This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Handling Label Uncertainty for Camera Incremental Person Re-Identification

Zexian Yang1,2  Dayan Wu1111Corresponding author.  Wanqian Zhang1  Bo Li1,2  Weiping Wang1,2
1Institute of Information Engineering, Chinese Academy of Sciences
2School of Cyber Security, University of Chinese Academy of Sciences
{yangzexian,wudayan,zhangwanqian,libo,wangweiping}@iie.ac.cn
(2023)
Abstract.

Incremental learning for person re-identification (ReID) aims to develop models that can be trained with a continuous data stream, which is a more practical setting for real-world applications. However, the existing incremental ReID methods make two strong assumptions that the cameras are fixed and the new-emerging data is class-disjoint from previous classes. This is unrealistic as previously observed pedestrians may re-appear and be captured again by new cameras. In this paper, we investigate person ReID in an unexplored scenario named Camera Incremental Person ReID (CIPR), which advances existing lifelong person ReID by taking into account the class overlap issue. Specifically, new data collected from new cameras may probably contain an unknown proportion of identities seen before. This subsequently leads to the lack of cross-camera annotations for new data due to privacy concerns. To address these challenges, we propose a novel framework ExtendOVA. First, to handle the class overlap issue, we introduce an instance-wise seen-class identification module to discover previously seen identities at the instance level. Then, we propose a criterion for selecting confident ID-wise candidates and also devise an early learning regularization term to correct noise issues in pseudo labels. Furthermore, to compensate for the lack of previous data, we resort prototypical memory bank to create surrogate features, along with a cross-camera distillation loss to further retain the inter-camera relationship. The comprehensive experimental results on multiple benchmarks show that ExtendOVA significantly outperforms the state-of-the-arts with remarkable advantages.

person re-identification, incremental learning, class overlap
copyright: acmcopyrightjournalyear: 2023doi: XXXXXXX.XXXXXXXprice: 15.00isbn: 978-1-4503-XXXX-X/18/06ccs: Information systems Information retrieval

1. Introduction

Person Re-IDentification (ReID) aims to match the same identity across non-overlapping camera views. The success of the modern offline supervised person ReID paradigm  (Zheng et al., 2015; Song et al., 2019; Xiao et al., 2016) is largely attributed to the availability of large-scale cross-camera annotations and the assumption that the surveillance system is fixed. The problem arises when the model needs to acquire new knowledge from newly installed cameras over time, which may require re-collecting data and retraining the model. However, manually establishing cross-camera annotations of all the identities from new and old cameras and then retraining them is expensive and cumbersome. Moreover, those methods are susceptible to catastrophic forgetting (McCloskey and Cohen, 1989) when adapted to real-world dynamic surveillance systems, particularly when data privacy concerns are taken into account.

Refer to caption
Figure 1. The comparison between camera incremental setting and previous setting in incremental person ReID. (a) The existing setting assumes that identities in new data are completely disjoint with previous data. (b) Our setting relaxes the strict class-disjoint assumption. Under our camera incremental setting, the new data will only have intra-camera annotations, and may also contain previously seen people.

Recently, there has been increasing attention  (Wu and Gong, 2021; Huang et al., 2021; Pu et al., 2021) on incremental learning (or lifelong) for person ReID (ILReID), which aims to address the practical requirements of continuously learning person ReID models from a stream of incoming data. As new data arrives, old data is not available for re-training due to privacy concerns. However, as shown in Fig. 1(a), existing ILReID methods commonly assume that the classes of new data are entirely different from the old ones. This assumption is not consistent with real-world scenarios, as previously observed pedestrians may re-appear and be captured again by the camera.

Motivated by this gap, in this paper, we introduce a new task setting named Camera Incremental Person ReID (CIPR), that naturally meets the demand of incrementally updating the model from newly installed cameras without access to previous data. As shown in Fig. 1(b), unlike previous incremental setting in person ReID (Wu and Gong, 2021; Huang et al., 2021; Pu et al., 2021) that heavily relies on the class-disjoint assumption, the proposed CIPR allows for partial class overlap between the old and new cameras. In fact, previous methods have overlooked a critical limitation, that annotations in person ReID are based solely on numerical IDs to distinguish between individuals, rather than specific categories (e.g. ”cat”). This means that when previous data is no longer available, it becomes difficult to determine whether new data belongs to an existing or a new class, resulting in uncertainty in cross-camera labels of the new data. This thus makes CIPR a more realistic scenario since annotations can only be performed independently for each camera. Despite the above differences, CIPR still faces the risk of catastrophic forgetting due to the lack of prior data. In general, the challenges of CIPR stem from two main aspects: 1) How to recognize and associate seen classes without any prior data (termed as class-overlap issue). As these seen classes should not be expected to learn as new ones, any accumulated errors can lead to performance degradation over time. 2) How to learn more informative knowledge from new cameras while also retaining previously acquired knowledge.

To handle the above challenges in CIPR, a novel framework ExtendOVA is proposed. Specifically, to eliminate the detrimental effect of the class overlap issue, we first incorporate an One-vs-All (OVA) detector (Padhy et al., 2020) that can identify unknown samples from new data. Nevertheless, directly applying the vanilla OVA detector to the CIPR task is problematic for two main reasons. On one hand, the OVA detector only models instance-level recognition, which fails to inherently identify whether a given class is unseen or not. On the other hand, the OVA detector is trained on the original camera data, leading to the domain shift from the new camera. As a result, potentially seen classes will be misidentified as unseen classes. To achieve ID-wise cross-camera identification, we extend the OVA detector by 1) We propose a simple yet effective criterion for selecting confidence-seen classes. 2) We devise an early learning regularization term to address concerns of domain shift and rectify potential noisy labels. In addition, to compensate for the lack of previous data against the second challenge of CIPR, we resort to the prototypical memory bank to create surrogate features based on the prototypes and the Batch-Normalization (BN) layer statistics. We also present a cross-camera distillation loss to retain the inter-camera relationship. In conclusion, our contributions can be summarized as follows:

  • We introduce a novel yet more practical ReID task, named Camera Incremental Person ReID (CIPR), which is fundamentally different from the existing lifelong person ReID tasks. It demands continuous learning of more generalizable representations through data from newly installed cameras only with intra-camera supervision.

  • We carefully design a novel framework ExtendOVA, which crafts an ID-wise pseudo label generation module against the peculiar class overlap issue under the camera incremental setting.

  • For extensive assessment, we build a simple baseline in addition to ExtendOVA to tackle CIPR. Experimental results show that the proposed approach gains significant advantages over the comparative methods.

Refer to caption
Figure 2. The proposed framework consists of three parts. The first part called Instance-wise Seen Class Identification, is used for detecting the seen and unseen samples using a One-vs-All detector before training. The second part is to generate ID-wise pseudo labels and further correct noisy labels by Aux\mathcal{L}_{\text{Aux}} at the early training stage. The third part is cross-camera distillation, which leverages sampled surrogate features to regularize forgetting by forcing relationship to be maintained between cameras.

2. Related Work

Person Re-identification. Offline person ReID settings can be roughly distinguished into three categories: supervised person ReID, unsupervised person ReID, and intra-camera supervised person ReID. Supervised person ReID  (Luo et al., 2019; Sun et al., 2018; Wang et al., 2020) are usually superior in performance but are less scalable, relying on a large amount of cross-camera annotations. Differently, unsupervised person ReID (Yu et al., 2017, 2018; Wu et al., 2020; Fu et al., 2019; Zou et al., 2020) is more challenging employing either clustering algorithms to generate pseudo labels or the extra source labeled data to boost the performance. Moreover, intra-camera supervised (ICS) person ReID is another perspective to reduce annotation labor  (Ge et al., 2021; Zhu et al., 2019; Peng et al., 2022), where cross-camera association labels are removed from the training data. However, these settings assume that the training data is pre-collected, and thus camera relations can be learned by matching cross-camera images, yet they are not suited to incrementally add new cameras over time. Our task of incrementally learning person ReID models from newly installed cameras is related to the problem of intra-camera supervision, but with the added challenge of privacy concerns related to cross-camera images for associating positive pairs.

Lifelong Person Re-identification. Recently, there has been significant interest in incremental learning for Person ReID, which aims to continuously learn new knowledge without experiencing catastrophic forgetting  (Ratcliff, 1990). Various methods have been proposed to prevent forgetting, which can be broadly divided into two categories: replay-based and data-free methods. Replay-based methods rely on maintaining a memory bank of limited samples that are recorded for replay. However, this approach requires additional storage, and maintaining raw data poses a risk to privacy. In contrast, data-free methods do not rely on any old samples. In this paper, we focus on a data-free incremental learning pipeline.

Previous research  (Pu et al., 2021; Wu and Gong, 2021; Lu et al., 2022; Ge et al., 2022) primarily focuses on incremental scenarios, where new identities keep increasing in fixed camera systems. However, contemporary surveillance systems are dynamic, meaning cameras can be installed at any time. In this paper, we consider a more practical scenario for lifelong person re-id, which aims to optimize the model when one or more cameras are introduced in the existing surveillance systems. Our approach does not require any strict class-disjoint assumption for model training, and it also considers a scenario where cross-camera labels are unavailable in training data.

Out-of-Distribution Detection. Out-of-distribution (OOD) detection is a binary classification problem that involves the ability of a model to distinguish between in-distribution and out-of-distribution samples during inference. There are various approaches to OOD detection, some of which involve modeling different scoring functions, such as maximum softmax probability (Hendrycks and Gimpel, 2016; Liang et al., 2018) or entropy (Liu et al., 2020; Chan et al., 2021), to estimate confidence and identify OOD samples. Others (Zong et al., 2018; Pidhorskyi et al., 2018) utilize generative models to learn the distribution of in-distribution data. One approach (Padhy et al., 2020) proposed in a recent paper involves using neural One-vs-All (OVA) classifiers to handle out-of-distribution detection. In our work, we incorporate the OVA detector to differentiate between ”unseen” and potential ”seen” samples. However, it’s important to note that the OVA detector is unable to perform ID-wise prediction and may not be robust enough to handle data with domain gaps.

3. Preliminary

3.1. Problem Formulation

Consider a CIPR problem with several steps, and each incremental step introduces a new camera with a set of classes to learn. Formally, in the tt-th step, we have the training data 𝒟t={𝑿𝒕,𝒀𝒕}={(xit,yit)|i=1Nt}\mathcal{D}_{t}=\{\bm{\mathit{X_{t},Y_{t}}}\}=\{(x^{t}_{i},y^{t}_{i})|^{N_{t}}_{i=1}\} with intra-camera annotations yit𝒀𝒕y^{t}_{i}\in\bm{Y_{t}} captured by the newly installed camera ctc_{t}, and NtN_{t} is the number of classes in ctc_{t}. We note that the training data 𝒟t\mathcal{D}_{t} can contain overlapping classes in 𝒟t1\mathcal{D}_{t-1}, while the old training data 𝒟t1\mathcal{D}_{t-1} are not available due to the privacy concern. Hence, we first need to identify the real number of extensions to correctly learn new classes. The goal of CIPR is to learn a robust ReID model that can be generalized to unseen classes from all encountered cameras.

3.2. A CIPR Baseline

We first present a straightforward baseline for CIPR task. Basically, in the tt-th step (t>1t>1), the feature extractor F(θt)F(\theta_{t}) initialized by F(θt1)F(\theta_{t-1}) is updated to learn a set of classes employing 𝒟t\mathcal{D}_{t}, and the classifier G(ϕt)G(\phi_{t}) is also extended to the corresponding new dimension (Hou et al., 2019), which is expected to predict all the classes seen so far. As a common incremental learning baseline, in addition to ReID loss (He et al., 2020) (e.g. ID loss ID\mathcal{L}_{\text{ID}}+ triplet loss Triplet\mathcal{L}_{\text{Triplet}} (Hermans et al., 2017)), knowledge distillation (KD) loss KD\mathcal{L}_{\text{KD}} is employed to prevent catastrophic forgetting, which can be formulated as:

(1) KD=i𝑿𝒕KL(pin||pio),\mathcal{L}_{\text{KD}}=\sum_{i\in\bm{X_{t}}}KL(p^{n}_{i}||p^{o}_{i}),

where KL()KL(\cdot) is the Kullback Leibler (KL) divergence, piop^{o}_{i} and pinp^{n}_{i} denote the logit output of the old and new models, respectively.

To discriminate the seen and unseen identities without accessing the old data, a straightforward method is to leverage the softmax prediction score. We assume that samples belonging to unseen classes will produce smooth probability distributions since they are equally wrong and ambiguous. Therefore, we can treat an image as the seen class if the maximum softmax score is above a threshold TT. For samples identified as a new class, we add a new ID based on the existing old classes. For samples classified into old classes, we use the model predict as its pseudo label. Then we can minimize the cross entropy with the global pseudo labels. The loss function can be formulated as:

(2) ID=CE(G(F(𝑿𝒕;θt);ϕt),𝒀𝒕).\mathcal{L}_{\text{ID}}=\mathcal{L}_{\text{CE}}(G(F(\bm{X_{t}};\theta_{t});\phi_{t}),\bm{Y^{\prime}_{t}}).

where 𝒀𝒕\bm{Y^{\prime}_{t}} is the pseudo label of samples 𝑿𝒕\bm{X_{t}}, CE\mathcal{L}_{CE} is the cross-entropy loss function.

Overall, the optimization objective of the baseline CIPR model can be formulated as:

(3) Base=ID+Triplet+λ0KD.\mathcal{L}_{\text{Base}}=\mathcal{L}_{\text{ID}}+\mathcal{L}_{\text{Triplet}}+\lambda_{0}\mathcal{L}_{\text{KD}}.

4. METHODOLOGY

The filtering mechanism proposed in our baseline method is an alternative way to address the class-overlap issue. However, the manual set threshold TT is not robust enough to identify old classes, and the classifier is biased toward mass classes over few classes (Hendrycks and Gimpel, 2016). Therefore, in this section, we introduce a new framework for CIPR.

4.1. Overview of Framework

The graphical illustration of our framework is depicted in Fig. 2. We replace the linear classifier with the non-parametric memory bank to alleviate the over-confidence issue  (Bendale and Boult, 2016), which stores the moving average of the cluster prototypes. The old model is fixed, and the new model is updataed via Exponential Moving Average (EMA) scheme during the optimization. Then we elaborate our ExtendOVA in three parts to tackle CIPR problem. The first technical novelty comes from taking advantage of the One-vs-All (OVA) detector for instance-wise seen class identification before training (section 4.2). Then the samples are assigned with global pseudo labels via our criteria and an early learning regularization term Aux\mathcal{L}_{\text{Aux}}, as to be detailed in section 4.3. Finally, in section 4.4, surrogate features are sampled based on the memory bank to guide the cross-camera distillation objective.

4.2. Instance-wise Seen Class Identification

In this section, we elaborate on the process of instance-wise seen class identification. We first describe the training of the One-vs-All detector before describing the remaining methods.

One-vs-All Detector. The One-vs-All (OVA) detector (Padhy et al., 2020; Saito and Saenko, 2021) is first proposed for the out-of-distribution detection, which extends a binary classifier to a multi-class classifier to learn a boundary between in-liers and outliers. Specifically, the OVA detector consists of multiple binary sub-classifiers, each of which is trained to distinguish that class from all other classes, i.e., samples belonging to this class are positive while others are negative. For more effectively learning a boundary to identify unknown identities, herein we only pick hard negative samples to compute the loss. Formally, we denote p(y^c|x)p(\hat{y}^{c}|x) as the positive softmax output for the class cc. The optimization objective for a sample xix_{i} within label yiy_{i} can be formulated as:

(4) ova(xi,yi)=logp(y^yi|xi)mincyilog(1p(y^c|xi)).\mathcal{L}_{\text{ova}}(x_{i},y_{i})=-\log{p(\hat{y}^{y_{i}}|x_{i})}-\min_{c\neq y_{i}}\log{(1-p(\hat{y}^{c}|x_{i}))}.

Seen Class Identification. When new data arrives, we first get the nearest prototype and take the corresponding sub-classifier output of the OVA detector to determine whether the sample is a seen or unseen class, as illustrated in Fig. 2. Essentially, each sub-classifier of the OVA detector corresponds to the latent space of its class, and if a sample exceeds all the boundaries of that space, it will be recognized as outliers. The advantage of the OVA detector is that it can learn an adaptive threshold between seen and unseen classes.

4.3. ID-wise Pseudo Label Generation

Although the OVA detector is effective and more robust for instance-level prediction, it may still introduce noisy labels, especially for hard samples. As a result, two images of the same class may be paradoxically predicted as a new class and an old one, which can affect the class-level prediction. Furthermore, we argue that due to the domain gap, the latent space trained for each class could not effectively represent the data from future cameras, as illustrated in Fig. 2. To this end, we propose an ID-wise pseudo label generation module (IPLG) to correct noisy labels and associate the samples with the same local label to the same pseudo-global label. We shall detail the operation of this module below.

Pseudo Label Initialize. We propose a simple criterion to select confident-seen classes. Given a batch of samples {(xit,yit)}i=1B\{(x_{i}^{t},y_{i}^{t})\}^{B}_{i=1} that follows PK sampling, we first analyze the output of the samples with the same label yity_{i}^{t} from the OVA detector. An identity yity_{i}^{t} is the seen class if and only if all samples with label yity_{i}^{t} are predicted to belong to the seen class. Denoting the set of seen classes as CscC_{sc}, for yitCscy_{i}^{t}\in C_{sc}, we use the nearest class predicted by the prototype classifier as pseudo labels. For the remaining classes (denoted by CucC_{uc}) that are excluded by our criterion, we re-label them to a new class. Formally, the pseudo labels are assigned in the following way:

(5) y^i={argmax𝑘WkF(xit;θt),yitCsc,k[1,Nt1]Nt1+n,yitCuc,n[1,|Cuc|],\hat{y}_{i}=\begin{cases}\underset{k}{\arg\max}~{}{W}_{k}^{\top}F(x_{i}^{t};\theta_{t}),&y_{i}^{t}\in C_{sc},k\in[1,N_{t-1}]\\ N_{t-1}+n~{},&y_{i}^{t}\in C_{uc},n\in[1,|C_{uc}|],\end{cases}

where WkdW_{k}\in\mathcal{R}^{d} stands for the kk-th column of ID-wise prototype in memory bank 𝑾d×Nt1\bm{W}\in\mathcal{R}^{d\times N_{t-1}}, dd is the feature dimension. Note that we only choose the most frequently predicted one if there are multiple kk for one class. Correspondingly, we expand the memory bank to 𝑾d×Nt\bm{W}\in\mathcal{R}^{d\times N_{t}^{\prime}}, Nt=Nt1+|Cuc|N_{t}^{\prime}=N_{t-1}+|C_{uc}|.

Pseudo Label Refinement. To rectify the noisy pseudo labels caused by the domain shift, an auxiliary loss is designed to regularize the early-stage learning. Concretely, leveraging the initialized pseudo labels, the network can be trained via a softmax loss to identify both seen and unseen classes in the new data, ensuring that features are closest to their corresponding prototypes, which can be formulated as:

(6) ID=xit𝒟tlogeWy^iF(xit;θt)eWy^iF(xit;θt)+k=1,ky^iNeWkF(xit;θt),\mathcal{L}_{\text{ID}^{*}}=-\sum_{x^{t}_{i}\in\mathcal{D}_{t}}\log\frac{e^{W_{\hat{y}_{i}}^{\top}F(x_{i}^{t};\theta_{t})}}{e^{W_{\hat{y}_{i}}^{\top}F(x_{i}^{t};\theta_{t})}+{\textstyle\sum_{k=1,k\neq\hat{y}_{i}}^{N^{\prime}}}e^{W_{k}^{\top}F(x_{i}^{t};\theta_{t})}},

the key is to keep the old prototypes fixed throughout the process, serving as a boilerplate for learning domain-invariant features.

Additionally, the hope is that samples belonging to unseen classes can also output the second largest probability to the potential nearest prototype, which is constrained by the following auxiliary loss:

(7) Aux=y^i>Nt1logeWy~iF(xit;θt)eWy~iF(xit;θt)+k=1,ky~iNt1eWkF(xit;θt).\mathcal{L}_{\text{Aux}}=-\sum_{\hat{y}_{i}>N_{t-1}}\log\frac{e^{W_{\tilde{y}_{i}}^{\top}F(x_{i}^{t};\theta_{t})}}{e^{W_{\tilde{y}_{i}}^{\top}F(x_{i}^{t};\theta_{t})}+{\textstyle\sum_{k=1,k\neq\tilde{y}_{i}}^{N_{t-1}}}e^{W_{k}^{\top}F(x_{i}^{t};\theta_{t})}}.

Here, y~i[1,Nt1]\tilde{y}_{i}\in[1,N_{t-1}] is obtained using the same way as in Eq.5. The motivation behind this approach lies in the consensus  (Ren et al., 2018; Grandvalet and Bengio, 2004) that seen classes are usually clustered to form high-density regions in the latent space. Hence, this regularizer encourages these samples to be closed to a shared and real prototype. On the other hand, the unseen classes are often distributed in low-density regions, leading to optimization conflicts where the samples struggle to simultaneously approach the current prototype and the nearest old prototype.

After the early-stage regularization, we would employ the selection criterion again to obtain refined pseudo labels, which would continue to be used for model updates. Meanwhile, those new prototypes that were previously created by falsely identifying as unseen classes would be removed from the memory bank.

4.4. Cross-camera Distillation

In incremental learning, models are susceptible to forgetting previous learned knowledge without relearning the old data. To address this, previous work (Lu et al., 2022) has proposed to use a GAN (Goodfellow et al., 2020) to reconstruct old data in the image space. In our method, we generate substitute samples in the feature space instead. Concretely, to estimate the distribution of the old data, we assume a class-conditioned multivariate Gaussian distribution denoted as F(xit1|yit1=k)𝒩kt1(μkt1,kt1)F(x_{i}^{t-1}|y_{i}^{t-1}=k)\sim\mathcal{N}_{k}^{t-1}(\mu_{k}^{t-1},{\textstyle\sum_{k}^{t-1}}). Here, μkt1\mu_{k}^{t-1} is the mean of the Gaussian distribution and can be approximated using our prototypical memory bank. To estimate the covariance matrices, we utilize the statistics of BatchNorm (BN) layers. During training, a BN layer normalizes the features, which implicitly captures the means and variances of the data (Yin et al., 2020), thereby enabling the estimation of the covariance matrices of the old data. Overall, we estimate the distribution of the data in previous step by:

(8) μkt1Wk\displaystyle\mu_{k}^{t-1}\simeq W_{k} kt1BN(var)\displaystyle{\textstyle\sum_{k}^{t-1}}\simeq\text{BN}(var)

Then we can sample surrogate features fk~𝒩~k(Wk,BN(var))\tilde{f_{k}}\sim\tilde{\mathcal{N}}_{k}(W_{k},\text{BN}(var)). Based on these surrogate features, we present a cross-camera distillation loss that serves to regularize forgetting, by ensuring that the cosine distance is maintained across different cameras. Formally, given a batch of samples 𝓧={(xit,y^i)|i=1B}\bm{\mathcal{X}}=\{(x^{t}_{i},\hat{y}_{i})|^{B}_{i=1}\} along with a batch of sampled surrogate features 𝑭~={fk1~,fk2~fkB~}\bm{\tilde{F}}=\{\tilde{f_{k_{1}}},\tilde{f_{k_{2}}}...\tilde{f_{k_{B}}}\}, the loss can be calculated by:

(9) CD=cos(F(𝓧;θt),𝑭~)cos(F(𝓧;θt1),𝑭~)22.\displaystyle\mathcal{L}_{\text{CD}}=\left\|\cos(F(\bm{\mathcal{X}};\theta_{t}),\bm{\tilde{F}})-\cos(F(\bm{\mathcal{X}};\theta_{t-1}),\bm{\tilde{F}})\right\|_{2}^{2}.

where cos(a,b)=aba2b2\cos(a,b)=\frac{a\cdot b}{\left\|a\right\|_{2}\left\|b\right\|_{2}} denotes the cosine similarity . The distillation loss CD\mathcal{L}_{\text{CD}} improves stability that is commensurate with the ability of previous data to maintain past structure.

4.5. Optimization Summary

In summary, the overall objective function for our ExtendOVA framework is formulated as

(10) =Triplet+ID+λ1Aux+λ2CD.\mathcal{L}=\mathcal{L}_{\text{Triplet}}+\mathcal{L}_{\text{ID}^{*}}+\lambda_{1}\mathcal{L}_{\text{Aux}}+\lambda_{2}\mathcal{L}_{\text{CD}}.

where λ1\lambda_{1} and λ2\lambda_{2} are coefficients. To enhance the model’s stability during optimization, we utilize the Exponential Moving Average (EMA) technique (Tarvainen and Valpola, 2017), wherein the student model’s parameters are initially shared by the teacher model F(θt)F(\theta_{t}). Once this iteration is complete, the student model is updated using the EMA parameters computed from the teacher model’s parameters by

(11) θs,t=αθt+(1α)θs,t1\theta_{s,t}=\alpha\theta_{t}+(1-\alpha)\theta_{s,t-1}

where α\alpha is a smoothing factor typically set to a value 0.99 (Xu et al., 2021). In the test phase, we will use the student model to extract feature representations. By incorporating EMA into the training process, the updates to the student model’s parameters are smoothed, leading to improved stability and better generalization performance.

5. Experiments

5.1. Datasets and Evaluation Metrics

Datasets. To evaluate and compare different methods under Camera Incremental Person Re-Identification (CIPR) setting, three large-scale person Re-Identification (ReID) datasets Market-1501 (Zheng et al., 2015), MSMT17 (Wei et al., 2018) and DukeMTMC (Zheng et al., 2017) (only for academic use, without displaying images of persons) are exploited. We form the intra-camera annotations based on the provided labels. In order to simulate a realistic scenario for incremental learning in person re-identification, we simulate the deployment of a surveillance system that starts with multiple cameras and gradually adds new cameras over time. For example, we select 4 cameras from the Market-1501 for initial training and incrementally add 1 more camera in each subsequent step. Similarly, we create a five-step incremental training setup for MSMT17 and DukeMTMC. It is worth noting that we do not employ all classes of the initial cameras in the first step, but instead perform sampling to generate multiple setups to suit various conditions. The statistics of the datasets are shown in Fig. 3; section 5.2 goes into detail about the setups.

Testing Protocols. Two commonly used metrics mean Average Precision (mAP) and Rank-1(R-1) accuracy are used to evaluate the performance of CIPR. To measure the model’s ability to adapt and learn new knowledge, we evaluate its performance on unseen classes of all encountered cameras during the incremental learning process. Additionally, to assess the model’s anti-forgetting ability, we evaluate its performance on unseen classes of the initial cameras as a measure of retention of previously learned knowledge.

Refer to caption
Figure 3. The distribution of identity number throughout training with different setups.
Table 1. Comparison of the final-step incremental results with the state-of-the-art methods in different setups. Joint-T refers to the upper-bound result. Red and blue: the best and second-best results.
Market-1501 MSMT17 DukeMTMC
s=3,G s=3,E s=5,G s=3,E s=5,G s=3,E
Method Reference mAP R-1 mAP R-1 mAP R-1 mAP R-1 mAP R-1 mAP R-1
Upper Bound and Lower Bound
\hdashlineJoint-T - 73.5 88.2 79.1 88.4 45.4 70.9 52.1 75.2 69.1 81.8 73.7 84.2
Finetune - 40.1 62.5 45.5 67.4 20.3 30.2 21.7 30.5 32.7 48.3 34.1 46.3
Data-free methods
\hdashlineLwF (Li and Hoiem, 2017) TPAMI’17 57.8 78.5 62.7 83.3 27.6 60.4 30.2 59.8 45.4 62.5 51.3 68.3
MMD (Long et al., 2015) ICML’15 60.7 79.7 65.8 83.7 24.9 58.6 28.8 53.5 38.7 57.9 51.4 68.8
AKA  (Pu et al., 2021) CVPR’21 60.2 70.3 61.5 81.1 29.7 57.9 33.0 60.2 41.9 57.6 52.6 61.9
AGD  (Lu et al., 2022) CVPR’22 54.4 68.1 58.6 76.1 30.5 64.8 36.8 62.1 46.0 65.6 48.2 64.9
PatchKD (Sun and Mu, 2022) MM’22 64.8 82.5 71.2 83.4 27.9 45.2 33.9 54.2 51.1 67.8 51.3 70.1
Replay-based methods
\hdashlineiCaRL (Rebuffi et al., 2017) CVPR’17 59.4 82.0 64.8 84.0 33.2 68.6 42.2 69.1 50.7 68.5 53.5 70.8
PTKP (Ge et al., 2022) AAAI’22 68.1 84.8 69.3 84.3 33.0 67.9 41.3 69.4 52.3 68.9 56.3 71.1
Our methods (Data-free)
\hdashlineBaseline 61.7 80.3 66.9 85.0 30.9 67.0 36.6 66.1 47.8 65.1 53.4 69.1
ExtendOVA 70.3 86.6 75.6 88.2 35.3 70.2 46.3 74.4 53.4 70.2 62.4 75.8
Refer to caption
Figure 4. The confidence score distribution of seen and unseen samples produced by baseline (left), OVA detector (middle) and ExtendOVA (right) on Market-1501 in general setup.

5.2. Implementation Details

We use the widely adopted ResNet-50 (He et al., 2016) as the backbone network. To obtain 2048-dimensional features, a Batch Normalization (BN) layer  (Ioffe and Szegedy, 2015) is placed after the last layer of the network. The batch size is set to 64, comprising of 16 identities with 4 images per identity. The Adam optimizer with a learning rate of 3.5×1043.5\times 10^{-4} is used for optimization at the initial step and the learning rate of the backbone is set to lr/10lr/10 during the incremental learning. The model is trained for 40 epochs per step, and the early-stage learning regularization is performed during the first ten epochs. The hyper-parameter TT, λ1\lambda_{1}, and λ2\lambda_{2} are set to 0.5,0.90.5,0.9 and 0.60.6 (see hyper-parameter analysis in the supplemental material).

General setup assumes that in most cases, there are more unseen classes initially emerging in a new camera than seen ones, and the number of unseen classes will increase linearly over time. As shown in Fig. 3(a), the identity distribution of the new data is managed by sampling the data from the initial camera. For example, we sampled 300, 500,and 300 identities in the first step for Market-1501, MSMT17 and DukeMTMC, respectively, yielding 110/131, 54/83, 74/144 seen/unseen classes in the second step.

Exceptional setup is further considered for extreme scenarios where the majority of the classes captured by new cameras are old ones. This can be achieved by increasing the number of classes sampled in the initial step, thereby increasing the proportion of old classes in the new cameras.

For comparative experiments, we reproduce the state-of-the-art methods, i.e., data-free methods LwF (Li and Hoiem, 2017), AKA (Pu et al., 2021), AGD (Lu et al., 2022), PatchKD (Sun and Mu, 2022) , the replay-based methods iCaRL (Rebuffi et al., 2017), PTKP (Ge et al., 2022), and a distribution alignment methods MMD (Long et al., 2015) on our setting. It is noteworthy that these methods are based on a class-disjoint setting, and they do not match our setting. Therefore, to implement them in our setting, they can only treat old classes as new ones. For more extensive assessment, we design some other comparative methods, including the baseline described in section 3.2, the fine-tune method that fine-tunes the model on new data, the Joint-T that denotes an upper-bound by training the model on all data seen so far.

Table 2. Ablation study of the contribution of ExtendOVA components during every incremental step in the exceptional setup.
Market-1501 MSMT17 DukeMTMC
Step2 Step3 Step2 Step3 Step2 Step3
ID\mathcal{L}_{\text{ID}^{*}} Aux\mathcal{L}_{\text{Aux}} CD\mathcal{L}_{\text{CD}} EMA Method mAP R-1 mAP R-1 mAP R-1 mAP R-1 mAP R-1 mAP R-1
71.3 85.0 62.7 83.3 37.4 60.7 30.2 59.8 56.1 70.5 51.3 68.3
LwF 73.2 85.9 70.8 85.3 43.9 70.1 38.4 69.3 61.5 73.3 54.6 70.8
78.8 89.5 75.6 88.2 50.0 74.2 46.3 74.4 64.1 76.5 62.4 75.8
76.0 87.7 73.9 87.4 49.3 73.6 45.4 73.3 63.8 75.9 58.7 73.3
75.1 87.9 73.8 87.8 48.5 73.7 44.4 72.7 63.4 75.2 58.5 71.1
ExtendOVA 74.9 86.3 72.6 86.8 45.7 72.9 44.2 73.1 62.0 74.0 55.3 71.1
71.4 85.3 67.6 83.9 42.1 68.4 37.7 68.2 58.1 71.9 51.9 68.2

5.3. Comparative Results with Different Settings

We compare our ExtendOVA with the current state-of-the-art. The evaluation is conducted on all cameras encountered so far and the final-step results are reported in Tabel 3 with both the general and exceptional setups. We summarize the results as follows:

  • Our proposed ExtendOVA outperforms the current state-of-the-arts by a clear margin, and is the closest to the upper bound Joint-T. Notably, it even achieves comparable or better results than the replay-based methods, validating the effectiveness of our proposed solutions.

  • Previous methods, designing for the non-overlapping setting, still achieve poor performance. We attribute this poor performance to two aspects: Firstly, in the absence of cross-camera labels, these methods fail in learning cross-view representations. Secondly, turning old classes into new ones results in a false-positive prediction of the spurious classes, leading to accumulated errors.

  • Interestingly, our proposed baseline outperforms the data-free methods LwF, AKA and AGD, demonstrating the potential improvement in addressing the class-overlap issue.

5.4. Ablation Study

A closer look at early stage regularization. In Fig. 4, we plot the per-sample probability distribution of confidence scores generated by different methods. As can be seen, the baseline method uses the maximum probability as the confidence score, resulting in significant confusion between seen and unseen classes. Higher threshold values will reject a large number of seen classes. While the output of the OVA detector is more discriminative, there is still significant noise introduced due to domain shift. Our method improves upon the OVA detector by incorporating early regularization learning, which significantly mitigates the noise caused by domain shift.

To further observe the impact of early regularization learning on the ID-wise predictive performance, Fig. 5 shows the training curves of ID loss and model accuracy of seen classes during the training process. We can observe that during the training process, both models show a decreasing trend in loss and eventually converge. In the initial iterations of training, the accuracy of both models shows an increasing trend, indicating that the models have not yet started fitting to the noise. After a certain number of iterations, the accuracy of the model without early regularization starts to gradually decrease, while our method corrects the noise in the early stage, resulting in an increase in accuracy.

Table 3. Compared with different distillation loss in the general setup. Base. is trained with ID+Aux\mathcal{L}_{\text{ID}^{*}}+\mathcal{L}_{\text{Aux}}.
Market-1501 DukeMTMC
Method mAP Rank-1 mAP Rank-1
Base. 57.8 78.4 44.1 62.7
Base. +KD+\mathcal{L}_{\text{KD}} 62.5 81.0 47.5 65.3
Base. +CD+\mathcal{L}_{\text{CD}} 65.5 83.4 48.6 65.6
Base. ++EMA 66.4 83.7 48.7 66.7
Base. +EMA+KD+\text{EMA}+\mathcal{L}_{\text{KD}} 67.1 84.2 50.5 68.4
Base. +EMA+CD+\text{EMA}+\mathcal{L}_{\text{CD}} 70.3 86.6 53.4 70.2
Refer to caption
Figure 5. Curves of CE loss and model performance on seen classes. ELR: Early stage learning regularization.

Effectiveness of the different components. We conduct ablation studies in the three-step exceptional setup to evaluate the effectiveness of each module in each step. To evaluate the effectiveness of our ID-wise pseudo label generation module, we conduct experiments where we disable the Aux\mathcal{L}_{\text{Aux}} components and also compare the results to a baseline method, i.e. LwF, which is trained using ReID loss supervised by intra-camera labels. First, as shown in Table  2, removing Aux\mathcal{L}_{\text{Aux}} will decrease the final performance by 0.9%0.9\% to 3.7%3.7\% in mAP. Second, the combination of Aux\mathcal{L}_{\text{Aux}} and ID\mathcal{L}_{\text{ID}^{*}} bring the gain in the range of 3.3%3.3\% to 8.2%8.2\% in mAP compared with the baseline. This suggests that simply optimizing the cross-entropy loss with intra-camera supervision is not sufficient.

To evaluate the contribution of the EMA scheme, we conduct experiments by removing it. Without the EMA technique, the performance drop ranges between 2.1%2.1\% and 4.3%4.3\% in the second step in terms of mAP, and the degradation becomes more significant as the incremental training phase proceeds. This clearly indicates that incorporating this design is crucial for overall performance. While the effect of CD\mathcal{L}_{\text{CD}} is not as pronounced as the EMA scheme, it still has an obvious impact on the performance. When both terms are eliminated, there is a significant decline in performance, suggesting that the CD\mathcal{L}_{\text{CD}} plays a role in maintaining the overall performance.

Refer to caption
Figure 6. Anti-forgetting evaluation on MSMT17 and DukeMTMC in the general setup. mAP and Rank-1 score on the test set of original cameras (test set on step 1) during the training process.

Compared with different distillation loss. Table 3 compares the performance of our method with different distillation losses in the general setup. Specifically, we evaluate the impact of knowledge distillation (KD) and cross-camera distillation (CD) loss on the performance of the baseline model trained with ID loss (ID\mathcal{L}_{\text{ID}^{*}}) and the auxiliary losses (Aux\mathcal{L}_{\text{Aux}}). From the table, we observe that incorporating distillation losses, either KD\mathcal{L}_{KD} or CD\mathcal{L}_{CD}, improves the performance of the baseline model in terms of mAP and Rank-1 accuracy. Notably, adding CD\mathcal{L}_{CD} achieves higher performance than adding KD\mathcal{L}_{KD}, indicating that CD\mathcal{L}_{CD} is more effective in preserving the structure-wise knowledge.

Table 4. Performance (%) of seen classes identification by our proposed method. We report the scores obtained in the second step.
Market-1501 MSMT17 DukeMTMC
Setup Method Prec Recall Prec Recall Prec Recall
Baseline 98.4 56.4 97.5 40.6 51.7 40.5
OVA 93.3 63.6 98.2 64.3 98.3 60.9
G ExtendOVA 89.8 80.0 95.5 74.5 92.9 86.5
Baseline 99.0 56.7 99.2 33.5 94.4 66.9
OVA 97.9 61.1 98.5 40.1 95.7 70.5
E ExtendOVA 96.5 75.6 98.8 51.8 95.0 77.9
Table 5. Results on the multiple-camera introduced setup.
Market-1501 MSMT17 DukeMTMC
Method mAP Rank-1 mAP Rank-1 mAP Rank-1
AKA (Pu et al., 2021) 57.9 69.4 32.5 60.6 45.2 62.1
AGD (Lu et al., 2022) 62.8 72.0 31.6 65.1 42.4 60.6
ExtendOVA 70.6 86.7 36.8 72.0 52.5 68.9

5.5. Anti-Forgetting Evaluation

We evaluate the anti-forgetting properties of our proposed method by measuring the performance on the test-set of the first step after each step. Fig. 6 plots the forgetting trend on DukeMTMC and MSMT17 in the general setup. We found that our method showed superior anti-forgetting properties, with no performance degradation and even a slight improvement on the previous tasks. Our baseline model exhibits less forgetting when compared to data-free methods, clearly indicating that class-overlap is an issue to be addressed. AGD employs the DeepInversion (Yin et al., 2020) to generate synthetic exemplars from previously learned classes, however, it still suffers from catastrophic forgetting. This is due to its reliance on a unified classifier, treating overlapping classes as new ones can result in the generation of a significant amount of noise.

5.6. Further discussion

Seen classes identification. To further study the potential of identifying seen classes, we compare our method with the baseline and OVA. We use precision (prec) and recall as metrics. Precision is calculated as the percentage of truly seen classes among the selected classes, and recall is calculated as the percentage of selected seen classes among all seen classes in the new data. The results in Table 4 show that our proposed method outperforms both the baseline and OVA methods in identifying seen classes, with comparable precision but higher recall scores in all datasets. This indicates that our method can effectively identify the seen classes in the new data, even when the data contains the domain shift.

Extension to multiple introduced cameras. To validate that our method can be applied in scenarios with multiple cameras increasing, we conducted further evaluations by including 2 cameras in the incremental step. Table 5 presents the performance of the final training process in this setting. Notably, our method continues to outperform other state-of-the-art methods even when additional cameras are introduced. This finding suggests that our method is capable of addressing a universal CIPR problem.

6. Conclusion

In this paper, we come up with a new yet very practical task, i.e., Camera Incremental person ReID (CIPR). We particularly emphasize the class-overlap issue brought by CIPR where the new camera might contain identities seen before and the ideal global cross-camera annotations are absent. To approach this task, we design a novel framework called ExtendOVA. In ExtendOVA, we address the class overlap issue by exploiting a One-vs-All detector combined with an early-stage regularization term to achieve the global pseudo-label assignment. Extensive experiments verify the effectiveness and superiority of ExtendOVA.

References

  • (1)
  • Bendale and Boult (2016) Abhijit Bendale and Terrance E Boult. 2016. Towards open set deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Chan et al. (2021) Robin Chan, Matthias Rottmann, and Hanno Gottschalk. 2021. Entropy maximization and meta classification for out-of-distribution detection in semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
  • Fu et al. (2019) Yang Fu, Yunchao Wei, Guanshuo Wang, Yuqian Zhou, Honghui Shi, and Thomas S Huang. 2019. Self-similarity grouping: A simple unsupervised cross domain adaptation approach for person re-identification. In proceedings of the IEEE international conference on computer vision (ICCV).
  • Ge et al. (2022) Wenhang Ge, Junlong Du, Ancong Wu, Yuqiao Xian, Ke Yan, Feiyue Huang, and Wei-Shi Zheng. 2022. Lifelong Person Re-identification by Pseudo Task Knowledge Preservation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI).
  • Ge et al. (2021) Wenhang Ge, Chunyan Pan, Ancong Wu, Hongwei Zheng, and Wei-Shi Zheng. 2021. Cross-Camera Feature Prediction for Intra-Camera Supervised Person Re-identification across Distant Scenes. In Proceedings of the 29th ACM International Conference on Multimedia (ACMMM).
  • Goodfellow et al. (2020) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020. Generative adversarial networks. Commun. ACM (2020).
  • Grandvalet and Bengio (2004) Yves Grandvalet and Yoshua Bengio. 2004. Semi-supervised learning by entropy minimization. Advances in neural information processing systems (NeurIPS) (2004).
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • He et al. (2020) Lingxiao He, Xingyu Liao, Wu Liu, Xinchen Liu, Peng Cheng, and Tao Mei. 2020. Fastreid: A pytorch toolbox for general instance re-identification. arXiv preprint arXiv:2006.02631 (2020).
  • Hendrycks and Gimpel (2016) Dan Hendrycks and Kevin Gimpel. 2016. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136 (2016).
  • Hermans et al. (2017) Alexander Hermans, Lucas Beyer, and Bastian Leibe. 2017. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737 (2017).
  • Hou et al. (2019) Saihui Hou, Xinyu Pan, Chen Change Loy, Zilei Wang, and Dahua Lin. 2019. Learning a unified classifier incrementally via rebalancing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Huang et al. (2021) Zhipeng Huang, Zhizheng Zhang, Cuiling Lan, Wenjun Zeng, Peng Chu, Quanzeng You, Jiang Wang, Zicheng Liu, and Zheng-jun Zha. 2021. Lifelong Unsupervised Domain Adaptive Person Re-identification with Coordinated Anti-forgetting and Adaptation. arXiv preprint arXiv:2112.06632 (2021).
  • Ioffe and Szegedy (2015) Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (ICML).
  • Li and Hoiem (2017) Zhizhong Li and Derek Hoiem. 2017. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence (TPAMI) (2017).
  • Liang et al. (2018) Shiyu Liang, Yixuan Li, and R. Srikant. 2018. Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks. In 6th International Conference on Learning Representations (ICLR).
  • Liu et al. (2020) Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li. 2020. Energy-based out-of-distribution detection. Advances in neural information processing systems (NeurIPS) (2020).
  • Long et al. (2015) Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. 2015. Learning transferable features with deep adaptation networks. In International conference on machine learning (ICML).
  • Lu et al. (2022) Yichen Lu, Mei Wang, and Weihong Deng. 2022. Augmented Geometric Distillation for Data-Free Incremental Person ReID. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  • Luo et al. (2019) Hao Luo, Wei Jiang, Xuan Zhang, Xing Fan, Jingjing Qian, and Chi Zhang. 2019. Alignedreid++: Dynamically matching local information for person re-identification. Pattern Recognition (PR) (2019).
  • McCloskey and Cohen (1989) Michael McCloskey and Neal J Cohen. 1989. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation (PLM).
  • Padhy et al. (2020) Shreyas Padhy, Zachary Nado, Jie Ren, Jeremiah Liu, Jasper Snoek, and Balaji Lakshminarayanan. 2020. Revisiting one-vs-all classifiers for predictive uncertainty and out-of-distribution detection in neural networks. arXiv preprint arXiv:2007.05134 (2020).
  • Peng et al. (2022) Yi-Xing Peng, Jile Jiao, Xuetao Feng, and Wei-Shi Zheng. 2022. Consistent Discrepancy Learning for Intra-camera Supervised Person Re-identification. IEEE Transactions on Multimedia (TMM) (2022).
  • Pidhorskyi et al. (2018) Stanislav Pidhorskyi, Ranya Almohsen, and Gianfranco Doretto. 2018. Generative probabilistic novelty detection with adversarial autoencoders. Advances in neural information processing systems (NeurIPS) (2018).
  • Pu et al. (2021) Nan Pu, Wei Chen, Yu Liu, Erwin M Bakker, and Michael S Lew. 2021. Lifelong person re-identification via adaptive knowledge accumulation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Ratcliff (1990) Roger Ratcliff. 1990. Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. Psychological review (1990).
  • Rebuffi et al. (2017) Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. 2017. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR).
  • Ren et al. (2018) Mengye Ren, Eleni Triantafillou, Sachin Ravi, Jake Snell, Kevin Swersky, Joshua B Tenenbaum, Hugo Larochelle, and Richard S Zemel. 2018. Meta-learning for semi-supervised few-shot classification. International Conference on Learning Representations (ICLR) (2018).
  • Saito and Saenko (2021) Kuniaki Saito and Kate Saenko. 2021. Ovanet: One-vs-all network for universal domain adaptation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
  • Song et al. (2019) Jifei Song, Yongxin Yang, Yi-Zhe Song, Tao Xiang, and Timothy M Hospedales. 2019. Generalizable person re-identification by domain-invariant mapping network. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR).
  • Sun et al. (2018) Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Shengjin Wang. 2018. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European conference on computer vision (ECCV).
  • Sun and Mu (2022) Zhicheng Sun and Yadong Mu. 2022. Patch-based Knowledge Distillation for Lifelong Person Re-Identification. In Proceedings of the 30th ACM International Conference on Multimedia (ACM MM).
  • Tarvainen and Valpola (2017) Antti Tarvainen and Harri Valpola. 2017. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in neural information processing systems (2017).
  • Wang et al. (2020) Guan’an Wang, Shuo Yang, Huanyu Liu, Zhicheng Wang, Yang Yang, Shuliang Wang, Gang Yu, Erjin Zhou, and Jian Sun. 2020. High-order information matters: Learning relation and topology for occluded person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Wei et al. (2018) Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. 2018. Person transfer gan to bridge domain gap for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
  • Wu and Gong (2021) Guile Wu and Shaogang Gong. 2021. Generalising without forgetting for lifelong person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI).
  • Wu et al. (2020) Guile Wu, Xiatian Zhu, and Shaogang Gong. 2020. Tracklet self-supervised learning for unsupervised person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI).
  • Xiao et al. (2016) Tong Xiao, Hongsheng Li, Wanli Ouyang, and Xiaogang Wang. 2016. Learning deep feature representations with domain guided dropout for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
  • Xu et al. (2021) Mengde Xu, Zheng Zhang, Han Hu, Jianfeng Wang, Lijuan Wang, Fangyun Wei, Xiang Bai, and Zicheng Liu. 2021. End-to-end semi-supervised object detection with soft teacher. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
  • Yin et al. (2020) Hongxu Yin, Pavlo Molchanov, Jose M Alvarez, Zhizhong Li, Arun Mallya, Derek Hoiem, Niraj K Jha, and Jan Kautz. 2020. Dreaming to distill: Data-free knowledge transfer via deepinversion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Yu et al. (2017) Hong-Xing Yu, Ancong Wu, and Wei-Shi Zheng. 2017. Cross-view asymmetric metric learning for unsupervised person re-identification. In Proceedings of the IEEE international conference on computer vision (ICCV).
  • Yu et al. (2018) Hong-Xing Yu, Ancong Wu, and Wei-Shi Zheng. 2018. Unsupervised person re-identification by deep asymmetric metric embedding. IEEE transactions on pattern analysis and machine intelligence (TPAMI) 42, 4 (2018).
  • Zheng et al. (2015) Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. 2015. Scalable person re-identification: A benchmark. In Proceedings of the IEEE international conference on computer vision (ICCV).
  • Zheng et al. (2017) Zhedong Zheng, Liang Zheng, and Yi Yang. 2017. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In Proceedings of the IEEE international conference on computer vision (ICCV).
  • Zhu et al. (2019) Xiangping Zhu, Xiatian Zhu, Minxian Li, Vittorio Murino, and Shaogang Gong. 2019. Intra-camera supervised person re-identification: A new benchmark. In Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCV).
  • Zong et al. (2018) Bo Zong, Qi Song, Martin Renqiang Min, Wei Cheng, Cristian Lumezanu, Daeki Cho, and Haifeng Chen. 2018. Deep autoencoding gaussian mixture model for unsupervised anomaly detection. In International conference on learning representations (ICLR).
  • Zou et al. (2020) Yang Zou, Xiaodong Yang, Zhiding Yu, BVK Kumar, and Jan Kautz. 2020. Joint disentangling and adaptation for cross-domain person re-identification. In Proceedings of the European conference on computer vision (ECCV).