Camera-Invariant Meta-Learning Network for Single-Camera-Training Person Re-identification

Jiangbo Pei1, Zhuqing Jiang21, Aidong Men, Haiying Wang, Haiyong Luo, Shiping Wen Jiangbo Pei, Aidong Men and Haiying Wang are with the School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China, (e-mail: [email protected]; [email protected]; [email protected]).Zhuqing Jiang is with Beijing Key Laboratory of Network System and Network Culture, and also with the School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China, (e-mail: [email protected]).Haiyong Luo is with Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China, (e-mail: [email protected]).Shiping Wen is with Australian AI Institute, Faculty of Engineering and Information Technology, University of Technology Sydney, NSW 2007, Australia, (email: [email protected]).2Corresponding author1Equal contribution

Abstract

Single-camera-training person re-identification (SCT re-ID) aims to train a re-ID model using SCT datasets where each person appears in only one camera. The main challenge of SCT re-ID is to learn camera-invariant feature representations without cross-camera same-person (CCSP) data as supervision. Previous methods address it by assuming that the most similar person should be found in another camera. However, this assumption is not guaranteed to be correct. In this paper, we propose a Camera-Invariant Meta-Learning Network (CIMN) for SCT re-ID. CIMN assumes that the camera-invariant feature representations should be robust to camera changes. To this end, we split the training data into meta-train set and meta-test set based on camera IDs and perform a cross-camera simulation via meta-learning strategy, aiming to enforce the representations learned from the meta-train set to be robust to the meta-test set. With the cross-camera simulation, CIMN can learn camera-invariant and identity-discriminative representations even there are no CCSP data. However, this simulation also causes the separation of the meta-train set and the meta-test set, which ignores some beneficial relations between them. Thus, we introduce three losses: meta triplet loss, meta classification loss, and meta camera alignment loss, to leverage the ignored relations. The experiment results demonstrate that our method achieves comparable performance with and without CCSP data, and outperforms the state-of-the-art methods on SCT re-ID benchmarks. In addition, it is also effective in improving the domain generalization ability of the model.

Index Terms:

Person re-identification, single-camera-training, camera-invariant features, meta-learning, domain generalization.

I Introduction

Person re-identification (re-ID) aims to identify a specific person across multiple camera views [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]. It has drawn increasing attention in recent years due to its importance in video surveillance and security systems. Most previous works focus on supervised re-ID [21, 22, 23, 24, 25, 26, 27, 28]. Although they have achieved high accuracies, their success relies on tremendous annotated data. More specifically, it depends particularly on the annotated data that belongs to the same person but is captured from different cameras, which we call cross-camera same-person (CCSP) data.

Refer to caption — Figure 1: The comparison between SCT re-ID setting and previous re-ID settings. Different colors represent different identities. Conventional supervised re-ID data are composed of a large number of people appearing in multiple cameras, and there are many data that belong to the same identity but are captured by different cameras (CCSP data), such as the ID-1 in this figure. In this setting, we know the identity of each data. Unsupervised re-ID has no identity annotation for data, but it still requires latent CCSP data in the training set (such as the blue data). In weakly-supervised re-ID, data are usually weakly annotated. We showed an example of [29], which annotates data at bag-level, and the identity of each data is unknown. Similar to unsupervised re-ID, this setting also needs CCSP data for training. In contrast, under the single-camera-training (SCT) setting, each person appears in only one camera; thus, there are no CCSP data. Compared to previous settings, this setting is easy to collect and annotate data.

In practice, it is tough to collect so much available CCSP data. First, the collection is based on the assumption that there is a large amount of overlapping person between different cameras. However, this assumption may not be valid in the real world, especially between some remote cameras. Second, since the time of a person appearing in different cameras is unpredictable, searching and annotating CCSP data are exhausting and expensive projects. These limitations severely hinder the real-world, large-scale deployment of conventional supervised re-ID methods.

In the past few years, many works have been proposed to reduce the annotation cost. These methods belong to three main categories: namely unsupervised re-ID, semi-supervised re-ID, and weakly-supervised re-ID. Unsupervised re-ID methods [30, 31, 32] aim to train a re-ID model using only unlabeled data, while semi-supervised re-ID methods [33] add some labeled data as assistance. And for weakly-supervised re-ID [29, 34, 35], previous works focus on leveraging more economic annotation as supervision. Although these methods have achieved outstanding results in alleviating the annotation cost, they do not change the inherent reliance of the model on CCSP data. Thus, they still require a lot of unlabeled/weakly-labeled CCSP data for training.

In this paper, we aim to address the single-camera-training (SCT) re-ID [36], which eliminates the reliance of the re-ID model on CCSP data by enforcing each person appears in only one camera. We compare SCT re-ID setting to previous settings in Fig. 1 and illustrate their two advantages as follows: First, SCT re-ID removes the heavy burden of collecting CCSP data and makes it easy to prepare the training data. For example, using some existing tracking techniques [37, 38], we can quickly get a large number of tracklets under each camera at different periods, and each of them most likely corresponds to a unique ID. Second, SCT re-ID meets the fact that there is almost no overlapping identity between remote cameras. Thus, it has the potential to be deployed in more scenarios.

The main challenge of SCT re-ID is to learn camera-invariant feature representations without CCSP data. Specifically, previous re-ID settings have sufficient CCSP data for training, which provides the critical supervision for a model to learn camera-invariant feature representations. In contrast, there are no available CCSP data in SCT re-ID, so that we must turn to other types of supervision to achieve the goal. The prior work [36] assumes that the most similar person is found in another camera. Based on the assumption, they formulate a multi-camera negative loss function to build a relation between samples from different cameras. Although they have made progress, the irrationality of the assumption limits their performance in practice.

To this end, we propose a Camera-Invariant Meta-Learning Network (CIMN) for SCT re-ID. Intuitively, if the feature representations are robust to camera changes, we could consider them to be camera-invariant. Inspired by this, a cross-camera simulation via meta-learning pipeline are designed. Specifically, at each training iteration, CIMN random samples data from two different cameras as meta-train set and meta-test set, respectively. CIMN first virtually trains the meta-train set to learn feature representations and then transfers the learned representations to the meta-test set to examine their applicability on the new set. The objective of the cross-camera simulation named simulation loss, enforcing the representations learned from the meta-train set to perform well on the meta-test set. With the cross-camera simulation, CIMN can learn camera-invariant and identity-discriminative representations even there are no CCSP data. However, this simulation also causes the separation of the meta-train set and the meta-test set, which ignores some beneficial relations between them. A natural relation between the two sets is the negative relations, i.e., the images in one set have a different identity from the images in the other set since they belong to different cameras. To leverage this relation, we design two loss functions, namely meta triplet loss, and meta classification loss. There are also positive relations between the two sets. Specifically, we argue that there should be no camera-related features in the camera-invariance feature representations. Thus, data from different sets should be embedded into the same feature space. To exploit this relation, we introduce a meta camera alignment loss to align the distribution of the two sets in the feature space. Finally, these losses and the simulation loss are combined for meta-optimization. Note that the meta-learning technology has been introduced into re-ID currently: [39] propose MetaBIN to simulate unsuccessful generalization scenarios for domain generalizable re-ID, and [20] exploit meta-learning pipeline for unsupervised re-ID. However, our purpose is entirely different from them because they have a lot of labeled/unlabeled CCSP data for training, while we aim to train the model without CCSP data.

We conduct extensive experiments to demonstrate the effectiveness of our proposed approach on three open-source SCT re-ID benchmarks. The result shows that our approach is superior to the state-of-the-art SCT re-ID methods. More valuable, our controlled experiment shows that under the same data volume, the performance of our CIMN without CCSP data is comparable to that of the conventional approaches with CCSP data, and is also similar to that of CIMN with CCSP data. In addition, by promoting to learn camera-invariant feature representations, CIMN is also robust to camera changes. As a result, our model is also more generalizable to new re-ID datasets captured from other cameras.

We summarize our contributions as follows:

•

In this paper, we propose a Camera-Invariant Meta-Learning Network (CIMN) to address the challenging SCT re-ID task. CIMN proposes a cross-camera simulation process to enforce the representations learned from one camera (the meta-train set) to perform well on another camera (the meta-test set), thus guiding the model to learn camera-invariant representations in the absence of cross-camera same-person (CCSP) data.
•

We explore the negative and positive relations that are ignored in the cross-camera simulation process and propose the meta triplet loss, the meta classification loss and the meta camera alignment loss to exploit these relations.
•

Extensive experiments demonstrate the effectiveness of our method in SCT re-ID. Particularly, we show that the performance of our CIMN without CCSP data is comparable to that of the conventional approaches with CCSP data under the same data volume. Besides, our method also promotes the model’s generalization ability to new re-ID datasets collected from other cameras.

II Related Work

II-A Conventional Person Re-Identification

Conventional supervised re-ID is often referred to as fully-supervised re-ID [40, 36, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51]. Most of the conventional supervised re-ID works aim to learn identity-discriminative features with representation methods [52]. These methods design attention selection[53], semantic segmentation[21] and diverse representations [22, 23] for extracting robust features. Besides, many works [54] address conventional supervised re-ID using metric learning methods. Inspired by advanced metric learning works [55], lots of ranking algorithms[56] and metric loss functions [57, 58] are proposed for re-ID. Although existing re-ID studies have achieved impressive accuracies in the field of the conventional supervised re-ID, their successes heavily rely on the annotated CCSP data. As mentioned, CCSP data is expensive to be annotated and is few to none among remote cameras, which prevents these methods from deploying on a large scale.

In the past few years, many works are proposed to reduce the annotation costs of re-ID. Unsupervised re-ID [59, 60, 61, 62, 31, 32, 63, 64, 65, 66, 67, 68] aims to train a re-ID model only using unlabeled data. These methods usually assign pseudo labels to unlabeled data and then train the model with the pseudo labels. Semi-supervised re-ID [33] uses labeled and unlabeled data to learn a re-ID model jointly. Similar to the unsupervised methods, they also focus on pseudo-labels approaches, but with additional labeled data as an auxiliary. Weakly-supervised re-ID aims to use more economic annotation to replace the full annotation in conventional supervised re-ID. For example, Wang et al. [29] group the images into bags and annotate data with bag-level annotations. Zhu et al. [35] propose to annotate each camera independently. Although the above methods have achieved great results in reducing the annotation cost, their essence is to self-discover the identity correspondence among unlabeled/labeled+unlabeled/weakly-labeled data, which still requires potential CCSP data for training.

II-B Single-Camera-Training Person Re-Identification

SCT re-ID[36] assumes that there is no CCSP data in the training set. It not only helps to reduce the annotation costs, but also has the potential to be deployed to some remote cameras where there is no CCSP data.

To address SCT re-ID, Zhang et al. [36] assume that the most similar person is found in another camera to associate data from different cameras. However, since the changes in the camera often lead to larger discrepancies in the image, this assumption is hardly guaranteed to hold true. Ge et al. [69] design a local-branch that localize and extract local-level feature for SCT re-ID using a transformer [70] structure. Wu et al. [71] propose CCFSG which introduce the $\sigma-$ Regularized Conditional Variational Autoencoder to synthesize the cross-camera samples for model training. In contrast, we propose the CIMN that guides the feature representation to be camera-invariant by enforcing the representation learned from a particular camera to perform well on other cameras, which is model-agnostic and is orthometric to these methods [69, 71].

II-C Meta-Learning

Meta-Learning has recently obtained much attention, in which a learner learns new tasks and another meta learner learns to train the learner. Recent meta-learning studies can be roughly classified into optimizing-based[72], model-based[73], and metric-based methods [74]. Our work is related to Model Agnostic Meta-Learning (MAML) [72], an optimizing-based meta-learning framework that aims to find a good initialization of parameters that are sensitive to changes in novel tasks. The variants of MAML have been widely used in computer vision tasks various tasks, such as domain generalization [75], and frame interpolation [76]. Recently, MAML has also been introduced into re-ID, including domain generalized re-ID [39, 77] and unsupervised re-ID[20]. Choi et al. [39] and Ni et al. [77] aim at simulating domain generalization scenarios to generalize normalization layers, which focuses on the domain level discrepancy. In both their meta-train process and meta-test process, the model is trained on several fully labeled datasets with sufficient CCSP data. Yang et al. [20] are more related to our works, which also focus on the camera level. They aim to adapt to the shifts caused by cameras by simulating a cross-camera searching process. However, the searching process essentially relies on associating potential unlabeled CCSP data with the cross-entropy loss function based on the pseudo-labels. In contrast, we mainly focus on how to train the model without CCSP data. We deliberately cut the relations between different cameras during the cross-camera simulation, and supplement them with three proposed losses in subsequent meta-optimizations. Thereby, our method is different from [39, 20] in both purpose and implementation.

II-D Domain Generalizable Person Re-Identification

Current re-ID models trained on a source dataset (domain) are well known to suffer from performance degradation when generalized to a new target dataset. To this end, a lot of generalizable re-ID methods are proposed [39, 77, 78, 79, 80, 81, 82, 83, 84]. For example, Jin et al.[79] design an effective SNR module to distill identity-relevant feature from the removed information and restitute it to the network. Chen et al. [80] propose a DDAN to map images into a domain-invariant feature space by selectively aligning the distributions of multiple source domains. Although domain generalization ability is not our main focus, we find that our model is also beneficial to this field. We conjecture the reason is that CIMN encourages the model to be robust to camera changes, therefore promoting it to be more generalizable to new re-ID datasets captured from other cameras.

III APPROACH

III-A Overview

We begin with a formal description of the SCT re-ID problem. We assume that we are given a dataset D captured from $N$ cameras $D=\{{C_{n}}\}_{n=1}^{N}$ . Each camera $C_{n}$ contains $M_{n}$ image-label pairs $C_{n}=\{(x_{i}^{n},y_{i}^{n})\}_{i=1}^{M_{n}}$ . Each image $x_{i}^{n}\in\mathcal{X}_{n}$ corresponds to an identity label $y_{i}^{n}\in\mathcal{Y}_{n}$ . Different from conventional re-ID, there is no overlapped identity between two different cameras in SCT re-ID, i.e., $\forall i,j,0\leq i<j\leq N:\mathcal{Y}_{i}\cap\mathcal{Y}_{j}=\emptyset$ . Our goal is to learn a model parameterized by $\theta$ with $D$ , which can extract camera-invariant identity-discriminative feature $f\in\mathbb{R}^{d\times 1}$ for cross-camera retrieval. The overall of our proposed CIMN is presented in Fig. 2. Particularly, we construct camera pairs and sample meta-train set and meta-test set from each pair to form meta-batch. During training, CIMN cuts off the relation between two sets to perform a cross-camera simulation, and encourages the model to learn camera-invariant features by the simulation loss $L_{smi}$ . After that, the meta triplet loss $L_{mtri}$ , the meta classification loss $L_{cl}$ , and the meta camera alignment loss $L_{mca}$ are proposed to leverage the negative and positive relations that ignored by the cross-camera simulation.

III-B Meta-Batch Preparation.

To simulate a cross-camera process, we sample meta-batch that includes images from two different cameras, and assign them to meta-train set and meta-test set according their camera IDs. Given a training dataset captured by $N$ different cameras $D=\{C_{1},C_{2},...,C_{N}\}$ , we sample the meta-batch as follows: (1) In the $n$ -th epoch, we select $C_{q(n)}$ as the meta-train camera $C_{mtr}$ , where $q(n)$ is the remainder of $n$ divided by $N$ . (2) A meta-test camera $C_{mte}$ is randomly selected from the remaining cameras, forming camera pair $\{C_{mtr},C_{mte}\}$ . (3) We randomly sample $P\times K$ images from the meta-train camera as the meta-train set $S_{mtr}$ and $P\times K$ images from the meta-test camera as the meta-test set $S_{mte}$ , where $P$ is the number of identities, and $K$ is the number of images that belong to each identity. Then, a $2\times P\times K$ meta-batch is built, denoted as $\{S_{mtr},S_{mte}\}$ .

III-C Cross-Camera Simulation

The cross-camera simulation simulates a process that trains a model on one camera and evaluates its performance on a new camera via meta-learning. The process consists of two steps: meta-train and meta-test.

III-C1 Meta-Train

In each meta-batch $\{S_{mtr},S_{mte}\}$ , given a model with initial parameters $\theta$ , we first virtually train the model on the meta-train set $S_{mtr}$ . In this step, we adopt triplet loss[85] as criterion. The triplet loss on $S_{mtr}$ can be formulated as:

L_{tri}(mtr,\theta)=\sum_{j=1}^{P\times K}[d(f_{j},f_{j}^{p}))-d(f_{j},f_{j}^{n}))+m]^{+},

(1)

where $f_{j}=f_{\theta}(x_{j})$ represents the feature of an anchor image $x_{j}$ extracted by the model with parameter $\theta$ , $f_{j}^{p}$ and $f_{j}^{n}$ are the features of the anchor’s farthest positive image and nearest negative image within $S_{mtr}$ , $d(.)$ is the euclidean distance function, $m$ is a margin value set to 0.3, and $[z]^{+}=max(z,0)$ .

With a learning rate $\eta$ , we can update the model with SGD optimizer like the conventional metric learning. Then we obtain an intermediate parameter $\theta^{\prime}$ , representing the parameter of the model after learning on the $S_{mtr}$ .

\theta^{\prime}=\theta-\eta\nabla_{\theta}L_{tri}(mtr,\theta).

(2)

III-C2 Meta-Test

We then transfer the model trained on $S_{mtr}$ (parameterized by $\theta^{\prime}$ ) to the meta-test set $S_{mte}$ for evaluation. Following 1, we calculate the triplet loss of the model on $S_{mte}$ :

\displaystyle L_{tri}(mte,\theta^{\prime})=\sum_{j=1}^{P\times K}[d({f^{\prime}}_{j},{f^{\prime}}_{j}^{p}))-d({f^{\prime}}_{j},{f^{\prime}}_{j}^{n}))+m]^{+},

(3)

where ${f^{\prime}}_{j}=f_{\theta^{\prime}}(x_{j})$ represent the feature of the image $x_{j}$ extract by the model with the intermediate parameter $\theta^{\prime}$ .

III-C3 Simulation Loss

On the one hand, we expect the model to learn identity-discriminative representations from $S_{mtr}$ so that it should perform well on this set. On the other hand, we expect the model learned from $S_{mtr}$ to be camera-invariant, so that it should also perform well on $S_{mte}$ . With the two purposes, the objective of the cross-camera simulation can be formulated as:

L_{smi}=\lambda L_{tri}(mtr,\theta)+(1-\lambda)L_{tri}(mte,\theta^{\prime}),

(4)

where $\lambda$ is a trade-off coefficient of the two purposes.

By substitute Eq. 2 into this objective, the simulation loss is:

	$\displaystyle L_{smi}=$	$\displaystyle\lambda L_{tri}(mtr,\theta)$		(5)
	$\displaystyle+$	$\displaystyle(1-\lambda)L_{tri}(mte,\theta-\eta\nabla_{\theta}L_{tri}(mtr,\theta)).$		(5)

Here we explain why we adopt triplet loss in the cross-camera simulation. Although there is no overlapping person between the meta-train set and the meta-test set, the two sets still have potential relations. We should cut off these relations in meta-train to avoid the model implicitly touching the meta-test set. We select the triplet loss as our criterion considering its local-level nature, i.e., it only considers the relationship between samples in a certain closed set and does not interact with other sets.

On the other hand, although we have to cut off the relations between the meta-train set and the meta-test set during simulation, these relations are undoubtedly beneficial for training. To leverage these relations, we further design three losses: 1) meta triplet loss, 2) meta classification loss and 3) meta-camera alignment loss, which are introduced in the following section.

III-D Meta-Optimization

III-D1 Meta Triplet Loss

We introduce a meta triplet loss to capture the identity relations between the two sets. In SCT re-ID setting, these relations are the negative relations, i.e., the images in one set have completely different identities from the images in the other set. Specifically, for an image, meta triplet loss aims to pull it closer to its farthest positive sample in the same set, and push it away from the nearest negative sample in different sets. It can be formulated as:

\displaystyle L_{mtri}=\sum_{j=1}^{2\times P\times K}[d({F}_{j},{F}_{j}^{p}))-d({F}_{j},{F}_{j}^{n}))+m]^{+},

(6)

where $F_{j}$ is the feature of an anchor image $x_{j}$ , $F_{j}^{p}$ is the feature of the anchor’s farthest positive image within the same set, while $F_{j}^{n}$ is the feature of its nearest negative image in the other set. To be consistent with the cross-camera simulation, we use different parameters to extract features from the meta-train set and the meta-test set:

F_{j}=\left\{\begin{aligned} &f_{\theta}(x_{j}),&x_{j}\in S_{mtr}\\ &{f}_{\theta^{\prime}}(x_{j}),&x_{j}\in S_{mte}\\ \end{aligned}\right.

(7)

III-D2 Meta Classification Loss

Classification loss is prevalent in the re-ID field, which uses cross-entropy to measure the difference between model prediction distribution and the actual label distribution. This loss has been widely demonstrated to be effective in re-ID. It can naturally exploit the negative relations between the meta-train set and the meta-test set in the label space since an image that belongs to one identity label must not belong to other identity labels. It can not be adopted by the cross-camera simulation which requires cutting off the two sets. However, this loss can be a good choice for associating the two sets. To make it suitable for our model, we improve it to meta classification loss.

For the meta-train set $S_{mtr}$ , we calculate the classification loss with the initial parameter $\theta$ in the meta-train process:

L_{cl}({mtr,\theta})=\sum_{j=1}^{P\times K}CE[(f_{\theta}(x_{j})),y_{j}],

(8)

where CE is the cross-entropy function, $x_{j}$ is the images in $S_{mtr}$ , $y_{j}$ is the label of $x_{j}$ , $f_{\theta}(x_{j})$ represents the features of $x_{j}$ extracted by model $\theta$ . Note that the features in this formulation should have the same dimension as the number of labels (identities).

For the meta-test set, the classification loss is calculated based on the intermediate parameters $\theta^{\prime}$ :

L_{cl}({mte,\theta^{\prime}})=\sum_{j=1}^{P\times K}CE(f_{\theta^{\prime}}(x_{j})),y_{j}],

(9)

where $x_{j}$ is the images in $S_{mte}$ .

Then the loss can be formulated as:

\ L_{mcl}=L_{cl}({mtr,\theta})+L_{cl}({mte,\theta^{\prime}}),

(10)

III-D3 Meta Camera Alignment Loss

Besides the negative relations in identities, there are also positive relations between the two sets. Specifically, we argue that there should be no camera-related features in the camera-invariance feature representations. Thus, data from different sets should be embedded into the same feature space. To this end, we propose a meta camera alignment loss, aiming at aligning the feature distribution of the two sets. Meta camera alignment adopts Maximum Mean Discrepancy (MMD) [86] and center distance to measure the distance between two distributions. We use the initial parameter $\theta$ as the model parameter to extract features from $S_{mtr}$ and use the intermediate parameter $\theta^{\prime}$ to extract from $S_{mte}$ . We denote the feature sets of $S_{mtr}$ and $S_{mte}$ as $F_{mtr}$ and $F_{mte}$ , respectively. The MMD is calculated by

	$\displaystyle MMD($	$\displaystyle F_{mtr},F_{mte})=\frac{1}{(P\times K)^{2}}[\sum_{i=1}^{P\times K}\sum_{j=1}^{P\times K}k(f_{mtr}^{i},f_{mtr}^{j})+$		(11)
		$\displaystyle\sum_{i=1}^{P\times K}\sum_{j=1}^{P\times K}k(f_{mte}^{i},f_{mte}^{j})-\sum_{i=1}^{P\times K}\sum_{j=1}^{P\times K}k(f_{mtr}^{i},f_{mte}^{j})],$		(11)

Where and $k(.,.)$ is a Gaussian kernel, $f_{mtr}$ , $f_{mte}$ is the features from $F_{mtr}$ and $F_{mte}$ , respectively.

The center distance is the Euclidean distance between the center of $F_{mtr}$ and $F_{mte}$ :

\begin{split}Cen(F_{mtr},F_{mte})=\frac{1}{(P\times K)^{2}}\|\sum_{i=1}^{P\times K}f_{mtr}^{i}-\sum_{j=1}^{P\times K}f_{mte}^{j}\|^{2},\end{split}

(12)

Then the meta camera alignment loss of two distribuitons can be formulated as

\displaystyle L_{mca}=MMD(F_{mtr},F_{mte})+Cen(F_{mtr},F_{mte}).

(13)

III-D4 Overall

Combined the three losses and simulation loss, the final objective of the meta-optimization can be formulated as:

\displaystyle L=L_{smi}+\gamma_{1}L_{mtri}+\gamma_{2}L_{mcl}+\gamma_{3}L_{mca},

(14)

where $\gamma_{1}$ , $\gamma_{2}$ , $\gamma_{3}$ represent the weight of meta triplet loss, meta classification loss and meta camera alignment loss. The whole process is illustrated in Algorithm. 1.

Algorithm 1 CIMN algorithm

Input: Training dataset captured by $N$ cameras
$D=\{C_{1},C_{2},...,C_{N}\}$ .
Init: Model $\theta$ , hyperparameters $P$ , $K$ , $\eta$ , $\lambda$ , $\gamma_{1}$ , $\gamma_{2}$ , $\gamma_{3}$ .

1:for

epoch

max\_epoch/N

2: for each

C_{mtr}

D

: do

3: Sampling

C_{mte}

from remain cameras

4: Build camera pair

\{C_{mtr},C_{mte}\}

5: Preparing meta-batch

6: for each meta-batch

\{S_{mtr},S_{mte}\}

7: Meta-Train:

8: Calculate meta-train loss

L_{tri}(mtr,\theta)

9: Update model

\theta^{\prime}=\theta-\eta\nabla_{\theta}L_{tri}(mtr,\theta)

10: Meta-Test:

11: Calculating meta-test loss

L_{tri}(mte,\theta^{\prime})

12: Calculating simulation loss:

13:

L_{smi}=\lambda L_{tri}(mtr,\theta)+(1-\lambda)L_{tri}(mte,\theta^{\prime})

14: Calculating meta triplet loss

L_{mtri}

15: Calculating meta classificaiton loss

L_{mcl}

16: Calculating meta camera alignment loss

L_{mca}

17: Optimization:

18:

L=L_{smi}+\gamma_{1}L_{mtri}+\gamma_{2}L_{mcl}+\gamma_{3}L_{mca}

19: Update

\theta\leftarrow\theta-\eta\frac{\partial L}{\partial\theta}

20: endfor

21: endfor

22:endfor

TABLE I: Detailed statistics of the datasets used in our experiments. Market-STD, Market-SCT and Market-CG represent Market-1501 dataset in standard protocol, SCT protocol and Control Group protocol, respectively. The same is true for Duke-STD, Duke-SCT, Duke-CG, MSMT-STD and MSMT-CG.

Dataset	Train IDs	Train Images	Test IDs	Test Images	With CCSP data?
Market-STD	751	12,936	750	15,913	True
Market-SCT	751	3,561	750	15,913	False
Market-CG	751	3,561	750	15,913	True
Duke-STD	702	16,522	1,110	17,661	True
Duke-SCT	702	5,993	1,110	17,661	False
Duke-CG	702	5,993	1,110	17,661	True
MSMT-SCT	1041	6.645	3,060	93.820	False

IV Experiments

IV-A Experiment Settings

IV-A1 Dataset Preparation and Evaluation Metrics

Our main experiments are conducted on the widely used Market-1501 [87] and DukeMTMC-reID [88] datasets. Market-1501 dataset contains 32,668 images, belonging to 1501 identities. These images are taken from 6 cameras in front of the supermarket in the Tsinghua campus. Each identity is captured by multiple cameras. The standard evaluation protocol uses 12,936 bounding boxes of 751 identities for training and 19,281 bounding boxes of 750 identities for testing. DukeMTMC-reID dataset contains 36,411 images, belonging to 1812 identities and captured from 8 cameras. It provides 16,522 images of 702 identities for training and 19,889 images of 1110 identities for testing. To distinguish the standard setting with the following SCT setting, we denote the standard Market-1501 protocol and DukeMTMC-reID protocol as Market-STD, Duke-STD, respectively.

We construct SCT benchmarks following [36]. Specifically, for the training set, we randomly choose one camera for each person, and only take their images under the selected camera as training images. In this case, each person can only appear in one camera, so that there are no CCSP data in the training set. For the testing set, we keep the same as the standard testing set. We denote the SCT dataset built from Market-1501 and DukeMTMC-reID as Market-SCT and Duke-SCT, respectively.

Note that the number of training images is significantly reduced in the SCT training set due to the abandonment of CCSP data. Therefore, for fairly comparing, we propose a control group setting. Take Market1501 as an example; we randomly selected the same number of images as Market-SCT from the standard Market-1501 training set, denoted as Market-CG. The same goes for building Duke-CG. The constructions of the training set of the SCT setting and control group setting are shown in Fig. 3. The detailed statistics of the three settings are displayed in Tab. I. Despite the above benchmarks, we also evaluate our method on the MSMT-SCT dataset [69], which is derived from the MSMT-17 [89] following [69]. The details are also reported in Tab. I.

In this paper, we use two evaluation metrics, named Cumulated Matching Characteristics (CMC) and mean average precision (mAP). The CMC curve generally considers re-ID as a ranking problem and focuses on precision, while the mAP considers both precision and recall. We experiment with the above datasets and report mAP and the cumulated matching accuracy at Rank-1, Rank-5, and Rank-10.

IV-A2 Model

Our method does not restrain the architecture of the deep network and can be applied to any model. We here take the ResNet-50 model pre-trained on ImageNet [90] as the backbone network to implement CIMN, considering its wide adoption in academia and industry. The four res-convolution blocks of the Resnet-50 model, stage1,stage2, stage3 and stage4, with a global maximize pooling layer, are adopted as the backbone. A Batch Normalization and a fully connected classifier with desired dimensions (corresponding to the number of identities of a person) are added in turn. As shown in Fig. 4, the triplet loss within the cross-camera simulation and the meta-triplet loss are calculated by the features after the global maximize pooling layer. The meta classification loss is calculated with the features after the classifier. For the meta camera alignment loss, we align the distribution of feature after stage2 between the meta-train set and the meta-test set.

IV-A3 Implementation Details

In this paper, all experiments are conducted with PyTorch. The image size is $384\times 128$ and padded with 10. For each minibatch, we randomly sample $P$ person from $2$ different cameras, respectively, and then randomly sample $K$ images of each person. Here we set $P=8$ and $K=2$ so that the mini-batch size is $32$ . Some commonly used tricks, including random cropping, random flipping, random rotation, and random color jitter, are used. The trade-off coefficient of simulation loss $\lambda$ is set to 0.6. The weights of meta triplet loss, meta classification loss and meta alignment loss, $\gamma_{1}$ , $\gamma_{2}$ and $\gamma_{3}$ , are set to $1.0$ , $1.0$ and $0.02$ , respectively. We train the model for 240 epochs. The learning rate $\eta$ is adjusted based on the epoch $t$ as follows.

\eta(t)=\left\{\begin{aligned} &3.5\times 10^{-4}\times\frac{t}{30},&t\leq 30\\ &3.5\times 10^{-4},&5\leq t\leq 120\\ &3.5\times 10^{-5},&20\leq t\leq 180\\ &3.5\times 10^{-6},&30\leq t\leq 240\\ \end{aligned}\right.

(15)

In addition, following two previous SCT re-ID works [69, 71], we also implement our method on the global-local baseline, which additionally adds a local branch on the ResNet-50 (global branch) backbone. The implementation details are as follows. 1) In the global branch, the locations to use the meta triplet loss $L_{mtri}$ , the meta classification loss $L_{mcl}$ and the meta camera alignment loss $L_{mca}$ are the same as our original one (see Fig. 4). 2) $L_{mtri}$ and $L_{mca}$ are also adopted in the output of the local branch. 3) Other settings are the same as [69]. We denote this implementation as Ours(GL).

IV-B Evaluation of CIMN

This section evaluates the performance of our CIMN on Market-1501 and DukeMTMC-reID. For comparison, we also train the baseline model with two main approaches in conventional supervised re-ID, i.e., trained with triplet loss and classification loss. We denote them as Bas-T and Bas-C, respectively.

IV-B1 The Performance on Market-1501

We first conduct an experiment on Market-1501. We trained the model on the training set of Market-STD, Market-SCT and Market-CG respectively, and tested them on the standard testing set. Tab. II reports the results. We have the following observations/conclusions:

(1) We can see the conventional approaches show a good performance on the Market-STD. The best performance is achieved by Bas-C, which hits 89.4%, 96.7%, 97.9%, and 77.3% in Rank-1, Rank-5, Rank-10 and mAP, respectively. However, if we train BAS-C on Market-SCT, it drop to 36.1%, 53.5%, 62.4% and 13.2%, with 53.3%, 43.2%, 35.5% and 64.1% performance degradations compared to Market–STD. Similarly, the performance of Bas-T has also been suppressed by 50.4%, 41.9%, 35.7% and 60.0%. In contrast, CIMN still remain 73.3%/85.7%/91.1%/47.2% in Rank-1/Rank-5/Rank-10/mAP accuracy, which significantly outperform Bas-C by 37.2%/32.2%/28.7%/34.1%. The results show a huge advantage of our CIMN in SCT re-ID.

(2) Since the number of training images is significantly reduced in the SCT training set, we evaluate these methods on Market-CG, where there is the same number of images as Market-SCT but contains many CCSP data. In this case, the Bas-C perform 73.9%, 87.8%, 93.4%, and 47.9% accuracy in Rank-1, Rank-5, Rank-10 and mAP, respectively. And the performance of BAS-T is similar to it, both of which are better than their results on Market-SCT. This phenomenon proves that the CCSP data is very important for Bas-T and Bas-C that adopt conventional re-ID methods. On the other hand, the performance of CIMN on Market-SCT and Market-CG is very similar, and is comparable with that of conventional Bas-T/Bas-C on Market-CG. There are only 0.8% and 0.9% gaps in Rank-1 and mAP accuracy. This phenomenon demonstrates that by learning camera-invariant features with CIMN, the non-CCSP data and CCSP data in the training set can have a similar effect; thus, we can train the model without the expensive CCSP data, This is extremely valuable for the real-world application of re-ID.

(3)We can see that CIMN also achieves better performance on Market-CG. It outperforms the second-placed Bas-C in Rank-1 and mAP by 0.2% and 0.3%. The results indicate that CIMN is also advantageous when training with a small number of images.

TABLE II: Performance on Market-STD, Market-SCT and Market-CG datasets. Rank-1 accuracy, Rank-5 accuracy, Rank-10 accuracy, and mAP (%) are reported.

Dataset	Method	Rank-1	Rank-5	Rank-10	mAP
Market-STD	Bas-T	88.9	96.1	97.4	74.3
	Bas-C	89.4	96.7	97.9	77.3
	CIMN	90.1	96.2	97.5	74.6
Market-SCT	Bas-T	38.5	54.2	61.7	14.3
	Bas-C	36.1	53.5	62.4	13.2
	CIMN	73.3	85.7	91.1	47.3
Market-CG	Bas-T	74.2	86.4	92.1	47.8
	Bas-C	73.9	87.8	93.4	47.9
	CIMN	74.1	87.9	93.4	48.2

TABLE III: Performance on Duke-STD, Duke-SCT and Duke-CG datasets. Rank-1 accuracy, Rank-5 accuracy, Rank-10 accuracy, and mAP (%) are reported.

Dataset	Method	Rank-1	Rank-5	Rank-10	mAP
Duke-STD	Bas-T	81.5	91.6	94.6	65.6
	Bas-C	83.2	91.3	93.7	68.2
	CIMN	81.7	91.3	92.5	65.9
Duke-SCT	Bas-T	20.7	34.1	39.7	11.6
	Bas-C	24.7	31.4	38.9	12.1
	CIMN	68.4	79.2	84.1	47.7
Duke-CG	Bas-T	68.5	83.1	87.5	48.2
	Bas-C	68.1	82.4	88.6	48.6
	CIMN	68.9	83.8	88.7	48.9

TABLE IV: Performance on global-local (GL) baseline. Rank-1 accuracy and mAP (%) are reported.

Method	Market-SCT		Duke-SCT
Method	Rank-1	mAP	Rank-1	mAP
Baseline(GL)[69]	75.7	51.5	70.1	53.1
Ours(GL)	85.6	69.1	81.0	64.5

IV-B2 Performance on DukeMTMC-reID

We then evaluate our method on DukeMTMC-reID. Following the experiments on Market-1501, we conduct experiments on standard Duke-STD, Duket-SCT and Duke-CG. From Tab. III, we can see our method is still effective. While conventional methods (Bas-C, Bas-T) show poor performance on Duket-SCT due to the lack of CCSP data, CIMN still has 68.4%, 79.2%, 84.1%, and 47.7% accuracies in Rank-1, Rank-5, Rank-10 and mAP. This outperform the two conventional methods in a large scale. For example, it surpasses Bas-C by 43.7% in Rank-1 accuracy, 47.8% in Rank-5 accuracy, 45.2% in Rank-10 accuracy and 35.6% in mAP. The improvements demonstrate the effectiveness of CIMN on SCT re-ID. Besides, unlike conventional methods with large performance gaps on Duket-SCT and Duke-CG, CIMN achieves comparable performance on them. This phenomenon shows the essence of CIMN that reduces the reliances of the model on CCSP data.

IV-B3 Stability

Although the amount of CCSP data is small, it is inevitable that some people appear in not only one camera in the real-world. To evaluate the robustness of our CIMN, we conduct experiments to show how the accuracy changes with respect to the number of CCSP data. Fig. 5 shows the Rank-1 tendencies on Market-1501 and DukeMTMC-reID. We can see that CIMN is quite robust and stable with the changes of CCSP data; thus, it is more suitable for practical application.

IV-B4 Extension to different structure

Our method guides the feature representation to be camera-invariant by enforcing the representation learned from a particular camera to perform well on other cameras, which is model-agnostic and feasible to be used on different structures. Here, we conduct experiments to evaluate CIMN on different models. Specifically, we implement CIMN on the global-local baseline used in recent SCT re-ID works ([69, 71]), which additionally adds a local branch on the ResNet-50 (global branch) backbone. The implementation details are as follows. 1) In the global branch, the locations to use the meta triplet loss $L_{mtri}$ , the meta classification loss $L_{mcl}$ and the meta camera alignment loss $L_{mca}$ are the same as our original one (see Fig. 4). 2) $L_{mtri}$ and $L_{mca}$ are also adopted in the output of the local-branch. 3) Other settings are the same as [69]. The results are shown in Tab. IV. It can be seen that our method is also effective in the global-local baseline, which improves the Rank-1/map of the baseline by 9.9%/17.6% and 10.9%/11.4%.

IV-C Comparison with the State-of-the-Art Methods

Here we compare our method with state-of-the-art methods in three SCT re-ID benchmarks: Market-SCT, Duke-SCT and MSMT-SCT. We first compare our performance with conventional supervised methods, including Center Loss [91], A-Softmax [92], ArcFace[93], PCB[94], Suh’s method [95], MGN [96], CBN [97], HOReID [98], ISP [21], MGN-ibn [96], Bagtrick [99] and AGW [100]. The results are reported in Tab V. We can see that our method significantly outperforms these methods, which outperform the second one 11.4%/6.7% (CBN [97] on Market-SCT), 5.7%/5.6% (CBN [97] on Duke-SCT) and 0.3%/1.2% (MGN-ibn [96] on MSMT-SCT) respectively. Meanwhile, CBN is a transductive module that requires collecting test data to update the model, while our approach can be directly applied.

We then compare our method with the SCT re-ID works MCNL[36], which is also implemented on ResNet-50 baseline. From Tab V, we can see that CIMN is superior to their MCNL, which surpasses MCNL by 7.1%/6.7% Rank-1/mAP on Market-SCT, 2.0%/2.4% on Duke-SCT and 1.5%/2.6% on MSMT-SCT.

Finally, we compare our method implemented on the global-local baseline, i.e. Ours(GL) in Tab V, with two SCT re-ID works which are also implemented on the global-local baseline: CCFP[69] and CCSFG [71]. It can see that CIMN is competitive with CCFP[69] and CCSFG [71] on Duke-SCT, and achieves the best results on Market-SCT and MSMT-SCT, which surpasses the second one (CCSFG) by 0.7%/1.5% and 0.6%/0.7% Rank-1/mAP on the two datasets, respectively.

TABLE V: Compared with state-of-the-art in SCT re-ID setting. Ours(GL) denotes implementing CIMN on the global-local baseline. Rank-1 accuracy and mAP (%) are reported. The best results are bolded.

Method	Market-SCT		Duke-SCT		MSMT-SCT
Method	Rank-1	mAP	Rank-1	mAP	Rank-1	mAP
Center Loss [91]	40.3	18.5	38.7	23.2	-	-
A-Softmax [92]	41.9	23.2	34.8	22.9	-	-
ArcFace [93]	39.4	19.8	35.8	22.8	-	-
PCB [94]	43.5	23.5	32.7	22.2	-	-
Suh’s method [95]	48.0	27.3	38.5	25.4	-	-
MGN [96]	38.1	24.7	27.1	18.7	-	-
CBN[36]	61.9	40.6	62.7	42.1	-	-
HOReID[98]	48.1	29.6	40.2	29.8	-	-
ISP[21]	60.1	40.5	61.4	41.3	-	-
MGN-ibn [96]	45.6	26.6	46.7	32.6	27.8	11.7
Bagtrick [99]	54.0	34.0	54.2	40.2	20.4	9.8
AGW [100]	56.0	36.6	56.5	43.9	23.0	11.1
MCNL[36]	66.2	40.6	66.4	45.3	26.6	10.0
Ours	73.3	47.3	68.4	47.7	28.1	12.9
CCFP[69]	82.4	63.9	80.3	64.5	50.1	22.2
CCSFG [71]	84.9	67.7	81.1	63.7	54.6	24.6
Ours(GL)	85.6	69.1	81.0	64.5	55.2	25.0

TABLE VI: Performance of the proposed solutions with the changing CCSP data on Market-1501 dataset. Rank-1 accuracy, and mAP (%) are reported.

Percentage of	CCS		+ $L_{mtri}$		+ $L_{mtri}$ + $L_{mcl}$		+ $L_{mtri}$ + $L_{mcl}$ + $L_{mca}$
CCSP data	Rank-1	mAP	Rank-1	mAP	Rank-1	mAP	Rank-1	mAP
100%(Market-STD)	85.5	71.1	89.8	72.4	90.1	74.4	90.7	74.6
80%	82.1	64.2	86.4	69.2	86.2	67.3	86.5	69.6
60%	78.6	58.4	82.7	63.1	82.8	62.9	82.9	63.5
40%	72.1	51.8	77.9	57.1	76.1	54.0	78.3	57.6
20%	70.8	48.1	73.8	51.6	74.1	50.2	74.2	51.4
0%(Market-SCT)	68.4	44.9	70.7	46.5	72.3	46.9	73.3	47.3
Market-CG	69.5	46.0	72.1	47.5	73.3	47.6	74.1	48.2

TABLE VII: Performance of the proposed solutions with the changing CCSP data on DukeMTMC-reID dataset. Rank-1 accuracy, and mAP (%) are reported.

Percentage of	CCS		+ $L_{mtri}$		+ $L_{mtri}$ + $L_{mcl}$		+ $L_{mtri}$ + $L_{mcl}$ + $L_{mca}$
CCSP data	Rank-1	mAP	Rank-1	mAP	Rank-1	mAP	Rank-1	mAP
100%(Duke-STD)	77.7	61.7	80.5	63.9	81.1	64.8	81.7	65.9
80%	76.5	57.5	79.1	61.7	79.7	61.8	80.2	62.1
60%	70.8	54.9	75.1	59.7	75.4	59.5	75.5	60.5
40%	67.1	51.4	71.5	55.9	71.3	56.1	72.4	56.3
20%	64.6	46.8	68.2	51.3	68.9	51.4	69.1	51.9
0%(Duke-SCT)	62.4	41.0	65.2	45.2	67.7	46.9	68.4	47.7
Duke-CG	62.9	41.7	64.6	45.8	67.8	46.9	68.9	48.9

TABLE VIII: The performance of meta camera alignment loss at different positions. We evaluate the model on Market-SCT. Rank-1 accuracy, Rank-5 accuracy, Rank-10 accuracy, and mAP (%) are reported.

	Rank-1	Rank-5	Rank-10	mAP
Stage2	73.3	85.7	91.1	47.3
Stage3	73.1	85.2	91.0	46.9
Stage4	70.2	83.4	89.1	42.7

TABLE IX: The influence of the number of cameras

r

in the meta-train set and the meta-test set on Market-SCT. If

r\leq 3

, the cameras in the two sets is completely different. Otherwise, there will be

2*r-6

overlapped cameras in the two sets.

r	Rank-1	Rank-5	Rank-10	mAP
1	73.3	85.7	91.1	47.3
2	72.8	85.2	89.8	46.7
3	72.7	85.2	89.9	46.4
4	66.9	81.4	84.9	40.1
5	63.9	80.6	84.7	38.5
6	63.1	78.8	84.7	37.9

IV-D Ablation Experiments

We further conduct the ablation experiments to evaluate the impact of the components of the proposed method. We first implement Cross-Camera Simulation to the model. Then we add the meta triplet loss $L_{mtri}$ , the meta classification loss $L_{mcl}$ and the meta camera alignment loss $L_{mca}$ gradually. We train it on the standard dataset (i.e., 100% CCSP data) and then reduced the CCSP data until the CCSP data volume was zero. The result on Market-1501 and DukeMTMC-reID are reported on Tab. VI and Tab. VII, respectively.

IV-D1 Effectiveness of Cross-Camera Simulation

Cross-Camera Simulation (CCS) simulates a cross-camera process via meta-learning. As shown in Tab. VI and Tab. VII, with CCS, the model is more robust and stable with the changes of CCSP data. If we throw away all the CCSP data, i.e., on Market-SCT and Duke-SCT, it remains 68.4%/44.9% and 62.4%/41.0% Rank-1/mAP accuracy. This result is close to its performance on Market-CG (69.5%/46.0%) and Duke-CG (62.9%/41.7%), indicating the dependence of the model on CCSP data is significantly reduced.

However, since CCS separates different cameras, this strategy can not fully mine the relations between the data from these cameras. As a result, the performance of the model on Market-STD/Duke-STD and Market-CG/Duke-CG is worse than that of conventional approaches Bas-T and Bas-C (reported in Tab. II and Tab. III).

IV-D2 Effectiveness of Meta Triplet Loss

Meta triplet loss $L_{mtri}$ guides the model to leverage the potential negative relations between meta-train set and meta-test set. With $L_{mtri}$ , the performance of the model on Market-STD and Duke-STD improves 4.3% and 2.8% in Rank-1 accuracy, respectively. Similarly, its Rank-1 accuracy on Market-CG and Duke-CG are improved by 2.6% and 1.7%. These results demonstrate the effectiveness of $L_{mtri}$ . Besides, we can see that on Market-SCT and Duke-SCT, $L_{mtri}$ also improves CCS by 2.3% and 2.8% in Rank-1 accuracy.

IV-D3 Effectiveness of Meta Classification Loss

We then add meta classification loss $L_{mcl}$ to our model, with the same purposes as $L_{mtri}$ . From Tab. VI and Tab. VII, we can see that the $L_{mcl}$ is effective, which improves the Rank-1 accuracy by 0.3%/0.6% on Market-STD/Duke-STD and 1.2%/3.2% on Market-CG/Duke-CG. On SCT setting, it also brings 1.6% and 2.5% improvements on Rank-1 accuracy on Market-SCT and Duke-SCT.

IV-D4 Effectiveness of Meta Camera Alignment Loss

Finally, we conduct meta camera alignment Loss $L_{mca}$ to our model, which actively assigns the distribution of meta-train set and meta-test set, which encourage the model to leverage the potential positive relations between the two sets. We can see that $L_{mtri}$ improves the Rank-1/mAP accuracy by 0.6%/0.2% and 0.6%/1.1% on Market-STD and Duke-STD. It is also beneficial to the SCT re-ID setting, which improves the Rank-1/mAP of the model by 0.8%/0.6% and 1.1%/2.0% on Market-SCT and Duke-SCT, respectively.

IV-D5 Influence of the Number of Cameras in Meta-train and Meta-test Processes

In our Meta-Batch Preparation, we collect samples from one camera to the meta-train set and that from another camera to the meta-test set. Here we explore the result if there are multiple cameras in two sets. Specifically, in each meta-batch, we randomly select $r$ cameras for the meta-train set and the meta-test set, respectively. If $r\leq N/2$ , where $N$ is the number of the cameras in the dataset, the cameras in the meta-train set and the meta-test set are non-overlapped. Otherwise, there will be $2*r-N$ overlapped cameras in the two sets simultaneously. Other parts are the same as our original implementation. Note that, although there are samples from multiple cameras in the meta-train/meta-test set, there is no overlapped person among these cameras, i.e., we still guarantee that each person in the training set only appears in one camera to satisfy the SCT setting.

The results are reported in Tab. IX. We can see that if there are overlapped cameras in the two sets, the performance of the model will degrade significantly. If the camera composition is completely the same, i.e. $r=6$ , the mAP drops from $46.1\%$ to $36.9\%$ . We hypothesize the reason is that the meta-learning-based cross-camera simulation process encourages the model to learn the knowledge that is suitable for the meta-train set and meta-test set. In other words, this implicitly guides the model to learn to bridge the gap between the two sets. Therefore, the larger the gap between the camera composition of the meta-training set and the camera composition of the meta-test set, the more our method can promote the model to learn camera-invariant feature representations to bridge the camera gap. In addition, it can be seen that $r=1$ is better than other cases $r=2,3$ , but very slightly. We argue this is because when $r=1$ , our $L_{mca}$ can better capture the feature distribution of the data from each camera and align them.

IV-E Design Choices

Here we illustrate our design choices, including the selection of hyperparameters and the position to implement meta camera alignment loss. We evaluate our method on Market-SCT.

IV-E1 Important Hyperparameters

We examine the impact of trade-off coefficient $\lambda$ and the weight of meta triplet loss $\gamma_{1}$ , meta classification loss $\gamma_{2}$ and meta camera alignment loss $\gamma_{3}$ , respectively. For each hyperparameter, we fix other hyperparameters and then adjust this hyperparameter in a wide range. The results are shown in Fig. 6, it can be seen that the best performance is achieved when $\lambda=0.6$ .

From Fig. 6 and Fig. 6, we note that the choice of $\gamma_{1}$ and $\gamma_{2}$ has little effect on the model within a certain range (almost in [0.3, 2.0]). Therefore, we set $\gamma_{1}$ and $\gamma_{2}$ to 1.0 for convenience. The performance curve of $\gamma_{3}$ is shown in Fig. 6. It can be seen that the best performance is achieved when $\gamma_{3}=0.02$ .

IV-E2 Where to Apply Meta Camera Alignment Loss

Meta camera alignment loss $L_{mca}$ aims to align the feature distributions between different cameras. Here we discuss where $L_{mca}$ loss is applied. For simplicity, we only consider the following four simple cases: aligning the distribution of the feature after stage2, stage3, and stage4. We find that the impose the loss on the shallow layer is more effective than the deep layer. As shown in Tab VIII, the best performance is achieved on stage2. We believe that the reason is that: 1) $L_{mca}$ is not related to the specific identity of the data, so it does not need to be added to the deep layer of the network model where the output features represent the identity information of data. 2) By aligning the feature distributions of data from different cameras, $L_{mca}$ can help to filter out camera-related features. These camera-related features are more likely to be low-level semantics so that can be better handled in shallow layers.

TABLE X: Compared with state-of-the-art in single domain generalization re-ID setting. Label M, D denote Market-1501 dataset and DukeMTMC-reID dataset, respectively. M

\rightarrow

D means training the model on Market-1501 and testing it on DukeMTMC-reID. Rank-1 accuracy and mAP (%) are reported. The best results are bolded.

Method	M $\rightarrow$ D		D $\rightarrow$ M
Method	Rank1	mAP	Rank1	mAP
StrongBaseline [101]	41.4	25.7	54.3	25.5
OSNet-IBN [102]	48.5	26.7	57.7	26.1
OSNet-AIN [102]	52.4	30.5	61.0	30.6
PAP [103]	46.4	27.9	59.5	30.6
SNR [79]	55.1	33.6	66.7	33.9
DIR-ReID [104]	54.5	33.0	68.2	35.2
CIMN	58.3	36.7	65.9	32.9

IV-F Generalization Ability

In addition to our motivation to address SCT re-ID, CIMN is robust to camera changes since the learned representations are camera-invariant. As a result, the model is more generalizable to new re-ID datasets captured by new cameras. To verify this point, we evaluate our CIMN on single domain generalization re-ID setting, where the model is trained on a source dataset and tested on unseen target datasets.

Tab. X shows the comparison between our method and the state-of-the-arts methods in this field, including StrongBaseline [101], OSNet-IBN [102], OSNet-AIN [102], PAP [103], SNR [79] and DIR-ReID[104]. As illustrated, our method is comparable with the state-of-the-art methods. Especially, we achieve the best performance on M $\rightarrow$ D, which outperform the second one [79] by 3.2% and 0.6% in Rank-1 accuracy and mAP, respectively.

We also compare our method in Mutli-source generalization setting [78, 80, 79]. In this setting, the model trained on five source datasets: CUHK02 [105], CUHK03 [106], Market-1501, DukeMTMC-reID, and CUHK-SYSU [107] datasets, with a total of 121,765 images of 18,530 identities. And then the model is directly evaluated on four testing datasets: VIPeR [108], PRID [109], GRID [110], and i-LIDS [111]. However, there are no available camera annotations in the CUHK-SYSU dataset. Therefore, we only used 6596 identities from the four remaining datasets, i.e., CUHK02, CUHK03, Market-1501, and DukeMTMC-reID, as the training set. Our method achieves 50.4%/52.6%/48.0%/76.1% mAP on the VIPeR/PRID/GRID/i-iLDS, which outperforms Bas-T by 6.0%/16.9%/12.2%/9.6% and Bas-C 6.2%/15.1%/12.4%/4.9%, respectively. The above results prove the effectiveness of our method in improving the model’s generalization ability.

V Conclusion

In this paper, we propose a Camera-Invariant Meta-Learning Network (CIMN) for Single-camera-training person re-identification. CIMN performs a cross-camera simulation by split the training data into meta-train set and meta-test set based on camera IDs, enforcing the representations learned from one set to be robust on the other set. Then three losses, meta triplet loss, meta classification loss and meta alignment loss, are introduced to leverage the relations ignored by the cross-camera simulation. With the interaction from the meta-train set and the meta-test set, the model is guided to learn camera-invariant and identity-discriminative representations. We demonstrate that our method achieves comparable performance with and without CCSP data, and outperforms the state-of-the-art methods on two SCT re-ID benchmarks. Finally, we show that our method is also effective in improving the model’s generalization ability.

Acknowledgment

This work is sponsored by the National Key Research and Development Program under Grant (2018YFB0505200) and National Natural Science Funding (No.62002026).

References

[1] Y. Bai, J. Jiao, W. Ce, J. Liu, Y. Lou, X. Feng, and L.-Y. Duan, “Person30k: A dual-meta generalization network for person re-identification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2123–2132.
[2] A. Li, L. Liu, K. Wang, S. Liu, and S. Yan, “Clothing attributes assisted person reidentification,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 25, no. 5, pp. 869–878, 2014.
[3] A. Zheng, X. Zhang, B. Jiang, B. Luo, and C. Li, “A subspace learning approach to multishot person reidentification,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 50, no. 1, pp. 149–158, 2018.
[4] Y. Wang, R. Hu, C. Liang, C. Zhang, and Q. Leng, “Camera compensation using a feature projection matrix for person reidentification,” IEEE transactions on circuits and systems for video technology, vol. 24, no. 8, pp. 1350–1361, 2014.
[5] X. Ning, K. Gong, W. Li, L. Zhang, X. Bai, and S. Tian, “Feature refinement and filter network for person re-identification,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 9, pp. 3391–3402, 2020.
[6] J. Si, H. Zhang, C.-G. Li, and J. Guo, “Spatial pyramid-based statistical features for person re-identification: A comprehensive evaluation,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 48, no. 7, pp. 1140–1154, 2017.
[7] P. Li, P. Pan, P. Liu, M. Xu, and Y. Yang, “Hierarchical temporal modeling with mutual distance matching for video based person re-identification,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 2, pp. 503–511, 2020.
[8] S. Bai, B. Ma, H. Chang, R. Huang, S. Shan, and X. Chen, “Sanet: Statistic attention network for video-based person re-identification,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 6, pp. 3866–3879, 2021.
[9] H. Li, S. Yan, Z. Yu, and D. Tao, “Attribute-identity embedding and self-supervised learning for scalable person re-identification,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 10, pp. 3472–3485, 2019.
[10] Z. Hu, Y. Sun, Y. Yang, and J. Zhou, “Divide-and-regroup clustering for domain adaptive person re-identification,” 2022.
[11] H. Nodehi and A. Shahbahrami, “Multi-metric re-identification for online multi-person tracking,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 1, pp. 147–159, 2021.
[12] S. Li, H. Chen, S. Yu, Z. He, F. Zhu, R. Zhao, J. Chen, and Y. Qiao, “Cocas+: large-scale clothes-changing person re-identification with clothes templates,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 4, pp. 1839–1853, 2022.
[13] H. Tan, X. Liu, Y. Bian, H. Wang, and B. Yin, “Incomplete descriptor mining with elastic loss for person re-identification,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 1, pp. 160–171, 2021.
[14] N. McLaughlin, J. M. del Rincon, and P. C. Miller, “Person reidentification using deep convnets with multitask learning,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 27, no. 3, pp. 525–539, 2016.
[15] T. Zhang, L. Xie, L. Wei, Z. Zhuang, Y. Zhang, B. Li, and Q. Tian, “Unrealperson: An adaptive pipeline towards costless person re-identification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 506–11 515.
[16] Y. Zhao, Z. Zhong, F. Yang, Z. Luo, Y. Lin, S. Li, and N. Sebe, “Learning to generalize unseen domains via memory-based multi-source meta-learning for person re-identification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6277–6286.
[17] Z. Bai, Z. Wang, J. Wang, D. Hu, and E. Ding, “Unsupervised multi-source domain adaptation for person re-identification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 12 914–12 923.
[18] H. Li, G. Wu, and W.-S. Zheng, “Combined depth space based architecture search for person re-identification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6729–6738.
[19] J. Chen, X. Jiang, F. Wang, J. Zhang, F. Zheng, X. Sun, and W.-S. Zheng, “Learning 3d shape feature for texture-insensitive person re-identification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8146–8155.
[20] F. Yang, Z. Zhong, Z. Luo, Y. Cai, Y. Lin, S. Li, and N. Sebe, “Joint noise-tolerant learning and meta camera shift adaptation for unsupervised person re-identification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4855–4864.
[21] K. Zhu, H. Guo, Z. Liu, M. Tang, and J. Wang, “Identity-guided human semantic parsing for person re-identification,” arXiv preprint arXiv:2007.13467, 2020.
[22] Z. Zhang, C. Lan, W. Zeng, X. Jin, and Z. Chen, “Relation-aware global attention for person re-identification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3186–3195.
[23] X. Chen, C. Fu, Y. Zhao, F. Zheng, J. Song, R. Ji, and Y. Yang, “Salience-guided cascaded suppression network for person re-identification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3300–3310.
[24] L. An, M. Kafai, S. Yang, and B. Bhanu, “Person reidentification with reference descriptor,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 26, no. 4, pp. 776–787, 2015.
[25] S. Tan, F. Zheng, L. Liu, J. Han, and L. Shao, “Dense invariant feature-based support vector ranking for cross-camera person reidentification,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 2, pp. 356–363, 2016.
[26] S.-M. Li, C. Gao, J.-G. Zhu, and C.-W. Li, “Person reidentification using attribute-restricted projection metric learning,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 8, pp. 1765–1776, 2016.
[27] T. He, X. Shen, J. Huang, Z. Chen, and X.-S. Hua, “Partial person re-identification with part-part correspondence learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 9105–9115.
[28] A. Zhang, Y. Gao, Y. Niu, W. Liu, and Y. Zhou, “Coarse-to-fine person re-identification with auxiliary-domain classification and second-order information bottleneck,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 598–607.
[29] G. Wang, G. Wang, X. Zhang, J. Lai, and L. Lin, “Weakly supervised person re-id: Differentiable graphical learning and a new benchmark,” arXiv preprint arXiv:1904.03845, 2019.
[30] D. Fu, D. Chen, J. Bao, H. Yang, L. Yuan, L. Zhang, H. Li, and D. Chen, “Unsupervised pre-training for person re-identification,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 14 750–14 759.
[31] Y. Lin, Y. Wu, C. Yan, M. Xu, and Y. Yang, “Unsupervised person re-identification via cross-camera similarity exploration,” IEEE Transactions on Image Processing, vol. 29, pp. 5481–5490, 2020.
[32] M. Wang, B. Lai, J. Huang, X. Gong, and X.-S. Hua, “Camera-aware proxies for unsupervised person re-identification,” arXiv preprint arXiv:2012.10674, 2020.
[33] X. Xin, J. Wang, R. Xie, S. Zhou, W. Huang, and N. Zheng, “Semi-supervised person re-identification using multi-view clustering,” Pattern Recognition, vol. 88, pp. 285–297, 2019.
[34] H. Chen, Y. Wang, B. Lagadec, A. Dantcheva, and F. Bremond, “Joint generative and contrastive learning for unsupervised person re-identification,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 2004–2013.
[35] X. Zhu, X. Zhu, M. Li, V. Murino, and S. Gong, “Intra-camera supervised person re-identification: A new benchmark,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2019, pp. 0–0.
[36] T. Zhang, L. Xie, L. Wei, Y. Zhang, B. Li, and Q. Tian, “Single camera training for person re-identification.” in AAAI, 2020, pp. 12 878–12 885.
[37] M. Keuper, S. Tang, B. Andres, T. Brox, and B. Schiele, “Motion segmentation & multiple object tracking by correlation co-clustering,” IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 1, pp. 140–153, 2018.
[38] W. Luo, B. Stenger, X. Zhao, and T.-K. Kim, “Trajectories as topics: Multi-object tracking by topic discovery,” IEEE Transactions on Image Processing, vol. 28, no. 1, pp. 240–252, 2018.
[39] S. Choi, T. Kim, M. Jeong, H. Park, and C. Kim, “Meta batch-instance normalization for generalizable person re-identification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3425–3435.
[40] S. Liu, W. Huang, and Z. Zhang, “Learning hybrid relationships for person re-identification,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 3, 2021, pp. 2172–2179.
[41] H. Wang, J. Shen, Y. Liu, Y. Gao, and E. Gavves, “Nformer: Robust person re-identification with neighbor transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 7297–7307.
[42] N. McLaughlin, J. M. del Rincon, and P. Miller, “Video person re-identification for wide area tracking based on recurrent neural networks,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 9, pp. 2613–2626, 2017.
[43] M. Ren, L. He, X. Liao, W. Liu, Y. Wang, and T. Tan, “Learning instance-level spatial-temporal patterns for person re-identification,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 14 930–14 939.
[44] H. Gu, J. Li, G. Fu, C. Wong, X. Chen, and J. Zhu, “Autoloss-gms: Searching generalized margin-based softmax loss function for person re-identification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4744–4753.
[45] Z. Wang, L. He, X. Tu, J. Zhao, X. Gao, S. Shen, and J. Feng, “Robust video-based person re-identification by hierarchical mining,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 12, pp. 8179–8191, 2021.
[46] X. Liu, S. Bi, S. Fang, and A. Bouridane, “Bayesian inferred self-attentive aggregation for multi-shot person re-identification,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 10, pp. 3446–3458, 2019.
[47] T. Chai, Z. Chen, A. Li, J. Chen, X. Mei, and Y. Wang, “Video person re-identification using attribute-enhanced features,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 11, pp. 7951–7966, 2022.
[48] A. Subramanyam, “Meta generative attack on person reidentification,” IEEE Transactions on Circuits and Systems for Video Technology, 2023.
[49] H. Jin, S. Lai, and X. Qian, “Occlusion-sensitive person re-identification via attribute-based shift attention,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 4, pp. 2170–2185, 2021.
[50] X. Shu, X. Wang, X. Zang, S. Zhang, Y. Chen, G. Li, and Q. Tian, “Large-scale spatio-temporal person re-identification: Algorithms and benchmark,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 7, pp. 4390–4403, 2021.
[51] Z. Zhang, H. Zhang, and S. Liu, “Person re-identification using heterogeneous local graph attention networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 12 136–12 145.
[52] L. Wu, Y. Wang, L. Shao, and M. Wang, “3-d personvlad: Learning deep global representations for video-based person reidentification,” IEEE transactions on neural networks and learning systems, vol. 30, no. 11, pp. 3347–3359, 2019.
[53] T. Chen, S. Ding, J. Xie, Y. Yuan, W. Chen, Y. Yang, Z. Ren, and Z. Wang, “Abd-net: Attentive but diverse person re-identification,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 8351–8361.
[54] W. Rao, M. Xu, and J. Zhou, “Improved metric learning algorithm for person re-identification based on asymmetric metric,” in 2020 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA). IEEE, 2020, pp. 212–216.
[55] K. Musgrave, S. Belongie, and S.-N. Lim, “A metric learning reality check,” in European Conference on Computer Vision. Springer, 2020, pp. 681–699.
[56] Y. Hao, N. Wang, J. Li, and X. Gao, “Hsme: Hypersphere manifold embedding for visible thermal person re-identification,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 8385–8392.
[57] C. Zhao, X. Lv, Z. Zhang, W. Zuo, J. Wu, and D. Miao, “Deep fusion feature representation learning with hard mining center-triplet loss for person re-identification,” IEEE Transactions on Multimedia, 2020.
[58] Y. Sun, C. Cheng, Y. Zhang, C. Zhang, L. Zheng, Z. Wang, and Y. Wei, “Circle loss: A unified perspective of pair similarity optimization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 6398–6407.
[59] J. Han, Y.-L. Li, and S. Wang, “Delving into probabilistic uncertainty for unsupervised domain adaptive person re-identification,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 1, 2022, pp. 790–798.
[60] S. Xuan and S. Zhang, “Intra-inter camera similarity for unsupervised person re-identification,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 11 926–11 935.
[61] M. Wang, B. Lai, J. Huang, X. Gong, and X.-S. Hua, “Camera-aware proxies for unsupervised person re-identification,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 4, 2021, pp. 2764–2772.
[62] L. Qi, L. Wang, J. Huo, Y. Shi, X. Geng, and Y. Gao, “Adversarial camera alignment network for unsupervised cross-camera person re-identification,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 5, pp. 2921–2936, 2021.
[63] Y. Dai, J. Liu, Y. Bai, Z. Tong, and L.-Y. Duan, “Dual-refinement: Joint label and feature refinement for unsupervised domain adaptive person re-identification,” IEEE Transactions on Image Processing, vol. 30, pp. 7815–7829, 2021.
[64] J. Li and S. Zhang, “Joint visual and temporal consistency for unsupervised domain adaptive person re-identification,” in European Conference on Computer Vision. Springer, 2020, pp. 483–499.
[65] B. Wang, M. Asim, G. Ma, and M. Zhu, “Central feature learning for unsupervised person re-identification,” International Journal of Pattern Recognition and Artificial Intelligence, vol. 35, no. 08, p. 2151007, 2021.
[66] Y. Wang, X. Liang, and S. Liao, “Cloning outfits from real-world images to 3d characters for generalizable person re-identification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4900–4909.
[67] S. Liao and L. Shao, “Graph sampling based deep metric learning for generalizable person re-identification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 7359–7368.
[68] Y. Cho, W. J. Kim, S. Hong, and S.-E. Yoon, “Part-based pseudo label refinement for unsupervised person re-identification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 7308–7318.
[69] W. Ge, C. Pan, A. Wu, H. Zheng, and W.-S. Zheng, “Cross-camera feature prediction for intra-camera supervised person re-identification across distant scenes,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 3644–3653.
[70] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[71] C. Wu, W. Ge, A. Wu, and X. Chang, “Camera-conditioned stable feature generation for isolated camera supervised person re-identification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20 238–20 248.
[72] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 1126–1135.
[73] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap, “Meta-learning with memory-augmented neural networks,” in International conference on machine learning. PMLR, 2016, pp. 1842–1850.
[74] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales, “Learning to compare: Relation network for few-shot learning,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1199–1208.
[75] D. Li, Y. Yang, Y.-Z. Song, and T. M. Hospedales, “Learning to generalize: Meta-learning for domain generalization,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
[76] M. Choi, J. Choi, S. Baik, T. H. Kim, and K. M. Lee, “Scene-adaptive video frame interpolation via meta-learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9444–9453.
[77] H. Ni, J. Song, X. Luo, F. Zheng, W. Li, and H. T. Shen, “Meta distribution alignment for generalizable person re-identification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2487–2496.
[78] J. Song, Y. Yang, Y.-Z. Song, T. Xiang, and T. M. Hospedales, “Generalizable person re-identification by domain-invariant mapping network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 719–728.
[79] X. Jin, C. Lan, W. Zeng, Z. Chen, and L. Zhang, “Style normalization and restitution for generalizable person re-identification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3143–3152.
[80] P. Chen, P. Dai, J. Liu, F. Zheng, Q. Tian, and R. Ji, “Dual distribution alignment network for generalizable person re-identification,” arXiv preprint arXiv:2007.13249, 2020.
[81] J. Liu, Z. Huang, L. Li, K. Zheng, and Z.-J. Zha, “Debiased batch normalization via gaussian process for generalizable person re-identification,” arXiv preprint arXiv:2203.01723, 2022.
[82] S. Liao and L. Shao, “Transmatcher: Deep image matching through transformers for generalizable person re-identification,” Advances in Neural Information Processing Systems, vol. 34, pp. 1992–2003, 2021.
[83] K. Han, C. Si, Y. Huang, L. Wang, and T. Tan, “Generalizable person re-identification via self-supervised batch norm test-time adaption,” arXiv preprint arXiv:2203.00672, 2022.
[84] Y. Dai, X. Li, J. Liu, Z. Tong, and L.-Y. Duan, “Generalizable person re-identification with relevance-aware mixture of experts,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 16 145–16 154.
[85] A. Hermans, L. Beyer, and B. Leibe, “In defense of the triplet loss for person re-identification,” arXiv preprint arXiv:1703.07737, 2017.
[86] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola, “A kernel two-sample test,” The Journal of Machine Learning Research, vol. 13, no. 1, pp. 723–773, 2012.
[87] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, “Scalable person re-identification: A benchmark,” in ICCV, 2015, pp. 1116–1124.
[88] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi, “Performance measures and a data set for multi-target, multi-camera tracking,” in European Conference on Computer Vision. Springer, 2016, pp. 17–35.
[89] J. Wang, X. Zhu, S. Gong, and W. Li, “Transferable joint attribute-identity deep learning for unsupervised person re-identification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2275–2284.
[90] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009, pp. 248–255.
[91] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learning approach for deep face recognition,” in European conference on computer vision. Springer, 2016, pp. 499–515.
[92] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface: Deep hypersphere embedding for face recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 212–220.
[93] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4690–4699.
[94] Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang, “Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline),” in ECCV, 2018, pp. 480–496.
[95] Y. Suh, J. Wang, S. Tang, T. Mei, and K. M. Lee, “Part-aligned bilinear representations for person re-identification,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 402–419.
[96] G. Wang, Y. Yuan, X. Chen, J. Li, and X. Zhou, “Learning discriminative features with multiple granularities for person re-identification,” in ACM MM, 2018, pp. 274–282.
[97] Z. Zhuang, L. Wei, L. Xie, T. Zhang, H. Zhang, H. Wu, H. Ai, and Q. Tian, “Rethinking the distribution gap of person re-identification with camera-based batch normalization,” in European Conference on Computer Vision. Springer, 2020, pp. 140–157.
[98] G. Wang, S. Yang, H. Liu, Z. Wang, Y. Yang, S. Wang, G. Yu, E. Zhou, and J. Sun, “High-order information matters: Learning relation and topology for occluded person re-identification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 6449–6458.
[99] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of tricks for efficient text classification,” arXiv preprint arXiv:1607.01759, 2016.
[100] M. Ye, J. Shen, G. Lin, T. Xiang, L. Shao, and S. C. Hoi, “Deep learning for person re-identification: A survey and outlook,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 6, pp. 2872–2893, 2021.
[101] H. Luo, W. Jiang, Y. Gu, F. Liu, X. Liao, S. Lai, and J. Gu, “A strong baseline and batch normalization neck for deep person re-identification,” IEEE Transactions on Multimedia, vol. 22, no. 10, pp. 2597–2609, 2019.
[102] K. Zhou, Y. Yang, A. Cavallaro, and T. Xiang, “Learning generalisable omni-scale representations for person re-identification,” arXiv preprint arXiv:1910.06827, 2019.
[103] H. Huang, W. Yang, X. Chen, X. Zhao, K. Huang, J. Lin, G. Huang, and D. D. Eanet, “Enhancing alignment for cross-domain person reidentification,” arXiv preprint arXiv:1812.11369, vol. 3, 2018.
[104] Y.-F. Zhang, H. Zhang, Z. Zhang, D. Li, Z. Jia, L. Wang, and T. Tan, “Learning domain invariant representations for generalizable person re-identification,” arXiv preprint arXiv:2103.15890, 2021.
[105] W. Li and X. Wang, “Locally aligned feature transforms across views,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 3594–3601.
[106] W. Li, R. Zhao, T. Xiao, and X. Wang, “Deepreid: Deep filter pairing neural network for person re-identification,” in CVPR, 2014, pp. 152–159.
[107] T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang, “End-to-end deep learning for person search,” arXiv preprint arXiv:1604.01850, vol. 2, no. 2, 2016.
[108] D. Gray and H. Tao, “Viewpoint invariant pedestrian recognition with an ensemble of localized features,” in European conference on computer vision. Springer, 2008, pp. 262–275.
[109] M. Hirzer, C. Beleznai, P. M. Roth, and H. Bischof, “Person re-identification by descriptive and discriminative classification,” in Scandinavian conference on Image analysis. Springer, 2011, pp. 91–102.
[110] C. C. Loy, T. Xiang, and S. Gong, “Time-delayed correlation analysis for multi-camera activity understanding,” International Journal of Computer Vision, vol. 90, no. 1, pp. 106–129, 2010.
[111] W. S. Zheng, S. Gong, and T. Xiang, “Associating groups of people,” in British Machine Vision Conference, 2009.