Adaptive Deep Metric Embeddings for Person Re-Identification under Occlusions

Wanxiang Yang Yan Yan Si Chen Fujian Key Laboratory of Sensing and Computing for Smart City, School of Information Science and Engineering, Xiamen University,
Xiamen 361005, Fujian, China School of Computer and Information Engineering, Xiamen University of Technology,
Xiamen 361024, Fujian, China

Abstract

Person re-identification (ReID) under occlusions is a challenging problem in video surveillance. Most of existing person ReID methods take advantage of local features to deal with occlusions. However, these methods usually independently extract features from the local regions of an image without considering the relationship among different local regions. In this paper, we propose a novel person ReID method, which learns the spatial dependencies between the local regions and extracts the discriminative feature representation of the pedestrian image based on Long Short-Term Memory (LSTM), dealing with the problem of occlusions. In particular, we propose a novel loss (termed the adaptive nearest neighbor loss) based on the classification uncertainty to effectively reduce intra-class variations while enlarging inter-class differences within the adaptive neighborhood of the sample. The proposed loss enables the deep neural network to adaptively learn discriminative metric embeddings, which significantly improve the generalization capability of recognizing unseen person identities. Extensive comparative evaluations on challenging person ReID datasets demonstrate the significantly improved performance of the proposed method compared with several state-of-the-art methods.

keywords:

Person re-identification, occlusion, long short-term memory, adaptive nearest neighbor loss

1 Introduction

Matching pedestrians across different camera views, known as person re-identification (ReID), is a challenging task in computer vision [1, 2, 3, 4, 5, 6]. One key challenge of person ReID is the significant appearance variations caused by occlusions in pedestrian images.

There are two major components in the conventional person ReID methods: 1) the effective feature descriptor (such as SCNCD [7], gBiCov [8], and LOMO [9]) to characterize the pedestrian image, and 2) the suitable metric (such as LADF [10], KISSME [11], and XQDA [9]) to compare the similarity between pedestrian images. With the popularity of deep learning, several deep learning based person ReID methods [12, 13, 14, 15, 16] that effectively incorporate these two components into an integrated framework have been developed. Generally, the deep learning based methods automatically learn discriminative image representations based on large scale image data, which have shown to be highly robust to pedestrian appearance variations.

Partially occluded pedestrians are ubiquitous in person ReID. To deal with the problem of occlusions, several methods [17], [18] have been proposed by taking advantage of the part-based network architectures to learn representations from different local regions of pedestrian images. For example, in [17], the authors firstly split the pedestrian images into three overlapping local regions, and then apply a three-channel Convolutional Neural Network (CNN) architecture to learn discriminative local features from these regions. However, this method may suffer from the problem of spatial misalignment (recall that the local features are separately learned). Recently, Zhong et al. [19] propose to perform data augmentation with random erasing, which addresses the problem of occlusions to some extent. However, this method does not exploit the relationship of spatial structure in the pedestrian image, which can be beneficial for person ReID.

Recently, Recurrent Neural Network (RNN) has shown the powerfulness in handling sequential data due to its great capability of storing the representations of recent inputs. As an effective RNN, Long Short-Term Memory (LSTM) [20] can properly capture the temporal/spatial dependencies. In this paper, inspired by the success of LSTM, we propose to model the spatial dependencies among different local regions of pedestrian images based on LSTM to handle the problem of occlusions. By making use of the internal gating mechanism of the LSTM cells, the proposed method effectively extracts the intrinsic feature representation by memorizing the spatial correlations and ignoring confusing distractors (i.e., occluded local regions), thus leading to performance improvements for person ReID under occlusions.

The loss function plays an important role in deep learning for the task of person ReID, which aims to learn separable and discriminative deep features. A commonly used loss function, termed the triplet loss [21], can significantly improve the capability of distingushing different classes. However, how to generate high-quality triplets (i.e., hard triplet mining) to ensure training efficiency is not a trivial task, since many triplets are uninformative. Moreover, the triplet loss often suffers from the problems of slow convergence and poor local optima, partially due to the fact that a single triplet only considers the pairwise distance between the anchor and the positive (negative) sample.

In this paper, to overcome the above problems, we propose an Adaptive Nearest Neighbor (ANN) loss based on the classification uncertainty, maintaining a large margin between the inter-class distance and the intra-class distance within the adaptive neighborhood of each sample. In fact, the ANN loss can be viewed as the generalization of the triplet loss. Compared with the triplet loss, the proposed loss effectively exploits the neighborhood information (the distance between the anchor and the neighborhood of each positive/negative sample) of the training data. Therefore, the selected triplets are informative, making the training process quickly converge.

The main contributions are summarized as follows. 1) We propose to exploit the spatial dependencies between the local regions of pedestrian images based on LSTM, which significantly improves the performance of person ReID under occlusions. LSTM effectively memorizes the spatial correlations and automatically encodes the spatial information so as to reduce the noises caused by occlusions. 2) We develop an adaptive nearest neighbor (ANN) loss, which takes advantage of the neighborhood information to learn adaptive deep metric embeddings based on the classification uncertainty. Experimental results show the superiority of the proposed method compared with the state-of-the-art methods on several challenging person ReID datasets.

2 The proposed method

In this section, we firstly introduce the overall framework of the proposed method in Section 2.1. Then, the spatial encoded local features which exploit the spatial dependencies based on LSTM, are given in Section 2.2. Finally, the proposed ANN loss is formulated in Section 2.3.

Refer to caption — Figure 1: The overall framework of the proposed method.

2.1 Overall framework

The overall framework of the proposed method is illustrated in Fig. 1. For each input pedestrian image $I$ , we firstly use a base convolutional neural network (in this paper, ResNet [22] is used due to its superiority), to extract mid-level convolutional feature maps. For notational simplicity, we refer to the output of the last convolutional layer of ResNet as $g(I)$ for each input image. Specifically, the last convolutional layer is denoted as $g\in\mathbb{R}^{C\times H\times W}$ ( $C=512$ , $H=8$ , $W=4$ in this paper), where $H$ and $W$ denote the spatial size (i.e., height and width) of the last convolutional layer and $C$ is the number of feature channels. Secondly, the global features and spatial encoded local features (SELF) are respectively extracted in the global and local branches. In the global branch, we extract the global features by applying global average pooling (GAP) to the mid-level feature maps to capture high-level semantics. Meanwhile, in the local branch, the local features are extracted by applying average pooling (AP) to each row of the mid-level feature maps, and then using a $1\times 1$ convolutional layer to reduce the number of feature channels from $C$ to $c$ ( $c=128$ in this paper). These local features are fed into the LSTM layer to learn the spatial dependencies between different local regions of the input pedestrian image (see Section 2.2). The output of LSTM layer is SELF, which is represented as $L\in\mathbb{R}^{c}$ . Thirdly, the global features and SELF are concatenated to represent the pedestrian image, which contains complementary information from the different levels of semantics, following a fully-connected (FC) layer to get a compact representation. Finally, the deep neural network is jointly optimized by the softmax loss and the proposed adaptive nearest neighbor loss (see Section 2.3).

2.2 Spatial encoded local features (SELF)

In this section, SELF is extracted to effectively capture the dependencies of spatial structure in the pedestrian image. Specifically, the pedestrian image is decomposed into a sequence of local regions from head to foot, where each local region has a relatively fixed position due to the prior knowledge about the human body structure. Based on the local regions, different local features are extracted accordingly. In this manner, all local features are treated as a spatial sequence. Following that, the sequence data consisting of all the local features are represented as $S_{t}\in\mathbb{R}^{c}$ ( $t=1,\cdots,H$ ), where $S_{t}$ is the local features in each row and $H$ denotes the sequence length (i.e., the number of local regions). The LSTM layer sequentially accepts the input local features and the hidden state $h_{t}\in\mathbb{R}^{e}$ at each step $t$ is obtained using the following equations ( $e$ is the number of hidden units of the LSTM layer. In this paper, we empirically set $e$ to be equal to $c$ ).

\left(\begin{array}[]{c}i_{t}\\ f_{t}\\ o_{t}\\ g_{t}\\ \end{array}\right)=\left(\begin{array}[]{ccc}sigm\\ sigm\\ sigm\\ tanh\\ \end{array}\right)W_{L}\left(\begin{array}[]{ccc}S_{t}\\ h_{t-1}\\ \end{array}\right),

(1)

d_{t}=f_{t}\odot d_{t-1}+i_{t}\odot g_{t},

(2)

h_{t}=o_{t}\odot tanh(d_{t}),

(3)

where $i_{t}\in\mathbb{R}^{e}$ , $f_{t}\in\mathbb{R}^{e}$ , $o_{t}\in\mathbb{R}^{e}$ , $g_{t}\in\mathbb{R}^{e}$ and $d_{t}\in\mathbb{R}^{e}$ are the input gate, forget gate, output gate, cell state candidate and cell state, respectively. $sigm\in\mathbb{R}^{e}$ and $tanh\in\mathbb{R}^{e}$ denote the non-linear activation functions (i.e., the sigmoid function and the tanh function), which are applied in element-wise. $W_{L}\in\mathbb{R}^{4e\times(c+e)}$ denotes the weight matrix of the LSTM layer. $\odot$ denotes the element-wise multiplication.

From Eq. (1), $i_{t}$ , $f_{t}$ and $o_{t}$ decide which information will be updated, thrown away and outputted, respectively, according to the previous hidden state $h_{t-1}$ and the current input $S_{t}$ . A $tanh$ activation layer creates the cell candidate state $g_{t}$ . In Eq. (2), the LSTM layer updates the old cell state $d_{t-1}$ by firstly multiplying $f_{t}$ (throw away the old information), and adding the cell candidate state scaled by $i_{t}$ (update the information), to obtain the cell state $d_{t}$ . In Eq. (3), the LSTM layer passes the cell state $d_{t}$ through $tanh$ (constrain the output values between $-1$ and $1$ ) and multiplies it by the output gate $o_{t}$ (output the critical information and reduce the noises), to obtain the hidden state $h_{t}$ .

The final hidden state of the LSTM layer is SELF (i.e., $L=h_{H}$ , where $L\in\mathbb{R}^{c}$ ), which effectively captures the spatial relationship between different local regions. Since partial occlusions only affect some local regions, we exploit the intrinsic relationship between different local regions (recall that the occluded regions are considered as noises that can be filtered by LSTM) to alleviate the problem of occlusions. Therefore, the SELF extracted by the LSTM layer is highly robust to occlusions.

2.3 Adaptive nearest neighbor loss

As we discuss previously, hard triplet mining is critical for the triplet loss (note that most triplets are uninformative). Although some methods [23, 24] have been developed for hard triplet mining, the selected triplets only exploit pairwise distance information, which may result in the local optima.

Motivated by the above issues, we propose a novel loss function, termed the adaptive nearest neighbor (ANN) loss, which effectively takes advantage of the neighborhood information to enlarge the inter-class dispersion while preserving the intra-class compactness. To be specific, the ANN loss is defined as follows,

L_{ANN}=\sum_{a=1}^{B}[m+D_{ap}-D_{an}]_{+},

(4)

D_{ap}=\frac{1}{K_{a}}\sum_{k=1}^{K_{a}}\left\|f(I_{a})-f(I_{pk})\right\|_{2}^{2},

(5)

D_{an}=\frac{1}{K_{a}}\sum_{k=1}^{K_{a}}\left\|f(I_{a})-f(I_{nk})\right\|_{2}^{2},

(6)

where $[\cdot]_{+}$ denotes the hinge loss. $B$ is the number of training samples. $f(\cdot)$ is the function that maps the raw image to the metric embedding representation. $I_{a}$ , $I_{pk}$ and $I_{nk}$ represent the anchor sample, positive sample, and negative sample, respectively. $D_{ap}$ and $D_{an}$ respectively denote the average distance between the anchor sample $I_{a}$ and the $K_{a}$ hardest positive samples (i.e, the $K_{a}$ farthest positive samples), and the average distance between $I_{a}$ and the $K_{a}$ hardest negative samples (i.e., the $K_{a}$ closest negative samples). $m$ is a margin to keep the separation between positive and negative pairs. $\left\|\cdot\right\|_{2}$ denotes the Euclidean distance. Note that $K_{a}$ denotes the number of positive/negative samples in the neighborhood of the anchor sample $I_{a}$ .

In this paper, instead of fixing $K_{a}$ to be a constant value, we adaptively set $K_{a}$ based on the classification uncertainty. That is, $K_{a}$ is formulated as follows:

K_{a}=\max(\left\lfloor H_{a}\right\rfloor,K_{0}),

(7)

where $H_{a}=-\sum_{j=1}^{N}p_{a}^{j}log(p_{a}^{j})$ denotes the classification uncertainty of the anchor sample $I_{a}$ . Here, $p_{a}^{j}$ is the probability that the sample $I_{a}$ belongs to the $j$ -th class according to a softmax layer and $N$ is the number of classes. $\left\lfloor\cdot\right\rfloor$ denotes the ceil operation. $K_{0}$ is a constant, denoting the minimum number of nearest neighbors (we set $K_{0}$ to be 1 in this paper).

The classification uncertainty $H_{a}$ measures the confidence of classification based on the softmax classifier, which intrinsically characterizes the global data distribution. When the value of $H_{a}$ is high, the anchor sample is considered as the hard-classified sample (the probability of classifying the sample using the softmax classifier is around $1/N$ for each class). In this case, the number of neighbors $K_{a}$ should be increased and vice versa. Therefore, ANN effectively integrates both global and local information of the training data into metric embedding learning, which can successfully overcome the problems of slow convergence and poor optima in the triplet loss. Besides, different from the triplet loss (which only exploits the pairwise distance), the ANN loss considers the average distance between the anchor and the neighborhood of positive/negative samples, which makes the training process quickly converge.

Finally, to learn both separable and discriminative features, we combine the softmax loss (denoted as $L_{s}$ ) with the ANN loss to jointly optimize the deep neural network, that is,

L=L_{s}+\lambda L_{ANN},

(8)

where $\lambda$ is the tradeoff parameter used to balance the two loss functions.

3 Experiments

In this section, several person ReID datasets used for evaluation are introduced in Section 3.1. Then, the influence of the parameters is given in Section 3.2. Next, the ablation study is shown in Section 3.3. Finally, the comparison with several state-of-the-art methods is presented in Section 3.4.

3.1 Datasets

To verify the effectiveness of the proposed method, we perform extensive experiments on four challenging person ReID datasets, including Market1501 [25], DukeMTMC-reID [26], CUHK03 [27], and Partial REID [28]. The Market1501 dataset contains 1,501 identities captured by six camera views, where the dataset is split into 12,936 training images with 750 identities and 19,732 gallery images with 750 identities. The DukeMTMC-reID dataset contains 1,404 identities collected from eight cameras. The dataset is divided into 16,522 training images with 702 identities and 17,661 gallery images with 702 identities. The CUHK03 dataset contains 13,164 images with 1,360 identities

captured by six cameras. Each identity is observed by two disjoint camera views, yielding an average of 4.8 images in each view. The Patial REID dataset contains 600 images of 60 identities, with 5 full-body images and 5 partial occluded images for each identity.

We use the standard metrics, including the mean Average Precision (mAP) and the Cumulative Matching Characteristic (CMC) curve at rank-1, to evaluate the performance of person ReID.

3.2 Influence of the parameters

To observe the influence of the parameters on the proposed method, we evaluate two critical parameters in the proposed method, i.e., the neighbor size (i.e., $K_{a}$ in Eq. (7)) and the tradeoff parameter to combine two losses (i,e., $\lambda$ in Eq. (8)). The rank-1 accuracy with different values of $K_{a}$ and $\lambda$ on the Market1501 dataset is given in Fig. 2.

From Fig. 2, we can see that the proposed method with the adaptive value of $K_{a}$ can achieve much better results than that with the fixed value of $K_{a}$ , which demonstrates the importance of adaptive neighborhood. The value of $\lambda$ also significantly affects the final performance. In summary, when $\lambda$ is set to 1 and $K_{a}$ is adaptively set based on the classification uncertainty $H_{a}$ , the proposed method achieves the best performance. In the following, the value of $\lambda$ is fixed to 1.

3.3 Ablation study

In this section, we evaluate several variants of the proposed method to verify the effectiveness of the key components in the proposed method for person ReID under occlusions. We conduct the experiments on the Market1501 dataset, where the query images are set with the different levels of occlusions. More specifically, we randomly occlude a region with random values in an image with the aspect ratio $s$ . The aspect ratio of occluded area is set within the range of [0.0, 0.6] (please refer to [19] for more details). Moreover, we also conduct the experiments on the real and challenging occlusion dataset, the Partial REID dataset, which contains different types of severe occlusions. The partial occluded images are used as the query images and the full-body images are used as the gallery images.

The proposed method contains two key components: LSTM, which exploits the spatial dependencies between the different local regions to enable the model to be robust to occlusions; and the ANN loss, which learns discriminative metric embeddings.

Table 1: The details of the nine variants.

Variants

Global

Local

Loss

RN_{S}

GAP

Softmax

RN_{A}

Softmax+ANN

RNCONV_{A}

GAP

Conv

ReLU

Batch Normalization

Softmax+ANN

RNFC_{A}

ReLU

RNRNN_{A}

RNN

RNLSTM_{S}

GAP

LSTM

Softmax

RNLSTM_{C}

Softmax+Contrastive

RNLSTM_{T}

Softmax+Triplet

RNLSTM_{A}

(ours)

GAP

LSTM

Softmax+ANN

Therefore, nine different variants of the proposed method are evaluated. That is, (1) The baseline method (denoted as $RN_{S}$ ) that only uses the global branch and a ResNet model [22] based on the softmax loss. (2) The method (denoted as $RN_{A}$ ) that uses the same network as $RN_{S}$ , where both the softmax loss and the ANN loss are employed to jointly optimize the network. (3)-(6) The methods (respectively denoted as $RNCONV_{A}$ , $RNFC_{A}$ , $RNRNN_{A}$ , and $RNLSTM_{A}$ ) that employ a ResNet model, and combine the global branch and the local branch (here the convolutional layer, fully-connected layer, RNN layer and LSTM layer are respectively used as the local branch), where the softmax loss and the ANN loss are used. Note that $RNLSTM_{A}$ is the proposed method in this paper. (7) The method (denoted as $RNLSTM_{S}$ ) that uses the proposed network, where only the softmax loss is used. (8)-(9) The methods (respectively denoted as $RNLSTM_{C}$ , $RNLSTM_{T}$ ) that uses the proposed network, where the softmax loss is combined with the contrastive loss [29] and triplet loss [23], respectively. The details of the nine variants are summarized in Table 1.

Table 2: The rank-1 (%) accuracy and mAP (%) obtained by the proposed method and the state-of-the-art methods against the different levels of occlusions on the Market1501, DukeMTMC-reID, and CUHK03 (detected) datasets. The best and second highest results are in red and blue, respectively.

Method	Market1501						DukeMTMC-reID						CUHK03
	s=0		s=0.3		s=0.6		s=0		s=0.3		s=0.6		s=0		s=0.3		s=0.6
	rank-1	mAP	rank-1	mAP	rank-1	mAP	rank-1	mAP	rank-1	mAP	rank-1	mAP	rank-1	mAP	rank-1	mAP	rank-1	mAP
XQDA [9]	43.0	21.7	28.3	14.2	24.3	12.0	31.2	17.2	20.5	10.6	17.4	9.4	44.2	-	36.9	-	32.3	-
NPD [30]	55.4	30.0	39.6	19.1	32.5	16.1	46.7	27.3	33.7	17.7	29.7	15.7	53.7	-	39.5	-	33.8	-
IDE [31]	81.9	61.0	62.4	48.2	45.6	36.4	66.3	45.2	57.9	41.6	41.3	30.3	68.2	62.7	65.1	59.8	46.2	43.6
TriNet [23]	83.2	64.9	68.6	54.7	47.9	38.9	71.4	51.6	56.0	40.8	39.0	28.4	79.1	76.4	68.0	66.9	48.1	49.2
PAN [32]	81.0	63.4	52.0	36.5	43.2	30.0	71.6	51.5	44.7	29.0	39.9	25.9	85.4	90.9	61.0	66.5	53.0	57.6
SVDNet [33]	81.4	61.2	62.3	46.9	52.0	40.3	75.9	56.3	59.1	43.5	50.6	37.9	81.2	84.5	71.2	66.8	63.9	62.1
DPFL [34]	88.6	72.6	-	-	-	-	79.2	60.6	-	-	-	-	82.0	78.1	-	-	-	-
$RNLSTM_{A}$	90.3	76.4	76.6	63.8	52.9	44.8	77.0	62.1	69.3	58.3	51.9	41.2	86.1	83.6	77.0	75.3	59.8	59.6

The rank-1 accuracy obtained by the nine different variants on the Market1501 and Partial REID datasets is shown in Fig. 3, where Fig. 3(a) shows the robustness of different variants against the different levels of occlusions.

From Fig. 3(a), we have the following conclusions: (1) In general, the recognition performance obtained by all the variants on Market1501 drops when the aspect ratio of occluded area increases. This further demonstrates the challenging task of person ReID under occlusions. (2) By comparing the rank-1 accuracy obtained by $RN_{S}$ and $RN_{A}$ , the ResNet (only with global branch) jointly optimized with the softmax and ANN losses obtains better performance than that only using the softmax loss. This is mainly because that the joint loss enhances the discrimination ability of the model. However, the improvements are not significant, due to the fact that ResNet only considers the high-level semantic information and ignores the local information, which is critical for classification under occlusions. (3) By comparing the variants ( $RNCONV_{A}$ , $RNFC_{A}$ , $RNRNN_{A}$ and $RNLSTM_{A}$ ) with different local branches, we can see that LSTM plays a critical role in the final performance. Specifically, compared with the variants without LSTM ( $RNCONV_{A}$ , $RNFC_{A}$ , $RNRNN_{A}$ ), the variant with LSTM (the proposed $RNLSTM_{A}$ ) can effectively improve the rank-1 accuracy about 3% under different levels of occlusions. This is due to the fact that that LSTM not only memorizes the spatial correlations between the different local regions, but also reduces the noises caused by partial occlusions. (4) The proposed method ( $RNLSTM_{A}$ ) outperforms the softmax loss based method ( $RNLSM_{S}$ ), and the joint loss using contrastive loss ( $RNLSTM_{C}$ ) and triplet loss ( $RNLSTM_{T}$ ) based methods in a reasonable margin (about 3%~6% on Market1501). From this comparison, we can see that the combination of the classification loss and metric loss (i.e., contrastive loss or triplet loss) can improve the performance of the model. Furthermore, $RNLSTM_{A}$ obtains the superiority performance by exploiting the neighborhood information to enlarge the inter-class dispersion while increasing the intra-class compactness. Among all the competing variants, $RNLSTM_{A}$ consistently achieves the best results on the Market1501 dataset. This indicates that the deep model (based on LSTM and CNN) jointly optimized by the softmax loss and the ANN loss, can effectively enhance the robustness to occlusions.

From Fig. 3(b), we can observe similar conclusions on the Partial REID dataset. Note that $RNLSTM_{A}$ performs slightly inferior than $RNRNN_{A}$ . This is mainly because that the LSTM layer have more parameters than the RNN layer. In other words, to ensure the effectiveness of the proposed method, a large number of training set are preferred to learn the parameters. However, the training set of Partial REID dataset is small (only 300 images are used for training).

3.4 Comparison with the state-of-the-art methods

In this section, we compare the proposed method (i.e., $RNLSTM_{A}$ ) with several representative methods, including the traditional metric learning methods (NPD [30], XQDA[9]) and the recently-proposed deep learning methods (IDE [31], TriNet [23], PAN [32], SVDNet [33] and DFFL [34]).

The rank-1 accuracy and mAP obtained by all the competing methods are shown in Table 2. Compared with the traditional person ReID methods (NPD and XQDA), the deep learning methods achieve significant performance improvements, which show the superiority of deep learning. The proposed method obtains much better results than the softmax-loss based IDE [31] method and the triplet loss based TriNet [23] method, which demonstrates the effectiveness of the proposed ANN loss. Moreover, the proposed method outperforms the part-based method PAN [32] under occlusions, since we exploit the spatial dependencies based on LSTM to learn discriminative representations. Compared with SVDNet [33], the proposed method achieves higher rank-1 accuracy and mAP under small occlusions. However, the proposed method obtains slightly inferior results under large occlusions ( $s=0.6$ ) on the CUHK03 dataset. Although SVD in the FC layer of SVDNet can effectively extract the discriminative information for person ReID, the training complexity of SVDNet is high. DPFL [27] that trains a multi-channel network for multi-scale images achieves slightly better performance than the proposed method that only exploits the single scale image on DUKEMTMC-reID. However, the proposed method obtains better results than DPFL on the challenging CUHK03 database, where each pedestrian contains a relatively small number of training images.

4 Conclusion

In this paper, we propose to exploit spatial dependencies based on LSTM to handle the problem of occlusions for person ReID. To better explore the discriminative capability of deep metric embedding, we propose an adaptive nearest neighbor loss to enlarge the inter-class dispersion while preserving the intra-class compactness. Experimental results on four challenging datasets have shown the effectiveness of the proposed method for person ReID under occlusions.

Acknowledgements

This work was supported by the National Key R&D Program of China under Grant 2017YFB1302400, by the National Natural Science Foundation of China under Grants 61571379, 61503315, U1605252, and 61472334, by the Natural Science Foundation of Fujian Province of China under Grant 2017J01127 and 2018J01576, by the Fundamental Research Funds for the Central Universities under Grant 20720170045, and by State Key Laboratory of Advanced Optical Communication Systems Networks, China.

References

[1] Y. Lin, F. Guo, L. Cao, J. Wang, Person re-identification based on multi-instance multi-label learning, Neurocomputing 217 (2016) 19–26.
[2] Y. Huang, H. Sheng, Y. Zheng, Z. Xiong, Deepdiff: Learning deep difference features on human body parts for person re-identification, Neurocomputing 241 (2017) 191–203.
[3] W. Fang, H. M. Hu, Z. Hu, S. Liao, B. Li, Perceptual hash-based feature description for person re-identification, Neurocomputing 272 (2017) 520–531.
[4] H. Dong, P. Lu, S. Zhong, C. Liu, Y. Ji, S. Gong, Person re-identification by enhanced local maximal occurrence representation and generalized similarity metric learning, Neurocomputing 307 (2018) 25–37.
[5] D. Cheng, X. Chang, L. Liu, A. G. Hauptmann, Y. Gong, N. Zheng, Discriminative dictionary learning with ranking metric embedded for person re-identification, in: International Joint Conference on Artificial Intelligence (IJCAI), 2017, pp. 964–970.
[6] W. S. Zheng, S. Gong, T. Xiang, Reidentification by relative distance comparison, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 35 (3) (2013) 653–68.
[7] Y. Yang, J. Yang, J. Yan, S. Liao, D. Yi, S. Z. Li, Salient color names for person re-identification, in: European Conference on Computer Vision (ECCV), 2014, pp. 536–551.
[8] B. Ma, Y. Su, F. Jurie, Covariance descriptor based on bio-inspired features for person re-identification and face verification, Image and Vision Computing (IVC) 32 (6) (2014) 379–390.
[9] S. Liao, Y. Hu, X. Zhu, S. Z. Li, Person re-identification by local maximal occurrence representation and metric learning, in: Computer Vision and Pattern Recognition (CVPR), 2015, pp. 2197–2206.
[10] Z. Li, S. Chang, F. Liang, T. S. Huang, L. Cao, J. R. Smith, Learning locally-adaptive decision functions for person verification, in: Computer Vision and Pattern Recognition (CVPR), 2013, pp. 3610–3617.
[11] M. Koestinger, M. Hirzer, P. Wohlhart, P. M. Roth, H. Bischof, Large scale metric learning from equivalence constraints, in: Computer Vision and Pattern Recognition (CVPR), 2012, pp. 2288–2295.
[12] S. Karanam, M. Gou, Z. Wu, A. Ratesborras, O. Camps, R. J. Radke, A systematic evaluation and benchmark for person re-identification: Features, metrics, and datasets, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) (2016) 1–1.
[13] S. Ding, L. Lin, G. Wang, H. Chao, Deep feature learning with relative distance comparison for person re-identification, Pattern Recognition (PR) 48 (10) (2015) 2993–3003.
[14] W. Chen, X. Chen, J. Zhang, K. Huang, Beyond triplet loss: a deep quadruplet network for person re-identification, in: Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1320–1329.
[15] E. Ustinova, Y. Ganin, V. Lempitsky, Multi-region bilinear convolutional neural networks for person re-identification, in: IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 2017, pp. 2993–3003.
[16] H. Liu, J. Feng, M. Qi, J. Jiang, S. Yan, End-to-end comparative attention networks for person re-identification., IEEE Transactions on Image Processing (TIP) 26 (7) (2017) 3492–3506.
[17] D. Yi, Z. Lei, S. Liao, S. Z. Li, Deep metric learning for person re-identification, in: International Conference on Pattern Recognition (ICPR), 2014, pp. 34–39.
[18] D. Cheng, Y. Gong, S. Zhou, J. Wang, N. Zheng, Person re-identification by multi-channel parts-based cnn with improved triplet loss function, in: Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1335–1344.
[19] Z. Zhong, L. Zheng, G. Kang, S. Li, Y. Yang, Random erasing data augmentation, arXiv preprint arXiv:1708.04896.
[20] H. Palangi, L. Deng, Y. Shen, J. Gao, X. He, J. Chen, X. Song, R. Ward, Deep sentence embedding using long short-term memory networks: Analysis and application to information retrieval, IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 24 (4) (2016) 694–707.
[21] F. Schroff, D. Kalenichenko, J. Philbin, Facenet: A unified embedding for face recognition and clustering, in: Computer Vision and Pattern Recognition (CVPR), 2015, pp. 815–823.
[22] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
[23] A. Hermans, L. Beyer, B. Leibe, In defense of the triplet loss for person re-identification, arXiv preprint arXiv:1703.07737.
[24] H. O. Song, Y. Xiang, S. Jegelka, S. Savarese, Deep metric learning via lifted structured feature embedding, in: Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4004–4012.
[25] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, Q. Tian, Scalable person re-identification: A benchmark, in: International Conference on Computer Vision (ICCV), 2015, pp. 1116–1124.
[26] E. Ristani, F. Solera, R. Zou, R. Cucchiara, C. Tomasi, Performance measures and a data set for multi-target, multi-camera tracking, in: European Conference on Computer Vision (ECCV), 2016, pp. 17–35.
[27] W. Li, R. Zhao, T. Xiao, X. Wang, Deepreid: Deep filter pairing neural network for person re-identification, in: Computer Vision and Pattern Recognition (CVPR), 2014, pp. 152–159.
[28] W. S. Zheng, X. Li, T. Xiang, S. Liao, J. Lai, S. Gong, Partial person re-identification, in: International Conference on Computer Vision (ICCV), 2016, pp. 4678–4686.
[29] Y. Chen, Y. Chen, X. Wang, X. Tang, Deep learning face representation by joint identification-verification, in: International Conference on Neural Information Processing Systems (NIPS), 2014, pp. 1988–1996.
[30] L. Zhang, T. Xiang, S. Gong, Learning a discriminative null space for person re-identification, in: Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1239–1248.
[31] L. Zheng, Y. Yang, A. G. Hauptmann, Person re-identification: Past, present and future, arXiv preprint arXiv:1610.02984.
[32] L. Zhao, X. Li, Y. Zhuang, J. Wang, Deeply-learned part-aligned representations for person re-identification, in: International Conference on Computer Vision (ICCV), 2017, pp. 3239–3248.
[33] Y. Sun, L. Zheng, W. Deng, S. Wang, Svdnet for pedestrian retrieval, in: International Conference on Computer Vision (ICCV), 2017, pp. 3820–3828.
[34] Y. Chen, X. Zhu, S. Gong, Person re-identification by deep learning multi-scale representations, in: International Conference on Computer Vision (ICCV), 2017, pp. 2590–2600.