Dynamic Prototype Mask for Occluded Person Re-Identification
Abstract.
Although person re-identification has achieved an impressive improvement in recent years, the common occlusion case caused by different obstacles is still an unsettled issue in real application scenarios. Existing methods mainly address this issue by employing body clues provided by an extra network to distinguish the visible part. Nevertheless, the inevitable domain gap between the assistant model and the ReID datasets has highly increased the difficulty to obtain an effective and efficient model. To escape from the extra pre-trained networks and achieve an automatic alignment in an end-to-end trainable network, we propose a novel Dynamic Prototype Mask (DPM) based on two self-evident prior knowledge. Specifically, we first devise a Hierarchical Mask Generator which utilizes the hierarchical semantic to select the visible pattern space between the high-quality holistic prototype and the feature representation of the occluded input image. Under this condition, the occluded representation could be well aligned in a selected subspace spontaneously. Then, to enrich the feature representation of the high-quality holistic prototype and provide a more complete feature space, we introduce a Head Enrich Module to encourage different heads to aggregate different patterns representation in the whole image. Extensive experimental evaluations conducted on occluded and holistic person re-identification benchmarks demonstrate the superior performance of the DPM over the state-of-the-art methods. The code is released at https://github.com/stone96123/DPM. ††footnotetext: * Corresponding Author.
1. Introduction
Person re-identification (ReID), aiming to address the problem of matching people over a distributed set of non-overlapping cameras, has attracted intensive attention in the last few years due to its wide applications in surveillance systems (Eom and Ham, 2019; Zhai et al., 2020; Zheng et al., 2019a; Ye et al., 2021a). While recent large-scale re-id datasets (Zheng et al., 2017; Wei et al., 2018) have provided an ability for deep neural networks to produce a satisfying retrieval performance upon the holistic pedestrian regions, widespread occlusion caused by different obstacles is still an unsettled issue in real application scenarios. This condition in practice inspires a large amount of research effort to explore the occluded person re-identification.

Compared with the general person re-identification problem which assumes the whole body part is available, the main challenge of occluded person re-identification is two folds: Firstly, with the obstacles which cover those discriminative body regions, the valid information in the final feature representation will high-decreased. Even if those valid regions provide valuable information, the final representation of the whole image may not be corrected at these puny efforts. Secondly, exploring a fine-grid feature representation has been demonstrated as an efficient strategy to achieve an advanced ReID framework (Sun et al., 2018; Wang et al., 2018; Zheng et al., 2019a). However, occluded person images usually lack several important parts due to the obstacles. Under this condition, those invalid noises will easily provide an ill similarity with similar obstacles and induce an error result. To this end, two typical frameworks have been proposed to tackle the above issues. One of the mainstream frameworks (Wang et al., 2020; Hou et al., 2021) aims to aggregate the information from the whole image and handles the above issue by compensating the invisible body regions by its visible near-neighbor. With the assistance of well-trained human parsing or body key point estimation networks, these methods can easily conduct a topology graph based on the body key point. By passing the information from visible node to invisible node, the influence of occluded regions will largely be alleviated. Although retained the information from the whole image, not only the information from visible near-neighbor may not be so convincing enough, but also the inevitable domain gap between the assistant pre-trained model and the ReID datasets highly increases the difficulty to obtain an effective and efficient model. Alongside the above strategy, discovering and aligning the fine-grid visible body part in the spatial level of the occluded person image are a prevailing and straightforward strategy that has received much attention (Qian et al., 2018; Miao et al., 2019; Gao et al., 2020). By ignoring those invisible parts, these works achieve an significant improvement and visualization result. Nevertheless, to better distinguish the visible/invisible regions, most of the methods in this body of work also rely on extra pre-trained networks to provide body clues and suffer the same domain gap. To make matters worse, those error segmentation or key point results will make the valid nuances be abandoned easily.
Therefore, in this paper, we propose a Dynamic Prototype Mask (DPM) which not only escapes from the extra pre-trained networks but also simultaneously retains the information from the whole image and achieves alignment. The DPM is conducted under two self-evident prior knowledge: 1) During the training, the loss scale of high-quality images which suffer less occlusion will be lower than those of high-occluded images. Since the fully connected layer used for classification could be considered as the bank of prototype for each class, under this perspective, the lower loss scale can be seen as a high similarity between the high-quality sample and its corresponding prototype. In other words, those prototypes for each identity can be considered as a high-quality and complete feature representation that suffers little occlusion. 2) Each channel in the feature representation can be regarded as a response to a specific pattern. This phenomenon in CNN has already been explored by several previous works (Chen et al., 2017; Hu et al., 2018). For the transformer, the multi-head self-attention directly aggregates the feature based on the similarity from patch to patch. These two prior knowledge indicate that the alignment and matching in the training period for occlusion person re-identification can largely be addressed by selecting a visible pattern subspace for both the input image and its holistic prototype. Motivated by this observation, we introduce the DPM. Different from the spatial attention strategy (Chen et al., 2021b) which takes effect on the feature of the input image itself, the DPM aims to learn a dynamic mask to cut the holistic prototype and select the efficient subspace for matching. Meanwhile, this processing is totally spontaneous and does not rely on any extra network to provide body clues.
To do this, as shown in Figure 2, the DPM starts with a standard ViT (Dosovitskiy et al., 2020) which has demonstrated its superior performance in the computer vision tasks before (Chen et al., 2022; Luo et al., 2022; He et al., 2021; Li et al., 2021a). Since the key idea of DPM is to generate the prototype mask to select the visible subspace for matching, we first introduce a hierarchical mask generator (HMG) to provide a reliable mask feature. The HMG takes the advantage of the convolutional neural network and aims to evaluate the weight for each channel by the correlation of local information. Meanwhile, we observe that with the network going deeper, the feature representation of each patch will be smoothed and become more similar to each other. Based on high-similar input, it is difficult for pure HMG to provide an efficient prototype mask. Therefore, we add a hierarchical structure to enhance the diversity of the input feature by shallow layers with high diversity. To fully explore the potential of DPM, an Head Enrich Module (HEM) is devised to enrich the feature representation. Specifically, each head in the final transformer block will be encouraged to aggregate different patterns in the whole image. Finally, to evaluate the effectiveness of the proposed DPM, we conduct a series of experiments on both occluded and holistic ReID benchmarks.
The main contributions of the paper are summarized as follows:
-
•
A novel end-to-end trainable network DPM is proposed. DPM not only escapes from the extra pre-trained networks but also simultaneously retains the information from the whole image and achieves automatic alignment.
-
•
To fully explore the potential of DPM, a Hierarchical Mask Generator (HMG) together with a Head Enrich Module (HEM) is introduced. The HMG provides a high-quality sub-space mask via hierarchical semantic information, while the HEM enriches the holistic prototype via diverse heads.
-
•
Extensive experiments on two publicly occluded datasets Occluded-Duke and Occluded-REID demonstrate the superiority of our DPM.

2. Related Works
2.1. Holistic Person Re-Identification
Holistic person re-identification aims to address the problem of matching people over a distributed set of non-overlapping cameras. The prior works mainly focus on exploring the hand-craft descriptors (Ma et al., 2014; Yang et al., 2014; Liao et al., 2015) with a well-designed metric learning strategy (Zheng et al., 2012; Koestinger et al., 2012). With the resurge of deep learning, deep feature representation learning has dominated the vision tasks (He et al., 2017; Mou et al., 2020; Chen et al., 2021a; Peng et al., 2021; Zhang et al., 2021). Luo et al. (Luo et al., 2019) introduce the BN-Neck structure in the CNN-based ReID framework. The research provides a strong baseline for the holistic ReID. Chen et al. (Chen et al., 2019) introduced a high-order attention mechanism to capture and use high-order attention distributions. Zheng et al. (Zheng et al., 2019b) integrate discriminative and generative learning in a single unified network for person re-identification. Besides using the global feature representation (Luo et al., 2019; Zheng et al., 2019b; Ye et al., 2021b), employing part-level features for pedestrian image description to offer fine-grained information is also a mainstream strategy that has been verified as beneficial for person ReID. Methods like PCB (Sun et al., 2018), MGN (Wang et al., 2018), and Pyramid (Zheng et al., 2019a) horizontally divide the input images or feature maps into several parts to conduct a fine-grid representation. Most recently, we have witnessed the thriving of transformer structures from natural language processing to computer vision. TransReID (He et al., 2021) firstly takes the advantage of ViT structure and applies it to the ReID task. Although those methods reach a satisfying performance in the holistic ReID benchmarks, the widely existing occlusion condition is largely ignored. Most of the methods suffer significant performance degradation when being applied to the real-world scenarios which contain the occluded cases.
2.2. Occluded Person Re-Identification
Occluded person re-identification points out the weakness of holistic ReID methods in such occluded cases. The main challenge of occluded ReID lies in the incomplete body information which can not provide high-quality feature representation. To tackle this issue, early works attempt to remove the influence of obstacles in an end-to-end framework and generate the global feature representation from the visible part. Zhuo et al. (Zhuo et al., 2018) introduce an extra occluded/non-occluded binary classification task to distinguish the occluded images from holistic ones. Chen et al. (Chen et al., 2021b) combine an occlusion augmentation scheme with an attention mechanism to precisely capture body parts regardless of the occlusion. Although this kind of work is easy to achieve and shows a good performance before, it always suffer the noise caused by the obstacles which limited the performance upper bound. Therefore, recent methods attempt to avoid such a condition with two typical strategies. The first one aims to aggregate the information from the whole image and handles the above issue by compensating for the invisible body regions by its visible near-neighbor. Wang et al. (Wang et al., 2020) utilize the high-order relation and human-topology information that is based on keypoint estimation to learn well and robustly aligned features. Hou et al. (Hou et al., 2021) propose a region feature completion module to exploit the long-range spatial contexts from non-occluded regions to predict the features of occluded regions. Though aggregating the information from the visible neighbor node can alleviate the occluded condition, this process is still facing a great challenge when lacking efficient evidence in the neighbor nodes. Meanwhile, using the key point estimate network which is pre-trained on other datasets also faces a challenge to provide reliable results when suffering a domain variation. Another strategy inherits the idea of fine-grid feature representation and aims to match the image between the visible parts. Miao et al. (Miao et al., 2019) introduce the Pose-Guided Feature Alignment (PGFA), exploiting pose landmarks to disentangle visible part information from occlusion noise. Gao et al. (Gao et al., 2020) introduce the Visible Part Matching (PVPM) model to learn discriminative part features via a pose-guided attention map. Li et al. (Li et al., 2021a) employ the prototypes to disentangle the fine-grid body part without the help of an extra network in order to achieve satisfying performance. However, most of the fine-grid methods still rely on the extra network to provide body clues and suffer the same domain variation problem. Furthermore, since the fine-grid methods demand strict part prediction to send the feature to its corresponding branch, those incorrect results will make those valuable nuances be ignored easily.
Differing from the above methods, the DPM not only escapes from the extra pre-trained networks but also simultaneously retains the global information from the whole image representation and achieves an automatic alignment.
3. The Proposed Method
3.1. Overall Framework
The overview of our proposed DPM framework is illustrated in Figure 2. The DPM adopts a pre-trained ViT (Dosovitskiy et al., 2020) to extract the original feature representation from the input images. Herein, we denote the input image as with the resolution as , We first split the image into patches with the size after flatten as . Specifically, it can be described as:
(1) |
where the and refers to the size of image patch and the step size of sliding window. After the linear projection , a learnable class token is attached to aggregate the information from image patches. Before feeding into the transformer block, following the TransReID (He et al., 2021), a learnable position embedding and camera embedding is added to the patch embeddings to retain positional information and camera information respectively, which can be formulated as:
(2) |
where the is the input of the transformer blocks. The hyper-parameter is used to balance the weight of camera embedding. The query and key vectors of the last transformer block are fed into the head enrich module (HEM), which takes the advantage of the multi-attention structure to explore a diverse feature representation for the different heads. Meanwhile, the representation of class-token will be utilized to train a holistic prototype for each class. To well tackle the occluded case, the representation for image patches of the 2st, 4th, 10th, and 12th is concatenated and sent to the hierarchical mask generator (HMG) to provide the dynamic prototype mask for every single input image. Different from spatial attention methods (Chen et al., 2021b) which take effect on the feature map itself, the prototype mask is used to cut the holistic prototype to select the subspace of high-discriminative visible patterns.

3.2. Hierarchical Mask Generator
Based on the two prior knowledge mentioned before, the main idea of DPM is to explore a spontaneous alignment and select a discriminative subspace for the holistic prototype to match every single image. Therefore, one of the most important parts for achieving the DPM is to generate an efficient prototype mask. In most cases, a pattern within an image is usually conducted by pixels that are spatially concentrated and form a connected component. Therefore, using a local region-based sliding window to weigh the importance of patterns is quite suitable in such an application. In order to provide a high-quality prototype mask, as shown in Figure 2, we apply a convolutional-based mask generator that can take the neighbor nodes into consideration for each patch.
In specific, after the block, HMG adopts the reshaped image representation by excluding the class-token in the feature representation . Although directly using the image representation provided by the last block seems like the most intuitive strategy, as shown in Figure 3 (b), after calculating the cosine similarity between the most dissimilar image patches in every block, we observe that by passing information through the similarity among image patches, the feature representations will be smoothed and become more similar. Those representations with few discriminative make it hard to provide an efficient prototype mask. To this end, a hierarchical structure is utilized to aggregate those image representations from the shallow layers. Inspired by the success of Swin-Transformer (Liu et al., 2021) which merges the patches after the , , transformer block, we also pick the image representations from these three blocks and combine them with the last block as the input for the HMG. The final prototype mask is generated as:
(3) |
Herein, the refers to the sigmoid function. The denotes the convolutional layers of HMG and the is a binary gate to choose the input for HMG.
Finally, the masked prototype for input image is generated by Hadamard product of the row-extended prototype mask and the prototype weight matrix as:
(4) |
3.3. Head Enrich Module
Multi-head self-attention which aims to extract difference and discriminative feature representation is considered as the key component to conducting the transformer block. This kind of aggregation of different patterns suits the DPM both in training a complete holistic prototype matrix and selecting an aligned and discriminative subspace. However, as shown in Figure 3 (a), in the original transformer block, there is no explicit optimization to encourage multiple heads to aggregate more nuances in the whole image, which makes it possible for the different heads to have similar feature embedding. The similar head representations will in turn limit the DPM to select a well-aligned subspace.
On account of this condition, we introduce a head enrich module (HEM) to push multiple heads in the class token to obtain diverse patterns in the last transformer block. Generally, the multi-head attention adopts the query matrix Q, key matrix K, and value matrix V to pass the information from different patches. In the training phase, we only employ the class token as the global representation. Therefore, when ignoring the key vector of the class token itself, we can obtain the attention map of between the class token and the image patches as:
(5) |
where is the number of heads, is the query vector of class token in the transformer block, is the query vector of image patches in the transformer block, and refers to the number of heads. To push the attention map of each head apart, an orthonormal constraint is impose as:
(6) |
where the is Frobenius norm, the is the identity matrix, is a normalized matrix with each row being L2 normalized. With , the class token could provide richer representation, which not only benefit the learning for holistic prototype and the masked generator.
3.4. Loss Function and Optimization
The DPM contains a two-branch learning framework as the original classification loss and the masked classification loss. Intuitionally, we should employ the softmax loss to optimize both two branches. However, this strategy will cause a condition that the mask can not be well learned. After adding the mask, the scale of the loss generated by the masked branch is much smaller than the softmax loss, which can not provide enough power to learn a high-quality prototype mask. Although increasing the weight of the masked branch may make sense, the softmax loss is not a strong constraint to clustering the samples. Limited constrains also limits the performance of DPM. Inspired by the progress of metric learning (Deng et al., 2019), we employ an extra angular margin in the original softmax loss to optimize the masked branch. This strategy not only balances the scale of the original branch and masked branch but also highlights the importance of learning a high-quality mask during the training phase. With the extra margin, for input with label , the can be given as:
(7) |
where the and refers to the batch size and the number of class. The denotes the angular margin and is the hyper-parameter to adjust the scale.
To increase the intra-class similarity and decrease the inter-class similarity, triplet loss with online hard-mining (Schroff et al., 2015) are combined during the supervise training. Therefore, the overall loss function can be formulated as:
(8) |
where the and is the hyper-parameters to adjust the weight of and respectively.
4. Experiment
4.1. Datasets and Experimental Setting
Datasets. To evaluate the effectiveness of the proposed DPM, we conduct extensive experiments on four publicly available ReID benchmarks which include both occluded and holistic person re-identification datasets. The details are as follows.
Occluded-Duke (Miao et al., 2019) is a large-scale dataset collected from the DukeMTMC for occluded person re-identification. The training set consists of 15,618 images of 702 persons. The testing set contains 2,210 images of 519 persons as the query and 17,661 images of 1,110 persons as the gallery. Until now, Occluded-Duke is still the most challenging dataset for occluded ReID due to its scale.
Occluded-REID (Zhuo et al., 2018) is an occluded person dataset captured by mobile cameras. It consists of 2000 images from 200 persons, where each person has 5 whole-body images and 5 occluded person images. Following the evaluation protocol of previous works (Gao et al., 2020; Wang et al., 2020; Chen et al., 2021b), the Occluded-REID is only used as a testing set. The model used for experiments in this dataset is trained under the training set of Marker-1501 (Zheng et al., 2015).
Market-1501 (Zheng et al., 2015) is a widely-used holistic ReID dataset captured from 6 cameras. It includes 12,936 training images of 751 persons as the training set, 3,368 images of 750 persons as the query, and 19,732 images of 750 persons as the gallery.
DukeMTMC-reID (Zheng et al., 2017) contains 36,441 images of 1,812 persons captured by eight cameras, in which 16,522 images of 702 identities are used as the training set, 2,228 and 16,522 images of 702 persons that do not appear in the training set are used as the query and gallery, respectively.
Evaluation Protocol. To verify fair comparison with other methods, we adopt the widely used Cumulative Matching Characteristic (CMC) and mean Average Precision (mAP) as evaluation metrics and follow the evaluation settings provided by existing occluded methods (Wang et al., 2020; Gao et al., 2020).
Implementation details. We employ the ViT (Dosovitskiy et al., 2020) pre-trained on ImageNet (Deng et al., 2009) as the backbone network. Particularly, we resize all the input images to and adopt commonly used horizontal flipping, padding, random cropping, and random erasing (Zhong et al., 2020) as data augmentation. Following (Wang et al., 2020; Yang et al., 2021), we use extra color jitter augmentation to avoid domain variance when conduct testing in the Occluded-REID. Following the success of TransReID (He et al., 2021), we adopt a lower stride and set to . During the training stage, each mini-batch is conducted by 64 images from 4 identities. In order to strengthen the power of HMG, the training phase is divided into two-step in every iteration. In the first step, we froze the parameter of HMG to train the holistic prototype. In the second step, we froze the parameter except the HMG to train a high-quality prototype mask. During the testing phase, we apply the mask generated by the query image to the gallery images. Then the retrieval stage can still be computed in parallel for each query image. The SGD is utilized as the optimizer, in which the learning rate is initiated as with cosine learning rate decay. The hyper-parameters for and in Arcface loss are set to and respectively in training the Occluded-Duke. In training Occluded-REID, since the strong constraints will induce the overfitting easily when considering the domain variance of Market-1501 and Occluded-REID, we decrease the to and to in training. We implement our DPM with PyTorch and conduct all experiments on a single Nvidia Tesla A100.
Method | Occluded-Duke | Occluded-REID | ||
---|---|---|---|---|
R-1 | mAP | R-1 | mAP | |
PCB (Sun et al., 2018) | 42.6 | 33.7 | 41.3 | 38.9 |
Part Bilinear (Suh et al., 2018) | 36.9 | - | - | - |
FD-GAN (Ge et al., 2018) | 40.8 | - | - | - |
ISP (Zhu et al., 2020) | 62.8 | 52.3 | - | - |
TransReID* (He et al., 2021) | 66.4 | 59.2 | - | - |
DSR (He et al., 2018) | 40.8 | 30.4 | 72.8 | 62.8 |
Ad-Occluded (Huang et al., 2018) | 44.5 | 32.2 | - | - |
FPR (He et al., 2019) | - | - | 78.3 | 68.0 |
PGFA (Miao et al., 2019) | 51.4 | 37.3 | - | - |
PVPM+Aug (Gao et al., 2020) | - | - | 70.4 | 61.2 |
HOReID (Wang et al., 2020) | 55.1 | 43.8 | 80.3 | 70.2 |
OAMN (Chen et al., 2021b) | 62.6 | 46.1 | - | - |
Part-Label (Yang et al., 2021) | 62.2 | 46.3 | 81.0 | 71.0 |
PAT* (Li et al., 2021a) | 64.5 | 53.6 | 81.6 | 72.1 |
DPM | 71.4 | 61.8 | 85.5 | 79.7 |
Method | Market-1501 | DukeMTMC | ||
---|---|---|---|---|
R-1 | mA | R-1 | mAP | |
PCB (Sun et al., 2018) | 92.3 | 71.4 | 81.8 | 66.1 |
MGN (Wang et al., 2018) | 95.7 | 86.9 | 88.7 | 78.4 |
ISP (Zhu et al., 2020) | 95.3 | 88.6 | 89.6 | 80.0 |
CDNet (Li et al., 2021b) | 95.1 | 86.0 | 88.6 | 76.8 |
TransReID* (He et al., 2021) | 95.2 | 88.9 | 90.7 | 82.0 |
FPR (He et al., 2019) | 95.4 | 86.6 | 88.6 | 78.4 |
PGFA (Miao et al., 2019) | 91.2 | 76.8 | 82.6 | 65.5 |
HOReID (Wang et al., 2020) | 94.2 | 84.9 | 86.9 | 75.6 |
OAMN (Chen et al., 2021b) | 93.2 | 79.8 | 86.3 | 72.6 |
PAT* (Li et al., 2021a) | 95.4 | 88.0 | 88.8 | 78.2 |
DPM | 95.5 | 89.7 | 91.0 | 82.6 |
4.2. Comparison with State-of-the-art Methods
Results on Occluded Datasets. To comprehensively demonstrate the performance of DPM, we evaluate DPM against the previously reported state-of-the-art methods on the Occluded-Duke and Occluded-REID in Table 1, The compared methods include holistic ReID methods (Sun et al., 2018; Suh et al., 2018; Ge et al., 2018; Zhu et al., 2020; He et al., 2021) and occluded ReID methods (He et al., 2018; Huang et al., 2018; He et al., 2019; Miao et al., 2019; Gao et al., 2020; Wang et al., 2020; Li et al., 2021a; Chen et al., 2021b; Yang et al., 2021). Obviously, the transformer-based structure (PAT, TransReID) has the advantage in solving occluded cases when compared to the convolutional neural network. The DPM also inherits this advantage and further outperforms other transformer-based methods. In the most challenging occluded ReID dataset Occluded-Duke, the DPM reaches an impressive performance, with in rank-1 and in mAP, which at least outperforms other occluded ReID methods with and in rank-1 and mAP respectively. Meanwhile, in the Occluded-REID, the proposed DPM consistently surpasses current state-of-the-art methods. Specifically, the DPM achieves in rank-1 accuracy in mAP, which improves the Rank-1 accuracy by and mAP by over the PAT.
Although not relying on the extra network to provide body clues, the DPM still achieves superior performance in the occluded ReID benchmarks.
Results on Holistic Datasets. Although occluded ReID methods mainly focused on solving the occluded ReID issue, they may suffer a performance decrease in the original holistic ReID task due to incorrect alignment or ignoring of valuable regions. Therefore, in this section, we also evaluate the proposed DPM on the holistic ReID dataset Market-1501 and DukeMTMC-ReID. For better comparison, we select five holistic ReID method (Sun et al., 2018; Wang et al., 2018; Zhu et al., 2020; Li et al., 2021b; He et al., 2021) and five occluded ReID methods (He et al., 2019; Miao et al., 2019; Wang et al., 2020; Chen et al., 2021b; Li et al., 2021a).
The results are shown in Figure 2. In the Market-1501 dataset, the DPM gets in rank-1 accuracy and in mAP. In the DukeMTMC-reID, the the DPM gets in rank-1 accuracy and in mAP. It is clear that DPM shows competitive results in both two holistic ReID datasets when compared to the state-of-the-art holistic ReID methods. When compared to the occluded ReID methods, the DPM outperforms the previous state-of-the-art methods in these two datasets. Overall, the above results show that DPM is a universal framework, which mainly aims to tackle occluded cases. But it will not destroy the performance on the general holistic ReID task.
Method | Occluded-Duke | |||
---|---|---|---|---|
R-1 | R-5 | R-10 | mAP | |
baseline | 64.8 | 80.8 | 85.8 | 57.8 |
+DPM | 70.1 | 82.8 | 86.9 | 59.9 |
+DPM+HR | 71.0 | 82.9 | 87.2 | 61.0 |
+DPM+HR+HEM | 71.4 | 83.7 | 87.4 | 61.8 |
Method | Setting | Occluded-Duke | ||||
---|---|---|---|---|---|---|
Cls | Mask-Cls | R-1 | R-5 | R-10 | mAP | |
baseline | 64.8 | 80.8 | 85.8 | 57.8 | ||
baseline | 68.6 | 81.2 | 85.9 | 58.5 | ||
DPM | 64.6 | 80.6 | 85.5 | 57.3 | ||
DPM | 68.3 | 80.0 | 84.2 | 57.7 | ||
DPM | 70.1 | 82.8 | 86.9 | 59.9 |
4.3. Ablation Study
To evaluate the influence of the proposed architectural components. We conduct a series experiments over the occluded-Duke with different settings and show the quantitative results in Table 3. The baseline uses the ViT as backbone and training with the original softmax loss and triplet loss .
Compared to the baseline method, adding the DPM strategy highly improved the performance in both rank-1 accuracy and mAP as and respectively. After adding the hierarchical structure in the mask generator, the performance further increases from and to and in rank-1 and mAP. On the other hand, it also demonstrates that the image representation provided by the last transformer block which lacks sufficient diversity information will limit the performance of mask generator. Meanwhile, HEM also provides significant performance gains as and in rank-1 and mAP based on the above results. The experiments results indicate that all these components have made sense and satisfied their motivation in the DPM. All the components contribute to an effective framework consistently and finally result in an impressive performance.

Setting | Occluded-Duke | ||||||
---|---|---|---|---|---|---|---|
R-1 | R-5 | R-10 | mAP | ||||
✓ | 71.4 | 83.7 | 87.4 | 61.8 | |||
✓ | ✓ | 69.5 | 82.3 | 86.8 | 58.5 | ||
✓ | 70.2 | 84.0 | 87.4 | 60.8 | |||
✓ | ✓ | 69.7 | 81.6 | 85.1 | 58.3 |
4.4. Discussions
The classification loss function for DPM. As the most important module, how to train a efficient mask generator is one of the main challenges to achieve the the DPM. In Section 3.4, we have mentioned that we utilize an extra angular margin to train the masked branch to avoid inefficient optimization for the masked branch. Therefore, in this part, we conduct a comparison to show the performance of different loss function combinations when training the class branch and masked branch. The results are shown in Table 4. Here, we use the to denote the branch which is training with the original softmax loss, and the to denote the branch which is training with an extra angular margin. The two baselines are training without the masked branch.
From Table 4, we can observe that adding an extra angular margin could improve the rank-1 accuracy since it can enforce the network to pay more attention to those outlier samples. However, we can also observe that this strategy can not bring such a significant increment in the mAP, which denotes that the whole distribution may not be ameliorated. Meanwhile, as we have mentioned before, adding an extra mask to select the subspace will further decrease the scale of the loss. It indicates that the mask generator just needs to output the same score for all the channels, the scale loss will still be satisfying. Therefore, as shown in Table 4, when using the same loss function in two branches, the optimization of the mask generator will be limited and the whole network will degrade to optimize the classification branch. After adding an extra angular margin in the masked branch and keeping the original classification branch, the mask generator has been encouraged to adopt a more radical optimization thus alleviating the above dilemma. As shown in the last line in Table 4, this kind of strategy obtain a significant improvement in both rank-1 accuracy and mAP accuracy.

Effect of head enrich module. The HEM aims to enrich the feature representation of multiple heads in the class token to overcome the tendency that more different head has similar representations. The HEM can help not only the training for a holistic prototype with more nuances but also the training for mask generator to select the potential subspace. In Figure 3 (a), we have visualized the cross-correlation matrix between multiple heads’ attention map to show the limitation of the original transformer block. Therefore, in this part, we visualized the cross-correlation matrix with the same input once again in Figure 4 after adding the HEM.
By adding the HEM that can explicitly encourage each head to have a different attention map, we could observe that the cross-correlation matrix between multiple heads’ attention maps has significantly decreased as shown in Figure 4. It indicates that the class token has received more diverse pattern information from the different image patches. Overall, the visualization well demonstrates the effectiveness of the HEM in such an application.
Impact of the hyper-parameters and . As indicated by the loss function in Eq. 8, we set two hyper-parameters and to balance the weight of different components in the overall loss functions. Specifically, the controls the trade-off between the holistic prototype matrix and the prototype mask, while the controls the correlation between different heads. Hence, in this part, we conduct empirical experiments to measure the performance of the model under different hyper-parameters settings. When discussing the , we select the DPM with HMG as the baseline to conduct the experiments. As we have mentioned before the performance is sensitive to the , a small will limit the ability to learn a holistic prototype matrix while a large can not provide enough power to learn a high-quality prototype mask. As shown in the Figure 5, we observe that the performance of rank-1 accuracy and mAP increase linearly when the is less than . The performance reaches the peak when the is set to with and in rank-1 accuracy and mAP. After that, a larger will decrease the performance.
Based on the model, we further discuss the influence of . As shown in Figure 5, after adding the HEM into training, the rank-1 accuracy and mAP also get improved when is less than . The best performance in rank-1 reaches when the is set to and the best mAP reaches when the is set to . Considering the overall performance, we select the as in the further experiments.
Different types of DPM. In DPM, we apply the mask to the holistic prototype matrix, while attention-based strategies explore the spatial attention mask which takes effect on the input image itself to alleviate the noise caused by obstacles. Therefore, in this part, we also make discuss whether the mask should also be applied to the feature representation. Meanwhile, in this part, we also evaluate the influence of L2 normalization in the DPM. Herein, we select the complete DPM as the baseline in the comparison and give an empirical analysis on the Occluded-Duke. The experimental results are shown in Table 5.
Benefits from the great ability of transformer structure that provides an effective representation for the visible part, in both settings, applying the mask upon the feature representation can not provide an extra performance gain. On another side, under both settings, applying the mask on the prototype after the L2 normalization works better than applying the mask before the L2 normalization. Thus, we select the only prototype mask which is applied before the L2 normalization as the final model.
5. Conclusion
In this paper, we address the occluded person re-identification with a novel dynamic prototype mask (DPM). The DPM takes the advantage of prototype classification and transfers the alignment in occluded retrieval to the subspace selection task. This strategy not only gets rid of the extra pre-trained networks to provide body clues but also simultaneously retains the information from the global wise and achieves an automatic alignment. Meanwhile, based on the observation in the original DPM framework, we further explore a Hierarchical Mask Generator (HMG) together with a Head Enrich Module (HEM) to fully exploit the potential of DPM. Finally, extensive experiments on occluded and holistic datasets demonstrate the superior performance of DPM.
Acknowledgments
This work was supported by the National Science Fund for Distinguished Young Scholars (No.62025603), the National Natural Science Foundation of China (No.U1705262, No.62176222, No.62176223, No.62176226, No.62072386, No.62072387, No.62072389, No.62002305, No.61772443, No.61802324 and No.61702136), Guangdong Basic and Applied Basic Research Foundation (No.2019B1515120049), the Natural Science Foundation of Fujian Province of China (No.2021J01002), and the Fundamental Research Funds for the Central Universities (No.20720200077, No.20720200090 and No.20720200091).
References
- (1)
- Chen et al. (2019) Binghui Chen, Weihong Deng, and Jiani Hu. 2019. Mixed High-Order Attention Network for Person Re-Identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
- Chen et al. (2017) Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat-Seng Chua. 2017. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5659–5667.
- Chen et al. (2021b) Peixian Chen, Wenfeng Liu, Pingyang Dai, Jianzhuang Liu, Qixiang Ye, Mingliang Xu, Qi’an Chen, and Rongrong Ji. 2021b. Occlude them all: Occlusion-aware attention network for occluded person re-id. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11833–11842.
- Chen et al. (2021a) Zhiwei Chen, Liujuan Cao, Yunhang Shen, Feihong Lian, Yongjian Wu, and Rongrong Ji. 2021a. E2Net: Excitative-expansile learning for weakly supervised object Localization. In Proceedings of the 29th ACM International Conference on Multimedia. 573–581.
- Chen et al. (2022) Zhiwei Chen, Changan Wang, Yabiao Wang, Guannan Jiang, Yunhang Shen, Ying Tai, Chengjie Wang, Wei Zhang, and Liujuan Cao. 2022. Lctr: On awakening the local continuity of transformer for weakly supervised object localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 410–418.
- Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the CVPR. 248–255.
- Deng et al. (2019) Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. 2019. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4690–4699.
- Dosovitskiy et al. (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
- Eom and Ham (2019) Chanho Eom and Bumsub Ham. 2019. Learning disentangled representation for robust person re-identification. In Proceedings of the NeurIPS. 5297–5308.
- Gao et al. (2020) Shang Gao, Jingya Wang, Huchuan Lu, and Zimo Liu. 2020. Pose-guided visible part matching for occluded person ReID. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11744–11752.
- Ge et al. (2018) Yixiao Ge, Zhuowan Li, Haiyu Zhao, Guojun Yin, Shuai Yi, Xiaogang Wang, et al. 2018. Fd-gan: Pose-guided feature distilling gan for robust person re-identification. Advances in neural information processing systems 31 (2018).
- He et al. (2017) Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision. 2961–2969.
- He et al. (2018) Lingxiao He, Jian Liang, Haiqing Li, and Zhenan Sun. 2018. Deep spatial feature reconstruction for partial person re-identification: Alignment-free approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7073–7082.
- He et al. (2019) Lingxiao He, Yinggang Wang, Wu Liu, He Zhao, Zhenan Sun, and Jiashi Feng. 2019. Foreground-aware pyramid reconstruction for alignment-free occluded person re-identification. In Proceedings of the IEEE/CVF international conference on computer vision. 8450–8459.
- He et al. (2021) Shuting He, Hao Luo, Pichao Wang, Fan Wang, Hao Li, and Wei Jiang. 2021. Transreid: Transformer-based object re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15013–15022.
- Hou et al. (2021) Ruibing Hou, Bingpeng Ma, Hong Chang, Xinqian Gu, Shiguang Shan, and Xilin Chen. 2021. Feature completion for occluded person re-identification. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).
- Hu et al. (2018) Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7132–7141.
- Huang et al. (2018) Houjing Huang, Dangwei Li, Zhang Zhang, Xiaotang Chen, and Kaiqi Huang. 2018. Adversarially occluded samples for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5098–5107.
- Koestinger et al. (2012) Martin Koestinger, Martin Hirzer, Paul Wohlhart, Peter M Roth, and Horst Bischof. 2012. Large scale metric learning from equivalence constraints. In 2012 IEEE conference on computer vision and pattern recognition. IEEE, 2288–2295.
- Li et al. (2021b) Hanjun Li, Gaojie Wu, and Wei-Shi Zheng. 2021b. Combined depth space based architecture search for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6729–6738.
- Li et al. (2021a) Yulin Li, Jianfeng He, Tianzhu Zhang, Xiang Liu, Yongdong Zhang, and Feng Wu. 2021a. Diverse part discovery: Occluded person re-identification with part-aware transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2898–2907.
- Liao et al. (2015) Shengcai Liao, Yang Hu, Xiangyu Zhu, and Stan Z Li. 2015. Person re-identification by local maximal occurrence representation and metric learning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2197–2206.
- Liu et al. (2021) Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10012–10022.
- Luo et al. (2022) Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Yan Wang, Liujuan Cao, Yongjian Wu, Feiyue Huang, and Rongrong Ji. 2022. Towards Lightweight Transformer Via Group-Wise Transformation for Vision-and-Language Tasks. IEEE Transactions on Image Processing 31 (2022), 3386–3398.
- Luo et al. (2019) Hao Luo, Youzhi Gu, Xingyu Liao, Shenqi Lai, and Wei Jiang. 2019. Bag of tricks and a strong baseline for deep person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops. 0–0.
- Ma et al. (2014) Bingpeng Ma, Yu Su, and Frederic Jurie. 2014. Covariance descriptor based on bio-inspired features for person re-identification and face verification. Image and Vision Computing 32, 6-7 (2014), 379–390.
- Miao et al. (2019) Jiaxu Miao, Yu Wu, Ping Liu, Yuhang Ding, and Yi Yang. 2019. Pose-guided feature alignment for occluded person re-identification. In Proceedings of the IEEE/CVF international conference on computer vision. 542–551.
- Mou et al. (2020) Yongqiang Mou, Lei Tan, Hui Yang, Jingying Chen, Leyuan Liu, Rui Yan, and Yaohong Huang. 2020. Plugnet: Degradation aware scene text recognition supervised by a pluggable super-resolution unit. In European Conference on Computer Vision. Springer, 158–174.
- Peng et al. (2021) Jun Peng, Yiyi Zhou, Xiaoshuai Sun, Liujuan Cao, Yongjian Wu, Feiyue Huang, and Rongrong Ji. 2021. Knowledge-Driven Generative Adversarial Network for Text-to-Image Synthesis. IEEE Transactions on Multimedia (2021).
- Qian et al. (2018) Xuelin Qian, Yanwei Fu, Tao Xiang, Wenxuan Wang, Jie Qiu, Yang Wu, Yu-Gang Jiang, and Xiangyang Xue. 2018. Pose-normalized image generation for person re-identification. In Proceedings of the ECCV. 650–667.
- Schroff et al. (2015) Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the CVPR. 815–823.
- Suh et al. (2018) Yumin Suh, Jingdong Wang, Siyu Tang, Tao Mei, and Kyoung Mu Lee. 2018. Part-aligned bilinear representations for person re-identification. In Proceedings of the European conference on computer vision (ECCV). 402–419.
- Sun et al. (2018) Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Shengjin Wang. 2018. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the ECCV. 480–496.
- Wang et al. (2020) Guan’an Wang, Shuo Yang, Huanyu Liu, Zhicheng Wang, Yang Yang, Shuliang Wang, Gang Yu, Erjin Zhou, and Jian Sun. 2020. High-order information matters: Learning relation and topology for occluded person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6449–6458.
- Wang et al. (2018) Guanshuo Wang, Yufeng Yuan, Xiong Chen, Jiwei Li, and Xi Zhou. 2018. Learning discriminative features with multiple granularities for person re-identification. In Proceedings of the ACM MM. 274–282.
- Wei et al. (2018) Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. 2018. Person transfer gan to bridge domain gap for person re-identification. In Proceedings of the CVPR. 79–88.
- Yang et al. (2021) Jinrui Yang, Jiawei Zhang, Fufu Yu, Xinyang Jiang, Mengdan Zhang, Xing Sun, Ying-Cong Chen, and Wei-Shi Zheng. 2021. Learning To Know Where To See: A Visibility-Aware Approach for Occluded Person Re-Identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11885–11894.
- Yang et al. (2014) Yang Yang, Jimei Yang, Junjie Yan, Shengcai Liao, Dong Yi, and Stan Z Li. 2014. Salient color names for person re-identification. In European conference on computer vision. Springer, 536–551.
- Ye et al. (2021a) Mang Ye, Weijian Ruan, Bo Du, and Mike Zheng Shou. 2021a. Channel Augmented Joint Learning for Visible-Infrared Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13567–13576.
- Ye et al. (2021b) Mang Ye, Jianbing Shen, Gaojie Lin, Tao Xiang, Ling Shao, and Steven CH Hoi. 2021b. Deep learning for person re-identification: A survey and outlook. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).
- Zhai et al. (2020) Yunpeng Zhai, Shijian Lu, Qixiang Ye, Xuebo Shan, Jie Chen, Rongrong Ji, and Yonghong Tian. 2020. AD-Cluster: Augmented Discriminative Clustering for Domain Adaptive Person Re-Identification. In Proceedings of the CVPR.
- Zhang et al. (2021) Yukang Zhang, Yan Yan, Yang Lu, and Hanzi Wang. 2021. Towards a Unified Middle Modality Learning for Visible-Infrared Person Re-Identification. In Proceedings of the 29th ACM International Conference on Multimedia. 788–796.
- Zheng et al. (2019a) Feng Zheng, Cheng Deng, Xing Sun, Xinyang Jiang, Xiaowei Guo, Zongqiao Yu, Feiyue Huang, and Rongrong Ji. 2019a. Pyramidal Person Re-IDentification via Multi-Loss Dynamic Training. In Proceedings of the CVPR. 8514–8522.
- Zheng et al. (2015) Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. 2015. Scalable person re-identification: A benchmark. In Proceedings of the IEEE international conference on computer vision. 1116–1124.
- Zheng et al. (2012) Wei-Shi Zheng, Shaogang Gong, and Tao Xiang. 2012. Reidentification by relative distance comparison. IEEE transactions on pattern analysis and machine intelligence 35, 3 (2012), 653–668.
- Zheng et al. (2019b) Zhedong Zheng, Xiaodong Yang, Zhiding Yu, Liang Zheng, Yi Yang, and Jan Kautz. 2019b. Joint discriminative and generative learning for person re-identification. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2138–2147.
- Zheng et al. (2017) Zhedong Zheng, Liang Zheng, and Yi Yang. 2017. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In Proceedings of the IEEE international conference on computer vision. 3754–3762.
- Zhong et al. (2020) Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. 2020. Random Erasing Data Augmentation.. In Proceedings of the AAAI.
- Zhu et al. (2020) Kuan Zhu, Haiyun Guo, Zhiwei Liu, Ming Tang, and Jinqiao Wang. 2020. Identity-guided human semantic parsing for person re-identification. In European Conference on Computer Vision. Springer, 346–363.
- Zhuo et al. (2018) Jiaxuan Zhuo, Zeyu Chen, Jianhuang Lai, and Guangcong Wang. 2018. Occluded person re-identification. In 2018 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1–6.