This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Dynamic Feature Pruning and Consolidation for Occluded Person Re-Identification

Yuteng Ye1, Hang Zhou1, Jiale Cai1, Chenxing Gao1, Youjia Zhang1, Junle Wang2, Qiang Hu3, Junqing Yu1, Wei Yang1 indicates corresponding author.
Abstract

Occluded person re-identification (ReID) is a challenging problem due to contamination from occluders. Existing approaches address the issue with prior knowledge cues, such as human body key points and semantic segmentations, which easily fail in the presence of heavy occlusion and other humans as occluders. In this paper, we propose a feature pruning and consolidation (FPC) framework to circumvent explicit human structure parsing. The framework mainly consists of a sparse encoder, a multi-view feature mathcing module, and a feature consolidation decoder. Specifically, the sparse encoder drops less important image tokens, mostly related to background noise and occluders, solely based on correlation within the class token attention. Subsequently, the matching stage relies on the preserved tokens produced by the sparse encoder to identify k-nearest neighbors in the gallery by measuring the image and patch-level combined similarity. Finally, we use the feature consolidation module to compensate pruned features using identified neighbors for recovering essential information while disregarding disturbance from noise and occlusion. Experimental results demonstrate the effectiveness of our proposed framework on occluded, partial, and holistic Re-ID datasets. In particular, our method outperforms state-of-the-art results by at least 8.6% mAP and 6.0% Rank-1 accuracy on the challenging Occluded-Duke dataset.

1 Introduction

Person Re-Identification (ReID) refers to the process of retrieving the same person from a gallery set under non-overlapping surveillance cameras (Chen et al. 2017; Ye et al. 2021), and has been making remarkable progress in tackling appearance change in deep learning era (Wu et al. 2019; Lavi, Serj, and Ullah 2018). However, the re-identification of occluded persons remains a challenging problem because of two reasons: 1. the inference from wrongly included occluder features and 2. the partial or full absence of essential target features. To tackle occlusions, many existing approaches explicitly recover human semantics, via human pose estimation (Miao et al. 2019; Gao et al. 2020; Wang et al. 2022a) or body segmentation (Huang, Chen, and Huang 2020), as extra supervision to guide the network to focus on non-occluded features. Others (Yu et al. 2021; Xu et al. 2022) first partition input image into horizontal or vertical parts, and then identify the occlusion status of each part with off-the-shelf models (Mask-RCNN (He et al. 2017), HR-Net (Sun et al. 2019)), and finally recover occluded features from K-nearest neighbors in a gallery according to non-occluded features. Both strategies rely on extra modules for detecting occlusions and will easily fail in the presence of heavy occlusion and persons as occluders, and further background noise persists as the partition usually is very coarse.

Inspired by the recent advances in sparse encoders (Rao et al. 2021; Liang et al. 2022), we propose a feature pruning, matching, and consolidation (FPC) framework for Occluded Person Re-Identification which adaptively removes interference from occluders and background and consolidates the contaminated features. Firstly, we send the query image into a modified transformer encoder with token sparsification to drop interference tokens (usually related to occluders and background) while preserving attentive tokens. Different from extra cue-based approaches that rely on prior information about human semantics, the sparse encoder exploits correlation properties on attention maps and generalizes better to various occlusion situations. In addition, our sparse encoder removes interference from the background as an extra benefit. Then, we rank the full tokens in the gallery memory according to their similarity with the query image. We obtain the gallery memory containing [cls][\mathrm{cls}] token and patch tokens via pre-training a vision transformer encoder. The similarity metric for matching is defined as the linear combination of image-level cosine distance and patch-level earth mover’s distance (Rubner, Tomasi, and Guibas 2000) for bridging the domain gap between the sparse query feature and holistic features in the gallery memory. At last, we select the k-nearest neighbors for each query and construct multi-view features by concatenating averaged [cls][\mathrm{cls}] tokens and respective patch tokens of both the query and its selected neighbors. We send the multi-view feature into a transformer decoder for feature consolidation. We compensate the occluded query feature with gallery neighbors and achieve better performance. During training, our method utilizes the entire training set as the gallery, a common practice in ReID literature (Xu et al. 2022; Yu et al. 2021). While for inference, only the test set query and gallery images are used without any pre-filtering steps. Our FPC framework achieves state-of-the-art performance on both occluded person, partial person, and holistic person ReID datasets. In particular, FPC outperforms the state-of-the-art by 8.6% mAP and 6.0% Rank-1 accuracy on the challenging Occluded-Duke dataset.

Our main contributions are as follows:

  • \bullet

    We introduce the token sparsification mechanism to the occluded person ReID problem, the first to the best of our knowledge, which avoids explicit use of human semantics and better prunes unrelated features.

  • \bullet

    We propose feature matching and consolidation modules to recover occluded query features from multi-view gallery neighbors. Compared to feature division approaches (Yu et al. 2021; Xu et al. 2022), our module uses transformer tokens that naturally preserve connectivity and richness of the features.

  • \bullet

    We design a novel token based metric to measure the similarity of images by linearly combining the patch-level distance and the image-level distance.

2 Related Work

Our method is closely related to occluded person ReID and feature pruning in vision transformers.

Occluded Person ReID. The task of Occluded Person ReID is to find the same person under different cameras while the target pedestrian is obscured. The current methods are mainly divided into two categories, i.e., extra-clues based methods and feature reconstruction based methods. The extra-clues based methods locate non-occluded areas of the human body by prior knowledge cues, e.g., human pose (Miao et al. 2019; Gao et al. 2020; Wang et al. 2020a, 2022a; Miao, Wu, and Yang 2021) and human segments (Huang, Chen, and Huang 2020; Cai, Wang, and Cheng 2019). Another approach is based on feature reconstruction. Hou et al. (Hou et al. 2021) locate occluded human parts by key-points and propose Region Feature Completion (RFC) to recover the semantics of occluded regions in feature space. Xu et al. (Xu et al. 2022) extract occluded features with pose information and propose Feature Recovery Transformer (FRT) to recover the occluded features using the visible semantic features in the k-nearest gallery neighbors. In contrast, our approach adaptively removes interference from occluders and background according to correlation within the class token attention, which generalizes better to various occlusion situations.

Refer to caption
Figure 1: Overview of the proposed framework, which consists of a sparse encoder 𝒮\mathcal{S}, a multi-view feature matching module \mathcal{M}, and a feature consolidation module 𝒞\mathcal{C}. The sparse encoder 𝒮\mathcal{S} removes interfering tokens while preserving attentive tokens. In matching module \mathcal{M}, we generate a rank list between the sparse query feature and holistic features in a gallery memory pretrained with a vision transformer. We use the summation of image-level cosine distance and patch-level earth mover’s distance as the metric for ranking. In the consolidation module 𝒞\mathcal{C}, we select the k-nearest neighbors for the query and construct multi-view features by concatenating averaged [cls][\mathrm{cls}] tokens and respective patch tokens of both the query and its selected neighbors. The multi-view features are sent to the transformer decoder for feature consolidation.

Transformer Sparsification. The techniques of accelerating vision transformer models are necessary with limited computing resources. Most methods mainly aim to simplify the structure of the model with efficient attention mechanisms (Wang et al. 2020b; Kitaev, Kaiser, and Levskaya 2020; Zhang et al. 2022; Zhu et al. 2021) and compact structures (Liu et al. 2021; Touvron et al. 2021; Wu et al. 2021; Graham et al. 2021). There are also many researches (Kong et al. 2022; Rao et al. 2021; Liang et al. 2022) focus on effective token learning to reduce the redundancy. However, the above approaches have not explored the possibility of applying the characteristics of model acceleration to the occluded person ReID problem. Actually, since there are various occlusion and background noise in occluded person Re-ID tasks, we observe that it is sub-optimal to apply the transformer model directly. Following (Liang et al. 2022), the purpose of our approach is to prune out ineffective tokens by exploring the sparsity of informative image patches.

Refer to caption
Figure 2: The structure of our sparse encoder. We divide and embed the image tokens using linear projection, positional encoding, and camera ID encoding. We perform token sparsification in layer 3, 6, and 9.

3 Method

We illustrate our feature pruning and consolidation framework as illustrated in fig. 1. The framework consists of (1) sparse encoder 𝒮\mathcal{S} conducts token sparsification to prune interference tokens and preserve attentive tokens; (2) multi-view feature matching module \mathcal{M} generates a rank list between the sparse query feature and pre-trained gallery memory by the image and patch-level combined similarity; (3) feature consolidation framework 𝒞\mathcal{C} utilizes complete information of identified neighbors to compensate pruned query features.

3.1 Sparse Encoder

Inspired by the advances in feature pruning (Liang et al. 2022) and person ReID (He et al. 2021), we use a transformer with token sparsification to prune interference from occlusion and background. As shown in fig. 2, given an input image xH×W×Cx\in\mathbb{R}^{H\times W\times C}, where W,H,CW,H,C denote the width, height, and channel of the image respectively, we split xx into N overlapping patches {p1,p2,,pN}\{p_{1},p_{2},\cdots,p_{N}\} and embed each patch with linear projection denoted as f()f(\cdot). Then we combined a learnable [cls][\mathrm{cls}] token with patch embedding and apply positional encoding and camera index encoding following (He et al. 2021). The final input can be described as:

𝒵={xcls,f(p1),,f(pN)}+𝒫+𝒞id\mathcal{Z}=\left\{x_{cls},f\left(p_{1}\right),\cdots,f\left(p_{N}\right)\right\}+\mathcal{P}+\mathcal{C}_{id} (1)

where xcls1×Dx_{cls}\in\mathbb{R}^{1\times D} is learnable [cls][\mathrm{cls}] token. f(pi)1×Df(p_{i})\in\mathbb{R}^{1\times D} is i-th patch embeddings. 𝒫(N+1)×D\mathcal{P}\in\mathbb{R}^{(N+1)\times D} is position embeddings and 𝒞id(N+1)×D\mathcal{C}_{id}\in\mathbb{R}^{(N+1)\times D} is camera index embeddings.

Token Sparsification. We adopt the token sparsification strategy proposed in (Liang et al. 2022). Specifically, through the attention correlation (Vaswani et al. 2017) between [cls][\mathrm{cls}] token and other tokens in the vision transformer, we can express the value of [cls][\mathrm{cls}] token as:

xcls=AclsV=softmax(QclsKd)Vx_{cls}=A_{cls}V=\mathrm{softmax}(\frac{Q_{cls}K}{\sqrt{d}})V (2)

where AclsA_{cls} denotes the attention matrix of the [cls][\mathrm{cls}] token, i.e., first row of the attention matrix, d\sqrt{d} is scale factor. Qcls,K,VQ_{cls},K,V represent the query matrix of [cls][\mathrm{cls}] token, the key matrix, and the value matrix respectively. For multiple heads in the self-attention layer, we average the attention matrix as A¯cls=in1nAcls(i)\bar{A}_{cls}={\textstyle\sum_{i}^{n}\frac{1}{n}\cdot A_{cls}^{(i)}}, n is the total heads number. Since the [cls][\mathrm{cls}] token corresponds to a larger attention value in significant patch regions (Caron et al. 2021), we can evaluate the importance of a token according to its relevance to the [cls][\mathrm{cls}] token. As Acls¯\bar{A_{cls}} represents the correlation between [cls][\mathrm{cls}] token and all other tokens, we hence preserve the tokens with 𝒦\mathcal{K} largest values in Acls¯\bar{A_{cls}} and drop others, which is shown in fig. 2. We define 𝒦=γNc\mathcal{K}=\left\lceil\gamma\cdot N_{c}\right\rceil, where γ\gamma is the keep rate and NcN_{c} is the total number of tokens in the current layer, \left\lceil\cdot\right\rceil is the ceiling operation. With token sparsification, the preserved tokens are mostly related to the region of the target pedestrian, and dropped tokens are related to occluders or backgrounds.

Sparse Encoder Supervision Loss. We use the cross-entropy ID loss and triplet loss to supervise [cls][\mathrm{cls}] token xcls𝒮x_{cls}^{\mathcal{S}} obtained by the sparse encoder as:

S=ID(xcls𝒮)+T(xcls𝒮)\mathcal{L}_{S}=\mathcal{L}_{ID}(x_{cls}^{\mathcal{S}})+\mathcal{L}_{T}(x_{cls}^{\mathcal{S}}) (3)

where ID\mathcal{L}_{ID} denotes cross-entropy ID loss and T\mathcal{L}_{T} denotes triplet loss. Compared with existing approaches, our feature pruning with token sparsification is adaptive and doesn’t rely on prior knowledge of human semantics, and can better handle challenging scenarios, such as heavy occlusion and other persons as occluders, as shown in fig. 6.

3.2 Multi-view Feature Matching Module

After the feature pruning, we would like to find the most related patches from other views that not been occluded for consolidation, such strategy has been proved to be effective (Xu et al. 2022; Yu et al. 2021). Specifically, we first learn a gallery memory with the pre-trained encoder. With the pruned query image features, we rank patches in the gallery memory according to their similarity with the query. Considering that there exists appearance gap between the pruned query feature and holistic feature in gallery memory, we measure the similarity from both the image-level and patch-level. As shown in fig. 1, we linearly combine image-level and patch-level distance to match the query image with gallery memory images.

Image-level distance. We define the image-level distance as the cosine similarity between the [cls][\mathrm{cls}] tokens of query and gallery memory as follows:

𝒟COS=1xcls(i),xcls(i)|xcls(i)||xcls(j)|\mathcal{D}_{COS}=1-\frac{\langle x_{cls}^{(i)},x_{cls}^{(i)}\rangle}{|x_{cls}^{(i)}|\cdot|x_{cls}^{(j)}|} (4)

where xcls(i)x_{cls}^{(i)} and xcls(j)x_{cls}^{(j)} is [cls][\mathrm{cls}] token of i-th query image and j-th image in gallery memory. \left\langle\ \cdot\ \right\rangle is the dot product.

Patch-level distance. We leverage Earth Mover’s Distance (EMD) (Rubner, Tomasi, and Guibas 2000) to measure patch-level similarity. EMD is usually employed for measuring the similarity between two multidimensional distributions and is formulated as following linear programming problem: Let 𝒬={(q1,wq1),,(qm,wqm)}\mathcal{Q}=\left\{(q_{1},w_{q_{1}}),\dots,(q_{m},w_{q_{m}})\right\} be the set of patch tokens from the query image, where qiq_{i} is i-th patch token and wqiw_{q_{i}} is the weight of qiq_{i}. Similarly, 𝒢={(g1,wg1),,(gn,wgn)}\mathcal{G}=\left\{(g_{1},w_{g_{1}}),\dots,(g_{n},w_{g_{n}})\right\} represents set of patch tokens of a gallery image. Then wqiw_{q_{i}} is further defined as proportional correlation weight (Phan and Nguyen 2022) of qiq_{i} to set {g1,g2,,gn}\{g_{1},g_{2},\dots,g_{n}\}, denoted as max(0,qi,1ningi)\max(0,\langle q_{i},\frac{1}{n}\sum_{i}^{n}g_{i}\rangle), and wgjw_{g_{j}} equals max(0,gj,1mimqi)\max(0,\langle g_{j},\frac{1}{m}\sum_{i}^{m}q_{i}\rangle). The ground distance dijd_{ij} is the cosine distance between pip_{i} and qjq_{j}. The objective is to find the flow \mathcal{F}, of which fijf_{ij} indicates the flow between pip_{i} and qjq_{j}, to minimize following cost:

=argminCOST(𝒬,𝒢,)=argmin(i=1mj=1nfijdij)\begin{split}\mathcal{F}^{*}&=\operatorname*{arg\,min}_{\mathcal{F}}\mathrm{COST}(\mathcal{Q},\mathcal{G},\mathcal{F})\\ &=\operatorname*{arg\,min}_{\mathcal{F}}\left(\sum_{i=1}^{m}\sum_{j=1}^{n}f_{ij}d_{ij}\right)\end{split} (5)

subject to the following constraints:

fij0, 1im, 1jnf_{ij}\geq 0,\ 1\leq i\leq m,\ 1\leq j\leq n (6)
i=1mfijwgj, 1jn\sum_{i=1}^{m}f_{ij}\leq w_{g_{j}},\ 1\leq j\leq n (7)
j=1nfijwqi, 1im\sum_{j=1}^{n}f_{ij}\leq w_{q_{i}},\ 1\leq i\leq m (8)
i=1mj=1nfij=min(i=1mwqi,j=1nwgj)\sum_{i=1}^{m}\sum_{j=1}^{n}f_{ij}=\min(\sum_{i=1}^{m}w_{q_{i}},\sum_{j=1}^{n}w_{g_{j}}) (9)

eq. 5 can by solved by iterative Sinkhorn algorithm (Cuturi 2013) and produces the earth mover’s distance 𝒟EMD\mathcal{D}_{EMD}. Intuitively, 𝒟COS\mathcal{D}_{COS} represents global distance, while 𝒟EMD\mathcal{D}_{EMD} measures in the perspective of the set of local features. Naturally, we take the linear combination of image-level cosine distance and patch-level EMD as our final distance for feature matching, expressed as follows:

𝒟=(1α)𝒟COS+α𝒟EMD\mathcal{D}=(1-\alpha)\mathcal{D}_{COS}+\alpha\mathcal{D}_{EMD} (10)

With 𝒟\mathcal{D}, we rank the sparse query feature with holistic features in gallery memory and generate a ranking list. In order to expedite the process of feature matching, we use an efficient two-stage selection strategy to save time costs by 138 times.

3.3 Feature Consolidation Decoder

After we retrieve the KK nearest gallery neighbors, we then consolidate the pruned query feature from multi-view observations in the gallery memory. Specifically, we average the [cls][\mathrm{cls}] tokens of the query and gallery neighbors as our initial global feature. Then we combine the averaged [cls][\mathrm{cls}] token with patch tokens in both query and gallery neighbors to aggregate information of multi-view pedestrians. The consolidation of the multi-view features can be formulated as follows:

fm={fc¯,fpq,fpg(1),fpg(2),,fpg(K)}f_{m}=\left\{\bar{f_{c}},f_{p_{q}},f_{p_{g}}^{{(1)}},f_{p_{g}}^{(2)},\dots,f_{p_{g}}^{(K)}\right\} (11)

where fc¯1×C\bar{f_{c}}\in\mathbb{R}^{1\times C} is the average of [cls]\mathrm{[cls]} tokens of both query and its neighbors. fpqM×Cf_{p_{q}}\in\mathbb{R}^{M\times C} is the patch tokens of query and MM is the number of patch tokens. fpg(i)N×Cf^{(i)}_{p_{g}}\in\mathbb{R}^{N\times C} is the patch tokens of i-th gallery neighbor and NN is the number of patch tokens for each gallery.

Transformer Decoder. Notice that [cls][\mathrm{cls}] token is class prediction (Dosovitskiy et al. 2020) in transformer and conventionally treated as global feature for image representation, the query [cls]\mathrm{[cls]} token is usually contaminated in the occluded person ReID problem . We send the consolidated multi-view feature to a transformer decoder to compensate the incomplete [cls]\mathrm{[cls]} token with gallery neighbors. In the decoder, we first transform the multi-view feature fmf_{m} into Q,K,VQ,K,V vectors using linear projections, which is:

Q=Wqfm,K=Wkfm,V=WvfmQ=W_{q}\cdot f_{m},K=W_{k}\cdot f_{m},V=W_{v}\cdot f_{m} (12)

where Wq,Wk,WvW_{q},W_{k},W_{v} are weights of linear projections. As the multi-view feature contains three parts: [cls]\mathrm{[cls]} token, patch tokens of query, and patch tokens of gallery neighbors, the computation of attention with respect to [cls]\mathrm{[cls]} token can be decomposed into the above three parts, which is expressed as:

xcls=Cat(Ac,Aq,Ag)Cat(Vc,Vq,Vg)=AcVc+AqVq+AgVg\begin{split}x_{cls}&=\mathrm{Cat}(A^{{}^{\prime}}_{c},A^{{}^{\prime}}_{q},A^{{}^{\prime}}_{g})\cdot\mathrm{Cat}(V_{c},V_{q},V_{g})\\ &=A^{{}^{\prime}}_{c}\cdot V_{c}+A^{{}^{\prime}}_{q}\cdot V_{q}+A_{g}^{{}^{\prime}}\cdot V_{g}\end{split} (13)

where Cat()\mathrm{Cat}\left(\ \cdot\ \right) is operation of vector combination. AA^{{}^{\prime}} denotes the attention matrix of [cls]\mathrm{[cls]} token and VV is the value vector. Subscripts c,q,gc,q,g correspond to [cls]\mathrm{[cls]} token, query and gallery neighbors respectively. AcVc+AqVqA_{c}^{{}^{\prime}}\cdot V_{c}+A_{q}^{{}^{\prime}}\cdot V_{q} acts as the feature learning process with sparse query and AgVgA_{g}^{{}^{\prime}}\cdot V_{g} integrates completion information from gallery neighbors to the [cls]\mathrm{[cls]} token. The final consolidated [cls][\mathrm{cls}] token generated by the transformer decoder is denoted as xcls𝒞=τ(fm)x_{cls}^{\mathcal{C}}=\tau(f_{m}). We find one layer of transformer is enough and also avoids the high memory consumption from neighborhoods (Yu et al. 2021).

Consolidation Loss. The [cls][\mathrm{cls}] token xcls𝒞x_{cls}^{\mathcal{C}} obtained by feature consolidation decoder is supervised with cross entropy loss as ID loss and triplet loss, as expressed below:

C=ID(xcls𝒞)+T(xcls𝒞)\mathcal{L}_{C}=\mathcal{L}_{ID}(x_{cls}^{\mathcal{C}})+\mathcal{L}_{T}(x_{cls}^{\mathcal{C}}) (14)

Therefore, the final loss function can be expressed:

=S+C\mathcal{L}=\mathcal{L}_{S}+\mathcal{L}_{C} (15)

3.4 Implementation Details

We choose the ViT-B/16 as our backbone for both the sparse encoder and gallery encoder. Our sparse encoder incorporates several modifications into the backbone, including camera encoding, patch overlapping with a stride of 11, batch normalization, and token sparsification at layer 3, 6, 9. Our gallery encoder has the same structure as the sparse encoder but does not conduct token sparsification. We construct the gallery memory with gallery image tokens as in eq. 1. In the multi-view feature matching module, we first identify 100 globally-similar gallery neighbors using cos distance in eq. 4, and then search the final KK nearest gallery neighbors using the proposed distance in eq. 10. We set the number of KK to 10 on Occluded-Duke dataset and 5 for others. In the training process, we resize all input images to 256 ×\times 128. The training images are augmented with random horizontal flipping, padding, random cropping and random erasing (Zhong et al. 2020). The batch size is 64 with 4 images per ID and the learning rate is initialized as 0.008 with cosine learning rate decay. The distance weight α\alpha in eq. 10 is 0.4 and the parameter of keep rate γ\gamma is 0.8. We take consolidated [cls]\mathrm{[cls]} token in eq. 13 for model inference. For large-scale problems, we can further replace the full image tokens with [cls] tokens to save computation and memory.

4 Experiment

We conduct extensive experiments to validate our framework. We first introduce the dataset we use:

4.1 Datasets and Evaluation Metric

Occluded-Duke (Miao et al. 2019) includes 15,618 training images of 702 persons, 2,210 occluded query images of 519 persons, and 17,661 gallery images of 1,110 persons. Occluded-ReID (Zhuo et al. 2018) consists of 1,000 occluded query images and 1,000 full-body gallery images both belonging to 200 identities. Partial-ReID (Zheng et al. 2015b) involves 600 images from 60 persons, and each person consists 5 partial and 5 full-body images. Market-1501 (Zheng et al. 2015a) contains 12,936 training images of 751 persons, 19,732 query images and 3,368 gallery images of 750 persons captured by six cameras.

Evaluation Metirc. All methods are evaluated under the Cumulative Matching Characteristic (CMC) and mean Average Precision (mAP). Floating Point Operations (FLOPs) represents the amount of model computation.

Occluded-Duke Occluded-ReID
Method Rank-1 mAP Rank-1 mAP
DSR 40.8 30.4 72.8 62.8
PGFA 51.4 37.3 - -
HOReID 55.1 43.8 55.1 43.8
OAMN 62.6 46.1 - -
PAT 64.5 53.6 81.6 72.1
TransReID 67.4 59.5 - -
PFD 69.5 61.8 81.5 83.0
FED 67.9 56.3 87.0 79.4
RFCnet 63.9 54.5 - -
Yu et al. 67.6 64.2 68.8 67.3
FRT 70.7 61.3 80.4 71.0
FPC (ours) 76.7 72.8 86.3 84.6
Table 1: Comparison with state-of-the-art methods on Occluded-Duke and Occluded-ReID datasets.
Market-1501 Partial-ReID
Method Rank-1 mAP Rank-1 mAP
PCB 92.3 77.4 - -
PGFA 91.2 76.8 69.0 61.5
HOReID 94.2 84.9 85.3 -
OAMN 92.3 79.8 86.0 -
PAT 95.4 88.0 88.0 -
TransReID 95.0 88.2 83.0 77.5
PFD 95.5 89.7 - -
FED 95.0 86.3 84.6 82.3
RFCnet 95.2 89.2 - -
Yu et al. 94.5 86.5 - -
FRT 95.5 88.1 88.2 -
FPC (ours) 95.1 91.4 86.3 86.5
Table 2: Comparison with state-of-the-art methods on Market-1501 and Partial-ReID datasets.

4.2 Comparison with State-of-the-art Methods

Experimental results on Occluded ReID Datasets. In table 1, we compare FPC with state-of-the-art methods on two occluded ReID datasets (i.e., Occluded-Duke and Occluded-ReID). Methods in different categories are compared, including the partial ReID methods (He et al. 2018), key-points based methods (Miao et al. 2019; Wang et al. 2020a, 2022a; Hou et al. 2021), data augmentation methods (Chen et al. 2021; Wang et al. 2020a, 2022b), transformer-based methods (Li et al. 2021; He et al. 2021), gallery-based reconstruction methods (Yu et al. 2021; Xu et al. 2022). In addition, TransReID (He et al. 2021) and PFD (Wang et al. 2022a) use camera information. Since no training set is provided for Occluded-ReID, we adopt Market-1501 as the training set the same as other methods to ensure the fairness of comparison. We can see that our FPC outperforms existing approaches and demonstrates the effectiveness for the occluded ReID tasks. Specifically, our FPC achieves the best performance on the challenging Occluded-Duke dataset, outperforming other methods by at least 6.0% Rank-1 accuracy and 8.6% mAP. On the Occluded-ReID dataset, FPC produces the highest mAP, outperforming the other methods by at least 1.6%. FPC achieves comparable results in Rank-1 accuracy with FED (Wang et al. 2022b), and much better than others.

Experimental Results on Holistic and Partial ReID Datasets. Many existing occluded ReID methods can not be effectively applicable to holistic and partial ReID datasets. On the contrary, FPC achieves great performance improvement on holistic and partial datasets, i.e., Market-1501 and Partial-ReID. The results are shown in table 2. We compare FPC with three categories of methods, including the holistic ReID methods (Sun et al. 2018), the current leading methods in occluded ReID (Miao et al. 2019; Wang et al. 2020a; Chen et al. 2021; Li et al. 2021; He et al. 2021; Wang et al. 2022a, b), the feature reconstruction based methods in occluded ReID (Hou et al. 2021; Yu et al. 2021; Xu et al. 2022). On the Partial-ReID dataset, FPC achieves the best mAP result, lower than the PAT and FRT in Rank-1 accuracy. We consider that we adopt the ViT models and are more prone to overfit on small datasets which leads poor cross-domain generalization. We also observe that FPC achieves competitive Rank-1 accuracy and the highest mAP on the Market-1501 dataset, which is at least 1.7% mAP ahead of other methods. The excellent performance on holistic and partial ReID datasets illustrates the robustness of our FPC.

Index 𝒮\mathcal{S} \mathcal{M} 𝒞\mathcal{C} Rank-1 mAP
1 65.2 56.6
2 68.6 60.1
3 75.8 71.6
4 76.7 72.8
Table 3: Ablation study on each component.
Metrics Keep Rate γ\gamma
1.0 0.9 0.8 0.7 0.6 0.5
Rank-1 (%) 76.0 76.6 (+0.6) 76.7 (+0.7) 75.9 (-0.1) 74.5 (-1.5) 72.1 (-3.9)
mAP (%) 72.3 72.7 (+0.4) 72.8 (+0.5) 72.4 (+0.1) 71.3 (-1.0) 69.3 (-3.0)
FLOPs (G\mathrm{G}) 20.8 18.1 (-13%) 15.7 (-25%) 13.6 (-33%) 11.9 (-43%) 10.4 (-50%)
Table 4: Analysis on effectiveness of the keep rate γ\gamma. We perform FLOPs comparison on the sparse encoder.

4.3 Ablation Study

Analysis of proposed components. The results are shown in table 3. Index-1 is our baseline architecture. To assess the effectiveness of 𝒮\mathcal{S}, we conduct a comparison between index-1 and index-2. Our findings demonstrate that 𝒮\mathcal{S} leads to a 3.4% improvement in Rank-1 accuracy and a 3.5% increase in mAP over the baseline. This suggests that 𝒮\mathcal{S} has the potential to mitigate the interference of inattentive features (mainly related to occlusion and background noise). Furthermore, 𝒮\mathcal{S} can also expedite model inference, as elaborated in table 4. By comparing index-2 and index-3, 𝒞\mathcal{C} improves performance by 7.2% Rank-1 accuracy and 11.5% mAP, which indicates that 𝒞\mathcal{C} can effectively compensate the occluded query feature and bring huge performance improvements. By comparing index-3 and index-4, \mathcal{M} can increase performance by 0.9% Rank-1 accuracy and 1.2% mAP, indicating that the proposed distance metric contributes to the find the accurate gallery neighbors.

Refer to caption
(a)
Refer to caption
(b)
Figure 3: Analysis of distance weight α\alpha and the number of nearest gallery neighbors KK on Occluded-Duke dataset.

Analysis of Distance Weight α\alpha. The distance weight α\alpha defined in eq. 10 balances the importance between image-level and patch-level distance. From fig. 3(a), as α\alpha goes from 0 to 1, the patch-level distance gradually takes effect. When α\alpha is 0.4, the combination of both achieves the best performance with 76.7% Rank-1 accuracy and 72.8% mAP.

Analysis of Number KK of Nearest Gallery Neighbors. We conduct quantitative experiments to find the most appropriate number KK of nearest gallery neighbors. We conclude from fig. 3(b) that too small a choice of KK lacks sufficient information for feature consolidation and too large a choice increases the risk of incorrectly selecting neighbors. When the value of KK is 10, we achieve the optimal performance.

Refer to caption
Figure 4: Illustration of three different patch drop strategies. White patches are the dropped image parts.
Refer to caption
Refer to caption
Figure 5: The analysis of three different patch drop strategies.

Analysis of Keep Rate γ\gamma in Sparse Encoder. Here, keep rate γ\gamma reflects the number of preserved tokens in each layer of the sparse encoder. We find instructive observations from table 4: as the γ\gamma goes from 1.0 to 0.8, the FLOPs drops while the model performance improves. This suggests that a reasonable choice of γ\gamma can effectively filter out inattentive features, thus reducing the computational complexity and enhancing the model inference capability. When γ\gamma is 0.8, we achieve the best balance between computational complexity and experimental performance, which leads to an improvement of 0.5% mAP, 0.7% Rank-1 accuracy and 25% reduction in FLOPs.

Analysis of Patch Drop Strategy in Sparse Encoder. We experiment with three variants of patch drop approaches to demonstrate the effectiveness of the proposed method. As shown in fig. 4, we preserve the same number of patches and choose different preservation methods: Random Drop means that 𝒦\mathcal{K} patches are randomly selected to be retained. Non-salient Drop, as the proposed method, preserves the patches corresponding to the 𝒦\mathcal{K} largest values in the [cls]\mathrm{[cls]} token attention. Conversely, Salient Drop preserves the 𝒦\mathcal{K} smallest ones. The performance of the three patch drop approaches with different keep rate is shown in fig. 5. We observe that the proposed Non-salient Drop method achieves best performance under different keep rate settings, indicating the importance of attentive features for feature matching and consolidation. In addition, the result of random drop test still achieves relatively high performance, which reflects that our proposed sparse encoder structure is robust against information loss.

Analysis of speed and storage optimization. Our approach relies on gallery information for feature consolidation, to address the practical concerns regarding speed and storage for real-world problems, we conduct analysis on our optimization strategies on the Occluded-Duke dataset. In 𝒮\mathcal{S}, token sparsification reduce 25% Flops (in table 4) with γ\gamma equals 0.8 and thus accelerates model inference. In \mathcal{M}, we use a two-step matching procedure, i.e., use cosine similarity for initial filtering and compute EMD in the finer step, achieving a remarkable 138-fold reduction in computational costs, compared to the exhaustive search based on EMD. Our approach takes 0.43ms per image, while the full EMD calculation takes 60.0ms per image. Furthermore, if we replace full image tokens with [cls]\mathrm{[cls]} token, our approach saves a lot memory (i.e., only takes 0.05G for the entire gallery set) and achieves 17x faster in nearest neighbor search time (i.e., from 0.43ms to 0.026ms per image), while still preserving an acceptable degree of precision (i.e., mAP: 72.8% to 70.2%, Rank-1: 76.7% to 74.3%, which still outperforms others).

Comparisons with Re-ranking. An alternative approach is the re-ranking technique (Zhong et al. 2017), which also uses k-nearest gallery neighbors information. We compare 𝒞\mathcal{C} with re-ranking (Zhong et al. 2017), the second group results in table 5 indicate that re-ranking fails to attain comparable performance of 𝒞\mathcal{C}, lead to a reduction of 3.2% Rank-1 and 0.3% mAP. Further, our +𝒞\mathcal{M}+\mathcal{C} is approximately 17 times faster than re-ranking, with respective times of 0.47ms and 7.9ms per image. Moreover, the last group shows 𝒞\mathcal{C} and re-ranking can be jointly employed and outperforms others.

Methods Rank-1 mAP
FRT(Xu et al. 2022) 70.7 61.3
FRT(Xu et al. 2022) + re-ranking 70.8 65.0
Yu et al. (Yu et al. 2021) 67.6 64.2
Yu et al. (Yu et al. 2021) + re-ranking 68.9 67.3
𝒮\mathcal{S} 68.6 60.1
𝒮\mathcal{S} + re-ranking 72.6 71.3
𝒮\mathcal{S} + 𝒞\mathcal{C} 75.8 71.6
FPC (ours) 76.7 72.8
FPC (ours) + re-ranking 78.6 78.3
Table 5: Performance comparsions with re-ranking on Occluded-Duke dataset. 𝒮\mathcal{S} is our sparse encoder and 𝒞\mathcal{C} is feature consolidation.
Refer to caption
Figure 6: Visualization of patch drop process in different layers of the sparse encoder. We show various occlusion scenarios such as object occlusion, pedestrian occlusion and heavy occlusions.

4.4 Visualization

Patch pruning. The patch-drop process in different layers of sparse encode is shown in fig. 6. We observe that with the increasing number of sparse encoder layers, more object occlusion, pedestrian occlusion and background noise are filtered out, while the essential classification information of target pedestrians is preserved.

5 Conclusion

In this paper, we propose the feature pruning and consolidation (FPC) framework for the occluded person ReID task. Specifically, our framework prunes interfering image tokens (mostly related to background noise and occluders) without relying on prior human shape information. We propose an effective way to consolidate the pruned feature and achieve the SOTA performance on many occluded, partial, and holistic ReID datasets. We introduce the token sparsification technique and demonstrate its effectiveness.

6 Acknowledgment

This work is supported by the National Natural Science Foundation of China (NSFC No. 62272184). The computation is completed in the HPC Platform of Huazhong University of Science and Technology.

References

  • Cai, Wang, and Cheng (2019) Cai, H.; Wang, Z.; and Cheng, J. 2019. Multi-scale body-part mask guided attention for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 0–0.
  • Caron et al. (2021) Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; and Joulin, A. 2021. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9650–9660.
  • Chen et al. (2021) Chen, P.; Liu, W.; Dai, P.; Liu, J.; Ye, Q.; Xu, M.; Chen, Q.; and Ji, R. 2021. Occlude them all: Occlusion-aware attention network for occluded person re-id. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 11833–11842.
  • Chen et al. (2017) Chen, Y.-C.; Zhu, X.; Zheng, W.-S.; and Lai, J.-H. 2017. Person re-identification by camera correlation aware feature augmentation. IEEE transactions on pattern analysis and machine intelligence, 40(2): 392–408.
  • Cuturi (2013) Cuturi, M. 2013. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems, 26.
  • Dosovitskiy et al. (2020) Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
  • Gao et al. (2020) Gao, S.; Wang, J.; Lu, H.; and Liu, Z. 2020. Pose-guided visible part matching for occluded person reid. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11744–11752.
  • Graham et al. (2021) Graham, B.; El-Nouby, A.; Touvron, H.; Stock, P.; Joulin, A.; Jégou, H.; and Douze, M. 2021. LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 12239–12249.
  • He et al. (2017) He, K.; Gkioxari, G.; Dollár, P.; and Girshick, R. 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, 2961–2969.
  • He et al. (2018) He, L.; Liang, J.; Li, H.; and Sun, Z. 2018. Deep spatial feature reconstruction for partial person re-identification: Alignment-free approach. In Proceedings of the IEEE conference on computer vision and pattern recognition, 7073–7082.
  • He et al. (2021) He, S.; Luo, H.; Wang, P.; Wang, F.; Li, H.; and Jiang, W. 2021. Transreid: Transformer-based object re-identification. In Proceedings of the IEEE/CVF international conference on computer vision, 15013–15022.
  • Hou et al. (2021) Hou, R.; Ma, B.; Chang, H.; Gu, X.; Shan, S.; and Chen, X. 2021. Feature completion for occluded person re-identification. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  • Huang, Chen, and Huang (2020) Huang, H.; Chen, X.; and Huang, K. 2020. Human parsing based alignment with multi-task learning for occluded person re-identification. In 2020 IEEE International Conference on Multimedia and Expo (ICME), 1–6. IEEE.
  • Kitaev, Kaiser, and Levskaya (2020) Kitaev, N.; Kaiser, L.; and Levskaya, A. 2020. Reformer: The Efficient Transformer. In 8th International Conference on Learning Representations (ICLR).
  • Kong et al. (2022) Kong, Z.; Dong, P.; Ma, X.; Meng, X.; Niu, W.; Sun, M.; Shen, X.; Yuan, G.; Ren, B.; Tang, H.; Qin, M.; and Wang, Y. 2022. SPViT: Enabling Faster Vision Transformers via Latency-Aware Soft Token Pruning. In Proceedings of the European conference on computer vision (ECCV).
  • Lavi, Serj, and Ullah (2018) Lavi, B.; Serj, M. F.; and Ullah, I. 2018. Survey on deep learning techniques for person re-identification task. arXiv preprint arXiv:1807.05284.
  • Li et al. (2021) Li, Y.; He, J.; Zhang, T.; Liu, X.; Zhang, Y.; and Wu, F. 2021. Diverse part discovery: Occluded person re-identification with part-aware transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2898–2907.
  • Liang et al. (2022) Liang, Y.; Ge, C.; Tong, Z.; Song, Y.; Wang, J.; and Xie, P. 2022. Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations. In arXiv preprint arXiv:2202.07800.
  • Liao and Shao (2021) Liao, S.; and Shao, L. 2021. Transmatcher: Deep image matching through transformers for generalizable person re-identification. Advances in Neural Information Processing Systems, 34: 1992–2003.
  • Liu et al. (2021) Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; and Guo, B. 2021. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 10012–10022.
  • Malkov and Yashunin (2018) Malkov, Y. A.; and Yashunin, D. A. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence, 42(4): 824–836.
  • Miao et al. (2019) Miao, J.; Wu, Y.; Liu, P.; Ding, Y.; and Yang, Y. 2019. Pose-guided feature alignment for occluded person re-identification. In Proceedings of the IEEE/CVF international conference on computer vision, 542–551.
  • Miao, Wu, and Yang (2021) Miao, J.; Wu, Y.; and Yang, Y. 2021. Identifying visible parts via pose estimation for occluded person re-identification. IEEE Transactions on Neural Networks and Learning Systems.
  • Phan and Nguyen (2022) Phan, H.; and Nguyen, A. 2022. DeepFace-EMD: Re-Ranking Using Patch-Wise Earth Mover’s Distance Improves Out-of-Distribution Face Identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 20259–20269.
  • Rao et al. (2021) Rao, Y.; Zhao, Y.; Liu, B.; Lu, J.; Zhou, J.; and Hsieh, C.-J. 2021. Dynamicvit: Efficient vision transformers with dynamic token sparsification. In Advances in Neural Information Processing Systems (NeurIPS), 13937–13949.
  • Rubner, Tomasi, and Guibas (2000) Rubner, Y.; Tomasi, C.; and Guibas, L. J. 2000. The earth mover’s distance as a metric for image retrieval. International journal of computer vision, 40(2): 99–121.
  • Sun et al. (2019) Sun, K.; Xiao, B.; Liu, D.; and Wang, J. 2019. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 5693–5703.
  • Sun et al. (2018) Sun, Y.; Zheng, L.; Yang, Y.; Tian, Q.; and Wang, S. 2018. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European conference on computer vision (ECCV), 480–496.
  • Touvron et al. (2021) Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; and Jegou, J. 2021. Training data-efficient image transformers & distillation through attention. In Proceedings of the 38th International Conference on Machine Learning (ICML), 10347–10357.
  • Vaswani et al. (2017) Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. Advances in neural information processing systems, 30.
  • Wang et al. (2020a) Wang, G.; Yang, S.; Liu, H.; Wang, Z.; Yang, Y.; Wang, S.; Yu, G.; Zhou, E.; and Sun, J. 2020a. High-order information matters: Learning relation and topology for occluded person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 6449–6458.
  • Wang et al. (2020b) Wang, S.; Li, B.; Khabsa, M.; Fang, H.; and Ma, H. 2020b. Linformer: Self-Attention with Linear Complexity. In arXiv preprint arXiv:2006.04768.
  • Wang et al. (2022a) Wang, T.; Liu, H.; Song, P.; Guo, T.; and Shi, W. 2022a. Pose-guided feature disentangling for occluded person re-identification based on transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 2540–2549.
  • Wang et al. (2022b) Wang, Z.; Zhu, F.; Tang, S.; Zhao, R.; He, L.; and Song, J. 2022b. Feature Erasing and Diffusion Network for Occluded Person Re-Identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4754–4763.
  • Wu et al. (2019) Wu, D.; Zheng, S.-J.; Zhang, X.-P.; Yuan, C.-A.; Cheng, F.; Zhao, Y.; Lin, Y.-J.; Zhao, Z.-Q.; Jiang, Y.-L.; and Huang, D.-S. 2019. Deep learning-based methods for person re-identification: A comprehensive review. Neurocomputing, 337: 354–371.
  • Wu, Zhu, and Gong (2022) Wu, G.; Zhu, X.; and Gong, S. 2022. Learning hybrid ranking representation for person re-identification. Pattern Recognition, 121: 108239.
  • Wu et al. (2021) Wu, H.; Xiao, B.; Codella, N.; Liu, M.; Xiyang, D.; Yuan, L.; and Zhang, L. 2021. CvT: Introducing Convolutions to Vision Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 22–31.
  • Xu et al. (2022) Xu, B.; He, L.; Liang, J.; and Sun, Z. 2022. Learning Feature Recovery Transformer for Occluded Person Re-Identification. IEEE Transactions on Image Processing, 31: 4651–4662.
  • Ye et al. (2021) Ye, M.; Shen, J.; Lin, G.; Xiang, T.; Shao, L.; and Hoi, S. C. 2021. Deep learning for person re-identification: A survey and outlook. IEEE transactions on pattern analysis and machine intelligence, 44(6): 2872–2893.
  • Yu et al. (2021) Yu, S.; Chen, D.; Zhao, R.; Chen, H.; and Qiao, Y. 2021. Neighbourhood-guided feature reconstruction for occluded person re-identification. arXiv preprint arXiv:2105.07345.
  • Zhang et al. (2022) Zhang, J.; Peng, H.; Wu, K.; Liu, M.; Xiao, B.; Fu, J.; and Yuan, L. 2022. MiniViT: Compressing Vision Transformers with Weight Multiplexing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 12145–12154.
  • Zheng et al. (2015a) Zheng, L.; Shen, L.; Tian, L.; Wang, S.; Wang, J.; and Tian, Q. 2015a. Scalable person re-identification: A benchmark. In Proceedings of the IEEE international conference on computer vision, 1116–1124.
  • Zheng et al. (2015b) Zheng, W.-S.; Li, X.; Xiang, T.; Liao, S.; Lai, J.; and Gong, S. 2015b. Partial person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, 4678–4686.
  • Zhong et al. (2017) Zhong, Z.; Zheng, L.; Cao, D.; and Li, S. 2017. Re-ranking person re-identification with k-reciprocal encoding. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1318–1327.
  • Zhong et al. (2020) Zhong, Z.; Zheng, L.; Kang, G.; Li, S.; and Yang, Y. 2020. Random erasing data augmentation. In Proceedings of the AAAI conference on artificial intelligence, volume 34, 13001–13008.
  • Zhu et al. (2021) Zhu, M.; Han, K.; Tang, Y.; and Wang, Y. 2021. Visual Transformer Pruning. In arXiv preprint arXiv:1204.08500.
  • Zhuo et al. (2018) Zhuo, J.; Chen, Z.; Lai, J.; and Wang, G. 2018. Occluded person re-identification. In 2018 IEEE International Conference on Multimedia and Expo (ICME), 1–6. IEEE.

7 Supplementary Material

We present additional experimental and qualitative results in this supplementary material.

8 More Experimental Details and Results

8.1 Visualization Results with Different Keep Rate

We show the visualization of patch drop with different keep rate γ\gamma in fig. 7. γ\gamma reflects the number of preserved patches in the sparse encoder. As γ\gamma decreases from 0.9 to 0.5, the patches corresponding to backgrounds and occlusions are gradually pruned. When the γ\gamma is 0.5, only important target human areas and several outliers are preserved.

Refer to caption
Figure 7: Visualization of patch drop with different keep rate γ\gamma in sparse encoder.

8.2 Extended Visualization Results of Sparse Encoder

We show more visualization results in fig. 8 in order to illustrate the attentive token identification. Input images are chosen from Occluded-Duke(Miao et al. 2019), including object occlusion and pedestrian occlusion. The results validate the ability of our sparse encoder to be applicable to a variety of occlusion cases.

8.3 Analysis of Token Sparsification

In our paper, we discuss the effect of token sparsification, i.e., , preserving attentive tokens and dropping inattentive tokens, which can both reduce model computation and enhance model inference. Similarly, this sparsification mechanism is also applicable to other occluded ReID methods. Here we demonstrate the effectiveness of token sparsification with Transreid (He et al. 2021) on Occluded-Duke(Miao et al. 2019) and Occluded-ReID(Zhuo et al. 2018) datasets, as shown in table 6. We discover that the token sparsification (abbreviated as T.S.) improves the performance of Rank-1 accuracy and mAP on both datasets, which demonstrates that token sparsification may have broad scope of application in occluded ReID tasks.

Occluded-Duke Occluded-ReID
Method Rank-1 mAP Rank-1 mAP
Transreid 67.4 59.5 80.5 74.4
Transreid + T.S. 68.2 60.0 80.9 75.0
Table 6: Performance comparison with token sparsification on Occluded-Duke and Occluded-ReID datasets. Here we set keep rate to 0.9.

8.4 Ablation Study on Layers of Token Sparsification

We have performed an ablation study on the token sparsification layers, as detailed in  table 7. Our results indicated that the foremost layer has a significant impact on performance, and an excessive forward shift of the layer may lead to performance degradation.

Layers locations [3,6,9] [2,5,9] [2,5,8] [1,4,7] [1,6,9]
Rank-1 76.7 76.2 (-0.5) 76.3 (-0.4) 75.3 (-1.4) 75.2 (-1.5)
mAP 72.8 72.3 (-0.5) 72.5 (-0.3) 71.6 (-1.2) 71.4 (-1.4)
Table 7: Analysis of token sparsification layers on Occluded-Duke dataset.

8.5 More Implementation Details

We adopted a two-phase training strategy to optimize computational memory efficiency. Initially, we focused on training the sparse encoder. Upon completion of this phase, the parameters of the sparse encoder were fixed, and we proceeded to train the parameters of the transformer decoder module. The first stage effectively circumvents memory constraints associated with large gallery memory storage. Subsequently, the second stage demonstrated a reduced need for epochs to achieve effective convergence. Furthermore, we input only the tokens of gallery neighbors into the transformer decoder for model inference.

9 Further Analysis on Applicability

9.1 Analysis for Real-World Applications

We recognize that there are many challenges to overcome before our approach can be applied to real-world Re-ID scenarios. It is important to note that these challenges are not unique to our approach, but rather common issues faced by most existing person Re-ID methods. These challenge are mainly: 1) large number of gallery images; 2) the gallery needs update; 3) expensive cost of feature storage; 4) no positive samples for a query image.

Classic person Re-ID techniques, such as pair-wise similarity calculation (Ye et al. 2021) or gallery feature reconstruction/aggregation (Xu et al. 2022; Yu et al. 2021; Liao and Shao 2021; Wu, Zhu, and Gong 2022), also face the same issues of large gallery size, gallery update requirement, high cost of feature storage, and absence of positive samples in the gallery for a query image. Moreover, for 1) and 3), in our implementation, the feature size is 243x768 compared to the gallery image size 224x224x3, the extra feature storage is still in the same order of magnitude. For 2), our approach treats each image in the gallery as independent during matching and consolidation, so an update of the gallery would just simply involve deleting and inserting operations. For 4), If there are no positive samples in the gallery for a query image, it is impossible for person Re-ID approaches, including ours, to find the correct identity, as it would violate the pre-conditions for the task. In such cases, it is likely that our approach would use the closest negative samples for consolidation and report a low confidence score.

In addition, it should be noted that our approach still has the potential to be applicable to real-world scenarios. To address the limitations outlined in points 1) and 2), we can employ the efficient HNSW algorithm (Malkov and Yashunin 2018), which has demonstrated fast search speeds of 94ms on a million-scale SIFT dataset and 18.3s on a million-scale ImageNet dataset. To tackle point 3), we can replace fine-grained patch tokens with a [cls]\mathrm{[cls]} token, which greatly reduces memory storage requirements while still maintaining an acceptable level of accuracy, as in paper Sec. 4.3. Finally, in regards to point 4), we can adaptively adjust the size of neighbors based on EMD distance rather than relying on a fixed number of neighbors. Nevertheless, our framework is robust to the use of incorrect neighbor features, as demonstrated by our results in Fig. 3(b) of the paper, where an over-sized choice of neighbors (K=14) had only a minimal impact on performance (a decrease of 1.1% in mAP and 2.1% in Rank-1 compared to the best results).

Gallery Feature Gallery Images Memory Cost (G) Rank-1 (%) mAP (%)
[cls]\mathrm{[cls]} token 17,661 0.05 75.9 70.6
patch token + [cls]\mathrm{[cls]} token 17,661 12.28 76.7 72.8
Table 8: The analysis of using [cls] token in feature consolidation to save storage.

9.2 Time Analysis of Distance Computation

As described in Sec. 3.4 of the paper, our method uses a two-step procedure to efficiently select optimal samples. The first step involves finding 100 candidate samples using cosine distance and the second step integrates EMD to determine the final optimal samples. Our approach showed its efficiency on the Occluded-Duke dataset, where computing EMD for all 17,661 gallery images took 1152.5s, while our two-stage process took only 11.0s, as presented in  table 9. This is in comparison to the k-reciprocal re-ranking process commonly used in Re-ID, which takes 157.1s, making our EMD computation highly efficient.

Time Cost (s) Metric (%)
Method Cosine EMD Total Rank-1 mAP
Bruteforce 0.5 1152.5 1153.0 76.7 72.8
Two-Stage (Ours) 0.5 10.5 11.0 76.8 72.7
Table 9: Running time analysis for the distance computation.

10 Future Work

10.1 Optimization for Drop Rate

We acknowledge that the patch drop rate in our model is currently fixed as a potential limitation. Our paper ablation studies on the keep ratio have shown that a range of 0.7-0.9 consistently yields superior results on Occluded-Duke. While not currently implemented, future research plans include the adaptive optimization of the drop rate.

10.2 Generalizability of Framework

In the future, we envisage investigating the proposed pruning and consolidation framework across a wider spectrum of occlusion challenges, including but not limited to mixed holistic and occluded ReID, occluded object classification, among others. This endeavor is aimed at bolstering the generalizability and applicability of our framework.

Refer to caption
Figure 8: Extended visualization results of inattentive tokens in different layer of sparse encoder. The regions without masks represent the attentive tokens (mainly relate to target person). The masked regions denote the inattentive tokens (mostly related to background noise and occlusions). Our sparse encoder is effective in dealing with occlusion and background.