This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

PAFormer : Part Aware Transformer for Person Re-identification

Hyeono Jung [email protected] Jangwon Lee [email protected] Jiwon Yoo [email protected] Dami Ko [email protected] Gyeonghwan Kim [email protected] [
Abstract

Within the domain of person re-identification (ReID), partial ReID methods are considered mainstream, aiming to measure feature distances through comparisons of body parts between samples. However, in practice, previous methods often lack sufficient awareness of anatomical aspect of body parts, resulting in the failure to capture features of the same body parts across different samples. To address this issue, we introduce Part Aware Transformer (PAFormer), a pose estimation based ReID model which can perform precise part-to-part comparison. In order to inject part awareness to pose tokens, we introduce learnable parameters called ‘pose token’ which estimate the correlation between each body part and partial regions of the image. Notably, at inference phase, PAFormer operates without additional modules related to body part localization, which is commonly used in previous ReID methodologies leveraging pose estimation models. Additionally, leveraging the enhanced awareness of body parts, PAFormer suggests the use of a learning-based visibility predictor to estimate the degree of occlusion for each body part. Also, we introduce a teacher forcing technique using ground truth visibility scores which enables PAFormer to be trained only with visible parts. A set of extensive experiments show that our method outperforms existing approaches on well-known ReID benchmark datasets.

keywords:
Person Re-identification , Vision Transformer , Partial ReID
journal: Pattern Recognition

url]https://mmi.sogang.ac.kr/mmi/

\affiliation

[1] organization=Dept. of Electronic Engineering, addressline=Sogang University, city=35 Baekbum-ro, Mapo-gu, Seoul 04107, country=Rep. of Korea

{highlights}

Previous partial ReID methods’ problem; not performing part-to-part comparison

Introduce pose estimation based ReID model

Localization module free in inference phase

Alleviate an occlusion problem by learned visibility predictor

1 Introduction

Person re-identification (ReID) is a type of image retrieval task that aims to find samples with the same ID as a given query image from a gallery set. Specifically, the methodology termed part-based ReID or partial ReID aims to achieve identification by comparing distinctive features extracted from specific body parts. Following the success of stripe-based methods [1, 2, 3], which compute feature distances between spatially aligned regions after dividing images according to a predetermined division rule, the emergence of partial ReID methodologies has commenced in earnest.

Recently, methods using attention mechanisms for localizing body parts have been proposed [4, 5, 6], with ViT [7]-based methods [8, 9, 10] being encompassed in this category. These approaches typically employ part prototypes [11, 12, 13, 14] which are expected to represent specific body parts, to capture partial features. In these methods, they use loss term which combines cross-entropy loss and triplet loss. However, this loss term is more inclined towards guiding the prototype to excel in ID prediction within the training set rather than learning awareness for the human anatomy. Consequently, there is a significant risk of overfitting, potentially leading the model to capture only dominant information from individual training samples.

Refer to caption
Figure 1: Left) Patch embedding in ViT : CNN filters of the same size as the patch are applied, extracting visual representation features. Right) The generated patch embeddings contain both body part-related channels and unrelated channels.

We elaborate the aforementioned issues frequently observed in attention-based ReID with its variant, ViT-based model. In ViT-based models, each part prototype gathers information based on the cosine similarity with patch tokens, allowing all channels of the token to contribute to the result. Therefore, in case there are channels in the token unrelated to human body, the resulting similarity map may not accurately represent body parts. As in Figure 1, if there exists a channel related with color, the black lower body part of the first sample might show a higher similarity with the black upper body part of the second sample, rather than the gray lower body part of that.

Refer to caption
Figure 2: Each part tokens should be projected close to the patch tokens corresponding to the body part it represents in the embedding space. If body part-unrelated features are involved in the similarity computation of ViT, there is a risk of projecting based on specific appearances rather than body part.

In partial ReID, we expect the projection of prototypes and patch tokens in the embedding space to occur as depicted in Figure 2. Many methods leveraging human body information [13, 15, 16] have been proposed to achieve this behavior. However, these methods often entail challenges, such as the reliance on external pose estimation models even during inference phase, or the introduction of additional body part localization modules beyond modules capturing partial features.

To address the deficiency of human anatomical awareness in attention-based partial ReID models and aiming for efficient utilization of human body information, we propose Part Aware Transformer (PAFormer) with a Transformer-based structure, which incorporates pose heatmap obtained from a pose estimation model. Since PAFormer provides direct supervision of pose heatmap information to the attention map of the vanilla ViT’s internal cross-attention mechanism, it is more efficient compared to other pose-based ReID methods that might rely on supplementary modules.

In detail, we introduce a novel learnable parameter called ‘pose token’ to estimate the association between patch tokens and body parts. Unlike the part tokens in existing methods, the pose token is explicitly designed to represent a specific body part. Consequently, partial features are extracted by aggregating patch tokens based on the probability estimated by pose tokens. Additionally, we introduce a learnable visibility predictor that takes the output pose tokens as its input, in contrast to existing methods that often rely on non-learnable visibility prediction strategies involving thresholding of externally or internally estimated heatmaps.

Our contributions are summarized as follows.

  • 1.

    We point out that the lack of human anatomical awareness in previous partial ReID methods hinders the achievement of the fundamental goals of the fundamental goals of partial ReID.

  • 2.

    We introduce PAFormer, which can perform part-to-part comparison by precisely localize human body parts. PAFormer introduces a novel learnable parameter called ‘pose token’ to estimate association between body parts and patch tokens. Based on predicted association probability, PAFormer captures partial features and performs part-to-part comparison.

  • 3.

    We design a visibility predictor that takes pose tokens, designed to represent each body part, as input. This predictor estimates the degree of occlusion for specific body parts. By utilizing the estimated visibility scores, we compute feature distances between samples to effectively tackle occlusion challenges.

  • 4.

    PAFormer achieves state-of-the-art performance on various ReID benchmark datasets including Market-1501 [17], DukeMTMC-ReID [18], and Occluded-Duke [19]. PAFormer outperforms existing methods and demonstrates superior performance in ReID tasks.

2 Related works

2.1 CNN based ReID methods

Prior to the emergence of Vision Transformer, almost all ReID approaches were based on CNN. BoT [20] has proposed to use a CNN backbone to extract global features and to apply cross-entropy loss and triplet loss as supervision. Various subsequent researches built upon this ReID loss have also been proposed. However, because of the challenges faced by those relying soley on global features, such as occlusion, part-based approaches have emerged.

In PCB [1], stripe-based methods are proposed to measure the distances between samples within the same region. Zhao et al.[5] introduces CNN-based attention networks, allowing the model to autonomously find important regions in the image for ReID. ABDNet [4] builds upon this attention network concept and applies orthogonality regularization to each channel, enabling the extraction of information from various body parts. SPReID [21] employs a human semantic parsing model to aggregate features only from the regions corresponding to each body part. AACN [22] and BPBReID [16] directly supervise formation of attention maps by utilizing pose estimation models. These approaches suggeset explicit answers on how attention maps should be formed.

2.2 Transformer based ReID methods

After the introduction of TransReID [9], which utilizes a jigsaw patch module to capture local features, many transformer-based ReID methods have emerged. LA Transformer [3] implements stripe-based methods on the ViT architecture. PAT [11] adopts an encoder-decoder structure in Transformer and uses cosine dissimilarity loss to encourage each part token to gather information from different body parts. AAFormer [12] enhances distinctiveness of each token by physically limiting the number of patch tokens that can interact with each part token using Optimal Transport algorithm [23]. NFormer [8] effectively removes noise by performing self-attention only between reciprocal neighbor patch tokens. HAT [10] utilizes multi-scale information from ResNet [24] backbone to aggregate hierarchical information. Rest-ReID [25] combines attention-guided graph convolutional networks with Transformer to find correlations between body parts. PFD [13] leverages an embedding of pose heatmaps as additional information to Transformer-based models.

3 Problem Setting

Let us assume that xix_{i} represents the ii-th image, and yiy_{i} represents an ID label of that image. For training set T={xi,yi}T=\{x_{i},y_{i}\}, a common partial ReID model is trained with ReID loss, which is a combination of cross-entropy and triplet loss. These terms can be seen as loss to solve a classification problem of determining whether xix_{i} belongs to yiy_{i}. With this loss, the most straightforward approach during model training would involve reducing the loss by focusing on the information from the most discriminative regions associated with each ID, which can potentially cause the model to be overfitted to the classification problem on training set TT. Therefore, prior approaches lack the assurance that a single part prototype captures information from consistent body region across all samples and may not operate as explicitly ‘part-based’ as they intended; instead, they could operate more as ‘attribute-based’ (e.g., whether the image contains the color yellow).

Refer to caption
Figure 3: Attention maps of vanilla ViT-based ReID model: We can observe that the first sample focuses on head, the second sample on lower body region, and the last sample on upper body region. This demonstrates that a parital ReID model trained solely on ReID loss is incapable of performing part-to-part comparisons effectively in practice.

To support our aforementioned claim, we visualize attention maps of the vanilla ViT which can be considered as a partial ReID model with a single part prototype, using attention rollout [26]. As shown in Figure 3, we observe that the model focuses on distinct body parts across different samples. However, since the fundamental goal of partial ReID is to perform precise comparison between same body parts, we expect the highly-focused area in attention maps to be formed on the same body parts across the samples. Highlighting the limitations of current approaches in the context of part-to-part matching, we propose our PAFormer in the following section, which enables each part prototypes to be aware of distinct human body parts.

4 PAFormer

Refer to caption
Figure 4: Pipeline of PAFormer. We adopt pose tokens to estimate association between body parts and patch tokens. Partial features are generated by aggregating patch tokens during self-attention process are aggregated based on the predicted probabilities. Additionally, the output pose tokens pass through a visibility predictor to infer visibility scores.

The overall pipeline of our PAFormer is depicted in Figure 4. From section  4.1 to section 4.4, we provide an explanation of each component of PAFormer and elucidate the principles of the learning process. Section 4.5 describes the visibility predictor proposed to tackle occlusion issues, while section 4.6 introduces modification of ReID loss based on psuedo ground truth visibility scores. Then section 4.7 covers the total loss function and equation to measure feature distance during inference. Finally, we analyze PAFormer’s efficacy in section 4.8.

4.1 Pose tokens

Our model employs a new learnable prototype named ‘pose token’. The key advantage of using a pose token is to learn to extract features of the same body part across diverse samples. This allows us to shift the main focus from aligning IDs in the training set to estimating correlations between patch tokens and body parts, which significantly aids in facilitating part-to-part comparison. Furthermore, the use of a clear guidance about human body part ensures that each pose token is trained to specifically represent a distinct body part. This attribute also enables the pose token to function as a valuable indicator for detecting occlusions in specific body parts. Detailed explanations regarding the training and application of the pose token will be provided in following subsections.

Refer to caption
Figure 5: The visualization depicts similarity between a query token (highlighted with a red border) and other patch tokens when Vanilla ViT is trained with ReID loss. Despite the consistent location of the query token, it’s noticeable that patch token with high similarity varies based on the semantic of query token. For instance, in the first sample, a high similarity can be observed with tokens corresponding to the lower body, while in the second sample, tokens associated with the arms show higher similarity.

4.2 Feature Refinement

We expect the pose token to demonstrate a high degree of similarity with their associated patch tokens only. We aim to leverage this similarity in the forthcoming cross-attention and aggregation layers, as detailed in subsequent subsections. To achieve this, a prerequisite step is required, which enhances the similarity among patch tokens that share the same anatomical semantics. To accomplish this, we preprocess the patch tokens using self-attention (SA), a mechanism that has already been demonstrated in various studies for effectively performing this task. In Figure 5, it is evident that patch tokens that share anatomical semantics converge in the vanilla ViT-based ReID model.

Additionally, by utilizing the self-attention mechanism, we can harness the robust global feature extraction capabilities of ViT. Similar to the vanilla ViT, we employ a class token to extract global features and apply ReID loss on it:

LReIDCLS=LID(CLS)+Ltri(CLS)L_{ReID}^{CLS}=L_{ID}(CLS)+L_{tri}(CLS) (1)

where LIDL_{ID} and LtriL_{tri} are cross-entropy loss and triplet loss, respectively. Also, we incorporate SA layer among pose tokens, with the objective of empowering PAFormer to learn associations among prototypes, akin to DETR [27] or PAT [11].

4.3 Part Awareness

In order to imbue PAFormer with part awareness, during training, we create a ground truth heatmaps generated by post-processing initial keypoint heatmaps obtained from a pose estimation model, PifPaf [28]. The creation process of this ground truth heatmaps is detailed in section 5.3. The ground truth heatmaps directly guide how the attention maps between the pose token and patch token in the cross-attention (CA) layer should be formed. This guidance ensures that the pose token accurately perform localization of body parts. The loss function required for this process is as follows:

Lpose=1PPMSE(Attn¯pose,GT)L_{pose}={1\over P}\sum_{P}MSE(\overline{Attn}_{pose},GT) (2)

where PP is the number of pose tokens, MSEMSE denotes mean square error, and GTGT represents the ground truth heatmap. Attn¯pose\overline{Attn}_{pose} is the averaged value of all AttnposeAttn_{pose} (12 layers×\times12 heads) present in CA layers of PAFormer. In contrast to existing methods, PAFormer allows prototypes to concentrate on identifying patch tokens corresponding to specific body parts.

While some methods which utilize a pose estimation model have already been proposed, they often use a separate body part localization module within their models or continue to rely on an external pose estimation model during inference. However, PAFormer takes a different approach by learning how attention maps should be formed during the cross-attention process. As a result, it does not require pose heatmaps during inference.

The underlying principle of learning part awareness through a simple mean square error is as follows:

Attnpose=(XposeWQ)(XpatchWK)T=Xpose(WQWKT)XpatchT=XposeWXpatchTAttn_{pose}=(X_{pose}\cdot W_{Q})\cdot(X_{patch}\cdot W_{K})^{T}\\ =X_{pose}\cdot(W_{Q}\cdot W_{K}^{T})\cdot X_{patch}^{T}=X_{pose}\cdot W\cdot X_{patch}^{T} (3)

where WQW_{Q} and WKW_{K} refer to fully connected layers performing query and key transforms, respectively. XposeX_{pose} and XpatchX_{patch} represents the pose tokens and the patch tokens, respectively. Expanding upon the expression, the product of WQW_{Q} and WKW_{K} can be expressed as a single weight matrix, denoted as WW. The application of MSE loss offers supervision to PAFormer, allowing WW to accentuate channels associated with specific body parts. Consequently, the dot product of paired patch tokens and pose tokens increases, causing the attention score to selectively rise only for patches belonging to the specific body part, as intended.

4.4 Partial Features

Once the association of each pose token with specific patch tokens is estimated in the cross-attention layer, we leverage these probabilities to aggregate the refined values of the patch tokens, thereby extracting partial features. This stage is denoted as an aggregation layer. Aggregated partial features zpz_{p} can be expressed as:

zp=Attn¯posepXpatchz_{p}=\overline{Attn}_{pose}^{p}\cdot X_{patch} (4)

where Attn¯posep\overline{Attn}_{pose}^{p} is pp-th pose token’s averaged attention map across all heads from a CA layer. In typical ViT-based methods, a value transformation by a fully-connected network are applied to patch tokens when creating partial features. However, in the aggregation layer, no separate transformation is applied because the information from patch tokens should be directly accepted based on the association probabilities. Similar to the conventional Transformer architecture, as each layer is traversed, the partial features zpz_{p} are also updated by layer normalization, feed-forward networks, and residual connections.

Relying solely on the pose heatmap for model guidance can be vulnerable if the heatmap is inaccurately generated. To mitigate this vulnerability, we supplement the guidance by incorporating the ReID loss for partial features:

LReIDpart=1Pi=1P(LID(ziL)+Ltri(ziL))L_{ReID}^{part}={1\over P}{\sum_{i=1}^{P}}(L_{ID}(z_{i}^{L})+L_{tri}(z_{i}^{L})) (5)

where ziLz_{i}^{L} denotes the aggregated partial features from ii-th pose token at the last PAFormer block. We minimize attention to regions that do not correspond to body parts by LposeL_{pose}. Subsequently, with the assistance of LReIDpartL_{ReID}^{part}, we perform an additional filtering, addressing areas that may have been overlooked by LposeL_{pose}. However, to prevent potential issues stemming from the dominant influence of the ReID loss, we assign a substantially higher weight to LposeL_{pose} than LReIDpartL_{ReID}^{part}. This strategic weighting ensures that the ReID loss functions merely as a complementary element.

4.5 Visibility Predictor

While many other methods that consider occlusion rely on the confidence score of the output of a module responsible for performing body part localization to address this issue, PAFormer employs a learning-based prediction network for a more sophisticated treatment of the problem, as in Figure 4. This is attributed to the design of the model in preceding stages, which ensures that the pose token serves as a representative for various body parts. For the ground truth visibility score, we assume that a body part is invisible if the maximum value in the pose heatmap does not exceed θp\theta_{p}. As the pose estimation model yields varying confidence scores for different body parts, θp\theta_{p} is set differently for each body part. To train the visibility predictor, we employ LvisL_{vis} which uses binary cross-entropy loss between the obtained output and the ground truth:

Lvis=vGTlogv(1vGT)log(1v)L_{vis}=-v_{GT}\cdot\log{v}-(1-v_{GT})\cdot\log{(1-v)} (6)

where vv denotes a predicted visibility score, and vGTv_{GT} denotes ground truth visibility score from the pose estimation model.

4.6 Teacher forcing based on visibility score

It is essential to refrain from calculating ID loss or triplet loss for occluded body parts. To prevent this, we adapt the ReID loss by incorporating the ground truth visibility score as a form of teacher forcing to guarantee the exclusion of occluded parts from the learning process. Particularly, during hard triplet mining [29], to deter the selection of occluded regions, a value of 0 is multiplied to the distance for positive cases, while an extremely large value is multiplied for negative cases.

4.7 Objective Function and Inference

The overall loss LL is calculated as follow:

L=LReIDCLS+LReIDpart+λLpose+LvisL=L_{ReID}^{CLS}+L_{ReID}^{part}+\lambda L_{pose}+L_{vis} (7)

λ\lambda is a weight for LposeL_{pose} and is heuristically set to 10. All modules in PAFormer are learned together.

During inference phase, PAFormer calculate distances as follows:

di,j=dCLSi,j+p=1Pdpi,jvpivpjp=1Pvpivpjd^{i,j}=d_{CLS}^{i,j}+{\sum_{p=1}^{P}d_{p}^{i,j}v_{p}^{i}v_{p}^{j}\over\sum_{p=1}^{P}v_{p}^{i}v_{p}^{j}} (8)

where di,jd^{i,j} is a distance between ii-th and jj-th samples, and vv denotes the predicted visibility score. Among the subscripts in dd, ‘CLS’ refers to the class token, and pp indicates the specific pose token’s identifier.

4.8 Time complexity of PAFormer

Generally, time complexity of Transformer is known to be O(N2d)O(N^{2}d), where NN and dd are the number of tokens and channels, respectively. Since PAFormer performs self-attention similarly, it maintains the same O(N2d)O(N^{2}d) time complexity. Additionally, the pose token self-attention process contributes O(P2d)O(P^{2}d), and the cross-attention process adds O(PNd)O(PNd), resulting in an overall time complexity of O((N2+P2+PN)d)O((N^{2}+P^{2}+PN)d). Given that NN is usually much larger than PP (NN: three-digit, PP: one-digit), the additional burden introduced by PAFormer is considered to be negligible.

5 Experiments

5.1 Datasets

Dataset #image #ID #cam
Market-1501 [17] 32,668 1,501 6
DukeMTMC-ReID [18] 36,441 1,404 8
Occluded-Duke [19] 36,441 1,404 8
Table 1: Statistics of ReID datasets used in experiments.

Three widely used ReID datasets are chosen for our experiments and their statistics are shown in Table 1. Images in every dataset are resized to 256×128256\times 128 or 384×128384\times 128. Then, the training images are augmented with random horizontal flipping, cropping, padding, grayscale [30] and erasing [31]. Among these methods, we apply all except random erasing and grayscale augmentation to the pose mask. Additionally, for regions affected by random erasing in the original data, we set the mask value to 0.

5.2 Implementation Details

We adopt ImageNet pre-trained ViT-B/16 as the backbone. The model is trained with SGD optimizer with a momentum of 0.9 and a weight decay of 1e-4. The learning rate is set to 0.008 with a cosine decay and the batch size is determined to be 64. We also use sliding-window (S=12S=12) and side information embedding [9]. Both training and testing processes are implemented with two NVIDIA GeForce RTX 4080 GPUs. We conducted training for 320 epochs and evaluated the performance every 20 epochs. Among these evaluations, we choose the one with the highest performances.

Head Torso Arms Legs Feet
Market-1501 0.6 0.7 0.85 0.8 0.7
Duke-reID 0.6 0.8 0.85 0.85 0.75
OCC-Duke 0.6 0.8 0.85 0.85 0.75
Table 2: Values of θp\theta_{p} for various datasets.

5.3 Ground truth Pose heatmap ground truth Visiblity score

We use PifPaf [28] to obtain pose heatmaps from the input image. However, instead of using the heatmaps predicted by PifPaf directly, we employ several modification steps to adapt them as ground truth heatmaps.

Firstly, we apply random erasing augmentation [31] to pose heatmap as well. The probabilities corresponding to the erased areas in original image are all set to 0.

Secondly, we combine heatmaps of small fragments such as keypoints and joints to form masks for various body parts. During this process, we regard heatmaps from PifPaf with a maximum confidence score below θp\theta_{p} as noise and ignore them. The value of θp\theta_{p} is set differently for each body part and for each dataset, as in Table. 2.

Thirdly, we normalize the combined heatmaps. Given our objective is to avoid focusing solely on discriminative parts, we assume that the influence of all patch tokens belonging to a specific body part is uniform. Consequently, value in the ground truth mask is either 0 or 1K{1\over K}, where KK represents the count of non-zero values in the ground truth mask. Following this transformation, we normalize the mask by dividing it by the sum of its constituent values, ensuring a total sum of 1.

The ground truth visibility score is also determined by θp\theta_{p}. Pseudo labels are assigned as ‘visible’ only for body parts with a maximum confidence score greater than or equal to θp\theta_{p}.

5.4 Visualization

Refer to caption
Figure 6: Attention maps and visibility scores of pose tokens. The color of the attention map goes from blue to red as attention score goes higher. The example divides a person into five body parts (P=5P=5), and ‘Fg’ refers to foreground features that combine all of the parts. Even as samples change, it’s evident that each pose token continues to focus on the same body part it represents. In addition, we show a visibility score of each pose token on the right side of attention maps. It is observed that the visibility score corresponding to occluded body part is predicted very low (The scores predicted low are highlighted in red).

5.4.1 Attention maps

We present attention maps of the pose tokens in Figure 6, visualized by averaging the attention maps across all CA layers. The visualization highlights that the pose tokens consistently concentrate on their respective distinctive body parts across various samples. This starkly contrasts with the attention maps in Figure 3, which are solely learned through ReID loss. Figure 2 emphasizes that each prototype should aggregate information from each patch token based on their semantics, regardless of ID-specific appearances. Considering that the attention map in CA layer represents a similarity matrix between pose token and patch tokens, the visualization in Figure 6 demonstrates the successful resolution of this issue.

Furthermore, in instances where only a portion of a specific body part is occluded, PAFormer demonstrates the ability to selectively focus on the visible regions (Please refer to the Legs section of the second sample in the occluded part, in Figure 6). When a particular body part is fully occluded, PAFormer tends to concentrate on regions where corresponding body parts might be located (See Legs and Feet section of the first sample in the occluded part, in Figure 6). However, this is subsequently adjusted through visibility prediction.

5.4.2 Visibility scores

To further validate the accuracy of the predicted visibility scores, we also provide scores measured for samples, as shown in Figure 6. The scores in the figure are sigmoid-activated, clearly indicating that occluded body parts lead to notably lower visibility scores. Based on this, we measure distances between samples as decribed in (8), effectively mitigating the detrimental impact of occlusion.

Attempts to convert visibility scores into binary values of 0 or 1 resulted in a significant performance decrease. This highlights the advantage of depending on the model’s assessment of visibility, rather than entirely disregarding ambiguously visible regions (e.g., cases with visibility scores ranging between 0.2 and 0.8).

Method Ref Size Market-1501 Duke-reID OCC-Duke
mAP R1 mAP R1 mAP R1
PCB+RPP [1] ECCV18 256 81.6 93.8 69.2 83.3 - -
ISP [15] ECCV20 256 88.6 95.3 80.0 89.6 52.3 62.8
CBDB-Net [32] CSVT21 256 85.0 94.4 74.3 87.7 38.9 50.9
CDNet [33] CVPR21 256 86.0 95.1 76.8 88.6 - -
C2F [34] CVPR21 256 87.7 94.8 74.9 87.4 - -
PAT [11] CVPR21 256 88.0 95.4 78.2 88.8 53.6 64.5
TransReID [9] ICCV21 256 88.9 95.2 82.0 90.7 59.2 66.4
HAT [10] ACM21 256 89.5 95.6 81.4 90.4 - -
PFD [13] AAAI22 256 89.7 95.5 83.2 91.2 61.8 69.5
ResT-ReID [25] PRL22 256 88.2 95.3 80.6 90.0 51.9 59.6
DCAL [35] CVPR22 256 87.5 94.7 80.1 89.0 - -
ABDNet [4]+NFormer [8] CVPR22 256 93.0 95.7 85.7 90.6 - -
BPBReID [16] WACV23 256 87.0 95.1 78.3 89.6 54.1 66.1
DCFormer [14] AAAI23 256 90.4 96.0 - - - -
PAFormer - 256 90.8 96.0 83.3 92.1 59.9 67.0
ABDNet [4] ICCV19 384 88.3 95.6 78.6 89.0 - -
OH-former [36] Arxiv21 368 88.7 95.0 82.8 91.0 - -
AAformer [12] Arxiv21 384 87.7 95.4 80.0 90.1 58.2 67.0
TransReID [9] ICCV21 384 89.5 95.2 82.6 90.7 59.4 66.8
MGN [2]+AutoLoss [37] CVPR22 384 90.1 96.2 - - - -
BPBReID [16]+HRNet [38] WACV23 384 89.4 95.4 84.2 92.4 62.5 75.1
PAFormer - 384 90.9 96.1 84.2 92.5 60.4 66.4
Table 3: Performance comparison with state-of-the-art(SOTA) methods on ReID benchmarks. * denotes pose-estimation based methods. The highest and second-highest scores for each dataset are denoted by bold letters and underlines, respectively. Since TransReID [9] does not provide its performances on Occluded-Duke at the size of 384×128384\times 128, we reproduced the results. Performance results are provided for both 256×128256\times 128 and 368 or 384×128384\times 128 resized inputs. Our PAFormer either outperforms existing SOTA models or shows performance on par with them.

5.5 Comparison with existing methods

Table 3 shows the performances of our PAFormer and other previous ReID methods. We adopt mean Average Precision (mAP) and Rank-1 score (R1) for evaluations. PAFormer demonstrates its strong capabilities by achieving state-of-the-art performance on various datasets, outperforming competitors and securing at least the second-place position. The performances using the holistic datasets and the occluded dataset have been achieved with P=5P=5 and P=6P=6, respectively.

Market-1501. For image size of 256×128256\times 128, PAFormer achieves 90.8/96.0 for mAP and Rank-1, respectively. In terms of the Rank-1 score, PAFormer exhibits the highest score among the entire comparison group, and the mAP also surpasses 90, showing the second-highest performance following Nformer [8]. Since NFormer uses ABDNet [4], which has already demonstrated excellent performance in the ReID field, as its backbone, PAFormer appears to lag slightly behind in terms of mAP.

PAFormer exhibits strong performance even for 384×\times128 images, achieving 90.9 in mAP and 96.1 in Rank-1 score. It records the highest mAP and the second-highest Rank-1 score among the comparison groups, demonstrating excellent performance.

DukeMTMC-reID. Our method demonstrates a performance of 83.3 in mAP and 92.1 in Rank-1 on the DukeMTMC-ReID dataset with a resolution of 256×128256\times 128. These scores correspond to the 2nd and 1st places, respectively, among the entire comparison group. While NFormer [8] demonstrates impressive performance in terms of mAP on holistic datasets, PAFormer exhibits superior results in terms of Rank-1 scores.

When applied to images resized to 384×128384\times 128, PAFormer achieves a performance of 84.2 in mAP and 92.5 in Rank-1 score, which are both the highest scores among the comparisons. The high performance on both Market-1501 and DukeMTMC-reID datasets demonstrates the ReID ability of PAFormer on holistic datasets.

Occluded-Duke. To evaluate performance on occluded cases, we experiment on Occluded-Duke dataset. For images of size 256×\times128, PAFormer achieves mAP of 59.9 and Rank-1 of 67.0. PAFormer achieves the second highest performance on both mAP and Rank-1 score. Considering that PFD [13] uses the pose estimation model during inference, PAFormer’s performance is competitive enough. While showing a slightly weaker performance on 384×\times128 images, PAFormer still achieves the second-highest mAP among the comparison groups. The decline in performance at image size of 384×\times128 can be attributed to the growing inaccuracy of the pose estimation model’s heatmaps in occluded dataset as the image size increases.

5.6 Ablation Study

PP Grouping strategy
3 {Head, Upper, Lower}
4 {Head, Upper, Legs, Feet}
5 {Head, Torso, Arms, Legs, Feet}
6 {Heat, Upper torso, Lower torso, Arms, Legs, Feet}
Table 4: Grouping strategy of body parts for different PP.
Refer to caption
Figure 7: Performance comparison for different number of pose tokens, PP.

5.6.1 The number of pose tokens, PP

In order to analyze the impact of body part division methods on performance, we conduct an ablation study for different values of PP. Table 4 illustrates body part dividing strategies according to various PP, while Figure 7 displays the corresponding performance outcomes. We confirmed that the optimal value of PP for the holistic datasets is 5.

When the value of PP is too small, it becomes hard to handle occlusion problem. For instance, when the value of PP is 4, since the torso and arms are not be separated, the features of the torso might be influenced even in cases where the arm region is occluded. Moreover, there may be disadvantages in situations where there is a significant difference in features between the arms and torso, for example, when the individual is wearing short sleeves or sleeveless clothing.

On the contrary, when PP is excessively large, there is a risk of overfitting. Additionally, the influence of a particular body part may be duplicated during feature distance calculation, leading to an unintended amplification of the impact of that specific body parts.

visibility score mAP R1
w 59.9 67.0
w/o 57.6 64.5
round 59.0 66.1
Table 5: Ablation of the visibility score.

5.6.2 Validity of the visibility score

In order to assess the effectiveness of the visibility prediction, we conducted an ablation study on visibility scores, in Table 5. We experiment on the three scenarios for the visibility scores predicted by the model: 1) using the visibility score as-is, 2) excluding the visibility score, and 3) rounding the visibility score. All experiments are conducted on the Occluded-Duke with P=6P=6. It is confirmed that using the visibility score from PAFormer as-is yielded the best results.

Refer to caption
Figure 8: Failure cases: In cases where the visibility score or attention region is inaccurately formed. For instance, in the first sample, even though the legs are visible, the visibility score has been measured to be significantly low. In the second sample, the model is focusing on wrong areas.

5.7 Limitation

PAFormer’s training strategy hinges on the output pose heatmap from a pose estimation model, revealing vulnerabilities in scenarios with two or more individuals, as depicted in Table 8. This challenge is not so much a flaw in our methodology but rather stems from inaccuracies in generating ground truth masks. By using models adept at multi-person pose estimation and leveraging precisely annotated data, PAFormer has the potential to achieve enhanced performance in such situations.

6 Conclusion

This paper introduces an impactful model for partial ReID, named PAFormer. We tackle the shortcomings observed in existing partial ReID models by proposing PAFormer, which is aware of anatomical aspect of body parts. Our approach involves the use of learnable pose tokens to estimate the correlation between patch tokens and different body parts. Subsequently, we aggregate information from patch tokens based on the attention maps generated by the pose tokens. Furthermore, we introduce a visibility predictor to effectively handle occlusion issues. PAFormer demonstrates state-of-the-art performance on well-known ReID datasets.

References

  • Sun et al. [2018] Y. Sun, L. Zheng, Y. Yang, Q. Tian, S. Wang, Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline), in: Proceedings of the European conference on computer vision (ECCV), 2018, pp. 480–496.
  • Wang et al. [2018] G. Wang, Y. Yuan, X. Chen, J. Li, X. Zhou, Learning discriminative features with multiple granularities for person re-identification, in: Proceedings of the 26th ACM international conference on Multimedia, 2018, pp. 274–282.
  • Sharma et al. [2021] C. Sharma, S. R. Kapil, D. Chapman, Person re-identification with a locally aware Transformer, arXiv preprint arXiv:2106.03720 (2021).
  • Chen et al. [2019] T. Chen, S. Ding, J. Xie, Y. Yuan, W. Chen, Y. Yang, Z. Ren, Z. Wang, ABD-Net: Attentive but diverse person re-identification, in: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 8351–8361.
  • Zhao et al. [2017] L. Zhao, X. Li, Y. Zhuang, J. Wang, Deeply-learned part-aligned representations for person re-identification, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 3219–3228.
  • Li et al. [2018] W. Li, X. Zhu, S. Gong, Harmonious attention network for person re-identification, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2285–2294.
  • Dosovitskiy et al. [2021] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An image is worth 16x16 words: Transformers for image recognition at scale, ICLR (2021).
  • Wang et al. [2022] H. Wang, J. Shen, Y. Liu, Y. Gao, E. Gavves, NFormer: Robust person re-identification with neighbor Transformer, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 7297–7307.
  • He et al. [2021] S. He, H. Luo, P. Wang, F. Wang, H. Li, W. Jiang, TransReID: Transformer-based object re-identification, in: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 15013–15022.
  • Zhang et al. [2021] G. Zhang, P. Zhang, J. Qi, H. Lu, HAT: Hierarchical aggregation Transformers for person re-identification, in: Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 516–525.
  • Li et al. [2021] Y. Li, J. He, T. Zhang, X. Liu, Y. Zhang, F. Wu, Diverse part discovery: Occluded person re-identification with part-aware Transformer, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2898–2907.
  • Zhu et al. [2021] K. Zhu, H. Guo, S. Zhang, Y. Wang, G. Huang, H. Qiao, J. Liu, J. Wang, M. Tang, AAformer: Auto-aligned Transformer for person re-identification, arXiv preprint arXiv:2104.00921 (2021).
  • Wang et al. [2022] T. Wang, H. Liu, P. Song, T. Guo, W. Shi, Pose-guided feature disentangling for occluded person re-identification based on Transformer, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 2022, pp. 2540–2549.
  • Li et al. [2023] W. Li, C. Zou, M. Wang, F. Xu, J. Zhao, R. Zheng, Y. Cheng, W. Chu, DC-Former: Diverse and compact Transformer for person re-identification, Proceedings of the AAAI Conference on Artificial Intelligence 37 (2023) 1415–1423. URL: https://ojs.aaai.org/index.php/AAAI/article/view/25226. doi:10.1609/aaai.v37i2.25226.
  • Zhu et al. [2020] K. Zhu, H. Guo, Z. Liu, M. Tang, J. Wang, Identity-guided human semantic parsing for person re-identification, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, Springer, 2020, pp. 346–363.
  • Somers et al. [2023] V. Somers, C. De Vleeschouwer, A. Alahi, Body part-based representation learning for occluded person re-identification, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 1613–1623.
  • Zheng et al. [2015] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, Q. Tian, Scalable person re-identification: A benchmark, in: Proceedings of the IEEE international conference on computer vision, 2015, pp. 1116–1124.
  • Zheng et al. [2017] Z. Zheng, L. Zheng, Y. Yang, Unlabeled samples generated by GAN improve the person re-identification baseline in vitro, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 3754–3762.
  • Miao et al. [2019] J. Miao, Y. Wu, P. Liu, Y. Ding, Y. Yang, Pose-guided feature alignment for occluded person re-identification, in: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 542–551.
  • Luo et al. [2019] H. Luo, Y. Gu, X. Liao, S. Lai, W. Jiang, Bag of tricks and a strong baseline for deep person re-identification, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2019, pp. 0–0.
  • Kalayeh et al. [2018] M. M. Kalayeh, E. Basaran, M. Gökmen, M. E. Kamasak, M. Shah, Human semantic parsing for person re-identification, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1062–1071.
  • Xu et al. [2018] J. Xu, R. Zhao, F. Zhu, H. Wang, W. Ouyang, Attention-aware compositional network for person re-identification, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2119–2128.
  • Cuturi [2013] M. Cuturi, Sinkhorn distances: Lightspeed computation of optimal transport, Advances in neural information processing systems 26 (2013).
  • He et al. [2016] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • Chen et al. [2022] Y. Chen, S. Xia, J. Zhao, Y. Zhou, Q. Niu, R. Yao, D. Zhu, D. Liu, ResT-ReID: Transformer block-based residual learning for person re-identification, Pattern Recognition Letters 157 (2022) 90–96.
  • Abnar and Zuidema [2020] S. Abnar, W. Zuidema, Quantifying attention flow in Transformers, arXiv preprint arXiv:2005.00928 (2020).
  • Carion et al. [2020] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko, End-to-end object detection with Transformers, in: European conference on computer vision, Springer, 2020, pp. 213–229.
  • Kreiss et al. [2019] S. Kreiss, L. Bertoni, A. Alahi, PifPaf: Composite fields for human pose estimation, CoRR abs/1903.06593 (2019). URL: http://arxiv.org/abs/1903.06593. arXiv:1903.06593.
  • Schroff et al. [2015] F. Schroff, D. Kalenichenko, J. Philbin, FaceNet: A unified embedding for face recognition and clustering, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823.
  • Gong et al. [2021] Y. Gong, Z. Zeng, L. Chen, Y. Luo, B. Weng, F. Ye, A person re-identification data augmentation method with adversarial defense effect, arXiv preprint arXiv:2101.08783 (2021).
  • Zhong et al. [2020] Z. Zhong, L. Zheng, G. Kang, S. Li, Y. Yang, Random erasing data augmentation, in: Proceedings of the AAAI conference on artificial intelligence, volume 34, 2020, pp. 13001–13008.
  • Tan et al. [2021] H. Tan, X. Liu, Y. Bian, H. Wang, B. Yin, Incomplete descriptor mining with elastic loss for person re-identification, IEEE Transactions on Circuits and Systems for Video Technology 32 (2021) 160–171.
  • Li et al. [2021] H. Li, G. Wu, W.-S. Zheng, Combined depth space based architecture search for person re-identification, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6729–6738.
  • Zhang et al. [2021] A. Zhang, Y. Gao, Y. Niu, W. Liu, Y. Zhou, Coarse-to-fine person re-identification with auxiliary-domain classification and second-order information bottleneck, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 598–607.
  • Zhu et al. [2022] H. Zhu, W. Ke, D. Li, J. Liu, L. Tian, Y. Shan, Dual cross-attention learning for fine-grained visual categorization and object re-identification, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4692–4702.
  • Chen et al. [2021] X. Chen, C. Xu, Q. Cao, J. Xu, Y. Zhong, J. Xu, Z. Li, J. Wang, S. Gao, OH-Former: Omni-relational high-order transformer for person re-identification, arXiv preprint arXiv:2109.11159 (2021).
  • Gu et al. [2022] H. Gu, J. Li, G. Fu, C. Wong, X. Chen, J. Zhu, AutoLoss-GMS: Searching generalized margin-based softmax loss function for person re-identification, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 4744–4753.
  • Wang et al. [2020] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang, et al., Deep high-resolution representation learning for visual recognition, IEEE transactions on pattern analysis and machine intelligence 43 (2020) 3349–3364.