PAFormer : Part Aware Transformer for Person Re-identification

Hyeono Jung [email protected] Jangwon Lee [email protected] Jiwon Yoo [email protected] Dami Ko [email protected] Gyeonghwan Kim [email protected] [

Abstract

Within the domain of person re-identification (ReID), partial ReID methods are considered mainstream, aiming to measure feature distances through comparisons of body parts between samples. However, in practice, previous methods often lack sufficient awareness of anatomical aspect of body parts, resulting in the failure to capture features of the same body parts across different samples. To address this issue, we introduce Part Aware Transformer (PAFormer), a pose estimation based ReID model which can perform precise part-to-part comparison. In order to inject part awareness to pose tokens, we introduce learnable parameters called ‘pose token’ which estimate the correlation between each body part and partial regions of the image. Notably, at inference phase, PAFormer operates without additional modules related to body part localization, which is commonly used in previous ReID methodologies leveraging pose estimation models. Additionally, leveraging the enhanced awareness of body parts, PAFormer suggests the use of a learning-based visibility predictor to estimate the degree of occlusion for each body part. Also, we introduce a teacher forcing technique using ground truth visibility scores which enables PAFormer to be trained only with visible parts. A set of extensive experiments show that our method outperforms existing approaches on well-known ReID benchmark datasets.

keywords:

Person Re-identification , Vision Transformer , Partial ReID

^†^†journal: Pattern Recognition

url]https://mmi.sogang.ac.kr/mmi/

\affiliation

[1] organization=Dept. of Electronic Engineering, addressline=Sogang University, city=35 Baekbum-ro, Mapo-gu, Seoul 04107, country=Rep. of Korea

{highlights}

Previous partial ReID methods’ problem; not performing part-to-part comparison

Introduce pose estimation based ReID model

Localization module free in inference phase

Alleviate an occlusion problem by learned visibility predictor

1 Introduction

Person re-identification (ReID) is a type of image retrieval task that aims to find samples with the same ID as a given query image from a gallery set. Specifically, the methodology termed part-based ReID or partial ReID aims to achieve identification by comparing distinctive features extracted from specific body parts. Following the success of stripe-based methods [1, 2, 3], which compute feature distances between spatially aligned regions after dividing images according to a predetermined division rule, the emergence of partial ReID methodologies has commenced in earnest.

Recently, methods using attention mechanisms for localizing body parts have been proposed [4, 5, 6], with ViT [7]-based methods [8, 9, 10] being encompassed in this category. These approaches typically employ part prototypes [11, 12, 13, 14] which are expected to represent specific body parts, to capture partial features. In these methods, they use loss term which combines cross-entropy loss and triplet loss. However, this loss term is more inclined towards guiding the prototype to excel in ID prediction within the training set rather than learning awareness for the human anatomy. Consequently, there is a significant risk of overfitting, potentially leading the model to capture only dominant information from individual training samples.

Refer to caption — Figure 1: Left) Patch embedding in ViT : CNN filters of the same size as the patch are applied, extracting visual representation features. Right) The generated patch embeddings contain both body part-related channels and unrelated channels.

We elaborate the aforementioned issues frequently observed in attention-based ReID with its variant, ViT-based model. In ViT-based models, each part prototype gathers information based on the cosine similarity with patch tokens, allowing all channels of the token to contribute to the result. Therefore, in case there are channels in the token unrelated to human body, the resulting similarity map may not accurately represent body parts. As in Figure 1, if there exists a channel related with color, the black lower body part of the first sample might show a higher similarity with the black upper body part of the second sample, rather than the gray lower body part of that.

In partial ReID, we expect the projection of prototypes and patch tokens in the embedding space to occur as depicted in Figure 2. Many methods leveraging human body information [13, 15, 16] have been proposed to achieve this behavior. However, these methods often entail challenges, such as the reliance on external pose estimation models even during inference phase, or the introduction of additional body part localization modules beyond modules capturing partial features.

To address the deficiency of human anatomical awareness in attention-based partial ReID models and aiming for efficient utilization of human body information, we propose Part Aware Transformer (PAFormer) with a Transformer-based structure, which incorporates pose heatmap obtained from a pose estimation model. Since PAFormer provides direct supervision of pose heatmap information to the attention map of the vanilla ViT’s internal cross-attention mechanism, it is more efficient compared to other pose-based ReID methods that might rely on supplementary modules.

In detail, we introduce a novel learnable parameter called ‘pose token’ to estimate the association between patch tokens and body parts. Unlike the part tokens in existing methods, the pose token is explicitly designed to represent a specific body part. Consequently, partial features are extracted by aggregating patch tokens based on the probability estimated by pose tokens. Additionally, we introduce a learnable visibility predictor that takes the output pose tokens as its input, in contrast to existing methods that often rely on non-learnable visibility prediction strategies involving thresholding of externally or internally estimated heatmaps.

Our contributions are summarized as follows.

1.

We point out that the lack of human anatomical awareness in previous partial ReID methods hinders the achievement of the fundamental goals of the fundamental goals of partial ReID.
2.

We introduce PAFormer, which can perform part-to-part comparison by precisely localize human body parts. PAFormer introduces a novel learnable parameter called ‘pose token’ to estimate association between body parts and patch tokens. Based on predicted association probability, PAFormer captures partial features and performs part-to-part comparison.
3.

We design a visibility predictor that takes pose tokens, designed to represent each body part, as input. This predictor estimates the degree of occlusion for specific body parts. By utilizing the estimated visibility scores, we compute feature distances between samples to effectively tackle occlusion challenges.
4.

PAFormer achieves state-of-the-art performance on various ReID benchmark datasets including Market-1501 [17], DukeMTMC-ReID [18], and Occluded-Duke [19]. PAFormer outperforms existing methods and demonstrates superior performance in ReID tasks.

2 Related works

2.1 CNN based ReID methods

Prior to the emergence of Vision Transformer, almost all ReID approaches were based on CNN. BoT [20] has proposed to use a CNN backbone to extract global features and to apply cross-entropy loss and triplet loss as supervision. Various subsequent researches built upon this ReID loss have also been proposed. However, because of the challenges faced by those relying soley on global features, such as occlusion, part-based approaches have emerged.

In PCB [1], stripe-based methods are proposed to measure the distances between samples within the same region. Zhao et al.[5] introduces CNN-based attention networks, allowing the model to autonomously find important regions in the image for ReID. ABDNet [4] builds upon this attention network concept and applies orthogonality regularization to each channel, enabling the extraction of information from various body parts. SPReID [21] employs a human semantic parsing model to aggregate features only from the regions corresponding to each body part. AACN [22] and BPBReID [16] directly supervise formation of attention maps by utilizing pose estimation models. These approaches suggeset explicit answers on how attention maps should be formed.

2.2 Transformer based ReID methods

After the introduction of TransReID [9], which utilizes a jigsaw patch module to capture local features, many transformer-based ReID methods have emerged. LA Transformer [3] implements stripe-based methods on the ViT architecture. PAT [11] adopts an encoder-decoder structure in Transformer and uses cosine dissimilarity loss to encourage each part token to gather information from different body parts. AAFormer [12] enhances distinctiveness of each token by physically limiting the number of patch tokens that can interact with each part token using Optimal Transport algorithm [23]. NFormer [8] effectively removes noise by performing self-attention only between reciprocal neighbor patch tokens. HAT [10] utilizes multi-scale information from ResNet [24] backbone to aggregate hierarchical information. Rest-ReID [25] combines attention-guided graph convolutional networks with Transformer to find correlations between body parts. PFD [13] leverages an embedding of pose heatmaps as additional information to Transformer-based models.

3 Problem Setting

Let us assume that $x_{i}$ represents the $i$ -th image, and $y_{i}$ represents an ID label of that image. For training set $T=\{x_{i},y_{i}\}$ , a common partial ReID model is trained with ReID loss, which is a combination of cross-entropy and triplet loss. These terms can be seen as loss to solve a classification problem of determining whether $x_{i}$ belongs to $y_{i}$ . With this loss, the most straightforward approach during model training would involve reducing the loss by focusing on the information from the most discriminative regions associated with each ID, which can potentially cause the model to be overfitted to the classification problem on training set $T$ . Therefore, prior approaches lack the assurance that a single part prototype captures information from consistent body region across all samples and may not operate as explicitly ‘part-based’ as they intended; instead, they could operate more as ‘attribute-based’ (e.g., whether the image contains the color yellow).

To support our aforementioned claim, we visualize attention maps of the vanilla ViT which can be considered as a partial ReID model with a single part prototype, using attention rollout [26]. As shown in Figure 3, we observe that the model focuses on distinct body parts across different samples. However, since the fundamental goal of partial ReID is to perform precise comparison between same body parts, we expect the highly-focused area in attention maps to be formed on the same body parts across the samples. Highlighting the limitations of current approaches in the context of part-to-part matching, we propose our PAFormer in the following section, which enables each part prototypes to be aware of distinct human body parts.

4 PAFormer

The overall pipeline of our PAFormer is depicted in Figure 4. From section 4.1 to section 4.4, we provide an explanation of each component of PAFormer and elucidate the principles of the learning process. Section 4.5 describes the visibility predictor proposed to tackle occlusion issues, while section 4.6 introduces modification of ReID loss based on psuedo ground truth visibility scores. Then section 4.7 covers the total loss function and equation to measure feature distance during inference. Finally, we analyze PAFormer’s efficacy in section 4.8.

4.1 Pose tokens

Our model employs a new learnable prototype named ‘pose token’. The key advantage of using a pose token is to learn to extract features of the same body part across diverse samples. This allows us to shift the main focus from aligning IDs in the training set to estimating correlations between patch tokens and body parts, which significantly aids in facilitating part-to-part comparison. Furthermore, the use of a clear guidance about human body part ensures that each pose token is trained to specifically represent a distinct body part. This attribute also enables the pose token to function as a valuable indicator for detecting occlusions in specific body parts. Detailed explanations regarding the training and application of the pose token will be provided in following subsections.

4.2 Feature Refinement

We expect the pose token to demonstrate a high degree of similarity with their associated patch tokens only. We aim to leverage this similarity in the forthcoming cross-attention and aggregation layers, as detailed in subsequent subsections. To achieve this, a prerequisite step is required, which enhances the similarity among patch tokens that share the same anatomical semantics. To accomplish this, we preprocess the patch tokens using self-attention (SA), a mechanism that has already been demonstrated in various studies for effectively performing this task. In Figure 5, it is evident that patch tokens that share anatomical semantics converge in the vanilla ViT-based ReID model.

Additionally, by utilizing the self-attention mechanism, we can harness the robust global feature extraction capabilities of ViT. Similar to the vanilla ViT, we employ a class token to extract global features and apply ReID loss on it:

L_{ReID}^{CLS}=L_{ID}(CLS)+L_{tri}(CLS)

(1)

where $L_{ID}$ and $L_{tri}$ are cross-entropy loss and triplet loss, respectively. Also, we incorporate SA layer among pose tokens, with the objective of empowering PAFormer to learn associations among prototypes, akin to DETR [27] or PAT [11].

4.3 Part Awareness

In order to imbue PAFormer with part awareness, during training, we create a ground truth heatmaps generated by post-processing initial keypoint heatmaps obtained from a pose estimation model, PifPaf [28]. The creation process of this ground truth heatmaps is detailed in section 5.3. The ground truth heatmaps directly guide how the attention maps between the pose token and patch token in the cross-attention (CA) layer should be formed. This guidance ensures that the pose token accurately perform localization of body parts. The loss function required for this process is as follows:

L_{pose}={1\over P}\sum_{P}MSE(\overline{Attn}_{pose},GT)

(2)

where $P$ is the number of pose tokens, $MSE$ denotes mean square error, and $GT$ represents the ground truth heatmap. $\overline{Attn}_{pose}$ is the averaged value of all $Attn_{pose}$ (12 layers $\times$ 12 heads) present in CA layers of PAFormer. In contrast to existing methods, PAFormer allows prototypes to concentrate on identifying patch tokens corresponding to specific body parts.

While some methods which utilize a pose estimation model have already been proposed, they often use a separate body part localization module within their models or continue to rely on an external pose estimation model during inference. However, PAFormer takes a different approach by learning how attention maps should be formed during the cross-attention process. As a result, it does not require pose heatmaps during inference.

The underlying principle of learning part awareness through a simple mean square error is as follows:

Attn_{pose}=(X_{pose}\cdot W_{Q})\cdot(X_{patch}\cdot W_{K})^{T}\\ =X_{pose}\cdot(W_{Q}\cdot W_{K}^{T})\cdot X_{patch}^{T}=X_{pose}\cdot W\cdot X_{patch}^{T}

(3)

where $W_{Q}$ and $W_{K}$ refer to fully connected layers performing query and key transforms, respectively. $X_{pose}$ and $X_{patch}$ represents the pose tokens and the patch tokens, respectively. Expanding upon the expression, the product of $W_{Q}$ and $W_{K}$ can be expressed as a single weight matrix, denoted as $W$ . The application of MSE loss offers supervision to PAFormer, allowing $W$ to accentuate channels associated with specific body parts. Consequently, the dot product of paired patch tokens and pose tokens increases, causing the attention score to selectively rise only for patches belonging to the specific body part, as intended.

4.4 Partial Features

Once the association of each pose token with specific patch tokens is estimated in the cross-attention layer, we leverage these probabilities to aggregate the refined values of the patch tokens, thereby extracting partial features. This stage is denoted as an aggregation layer. Aggregated partial features $z_{p}$ can be expressed as:

z_{p}=\overline{Attn}_{pose}^{p}\cdot X_{patch}

(4)

where $\overline{Attn}_{pose}^{p}$ is $p$ -th pose token’s averaged attention map across all heads from a CA layer. In typical ViT-based methods, a value transformation by a fully-connected network are applied to patch tokens when creating partial features. However, in the aggregation layer, no separate transformation is applied because the information from patch tokens should be directly accepted based on the association probabilities. Similar to the conventional Transformer architecture, as each layer is traversed, the partial features $z_{p}$ are also updated by layer normalization, feed-forward networks, and residual connections.

Relying solely on the pose heatmap for model guidance can be vulnerable if the heatmap is inaccurately generated. To mitigate this vulnerability, we supplement the guidance by incorporating the ReID loss for partial features:

L_{ReID}^{part}={1\over P}{\sum_{i=1}^{P}}(L_{ID}(z_{i}^{L})+L_{tri}(z_{i}^{L}))

(5)

where $z_{i}^{L}$ denotes the aggregated partial features from $i$ -th pose token at the last PAFormer block. We minimize attention to regions that do not correspond to body parts by $L_{pose}$ . Subsequently, with the assistance of $L_{ReID}^{part}$ , we perform an additional filtering, addressing areas that may have been overlooked by $L_{pose}$ . However, to prevent potential issues stemming from the dominant influence of the ReID loss, we assign a substantially higher weight to $L_{pose}$ than $L_{ReID}^{part}$ . This strategic weighting ensures that the ReID loss functions merely as a complementary element.

4.5 Visibility Predictor

While many other methods that consider occlusion rely on the confidence score of the output of a module responsible for performing body part localization to address this issue, PAFormer employs a learning-based prediction network for a more sophisticated treatment of the problem, as in Figure 4. This is attributed to the design of the model in preceding stages, which ensures that the pose token serves as a representative for various body parts. For the ground truth visibility score, we assume that a body part is invisible if the maximum value in the pose heatmap does not exceed $\theta_{p}$ . As the pose estimation model yields varying confidence scores for different body parts, $\theta_{p}$ is set differently for each body part. To train the visibility predictor, we employ $L_{vis}$ which uses binary cross-entropy loss between the obtained output and the ground truth:

L_{vis}=-v_{GT}\cdot\log{v}-(1-v_{GT})\cdot\log{(1-v)}

(6)

where $v$ denotes a predicted visibility score, and $v_{GT}$ denotes ground truth visibility score from the pose estimation model.

4.6 Teacher forcing based on visibility score

It is essential to refrain from calculating ID loss or triplet loss for occluded body parts. To prevent this, we adapt the ReID loss by incorporating the ground truth visibility score as a form of teacher forcing to guarantee the exclusion of occluded parts from the learning process. Particularly, during hard triplet mining [29], to deter the selection of occluded regions, a value of 0 is multiplied to the distance for positive cases, while an extremely large value is multiplied for negative cases.

4.7 Objective Function and Inference

The overall loss $L$ is calculated as follow:

L=L_{ReID}^{CLS}+L_{ReID}^{part}+\lambda L_{pose}+L_{vis}

(7)

$\lambda$ is a weight for $L_{pose}$ and is heuristically set to 10. All modules in PAFormer are learned together.

During inference phase, PAFormer calculate distances as follows:

d^{i,j}=d_{CLS}^{i,j}+{\sum_{p=1}^{P}d_{p}^{i,j}v_{p}^{i}v_{p}^{j}\over\sum_{p=1}^{P}v_{p}^{i}v_{p}^{j}}

(8)

where $d^{i,j}$ is a distance between $i$ -th and $j$ -th samples, and $v$ denotes the predicted visibility score. Among the subscripts in $d$ , ‘CLS’ refers to the class token, and $p$ indicates the specific pose token’s identifier.

4.8 Time complexity of PAFormer

Generally, time complexity of Transformer is known to be $O(N^{2}d)$ , where $N$ and $d$ are the number of tokens and channels, respectively. Since PAFormer performs self-attention similarly, it maintains the same $O(N^{2}d)$ time complexity. Additionally, the pose token self-attention process contributes $O(P^{2}d)$ , and the cross-attention process adds $O(PNd)$ , resulting in an overall time complexity of $O((N^{2}+P^{2}+PN)d)$ . Given that $N$ is usually much larger than $P$ ( $N$ : three-digit, $P$ : one-digit), the additional burden introduced by PAFormer is considered to be negligible.

5 Experiments

5.1 Datasets

Dataset	#image	#ID	#cam
Market-1501 [17]	32,668	1,501	6
DukeMTMC-ReID [18]	36,441	1,404	8
Occluded-Duke [19]	36,441	1,404	8

Table 1: Statistics of ReID datasets used in experiments.

Three widely used ReID datasets are chosen for our experiments and their statistics are shown in Table 1. Images in every dataset are resized to $256\times 128$ or $384\times 128$ . Then, the training images are augmented with random horizontal flipping, cropping, padding, grayscale [30] and erasing [31]. Among these methods, we apply all except random erasing and grayscale augmentation to the pose mask. Additionally, for regions affected by random erasing in the original data, we set the mask value to 0.

5.2 Implementation Details

We adopt ImageNet pre-trained ViT-B/16 as the backbone. The model is trained with SGD optimizer with a momentum of 0.9 and a weight decay of 1e-4. The learning rate is set to 0.008 with a cosine decay and the batch size is determined to be 64. We also use sliding-window ( $S=12$ ) and side information embedding [9]. Both training and testing processes are implemented with two NVIDIA GeForce RTX 4080 GPUs. We conducted training for 320 epochs and evaluated the performance every 20 epochs. Among these evaluations, we choose the one with the highest performances.

	Head	Torso	Arms	Legs	Feet
Market-1501	0.6	0.7	0.85	0.8	0.7
Duke-reID	0.6	0.8	0.85	0.85	0.75
OCC-Duke	0.6	0.8	0.85	0.85	0.75

Table 2: Values of

\theta_{p}

for various datasets.

5.3 Ground truth Pose heatmap ground truth Visiblity score

We use PifPaf [28] to obtain pose heatmaps from the input image. However, instead of using the heatmaps predicted by PifPaf directly, we employ several modification steps to adapt them as ground truth heatmaps.

Firstly, we apply random erasing augmentation [31] to pose heatmap as well. The probabilities corresponding to the erased areas in original image are all set to 0.

Secondly, we combine heatmaps of small fragments such as keypoints and joints to form masks for various body parts. During this process, we regard heatmaps from PifPaf with a maximum confidence score below $\theta_{p}$ as noise and ignore them. The value of $\theta_{p}$ is set differently for each body part and for each dataset, as in Table. 2.

Thirdly, we normalize the combined heatmaps. Given our objective is to avoid focusing solely on discriminative parts, we assume that the influence of all patch tokens belonging to a specific body part is uniform. Consequently, value in the ground truth mask is either 0 or ${1\over K}$ , where $K$ represents the count of non-zero values in the ground truth mask. Following this transformation, we normalize the mask by dividing it by the sum of its constituent values, ensuring a total sum of 1.

The ground truth visibility score is also determined by $\theta_{p}$ . Pseudo labels are assigned as ‘visible’ only for body parts with a maximum confidence score greater than or equal to $\theta_{p}$ .

5.4 Visualization

5.4.1 Attention maps

We present attention maps of the pose tokens in Figure 6, visualized by averaging the attention maps across all CA layers. The visualization highlights that the pose tokens consistently concentrate on their respective distinctive body parts across various samples. This starkly contrasts with the attention maps in Figure 3, which are solely learned through ReID loss. Figure 2 emphasizes that each prototype should aggregate information from each patch token based on their semantics, regardless of ID-specific appearances. Considering that the attention map in CA layer represents a similarity matrix between pose token and patch tokens, the visualization in Figure 6 demonstrates the successful resolution of this issue.

Furthermore, in instances where only a portion of a specific body part is occluded, PAFormer demonstrates the ability to selectively focus on the visible regions (Please refer to the Legs section of the second sample in the occluded part, in Figure 6). When a particular body part is fully occluded, PAFormer tends to concentrate on regions where corresponding body parts might be located (See Legs and Feet section of the first sample in the occluded part, in Figure 6). However, this is subsequently adjusted through visibility prediction.

5.4.2 Visibility scores

To further validate the accuracy of the predicted visibility scores, we also provide scores measured for samples, as shown in Figure 6. The scores in the figure are sigmoid-activated, clearly indicating that occluded body parts lead to notably lower visibility scores. Based on this, we measure distances between samples as decribed in (8), effectively mitigating the detrimental impact of occlusion.

Attempts to convert visibility scores into binary values of 0 or 1 resulted in a significant performance decrease. This highlights the advantage of depending on the model’s assessment of visibility, rather than entirely disregarding ambiguously visible regions (e.g., cases with visibility scores ranging between 0.2 and 0.8).

Method	Ref	Size	Market-1501		Duke-reID		OCC-Duke
Method	Ref	Size	mAP	R1	mAP	R1	mAP	R1
PCB+RPP [1]	ECCV18	256	81.6	93.8	69.2	83.3	-	-
ISP [15]	ECCV20	256	88.6	95.3	80.0	89.6	52.3	62.8
CBDB-Net [32]	CSVT21	256	85.0	94.4	74.3	87.7	38.9	50.9
CDNet [33]	CVPR21	256	86.0	95.1	76.8	88.6	-	-
C2F [34]	CVPR21	256	87.7	94.8	74.9	87.4	-	-
PAT [11]	CVPR21	256	88.0	95.4	78.2	88.8	53.6	64.5
TransReID [9]	ICCV21	256	88.9	95.2	82.0	90.7	59.2	66.4
HAT [10]	ACM21	256	89.5	95.6	81.4	90.4	-	-
PFD [13]	AAAI22	256	89.7	95.5	83.2	91.2	61.8	69.5
ResT-ReID [25]	PRL22	256	88.2	95.3	80.6	90.0	51.9	59.6
DCAL [35]	CVPR22	256	87.5	94.7	80.1	89.0	-	-
ABDNet [4]+NFormer [8]	CVPR22	256	93.0	95.7	85.7	90.6	-	-
BPBReID [16]	WACV23	256	87.0	95.1	78.3	89.6	54.1	66.1
DCFormer [14]	AAAI23	256	90.4	96.0	-	-	-	-
PAFormer	-	256	90.8	96.0	83.3	92.1	59.9	67.0
ABDNet [4]	ICCV19	384	88.3	95.6	78.6	89.0	-	-
OH-former [36]	Arxiv21	368	88.7	95.0	82.8	91.0	-	-
AAformer [12]	Arxiv21	384	87.7	95.4	80.0	90.1	58.2	67.0
TransReID [9]	ICCV21	384	89.5	95.2	82.6	90.7	59.4	66.8
MGN [2]+AutoLoss [37]	CVPR22	384	90.1	96.2	-	-	-	-
BPBReID [16]+HRNet [38]	WACV23	384	89.4	95.4	84.2	92.4	62.5	75.1
PAFormer	-	384	90.9	96.1	84.2	92.5	60.4	66.4

Table 3: Performance comparison with state-of-the-art(SOTA) methods on ReID benchmarks. * denotes pose-estimation based methods. The highest and second-highest scores for each dataset are denoted by bold letters and underlines, respectively. Since TransReID [9] does not provide its performances on Occluded-Duke at the size of

384\times 128

, we reproduced the results. Performance results are provided for both

256\times 128

and 368 or

384\times 128

resized inputs. Our PAFormer either outperforms existing SOTA models or shows performance on par with them.

5.5 Comparison with existing methods

Table 3 shows the performances of our PAFormer and other previous ReID methods. We adopt mean Average Precision (mAP) and Rank-1 score (R1) for evaluations. PAFormer demonstrates its strong capabilities by achieving state-of-the-art performance on various datasets, outperforming competitors and securing at least the second-place position. The performances using the holistic datasets and the occluded dataset have been achieved with $P=5$ and $P=6$ , respectively.

Market-1501. For image size of $256\times 128$ , PAFormer achieves 90.8/96.0 for mAP and Rank-1, respectively. In terms of the Rank-1 score, PAFormer exhibits the highest score among the entire comparison group, and the mAP also surpasses 90, showing the second-highest performance following Nformer [8]. Since NFormer uses ABDNet [4], which has already demonstrated excellent performance in the ReID field, as its backbone, PAFormer appears to lag slightly behind in terms of mAP.

PAFormer exhibits strong performance even for 384 $\times$ 128 images, achieving 90.9 in mAP and 96.1 in Rank-1 score. It records the highest mAP and the second-highest Rank-1 score among the comparison groups, demonstrating excellent performance.

DukeMTMC-reID. Our method demonstrates a performance of 83.3 in mAP and 92.1 in Rank-1 on the DukeMTMC-ReID dataset with a resolution of $256\times 128$ . These scores correspond to the 2nd and 1st places, respectively, among the entire comparison group. While NFormer [8] demonstrates impressive performance in terms of mAP on holistic datasets, PAFormer exhibits superior results in terms of Rank-1 scores.

When applied to images resized to $384\times 128$ , PAFormer achieves a performance of 84.2 in mAP and 92.5 in Rank-1 score, which are both the highest scores among the comparisons. The high performance on both Market-1501 and DukeMTMC-reID datasets demonstrates the ReID ability of PAFormer on holistic datasets.

Occluded-Duke. To evaluate performance on occluded cases, we experiment on Occluded-Duke dataset. For images of size 256 $\times$ 128, PAFormer achieves mAP of 59.9 and Rank-1 of 67.0. PAFormer achieves the second highest performance on both mAP and Rank-1 score. Considering that PFD [13] uses the pose estimation model during inference, PAFormer’s performance is competitive enough. While showing a slightly weaker performance on 384 $\times$ 128 images, PAFormer still achieves the second-highest mAP among the comparison groups. The decline in performance at image size of 384 $\times$ 128 can be attributed to the growing inaccuracy of the pose estimation model’s heatmaps in occluded dataset as the image size increases.

5.6 Ablation Study

$P$	Grouping strategy
3	{Head, Upper, Lower}
4	{Head, Upper, Legs, Feet}
5	{Head, Torso, Arms, Legs, Feet}
6	{Heat, Upper torso, Lower torso, Arms, Legs, Feet}

Table 4: Grouping strategy of body parts for different

P

5.6.1 The number of pose tokens, $P$

In order to analyze the impact of body part division methods on performance, we conduct an ablation study for different values of $P$ . Table 4 illustrates body part dividing strategies according to various $P$ , while Figure 7 displays the corresponding performance outcomes. We confirmed that the optimal value of $P$ for the holistic datasets is 5.

When the value of $P$ is too small, it becomes hard to handle occlusion problem. For instance, when the value of $P$ is 4, since the torso and arms are not be separated, the features of the torso might be influenced even in cases where the arm region is occluded. Moreover, there may be disadvantages in situations where there is a significant difference in features between the arms and torso, for example, when the individual is wearing short sleeves or sleeveless clothing.

On the contrary, when $P$ is excessively large, there is a risk of overfitting. Additionally, the influence of a particular body part may be duplicated during feature distance calculation, leading to an unintended amplification of the impact of that specific body parts.

visibility score	mAP	R1
w	59.9	67.0
w/o	57.6	64.5
round	59.0	66.1

Table 5: Ablation of the visibility score.

5.6.2 Validity of the visibility score

In order to assess the effectiveness of the visibility prediction, we conducted an ablation study on visibility scores, in Table 5. We experiment on the three scenarios for the visibility scores predicted by the model: 1) using the visibility score as-is, 2) excluding the visibility score, and 3) rounding the visibility score. All experiments are conducted on the Occluded-Duke with $P=6$ . It is confirmed that using the visibility score from PAFormer as-is yielded the best results.

5.7 Limitation

PAFormer’s training strategy hinges on the output pose heatmap from a pose estimation model, revealing vulnerabilities in scenarios with two or more individuals, as depicted in Table 8. This challenge is not so much a flaw in our methodology but rather stems from inaccuracies in generating ground truth masks. By using models adept at multi-person pose estimation and leveraging precisely annotated data, PAFormer has the potential to achieve enhanced performance in such situations.

6 Conclusion

This paper introduces an impactful model for partial ReID, named PAFormer. We tackle the shortcomings observed in existing partial ReID models by proposing PAFormer, which is aware of anatomical aspect of body parts. Our approach involves the use of learnable pose tokens to estimate the correlation between patch tokens and different body parts. Subsequently, we aggregate information from patch tokens based on the attention maps generated by the pose tokens. Furthermore, we introduce a visibility predictor to effectively handle occlusion issues. PAFormer demonstrates state-of-the-art performance on well-known ReID datasets.

References

Sun et al. [2018] Y. Sun, L. Zheng, Y. Yang, Q. Tian, S. Wang, Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline), in: Proceedings of the European conference on computer vision (ECCV), 2018, pp. 480–496.
Wang et al. [2018] G. Wang, Y. Yuan, X. Chen, J. Li, X. Zhou, Learning discriminative features with multiple granularities for person re-identification, in: Proceedings of the 26th ACM international conference on Multimedia, 2018, pp. 274–282.
Sharma et al. [2021] C. Sharma, S. R. Kapil, D. Chapman, Person re-identification with a locally aware Transformer, arXiv preprint arXiv:2106.03720 (2021).
Chen et al. [2019] T. Chen, S. Ding, J. Xie, Y. Yuan, W. Chen, Y. Yang, Z. Ren, Z. Wang, ABD-Net: Attentive but diverse person re-identification, in: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 8351–8361.
Zhao et al. [2017] L. Zhao, X. Li, Y. Zhuang, J. Wang, Deeply-learned part-aligned representations for person re-identification, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 3219–3228.
Li et al. [2018] W. Li, X. Zhu, S. Gong, Harmonious attention network for person re-identification, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2285–2294.
Dosovitskiy et al. [2021] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An image is worth 16x16 words: Transformers for image recognition at scale, ICLR (2021).
Wang et al. [2022] H. Wang, J. Shen, Y. Liu, Y. Gao, E. Gavves, NFormer: Robust person re-identification with neighbor Transformer, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 7297–7307.
He et al. [2021] S. He, H. Luo, P. Wang, F. Wang, H. Li, W. Jiang, TransReID: Transformer-based object re-identification, in: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 15013–15022.
Zhang et al. [2021] G. Zhang, P. Zhang, J. Qi, H. Lu, HAT: Hierarchical aggregation Transformers for person re-identification, in: Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 516–525.
Li et al. [2021] Y. Li, J. He, T. Zhang, X. Liu, Y. Zhang, F. Wu, Diverse part discovery: Occluded person re-identification with part-aware Transformer, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2898–2907.
Zhu et al. [2021] K. Zhu, H. Guo, S. Zhang, Y. Wang, G. Huang, H. Qiao, J. Liu, J. Wang, M. Tang, AAformer: Auto-aligned Transformer for person re-identification, arXiv preprint arXiv:2104.00921 (2021).
Wang et al. [2022] T. Wang, H. Liu, P. Song, T. Guo, W. Shi, Pose-guided feature disentangling for occluded person re-identification based on Transformer, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 2022, pp. 2540–2549.
Li et al. [2023] W. Li, C. Zou, M. Wang, F. Xu, J. Zhao, R. Zheng, Y. Cheng, W. Chu, DC-Former: Diverse and compact Transformer for person re-identification, Proceedings of the AAAI Conference on Artificial Intelligence 37 (2023) 1415–1423. URL: https://ojs.aaai.org/index.php/AAAI/article/view/25226. doi:10.1609/aaai.v37i2.25226.
Zhu et al. [2020] K. Zhu, H. Guo, Z. Liu, M. Tang, J. Wang, Identity-guided human semantic parsing for person re-identification, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, Springer, 2020, pp. 346–363.
Somers et al. [2023] V. Somers, C. De Vleeschouwer, A. Alahi, Body part-based representation learning for occluded person re-identification, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 1613–1623.
Zheng et al. [2015] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, Q. Tian, Scalable person re-identification: A benchmark, in: Proceedings of the IEEE international conference on computer vision, 2015, pp. 1116–1124.
Zheng et al. [2017] Z. Zheng, L. Zheng, Y. Yang, Unlabeled samples generated by GAN improve the person re-identification baseline in vitro, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 3754–3762.
Miao et al. [2019] J. Miao, Y. Wu, P. Liu, Y. Ding, Y. Yang, Pose-guided feature alignment for occluded person re-identification, in: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 542–551.
Luo et al. [2019] H. Luo, Y. Gu, X. Liao, S. Lai, W. Jiang, Bag of tricks and a strong baseline for deep person re-identification, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2019, pp. 0–0.
Kalayeh et al. [2018] M. M. Kalayeh, E. Basaran, M. Gökmen, M. E. Kamasak, M. Shah, Human semantic parsing for person re-identification, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1062–1071.
Xu et al. [2018] J. Xu, R. Zhao, F. Zhu, H. Wang, W. Ouyang, Attention-aware compositional network for person re-identification, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2119–2128.
Cuturi [2013] M. Cuturi, Sinkhorn distances: Lightspeed computation of optimal transport, Advances in neural information processing systems 26 (2013).
He et al. [2016] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
Chen et al. [2022] Y. Chen, S. Xia, J. Zhao, Y. Zhou, Q. Niu, R. Yao, D. Zhu, D. Liu, ResT-ReID: Transformer block-based residual learning for person re-identification, Pattern Recognition Letters 157 (2022) 90–96.
Abnar and Zuidema [2020] S. Abnar, W. Zuidema, Quantifying attention flow in Transformers, arXiv preprint arXiv:2005.00928 (2020).
Carion et al. [2020] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko, End-to-end object detection with Transformers, in: European conference on computer vision, Springer, 2020, pp. 213–229.
Kreiss et al. [2019] S. Kreiss, L. Bertoni, A. Alahi, PifPaf: Composite fields for human pose estimation, CoRR abs/1903.06593 (2019). URL: http://arxiv.org/abs/1903.06593. arXiv:1903.06593.
Schroff et al. [2015] F. Schroff, D. Kalenichenko, J. Philbin, FaceNet: A unified embedding for face recognition and clustering, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823.
Gong et al. [2021] Y. Gong, Z. Zeng, L. Chen, Y. Luo, B. Weng, F. Ye, A person re-identification data augmentation method with adversarial defense effect, arXiv preprint arXiv:2101.08783 (2021).
Zhong et al. [2020] Z. Zhong, L. Zheng, G. Kang, S. Li, Y. Yang, Random erasing data augmentation, in: Proceedings of the AAAI conference on artificial intelligence, volume 34, 2020, pp. 13001–13008.
Tan et al. [2021] H. Tan, X. Liu, Y. Bian, H. Wang, B. Yin, Incomplete descriptor mining with elastic loss for person re-identification, IEEE Transactions on Circuits and Systems for Video Technology 32 (2021) 160–171.
Li et al. [2021] H. Li, G. Wu, W.-S. Zheng, Combined depth space based architecture search for person re-identification, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6729–6738.
Zhang et al. [2021] A. Zhang, Y. Gao, Y. Niu, W. Liu, Y. Zhou, Coarse-to-fine person re-identification with auxiliary-domain classification and second-order information bottleneck, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 598–607.
Zhu et al. [2022] H. Zhu, W. Ke, D. Li, J. Liu, L. Tian, Y. Shan, Dual cross-attention learning for fine-grained visual categorization and object re-identification, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4692–4702.
Chen et al. [2021] X. Chen, C. Xu, Q. Cao, J. Xu, Y. Zhong, J. Xu, Z. Li, J. Wang, S. Gao, OH-Former: Omni-relational high-order transformer for person re-identification, arXiv preprint arXiv:2109.11159 (2021).
Gu et al. [2022] H. Gu, J. Li, G. Fu, C. Wong, X. Chen, J. Zhu, AutoLoss-GMS: Searching generalized margin-based softmax loss function for person re-identification, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 4744–4753.
Wang et al. [2020] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang, et al., Deep high-resolution representation learning for visual recognition, IEEE transactions on pattern analysis and machine intelligence 43 (2020) 3349–3364.