Dynamic Patch-aware Enrichment Transformer for Occluded Person Re-Identification

Xin Zhang, Keren Fu, and Qijun Zhao Xin Zhang is with the National Key Laboratory of Fundamental Science on Synthetic Vision, Sichuan University, Chengdu 610065, China. (E-mail: [email protected]) Keren Fu and Qijun Zhao are with the College of Computer Science, and the National Key Laboratory of Fundamental Science on Synthetic Vision, Sichuan University, Chengdu 610065, China. (E-mail: [email protected]; [email protected]). (Corresponding author: Keren Fu).

Abstract

Person re-identification (re-ID) continues to pose a significant challenge, particularly in scenarios involving occlusions. Prior approaches aimed at tackling occlusions have predominantly focused on aligning physical body features through the utilization of external semantic cues. However, these methods tend to be intricate and susceptible to noise. To address the aforementioned challenges, we present an innovative end-to-end solution known as the Dynamic Patch-aware Enrichment Transformer (DPEFormer). This model effectively distinguishes human body information from occlusions automatically and dynamically, eliminating the need for external detectors or precise image alignment. Specifically, we introduce a dynamic patch token selection module (DPSM). DPSM utilizes a label-guided proxy token as an intermediary to identify informative occlusion-free tokens. These tokens are then selected for deriving subsequent local part features. To facilitate the seamless integration of global classification features with the finely detailed local features selected by DPSM, we introduce a novel feature blending module (FBM). FBM enhances feature representation through the complementary nature of information and the exploitation of part diversity. Furthermore, to ensure that DPSM and the entire DPEFormer can effectively learn with only identity labels, we also propose a Realistic Occlusion Augmentation (ROA) strategy. This strategy leverages the recent advances in the Segment Anything Model (SAM) [1]. As a result, it generates occlusion images that closely resemble real-world occlusions, greatly enhancing the subsequent contrastive learning process. Experiments on occluded and holistic re-ID benchmarks signify a substantial advancement of DPEFormer over existing state-of-the-art approaches. The code will be made publicly available.

Index Terms:

Person re-identification, occlusion, contrastive learning, token selection, Segment Anything Model.

I Introduction

PERSON re-identification (re-ID) is a challenging task involving the recognition and tracking of individuals across multiple non-overlapping cameras. Over recent years, it has attracted substantial research interest owing to its wide-ranging applications in various video surveillance scenarios. Thanks to the continuous advancements in deep learning techniques and the accessibility of extensive datasets, the field of re-ID has witnessed remarkable progress. Numerous techniques [2, 3, 4, 5, 6, 7, 8, 9] have emerged to tackle complex issues like viewpoint variations and variable lighting conditions in the domain of person re-identification. However, most of these methods heavily rely on extensive datasets containing comprehensive and unobstructed pedestrian images for training deep neural networks. Consequently, their effectiveness may diminish when faced with occlusion scenarios, where individuals are occluded by objects like poles, vehicles, or walls. These occlusions introduce formidable challenges in precisely identifying individuals. As a result, occluded person re-identification emerges as a crucial area that deserves more in-depth exploration and investigation.

Refer to caption — Figure 1: Applying pose estimation model [10] ((a), (d), (g)) and human parsing model [11] ((b), (e), (h)) to extract body information. Both models perform well when presented with holistic and object-occluded images but diminish when dealing with multi-pedestrian images. By contrast, our DPEFormer selects more accurate patches corresponding to body region ((c), (f), (i)).

In the context of occluded re-ID, a significant challenge arises from the presence of occluded regions, which tend to introduce noise and potentially lead to mismatches in the identification process. The crux of this challenge lies in the effective extraction of discriminative features from these non-occluded regions. Prevailing strategies to address this issue typically revolve around harnessing local features derived from distinct human body parts. Generally, these strategies rely on external cues provided by either semantic parsing [12, 13, 14] or pose estimation [15, 16, 17]. In such approaches, a pre-trained pose or semantic detector is employed to identify landmarks or regions of interest within the images. These detected landmarks or regions serve as valuable cues to pinpoint the non-occluded areas and facilitate the alignment of local features during the learning process. However, such solutions come with certain limitations. For instance, when dealing with considerable cross-domain disparities between training and testing data or in cluttered environments, the accuracy of off-the-shelf external detectors can be compromised. As depicted in Fig. 1, in scenarios involving multi-pedestrian occlusions, pose estimation results might erroneously align with other pedestrians, resulting in inaccurate information extraction. Furthermore, human parsing models may not always recognize items carried by individuals, such as backpacks, hats, umbrellas, etc., potentially leading to the omission of crucial information for re-ID purposes. Additionally, the use of external detectors can introduce extra computational costs, which may be less advantageous in real-time surveillance applications.

Considering the preceding discussions, it is a common consensus that part-based representations hold promise as solutions to the challenges encountered in occluded person re-ID. Regrettably, the absence of part labels in such scenarios makes it challenging to train reliable intra-domain detectors. To address this predicament within the context of occluded person re-ID, this paper introduces a novel end-to-end learning model known as the Dynamic Patch-aware Enrichment Transformer (DPEFormer). DPEFormer has the primary goal of localizing and selecting discriminative human body parts solely based on identity labels, without relying on external detectors. Leveraging the Transformer architecture [18], the pedestrian image is divided into smaller patches (e.g., Fig. 1 (c), (f), and (i)), each corresponding to a token. It is intuitive to expect that high-level tokens within the Transformer, associated with the same person, should exhibit more semantic similarity compared to tokens linked to occluded or background areas. Taking inspiration from this observation, we introduce the dynamic patch token selection module (DPSM). DPSM operates by considering the information encapsulated within each patch token and assessing its significance through similarity calculations with a label-guided proxy token. These similarity values are then used to dynamically select the most crucial tokens via a first-order derivative process. It is worth noting that DPSM essentially functions as a hard attention mechanism, as it assigns binary weights ({0,1}) to tokens, in contrast to the soft attention mechanism that employs continuous weights ([0,1]). As depicted in Fig. 1, DPEFormer, when trained with DPSM, demonstrates an improved ability to select more accurate patches for subsequent representation learning.

Furthermore, to enhance the aggregation of global features and part features selected by DPSM, we introduce a novel Feature Blending Module (FBM). FBM utilizes cross-attention mechanisms to comprehensively integrate both feature types, resulting in a more enriched final feature representation for re-ID tasks. Additionally, to facilitate improved learning for DPSM and the entire DPEFormer framework, we propose Realistic Occlusion Augmentation (ROA) at the image level. ROA takes advantage of recent advancements such as the Segment Anything Model (SAM) [1], allowing us to synthesize diverse and realistic occlusion data. This data includes scenarios involving multiple pedestrians and various objects, closely resembling real-world occlusion situations. By training with ROA-augmented data, we can observe an improvement in the robustness of both DPSM and DPEFormer when it comes to handling occlusions. Note that ROA is introduced as an auxiliary training strategy and does not play a role in the inference phase. This design ensures that DPEFormer remains highly flexible during real-world applications.

In summary, to cope with occluded person re-ID, this paper presents three distinct contributing components, including Dynamic Patch Token Selection Module (DPSM), Feature Blending Module (FBM), and Realistic Occlusion Augmentation (ROA). The primary contributions can be succinctly summarized as follows:

•

We propose a patch-aware feature selection paradigm called DPSM for occluded person re-ID. The primary objective of DPSM is to pinpoint crucial human body patch tokens from the multitude of available tokens with occlusion/background.
•

Expanding upon our patch-aware feature selection paradigm, we introduce a Feature Blending Module (FBM) that enhances the synergy between global and carefully selected local features. This augmentation results in effective feature fusion.
•

We propose Realistic Occlusion Augmentation (ROA) utilizing the Segment Anything Model (SAM) [1]. ROA serves the dual purpose of reducing information redundancy during image augmentation while faithfully emulating realistic occlusion scenarios, including variations in contour details.

The remainder of the paper is organized as follows. Section II discusses related work on holistic and occluded re-ID. Section III provides detailed descriptions of the proposed DPEFormer framework, as well as key components: Dynamic Patch Selection Module (DPSM), Feature Blending Module (FBM), and Realistic Occlusion Augmentation (ROA). Experimental results, performance evaluations, and comparative analyses are included in Section IV. Finally, conclusions are drawn in Section V.

II Related Work

In this section, we give a brief review of existing methods for holistic person re-ID and occluded person re-ID.

II-A Holistic Person Re-Identification

Person re-identification is a task that searches for or identifies a target person from multiple camera views. Existing methods can be roughly categorized into traditional [19, 20] and deep learning [21, 22] approaches. As a representative work, Yang et al. [19] propose a unique color descriptor and generate feature representation in the color space. With the emergence of large-scale datasets together with modern GPUs, deep learning techniques have been extensively adopted in the person re-ID field, among which part-based methods have shown competitive performance by leveraging fine-grained information of a human body. Sun et al. [2] presents a simple and efficient part-based baseline convolutional network that employs a uniform partition strategy to learn part-level features. Wang et al. [23] design a multi-branch deep network, consisting of one global branch and two local branches with varying numbers of segmentation parts. Lin et al. [24] introduce additional attribute information, such as gender, hair length, age, backpacks, and so on, to enforce model learning of local details. Attention mechanisms have been adopted in [25, 26, 27] to emphasize feature extraction on human body areas. [28, 29] fully exploit the knowledge from the source domain to address cross-domain re-ID. Tan et al. [30] incorporate spatial and temporal dual attention to refine pseudo-labels in unsupervised re-ID tasks. Xu et al. [31] introduce a novel recycling strategy for pseudo-labels, addressing both pre-clustering and post-clustering stages. Unfortunately, these above methods exhibit limited accuracy on retrieving individuals under occlusions, thereby hindering their applicability in crowded and complex scenarios.

II-B Occluded Person Re-Identification

The current mainstream treatment for such occluded person re-ID is to adopt external information, including human parsing and pose estimation. Huang et al. [12] are the pioneers in employing human parsing techniques for localizing body parts. He et al. [15] introduce a new network called Pose-Guided Feature Alignment (PGFA), which leverages pose landmarks to extract meaningful information while excluding occlusion noise. He et al. [32] introduce a new model called Foreground-aware Pyramid Reconstruction (FPR), which is dedicated to extracting features from foreground human body parts and computing matching scores between occluded pedestrians. Gao et al. [33] propose an integrated framework called Pose-guided Visible Part Matching (PVPM) to learn discriminative features using a part visibility predictor and a pose-guided attention module. Wang et al. [34] present an adaptive-direction graph convolutional (ADGC) model to learn semantic features, together with a cross-graph embedded alignment (CGEA) method for robust feature alignment. Yan et al. [35] propose a lightweight PRE to exploit local feature correlations and aggregate them without external detectors.

By leveraging transformer architecture, Jia et al. [36] propose DRL-Net through disentangled representation learning. Additionally, Wang et al. [37] propose a novel approach, called Feature Erasing and Diffusion Network (FED), which can effectively eliminate occlusion features in images. To summarize, the above methods represent significant advances in the field of occluded person re-ID, and demonstrate encouraging performance for this challenging task.

Unlike the aforementioned previous methods that rely heavily on external human detectors or predefined occlusion priors from training sets, the proposed DPEFormer does not require any additional knowledge and also enables end-to-end training. Through careful consideration and design for occlusion elimination, DPEFormer demonstrates good generalizability in handling different occlusion scenarios, and can also adapt to unseen occlusion cases effectively.

III Methodology

The overall architecture of our model is illustrated in Fig. 2, which begins with pedestrian images as inputs. Note that these images will be augmented by ROA described later. Following previous works [38, 36, 37], the Vision Transformer (ViT) [18] is adopted as our feature extractor. A learnable $[cls]$ classification token and position embeddings are prepended to the input image. After the transformer encoder, the output feature can be expressed as $f\in\mathbb{R}^{(N+1)\times c}$ , where $N+1$ including one $[cls]$ token and $N$ image patch tokens, and $c$ indicates the dimension. In practice, $N$ and $c$ are 128 and 768 (the same as ViT-Base [18]), respectively. Then, we feed the output feature $f$ to DPSM to perform token selection, which discards a certain amount of tokens while retaining the rest for subsequent feature representation. After DPSM, several local part features from adaptive pooling of the selected tokens, as well as the global classification token, are further fed to FBMs to embed global classification features into part features to obtain enriched feature representation. After that, the outputs of all FBMs (corresponding to individual parts) are concatenated to form the final pedestrian descriptor, which is supervised by identity loss. Meanwhile, we employ the memory bank and contrastive loss proposed in [39] for enhanced supervision, as shown by the pink region in Fig. 2. This part works concurrently with the DPSM and FBMs during training, but takes tokens prior to selection as inputs. Details of each component will be introduced in the following sections.

III-A Dynamic Patch Token Selection Module (DPSM)

Recognizing that not all patch tokens from ViT contribute equally to the representation of a pedestrian, especially an occluded one, we propose a dynamic patch token selection module (DPSM) to exclude a certain amount of ineffective patches. In turn, DPSM aims to emphasize contributions of those advantageous ones based on specific rules to enhance model performance. Firstly, we introduce a proxy token $f_{proxy}$ that is carefully selected from all patch tokens using a similarity measure that compares with the global features, namely the class token. This proxy serves as a key link between the global and patch features and will be used for subsequent patch token selection to alleviate the impact of occlusion. The features obtained after ViT backbone can be denoted as $f=\{f_{g},f_{1},f_{2},...,f_{N}\}$ , where $f_{g}$ is the class token and can be deemed as the global features for the re-ID task. Owing to that all the token features are normalized after ViT, we directly measure similarities by computing the inner product between $f_{g}$ and each patch token $f_{i}$ , where $i=1,2,...,N$ . Next, we identify the patch token having the greatest similarity with the class token $f_{g}$ as the proxy token $f_{proxy}$ . Mathematically, the lower-index $proxy$ of $f_{proxy}$ can be formulated as:

proxy=\underset{i}{\operatorname{arg\,max}}\,(f_{g}f_{1}^{\intercal},f_{g}f_{2}^{\intercal},\dots,f_{g}f_{N}^{\intercal}).

(1)

Since $f_{proxy}$ is obtained from the strongest patch token representation and it is reasonable to assume it corresponds to the pedestrian body area, $f_{proxy}$ can be considered as a reliable feature representation. To further evaluate similarities between $f_{proxy}$ and the other tokens, we perform dot product again between $f_{proxy}$ and each patch embedding, obtaining a set of similarity scores as:

\mathtt{S}^{p}=\{f_{proxy}f_{1}^{\intercal},f_{proxy}f_{2}^{\intercal},\dots,f_{proxy}f_{N}^{\intercal}\},

(2)

where we note that we do not yet exclude $f_{proxy}$ itself from the set, yielding $N$ scores in total, among which the maximum score equals 1.

As shown in Fig. 3, we then sort all similarity values in descending order via $\operatorname{Sort}(\cdot)$ operation as:

\mathtt{S}^{o}=\operatorname{Sort}(\mathtt{S}^{p}),

(3)

and calculate the first-order gradient of $\mathtt{S}^{o}$ :

\mathtt{D}[i]=\mathtt{S}^{o}[i]-\mathtt{S}^{o}[i+1],\\

(4)

and let $\mathtt{Diff}$ be denoted as:

\mathtt{Diff}=\{\mathtt{D}[1],\mathtt{D}[2],\dots,\mathtt{D}[N-1]\},

(5)

where $N$ means the total number of patch tokens. We assume that the pedestrian body and other information, such as background and occlusion, belong to distinct categories in the feature space. Our goal is to divide the patch tokens into two clusters, among which one corresponds to body information and the other is associated with background/occlusion information. To address this, we refer to $\mathtt{Diff}$ and consider the position with the maximal gradient magnitude, namely the maximal first-order difference, as the splitting point. The underlying assumption is that there should be a distinct feature transition when body features change to those tokens that are contaminated by occlusions or other interfering information. Therefore, this splitting point is chosen as:

k=\underset{i}{\operatorname{arg\,max}}(\mathtt{Diff}).

(6)

This means those tokens corresponding to the first $k$ scores in $\mathtt{S}^{o}$ will be selected as body feature tokens to represent a pedestrian, and the rest tokens are deemed less effective and will be discarded subsequently.

Despite the above maximal gradient magnitude demonstrating certain efficacy (will be discussed in Section IV), we also observe some failure cases in practice due to the strong assumption of a clear boundary between body tokens and background/occlusion tokens. If a pedestrian presents a similar appearance to the occlusion object or background, this strategy may lead to the misclassification of similar parts as disturbance, resulting in the selection of too few patch tokens. Consequently, this reduction in the number of selected patch tokens can compromise the model’s ability to accurately capture comprehensive pedestrian information, ultimately affecting its overall performance.

To address this issue, we set up an initial minimum for $k$ , denoted as $k_{min}$ , to define the least number for patch selection. The value of $k_{min}$ is empirically determined (see Section IV-D) and acts as a relaxation condition. Finally, the dynamic number of the selected candidate tokens can be defined as $k\leftarrow max(k,k_{min})$ . Lastly, we perform adaptive average pooling on the selected $k$ patch tokens, yielding $P$ part features denoted as $f_{{part}_{i}}$ , where $i=1,\dots,P$ , and together with $f_{g}$ as the outputs of DPSM.

Discussion. Actually, the proposed hard attention mechanism DPSM can be deemed as performing a two-class clustering process to divide the tokens into two classes, and the proposed algorithm is a simple tailored one to achieve such a goal. Given the clustering guidance by $f_{proxy}$ , DPSM treats $f_{proxy}$ as a seed and compares all the samples with this seed. Tokens similar to $f_{proxy}$ are clustered into one class, while those far away are put into the other class. It is also worth noting that, in the view of clustering, one can also apply any clustering algorithm, like naive K-means, to bi-partition tokens $\{f_{1},f_{2},...,f_{N}\}$ . However, such a way is much less efficient as we need to handle enormous training images during learning. The clustering results from K-means may also be less stable and reasonable. Therefore, we reject those clustering approaches and employ a proxy token-based sharpest gradient scheme to establish the splitting point.

To visualize the selected patch tokens, we map the selected ones back to image space, where raw image patches corresponding to the selected patch tokens are highlighted. The visualization results are shown in Fig. 1 and Fig. 4. One can observe that the majority of selected patches align with pedestrian bodies, effectively avoiding occlusions caused by other pedestrians or objects. However, DPSM sometimes fails to “perfectly” select patches. Instead, the selected patches could cover the background or some occlusion areas. One possible reason is that the global self-attention mechanism of Transformer has exchanged information across all patch tokens, making their feature representation complicated and not well aligned with the original image space, e.g., a patch token corresponds to the background of the original image may also convey pedestrian identity representation. Despite the above explanation, we observe notable performance gains after using DPSM.

III-B Feature Blending Module (FBM)

To accurately identify pedestrians, it is imperative to consider both contextual information from the global features $f_{g}\in\mathbb{R}^{1\times 768}$ and detailed information from the part features $f_{{part}_{i}}\in\mathbb{R}^{1\times 768},i=1,\dots,P$ . In recent work [40], a module was developed specifically for communicating information between global and part features in the multi-modal re-ID task. Inspired by [40], we propose a feature enrichment strategy for occluded re-ID. To this end, given these two types of features, we embed global context information into each local part feature to enrich representation power via FBM. Fig. 2 bottom right illustrates the structure of FBM, which we describe below using the $i$ -th part features $f_{{part}_{i}}$ and the global features $f_{g}$ for simplicity.

The underlying idea of FBM is to leverage the well-known multi-head self-attention (MHSA) mechanism as an enrichment mean (Fig. 2), rather than simple point-to-point addition (i.e., $f_{{part}_{i}}+f_{g}$ ) or concatenation. We initially implement three $1\times 1$ convolution layers on global and part features to obtain $f_{g}^{{}^{\prime}}$ , $f_{g}^{{}^{\prime\prime}}$ and $f_{{part}_{i}}^{{}^{\prime}}$ . Then we split these feature vectors into 48 non-overlapping sub-parts, with each sub-part having a fixed size of 16. Such 48 sub-parts are then formulated as regular inputs to the attention mechanism. Learnable positional embeddings are added to these feature sub-parts. In contrast to the standard MHSA, we modify MHSA to derive query set $Q_{p}$ from the feature sub-parts of $f_{{part}_{i}}^{{}^{\prime}}$ , and to derive key set $K_{g}$ and value set $V_{g}$ from the feature sub-parts of $f_{g}^{{}^{\prime}}$ and $f_{g}^{{}^{\prime\prime}}$ . The graphic illustration is shown in Fig. 2 bottom right, and our modified MHSA enables comprehensive mutual interactions between the part and global features, instead of naive point-to-point integration performed by simple addition.

Given $Q_{p},K_{g},V_{g}$ , the modified MHSA is formulated as:

\operatorname{MHSA}_{g\rightarrow p}(Q_{p},K_{g},V_{g})=\operatorname{Concat}(head_{0},\dots,head_{h})W^{O},

(7)

where $W^{O}$ is the output transformation matrix for integrating multi-head outputs, and $head_{i}$ is computed as:

head_{i}=\operatorname{Attention}\left(Q_{p}^{i},K_{g}^{i},V_{g}^{i}\right),

(8)

where $Q_{p}^{i},K_{g}^{i},V_{g}^{i}$ are the query, key, and value matrices of the $i$ -th head, respectively. The general computation of Attention( $\cdot$ ) is defined as:

\operatorname{Attention}(Q,K,V)=\operatorname{Softmax}\left(\frac{QK^{T}}{\sqrt{d_{k}}}\right)V,

(9)

where $\sqrt{d_{k}}$ denotes a scaling factor, and is equal to $4(\sqrt{16})$ in our case. So generally, the feature blending process in FBM is represented as:

f_{{mhsa}_{i}}\!=\!\operatorname{RE}\left(\operatorname{MHSA}_{g\rightarrow p}\left(\operatorname{LN}\left(\operatorname{FPR}\left(f_{g}^{{}^{\prime}},f_{g}^{{}^{\prime\prime}},f_{{part}_{i}}^{{}^{\prime}}\right)\!+\!\mathrm{PE}\right)\right)\right),

(10)

where $\operatorname{RE}$ represents the reshape operation, which aims to maintain the same dimension as $f_{{part}_{i}}$ . $\operatorname{PE}$ and $\operatorname{LN}(\cdot)$ denote position embedding and layer normalization, respectively. $\operatorname{FPR}(\cdot)$ represents feature partition and sub-part matrix formulation, namely transforming $\mathbb{R}^{1\times 768}$ into $\mathbb{R}^{48\times 16}$ . Finally, the entire forward blending process of FBM is computed as:

f_{{fbm}_{i}}=\operatorname{LN}\left(f_{{part}_{i}}+f_{{mhsa}_{i}}\right)+\operatorname{FFN}\left(\operatorname{LN}\left(f_{{part}_{i}}+f_{{mhsa}_{i}}\right)\right),

(11)

where $\operatorname{FFN}(\cdot)$ denotes the feed-forward layer, and $f_{{fbm}_{i}}$ denotes the obtained blending features by embedding $f_{g}$ into $f_{{part}_{i}}$ .

To ensure semantic diversity, we employ FBM for each part, namely, there will be $P$ FBM in total. Finally, all enriched part features are concatenated to form the final pedestrian descriptor:

f_{final}=\operatorname{Concat}(f_{{fbm}_{i}},\dots,f_{{fbm}_{P}}),

(12)

where $P$ denotes the total number of parts. During the training process, $f_{g}$ is employed solely for classification and discarded in the final feature representation. The final representation $f_{final}\in\mathbb{R}^{1\times 3072}$ encompasses diverse detail-specific information from different local parts, and is also enriched by global knowledge through the proposed feature embedding via FBM.

III-C Realistic Occlusion Augmentation (ROA)

A significant challenge faced by existing approaches in occluded person re-ID lies in the limited availability of occlusion data. In response to this challenge, researchers have explored various strategies. Zhong et al. [41] introduced a method known as Random Erasing (RE), which involves the random occlusion of a rectangular region with random pixel values. Chen et al. [42] adopted a random cut-and-paste approach, where a patch is cropped from a training image and pasted onto the input image. Following the idea of [42], Wang et al. [37] further cropped occlusion objects like bags, manually from the training set and pasted them. Such an approach can be deemed as incorporating additional prior information specifically related to pedestrian occlusion. The aforementioned approaches have made significant progress in addressing data scarcity. However, a common limitation shared by these methods is that the synthesized images appear to be coarse and not realistic. To overcome this limitation, we present Realistic Occlusion Augmentation (ROA) that significantly diminishes data redundancy and greatly enhances the realism of generated images, better simulating real-world occlusions.

The procedure of the proposed ROA is outlined in Algorithm 1, whose mask set is generated by the Segment Anything Model (SAM) [1] applied on natural images (Fig. 5). In Algorithm 1, variable $random\_area$ means the area of the bounding box of an occlusion object, and therefore, the actual occlusion area is usually smaller than this value. Following the steps in Algorithm 1, ROA is able to generate occluded images that closely resemble real-world scenarios, with occlusions manifesting in different directions and covering various regions of a pedestrian body. As shown in Fig. 5, the generated occlusion image by ROA is more natural and exhibits diverse contour details when compared to cut&paste [42].

Algorithm 1 Realistic Occlusion Augmentation (ROA)

0: Training set:

\mathcal{X}_{train}

; Mask set:

\mathcal{X}_{mask}

(generated by SAM [1], detailed in Sec. IV-B).

0: Realistic Occlusion Augmentation set

\mathcal{X}_{roa}

\%

For simplicity, we use the width and height of the bounding box to represent the size of an object mask.

2: for each mini-batch

\mathcal{B}\subset\mathcal{X}_{train}

3: for each

\textbf{x}_{i}\in\mathcal{B}

4: Randomly select a mask

\textbf{x}_{mask}

from

\mathcal{X}_{mask}

5: Obtain the size

mask\_h\times mask\_w

\textbf{x}_{mask}

and

image\_h\times image\_w

\textbf{x}_{i}

, and the area

area

\textbf{x}_{i}

6: Select a random value

random\_area

between

1/2*area

and

3/4*area

7: if

mask\_h/mask\_w>2

then

resize\_w=random\_area/image\_h

resize\_h=image\_h

10: else

11:

resize\_w=image\_w

12:

resize\_h=random\_area/image\_w

13: end if

14: Scale

\textbf{x}_{mask}

to size

resize\_w\times resize\_h

15: Perform random horizontal flip for

\textbf{x}_{mask}

16: Randomly place

\textbf{x}_{mask}

at one of the three positions

\{(0,0),(0,image\_h-resize\_h),(image\_w-resize\_w,0)\}

17: Generate

\textbf{x}_{roa}

by occluding

\textbf{x}_{i}

with

\textbf{x}_{mask}

18: Add

\textbf{x}_{roa}

\mathcal{X}_{roa}

19: end for

20: end for

21: return

\mathcal{X}_{roa}

III-D Loss Function

For better training our DPEFormer model, we adopt the same memory bank strategy as in [39] to regularize feature learning. Since the learning process of DPSM may not be stable at the beginning (patch tokens reside in learning and have not yet converged), the patch selection process yields less reliable results. So we apply the memory bank to the patch tokens prior to DPSM and construct the associated contrastive loss and identity loss.

Memory Bank

Following [39], the memory initialization takes place only once at the beginning of the training. Moreover, to obtain identity centers, all extracted features from the training set are averaged and stored in a memory-based feature dictionary. Memory updating is performed at every forward inference stage. As shown by Fig. 2, the feature vector of each training instance is utilized to update the corresponding dictionary vector during forward computation. For an instance with identity $j$ , after a series of operations (the ViT encoder, average pooling, and concatenation operation, yielding features $f_{q_{j}}$ ), the updating process is formulated as:

d_{j}\leftarrow\mu\cdot d_{j}+(1-\mu)\cdot f_{q_{j}},

(13)

where $\mu$ is a momentum factor, and $d_{j}$ is the dictionary vector corresponding to identity $j$ . Note that to the maintain stability of dictionary updating, we only use non-ROA samples to update the dictionary. Meanwhile, the memory bank is adopted for calculating contrastive loss.

Contrastive Loss

During training, we compare the training instance’s features $f_{q_{j}}$ to all the dictionary vectors $\{d_{1},d_{1},$ $\dots,d_{M}\}$ ( $M$ means the number of identity classes) using the InfoNCE loss [43]:

\mathcal{L}_{con}=-\log\frac{exp(f_{q_{j}}\cdot d_{j}^{\intercal}/\tau)}{\sum_{l=1}^{M}{exp(f_{q_{j}}\cdot d_{l}^{\intercal}/\tau)}},

(14)

where $\tau$ is a temperature factor, $d_{j}$ is the positive centroid feature vector for instance $f_{q_{j}}$ . The loss enforces an instance close to its positive centroid $d_{j}$ and deviates away from the other negative ones. Note that in practice, we render ROA images $\mathcal{X}_{roa}$ with a smaller weight (about 0.3) on this contrastive loss term while the original images $\mathcal{X}_{train}$ are rendered weight 1.

Total Loss

We choose the cross-entropy loss as the identity loss $\mathcal{L}_{id}$ to train the model in our experiments, and all the global and part features are under the constraint of $\mathcal{L}_{id}$ . The overall classification loss $\mathcal{L}_{cls}$ then can be formulated as:

\begin{split}\mathcal{L}_{cls}&=\sum^{P}_{i=1}(\mathcal{L}_{id}(\Theta(f_{{fbm}_{i}}),y))+\mathcal{L}_{id}(\Theta(f_{q_{j}}),y)\\ &+\mathcal{L}_{id}(\Theta(f_{g}),y)+\mathcal{L}_{id}(\Theta(f_{final}),y),\end{split}

(15)

where $\Theta$ represents the probability prediction function, which comprises a bottleneck layer followed by a fully connected layer. $y$ denotes the ground truth identity label. $P$ is the total number of parts. Finally, the total loss $L_{total}$ is defined as:

\mathcal{L}_{total}=\mathcal{L}_{con}+\mathcal{L}_{cls}.

(16)

IV Experiments and Results

IV-A Datasets and Evaluation Settings

Occluded-DukeMTMC [15] represents one of the most challenging occluded person ReID datasets due to the diversity of scenes and distractions encountered. It comprises 15,618 training images of 702 persons, 2,210 query images of 519 persons, and 17,661 gallery images of 1,110 persons.

Occluded-REID [44] focuses on occluded individuals captured through mobile cameras, comprising 2,000 images spanning 200 unique identities. Within each identity, the dataset includes five unobstructed full-body images and five images featuring severe occlusions, showcasing diverse viewpoints and occlusion types.

Market-1501 [45] is one of the most well-known holistic person ReID datasets, with 12,936 training images of 751 persons, 19,732 gallery images, and 3,368 query images of 750 persons captured from six cameras. Few of images in this dataset are occluded.

DukeMTMC [46] consists of 16,522 training images of 702 persons, 2,228 queries of 702 persons, and 17,661 gallery images of 702 persons. The images are captured by eight different cameras, making it more challenging. As it contains more holistic images than occluded ones, this dataset can be treated as a holistic ReID dataset.

IV-B Experimental Settings

Experimental details

During training, each pedestrian image is resized to a resolution of $256\times 128$ . For each iteration, we select a batch of 64 images, consisting of 16 identities, each with 4 distinct images. The Adam optimizer is employed for model optimization, with the weight decay factor set to $1\times 10^{-4}$ . We used the initial 100 images from the SA-1B dataset [1] as the original images to extract the masks. By utilizing SAM, we generated a collection $\mathcal{X}_{mask}$ of 9,913 occlusion masks. The minimum threshold of selected patch tokens is set to 48. The momentum factor $\mu$ is set to 0.2 in Eq. (13). Following [37], the part number $P$ is set to 4. To achieve optimal performance, we train the model for 120 epochs, initializing the learning rate of the parameters to $1\times 10^{-4}$ , adopting cosine learning rate decay. The platform for the implementation of our experiment is PyTorch [47]. All global and local features are subject to identity loss. Lastly, only four enriched part features are concatenated to produce a final 3072-dimensional person descriptor for testing.

For a fair comparison, our backbone network just adopts ViT [18] without employing a specific sliding window or other specialized configurations.

Protocols

The model is evaluated with two standard metrics: Rank-1 accuracy and mean average precision (mAP). All experiments are carried out in a single query mode.

IV-C Comparison with State-of-the-Art Methods

We compare our method with state-of-the-art methods on both occluded and holistic re-ID datasets in Table I and Table II, respectively. The backbones of compared methods are ResNet-50 [48] and Vision Transformer [18].

Comparisons on Occluded Re-ID Datasets. DPEFormer is evaluated with recent state-of-the-art occluded re-ID methods. The comparison results are presented in Table I. We can see that PAT [38] significantly improves accuracy by adopting a transformer encoder-decoder structure capable of leveraging diverse part-aware attention maps. These attention maps serve as specific part detectors and boost network performance. Our DPEFormer achieves 69.9% Rank-1 accuracy, 82.8% Rank-5 accuracy, 86.4% Rank-10 accuracy, and 58.9% mAP, which surpasses all other types of occluded methods by a substantial margin. Notably, our method does not rely on external cues. It is designed to mitigate interference of occlusion solely based on the available training data and the designed architecture.

Comparisons on Holistic Re-ID Datasets. We also evaluate our model on two holistic person Re-ID datasets and compare it with other state-of-the-art methods in Table II. While our proposed strategies and modules aim to address occlusion problems, they may not fully function well for holistic scenarios. Nonetheless, we still achieve an impressive Rank-1=95.4%/mAP=88.1% and Rank-1=90.0%/mAP=80.3% accuracy on Market-1501 and DukeMTMC, respectively, surpassing all other types of holistic re-ID methods. This comprehensive performance indicates that our proposed methods effectively exploit a robust feature representation not only for occlusion challenges but also for holistic ones.

TABLE I: Performance comparison with state-of-the-art methods on occluded datasets Occlude-DukeMTMC and Occluded-REID (%). “*” denotes that external information is utilized. The best results are shown in bold.

Backbone	Methods	Occlude-DukeMTMC		Occluded-REID
Backbone	Methods	Rank-1	mAP	Rank-1	mAP
CNN	Part Aligned [49] (ICCV 17)	28.8	20.2	-	-
	HACNN [50] (CVPR 18)	34.4	26.0	-	-
	Part Bilinear* [51] (ECCV 18)	36.9	-	-	-
	FD-GAN* [52] (NIPS 18)	40.8	-	-	-
	DSR [53] (CVPR 18)	40.8	30.4	72.8	62.8
	SFR [54] (ArXiv 18)	42.3	32.0	-	-
	PCB [2] (ECCV 18)	42.6	33.7	41.3	38.9
	Adver Occluded [55] (CVPR 18)	44.5	32.2	-	-
	PVPM* [33] (CVPR 20)	47.0	37.7	70.4	61.2
	PGFA* [15] (ICCV 19)	51.4	37.3	-	-
	HOReID* [34] (CVPR 20)	55.1	43.8	80.3	70.2
	MoS [56] (AAAI 21)	61.0	49.2	-	-
	OAMN [42] (ICCV 21)	62.6	46.1	-	-
	MMNet [57] (TITS 22)	56.1	50.1	-	-
	PRE-Net [35] (TCSVT 23)	67.1	54.3	-	-
Transformer	PAT [38] (CVPR 21)	64.5	53.6	81.6	72.1
	DRL-Net [36] (TMM 23)	65.0	50.8	-	-
	FED [37] (CVPR 22)	68.1	56.4	86.3	79.3
	DPEFormer (Ours)	69.9	58.9	87.0	79.5

TABLE II: Performance comparison with state-of-the-art methods on holistic datasets Market-1501 and DukeMTMC (%).

Backbone	Methods	Market-1501		DukeMTMC
Backbone	Methods	Rank-1	mAP	Rank-1	mAP
CNN	DSR [53] (CVPR 18)	83.6	64.3	-	-
	PSE* [58] (CVPR 18)	87.7	69.0	27.3	30.2
	VCFL [59] (ICCV 19)	89.3	74.5	-	-
	PGFA* [15] (ICCV 19)	91.2	76.8	82.6	65.5
	MVPM [60] (ICCV 19)	91.4	80.5	83.4	70.0
	PCB [2] (ECCV 18)	92.3	77.4	81.8	66.1
	OAMN* [42] (ICCV 21)	92.3	79.8	86.3	72.6
	VPM [10] (CVPR 19)	93.0	80.8	83.6	72.6
	SFT [61] (ICCV 19)	93.4	82.7	86.9	73.2
	AANet* [25] (CVPR 19)	93.9	82.5	86.4	72.6
	Circle [62] (CVPR 20)	94.2	84.9	-	-
	HOReID* [34] (CVPR 20)	94.2	84.9	86.9	75.6
	PRE-Net [35] (TCSVT 23)	94.5	86.0	88.9	76.5
Transformer	DRL-Net [36] (TMM 23)	94.7	86.9	88.1	76.6
	FED [37] (CVPR 22)	95.0	86.3	89.4	78.0
	PAT [38] (CVPR 21)	95.4	88.0	88.8	78.2
	DPEFormer (Ours)	95.4	88.1	90.0	80.3

TABLE III: Performance analysis of each component in DPEFormer on Occluded-DukeMTMC.

Index	RE	ROA	DPSM	FBM	Rank-1	mAP
1	-	-	-	-	60.4	50.4
2	✓	-	-	-	60.5	53.0
3	-	✓	-	-	67.1	58.1
4	-	✓	✓	-	68.7	58.7
5	-	✓	-	✓	68.1	58.6
6	-	✓	✓	✓	69.9	58.9

IV-D Ablation Study

In this section, we present a comprehensive series of ablation studies to thoroughly analyze our proposed framework.

Effectiveness of each component. In Table III, we present a series of ablation studies that investigate the individual contributions of various components: random erasing (RE) [41], realistic occlusion augmentation (ROA), the dynamic patch token selection module (DPSM), and the feature blending module (FBM). In detail, “index 1” corresponds to the baseline model, where different images of the same pedestrian are used as input for contrastive learning. Subsequently, we incrementally introduce augmentations and modules, denoted as indexes from 1 to 5: baseline + RE, baseline + ROA, baseline + ROA + DPSM, baseline + ROA + FBM, and DPEFormer, respectively. Comparing “index 0” (the baseline) to “index 1/2” (baseline + RE and baseline + ROA), we observe that the inclusion of data augmentation strategies, specifically ROA, significantly enhances performance. This improvement can be attributed to the generation of a more diverse and realistic set of augmented images, which in turn aids the model during training.

Comparing “index 3” to “index 4”, we observe that the DPSM leads to a further enhancement in representations, resulting in 1.6% improvement in Rank-1 accuracy. In Fig. 1 and 4, we provide a visual representation of the patches selected by DPSM, demonstrating its effectiveness in directing the model’s focus towards extracting human body information while filtering out tokens associated with occlusions or background elements. Furthermore, when comparing “index 3” to “index 5”, the FBM contributes to performance gains, showing improvements of 1.0% in Rank-1 accuracy and 0.5% in mAP. This underscores FBM’s ability to diversify local part features and enhance local representations by effectively fusing complementary global information. Ultimately, “index 6” (DPEFormer) achieves the highest accuracy, highlighting the efficacy of each component, both individually and in synergy.

Parameter analysis of DPSM. We investigate the impact of the patch token selection module on model performance. To this end, we introduce a fixed threshold selection strategy referred to as Fixed Threshold PSM (FPSM). In FPSM, we modify the token selection process in the Dynamic Patch Selection Module (DPSM) to a fixed number while keeping all other training strategies unchanged. Table IV presents our results. We observe an obvious trend as the number of selections increases: both mAP and Rank-1 accuracy improve. However, this trend changes after reaching a threshold of 48, where we begin to observe a gradual decrease in performance with further increases in the selection number. This observation suggests that an abundance of patch tokens, which incorporate occlusion or background information, can degrade the feature representation. Conversely, an insufficient number of patch tokens may fail to capture the necessary information required for accurate person characterization. Balancing the selection of patch tokens is a critical factor in achieving an optimal feature representation and robustness in person recognition. Therefore, it becomes essential to establish a minimum threshold for the number of patch token selections to ensure a more effective implementation of the dynamic selection strategy. Setting this threshold ensures that an adequate number of candidate patch tokens are retained, striking the right balance in the process.

TABLE IV: Performance analysis of different fixed selection numbers in FPSM on Occluded-DukeMTMC. “Number=128” denotes that all the patch tokens are selected.

Number	Rank-1	Rank-5	Rank-10	mAP
12	68.1	82.5	86.2	58.3
24	68.2	81.7	86.4	58.4
48	69.0	82.8	86.5	58.7
96	68.4	82.2	86.6	58.7
128	68.1	82.1	86.4	58.3

TABLE V: Parameter analysis for the minimum selection number

k_{min}

on Occluded-DukeMTMC. “

k_{min}=0

” denotes that no minimum number of patch token selections is imposed.

$k_{min}$	Rank-1	Rank-5	Rank-10	mAP
0	69.3	82.5	86.4	58.8
12	68.2	82.0	86.0	58.5
24	68.3	82.3	86.2	58.5
36	68.5	82.4	86.4	58.6
48 (Ours)	69.9	82.8	86.6	58.9
60	68.6	82.3	86.4	58.5
72	68.4	82.2	86.2	58.4

We conducted experiments to determine the optimal hyperparameter, denoted as $k_{min}$ , by varying its values and assessing the impact on the model performance, as presented in Table V. Notably, when there were no constraints on the minimum selection number ( $k_{min}=0$ ), meaning tokens were selected solely based on their first gradient value, the model achieved competitive performance. As we increased $k_{min}$ from 12, we observed an improvement in Rank-1/mAP performance by 1.6%/1.3% (reaching $k_{min}=48$ ), indicating that this parameter became beneficial for learning discriminative features. However, further increases in $k_{min}$ led to performance degradation, as more noisy tokens, including occlusions and background information, were included. Additionally, the dynamic selection strategy outperformed the fixed number strategy presented in Table IV. This improvement can be attributed to the enhanced flexibility of the dynamic selection module, which adapts better to the varying number of characteristics in each image. From another perspective, it is worth noting that the model achieved its best performance when the hyperparameter was set to 48, as evidenced in both Table IV and Table V.

Furthermore, Table VI presents a comparison of recognition accuracy achieved by using the feature representations before and after the DPSM for inference, without the need for retraining. “Before” signifies the utilization of a pooling operation on all patch tokens to form the feature representation, while “After” denotes the use of the patch tokens selected by DPSM for pooling. It is evident that utilizing the feature representation after the DPSM results in higher accuracy, underscoring the module’s effectiveness.

TABLE VI: Comparison between using features before and after the application of DPSM for pedestrian representation on Occluded-DukeMTMC.

Feature	Rank-1	Rank-5	Rank-10	mAP
Before	68.1	82.2	86.4	58.3
After	68.7	82.4	86.4	58.7

Necessity for the proxy token in DPSM. We refrain from using global features as proxies due to their inclusion of information from all body parts. The results, as shown in Table VII, clearly demonstrate that employing a selected patch token as the proxy yields superior performance compared to using global features (cls token). This highlights that the selected patch token provides more representative information for token selection. The effectiveness of using patch tokens as proxies can be attributed to the elimination of discrepancies caused by redundant information present in global features.

TABLE VII: Comparison of using the global token or patch token as the proxy token on Occluded-DukeMTMC.

Proxy token	Rank-1	Rank-5	Rank-10	mAP
Global token	68.4	81.9	86.4	58.6
Closest patch token (Ours)	69.9	82.8	86.6	58.9

FBM vs. REM. The efficacy of our FBM is confirmed through a comparative analysis with an alternative fusion strategy known as REM, introduced by Wang et al. [40]. All other experimental settings remain constant. The results, as presented in Table VIII, reveal that REM exhibits a decrease in Rank-1 accuracy by 1.2% and a decline in mAP by 0.5% when compared to our proposed FBM. This outcome underscores the superior capability of our FBM in effectively integrating and fusing global and local features within the context of DPEFormer.

TABLE VIII: Comparison of the proposed FBM with REM on Occluded-DukeMTMC.

Method	Rank-1	Rank-5	Rank-10	mAP
REM [40]	68.7	81.7	85.6	58.3
FBM (Ours)	69.9	82.8	86.6	58.9

On Realistic Occlusion Augmentation. We assess the efficacy of our Realistic Occlusion Augmentation (ROA) approach through a comparative analysis with two established methods: Random Erasing (RE) [41] and the Cut&Paste method [42]. Specifically, in the Cut&Paste configuration, we employ the same object instances as ROA to ensure a fair comparison. In this scenario, we directly extract rectangular regions from scene images instead of employing our occlusion generation method. The results of this evaluation are presented in Table X. We observed that the Cut&Paste method outperforms Random Erasing, possibly due to its ability to provide rich occlusion information. However, ROA significantly outperforms both of these augmentation methods. This substantial improvement validates the effectiveness of ROA, which leverages the Segment Anything Model (SAM) [1] to generate realistic occluded images. ROA proves to be a superior approach for augmenting occluded person images.

To ensure a fairer comparison, we replaced the occlusion augmentation in FED and the ROA in DPEFormer with the widely recognized occlusion augmentation RE. The results are also presented in Table X. We can see that our approach consistently outperforms FED, underscoring the effectiveness of our meticulously designed modules (DPSM and FBM). It is worth mentioning that FED achieves competitive performance, especially when leveraging its NPO augmentation strategy, which incorporates prior manual occlusion information from the training set.

Furthermore, to evaluate the influence of the generated occlusion mask set size on model performance, we employed varying quantities of original images choosen from SA-1B datasets. The findings, as illustrated in Table IX, reveal a performance uptick with a candidate image count ranging from 50 to 100. Optimal performance is attained at approximately 100 candidates. Beyond this threshold, augmenting the number of images yields minimal impact. At this juncture, the model is exposed to a sufficiently diverse array of occlusion shapes, and texture saturation has reached an optimal value. Hence, our investigation establishes a candidate set size of 100 as the optimal enhancement size.

TABLE IX: The influence of the quantity of selected images from the SA-1B dataset, utilized as source images for

\mathcal{X}_{mask}

formation.

Number	Rank-1	Rank-5	Rank-10	mAP
50	69.0	82.3	86.4	58.6
100 (Ours)	69.9	82.8	86.6	58.9
200	69.7	82.6	86.3	58.8

TABLE X: Comparison of occlusion augmentation method on Occluded-DukeMTMC.

Augmentation	Model	Rank-1	Rank-5	Rank-10	mAP
ROA (Ours)	DPEFormer	69.9	82.8	86.6	58.9
Cut&Paste [42]		66.9	79.5	84.1	56.0
RE [41]		63.9	78.9	84.0	55.5
RE [41]	FED [37]	62.6	77.7	83.0	55.2
NPO [37]	FED [37]	68.1	-	-	56.4

About the contrastive loss. We introduced an additional contrastive loss after the Feature Blending Module (FBM). As outlined in Table XI, the results demonstrate that the inclusion of extra contrastive loss led to a decrease of 1.5% in mAP and a 3.1% reduction in Rank-1 accuracy. Our analysis aligns with the observation that body region information differs between occluded and non-occluded images. Furthermore, this discrepancy is accentuated by the DPSM. Introducing contrastive loss as a constraint in this context may hinder the learning of crucial discriminative features, potentially impacting DPSM’s performance. However, the contrastive loss applied to the memory bank serves as guidance for the Vision Transformer (ViT) backbone to extract features devoid of occlusions. This preparation enhances the subsequent processing by DPSM.

TABLE XI: Analysis of the utilization of contrastive loss on Occluded-DukeMTMC. “One” denotes the execution of contrastive loss solely after the backbone, whereas “two” indicates additional implementation of contrastive loss post the FBM.

Contrastive loss	Rank-1	Rank-5	Rank-10	mAP
One (Ours)	69.9	82.8	86.6	58.9
Two	66.8	81.3	85.9	57.4

IV-E Qualitative Analyses

We present qualitative experimental results to demonstrate the efficacy of DPEFormer. In Fig. 4, we provide visualizations of the patch tokens selected by DPSM, mapping them back to their corresponding locations in the image space. We have highlighted the raw image patches that correspond to these selected patch tokens. These images depict instances where pedestrians and objects occlude the scene. Notably, a significant proportion of the selected patches align precisely with the pedestrians’ body regions, effectively mitigating occlusions caused by other pedestrians or objects. This visualization underscores the robustness and effectiveness of DPEFormer in handling occlusions within complex real-world scenarios.

Fig. 6 illustrates how DPEFormer overcomes occlusions by presenting some retrieval results. Each set comprises an occluded query person image displayed on the left, along with two rows of images on the right. These right-hand images represent the top-10 matches generated by the baseline model and our proposed DPEFormer. Upon observation, it is evident that DPEFormer excels at overcoming occlusions, accurately identifying images of the same pedestrian. In contrast, the baseline network exhibits heightened sensitivity to occlusions, resulting in a substantial number of false-positive matches.

V Conclusion

This paper introduces DPEFormer, a novel end-to-end architecture specifically tailored for occlusion re-identification tasks. DPEFormer operates at the patch token level, allowing for the automatic and insightful selection of human body part features free from occlusions, all without the need for additional supervision. Moreover, we present a novel Feature Blending Module meticulously crafted to enhance feature representation. It achieves this by harnessing the complementary nature of information and capitalizing on the diverse aspects of body parts. In addition, we introduce a Realistic Occlusion Augmentation (ROA) strategy, grounded in SAM, which enables the generation of more authentic occluded data. This augmentation significantly enriches DPEFormer’s overall learning capability, making it highly adaptable to real-world scenarios. Our extensive experiments and comparisons showcase substantial improvements achieved by DPEFormer over state-of-the-art models in the domain of occlusion handling. We believe that DPEFormer offers fresh insights into addressing occlusion challenges within the occluded re-ID problem, and we hope it could inspire further research.

References

[1] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” arXiv preprint arXiv:2304.02643, 2023.
[2] Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang, “Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline),” in Eur. Conf. Comput. Vis., 2018, pp. 480–496.
[3] J. Dai, P. Zhang, D. Wang, H. Lu, and H. Wang, “Video person re-identification by temporal residual learning,” IEEE T. Image Process., vol. 28, no. 3, pp. 1366–1377, 2019.
[4] J. Liu, W. Zhuang, Y. Wen, J. Huang, and W. Lin, “Optimizing federated unsupervised person re-identification via camera-aware clustering,” in IEEE 24th International Workshop on Multimedia Signal Processing (MMSP), 2022, pp. 1–6.
[5] K. Jiang, T. Zhang, X. Liu, B. Qian, Y. Zhang, and F. Wu, “Cross-modality transformer for visible-infrared person re-identification,” in Eur. Conf. Comput. Vis., 2022, pp. 480–496.
[6] X. Liu, C. Yu, P. Zhang, and H. Lu, “Deeply coupled convolution–transformer with spatial–temporal complementary learning for video-based person re-identification,” IEEE T. Neural Netw. Learn. Syst., pp. 1–11, 2023.
[7] W. Zhuang, X. Gan, Y. Wen, and S. Zhang, “Optimizing performance of federated person re-identification: Benchmarking and analysis,” ACM Transactions on Multimedia Computing, Communications and Applications, vol. 19, pp. 1–18, 2023.
[8] Y. Dai, X. Li, J. Liu, Z. Tong, and L.-Y. Duan, “Generalizable person re-identification with relevance-aware mixture of experts,” in IEEE Conf. Comput. Vis. Pattern Recog., 2021, pp. 16 145–16 154.
[9] B. Hu, J. Liu, and Z.-j. Zha, “Adversarial disentanglement and correlation network for rgb-infrared person re-identification,” in Int. Conf. Multimedia and Expo, 2021, pp. 1–6.
[10] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” in IEEE Conf. Comput. Vis. Pattern Recog., 2019, pp. 5693–5703.
[11] P. Li, Y. Xu, Y. Wei, and Y. Yang, “Self-correction for human parsing,” IEEE T. Pattern Anal. Mach. Intell., vol. 44, no. 6, pp. 3260–3271, 2020.
[12] H. Huang, X. Chen, and K. Huang, “Human parsing based alignment with multi-task learning for occluded person re-identification,” in Int. Conf. Multimedia and Expo, 2020, pp. 1–6.
[13] S. Dou, C. Zhao, X. Jiang, S. Zhang, W.-S. Zheng, and W. Zuo, “Human co-parsing guided alignment for occluded person re-identification,” IEEE T. Image Process., vol. 32, pp. 458–470, 2023.
[14] V. Somers, C. De Vleeschouwer, and A. Alahi, “Body part-based representation learning for occluded person re-identification,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 1613–1623.
[15] J. Miao, Y. Wu, P. Liu, Y. Ding, and Y. Yang, “Pose-guided feature alignment for occluded person re-identification,” in Int. Conf. Comput. Vis., 2019, pp. 542–551.
[16] T. Wang, H. Liu, P. Song, T. Guo, and W. Shi, “Pose-guided feature disentangling for occluded person re-identification based on transformer,” in AAAI Conf. Art. Intell., vol. 36, no. 3, 2022, pp. 2540–2549.
[17] J. Yang, C. Zhang, Z. Li, Y. Tang, and Z. Wang, “Discriminative feature mining with relation regularization for person re-identification,” Information Processing & Management, vol. 60, no. 3, p. 103295, 2023.
[18] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
[19] Y. Yang, J. Yang, J. Yan, S. Liao, D. Yi, and S. Z. Li, “Salient color names for person re-identification,” in Eur. Conf. Comput. Vis. Springer, 2014, pp. 536–551.
[20] S. Liao, Y. Hu, X. Zhu, and S. Z. Li, “Person re-identification by local maximal occurrence representation and metric learning,” in IEEE Conf. Comput. Vis. Pattern Recog., 2015, pp. 2197–2206.
[21] W. Chen, X. Chen, J. Zhang, and K. Huang, “Beyond triplet loss: a deep quadruplet network for person re-identification,” in IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 403–412.
[22] F. Zheng, C. Deng, X. Sun, X. Jiang, X. Guo, Z. Yu, F. Huang, and R. Ji, “Pyramidal person re-identification via multi-loss dynamic training,” in IEEE Conf. Comput. Vis. Pattern Recog., 2019, pp. 8514–8522.
[23] G. Wang, Y. Yuan, X. Chen, J. Li, and X. Zhou, “Learning discriminative features with multiple granularities for person re-identification,” in ACM Int. Conf. Multimedia, 2018, pp. 274–282.
[24] Y. Lin, L. Zheng, Z. Zheng, Y. Wu, Z. Hu, C. Yan, and Y. Yang, “Improving person re-identification by attribute and identity learning,” Pattern Recognit., vol. 95, pp. 151–161, 2019.
[25] C.-P. Tay, S. Roy, and K.-H. Yap, “Aanet: Attribute attention network for person re-identifications,” in IEEE Conf. Comput. Vis. Pattern Recog., 2019, pp. 7134–7143.
[26] B. Chen, W. Deng, and J. Hu, “Mixed high-order attention network for person re-identification,” in Int. Conf. Comput. Vis., 2019, pp. 371–381.
[27] X. Zhang, M. Hou, X. Deng, and Z. Feng, “Multi-cascaded attention and overlapping part features network for person re-identification,” Signal, Image and Video Processing, vol. 16, no. 6, pp. 1525–1532, 2022.
[28] K. Zheng, W. Liu, L. He, T. Mei, J. Luo, and Z.-J. Zha, “Group-aware label transfer for domain adaptive person re-identification,” in IEEE Conf. Comput. Vis. Pattern Recog., 2021, pp. 5306–5315.
[29] Y. Dai, J. Liu, Y. Sun, Z. Tong, C. Zhang, and L.-Y. Duan, “Idm: An intermediate domain module for domain adaptive person re-id,” in Int. Conf. Comput. Vis., 2021, pp. 11 864–11 874.
[30] Q. He, Z. Wang, Z. Zheng, and H. Hu, “Spatial and temporal dual-attention for unsupervised person re-identification,” IEEE T. Intell. Transp. Syst., pp. 1–13, 2023.
[31] M. Xu, H. Guo, Y. Jia, Z. Dai, and J. Wang, “Pseudo label rectification with joint camera shift adaptation and outlier progressive recycling for unsupervised person re-identification,” IEEE T. Intell. Transp. Syst., vol. 24, no. 3, pp. 3395–3406, 2023.
[32] L. He, Y. Wang, W. Liu, H. Zhao, Z. Sun, and J. Feng, “Foreground-aware pyramid reconstruction for alignment-free occluded person re-identification,” in Int. Conf. Comput. Vis., 2019, pp. 8450–8459.
[33] S. Gao, J. Wang, H. Lu, and Z. Liu, “Pose-guided visible part matching for occluded person reid,” in IEEE Conf. Comput. Vis. Pattern Recog., 2020, pp. 11 744–11 752.
[34] G. Wang, S. Yang, H. Liu, Z. Wang, Y. Yang, S. Wang, G. Yu, E. Zhou, and J. Sun, “High-order information matters: Learning relation and topology for occluded person re-identification,” in IEEE Conf. Comput. Vis. Pattern Recog., 2020, pp. 6449–6458.
[35] G. Yan, Z. Wang, S. Geng, Y. Yu, and Y. Guo, “Part-based representation enhancement for occluded person re-identification,” IEEE T. Circuit Syst. Video Technol., vol. 33, no. 8, pp. 4217–4231, 2023.
[36] M. Jia, X. Cheng, S. Lu, and J. Zhang, “Learning disentangled representation implicitly via transformer for occluded person re-identification,” IEEE T. Multimedia, vol. 25, pp. 1294–1305, 2023.
[37] Z. Wang, F. Zhu, S. Tang, R. Zhao, L. He, and J. Song, “Feature erasing and diffusion network for occluded person re-identification,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 4754–4763.
[38] Y. Li, J. He, T. Zhang, X. Liu, Y. Zhang, and F. Wu, “Diverse part discovery: Occluded person re-identification with part-aware transformer,” in IEEE Conf. Comput. Vis. Pattern Recog., 2021, pp. 2898–2907.
[39] Y. Ge, F. Zhu, D. Chen, R. Zhao, and H. Li, “Self-paced contrastive learning with hybrid memory for domain adaptive object re-id,” Adv. Neural Inform. Process. Syst., 2020.
[40] Z. Wang, C. Li, A. Zheng, R. He, and J. Tang, “Interact, embed, and enlarge: boosting modality-specific representations for multi-modal person re-identification,” in AAAI Conf. Art. Intell., vol. 36, no. 3, 2022, pp. 2633–2641.
[41] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, “Random erasing data augmentation,” in AAAI Conf. Art. Intell., vol. 34, no. 07, 2020, pp. 13 001–13 008.
[42] P. Chen, W. Liu, P. Dai, J. Liu, Q. Ye, M. Xu, Q. Chen, and R. Ji, “Occlude them all: Occlusion-aware attention network for occluded person re-id,” in Int. Conf. Comput. Vis., 2021, pp. 11 833–11 842.
[43] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
[44] J. Zhuo, Z. Chen, J. Lai, and G. Wang, “Occluded person re-identification,” in Int. Conf. Multimedia and Expo. IEEE, 2018, pp. 1–6.
[45] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, “Scalable person re-identification: A benchmark,” in Int. Conf. Comput. Vis., 2015, pp. 1116–1124.
[46] Z. Zheng, L. Zheng, and Y. Yang, “Unlabeled samples generated by gan improve the person re-identification baseline in vitro,” in Int. Conf. Comput. Vis., 2017, pp. 3754–3762.
[47] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Adv. Neural Inform. Process. Syst., vol. 32, 2019.
[48] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conf. Comput. Vis. Pattern Recog., 2016, pp. 770–778.
[49] L. Zhao, X. Li, Y. Zhuang, and J. Wang, “Deeply-learned part-aligned representations for person re-identification,” in Int. Conf. Comput. Vis., 2017, pp. 3219–3228.
[50] W. Li, X. Zhu, and S. Gong, “Harmonious attention network for person re-identification,” in IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 2285–2294.
[51] Y. Suh, J. Wang, S. Tang, T. Mei, and K. M. Lee, “Part-aligned bilinear representations for person re-identification,” in Eur. Conf. Comput. Vis., 2018, pp. 402–419.
[52] Y. Ge, Z. Li, H. Zhao, G. Yin, S. Yi, X. Wang, and H. Li, “Fd-gan: Pose-guided feature distilling gan for robust person re-identification,” in Adv. Neural Inform. Process. Syst., 2018, pp. 1229–1240.
[53] L. He, J. Liang, H. Li, and Z. Sun, “Deep spatial feature reconstruction for partial person re-identification: Alignment-free approach,” in IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 7073–7082.
[54] L. He, Z. Sun, Y. Zhu, and Y. Wang, “Recognizing partial biometric patterns,” arXiv preprint arXiv:1810.07399, 2018.
[55] H. Huang, D. Li, Z. Zhang, X. Chen, and K. Huang, “Adversarially occluded samples for person re-identification,” in IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 5098–5107.
[56] M. Jia, X. Cheng, Y. Zhai, S. Lu, S. Ma, Y. Tian, and J. Zhang, “Matching on sets: Conquer occluded person re-identification without alignment,” in AAAI Conf. Art. Intell., 2021, pp. 1673–1681.
[57] M. Tu, K. Zhu, H. Guo, Q. Miao, C. Zhao, G. Zhu, H. Qiao, G. Huang, M. Tang, and J. Wang, “Multi-granularity mutual learning network for object re-identification,” IEEE T. Intell. Transp. Syst., vol. 23, no. 9, pp. 15 178–15 189, 2022.
[58] M. S. Sarfraz, A. Schumann, A. Eberle, and R. Stiefelhagen, “A pose-sensitive embedding for person re-identification with expanded cross neighborhood re-ranking,” in IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 420–429.
[59] F. Liu and L. Zhang, “View confusion feature learning for person re-identification,” in Int. Conf. Comput. Vis., 2019, pp. 6639–6648.
[60] H. Sun, Z. Chen, S. Yan, and L. Xu, “Mvp matching: A maximum-value perfect matching for mining hard samples, with application to person re-identification,” in Int. Conf. Comput. Vis., 2019, pp. 6737–6747.
[61] C. Luo, Y. Chen, N. Wang, and Z. Zhang, “Spectral feature transformation for person re-identification,” in Int. Conf. Comput. Vis., 2019, pp. 4976–4985.
[62] Y. Sun, C. Cheng, Y. Zhang, C. Zhang, L. Zheng, Z. Wang, and Y. Wei, “Circle loss: A unified perspective of pair similarity optimization,” in IEEE Conf. Comput. Vis. Pattern Recog., 2020, pp. 6398–6407.