TransMatcher: Deep Image Matching Through Transformers for Generalizable Person Re-identification

Shengcai Liao and Ling Shao
Inception Institute of Artificial Intelligence (IIAI), Abu Dhabi, UAE
https://liaosc.wordpress.com/ Shengcai Liao is the corresponding author.

Abstract

Transformers have recently gained increasing attention in computer vision. However, existing studies mostly use Transformers for feature representation learning, e.g. for image classification and dense predictions, and the generalizability of Transformers is unknown. In this work, we further investigate the possibility of applying Transformers for image matching and metric learning given pairs of images. We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention. Thus, we further design two naive solutions, i.e. query-gallery concatenation in ViT, and query-gallery cross-attention in the vanilla Transformer. The latter improves the performance, but it is still limited. This implies that the attention mechanism in Transformers is primarily designed for global feature aggregation, which is not naturally suitable for image matching. Accordingly, we propose a new simplified decoder, which drops the full attention implementation with the softmax weighting, keeping only the query-key similarity computation. Additionally, global max pooling and a multilayer perceptron (MLP) head are applied to decode the matching result. This way, the simplified decoder is computationally more efficient, while at the same time more effective for image matching. The proposed method, called TransMatcher, achieves state-of-the-art performance in generalizable person re-identification, with up to 6.1% and 5.7% performance gains in Rank-1 and mAP, respectively, on several popular datasets. Code is available at https://github.com/ShengcaiLiao/QAConv.

1 Introduction

Refer to caption — Figure 1: The structure of the proposed TransMatcher for image matching. A standard Transformer encoder without positional encoding is used for feature encoding. Then, query and gallery encodings are matched by a dot product. Global max pooling (GMP) is applied to find the optimal matching scores and locations, and an MLP head is appended to produce the final matching scores. Note that the batch dimension is ignored in this figure for simplicity.

The Transformer [32] is a neural network based on attention mechanisms. It has shown great success in the field of natural language processing. Recently, it has also shown promising performance for computer vision tasks, including image classification [12, 20], object detection [2, 50, 20, 34], and image segmentation [20, 34], thus gaining increasing attention in this field. However, existing studies mostly use Transformers for feature representation learning, e.g. for image classification or dense predictions, and the generalizability of Transformers is unknown. At a glance, query-key similarities are computed by dot products in the attention mechanisms of Transformers. Therefore, these models could potentially be useful for image matching. In this work, we further investigate the possibility of applying Transformers for image matching and metric learning given pairs of images, with applications in generalizable person re-identification.

Attention mechanisms are used to gather global information from different locations according to query-key similarities. The vanilla Transformer [32] is composed of an encoder that employs self-attention, and a decoder that further incorporates a cross-attention module. The difference is that the query and key are the same in the self-attention, while they are different in the cross-attention. The Vision Transformer (ViT) [12] applies a pure Transformer encoder for feature learning and image classification. While the Transformer encoder facilitates feature interaction among different locations of the same image, it cannot address the image matching problem being studied in this paper, because it does not enable interaction between different images. In the decoder, however, the cross-attention module does have the ability for cross interaction between query and the encoded memory. For example, in the decoder of the detection Transformer (DETR) [2], learnable query embeddings are designed to decode useful information in the encoded image memory for object localization. However, the query embeddings are independent from the image inputs, and so there is still no interaction between pairs of input images. Motivated by this, how about using actual image queries instead of learnable query embeddings as input to decoders?

Person re-identification is a typical image matching and metric learning problem. In a recent study called QAConv [16], it was shown that explicitly performing image matching between pairs of deep feature maps helps the generalization of the learned model. This inspires us to investigate the capability and generalizability of Transformers for image matching and metric learning between pairs of images. Since training through classification is also a popular strategy for metric learning, we start from a direct application of ViT and the vanilla Transformer with a powerful ResNet [6] backbone for person re-identification. However, this results in poor generalization to different datasets. Then, we consider formulating explicit interactions between query¹¹1Query/gallery in person re-identification and query/key or target/memory in Transformers have very similar concepts originated from information retrieval. We use the same word query here in different contexts. and gallery images in Transformers. Two naive solutions are thus designed. The first one uses a pure Transformer encoder, as in ViT, but concatenates the query and gallery features together as inputs, so as to enable the self-attention module to read both query and gallery content and apply the attention between them. The second design employs the vanilla Transformer, but replaces the learnable query embedding in the decoder by the ready-to-use query feature maps. This way, the query input acts as a real query from the actual retrieval inputs, rather than a learnable query which is more like a prior or a template. Accordingly, the cross-attention module in the decoder is able to gather information across query-key pairs, where the key comes from the encoded memory of gallery images.

While the first solution does not lead to improvement, the second one is successful with notable performance gain. However, compared to the state of the art in generalizable person re-identification, the performance of the second variant is still not satisfactory. We further consider that the attention mechanism in Transformers might be primarily for global feature aggregation, which is not naturally suitable for image matching, though the two naive solutions already enable feature interactions between query and gallery images. Therefore, to improve the effectiveness of image matching, we propose a new simplified decoder, which drops the full attention implementation with the softmax weighting, keeping only the query-key similarity computation. Additionally, inspired from QAConv [16], global max pooling (GMP) is applied, which acts as a hard attention to gather similarity values, instead of a soft attention to weight feature values. This is because, in image matching, we are more interested in matching scores than feature values. Finally, a multilayer perceptron (MLP) head maps the matching result to a similarity score for each query-gallery pair. This way, the simplified decoder is computationally more efficient, while at the same time more effective for image matching.

We call the above design TransMatcher (see Fig. 1), which targets at efficient image matching and metric learning in particular. The contributions of this paper are summarized as follows.

•

We investigate the possibility and generalizability of applying Transformers for image matching and metric learning, including direct applications of ViT and the vanilla Transformer, and two solutions adapted specifically for matching images through attention. This furthers our understanding of the capability and limitation of Transformers for image matching.
•

According to the above, a new simplified decoder is proposed for efficient image matching, with a focus on similarity computation and mapping.
•

With generalizable person re-identification experiments, the proposed TransMatcher is shown to achieve state-of-the-art performance on several popular datasets, with up to 6.1% and 5.7% performance gains in Rank-1 and mAP, respectively.

2 Related Work

Given pairs of input images, deep feature matching has been shown to be effective for person re-identification. Li et al. [14] proposed a novel filter pairing neural network (FPNN) to handle misalignment and occlusions in person re-identification. Ahmed et al. [1] proposed a local neighborhood matching layer to match deep feature maps of query and gallery images. Suh et al. [28] proposed a deep neural network to learn part-aligned bilinear representations for person re-identification. Shen et al. [25] proposed a Kronecker-product matching (KPM) module for matching person images in a softly aligned way. Liao and Shao [16] proposed the query adaptive convolution (QAConv) for explicit deep feature matching, which is proved to be effective for generalizable person re-identification. They further proposed a graph sampler (GS) for efficient deep metric learning [17].

Generalizable person re-identification has gained increasing attention in recent years. Zhou et al. [49] proposed the OSNet, and showed that this new backbone network has advantages in generalization. Jia et al. [10] applied IBN-Net-b [22] together with a feature normalization to alleviate both style and content variance across datasets to improve generalizability. Song et al. [27] proposed a domain-invariant mapping network (DIMN) and further introduced a meta-learning pipeline for effective training and generalization. Qian et al. [24] proposed a deep architecture with leader-based multi-scale attention (MuDeep), with improved generalization of the learned models. Yuan et al. [42] proposed an adversarial domain-invariant feature learning network (ADIN) to separate identity-related features from challenging variations. Jin et al.[11] proposed a style normalization and restitution module, which shows good generalizability for person re-identification. Zhuang et al. [51] proposed a camera-based batch normalization (CBN) method for domain-invariant representation learning, which utilizes unlabeled target data to adapt the BN layer in a quick and unsupervised way. Wang et al. [36] created a large-scale synthetic person dataset called RandPerson, and showed that models learned from synthesized data generalize well to real-world datasets. However, current methods are still far from satisfactory in generalization for practical person re-identification.

There are a number of attentional networks [18, 23, 19, 41, 26, 15, 40, 43, 8] proposed for person re-identification, but focusing on representation learning. More recently, Zhao et al. [44] proposed a cross-attention network for person re-identificaiton. However, it is still applied for feature refinement, instead of explicit image matching between gallery and probe images studied in this paper.

Transformers have recently received increasing attention for computer vision tasks, including image classification [12, 20], object detection [2, 50, 20, 34], image segmentation [20, 34], and so on. For example, ViT was proposed in [12], showing that a pure Transformer-based architecture is capable of effective image classification. DETR was proposed in [2], providing a successful end-to-end Transformer solution for object detection. Later, several studies, such as the Deformable DETR [50], Swin [20], and PVT [34], improved the computation of Visual Transformers and further boosted their performance. However, existing studies mostly use Transformers for feature representation learning, e.g. for image classification and dense predictions. There lacks a comprehensive study on whether Transformers are effective for image matching and metric learning and how its capability is in generalizing to unknown domains.

3 Transformers

For the vanilla Transformer [32], the core module is the multi-head attention (MHA). First, a scaled dot-product attention is defined as follows:

\text{Attention}(Q,K,V)=\text{softmax}(\frac{QK^{T}}{\sqrt{d_{k}}})V,

(1)

where $Q\in\mathbb{R}^{T\times d_{k}}$ is the query (or target) matrix, $K\in\mathbb{R}^{M\times d_{k}}$ is the key (or memory) matrix, $V\in\mathbb{R}^{M\times d_{v}}$ is the value matrix, $T$ and $M$ are the sequence lengths of the query and key, respectively, $d_{k}$ is the feature dimension of the query and key, and $d_{v}$ is the feature dimension of $V$ . In visual tasks, $Q$ and $K$ are usually reshaped query and key feature maps, with $T=M=hw$ , where $h$ and $w$ are the height and width of the query and key feature maps, respectively. Then, the MHA is defined as:

\text{head}_{i}=\text{Attention}(QW_{i}^{Q},KW_{i}^{K},VW_{i}^{V}),

(2)

\text{MultiHead}(Q,K,V)=\text{Concat}(\text{head}_{1},\ldots,\text{head}_{H})W^{O},

(3)

where $W_{i}^{Q}\in\mathbb{R}^{d\times d_{k}}$ , $W_{i}^{K}\in\mathbb{R}^{d\times d_{k}}$ , $W_{i}^{V}\in\mathbb{R}^{d\times d_{v}}$ , and $W^{O}\in\mathbb{R}^{hd_{v}\times d}$ are parameter matrices, and $H$ is the number of heads. Then, $Q=K=V$ in the multi-head self-attention (MHSA) in the encoders, while they are defined separately in the multi-head cross-attention (MHCA) in the decoders.

The structure of the Transformer encoder without positional encoding is shown on the left of Fig. 1. Beyond MHSA, it further appends a feed-forward layer to first increase the feature dimension from $d$ to $D$ , and then recover it back from $D$ to $d$ . Besides, the encoder can be self-stacked $N$ times, where $N$ is the total number of encoder layers. In ViT [12], only Transformer encoders are used, and positional encoding is further applied. In the vanilla Transformer [32], decoders with MHCA are further applied, with the query being learnable query embeddings initially, and the output of the previous decoder layer later on, and the key and value being the output of the encoder layer. The decoder can also be self-stacked $N$ times.

4 Image Matching with Transformers: Naive Solutions

While the above ViT and vanilla Transformer are able to perform image matching through black-box feature extraction and distance learning, they are not optimal for this task because they lack image-to-image interaction in their designs. Though cross-attention is employed in the Transformer decoders, in its original form the query either comes from learnable query embeddings, or from the output of the previous decoder layer.

Therefore, we adapt Transformers with two naive solutions for image matching and metric learning. Building upon a powerful ResNet [6] backbone, the first solution appends ViT, but not simply for feature extraction. Instead, a pair of query and gallery feature maps are concatenated to double the sequence length, forming a single sample for the input of ViT. Thus, both the query and key for the self-attention layer contain query image information in one half and gallery image information in the other half. Therefore, the attention computation in Eq. (1) is able to interact query and gallery inputs for image matching. This variant is denoted as Transformer-Cat.

The second solution appends the vanilla Transformer, but instead of learnable query embeddings, ResNet query features are directly input into the first decoder. This way, the cross-attention layer in the decoders is able to interact the query and gallery samples being matched. This variant is denoted as Transformer-Cross.

The structure of these two variants can be found in the Appendix. Note that these two solutions have high computational and memory costs, especially for large $d$ , $D$ , and $N$ (c.f. Section 6.4).

5 The Proposed TransMatcher

Though the above two solutions enable query-gallery interaction in the attention mechanism for image matching, they are not adequate for distance metric learning. This is because, taking a deeper look at Eq. (1) for the attention, it can be observed that, though similarity values between $Q$ and $K$ are computed, they are only used for softmax-based weighting to aggregate features from $V$ . Therefore, the output of the attention is always a weighted version of $V$ (or $K$ ), and thus cross-matching between a pair of inputs is not directly formulated.

To address this, we propose a simplified decoder, which is explicitly formulated towards similarity computation. The structure of this decoder is shown in the middle of Fig. 1. First, both gallery and query images are independently encoded by $N$ sequential Transformer encoders after a backbone network, as shown on the left of Fig. 1. This encoding helps aggregating global information from similar body parts for the subsequent matching step. The resulting feature encodings are denoted by $Q_{n}\in\mathbb{R}^{hw\times d}$ and $K_{n}\in\mathbb{R}^{hw\times d}$ , $n=1,\ldots,N$ , for the query and gallery, respectively. Then, as in Eq. (2), both the gallery and query encodings are transformed by a fully connected (FC) layer FC₁:

Q^{\prime}_{n}=Q_{n}W_{n},K^{\prime}_{n}=K_{n}W_{n},

(4)

where $W_{n}\in\mathbb{R}^{d\times d}$ is the parameter matrix for encoder-decoder layer $n$ . Different from Eq. (2), we use shared FC parameters for both query and gallery, because they are exchangeable in the image matching task, and the similarity metric needs to be symmetrically defined. Then, the dot product is computed between the transformed features, as in Eq. (1):

S_{n}=Q^{\prime}_{n}{K^{\prime}_{n}}^{T},

(5)

where $S_{n}\in\mathbb{R}^{hw\times hw}$ are the similarity scores. In addition, a learnable prior score embedding $R\in\mathbb{R}^{hw\times hw}$ is designed, which defines prior matching scores between different locations of query and gallery images. Then, it is used to weight the similarity values:

S^{\prime}_{n}=S_{n}*\sigma(R),

(6)

where $*$ denotes element-wise multiplication, and $\sigma$ is the sigmoid function to map the prior score embedding into weights in $[0,1]$ .

After that, a GMP layer is applied along the last dimension of $hw$ elements:

S{{}^{\prime\prime}}_{n}=\text{max}(S^{\prime}_{n},\text{dim=-1}).

(7)

This way, the optimal local matching over all key locations is obtained, as in QAConv [16]. Compared to Eq. (1), the GMP here can be considered as a hard attention, but it is used for similarity matching rather than softmax-based feature weighting like in the soft attention. Note that multi-head design in MHA is not considered here (c.f. Section 6.6).

Then, after a batch normalization layer BN₁, an MLP head is further appended, similar to the feed-forward layer of Transformers. It is composed of MLPHead₁=(FC₂, BN₂, ReLU) to map the $hw$ similarity values to dimension $D$ , and MLPHead₂=(FC₃, BN₃) to map dimension $D$ to 1 as a single output score $S{{}^{\prime\prime\prime}}_{n}$ .

Finally, decoder $n$ outputs a similarity score by fusing the output of the previous decoder:

S{{}^{\prime\prime\prime\prime}}_{n}=S{{}^{\prime\prime\prime}}_{n}+S{{}^{\prime\prime\prime\prime}}_{n-1},

(8)

where $S{{}^{\prime\prime\prime\prime}}_{0}$ is defined as 0. With $N$ stacked encoder-decoder blocks, as shown in Fig. 1, this can be considered as residual similarity learning. Note that the stack of encoder-decoder blocks in TransMatcher is different from that in the vanilla Transformer. In TransMatcher, the encoder and decoder are connected before being stacked, while in the vanilla Transformer they are stacked independently before connection. This way, the decoder of TransMatcher is able to perform cross matching with different levels of encoded features for residual similarity learning.

However, the GMP operation in Eq. (7) is not symmetric. To make TransMatcher symmetric for the query and gallery, the GMP operation in Eq. (7) can also be applied along dim=0; that is, conduct an inverse search of best matches over all query locations. Keeping other operations the same, this will result in another set of similarity scores, which are summed with the original ones after the FC₃ layer. Further details can be found in the Appendix. Note that this is not reflected in Fig. 1 for simplicity of illustration.

Finally, the outputs of TransMatcher scores for all query-gallery pairs in a batch are collected for pairwise metric learning following the same pipeline in QAConv-GS [17], and the same binary cross entropy loss is used as in the QAConv-GS.

6 Experiments

6.1 Datasets

Four large-scale person re-identification datasets, CUHK03 [14], Market-1501 [45], MSMT17 [37], and RandPerson [36], which are publicly available for research purpose, are used in our experiments. The CUHK03 dataset includes 1,360 persons and 13,164 images,with 767 and 700 subjects used for training and testing, respectively, as in the CUHK03-NP protocol [48]. Besides, the “detected” subset is used, which is more challenging than the “labeled” subset. The Market-1501 dataset contains 32,668 images of 1,501 identities captured from six cameras, with 12,936 images from 751 identities for training, and 19,732 images from 750 identities for testing.MSMT17 includes 4,101 identities and 126,441 images captured from 15 cameras, with 32,621 images from 1,041 identities for training, and the remaining images from 3,010 identities for testing. RandPerson is a recently released synthetic person re-identification dataset for large-scale training towards generalization testing. It is with 8,000 persons and 1,801,816 images. A subset with 132,145 images of the 8,000 IDs is used for training.

Cross-dataset evaluation is performed on these datasets by training on the training subset of one dataset, and evaluating on the test subsets of other datasets. Except that for MSMT17 we further use an additional setting with all images for training, regardless of the subset splits. This is denoted by MSMT17_all. All evaluations follow the single-query evaluation protocol. The Rank-1 (Top1) accuracy and mean average precision (mAP) are used as the performance evaluation metrics.

6.2 Implementation Details

The implementation of TransMatcher is built upon the official PyTorch project of QAConv-GS ²²2QAConv-GS project under MIT License: https://github.com/ShengcaiLiao/QAConv. [17], as the graph sampler (GS) proposed in this project is efficient for metric learning and quite suitable for the learning of TransMatcher. We keep most of the settings the same as QAConv-GS. Specifically, ResNet-50 [6] is used as the backbone network, with three instance normalization (IN) [31] layers further appended as in IBN-Net-b [22], following several recent studies [10, 49, 11, 51, 17]. The backbone network is pre-trained on ImageNet, with the states of the BN layers being fixed. The layer3 feature map is used, with a 3 $\times$ 3 neck convolution appended to produce the final feature map. The input image is resized to $384\times 128$ . The batch size is set to 64, with K=4 for the GS sampler. The network is trained with the SGD optimizer, with a learning rate of 0.0005 for the backbone network, and 0.005 for newly added layers. They are decayed by 0.1 after 10 epochs, and 15 epochs are trained in total. Except that for RandPerson [36] the total number of epochs is 4, and the learning rate step size is 2, according to the experiences in [36, 17]. Gradient clipping is applied with $T=4$ [17]. Several commonly used data augmentation methods are applied, including random flipping, cropping, occlusion, and color jittering. All experiments are run on a single NVIDIA V100 GPU.

For the proposed TransMatcher, unless otherwise indicated, $d$ =512 and $D$ =2048 by default as in the original Transformer [32], and $H$ =1 and $N$ =3 for higher efficiency. Please refer to Section 6.6 for further parameter analysis. Besides, in practice, we find that when $N$ decoders are used, using $N-1$ encoders together with the ResNet feature map directly pairing the first decoder slightly improves the results while being more efficient, which is preferred in the implementation (c.f. Appendix).

6.3 Comparison to the State of the Art

Table 1: Comparison of the state-of-the-art direct cross-dataset evaluation results (%). MSMT_all means all images are used for training, regardless of the subset splits.

Method	Venue	Train Set	CUHK03-NP		Market-1501		MSMT17
Method	Venue	Train Set	Rank-1	mAP	Rank-1	mAP	Rank-1	mAP
MGN [33, 24]	MM’18	Market	8.5	7.4	-	-	-	-
MuDeep [24]	PAMI’20	Market	10.3	9.1	-	-	-	-
CBN [51]	ECCV’20	Market	-	-	-	-	25.3	9.5
QAConv [16]	ECCV’20	Market	9.9	8.6	-	-	22.6	7.0
QAConv-GS [17]	arXiv’21	Market	16.4	15.7	-	-	41.2	15.0
TransMatcher	Ours	Market	22.2	21.4	-	-	47.3	18.4
PCB [29, 42]	ECCV’18	MSMT	-	-	52.7	26.7	-	-
MGN [33, 42]	MM’18	MSMT	-	-	48.7	25.1	-	-
ADIN [42]	WACV’20	MSMT	-	-	59.1	30.3	-	-
SNR [11]	CVPR’20	MSMT	-	-	70.1	41.4	-	-
CBN [51]	ECCV’20	MSMT	-	-	73.7	45.0	-	-
QAConv-GS [17]	arXiv’21	MSMT	20.0	19.2	75.1	46.7	-	-
TransMatcher	Ours	MSMT	23.7	22.5	80.1	52.0	-	-
OSNet [49]	CVPR’19	MSMT_all	-	-	66.5	37.2	-	-
QAConv [16]	ECCV’20	MSMT_all	25.3	22.6	72.6	43.1	-	-
QAConv-GS [17]	arXiv’21	MSMT_all	27.2	27.1	80.6	55.6	-	-
TransMatcher	Ours	MSMT_all	31.9	30.7	82.6	58.4	-	-
RP Baseline [36]	MM’20	RandPerson	13.4	10.8	55.6	28.8	20.1	6.3
QAConv-GS [17]	arXiv’21	RandPerson	14.8	13.4	74.0	43.8	42.4	14.4
TransMatcher	Ours	RandPerson	17.1	16.0	77.3	49.1	48.3	17.7

A comparison to the state of the art (SOTA) in generalizable person re-identification is shown in Table D. Several methods published very recently for generalizable person re-identification are compared, including OSNet [49], MuDeep [24], ADIN [42], SNR [11], CBN [51], QAConv [16], and QAConv-GS [17]. From Table D it can be observed that TransMatcher significantly improves the previous SOTA. For example, with Market-1501 for training, the Rank-1 and mAP are improved by 5.8% and 5.7% on CUHK03-NP, respectively, and they are improved by 6.1% and 3.4% on MSMT17, respectively. With MSMT17 $\rightarrow$ Market-1501, the improvements are 5.0% for Rank-1 and 5.3% for mAP. With the synthetic dataset RandPerson for training, the improvements on Market-1501 are 3.3% for Rank-1 and 5.3% for mAP, and the gains on MSMT17 are 5.9% for Rank-1 and 3.3% for mAP.

Compared to the second best method QAConv-GS, since it shares the same code base and training setting with the proposed TransMatcher, it indicates that TransMatcher is a superior image matching and metric learning method for generalizable person re-identification, thanks to the effective cross-matching design in the new decoders.

6.4 Comparison of Transformers

Table 2: Comparison of different Transformers trained on MSMT17 for direct cross-dataset evaluation (%). mAcc (%) is the average of all Rank-1 and mAP results on both CUHK03-NP and Market-1501 over four random runs.

Method	$d$	$D$	N	Time (h)	CUHK03-NP		Market-1501		mAcc
Method	$d$	$D$	N	Time (h)	Rank-1	mAP	Rank-1	mAP	mAcc
ViT	512	2048	3	0.99	12.0	12.4	57.7	29.8	27.42
Transformer	512	2048	3	1.16	13.2	13.3	54.3	29.0	27.01
TransMatcher	512	2048	3	1.44	23.7	22.5	80.1	52.0	44.29
Transformer-Cat	128	512	2	4.89	13.1	13.2	53.9	27.4	25.34
Transformer-Cross	128	512	2	3.48	18.9	19.8	66.2	40.1	36.70
TransMatcher	128	512	2	1.11	22.5	21.4	77.4	49.3	42.12

A comparison of different Transformers trained on MSMT17 for direct cross-dataset evaluation is shown in Table 2. For a fair comparison, they are all trained with the same settings as described in Section 6.2. Besides, $H$ =1 for all models. ViT, the vanilla Transformer, and TransMatcher all have the same parameter settings. Though we use an NVIDIA V100 GPU with 32GB of memory, Transformer-Cat and Transformer-Cross still encounter the memory overflow problem under the same parameter settings as TransMatcher. Therefore, we have to set $d$ =128, $D$ =512, and $N$ =2 for them to run, and accordingly, a smaller version of TransMatcher with the same set of parameters is also provided for comparison.

From the results shown in Table 2, it can be observed that ViT and the vanilla Transformer perform poor in generalizing to other datasets. In contrast, the proposed TransMatcher significantly improves the performance. This confirms that simply applying Transformers for the image matching task is not effective, because they lack cross-image interaction in their designs.

Besides, we find that Transformer-Cat does not lead to improvement compared to ViT and the vanilla Transformer. It is a smaller model, though. However, Transformer-Cross does lead to notable improvements, indicating that the cross-matching of gallery and query images in Transformer decoders is potentially more effective. However, it is still not as good as the smaller version of TransMatcher. For example, on Market-1501, TransMatcher improves the Rank-1 by 11.2% and the mAP by 9.2% over the Transformer-Cross. Therefore, the cross-attention design in the original Transformers is not efficient enough for image matching, due to its focus on feature aggregation but not similarity matching. More variants and experiments of Transformers can be found in Appendix.

As for the running speed, the training times of these methods are also listed in Table 2. As can be seen, without cross-matching, ViT is the most efficient, followed by the vanilla Transformer. TransMatcher is not as efficient as ViT due to the explicit cross-matching between query and gallery images. However, it is still acceptable, thanks to the new simplified decoder. In contrast, even with a small set of parameters, Transformer-Cat and Transformer-Cross are still quite heavy to compute.

6.5 Ablation Study

Table 3: Ablation of different components in TransMatcher. Training is performed on MSMT17. mAcc (%) is the average of all Rank-1 and mAP results on all test sets over four random runs. PriorEmbed is the prior score embedding, while PosEmbed is the positional embedding.

FC₁	BN₁	MLPHead₁	MLPHead₂	PriorEmbed	PosEmbed	mAcc
			✓			41.44
		✓	✓			42.82
	✓	✓	✓			43.66
✓		✓	✓			43.70
✓	✓	✓	✓			44.04
✓	✓	✓	✓	✓		44.29
✓	✓	✓	✓		✓	42.39
✓	✓	✓	✓	✓	✓	42.56

The structure of the proposed TransMatcher shown in Fig. 1 is carefully ablated, with results listed in Table 3. The training is performed on MSMT17. For ease and reliable comparison, we report the average of all Rank-1 and mAP results on all test sets over four random runs. This is denoted by mAcc. We start with Dot Product + GMP + MLPHead₂ (the input dimension to FC₃ needs to be adapted to $hw$ accordingly), which is the simplest and most necessary configuration. Then, by adding MLPHead₁, the performance is improved by 1.38%, indicating that increasing the dimension to $D$ , as in Transformers, is useful. Then, by including FC₁ / BN₁ independently, the performance gain is 0.84% / 0.88%, and by including them together, the performance can be further improved. Finally, when the prior score embedding is appended, the best performance is achieved. Interestingly, when we include a learnable positional embedding in the encoders, as in ViT, either independently or together with the prior score embedding, the performance is degraded. This indicates that mixing the position information with visual features for image matching is not useful in our design. In contrast, learning spatial-aware prior matching scores separately for score weighting is more effective. More ablation study and analysis can be found in the Appendix.

6.6 Parameter Analysis

To understand the parameter selection of the proposed TransMatcher, we train it on MSMT17 with different parameter configurations to the defaults, with the mAcc results as well as the training time shown in Fig. 2. First, the performance is gradually improved by increasing the model dimension $d$ . However, the training time is also increased quadratically. Therefore, to provide a balance between accuracy and running speed, $d$ =512 is selected, which is the same as in the vanilla Transformer [32].

For the feed forward-dimension $D$ , the performance is also gradually improved when increasing the value. However, the training time is less affected, because the feed-forward operation is only applied after the dot product and GMP, where the dimension of $d$ and one spatial dimension $hw$ are already contracted. Nevertheless, large $D$ will increase the memory usage. Therefore, $D$ =2048 is selected, which is also the same as in the vanilla Transformer [32].

As for the number of layers $N$ , the performance is also gradually improved with increasing $N$ . However, after $N$ =3 the performance tends to saturate, and the training time grows linearly with the increasing number of layers. Therefore, $N$ =3 is a reasonable balance for our choice. In addition, with $N=1$ there is no encoder used (for details please see Appendix), and from Fig. 2 it is clear that this is inferior, indicating that including an encoder is important. On the other hand, from the poor performance of ViT where there are only encoders, it is clear that the decoder is also important.

Finally, for the number of heads $H$ in the encoders, it appears that larger $H$ does not lead to improved results. Since the training time is also not affected, we simply select $H$ =1 in the encoders, and do not implement the multi-head mechanism in the decoders.

6.7 Qualitative Analysis

With the help of the GMP layer, inspired from QAConv [16], the proposed TransMatcher is able to find the best local correspondence matches in each decoder layer. Some qualitative matching results are shown in Fig. F for a better understanding of TransMatcher. More examples can be found in the Appendix. The model used here is trained on the MSMT17 dataset [37], and the evaluations are done on the query subset of the Market-1501 dataset [45]. Results of both positive pairs and hard negative pairs are shown. For a clear illustration, only reliable correspondences with matching scores over a certain threshold are shown, where the threshold is determined by a false acceptance rate of 1‰ over all matches of negative pairs. Note that the local positions are coarse due to the $24\times 8$ size of the feature map.

As can be observed from Fig. F, the proposed method is able to find correct local correspondences for positive pairs of images, even if there are notable misalignments in both scales and positions, pose, viewpoint, and illumination variations, occlusions, and low resolution blur. Besides, for hard negative pairs, the matching of TransMatcher still appears to be mostly reasonable, by linking visually similar parts or even the same person who might be incorrectly labeled.

This indicates that the proposed TransMatcher is effective in local correspondence matching, and note that it learns to do this with the only supervision of identity information. Besides, the matching capability is generalizable to other datasets beyond the training set. From the illustration it can also be seen that, generally, matching results of the first decoder layer are not as successful as the next two layers, and the matching with the last decoder layer appears to be the best. This indicates that both Transformer encoders and decoders helps the model to match better by aggregating global similarity information.

7 Conclusion

With the study conducted in this paper, we conclude that: (1) direct applications of ViT and the vanilla Transformer are not effective for image matching and metric learning, because they lack cross-image interaction in their designs; (2) designing query-gallery concatenation in ViT does not help, while introducing query-gallery cross-attention in the vanilla Transformer leads to notable but not adequate improvements, probably because the attention mechanism in Transformers might be primarily designed for global feature aggregation, which is not naturally suitable for image matching; and (3) a new simplified decoder thus developed, which employs hard attention to cross-matching similarity scores, is more efficient and effective for image matching and metric learning. With generalizable person re-identification experiments, the proposed TransMatcher is shown to achieve state-of-the-art performance on several popular datasets with large improvements. Therefore, this study proves that Transformers can be effectively adapted for the image matching and metric learning tasks, and so other potentially useful variants will be of future interest.

Acknowledgements

The authors would like to thank Yanan Wang who helped producing Fig. 1 in this paper, and Anna Hennig who helped proofreading the paper, and all the anonymous reviewers for the valuable feedbacks in improving the paper.

References

[1] Ejaz Ahmed, Michael Jones, and Tim K Marks. An improved deep learning architecture for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3908–3916, 2015.
[2] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, 2020.
[3] Peixian Chen, Pingyang Dai, Jianzhuang Liu, Feng Zheng, Qi Tian, and Rongrong Ji. Dual distribution alignment network for generalizable person re-identification. In Thirty-Fifth AAAI Conference on Artificial Intelligence, 2021.
[4] R. Ergys, S. Francesco, Z. Roger, C. Rita, and T. Carlo. Performance measures and a data set for multi-target, multi-camera tracking. In ECCV workshop on Benchmarking Multi-Target Tracking, 2016.
[5] Douglas Gray, Shane Brennan, and Hai Tao. Evaluating appearance models for recognition, reacquisition, and tracking. In IEEE International workshop on performance evaluation of tracking and surveillance, 2007.
[6] K. He, X. zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
[7] Martin Hirzer, Csaba Beleznai, Peter M. Roth, and Horst Bischof. Person re-identification by descriptive and discriminative classification. In Proc. Scandinavian Conf. on Image Analysis, 2011.
[8] Ruibing Hou, Bingpeng Ma, Hong Chang, Xinqian Gu, Shiguang Shan, and Xilin Chen. Interaction-and-aggregation network for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9317–9326, 2019.
[9] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7132–7141, 2018.
[10] Jieru Jia, Qiuqi Ruan, and Timothy M Hospedales. Frustratingly easy person re-identification: Generalizing person re-id in practice. In British Machine Vision Conference, 2019.
[11] Xin Jin, Cuiling Lan, Wenjun Zeng, Zhibo Chen, and Li Zhang. Style Normalization and Restitution for Generalizable Person Re-identification. In CVPR, feb 2020.
[12] lexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
[13] Wei Li and Xiaogang Wang. Locally aligned feature transforms across views. In IEEE Conference on Computer Vision and Pattern Recognition, 2013.
[14] Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang. DeepReID: Deep filter pairing neural network for person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition, 2014.
[15] Wei Li, Xiatian Zhu, and Shaogang Gong. Harmonious attention network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2285–2294, 2018.
[16] Shengcai Liao and Ling Shao. Interpretable and Generalizable Person Re-Identification with Query-Adaptive Convolution and Temporal Lifting. In European Conference on Computer Vision (ECCV), 2020.
[17] Shengcai Liao and Ling Shao. Graph Sampling Based Deep Metric Learning for Generalizable Person Re-Identification. CoRR, abs/2104.01546, April 4, 2021.
[18] Hao Liu, Jiashi Feng, Meibin Qi, Jianguo Jiang, and Shuicheng Yan. End-to-end comparative attention networks for person re-identification. IEEE Transactions on Image Processing, 26(7):3492–3506, 2017.
[19] Xihui Liu, Haiyu Zhao, Maoqing Tian, Lu Sheng, Jing Shao, Shuai Yi, Junjie Yan, and Xiaogang Wang. Hydraplus-net: Attentive deep features for pedestrian analysis. In Proceedings of the IEEE international conference on computer vision, pages 350–359, 2017.
[20] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. CoRR, abs/2103.14030, 2021.
[21] Chen Change Loy, Tao Xiang, and Shaogang Gong. Multi-camera activity correlation analysis. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 1988–1995. IEEE, 2009.
[22] Xingang Pan, Ping Luo, Jianping Shi, and Xiaoou Tang. Two at once: Enhancing learning and generalization capacities via ibn-net. In Proceedings of the European Conference on Computer Vision (ECCV), pages 464–479, 2018.
[23] Xuelin Qian, Yanwei Fu, Yu-Gang Jiang, Tao Xiang, and Xiangyang Xue. Multi-scale deep learning architectures for person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pages 5399–5408, 2017.
[24] Xuelin Qian, Yanwei Fu, Tao Xiang, Yu-Gang Jiang, and Xiangyang Xue. Leader-based Multi-Scale Attention Deep Architecture for Person Re-identification. TPAMI, 2020.
[25] Yantao Shen, Tong Xiao, Hongsheng Li, Shuai Yi, and Xiaogang Wang. End-to-end deep kronecker-product matching for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6886–6895, 2018.
[26] Jianlou Si, Honggang Zhang, Chun-Guang Li, Jason Kuen, Xiangfei Kong, Alex C Kot, and Gang Wang. Dual attention matching network for context-aware feature sequence based person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5363–5372, 2018.
[27] Jifei Song, Yongxin Yang, Yi-Zhe Song, Tao Xiang, and Timothy M Hospedales. Generalizable person re-identification by domain-invariant mapping network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 719–728, 2019.
[28] Yumin Suh, Jingdong Wang, Siyu Tang, Tao Mei, and Kyoung Mu Lee. Part-aligned bilinear representations for person re-identification. Proceedings of the European Conference on Computer Vision (ECCV), pages 402–419, 2018.
[29] Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Shengjin Wang. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
[30] Masato Tamura and Tomoaki Yoshinaga. Bcar: Beginner classifier as regularization towards generalizable re-id. In BMVC, 2020.
[31] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Improved texture networks: Maximizing quality and diversity in feed-forward stylization and texture synthesis. In Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, 2017.
[32] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, June 2017.
[33] Guanshuo Wang, Yufeng Yuan, Xiong Chen, Jiwei Li, and Xi Zhou. Learning discriminative features with multiple granularities for person re-identification. In 2018 ACM Multimedia Conference on Multimedia Conference, pages 274–282. ACM, 2018.
[34] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. CoRR, abs/2102.12122, 2021.
[35] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018.
[36] Yanan Wang, Shengcai Liao, and Ling Shao. Surpassing Real-World Source Training Data: Random 3D Characters for Generalizable Person Re-Identification. In 28th ACM International Conference on Multimedia (ACMMM), 2020.
[37] Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. Person transfer gan to bridge domain gap for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 79–88, 2018.
[38] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attention module. In European Conference on Computer Vision, volume 11211 of Lecture Notes in Computer Science, pages 3–19. Springer, 2018.
[39] Tong Xiao, Shuang Li, Bochao Wang, Liang Lin, and Xiaogang Wang. Joint detection and identification feature learning for person search. In CVPR, 2017.
[40] Jing Xu, Rui Zhao, Feng Zhu, Huaming Wang, and Wanli Ouyang. Attention-aware compositional network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2119–2128, 2018.
[41] Shuangjie Xu, Yu Cheng, Kang Gu, Yang Yang, Shiyu Chang, and Pan Zhou. Jointly attentive spatial-temporal pooling networks for video-based person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pages 4733–4742, 2017.
[42] Ye Yuan, Wuyang Chen, Tianlong Chen, Yang Yang, Zhou Ren, Zhangyang Wang, and Gang Hua. Calibrated Domain-Invariant Learning for Highly Generalizable Large Scale Re-Identification. WACV, pages 3578–3587, nov 2020.
[43] Zhizheng Zhang, Cuiling Lan, Wenjun Zeng, Xin Jin, and Zhibo Chen. Relation-aware global attention for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3186–3195, 2020.
[44] Shizhen Zhao, Changxin Gao, Jun Zhang, Hao Cheng, Chuchu Han, Xinyang Jiang, Xiaowei Guo, Wei-Shi Zheng, Nong Sang, and Xing Sun. Do not disturb me: Person re-identification under the interference of other pedestrians. In European Conference on Computer Vision, pages 647–663. Springer, 2020.
[45] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian. Scalable person re-identification: A benchmark. In Proceedings of IEEE International Conference on Computer Vision, 2015.
[46] Wei Shi Zheng, Shaogang Gong, and Tao Xiang. Associating groups of people. In British Machine Vision Conference, 2009.
[47] Zhedong Zheng, Liang Zheng, and Yi Yang. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. international conference on computer vision, pages 3774–3782, 2017.
[48] Zhun Zhong, Liang Zheng, Donglin Cao, and Shaozi Li. Re-ranking person re-identification with k-reciprocal encoding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1318–1327, 2017.
[49] Kaiyang Zhou, Yongxin Yang, Andrea Cavallaro, and Tao Xiang. Omni-scale feature learning for person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, 2019.
[50] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable DETR: Deformable Transformers for End-to-End Object Detection. CoRR, abs/2010.04159:1–16, 2020.
[51] Zijie Zhuang, Longhui Wei, Lingxi Xie, Tianyu Zhang, Hengheng Zhang, Haozhe Wu, Haizhou Ai, and Qi Tian. Rethinking the Distribution Gap of Person Re-identification with Camera-based Batch Normalization. In ECCV, pages 140–157, jan 2020.

Appendix

Appendix A mAcc Measure

For ease and reliable comparison, we report the average of all Rank-1 and mAP results on all test datasets over several random runs for ablation study and parameter analysis. This is denoted by mAcc. There are three reasons that we use mAcc.

•

It is a unified measure, which is convenient for algorithm comparison. Both Rank-1 and mAP are accuracy measures ranging from 0%-100%, thus averaging them is possible. Besides, if a method’s mAcc is 1% higher than another method, on average it means that every single measure on each dataset has been increased by 1%, which is a perceptible achievement.
•

Some algorithms perform unstably across different runs, thus the average among several runs is a more stable measure.
•

Using a unified measure is convenient, concise, and space-saving for ablation study and parameter analysis.

Appendix B Additional Description and Analysis of the Proposed Method

B.1 Symmetric Extension of GMP

In Eq. (6) of the main paper, $S^{\prime}_{n}$ is of size $HW\times hw$ , with the first $HW$ associated with the query feature map $Q$ , and the second $hw$ associated with the gallery feature map $K$ . Here $H=h$ and $W=w$ , but to be clear, let’s denote them differently. Then in Eq. (7), GMP is applied along the last dimension of $hw$ elements, resulting in a vector of size $HW$ . However, considering that for Q and K their role and order are exchangeable, we can have Eq. (7’) that $S{{}^{\prime\prime}}_{n}=\text{max}(S^{\prime}_{n},\text{dim=0})$ , that is, applying GMP along the first dimension of $HW$ elements, resulting in a vector of size $hw$ . Afterwards, the two sets of results are independently processed by the MLP head and the final scores are summed. This way, if the input pair of images are swapped, the final similarity score will remain the same. Please refer to the source code of TransMatcher ³³3https://github.com/ShengcaiLiao/QAConv/tree/master/projects/transmatcher for further details.

B.2 Prior Score Embeddings

The prior score embeddings are learnable parameters of size $hw\times hw$ . They can also be considered learnable weights, somewhat similar to the learnable FC weights. They act as spatial matching priors. For example, for a pair of person images to be matched, one head should be matched to the other head, and so the corresponding rough "head-to-head" locations in the $hw\times hw$ parameters should have large values, while others such as "head-to-foot" locations should have small values. It is not easy to define this manually, so we make it learnable so that this prior can be automatically learned from data. Furthermore, this can also be understood as the image matching extension of the original positional embedding proposed in Transformers.

B.3 Different Layer Configurations of TransMatcher

As mentioned in the main paper, in practice, we find that replacing the first encoder directly with the output of the deep feature map slightly improves the results while being more efficient. Specifically, if $N$ decoder layers are used, normally there should be $N$ corresponding encoder layers, as shown in Fig. 1 of the main paper. Then, to save some computation, we only use $N-1$ encoder layers, together with the CNN feature map as the first layer to pair the decoders. For these two configurations, the experimental results are illustrated in Fig. D, where it can be seen that replacing the first encoder slightly improves the running speed, while at the same time slightly improves the accuracy, except when there is only one decoder. However, note that when there is only one decoder, there is no encoder for the default configuration since the first encoder is replaced directly by the CNN feature map. Therefore, in this case the other configuration is better, which also proves the benefit of including an encoder.

B.4 Encoder V.S. Decoder

Encoders are indeed useful, but not as important as the proposed decoder. This can be understood from the experiments. ViT is a pure encoder based model, without decoders. However, from Table 2 of the main paper we can see that its generalization performance is not satisfactory. In contrast, Fig. D in this Appendix provides an example where $N=1$ corresponds to a model with only decoders but no encoders. In this case the mAcc is 38.76%, which is much better than the 27.42% of ViT in Table 2 of the main paper. The reasons may be two folds. Firstly, with only encoders, though it is useful for feature learning, we observe that it may easily overfit the source data. Secondly, inspired by QAConv [Liao-ECCV2020-QAConv], the decoders perform local image matching explicitly, which is interpretable and has a better generalization performance.

B.5 Efficiency And Practical Value

First, this work has theoretical value for understanding Transformer’s capability in image matching. Second, generalizable person re-identification is particularly designed towards practical applications, and as shown in Table 1 of the main paper, the proposed method has large improvements over existing methods, therefore, it deserves further study, e.g. improving its efficiency. Third, the proposed method has already considered the efficiency, with its simplified decoder and balanced parameter selection, and thus it is the most efficient one in cross-matching Transformers as shown in Table 2 of the main paper. Other potential Transformers would encounter more difficulty in efficiency. Finally, compared to the SOTA method QAConv-GS which spends 0.96 hour in training MSMT17, the proposed method spends 1.44 hours, which is still acceptable.

Appendix C Implementation Details And Variants of Different Transformers

Illustration of Transformer-Cat and Transformer-Cross is shown in Fig. E. The same binary cross-entropy loss in GS sampler [Liao-2021-QAConv-GS] is applied for all methods.

C.1 ViT

ViT only contains Transformer encoders, and the same vanilla Transformer encoders are used for both ViT and the proposed method. The MLP head for the proposed model is only used in the decoder. Besides, the structure of the MLP head is almost the same as the feed-forward layer in the ViT encoder, where two FCs are used, together with ReLU and normalization.

C.2 Transformer-Cross

For Transformer-Cross, as in vanilla Transformers, both $K$ and $V$ are the same and they are both from the encoded memory (gallery features). In Transformer-Cross, encoders are applied only for gallery features but we input query features to decoders directly without encoders, this is because:

•

Almost all components in Transformer encoders are already designed in decoders, such as the self-attention module (prior to cross-attention), the feed-forward layer, etc.
•

Besides, this is also to be consistent to existing Transformers (e.g. vanilla Transformer and DeTR). We did not see a method inputting query tokens to encoders before decoders.
•

For the proposed method, the same encoders are applied for both gallery and query features. This is because, first, we think query and gallery are exchangeable and so we would like to design a symmetric distance metric. Second, the proposed decoder is simplified, without a self-attention module as in the vanilla Transformer decoder.

For a variant, we further apply the encoders to query features prior to decoders for Transformer-Cross. The mAcc for this variant is 33.77%, which is lower than 36.70% reported in Table 2 of the main paper. This may be because this structure is too complex to learn an effective distance metric.

C.3 Fusion in Transformer-Cat

For Transformer-Cat, we also tried appending the results of the Transformer in the lower block. At first, we tried score fusion as in the proposed method, resulting in an mAcc of 24.92%, compared to 25.34% in Table 2 of the main paper. Then we thought maybe the final normalization layer in the default encoder hindered the improvement; after removing it we got 25.10%. Later, we tried feature fusion instead of score fusion, then we got 25.22%. In any case, we were not able to improve the results by multi-layer fusion. This may be because the Transformer-Cat structure itself is not suitable for metric learning, as explained in Section 5 in the main paper.

C.4 Improved Components on Transformer-Cross

As in ViT and DeTR, we would like to see what’s the capability of the original Transformers for distance metric learning. Therefore, we did as little modifications as possible to the baseline architectures. Besides, both Transformer-Cat and Transformer-Cross are heavy to compute, which limits their values and further developments. However, we still tried several variants, including multi-scale fusion among different Transformer layers, and shared FC for $Q$ and $K$ in cross-attention, to see the maximal capability of Transformer-Cross.

•

For multi-scale fusion, the mAcc is 29.52%, lower than the 36.70% in Table 2 of the main paper, and we observe that the learning is not stable across different runs. Again, this may be because the structure of Transformer-Cross does not directly target on image matching and metric learning.
•

An interesting finding is the role of shared FC for $Q$ and $K$ in cross-attention in decoders. Previously, we set shared FC in the proposed model just because the distance metric is required to be symmetric. We did not observe and did not expect it to be critical in the proposed pipeline (see Table 3 of the main paper for ablation study). On the other hand, Transformer-Cross is already not a symmetric design for gallery-query pairs, therefore, two different FCs are reasonable, as in the vanilla Transformer. However, now when we force the FC to be shared on both $Q$ and $K$ in Transformer-Cross, surprisingly, the mAcc becomes 41.56%, which is much better than 36.70% in Table 2 of the main paper, and it is only slightly worse than the 42.12% of the proposed method. We guess this is because forcing shared FC makes the feature space of $Q$ and $K$ being consistent before cross-attention, and therefore it helps the subsequent metric learning task. Nevertheless, compared to the proposed model, the Transformer-Cross is still too heavy, costing too much memory and computation time.
•

Based on the shared FC, we tried the multi-scale fusion again, and got 41.52% for mAcc.

Appendix D Comparison to Other Methods

D.1 Comparison to DeTR

DeTR [Carion-2020-DETR] is an original work for Transformer based object detection. Though DeTR is not directly applicable to person re-identification due to its ad-hoc detection-oriented structure, we can see that beyond the particular detection head design, DeTR is very similar to the vanilla Transformer compared in this paper, except that to adapt to the person re-identification task (pairwise metric learning), our prediction head outputs pairwise similarities between gallery-query pairs. Therefore, the vanilla Transformer compared in this paper can be considered the person re-identification version of DeTR. Furthermore, there are learnable queries in DeTR, which inspires us that, how about using actual image queries instead of learnable queries? This results in the Transformer-Cross method proposed in the paper, which is also compared in the experiments.

D.2 Why Not Other Attention Modules?

Other attention modules, such as the Non-local Network [wang2018non] and the channel-wise attention [Hu-CVPR18-SE, Woo-eccv18-cbam], are mostly for feature representation learning, particularly, feature enhancement or feature refinement, which operates within the same image. Please refer to the Related Work section of the main paper regarding this. However, image matching or distance metric learning is a different task, involving pairs of images. On the other hand, as motivated in the paper, Transformers with almost its original form have shown great success on computer vision tasks recently (e.g. ViT, DeTR), and image matching is also a typical computer vision task (e.g. face recognition and person re-identification), therefore, this is a timely study that whether Transformers are useful for image matching and how to apply or adapt them for this task. Furthermore, there are learnable queries in DeTR, which inspires us that, how about using actual image queries instead of learnable queries? This results in the Transformer-Cross method proposed in the paper.

D.3 Comparison to ResNet-IBN And FastReID

ResNet-IBN has already been compared in Table D of the supplementary material, where the DualNorm method is a straightforward extension of ResNet-IBN, and the proposed method performs much better than it. FastReID is a strong baseline for person re-identification, and we find some results from https://github.com/JDAI-CV/fast-reid/tree/master/projects/CrossDomainReID, where the results are mostly with the DukeMTMC-reID dataset which has been officially removed due to some ethic concerns. Yet there is a result for training on Market1501 while testing on MSMT17, with Rank-1 29.8% and mAP 10.3%. Compared to the results in Table 1 of the main paper, the proposed method performs much better, with Rank-1 47.3% and mAP 18.4%.

Appendix E Additional Comparison to The State of The Art

E.1 Datasets

Some recent works, such as DIMN [song2019generalizable], DualNorm [jia2019frustratingly], and DDAN [Chen-AAAI2021-DDAN], used different experimental settings. To compare with them, we conducted additional experiments following their experimental protocols. Specifically, a large-scale combined source training dataset is constructed, which includes the CUHK02 [Li-CVPR-2013-CUHK02], CUHK03 [Li-CVPR-2014-DeepReID], Market-1501 [zheng2015smarket], DukeMTMC-reID⁴⁴4Note that the DukeMTMC and its derived datasets have been officially removed due to some ethic concerns. Here we include it only for the sake of comparison to some existing results. We discourage further usage of DukeMTMC datasets in the future. [ristani2016duke, zheng2017unlabeled], and CUHK-SYSU Person Search [Xiao-CVPR2017-CUHK-SYSU] datasets. Note that all the images in these datasets are used for training, regardless of their original training and test subset splits. This results in a large-scale training data, including 18,530 identities and 121,765 training images.

As for testing, four small datasets are used, including the VIPeR [gray2007evaluating], PRID [hirzer11a], QMUL GRID [loy2009multi], and i-LIDS [Zheng2009-Group] datasets. The standard testing splits of these datasets are used for evaluation. Specifically, 10 random splits of training and test subsets for each dataset are repeated for evaluation, with the averaged results reported. The single-shot evaluation protocol is used for all experiments. For each split, the probe/gallery splits of image subsets are as follows: VIPeR: 316/316; PRID: 100/649; GRID: 125/900; and i-LIDS: 60/60. Besides, on VIPeR, their swapped versions of the probe/gallery splits are also evaluated and averaged.

Table D: Comparison of the state-of-the-art direct cross-dataset evaluation results (%). Mob is short for MobileNetV2. Res is short for ResNet-50. DN is short for DualNorm [jia2019frustratingly].

Method	Venue	Net	VIPeR		PRID		GRID		i-LIDS		Average
Method	Venue	Net	R1	mAP	R1	mAP	R1	mAP	R1	mAP	R1	mAP
DIMN [song2019generalizable]	CVPR’19	Mob	51.2	60.1	39.2	52.0	29.3	41.1	70.2	78.4	47.5	57.9
DualNorm [jia2019frustratingly, Chen-AAAI2021-DDAN]	BMVC’19		53.9	58.0	60.4	64.9	41.4	45.7	74.8	78.5	57.6	61.8
BCaR[tamura2020bcar]	BMVC’20		50.4	-	37.1	-	31.9	-	68.7	-	47.0	-
BCaR + DN[tamura2020bcar]	BMVC’20		57.3	-	62.0	-	42.3	-	80.0	-	60.4	-
DDAN [Chen-AAAI2021-DDAN]	AAAI’21		52.3	56.4	54.5	58.9	50.6	55.7	78.5	81.5	59.0	63.1
DDAN + DN [Chen-AAAI2021-DDAN]	AAAI’21		56.5	60.8	62.9	67.5	46.2	50.9	78.0	81.2	60.9	65.1
QAConv-GS [Liao-2021-QAConv-GS]	arXiv’21		47.6	57.2	61.3	68.8	37.4	45.3	75.7	82.3	55.5	63.4
TransMatcher	Ours		53.1	63.1	65.6	74.3	48.8	56.4	77.8	84.2	61.3	69.5
DualNorm[jia2019frustratingly, Chen-AAAI2021-DDAN]	BMVC’19	Res	59.4	-	69.6	-	43.7	-	78.2	-	62.7	-
BCaR + DN[tamura2020bcar]	BMVC’20		65.8	-	70.2	-	52.8	-	81.3	-	67.5	-
QAConv-GS [Liao-2021-QAConv-GS]	arXiv’21		57.8	67.5	63.0	71.5	51.9	61.3	79.2	85.4	63.0	71.4
TransMatcher	Ours		63.4	71.8	63.8	72.0	57.2	65.7	81.8	87.8	66.6	74.3

E.2 Results

For a fair comparison to the existing results, MobileNetV2 with width multiplier of 1.0 is additionally used for QAConv-GS and TransMatcher as the backbone network. The same configuration and training setting is applied as described in the main paper, except that the learning rates are decayed by 0.1 after 6 epochs, and 9 epochs are trained in total, since with the GS sampler the number of iterations per epoch is determined by the number of classes and the combined source training dataset has 18,530 identities. In addition, gradient clipping [Liao-2021-QAConv-GS] is applied with $T=32$ for MobileNetV2 other than $T=4$ with ResNet-50, because MobileNetV2 is small and less suffered from overfitting.

The evaluation results are shown in Table D. From the results it can be observed that the proposed TransMatcher with the ResNet-50 backbone outperforms QAConv-GS with a large margin. With the MobileNetV2 backbone, the proposed TransMatcher also achieves the best results on average, though on some datasets it has slightly lower Rank-1 results. This shows TransMatcher’s good learning capability from large-scale combined datasets. Besides, it can be seen that methods with the ResNet-50 backbone are much better than the MobileNetV2, though MobileNetV2 is theoretically with less parameters and computational costs. However, running on GPU cards, we find that MobileNetV2 is not that efficient, which requires 153,903 seconds for training with TransMatcher on the combined training dataset. In contrast, TransMatcher with ResNet-50 requires 169,257 seconds for training, while it is 137,511 seconds for ResNet-18.

Appendix F Qualitative Analysis

With the help of the GMP layer, inspired from QAConv [Liao-ECCV2020-QAConv], the proposed TransMatcher is able to find the best local correspondence matches in each decoder layer. Some qualitative matching results are shown in Fig. F for a better understanding of the proposed method. The model used here is trained on the MSMT17 dataset [Wei-CVPR18-PTGAN], and the evaluations are done on the query subsets of the Market-1501 dataset [zheng2015smarket]. Results of both positive pairs and hard negative pairs are shown. For a clear illustration, only reliable correspondences with matching scores over a certain threshold are shown, where the threshold is determined by a false acceptance rate of 1‰ over all matches of negative pairs. Note that the local positions are coarse due to the $24\times 8$ size of the feature map.

References

[1] Ejaz Ahmed, Michael Jones, and Tim K Marks. An improved deep learning architecture for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3908–3916, 2015.
[2] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, 2020.
[3] Peixian Chen, Pingyang Dai, Jianzhuang Liu, Feng Zheng, Qi Tian, and Rongrong Ji. Dual distribution alignment network for generalizable person re-identification. In Thirty-Fifth AAAI Conference on Artificial Intelligence, 2021.
[4] R. Ergys, S. Francesco, Z. Roger, C. Rita, and T. Carlo. Performance measures and a data set for multi-target, multi-camera tracking. In ECCV workshop on Benchmarking Multi-Target Tracking, 2016.
[5] Douglas Gray, Shane Brennan, and Hai Tao. Evaluating appearance models for recognition, reacquisition, and tracking. In IEEE International workshop on performance evaluation of tracking and surveillance, 2007.
[6] K. He, X. zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
[7] Martin Hirzer, Csaba Beleznai, Peter M. Roth, and Horst Bischof. Person re-identification by descriptive and discriminative classification. In Proc. Scandinavian Conf. on Image Analysis, 2011.
[8] Ruibing Hou, Bingpeng Ma, Hong Chang, Xinqian Gu, Shiguang Shan, and Xilin Chen. Interaction-and-aggregation network for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9317–9326, 2019.
[9] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7132–7141, 2018.
[10] Jieru Jia, Qiuqi Ruan, and Timothy M Hospedales. Frustratingly easy person re-identification: Generalizing person re-id in practice. In British Machine Vision Conference, 2019.
[11] Xin Jin, Cuiling Lan, Wenjun Zeng, Zhibo Chen, and Li Zhang. Style Normalization and Restitution for Generalizable Person Re-identification. In CVPR, feb 2020.
[12] lexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
[13] Wei Li and Xiaogang Wang. Locally aligned feature transforms across views. In IEEE Conference on Computer Vision and Pattern Recognition, 2013.
[14] Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang. DeepReID: Deep filter pairing neural network for person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition, 2014.
[15] Wei Li, Xiatian Zhu, and Shaogang Gong. Harmonious attention network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2285–2294, 2018.
[16] Shengcai Liao and Ling Shao. Interpretable and Generalizable Person Re-Identification with Query-Adaptive Convolution and Temporal Lifting. In European Conference on Computer Vision (ECCV), 2020.
[17] Shengcai Liao and Ling Shao. Graph Sampling Based Deep Metric Learning for Generalizable Person Re-Identification. CoRR, abs/2104.01546, April 4, 2021.
[18] Hao Liu, Jiashi Feng, Meibin Qi, Jianguo Jiang, and Shuicheng Yan. End-to-end comparative attention networks for person re-identification. IEEE Transactions on Image Processing, 26(7):3492–3506, 2017.
[19] Xihui Liu, Haiyu Zhao, Maoqing Tian, Lu Sheng, Jing Shao, Shuai Yi, Junjie Yan, and Xiaogang Wang. Hydraplus-net: Attentive deep features for pedestrian analysis. In Proceedings of the IEEE international conference on computer vision, pages 350–359, 2017.
[20] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. CoRR, abs/2103.14030, 2021.
[21] Chen Change Loy, Tao Xiang, and Shaogang Gong. Multi-camera activity correlation analysis. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 1988–1995. IEEE, 2009.
[22] Xingang Pan, Ping Luo, Jianping Shi, and Xiaoou Tang. Two at once: Enhancing learning and generalization capacities via ibn-net. In Proceedings of the European Conference on Computer Vision (ECCV), pages 464–479, 2018.
[23] Xuelin Qian, Yanwei Fu, Yu-Gang Jiang, Tao Xiang, and Xiangyang Xue. Multi-scale deep learning architectures for person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pages 5399–5408, 2017.
[24] Xuelin Qian, Yanwei Fu, Tao Xiang, Yu-Gang Jiang, and Xiangyang Xue. Leader-based Multi-Scale Attention Deep Architecture for Person Re-identification. TPAMI, 2020.
[25] Yantao Shen, Tong Xiao, Hongsheng Li, Shuai Yi, and Xiaogang Wang. End-to-end deep kronecker-product matching for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6886–6895, 2018.
[26] Jianlou Si, Honggang Zhang, Chun-Guang Li, Jason Kuen, Xiangfei Kong, Alex C Kot, and Gang Wang. Dual attention matching network for context-aware feature sequence based person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5363–5372, 2018.
[27] Jifei Song, Yongxin Yang, Yi-Zhe Song, Tao Xiang, and Timothy M Hospedales. Generalizable person re-identification by domain-invariant mapping network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 719–728, 2019.
[28] Yumin Suh, Jingdong Wang, Siyu Tang, Tao Mei, and Kyoung Mu Lee. Part-aligned bilinear representations for person re-identification. Proceedings of the European Conference on Computer Vision (ECCV), pages 402–419, 2018.
[29] Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Shengjin Wang. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
[30] Masato Tamura and Tomoaki Yoshinaga. Bcar: Beginner classifier as regularization towards generalizable re-id. In BMVC, 2020.
[31] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Improved texture networks: Maximizing quality and diversity in feed-forward stylization and texture synthesis. In Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, 2017.
[32] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, June 2017.
[33] Guanshuo Wang, Yufeng Yuan, Xiong Chen, Jiwei Li, and Xi Zhou. Learning discriminative features with multiple granularities for person re-identification. In 2018 ACM Multimedia Conference on Multimedia Conference, pages 274–282. ACM, 2018.
[34] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. CoRR, abs/2102.12122, 2021.
[35] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018.
[36] Yanan Wang, Shengcai Liao, and Ling Shao. Surpassing Real-World Source Training Data: Random 3D Characters for Generalizable Person Re-Identification. In 28th ACM International Conference on Multimedia (ACMMM), 2020.
[37] Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. Person transfer gan to bridge domain gap for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 79–88, 2018.
[38] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attention module. In European Conference on Computer Vision, volume 11211 of Lecture Notes in Computer Science, pages 3–19. Springer, 2018.
[39] Tong Xiao, Shuang Li, Bochao Wang, Liang Lin, and Xiaogang Wang. Joint detection and identification feature learning for person search. In CVPR, 2017.
[40] Jing Xu, Rui Zhao, Feng Zhu, Huaming Wang, and Wanli Ouyang. Attention-aware compositional network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2119–2128, 2018.
[41] Shuangjie Xu, Yu Cheng, Kang Gu, Yang Yang, Shiyu Chang, and Pan Zhou. Jointly attentive spatial-temporal pooling networks for video-based person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pages 4733–4742, 2017.
[42] Ye Yuan, Wuyang Chen, Tianlong Chen, Yang Yang, Zhou Ren, Zhangyang Wang, and Gang Hua. Calibrated Domain-Invariant Learning for Highly Generalizable Large Scale Re-Identification. WACV, pages 3578–3587, nov 2020.
[43] Zhizheng Zhang, Cuiling Lan, Wenjun Zeng, Xin Jin, and Zhibo Chen. Relation-aware global attention for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3186–3195, 2020.
[44] Shizhen Zhao, Changxin Gao, Jun Zhang, Hao Cheng, Chuchu Han, Xinyang Jiang, Xiaowei Guo, Wei-Shi Zheng, Nong Sang, and Xing Sun. Do not disturb me: Person re-identification under the interference of other pedestrians. In European Conference on Computer Vision, pages 647–663. Springer, 2020.
[45] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian. Scalable person re-identification: A benchmark. In Proceedings of IEEE International Conference on Computer Vision, 2015.
[46] Wei Shi Zheng, Shaogang Gong, and Tao Xiang. Associating groups of people. In British Machine Vision Conference, 2009.
[47] Zhedong Zheng, Liang Zheng, and Yi Yang. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. international conference on computer vision, pages 3774–3782, 2017.
[48] Zhun Zhong, Liang Zheng, Donglin Cao, and Shaozi Li. Re-ranking person re-identification with k-reciprocal encoding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1318–1327, 2017.
[49] Kaiyang Zhou, Yongxin Yang, Andrea Cavallaro, and Tao Xiang. Omni-scale feature learning for person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, 2019.
[50] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable DETR: Deformable Transformers for End-to-End Object Detection. CoRR, abs/2010.04159:1–16, 2020.
[51] Zijie Zhuang, Longhui Wei, Lingxi Xie, Tianyu Zhang, Hengheng Zhang, Haozhe Wu, Haizhou Ai, and Qi Tian. Rethinking the Distribution Gap of Person Re-identification with Camera-based Batch Normalization. In ECCV, pages 140–157, jan 2020.

Biography

Shengcai Liao is a Lead Scientist in the Inception Institute of Artificial Intelligence (IIAI), Abu Dhabi, UAE. He is a Senior Member of IEEE. Previously, he was an Associate Professor in the Institute of Automation, Chinese Academy of Sciences (CASIA). He received the B.S. degree in mathematics from the Sun Yat-sen University in 2005 and the Ph.D. degree from CASIA in 2010. He was a Postdoc in the Michigan State University during 2010-2012. His research interests include object detection, recognition, and tracking, especially face and person related tasks. He has published over 100 papers, with over 15,000 citations and h-index 43 according to Google Scholar. He ranks 905 among 215,114 scientists (Top 0.42%) in 2019 single year in the field of AI, according to a study by Stanford University of Top 2% world-wide scientists. His representative work LOMO+XQDA, known for effective feature design and metric learning for person re-identification, has been cited over 1,900 times and ranks top 10 among 602 papers in CVPR 2015. He was awarded/co-awarded the Best Student Paper in ICB 2006, ICB 2015, and CCBR 2016, and the Best Paper in ICB 2007. He was also awarded the IJCB 2014 Best Reviewer and CVPR 2019/2021 Outstanding Reviewer. He was an Assistant Editor for the book “Encyclopedia of Biometrics (2nd Ed.)”. He serves as Program Chair for IJCB 2022, and Area Chair for CVPR 2022 and ECCV 2022. He served as Area Chairs for ICPR 2016, ICB 2016 and 2018, SPC for IJCAI 2021, and reviewers for ICCV, CVPR, ECCV, NeurIPS, ICLR, AAAI, TPAMI, IJCV, TIP, etc. His team was the Winner of the CVPR 2017 Detection in Crowded Scenes Challenge and ICCV 2019 NightOwls Pedestrian Detection Challenge. Homepage: https://liaosc.wordpress.com/