Meta Generative Attack on Person Reidentification

A V Subramanyam A V Subramanyam is with IIITD, India (e-mail: [email protected]). We acknowledge the support of the IHUB-ANUBHUTI-IIITD FOUNDATION setup under the NM-ICPS scheme of the DST, India.

Abstract

Adversarial attacks have been recently investigated in person re-identification. These attacks perform well under cross dataset or cross model setting. However, the challenges present in cross-dataset cross-model scenario does not allow these models to achieve similar accuracy. To this end, we propose our method with the goal of achieving better transferability against different models and across datasets. We generate a mask to obtain better performance across models and use meta learning to boost the generalizability in the challenging cross-dataset cross-model setting. Experiments on Market-1501, DukeMTMC-reID and MSMT-17 demonstrate favorable results compared to other attacks.

Index Terms:

Adversarial attacks, Meta learning, ReID.

I Introduction

The tremendous performance of deep learning models has led to rampant application of these systems in practice. However, these models can be manipulated by introducing minor perturbations [1, 2, 3, 4, 5]. This process is called adversarial attacks. In case of person re-identification, for a given query input $x$ , a target model $f$ and a gallery, the attack is defined as,

	$\displaystyle\lVert f(\mathbf{x}+\boldsymbol{\delta})-f(\mathbf{x}_{g})\rVert_{2}>\lVert f(\mathbf{x}+\boldsymbol{\delta})-f(\bar{\mathbf{x}}_{g})\rVert_{2}\;\;\;\textit{s.t.}\;\lVert\boldsymbol{\delta}\rVert_{p}\leq\epsilon,$
	$\displaystyle\mathbf{x}_{g}\ni topk(\mathbf{x}+\boldsymbol{\delta}),ID(\mathbf{x})=ID(\mathbf{x}_{g})\neq ID(\bar{\mathbf{x}}_{g})$

where $\mathbf{x}_{g}$ and $\bar{\mathbf{x}}_{g}$ are gallery samples belonging to different identity and $\boldsymbol{\delta}$ is the adversarial perturbation with an $l_{p}$ norm bound of $\epsilon$ . topk( $\cdot$ ) refers to the top $k$ retrieved images for the given argument.

Adversarial attacks have been extensively investigated under classification setting [6] and also studied in other domains [7, 8, 9] in the recent times. However, to the best of our knowledge, there are very few works which study these attacks in person re-identification domain. In the following we briefly discuss some classical attacks under classification setting. Szegedy et al. [1] proposed the first work on generation of adversarial sample for deep neural networks using L-BFGS. Goodfellow et al. [2] proposed an efficient adversarial sample generation method using fast gradient sign method (FGSM). Kurakin et al. [10] proposed an iterative FGSM method. Other prominent works include [11, 12, 13, 14, 15, 16].

In person re-id [17, 18, 19, 20], both white-box and black box attacks have been proposed in [21, 22, 23, 24]. These attacks use a labeled source dataset and show that the attacks are transferable under cross-dataset or cross-model, or both settings. However, transferabilty of attacks in the challenging cross-dataset and cross-model setting is still an issue. In this work, we propose to use a mask and meta-learning for better transferability of attacks. We also investigate adversarial attacks in a completely new setting where the source dataset does not have any labels and the target model structure or parameters are unknown.

II Related Works

In [25], authors propose white box and black box attacks. The black box attack only assumes that the victim model is unknown but the dataset is available. [26] introduces physically realizable attacks in white box setting by generating adversarial clothing pattern. [24] proposes a query based attack wherein the images obtained by querying the victim model are used to form triplets for triplet loss. [27] proposes white box attack using self metric attack; wherein the positive sample is obtained by adding noise to the given input and obtain negative sample from other images. In [21], authors propose a meta-learning framework using a labeled source and extra association dataset. This method generalizes well in cross-dataset scenario. In [22], Ding et al. proposed to use a list-wise attack objective function along with model agnostic regularization for better transferability. A GAN based framework is proposed in [23]. Here the authors generate adversarial noise and mask by training the network using triplet loss.

In this work we use a GAN network to generate adversarial sample. In order to achieve better transferability of attack across models, we suppress the pixels that generate large gradients. Suppressing these gradients allows the network to focus on other pixels. In this way, the network can focus on pixels that are not explicitly salient with respect to the model used for attack. We further use meta learning [28] which also allows incorporation of an additional dataset to boost the transferability. We refer this attack as Meta Generative Attack (MeGA). Our work is closest in spirit to [21, 23], however, the mask generation and application of meta learning under GAN framework are quite distinct from these works.

III Methodology

In this work we address both white-box and black-box attacks. We need that the attack is transferable across models and datasets. Thus if we obtain the attack sample using a given model $f$ , the attack is inherently tied to $f$ [16]. In order that attack does not over-learn, we apply a mask that can focus on regions that are not highly salient for discrimination. This way the network can focus on less salient but discriminative regions thereby increasing the generalizability of attack to other models. On the other hand, meta learning has been efficiently used in adversarial attacks [29, 21, 30] to obtain better transferability across datasets. However meta learning has not been explored with generative learning for attacks in case of PRID. We adapt the MAML meta learning framework [28] in our proposed method. While the black box attack works assume the presence of a labeled source dataset, we additionally present a more challenging setting wherein no labels are available during attack.

Refer to caption — Figure 1: Model architecture. Mask $\mathbf{M}$ is generated using model $f$ and is used to mask the input $\mathbf{x}$ . GAN is trained using a meta learning framework with an adversarial triplet loss and GAN loss.

Our proposed model is illustrated in Figure 1. In case of white-box setting, the generator $\mathcal{G}$ is trained using the generator loss, adversarial triplet loss and meta learning loss while the discriminator $\mathcal{D}$ is trained with the classical binary cross-entropy discriminator loss. The mask is obtained via self-supervised triplet loss. The network learns to generate adversarial image. While the GAN loss itself focuses on generating real samples, the adversarial triplet loss guides the network to generate samples that will be closer to negative samples and farther away from positive samples.

III-A GAN training

Given a clean sample $\mathbf{x}$ , we use the generator $\mathcal{G}$ to create the adversarial sample $\mathbf{x}_{adv}$ . The overall GAN loss is given by, $\mathcal{L}_{GAN}=E_{\mathbf{x}}\log\mathcal{D}(\mathbf{x})+E_{\mathbf{x}}\log(1-\mathcal{D}(\Pi(\mathcal{G}(\mathbf{x}))))$ . Here $\Pi(.)$ denotes the projection into $l_{\infty}$ ball of $\epsilon$ -radius within $\mathbf{x}$ and $\mathbf{x}_{adv}=\Pi(\mathcal{G}(\mathbf{x}))$ . In order to generate adversarial samples, a deep mis-ranking loss is used [23],

	$\displaystyle\mathcal{L}_{adv-trip}(\mathbf{x}_{adv}^{a},\mathbf{x}_{adv}^{n},\mathbf{x}_{adv}^{p})$	$\displaystyle=\max(\lVert\mathbf{x}_{adv}^{a}-\mathbf{x}_{adv}^{n}\rVert_{2}$		(1)
		$\displaystyle-\lVert\mathbf{x}_{adv}^{a}-\mathbf{x}_{adv}^{p}\rVert_{2}+m,0)$

where $m$ is the margin. $\mathbf{x}_{adv}^{a}$ is the adversarial sample obtained from anchor sample $\mathbf{x}^{a}$ . Similarly, $\mathbf{x}_{adv}^{p}$ and $\mathbf{x}_{adv}^{n}$ are the adversarial samples obtained from respective positive and negative samples $\mathbf{x}^{p}$ and $\mathbf{x}^{n}$ . This loss tries to push the negatives closer to each other and pulls the positives farther away. Thus the network learns to generate convincing adversarial samples.

III-B Mask Generation

Attack obtained using the given model $f$ leads to poor generelization to other networks. In order to have a better tranferability, we first compute the gradients with respect to self-supervised triplet loss $\mathcal{L}_{adv-trip}(\mathbf{x},\mathbf{x}^{n},\mathbf{x}^{p})$ , where $\mathbf{x}^{p}$ is obtained by augmentation of $\mathbf{x}$ and $\mathbf{x}^{n}$ is the sample in the batch which lies at a maximum Euclidean distance from $\mathbf{x}$ . Here, the large gradients are primarily responsible for loss convergence. Since this way of achieving convergence is clearly coupled with $f$ , we mask the large gradients. Thus, the convergence is not entirely dependent on the large gradients and focuses on other smaller ones which can also potentially posses discriminative nature. Thus the overfitting can be reduced by using the mask. To obtain the mask, we compute,

\mathbf{grad}_{adv-triplet}=\nabla_{\mathbf{x}}\mathcal{L}_{adv-trip}(\mathbf{x},\mathbf{x}^{n},\mathbf{x}^{p})

(2)

Note that, we use the real samples in Eq. 2. The mask is given by $\mathbf{M}=sigmoid(\lvert\mathbf{grad}_{adv-triplet}\rvert)$ , where $\lvert\cdot\rvert$ denotes absolute value. We mask $\mathbf{x}$ before feeding as an input to the generator $\mathcal{G}$ . The masked input is given as $\mathbf{x}=\mathbf{x}\odot(1-\mathbf{M})$ , where $\odot$ denotes Hadamard product.

Masking techniques have also been explored in [31, 32] where the idea is to learn the model such that it does not overfit to the training distribution. Our masking technique is motivated from the idea that an adversarial example should be transferbale across different reid models. Our technique is distinct and can be applied to an individual sample. Whereas, masking technique in [31, 32] seeks agreement among the gradients obtained from all the samples of a batch. This technique in [31, 32] also suffers from the drawback of tuning hyperparameter. Further, the masking technique of [31] is boolean while ours is continuous.

III-C Meta Learning

Meta optimization technique allows to learn from multiple datasets for different tasks while generalizing well on a given task. One of the popular meta learning approaches, MAML [28], applies two update steps. The first update happens in an inner loop with a meta-train set while the second update happens in outer loop with a meta-test set. In our case, we perform the inner loop update on the discriminator and generator parameters using the meta-train set and the outer loop update is performed on the parameters of generator using a meta-test set.

input : Datasets

\mathcal{T}

and

\mathcal{A}

, model

f

output : Generator network

\mathcal{G}

parameters

\boldsymbol{\theta}_{g}

while not converge do

for samples in $\mathcal{T}$ do

/* Obtain the mask */

\mathbf{M}

\leftarrow

\sigma

(

\lvert\nabla_{\mathbf{x}}{\mathcal{L}_{adv-trip}(\mathbf{x},\mathbf{x}^{n},\mathbf{x}^{p})}\rvert

)

/* Meta train update using

\mathcal{T}

\boldsymbol{\theta}_{d}\leftarrow\operatorname*{arg\,max}_{\boldsymbol{\theta}_{d}}E_{\mathbf{x}}\log\mathcal{D}(\mathbf{x})+E_{\mathbf{x}}\log(1-\mathcal{D}(\Pi(\mathcal{G}(\mathbf{x}))))

\boldsymbol{\theta}_{g}\leftarrow\operatorname*{arg\,min}_{\boldsymbol{\theta}_{g}}\mathcal{L}_{\mathcal{G}}^{\mathcal{T}}+\lambda\mathcal{L}_{adv-trip}^{\mathcal{T}}(\mathbf{x}_{adv}^{a},\mathbf{x}_{adv}^{n},\mathbf{x}_{adv}^{p})

\boldsymbol{\delta}=\mathbf{x}-\Pi(G(\mathbf{x}))

/* Meta test loss using

\mathcal{A}

Sample triplets from meta-test set

\mathcal{A}

and compute

\mathcal{L}=\mathcal{L}_{adv-trip}^{\mathcal{A}}(\mathbf{x}^{a}-\boldsymbol{\delta},\mathbf{x}^{n},\mathbf{x}^{p})

/* Meta test update */

\boldsymbol{\theta}_{g}\leftarrow\operatorname*{arg\,min}_{\boldsymbol{\theta}_{g}}\lambda\mathcal{L}

Algorithm 1 Training for MeGA

More formally, given a network $\mathcal{D}$ parametrized by $\boldsymbol{\theta}_{d}$ and $\mathcal{G}$ parametrized by $\boldsymbol{\theta}_{g}$ , we perform the meta-training phase to obtain the parameters $\boldsymbol{\theta}_{d}$ and $\boldsymbol{\theta}_{g}$ . The update steps are given in Algorithm 1. We also obtain the adversarial perturbation as, $\boldsymbol{\delta}=\mathbf{x}-\Pi(G(\mathbf{x}))$ .

We then apply the meta-testing update using the additional meta-test dataset ${\mathcal{A}}$ . In Algorithm 1, $\mathcal{L}_{\mathcal{G}}^{\mathcal{T}}=E_{\mathbf{x}}\log(1-\mathcal{D}(\Pi(\mathcal{G}(\mathbf{x}))))$ . We discriminate the datasets using superscripts $\mathcal{T}$ for meta-train set and $\mathcal{A}$ for meta-test set. $\mathcal{L}_{adv-trip}^{\mathcal{A}}$ draws its samples $\mathbf{x}$ from $\mathcal{A}$ . At the inference stage, we only use $\mathcal{G}$ to generate the adversarial sample.

III-D Training in absence of labels

Deep mis-ranking loss can be used [23] when the labels are available for $\mathcal{T}$ . In this scenario, we present the case where no labels are available. In the absence of labels and inspired by unsupervised contrastive loss [33], we generate a positive sample $\mathbf{x}_{adv}^{p}$ by applying augmentation to the given sample $\mathbf{x}_{adv}^{a}$ . The negative sample $\mathbf{x}_{adv}^{n}$ is generated using batch hard negative sample strategy, that is we consider all samples except the augmented version of $\mathbf{x}_{adv}^{a}$ as negative samples and choose the one which is closest to $\mathbf{x}_{adv}^{a}$ . We then use Eq. 1 to obtain the adversarial triplet loss.

IV Experimental Results

IV-A Implementation Details

We implemented the proposed method in Pytorch framework. The GAN architecture is similar to that of the GAN used in [34, 35]. We use the models from Model Zoo [36] - OSNet [17], MLFN [18], HACNN [37], ResNet-50 and ResNet-50-FC512. We also use AlignedReID [38, 39], LightMBN [40], and PCB [41, 42]. We use an Adam optimizer with a learning rate = $10^{-5}$ , $\beta_{1}$ = $0.5$ and $\beta_{2}=0.999$ and train the model for 40 epochs. We set $m=1$ , $\lambda=0.01$ , and $\epsilon=16$ . In order to stabilize GAN training, we apply label flipping with 5% flipped labels. We first present the ablation for mask and meta learning.

IV-B Effect of mask $\mathbf{M}$

We find that when we use mask for Resnet50 and test for different models like MLFN [18] and HACNN [37], there is a substantial gain in the performance as shown in Table I. In terms of R-1 accuracy, introduction of mask gives a boost of 42.10% and 4.8% for MLFN and HACNN respectively. This indicates that mask provides better transferability. Further, when we evaluate on Resnet50 itself, there is a minor change in performance which could be because mask is learnt using Resnet50 itself.

TABLE I: Trained on Market-1501 [43]. Setting Market-1501

\rightarrow

Market-1501.

l

indicates Market-1501 labels are used for training.

\mathbf{M}

indicates the incorporation of mask. ’Before’ indicates accuracy on clean samples.

Model	Resnet50		MLFN		HACNN
	mAP	R-1	mAP	R-1	mAP	R-1
Before	70.4	87.9	74.3	90.1	75.6	90.9
$l$	0.66	0.41	3.95	3.23	32.57	42.01
$l+\text{AND}$	0.56	0.35	5.39	4.55	35.13	44.20
$l+\text{SAND}$	0.51	0.33	6.01	4.89	37.50	45.11
$l+\mathbf{M}$	0.69	0.50	2.80	1.87	31.73	39.99

IV-C Effect of meta learning

We demonstrate the effect of meta learning in Table II. In the case of cross-dataset (Resnet50) as well as cross-dataset cross-model (MLFN) setting, we observe that introduction of meta learning gives a significant performance boost. In terms of R-1 accuracy, there is a boost of 69.87% and 69.29% respectively for Resnet50 and MLFN. We further observe that Resnet50 does not have a good transferability towards HACNN. This could be due to two reasons. First, Resnet50 is a basic model compared to other superior PRID models. Second, HACNN is built on Inception units [44].

TABLE II: Trained on Market-1501 using MSMT-17 [45] as meta test set. Setting Market-1501

\rightarrow

DukeMTMC-reID [46].

\mathcal{A}

indicates incorporation of meta learning.

Model	Resnet50		MLFN		HACNN
	mAP	R-1	mAP	R-1	mAP	R-1
Before	58.9	78.3	63.2	81.1	63.2	80.1
$l$	17.96	24.86	18.25	24.10	42.75	58.48
$l+\mathcal{A}$	5.80	7.49	6.15	7.4	43.12	58.97

IV-D Adversarial attack performance

We first present the results for cross-model attack in Table III. We use AlignedReID model, Market-1501 [43] as training set and MSMT-17 [45] as meta test set. The results are reported for Market-1501 and DukeMTMC-reID [46]. In case of Market-1501, it is clearly evident that the proposed method is able to achieve a strong transferability. We can see that incorporation of meta test set leads to less than halving the mAP and R-1 results compared to case when only labels are used. For instance, mAP and R-1 of AlignedReID goes down from 7.00% and 6.38% to 3.51% and 2.82% respectively. This is consistently observed for all three models. Further, the combined usage of mask and meta learning ( $l+\mathbf{M}+\mathcal{A}$ ), denoted as MeGA, achieves best results in cross-model case of PCB and HACNN. The respective R-1 improvements are 10.00% and 9.10%. Thus our method is extremely effective in generating adversarial samples.

TABLE III: AlignedReID trained on Market-1501 with MSMT-17 as meta test set. M is Market-1501 and D is DukeMTMC-reID. MeGA denotes

l+\mathbf{M}+\mathcal{A}

	Model	AlignedReID		PCB		HACNN
		mAP	R-1	mAP	R-1	mAP	R-1
M $\rightarrow$ M	Before	77.56	91.18	78.54	92.87	75.6	90.9
	$l$	7.00	6.38	16.46	29.69	16.39	20.16
	$l$ + $\mathbf{M}$	6.62	5.93	15.96	28.94	16.01	19.47
	$l+\mathcal{A}$	3.51	2.82	8.07	13.86	5.44	5.28
	MeGA	5.50	5.07	7.39	12.47	4.85	4.80
M $\rightarrow$ D	$l$	16.04	21.14	13.35	15.66	15.94	21.85
	$l+\mathbf{M}$	16.23	21.72	13.70	15.97	16.43	22.17
	$l+\mathcal{A}$	4.69	5.70	11.10	12.88	5.40	6.55
	MeGA	7.70	9.47	11.81	14.04	4.73	5.40

In case of Market-1501 to DukeMTMC-reID, we observe that simply applying the meta learning ( $l+\mathcal{A}$ ) generalizes very well. In case of AlignedReID, mAP and R-1 of 4.60% and 5.70% respectively, are significantly lower compared to results obtained via $l$ or $l+\mathbf{M}$ settings. The combined setting of mask and meta learning yields better results for HACNN compared to AlignedReID and PCB. This may be because of the fact that learning of mask is still tied to training set and thus may result in overfitting.

In Table IV we discuss the results for cross-dataset and cross-model case against more models. Here also we can see that both AlignedReID and PCB lead to strong attacks against other models in a different dataset.

In Table V, we present the results for MSMT-17. Here, the model is trained using AlignedReID and PCB using Market-1501 and DukeMTMC-reID as meta test set. When trained and tested using AlignedReID, the R-1 accuracy drops from 67.6% on clean samples to 17.69%. On the other hand when trained using PCB and tested on AlignedReID, the performance drops to 16.70%. This shows that our attack is very effective in large scale datasets such as MSMT-17.

TABLE IV: AlignedReID and PCB trained on Market with MSMT-17 as meta test set. Setting Market-1501

\rightarrow

DukeMTMC-reID.

Model	OSNet			LightMBN			ResNet50			MLFN			ResNet50FC512			HACNN
	mAP	R-1	R-10	mAP	R-1	R-10	mAP	R-1	R-10	mAP	R-1	R-10	mAP	R-1	R-10	mAP	R-1	R-10
Before	70.2	87.0	-	73.4	87.9	-	58.9	78.3	-	63.2	81.1	-	64.0	81.0	-	63.2	80.1	-
AlignedReID	15.31	22.30	35.00	16.24	24.13	39.65	5.17	6.64	13.77	12.28	16.38	29.39	6.97	9.69	19.38	4.77	5.61	11.98
PCB	12.27	14.45	27.49	12.88	15.70	28.54	7.14	8.55	20.01	11.95	16.54	30.92	9.45	11.46	23.90	3.97	4.66	10.00

TABLE V: Trained on Market-1501 using DukeMTMC-reID as meta test set. Setting Market-1501

\rightarrow

MSMT-17.

Model	AlignedReID
	mAP	R-1	R-10
MeGA (AlignedReID)	9.37	17.69	33.42
MeGA (PCB)	8.82	16.70	31.98

Figure 2: Left column: Red and blue box show the given image from Market-1501 and its mask (

1-M

) respectively. Right column: Attacked (top) and clean (bottom) images from MSMT-17

IV-E Comparison with SOTA models

In Table VI we present the comparison with TCIAA [23], UAP [47] and Meta-attack [21]. We observe that our method outperforms TCIAA by a huge margin. We can also see that when mis-ranking loss is naively applied in case of TCIAA^† [21], the model’ performance degrades. Our attack has better performance compared to both TCIAA and Meta-attack.

TABLE VI: AlignedReID trained on Market with MSMT-17 as meta test set. Setting Market-1501

\rightarrow

DukeMTMC-reID. ^† uses PersonX [48] as extra dataset.^∗ uses PersonX for meta learning.

Model	Aligned reid
	mAP	R-1	R-10
Before	67.81	80.50	93.18
TCIAA [23]	14.2	17.7	32.6
MeGA^∗ (Ours)	11.34	12.81	24.11
MeGA (Ours)	7.70	9.47	19.16
	PCB
Before	69.94	84.47	-
TCIAA [23]	31.2	45.4	-
TCIAA^† [23]	38.0	51.4	-
UAP [47]	29.0	41.9	-
Meta-attack^∗ ( $\epsilon=8$ ) [21]	26.9	39.9
MeGA^∗ ( $\epsilon=8$ ) (Ours)	22.91	31.70	-
MeGA ( $\epsilon=8$ ) (Ours)	18.01	21.85	44.29

IV-F Subjective Evaluation

We show the example images obtained by our algorithm in Figure 2 and top-5 retrieved results in Figure 3 for the OSNet model. We can see that in the case of clean samples the top-3 retrieved images match the query ID, however, none of the retrieved images match query ID in the presence of our attack.

Figure 3: Query image marked in blue border. Top 5 retrieved mages from OSNet for Market-1501 (top). Green colored boxes are correct match and red ones are incorrect. Retrieved images after attacking query sample (bottom).

IV-G Attack using unlabelled source

In this section we discuss the attack when source dataset $\mathcal{T}$ is unlabeled and neither the victim model nor the dataset used for training victim model are available. This is a very challenging scenario as supervised models cannot be used for attack. Towards this, we use unsupervised trained models on Market-1501 and MSMT-17 from [49]. In Table VII, we present results for training using MSMT-17 and testing on Market. We observe that IBNR50 obtains a mAP and R-1 accuracy of 40.7% and 52.34% when both labels and mask are not used. When mask is incorporated there is a substantial boost of 3.82% in mAP and 4.81% in R-1 accuracy in case of OSNet. These gains are even higher for MLFN and HACNN.

In case of Market-1501 to MSMT-17 in Table VIII, we see that the attack using only mask performs reasonably well compared to that of attacks using label or both label and mask. Due to the comparatively small size of Market-1501, even the attacks using labels are not very efficient.

TABLE VII: MSMT-17

\rightarrow

Market-1501. R50 denotes Resnet50.

Model	OSNet		MLFN		HACNN
	mAP	R-1	mAP	R-1	mAP	R-1
Before	82.6	94.2	74.3	90.1	75.6	90. 9
$l$ (R50)	30.50	39.45	26.37	38.03	31.15	39.34
$l+\mathbf{M}$ (R50)	24.50	33.07	21.76	32.18	18.81	23.66
$\mathbf{M}$ (R50)	36.5	47.56	34.92	52.61	31.15	39.34
IBN R50	40.7	52.34	40.62	61.46	35.44	44.84
$\mathbf{M}$ (IBN R50)	36.88	47.53	35.01	52.79	30.98	38.98

TABLE VIII: Market-1501

\rightarrow