Isolation and Induction: Training Robust Deep Neural Networks against Model Stealing Attacks

Jun Guo 0000-0002-6626-4135 Beihang University, China [email protected] , Aishan Liu 0000-0002-4224-1318 NLSDE, Beihang University, ChinaInstitute of Dataspace, Hefei, China [email protected] , Xingyu Zheng 0009-0009-6283-7635 Beihang University, China [email protected] , Siyuan Liang 0000-0002-6154-0233 Chinese Academy of Sciences, China [email protected] , Yisong Xiao 0000-0001-8227-0052 Beihang University, China [email protected] , Yichao Wu Sensetime Group Limited, China [email protected] and Xianglong Liu 0000-0002-7618-3275 NLSDE, Beihang University, ChinaZhongguancun Laboratory, ChinaInstitute of Dataspace, Hefei, China [email protected]

(2023)

Abstract.

Despite the broad application of Machine Learning models as a Service (MLaaS), they are vulnerable to model stealing attacks. These attacks can replicate the model functionality by using the black-box query process without any prior knowledge of the target victim model. Existing stealing defenses add deceptive perturbations to the victim’s posterior probabilities to mislead the attackers. However, these defenses are now suffering problems of high inference computational overheads and unfavorable trade-offs between benign accuracy and stealing robustness, which challenges the feasibility of deployed models in practice. To address the problems, this paper proposes Isolation and Induction (InI), a novel and effective training framework for model stealing defenses. Instead of deploying auxiliary defense modules that introduce redundant inference time, InI directly trains a defensive model by isolating the adversary’s training gradient from the expected gradient, which can effectively reduce the inference computational cost. In contrast to adding perturbations over model predictions that harm the benign accuracy, we train models to produce uninformative outputs against stealing queries, which can induce the adversary to extract little useful knowledge from victim models with minimal impact on the benign performance. Extensive experiments on several visual classification datasets (e.g., MNIST and CIFAR10) demonstrate the superior robustness (up to 48% reduction on stealing accuracy) and speed (up to 25.4× faster) of our InI over other state-of-the-art methods. Our codes can be found in https://github.com/DIG-Beihang/InI-Model-Stealing-Defense.

Model Stealing, Stealing Defense, Model Privacy

^†^†journalyear: 2023^†^†copyright: acmlicensed^†^†conference: Proceedings of the 31st ACM International Conference on Multimedia; October 29–November 3, 2023; Ottawa, ON, Canada.^†^†booktitle: Proceedings of the 31st ACM International Conference on Multimedia (MM ’23), October 29–November 3, 2023, Ottawa, ON, Canada^†^†price: 15.00^†^†isbn: 979-8-4007-0108-5/23/10^†^†doi: 10.1145/3581783.3612092^†^†ccs: Security and privacy^†^†ccs: Computing methodologies Machine learning

1. Introduction

Machine learning (ML) models, especially deep neural networks (DNNs), have been deployed across a wide range of areas, especially in computer vision. Currently, Machine Learning as a Service (MLaaS) platforms have emerged to outsource well-trained deep learning models for developers since it often requires high computational resources to build a sophisticated ML model.

Refer to caption — Figure 1. Illustration of attacks and defenses in model stealing. The adversary makes queries using malicious queries to extract knowledge from the victim MLaaS model, and the returned outputs are used to train a clone model. The defender introduces randomness into the model’s outputs, in order to mislead the stealing algorithm.

However, severe security issues exist when using online platforms since the knowledge of these ML models is exposed to the risk of being stolen. Extensive studies have revealed that the functionality of an ML model can be extracted by adversaries, even when they have no knowledge about training examples or model structures (Tramèr et al., 2016; Orekondy et al., 2019; Wallace et al., 2020; Kariyappa et al., 2021a; Yuan et al., 2022; Liang et al., 2022b). Adversaries can easily steal a deployed (victim) model by querying it using partial or surrogate input data, which is called model stealing or model extraction attacks. Model stealing attacks have raised increasing attention since they have posed strong security threats, where attackers can acquire a function-similar copy of the victim model to extract confidential data (Fredrikson et al., 2015) or even perform adversarial attacks (Szegedy et al., 2014; Goodfellow et al., 2015; Liu et al., 2019, 2020b, 2020a; Wang et al., 2021; Li et al., 2022a; Liu et al., 2022, 2023a; Li et al., 2023c; Wei et al., 2018; Liang et al., 2020, 2022c, 2022a; Ma et al., 2022, 2021) via the substitute for the victim model (Papernot et al., 2017).

To mitigate the threat of model stealing attacks, several defensive methods (Lee et al., 2019; Orekondy et al., 2020; Mazeika et al., 2022; Lee et al., 2022) have been devoted to making the victim model hard to steal by introducing perturbations or randomness to the model output. However, the practical feasibility of these defenses is still hampered by certain limitations: (1) Existing defensive methods often incorporate auxiliary modules that validate the input or modify the output of the victim model, which introduces extra computational costs. In practice, the computational burdens are mainly concentrated on the inference phase, and higher computational overheads indicate extended user response time and increased financial expenditures. (2) Some defenses add perturbations to the model’s output to enhance its resilience against model stealing attacks by providing erroneous predictions to attackers. However, this comes at the expense of reduced benign accuracy for legitimate users due to the unfavorable trade-off.

To tackle these concerns, we propose a novel and effective defensive training framework against model stealing attacks, which is called InI. As for the computational overheads, distinct from prevailing methodologies that introduce auxiliary inference-time modules, our InI aims to directly train a robust model that is able to defend against stealing attacks without extra inference modules. Based on the fact that DNNs are heavily over-parameterized (Zhang et al., 2017) and can be trained to fit and generalize across diverse data distributions, we, therefore, posit that the victim model can achieve robustness by incorporating the countermeasures within its parameters during training. InI leverages a clone model during training as the surrogate adversary and estimates the adversary’s optimization gradient and the expected gradient. With this estimation, the victim can learn to adjust its posterior probabilities to maximize the directional divergence between the two gradients. Therefore, we can isolate the clone model’s optimization gradient from the expected gradient. To ameliorate the trade-off between benign accuracy and stealing robustness, different from previous studies that add perturbations over all predictions that harm the benign accuracy, we aim to train models that can learn to behave differently on benign and malicious queries. Following the assumption of previous works (Kariyappa and Qureshi, 2020; Kariyappa et al., 2021b) that the malicious query samples deviate from the task distribution, InI trains a victim that behaves normally on the benign task yet produces inductive outputs on the malicious samples. Specifically, we introduce an out-of-distribution (OOD) dataset during victim training and minimize the adversary’s information gain on it. As a result, during inference, the adversary is induced to extract little useful knowledge from the victim model using the stealing query with OOD examples. In addition, our method can be integrated with existing methods to better obtain defensive performance.

In summary, our main contributions are three-fold:

•

We propose a novel and effective defensive training framework against model stealing attacks called InI to achieve robustness during training, which provides a new perspective of model stealing defenses.
•

For computational overheads, InI incorporate the gradient isolation countermeasure within the victim’s parameters; for unfavorable trade-offs, InI produces distinct outputs and induce the adversary to acquire minimal knowledge from malicious queries.
•

Extensive experiments have been conducted on multiple datasets which demonstrate the state-of-the-art robustness and speed over other baselines. Moreover, InI shows flexible compatibility with existing methods for better defense.

2. Related work

2.1. Model Stealing Attacks

Model stealing attack, also referred to as model extraction, aims at inferring hyper-parameters (Wang and Gong, 2018; Oh et al., 2019), extracting model parameters (Lowd and Meek, 2005; Tramèr et al., 2016; Milli et al., 2019), or copying functionalities of a certain machine learning model. Our work focuses on stealing the classification accuracy of the model, which is the most prevailing and universal stealing attack in deep learning. Tramèr et al. (Tramèr et al., 2016) proposed the concept of model stealing that attackers could “steal” the property of a machine learning model by queries without prior knowledge of the victim model. Generally speaking, there are many properties that can be stolen, e.g., model parameters, training data, or functionality. Papernot et al. (Papernot et al., 2017) proposed a partial data approach named Jacobian-Based Dataset Augmentation (JBDA), which generates synthetic data by adding small perturbations on a small set of in-distribution samples. Orekondy et al. (Orekondy et al., 2019) proposed KnockoffNets, employing samples from a surrogate dataset as query inputs of the victim model. The stealing performance of partial data and surrogate data approaches would degrade when the available data are different from the original training set. In recent years, some data-free stealing methods have been proposed. Kariyappa et al. (Kariyappa et al., 2021a) and Truong et al. (Truong et al., 2021) are motivated by the framework of data-free knowledge distillation (Hinton et al., 2015; Micaelli and Storkey, 2019) and proposed data-free model stealing methods, where they use zeroth-order gradient estimation to calculate the victim gradient in black-box settings. Moreover, Sanyal et al. (Sanyal et al., 2022) proposed a model stealing attack in the hard-label setting, They utilize some unrelated proxy data to get a pre-trained data generator, while the stealing process is data-free.

2.2. Model Stealing Defenses

Currently, most model stealing defenses tend to add perturbations to the model outputs, thus disturbing the optimization of the adversary. Lee et al. (Lee et al., 2019) proposed an accuracy-preserving defense against model stealing attacks by adding deceptive perturbations to the model outputs while preserving its top-1 label, while it yields to the hard-label stealing. Other defenses like Maximizing Angular Deviation (MAD) (Orekondy et al., 2020) perturbs the model outputs with controllable intensity, defending against model stealing attacks at the expense of benign accuracy. Another approach of defense takes advantage of the data limitation of adversaries, making the victim model produce dissimilar output between in-distribution inputs and out-of-distribution inputs. Kariyappa et al. (Kariyappa and Qureshi, 2020) proposed Adaptive Misinformation (AM) defense that detects the OOD inputs and misleads adversaries with modified outputs. Kariyappa et al. (Kariyappa et al., 2021b) then proposed Ensemble of Diverse Models (EDM) defense, which introduces randomness into the model output by using an ensemble of diverse models. Models in the ensemble are trained to perform diversely for OOD inputs, making the functionality of the model hard to be stolen. In addition, there are other types of countermeasures towards model stealing, such as digital watermarking (Jia et al., 2021; Li et al., 2022b; Charette et al., 2022), which inject an extractable watermark into the victim model and can distinguish whether a model is from stealing.

Existing defensive approaches take advantage of some limitations of model stealing attacks to mitigate the knowledge leakage, while they are suffering from high computational costs and unfavorable trade-offs. In this paper, we are devoted to defending against stealing attacks by incorporating the countermeasure with the victim’s parameters, inherently enhancing the robustness against model stealing attacks.

3. Threat Model

3.1. Attack Objective

In this paper, we mainly discuss the functionality stealing towards DNNs on image classification tasks. Specifically, an adversary aims at stealing the functionality of a victim model $\mathcal{V}$ by training a clone model $\mathcal{C}$ . These attacks usually follow the framework of knowledge distillation (Hinton et al., 2015), where the victim model $\mathcal{V}$ plays the role of “teacher”, and the knowledge of the teacher is distilled into the “student” model $\mathcal{C}$ . The objective of the adversary is to maximize the classification accuracy of clone model $Acc(\mathcal{C}(x;\bm{\theta}_{\mathcal{C}}),y)$ on the victim’s target distribution $\mathcal{D}_{tar}$ . Define $\bm{\theta}_{\mathcal{V}}$ to be the parameter of $\mathcal{V}$ and $\bm{\theta}_{\mathcal{C}}$ to be the parameter of $\mathcal{C}$ , and the adversary’s goal could be formulated as Eqn. 1.

(1)

\max_{\bm{\theta}_{\mathcal{C}}}\mathbb{E}_{(\bm{x},\bm{y})\sim\mathcal{D}_{tar}}[Acc(\mathcal{C}(\bm{x};\bm{\theta}_{\mathcal{C}}),\bm{y})]

In most real-world settings, the adversary has no knowledge about the victim’s structure, parameters, or training set. The only interaction between the adversary and the victim is the black-box query process: the adversary inputs an image $x$ and the victim returns a softmax probability or logits. Though the original training set is unavailable, the adversary can use synthetic data or surrogate data to query the victim model. For example, JBDA (Papernot et al., 2017) synthesizes data from a small part of in-distribution samples, and KnockoffNets (Orekondy et al., 2019) uses surrogate datasets to query the victim model. Therefore, the adversary’s learning objective is a surrogate goal based on the distribution of the query dataset $\mathcal{D}_{que}$ , which can be formulated as follows:

(2)

\min_{\bm{\theta}_{\mathcal{C}}}\mathbb{E}_{\bm{x}\sim\mathcal{D}_{que}}[d(\mathcal{C}(\bm{x};\bm{\theta}_{\mathcal{C}}),\mathcal{V}(\bm{x};\bm{\theta}_{\mathcal{V}}))]

In the ideal scenario, if $\mathcal{D}_{tar}$ and $\mathcal{D}_{que}$ are close enough and the query budget is sufficient, the model stealing is generally inevitable since the victim $\mathcal{V}$ must guarantee the performance for benign users on the target distribution. However, in practice, $\mathcal{D}_{tar}$ and $\mathcal{D}_{que}$ are dissimilar due to the knowledge limitation of the adversary, and the query budget is limited by the adversary’s financial cost. It is these limitations that support existing defensive methods.

3.2. Defense Objective

In defenses against model stealing, the defender aims at preventing the functionality of the victim model from being stolen with an acceptable impact on its benign accuracy. To be more practical, the accuracy degradation should be constrained with a minimum threshold $T$ . The objective of the defender is to minimize the classification accuracy of the clone model $Acc(\mathcal{C}(x;\bm{\theta}_{\mathcal{C}}),y)$ on victim’s target distribution $\mathcal{D}_{tar}$ , which could be formulated as Eqn. 3.

(3)		$\displaystyle\min_{\bm{\theta}_{\mathcal{V}}}$	$\displaystyle\mathbb{E}_{(\bm{x},\bm{y})\sim\mathcal{D}_{tar}}[Acc(\mathcal{C}(\bm{x};\bm{\theta}_{\mathcal{C}}),\bm{y})],$
(3)		s.t.	$\displaystyle\mathbb{E}_{(\bm{x},\bm{y})\sim\mathcal{D}_{tar}}[Acc(\mathcal{V}(\bm{x};\bm{\theta}_{\mathcal{V}}),\bm{y})]\geq T$

Considering the limitations of the adversary, existing defense methods either add adaptive perturbations to multiply the adversary’s query cost, or differentiate the victim’s behavior on ID and OOD data to deceive the adversary. In this paper, we propose a defensive method against model stealing attacks, which gets rid of the extra computational costs and ameliorates the trade-off between benign accuracy and stealing resistance.

4. Methodology

To build a defensive method with low computational costs and high trade-offs against model stealing attacks, we propose InI, a novel and effective defensive training framework. In this section, we first illustrate the training-time gradient isolation methodology that gets rid of auxiliary inference-time modules, and then elaborate on the adversary induction approach that reduces the knowledge leakage. Finally, we explain our overall training framework.

4.1. Gradient Isolation

Existing defensive methods are suffering from extra computational costs during inference since they often employ auxiliary inference-time modules. Studies have revealed that DNNs are heavily over-parameterized (Zhang et al., 2017) and can be trained to fit and generalize across diverse data distributions, e.g., adversarial examples for adversarially-trained models (Goodfellow et al., 2015; Madry et al., 2018; Zhang et al., 2021; Liu et al., 2021, 2023b; Guo et al., 2023; Tang et al., 2021) and real-world disturbance for reinforcement agents (Pinto et al., 2017; Guo et al., 2022; Li et al., 2023b, a). Inspired by them, we propose a defensive training framework to directly train a robust model, so that the model can generate deceptive outputs towards stealing attack queries without extra modules.

Generally, as Eqn. 1 shows, the adversary’s expected goal is to minimize the disagreement with the ground truth on the target distribution. This objective is not directly related to the victim’s parameter $\bm{\theta}_{\mathcal{V}}$ , but the adversary must extract knowledge from the victim. As a consequence, the real objective of the adversary is illustrated in Eqn. 2. There exists a gap between the expected goal and the real objective, and the defense can be achieved by isolating the adversary’s real objective from the expected goal. As the adversary usually updates its parameters by gradient descent, we propose gradient isolation to isolate the real gradient from the expected gradient, thereby incorporating the robustness within the victim’s parameters to mislead the adversary.

To achieve gradient isolation, we need to estimate the above two gradient terms during training. Therefore, we introduce a surrogate white-box clone model $\mathcal{C}$ into the victim’s training process to represent the stealing role of the adversary. To isolate the update gradient from the expected gradient, we maximize the directional divergence between them. Specifically, for a certain batch of data $\bm{x}$ , assuming the target of the adversary is $\bm{y}$ , the update gradient can be written as:

(4)		$\displaystyle\nabla_{\bm{\theta}_{\mathcal{C}}}CE(\mathcal{C}(\bm{x};\bm{\theta}_{\mathcal{C}}),\bm{y})$	$\displaystyle=-\nabla_{\bm{\theta}_{\mathcal{C}}}\sum_{i}{y_{i}\log\mathcal{C}(\bm{x};\bm{\theta}_{\mathcal{C}})_{i}}$
(4)			$\displaystyle=-\bm{y}^{T}\bm{G},$

where $CE(\cdot,\cdot)$ is the soft cross-entropy loss commonly used in model stealing attacks (Papernot et al., 2017; Orekondy et al., 2019), and $\bm{G}=\nabla_{\bm{\theta}_{\mathcal{C}}}\log\mathcal{C}(\bm{x};\bm{\theta}_{\mathcal{C}})$ is a Jacobian matrix. When $y$ comes from the ground truth, it represents the correct optimization direction; when $y$ comes from the victim model (denoted by $\tilde{\bm{y}}$ ), it represents the actual optimization direction during stealing.

The goal of gradient isolation is to maximize the directional divergence of these gradients, which can be quantified by the cosine similarity. Therefore, the objective of gradient isolation can be written as follows:

(5)

\mathcal{L}_{iso}=\mathbb{E}_{(\bm{x},\bm{y})\sim\mathcal{D}_{tar}}[CS(\tilde{\bm{y}}^{T}\bm{G},\bm{y}^{T}\bm{G})],

where $CS(\cdot,\cdot)$ represents the cosine similarity. $\tilde{\bm{y}}^{T}\bm{G}$ and $\bm{y}^{T}\bm{G}$ can be calculated by the backward propagation. During the victim training, InI minimizes $\mathcal{L}_{iso}$ via optimization to perform gradient isolation, enhancing the victim’s robustness against model stealing attacks.

4.2. Adversary Induction

Some existing defenses add perturbations to the output over all samples, which harm the benign accuracy and cause low trade-offs. To improve the trade-off between clean accuracy and stealing robustness, we further design an adversary induction approach to train victim models. Thus, the victim model will behave normally for benign users but produce inductive outputs that induce the adversary to optimize without learning too much useful knowledge.

Successfully inducting the adversary stands on the assumption that benign and malicious queries can be distinguished through the distribution (Kariyappa and Qureshi, 2020; Kariyappa et al., 2021b). Practically, the adversary has limited knowledge of the distribution of the victim’s training set $\mathcal{D}_{tar}$ , and consequently uses a surrogate dataset $\mathcal{D}_{que}$ to query and steal the victim model. Following this assumption, we regard the query samples of the adversary as out-of-distribution and apply an OOD dataset $\mathcal{D}_{out}$ during defense to substitute them. The over-parameterization property of DNNs ensures the generalization capability across diverse distributions and thus enables the acquisition of the victim that exhibits divergent behavior on benign in-distribution (ID) queries and malicious out-of-distribution (OOD) queries. Thus, our InI aims to guarantee the victim’s benign performance on ID queries, while inducing the adversary to attain little knowledge with uninformative outputs on OOD queries.

In particular, on ID samples, we should guarantee the victim’s benign performance. This can be achieved by applying a cross-entropy loss to train the classification model, which is shown as follows:

(6)

\mathcal{L}_{ben}=\mathbb{E}_{(\bm{x},\bm{y})\sim\mathcal{D}_{tar}}[CE(\mathcal{V}(\bm{x};\bm{\theta}_{\mathcal{V}}),\bm{y})].

On OOD samples, we should induce the adversary to acquire minimal knowledge. Intuitively, the adversary induction can be realized by reducing the adversary’s information gain on OOD queries. The information gain can be quantified by the KL divergence between the output probabilities of the clone and the victim model, which can be formulated as below:

(7)

\mathcal{L}_{ig}=\mathbb{E}_{\bm{x}\sim\mathcal{D}_{out}}[KL(\mathcal{C}(\bm{x};\bm{\theta}_{\mathcal{C}}),\mathcal{V}(\bm{x};\bm{\theta}_{\mathcal{V}}))].

Noted that $\mathcal{L}_{ig}$ is the optimization objective of the adversary during model stealing. Therefore, when $\mathcal{L}_{ig}$ is minimized before the stealing process, the adversary can only gain little information via optimization. Thus, the adversary would only extract little knowledge from the victim model. Since the adversary often uses gradient descent to update their parameters and attain knowledge, minimizing the first-order approximation of the KL divergence can also help to reduce the information gain. Consequently, we can minimizing the norm of $\nabla_{\bm{\theta}_{\mathcal{C}}}\mathcal{L}_{ig}$ , i.e., the gradient of the information gain. In summary, the objective of adversary induction can be formulated as:

(8)

\mathcal{L}_{ind}=\mathcal{L}_{ig}+\beta||\nabla_{\bm{\theta}_{\mathcal{C}}}\mathcal{L}_{ig}||.

$\mathcal{L}_{ind}$ is a function related to $\bm{\theta}_{\mathcal{V}}$ , which can be integrated with our training framework. During the victim’s training, we apply a surrogate white-box clone model $\mathcal{C}$ and an OOD dataset to estimate the adversary’s information gain and minimize $\mathcal{L}_{ind}$ by updating the victim’s parameter $\bm{\theta}_{\mathcal{V}}$ , thus reducing the knowledge leakage from the victim. The OOD dataset used in training comes from another classification task and is different from the query dataset in stealing attacks.

Though previous defenses (Kariyappa and Qureshi, 2020; Kariyappa et al., 2021b) utilize OOD datasets to train the victim, they only constrain the victim to produce meaningless or diverse outputs and did not take the adversary’s role into consideration. In contrast to them, InI produces inductive outputs that trigger the adversary to learn less, reducing the knowledge leakage from the victim.

4.3. Overall Framework

Figure 2 illustrates our overall framework. To train a robust victim $\mathcal{V}$ , we introduce a surrogate clone model $\mathcal{C}$ into the training process. During training, the victim has white-box access to the surrogate clone model. To jettison the auxiliary modules for low computational costs, InI isolates the adversary’s optimization gradient from the expected gradient during training via $\mathcal{L}_{iso}$ in Eqn. 5. To further improve the trade-off between benign performance and stealing robustness, InI produces uninformative outputs on malicious queries through $\mathcal{L}_{iso}$ in Eqn. 8 to induce the adversary to acquire less useful knowledge. By incorporating the robustness within the victim’s parameters, the victim will acquire how to resist the model stealing attacks in a favorable trade-off without any auxiliary modules during inference.

During training, all objectives are simultaneously calculated and updated. We use some hyper-parameters $\gamma_{1}$ , $\gamma_{2}$ to control the trade-off between each loss, and the total loss can be written as:

(9)

\mathcal{L}=\mathcal{L}_{ben}+\gamma_{1}\mathcal{L}_{iso}+\gamma_{2}\mathcal{L}_{ind}

However, during training, there may exist some conflicts between the optimization directions among the above losses. As the blue bars shown in Figure 3, we extract the gradients at the first 3 epochs of training and calculate their cosine similarities, where we observe that the cosine similarities between different objectives exist conflicts. To mitigate these conflicts, we leverage PCGrad (Yu et al., 2020) to deal with the gradient conflicts. Before the gradient descent update, PCGrad tries to find the conflict among these gradients and projects one gradient to the orthogonal direction of the others as follows:

(10)

\bm{g}^{PC}_{i}=\bm{g}^{PC}_{i}-\frac{\bm{g}^{PC}_{i}\cdot\bm{g}_{j}}{||\bm{g}_{j}||^{2}}\bm{g}_{j}.

After PCGrad stabilization, the conflicts are better mitigated and the optimization process is improved (see the orange bars in Figure 3). The overall pseudo-algorithm of our InI can be found in Supplementary Material.

5. Experiments

In this section, we first elaborate on the experimental settings; then, we illustrate the defense performance and inference speed analysis on image classification tasks; we finally provide ablation studies. More results such as feature visualization analysis of our defense are provided in Supplementary Materials.

5.1. Experimental Setup

In this part, we elaborate on our experimental settings about datasets, model architectures, defenses, attacks, and evaluation metrics.

Datasets and architectures. We evaluate our proposed InI on the most commonly-adopted image classification datasets for model stealing including MNIST (LeCun et al., 1998), FashionMNIST (Xiao et al., 2017), CIFAR-10 (Krizhevsky et al., 2009), and CIFAR-100 (Krizhevsky et al., 2009). We choose ResNet-18 (He et al., 2016) as the backbone of all victim models. We also evaluate results on VGG networks (Simonyan and Zisserman, 2014) which show similar observations (c.f. Supplementary Materials).

Implementation details. For the training of InI, we use an SGD optimizer with momentum 0.5 and a weight decay of $1\times 10^{-3}$ . For MNIST and FashionMNIST datasets, we train 50 epochs with a learning rate annealing of 0.1 every 20 epochs, and for CIFAR-10 and CIFAR-100 datasets, we train 150 epochs with a learning rate annealing of 0.1 every 50 epochs. The initial learning rate is 0.1.

Defenses. To demonstrate the effectiveness of InI, we compare our method with the commonly-adopted defensive approaches: MAD (Orekondy et al., 2020), AM (Kariyappa and Qureshi, 2020), EDM (Kariyappa et al., 2021b). We also report the results of an undefended model denoted by “Vanilla”. We referred to the official implementation of these methods. The batch size of all defensive methods is 128. For the auxiliary OOD datasets used by AM, EDM, and InI, we choose KMNIST (Clanuwat et al., 2018) for MNIST and FashionMNIST, and choose TinyImageNet (Le and Yang, 2015) for CIFAR-10 and CIFAR-100. For the hash dataset used by EDM, we choose KMNIST for MNIST and FashionMNIST, and use SVHN (Netzer et al., 2011) for CIFAR-10 and CIFAR-100.

Attacks. Following the previous works(Orekondy et al., 2020; Kariyappa and Qureshi, 2020; Kariyappa et al., 2021b), to evaluate the performance of InI against model stealing, we use the commonly-used stealing attacks including KnockoffNets (Orekondy et al., 2019) and JBDA (Papernot et al., 2017), and evaluate the integration of InI with other defensive methods. For each attack, we conduct soft-label and hard-label attacks, which means the adversary learns according to the victim’s output probability and the top-1 label, respectively. The detailed settings of attacks are listed as follows:

•

KnockoffNets: The budget of attack is 50000. As for the surrogate datasets, we use EMNISTLetters (Cohen et al., 2017), EMNIST (Cohen et al., 2017), CIFAR-100, CIFAR-10 as the surrogate dataset for MNIST, FashionMNIST, CIFAR-10, CIFAR-100.
•

JBDA: We choose 150 images from the victim’s training set as the seed samples. We use 6 rounds of augmentation and a noise rate of 0.1 to synthesize the query data. The clone model is trained for 10 epochs every augmentation round.

Evaluation metrics. Following (Kariyappa et al., 2021b), we evaluate the performance of defenses by comparing the clone accuracy achieved by attacks on the victim’s test set. To take the victim’s benign accuracy into consideration, we further compare the relative performance of the defenses, which is the ratio of the adversary’s clone accuracy and the victim’s benign accuracy. For methods that have mutable parameters during inference (i.e., MAD and AM), we follow the setting in (Kariyappa et al., 2021b) and adjust the parameters of the defense to have similar benign accuracy on the test set. The benign accuracy of these defenses on classification tasks is shown in Table 1. For all the above metrics, the lower the better defenses.

Table 1. Benign accuracy of each defense on different image classification datasets.

Dataset	Benign Accuracy
Dataset	MNIST	FashionMNIST	CIFAR-10	CIFAR-100
Vanilla	99.46	93.89	94.71	76.63
MAD	99.46	93.87	94.31	75.44
AM	99.41	93.67	94.30	75.00
EDM	99.43	93.70	94.35	75.38
InI (Ours)	99.40	93.36	94.32	75.50

Table 2. Experimental results for KnockoffNets attack. We report the clone accuracy and the relative performance on the target test set. Lower clone accuracy/relative performance indicates better defense performance. “InI + MAD” and “InI + AM” indicate the integration of our InI with other defenses.

Defense	MNIST		FashionMNIST		CIFAR-10		CIFAR-100
Defense	soft-label	hard-label	soft-label	hard-label	soft-label	hard-label	soft-label	hard-label
Vanilla	99.39(1.00×)	98.84(0.99×)	71.58(0.76×)	57.95(0.62×)	79.55(0.84×)	69.60(0.73×)	50.89(0.66×)	27.44(0.36×)
MAD	99.31(1.00×)	99.05(1.00×)	68.84(0.73×)	44.61(0.48×)	70.31(0.75×)	65.07(0.69×)	37.36(0.50×)	18.58(0.25×)
AM	98.58(0.99×)	97.14(0.98×)	20.77(0.22×)	14.23(0.15×)	75.32(0.80×)	63.08(0.67×)	24.07(0.32×)	15.99(0.21×)
EDM	98.90(0.99×)	97.44(0.98×)	21.42(0.23×)	15.90(0.17×)	72.30(0.77×)	62.31(0.66×)	43.78(0.58×)	20.52(0.27×)
InI (Ours)	89.02(0.90×)	95.90(0.96×)	20.12(0.22×)	10.82(0.12×)	69.54(0.74×)	60.33(0.64×)	9.71(0.13×)	22.01(0.29×)
InI + MAD	88.09(0.89×)	92.50(0.93×)	20.18(0.22×)	10.77(0.12×)	67.45(0.72×)	60.25(0.64×)	9.37(0.12×)	13.06(0.17×)
InI + AM	88.22(0.89×)	94.12(0.95×)	15.01(0.16×)	10.27(0.11×)	65.80(0.70×)	58.35(0.62×)	9.36(0.13×)	12.47(0.17×)

Table 3. Experimental results for JBDA attack. We report the clone accuracy and the relative performance on the target test set. Lower clone accuracy/relative performance indicates better defense performance. “InI + MAD” and “InI + AM” indicate the integration of our InI with other defenses.

Defense	MNIST		FashionMNIST		CIFAR-10		CIFAR-100
Defense	soft-label	hard-label	soft-label	hard-label	soft-label	hard-label	soft-label	hard-label
Vanilla	73.00(0.73×)	72.77(0.73×)	71.09(0.76×)	67.80(0.72×)	26.19(0.28×)	25.59(0.27×)	4.82(0.06×)	4.09(0.05×)
MAD	61.69(0.62×)	72.81(0.73×)	57.70(0.61×)	66.46(0.71×)	18.73(0.20×)	24.89(0.26×)	2.44(0.03×)	3.90(0.05×)
AM	81.23(0.82×)	73.17(0.74×)	67.73(0.72×)	66.28(0.71×)	24.33(0.26×)	25.12(0.27×)	4.36(0.06×)	3.29(0.04×)
EDM	79.34(0.80×)	78.72(0.79×)	70.08(0.75×)	68.86(0.73×)	25.86(0.27×)	25.71(0.27×)	3.35(0.04×)	3.04(0.04×)
InI (Ours)	57.94(0.58×)	66.19(0.67×)	70.81(0.76×)	64.99(0.70×)	24.16(0.26×)	24.14(0.26×)	3.81(0.05×)	2.61(0.03×)
InI + MAD	55.99(0.56×)	66.09(0.66×)	65.63(0.70×)	64.61(0.69×)	23.04(0.24×)	24.04(0.25×)	3.51(0.05×)	2.61(0.03×)
InI + AM	56.42(0.57×)	62.35(0.63×)	68.14(0.73×)	63.72(0.68×)	22.68(0.24×)	22.43(0.24×)	3.33(0.04×)	2.45(0.03×)

5.2. Defense Results on Stealing Attacks

In this part, we compare the performance of our InI with other model stealing defenses. We report the clone accuracy and the relative performance of existing defenses and InI in Tables 2 and 3. To further improve our defensive performance, we integrate the model trained by InI with MAD and AM and evaluate the performance. AM needs to jointly train the victim, yet we do not retrain our model in their framework but load the parameters from InI to replace the victim’s parameter for simplicity.

KnockoffNets attacks. Table 2 presents the defense results against KnockoffNets attacks. We can draw some observations listed below:

•

By inducing and isolating the adversary, InI achieves the best defense performance with similar benign accuracy on most of the results, which demonstrates our better trade-offs.
•

Noted that, on CIFAR-100 dataset, InI achieves an extraordinary defense performance against the soft-label attack, but behaves poorly against the hard-label attack.

JBDA attacks. Table 3 shows the defense performance against JBDA attacks. Different from KnockoffNets which applys surrogate datasets, JBDA uses a set of seed examples from the victim’s training set, and thus the results differ from KnockoffNets. From the results, we can observe that:

•

The OOD-based defenses (AM, EDM, and InI) behave poorly on the soft-label JBDA. This mainly results from that the samples of JBDA come from the seed examples in the training set and are close to the target distribution.
•

Instead, MAD, the perturbation-based method, could achieve the best performance on the soft-label attack on FashionMNIST, CIFAR-10, and CIFAR-100 datasets. However, its performance rapidly drops on the hard-label attacks, as it hardly modifies the top-1 label of the output.
•

JBDA achieves a good stealing performance on simpler datasets like MNIST and FashionMNIST, but cannot behave well on more complex datasets like CIFAR-10 and CIFAR-100. The JBDA stealing results on CIFAR-100 (around 0.05× on all defenses) is generally unusable.

Integration with other defenses. We also provide the experimental results of the integration of our InI with MAD and AM in Table 2 and 3. We can draw some observations that:

•

The integration of our InI with MAD and AM can further improve the trade-off, as they achieve further defense performance with invariant benign accuracy than the original MAD and AM against all the attacks. Additionally, the integration of our InI with AM mostly achieves more robustness than that with MAD.
•

On soft-label JBDA attacks, though the integration achieves improvement in Vanilla and InI, it still cannot surpass the performance of MAD in similar benign accuracy. We attribute this phenomenon to the robustness trade-off between soft-label and hard-label attacks, as InI has traded its benign accuracy off the robustness against hard-label attacks during training, and similar robustness can only be achieved at the cost of more benign performance.

5.3. Inference Speed Analysis

In this section, we provide the inference speed analysis of each defensive method and evaluate our analysis through experiments. In practice, the model for MLaaS would only train once but would receive millions of inference queries from users, which the service provider would charge for. Therefore, boosting the inference speed has a significant impact on the practical use of defensive methods. Evaluations and discussions about training-time speed are provided in Supplementary Materials.

We first provide some analyses of the inference process of each method. As shown in Table 4, our InI can achieve the fastest inference speed among all defensive methods. On the contrary, existing methods employ auxiliary modules during inferences, which would introduce extra computational operations, or even harm the DNN’s parallel capability. Detailed analyses are given below.

Time cost of Vanilla and InI. We define the time cost of a forward operation as $M_{f}$ for a single query. The time cost of a model without defense should be $M_{f}$ , and the same as InI since InI employs no extra modules.

Time cost of MAD. MAD computes the Jacobian matrix $G=\nabla\log f(x,\bm{\theta})$ to maximize the angular deviation between the perturbed gradient and the original gradient. The official code needs $C$ backward passes which costs $CM_{b}$ ( $C$ is the number of classes of the target classification task) to calculate the Jacobian matrix. After getting $G$ , MAD needs to heuristically search a perturbed probability $y^{*}$ , which costs time of $S$ . When the victim receives a batch of data with a mini-batch size of $B$ , the backward pass and heuristic search would cost $BCM_{b}$ and $BS$ , as these operations cannot perform in parallel.

Time cost of AM. AM employs an adaptive misinformation module to perturb the victim’s output probability. The adaptive misinformation is generated by a DNN with the same architecture as the victim’s backbone. As a consequence, the time cost of AM mainly comes from the forward passes of the victim’s backbone and the misinformation model, which cost around $2M_{f}$ in sum.

Table 4. Analysis for the inference time of each defense.

	Single Query	Batch Query
Vanilla/InI	$M_{f}$	$M_{f}$
MAD	$M_{f}+CM_{b}+S$	$M_{f}+B(CM_{b}+S)$
AM	$2M_{f}$	$2M_{f}$
EDM	$M_{f}+M_{h}$	$nM_{f}+M_{h}$

Time cost of EDM. EDM jointly trains an ensemble of $n$ victim models and selects a result from them according to a hash function. The hash function of EDM is a DNN with a simpler architecture, which costs $M_{h}$ for a forward pass. For a single query, the time cost of EDM should be $M_{f}+M_{h}$ . However, in the official implementation of EDM, the time cost increase to $nM_{f}+M_{h}$ when EDM receives a batched query. The reason lies in that the forward pass of a single model and a batch of data can run in parallel, but the forward pass of different models cannot run in parallel. Therefore, all models in the ensemble would be accessed, and the time cost would increase.

Empirical evaluation. We perform empirical experiments to evaluate the speed of each defense. The experiment is conducted on an RTX 3080 GPU and the architecture of the victim’s backbone is ResNet-18. We randomly generate input images, execute the inference, and record the time cost. For fair comparisons, we conduct the process repeatedly for 2000 times and record the total time. As shown in Figure 4, other defenses consume significantly more time during inference compared to our method (1.98 $\times$ -25.4 $\times$ ).

5.4. Ablation Studies

Table 5. Ablation studies on the contributions of gradient isolation and adversary induction. Results are shown in clone accuracy and relative performance (lower the better).

Defense	KnockoffNets		JBDA
Defense	soft-label	hard-label	soft-label	hard-label
Vanilla	79.55(0.84×)	69.60(0.73×)	26.19(0.28×)	25.59(0.27×)
$\mathcal{L}_{iso}$	75.21(0.79×)	67.48(0.71×)	25.30(0.27×)	25.37(0.27×)
$\mathcal{L}_{ind}$	78.35(0.83×)	65.92(0.70×)	24.79(0.26×)	24.76(0.26×)
w/o $\nabla\mathcal{L}_{ig}$	74.59(0.79×)	64.73(0.68×)	24.55(0.26×)	25.15(0.27×)
$\mathcal{L}_{iso}$ + $\mathcal{L}_{ind}$ (InI)	69.54(0.74×)	60.33(0.64×)	24.16(0.26×)	24.14(0.26×)

We then conduct ablation studies to understand the contributions of gradient isolation and adversary induction. Specifically, we conduct experiments by training the victim with (1) no extra defense (denoted by “Vanilla”); (2) only isolation loss $\mathcal{L}_{iso}$ ; (3) only induction loss $\mathcal{L}_{ind}$ ; (4) InI without $\nabla\mathcal{L}_{ig}$ ; and (5) the full InI ( $\mathcal{L}_{iso}$ + $\mathcal{L}_{ind}$ ). We record the defense performance against KnockoffNets and JBDA on the CIFAR-10 dataset. As shown in Table 5, we can draw several observations:

•

Model trained with $\mathcal{L}_{iso}$ has an apparent drop in stealing performance (e.g., 4.34 on soft-label KnockoffNets), which indicates that gradient isolation can bring robustness against model stealing.
•

Model trained with $\mathcal{L}_{ind}$ shows limited defenses against model stealing attacks, as it only induces the adversary to learn less instead of directly misleading the adversary.
•

With the two methods combined, InI can achieve the best defense performance (69.54 / 60.33 on KnockoffNets and 24.16 / 24.14 on JBDA), implying that the cooperation of gradient isolation and adversary induction plays an important role during training.

6. Conclusion

The model stealing attack becomes a raising challenge to the privacy and intellectual property of machine learning models. Existing defensive methods are suffering from additional computational costs and unfavorable trade-offs, which impede their practical implementation. To cope with this concern, we propose a novel and efficient training framework named InI. InI embeds the countermeasures within the victim’s parameters, isolating the adversary’s gradient from the expected gradient to achieve robustness without incurring extra computational overheads. InI leverages the OOD assumption and induces the adversary to acquire minimal knowledge, thereby enhancing the trade-off. Through our evaluations, InI surpasses existing defensive methods in terms of speed and robustness, and the integration with prior defenses renders it more practical. We hope our proposed method could provide a new perspective of defense strategies against model stealing attacks.

Acknowledgement

This work was supported by the National Natural Science Foundation of China (62022009 and 62206009), the Fundamental Research Funds for the Central Universities, and the State Key Laboratory of Software Development Environment.

References

(1)
Charette et al. (2022) Laurent Charette, Lingyang Chu, Yizhou Chen, Jian Pei, Lanjun Wang, and Yong Zhang. 2022. Cosine Model Watermarking Against Ensemble Distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 9512–9520.
Clanuwat et al. (2018) Tarin Clanuwat, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Yamamoto, and David Ha. 2018. Deep learning for classical japanese literature. arXiv preprint arXiv:1812.01718 (2018).
Cohen et al. (2017) Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre Van Schaik. 2017. EMNIST: Extending MNIST to handwritten letters. In 2017 international joint conference on neural networks (IJCNN). IEEE, 2921–2926.
Fredrikson et al. (2015) Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. 2015. Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security. 1322–1333.
Goodfellow et al. (2015) Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and harnessing adversarial examples. In International Conference on Learning Representations.
Guo et al. (2023) Jun Guo, Wei Bao, Jiakai Wang, Yuqing Ma, Xinghai Gao, Gang Xiao, Aishan Liu, Jian Dong, Xianglong Liu, and Wenjun Wu. 2023. A Comprehensive Evaluation Framework for Deep Model Robustness. Pattern Recognition (2023).
Guo et al. (2022) Jun Guo, Yonghong Chen, Yihang Hao, Zixin Yin, Yin Yu, and Simin Li. 2022. Towards comprehensive testing on the robustness of cooperative multi-agent reinforcement learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 115–122.
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).
Jia et al. (2021) Hengrui Jia, Christopher A Choquette-Choo, Varun Chandrasekaran, and Nicolas Papernot. 2021. Entangled Watermarks as a Defense against Model Extraction.. In USENIX Security Symposium. 1937–1954.
Kariyappa et al. (2021a) Sanjay Kariyappa, Atul Prakash, and Moinuddin K Qureshi. 2021a. Maze: Data-free model stealing attack using zeroth-order gradient estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13814–13823.
Kariyappa et al. (2021b) Sanjay Kariyappa, Atul Prakash, and Moinuddin K Qureshi. 2021b. Protecting dnns from theft using an ensemble of diverse models. In International Conference on Learning Representations.
Kariyappa and Qureshi (2020) Sanjay Kariyappa and Moinuddin K Qureshi. 2020. Defending against model stealing attacks with adaptive misinformation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 770–778.
Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. 2009. Learning multiple layers of features from tiny images. (2009).
Le and Yang (2015) Ya Le and Xuan Yang. 2015. Tiny imagenet visual recognition challenge. CS 231N 7, 7 (2015), 3.
LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278–2324.
Lee et al. (2022) Jeonghyun Lee, Sungmin Han, and Sangkyun Lee. 2022. Model Stealing Defense against Exploiting Information Leak through the Interpretation of Deep Neural Nets. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI).
Lee et al. (2019) Taesung Lee, Benjamin Edwards, Ian Molloy, and Dong Su. 2019. Defending against neural network model stealing attacks using deceptive perturbations. In 2019 IEEE Security and Privacy Workshops (SPW). IEEE, 43–49.
Li et al. (2023a) Simin Li, Jun Guo, Jingqiao Xiu, Pu Feng, Xin Yu, Jiakai Wang, Aishan Liu, Wenjun Wu, and Xianglong Liu. 2023a. Attacking Cooperative Multi-Agent Reinforcement Learning by Adversarial Minority Influence. arXiv preprint arXiv:2302.03322 (2023).
Li et al. (2023b) Simin Li, Jun Guo, Jingqiao Xiu, Xini Yu, Jiakai Wang, Aishan Liu, Yaodong Yang, and Xianglong Liu. 2023b. Byzantine Robust Cooperative Multi-Agent Reinforcement Learning as a Bayesian Game. arXiv preprint arXiv:2305.12872 (2023).
Li et al. (2022a) Simin Li, Huangxinxin Xu, Jiakai Wang, Aishan Liu, Fazhi He, Xianglong Liu, and Dacheng Tao. 2022a. Hierarchical Perceptual Noise Injection for Social Media Fingerprint Privacy Protection. arXiv preprint arXiv:2208.10688 (2022).
Li et al. (2023c) Simin Li, Shuning Zhang, Gujun Chen, Dong Wang, Pu Feng, Jiakai Wang, Aishan Liu, Xin Yi, and Xianglong Liu. 2023c. Towards Benchmarking and Assessing Visual Naturalness of Physical World Adversarial Attacks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12324–12333.
Li et al. (2022b) Yiming Li, Linghui Zhu, Xiaojun Jia, Yong Jiang, Shu-Tao Xia, and Xiaochun Cao. 2022b. Defending against model stealing via verifying embedded external features. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 1464–1472.
Liang et al. (2022a) Siyuan Liang, Longkang Li, Yanbo Fan, Xiaojun Jia, Jingzhi Li, Baoyuan Wu, and Xiaochun Cao. 2022a. A Large-Scale Multiple-objective Method for Black-box Attack Against Object Detection. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV. Springer, 619–636.
Liang et al. (2022b) Siyuan Liang, Aishan Liu, Jiawei Liang, Longkang Li, Yang Bai, and Xiaochun Cao. 2022b. Imitated Detectors: Stealing Knowledge of Black-box Object Detectors. In Proceedings of the 30th ACM International Conference on Multimedia. 4839–4847.
Liang et al. (2020) Siyuan Liang, Xingxing Wei, Siyuan Yao, and Xiaochun Cao. 2020. Efficient adversarial attacks for visual object tracking. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVI 16. Springer, 34–50.
Liang et al. (2022c) Siyuan Liang, Baoyuan Wu, Yanbo Fan, Xingxing Wei, and Xiaochun Cao. 2022c. Parallel rectangle flip attack: A query-based black-box attack against object detection. arXiv preprint arXiv:2201.08970 (2022).
Liu et al. (2023a) Aishan Liu, Jun Guo, Jiakai Wang, Siyuan Liang, Renshuai Tao, Wenbo Zhou, Cong Liu, Xianglong Liu, and Dacheng Tao. 2023a. X-Adv: Physical Adversarial Object Attacks against X-ray Prohibited Item Detection. ArXiv (2023).
Liu et al. (2020a) Aishan Liu, Tairan Huang, Xianglong Liu, Yitao Xu, Yuqing Ma, Xinyun Chen, Stephen J Maybank, and Dacheng Tao. 2020a. Spatiotemporal attacks for embodied agents. In ECCV.
Liu et al. (2019) Aishan Liu, Xianglong Liu, Jiaxin Fan, Yuqing Ma, Anlan Zhang, Huiyuan Xie, and Dacheng Tao. 2019. Perceptual-sensitive gan for generating adversarial patches. In AAAI.
Liu et al. (2021) Aishan Liu, Xianglong Liu, Hang Yu, Chongzhi Zhang, Qiang Liu, and Dacheng Tao. 2021. Training robust deep neural networks via adversarial noise propagation. TIP (2021).
Liu et al. (2023b) Aishan Liu, Shiyu Tang, Siyuan Liang, Ruihao Gong, Boxi Wu, Xianglong Liu, and Dacheng Tao. 2023b. Exploring the Relationship between Architecture and Adversarially Robust Generalization. In CVPR.
Liu et al. (2020b) Aishan Liu, Jiakai Wang, Xianglong Liu, Bowen Cao, Chongzhi Zhang, and Hang Yu. 2020b. Bias-based universal adversarial patch attack for automatic check-out. In ECCV.
Liu et al. (2022) Shunchang Liu, Jiakai Wang, Aishan Liu, Yingwei Li, Yijie Gao, Xianglong Liu, and Dacheng Tao. 2022. Harnessing Perceptual Adversarial Patches for Crowd Counting. In ACM CCS.
Lowd and Meek (2005) Daniel Lowd and Christopher Meek. 2005. Adversarial learning. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. 641–647.
Ma et al. (2021) Ke Ma, Qianqian Xu, Jinshan Zeng, Xiaochun Cao, and Qingming Huang. 2021. Poisoning attack against estimating from pairwise comparisons. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 10 (2021), 6393–6408.
Ma et al. (2022) Ke Ma, Qianqian Xu, Jinshan Zeng, Guorong Li, Xiaochun Cao, and Qingming Huang. 2022. A Tale of HodgeRank and Spectral Method: Target Attack Against Rank Aggregation is the Fixed Point of Adversarial Game. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 4 (2022), 4090–4108.
Madry et al. (2018) Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018. Towards Deep Learning Models Resistant to Adversarial Attacks. In International Conference on Learning Representations.
Mazeika et al. (2022) Mantas Mazeika, Bo Li, and David Forsyth. 2022. How to steer your adversary: Targeted and efficient model stealing defenses with gradient redirection. In International Conference on Machine Learning. PMLR, 15241–15254.
Micaelli and Storkey (2019) Paul Micaelli and Amos J Storkey. 2019. Zero-shot knowledge transfer via adversarial belief matching. Advances in Neural Information Processing Systems 32 (2019).
Milli et al. (2019) Smitha Milli, Ludwig Schmidt, Anca D Dragan, and Moritz Hardt. 2019. Model reconstruction from model explanations. In Proceedings of the Conference on Fairness, Accountability, and Transparency. 1–9.
Netzer et al. (2011) Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. 2011. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning.
Oh et al. (2019) Seong Joon Oh, Bernt Schiele, and Mario Fritz. 2019. Towards reverse-engineering black-box neural networks. Explainable AI: Interpreting, Explaining and Visualizing Deep Learning (2019), 121–144.
Orekondy et al. (2019) Tribhuvanesh Orekondy, Bernt Schiele, and Mario Fritz. 2019. Knockoff nets: Stealing functionality of black-box models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4954–4963.
Orekondy et al. (2020) Tribhuvanesh Orekondy, Bernt Schiele, and Mario Fritz. 2020. Prediction poisoning: Towards defenses against dnn model stealing attacks. In International Conference on Learning Representations.
Papernot et al. (2017) Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. 2017. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference on computer and communications security. 506–519.
Pinto et al. (2017) Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. 2017. Robust adversarial reinforcement learning. In International Conference on Machine Learning. PMLR, 2817–2826.
Sanyal et al. (2022) Sunandini Sanyal, Sravanti Addepalli, and R Venkatesh Babu. 2022. Towards data-free model stealing in a hard label setting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15284–15293.
Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
Szegedy et al. (2014) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2014. Intriguing properties of neural networks. In International Conference on Learning Representations.
Tang et al. (2021) Shiyu Tang, Ruihao Gong, Yan Wang, Aishan Liu, Jiakai Wang, Xinyun Chen, Fengwei Yu, Xianglong Liu, Dawn Song, Alan Yuille, et al. 2021. Robustart: Benchmarking robustness on architecture design and training techniques. ArXiv (2021).
Tramèr et al. (2016) Florian Tramèr, Fan Zhang, Ari Juels, Michael K Reiter, and Thomas Ristenpart. 2016. Stealing Machine Learning Models via Prediction APIs.. In USENIX security symposium, Vol. 16. 601–618.
Truong et al. (2021) Jean-Baptiste Truong, Pratyush Maini, Robert J Walls, and Nicolas Papernot. 2021. Data-free model extraction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4771–4780.
Van der Maaten and Hinton (2008) Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).
Wallace et al. (2020) Eric Wallace, Mitchell Stern, and Dawn Song. 2020. Imitation attacks and defenses for black-box machine translation systems. arXiv preprint arXiv:2004.15015 (2020).
Wang and Gong (2018) Binghui Wang and Neil Zhenqiang Gong. 2018. Stealing hyperparameters in machine learning. In 2018 IEEE symposium on security and privacy (SP). IEEE, 36–52.
Wang et al. (2021) Jiakai Wang, Aishan Liu, Zixin Yin, Shunchang Liu, Shiyu Tang, and Xianglong Liu. 2021. Dual attention suppression attack: Generate adversarial camouflage in physical world. In CVPR.
Wei et al. (2018) Xingxing Wei, Siyuan Liang, Ning Chen, and Xiaochun Cao. 2018. Transferable adversarial attacks for image and video object detection. arXiv preprint arXiv:1811.12641 (2018).
Xiao et al. (2017) Han Xiao, Kashif Rasul, and Roland Vollgraf. 2017. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017).
Yu et al. (2020) Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. 2020. Gradient surgery for multi-task learning. Advances in Neural Information Processing Systems 33 (2020), 5824–5836.
Yuan et al. (2022) Xiaoyong Yuan, Leah Ding, Lan Zhang, Xiaolin Li, and Dapeng Oliver Wu. 2022. Es attack: Model stealing against deep neural networks without data hurdles. IEEE Transactions on Emerging Topics in Computational Intelligence 6, 5 (2022), 1258–1270.
Zhang et al. (2017) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2017. Understanding deep learning requires rethinking generalization. In 5th International Conference on Learning Representations (ICLR).
Zhang et al. (2021) Chongzhi Zhang, Aishan Liu, Xianglong Liu, Yitao Xu, Hang Yu, Yuqing Ma, and Tianlin Li. 2021. Interpreting and Improving Adversarial Robustness of Deep Neural Networks with Neuron Sensitivity. IEEE Transactions on Image Processing (2021).

Appendix

Appendix A Pseudo Code of InI

Algorithm 1 InI: Defense against Model Stealing

0: Number of iterations

N

, induction coefficient

\beta_{0},\beta_{1}

, isolation coefficient

\gamma

, training set

\mathcal{D}_{tar}

, surrogate query dataset

\mathcal{D}_{que}

0: The defended model

\mathcal{V}

1: Initialize

\bm{\theta}_{\mathcal{V}},\bm{\theta}_{\mathcal{C}}

2: for iteration in

N

3: Sample a mini-batch of data

(\bm{x},\bm{y})

from

\mathcal{D}_{tar}

4: Calculate benign utility loss

\mathcal{L}_{ben}=CE(\mathcal{V}(\bm{x};\bm{\theta}_{\mathcal{V}}),\bm{y})

5: Calculate isolation loss

\mathcal{L}_{iso}=CS(\tilde{\bm{y}}^{T}\bm{G},\bm{y}^{T}\bm{G})

6: Sample a mini-batch of data

x

from

\mathcal{D}_{que}

7: Calculate information gain

\mathcal{L}_{ig}=d(\mathcal{C}(\bm{x};\bm{\theta}_{\mathcal{C}}),\mathcal{V}(\bm{x};\bm{\theta}_{\mathcal{V}}))

8: Get the gradient of information gain

\nabla_{\bm{\theta}_{\mathcal{C}}}\mathcal{L}_{ig}

through backward propagation

9: Mitigate the gradient conflict of

\mathcal{L}_{ben}

\mathcal{L}_{iso}

\mathcal{L}_{ig}

||\nabla_{\bm{\theta}_{\mathcal{C}}}\mathcal{L}_{ig}||

using PCGrad

10: Update

\bm{\theta}_{\mathcal{V}}

with SGD optimizer

11: end for

Appendix B Training Time Evaluation

Though InI makes great progress on inference time speed improvement, the operation of calculating gradients increases its training overheads. We evaluate the training time of each defense method on CIFAR-10, and record the time consumption of 10 epochs. The results are listed in Table B.1. The training time is 12.15× as undefended training and 1.40× time consumption as EDM, as we compute different gradients to construct the isolation loss. However, we achieve most 25.4x speed improvement than baseline methods during inference, and for MLaaS models, the proportion of inference time is much more than training time.

Table B.1. The time consumption of 10 epochs training for each defense methods.

	ND/MAD	AM	EDM	InI(Ours)
Time(s)	155.7(1.00×)	551.1(3.54×)	1353.5(8.69×)	1892.1(12.15×)

Appendix C Training the Surrogate Adversary

In our main experiments of InI, the surrogate adversary $\mathcal{C}$ keeps untrained, which is similar to (Orekondy et al., 2020). We also perform adversarial training for the surrogate adversary $\mathcal{C}$ , where $\mathcal{C}$ and $\mathcal{V}$ have opposite objectives and the parameters $\theta_{\mathcal{C}}$ and $\theta_{\mathcal{V}}$ are alternately updated. The results are shown in Table C.1. We observe that adversarial training is very time-consuming and the defensive performance is sensitive to hyperparameters. We can draw from the results that adversarial training either has poor defensive performance or harms benign accuracy. The unstable performance of adversarial training makes it almost unusable.

Table C.1. The defense result when adversarially training the surrogate adversary. The steal accuracy is from KnockoffNets soft-label attack on CIFAR-10 datasets. “Adv Training 1” an “Adv training 2” refer to different hyperparameters.

Defense	Clean accuracy	Steal accuracy
Vanilla	94.71	79.55
InI	94.32	69.54
Adv Training 1	94.66	77.23
Adv Training 2	64.95	40.40

Appendix D Feature Visualization

To better understand the effectiveness of our defense, we further conduct feature visualizations of ID and OOD samples for the victim models. Specifically, we pick 5,000 images from the test set of ID (MNIST) and OOD (EMNISTLetters) datasets and use the convolutional layers of victim models to extract the features, and finally plot them using t-SNE (Van der Maaten and Hinton, 2008).

Figure 1(a) shows the feature distribution extracted by the undefended model (vanilla). Apparently, we can observe that samples from the target distribution (ID) can be clearly categorized into 10 different clusters. As for the OOD samples, they exhibit no clusters within the feature space, however, the boundaries among different classes are comparatively clear, thereby facilitating the model stealing on ODD queries. Figure 1(b) presents the feature distribution extracted by our InI. The clusters of ID samples remain largely unaltered, while the distribution of OOD samples shows significant differences: the boundaries between different classes become more ambiguous, which hinders model stealing attacks. We assume that the adversary induction minimizes the disparity between samples categorized in different classes, and the gradient isolation obfuscates the victim’s decision boundary through the adversary’s update gradient.

Appendix E Additional Experimental Results

We evaluate our method on VGG-16 networks. Table E.1 shows the benign accuarcy of each defense. The defense performance of KnockoffNets and JBDA are exhibited in Table E.2 and E.3. The results are similar with those on ResNet-18, while InI achieves better defense performance against JBDA on FashionMNIST and CIFAR10.

Table E.1. Benign accuracy of each defense on different image classification datasets.

Dataset	Benign Accuracy
Dataset	MNIST	FashionMNIST	CIFAR-10	CIFAR-100
Vanilla	99.63	94.02	93.24	72.75
MAD	99.43	93.84	92.58	72.04
AM	99.42	93.63	92.32	71.48
EDM	99.62	93.76	92.85	71.12
InI (Ours)	99.45	93.97	92.39	72.03

Table E.2. Experimental results for KnockoffNets attack. We report the clone accuracy and the relative performance on the target test set. Lower clone accuracy/relative performance indicates better defense performance. “InI + MAD” and “InI + AM” indicate the integration of our InI with other defenses.

Defense	MNIST		FashionMNIST		CIFAR-10		CIFAR-100
Defense	soft-label	hard-label	soft-label	hard-label	soft-label	hard-label	soft-label	hard-label
Vanilla	99.52(1.00×)	99.12(0.99×)	64.35(0.68×)	49.95(0.53×)	82.35(0.88×)	78.21(0.84×)	35.89(0.49×)	24.87(0.34×)
MAD	97.28(0.98×)	98.81(0.99×)	46.69(0.50×)	43.72(0.47×)	73.10(0.79×)	70.17(0.76×)	24.97(0.35×)	20.28(0.28×)
AM	98.49(0.99×)	97.45(0.98×)	28.18(0.30×)	35.63(0.38×)	74.41(0.81×)	69.01(0.75×)	32.95(0.46×)	21.51(0.30×)
EDM	98.62(0.99×)	97.99(0.98×)	21.61(0.23×)	26.08(0.28×)	69.98(0.75×)	68.81(0.74×)	29.34(0.41×)	22.49(0.32×)
InI (Ours)	79.90(0.80×)	97.27(0.98×)	20.42(0.22×)	25.57(0.27×)	15.17(0.16×)	66.36(0.72×)	6.63(0.09×)	22.39(0.31×)
InI + MAD	75.21(0.76×)	97.05(0.98×)	20.91(0.22×)	25.69(0.27×)	13.59(0.15×)	64.77(0.70×)	6.22(0.09×)	13.71(0.19×)
InI + AM	31.22(0.31×)	97.03(0.98×)	19.67(0.21×)	17.91(0.19×)	14.93(0.16×)	66.76(0.72×)	6.13(0.09×)	21.76(0.30×)

Table E.3. Experimental results for JBDA attack. We report the clone accuracy and the relative performance on the target test set. Lower clone accuracy/relative performance indicates better defense performance. “InI + MAD” and “InI + AM” indicate the integration of our InI with other defenses.

Defense	MNIST		FashionMNIST		CIFAR-10		CIFAR-100
Defense	soft-label	hard-label	soft-label	hard-label	soft-label	hard-label	soft-label	hard-label
Vanilla	76.62(0.77×)	43.32(0.43×)	50.60(0.54×)	57.77(0.61×)	13.08(0.14×)	12.67(0.14×)	2.60(0.04×)	3.00(0.04×)
MAD	16.63(0.17×)	29.49(0.30×)	48.11(0.51×)	56.48(0.60×)	12.16(0.13×)	12.08(0.13×)	1.08(0.01×)	2.97(0.04×)
AM	70.03(0.70×)	25.20(0.25×)	66.76(0.71×)	53.88(0.58×)	16.09(0.17×)	12.21(0.13×)	1.79(0.03×)	2.93(0.04×)
EDM	61.28(0.62×)	69.79(0.70×)	51.74(0.55×)	49.28(0.53×)	10.88(0.12×)	15.59(0.17×)	2.42(0.03×)	2.60(0.04×)
InI (Ours)	10.64(0.11×)	24.33(0.24×)	17.21(0.18×)	48.96(0.52×)	10.38(0.11×)	11.76(0.13×)	1.78(0.02×)	2.33(0.03×)
InI + MAD	9.85(0.10×)	3.02(0.03×)	31.86(0.34×)	49.93(0.53×)	11.50(0.12×)	11.57(0.13×)	1.74(0.02×)	2.33(0.03×)
InI + AM	11.41(0.11×)	22.57(0.23×)	19.60(0.21×)	44.35(0.47×)	10.31(0.11×)	11.57(0.13×)	2.02(0.03×)	2.58(0.04×)