22institutetext: Institute for Network Sciences and Cyberspace, BNRist, Tsinghua University, Beijing, China, 22email: [email protected]
33institutetext: School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore, 33email: [email protected]
44institutetext: Department of Computer Science and Technology, Tsinghua University, Beijing, China, 44email: [email protected]
55institutetext: Zhongguancun Laboratory, Beijing, China
66institutetext: Shandong Key Laboratory of Artificial Intelligence Security, Shandong, China
Hard-Label Cryptanalytic Extraction of Neural Network Models
Abstract
The machine learning problem of extracting neural network parameters has been proposed for nearly three decades. Functionally equivalent extraction is a crucial goal for research on this problem. When the adversary has access to the raw output of neural networks, various attacks, including those presented at CRYPTO 2020 and EUROCRYPT 2024, have successfully achieved this goal. However, this goal is not achieved when neural networks operate under a hard-label setting where the raw output is inaccessible.
In this paper, we propose the first attack that theoretically achieves functionally equivalent extraction under the hard-label setting, which applies to ReLU neural networks. The effectiveness of our attack is validated through practical experiments on a wide range of ReLU neural networks, including neural networks trained on two real benchmarking datasets (MNIST, CIFAR10) widely used in computer vision. For a neural network consisting of parameters, our attack only requires several hours on a single core.
Keywords:
Cryptanalysis ReLu Neural Networks Functionally Equivalent Extraction Hard-Label.1 Introduction
Extracting all the parameters (including weights and biases) of a neural network (called victim model) is a long-standing open problem which is first proposed by cryptographers and mathematicians in the early nineties of the last century [2, 7], and has been widely studied by research groups from both industry and academia [14, 16, 11, 18, 1, 20, 4, 3].
In previous research [14, 16, 11, 18, 20, 4, 3], one of the most common attack scenarios is as follows. The victim model (denoted by where denotes the parameters) is made available as an Oracle , then the adversary generates inputs to query and collects the feedback to extract the parameters. This is similar to a cryptanalysis problem: is considered the secret key, and the adversary tries to recover the secret key , given the pairs [4]. If contains parameters (64-bit floating-point numbers), then the secret key contains bits, and the computation complexity of brute force searching is .
Consider that there may be isomorphisms for neural networks, e.g., permutation and scaling for ReLU neural networks [18]. An important concept named functionally equivalent extraction is summarized and proposed in [11].
Functionally Equivalent Extraction.
Denote by the input space of the victim model . Functionally equivalent extraction aims at generating a model (i.e., the extracted model), such that holds for all , where and are, respectively, the raw output of the victim model and the extracted model [11]. Such extracted model is called the functionally equivalent model of (also say that and are isomorphic [18]). Since behaves the same as , the adversary can explore the properties of by taking as a perfect substitute 111 Due to the isomorphisms introduced in [18], the parameters of the extracted model may be different from that of the victim model, but it does not matter as long as is the functionally equivalent model of . .
Functionally equivalent extraction is hard [11]. Consider the isomorphisms (permutation and scaling) introduced in [18]. Scaling can change parameters and permutation does not. For a ReLU neural network that contains parameters (64-bit floating-point numbers) and neurons, from the perspective of the above cryptanalysis problem, even though scaling can be applied to each neuron once, the adversary still needs to recover secret key bits. Note that the case of is common for neural networks. For example, for the ReLU neural networks extracted by Carlini et al. at CRYPTO 2020 [4], the pairs are , , , , and . Besides, even if we only recover a few bits (instead of bits) of each parameter, the number of secret key bits to be recovered is still large, particularly in the case of large .
Hard-label Setting.
According to the taxonomy in [11], when the Oracle is queried, there are types of feedback given by the Oracle: (1) label (the most likely class label, also called hard-label), (2) label and score (the most-likely class label and its probability score), (3) top-k scores, (4) all the label scores, (5) the raw output (i.e., ). When the Oracle only returns the hard-label, we say that the victim model (i.e., neural networks in this paper) works under the hard-label setting [8].
To the best of our knowledge, there are no functionally equivalent extraction attacks that are based on the first four types of feedback so far. From the perspective of cryptanalysis, the raw output is equivalent to the complete ciphertext corresponding to the plaintext (i.e., the query ). The other four types of feedback only reveal some properties of the ciphertext (raw output ). For example, when , the hard-label only tells whether holds or not [8]. Among the five types of feedback, the raw output leaks the most information, while the hard-label (i.e., the first type of feedback) leaks the least [11].
Assuming that the feedback of the Oracle is the raw output, Jagielski et al. propose the first functionally equivalent extraction attack against ReLU neural networks with one hidden layer [11], which is extended to deeper neural networks in [18]. At CRYPTO 2020, Carlini et al. propose a differential extraction attack [4] that requires fewer queries than the attack in [18]. However, the differential extraction attack requires an exponential amount of time, which is addressed by Canales-Martínez et al. at EUROCRYPT 2024 [3]. Note that the extraction attacks in [18, 4, 3] are also based on the assumption that the feedback is the raw output. Due to the dependence on the raw output, all the authors in [11, 18, 4, 3], state that the hard-label setting (i.e., the feedback is the first type) is a defense against functionally equivalent extraction.
The above backgrounds lead to the question not studied before
1.1 Our Results and Techniques
1.1.1 Results.
We have addressed this question head-on in this paper. In total, the answer is yes, and we propose the first functionally equivalent extraction attack against ReLU neural networks under the hard-label setting. Here, the definition of functionally equivalent extraction proposed in [4] is extended reasonably.
Definition 1 (Extended Functionally Equivalent Extraction)
The goal of the extended functionally equivalent extraction is to generate a model (i.e., the extracted model), such that holds for all , where is a fixed constant, and are, respectively, the raw output of the victim model and the extracted model. The extracted model is the functionally equivalent model of the victim model .
Since holds for all , i.e., is a simple scalar product of , the adversary still can explore the properties of the victim model by taking as a perfect substitute. This is why we propose this extended definition. From the perspective of cryptanalysis, this extended definition allows the adversary not to guess the bits of the constant . To evaluate the efficacy of our model extraction attacks, and quantify the degree to which a model extraction attack has succeeded in practice, we generalize the metric named -functional equivalence proposed in [4].
Definition 2 (Extended -Functional Equivalence)
Two models and are -functional equivalent on if there exists a fixed constant such that
In this paper, we propose two model extraction attacks, one of which applies to -deep neural networks, and the other one applies to -deep neural networks. The former attack is the basis of the latter attack. Our model extraction attacks theoretically achieve functionally equivalent extraction described in Definition 1, where the constant is determined by the model parameter .
We have also performed numerous experiments on both untrained and trained neural networks, for verifying the effectiveness of our model extraction attacks in practice. The untrained neural networks are obtained by randomly generating model parameters. To fully verify our attacks, we also adopt two real benchmarking image datasets (i.e., MNIST and CIFAR10) widely used in computer vision, and train many classifiers (i.e., trained neural networks) as the victim model. Our model extraction attacks show good performances in experiments. The complete experiment results refer to Tables 1 and 2 in Section 7. The number of parameters of neural networks in our experiments is up to , but the runtime of the proposed extraction attack on a single core is within several hours. Our experiment code is uploaded to GitHub (https://github.com/AI-Lab-Y/NN_cryptanalytic_extraction).
The analysis of the attack complexity is presented in Appendix 0.B. For the extraction attack on -deep neural networks, its query complexity is about , where and are, respectively, the input dimension (i.e., the size of ) and the number of neurons, is a precision chosen by the adversary. The computation complexity is about , where is the number of neurons and is the number of hidden layers. The computation complexity of our attack is much lower than that of brute-force searching.
Techniques.
By introducing two new concepts, namely model activation pattern and model signature, we obtained some findings as follows. A ReLU neural network is composed of a certain number of affine transformations corresponding to model activation patterns. Each affine transformation leaks partial information about neural network parameters, which is determined by the corresponding model activation pattern. Most importantly, for a neural network that contains neurons, special model activation patterns will leak all the information about the neural network parameters.
Inspired by the above findings, we design a series of methods to find decision boundary points, recover the corresponding affine transformations, and further extract the neural network parameters. These methods compose the complete model extraction attacks.
1.1.2 Organization.
The basic notations, threat model, attack goal and assumptions are introduced in Section 2. Section 3 introduces some auxiliary concepts. Then we introduce the overview of our model extraction attacks, the idealized model extraction attacks, and some refinements in practice in the following three sections respectively. Experiments are introduced in Section 7. At last, we present more discussions about our work and conclude this paper.
2 Preliminaries
2.1 Basic Definitions and Notations
This section presents some necessary definitions and notations.
Definition 3 (-Deep Neural Network [4])
A -deep neural network is a function parameterized by that takes inputs from an input space and returns values in an output space . The function is composed of alternating linear layers and a non-linear activation function :
(1) |
In this paper, we exclusively study neural networks over and , where and are positive integers. As in [4, 3], we only consider neural networks using the ReLU [15] activation function, given by .
Definition 4 (Fully Connected Layer [4])
The -th fully connected layer of a neural network is a function given by the affine transformation
(2) |
where is a weight matrix, is a -dimensional bias vector.
Definition 5 (Neuron [4])
A neuron is a function determined by the corresponding weight matrix, bias vector, and activation function. Formally, the -th neuron of layer is the function given by
(3) |
where and denote, respectively, the -th row of and -th coordinate of . In a -deep neural network, there are a total of neurons.
Definition 6 (Neuron State [3])
Let denote the value that neuron takes with before applying . If , then is active, i.e., the neuron state is active. Otherwise, the neuron state is inactive 222 In [4, 3], the authors defined one more neuron state, namely critical, i.e., , which is a special inactive state since the output of neuron is . . The state of the -th neuron in layer on input is denoted by . If , the neuron is active. If , the neuron is inactive.
Definition 7 (Neural Network Architecture [4])
The architecture of a fully connected neural network captures the structure of : (a) the number of layers, (b) the dimension of each layer . We say that is the dimension of the input to the neural network, and denotes the number of outputs of the neural network.
Definition 8 (Neural Network Parameters [4])
The parameters of a -deep neural network are the concrete assignments to the weights and biases for .
When neural networks work under the hard-label setting, the raw output is processed before being returned [8]. This paper considers the most common processing. The raw output is first transformed into a category probability vector by applying the Sigmoid (when ) or Softmax (when ) function to [6]. Then, the category with the largest probability is returned as a hard-label. Definition 9 summarizes the hard-label and corresponding decision conditions on the raw output .
Definition 9 (Hard-Label)
Consider a -deep neural network where . The hard-label (denoted by ) is related to the outputs . When , the hard-label is computed as
(4) |
When , the output is a -dimensional vector. The hard-label is the coordinate of the maximum of . 333 If there are ties, i.e., multiple items of share the same maximum, the hard-label is the smallest one of the coordinates of these items.
2.2 Adversarial Goals and Assumptions
There are two parties in a model extraction attack: the oracle who possesses the neural network , and the adversary who generates queries to the Oracle. Under the hard-label setting, the Oracle returns the hard-label in Definition 9.
Definition 10 (Model Parameter Extraction Attack)
A model parameter extraction attack receives Oracle access to a parameterized function (i.e., a -deep neural network in our paper) and the architecture of , and returns a set of parameters with the goal that is as similar as possible to where is a fixed constant.
In this paper, we use the symbol to indicate an extracted parameter. For example, is the parameters of the victim model , and stands for the parameters of the extracted model .
Assumptions.
We make the following assumptions of the Oracle and the capabilities of the attacker:
-
•
Architecture knowledge. We require knowledge of the neural network architecture.
-
•
Full-domain inputs. We can feed arbitrary inputs from .
-
•
Precise computations. is specified and evaluated using 64-bit floating-point arithmetic.
-
•
Scalar outputs. The output dimensionality is , i.e., . 444This assumption is fundamental to our work. Our attack only applies to the case of scalar outputs.
-
•
ReLU Activations. All activation functions (’s) are the ReLU function.
Compared with the work in [4], we remove the assumption of requiring the raw output of the neural network. Now, after querying the Oracle , the attacker obtains the hard-label . In other words, the attacker only knows whether holds or not.
3 Auxiliary Concepts
To help understand our attacks, this paper proposes some auxiliary concepts.
3.1 Model Activation Pattern
To describe all the neuron states, we introduce a new concept named Model Activation Pattern.
Definition 11 (Model Activation Pattern)
Consider a -deep neural network with neurons. The model activation pattern of over an input is a global description of the neuron states, and is denoted by where is the concatenation of neuron states (i.e., ) in layer .
In the rest of this paper, the notations , , and are simplified as , , and respectively, when the meaning is clear in context. Besides, is represented by a -bit integer. For example, means that only the -th neuron in layer is active, and means that all the neurons are active.
When the model activation pattern is known, one can precisely determine which neural network parameters influence the output . Consider the -th neuron in layer . Due to the ReLU activation function, if the neuron state is inactive, neuron does not influence the output . As a result, all the weights and ( i.e., the elements of the -th column of , and the -th row of respectively) and the bias do not affect the output .
Special ‘neuron’.
For the convenience of introducing model extraction attacks later, we regard the input and the output as, respectively, and special ‘neurons’ that are always active. So we adopt two extra notations and , for describing the states of the special ‘neurons’. But if not necessary, we will omit the two notations.
3.2 Model Signature
Consider a -deep neural network . For an input , can be described as an affine transformation
(5) | ||||
where is the model activation pattern over , , and . Here, are - diagonal matrices with a on the diagonal’s -th entry when the neuron state is , and when .
The affine transformation is denoted by a tuple . Except for , the value of the tuple is only determined by the neural network parameters, i.e., and . Once the value of any neural network parameters is changed, the value of the tuple corresponding to some ’s will change too 555 We do not consider the case of some neurons being always inactive, since such neurons are redundant and usually deleted by various network pruning methods (e.g., [10]) before the neural network is deployed as a prediction service. . Therefore, we regard the set of all the possible tuples as a unique model signature of the neural network.
Definition 12 (Model Signature)
For a -deep neural network , the model signature denoted by is the set of affine transformations
In [3], Canales-Martínez et al. use the term ‘signature’ to describe the weights related to a neuron, which is different from the model signature. Except for the model signature, we propose another important concept, namely normalized model signature.
Definition 13 (Normalized Model Signature)
Consider a victim model and its model signature . Denote by the -th element of for . Divide the set of ’s into two subsets and . For each , for . For each , there is at least one non-zero element in , without loss of generality, assume that . Let be the following set
The set is the normalized model signature of .
Shortly, the difference between the normalized model signature and the initial model signature is as follows. For each , i.e., there is at least one non-zero element in (without loss of generality, assume that the first element is non-zero, i.e.,), we transform the parameter tuple into .
In our attacks, the normalized model signature plays two important roles. First, the recovery of all the weights relies on the subset . Second, our attacks will produce many extracted models during the attack process while at most only one is the functionally equivalent model of , and the normalized model signature is used to filter functionally inequivalent models.
3.3 Decision Boundary Point
Our attacks exploit a special class of inputs named Decision Boundary Points.
Definition 14 (Decision Boundary Point)
Consider a neural network . If an input makes hold, is a decision boundary point.
The extraction attacks presented at CRYPTO 2020 [4] and EUROCRYPT 2024 [3] exploit a class of inputs, namely critical points. Fig. 1 shows the difference between critical points and decision boundary points.

Critical points leak information on the neuron states, i.e., whether the output of a neuron is , which is the core reason why the differential extraction attack can efficiently extract model parameters [4]. As a comparison, decision boundary points do not leak information on the neuron states.
4 Overview of Our Cryptanalytic Extraction Attacks
Under the hard-label setting, i.e., the Oracle returns the most likely class instead of the raw output , only decision boundary points will leak the value of , since . Motivated by this truth, our cryptanalytic extraction attacks focus on decision boundary points.
Attack Process.
At a high level, the complete attack contains five steps.
-
Step 1: collect decision boundary points. The algorithm for finding decision boundary points will be introduced in Section 6.1. Suppose that decision boundary points are collected.
-
Step 2: recover the normalized model signature. Recover the tuples corresponding to the decision boundary points. After filtering duplicate tuples, regard the set of the remaining tuples as the (partial) normalized model signature . Suppose that the size of is , refer to Definition 13. It means that there are decision boundary points that can be used to recover weights .
-
Step 3: recover weights layer by layer. Suppose that there are neurons in the neural network. Randomly choose out of decision boundary points each time, assign a specific model activation pattern to each selected decision boundary point, and recover the weights .
-
Step 4: recover all the biases. Based on recovered weights, recover all the biases simultaneously.
-
Step 5: filter functionally inequivalent models. As long as holds, we will obtain many extracted models, but it is expected that at most only one is the functionally equivalent model. Thus, we filter functionally inequivalent models in this step.
Some functionally inequivalent models may not be filtered. For each surviving extracted model, we test the Prediction Matching Ratio (PMR, introduced in Section 6.3) over randomly generated inputs, and take the one with the highest PMR as the final candidate.
In Step 2, we recover the tuple by the extraction attack on -deep neural networks. In Step 3, for layer , the weight vector of the -th neuron () is recovered by solving a system of linear equations. For layer , except for selecting decision boundary points, the recovery of the weights does not use any extra techniques. In Step 4, all the biases are recovered by solving a system of linear equations.
5 Idealized Hard-Label Model Extraction Attack
This section introduces -functionally equivalent model extraction attacks under the hard-label setting, which assumes infinite precision arithmetic and recovers the functionally equivalent model. We first introduce the -deep neural network extraction attack, which is used in the -deep neural network extraction attack to recover the normalized model signature.
Note that this section only (partially) involves Steps 2, 3, and 4 introduced in Section 4. In the next section, we introduce the remaining steps and refine the idealized attacks to work with finite precision.
5.1 Zero-Deep Neural Network Extraction
According to Definition 3, zero-deep neural networks are affine functions where , and . Let , and . The model signature is .
Our extraction attack is based on a decision boundary point (i.e., ), and composed of steps: (1) recover weight signs, i.e., the sign of ; (2) recover weights ; (3) recover bias .
Recover Weight Signs.
Denote by the basis vector where only the -th element is and other elements are .
Let the decision boundary point move along the direction , and the moving stride is where . Query the Oracle and obtain the hard-label , then the sign of is
(6) |
When , we have , i.e., . Thus, the sign of is the same as that of . If always holds, no matter if is positive or negative, then we have .
Recover Weights.
Without loss of generality, assume that .
At first, let the decision boundary point move along with a moving stride , such that the hard-label of the new point is , i.e., . Then, let the new point move along with a moving stride where and , such that is a decision boundary point too. As a result, we have
(7) |
and obtain the weight ratio . Since the signs of are known, the final extracted weights are
(8) |
Recover Bias.
The extracted bias is .
Thus, the model signature of is , and .
5.2 -Deep Neural Network Extraction
Basing the -deep neural network extraction attack, we develop an extraction attack on -deep neural networks. Recall that, the expression of -deep neural networks is
(9) | ||||
where the model activation pattern is and
(10) |
Notations.
Our attack recovers weights layer by layer. Assuming that the weights of the first layers have been recovered and we are trying to recover where , we describe -deep neural networks as:
(11) |
where and are, respectively, related to the unrecovered part (excluding ) and recovered part of the neural network .
The values of and are
(12) | ||||
where is a diagonal matrix with a on each diagonal entry.
Core Idea of Recovering Weights Layer by Layer.
To better grasp the attack details presented later, we first introduce the core idea of recovering weights layer by layer. Assuming that the extracted weights of the first layers are known, i.e., are known, we try to recover the weights in layer .
To obtain the weight vector of the -th neuron (denoted by ) in layer 666 when , it means that we are trying to recover the weights . , we exploit a decision boundary point with the model activation pattern where
(13) |
It means that, in layer , only the -th neuron is active, and all the neurons in layer are active. Fig 2 shows a schematic diagram under this scenario.

Since , all the layers starting from layer collapse into a direct connection from to the output . The weight of this connection is , i.e., the -th element of (see Eq. (12)). The expression (see Eq. (11)) of the -deep neural network further becomes
where and .
In Step 2 (see Section 4), applying the extraction attack on zero-deep neural networks, we can obtain the tuple where
(14) |
Here the symbol stands for the -th column vector of .
According to Eq. (14), the value of each element of is not related to the absolute value of , i.e., the unrecovered part does not affect the affine transformation. Then, basing the vector and the extracted weights , we build a system of linear equations and solve it to obtain . Next, we introduce more attack details.
Recover Weights in Layer .
To recover the weight vector of the -th neuron in layer , we exploit the model activation pattern where
(15) |
It means that, in layer , only the -th neuron is active, and all the neurons in other layers are active.
Under this model activation pattern, according to Eq. (12), we have
(16) |
Now, the expression of the -deep neural network is
(17) |
where , is the -th element of . As for , it is a constant determined by . In other words, when changes, the value of does not change.
Recall that, in Step 2, we have recovered the following weight vector
(18) |
In this step, our target is to obtain where
(19) |
Therefore, we need to determine signs, i.e., the signs of for where .
Since , we know that holds for , which tells us that the above signs are the same. Thus, by guessing sign, i.e., the sign of for , we obtain weight vectors presented in Eq. (19).
Recover Weights in Layer ().
To recover the weight vector of the -th neuron in layer , we exploit the model activation pattern where
(20) |
It means that, in layer , only the -th neuron is active, and all the neurons in other layers are active.
Under this model activation pattern, according to Eq. (12), we have
(21) | ||||
Now, the expression of -deep neural networks becomes
(22) | ||||
where , and . Besides, is not related to , and only determined by , i.e., is the same constant for .
In Step 2, using the zero-deep neural network extraction attack, we have recovered the following weight vectors ()
(24) |
In this step, our target is to obtain the weight vector .
It is clear that we need to guess the sign of for . Again, all the signs are the same. Consider the expression in Eq. (22). Since the -th neuron is active, its output exceeds , i.e.,
Then holds for . At the same time, since is a constant, all the signs are the same. Therefore, by guessing one sign, i.e., the sign of , based on Eq. (24), we obtain
(25) |
Note that can be obtained using Eq. (21), since are known. Then, basing the vector in Eq. (25), we build a system of linear equations
(26) |
When 777 The case of is common in various applications, particularly in computer vision [9, 13, 17], since the dimensions of images or videos are often large., we obtain by solving the above system of linear equations. Lemma 1 summarizes the expression of extracted weight vectors .
Lemma 1
Based on the system of linear equations presented in Eq. (26), for and , the extracted weight vector is
(27) |
where .
Proof
The proof refers to Appendix 0.A.
In Lemma 1, for the consistency of the mathematical symbols, the weights are denoted by instead of .
Recover All the Biases.
Since for have been obtained, we can extract all the biases by solving a system of linear equations.
Concretely, for the decision boundary points, should hold. Thus, we build a system of linear equations: where the expression of refers to Eq. (9). Combining with Lemma 1, by solving the above system, we will obtain
(28) |
Based on the extracted neural network parameters (see Eq. (19), Eq. (27), and Eq. (28)), the model signature of the extracted model is
Consider the -th neuron in layer . For an input , denote by the output of the neuron of the victim model. Based on the extracted neural network parameters (see Eq. (19), Eq. (27), and Eq. (28)), the output of the neuron of the extracted model is . At the same time, all the weights in layer are increased by a factor of . Thus, for the victim model and extracted model, the -th neuron in layer has the same influence on all the neurons in layer . As a result, for any , the model activation pattern of the victim model is the same as that of the extracted model. Combining with , for all , we have
(29) |
In Appendix 0.C, we apply the extraction attack on -deep neural networks and directly present the extracted model, which helps further understand our attack.
Remark 2
Except for the model activation patterns as shown in Eq. (15) and Eq. (20), the adversary could choose a new set of model activation patterns. The reason is as follows. Consider the recovery of the weight vector of the -th neuron in layer , and look at Fig. 2 again. Our attack only requires that: (1) in layer , only the -th neuron is active; (2) in layer , all the neurons are active. The neuron states in other layers do not affect the attack. Thus, there are more options for the model activation patterns, and the rationale does not change.
Discussion on The Computation Complexity.
Once decision boundary points and sign guesses are selected, to obtain an extracted model, we just need to solve systems of linear equations. However, since the model activation pattern of a decision boundary point is unknown, we have to traverse all the possible combinations of decision boundary points (see Step 3 in Section 4), which is the bottleneck of the total computation complexity. The complete analysis of the attack complexity is presented in Appendix 0.B.
6 Instantiating the Extraction Attack in Practice
Recall that, the complete extraction attack contains steps introduced in Section 4. To obtain a functionally equivalent model, the adversary also needs three auxiliary techniques: finding decision boundary points (related to Steps 1 and 2), filtering duplicate affine transformations (related to Step 2), and filtering functionally inequivalent models (related to Step 5).
The idealized extraction attack introduced in Section 5 relies on decision boundary points that make strictly hold. This section will propose a binary searching method to find decision boundary points under the hard-label setting. Under finite precision, it is hard to find decision boundary points that make strictly hold. Therefore, the proposed method returns input points close to the decision hyperplane as decision boundary points. As a result, the remaining two techniques need to consider the influence of finite precision. This ensures our model extraction attacks work in practice, for producing a -functionally equivalent model.
6.1 Finding Decision Boundary Points
Let us see how to find decision boundary points under the hard-label setting. Fig. 3 shows a schematic diagram in a 2-dimensional input space.

We first randomly pick a starting point and non-zero direction vector . Then let the starting point move along the direction or the opposite direction . It is expected that the starting point will eventually cross the decision hyperplane in one direction, as long as and are not parallel to the decision hyperplane.
Denote by the moving stride of the starting point, which means that the starting point arrives at . After querying the Oracle with and , if (i.e., the two labels are different), we know that the starting point has crossed the decision hyperplane when the moving stride is . Now, the core of finding decision boundary points is to determine a suitable moving stride , such that the starting point reaches the decision hyperplane, i.e., holds.
This task is done by binary search. Concretely, randomly choose two different moving strides and at first, such that
(30) | ||||
Then, without changing the conditions presented in Eq. (30), we dynamically adjust and until their absolute difference is close to , i.e., where is a precision defined by the adversary. Finally, return as a decision boundary point.
Since the precision is finite, is not strictly at the decision boundary, which will inevitably introduce minor errors (equivalent to noises) into the extracted model. If decreases, then will be closer to the decision boundary, which is helpful to the model extraction attack, refers to the experiment results in Section 7.
6.2 Filtering Duplicate Affine Transformations
For a -deep neural network consisting of neurons, the idealized extraction attack exploits special model activation patterns.
To ensure that the required model activation patterns occur with a probability as high as possible, in Step 1 introduced in Section 4, we collect decision boundary points where , e.g., and is a small factor. As a result, there are many collected decision boundary points with duplicate model activation patterns. Therefore, in Step 2, after recovering the parameter tuple (i.e., the affine transformation) corresponding to each decision boundary point, we need to filter the decision boundary points with duplicate affine transformations, since their model activation patterns should be the same. When filtering duplicate affine transformations, we consider two possible cases.
Filtering Correctly Recovered Affine Transformations.
In the first case, assume that two affine transformations are both correctly recovered.
However, recovering affine transformations (i.e., 0-deep neural network extraction attack) relies on finding decision boundary points, which introduces minor errors. This is equivalent to adding noises to the recovered affine transformations, i.e., the tuples . To check whether two noisy affine transformations are the same, we adopt the checking rule below.
Comparing two vectors.
Consider two vectors with the same dimension, e.g., . Set a small threshold . If the following inequations hold simultaneously
(31) |
where and are, respectively, the -th element of and , the two vectors are considered to be the same.
Filtering Wrongly Recovered Affine Transformations.
In the second case, assume that one affine transformation is correctly recovered and another one is partially recovered.
For the extraction attack on -deep neural networks, when recovering the affine transformation corresponding to an input by the -deep neural network extraction attack (see Section 5.1), the process of binary search should not change the model activation pattern. Otherwise, the affine transformation may be wrongly recovered. Recall that, in the -deep neural network extraction attack, the elements of are recovered one by one independently. Thus, the wrong recovery of one element of does not influence the recovery of other elements.
As a result, we have to consider the case that one transformation is partially recovered. In this case, the filtering method is as follows. Consider two vectors and . If holds for at least ’s where and is a threshold, the two vectors are considered to be the same. Suppose that the occurrence frequencies of and are and respectively, we regard as the correctly recovered affine transformation if , and vice versa.
6.3 Filtering Functionally Inequivalent Extracted Models
Consider -deep neural networks consisting of neurons. As introduced in Section 4, each time we randomly choose out of collected decision boundary points to generate an extracted model. Moreover, according to Section 5.2, in the extraction attack, we need to guess signs, i.e., the sign of .
When the model activation patterns of the selected decision boundary points are not those required in the extraction attack, or at least one of the sign guesses is wrong, the resulting extracted model is not a functionally equivalent model of the victim model . Thus, we will get many functionally inequivalent extracted models.
Besides, due to the minor errors introduced by the finite precision used in finding decision boundary points, the parameters of the extracted model may be slightly different from the theoretical values (see Eq. (19), Eq. (27), and Eq. (28)). This subsection introduces three methods to filter functionally inequivalent extracted models, one of which considers the negative influence of finite precision together.
Filtering by the Normalized Model Signature.
Before introducing the filtering method, we discuss how many possible model activation patterns there are at most for a -deep neural network. Lemma 2 answers this question.
Lemma 2
For a -deep neural network consisting of neurons, the upper bound of the number of possible model activation patterns is
(32) |
where is the number of neurons in layer .
Proof
If all the neurons in layer are inactive, i.e., the outputs of these neurons are , then the neuron states of all the neurons in the last layers are deterministic. In this case, the number of possible model activation patterns is decided by the first layers, i.e., the maximum is . If there is at least one active neuron in each layer, then there are at most possible model activation patterns.
After all the weights and biases are obtained, we assume that all the model activation patterns are possible, and compute the resulting normalized model signature . Denote by the normalized model signature recovered in Step 2 (see Section 4). If is not a subset of , we regard as a functionally inequivalent model.
Due to the minor errors caused by finite precision, i.e., the slight difference between the extracted parameters and the theoretical values (see Eq. (19), Eq. (27), and Eq. (28)), when checking whether a tuple is equal to a tuple or not, we adopt the checking rule presented in Section 6.2, refers to Eq. (31).
Besides, the filtering method in Section 6.2 does not ensure that all the wrongly recovered affine transformations are filtered. To avoid the functionally equivalent model being filtered, we adopt a flexible method.
Recall that, in Step 1, we collect a sufficient number of decision boundary points. For each tuple , denote by the frequency that the tuple occurs in the collected decision boundary points. Suppose that the number of where is . Then when at least tuples are in the set , the extracted model is regarded as a candidate of the functionally equivalent model. Here, We call the ratio the adaptive threshold.
Filtering by Weight Signs.
After , for are obtained, we compute the matrices and check whether the signs, i.e., the sign of are consistent with the guesses. If at least one sign is not consistent with the guess, the extracted model is not the functionally equivalent model.
Interestingly, except for handling wrong sign guesses, this method also shows high filtering effectiveness when the model activation patterns of the selected decision boundary points are not those required by our extraction attacks. This is not strange, since our extraction attack is designed for a specific set of model activation patterns. For wrong model activation patterns, whether the sign of is or is a random event.
Filtering by Prediction Matching Ratio.
The above two filtering methods are effective, but we find that some functionally inequivalent models still escape from the filtering. Therefore, the third method is designed to perform the last filtering on extracted models surviving from the above two filtering methods. This method is based on the prediction matching ratio.
Prediction Matching Ratio.
Randomly generate inputs, query the extracted model and the victim model . Suppose that the two models return the same hard-label for out of inputs. The ratio is called the prediction matching ratio.
According to Definition 1 and Definition 2, for a functionally equivalent model, the prediction matching ratio should be high, or even close to . Note that many random inputs and corresponding hard-label are collected during the attack process (see Steps 1 and 2 in Section 4). Thus, we can exploit these inputs.
7 Experiments
Our model extraction attacks are evaluated on both untrained and trained neural networks. Concretely, we first perform experiments on untrained neural networks with diverse architectures and randomly generated parameters. Then, based on two typical benchmarking image datasets (i.e., MNIST, CIFAR10) in visual deep learning, we train a series of neural networks as classifiers and evaluate the model extraction attacks on these trained neural networks.
For convenience, denote by ‘---’ the victim model, where is the dimension of each layer. For example, the symbol 1000-1 stands for a -deep neural network with an input dimension of and an output dimension of .
Partial Universal Experiment Settings.
Some settings are used in all the following experiments. For -deep neural network extraction attacks, in Step 1, we randomly generate pairs of starting point and moving direction, where is the number of neurons. The prediction matching ratio is estimated over random inputs.
7.1 Computing -Functional Equivalence
To quantify the degree to which a model extraction attack has succeeded, the method (i.e., error bounds propagation [4]) proposed by Carlini et al. is adopted to compute -functional equivalence.
Error bounds propagation.
To compute -functional equivalence of the extracted neural network , one just needs to compare the extracted parameters (weights and biases ) to the real parameters (weights and biases ) and analytically derive an upper bound on the error when performing inference [4].
Before comparing the neural network parameters, one must ‘align’ them [4]. This involves two operations: (1) adjusting the order of the neurons in the network, i.e., the order of the rows or columns of and , (2) adjusting the values of and to the theoretical one (see Eq. (19), Eq. (27), and Eq. (28)) obtained by the idealized model extraction attacks. This gives an aligned and from which one can analytically derive upper bounds on the error. Other details (e.g., propagating error bounds layer-by-layer) are the same as that introduced in [4], and not introduced again in this paper.
7.2 Experiments on Untrained Neural Networks
Table 1 summarizes the experimental results on different untrained neural networks which demonstrates the effectiveness of our model extraction attacks.
Architecture | Parameters | PMR | Queries | max | ||
512-2-1 | 1029 | |||||
2048-4-1 | 8201 | |||||
25120-4-1 | 100489 | |||||
50240-2-1 | 100485 | |||||
32-2-2-1 | 75 | |||||
512-2-2-1 | 1035 | |||||
1024-2-2-1 | 2059 | |||||
: the precision used to find decision boundary points. | ||||||
max: the maximum extraction error of model parameters. | ||||||
PMR: prediction matching ratio. |
According to Appendix 0.B, the computation complexity of our model extraction attack is about , where is the number of neurons. Thus, we limit the number of neurons, which does not influence the verification of our model extraction attack. Note that the number of parameters is not limited. All the attacks can be finished within several hours on a single core.
The results in Table 1 also support our argument in Remark 2. For the -deep neural networks (e.g., 32-2-2-1), when recovering the weights in layer , we require that only one neuron in layer is active, instead of all the neurons being active. Our extraction attacks also achieve good performance.
The influence of the precision .
A smaller will make the returned point (see Section 6.1) closer to the decision boundary, which helps reduce the extraction error of affine transformations. As a result, the model extraction attack is expected to perform better. For example, for the 1-deep neural network 2048-4-1, when decreases from to , the value (respectively, max) decreases from to (respectively, from to ), which is a significant improvement.
At the same time, using a smaller precision does not increase the attack complexity significantly. According to Appendix 0.B, the query complexity is about . Thus, decreasing has little influence on the query complexity. Look at the neural network 2048-4-1 again. When decreases from to , the number of queries only increases from to . Besides, when (i.e., the number of neurons) is large, almost does not influence the computation complexity, since only influences Steps 1 and 2 (see Section 4), while the computation complexity is mainly determined by other steps (refer to Appendix 0.B). When is small, the practical runtime is determined by the query complexity, then decreasing also has little influence on the runtime.
Choosing an appropriate is simple. In our experiments, we find that a smaller should be used, when the prediction matching ratio estimated over random inputs is not , and the gap (e.g., , see the third or fifth row) is not negligible.
7.3 Experiments on Trained Neural Networks
The MNIST and CIFAR10 Dataset.
MNIST (respectively, CIFAR10) is one typical benchmarking dataset used in visual deep learning. It contains ten-class handwriting number gray images [12] (resp., real object images in a realistic environment [19]). Each of the ten classes, i.e., ‘0’, ‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’, ‘7’, ‘8’, and ‘9’ (resp., airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck), contains pixel gray images (resp., pixel RGB images), totaling (resp., ) training and (resp. ) testing images.
Neural Network Training Pipelines.
When classifying different classes of objects, the decision boundary of trained neural networks will be different. To fully verify our model extraction attack, for MNIST (respectively, CIFAR10), we divide the ten classes into five groups and build a binary classification neural network for each group. All the neural networks share the same architecture -2-1, where for MNIST (respectively, for CIFAR10). On the MNIST and CIFAR10 datasets, we perform a standard rescaling of the pixel values from to . For the model training, we choose typical settings (the loss is the cross-entropy loss; the optimizer is standard stochastic gradient descent; batch size ). The first four columns of Table 2 summarize a detailed description of the neural networks to be attacked in this section.
Experiment Results.
The last four columns of Table 2 summarize the experiment results. Our extraction attack still achieves good performance when an appropriate precision is used, which further verifies its effectiveness.
task | architecture | accuracy | parameters | Queries | max | ||
‘0’ vs ‘1’ | 784-2-1 | 0.9035 | 1573 | ||||
‘2’ vs ‘3’ | 784-2-1 | 0.8497 | 1573 | ||||
‘4’ vs ‘5’ | 784-2-1 | 0.8570 | 1573 | ||||
‘6’ vs ‘7’ | 784-2-1 | 0.9290 | 1573 | ||||
‘8’ vs ‘9’ | 784-2-1 | 0.9501 | 1573 | ||||
airplane vs | 3072-2-1 | 0.8120 | 6149 | ||||
automobile | |||||||
bird vs cat | 3072-2-1 | 0.6890 | 6149 | ||||
deer vs dog | 3072-2-1 | 0.6870 | 6149 | ||||
frog vs horse | 3072-2-1 | 0.8405 | 6149 | ||||
ship vs truck | 3072-2-1 | 0.7995 | 6149 | ||||
max: the maximum extraction error of model parameters. | |||||||
accuracy: classification accuracy of the victim model . | |||||||
for saving space, prediction matching ratios are not listed. |
The experimental results presented in Table 1 and Table 2 show that the attack performance (i.e., the value of and ) is related to the precision and the properties of the decision boundary. However, we do not find a clear quantitative relationship between the attack performance and the precision (or some unknown properties of the decision boundary). Considering that the unknown quantitative relationships do not influence the verification of the model extraction attack, we leave the problem of exploring the unknown relationships as a future work.
8 Conclusion
In this paper, we have studied the model extraction attack against neural network models under the hard-label setting, i.e., the adversary only has access to the most likely class label corresponding to the raw output of neural network models. We propose new model extraction attacks that theoretically achieve functionally equivalent extraction. Practical experiments on numerous neural network models have verified the effectiveness of the proposed model extraction attacks. To the best of our knowledge, this is the first time to prove with practical experiments that it is possible to achieve functionally equivalent extraction against neural network models under the hard-label setting.
The future work will mainly focus on the following aspects:
-
•
The (computation and query) complexity of our model extraction attack remains high, which limits the application to neural networks with a large number of neurons. Reducing the complexity is an important problem.
-
•
In this paper, to recover the weight vector of the -th neuron in layer , we require that in layer , only the -th neuron is active. However, such a model activation pattern may not occur in some cases. Then how to recover the weight vector of this neuron based on other model activation patterns would be a vital step towards better generality.
-
•
Explore possible quantitative relationships between the precision (or some unknown properties of the decision boundary) and (or ).
-
•
Extend the extraction attack to the case of vector outputs, i.e., the output dimensionality exceeds .
-
•
Develop extraction attacks against other kinds of neural network models.
8.0.1 Acknowledgments.
We would like to thank Adi Shamir for his guidance. We would like to thank the anonymous reviewers for their detailed and helpful comments. This work was supported by the National Key R&D Program of China (2018YFA0704701, 2020YFA0309705), Shandong Key Research and Development Program (2020ZLYS09), the Major Scientific and Technological Innovation Project of Shandong, China (2019JZZY010133), the Major Program of Guangdong Basic and Applied Research (2019B030302008), the Tsinghua University Dushi Program, and the Ministry of Education in Singapore under Grant RG93/23. Y. Chen was also supported by the Shuimu Tsinghua Scholar Program.
Appendix 0.A Proof of Lemma 1
We prove Lemma 1 by Mathematical Induction.
Proof
Besides, we have
(34) | ||||
where , and .
Look at the system of linear equations presented in Eq. (26). Now, the system of linear equations is transformed into
(35) |
when , by solving the system, it is expected to obtain
(36) |
which is consistent with the expected value in Eq. (33).
Next, consider the recovery of the weight vector of the -th neuron in layer , and assume that the weights as shown in Lemma 1 have been obtained. As a result, we have
(37) | ||||
Appendix 0.B Complexity of Hard-Label Model Extraction Attacks
For the -deep neural network extraction attack, its complexity is composed of two parts: Oracle query complexity and computation complexity. Suppose that the number of neurons is . Its input size, i.e., the size of is . And is the number of hidden layers. The precision adopted by binary search is (refer to Section 6.1).
Oracle Query Complexity.
For the -deep neural network extraction attack, we only query the Oracle in Steps 1 and 2 (see Section 4).
In Step 1, if decision boundary points are collected, then the number of queries to the Oracle is , where is a factor determined by the precision , and is a small factor defined by the attacker. In Step 2, for each decision boundary point collected in Step 1, to recover the corresponding affine transformation (i.e., and ), we need to collect another decision boundary points. Therefore, the times of querying the Oracle in this step is . Based on the above analysis, the Oracle query complexity of our -deep neural network extraction attack is . Note that is proportional to . Thus, the query complexity is about .
Computation Complexity.
For the -deep neural network extraction attack, when is large, most computations are occupied by recovering neural network parameters, i.e., Steps 3 and 4 (see Section 4). Suppose that there are decision boundary points used to recover neural network parameters after filtering duplicate affine transformations in Step 2.
In Step 3, to recover the weight vector of the -th neuron in layer where , we need to solve a system of linear equations. For convenience, let us ignore the difference in the sizes of the different systems of linear equations. Then, to recover all the weights , a total of systems of linear equations need to be solved. In Step 4, to recover all the biases , only one system of linear equations needs to be solved. Therefore, to obtain an extracted model, we need to solve systems of linear equations.
There are two loops in the extraction attack. First, we need to select out of decision boundary points each time. More concretely, to recover the weights in layer , we choose decision boundary points. Then the number (denoted by ) of possible cases is
Second, we need to guess signs when recovering all the weights, i.e., there are cases.
Thus, the computation complexity is about . When an appropriate precision (i.e., is small) is adopted, we have , where is the number of possible model activation patterns (refer to Lemma 2). Then, we further have
(39) |
Thus, the computation complexity is about .
Appendix 0.C Extraction on -Deep Neural Networks
The parameters of the extracted -deep neural network are as follows.
(40) | ||||
Fig. 4 shows a diagram of a victim model (2-2-1) and the extracted model.

References
- [1] Batina, L., Bhasin, S., Jap, D., Picek, S.: CSI NN: reverse engineering of neural network architectures through electromagnetic side channel. In: Heninger, N., Traynor, P. (eds.) USENIX Security 2019. pp. 515–532. USENIX Association
- [2] Blum, A., Rivest, R.L.: Training a 3-node neural network is np-complete. In: Hanson, S.J., Remmele, W., Rivest, R.L. (eds.) Machine Learning: From Theory to Applications - Cooperative Research at Siemens and MIT. LNCS, vol. 661, pp. 9–28. Springer (1993)
- [3] Canales-Martínez, I., Chávez-Saab, J., Hambitzer, A., Rodríguez-Henríquez, F., Satpute, N., Shamir, A.: Polynomial time cryptanalytic extraction of neural network models. IACR Cryptol. ePrint Arch. p. 1526 (2023)
- [4] Carlini, N., Jagielski, M., Mironov, I.: Cryptanalytic extraction of neural network models. In: Micciancio, D., Ristenpart, T. (eds.) CRYPTO 2020. LNCS, vol. 12172, pp. 189–218. Springer
- [5] Carlini, N., Wagner, D.A.: Towards evaluating the robustness of neural networks. In: SP 2017. pp. 39–57. IEEE Computer Society
- [6] Dubey, S.R., Singh, S.K., Chaudhuri, B.B.: Activation functions in deep learning: A comprehensive survey and benchmark. Neurocomputing 503, 92–108 (2022)
- [7] Fefferman, C.: Reconstructing a neural net from its output. Revista Matematica Iberoamericana 10, 507–555 (1994)
- [8] Galstyan, A., Cohen, P.R.: Empirical comparison of ”hard” and ”soft” label propagation for relational classification. In: Blockeel, H., Ramon, J., Shavlik, J.W., Tadepalli, P. (eds.) ILP 2007. LNCS, vol. 4894, pp. 98–111. Springer
- [9] Ganju, K., Wang, Q., Yang, W., Gunter, C.A., Borisov, N.: Property inference attacks on fully connected neural networks using permutation invariant representations. In: Lie, D., Mannan, M., Backes, M., Wang, X. (eds.) CCS 2018. pp. 619–633. ACM
- [10] Han, S., Pool, J., Tran, J., Dally, W.J.: Learning both weights and connections for efficient neural network. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) NeurIPS 2015. pp. 1135–1143
- [11] Jagielski, M., Carlini, N., Berthelot, D., Kurakin, A., Papernot, N.: High accuracy and high fidelity extraction of neural networks. In: Capkun, S., Roesner, F. (eds.) USENIX Security 2020. pp. 1345–1362. USENIX Association
- [12] LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
- [13] Long, C., Collins, R., Swears, E., Hoogs, A.: Deep neural networks in fully connected CRF for image labeling with social network metadata. In: WACV 2019. pp. 1607–1615. IEEE
- [14] Lowd, D., Meek, C.: Adversarial learning. In: Grossman, R., Bayardo, R.J., Bennett, K.P. (eds.) SIGKDD 2005. pp. 641–647. ACM
- [15] Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: Fürnkranz, J., Joachims, T. (eds.) ICML, 2010. pp. 807–814. Omnipress
- [16] Oliynyk, D., Mayer, R., Rauber, A.: I know what you trained last summer: A survey on stealing machine learning models and defences. ACM Comput. Surv. 55(14s), 324:1–324:41 (2023)
- [17] Perazzi, F., Wang, O., Gross, M.H., Sorkine-Hornung, A.: Fully connected object proposals for video segmentation. In: ICCV 2015. pp. 3227–3234. IEEE Computer Society
- [18] Rolnick, D., Kording, K.P.: Reverse-engineering deep relu networks. In: ICML 2020. Proceedings of Machine Learning Research, vol. 119, pp. 8178–8187. PMLR
- [19] Torralba, A., Fergus, R., Freeman, W.T.: 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 30(11), 1958–1970 (2008)
- [20] Tramèr, F., Zhang, F., Juels, A., Reiter, M.K., Ristenpart, T.: Stealing machine learning models via prediction apis. In: Holz, T., Savage, S. (eds.) USENIX Security 2016. pp. 601–618. USENIX Association