Breaking the Black-Box: Confidence-Guided Model Inversion Attack for Distribution Shift
Abstract
Model inversion attacks (MIAs) seek to infer the private training data of a target classifier by generating synthetic images that reflect the characteristics of the target class through querying the model. However, prior studies have relied on full access to the target model, which is not practical in real-world scenarios. Additionally, existing black-box MIAs assume that the image prior and target model follow the same distribution. However, when confronted with diverse data distribution settings, these methods may result in suboptimal performance in conducting attacks. To address these limitations, this paper proposes a Confidence-Guided Model Inversion attack method called CG-MI, which utilizes the latent space of a pre-trained publicly available generative adversarial network (GAN) as prior information and gradient-free optimizer, enabling high-resolution MIAs across different data distributions in a black-box setting. Our experiments demonstrate that our method significantly outperforms the SOTA black-box MIA by more than 49% for Celeba and 58% for Facescrub in different distribution settings. Furthermore, our method exhibits the ability to generate high-quality images comparable to those produced by white-box attacks. Our method provides a practical and effective solution for black-box model inversion attacks.
1 Introduction
Privacy protection and attacks have been extensively studied, attracting significant attention within the scientific community[24, 38, 31, 30]. Model inversion attacks (MIAs) represent a class of attacks aimed at compromising the privacy protection of models[33]. MIAs target the retrieval of sensitive information about the model’s training data by leveraging known model outputs, thus putting user privacy at risk. For instance, an attacker may query the output of a facial recognition model and, upon successful exploitation, generate synthetic images that reflect the user’s facial features, thereby violating user privacy.

Model inversion attacks can currently be categorized into three types: white-box attacks[41, 40, 34, 4, 27, 36], black-box attacks[12, 2, 29, 39, 1], and label-only attacks[18], based on the level of access the attacker has to the target model. In the context of white-box attacks, the attacker has complete access to the target model, including its knowledge such as weights and output confidence scores. In the case of black-box attacks, the attacker has limited access and can only utilize the confidence scores provided by the target model without any internal knowledge. In the label-only setting, the attacker can only use the output labels provided by the model.
Currently, there has been significant research focus on white-box MIAs [41, 4]. These attack methods require complete access to the target model for performing MIAs. Moreover, these methods assume that the private training data of the target model and the public data used to train the generative model follow the same distribution, which is not practical in real-world scenarios. To address this challenge, Plug & Play Attack(PPA) [34] proposed an independent white-box MIA that works under different data distribution settings. However, it still relies on complete access to the target model. In the black-box domain, Reinforcement Learning-Based Model Inversion attack(RLB-MI) [12] focused on black-box MIA within the same data distribution using reinforcement learning techniques. However, their attack performance on different data distributions is not satisfactory, and the synthesized images have lower resolution. Therefore, a crucial challenge in the black-box MIAs scenario is how to generate high-resolution and effective synthetic images solely based on the confidence scores provided by the target model across different data distributions.
The challenges in this task can be summarized as follows: Firstly, generating high-resolution synthetic images in a high-dimensional latent space poses optimization difficulties. Secondly, the absence of gradient information in black-box model inversion attacks may lead optimization algorithms to explore GANs’ latent space excessively, resulting in the generation of images without meaningful features and leading to the failure of the attack. These challenges hinder the direct application of existing white-box attack method[34] to black-box scenarios or limit the attack effectiveness in different data distribution scenarios[12]. The adversary’s attack process in a black-box scenario is illustrated in Figure 1.
To address the limitations described above, our paper proposes CG-MI, a novel approach that achieves MIAs in black-box settings with different data distributions. The main idea is to leverage a pre-trained, target-independent generative adversarial network (GAN)[11, 21] as image prior, then employ gradient-free optimization methods to minimize the confidence loss, which measures the matching between the GAN image manifold and the target model. To overcome dimensionality issues and avoid generating meaningless images during the optimization process, we propose a novel objective optimization function. The core idea is to incorporate the mapping network of StyleGAN2[21] into the gradient-free optimization process, thereby ensuring that the solution of the optimization problem remains within a meaningful exploration space. Extensive experiments demonstrate that our method significantly outperforms existing black-box MIAs and its ability to generate high-quality images comparable to white-box attacks. Our main contributions are as follows:
-
•
We present a novel approach to black-box MIAs by utilizing gradient-free optimizer-based method. Our method enables MIAs in black-box scenarios, accommodating various data distributions and generate high-resolution synthesis images.
-
•
We propose the concept of synthesis image transferability in model inversion, analysis its impact on MIAs, and address this issue by designing a novel objective optimization function.
-
•
We demonstrate on different datasets and models with the proposed CG-MI. Compared to state-of-the-art black-box attack methods, our approach significantly improves attack performance. Furthermore, visual comparisons indicate comparable synthesis image quality to white-box approaches.
Our work shows that in more challenging scenarios, MIAs still can lead to the leakage of private information from DNNs.

2 Related Work
MIAs can be viewed as an optimization problem, where the objective is to maximize the confidence scores of a given target class in order to generate images that reveal sensitive data features. MIAs were first introduced by [9] for attacking linear regression models. Subsequently, [10] proposed a gradient descent-based algorithm to attack shallow networks. In the following sections, we will introduce recent attack methods based on the types of MIAs.
White-Box MIAs. White-box MIAs leverage full access to the target model and utilize gradient-based optimization techniques. [41] was the first to propose a generative attack method for MIAs, to enable MIAs for deeper networks. They trained a DCGAN [28] on publicly available data that had no overlap with the private training data, and conducted the attack by optimizing the latent input vector of the GAN to attack the target model trained on the private data. Subsequent work, such as [4], incorporated soft labels of the target model to guide the GAN training and allow the generator to learn the latent distribution, enabling specific GANs for MIAs. Furthermore, [40] introduced a Conditional GAN[37] as an image prior model for MIAs, addressing the issue of the generator in [4] not fully utilizing the target model’s knowledge. Additionally, [27] proposed a method that directly maximizes confidence scores instead of minimizing negative log-likelihood scores, aiming to improve the attack performance of [41] and [4]. [36] introduced a variation-based MIAs using StyleGAN2 [22], capable of generating high-resolution images that reflect the target model’s private training data. Moreover, [34] introduced a dataset-agnostic MIAs approach utilizing pre-trained StyleGAN2 [21] models. This method targets the vulnerability of prior works to dataset distribution shifts, aiming to address this concern.
Black-Box MIAs. Black-box MIAs require access to the confidence scores of the target model. In black-box MIAs, [29] proposed a method that simultaneously trains a GAN and a surrogate model. The GAN is used to generate inputs similar to the private training data, while the surrogate model imitates the behavior of the target model for inversion attacks. Additionally, [2] attempted to recover faces from deep feature vectors of a face recognition model in a black-box setting without prior knowledge. Another attack model was introduced by [39], where they perform MIAs by swapping the input and prediction vectors of the target model. Furthermore, [1] proposed a black-box MIAs method based on StyleGAN, using a classical genetic algorithm for optimization. Recently, [12] presented a reinforcement learning-based approach for MIAs, where the confidence scores of the target model’s outputs serve as rewards.
Label-Only MIAs. Label-only MIAs focus on querying the model to obtain hard labels without confidence scores. [18] introduces an algorithm called Boundary Repulsion Model Inversion (Brep-MI). The core idea of this algorithm is to evaluate the model’s predicted labels on a spherical surface and then estimate the direction towards the center of the target class to generate the most representative image.
3 Threat Model
Attack goal. When the target model is a facial recognition classifier, the aim of MIAs is to exploit the attacker’s access to generate facial images that reflect the features of a specific class, represented as , represents all classes.
Model Knowledge. In white-box MIAs, the attacker possesses the ability to download the model and exploit its weights and confidence information for launching the attack. In contrast, black-box MIAs restrict the attacker to using only the confidence scores provided by the target model. In label-only MIAs, the attacker is limited to utilizing the model’s output labels. Our research focus on the black-box MIAs.
Data Knowledge. In the majority of existing white-box MIAs[4, 41, 40] and black-box MIAs [12], they assume that the attacker can launch attacks on data from the same distribution, i.e., , represents the distribution of image priors and represents the distribution of target model. In our work, we relax the assumption that the attacker is only aware of the targeted model’s classification task, such as facial recognition, under the setting where .
4 Methodology
4.1 Background
Problem Formulation. We define the target classification model as , with representing the input image to the target model and denoting the target class for the attack. To obtain a synthesized image that reflects the private features of the target class , we optimize the following loss function:
(1) |
Here, can be the cross-entropy loss or other suitable loss functions. The purpose of this loss is to directly optimize the image in order to leak the private training data of the model. As directly optimizing the high-dimensional vector is not efficient. The following section will delve into generative MIAs.
Generative Model Inversion Attacks. The idea of training a generative model as an image prior to optimize the latent vector in the GAN for image synthesis was first introduced by [41]. This approach addresses the problem discussed in [10], where directly optimizing in the high-dimensional, nonlinear, and non-convex solution space when attacking deep neural networks can lead to the generation of meaningless results. Their method involves training a DCGAN [28] on publicly available data that does not overlap with the private training data, and then optimizing the latent input vector of the GAN to attack a target model trained on private data. By introducing the image prior GAN, the optimization problem in Equation 1 can be expressed as:
(2) |
After optimizing Equation 2 to obtain , then input into the GAN[28] to generate the synthesized image . This approach helps mitigate the issue of generating meaningless images to a certain extent.
4.2 Breaking the Black-Box
In this section, we will present our approach, Confidence-Guided Model Inversion (CG-MI), for attacking different data distribution models in a black-box scenario. An overview of CG-MI is illustrated in Figure 2.
Pre-Trained Publicly Available Image Prior. In the architecture of a GAN[28, 3, 21], the generator model learns to map latent vectors sampled from a simple distribution (e.g., Gaussian or uniform distribution) denoted as , to the generated image . However, StyleGAN2[21] consists of two main components: and . In StyleGAN2, the latent vector is first transformed into a style vector w using a non-linear mapping network , implemented as an -layer Multi-Layer Perceptron (MLP). The style vector is then transformed into a synthesized image . Specifically, a mapping network , where and , a synthesis network , generating the corresponding image based on the input style vector . Previous works on PPA[34] have demonstrated the tremendous potential of StyleGAN2 across diverse data distributions. In our paper, while ensuring the independence between the generator model and the target model, we leverage a pre-trained publicly available StyleGAN2 model as our image prior to perform attacks between different data distributions.
Synthesis Image Transferability in MIAs. Consider two well-trained face recognition models, denoted as and , trained on the same dataset . Let be a synthetic image generated through an attack on , classification result as . In the ideal scenario of transferability, the generated image satisfies both . This indicates that both and are able to recognize and classify it into the target class . Conversely, if lacks transferability, it may result in , leading to attack failure.

Enhancing Synthesized Images Transferability with Meaningful Exploration. In previous works, such as Brep-MI [18] and RLB-MI [12], the chosen image prior model was DCGAN[28]. However, the overall quality of the synthetic images generated by DCGAN is not satisfactory, as illustrated in Figure 3, thereby undermining the transferability of the attack. In the white-box PPA[34] work, StyleGAN2 [21] was selected as a replacement for DCGAN. They achieved success by optimizing the vectors, where represents the batch size, by leveraging the target model’s weights and utilizing a gradient descent algorithm. The objective function for their method is shown in Equation 3.
(3) |
where is the synthesis network of StyleGAN2, is the target class for the attack, denotes the target model, is the style vector of the generative model. However, in a black-box scenario, the use of gradient-free optimization algorithms for the direct optimization of the equation above encounters several model inversion issues, leading to the generation of images lacking meaningful features. The main issue arises because gradient-free optimization algorithms focus solely on minimizing the loss function without considering the preservation of the underlying structure of the GAN latent space. Without constraints, they can deviate from the natural image space of the GAN, leading to synthesized images that receive high confidence scores from the target model while the evaluation model assigns them low confidence scores. Moreover, the high dimensionality of hampers the efficiency of gradient-free optimization algorithms.
To address the issue of generating images lack meaningful features and to reduce the dimensionality of the optimization variables, we focus on the vectors before they enter the input mapping network. Regardless of how is changed during the optimization process, the mapping network consistently maps to a meaningful latent space, thereby avoiding the generation of meaningless images. In comparision to the high dimensionality of , has dimensions of . Specifically, with a pre-trained StyleGAN2, we aim to solve the following optimization problem:
(4) |
where , is confidence matching loss, is the mapping network of stylegan2. By incorporating a mapping network into the optimization process and utilizing a gradient-free optimization algorithm, we have successfully transitioned the problem of solving the black-box MIA from the unrestricted latent space to a space characterized by meaningful facial features.
Confidence Matching Loss. The Confidence Matching Loss encourages the solver to find images that can reflect the characteristics of the private training data in the image prior’s latent space by minimizing the loss between the target model output confidence and the label . We explore the following confidence matching loss: (1) Cross-Entropy Loss[4]; (2) Max-Margin Loss[40]; and (3) Poincaré Loss[34]. We performed several comparisons in Table 4 and we final chose poincaré loss. For specific details of the loss functions, please refer to Appendix Appendix A.
Gradient-free Optimizer. The optimization problem in Equation 4 is a non-linear and non-convex problem. Choosing a suitable optimization algorithm is crucial for achieving good performance. In this study, we consider the Covariance Matrix Adaptation Evolution Strategy (CMA-ES)[13], a gradient-free optimization algorithm that is particularly well-suited for high-dimensional problems. CMA-ES is a variant of evolutionary strategies [6, 14] and utilizes an adaptive covariance matrix to optimize the probability distribution. We initiate the optimization process by inputting an initial latent vector z. After obtaining the optimized , we can generate an image that reflects the private training data features with a target model class label of using . We maintain the parameter settings as specified in [14] for the parameters in the algorithm. Please refer to Algorithm 1 for details.

Transformation-based Selection. Following [34], we utilize transformation-based selection to choose the images from the optimized population that best reflect the private training data. Let denote the set of optimized synthetic images. We define a transformation operation , which includes scaling, modifying aspect ratio, and random horizontal flipping. By applying to and , we obtain transformed images, and , which are then input to the target model, yielding new confidence scores, and . Typically, if captures the target class features more effectively than , the confidence scores after transformations satisfy .
(5) |
Hence, the image with the highest logit after transformations, denoted by , is selected as the final image. In Equation 5, the Monte Carlo estimation method are employed, where represents the number of applied transformations.
5 Experiments
This section begins with a comprehensive explanation of our experimental setup. Subsequently, we assess the effectiveness of our CG-MI attack by considering different factors such as the performance of various MIA methods in different scenarios, different datasets and different target models.
5.1 Experimental Settings
Datasets. In our experiments, we focus on five datasets: CelebA[25], FFHQ[19], FaceScrub[26], AFHQ Dogs[5], Metfaces[20] and Stanford Dogs[23]. For the purpose of conducting attack evaluations, we trained our target models on CelebA, FaceScrub, and Stanford Dogs datasets. As a prior for image synthesis, we utilized pre-trained stylegan2 models on FFHQ, Metfaces and AFHQ Dogs datasets, enabling us to perform attacks on different data distributions.
Models. Our models are divided into target models and evaluation models. To facilitate a fair comparison, we conducted attack experiments on several popular network architectures, including Resnet[15] and DenseNet[17], while selecting Inception-V3[35] as the evaluation model. For the facial recognition task, we trained Resnet18, Resnet152, Densenet169, and InceptionV3 models on the CelebA and FaceScrub datasets, respectively. Similarly, for the dog breed classification task, we trained Resnet18, Resnet152, Densenet169, and InceptionV3 models on the Stanford dogs dataset. To facilitate comparison with prior work, we attacked the Resnet18 model trained on the CelebA and FaceScrub datasets for comparative experiments. The details of the data partition used for training the target model and the parameters for model training, the attack process, and comparative experiments can be referenced in Section B.1.
Evaluation Metrics. To align with prior work, we followed PPA[34] to calculate various evaluation metrics. Firstly, we trained an independent Inception-v3 evaluation model on the training data of the target model. Then, we used the evaluation model to predict labels on the generated attack results and computed the TOP-1 and TOP-5 accuracy for the target class.
Next, we computed the shortest feature distance from each generated image to any training sample in the target class and denoted the average distance as . The distance was measured using the squared L2 distance between activation layers in the penultimate layer of the evaluation model. For facial images, we utilized a pre-trained FaceNet model [32] to measure the feature distance . Lower values indicate that the attack results are visually closer to the training data.
The third metric is the FID (Fréchet Inception Distance) score [16]. FID calculates the distance between the feature vectors of the generated attack results and the training data of the target. The feature vectors are extracted using an Inception-v3 model trained on ImageNet [7]. A lower FID score indicates a higher similarity between the two datasets.
5.2 Experimental Results
Comparision with Previous MIAs Approachs. We compared CG-MI with various MIAs methods in different scenarios, including white-box, black-box, and MIAs in the label-only setting. For white-box MIAs, we used PPA[34] as the baseline method. In contrast to previous white-box attack methods (such as [41, 4, 40]), PPA focuses on scenarios with different data distributions, image priors, and independence from the target model, making it more meaningful for comparison. For the black-box attack scenario, we selected RLB-MI[12] and Brep-MI[18] as baseline methods, representing state-of-the-art MIA methods in black-box attacks and the label-only setting, respectively.
To ensure fair comparison, we trained the Resnet18 model on the CelebA and Facescrub datasets for conducting comparative experiments using different MIAs methods. In our CG-MI method, we first ran the CMA-ES optimization algorithm multiple times for each class, generating a batch of 200 synthetic images. This approach ensured the stability and reliability of the results and reduced the impact of randomness. Next, we adopted a transformation selection strategy to choose the most representative 50 images from the optimized batch of 200 synthetic images for evaluation. For the other MIAs methods, we combined the characteristics of each attack method and ran it multiple times to generate a total of 200 synthetic images. Additionally, these methods also incorporate a transformation-based selection strategy to choose 50 images. We then used the same evaluation models and metrics to assess all the methods.
It is worth mentioning that RLB-MI and Brep-MI both employ the same GAN[28] structure, which is designed specifically for generating low-resolution 64x64 pixel images. To ensure a fair comparison, we made adjustments to the aforementioned methods by using a deeper GAN architecture capable of generating higher-resolution images[34]. Firstly, we added two additional upsampling blocks (consisting of a transpose convolution layer and a batch normalization layer) to the generator. We also expanded the discriminator by adding two convolution blocks, with each block consisting of a convolution layer and an instance normalization layer. Subsequently, we trained the modified generator on the FFHQ256 dataset to generate 256x256 pixel images. We then utilized this enhanced GAN as the image prior for RLB-MI and Brep-MI. By adapting the GAN architecture and training on higher-resolution images, we ensured that the comparison between CG-MI and RLB-MI/Brep-MI was carried out on a level playing field.
Type | Method | ↑acc@1 | ↑acc@5 | ↓ | ↓ | ↓FID | |
CelebA | White-box | PPA | 88.28% | 97.34% | 0.6992 | 283.89 | 40.43 |
Black-box | RLB-MI | 29.25% | 53.77% | 1.0740 | 358.18 | 101.86 | |
CG-MI(Ours) | 77.86% | 94.16% | 0.7465 | 292.14 | 46.66 | ||
Label-only | Brep-MI | 38.50% | 61.25% | 0.9700 | 356.83 | 93.05 | |
Facescrub | White-box | PPA | 98.32% | 99.84% | 0.6735 | 107.35 | 45.73 |
Black-box | RLB-MI | 33.28% | 64.52% | 1.1097 | 135.16 | 111.06 | |
CG-MI(Ours) | 90.92% | 99.34% | 0.7570 | 111.75 | 62.24 | ||
Label-only | Brep-MI | 51.33% | 73.82% | 1.0664 | 132.59 | 102.94 |
Table 1 presents the evaluation results of CG-MI and baseline methods in attacking Resnet18 trained on Celeba and Facescrub datasets. The target model Resnet18 achieves test accuracies of 86.38% and 94.22% on Celeba and Facescrub, respectively, while the evaluation model InceptionV3 achieves test accuracies of 93.28% and 96.20% on the same datasets. Based on the data in Table 1, CG-MI outperforms other black-box methods in addressing distribution shift issues in black-box attack scenarios. It generates more transferability synthetic images, resulting in higher attack success rates, lower feature distances, and FID values.
The qualitative evaluation results shown in Figure 4 demonstrate that compared to previous black-box MIA methods, CG-MI is capable of generating more realistic images in different data distribution scenarios. It overcomes the limitations of distribution shift and produces synthetic images of comparable quality to state-of-the-art white-box methods.
Performance Evaluation on various Models and Datasets. We also evaluated the performance of CG-MI on deeper network architectures and datasets from different categories. Specifically, we trained Resnet152 and Densenet169 models on the CelebA, Facescrub, and Stanford Dogs datasets. Additionally, we trained Resnet18 models on the Facescrub and CelebA datasets, with the recognition accuracy on the respective test datasets indicated in parentheses.

Target Model | (↑acc@1,↑acc@5) | (↓, ↓, ↓FID) | |
---|---|---|---|
FFHQ Celeba | Resnet152 (86.78%) | (67.42%,86.16%) | (0.7773, 319.42, 46.04) |
FFHQ Celeba | Densenet169 (85.39%) | (65.28%, 87.72%) | (0.7831, 321.16, 47.91) |
Metfaces Celeba | Resnet18 (86.38%) | (24.42%, 49.26%) | (1.2089, 420.75, 111.40) |
FFHQ Facescrub | Resnet152 (93.74%) | (85.02%, 97.94%) | (0.7998, 122.53, 64.17) |
FFHQ Facescrub | Densenet169 (95.49%) | (90.78%, 97.95%) | (0.7608, 115.02, 63.05) |
Metfaces Facescrub | Resnet18 (94.22%) | (59.16%, 88.40%) | (1.0286, 133.42, 101.02) |
AFHQ.dogs Stan.dogs | Resnet152 (71.23%) | ( 84.80%, 99.04%) | (-, 60.50, 58.04) |
AFHQ.dogs Stan.dogs | Densenet169 (74.39%) | (83.06%, 98.02%) | (-, 63.51, 59.03) |
Optimizer | ↑acc@1 | ↑acc@5 | ↓ | ↓ | ↓FID |
---|---|---|---|---|---|
BO | 54.65% | 85.26% | 0.9511 | 135.06 | 72.67 |
CMA-ES | 90.92% | 99.34% | 0.7570 | 111.75 | 62.24 |
The objective of this experiment was to explore the potential for our method to generalize to other model architectures or deeper network structures. It also aimed to evaluate the performance of CG-MI in scenarios involving significant data distribution shifts. From the data presented in Table 2, it is evident that as the target model becomes structurally deeper, CG-MI encounters increased difficulty when attacking CelebA and Facescrub. Consequently, this results in a slight reduction in attack effectiveness. In settings with significant data distribution shifts, such as using Metfaces to attack Celeba and Facescrub, CG-MI still maintains a certain level of attack effectiveness. In Figure 5, the visual experimental results illustrate that CG-MI continues to produce meaningful attack results, even when dealing with deeper network architectures, different classification tasks, and scenarios involving significant data distribution shifts. This further reinforces the robustness and versatility of our proposed method across various model architectures and datasets.
Experiments with Various Gradient-free Optimizer. We compared different gradient-free optimization algorithms, namely CMA-ES[13] and BO[8], under the setting of and , with a target model architecture of Resnet18. Bayesian Optimization (BO) is a global optimization algorithm that utilizes probabilistic models to efficiently search for the optimal solution of an expensive black-box function.
5.3 Ablation Study
In the ablation study, we considered two probability distributions: , which represents the distribution of images from the CelebA dataset, and , which represents the distribution of images from the FFHQ dataset. We first compared three different loss functions for the model inversion attack: Poincaré loss[34], max-margin loss[40], and cross-entropy loss[4]. Poincaré loss achieved the best performance in terms of generating effective synthesis images.
The experiments involved also replacing the StyleGAN2 architecture with the DCGAN architecture combine our proposed attack CG-MI. Ablation study results indicated that CG-MI is compatible with other GAN architectures. Our findings demonstrate that, even when replacing StyleGAN2 with the DCGAN, CG-MI still yields favorable evaluation results. We also conducted ablation experiments by employing the objective function proposed in [34] in combination with the gradient-free optimization algorithm CMA-ES. From the experimental results(No Mapping), we observed that in the black-box scenario, where gradient information is unavailable, directly using the objective function from PPA did not yield favorable attack results. This underscores the importance of leveraging the mapping network as an integral part of the optimization process. By using our newly proposed objective function, we significantly improved our attack capabilities in the black-box scenario.
We also investigated the influence of transformation-based selection techniques[34] on the attack outcomes (No Trans. Selection). Transformation-based selection involves applying certain transformations to the synthesis images after the optimization process. By employing selection transformations, we can enhance the stability of attack outcomes and, to some extent, increase the success rate of attacks. The results of our ablation experiments highlighted the significance of using our proposed objective function for generating synthesis images.
↑acc@1 | ↑acc@5 | ↓ | ↓ | ↓FID | |
---|---|---|---|---|---|
Poincare Loss | 77.86% | 94.16% | 0.7465 | 292.14 | 46.66 |
Cross-Entropy Loss | 62.77% | 85.67% | 0.8550 | 329.71 | 51.81 |
Max-Margin Loss | 71.50% | 91.25% | 0.7827 | 331.48 | 49.02 |
DCGAN | 32.80% | 66.50% | 0.9526 | 322.88 | 91.32 |
No Mapping | 00.03% | 00.09% | 1.4894 | 435.82 | 201.02 |
No Trans. Selection | 72.79% | 89.31% | 0.7688 | 300.11 | 47.28 |
6 Discussion, Limitations and Conclusion
Our paper proposes a novel black-box attack method called CG-MI. Unlike existing black-box methods, we focus on considering more realistic scenarios without making assumptions about the data distribution of the target model. Our approach only requires knowledge about the model’s classification task and leverages pre-trained publicly available images’ prior knowledge to attack various target models and generate high-resolution synthetic images. Furthermore, we introduce the concept of synthetic images transferability and investigate its impact on in MIAs. By designing a novel objective function and combining gradient-free optimization methods, we achieve MIAs in black-box scenarios and enhance the transferability of the synthesized images. Experimental results demonstrate that CG-MI outperforms existing black-box MIAs in more realistic scenarios, achieving state-of-the-art attack performance.
However, the current black-box MIAs, including our work, still have some limitations. While black-box methods do not require full access to the target model, frequent queries to the target model may hinder the progress of the attack in real-world settings. Therefore, exploring how to control the query count and ensure attack success rate is a worthwhile research direction.
It is worth noting that our research may have negative implications. However, the purpose of revealing vulnerabilities in existing systems is to promote the development of better defense mechanisms. Our work aims to call for attention from the academic and technical community to research on machine learning privacy protection. We believe that the positive impact of these efforts will outweigh the potential negative risks.
Acknowledgements
References
- An et al. [2022] Shengwei An, Guanhong Tao, Qiuling Xu, Yingqi Liu, Guangyu Shen, Yuan Yao, Jingwei Xu, and Xiangyu Zhang. Mirror: Model inversion for deep learning network with high fidelity. In Proceedings of the 29th Network and Distributed System Security Symposium, 2022.
- Aïvodji et al. [2019] Ulrich Aïvodji, Sébastien Gambs, and Timon Ther. Gamin: An adversarial approach to black-box model inversion. Cornell University - arXiv,Cornell University - arXiv, 2019.
- Brock et al. [2018] AndrewS. Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. International Conference on Learning Representations, 2018.
- Chen et al. [2021] Si Chen, Mostafa Kahla, Ruoxi Jia, and Guo-Jun Qi. Knowledge-enriched distributional model inversion attacks. In Proceedings of the IEEE/CVF international conference on computer vision (CVPR), pages 16178–16187, 2021.
- Choi et al. [2020] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. Stargan v2: Diverse image synthesis for multiple domains. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Das and Suganthan [2011] Swagatam Das and Ponnuthurai Nagaratnam Suganthan. Differential evolution: A survey of the state-of-the-art. IEEE Transactions on Evolutionary Computation, page 4–31, 2011.
- Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
- Eriksson et al. [2019] David Eriksson, Michael Pearce, JacobR. Gardner, Ryan Turner, and Matthias Poloczek. Scalable global optimization via local bayesian optimization. Neural Information Processing Systems(NeurIPS), 2019.
- Fredrikson et al. [2014] Matthew Fredrikson, Eric Lantz, Somesh Jha, Simon Lin, David Page, and Thomas Ristenpart. Privacy in pharmacogenetics: An end-to-end case study of personalized warfarin dosing. In 23rd USENIX Security Symposium ($ Security 14), pages 17–32, 2014.
- Fredrikson et al. [2015] Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, pages 1322–1333, 2015.
- Goodfellow et al. [2017] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Journal of Japan Society for Fuzzy Theory and Intelligent Informatics, page 177–177, 2017.
- Han et al. [2023] Gyojin Han, Jaehyun Choi, Haeil Lee, and Junmo Kim. Reinforcement learning-based black-box model inversion attacks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20504–20513, 2023.
- Hansen [2016] Nikolaus Hansen. The cma evolution strategy: A tutorial. Towards a new evolutionary computation, page 75–102, 2016.
- Hansen et al. [2019] Nikolaus Hansen, Youhei Akimoto, and Petr Baudis. CMA-ES/pycma on Github. Zenodo, DOI:10.5281/zenodo.2559634, 2019.
- He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Heusel et al. [2017] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Neural Information Processing Systems (NeurIPS), 2017.
- Huang et al. [2017] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- Kahla et al. [2022] Mostafa Kahla, Si Chen, Hoang Anh Just, and Ruoxi Jia. Label-only model inversion attacks via boundary repulsion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15045–15053, 2022.
- Karras et al. [2019] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Karras et al. [2020a] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Training generative adversarial networks with limited data. Neural Information Processing Systems,Neural Information Processing Systems, 2020a.
- Karras et al. [2020b] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Training generative adversarial networks with limited data. Cornell University - arXiv, 2020b.
- Karras et al. [2020c] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020c.
- Khosla et al. [2011] Aditya Khosla, N Jayadevaprakash, Bangpeng Yao, and Li Fei-Fei. Novel dataset for fine-grained image categorization. In Conference on Computer Vision and Pattern Recognition (CVPR) Workshop, 2011.
- Liu et al. [2020] Ximeng Liu, Lehui Xie, Yaopeng Wang, Jian Zou, Jinbo Xiong, Zuobin Ying, and Athanasios V Vasilakos. Privacy and security issues in deep learning: A survey. IEEE Access, 9:4566–4593, 2020.
- Liu et al. [2015] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In 2015 IEEE International Conference on Computer Vision (ICCV), 2015.
- Ng and Winkler [2014] Hong-Wei Ng and Stefan Winkler. A data-driven approach to cleaning large face datasets. In 2014 IEEE International Conference on Image Processing (ICIP), 2014.
- Nguyen et al. [2023] Ngoc-Bao Nguyen, Keshigeyan Chandrasegaran, Milad Abdollahzadeh, and Ngai-Man Cheung. Re-thinking model inversion attacks against deep neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16384–16393, 2023.
- Radford et al. [2016] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. International Conference on Learning Representations (ICLR), 2016.
- Razzhigaev et al. [2020] Anton Razzhigaev, Klim Kireev, Edgar Kaziakhmedov, Nurislam Tursynbek, and Aleksandr Petiushko. Black-box face recovery from identity features. In Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pages 462–475. Springer, 2020.
- Rigaki and García [2020a] Maria Rigaki and Sebastián García. A survey of privacy attacks in machine learning. Cornell University - arXiv, 2020a.
- Rigaki and García [2020b] Maria Rigaki and Sebastián García. A survey of privacy attacks in machine learning. Cornell University - arXiv,Cornell University - arXiv, 2020b.
- Schroff et al. [2015] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
- Song and Namiot [2022] Junzhe Song and Dmitry Namiot. A survey of the implementations of model inversion attacks. In International Conference on Distributed Computer and Communication Networks, pages 3–16. Springer, 2022.
- Struppek et al. [2022] Lukas Struppek, Dominik Hintersdorf, Antonio De Almeida Correia, Antonia Adler, and Kristian Kersting. Plug & play attacks: Towards robust and flexible model inversion attacks. In Proceedings of the 39th International Conference on Machine Learning (ICML), pages 20522–20545. PMLR, 2022.
- Szegedy et al. [2016] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Wang et al. [2021] Kuan-Chieh Wang, Yan Fu, Ke Li, Ashish Khisti, Richard Zemel, and Alireza Makhzani. Variational model inversion attacks. Advances in Neural Information Processing Systems, 34:9706–9719, 2021.
- Wang et al. [2018] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 8798–8807, 2018.
- Wernke et al. [2014] Marius Wernke, Pavel Skvortsov, Frank Dürr, and Kurt Rothermel. A classification of location privacy attacks and approaches. Personal and ubiquitous computing, 18:163–175, 2014.
- Yang et al. [2019] Ziqi Yang, Ee-Chien Chang, and Zhenkai Liang. Neural network inversion in adversarial setting via background knowledge alignment. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, CCS ’19, page 225–240, 2019.
- Yuan et al. [2023] Xiaojian Yuan, Kejiang Chen, Jie Zhang, Weiming Zhang, Nenghai Yu, and Yang Zhang. Pseudo label-guided model inversion attack via conditional generative adversarial network. AAAI Conference on Artificial Intelligence(AAAI), 2023.
- Zhang et al. [2020] Yuheng Zhang, Ruoxi Jia, Hengzhi Pei, Wenxiao Wang, Bo Li, and Dawn Song. The secret revealer: Generative model-inversion attacks against deep neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 253–261, 2020.
Supplementary Material

Appendix
Appendix A Confidence Matching Loss
We explore three key loss functions: cross-entropy loss[41, 4], max-margin loss[40] and Poincaré loss[34]. Inspired by [27], we represent the model’s unnormalized scores by taking the product of the weights of the last layer and the penultimate layer activations . The cross-entropy loss aims to minimize the negative log-likelihood of the identity under the model parameters. The formulation is as follows:
(6) |
Here, denotes the activation of the penultimate layer for sample , and represents the weight of the last layer for the -th class in the target model .
The maximum margin loss not only encourages maximizing the confidence score for the target class but also emphasizes the separability of this class from others. We reformulate the maximum margin loss as follows:
(7) |
Here, represents the activation of the penultimate layer for sample . The weight of the last layer for the -th class in the target model is denoted as . The term represents the unnormalized logit value for the -th class.
Poincaré loss is a hyperbolic space embedding loss function, which can be rewritten as:
(8) |
Poincaré loss measures the distance between two vectors and in the hyperbolic space. represents the Euclidean norm, satisfying and . Here, , denotes the absolute value norm, is the one-hot encoded vector for class , denoted as , and we replace 1 with 0.9999. Poincaré loss belongs to the hyperbolic distance learning paradigm, which enables measuring and comparing distances between vectors in a larger embedding space.
Appendix B Experimental Supplement
B.1 Datasets
CelebA is a dataset of celebrity face attributes that contains 202,599 face pictures of 10,177 celebrity identities. For the training data of the target models (Resnet18, Resnet152, and Densenet169), we selected 1000 identities with the most number of samples, resulting in a total of 30,038 images. The FaceScrub dataset provides cropped face images of 530 identities. However, on the dataset’s official website, instead of actual images, they provide download links for the dataset. All identities are used as target dataset. For Stanford.dogs, this dataset is built on top of ImageNet, which is intended for non-commercial research purposes only and provides 20,580 images of 120 dog breeds. For all datasets, the input images for the target model are resized to 224x224, while the input images for the evaluation model(Inception-V3) are resized to 299x299. The CelebA[25] dataset comprises 202,599 face pictures of 10,177 celebrity identities. In the case of training the target models (Resnet18, Resnet152, and Densenet169), we specifically selected 1,000 identities with the highest number of samples, resulting in a total of 30,038 images. Please note that the FaceScrub[26] dataset provides cropped face images of 530 identities, but the dataset’s official website only offers download links instead of actual images. All identities from FaceScrub were utilized as a target dataset for our research. Regarding the Stanford.dogs[23] dataset, it is constructed on the foundation of ImageNet[7], which is exclusively intended for non-commercial research purposes. The ImageNet dataset provides 20,580 images encompassing 120 dog breeds. In our experiments, the input images for the target models were uniformly resized to dimensions of 224x224 pixels, ensuring consistency across all datasets. However, it is important to note that for the evaluation model (Inception-V3), the input images were resized to a different size of 299x299 pixels.
Attack Implementation. All models were trained using the Adam optimizer with a learning rate of 0.001 and values of for a total of 100 epochs with a batch size of 128. The training data was normalized with mean and standard deviation set to 0.5. The input images of the target model were resized to 224224, and the evaluation model InceptionV3 was resized to 299299. Additionally, data augmentation techniques were applied, including 50% horizontal flipping and adjustments in brightness and contrast within the range , saturation within , and hue within .
During the attack process, the StyleGAN2 synthesis network was configured with a truncation parameter of 0.5 and a truncation cutoff value of 8. The CMA-ES algorithm, a gradient-free optimization method, was employed with 8 iterations, 300 rounds, and a population size of 25. The rotation transformation selection strategy involved 100 transformations. It included center cropping of images generated by the generative model, resizing them to 224, and applying random adjustments to cropping parameters within the ranges of size , scale , and ratio . To accelerate the multiple CMA algorithm optimizations, 8 parallel processes were utilized. In the extended experiments, we performed central cropping to 800 on the images generated by Metfaces for Celeba and Facescrub datasets, followed by resizing to 224. For the remaining different architectures of target models, we maintained consistent attack parameters.
In the comparative experiments to perform RLB-MI and Brep-MI, the latent vector size in the GAN was set to 100. For GAN training, the ADAM optimizer was used with a batch size of 64, a learning rate of 0.0002, and values of . The training was conducted for 280 epochs.
B.2 Publicly Available Image Prior
We downloaded the code for StyleGAN2[21] from the official source at stylegan2-ada. The AFHQ dataset consists of 16,130 high-resolution images with a resolution of 512x512 pixels. We obtained the pretrained model weights for AFHQ.dogs512[5] by using the provided link from the official code of StyleGAN2-ADA: AFHQ.dogs. It should be noted that the FFHQ[19] dataset comprises 70,000 high-quality face images. Compared to CelebA and FaceScrub, the image quality in FFHQ is significantly higher. Subsequently, we downloaded the pretrained model weights for FFHQ256 from ffhq256 and for Metfaces from Metfaces. During the attack process, for the StyleGAN2 model pretrained on FFHQ256, we performed central cropping to 200 and then resized the images to 224x224 before inputting them into the target model. For the generated images on AFHQ.dogs, we applied central cropping to 400 and then resized them to 224x224. In the comparative experiments, we followed the same cropping and resizing approach for other MIAs method.
B.3 Additional Experimental Results.
Visual Comparison of Attack Results on Facescrub. We have also visualized the attack results using the FaceScrub dataset, comparing the performance of different methods. From the visualized results in Fig. 6, our approach demonstrates the ability to generate synthesis images that better reflect the distinctive features of the target model’s private training data.