Unlearnable Graph: Protecting Graphs from Unauthorized Exploitation

Yixin Liu
and Lichao Sun Leigh University
{yila22,lis221}@lehigh.edu Chenrui Fan
and Pan Zhou Huazhong University of Science and Technology
{chenrui_fan,panzhou}@hust.edu.cn Yixin Liu1, Chenrui Fan2, Pan Zhou2 and Lichao Sun1 1 Lehigh University, Bethlehem, PA, USA 2 Huazhong University of Science and Technology, Wuhan, Hubei, China
{yila22, lis221}@lehigh.edu, {chenrui_fan, panzhou}@hust.edu.cn

Abstract

While the use of graph-structured data in various fields is becoming increasingly popular, it also raises concerns about the potential unauthorized exploitation of personal data for training commercial graph neural network (GNN) models, which can compromise privacy. To address this issue, we propose a novel method for generating unlearnable graph examples. By injecting delusive but imperceptible noise into graphs using our Error-Minimizing Structural Poisoning (EMinS) module, we are able to make the graphs unexploitable. Notably, by modifying only $5\%$ at most of the potential edges in the graph data, our method successfully decreases the accuracy from ${77.33\%}$ to ${42.47\%}$ on the COLLAB dataset.

I Introduction

The abundance of data has led to the successful implementation of deep learning, which allows the integration of artificial intelligence (AI) into various domains [3]. However, With the increasing availability of publicly accessible data, concerns have risen about the unauthorized exploitation of data. Many commercial AI models are trained using personal data that is unknowingly collected from the internet, raising questions about the potential misuse of this data for commercial or even illegal gain and also posing a significant threat to individuals’ privacy, security, and copyright.

The threat of unauthorized data exploitation has made it imperative to develop defensive approaches. Recent studies have been focusing on developing Unlearnable Example [2, 1]. These methods aim to make the original data unlearnable by adding imperceivable but delusive perturbations to data samples, resulting in deep learning models trained on the perturbed dataset having extremely low prediction accuracy.

Previous studies on unlearnable examples have primarily focused on the vision domain. However, as the use of graph data structures becomes more prevalent, particularly in regard to privacy and security, it is important to explore the potential vulnerability of unauthorized graph exploitation. As far as we know, unlearnable graphs, i.e., unlearnable examples on graph data, have not been explored yet. In this paper, we aim to answer the question of how to make structured graph data unlearnable by a wide range of GNN models.

To tackle these issues, we propose the Adaptive GradArgMin method to craft error-minimizing structural perturbation based on the gradient information. The Adaptive GradArgMin selects a set of edges that cause the maximum gradient change and conducts the flipping operation in the adjacent matrix. To achieve a good balance between invisibility and effectiveness under limited manipulation budgets, we design an adaptive constraint strategy by considering both vertex-based and edge-based information. The perturbed graph maintains invisible compared to the original graph under visual inspections, which ensures the utility of the data for other purposes while making them unexploitable by ML models.

Refer to caption — Figure 1: An illustration of motivation of Unlearnable Graph. Existing vision-based solutions fail to inject delusive patterns into more challenging data structure-graph due to their discrete property. In this paper, we propose an Error-Minimizing Structural Poisoning to achieve efficient and effective data protection for graphs.

II Assumptions and Problem Formulation

Assumptions on Defender’s Capability. We assume that the data owner has full access to a portion of graph data used to train a model by unauthorized data exploiters. However, the defender could not interfere with the model selection and the training procedure of the unauthorized users.

Objective. Given a clean graph training dataset $\mathcal{G}_{c}=\{G_{i},y_{i}\}_{i=1}^{N}$ , our goal is to craft an unlearnable version of the training dataset $\mathcal{G}_{u}=\{\hat{G_{i}},y_{i}\}_{i=1}^{N}$ such that the models trained on the $\mathcal{G}_{u}$ have poor performance on the clean testing set $\mathcal{G}_{t}$ . The task can be formulated into a bi-level optimization as follows:

\begin{gathered}\max_{\delta_{i}\preceq\mathrm{c}}\underset{\left(\mathrm{G}_{i},y_{i}\right)\sim\mathcal{G}_{\mathrm{t}}}{\mathbb{E}}\left[\mathcal{L}\left(f_{\theta^{*}}(\mathrm{G}_{i}),y\right)\right],\\ \,\,\mathrm{s}.\mathrm{t}.\theta^{*}=\underset{\theta}{\mathrm{arg}\min}\sum_{\left(\mathrm{G}_{i},y_{i}\right)\in\mathcal{G}_{\mathrm{u}}}{\left[\mathcal{L}\left(f_{\theta}\left(\mathrm{G}_{i}\oplus\delta_{i}\right),y_{i}\right)\right]}.\end{gathered}

(1)

where $\oplus$ denotes the application of perturbations of node features or topology structure on the original graph $G_{i}$ , and $\preceq$ represents the budget constraints relationship.

III Proposed Methodology

In this section, we propose Error-Minimizing Structural noise, which is effective and imperceptible against unauthorized exploitation.

The Min-min Optimization. To tackle the intractable bi-level problem in Eq. 1, an approximated min-min optimization process is proposed [2] to first learn a noise generator and leverage it to conduct noise generation. The major motivation is to iteratively craft noise that can trick the models trained on the poisoned data. The problem is also a bi-level optimization problem, with two levels of minimization. The inner level is a constrained optimization problem that finds the noise that is bounded by certain constraints and minimizes the model’s classification loss. The outer level is another minimization problem that finds the parameters that also minimize the model’s classification loss.

Crafting Delusive Edges with Adaptive GradArgMin. The core of our method is to take gradients with respect to the adjacent matrix $A$ to obtain the gradient for any edge in the potential edge space ( $\mathcal{V}\times\mathcal{V}$ ) regardless of its existence. For one selected edge $(u,v)$ , we conduct the discrete version of the gradient descend by deleting existing edges with a positive gradient or adding non-exist edges with a negative gradient. Given the modification constraint of edges $c$ and the element $\alpha_{u_{t},v_{t}}\in A$ , we obtain a set of edges via a greedy selection:

\left\{u_{t},v_{t}\right\}_{t=1}^{\mathrm{c}}=\underset{\left\{u_{t},v_{t}\right\}_{t=1}^{\mathrm{c}}}{\mathrm{arg}\max}\sum_{t=1}^{\mathrm{c}}{|\frac{\partial\mathcal{L}}{\partial\alpha_{u_{t},v_{t}}}|}.

(2)

After that, modifications are performed by sequentially modifying these edges in the way that is most likely to reduce the loss function. Note that we stop the modification process until we find all the gradients for existing edges are negative, or the ones for non-exist edges are positive, in which case no perturbation for decreasing the objective is possible.

Adaptive Constrains. To ensure that we modify sufficient and essential edges in each graph for creating delusive and invisible patterns, we devise a mixed type of constraint based on vertex-based and edge-based information. We consider two types of perturbation ratios, $r_{V}$ and $r_{E}$ , which refer to the entire edge space ( $V\times V$ ) or existing edges ( $E$ ), respectively. The number of edges to be modified is constrained by both coefficients. Our constraints are effective and flexible, as they allocate more budget to larger graphs, ensuring the overall imperceptibility and effectiveness of our method.

IV Experiments

In order to evaluate the effectiveness of our proposed method, we conducted experiments on six benchmark graph classification datasets(MUTAG, ENZYMES, PROTEINS, IMDB-B, IMDB-M, and COLLAB) across four common GNN architectures(GCN, GAT, GIN, and GraphSage), and make comparisons between random and error-maximizing noise. The results on the PROTEINS and IMDB-M datasets are reported in Figure 2.

Despite the good performance of EMinS noise in degrading the models’ test accuracy, we visualize the graphs before and after perturbation in Table I to demonstrate the imperceptibility of our noise. From the visualizations, we can observe that for the majority of graphs, there are few visual discrepancies between the original and modified ones.

[Uncaptioned image] — TABLE I: Graph visualizations on IMDB-MULTI and ENZYMES datasets (the first row are clean graphs, and the second row are the graphs we generated with EMinS).

V Conclusion

In this paper, we are the first group that proposes a novel method for minimizing errors in structural poisoning for generating unlearnable graphs. Our method explores invisible noise to prevent GNN models from exploiting graph data freely. We verify our method by conducting experiments on six benchmark graph datasets, and the extensive experimental results show that our method can be applied effectively to various GNN architectures. This study represents an important first step in safeguarding personal graph data from being exploited by GNN models.

References

Fowl et al. [2021] Liam Fowl, Ping-yeh Chiang, Micah Goldblum, Jonas Geiping, Arpit Bansal, Wojtek Czaja, and Tom Goldstein. Preventing unauthorized use of proprietary data: Poisoning for secure dataset release. arXiv preprint arXiv:2103.02683, 2021.
Huang et al. [2021] Hanxun Huang, Xingjun Ma, Sarah Monazam Erfani, James Bailey, and Yisen Wang. Unlearnable examples: Making personal data unexploitable. arXiv preprint arXiv:2101.04898, 2021.
Zhou et al. [2023] Ce Zhou, Qian Li, Chen Li, Jun Yu, Yixin Liu, Guangjing Wang, Kai Zhang, Cheng Ji, Qiben Yan, Lifang He, et al. A comprehensive survey on pretrained foundation models: A history from bert to chatgpt. arXiv preprint arXiv:2302.09419, 2023.

	IMDB-M	ENZYMES
Clean
Perturbed