Enhancing Privacy in ControlNet and Stable Diffusion via Split Learning

Dixi Yao University of Toronto, Toronto, ON M5S 1A1, Canada. E-mail: [email protected]

Abstract

With the emerging trend of large generative models, ControlNet is introduced to enable users to fine-tune pre-trained models with their own data for various use cases. A natural question arises: how can we train ControlNet models while ensuring users’ data privacy across distributed devices? Exploring different distributed training schemes, we find conventional federated learning and split learning unsuitable. Instead, we propose a new distributed learning structure that eliminates the need for the server to send gradients back. Through a comprehensive evaluation of existing threats, we discover that in the context of training ControlNet with split learning, most existing attacks are ineffective, except for two mentioned in previous literature. To counter these threats, we leverage the properties of diffusion models and design a new timestep sampling policy during forward processes. We further propose a privacy-preserving activation function and a method to prevent private text prompts from leaving clients, tailored for image generation with diffusion models. Our experimental results demonstrate that our algorithms and systems greatly enhance the efficiency of distributed training for ControlNet while ensuring users’ data privacy without compromising image generation quality.

I Introduction

Leading at the forefront in the emerging trend of large generative artificial intelligence, large diffusion models [1] have become commercial success stories, with models from Stability AI and Midjourney dominating the news. With large diffusion models, any user is able to generate artistically appealing images with short descriptive text prompts. However, short descriptive text prompts do not offer a sufficient level of control over the generated images to satisfy a user’s needs in many cases. To support an additional level of control using conditions, ControlNet [2] has recently emerged, allowing users to generate images with a wide variety of user-defined conditions beyond text prompts.

With fine-grained control over generated images using ControlNet, it’s intuitive that users would want to fine-tune pre-trained ControlNet models with their own data to meet various use cases. However, since the training dataset may contain users’ own artistic creations or faces, privacy concerns arise. Additionally, each user may possess only a small number of images, which may not suffice for fine-tuning a diffusion model unless aggregated, such as in a collection of 50,000 images [2]. To maintain data privacy, it’s essential to fine-tune ControlNet with distributed users, posing the research question: How can we train ControlNet models while preserving users’ data privacy, particularly when the data is distributed across multiple client devices?

Federated learning has been heralded in recent years as a distributed training paradigm that preserves user privacy by training directly on client devices and aggregating local training updates using a federated learning server. However, conventional federated learning may not be suitable for fine-tuning large ControlNet and stable diffusion models for three important reasons. First, ControlNets and stable diffusion models are large generative models, requiring formidable GPU resources on client devices for local fine-tuning of pre-trained models. Second, even if such GPU resources were available on client devices, pre-trained ControlNet and stable diffusion models may not be accessible as open-source due to commercial interests. For example, neither OpenAI nor Midjourney has open-sourced models such as DALL $\cdot$ E 2 [3]. Finally, our own experimental results, as presented in this paper, indicate that large ControlNet models fine-tuned with conventional federated averaging [4] as the aggregation mechanism experienced severely degraded performance compared to centralized training.

Hence, split learning[5] becomes the only feasible distributed training paradigm for fine-tuning ControlNets. Clients train the first few layers of the neural network with their local data and transmit intermediate features to the server. The server then sequentially sends gradients back to clients after the forward pass and backpropagation. However, recent literature highlights that split learning can be inefficient and vulnerable to adversarial attacks, such as inversion attacks[6, 7, 8, 9], which have the potential to reconstruct private data.

To preserve data privacy, before designing defense mechanisms for training ControlNet in split learning, we begin to have second thoughts on whether these attacks are practical in real-world use cases of training ControlNet and stable diffusion models. We test these existing attacks in practical settings and under valid assumptions and surprisingly discover that images reconstructed by most existing attacks are not recognizable by humans.

With our detailed analysis of existing attacks and case studies, we find that only inversion attacks using inverse network models are effective for reconstructing conditional images when we train models with split learning. These attacks first train an inverse network on a public dataset and then use it to reconstruct private data [6, 7]. We empirically demonstrate that the success of the attack depends on the types of private data and should be analyzed on a case-by-case basis. Furthermore, we find that defending against such successful attacks with existing defense mechanisms greatly degrades image generation performance.

Our original contributions are as follows:

First, to enhance the efficiency of fine-tuning ControlNets using split learning, we design a new deployment structure. This structure eliminates the need for the server to send data back to the clients, thereby addressing the issue of efficiency bottlenecks.

Second, inspired by our empirical observations, we find that the forward process when training diffusion models can be combined with local differential privacy guarantees. Based on this, we emphasize our privacy-preserving timestep scheduling policy, establishing a relationship between the timestep scheduling policy and the privacy budget $\epsilon$ . This allows us to adjust the privacy-preserving ability of the system by setting specific scheduling policies. Additionally, we propose a symmetric activation function to process intermediate features, preventing attackers from reconstructing conditional images while still enabling the generation of high-quality images.

Third, in addition to the privacy leakage of conditional images, we further explore the leakage of text prompts. To train the stable diffusion model and ControlNet, we need to upload the text prompts, which may contain private information, to the server. We propose a new mechanism to train ControlNets with zero prompts. This trained model can still maintain high performance in image generation, while the server does not know the text prompts.

Finally, to evaluate performance in production settings, we implement a system to train ControlNet with federated learning and split learning using Plato and conduct experiments in real-world settings. It is demonstrated that with split learning and our architecture design, clients require less than 3 GB of GPU memory and experience $3\times$ lower communication overhead. Unlike existing privacy-preserving mechanisms, we verify that our mechanism can protect the privacy of images, conditions, and text prompts without sacrificing image generation performance.

II Background and Related Work

Refer to caption — Figure 1: Examples of images output from the decoder corresponding to latent representation at different timesteps during forward and sampling process.

II-A Diffusion Model and ControlNet

Diffusion Models [1] are probabilistic models used to learn a data distribution by gradually denoising a normally distributed variable to generate high-quality images. The stable diffusion model [10] converts images into latent representations with an encoder $\mathcal{E}$ and conducts the diffusion process on the latent domain $Z$ . After the sampling process, images are generated through a corresponding decoder $\mathcal{D}$ .

Existing diffusion models allow users to guide image generation with text prompts. For example, in a stable diffusion model, we utilize a contrastive language-image pretraining (CLIP) model [11], with parameters $\gamma_{\theta}$ , to convert text prompts into features. A cross-attention layer is then employed to combine these features with latent representations.

The image generation process involves a sampling procedure, which is the inverse of the forward process depicted in Fig. 1. In the forward process, we follow a Markov Chain to gradually add Gaussian noise ( $\mathcal{N}\sim(0,1)$ ) to the data, based on a variance schedule $\beta_{1},\ldots,\beta_{T}$ , where $t\in[T]$ represents each timestep of noise addition. We denote this Gaussian noise as $\hat{n}$ . As an inverse of the diffusion process, during the sampling process, the diffusion model outputs an estimation of noise $n$ at timestep $t$ , and we sample the latent $z_{t-1}$ using the equation:

	$\displaystyle z_{t-1}=$	$\displaystyle\sqrt{\alpha_{t-1}}\left(\frac{z_{t}-\sqrt{1-\alpha_{t}}n}{\sqrt{\alpha_{t}}}\right)$		(1)
		$\displaystyle+\sqrt{1-\alpha_{t-1}-\lambda_{t}^{2}}n+\lambda_{t}o(z_{t})$		(1)

Here, $\alpha_{t}=1-\beta_{t}$ , $\lambda_{t}$ is a noise coefficient, and $o(z_{t})$ is a small value randomly generated from a standard normal distribution. The sampling process begins with randomized Gaussian noise and gradually samples until we obtain $z_{0}$ , which corresponds to the latent representation of the image we wish to generate.

During the training process, we uniformly sample timesteps from $[T]$ and train the diffusion model (DM) with given text prompts $y$ , aiming to minimize the loss:

\min\mathcal{L}_{\rm DM}=\mathbb{E}_{\mathcal{E}(x),y,n\sim\mathcal{N}(0,1),t}\left[\lVert\hat{n}-n_{\theta}(z_{t},t,\gamma_{\theta}(y))\rVert^{2}_{2}\right]

(2)

In each training step, we need to follow Eq. 1 to generate a random noise $\hat{n}$ as a label, which serves as the ground truth. The diffusion model’s objective is to learn the parameters $\theta$ , which enable it to infer the noise $n$ . This inferred noise is the output of the diffusion model, used for denoising the image.

To enable users to control the generated images with more detailed conditions such as scribbles [2], canny lines [12], depth maps [13], HED lines [14], and segmentation maps [15], in addition to the given text prompts, a conditional diffusion model is proposed. An example is ControlNet [2], as shown in Fig. 3. We can generate a stormtrooper with the same skeletons as in the left image of the depth maps.

ControlNet copies the encoders from the backbone diffusion models and replaces the decoders of the backbone diffusion models with convolution layers initialized with zeros (referred to as zero convolution). A control network comprises copied encoders, zero convolutions, and a condition encoder for converting conditions into latent representations. ControlNet consists of this control network and the original stable diffusion model. For the detailed structure of ControlNet, we will explain it later in Section III-B.

Apart from stable diffusion models, ControlNet can also leverage other backbones such as LCM [16] and ControlLoRA [17]. Concurrent works, T2I-Adapter [18] and Composer [19], feature much smaller and much larger control networks, respectively. The control network can have other structures, such as those seen in T2IAdapter [18] and Composer [19]. Our method can also be applied to these works. FreeDoM [20] is a training-free conditional diffusion model. However, generating images with fine-grained conditions, such as using canny edge maps, can be challenging, resulting in poor guidance. Training-required methods still remain the optimal solution for conditional diffusion models.

II-B Decentralized ControlNet Training

With the assistance of ControlNet, users can fine-tune well-trained stable diffusion (SD) models without disrupting the original SD models. However, the conditions and training images involved may contain privacy-sensitive information. One straightforward solution is to train ControlNet entirely on a single device. For inference with a batch size of 1, we need 7.50 GB of GPU memory. However, to train ControlNet, a minimum of 23.82 GB of GPU memory is needed (with a minimal batch size of 2). Such high GPU memory requirements are unfeasible for most client devices, let alone mobile devices. Even if clients possess powerful computing resources and acceptable training times, training ControlNet on the client side remains impossible if the server is unwilling to share the well-trained diffusion model.

Even if a client has enough GPU memory to fine-tune a diffusion model locally, another issue arises when it needs to collect training samples from different users, as it may lack sufficient samples in its local data. To allow users to fine-tune ControlNets without their private data leaving their devices, a common solution is to leverage privacy-preserving decentralized frameworks. One such decentralized training paradigm is federated learning [4, 21]. We follow the standard federated averaging scheme to train the ControlNet with 50 clients, each having 1000 training samples. We train for a total of 100 rounds and aggregate weights after every 250 local iterations. We evaluate the performance on the MS-COCO [22] validation set. As shown in Fig. 3, even under the assumption that clients have powerful computing units and the weights of the diffusion model are available, the ControlNet trained by FedAvg [4] fails to learn the conditions. The generated image does not match the condition at all; for example, the posture of the person in the generated image differs from that in the left condition images. Since federated learning without any privacy-preserving mechanism cannot work, we do not need to further evaluate methods with privacy-preserving mechanisms [21].

There are also data encryption approaches in decentralized systems, such as trusted execution environments, multi-party computation, and homomorphic encryption. However, the overhead is not at the same scale as computing over plaintext data. For example, during inference, the forward time on diffusion models with homomorphic encryption [23] is 79.19 days, compared to 35 seconds with plaintext using NVIDIA A100. Moreover, to the best of our knowledge, there is no encryption method that can be directly applied to the training process of diffusion models.

Considering the challenges involved in training conditional diffusion models either on clients or servers, a suitable solution is to train such models through split learning, involving multiple clients and the server. Unlike federated learning, which pushes the entire model to the edge, split learning employs a neural network spanning both the cloud and the edge. An edge device trains the network up to the partition layer and sends the intermediate features to the server. Upon receiving these features, the server takes over training the remaining layers and completes forward propagation. During backward propagation, the server conducts back-propagation up to the partition point and sends the gradients of the partition layers back to the client. The client then updates the local parameters through back-propagation using the received gradients. A major drawback of split learning is its sequential training manner, resulting in significant resource underutilization and high transmission overhead, leading to longer training times. In each training step, the server and clients exchange features and gradients, with one party waiting while the other computes or transmits data.

II-C Privacy Leakage in Split Learning

Potential threats arise from split learning, as it carries the risk of privacy leakage through data transmission between clients and the server. Literature highlights that an honest-but-curious server could reconstruct private data using the intermediate features sent from clients to the server. Zhang et al. [24] successfully reconstructed private data in a white-box setting. UnSplit[9] further refined this method to conduct a similar attack in a black-box setting. He et al. [25] trained an inverse network using a public dataset, taking intermediate results as inputs to output private data for reconstruction. Conversely, Li et al. [26] demonstrated that private labels on clients also face the risk of leakage. Pasquini et al. [27] proposed an attack to reconstruct private data by manipulating gradients sent back to clients, under the assumption of a dishonest server. Duan et al. [28] introduced a membership inference attack (MIA) tailored specifically for diffusion models, although they acknowledged its limited applicability in real-world scenarios. Direct MIA is excluded from the scope of our paper. Additionally, Carlini et al. [29] utilized leaked text prompts to generate numerous images, subsequently employing MIA to identify which images exist in private datasets.

II-D Privacy Protection in Split Learning

In response to potential privacy leakage in split learning, researchers have made efforts to defend against such attacks. Local differential privacy techniques, such as additive noise and randomized response [30], are employed to prevent reconstruction. Additionally, Gaussian noise is utilized to directly add noise to the raw data [8]. Subsequently, many works have adopted methods involving additive noise [31, 32, 33]. DataMix [8] and CutMix [34] leverage the concept of mixing a batch of samples. DataMix is designed for convolutional neural networks, while CutMix is tailored for vision image transformers. PatchShuffling [7, 35, 36] is a method specifically designed for transformer-structured models, where patches are shuffled among a batch of samples. However, all these methods must provide sufficient privacy guarantees at the cost of significant performance decreases.

Xiao et al. [6] utilized adversarial learning to enable clients to generate intermediate results that the server cannot use to reconstruct images. Shredder[33] introduced noise based on mutual information, while DISCO [37] employed a channel obfuscation method to process features before transmitting them to the server. However, all three of these methods are only applied during the inference stage and require training a network to process features.

III Preliminaries

III-A Local Differential Privacy

Differential privacy was originally conceived to ensure that an adversary’s ability to compromise the privacy of any set of users remains unchanged by an individual’s decision to opt into or out of the dataset [38]. This characteristic ensures that an adversary cannot glean additional information about any specific individual, thereby extending its capability to prevent inversion attacks such as reconstructing private data.

We assume we have two sets $X$ and $X^{\prime}$ , with only one sample difference between them. We say a privacy-preserving mechanism $\mathcal{M}$ is locally differentially private (LDP) if we cannot differentiate between $X$ and $X^{\prime}$ based on the outputs of these two sets by $\mathcal{M}$ . We provide a formal definition here.

Definition III.1.

( $(\epsilon,\Delta)-LDP$ ) A mechanism $\mathcal{M}$ is $(\epsilon,\Delta)-$ LDP if for two adjacent input sets $X$ and $X^{\prime}$ , and a set $\mathcal{O}$ of all possible outputs,

{\rm Pr}[\mathcal{M}(X)\in\mathcal{O}]\leq e^{\epsilon}\cdot{\rm Pr}[\mathcal{M}(X^{\prime})\in\mathcal{O}]+\Delta

(3)

We call this $\epsilon$ privacy budget. In common cases, a larger privacy budget implies easier differentiation between the two probabilities, indicating weaker privacy-preserving ability, and vice versa.

We can apply a differential privacy mechanism to either model weights or features. DPDM [39] and DPGM [40] have implemented differential privacy mechanisms over the model weights of a diffusion model. However, since our focus is on split learning, we need to concentrate on mechanisms applied to the inputs or features. To achieve privacy protection against reconstructing private images, we can add noise to the original inputs or intermediate features [41], making it $(\epsilon,\Delta)$ -LDP. The noise added can be Gaussian noise [30, 42].

Definition III.2.

(( $\epsilon,\Delta)-LDP$ noise adding.) A mechanism of adding noise over samples is $\epsilon-$ LDP if the Gaussian noise follows the normal distribution

\mathcal{N}\sim\left(0,2\ln\frac{1.25}{\Delta}\alpha^{2}\cdot\frac{1}{\epsilon^{2}}\right)

(4)

In literature, $\alpha$ is called sensitivity. It is the biggest $L_{2}$ distance between all possible inputs or intermediate features we are going to add noise.

We will later use this formulation of adding noise in the diffusion model to propose an $(\epsilon,\Delta)$ -LDP mechanism over features. Randomized response [43] is another typical mechanism to achieve local differential privacy. It encodes each real value in a feature into bits and randomly flips each bit. A drawback of existing LDP mechanisms, whether applied to model weights or features, is that they aim for privacy preservation at the expense of the quality of generated images.

III-B Training ControlNet with Split Learning

We first introduce the deployment when we want to fine-tune a diffusion model with ControlNet using split learning. Though this is engineering work to combine two existing frameworks, we would still like to provide the essential rationale behind how we decide the cut layers and model deployments over clients and the server. Different from the partition of dense models such as ResNet [44], which has the structure of blocks being sequentially placed, the conditional diffusion model contains two parts: a diffusion model with frozen weights and a trainable control network. The structure of the ControlNet is shown in Fig. 4. The control network will first process the condition images with a particular condition encoder (during condition encoding) trained from scratch and then mix it with the noisy latent representation as the input of the following blocks.

The diffusion model will first use a pre-trained encoder to convert the original image into the latent representation. The diffusion model has encoder blocks and decoder blocks. As they are all frozen, no parameters need to be updated. But the decoder blocks need to calculate gradients of parameters so as to update the control network parameters during backpropagation. Each diffusion model decoder will take the output of the corresponding encoder and the output of the corresponding block in the control network as the inputs. These inputs are fed into the decoder with a jump connection using a similar structure as the UNet [45].

For each encoder and decoder block in the original stable diffusion model and each encoder block in the control network, we will also put the text prompts and timestep as the inputs to realize the text-to-image generation. The condition encoder and the autoencoder $\mathcal{E}$ only take the conditions or original images as the inputs. During the forward process, the clients need to send text prompts as well as timesteps to the server.

Considering hiding the complete model weights of the well-trained diffusion models from the clients and achieving the best tradeoff between privacy and efficiency through choosing different partition points, we cut right after the first diffusion model encoder and the trainable condition encoder. If we cut deeper, the server still needs the output of the previous encoder blocks as the inputs of the decoders. This will not provide a better privacy guarantee but will increase the transmission and computation overhead on the clients.

We send the output of the model and gradients back to the clients In such a way, the images are generated on the clients. The output is the $n$ going to be used in Eq. 1 for denoising. It is an inferred random noise following the standard normal distribution. The server will not be able to generate images with only knowing $n$ . Regarding the diffusion model, if we place the partition point before the first encoder block, the server can subtract the estimated noise $n$ from received $z_{t}$ in Eq. 1 to recover the $z_{0}$ and retrieve the private images.

IV Speed Up by Not Sending Gradients Back

TABLE I: Comparison of memory usage, training time, and transmission overhead for different split learning structures training ControlNet.

{\rm M_{c}}

and

{\rm M_{s}}

denote GPU memory usage (in GB) on the client and server, respectively.

{\rm T_{c}}

and

{\rm T_{s}}

indicate training time (in hours) on the client and server.

{\rm T}

represents transmission overhead (in GB).

Structure	${\rm M_{c}}$	${\rm M_{s}}$	${\rm T_{c}}$	${\rm T_{s}}$	${\rm T}$
Split learning	2.78	22.04	22.46	14.10	559.17
Ours	2.75	22.04	0.446	14.10	186.56

To do split learning in practical use cases, we propose a new deployment structure to address the efficiency bottleneck. This design ensures that the server does not need to send back gradients, thereby removing the sequential dependency between clients and the server during training. Instead of training a condition encoder for each different condition, we propose to replace it with the pre-trained encoder used in the stable diffusion model. This way, clients only need to perform inference, allowing them to continuously forward without waiting for gradients from the server. This approach addresses the bottleneck caused by the sequential training manner.

As the clients share the same pre-trained model and the server model is shared between all clients, we do not need to aggregate client models. This makes the trained ControlNet have the same performance as centralized training. Besides that, since the condition encoder and pre-trained encoder both only need images as inputs, the replacement will not cause the outputs to have distribution drift. Hence, image generation performance will not be affected.

We compare the memory usage, training efficiency, and transmission overhead of these two structures in Table I. The diffusion model is stable diffusion V-1.5, ControlNet is of version 1.1, and the autoencoder is ViT-Large-Patch14 CLIP model [11]. The input resolution is $512\times 512$ . The NVIDIA A100 serves as the server’s device, and the NVIDIA A4500 is used for clients’ devices. The training batch size is 4, and the model is trained for $2.5\times 10^{3}$ iterations. The number of clients is set the same as in federated learning, which is 50, with each client having 1000 training samples.

Without sending back the gradients, our new structure can save much transmission overhead. Additionally, by eliminating the forward-backward lock between the client and the server, the clients, server, and intermediate data transmission can operate in a parallel pipeline. Clients no longer need to wait for other clients or the server, reducing the time required for each client. Our whole training time is $\max(\{\rm T_{c},T_{s},T/r\})$ while the original split learning structure needs ${\rm T_{c}+T_{s}+T/r}$ for training, where $r$ is the data transmission rate. We can increase the number of clients if we want, but since $T_{s}$ is much larger than $T_{c}$ , the whole training time is the same.

V Re-evaluating Potential Attacks

V-A Potential Threats in Split Learning

V-A1 Threat modeling

We begin by defining the threat model in practical scenarios. We assume the server to be honest but curious. In our designed split-learning structure, although the server does not need to send gradients back to the clients, it will still accurately complete the remaining training in each split learning iteration and send $n$ to the clients. However, simultaneously, the server will attempt to reconstruct private data using the received intermediate features. The server can conduct the reconstruction process in the background, ensuring that clients remain unaware of the attacks.

In our evaluation, we do not consider clients to be malicious. In split learning, a client receives no data if using our proposed framework. Therefore, malicious or colluding clients cannot obtain data related to constructing private images from other clients. However, malicious or colluding clients may send maliciously constructed data to the server to launch other attacks, such as harming model utility. Such cases are detectable as the model cannot generate the correct results. As our focus is on adversaries attempting to reconstruct private images, we do not consider that type of threat.

V-A2 Attacking methods

Several threats have been specifically proposed in split learning, ranging from the leakage of inputs to labels. The most threatening attack is the inversion attack, which attempts to reconstruct original private data based on the received intermediate feature. We summarize typical inversion attack methods proposed in previous literature in Fig. 5. There are two typical methods to do such an attack.

The first method is based on gradient descent. In this type of attack, the adversary first constructs a randomized input or an input with prior knowledge about the private data. This input is then forwarded through a saved client model on the server, and the reconstruction loss (usually MSE loss) between the output from the randomized input and the received intermediate features is minimized. After several iterations of gradient descent, the randomized input will be optimized to resemble the private data, which we consider to be the reconstruction of private data. The attacker can launch these attacks under a white-box setting [24] if it knows the parameters of client models; otherwise, it operates under the black-box setting.

In the black-box setting, the first method is a query-based attack [25] where the server sends specific designed inputs to the clients and observes the corresponding intermediate feature output. These queries are not included in all necessary transmitted data discussed in this paper. The second attack method, UnSplit [9], does not require such queries. In the UnSplit attack methodology, a client model replica, denoted as $M$ , is initialized on the server along with a training sample represented as $x$ . The parameters of the guessed client model are designated as $\theta$ . Following the completion of the UnSplit attack, the converged training samples are utilized as the desired reconstruction of private data. During each iteration of split learning, upon receiving intermediate features denoted as $\hat{h}$ , the server feeds the training sample into the guessed client model to obtain the output. Subsequently, the server undergoes multiple inner iterations to update $x$ using $\nabla_{x}L_{\rm MSE}(M_{\theta}(x),\hat{h})$ , followed by several inner iterations to update $\theta$ using $\nabla_{\theta}L_{\rm MSE}(M_{\theta}(x),\hat{h})$ . These steps are iteratively performed until convergence is achieved.

The second type of inversion attack is based on training an inverse network [7, 8, 25]. In this approach, the attacker first trains an inverse network on a public dataset, which is assumed to have a similar distribution as the private dataset. The inverse network takes the intermediate features as inputs and outputs the reconstructed private data. During the training of inverse networks, if it is under a white-box setting, the attacker will directly use known client model weights to train an inverse network. Otherwise, the attacker will first train an estimated client model using known server model weights on the same public dataset and then use this estimated client model to train an inverse network.

Apart from the leakage caused by the most threatening inversion attack, there are other concerns regarding data privacy leakage. Label leakage [26] assumes that labels contain private information. The server can infer the private label by observing the distribution of gradients that will be sent back to the clients. However, such an attack is only applicable for binary classification tasks in split learning. Inference attacks [27] steal private data by sending attacker-designed gradients to fool the client models into sending features that can be used by the attacker to reconstruct the private data.

Another potential leakage is the leakage of text prompts. As depicted in Fig. 4, during the training of the ControlNet, the server requires inputting the text prompts into the encoders and decoders of both the original stable diffusion model and the control network. Consequently, clients are required to upload their private text prompts to the server. Some may argue that prompts are short descriptive texts, containing limited useful private information. However, the server could utilize the known prompts to extract a training dataset [29]. Therefore, clients must keep the prompts confidential from the server.

V-B Re-evaluating the Validity of Assumptions

V-B1 The client model weights can be kept secretly

In a white-box setting [24], the client model weights are known. However, in real-world scenarios, the client does not need to disclose the model weights to the server for split learning to function. Even if an adversary manages to steal the client model weights, clients can simply re-initialize the model with different parameters. During the training process, if the client model is trainable, its weights will change in each iteration, making such an assumption invalid. The only potential vulnerability arises if the client model is a pre-trained model. Since pre-trained weights are typically publicly available on the Internet, such an attack could pose a threat.

V-B2 The client can do split learning without providing prior knowledge about private data to the server

In real-world split learning scenarios, the server only requires the client model for training, operating without any knowledge of the private data. In a black-box setting, it is assumed that the adversary possesses prior knowledge about the private data, enabling it to train an inverse network on public data. For instance, Yao et al. [7] employed CelebA [46] as the public dataset and LFWA [47] as the private dataset, both containing human faces. However, in practical contexts, the availability of a public dataset exhibiting such a correlation with private datasets remains uncertain.

V-B3 The client can reject the query request

In a specific inversion attack, an adversary must query the client model with samples supplied by the server [25]. However, in the standard split learning setup, clients do not need to respond to any queries from the server; the split learning still works. Therefore, to counter such an attack, clients can simply reject all queries originating from the server. One may argue that the server could construct these queries in a manner resembling gradients, making them indistinguishable to clients. However, with our structure that eliminates the need for gradient back-sending, such concerns are mitigated.

In inference attacks [27], if the client model is trained with gradients designed by the attacker, the resulting model will inevitably experience a performance decline. Users can easily detect this degradation in performance and cease using the compromised server. Furthermore, our designed structure offers a straightforward defense against such attacks as we do not need to train client models.

V-C Re-evaluating the Effectiveness of Attacks

In summary, practical applications of split learning face four threats. The first is a potential attack using gradient descents in a white-box scenario, particularly if clients utilize pre-trained weights. The second threat is an UnSplit attack, while the third involves training inverse networks to infer private data without prior knowledge of the data. The fourth threat is the leakage of text prompts.

V-C1 Metrics for privacy-preserving effectiveness

An honest-but-curious server aims to reconstruct private data based on intermediate results. We evaluate the similarity between reconstructed images and private images using peak signal-to-noise ratio (PSNR) and the structural similarity index measure (SSIM) [48]. Private images encompass users’ natural and conditional images. Both SSIM and PSNR utilize image pixel values ranging from 0 to 255. PSNR assesses image reconstruction quality, while SSIM gauges image similarity. Lower SSIM and PSNR values signify decreased image similarity, indicating improved privacy preservation.

V-C2 Attack by gradient descents

In the original structure, since the server lacks knowledge of the condition encoder weights, we resort to the UnSplit attack method, following the procedure outlined in UnSplit [9]. This attack involves updating the inputs based on the mean squared error (MSE) loss between intermediate results and outputs generated by randomly initialized inputs, iterated over 100 loops. Subsequently, these inputs are used to update the weights of the guessed client model, which is also initialized randomly on the server, for another 100 loops. This process is repeated for a total of 100 outer loops. We optimize the randomized model weights and inputs using the Adam optimizer with a learning rate of 0.001. The loss function utilized is $\mathcal{L}=\mathcal{L}_{MSE}+\mathcal{L}_{L_{2}}$ . For training the ControlNet, the dataset used is MS-COCO [22]. The successful reconstruction of images using the UnSplit method is illustrated in Fig. 6.

In the gradient back-sending free structure, the server possesses knowledge of the weights of the pre-trained condition encoder. Consequently, the server can launch attacks in a white-box setting. For each attack, we conduct 1000 iterations using the Adam optimizer with a learning rate of 0.001 and MSE as the loss function. Regarding the reconstruction of the original image with the output of the SD encoder block 1, the PSNR is 3.39, and the SSIM is 0.12. For reconstructing the condition image, the PSNR is 5.95, and the SSIM is only 0.002. As depicted in Fig. 7, the reconstructed images are far from recognizable. This ineffectiveness is attributed to the pre-trained autoencoder’s complex model structure, which incorporates dropout layers and batch normalization layers. Between the two runs, even with identical inputs, variations in outputs occur due to dropout layers. In dropout layers, the operation of zeroing elements also nullifies the gradient, making methods relying on gradient descent ineffective. Given the ineffectiveness of the white-box setting, we do not need to test the black-box setting Unsplit attack.

V-C3 Attack using inverse networks

TABLE II: The inverse networks for inversion attacks have two types: Type 1 reconstructs the original image; Type 2 reconstructs the condition image. Type 1&2 denotes layers used in both structures. Padding size is 1 and kernel size is 3.

Input	Operator	Stride	#Out	Structure	Activation
$64^{2}\times 320$	Conv2d	1	320	Type 1	SiLU
$64^{2}\times 320$	Conv2d,	1	256	Type 1	SiLU
$64^{2}\times 256$	Upsample	2	96	Type 1	SiLU
$64^{2}\times 4$	Upsample	2	96	Type 2	SiLU
$128^{2}\times 96$	Conv2d	1	96	Type 1&2	SiLU
$128^{2}\times 96$	Upsample	2	32	Type 1&2	SiLU
$256^{2}\times 32$	Conv2d	1	32	Type 1&2	SiLU
$256^{2}\times 32$	Upsample	2	16	Type 1&2	SiLU
$512^{2}\times 16$	Conv2d	1	16	Type 1&2	SiLU
$512^{2}\times 16$	Conv2d	1	3	Type 1&2	Sigmoid

For this attack, we examine two aspects: reconstructing the original image and the condition image. Since the server knows the stable diffusion model, it can directly use it to train an inverse network. The key question is how similar private and public datasets are. We choose MS-COCO as the public dataset and CelebA [46] and ImageNet [49] as the private datasets. MS-COCO and ImageNet contain wild images, while CelebA comprises over 200K faces from more than 10K celebrities.

The inverse network is trained on the public dataset and then evaluated on the private dataset. We use the AdamW optimizer with a learning rate of $1\times 10^{-5}$ , a batch size of 8, and train it for $7.5\times 10^{4}$ iterations. The structure of the inverse network is shown in Table II. As illustrated in Fig. 8, this attack is ineffective on both datasets. For CelebA, the PSNR is 5.87 and SSIM is 0.73, while for ImageNet, the PSNR is 6.56 and SSIM is 0.71. The reconstruction results are hardly recognizable as the original private images seen by human eyes.

Secondly, for the reconstruction of the condition image, if we employ our proposed structure, the server can directly train the inverse network. However, in the original structure, the server first utilizes its model weights to train an estimated client model on the public dataset and then trains the inverse network. Unfortunately, this attack is effective for both structures when considering condition images. We will present the results and defense mechanisms in the following sections.

V-D Summary

We summarize potential split learning attacks in Table III and their effectiveness. The remaining effective method is inverse network-based attacks for reconstructing condition images. Another valid threat is the leakage of text prompts.

TABLE III: Summary of existing attacks in split learning, assessing validity and effectiveness in the diffusion model scenario, using

\checkmark

for successful data reconstruction and

\times

otherwise; – denotes N/A.

		Original structure			Our structure
		Valid?	Raw image	Condition image	Valid?	Raw image	Condition image
Gradient descent	White-box	$\checkmark$	$\times$	–	$\checkmark$	$\times$	$\times$
	Query-based	$\times$	–	–	$\times$	–	–
	Black-box	$\checkmark$	–	$\times$	$\times$	–	–
Inverse network	White-box	$\checkmark$	$\times$	–	$\checkmark$	$\times$	$\checkmark$
Inverse network	Black-box	$\checkmark$	–	$\checkmark$	$\times$	–	–
Label leakage	–	Invalid: only applicable to binary image classification.
Inference attack	–	Invalid: detectable as the model cannot generate the correct results.
Text prompt leakage	–	The assumption is valid.

VI Privacy-Preserving Training of ControlNet

VI-A Local Differential Private Timestep Sampling

We have empirically shown that ControlNet itself is quite effective in defending against several attacks in split learning. If we look at its structure carefully, we will find that the forward process in Fig. 1 has already contained the process of adding noise over the latent representation. We will next show that such a mechanism is $(\epsilon,\Delta)-$ LDP using the Definition III.2. Based on this property, we propose a new sampling scheme over timesteps during the diffusion process, preserving privacy.

With a given latent representation $z_{0}$ , we will generate the noisy latent representation according to the timestep $t$ , scheduling parameter $\beta_{t}$ and a randomly generated noise $\hat{n_{t}}\sim\mathcal{N}(0,1)$ :

		$\displaystyle z_{t}=\sqrt{1-\beta_{t}}z_{0}+\sqrt{\beta_{t}}\hat{n_{t}}$		(5)
		$\displaystyle\frac{z_{t}}{\sqrt{1-\beta_{t}}}=z_{0}+\sqrt{\frac{\beta_{t}}{1-\beta_{t}}}\hat{n_{t}}$		(5)

According to Fig. 4, $z_{t}$ is the input to the first encoder block of stable diffusion model. Because $\beta_{t}$ is usually a small number, we approximate Eq. 5 as,

z_{t}\approx z_{0}+\sqrt{\frac{\beta_{t}}{1-\beta_{t}}}\hat{n_{t}}

(6)

We can view this equation as adding a noise following distribution $\mathcal{N}(0,\frac{\beta_{t}}{1-\beta_{t}})$ over $z_{0}$ to get $z_{t}$ . We then substitute the variance in Definition III.2. For convenience, we notate $H$ as hyper-parameter $2\ln\frac{1.25}{\Delta}\alpha^{2}$ .

		$\displaystyle\frac{\beta_{t}}{1-\beta_{t}}=H\cdot\frac{1}{\epsilon^{2}}$		(7)
		$\displaystyle\epsilon=\sqrt{H\cdot\frac{1-\beta_{t}}{\beta_{t}}}$		(7)

In the diffusion model, we employ the linear scheduling as the default method which is $\beta_{t}=k\cdot t+\beta_{0}$ , where $k$ , $\beta_{0}$ are scheduling parameters. Hence, we can derive the following relationship between privacy budget $\epsilon$ and $t$ , $k$ , and $\beta_{0}$ :

\epsilon(t,k,\beta_{0})=\sqrt{H\cdot\left(\frac{1}{kt+\beta_{0}}-1\right)}

(8)

From this equation, we can see that the privacy budget is related to the timestep. However, to protect different types of images, we need to set different levels of privacy budgets. For example, during the inversion attack for the original image, even if the $t=0$ , privacy can still be protected. However, for condition images which are simpler than detailed natural images, the budget at $t=0$ is not enough to protect privacy. In such a case, we need to set bigger privacy budgets. Fortunately, based on Eq. 8, we can set proper privacy budget by setting different $t,k,\beta_{0}$ in fine-tuning ControlNet.

Theorem VI.1.

( $(\epsilon,\Delta)$ - $\rm{LDP}$ timestep sampling mechanism in diffusion model) With a given privacy budget $\epsilon$ , we can have a sampling process in diffusion model which is $(\epsilon,\Delta)$ - $\rm{LDP}$ . The value of $\epsilon$ is set by a timestep ranging in $[t_{s},t_{\max}]$ and scheduling parameters $k$ and $\beta_{0}$ , according to Eq. 8.

VI-B Noise-Confounding Activation Function

However, as we can see from the Eq. 5, if we directly send the encoded condition mixed with the noisy latent representation to the server, as the server knows the timestep $t$ and the label $\hat{n}$ , it can directly subtract the added noise from Eq. 5. As a result, based on observation of the success defense by the diffusion model part, we find that leveraging some functions, especially non-linear, after the noisy latent representation and before sending to the server will confuse the attacker from stealing privacy. Hence, we propose to add a noise-confounding activation layer before sending features to the server. To design an activation function keeping privacy-preserving property while maintaining the image generation performance, we pass the sum of the encoded condition and the noisy latent representation through such a function:

y=\lvert x\rvert\cdot\left(\frac{2}{1+e^{-x}}\right)+\delta

(9)

The $\delta$ is a randomized noise following distribution $\mathcal{N}\sim(0,1)$ . The noise $\delta$ is randomized at the beginning of the training and fixed during the training. The server or the attacker has no access to $\delta$ . The function graph of Eq. 9 is shown in Fig. 10. The functionality of this function is to prevent the attacker from inferring the sum of the latent representation and encoder condition. In order to maintain image generation performance, we adopt a symmetric design with an SiLU-like shape. The SiLU function [50] is a widely used activation function, which can help improve the performance of neural networks. As $\delta$ is fixed during the training, the quality of the summation will not be degraded.

Another solution is to put the SD Encoder Block 1 of the control network on the clients. However, such moving will increase the computation overhead on the clients and break the structure of gradient back-sending free. On the other hand, adding this activation function will help preserve privacy without adding any overhead. Besides, moving the place of SD encoder block is specific for ControlNet while adding activation function can also apply to other conditional diffusion model such as T2I-Adapter [18].

VI-C Prompt-Hiding Training

Apart from keeping the privacy of images, text prompts can also contain private information. In the original ControlNet, for each encoder block and decoder block in the diffusion model and control network, text prompts will be input into the blocks and attention modules will be applied in these blocks to let the generate images learn the text prompts. As a result, the clients have to upload the text prompts or the text features to the server. The server can get the raw text information and privacy will be leaked. Uploading the output features of text encoder may be able to let the server not get texts, but the server can still use the same method proposed by Carlini et al. [29] to extract training dataset using text features as inputs.

As a result, to hide prompts from the server, we propose to not send the text prompts to the server directly. During the training, only the SD Encoder Block 1 in Fig. 4 on the clients, will take in text prompts as the input. The text prompts will not be uploaded to the server. Therefore, other encoder and decoder blocks in ControlNet, situated on the server, won’t utilize text prompts as inputs. Removing text prompts will not affect the performance of condition and image encoders as they are irrelevant to prompts. However, the input distribution of server-side encoders and decoders has changed. To maintain high-quality image generation, we introduce the following prompts-hiding training methods.

As the diffusion model is frozen and able to always keep the generation performance of a well-trained diffusion model while the control network needs further training, we use different policies for the encoder and decoder blocks in the control network and diffusion model. For blocks in control networks, no text features will be input and the attention modules will be replaced by self-attention modules for the condition features. Since we still need to further fine-tune the control network, the distribution drift caused by removal of text input will be mitigated during training.

To maintain the high image generation performance of the frozen diffusion model, we will keep the text attention modules but input a zero text feature. The zero text feature has the same feature dimension (by default 768) as the original text feature but with a length of one and always has the weights of zero. We need to keep the distribution of text input the same for these blocks as they will not be further trained. Otherwise distribution drift will affect image generation performance.

VII Evaluation

TABLE IV: Comparison of image generation and privacy preservation among different methods, where arrow directions indicate superior image quality and increased difficulty in recognizing reconstructed data.

Defendable by our structure without privacy-preserving methods
Condition	Scribble						Segmentation
Methods	Performance		Privacy				Privacy
	FID $\downarrow$	CLIP $\uparrow$	CelebA		Imagenet		Imagenet
	FID $\downarrow$	CLIP $\uparrow$	PSNR $\downarrow$	SSIM $\downarrow$	PSNR $\downarrow$	SSIM $\downarrow$	PSNR $\downarrow$			SSIM $\downarrow$
Centralized	19.53	26.04	–	–	–	–	–			–
SL	19.46	26.87	14.41	0.37	8.17	0.35	11.53			0.50
Ours	13.45	26.85	13.15	0.37	7.34	0.47	9.95			0.47
Not defendable by our structure without privacy-preserving methods
Condition	Canny						Segmentation				Attack works?
Methods	Performance		Privacy				Performance		Privacy
	FID $\downarrow$	CLIP $\uparrow$	CelebA		Imagenet		FID $\downarrow$	CLIP $\uparrow$	CelebA
	FID $\downarrow$	CLIP $\uparrow$	PSNR $\downarrow$	SSIM $\downarrow$	PSNR $\downarrow$	SSIM $\downarrow$	FID $\downarrow$	CLIP $\uparrow$	PSNR $\downarrow$	SSIM $\downarrow$
Centralized	11.60	26.42	–	–	–	–	15.23	26.82	–	–	$\checkmark$
SL	11.46	26.61	18.54	0.89	23.10	0.94	17.74	27.76	11.80	0.45	$\checkmark$
Ours	18.59	26.21	18.86	0.73	22.84	0.86	14.35	26.92	12.18	0.49	$\checkmark$
Ours+t	16.80	26.20	18.86	0.73	22.84	0.86	15.68	26.70	12.18	0.49	$\checkmark$
Ours+c	14.52	26.80	17.45	0.51	21.74	0.70	15.05	26.85	1.68	0.46	$\times$
Ours++	16.80	26.50	17.45	0.51	21.74	0.70	16.32	26.39	1.68	0.46	$\times$
LDP rr	18.11	27.22	18.86	0.82	23.77	0.97	17.49	27.23	14.92	0.72	$\checkmark$
LDP 0.1	18.00	27.15	16.84	0.03	19.70	0.04	17.96	27.15	7.56	0.33	$\times$
LDP 0.3	17.28	27.12	18.65	0.79	23.33	0.88	17.21	27.13	8.41	0.36	$\checkmark$ / $\times$
LDP 0.5	12.27	26.53	19.81	0.90	24.31	0.95	17.46	27.21	11.21	0.51	$\checkmark$
Add 1	11.77	26.60	25.69	0.98	31.02	0.995	17.51	27.30	22.96	0.88	$\checkmark$
Add 50	19.69	26.84	25.53	0.99	30.60	0.99	17.60	27.29	23.05	0.90	$\checkmark$
Mixup	401.62	13.54	17.96	0.14	22.84	0.19	384.24	13.99	13.45	0.73	$\times$ / $\checkmark$
PS	17.39	27.16	21.25	0.95	25.85	0.98	17.62	27.22	22.64	0.92	$\checkmark$
FedAvg	19.10	26.92	–	–	–	–	17.51	27.26	–	–	$\times$

•

For privacy: $\checkmark$ and $\times$ whether the attack is able to reconstruct condition image. – means not applicable.

VII-A Experimental Settings

We first introduce details about experimental settings and then evaluate the effectiveness of our methods and state-of-the-art mechanisms defending against successful inversion attacks. We conduct all our experiments on Plato[51], an open-source research framework for deploying decentralized training on multiple devices. Plato can support large-scale decentralized training and help deploy the server and the clients on separate devices conveniently. We use the same setting as previous where we have 50 clients in total and each has 1000 training samples. The number of clients will effect efficiency and scalability but will not effect image generation performance or privacy-preserving ability. Since the main focus of this paper is on the latter two aspects, we do not particularly study other settings.

For the pre-trained models, we used stable diffusion V-1.5 and ControlNet V-1.1. The autoencoder is from pre-trained CLIP model with ViT-Large-Patch14 [11]. The resolution of the input and generate images is $512\times 512$ . We used the MS-COCO [22] as the training dataset for fine-tuning diffusion models to generate high quality images with given conditions. The MS-COCO dataset contains over 120K wild images with proper prompts. It is a common dataset used in fine-tuning large diffusion models and text-to-image generation tasks. The model is fine-tuned over MS-COCO for 25000 iterations with a batch size of 4. The rest training settings is the same as default implementation of ControlNet [2] where we use AdamW optimizer with learning rate of $1\times 10^{-5}$ . The noise coefficient $\lambda_{t}$ is 0.

Evaluation Metrics. For comparing the performance, we need to verify that the privacy-preserving method will not harm image generation performance and that an adversary will not be able to reconstruct private images. For the first objective, we use Fréchet Inception Distance [52] (FID) to evaluate quality of generated images and the CLIP score [11] to evaluate whether the prompts and generated images are matched (in range of $[0,100]$ ). We use the MS-COCO validation set with over 5000 images to evaluate the quality of generated images. Lower FID indicates better quality of generated images. A higher CLIP score indicates that the text prompts and the generated images match to each other better.

For the second prospective, we use PSNR and SSIM as mentioned in previous sections. The images are generated with the same random seed. The settings for evaluating privacy against reconstructing private data is the same as in Section V. Lower PSNR and SSIM indicate the reconstructed images are less similar to the private data, meaning better privacy-preserving effectiveness.

For inverse-network based attacks, if we use the original split learning structure, the attack is under a black-box setting. If we use our designed split learning structure, as the weights of models on clients only involve weights of pre-trained CLIP models downloaded from [53], the attack is under a white-box setting. Within the realm of conditional image generation, various tasks involve different conditions. We assess three types of conditions: canny lines, scribbles, and segmentation maps. These conditions represent a range from detailed to coarse-grained, with the lines drawn in the condition images varying accordingly.

During the training process of ControlNet, the timestep is sampled among the range of $[t_{s},t_{\max}]$ . With the default $k$ and $\beta_{0}$ , we say $\epsilon_{s}=\epsilon(t_{s},k,\beta_{0})$ . So, according to Eq. 8, during the training process, the privacy budget is equal to or larger than $\epsilon_{s}$ . Because we need to ensure that the privacy of every image is preserved, when we evaluate the effectiveness of privacy-preserving methods, in terms of both numerical data and visualization, we consider the worst case of least noise added and sample the timestep as $t_{s}$ and send the intermediate features to the server.

VII-B Implementation of Our Methods and Baselines

For our and other privacy-preserving methods, we implement them with our designed gradient sending-back free structure. For $(\epsilon,\Delta)$ -LDP mechanism, $\Delta=1\times 10^{-4}$ and we calculate that $\alpha\approx 0.16$ . The training latency remains the same after adding our privacy-preserving methods.

VII-B1 Implementation of our methods

For our privacy preserving methods, we set the $t_{\max}$ , $k$ , and $\beta_{0}$ as default in ControlNet [2] which are $1000$ , $1.115\times 10^{-5}$ and $8.85\times 10^{-4}$ respectively. If $t_{s}$ is too big, the sampling range will be too small to get enough samples. If $t_{s}$ is too small, no privacy protection will be guaranteed. Hence, we set the $t_{s}$ around middle point which is 536, which results in $\epsilon_{s}\approx 8$ . We implement our methods in three ways: only protecting conditions, only hiding prompts and both. We denote our structure without any privacy-preserving methods implemented (Section IV ) as Ours. We denote our three ways of implementation as Ours+c, Ours+t, and Ours++, respectively.

VII-B2 Implementation of baselines

We compare our methods with several state-of-the-art privacy-preserving methods which can be used for split learning with ControlNet. The rationale for showing the results with the following chosen parameters is that we would like to show the cases that the baselines neither protect privacy nor generate images of high quality. If they want to generate a good image, they should choose a smaller disruption magnitude (e.g. smaller privacy budget). This will make privacy-preserving performance weaker. On the other hand, if they want to provide stronger privacy protection, they need to use stronger disruption, which will further degrade image generation performance. These can show that they cannot find a proper solution for preserving privacy and image generation performance at the same time.

LDP rr means a mechanism called randomized response which is local differential private [43]. We implement randomized response over intermediate results following steps in literature where we set privacy budget as $2$ .

LDP number means the mechanism in Definition III.2. The number means the privacy budgets where we have three values: $0.1$ , $0.3$ , and $0.5$ .

Add number means the mechanism of adding Gaussian noise on the raw data according to the distribution $\mathcal{N}\sim(0,\sigma^{2})$ . We have two numbers: $\sigma^{2}=1$ and $\sigma^{2}=50$ .

Mixup is the method of mixing up data proposed in DataMix [8] and CutMix [34]. We mix four images together which is the same as the batch size.

PS is the method called patch shuffling [7, 35]. The patch size is set to 4, same as the batch size.

We also compare the results of image generation with several other baselines.

Centralized means images generated by the well-trained ControlNet V 1.1 from [54]. We directly use the downloaded models to generate images. This is a production-level baseline.

SL is the deployment of ControlNet with split learning without any privacy-preserving methods applied (Section III-B). We fine-tune ControlNet following steps of split learning.

VII-C Comparison Results

We present qualitative results in Table IV. The conclusion is that from numerical data and visualization, we can see that our method is the only method that can protect privacy without loss of image generation quality. The methods that can generate images correctly cannot preserve privacy well. The methods can preserve privacy well are not able to generate good images. Though more advanced methods such as Mixup and PS can provide good privacy protection in some cases, they fail to correctly generate images of good quality conforming to the conditions.

One of our new insights is that all previous privacy-preserving methods try to propose a general method for split learning, overlooking the variance between different use cases. We can easily extend methods like DataMix from image classification to different tasks. However, they cannot achieve satisfactory performance when we really verify them on the task of image generation. Our privacy-preserving method is tailored for diffusion models, considering the specialty of the overall model structure of diffusion model to how prompts are processed. Let’s take a look at the detailed analysis.

VII-C1 Maintenance of image generation performance

An interesting result is that our designed split learning structure not only improves the efficiency but also improves the quality of generated images, reflected by FID. For example, on scribble conditions, we can improve FID from 19.53 to 13.45. This is possible as the pre-trained CLIP model is well-trained on large datasets. While for other methods, such as LDP Gaussian noise adding, though they can provide strong privacy protection with a small privacy budget, they need to sacrifice the image generation quality. Furthermore, our methods preserve data privacy regardless the number of samples on each client. Clients only need to do inferences with our designed structure. The results of inference are irrelevant to the number of samples passed through the models. Another reason is that our methods do not mix several training samples like what Mixup and Patch Shuffling did.

VII-C2 Privacy-preserving ability

For canny conditions, attackers find it easier to reconstruct private data, while for scribble conditions, it is much harder. However, as canny contains richer information, including complex lines, protecting such conditions is crucial. We analyze privacy concerns for individual conditions and private datasets separately. In segmentation, attack success depends on the datasets. In cases where split learning with the original and our designed structure can defend against existing attacks, our methods can enhance privacy. For other cases, having risks of data leakage, our methods can protect privacy. For instance, we can reduce PSNR from 11.80 to 1.68.

VII-D Ablation Study of Privacy Budgets

In Theorem VI.1, we can set different privacy budgets with proper $k$ and $\beta_{0}$ . In Fig. 11, we change privacy budgets by setting different scheduling parameters $k$ and $\beta_{0}$ respectively. In the default setting of our method, the privacy budget is $8$ . We try the other two cases of setting privacy budgets as $0.3$ and $2$ . As shown in Fig. 11, our method can still generate images of good quality. However, for the baseline methods, they fail to generate good images and protect privacy when they use the same privacy budgets of $0.3$ and $2$ .

VIII Conclusion

In this paper, we address the challenge of fine-tuning ControlNet models with locally distributed data across multiple users, focusing on feasibility and privacy. We initiate the study with federated learning and find that conventional federated learning is not suitable due to high GPU memory requirements during training, the unavailability of stable diffusion models, and empirically proven performance degradation. So we turn to split learning to solve such a problem where we first improve the structure so that the server does not need to send gradients back to the clients, greatly improving efficiency. Through in-depth study of existing attacks in split learning. We discover that the effectiveness of most existing methods is weakened. For the remaining threats, we propose differential private timestep sampling, a noise-confounding activation function, and prompts-hiding training, based on the built-in mechanisms in diffusion models with tunable privacy budgets. We show convincing results from a wide array of experiments that our method can provide stronger privacy protection without loss of image generation performance and train the models faster than its state-of-the-art alternatives in the literature.

References

[1] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in Proceedings of the International Conference on Machine Learning (ICML), 2015, pp. 2256–2265.
[2] L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 3836–3847.
[3] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, 2022.
[4] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Proceedings of he 20th International Conference on Artificial Intelligence and Statistics (AISTATS), 2017, pp. 1273–1282.
[5] O. Gupta and R. Raskar, “Distributed Learning of Deep Neural Network over Multiple Agents,” Journal of Network and Computer Applications (JNCA), vol. 116, pp. 1–8, 2018.
[6] T. Xiao, Y.-H. Tsai, K. Sohn, M. Chandraker, and M.-H. Yang, “Adversarial learning of privacy-preserving and task-oriented representations,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2020, pp. 12 434–12 441.
[7] D. Yao, L. Xiang, H. Xu, H. Ye, and Y. Chen, “Privacy-preserving split learning via patch shuffling over transformers,” in Proceedings of the IEEE International Conference on Data Mining (ICDM), 2022, pp. 638–647.
[8] Z. Liu, Z. Wu, C. Gan, L. Zhu, and S. Han, “DataMix: Efficient privacy-preserving edge-cloud inference,” in Proceedings of the European Conference on Computer Vision (ECCV), 2020, pp. 578–595.
[9] E. Erdoğan, A. Küpçü, and A. E. Çiçek, “UnSplit: Data-oblivious model inversion, model stealing, and label inference attacks against split learning,” in Proceedings of the 21st Workshop on Privacy in the Electronic Society (WPES), 2022, p. 115–124.
[10] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 10 684–10 695.
[11] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in Proceedings of the International Conference on Machine Learning (ICML), 2021, pp. 8748–8763.
[12] J. Canny, “A computational approach to edge detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), no. 6, pp. 679–698, 1986.
[13] R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V. Koltun, “Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 44, no. 3, pp. 1623–1637, 2020.
[14] S. Xie and Z. Tu, “Holistically-nested edge detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2015, pp. 1395–1403.
[15] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene parsing through ade20k dataset,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 633–641.
[16] S. Luo, Y. Tan, L. Huang, J. Li, and H. Zhao, “Latent consistency models: Synthesizing high-resolution images with few-step inference,” arXiv preprint arXiv:2310.04378, 2023.
[17] W. Hecong, “ControlLoRA Version 2: A Lightweight Neural Network To Control Stable Diffusion Spatial Information Version 2,” 9 2023. [Online]. Available: https://github.com/HighCWu/control-lora-2
[18] C. Mou, X. Wang, L. Xie, J. Zhang, Z. Qi, Y. Shan, and X. Qie, “T2I-Adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models,” arXiv preprint arXiv:2302.08453, 2023.
[19] L. Huang, D. Chen, Y. Liu, Y. Shen, D. Zhao, and J. Zhou, “Composer: Creative and controllable image synthesis with composable conditions,” arXiv preprint arXiv:2302.09778, 2023.
[20] J. Yu, Y. Wang, C. Zhao, B. Ghanem, and J. Zhang, “FreeDoM: Training-free energy-guided conditional diffusion model,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 23 174–23 184.
[21] Y. Yang, B. Hui, H. Yuan, N. Gong, and Y. Cao, “PrivateFL: Accurate, differentially private federated learning via personalized data transformation,” in Proceedings of the USENIX Security Symposium (USENIX Security), 2023, pp. 1595–1612.
[22] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in Proceedings of the European Conference on Computer Vision (ECCV), 2014, pp. 740–755.
[23] Y. Chen and Q. Yan, “Privacy-preserving diffusion model using homomorphic encryption,” arXiv preprint arXiv:2403.05794, 2024.
[24] Y. Zhang, R. Jia, H. Pei, W. Wang, B. Li, and D. Song, “The secret revealer: Generative model-inversion attacks against deep neural networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2020, pp. 253–261.
[25] Z. He, T. Zhang, and R. B. Lee, “Model inversion attacks against collaborative inference,” in Proceedings of the 35th Annual Computer Security Applications Conference (ACSAC), 2019, pp. 148–162.
[26] O. Li, J. Sun, X. Yang, W. Gao, H. Zhang, J. Xie, V. Smith, and C. Wang, “Label leakage and protection in two-party split learning,” in Proceedings of the International Conference on Learning Representations (ICLR), 2021.
[27] D. Pasquini, G. Ateniese, and M. Bernaschi, “Unleashing the tiger: Inference attacks on split learning,” in Proceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS), 2021, pp. 2113–2129.
[28] J. Duan, F. Kong, S. Wang, X. Shi, and K. Xu, “Are diffusion models vulnerable to membership inference attacks?” in Proceedings of the International Conference on Machine Learning (ICML), 2023.
[29] N. Carlini, J. Hayes, M. Nasr, M. Jagielski, V. Sehwag, F. Tramèr, B. Balle, D. Ippolito, and E. Wallace, “Extracting training data from diffusion models,” in Proceedgins of the 32nd USENIX Security Symposium (USENIX Security), 2023, pp. 5253–5270.
[30] C. Dwork and A. Roth, “The algorithmic foundations of differential privacy,” Theoretical Computer Science (TCS), vol. 9, no. 3-4, pp. 211–407, 2014.
[31] T. Titcombe, A. J. Hall, P. Papadopoulos, and D. Romanini, “Practical defences against model inversion attacks for split neural networks,” in Proceedings of the ICLR 2021 Workshop on Distributed and Private Machine Learning (DPML), 2021.
[32] P. Vepakomma, O. Gupta, A. Dubey, and R. Raskar, “Reducing leakage in distributed deep learning for sensitive health data,” in Proceedings of the ICLR AI for Social Good Workshop, vol. 2, 2019.
[33] F. Mireshghallah, M. Taram, P. Ramrakhyani, A. Jalali, D. Tullsen, and H. Esmaeilzadeh, “Shredder: Learning noise distributions to protect inference privacy,” in Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2020, pp. 3–18.
[34] S. Oh, J. Park, S. Baek, H. Nam, P. Vepakomma, R. Raskar, M. Bennis, and S.-L. Kim, “Differentially private cutmix for split learning with vision transformer,” in Proceedings of the First Workshop on Interpolation Regularizers and Beyond at NeurIPS, 2022.
[35] H. Xu, L. Xiang, H. Ye, D. Yao, P. Chu, and B. Li, “Shuffled transformer for privacy-preserving split learning,” arXiv preprint arXiv:2304.07735, 2023.
[36] ——, “Permutation equivariance of transformers and its applications,” in Proceedings of The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
[37] A. Singh, A. Chopra, E. Garza, E. Zhang, P. Vepakomma, V. Sharma, and R. Raskar, “DISCO: Dynamic and invariant sensitive channel obfuscation for deep neural networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 12 125–12 135.
[38] R. Shokri and V. Shmatikov, “Privacy-preserving deep learning,” in Proceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS), 2015, pp. 1310–1321.
[39] T. Dockhorn, T. Cao, A. Vahdat, and K. Kreis, “Differentially private diffusion models,” Transactions on Machine Learning Research (TMLR), 2023.
[40] D. Jiang, S. Sun, and Y. Yu, “Functional renyi differential privacy for generative modeling,” in Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2023.
[41] C. Dwork, “Differential privacy: A survey of results,” in Proceedings of the International conference on theory and applications of models of computation (TAMC), 2008, pp. 1–19.
[42] J. Dong, A. Roth, and W. J. Su, “Gaussian differential privacy,” Journal of the Royal Statistical Society Series B: Statistical Methodology, vol. 84, no. 1, pp. 3–37, 2022.
[43] C. Dwork, M. Naor, O. Reingold, G. N. Rothblum, and S. Vadhan, “On the complexity of differentially private data release: Efficient algorithms and hardness results,” in Proceedings of the forty-first annual ACM symposium on Theory of computing (STOC), 2009, pp. 381–390.
[44] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2016, pp. 770–778.
[45] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in Proceedings of theMedical Image Computing and Computer-Assisted Intervention (MICCAI), 2015, pp. 234–241.
[46] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the wild,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2015, pp. 3730–3738.
[47] G. Huang, M. Mattar, H. Lee, and E. Learned-Miller, “Learning to align from scratch,” in Proceedgins of the Advances in Neural Information Processing Systems (NeurIPS), 2012, pp. 764–772.
[48] A. Hore and D. Ziou, “Image quality metrics: PSNR vs. SSIM,” in Proceedings of the International Conference on Pattern Recognition (ICPR), 2010, pp. 2366–2369.
[49] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F.-F. Li, “ImageNet: A large-scale hierarchical image database,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 248–255.
[50] D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” arXiv preprint arXiv:1606.08415, 2016.
[51] B. Li, N. Su, C. Ying, and F. Wang, “Plato: An open-source research framework for production federated learning,” in Proceedings of the 2023 ACM Turing Award Celebration Conference (ACM TURC), 2023, pp. 1–2.
[52] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “GANs trained by a two time-scale update rule converge to a local Nash equilibrium,” in Proceedgins of the Advances in Neural Information Processing Systems (NeurIPS), 2017, pp. 6629–6640.
[53] OpenAI, “openai/clip-vit-large-patch14,” https://huggingface.co/openai/clip-vit-large-patch14.
[54] lllyasviel, “lllyasviel/controlnet,” https://huggingface.co/lllyasviel/ControlNet.

Appendix A Discussion

In this paper, we resolve the question of how we can train ControlNet and diffusion models while keeping users’ data privacy. Besides the aspect of preserving privacy, there are other issues worth studying in production level split learning with ControlNet and stable diffusion. Another challenging question is how we can keep users’ data privacy during the inference stage after deploying trained ControlNet and diffusion models. The inference process is different from the training. A trivial solution is to run the inference completely on the edge device, which needs about 7.5GB of memory. The memory requirement is much less than that of training, which is feasible. However, maybe not all clients have enough memory. It is a challenge that how we can still keep user data privacy if we deploy a ControlNet across the clients and the server. From related work, we can see large efforts are being put into privacy-preserving inference in split learning. It is worth studying whether these methods are helpful during the inference stage.

In this paper, the target is towards privacy-preserving split learning with ControlNet and diffusion model. In a broader research topic, one question is how we can safely do split learning. In such a case, we may not assume every client is honest, which means some clients are malicious and not sending the correct intermediate features. To harm the interests of other clients, some clients may do backdoor attacks or adversarial attacks, diminishing the utility of the fine-tuned ControlNet and diffusion model.

In our experiments, we deploy split learning with 50 clients. We can increase the number of clients if we want, but since $T_{s}$ is much larger than $T_{c}$ , the whole training time is the same. Therefore, we do not increase the number. On the production level, it is possible that there are more than 50 clients. With our methods, we can still train ControlNet with split learning over them while preserving data privacy. A minor issue is that since the clients only need to do inference, they may send intermediate features of large amounts continuously and simultaneously. It is worth studying how the server deals with a large scale of requests simultaneously. We can expand the client number to hundreds or thousands to evluate the scalability.