DPFormer: Learning Differentially Private Transformer on Long-Tailed Data

Youlong Ding
Shenzhen University
Shenzhen, China
[email protected]
&Xueyang Wu
Hong Kong University
of Science and Technology
Hong Kong SAR, China
[email protected]
Hao Wang
Rutgers University
Piscataway, NJ, USA
[email protected]
&Weike Pan
Shenzhen University
Shenzhen, China
[email protected]
Part of work done while at WeBank AI.

Abstract

The Transformer has emerged as a versatile and effective architecture with broad applications. However, it still remains an open problem how to efficiently train a Transformer model of high utility with differential privacy guarantees. In this paper, we identify two key challenges in learning differentially private Transformers, i.e., heavy computation overhead due to per-sample gradient clipping and unintentional attention distraction within the attention mechanism. In response, we propose DPFormer, equipped with Phantom Clipping and Re-Attention Mechanism, to address these challenges. Our theoretical analysis shows that DPFormer can reduce computational costs during gradient clipping and effectively mitigate attention distraction (which could obstruct the training process and lead to a significant performance drop, especially in the presence of long-tailed data). Such analysis is further corroborated by empirical results on two real-world datasets, demonstrating the efficiency and effectiveness of the proposed DPFormer.

1 Introduction

Differentially private deep learning has made remarkable strides, particularly in domains such as image classification [35, 16, 10] and natural language processing [45, 27, 19]. This success can be largely attributed to the availability of extensive pre-trained models, offering robust and diverse foundations for further learning. However, such reliance on vast, pre-existing datasets poses a significant challenge when these resources are not accessible or relevant. This hurdle becomes particularly pronounced when it is necessary to train differentially private Transformers using only domain-specific data gathered from real-world scenarios rather than generic, large-scale datasets.

This issue is glaringly evident in tasks like sequence prediction, which are integral to a wide range of applications, such as commercial recommender systems. These systems are designed to model and predict user behavior based on sequences of interactions like clicks or purchases. The reliance on domain-specific sequence data coupled with the constraints of differential privacy can significantly compromise the performance of such systems, particularly when large-scale public datasets or pre-existing pre-trained models are not at our disposal.

The challenges posed by this scenario can be summarized into two key hurdles. The first one stems from the inherent nature of real-world data, which typically follows a long-tailed distribution, where a small fraction of data occur frequently, while a majority of data appear infrequently. This poses the intrinsic hardness of high-utility differentially private training, which, based on its sample complexity [12], necessitates a sufficiently large volume of data to discern general patterns without resorting to the memorization of individual data points [7, 14]. Our theoretical analysis shows that during differentially private training of Transformers, attention scores tend to be skewed by long-tailed tokens (i.e., tokens with fewer occurrences), therefore leading to huge performance drops.

The second hurdle arises from the resource-intensiveness of deep learning with differential privacy, which is primarily due to the requirement of clipping per-sample gradient. This requirement not only complicates the learning process but also places a significant computational burden, especially when resources are limited or when scalability is a priority.

To address these issues, we propose DPFormer (Figure 1), a methodology for learning differentially private Transformers. One of our primary contributions is the introduction of an efficient technique, Phantom Clipping. This technique addresses the computational burden associated with differential privacy by computing the clipped gradient without the need for the per-sample gradient. This allows for the efficient scaling of the learning process. Additionally, to further enhance model utility under differentially private training, we introduce the Re-Attention Mechanism, which aims to mitigate the attention distraction phenomenon, ensuring the model’s focus is directed toward the most pertinent elements of the input. This enables effective learning, especially in the presence of long-tailed data.

Experiments on public real-world datasets demonstrate the efficiency and effectiveness of DPFormer. These results not only validate our approach but also demonstrate the practical applicability of our model in scenarios where data is limited and domain-specific, and differential privacy is a crucial requirement.

2 Related Work

Differentially Private Deep Learning. The most related works are [45, 27] which finetune Transformer-based language models with differential privacy, and [2] which trains differentially private BERT [20], a masked language model built upon Transformer. Their success is largely attributed to large-scale public information, which may not be available in some scenarios.

Effective Training of Transformer. The training of the Transformer models can be notoriously difficult, relying heavily on layer normalization [3], model initialization [22], learning rate warmup [31], etc. Beyond applying existing techniques proposed in the context of non-private learning, one intriguing research question is: Can we develop effective Transformer-training techniques that are specialized for private training? In this paper, we make the first attempt and propose the Re-Attention Mechanism, as discussed in Section 5.

Differenital Private Learning on Heavy-Tailed Data. Previous work has also considered the differentially private learning algorithms on heavy-tailed data [37, 21, 23]. This line of research is mainly concerned with differential private stochastic optimization (DP-SCO). Note that the notion of heavy-tailed there is different from the focus of this work. As pointed out in [23], the setting they actually consider is dealing with heavy-tailed gradients due to unbounded values in the input data.

For detailed analysis about the related work, please refer to appendix A.

3 Preliminaries

Problem Setting: Sequential Prediction¹¹1This setting does not compromise universality. Transformer trained for the sequential prediction can be viewed as the universal model for sequence modeling, enabling other tasks (e.g., classification) through fine-tuning.. Since Transformers are designed to predict the next token in an autoregressive manner, in this paper, we focus our evaluation on sequential prediction tasks, where each training sample consists of a sequence of tokens²²2In this paper, we will use ‘token’ to denote the discrete unit within the input sequence and ‘vocabulary size’ to represent the total count of relevant entities, generalizing their definitions associated with language modeling. Given the preceding tokens $[s_{1},s_{2},...,s_{t-1}]$ , the task is to predict the next token ${s}_{t}$ . Note that in practice (as is also the case for all our datasets), training data is typically long-tailed, in the sense that a small number of tokens occur quite frequently while others have fewer occurrences. Our goal is to train a Transformer with DP-SGD [1] such that it can predict the next token accurately while preserving differential privacy.

Definition 3.1.

$(\varepsilon,\delta)$ -Differential Privacy (DP) [11, 13]: A randomized mechanism $\mathcal{M}:\mathcal{D}\rightarrow\mathcal{R}$ satisfies $(\varepsilon,\delta)$ -differential privacy if for any two datasets $\mathcal{D},\mathcal{D}^{\prime}\in{\rm Domain}(\mathcal{M})$ that differ in one record and for all $S\in{\rm Range(\mathcal{M})}$ it holds that $\operatorname{Pr}(\mathcal{M}(\mathcal{D})\in\mathcal{S})\leq e^{\varepsilon}\operatorname{Pr}\left(\mathcal{M}(\mathcal{D}^{\prime})\in\mathcal{S}\right)+\delta$ .

One desirable property of DP is that it ensures privacy (in terms of $\varepsilon$ and $\delta$ ) under composition. Based on this property, DP-SGD [1] injects calibrated Gaussian noise into model gradients in each training step to achieve differential privacy as follows,

\begin{split}\bm{G}&=\frac{1}{B}\left(\sum_{i=1}^{B}\bm{g}_{i}\cdot\operatorname{Clip_{C}}(\|\bm{g}_{i}\|+\sigma_{{\rm dp}}\cdot\mathcal{N}(0,\mathbf{I}))\right),\end{split}

(3.1)

where $\bm{g}_{i}$ is the gradient of the $i$ -th sample in the minibatch of size $B$ . $C$ is the clipping norm, $\operatorname{Clip_{C}}(\|g_{i}\|)=\operatorname{min}(C/\|g_{i}\|,1)$ , ensuring that the sensitivity of the averaged gradient $G$ is bounded by $\Delta_{G}\leq\|g_{i}\cdot\operatorname{Clip}(\|g_{i}\|)\|\leq C$ . ${\rm dp}$ is the noise multiplier derived from privacy accounting tools [4, 40].

In the following sections, we introduce two novel and significant techniques of DPFormer, Phantom Clipping and Re-Attention Mechanism, which provide more efficient and precise DP-enhanced Transformer modeling.

4 Phantom Clipping

4.1 Motivation: The Indispensability of Parameter Sharing as a Form of Inductive Bias

A recent advancement, Ghost Clipping [27], generalizes the Goodfellow trick [17] to facilitate efficient clipping during the fine-tuning of Transformer-based language models on private data, without necessitating per-sample gradient computation. Notwithstanding its considerable benefits in training Transformer models using DP-SGD as compared to other libraries or implementations (for instance, Opacus [44], JAX [34]), Ghost Clipping presents a limitation in its lack of support for parameter sharing (i.e., the practice that ties the parameter of the input embedding and the output embedding layer together). Although certain tasks, such as fine-tuning language models with differential privacy can achieve high accuracy without parameter sharing, our investigation reveal that when training with DP-SGD, parameter sharing of the embedding layer becomes essential.

In more detail, to elucidate the role of parameter sharing when training with DP-SGD, we conduct experiments under the following three settings: (1) parameter sharing of the embedding layer, which aligns with the standard treatment in Transformer; (2) no parameter sharing; and (3) no parameter sharing coupled with a reduced embedding dimension by half. Note that the third setting is included to account for the potential impact of model dimension on accuracy in private training, given the difference in the number of parameters between models with and without parameter sharing. Model performance across different hyperparameters is shown in Figure. 2.

The consistency and significance of the performance improvement brought by parameter sharing during private training are not hard to perceive. The essence of embedding sharing lies in the assumption that, by tying the embedding of the input and output layers, the representation of each token remains consistent throughout its retrieval. This inductive bias enhances the statistical efficiency of the model, enabling improved generalization. When training with DP-SGD on limited training data, the model must independently uncover this relationship from the noisy gradients with a low signal-to-noise ratio, heightening the convergence challenge.

4.2 Efficient Per-Sample Gradient Norm Computation with Parameter Sharing

In this section, we present Phantom Clipping, a technique for efficient private training of Transformers without the need for instantiating per-sample gradient. Phantom Clipping is built upon Ghost Clipping, with additional support for efficient gradient clipping of the shared embedding layer.

Recall that the computational bottleneck of gradient clipping in Equation (3.1) lies in the calculation of the per-sample gradient norm i.e., $\|g_{i}\|$ . As the L2 norm of a vector can be decomposed cross arbitrary dimensions, for example, $\left\|(a,b,c)\right\|=\left\|(\|a\|,\|[b,c]\|)\right\|$ . It suffices to consider the per-sample gradient norm $\|g_{i,E}\|$ of the embedding layer $E$ because the disparity due to parameter sharing lies solely in the shared embedding, and other layers can be handled akin to Ghost Clipping. After obtaining the gradient norm via $\|g_{i}\|$ = $\|(\|g_{i,E}\|,\|g_{i,-E}\|)\|$ , the next step is to scale the gradient by a factor of $\operatorname{Clip}_{C}(\|g_{i}\|)$ to bound its sensitivity. This can either be accomplished by re-scaling the loss $\mathcal{L}_{i}$ through this factor, followed by a second backpropagation [26], or by manually scaling the gradient as demonstrated by [6].

The challenge of evaluating $\|g_{i,E}\|$ efficiently without instantiating $g_{i,E}$ stems from the non-plain feed-forward (and symmetrically, backward propagation) topology caused by parameter sharing. See Figure 1(a) for a visual illustration, where the shared embedding leads to two branches of the backpropagation. Despite this complexity, our Phantom Clip, described below, is capable of achieving this task. The derivation is deferred to Appendix B.

Claim 4.1.

(Phantom Clipping) Let $a_{i,s}\in\{0,1\}^{L\times M}$ (or $a_{i,c}\in\{0,1\}^{M\times M}$ ) be the one-hot encodings of the input sequence $s_{i}$ (or those of the candidate tokens for the output probability) in a minibatch. Let $e_{i,s}\in\mathbb{R}^{L\times d}$ (or $e_{i,c}\in\mathbb{R}^{M\times d}$ ) be output of the (shared) embedding layer $E$ when fed into $a_{i,s}$ (or $a_{i,c}$ ). Then the norm of the per-sample gradient with respect to $E$ can be efficiently evaluated as

\|g_{i,E}\|=\left(\langle a_{i,s}a_{i,s}^{T},\nabla e_{i,s}\nabla e_{i,s}^{T}\rangle^{2}+\|\nabla e_{i,c}\|^{2}+2\cdot\langle\nabla e_{i,s},a_{i,s}^{T}\nabla e_{i,c}\rangle\right)^{\frac{1}{2}},

(4.1)

where $\nabla e_{i,s}:=\partial\mathcal{L}_{i}/\partial e_{i,s}\in\mathbb{R}^{L\times d}$ , $\nabla e_{i,c}:=\partial\mathcal{L}_{i}/\partial e_{i,c}\in\mathbb{R}^{M\times d}$ , and $\langle\cdot,\cdot\rangle$ is the inner product of two matrices being of the same shape.

Memory Complexity We study the additional memory footprint required by Phantom Clipping. Due to the storage of $a_{s}a_{s}^{T}$ (and $\nabla e_{s}\nabla e_{s}^{T})\in\mathbb{R}^{B\times L\times L}$ in the first term of Equation (4.1) (note that $a^{T}\nabla e_{c}$ is merely an indexing operation, requiring no additional memory), Phantom Clipping has overhead memory complexity of $O(BL^{2})$ . As a comparison, Ghost Clipping has a memory complexity of $O(BT^{2})$ when the input to the layer $a_{i}$ has the shape of $\mathbb{R}^{T\times\cdot}$ . Hence its memory complexity for the two embedding layers is $O(BM^{2}+BL^{2})$ where $M$ is the vocabulary size. Typically, we have $L^{2}\ll M$ , that is, the length of the sentence is much smaller than the vocabulary size. Thus the memory overhead complexity of Phantom Clipping is only a negligible portion of that of Ghost Clipping.

We implement our Phantom Clipping based on AWS’s fastDP³³3https://github.com/awslabs/fast-differential-privacy library, which has implemented Ghost Clipping. We then empirically compare our Phantom Clipping with Ghost Clipping in terms of both memory footprint and training speed on real-world datasets⁴⁴4Since Ghost Clipping does not support parameter sharing, its results are obtained from training models without embedding sharing. This leads to more model parameters. For a fair comparison, we halve its embedding dimension to $d_{E}/2$ , ending up with a similar number of parameters as in the model with embedding sharing. (see Figure 5 for details of the datasets). Figure 3(a) shows the maximum batch size that can fit into a Tesla V100 GPU (16 GB of VRAM). It can be seen that our technique is much more memory friendly. It allows up to $450\times$ larger batch size compared with Ghost Clipping on Amazon, almost as large as those in non-private training. Figure 3(b) shows the training speed on a single Tesla V100 GPU. It allows up to $100\times$ training speedup in practice compared to Ghost Clipping, achieving 0.68 $\times$ training speed of the non-private version.

5 Re-Attention Mechanism

5.1 Motivation: The Attention Distraction Phenomenon

Recall that the Attention Mechanism, as the key component of the Transformer, given a query, calculates attention scores pertaining to tokens of relevance. Our key observation is that, over the randomness of the attention keys for the tokens of interest⁵⁵5In unambiguous contexts, ‘token variance’ will denote the variance of the relevant representation associated with that token, like its attention key, $K_{i}$ , or its embedding, $E_{i}$ ., the expectation of the attention scores will be distorted, particularly, mindlessly leaning towards tokens with high variance, regardless of their actual relevance. Refer to Figure 4 for a visual illustration of this concept of attention distraction.

To shed light on this phenomenon, we offer a theoretical analysis. Let us fix some query $q$ and denote the attention key of token $i$ as $K_{i}$ . Since DP-SGD injects Gaussian noise into the model, it is natural to assume $K_{i}$ follows a Gaussian distribution with mean $k_{i}$ and variance $\sigma^{2}_{i}$ . We denote $S_{i}$ as the random variable of the attention score assigned to token $i$ . With some basic algebraic manipulation and applying the theory of extreme value [9], we can recast the formula for attention scores as follows⁶⁶6For ease of notation, we omit the constant factor (i.e., $1/\sqrt{d}$ ) in attention computation.,

\begin{split}S_{i}=\frac{\exp{\langle q,K_{i}\rangle}}{\sum_{j=1}^{L}\exp{\langle q,K_{j}\rangle}}&=\exp{\left(\langle q,K_{i}\rangle-\log\sum_{j=1}^{L}\exp{\langle q,K_{j}\rangle}\right)}\\ &=\exp{\left(\langle q,K_{i}\rangle-\mathbb{E}_{\gamma}[\operatorname{max}_{j}\{\langle q,K_{j}\rangle+\gamma\}]\right)},\\ \end{split}

(5.1)

where $\gamma$ is distributed as a standard Gumbel. Let us consider some token $i^{\prime}$ that should have attracted little attention given the query $q$ , then the expectation of the noisy maximum $\mathbb{E}_{\gamma}[\operatorname{max}_{j}\{\langle q,K_{j}\rangle+\gamma\}]$ can be approximated by $\operatorname{max}_{j\neq i^{\prime}}{\langle q,K_{j}\rangle}+\zeta$ , where $\zeta=\mathbb{E}[\gamma]=\pi^{2}/6$ . Taking the expectation of Equation (5.1) over $K_{i^{\prime}}$ , and leveraging the fact $\mathbb{E}[\exp(X)]=\exp(\mathbb{E}[X])\exp(\operatorname{Var}[X]/2)$ when $X$ follows a Gaussian distribution, we then arrive at the following conclusion

\begin{split}\mathbb{E}_{K_{i^{\prime}}}[S_{i^{\prime}}]&\approx\mathbb{E}_{K_{i^{\prime}}}[\exp{(\langle q,K_{i^{\prime}}\rangle-(\operatorname{max}_{j\neq i^{\prime}}\{\langle q,K_{j}\rangle\}+\zeta))}]\\ &=\underbrace{\exp{(\langle q,k_{i^{\prime}}\rangle-\widetilde{M})}}_{\rm attentive~{}relevance}\cdot\underbrace{\exp{\left(C\sigma^{2}_{i^{\prime}}/2\right)}}_{\rm multiplicative~{}error},\end{split}

(5.2)

where $\widetilde{M}=(\operatorname{max}_{j\neq i^{\prime}}\{\langle q,K_{j}\rangle\}+\zeta)$ and the last equality leverages the fact that $\langle q,K_{i^{\prime}}\rangle\sim\mathcal{N}(\langle q,k_{i^{\prime}}\rangle,C\sigma^{2})$ with $C=\langle q,q\rangle$ . As a result, tokens with higher variance result in inflated attention scores due to increased multiplicative bias, distracting attention from more deserving tokens, given that token $i^{\prime}$ is presupposed to garner little attention under query $q$ . If all tokens have similar variance or variance terms are negligible, the negative effects of this attention diversion are reduced. However, in less ideal conditions, especially with long-tailed data, this attention distraction could hinder the Transformer’s training, thereby degrading model utility.

5.2 Re-Attention via Error Tracking

At its core, the Re-Attention Mechanism is designed to mitigate attention distraction during the learning process via debiasing the attention scores. To achieve this, it is natural to track the variance term identified as the error multiplier in Equation (5.2). In the following discussion, we elaborate on the methodology employed for tracking this error term during private training.

5.2.1 Error Instantiation

Let us focus on the source of the randomness which leads to the attention distraction phenomenon, that is, the DP noise injected into the model gradients. Inspired by [27], we propose the idea of effective error, which is a probabilistic treatment of the effective noise multiplier in [27], proposed for model with sequential input. Effective error is used as an estimate of the uncertainty level underlying the model parameters, where the randomness is over DP noise.

Definition 5.1.

Effective Error: The effective error $\sigma_{{\rm eff}}^{\theta}$ associated with the model parameter $\theta$ is defined as

\begin{split}\sigma_{{\rm eff}}^{\theta}=\frac{\sigma_{{\rm dp}}}{B_{{\rm eff}}^{\theta}},~{}~{}~{}~{}~{}\text{where}~{}B_{{\rm eff}}^{\theta}=\mathbb{E}_{\mathcal{B}\stackrel{{\scriptstyle\tiny\text{i.i.d}}}{{~{}\sim~{}}}\mathcal{D}^{B}}\left[\sum_{i=1}^{B}\mathbb{I}\left[R_{\theta}(\mathcal{B}_{i})\right]\right],\end{split}

(5.3)

where $B\in\mathbb{N}$ is the batch size, $\mathcal{B}\in\mathbb{N}^{B\times L}$ is the minibatch, i.i.d. sampled from training data distribution $\mathcal{D}$ (note that $\mathcal{B}_{i}$ is a sequence of tokens), $\sigma_{{\rm dp}}$ is the DP noise multiplier in Equation (3.1), and $\mathbb{I\left(\cdot\right)}$ is the indicator function, $R_{\theta}(\cdot)=1$ if $\mathcal{B}_{i}$ has relevance with $\theta$ , for example, $R_{E_{i}}(\mathcal{B}_{j})=\mathbb{I}[{\rm token}~{}i\in\mathcal{B}_{j}]$ where $E_{i}$ is the embedding for token $i$ , and $R_{W}(\mathcal{B}_{j})=1$ where $W$ is the parameter within Transformer Block (see Figure 1).

Remark 5.2.

Effective error recovers effective noise multiplier when the model has no embedding layer, for example, an MLP model. In that case, $\sigma_{{\rm eff}}^{\theta}=\sigma_{{\rm dp}}/B$ .

We then have the following claims for obtaining effective error of the Transformer’s parameters. See Appendix C for detailed derivation.

Claim 5.3.

For each layer parameterized by $W$ within the Transformer block, its effective error is $\sigma_{{\rm eff}}^{W}=\sigma_{{\rm dp}}/B$ .

Claim 5.4.

For the embedding layer $E$ , effective error of token $i$ is $\sigma_{{\rm eff}}^{E_{i}}=\sigma_{{\rm dp}}/(B\cdot p_{i})$ , where $p_{i}$ is the frequency of token $i$ (i.e., the probability of token $i$ ’s occurrence in data).

5.2.2 Error Propagation

Given the effective errors of the embedding layer and of the Transformer encoder, our goal is to obtain the error term $\sigma_{i}$ identified in Equation (5.2) for each attention computation. Notably, this issue of error tracking aligns with studies in Bayesian deep learning [39], a field primarily focused on quantifying prediction uncertainty to enhance the robustness and reliability of machine learning systems. While our primary interest lies in unbiased attention score computation during private training, we can leverage and adapt existing methodologies in Bayesian deep learning to achieve this distinct goal. Specifically, given the input embedding along with its effective error, we propagate the effective error through Transformer layers (see Figure 1), with the goal of obtaining $\sigma_{i}$ for each attention calculation. We denote the output of the $l$ -th layer by the random variable $X^{(l)}$ . Given the output distribution $X^{(l-1)}$ of the preceding layer, the distribution $X^{(l)}$ can be computed layer-by-layer as follows,

\begin{split}p(X^{(l)}|X^{(0)})=\mathbb{E}_{X^{(l-1)}|X^{(0)}}\left[p\left(X^{(l)}|X^{(l-1)}\right)\right]=\mathbb{E}_{X^{(l-1)}|X^{(0)}}\left[p^{d}\left(X^{(l)}_{i}|X^{(l-1)}\right)\right],\end{split}

(5.4)

where the last equality is due to the isometric Gaussian noise of DP (see Equation 3.1), i.e., each dimension is independently and identically distributed. Based on Variational Inference [25], we can use an approximating distribution $q$ to approximate the computationally intractable distribution $p$ , where $q(X^{(l)})$ follows a Gaussian distribution of mean $\mu$ and variance $\sigma^{2}$ . Note that minimizing the KL divergence of $KL(p(X^{(l)}|X^{(0)})||q(X^{(l)}))$ reduces to matching the moments of $q(X^{(l)})$ to $p(X^{(l)}|X^{(0)})$ . Since the mean and variance⁷⁷7Note that for a Gaussian distribution, (i) mean and variance, (ii) the first two moments, and (iii) natural parameter, are equivalent in the sense of mutual convertibility. We will use them interchangeably. are sufficient statistics for Gaussian distribution, propagating the distribution reduces to propagating its natural parameters [38]. For linear layers coupled with a coordinate-wise non-linear activation, the statistics can be computed by analytic expressions using existing techniques from Probabilistic Neural Networks [38, 33, 15, 32, 29]. Concretely, for linear transformation, $X^{(l)}=X^{(l-1)}W$ , we can propagate the variance as

\sigma_{X^{(l)}}^{2}=\sigma_{X^{(l)}}^{2}\cdot\sigma_{W}^{2}+\sigma_{X^{(l)}}^{2}\cdot\mu_{W}^{2}+\sigma_{W}^{2}\cdot(\mu_{X^{(l)}})^{2}.

(5.5)

For nonlinear activation functions, e.g., $X^{(l)}=\operatorname{ReLU}\left(X^{(l-1)}\right)$ , we can propagate the variance as

\begin{split}\sigma_{X^{(l)}}^{2}=\Phi(\frac{c}{\sqrt{d}})(c^{2}+d)+\frac{c\sqrt{d}}{\sqrt{2\pi}}\exp(-\frac{1}{2}\frac{c^{2}}{d})-c^{2},\end{split}

(5.6)

where $\Phi(\cdot)$ is the cumulative density function (CDF) of the standard Gaussian distribution, $c$ and $d$ are the natural parameter of $X^{(l-1)}$ . For completeness, derivation is included in Appendix D.1.

All in all, we can obtain the output distribution of layer $(l)$ via analytic expression in terms of the natural parameter [38] of the preceding layer’s output distribution as

(c^{(l)},d^{(l)})=\mathcal{F}(c^{(l-1)},d^{(l-1)}),~{}~{}~{}\sigma^{2}=\mathcal{T}(c^{(l)},d^{(l)}).

(5.7)

Nevertheless, a nuanced difference exists between our error propagation and existing techniques encapsulated in Equation (5.7). In the Bayesian approach, the model parameter is directly associated with the mean $\mu_{W}$ in Equation (5.5). During private training, however, we can only access the noisy parameter after the injection of DP noise. Interestingly, access to this noisy parameter can be interpreted as a single sampling opportunity from its underlying Gaussian distribution, which can then be viewed as a one-time Markov Chain sampling [41]. Therefore, the noisy parameter can serve as an estimate of its mean. In addition, unlike variance propagation in Bayesian deep learning, the error propagation here incurs minimal computational and memory overhead as the effective error can be represented in scalar (again, due to the isometric DP noise), plus the propagation is performed via analytical expressions.

Re-Attention. With the effective error tracked, we then proceed to mitigate the attention distraction identified in Equation (5.2) via $S_{i}\leftarrow S_{i}/\exp\left[C\sigma^{2}_{i}/2\right]$ , obtaining unbiased attention scores.

In summary, we can propagate and track the effective error through the layers: given the natural parameter of $X^{(l-1)}$ , the variance can be estimated using analytic expressions, which then can be used to correct the attention scores. This methodology of error instantiation from DP-SGD multiplier, propagation through layers, and attention debiasing forms the crux of our Re-Attention Mechanism.

6 Experiments

Datasets and Prevalence of Long-Tailed Distributions. We conduct experiments on two public datasets collected from real-world scenarios: MovieLens [18] and Amazon [28]. Figure 5 shows their data distributions, illustrating the prevalence of long-tailed distributions, where a small number of items are extremely popular and have relatively high frequency while other items occur infrequently. The embedded table above the ‘long tail’ reports the statistics of the two datasets, showing that the two datasets vary significantly in size and sparsity. More details on datasets can be found in Appendix E.1.

Baselines and Implementation Details. We compare our DPFormer with vanilla Transformer [36] (i.e., the one without Re-Attention Mechanism), vanilla Transformer without parameter sharing, GRU [8], and LSTM [20]. For a fair comparison, embedding sharing is applied for all evaluated methods if not explicitly stated. The number of epochs is set to 100, where the first 20% of epochs are used for learning rate warm-up. After that, we linearly decay the learning rate through the remaining epochs. Following [5, 43], we normalize the gradients and set the clipping norm $C$ to 1, which eliminates the hyperparameter tuning for clipping norm $C$ . The model dimension is set to 64. For privacy accounting, we fix the total training epochs (iterations) and derive the noise required for each iteration from the preset privacy budget $\varepsilon$ . We consider $\varepsilon=5,8,10$ .

Table 1: Best results (%) on MovieLens at different privacy levels.

DP Guarantee	$\varepsilon=5$		$\varepsilon=8$		$\varepsilon=10$
Metric	NDCG@10	HIT@10	NDCG@10	HIT@10	NDCG@10	HIT@10
GRU	2.26 $\pm$ 0.04	4.58 $\pm$ 0.09	2.40 $\pm$ 0.03	4.75 $\pm$ 0.20	2.81 $\pm$ 0.03	5.53 $\pm$ 0.05
LSTM	2.65 $\pm$ 0.07	5.08 $\pm$ 0.08	2.76 $\pm$ 0.03	5.41 $\pm$ 0.06	2.95 $\pm$ 0.03	5.55 $\pm$ 0.06
Transformer w/o PS	2.33 $\pm$ 0.05	4.47 $\pm$ 0.07	2.56 $\pm$ 0.03	5.11 $\pm$ 0.05	2.74 $\pm$ 0.04	5.39 $\pm$ 0.08
Transformer (Vanilla)	4.57 $\pm$ 0.26	8.69 $\pm$ 0.53	7.05 $\pm$ 0.23	13.17 $\pm$ 0.37	7.99 $\pm$ 0.21	14.82 $\pm$ 0.38
DPFormer (Ours)	5.88 $\pm$ 0.24	11.13 $\pm$ 0.43	7.70 $\pm$ 0.26	14.31 $\pm$ 0.37	8.42 $\pm$ 0.22	15.40 $\pm$ 0.32
Relative Improvement	29% $\uparrow$	28% $\uparrow$	9.2% $\uparrow$	8.7% $\uparrow$	5.4% $\uparrow$	3.9% $\uparrow$

Table 2: Best results (%) on Amazon at different privacy levels.

DP Guarantee	$\varepsilon=5$		$\varepsilon=8$		$\varepsilon=10$
Metric	NDCG@10	HIT@10	NDCG@10	HIT@10	NDCG@10	HIT@10
GRU	1.13 $\pm$ 0.02	2.46 $\pm$ 0.03	1.33 $\pm$ 0.02	2.22 $\pm$ 0.02	1.47 $\pm$ 0.03	2.48 $\pm$ 0.02
LSTM	1.19 $\pm$ 0.01	2.46 $\pm$ 0.04	1.23 $\pm$ 0.01	2.46 $\pm$ 0.04	1.34 $\pm$ 0.01	2.51 $\pm$ 0.02
Transformer w/o PS	1.16 $\pm$ 0.01	2.36 $\pm$ 0.01	1.20 $\pm$ 0.02	2.38 $\pm$ 0.01	1.40 $\pm$ 0.01	2.47 $\pm$ 0.02
Transformer (Vanilla)	1.37 $\pm$ 0.04	2.47 $\pm$ 0.10	1.54 $\pm$ 0.03	2.77 $\pm$ 0.07	1.57 $\pm$ 0.03	2.83 $\pm$ 0.08
DPFormer (Ours)	1.64 $\pm$ 0.01	3.01 $\pm$ 0.01	1.98 $\pm$ 0.05	3.70 $\pm$ 0.15	1.99 $\pm$ 0.04	3.73 $\pm$ 0.11
Relative Improvement	20% $\uparrow$	22% $\uparrow$	28% $\uparrow$	34% $\uparrow$	27% $\uparrow$	31% $\uparrow$

(a) MovieLens (

\varepsilon=5

)

(b) MovieLens (

\varepsilon=5

)

\varepsilon=10

)

(d) MovieLens (

\varepsilon=10

)

(e) Amazon (

\varepsilon=5

)

(f) Amazon (

\varepsilon=5

)

(g) Amazon (

\varepsilon=10

)

(h) Amazon (

\varepsilon=10

)

Figure 6: The Re-Attention Mechanism renders DPFormer notably more stable during private training. Each run is repeated five times with independent random seeds, with test accuracy (i.e., NDCG@10(%) and HIT@10(%)) reported every five epochs. The graduated shading (best viewed zoomed in) represents confidence intervals from 60% to 100%.

Table 1 and Table 2 show the best⁸⁸8Strictly speaking, the process of hyperparameter tuning would cost privacy budget [30], but is mainly of theoretical interest. We perform grid search on learning rate $\in\{10^{-3},3\times 10^{-3},5\times 10^{-3},7\times 10^{-3},9\times 10^{-3}\}$ and batch size $\in\{256,512,1024,2048,4096\}$ for each method, ensuring fair comparison. NDCG@10 and HIT@10 for all the methods on MovieLens and Amazon. The vanilla Transformer outperforms all other baselines, reaffirming its dominance in sequential data modeling due to the Attention Mechanism. Our DPFormer, incorporating the Re-Attention Mechanism, further boosts the performance by around 20% on average. Notably, under a low privacy budget ( $\varepsilon=3$ ), DPFormer achieves a relative improvement of around 25%, demonstrating its efficacy in attenuating attention distraction during private training. On MovieLens, as expected, the performance gain increases with decreasing privacy budget $\varepsilon$ , i.e., increasing noise strength during training; this is because larger noise corresponds to more severe attention distraction, which better highlights the Re-Attention Mechanism’s advantage. However, on Amazon, DPFormer achieves a smaller relative improvement at $\varepsilon=5$ than at $\varepsilon=10$ . We suspect that this is due to the two datasets’ differences in terms of sparsity (i.e., $1-$ density in Figure 5) as well as the inherent hardness of training Transformer [46, 42, 22] and substantial DP noise.

Figure 6 shows the model accuracy every five epochs during training. Evidently, the training dynamics of the vanilla Transformer, impacted by attention distraction, can suffer from high variance and/or substantial fluctuation, especially on Amazon. In contrast, DPFormer enjoys faster and smoother convergence, highlighting its superior training stability under differential privacy.

Figure 7 shows the results of hyperparameter tuning via grid search ⁹⁹9Rather than run an addtional differentially private algorithm to report a noisy max (or argmax) [30], we opt for this practice of directly displaying all results due to its transparency and comprehensiveness.. For most hyperparameter configurations (especially along the main diagonal [35]), our DPFormer significantly outperforms the vanilla Transformer.

7 Conclusion

In this paper, we identify two key challenges in learning differentially private Transformers, i.e., heavy computation overhead due to per-sample gradient clipping and attention shift due to long-tailed data distributions. We then proposed DPFormer, equipped with Phantom Clipping and Re-Attention Mechanism, to address these challenges. Our theoretical analysis shows that DPFormer can effectively correct attention shift (which leads to significant performance drops) and reduce computational cost during gradient clipping. Such analysis is further corroborated by empirical results on two real-world datasets. Limitation. One limitation of this work is the assumption of a trusted data curator. Another limitation is the absence of semantic interpretation for the privacy budget, given the sequential data as input. Addressing these limitations would be interesting future work.

References

[1] Martín Abadi, Andy Chu, Ian J. Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 308–318, 2016.
[2] Rohan Anil, Badih Ghazi, Vineet Gupta, Ravi Kumar, and Pasin Manurangsi. Large-scale differentially private BERT. In Findings of the Association for Computational Linguistics: EMNLP, 2022.
[3] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
[4] Borja Balle, Gilles Barthe, and Marco Gaboardi. Privacy amplification by subsampling: Tight analyses via couplings and divergences. Advances in Neural Information Processing Systems, 31, 2018.
[5] Zhiqi Bu, Yu-Xiang Wang, Sheng Zha, and George Karypis. Automatic Clipping: Differentially private deep learning made easier and stronger. arXiv preprint arXiv:2206.07136, 2022.
[6] Zhiqi Bu, Yu-Xiang Wang, Sheng Zha, and George Karypis. Differentially private optimization on large model at small cost. arXiv preprint arXiv:2210.00038, 2022.
[7] Nicholas Carlini, Chang Liu, Úlfar Erlingsson, Jernej Kos, and Dawn Song. The secret sharer: Evaluating and testing unintended memorization in neural networks. In USENIX Security Symposium, volume 267, 2019.
[8] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
[9] Stuart Coles, Joanna Bawa, Lesley Trenner, and Pat Dorazio. An introduction to statistical modeling of extreme values, volume 208. Springer, 2001.
[10] Soham De, Leonard Berrada, Jamie Hayes, Samuel L Smith, and Borja Balle. Unlocking high-accuracy differentially private image classification through scale. arXiv preprint arXiv:2204.13650, 2022.
[11] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, pages 265–284, 2006.
[12] Cynthia Dwork, Moni Naor, Omer Reingold, Guy N Rothblum, and Salil Vadhan. On the complexity of differentially private data release: Efficient algorithms and hardness results. In Proceedings of the 41st annual ACM Symposium on Theory of Computing, pages 381–390, 2009.
[13] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3–4):211–407, 2014.
[14] Vitaly Feldman. Does learning require memorization? A short tale about a long tail. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 954–959, 2020.
[15] Jochen Gast and Stefan Roth. Lightweight probabilistic deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3369–3378, 2018.
[16] Aditya Golatkar, Alessandro Achille, Yu-Xiang Wang, Aaron Roth, Michael Kearns, and Stefano Soatto. Mixed differential privacy in computer vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8376–8386, 2022.
[17] Ian Goodfellow. Efficient per-example gradient computations. arXiv preprint arXiv:1510.01799, 2015.
[18] F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context. ACM Transactions on Interactive Intelligent Systems, 5(4):1–19, 2015.
[19] Jiyan He, Xuechen Li, Da Yu, Huishuai Zhang, Janardhan Kulkarni, Yin Tat Lee, Arturs Backurs, Nenghai Yu, and Jiang Bian. Exploring the limits of differentially private deep learning with group-wise clipping. In The Eleventh International Conference on Learning Representations, 2023.
[20] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
[21] Lijie Hu, Shuo Ni, Hanshen Xiao, and Di Wang. High dimensional differentially private stochastic optimization with heavy-tailed data. In Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 227–236, 2022.
[22] Xiao Shi Huang, Felipe Perez, Jimmy Ba, and Maksims Volkovs. Improving Transformer optimization through better initialization. In International Conference on Machine Learning, pages 4475–4483. PMLR, 2020.
[23] Gautam Kamath, Xingtu Liu, and Huanyu Zhang. Improved rates for differentially private stochastic convex optimization with heavy-tailed data. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 10633–10660, 2022.
[24] Wang-Cheng Kang and Julian McAuley. Self-attentive sequential recommendation. In 2018 IEEE International Conference on Data Mining, pages 197–206, 2018.
[25] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
[26] Jaewoo Lee and Daniel Kifer. Scaling up differentially private deep learning with fast per-example gradient clipping. Proceedings on Privacy Enhancing Technologies, 2021(1), 2021.
[27] Xuechen Li, Florian Tramer, Percy Liang, and Tatsunori Hashimoto. Large language models can be strong differentially private learners. In International Conference on Learning Representations, 2022.
[28] Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel. Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 43–52, 2015.
[29] Pablo Morales-Alvarez, Daniel Hernández-Lobato, Rafael Molina, and José Miguel Hernández-Lobato. Activation-level uncertainty in deep neural networks. In International Conference on Learning Representations, 2021.
[30] Nicolas Papernot and Thomas Steinke. Hyperparameter tuning with renyi differential privacy. In International Conference on Learning Representations, 2022.
[31] Martin Popel and Ond $\bm{v}{r}$ ej Bojar. Training tips for the transformer model. arXiv preprint arXiv:1804.00247, 2018.
[32] Janis Postels, Francesco Ferroni, Huseyin Coskun, Nassir Navab, and Federico Tombari. Sampling-free epistemic uncertainty estimation using approximated variance propagation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2931–2940, 2019.
[33] Alexander Shekhovtsov and Boris Flach. Feed-forward propagation in probabilistic neural networks with categorical and max layers. In International Conference on Learning Representations, 2019.
[34] Pranav Subramani, Nicholas Vadivelu, and Gautam Kamath. Enabling fast differentially private sgd via just-in-time compilation and vectorization. Advances in Neural Information Processing Systems, 34:26409–26421, 2021.
[35] Florian Tramer and Dan Boneh. Differentially private learning needs better features (or much more data). In International Conference on Learning Representations, 2021.
[36] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
[37] Di Wang, Hanshen Xiao, Srinivas Devadas, and Jinhui Xu. On differentially private stochastic convex optimization with heavy-tailed data. In International Conference on Machine Learning, pages 10081–10091, 2020.
[38] Hao Wang, Xingjian Shi, and Dit-Yan Yeung. Natural-parameter networks: A class of probabilistic neural networks. Advances in Neural Information Processing Systems, 29, 2016.
[39] Hao Wang and Dit-Yan Yeung. A survey on Bayesian deep learning. ACM Computing Surveys (CSUR), 53(5):1–37, 2020.
[40] Yu-Xiang Wang, Borja Balle, and Shiva Prasad Kasiviswanathan. Subsampled rényi differential privacy and analytical moments accountant. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1226–1235. PMLR, 2019.
[41] Yu-Xiang Wang, Stephen Fienberg, and Alex Smola. Privacy for free: Posterior sampling and stochastic gradient monte carlo. In International Conference on Machine Learning, pages 2493–2502. PMLR, 2015.
[42] Peng Xu, Dhruv Kumar, Wei Yang, Wenjie Zi, Keyi Tang, Chenyang Huang, Jackie Chi Kit Cheung, Simon JD Prince, and Yanshuai Cao. Optimizing deeper transformers on small datasets. arXiv preprint arXiv:2012.15355, 2020.
[43] Xiaodong Yang, Huishuai Zhang, Wei Chen, and Tie-Yan Liu. Normalized/Clipped SGD with perturbation for differentially private non-convex optimization. arXiv preprint arXiv:2206.13033, 2022.
[44] Ashkan Yousefpour, Igor Shilov, Alexandre Sablayrolles, Davide Testuggine, Karthik Prasad, Mani Malek, John Nguyen, Sayan Ghosh, Akash Bharadwaj, Jessica Zhao, Graham Cormode, and Ilya Mironov. Opacus: User-friendly differential privacy library in PyTorch. In NeurIPS 2021 Workshop Privacy in Machine Learning, 2021.
[45] Da Yu, Saurabh Naik, Arturs Backurs, Sivakanth Gopi, Huseyin A Inan, Gautam Kamath, Janardhan Kulkarni, Yin Tat Lee, Andre Manoel, Lukas Wutschitz, et al. Differentially private fine-tuning of language models. In International Conference on Learning Representations, 2022.
[46] Biao Zhang, Ivan Titov, and Rico Sennrich. Improving deep transformer with depth-scaled initialization and merged attention. arXiv preprint arXiv:1908.11365, 2019.

Appendix A Related Work

The most related work is [27], which finetunes large language models with differential privacy. They propose Ghost Clipping, which is a technique that enables efficient per-sample gradient clipping without instantiating per-sample gradient. Our Phantom Clipping can be viewed as an extension of Ghost Clipping that additionally handles parameter sharing of the embedding layer. They also introduce the idea of effective noise multiplier in order to explain the role of batch size in private learning. Our effective error (Equation 5.3) can be viewed as its generalization in order to account for the inherent input sparsity (i.e., only a small portion of tokens appear in one training sequence).

Another related work is [2], which establishes a baseline for BERT-Large pretraining with DP. They introduce several strategies to help private training, such as large weight decay and increasing batch size schedule. Notably, these strategies are independent yet complementary to the methodologies utilized in this work, thereby offering potential avenues for an integrated approach.

Appendix B Proof of Phantom Clipping (Claim 4.1)

Proof.

For simplicity, we will omit the per-sample index $i$ throughout this proof. From the chain rule, the per-sample gradient with respect to the embedding layer $E$ is

\begin{split}g_{E}&=\frac{\partial\mathcal{L}}{\partial e_{s}}\cdot\frac{\partial e_{s}}{\partial E}+\frac{\partial\mathcal{L}}{\partial e_{c}}\cdot\frac{\partial e_{c}}{\partial E}\\ &=\underbrace{a_{s}^{T}\cdot\nabla e_{s}}_{g_{E}^{(1)}}+\underbrace{a_{c}^{T}\cdot\nabla e_{c}}_{g_{E}^{(2)}},\end{split}

(B.1)

where $a_{s}\in\{0,1\}^{L\times M}$ (or $a_{c}\in\{0,1\}^{M\times M}$ ) is the one-hot encodings of the input sequence $s_{i}$ (or those of the candidate tokens for the output probability) in a minibatch, and $e_{s}\in\mathbb{R}^{L\times d}$ (or $e_{c}\in\mathbb{R}^{M\times d}$ ) be output of the (shared) embedding layer $E$ when fed into $a_{s}$ (or $a_{c}$ ). Denote the first segment of the right-hand side (RHS) as $g_{E}^{(1)}$ , the second segment of the RHS as $g_{E}^{(2)}$ . Then we have

\left\|g_{E}\right\|_{F}^{2}=\left\|g_{E}^{(1)}+g_{E}^{(2)}\right\|_{F}^{2}=\left\|g_{E}^{(1)}\right\|_{F}^{2}+\left\|g_{E}^{(2)}\right\|_{F}^{2}+2\cdot\langle g_{E}^{(1)},g_{E}^{(2)}\rangle.

(B.2)

Ghost Clipping [27] allows us to evaluate $\left\|g_{E}^{(1)}\right\|_{F}$ without instantiating $g_{E}^{(1)}$ , the formula is given by

\begin{split}\left\|g_{E}^{(1)}\right\|_{F}=\langle a_{s}a_{s}^{T},\nabla e_{s}\nabla e_{s}^{T}\rangle.\end{split}

(B.3)

Similarly, we have

\begin{split}\left\|g_{E}^{(2)}\right\|_{F}=\langle a_{c}a_{c}^{T},\nabla e_{c}\nabla e_{c}^{T}\rangle.\end{split}

(B.4)

Note that $a_{i,c}$ is the one-hot encoding of $[1,2,3,...,M]$ , thus $a_{i,c}$ is an Identity matrix, Equation B.4 can be further simplified as

\begin{split}\left\|g_{i,E}^{(2)}\right\|_{F}=\langle\mathbf{I},\nabla e_{i,c}\nabla e_{i,c}^{T}\rangle=\sum_{i=1}^{M}\langle(\nabla e_{s})_{i},(\nabla e_{s})_{i}\rangle=\|\nabla e_{i,c}\|^{2}.\end{split}

(B.5)

Note that this simplification reduces the memory footprint from $O(M^{2})$ to $O(M)$ , obviating the need for evaluating $e_{c}\nabla e_{c}^{T}$ in Equation B.4.

Therefore, computing the gradient norm of shared embedding reduces to computing $\langle g_{E}^{(1)},g_{E}^{(2)}\rangle$ in Equation B.2.

\begin{split}\langle g_{E}^{(1)},g_{E}^{(2)}\rangle&=\langle a_{s}^{T}\cdot\nabla e_{s},a_{c}^{T}\cdot\nabla e_{c}\rangle\\ &=\sum_{j=1}^{M}\sum_{k=1}^{d}\left(\sum_{i=1}^{L}(a_{s})_{ij}\cdot(\nabla e_{s})_{ik}\right)\left(\sum_{i=1}^{M}(a_{c})_{ij}\cdot(\nabla e_{c})_{ik}\right)\\ &=\sum_{i_{1}=1}^{L}\sum_{i_{2}=1}^{M}\left(\sum_{j=1}^{M}\sum_{k=1}^{d}(a_{s})_{i_{1}j}\cdot(\nabla e_{s})_{i_{1}k}\cdot(a_{c})_{i_{2}j}\cdot(\nabla e_{c})_{i_{2}k}\right)\\ &=\sum_{i_{1}=1}^{L}\sum_{i_{2}=1}^{M}\left(\sum_{j=1}^{M}(a_{s})_{i_{1}j}\cdot(a_{c})_{i_{2}j}\right)\left(\sum_{k=1}^{d}(\nabla e_{s})_{i_{1}k}\cdot(\nabla e_{c})_{i_{2}k}\right)\\ &=\sum_{i_{1}=1}^{L}\sum_{i_{2}=1}^{M}\langle(a_{s})_{i_{1}},(a_{c})_{i_{2}}\rangle\cdot\langle(\nabla e_{s})_{i_{1}},(\nabla e_{c})_{i_{2}}\rangle\\ &=\sum_{i_{1}=1}^{L}\sum_{i_{2}=1}^{M}[i_{2}==\operatorname{onehot}^{-1}((a_{s})_{i_{1}})]\cdot\langle(\nabla e_{s})_{i_{1}},(\nabla e_{c})_{i_{2}}\rangle\\ &=\sum_{i_{1}=1}^{L}\langle(\nabla e_{s})_{i_{1}},(a_{s}^{T}\cdot\nabla e_{c})_{i_{2}}\rangle\\ &=\langle(\nabla e_{s}),a_{s}^{T}\cdot(\nabla e_{c})\rangle.\\ \end{split}

(B.6)

Combining Equation B.3, B.5 and B.6 yields the conclusion. ∎

Appendix C Proof of Error Instantiation (Claim 5.3 and 5.4)

Correction of typo: The definition of effective batch size $B_{{\rm eff}}$ in Equation 5.3 should be

B_{{\rm eff}}^{\theta}=\mathbb{E}_{\mathcal{B}\stackrel{{\scriptstyle\tiny\text{i.i.d}}}{{~{}\sim~{}}}\mathcal{D}^{B}}\left[\sum_{i=1}^{B}\mathbb{I}\left[R_{\theta}(\mathcal{B}_{i})\right]\right].

(C.1)

Taking the average over the minibatch in Equation 3.1 can be considered noise reduction of $O(1/B)$ . Fix the noise multiplier $\sigma_{dp}$ . As the size of the batch increases, the amount of DP noise incorporated into the parameter decreases correspondingly. Suppose that token $i$ is absent from the current training sequence. Its input embedding will not be activated and thus will not be properly trained in this iteration, but the DP noise will be nevertheless injected into its embedding. The concept of effective error is introduced to account for this phenomenon.

Proof.

It reduces the derive the formula for effective batch size $B_{{\rm eff}}^{\theta}$ in Equation 5.3. Recall that its definition is given by

B_{{\rm eff}}^{\theta}=\mathbb{E}_{\mathcal{B}\stackrel{{\scriptstyle\tiny\text{i.i.d}}}{{~{}\sim~{}}}\mathcal{D}^{B}}\left[\sum_{i=1}^{B}\mathbb{I}\left[R_{\theta}(\mathcal{B}_{i})\right]\right].

(C.2)

For each layer parameterized by $W$ within the Transformer block, its effective batch size is $B_{{\rm eff}}^{W}=B$ , since $R_{W}(\mathcal{B}_{i})=1$ .

For the embedding layer $E$ , its effective batch size is

\begin{split}B_{{\rm eff}}^{E_{i}}&=\mathbb{E}_{\mathcal{B}\stackrel{{\scriptstyle\tiny\text{i.i.d}}}{{~{}\sim~{}}}\mathcal{D}^{B}}\left[\sum_{j=1}^{B}\mathbb{I}\left[R_{E_{i}}(\mathcal{B}_{j})\right]\right]\\ &=\sum_{j=1}^{B}\mathbb{E}_{\mathcal{B}_{j}\stackrel{{\scriptstyle\tiny\text{i.i.d}}}{{~{}\sim~{}}}\mathcal{D}}\left[\mathbb{I}\left[R_{E_{i}}(\mathcal{B}_{j})\right]\right]~{}~{}~{}~{}\text{(Linearity of Expectation)}\\ &=\sum_{j=1}^{B}\mathbb{E}_{\mathcal{B}_{j}\stackrel{{\scriptstyle\tiny\text{i.i.d}}}{{~{}\sim~{}}}\mathcal{D}}\left[\mathbb{I}\left[{\rm token}~{}i\in\mathcal{B}_{j}\right]\right]\\ &=\sum_{j=1}^{B}p_{i}\\ &=B\cdot p_{i},\\ \end{split}

(C.3)

where $p_{i}$ is the frequency of token $i$ (i.e., the probability of token $i$ ’s occurrence in data). ∎

Remark C.1.

Note that $p_{i}$ , as the frequency statistics, can be obtained with high accuracy with tiny privacy budget. For simplicity, we will assume $p_{i}$ is publicly known in this work.

Appendix D Proof of Error Propagation

D.1 Proof of Linear Transformation (Equation 5.5)

Lemma D.1.

Let $X$ , $Y$ be two independent random variables. Let $Z=XY$ , then the variance of $Z$ can be expressed as

\begin{split}\operatorname{Var}[Z]=\operatorname{Var}[XY]=\mathbb{E}[X^{2}]\mathbb{E}[Y^{2}]-\mathbb{E}[XY]^{2}.\end{split}

(D.1)

LemmaD.1 directly implies Equation 5.5 as follows.

Proof.

Suppose the linear transformation is given by $X^{(l)}=X^{(l-1)}W$ , then we have

\begin{split}\operatorname{Var}[X^{(l)}]&=\mathbb{E}[(X^{(l-1)})^{2}]\mathbb{E}[W^{2}]-\mathbb{E}[X^{(l-1)}W]^{2}\\ &=(\mathbb{E}[X^{(l-1)}]^{2}+\operatorname{Var}[X^{(l-1)}])(\mathbb{E}[W]^{2}+\operatorname{Var}[W])-\mathbb{E}[X^{(l-1)}]^{2}\mathbb{E}[W]^{2}\\ &=\operatorname{Var}[X^{(l-1)}]\operatorname{Var}[W]+\mathbb{E}[X^{(l-1)}]^{2}\operatorname{Var}[W]+\mathbb{E}[W]^{2}\operatorname{Var}[X^{(l-1)}].\\ \end{split}

(D.2)

∎

D.2 Proof of Non-linear Transformation (Equation 5.7)

Lemma D.2.

Let $X_{1}$ , $X_{2}$ be two independent Gaussian random variables, where $X_{i}\sim\mathcal{N}(\mu_{i},\sigma_{i}),i=1,2$ . Let $Z=\operatorname{max}(X_{1},X_{2})$ .

\begin{split}&\mathbb{E}[Z]=\mu_{1}\Phi(\gamma)+\mu_{2}\Phi(-\gamma)+\nu\phi(\gamma)\\ &\mathbb{E}[Z^{2}]=(\mu_{1}^{2}+\sigma_{1}^{2})\Phi(\gamma)+(\mu_{2}^{2}+\sigma_{2}^{2})\Phi(-\gamma)+(\mu_{1}+\mu_{2})\nu\phi(\gamma),\end{split}

(D.3)

where $\Phi(\cdot)$ is the cumulative density function (CDF) of the standard Gaussian distribution, $\phi(\cdot)$ is the probability density function (PDF) of the standard Gaussian distribution, $\nu$ = $\sqrt{\sigma_{1}^{2}+\sigma_{2}^{2}}$ , and $\gamma=(\mu_{1}-\mu_{2})/\nu$ .

LemmaD.2 directly implies Equation 5.7 as follows.

Proof.

Let $X^{(l)}=\operatorname{ReLU}\left(X^{(l-1)}\right)$ be the ReLU activation. Substitute $\mu_{1}=\mathbb{E}[X^{(l-1)}],\sigma_{2}=\operatorname{Var}[X^{(l-1)}],\mu_{2}=\sigma_{2}=0$ into Equation D.3. Leveraging $\operatorname{Var}[X^{(l)}]=\mathbb{E}[(X^{(l)})^{2}]-\mathbb{E}[X^{(l)}]^{2}$ yields the conclusion. ∎

Remark D.3.

GELU activation. GELU function is another widely used activation function within Transformer models. GELU can be viewed as a smooth version of ReLU (see Figure 8), where their forward propagation is similar, and the major distinction lies in the numerical behavior of backpropagation. Since error propagation is only concerned with forward propagation behavior, we can also use Equation 5.7 to approximate the variance of GELU output. Table 3 shows the analytic error propagation for ReLU and GELU activation, compared with the sampling-based results.

Table 3: Analytic error propagation for ReLU and GELU activation.

Input	Activation	Sampling-based						Analytic
Input	Activation	10	100	1000	10000	100000	1000000	Analytic
$\mathcal{N}(0,0.01)$	ReLU	4.08e-6	3.60e-5	3.93e-5	3.45e-5	3.40e-5	3.40e-5	3.40e-5
$\mathcal{N}(0,0.01)$	GELU	2.48e-5	2.69e-5	2.72e-5	2.57e-5	2.50e-0	2.49e-5	3.40e-5
$\mathcal{N}(0,0.1)$	ReLU	0.0030	0.0031	0.0037	0.0034	0.0035	0.0034	0.0034
$\mathcal{N}(0,0.1)$	GELU	0.0030	0.0025	0.0027	0.0025	0.0026	0.0025	0.0034
$\mathcal{N}(0,1)$	ReLU	0.5299	0.2361	0.3649	0.3451	0.3387	0.3418	0.3408
$\mathcal{N}(0,1)$	GELU	0.5525	0.2306	0.3719	0.3506	0.3433	0.3467	0.3408

Appendix E Experimental Details

E.1 Datasets

MovieLens. The MovieLens dataset [18] is often used in the development and evaluation of collaborative filtering algorithms, which are used to make personalized recommendations based on user behavior. It is a benchmark dataset in the field of recommender systems due to its size, longevity, and richness of user-item interactions. We use the version (MovieLens-1M) that includes 1 million user behaviors.

Amazon. A series of datasets introduced in [28], comprising large corpora of product reviews crawled from Amazon.com. Top-level product categories on Amazon are treated as separate datasets. We consider the ‘Games.’ category. This dataset is notable for its high sparsity and variability.

We follow [24] for the data preprocessing. We use timestamps to determine the sequence order of actions. Each user is associated with a training sequence (i.e., his chronological behavior). We discard users and items with fewer than five related actions. For data partitioning, the last token of each sequence is left for testing.

It is worth noting that since each user is exclusively associated with exactly one training sample (sequence) in the training data, the DP guarantee we provide is user-level. That is, removing all information pertaining to a specific user yields an indistinguishable model.

E.2 Model Architecture

We use the standard Transformer encoder described in [36]. The model dimension is set to 64. The number of heads in the Attention Mechanism is set to 1. The number of transformer blocks is set to 2. Our model adopts a learned (instead of fixed) positional embedding.

E.3 Hyperparameters

The number of epochs is set to 100. The batch size is chosen from $\{256,512,1024,2048,4096\}$ . The learning rate is chosen from $\{10^{-3},3\times 10^{-3},5\times 10^{-3},7\times 10^{-3},9\times 10^{-3}\}$ . The dropout rate is 0.2 for MovieLens and 0.5 for Amazon (due to its high sparsity). We use the Adam optimizer with a weight decay of $10^{-5}$ .