This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

spacing=nonfrench

DP-Forward: Fine-tuning and Inference on Language Models with Differential Privacy in Forward Pass

Minxin Du The Chinese University of Hong Kong [email protected] Xiang Yue The Ohio State University [email protected] Sherman S. M. Chow The Chinese University of Hong Kong [email protected] Tianhao Wang University of Virginia [email protected] Chenyu Huang Independent [email protected]  and  Huan Sun The Ohio State University [email protected]
(2023)
Abstract.

Differentially private stochastic gradient descent (DP-SGD) adds noise to gradients in back-propagation, safeguarding training data from privacy leakage, particularly membership inference. It fails to cover (inference-time) threats like embedding inversion and sensitive attribute inference. It is also costly in storage and computation when used to fine-tune large pre-trained language models (LMs).

We propose DP-Forward, which directly perturbs embedding matrices in the forward pass of LMs. It satisfies stringent local DP requirements for training and inference data. To instantiate it using the smallest matrix-valued noise, we devise an analytic matrix Gaussian mechanism (aMGM) by drawing possibly non-i.i.d. noise from a matrix Gaussian distribution. We then investigate perturbing outputs from different hidden (sub-)layers of LMs with aMGM noises. Its utility on three typical tasks almost hits the non-private baseline and outperforms DP-SGD by up to 7.77.7pp at a moderate privacy level. It saves 3×3\times time and memory costs compared to DP-SGD with the latest high-speed library. It also reduces the average success rates of embedding inversion and sensitive attribute inference by up to 8888pp and 4141pp, respectively, whereas DP-SGD fails.

Local Differential Privacy, Natural Language Processing, Pre-trained Language Models, Privacy-preserving Fine-tuning and Inference of LMs, Embedding Matrices, Analytic Matrix Gaussian Mechanism
journalyear: 2023copyright: acmlicensedconference: Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security; November 26–30, 2023; Copenhagen, Denmark.booktitle: Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS ’23), November 26–30, 2023, Copenhagen, Denmarkprice: 15.00doi: 10.1145/3576915.3616592isbn: 979-8-4007-0050-7/23/11ccs: Security and privacy Data anonymization and sanitizationccs: Security and privacy Privacy-preserving protocolsccs: Security and privacy Privacy protections

1. Introduction

The deep learning architecture of transformer (Vaswani et al., 2017) is now gaining popularity in computer vision and has been widely utilized in natural language processing (NLP). Transformer-based language models (LMs), such as BERT (Devlin et al., 2019) and GPT (Radford et al., 2018, 2019), have remarkably achieved state-of-the-art performance in almost every NLP task. They are first pre-trained on massive (public) self-labeled corpora and then fine-tuned for various tasks using much smaller, potentially private corpora. It avoids training from scratch and the possible shortage of task-specific corpora while earning versatility.

Training data contributing to the improved utility of fine-tuned LMs can be sensitive. LMs can (unintentionally) memorize them (Carlini et al., 2019) and become vulnerable to membership inference attacks (MIAs) (Shokri et al., 2017) that identify whether an example is in the training set. Worse still, verbatim training text (e.g., SSNs) can be extracted via only black-box access to GPT-2 (Carlini et al., 2021). It is also possible to recover personal health information (e.g., patient-condition pairs) from BERT trained over a clinical corpus (Lehman et al., 2021) based on the extraction attack (Carlini et al., 2021).

Differential privacy (DP) (Dwork et al., 2006) has emerged as the de facto privacy standard for protecting individual privacy. To thwart MIAs on individuals’ training data, DP stochastic gradient descent (DP-SGD) (Abadi et al., 2016) can be used. It clips the gradients of each example in a batch and adds random Gaussian noise to the aggregated gradient. It is more general than earlier attempts (Chaudhuri and Monteleoni, 2008; Chaudhuri et al., 2011) that focus on convex problems and has been implemented in modern ML frameworks, such as PyTorch and TensorFlow. One can apply it to fine-tune LM-based NLP pipelines while ensuring example-level privacy, assuming each individual contributes an example, typically a sequence-label pair.

Unfortunately, DP-SGD often uses a trusted party to curate users’ sensitive training data. Although it can be done distributively (Bonawitz et al., 2017; McMahan et al., 2018) via secure aggregation (Chase and Chow, 2009) with extra costs and trust assumptions, it offers central DP (CDP) at its core.111Distributed DP-SGD adds local noise too small to achieve LDP. But it is protected by secret sharing. When all shares are aggregated, they cancel out each other, assuming an honest majority. It thus faces a “synchronization” issue begging for identification and recovery mechanisms with computation and communication overheads (Bonawitz et al., 2017). Instantiating per-example gradients as large as entire pipelines (e.g., >110{>}110M parameters for BERT-Base) is obliviously costly. Moreover, maintaining the utility of pipelines trained by the noisy aggregated one is tricky due to the dimensional “curse.” A recent study (Yu et al., 2021, Table 44) shows that the average accuracy in fine-tuning LMs for four NLP tasks at moderate privacy is 65.7%65.7\% (vs. 91.8%91.8\% without DP). Finally, the inference-time embeddings are not perturbed by the noise added during training, leaving inference queries vulnerable to various recovery attacks (Song and Raghunathan, 2020; Pan et al., 2020), ranging from sensitive attributes (e.g., authorship) to raw text.

1.1. Natural Privacy via Perturbing Embeddings

We propose DP-Forward, a radically different approach that perturbs forward-pass signals: Users can locally inject noise into the embeddings of (labeled) sequences before sharing them for training, in contrast to perturbing gradients in back-propagation (possibly by an untrusted party). It is meant for provable local DP (LDP) guarantees, thus protecting against stronger adversaries than DP-SGD.

Our approach also naturally fits the federated learning (FL) setting that does not gather users’ data but with substantial differences – FL typically shares noiseless local model updates. Note that any subsequent computation (e.g., gradient computation) on noisy embeddings incurs no extra privacy loss due to the free post-processing of LDP. One might force DP-SGD to offer LDP by adding “enough” noise to the orders-of-magnitude larger per-example gradient from a user, but it may yield unusable models at a similar privacy level.

DP-Forward also extends its applicability to inference via adding noise to users’ test-sequence embeddings, ensuring LDP as in training. As a “side” benefit, it can effectively mitigate emerging embedding-based privacy risks (Pan et al., 2020; Song and Raghunathan, 2020) beyond MIAs.

It is evident that the design goals of DP-Forward naturally align in tandem with our overarching objectives: LDP (vs. CDP), more direct protection of raw data (vs. gradients) against new threats (Pan et al., 2020; Song and Raghunathan, 2020), and can be as efficient as regular non-private training (allowing batch processing of noisy embeddings). The foundation supporting these desiderata, unfortunately, was unavailable. A dedicated mechanism to perturb the forward-pass signals is indispensable.

Specifically, we need to derive noises for embeddings of training/inference text sequences obtained through the forward pass of LM-based pipelines as a real- and matrix-valued function. One might adopt the classical Gaussian mechanism (GM) (Dwork and Roth, 2014) to add i.i.d. noise drawn from a univariate Gaussian distribution. Yet, GM calibrates its noise variance based solely on a sufficient condition for DP, and its variance formula is not applicable to a low privacy regime (Balle and Wang, 2018). Another candidate is the matrix-variate Gaussian (MVG) mechanism (Chanyaswad et al., 2018), tailored for matrix-valued data: It exploits possibly non-i.i.d. noise from a matrix Gaussian distribution to perturb more important rows/columns less. Although it may show better utility over GM (Chanyaswad et al., 2018), it is still sub-optimal due to the sufficient condition.

To optimize MVG, we propose an analytic matrix Gaussian mechanism (aMGM) by integrating a necessary and sufficient condition from the analytic GM (aGM) (Balle and Wang, 2018) for non-i.i.d. noise calibration. Our challenge lies in manipulating the two covariance matrices instead of a single variance. We deduce a constraint only on the two smallest singular values (Section 4.2), indicating that i.i.d. noise (as in aGM) may already be optimal for general applications like DP-Forward.222 With extra assumptions, dedicated allocation of other singular values by optimizing/maximizing utility functions specific to applications could help.

A transformer-based pipeline contains an input embedding layer, encoders, and task layers. All these layers prominently manipulate embeddings of text inputs in training and subsequent inference. We investigate adding aMGM noise to embeddings output by any hidden (sub-)layer before task layers (Figure 1). To ensure sequence-level LDP, we need to estimate the L2L_{2}-sensitivity (Dwork and Roth, 2014) of “pre-noise” functions for any two sequences. It is non-trivial since the functions can include different (sub-)layers that may not even be Lipschitz (Kim et al., 2021). Our strategy is to normalize the function outputs to have a fixed Frobenius (or L2L_{2}) norm, similar to gradient clipping (Abadi et al., 2016). It works especially well for deeper sub-layers, achieving comparable task accuracy to the non-private baseline (Section 5). For the first few (sub-)layers, we also make two specializations in relaxing LDP to the token level, elaborated in Appendix A.2, to improve accuracy.

Refer to caption
Figure 1. A typical NLP pipeline built atop a pre-trained LM such as BERT with our matrix-valued Gaussian noise layer

1.2. Our Contributions

Motivated by prevailing privacy concerns in LM fine-tuning and inference and inherent shortcomings of DP-SGD, we initiate a formal study of an intuitive but rarely studied approach and explore its integration with a transformer-based NLP pipeline. Specifically:

1) We propose DP-Forward fine-tuning, which perturbs the forward-pass embeddings of every user’s (labeled) sequence. It offers more direct protection than DP-SGD perturbing aggregated gradients. Its provable guarantee (Theorem 3) is a new sequence-level LDP notion (SeqLDP, Definition 2), with the more stringent (ϵ,δ)(\epsilon,\delta)-LDP guarantee to hold w.r.t. only sequences. Moreover, DP-Forward can naturally extend to inference, ensuring the standard LDP (Theorem 5) for test sequences without labels, whereas DP-SGD cannot.

2) To instantiate an optimal output perturbation mechanism for DP-Forward, we propose aMGM, owning independent interests for any matrix-valued function. By exploiting a necessary and sufficient DP condition from aGM (Balle and Wang, 2018), it can draw possibly non-i.i.d. noise from a matrix Gaussian distribution like MVG (Chanyaswad et al., 2018) while producing orders-of-magnitude smaller noise for high-dimensional data (Section 5.3).

3) We conduct experiments333Our code is available at https://github.com/xiangyue9607/DP-Forward. on three typical NLP tasks in Section 5, showing how crucial hyperparameters (e.g., the sequence length) impact task accuracy. To fairly compare with DP-SGD on privacy-vs.-utility: i) We perturb labels by the randomized response (Warner, 1965) such that DP-Forward fine-tuning offers the standard LDP for sequence-label pairs (Theorem 4). ii) We “translate” DP-Forward with standard LDP to (example-level) CDP (as offered by DP-SGD) via shuffling (Erlingsson et al., 2019). Our accuracy gain (for deep-layer DP-Forward instantiations) is up to 7.77.7 percentage points (pp), compared to DP-SGD or its recent improvements (Yu et al., 2021, 2022) (reviewed in Section 7.3), at a similar privacy level. Efficiency-wise, DP-SGD incurs >3×{>}3\times time and GPU-memory costs even with the latest Opacus library (Yousefpour et al., 2021).

4) We evaluate three classes of privacy threats. Like DP-SGD, DP-Forward (including the two token-level designs in Appendix A.3) can effectively defend against sequence-level MIAs, but only DP-Forward can thwart the two threats on (inference-time) embeddings. Specifically, Section 6 shows that DP-SGD totally fails in two embedding inversion attacks, while DP-Forward remarkably reduces their success rates by up to 8888pp. For a neural-network-based attribute inference attack, DP-SGD reduces its success rates by only 1515pp on average, while DP-Forward achieves 41{\sim}41pp reduction, making the attack predict like assigning all labels to the majority class.

In short, DP-Forward is a better alternative to DP-SGD in training (and testing) deep-learning models, e.g., gigantic LM-based ones.

2. Preliminaries and Notations

2.1. Transformer Encoders in BERT

Modern transformer-based LMs, including BERT (Devlin et al., 2019) and GPT (Radford et al., 2018), are first pre-trained on enormous unannotated (public) corpora to learn contextualized text representations. Later, they can be fine-tuned for various downstream NLP tasks (e.g., sentiment analysis, question answering) using much smaller, task-specific datasets.

We consider BERT (Figure 1), which comprises a stack of LL identical layers (i.e., bidirectional transformer encoders (Vaswani et al., 2017)). Each layer has two sub-layers: the dot-product multi-head attention (MHA) (Vaswani et al., 2017) with hh heads and a feed-forward network (FFN). Each sub-layer has an extra residual connection, followed by layer normalization (Ba et al., 2016).

Let X=xii=1nX=\langle x_{i}\rangle^{n}_{i=1} be an input sequence of nn tokens (e.g., characters, words, sub-words, q-grams), where xix_{i} is from a vocabulary 𝒱\mathcal{V}. The input embedding layer first maps each xix_{i} to its representation in d\mathbb{R}^{d}, which is the sum of the token, segment, and position embeddings. We re-use XX to represent the hidden embedding matrix in n×d\mathbb{R}^{n\times d}. For each of hh attentions 𝖠𝗍𝗍i\mathsf{Att}_{i} in the MHA layer, we derive the query, key, and value matrices Q,K,Vn×d/hQ,K,V\in\mathbb{R}^{n\times d/h} (hh divides dd) by multiplying XX with head-specific weights WQ,WK,WVd×d/hW^{Q},W^{K},W^{V}\in\mathbb{R}^{d\times d/h}. Its output is

𝖠𝗍𝗍i(Q,K,V)=𝗌𝗈𝖿𝗍𝗆𝖺𝗑(QKd/h)V,i[1,h].\mathsf{Att}_{i}(Q,K,V)=\mathsf{softmax}(\frac{QK^{\top}}{\sqrt{d/h}})V,\forall i\in[1,h].

The input to 𝗌𝗈𝖿𝗍𝗆𝖺𝗑()\mathsf{softmax}(\cdot) is an n×nn\times n matrix of pairwise dot products. Finally, MHA concatenates (denoted by ||||) all the head outputs into a matrix in n×d\mathbb{R}^{n\times d}, right multiplied by a projection matrix WOd×dW^{O}\in\mathbb{R}^{d\times d}:

𝖬𝖧𝖠(X)=[𝖠𝗍𝗍1𝖠𝗍𝗍h]WO.\mathsf{MHA}(X)=[\mathsf{Att}_{1}||\cdots||\mathsf{Att}_{h}]W^{O}.

FFN is composed of two linear mappings with a ReLU activation in between. It separately and identically operates on each xi[1,n]x_{i\in[1,n]},

𝖥𝖥𝖭(xi)=𝖱𝖾𝖫𝖴(0,xiW1+b1)W2+b2,\mathsf{FFN}(x_{i})=\mathsf{ReLU}(0,x_{i}W_{1}+b_{1})W_{2}+b_{2},

where W1W_{1}, W2W_{2}, b1b_{1}, and b2b_{2} are trainable matrix/vector-valued parameters. Its output on XX is 𝖥𝖥𝖭(X)=[𝖥𝖥𝖭(x1)𝖥𝖥𝖭(xn)]\mathsf{FFN}(X)=[\mathsf{FFN}(x_{1})^{\top}||\cdots||\mathsf{FFN}(x_{n})^{\top}]. The residual connection for sub-layers is X+𝖬𝖧𝖠(X)/𝖥𝖥𝖭(X)X+\mathsf{MHA}(X)/\mathsf{FFN}(X). The layer normalization 𝖫𝖭(xi)\mathsf{LN}(x_{i}) normalizes all xix_{i} entries to have zero mean and unit variance using an extra scale-then-shift step.

At the output of the final encoder, the hidden embedding matrix is reduced to a sequence feature in 1×d\mathbb{R}^{1\times d}. Standard reduction methods include mean pooling (Reimers and Gurevych, 2019) (computing i=1nxi/n\sum^{n}_{i=1}x_{i}/n) or taking the last embedding of a special token [CLS] for classification (Devlin et al., 2019).

The pre-training of BERT is based on two self-supervised tasks: masked language model (MLM) and next sentence prediction (Devlin et al., 2019). We adopt MLM: It randomly masks out some tokens, indexed by \mathcal{I}, in an input sequence XX. The objective is to predict those masked tokens using their context by minimizing the cross-entropy loss

(1) LMLM=ilogPr[xi|X^;θ], with X^=X{xi|i},\displaystyle L_{\mathrm{MLM}}=-\sum_{i\in\mathcal{I}}\log\Pr[x_{i}|\hat{X};\theta],\textrm{~{}with~{}}\hat{X}=X\setminus\{x_{i}|i\in\mathcal{I}\},

where θ\theta denotes all the parameters of BERT transformer encoders.

2.2. (Local) Differential Privacy

DP (Dwork et al., 2006) is a rigorous, quantifiable privacy notion. It has two popular models, central and local. In central DP, a trusted data curator accesses the set 𝒳\mathcal{X} of all individuals’ raw data and processes 𝒳\mathcal{X} by a randomized mechanism \mathcal{M} with some random noise. Formally:

Definition 0 (Central DP).

For privacy parameters ϵ0\epsilon\geq 0 and 0δ10\leq\delta\leq 1, \mathcal{M} fulfills (ϵ,δ)(\epsilon,\delta)-DP if, for all neighboring datasets 𝒳\mathcal{X} and 𝒳\mathcal{X}^{\prime} (denoted by 𝒳𝒳\mathcal{X}\simeq\mathcal{X}^{\prime}) and any subset 𝒪\mathcal{O} of the outputs of \mathcal{M},

Pr[(𝒳)𝒪]eϵPr[(𝒳)𝒪]+δ.\displaystyle\Pr[\mathcal{M}(\mathcal{X})\in\mathcal{O}]\leq e^{\epsilon}\Pr[\mathcal{M}(\mathcal{X}^{\prime})\in\mathcal{O}]+\delta.

We call it ϵ\epsilon-DP or pure DP when δ=0\delta=0.

The neighboring notion is application-dependent (to be discussed in Section 3.1). Typically, it involves the “replace-one” relation: 𝒳\mathcal{X}^{\prime} can be obtained from 𝒳\mathcal{X} by replacing a single individual’s data point (e.g., a sequence-label pair). CDP offers plausible deniability to any individual in a dataset. In contrast, local DP (LDP) (Kasiviswanathan et al., 2008) removes the trusted curator, allowing individuals to locally perturb their data using \mathcal{M} before being sent to an untrusted aggregator for analytics.

Definition 0 (Local DP).

For ϵ0,0δ1\epsilon\geq 0,0\leq\delta\leq 1, \mathcal{M} is (ϵ,δ)(\epsilon,\delta)-LDP if, for any two inputs X,XX,X^{\prime} and any possible output subset 𝒪\mathcal{O} of \mathcal{M},

Pr[(X)𝒪]eϵPr[(X)𝒪]+δ.\displaystyle\Pr[\mathcal{M}(X)\in\mathcal{O}]\leq e^{\epsilon}\Pr[\mathcal{M}(X^{\prime})\in\mathcal{O}]+\delta.

Similarly, we call it ϵ\epsilon-LDP when δ=0\delta=0.

Privacy Loss Random Variable (PLRV)

For a specific pair of inputs 𝒳𝒳\mathcal{X}\simeq\mathcal{X}^{\prime}, the privacy loss (or the “actual ϵ\epsilon value”) (Balle and Wang, 2018) incurred by observing an output OO is the log-ratio of two probabilities:

,𝒳,𝒳(O)=lnPr[(𝒳)=O]Pr[(𝒳)=O].\mathcal{L}_{\mathcal{M},\mathcal{X},\mathcal{X}^{\prime}}(O)=\ln{\frac{\Pr[\mathcal{M}(\mathcal{X})=O]}{\Pr[\mathcal{M}(\mathcal{X}^{\prime})=O]}}.

When OO varies according to (𝒳)\mathcal{M}(\mathcal{X}), we get the PLRV ,𝒳,𝒳\mathcal{L}_{\mathcal{M},\mathcal{X},\mathcal{X}^{\prime}}. A helpful way to work with DP is to analyze tail bounds on PLRVs (Dwork and Roth, 2014), which we utilize to build our proposed mechanism in Section 4.2.

DP has two desirable properties: free post-processing and composability. The former means that further computations on the outputs of an (ϵ,δ)(\epsilon,\delta)-DP mechanism incur no extra privacy loss. The latter allows us to build more complicated mechanisms atop simpler ones: sequentially (and adaptively) running an (ϵ,δ)(\epsilon,\delta)-DP mechanism for kk times on the same input is at least (kϵ,kδ)(k\epsilon,k\delta)-DP. The two properties also hold for LDP when considering a dataset has only one row.

An output perturbation mechanism \mathcal{M} for a matrix-valued function f:𝒳n×df:\mathcal{X}\rightarrow\mathbb{R}^{n\times d} is given by computing ff on the inputs and then adding random noise drawn from a random variable to its outputs.

Gaussian Mechanism (GM). For (ϵ,δ)(\epsilon,\delta)-DP, a typical instance of \mathcal{M} is the classical GM (Dwork and Roth, 2014), which adds noise Zn×dZ\in\mathbb{R}^{n\times d} with each entry i.i.d. drawn from a univariate Gaussian distribution 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}). The variance σ2=2ln(1.25/δ)S22(f)/ϵ2\sigma^{2}=2\ln(1.25/\delta)S^{2}_{2}(f)/\epsilon^{2} with the L2L_{2}-sensitivity:

S2(f)=sup𝒳𝒳f(𝒳)f(𝒳)F,S_{2}(f)=\sup_{\mathcal{X}\simeq\mathcal{X}^{\prime}}||f(\mathcal{X})-f(\mathcal{X}^{\prime})||_{F},

where ||||F||\cdot||_{F} denotes the matrix Frobenius norm (Horn and Johnson, 2012).

Table 1 summarizes the acronyms throughout this work.

Table 1. Acronyms (newly proposed ones are marked with )
NLP Natural Language Processing
LM Language Model
BERT Bidirectional Encoder Representations from Transformers
MLM Masked Language Modeling
MHA Multi-Head Attention
FFN Feed-Forward Network
MIA Membership Inference Attack
DP-SGD Differentially Private Stochastic Gradient Descent
PLRV Privacy Loss Random Variable
(C)DP (Central) Differential Privacy
LDP Local Differential Privacy
Seq(L)DP Sequence (Local) Differential Privacy
GM Gaussian Mechanism
RR Randomized Response
MVG Matrix-Variate Gaussian (Mechanism)
aGM Analytic Gaussian Mechanism
aMGM Analytic Matrix Gaussian Mechanism

3. DP-Forward

We study BERT-based pipelines as an example due to their superior performance in classification tasks. DP-Forward can be readily applied to other (transformer-based) NLP or computer vision models that involve matrix-valued computation during the forward pass.

Suppose each user holds a sequence-label pair (X,y)(X,y) or only XX for fine-tuning or testing a pipeline at an untrusted service provider. Sharing redacted XX (with common PII removed) or its feature, a non-human-readable real-valued embedding matrix, is leaky (Sweeney, 2015; Song and Raghunathan, 2020; Pan et al., 2020).

For DP-Forward training, users perturb their embedding matrices locally to ensure (new notions of) LDP before being shared, and they should also perturb the corresponding labels if deemed sensitive (Section 3.4). We explore different options for splitting pipelines into pre-noise functions f()f(\cdot) and post-noise processing p()p(\cdot) in Section 3.2: Users can access f()f(\cdot) to derive embedding matrices, perturbed by an output perturbation mechanism \mathcal{M} (e.g., GM); the service provider runs p()p(\cdot) on noisy (labeled) embeddings for fine-tuning (Section 3.3) or pre-training (Section 3.6). The challenge lies in analyzing S2(f)S_{2}(f) for different pipeline parts, which we address by normalizing f()f(\cdot).

DP-Forward can be naturally used to protect inference sequences (Section 3.5), unlike DP-SGD. It exploits the free post-processing (i.e., inference works on noisy embeddings), incurring minimal changes to pipelines with the extra “plug-and-play” noise layer.

3.1. Notions of Sequence (Local) DP

Embeddings f(X)f(X) encode semantic information of input sequences XX, each of which has nn tokens (Section 2.1). Fine-tuning (or subsequent inference of) NLP pipelines essentially processes f(X)f(X). DP-Forward fine-tuning protects every XX by an output perturbation mechanism \mathcal{M} over f(X)f(X), in contrast to DP-SGD, which perturbs aggregates of gradients f(X,y)f^{\prime}(X,y) over XX and label yy. Simply put, our (ϵ,δ)(\epsilon,\delta)-LDP holds for XX while DP-SGD provides CDP for (X,y)(X,y).

Sequence-only protection is meaningful since sequences often convey (implicit) sensitive information (e.g., authorship), whereas labels (e.g., a single bit denoting positive/negative) can be public. We defer to Section 3.4 for achieving “full” LDP over (X,y)(X,y). To bridge the gap between theoretical guarantees of DP-SGD and DP-Forward, we first define sequence DP444One could generalize it to “feature” (or “input”) DP, as DP-Forward also allows other types of features beyond embeddings (and its essence is input-only privacy). To keep our focus on NLP, we use “sequence” here. (PixelDP (Lécuyer et al., 2019) treats pixels as image features.) (SeqDP) in the central setting.

Definition 0 (SeqDP).

​​For ϵ0,0δ1\epsilon\geq 0,0\leq\delta\leq 1, \mathcal{M} is (ϵ,δ)(\epsilon,\delta)-SeqDP, if 𝒳𝒳\forall\mathcal{X}\simeq\mathcal{X}^{\prime} that only differ in a sequence at some index ii: (Xi,yi)𝒳(X_{i},y_{i})\in\mathcal{X} and (Xi,yi)𝒳,Xi,Xi(X^{\prime}_{i},y_{i})\in\mathcal{X}^{\prime},\forall X_{i},X^{\prime}_{i}, and any possible output subset 𝒪\mathcal{O},

Pr[(𝒳)𝒪]eϵPr[(𝒳)𝒪]+δ.\displaystyle\Pr[\mathcal{M}(\mathcal{X})\in\mathcal{O}]\leq e^{\epsilon}\Pr[\mathcal{M}(\mathcal{X}^{\prime})\in\mathcal{O}]+\delta.

3.1.1. Label DP

The recently proposed notion of label DP (Ghazi et al., 2021; Esmaeili et al., 2021) is originally studied in PAC learning (Chaudhuri and Hsu, 2011). It only protects labels (not the corresponding inputs/images): (ϵ,δ)(\epsilon,\delta)-DP is only w.r.t. labels.

Our SeqDP is “more secure” than or at least “complements” label DP, which has an inherent flaw (Busa-Fekete et al., 2021): As labels typically rely on their sequences (but not vice versa), it is very likely to recover the true labels from the raw sequences, even if the labels are protected (by any label-DP mechanism). The follow-up (Wu et al., 2023) shows the impossibility of label protection under label DP even with arbitrarily small (ϵ,δ)(\epsilon,\delta) when models generalize. Moreover, labels can be absent (e.g., inference or self-supervised learning), for which SeqDP upgrades to the standard (ϵ,δ)(\epsilon,\delta)-DP, whereas label DP is simply inapplicable.

3.1.2. Sequence Local DP (SeqLDP)

We further define SeqLDP, the local counterpart of sequence DP. Note that the above discussion of label DP in relation to SeqDP also carries over to SeqLDP.

Definition 0 (SeqLDP).

​​For ϵ0,0δ1\epsilon\geq 0,0\leq\delta\leq~{}1, \mathcal{M} satisfies (ϵ,δ)(\epsilon,\delta)-SeqLDP, if X,X\forall X,X^{\prime} with the same yy, and any possible output subset 𝒪\mathcal{O},

Pr[(X,y)𝒪]eϵPr[(X,y)𝒪]+δ.\displaystyle\Pr[\mathcal{M}(X,y)\in\mathcal{O}]\leq e^{\epsilon}\Pr[\mathcal{M}(X^{\prime},y)\in\mathcal{O}]+\delta.

In theory, SeqLDP remains a strong notion (like the standard LDP). It is meant to be information-theoretic protection on sequence and bounds the indistinguishability of any X,XX,X^{\prime} (differing by up to nn tokens), and hence governing the “usefulness” of noisy embeddings.

3.1.3. Sequence-Level SeqLDP vs. Token-Level SeqLDP

In practice, as a strong notion balancing seemingly conflicting requirements (ideal theoretical guarantees and empirical utility), attaining a meaningful range of ϵ\epsilon for SeqLDP is a struggle. Adding Gaussian noise to the outputs of f()f(\cdot) for (ϵ,δ)(\epsilon,\delta)-SeqLDP requires bounding the L2L_{2}-sensitivity S2(f),X,XS_{2}(f),\forall X,X^{\prime}. Our approach is to normalize the outputs (with extra benefits elaborated in Section 3.2), similar to clipping gradients in DP-SGD. It generally works better when f()f(\cdot) has more layers (at the same meaningful range of ϵ\epsilon) since fewer (trainable) parameters/layers of p()p(\cdot) are “affected” by the noisy outputs.

Unfortunately, when f()f(\cdot) includes the first few layer(s), e.g., only the input embedding layer is available to the users (say, for saving user-side storage and computation overheads), it leads to poor utility. As a comprehensive study, we resort to row-wise normalization with the (composition of) Lipschitz constants (Kim et al., 2021) to maintain utility for those cases.555One might also resort to the weaker random DP (Hall et al., 2013)(ϵ,δ)(\epsilon,\delta)-DP holds on all but a small γ\gamma-proportion of “unlikely” 𝒳𝒳\mathcal{X}\simeq\mathcal{X}^{\prime} for an extra parameter γ(0,1)\gamma\in(0,1). It is useful when the global sensitivity is hard to compute. Exploring it is left as future work. In contrast to the general normalization, it aims for weaker SeqLDP at the token level (cf. event-level vs. user-level LDP (Zhou et al., 2022)), a finer granularity in the “protection hierarchy,” protecting any neighboring sequences (vs. datasets) differing in any single token (vs. sequence). Details are deferred to Appendix A.

3.2. Our Approach for Sequence LDP

DP-Forward in our paper (except Appendix A) applies the general normalization approach to any f()f(\cdot) for sequence-level (Seq)LDP.

Let f()f(\cdot) be an arbitrarily deep forward pass, ranging from the first (input embedding) layer itself to all but the last (task) layer in a BERT-based pipeline (Figure 1). Correspondingly, let p()p(\cdot) be the remaining layers, ranging from the last task layers themselves to all but the first (input embedding) layer. Every sequence XX becomes an embedding matrix f(X)n×df(X)\in\mathbb{R}^{n\times d} at the output of layers in encoders or 1×d\mathbb{R}^{1\times d} before task layers (Section 2.1). To offer (ϵ,δ)(\epsilon,\delta)-SeqLDP, we adopt a suitable output perturbation mechanism \mathcal{M}, such as GM, considering that a dataset has only one labeled sequence.

Since \mathcal{M} can work on the output of any hidden layer, estimating S2(f)S_{2}(f) is non-trivial. Specifically, 𝖬𝖧𝖠\mathsf{MHA} itself, let alone more layers included, is not Lipschitz continuous, meaning its outputs can change arbitrarily for even slight input variation (Kim et al., 2021). To address this, our approach is to normalize or clip the function outputs:

f()F=Corf()/max(1,f()FC)||f(\cdot)||_{F}=C\ \text{or}\ f(\cdot)/\max(1,\frac{||f(\cdot)||_{F}}{C})

as in DP-SGD (Abadi et al., 2016), where CC is a tunable parameter. We then have S2(f)=2CS_{2}(f)=2C. Such normalization makes task utility less “sensitive” to the choice of CC since signal and noise increase proportionally with CC, whereas the signal may be unchanged when f()f(\cdot) is not clipped. It also has many other benefits, such as stabilizing training, avoiding overfitting, and accelerating convergence (Aboagye et al., 2022). Hence, we resort to normalization in our experiments. One can then calibrate Gaussian noise ZZ and derive f(X)+Zf(X)+Z for the post-noise layers p()p(\cdot).

Note that we remove the residual connection when adding noise to the output of the first MHA layer to avoid p()p(\cdot) reaccessing XX (dashed arrow, Figure 1) to maintain free post-processing. This may lead to instability (e.g., gradient vanishing) (Yu et al., 2021), but it can be mitigated by pre-training new BERT without such a residual connection to keep consistent with later fine-tuning/inference.

DP-Forward using GM suffers from the “curse of dimensionality” when dd is large (e.g., 768768 for BERT-Base). To alleviate the curse, we can append two linear maps, M1,M2d×dM_{1},M_{2}\in\mathbb{R}^{d\times d^{\prime}} with ddd^{\prime}\ll d, such that f()f(\cdot) and p()p(\cdot) respectively have M1M_{1} and M2M_{2}. Both maps are randomly initialized and updated like other weights using gradients. The raw embedding matrix is first right multiplied by M1M_{1}, leading to n×d\mathbb{R}^{n\times d^{\prime}} or 1×d\mathbb{R}^{1\times d^{\prime}}, before being normalized. Our privacy guarantee will not be affected since S2(f)S_{2}(f) remains the same. We then use M2M_{2} to restore the dimensionality to be compatible with the raw pipeline; M2M_{2} incurs no extra privacy loss due to the free post-processing. Nevertheless, it needs dedicated efforts to modify the pipeline; dimension-reduced embedding matrices may also lose useful information, degrading task utility. We thus make M1M_{1} and M2M_{2} optional (see Section 5.2).

3.3. DP-Forward Fine-tuning

Suppose we use a raw, public BERT checkpoint666Using noisy BERT for fine-tuning (and subsequent inference) is deferred to Section 3.6. for fine-tuning. In the forward pass of the ii-th (i1i\geq 1) step, it offers the latest f(i1)()f^{(i-1)}(\cdot) to a batch of users, mimicking the regular mini-batch SGD. f(0)f^{(0)} is from the raw checkpoint. Users are randomly chosen (without replacement), and their number is a fixed parameter. Users in the batch individually compute their noisy embeddings f(i1)(X)+Zf^{(i-1)}(X)+Z to ensure SeqLDP (Theorem 3). They then send them with unperturbed labels yy to the service provider, who runs p(i1)()p^{(i-1)}(\cdot) over (f(i1)(X)+Z,y)(f^{(i-1)}(X)+Z,y) to compute the batch loss; any post-processing of embeddings under SeqLDP incurs no extra privacy degradation on XX. p(0)p^{(0)} here includes the rest raw BERT part and randomly initialized task layers.

During the back-propagation, the service provider can update p(i1)()p^{(i-1)}(\cdot) to p(i)()p^{(i)}(\cdot) via the gradient (derived from the loss and noisy embeddings) of the post-noise layers. To avoid accessing users’ raw XX, it needs to freeze the pre-noise layers f(i1)()f^{(i-1)}(\cdot) as f(0)f^{(0)}. Parameter freezing is compatible with the more recent zero-shot or in-context learning paradigm (Min et al., 2022). It is useful when models are gigantic and full fine-tuning is expensive. However, the more layers are frozen, the worse the utility might be (even in non-private settings).

There are two general ways to update f(i1)()f^{(i-1)}(\cdot) securely: i) We can assume an extra trusted party (as in DP-SGD), but it becomes central DP. ii) Users can first derive the gradients for the layers inside f(i1)()f^{(i-1)}(\cdot) locally on their XX and then resort to secure aggregation (Bonawitz et al., 2017) for global updates at the service provider. However, it is costly. For better utility, we update f(i1)()f^{(i-1)}(\cdot) in experiments, requiring us to consider privacy degradation across different epochs due to the composability (as detailed below). Dedicated approaches (that balance efficiency, privacy, and utility) are left as future work.

Theorem 3.

Let f()f(\cdot) be the pre-noise function (of BERT-based pipelines) and \mathcal{M} be GM with ϵ0,0δ1\epsilon\geq 0,0\leq\delta\leq 1. DP-Forward fine-tuning running \mathcal{M} on normalized/clipped f()f(\cdot) ensures (ϵ,δ)(\epsilon,\delta)-SeqLDP.

The proof follows that of GM (Dwork and Roth, 2014). The crux is that S2(f),X,XS_{2}(f),\forall X,X^{\prime} is given by the output normalization, independent of the inputs.

Privacy Accounting. An epoch refers to an entire transit of the private training corpus. Every XX is used once per epoch. The number of epochs kk is a hyperparameter, which is typically small. Repeated applications of GM over the same XX ask for estimating the overall privacy loss due to the composability (unless freezing ff for re-using f(X)+Zf(X)+Z). The well-known moments accountant (Abadi et al., 2016) (or its generalization to Rényi DP (Mironov, 2017)) only provides a loose upper bound, which is even inapplicable if unbounded moments exist. Gaussian DP (Bu et al., 2019) proposes an accountant based on the central limit theorem. Yet, it leads to significant underestimation by a lower bound. Instead, we resort to a recent numerical accountant (Gopi et al., 2021), which outperforms RDP or GDP by approximating the true overall ϵ\epsilon to arbitrary accuracy. It composes the privacy curve of a mechanism by truncating and discretizing PLRVs with their PDFs convoluted by FFT (Gopi et al., 2021).

3.4. DP-Forward with Shuffling versus DP-SGD

DP-Forward ensures SeqLDP for fine-tuning, while DP-SGD offers central DP (for sequence-label pairs). To facilitate a fair comparison (on privacy-utility tradeoffs), we make two changes. First, we also perturb the labels with a suitable mechanism for the standard LDP, i.e., extending the protection from sequence to sequence-label pairs. Second, we use shuffling (Erlingsson et al., 2019) to “translate” our (label-protected) DP-Forward with LDP to claim (example-level) CDP as DP-SGD.

Discrete Labels Perturbation. For most NLP tasks, e.g., bi-/multi-nary classification in the GLUE benchmark (Wang et al., 2019), the size |𝐲||\mathbf{y}| of label space is often small. A simple yet effective solution for discrete data is randomized response (RR) (Warner, 1965) proposed decades ago! Specifically, RR perturbs a true label yy to itself y^=y\hat{y}=y with the probability

Pr[y=y^]=eϵ/(eϵ+|𝐲|1),\Pr[y=\hat{y}]=e^{\epsilon}/(e^{\epsilon}+|\mathbf{y}|-1),

or to y^𝐲y\forall\hat{y}\in\mathbf{y}\setminus y uniformly, where 𝐲\mathbf{y} denotes the label space.

When |𝐲||\mathbf{y}| is large, we can use prior to “prune” 𝐲\mathbf{y} to smaller 𝐲\mathbf{y}^{\prime} (Ghazi et al., 2021). The prior can be publicly available (e.g., auxiliary corpora similar to the users’ data) or progressively refined from a uniform distribution via the multi-stage training (Ghazi et al., 2021). One can then estimate an optimal |𝐲||\mathbf{y}^{\prime}| by maximizing the probability that the output is correct, i.e., Pr[y=y^]\Pr[y=\hat{y}]. With (prior-aided) RR (Ghazi et al., 2021), we can achieve full LDP.

Theorem 4.

Let f()f(\cdot) be the pre-noise function (of BERT-based pipelines), \mathcal{M} be GM with ϵ10,0δ1\epsilon_{1}\geq 0,0\leq\delta\leq 1, and RR\mathcal{M}_{RR} be (prior-aided) RR with ϵ20\epsilon_{2}\geq 0. DP-Forward fine-tuning perturbing f(X)f(X) and yy separately by \mathcal{M} and RR\mathcal{M}_{RR} ensures (ϵ1+ϵ2,δ)(\epsilon_{1}+\epsilon_{2},\delta)-LDP.

The proof follows from the basic composition theorem (Dwork and Roth, 2014).

Privacy Amplification by Shuffling. If noisy embedding-label pairs are also shuffled properly, DP-Forward can claim example-level CDP (as in DP-SGD), which “amplifies” LDP guarantees by Θ(N)\Theta(\sqrt{N}) for a total number of NN users (without extra noise addition) (Erlingsson et al., 2019). We then show that DP-Forward qualitatively outperforms DP-SGD from the SNR perspective under a similar privacy regime.

Suppose we train for an epoch, and the normalization factor is CC. For DP-SGD, the batch size is bb; the subsampling probability and the number of training steps are respectively b/Nb/N and N/bN/b. If each step is (ϵ,δ)(\epsilon,\delta)-DP, the overall privacy loss is (O(ϵb/N),δ)(O(\epsilon\sqrt{b/N}),\delta)-DP using the strong composition and privacy amplification by subsampling (Abadi et al., 2016).

DP-Forward with shuffling can also be seen as composing NN subsamplings, each a fraction of size 11 (Steinke, 2022). It is (O(ϵ1/N),δ)(O(\epsilon\sqrt{1/N}),\delta)-DP, which is “amplified” from (ϵ,δ)(\epsilon,\delta)-LDP. For an easier analysis of SNR, we omit ϵ2\epsilon_{2} of RR since the overall ϵ\epsilon is dominated by composing subsampled Gaussian. So, our Gaussian noise variance is b×b\times smaller than DP-SGD’s in each step; the SNR of each entry in embeddings vs. the aggregation of bb gradients can be estimated as O(C/nd)O(C/\sqrt{nd}) for DP-Forward vs. O(C/d)O(C/\sqrt{d^{\prime}}) for DP-SGD, where dd^{\prime} is the gradient dimension and is much larger than ndnd, the embedding-matrix size.

3.5. DP-Forward Inference

Given only fine-tuned pipeline parts f()f(\cdot), users can derive the noisy embedding matrices of their test sequences for inferences at the service provider while ensuring (ϵ,δ)(\epsilon,\delta)-LDP. Inference using noise aligned to the noisy fine-tuning is also beneficial for task accuracy.

Local inference (as in DP-SGD) without noise forces the service provider to reveal its entire pipeline, losing its intellectual property and incurring more time and storage costs for both f()f(\cdot) and p()p(\cdot).

Theorem 5.

Let f()f(\cdot) be the fine-tuned pre-noise layers (of BERT-based pipelines) and \mathcal{M} be GM with ϵ0,0δ1\epsilon\geq 0,0\leq\delta\leq 1. DP-Forward inference running \mathcal{M} on normalized/clipped f()f(\cdot) ensures (ϵ,δ)(\epsilon,\delta)-LDP.

The proof is inherited from GM (Dwork and Roth, 2014). Different from DP-Forward fine-tuning, LDP holds for test sequences since the labels are absent.

3.6. DP-Forward Pre-training

Directly using the raw BERT might not “match” DP-Forward fine-tuning/inference, degrading task utility. Pre-training BERT with DP-Forward on publicly available text (e.g., Wikipedia), besides the private user-shared data, can make future operations “adaptive” to noise. It requires us to modify the raw MLM objective in Eq. (1):

LMLM=ilogPr[xi|(f(X^));θ],L_{\mathrm{MLM}}^{*}=-\sum_{i\in\mathcal{I}}\log\Pr[x_{i}|\mathcal{M}(f(\hat{X}));\theta^{*}],

where θ\theta^{*} denotes the parameters of “noisy” BERT. This endows the noisy BERT with some “de-noising” ability since the objective is to predict the raw masked tokens from noisy embeddings (f(X^))\mathcal{M}(f(\hat{X})). It does not really breach privacy due to the free post-processing; LDP is ensured for each sequence, as the pre-training is self-supervised (without labels). Such noisy pre-training can also be outsourced to dedicated GPU clusters, enabling “de-noising BERT as a service.”

De-noising as post-processing is not new, but most prior arts need prior knowledge, e.g., Bayesian prior. aGM formulates it as an unusual estimation problem since a single noisy output is observed for each input, which can then be solved by appropriate estimators, e.g., the Bayesian one (Balle and Wang, 2018). Another attempt (Lécuyer et al., 2019) trains a separate noisy auto-encoder, which learns the identity function f(X)=Xf(X)=X stacked before an image classification network, to de-noise the noisy input. It has limited applications for only noisy input embeddings and incurs extra changes when migrating it to an NLP pipeline.

4. Optimizing Matrix Gaussian Noise

To instantiate \mathcal{M} for f()n×df(\cdot)\in\mathbb{R}^{n\times d} of DP-Forward, a natural question is whether the classical GM is optimal. The answer is no. Its privacy analysis applies a sufficient but not necessary condition for (ϵ,δ)(\epsilon,\delta)-DP while using Gaussian tail approximations, and its variance formula cannot extend to ϵ>1\epsilon>1 for a single run (e.g., inference) (Dwork and Roth, 2014).

Another candidate is the matrix-variate Gaussian (MVG) mechanism (Chanyaswad et al., 2018), tailored for matrix-valued functions. It exploits possibly non-i.i.d. noise from a matrix Gaussian distribution and outperforms GM in several usage cases (Chanyaswad et al., 2018). Yet, it is not optimal either, with the root cause still being based on a sufficient DP condition (Section 4.1). To improve it, we resort to a necessary and sufficient condition from aGM (Balle and Wang, 2018) for calibrating the matrix Gaussian noise (Section 4.2).

4.1. Matrix-Variate Gaussian (MVG) Mechanism

In contrast to the classical GM, MVG adopts possibly non-i.i.d. noise Zn×dZ\in\mathbb{R}^{n\times d} drawn from the zero-mean matrix Gaussian distribution 𝒩n,d(0,Σ,Ψ)\mathcal{MN}_{n,d}(0,\Sigma,\Psi), where Σn×n\Sigma\in\mathbb{R}^{n\times n} and Ψd×d\Psi\in\mathbb{R}^{d\times d} are the row- and column-wise covariance matrices. Intuitively, it adds less noise to more “important” rows or columns for possible better utility.

Definition 0 (Matrix Gaussian Distribution).

The PDF for an n×dn\times d random variable ZZ following 𝒩n,d(0,Σ,Ψ)\mathcal{MN}_{n,d}(0,\Sigma,\Psi) has the form:

(2) Pr(Z|0,Σ,Ψ)=exp(12U1ZVF2)(2π)nd/2|Ψ|d/2|Σ|n/2,\displaystyle\Pr{(Z|0,\Sigma,\Psi)}=\frac{\exp\left(-\frac{1}{2}||U^{-1}ZV^{-\top}||^{2}_{F}\right)}{(2\pi)^{nd/2}|\Psi|^{d/2}|\Sigma|^{n/2}},

where Un×nU\in\mathbb{R}^{n\times n} and Vd×dV\in\mathbb{R}^{d\times d} are invertible with Σ=UU\Sigma=UU^{\top} and Ψ=VV\Psi=VV^{\top}, and |||\cdot| denotes the matrix determinant (Horn and Johnson, 2012).

The definition is equivalent to the conventional form given by the matrix trace. It generalizes the univariate Gaussian used in GM; ZZ becomes i.i.d. when Σ,Ψ\Sigma,\Psi are diagonal and equal-valued. Below recites the main theorem of the MVG mechanism for (ϵ,δ)(\epsilon,\delta)-DP.

Theorem 2 (The MVG Mechanism with (ϵ,δ)(\epsilon,\delta)-DP (Chanyaswad et al., 2018)).

Let

σ(Σ1)\displaystyle\sigma(\Sigma^{-1}) =[σ1(Σ1),,σn(Σ1)],\displaystyle=[\sigma_{1}(\Sigma^{-1}),\ldots,\sigma_{n}(\Sigma^{-1})]^{\top},
σ(Ψ1)\displaystyle\sigma(\Psi^{-1}) =[σ1(Ψ1),,σd(Ψ1)]\displaystyle=[\sigma_{1}(\Psi^{-1}),\ldots,\sigma_{d}(\Psi^{-1})]^{\top}

be the vectors of (non-increasingly ordered) singular values of Σ1\Sigma^{-1} and Ψ1\Psi^{-1}, respectively. The MVG mechanism using noise from the matrix Gaussian distribution 𝒩n,d(0,Σ,Ψ)\mathcal{MN}_{n,d}(0,\Sigma,\Psi) satisfies (ϵ,δ)(\epsilon,\delta)-DP if

σ(Σ1)2σ(Ψ1)2(β+β2+8αϵ)24α2,\displaystyle||\sigma(\Sigma^{-1})||_{2}\cdot||\sigma(\Psi^{-1})||_{2}\leq\frac{\left(-\beta+\sqrt{\beta^{2}+8\alpha\epsilon}\right)^{2}}{4\alpha^{2}},

where α=[Hr+Hr,1/2]γ2+2HrγS2(f)\alpha=[H_{r}+H_{r,1/2}]\gamma^{2}+2H_{r}\gamma S_{2}(f), β=2(nd)1/4HrS2(f)ζ(δ)\beta=2(nd)^{1/4}H_{r}S_{2}(f)\zeta(\delta), with HrH_{r} (or Hr,1/2H_{r,1/2}) being the generalized harmonic number of order rr (of 1/21/2), γ\gamma being sup𝒳f(𝒳)F\sup_{\mathcal{X}}||f(\mathcal{X})||_{F}, and ζ(δ)=2ndlnδ2lnδ+nd\zeta(\delta)=2\sqrt{-nd\ln{\delta}}-2\ln{\delta}+nd.

To illustrate how the MVG mechanism works, we quote an example (Chanyaswad et al., 2018): performing regression using an identity query on a liver disorders dataset (McDermott and Forsyth, 2016) with 66 features and 248248 samples (i.e., f248×6f\in\mathbb{R}^{248\times 6}). MVG treats ‘ALT’ and a teacher label as the two most indicative features based on some prior, thus added with less noise (Chanyaswad et al., 2018). To report the best empirical results, it tries different precision budget (or noise variance) allocation strategies so that the total budget (Theorem 2) is not overspent. For example, it evenly allocates τ>50%\tau>50\% (a tunable parameter) of the budget to the two important features and the rest to the other four. Compared to GM using i.i.d. Gaussian noise, MVG can improve root mean square error (RMSE) by up to 0.0030.003 at the same privacy level (Chanyaswad et al., 2018).

Sub-optimality of MVG. Theorem 2 presents an upper bound on the product of L2L_{2}-norms of two singular-value vectors σ(Σ1)\sigma(\Sigma^{-1}) and σ(Ψ1)\sigma(\Psi^{-1}), assuming f(𝒳)F||f(\mathcal{X})||_{F} is bounded for any 𝒳\mathcal{X} by a constant γ\gamma. The upper bound monotonically decreases with β\beta that depends on ndnd and approaches 0 as ndnd\rightarrow\infty, making the sums of noise variances large. A similar situation exists in high privacy regimes ϵ0\epsilon\rightarrow 0.

At least two slacks caused the sub-optimality. The first and foremost is due to a sufficient condition for (ϵ,δ)(\epsilon,\delta)-DP (Dwork and Roth, 2014): Pr[,𝒳,𝒳ϵ]δ\Pr[\mathcal{L}_{\mathcal{M},\mathcal{X},\mathcal{X}^{\prime}}\geq\epsilon]\leq\delta, which is also used in the classical GM. With the Laurent-Massart Theorem (Laurent and Massart, 2000), MVG further transforms it to Pr[,𝒳,𝒳ϵ]=1\Pr[\mathcal{L}_{\mathcal{M},\mathcal{X},\mathcal{X}^{\prime}}\leq\epsilon]=1 for a subset of all the possible outputs. The second lies in a loose matrix-trace-based privacy analysis; a follow-up (Yang et al., 2023) derives a tighter bound from Definition 1 and a matrix-norm inequality.

Input: Privacy parameters ϵ,δ\epsilon,\delta
Output: \mathcal{B}
1 Let δ0=Φ(0)eϵΦ(2ϵ)\delta_{0}=\Phi(0)-e^{\epsilon}\Phi(-\sqrt{2\epsilon});
2 if δδ0\delta\geq\delta_{0} then
3       Re-write gϵ+(v)=Φ(ϵv)eϵΦ(ϵ(v+2)g^{+}_{\epsilon}(v)=\Phi(\sqrt{\epsilon v})-e^{\epsilon}\Phi(-\sqrt{\epsilon(v+2)};
4       Compute v=sup{v0:gϵ+(v)=δ}v^{*}=\sup\{v\in\mathbb{R}_{\geq 0}:g^{+}_{\epsilon}(v)=\delta\};
5       Let α=1+v/2v/2\alpha=\sqrt{1+v^{*}/2}-\sqrt{v^{*}/2};
6      
7else
8      gϵ(u)=Φ(ϵu)eϵΦ(ϵ(u+2))g^{-}_{\epsilon}(u)=\Phi(-\sqrt{\epsilon u})-e^{\epsilon}\Phi(-\sqrt{\epsilon(u+2)});
9       Compute u=inf{u0:gϵ(u)=δ}u^{*}=\inf\{u\in\mathbb{R}_{\geq 0}:g^{-}_{\epsilon}(u)=\delta\};
10       Let α=1+u/2+u/2\alpha=\sqrt{1+u^{*}/2}+\sqrt{u^{*}/2};
11      
12 end if
Return =2ϵ/α\mathcal{B}=\sqrt{2\epsilon}/\alpha
Algorithm 1 A(ϵ,δ)A(\epsilon,\delta): Derive the Upper Bound \mathcal{B}

4.2. Analytic Matrix Gaussian Mechanism

To enhance MVG while still adding possibly non-i.i.d. noise Z𝒩n,d(0,Σ,Ψ)Z\sim\mathcal{MN}_{n,d}(0,\Sigma,\Psi), we put forth the analytic matrix Gaussian mechanism (aMGM) by exploiting a necessary and sufficient condition for (ϵ,δ)(\epsilon,\delta)-DP, which is formulated using two PLRVs by the analytic GM (aGM) (Balle and Wang, 2018). It is non-trivial777A recent pre-print (Yang et al., 2021) also studied using matrix Gaussian distribution. The proof of (Yang et al., 2021, Lemma 44), pivotal for our Theorem 6, is problematic. We prove it in Appendix C. since we now need to work with two covariance matrices Σ\Sigma and Ψ\Psi instead of a single variance σ2\sigma^{2} in aGM.

Theorem 3 ((Balle and Wang, 2018)).

A mechanism \mathcal{M} is (ϵ,δ)(\epsilon,\delta)-DP iff, 𝒳𝒳\forall\mathcal{X}\simeq\mathcal{X}^{\prime},

(3) Pr[,𝒳,𝒳ϵ]eϵPr[,𝒳,𝒳ϵ]δ.\displaystyle\Pr[\mathcal{L}_{\mathcal{M},\mathcal{X},\mathcal{X}^{\prime}}\geq\epsilon]-e^{\epsilon}\Pr[\mathcal{L}_{\mathcal{M},\mathcal{X}^{\prime},\mathcal{X}}\leq-\epsilon]\leq\delta.

It directly implies the sufficient condition due to Pr[,𝒳,𝒳ϵ]0\Pr[\mathcal{L}_{\mathcal{M},\mathcal{X}^{\prime},\mathcal{X}}\leq-\epsilon]\geq 0. We next show that ,𝒳,𝒳\mathcal{L}_{\mathcal{M},\mathcal{X},\mathcal{X}^{\prime}} or ,𝒳,𝒳\mathcal{L}_{\mathcal{M},\mathcal{X}^{\prime},\mathcal{X}} of aMGM is also Gaussian, a similar result has been proven in aGM (Balle and Wang, 2018, Lemma 3).

Lemma 0.

The PLRVs of our aMGM follow a distribution 𝒩(η,2η)\mathcal{N}(\eta,2\eta) with η=U1ΔVF22\eta=\frac{||U^{-1}\Delta V^{-\top}||^{2}_{F}}{2}, where Δ=f(𝒳)f(𝒳)\Delta=f(\mathcal{X})-f(\mathcal{X}^{\prime}).

With Lemma 4, we can then specialize the left-hand side of Eq. (3). Particularly, we use the Gaussian cumulative density function (CDF)

Φ(t)=Pr[𝒩(0,1)t]=12πtey2/2𝑑y\Phi(t)=\Pr[\mathcal{N}(0,1)\leq t]=\frac{1}{\sqrt{2\pi}}\int^{t}_{-\infty}e^{-y^{2}/2}dy

to explicitly express the two probabilities (see Lemma 5) instead of approximating them by the tail bounds of a Gaussian distribution.

Lemma 0.

For any 𝒳𝒳\mathcal{X}\simeq\mathcal{X}^{\prime}, let Δ=U1ΔV\Delta^{\prime}=U^{-1}\Delta V^{-\top} with Δ=f(𝒳)f(𝒳)\Delta=f(\mathcal{X})-f(\mathcal{X}^{\prime}). The following holds for any ϵ0\epsilon\geq 0:

Pr[,𝒳,𝒳ϵ]\displaystyle\Pr[\mathcal{L}_{\mathcal{M},\mathcal{X},\mathcal{X}^{\prime}}\geq\epsilon] =Φ(ΔF2ϵΔF),\displaystyle=\Phi\left(\frac{||\Delta^{\prime}||_{F}}{2}-\frac{\epsilon}{||\Delta^{\prime}||_{F}}\right),
Pr[,𝒳,𝒳ϵ]\displaystyle\Pr[\mathcal{L}_{\mathcal{M},\mathcal{X}^{\prime},\mathcal{X}}\leq-\epsilon] =Φ(ΔF2ϵΔF).\displaystyle=\Phi\left(-\frac{||\Delta^{\prime}||_{F}}{2}-\frac{\epsilon}{||\Delta^{\prime}||_{F}}\right).

We can further re-write the left-hand side of Eq. (3) as g(ΔF)g(||\Delta^{\prime}||_{F}):

(4) Φ(ΔF2ϵΔF)eϵΦ(ΔF2ϵΔF),\displaystyle\Phi\left(\frac{||\Delta^{\prime}||_{F}}{2}-\frac{\epsilon}{||\Delta^{\prime}||_{F}}\right)-e^{\epsilon}\Phi\left(-\frac{||\Delta^{\prime}||_{F}}{2}-\frac{\epsilon}{||\Delta^{\prime}||_{F}}\right),

a function of Δ\Delta and (Σ,Ψ)(\Sigma,\Psi); it is defined w.r.t. Δ\Delta and σ2\sigma^{2} for aGM (Balle and Wang, 2018). To satisfy Theorem 3, we require g(ΔF)δ,𝒳𝒳g(||\Delta^{\prime}||_{F})\leq\delta,\forall\mathcal{X}\simeq\mathcal{X}^{\prime}. Since g()g(\cdot) is monotonically increasing (Balle and Wang, 2018, Lemma 7), we first find the upper bound \mathcal{B} of ΔF||\Delta^{\prime}||_{F} as the “solution” to g(ΔF)=δg(||\Delta^{\prime}||_{F})=\delta and then determine U,VU,V (hence Σ,Ψ\Sigma,\Psi) based on \mathcal{B} and Δ\Delta with S2(f)=sup𝒳𝒳ΔFS_{2}(f)=\sup_{\mathcal{X}\simeq\mathcal{X}^{\prime}}||\Delta||_{F}.

4.2.1. Computing the upper bound \mathcal{B}

One could derive an analytic expression for \mathcal{B} using the tail bounds of Φ(t)\Phi(t), which is sub-optimal due to the slack in the tail bounds. Instead, we adapt a “numerical solver,” as detailed in Alg. 1, for \mathcal{B} since Φ(t)\Phi(t) can also be represented by (1+𝖾𝗋𝖿(t/2))/2(1+\mathsf{erf}(t/\sqrt{2}))/2, where 𝖾𝗋𝖿\mathsf{erf} is the standard error function.888Its efficient implementation to extremely high accuracy is supported in most statistical and numerical software packages, e.g., Python math library.

For the first term of Eq. (4), its input ΔF/2ϵ/ΔF||\Delta^{\prime}||_{F}/2-\epsilon/||\Delta^{\prime}||_{F} changes sign at ΔF=2ϵ||\Delta^{\prime}||_{F}=\sqrt{2\epsilon}, while the other term’s input ΔF/2ϵ/ΔF-||\Delta^{\prime}||_{F}/2-\epsilon/||\Delta^{\prime}||_{F} is always negative. Therefore, we only consider ΔF=2ϵ/α||\Delta^{\prime}||_{F}=\sqrt{2\epsilon}/\alpha under two cases 0<α10<\alpha\leq 1 and α>1\alpha>1 for a variable α\alpha.

When α=1\alpha=1, δ0=g(2ϵ)\delta_{0}=g(\sqrt{2\epsilon}) in line 11. If δδ0\delta\geq\delta_{0} (or 0<α10<\alpha\leq 1), we can use v=(1/αα)2/2v=(1/\alpha-\alpha)^{2}/2 to re-write g()g(\cdot) as gϵ+(v)g^{+}_{\epsilon}(v) (line 3). For α>1\alpha>1, we can use u=(α1/α)2/2u=(\alpha-1/\alpha)^{2}/2 to re-write g()g(\cdot) as gϵ(u)g^{-}_{\epsilon}(u) (line 7). In either case, given the “oracle” computing Φ(t)\Phi(t) via 𝖾𝗋𝖿\mathsf{erf}, we derive uu^{*} or vv^{*} using Newton’s method, recover α\alpha, and return =2ϵ/α\mathcal{B}=\sqrt{2\epsilon}/\alpha.

4.2.2. Determining the covariance matrices Σ=UU\Sigma=UU^{\top} and Ψ=VV\Psi=VV^{\top}

With Lemma 5, and let σi()\sigma_{i}(\cdot) be the ithi^{\text{th}} singular value; we have

(5) U1ΔVF2i=1min{n,d}σi2(U1)σi2(Δ)σi2(V).\displaystyle||U^{-1}\Delta V^{-\top}||^{2}_{F}\leq\sum^{\min\{n,d\}}_{i=1}\sigma^{2}_{i}(U^{-1})\sigma^{2}_{i}(\Delta)\sigma^{2}_{i}(V^{-\top}).

Since σi(U1)=1/σni+1(U)\sigma_{i}(U^{-1})=1/\sigma_{n-i+1}(U) and σi(V)=1/σdi+1(V)\sigma_{i}(V^{-\top})=1/\sigma_{d-i+1}(V) with i[1,r]i\in[1,r], we transform the right-hand side of Eq. (5) to

i=1rσi2(Δ)σni+12(U)σdi+12(V)i=1rσi2(Δ)σn2(U)σd2(V)=ΔF2σn2(U)σd2(V),\displaystyle\sum^{r}_{i=1}\frac{\sigma^{2}_{i}(\Delta)}{\sigma^{2}_{n-i+1}(U)\sigma^{2}_{d-i+1}(V)}\leq\frac{\sum^{r}_{i=1}\sigma^{2}_{i}(\Delta)}{\sigma^{2}_{n}(U)\sigma^{2}_{d}(V)}=\frac{||\Delta||^{2}_{F}}{\sigma^{2}_{n}(U)\sigma^{2}_{d}(V)},

where the inequality follows from Theorem 2, i.e., σ1()σr()\sigma_{1}(\cdot)\geq\cdots\geq\sigma_{r}(\cdot), and the last equality is directly from Lemma 4.

Given U1ΔVF\mathcal{B}\geq||U^{-1}\Delta V^{-\top}||_{F}, it suffices to let ΔF/σn(U)σd(V)||\Delta||_{F}/\sigma_{n}(U)\sigma_{d}(V)\leq\mathcal{B} with Δ=f(𝒳)f(𝒳),𝒳𝒳\Delta=f(\mathcal{X})-f(\mathcal{X}^{\prime}),\forall\mathcal{X}\simeq\mathcal{X}^{\prime}. Recall that S2(f)S_{2}(f) is the upper bound on ΔF,𝒳𝒳||\Delta||_{F},\forall\mathcal{X}\simeq\mathcal{X}^{\prime}, we now reach the main theorem.

Theorem 6.

Our aMGM satisfies (ϵ,δ)(\epsilon,\delta)-DP, iff

S2(f)σn(U)σd(V),\frac{S_{2}(f)}{\mathcal{B}}\leq\sigma_{n}(U)\sigma_{d}(V),

where =A(ϵ,δ)\mathcal{B}=A(\epsilon,\delta) as in Alg. 1, S2(f)S_{2}(f) is the L2L_{2}-sensitivity, σn(U)\sigma_{n}(U) and σd(V)\sigma_{d}(V) are respectively the smallest singular values of UU and VV.

Theorem 6 only constrains the lower bound on the product of σn(U)\sigma_{n}(U) and σd(V)\sigma_{d}(V), the two smallest singular values; it offers infinite choices for all the others with the design space for (Σ,Ψ)(\Sigma,\Psi) even larger than that of MVG (Theorem 2). More importantly, the lower bound is independent of ndnd, which can lead to orders-of-magnitude variance reduction than MVG, confirmed by our experiments in Section 5. For ϵ0\epsilon\rightarrow 0, we can still derive a valid \mathcal{B} from 2Φ1((1+δ)/2)2\Phi^{-1}((1+\delta)/2).

To determine Σ,Ψ\Sigma,\Psi, another implicit constraint is to keep smaller noise for better utility. Let us first consider Σ=UU\Sigma=UU^{\top}. Since it is positive definite, we can also decompose it into WΣΛΣWΣW_{\Sigma}\Lambda_{\Sigma}W^{\top}_{\Sigma}; we then have U=WΣΛΣ1/2U=W_{\Sigma}\Lambda^{1/2}_{\Sigma}, where ΛΣ1/2={σi(U)}i=1n\Lambda^{1/2}_{\Sigma}=\{\sigma_{i}(U)\}^{n}_{i=1} specifies the row-wise noise magnitudes. Assuming that the smallest overall noise will yield the best utility, we let all the singular values be the smallest: σ1(U)==σn(U)\sigma_{1}(U)=\cdots=\sigma_{n}(U). As WΣW_{\Sigma} can be any unitary matrix, we simply use the standard basis, resulting in U=σn(U)InU=\sigma_{n}(U)\cdot I_{n} for an n×nn\times n identity matrix InI_{n} and hence the final Σ\Sigma. Similarly, we can pick Ψ=VV\Psi=VV^{\top} with V=σd(V)IdV=\sigma_{d}(V)\cdot I_{d}, where IdI_{d} is a d×dd\times d identity matrix.

4.2.3. Drawing the noise ZZ

With Σ\Sigma and Ψ\Psi, the last step is to draw ZZ. Pragmatically, we adopt the affine transformation below.

Lemma 0 ((Chanyaswad et al., 2018)).

Let Zn×dZ^{\prime}\in\mathbb{R}^{n\times d} be a random variable with each entry i.i.d. drawn from the standard normal distribution. The transformed Z=UZVZ=UZ^{\prime}V^{\top} follows 𝒩n,d(0,UU,VV)\mathcal{MN}_{n,d}(0,UU^{\top},VV^{\top}).

Hence, we can first sample ndnd i.i.d. values from 𝒩(0,1)\mathcal{N}(0,1) to form ZZ^{\prime}, then employ the transformation UZVUZ^{\prime}V^{\top} such that

(6) ZS2(f)𝒩n,d(0,In,Id).\displaystyle Z\sim\frac{S_{2}(f)}{\mathcal{B}}\mathcal{MN}_{n,d}(0,I_{n},I_{d}).
Discussion.

When instantiating DP-Forward using aMGM, we set σ1(U)==σn(U)\sigma_{1}(U)=\cdots=\sigma_{n}(U) and σ1(V)==σd(V)\sigma_{1}(V)=\cdots=\sigma_{d}(V) such that the row- and column-wise noises are the smallest, and our pilot experiments show this yields optimal task utility; aMGM actually “degenerates” to aGM with i.i.d. noise. Nevertheless, aMGM also allows non-i.i.d. noise like MVG: By tuning the corresponding singular values larger, we can add more noise to the rows/columns that negatively impact the utility. It might be helpful when f()f(\cdot) (e.g., linear regression on a small liver dataset (Chanyaswad et al., 2018)) is simple or p()p(\cdot) does not “mix up” noisy rows/columns. In contrast to our empirical approach (like MVG), one could theoretically formulate the allocation of singular values as optimization problems that maximize different utility functions tailored to applications. It might outperform our uniform treatment but takes more dedicated efforts, which we leave as future work.

5. Experiments

5.1. Experimental Setup

Refer to caption
Figure 2. SST-2 accuracy when tuning nn with noise added at different positions (ϵ=16\epsilon=16 for SeqLDP)
Refer to caption
Figure 3. SST-2 accuracy with noise added to the outputs of different sub-layers in six encoders (ϵ=16\epsilon=16 for SeqLDP)

We use three typical datasets/tasks that are widely used in NLP/DP literature (Yu et al., 2021; Li et al., 2022b; Yu et al., 2022; Yue et al., 2021) and GLUE benchmark (Wang et al., 2019): i) Stanford sentiment treebank (SST-2), ii) Internet movie database (IMDb) (Maas et al., 2011) for binary sentiment classification of single- and multi-sentence movie reviews, and iii) Quora question pairs (QQP) for semantic equivalence test over question pairs on Quora.com. Their test sets do not have any labels; we use the original dev sets as the test sets. Table 2 summarizes their characteristics. They all carry privacy risks; e.g., stylistic features of posts may leak the author’s identity. We use task accuracy (w.r.t. the ground truth labels) as the utility metric.

Table 2. Statistics of the downstream task datasets
SST-2 (Wang et al., 2019) IMDb (Maas et al., 2011) QQP (Wang et al., 2019)
#train samples 67,34967,349 25,00025,000 363,846363,846
#test samples 872872 25,00025,000 40,32040,320

Baselines. We instantiate \mathcal{M} in DP-Forward by the classical GM, MVG (Chanyaswad et al., 2018), and aMGM. If not specified, all the results are based on aMGM. For MVG, we adopt its unimodal type, applicable to asymmetric functions like pre-noise layers f()f(\cdot). Specifically, we make the row-wise noise directional and assign the same precision budget to each row, assuming that tokens share the same importance.

By default, we report the accuracy of DP-Forward inferences on tasks fine-tuned using DP-Forward (with 2{\sim}2pp gains compared to the case of “DP-Forward fine-tuning + non-private inference”). We also realize DP-SGD fine-tuning with the latest Opacus (Yousefpour et al., 2021) but do not add any noise to its inference. Another baseline is non-private (in both fine-tuning and inference).

Implementation. We run experiments on a cluster with Tesla P100100 GPUs. We implement all the mechanisms and baselines in Python. We use a raw BERT checkpoint bert-base-uncased (Face, 2023), available in the Huggingface transformers library, for fine-tuning (Section 3.3) or further pre-train it over WikiCorpus (Section 3.6).

For the hyperparameters used throughout our experiments, we set the number of training epochs k=3k=3, learning rate η=2e5\eta=2\mathrm{e}{-5}, batch size b=32b=32, and normalization/clipping factor C=1C=1. We keep others (e.g., no weight decay, no learning rate decay) default as literature (Wang et al., 2019). The privacy parameter δ\delta is fixed as 1e51\mathrm{e}{-5} (Yu et al., 2022).

5.2. Configuring Matrix Dimensions

The sequence length nn is variable. While the hidden dimensionality dd is tied as 768768 for BERT-Base, we can resort to two linear maps for “mediating” it (see Section 3.2). Since we normalize embedding matrices of size n×dn\times d to have a fixed norm CC, each entry’s signal magnitude relies on (n,d)(n,d). In contrast, the noise variance is the same given CC and fixed privacy parameters. The signal-to-noise ratios (SNRs) affecting accuracy can be configured based on (n,d)(n,d).

Figure 2 shows the evaluation accuracy of SST-2 fine-tuned using DP-Forward with nn tuning from 1616 to 256256. We study adding aMGM noise at five hidden layers’ outputs. The results indicate that the best accuracy is often achieved at n=64n=64 or 128128, so we opt for n=128n=128 (which is sufficient for most sequences) in subsequent experiments.

We fine-tuned SST-2 on noisy output embeddings under different choices of ϵ\epsilon and reduced dd. Table 3 summarizes the results. Reducing dd leads to larger SNRs (under fixed CC and nn) but may also lose useful information, degrading accuracy. For the same ϵ\epsilon, most accuracy variations are within 22pp under different choices of dd. Balancing everything, we use the raw d=768d=768 in later experiments such that no extra changes (including two linear maps) are made to pipelines.

Table 3. SST-2 accuracy with different dd and ϵ\epsilon for SeqLDP
dd ϵ\epsilon 22 44 88 1212 1616
1616 0.68010.6801 0.78510.7851 0.88330.8833 0.9232\mathbf{0.9232} 0.92660.9266
6464 0.67660.6766 0.77520.7752 0.87270.8727 0.90370.9037 0.92090.9209
128128 0.67320.6732 0.7856\mathbf{0.7856} 0.88070.8807 0.91280.9128 0.92320.9232
256256 0.6835\mathbf{0.6835} 0.76950.7695 0.8965\mathbf{0.8965} 0.91860.9186 0.9249\mathbf{0.9249}
512512 0.64110.6411 0.76260.7626 0.88310.8831 0.91280.9128 0.92430.9243
768768 0.66860.6686 0.77410.7741 0.87390.8739 0.91280.9128 0.91860.9186
Table 4. Accuracy on output embeddings under SeqLDP
Task Mech. ϵ=0.5\epsilon=0.5 ϵ=1\epsilon=1 ϵ=2\epsilon=2 ϵ=4\epsilon=4 ϵ=8\epsilon=8
SST-2 GM 0.54240.5424 0.57570.5757 0.65370.6537 0.74660.7466 0.86240.8624
aMGM 0.5516\mathbf{0.5516} 0.6021\mathbf{0.6021} 0.6686\mathbf{0.6686} 0.7741\mathbf{0.7741} 0.8739\mathbf{0.8739}
MVG 0.5{\sim}0.5
IMDb GM 0.52440.5244 0.54980.5498 0.60160.6016 0.69020.6902 0.80020.8002
aMGM 0.5353\mathbf{0.5353} 0.5676\mathbf{0.5676} 0.6224\mathbf{0.6224} 0.7093\mathbf{0.7093} 0.8109\mathbf{0.8109}
MVG 0.5{\sim}0.5
QQP GM 0.63040.6304 0.63210.6321 0.65710.6571 0.74580.7458 0.86380.8638
aMGM 0.6312\mathbf{0.6312} 0.6348\mathbf{0.6348} 0.6747\mathbf{0.6747} 0.7685\mathbf{0.7685} 0.8653\mathbf{0.8653}
MVG 0.5{\sim}0.5
Table 5. Accuracy of task models fine-tuned using DP-Forward and DP-SGD under (example-level) CDP
Method Noise position SST2 IMDb QQP
ϵ1\epsilon\approx 1 ϵ3\epsilon\approx 3 ϵ8\epsilon\approx 8 ϵ1\epsilon\approx 1 ϵ3\epsilon\approx 3 ϵ8\epsilon\approx 8 ϵ1\epsilon\approx 1 ϵ3\epsilon\approx 3 ϵ8\epsilon\approx 8
DP-Forward Embedding 0.60550.6055 0.61460.6146 0.62780.6278 0.50.5 0.50.5 0.50.5 0.65340.6534 0.65890.6589 0.65940.6594
Encoder 1 0.79710.7971 0.80960.8096 0.80730.8073 0.50000.5000 0.50160.5016 0.50220.5022 0.78570.7857 0.78850.7885 0.79060.7906
Encoder 3 0.80960.8096 0.83940.8394 0.84630.8463 0.75250.7525 0.75450.7545 0.75490.7549 0.8513\mathbf{0.8513} 0.8585\mathbf{0.8585} 0.8607\mathbf{0.8607}
Encoder 5 0.85440.8544 0.86580.8658 0.87160.8716 0.76420.7642 0.77090.7709 0.77190.7719 0.8698\mathbf{0.8698} 0.8765\mathbf{0.8765} 0.8806\mathbf{0.8806}
Encoder 7 0.86240.8624 0.8819\mathbf{0.8819} 0.8872\mathbf{0.8872} 0.77650.7765 0.7883\mathbf{0.7883} 0.7924\mathbf{0.7924} 0.8840\mathbf{0.8840} 0.8887\mathbf{0.8887} 0.8926\mathbf{0.8926}
Encoder 9 0.8945\mathbf{0.8945} 0.8979\mathbf{0.8979} 0.9002\mathbf{0.9002} 0.7995\mathbf{0.7995} 0.8105\mathbf{0.8105} 0.8181\mathbf{0.8181} 0.8895\mathbf{0.8895} 0.8941\mathbf{0.8941} 0.8955\mathbf{0.8955}
Encoder 11 0.8819\mathbf{0.8819} 0.8968\mathbf{0.8968} 0.8985\mathbf{0.8985} 0.8042\mathbf{0.8042} 0.8187\mathbf{0.8187} 0.8265\mathbf{0.8265} 0.8952\mathbf{0.8952} 0.8997\mathbf{0.8997} 0.9007\mathbf{0.9007}
Output 0.8865\mathbf{0.8865} 0.9009\mathbf{0.9009} 0.9055\mathbf{0.9055} 0.8096\mathbf{0.8096} 0.8160\mathbf{0.8160} 0.8270\mathbf{0.8270} 0.8987\mathbf{0.8987} 0.8994\mathbf{0.8994} 0.9038\mathbf{0.9038}
DP-SGD Gradient 0.86500.8650 0.87130.8713 0.87590.8759 0.77790.7779 0.78260.7826 0.79030.7903 0.82190.8219 0.83450.8345 0.84330.8433
Non-private baseline 0.91780.9178 0.83780.8378 0.90190.9019

5.3. Fine-tuning with Sequence LDP

Our approach also supports perturbing sub-layer outputs during fine-tuning. We study six encoders as an example, with the results shown in Figure 3. Overall, DP-Forward performs better with deeper encoders since fewer parameters are directly affected by noise during fine-tuning. Another observation is that perturbing different sub-layer outputs, even inside the same encoder, may result in huge accuracy variation; e.g., using noisy outputs of the last sub-layer in Encoder 11 can bring 20{\sim}20pp gains over those of the first sub-layer.

We next evaluate the privacy-accuracy tradeoffs under different ϵ\epsilon and compare the instantiations using the classical GM, MVG (Chanyaswad et al., 2018), and aMGM. Note that we still compute the GM variance as σ2=2ln(1.25/δ)S22(f)/ϵ2\sigma^{2}=2\ln(1.25/\delta)S^{2}_{2}(f)/\epsilon^{2} for empirical evaluation, albeit it cannot extend to ϵ>1\epsilon>1 for a single run to ensure theoretical DP guarantees.

For the GM- and aMGM-based instantiations, Table 4 shows all three tasks’ accuracy increases with ϵ\epsilon. Ours has better accuracy than the GM-based one due to the smaller noise produced by aMGM in all choices of ϵ\epsilon. Although the noise variance gap (between GM and aMGM) widens as ϵ\epsilon decreases, one cannot fine-tune effective models in a high privacy regime ϵ<1\epsilon<1. The MVG-based one behaves like random guessing for all three tasks since its noise variance is proportional to ndn\cdot d, which is even much larger than the classical GM for high-dimensional settings (see Section 4.1). For instance, under the same parameter setting (e.g., n=128,d=768n=128,d=768, and ϵ=8\epsilon=8), MVG produces noise with the variance orders-of-magnitude larger than aMGM (e.g., >108{>}10^{8} vs. 0.6{\sim}0.6), even assuming supf()F=1\sup||f(\cdot)||_{F}=1.

We remark that the used local ϵ\epsilon value is not large. Most classical LDP works that deem such ϵ\epsilon lies in a low privacy regime are for statistical analytics. In great contrast, we aim at fine-tuning large LM-based pipelines with high-dimensional signals and limited training data, which is much more complicated. Many prior works (Feyisetan et al., 2020; Qu et al., 2021; Yue et al., 2021; Feyisetan et al., 2019) use a larger ϵ\epsilon to ensure even a weaker token-level LDP variant, while others (Meehan et al., 2022) categorizes ϵ<10\epsilon<10 and 10ϵ<2010\leq\epsilon<20 as strong and moderate privacy respectively999Such choices can be “reduced” to smaller ones under the shuffling model (Section 5.4), cf.. U.S. census discloses demographic data at central ϵ=11.14\epsilon=11.14 (Abowd et al., 2022). for sequence-level LDP like ours. More importantly, they provide effective protection against various privacy threats, as detailed in Section 6.

5.4. DP-Forward versus DP-SGD

Fairness of comparisons on privacy-accuracy tradeoffs. As elaborated in Section 3.4, we can adopt RR (Warner, 1965) to perturb the labels and then report central ϵ\epsilon values for DP-Forward, amplified by shuffling using the following parameters, ensuring that comparisons are fair under (example-level) CDP. For DP-SGD, the subsampling probability is b/Nb/N, with b=32b=32 and the dataset size NN; the number of fine-tuning steps is T=kN/bT=k\cdot N/b with k=3k=3. For DP-Forward, the subsampling and non-flipping probabilities are respectively 1/N1/N (with T=kNT=k\cdot N) and 0.90.9; we still process bb noisy embeddings as a batch. For both, we use aGM (Balle and Wang, 2018), the degenerated version of aMGM (Section 4.2), and the same accountant (Gopi et al., 2021) to report approximated overall ϵ\epsilon values101010They are dominated by composing subsampled Gaussian, e.g., composing subsampled RR only consumes 0.030.03 for SST-2, which is even overestimated by AutoDP..

We study eight instances of DP-Forward, including perturbing the outputs of the input embedding layer, six different encoders, and BERT. Their accuracies on all three tasks under three privacy levels, plus those of DP-SGD and the non-private baseline, are shown in Table 5. About half or more of our instances have better accuracy than DP-SGD for each task; the largest accuracy gain is 7.7{\sim}7.7pp for QQP. The noisy output embeddings often lead to the best accuracy for all tasks, even comparable to the non-private baseline, due to the dimension reduction at the last encoder output (Section 2.1).

Recent DP-SGD variants (Yu et al., 2021, 2022) improve DP-SGD (Abadi et al., 2016) by perturbing partial gradient entries using additional tricks (e.g., low-rank adaption). They report the best accuracy of 92.5%92.5\% and 85.7%85.7\% on SST-2 and QQP, respectively, with 2.32.3pp and 6.26.2pp drops from the non-private baselines at central ϵ=6.7\epsilon=6.7 (Yu et al., 2022, Table 44). DP-Forward with label privacy, incurring <1.7{<}1.7pp accuracy drops on the two tasks at ϵ3\epsilon\approx 3, can still beat them, albeit their fine-tuning is based on RoBERTa-base, a robustly optimized BERT approach, which by itself outperforms BERT due to larger training set, longer training time, and better techniques (e.g., dynamic masking in MLM).

Figure 4 shows the efficiency comparisons on fine-tuning SST-2. The time and storage overheads of our approach (for all possible instances) are almost the same as the non-private baseline and 3×{\sim}3\times smaller than DP-SGD. It is because we allow batch processing as in the normal fine-tuning – no need to handle per-example gradients. Meanwhile, our normalization and noise sampling/addition are also faster since the size of embeddings is smaller than that of gradients.

5.5. Noisy Pre-training

Pre-training BERT using DP-Forward, aligned with the noisy fine-tuning, does help accuracy. We use SST-2 as an example and perturb the input embedding matrices. We continue pre-training BERT over English WikiCorpus, the 20062006 dump with about 600600M words, for an epoch. Table 6 shows that we can obtain 121{-}2pp accuracy gains for most choices of ϵ\epsilon, compared to fine-tuning on the original BERT.

Efficiency-wise, DP-Forward pre-training also consumes much fewer resources; e.g., an existing work (Anil et al., 2022) pre-trains BERT-Large (with 340340 million parameters) using DP-SGD on Google TPUs, which requires sufficient memory for handling batch sizes of millions.

Refer to caption
Figure 4. Efficiency comparison for the case of SST-2
Table 6. SST-2 accuracy gain with pre-training under SeqLDP
ϵ\epsilon Raw BERT Noisy BERT Δacc\Delta_{\text{acc}}
22 0.55010.5501 0.56650.5665 0.01640.0164
44 0.59500.5950 0.59990.5999 0.00490.0049
88 0.65500.6550 0.67080.6708 0.01580.0158
1616 0.73450.7345 0.74500.7450 0.01050.0105

6. Defense against Privacy Threats

Following the recent taxonomy (Song and Raghunathan, 2020), we study MIAs and two new threats of sequence leakage from their embeddings: embedding inversion and attribute inference. We moderately adapt them to suit our context, e.g., upgrading MIAs (Song and Raghunathan, 2020) to sequence-level.

6.1. Threat Models

For MIAs, we follow prior arts (Shokri et al., 2017; Yeom et al., 2018; Song and Raghunathan, 2020) to consider an adversary with only black-box access to an entire (DP-SGD/DP-Forward-trained) pipeline: It can query the prediction results (e.g., each-class probability) of target sequences but cannot access the pipeline weights and architecture; the hidden embeddings are not revealed.

Despite “different” objectives of inverting or inferring (Section 6.3 or 6.4) from embeddings, we consider both threats involving a general adversary with black-box access to the trained pipeline part f()f(). It can get the inference-time (clear/noisy) embeddings of target sequences (Song and Raghunathan, 2020). Besides public prior knowledge, it can collect a limited auxiliary dataset 𝒳aux\mathcal{X}_{\mathrm{aux}}, sharing similar attributes to the targets (Song and Raghunathan, 2020).

DP-SGD only offers CDP for training data and does not protect inference-time input.111111One might add the same noise to it as DP-Forward inference, which indeed mitigates the new threats. However, perturbing gradients in training, inherently “mismatches” from perturbing embeddings in inference, deteriorating task performance significantly, e.g., SST-2 accuracy will be reduced to 0.77860.7786 (with a 10{\sim}10pp drop) at central ϵ8\epsilon\approx 8. What follows intends to empirically confirm a major merit of DP-Forward in protecting against stronger adversaries and threats to both training- and inference-time inputs.

6.2. Membership Inference Attacks

Attack Objective. MIAs predict whether a data point is in the training set (Shokri et al., 2017). They often exploit the disparity in model behavior between training data and unseen data, i.e., poor model generalization due to overfitting (Yeom et al., 2018). Inferring membership at the token/word level, e.g., a sliding window of tokens (Song and Raghunathan, 2020), is not interesting. We consider more realistic MIAs on entire sequences, which can be extended for more devastating attacks, such as extracting verbatim pre-training sequences via black-box access to GPT-2 (Carlini et al., 2021).

Prior arts (Yeom et al., 2018; Song and Mittal, 2021) suggest that threshold-based MIAs using only prediction confidence (Yeom et al., 2018) or entropy (Song and Mittal, 2021) with proper assumptions are comparable to the more sophisticated one (Shokri et al., 2017) based on shadow training. Adapting the confidence-based MIA to our context exploits that a pipeline is fine-tuned by minimizing its prediction loss: The confidence/probability of predicting a training sequence as its true label should be close to 11. The adversary can then infer a candidate sequence XX^{*} as a member when the confidence for the predicted label ll output by pipeline \mathcal{F} is larger than a pre-set threshold τ\tau:

𝟙{Pr[(X)=l]τ},\mathds{1}\{\Pr[\mathcal{F}(X^{*})=l]\geq\tau\},

where 𝟙{}\mathds{1}\{\cdot\} is the indicator function. We simply use a fixed τ\tau for all possible labels in our evaluation, albeit it can be label-dependent.

The second MIA we use is based on the prediction output (i.e., a vector of probabilities) of a training sequence tends to be a one-hot vector, i.e., its entropy should be close to 0. Similarly, the adversary can infer XX^{*} as a member when its prediction entropy falls below a preset threshold τ\tau; otherwise, it is not deemed a member:

𝟙{iPr[(X)=li]log(Pr[(X)=li])τ},\mathds{1}\{-\sum_{i}\Pr[\mathcal{F}(X^{*})=l_{i}]\log(\Pr[\mathcal{F}(X^{*})=l_{i}])\leq\tau\},

for all possible labels {li}\{l_{i}\}. Note that a totally wrong prediction with probability 1{\sim}1 also leads to entropy approaching 0. We can address it by encoding the information of the ground-truth label of XX^{*} (Song and Mittal, 2021).

Numerical Results. As in (Yu et al., 2021), all the test examples and a random subset of the training examples (as many as the test ones) are evenly split into two subsets (each has half of the training/test examples), one for finding the optimal τ\tau, and the other for reporting the attack success rates. Given that the training and test examples likely share the same distribution, we randomly drop/replace tokens in the test examples to enlarge the prediction difference to make MIAs easier.

We evaluated the adapted confidence- and entropy-based MIAs on SST-2 fine-tuned by the non-private baseline, DP-Forward, and DP-SGD. For DP-Forward, we investigate five instances, perturbing input embeddings, three encoders’ outputs, and output embeddings. Table 7 presents the results, where success rates within 0.490.490.510.51 are shown in bold. Both DP-Forward and DP-SGD can mitigate MIAs effectively. For all choices of ϵ\epsilon, the two MIAs’ success rates on DP-Forward are reduced to 0.5{\sim}0.5 (like random guessing) for deeper layers, outperforming DP-SGD by >6{>}6pp at the same privacy level.

Table 7. Success rates of two MIAs with (translated) central ϵ\epsilon
ϵ\epsilon Method Attack Success Rate
Entropy Confidence
\infty Non-private baseline 0.6590.659 0.6450.645
1 DP-SGD 0.5670.567 0.5610.561
DP-Forward (Embedding) 0.5860.586 0.5760.576
DP-Forward (Encoder 1) 0.5350.535 0.5370.537
DP-Forward (Encoder 7) 0.494\mathbf{0.494} 0.506\mathbf{0.506}
DP-Forward (Encoder 11) 0.506\mathbf{0.506} 0.494\mathbf{0.494}
DP-Forward (Output) 0.508\mathbf{0.508} 0.502\mathbf{0.502}
3 DP-SGD 0.5840.584 0.5760.576
DP-Forward (Embedding) 0.5840.584 0.5760.576
DP-Forward (Encoder 1) 0.5430.543 0.5300.530
DP-Forward (Encoder 7) 0.510\mathbf{0.510} 0.507\mathbf{0.507}
DP-Forward (Encoder 11) 0.5120.512 0.500\mathbf{0.500}
DP-Forward (Output) 0.503\mathbf{0.503} 0.499\mathbf{0.499}
8 DP-SGD 0.5800.580 0.5800.580
DP-Forward (Embedding) 0.5970.597 0.5760.576
DP-Forward (Encoder 1) 0.510\mathbf{0.510} 0.5360.536
DP-Forward (Encoder 7) 0.510\mathbf{0.510} 0.504\mathbf{0.504}
DP-Forward (Encoder 11) 0.5200.520 0.513\mathbf{0.513}
DP-Forward (Output) 0.506\mathbf{0.506} 0.490\mathbf{0.490}

6.3. Embedding Inversion Attacks

Attack Objective. These attacks aim at recovering the raw text as (unordered) tokens {xi}i[n]X\{x_{i}\}_{i\in[n]}\subseteq X from embeddings, highlighting the risk of directly sharing (without noise) even only text embeddings (for training/inference). They have been employed to reconstruct specific patterns, e.g., identity codes and gene segments (Pan et al., 2020).

We first propose a simple token-wise inversion attack to invert (noisy) token embeddings output by the input embedding layer ϕ()\phi(\cdot) that maps every token in 𝒱\mathcal{V} to d\mathbb{R}^{d} (Wu et al., 2016). It can be formulated as:

i[n]:minxi𝒱ϕ(xi)(ϕ(xi)+zi)2,\forall i\in[n]:\min_{x_{i}^{*}\in\mathcal{V}}||\phi(x^{*}_{i})-(\phi(x_{i})+z_{i})||_{2},

where ziz_{i} is the ithi^{\text{th}} row of noise ZZ from \mathcal{M} (omitted for DP-SGD or the non-private baseline). It returns xix^{*}_{i} with its embedding closest to the observed one of xix_{i} via a nearest-neighbor search over 𝒱\mathcal{V}.

A token’s hidden embedding from deeper layers encodes more “abstract” contextual information of the entire sequence it belongs to; the token-wise inversion may be less accurate. We thus require a more general attack (Song and Raghunathan, 2020). It first maps the observed (noisy) embedding back to a lower-layer one using a linear least square model MM and then selects nn tokens as XX^{*} to minimize the L2L_{2}-distance between the lower-layer representation of XX^{*} and the one from MM:

minX𝒱nζ(X)M(f(X)+Z)2,\min_{X^{*}\in\mathcal{V}^{n}}||\zeta(X^{*})-M\left(f(X)+Z\right)||_{2},

where ζ()\zeta(\cdot) is a lower-layer representation function than f()f(\cdot).

The above minimization is over |𝒱|n|\mathcal{V}|^{n}, larger than the token-wise candidate space. To determine XX^{*}, we first relax the token selection at each position i[n]i\in[n] using a continuous vector in |𝒱|\mathbb{R}^{|\mathcal{V}|}, which is then input (with another temperature parameter) to a softmax function to model the probabilities of selecting each token in 𝒱\mathcal{V}. We further derive the token embedding xix_{i}^{*} by multiplying the relaxed vector (with each entry as a weight) and the original embedding matrix. Finally, we solve it by a gradient-based method (Song and Raghunathan, 2020).

Numerical Results. The gradient-based attack reports the highest recall (or precision) on inverting the lowest-layer (clear) embeddings (Song and Raghunathan, 2020, Figure 2). To show that DP-Forward can mitigate such “strongest” inversion, we implement both (nearest-neighbor and gradient-based) attacks to invert input embeddings, with the public BERT embedding lookup table as prior. We also report their success rates as recall – the ratios of correct recoveries over the raw targets.

Table 8 shows that DP-Forward can reduce their success rates to a relatively low level, most are within 0.20.2. However, DP-SGD fails in defense. The results corroborate our claim: DP-Forward directly adds noise to embeddings, thus mitigating embedding inversion, whereas DP-SGD only perturbs gradients, offering no protection for the (clear) inference-time embeddings of test sequences.

Table 8. Success rates of two inversion attacks on (the lowest-layer) input embeddings with (translated) central ϵ8\epsilon\approx 8
Nearest Neighbor Gradient-based
SST-2 IMDb QQP SST-2 IMDb QQP
Non-private ​​​ 11 11 11 11 11 11
DP-SGD ​​​ 11 11 11 .9991.9991 .9982.9982 11
DP-Forward​​ ​​.1811\mathbf{.1811} ​​.1420\mathbf{.1420} ​​.2457\mathbf{.2457} ​​.1622\mathbf{.1622} ​​.1241\mathbf{.1241} ​​.2226\mathbf{.2226}

6.4. Sensitive Attribute Inference Attacks

Attack Objective. Instead of recovering exact tokens, one can try to infer sensitive attributes about target sequences from their embeddings. The attributes are often statistically unrelated to the training/inference objective but inherent in sequences, e.g.stylometry, implying the text’s authorship for sentiment analysis (Shetty et al., 2018). We are not interested in any global property of an entire corpus (Ganju et al., 2018).

Table 9. Success rates of a (neural-network-based) sensitive attribute inference attack with (translated) central ϵ8\epsilon\approx 8
action comedy drama horror Overall
Non-private 0.7270.727 0.8580.858 0.5160.516 0.4390.439 0.6870.687
DP-SGD 0.6640.664 0.7330.733 0.2530.253 0.3240.324 0.5360.536
DP-Forward (Embedding)​​ 0.9980.998 𝟎\mathbf{0} 𝟎\mathbf{0} 0.009\mathbf{0.009} 0.276\mathbf{0.276}
DP-Forward (Encoder 1)​​ 1.01.0 𝟎\mathbf{0} 𝟎\mathbf{0} 𝟎\mathbf{0} 0.276\mathbf{0.276}
DP-Forward (Encoder 7)​​ 1.01.0 𝟎\mathbf{0} 𝟎\mathbf{0} 𝟎\mathbf{0} 0.276\mathbf{0.276}
DP-Forward (Encoder 11)​​ 1.01.0 𝟎\mathbf{0} 𝟎\mathbf{0} 𝟎\mathbf{0} 0.276\mathbf{0.276}
DP-Forward (Output)​​ 0.9980.998 𝟎\mathbf{0} 𝟎\mathbf{0} 0.009\mathbf{0.009} 0.276\mathbf{0.276}

We assume 𝒳aux\mathcal{X}_{\mathrm{aux}} has sequences labeled with sensitive attributes. The adversary can query f()f(\cdot) for the noisy (or clear) embeddings:

𝒳aux={(f(Xi)+Zi,si)},si𝒮,\mathcal{X}_{\mathrm{aux}}=\{\left(f(X_{i})+Z_{i},s_{i}\right)\},\ \forall s_{i}\in\mathcal{S},

where 𝒮\mathcal{S} is the set of all possible sensitive attributes of interest, say, authorship. It does not care about non-sensitive attributes.

To infer sensitive attributes, the adversary first trains a classifier on 𝒳aux\mathcal{X}_{\mathrm{aux}} via supervised learning and then uses it for an observed noisy (or clear) embedding f(X)+Zf(X^{*})+Z to predict s𝒮s^{*}\in\mathcal{S} of XX^{*}.

Numerical Results. We train a three-layer neural network with a linear head as the classifier to infer the film genre (e.g., ‘horror’) as a sensitive attribute from a movie review using its output embedding. We employ IMDb with 2020k examples (90%90\% for training and 10%10\% for validation) as 𝒳aux\mathcal{X}_{\mathrm{aux}}, and SST-2 contributes 3.33.3k examples for testing the classifier. The attack success rates are measured using recall.

We investigate five DP-Forward instances. Table 9 shows that they “reduce” the classifier to majority-class prediction, which returns the majority class (‘action’) on all inputs. In contrast, DP-SGD only reduces success rates moderately compared to the non-private baseline. It is because the embeddings from DP-SGD-trained/noisy models still “lose” some useful information (cf., accuracy drops of DP-SGD inference on embeddings without noise). The results confirm DP-Forward is more effective in thwarting attribute inference.

7. Related Work

7.1. Privacy Threats on LMs and Embeddings

An active line of research (Pan et al., 2020; Song and Raghunathan, 2020; Béguelin et al., 2020; Carlini et al., 2021) discloses severe privacy risks in modern LMs (even used as black-box query “oracles”) concerning their (hidden/output) text embeddings. Song and Raghunathan (Song and Raghunathan, 2020) build a taxonomy of attacks that covers a broader scope than a parallel work (Pan et al., 2020). These attacks include embedding inversion (which can partially recover raw texts), membership inference (establishing the is-in relation between a target and private training data), and inferring sensitive attributes like text authorship from embeddings. A common defense for them is adversarial training, e.g.(Elazar and Goldberg, 2018).

Others (Béguelin et al., 2020; Carlini et al., 2021) study the “memorization” of training data in LMs (a.k.a. membership inference attack). In particular, Carlini et al. (Carlini et al., 2021) define kk-eidetic memorization, where a string is extractable or memorized if it appears in at most kk examples. Their black-box attacks on GPT-2 (Radford et al., 2018) can extract verbatim training texts even when k=1k=1 (e.g., a name that only appears once is still extractable). A smaller kk means a higher privacy risk. Beguelin et al. (Béguelin et al., 2020) define differential score and rank as two new metrics for analyzing the update leakage, enabling the recovery of new text used to update LMs. Incorporating DP to address memorization is a promising solution.

7.2. Input (Text/Feature) Perturbation for LDP

SynTF (Weggenmann and Kerschbaum, 2018) synthesizes term-frequency (feature) vectors under LDP, which have limited applications compared to sentence embeddings or text itself. Feyisetan et al. (Feyisetan et al., 2019, 2020) resort to metric-LDP (Alvim et al., 2018), a relaxed variant of LDP with a distance metric (e.g., Euclidean or Hyperbolic), which allows the indistinguishability of outputs to grow proportionally to the inputs’ distance. They first add noise to the outputs of a non-contextualized token embedding model (e.g., GLoVe (Pennington et al., 2014)), which are then projected back to “sanitized” text using the nearest neighbor search as post-processing. In contrast, Yue et al. (Yue et al., 2021) sanitize text by directly sampling token-wise replacements, avoiding adding noise to high-dimensional embeddings. All these works only achieve (variants of) token-level metric-LDP.

To offer sequence-level protection, recent studies (Lyu et al., 2020; Meehan et al., 2022) apply Laplace or exponential mechanism to perturb (the average of) sentence embeddings extracted by an LM (e.g., BERT (Devlin et al., 2019)). Both ensure pure LDP (homogeneously protecting any entire sequence), which may be too stringent and impact utility. In contrast, heterogeneous protection (Feyisetan et al., 2020; Yue et al., 2021) can strategically manage the privacy demands across inputs. Du et al. (Du et al., 2023) achieve metric-LDP (by Purkayastha and planar Laplace mechanisms) at the sequence level (unlike token-level in prior arts (Feyisetan et al., 2020; Yue et al., 2021)). To further boost the utility, they mitigate the dimensional curse via a random-projection-like approach. They also perturb sensitive sequence labels for enhanced privacy. Nevertheless, perturbing different hidden (rather than token or sentence) embeddings inside LM-based NLP pipelines remains unexplored.

7.3. DP-SGD (Variants) in Training LMs

An early attempt (McMahan et al., 2018) uses DP-SGD to train long short-term memory LMs in the federated learning setting. By configuring hyperparameters properly (e.g., setting the batch size to millions), one can even pre-train BERT-Large, an LM with 340{\sim}340M parameters, using DP-SGD/Adam while achieving acceptable (MLM) accuracy (Anil et al., 2022).

Using the vanilla DP-SGD in pre-training/fine-tuning large LMs leads to significant efficiency and accuracy drops due to the “curse of dimensionality.” Yu et al. (Yu et al., 2021) propose reparametrized gradient perturbation: It first reparameterizes/decomposes each high-rank weight matrix into two low-rank (gradient-carrier) ones with a residual matrix and then only perturbs the two low-rank gradients to alleviate the dimensional curse. The noisy low-rank gradients are finally projected back to update the raw high-rank weights.

Applying reparameterization to every weight in each update is still costly and may introduce instability (e.g., noises are “zoomed up” during the projection). Instead, the follow-up (Yu et al., 2022) builds atop the recent success of parameter-efficient fine-tuning (e.g., LoRA (Hu et al., 2022), Adapter (Houlsby et al., 2019), and Compacter (Mahabadi et al., 2021)): It perturbs the gradients of a much smaller number of additional “plug-in” parameters. However, Li et al. (Li et al., 2022b) empirically show that parameter-efficient fine-tuning is not necessarily better than the full one; they propose ghost clipping, a memory-saving technique (“orthogonal” to dimension reduction), to use DP-SGD in full fine-tuning without instantiating per-example gradients. Despite efficiency/accuracy gains, all these works still only protect training data by perturbing (smaller) gradients.

7.4. DP Mechanisms for Matrix Functions

Gaussian and Laplace mechanisms are typically for scalar-/vector-valued functions (Dwork and Roth, 2014). Vectorizing the outputs and adding i.i.d. noise could generalize them for matrix-valued functions, but the structural information of matrix functions is not exploited. The MVG mechanism (Chanyaswad et al., 2018) is thus devised, which draws directional or non-i.i.d. noise from a matrix Gaussian distribution. It injects less noise into more “informative” output directions for better utility, with only a constraint on the sum of the singular values (determining the noise magnitude) of two covariance matrices. Such a constraint is only a sufficient condition for (ϵ,δ)(\epsilon,\delta)-DP, which is improved by the follow-up (Yang et al., 2023) with a tighter bound on the singular values.

There also exist mechanisms dedicated to restricted matrix-valued functions. The matrix mechanism (Li et al., 2015) considers a collection of linear counting queries represented by WxWx for query matrix WW and input vector xx. It still resorts to additive Laplace/Gaussian noise but with an extra transformation solving the min-variance estimation to the noisy WxWx. Another very recent study (Ji et al., 2021) focuses on matrix-valued queries with only binary (matrix) outputs. It then devises an exclusive-or (xor) mechanism xor-ing the outputs with noise attributed to a matrix-valued Bernoulli distribution.

8. Conclusion

Pre-trained LMs became pivotal in NLP. Alarmingly, fine-tuning corpora or inference-time inputs face various privacy attacks. The popular DP-SGD only provides limited protection for training data by adding noise to gradients. Raw tokens or sensitive attributes of training/inference data can be inverted or inferred from embeddings in forward-pass computation. Vanilla DP-SGD also imposes high GPU memory and computational burdens but cannot be batched.

We propose DP-Forward, which directly adds noise to embedding matrices derived from the raw training/inference data in the forward pass. Its core is the analytic matrix Gaussian mechanism, a general-purpose tool that owns independent interests. It draws optimal matrix-valued noise from a matrix Gaussian distribution in a dedicated way using a necessary and sufficient condition for DP.

Perturbing embeddings at various positions across multiple layers yields at least two benefits. DP-Forward users are only required to download pipeline parts for deriving noisy embeddings, which is more storage- and time-efficient than deriving noisy gradients. Together with our prior attempts (Yue et al., 2021; Du et al., 2023) at sanitizing input text tokens and output sentence embeddings, we provide a full suite of forward-pass signal sanitization options for users only to share their sanitized data for LM-as-a-Service APIs while protecting privacy.

Beyond the theoretical contribution of two local DP notions and the experimental comparisons with baselines (e.g., GM, MVG, and DP-SGD) across three typical NLP tasks, we investigate the hyperparameter configuration for reproducible validations of DP-Forward’s potential in terms of efficiency, accuracy, and its ability to withstand diverse against diverse attacks.

Altogether, our new perspective leads to a better approach to privacy-aware deep neural network training, challenging the traditional wisdom focusing on gradients. As a new paradigm for local DP in fine-tuning and inference, our work paves the way for a myriad of possibilities for new machine-learning privacy research (Ng and Chow, 2023), e.g., generalization to transformer-based computer vision tasks.

Acknowledgements.
We are grateful to the anonymous reviewers for their comments and Ashwin Machanavajjhala for his comments on a related Ph.D. thesis. Sherman Chow is supported in part by the General Research Funds (CUHK 14209918, 14210319, 14210621), Research Grants Council, University Grants Committee, Hong Kong. Authors at OSU are sponsored in part by NSF IIS #1815674\#1815674, NSF CAREER #1942980\#1942980, and Ohio Supercomputer Center (OSC, 1987). Tianhao Wang is supported in part by National Science Foundation (NSF) with grants CNS-22204332220433 and OAC-23199882319988.

References

  • (1)
  • Abadi et al. (2016) Martín Abadi, Andy Chu, Ian J. Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep Learning with Differential Privacy. In CCS. 308–318.
  • Aboagye et al. (2022) Prince Osei Aboagye, Yan Zheng, Chin-Chia Michael Yeh, Junpeng Wang, Wei Zhang, Liang Wang, Hao Yang, and Jeff M. Phillips. 2022. Normalization of Language Embeddings for Cross-Lingual Alignment. In ICLR (Poster). 32 pages.
  • Abowd et al. (2022) John Abowd, Robert Ashmead, Ryan Cumings-Menon, Simson Garfinkel, Micah Heineck, Christine Heiss, Robert Johns, Daniel Kifer, Philip Leclerc, Ashwin Machanavajjhala, Brett Moran, William Sexton, Matthew Spence, and Pavel Zhuravlev. 2022. The 2020 Census Disclosure Avoidance System TopDown Algorithm. Harvard Data Science Review Special Issue 2: Differential Privacy for the 2020 U.S. Census (Jun 24 2022), 77 pages. https://hdsr.mitpress.mit.edu/pub/7evz361i.
  • Alvim et al. (2018) Mário S. Alvim, Konstantinos Chatzikokolakis, Catuscia Palamidessi, and Anna Pazii. 2018. Invited Paper: Local Differential Privacy on Metric Spaces: Optimizing the Trade-Off with Utility. In CSF. 262–267.
  • Anil et al. (2022) Rohan Anil, Badih Ghazi, Vineet Gupta, Ravi Kumar, and Pasin Manurangsi. 2022. Large-Scale Differentially Private BERT. In Findings of EMNLP. 6481–6491.
  • Ba et al. (2016) Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer Normalization. arXiv:1607.06450.
  • Balle and Wang (2018) Borja Balle and Yu-Xiang Wang. 2018. Improving the Gaussian Mechanism for Differential Privacy: Analytical Calibration and Optimal Denoising. In ICML. 403–412.
  • Béguelin et al. (2020) Santiago Zanella Béguelin, Lukas Wutschitz, Shruti Tople, Victor Rühle, Andrew Paverd, Olga Ohrimenko, Boris Köpf, and Marc Brockschmidt. 2020. Analyzing Information Leakage of Updates to Natural Language Models. In CCS. 363–375.
  • Bonawitz et al. (2017) Kallista A. Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H. Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. 2017. Practical Secure Aggregation for Privacy-Preserving Machine Learning. In CCS. 1175–1191.
  • Bu et al. (2019) Zhiqi Bu, Jinshuo Dong, Qi Long, and Weijie J. Su. 2019. Deep Learning with Gaussian Differential Privacy. arXiv:1911.11607.
  • Busa-Fekete et al. (2021) Robert Istvan Busa-Fekete, Andres Munoz Medina, Umar Syed, and Sergei Vassilvitskii. 2021. On the pitfalls of label differential privacy. In NeurIPS Workshop. 6 pages.
  • Carlini et al. (2019) Nicholas Carlini, Chang Liu, Úlfar Erlingsson, Jernej Kos, and Dawn Song. 2019. The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks. In USENIX Security. 267–284.
  • Carlini et al. (2021) Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom B. Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel. 2021. Extracting Training Data from Large Language Models. In USENIX Security. 2633–2650.
  • Chanyaswad et al. (2018) Thee Chanyaswad, Alex Dytso, H. Vincent Poor, and Prateek Mittal. 2018. MVG Mechanism: Differential Privacy under Matrix-Valued Query. In CCS. 230–246.
  • Chase and Chow (2009) Melissa Chase and Sherman S. M. Chow. 2009. Improving privacy and security in multi-authority attribute-based encryption. In CCS. 121–130.
  • Chaudhuri and Hsu (2011) Kamalika Chaudhuri and Daniel J. Hsu. 2011. Sample Complexity Bounds for Differentially Private Learning. In COLT. 155–186.
  • Chaudhuri and Monteleoni (2008) Kamalika Chaudhuri and Claire Monteleoni. 2008. Privacy-preserving logistic regression. In NIPS. 289–296.
  • Chaudhuri et al. (2011) Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. 2011. Differentially Private Empirical Risk Minimization. J. Mach. Learn. Res. 12 (2011), 1069–1109.
  • Dasoulas et al. (2021) George Dasoulas, Kevin Scaman, and Aladin Virmaux. 2021. Lipschitz normalization for self-attention layers with application to graph neural networks. In ICML, Vol. 139. 2456–2466.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT. 4171–4186.
  • Du et al. (2023) Minxin Du, Xiang Yue, Sherman S. M. Chow, and Huan Sun. 2023. Sanitizing Sentence Embeddings (and Labels) for Local Differential Privacy. In WWW. 2349–2359.
  • Dwork et al. (2006) Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam D. Smith. 2006. Calibrating Noise to Sensitivity in Private Data Analysis. In TCC. 265–284.
  • Dwork and Roth (2014) Cynthia Dwork and Aaron Roth. 2014. The Algorithmic Foundations of Differential Privacy. Found. Trends Theor. Comput. Sci. 9, 3-4 (2014), 211–407.
  • Elazar and Goldberg (2018) Yanai Elazar and Yoav Goldberg. 2018. Adversarial Removal of Demographic Attributes from Text Data. In EMNLP. 11–21.
  • Erlingsson et al. (2019) Úlfar Erlingsson, Vitaly Feldman, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and Abhradeep Thakurta. 2019. Amplification by Shuffling: From Local to Central Differential Privacy via Anonymity. In SODA. 2468–2479.
  • Esmaeili et al. (2021) Mani Malek Esmaeili, Ilya Mironov, Karthik Prasad, Igor Shilov, and Florian Tramèr. 2021. Antipodes of Label Differential Privacy: PATE and ALIBI. In NeurIPS. 6934–6945.
  • Face (2023) Hugging Face. 2023. BERT base model (uncased). https://huggingface.co/bert-base-uncased, last accessed: September 12, 2023.
  • Feyisetan et al. (2020) Oluwaseyi Feyisetan, Borja Balle, Thomas Drake, and Tom Diethe. 2020. Privacy- and Utility-Preserving Textual Analysis via Calibrated Multivariate Perturbations. In WSDM. 178–186.
  • Feyisetan et al. (2019) Oluwaseyi Feyisetan, Tom Diethe, and Thomas Drake. 2019. Leveraging Hierarchical Representations for Preserving Privacy and Utility in Text. In ICDM. 210–219.
  • Ganju et al. (2018) Karan Ganju, Qi Wang, Wei Yang, Carl A. Gunter, and Nikita Borisov. 2018. Property Inference Attacks on Fully Connected Neural Networks using Permutation Invariant Representations. In CCS. 619–633.
  • Ghazi et al. (2021) Badih Ghazi, Noah Golowich, Ravi Kumar, Pasin Manurangsi, and Chiyuan Zhang. 2021. Deep Learning with Label Differential Privacy. In NeurIPS. 27131–27145.
  • Gopi et al. (2021) Sivakanth Gopi, Yin Tat Lee, and Lukas Wutschitz. 2021. Numerical Composition of Differential Privacy. In NeurIPS. 11631–11642.
  • Hall et al. (2013) Robert J. Hall, Larry A. Wasserman, and Alessandro Rinaldo. 2013. Random Differential Privacy. J. Priv. Confidentiality 4, 2 (2013), 43–59.
  • Horn and Johnson (2012) Roger A. Horn and Charles R. Johnson. 2012. Matrix Analysis, 2nd Ed. Cambridge University Press, N/A.
  • Houlsby et al. (2019) Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-Efficient Transfer Learning for NLP. In ICML, Vol. 97. 2790–2799.
  • Hu et al. (2022) Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. In ICLR (Poster). 13 pages.
  • Ji et al. (2021) Tianxi Ji, Pan Li, Emre Yilmaz, Erman Ayday, Yanfang Ye, and Jinyuan Sun. 2021. Differentially Private Binary- and Matrix-Valued Data Query: An XOR Mechanism. Proc. VLDB Endow. 14, 5 (2021), 849–862.
  • Kasiviswanathan et al. (2008) Shiva Prasad Kasiviswanathan, Homin K. Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam D. Smith. 2008. What Can We Learn Privately?. In FOCS. 531–540.
  • Kim et al. (2021) Hyunjik Kim, George Papamakarios, and Andriy Mnih. 2021. The Lipschitz Constant of Self-Attention. In ICML, Vol. 139. 5562–5571.
  • Laurent and Massart (2000) Beatrice Laurent and Pascal Massart. 2000. Adaptive estimation of a quadratic functional by model selection. The Annals of Statistics 28, 5 (2000), 1302–1338.
  • Lécuyer et al. (2019) Mathias Lécuyer, Vaggelis Atlidakis, Roxana Geambasu, Daniel Hsu, and Suman Jana. 2019. Certified Robustness to Adversarial Examples with Differential Privacy. In S&P. 656–672.
  • Lehman et al. (2021) Eric Lehman, Sarthak Jain, Karl Pichotta, Yoav Goldberg, and Byron C. Wallace. 2021. Does BERT Pretrained on Clinical Notes Reveal Sensitive Data?. In NAACL-HLT. 946–959.
  • Li et al. (2015) Chao Li, Gerome Miklau, Michael Hay, Andrew McGregor, and Vibhor Rastogi. 2015. The matrix mechanism: optimizing linear counting queries under differential privacy. VLDB J. 24, 6 (2015), 757–781.
  • Li et al. (2022a) Jing Li, Aixin Sun, Jianglei Han, and Chenliang Li. 2022a. A Survey on Deep Learning for Named Entity Recognition. IEEE Trans. Knowl. Data Eng. 34, 1 (2022), 50–70.
  • Li et al. (2022b) Xuechen Li, Florian Tramèr, Percy Liang, and Tatsunori Hashimoto. 2022b. Large Language Models Can Be Strong Differentially Private Learners. In ICLR. 30 pages.
  • Lyu et al. (2020) Lingjuan Lyu, Xuanli He, and Yitong Li. 2020. Differentially Private Representation for NLP: Formal Guarantee and An Empirical Study on Privacy and Fairness. In Findings of EMNLP. 2355–2365.
  • Maas et al. (2011) Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis. In ACL. 142–150.
  • Mahabadi et al. (2021) Rabeeh Karimi Mahabadi, James Henderson, and Sebastian Ruder. 2021. Compacter: Efficient Low-Rank Hypercomplex Adapter Layers. In NeurIPS. 1022–1035.
  • McDermott and Forsyth (2016) James McDermott and Richard S. Forsyth. 2016. Diagnosing a disorder in a classification benchmark. Pattern Recognit. Lett. 73 (2016), 41–43.
  • McMahan et al. (2018) H. Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang. 2018. Learning Differentially Private Recurrent Language Models. In ICLR (Poster). 14 pages.
  • Meehan et al. (2022) Casey Meehan, Khalil Mrini, and Kamalika Chaudhuri. 2022. Sentence-level Privacy for Document Embeddings. In ACL. 3367–3380.
  • Min et al. (2022) Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2022. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?. In EMNLP. 11048–11064.
  • Mironov (2017) Ilya Mironov. 2017. Rényi Differential Privacy. In CSF. 263–275.
  • Ng and Chow (2023) Lucien K. L. Ng and Sherman S. M Chow. 2023. SoK: Cryptographic Neural-Network Computation. In S&P. 497–514.
  • OSC (1987) OSC. 1987. Ohio Supercomputer Center. http://osc.edu/ark:/19495/f5s1ph73
  • Pan et al. (2020) Xudong Pan, Mi Zhang, Shouling Ji, and Min Yang. 2020. Privacy Risks of General-Purpose Language Models. In S&P. 1314–1331.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global Vectors for Word Representation. In EMNLP. 1532–1543.
  • Qu et al. (2021) Chen Qu, Weize Kong, Liu Yang, Mingyang Zhang, Michael Bendersky, and Marc Najork. 2021. Natural Language Understanding with Privacy-Preserving BERT. In CIKM. 1488–1497.
  • Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. OpenAI Report.
  • Radford et al. (2019) Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. OpenAI Report.
  • Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In EMNLP-IJCNLP. 3980–3990.
  • Shetty et al. (2018) Rakshith Shetty, Bernt Schiele, and Mario Fritz. 2018. A4NT: Author Attribute Anonymity by Adversarial Training of Neural Machine Translation. In USENIX Security. 1633–1650.
  • Shokri et al. (2017) Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2017. Membership Inference Attacks Against Machine Learning Models. In S&P. 3–18.
  • Song and Raghunathan (2020) Congzheng Song and Ananth Raghunathan. 2020. Information Leakage in Embedding Models. In CCS. 377–390.
  • Song and Mittal (2021) Liwei Song and Prateek Mittal. 2021. Systematic Evaluation of Privacy Risks of Machine Learning Models. In USENIX Security. 2615–2632.
  • Steinke (2022) Thomas Steinke. 2022. Composition of Differential Privacy & Privacy Amplification by Subsampling. arXiv:2210.00597.
  • Sweeney (2015) Latanya Sweeney. 2015. Only You, Your Doctor, and Many Others May Know. Technology Science 2015092903, 9 (2015), 29.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. In NeurIPS. 5998–6008.
  • Wang et al. (2019) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In ICLR (Poster). 20 pages.
  • Warner (1965) Stanley L Warner. 1965. Randomized Response: A Survey Technique for Eliminating Evasive Answer Bias. JASA 60, 309 (1965), 63–69.
  • Weggenmann and Kerschbaum (2018) Benjamin Weggenmann and Florian Kerschbaum. 2018. SynTF: Synthetic and Differentially Private Term Frequency Vectors for Privacy-Preserving Text Mining. In SIGIR. 305–314.
  • Wu et al. (2023) Ruihan Wu, Jin Peng Zhou, Kilian Q. Weinberger, and Chuan Guo. 2023. Does Label Differential Privacy Prevent Label Inference Attacks?. In AISTAST. 4336–4347.
  • Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv:1609.08144.
  • Yang et al. (2021) Jungang Yang, Liyao Xiang, Weiting Li, Wei Liu, and Xinbing Wang. 2021. Improved Matrix Gaussian Mechanism for Differential Privacy. arXiv:2104.14808.
  • Yang et al. (2023) Jungang Yang, Liyao Xiang, Jiahao Yu, Xinbing Wang, Bin Guo, Zhetao Li, and Baochun Li. 2023. Matrix Gaussian Mechanisms for Differentially-Private Learning. IEEE Trans. Mob. Comput. 22, 2 (2023), 1036–1048.
  • Yeom et al. (2018) Samuel Yeom, Irene Giacomelli, Matt Fredrikson, and Somesh Jha. 2018. Privacy Risk in Machine Learning: Analyzing the Connection to Overfitting. In CSF. 268–282.
  • Yousefpour et al. (2021) Ashkan Yousefpour, Igor Shilov, Alexandre Sablayrolles, Davide Testuggine, Karthik Prasad, Mani Malek, John Nguyen, Sayan Ghosh, Akash Bharadwaj, Jessica Zhao, Graham Cormode, and Ilya Mironov. 2021. Opacus: User-Friendly Differential Privacy Library in PyTorch. arXiv:2109.12298.
  • Yu et al. (2022) Da Yu, Saurabh Naik, Arturs Backurs, Sivakanth Gopi, Huseyin A. Inan, Gautam Kamath, Janardhan Kulkarni, Yin Tat Lee, Andre Manoel, Lukas Wutschitz, Sergey Yekhanin, and Huishuai Zhang. 2022. Differentially Private Fine-tuning of Language Models. In ICLR (Poster). 19 pages.
  • Yu et al. (2021) Da Yu, Huishuai Zhang, Wei Chen, Jian Yin, and Tie-Yan Liu. 2021. Large Scale Private Learning via Low-rank Reparametrization. In ICML. 12208–12218.
  • Yue et al. (2021) Xiang Yue, Minxin Du, Tianhao Wang, Yaliang Li, Huan Sun, and Sherman S. M. Chow. 2021. Differential Privacy for Text Analytics via Natural Text Sanitization. In Findings of ACL/IJCNLP. 3853–3866.
  • Zhou et al. (2022) Mingxun Zhou, Tianhao Wang, T.-H. Hubert Chan, Giulia Fanti, and Elaine Shi. 2022. Locally Differentially Private Sparse Vector Aggregation. In S&P. 422–439.

Appendix A Token-level DP-Forward

A.1. Definition and Related Notions

Definition 0 (Token-level SeqLDP).

​​For ϵ0,0δ1\epsilon\geq 0,0\leq\delta\leq 1, \mathcal{M} fulfills token-level (ϵ,δ)(\epsilon,\delta)-SeqLDP, if XX\forall X\simeq X^{\prime} that differ in any single token but with the same yy, and any possible output subset 𝒪\mathcal{O},

Pr[(X,y)𝒪]eϵPr[(X,y)𝒪]+δ.\displaystyle\Pr[\mathcal{M}(X,y)\in\mathcal{O}]\leq e^{\epsilon}\Pr[\mathcal{M}(X^{\prime},y)\in\mathcal{O}]+\delta.

Despite a token-level notion, our experiments (Appendix A.3) show that when f()f(\cdot) is only the input embedding layer, our token-level SeqLDP designs can also effectively mitigate MIAs on entire sequences, with up to 2020pp accuracy gains at the same choices of ϵ\epsilon. It is not necessarily weaker than sequence-level CDP (as offered by DP-SGD). One might doubt its usefulness since two neighboring sequences may be too similar. Nevertheless, there are cases where a sentence, e.g., “How’s it going” may not matter in a bigger unit (paragraph/essay) of the training data either. Moreover, a token (e.g., yes/no) can play a crucial role, e.g., in named entity recognition (Li et al., 2022a). Our LDP guarantee is for any such two sequences, covering the wide spectrum between “too similar” and radically different cases.

Note that weakening privacy notions by itself is not our goal121212As a related example, in image classification, PixelDP (Lécuyer et al., 2019) has been proposed for a DP notion defined upon pixels. Its motivation is robustness to adversarial examples.. Protection at the token level has been studied under metric-DP (Feyisetan et al., 2020; Qu et al., 2021), a relaxation of LDP. They require even much larger ϵ\epsilon, say, 175175. Our goal of studying token-level SeqLDP is to narrow the gap between theory and practice, i.e., provable privacy notions tailored to the protection targets (the first few layers vs. the whole pipeline).

A.2. Two Token-level SeqLDP Designs

For token-level SeqLDP, we need to bound a “new” S2(f),XXS_{2}(f),\forall X\simeq X^{\prime}, which should be tight and smaller than the one over X,X\forall X,X^{\prime}, hence producing smaller noise for better utility at meaningful token-level ϵ\epsilon. It is still non-trivial since f()f(\cdot), except for being the input embedding layer, may differ in every entry for even XXX\simeq X^{\prime}. One could also normalize the entire f()f(\cdot) for S2(f),XXS_{2}(f),\forall X\simeq X^{\prime}, which “degenerates” to the token-level SeqLDP. Instead, we tailor two designs to estimate a tighter S2(f)S_{2}(f) than the “general” one for only the input embedding layer and the first two layers, respectively. Specifically, we employ row-wise normalization and the Lipschitz continuity (Kim et al., 2021).

After the Input Embedding Layer. When f()f(\cdot) is only the input embedding layer, we work on each row xix_{i} independently: xi2=C,i[n]||x_{i}||_{2}=C,\forall i\in[n], where ||||2||\cdot||_{2} is vector 22-norm. As a token only affects one row, we have S2(f)=CS_{2}(f)=C, independent of whether the embedding layer will be updated. Again with \mathcal{B}, we can draw noise Zn×dZ\in\mathbb{R}^{n\times d}.

In the First MHA Sub-layer. The second option could be adding ZZ right after the first MHA sub-layer: 𝖬𝖧𝖠(X)+Z\mathsf{MHA}(X)+Z, where 𝖬𝖧𝖠()\mathsf{MHA}(\cdot) is the concatenation of 𝖠𝗍𝗍i(),i[h]\mathsf{Att}_{i}(\cdot),i\in[h]. Yet, it is non-trivial to estimate S2(f)S_{2}(f) of 𝖬𝖧𝖠()\mathsf{MHA}(\cdot) as 𝖠𝗍𝗍i()\mathsf{Att}_{i}(\cdot), let alone 𝖬𝖧𝖠()\mathsf{MHA}(\cdot), is not Lipschitz (Kim et al., 2021).

Definition 0 (Lipschitz Continuity).

Given two metric spaces (𝒳,d𝒳)(\mathcal{X},d_{\mathcal{X}}) and (𝒴,d𝒴)(\mathcal{Y},d_{\mathcal{Y}}), a function f:𝒳𝒴f:\mathcal{X}\rightarrow\mathcal{Y} is Lipschitz continuous (KK-Lipschitz) if there exists a constant K0K\geq 0,

d𝒴(f(X),f(X))Kd𝒳(X,X),X,X𝒳.d_{\mathcal{Y}}(f(X),f(X^{\prime}))\leq Kd_{\mathcal{X}}(X,X^{\prime}),\forall X,X^{\prime}\in\mathcal{X}.

The smallest KK is the Lipschitz constant, denoted by 𝐿𝑖𝑝(f)\mathit{Lip}(f).

We consider that 𝒳\mathcal{X} is the space of row-wise normalized matrices in n×d\mathbb{R}^{n\times d}, 𝒴\mathcal{Y} is the output space n×d(/h)\mathbb{R}^{n\times d(/h)} of 𝖠𝗍𝗍i[h]()\mathsf{Att}_{i\in[h]}(\cdot) or 𝖬𝖧𝖠()\mathsf{MHA}(\cdot), and d𝒳=d𝒴=||||Fd_{\mathcal{X}}=d_{\mathcal{Y}}=||\cdot||_{F} (or pp-norm ||||p||\cdot||_{p}). 𝐿𝑖𝑝(f)\mathit{Lip}(f) generalizes S2(f)S_{2}(f) since it focuses on any two inputs rather than just neighboring ones, allowing us to estimate an upper bound for S2(f)S_{2}(f) given 𝐿𝑖𝑝(f)\mathit{Lip}(f).

The non-Lipschitz continuity stems from the non-linear Softmax activation, which takes pairwise dot products as input (Kim et al., 2021). To make MHA Lipschitz, one might apply pairwise L2L_{2}-distances (hence called L2L_{2}-MHA) (Kim et al., 2021) or add a normalization step called LipschitzNorm (Dasoulas et al., 2021) in 𝗌𝗈𝖿𝗍𝗆𝖺𝗑()\mathsf{softmax}(\cdot). Unfortunately, estimating 𝐿𝑖𝑝(f)\mathit{Lip}(f) of L2L_{2}-MHA needs to solve an intractable optimization problem, and LipschitzNorm is ill-suited for the high-dimensional BERT attention.

Instead of adding ZZ to the outputs of 𝖬𝖧𝖠()\mathsf{MHA}(\cdot) or 𝖠𝗍𝗍i()\mathsf{Att}_{i}(\cdot), we can shift f()f(\cdot) inside 𝗌𝗈𝖿𝗍𝗆𝖺𝗑()\mathsf{softmax}(\cdot), where estimating 𝐿𝑖𝑝(f)\mathit{Lip}(f) or S2(f)S_{2}(f) is feasible, e.g., the linear maps used to derive Q,K,VQ,K,V matrices. Considering a linear map f(x)=xWf(x)=xW with Wd×d/hW\in\mathbb{R}^{d\times d/h} and xdx\in\mathbb{R}^{d}, the 22-norm 𝐿𝑖𝑝2(f)\mathit{Lip}_{2}(f) is the largest singular value σmax(W)\sigma_{\max}(W) (Kim et al., 2021). When generalizing f()f(\cdot) for any two matrices XXX\simeq X^{\prime}, we can estimate

S2(f)=supf(X)f(X)F=f(x)2Cσmax(W).S_{2}(f)=\sup||f(X)-f(X^{\prime})||_{F}=||f(x)||_{2}\leq C\cdot\sigma_{\max}(W).

We can now respectively derive the noisy Q,K,VQ,K,V matrices for p()p(\cdot). The first step is to draw noise ZQ,K,Vn×dZ^{Q^{*},K^{*},V^{*}}\in\mathbb{R}^{n\times d} given WQ,K,Vd×dW^{Q^{*},K^{*},V^{*}}\in\mathbb{R}^{d\times d}. It requires us to either estimate σmax(WQ,K,V)\sigma_{\max}(W^{Q^{*},K^{*},V^{*}}) on the fly via power iteration or fix the linear maps in each forward pass of fine-tuning or inference. We then compute XWQ,K,V+ZQ,K,VXW^{Q^{*},K^{*},V^{*}}+Z^{Q^{*},K^{*},V^{*}}, which are reshaped into 3h3h matrices of size n×d/hn\times d/h for 𝖠𝗍𝗍i[h]()\mathsf{Att}_{i\in[h]}(\cdot).

Theorem 3.

The two instances (with row-wise normalization) for fine-tuning or inference fulfill token-level (ϵ,δ)(\epsilon,\delta)-(Seq)LDP.

The proof is equivalent to our approach for (Seq)LDP. One just needs to compute S2(f),XXS_{2}(f),\forall X\simeq X^{\prime} properly, and we did.

Discussion.

For minimal changes to the pipeline, we adopt the raw WordPiece (Wu et al., 2016), which splits text into sub-words; using word-level tokenization yields word-level (Seq)LDP. Our notion can also extend to phrase-level (Seq)LDP by directly using the group privacy (Dwork and Roth, 2014) or dedicatedly computing the L2L_{2}-sensitivity smaller than cS2(f)c\cdot S_{2}(f) for two sequences differing in (consecutive) cc tokens. Typically, cc is small since a few tokens are enough for most sensitive information. One could also add noise deeper in a pipeline using S2(f1f2)S2(f1)S2(f2)S_{2}(f_{1}\circ f_{2})\leq S_{2}(f_{1})\cdot S_{2}(f_{2}), where f1f2f_{1}\circ f_{2} is function composition f1(f2())f_{1}(f_{2}(\cdot)). We then need to estimate S2(f)S_{2}(f) of each (component of) sub-layer. For example, 𝖥𝖥𝖭()\mathsf{FFN}(\cdot) has two linear maps W1W_{1} and W2W_{2} with 𝖱𝖾𝖫𝖴()\mathsf{ReLU}(\cdot) in between, where S2(f)S_{2}(f) of 𝖱𝖾𝖫𝖴()\mathsf{ReLU}(\cdot) is 11. For W1,2W_{1,2}, its S2(f)S_{2}(f) is bounded by dCσmax(W1,2)\sqrt{d}C\sigma_{\max}(W_{1,2}) since ||||Fd||||2||\cdot||_{F}\leq\sqrt{d}||\cdot||_{2} with dd as the rank. We can also estimate S2(f)S_{2}(f) of 𝖫𝖭()\mathsf{LN}(\cdot) from its Lipschitz constant (Kim et al., 2021). When f()f(\cdot) is composed of more layers, we can only get a looser estimation on the final S2(f)S_{2}(f). Hence, our general recommendation is to add noise early when estimating a tight S2(f)S_{2}(f) is feasible.

Refer to caption
Figure 5. Privacy-accuracy tradeoff of token-level SeqLDP instances when tuning local ϵ\epsilon

A.3. More Experiment Results

We also study the privacy-accuracy tradeoff on all three tasks for our two token-level SeqLDP designs when tuning local ϵ\epsilon. The results are compared with the non-private baseline and fine-tuning using MVG noise. Figure 5 shows task accuracy increases with ϵ\epsilon. Perturbing input embeddings for token-level (vs. sequence-level) SeqLDP can achieve remarkable accuracy gain, e.g., 0.7{\sim}0.7 vs. 0.50.5 for IMDb.

We evaluate the two MIAs on SST-2 fine-tuned by our two token-level SeqLDP instances. Table 10 shows the results, with success rates within 0.480.520.48{-}0.52 (like random guessing) bolded. Even if the provable guarantee is at the token level, our instances can notably reduce the success rates of the confidence-based attack by 14{\sim}14pp and the entropy-based one by 11{\sim}11pp, compared to the non-private baseline.

Local ϵ\epsilon Method Attack Success Rate
Entropy Confidence
\infty Non-private baseline 0.6590.659 0.6450.645
8 DP-Forward (Embedding) 0.5360.536 0.503\mathbf{0.503}
DP-Forward (Attention) 0.5450.545 0.516\mathbf{0.516}
16 DP-Forward (Embedding) 0.5420.542 0.509\mathbf{0.509}
DP-Forward (Attention) 0.5520.552 0.519\mathbf{0.519}
24 DP-Forward (Embedding) 0.5520.552 0.516\mathbf{0.516}
DP-Forward (Attention) 0.5590.559 0.5230.523
Table 10. Success Rates of the two (sequence-level) MIAs on our token-level SeqLDP instances

Appendix B Relevant Matrix Algebra

Proposition 0.

The PDF defined in Eq. (1) and the matrix-trace-based one used in MVG (Chanyaswad et al., 2018) are equivalent.

Proof.

For the numerator part in Eq. (1), we have

U1(ZM)VF2\displaystyle~{}||U^{-1}(Z-M)V^{-\top}||^{2}_{F}
=\displaystyle= Tr[V1(ZM)UU1(ZM)V]\displaystyle~{}\operatorname{Tr}[V^{-1}(Z-M)^{\top}U^{-\top}U^{-1}(Z-M)V^{-\top}]
=\displaystyle= Tr[V1(ZM)Σ1(ZM)V],\displaystyle~{}\operatorname{Tr}[V^{-1}(Z-M)^{\top}\Sigma^{-1}(Z-M)V^{-\top}],

where Tr()\operatorname{Tr}(\cdot) denotes the matrix trace. Denote

A=V1(ZM)Σ1(ZM)V.A=V^{-1}(Z-M)^{\top}\Sigma^{-1}(Z-M)V^{-\top}.

We compute

B=VAV=Ψ1(ZM)Σ1(ZM),B=V^{-\top}AV^{\top}=\Psi^{-1}(Z-M)^{\top}\Sigma^{-1}(Z-M),

which is a similar matrix of AA, and hence Tr(A)=Tr(B)\operatorname{Tr}(A)=\operatorname{Tr}(B). So, the two PDFs are equivalent since

U1(ZM)VF2=Tr(B).||U^{-1}(Z-M)V^{-\top}||^{2}_{F}=\operatorname{Tr}(B).
Theorem 2 (Singular Value Decomposition or SVD (Horn and Johnson, 2012)).

A matrix An×dA\in\mathbb{R}^{n\times d} can be decomposed as W1ΛW2W_{1}\Lambda W_{2}^{\top}, where W1n×nW_{1}\in\mathbb{R}^{n\times n} and W2d×dW_{2}\in\mathbb{R}^{d\times d} are unitary, and Λ\Lambda is an n×dn\times d diagonal matrix whose diagonal entries are ordered singular values of AA, denoted by σ1(A)σr(A)0\sigma_{1}(A)\geq\ldots\geq\sigma_{r}(A)\geq 0 (or simply σ(A)\sigma(A)) with r=min{n,d}r=\min\{n,d\}.

Lemma 0.

Given a matrix An×dA\in\mathbb{R}^{n\times d} and two orthogonal matrices W1n×n,W2d×dW_{1}\in\mathbb{R}^{n\times n},W_{2}\in\mathbb{R}^{d\times d}, we have AF=W1AF=AW2F||A||_{F}=||W_{1}A||_{F}=||AW_{2}||_{F}; ||||F||\cdot||_{F} is immune to the pre- and post-orthogonal transformation.

Proof.

We first prove that AF=W1AF||A||_{F}=||W_{1}A||_{F} by

W1AF2=Tr(AW1W1A)=Tr(ATA)=AF2,||W_{1}A||^{2}_{F}=\operatorname{Tr}(A^{\top}W_{1}^{\top}W_{1}A)=\operatorname{Tr}(A^{T}A)=||A||^{2}_{F},

and similarly we can prove that AF=AW2F||A||_{F}=||AW_{2}||_{F}. ∎

Lemma 0.

For An×dA\in\mathbb{R}^{n\times d}, AF2=i=1rσi2(A)||A||^{2}_{F}=\sum^{r}_{i=1}\sigma^{2}_{i}(A), where σi(A)\sigma_{i}(A) is the ithi^{\text{th}} singular value of AA and r=min{n,d}r=\min\{n,d\}.

Proof.

The SVD of AA is W1ΛW2W_{1}\Lambda W_{2}^{\top}. By Lemma 3, we have

AF2=W1ΛW2F2=ΛF2=i=1rσi2(A).||A||^{2}_{F}=||W_{1}\Lambda W_{2}^{\top}||^{2}_{F}=||\Lambda||^{2}_{F}=\sum^{r}_{i=1}\sigma^{2}_{i}(A).
Lemma 0 (Lemma 44 (Yang et al., 2021)).

Given matrices An×n,Bn×d,Cd×dA\in\mathbb{R}^{n\times n},B\in\mathbb{R}^{n\times d},C\in\mathbb{R}^{d\times d}, we have ABCF2i=1rσi2(A)σi2(B)σi2(C)||ABC||^{2}_{F}\leq\sum^{r}_{i=1}\sigma^{2}_{i}(A)\sigma^{2}_{i}(B)\sigma^{2}_{i}(C) where σi()\sigma_{i}(\cdot) is the ithi^{\text{th}} singular value, and r=min{n,d}r=\min\{n,d\}.

Proof.

With SVD, A,B,CA,B,C are decomposed as

A=WA1ΛAWA2,B=WB1ΛBWB2,C=WC1ΛCWC2.A=W_{A_{1}}\Lambda_{A}W^{\top}_{A_{2}},\ B=W_{B_{1}}\Lambda_{B}W^{\top}_{B_{2}},\ C=W_{C_{1}}\Lambda_{C}W^{\top}_{C_{2}}.

Based on Lemma 4, we have

ABCF2\displaystyle||ABC||^{2}_{F} =WA1ΛAWA2WB1ΛBWB2WC1ΛCWC2F2\displaystyle=||W_{A_{1}}\Lambda_{A}W^{\top}_{A_{2}}W_{B_{1}}\Lambda_{B}W^{\top}_{B_{2}}W_{C_{1}}\Lambda_{C}W^{\top}_{C_{2}}||^{2}_{F}
=ΛAWΛBWΛCF2,\displaystyle=||\Lambda_{A}W\Lambda_{B}W^{\prime}\Lambda_{C}||^{2}_{F},

where W=WA2WB1=(wij)n×nW=W^{\top}_{A_{2}}W_{B_{1}}=(w_{ij})_{n\times n} and W=WB2WC1=(wij)d×dW^{\prime}=W^{\top}_{B_{2}}W_{C_{1}}=(w^{\prime}_{ij})_{d\times d} are still two unitary matrices. We further have

ABCF2\displaystyle||ABC||^{2}_{F} =i=1nj=1dσi2(A)σj2(C)[k=1rσk(B)wikwkj]2\displaystyle=\sum^{n}_{i=1}\sum^{d}_{j=1}\sigma^{2}_{i}(A)\sigma^{2}_{j}(C)[\sum^{r}_{k=1}\sigma_{k}(B)w_{ik}w^{\prime}_{kj}]^{2}
=i=1nj=1dσi2(A)σj2(C)βij2,\displaystyle=\sum^{n}_{i=1}\sum^{d}_{j=1}\sigma^{2}_{i}(A)\sigma^{2}_{j}(C)\beta^{2}_{ij},

where βij=k=1rσk(B)wikwkj\beta_{ij}=\sum^{r}_{k=1}\sigma_{k}(B)w_{ik}w^{\prime}_{kj}. Hence, we need to show

(7) i=1nj=1dσi2(A)σj2(C)βij2i=1rσi2(A)σi2(B)σi2(C).\displaystyle\sum^{n}_{i=1}\sum^{d}_{j=1}\sigma^{2}_{i}(A)\sigma^{2}_{j}(C)\beta^{2}_{ij}\leq\sum^{r}_{i=1}\sigma^{2}_{i}(A)\sigma^{2}_{i}(B)\sigma^{2}_{i}(C).

Following the strategy in (Yang et al., 2021) (cf. Eq. (29), (30)), we rewrite σi2(A)\sigma^{2}_{i}(A) and σj2(C)\sigma^{2}_{j}(C) using non-negative values ξt\xi_{t} and ηs\eta_{s} s.t.

σi2(A)=t=inξt,t[1,n];σj2(C)=s=jdηs,s[1,d].\sigma^{2}_{i}(A)=\sum^{n}_{t=i}\xi_{t},\ t\in[1,n];\ \sigma^{2}_{j}(C)=\sum^{d}_{s=j}\eta_{s},\ s\in[1,d].

For i[1,n],j[1,d]i\in[1,n],j\in[1,d], we denote γij=σi(B)\gamma_{ij}=\sigma_{i}(B), if i=ji=j; γij=0\gamma_{ij}=0, otherwise. Then, we transform the Eq. (7) as

i=1rσi2(A)σi2(B)σi2(C)i=1nj=1dσi2(A)σj2(C)βij2\displaystyle\sum^{r}_{i=1}\sigma^{2}_{i}(A)\sigma^{2}_{i}(B)\sigma^{2}_{i}(C)-\sum^{n}_{i=1}\sum^{d}_{j=1}\sigma^{2}_{i}(A)\sigma^{2}_{j}(C)\beta^{2}_{ij}
=\displaystyle= i=1nj=1d(γij2βij2)σi2(A)σj2(C)\displaystyle\sum^{n}_{i=1}\sum^{d}_{j=1}(\gamma^{2}_{ij}-\beta^{2}_{ij})\sigma^{2}_{i}(A)\sigma^{2}_{j}(C)
=\displaystyle= i=1nj=1d(γij2βij2)t=inξts=jdηs\displaystyle\sum^{n}_{i=1}\sum^{d}_{j=1}(\gamma^{2}_{ij}-\beta^{2}_{ij})\sum^{n}_{t=i}\xi_{t}\sum^{d}_{s=j}\eta_{s}
=\displaystyle= t=1ns=1dξtηsi=1tj=1s(γij2βij2).\displaystyle\sum^{n}_{t=1}\sum^{d}_{s=1}\xi_{t}\eta_{s}\sum^{t}_{i=1}\sum^{s}_{j=1}(\gamma^{2}_{ij}-\beta^{2}_{ij}).

Since ξt,ηs\xi_{t},\eta_{s} are non-negative, we only need to show

(8) i=1tj=1s(γij2βij2)0.\displaystyle\sum^{t}_{i=1}\sum^{s}_{j=1}(\gamma^{2}_{ij}-\beta^{2}_{ij})\geq 0.

However, the original proof (Yang et al., 2021) has two issues: i) t>st>s is not considered, and ii) the commutative law of matrix multiplication in Eq. (35) does not hold as E(t)E(t) in Eq. (34) is not a standard diagonal matrix. To address them, we have

i=1tj=1sγij2=k=1min{t,s}σk2(B).\sum^{t}_{i=1}\sum^{s}_{j=1}\gamma^{2}_{ij}=\sum^{\min\{t,s\}}_{k=1}\sigma^{2}_{k}(B).

We then denote a sub-matrix B=(βij)B^{*}=(\beta_{ij}) for i[1,t],j[1,s]i\in[1,t],j\in[1,s] of WΛBWW\Lambda_{B}W^{\prime}. With SVD of BB^{*}, we have

i=1tj=1sβij2=BF2=k=1min{t,s}σk2(B)k=1min{t,s}σk2(B).\sum^{t}_{i=1}\sum^{s}_{j=1}\beta^{2}_{ij}=||B^{*}||^{2}_{F}=\sum^{\min\{t,s\}}_{k=1}\sigma^{2}_{k}(B^{*})\leq\sum^{\min\{t,s\}}_{k=1}\sigma^{2}_{k}(B).

The last inequality is due to σk(B)σk(B)\sigma_{k}(B^{*})\leq\sigma_{k}(B) for k[1,r]\forall k\in[1,r] (Horn and Johnson, 2012). So, Inequality (8) holds. ∎

Appendix C Proofs for Our Analytic Matrix Gaussian Mechanism

This section proof Lemma 4, Lemma 5, and Theorem 6 in Section 4.2.

Proof of Lemma 4.

Recall that (f(𝒳))=f(𝒳)+Z\mathcal{M}(f(\mathcal{X}))=f(\mathcal{X})+Z with Z𝒩n,d(0,Σ,Ψ)Z\sim\mathcal{MN}_{n,d}(0,\Sigma,\Psi), the probability of (f(𝒳))=O\mathcal{M}(f(\mathcal{X}))=O is

Pr[(f(𝒳)=O]=exp(12U1(Of(𝒳))VF2)(2π)nd/2|Ψ|d/2|Σ|n/2.\Pr[\mathcal{M}(f(\mathcal{X})=O]=\frac{\exp(-\frac{1}{2}||U^{-1}(O-f(\mathcal{X}))V^{-\top}||^{2}_{F})}{(2\pi)^{nd/2}|\Psi|^{d/2}|\Sigma|^{n/2}}.

Similarly, we can compute Pr[(f(𝒳))=O]\Pr[\mathcal{M}(f(\mathcal{X}^{\prime}))=O]. By plugging them into ,𝒳,𝒳(O)\mathcal{L}_{\mathcal{M},\mathcal{X},\mathcal{X}^{\prime}}(O), and let Δ=f(𝒳)f(𝒳)\Delta=f(\mathcal{X})-f(\mathcal{X}^{\prime}),

,𝒳,𝒳(O)=lnexp(12U1(Of(𝒳))VF2)exp(12U1(Of(𝒳))VF2)\displaystyle\mathcal{L}_{\mathcal{M},\mathcal{X},\mathcal{X}^{\prime}}(O)=\ln\frac{\exp(-\frac{1}{2}||U^{-1}(O-f(\mathcal{X}))V^{-\top}||^{2}_{F})}{\exp(-\frac{1}{2}||U^{-1}(O-f(\mathcal{X}^{\prime}))V^{-\top}||^{2}_{F})}
=\displaystyle= 12U1(Z+Δ)VF212U1ZVF2\displaystyle~{}\frac{1}{2}||U^{-1}(Z+\Delta)V^{-\top}||^{2}_{F}-\frac{1}{2}||U^{-1}ZV^{-\top}||^{2}_{F}
=\displaystyle= 12U1ΔVF2+𝑣𝑒𝑐(U1ΔV),𝑣𝑒𝑐(U1ZV),\displaystyle~{}\frac{1}{2}||U^{-1}\Delta V^{-\top}||^{2}_{F}+\langle\mathit{vec}(U^{-1}\Delta V^{-\top}),\mathit{vec}(U^{-1}ZV^{-\top})\rangle,

where 𝑣𝑒𝑐()\mathit{vec}(\cdot) is the vectorization of a matrix and ,\langle\cdot,\cdot\rangle denotes the inner product. For easy presentation, we denote Z=U1ZVZ^{\prime}=U^{-1}ZV^{-\top} and Δ=U1ΔV\Delta^{\prime}=U^{-1}\Delta V^{-\top}, and then we re-write

,𝒳,𝒳(O)=12ΔF2+𝑣𝑒𝑐(Δ),𝑣𝑒𝑐(Z).\mathcal{L}_{\mathcal{M},\mathcal{X},\mathcal{X}^{\prime}}(O)=\frac{1}{2}||\Delta^{\prime}||^{2}_{F}+\langle\mathit{vec}(\Delta^{\prime}),\mathit{vec}(Z^{\prime})\rangle.

Given Lemma 7, Z𝒩n,d(0,In,Id)Z^{\prime}\sim\mathcal{MN}_{n,d}(0,I_{n},I_{d}) with each entry i.i.d. drawn from 𝒩(0,1)\mathcal{N}(0,1). 𝑣𝑒𝑐(Δ),𝑣𝑒𝑐(Z)\langle\mathit{vec}(\Delta^{\prime}),\mathit{vec}(Z^{\prime})\rangle is thus the Δ\Delta^{\prime}-weighted sum of ndnd i.i.d. Gaussian random variables, which is a Gaussian variable131313en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables 𝒩(0,ΔF2)\mathcal{N}(0,||\Delta^{\prime}||^{2}_{F}) too. So ,𝒳,𝒳𝒩(η,2η)\mathcal{L}_{\mathcal{M},\mathcal{X},\mathcal{X}^{\prime}}\sim\mathcal{N}(\eta,2\eta), η=12ΔF2\eta=\frac{1}{2}||\Delta^{\prime}||^{2}_{F}. ∎

Proof of Lemma 5.

With Lemma 4 and the CDF, we have

Pr[,𝒳,𝒳ϵ]=Pr[𝒩(η,2η)ϵ]\displaystyle~{}\Pr[\mathcal{L}_{\mathcal{M},\mathcal{X},\mathcal{X}^{\prime}}\geq\epsilon]=\Pr[\mathcal{N}(\eta,2\eta)\geq\epsilon]
=\displaystyle= Pr[𝒩(0,1)ϵη2η]=Pr[𝒩(0,1)ηϵ2η]\displaystyle~{}\Pr[\mathcal{N}(0,1)\geq\frac{\epsilon-\eta}{\sqrt{2\eta}}]=\Pr[\mathcal{N}(0,1)\leq\frac{\eta-\epsilon}{\sqrt{2\eta}}]
=\displaystyle= Φ(ΔF2ϵΔF),\displaystyle~{}\Phi(\frac{||\Delta^{\prime}||_{F}}{2}-\frac{\epsilon}{||\Delta^{\prime}||_{F}}),

where we used 𝒩(η,2η)=η+𝒩(0,1)/2η\mathcal{N}(\eta,2\eta)=\eta+\mathcal{N}(0,1)/\sqrt{2\eta} and the symmetry of the standard normal distribution Pr[𝒩(0,1)t]=Pr[𝒩(0,1)t]\Pr[\mathcal{N}(0,1)\geq t]=\Pr[\mathcal{N}(0,1)\leq-t].

A similar argument applied to ,𝒳,𝒳\mathcal{L}_{\mathcal{M},\mathcal{X}^{\prime},\mathcal{X}} yields

Pr[,𝒳,𝒳ϵ]=Φ(ΔF2ϵΔF).\Pr[\mathcal{L}_{\mathcal{M},\mathcal{X}^{\prime},\mathcal{X}}\leq-\epsilon]=\Phi(-\frac{||\Delta^{\prime}||_{F}}{2}-\frac{\epsilon}{||\Delta^{\prime}||_{F}}).
Proof of Theorem 6.

The proof boils down to two directions. From (ϵ,δ)(\epsilon,\delta)-DP (Theorem 3) to Theorem 6, the proof directly follows from all the derivations in Section 4.2. For the inverse direction, it is sufficient to show that ΔF||\Delta^{\prime}||_{F}\leq\mathcal{B} holds for 𝒳𝒳\forall\mathcal{X}\simeq\mathcal{X}^{\prime} given Theorem 6. In particular, for ΔF=U1ΔVF||\Delta^{\prime}||_{F}=||U^{-1}\Delta V^{-\top}||_{F}, we have

U1ΔVF2i=1rσi2(Δ)σni+12(U)σdi+12(V)\displaystyle~{}||U^{-1}\Delta V^{-\top}||_{F}^{2}\leq\sum^{r}_{i=1}\frac{\sigma^{2}_{i}(\Delta)}{\sigma^{2}_{n-i+1}(U)\sigma^{2}_{d-i+1}(V)}
\displaystyle\leq i=1rσi2(Δ)σn2(U)σd2(V)ΔF2S22(f)/22,\displaystyle~{}\frac{\sum^{r}_{i=1}\sigma^{2}_{i}(\Delta)}{\sigma^{2}_{n}(U)\sigma^{2}_{d}(V)}\leq\frac{||\Delta||^{2}_{F}}{S^{2}_{2}(f)/\mathcal{B}^{2}}\leq\mathcal{B}^{2},

where the first inequality is due to Lemma 5, the second one holds since σn(U)\sigma_{n}(U) and σd(V)\sigma_{d}(V) are the smallest singular values among the others, and the third one is from Theorem 6. ∎