How Does Diffusion Influence Pretrained Language Models on Out-of-Distribution Data?

Huazheng Wang Daixuan Cheng Haifeng Sun Corresponding Author. Email: [email protected]. Jingyu Wang Qi Qi Jianxin Liao Jing Wang Corresponding Author. Email: [email protected]. Cong Liu State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications China Mobile

Abstract

Transformer-based pretrained language models (PLMs) have achieved great success in modern NLP. An important advantage of PLMs is good out-of-distribution (OOD) robustness. Recently, diffusion models have attracted a lot of work to apply diffusion to PLMs. It remains under-explored how diffusion influences PLMs on OOD data. The core of diffusion models is a forward diffusion process which gradually applies Gaussian noise to inputs, and a reverse denoising process which removes noise. The noised input reconstruction is a fundamental ability of diffusion models. We directly analyze OOD robustness by measuring the reconstruction loss, including testing the abilities to reconstruct OOD data, and to detect OOD samples. Experiments are conducted by analyzing different training parameters and data statistical features on eight datasets. It shows that finetuning PLMs with diffusion degrades the reconstruction ability on OOD data. The comparison also shows that diffusion models can effectively detect OOD samples, achieving state-of-the-art performance in most of the datasets with an absolute accuracy improvement up to 18%. These results indicate that diffusion reduces OOD robustness of PLMs.

1 Introduction

Recently, diffusion models [13] have shown success in vision [27, 25] and text generation [19]. These models define a forward diffusion process which gradually applies Gaussian noise to real data with a Markov chain. Then they are trained to learn the reverse denoising process that inverts the forward process by incrementally removing noise from the noised inputs. Recent works show that transformer-based language models, whether pretrained or finetuned by diffusion, are capable of generating high-quality samples. This can be achieved by applying noise on discrete token space [9, 2] or continuous latent space [19, 29, 7].

Since the training distribution and the test distribution are not usually identical, the out-of-distribution (OOD) robustness is an important factor for pretrained language models (PLMs). Transformer-based PLMs have shown better robustness than previous language models [12]. It remains under-explored how diffusion influences PLMs. For this purpose, we systematically study the OOD robustness from the abilities of reconstructing noised OOD inputs and detecting OOD samples¹¹1Code is available at https://github.com/MaybeLizzy/Diffusion_OOD_Robustness.

Reconstruction Ability. The essence of diffusion models is the diffusion and denoising process. Whether the models can reconstruct noised inputs is a fundamental problem. A direct way to analyze OOD robustness is measuring the reconstruction loss. We use diffusion to finetune PLMs on in-domain (ID) data following [19], then measure the reconstruction loss on OOD data. If the models are robust, the reconstruction loss should not vary significantly from ID test data to OOD data. Specifically, diffusion models are sensitive to many issues, such as denoising step, noise ratio, and input size [13]. We conduct an in-depth analysis from two perspectives, namely training parameters and data statistical features. Experiments show that finetuning PLMs by diffusion leads to poor OOD reconstruction ability.

OOD Detection. In order to investigate to which extent diffusion influences PLMs on OOD data, we use diffusion models with PLMs to do OOD sample detection and perform a numerical comparison, by proposing a diffusion-based OOD detection method. It reconstructs noised inputs and measures the average token reconstruction loss with a threshold to detect OOD samples. Comparing with different widely-used baselines, diffusion models with PLMs achieve state-of-the-art performance on most datasets with an absolute accuracy improvement up to 18%. The improvement on OOD detection, on the other hand, demonstrates the reduction in the robustness of diffusion models with PLMs .

To summarize, our work reveals that finetuning PLMs with diffusion reduces the OOD robustness. The extent of the impact is related to factors including training step, model size, data diversity, and sentence length (§ 3). We also provide adapted solutions to improve the robustness. Based on the robustness degradation, we propose a diffusion-based OOD detectors which provides high-accuracy detections by simply computing the reconstruction loss (§ 4).

2 How We Test Reconstruction Ability

2.1 Datasets

Following [34], we use four classification-task datasets as ID data, including SST2 [28], IMDB [22], 20NG [17] and TREC-10 [20]. Any pair of the above datasets can be regarded as OOD to each other, except for IMDB and SST2 which belong to the same task category. Besides, we also select four additional datasets as the OOD datasets, including RTE [4], MNLI [31], WMT16 [3] and Multi30K [6]. We take the test splits in those OOD datasets for testing and report the average sequence lengths of the test splits. The statistics of the datasets are shown in Table 1.

2.2 Methods

A diffusion model [13] is a latent variable model that builds a transformation from Gaussian distribution to data distribution through a multi-step denoising process. Given a data distribution $\boldsymbol{x}_{0}\sim q(\boldsymbol{x}_{0})$ , at timestep $t$ , $\boldsymbol{x}_{t}$ is produced by a diffusion process which satisfies a Markov chain according to a variance schedule $\beta_{1},\beta_{2},...,\beta_{T}$ $\in(0,1)$ ,

q(\boldsymbol{x}_{1:T}|\boldsymbol{x}_{0}):=\prod\limits_{t=1}^{T}q(\boldsymbol{x}_{t}|\boldsymbol{x}_{t-1}),

(1)

q(\boldsymbol{x}_{t}|\boldsymbol{x}_{0})=N(\boldsymbol{x}_{t};\sqrt{\overline{\alpha}_{t}}\boldsymbol{x}_{0},(1-\overline{\alpha}_{t})\boldsymbol{I}),

(2)

where $0\leq t\leq T$ , $\alpha_{t}:=1-\beta_{t}$ , $\overline{\alpha}_{t}:=\prod_{s=1}^{t}\alpha_{s}$ , $T$ is the max iteration step. A closed form of $\boldsymbol{x}_{t}$ can be calculated for any arbitrary $t\geq 1$

\boldsymbol{x}_{t}=\sqrt{\overline{\alpha}_{t}}\boldsymbol{x}_{0}+\sqrt{1-\overline{\alpha_{t}}}\epsilon,

(3)

where $\epsilon\in N(0,1)$ . When the data is fully noised, $\boldsymbol{x}_{T}$ is close to an isotropic Gaussian, $\boldsymbol{x}_{T}\sim N(0,1)$ . Note that the predefined forward process $q$ contains no trainable parameters.

In the denoising process, given $\boldsymbol{x}_{t}$ and $t$ , the model is trained to reverse the diffusion process iteratively and to reconstruct data. Each denoising transition $\boldsymbol{x}_{t}\rightarrow\boldsymbol{x}_{t-1}$ is parametrized by

p_{\theta}(\boldsymbol{x}_{t-1}|\boldsymbol{x}_{t})=N(\boldsymbol{x}_{t-1}|\boldsymbol{\mu}_{\theta}(\boldsymbol{x}_{t},t),\boldsymbol{\Sigma}_{\theta}(\boldsymbol{x}_{t},t)),

(4)

where $\boldsymbol{\mu}_{\theta}$ and $\boldsymbol{\Sigma}_{\theta}$ are predicted by a transformer-based PLMs. Following [19], instead of predicting the mean posterior $\mu_{\theta}(\boldsymbol{x}_{t},t)$ , we train a neural network to predict $\boldsymbol{x}_{0}$ in every term and simplify the loss to a sum of mean-squared errors between the ground truth data $\boldsymbol{x}_{0}$ and its estimates $\hat{\boldsymbol{x}_{0}}$

L_{d}(\theta)=\mathbb{E}_{t,\boldsymbol{x}_{0}}\left[\|\boldsymbol{x}_{0}-\hat{\boldsymbol{x}_{0}}(\boldsymbol{x}_{t},t,\theta)\|^{2}\right].

(5)

To apply diffusion on transformer-based PLMs, we define the diffusion process on the continuous embedding space. Considering a sequence $z=\{w_{1},w_{2},...,w_{n}\}$ , where each token $w_{i}\in\mathbb{R}^{v}$ has an associated embedding $e_{i}\in\mathbb{R}^{d}$ , the discrete-to-continuous step is then defined as $q_{\phi}(\boldsymbol{x}_{0}|\boldsymbol{w})=N(\boldsymbol{E}\boldsymbol{w},\sigma_{0}^{2}\boldsymbol{I})$ , where $d$ is the dimension of input space, $\boldsymbol{E}$ is a matrix of all embeddings and $\sigma_{0}$ is a constant scale factor with a similar order of magnitude as $\beta_{1}$ . To map the predicted vector back to tokens, we define a reverse continuous-to-discrete step $p_{\theta}(\boldsymbol{w}|\boldsymbol{x}_{0})=\prod_{i=1}^{n}p_{\theta}(\boldsymbol{w}_{i}|\boldsymbol{x}_{i})$ , where $p_{\theta}(\boldsymbol{w}_{i}|\boldsymbol{x}_{i})$ is the softmax probability of token $i$ with logit $\boldsymbol{x}_{i}$ . A simple cross-entropy loss is then added to maximise $p_{\theta}(\boldsymbol{w}|\boldsymbol{x}_{0})$

L_{c}(\theta)=\mathbb{E}_{\boldsymbol{w},\boldsymbol{x}_{0}}\left[-logp_{\theta}(\boldsymbol{w}|\boldsymbol{x}_{0})\right].

(6)

The reconstruction loss function is then defined by

L_{recon}=L_{d}+L_{c}.

(7)

Dataset	Train	Dev	Test	Avg len
SST2	67349	872	1821	25
IMDB	22500	2500	25000	295
TREC-10	4907	545	500	10
20NG	15056	1876	1896	865
MNLI	-	-	19643	38
RTE	-	-	3000	65
WMT16	-	-	2999	25
Multi30K	-	-	2532	15

Table 1: Dataset Statistics.

2.3 Experimental Setups

We train the diffusion model on the training splits of the four aforementioned ID datasets respectively and use validation splits for evaluation to avoid overfitting. When it comes to inference, the other test splits of the remaining datasets are treated as OOD data. We implement the model upon pretrained RoBERTa-Large [32]. The learning rate is set to be $5e-5$ with a linear decay rate towards 0. We set $\beta$ to constants as [19] and [13] did. It increases linearly from $1e-4$ to $0.02$ , standing for the amount of noise added at diffusion step $t$ . And the batch size is 16. Each model is trained for $s=80k$ steps, optimized with Adam [15]. $T$ is set to be 1000. Reconstruction loss is calculated as the average loss of each word in the whole sentence.

3 Reconstruction ability of Diffusion Models with PLMs

3.1 Impact of Model Hyper-parameters and Structures

OOD data is more sensitive to higher noise level. In the process of diffusion, the added noise level, i.e. the diffusion step $t$ , is pivotal in restoring noised inputs. However, under which $t$ the model starts to produce dissimilar outputs of ID and OOD data has not yet been studied. Hence, in this experiment, we explore how reconstruction loss varies with the increase of $t$ .

Refer to caption — Figure 1: The reconstruction loss on different OOD datasets by increasing diffusion step $t$ .

As shown in Fig. 1, the reconstruction loss of both ID and OOD data keep rising as $t$ increases. More specifically, the loss of OOD data rises faster than ID data, indicating that OOD data is more sensitive to higher noise level. The loss gap between ID/OOD is also increasing with $t$ and achieves the maximum when $t=700\sim 800$ , where model outputs start to look dissimilar from the inputs. For instance, only the high-frequency word “what” can be correctly restored in ID dataset when $t=800$ (Table 2). As for OOD data, even the common words like “it”, “just” and “as” are wrongly restored. The reason is that highly-noised inputs reserve very little information from the original sentences and the model begins to output unconditioned samples more than doing reconstructions [8], which results in high reconstruction loss. It shows that diffusion models with PLMs have worse OOD robustness when $t$ is large.

[Uncaptioned image] — Table 2: Examples of the model outputs on ID (TREC-10) and OOD dataset (SST2) under different noise level $t$ . The original sentences are in the leftmost column. Words painted yellow are correctly restored.

Models with more training steps are more resilience to higher noise level. The number of training steps $s$ is another important factor influencing the sensitivity of the model to noise. We explore the model performance on ID/OOD data under different $t$ with the increase of $s$ .

As noted in Fig.2, when the added noise level is relative small ( $t=500$ ), reconstruction loss of ID and OOD data reaches the minimum in small training steps ( $s=20k\sim 40k$ ) and then starts to rise and fluctuate. When tested on higher noise level ( $t=700$ ), the reconstruction loss of ID data basically declines and then remains stable. However, reconstruction loss of OOD data goes up with the increase of $s$ . Such different behaviors verify that fully-trained models are more robust to higher noise level on ID data but the ability to restore OOD data is greatly degraded.

To verify the generation ability of fully-trained models, we test our models on SST2 dataset with different training steps and compare the score with MUCOLA-DISC [16]. The results are competitive with MUCOLA-DISC when reconstruction loss is relatively small ( $s=160k$ ). This suggests that training models fully with reconstruction loss can guarantee good generation ability.

Method	Perplexity	Dist-1	Dist-2	Dist-3	Reconstruc- tion loss
GPT-2	38.60	0.64	0.90	0.88	-
MUCOLA-DISC	27.90	0.50	0.81	0.82	-
Diffusion, s=20k	10.66	0.42	0.56	0.69	3.5726
Diffusion, s=80k	30.21	0.62	0.78	0.81	1.8430
Diffusion, s=160k	33.76	0.63	0.80	0.83	1.4710

Table 3: Scores of generation tasks on SST2 dataset compared with [16].

Larger models are not good at reconstruction and have bad OOD robustness. It is proved that RoBERTa exhibits greater robustness than BERT on OOD data since it is pretrained on more diverse corpora [12]. To test this hypothesis, we also evaluate the reconstruction ability of diffusion models with more PLMs.

As shown in Fig.3, the reconstruction loss of ID and OOD data when using RoBERTa is up to 2.8 times higher than that using BERT. Besides, the loss of OOD data is up to 4.2 times higher than that of ID data when using BERT, while 4.7 times when using RoBERTa. It shows that diffusion model with RoBERTa presents a greater gap of reconstruction loss between ID/OOD data and have worse OOD robustness than BERT.

Besides, diffusion models with BERT/RoBERTa-Base have lower reconstruction loss than BERT/RoBERTa-Large. The way in which the diffusion process adds noise determines this result. Since the latent space dimension of the base model is lower than that of the large model, the influence of the added noise is smaller and the model can make better reconstructions conditioned on the remaining inputs.

3.2 Impact of Data

Finetuning on more diverse data improves reconstruction quality, but not always. Since more diverse data improves OOD generalization of PLMs [12], we question if this assumption applies to diffusion as well. To test it, we compare the OOD reconstruction performances when training with different ID datasets. We set a small $t=500$ in this experiment where the performance gap between different ID datasets is the most obvious.

As shown in Table 4, compared to TREC-10, the model finetuned on SST2 dataset has a lower OOD reconstruction loss and a greater ability to restore the unfamiliar words that do not appear in ID dataset. Since SST2 is a large dataset with more diverse data and covers 72.5% OOD tokens, it suggests that finetuning on more diverse ID data improves OOD reconstruction performance [14].

Dataset	TREC-10	SST2	IMDB	20NG
occur & correct	1524	2497	3464	3024
occur & wrong	749	20	305	1169
not occur & correct	4	631	23	16
Avg loss	1.682	0.155	0.75	1.328
vocab size	9277	14409	41177	44944
overlap	52.1%	72.5%	99.2%	97.9%

Table 4: Statistics of the reconstruction results on OOD dataset (Multi30K).“not occur & correct” stands for the number of tokens that do not occur in ID dataset but are correctly restored. The vocabulary size of ID dataset and the number of overlapped OOD tokens are also reported.

However, for the datasets IMDB and 20NG, although the value of “overlap” reaches 99.2% and 97.9% respectively (Table. 4), models finetuned on these two datasets do not have better reconstruction performance. By carefully investigating which aspect of data causes this anomalies, we compare the statistical characteristics of these data and observe that the average sentence length of IMDB and 20NG is 10 $\sim$ 40 times larger than that of SST2 and TREC-10. So we speculate that the OOD performance of diffusion model is also related to sentence length.

Diffusion models with PLMs finetuned on long sentences are sensitive to length. To further test the hypothesis of length dependence, we explore the influence of sentence length on reconstruction quality.

As shown in Fig.4, when the model is finetuned on short sentences (TREC-10), the reconstruction loss of long sentence is overall higher than that of short sentence. The reason is that long sentences tend to contain more words not seen during training.

However, when the model is finetuned on long sentences (IMDB), the reconstruction loss of shorter sentence is at most 155 times higher than that of long sentence. Even when tested on SST2 dataset which is not considered as OOD dataset from IMDB, the model fails to produce low-loss reconstructions on short sentences as well. We call this phenomenon “Length Bias”, meaning that diffusion models with PLMs finetuned on long sentences are more sensitive to sentence length.

We give more examples to illustrate the “length bias” problem using 20NG as ID dataset which has an average sentence length of 865. As shown in Table 5, sentence from ID dataset can be perfectly restored with an average loss of 0.0008. It can also make near-perfect reconstructions on OOD dataset with long sentences (IMDB). However, when tested on short OOD sentences, even the common word “in” and “the” are not reconstructed correctly. Besides, the loss of long OOD sentence is up to 27 times higher than that of short OOD sentences, indicating serious “Length Bias” problem.

ID dataset				OOD dataset
`SST2`	`IMDB`	`TREC-10`	`20NG`	`MNLI`	`RTE`	`WMT16`	`Multi30K`
Cosine^∗	-	97.0/19.4	93.2/55.9	94.6/31.4	98.1/9.0	95.9/20.2	98.6/6.0
MLM	-	96.0/32.1	100/0	99.3/0.1	99.8/0.1	98.2/12.9	68.9/76.2
Hendrycks et al. [11]^∗	-	91.8/61.3	93.6/52.1	84.6/68.4	89.2/59.7	84.0/69.4	90.2/57.3
Liu et al. [21]^∗	-	91.5/63.1	93.4/52.5	83.6/68.7	87.4/62.4	82.4/70.6	88.1/61.7
Lee et al. [18]^∗	-	97.8/13.2	94.9/39.5	95.1/27.0	98.4/8.0	96.4/17.2	98.8/5.5
Zhou et al. [34]^∗	-	99.5/2.4	100/0	99.8/0.4	100/0	99.9/0.1	100/0
\hdashlineDiffusion	-	100/0	100/0	99.6/1.5	99.9/0.4	99.4/2.2	79.5/70.4
Diffusion+Maha	-	100/0	100/0	99.8/0.5	100/0	99.7/1.1	100/0
`IMDB`	`SST2`	`TREC-10`	`20NG`	`MNLI`	`RTE`	`WMT16`	`Multi30K`
Cosine^∗	-	99.5/1.5	99.6/0.6	99.0/4.8	99.5/1.9	99.1/3.8	99.8/0.2
MLM	-	99.7/3.1	100/0	87.8/36.1	90.0/33.8	86.8/36.4	99.7/0.1
Hendrycks et al. [11]^∗	-	94.9/37.4	96.0/28.4	93.1/51.4	93.9/49.9	93.4/50.9	96.4/25.7
Liu et al. [21]^∗	-	94.0/57.7	95.6/32.1	92.4/55.4	93.3/54.7	92.7/57.6	95.6/34.3
Lee et al. [18]^∗	-	100/0	99.8/0.2	99.5/2.2	99.8/0.7	99.7/1.2	99.9/0
Zhou et al. [34]^∗	-	100/0	100/0	100/0.1	100/0	100/0	100/0
\hdashlineDiffusion	-	100/0	100/0	99.4/2.7	98.3/8.9	100/0	100/0
Diffusion+Maha	-	100/0	100/0	99.2/0.5	99.4/3.3	100/0	100/0
`TREC-10`	`SST2`	`IMDB`	`20NG`	`MNLI`	`RTE`	`WMT16`	`Multi30K`
Cosine^∗	97.9/6.4	99.5/0.3	99.6/1.0	98.8/4.2	99.4/1.4	99.3/2.2	99.6/0.2
MLM	97.5/10.4	100/0	100/0	99.1/6.2	99.3/5.8	97.0/10.0	96.1/20.0
Hendrycks et al. [11]^∗	97.1/14.5	98.9/2.9	98.2/7.2	97.0/13.6	98.6/5.2	97.9/8.5	99.1/1.3
Liu et al. [21]^∗	94.8/28.5	98.9/4.5	99.0/6.6	97.3/15.5	98.8/5.5	98.2/10.2	99.2/1.8
Lee et al. [18]^∗	97.4/12.0	99.5/0.2	99.5/0.3	98.9/3.2	99.5/0.8	99.4/1.8	99.7/0.3
Zhou et al. [34]^∗	98.4/1.6	99.6/0	99.8/0	99.2/0.7	99.6/0.1	99.4/0.5	99.5/0
\hdashlineDiffusion	98.7/5.8	100/0	100/0	99.5/2.0	99.8/0.8	98.9/5.0	97.6/9.2
Diffusion+Maha	99.7/0.8	100/0	100/0	99.7/1.4	100/0.1	99.7/2.4	99.6/1.2
`20NG`	`SST2`	`IMDB`	`TREC-10`	`MNLI`	`RTE`	`WMT16`	`Multi30K`
Cosine^∗	99.7/0.9	98.6/6.4	98.7/8.3	97.5/13.9	95.2/21.9	96.8/16.4	98.3/7.0
MLM	88.0/80.0	94.9/23.1	82.0/96.5	89.5/77.0	93.8/32.0	90.0/77.0	89.3/77.7
Hendrycks et al. [11]^∗	98.6/9.0	95.9/25.3	95.1/35.0	94.1/36.0	90.3/49.0	92.7/40.5	95.7/18.9
Liu et al. [21]^∗	99.6/2.3	97.8/10.5	97.6/14.9	96.1/20.2	92.8/30.1	95.0/23.1	97.0/9.2
Lee et al. [18]^∗	99.4/0.3	98.9/4.4	98.9/1.3	98.1/10.1	96.5/17.1	97.8/11.9	98.7/5.9
Zhou et al. [34]^∗	99.5/1.2	99.0/4.7	99.6/1.4	98.4/9.6	98.2/11.1	98.5/7.5	99.1/6.9
\hdashlineDiffusion	100/0	99.9/0.4	100/0	99.9/0.1	100/0.0	99.8/0.1	100/0
Diffusion+Maha	100/0	100/0	100/0	100/0	100/0.0	100/0	100/0

Table 6: The AUROC

\uparrow

/ FAR95

\downarrow

accuracy on different datasets when keeping

t=700

s=80K

and

\lambda=0.99

. All data is averaged over ten times.

\ast

denotes that results are taken from [34].

4 Out-of-Distribution Detection

In order to investigate to which extent diffusion influences PLMs on OOD data, in this section we present general performance of diffusion models with PLMs on OOD detection task.

4.1 Baselines

We introduce the following methods as baselines of OOD detection.

$\bullet$ Cosine Similarity serves as a scoring function to measure the similarity of input representations. It is defined by the maximum cosine similarity of $\boldsymbol{h}$ to samples of the ID validation set $\boldsymbol{h}^{(dev)}$

d(x)=\mathop{\max}\limits_{i=1}^{N}cos(\boldsymbol{h},\boldsymbol{h}_{i}^{(dev)})

(8)

$\bullet$ MLM represents continuing training the pretrained transformers on ID data with Masked Language Modeling task [5]. Then we use the corresponding loss as a threshold to detect OOD data in a same way of diffusion-based OOD detection method.

$\bullet$ Hendrycks et al. [11] propose Maximum Softmax Probability (MSP) method by using the maximum class probability of a classifier with a softmax layer. The less confident classifier is, the higher the OOD score will be. It is defined by

d(x)=1-max_{j=1}^{C}\boldsymbol{p}_{j},

(9)

where $C$ is the number of classes.

$\bullet$ Liu et al. [21] propose Energy Score method by using the softmax function as the ratio of the joint probability in $X\times Y$ to the probability in $X$ , where $X$ and $Y$ denote the training corpus and its label set. A higher $d(x)$ means lower probability density in ID dataset, implying higher OOD likelihood. It calculates the following probability density

d(x)=-log\sum\limits_{j=1}^{C}exp(\boldsymbol{w}_{j}^{T}\boldsymbol{h}),

(10)

where $\boldsymbol{w}_{j}$ is the weight of class $j$ in the softmax layer and $\boldsymbol{h}$ is the input of the softmax layer.

$\bullet$ Lee et al. [18] propose to use Mahalanobis Distance [26] to determine the closeness of a sample to a set of samples belonging to the class $c$ by modeling the ID features with class-conditional multivariate Gaussian distributions

d(x)=\mathop{\min}\limits_{c=1}^{C}(\boldsymbol{h(x)}-\boldsymbol{\mu}_{c})^{T}\boldsymbol{\Sigma}^{+}(\boldsymbol{h(x)}-\boldsymbol{\mu}_{c}),

(11)

where $\boldsymbol{h(x)}$ is the vector representation of the sample $x$ , $\boldsymbol{\mu}_{c}$ is the mean vector of the training set of class $c$ , $\boldsymbol{\Sigma}$ is a shared covariance matrix and $\boldsymbol{\Sigma}^{+}$ is the pseudo-inverse of $\boldsymbol{\Sigma}$ defined by

\boldsymbol{\mu}_{c}=\frac{1}{N_{c}}\sum_{x\in D_{in}^{c}}\boldsymbol{h(x)},

(12)

\boldsymbol{\Sigma}=\frac{1}{N}\sum_{c\in C}\sum_{x\in D_{in}^{c}}(\boldsymbol{h(x)}-\boldsymbol{\mu}_{c})(\boldsymbol{h(x)}-\boldsymbol{\mu}_{c})^{T},

(13)

where $N$ is the total number of the samples and $N_{c}$ is the number of samples belonging to class $c$ .

$\bullet$ Zhou et al. [34] propose to finetune the Transformers with a contrastive loss. By contrasting samples to those from different ID classes, it improves the compactness of representations. The highest accuracy occurs when combining it with Mahalanobis Distance.

4.2 Method

We propose a diffusion-based detector which uses a score function $f(x)=L_{recon}(x)$ to measure the similarity of any sample $x$ with ID dataset by fixing a threshold $\gamma$ . If $f(x)\leq\gamma$ , we classify $x$ as ID data. Otherwise, $x$ is regarded as OOD data. Formally, we distinguish OOD data using the decision function:

\begin{split}g(x,\gamma)=\left\{\begin{array}[]{ll}0,&f(x)\leq\gamma\\ 1,&f(x)\textgreater\gamma\\ \end{array}\right.\end{split}

(14)

Since models learn to pay more attentions on token level after finetuned with diffusion, we further leverage Mahalanobis Distance method to add sentence level techniques. Following [18], we define Mahalanobis Distance without using the class labels of the training set, i.e, treating the whole set as one class. Eq. 11 changes to

d(x)=(\boldsymbol{h(x)}-\boldsymbol{\mu}_{s})^{T}\boldsymbol{\Sigma}^{+}(\boldsymbol{h(x)}-\boldsymbol{\mu}_{s}).

(15)

Hence, we define our score function with a mixed of loss and Mahalanobis Distance score.

f(x)=\lambda\times L_{recon}(x)+(1-\lambda)\times d(x)

(16)

where $\lambda$ is a hyperparameter treated as a constant with regards to optimization.

We use “Diffusion” to represent only using reconstruction loss as score function while “Diffusion+Maha” represents using both loss and Mahalanobis Distance.

4.3 Metrics

We apply two metrics that are commonly used in evaluating OOD detection performance [11] and both of them are threshold-independent.

$\bullet$ AUROC: AUROC is the area under the Receiver Operating Characteristic curve. Higher AUROC value indicates better OOD detection performance. A random guessing detector model has an AUROC of 50%.

$\bullet$ FAR95: FAR95 is the probability that a negative example (OOD) to be mistakenly classified as positive (ID) when the true positive rate is 95%. In this case, a lower value indicates better performance.

4.4 Results And Analysis

Overall. As shown in Table 6, the AUROC accuracies of “Diffusion” on most datasets are above 99% and become SOTA. It proves the effectiveness of diffusion models with PLMs on OOD detection task. Specifically, the AUROC accuracy of “Diffusion” is up to 18% higher than that of “MLM”, indicating that diffusion models with PLMs degrade the generalization on OOD data. One exception is that the accuracy when trained on SST2 and tested on Multi30K has a significant decline. We suspect the reason is that SST2 contains more diverse data and has a similar sentence length with Multi30K. So the reconstruction loss between ID and OOD data are too close to distinguish OOD data from ID data.

Impact of $\lambda$ . We test the impact of $\lambda$ on TREC-10 dataset while keeping $t=700$ and $s=80K$ . As shown in Table 7, the accuracy reaches a peak at $\lambda=0.99$ and it declines by up to 2.27% as the decrease of $\lambda$ . It indicates that the reconstruction loss plays a key role in OOD detection.

Dataset	Multi30K	SST2	WMT16	RTE
$\lambda=0.99$	98.32	98.57	99.33	99.53
$\lambda=0.9$	97.57	96.83	98.73	98.40
$\lambda=0.7$	97.45	96.47	98.64	98.25
$\lambda=0.5$	97.43	96.40	98.63	98.22
$\lambda=0.3$	97.41	96.37	98.24	98.21
$\lambda=0.1$	97.41	96.30	98.24	98.20

Table 7: AUROC accuracy under different

\lambda

`TREC-10`	`SST2`	`IMDB`	`20NG`	`MNLI`	`RTE`	`WMT16`	`Multi30K`
Diffusion (T5-large)	98.9/4.6	100/0.0	100/0.0	99.6/1.4	99.9/0.4	99.3/2.6	98.2/6.6
Diffusion (Bart)	99.3/3.2	100/0.0	100/0.0	100/0.0	100/0.0	99.8/0.6	99.4/1.8

Table 8: The AUROC

\uparrow

/ FAR95

\downarrow

accuracy of T5-Large and BART finetuned with diffusion.

Impact of $\beta$ . We test the impact of $\beta$ on TREC-10 dataset while keeping $t=700$ and $s=80K$ . As shown in Table 9, the reconstruction loss of ID and OOD data decreases as $\beta$ gets smaller. But the gap between ID and OOD remains obvious. The model achieves the best results when $\beta$ increases linearly from $1e-4$ to $0.02$ .

$\beta$		ID	Multi30K	RTE	WMT16
1e-3 to 0.2	loss	6.32	8.80	10.88	10.68
1e-3 to 0.2	AUROC	-	95.3	99.4	98.4
1e-4 to 2e-2	loss	4.07	7.12	9.55	8.99
1e-4 to 2e-2	AUROC	-	96.9	99.7	98.7
1e-5 to 2e-3	loss	0.11	0.55	1.18	1.30
1e-5 to 2e-3	AUROC	-	90.6	97.2	95.8

Table 9: Reconstruction loss and AUROC accuracy of ID (TREC-10) and OOD dataset under different scopes of

\beta

Visualization. By applying Gaussian Mixture Models (GMMs) to the learned representations, we draw the domain clusters in a 2D visualization as shown in Fig.5. Models are trained on TREC-10 as ID dataset. After finetuning, ID data is scattered throughout the output space while OOD data tends to aggregate into a cluster. Comparing with MLM, diffusion makes OOD data cluster in a more compact way like a Gauss. ID data is more scattered since all sentences can be perfectly restored. It further proves that diffusion models with PLMs are better OOD detectors.

Few-shot scenarios. In reality, many applications have very limited ID data. Hence, we test the robustness of diffusion-based detector finetuned on low-resource ID data. As shown in Fig.6, the AUROC accuracy remains high even when the number of training data is 10. It indicates that a few number of ID data is enough for diffusion-based detector to detect OOD data.

More PLMs. We further explore the OOD detection performance of T5-Large and BART finetuned with diffusion. TREC-10 is used as the ID dataset while keeping $s=80k$ and $t=800$ . Results in Table 8 indicate that the OOD detection performances of T5-Large and BART outperform RoBERTa and become the new state-of-the-art, further illustrating that larger models finetuned with diffusion have worse OOD robustness.

We conclude that diffusion models with PLMs are effective on OOD detection, even in few-shot scenarios, though it shows a limitation when two datasets have high lexical coverage and similar sentence length.

5 Related Work

In real world scenarios, training and test examples are not always identical distributed, which makes models struggle to maintain the accuracy on out-of-distribution data. Hence, exploring whether the model is robust on OOD data is essential to build a safe and reliable system.

How to test OOD robustness is a complex and multi-faceted problem. Evaluating the accuracy on unseen in-distribution data [23] is a widely-used approach to study OOD robustness. For instance, Angeli et al. [1] explore the robustness of CNNs for text classification under distribution shifts. It conducts two-phase learning by training and testing the model with different classes of ID data. Another line of work is evaluating OOD robustness by corrupting or perturbing inputs [10]. For instance, Wang et al. [30] explore the robustness of pretrained BNNs by fixing complex and nonlinear corruptions. Moradi et al. [24] investigate the ability of language models in handling different types of character-level and word-level perturbations. Zhang et al. [33] test how well the model learns to identify an unseen textual perturbation.

Different from prior works which conduct perturbation on discrete token space, we mainly focus on how diffusion in continuous space influences PLMs on OOD robustness. Since reconstructing a noised input is a self-supervised task and is a fundamental ability to measure diffusion model’s performance, it is suitable for testing on data in various distributions, which serves naturally as a method to measure the model’s OOD robustness. The reconstruction loss can not be directly compared with that of MLM since MLM can be seen as a one-step diffusion model on discrete state spaces [2] while we conduct diffusion on continuous latent space. Hence, we additionally use OOD detection task to further evaluate the OOD robustness [8] compared with other finetuning methods.

6 Conclusion

Understanding how diffusion influences pretrained language models on OOD data is crucial in NLP. In this work, we investigate the effect of diffusion models with PLMs on OOD robustness. The results indicate that diffusion reduces PLMs robustness, and provide a set of findings. Diffusion models with PLMs (1) fail to make perfect reconstructions on partially-noised OOD examples; (2) have serious length dependence when finetuned with long sentences; (3) have worse reconstruction ability on OOD data when trained with larger models. These findings have the following implications for other research:

Suggestions on applying diffusion. Correctly adjusting the training process can alleviate robustness issues. Selecting smaller models, finetuning on more diverse data with a variety of sentence length is favorable for OOD robustness.

Diffusion is a new tool for OOD detection. Diffusion models with PLMs achieve the state-of-the-art performance in most datasets, which is proved to be an effective tool. It provides a new idea to complete OOD sensitive work, such as outlier data detection, domain transfer, etc.

Ethical Statement

In real-world scenarios, models may face heterogeneous samples which has severe semantic shifts from its training distributions. The proposed work seeks to develop a comprehensive understanding of how diffusion influences PLMs on out-of-distribution data. We believe that this study brings intellectual benefits to reliable application of diffusion models in the field of NLP. And it potentially has broader impacts to tasks of other areas. There is not any direct societal consequence and all experiments are conducted on open datasets in this work.

Limitations

As discussed in § 3.1, the model size affects the robustness. Due to the constraints of computing power, it is difficult for us to incorporate some popular models with larger scales. It should also be noted that our study is limited to English-only datasets. The pretrained language models we investigate are not in multilingual settings either. While results using automatic metrics give a fair idea of task performance, we would like to conduct a human evaluation in the near future.

\ack

This work was supported in part by the National Key R&D Program of China 2020YFB1807800, in part by the National Natural Science Foundation of China under Grants (62201072, 62171057, 62071067), in part by the Ministry of Education and China Mobile Joint Fund (MCM20200202), Beijing University of Posts and Telecommunications-China Mobile Research Institute Joint Innovation Center.

References

[1] Kevin De Angeli, Shang Gao, Ioana Danciu, Eric B. Durbin, Xiao-Cheng Wu, Antoinette Stroup, Jennifer A. Doherty, Stephen M. Schwartz, Charles Wiggins, Mark Damesyn, Linda Coyle, Lynne Penberthy, Georgia D. Tourassi, and Hong-Jun Yoon, ‘Class imbalance in out-of-distribution datasets: Improving the robustness of the textcnn for the classification of rare cancer types’, J. Biomed. Informatics, 125, 103957, (2022).
[2] Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg, ‘Structured denoising diffusion models in discrete state-spaces’, in NeurIPS 2021, pp. 17981–17993, (2021).
[3] Ondrej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno-Yepes, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana L. Neves, Martin Popel, Matt Post, Raphael Rubino, Carolina Scarton, Lucia Specia, Marco Turchi, Karin Verspoor, and Marcos Zampieri, ‘Findings of the 2016 conference on machine translation’, in Proceedings of the First Conference on Machine Translation, WMT 2016, colocated with ACL 2016, August 11-12, Berlin, Germany, pp. 131–198, (2016).
[4] Ido Dagan, Oren Glickman, and Bernardo Magnini, ‘The PASCAL recognising textual entailment challenge’, in Machine Learning Challenges, Evaluating Predictive Uncertainty, Visual Object Classification and Recognizing Textual Entailment, First PASCAL Machine Learning Challenges Workshop, MLCW 2005, Southampton, UK, April 11-13, 2005, Revised Selected Papers, volume 3944 of Lecture Notes in Computer Science, pp. 177–190, (2005).
[5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, ‘BERT: pre-training of deep bidirectional transformers for language understanding’, in NAACL-HLT 2019, Volume 1 (Long and Short Papers), pp. 4171–4186, (2019).
[6] Desmond Elliott, Stella Frank, Khalil Sima’an, and Lucia Specia, ‘Multi30k: Multilingual english-german image descriptions’, arXiv preprint arXiv:1605.00459, (2016).
[7] Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and Lingpeng Kong, ‘Diffuseq: Sequence to sequence text generation with diffusion models’, CoRR, abs/2210.08933, (2022).
[8] Mark S. Graham, Walter H. L. Pinaya, Petru-Daniel Tudosiu, Parashkev Nachev, Sébastien Ourselin, and M. Jorge Cardoso, ‘Denoising diffusion models for out-of-distribution detection’, CoRR, abs/2211.07740, (2022).
[9] Zhengfu He, Tianxiang Sun, Kuanning Wang, Xuanjing Huang, and Xipeng Qiu, ‘Diffusionbert: Improving generative masked language models with diffusion models’, CoRR, abs/2211.15029, (2022).
[10] Dan Hendrycks and Thomas G. Dietterich, ‘Benchmarking neural network robustness to common corruptions and perturbations’, in ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, (2019).
[11] Dan Hendrycks and Kevin Gimpel, ‘A baseline for detecting misclassified and out-of-distribution examples in neural networks’, in ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, (2017).
[12] Dan Hendrycks, Xiaoyuan Liu, Eric Wallace, Adam Dziedzic, Rishabh Krishnan, and Dawn Song, ‘Pretrained transformers improve out-of-distribution robustness’, in ACL 2020, Online, July 5-10, 2020, pp. 2744–2751, (2020).
[13] Jonathan Ho, Ajay Jain, and Pieter Abbeel, ‘Denoising diffusion probabilistic models’, in NeurIPS 2020, December 6-12, 2020, virtual, (2020).
[14] Nitish Joshi and He He, ‘An investigation of the (in)effectiveness of counterfactually augmented data’, in ACL 2022, pp. 3668–3681, (2022).
[15] Diederik P. Kingma and Jimmy Ba, ‘Adam: A method for stochastic optimization’, in ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, (2015).
[16] Sachin Kumar, Biswajit Paria, and Yulia Tsvetkov, ‘Gradient-based constrained sampling from language models’, in EMNLP 2022, pp. 2251–2277, (2022).
[17] Ken Lang, ‘Newsweeder: Learning to filter netnews’, in Machine Learning, Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, California, USA, July 9-12, 1995, pp. 331–339, (1995).
[18] Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin, ‘A simple unified framework for detecting out-of-distribution samples and adversarial attacks’, in NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pp. 7167–7177, (2018).
[19] Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori Hashimoto, ‘Diffusion-LM improves controllable text generation’, in NeurIPS 2022, November 28 - December 9, 2022, New Orleans, (2022).
[20] Xin Li and Dan Roth, ‘Learning question classifiers’, in COLING 2002, Howard International House and Academia Sinica, Taipei, Taiwan, August 24 - September 1, 2002, (2002).
[21] Weitang Liu, Xiaoyun Wang, John D. Owens, and Yixuan Li, ‘Energy-based out-of-distribution detection’, in NeurIPS 2020, December 6-12, 2020, virtual, (2020).
[22] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts, ‘Learning word vectors for sentiment analysis’, in The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, 19-24 June, 2011, Portland, Oregon, USA, pp. 142–150, (2011).
[23] John Miller, Rohan Taori, Aditi Raghunathan, Shiori Sagawa, Pang Wei Koh, Vaishaal Shankar, Percy Liang, Yair Carmon, and Ludwig Schmidt, ‘Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization’, in Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp. 7721–7735. PMLR, (2021).
[24] Milad Moradi and Matthias Samwald, ‘Evaluating the robustness of neural language models to input perturbations’, in EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pp. 1558–1570, (2021).
[25] Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen, ‘GLIDE: towards photorealistic image generation and editing with text-guided diffusion models’, in ICML 2022, volume 162, pp. 16784–16804, (2022).
[26] Alexander Podolskiy, Dmitry Lipin, Andrey Bout, Ekaterina Artemova, and Irina Piontkovskaya, ‘Revisiting mahalanobis distance for transformer-based out-of-domain detection’, in AAAI 2021, IAAI 2021, EAAI 2021, Virtual Event, February 2-9, 2021, pp. 13675–13682, (2021).
[27] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen, ‘Hierarchical text-conditional image generation with CLIP latents’, CoRR, abs/2204.06125, (2022).
[28] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christopher Potts, ‘Recursive deep models for semantic compositionality over a sentiment treebank’, in EMNLP 2013, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1631–1642, (2013).
[29] Robin Strudel, Corentin Tallec, Florent Altché, Yilun Du, Yaroslav Ganin, Arthur Mensch, Will Grathwohl, Nikolay Savinov, Sander Dieleman, Laurent Sifre, and Rémi Leblond, ‘Self-conditioned embedding diffusion for text generation’, CoRR, abs/2211.04236, (2022).
[30] Xi Wang and Laurence Aitchison, ‘Out of distribution robustness with pre-trained bayesian neural networks’, CoRR, abs/2206.12361, (2022).
[31] Adina Williams, Nikita Nangia, and Samuel R. Bowman, ‘A broad-coverage challenge corpus for sentence understanding through inference’, in NAACL-HLT 2018, Volume 1 (Long Papers), pp. 1112–1122, (2018).
[32] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush, ‘Transformers: State-of-the-art natural language processing’, in EMNLP 2020 - Demos, Online, November 16-20, 2020, pp. 38–45, (2020).
[33] Yunxiang Zhang, Liangming Pan, Samson Tan, and Min-Yen Kan, ‘Interpreting the robustness of neural NLP models to textual perturbations’, in Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, pp. 3993–4007, (2022).
[34] Wenxuan Zhou, Fangyu Liu, and Muhao Chen, ‘Contrastive out-of-distribution detection for pretrained transformers’, in EMNLP 2021, pp. 1100–1111, (2021).