Concurrent Linguistic Error Detection (CLED) for Large Language Models

Jinhua Zhu, Javier Conde, Zhen Gao, Pedro Reviriego, Shanshan Liu and Fabrizio Lombardi This work was supported by the Natural Science Funds of China (62171313), FUN4DATE (PID2022-136684OB-C22) and the ITACA (PDC2022-133888-I00) projects funded by the Spanish Agencia Estatal de Investigacion (AEI). (Corresponding author: Zhen Gao)
Jinhua Zhu and Zhen Gao are with School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China. Email: [email protected], [email protected].
J. Conde and P. Reviriego are with ETSI de Telecomunicación, Universidad Politécnica de Madrid, 28040 Madrid, Spain. Email: [email protected],[email protected].
S. Liu is with the University of Electronic Science and Technology of China, Chengdu 611731, Sichuan, China. Email: [email protected].
F. Lombardi is with Northeastern University, Dept. of ECE, Boston, MA 02115, USA. Email: [email protected].

Abstract

The wide adoption of Large language models (LLMs) makes their dependability a pressing concern. Detection of errors is the first step to mitigating their impact on a system and thus, efficient error detection for LLMs is an important issue. In many settings, the LLM is considered as a black box with no access to the internal nodes; this prevents the use of many error detection schemes that need access to the model’s internal nodes. An interesting observation is that the output of LLMs in error-free operation should be valid and normal text. Therefore, when the text is not valid or differs significantly from normal text, it is likely that there is an error. Based on this observation we propose to perform Concurrent Linguistic Error Detection (CLED); this scheme extracts some linguistic features of the text generated by the LLM and feeds them to a concurrent classifier that detects errors. Since the proposed error detection mechanism only relies on the outputs of the model, then it can be used on LLMs in which there is no access to the internal nodes. The proposed CLED scheme has been evaluated on the T5 model when used for news summarization and on the OPUS-MT model when used for translation. In both cases, the same set of linguistic features has been used for error detection to illustrate the applicability of the proposed scheme beyond a specific case. The results show that CLED can detect most of the errors at a low overhead penalty. The use of the concurrent classifier also enables a trade-off between error detection effectiveness and its associated overhead, so providing flexibility to a designer.

Index Terms:

LLMs, soft errors, concurrent error detection, T5, OPUS-MT.

I Introduction

Large language models (LLMs) are becoming a centerpiece of many applications and services; moreover, this trend is expected to continue in the coming years [1]. From the first popular models that had hundreds of millions of parameters such as BERT [2] or T5 [3], models have quickly evolved to billions and in some cases possibly trillions of parameters. Examples of LLMs include commercial models such as GPT [4] or Gemini [5] and open-source models such as LLama [6] or Mistral [7].

Their wide adoption makes their dependability a significant concern while their large size makes efficient error protection a priority. For example, a common approach to detect transient errors is to run the same code twice and compare the results; for LLMs this implies doubling the already large inference time and power dissipation. However, error detection and protection of LLMs have not been widely studied. Many works have considered the impact of errors on neural networks and their protection when implemented on different platforms because this is a key requirement for their use in safety-critical applications [8]. The impact of errors on neural networks has been evaluated extensively both by simulation and by radiation testing [9],[10] showing that they have some intrinsic tolerance to errors, especially when a fixed-point representation is used for the parameters and arithmetic operations. The impact of errors on neural networks implemented on FPGAs has also been considered showing that the implementation of an ensemble of networks can be used to protect against errors [11]. As transformer-based LLMs become widely used in many applications, the next step is to study the impact of soft errors and design error detection and correction schemes to protect the LLMs. The impact of soft errors on BERT has been recently studied in [12] and the results have shown that even a single error bit flip can in some cases corrupt the output of the entire model when using floating point formats as commonly the case in GPU implementations. However, to the best of the authors’ knowledge, the impact of soft errors on other popular models such as T5 [3] or OPUS-MT [13] has not been studied.

The detection of errors by using additional logic that operates concurrently with the main system is attractive because it can exploit features of the system for error detection and has been widely used in computing and signal processing systems [14],[15],[16]. Error detection using a concurrent classifier has been proposed for large-scale machine learning systems in [17] and evaluated for BERT in question and answering and emotion classification, showing good detection capabilities. However, these schemes rely on having access to internal nodes of the model which is not the case for LLMs in many settings. In the case of LLMs when used to produce text, for example in summarization, translation, or conversational tasks, a possibility could be to use the properties of the text to detect errors. For example, errors that produce invalid sequences of characters or abnormal patterns could be detected by extracting few features of the generated text and comparing them with those of normal texts. The key observation is that the output of the LLMs (i.e., the text generated) should have some features that are determined by the rules of the language which may possibly be used to perform concurrent error detection or more precisely, language-based error detection. In this paper, this focus is pursued by proposing Concurrent Linguistic Error Detection (CLED) to detect errors based on the properties of text and evaluating it in T5 when used for summarization and in OPUS-MT when used for translation. The results show that CLED enables the concurrent detection of errors in LLMs through the features extracted from the generated output text without the necessity to access the LLM internal nodes. Both experiments show an accuracy exceeding 90% in error detection, with mechanisms to configure the error detection rate by tuning the decision threshold while also increasing the false errors detected.

The rest of this paper is organized as follows. Section II covers the preliminaries describing OPUS-MT and T5, concurrent error detection, and the error model considered. The dependability of the OPUS-MT and T5 is studied in Section III with fault injection experiments that identify the critical bits and the patterns of errors on the generated text. Section IV presents the proposed CLED describing the scheme by which it can be applied to detect errors in the generated text. CLED is evaluated in Section V in terms of error detection effectiveness and the overhead introduced over the unprotected language model. Finally, the paper ends with the conclusion and future work in Section VI.

II Preliminaries

This section covers the preliminaries describing OPUS-MT and T5, the language models considered in our evaluation. Then concurrent error detection is introduced by discussing previous works, and finally, the error model considered is presented. Both language models considered are based on the Transformer architecture [18] which has been employed with variations in the majority of LLMs.

II-A OPUS-MT

As shown in Fig. 1, the OPUS-MT [13] model employs the standard Transformer structure with an encoder and a decoder, followed by a prediction layer. The encoder consists of N layers, and each layer is made of two sub-layers, Self-Attention (SA) and Feed-Forward Network (FFN). The decoder also consists of N layers; however, different from the encoder, there is a cross-attention (CA) sub-layer between the SA and the FFN in each layer of the decoder. The prediction layer consists of a linear operation and a log softmax operation. The details of the encoder and decoder are introduced next.

For Natural Language Processing (NLP) applications, the input X of the encoder is a M×W embedding matrix for the M words (or tokens) in the input text. Each row is the combination of the token embedding of a word and the position embedding, whose dimensions are both 1×W. In the SA sub-layer, the Multi-Head self-attention (MH-SA) performs the dot-product attention for a query matrix Q, a key matrix K, and a value matrix V as

MHSA\left(Q,K,V\right)=softmax\left(QK^{T}\right)\times V,

(1)

where Q, K and V are the linear transformations of X as $Q=XW_{Q}^{T}+B_{Q}$ , $K=XW_{K}^{T}+B_{K}$ and $V=XW_{V}^{T}+B_{V}$ , respectively, and $W_{Q}$ , $W_{K}$ , and $W_{V}$ are the weight matrices, $B_{Q}$ , $B_{K}$ , and $B_{V}$ are the bias vectors. As shown in the upper left corner of Fig. 1, the MH-SA mechanism includes H ( $H=W/d_{h}$ ) heads, and the output of the i-th head $head_{i}\left(X\right)$ is expressed as

\small head_{i}(X)=MHSA\left(XW_{Q}^{iT}+B_{Q}^{i},XW_{K}^{iT}+B_{K}^{i},XW_{V}^{iT}+B_{V}^{i}\right).

(2)

The outputs of all heads are then concatenated as a M×W matrix, and merged by an W×W SA output matrix $W_{O}$ as

SA\left(X\right)=Concat\left(head_{i}\left(X\right)\right)W_{O}^{T}+B_{O},

(3)

whose dimension is still M×W. After residual addition with the input X and layer normalization, the input to the FFN sub-layer becomes $Y=norm\left(X+SA\left(X\right)\right)$ .

The Feed Forward (FF) block is composed of two fully connected layers and a non-linear activation GeLU between them, so the output of the FF block is expressed as

FF\left(Y\right)=\left(GeLU\left(YW_{F1}^{T}+B_{F1}\right)\right)W_{F2}^{T}+B_{F2}^{T},

(4)

where $W_{F1}$ and $W_{F2}$ are the weight matrices for the two fully-connected layers, respectively, and $B_{F1}$ and $B_{F2}$ are the corresponding bias vectors. Finally, after residual addition and layer normalization, the output of one layer in the encoder becomes $Z=norm\left(Y+FF\left(Y\right)\right)$ , which has the same dimensions as the input matrix X (M×W), therefore it can be fed as input to the next encoder layer.

The decoder generates the representation of the (n+1)-th token based on the first n predicted tokens $X_{n}$ , which is a n×W token embedding matrix. The SA in the decoder is almost the same as in the encoder, but it applies a mask matrix to mask the information after the current token. The processing of the CA is like as SA; the only difference is that K and V are generated based on the encoder output Z rather than the output of SA. The FFN in the decoder is the same as in the encoder.

The OPUS-MT model includes 6 encoder layers and 6 decoder layers (N = 6), and each MH-SA includes 8 heads (H = 8). The embedding dimension is W = 512. The total number of parameters in OPUS-MT is approximately 78 million, so needing a memory of 156 million bytes when the parameters are stored in half-precision floating-point format.

Refer to caption — Figure 1: Structure of the OPUS-MT Model.

II-B T5

The second model considered in this paper is T5 [3], a larger model that is a variant of the standard Transformer. As shown in Fig. 2, T5 introduces the following four main improvements over the original transformer.

1.

The position information is not included in the input embedding matrix but it is merged in the operation of MH-SA in the form of a relative position bias ( $B_{RP}$ ). Therefore, Eqn. (1) becomes

$MHSA\left(Q,K,V\right)=softmax\left(QK^{T}+B_{RP}\right)\times V.$ (5)
2.

Layer normalization after the residual addition in SA, CA, or FFN in the standard Transformer is moved to the input of those sub-layers.
3.

The linear transformation in T5 removes all bias vectors.
4.

The FF block in T5 is made of three fully connected layers, and the operation is changed to

$FF\left(Y\right)=\left(GeLU\left(YW_{F1}^{T}\right)\odot\left(YW_{F2}^{T}\right)\right)W_{Fo}^{T},$ (6)

where $W_{F1}$ , $W_{F2}$ and $W_{Fo}$ are the weight matrices for the three fully connected layers respectively, and $\odot$ denotes the Hadamard product.

The standard T5 model includes 12 encoder layers and 12 decoder layers (N = 12), and each MH-SA includes 12 heads (H = 12). The embedding dimension is W = 768. Then the total number of parameters in T5 is approximately 247 million, which needs a memory of 494 million bytes for the half-precision floating point format.

II-C Concurrent Error Detection

In many cases, error detection can be performed concurrently with the system operation [15]. For example, in the Fast Fourier Transform (FFT), error detection can be performed by checking that the energy of the inputs is the same as the energy of the outputs [19]. Similar schemes that exploit the relations between inputs and outputs, have been proposed for other signal processing blocks [14]. For machine learning systems, the use of a concurrent classifier has been recently proposed [17]. This scheme (referred to as Concurrent Classifier Error Detection (CCED)) relies on monitoring a number of internal nodes of the machine learning model and feeding their values to a classifier that detects errors as illustrated in Fig. 3. The scheme has been evaluated for BERT when it is used for emotion classification or to locate the answer to a question in a given text. The nodes monitored in the CCED scheme of [17] were the softmax values, which are used to select the emotion or the position of the answer in the text. The use of internal nodes is needed in systems that produce outputs with no inherent structure or properties that can be used for error detection. For example, in emotion classification, the selected emotion provides little information on whether the system is in error. Instead, the distribution of the softmax values among the possible emotions can exhibit different patterns when there is an error, such that the concurrent classifier can detect the error. This is the case for BERT as shown in the evaluation results presented in [17].

However, when the models are used to generate text as it is commonly found in many modern applications such as ChatGPT, error detection based on the softmax values has some intrinsic limitations. First, some of the models are closed and do not expose the softmax values. For example, OpenAI only provides the values of the top five tokens¹¹1see https://platform.openai.com/docs/api-reference/chat/create which limits the use of CCED. A second limitation is that CCED would be significantly more complex because 1) the number of tokens is typically in the order of tens of thousands, so the concurrent classifier will no longer be simple, and 2) it must be run per token so for an interaction with the model, as an example answering a question, hundreds of runs of a more complex concurrent classifier will be needed. An interesting observation is that when language models are used to generate text, their outputs can be used for error detection and thus, the CCED scheme can be extended to work directly on the output of the system. This makes error detection significantly simpler and more widely applicable to models for which there is no access to internal nodes and reduces the complexity and overhead of the error detection process.

II-D Error Model

The objective of the error model is to capture the impact of transient soft errors on the language models. This is motivated by the importance of radiation-induced soft errors in computing components such as memories, GPUs, or FPGAs [20],[21]. Soft errors can affect both sequential logic elements flipping typically one of the bits stored and also combinational logic elements that can propagate the error until a sequential element catches an incorrect value [22]. In the first case, simulation of the errors, for example in the parameters of the model, can be easily performed by flipping one of the bits. However, the simulation of the effect of errors on combinational logic is more complex because it requires access to the detailed design of the component, for example, a GPU on which the model is run. This is typically not feasible and if it were, it would make the results specific to the platform evaluated.

Therefore, based on the practical restrictions of simulating errors on combinational logic and also to make the evaluation independent of the specific components when implementing the LLMs, we consider a single-bit error in one of the parameters of the LLM. This model captures the effect of soft errors on the model parameters when stored inside a computing unit, for example in a cache or a register file inside a GPU. The effects of soft errors in combinational elements are not fully captured, but to some extent, we can expect them to be similar to errors on the model parameters for floating point representations for which errors on the most significant bits of the exponent are critical [12]. Hence, the error model enables us to explore the effects of transient soft errors during the execution of an LLM in a general computing platform.

Since the errors considered in this paper are soft errors, they do not induce any permanent damage to the circuit functionality. We also assume them to occur inside the computational unit (for example a GPU or a TPU). Therefore, once an error is detected when executing the language model, it can be corrected by running the same input text a second time as that implies that the LLM parameters are reloaded from memory and any previous intermediate value is removed. Therefore, the CLED scheme proposed in this paper can correct the detected errors by re-computation. Note that differently from the standard re-computation that must rerun all input texts, CLED only runs a second time for input texts for which an error has been detected.

III Dependability Evaluation of LLMs

The dependability of T5 and OPUS-MT against single errors on weights is evaluated separately in this section. For each model, we first evaluate the impact of errors on different bit positions of the weights to identify Critical Bits (CBs) that have a large impact on the model performance, and then analyze the impact of bit flips on the different CBs for the model performance. In this paper, all weights are represented in single-precision floating point (float32). A float32 value consists of a sign bit, 8 exponent bits, and 23 fraction bits, and its value is given by

value=\left(-1\right)^{sign}\times 2^{exponent-127}\times\left(1+fraction/2^{23}\right),

(7)

III-A Dependability Evaluation for T5

1) Dataset and evaluation metric

To evaluate the performance of T5 summarization, the test set of CNN Daily Mail²²2Further details on the datasets used for T5 and OPUS-MT are provided in the evaluation section. is used and the Rouge1 score is applied as the performance metric, which is calculated as

Rouge1=\frac{{\sum\limits_{S\in\left\{{{\rm{RefSummaries}}}\right\}}{\sum\limits_{gra{m_{1}}\in S}{Coun{t_{match}}\left({gra{m_{1}}}\right)}}}}{{\sum\limits_{S\in\left\{{{\rm{RefSummaries}}}\right\}}{\sum\limits_{gra{m_{1}}\in S}{Count\left({gra{m_{1}}}\right)}}}},

(8)

where, the denominator is the total number of 1-grams, and the numerator is the number of common 1-grams between the reference summary and generated summary. When there are no errors, the Rouge1 for T5 summarization is 0.4081.

2) Distribution of weights in the T5 model

To study the impact of errors on different bits, we first analyze the distribution of all weights in T5 that are publicly available, including $W_{Q}$ , $W_{K}$ , $W_{V}$ , $W_{O}$ , $W_{Fi1}$ , $W_{Fi2}$ and $W_{Fo}$ of the encoder and those weight matrices in the decoder. Based on our analysis, the distribution of the weights for different layers and types is almost the same, and the average value of each bit for single precision weights is plotted in Fig. 4. Based on this figure, we can get three observations: (a) the 1st exponent bit is highly likely to be 0; (b) the 2nd to 6th exponent bits are highly likely to be 1; (c) the average value of other bits is approximately 0.5. This means that the exponents bits have little variability and only the mantissa bits change frequently.

3) General impact of errors on different bit positions

For a floating-point representation, a 0 to 1 flip of the i-th exponent bit increases the value of the weight by a factor of $2^{128}/2^{i-1}$ , so errors on the upper exponent bits have a larger impact. In this section, we perform error injection experiments to evaluate the effect of flips on different bit positions. In the test for a specific bit position (e.g., x), we first randomly select one parameter, flip bit-x of this parameter, and then run the model on the test set. If at least one of the summaries is different from the error-free one, then the error Rouge1 score is recorded. The process is repeated 1000 times, and the average Rouge1 value is used as the significance of bit-x. The results are shown in Fig. 5. The flips of the 1st exponent bit introduce a severe performance degradation, and the flips of the 2nd to 5th exponent bits introduce a slight performance loss. Actually, all 0 to 1 flip of the first 5 exponent bits cause an obvious performance loss (as shown in Table I). As the 2nd to 5th exponent bits are 0 for only 1% of the cases, the overall performance loss seems small. In general, these results show that the first 5 exponent bits are the CBs for T5.

The experiment results show that in both the encoder and decoder errors on CBs of $W_{Q}$ and $W_{K}$ have negligible impact on the model performance due to the softmax operation, and those on the 1st exponent bit of $W_{F1}$ only introduce a few grammar errors in the output summary due to the GeLU operation. Therefore, the performance degradation mainly comes from the 0 to 1 flip of CBs of $W_{V}$ , $W_{O}$ , $W_{F2}$ and $W_{Fo}$ .

TABLE I: Impact of 0 to 1 errors on different bits of weights in T5 model

Exp bit	1	2	3	4	5
Rouge 1	0.0881	0.06	0.0282	0.1042	0.3089

4) Analysis of the impact of errors on CBs

The effect of errors on the CBs of $W_{V}$ , $W_{O}$ , $W_{F2}$ and $W_{Fo}$ are similar, and the results for different CBs are listed in Table II, in which the typical wrong outputs are also provided; the error on the 1st exponent bit will result in a fixed output that is irrelevant to the input text. Errors on the 2nd to 4th exponent bits also result in irrelevant summaries, but the outputs change for different inputs; errors on the 5th exponent bit only result in grammar errors in the output text. A discussion of the causes of these error characteristics is presented next.

•

Errors on the 1st exponent bit introduce abnormally large values (positive or negative) in the intermediate outputs, which may cause overflow when performing layer normalization. Since the T5 model does not have bias parameters, the overflow causes a 0 output of the layer normalization. Irrespective of which layer the errors occur (encoder or decoder), such 0 outputs force the final results to be fixed and unrelated to the input text.
•

Errors on the 2nd to 5th exponent bits do not cause overflow in the layer normalization operation, but the wrong values in the intermediate results cause changes in the final logits scores for the word selection in the prediction layer. For errors on the 2nd, 3rd, and 4th exponent bits of some large weights, the logits scores change significantly, so that the output is irrelevant to the input text and looks random. However, for the 4th exponent bits of some small weights and the 5th exponent bits of the weights, errors only introduce small changes to the logits scores, which would then cause some grammar errors in the final output. For example, the errors on the 5th exponent bit may cause repeating words in the text as shown in Table II.

TABLE II: Effect of errors on different CBs for T5 based summary

Bit position	Error characteristics	Typical output examples
1	Fixed sentence or strings	Decoder: the the thea the the: the thet thesss thes the thesassa thes:ss: (omit subsequent characters) Encoder: iReporter.com: What’s New in New York City? . .: What Happens Next? - What Happened To You? ”It’s a Wonderful Life”
2	Random and similar strings	age .snaes-drubs of swaas - saber sdberbersedst syscercerssedcersssce of eraskar a yror .age .
3	Random and similar strings	distr bebelus adanc spalat intrebgardinen Clickfunnel mes insotit
4	Random and similar strings or grammar errors	agments cans, a symgael, is a noggottadfad, he says . Previously he and ad samess, the duddad isn’t ay
5	Grammar errors	The Palestinian Authority signed the International Criminal Criminal on Wednesday (omit subsequent characters)

III-B Dependability evaluation for OPUS MT

1) Dataset and evaluation metric

We use the IWSLT2017 dataset [23] in this section, and the BLEU score is used to evaluate the translation performance of OPUS-MT, which is defined as

\left\{\begin{array}[]{l}BLEU=BP\cdot\exp\left({\sum\nolimits_{n=1}^{N}{{w_{n}}\cdot\log{p_{n}}}}\right)\\ BP=\left\{{\begin{array}[]{*{20}{c}}{1,{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}c>r}\\ {\exp\left({1-{r\mathord{\left/{\vphantom{rc}}\right.\kern-1.2pt}c}}\right),{\rm{}}c\leq r}\end{array}}\right.\\ {p_{n}}=\frac{{\min(count_{match}^{n},count_{ref}^{n})}}{{count_{output}^{n}}},\end{array}\right.,

(9)

where $p_{n}$ is the accuracy of n-grams, $count_{output}^{n}$ and $count_{ref}^{n}$ represent the number of n-grams in the output translation and the reference translation, respectively, $count_{match}^{n}$ is the number of common n-grams in both the output translation and the reference translation, BP is the penalty factor, c is the sentence length of the output translation, and r is the sentence length of the reference translation. When there are no errors, the BLEU score is 0.2186.

2) Identification of CBs for OPUS-MT model

The weight distribution of the OPUS-MT model is first analyzed, and the average values of different bits of the weights are shown in Fig. 6. As we can see, the results are similar to T5, but they also reveal two differences: (a) the 1st exponent bit is always 0; (b) the 2nd to 4th exponent bits are always 1.

Then we perform error injection experiments to evaluate the impact of errors on different bit positions, and the average BLEU scores over 1000 runs are shown in Fig. 6. Only errors on the 1st exponent bit introduce a severe performance loss, while those on other bits do not affect the model performance. These results are expected because errors on the (upper) exponent bits lower than the 1st bit only decrease the weights, and thus, they do not affect the model performance. Therefore, the 1st exponent bit is the only CB for the OPUS-MT model.

3) Analysis of the impact of errors on the CB

Similar to the results for T5, the effects on errors on $W_{Q}$ , $W_{K}$ and $W_{f1}$ are reduced by softmax and GeLU, the performance loss mainly comes from the errors on $W_{V}$ , $W_{O}$ and $W_{f2}$ . However, different from the cases for T5, the layer normalization in OPUS-MT also includes bias, so the layer output becomes a fixed matrix (with each row equal to the bias vector) instead of a 0 matrix. Therefore, if the errors occur in the encoder part, the translation results are determined by the bias vector of the layer norm, and they are not related to the input text. Since the layer norm bias is different for different layers, the wrong translation results also change for different layers, but it is fixed for different inputs. The errors in a layer of the decoder part also produce a fixed layer output matrix with each row becoming the bias vector, and finally generating a wrong translation result that is irrelevant to the input text. However, a difference is encountered, namely the fixed layer output is only used for the computation of Q of the next layer, and the computations for K and V of the following layers are based on the correct results from the encoder. Therefore, the final results are different for different inputs. Table III shows some examples of the wrong translation results for different layers in the encoder and decoder parts. The results are meaningless and include a repeat of fixed words or symbols. This provides the reasons for the BLEU score for errors on the 1st exponent bit being 0 in Fig. 7.

TABLE III: Effect of errors on the CB for OPUS-MT based translation

	Layer	Translation result (the first line)
Encoder	1	All-for- and-for-for- and-for-for- and-for-for- and-for-
	3	I’d be my I——————————————————
	5	It’s—’————————————————————
Decoder	1	(empty strings)
	3	……………………………………………………………………………..
	5	at at at at at at at at at at at at at at at at at at at at at at at

IV Concurrent Linguistic Error Detection (CLED)

In the previous section, the impact of the change of a bit in OPUS-MT and T5 has been analyzed. Error detection is based on the comparison with the error-free output, which allows the generation of a labeled dataset. In this section, the proposed CLED scheme is presented. First, the motivation and high-level approach are presented to then discuss the main blocks of the scheme: the linguistic features and the concurrent classifier.

IV-A Motivation and Approach

The main motivation for the development of the proposed CLED scheme is to provide error detection mechanisms that

1.

They operate on models for which there is no access to the internal nodes.
2.

They are widely applicable to language models regardless of their implementation details.

To the best of the authors’ knowledge, there is no concurrent error detection scheme that meets these requirements, CCED [17] meets partially 2) but it does not fulfill 1). As discussed in the introduction, a trivial but at the same time key observation is that when language models are used to generate text, we expect the text to be valid. This implies that words must be correct and the sentences and constructions must follow the rules of the language. Intuitively, errors that cause the model to malfunction are likely to output meaningless text or at least incorrect text as discussed in the previous section. As an example, this is an excerpt of the output of T5 with an error injected in one of its parameters: “Satoisesaias,mass of a customiseaed toeasaiesaado hudness in a shader in hiuued tuuuiut tiuisuaiui”; it can be seen that the text is clearly incorrect. This example and the ones presented in the previous section suggest that it may be possible to detect errors based only on the features of the output text to also enable the protection of closed models and applicable to models regardless of their implementation details.

The proposed CLED scheme is illustrated in Fig. 8. It operates on the output text of the LLM and therefore, it does not require any modifications to the model, or access to any internal nodes. This makes CLED attractive to protect models that are provided as a black box to which no modification can be made. The error detection is performed in two steps.

Initially, relevant linguistic features are extracted from the output text and fed to another classifier that performs error detection. Since error detection is based solely on the properties of the text, the approach should be applicable to newer LLMs. This is very interesting as the pace at which models are introduced is hectic and having a error detection scheme that is general and can be applied to any LLM facilitates the fast protection of LLMs as they are introduced.

Then, once an error is detected, as it is a soft error that does not affect the component functionality, the model is run a second time for the same input text. This implies that the LLM parameters are reloaded from external memory and previous results, or intermediate values are removed. So unless there is a second soft error (which is not likely as soft errors are rare events), the occurrence of two errors on consecutive runs is extremely unlikely, i.e., the second execution is usually error-free. Therefore, if CLED detects an error on the second run, it is assumed that it is a false positive and the text generated is thus assumed to be correct; this is like in [17].

The overheads introduced by the proposed scheme are thus from the extraction of the linguistic features, the concurrent classifier and the re-computations performed for false positives. Given the large size of the LLMs, the relative burden of the linguistic features and the concurrent classifiers is negligible, so the main overhead is introduced by the re-computations induced by the false positives.

The main elements of the proposed CLED scheme are the linguistic features and the concurrent classifier, both are briefly discussed in the next subsections.

IV-B Linguistic Features

Languages have spelling rules that restrict the possible characteristics of a valid text. In English, for example, the same consonant cannot be tripled, digraphs such as ‘uu’ and ‘aa’ are extremely rare, and ‘ii’ never occurs [24] (pp. 132 and 449). Similarly, in Spanish, only six consonants can appear in a word’s final position [25], and in German, sequences of fricatives (such as /vz/) are not allowed [26]. These rules can potentially be used to detect invalid texts that can be a sign of an error, because the models have been trained to produce valid texts. This seems to be confirmed by looking at the examples of texts generated by the models with an error shown in Tables II and III, invalid words are provided as well as character sequences in many cases.

The presence of invalid words is not the only linguistic information that can be used to detect errors. There are also patterns that valid texts follow, for example, the distribution of the part of speech words [27] is typically within some ranges, and not all words are for example nouns or verbs. These types of patterns can also be used to detect anomalous text caused by errors. Some works have proposed grammaticality when analyzing sentences to detect errors; for example, they extract the grammatical category of words with Part-of-Speech (POS) techniques and train models using the sequences of grammatical categories as features [28]. Other works demonstrate that the frequency of grammatical categories allows training simple classifiers with high accuracy, such as the one developed by Mendhakar and Darshan to differentiate between fiction and non-fiction texts [29]. However, these methods are not capable of detecting semantic errors and can classify sentences like “I play basketball” (correct) as well as “I play melon” (incorrect) as valid. This highlights the inadequacy of probabilistic grammatical models to detect semantic incongruencies, a problem already described by linguists [30].

Therefore, based on the previous discussion, two types of features are used as inputs to the concurrent classifier of the proposed CLED scheme: linguistic rules and linguistic patterns. The first type of feature checks the restrictions that are defined by the language. The second type of feature captures patterns that are common in the language, for example, a text in which the relative frequency of consonants is ten times larger than the number of vowels is also likely to be invalid due to an error. These features used can be adapted to better capture the impact of errors on the output text; some examples of features are:

1.

Restriction: number of capital letters in the middle of words.
2.

Restriction: number of sequences of the same letter repeating three or more times.
3.

Pattern: average word length.
4.

Pattern: frequency of vowels or consonants.
5.

Pattern: distribution of grammatical categories (i.e., verbs, nouns, determiners, conjunctions.).

Ideally, a common set of features can be used for different LLMs and applications, but depending on the performance of the classifier, the features to use may have to be adjusted to better capture the error patterns caused by the soft error in a specific LLM and application. In the proposed design, the same set of features is used for both models (OPUS-MT and T5) and applications (translation and summarization) to illustrate the general applicability of the linguistic features. This facilitates the protection of LLMs because it makes feature extraction independent of the model application.

Table IV shows the linguistic features used for the concurrent classifier. The first three are rules as they correspond to letter sequences that are invalid while the rest of the features correspond to patterns that are expected to be within a range of values in valid texts. As discussed before, these features will be used both for OPUS-MT and T5 to illustrate that a common set of features can be used for different models and applications.

TABLE IV: Linguistic Features used for the concurrent classifier

	Feature	Description
Rule	uppercase_middle	Frequency of words with uppercase in the middle of the word
	four_or_more_consonants	Frequency of words with sequences of four or more consonants
	three_or_more_eq_chars	Frequency of words with sequences of three or more characters equal
Pattern	punctuation_mark	Frequency of punctuation marks in the text
	digit	Frequency of digits in the text
	blank	Frequency of blanks in the text
	vowel	Frequency of vowels in the text
	words_density	Inverse of average word length
	ADP	Frequency of adpositions in the text
	NUM	Frequency of numerals in the text
	VERB	Frequency of verbs in the text
	DET	Frequency of determiners in the text
	PRON	Frequency of pronouns in the text
	NOUN	Frequency of nouns in the text
	PRT	Frequency of particles in the text
	ADV	Frequency of adverbs in the text

IV-C Concurrent Classifier

As the number of linguistic features is small and they are also simple, a basic binary classifier should be sufficient to detect the errors. In the proposed design, a Random Forest classifier³³3https://scikit-learn.org/1.3/modules/generated/sklearn.ensemble.RandomForestClassifier.html is used as in [17]. Different combinations of models and features have been tested. The model’s hyperparameters have been configured using cross-validation and recall as the metric to evaluate the model to reduce false negatives. This process has been performed independently for OPUS-MT and T5; the same set of linguistic features and the same type of classifier (a Random Forest) is used for both models but trained with the relevant dataset in each case.

V Evaluation

To evaluate the proposed CLED scheme, T5 is applied to text summarization and OPUS-MT is used for Chinese to English translation as case studies. As discussed previously, CLED targets the protection of LLMs that are seen by the designer as a black box, i.e., with no access to the internal nodes. In this scenario, protection techniques that are applicable and comparable are duplication or recomputation that imply at least doubling the overhead of an implementation.

V-A T5 Summarization

The T5 model is used to summarize news taken from the CNN Daily Mail dataset⁴⁴4CNN Daily Mail dataset: https://huggingface.co/datasets/cnn_dailymail/viewer/3.0.0/test. First, a subset of 11,490 news from such dataset has been selected and run to generate error-free summaries. Then an error has been injected in a randomly selected bit of a randomly selected parameter of the T5 model and the summaries for the same 11,490 news have been generated. If at least one of them was different from the error-free summary, then the error was considered relevant, otherwise if the error had no effect on any of the summaries, the error is assumed to be irrelevant⁵⁵5This is commonly the case when the error changes one of the least significant bits of a parameter.. The process has been repeated until 100 relevant errors were identified. Then a dataset has been built with the 11,490 error-free summaries and 11,490 erroneous summaries from the 100 relevant errors. The dataset is available online⁶⁶6Single-bit error dataset: https://doi.org/10.5281/zenodo.10647624. These error-free and erroneous texts are used as the dataset on which the concurrent classifier is trained and evaluated. The 22,980 samples are split, with 80% allocated for training, and 20% for testing.

Different classifiers with multiple hyperparameters configurations have been tested. The best model has been achieved with a Random Forest comprised of 50 trees maximum depth of 10 levels for each tree, with the square root of the number of features as the maximum number of features considered for each split, gini as the impurity criterion for splitting, with bootstrapping of samples, a minimum of 2 samples per leaf, and requiring at least 10 samples for a node split. Table V shows the resulting hyperparameters used to train a model that fits the error detection problem, with the mentioned features, and that avoids overfitting. The results on the test dataset reveal an accuracy of 93% and a recall of 93% with a false negative rate of 11% and a false positive rate of 2%.

TABLE V: Model Hyperparameters of the Concurrent Random Forest Classifier

Hyperparameter	Value	Description
Number of trees	50	Improves the generalization of the model and avoids inter-tree overfitting by limiting the number of trees
Maximum depth	10	Avoids in-tree overfitting by limiting the maximum number of levels per tree
Maximum number of features per split	sqrt	Prevents the selection of a subset of features that may be too large in each split by limiting it to the square root of the number of the features
Impurity criterion	Gini	Impurity of a tree based on the probability to incorrectly classify a randomly chosen element if it were randomly labeled
Bootstrapping of samples	Yes	Introduces variability in the training datasets of each tree, which contributes to reducing the correlation between the trees
Minimum samples per leaf	2	Avoids overfitting by requiring at least two samples per leaf in each tree
Minimum samples for a node split	10	Avoids overfitting by requiring ten samples to make a new split in the tree
Maximum leaf nodes	No limit	Adds flexibility to each tree as other parameters limit the overfitting
Minimum decrease impurity for splitting	No limit	Allows a new splitting when impurity decreases at any value. Other parameters are fixed to avoid overfitting
Minimum weight fraction per leaf	No limit	Allows leaf nodes with any weight fraction value. Other parameters are fixed to avoid overfitting

The results are summarized in Fig. 9. A change of the decision threshold of the model allows for control of the overhead of re-computation, i.e., the unnecessary re-computations run when an error has been detected but the prediction was wrong (measured by the false positive rate). Moreover, it demonstrates the influence of this re-computation rate on error non-detection, quantified as false negatives. The results show that most of the errors, close to 90%, can be detected even with a very low, for example, 1%, recomputation overhead. The overhead of detecting additional errors grows significantly and to detect an additional 4% of the errors, the recomputations increase by more than 5 times. Therefore, the designer can trade-off detection effectiveness for recomputations depending on the system requirements and restrictions. Interestingly, since this change only implies changing the decision threshold, the same LLM can be run with different levels of protection using the proposed CLED. For example, when used to translate medical texts, we can allow for more than 20% recomputations to prioritize error detection than when used to translate news (at 1% overhead).

The relative complexity of the main model and the concurrent classifier is evaluated in terms of their average run time per sample in the dataset. The results are summarized in Table VI where the times for the concurrent classifier include the linguistic feature extraction and the Random Forest classifier. The overhead for concurrent error detection is small even when taking into account that the main model was run on a GPU (NVIDIA GeForce RTX 4080) and the concurrent classifier on a standard CPU (Intel Xeon E5-2630 v2 @2.6ghz and 32GB RAM). The runtime for the concurrent classifier is at least one order of magnitude below the main model. It is interesting to note that the complexity of the main model increases continuously, for example, most state-of-the-art LLMs have at least several billion parameters, so significantly more than OPUS-MT and T5. Instead, the complexity of the proposed CLED is independent of the LLM size as it operates directly on the output text. Therefore, the relative overhead of the concurrent classifier will be even lower for more advanced LLMs.

TABLE VI: Average run time for case studies (milliseconds)

Case study	Main model	Concurrent Classifier
T5 summarization	453ms/sample	11ms/sample
OPUS-MT translation	113ms/sample	11ms/sample

V-B OPUS-MT Translation

As a second case study, a different model (OPUS-MT) and application (translation) has been considered. The IWSLT17 translation dataset [23] has been used to select 8,549 samples of Chinese-to-English translation and OPUS-MT has been first run with no errors. Then the same process as for T5 was performed to identify 100 relevant errors and build a dataset with 8,459 error-free and 8,459 erroneous samples. The 17,098 samples are split, with 80% allocated for training, and 20% for testing. Again, the dataset is published online⁷⁷7Single-bit error dataset: https://doi.org/10.5281/zenodo.10647624.

Similarly to T5, a Random Forest binary classifier has been trained using the same features (Table IV) and hyperparameter configuration. As a result, a model with an accuracy of 95% and a recall of 95% was obtained, with a false negative rate of 9% and a false positive rate of 1%. Fig. 9 shows the error detection rate for different overhead rates (false positives). The results demonstrate that the use of Random Forest and the selected features enable the training of a concurrent classifier for error detection with a high accuracy and low false negative rate for different LLMs and applications. This confirms our intuition that the linguistic features can to some extent be general and used for different LLMs. As previously discussed, this makes CLED more attractive as the feature extraction and classifier can be the same and when changing LLM or application, the designer only needs to train the concurrent classifier with the relevant dataset to achieve an effective protection.

VI Conclusion

This paper has presented Concurrent Linguistic Error Detection (CLED), an efficient error detection technique for large language models (LLMs) that does not require access to the model’s internal nodes. CLED exploits the rules of the language and the patterns of valid texts to detect texts generated by language models when they suffer errors. Hence, relevant linguistic features are extracted from the generated text and fed into a concurrent classifier to detect errors. The proposed scheme has been evaluated for two different models: T5 and OPUS-MT in two different tasks: summarization and translation. The same linguistic features have been used for error detection in both models to illustrate the generality of the proposed approach. The results show that most errors, close to 90% are detected at a very low recomputation overhead (of 1%). An increase of the detection rate to 95% introduces a significantly larger overhead (around 20%) that can still be acceptable for many applications. Differently from existing error detection methods, CLED can be used on LLMs for which the designer has only access as a black box because CLED operates only on the outputs of the LLM. Interestingly, the proposed CLED can trade off detection rate and overhead by just changing the decision threshold of the concurrent classifier. This enables a dynamic adjustment of the protection to the application or input text at run time.

The CLED scheme can be refined and extended in different directions as future work. For example, it is of interest to evaluate the effectiveness of CLED for languages other than English and for additional tasks/models. In particular, the evaluation of CLED for state-of-the-art commercial models such as GPT4 or Gemini would be interesting, unfortunately, those models are not open source and thus error injection is not possible. Moreover, investigating additional linguistic features that can improve the detection rate is also of interest.

References

[1] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, Y. Du, C. Yang, Y. Chen, Z. Chen, J. Jiang, R. Ren, Y. Li, X. Tang, Z. Liu, P. Liu, J.-Y. Nie, and J.-R. Wen, “A survey of large language models,” arXiv preprint 2303.18223, 2023.
[2] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” 2019.
[3] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of Machine Learning Research, vol. 21, no. 140, pp. 1–67, 2020. [Online]. Available: http://jmlr.org/papers/v21/20-074.html
[4] OpenAI, “Gpt-4 technical report,” arXiv preprint 2310.06825, 2023.
[5] G. Team, “Gemini: A family of highly capable multimodal models,” arXiv preprint 2312.11805, 2023.
[6] H. T. et al, “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint 2307.09288, 2023.
[7] A. Q. J. et al, “Mistral 7b,” arXiv preprint 2310.06825, 2023.
[8] M. A. Neggaz, I. Alouani, S. Niar, and F. Kurdahi, “Are cnns reliable enough for critical applications? an exploratory study,” IEEE Design and Test, vol. 37, no. 2, pp. 76–83, 2020.
[9] L. M. Luza, A. Ruospo, D. Söderström, C. Cazzaniga, M. Kastriotou, E. Sanchez, A. Bosio, and L. Dilillo, “Emulating the effects of radiation-induced soft-errors for the reliability assessment of neural networks,” IEEE Transactions on Emerging Topics in Computing, vol. 10, no. 4, pp. 1867–1882, 2022.
[10] C. Bolchini, L. Cassano, A. Miele, and A. Toschi, “Fast and accurate error simulation for cnns against soft errors,” IEEE Transactions on Computers, vol. 72, no. 4, pp. 984–997, 2023.
[11] Z. Gao, H. Zhang, Y. Yao, J. Xiao, S. Zeng, G. Ge, Y. Wang, A. Ullah, and P. Reviriego, “Soft error tolerant convolutional neural networks on fpgas with ensemble learning,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 30, no. 3, pp. 291–302, 2022.
[12] Z. Gao, J. Wang, R. Su, P. Reviriego, S. Liu, and L. Fabrizio, “On the dependable operation of bidirectional encoder representations from transformers (bert) in the presence of soft errors,” in 2023 IEEE 23rd International Conference on Nanotechnology (NANO), 2023, pp. 582–586.
[13] J. Tiedemann and S. Thottingal, “OPUS-MT — Building open translation services for the World,” in Proceedings of the 22nd Annual Conference of the European Association for Machine Translation (EAMT), Lisbon, Portugal, 2020.
[14] L. Costas-Pérez and J. J. Rodríguez-Andina, “Algorithmic Concurrent Error Detection in Complex Digital-Processing Systems,” IEEE Design and Test of Computers, vol. 26, no. 1, pp. 60–67, 2009.
[15] A. Mahmood and E. McCluskey, “Concurrent error detection using watchdog processors-a survey,” IEEE Transactions on Computers, vol. 37, no. 2, pp. 160–174, 1988.
[16] Z. Alkhalifa, V. Nair, N. Krishnamurthy, and J. Abraham, “Design and evaluation of system-level checks for on-line control flow error detection,” IEEE Transactions on Parallel and Distributed Systems, vol. 10, no. 6, pp. 627–641, 1999.
[17] P. Reviriego, Z. Wang, A. Alonso, Z. Gao, F. Niknia, S. Liu, and F. Lombardi, “Concurrent classifier error detection (cced) in large scale machine learning systems,” IEEE Transactions on Reliability, pp. 1–10, 2024.
[18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
[19] A. Reddy and P. Banerjee, “Algorithm-based fault detection for signal processing applications,” IEEE Transactions on Computers, vol. 39, no. 10, pp. 1304–1308, 1990.
[20] K. Ito, Y. Zhang, H. Itsuji, T. Uezono, T. Toba, and M. Hashimoto, “Analyzing due errors on gpus with neutron irradiation test and fault injection to control flow,” IEEE Transactions on Nuclear Science, vol. 68, no. 8, pp. 1668–1674, 2021.
[21] A. M. Keller and M. J. Wirthlin, “Partial tmr for improving the soft error reliability of sram-based fpga designs,” IEEE Transactions on Nuclear Science, vol. 68, no. 5, pp. 1023–1031, 2021.
[22] R. Baumann, “Soft errors in advanced computer systems,” IEEE Design and Test of Computers, vol. 22, no. 3, pp. 258–266, 2005.
[23] M. Cettolo, M. Federico, L. Bentivogli, J. Niehues, S. Stüker, K. Sudoh, K. Yoshino, and C. Federmann, “Overview of the IWSLT 2017 evaluation campaign,” in Proceedings of the 14th International Conference on Spoken Language Translation. Tokyo, Japan: International Workshop on Spoken Language Translation, Dec. 14-15 2017, pp. 2–14. [Online]. Available: https://aclanthology.org/2017.iwslt-1.1
[24] G. Brooks, Dictionary of the British English Spelling System, 1st ed. Open Book Publishers, 2015.
[25] J. H. Clegg and W. C. Fails, “Manual de fonética y fonología españolas,” Routledge, p. 111, 2018.
[26] M. G. O’Brien and S. M. Fagan, “German phonetics and phonology: Theory and practice,” Yale University Press, p. 65, 2016.
[27] E. van Lier, The Oxford Handbook of Word Classes. Oxford University Press, 12 2023. [Online]. Available: https://doi.org/10.1093/oxfordhb/9780198852889.001.0001
[28] N. Agarwal, M. A. Wani, and P. Bours, “Lex-pos feature-based grammar error detection system for the english language,” Electronics, vol. 9, no. 10, 2020. [Online]. Available: https://www.mdpi.com/2079-9292/9/10/1686
[29] A. Mendhakar and D. H. S, “Parts-of-speech (pos) analysis and classification of various text genres,” Corpus-based Studies across Humanities, 2023. [Online]. Available: https://doi.org/10.1515/csh-2023-0002
[30] N. Chomsky, “Three models for the description of language,” IRE Transactions on information theory, vol. 2, no. 3, pp. 113–124, 1956.