BPDec: Unveiling the Potential of Masked Language Modeling Decoder in BERT Model Pretraining

Wen Liang
Google Research
Mountain View, CA 94043
[email protected]
&Youzhi Liang
Department of Computer Science
Stanford University
Stanford, CA 94305
[email protected]

Abstract

BERT (Bidirectional Encoder Representations from Transformers) has revolutionized the field of natural language processing through its exceptional performance on numerous tasks. Yet, the majority of researchers have mainly concentrated on enhancements related to the model structure, such as relative position embedding and more efficient attention mechanisms. Others have delved into pretraining tricks associated with Masked Language Modeling, including whole word masking. DeBERTa introduced an enhanced decoder adapted for BERT’s encoder model for pretraining, proving to be highly effective. We argue that the design and research around enhanced masked language modeling decoders have been underappreciated. In this paper, we propose several designs of enhanced decoders and introduce BPDec (BERT Pretraining Decoder), a novel method for modeling training. Typically, a pretrained BERT model is fine-tuned for specific Natural Language Understanding (NLU) tasks. In our approach, we utilize the original BERT model as the encoder, making only changes to the decoder without altering the encoder. This approach does not necessitate extensive modifications to the encoder architecture and can be seamlessly integrated into existing fine-tuning pipelines and services, offering an efficient and effective enhancement strategy. Compared to other methods, while we also incur a moderate training cost for the decoder during the pretraining process, our approach does not introduce additional training costs during the fine-tuning phase. We test multiple enhanced decoder structures after pretraining and evaluate their performance on the GLUE tasks and SQuAD tasks. Our results demonstrate that BPDec, having only undergone subtle refinements to the model structure during pretraining, significantly enhances model performance without escalating the finetuning cost, inference time and serving budget.

1 Introduction

The journey of deep learning is marked by significant innovations with propelling advancements that redefine our understanding and capabilities in solving complex problems [1, 2, 3]. ResNet revolutionized a wide range of computer vision tasks with its deep residual learning framework, enabling the training of deeper networks by addressing the vanishing gradient problem [2]. This breakthrough laid the groundwork for subsequent works in model architectures. The transformer model shifted the paradigm, establishing a new standard in handling sequential data [4]. Building on these foundations, BERT model marked another milestone in natural language processing. By pretraining on a large corpus and adopting a bidirectional approach, BERT achieved unprecedented performance across a wide range of language tasks. These contributions collectively have not only advanced the field of deep learning but also have set the stage for our current exploration into refining the BERT architecture through a pretraining and finetuning approach.

Following the groundbreaking inception of the Transformer architecture, there has been a continuous evolution in model design and methodology, further enhancing its capabilities and performance in diverse tasks. TransformerXL extended the Transformer’s capability to process longer sequences, overcoming the limitations of the standard architecture in handling extended context [5]. This was achieved through the innovative use of relative positional encoding, allowing the model to capture dependencies over extended sequences, far beyond the absolute positional encoding that was applied in Transformer and BERT models. This enhancement not only addressed the limitations in handling extended context but also paved the way for more dynamic and contextually aware representations in sequence modeling. Complementing this, XLNet integrated the strengths of both autoregressive and autoencoding techniques [6]. It further leveraged the concept of relative positional encoding, enhancing the model’s ability to understand and predict the relationships between elements in a sequence. This approach enabled XLNet to surpass BERT in numerous benchmarks. The Funnel-Transformer, another significant innovation, introduces a methodology that gradually compresses the sequence of hidden states into a shorter format, effectively reducing computation costs without compromising the model’s performance across a variety of NLP tasks [7]. This trend underscores a growing emphasis in the field not only on achieving high accuracy but also on improving computational efficiency.

The original Transformer architecture has a separation of the encoder and decoder components, each serving a unique function in the model’s processing of sequences. BERT focused exclusively on the encoder part of this architecture. By harnessing the power of the encoder, BERT excelled in creating deep, bidirectional representations with Masked Language Modeling (MLM), but it lose the sequential generation capabilities of the decoder. On the other hand, T5 (Text-to-Text Transfer Transformer) took a holistic approach by further developing both the encoder and decoder components [8]. T5 not only embraced the full encoder-decoder structure but also innovated by treating every NLP task as a text-to-text problem and using a multi-tasking pretraining strategy. This model incorporated varied methods such as MLM with different lengths and formats, thereby expanding the versatility and applicability of the Transformer architecture. Such developments in models like T5 demonstrate the continuous evolution and adaptation of the Transformer framework, exploring new potentials in both its encoder and decoder capabilities. These advancements reflect that, in addition to improvements in model architecture, changes in strategies during the pretraining phase are also very important.

Not surprisingly, alongside these architectural advancements, there has been a significant emphasis on refining strategies and methods in optimization, hyperparameter tuning, and model pretraining. For instance, RoBERTa, an iteration of BERT, has unveiled its potential by utilizing more extensive and diverse datasets, dynamic masking in Masked Language Modeling (MLM), and a refined set of hyperparameters [9]. This work underlines the importance of not just model design, but also the quality and variety of data and training techniques. Furthermore, some studies have delved into the optimal ratio of masking in MLM, a key factor that significantly influences the efficiency and effectiveness of the pretraining process [10]. Moreover, the topic of how to mask tokens has also been widely discussed [11]. Another intriguing area of research is the adaptability and flexibility of training structures. Innovations like BERT-of-Theseus demonstrate how a compressing strategy can lead to more efficient models without compromising performance [12]. These strategies involve progressively replacing parts of the network with more efficient modules, thereby refining the model’s structure and training process. Such developments suggest a growing recognition of the intricate interplay between various aspects of model training, extending beyond mere architectural changes to include strategic modifications in data handling, training methodologies, and structural efficiency.

DeBERTa represents a significant milestone in the ongoing evolution of language models, introducing several novel methods that have achieved state-of-the-art performance [13]. The authors innovatively introduced the disentangled attention mechanism represents a significant deviation from vanilla multi-head attention. However, compared to a BERT model with a same number of layers and similar amount of parameters, DeBERTa incurs a higher computational cost for both training and serving. Furthermore, for developers considering transitioning from BERT to DeBERTa, there is an additional development cost need to be considered, which may not always be feasible or cost-effective for every application or organization.

However, DeBERTa’s introduction of a pretraining-only module called the Enhanced Mask Decoder, which is essentially two additional layers discarded during fine-tuning, intrigued us more than previous architectural modifications. We were drawn to explore whether this decoder module could be generalized and optimized for use with the vanilla BERT model, and whether it could unlock encoder’s ability to match or even surpass the performance of more computationally expensive models. This line of inquiry forms the basis of our research.

In this paper, we propose BPDec (BERT Pretraining Decoder), a novel architecture that harnesses the potential of the MLM (Masked Language Modeling). Our primary contributions are as follows:

1. Introduction of BPDec: Its encoder is identical to the vanilla BERT, but it includes a redesigned MLM decoder called BPDec. The main modifications are:

•

Adding multiple transformer block layers after the encoder to function as a decoder.
•

In the decoder, we remove the restriction present in the encoder’s attention mechanism, which prevents attending to masked positions. Instead, we randomly and gradually relax some of the masked positions throughout the decoder layers.
•

Integration of a degree of randomness before output, where the final output is a randomly combined result of the encoder output and the decoder output.

2. Guidance on Hyperparameter Tuning for BPDec: To further assist readers, we have incorporated a dedicated section in our paper to guide the tuning process of the MLM decoder’s hyperparameters.

To assess the performance of our proposed method, we conducted a series of rigorous evaluations across a variety of tasks. These tasks include, but are not limited to, MNLI (Multi-Genre Natural Language Inference), SQuAD (Stanford Question Answering Dataset), RACE (Reading Comprehension from Examinations), among others. In addition, an extensive ablation study was performed to validate the effectiveness of each modification we made. This study methodically deconstructed the model to isolate and evaluate the impact of individual changes, thereby providing empirical evidence of their contribution to the model’s overall performance. Furthermore, comparative analysis were conducted against baseline models, including the original BERT and other state-of-the-art architectures. These comparisons not only highlight BPDec’s advancements but also offer a comprehensive understanding of its positioning in the current NLP landscape.

2 Related Works

2.1 Masked Language Modeling

In the context of traditional Language Modeling, the objective is to maximize the probability of a given sequence of tokens. This can be mathematically defined as:

\max_{\theta}\log p_{\theta}(X|\tilde{X})=\max_{\theta}\sum_{i\in C}\log p_{\theta}(X_{i}|\tilde{X})

(1)

Masked Language Modeling (MLM) has become a fundamental technique in the field of natural language processing, particularly since the introduction of the BERT model [13, 9]. BERT represents a significant shift in model architecture by employing an encoder-only framework [1]. For autoregressive language modeling, as seen in models like GPT, the objective differs in its sequential approach to predicting each token. In autoregressive language modeling, each token $X_{i}$ is predicted based on the sequence of preceding tokens. Mathematically, this can be defined as:

\max_{\theta}\log p_{\theta}(X|\tilde{X})=\max_{\theta}\sum_{i=1}^{N}\log p_{\theta}(X_{i}|X_{1},X_{2},...,X_{i-1})

(2)

The goal is to optimize the model parameters $\theta$ such that the probability of each token $X_{i}$ , given its preceding context, is maximized. The encoder-only design of BERT enables bidirectional context understanding and full-text attention, unlike autoregressive decoders that are limited to unidirectional context. BERT’s training methodology evolves to masked language modeling (MLM). This approach can be mathematically represented as follows:

\max_{\theta}\log p_{\theta}(X_{masked}|\tilde{X})=\max_{\theta}\sum_{i\in C_{\text{masked}}}\log p_{\theta}(X_{i}|X_{\text{unmasked}})

(3)

Here, $X_{masked}$ represents the set of tokens that have been masked, and $X_{unmasked}$ is the set of tokens that remain visible to the model. The model parameters $\theta$ are optimized to maximize the log probability of correctly predicting the masked tokens given the context of unmasked tokens. BERT’s MLM pretraining randomly masks 15% of the tokens in each sequence. Within this subset, 80% of the tokens are substituted with a [MASK] token, 10% remain as is, and the remaining 10% are replaced with a random token from the vocabulary. The model is trained to predict these tokens based on the context. This training method allows BERT to effectively grasp the bidirectional context, making it highly effective across a wide range of NLP tasks after finetuning.

2.2 Finetuning

Following the pretraining phase, BERT and similar models will undergo a process known as fine-tuning, which tailors them to specific tasks. In this phase, the final-layer output of the the [CLS] token, typically the first token of the sequence, is usually employed as a representative summary of the entire input sequence and is finetuned along with other model parameters to fit various downstream tasks. This fine-tuning phase is essential for adapting the generalized pretraining to more specialized applications, thereby enhancing the model’s performance on task-specific datasets.

There are variations in the finetuning process. Some approaches suggest the possibility of dropping certain layers during fine-tuning for efficiency [14]. This efficiency-focused approaches suggest the feasibility of omitting pretrained layers during fine-tuning. Notable findings in this area highlight that:

1.

Up to 50% of pretrained transformer blocks can be pruned while retaining approximately 98% efficiency in specific downstream tasks.
2.

The retention of lower layers is essential for maintaining performance in downstream applications.
3.

For certain tasks, a minimal subset of layers, as few as 3 out of 12, can sustain performance within a 1% variance of the full model’s capacity.

DeBERTa introduces an approach, where it is pretrained with an additional decoder and then this decoder is dropped during the fine-tuning phase [13]. Similarly, the T5 model can be adapted to similar finetuning tasks by utilizing only its encoder part [8]. Besides, BERT-of-Theseus presents a novel compression approach for BERT by progressively replacing original modules with compact substitutes during pretraining [12]. This method enhances interaction between original and compact models without introducing extra loss functions, ultimately outperforming existing knowledge distillation techniques on the GLUE benchmark.

2.3 Enhanced Masked Language Modeling Decoder in DeBERTa

DeBERTa introduces an innovative approach known as the Enhanced Masked Language Modeling Decoder (EMD), which is mentioned above. Moreover, while BERT incorporates absolute positional encodings at the input layer, DeBERTa adopts a different strategy. In DeBERTa, absolute positions are incorporated after all Transformer layers, just before the softmax layer dedicated to masked token prediction. This design choice allows the model to capture the relative positions within all Transformer layers and utilize absolute positions as supplementary information during the decoding phase. However, this disentangled attention mechanism notably introduces additional parameters and computational overhead compared to the original attention mechanism. Therefore, a direct comparison between DeBERTa and BERT with same number of layers might not be entirely equitable due to these architectural differences. As we will elaborate in the next section, our approach ensures fairness in comparison. The only difference lies in the pretraining phase compared to the original BERT, which is often just a one-time cost, whereas in practical applications, finetuning and inference are the recurring stages where computational efficiency matters more.

3 Methods

Considering the structural modifications of DeBERTa and their impact on the overall computational cost of the model, we find the concept of Enhanced Mask Decoders (EMD) to be particularly intriguing. The MLM decoder offers a solution that it is utilized exclusively during the pretraining phase, thereby avoiding additional computational burdens during the finetuning and serving. This aspect caught our attention and led us to explore the potential it holds for BERT pretraining. While the primary intent of introducing the Enhanced Mask Decoder (EMD) in the DeBERTa model, as highlighted by the authors, was to integrate absolute position information before the output layer. However, insights from other studies suggest an alternative perspective [14]. The ability to understand the input text and the ability to distill language into specific embeddings already well-established in the earlier layers, which indicates that using later layer during BERT finetuning might be unnecessary.

In related work, we also discussed T5 and BERT-of-Theseus demonstrate that variation between pretraining and fine-tuning architectures can yield stronger performance [8, 12]. Building upon these insights, we hypothesize that carefully designed differences in these architectures can not only avoid performance degradation but may even enhance some part of the model. This leads us to believe that, with further research and meticulous design, the original BERT model itself has untapped potential for significant improvement. Motivated by these insights and our analysis, we developed the BERT Pretraining MLM Decoder, termed BPDec, a novel architecture that serves as an innovative paradigm for BERT pretraining.

3.1 MLM Decoder

Our approach with BPDec preserves the core encoder structure of the original BERT while augmenting it with a specially designed decoder, exclusively active during pretraining. This additional decoder is crafted to enhance the BERT encoder’s ability in understanding and processing of language, drawing upon the strengths of the EMD approach. By incorporating this decoder, we aim to improve the overall performance of the BERT encoder without imposing significant computational costs during the subsequent phases of finetuning and deployment. The BPDec design is strategically aligned with the objectives of pretraining. This decoder is adeptly positioned right after the BERT encoder, and just before the final dense layer and softmax layer dedicated to masked token prediction. It focuses on refining the model’s ability to predict masked tokens, thereby expanding the depth and versatility of the encoded representations.

3.2 Gradual Relaxation of Attention Mask

Refer to caption — Figure 1: Examples of attention heads with and without attention masks. (a) Attention heads will not attend to the masked embedding highlighted in red due to the attention mask. (b) The attention mask is disabled.

Another key aspect of our proposed methodology in BPDec involves modifying the restrictions present in the standard multi-head attention mechanism of BERT, particularly in relation to masked positions. As shown in Figure 1, the BERT architecture avoids attending to embeddings at the 15% masked positions during the multi-head attention by setting an attention mask to them during pretraining. The embeddings at masked positions are considered less informative and less accurate than those are not masked. The mechanism helps ensure that the model is not unduly influenced by those masked and noised inputs.

However, in our MLM Decoder, we try to remove the restriction in the decoder layers. This modification is rooted in the understanding that as the layers in the model become deeper, the output features become increasingly abstract and, conceivably, become sufficiently enough to guess the masked or noised information. Such information, though not fully revealed, can provide valuable context to the model. It encourages the model to formulate hypotheses and fosters a deeper alignment and understanding of the input based on a mix of noise and these formulated assumptions. However, this also brings about a drawback, which is that the model’s sudden exposure to this hypothesized information about the masked or noised input, and the noise it brings, may have adverse effects. To mitigate these risks and enhance the effects, we made two optimizations shown in Figure 2: 1) Fully disabling the attention mask only in the last several decoder layers to prevent the early involvement of hypothesized information, and 2) Progressively unmasking the attention mask across the decoder layers, starting with a randomized partial unmasking followed by a complete unmasking. This approach not only enriches the model’s interpretative capacity but also allows for a more gradual training process, leading to enhanced overall performance as shown in the experiments.

3.3 Random Mix of Encoder and Decoder Outputs

In our approach, we introduce a modest blending of the encoder’s output into the decoder’s output as shown in Figure 3. This technique serves a dual purpose. Firstly, it encourages the MLM task to not depend exclusively on the representations formed in the deeper layers of the decoder. Instead, it also takes into account the more direct and less processed information from the encoder stage. Drawing inspiration from the NEFTUNE method, which highlights the potential performance benefits of introducing controlled randomness during language model training, we adopt a similar principle in our approach [15]. Additionally, echoing the compression method utilized in BERT-of-Theseus, the authors recognize that random layer replacement is an effective approach to model compression [12]. In our case, randomly skipping decoder layers can be seen as a form of compression as well. Our experimental results reveal that using 80% of the decoder output mixed with 20% of the encoder output can significantly improve performance on downstream tasks.

4 Results

In order to examine the benefits of BPDec, we conduct experiments on various NLP tasks with a few models with as baselines.

4.1 Pretraining

Pretraining was conducted on both base and large model sizes. Additionally, we introduced a variant of the base model (base-h256) to further demonstrate the generalizability of our method on smaller models. The encoder architecture, pretraining data, pretraining tasks, and mask settings are identical with the original BERT models. Our BPDec lies in the addition of a decoder following the encoder. Details about the parameters of this decoder can be found in Table 1. This strategic augmentation aims to enhance the model’s capabilities while maintaining the foundational architecture of BERT encoder, and it also ensures a high degree of fairness in our comparative experiments.

BERT+BPDec Hyperparam	base-h256	base	large
Number of decoder layers	2	2	4
decoder layers with unmasking	[1, 2]	[1, 2]	[1, 3]
decoder layer unmasking rates	[0.5, 1.0]	[0.5, 1.0]	[0.5, 1.0]
% of mix with encoder output	20	20	20
Number of encoder layers	12	12	24
Hidden size	256	768	1024
FFN inner hidden size	1024	3072	4096
Number of attention heads	4	12	16
Attention head size	64	64	64

Table 1: Hyperparameters for decoders used in BERT+BPDec-base-h256, BERT+BPDec-base and BERT+BPDec-large pretraining. The architectural hyperparameters, including hidden size and the number of attention heads, are the exaclty same with the corresponding BERT model. These parameters that remain unchanged are highlighted with a gray background in the table for easy reference.

In our study, we conducted a comparative analysis using the BERT model, BERT model with pre-LN (pre-layer normalization) [16], and DeBERTa [13] trained under identical settings as baselines. Here, we utilized the same hyperparameters, tokenizer, and pretraining dataset to isolate the impact of other factors on model performance. Our focus was solely on determining the effectiveness of the architectural change. In RoBERTa [9], authors significantly extended the pretraining duration, increasing the number of steps from 100K to 500K. This led to substantial improvements in downstream task performance. Authors also emphasized that the longest-trained model still haven’t show signs of overfitting. This observation highlights the importance of controlling for training steps and pretraining dataset when evaluating architectural modifications, as it can act as a confounding factor that influences model performance. Without careful consideration of such factors, drawing definitive conclusions about the true impact of architectural changes could be misleading.

Moreover, it is important to note that although the DeBERTa model was trained for the same number of steps as other models, its disentangled attention mechanism introduced a significant overhead comparing with other baseline models and with our BERT+BPDec models. This added complexity affected both the pretraining and fine-tuning phases, but ours BERT+BPDec models only have extra ovearhead during pretraining. The implications of this increased overhead are critical, especially when considering the balance between model performance and computational efficiency. Our analysis aims to provide insights into how these different architectures impact training and computation efficiency, while also maintaining or enhancing language processing capabilities.

4.2 Performance on GLUE Tasks

We summarize the results on seven Natural Language Understanding (NLU) tasks from the GLUE benchmark. These tasks and their corresponding metrics are as follows:

•

AX (GLUE Diagnostic Task): Tests ability to understand linguistic phenomena such as logic, predicate-argument structure, and lexical semantics. The evaluation metric is typically overall accuracy across all diagnostic categories.
•

CoLA (Corpus of Linguistic Acceptability): Measures grammatical correctness. The metric used is Matthew’s correlation.
•

MNLI-m/mm (Multi-Genre NLI - Matched/Mismatched): Assesses the ability to predict textual entailment across different genres. Accuracy is the metric for both the matched (MNLI-m) and mismatched (MNLI-mm) sections.
•

MRPC (Microsoft Research Paraphrase Corpus): Focuses on identifying whether two sentences are paraphrases of each other. Accuracy is used as the main evaluation metric.
•

QNLI (Question NLI): Involves determining whether a context sentence contains the answer to a question. Accuracy is the evaluation metric.
•

QQP (Quora Question Pairs): Focuses on determining if two questions asked on Quora are semantically equivalent. The metrics for evaluation are accuracy and F1 score.
•

RTE (Recognizing Textual Entailment): Sentence entailment task. Accuracy is used as the metric.
•

STS (Semantic Textual Similarity Benchmark): Evaluates the degree of semantic similarity between two sentences. The primary metric is Pearson and Spearman correlation coefficients.

These tasks provide a broad evaluation of the model’s performance across various dimensions of NLU. The experiment results are shown in Table 2

Model	AX	COLA	MNLI-m/mm	MRPC	QNLI	QQP	RTE	SST	Avg
BERT-base-h256 [1]	78.99	75.26	79.09/79.83	81.86	87.81	88.45	61.37	87.73	80.04
BERT-PreLN-base-h256 [16]	76.70	69.13	76.48/77.87	71.81	83.40	86.59	55.96	86.01	75.99
DeBERTa-base-h256 [13]	79.04	74.50	79.41/79.99	78.19	87.88	88.68	64.62	88.76	80.11
BERT+BPDec-base-h256	79.02	75.74	79.08/80.33	81.86	88.05	88.46	62.09	89.33	80.44
BERT-base [1]	85.05	83.13	84.67/84.85	88.48	91.84	90.85	71.92	93.58	86.04
BERT-PreLN-base [16]	85.03	83.89	85.10/85.45	88.24	91.96	90.78	71.56	92.89	86.10
DeBERTa-base [13]	85.43	82.74	85.51/85.66	88.48	92.17	91.18	71.79	93.58	86.28
BERT+BPDec-base	85.76	83.13	85.66/85.59	88.73	92.39	90.98	71.12	93.81	86.35
BERT-large [1]	87.21	85.43	87.32/87.49	87.50	93.47	91.44	72.92	94.04	87.42
BERT-PreLN-large [16]	87.11	84.99	86.91/87.41	88.24	93.70	91.24	75.81	93.69	87.68
DeBERTa-large [13]	87.23	85.62	87.01/87.54	87.75	93.14	91.17	76.53	94.84	87.87
BERT+BPDec-large	87.12	85.81	87.36/87.28	88.24	93.28	91.42	76.90	94.27	87.96

Table 2: Results on 8 GLUE benchmark tasks. Note that the results presented here are not based on publicly available checkpoints but are derived from models we trained ourselves with aligned settings. Given that DeBERTa incurs higher costs in finetuning and has more parameters, its corresponding results are highlighted in gray for comparative analysis.

In our finetuning experiments, each model was trained on the designated training set and subsequently evaluated on the development set. Table 2 illustrates that under identical conditions and with the same number of pretraining samples, BERT with BPDec enhancement consistently outperforms both BERT and DeBERTa models in terms of many tasks and the averaged score. Notably, MNLI, the largest dataset among these tasks, serves as a robust indicator of model performance. Our model competes closely with the similarly sized DeBERTa model, achieving slightly higher scores in the matched condition and comparable results in the mismatched one. However, it’s crucial to consider that the DeBERTa model’s encoder is approximately 17% larger in parameter size than BERT encoder, and it incurs a 32% higher cost in finetuning cost and a 29% increase in inference latency. Despite these disparities, BERT+BPDec matches or even surpasses DeBERTa in various benchmarks in base-h256, base and large level sizes. Moreover, BERT+BPDec achieves significant improvement over the original BERT at exactly the same finetuning and serving costs. The results tell us that this decoder only enhancement significantly boosts performance without incurring further additional costs. Building on this, we extended our testing to question answering tasks, further solidifying the conclusion that our work can yield substantial performance gains.

4.3 Performance on SQuAD Tasks

	SQuAD v1		SQuAD v2
Model	EM	F1	EM	F1
BERT-base-h256 [1]	75.05	83.55	67.07	69.75
BERT-PreLN-base-h256 [16]	74.18	80.26	65.39	67.96
DeBERTa-base-h256 [13]	75.91	84.31	67.32	69.87
BERT+BPDec-base-h256	75.74	84.14	67.97	70.82
BERT-base [1]	83.70	90.63	76.83	79.78
BERT-PreLN-base [16]	83.65	90.58	77.03	79.87
DeBERTa-base [13]	83.75	90.78	77.87	80.94
BERT+BPDec-base	84.20	91.12	78.11	81.23
BERT-large [1]	86.58	92.92	82.06	85.22
BERT-PreLN-large [16]	85.94	92.56	81.19	84.42
DeBERTa-large [13]	86.91	92.96	82.09	85.12
BERT+BPDec-large	86.11	92.70	82.12	85.39

Table 3: Results on SQuAD v1 and v2 tasks. Given that DeBERTa incurs higher costs in finetuning and has more parameters, its corresponding results are highlighted in gray for comparative analysis.

The Stanford Question Answering Dataset (SQuAD) serves as a benchmark for evaluating question answering systems in NLP. It comprises two versions: SQuAD v1, focused on answerable questions from given passages, and SQuAD v2, which includes both answerable and unanswerable questions. Performance is measured using Exact Match (EM) and F1 Score, reflecting the precision and accuracy of the model’s responses.

The experiments conducted on both the v1 and v2 tasks, as presented in Table 3, followed a similar trend to those performed on the GLUE tasks: all models were trained on the training set and evaluated on the development set. While the results for BERT+BPDec on SQuAD v1 did not achieve the highest scores in the table, its performance on SQuAD v2, particularly for the base and base-h256 size model, was notably stronger than the original BERT. Especially, when considering that BERT+BPDec competes closely with, or even surpasses, the DeBERTa model, the improvement is significant. These results underscore BPDec’s efficiency and efficacy, showcasing its ability to deliver robust performance in question answering tasks, a key area in NLP, without additional overhead.

4.4 Ablation Study

The BPDec involves multiple modifications of architecture and randomization which introduce many parameters in the process. It is essential to clearly articulate whether each change is necessary and how much it contributes to the final result. Furthermore, understanding the influence of each major parameter on the outcome is also crucial. To further demonstrate the effectiveness of our method, we conducted several ablation experiments on selected GLUE tasks and SQuAD tasks with base-level sized models.

4.4.1 Number of MLM Decoder Layers

Model	AX	COLA	MNLI-m/mm	MRPC	QNLI	QQP	SQuAD v2 EM/F1
BERT-base	85.05	83.13	84.67/84.85	88.48	91.84	90.85	76.83/79.78
+ 1 decoder layers	84.89	83.28	85.30/85.06	88.24	91.67	90.78	76.92/80.12
+ 2 decoder layers	85.10	83.41	84.82/85.53	88.97	91.31	90.99	77.67/81.13
+ 3 decoder layers	84.91	83.41	84.48/84.53	87.75	91.62	90.78	77.14/80.19
+ 4 decoder layers	84.74	81.21	84.19/84.64	88.24	91.01	91.11	75.89/78.72
+ 6 decoder layers	83.49	80.35	83.98/84.19	86.52	91.01	90.81	74.21/77.03

Table 4: Comparative results on various tasks with different numbers of decoder layers added to BERT-base.

In our study, we initially sought to determine the optimal number of MLM decoder layers for improving model performance. To achieve this, we conducted an ablation study focusing solely on increasing the number of decoder layers during pretraining, without implementing other changes. During finetuning, we use the same number of layers as the BERT-base model.

Our findings in Table 4 revealed that adding two decoder layers to the BERT-base model yielded the best results. Parallel experiments were conducted on the larger model as well, where adding four decoder layers to the BERT-large model proved most effective. These results led us to conclude that augmenting the model with additional decoder layers equivalent to one-sixth of the number of encoder layers seems to be the optimal strategy.

4.4.2 Decoder Layers with Attention Mask Relaxation

Decoder layer unmasking rates	AX	COLA	MNLI-m/mm	MRPC	QNLI	QQP	SQuAD v2 EM/F1
$[0.0,0.0]$	85.10	83.41	84.82/85.53	88.97	91.31	90.99	77.67/81.13
$[1.0,N/A]$	84.77	82.84	84.80/85.08	87.50	91.32	90.86	77.68/80.82
$[0.0,1.0]$	84.95	83.70	84.49/85.02	87.75	91.43	90.86	78.23/81.26
$[0.5,1.0]$	85.08	83.13	85.08/85.40	88.46	91.62	90.99	78.83/81.73

Table 5: Comparative results on various tasks for a 12-layer BERT + 2-layer decoder model with different configurations regarding the application of attention mask restrictions during pretraining. The finetuning is applied to BERT encoder only. The configurations are denoted by lists, where each list specifies 2 random unmasking rates for the first and second decoder layers

In our ablation study shown in Table 5, we examined the effects of releasing attention mask restrictions across various configurations of the BPDec model during the MLM pretraining. Initially, we removed the attention mask entirely for the decoder, where we observed no positive impact on performance. However, a notable improvement emerged when the same restriction release was applied to a deeper layer of the BPDec decoder. Moreover, we found an optimal configuration when we gradually remove the mask, the first layer randomly unmask 50%, and the second decoder layer release the 50% attention mask left. This specific adjustment yielded the most significant performance enhancements. These results highlight the nuanced interplay between model architecture and attention mechanisms, emphasizing the potential of targeted modifications in MLM decoder to achieve better outcomes in complex language processing tasks. Additionally, we observed that the improvement on the SQuAD task after adding attention mask relaxation was significantly larger than the gains on other classification tasks. Considering that SQuAD tasks generally have longer and more complex contexts, this indirectly indicates that our improvement can enhance the model’s ability to understand context.

4.4.3 Effectiveness of Output Random Mix

Ratio of Decoder	AX	COLA	MNLI-m/mm	MRPC	QNLI	QQP	SQuAD v2 EM/F1
0.5	84.64	83.22	84.51/85.28	87.99	92.17	91.09	76.22/79.29
0.6	84.65	82.45	85.04/85.05	87.75	91.93	90.99	77.43/80.45
0.7	84.81	83.89	85.08/85.15	87.99	92.37	91.08	78.31/81.49
0.8	85.76	83.13	85.66/85.59	88.73	92.39	90.98	78.11/81.23
1.0	85.08	83.13	85.08/85.40	88.46	91.62	90.99	78.83/81.73

Table 6: Comparative results on various tasks with different ratios of decoder output in a BERT+BPDec-base model during pretraining.

In our ablation studies, we explored the integration of randomness into the model output before the Softmax layer with Masked Language Modeling Task. A particularly effective method was found is to randomly alternate between using outputs from the decoder and the encoder. We set our randomization strategy to only apply randomness along the (batch, seq) dimensions, maintaining the cohesiveness of individual output embeddings. This implies that an output embedding is sourced entirely from either the decoder or the encoder.

Furthermore, we experimented with varying mix ratios of decoder and encoder outputs. As shown in Table 6, the most effective ratio during pretraining was found to be 80% decoder output and 20% encoder output. This specific mix ratio led to the most balanced performance, underlining the importance of fine-tuning the balance between the two types of outputs to achieve optimal results in the finetuning phase. The 80% ratio for utilizing decoder output emerged as a reasonable choice. If this ratio were lower, it might result in insufficient training of the decoder, diminishing its potential benefits. Conversely, if the ratio were higher, it could undermine the intended effect of randomness.

From the perspective of experimental results, although completely avoiding random output mix yielded the best performance on the SQuAD task, random output mix provides us with an excellent method to balance the model’s capabilities across different tasks. We believe that the 80% rate trades off a portion of SQuAD’s context understanding ability in exchange for better performance on other GLUE tasks, ultimately achieving a good balance. Therefore, this balanced approach not only ensures effective training of the decoder but also maintains the necessary level of randomness to optimize the model’s overall performance.

4.5 Generalizability to Other BERT-like Models

To further demonstrate the effectiveness of BPDec, we applied the same methodology to ALBERT [17] and BERT-PreLN [16] models. For ALBERT, the BPDec does not share parameters with the ALBERT encoder, but the decoder layers within BPDec share parameters. For BERT-preLN, we followed the encoder’s structure and used pre-LayerNorm in BPDec. Experiments on the GLUE benchmark and SQuAD dataset showed consistent performance improvements across both models.

Model	AX	COLA	MNLI-m/mm	MRPC	QNLI	QQP	SQuAD v2 EM/F1
ALBERT-base	84.34	80.15	84.46/84.51	85.78	91.03	90.81	75.62/78.77
ALBERT+BPDec-base	84.43	81.78	84.82/85.09	87.75	92.04	90.90	76.71/79.93
BERT-PreLN-base	85.03	83.89	85.10/85.45	88.24	91.96	90.78	77.03/79.87
BERT-PreLN+BPDec-base	85.42	85.08	85.55/85.86	88.75	91.71	90.77	78.04/80.92

Table 7: Comparative results of ALBERT and BERT-PreLN models with and without BPDec.

These additional experiments shown in 7 prove the effectiveness, adaptability, and generalizability of the BPDec method, highlighting its potential for broader application in various BERT-like architectures. The consistent improvements observed across different model types suggest that BPDec’s principles can be successfully transferred to enhance the performance of a wider range of models.

5 Conclusion

In this paper, we introduced BERT+BPDec, an enhanced version of the original BERT model, focusing on improvements with the Masked Language Modeling (MLM) Decoder used in pretraining. Key innovations include adding transformer block layers as MLM decoder, removing restrictions on attending to masked positions in the MLM decoder, and introducing a degree of output randomness by randomly mixing encoder and decoder outputs. Through rigorous evaluations on tasks such as MNLI, SQuAD, and RACE, BERT+BPDec model demonstrated superior performance over the original BERT and other state-of-the-art models, without any increasing computational complexity in further finetuning and inference. The ablation study confirmed the effectiveness of each modification, highlighting BPDec’s potential in advancing NLP efficiency and effectiveness. The optimal hyperparameter settings we provide for BPDec not only improve the model’s performance and efficiency but also contribute to sustainable computing by reducing energy consumption and emissions. This work has practical implications for real-world applications of language models and paves new avenues for future research on optimizing the masked language modeling pretraining process.

References

[1] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
[3] Wen Liang, Youzhi Liang, and Jianguo Jia. Miamix: Enhancing image classification through a multi-stage augmented mixed sample data augmentation method. Processes, 11(12), 2023.
[4] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
[5] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context, 2019.
[6] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. Xlnet: Generalized autoregressive pretraining for language understanding, 2020.
[7] Zihang Dai, Guokun Lai, Yiming Yang, and Quoc Le. Funnel-transformer: Filtering out sequential redundancy for efficient language processing. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 4271–4282. Curran Associates, Inc., 2020.
[8] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer, 2020.
[9] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach, 2019.
[10] Alexander Wettig, Tianyu Gao, Zexuan Zhong, and Danqi Chen. Should you mask 15
[11] Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. Spanbert: Improving pre-training by representing and predicting spans. CoRR, abs/1907.10529, 2019.
[12] Canwen Xu, Wangchunshu Zhou, Tao Ge, Furu Wei, and Ming Zhou. Bert-of-theseus: Compressing BERT by progressive module replacing. CoRR, abs/2002.02925, 2020.
[13] Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention, 2021.
[14] Hassan Sajjad, Fahim Dalvi, Nadir Durrani, and Preslav Nakov. Poor man’s BERT: smaller and faster transformer models. CoRR, abs/2004.03844, 2020.
[15] Neel Jain, Ping yeh Chiang, Yuxin Wen, John Kirchenbauer, Hong-Min Chu, Gowthami Somepalli, Brian R. Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Neftune: Noisy embeddings improve instruction finetuning, 2023.
[16] Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tie-Yan Liu. On layer normalization in the transformer architecture, 2020.
[17] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A lite BERT for self-supervised learning of language representations. CoRR, abs/1909.11942, 2019.