ELBERT: FAST ALBERT WITH CONFIDENCE-WINDOW BASED EARLY EXIT

Abstract

Despite the great success in Natural Language Processing (NLP) area, large pre-trained language models like BERT are not well-suited for resource-constrained or real-time applications owing to the large number of parameters and slow inference speed. Recently, compressing and accelerating BERT have become important topics. By incorporating a parameter-sharing strategy, ALBERT greatly reduces the number of parameters while achieving competitive performance. Nevertheless, ALBERT still suffers from a long inference time. In this work, we propose the ELBERT, which significantly improves the average inference speed compared to ALBERT due to the proposed confidence-window based early exit mechanism, without introducing additional parameters or extra training overhead. Experimental results show that ELBERT achieves an adaptive inference speedup varying from 2 $\times$ to 10 $\times$ with negligible accuracy degradation compared to ALBERT on various datasets. Besides, ELBERT achieves higher accuracy than existing early exit methods used for accelerating BERT under the same computation cost. Furthermore, to understand the principle of the early exit mechanism, we also visualize the decision-making process of it in ELBERT. Our code is publicly available online.¹¹1https://github.com/shakeley/ELBERT

Index Terms— Natural Language Processing, BERT, Inference Acceleration, Early Exit, Model Compression

1 Introduction

In recent years, large pre-trained language models (e.g., BERT[1], RoBERTa[2], XLNet[3]) have made remarkable improvements in many Natural Language Processing (NLP) tasks. However, the great success is achieved at the cost of millions of parameters and huge computation cost. Hence, employing those models in resource-constrained and real-time scenarios is quite difficult.

To improve the applicability of BERT, some works using common compression methods have been proposed, such as Weight Pruning[4], Quantization[5] and Knowledge Distilling[6]. Compared with the models based on these methods, ALBERT[7] greatly reduces the amount of parameters and memory consumption by sharing parameters, and achieves even better performance than BERT. However, ALBERT doesn’t cut down computation cost and inference time.

Redundance[8] and overthinking[9] are knotty problems that big models often suffer from. Early exit is a method that focuses on the differences in input complexities for avoiding redundant computations and overthinking to accelerate inference. The inputs judged as simple are processed with only a part of the whole model. Early exit enables one-for-all[10], which means that one trained model can meet different accuracy-speed trade-offs by adjusting the criterion of input complexity in inference only, while time-consuming re-training is needed for other common compression methods.

In this paper, we propose ELBERT, a fast ALBERT coupled with a confidence-window based early exit mechanism, which achieves high-speed inference without introducing additional parameters. Specifically, ELBERT uses ALBERT as the backbone model (also compatible with other BERT-like models). The confidence-window based early exit mechanism enables an input-adaptive efficient inference. Thus it saves inference time and computation cost. We conduct extensive experiments on various datasets. The results show that ELBERT achieves at least 2 $\times$ inference speedup while keeping and even improving the accuracy, and up to 10 $\times$ speedup with negligible accuracy degradation.

The main contributions of this paper can be summarized as follows: 1) A novel and efficient confidence-window based early exit mechanism is proposed for the first time to the best of our knowledge. 2) We propose ELBERT which achieves better performance than existing early exit methods used for accelerating BERT on many NLP tasks. 3) We visualize the decision-making process of the early exit in ELBERT, which sheds light on its internal mechanism.

2 Related Work

Prior works in model compression can be mainly divided into two categories:
A. Structure-wise compression methods try to remove the unimportant elements of models. For Weight Pruning, Gordon et al.[11] applied the magnitude-based pruning method to BERT. Michel et al.[12] pruned BERT based on gradients of weights. For Quantization, Q-BERT[13] utilized a Hessian based mix-precision approach to compress BERT, while Q8BERT [14] quantized BERT using symmetric linear quantization. Besides, Knowledge Distilling is applied by Tang et al. [15], Sun et al.[16], DistillBERT[17] and TinyBERT[18] for a light BERT.
B. Input-wise compression methods focus on avoiding redundant computations based on the complexity of inputs. BranchyNet[19] proposed the entropy based confidence measurement. Shallow-Deep Nets[9] managed to mitigate the overthinking problem with early exit mechanism. LayerDrop[20] randomly dropped layers at training time, allowing for sub-network selection to any desired depth in inference. Concurrently, DeeBERT[21] and TheRT[22] applied the basic early exit method to BERT. FastBERT[23] proposed a self-distilling method in fine-tuning. However, those works only explored the intermediate state of the classifier, while ELBERT proposes a two-stage early exit mechanism. Coincidentally, Zhou[24] first proposed one criterion which is similar to one of the proposed criteria in this work.

Refer to caption — Fig. 1: Structures of BERT, ALBERT and ELBERT. Note that ELBERT brings no additional parameters, and the computation brought by early exit mechanism is ignorable[23] (less than 2% of that of one encoder).

3 Methodology

3.1 Model Arichitecture

As shown in Fig. 1, ELBERT uses ALBERT as the backbone model, which is composed of an encoder and a classifier. Additionally, ELBERT is designed to put a early exit decision after each propagation processed by the encoder and the classifier.

3.2 Training

To fit the early exit mechanism in inference, the losses of inputs exiting at different depths of ELBERT are calculated during the training. For classification, the early exit loss at the $i$ -th layer $\mathcal{L}_{i}$ is calculated with Cross-Entropy

\mathcal{L}_{i}=-\sum_{c\in C}\left[\mathbb{I}\left[\mathbf{\hat{y}}_{i}=c\right]\cdot\log P\left(\mathbf{\hat{y}}_{i}=c\mid\mathbf{h}_{i}\right)\right],

(1)

where $c$ and $C$ denote one class label and the set of class labels, respectively. The common practice is to simply add up $\mathcal{L}_{i}$ as the total loss $\mathcal{L}$ [21, 23]. For better training under various combinations of losses, we assign a trainable variable $t_{i}$ with an initial value of 4 to each layer, inspired by Wang et al.[25]. The weight of the $i$ -th layer $w_{i}$ is calculated by

w_{i}=\left\{\begin{array}[]{ll}\sigma(t_{i})&0<i\leq M-1\\ M-\sum_{i=1}^{M-1}\sigma(t_{i})&i=M\end{array}\right.,

(2)

where $M$ denotes the depth of ELBERT and $\sigma(\cdot)$ denotes sigmoid funciton $\sigma\left(t_{i}\right)=1/\left(1+\exp({-t_{i})}\right)$ . Then the total loss $\mathcal{L}$ is calculated by a weighted sum

\mathcal{L}=\sum_{i=1}^{M}w_{i}\cdot\mathcal{L}_{i}.

(3)

In this way, the cases that inputs may exit at different depths are well considered, which helps to bridge the gap between training and inference of ELBERT.

3.3 Inference

ELBERT first introduces a two-stage early exit mechanism, which focuses on both the intermediate state and the historical trend of classifier output to decide whether an early exit of computation is needed.

Formally, the input $\mathbf{x}$ goes through the encoder iteratively. The hidden state $\mathbf{h}_{i}$ after the $i$ -th forward propagation of the encoder is calculated by

\mathbf{h}_{i}=\left\{\begin{array}[]{ll}Encoder(\mathbf{h}_{i-1})&0<i\leq M\\ Embedding(\mathbf{x})&i=0\end{array}\right..

(4)

After each forward propagation in the encoder, the hidden state $\mathbf{h}_{i}$ is sent to the classifier that outputs a prediction probability distribution $p_{i}=Classifier(\mathbf{h}_{i})$ via fully-connected layer and softmax function for classification. Then we can get the predicted label $\mathbf{\hat{y}}_{i}=argmax(p_{i})$ .

The first stage of the early exit focuses on confidence, or intermediate state, of the classifier. Given a probability distribution $p_{i}$ , we take its normalized entropy as the $Puzzlement$ of the current classifier, which is calculated by

Puzzlement(i)=\frac{\sum_{j=1}^{C}p_{i}(j)\log p_{i}(j)}{\log(1/C)},

(5)

where $C$ denotes the number of labeled classes. The model will stop the inference in advance and take $\mathbf{\hat{y}}_{i}$ as the final prediction to skip further computations when $Puzzlement(i)<{\delta}$ , where $\delta$ is a user-defined threshold. When a faster model is needed and some accuracy degradation is tolerable, we can set a higher $\delta$ , while the opposite situation leads to a lower $\delta$ .

The second stage traces the historical trend of the classifier output in a time window, whose size $N$ is defined based on user demands. We propose three criteria for triggering the second stage early exit in a time window: 1) The prediction probability $p_{i}$ of a certain class varies monotonically. 2) The range of $max(p_{i})$ is less than a set value. 3) The predicted label $\mathbf{\hat{y}}_{i}$ stays the same. Experimental results show that the first criterion outperforms others. In subsequent experiments, we will use the first criterion for the second stage by default, and the window size $N$ is set to 8.

Usually, we prefer the moment when we get enough confidence. Only when the first stage condition isn’t satisfied will we consider the second stage early exit.

4 Experiments

4.1 Baselines

We select three baselines. 1) Original model: We choose ALBERT-large (depth=24). 2) Plain compression: We evaluate several models with smaller depths based on ALBERT-large. 3) Early exit approach: The methods in DeeBERT and FastBERT are applied to ALBERT for comparison.

4.2 Datasets

To test the generalization ability of ELBERT, widely used GLUE benchmark [26], AG-news [27] and IMDB [28] are evaluated in our experiments. These datasets include various NLP tasks such as Natural Language Inference, Sentiment Analysis and News Classification.

4.3 Experimental Setup

Training For GLUE we use the corresponding hyperparameters in ALBERT original paper for a fair comparison, while for other datasets, we use a default learning rate of 3e-5 and a batch size of 32.
Inference In practical scenes, the user requests often arrive one by one. Our batch size of inference is set to 1, following prior work[21, 19]. The experiments are done on an NVIDIA 2080Ti GPU.

4.4 Main Results

Efficient inference acceleration We evaluate ELBERT on the above datasets and report the median of 5 runs in Fig. 2 and Fig. 3. The curves are drawn by interpolating several points that correspond to different $\delta$ , which changes from 0.1 to 1.0 with a step size of 0.1 in the first stage of early exit. For all datasets, ELBERT achieves at least two times inference speedup while keeping or even improving the accuracy. When a little accuracy degradation is tolerable, the inference acceleration ratio can be up to ten. This demonstrates ELBERT’s superiority of inference acceleration.

Task-related trends An interesting observation is that there are different trends of curves in Fig. 2 on different kinds of tasks. For News Classification (AG), ELBERT gets the best acceleration performance, followed by Sentiment Analysis (SST-2, IMDB), the curves of which drop a little faster. NLI (QNLI, RTE) is the case with the lowest performance. This indicates that different tasks may have different internal characteristics and acceleration difficulty. Early exit may help us understand tasks better. We will do some discussions about this in Section 4.5.
Flexible and better accuracy-speed tradeoffs We compare different models on several datasets. The results are shown in Fig. 3, where the red star-shaped points represent different models obtained by plain compression. Our first observation is that ELBERT significantly outperforms plain compression models. Also, compared to other early exit based methods, ELBERT obtains higher accuracy than both DeeBERT and FastBERT under the same computation cost, which shows ELBERT’s great advantages over other approaches.

4.5 Visualization of Early Exit

To visualize the decision-making process of the early exit in ELBERT, we make some changes to BertViz[29], a tool for visualizing attention in Transformer. We use the attention-scores of each layer to get the cumulative attention-scores, which allows us to see the attention relationships between tokens clearly as the input passes through different depths of ELBERT. Since ELBERT only takes the [cls] token as the representation of one sentence to do classification, we only show the cumulative attention-scores of [cls] to other tokens in the figures. We take SST, a Sentiment Analysis dataset for example, and find two main patterns of early exit. ^†^†footnotetext: $\dagger$ The results are based on our implementation on ALBERT-large.
Simple input, simple exit For the most common inputs without emotional turns or negative words, as shown in Fig. 4, the attention of [cls] to the emotional keywords (i.e., hampered) tends to increase monotonously. Early exit is triggered when such attention exceeds a certain limit determined by the $\delta$ , thus reducing redundant computations. Actually the prediction remains unchanged after the exit layer 11.
Mitigating overthinking As Fig. 3 shows, ELBERT sometimes achieves even higher accuracy than that of the original model, indicating that the early exit mechanism corrects some wrong predictions of the final layer. As shown in Fig. 5, the model first pays attention to the commendatory word (i.e., benign) and predicts $Positive$ . Next, an irrelevant negative word (i.e., rarely) is noticed, seen as the negation of commendatory words. Then the model predicts $Negative$ . This is exactly an example of overthinking. In correct cases, the negation and the corresponding word are often noticed simultaneously.

The above patterns demonstrate that ELBERT’s prediction for classification is mainly determined by some key words, such as negatives and those words with strong emotional orientation. The early exit mechanism helps to establish appropriate attention to these words, which enables the model to exit from simple inputs in advance and avoid overthinking.

5 Conclusion

In this paper, we propose ELBERT, a fast ALBERT coupled with a confidence-window based early exit mechanism. Our empirical experiments demonstrate that ELBERT achieves excellent inference acceleration and outperforms other early exit methods used for accelerating BERT. Moreover, it’s quite easy for other models to reach fast and flexible inference by using the proposed method. Our future work will include exploring the confidence-window based early exit mechanism on more kinds of models and combining our method with common compression methods.

References

[1] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in NAACL, 2019.
[2] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
[3] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language understanding,” in Advances in neural information processing systems, 2019.
[4] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149, 2015.
[5] Y. Gong, L. Liu, M. Yang, and L. Bourdev, “Compressing deep convolutional networks using vector quantization,” arXiv preprint arXiv:1412.6115, 2014.
[6] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
[7] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite bert for self-supervised learning of language representations,” in ICLR, 2019.
[8] O. Kovaleva, A. Romanov, A. Rogers, and A. Rumshisky, “Revealing the dark secrets of bert,” in EMNLP-IJCNLP, 2019, pp. 4365–4374.
[9] Y. Kaya, S. Hong, and T. Dumitras, “Shallow-deep networks: Understanding and mitigating network overthinking,” in ICML. PMLR, 2019, pp. 3301–3310.
[10] H. Cai, C. Gan, T. Wang, Z. Zhang, and S. Han, “Once-for-all: Train one network and specialize it for efficient deployment,” in ICLR, 2019.
[11] M. A. Gordon, K. Duh, and N. Andrews, “Compressing bert: Studying the effects of weight pruning on transfer learning,” arXiv preprint arXiv:2002.08307, 2020.
[12] P. Michel, O. Levy, and G. Neubig, “Are sixteen heads really better than one?,” in Advances in Neural Information Processing Systems, 2019, pp. 14014–14024.
[13] S. Shen, Z. Dong, J. Ye, L. Ma, Z. Yao, A. Gholami, M. W. Mahoney, and K. Keutzer, “Q-bert: Hessian based ultra low precision quantization of bert.,” in AAAI, 2020, pp. 8815–8821.
[14] O. Zafrir, G. Boudoukh, P. Izsak, and M. Wasserblat, “Q8bert: Quantized 8bit bert,” arXiv preprint arXiv:1910.06188, 2019.
[15] R. Tang, Y. Lu, L. Liu, L. Mou, O. Vechtomova, and J. Lin, “Distilling task-specific knowledge from bert into simple neural networks,” arXiv preprint arXiv:1903.12136, 2019.
[16] S. Sun, Y. Cheng, Z. Gan, and J. Liu, “Patient knowledge distillation for bert model compression,” in EMNLP-IJCNLP, 2019, pp. 4314–4323.
[17] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter,” arXiv preprint arXiv:1910.01108, 2019.
[18] X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, and Q. Liu, “Tinybert: Distilling bert for natural language understanding,” arXiv preprint arXiv:1909.10351, 2019.
[19] S. Teerapittayanon, B. McDanel, and H.-T. Kung, “Branchynet: Fast inference via early exiting from deep neural networks,” in ICPR, 2016, pp. 2464–2469.
[20] A. Fan, E. Grave, and A. Joulin, “Reducing transformer depth on demand with structured dropout,” in ICLR, 2019.
[21] J. Xin, R. Tang, J. Lee, Y. Yu, and J. Lin, “Deebert: Dynamic early exiting for accelerating bert inference,” arXiv preprint arXiv:2004.12993, 2020.
[22] R. Schwartz, G. Stanovsky, S. Swayamdipta, J. Dodge, and N. A. Smith, “The right tool for the job: Matching model and instance complexities,” in ACL, 2020.
[23] W. Liu, P. Zhou, Z. Zhao, Z. Wang, H. Deng, and Q. Ju, “Fastbert: a self-distilling bert with adaptive inference time,” arXiv preprint arXiv:2004.02178, 2020.
[24] W. Zhou, C. Xu, T. Ge, J. J. McAuley, K. Xu, and F. Wei, “BERT loses patience: Fast and robust inference with early exit,” CoRR, vol. abs/2006.04152, 2020.
[25] M. Wang, J. Mo, J. Lin, Z. Wang, and L. Du, “Dynexit: A dynamic early-exit strategy for deep residual networks,” in SiPS. IEEE, 2019, pp. 178–183.
[26] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “Glue: A multi-task benchmark and analysis platform for natural language understanding,” in ICLR, 2019.
[27] X. Zhang, J. Zhao, and Y. LeCun, “Character-level convolutional networks for text classification,” in Advances in Neural Information Processing Systems, 2015.
[28] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts, “Learning word vectors for sentiment analysis,” in ACL, 2011, pp. 142–150.
[29] J. Vig, “A multiscale visualization of attention in the transformer model,” in ACL, 2019, pp. 37–42.