F-PABEE: Flexible-patience-based Early Exiting for Single-label and Multi-label text Classification Tasks

Abstract

Computational complexity and overthinking problems have become the bottlenecks for pre-training language models (PLMs) with millions or even trillions of parameters. A Flexible-Patience-Based Early Exiting method (F-PABEE) has been proposed to alleviate the problems mentioned above for single-label classification (SLC) and multi-label classification (MLC) tasks. F-PABEE makes predictions at the classifier and will exit early if predicted distributions of cross-layer are consecutively similar. It is more flexible than the previous state-of-the-art (SOTA) early exiting method PABEE because it can simultaneously adjust the similarity score thresholds and the patience parameters. Extensive experiments show that: (1) F-PABEE makes a better speedup-accuracy balance than existing early exiting strategies on both SLC and MLC tasks. (2) F-PABEE achieves faster inference and better performances on different PLMs such as BERT and ALBERT. (3) F-PABEE-JSKD performs best for F-PABEE with different similarity measures.

Index Terms— F-PABEE, PABEE, Early Exiting, Multi-label Classification, Single-label Classification

1 Introduction

Fine-tuning PLMs has become the de-facto paradigm in natural language processing [1], due to the amazing performance gains on a wide range of natural language processing tasks [2, 3, 4, 5, 6]. Despite SOTA performances, BERT [7] and its variants [8, 9, 10, 11] still face significant application challenges: cumbersome computation and overthinking problems due to huge parameters and deep models. Early exiting attracts much attention as an input-adaptive method to speed up inference [12]. Early exiting installs a classifier at each transformer layer to evaluate the predictions and will exit when meeting the criterion. Three different early exiting strategies exist: (1) The confidence-based strategy evaluates the predictions based on specific confidence measurements. (2) The learned-based strategy learns a criterion for early exiting. (3) The patience-based strategy exits when consecutive classifiers make the exact predictions. Among them, the patience-based strategy PABEE [13] achieves SOTA results.

We raise two issues for the current SOTA strategy: (1) PABEE faces a limitation for application: it can not flexibly adjust the speedup ratio on a given task and fixed patience parameter, mainly caused by a strict cross-layer comparison strategy. Thus, we wonder whether we can combine PABEE with a softer cross-layer comparison strategy. (2) Current early exiting strategies mainly focus on SLC tasks, while the MLC tasks are neglected. So can they speed up MLC tasks?

Therefore, we propose a Flexible-Patience-Based Early Exiting method (F-PABEE) to address the above issues. F-PABEE makes predictions at each classifier and will exit early if the current layer and the last few layers have similar (similarity score less than a threshold) predicted distributions. F-PABEE can be seen as a natural extension of PABEE and is more flexible since it can achieve better speed-accuracy tradeoffs by adjusting the similarity score thresholds and patience parameters. It can also extend to MLC tasks effortlessly.

Our contributions are summarized as follows: (1) We propose F-PABEE, a novel and effective inference mechanism that is flexible in adjusting the speedup ratios of PLMs. (2) The results show that our method can accelerate inference effectively while maintaining good performances across different SLC and MLC tasks. (3) We are the first to investigate the early exiting of MLC tasks, and F-PABEE is suitable for this type of task.

Refer to caption — Fig. 1: Inference procedure of PABEE and F-PABEE, $C_{i}$ is the classifier, $thre$ is threshold, $P_{0}$ is pre-defined patience.

2 Related works

2.1 Static inference approach

The static inference approach compresses the heavy model into a smaller one, including pruning, knowledge distillation, quantization, and weight sharing [14, 15, 16]. For example, HeadPrune [17] ranks the attention heads and prunes them to reduce inference latency. PKD [18] investigates the best practices of distilling knowledge from BERT into smaller-sized models. I-BERT [19] performs an end-to-end BERT inference without any floating point calculation. ALBERT [8] shares the cross-layer parameters. [20], [21] and [22] distills knowledge from the larger BERT teacher model for improving the performances of student networks which are learned with neural architecture search. Note that the static models are still in the form of deep neural networks with multiple stacked layers. The computational path is invariable for all examples in the inference process, which is not flexible.

2.2 Dynamic early exiting

Orthogonal to the static inference approach, early exiting dynamically adjusts hyper-parameters in response to changes in request traffic. It does not need to make significant changes to the original model structure or weight bits, nor does it need to train different teacher-student learning networks, which saves computing resources [23].

There are mainly three groups of dynamic early exiting strategies. The first type is confidence-based early exiting [24]. For example, BranchyNet [25], FastBERT [26], and DeeBERT [27] calculate the entropy of the prediction probability distribution to estimate the confidence of classifiers to enable dynamic early exiting. Shallow-deep [28] and RightTool [29] leverage the maximum of the predicted distribution as the exiting signal. The second type is the learned-based exiting, such as BERxiT [30] and CAT [31]. They learn a criterion for early exiting. The third type is patience-based early exiting, such as PABEE [13], which stops inference and exits early if the classifiers’ predictions remain unchanged for pre-defined times. Among them, patience-based PABEE achieves SOTA performance. However, PABEE suffers from too strict cross-layer comparison, and the applications on MLC tasks are neglected. There are also literature focusing on improving the training of multi-exit BERT, like LeeBERT [32] and GAML-BERT [33].

F-PABEE is a more flexible extension to PABEE, which can simultaneously adjust the confidence thresholds and patience parameters to meet different requirements. In addition, it outperforms other existing early exiting strategies on both SLC and MLC tasks.

2.3 Training of multi-exit backbones

The literature on early exiting focuses more on the design of early exiting strategies, thus neglect the advances of multi-exit backbones’ training methods. LeeBERT [32] employs an adaptive learning method for training multiple exits. GAML-BERT [33] enhances the training of multi-exit backbones by a mutual learning approach.

	CoLA		MNLI		MRPC		QNLI		QQP		RTE		SST-2
	score	speedup	score	speedup	score	speedup	score	speedup	score	speedup	score	speedup	score	speedup
BERT base	54.2	0%	83.1	0%	86.8	0%	89.8	0%	89.2	0%	69.1	0%	91.3	0%
Fixed-Exit-3L	0.0	75%	70.0	75%	75.8	75%	77.4	75%	81.8	75%	54.7	75%	81.0	75%
Fixed-Exit-6L	0.0	50%	79.6	50%	84.7	50%	85.3	50%	89.3	50%	68.1	50%	88.6	50%
BranchyNet	0.0	74%	63.8	76%	75.7	76%	74.2	80%	71.6	80%	54.7	76%	79.9	76%
BranchyNet	0.0	51%	78.3	53%	83.0	52%	87.1	47%	89.3	50%	67.4	47%	88.3	49%
Shallow-Deep	0.0	75%	64.1	77%	75.6	76%	74.3	78%	71.4	79%	54.7	76%	79.5	77%
Shallow-Deep	0.0	52%	78.2	51%	82.8	51%	87.2	49%	89.6	51%	67.2	48%	88.4	48%
BERxiT	0.0	76%	63.5	76%	75.6	76%	73.3	78%	68.2	80%	55.3	77%	79.5	76%
BERxiT	12.3	52%	78.4	51%	82.9	51%	87.0	48%	89.1	49%	67.3	47%	88.3	49%
PABEE	0.0	75%	63.9	77%	75.8	75%	73.6	81%	68.6	82%	55.8	75%	79.9	77%
PABEE	0.0	50%	78.9	52%	83.1	53%	87.2	46%	89.6	49%	67.7	46%	88.7	48%
F-PABEE	0.0	75%	66.9	72%	81.5	77%	76.2	75%	79.6	82%	56.0	76%	80.5	76%
F-PABEE	13.6	52%	83.9	53%	87.3	53%	88.6	54%	90.8	49%	68.1	47%	92.3	48%

Table 1: Experimental results of different early exiting methods with BERT backbone on the GLUE benchmark.

3 Flexible patience-based early exiting

3.1 Inference procedure for SLC and MLC tasks

The inference procedure of F-PABEE is shown in Fig 1(b), which is an improved version of PABEE (Fig 1(a)), where $L_{i}$ is the transformer block of the model, $n$ is the number of transformer layers, $C_{i}$ is the inserted classifier layer, $s$ is the cross-layer similarity score, $thre$ is the similarity score threshold, $P_{0}$ is the pre-defined patience value in the model.

The input sentences are first embedded as the vector:

h_{0}=\text{Embedding}(x).

(1)

The vector is then passed through transformer layers ( $L_{1}...L_{n}$ ) to extract features and compute its hidden state $h$ . After which, we use internal classifiers ( $C_{1}...C_{n}$ ), which are connected to each transformer layer to predict probability $p$ :

p_{i}=C_{i}(h_{i})=C_{i}(L_{i}(h_{i-1})).

(2)

We denote the similarity score between the prediction results of layer $i-1$ and $i$ as $s(p_{i-1},p_{i})$ ( $s(p_{i-1},p_{i})\in\mathbf{R}$ ). The smaller the value of $s(p_{i-1},p_{i})$ , the prediction distributions are more consistent with each other. The premise of the model’s early exit is that the comparison scores between successive layers are relatively small; The similarity threshold $thre$ is a hyper-parameter. We use $pat_{i}$ to store the times that the cross-layer comparison scores are consecutively less than the threshold $thre$ when the model reaches current layer $i$ :

pat_{i}=\left\{\begin{array}[]{rcl}pat_{i-1}+1&&s(p_{i-1},p_{i})<thre\\ 0&&s(p_{i-1},p_{i})>=thre\end{array}\right\}

(3)

If $s(p_{i-1},p_{i})$ is less than the similarity score threshold $thre$ , then increase the patience counter by 1. Otherwise, reset the patience counter to 0. This process is repeated until $pat$ reaches the pre-defined patience value $P_{0}$ . The model dynamically stops inference and exits early. However, if this condition is never met, the model uses the final classifier layer to make predictions. This way, the model can stop inference early without going through all layers.

3.2 Similarity measures for SLC and MLC tasks

Under the framework of F-PABEE, we can adopt different similarity measures for predicted probability distributions. This work uses the knowledge distillation objectives as the similarity measures [34]. When the model reaches the current layer $l$ , for SLC tasks, we compare a series of similarity measures of F-PABEE, denoted as:

F-PABEE-KD: It adopts the knowledge distillation objective from probability mass distribution $p^{l-1}$ to $p^{l}$ :

s(p^{l-1},p^{l})=-\sum_{j=1}^{k}p_{j}^{l-1}log(p_{j}^{l});

(4)

F-PABEE-ReKD: It adopts the knowledge distillation objective in the reverse direction, from probability mass distribution $p^{l}$ to $p^{l-1}$ :

s(p^{l},p^{l-1})=-\sum_{j=1}^{k}p_{j}^{l}log(p_{j}^{l-1});

(5)

F-PABEE-SymKD: It adopts a symmetrical knowledge distillation objective:

SymKD=s(p^{l-1},p^{l})+s(p^{l},p^{l-1});

(6)

F-PABEE-JSKD: It adopts another symmetrical distillation objective, similar to Jenson-Shannon divergence:

JSKD=\frac{1}{2}s(p^{l-1},\frac{p^{l-1}+p^{l}}{2})+\frac{1}{2}s(p^{l},\frac{p^{l-1}+p^{l}}{2})

(7)

In addition, for MLC tasks, we transform them into multiple binary classification problems and sum the similarity scores of all categories, and the formulas are denoted as:

F-PABEE-KD:

s(p^{l-1},p^{l})=-\sum_{j=1}^{k}\sum_{i=1}^{2}p_{ji}^{l-1}log(p_{ji}^{l});

(8)

F-PABEE-ReKD:

s(p^{l},p^{l-1})=-\sum_{j=1}^{k}\sum_{i=1}^{2}p_{ji}^{l}log(p_{ji}^{l-1});

(9)

The formulations of F-PABEE-SymKD and F-PABEE-JSKD for MLC tasks are similar to those of SLC tasks.

3.3 Training procedure

F-PABEE is trained on SLC and MLC tasks, while the activation and loss functions are different. For SLC tasks, we use the softmax activation function and cross-entropy function according to the tasks. In contrast, we use the sigmoid activation function and binary cross-entropy function for MLC tasks.

After that, we optimize the model parameters by minimizing the overall loss function $L$ , which is the weighted average of the loss terms from all classifiers:

L=\sum_{j=1}^{n}jL_{j}/\sum_{j=1}^{n}j

(10)

4 Experiments

4.1 Tasks and Baselines

We evaluate F-PABEE on GLUE benchmark [35] for SLC tasks and four datasets for MLC tasks: MixSNLPS [36], MixATS [37], AAPD [38], and Stackoverflow [39]. we compare F-PABEE with three groups of baselines: (1) BERT-base; (2) Static exiting; (3) Dynamic exiting methods, including BrachcyNet [40], Shallow-Deep [28], BERxiT [30], and PABEE. Considering the flops of inferencing one with the whole BERT as the base, the speed-up ratio is defined as the average ratio of reduced flops due to early exiting.

4.2 Experimental setting

In training process, we perform grid search over the batch size of {16, 32, 128}, and learning rate of {1e-5, 2e-5, 3e-5, 5e-5} with an AdamW optimizer [41] . The batch size in the inference process is 1. We implement F-PABEE on the bases of HuggingFace Transformers [42]. All experiments are conducted on two Nvidia TITAN X 24GB GPUs.

4.3 Overall comparisons

In Table 1, we compare F-PABEE with other early exiting strategies. We adjust the hyper-parameters of F-PABEE and other baselines to ensure similar speedups with PABEE. It shows that F-PABEE balances speedup and performance better than baselines, especially for a large speedup ratio. Moreover, we draw the score-speedup curves for BERxiT, PABEE, and F-PABEE. It shows that F-PABEE outperforms the baseline models on both SLC (Fig 2) and MLC tasks(Fig 3). Furthermore, the distribution of executed layers (Fig 4) indicates that F-PABEE can choose the faster off-ramp and achieve a better trade-off between accuracy and efficiency by flexibly adjusting similarity score thresholds and patience parameters.

4.4 Ablation studies

Ablation on different PLMs F-PABEE is flexible and can work well with other pre-trained models, such as ALBERT. Therefore, to show the acceleration ability of F-PABEE with different backbones, we compare F-PABEE to other early exiting strategies with ALBERT base as the backbone. The results in Fig 5 show that F-PABEE outperforms other early exiting strategies under different backbones by large margins on both SLC and MLC tasks, indicating that F-PABEE can accelerate the inference process for numerous PLMs.

Comparisons between different similarity measures We consider F-PABEE with different similarity measures, denoted as F-PABEE-KD, F-PABEE-ReKD, F-PABEE-SymKD, and F-PABEE-JSKD, and the results are presented in Fig 6. F-PABEE-JSKD performs the best on both SLC and MLC tasks. We suppose that F-PABEE-JSKD is symmetric, and the similarity discrimination is more accurate than asymmetric measures. Therefore, it is better at determining which samples should exit at shallow layers and which should go through deeper layers.

5 Conclusions

We proposed F-PABEE, a novel and efficient early exiting method that combines PABEE with a softer cross-layer comparison strategy. F-PABEE is more flexible than PABEE since it can achieve different speed-performance tradeoffs by adjusting the similarity score thresholds and patience parameters. In addition, we investigate the acceleration ability of F-PABEE with different backbones. Moreover, we compare the performances of F-PABEE with different similarity measures. Extensive experiments on SLC and MLC demonstrate that: (1) F-PABEE performs better than the previous SOTA adaptive early exiting strategies for both SLC and MLC tasks. As far as we know, we are the first to investigate the early exiting methods for MLC tasks. (2) F-PABEE performs well on different PLMs such as BERT and ALBERT. (3) Ablation studies show that F-PABEE-JSKD performs best for F-PABEE with different similarity measures.

References

[1] Tianyang Lin et al., “A survey of transformers,” arXiv, vol. abs/2106.04554, 2021.
[2] Wei Zhu, “Mvp-bert: Redesigning vocabularies for chinese bert and multi-vocab pretraining,” ArXiv, vol. abs/2011.08539, 2020.
[3] Wei Zhu, Xiaofeng Zhou, Keqiang Wang, Xun Luo, Xiepeng Li, Yuan Ni, and Guo Tong Xie, “Panlp at mediqa 2019: Pre-trained language models, transfer learning and knowledge distillation,” in BioNLP@ACL, 2019.
[4] Yuhui Zuo, Wei Zhu, and Guoyong Cai, “Continually detection, rapidly react: Unseen rumors detection based on continual prompt-tuning,” in International Conference on Computational Linguistics, 2022.
[5] Wei Zhu, Peng Wang, Xiaoling Wang, Yuan Ni, and Guo Tong Xie, “Acf: Aligned contrastive finetuning for language and vision tasks,” ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.
[6] Zhao Guo, Yuan Ni, Keqiang Wang, Wei Zhu, and Guo Tong Xie, “Global attention decoder for chinese spelling error correction,” in Findings, 2021.
[7] Jacob Devlin et al., “BERT: pre-training of deep bidirectional transformers for language understanding,” ArXiv, vol. abs/1810.04805, 2018.
[8] Zhenzhong Lan et al., “ALBERT: A lite BERT for self-supervised learning of language representations,” arXiv, vol. abs/1909.11942, 2019.
[9] Zhilin Yang et al., “Xlnet: Generalized autoregressive pretraining for language understanding,” arXiv, vol. abs/1906.08237, 2019.
[10] Yinhan Liu et al., “Roberta: A robustly optimized BERT pretraining approach,” arXiv, vol. abs/1907.11692, 2019.
[11] Wei Zhu, “MVP-BERT: Multi-vocab pre-training for Chinese BERT,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop, Online, Aug. 2021, pp. 260–269, Association for Computational Linguistics.
[12] Canwen et al. Xu, “A survey on dynamic neural networks for natural language processing,” arXiv, vol. abs/2202.07101, 2022.
[13] Wangchunshu Zhou et al., “BERT loses patience: Fast and robust inference with early exit,” arXiv, vol. abs/2006.04152, 2020.
[14] Canwen et al., “Bert-of-theseus: Compressing bert by progressive module replacing,” EMNLP, 2020.
[15] Victor Sanh et al., “Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter,” arXiv, vol. abs/1910.01108, 2019.
[16] Angela Fan et al., “Reducing transformer depth on demand with structured dropout,” arXiv, vol. abs/1909.11556, 2019.
[17] Paul Michel et al., “Are sixteen heads really better than one?,” arXiv, vol. abs/1905.10650, 2019.
[18] Siqi Sun et al., “Patient knowledge distillation for BERT model compression,” arXiv, vol. abs/1908.09355, 2019.
[19] Sehoon Kim et al., “I-BERT: integer-only BERT quantization,” arXiv, vol. abs/2101.01321, 2021.
[20] Wei Zhu, “Autonlu: Architecture search for sentence and cross-sentence attention modeling with re-designed search space,” in Natural Language Processing and Chinese Computing, Lu Wang, Yansong Feng, Yu Hong, and Ruifang He, Eds., Cham, 2021, pp. 155–168, Springer International Publishing.
[21] Zhexi Zhang, Wei Zhu, Junchi Yan, Peng Gao, and Guowang Xie, “Automatic student network search for knowledge distillation,” 2020 25th International Conference on Pattern Recognition (ICPR), pp. 2446–2453, 2021.
[22] Wei Zhu, Yuan Ni, Xiaoling Wang, and Guo Tong Xie, “Discovering better model architectures for medical query understanding,” in North American Chapter of the Association for Computational Linguistics, 2021.
[23] Mostafa Dehghani et al., “Universal transformers,” arXiv, vol. abs/1807.03819, 2018.
[24] Zhen Zhang, Wei Zhu, Jinfan Zhang, Peng Wang, Rize Jin, and Tae-Sun Chung, “Pcee-bert: Accelerating bert inference via patient and confident early exiting,” in NAACL-HLT, 2022.
[25] Teerapittayanon et al., “Branchynet: Fast inference via early exiting from deep neural networks,” in 2016 23rd International Conference on Pattern Recognition (ICPR), 2016, pp. 2464–2469.
[26] Weijie Liu et al., “Fastbert: a self-distilling BERT with adaptive inference time,” arXiv, vol. abs/2004.02178, 2020.
[27] Ji Xin et al., “Deebert: Dynamic early exiting for accelerating BERT inference,” arXiv, vol. abs/2004.12993, 2020.
[28] Yigitcan Kaya et al., “How to stop off-the-shelf deep neural networks from overthinking,” CoRR, vol. abs/1810.07052, 2018.
[29] Roy Schwartz et al., “The right tool for the job: Matching model and instance complexities,” arXiv, vol. abs/2004.07453, 2020.
[30] Ji et al. Xin, “BERxiT: Early exiting for BERT with better fine-tuning and extension to regression,” in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, Apr. 2021, pp. 91–104, Association for Computational Linguistics.
[31] Tal Schuster et al., “Consistent accelerated inference via confident adaptive transformers,” CoRR, vol. abs/2104.08803, 2021.
[32] Wei Zhu, “LeeBERT: Learned early exit for BERT with cross-level optimization,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, Aug. 2021, pp. 2968–2980, Association for Computational Linguistics.
[33] Wei Zhu, Xiaoling Wang, Yuan Ni, and Guotong Xie, “GAML-BERT: Improving BERT early exiting by gradient aligned mutual learning,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, Nov. 2021, pp. 3033–3044, Association for Computational Linguistics.
[34] Geoffrey E. Hinton et al., “Distilling the knowledge in a neural network,” ArXiv, vol. abs/1503.02531, 2015.
[35] Alex Wang et al., “GLUE: A multi-task benchmark and analysis platform for natural language understanding,” CoRR, vol. abs/1804.07461, 2018.
[36] Alice Coucke et al., “Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces,” arXiv, vol. abs/1805.10190, 2018.
[37] Hemphill et al., “The ATIS spoken language systems pilot corpus,” in Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27,1990, 1990.
[38] Pengcheng Yang et al., “SGM: sequence generation model for multi-label classification,” arXiv, vol. abs/1806.04822, 2018.
[39] Jeff Atwood, “Stack overflow creative commons data dump,” 2009.
[40] Haoli Bai, Wei Zhang, Lu Hou, Lifeng Shang, Jing Jin, Xin Jiang, Qun Liu, Michael R. Lyu, and Irwin King, “Binarybert: Pushing the limit of bert quantization,” in Annual Meeting of the Association for Computational Linguistics, 2020.
[41] Ilya Loshchilov et al., “Decoupled weight decay regularization,” in ICLR, 2019.
[42] Wolf et al., “Transformers: State-of-the-art natural language processing,” in Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, 2020, pp. 38–45.