The Role of Model Architecture and Scale in Predicting Molecular Properties: Insights from Fine-Tuning RoBERTa, BART, and LLaMA

Youngmin Lee
Department of Computing & Mathematics
Oral Roberts University
[email protected]
Andrew S.I.D. Lang¹¹footnotemark: 1
Department of Computing & Mathematics
Oral Roberts University
[email protected]
Duoduo Cai
Department of Computing & Mathematics
Oral Roberts University
[email protected]
Stephen R. Wheat
Department of Computing & Mathematics
Oral Roberts University
[email protected]
Equal Contribution

Abstract

The application of Artificial Intelligence (AI) in cheminformatics, specifically through the use of fine-tuned Large Language Models (LLMs), has shown significant potential in predicting molecular properties. This study introduces a systematic framework to compare the efficacy of LLMs for fine-tuning across various cheminformatics tasks. Employing a uniform training methodology, we assessed three well-known models—RoBERTa, BART, and LLaMA—on their ability to predict molecular properties using the Simplified Molecular Input Line Entry System (SMILES) as a universal molecular representation format. Our comparative analysis involved pre-training 18 configurations of these models, with varying parameter sizes and dataset scales, followed by fine-tuning them on six benchmarking tasks from DeepChem. We maintained consistent training environments across models to ensure reliable comparisons. This approach allowed us to assess the influence of model type, size, and training dataset size on model performance. Specifically, we found that LLaMA-based models generally offered the lowest validation loss, suggesting their superior adaptability across tasks and scales. However, we observed that absolute validation loss is not a definitive indicator of model performance - contradicts previous research - at least for fine-tuning tasks; instead, model size plays a crucial role. Through rigorous replication and validation, involving multiple training and fine-tuning cycles, our study not only delineates the strengths and limitations of each model type but also provides a robust methodology for selecting the most suitable LLM for specific cheminformatics applications. This research underscores the importance of considering model architecture and dataset characteristics in deploying AI for molecular property prediction, paving the way for more informed and effective utilization of AI in drug discovery and related fields.

1 Introduction

The use of Artificial Intelligence (AI), including fine-tuned Large Language Models (LLMs), for predicting molecular properties has become increasingly prevalent [1]. However, the development of these models often focuses on achieving high benchmark scores and sometimes neglects how variations in training datasets, molecular representations, and training parameters influence overall performance [2]. This complicates the task of selecting the best LLM to fine-tune for a specific task. Our research presents a straightforward framework designed to compare LLMs, assessing their suitability for fine-tuning across a range of tasks. This approach is demonstrated through the evaluation of three popular open-source models for fine-tuning for molecular property prediction: RoBERTa, a BERT (Bidirectional Encoder Representations from Transformers)-based model [3], BART (Bidirectional and Auto-Regressive Transformers) [4], and LLaMA (Large Language model Meta AI) [5], chosen for their proven utility in chemistry-specific Natural Language Processing (NLP) applications. The intent of this study is not to identify the universally best LLM across all models and tasks but to demonstrate an effective method for comparing various LLMs. In conducting our comparative analysis, we adopted a consistent training methodology and used uniform training data across all models to fine-tune RoBERTa, BART, and LLaMA for molecular property prediction tasks. Additionally, given the extensive use of the Simplified Molecular Input Line Entry System (SMILES) for representing chemical structures in the fine-tuning process of LLMs [6-11], our methodology employs SMILES encoding as the standard form of molecular representation. While acknowledging the existence of alternative molecular encoding techniques [12], our study primarily aims to showcase a comparative methodology for determining the most effective LLM for a specific task under uniform training conditions.

2 Related Works

Recent studies have highlighted the utility of fine-tuning BERT and its variants to create specialized models such as CHEM-BERT [6] and ChemBERTa-2 [7]. These models can undergo further fine-tuning for molecular property prediction [8], indicating the versatility of BERT-based architectures. Similarly, research has demonstrated the usefulness of BART-based architectures for small-molecule drug discovery tasks, exemplified by models such as Chemformer [8] and MegaMolBART [9], with the latter also being applied to molecular property prediction tasks [10]. Furthermore, LLaMA (and other recently released open models) have also been fine-tuned for chemistry-related tasks through instruction tuning [13], suggesting their potential use in various chemistry applications. Finally, Molformer, leveraging techniques like encoder-based architecture, Masked Language Modeling (MLM), linear attention, and bucketing, is prominently used in cheminformatics for predictive tasks, including molecular property prediction and chemical reaction outcomes. Its capability for de novo drug design allows for the generation of novel molecular structures with desired properties, aiding in efficient drug discovery processes. Additionally, Molformer excels in tasks such as molecular similarity assessment and clustering, enhancing the screening and optimization of compounds in large datasets [14].

These recent advancements in fine-tuning BART, RoBERTa, and LLaMA (amongst others) underscore the importance of foundational LLMs in addressing domain-specific challenges in chemistry and illustrate the need to have the ability to directly compare the performance of such models on the same task.

3 Methodology

3.1 Pre-training Models for Multi-Task Regression (MTR)

To assess the effects of model size and dataset size on the performance of foundational transformer models (RoBERTa, BART, and LLaMA), the authors established 18 configurations by training two versions of each model—one with 13 million parameters and another with 30 million parameters—across three distinct training dataset sizes: 10 million, 20 million, and 30 million instances. This setup was designed to explore scalability and performance impacts. To ensure experimental consistency, the same data sets were used for each dataset size, with fixed random seeds to guarantee reproducibility (see Table 1).

Table 1: Model combination chart: The 18 experimental configurations result from the product of three variables: three foundational model types, two parameter sizes, and three training dataset sizes.

Factor	Values
Model Type (MT)	ChemBART \| ChemBERTa \| ChemLLaMA
Model Size (MS)	Small (13M) \| Medium (30M)
Data Size (DS)	10M \| 20M \| 30M

3.1.1 Dataset

We downloaded the canonicalized SMILES dataset utilized for Molformer [14], initially processing 33 million SMILES strings. After removing duplicates, we randomly selected 30 million entries for further analysis. Using DeepChem’s RDKitDescriptors Featurizer, we calculated descriptors for each SMILES string [15, 16]. This process revealed significant skewing in some property columns due to a few extreme values (e.g., ‘Kappa3’ values exceeding 1e+36, while most are below 1e+3). We addressed this by removing outliers and normalizing all properties using z-scores. Finally, we excluded any columns where the minimum and median values were identical, resulting in a curated initial training dataset of 30 million distinct compounds (30M rows) with 105 properties (106 columns: 1 identifier + 105 properties). To set the uniform training environment, we used the exact same training and validation data for all transformer models for each data size. Note that we shuffle the order of training data for each epoch.

3.1.2 Tokenizer

All tokenizers for each transformer model are based on the tokenizer from ChemBERTa [7]. The authors have replaced the special tokens with the default settings from Hugging Face. The maximum sequence length for the tokenizers is 512. Note that the ChemLLaMA tokenizer adds the ‘eos’ tokens at the end of the sequence before padding, while the default LLaMA tokenizer does not.

3.1.3 Model for MTR (pre-trained)

The authors utilized three types of transformer-based models, representing encoder-based, decoder-based, and encoder-decoder-based architectures: RoBERTa, LLaMA, and BART, respectively. Default settings were used for each model, with adjustments made only to the special token IDs and vocabulary size, as illustrated in Figure 1. The models were trained with two parameter sizes, featuring 30 million (Medium) and 13 million (Small) parameters as calculated using the .num_param() method from the Hugging Face ‘transformers’ library. During the training, BART and LLaMA models calculated loss using the ‘eos’ token, whereas RoBERTa utilized the ‘bos’ token. We employed L1Loss, corresponding to Mean Absolute Error (MAE). The MTR model families discussed in this paper—ChemBERTa, ChemLLaMA, and ChemBART, see Table 2—correlate with the architectures of RoBERTa, LLaMA, and BART, respectively.

Table 2: Model Size Chart: The configuration of each transformer model (Model Type) by Model Size. All other parameters are the default values provided for each model on Hugging Face.

MTR Model
Size	Type	Hid. Size	Int. Size	Hid. Layers	Att. Heads
	ChemBART	624 \| 624	624 \| 624	2 \| 2	2 \| 2
Small	ChemBERTa	620	710	5	5
(13M)	ChemLLaMA	600	620	5	5
	ChemBART	768 \| 768	768 \| 768	3 \| 3	3 \|3
Medium	ChemBERTa	768	768	8	8
(30M)	ChemLLaMA	768	768	7	8

Refer to caption — Figure 1: The Architecture of Multi-Task Regression model and Fine-Tuned model.

3.1.4 Training

We configured both the training and validation phases with a batch size of 64. For optimization, we employed the AdamW optimizer, using its default settings. We also used the ‘Linear Warmup Cosine Annealing Learning Rate’ scheduler from Pytorch Lightning Bolts, which gradually ramped up the learning rate to a peak of 0.0001 by the end of the first epoch (starting from epoch 0) and sustained this rate across a total of 7 epochs. Our MTR models were trained on four NVIDIA A30 GPUs using a Distributed Data Parallelism (DDP) strategy, facilitated by the Pytorch Lightning framework.

3.2 Models for Fine-tuning (FT)

3.2.1 Dataset & Tokenizer

We selected six benchmarking tasks from DeepChem to evaluate the performance of our transformer models after fine-tuning. These tasks, detailed in Table 3, are representative of common benchmark challenges in cheminformatics. For regression analyses, we utilized the ‘bace_regression,’ ‘delaney,’ and ‘lipo’ datasets. For classification tasks, we selected ‘bace_classification,’ ‘hiv,’ and ‘tox21_sr_p53.’ Consistency in preprocessing was maintained by using the same tokenizers as those applied in the pre-training phase of our models.

Table 3: Fine-tune dataset: Datasets employed for model Fine-tuning - task classification and dataset sizes.

Fine-tune Dataset	Regression			Classification
Name	Bace	Delaney	Lipo	Bace	Hiv	Tox21
Size	1513	1127	4200	1513	40000	8000

3.2.2 Models Used for Fine-Tuning

In the fine-tuning process, we utilized the 18 pre-trained models from our MTR series. These models were initially frozen to preserve the learned features and then appended with linear layers incorporating the Gelu activation function, as depicted in Figure 1. For the loss functions, the fine-tuned models employed Mean Absolute Error (MAE, or L1 Loss) for regression tasks and Binary Cross-Entropy with Logits (BCE with Logits Loss) for classification tasks.

3.2.3 Training

During the fine-tuning of our models, we allocated one GPU per model due to the relatively smaller size of the fine-tuning datasets compared to those used for pre-training. Consequently, we did not employ parallel processing techniques and instead adhered to the default training settings. We retained the same learning rate scheduler as in the pre-training phase but increased the peak learning rate from 0.0001 to 0.01 for both regression and classification tasks to better suit the scale of the datasets. All other training configurations were maintained as in the initial setup. To ensure the reliability of our results and reduce potential biases from single-model evaluations, we repeated the fine-tuning process five times. This approach enabled us to compile and analyze statistics across multiple iterations, enhancing the robustness of our findings.

4 Results & Discussion

4.1 Model Multi-Task Regression

4.1.1 Training and Validation Loss

During the training of the MTR models, we observed that BART and RoBERTa-based models exhibited similar learning rate behaviors across all model sizes. However, ChemBART’s learning rate decreased more rapidly than ChemBERTa’s at the outset, across all sizes, as shown in Figure 2 (data for 20 million parameters displayed, with other sizes presenting similar trends). Regarding validation loss, the smaller-sized models of ChemBART and ChemBERTa displayed comparable loss values across various data sizes and epochs. However, for the medium-sized models, ChemBART consistently showed lower validation losses in all cases, as illustrated in Figure 3. ChemLLaMA achieved the lowest validation loss values, irrespective of epoch or model size.

Additionally, we confirmed that larger MTR models tend to reach saturation more quickly than their smaller counterparts, aligning with previous findings [17]. Interestingly, as Figure 3 illustrates, ChemLLaMA models of the same size exhibited lower validation losses with smaller datasets compared to larger ones throughout the gradient steps. Nonetheless, models trained with larger datasets ultimately achieved lower validation losses as the number of gradient steps increased.

4.2 Model Fine-Tuning

The authors trained each fine-tuned model for seven epochs and replicated this process five times to mitigate potential statistical anomalies from any single run. The total count of metrics (MAE/BCE) recorded for each finely-tuned (FT) dataset, encompassing all model types (MT), model sizes (MS), and data sizes (DS), amounted to 3 $\times$ 2 $\times$ 3 $\times$ 7 (epochs for MTR models) $\times$ 7 (epochs for FT models) $\times$ 5 (iterations) = 4,410 performance values. The ‘best’ metric over all epochs was determined by selecting the model with the lowest validation loss and recorded in a ‘best metrics set’ for 3 (MT) $\times$ 2 (MS) $\times$ 3 (DS) $\times$ 5 (iterations) = 90 models. From these 90 models, we calculate the performance metric (RMSE/AUC) using the test dataset.

For the training, we employed Mean Absolute Error (MAE) for regression and Binary Cross Entropy (BCE) for classification. However, for benchmarking each fine-tuning task with the test datasets, we used Root Mean Squared Error (RMSE) and the Area Under the Curve (AUC) of the Receiver Operating Characteristics (ROC).

4.2.1 Total Error Sum (TES) and its Standard Deviation (STD)

We assess the performance of various finely-tuned models across multiple tasks using the following categories: Model Type, Model Type & Model Size, Model Type & Data Size, and Model Size & Data Size. Our evaluation process is as follows:

1.

We compute the average RMSE (for regression tasks) or AUC (for classification tasks) for each group: Model Type, Model Type & Model Size, Model Type & Data Size, and Model Size & Data Size.
2.

We establish the optimal performance — the lowest average RMSE in regression tasks or the highest average AUC in classification tasks — as the benchmark for each group.
3.

For each individual model run, we measure the deviation of the RMSE or AUC from the established benchmark. These deviations are signed to reflect that a lower RMSE or higher AUC is preferable, aligning with performance standards for both regression and classification tasks.
4.

We then aggregate these deviations, grouped by the aforementioned categories, into an Error Sum (ES) for each task.
5.

We calculate the Total Error Sum (TES) by summing all the ES values, providing a comprehensive performance metric across tasks. Additionally, we determine the Standard Deviation (STD) based on the deviations from the benchmark identified in Step 3.

This structured approach enables a detailed model performance analysis relative to established standards across different configurations.

4.2.2 Overall Test Evaluation

Table 5 displays the average best metrics (RMSE/AUC) for our finely-tuned models, grouped by model type. For example, the average of the best metric for the BACE benchmark task is calculated from 30 RMSE values, derived from combinations of 2 model sizes, 3 data sizes, and 5 runs. This yields three comparative averages—one for each model type. Among them, ChemBART records the lowest average RMSE at 0.896. Average metrics for other tasks are calculated similarly. Table 5 splits the results by model size.

Table 4: Best Avg. Metrics by MT

Tasks	Name	Avg. Metrics	MT
FT Dataset
	Bace	0.895847	ChemBart
Regression	Delaney	0.478322	ChemBart
(RMSE)	Lipo	0.735585	ChemBerta
	Bace	0.802156	ChemBerta
Classification	Hiv	0.769427	ChemLlama
(AUC-ROC)	Tox21	0.789914	ChemLlama

Table 5: Best Avg. Metrics by MTMS

Sharing FT Dataset
Avg. Metrics	MTMS
0.867828	ChemBART Medium
0.452183	ChemLLaMA Medium
0.731018	ChemBERTa Medium
0.802258	ChemBERTa Small
0.774683	ChemLLaMA Small
0.794821	ChemBART Small

4.2.3 Best Metrics Grouped by Model Type and Model Size (MTMS)

When focusing solely on the transformer’s Model Type and disregarding the size, ChemBART outperforms the other models in regression tasks but ranks lowest in classification tasks. Whereas, ChemLLaMA exhibits the worst performance (highest TES) in regression tasks across both sizes, as shown in Table 7. However, ChemLLaMA medium has the best overall performance (lowest TES values) among all Model Types for medium-sized models in both regression and classification tasks, as highlighted in Table 7. These tables categorize the best metric sets from each dataset, highlighting differences across model configurations.

Table 6: TES and STD grouped by MT

Group	Regression		Classification
MT	TES	STD	TES	STD
ChemBART	0.001455	0.055811	0.017267	0.020405
ChemBERTa	0.036971	0.072974	0.012621	0.019049
ChemLLaMA	0.042069	0.074816	0.010000	0.021938

Table 7: TES and STD grouped by MTMS

Group		Regression		Classification
MT	MS	TES	STD	TES	STD
ChemBART	Small	0.085963	0.062398	0.028660	0.024561
	Medium	0.034399	0.048846	0.026405	0.015154
ChemBERTa	Small	0.086802	0.065119	0.023367	0.019636
	Medium	0.104591	0.083724	0.022404	0.019529
ChemLLaMA	Small	0.171555	0.094720	0.022002	0.023530
	Medium	0.030035	0.039855	0.018527	0.019365

The baseline values for each dataset are taken from Table 5. Additionally, the best metrics are organized by MTMS from the original best metric set for each dataset.

As previously mentioned, ChemLLaMA with a ’Medium’ model size demonstrates the best performance across all regression and classification tasks, even though ChemLLaMA ‘Small’ has the worst performance in regression tasks among all MTMS combinations. The ’Medium’ size ChemBART also performs impressively in regression tasks.

Furthermore, our analysis reveals that the performance of each Model Type generally scales with Model Size across all task types, with the notable exception of ChemBERTa in regression tasks.

4.2.4 Best Metrics Grouped by Model Type/Data Size (MTDS) and Model Size/Data Size (MSDS)

When the average metrics and TES are calculated from the best model and grouped by Model Type and Data Size (MTDS), as presented in Tables 9 and 11, ChemBART, utilizing 10M of MTR data, excels in regression tasks compared to other groups. For classification tasks, ChemLLaMA with 30M of data achieves the lowest TES value. ChemBERTa with 30M of data also displays strong performance in classification, alongside other groups with similarly low TES values. Additionally, performance for ChemBERTa in regression tasks and for ChemLLaMA in classification tasks shows a clear proportional relationship to data size.

Considering Model Size and Model Type, as detailed in Tables 9 and 11, a Medium Model Size with 10M Data Size yields the best (lowest) TES value in regression tasks. Generally, larger Model Sizes outperform smaller ones across all Data Sizes. In classification tasks, Small Model Sizes with 30M Data Size achieve the lowest TES. However, with the exception of the 30M data scenarios, Medium Model Sizes tend to exhibit lower errors than smaller sizes. Notably, in regression tasks, Small Model Sizes demonstrate a non-proportional relationship between TES and Data Size. Conversely, in classification tasks, there is a consistent proportional relationship between performance and Data Size, regardless of Model Size.

Table 8: Best Avg. Metrics by MTDS

Tasks	Name	Avg. Metrics	MTDS
FT Dataset
	Bace	0.877418	ChemBart 10m
Regression	Delaney	0.459106	ChemBart 10m
(RMSE)	Lipo	0.733709	ChemBerta 20m
	Bace	0.806304	ChemBerta 30m
Classification	Hiv	0.778740	ChemLlama 30m
(AUC-ROC)	Tox21	0.798990	ChemLlama 30m

Table 9: Best Avg. Metrics by MSDS

Sharing FT Dataset
Avg. Metrics	MSDS
0.879725	Medium 10m
0.470251	Medium 30m
0.730185	Medium 10m
0.801208	Medium 20m
0.773886	Small 30m
0.796570	Small 30m

Table 10: TES and STD grouped by MTDS

Group		Regression		Classification
MT	DS	TES	STD	TES	STD
	10 M	0.002017	0.048120	0.040099	0.018766
ChemBART	20 M	0.090277	0.067623	0.037477	0.021300
	30 M	0.030635	0.046525	0.041837	0.021029
	10 M	0.059857	0.061768	0.046010	0.021386
ChemBERTa	20 M	0.074375	0.098413	0.044872	0.020028
	30 M	0.095244	0.054786	0.014592	0.014878
	10 M	0.081123	0.073842	0.048363	0.022265
ChemLLaMA	20 M	0.079537	0.087679	0.036204	0.024237
	30 M	0.084112	0.064089	0.013043	0.015452

Table 11: TES and STD grouped by MSDS

Group		Regression		Classification
MS	DS	TES	STD	TES	STD
	10 M	0.072341	0.076212	0.033915	0.022837
Small	20 M	0.089094	0.086085	0.029663	0.024573
	30 M	0.095489	0.062148	0.010157	0.019240
	10 M	0.003135	0.042831	0.030993	0.018664
Medium	20 M	0.053843	0.083541	0.024632	0.019120
	30 M	0.024651	0.045687	0.011417	0.016060

5 Conclusion

In our MTR training sessions, ChemLLaMA consistently demonstrated the lowest validation loss across all model sizes and epochs. However, we observed that absolute validation loss is not a definitive indicator of model performance - at least for fine-tuning tasks; instead, model size plays a crucial role. Overall, ChemBART excels in regression tasks, while ChemLLaMA performs better in classification tasks when considering only the model type.

For regression tasks, larger-sized ChemBART models using smaller data sets emerge as one of the best configurations. Yet, ChemLLaMA exhibits a clear proportional relationship between performance and model size, a pattern not observed in the other models. Additionally, ChemLLaMA with larger model sizes outperforms all other foundational model types across various tasks. Notably, ChemLLaMA’s performance in classification tasks also shows a direct correlation with the data size of the pre-training models. When sufficient computational resources and MTR datasets are available, training large-scale ChemLLaMA models with extensive datasets proves to be the most effective strategy for both regression and classification tasks.

References

[1] Deng, J., Yang, Z., Wang, H., Ojima, I., Samaras, D. & Wang, F. (2023) A systematic study of key elements underlying molecular property prediction. Nature Communications, 14(1), 6395.

[2] David, L., Thakkar, A., Mercado, R. & Engkvist, O. (2020) Molecular representations in AI-driven drug discovery: a review and practical guide. J Cheminformatics 12: 56.

[3] Devlin, J., Chang, M. W., Lee, K. & Toutanova, K. (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805.

[4] Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., … & Zettlemoyer, L. (2019) Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv:1910.13461.

[5] Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., … & Lample, G. (2023) Llama: Open and efficient foundation language models. arXiv:2302.13971.

[6] Kim, H., Lee, J., Ahn, S. & Lee, J. R. (2021) A merged molecular representation learning for molecular properties prediction with a web-based service. Scientific Reports 11(1), 11028.

[7] Ahmad, W., Simon, E., Chithrananda, S., Grand, G. & Ramsundar, B. (2022) Chemberta-2: Towards chemical foundation models. arXiv:2209.01712.

[8] Irwin, R., Dimitriadis, S., He, J. & Bjerrum, E. J. (2022) Chemformer: a pre-trained transformer for computational chemistry. Machine Learning: Science and Technology 3(1), 015022.

[9] Nvidia. (2022). MegaMolBART. Retrieved from https://github.com/NVIDIA/MegaMolBART.

[10] Hödl, S., Robinson, W., Bachrach, Y., Huck, W. & Kachman, T. (2023) Explainability Techniques for Chemical Language Models. arXiv:2305.16192.

[11] Lang, A. S. I. D., Chong, W. C. & Wörner, J. H. (2023) Fine-Tuning ChemBERTa-2 for Aqueous Solubility Prediction. Ann. Chem. Sci. Res, 4, 1-3.

[12] Liao, C., Yu, Y., Mei, Y. & Wei, Y. (2024) From Words to Molecules: A Survey of Large Language Models in Chemistry. arXiv:2402.01439.

[13] Yu, B., Baker, F. N., Chen, Z., Ning, X. & Sun, H. (2024) LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset. arXiv:2402.09391.

[14] Ross, J., Belgodere, B., Chenthamarakshan, V., Padhi, I., Mroueh, Y. & Das, P. (2022) Large-scale chemical language representations capture molecular structure and properties. Nature Machine Intelligence 4(12), 1256-1264.

[15] Ramsundar, B., Eastman, P., Walters, P., Pande, V., Leswing, K. & Wu, Z. (2019) Deep Learning for the Life Sciences. O’Reilly Media, https://www.amazon.com/Deep-Learning-Life-Sciences-Microscopy/dp/1492039837.

[16] Landrum, G. (2013) Rdkit documentation. Release 1(1-79), 4.

[17] Li, Z., Wallace, E., Shen, S., Lin, K., Keutzer, K., Klein, D. & Gonzalez, J. (2020) Train big, then compress: Rethinking model size for efficient training and inference of transformers. In International Conference on machine learning, pp. 5958-5968. PMLR.