∎

¹¹institutetext: Vithya Yogarajan ²²institutetext: The University of Auckland. ²²email: [email protected] ³³institutetext: Jacob Montiel, Tony Smith, Bernhard Pfahringer ⁴⁴institutetext: The University of Waikato. ⁴⁴email: [email protected]

Improving Predictions of Tail-end Labels using Concatenated BioMed-Transformers for Long Medical Documents

Vithya Yogarajan Bernhard Pfahringer Tony Smith Jacob Montiel

(Received: date / Accepted: date)

Abstract

Multi-label learning predicts a subset of labels from a given label set for an unseen instance while considering label correlations. A known challenge with multi-label classification is the long-tailed distribution of labels. Many studies focus on improving the overall predictions of the model and thus do not prioritise tail-end labels. Improving the tail-end label predictions in multi-label classifications of medical text enables the potential to understand patients better and improve care. The knowledge gained by one or more infrequent labels can impact the cause of medical decisions and treatment plans. This research presents variations of concatenated domain-specific language models, including multi-BioMed-Transformers, to achieve two primary goals. First, to improve F1 scores of infrequent labels across multi-label problems, especially with long-tail labels; second, to handle long medical text and multi-sourced electronic health records (EHRs), a challenging task for standard transformers designed to work on short input sequences. A vital contribution of this research is new state-of-the-art (SOTA) results obtained using TransformerXL for predicting medical codes. A variety of experiments are performed on the Medical Information Mart for Intensive Care (MIMIC-III) database. Results show that concatenated BioMed-Transformers outperform standard transformers in terms of overall micro and macro F1 scores and individual F1 scores of tail-end labels, while incurring lower training times than existing transformer-based solutions for long input sequences.

Keywords:

Multi-label Transformers Long Documents Medical Text Tail-end Labels SOTA

1 Introduction

Multi-label text classification techniques enable predictions of treatable risk factors in patients, aiding in better life expectancy and quality of life aubert2019patterns . The goal of multi-label learning is to predict a subset of labels for an unseen instance from a given label set while considering label correlations zhang2013review . One of the known challenges with multi-label classification is the long-tailed distribution of labels. In general, with multi-label problems, a small subset of the labels are associated with a large number of instances, and a significant fraction of the labels are associated with a small number of instances (as shown in Figure 1).

There are some examples of studies that focus on exploiting label structure zhang2018deep and label co-occurrence patterns kurata2016improved . However, in studies especially relating to medical text, the focus is on improving the overall performance of the model instead of individual tail-end labels moons2020comparison ; amin2019mlt . There are also examples of studies, such as Wei and Li (2019) wei2019does , which demonstrate that tail-end labels have minimal impact on the overall performance. However, prediction of infrequent labels in order to understand all aspects of a patient’s prognosis is as crucial as predicting frequent labels flegel2018we . The knowledge gained by one or more infrequent labels can impact the cause of medical decisions, treatment plans and patient care.

This research explores the opportunity to improve predictions of tail-end labels using transformers for medical-domain specific tasks by exploiting models pre-trained on health data. We consider the option of using three variations of concatenated language models: multi-CNNText, multi-BioMed-Transformers and CNNText with Transformers. We show concatenated BioMed-Transformers improve tail-end predictions compared to other neural networks and single transformers.

In addition to improving the tail-end performance, we demonstrate concatenated domain-specific transformer models are a solution for handling text data with extended text and multi-sources of texts. For short or truncated electronic health records (EHRs), medical domain-specific transformer models outperform state-of-the-art (SOTA) methods for many classification tasks, including predicting medical codes and name entity recognition yogarajan2021trans ; domains ; gu2020domain . However, given that most transformer models are limited to a maximum sequence length of 512 tokens, with some exceptions, there is still a gap in alternative solutions for long documents. Transformer models such as Longformer beltagy1904longformer and TransformerXL dai2019transformer can handle longer sequences and perform better than other language models for long documents. Unfortunately, these models require considerable amounts of memory and processing time. In contrast, concatenated domain-specific transformers require fewer resources.

We also present new SOTA results using TransformerXL for predicting medical codes. We compare these results directly with the most recent (Nov, 2021) published SOTA liueffective for the exact same multi-label text classification problem.

We compare concatenated domain-specific transformer models with standard language models for increasingly larger multi-label problems with 30, 42, 50, 73, 158 and 923 labels. The multi-label problems considered in this paper are: predicting ICD-9 codes for ICD-9 hierarchy levels, most frequent 50 ICD-9 codes, cardiovascular disease, COVID-19 patient shielding (introduced in Yogarajan et al (2021) yogarajan2021predicting ) and systemic fungal or bacterial disease. The contributions of this work are:

1.

analyse the effectiveness of using concatenated domain-specific language models, multi-CNNText, multi-BioMed-Transformers and CNNText with Transformers, for predicting medical codes from EHRs for multiple document lengths, multi-sources of texts and number of labels;
2.

show that concatenated domain-specific transformers improve F1 scores of infrequent labels;
3.

show improvements in overall micro and macro F1 scores and achieve such improvements with fewer resources;
4.

present new SOTA results for predicting medical codes from EHRs.

Refer to caption — Figure 1: Percentage frequency of labels for ICD-9 level-3 codes with 923 labels (left) and systemic fungal or bacterial infection with 73 labels (right) for MIMIC-III data. The labels are ordered from most frequent (left) to least frequent (right) for each plot. The threshold for tail-end labels with % Freq of occurrences $<1\%$ is indicated for reference.

2 Related Work

In the last two to three years, there have been considerable advancements in transformer models, which have shown substantial improvements in many NLP tasks, including BioNLP tasks gu2020domain ; yang2020clinical . With minimum effort, transfer learning of pre-trained models by fine-tuning on downstream supervised tasks achieves very good results amin2020exploring ; amin2019mlt . Examples of BioNLP tasks where transformers have shown performance improvements include named entity recognition, question answering, relation extraction, and clinical concept extraction tasks gu2020domain ; yang2020clinical ; domains .

A significant obstacle for transformers is the 512 token size limit they impose on input sequences 9364676 . Gao et al. (2021) 9364676 presents evidence showing BERT-based models under-perform in clinical text classification tasks with long input data, such as MIMIC-III johnson2016mimic , when compared to a CNN trained on word embeddings that can process the complete input sequences. Si and Roberts (2021) si2021hierarchical presents an alternative system to overcome the issue of long documents, where transformer-based encoders are used to learn from words to sentences, sentences to notes and notes to patients progressively. This transformer-based hierarchical attention networks system presents SOTA methods for in-hospital mortality prediction and phenotype predictions using MIMIC-III. However, it requires considerable computational resources si2021hierarchical . Chalkidis et al. (2020) chalkidis2020empirical proposes a similar hierarchical version using SCI-BERT to deal with long documents for predicting medical codes from MIMIC III. Here SCI-BERT reads words of each sentence, resulting in sentence embeddings. This is followed by a self-attention mechanism that reads the sentence embeddings to produce single document embeddings fed through an output layer. Unfortunately, HIER-SCI-BERT performed poorly compared to other neural networks chalkidis2020empirical . One possible reason for poor results is the use of a continuously pre-trained BERT model chalkidis2020empirical . The continuous training approach would initialise with the standard BERT model, pre-trained using Wikipedia and BookCorpus. It then continues the pre-training process with a masked language model and next-sentence prediction using domain-specific data. In this case, the vocabulary is the same as the original BERT model, which is considered a disadvantage for domain-specific tasks gu2020domain . For our research, PubMedBERT gu2020domain , a domain-specific BERT based model trained solely on biomedical text, is used.

Our research focuses on automatically predicting medical codes from medical text as the multi-label classification task. Examples of predicting medical codes using transformers include ICD-10 predictions from German documents amin2019mlt ; sanger2019classifying , and predicting frequent medical codes from MIMIC-III biswas2021transicd ; yogarajan2021trans . These examples restrict themselves to (1) truncated text sequences of $<512$ tokens and (2) predicting frequent labels biswas2021transicd ; amin2020exploring . MIMIC-III consists of many infrequent labels, as shown in Figure 1, where most codes only occur in a small number of clinical documents. This research focuses on improving the predictive accuracy for infrequent labels and using long medical texts. Moons et al. (2020) moons2020comparison presents a survey of deep learning methods for ICD coding of medical documents and indicates Convolutional Attention for Multi-Label classification (CAML) mullenbach2018explainable as the SOTA method for automatically predicting medical codes from EHRs. Yogarajan et al. (2021) yogarajan2021trans presents evidence to show that domain-specific transformers outperform CAML for truncated sequences. Liu et al (2021) liueffective presents the most recent evidence where EffectiveCAN –an effective convolution attention network– outperforms SOTA for predicting medical codes. We extend the findings in Yogarajan et al. (2021) yogarajan2021trans by providing evidence to show TransformerXL outperforms CAML and sets new SOTA results for predicting medical codes. We also present a direct comparison with EffectiveCAN for the same multi-label problem with the same labels and data to show transformers such as TransformerXL outperform SOTA.

3 Data

Medical Information Mart for Intensive Care (MIMIC-III) is one of the most extensive publicly available medical databases johnson2016mimic ; goldberger2000physiobank with more than 50,000 patient EHRs. It contains data including billing, laboratory, medications, notes, physiological information, and reports. Among the available free-form medical text, more than 90% of the unique hospital admissions contain at least one discharge summary (dis). In addition to the free-form medical text from dis, this research also makes use of text summary of categories ECG (ecg) and Radiology(rad). As with most free form EHRs, MIMIC-III text data includes acronyms, abbreviations, and spelling errors. For example (data as presented in MIMIC III with errors):

82 yo M with h/o CHF, COPD on 5 L oxygen at baseline, tracheobronchomalacia s/p stent, presents with acute dyspnea over several days, and lethargy…

MIMIC-III data includes long documents, where dis ranges from 60 to 9,500 tokens with an average of 1,513 tokens and rad with an average of 2,500 tokens. The document lengths of ecg are short with an average of 84 tokens. In this research, MIMIC-III text is pre-processed by removing tokens that contain non-alphabetic characters, including all special characters and tokens that appear in less than three training documents.

The discharge summary is split into equal segments for a given hospital admission, and each section is labelled text $1,...,4$ . For example, for two splits, if a given discharge summary is $700$ tokens long, text 1 is the first 350 tokens, and text 2 is the last 350 tokens. In the case of a lengthy document, if the discharge summary is $2500$ tokens long, text 1 is the first $1,250$ tokens, and text 2 is the last $1,250$ tokens. For multi-BioMed-Transformers where the maximum sequence length is $512$ , each of text $1,...,4$ is truncated to $512$ tokens. There are many other ways to split the text, including sequential splits. For instance, with the first example above, text 1 being the first $512$ tokens, and text 2 being the remainder $238$ tokens. Each of these decisions has some advantages and disadvantages. After preliminary experiments, the decision was made to split the discharge summary into equal sections. This research presents results for the following configurations:

0.

dis ${}_{1\text{ of }2}$ + dis ${}_{2\text{ of }2}$
1.

dis ${}_{1\text{ of }3}$ + dis ${}_{2\text{ of }3}$ + dis ${}_{3\text{ of }3}$ .
2.

dis ${}_{1\text{ of }2}$ + dis ${}_{2\text{ of }2}$ + ecg.
3.

dis ${}_{1\text{ of }2}$ + dis ${}_{2\text{ of }2}$ + rad.
4.

dis ${}_{1\text{ of }2}$ + dis ${}_{2\text{ of }2}$ + ecg + rad.
5.

dis ${}_{1\text{ of }4}$ + dis ${}_{2\text{ of }4}$ + dis ${}_{3\text{ of }4}$ + dis ${}_{4\text{ of }4}$ .
6.

dis + ecg.
7.

dis + rad.

4 Multi-label Datasets and Labels

We consider predicting ICD-9 codes (standards for international Statistical Classification of Diseases and Related Health Problems) from EHRs as flat multi-label problems. ICD codes are used to classify diseases, symptoms, signs, and causes of diseases. Almost all health conditions can be assigned a unique code. Manual assigning of medical codes requires expert knowledge and is very time-consuming. Thus, the ability to predict and automate medical coding is vital. ICD-9 codes are grouped in a hierarchical tree-like structure by the World Health Organisation. In this research, we focus on levels 2 and 3 for MIMIC-III data containing 158 labels at level 2 and 923 labels at level 3 with associated medical text for the patient. In addition, we consider case studies, cardiovascular disease, COVID-19 patient shielding, and systemic fungal or bacterial infections, where commonly used medical codes are used as labels. As mentioned earlier, for the purposes of direct comparison with the recently published SOTA, the most frequent 50 ICD-9 codes in MIMIC-III are also considered.

Table 1: Statistics of multi-label classification problems. Counts for frequent and infrequent, or tail-end labels, are also provided. * MIMIC III Top50 is the most frequent 50 labels, hence no tail labels, and is only used in this research for direct SOTA comparison.

Multi-label Problems	q	# Inst	LCard	LDens	LFreq $\geq 1\%$	LFreq $<1\%$
MIMIC-III Level 3	923	52,722	14.43	0.02	244	679
MIMIC-III Level 2	158	52,722	11.61	0.07	100	58
MIMIC-III Top50*	50	50,957	5.60	0.11	50	0
Fungal or bacterial	73	30,814	2.06	0.03	34	39
COVID-19 yogarajan2021predicting	42	35,458	1.84	0.04	27	15
Cardiovascular	30	28,154	2.51	0.08	16	14

Table 1 provides a summary of the multi-label problems used in this research. For multi-label problems, the notations as per Tsoumakas et al., (2009) tsoumakas2009mining are used, where $L=\{\lambda_{j}:j=1...q\}$ refers to the finite set of labels and $D=\{(x_{i},Y_{i}),i=1...m\}$ refers to set of multi-label training examples. Here $x_{i}$ is the feature vector, and $Y_{i}\subseteq L$ is the set of labels of the $i$ -th example. Label cardinality ( $LCard$ ) is the average number of labels of the examples in a dataset, and label density ( $LDens$ ) is cardinality divided by $q$ . Table 1 provides the number of labels selected for experiments presented in this paper, with the frequency of occurrences $<1\%$ , tail-end labels, and the number of labels $\geq 1\%$ .

5 Language Models

This research mainly focuses on transformer models. Transformers are feed-forward models based on the self-attention mechanism with no recurrence. Self-attention takes into account the context of a word while processing it. Similar to the sequence-to-sequence attention mechanism, self-attention is considered a soft measure where multiple words are considered. Transformer models take all the tokens in the sequence at once in parallel, enabling the capture of long-distance dependencies. Vaswani et al. (2017) vaswani2017attention provides an introduction to the transformer architecture.

BERT (Bidirectional Encoder Representations from Transformers) DBLP:journals/corr/abs-1810-04805 is one of the early transformer models that applies bidirectional training of encoders vaswani2017attention to language modelling. The 12-layer BERT-base model with a hidden size of 768, 12 self-attention heads, 110M parameter neural network architecture, was pre-trained from scratch on BookCorpus and English Wikipedia. PubMedBERT gu2020domain uses the same architecture, and is domain-specifically pre-trained from scratch using abstracts from PubMed and full-text articles from PubMedCentral to better capture the biomedical language gu2020domain .

BioMed-RoBERTa-base domains is based on the RoBERTa-base DBLP:journals/corr/abs-1907-11692 architecture. RoBERTa-base, originally trained using 160GB of general domain training data, was further continuously pre-trained using 2.68 million scientific papers from the Semantic Scholar corpus. Gururangan et al. (2020) domains show that BioMED-RoBERTa-base, which was specifically pre-trained on medical text data, outperforms the generically trained RoBERTa-base model on biomedical domain-specific tasks.

TransformerXL dai2019transformer is an architecture that enables the representation of language beyond a fixed length. It can learn dependency that is longer than recurrent neural networks and vanilla transformers. The Longformer beltagy1904longformer model is designed to handle longer sequences without the limitation of the maximum token size of 512. Longformer reduces the model complexity from quadratic to linear by reformulating the self-attention computation. Compared to Transformer-XL dai2019transformer , Longformer is not restricted to the left-to-right approach of processing documents.

In addition to transformer models, CNNText kim2014convolutional with domain-specific fastText pre-trained 100-dimensional embeddings is used. CNNText combines one-dimensional convolutions with a max-over-time pooling layer and a fully connected layer. The final prediction is made by computing a weighted combination of the pooled values and applying a sigmoid function. A simple architecture of CNNText is presented in Figure 2.

CAML mullenbach2018explainable is also used to compare with TransformerXL and other languange models. CAML combines convolution networks with an attention mechanism. Simultaneously, a second module is used to learn embeddings of the descriptions of ICD-9 codes to improve the predictions of less frequent labels and target regularisation. For each word in a given document, word embeddings are concatenated into a matrix, and a one-dimensional convolution layer is used to combine these adjacent embeddings.

Algorithm 1 Multiple BioMed-Transformer

1: Input: Fixed length multi-sourced or long document text input with set of labels

Y\subseteq L

, domain specific pre-trained transformer models

x_{i}

with parameters

\theta_{1,2,...,n}

, Linear layer (FC) with

L

number of output units having

\theta_{l}

parameters and loss function Binary-cross-entropy (BCE).

2: for each mini-batch do

3: pooled

\_features

= []

4: for each document

i

x_{i}

\text{BioMed-Transformer(document}_{i})

6: pooled

\_

features.append( AVG

\_

POOL(

x_{i}

))

7: end for

8: combined

\_

features = CONCATENATE(pooled

\_

features)

9: drop

\_

output = DROPOUT(combined

\_

features)

10: output = FC

{}_{\theta_{l}}

(drop

\_

output)

11:

\mathcal{L}=\mathcal{L}_{BCE}

(output, targets)

12:

\theta=[\theta_{1},\theta_{2},\theta_{3},\ldots,\theta_{n},\theta_{l}]

13:

\theta=\theta-\nabla_{\theta}\mathcal{L}

14: end for

6 Concatenated Language Models

6.1 Multi-BioMed-Transformers

Multi-BioMed-Transformers use an architecture where two or more domain-specific transformer models are concatenated together to enable the usage of multiple text inputs. Algorithm 1 presents an outline of multi-Bio-Med-Transformer models concatenated together. We explore the options of two to four PubMedBERT models that are concatenated together. See Figure 2 for an example of TriplePubMedBERT architecture. Concatenated transformer models enable the processing of longer sequences, where the longer input sequence is split into multiple smaller segments with a maximum length of 512 tokens. The average length of discharge summaries in MIMIC-III is approximately $1,500$ tokens, hence the choice to concatenate two to four PubMedBERT models. Moreover, as indicated in Section 3, MIMIC-III contains text from other categories, such as ecg and rad. Multi-BioMed-Transformers provides the option to explore using these other available texts as additional input text.

6.2 Multi-CNNText

Multi-CNNText adopts the same idea as multi-BioMed-Transformers, where two or more CNNText models are concatenated together. Figure 2 presents an example of DualCNNText where two CNNText models are concatenated together. Although CNNText can handle longer sequence length as input text, concatenating multiple CNNText models provides the option of using input text from different categories such as ECG and radiology, as mentioned before, as the features of different categories can be captured separately.

6.3 CNNText with Transformers

The third variation is combining CNNText with transformers (see Figure 2). Although many variations are possible, this research only considers a couple of variations. BERT-base and PubMedBERT are the two transformers that are used with CNNText. However, variations, such as embeddings dimensions, and multiple transformer models can be used for CNNText. It is also important to point out that CNNText is just one possible choice, and there are many other deep learning models that could be used instead of CNNtext.

7 Experiments

We present overall micro and macro F1 scores and individual label F1 scores for the multi-label problems outlined in Table 1. Critical difference plots are presented as supportive statistical analysis. The Nemenyi posthoc test (95% confidence level) identifies statistical differences between learning methods. CD graphs show the average ranking of individual F1 scores obtained using various language models. The lower the rank, the better it is. The difference in average ranking is statistically significant if there is no bold line connecting the two settings. All experimental results are obtained from a random seeds training-testing scheme and averaged over three runs. The variation of these three independent runs are within a range of $\pm 0.015$ . We explore several different transformer models and compare the performance to concatenated BioMed-Transformers.

Transformer implementations are based on the open-source PyTorch transformer repository.¹¹1https://github.com/huggingface/transformers Transformer models are fine-tuned on all layers without freezing. For the optimiser, we use Adam kingma2014adam with learning rates between 9e-6, and 1e-5. Training batch sizes were varied between 1 and 16. A non-linear sigmoid function $f(z)=\frac{1}{1+e^{-z}}$ , with a range of 0 to 1 is used as the activation function. Binary-cross-entropy cox1958regression loss, $Loss_{BCE}(X,y)=-\sum_{l=1}^{L}(y_{l}log(\hat{y}_{l})+(1-y_{l})log(1-\hat{y}_{l}))$ , over each label is used for multi-label classification. Domain-specific fastText embeddings yogarajan2020 ; yogarajan2020seeing of a 100-dimensional skipgram model are used for neural networks.²²2Our source code can be obtained from:
https://github.com/vithyayogarajan/Medical-Domain-Specific-Language-Models/tree/main/Concatenated-Language-Models-Multi-label

8 Results

Results are presented in three parts. First we present the overall performance of the language models, followed by SOTA comparison, and finally we present tail-end performance.

8.1 Overall performance

We present an extensive comparison across models for cardiovascular disease, followed by selected results for other multi-label problems. Table 8.1 presents the results for various language model variations for cardiovascular disease, using MIMIC-III data with 28,154 hospital admissions of patients and 30 labels. Multi-PubMedBERT and multi-BioMed-RoBERTa show a consistent improvement of 3% to 7% in micro-F1 scores over single PubMedBERT and BioMed-RoBERTa, respectively. The macro-F1 score of TriplePubMedBERT option is better than other language models presented with at least 3% improvement, except for TransformerXL with 3,072 tokens. Macro F1 scores of multi-CNNText and CNNText with transformers perform poorly compared to all other language models presented. For cardiovascular disease, incorporating ecg and rad does show some improved overall results, especially with TriplePubMedBERT options. Critical difference plots for individual label F1 scores obtained using various language models in Table 8.1 are presented in Figure 3. Both Table 8.1 and Figure 3 show that TransformerXL with dis 3,072 tokens is the best option. However, multi-BioMed-Transformers show improvements, especially when compared to single-BioMed-Transformers.

Neural Network Details	Input Text Options	Micro-F1	Macro-F1
BioMed-RoBERTa	dis 512	0.69	0.30
PubMedBERT	dis 512	0.70	0.30
TransformerXL	dis 1,536	0.75	0.28
TransformerXL	dis 3,072	0.78	0.32
Longformer	dis 3,000	0.74	0.30
CAML (T100SG)	dis 3,000	0.77	0.24
\hdashline Dual-Bio-RoBERTa	Option 0: 512	0.72	0.28
DualPubMedBERT	Option 0: 512	0.72	0.30
Triple-BioMed-RoBERTa	Option 1: 512	0.72	0.29
TriplePubMedBERT	Option 1: 512	0.73	0.29
TriplePubMedBERT	Option 2: 512	0.73	0.31
TriplePubMedBERT	Option 3: 512	0.73	0.30
QuadruplePubMedBERT	Option 4: 512	0.74	0.28
\hdashline CNNText (T100SG)	dis 512	0.72	0.23
CNNText (T100SG)	dis 3,000	0.74	0.30
DualCNNText (T100SG)	Option 0: 1,000	0.73	0.22
TripleCNNText (T100SG)	Option 2: 1,000	0.74	0.24
TripleCNNText (T100SG)	Option 3: 1,000	0.75	0.25
QuadrupleCNNText (T100SG)	Option 4: 1,000	0.74	0.22
\hdashline CNNText (T100SG) + BERT-base	Option 6: dis 3,000 + ecg 512	0.75	0.20
CNNText (T100SG) + PubMedBERT	Option 6: dis 3,000 + ecg 512	0.76	0.22
CNNText (T100SG) + PubMedBERT	Option 7: dis 3,000 + rad 512	0.75	0.21

		COVID-19		Fungal or Bacterial
Transformers	Input Text	Micro-F1	Macro-F1	Micro-F1	Macro-F1	Time (epoch)
BioMed-RoBERTa	dis 512	0.53 yogarajan2021predicting	0.45 yogarajan2021predicting	0.45	0.39	2,554 sec
PubMedBERT	dis 512	0.54 yogarajan2021predicting	0.48 yogarajan2021predicting	0.48	0.39	2,940 sec
TransformerXL	dis 512	0.51	0.45	0.47	0.39	2,921 sec
TransformerXL	dis 3,072	0.65 yogarajan2021predicting	0.51 yogarajan2021predicting	0.64	0.46	43,200 sec
Longformer	dis 3,000	0.58 yogarajan2021predicting	0.50 yogarajan2021predicting	0.58	0.43	13,500 sec
CAML (T100SG)	dis 3,000	0.61 yogarajan2021predicting	0.40 yogarajan2021predicting	0.62	0.38	47 sec
DualPubMedBERT	Option 0: 512	0.58	0.49	0.57	0.43	4,020 sec
TriplePubMedBERT	Option 1: 512	0.54	0.46	0.56	0.40	5,580 sec
TriplePubMedBERT	Option 2: 512	-	-	0.54	0.39	5,580 sec
TriplePubMedBERT	Option 3: 512	-	-	0.54	0.39	5,580 sec
QuadruplePubMedBERT	Option 4: 512	-	-	0.54	0.40	7,080 sec
QuadruplePubMedBERT	Option 5: 512	0.52	0.46	0.57	0.40	7,080 sec
		MIMIC-III Level 2 codes		MIMIC-III Level 3 codes
Transformers	Input Text	Micro-F1	Macro-F1	Micro-F1	Macro-F1	Time (epoch)
PubMedBERT	dis 512	0.65 yogarajan2021multilabel	0.41 yogarajan2021multilabel	0.55	0.18	3,393 sec
BioMed-RoBERTa	dis 512	0.64 yogarajan2021multilabel	0.40 yogarajan2021multilabel	0.53	0.18	4,877 sec
TransformerXL	dis 3,072	0.73	0.46	-	-	-
Longformer	dis 3,000	0.72	0.45	0.62	0.19	16,889 sec
CAML (T100SG)	dis 3,000	0.72	0.43	0.64	0.26	64 sec
DualPubMedBERT	Option 0: 512	0.68	0.45	0.57	0.20	4,750 sec
DualBioMed-RoBERTa	Option 0: 512	0.66	0.43	0.56	0.19	6,842 sec
TriplePubMedBERT	Option 1: 512	0.66	0.43	-	-	-

Models	Micro-F1	Macro-F1
CAML mullenbach2018explainable	0.614	0.532
DR-CAML mullenbach2018explainable	0.633	0.576
EffectiveCAN (Sum-pooling attention) liueffective	0.702	0.644
EffectiveCAN (Multi-layer attention) liueffective	0.717	0.668
\hdashline DualPubMedBERT (Option 0: 512)	0.640	0.576
TriplePubMedBERT (Option 1: 512)	0.641	0.583
Longformer (3,000)	0.703	0.654
TransformerXL (3,072)	0.723	0.677