An Ensemble of Pre-trained Transformer Models For Imbalanced Multiclass Malware Classification
Abstract
Classification of malware families is crucial for a comprehensive understanding of how they can infect devices, computers, or systems. Hence, malware identification enables security researchers and incident responders to take precautions against malware and accelerate mitigation. API call sequences made by malware are widely utilized features by machine and deep learning models for malware classification as these sequences represent the behavior of malware. However, traditional machine and deep learning models remain incapable of capturing sequence relationships among API calls. Unlike traditional machine and deep learning models, the transformer-based models process the sequences in whole and learn relationships among API calls due to multi-head attention mechanisms and positional embeddings. Our experiments demonstrate that the Transformer model with one transformer block layer surpass the performance of the widely used base architecture, LSTM. Moreover, BERT or CANINE, the pre-trained transformer models, outperforms in classifying highly imbalanced malware families according to evaluation metrics: F1-score and AUC score. Furthermore, our proposed bagging-based random transformer forest (RTF) model, an ensemble of BERT or CANINE, reaches the state-of-the-art evaluation scores on the three out of four datasets, specifically it captures a state-of-the-art F1-score of 0.6149 on one of the commonly used benchmark dataset.
keywords:
Transformer , Tokenization-free , API Calls , Imbalanced , Multiclass , BERT , CANINE , Ensemble , Malware Classification[inst1]organization=Cyber Security Gradaute Program,addressline=Kadir Has Unviersity, city=Istanbul, country=Turkey
[inst3]organization=Huawei R&D Center,city=Istanbul, country=Turkey
[inst2]organization=Management Information System,addressline=Kadir Has University, city=Istanbul, country=Turkey
1 Introduction
In recent times, with our dependence on information technologies, the Internet has been widely used by people of all ages. Those who want to quickly meet their daily needs such as online banking, online shopping, health, and transportation-related transactions cause an enormous increase in internet usage as well. This exponential growth of the usage of Internet plays a significant role in making life easier. On the other hand, this situation poses a severe threat as cyber attacks increase drastically in parallel with the growth of the Internet. Among these cyber attacks, malicious software (malware) is the primary weapon for attackers to conduct their malicious activities against a victim’s machine such as computer, smartphone, or computer networks in order to disrupt system’s functions and gain unauthorized access [1, 2].
Cybercriminals use several ways to spread malware, such as phishing e-mails with malicious links and attachments, text messages, and malicious advertisements etc. According to the state of e-mail security report released by Mimecast in 2021, 61% of organizations were exposed to e-mail-based ransomware in 2020, with an increase of 10% compared to the previous year [3]. The average amount spent to recover from a ransomware attack, when factors such as downtime, device, human, and network costs are included, is about $1.85 million as reported by Sophos [4]. According to another cyber threat report published by SonicWall, 5.6 billion malware attacks were carried out in 2020 [5].
In lieu all of these findings, one can safely claim that excessive malware, without considering the identification/classification methods, affects many victims destructively. Since the numbers of malicious software and the damages they cause to the institutions are increasing every day, it is crucial to map malware behavior that can be provided by malware family identification so that security researchers and incident responders can speed up the recognition and mitigation processes.
There are two main approaches used the most to detect malware. One of them is the signature-based malware detection method. The signatures, sequences of bytes, created using static, dynamic, or hybrid methods are uniquely located in the database. Whether a given file is malware or not is determined by looking at the unique signature of this file from a predefined database [6]. Although signature-based methods are the most generally utilized procedure in antivirus programming, since the only predefined list of known malware variants are kept, they are not able to catch previously unidentified malware [7].
The other main malware detection approach is behavior-based method, which examines the behavior and characteristics of a given file and then decides whether the related file is malware, and if it is a malware, then the approach also defines the malware family the file belongs to. Although the effort and time spent and the storage complexity are much more, the unknown attacks can be detected and classified by using behavior-based methodologies better contrary to the signature-based methods [8].
In the report released by SonicWall, among detected malware, 268,362 of them have never been seen before in 2020, with a rise of 74% from the preceding year [5]. Considering the increasing number of unseen malware over the years, performing a behavior-based approach is more reasonable. This report indicates the significance of developing more innovative and effective malware defense mechanisms to detect and classify unknown malware.
The effectiveness of the malware defense mechanism is directly associated with the right choice of behavioral features exploited from malware. Several features can be extracted from malware due to its diverse nature. Obtaining adequate features is time-consuming for a model. This situation can make learning difficult for a model if some of the features used are non-distinctive [9]. In our study, both static and dynamic API call sequences are leveraged to classify malware families since these sequences represent behavioral patterns for each sample. Considering API call sequences, machine learning becomes the primary choice to capture sequence relationships between the sequence elements for malware classification.
Different machine learning algorithms have been used in the literature for malware detection and classification so far [10, 7]. Considering the sequence, traditional machine learning models may not be sufficient as the relations among API calls must be taken into account to successfully predict the malware families of unseen API call sequences. The current deep learning based models, mainly pre-trained transformer models outperform traditional machine learning based approaches for sequential text classification [11, 12].
In this paper, we have answered the following research questions respectively:
RQ.1: What are the suitable classification metrics for imbalanced datasets in multiclass malware classification?
RQ.2: What are the appropriate base models for multiclass malware classification based on API call sequences?
RQ.3: What are the effects of pre-processing on API call sequences to the model results?
RQ.4: What are the effects of tokenizer-based (word piece) pre-trained transformer model (e.g. BERT) and tokenizer-free transformer model (e.g. CANINE) to our model results?
RQ.5: What is the effect of ensemble of pre-trained transformer models, BERT and CANINE, which is based on bagging for imbalanced multiclass malware classification?
Our main contributions through this study, in the light of the answers to our research questions, can be summarized as follows:
-
1.
Noticing inconsistent evaluation results due to a logical error in the code of a published article [13].
-
2.
To the best of our knowledge, we have used the pre-trained CANINE transformer model for the first time in the field of malware in this study.
-
3.
Again, to the best of our knowledge, a bagging-based ensemble of pre-trained transformer models has been used for the first time in malware classification.
-
4.
Our proposed model Random Transformer Forest (RTF), has surpassed the state-of-the-art results obtained in the malware classification.
-
5.
We have achieved a state-of-the-art result on one of the well-known API call dataset in the literature [14] with our proposed RTF model.
This paper is structured as follows: Section 2 presents the related work. The description of the datasets, base models, pre-trained models, and our proposed model are presented in Section 3. The test results are discussed and compared with the related studies in Section 4, and lastly, the conclusion and future work are given in Section 5.
2 Related Work
Cybercriminals leverage malware to exploit any device or system to steal sensitive data and hence cause enormous problems for victims. Analyzing and classifying incoming malware helps us define the problem and understand how to recover from the damage as quickly as possible.
There are two techniques most commonly used in malware analysis, static analysis and dynamic analysis. Static analysis is a process of malware analysis that analyzes the given malware without running it. Unlike static analysis, a given malware file is executed in an isolated environment to avoid harming the computer system in dynamic analysis.
Malware developers may implement various techniques to evade detection mechanisms such as code obfuscation, dynamic code loading, polymorphism, and metamorphism. For instance, the MD5 hash based detection method can be easily bypassed by malware authors with the methods mentioned above. As these methods cause the binary of the file to change, they also cause a change in the hash of the file. While the hash of the malicious file is changed and the file is defined as benign, the behavior of the file, thus its effect, remains unchanged [7].
Since dynamic analysis requires the execution of a given sample to be monitored and observed in an isolated environment, malware even written with code obfuscation techniques hardly eludes dynamic analysis contrary to static analysis. This identified situation provides dynamic analysis to be more robust than static analysis [15].
Performing dynamic analysis requires more time than static analysis and organizations are dealing with millions of attacks carried out in a day. These shortcomings provide an excellent opportunity for machine learning to collaborate with dynamic and static analysis since machine learning can handle large volumes of data [16].
In the malware detection and classification process, understanding malware behavior is one of the substantial parts of detecting and classifying malware. Dynamic API calls are obtained by tracing the sequences of calls by way of calling operating system services such as creating a file and allocation of virtual memory by malware samples. On the other hand, static API calls are extracted from portable executable (PE) format of the executable files. Static API calls are unordered as the sequences of calls are not traced unlike Dynamic API calls [17]. In general, since API call sequences generate specific behavioral patterns and hence represent malware families, they can be considered as one of the most distinguished features among malware families [18, 19].
Related studies about base models for malware analysis using API calls and transformer-based models on sequence problems will be examined respectively in the rest of this section.
2.1 Base Models for Malware Analysis using API Calls
In [20], the authors used DNA sequence alignment algorithms, Multiple Sequence Alignment (MSA), and Longest Common Subsequence (LCS) to extract the most critical API call sequence patterns among different malware families and generate a signature-based malware detection mechanism to determine whether a program is a malware or not based on these extracted patterns. The API call sequences determined by the MSA and LCS algorithms can be misleading for the model if sequences get more extended than a preset API call sequence length.
In [21], the authors proposed a model using text mining and topic modeling for feature extraction and selection processes based on API call sequences. Machine Learning based Group Method of Data Handling (GMDH) method, traditional machine learning models, Random Forest (RF), Decision Tree (DT), Support Vector Machine (SVM), and Multilayer Perceptron (MLP) are compared on two different datasets. Although DT and SVM models outperformed the results, and they suggest DT for malware detection expert system, the size of the datasets are inadequate to rely on the models.
In [22], the authors integrated Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) layer into one neural network architecture. With this model, they have achieved the accuracy score of 89.4%, the precision score of 85.6%, and the recall score of 89.4% to classify malware families. Newly generated subsequences of original API call sequences are given to the model as an input. For each API call sequence, if the same API call is repeated more than two times in a row, only two consecutive identical API calls are included in the resulting sequence. Since their corpus contains only 60 different API calls, they did not set any boundaries. Otherwise, they may have to set a predetermined length to avoid tracking loops and make the model less complex.
In [23], two stages of Deep Neural Networks are applied for the malware detection process. The proposed model CNN is used to classify feature images extracted with Long Short-Term Memory (LSTM) model using API calls. Although the authors achieved an Area Under Curve (AUC) score of 96%, since the size of the dataset is relatively small, the score may be misleading.
In [24], the authors leveraged N-gram and Term Frequency–Inverse Document Frequency (TF–IDF) for feature extraction and selection, respectively. The proposed LSTM model is used for binary classification, benign or malware, using API call sequences. The authors reached a 92% accuracy score on unknown test API call sequences.
In [25], the authors trained two different LSTM networks on system call sequences for both malware and benign Android applications, respectively. The new sequence has been classified by comparing two similarity scores obtained from two different LSTM networks. The LSTM model has been compared with two n-gram models, MALINE and BMSCS, based on accuracy, precision, recall, and False Positive Rate (FPR). They have shown that the LSTM model outperformed MALINE and BMSCS models.
In [26], the authors tried several models from LSTM to traditional machine learning models, RF, DT, SVM, and K-Nearest Neighbour (KNN), on a dataset containing 7,107 samples of API call sequences generated by them. In multiclass classification, they have achieved the highest F1-score of 47% using the single-layer LSTM model compared to the two-tier LSTM and common machine learning models.
In [27], the authors generated new API call sequences by applying data pre-processing steps to benchmark dataset [14]. If any unique API call repeated more than once in the sequence, they kept only one and removed the continuously same API. Similarly, they removed continuously same sub sequences when the length of sub sequences are two or three. They have proposed two different models, LSTM and Gated Recurrent Unit (GRU) models, which are based on RNN architecture. They have significantly increased their precision, recall, and F1-score after data pre-processing.
In [28], similar to feature pre-processing method applied in previous work [22], the authors prepared new API call sequences and kept only 100 non-consecutive sequences to avoid repeating API calls loop. They achieved similar AUC score and F1-score results compared to LSTM with their proposed model, which is based on Deep Graph Convolutional Neural Networks (DGCNNs).
LSTM model is widely used as the underlying architecture for malware detection and classification based on API calls, as seen in the above mentioned studies.
2.2 Transformer Based Models
This section is structured as follows: Section 2.2.1 introduces the pre-trained transformer models. Section 2.2.2 presents the related work relevant to the transformer based models on sequence problems.
2.2.1 Pre-trained Transformer Models
With the increasing use of deep learning approaches, the number of model parameters and hence the need for a much larger dataset to train these model parameters have increased. Since constructing a large labeled dataset is time-consuming and requires extremely expensive annotation costs, contrary to constructing a large unlabeled dataset, pre-trained models have gained importance. Models that are pre-trained on the huge unlabeled text data learn good universal representations, which are used to fine-tune the model on the downstream tasks [29]. On the other hand, the pre-trained transformer models have reached state-of-the-art results as these models are capable of capturing dependencies over a wide range of scales unlike convolutional and recurrent networks [30, 31]. Pre-trained techniques leveraged to capture these dependencies are elaborated in the sections 3.5.1 and 3.5.2.
2.2.2 Transformer Based Models on Sequence Problems
In [32], the authors used several deep learning models from mostly used traditional deep learning methods such as CNN, RNN, LSTM, GRU, and BiLSTM with GLOVE and fastText embedding to pre-trained transformer models for multiclass text classification. For their experiment, they utilized the highly imbalanced RCV1-v2 dataset, which contains 800,000 news stories. They have shown that transformer models outperformed traditional deep learning models based on F1-score for multiclass text classification.
In [33], the authors used the BERT model for the cyber-bullying detection task. The proposed model has been tested on three different datasets taken from Twitter, Wikipedia, and FormSpring. Compared to common machine learning models, SVM and Logistic Regression (LR), and deep-learning based models, CNN, RNN + LSTM, and Bidirectional LSTM (BiLSTM), they have achieved higher F1-scores.
In [34], the authors generated word embeddings for each opcode of malware samples by using Word2Vec and BERT. They classified malware with different classifiers such as LR, SVM, and MLP to see the effect of different word embeddings. They have achieved higher results using BERT for word embeddings with the same set of input parameters and the same set of classifiers based on classification accuracy among five unique malware families distributed almost balanced.
In [35], they have proposed transformer-based architecture for detection and classification of malware using opcode sequences of windows executable files. The proposed transformer model has achieved better results compared to Gradient Boosting Method (GBM) and BiLSTM based on accuracy, precision, recall, and F1-score evaluation metrics.
In [36], the authors proposed a pre-trained transformer model, Malbert, which is pre-trained on 15,000 malware and 15,000 benign samples [20] first to learn the relationships among API calls. This pre-trained transformer model and existing pre-trained transformer model, Bert-base-uncased are fine-tuned on two different datasets for the malware detection process. Pre-trained transformer models have achieved higher results compared to LSTM model and traditional machine learning models based on mostly used evaluation metrics, such as accuracy, precision, recall, and F1-score.
In [37], the authors proposed a model, called CyberBert which uses bidirectional transformer architecture for two different tasks, session-based recommendation, and malware classification based on API calls. Compared to the LSTM model and transformer-encoder, a unidirectional model, they have achieved higher F1-scores for binary and multiclass classification with CyberBert.
In [38], the authors leveraged pre-trained BERT transformer model for malware detection, malware, and benign, on Android operating system API calls called by the application. The set of experiments made by the study shows BERT model obtained state-of-the-art results compared to the LSTM model on sequence classification.
3 Methodology
In the methodology section, firstly, the datasets used in experiments are introduced. Secondly, the most suitable evaluation metrics for highly imbalanced datasets are specified. Then, base model structures, the effect of the pre-processing method, pre-trained transformer models, CANINE and BERT, and the proposed RTF model architectures are explained.
3.1 Datasets
To verify how effectively a model classifies malware, it is necessary to test the model on different malware datasets. Since malware constantly evolves, working on an up-to-date malware dataset is required to assess the effectiveness of proposed models. Comparing several models on a single dataset containing outdated malware samples and highlighting one model might not be reliable. Thus, four different datasets containing API call sequences of malware samples and their corresponding malware families are utilized to evaluate the models used in this study. Of these four datasets used in this study, Catak and Oliveira datasets are created with dynamic API calls and VirusSample, whereas VirusShare datasets are constructed with static API calls.
3.1.1 Dynamic API Call Datasets
3.1.1.1 Catak Dataset
This study obtained sequences of Windows Operating System API calls within the Cuckoo Sandbox isolated environment for each malware file. Malware family labels were determined using unique hash codes of each malware on the Virus Total website. In total, 7,107 samples, which contain hash codes of malware, Windows operating system API call sequences, and their malware family classes, were created [14].
3.1.1.2 Oliveira Dataset
42,797 malware and 1,079 benign API call sequences were obtained via Cuckoo Sandbox for dynamic malware analysis. Instead of using whole API call sequences, the first 100 non-consecutive API call sequences were extracted from the parent processes to reduce complexity and detect the malicious pattern as quickly as possible. The generated dataset containing hashcodes, label (malware or benign), and 100 non-consecutive API calls for each sample has been used for binary malware classification [28].
Since we are working on a multiclass classification problem, malware families of 42,797 malware samples are determined through virus total. Out of 42,797 malware samples, 2,081 were labeled as "unknown" by virus total. Several malware families hold a small number of malware samples, less than 100. These malware samples are removed since they could be misleading for the models. Thus, the dataset in question has been turned into a multiclass classification case. The compiled dataset used in our study consists of 40,566 malware samples with their API call sequences and malware families.
3.1.2 Static API Call Datasets
3.1.2.1 VirusShare
Unique hash codes represent malware samples obtained from Virus Share. Each unique hash code in text files is passed to Virus Total to learn their corresponding malware families. Python module named PEfile is leveraged to extract API calls from each malware sample. Lastly, malware families having less than 100 samples are removed.
Thus, 13,849 malware samples with their corresponding API call sequences and malware families are obtained [39].
3.1.2.2 VirusSample
Malware samples taken by Virus Sample are kept with their unique hash code text files. Corresponding malware families and API calls are obtained from Virus Total site and PEfile module, respectively, as in Virus Share. Finally, malware families having less than 100 are removed from the dataset.
Therefore, 9,732 malware samples with their corresponding API call sequences and malware families are obtained [39].
Since the malware samples in this dataset consist of the most up-to-date data based on API calls, we also find an opportunity to test our models on recent malware samples. Table 1 shows the total samples of malware families for each dataset.
Malware Family | Oliveira | VirusShare | Catak | VirusSample |
Trojan | 31,979 | 8,919 | 1,001 | 6,153 |
Virus | 102 | 2,490 | 1,001 | 2,367 |
Adware | 5,444 | 908 | 379 | 222 |
Backdoor | 135 | 510 | 1,001 | 447 |
Downloader | 1,948 | 218 | 1,001 | N/A |
Worms | N/A | 524 | 1,001 | 441 |
Agent | 220 | 165 | N/A | 102 |
Ransomware | 404 | 115 | N/A | N/A |
Dropper | 118 | N/A | 891 | N/A |
Riskware | 216 | N/A | N/A | N/A |
Spyware | N/A | N/A | 832 | N/A |
Total | 40,566 | 13,849 | 7,107 | 9,732 |
3.2 RQ.1-) What are the suitable classification metrics for imbalanced datasets in multiclass malware classification?
The degree of imbalance may vary within different domains. One of these domains is malware, as specific malware families are used chiefly in particular periods for cyber attacks.
According to the report released by Malwarebytes: The total number of Trojan detected by Malwarebytes is almost 26 times higher than the total number of Worm in 2018. The total number of Riskware detected by Riskware tools in 2019 was 6,632,817, with a decrease of 35% compared to the previous year. In another chart containing the number of detection of malware families by months, it is seen that the number of Trojan attacks increased dramatically at the beginning of 2019, with the spread of the Emotet, one of the advanced Trojan campaign in that period [40].
These situations demonstrate that there could be significant differences in the distribution of malware families according to years or even months. Therefore, when the collected malware is classified according to their families, the distribution will vary according to the malware type prevailing at the time of collection and hence lead to imbalance.
For these reasons, almost all of the datasets belonging to malware have an imbalanced class distribution in the literature. Thus, we are required to leverage the most suitable metrics to evaluate our classification performance on imbalanced datasets.
The datasets leveraged in our study have highly imbalanced class distribution as expected and shown in Table 1.
Evaluation metrics are one of the crucial steps to assess model performance. An incorrectly chosen evaluation metric can make a poor performance algorithm seems effective. The metrics used to evaluate a model performance may vary for balanced and imbalanced datasets. For instance, using accuracy metric for a balanced dataset may provide an objective evaluation, yet may not be the right choice for an imbalanced dataset as it has a bias against the majority class. Taking malware classification for example. Assume there are six different classes in the dataset, and 95% of the samples belong to Trojan. In this case, a dummy model that predicts all samples in the test data as Trojan will achieve an accuracy score of 95% even though it does not predict any other classes correctly. It may not always be correct to use the most widely preferred evaluation metric without examining the distribution of classes in the dataset for the reasons mentioned above.
Recently, researchers have started to use Matthew’s Correlation Coefficient (MCC) to evaluate model performance on imbalanced datasets. Although this metric was previously used mostly in biomedical research, it has now been used in many areas, including malware classifications [41, 42] yet according to the experimental results to investigate the behavior of MCC metric, MCC is not suitable for directly applying on imbalanced datasets [43]. Empirical research conducted on 54 imbalanced datasets demonstrates that the AUC score is more discriminating than MCC [44]. Most frequently used metrics to assess model performance for multiclass classification tasks have been shown to be inadequate on imbalanced datasets such as Precision, Recall, MCC, Confusion Entropy (CEN), Classes Average Accuracy (AvAcc), and Class Balanced Accuracy (CBA) [45].
Although the choice of right metrics is still an open issue, following the searches to find the most suitable metrics used for multiclass classification on imbalanced datasets, we have used AUC, which is a summary of probability curve, Receiver Operating Characteristic (ROC), based on FPR and True Positive Rate (TPR), as an evaluation metric. On the other hand, F1-score has been used to be comparable with studies conducted on one of the well-known datasets in the literature [26].
The equation (1) and (2) define the recall and precision metrics respectively. The equation (3) defines F1-score in terms of precision and recall. Also, the equation (3) contains the explicit form of the formula in terms of True Positive (TP), False Negative (FN), and False Positive (FP).
(1) |
(2) |
(3) |
3.3 Base Models
LSTM and One-layer Transformer Block Based Transformer architecture are leveraged as base models for malware classification based on RQ.2. For the rest of the paper, One-layer Transformer Block Based Transformer architecture is referred to as Transformer model.
3.3.1 LSTM Based Malware Classification
Recurrent Neural Networks (RNNs) have a structure that uses recurrent relation, which is a situation of performing the same step, processing current output depends on the previously computed hidden state, for each element of a sequence repeatedly. In these structures, information is retained through the previous hidden state, and hence processing continues in time steps. The recursion between sequence elements hinders parallelization during the training phase and consequently causes a longer runtime for training.
LSTM can learn relatively long-term dependencies compared to other RNNs because it provides deeper processing of hidden states through specific units. This situation causes an increase in the number of parameters used for training. Besides, since LSTM has a recursive structure like other RNNs inherently and hence can not be trained in parallel, the training period may take relatively longer compared to other RNNs [46].
Since our purpose is to classify malware families, fully connected layer output that captured the information from the LSTM networks, is given to the softmax layer for multiclass classification.
The standard LSTM network is preferred as one of the base models for comparison since it has been used widely as a base network and performed successfully for several malware classification problems using API call sequences [47].
3.3.2 Transformer Model
Transformer model is a recently used network architecture that designed to overcome the deficiencies of sequence-to-sequence neural network approaches such as LSTM and RNN for the sequence modeling and transduction problems in 2017 [48].

Since transformer-based architectures avoid recursion, they can overcome the parallelization problem that both the LSTM and other RNNs suffer from. In traditional sequence-to-sequence architectures the information coming from the previously hidden state is processed recursively to capture dependencies. On the contrary, since the transformer models refrain from recurrence and convolution, positional encodings are used to preserve the order of the sequence and provide position-related information of the tokens in the sequence. Transformer models leverage the attention mechanism to caption and preserve long-term dependencies for the sequences processed as a whole. Positional information is retained with attention layers instead of the recurrent and convolutional layers in transformer model [48]. Figure 1 shows the Transformer model architecture utilized in this study.
3.4 Pre-processing Method on Datasets
Each API call sequence is pre-processed, similar to the steps taken part in [27]. Pre-processing part consists of 3 main steps.
In the first step, for any API call in a sequence that repeated more than one time in a row, the continuously same API calls are removed from the sequence. This pre-processing step generates a new sequence that does not consist of the consecutive same API call. In the second and third steps respectively, repetitive binary and triple sub-sequences are removed from the new sequence created by the first step. The following Figure 2 shows the pre-processing step outputs respectively based on randomly given sequences.

These pre-processing steps have not been applied to the Oliveira dataset as this dataset has already given pre-processed and limited to the 100 non-consecutive API calls only. On the other hand, although the pre-processing steps have been applied to the VirusSample and VirusShare datasets as well, only one sample out of 9,732 samples for VirusSample dataset and only two samples out of 13,849 samples for VirusShare dataset are affected. Of these affected samples, only two or three API calls are affected. The results of the pre-processing steps applied on VirusSample and VirusShare datasets show us that static API calls do not include noisy and redundant API calls as expected contrary to dynamic API calls. Thus, we have continued with the original API call sequences of VirusSample and VirusShare datasets. After performing pre-processing steps on Catak dataset, only 11 API call sequences remained constant. The effect of the pre-processing steps on the Catak dataset can easily be seen by the Figure 3 shown below.

After pre-processing steps, the number of samples where the length of the API call sequences is less than 200 has increased more than twice, and the number of samples where the length of the API call sequences is more than 200 has decreased more than twice. These changes clearly show the effect of pre-processing steps on the Catak dataset. Finally, after performing all the pre-processing steps to the Catak set only, the effect of the pre-processing is examined with model performances.
3.5 CANINE and BERT
Large-scale pre-trained models have recently become very popular in the field of artificial intelligence. Due to the model previously trained on large-scale data, captured information can be used for specific tasks that utilize the pre-trained model by fine-tuning [49]. This study utilizes two different pre-trained models architectures, BERT and CANINE.
3.5.1 BERT
BERT, Bidirectional Encoder Representations from Transformers, is a language model that uses transformer architecture which is pre-trained on Wikipedia and Book Corpus of unlabelled text [50].
BERT preserves the semantic content thanks to the masked language modeling (MLM) and next sentence prediction (NSP) unsupervised tasks, which enables to generate deep bidirectional representations while pre-training. BERT uses Word-piece tokenization to create a token vocabulary that consists of learned representations of the words. Figure 4 shows BERT architecture in details.
3.5.2 CANINE
CANINE, Character Architecture with No tokenization In Neural Encoders, is a tokenizer-free pre-trained encoder model that is designed to overcome the shortcomings of the tokenization process such as word-piece and sentence-piece tokenization [51]. For example, a pre-trained model that uses specific tokenization may not be convenient for specialized domains. In [52], it is shown that word-piece tokenization based pre-training strategy is not well-suited compared to character-piece when fine-tuned on medical data.
Similar to BERT [50], CANINE is pre-trained on the MLM and NSP tasks as well. Unlike commonly used pre-trained models, CANINE uses neural encoders that encode the sequence of characters or optionally sub-words are used as a soft inductive bias without doing explicit tokenization on input data. The CANINE structure is shown in Figure 5.
In general, the BERT model is widely used by many studies for malware classification, as shown in the Related Work. We have assumed that the tokenization-free strategy used by the CANINE model might be well-suited for API calls since an API call such as ’ldrloaddll’ may not be appropriate for word-tokenization. Therefore, we have included the CANINE model in our study.
Thus, BERT, CANINE-C (Pre-trained with autoregressive character loss), and CANINE-S (Pre-trained with subword loss) pre-trained transformer models are leveraged regarding the RQ.4.


3.6 Proposed Model: Random Transformer Forest (RTF)
There are several important studies that show the success of using different ensemble types of pre-trained transformer models such as stacking and majority voting of heterogeneous pre-trained transformer models on varying downstream tasks. [53, 54, 55]. Unlike these type of ensemble models, Random Transformer Forest (RTF) is a bagging-based ensemble model inspired from the Random Forest (RF) machine learning model [56]. Similar to RF, using an ensemble of pre-trained transformer models is assumed to increase classification performance on highly imbalanced malware datasets rather than using a single pre-trained transformer model [57, 58].
The training phase requires creating N different training subsets by using the bootstrap sampling method from the original training set. The malware class distribution coming from the original training set must be preserved in the resampling step due to the highly imbalanced class distribution. After the resampling step, each training subset is used to fine-tune the pre-trained transformer model. Each pre-trained transformer model has the same structure. Therefore base estimators are homogeneous.
In the testing phase, each fine-tuned transformer model takes a given malware API call sequence, and the class probabilities of each fine-tuned transformer model are aggregated by taking the average. For the majority voting, the final prediction is accepted as the malware family, which takes the highest probability coming from the aggregation part. Figure 6 shows the structure of RTF model.

4 Experiment and Results
Experimental setup and conducted experiments will be clarified respectively regarding the predetermined Research Questions except RQ.1 which is elaborated in Methodology part.
4.1 Experimental Setup
We have utilized the google-cloud colab pro+ for our experiments. We have worked on Tesla T4 GPU with 51 GB available RAM provided by the colab platform. The Keras framework [59] is used for base model comparison. The PyTorch framework [60] is leveraged with pre-trained models taken from HuggingFace [61] for pre-trained transformer models and RTF model. All jupyter notebook files that contain source codes regarding the research questions, from base model comparison to RTF model, can be found in the Github repository 111https://github.com/Ferhat94/Random-Transformer-Forest.
4.2 RQ.2-) What are the appropriate base models for multiclass malware classification based on API call sequences?
In this part of the study, Transformer and LSTM model are leveraged as base architectures for comparison on four datasets.
All datasets used to evaluate the base model performances are divided into three parts, training, validation, and testing. 20% of the original datasets are allocated for testing. The splitting data process is performed in a stratified way as to preserve class distribution is one of the crucial steps on highly imbalanced datasets. Considering the imbalanced distribution of the classes, we have leveraged the class weight approach to give different weights to both the majority and minority classes so that class weights are taken into account by training algorithms.
Stratified 10 Fold strategy is used on training data for each dataset, and 10% of training data is used for validation for each iteration. Thus, we guarantee that each fold has the same distribution of malware families and ensure that every sample from the dataset has the chance of appearing in both training and validation data. Standard deviation and mean of 10 validation results are calculated for each evaluation metric used in our study, and training runtime for robust interpretation.
We have provided a dummy classifier as a simple baseline since class distribution of each dataset is imbalanced. We have used the "most frequent" strategy for our dummy classifiers. Base model comparison results on each dataset is shown in the tables 3, 2, 4 and 5.
Base Model | F1-score | AUC score | Training Time (s) |
Validation Mean Scores | |||
LSTM | 0.4873 0.0126 | 0.7887 0.0128 | 68.62 11.44 |
Transformer | 0.5689 0.0578 | 0.8676 0.0329 | 11.74 3.43 |
Test Scores | |||
LSTM | 0.4638 | 0.7885 | |
Transformer | 0.5042 | 0.8246 | |
Dummy | 0.0308 | 0.5000 |
Base Model | F1-score | AUC score | Training Time (s) |
Validation Mean Scores | |||
LSTM | 0.5570 0.0182 | 0.8853 0.0142 | 111.99 20.11 |
Transformer | 0.5792 0.0379 | 0.9280 0.0142 | 33.48 7.91 |
Test Scores | |||
LSTM | 0.5637 | 0.8844 | |
Transformer | 0.5650 | 0.8855 | |
Dummy | 0.0980 | 0.5000 |
Base Model | F1-score | AUC score | Training Time (s) |
Validation Mean Scores | |||
LSTM | 0.7690 0.0419 | 0.9656 0.0110 | 21.80 4.04 |
Transformer | 0.8070 0.0323 | 0.9885 0.0055 | 13.26 3.38 |
Test Scores | |||
LSTM | 0.7531 | 0.9701 | |
Transformer | 0.7548 | 0.9680 | |
Dummy | 0.1291 | 0.5000 |
Base Model | F1-score | AUC score | Training Time (s) |
Validation Mean Scores | |||
LSTM | 0.7121 0.0231 | 0.9274 0.0130 | 96.79 8.38 |
Transformer | 0.7641 0.0297 | 0.9700 0.0182 | 31.76 9.80 |
Test Scores | |||
LSTM | 0.7071 | 0.9298 | |
Transformer | 0.7125 | 0.9350 | |
Dummy | 0.0980 | 0.5000 |
Considering the two base models, LSTM and Transformer model, the standard deviation of the mean validation scores of evaluation metrics for the Transformer model is higher. Even in this case, we get higher results on unseen test data.
Evaluation results demonstrate that the Transformer model is more reasonable to continue with compared to the LSTM model in terms of evaluation results and training time.
4.3 RQ.3-) What are the effects of pre-processing on API call sequences to the model results?
Data pre-processing steps mentioned in Pre-processing Method on datasets part are applied to the Catak dataset [14] only as pre-processing has no effect on VirusSample and VirusShare datasets given in the study [39] and Oliveira dataset [28] has already been pre-processed and limited with 100 non-consecutive API calls.
Original API call sequences and pre-processed API call sequences on Catak dataset have been compared with the LSTM and Transformer model.
Table 6 shows the comparison results of Original API call sequences and pre-processed API call sequences.
Base Model | F1-score | AUC score | Training Time (s) |
On Original API Call Sequences | |||
LSTM | 0.4638 | 0.7885 | 68.62 11.44 |
Transformer | 0.5042 | 0.8246 | 11.74 3.43 |
On Pre-processed API Call Sequences | |||
LSTM | 0.5020 | 0.8156 | 71.58 12.27 |
Transformer | 0.5106 | 0.8372 | 18.27 8.74 |
Evaluation results demonstrate that pre-processing step outperforms for Catak dataset. Both the AUC score and F1-score have increased after the pre-processing steps for both LSTM and Transformer model. Although we expect training runtime to decrease after pre-processing steps, training runtime on the original Catak dataset is lower as the monitored evaluation metric, validation AUC, has stopped improving during the learning phase. Therefore, both LSTM and Transformer models have started to learn after pre-processing steps on the Catak dataset. New sequences generated after pre-processing steps are leveraged for the following experiments, pre-trained models and RTF model for the Catak dataset.
4.4 RQ.4-) What are the effects of tokenizer-based (word piece) pre-trained transformer model (e.g. BERT) and tokenizer-free transformer model (e.g. CANINE) to our model results?
In this part, Due to the large number of parameters used in pre-trained models, 20% of the training data is allocated for validation instead of the stratified k fold strategy. The CANINE model uses two different pre-training strategies: sub-word loss (Canine-s) and character loss (Canine-c). In our studies both pre-training strategies are leveraged to see which pre-training strategy performs better for each dataset. Only the best Canine model, Canine-c or Canine-s, is included to our final results from tables 8, 9, 10 and 11 for each individual dataset.
The comparison of the pre-trained models, BERT and CANINE, is shown at the end of the experiments with the RTF model results to see the comparison clearly.
4.5 RQ.5-) What is the effect of ensemble of pre-trained transformer models, BERT and CANINE, which is based on bagging for imbalanced multiclass malware classification?
In the RTF model N different training subsets, thus N different base estimators are utilized to fine-tune the N different pre-trained BERT or CANINE model. We have tried several combinations of N and pre-trained transformer models, BERT and CANINE, for each dataset. As a result of several trials, the combinations that provided the best scores are accepted as our RTF score. Table 7 shows the best combination for each dataset and model comparison results on each dataset are shown in tables 8, 9, 10 and 11.
Dataset | Number of Base Estimators (N) | Pre-trained Model |
---|---|---|
Catak [14] | 6 | BERT |
Oliveira [28] | 2 | BERT |
VirusSample [39] | 10 | CANINE-S |
VirusShare [39] | 5 | CANINE-S |
Model | F1-score | AUC score |
---|---|---|
Transformer | 0.5106 | 0.8372 |
CANINE-S | 0.5633 | 0.8339 |
BERT | 0.5919 | 0.8735 |
RTF | 0.6149 | 0.8818 |
Model | F1-score | AUC score |
---|---|---|
Transformer | 0.5650 | 0.8855 |
CANINE-S | 0.4725 | 0.8636 |
BERT | 0.4839 | 0.8321 |
RTF | 0.4061 | 0.8714 |
Model | F1-score | AUC score |
---|---|---|
Transformer | 0.7548 | 0.9680 |
CANINE-C | 0.7893 | 0.9570 |
BERT | 0.7759 | 0.9690 |
RTF | 0.8059 | 0.9773 |
Model | F1-score | AUC score |
---|---|---|
Transformer | 0.7125 | 0.9350 |
CANINE-S | 0.7064 | 0.9286 |
BERT | 0.7145 | 0.9364 |
RTF | 0.7275 | 0.9513 |
According to the results, at least one of the pre-trained transformer models, CANINE or BERT, surpassed Transformer model, and the RTF model obtained the highest scores on three out of four datasets.
The performance of our proposed model RTF on VirusShare, VirusSample, and Oliveira dataset for each imbalanced class is shown in Figure 7. The performance of the RTF on Catak dataset for each imbalanced class is shown in a separate figure as it is compared to the original authors’ confusion matrix (CM) in Figure 8(b) for the Catak dataset.



Results show us that our proposed model is more accomplished in identifying minority classes on VirusSample and VirusShare dataset contrary to Oliveira dataset. Considering false negative predictions on these three dataset for each class, the model is prone to predict some minority classes as majority class which is Trojan in these cases.
In our study, our proposed architecture (RTF) is compared with base models (LSTM and Transformer) and pre-trained transformer models (BERT and CANINE ) for each given dataset (Catak, Oliveira VirusSample, and VirusShare). Besides, since VirusShare and VirusSample datasets are newly published and Oliveira dataset is transformed to a multiclass problem by us, we have compared our approach with other approaches in the literature for the Catak dataset.
In [13], they have proposed a model for multiclass classification on Catak Dataset. In this study, all the samples in the Catak dataset are shuffled, and 80% of the dataset is allocated for the train set. Then, all the samples in the Catak dataset are shuffled again and 20% of the dataset is allocated for the test set. The logical error made here is whole samples are shuffled twice. This situation causes the test part to have some samples which exactly fall into the train part. Thus, the model might test what it learned from the train. We have performed a test on the code shared with us by the authors [13] from the GitHub link 222https://github.com/MattScho/MalwareClassificationCNN. We have performed a test to divide the whole dataset into the training and testing dataset with the exact code script performed by authors for their study. Finally, we realized that 1,117 of 1,371 samples allocated for the test set intersect with the train set. For these reasons, evaluation scores obtained by this article are not taken into account to compare our results. The logical error has been reported to the article authors.
To the best of our knowledge highest F1-score obtained on the Catak dataset for multiclass classification is 0.57 [27] compared to the baseline score of 0.47 obtained by the Catak dataset creators [26]. Although the F1-score reported by [26] is 0.47, the calculated F1-score from the given CM in [26] is 0.41 as in Figure 8(a). The 20% of original Catak dataset is allocated as unseen test data for RTF experiments like in [26]. In [26], the authors showed their CM of LSTM model results on unseen test dataset. Their CM is referred to as source CM as this is the first study performed on Catak dataset. Since only this study contains CM, we have compared their CM with our proposed RTF model CM on unseen test data.
Among the experimental studies conducted on the Catak dataset for multiclass classification [27, 37, 26] our proposed RTF model has surpassed and reached the state-of-the-art F1-score of 0.6149 as shown in Figure 8(b).


In our study, model performances are assessed according to F1-score and AUC score for pre-trained models, BERT, and CANINE, and RTF model thus far. However, besides F1-score and AUC score, training time is one of the factors to evaluate model efficiency. Therefore, training runtime results for BERT, CANINE, and RTF are given in Table 12.
Model | Oliveira | VirusShare | Catak | VirusSample |
---|---|---|---|---|
BERT | 141.02 | 26.25 | 28.63 | 18.73 |
CANINE | 58.27 | 11.12 | 12.03 | 7.77 |
RTF | 145.19 | 11.38 | 27.10 | 8.03 |
As seen in the Table 12, training runtime for pre-trained transformer models and our proposed model has increased dramatically compared to LSTM and Transformer model (tables 3, 2, 4 and 5). The reason of high training runtime is BERT, and CANINE models are pre-trained on huge text data.
Considering the security aspects of our study, higher training time does not hurt reaction time as these models will be trained first by AV scanners. Therefore, the inference time spent by AV scanners to process unseen data and make a prediction is the essential concern for security researchers to take an action as soon as possible. Inference time for each model is given in Table 13.
Model | Oliveira | VirusShare | Catak | VirusSample |
---|---|---|---|---|
LSTM | 0.3130 | 0.1110 | 0.3710 | 0.1070 |
Transformer | 0.0944 | 0.0931 | 0.0945 | 0.0924 |
BERT | 4.5137 | 2.7555 | 4.9658 | 1.4297 |
CANINE | 4.2778 | 2.6898 | 4.4324 | 1.4224 |
RTF | 4.5316 | 2.6856 | 4.9524 | 1.4183 |
Inference time given in Table 13, shows us the processing and prediction time together for single unseen observation.
4.6 Limitations
As described in Table 1, even the four datasets used in our study consist of different malware families, we have total of 11 unique malware families. We think having a total of 11 different malware families is inadequate to understand malicious behavioural characteristics of a malware sample comprehensively. According to [62], malware taxonomy is divided into four different categories as stealing, evasion, disruption, and modification as these categories represent main malware behaviours. Therefore, among the defined behavior types such as Trojan, Virus, and Ransomware, several types need to be expanded and refined.
5 Conclusion and Future Work
In this study, we have leveraged several deep learning models for highly imbalanced multiclass malware classification based on API calls, which are inherently sequence problems. We have assessed the performance of our models with AUC score and F1-score evaluation metrics as the four datasets used in this study are imbalanced.
Our evaluation results demonstrate that the Transformer model with one transformer block layer has achieved slightly better results than the LSTM model. Moreover, the pre-trained transformer models, BERT or CANINE, outperformed one transformer block layer Transformer architecture. On the other hand, Transformer and LSTM models are noticeably faster than pre-trained models, BERT, and CANINE, in both training and inference times. However, considering the fact that training time does not directly affect the response time, and there are differences in the inference time on the basis of seconds, RTF model is more reasonable to continue with.
The CANINE model has been used for the first time in the field of malware classification in this study. We have reached a state-of-the-art results on the static API calls datasets, VirusShare and VirusSample, with a bagging-based ensemble of the CANINE model. Therefore, we have demonstrated the success of the CANINE model with the strength of RTF.
We have achieved a state-of-the-art F1-score of 0.6149 on the Catak dataset with the power of bagging-based ensemble of BERT model and pre-processing since pre-processing steps have enabled us to increase our results significantly on the Catak dataset which has been built with dynamic API call sequences. In addition, the F1-score of 0.6149 obtained on the well-known benchmark Catak dataset has proved the success of our proposed RTF Model. In general, our proposed RTF Model has obtained state-of-the-art results on three out of four datasets.
This study can be extended by integrating our proposed ensemble model with the AUC maximization paradigm [63]. We believe we may increase our results in this way.
References
- [1] J. Jang-Jaccard, S. Nepal, A survey of emerging threats in cybersecurity, Journal of Computer and System Sciences 80 (5) (2014) 973–993.
- [2] Ö. A. Aslan, R. Samet, A comprehensive review on malware detection approaches, IEEE Access 8 (2020) 6249–6271.
- [3] Mimecast, The state of email security, Tech. rep., Mimecast (2021).
- [4] Sophos, The state of ransomware 2021, Tech. rep., Sophos (2021).
- [5] SonicWall, Cyber threat report, Tech. rep., SonicWall (2021).
- [6] P. Shijo, A. Salim, Integrated static and dynamic analysis for malware detection, Procedia Computer Science 46 (2015) 804–811.
- [7] D. Ucci, L. Aniello, R. Baldoni, Survey of machine learning techniques for malware analysis, Computers & Security 81 (2019) 123–147.
- [8] D. Gibert, C. Mateu, J. Planes, The rise of machine learning for detection and classification of malware: Research developments, trends and challenges, Journal of Network and Computer Applications 153 (2020) 102526.
- [9] C. Jindal, C. Salls, H. Aghakhani, K. Long, C. Kruegel, G. Vigna, Neurlux: Dynamic malware analysis without feature engineering, in: Proceedings of the 35th Annual Computer Security Applications Conference, 2019, pp. 444–455.
- [10] R. Komatwar, M. Kokare, A survey on malware detection and classification, Journal of Applied Security Research 16 (3) (2021) 390–420.
- [11] Q. Li, H. Peng, J. Li, C. Xia, R. Yang, L. Sun, P. S. Yu, L. He, A survey on text classification: From shallow to deep learning, arXiv preprint arXiv:2008.00364 (2020).
- [12] S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad, M. Chenaghlu, J. Gao, Deep learning–based text classification: A comprehensive review, ACM Computing Surveys (CSUR) 54 (3) (2021) 1–40.
- [13] M. Schofield, Comparison of malware classification methods using convolutional neural network based on api call stream, International Journal of Network Security & Its Applications (IJNSA) Vol 13 (2021).
- [14] F. O. Catak, A. F. Yazı, A benchmark api call dataset for windows pe malware classification, arXiv preprint arXiv:1905.01999 (2019).
- [15] O. Or-Meir, N. Nissim, Y. Elovici, L. Rokach, Dynamic malware analysis in the modern era—a state of the art survey, ACM Computing Surveys (CSUR) 52 (5) (2019) 1–48.
- [16] J. B. Fraley, J. Cannady, The promise of machine learning in cybersecurity, in: SoutheastCon 2017, IEEE, 2017, pp. 1–6.
- [17] W. Han, J. Xue, Y. Wang, L. Huang, Z. Kong, L. Mao, Maldae: Detecting and explaining malware based on correlation and fusion of static and dynamic characteristics, computers & security 83 (2019) 208–233.
- [18] Y. Ding, X. Xia, S. Chen, Y. Li, A malware detection method based on family behavior graph, Computers & Security 73 (2018) 73–86.
- [19] A. Fujino, J. Murakami, T. Mori, Discovering similar malware samples using api call topics, in: 2015 12th annual IEEE consumer communications and networking conference (CCNC), IEEE, 2015, pp. 140–147.
- [20] Y. Ki, E. Kim, H. K. Kim, A novel approach to detect malware based on api call sequence analysis, International Journal of Distributed Sensor Networks 11 (6) (2015) 659101.
- [21] G. G. Sundarkumar, V. Ravi, I. Nwogu, V. Govindaraju, Malware detection via api calls, topic models and machine learning, in: 2015 IEEE International Conference on Automation Science and Engineering (CASE), IEEE, 2015, pp. 1212–1217.
- [22] B. Kolosnjaji, A. Zarras, G. Webster, C. Eckert, Deep learning for classification of malware system call sequences, in: Australasian joint conference on artificial intelligence, Springer, 2016, pp. 137–149.
- [23] S. Tobiyama, Y. Yamaguchi, H. Shimada, T. Ikuse, T. Yagi, Malware detection with deep neural network using process behavior, in: 2016 IEEE 40th annual computer software and applications conference (COMPSAC), Vol. 2, IEEE, 2016, pp. 577–582.
- [24] J. Mathew, M. A. Kumara, Api call based malware detection approach using recurrent neural network—lstm, in: International Conference on Intelligent Systems Design and Applications, Springer, 2018, pp. 87–99.
- [25] X. Xiao, S. Zhang, F. Mercaldo, G. Hu, A. K. Sangaiah, Android malware detection based on system call sequences and lstm, Multimedia Tools and Applications 78 (4) (2019) 3979–3999.
- [26] F. O. Catak, A. F. Yazı, O. Elezaj, J. Ahmed, Deep learning based sequential model for malware analysis using windows exe api calls, PeerJ Computer Science 6 (2020) e285.
- [27] C. Li, J. Zheng, Api call-based malware classification using recurrent neural networks, Journal of Cyber Security and Mobility (2021) 617–640.
- [28] A. Oliveira, R. Sassi, Behavioral malware detection using deep graph convolutional neural networks, TechRxiv (2019).
- [29] X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, X. Huang, Pre-trained models for natural language processing: A survey, Science China Technological Sciences (2020) 1–26.
- [30] P. Ganesh, Y. Chen, X. Lou, M. A. Khan, Y. Yang, H. Sajjad, P. Nakov, D. Chen, M. Winslett, Compressing large-scale transformer-based models: A case study on bert, Transactions of the Association for Computational Linguistics 9 (2021) 1061–1080.
- [31] T. Lin, Y. Wang, X. Liu, X. Qiu, A survey of transformers, arXiv preprint arXiv:2106.04554 (2021).
- [32] N. E. Erciyes, A. K. Görür, Deep learning methods with pre-trained word embeddings and pre-trained transformers for extreme multi-label text classification, in: 2021 6th International Conference on Computer Science and Engineering (UBMK), IEEE, 2021, pp. 50–55.
- [33] S. Paul, S. Saha, Cyberbert: Bert for cyberbullying identification, Multimedia Systems (2020) 1–8.
- [34] J. L. Alvares, Malware classification with bert, Master’s thesis, San José State University (2021).
- [35] F. Nassar, N. Hubballi, Malware detection and classification using transformer-based learning, Ph.D. thesis, Discipline of Computer Science and Engineering, IIT Indore (2021).
- [36] Z. Xu, X. Fang, G. Yang, Malbert: A novel pre-training method for malware detection, Computers & Security 111 (2021) 102458.
- [37] S. McDonnell, O. Nada, M. R. Abid, E. Amjadian, Cyberbert: A deep dynamic-state session-based recommender system for cyber threat recognition, in: 2021 IEEE Aerospace Conference (50100), IEEE, 2021, pp. 1–12.
- [38] R. Oak, M. Du, D. Yan, H. Takawale, I. Amit, Malware detection on highly imbalanced data through sequence modeling, in: Proceedings of the 12th ACM Workshop on artificial intelligence and security, 2019, pp. 37–48.
- [39] B. Düzgün, A. Çayır, F. Demirkıran, C. N. Kayha, B. Gençaydın, H. Dağ, New datasets for dynamic malware classification, arXiv preprint arXiv:2111.15205 (2021).
- [40] malwarebytes Labs, 2020 state of malware report, Tech. rep., malwarebytes (2020).
- [41] M. Kim, D. Kim, C. Hwang, S. Cho, S. Han, M. Park, Machine-learning-based android malware family classification using built-in and custom permissions, Applied Sciences 11 (21) (2021) 10244.
- [42] A. N. Jahromi, S. Hashemi, A. Dehghantanha, K.-K. R. Choo, H. Karimipour, D. E. Newton, R. M. Parizi, An improved two-hidden-layer extreme learning machine for malware hunting, Computers & Security 89 (2020) 101655.
- [43] Q. Zhu, On the performance of matthews correlation coefficient (mcc) for imbalanced dataset, Pattern Recognition Letters 136 (2020) 71–80.
- [44] C. Halimu, A. Kasem, S. S. Newaz, Empirical comparison of area under roc curve (auc) and mathew correlation coefficient (mcc) for evaluating machine learning algorithms on imbalanced datasets for binary classification, in: Proceedings of the 3rd international conference on machine learning and soft computing, 2019, pp. 1–6.
- [45] P. Branco, L. Torgo, R. P. Ribeiro, Relevance-based evaluation metrics for multi-class imbalanced domains, in: Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, 2017, pp. 698–710.
- [46] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation 9 (8) (1997) 1735–1780.
- [47] D. S. Berman, A. L. Buczak, J. S. Chavis, C. L. Corbett, A survey of deep learning methods for cyber security, Information 10 (4) (2019) 122.
- [48] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, in: Advances in neural information processing systems, 2017, pp. 5998–6008.
- [49] X. Han, Z. Zhang, N. Ding, Y. Gu, X. Liu, Y. Huo, J. Qiu, L. Zhang, W. Han, M. Huang, et al., Pre-trained models: Past, present and future, AI Open (2021).
- [50] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
- [51] J. H. Clark, D. Garrette, I. Turc, J. Wieting, Canine: Pre-training an efficient tokenization-free encoder for language representation, arXiv preprint arXiv:2103.06874 (2021).
- [52] H. E. Boukkouri, O. Ferret, T. Lavergne, H. Noji, P. Zweigenbaum, J. Tsujii, Characterbert: Reconciling elmo and bert for word-level open-vocabulary representations from characters, arXiv preprint arXiv:2010.10392 (2020).
- [53] M. Marcinczuk, Punctuation restoration with ensemble of neural network classifier and pre-trained transformers, Proceedings ofthePolEval2021Workshop (2021) 47.
- [54] G. Morio, T. Morishita, H. Ozaki, T. Miyoshi, Hitachi at semeval-2020 task 11: An empirical study of pre-trained transformer family for propaganda detection, in: Proceedings of the Fourteenth Workshop on Semantic Evaluation, 2020, pp. 1739–1748.
- [55] S. Malla, P. Alphonse, Covid-19 outbreak: An ensemble pre-trained deep learning model for detecting informative tweets, Applied Soft Computing 107 (2021) 107495.
- [56] L. Breiman, Random forests, Machine learning 45 (1) (2001) 5–32.
- [57] A. Çayır, U. Ünal, H. Dağ, Random capsnet forest model for imbalanced malware type classification task, Computers & Security 102 (2021) 102133.
- [58] S. Kobayashi, J. von Oswald, B. Grewe, On the reversed bias-variance tradeoff in deep ensembles, in: ICML 2021 Workshop on Uncertainty and Robustness in Deep Learning, 2021.
- [59] F. Chollet, et al., Keras: The python deep learning library, Astrophysics Source Code Library (2018) ascl–1806.
- [60] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, A. Lerer, Automatic differentiation in pytorch (2017).
- [61] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al., Huggingface’s transformers: State-of-the-art natural language processing, arXiv preprint arXiv:1910.03771 (2019).
- [62] A. R. A. Grégio, V. M. Afonso, D. S. F. Filho, P. L. d. Geus, M. Jino, Toward a taxonomy of malware behaviors, The Computer Journal 58 (10) (2015) 2758–2777.
- [63] Z. Yuan, Y. Yan, M. Sonka, T. Yang, Large-scale robust deep auc maximization: A new surrogate loss and empirical studies on medical image classification, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3040–3049.