This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Clustering based opcode graph generation for malware variant detection

1st Fok Kar Wai Cybersecurity Strategic Technology Centre
ST Engineering
Singapore, Singapore
[email protected]
   2nd Vrizlynn L. L. Thing Cybersecurity Strategic Technology Centre
ST Engineering
Singapore, Singapore
[email protected]
Abstract

Malwares are the key means leveraged by threat actors in the cyber space for their attacks. There is a large array of commercial solutions in the market and significant scientific research to tackle the challenge of the detection and defense against malwares. At the same time, attackers also advance their capabilities in creating polymorphic and metamorphic malwares to make it increasingly challenging for existing solutions. To tackle this issue, we propose a methodology to perform malware detection and family attribution. The proposed methodology first performs the extraction of opcodes from malwares in each family and constructs their respective opcode graphs. We explore the use of clustering algorithms on the opcode graphs to detect clusters of malwares within the same malware family. Such clusters can be seen as belonging to different sub-family groups. Opcode graph signatures are built from each detected cluster. Hence, for each malware family, a group of signatures is generated to represent the family. These signatures are used to classify an unknown sample as benign or belonging to one the malware families. We evaluate our methodology by performing experiments on a dataset consisting of both benign files and malware samples belonging to a number of different malware families and comparing the results to existing approach.

Index Terms:
malware detection and attribution; malware family; clustering; opcode graph

I Introduction

Malware is often used as a means to gain an initial compromise of victim hosts commonly through activities such as social engineering and phishing schemes. Once targets are infected, a variety of malicious activities are furthered performed by the malwares. These activities include lateral movement to infect more targets in the same network, exfiltration or destruction of valuable data and the disruption of critical systems. The detection of malwares is therefore a must for protecting critical systems and hence is a widely researched topic. There is a large variety of research works and methodologies on the detection of malwares which have been introduced by various surveys [24, 16, 21, 39]. However, it has become increasingly challenging for these methods to have ideal detection performances due to the rapidly evolving malwares and growing pool of different malware families [11]. In the SonicWall’s 2021 Cyber Threat Report [12], their solutions found a total of 589,313 new malware variants in 2020. Overall, 74% more never-before-seen malware variants were identified in 2020 than in 2019.

Malware authors apply various types of techniques on the creation of malware variants in order to evade detection from the variety of defense mechanisms. Such obfuscation techniques include encryption, polymorphism and metamorphism which have been introduced in research works such as in [40]. In particular, metamorphism provides the capability to alter the internal structure of each malware, whilst being able to maintain the same functional outputs [28, 35, 15]. Thus it is possible to produce an infinite pool of malwares within the same family. These malwares achieve the same purpose but are not identical. Common signature-based solutions therefore become ineffective as each of these malwares have different signatures. This also makes it challenging for other types of methods such as dynamic and behaviour analysis to effectively detect these continuously morphing malwares that have evasive mechanisms during runtime.

Static analysis, which involves analyzing the file properties and structure, may also be relied upon to detect obfuscated malware. In this paper, our proposed methodology focuses on the analysis of one such static property, which is opcodes. Opcodes are machine instruction codes that define the operations that needs to be performed when the file is executed. Hence, it is a representation of a file’s structure and operations and is useful for malware detection.

In this work, we extract opcodes from malware samples, build opcode graphs using bigrams and perform graph similarity comparison for malware classification and family attribution. However differing from the above previous works and graph-based methods in particular, we integrate the use of clustering algorithms on the graphs belonging to samples within the same families. The purpose of this idea is to first detect potential clusters of sub-family samples as a preliminary step, as malwares belonging to the same family may be generated from different versions of the family codebase. Hence there may be some considerable differences in the structure of the malwares generated by different code versions within the same family. It is then meaningful to detect these different groups to be able to boost detection accuracy. Opcode bigrams belonging to clustered samples are used to build a single opcode graph which represents its signature. Each malware family is thus represented by a set of opcode graph signatures, with each signature belonging to a different sub-family or version. For an incoming unknown sample, its opgraph is extracted and compared against each signature to find the closest matching family for classification.

In summary, these are the contributions of our paper:

  1. 1.

    An opcode-based approach for malware detection and attribution is proposed. The method builds opcode graph signatures for each malware family, which can be used for similarity comparison against samples and classify them to one of the malware families.

  2. 2.

    Clustering algorithms are explored to perform the detection of clusters on samples belonging to the same malware family. The existence of different clusters of samples that have considerable difference in characteristics and therefore potentially belong to separate sub-families is demonstrated as a result.

  3. 3.

    The proposed methodology is evaluated on a dataset comprising of 2300 malware samples belonging to six different malware families and 580 benign file samples. Both binary classification experiments for general malware detection and multi-class classification experiments for family attribution are performed.

In the following section, we discuss the related works. In Section III, we present our methodology in detail and describe its various steps. Experiments and performance evaluation of the proposed method are presented in section IV. Finally, future improvements and conclusions are discussed in section V

II Related works

Vast research has been carried out in the areas of malware analysis due to the rampant growth of the issue and the critical need for defense mechanisms against malwares. Many research surveys have been published to provide the community with a comprehensive breakdown of features and techniques explored. [37, 39] are some recent surveys. Malware analysis can be broadly categorised into two categories, static analysis and dynamic analysis. In static analysis, extraction of features based on the content of the files without executing them is performed. Such features include file strings, byte sequences, opcodes, static API calls and control flow graphs. There are also features unique to different types of applications such as android. In [41], n-gram analysis techniques were applied on XML and DEX files found within the application’s Android Package Kit (APK). On the other hand, dynamic analysis involve observing the behavior of malwares during execution in a simulated environment. This approach allows more information related to the malwares’ interactions, such as its dynamic API calls, network activity, access and modifications to the file systems and registries to be collected.

In this paper we focus on the static analysis feature of opcodes for malware family attribution and detection of metamorphic malware. Santos et al. [33] represented each malware sample as a vector containing the frequencies of fixed length opcode sequences. Similarity comparison is performed against variants belonging to a set of different malware families and attribute the sample to the closest family. Rad and Masrom [14] extracted opcode frequencies as histograms and applies histogram dissimilarity metrics to detect similarities to metamorphic families and hence classify them. Toderici and Stamp [36] proposed to combine chi-squared statistical test with HMM method. Each program is represented by a spectrum of opcode frequencies and chi-squared test is performed to determine whether it matches the expected spectrum of frequencies from a malware family. This method is effective at metamorphic virus detection but performs poorly when techniques such as copying benign codes into malware are used. Shangmugam et al. [34] applied the use of a simple substitution cipher technique as a measure of similarity between the opcode distribution matrix of each metamorphic family to that of an unknown sample. Fazlali et al. [20] extracted an opcode feature set such as normalized frequency count of opcodes and appearance of high frequency opcodes. The feature set is used to train models of six different decision tree based machine learning algorithms such as Random forest to classify samples into metamorphic families or as benign. Sahay and Sharma [32] investigated and found that malwares generated from the same metamorphic virus kit are similar in size. Thus they applied K-Means clustering on their size values to detect clusters of metamorphic malwares from a dataset. Opcode frequency based features are then extracted from each cluster and used to train common classifiers for detection. Okane et al. [29] explored the use of runtime opcodes during the execution of the program for obfuscated malware such as encrypted malware. Support vector machine (SVM) was applied on density histogram of runtime opcodes as features.

Recently, works have applied deep learning on opcode-based features. Darem et al. [17] uses the popular method of converting features into images. Opcode n-gram features were converted into malware binary code and finally into gray-scale images before feeding into a Convolutional Neural Network (CNN) model. Their final model is an ensemble of CNN and XGBoost, combining both deep learning and traditional machine learning. [27] is another image based deep learning work that combines opcode features from both ASM and Bytes file of the malwares.

Graph-based techniques have also been applied on opcodes for malware analysis. Many of such works extract weighted directed graph of opcode digrams for each malware sample. One aspect of such works involve the use of sub-graph matching techniques. Khalilian et al. [26] constructs the opcode graph of samples and applies graph mining techniques to detect frequently occurring subgraphs across samples of the same malware family. These subgraphs represent micro-signatures of each family and are used as features to train classifiers for detection. [18] is a closely similar work but uses control flow graphs (CFG) instead of opcodes for frequent sub-graph mining. Gulmez et al. [22] performs further engineering of the sub-graphs of opcodes into histograms containing the degree of each graph node to produce final features as input to the machine learning models. Alam et al. [13] proposes two different techniques, with one using CFG and the other using opcodes. One key novelty of their work involves representing both CFG and opcodes as higher level patterns. This CFG is termed as Annotated CFG (ACFG) and they apply sub-graph isomorphism technique for malware pattern matching. Their second technique performs analysis of pattern distributions by applying sliding windows on the higher level opcode representations.

The other aspect of graph-based works consist of graph similarity analysis approaches which are closely related to our proposed methodology. Runwal et al. [31] first extracts the opcode graph of each malware sample. In order to determine the similarity between two graphs extracted from different samples, they proposed a formula to compute a score value from the graph matrices. They determined a threshold for the score value by performing a comparison of all samples in each metamorphic family and benign files. To classify new samples, comparison is performed on the opcode graph of the sample and any sample belonging to each metamorphic family or benign files and the sample is assigned to the class in which the score falls below the threshold. Kakisim et al. [25] proposes two graph similarity based techniques. Their first technique, coined as Co-opcode Graph Similarity based Metamorphic Malware Identification method (CGS-MMI), involves building a single opcode graph to represent each metamorphic family. A similarity comparison of opcode graphs of an unknown sample and that of each metamorphic family is done to determine the family with highest similarity for classification. Their similarity score is computed using a 2-D correlation coefficient formula which uses the mean of the weights of edges in the graphs. Their second proposed technique, Higher-Level Engine Signatures based Metamorphic Malware Identification method (HLES-MMI), aims to extract a higher level signature from the representative opcode graph of each metamorphic family. The idea is to only retain nodes and edges with the largest weights in each graph which indicates a specific pattern signature for each family. A metric is then used to determine the similarity among signatures for classification.

As our work also involves the use of clustering algorithms, following is a discussion on some clustering works. Zhang et al. [42] uses both static and dynamic features and applies clustering algorithms such as K-means and Hierarchical clustering. An ensemble method to combine both algorithms and improve robustness of results is proposed. Pitolli et al. [30] also use hybrid features but apply BIRCH clustering. Clustering is performed on the input dataset to group samples for the purpose of identifying the set of families that exist and attribute the samples to those families.

Hu et al. [23] propose a clustering-based work using opcode n-grams. Feature vectors containing occurrences of each opcode n-gram are used as input to calculate the Euclidean distance and hence determine program similarities. The authors argue that classic algorithms such as K-means and hierarchical clustering do not scale and propose the use of a prototype-based clustering method to cluster samples in their dataset of 20 malware families. Wang et al. [38] extracts opcode bigrams and uses information entropy on the bigram probabilities to select a smaller set of important bigrams as features. They propose a Fast Density-based Clustering which is an improvement over the standard density-based algorithm.

Differing from the above clustering works, our proposed methodology does not apply clustering on a dataset as a whole to identify the malware families or use clustering as the main means to classify new samples to their families. Instead we apply clustering separately on samples belonging to each malware family using a labelled dataset of samples with family categorisation. The purpose of this to attain a more fine-grained attribution capability by detecting potential sub-family groups that may have considerable differences even if they belong to the same family. In [23], it was reported that previous clustering approaches resulted in families being separated into several sub-family clusters, more specifically the detection of 50 clusters from a dataset of 20 families, due to highly diverse samples. This is an indication that our approach would be a meaningful research direction.

Our clustering approach is integrated to the graph similarity analysis techniques in [31, 25]. We follow these methods of similarity comparison as they have been shown to be effective in the detection of malware samples belonging to different metamorphic families. However, instead of applying the similarity comparison directly on samples, our approach applies such techniques as a means for similarity and distance calculations for input to the clustering algorithm to detect sub-family clusters within the same family.

III Methodology

Refer to caption
Figure 1: Three-stage process of methodology

Our proposed methodology can be organised as a three-stage process consisting of the following:

  1. 1.

    Creation of family distance matrix via similarity comparison of opcode graphs

  2. 2.

    Detection of sub-family groups via application of clustering algorithm

  3. 3.

    Building of opcode graph signatures from clusters for detection and family attribution

An illustration of the above process is shown in Fig. 1.

III-A Stage 1: Creation of family distance matrix

In order to perform clustering on samples belonging to each malware family and detect sub-family clusters, the first step is to create the feature inputs. Our inputs to the clustering phase are a set of N×NN\times N matrices, where NN represents the total number of samples belonging to each malware family in the dataset. Each matrix contains the distance values of each sample in the family to all the other samples within the same family.

III-A1 Opcode graph construction

In order to perform the similarity comparison to create the distance matrices, we leverage on opcode graph similarity comparison techniques. Specifically for opcode graph construction, we choose to use the weighted directed graphs in [31, 26].

All file samples in our dataset are disassembled to obtain their opcode sequences. From the set of opcode sequences, two key information are extracted. First is the list of all unique opcodes found. The other is the set of opcode bigrams which are pairs of opcodes in which one in the pair follows the other subsequently in the sequence. Using the above information, a weighted directed graph of opcodes is constructed for each sample in the dataset. Each node in the graph represents a unique opcode in the dataset. The edge from one node to another represents an opcode bigram (i, j), where j occurs subsequently after i in the opcode sequence of the sample. To compute an edge weight for (i, j), we first total the number of occurrences for (i, j) in the sample. This value is then divided by the total number of opcode bigrams where i belongs to the first member in the pair. The result is an edge weight representing the probability that j occurs after i, whenever i appears in the sample’s opcode sequence.

Refer to caption
(a) Graph for Sequence 1
Refer to caption
(b) Graph for Sequence 2
Figure 2: Constructed opgraph graphs for sequences in Table I
Sequence 1 Sequence 2
PUSH MOV
POP CALL
MOV JMP
POP MOV
RET PUSH
MOV PUSH
CALL POP
PUSH CALL
PUSH JMP
CALL PUSH
POP MOV
MOV JMP
TABLE I: Example illustration of two opcode sequences

Table I and Fig. 2 provide an example illustration of the opcode graph construction from opcode sequences. We consider a simple example scenario where the input contains only two file samples. Table I shows the opcode sequences of each sample. We may then refer to Fig. 2 to observe the opcode graphs that are constructed from each of the sequences, noting that the edges in the graphs contain the probability value of one specific opcode appearing after another specific opcode based on the direction of the edge. For example, in Fig. 2(a), we can observe that after a PUSH instruction, either a POP, CALL or another PUSH appears subsequently each with an equal probability.

III-A2 Opcode filtering

During the opcode sequence and opcode bigram extraction process, we had also performed filtering of opcodes to remove some from consideration in the opcode graph construction. As previously defined, the number of nodes in the opcode graph is the number of unique opcodes found in the samples of the entire dataset. Thus depending on how many unique opcodes there are, the size of the opcode graph may become very large. We filter and remove rarely occurring opcodes which lead to the graphs becoming unnecessarily large while not providing significant information or negatively impacting the similarity comparisons. More specifically, the process involves sorting the opcode bigrams according to their number of occurrences and we calculate the total number of bigrams. A simple threshold percentage is then set and applied on this total value. If the threshold value is 90%, the top occurring opcode bigrams that contribute to 90% of the total number of bigrams is retained while the remaining ones are filtered off. From the pool of filtered bigrams, unique opcodes that no longer exist in any of the retained bigrams are removed from consideration thereby reducing the size of the opcode graphs.

III-A3 Similarity comparison of opcode graphs

After the opcode graph construction for the samples, we proceed to perform similarity comparison of all samples within the same malware family in the dataset. The measure of similarity between samples, i.e. distance matrix, is a possible feature input to clustering algorithms for them to able to group our samples within the same family into different sub-family clusters.

To compute a similarity score between any two samples using their opcode graphs, we refer to the computation formula proposed in [31]. The scoring function is simple and efficient and is able to sufficiently reflect the similarity in the structure of the opcode graphs. Recall that each opcode graph is an N×NN\times N weighted directed graph. Hence the more similar one sample is to another sample, their edge weights should be closer in value in general. The scoring approach takes the above into consideration. It computes the difference in values of the corresponding edge weights in the two samples. Then the final score is computed using a formula which includes the summation of this difference in values between all the edge weights. Hence, the lower the score value, the more similar any two samples are to one another. A minimum score value of 0 then indicates that the two graphs are identical. In strict terms, this is in fact a measure of distance as opposed to similarity.

For each malware family, we proceed to perform the above computation to calculate the score between each sample and all the other samples in the family. Thus, the final output is an N×NN\times N distance matrix for each malware family. Each matrix will be separately used as a feature input for clustering in the next stage.

III-B Stage 2: Detection of sub-family groups via application of clustering algorithms

To perform the clustering in this stage, we need to first select a clustering algorithm capable of meeting the requirements for our problem. An important criteria for the algorithm is to be able to make a decision by itself on the number of clusters there are. Our input dataset is labelled with specific number of families. However our aim is to perform clustering within each family to detect the potential sub-family groups which is unknown and meant to be discovered through this process. Hence clustering algorithms such as the K-means are not suitable as they require the number of clusters to be defined. A likely candidate for this problem is the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm [19] which is able to automatically discover the number of clusters. Density-based clustering methods like DBSCAN do not group every point in the input into a cluster. Instead, only points that are very tightly packed are clustered and any points that are far from clusters and do not have their own close neighbours are considered as noise. This is applicable for the input scenario, where not all samples may belong to a group of samples generated by the same sub-family version. There may be lone samples generated by new versions which are not yet commonly found, and such samples could be detected as noise in the DBSCAN clustering process.

III-B1 Investigation of DBSCAN parameter

The DBSCAN algorithm automatically discovers the number of clusters based on a very important input parameter, eps. This parameter indicates the maximum distance between two samples for them to be considered in the same neighborhood for clustering.

Refer to caption
(a) Clustering results using eps=0.01
Refer to caption
(b) Subplot of the graph in 3(a)
Figure 3: DBSCAN clustering results on Qakbot malware samples

As the selection of this parameter is critical, we performed an investigation on how it may affect the outcome of clustering for the different malware families in our dataset with relation to the distance function used and distance matrices produced in Section III-A3 . Fig 3 shows clustering results for Qakbot malware samples. In particular Fig 3(a) shows the overall clustering when we used an eps value of 0.01. Clusters are indicated by the different shape patterns in the graph. Points that have the “x” pattern are noise points that do not belong to any clusters. Based on the graphical view, our observation is that the clustering results were promising. To better demonstrate, Fig 3(b) shows a subplot of  3(a) and the clusters being marked out in the graph contain a considerable number of samples with very high similarity.

Refer to caption
(a) Clustering results using eps=0.01
Refer to caption
(b) Clustering results using eps=0.1
Figure 4: DBSCAN clustering results on Formbook malware samples

Fig 4 shows the clustering results on the Formbook malware samples using two different eps values. We can observe in Fig 4(a) when the eps value is 0.01, the same as that in the Qakbot investigation, DBSCAN fails to cluster most of the samples which are all detected as noise. When a larger eps value of 0.1 is used as shown in Fig 4(b), we can observe that most samples are clustered. This provides a conclusion that as each malware family has its own unique properties, clustering cannot be applied exactly the same way for all of them. There may be a need to apply different eps values for different circumstances.

III-B2 Clustering process

Based on the above conclusion, in order to obtain optimal clustering results for different malware family, we came up with a clustering process that involves multiple rounds of DBSCAN clustering on a set of eps values. We first manually define the eps values and sort them according to increasing order. DBSCAN clustering is then applied on the input samples using the smallest eps value. All points detected as noise are then retained and sent for another round of clustering using the following eps value.

Algorithm 1 Malware sub-family clustering process
1:procedure PerformClustering
2:     M{M1,,MN}M\leftarrow\{M^{1},...,M^{N}\} \triangleright Set of distance matrices for N families
3:     S{S1,,SN}S\leftarrow\{S1,...,S^{N}\} \triangleright Corresponding samples for each distance matrix
4:     C{}C\leftarrow\{\} \triangleright Variable to store final result of detected clusters
5:     eps{eps1,,epsD}eps\leftarrow\{eps^{1},...,eps^{D}\} \triangleright Define set of eps values
6:     for n1n\leftarrow 1 to NN do
7:         mMnm\leftarrow M^{n}
8:         sSns\leftarrow S^{n}
9:         for d1d\leftarrow 1 to DD do
10:              labelsDbscanClustering(epsd,m)labels\leftarrow\textsc{DbscanClustering}(eps^{d},m) \triangleright Perform DBSCAN clustering with selected eps
11:              clusters,noiseProcessLabels(s,labels)clusters,noise\leftarrow\textsc{ProcessLabels}(s,labels) \triangleright Extract clustered/noise samples from DBSCAN labels
12:              mRecomputeDistanceMatrix(m,noise)m\leftarrow\textsc{RecomputeDistanceMatrix}(m,noise) \triangleright Compute new matrix using noise samples for next round
13:              CC+clustersC\leftarrow C+clusters \triangleright Add detected clusters to final result
14:         end for
15:     end for
16:     return C \triangleright Final result containing all detected clusters
17:end procedure

The above clustering process is illustrated in Algorithm 1. The set of eps values with increasing order allows DBSCAN to cater to the distribution of distance values for different malware families. By starting with a low eps value, there is good clustering performance for malware families such as Qakbot that contain distinct sub-family groups with high sample similarity as shown in Fig 3. When the same value is applied for other families such as Formbook, most samples are detected as noise in Fig 4(a). In such cases, when the algorithm progresses to the next eps which has a higher value, we will obtain better clustering results.

Once the clustering process is complete, we would obtain groups of clustered samples for each malware family. Each group contains samples that may potentially have been generated from a different version of the malware code. However at the end of this process, there may still be remaining samples that have been labelled as noise even after all rounds of clustering. We consider such lone samples as their own groups containing only a single member.

eps Setting Family No. Samples No. Clusters No. Unclustered
0.01 Qakbot 320 11 9
AgentTesla 400 13 363
Formbook 320 13 252
0.1 Qakbot 320 7 2
AgentTesla 400 22 98
Formbook 320 27 127
Proposed Qakbot 320 13 2
AgentTesla 400 37 4
Formbook 320 40 6
TABLE II: Evaluation of clusters detected over different eps settings

We provide an evaluation of our proposed clustering scheme by looking at the concrete number of clusters detected and unclustered samples belonging to three malware families in our dataset on different eps settings. We can observe in Table II that for an eps value setting of 0.01, the number of unclustered samples for Qakbot is very low. This coincides with our earlier explanation with reference to Fig 3, that such a setting was effective for this particular malware family. On the other hand, the number of unclustered samples for AgentTesla and Formbook form the majority. This situation is improved for both families when the setting of 0.1 is used as more samples are able to be clustered. For Qakbot, observe that the number of clusters detected in this setting is 7 which is lesser than the previous of 11. This indicates that more distinct sub-family clusters could have been detected which the algorithm was unable to do so in the current setting. Finally, for our proposed process of applying increasing eps value settings in an iterative fashion, the clustering results obtained is the best with low number of unclustered samples. This is an evidence of the benefits of the proposed scheme as compared to directly applying the DBSCAN algorithm on a specific eps value setting.

III-C Stage 3: Opcode Graph Signature Building

After the clustering stage, we proceed to the final stage of building opcode graph signatures which are used for similarity comparison to detect and classify new samples.

As opposed to building an opgraph graph for each individual malware sample as in Section III-A1, a single opcode graph is built for each group of clustered samples formed in the previous stage. This is done by consolidating the opcode bigrams extracted from every member in the cluster as if they belonged to a single sample and using them to build a single opcode graph. This graph is a representation of all members belonging to the cluster as a whole and is thus a signature for the cluster.

Once all signatures have been built, each malware family has a set of opcode graph signatures. Each signature represents a sub-family group within the malware family. In order to detect a new sample as malicious and classify it to one of the malware families, the same similarity comparison approach in Section III-A3 is used. The incoming sample is disassembled to obtain its opcode sequence and used to construct its opcode graph. The scoring function is used to compute the similarity score between the sample’s opcode graph and all opcode graph signatures. The sample is thus classified to the malware family that contains the signature with highest similarity to the sample.

IV Performance evaluations

IV-A Dataset

To evaluate the proposed methodology, we curated a dataset consisting of 2300 malware samples belonging to six different malware families and 580 benign file samples. Although the dataset is imbalanced, our methodology does not involve directly applying a machine learning technique for learning and prediction. Thus issues related to imbalanced data in the above approach does not exist in our methodology.

IV-A1 Malware

Table III shows a breakdown of the malware families and their corresponding sample count in our dataset. These malwares are collected from the MalwareBazaar project [9], a public malware database with samples contributed by the community. Specifically, our malwares are collected from a portion of MalwareBazaar’s daily lists of malware uploads from the period of November 2020 to May 2021. In order to use these samples in our experiments, there are two types of verifications that needed to be performed on the samples.

Firstly, we need to verify the validity that the collected samples were indeed malwares. To do so, we leveraged on VirusTotal’s [10] online file submission API service to perform analysis on the samples. The analysis results contain verdicts from a list of malware detection engines. To ensure validity, we choose to only use samples that have been flagged by at least a number of their detection engines (10% of the engines) in our dataset.

The second action performed was to obtain the malware family information of the collected samples. As our work’s focus is on malware family attribution, the family information needs to be accurate. While MalwareBazaar provides its own signature information containing malware family labels, there is a need for further verification. MalwareBazaar results contain information from other sources such as Hatching Triage malware analysis sandbox [7]. We also leverage on other sandbox services such as Joe Sandbox [8] to obtain analysis data. Thus, we only used the malware samples that have the same family information from multiple of these sources.

Malware Family Count
AgentTesla 500
Loki 411
Formbook 400
Qakbot 400
Emotet 300
AveMaria 290
TABLE III: Malware families and corresponding sample count in dataset

The malware families in our curated dataset are all significant threats and have seen continued activity in recent years. Agent Tesla is a remote access trojan (RAT) malware to steal credentials and user information and has been active for more than seven years [1]. New variants are continuously appearing within the growing number of attacks. Lokibot, another information-stealing malware, has been observed by the Cybersecurity and Infrastructure Security Agency (CISA) to have a considerable increase in popularity since July 2020 [5]. In 2021, Formbook malware has been distributed via COVID-19 themed campaigns and has made it to the top of Check Point Research’s Global Threat Index [3]. Emotet, which is a sophisticated Trojan that commonly acts as a downloader for other malwares, has also appeared at the top of Check Point Research’s index even after its global impact had been reduced due to an international police operation [4]. Lastly, the AveMaria RAT was observed to be distributed in a malicious email campaign and disguised as Microsoft Word documents in the period of December 2020 [2]. Hence there is no coincidence that these continuously active malwares were collected from MalwareBazaar’s daily uploads during the aforementioned period. Samples of these malware families made up the most significant proportion of the daily malwares indicating highest rates of exposure and activity.

IV-A2 Benign

Our benign file samples are collected from two sources. One set of files is collected from online repositories such as EXE Files [6]. The other set consists of standard files collected from Windows system directories.

IV-B Experimental Process

In our experiments, we applied the 5-fold cross validation technique to obtain separate samples for signature building and evaluation. Note that this is not applied to the dataset as a whole, but instead needs to be performed separately for each malware family. The samples within each family is randomly shuffled and split into five equal non-overlapping portions. One portion is used for testing. The remaining samples are sent for clustering as in Section III-B and signature building as in Section III-C to obtain the sub-family signatures for each malware family. The testing samples for all families are then combined into an aggregated testing set. Similarity comparison is performed on each testing sample against the built signatures to obtain its classification result. The same approach is also applied to the benign samples which is treated as a single ”family” with a set of detection signatures built.

Our experimental evaluation consists of two approaches:

  1. 1.

    Binary classification approach

  2. 2.

    Multi-class classification approach

In binary classification approach, we focus on the proposed methodology’s ability to only detect a sample as either a malware or a benign file. For this, as long as a malware sample is matched to a signature belonging to any of the malware families in our dataset, it is considered a correct classification as it has been detected as malware regardless of the family.

In multi-class classification approach, the focus is on family attribution. Each test malware sample must match to a sub-family signature belonging to the same malware family as its true label, to constitute a true positive.

IV-C Experimental Results

Refer to caption
Figure 5: Aggregated confusion matrix for binary classification
Refer to caption
Figure 6: Aggregated confusion matrix for multi-class classification

Our experimental results are presented in Fig. 5 and Fig. 6 which are the aggregated confusion matrices of the 5-fold cross validation for the binary classification and multi-class classification approaches respectively.

For binary classification, we focus on two important performance metrics. The first metric is True Positive Rate (TPR), which is the ratio of malware samples that can be correctly detected as malware. Following is the False Positive Rate (FPR) which is the measure of benign samples being incorrectly classified as malwares. In Fig. 5, we can observe that the TPR is 98% while the FPR is at 6%. With a high TPR of 98%, the experiments indicate that the methodology has a high confidence in being able to detect malware samples while only mistakingly classifying benign samples as malwares at a low rate.

For multi-class classification, we also focus on the TPR of each malware family and look at the rate of misclassification to each of other families to assess accurate malware family attribution. From Fig. 6, we can observe high TPR of 100%, 98% and 90% for the Emotet, Qakbot and AveMaria malware families respectively. The performance for these families may be attributed to the fact that they contain distinct sub-family clusters with a very high rate of similarity amongst samples within each cluster. Again, this can be observed in Fig. 3 where little to no scattering of points are observed for the marked clusters for the Qakbot malware family when eps value of 0.01 is used as input to the DBSCAN algorithm. The ability to cluster with lower eps values indicates that points within clusters are very close to one another. Samples in the test set for these families are thus more easily matched to its family signatures instead of being misclassified to the others leading to high TPR. This is an indication that there are very distinct and strong opcode sequence patterns for the sub-family clusters in the above families.

The TPR rates for the other three families AgentTesla, Loki and Formbook are 71%, 49% and 59% respectively and are not as high, especially when compared to the previous families. This is an indication that the sub-family clusters created and hence the signatures extracted for these families do not match the test samples closely enough. Instead a higher similarity score was obtained for the signatures of other malware families. However, a key observation can be made from the results in Fig. 6. A majority of the misclassifications for AgentTesla, Loki and Formbook actually occur amongst themselves. Out of AgentTesla’s misclassification rate of 29%, a total of 22% misclassification was attributed to Loki and Formbook. 34% of Loki’s test samples were misclassified as Formbook and 11% was misclassfied as AgentTesla while only 6% was misclassified as benign and other families. Similarly for Formbook, 18% and 19% test samples were misclassified as AgentTesla and Loki respectively, out of 41% total misclassification rate. In order to explain the above observation, we performed further investigations to understand the root cause of these misclassifications.

IV-D Investigations

Families Emotet Qakbot AveMaria AgentTesla Loki Formbook
Emotet - 0.795 0.801 0.784 0.786 0.782
Qakbot 0.795 - 0.840 0.941 0.927 0.931
AveMaria 0.801 0.840 - 0.886 0.917 0.901
AgentTesla 0.784 0.941 0.886 - 0.985 0.987
Loki 0.786 0.927 0.917 0.985 - 0.995
Formbook 0.782 0.931 0.901 0.987 0.995 -
TABLE IV: Experimental investigation results of malware family similarities

Two forms of investigations were performed. The first is an experimental investigation to get an indication of the level of similarity each malware family as a whole has to one another. To do so, we treated the samples within each family as a whole and created an opcode graph signature to represent each family. This is similar to how signatures were built for the clusters in Section III-C. Similarity comparison was then performed between the signature representing each family to that of all the other families.

Table IV presents the investigation results. The table shows the score values between each family and all other families in the dataset. In order for the results to be intuitive, we converted the scores to truely reflect similarity. This means that the higher the value, the higher level of similarity. This is the opposite to the measure of distance used earlier in our methodology introduction. To analyse the results, it would be important to gain a contextual view of the values in the table. We can observe that the lowest score value in the table is 0.782, which is between Emotet and Formbook. This tells that there should be considerable differences between Emotet and Formbook in terms of opcode sequence patterns when comparing between malware families. The score between Emotet and the rest of the families are relatively close in value, indicating that Emotet is indeed very different to the rest. The highest scores in the table belong to those in between AgentTesla, Loki and Formbook with values above 0.98. Specifically, Loki and Formbook are the most similar families producing the highest score value of 0.995. The above helps to explain the multi-class classification results presented in Fig 6. As AgentTesla, Loki and Formbook have very similar opcode characteristics to one another, it was natural that their samples would easily become recognized as belonging to any one of their families. This lead to the high misclassification rate that was contained within these three families.

The other investigation performed was research-based to learn about the key functionalities of each malware family and their similarities. Through our study, all the malware families are forms of information stealers. However, a common property of AgentTesla, Loki and Formbook malwares is that their main functions are to steal information and credentials from browsers, mail clients and FTP clients as well as keylogging. While the other malware families also steal data, they contain a variety of other functions. For example, Emotet is used as downloader for other malwares. Qakbot is a banking trojan and is able to drop additional payload such as ransomware. Hence it is possible that the similar streamlined capabilities of AgentTesla, Loki and Formbook lead to them having similar static features.

IV-E Comparison with Existing Approach

Refer to caption
Figure 7: Aggregated confusion matrix for multi-class classification using existing methodology without clustering

In this section, we provide an experimental comparison of our methodology against the existing approach in  [31]. The major difference in our methodology is the clustering process described in  III-B. By performing experiments without the inclusion of the clustering process, we can evaluate its impact on the performance and compare with the existing approach.

It is important to point out that we are unable to fully replicate the experimental procedure of the work in  [31]. Firstly, the work uses only a representative sample from each metamorphic family for comparisons to classify a new file. This was possible because based on their investigations on the samples in their data, there was a clear separation of similarity score values amongst samples in the same metamorphic family and that between the samples and benign files. In other words, a boundary threshold value of the score could be set such that a score below the threshold would indicate that a new sample is classified to the family, and a higher score means that the sample does not belong to the family. However such a pattern does not exist in our dataset. Our investigations indicate that samples within the same malware family do have considerable opcode structural differences such that the scores would vary and a valid threshold is not possible. For each malware family in the dataset, we thus combine the opcodes of all its samples to build a single opcode graph signature which is the approach in  [25] and also applied in our work.

We provide the multi-class classification results in Fig. 7. A significant drop in performance can be observed when the clustering process is absent. The first observation that can be made is close to total loss of ability to correctly classify benign files. This was an expected observation as benign files should not bear much similarities with one another unless they are related to the same application. Hence creating a single signature out of benign files would not be meaningful as the signature would not be similar to any individual benign file.  [31] uses only CYGWIN utilies as their benign files which would explain the ability to obtain a similar match.

The top performing malware families in our methodology had a reduction in TPR except for the Emotet malware family. AgentTesla and Loki hit a 0% TPR. Only Formbook had an improvement to its TPR. An interesting observation is that most of the misclassifications for all classes went to the Emotet malware family. However, such a phenomenom was not observed in the previous experimental results. This indicates that without the clustering process, generalisation of the characteristics of each malware family into a single signature is not effective. As each test sample does not bear significant resemblance to its own family signature, it happens that the score is closest to Emotet for most cases. Thus, the above experimental result highlights the importance of sub-family detection and signature creation in our proposed methodology.

IV-F Time Performance

Our experiments were carried out on a machine with a processor model of Intel(R) Core(TM) i7-9700K CPU 3.60GHz with eight cores available. In order to perform the classification on an unknown sample, the first step is to perform file disassembly to obtain opcodes and construct the opcode graphs. On average, this step takes approximately 0.5 seconds. For opcode graph comparisons, a single comparison takes approximately 0.07 seconds. Hence the total time for comparison depends on the number of signatures that is required for comparisons against. In our experiments, classification of each test sample would take approximately 17 seconds with the utilisation of only one core of the processor. With the implementation of multiprocessing, this can be significantly sped up if all available cores are utilised.

V Conclusions

In this paper, we presented an opcode-based methodology involving graph similarity comparison for malware detection and malware family attribution. We integrated a clustering algorithm into our approach to detect sub-family groups of malwares that potentially represent different versions of the malware family. Opcode graph signatures are extracted from each cluster which are used for similarity comparison to detect and attribute incoming samples. Experiments were performed on a dataset containing different malware families to evaluate the proposed methodology. The method achieved high recall and low false positives for binary classification. For multi-class classification, high attribution performance was obtained for some malware families. For certain families with relatively lower performance, investigations were performed to provide insights on the result.

There are several future work and enhancements that can be explored for the work. Further improvements or new methodologies can be proposed for the opcode graph based similarity comparison techniques that are used. Input parameters for the clustering algorithm resemble thresholds and are currently manually determined. It may be beneficial to look into how to better set these parameters, possibly through an automated approach. Another area is on scalability. The nature of signature comparison approaches is that the more signatures there are, the processing time required increases linearly. More signatures will be created from the addition of new malware samples and families and more comparisons have to performed. Hence we should also look into efficiency improvements.

References

  • [1] Agent Tesla. https://news.sophos.com/en-us/2021/02/02/agent-tesla-amps-up-information-stealing-attacks/.
  • [2] AveMaria RAT Malspam Campaign. https://blogs.infoblox.com/cyber-threat-intelligence/avemaria-rat-malspam-campaign/.
  • [3] Check Point Research Global Threat Index, August 2021. https://blog.checkpoint.com/2021/09/10/august-2021s-most-wanted-malware-formbook-climbs-into-first-place/.
  • [4] Check Point Research Global Threat Index, January 2021. https://blog.checkpoint.com/2021/02/11/january-2021s-most-wanted-malware-emotet-continues-reign-as-top-malware-threat-despite-takedown/.
  • [5] CISA Lokibot Alert. https://us-cert.cisa.gov/ncas/alerts/aa20-266a.
  • [6] EXE Files. https://www.exefiles.com/en/.
  • [7] Hatching Triage. https://tria.ge/.
  • [8] Joe Sandbox. https://www.joesandbox.com/.
  • [9] MalwareBazaar. https://bazaar.abuse.ch/.
  • [10] VirusTotal. https://www.virustotal.com/.
  • [11] FireEye Mandiant M-Trends Report 2020, 2020. https://content.fireeye.com/m-trends/rpt-m-trends-2020.
  • [12] SonicWall Threat Report 2021, 2021. https://www.sonicwall.com/medialibrary/en/white-paper/2021-cyber-threat-report.pdf.
  • [13] Shahid Alam, R.Nigel Horspool, Issa Traore, and Ibrahim Sogukpinar. A framework for metamorphic malware analysis and real-time detection. Computers & Security, 48:212–233, 2015.
  • [14] Babak Bashari Rad and Maslin Masrom. Metamorphic virus variants classification using opcode frequency histogram. 01 2011.
  • [15] Donabelle Baysa, Richard Low, and Mark Stamp. Structural entropy and metamorphic malware. Journal of Computer Virology and Hacking Techniques, 9, 11 2013.
  • [16] Zahra Bazrafshan, Hashem Hashemi, Seyed Mehdi Hazrati Fard, and Ali Hamzeh. A survey on heuristic malware detection techniques. In The 5th Conference on Information and Knowledge Technology, pages 113–120, 2013.
  • [17] Abdulbasit Darem, Jemal Abawajy, Aaisha Makkar, Asma Alhashmi, and Sultan Alanazi. Visualization and deep-learning-based malware variant detection using opcode-level features. Future Generation Computer Systems, 125:314–323, 2021.
  • [18] M. Eskandari and Hooman Raesi. Frequent sub-graph mining for intelligent malware detection. Secur. Commun. Networks, 7:1872–1886, 2014.
  • [19] M. Ester, H. Kriegel, J. Sander, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD, 1996.
  • [20] Mahmood Fazlali, Peyman Khodamoradi, Farhad Mardukhi, Masoud Nosrati, and Mohammad Mahdi Dehshibi. Metamorphic malware detection using opcode frequency rate and decision tree. International Journal of Information Security and Privacy, 10:67–86, 07 2016.
  • [21] Ekta Gandotra, Divya Bansal, and Sanjeev Sofat. Malware analysis and classification: A survey. Journal of Information Security, 05:56–64, 01 2014.
  • [22] Sibel Gülmez and Ibrahim Sogukpinar. Graph-based malware detection using opcode sequences. In 2021 9th International Symposium on Digital Forensics and Security (ISDFS), pages 1–5, 2021.
  • [23] Xin Hu, K. Shin, S. Bhatkar, and Kent Griffin. Mutantx-s: Scalable malware clustering based on static features. In USENIX Annual Technical Conference, 2013.
  • [24] Nwokedi Idika and Aditya Mathur. A survey of malware detection techniques. Purdue University, 03 2007.
  • [25] Arzu Kakisim, Mert Nar, and Ibrahim Sogukpinar. Metamorphic malware identification using engine-specific patterns based on co-opcode graphs. Computer Standards & Interfaces, 71:103443, 04 2020.
  • [26] Alireza Khalilian, Amir Nourazar, Mojtaba Vahidi-Asl, and H. Haghighi. G3md: Mining frequent opcode sub-graphs for metamorphic malware detection of existing families. Expert Systems with Applications, 112:15–33, 12 2018.
  • [27] Lin Li, Ying Ding, Bo Li, Mengqing Qiao, and Biao Ye. Malware classification based on double byte feature encoding. Alexandria Engineering Journal, 61, 06 2021.
  • [28] Da Lin and Mark Stamp. Hunting for undetectable metamorphic viruses. Journal in Computer Virology, 7:201–214, 08 2011.
  • [29] Philip O’Kane, S. Sezer, and K. McLaughlin. Detecting obfuscated malware using reduced opcode set and optimised runtime trace. Security Informatics, 5:1–12, 2016.
  • [30] Gregorio Pitolli, Giuseppe Laurenza, Leonardo Aniello, Leonardo Querzoni, and R. Baldoni. Malfamaware: automatic family identification and malware classification through online clustering. International Journal of Information Security, 20:371–386, 2020.
  • [31] Neha Runwal, Richard Low, and Mark Stamp. Opcode graph similarity and metamorphic detection. Journal in Computer Virology, 8, 05 2012.
  • [32] Sanjay K. Sahay and Ashu Sharma. Grouping the executables to detect malwares with high accuracy. Procedia Computer Science, 78:667–674, 2016. 1st International Conference on Information Security & Privacy 2015.
  • [33] Igor Santos, Felix Brezo, Javier Nieves, Yoseba Penya, Borja Sanz, Carlos Laorden, and Pablo Bringas. Idea: Opcode-sequence-based malware detection. pages 35–43, 02 2010.
  • [34] G. Shanmugam, R. M. Low, and M. Stamp. Simple substitution distance and metamorphic detection. Journal of Computer Virology and Hacking Techniques, 9:159–170, 2013.
  • [35] Mark Stamp and S. Sridhara. Metamorphic worm that carries its own morphing engine. Journal of Computer Virology and Hacking Techniques, 9:49–58, 05 2013.
  • [36] Annie H. Toderici and M. Stamp. Chi-squared distance and metamorphic virus detection. Journal of Computer Virology and Hacking Techniques, 9:1–14, 2012.
  • [37] Daniele Ucci and Leonardo Aniello. Survey on the usage of machine learning techniques for malware analysis. Computers & Security, 81, 10 2017.
  • [38] Cheng Wang, Zheng Qin, Jixin Zhang, and Hui Yin. A malware variants detection methodology with an opcode based feature method and a fast density based clustering algorithm. In 2016 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD), pages 481–487, 2016.
  • [39] Yanfang Ye, Tao Li, Donald Adjeroh, and S. Sitharama Iyengar. A survey on malware detection using data mining techniques. ACM Comput. Surv., 50(3), June 2017.
  • [40] Ilsun You and Kangbin Yim. Malware obfuscation techniques: A brief survey. In 2010 International Conference on Broadband, Wireless Computing, Communication and Applications, pages 297–300, 2010.
  • [41] L. Zhang, V. Thing, and Yao Cheng. A scalable and extensible framework for android malware detection and family attribution. Comput. Secur., 80:120–133, 2019.
  • [42] Yunan Zhang, Chenghao Rong, Qingjia Huang, Yang Wu, Zeming Yang, and Jianguo Jiang. Based on multi-features and clustering ensemble method for automatic malware categorization. In 2017 IEEE Trustcom/BigDataSE/ICESS, pages 73–82, 2017.