Multi-features based Semantic Augmentation Networks for Named Entity Recognition in
Threat Intelligence

Peipei Liu^1,2, Hong Li^1,2^∗ , Zuoguang Wang^1,2, Jie Liu^1,2, Yimo Ren^1,2, Hongsong Zhu^1,2 *Corresponding author ¹School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China ²Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China {liupeipei, lihong, liujie1, renyimo, zhuhongsong}@iie.ac.cn,
[email protected]

Abstract

Extracting cybersecurity entities such as attackers and vulnerabilities from unstructured network texts is an important part of security analysis. However, the sparsity of intelligence data resulted from the higher frequency variations and the randomness of cybersecurity entity names makes it difficult for current methods to perform well in extracting security-related concepts and entities. To this end, we propose a semantic augmentation method which incorporates different linguistic features to enrich the representation of input tokens to detect and classify the cybersecurity names over unstructured text. In particular, we encode and aggregate the constituent feature, morphological feature and part of speech feature for each input token to improve the robustness of the method. More than that, a token gets augmented semantic information from its most similar K words in cybersecurity domain corpus where an attentive module is leveraged to weigh differences of the words, and from contextual clues based on a large-scale general field corpus. We have conducted experiments on the cybersecurity datasets DNRTI and MalwareTextDB, and the results demonstrate the effectiveness of the proposed method.

Index Terms:

cybersecurity, named entity recognition, multi-features, semantic augmentation, attention mechanism

I Introduction

The cyber threat intelligence (CTI) is a collection of evidence-based information, which is often used to describe threat information for cyber assets[1]. Named entity recognition (NER) in CTI aims to tag cybersecurity entity names with their corresponding types such as users, malicious programs, hackers and vulnerabilities from plenty of unstructured CTI text. Accurate and speedy NER can be beneficial for researchers to carry out security analysis and assessment as well as enhance the real-time and precision of network security situation awareness. Thus, NER in CTI plays a major role in supporting and achieving cybersecurity research. Researches about NER in CTI have been widely pursued in recent years, and they can be summarized in the following three categories: rules-based [2, 3, 4, 5, 6], statistical characteristics-based [7, 8, 9, 10, 11] and deep learning-based [12, 13, 14, 15, 16, 17, 18, 19].

Refer to caption — Figure 1: The sample sentences of CTI. What is the ‘Apple’? And How can we understand ‘PE800BmcA06.exe’?

The rules-based methods use predefined rules and dictionaries to locate and extract entities with the advantages of accuracy, reliability and efficiency, while statistical characteristics-based methods apply feature engineering to learn representation for text and utilize machine learning algorithms to train models to recognize the entities. However, these approaches suffer from the limitations of poor portability and complex manual efforts.

Compared with the above two, deep learning-based methods have the unique advantages in the capability of representation learning and the semantic composition empowered by both the vector representation and neural processing[20]. Deep neural networks like recurrent neural networks(RNN)[21] and convolutional neural networks(CNN)[22] with their variants learn intricate features and discover the semantic information automatically from raw data via non-linear activation functions in multiple processing layers[23]. These neural models can be trained in an end-to-end paradigm by gradient descent, and they have been widely used for NER in CTI such as [17, 16, 18, 13].

Although the great success has been achieved by existing neural models, the sparsity of CTI data resulted from the high frequency variations and the randomness of cybersecurity entity names is not still given the priority. What’s more, vague entity types such as ‘APPLE’ for ORG and DEV¹¹1ORG means the organization, and DEV means the device. They are both the entity types in cybersecurity. also make the existing methods inefficient and unperfect for recognition, as seen at Figure.1. To solve the problem, we design a multi-features based semantic augmentation model, which is inspired by the studies using semantic augmentation to improve model performance on other natural language processing (NLP) tasks [24, 25, 26]. Our model consists of three main modules: Cybersecurity Domain Semantic Augmentation, General Domain Semantic Augmentation and Mixed Features Input. Cybersecurity Domain Semantic Augmentation enriches the representation of each input token from its most similar K words in cybersecurity domain corpus, where an attentive module is leveraged to weigh differences of the words. In General Domain Semantic Augmentation, we refer to the embeddings produced by finetuning the pretrained BERT [27] as the external support for input words, as it can give sufficient contextual clues from the large-scale corpus and rich lexical knowledge. We regard both Cybersecurity Domain Semantic Augmentation and General Domain Semantic Augmentation as Semantic Augmentation (SA). As for Mixed Features Input, except for common word embedding, morphological features encoded by CNN on character-level, part of speech features and component features are incorporated into the final representation of initial input words, before feeding into context encoding layers. After encodings of those three modules, a gate module is deployed on top of them to compute the most effective feature for entity detection and classification, with a followed softmax as tag decoder.

In summary, the contributions of this work are:

•

We propose a multi-features based semantic augmentation model, aiming to detect and classify the cybersecurity entities effectively by solving the data sparsity and vague entity types problem.
•

The model is highly modularized, and it can be easily transferred for other related works such as event extraction, but not only NER in CTI.
•

We evaluate our approach on the CTI datasets DNRTI and MalwareTextDB, and the results show that our model is competitive with the current state-of-the-art methods.

II Related Works

As a specific aspect of the cybersecurity research, NER in CTI has experienced three stages of development. Most of the early works use hand-crafted rules to achieve decent performance. [2] analyzes the regular expression matching method, and uses it to identify the types of attacks. The method is later used in Snort, l7-filter and Bro for the Deep Packet Inspection. A technique combining regular expression and ontology is then presented by [3] to extract entities from log files. The technique can use the formats of semi-structured files as features for type recognition and generating regular expressions, however, it cannot be applied to the extraction of unstructured files. [4] designs the iACE, an innovative solution for extraction of Indicators of Compromise (IOC) (e.g., malware signatures, botnet IPs) from public sources (e.g., blogs, forums, etc.) based on the combination of regular expression and syntax tree similarity. Furthermore, Bootstrapping algorithm is often viewed as the supplementary to improve the applicability and efficiency of rule-based matching methods [5, 6, 7].

Rules-based models work well when rules are exhaustive. Unfortunately, they cost a lot of manual efforts and are difficult to adapt to new tasks or new texts. To overcome the drawbacks of rules-based models, supervised methods based on machine learning such as Support Vector Machine (SVM) [9, 11], Hidden Markov Models (HMM) [28] and Conditional Random Fields (CRF) [10, 8, 29] are applied to automatically learn similar patterns from unstructured text data.

In recent years, with the development of deep learning, attention has been transferred to neural network models for NER in CTI. In [12], cybersecurity entities are extracted by a NER system, which is comprised of a residual multi-task CNN trained on OntoNotes 5 and the automatically labeled vulnerability corpus. [13] creates the VIEM, which employs a joint model based on character-level bidirectional gated recurrent units (BiGRU) and word-level BiGRU to identify the entities. [15] uses the stacked bidirectional Long Short-Term Memory (BiLSTM) network to process the preliminary character-level features, then the resulting features are concatenated with word-level vectors from the GLoVe [30] model serving as the input for encoder BiLSTM. Finally, two Dense layers and the Softmax function are designed for detection and classification. [16, 31, 32, 14] apply CNN and BiLSTM neural networks as the basic framework to extract features of the domain information for NER in CTI. The differences among them are just attention mechanisms such as self-attention or multi-head attention allocating the corresponding weight of the extracted token feature, and tag decoders such as softmax or CRF obtaining association information among tags. [17] presents CASIE, a system that extracts information about cybersecurity events from text. The CASIE combines BiLSTM with multiple different linguistic features, providing a suite of competitive information extraction models for cybersecurity.

Although these models have achieved fine results, they ignore the facts that vague cybersecurity entity types problem and the randomness and high-frequency variability of cybersecurity entity names could cause the data sparsity. In this paper, we will propose a model to solve both problems.

III The Proposed Method

The task of NER is conventionally posited as a standard sequence labeling problem, where an input sequence $X=x_{1},x_{2},...x_{N}$ with $N$ words is annotated with its corresponding labels ${Y}={y}_{1},{y}_{2},...,{y}_{N}$ . The goal is thus to learn a parametrized mapping $f_{\theta}:X\rightarrow{Y}$ from input words to output labels. To achieve the goal, we firstly enter the mixed feature input of each word into an encoder to get the basic contextual representation. Then, cybersecurity domain semantic augmentation representation and general domain semantic augmentation representation are combined with the basic contextual representation through a gate mechanism to enhance the semantic feature of the word. Finally, the resulting representation is passed into the decoder containing a feed forward neural networks and softmax to get label.

In this section, we will describe main modules of the proposed neural network model one-by-one. Figure 2 presents a high-level overview of the model framework architecture.

III-A Basic model with Mixed Feature Input

In this module, we first construct the input representation for each word in the input sentence, and then pass it to the encoder to extract contextual feature.

III-A1 Mixed Feature Input

Similar to the state-of-the-art NER approaches[33, 34, 35], we also use word embeddings, character-level word embeddings and part-of-speech (POS) embeddings for NER in CTI. In addition, considering the characteristics of cybersecurity entity names, we take component embedding of each word into account. We will discuss each of these embeddings in the following text.

Word embedding derived from the large corpus can learn co-occurrence statistics of words [36]. For each word, we retrieve one embedding from the lookup table initialized by Glove [30]. For the out of vocabulary (OOV) words, embeddings are randomly initialized within $(-\sqrt{\frac{3.0}{word_{dim}}},\sqrt{\frac{3.0}{word_{dim}}})$ , where $word_{dim}$ is the word dimension size.

Character-level word embeddings have been proved useful in capturing morphological features for dealing with unseen words and OOV [37], which is very beneficial for NER in CTI. In our model, input character sequences are first passed through the embedding layer to get character embeddings, then a two-layers CNN with batch-norm is deployed to extract local features. Finally, CNN results are sent to a max-pooling layer with a Rectifier Linear Unit (ReLU) [38] activation to map varying length vectors into fixed size representations of words.

POS can indicate the property of a word like noun or preposition, which is important for entity classification. For example, the POS tags of words ‘Shamoon’ and ‘StoneDrill’ in the sentence ‘Kaspersky believes both Shamoon and StoneDrill groups…’ are NN, indicating that they may be potential entities. Therefore, we utilize POS embeddings to enhance the representation ability of the input. In order to capture the contextual information of POS tags, we pretrain a CBOW [36] model based on POS tags with a context window of 3 tags and vector dimension size 30. In our work, the POS tag for each word is retrieved from Stanza [39].

Since the cybersecurity entity names contain various forms of characters, the component features may help us know the labels of words better. For example, words made up of numbers are more likely to be IPs or product models while words with all uppercase characters are more likely to be devices or vendors. Therefore, in this work, we get a separate one-hot lookup table to add the component feature with following options: allNum, allLower, allUpper, upperInit, mainNum, containNum, other.

The above four types of embedding are then concatenated and linearly converted as the final embedding for each word. For the word $x_{i}$ , we can get its embedding $w_{i}$ as:

\displaystyle w_{i}=[w_{wi},w_{mi},w_{pi},w_{ci}]W_{E}

(1)

where $[\cdot,\cdot]$ is the concatenation operation with the same below, $W_{E}\in R^{98\times 128}$ is a learnable matrix, $w_{wi}\in R^{50}$ denotes the word embedding, $w_{mi}\in R^{30}$ denotes the character-level word embedding, $w_{pi}\in R^{10}$ denotes the POS embedding, $w_{ci}\in R^{8}$ denotes the component feature.

III-A2 Encoder in the model

For many sequence labeling tasks, it is beneficial to have access to both past and future contexts, and BiLSTM coincidentally is an elegant solution to achieve the goal. In our work, the mixed representation embedding of each word is passed through a two-layers BiLSTM to capture past and future information from different directions, and then the two separate hidden states are concatenated to form the final contextual representation. That is, for the ${i_{th}}$ word $x_{i}$ in the sentence, its contextual representation $h_{i}\in R^{256}$ from BiLSTM can be given as:

\displaystyle h_{i}=[LSTM_{f}(w_{i}),LSTM_{b}(w_{i})]

(2)

where $LSTM_{f}$ denotes the forward LSTM and $LSTM_{b}$ denotes the backward LSTM.

The BiLSTM model can learn context features from input sequences automatically and effectively. However, these features have different contributions to NER. Fortunately, the multi-head attention can make the model learn relatively important features from different representation subspaces [40, 41]. Hence, in our study, we capture relatively important features from outputs of the former BiLSTM by introducing the multi-head attention mechanism to benefit for NER. We firstly calculate the self-attention using the scaled dot-product attention function as:

\displaystyle Attention(Q,K,V)=softmax(\frac{QK^{T}}{\sqrt{d}})V

(3)

where $Q,K,V$ represent query, key, and value matrix, respectively. d is the dimension of $K$ .

Then, the multi-head attention is computed by 8 parallel self-attention layers and the $t_{th}$ layer can be expressed by:

\displaystyle head_{t}=Attention(HW_{t}^{Q},HW_{t}^{K},HW_{t}^{V})

(4)

where $H\in R^{256}$ is the output of BiLSTM model and $W_{t}^{Q},W_{t}^{K},W_{t}^{V}\in R^{256\times 32}$ are projection matrices. In result, the multi-head attention is the concatenation of $\{head_{1},head_{2},...,head_{8}\}$ :

\displaystyle m=[head_{1},...,head_{t},...,head_{8}]W_{M}

(5)

where $W_{M}\in R^{256\times 256}$ is the learnable parameter matrix.

After obtaining the word representation from the multi-head attention, we apply a feed-forward neural network (FFNN) to better aggregate and encode the features from different spaces.

\displaystyle m=FFNN(m)

(6)

Finally, $m$ serves as the output from encoder and $m_{i}\in R^{256}$ acts for the feature representation of $x_{i}$ in the input sentence.

III-B Cybersecurity Domain Semantic Augmentation

We also call this as internal semantic augmentation for the reason that the support corpora is from the experimental dataset. To handle the data sparsity, the internal semantic augmentation captures commonalities of named entities in semantic space to improve the predicting effect of our model. Here, two internal semantic augmentation methods are designed named Hard Semantic Augmentation(HSA) and Soft Semantic Augmentation(SSA) respectively. We introduce details of these two methods in the following.

Firstly, a word embedding model is trained on the unlabeled experimental dataset. For each word $x_{i}$ in the input sentence, we can get its embedding $v_{i}$ via the trained model. Subsequently, we can get the most similar K words $\Phi_{i}=\{x_{i1},...,x_{iK}\}$ of $x_{i}$ with corresponding embeddings $\Psi_{i}=\{v_{i1},...,v_{iK}\}$ based on the semantic space similarities.

III-B1 Hard Semantic Augmentation

In this section, we enhance the semantic representation of $x_{i}$ by its K nearest neighbors without training, which is called hard. Since not all the K words make equal contribution to assisting label prediction of $x_{i}$ , it is important to get the most effective information from different words. Thus, an attentive module is leveraged to weigh differences of the words. For each word $x_{ij}$ , it is assigned a weight as:

\displaystyle\alpha_{ij}=\frac{exp(cosine(v_{i},v_{ij}))}{\sum_{k}^{K}{exp(cosine(v_{i},v_{ik}))}}

(7)

Then, we can get the semantic augmentation representation $w^{I}_{i}$ of $x_{i}$ by computing the weighted sum of all $\alpha_{ij}$ with their corresponding embeddings $v_{ij}$ :

\displaystyle w^{I}_{i}=\sum^{K}_{j=1}{\alpha_{ij}v_{ij}}

(8)

III-B2 Soft Semantic Augmentation

Different from the Hard Semantic Augmentation, the acquisition of Soft Semantic Augmentation needs the training procedure. We compute the augmentation representation of $x_{i}$ by measuring $\Psi_{i}$ and the $m_{i}$ derived from the encoder instead of $v_{i}$ . The same as above, the augmentation ability of the word $x_{ij}$ can be weighted by:

\displaystyle\beta_{ij}=\frac{exp(m_{i}W_{I}v_{ij})}{\sum_{k}^{K}{m_{i}W_{I}v_{ik}}}

(9)

where $W_{I}\in R^{256\times 256}$ is a learnable parameter matrix. Next, the weighted semantic representation can be derived from:

\displaystyle w^{I}_{i}=\sum^{K}_{j=1}{\beta_{ij}v_{ij}}

(10)

After the above processing, the module ensures that semantic augmentation representations are weighted based on contributions of different words, and the most effective information for NER is obtained.

III-C General Domain Semantic Augmentation

BERT [27] is a powerful language representation model, and it can explicitly model the correlation of a pair of tokens which is very helpful for solving vague cybersecurity entity types problem. For instance, the DEV ‘APLLE’ often cooccur with ‘operation system’, while the ORG ‘APPLE’ often appear with ‘Google’. In this work, we incorporate BERT into NER tasks after the finetuning procedure. As the BERT is pre-trained on the general field corpus, we also call this module as external support.

We take the final hidden state of the token corresponding to the target word from the BERT output as the word representation. If there are more than one token referring to a word, we sum them, i.e., we can get the representation of ’gpu’ by summing ’gp’ and ’##u’. Then, a fully connect layer for every target lemma is added, which is the same as the last layer of the multi-head attention module. For the word $x_{i}$ with $P$ subtokens $\{\mu_{i1},...,\mu_{iP}\}$ , its general semantic augmentation representation can be created by:

\displaystyle w_{i}^{E}=fc(\sum^{P}_{j=1}{e_{ij}})

(11)

where $e_{ij}$ is the token representation of $\mu_{ij}$ , and $fc(\cdot)$ denotes the full connection.

III-D Decoder

The gating mechanism regulates how much of the message propagates to the next step [42], and this provides the model a way to control contributions from the three modules in different text environments. We define a two-layers gate to compute the final comprehensive representation of $x_{i}$ :

\displaystyle f_{i}=\sigma(W_{G}[\theta_{i},\lambda_{i}])

(12)

\displaystyle\epsilon_{i}=f_{i}\odot\theta_{i}+(\boldsymbol{1}-f_{i})\odot\lambda_{i}

(13)

where $W_{G}\in R^{256\times 512}$ are the trainable parameters, $\sigma(\cdot)$ is the sigmoid function, $\odot$ stands for element-wise multiplication, and $\boldsymbol{1}$ is a vector whose elements are all 1. What’s more, $\epsilon_{i}^{1}$ yielded in the first gate layer where $[\theta_{i},\lambda_{i}]=[m_{i},w^{I}_{i}]$ is then used as the input for the second gate layer with $[\theta_{i},\lambda_{i}]=[\epsilon_{i}^{1},w^{E}_{i}]$ to output $\epsilon_{i}^{2}$ .

Next, the $\epsilon_{i}^{2}$ is passed through a final softmax layer for label classification:

\displaystyle o_{i}=softmax({\epsilon_{i}^{2}}W_{O}+b_{O})

(14)

where $W_{O}\in R^{256\times n_{l}}$ is a parametrized matrix, $b_{O}\in R^{n_{l}}$ is the bias and $n_{l}$ indicates the number of entity types.

To extract the optimal predicted entity type $\hat{y}_{i}$ of $x_{i}$ , we select the type corresponding to the maximum probability:

\displaystyle\hat{y}_{i}=argmax(o_{i})

(15)

During the training, we optimize our model with a cross-entropy loss:

\displaystyle loss=-\sum^{N}_{i=1}{\sum^{n_{l}}_{j=1}y_{i}^{j}log(o_{i}^{j})}

(16)

where $y_{i}^{j}$ equal to $0$ or $1$ is a binary indicator indicating whether $x_{i}$ truly belongs to $j_{th}$ entity type.

IV Experiments

We conduct extensive experiments on two open cybersecurity datasets and our model is trained end-to-end by forward and back propagation. The project source can be publicly available at https://github.com/LiuPeiP-CS/NER4CTI.

IV-A Datasets and Evaluation

Datasets: DNRTI [15] is a large-scale dataset for NER in CTI, consisting of 6574 annotated sentences and 36412 entities. All the entities can be divided into 13 categories, including hacker-organization(HackOrg), attack(OffAct), sample-file(SamFile), security-team(SecTeam), tool(Tool), time(Time), purpose(Purp), area(Area), industry(Idus), organization(Org), way(W-ay), loophole(Exp), features(Features).

MalwareTextDB [43] is an annotated malware database, annotated from 39 APT reports with a total of 6819 sentences and 10983 entities. All the entity tokens in MalwareTextDB can be labelled by 3 types: Action, Entity, and Modifier. Compared with DNRTI, MalwareTextDB is inferior to DNRTI in the number of category and the number of instances per category. Moreover, MalwareTextDB is sparser than DNRTI due to the more sentences but the less entities.

To approximately match the training data and testing data distributions, follow [15], we randomly select 70% of the original text as the training set, 15% as validation set, and 15% as the testing set on both datasets.

Evaluation: Following suggestions in [15], we evaluate Precision (P), Recall (R), and F1 scores with micro-averaging and adopt the strict evaluation criterion. Specifically, a predicted entity is correct only if its type and boundaries are correct.

IV-B Setting

We use the pre-trained uncased ${BERT_{base}}$ ²²2https://huggingface.co/bert-base-uncased model for finetuning, since we find that the cased BERT model performs slightly worse than the uncased model in this task. During finetuning, we use the training dataset to find the best settings for our task. We keep the dropout rate at 0.5. In addition, we have the learning rate warmup with the warmup proportion 0.002, the weight decay with the rate 1e-5, and the learning rate decay with the rate 1e-5. We finetune the BERT model in 100 epochs with the batch size 32 and learning rate 5e-5.

For the BiLSTM in our basic model, our training uses 128 hidden states with batch size of 64. To train the model, we minimize the cross-entropy loss of the softmax class probability in Eq.16. The model hyperparameters are updated using back-propagation by the Adam optimizer [44]. The learning rate is 1e-3 with weight decay 1e-5, and the minimum learning rate is set to 5e-5. The model is regularized with a locked dropout rate of 0.3. We use 50-dimentional pre-trained word embeddings from Glove [30], 30-dimentional random initialized character embedding, 10-dimensional POS embedding, and 8-dimnesional component embedding as described in III-A. We train our model for 200 epochs, and report the result for the model performing best on the validation set of the open data set collection.

Besides, we train the word2vec[36] word embedding model³³3We also have experiments on other models, such as fastText, but find no significant differences. for III-B with min_count 2, size 256 and window 3 in CBOW mode by using the gensim tool⁴⁴4https://radimrehurek.com/gensim/.

IV-C Main Results and Analysis

Experiments are conducted on the open data collection introduced in IV-A. Table.I and Table.II exhibit our results in comparison to previously published results. We have five findings below:

TABLE I: Main comparison results on MalwareTextDB dataset.

Methods	ACC	P	R	F1
CRF[43]	-	51.7	27.0	35.2
NaiveBayes+CRF[43]	-	45.9	36.3	40.3
BiLSTM+CRF[37]	83.74	40.18	46.64	43.17
IDCNN+CRF[45]	85.40	50.00	46.46	48.17
CNN+BiLSTM+CRF [37]	84.52	47.10	48.21	47.65
LSTM+BiLSTM+CRF [15]	87.53	-	-	47.52
$BERTfinetuning_{ours}$	87.40	52.53	63.49	57.49
$basemodel_{ours}$	84.55	46.51	51.18	48.73
$FinalModel_{ours}$	87.99	58.92	62.01	60.43

TABLE II: Main comparison results on DNRTI dataset.

Methods	ACC	P	R	F1
IDCNN+CRF[45]	98.66	74.42	77.40	75.88
BiLSTM+CRF [37]	98.58	74.07	75.98	75.01
LSTM+LSTM+CRF[15]	89.53	-	-	67.09
LSTM+BiLSTM+CRF[15]	90.85	-	-	71.29
CNN+BiLSTM+CRF [37]	98.69	76.20	76.07	76.14
$BERTfinetuing_{ours}$	94.33	80.55	80.27	80.41
$basemodel_{ours}$	94.90	81.18	81.77	81.47
$FinalModel_{ours}$	95.43	86.16	84.54	85.34

TABLE III:

Case1

GroundTruth

We have previously observed APT19[HackOrg] steal data[OffAct] from law and investment firms for competitive economic[Purp] purposes .

CNN+BiLSTM+CRF

APT19[HackOrg]

\surd

steal data[OffAct]

\surd

economic[Idus]

\times

Our Model

APT19[HackOrg]

\surd

steal data[OffAct]

\surd

competitive economic[Purp]

\surd

Case2

GroundTruth

In some attacks[OffAct], Whitefly[HackOrg] has used a second piece of custom malware[Tool], Trojan.Nibatad[Tool] .

CNN+BiLSTM+CRF

attacks[OffAct]

\surd

Trojan.Nibatad[Tool]

\surd

Our Model

attacks[OffAct]

\surd

Whitefly[HackOrg]

\surd

custom malware[Tool]

\surd

Trojan.Nibatad[Tool]

\surd

Case3

GroundTruth

It seems Eset[SecTeam] has discovered and published on a new malware module created by Turla[HackOrg].

CNN+BiLSTM+CRF

Turla[HackOrg]

\surd

Our Model

Eset[HackOrg]

\times

Turla[HackOrg]

\surd

TABLE IV: Module comparisons on the two datasets.

Methods	MalwareTextDB			DNRTI
Methods	P	R	F1	P	R	F1
base model	46.51	51.18	48.73	81.18	81.77	81.47
base+SSA	50.80	49.87	50.33	84.01	82.54	83.27
base+HSA	49.78	48.47	49.12	83.31	82.92	83.12
base+BERT	59.29	59.65	59.47	85.11	83.48	84.29
base+BERT+SSA	58.04	63.06	60.44	86.32	84.63	85.47
base+BERT+HSA	58.92	62.01	60.43	86.16	84.54	85.34

1.

Our model improves NER performance on every dataset and this improvement is particularly large on F1.
2.

CNN+BiLSTM performs better than BiLSTM, indicating the importance of applying character-level features for NER.
3.

Mixed Feature (i.e. CNN+POS+Components) plays an important role in improving the performance of NER by comparing the CNN+BiLSTM to our base model, which confirms our thought in III-A.
4.

Benefiting from the large-scale external corpus and transfer learning support, Bert can effectively adapt to CTI corpus and achieve satisfactory results for NER.
5.

The effect of our final model on DNRTI is stronger than MalwareTextDB, because the sparser characteristic of MalwareTextDB limits the ability of the model.

IV-D Module Study

To determine which modules are responsible for our models better, we conduct six different incremental comparison experiments, as seen at Table. IV. We choose the best of them as the final model.

For both datasets, we observe that both internal semantic augmentation and external semantic augmentation can improve the performance of base model. Compared to HSA, SSA can get the slightly better result might due to the effect of powerful contextual augmentation representation. However, we find that the average time cost of SSA is about 11 times than HSA during training and testing, since the attention of HSA is computed in advance (Eq.7,8) while the computation of SSA is posteriori (Eq.9,10) . To decide the best model, we further incorporate the HSA and SSA into base model with BERT respectively, which are base+BERT+HSA and base+BERT+SSA. Comprehensively considering the performance and time cost, we choose base+BERT+HSA as our final model. The results of our final model demonstrate the importance of gate and integration of different semantic augmentation modules.

Besides, different from the results on DNRTI, the performance of different combinations on MalwareTextDB is quite fluctuant. The BERT can have an important impact on the final model while internal semantic augmentations have a slight effect. That may be because the strong contextual encoding capability is more useful than internal semantic augmentations in an environment where data is too sparse.

IV-E Case Study

In order to verify the effectiveness of our model for data sparsity intuitively, we select three representative cases from the test set, and compare their prediction results of CNN+BiLSTM+CRF with our model. Table. III shows their prediction results. Next, we will analyze each case in detail.

From Case 1, we find that sufficient contextual reasoning is necessary. Predicting the entity ‘competitive economic’ depends on the context words ‘for’ and ‘purpose’, and BERT based models deal with the problem better than pure BiLSTM.

Our model achieves more adequate results in Case 2 while the CNN+BiLSTM+CRF misses some predictions. Specifically, the similar words of ‘Whitefly’ and ‘custom malware’ are {‘ESET’, ‘Symantec’, ‘Butterfly’, ‘LuckyMouse’} and {‘backdoor’, ‘tool’, ‘EternalBlue’, ‘Rocket’} respectively, making us believe the effectiveness of semantic augmentation.

In Case 3, we detect all entities but CNN+BiLSTM+CRF still loses one entity. However, we unfortunately assign the wrong label to ‘Eset’. The reason may be that ‘Eset’ has the similar label space and semantic role to ‘Turla’.

V Conclusion

In this paper, we seek to better leverage the augmentation semantics to solve the data sparsity for NER in CTI. We propose a new neural network model consisting of three parts: internal domain augmentation, external general augmentation and mixed Linguistic Features. The internal domain augmentation strengthens the semantics of input words by their most-K similar words in cybersecurity. We finetune the BERT model on cybersecurity datasets, and output representations of tokens are used for external general augmentation. What’s more, the POS feature, morphology feature and component feature construct the mixed Linguistic Features. The experiments on two CTI datasets show the effectiveness of our approach. We expect that this idea can, at least partially, help perfect classification and detection tasks in cybersecurity research while data sparsity occurs.

VI Acknowledgments

This work was supported by the National Key R&D Program of China (No.2018YFB0803402) and the National Natural Science Foundation of China (No.61842202). We thank our anonymous reviewers for their valuable suggestions. We also appreciate Beijing Key Laboratory of IOT Information Security Technology for providing the environment of our experiments.

VII supplementary material

In this supplementary material, we give more performance details of each entity type predicted by our model (Table.V and Table.VI). In Table. VI, we observe that recall rates of some types of entities are very high, even 1 (such as “Purp” and “Exp”) , which indicates that the model has a high coverage for such entities.

TABLE V: The performance details on MalwareTextDB dataset.

Entity Types	P	R	F1
Action	59.89	67.61	63.52
Entity	59.76	60.12	59.94
Modifier	54.21	58.86	56.44

TABLE VI: The performance details on DNRTI dataset.

Entity Types	P	R	F1
Area	86.26	84.26	85.26
Exp	98.51	100	99.25
Features	91.27	99.14	95.04
HackOrg	82.92	81.79	82.35
Idus	91.04	94.57	92.78
OffAct	85.94	73.33	79.14
Org	72.06	71.53	71.79
Purp	85.82	100	92.37
SamFile	96.38	85.89	90.83
SecTeam	88.36	84.87	86.58
Time	87.65	88.17	87.91
Tool	78.85	70.51	74.45
Way	84.21	97.96	90.57

References

[1] J. Williams, “Cyber threat intelligence,” SC magazine, vol. 29, no. 3SUPPL., pp. 62–63, 2018.
[2] ZHANG, Shu-Zhuang, LUO, Hao, FANG, and Bin-Xing, “Regular expressions matching for network security,” Journal of Software, 2011.
[3] M. Balduccini, S. Kushner, and J. Speck, “Ontology-driven data semantics discovery for cyber-security,” in International Symposium on Practical Aspects of Declarative Languages, 2015.
[4] X. Liao, K. Yuan, X. F. Wang, Z. Li, and R. Beyah, “Acing the ioc game: Toward automatic discovery and analysis of open-source cyber threat intelligence,” in Acm Sigsac Conference on Computer & Communications Security, 2016.
[5] E. M. Thelen, “A bootstrapping method for learning semantic lexicons using extraction pattern contexts,” 2002.
[6] C. L. Jones, R. A. Bridges, K. M. T. Huffer, and J. R. Goodall, “Towards a relation extraction framework for cyber-security concepts,” ACM, 2015.
[7] N. Mcneil, R. A. Bridges, M. Iannacone, B. Czejdo, N. Perez, and J. R. Goodall, “Pace: Pattern accurate computationally efficient bootstrapping for timely discovery of cyber-security concepts,” IEEE, 2014.
[8] R. Lal, “Information extraction of cyber security related terms and concepts from unstructured text,” Dissertations & Theses - Gradworks, 2013.
[9] V. MulwaD, W. Li, A. Joshi, T. Finin, and K. Viswanathan, “Extracting information about security vulnerabilities from web text,” in IEEE/WIC/ACM International Conference on Web Intelligence & Intelligent Agent Technology, 2011.
[10] A. Joshi, R. Lal, T. Finin, and A. Joshi, “Extracting cybersecurity related linked data from text,” in Semantic Computing (ICSC), 2013 IEEE Seventh International Conference on, 2013.
[11] S. More, M. Matthews, A. Joshi, and T. Finin, A Knowledge-Based Approach to Intrusion Detection Modeling. A Knowledge-Based Approach to Intrusion Detection Modeling, 2012.
[12] I. Perera, J. Hwang, K. Bayas, B. Dorr, and Y. Wilks, “Cyberattack prediction through public text analysis and mini-theories,” in 2018 IEEE International Conference on Big Data (Big Data), 2018.
[13] Y. Dong, W. Guo, Y. Chen, X. Xing, Y. Zhang, and G. Wang, “Towards the detection of inconsistencies in public security vulnerability reports,” in 28th USENIX Security Symposium (USENIX Security 19). Santa Clara, CA: USENIX Association, Aug. 2019, pp. 869–885. [Online]. Available: https://www.usenix.org/conference/usenixsecurity19/presentation/dong
[14] Y. Qin, G. Shen, W. Zhao, and Y. Chen, “Research on the method of network security entity recognition based on deep neural network,” Journal of Nanjing University(Natural Science), 2019.
[15] X. Wang, X. Liu, S. Ao, N. Li, and X. Zhang, “Dnrti: A large-scale dataset for named entity recognition in threat intelligence,” in 2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), 2020.
[16] Pingchuan, B. Ma, Z. Jiang, N. Lu, Z. Li, and Jiang, “Cybersecurity named entity recognition using bidirectional long short-term memory with conditional random fields,” Tsinghua Science and Technology, vol. v.26, no. 03, pp. 11–17, 2021.
[17] T. Satyapanich, F. Ferraro, and T. Finin, “Casie: Extracting cybersecurity event information from text,” Proceedings of the AAAI Conference on Artificial Intelligence, 2020.
[18] F. Ren, Z. Jiang, and J. Liu, “Integrating an attention mechanism and deep neural network for detection of dga domain names,” in 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI), 2019.
[19] Y. Qin, G. W. Shen, W. B. Zhao, Y. P. Chen, Y. U. Miao, and X. Jin, “A network security entity recognition method based on feature template and cnn-bilstm-crf,” Frontiers of Information Technology and Electronic Engineering, vol. 020, no. 006, pp. P.872–884, 2019.
[20] Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, p. 436, 2015.
[21] G. Toderici, S. M. O’Malley, S. J. Hwang, D. Vincent, D. Minnen, S. Baluja, M. Covell, and R. Sukthankar, “Recurrent neural network regularization.”
[22] Y. Lecun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, and L. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural Computation, vol. 1, no. 4, pp. 541–551, 2014.
[23] J. Li, A. Sun, J. Han, and C. Li, “A survey on deep learning for named entity recognition,” IEEE Transactions on Knowledge and Data Engineering, vol. PP, no. 99, pp. 1–1, 2020.
[24] V. Kumar, H. Glaude, C. D. Lichy, and W. Campbell, “A closer look at feature space data augmentation for few-shot intent classification,” 2019.
[25] M. Amjad, G. Sidorov, and A. Zhila, “Data augmentation using machine translation for fake news detection in the Urdu language,” in Proceedings of the 12th Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association, May 2020, pp. 2537–2542. [Online]. Available: https://aclanthology.org/2020.lrec-1.309
[26] Y. Nie, Y. Tian, X. Wan, Y. Song, and B. Dai, “Named entity recognition for social media texts with semantic augmentation,” 2020.
[27] J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” 2018.
[28] G. Zhou and J. Su, “Named entity recognition using an hmm-based chunk tagger,” in Proceedings of the 40th Annual 82 Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA., 2002.
[29] Y. Qin, G. Shen, and Y. U. Hongxing, “Large-scale network security entity recognition method based on hadoop,” CAAI Transactions on Intelligent Systems, 2019.
[30] J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for word representation,” in Conference on Empirical Methods in Natural Language Processing, 2014.
[31] H. Shin, W. Shim, J. Moon, J. W. Seo, S. Lee, and Y. H. Hwang, “Cybersecurity event detection with new and re-emerging words,” ser. ASIA CCS ’20. New York, NY, USA: Association for Computing Machinery, 2020, p. 665–678.
[32] C. Y. WEI Xiao, QIN Yongbin, “A network security named entity recognition method based on the component cnn,” Computer & Digital Engineering, vol. v.48;No.363, no. 01, pp. 111–116, 2020.
[33] P.-H. Li, R.-P. Dong, Y.-S. Wang, J.-C. Chou, and W.-Y. Ma, “Leveraging linguistic structures for named entity recognition with bidirectional recursive neural networks,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen, Denmark: Association for Computational Linguistics, Sep. 2017, pp. 2664–2669.
[34] G. Aguilar, S. Maharjan, A. P. López-Monroy, and T. Solorio, “A multi-task approach for named entity recognition in social media data,” 2019.
[35] P. Jansson and S. Liu, “Distributed representation, LDA topic modelling and deep learning for emerging named entity recognition from social media,” in Proceedings of the 3rd Workshop on Noisy User-generated Text. Copenhagen, Denmark: Association for Computational Linguistics, Sep. 2017, pp. 154–159.
[36] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” Computer Science, 2013.
[37] X. Ma and E. Hovy, “End-to-end sequence labeling via bi-directional lstm-cnns-crf,” 2016.
[38] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks,” vol. 15, 01 2010.
[39] P. Qi, Y. Zhang, Y. Zhang, J. Bolton, and C. D. Manning, “Stanza: A python natural language processing toolkit for many human languages,” 2020.
[40] J. Wang, X. Chen, Y. Zhang, Y. Zhang, and X. Wang, “Document-level biomedical relation extraction using graph convolutional network and multi-head attention (preprint),” 2019.
[41] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” arXiv, 2017.
[42] Y. Luan, D. Wadden, L. He, A. Shah, M. Ostendorf, and H. Hajishirzi, “A general framework for information extraction using dynamic span graphs,” in Proceedings of the 2019 Conference of the North, 2019.
[43] S. K. Lim, A. O. Muis, W. Lu, and C. H. Ong, “MalwareTextDB: A database for annotated malware articles,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vancouver, Canada: Association for Computational Linguistics, Jul. 2017, pp. 1557–1567.
[44] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” Computer Science, 2014.
[45] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” 11 2016.

Multi-features based Semantic Augmentation Networks for Named Entity Recognition in Threat Intelligence

Abstract

Index Terms:

I Introduction

II Related Works

III The Proposed Method

III-A Basic model with Mixed Feature Input

III-A1 Mixed Feature Input

III-A2 Encoder in the model

III-B Cybersecurity Domain Semantic Augmentation

III-B1 Hard Semantic Augmentation

III-B2 Soft Semantic Augmentation

III-C General Domain Semantic Augmentation

III-D Decoder

IV Experiments

IV-A Datasets and Evaluation

IV-B Setting

IV-C Main Results and Analysis

IV-D Module Study

IV-E Case Study

V Conclusion

VI Acknowledgments

VII supplementary material

References

Multi-features based Semantic Augmentation Networks for Named Entity Recognition in
Threat Intelligence