This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

urlBERT:A Contrastive and Adversarial Pre-trained Model for URL Classification

Yujie Li Y. Li, Y. Wang, H. Xu, Z. Guo, Z. Cao and L. Zhang are affiliated with the School of Cyber Science and Technology, College of Computer Science and Technology, Zhejiang University, Hangzhou, 310027, China, and are also associated with the Key Laboratory of Blockchain and Cyberspace Governance of Zhejiang Province, 330000, China. Y. Li is also affiliated with Beijing University of Posts and Telecommunications, China
Y. Wang and H. Xu are co-corresponding author.
   Yanbin Wang    Haitao Xu    Zhenhao Guo    Zheng Cao    Lun Zhang

Abstract

URLs play a crucial role in understanding and categorizing web content, particularly in tasks related to security control and online recommendations. While pre-trained models are currently dominating various fields, the domain of URL analysis still lacks specialized pre-trained models. To address this gap, this paper introduces urlBERT, the first pre-trained representation learning model applied to a variety of URL classification or detection tasks. We first train a URL tokenizer on a corpus of billions of URLs to address URL data tokenization. Additionally, we propose two novel pre-training tasks: (1) self-supervised contrastive learning tasks, which strengthen the model’s understanding of URL structure and the capture of category differences by distinguishing different variants of the same URL; (2) virtual adversarial training, aimed at improving the model’s robustness in extracting semantic features from URLs. Finally, our proposed methods are evaluated on tasks including phishing URL detection, web page classification, and ad filtering, achieving state-of-the-art performance. Importantly, we also explore multi-task learning with urlBERT, and experimental results demonstrate that multi-task learning model based on urlBERT exhibit equivalent effectiveness compared to independently fine-tuned models, showing the simplicity of urlBERT in handling complex task requirements. The code for our work is available at https://github.com/Davidup1/URLBERT.

1 Introduction

In the current era of highly developed online information, URLs play a crucial role as a vital source of information for web understanding and categorization, particularly in the areas of network security, personalized online services, and network engineering. For instance, phishing link recognition, as one of the significant research tasks in network security, requires the construction of models to parse the semantic features of URL data for advancing the field. In recent years, a substantial body of work has proposed the use of neural networks for URL analysis, achieving tangible progress.

Led by the success in the NLP domain, pre-trained models like BERT[3] have demonstrated outstanding performance when fine-tuned across various downstream tasks. This success has inspired researchers in diverse domains to leverage domain-specific corpora to construct pre-trained models based on the Transformer architecture. These models, enriched with extensive domain knowledge, can be effectively deployed across various downstream tasks through fine-tuning. Encouragingly, in domains such as life sciences, programming languages, financial technology, and more, models produced using this approach have initially achieved state-of-the-art performance on downstream tasks, validating the feasibility of pre-training domain-specific models in scientific research. To facilitate work in the field of URL analysis, this work introduces urlBERT, a pre-trained model created using carefully designed pre-training tasks and trained on 3 billion unlabeled URL data points based on the BERT model.

In this work, URLs are considered a distinct form of text with a specific textual structure based on their protocols, where each part contains particular content. To enable the BERT model to more comprehensively learn the semantic information and structural features of URLs, we tailored two new pre-training tasks on top of BERT’s masked language model task[3]: (1)A self-supervised contrastive learning task, generating augmented samples from the training data for contrastive learning to enhance the model’s understanding of URL structural characteristics; (2)Virtual adversarial training, introducing perturbations to generate augmented samples for adversarial training to improve the model’s robustness and foster the comprehension of URL semantics.

Following pre-training, we conducted comprehensive comparative experiments on urlBERT in various scenarios across multiple common downstream tasks. The results demonstrate that urlBERT, as a pre-trained model, exhibits significant improvements in metrics such as Accuracy, Precision, Recall, compared to other neural networks commonly used in the NLP domain. Additionally, it showcases lower dependency on data scale.

Our contributions can be summarized as follows:

  1. 1.

    To the best of our knowledge, urlBERT is the first publicly available pre-trained Transformer model in the field of URL analysis.

  2. 2.

    To enhance the understanding of URL semantic features, we customized contrastive learning tasks and virtual adversarial training tasks during the pre-training phase for urlBERT.

  3. 3.

    We demonstrated the effectiveness of urlBERT in various common downstream tasks under different scenarios. Additionally, we explored the potential of urlBERT further by adopting optimized fine-tuning approaches.

  4. 4.

    We released a dataset specifically designed for the advertisement link recognition task, contributing to the public availability of resources in this domain.

2 Related Works

2.1 Pretrained Models

In 2018, Google introduced the BERT[3], which following pre-training on a substantial corpus of document-level text through Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) tasks, demonstrated outstanding performance when fine-tuned for various downstream tasks such as text classification, named entity recognition, and question-answering. Subsequently, teams across numerous research domains were inspired by this "pre-training + fine-tuning" approach, recognizing the advantages of the Transformer architecture in representing textual information. They have pre-trained BERT models using domain-specific corpora and deployed them in various downstream tasks through fine-tuning, achieving significant advancements.BioBERT[12], trained on a large biomedical corpus, outperformed traditional NLP models in understanding biomedical literature. CodeBERT[5], based on RoBERTa[15], was pre-trained on a substantial dataset containing natural language-programming language pairs, exhibiting remarkable performance in natural language code search and code documentation generation tasks. PhishBERT[21] pre-trained BERT on a substantial amount of unlabeled URL data and achieved excellent accuracy in fine-tuning for phishing link detection tasks. Today, facing a multitude of downstream tasks such as URL webpage topic classification, phishing link recognition, and advertisement URL identification, pre-training a publicly available model based on the Transformer architecture remains valuable.

2.2 URL Classification

URL classification tasks, including malicious URL detection, webpage topic classification, and advertisement URL recognition, have always been crucial components of cybersecurity governance and improvements in network services optimization, among which the identification of malicious URLs has garnered particular attention from researchers in the field of cybersecurity. In the past, traditional identification methods relied on blacklist-based approaches, where URLs were categorized based on predefined lists. However, the timeliness and accuracy issues associated with this method prompted researchers to adopt methods involving the creation of rich datasets [18, 1, 19] and training various neural network models. As URLs represent a unique form of textual data, classification tasks are typically handled by commonly used neural networks in the NLP domain. Particularly focusing on the task of malicious URLs detection, researchers have designed and trained specialized neural network architectures tailored to URL data features. Models such as Grambeddings [1], URLTran [16], URLNet [11], and URL2Vec [23] have demonstrated excellent performance in malicious URL identification tasks. Moreover, these networks, when adjusted and fine-tuned for other tasks, have also shown promising accuracy. When deploying these models for downstream tasks, it is necessary to fine-tune the model’s structure and undergo multiple epochs of training on large datasets to achieve convergence. In URL classification, following the approach outlined in BERT [3], fine-tuning a pre-trained model that has learned abundant semantic features of URLs, along with task-specific fully connected layers, for a few epochs, can lead to higher convenience and efficiency.

2.3 Self-supervised Contrastive Learning

Represented by MoCo[8] and SimCLR[2], contrastive learning aims to enhance performance by simultaneously maximizing the consistency between different transformations (positive samples) of the same instance and minimizing the consistency between different instances (negative samples), demonstrating significant performance improvements on ImageNet. And self-supervised contrastive learning utilizes training data itself as supervisory information, generating positive and negative samples for the model’s contrastive learning. In the field of NLP, prior work has explored the application of self-supervised contrastive learning in BERT training. ConSERT[22], during the fine-tuning stage, employs sample augmentation on the generated data representations at the Embedding layer to obtain corresponding positive and negative samples. After training on the contrastive loss represented by the shared BERT model, performance improvements are achieved on the original model. On the other hand, CERT[4] enhances pre-training data through multilingual back-translation in the pre-training phase, incorporating contrastive learning, resulting in performance improvements in some downstream tasks.

2.4 Virtual Adversarial Training

The concept of Virtual Adversarial Training (VAT)[17] is derived from adversarial training, enabling semi-supervised learning. For unlabeled data, the model’s predictions on the data from the previous stage can be used as virtual labels. Adversarial perturbations are then computed based on these virtual labels to generate adversarial samples. During training, the goal is to minimize the discrepancy between the model outputs for the original and adversarial samples, thereby enhancing the model’s robustness. ALUM[13] applies VAT to both the pre-training and fine-tuning stages of the BERT model. Whether introducing VAT in both stages or applying it in one stage, significant improvements in both accuracy and robustness of the model on downstream tasks are observed.

3 Method

This section will introduce the construction of the urlBERT pre-training task and the data augmentation techniques applied to training samples.

3.1 Framework

The pre-training of urlBERT consists of three tasks: Masked Language Modeling (MLM), sentence-pair self-supervised contrastive learning, and token-level virtual adversarial training. The training process structure for this stage is depicted in Figure. 1.

Refer to caption
Figure 1: The pre-training task framework of urlBERT

URL data typically consists of various components, including network protocols, domain names or IP addresses, resource paths, query parameters, etc., organized in a specified format. Each URL is uniquely identified under this structural specification, and even slight differences between two URLs can result in significant distinctions in their functionalities. Taking phishing links as an example, harmful and legitimate links may only exhibit subtle differences in their query statements. In response to this scenario, we introduce a self-supervised contrastive learning(SSCL) task to enable the model to learn the subtle differences and structural features between URLs at the sentence level during the pre-training phase. We denote the contrastive learning loss as Losscon{Loss}_{con}, and detailed training specifics and loss functions for this task will be elaborated in Section 3.2.

We retained the Masked Language Modeling (MLM) pre-training task from BERT, consistent with the settings outlined in the original paper[3]. This task involves predicting the actual values of tokens masked with "[MASK]", facilitating the model in learning the compositional structure and semantic information of URLs. In addition to the MLM task, we introduced the Virtual Adversarial Training(VAT) task. VAT involves generating adversarial samples corresponding to the data samples and performing adversarial training. We compute the adversarial training loss, LossVAT{Loss}_{VAT}, at the MLM task’s output of the shared BERT model. This enhances the robustness of the model’s predictions for URL fluctuations at the token level. The detailed implementation of this task will be presented in Section 3.3.

The losses for each task are weightedly summed, with the combination of the VAT task training loss following the configuration outlined in ALUM[13]. The final loss is as follows:

Loss=Losscon+LossMLM+10LossVAT\operatorname{Loss}=\operatorname{Loss}_{con}+\operatorname{Loss}_{MLM}+10\operatorname{Loss}_{VAT}

3.2 SSCL

Inspired by the work of ConSERT[22] and SimCSE[6], the self-supervised contrastive learning task in urlBERT incorporates data augmentation at the token embedding level to generate positive and negative samples. Differing slightly from ConSERT[22], urlBERT utilizes the dropout module within BERT’s Embedding layer to produce two augmented samples for a given input. Subsequently, the fast gradient sign method[7] is applied separately to both samples to generate adversarial samples, thereby enhancing the effectiveness of data augmentation. The sample generation scheme is illustrated in Figure. 2.

Refer to caption
Figure 2: The process of generating augmented samples in the self-supervised contrastive learning task. H represents the depth of the hidden layer, and L denotes the length of the data sample.

A batch of NN token embeddings generates 2N2N augmented data samples. For a single data sample within a batch, there is only one positive sample, while the rest consist of 2(N1)2(N-1) negative samples. The objective of the contrastive learning task is to minimize the distance between positive samples and maximize the distance between negative samples. Therefore, similar to ConSERT[22], urlBERT employs the Normalized Temperature-scaled Cross-Entropy Loss (NT-Xent):

i,j=logexp(sim(ri,rj)/τ)k=12N1[ki]exp(sim(ri,rk)/τ)\mathcal{L}_{\mathrm{i},\mathrm{j}}=-\log\frac{\exp\left(\operatorname{sim}\left(\mathrm{r}_{\mathrm{i}},\mathrm{r}_{\mathrm{j}}\right)/\tau\right)}{\sum_{\mathrm{k}=1}^{2\mathrm{~{}N}}1_{[\mathrm{k}\neq\mathrm{i}]}\exp\left(\operatorname{sim}\left(\mathrm{r}_{\mathrm{i}},\mathrm{r}_{\mathrm{k}}\right)/\tau\right)}

rir_{i} is the representation obtained from the final layer of the transformer block output in the shared BERT encoder after encoding an augmented data sample. rjr_{j} signifies the representation for another augmented data sample corresponding to the same input. The function sim()sim\left(\bullet\right)denotes the cosine similarity, and τ\tau represents the temperature coefficient. After calculating this loss for each data point, taking the average yields the loss for this task, denoted as Losscon{Loss}_{con}. The pseudocode for the SSCL task is illustrated in Algorithm 1.

Algorithm 1 SSCL Task

Input: χ\chi: dataset, NN: data batch, θ\theta: shared urlBERT model, τ\tau: learning rate
Output: θ\theta

1:  for NχN\in\chi do
2:     2N data augmentation (N,θ)2N\leftarrow\text{ data augmentation }(N,\theta)
3:     for X2NX\in 2N do
4:        ymodel encoding(x,θ)y\leftarrow\text{model encoding}(x,\theta)
5:        Lossicalculate NT-Xent(y,x){Loss}_{i}\leftarrow\text{calculate NT-Xent}(y,x)
6:     end for
7:     Losscon12N12NLossi{Loss}_{con}\leftarrow\frac{1}{2N}\sum_{1}^{2N}{Loss}_{i}
8:     gconLosscon{g}_{con}\leftarrow\nabla{Loss}_{con}
9:     θθτgcon\theta\leftarrow\theta-\tau{g}_{con}
10:  end for
11:  return  θ\theta

3.3 VAT

The VAT task in urlBERT is primarily inspired by ALUM[13], introducing regularized perturbations in the token embedding space to generate adversarial samples corresponding to unlabeled URL data. Similar to ALUM[13], urlBERT employs the gradient forward propagation approach, introducing perturbations with a mean of 0 and a variance of 1 during training initialization. Subsequently, the data samples obtain corresponding model output logits and logitsadv{logits}_{adv} through encoding by the shared BERT encoder. After computing the virtual adversarial loss (LossVAT{Loss}_{VAT}) using the KL divergence function, a regularization perturbation is added in the reverse gradient direction to maximize the adversarial loss. This process can be repeated multiple times to enhance perturbation effects, but urlBERT introduces perturbation only once. The final Loss is obtained by adding the weighted sum of LossVAT{Loss}_{VAT} and the MLM task loss. Algorithm 2 outlines the training process for the VAT task.

Algorithm 2 VAT Task

Input: χ\chi: dataset, xx: data batch, θ\theta: shared urlBERT model, τ\tau: learning rate, σ2\sigma^{2}: the variance of the noise, α\alpha: weight of VAT task loss, Π\Pi: the projection operation, δ\delta: introduced disturbance, μ\mu: step size of perturbation
Output: θ\theta

1:  for xχx\in\chi do
2:     δN(0,σ2)\delta\sim N\left(0,\sigma^{2}\right)
3:     y1 model encoding (θ,x+δ)y_{1}\leftarrow\text{ model encoding }(\theta,x+\delta)
4:     y2model encoding (θ,x)y_{2}\leftarrow\text{model encoding }(\theta,x)
5:     Lossadv calculate Kldiv Loss(y1,y2){Loss}_{adv}\leftarrow\text{ calculate }Kl_{-}\text{div }{Loss}\left(y_{1},y_{2}\right)
6:     gadvLossadvg_{adv}\leftarrow\nabla\operatorname{Loss}_{adv}
7:     δΠ(δ+μgadv)\delta\leftarrow\Pi\left(\delta+\mu g_{adv}\right)
8:     y3model encoding (θ,x+δ)y_{3}\leftarrow\text{model encoding }(\theta,x+\delta)
9:     LossVAT calculate Kldiv Loss(y3,y2){Loss}_{VAT}\leftarrow\text{ calculate }Kl_{-}\text{div }{Loss}\left(y_{3},y_{2}\right)
10:     gθLossMLM+Losscon+αLossVATg_{\theta}\leftarrow\nabla{Loss}_{MLM}+\nabla{Loss}_{con}+\alpha\nabla{Loss}_{VAT}
11:     θθτgθ\theta\leftarrow\theta-\tau g_{\theta}
12:  end for
13:  return  θ\theta

3.4 Pre-training Setting

During the pre-training phase, urlBERT utilized the Web Tracking Datasets released by the Broadband Communications Systems and Architectures Research Group at UPC, which comprises approximately 3 billion web URLs. Our team extracted unlabeled URLs from this extensive dataset and employed them for urlBERT’s pre-training. The pre-training of urlBERT was based on the BERTForMaskedLM model, conducted on four A100-40G GPUs. AdamW optimizer was employed, with a learning rate of 2e-6 for the SSCL task and 1e-5 for the other two tasks. The batch size was set to 64, and the temperature coefficient τ\tau for the objective loss function of the SSCL task was set to 0.1.

4 Experiments

In this section, we conducted multiple experiments to validate the effectiveness of urlBERT and explored its potential in various fine-tuning scenarios. The experiments were organized as follows:

  1. 1.

    Exploration of the model’s potential in fine-tuning across multiple downstream tasks.

  2. 2.

    Comparison of the model’s performance in various downstream tasks with commonly used NLP neural networks.

  3. 3.

    Testing the model’s dependency on data scale under different training dataset sizes.

  4. 4.

    Evaluation of the model’s feature extraction capabilities across different transformer layers.

  5. 5.

    Assessment of the model’s ability for multi-task learning.

These experiments comprehensively and fairly validate the effectiveness of the pre-trained urlBERT model in various URL-related research tasks and different scenarios.

Setup: All experiments were conducted on a single A100-40G GPU, utilizing the PyTorch 2.0.0 framework and Python 3.8. The batch size for all experiments was set to 64. The optimizer and its learning rate varied across experiments, with specific configurations detailed in subsequent experiments. We conducted experiments on three downstream tasks: phishing link recognition, webpage topic classification, and advertisement link recognition. For the phishing url recognition task, we employed the publicly available dataset from GramBeddings, consisting of 640,000 training samples and 160,000 test samples. The webpage topic classification task utilized the publicly available DmoZ URL Classification Dataset, which includes URLs from 15 different topics with imbalanced data. We selected five categories, namely "Games", "Health", "Kids", "Reference" and "Shopping" for classification experiments. In the advertisement link recognition task, we utilized an ad URL dataset collected by our team, comprising 23,530 ad URLs and 65,263 whitelist URLs. For experiments, 80% of the data was randomly selected as the training set, with the remaining 20% used as the test set. The dataset is available at Google Drive Repositories.

4.1 Model Potential Exploration

We conducted fine-tuning experiments on three downstream tasks: phishing link detection, webpage topic classification, and advertisement link identification. Traditional fine-tuning approaches typically involve obtaining the [CLS] data representation from the last layer transformer block of BERT and inputting it into a fully connected layer constructed for the downstream task. During training, all parameters, including those of BERT, are fine-tuned. Another approach involves using BERT as a feature extractor, freezing its parameters, and inputting the data representation into other commonly used deep neural networks for training. And generally the former method tends to yield superior performance. Inspired by the work of Ran Wang[20], we chose to adopt the FT-TM fine-tuning approach to enhance urlBERT’s performance on downstream tasks.

Initially, we froze the parameters of the urlBERT model and used a CNN as the neural network for the downstream tasks. Fine-tuning was performed for each task, followed by unfreezing the BERT parameters for full-parameter fine-tuning. Throughout the experiments, the AdamW optimizer was employed, with a learning rate of 2e-3 in the first stage and 2e-5 in the second stage and both stages involve 5 epochs of training.

Table 1: Comparison of Test Accuracy for Full Parameter Fine-tuning and FT-TM Fine-tuning Approaches Across Various Downstream Tasks.
Methods Phishing Detection Advertising Detection Web Classification
urlBERT 0.9716 0.9990 0.7436
urlBERT(FT-TM) 0.9739 0.9995 0.7448

Table 1 presents a comparison of the Accuracy metrics on the test set for models trained on various downstream tasks using urlBERT, under both the FT-TM and full-parameter fine-tuning methods. It is evident that urlBERT, when fine-tuned with the FT-TM approach, exhibits a substantial improvement in performance across all three downstream tasks. Particularly noteworthy is the task of recognizing advertisement links, where the Accuracy on the test set has increased from 0.9990 to 0.9995, showcasing further impressive performance in predicting Accuracy for this task.

4.2 Comparison with Other Neural Network

To demonstrate the effectiveness of urlBERT in URL security-related research tasks, we conducted comparative experiments on three downstream tasks: phishing link detection, webpage topic classification, and advertisement link recognition. We compared urlBERT with commonly used neural network models in the NLP domain, namely BiLSTM[9] and TextCNN[10]. All models were experimented with on the same training and testing sets, utilizing the AdamW optimizer. The learning rate for BiLSTM was set to 0.005, TextCNN’s learning rate was set to 2e-3, and urlBERT’s learning rate was set to 2e-5. The batch size for all models was set to 5, and testing comparisons were performed after 5 training epochs.

Table 2: Comparison of Test Accuracy of Multiple Neural Networks Trained on Various Downstream Tasks.
Baselines Phishing Detection Advertising Detection Web Classification
BiLSTM 0.9545 0.9976 0.5531
TextCNN 0.9667 0.9990 0.7373
urlBERT 0.9716 0.9990 0.7436
urlBERT(FT-TM) 0.9739 0.9995 0.7448
Table 3: Test the effectiveness of different Transformer layer features in urlBERT and assess the impact of various feature pooling methods on downstream task model performance.
Layers Test Accuracy
Layer-8 0.9713
Layer-9 0.9706
Layer-10 0.9717
Layer-11 0.9716
Last 4 Layers + Concat 0.9713
Last 4 Layers + MeanPooling 0.9707
Last 4 Layers + MaxPooling 0.9612
Last 4 Layers + MinPooling 0.9716
Last 4 Layers + WeightedPooling 0.9718
Last 4 Layers + AttentionPooling 0.9720
Refer to caption
Figure 3: ROC curves for urlBERT trained on training sets of different sizes.
Table 4: Comparison of various metrics for three models under different training set sizes.
Data Scale baselines Accuracy Precision Recall F1-Score
3W BiLSTM 0.9426 0.9427 0.9412 0.9432
TextCNN 0.9283 0.9299 0.9266 0.9282
urlBERT 0.9456 0.9570 0.9423 0.9454
32W BiLSTM 0.9541 0.9525 0.9511 0.9518
TextCNN 0.9657 0.9658 0.9656 0.9649
urlBERT 0.9661 0.9677 0.9645 0.9663
52W BiLSTM 0.9545 0.9542 0.9534 0.9547
TextCNN 0.9667 0.9602 0.9624 0.9654
urlBERT 0.9716 0.9757 0.9675 0.9716
Refer to caption
Figure 4: Comparison of urlBERT with other models across various metrics at different training points in the phishing link detection task.

Table 2 presents a comparative analysis of urlBERT and other commonly used neural networks for NLP tasks across different downstream tasks. The urlBERT group represents results obtained under full-parameter fine-tuning, while urlBERT(FT-TM) represents results obtained under FT-TM method. It is evident that, compared to other commonly used neural network models, urlBERT consistently demonstrates higher predictive accuracy across various downstream tasks. Particularly noteworthy is the task of webpage topic classification, where urlBERT achieves an accuracy of 0.7448, a 35% improvement over BiLSTM’s test accuracy of 0.5531 and a slight enhancement compared to TextCNN’s metric of 0.7373. In the phishing link recognition task, urlBERT achieves a test accuracy of 0.9739, showing significant improvement over BiLSTM (0.9545) and TextCNN (0.9667).

Due to the significance of malicious link recognition in the current research landscape related to URLs, we have documented various metrics for different models at different training points in this task. The optimization processes for these metrics are illustrated in Figure 4. It is evident that during the 5 training rounds on this downstream task, urlBERT consistently outperforms BiLSTM in terms of Accuracy, Precision, and F1-score. Additionally, compared to TextCNN, urlBERT exhibits smoother and faster convergence in terms of Accuracy and F1-score, surpassing TextCNN after 5 rounds of training. While both models experience a decline in Precision after the second round of training, urlBERT demonstrates greater stability in this aspect.

4.3 Data Scale Dependency Test

We evaluated the predictive capabilities of urlBERT models trained on different data scales in the phishing link detection task. Models were trained on datasets consisting of 30,000, 320,000, and 520,000 instances, respectively, and were tested on the corresponding test sets. We compared the performance with TextCNN and BiLSTM to assess urlBERT’s dependency on dataset scale. All three models were trained for 5 epochs on datasets of varying sizes, maintaining consistent training settings with those used in downstream task comparison experiments.

Table 5: Comparison of the test Accuracy for models trained with urlBERT on datasets of the same scale using experimental settings across various tasks and those obtained through multi-task learning.
methods Phishing Detection Advertising Detection Web Classification
urlBERT 0.9601 0.9990 0.7264
urlBERT-MTL 0.9612 0.9981 0.7304

Table 4 presents the metrics of models obtained through training by urlBERT and two other neural networks at various data scales. It is evident that urlBERT exhibits the least dependency on data scale. Across three different data volumes, urlBERT consistently outperforms TextCNN and BiLSTM in terms of Accuracy, Precision, and F1-Score. Only at a data volume of 320,000 does urlBERT’s Recall (0.9645) fall slightly below that of TextCNN (0.9656). TextCNN, on the other hand, demonstrates significant data scale dependence, with noticeable declines in all metrics at a data volume of 30,000. Figure 4 illustrates the ROC curves for models obtained through urlBERT training at different data volumes. It is observed that, despite a decrease in training data volume, urlBERT’s inference performance does not exhibit significant deterioration. At the fixed FPR of 0.01, the model’s test TPR still reaches at approximately 0.9, indicating that urlBERT does not heavily rely on large-scale training data.

4.4 Layer Feature extraction Test

To investigate the feature extraction capabilities of different transformer blocks in urlBERT, we extracted the [CLS] representations from outputs at various layers of the transformer during fine-tuning. These representations were then input into a fully connected layer for training in the phishing link recognition task. Additionally, different pooling methods were employed in subsequent experiments to assess the impact of this factor on urlBERT’s performance in downstream tasks. Throughout the training process, each experiment group utilized the AdamW optimizer with a learning rate set to 2e-5 and a weight decay set to 1e-4. The models were trained for 5 rounds, and the metrics were compared on the test set thereafter.

Table 3 presents the test results obtained from training urlBERT on different layers’ feature outputs in the downstream task of phishing link recognition and also illustrates the impact of various feature pooling methods on the model’s performance. The tested transformer layers are the last four layers of the BertEncoder, and the experimental results confirm the effectiveness of these layers’ data representations. Employing different pooling methods on the output features of the last four layers yields varying effects on model performance. In comparison to the best single-layer test accuracy of 0.9717, WeightedPooling and AttentionPooling result in performance improvements, reaching 0.9718 and 0.9720, respectively. However, MaxPooling disrupts the model’s data representations, causing a decline in metrics to 0.9612.

4.5 Multi-Task Learning in Fine-tuning

Multi-task learning is another important direction for fine-tuning urlBERT for various downstream tasks. MT-DNN[14] uses BERT as a shared data encoder and assigns separate neural network layers for different downstream tasks during training. This training approach allows the model to learn knowledge from a large amount of training data from different downstream tasks, particularly beneficial when training data for a single task is limited. It enhances the model’s generalization capabilities and mitigates the negative effects of small training data volumes. We attempted multi-task learning for urlBERT, designing classifier networks for phishing link recognition, webpage topic classification, and advertisement link recognition with the shared urlBERT. During training, to create a suitable scenario for multi-task learning, we set different training data volumes for the three tasks: 40,000 samples for advertisement link recognition and 180,000 samples for the other two tasks. We employed the AdamW optimizer with a learning rate of 2e-5 and a weight decay of 1e-4.

We employed traditional full-parameter fine-tuning methods with the same data volume settings for multi-task learning to compare with multi-task learning, aiming to demonstrate the effectiveness of multi-task fine-tuning for urlBERT in scenarios where data scale is limited. The experimental results are presented in Table 5. It is evident that conducting multi-task learning for fine-tuning urlBERT in situations of limited training data effectively enhances the model’s predictive performance. Particularly noteworthy is the webpage topic classification task, where, compared to the test accuracy obtained through fine-tuning on 180,000 data points (0.7264), urlBERT achieves a test accuracy of 0.7304 through multi-task fine-tuning. In the phishing link recognition task, multi-task learning also increases accuracy from 0.9601 to 0.9612. While the model’s performance in the advertisement link recognition task is slightly lower with multi-task fine-tuning compared to full-parameter fine-tuning, the accuracy still maintains an impressive performance of 0.9981.

5 Conclusion

Applying pre-trained models based on the Transformer architecture to various downstream tasks for improving model performance has become a widely adopted practice in fields such as natural language processing, biomedical corpora, code language generation, and more. In this work, inspired by this approach, we have employed carefully designed pre-training tasks to pre-train a model, urlBERT, based on BERT on a large amount of unlabeled URL data, aiming to facilitate URL analysis in research. Our training tasks enable the model to fully grasp both structural and semantic features of URLs, allowing the foundational BERT model to adapt comprehensively to the semantic characteristics of URL text. Through well-designed experiments in diverse scenarios, the effectiveness of urlBERT has been validated. Consequently, when facing various downstream tasks in the field of URL analysis, urlBERT can be efficiently fine-tuned as a feature encoder for deployment in specific tasks.

References

  • [1] Bozkir, A. S., Dalgic, F. C., and Aydos, M. Grambeddings: a new neural network for url based identification of phishing web pages through n-gram embeddings. Computers & Security 124 (2023), 102964.
  • [2] Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A simple framework for contrastive learning of visual representations. In International conference on machine learning (2020), PMLR, pp. 1597–1607.
  • [3] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  • [4] Fang, H., Wang, S., Zhou, M., Ding, J., and Xie, P. Cert: Contrastive self-supervised learning for language understanding. arXiv preprint arXiv:2005.12766 (2020).
  • [5] Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., Shou, L., Qin, B., Liu, T., Jiang, D., et al. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020).
  • [6] Gao, T., Yao, X., and Chen, D. Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821 (2021).
  • [7] Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014).
  • [8] He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2020), pp. 9729–9738.
  • [9] Huang, Z., Xu, W., and Yu, K. Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015).
  • [10] Kim, Y. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 (2014).
  • [11] Le, H., Pham, Q., Sahoo, D., and Hoi, S. C. Urlnet: Learning a url representation with deep learning for malicious url detection. arXiv preprint arXiv:1802.03162 (2018).
  • [12] Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., and Kang, J. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 4 (2020), 1234–1240.
  • [13] Liu, X., Cheng, H., He, P., Chen, W., Wang, Y., Poon, H., and Gao, J. Adversarial training for large neural language models. arXiv preprint arXiv:2004.08994 (2020).
  • [14] Liu, X., He, P., Chen, W., and Gao, J. Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504 (2019).
  • [15] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
  • [16] Maneriker, P., Stokes, J. W., Lazo, E. G., Carutasu, D., Tajaddodianfar, F., and Gururajan, A. Urltran: Improving phishing url detection using transformers. In MILCOM 2021-2021 IEEE Military Communications Conference (MILCOM) (2021), IEEE, pp. 197–204.
  • [17] Miyato, T., Maeda, S.-i., Koyama, M., and Ishii, S. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence 41, 8 (2018), 1979–1993.
  • [18] Nokhbeh Zaeem, R., and Barber, K. S. A large publicly available corpus of website privacy policies based on dmoz. In Proceedings of the Eleventh ACM Conference on Data and Application Security and Privacy (2021), pp. 143–148.
  • [19] Singh, A. Malicious and benign webpages dataset. Data in brief 32 (2020), 106304.
  • [20] Wang, R., Su, H., Wang, C., Ji, K., and Ding, J. To tune or not to tune? how about the best of both worlds? arXiv preprint arXiv:1907.05338 (2019).
  • [21] Wang, Y., Zhu, W., Xu, H., Qin, Z., Ren, K., and Ma, W. A large-scale pretrained deep model for phishing url detection. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2023), IEEE, pp. 1–5.
  • [22] Yan, Y., Li, R., Wang, S., Zhang, F., Wu, W., and Xu, W. Consert: A contrastive framework for self-supervised sentence representation transfer. arXiv preprint arXiv:2105.11741 (2021).
  • [23] Yuan, H., Yang, Z., Chen, X., Li, Y., and Liu, W. Url2vec: Url modeling with character embeddings for fast and accurate phishing website detection. In 2018 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom) (2018), IEEE, pp. 265–272.