This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

SNAKE: A Sustainable and Multi-functional Traffic Analysis System utilizing Specialized Large-Scale Models with a Mixture of Experts Architecture

Tian Qin
Southeast University
   Guang Cheng
Southeast University
   Yuyang Zhou
Southeast University
   Zihan Chen
Southeast University
   Xing Luan
Southeast University
Abstract

The rapid advancement of internet technology has led to a surge in data transmission, making network traffic classification crucial for security and management. However, there are significant deficiencies in its efficiency for handling multi-attribute analysis and its ability to expand model knowledge, making it difficult to adapt to the ever-changing network environment and complex identification requirements. To address this issue, we proposed the SNAKE (Sustainable Network Analysis with Knowledge Exploration) system, which adopts a multi-gated mixture of experts architecture to construct a multi-functional traffic classification model. The system analyzes traffic attributes at different levels through multiple expert sub-models, providing predictions for these attributes via gating and a final Tower network. Additionally, through an intelligent gating configuration, the system enables extremely fast model integration and evolution across various knowledge expansion scenarios. Its excellent compatibility allows it to continuously evolve into a multi-functional large-scale model in the field of traffic analysis. Our experimental results demonstrate that the SNAKE system exhibits remarkable scalability when faced with incremental challenges in diverse traffic classification tasks. Currently, we have integrated multiple models into the system, enabling it to classify a wide range of attributes, such as encapsulation usage, application types and numerous malicious behaviors. We believe that SNAKE can pioneeringly create a sustainable and multi-functional large-scale model in the field of network traffic analysis after continuous expansion.

1 Introduction

With the advent of technologies such as 5G and the Internet of Things, the Internet has undergone rapid and significant expansion. Alongside this growth, new online services, applications, and even cyber threats are continually emerging. According to a recent report by the International Telecommunication Union (ITU), global Internet users reached 4.1 billion in 2019, marking an increase of over 53%53\% since 2005 and 5.3%5.3\% compared to 2018 [44]. Ensuring the security and quality of service for Internet users has become a critical concern for major Internet Service Providers (ISPs) and tech companies. Network traffic classification technology is crucial for categorizing Internet traffic based on various criteria, including application type, service, protocol, and potential malicious intent. This technology helps operators deliver customized services, detect cybercriminals, and optimize network resource utilization [1].

In the early stages, network traffic was easily identifiable through features like port numbers and identifiers, with the transmitted content being openly accessible and unencrypted. The techniques for classifying network traffic primarily relied on port-based methods  [25] alongside DPI(Deep Packet Inspection) which may use certain machine learning algorithms that analyzed the plain-text content within payload [49]. However, with the increasing emphasis on protecting user privacy, encryption protocols, particularly TLS/SSL suites, were developed to secure the data transmitted across networks. This resulted in the widespread use of the HTTPS protocol, an enhanced version of HTTP, which became the preferred choice for network traffic transmission [9]. Furthermore, the QUIC protocol, which combines UDP with HTTPS and is especially advantageous for streaming media, has been increasingly adopted. The surge in the variety of applications and services, coupled with the use of port mapping techniques to circumvent firewalls, has rendered the traditional port-based methods for network traffic classification somewhat obsolete. Traffic encryption and dynamic port assignment highlight the growing complexity of managing network traffic, rendering both port-based and payload-based methods ineffective.

Refer to caption
Figure 1: Challenges in Existing Network Traffic Analysis Systems
This diagram illustrates two key deficiencies in current network traffic analysis systems that lead to low timeliness. On the right, the need for diverse classified intelligence requires repetitive processing of traffic samples through multiple models, resulting in inefficiency. On the left, newly emerging sample sets struggle to be quickly integrated or incrementally updated in trained models, further degrading model timeliness.

The underlying traffic data remains interactive and utilizes network media, adhering to specific network protocols. While encryption obscures a significant amount of information, different network services still necessitate formats for segmenting transmitted data, and both parties involved follow established interaction rules. Additionally, various applications develop their own data transmission modules, resulting in differences in the segmentation of traffic data at the application layer (so called application data units [10]). These characteristics enable traffic classification without the need for decryption. With the ongoing advancements in deep learning, numerous artificial neural network architectures have been developed, many of which (such as GRU, LSTM, and Transformer) are well-suited for processing sequential data. Various features of traffic data can often be represented as discrete sequences, including the lengths of traffic packets  [32], the lengths of PDUs/ADUs, and the timing of packet arrivals  [50], among others. Numerous studies utilizing these models have achieved impressive results in the realm of traffic classification.

However, the complexity of traffic classification tasks is heightened by the demand of various subsequent network management functions. For example, in the context of ISP-customized billing  [6] and APM for optimizing the QoS of specific applications  [26], using website fingerprinting technology for Tor detecting  [54]. In systems such as IDS and IPS  [18], which are designed to prevent and mitigate malicious attacks, traffic classification must prioritize the detection of harmful behaviors. Additionally, to facilitate content censorship, traffic classification may need to identify flows related to the Tor  [41], VPN  [20], and specific DNS queries  [4]. Moreover, for particular network scenarios, it may be necessary to develop customized classification models tailored to IoT devices[27], smart mobile devices [2], and Wi-Fi networks [19]. Currently, effective classification models are generally designed for specific tasks and do not offer a unified approach for simultaneously classifying different attribute categories of network traffic.

Figure 1 illustrates two major challenges faced by the current traffic analysis system. Firstly, the network environment is rapidly evolving. Emerging internet services, private protocols, application versions, and corresponding new types of network devices, as well as new families of malware targeting different vulnerabilities, are constantly emerging. This places our traffic analysis technology in a continuously changing and expanding knowledge base, and the models we use must always face the challenge of "concept drift." As shown on the left side of the figure, researchers continuously collect data to train models in various new scenarios. However, these models have different objectives, and their processing logic and feature extraction schemes vary significantly. It is challenging to deeply integrate them with existing models; at most, we can list them in a model pool, which leads to another dilemma. Secondly, network traffic classification technology is often deployed at large-scale network ingress, where traffic is "copied" and transmitted to designated servers for analysis via Network Packet Brokers (NPB)  [42]. This imposes very high requirements on the identification rate of traffic analysis. However, as shown on the right side of the figure, the different services supported by the analysis system have very different focuses on traffic attributes. For example, an Intrusion Detection System (IDS) must swiftly detect attack events; any delay in this information could allow a network attack to succeed, rendering the prior traffic classification ineffective [23]. Currently, the mixed identification demands can only repeatedly invoke various models in the model pool for classification, resulting in the repeated processing of traffic samples. When faced with high-throughput data, this mode inevitably leads to omissions and even system crashes.

To address this issue, we propose the SNAKE(Sustainable Network Analysis with Knowledge Exploration) system, designed for multi-functional traffic classification with robust scalability. The core component of this system is a model architecture that employs a multi-gated mixture of experts (MMoE) approach. This architecture treats traffic classification models for various tasks as distinct expert network sub-networks. Traffic data is uniformly processed by these sub-models and integrated through multiple gating functions, directing different prediction outcomes to specific fully connected layers. The expert model can directly utilize parameters specifically designed for particular tasks, such as application classification, and then fine-tune the fully connected layers to quickly achieve model integration. Unlike regular multi-task learning frameworks, MMoE architecture isolates model parameters for distinct scenarios, minimizing task interference. It also differs from conventional incremental learning approaches by enabling model-level expansion for identical tasks, thereby mitigating the issue of catastrophic forgetting.

The contributions of this paper can be summarized as follows:

  • We have innovatively proposed the SNAKE system, which is an evolutionary framework for traffic analysis that integrates multi-functionality and high scalability. It is currently compatible with five distinct datasets and has accomplished the classification of eight tasks, with performance on each task approaching or even reaching the level of individual models. We believe that it can continuously accommodate various network traffic analysis tasks and generate a universal and efficient large-scale model in this field, significantly enhancing the comprehensiveness, efficiency, and timeliness of the analysis system.

  • Specifically, we first adopted a mixture of experts architecture in the traffic classification model. This architecture avoids the direct contribution of parameters across different tasks, allowing the model to effectively accommodate training tasks with significant differences. The separated parameter settings also provide the SNAKE system with strong scalability, enabling it to quickly deploy pre-trained models for new analysis tasks and achieve efficient knowledge exploration.

  • Additionally, we drew inspiration from the multi-gate mixture of experts model mechanisms used in other fields, as well as various gating configuration schemes. This allows the SNAKE system to intelligently integrate models under complex scenarios, such as when label relationships are unrelated, parallel, or contain nested relationships, thereby further enhancing its scalability and usability.

The rest of this paper is organized as follows. Section 2 introduces the background related to encrypted traffic classification and the mixture of experts model. Section 3 presents the various components and operational logic of the SNAKE system. Section 4 describes different task expansion scenarios encountered during actual operations, along with the related mathematical descriptions. In Section 5, we set up various scenarios to evaluate the system’s performance. Section 6 discusses its limitations. Finally, Section 7 summarizes the paper. To enable the research community to utilize our tool, we will open source our code upon the paper’s acceptance.

2 Related Works

2.1 Evolution of Encrypted Traffic Classification Techniques

Traditional traffic classification techniques, such as port-based  [25] and payload-based  [49] methods, rely on plain-text information or fixed rules in unencrypted traffic. However, dynamic port mapping and the widespread use of encrypted transmission protocols like TLS/SSL have rendered these methods largely ineffective. In response, researchers have turned to machine learning, using labeled traffic datasets and supervised learning models for automated classification. For instance, T. Bujlow and colleagues applied the C5.0 algorithm  [8], which combines multiple decision trees to enhance accuracy. Similarly, S. Huang and others utilized the k-Nearest Neighbors (kNN) algorithm  [22] to classify internet traffic based on statistical features, bypassing the need for a training phase. Other methods, such as Support Vector Machine (SVM) and Naive Bayes (NB)  [5], have also been explored. However, traditional machine learning techniques often struggle with over-fitting and lack the capability for fine-grained classification in an open world [21]. This has led to the exploration of deep learning approaches, which offer improved performance and adaptability in complex network scenarios.

With the rapid advancement of deep learning technology, neural network architectures such as residual neural networks have enabled deeper exploration of features related to data packets. For example, Wei et al.  [46]transformed the first 784 bytes of traffic packets from session groupings into image pixels to construct a CNN-based traffic classifier. In 2022, Dai et al. [12] applied the Transformer model architecture, known for its success in natural language processing, to traffic classification with impressive results. More recently, Wang et al. have established proofs for various mathematical theoretical foundations of deep learning in the field of traffic classification  [47]. Additionally, some deep learning methods for low-quality traffic data have also emerged  [39].

Despite these advancements, the goals of encrypted traffic classification are diverse, including application classification, VPN  [20] and Tor  [41] detection, service type identification, and monitoring of malicious traffic such as malware  [7], attacks in IPv6  [11], DDoS attacks  [24], and even attacks in IoT environments  [14]. However, these models are not well-integrated, posing challenges for multi-objective traffic analysis.

2.2 Potential Approaches for Scalable Traffic Classification

Multi-task learning approach: In the early stages of deep learning, multi-task learning was introduced to tackle various classification tasks within a single model  [52]. This approach connects multiple fully connected layers, each with its own loss function, to a shared neural network, allowing simultaneous training for different tasks. Some research has applied this method to traffic classification; however, our practical experience suggests limited feasibility for two main reasons:
       1) Different tasks require distinct feature sets. For instance, identifying unknown protocols demands analyzing local byte entropy changes, while VPN traffic classification relies on detecting packet re-encryption. In contrast, application and service type classification focuses on payload length sequences, and detecting malicious behaviors like DDoS attacks requires attention to statistical traffic features. These diverse features are challenging to integrate into a single model.
       2) Traffic classification attributes can exhibit subordinate and parallel relationships, creating complex task domains  [38]. For example, traffic may simultaneously belong to a specific protocol, application, and be classified as an attack or associated with VPN/Tor routing. When multi-task learning directly connects these classification targets, task divergence or entanglement can impede effective model training.

Large Model Fine-tuning Approach:Since last year large language models has advanced rapidly, with some of its latest innovations being applied in the field of network  [53]. However, there has yet to be relevant research that has developed a traffic analysis model capable of achieving multiple classification objectives simultaneously. Current studies typically take one of two approaches: ①Utilize established open-source large language models, such as Llama and Grok, and develop a model specifically tailored for traffic classification using fine-tuning techniques. ② Drawing inspiration from the Masked AutoEncoder (MAE) concept found in BERT  [31] and GPT  [51], training a large model head specifically for traffic classification, followed by fine-tuning the terminal layers to perform designated classification tasks  [13]. The first approach, which leverages the parameters of large language models, undoubtedly possesses strong classification capabilities. However, since language models are generally designed as broad conversational tools, most of their parameters are actually irrelevant to traffic classification. This vast number of parameters can result in high computational costs and delays in output during task execution, rendering it less suitable for high-throughput traffic classification scenarios. The second approach appears to be more advantageous, but it primarily aids in training downstream models. Ultimately, these models still need to be executed separately to produce classification results, which does not address the challenges highlighted in the first subsection.

2.3 Mixture of Experts Architecture

The Mixture of Experts (MoE) architecture  [16] is designed to facilitate the horizontal expansion of models by dynamically routing inputs to different sub-model modules based on actual demands. In neural network models, transforming specific layer outputs into MoE layers allows for the identification of the most suitable expert modules for processing current inputs. By distributing tasks among one or more expert models, the system can efficiently handle a wide array of tasks while maintaining a sparse parameter distribution. This design enables the selective activation of parameters during both training and inference, thereby reducing unnecessary computational power consumption. When applied to traffic analysis, this approach can significantly enhance the speed of result generation.

Additionally, the MoE architecture provides strong horizontal scalability, allowing for the seamless integration of new functional modules into an existing trained model. This capability is especially advantageous in the context of rapidly evolving network traffic scenarios. Advanced variants such as the Multi-gate Mixture of Experts (MMoE)  [36] further refine this approach by employing multiple gating networks to allocate inputs to the most relevant experts, enhancing model performance in multi-task learning environments. Other techniques, such as Hierarchical MoE  [29] and Switch Transformers  [17], provide additional flexibility and efficiency in routing and processing, demonstrating the versatility of MoE architectures in adapting to complex and dynamic classification tasks.

3 System Design of SNAKE

Refer to caption
Figure 2: Overview of the SNAKE System Structure
The SNAKE system is designed to efficiently aggregate various network traffic classification models. It comprises expert sub-models from pre-trained models in different domains, a concatenation layer that combines their inputs, gates that control the flow of information, and tower layers for specific tasks. This system can continuously integrate new task models, enabling rapid classification of multiple network traffic attributes.

3.1 Threat Model

As discussed, the current landscape of traffic classification tasks is exceedingly diverse, with many tasks playing a crucial role in network security and management. While the deployment of classification systems aims to provide a comprehensive understanding of data flows, there is a lack of a cohesive framework to integrate techniques from various domains. A single model often struggles to meet the complex demands of different scenarios. For instance, intrusion detection systems must address malicious data from various environments, including mobile devices, IoT, even in devices using microphone [33], as well as diverse attack patterns such as DDoS [30] and malicious DoH .

Moreover, common user traffic can experience concept drift due to changes in internet protocols and emerging services, impacting the overall classification effectiveness of the model. The significant disparities among different tasks further impede multi-task learning architectures designed to accommodate multiple objectives. This challenge stems from the direct sharing of underlying model parameters  [16], which restricts their ability to meet the varied requirements of distinct tasks.

The objective of the SNAKE system is to facilitate rapid model fusion and multi-target identification in the complex domain of network traffic analysis. Our final implementation enables the architecture trained by the SNAKE system to classify various application types, service categories, VPN and Tor usage, as well as identify four major categories of Android malware (including 42 specific attack tools), three types of DoH attacks, and ten types of Trojans. Achieving these functionalities requires the SNAKE model to exhibit strong scalability and specific handling mechanisms for compatibility across different task patterns.

3.2 Overview

To construct a highly scalable traffic classification system compatible with multiple identification targets, SNAKE draws on the MMoE architecture to build its classification model. At the model hierarchy level, SNAKE consists of four important components. First, all traffic classification tasks share a unified data preprocessing system for feature extraction. To accommodate different tasks, we have referenced the work of Distiller  [3], extracting information from traffic headers and payloads to form a common input for our model.

The second component is the combination of expert networks, which are tailored for different traffic classification tasks. The SNAKE system can set up the same network structure to directly read pre-trained model parameters. Certainly, the model architecture also supports the option to train a specific classification task from the ground up; however, due to the complexity of the overall network structure, we do not recommend adopting this approach.

The third component comprises an intelligent multi-gating structure featuring configurable modes, which is employed to regulate the outputs of multiple expert networks tailored for specific tasks. We have developed three distinct gating schemes corresponding to three different task fusion scenarios, with detailed descriptions to be provided in subsequent sections.

Finally, the architecture includes several classification tower networks designed for various models, which typically comprises fully connected networks. Following the model fusion process, only a limited number of fine-tuning iterations will be conducted on this segment of the model parameters. The overall architecture of the SNAKE system is depicted in Figure 2.

3.3 Preprocessing

Our method adopts a traffic preprocessing identifying individual flows as the fundamental classification unit, uniquely defined by their five-tuple. The input data provided to the classifier comprises two types: (a) the first Nb bytes of the transport-layer payload (PAY) of the Traffic Classification object  [48]; and (b) informative protocol header fields (HDR) from the first Np packets  [34]. In the first case, the input is represented in binary format, organized in a byte-wise manner, and normalized within the range of (0, 1). The second type of input data includes: (i) the number of bytes in the transport-layer payload, (ii) the TCP window size (which is set to zero for UDP packets), (iii) inter-arrival time, and (iv) packet direction 0,1\in{0,1} derived from the first NpN_{p} packets.

It is important to note that, in both scenarios, longer instances are truncated while shorter ones are padded with zeros to conform to the specified lengths of bytes (NbN_{b}) or packets (NpN_{p}). This specific input data selection is driven by the need to mitigate biased inputs—a common pitfall identified in related works—which may arise from factors such as PCAP metadata, data-link layer information, and certain transport-layer header fields (e.g., source and destination ports). Such biases can lead to inflated performance metrics and hinder generalization capabilities.

This processing approach enables us to accommodate both flow-level and packet-level perspectives, thereby supporting a range of tasks with varying emphases. For instance, while tasks related to DDoS attacks or port scanning require a flow-centric view, application classification tasks necessitate a detailed analysis of the payload data.

3.4 Expert Sub-models

The expert classification networks served as the primary entity for executing a variety of tasks. In this paper, we adopt a fine-tuning approach for the integration of expert sub-models, wherein we first independently train usable models on their respective labeled datasets. Subsequently, the SNAKE system retrieves and freezes the parameters of the core components from these models, relying solely on the gating mechanisms and tower networks in the latter stages for fine-tuning. It is acknowledged that there exists a plethora of model architectures tailored for different classification tasks, with extensive research conducted in the field. However, the diverse input formats of these models are evidently detrimental to the model fusion process. In light of the data preprocessing methods discussed in the previous section, this paper proposes a unified architecture for the expert sub-models.

In the original Distiller work, two modalities of input were analyzed using GRU and convolutional neural networks, which were effective for capturing temporal dependencies and texture features of payload data, respectively. However, with the advent of the transformer architecture, we have identified that this framework can effectively accommodate both modalities, thereby simplifying the model structure and avoiding the complexities arising from the use of heterogeneous architectures. The transformer architecture employs a multi-head attention mechanism composed of queries (Q), keys (K), and values (V), enabling it to extract sequential dependencies across varying spans, thus fulfilling the analytical requirements for both flow-level and packet-level modalities. In this paper, we uniformly adopt an expert model structure that flattens and concatenates the data from both modalities as input, establishing two attention heads for feature extraction. The specific model structure parameters and hyper-parameters utilized for training each traffic classification task are presented in Table 1.

Table 1: Architecture and Hyper-parameters of Expert Sub-models
Parameter Value
     Transformer Encoder 912dim, 2 heads
     Linear Layers [912, 256, NtargetN_{\text{target}}]
Optimizer, Learning Rate, Activation Adam, 1e-3, ReLU
Dropout Rate, Batch Size, Epochs 0.2, 32, 50
     Loss Function Cross-Entropy

During the model fusion process within the SNAKE system, only the parameters of the core transformer model components are read, excluding the final fully connected layers. We would like to emphasize that there may exist more optimal data processing or model configuration strategies for each specific task. However, this paper primarily focuses on evaluating the model fusion and expansion capabilities of SNAKE; hence, we have selected a relatively generic architecture for this purpose. The system itself supports customization of the structure, parameters, and hyper-parameters for each expert network.

3.5 Gates and Towers

In this paper, we adopt a supervised learning paradigm for traffic classification. In the field of network traffic research, labeled sample data suitable for various tasks, such as malicious attacks and traffic application types, can be collected using constructed scripts and some automated tools. Thus, dataset can be described as Dk(xi,yi,k)D_{k}(x_{i},y_{i,k}). The subscript kk denotes the classification task for the kthk_{th} attribute, where xix_{i} represents the ithi_{th} traffic sample and yi,ky_{i,k} is the corresponding label. We build the SNAKE system to provide multiple attribute categories of traffic at once, and this system employs a MMoE architecture, establishing an independent gating layer for each task. Therefore, for the known classification task of kk attributes, we need to establish kk Gates and corresponding Tower classification networks for the dataset. We assume that nn expert sub-models fj(x)f_{j}(x) have been used previously, and we need to define a concatenation operation to represent it as follows:

Xi=concat(f1(xi),f2(xi),,fn(xi))=[f1(xi)f2(xi)fn(xi)]X_{i}=concat(f_{1}(x_{i}),f_{2}(x_{i}),\dots,f_{n}(x_{i}))=\left[\begin{array}[]{c}f_{1}(x_{i})\\ f_{2}(x_{i})\\ \vdots\\ f_{n}(x_{i})\end{array}\right]

After the kthk_{th} gate gkg_{k} receives the overall output XX , it needs to be integrated and input into the corresponding Tower hkh_{k}. Specific operation can be represented as follows:

yk=hk(gk(Xi))y_{k}=h_{k}(g_{k}(X_{i}))

In the conventional MMoE architecture, the gate is a learnable linear transformations of the input with a softmax layer. However, in the field of traffic classification, whether the outputs of different expert models are mixed depends on the correlation between the corresponding task attributes. Therefore, the gating we use is merely a vector of the same dimension as the traditional MMoE method, with a length equal to the number of expert models nn, like gi(δ1i,δ2i,,δni)g_{i}(\delta_{1i},\delta_{2i},\dots,\delta_{ni}). For the specific value settings, we have the following three modes:

Default Mode: In this mode, if there is only one expert model connected to the gate, the task and the expert are in a one-to-one correspondence. Consequently, the gating mechanism is configured to output only the result from the corresponding expert, as expressed below:

gi(δ1i,δ2i,,δni),whereδji={1if j=i0if jig_{i}(\delta_{1i},\delta_{2i},\dots,\delta_{ni}),\qquad where\;\ \delta_{ji}=\begin{cases}1&\text{if }j=i\\ 0&\text{if }j\neq i\end{cases}

Top-K Mode: In this mode, multiple expert models are connected to the gate, and these experts share the same classification granularity. For instance, Expert 1 may classify app1app_{1}, app2app_{2}, …, appnapp_{n}, while Expert 2 classifies appn+1app_{n+1}, …, appmapp_{m}. Here, the task and experts are in a one-to-many relationship, and the gating mechanism is designed to equally consider the results from those experts (which could be represent by a set Sj{1,2,,n}S_{j}\subseteq\{1,2,\ldots,n\}). The setting of the gating vector in the top-k mode is as follows:

gi(δ1i,δ2i,,δni),whereδji={1/nif jS0if jSg_{i}(\delta_{1i},\delta_{2i},\dots,\delta_{ni}),\qquad where\;\ \delta_{ji}=\begin{cases}1/n&\text{if }j\in S\\ 0&\text{if }j\notin S\end{cases}
Algorithm 1 Multi-Attribute Traffic Classification Process Using the SNAKE System
1:  Notation: XX: Traffic sample, YjY_{j}: Attribute value for task jj, NN: Number of experts, KK: Number of classification tasks, HDRHDR: packet headers feature vector, PAYPAY: Payload feature vector, VV: Preprocessed data vector, ViV_{i}: Output from expert model EiE_{i}, V^\hat{V}: Concatenated outputs, gjg_{j}: Gating value for task jj, GjG_{j}: Gated output for task jj, TjT_{j}: Tower network for task jj, CjC_{j}: Confidence vector for task jj
2:  Input: Traffic sample XX
3:  Output: Attribute values Y1,Y2,,YKY_{1},Y_{2},\ldots,Y_{K}
4:  {Step 1: Preprocess the traffic data}
5:  Extract HDRHDR and PAYPAY from XX
6:  Normalize and encode the extracted data
7:  {Step 2: Mixture of Experts Model}
8:  Concatenate preprocessed data into vector VV
9:  for each EiE_{i} in NN do
10:     Pass VV through EiE_{i}
11:     Store output ViV_{i}
12:  end for
13:  {Step 3: Gating Mechanism}
14:  Concatenate outputs: V^=concat(V1,V2,,VN)\hat{V}=\text{concat}(V_{1},V_{2},\ldots,V_{N})
15:  for each TjT_{j} for j=1,2,,Kj=1,2,\ldots,K do
16:     Compute value gjg_{j} for task jj
17:     Apply gjg_{j} to V^\hat{V} to produce GjG_{j}
18:  end for
19:  {Step 4: Tower Networks}
20:  for each TjT_{j} for j=1,2,,Kj=1,2,\ldots,K do
21:     Pass GjG_{j} to TjT_{j}
22:     Generate CjC_{j} for task jj
23:  end for
24:  {Step 5: Extract Attribute Values}
25:  for each CjC_{j} do
26:     Yj=argmax(Cj)Y_{j}=\text{argmax}(C_{j})(Attribute value for task jj)
27:  end for
28:  
29:  return  Attribute values Y1,Y2,,YKY_{1},Y_{2},\ldots,Y_{K}

Trainable Mode: In this mode, multiple expert models are connected to the gating layer, and these experts operate at different classification granularities. For example, Expert Model 1 may judge whether this traffic is malicious, while Expert Model 2 focuses on specific tools. In this case, there exists a many-to-many relationship between the tasks and the experts. If the gating layer were to equally consider the outputs of both models, it could interfere with their respective tasks. Therefore, these gating layers should be set to an updatable state, allowing for timely updates during fine-tuning based on the gradients returned by the loss function. Specifically, the gating vector can be updated using the following formula:

gi(δ1i,δ2i,,δni),whereδji={ωjif jS0if jSg_{i}(\delta_{1i},\delta_{2i},\dots,\delta_{ni}),\qquad where\;\ \delta_{ji}=\begin{cases}\omega_{j}&\text{if }j\in S\\ 0&\text{if }j\notin S\end{cases}

where ω\omega is the trainable weight calculated through a linear layer and softmax function, ensuring that the gating layer can flexibly adapt to the outputs of different expert models, thereby optimizing overall classification performance.

Each gating layer connects to the final tower network for the task, which is responsible for producing the ultimate output. For this purpose, we uniformly configure a fully connected layer network. Additionally, to address potential over-fitting during fine-tuning, we have incorporated dropout layers. Specifically, tower networks are configured as a two-layer fully connected network, utilizing a dropout rate of 0.2, a learning rate of 0.001, a batch size of 128, and the ReLU activation function.

It is important to note that the architecture, parameters, and hyper-parameters of each tower can also be customized. Through experimentation, we have found that adding more fully connected layers does not significantly impact model performance; thus, we have opted for a simplified configuration in this context.

The pseudo-code for the system’s operation is presented in Algorithm 1.

4 Mathematical Description and Convergence Proof of Incremental Knowledge Scenarios

In the previous section, we introduced the three modes set by Gates, which correspond to three different task integration scenarios in the field of network traffic classification. These three scenarios are somewhat similar to the task expansion learning scenarios mentioned in the literature  [45]. However, our assumption here is that we have a clear understanding of the label attributes of the traffic samples and their relationships with other attributes.

We define that for a certain traffic sample, there exists its corresponding task domain and label domain. We have selected several examples as shown in the Figure 3.

Refer to caption
Figure 3: Examples of Task and Label Domains
This figure illustrates the examples of corresponding task domains and label domains, along with the relationships between them.

We define 𝒳\mathcal{X} as the sample input space, 𝒴\mathcal{Y} as the corresponding label space for classification, and 𝒯\mathcal{T} as the function that implements the mapping between them. As shown in the figure, we have the following three situations when handling task fusion.
Task Attribute-Independent Scenarios: As shown in Mode I in the figure, tasks of distinguishing whether the traffic is malicious and classifying the traffic based on the application set in label domain 3 is mutually independent.In this case, our task fusion objective can be represented as follows:

Given𝒳×𝒯1𝒴1,𝒳×𝒯2𝒴2Searchs.t.(𝒯1,𝒯2):𝒳(𝒴1,𝒴2)\begin{array}[]{rl}\text{Given}&\mathcal{X}\times\mathcal{T}_{1}\rightarrow\mathcal{Y}_{1},\quad\mathcal{X}\times\mathcal{T}_{2}\rightarrow\mathcal{Y}_{2}\\ \text{Search}&\mathcal{F}\quad\text{s.t.}\quad\mathcal{F}(\mathcal{T}_{1},\mathcal{T}_{2}):\mathcal{X}\xrightarrow{}(\mathcal{Y}_{1},\mathcal{Y}_{2})\end{array}

Category Expansion Scenarios: As shown in Mode II, the classification tasks from app1app_{1} to appnapp_{n} and from appn+1app_{n+1} to appmapp_{m} belong to the category expansion within the same task, which can be expressed as follows:

Given𝒳L1×𝒯1𝒴L1,𝒳L2×𝒯2𝒴L2Searchs.t.(𝒯1,𝒯2):𝒳L1L2𝒴L1L2\begin{array}[]{rl}\text{Given}&\mathcal{X}_{L_{1}}\times\mathcal{T}_{1}\rightarrow\mathcal{Y}_{L_{1}},\quad\mathcal{X}_{L_{2}}\times\mathcal{T}_{2}\rightarrow\mathcal{Y}_{L_{2}}\\ \text{Search}&\mathcal{F}\quad\text{s.t.}\quad\mathcal{F}(\mathcal{T}_{1},\mathcal{T}_{2}):\mathcal{X}_{L_{1}\cup L_{2}}\xrightarrow{}\mathcal{Y}_{L_{1}\cup L_{2}}\end{array}

Category Refinement Scenarios: As shown in Mode III, the tasks of determining whether a classified traffic is malicious and identifying the specific malicious family category of the flow represent a relationship where the former task is refined. This can be expressed as follows:

Given𝒳×𝒯1𝒴,𝒳×𝒯2𝒴1(𝒴1𝒴)Searchs.t.(𝒯1,𝒯2):𝒳𝒴𝒴1\begin{array}[]{rl}\text{Given}&\mathcal{X}\times\mathcal{T}_{1}\rightarrow\mathcal{Y},\quad\mathcal{X}\times\mathcal{T}_{2}\rightarrow\mathcal{Y}_{1}\quad(\mathcal{Y}_{1}\in\mathcal{Y})\\ \text{Search}&\mathcal{F}\quad\text{s.t.}\quad\mathcal{F}(\mathcal{T}_{1},\mathcal{T}_{2}):\mathcal{X}\xrightarrow{}\mathcal{Y}\xrightarrow{}\mathcal{Y}_{1}\end{array}
Algorithm 2 SNAKE System Training Process for Model Task Expansion
1:  Definitions: T1T_{1}: Model for Task1; T2T_{2}: Model for Task2; X1,Y1X_{1},Y_{1}: Dataset and labels for Task1; X2,Y2X_{2},Y_{2}: Dataset and labels for Task2; g1,g2g_{1},g_{2}: Gates for each task; 𝒯\mathcal{T}: Tower Network; MM: New MMoE Model
2:  Input: T1,T2T_{1},T_{2}; X1,X2X_{1},X_{2}; Y1,Y2Y_{1},Y_{2}
3:  Output: MM
4:  Initialize components of model MM:
5:  SET M.expertsM.experts = T1,T2{T_{1},T_{2}}
6:  Analyze Y1Y_{1} and Y2Y_{2} to determine Mode
7:  Switch mode:
8:   case Mode I:
9:  Configure g1,g2g_{1},g_{2} as gates(default mode); M.gate(g1,g2)M.gate\leftarrow(g_{1},g_{2}); Configure (𝒯1,𝒯2)(\mathcal{T}_{1},\mathcal{T}_{2}); M.tower𝒯M.tower\leftarrow\mathcal{T}
10:   case Mode II:
11:  Configure g1g_{1} as gate(top-k mode);M.gateg1M.gate\leftarrow g_{1}; Configure 𝒯1\mathcal{T}_{1}; M.tower𝒯M.tower\leftarrow\mathcal{T}
12:   case Mode III:
13:  Configure g1,g2g_{1},g_{2} as gates(trainable mode);M.gate(g1,g2)M.gate\leftarrow(g_{1},g_{2}); Configure (𝒯1,𝒯2)(\mathcal{T}_{1},\mathcal{T}_{2}); M.tower𝒯M.tower\leftarrow\mathcal{T}
14:  Lock M.expertsM.experts
15:  Fine-tune M.towerM.tower with datasets X1,Y1X_{1},Y_{1} and X2,Y2X_{2},Y_{2}
16:  Update M.towerM.tower after several epochs

After defining and describing the above scenario, the task flow of the SNAKE system in executing specific scenarios is represented as Algorithm 2. Of course, in practical scenarios, the categories in Mode II may overlap; in this case, the output dimension of the Tower network used to set the base should be the same as the total number of sample types.

Convergence analysis: The SNAKE system employs a mixed expert model architecture with separate parameter updates. One reason for this is to avoid the difficulty of model convergence caused by overlapping loss functions from multiple objectives updating shared model parameters. In the architecture we adopted, each cross-entropy function only updates the parameters of its connected Tower network, ensuring that the training updates for different tasks are independent of each other.

Assuming we have NN expert networks, each expert nn handles a different classification task or objective. The output of each expert is y^n\hat{y}_{n} , and these outputs will be concatenated into a single input, which is then passed to gates layer, represented as: g=[y^1,y^2,,y^N]g=[\hat{y}_{1},\hat{y}_{2},\ldots,\hat{y}_{N}]. For each classification target kk, we define a cross-entropy loss function LkL_{k} that depends on the output from the gating network and the true label yky_{k} :

Lk(yk,y^)=iyk,ilog(y^i)L_{k}(y_{k},\hat{y})=-\sum_{i}y_{k,i}\log(\hat{y}_{i})

In this case, the overall loss function can be expressed as a weighted sum of the losses from all classification targets:

Ltotal=k=1KαkLk(yk,y^final)L_{\text{total}}=\sum_{k=1}^{K}\alpha_{k}L_{k}(y_{k},\hat{y}_{\text{final}})

Here, αk\alpha_{k} is the weight for each classification target, which can be adjusted by gates. Because we have the configuration scheme for the Gates as described in Section 3, it ensures that αk\alpha_{k} is a non-negative number. This allows us to conclude that the overall loss function can be regarded as a linear combination of several cross-entropy losses, which ensures that it can maintain the Lipschitz continuity and strong convexity of the cross-entropy function when processing softmax outputs. This guarantees that we can present the following lemma, where we set our model output as ω\omega.
Lemma 1 (Lipschitz-Continuity) : Ltotal(ω)L_{total}(\omega) (represent as LoL_{o}) is continuously differentiable and its corresponding Lt\nabla L_{t} is Lipschitz continuous with constant c>0c>0, that is to say ω1,ω2d\forall\omega_{1},\omega_{2}\in\mathbb{R}^{d} s.t. :

Lo(ω1)Lo(ω2)2cω1ω22\parallel\nabla L_{o}(\omega_{1})-\nabla L_{o}(\omega_{2})\parallel_{2}\leq c\parallel\omega_{1}-\omega_{2}\parallel_{2}

Lemma 2 (Strong Convexity) : Lo(ω)L_{o}(\omega) is strongly convex, that is to say ω1,ω2d\forall\omega_{1},\omega_{2}\in\mathbb{R}^{d}, c>0\exists c>0 s.t. :

Lo(ω2)Lo(ω1)Lo(ω)(ω1ω2)+12ξω2ω122L_{o}(\omega_{2})-L_{o}(\omega_{1})\leq\nabla L_{o}(\omega)^{\top}(\omega_{1}-\omega_{2})+\frac{1}{2}\xi\parallel\omega_{2}-\omega_{1}\parallel^{2}_{2}

This lemma elucidates that Lo(ω)L_{o}(\omega) is characterized by a singular, unique global minimum. Here, we designate the point as Lo(ω)L_{o}(\omega^{*}), where ωd\omega^{*}\in\mathbb{R}^{d}. Based on these two lemmas, we can ultimately infer one conclusion. When setting the learning rate reasonably, the overall loss function will eventually approximate its minimum value. After TT rounds of training, setting learning rate α1c\alpha\leq\frac{1}{c}, the convergence upper bound of Lo(ω)L_{o}(\omega) can be formulated as follows:

Lo(ωT)Lo(ω)1Tzα(1cα2)L_{o}(\omega^{T})-L_{o}(\omega^{*})\leq\frac{1}{Tz\alpha(1-\frac{c\alpha}{2})}

This indicates that as the number of training iterations increases, Lo(ωT)L_{o}(\omega^{T}) will gradually converge to the global minimum Lo(ω)L_{o}(\omega^{*}). For the sake of the article’s coherence, the complete proof can be found in Appendix Section A .

5 System Evaluation

We established eight distinct traffic classification tasks, encompassing five public datasets with a total data volume of 172.29 GB, to validate the effectiveness of the SNAKE system. The experiments include a comparative analysis of the structural configurations of the expert models, an assessment of the model fusion effectiveness across three task expansion scenarios, a comparison of model hyper-parameter settings, and, finally, an evaluation of integrating numerous tasks scenario.

5.1 Experimental Setup

In this paper, aside from the corresponding comparative experiments, the default expert network structure and parameters, tower model parameters, dropout settings, and other hyper-parameter configurations are consistent with those outlined in Section 3. Some brief information about the datasets used in our experiments can be found in Table 3. Due to the use of multiple datasets, a detailed introduction is provided in Appendix Section B for the sake of narrative clarity.

Table 2: Overview of the Datasets Used
Dataset Name Classes Dataset Size Task Objective
ISCXVPN2016 [15] 2/6/15 8.58G VPN/Service/Application
ISCXTor2016 [28] 2 21.6G Tor usage
CIC-DoHBrw-2020 [37] 2 90.2G Malicious DNS
USTC-TFC2016 [46] 11 3.71G Malware Detection
IPTAS-Tbps [10] 7 10.7G Application Classification

During the model training, we performed a uniform split of the dataset, allocating 75%75\% to the training set, 10%10\% to the validation set, and 20%20\% to the test set. Whether it is the training of the expert model or the fine-tuning and integration operations conducted by the SNAKE system, the training samples are consistently drawn from the training set. Therefore, the experimental results will not be misjudged due to any overlap in the datasets. We conducted experiments with ten random data splits, so all the following experiments present the results of these ten repeated trials.

Table 3: Performance of the proposed expert sub-model structures across different tasks
Algorithm Encapsulation Traffic Type Application
Accuracy (%) Precision (%) F1 Score (%) Accuracy (%) Precision (%) F1 Score (%) Accuracy (%) Precision (%) F1 Score (%)
SNAKE Expert 98.66(±\pm 2.22) 98.67 (±\pm 2.18) 98.67 (±\pm 2.22) 80.25 (±\pm 1.19) 80.01 (±\pm 3.90) 79.65 (±\pm 1.05) 77.28(±\pm 0.87) 77.75(±\pm 2.78) 75.84 (±\pm 1.38)
CNN-RNN-2a [35] 65.6390.0065.63\sim 90.00 74.1190.0774.11\sim 90.07 58.2290.0258.22\sim 90.02 69.4675.6269.46\sim 75.62 72.6978.9472.69\sim 78.94 67.6674.3167.66\sim 74.31 52.0567.0152.05\sim 67.01 55.7469.8655.74\sim 69.86 47.4463.9347.44\sim 63.93
MTLCNN [40] 82.60(±1.71)82.60(\pm 1.71) 83.27(±1.74)83.27(\pm 1.74) 82.65(±2.05)82.65(\pm 2.05) 73.78(±2.68)73.78(\pm 2.68) 77.10(±3.66)77.10(\pm 3.66) 72.17(±2.84)72.17(\pm 2.84) 63.46(±7.89)63.46(\pm 7.89) 65.56(±9.59)65.56(\pm 9.59) 59.55(±10.81)59.55(\pm 10.81)
MT-DNN-FL [55] 82.63(±8.91)82.63(\pm 8.91) 83.03(±7.26)83.03(\pm 7.26) 81.82(±9.86)81.82(\pm 9.86)\ 74.05(±2.68)74.05(\pm 2.68) 77.26(±3.66)77.26(\pm 3.66) 72.35(±2.84)72.35(\pm 2.84) 64.56(±3.68)64.56(\pm 3.68) 66.95(±4.17)66.95(\pm 4.17) 59.94(±10.43)59.94(\pm 10.43)
multi-output DNN [43] 97.98(±0.41)97.98(\pm 0.41) 97.99(±0.37)97.99(\pm 0.37) 97.99(±0.41)97.99(\pm 0.41) 77.78(±2.23)77.78(\pm 2.23) 78.55(±2.83)78.55(\pm 2.83) 76.47(±2.60)76.47(\pm 2.60) 75.35(±1.88)75.35(\pm 1.88) 76.45(±3.04)76.45(\pm 3.04) 74.38(±1.64)74.38(\pm 1.64)
* All results are presented as mean ±\pm range, except for CNN-RNN-2a
Refer to caption
(a) Task Attribute-Independent Scenarios
Refer to caption
(b) Category Expansion Scenarios
Refer to caption
(c) Category Refinement Scenarios
Figure 4: Model Extension Performance in Three Incremental Knowledge Scenarios

In our experiments, we utilized deep learning models implemented with PyTorch 2.0.0 and CUDA 11.7. Data pre-processing and post-processing were primarily conducted using the NumPy and Scapy libraries. For graphical data representation, we employed Matplotlib and MATLAB. All experiments were performed on a PC with the following hardware specifications: a 13th Gen Intel® Core™ i7-13700KF processor running at 3.40 GHz, 32 GB of RAM, and an NVIDIA GeForce RTX 4080 GPU. The operating system used was Windows 11.

5.2 Experiments on Classification Effects of Different Expert Models

In this subsection, we aimed to compare the recognition performance of expert networks using different model architectures. Here, we compare five traffic classification models that utilize data processing schemes similar to ours. We evaluated the detection results of these models on the ISCX-VPN-2016 dataset for three tasks: encapsulation, service type, and application, as shown in Table 3.

In this experiment, all models were set with a batch size of 16 and trained for 50 epochs. The dataset was randomly split with 75%\% for the training set, 10%\% for the validation set, and 15%\% for the test set, and this process was run ten times. The table shows the accuracy, precision, and F1-score values of five model architectures. The CNN-RNN-2a model exhibited significant oscillations, possibly because the small batch size and number of epochs were not well-suited for training RNN models. Therefore, its range of variation is presented in the table, while the results of the other models are shown as average results and range performance. It is evident that the expert model architecture adopted in this paper performed the best in this three tasks, outperforming the other optimal results by approximately 0.5%\% to 2.5%\% on average. According to the experimental results, the transformer method we employed demonstrates significantly superior overall performance across the three tasks. Therefore, we used this model architecture for all expert networks in the classification tasks.

5.3 Model Extension Performance in Three Incremental Knowledge Scenarios

To explore the optimal gating configuration pattern and to lay the groundwork for the SNAKE system to integrate complex traffic classification task sets, we set up experiments in three traffic classification task fusion scenarios to test the effectiveness of different gating configurations. The schematic diagram of the scenarios is shown in Figure 4.

5.3.1 Task Attribute-Independent Scenarios

In this scenario, we utilized the three classification tasks from ISCX-VPN-2016 to construct a task attribute-independent model fusion scenario. As shown in Figure 4 (a), the SNAKE system established three gates and corresponding Tower networks for the classification of three traffic attributes in the ISCX-VPN-2016 dataset. All gates were set to default mode which means the base Tower network only considered the outputs of corresponding domain expert sub-models.

Refer to caption
Refer to caption
Figure 5: Model extension performance in Scenarios 5.3.1

The accuracy and loss function values of the model across the three tasks as a function of training epochs are shown in Figure 5. The results indicate that even with a very small learning rate (1×1041\times 10^{-4}), the model can still achieve convergence within five epochs, which is quite rapid. After conducting ten repeated experiments, we found that the average accuracy of SNAKE system’s models in this scenario for the three tasks reached: 98.84%\%, 80.87%\%, and 77.78%\%, which are comparable to the optimal results of each independent model in Experiment 5.2, indicating that the SNAKE system has good compatibility in this scenario. In this experimental scenario, the basic settings include a batch size of 128, a tower network with only two layers and no dropout, and the parameters of the expert network are also frozen. The impact of these hyper-parameter settings on the model performance in this scenario, along with related discussions, can be found in Section 5.4.

5.3.2 Category Expansion Scenarios

In this scenario, we aim to evaluate the effect of model fusion when the number of sample categories increases in a certain classification task. In this context, we utilize the Application classification task from the ISCX-VPN-2016 and IPTAS datasets to construct a model fusion scenario with an increased number of sample categories. The specific setup of the scenario is shown in the Figure 4(b).

We set the gating layer to top-k mode and redefined the newly added types. Unlike the simple default mode, configuring the gating mechanism is very important and complex in the case where multiple expert networks correspond to a single gating network. We first present the model fusion results under the top-k gating configuration, as shown in Figure 6.

Refer to caption
Refer to caption
Figure 6: Model extension performance in Scenarios 5.3.2

In our experiments, we found that the model fusion in this scenario is similar to incremental learning. Although the top-k gating simply averages the two outputs, which is a straightforward linear transformation, the Tower does need to relearn the relationship between features and the remapped categories. Therefore, we appropriately increased the learning rate to 0.001 for this experiment. As shown in Figure 6, it is evident that the model fusion converges in about five rounds, which is still relatively quick. We divided the final labels into two domains of the original dataset to observe their respective classification performance. Overall, the results are quite close to those of the individual models, although the classification results of IPTAS are slightly worse. However, it is possible to set hyper-parameters more reasonably to bring the accuracy closer to that of the original model; further details are discussed in Section 5.4. In addition, we also analyzed the reasons why the trainable configuration is not suitable for this case. However, to maintain the flow of the text, the details are included in Appendix Section C.

5.3.3 Category Refinement Scenarios

In this scenario, after implementing the original classification task, the newly added expert model refines the classification of a specific category. We constructed this fusion scenario using the USTC-TPC2016 dataset. Here, in order to highlight the advantages of SNAKE system and the setting of trainable gates, we have established a scenario similar to incremental learning, where Expert One has acquired five categories of ten types of malicious samples, while Expert Two holds the remaining five categories. This setup closely resembles the scenario of traffic analysis in network intrusion detection, where new types of attacks are constantly emerging. The model needs to maintain its effectiveness in recognizing existing attacks while rapidly adapting to identify new malicious behavior.Due to the limited size of the dataset and the significant differences between malicious and normal behaviors, the performance of the model is quite good both before and after fusion. Here, we would like to focus primarily on the improvement in the overall recognition performance of Expert Task One after the model fusion, particularly concerning the inclusion of newly added attack samples. The experimental results are shown in Figure 7.

Refer to caption
Figure 7: Model extension performance in Scenarios 5.3.3

The experimental results demonstrate that the SNAKE system performs the model fusion process very rapidly.We also found that this model successfully leverages the additional sample information provided by Expert Two to improve the recognition accuracy of the sample set in Task One (65%\% approximately), which includes new attack samples. The results clearly show the advantages of the trainable gating settings in this scenario, which resembles incremental learning and involves the refinement of categories.

Refer to caption
Refer to caption
Figure 8: Impact of Hyper-parameter Settings on Model Performance in Task Attribute-Independent Scenarios

5.4 Impact of Hyper-parameter Settings on Model Performance

In this section of the experiment, we aimed to verify whether the settings of certain hyper-parameters in SNAKE would affect the overall performance of the model. First, we analyzed Task Attribute-Independent Scenarios and utilized three classification from the ISCX-VPN-2016 dataset to examine the impact of different batch sizes, the number of layers in the tower, and the dropout rates in the tower layers on the model’s recognition performance. The corresponding trends in model accuracy are illustrated in Figure 8.

From the figure, it is not difficult to see that the batch size has a relatively minor average impact on the model accuracy in this scenario. However, a smaller batch size can lead to bolder updates in the direction of optimization, potentially resulting in slightly better performance. That said, when fusing models, if there are too many label types across two or more tasks, a smaller batch size might result in some labels not being represented during a single training iteration, which could negatively affect the training process. Therefore, we believe that setting the batch size to 128 is more reasonable. In terms of other hyper-parameters, setting dropout for the Tower network results in a slight improvement in the model’s average performance, although this is not very noticeable. On the other hand, increasing the depth of the Tower and not freezing the parameters of the expert models can lead to a significant enhancement in model performance. However, both of these approaches may slightly increase the computational resource requirements, so whether to implement these settings should depend on the specific needs of the user.

For the scenario in Experiment 5.3.2, the selection of batch size is similar to the situation mentioned above, so we will not elaborate further here. In this case, the base model did not use dropout, the learning rate was set to 0.001, and the parameters of the expert networks were frozen. We conducted experiments with a smaller learning rate, deployed dropout layers in the Tower layer, and allowed the parameters of the expert networks to be updated, observing the impact of these parameter settings on the experimental accuracy. The model’s classification accuracy for the IPTAS and ISCX-VPN-2016 datasets under these configurations is shown in Figure 9.

Refer to caption
Figure 9: Impact of Hyper-parameter Settings on Model Performance in Category Expansion Scenarios

From the figure, we can conclude that in this scenario, model fusion indeed requires a slightly higher learning rate to quickly update the mapping relationships. Additionally, configuring dropout layers in the Tower network helps improve the model’s accuracy. While allowing the expert models to be updated can enhance the performance of the model fusion, it requires more computational resources and needs to be allocated as needed. Furthermore, as we observed in Section 5.3.2, after configuring the dropout layers, the model’s recognition performance has fully reached the recognition accuracy of the independent expert models. This indicates that as long as the hyper-parameters are set appropriately, the SNAKE system can effectively approximate the capabilities of individual models across each classification dimension.

5.5 Overall Effects of Integrating Numerous Tasks on Model Performance

In this final experiment, we aim to showcase the model expansion that the SNAKE system has achieved so far, as well as its recognition performance on related tasks. While there is potential to further expand upon a variety of additional models and tasks, this paper will limit its scope to the current findings. Figure 10 illustrates the detection performance on various task test sets following model fusion, in comparison to the original individual models. Due to constraints on the length of this manuscript, the training processes and detailed results of the individual models are provided in Appendix D .

Refer to caption
Figure 10: Overall Effects of Integrating Numerous Tasks on Model Performance

Based on the figure, it can be observed that the performance of the model fusion for each task closely aligns with that of the individual models, indicating that the SNAKE system has preliminarily developed a relatively universal large-scale model in the field of traffic classification.

6 Limitations and Challenges

After extensive experimental validation, the SNAKE system has demonstrated the ability to rapidly integrate a wide range of categories in the field of network traffic classification, exhibiting strong scalability in different scenarios and adapting well to the ever-changing network environment. However, there are still areas that can be further improved.

First, we believe that the architecture of the SNAKE system is not yet efficient enough. If we can identify a large pre-trained model that is highly compatible with various traffic classification tasks, it could significantly enhance the SNAKE architecture, transforming it into a three-layer model. In this setup, different classification tasks could be fine-tuned using the large pre-trained model, allowing the SNAKE system to only read a small fine-tuning layer as the expert network. This would greatly reduce the complexity of the SNAKE system’s operation.

Secondly, most existing network traffic classification approaches still treat individual streams as the classification unit, which limits the information gain from applying large models. To fully leverage the benefits of high-parameter models, traffic classification should evolve towards using multiple streams over longer time windows as the recognition unit. The goal should shift from atomic classification results to summarizing network events over time periods. This technological evolution could enable the SNAKE system to transition from a sample-based mixed expert model to a token-based mixed expert model architecture, similar to contemporary mainstream large language models.

These two points represent the main directions for optimizing the SNAKE system. Additionally, further exploration is needed regarding the integration of the SNAKE system with other traffic classification tasks beyond those discussed in this paper.

7 Conclusion

In this paper, we introduced the SNAKE system to address the challenges faced in traffic recognition tasks, which stem from the diverse environments in which traffic data is generated, the various input formats, classification granularity, and recognition objectives. Existing traffic recognition models often exhibit limited functionality, and the process of training numerous models and selecting them is both inefficient and difficult to coordinate.

By leveraging the MMoE (Multi-gate Mixture of Experts) architecture, the SNAKE system can swiftly integrate pre-trained models from different traffic classification domains, merging them into a large-scale traffic classification model capable of performing multiple classification functions. We explored three distinct task expansion modes and examined their compatibility through three different gating layer configurations.

Additionally, we tested various factors influencing the multi-recognition accuracy of the SNAKE system, including the expert model structure, different task expansion modes, hyper-parameters, and tower layer network designs. Ultimately, we conducted a comprehensive evaluation across eight traffic classification tasks using several publicly available datasets. The experimental results demonstrated that the SNAKE system achieved a commendable overall recognition accuracy, enabling efficient reasoning of multi-attribute traffic intelligence.

This work not only highlights the potential of the SNAKE system in enhancing traffic classification but also sets the stage for future research aimed at further optimizing and expanding its capabilities.

References

  • [1] Jemal H. Abawajy. Trends in crime toolkit development. Network Security Technologies Design & Applications, 2014.
  • [2] Giuseppe Aceto, Domenico Ciuonzo, Antonio Montieri, and Antonio Pescapé. Toward effective mobile encrypted traffic classification through deep learning. Neurocomputing, 409:306–315, 2020.
  • [3] Giuseppe Aceto, Domenico Ciuonzo, Antonio Montieri, and Antonio Pescapé. Distiller: Encrypted traffic classification via multimodal multitask deep learning. Journal of Network and Computer Applications, 183:102985, 2021.
  • [4] Ahmad Reda Alzighaibi. Detection of doh traffic tunnels using deep learning for encrypted traffic classification. Computers, 12(3):47, 2023.
  • [5] Ahmad Azab, Mahmoud Khasawneh, Saed Alrabaee, Kim-Kwang Raymond Choo, and Maysa Sarsour. Network traffic classification: Techniques, datasets, and challenges. Digital Communications and Networks, 2022.
  • [6] Ali Javed Azhar-ud din, Ayesha Hanif, M Awais Azam, and Tasawer Hussain. Development of postpaid and prepaid billing system for isps. Proceedings Appeared on IOARP Digital Library, 2016.
  • [7] Onur Barut, Yan Luo, Peilong Li, and Tong Zhang. R1dit: Privacy-preserving malware traffic classification with attention-based neural networks. IEEE Transactions on Network and Service Management, 20(2):2071–2085, 2022.
  • [8] Tomasz Bujlow, Tahir Riaz, and Jens Myrup Pedersen. A method for classification of network traffic based on c5.0 machine learning algorithm. In 2012 International Conference on Computing, Networking and Communications (ICNC), 2012.
  • [9] Zigang Cao, Gang Xiong, Yong Zhao, Zhenzhen Li, and Li Guo. A survey on encrypted traffic classification. In Lynn Batten, Gang Li, Wenjia Niu, and Matthew Warren, editors, Applications and Techniques in Information Security, pages 73–81, Berlin, Heidelberg, 2014. Springer Berlin Heidelberg.
  • [10] Zihan Chen, Guang Cheng, Ziheng Xu, Keya Xu, Yuhang Shan, and Jiakang Zhang. A3c system: one-stop automated encrypted traffic labeled sample collection, construction and correlation in multi-systems. Applied Sciences, 12(22):11731, 2022.
  • [11] Tianyu Cui, Gaopeng Gou, Gang Xiong, Zhen Li, Mingxin Cui, and Chang Liu. Siamhan: Ipv6 address correlation attacks on TLS encrypted traffic via siamese heterogeneous graph attention network. CoRR, abs/2204.09465, 2022.
  • [12] Jianbang Dai, Xiaolong Xu, Honghao Gao, Xinheng Wang, and Fu Xiao. Shape: A simultaneous header and payload encoding model for encrypted traffic classification. IEEE Transactions on Network and Service Management, 20(2):1993–2012, 2023.
  • [13] Wenqi Dong, Jing Yu, Xinjie Lin, Gaopeng Gou, and Gang Xiong. Deep learning and pre-training technology for encrypted traffic classification: A comprehensive review. Neurocomputing, page 128444, 2024.
  • [14] Yutao Dong, Qing Li, Kaidong Wu, Ruoyu Li, Dan Zhao, Gareth Tyson, Junkun Peng, Yong Jiang, Shutao Xia, and Mingwei Xu. Horuseye: A realtime iot malicious traffic detection framework using programmable switches. In Joseph A. Calandrino and Carmela Troncoso, editors, 32nd USENIX Security Symposium, USENIX Security 2023, Anaheim, CA, USA, August 9-11, 2023, pages 571–588. USENIX Association, 2023.
  • [15] Gerard Draper-Gil, Arash Habibi Lashkari, Mohammad Saiful Islam Mamun, and Ali A Ghorbani. Characterization of encrypted and vpn traffic using time-related. In Proceedings of the 2nd international conference on information systems security and privacy (ICISSP), pages 407–414, 2016.
  • [16] David Eigen, Marc’Aurelio Ranzato, and Ilya Sutskever. Learning factored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314, 2013.
  • [17] William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022.
  • [18] Chuanpu Fu, Qi Li, Meng Shen, and Ke Xu. Realtime robust malicious traffic detection via frequency domain analysis. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pages 3431–3446, 2021.
  • [19] Giuseppe Granato, Alessio Martino, Andrea Baiocchi, and Antonello Rizzi. Graph-based multi-label classification for wifi network traffic analysis. Applied Sciences, 12(21):11303, 2022.
  • [20] Lulu Guo, Qianqiong Wu, Shengli Liu, Ming Duan, Huijie Li, and Jianwen Sun. Deep learning-based real-time vpn encrypted traffic identification methods. Journal of Real-Time Image Processing, 17(1):103–114, 2020.
  • [21] Yuzong Hu, Futai Zou, Linsen Li, and Ping Yi. Traffic classification of user behaviors in tor, i2p, zeronet, freenet. In 2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), pages 418–424, 2020.
  • [22] Shijun Huang, Kai Chen, Chao Liu, Alei Liang, and Haibing Guan. A statistical-feature-based approach to internet traffic classification using machine learning. In Proceedings of the International Conference on Ultra Modern Telecommunications, ICUMT 2009, 12-14 October 2009, St. Petersburg, Russia, 2009.
  • [23] Syed Usman Jafri, Sanjay Rao, Vishal Shrivastav, and Mohit Tawarmalani. Leo: Online {\{ML-based}\} traffic classification at {\{Multi-Terabit}\} line rate. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), pages 1573–1591, 2024.
  • [24] Danial Javaheri, Saeid Gorgin, Jeong-A Lee, and Mohammad Masdari. Fuzzy logic-based ddos attacks and network traffic anomaly detection methods: Classification, overview, and future perspectives. Information Sciences, 626:315–338, 2023.
  • [25] Thomas Karagiannis, Andre Broido, Michalis Faloutsos, and Kc Claffy. Transport layer identification of p2p traffic. In ACM SIGCOMM conference on Internet measurement, 2004.
  • [26] Gunjan Khanna, Kirk Beaty, Gautam Kar, and Andrzej Kochut. Application performance management in virtualized server environments. In 2006 IEEE/IFIP Network Operations and Management Symposium NOMS 2006, pages 373–381. IEEE, 2006.
  • [27] Rakesh Kumar, Mayank Swarnkar, Gaurav Singal, and Neeraj Kumar. Iot network traffic classification using machine learning algorithms: An experimental analysis. IEEE Internet of Things Journal, 9(2):989–1008, 2021.
  • [28] Arash Habibi Lashkari, Gerard Draper Gil, Mohammad Saiful Islam Mamun, and Ali A Ghorbani. Characterization of tor traffic using time based features. In International Conference on Information Systems Security and Privacy, volume 2, pages 253–262. SciTePress, 2017.
  • [29] Weikai Li, Ding Wang, Zijian Ding, Atefeh Sohrabizadeh, Zongyue Qin, Jason Cong, and Yizhou Sun. Hierarchical mixture of experts: Generalizable learning for high-level synthesis. arXiv preprint arXiv:2410.19225, 2024.
  • [30] Xiang Li, Dashuai Wu, Haixin Duan, and Qi Li. Dnsbomb: A new practical-and-powerful pulsing dos attack exploiting DNS queries-and-responses. In IEEE Symposium on Security and Privacy, SP 2024, San Francisco, CA, USA, May 19-23, 2024, pages 4478–4496. IEEE, 2024.
  • [31] Xinjie Lin, Gang Xiong, Gaopeng Gou, Zhen Li, Junzheng Shi, and Jing Yu. Et-bert: A contextualized datagram representation with pre-training transformers for encrypted traffic classification. In Proceedings of the ACM Web Conference 2022, pages 633–642, 2022.
  • [32] Chang Liu, Longtao He, Gang Xiong, Zigang Cao, and Zhen Li. Fs-net: A flow sequence network for encrypted traffic classification. In IEEE INFOCOM 2019 - IEEE Conference on Computer Communications, pages 1171–1179, 2019.
  • [33] Tiantian Liu, Feng Lin, Zhongjie Ba, Li Lu, Zhan Qin, and Kui Ren. Micguard: A comprehensive detection system against out-of-band injection attacks for different level microphone-based devices. pages 3963 – 3978, Philadelphia, PA, United states, 2024. Detection models;Detection system;Electromagnetic attack;Injected signal;Input interface;Intelligent applications;Out of band;Passive detection systems;Prior information;Public concern;.
  • [34] Manuel Lopez-Martin, Belen Carro, Antonio Sanchez-Esguevillas, and Jaime Lloret. Network traffic classifier with convolutional and recurrent neural networks for internet of things. IEEE access, 5:18042–18050, 2017.
  • [35] Manuel Lopez-Martin, Belen Carro, Antonio Sanchez-Esguevillas, and Jaime Lloret. Network traffic classifier with convolutional and recurrent neural networks for internet of things. IEEE Access, 5:18042–18050, 2017.
  • [36] Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H Chi. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 1930–1939, 2018.
  • [37] Mohammadreza MontazeriShatoori, Logan Davidson, Gurdip Kaur, and Arash Habibi Lashkari. Detection of doh tunnels using time-series classification of encrypted traffic. 2020 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech), pages 63–70, 2020.
  • [38] German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review. Neural networks, 113:54–71, 2019.
  • [39] Yuqi Qing, Qilei Yin, Xinhao Deng, Yihao Chen, Zhuotao Liu, Kun Sun, Ke Xu, Jia Zhang, and Qi Li. Low-quality training data only? A robust framework for detecting encrypted malicious network traffic. In 31st Annual Network and Distributed System Security Symposium, NDSS 2024, San Diego, California, USA, February 26 - March 1, 2024. The Internet Society, 2024.
  • [40] Shahbaz Rezaei and Xin Liu. Multitask learning for network traffic classification. In 2020 29th International Conference on Computer Communications and Networks (ICCCN), pages 1–9, 2020.
  • [41] Debmalya Sarkar, P Vinod, and Suleiman Y Yerima. Detection of tor traffic using deep learning. In 2020 IEEE/ACS 17th International Conference on Computer Systems and Applications (AICCSA), pages 1–8. IEEE, 2020.
  • [42] GVK Sasirekha, GH Annapoorna, Madhav Rao, Jyotsna Bapat, and Debabrata Das. Ml-augmented network packet broker based anomaly detection at iiot-edge egress port. In 2023 IEEE International Conference on Advanced Networks and Telecommunications Systems (ANTS), pages 1–6. IEEE, 2023.
  • [43] Haifeng Sun, Yunming Xiao, Jing Wang, Jingyu Wang, Qi Qi, Jianxin Liao, and Xiulei Liu. Common knowledge based and one-shot learning enabled multi-task traffic classification. IEEE Access, 7:39485–39495, 2019.
  • [44] International Telecommunication Union. Measuring digital development: Facts and figures 2023, 2023.
  • [45] Gido M Van de Ven, Tinne Tuytelaars, and Andreas S Tolias. Three types of incremental learning. Nature Machine Intelligence, 4(12):1185–1197, 2022.
  • [46] Wei Wang, Ming Zhu, Xuewen Zeng, Xiaozhou Ye, and Yiqiang Sheng. Malware traffic classification using convolutional neural network for representation learning. In 2017 International Conference on Information Networking (ICOIN), pages 712–717, 2017.
  • [47] Wei Wang, Ming Zhu, Xuewen Zeng, Xiaozhou Ye, and Yiqiang Sheng. Malware traffic classification using convolutional neural network for representation learning. In 2017 International conference on information networking (ICOIN), pages 712–717. IEEE, 2017.
  • [48] Xiaoliang Wang, Ke Xu, Wenlong Chen, Qi Li, Meng Shen, and Bo Wu. Id-based sdn for the internet of things. IEEE Network, 34(4):76–83, 2020.
  • [49] Yu Wang, Yang Xiang, and Shun Zheng Yu. Automatic application signature construction from unknown traffic. In IEEE International Conference on Advanced Information Networking & Applications, 2010.
  • [50] Lu Xu, Daihui Dou, and H. Jonathan Chao. Etcnet: Encrypted traffic classification using siamese convolutional networks. In Proceedings of the Workshop on Network Application Integration/CoDesign, NAI ’20, page 51–53, New York, NY, USA, 2020. Association for Computing Machinery.
  • [51] Siyao Zhang, Daocheng Fu, Wenzhe Liang, Zhao Zhang, Bin Yu, Pinlong Cai, and Baozhen Yao. Trafficgpt: Viewing, processing and interacting with traffic foundation models. Transport Policy, 150:95–105, 2024.
  • [52] Yu Zhang and Qiang Yang. A survey on multi-task learning. IEEE transactions on knowledge and data engineering, 34(12):5586–5609, 2021.
  • [53] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
  • [54] Xiyuan Zhao, Xinhao Deng, Qi Li, Yunpeng Liu, Zhuotao Liu, Kun Sun, and Ke Xu. Towards fine-grained webpage fingerprinting at scale. In Bo Luo, Xiaojing Liao, Jun Xu, Engin Kirda, and David Lie, editors, Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, CCS 2024, Salt Lake City, UT, USA, October 14-18, 2024, pages 423–436. ACM, 2024.
  • [55] Ying Zhao, Junjun Chen, Di Wu, Jian Teng, and Shui Yu. Multi-task network anomaly detection using federated learning. In the Tenth International Symposium, 2019.

Ethics Considerations

In our network traffic analysis experiments, we are committed to upholding the highest ethical standards. All datasets utilized in our research are open-source and publicly available, sourced from reputable repositories to ensure transparency and accessibility. We explicitly affirm that our study does not involve any analysis of personal or private data, thereby safeguarding individual privacy rights. Furthermore, all malicious traffic samples employed in our experiments are strictly confined to a controlled local environment managed by the researchers, ensuring that no disruptions or harm can occur to external networks or individuals.

Appropriate measures were implemented to safeguard the privacy of personal information. No personal data that could potentially identify participants will be disclosed or made public. In the research reports, all data have been anonymized and securely stored on password-protected servers. Access to the raw data is restricted to members of the research team, and any identifying information was removed prior to analysis. During the project design phase, we conducted a thorough assessment of potential risks associated with the research and implemented suitable measures to mitigate or manage these risks. We ensure that participants will not experience any physical or psychological harm as a result of their participation in this study. We are committed to maintaining the legality and transparency of data usage, ensuring the accurate application and interpretation of research data. Efforts have been made to minimize the risk of data misinterpretation and misuse, and data will be utilized solely for research purposes.

Compliance with Open Science Policy

In alignment with the principles of open science, we are committed to promoting transparency and reproducibility in our research.

The data that support the findings of this study were derived in the public domain:
The ISCXVPN2016 dataset is available at https://doi.org/10.5220/0005740704070414. The ISCXTor2016 dataset is available at https://doi.org/10.1109/NOMS.2006.1687567. The CIC-DoHBrw-2020 dataset is available at https://doi.org/10.1109/DASC-PICom-CBDCom-CyberSciTech49142.2020.00026. The USTC-TFC2016 dataset is available at https://doi.org/10.1109/ICOIN.2017.7899588.

In this study, we also commit to adhering to open science principles to ensure the transparency and reproducibility of our research findings. Below is a comprehensive list of all artifacts related to this paper, along with their availability status:

  1. 1.

    Source Code:

    • Artifact Name: SNAKE System

    • Description: Source code implementing the network traffic analysis system we proposed.

    • Availability: Will be made publicly available upon acceptance of the paper, hosted on GitHub.

  2. 2.

    Datasets:

    • Note: We use only publicly available datasets, and we are happy to provide the complete process of how we handled this data, as well as our final segmentation method.

  3. 3.

    Analysis Scripts:

    • Artifact Name: Analyze Script

    • Description: Scripts used for processing and analyzing the experimental results.

    • Availability: Will be made publicly available upon acceptance of the paper.

  4. 4.

    Models:

    • Artifact Name: Models for Network Traffic Analysis

    • Description: The final fused model and the individual classification task models produced in this article can provide both structure and parameters.

    • Availability: Will be made publicly available upon acceptance of the paper.

  5. 5.

    Non-Public Artifacts:

    • The code, algorithms, models, and data in this article are all available for open-source access.

We have made every effort to identify and list all necessary artifacts to ensure the reproducibility of our research findings. Should the availability of any artifacts change, we will update this statement accordingly.

Appendix

Appendix A Proof of Model Convergence

This section provides the detailed derivation and proof of model convergence that were not presented in Section 4.

For observing whether the model training gradually approaches the global optimal solution, it is assumed that after t rounds of supervised training following ωt+1=ωtαLo(ωt)\omega^{t+1}=\omega^{t}-\alpha\nabla L_{o}(\omega^{t}), the gap between Lo(ωt)L_{o}(\omega^{t}) and the optimal value Lo(ω)L_{o}(\omega^{*}) is δt\delta^{t}. The following corollary can be derived from Lemma 2 in section 4:

δt\displaystyle\delta^{t} =Lo(ωt)Lo(ω)\displaystyle=L_{o}(\omega^{t})-L_{o}(\omega^{*})
Lo(ω)(ωtω)\displaystyle\leq\nabla L_{o}(\omega^{*})^{\top}(\omega^{t}-\omega^{*})
Lo(ω)(ωtω)\displaystyle\leq\parallel\nabla L_{o}(\omega^{*})\parallel\parallel(\omega^{t}-\omega^{*})\parallel

Therefore, we can derive the following inequality:

δt(ωtω)Lo(ω)Lo(ωt)\frac{\delta^{t}}{\parallel(\omega^{t}-\omega^{*})\parallel}\leq\parallel\nabla L_{o}(\omega^{*})\parallel\leq\parallel\nabla L_{o}(\omega^{t})\parallel

Besides, the following formula can be derived based on Lemma 1:

δt+1δt\displaystyle\delta^{t+1}-\delta^{t} =Lo(ωt+1ωt)\displaystyle=L_{o}(\omega^{t+1}-\omega^{t})
Lo(ωt)(ωt+1ωt)+c2ωt+1ωt22\displaystyle\leq\nabla L_{o}(\omega^{t})^{\top}(\omega^{t+1}-\omega^{t})+\frac{c}{2}\parallel\omega^{t+1}-\omega^{t}\parallel^{2}_{2}
=αLo(ωt)Lo(ωt)+cα22Lo(ωt)22\displaystyle=-\alpha\nabla L_{o}(\omega^{t})^{\top}\nabla L_{o}(\omega^{t})+\frac{c\alpha^{2}}{2}\parallel\nabla L_{o}(\omega^{t})\parallel^{2}_{2}
=α(1cα2)Lo(ωt)22\displaystyle=-\alpha(1-\frac{c\alpha}{2})\parallel\nabla L_{o}(\omega^{t})\parallel^{2}_{2}
α(1cα2)(δt(ωtω))2\displaystyle\leq-\alpha(1-\frac{c\alpha}{2})(\frac{\delta^{t}}{\parallel(\omega^{t}-\omega^{*})\parallel})^{2}
zα(1cα2)δt2,z=mint[0,T]1(ωtω)2\displaystyle\leq-z\alpha(1-\frac{c\alpha}{2}){\delta^{t}}^{2},z=\underset{t\in[0,T]}{min}\frac{1}{\parallel(\omega^{t}-\omega^{*})\parallel^{2}}

It is beneficial to set the learning rate α\alpha here to a value below 1c\frac{1}{c}, the following two conditions can be obtained from this: Lo(ωt+1)Lo(ωt)0L_{o}(\omega^{t+1})-L_{o}(\omega^{t})\leq 0 and δtδt+11\frac{\delta^{t}}{\delta^{t+1}}\geq 1. Thus, we can further rewrite the corollary of Lemma 1 as the following inequality:

1δt+11δtzα(1cα2)δtδt+1zα(1cα2)\frac{1}{\delta^{t+1}}-\frac{1}{\delta^{t}}\geq\frac{z\alpha(1-\frac{c\alpha}{2})\delta^{t}}{\delta^{t+1}}\geq z\alpha(1-\frac{c\alpha}{2})

By summing all the inequalities over training period t[0,T]t\in[0,T], we can obtain:

ti=0T(1δt1δt1)Tzα(1cα2)\sum_{ti=0}^{T}(\frac{1}{\delta^{t}}-\frac{1}{\delta^{t-1}})\geq Tz\alpha(1-\frac{c\alpha}{2})

Finally, the following formula can be derived:

1Lo(ωT)Lo(ω)=1δT1δT1δ0=ti=0T(1δt1δt1)Tzα(1cα2)\frac{1}{L_{o}(\omega^{T})-L_{o}(\omega^{*})}=\frac{1}{\delta^{T}}\geq\frac{1}{\delta^{T}}-\frac{1}{\delta^{0}}=\sum_{ti=0}^{T}(\frac{1}{\delta^{t}}-\frac{1}{\delta^{t-1}})\geq Tz\alpha(1-\frac{c\alpha}{2})

The above derivation proves that we can reasonably set the learning rate for the overall loss function to approximate its minimum value. After TT rounds of training, setting learning rate α1c\alpha\leq\frac{1}{c}, the convergence upper bound of Lo(ω)L_{o}(\omega) can be formulated as follows:

Lo(ωT)Lo(ω)1Tzα(1cα2)L_{o}(\omega^{T})-L_{o}(\omega^{*})\leq\frac{1}{Tz\alpha(1-\frac{c\alpha}{2})}

This indicates that as the number of training iterations increases, Lo(ωT)L_{o}(\omega^{T}) will gradually converge to the global minimum Lo(ω)L_{o}(\omega^{*}).

Appendix B Detailed Description of the Dataset

In this section, we present detailed information about the dataset used in this paper:

ISCXVPN2016 [15]: The dataset aims to generate a representative sample of real-world traffic based on ISCX by defining a set of tasks that ensure a rich diversity and quantity. User accounts for Alice and Bob were created to utilize services such as Skype and Facebook. The dataset encompasses various types of traffic and applications, resulting in a total of 14 traffic categories, including VOIP, VPN-VOIP, P2P, and VPN-P2P, capturing both regular sessions and sessions over VPN. This diversity allows for a comprehensive reflection of actual application scenarios within the network environment. In this study, three labeling schemes were employed using the dataset: 1) VPN-nonVPN; 2) Service type, which includes Chat, Email, VoIP, File Transfer, Streaming, and P2P; and 3) 15 specific applications, including AIM, Email, Facebook, FTPS, Hangout, ICQ, Netflix, SCP, SFTP, Skype, Spotify, Vimeo, VoIPbuster, YouTube, and BitTorrent.

ISCXTor2016 [28]: To ensure the quantity and diversity of the dataset in the CIC, a set of tasks was defined to generate a representative sample of real-world traffic. Three user accounts were created for the collection of browser traffic, while two additional user accounts were established for communication activities, including chat, email, FTP, and P2P. For non-Tor traffic, previously collected benign traffic from a VPN project was utilized, and the Tor traffic was categorized into seven distinct traffic types. This structured approach enables the dataset to accurately reflect real-world network behavior with comprehensive and varied data.

CIRA-CIC-DoHBrw-2020 [37]: The CIRA-CIC-DoHBrw-2020 dataset employs a two-layered approach to capture benign and malicious DNS over HTTPS (DoH) traffic, as well as non-DoH traffic. To create a representative dataset, HTTPS traffic (including benign DoH and non-DoH) and DoH traffic were generated by accessing the top 10,000 Alexa websites, utilizing browsers and DNS tunneling tools that support the DoH protocol. In the first layer, the captured traffic is classified as DoH and non-DoH using a statistical features classifier. In the second layer, the DoH traffic is further characterized as benign DoH or malicious DoH through the application of a time-series classifier.

USTC-TFC2016 [46]: This dataset is divided into two parts, as detailed in Tables I and II. Part I includes ten types of malware traffic collected from public websites by CTU researchers in real network environments between 2011 and 2015. For larger traffic samples, only a subset was utilized, while smaller traffic samples were merged if they originated from the same application. Part II consists of ten types of normal traffic gathered using IXIA BPS, a professional network traffic simulation tool. Further details regarding the simulation methods can be found on the product’s website. To encompass a broader range of traffic types, the ten traffic types include eight classes of common applications. The USTC-TFC2016 dataset has a total size of 3.71GB and is formatted in pcap.

IPTAS-Tbps [10]: IPTAS-Tbps consists of traffic collected during normal webbrowsing under CERNET (China Education and Research Network). There are seven mainstream apps-running on user terminal such as JD, Sohu, and NetEase. The various transmission protocols and application versions in this data set are more up-to-date and better suited to the current network environment.

Appendix C Case Analysis of Trainable Gate Configuration in Experiments 5.3.2

Initially, we also used trainable gated networks for Category Expansion Scenarios, but we found that there was a certain probability of encountering training anomalies. Here, we conduct a case analysis using the same data scenario as in Section 5.3.2, focusing on the results of ten experiments under trainable configurations with a learning rate of 1×1041\times 10^{-4}.The conditions of these experiments are shown in Figure 10.

Refer to caption
Refer to caption
Figure 11: Case Analysis of Trainable Gate Configuration

The upper half of the sub-figure in Figure 10 illustrates an abnormal situation with the trainable gating settings during model training. After four iterations, the model’s loss function value increased instead of decreasing. Although the overall accuracy of the model did not decline correspondingly, there were actually some relatively hidden anomalies present. We subsequently grouped the sources of the identified targets and found a significant difference in recognition accuracy for the IPTAS dataset between the trainable settings and the top-k settings. Upon closer examination, we observed that in three experimental cases (case 3, 5, and 7) with trainable settings, the corresponding recognition accuracy was abnormally low. Considering the trend of the loss function, we can also conclude that this low accuracy is irretrievable.

This situation occasionally occurs under different hyper-parameter settings, affecting the overall performance of the model ensemble. We analyze that this is due to multiple expert models sharing the Tower network base in Category Expansion Scenarios, making the model training susceptible to the influence of data distribution. These anomalies arise because the trainable gating assigns excessively high weights to the tasks of certain experts, while the relatively small data volume from other experts makes it difficult to recover. Our preliminary experiments suggest that increasing the learning rate, setting a smaller batch size, or increasing the momentum of the optimizer may alleviate this phenomenon. However, we believe that in such cases, it is safer to choose the top-k setting, even though the trainable setting may occasionally yield better results. This is because the integration of such tasks involves many unstable factors, such as sample imbalance and uneven label distribution, which can trigger these anomalies. Compared to general neural network layers, the gating layers are relatively small and fragile. Therefore, we recommend enabling the trainable gating mode only when it is very clear that the patterns of the two tasks correspond to Category Expansion Scenarios.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 12: Training Details of Each Independent Model

Appendix D Training Processes and Detailed Results of Individual Models

This section presents the training processes and final results of various expert models read and integrated by the SNAKE system, each applied to their respective independent tasks. With the exception of the CIRA-CIC-DoHBrw-2020 dataset, which was limited to a random sampling of 7.92GB of data for training and testing due to equipment constraints, all other datasets were fully utilized and split into training, validation, and testing sets in a ratio of 75%\% 10%\%, and 15%\%. The data preprocessing and model architectures were consistent across all models; further details can be found in sections 3.4 and 3.5. For the USTC-TPC2016, CIRA-CIC-DoHBrw-2020 and ISCX-Tor datasets, we applied early stopping, training only for 20 and 10 epochs respectively, while all other models were trained for a consistent 50 epochs. The details of the training loss functions and accuracy trends for each model across epochs are illustrated in Figure 12.

Table 4: Classification Performance of Independent Models for Each Task from Different Datasets
Dataset Name Task Objective Accuracy % Precision % F1-Score %
ISCXVPN2016 Encapsulation 98.09 (±\pm 1.11) 98.11 (±\pm 1.09) 98.09 (±\pm 1.11)
Service Types 80.29 (±\pm 0.52) 80.35 (±\pm 1.95) 98.67 (±\pm 0.52)
Application Types 77.16 (±\pm 0.44) 77.99 (±\pm 1.49) 98.67 (±\pm 0.69)
ISCXTor2016 Tor Usage 99.89 (±\pm 0.07) 99.89 (±\pm 0.07) 99.89 (±\pm 0.08)
CIC-DoHBrw-2020 Malicious DoH 99.96 (±\pm 0.02) 99.97 (±\pm 0.02) 99.97 (±\pm 0.02)
USTC-TFC2016 Benign/Malicious 99.98 (±\pm 0.03) 99.98 (±\pm 0.03) 99.98 (±\pm 0.03)
Specific Tools 99.91 (±\pm 0.03) 99.91 (±\pm 0.03) 99.91 (±\pm 0.03)
IPTAS-Tbps Application Types 88.52 (±\pm 3.99) 89.02 (±\pm 2.89) 88.42 (±\pm 3.42)
* All results are presented as mean ±1/2\pm 1/2 range

For each independent expert model shown in the figure, we conducted ten repeated experiments from the dataset partitioning stage to the end of model training, validating the entire process. The figure displays the mean and range of the model’s loss function values and accuracy over the ten training runs. However, due to our oversight, we did not record the validation set results for each epoch when training the Encapsulation and Service type classification using the ISCX-VPN-2016 dataset. Nonetheless, we have recorded and presented the test results for all tasks across all datasets on the test set in Table 4.