This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Distributed Traffic Synthesis and Classification in Edge Networks: A Federated Self-supervised Learning Approach

Yong Xiao, , Rong Xia, Yingyu Li, Guangming Shi, , Diep N. Nguyen, , Dinh Thai Hoang, , Dusit Niyato, , and Marwan Krunz *This work has been accepted at IEEE Transactions on Mobile Computing. Copyright may be transferred without notice, after which this version may no longer be accessible. Y. Xiao was supported in part by the National Natural Science Foundation of China under grant 62071193 and the Key R & D Program of Hubei Province of China under grants 2021EHB015 and 2020BAA002. G. Shi was supported in part by the National Natural Science Foundation of China under grants 62293483, 61871304, and 61976169. Y. Xiao and G. Shi were supported in part by the major key project of Peng Cheng Laboratory (grant No. PCL2021A12). M. Krunz was supported by the National Science Foundation (grants 1910348, 1731164, 1813401, 2229386, and IIP-1822071) and by the Broadband Wireless Access & Applications Center (BWAC). Y. Xiao is with the School of Electronic Information and Communications at the Huazhong University of Science and Technology, Wuhan 430074, China, also with the Peng Cheng Laboratory, Shenzhen, Guangdong 518055, China, and the Pazhou Laboratory (Huangpu), Guangzhou, Guangdong 510555, China (e-mail: [email protected]). R. Xia is with the School of Electronic Information and Communications at the Huazhong University of Science and Technology, Wuhan 430074, China (e-mail: [email protected]). Y. Li is with the School of Mech. Eng. and Elect. Inform. at the China University of Geosciences, Wuhan, China, 430074 (e-mail: [email protected]). G. Shi is with the Peng Cheng Laboratory, Shenzhen, Guangdong, 518055, China and is also with the School of Artificial Intelligence, the Xidian University, Xi’an, Shanxi, China, 710071, and Pazhou Lab (Huangpu), Guangdong, 510555, China (e-mail: [email protected]). D. Nguyen and D. T. Hoang are with the School of Electrical and Data Engineering, University of Technology Sydney, Australia (e-mail: {diep.nguyen, hoang.dinh}@uts.edu.au). D. Niyato is with the School of Computer Science and Engineering, Nanyang Technological University, Singapore (e-mail: [email protected]). M. Krunz is with the Department of Electrical and Computer Engineering, the University of Arizona, Tucson, AZ (e-mail: [email protected]).
Abstract

With the rising demand for wireless services and increased awareness of the need for data protection, existing network traffic analysis and management architectures are facing unprecedented challenges in classifying and synthesizing the increasingly diverse services and applications. This paper proposes FS-GAN, a federated self-supervised learning framework to support automatic traffic analysis and synthesis over a large number of heterogeneous datasets. FS-GAN is composed of multiple distributed Generative Adversarial Networks (GANs), with a set of generators, each being designed to generate synthesized data samples following the distribution of an individual service traffic, and each discriminator being trained to differentiate the synthesized data samples and the real data samples of a local dataset. A federated learning-based framework is adopted to coordinate local model training processes of different GANs across different datasets. FS-GAN can classify data of unknown types of service and create synthetic samples that capture the traffic distribution of the unknown types. We prove that FS-GAN can minimize the Jensen-Shannon Divergence (JSD) between the distribution of real data across all the datasets and that of the synthesized data samples. FS-GAN also maximizes the JSD among the distributions of data samples created by different generators, resulting in each generator producing synthetic data samples that follow the same distribution as one particular service type. Extensive simulation results show that the classification accuracy of FS-GAN achieves over 20%20\% improvement in average compared to the state-of-the-art clustering-based traffic analysis algorithms. FS-GAN also has the capability to synthesize highly complex mixtures of traffic types without requiring any human-labeled data samples.

Index Terms:
Traffic classification, self-supervised learning, edge computing, federated learning, generative adversarial networks.

1 Introduction

Telecommunication technologies have evolved tremendously over the past few decades, driven by innovative services and applications in a wide range of verticals [1, 2]. In particular, recently standardized 5G technology introduces three major use cases: eMBB (enhanced Mobile BroadBand), URLLC (Ultra Reliable Low Latency Communications), and mMTC (massive Machine Type Communications), with promises to enable novel applications including IoT [3, 4, 5] and autonomous vehicles [6]. According to recent reports[7, 8], the next-generation mobile technologies, e.g., B5G and 6G, are expected to further extend the application scenarios of 5G by bringing more diverse and innovative services, such as holographic-type communications, Tactile Internet[9], semantic communications[10, 11], and others[12].

As more services and applications are introduced and adopted at different times in different regions, network traffic analysis and management face unprecedented challenges. In particular, the diverse demands and requirements of different services significantly increase the dynamics of network traffic, making traffic prediction, network planning, resource allocation and scheduling more challenging than before [13, 14]. Most existing solutions rely on regression or convolutional neural network (CNN)-based supervised learning to fit historical data and accordingly classify the traffic of existing services [15]. These solutions often require a large amount of manually labeled datasets, which may not be practically feasible to obtain [16]. Recent works employ clustering-based unsupervised learning solutions to analyze and cluster unknown traffic [17]. However, these solutions often suffer from limited accuracy and cannot be applied to predict or keep track of highly mixed and evolving service traffic flows. There is a pressing need to develop intelligent traffic analysis and classification solutions that can automatically identify known services and classify/label unknown services and data samples that emerge in decentralized datasets across various locations.

One promising solution to the above challenges is self-supervised learning, a novel algorithmic framework that can learn features and representations without requiring any labeled data samples for training machine learning (ML) models. Self-supervised learning combines the advantages of supervised-learning and unsupervised-learning by first solving some carefully designed pretext tasks to automatically create pseudo-labels for the unlabelled data samples. It then employs supervised-learning-based downstream training tasks to construct ML models based on pseudo-labelled dataset[18, 19, 20]. Due to its potential to significantly improve data efficiency and model generality, self-supervised learning is commonly believed to be one of the key enablers for the ‘next artificial intelligence revolution”[21].

Despite their potential, most existing works on self-supervised learning focus on designing pretext tasks that utilize the attributes of images or videos, such as image orientation, gray-scale image colorization, and image jigsaw puzzle solving, none of which can be applied to analyze network traffic datasets. Also, compared to image data samples, each consists of rich (pixel-level) information with well-known representation features and attributes, network packets generally have much smaller sizes. The data streams communicated throughout the network contain highly mixed traffic types and data packets associated with different services may exhibit very similar frame patterns that are challenging to classify. There is still no commonly known attributes of data packets associated with each individual service.

The increasingly stringent requirements on the responsiveness of smart services and the growing awareness of data protection and network security further exacerbate the above challenges. This is especially the case, considering that most existing self-supervised learning solutions are cloud-based in which every mobile device must upload its locally collected data samples to a cloud server and wait for feedback before making decisions [22, 23]. These solutions are known to suffer from long communication delay and risk of privacy leakage caused by showing raw data. To address these issues, edge intelligence has been recently introduced, which enables distributed data processing and learning over a large number of low-cost, often decentralized edge devices, e.g., mobile edge servers [24]. It is considered one of the most sought-after functions in B5G/6G.

One of the challenges of implementing edge intelligence is to develop simple and effective algorithmic solutions for decentralized data learning and processing over large resource-limited wireless networks [12]. As an emerging distributed AI solution, federated learning (FL) allows collaborative construction of ML models without exposing any raw data across decentralized datasets. Integrating FL with self-supervised learning will have the potential to significantly reduce communication overhead, enhance data protection, and improve efficiency of model training. Unfortunately, conventional FL approaches often suffer from slow convergence and bias when the distributions of traffic across different datasets are highly heterogeneous and unbalanced. In fact, the most commonly used federated learning solution, FedAvg, can only be applied when all decentralized datasets observe the same set of features. Furthermore, most existing FL solutions are based on supervised learning, which require high-quality labeled samples across all decentralized datasets [25]. Currently, there is no simple and effective self-supervised learning solution that allows joint model construction based on heterogeneous datasets.

In this paper, we propose the federated self-supervised Generative Adversarial Networks (FS-GAN), a novel framework for distributed traffic analysis and synthesis over decentralized datasets consisting of highly heterogeneous and unbalanced traffic types. FS-GAN is comprised of multiple decentralized GANs, deployed at a set of edge servers, each of which can access an exclusive set of data samples associated with a combination of local services. Different edge servers can access different combinations of services. A set of generators is deployed at the edge servers or virtual machines that offer computational functions. In contrast, each discriminator is implemented at an edge server and has been assigned with an exclusive right to access a local dataset requiring a certain privacy protection for local data samples. FS-GAN learns to create synthetic data samples that capture the statistical features of real data (i.e., its distribution) obtained from each individual type of services to support automatic traffic classification. Compared to a traditional GAN, FS-GAN provides three unique advantages. First, it supports collaborative model construction based on decentralized datasets by adopting a federated learning-based approach to coordinate model training at different edge servers. Second, it addresses the model collapse problem suffered by many GAN-based approaches by associating a classifier with each local datatset to classify samples generated by different generators. Third, FS-GAN does not require any labeled samples to train the model; rather, it adopts a self-supervised learning-based approach that automatically assigns different pseudo-labels to synthesized data samples generated by different generators. Compared to a traditional network data classification/clustering solution, the proposed FS-GAN trains multiple generators, each of which is capable of synthesizing data traffic for a given service. These features of FS-GAN make it possible to further improve the performance of data traffic classification, especially when traffic samples of different services are highly unbalanced, i.e., some services have much fewer data samples than others. They also make FS-GAN applicable to a wide range of scenarios and use cases that involve traffic prediction and synthesis, such as dynamic network slicing that involves multi-service traffic tracking and prediction, unknown attack detection, and prediction-based network planning. We conduct extensive simulations based on real-world traffic associated with ten popular services applications including email, FTP, video and audio chat, etc [26]. Our results show that the classification accuracy of FS-GAN achieves over 20%20\% improvement, on average, compared to the state-of-the-art clustering-based traffic analysis algorithms.

Refer to caption
Figure 1: Comparison between (a) the distribution of three service traffics T1,T1,T3T_{1},T_{1},T_{3} measured in packet value (converted decimal value of data packet) and (b) the synthesized traffic data produced by FS-GAN and (c) the synthesized traffic produced by F-GAN.

To highlight the advantages of FS-GAN, in Fig. 1, we present the traffic distribution of the data packets from the real traffic data associated with 3 types of (unknown) applications (Fig. 1(a)) and the distribution of synthesized traffic generated by FS-GAN after 50 rounds of training (Fig. 1(b)), compared with the distribution of data produced by a straightforward extension of traditional multi-generator GAN into an FL setting, referred to as F-GAN (Fig. 1(c)). It can be observed that FS-GAN is able to successfully generate synthetic data samples that match distributions of different types of applications. F-GAN, however, fails to capture data distribution differences between various services but can only generate samples with a limited diversity.

The main contributions of this paper are as follows:

  • 1)

    We introduce a novel framework called FS-GAN that exploits concepts from GAN, FL, and self-supervised learning to enable automatic learning, synthesis, and classification of heterogeneous traffic over decentralized datasets. Compared to regression-based clustering solutions, which separate data samples based on a single or a limited number of pre-selected features, FS-GAN autonomously creates synthetic samples with pseudo-labels to capture the distributions of local mixtures of traffic types and then applies supervised-learning-like solutions to further improve the classification performance. To the best of our knowledge, FS-GAN is the first network traffic synthesis and classification solution that uses self-supervised learning.

  • 2)

    We prove that the model trained by FS-GAN minimizes the difference between the distribution of the real traffic and that of the synthetic data samples created by the FS-GAN generators. The unique ability of FS-GAN to learn and create synthetic samples that capture the distribution of a heavily mixed but unknown traffic across decentralized datasets allows it to be applied to a range of novel applications, well beyond traditional traffic classification. We discuss some potential applications and pointed out the limitations of FS-GAN under different scenarios.

  • 3)

    We conduct extensive simulations using real traffic datasets. Our results show that FS-GAN achieves significant improvement in both traffic classification and data synthesis.

The rest of the paper is organized as follows. In section 2, we first review recent works that are relevant to FS-GAN. We then introduce the architecture and discuss its application scenarios in Section 3. Details of the proposed architecture, algorithms, as well as the main theoretical results are presented in Section 4. We conduct extensive simulations and evaluate the performance of FS-GAN in Section 5. The paper is concluded in Section 6.

2 Related Work

The solution proposed in this paper is closely related to recent progress in deep-learning-based traffic classification and FL-based data analysis and synthesis. Most related works are reviewed in this section.

2.1 Deep-learning-based Traffic Classification

Deep neural network (DNNs), are one of the promising solutions for network traffic classification [17, 15, 27, 28, 29, 30, 31, 32, 33, 34]. For example, Wang et al. proposed a hybrid neural network that combines a recurrent neural network (RNN) and a CNN to learn dynamic features of mobile data for traffic classification[15]. Liang et al. introduced a deep reinforcement learning-based approach, called NeuroCuts, for packet classification[30]. Zhang et al. proposed a deep learning-based traffic clustering solution that can classify packets associated with unknown classes and automatically build a new training dataset to update the classifier of unknown traffic[17]. Liu et al. applied an RNN-based flow sequence network approach to classify encrypted traffic flows[28]. Ensemble learning-based methods have been adopted to improve the traffic classification accuracy by Aouedi et al.[35] and Jin et al.[36].

2.2 FL-based Network Data Analysis

FL and its applications in network data analysis have recently attracted significant interest [37, 38, 39, 40, 41, 42, 43, 44]. FL has the potential to protect data privacy during distributed training and coordination across decentralized datasets. Thus, network Wen et al. proposed a two-step FL framework to achieve privacy preserving model training with high communication efficiency. Wang et al. proposed an empirically driven solution to optimize the selection of client devices that participate in model updating[39]. Recently, FL was extended to data analysis in resource-limited networking systems. For example, Wang et al. investigated the convergence of gradient descent-based FL solutions and optimized the tradeoff between local update and global model aggregation under a given resource constraint[45]. Luo et al. introduced a low-cost sampling-based algorithm to learn the convergence-related parameters and minimize the cost of learning time and energy[46]. Xiao et al. proposed a federated edge intelligence (FEI) framework for implementing FL in edge computing-assisted IoT networks with communication and computational resource constraints[47]. This framework has been further extended to real-time learning in edge intelligence networks[48], as well as semantic communications[10] and semantic-aware networking systems[11, 49, 50]. Zhang et al.[51] studied the homomorphic encryption-based FL for communication and computation cost of private medical data analysis and Lu et al.[52] investigated FL-based imbalanced classification for industrial data.

2.3 Self-supervised learning-based Data Analysis

Recently, self-supervised learning was introduced as a way to improve data efficiency and model generalization [20, 53, 54, 55, 19, 18, 56]. In particular, Araslanov et al. proposed a domain adaptation approach for semantic segmentation based on predictions produced by pseudo-supervision targets [20]. Li et al. proposed a two-stage framework for anomaly detection in which a self-supervised deep representation model was learned to build a generative classifier [53]. Sun et al. introduced a novel model training algorithm for Graph Convolutional Networks (GCNs). Self-supervised model in [54] improved model generalization performance while reduce the number of required labeled samples. Zhang et al. proposed a self-supervised learning method with adaptive memory network to improve model generalization and enrich feature representativeness[57]. Shi et al. studied a novel mask image model for self-supervised learning, where the idea of adversarial training was utilized[58]

Our proposed FS-GAN differs from traditional clustering-based data analysis techniques in that it adopts a self-supervised learning-based solution to first create synthetic samples with pseudo-labels that statistically match a mixture of real traffic types and then train an ML model using the pseudo-labeled datasets to classify and identify each individual service traffic.

3 Architecture Overview and Application Scenarios

3.1 Architecture Overview

Refer to caption
Figure 2: FS-GAN architecture, including components generators (G), discriminators (D), classifiers (C), and a global model coordinator (left figure); and the main algorithmic procedures (right figure).

The main architecture, including the key components and algorithmic procedures of FS-GAN, is presented in Fig. 2. FS-GAN consists of: (1) multiple generators that create synthetic samples whose statistical distribution is similar to that of real traffic in local datasets, and (2) multiple discriminators, each of which discriminator tries to differentiate the real traffic from the synthetic samples produced by the generators. Our proposed FS-GAN framework employs the following three procedures:

(1) Local Data Synthesis: One discriminator is deployed in each edge server with exclusive right access to a local dataset. Each discriminator is jointly trained with several associated generators. Although GANs, in general, have high flexibility in generating synthetic samples [59, 60, 61], they suffer from the so-called model collapsing problem, where the generator learns to produce synthetic samples with extremely limited diversity, e.g., samples that only capture the data distribution of a single traffic type or a fixed mixture of a given set of traffic types. To tackle this problem, we borrow the idea of multi-generator GANs [62, 63, 64] and assign a pseudo-label to each generated sample to indicate which generator the sample came from. As will be proven later, a generator that is locally trained with a discriminator is able to create synthetic data samples that match the data distribution of the real traffic in the local dataset.

(2) Global Model Coordination: The local models trained by the discriminators and generators will be periodically uploaded to a global model coordinator. The coordinator can be deployed at one of the edge servers or at the cloud data center. One straightforward implementation of model coordination is to adopt FL solutions[40, 41], in which the local model parameters (of generators and/or discriminators) are aggregated, e.g., averaged in FedAvg algorithm[65], to form an updated global model that is then broadcasted to edge servers. Note that the number of model parameters from local edge servers will affect the communication efficiency. In this paper, we consider two model coordination schemes:

  • (i)

    Coordination Scheme I (C-I): Both locally trained discriminators and the associated generators are uploaded to the global coordinator.

  • (ii)

    Coordination Scheme II (C-II): Only discriminators are uploaded and aggregated by the coordinator.

C-II requires less model-related data to be exchanged between edge servers and the coordinator, resulting in a higher communication efficiency. However, as will be shown later in this paper, C-I has faster convergence than C-II, which in some cases compensates for the extra communication overhead.

In the local data synthesis procedure, each discriminator only determines whether the data samples belong to a real traffic or are synthetic data produced by generators. After the global model coordination, all the discriminators should be able to differentiate real data in any local dataset from synthetic data generated by any generator. Similarly, the generators will be able to capture richer features across different local datasets. To address the model collapsing problem, we introduce a classifier in each local datatset to classify samples from different generators. As shown in Section 4, this classifier increases the divergence of the synthetic data distributions of different generators, hence alleviating the model collapsing problem of traditional GANs.

(3) Self-Supervised Learning: The aforementioned trained and updated classifier creates pseudo-labels for synthetic samples generated by different generators. As the generators become more capable of producing close-to-real synthetic data, the classifier associated with each discriminator will naturally support data classification and self-labeling of real traffic that arrive in the future. We show that the classifier actually shares the same model parameters as the locally trained discriminator, and therefore does not require any extra effort for model training.

We present a theoretical analysis that proves the effectiveness of the FS-GAN framework. In particular, we prove that FS-GAN possesses two important properties: (1) the difference between the distributions of the mixture of the synthetic data generated by generators and that of the real traffic data distribution is minimized, and (2) the JSD divergence of the distributions of synthetic data samples generated by different generators is maximized. Based on these properties, we prove the main result of this paper, namely, the synthetic data samples generated by each individual generator captures the real traffic distribution of each individual traffic type, and that the classifier can differentiate all the synthetic samples coming from different generators, if the distributions of the traffic data associated with different services are separable.

Compared to existing self-supervised learning solutions, FS-GAN provides the following unique advantages that make it more suitable for network traffic classification problems. First, FS-GAN can be applied to a large network with highly decentralized and imbalanced traffic distribution with data privacy protection requirements. Second, FS-GAN does not require any pre-labeling of data, which makes it suitable for many of today’s networking systems that involve a large volume of unlabeled network traffic datasets that are prohibitively expensive to label. Third, the data synthesizing capability of FS-GAN makes it possible to be applied in many emerging networking services that involve traffic prediction and synthesis.

3.2 Examples of Application Scenarios

Compared to traditional traffic classification solutions, FS-GAN adopts a fundamentally different approach by first training multiple generator networks to generate synthetic samples that capture the distribution of real traffics associated with an individual service and then applying the parameters of the already trained models for classifying traffics of different services. This uniqueness of FS-GAN makes it applicable to a range of new application scenarios and use cases, including the following:

1) Dynamic Network Slicing in 5G/B5G: 5G NR release 16 [66] introduced the concept of network slicing, which focuses on isolating and preserving partitions of network resources and functions, commonly referred to as slices. These slices are tailored and orchestrated to support applications with diverse QoS requirements. One of the key challenges in network slicing is now to keep track of the traffic demands of potentially unknown applications, so the appropriate amount of resources can be reserved while meeting the needs of these applications. FS-GAN can be directly applied to classify and keep track of the needs of different services from a highly mixed traffic. It can also assist the network slicing manager in identifying newly observed service traffic types that do not have a sufficient amount of pre-labeled data.

2) Unknown Attack Detection: An important application scenario for traffic classification is attack detection. Existing solutions mostly focus on detecting known attacks. Detecting unknown attacks is a notoriously difficult problem[67]. FS-GAN has the potential to identify unknown attacks and simulate various attack scenarios as well as their mixtures.

3) Traffic Prediction-based Network Planning: Because the final model obtained by FS-GAN is capable of generating synthetic data samples that are statistically similar to the real data of various types of emerging traffic, FS-GAN can be applied to predict future traffic trends and assist in planning the network infrastructure to better cope with anticipated needs.

4 FS-GAN Design and Theoretical Analysis

In this section, we present the algorithmic details of the main procedures of FS-GAN (both C-I and C-II): local data synthesis, global model coordination, and self-supervised learning. The detailed architecture is shown in Fig. 3. Theoretical analysis is presented at the end of this section.

Refer to caption
Figure 3: Detailed architectural components of FS-GAN
TABLE I: List of Notation
Notation Description
dd the index of local datasets
DD discriminator of FS-GAN
GG generator of the FS-GAN
PP distribution of data
zz noise of Gaussian distribution
mm index of generator
\mathcal{L} loss function
xx real data sample
x^\hat{x} synthetic data sample
λ\lambda diversity hyper-parameter
ω\omega parameters of generator
θ\theta parameters of discriminator and classifier

4.1 Local Data Synthesis

First, we consider local-model training of a single generator at the ddth local dataset, d=1,,Nd=1,...,N. As mentioned before, the main objective of this procedure is to train a model to generate synthetic data samples that captures the distribution of real traffic of a local dataset. Let GdG_{d} and discriminator DdD_{d} be, respectively, the generator and discriminator associated with the ddth dataset. Training GANs can be formulated as a minimax game between GdG_{d} and DdD_{d} [59]. The generator and the discriminator are often DNNs. Let PdataP_{data} be the distribution of the real dataset and suppose that 𝒙\boldsymbol{x} is drawn from PdataP_{data}, i.e., we write as 𝒙Pdata\boldsymbol{x}\sim P_{data}. The generator GdG_{d} aims to train a multi-layer neural network to map its input variables. Initially, a random noise 𝒛\boldsymbol{z} is generated to synthesize data samples whose distribution is denoted as P𝒛P_{\boldsymbol{z}}. Without loss of generality, we abuse the notation and use Gd(𝒛)G_{d}(\boldsymbol{z}) to denote the trained generator’s output given training input 𝒛\boldsymbol{z}. Meanwhile, the discriminator DdD_{d} tries to distinguish real data samples from synthetic samples produced by the generator. We use Dd()D_{d}(\cdot) to denote the output of discriminator, which represents the probability that the discriminator decides the given input sample comes from the real dataset rather than being synthesized by the generator. More formally, we can express the adversarial training process between a single generator and a single discriminator via the following minimax problem:

minGdmaxDd𝔼𝒙Pdata[logDd(𝒙)]+𝔼𝒛Pz[log(1Dd(Gd(𝒛)))].\displaystyle\min_{G_{d}}\max_{D_{d}}\mathbb{E}_{\boldsymbol{x}\sim P_{data}}[\log D_{d}(\boldsymbol{x})]+\mathbb{E}_{\boldsymbol{z}\sim P_{z}}[\log\left(1-D_{d}(G_{d}(\boldsymbol{z}))\right)]. (1)

The key idea of FS-GAN is to use multiple generators 𝑮d={Gd1,Gd2,,GdM}{\boldsymbol{G}}_{d}=\{G_{d}^{1},G_{d}^{2},...,G_{d}^{M}\}, rather than a single one to jointly train a model that can capture the distribution of a mixture of different traffic types in a given local dataset dd. To address the issue of model collapse, we introduce a classifier network CdC_{d} that classifies synthetic data produced by the generators. More specifically, the classifier first generates a pseudo-label for each synthetic data sample, indicating which generator it came from. It then minimizes its loss to encourage generators to create samples that are different from each other. As we prove later in this section, by employing CdC_{d}, each of the generators will synthesize data samples that represent a specific type of traffic, provided that the differences between the distributions of different traffic types are sufficiently large.

Now we focus on the training process for the multi-generator local model. We use the subscript to denote the index of the local dataset and the superscript for the generator index. Suppose a set 𝑮d={Gd1,Gd2,,GdM}{\boldsymbol{G}}_{d}=\{G_{d}^{1},G_{d}^{2},...,G_{d}^{M}\} of MM generators has already been assigned to the discriminator DdD_{d}, where d{1,2,,N}d\in\{1,2,...,N\} represents the discriminator index. Each discriminator DdD_{d} also interacts with a local dataset XdX_{d}, and each generator Gdm,m{1,2,,M}G^{m}_{d},m\in\{1,2,...,M\}, maps the input noise 𝒛\boldsymbol{z} to 𝒙^=Gdm(𝒛)\hat{\boldsymbol{x}}=G^{m}_{d}(\boldsymbol{z}). The classifier CdC_{d}, is co-trained with 𝑮d{\boldsymbol{G}}_{d} and DdD_{d}. Let PGdmP_{G^{m}_{d}} be the distribution of the synthetic samples produced by GdmG^{m}_{d}. We now describe the rule for each generator to interact with the discriminator. We set an index vector 𝒗\boldsymbol{v}, randomly drawn from a predefined multinomial distribution 𝒗(𝝅)\boldsymbol{v}\sim{\cal M}\left(\boldsymbol{\pi}\right) where 𝝅=π𝒗𝒗𝑮d\boldsymbol{\pi}=\langle{\pi}_{\boldsymbol{v}}\rangle_{\boldsymbol{v}\in{\boldsymbol{G}}_{d}} is the weighting coefficient of the distribution mixture. By setting Gd𝒗{Gdi:i𝒗}G^{\boldsymbol{v}}_{d}\triangleq\{G^{i}_{d}:i\in\ \boldsymbol{v}\} as the subset of 𝑮d{\boldsymbol{G}}_{d}, we can convert the multi-generator GANs into a standard GAN with a generator vector Gd𝒗{G}^{\boldsymbol{v}}_{d} and discriminator DdD_{d}. We use PdatadP_{data}^{d} to denote the real distribution of dataset XdX_{d}. We can then formulate the model of local data synthesis with a multi-generator as the following minimax game:

min𝑮d,CdmaxDd\displaystyle\min_{{\boldsymbol{G}}_{d},C_{d}}\max_{D_{d}} 𝔼𝒙Pdatad[logDd(𝒙)]\displaystyle\;\mathbb{E}_{\boldsymbol{x}\sim P_{data}^{d}}[\log D_{d}(\boldsymbol{x})] (2)
+m=1M𝔼𝒛Pz[log(1Dd(Gdm(𝒛)))]\displaystyle+\sum_{m=1}^{M}\mathbb{E}_{\boldsymbol{z}\sim P_{z}}[\log\left(1-D_{d}(G^{m}_{d}(\boldsymbol{z}))\right)]
λm=1M(πm𝔼𝒙^PGdm[logCdm(𝒙^)])\displaystyle-\lambda\sum_{m=1}^{M}\left(\pi_{m}\mathbb{E}_{\hat{\boldsymbol{x}}\sim P_{G^{m}_{d}}}[\log C^{m}_{d}(\hat{\boldsymbol{x}})]\right)

where Cdm(𝒙^)C^{m}_{d}(\hat{\boldsymbol{x}}) is the output of classifier CdC_{d}, specifying the probability that data sample 𝒙^\hat{\boldsymbol{x}} is generated by GdmG^{m}_{d}. The last term in (2) is a standard softmax loss in a multi-classification setting [68]. In other words, minimizing CdC_{d} according to (2) under fixed DdD_{d} minimizes the entropy for CdC_{d}, which leads to increased divergence among generators. We introduce a diversity hyper-parameter λ>0\lambda>0 to specify the degree that the classifier can influence the generators. We compare the model performance under different λ\lambda in Section 5.

Let ωd\omega_{d} be the model parameters of the generators 𝑮d{\boldsymbol{G}}_{d} associated with discriminator DdD_{d}, and let ωdm\omega_{d}^{m} be the model parameters of the generator Gdm,m{1,,M}{G^{m}_{d}},m\in\{1,...,M\}. Classifier CdC_{d} and discriminator DdD_{d} share the same model parameters except for the last layer. Let θd\theta_{d} be the model parameters shared between the classifier and the discriminator DdD_{d}. Note that the last layer of CdC_{d} outputs an automatically learned pseudo-label for each input data sample, while the last layer of DdD_{d} outputs a binary result. Similar to a standard GAN, we alternatively learn ωd\omega_{d} and θd\theta_{d} using the stochastic gradient descend (SGD) method. The local data synthesis framework is illustrated in Fig. 3.

Let BB be the selected mini-batch size during the training process. At the beginning of each local training iteration, each discriminator DdD_{d} will sample a mini-batch of BB data points (𝒙d(1),𝒙d(2),,𝒙d(B))(\boldsymbol{x}_{d}^{(1)},\boldsymbol{x}_{d}^{(2)},...,\boldsymbol{x}_{d}^{(B)}) from the real data distribution PdatadP_{data}^{d}. A mini-batch of BB synthetic data samples (𝒙^d(1),𝒙^d(2),,𝒙^d(B))(\hat{\boldsymbol{x}}_{d}^{(1)},\hat{\boldsymbol{x}}_{d}^{(2)},...,\hat{\boldsymbol{x}}_{d}^{(B)}) will also be created by a mixture of generators GdvG_{d}^{v} in 𝑮d{\boldsymbol{G}}_{d}. This mixture can be generated according to a predefined sampling probability πm\pi_{m}. During each training iteration, we update CdC_{d}, DdD_{d}, and Gd𝒗{{G}^{\boldsymbol{v}}_{d}} along the gradients of their loss functions. According to (2), we can write the loss functions under the aforementioned settings as follows:

(Cd,𝒙^d)=1Bb=1BlogCdm(𝒙^d(b))\displaystyle\mathcal{L}({C_{d}},\hat{\boldsymbol{x}}_{d})=-\frac{1}{B}\sum_{b=1}^{B}\log C^{m}_{d}\left(\hat{\boldsymbol{x}}_{d}^{(b)}\right) (3a)
(Dd,𝒙d,𝒙^d)=1Bb=1B(logDd(𝒙d(b))\displaystyle\mathcal{L}({D_{d}},\boldsymbol{x}_{d},\hat{\boldsymbol{x}}_{d})=-\frac{1}{B}\sum_{b=1}^{B}\left(\log D_{d}\left(\boldsymbol{x}_{d}^{(b)}\right)\right.
+log(1Dd(𝒙^d(b))))\displaystyle\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\left.+\log\left(1-D_{d}\left(\hat{\boldsymbol{x}}_{d}^{(b)}\right)\right)\right) (3b)
(Gd𝒗,𝒙^d)=1Bb=1B(logDd(𝒙^d(b))+λlogCdm(𝒙^d(b)))\displaystyle\mathcal{L}({{G}^{\boldsymbol{v}}_{d}},\hat{\boldsymbol{x}}_{d})=-\frac{1}{B}\sum_{b=1}^{B}\left(\log D_{d}\left(\hat{\boldsymbol{x}}_{d}^{(b)}\right)+\lambda\log C^{m}_{d}\left(\hat{\boldsymbol{x}}_{d}^{(b)}\right)\right) (3c)

We summarize the detailed procedure in Algorithm 1.

Algorithm 1 Local Data Synthesis

Input: Local dataset XdX_{d}; noise distribution P𝒛P_{\boldsymbol{z}}; sampling probability πm\pi_{m};
Output: Updated local parameter ωd\omega_{d} generators 𝑮d{\boldsymbol{G}}_{d}; updated local parameter θd\theta_{d} for discriminator DdD_{d} and classifier CdC_{d};
Initialize: λ>0\lambda>0; mini-batch size BB; network parameters ωd\omega_{d} and θd\theta_{d}; maximum number of Iterations II;

for each iteration iIi\leq I

  • Generator 𝑮d{\boldsymbol{G}}_{d} do:

    1. 1.

      Select generators Gd𝒗{G}^{\boldsymbol{v}}_{d} according to πm\pi_{m};

    2. 2.

      Each generator GmdGd𝒗G_{m}^{d}\in{G}^{\boldsymbol{v}}_{d} simultaneously do:

      1. \bullet

        Sample a noise input variable 𝒛\boldsymbol{z} from P𝒛P_{\boldsymbol{z}};

      2. \bullet

        Synthesize a batch of data sample 𝒙^d(b),b{1,2,,B}\hat{\boldsymbol{x}}_{d}^{(b)},b\in\{1,2,...,B\};

    3. 3.

      Send the synthesize data sample 𝒙^d(b)\hat{\boldsymbol{x}}_{d}^{(b)} to the discriminator DdD_{d} and classifier CdC_{d};

    4. 4.

      Update ωd\omega_{d} by ascending along its gradient of loss function: ωd(Gd𝒗)\nabla_{\omega_{d}}\mathcal{L}({{G}^{\boldsymbol{v}}_{d}});

  • Discriminator DdD_{d} do:

    1. 1.

      Sample a batch of data 𝒙d(b),b{1,2,,B}\boldsymbol{x}_{d}^{(b)},b\in\{1,2,...,B\} from local dataset XdX_{d};

    2. 2.

      Distinguish 𝒙d(b)\boldsymbol{x}_{d}^{(b)} and 𝒙^d(b)\hat{\boldsymbol{x}}_{d}^{(b)};

    3. 3.

      Update θd\theta_{d} by descending along their gradients of loss functions: θd((Cd)+(Dd))\nabla_{\theta_{d}}\left(\mathcal{L}({C_{d}})+\mathcal{L}({D_{d}})\right);

  • Classifier CdC_{d} do:

    1. 1.

      Classify the synthesized samples 𝒙^d(b)\hat{\boldsymbol{x}}_{d}^{(b)} with respect to different generators;

    2. 2.

      Update θd\theta_{d} by descending along their gradients of loss functions: θd((Cd)+(Dd))\nabla_{\theta_{d}}\left(\mathcal{L}({C_{d}})+\mathcal{L}({D_{d}})\right);

  • i=i+1i=i+1;

Return locally trained parameters

4.2 Global Model Coordination

In the previous section, the generators associated with a local discriminator produce synthetic data samples that capture the distributions of different traffic types in the local dataset. However, different local datasets may have different mixtures of traffic. It is therefore desirable to train a global model that can capture the diverse traffic across all local datasets. We adopt the FL-based framework to coordinate the local model training procedures without requiring any exchange of local datasets. One of the key ideas in FL is to coordinate different local model training procures via their model parameters. In one of the popular FL solutions, called FedAvg, each edge server periodically uploads its locally trained models to a global coordinator. The uploaded models are then averaged to form an updated global model, which will be broadcasted to all edge servers.

Formally, let Ω\Omega^{\prime} be the subset of discriminators in coordination scheme C-II (or discriminator-and-generator pairs in scheme C-I) that has been selected for uploading the local models. Note that ΩΩ\Omega^{\prime}\subseteq\Omega. Let ω0\omega_{0} and θ0\theta_{0} be the corresponding global model parameters, and let pdp_{d} be the weight of the ddth local dataset such that pd0p_{d}\geq 0 and d=1Npd=1\sum_{d=1}^{N}p_{d}=1. We can write the global model coordination procedure as follows:

{ω0=dΩpdωd and θ0=dΩpdθd,if Scheme C-I,θ0=dΩpdθd,if Scheme C-II.\displaystyle\left\{{\begin{array}[]{*{20}{l}}\omega_{0}={\sum\limits_{d\in\Omega^{\prime}}{{p_{d}}}\omega_{d}\;\mbox{ and }\;\theta_{0}={\sum\limits_{d\in\Omega^{\prime}}{{p_{d}}}\theta_{d},}}&{\mbox{if Scheme {C-I}}},\\ \theta_{0}={\sum\limits_{d\in\Omega^{\prime}}{{p_{d}}}\theta_{d},}&{\mbox{if Scheme {C-II}}}.\end{array}}\right. (6)
Algorithm 2 Global Model Coordination (Scheme C-I)

Input: Locally trained parameter θd\theta_{d} of DdD_{d} and CdC_{d}; locally trained parameter ωd\omega_{d} of 𝑮d\boldsymbol{G}_{d};

Output: Aggregated parameter θ0\theta_{0} for the global discriminator D0D_{0} and classifier C0C_{0}; aggregated parameter ω0\omega_{0} for the global generators 𝑮0\boldsymbol{G}_{0}

Initialize: The number of local datasets NN; maximum number of communication rounds JJ;

for the number of communication rounds jJj\leq J

  • The global model coordinator do:

    1. 1.

      Calculate the aggregated parameter θ0\theta_{0} and ω0\omega_{0} according to equation (6):
      θ0\theta_{0}ω0\omega_{0}~{}\leftarrow~{}Local Data Synthesis(XdX_{d}, θd\theta_{d}, ωd\omega_{d})

    2. 2.

      Broadcast θ0\theta_{0} to and ω0\omega_{0} to local discriminator, classifier and generator, respectively;

  • Each local model do:

    • Initialize local parameter θd\theta_{d} and ω0\omega_{0} by:

    • θdθ0\theta_{d}~{}\leftarrow~{}\theta_{0}ωdω0\omega_{d}~{}\leftarrow~{}\omega_{0};

  • j=j+1j=j+1;

end for

Algorithm 3 Global Model Coordination (Scheme C-II)

Input: Locally trained parameter θd\theta_{d} of DdD_{d} and CdC_{d};
Output: Aggregated parameter θ0\theta_{0} for the global discriminator D0D_{0} and classifier C0C_{0};
Initialize: The number of local datasets NN; maximum number of communication rounds JJ;

for the number of communication rounds jJj\leq J

  • The global model coordinator do:

    1. 1.

      Calculate the aggregated parameter θ0\theta_{0} according to equation (6b):
      θ0\theta_{0}~{}\leftarrow~{}Local Data Synthesis(XdX_{d}, θd\theta_{d})

    2. 2.

      Broadcast θ0\theta_{0} to local discriminator and classifier;

  • Each local model do:

    • Initialize local parameter θd\theta_{d} by:

    • θdθ0\theta_{d}~{}\leftarrow~{}\theta_{0};

  • j=j+1j=j+1;

end for

4.3 Self-Supervised Learning

As mentioned in the previous procedures, the classifier shares the same network parameters with the local discriminator. Therefore, this classifier can be updated in the global model coordination procedure to classify and create pseudo-labels for the synthetic samples according to their corresponding generators. This classifier will also be able to classify the real data associated with all traffic types in all datasets. Since this classifier is a part of the training process of FS-GAN and is trained based on the pseudo-labeled synthetic datasets, it does not introduce any extra training efforts.

4.4 Implementation Complexity

Since FS-GAN reuses the model parameters of discriminators and therefore does not have to introduce any new computational complexity, compared to the traditional federated learning implementation of GAN. In other words, the complexity of implementing FS-GAN is closely related to two types of implementation cost: communication and computational cost. The computational complexity of FS-GAN mainly includes the training of a set of local generative and discriminative models in parallel based on SGDs, i.e., the total number of local SGD rounds performed by edge servers are given by O(MKT)O(MKT). The communication cost depends on the model size and the number of coordination rounds. In other words, FS-GAN introduces novel capabilities and address the issues of traditional federated GAN without introducing any extra computational and communication complexity.

4.5 Theoretical Analysis

We analyze FS-GAN from the following two aspects: data synthesis ability and classification accuracy across local models. Consistent with our previous notation, we use PdatadP_{data}^{d} to denote the mixture distribution of real data samples in dataset XdX_{d}, and use PmodeldP_{model}^{d} to denote the distribution of the combination of synthetic data samples generated by generators Gd1,Gd2,,GdMG^{1}_{d},G^{2}_{d},...,G^{M}_{d}. First, we provide the optimal solution for classifier CdC_{d} in Equation (2).

Proposition 1

For a set of generators 𝐆{\boldsymbol{G}} and for a given sampling probability πm\pi_{m}, the optimal distribution learned by classifier Cd(𝐱;θd){C_{d}}^{*}(\boldsymbol{x};\theta_{d}) has the following form:

Cdm(𝒙;θd)\displaystyle{C^{m}_{d}}^{*}(\boldsymbol{x};\theta_{d}) =πmPGdm(𝒙)i=1MπiPGdi(𝒙),m{1,2,,M}\displaystyle=\dfrac{\pi_{m}P_{G^{m}_{d}}(\boldsymbol{x})}{\sum_{i=1}^{M}\pi_{i}P_{G^{i}_{d}}(\boldsymbol{x})},m\in\{1,2,...,M\} (7)
Proof:

See Appendix A. ∎

The result of the output of the optimal classifier Cd(𝒙;θd){C_{d}}^{*}(\boldsymbol{x};\theta_{d}) can be seen as a weighted normalization of different synthesized sample distributions.

We follow a commonly adopted setting and use Jensen Shannon Divergence (JSD) [69] to quantify the difference between two distributions. For the optimal generator 𝑮d(𝒙;ωd){{\boldsymbol{G}}_{d}}^{*}(\boldsymbol{x};\omega_{d}), we have the following proposition:

Proposition 2

Given the optimal discriminator and classifier, the objective of generators is to minimize:

𝑮d(𝒙;ωd)\displaystyle{{\boldsymbol{G}}_{d}}^{*}(\boldsymbol{x};\omega_{d}) =\displaystyle= argminG(2JSD(PdatadPmodeld)\displaystyle\mathop{\arg\min}\limits_{G}(2\cdot\mbox{JSD}(P_{data}^{d}\|P_{model}^{d})
λJSDπ1,π2,,πM(PGd1,PGd1,,PGdM).\displaystyle-\lambda\cdot\mbox{JSD}_{\pi_{1},\pi_{2},...,\pi_{M}}(P_{G^{1}_{d}},P_{G^{1}_{d}},...,P_{G^{M}_{d}}).
Proof:

See Appendix B. ∎

It can be observed that the two terms on the right-hand-side of (2) correspond, respectively, to the difference between distributions PdatadP_{data}^{d} and PmodeldP_{model}^{d}, and the inverse of the difference among the synthetic data generated by different generators. In other words, minimizing these two terms directly results in minimizing the difference between PdatadP_{data}^{d} and PmodeldP_{model}^{d} while maximizing the differences among the generators.

Before presenting the main theorem, we make the following assumption:

Assumption 1

The real data distribution PdatadP_{data}^{d} of dataset XdX_{d} can be written in the form:

Pdatad(𝒙)\displaystyle P_{data}^{d}(\boldsymbol{x}) =\displaystyle= m=1Mπmpdm(𝒙),\displaystyle\sum_{m=1}^{M}\pi_{m}p_{d}^{m}(\boldsymbol{x}), (9)

where for any given 𝐱\boldsymbol{x}, if pdm(𝐱)>0p_{d}^{m}(\boldsymbol{x})>0 for m{1,2,,M}m\in\{1,2,...,M\}, then mm,pdm(𝐱)=0m^{\prime}\neq m,~{}p_{d}^{m^{\prime}}(\boldsymbol{x})=0.

Equation (9) means that the data distribution of data samples is a mixture of MM separable distributions given by pdm(x)p_{d}^{m}(x) for m=1,2,,Mm=1,2,...,M. We can prove the following theorem:

Theorem 1

Suppose that Assumption 1 holds. At the equilibrium point of the minimax game defined in equation (2), the optimal classifier and generators satisfy the following equations:

Cdm(𝒙){1,ifxPGdm,0,ifxPGdm,formm,\displaystyle{C_{d}^{m}}^{*}(\boldsymbol{x})\rightarrow\begin{cases}1,\text{if}~{}x\sim P_{G_{d}^{m}},\\ 0,\text{if}~{}x\sim P_{G_{d}^{m^{\prime}}},~{}{\rm for}~{}m^{\prime}\neq m,\end{cases} (10a)
PGdm(𝒙)=pdm(𝒙),m=1,2,,M,\displaystyle P_{G_{d}^{m}}^{*}(\boldsymbol{x})=p_{d}^{m}(\boldsymbol{x}),~{}\forall m=1,2,...,M, (10b)
Pmodeld(𝒙)=Pdatad(𝒙).\displaystyle P_{model}^{d}(\boldsymbol{x})=P_{data}^{d}(\boldsymbol{x}). (10c)
Proof:

See Appendix C. ∎

We can observe from (10a) that the classifier can correctly differentiate synthetic samples produced by different generators, i.e., the probability for the classifier to identify the correct generator GdmG_{d}^{m} that produces each given synthetic samples xx approaches one. As shown in (10b) and (10c), synthetic samples produced by optimal generator Gdm{G_{d}^{m}}^{*} follow the same distribution as pdmp_{d}^{m}. Also, all the generators weighted by πm\pi_{m} together synthesize samples that match the same data distribution of that of the real dataset.

In cases where Assumption 1 is not satisfied, Theorem 1 can be considered as an upper-bound for the local data synthesis and classification performance. Existing works as well as our own experiments show that in many practical scenarios, even if Assumption 1 does not hold, FS-GAN can still create high-quality synthesized data with high classification accuracy performance. We will come back to this issue in Section 5.

So far, we have analyzed the optimal solution of the local model. Next, we focus on the convergence of the global model. In the global model coordination procedure, we use Federated Averaging (FedAvg) [65], a widely popular algorithm in federated learning, to periodically aggregate local models. More specifically, suppose that each participating local model performs II local updates after receiving the latest global model. Let TT be the total number of local iterations performed by each client. Let BB be the mini-batch size. Let d(𝜺,ξtd),d{1,2,,N}\mathcal{L}_{d}(\boldsymbol{\varepsilon},\xi_{t}^{d}),d\in\{1,2,...,N\}, be the loss function of local model in ttth iteration with network parameters 𝜺\boldsymbol{\varepsilon} and a mini-batch of BB samples ξtd\xi_{t}^{d}. We define d(𝜺)\mathcal{L}_{d}(\boldsymbol{\varepsilon}) as the mean value of d(𝜺,ξtd),t=1,,T\mathcal{L}_{d}(\boldsymbol{\varepsilon},\xi_{t}^{d}),t=1,...,T, i.e., we have d(𝜺)=𝔼[d(𝜺,ξtd)]\mathcal{L}_{d}(\boldsymbol{\varepsilon})=\mathbb{E}[\mathcal{L}_{d}(\boldsymbol{\varepsilon},\xi_{t}^{d})]. Consider the following optimization model:

min𝜺{(𝜺)d=1Npdd(𝜺)}.\displaystyle\min_{\boldsymbol{\varepsilon}}\{\mathcal{L}(\boldsymbol{\varepsilon})\triangleq\sum_{d=1}^{N}p_{d}\mathcal{L}_{d}(\boldsymbol{\varepsilon})\}. (11)

Theoretical guarantees of (11) can be obtained by following the same approach as [70, 71, 43].

We can observe that Theorem 1 has proved that the proposed FS-GAN has addressed the two fundamental issues of federated learning and self-supervised GAN. In particular, compared to the federated learning which requires all the decentralized datasets to share the same combinations of data features, FS-GAN allows edge servers with different combinations of service traffics to jointly construct a shared model that can identify and synthesize service traffics of all the edge servers. Also, FS-GAN addresses the model collapse problem of GAN by reuse the model trained by the discriminator. Furthermore, as described in Section 4.4, FS-GAN does not introduce any new components or increase the complexity, compared to traditional federated GAN. This further justifies the flexibility and practicality of FS-GAN in practical networking systems.

5 Performance Evaluation

In this section, we conduct extensive experiments using real-world traffic datasets to verify two key capabilities of FS-GAN proved in Theorem 1 including classification of unknown service traffics and synthesis of different types of service traffics. In the rest of this section, we first introduce the setup of our simulations in Sections 5.1 and then present detailed simulation results to evaluate the classification performance and traffic data synthesis performance of FS-GAN in Sections 5.2 and 5.3, respectively.

5.1 Simulation Setup

To evaluate the performance of FS-GAN, we consider real-world dataset, “VPN-nonVPN dataset” (ISCXVPN2016)[26], to evaluate a highly mixed traffic data flow consisting of multiple types of unknown services. Within this dataset, we select data samples associated with 10 popular services, summarized in Table II. As our focus is on classification of valid information, we remove the PHY and MAC headers from the packets and limite the length of each payload to the same size of 2500 bytes so as to achieve a proper trade-off between the training efficiency and classification accuracy.

TABLE II: Dataset Used for Evaluation of FS-GAN
Service #\# of Packets Service #\# of Packets
Email 32,566 Youtube 12,738
ftps (upload) 47,795 sftps (upload) 107,234
SCP (download) 85,018 Skype Video 140,569
Facebook Audio 91,815 Skype Chat 53,996
Tor-Youtube 54,294 VPN-Vimeo 215,102

We conduct our experiments on a workstation with an Intel(R) Core(TM) i9-9900K CPU@@3.60GHz, 64.0 GB RAM@@2133 MHz, 2 TB HD and two NVIDIA Corporation GP102 [TITAN X] GPUs. Each edge server ia simuylated with a TITAN X GPU running on Ubuntu 16.04, Python 3.6, CUDA 10.0 and PyTorch 1.3.1. The training process between edge servers is simulated using the PySyft framework [72]. We combine the packet of all selected traffic types into one trace and equally divide the data into six local datasets, each being assigned to a discriminator (an edge server). The network architecture and parameter setting for the generator, discriminator, and classifier are summarized in Table III.

TABLE III: Simulation Setup
Architecture Setup Network Type
Generator Discriminator Classifier
Input Size 100×\times1 2500×\times1 2500×\times1
Output Size 2500×\times1 2×\times1 #\# of Generator
Activation Function LeakyReLU Sigmoid LeakyReLU
#\# of Layers 8 6
Optimizer Adam with β1=0.5\beta_{1}=0.5 and β2=0.9\beta_{2}=0.9
TABLE IV: Clustering Performance of FS-GAN and State-of-the-art Schemes
Scheme Training Loss Training Method RI NMI ACC
K-means++ Mean Square Error Iterative Method 0.71 0.30 0.32
DEC Reconstruction and Pretraining and 0.82 0.39 0.39
Cluster Assignment Fine-tuning
DCN Reconstruction and Pretraining and 0.81 0.31 0.35
K-means Loss Joint Training
IDEC Reconstruction and Pretraining and 0.85 0.42 0.40
Cluster Assignment Joint Training
FS-GAN Adversarial and Jointly Adversarial 0.89 0.62 0.58
Classification Loss Training
TABLE V: Model Training Time and Clustering Performance under Different Schemes
NN nn EE BB Time RI NMI ACC
per-round (sec)
C-I C-II C-I C-II C-I C-II C-I C-II
1 12,000 4 100 31.19 0.91 0.64 0.63
6 2,000 4 100 34.32 39.16 0.89 0.89 0.62 0.43 0.58 0.57
3 2,000 4 100 34.72 35.43 0.89 0.88 0.61 0.42 0.60 0.58
6 2,000 4 100 34.32 39.16 0.89 0.89 0.62 0.43 0.58 0.57
9 2,000 4 100 38.03 37.21 0.89 0.88 0.60 0.40 0.63 0.57
6 500 4 100 10.14  11.36 0.89 0.88 0.62 0.41 0.60 0.51
6 1000 4 100 18.33 18.14 0.89 0.89 0.60 0.47 0.56 0.55
6 2000 4 100 34.32 39.16 0.89 0.89 0.62 0.43 0.58 0.57
6 2000 2 100 20.32 19.73 0.89 0.88 0.62 0.43 0.60 0.53
6 2000 4 100 34.32 39.16 0.89 0.89 0.62 0.43 0.58 0.57
6 2000 8 100 68.15 75.32 0.89 0.90 0.64 0.47 0.61 0.54
6 2000 4 50 61.17 62.73 0.90 0.89 0.63 0.46 0.57 0.60
6 2000 4 100 34.32 39.16 0.89 0.89 0.62 0.43 0.58 0.57
6 2000 4 200 21.50 24.13 0.89 0.89 0.57 0.43 0.57 0.57
Refer to caption
Figure 4: Clustering results of 10 service traffic data using: (a) FS-GAN, (b) K-means++, and (c) IDEC.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 5: Clustering performance of FS-GAN-I and FS-GAN-II based on real-world traffic datasets.
Refer to caption
(a)
Refer to caption
(b)
Figure 6: Clustering performance of FS-GAN and SD-GAN based on real-world traffic datasets.

5.2 Classification Performance

As mention earlier, compared to existing clustering solutions, FS-GAN introduces a novel clustering solution for model training by autonomously creating synthetic data with pseudo-labels. In this subsection, we compare the clustering performance of FS-GAN with existing clustering solutions. We consider three commonly adopted performance metrics:

(1) Rand Index (RI): measures the fraction of samples pairs being correctly clustered, including similar samples being classified into the same category and different samples being classified into different categories. Formally, RI is defined as follows:

RI=TP+TNTP+TN+FN+FP,\displaystyle\mbox{RI}=\frac{\mbox{TP}+\mbox{TN}}{\mbox{TP}+\mbox{TN}+\mbox{FN}+\mbox{FP}}, (12)

where TP is the number of two same-type packets being assigned to the same traffic. TN refers to the number of different-type packet pairs being assigned to different service types. FN is the number of same-type packets pairs being assigned to different service types. Finally, FP is the number of different-type packet pairs being assigned to one service type. RI evaluates the clustering performance by judging whether samples of the same type fall into the same category. In particular, the value of RI must be grater than or equal to 0 and it approaches 1 if the values of FN and FP approach zero.

(2) Normalized Mutual Information (NMI): measures the uncertainty that can be reduced about the ground truth clustering result when the clustering solution is given. Let UU be the ground truth class and VV be the resulting label of the proposed clustering algorithm. NMI is defined as:

NMI(U,V)=2MI(U,V)H(U)+H(V),\displaystyle\mbox{NMI}(U,V)=\dfrac{2\mbox{MI}(U,V)}{H(U)+H(V)}, (13)

where MI(U,V)\mbox{MI}(U,V) is the mutual information between UU and VV, and H()H(*) is the entropy. We have

(3) Unsupervised Accuracy (ACC): measures the best one-to-one relationship between the resulting label of the proposed clustering solution and the ground-truth classes. ACC is defined as:

ACC=1nmaxσi=1n1yi=σ(xi)\displaystyle\mbox{ACC}={1\over n}\max_{\sigma}{\sum_{i=1}^{n}\mathrm{1}_{y_{i}=\sigma(x_{i})}} (14)

where xix_{i} and yiy_{i} refer respectively, to the labels of the clustering solution and the ground-truth class for each data sample ii. σ\sigma represents all the possible one-to-one permutations between any two class labels.

Note that each of the above three metrics takes values between 0 and 11. The larger the better is the clustering performance.

In Table IV, we compare FS-GAN with four clustering solutions: K-means++, DEC[73], IDEC[74], and DCN[75]. It can be observed that FS-GAN achieves the best performance of all tested solutions in all three performance metrics. In particular, DEC, IDEC, and DCN use deep neural networks to recover different latent representations for clustering. Compared to K-means++, which directly searches for cluster centers or centroids of the data samples, deep-learning-based solutions often achieve better performance. FS-GAN takes a novel approach by first generating synthetic data samples with pseudo-labels and then training a classifier based on the labeled dataset. In some senses, this approach addresses the poor performance caused by the lack of labelled dataset and can result in improved performance when the quality of the pseudo-labeled synthetic data samples is high.

To compare FS-GAN with the most state-of-the-art self-supervised learning solution, we consider a recent extension of the DEC algorithm, called semi-supervised DEC (SDEC)[76], which incorporates the semi-supervised information learned from the labelled data samples in DEC to further improve the clustering performance. We use the SDEC as the pretext tasks and then include the peudo-labelled data samples into the self-supervised GAN algorithm to train the model, we referred to this approach as the SD-GAN. Note that, compared to FS-GAN, SD-GAN requires labelled datasets to learn a prior information to guide the learning process, it also requires an extra neural network model to perform the data clustering and peudo-labelling. In Fig. 6, we compare FS-GAN with SD-GAN with different setups under different number of training rounds, we can observe that FS-GAN-I outperforms SD-GAN and FS-GAN-II achieves similar performance in terms of NMI.

To compare the impact of model training parameters on the computational loads of FS-GAN under different setups, we present in Table V the running time per round of computation for FS-GAN in a fixed hardware and software environment under various performance metrics for 50 rounds of model training. We mainly evaluate the impact of four key parameters, including the number of local datasets, size of each dataset, the number of local training steps (NN, nn, EE) between two consequent global model coordination, and the mini-batch size BB. The learning results under one centralized dataset is also presented. We can see that both the learning time and performance of FS-GAN are slightly degraded compared with a centralized setting, because the model aggregation procedure is no longer needed. It can be observed that the computational load is most sensitive to the size of the dataset. Compared to scheme C-I, Scheme C-II always results in slightly more computational load for training. This is because, in scheme C-I, both generators and discriminators are coordinated, which accelerates convergence of the model training and leads to reduced overall computational time. The improved performance offered by Scheme C-I over C-II can be further shown in the NMI and ACC metrics. More specifically, C-I always exhibits a more accurate clustering performance than C-II when the number of training rounds is fixed. C-I and C-II offer similar performance in terms of RI. This is because RI measures the combined performance of the correct solutions over all the clustering results, instead of the highest performance among individual classes as NMI and ACC. Therefore, the performance difference in RI achieved by different schemes is always smaller than that of the other two performance metrics.

In Fig. 4, we present the clustering results when FS-GAN is used to classify the 10 services in Table II. Once again, FS-GAN delivers superior performance to IDEC and K-means++. Note that since K-means++ directly separates data samples based on the centroid, it tends to cluster most of the samples into a single or a limited number of clusters.

TABLE VI: Performance under Different Classifier Level λ\lambda
Diversity Parameter λ=0.1\lambda=0.1 λ=0.5\lambda=0.5 λ=5\lambda=5
FS-GAN (C-I) RI 0.89 0.89 0.90
NMI 0.60 0.60 0.64
ACC 0.57 0.58 0.60
FS-GAN (C-II) RI 0.88 0.89 0.88
NMI 0.43 0.47 0.40
ACC 0.52 0.57 0.51
Refer to caption
Figure 7: Distribution of the assigned labels under different schemes.
TABLE VII: Quality of Synthesized Under Different Scenarios
Metric F-GAN(C-I) F-GAN(C-II) FS-GAN(C-I) FS-GAN(C-II)
Three-G Six-G Three-G Six-G Three-G Six-G Three-G Six-G
KL Divergence 5.643 0.360 0.226 0.224 0.091 0.043 0.188 0.099
JS Divergence 0.086 0.062 0.054 0.051 0.022 0.011 0.041 0.025
Wasserstein Distance 0.647 0.619 1.013 0.578 0.353 0.286 0.484 0.387

In Fig. 5, we compare the convergence performance based on NMI under different model parameters. We can observe that C-I always offers better convergence performance than C-II. This result is consistent with the previous observation. When the size of local dataset is the same, the convergence performance is closely correlated with this size. Specifically, the larger the dataset size, the faster is the convergence of the model training process. Note that the fluctuations in NMI is due to the interaction between generators and discriminators in the local data synthesis as well as the global model coordination.

As mentioned earlier, the classifier helps increasing the diversity of the synthesized samples produced by generators. At the same time, it interferes with the adversarial training process between the generator and discriminator. We investigate the impact of the classifier on the data clustering performance by adjusting the diversity hyper-parameter λ\lambda in Equation (2). Intuitively, when the penalty brought by the classifier becomes larger than that of the discriminator, generators will be encouraged to produce more diversified samples even if these samples may not follow the same distribution as the real data. In Table VI, we present the clustering performance of FS-GAN for different values of λ\lambda. We can observe that a very large or small value of λ\lambda will not typically offer the best performance The optimal λ\lambda varies with the model setup. Finding the optimal λ\lambda that achieves the right balance between classification and discrimination will be left for future work.

5.3 Performance Evaluation of Traffic Data Synthesis

As mentioned earlier, FS-GAN is more than just a traffic classification approach. It can also learn from the decentralized datasets and produce synthetic data samples that capture the distribution of real data associated with unknown services. In this subsection, we evaluate this aspect of FS-GAN.

We compare the data synthesis capability of FS-GAN to various extensions of multi-generator GANs with FL, which we refer to as F-GAN. In particular, we compare the following 4 different schemes:

  • 1)

    F-GAN (three-G): 3 generators for each dataset without the classifier.

  • 2)

    FS-GAN (three-G): 3 generators for each dataset.

  • 3)

    F-GAN (six-G): 6 generators for each dataset without the classifier.

  • 4)

    FS-GAN (six-G): 6 generators for each dataset.

As mentioned earlier, one of the key challenges of GAN-based solutions is the possibility of triggering a model collapse problem where, in this case, different generators will only produce samples with limited variety regardless of the input. In Fig. 7, we evaluate the diversity of synthesized data samples produced by different schemes. We compare the default numbers of synthetic samples produced by each scheme that are associated with different services when the same input is employed. It can be observed that different generators tend to produce different numbers of synthetic samples. For example, among all six types of services, SCP (download) is the least popular service to be synthesized with the least number of synthesized data samples, e.g., less than 0.02% of total synthesized samples produced by F-GAN(C-I, six-G) falls into this service category. We, however, observe that FS-GAN, especially C-I scheme with six generators, produces relatively uniform number of samples for different services. This means that FS-GAN offers the best performance in terms of balancing synthesized data samples.

To quantify the diversity of the synthesized samples, we use statistical distance metrics, which measure the difference between the distribution of the synthetic data samples and that of the real service traffic data. We consider three distance metrics: Kullback-Leibler (KL) divergence, Jensen Shannon (JS) divergence [69], and Wasserstein (W) distance [77], consider two random variables, pp and qq, with respective distributions p(x)p(x) and q(x)q(x). Table VII provides the distance results under both FS-GAN and F-GAN. FS-GAN achieves the smallest distance between synthetic and real data, where the JS divergence under FS-GAN-I is shown to be as low as 0.011, almost one fifth of the lowest distance result achieved by F-GAN. This means that the proposed FS-GAN is ideal to classify and synthesize highly heterogeneous traffic flows with mixed services.

6 Conclusions

In this paper, we proposed FS-GAN, a federated self-supervised learning framework to automatically recognize, classify, and synthesize different types of traffic over a large number of decentralized datasets. FS-GAN consists of three components: local data synthesis, global model coordination, and self-supervised learning. We adopted an FL-like approach and proved that our jointly trained global model can simultaneously minimize the JSD between the distribution of real data across all the datasets and that of the synthesized data samples.It also maximizes the JSD among the distributions of data samples created by different generators. Simulation results show that FS-GAN can achieve significant performance improvement over state-of-art data clustering solutions, and almost five times improvement over the federated GAN solutions in terms of data synthesis diversity.

References

  • [1] K. Pentikousis, Y. Wang, and W. Hu, “Mobileflow: Toward software-defined mobile networks,” IEEE Communications Magazine, vol. 51, no. 7, pp. 44–53, Jul. 2013.
  • [2] V. Yazıcı, U. C. Kozat, and M. O. Sunay, “A new control plane for 5G network architecture with a case study on unified handoff, mobility, and routing management,” IEEE Communications Magazine, vol. 52, no. 11, pp. 76–85, Nov. 2014.
  • [3] M. Agiwal, A. Roy, and N. Saxena, “Next Generation 5G Wireless Networks: A comprehensive survey,” IEEE Communications Surveys & Tutorials, vol. 18, no. 3, pp. 1617–1655, Feb. 2016.
  • [4] A. Al-Fuqaha, M. Guizani, M. Mohammadi, M. Aledhari, and M. Ayyash, “Internet of things: A survey on enabling technologies, protocols, and applications,” IEEE Communications Surveys & Tutorials, vol. 17, no. 4, pp. 2347–2376, Jun. 2015.
  • [5] Y. Xiao, M. Krunz, and T. Shu, “Multi-operator network sharing for massive iot,” IEEE Communications Magazine, vol. 57, no. 4, pp. 96–101, Apr. 2019.
  • [6] T. Luettel, M. Himmelsbach, and H. Wuensche, “Autonomous ground vehicles—concepts and a path to the future,” Proceedings of the IEEE, vol. 100, pp. 1831–1839, Apr. 2012.
  • [7] “ITU 2019 network 2030,” Geneva, Jul. 2018. [Online]. Available: https://www.itu.int/en/ITU-T/focusgroups/net2030/Pages/default.aspx
  • [8] I. FG-NET2030, “New services and capabilities for network 2030: description, technical gap and performance target analysis,” Oct. 2019.
  • [9] Y. Xiao and M. Krunz, “Distributed optimization for energy-efficient fog computing in the Tactile Internet,” IEEE J. Sel. Areas Commun., vol. 36, no. 11, pp. 2390–2400, Nov. 2018.
  • [10] G. Shi, Y. Xiao, Y. Li, and X. Xie, “From semantic communication to semantic-aware networking: Model, architecture, and open problems,” IEEE Commun. Magazine, vol. 59, no. 8, pp. 44–50, Aug. 2021.
  • [11] Y. Xiao, Z. Sun, G. Shi, and D. Niyato, “Imitation learning-based implicit semantic-aware communication networks: Multi-layer representation and collaborative reasoning,” IEEE J. Sel. Areas Commun., vol. 41, no. 3, March 2023.
  • [12] Y. Xiao, G. Shi, Y. Li, W. Saad, and H. V. Poor, “Towards self-learning edge intelligence in 6G,” IEEE Communications Magazine, vol. 58, no. 12, Dec. 2020.
  • [13] M. Reiter and R. Steinberg, “Forward contracts for complementary segments of a communication network,” in IEEE INFOCOM, San Diego, USA, Mar. 2010, pp. 1–9.
  • [14] Y. Xiao and M. Krunz, “AdaptiveFog: A modelling and optimization framework for fog computing in intelligent transportation systems,” IEEE Transactions on Mobile Computing, vol. 21, no. 12, pp. 4187–4200, Dec. 2022.
  • [15] X. Wang, S. Chen, and J. Su, “App-net: A hybrid neural network for encrypted mobile traffic classification,” in IEEE INFOCOM, Virtual conference, Jul. 2020, pp. 424–429.
  • [16] M. H. Almannaa, M. Elhenawy, and H. A. Rakha, “A novel supervised clustering algorithm for transportation system applications,” IEEE Transactions on Intelligent Transportation Systems, vol. 21, no. 1, pp. 222–232, Jan. 2020.
  • [17] J. Zhang, F. Li, F. Ye, and H. Wu, “Autonomous unknown-application filtering and labeling for dl-based traffic classifier update,” in IEEE INFOCOM, Virtual conference, Jul. 2020, pp. 397–405.
  • [18] X. Liu, F. Zhang, Z. Hou, L. Mian, Z. Wang, J. Zhang, and J. Tang, “Self-supervised learning: Generative or contrastive,” IEEE Transactions on Knowledge and Data Engineering, 2021.
  • [19] X. Wang, S. Zhang, Z. Qing, Y. Shao, C. Gao, and N. Sang, “Self-supervised learning for semi-supervised temporal action proposal,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2021.
  • [20] N. Araslanov and S. Roth, “Self-supervised augmentation consistency for adapting semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2021.
  • [21] Y. LeCun, “Self-supervised learning.”   Keynote Presentation at AAAI 2020 conference, Feb. 2020.
  • [22] S. B. Baker, W. Xiang, and I. Atkinson, “Internet of things for smart healthcare: Technologies, challenges, and opportunities,” IEEE Access, vol. 5, pp. 26 521–26 544, 2017.
  • [23] D. Perepelkin and M. Ivanchikova, “Problem of network traffic classification in multiprovider cloud infrastructures based on machine learning methods,” Budva, Jun. 2021.
  • [24] Z. Zhou, X. Chen, E. Li, L. Zeng, K. Luo, and J. Zhang, “Edge intelligence: Paving the last mile of artificial intelligence with edge computing,” Proceedings of the IEEE, vol. 107, no. 8, pp. 1738–1762, 2019.
  • [25] W. Y. B. Lim, N. C. Luong, D. T. Hoang, Y. Jiao, Y.-C. Liang, Q. Yang, D. Niyato, and C. Miao, “Federated learning in mobile edge networks: A comprehensive survey,” IEEE Communications Surveys & Tutorials, vol. 22, no. 3, pp. 2031–2063, 2020.
  • [26] G. Draper-Gil, A. H. Lashkari, M. S. I. Mamun, and A. A. Ghorbani, “Characterization of encrypted and vpn traffic using time-related features,” in ICISSP, Portugal, Feb. 2016, pp. 407–414.
  • [27] P. Tang, Y. Dong, and S. Mao, “Online traffic classification using granules,” in IEEE INFOCOM, Virtual conference, Jul. 2020, pp. 1135–1140.
  • [28] C. Liu, L. He, G. Xiong, Z. Cao, and Z. Li, “Fs-net: A flow sequence network for encrypted traffic classification,” in IEEE INFOCOM, Paris, France, Apr. 2019, pp. 1171–1179.
  • [29] E. Areström and N. Carlsson, “Early online classification of encrypted traffic streams using multi-fractal features,” in IEEE INFOCOM, Paris, France, Apr. 2019, pp. 84–89.
  • [30] E. Liang, H. Zhu, X. Jin, and I. Stoica, “Neural packet classification,” Proceedings of the ACM Sigcomm, Beijing, China, Aug. 2019.
  • [31] R. Li, X. Xiao, S. Ni, H. Zheng, and S. Xia, “Byte segment neural network for network traffic classification,” in IEEE ACM IWQoS, Alberta, Canada, Jun. 2018, pp. 1–10.
  • [32] A. Nascita, A. Montieri, G. Aceto, D. Ciuonzo, V. Persico, and A. Pescapé, “Xai meets mobile traffic classification: Understanding and improving multimodal deep learning architectures,” IEEE Transactions on Network and Service Management, 2021.
  • [33] A. M. Sadeghzadeh, S. Shiravi, and R. Jalili, “Adversarial network traffic: Towards evaluating the robustness of deep-learning-based network traffic classification,” IEEE Transactions on Network and Service Management, vol. 18, no. 2, pp. 1962–1976, 2021.
  • [34] G. Aceto, D. Ciuonzo, A. Montieri, and A. Pescapé, “Distiller: Encrypted traffic classification via multimodal multitask deep learning,” Journal of Network and Computer Applications, vol. 183, p. 102985, 2021.
  • [35] O. Aouedi, K. Piamrat, and B. Parrein, “Ensemble-based deep learning model for network traffic classification,” IEEE Transactions on Network and Service Management, pp. 1–12, 2022.
  • [36] D. Jin, Y. Lu, J. Qin, Z. Cheng, and Z. Mao, “Swiftids: Real-time intrusion detection system based on lightgbm and parallel intrusion detection mechanism,” Computers & Security, vol. 97, pp. 1–12, 2020.
  • [37] Q. Yang, Y. Liu, T. Chen, and Y. Tong, “Federated machine learning: Concept and applications,” ACM Transactions on Intelligent Systems and Technology, vol. 10, no. 2, pp. 1–19, Jan. 2019.
  • [38] H. Wen, Y. Wu, C. Yang, H. Duan, and S. Yu, “A unified federated learning framework for wireless communications: towards privacy, efficiency, and security,” in IEEE INFOCOM, Virtual conference, Jul. 2020, pp. 653–658.
  • [39] H. Wang, Z. Kaplan, D. Niu, and B. Li, “Optimizing federated learning on non-iid data with reinforcement learning,” in IEEE INFOCOM, Virtual conference, Jul. 2020, pp. 1698–1707.
  • [40] Q. Li, Z. Wen, and B. He, “Federated learning systems: Vision, hype and reality for data privacy and protection,” ArXiv: 1907.09693, Jul. 2019.
  • [41] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning: Challenges, methods, and future directions,” IEEE Signal Processing Magazine, vol. 37, no. 3, pp. 50–60, May 2020.
  • [42] J. Shi, H. Zhao, M. Wang, and Q. Tian, “Signal recognition based on federated learning,” in IEEE INFOCOM, Virtual conference, Jul. 2020, pp. 1105–1110.
  • [43] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the convergence of fedavg on non-iid data,” ICLR, New Orleans, Apr. 2019.
  • [44] X. Zhu, N. Shu, H. Wang, and T. Wu, “A distributed traffic classification model based on federated learning,” in IEEE International Conference on Big Data Computing and Communications (BigCom), 2021, pp. 75–81.
  • [45] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “Adaptive federated learning in resource constrained edge computing systems,” IEEE J. Sel. Areas Commun., vol. 37, no. 6, pp. 1205–1221, June 2019.
  • [46] B. Luo, X. Li, S. Wang, J. Huang, and L. Tassiulas, “Cost-effective federated learning design,” in IEEE INFOCOM, Virtual conference, 2021.
  • [47] Y. Xiao, Y. Li, G. Shi, and H. V. Poor, “Optimizing resource efficiency for federated edge intelligence in iot networks,” in International Conference on Wireless Communications and Signal Processing (WCSP), Nanjing, China, Oct. 2020.
  • [48] Y. Xiao, X. Zhang, Y. Li, G. Shi, M. Krunz, D. N. Nguyen, and D. T. Hoang, “Time-sensitive learning for heterogeneous federated edge intelligence,” accepted at IEEE Transactions on Mobile Computing, Jan. 2023.
  • [49] Y. Xiao, Y. Li, G. Shi, and H. V. Poor, “Reasoning on the air: An implicit semantic communication architecture,” in Proc. of the IEEE ICC Workshop on Data Driven Intelligence for Networks and Systems, Seoul, South Korea, May 2022.
  • [50] Y. Xiao, X. Zhang, Y. Li, G. Shi, and T. Basar, “Rate-distortion theory for strategic semantic communication,” in Proc. of the IEEE Information Theory Workshop, Mumbai, India, Nov. 2022.
  • [51] L. Zhang, J. Xu, P. Vijayakumar, P. K. Sharma, and U. Ghosh, “Homomorphic encryption-based privacy-preserving federated learning in iot-enabled healthcare system,” IEEE Transactions on Network Science and Engineering, pp. 1–17, 2022.
  • [52] S. Lu, Z. Gao, Q. Xu, C. Jiang, A. Zhang, and X. Wang, “Class-imbalance privacy-preserving federated learning for decentralized fault diagnosis with biometric authentication,” IEEE Transactions on Industrial Informatics, pp. 1–11, 2022.
  • [53] C.-L. Li, K. Sohn, J. Yoon, and T. Pfister, “Cutpaste: Self-supervised learning for anomaly detection and localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2021.
  • [54] K. Sun, Z. Lin, and Z. Zhu, “Multi-stage self-supervised learning for graph convolutional networks on graphs with few labeled nodes,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, New York, USA, Feb. 2020.
  • [55] A. Saeed, F. D. Salim, T. Ozcelebi, and J. Lukkien, “Federated self-supervised learning of multisensor representations for embedded intelligence,” IEEE Internet of Things Journal, vol. 8, no. 2, pp. 1030–1040, 2021.
  • [56] J. Z. Bengar, J. van de Weijer, B. Twardowski, and B. Raducanu, “Reducing label effort: Self-supervised meets active learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1631–1639.
  • [57] Y. Zhang, J. Wang, Y. Chen, H. Yu, and T. Qin, “Adaptive memory networks with self-supervised learning for unsupervised anomaly detection,” IEEE Transactions on Knowledge and Data Engineering, pp. 1–13, 2022.
  • [58] Y. Shi, N. Siddharth, P. Torr, and A. R. Kosiorek, “Adversarial masking for self-supervised learning,” in Proceedings of the 39th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 162, 17–23 Jul 2022, pp. 20 026–20 040.
  • [59] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative Adversarial Nets,” in NIPS, Montreal, Canada, Dec. 2014, pp. 2672–2680.
  • [60] L. Yu, W. Zhang, J. Wang, and Y. Yu, “SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient,” in AAAI conference on artificial intelligence, San Francisco, California USA, Feb. 2017.
  • [61] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel, “InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets,” in NIPS, Barcelona, Spain, Dec. 2016, pp. 2172–2180.
  • [62] Q. Hoang, T. D. Nguyen, T. Le, and D. Phung, “MGAN: Training Generative Adversarial Nets with Multiple Generators,” in ICLR, Vancouver, Canada, May 2018.
  • [63] A. Ghosh, V. Kulharia, V. P. Namboodiri, P. H. Torr, and P. K. Dokania, “Multi-agent Diverse Generative Adversarial Networks,” in ICPR, Beijing, China, Aug. 2018, pp. 8513–8521.
  • [64] H. Zhang, S. Xu, J. Jiao, P. Xie, R. Salakhutdinov, and E. P. Xing, “Stackelberg GAN: Towards Provable Minimax Equilibrium via Multi-Generator Architectures,” Nov. 2018.
  • [65] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-Efficient Learning of Deep Networks from Decentralized Data,” in AISTATS, vol. 54, Ft. Lauderdale, USA, Apr. 2017, pp. 1273–1282.
  • [66] “Release 16,” Jul. 2020. [Online]. Available: https://www.3gpp.org/release-16
  • [67] N. Mangrulkar, A. Bhagat Patil, and A. Pande, “Network attacks and their detection mechanisms: A review,” International Journal of Computer Applications, vol. 90, Feb. 2014.
  • [68] K. Duan, S. S. Keerthi, W. Chu, S. K. Shevade, and A. N. Poo, “Multi-category classification by soft-max combination of binary classifier,” in Multiple Classifier Systems, Berlin, Heidelberg, Jun. 2003, pp. 125–134.
  • [69] J. Lin, “Divergence measures based on the shannon entropy,” IEEE Transactions on Information theory, vol. 37, no. 1, pp. 145–151, Sep. 1991.
  • [70] M. Rasouli, T. Sun, and R. Rajagopal, “FedGAN: Federated Generative Adversarial Networks for Distributed Data,” arXiv preprint arXiv: 2006.07228, Jun. 2020.
  • [71] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated learning with non-iid data,” ArXiv preprint arXiv: 1806.00582, Jun. 2018.
  • [72] T. Ryffel, A. Trask, M. Dahl, B. Wagner, J. Mancuso, D. Rueckert, and J. Passerat-Palmbach, “A generic framework for privacy preserving deep learning,” arXiv preprint arXiv:1811.04017, 2018.
  • [73] J. Xie, R. Girshick, and A. Farhadi, “Unsupervised deep embedding for clustering analysis,” in ICML, NY, USA, Jun. 2016, pp. 478–487.
  • [74] X. Guo, L. Gao, X. Liu, and J. Yin, “Improved deep embedded clustering with local structure preservation.” in IJCAI, Melbourne, Australia, Aug. 2017, pp. 1753–1759.
  • [75] B. Yang, X. Fu, N. D. Sidiropoulos, and M. Hong, “Towards k-means-friendly spaces: Simultaneous deep learning and clustering,” in ICML, Sydney, Australia, Aug. 2017, pp. 3861–3870.
  • [76] Y. Ren, K. Hu, X. Dai, L. Pan, S. C. Hoi, and Z. Xu, “Semi-supervised deep embedded clustering,” Neurocomputing, vol. 325, pp. 121–130, 2019.
  • [77] S. Vallender, “Calculation of the wasserstein distance between probability distributions on the line,” Theory of Probability & Its Applications, vol. 18, no. 4, pp. 784–786, Feb. 1973.
  • [78] J. Lin, “Divergence measures based on the shannon entropy,” IEEE Trans. Inf. Theor., vol. 37, no. 1, Jan. 1991.
[Uncaptioned image] Yong Xiao (Senior Member, IEEE) received his B.S. degree in electrical engineering from China University of Geosciences, Wuhan, China in 2002, M.Sc. degree in telecommunication from Hong Kong University of Science and Technology in 2006, and his Ph. D degree in electrical and electronic engineering from Nanyang Technological University, Singapore in 2012. He is now a professor in the School of Electronic Information and Communications at the Huazhong University of Science and Technology (HUST), Wuhan, China. He is also with Peng Cheng Laboratory, Shenzhen, China and Pazhou Laboratory (Huangpu), Guangzhou, China. He is the associate group leader of the network intelligence group of IMT-2030 (6G promoting group) and the vice director of 5G Verticals Innovation Laboratory at HUST. His research interests include machine learning, game theory, distributed optimization, and their applications in semantic communications and semantic-aware networks, cloud/fog/mobile edge computing, green communication systems, wireless communication networks, and Internet-of-Things (IoT).
[Uncaptioned image] Rong Xia received her B.S. degree and M.Sc. degree both in information and communication engineering from Huazhong University of Science and Technology, Wuhan, China in 2018 and 2022, respectively. Her research interests include federated edge intelligence and edge computing networks.
[Uncaptioned image] Yingyu Li (Member, IEEE) received the B.Eng. degree in electronic information engineering and the Ph.D. degree in circuits and systems from the Xidian University, Xi’an, China, in 2012 and 2018, respectively. From 2014 to 2016, she was a Research Scholar with the Department of Electronic Computer Engineering at the University of Houston, TX, USA. She was a postdoctoral researcher in the School of Electronic Information and Communications at Huazhong University of Science and Technology from 2018 to 2021. She is now an associate professor at the School of Mechanical Engineering and Electronic Information, China University of Geosciences (Wuhan). Her research interests include semantic communications, edge intelligence, green communication networks, and IoT.
[Uncaptioned image] Guangming Shi (Fellow, IEEE) received the M.S. degree in computer control and the Ph.D. degree in electronic information technology from Xidian University, Xi’an, China, in 1988, and 2002, respectively. He was the vice president of Xidian University from 2018 to 2022. Currently, he is the Vice Dean of Peng Cheng Laboratory and a Professor with the School of Artificial Intelligence, Xidian University. He is an IEEE Fellow, the chair of IEEE CASS Xi’an Chapter, senior member of ACM and CCF, Fellow of Chinese Institute of Electronics, and Fellow of IET. He was awarded Cheung Kong scholar Chair Professor by the ministry of education in 2012. He won the second prize of the National Natural Science Award in 2017. His research interests include Artificial Intelligence, Semantic Communications, and Human-Computer Interaction.
[Uncaptioned image] Diep N. Nguyen (Senior Member, IEEE) received the M.E. degree in electrical and computer engineering from the University of California at San Diego (UCSD), La Jolla, CA, USA, in 2008, and the Ph.D. degree in electrical and computer engineering from The University of Arizona (UA), Tucson, AZ, USA, in 2013. He is currently a Faculty Member with the Faculty of Engineering and Information Technology, University of Technology Sydney (UTS), Sydney, NSW, Australia. Before joining UTS, he was a DECRA Research Fellow with Macquarie University, Macquarie Park, NSW, Australia, and a Member of the Technical Staff with Broadcom Corporation, San Jose, CA, USA, and ARCON Corporation, Boston, MA, USA, and consulting the Federal Administration of Aviation, Washington, DC, USA, on turning detection of UAVs and aircraft, and the U.S. Air Force Research Laboratory, Wright-Patterson Air Force Base, OH, USA, on anti-jamming. His research interests include computer networking, wireless communications, and machine learning application, with emphasis on systems’ performance and security/privacy. Dr. Nguyen received several awards from LG Electronics, UCSD, UA, the U.S. National Science Foundation, and the Australian Research Council. He is currently an Editor, an Associate Editor of the IEEE Transactions on Mobile Computing, IEEE Communications Surveys & Tutorials (COMST), IEEE Open Journal of the Communications Society, and Scientific Reports (Nature’s).
[Uncaptioned image] Dinh Thai Hoang (Senior Member, IEEE) is currently a faculty member at the School of Electrical and Data Engineering, University of Technology Sydney, Australia. He received his Ph.D. in Computer Science and Engineering from the Nanyang Technological University, Singapore, in 2016. His research interests include emerging wireless communications and networking topics, especially machine learning applications in networking, edge computing, and cybersecurity. He has received several awards, including the Australian Research Council and IEEE TCSC Award for Excellence in Scalable Computing (Early Career Researcher). He is an Editor of IEEE Transactions on Wireless Communications, IEEE Transactions on Cognitive Communications and Networking, IEEE Transactions on Vehicular Technology, and Associate Editor of IEEE Communications Surveys & Tutorials.
[Uncaptioned image] Dusit Niyato (Fellow, IEEE) is a professor in the School of Computer Science and Engineering, at Nanyang Technological University, Singapore. He received B.Eng. from King Mongkuts Institute of Technology Ladkrabang (KMITL), Thailand in 1999 and Ph.D. in Electrical and Computer Engineering from the University of Manitoba, Canada in 2008. His research interests are in the areas of Internet of Things (IoT), machine learning, and incentive mechanism design.
[Uncaptioned image] Marwan Krunz (Fellow, IEEE) is a Regents Professor at the University of Arizona. He holds the Kenneth VonBehren Endowed Professorship in ECE and is also a professor of computer science. He directs the Broadband Wireless Access and Applications Center (BWAC), a multi-university NSF/industry center that focuses on next-generation wireless technologies. He also holds a courtesy appointment as a professor at University Technology Sydney. Previously, he served as the site director for Connection One, an NSF/industry-funded center of five universities and 20+ industry affiliates. Dr. Krunz’s research is in the fields of wireless communications, networking, and security, with recent focus on applying AI and machine learning techniques for protocol adaptation, resource management, and signal intelligence. He has published more than 320 journal articles and peer-reviewed conference papers, and is a named inventor on 12 patents. His latest h-index is 60. He is an IEEE Fellow, an Arizona Engineering Faculty Fellow, and an IEEE Communications Society Distinguished Lecturer (2013-2015). He received the NSF CAREER award. He served as the Editor-in-Chief for the IEEE Transactions on Mobile Computing. He also served as editor for numerous IEEE journals. He was the TPC chair for INFOCOM’04, SECON’05, WoWMoM’06, and Hot Interconnects 9. He was the general vice-chair for WiOpt 2016 and general co-chair for WiSec’12. Dr. Krunz served as chief scientist/technologist for two startup companies that focus on 5G and beyond wireless systems.

Appendix A Proof of Proposition 1

Proof:

According to the multi-player zero-sum game in Equation (2), classifier Cd{C_{d}} tries to maximize the softmax quantity V(C)V(C), where

V(C)=𝒙m=1MπmPGdm(𝒙)logCdm(𝒙)dx=𝒙π1PGd1(𝒙)log(1m=2MCdm(𝒙))𝑑x+𝒙m=2MπmPGdm(𝒙)logCdm(𝒙)dx.\begin{split}V(C)=&\int_{\boldsymbol{x}}\sum_{m=1}^{M}\pi_{m}P_{G^{m}_{d}}(\boldsymbol{x})\log C^{m}_{d}({\boldsymbol{x}})dx\\ =&\int_{\boldsymbol{x}}\pi_{1}P_{G^{1}_{d}}(\boldsymbol{x})\log(1-\sum_{m=2}^{M}C^{m}_{d}({\boldsymbol{x}}))dx\\ &+\int_{\boldsymbol{x}}\sum_{m=2}^{M}\pi_{m}P_{G^{m}_{d}}(\boldsymbol{x})\log C^{m}_{d}({\boldsymbol{x}})dx.\end{split} (15)

In order to get the optimal classifier, we first calculate the functional derivative w.r.t. Cdm,m=2,3,,M{C^{m}_{d}},m=2,3,...,M, we get:

V(C)Cdm(𝒙)\displaystyle\frac{\partial V(C)}{\partial C^{m}_{d}(\boldsymbol{x})} =πmPGdm(𝒙)Cdm(𝒙)π1PGd1(𝒙)Cdm(𝒙).\displaystyle=\dfrac{\pi_{m}P_{G^{m}_{d}}(\boldsymbol{x})}{C^{m}_{d}(\boldsymbol{x})}-\dfrac{\pi_{1}P_{G^{1}_{d}}(\boldsymbol{x})}{C^{m}_{d}(\boldsymbol{x})}. (16)

Then, by setting the above equation to zero for m{2,3,,M}m\in\{2,3,...,M\}, we obtain:

π1PGd1(𝒙)Cd1(𝒙)=πmPGdm(𝒙)Cdm(𝒙),m{2,3,,M}.\displaystyle\dfrac{\pi_{1}P_{G^{1}_{d}}(\boldsymbol{x})}{C^{1}_{d}(\boldsymbol{x})}=\dfrac{\pi_{m}P_{G^{m}_{d}}(\boldsymbol{x})}{C^{m}_{d}(\boldsymbol{x})},\forall m\in\{2,3,...,M\}. (17)

Given the fact that all the training samples used by classifier definitely come from generators, meaning m=1MCdm(𝒙)=1\sum_{m=1}^{M}C^{m}_{d}(\boldsymbol{x})=1, we can get the optimal solution of classifier CdmC_{d}^{m}, that is:

Cdm(𝒙;θd)\displaystyle{C^{m}_{d}}^{*}(\boldsymbol{x};\theta_{d}) =πmPGdm(𝒙)i=1MπiPGdi(𝒙),m{1,2,,M}.\displaystyle=\dfrac{\pi_{m}P_{G^{m}_{d}}(\boldsymbol{x})}{\sum_{i=1}^{M}\pi_{i}P_{G^{i}_{d}}(\boldsymbol{x})},m\in\{1,2,...,M\}. (18)

This concludes the proof. ∎

Appendix B Proof of Proposition 2

Proof:

Let the optimal discriminator be Dd{D_{d}}^{*}. Here, we directly use the conclusion in [59] that Dd=Pdatad(𝒙)Pdatad(𝒙)+Pmodeld(𝒙){D_{d}}^{*}=\dfrac{P_{data}^{d}(\boldsymbol{x})}{P_{data}^{d}(\boldsymbol{x})+P_{model}^{d}(\boldsymbol{x})}. Denote the quantity function of generators as V(𝑮)V({\boldsymbol{G}}). According to the multi-player zero-sum game in (2), given the optimal discriminator Dd{D_{d}}^{*} above and classifier Cd{C_{d}}^{*} in proposition 1, generators try to minimize:

V(𝑮(𝒙;ωd))=\displaystyle V({\boldsymbol{G}}(\boldsymbol{x};\omega_{d}))= 𝔼𝒙Pdatad[logPdatad(𝒙)Pdatad(𝒙)+Pmodeld(𝒙)]\displaystyle\mathbb{E}_{\boldsymbol{x}\sim P_{data}^{d}}[\log\dfrac{P_{data}^{d}(\boldsymbol{x})}{P_{data}^{d}(\boldsymbol{x})+P_{model}^{d}(\boldsymbol{x})}] (19)
+𝔼𝒙Pmodeld[logPmodeld(𝒙)Pdatad(𝒙)+Pmodeld(𝒙)]\displaystyle+\mathbb{E}_{\boldsymbol{x}\sim P_{model}^{d}}[\log\dfrac{P_{model}^{d}(\boldsymbol{x})}{P_{data}^{d}(\boldsymbol{x})+P_{model}^{d}(\boldsymbol{x})}]
λ{m=1Mπm𝔼𝒙PGdm[logπmPGdm(𝒙)i=1MπiPGdi(𝒙)]}.\displaystyle-\lambda\{\sum_{m=1}^{M}\pi_{m}\mathbb{E}_{\boldsymbol{x}\sim P_{G^{m}_{d}}}[\log\dfrac{\pi_{m}P_{G^{m}_{d}}(\boldsymbol{x})}{\sum_{i=1}^{M}\pi_{i}P_{G^{i}_{d}}(\boldsymbol{x})}]\}.

The first two terms in the right-hand of (19) is same as in the standard GANs [59], which equals to 2JSD(PdatadPmodeld)log42\cdot\mbox{JSD}(P_{data}^{d}\|P_{model}^{d})-\log 4. Here, we focus on the last term of this equation to clarify the effect of the auxiliary classifier.

In light of the conclusion in standard GANs, we can formulate (19) as follows:

V(𝑮(𝒙;ωd))=\displaystyle V({\boldsymbol{G}}(\boldsymbol{x};\omega_{d}))= 2JSD(PdatadPmodeld)log4\displaystyle 2\cdot\mbox{JSD}(P_{data}^{d}\|P_{model}^{d})-\log 4 (20)
λ{m=1Mπm𝔼𝒙PGdm[logπmPGdm(𝒙)i=1MπiPGdi(𝒙)]}\displaystyle-\lambda\{\sum_{m=1}^{M}\pi_{m}\mathbb{E}_{\boldsymbol{x}\sim P_{G^{m}_{d}}}[\log\dfrac{\pi_{m}P_{G^{m}_{d}}(\boldsymbol{x})}{\sum_{i=1}^{M}\pi_{i}P_{G^{i}_{d}}(\boldsymbol{x})}]\}

In order to convey the proof more clearly, we use λ{V^}-\lambda\{\hat{V}\} to denote the last term in (20), and let PG^(𝒙)=m=1MπmPGdm(𝒙)\hat{P_{G}}(\boldsymbol{x})=\sum_{m=1}^{M}\pi_{m}P_{G^{m}_{d}}(\boldsymbol{x}). Then, We have:

V^=\displaystyle\hat{V}= m=1Mπm𝔼𝒙PGdm[logπmPGdm(𝒙)i=1MπiPGdi(𝒙)]\displaystyle\sum_{m=1}^{M}\pi_{m}\mathbb{E}_{\boldsymbol{x}\sim P_{G^{m}_{d}}}[\log\dfrac{\pi_{m}P_{G^{m}_{d}}(\boldsymbol{x})}{\sum_{i=1}^{M}\pi_{i}P_{G^{i}_{d}}(\boldsymbol{x})}] (21a)
=m=1Mπm𝔼𝒙PGdm[logPGdm(𝒙)PG^(𝒙)]+m=1Mπmlogπm\displaystyle=\sum_{m=1}^{M}\pi_{m}\mathbb{E}_{\boldsymbol{x}\sim P_{G^{m}_{d}}}[\log\dfrac{P_{G^{m}_{d}}(\boldsymbol{x})}{\hat{P_{G}}(\boldsymbol{x})}]+\sum_{m=1}^{M}\pi_{m}\log\pi_{m} (21b)
=π1𝔼𝒙PGd1[logPGd1(𝒙)PG^(𝒙)]+\displaystyle=\pi_{1}\mathbb{E}_{\boldsymbol{x}\sim P_{G^{1}_{d}}}[\log\dfrac{P_{G^{1}_{d}}(\boldsymbol{x})}{\hat{P_{G}}(\boldsymbol{x})}]+...
+πM𝔼𝒙PGdM[logPGdM(𝒙)PG^(𝒙)]+m=1Mπmlogπm.\displaystyle\;\;\;+\pi_{M}\mathbb{E}_{\boldsymbol{x}\sim P_{G^{M}_{d}}}[\log\dfrac{P_{G^{M}_{d}}(\boldsymbol{x})}{\hat{P_{G}}(\boldsymbol{x})}]+\sum_{m=1}^{M}\pi_{m}\log\pi_{m}. (21c)

Using the standard definition of KL divergence, the above formula can be expressed by the sum of multiple KL divergences and the negative entropy of 𝝅\boldsymbol{\pi}, that is:

V^=\displaystyle\hat{V}= π1KL(PGd1(𝒙)PG^)+\displaystyle~{}\pi_{1}\cdot\mbox{KL}(P_{G^{1}_{d}}(\boldsymbol{x})\|\hat{P_{G}})+...
+πMKL(PGdM(𝒙)PG^)+m=1Mπmlogπm\displaystyle\;\;\;+\pi_{M}\cdot\mbox{KL}(P_{G^{M}_{d}}(\boldsymbol{x})\|\hat{P_{G}})+\sum_{m=1}^{M}\pi_{m}\log\pi_{m} (22a)
=m=1MπmKL(PGdm(𝒙)PG^)+m=1Mπmlogπm\displaystyle=\sum_{m=1}^{M}\pi_{m}\mbox{KL}(P_{G^{m}_{d}}(\boldsymbol{x})\|\hat{P_{G}})+\sum_{m=1}^{M}\pi_{m}\log\pi_{m} (22b)
=JSDπ1,π2,,πM(PGd1,PGd1,,PGdM)+m=1Mπmlogπm,\displaystyle=\mbox{JSD}_{\pi_{1},\pi_{2},...,\pi_{M}}(P_{G^{1}_{d}},P_{G^{1}_{d}},...,P_{G^{M}_{d}})+\sum_{m=1}^{M}\pi_{m}\log\pi_{m}, (22c)

where JSDπ1,π2,,πM(PGd1,PGd1,,PGdM)\mbox{JSD}_{\pi_{1},\pi_{2},...,\pi_{M}}(P_{G^{1}_{d}},P_{G^{1}_{d}},...,P_{G^{M}_{d}}) is the generalized Jensen-Shannon divergence [78]. Therefore, we can rewrite the quantity function of generators in equation (19) as:

V(𝑮(𝒙;ωd))=\displaystyle V({\boldsymbol{G}}(\boldsymbol{x};\omega_{d}))= 2JSD(PdatadPmodeld)log4λ{V^}\displaystyle 2\cdot\mbox{JSD}(P_{data}^{d}\|P_{model}^{d})-\log 4-\lambda\{\hat{V}\} (23a)
=\displaystyle= 2JSD(PdatadPmodeld)\displaystyle 2\cdot\mbox{JSD}(P_{data}^{d}\|P_{model}^{d})
λJSDπ1,π2,,πM(PGd1,PGd2,,PGdM)\displaystyle\;\;\;-\lambda\cdot\mbox{JSD}_{\pi_{1},\pi_{2},...,\pi_{M}}(P_{G^{1}_{d}},P_{G^{2}_{d}},...,P_{G^{M}_{d}})
λm=1Mπmlogπmlog4.\displaystyle\;\;\;-\lambda\sum_{m=1}^{M}\pi_{m}\log\pi_{m}-\log 4. (23b)

Ignoring the last two constant terms, we can finally get the objective function of generators in equation (2). Therefore, optimizing the generators is equivalent to minimizing the JD divergence between PdatadP_{data}^{d} and PmodeldP_{model}^{d} while maximizing the differences among those generators. This concludes our proof. ∎

Appendix C Proof of Theorem 1

Proof:

In order to further simplify the objective function of generators in Proposition 2, we firstly recast equation of V^\hat{V} in (21):

V^=\displaystyle\hat{V}= m=1Mπm𝔼𝒙PGdm[logπmPGdm(𝒙)i=1MπiPGdi(𝒙)].\displaystyle\sum_{m=1}^{M}\pi_{m}\mathbb{E}_{\boldsymbol{x}\sim P_{G^{m}_{d}}}[\log\dfrac{\pi_{m}P_{G^{m}_{d}}(\boldsymbol{x})}{\sum_{i=1}^{M}\pi_{i}P_{G^{i}_{d}}(\boldsymbol{x})}]. (24)

It can be observed that V^0\hat{V}\leq 0 and the equality occurs only if πmPGdm(𝒙)i=1MπiPGdi(𝒙)=1\dfrac{\pi_{m}P_{G^{m}_{d}}(\boldsymbol{x})}{\sum_{i=1}^{M}\pi_{i}P_{G^{i}_{d}}(\boldsymbol{x})}=1. Therefore, we have PGdm(𝒙)0imπiPGdi(𝒙)=0im,PGdi(𝒙)=0P_{G^{m}_{d}}(\boldsymbol{x})\neq 0~{}\rightarrow~{}\sum_{i\neq m}\pi_{i}P_{G^{i}_{d}}(\boldsymbol{x})=0~{}\rightarrow~{}\forall i\neq m,~{}P_{G^{i}_{d}}(\boldsymbol{x})=0. In this case, different generators are well-separated without overlapping. Further, we can rewrite the objective function of generators in Proposition 2 under V^=0\hat{V}=0:

𝑮d(𝒙;ωd)=\displaystyle{{\boldsymbol{G}}_{d}}^{*}(\boldsymbol{x};\omega_{d})= argmin𝑮2JSD(PdatadPmodeld).\displaystyle\mathop{\arg\min}\limits_{{\boldsymbol{G}}}2\cdot\mbox{JSD}(P_{data}^{d}\|P_{model}^{d}). (25)

This is a much simpler formulation and we can get its optimal solution directly, that is: Pdatad=PmodeldP_{data}^{d}=P_{model}^{d}. When Assumption 1 holds, which constraints the real data distribution to be a mixture of MM components pdm(𝒙),m=1,2,,Mp_{d}^{m}(\boldsymbol{x}),m=1,2,...,M, we can easily get the results in (10b) and (10c), that is:

PGdm(𝒙)=pdm(𝒙),m=1,2,,M,\displaystyle P_{G_{d}^{m}}^{*}(\boldsymbol{x})=p_{d}^{m}(\boldsymbol{x}),~{}\forall m=1,2,...,M, (26a)
Pmodeld(𝒙)=m=1MπmPGdm(𝒙)=Pdatad(𝒙).\displaystyle P_{model}^{d}(\boldsymbol{x})=\sum_{m=1}^{M}\pi_{m}P_{G_{d}^{m}}^{*}(\boldsymbol{x})=P_{data}^{d}(\boldsymbol{x}). (26b)

Up to now, we have proved that at the equilibrium point of the multi-player zero-sum game, generators can synthesize a mixture of traffic types, with the same distirbution as real in a local dataset. In order to evaluate the performance of the classifier, we substitute the well-separated optimal solution PGdmP_{G^{m}_{d}} into Proposition 1. It can be seen that Cdm{C^{m}_{d}}^{*} equals to 11 when 𝒙\boldsymbol{x} is drawn from PGdm(𝒙)P_{G^{m}_{d}}(\boldsymbol{x}); otherwise, Cdm{C^{m}_{d}}^{*} is 0. Therefore, the optimal classifier can differentiate all the samples produced by different generators correctly. Formally, we have:

Cdm(𝒙)={1,xPGdw,w=m0,xPGdw,wmm=1,2,,M,\displaystyle{C_{d}^{m}}^{*}(\boldsymbol{x})=\begin{cases}1,\quad x\sim P_{G_{d}^{w}},~{}w=m\\ 0,\quad x\sim P_{G_{d}^{w}},~{}w\neq m\end{cases}\forall m=1,2,\ldots,M, (27)

Besides, when Pdatad=PmodeldP_{data}^{d}=P_{model}^{d}, meaning generators are synthesizing high-quality samples, the classifier will also be able to classify real samples from the local dataset, e.g., labeling different local services. This concludes our proof.