Fed-TGAN: Federated Learning Framework for Synthesizing Tabular Data

Zilong Zhao Tu DelftDelftNetherlands [email protected] , Robert Birke ABB Corporate Research SwitzerlandDättwilSwitzerland [email protected] , Aditya Kunar Tu DelftDelftNetherlands [email protected] and Lydia Y. Chen Tu DelftDelftNetherlands [email protected]

Abstract.

Generative Adversarial Networks (GANs) are typically trained to synthesize data, from images and more recently tabular data, under the assumption of directly accessible training data. Recently, federated learning (FL) is an emerging paradigm that features decentralized learning on client’s local data with a privacy-preserving capability. And, while learning GANs to synthesize images on FL systems has just been demonstrated, it is unknown if GANs for tabular data can be learned from decentralized data sources. Moreover, it remains unclear which distributed architecture suits them best. Different from image GANs, state-of-the-art tabular GANs require prior knowledge on the data distribution of each (discrete and continuous) column to agree on a common encoding – risking privacy guarantees. In this paper, we propose Fed-TGAN, the first Federated learning framework for Tabular GANs. To effectively learn a complex tabular GAN on non-identical participants, Fed-TGAN designs two novel features: (i) a privacy-preserving multi-source feature encoding for model initialization; and (ii) table similarity aware weighting strategies to aggregate local models for countering data skew. We extensively evaluate the proposed Fed-TGAN against variants of decentralized learning architectures on four widely used datasets. Results show that Fed-TGAN accelerates training time per epoch up to 200% compared to the alternative architectures, for both IID and Non-IID data. Overall, Fed-TGAN not only stabilizes the training loss, but also achieves better similarity between generated and original data.

Tabular GAN, federated learning, table data, Non-IID

^†^†conference: CIKM; 2021; Australia

1. Introduction

Refer to caption — Figure 1. Challenge of initializing column distribution of a tabular model with example of age column from the Adult dataset: (1) original data, (2) skewed 1% sampled data to build VGM encoder, and (3) generated data based on model using VGM encoder built on skewed data. (4) Comparison with generated data based on model using VGM encoder built from original data.

Generative Adversarial Networks (GANs) (gan, ) are an emerging methodology to synthesize data, ranging from images (stylegan, ; stylegan2, ), to text (semeniuta2018accurate, ), to tables (tablegan, ; ctgan, ). The key components of GANs are training two competing neural networks, i.e., generator and discriminator, where the former iteratively generates synthetic data and the latter judges its quality. During the training process, the discriminator needs to access the original data and provide feedback to the generator by comparing it with the generated data. However, such a privilege of direct data access may no longer be taken for granted due to the ever increasing concern for data privacy. For instance, training a medical image generator (asyndgan, ) from multiple hospitals refrains from centralized data processing and calls for decentralized and privacy-preserving learning solutions.

In response to such a demand, the federated learning (FL) paradigm emerges. FL features decentralized local processing, under which machine learning (ML) models can first be trained on clients’ local data in parallel and subsequently be securely aggregated by the federator. As such, the local data is not directly accessed, except by the owner, and only intermediate model is shared. The key design choices of constructing a FL framework for GANs depends on how to effectively distribute the training of generator and discriminator networks across data sources. On the one hand, discriminators are typically located on clients’ premises due to the need of processing the client’s data. On the other hand, the prior art explores a disparate trend of training image generators: centrally at the server (mdgan, ) or locally at the clients (fegan, ). While tabular data is the most dominant data type in industries (arik2019tabnet, ), there is no prior study on training GANs for tabular data under the FL paradigm.

Training of state-of-the-art tabular GANs, e.g., CTGAN (ctgan, ), from decentralized data sources in a privacy preserving manner present multiple additional challenges as compared to image GANs. They are closely related to how current tabular GANs explicitly model each column, be it continuous or categorical variables, via data-dependent coding schemes and statistical distributions. Hence, the first challenge is to unify the encoding schemes across data sources that are non-identically independently distributed (Non-IID), and in a privacy preserving manner¹¹1Privacy preserving solutions refer to ones that do not require full knowledge of the local data.. Secondly, the convergence speed of GANs critically depends on how to merge local models (jill_fed, ). For image GANs (fedgan, ), the merging weights are determined jointly by the data quantity and the (dis)similarity of class distribution across clients. Beyond that, tabular GANs need to consider a more fine-grained (dis)similarity mechanism for deciding merging weights, i.e., differences in every column across clients.

In this paper, we aim to design a federated learning framework, Fed-TGAN, that allows to train tabular GAN models from decentralized clients. The architecture of Fed-TGAN is that (i) each client trains its generator and discriminator networks using its local data and (ii) the federator aggregates the generators and discriminators. We also propose two algorithmic features that address more fine-grained per column modeling in a privacy preserving manner. First, the novel feature encoding scheme of Fed-TGAN can reconstruct the entire column distribution via bootstrapping each client’s partial information. Secondly, a more precise weighting scheme can effectively merge local models by considering the quantity and distribution dissimilarity for every column across all clients. We design and implement a first of its kind federated learning framework for tabular GANs using the PyTorch RPC framework.

We extensively evaluate Fed-TGAN on a vast number of client scenarios, which have disparate data distributions. Specifically, Fed-TGAN is compared with three architecture baselines: (1) centralized approach, (2) vanilla federated learning and (3) multiple discriminator architecture (mdgan, ) comprising of a single generator and multiple discriminators. The evaluation is performed on four commonly used machine learning datasets where the statistical similarity between generated and real data are reported as evaluation metrics. Our results show that Fed-TGAN remarkably reduces training time per epoch comparing to multi-discriminator solution by up to 200%. Additionally, under an unbalanced amount of local data among the clients, Fed-TGAN converges much faster than vanilla federated learning. And, for scenarios where data in all clients is non independently and identically distributed, the convergence of Fed-TGAN is not only stable, but also provides better similarity between generated and real data. The main contributions of this study can be summarized as follows:

•

We design and prototype a one-of-a-kind federated learning framework for the decentralized learning of tabular GANs (i.e. CTGAN) on distributed clients’ data.
•

We create a privacy preserving feature encoding method, which allows the federator to build the global feature encoders (either for categorical or continuous columns) without accessing local data.
•

We design a table-similarity aware weighting scheme for merging local models, which is shown to achieve a faster convergence speed when the data quantity and data quality are highly imbalanced among clients.
•

We extensively evaluate Fed-TGAN to synthesize four widely used tabular datasets on the prototype testbed. Across various clients scenarios, the Fed-TGAN shows remarkably high similarity to the original data while also converging faster than vanilla FL and MD-GAN.

2. Preliminary and Motivation

Preliminary. Key in federated, and generally decentralized, learning is that all participating nodes use the same model structure. This structure heavily depends on the the input data type and its encoding. Previous federated GANs studies (mdgan, ; fedgan, ; asyndgan, ) focus only one type of data i.e images. Image data makes it easy to pre-define the encoding and neural network structure independent from the specific images located at each participating node. However, the same does not apply to tabular data. Each columns requires an encoding which shapes the input layer influencing the model structure. These encodings depend on both the data type, i.e. categorical or continuous, and the data values, i.e. data distribution. For example, the state-of-the-art generative tabular data models CTGAN (ctgan, ) and CTAB-GAN (ctabgan, ) use one-hot encoding for categorical columns and variational Gaussian mixtures (VGM) encoding for continuous columns. Both encoding types require to know the per column global data properties. One-hot encoding requires the list of all possible distinct values and VGM encoding depends on the estimation of the all possible modes with their mean and variance. Hence image federated learning systems can not readily be applied to tabular data problems.

Tabular GANs on federated learning systems need to agree on a common encoding and, consequent, model structure during initialization. For this it is important to know the column by column data distribution across all participants. This is straightforward if privacy is of no concern: e.g., collect all the client data on one node, decide the encoding and distribute the decision to all other nodes. However this goes against the fundamental aim of federated learning: training models without the need to share the detailed information of local data to preserve privacy. For categorical columns the problem can be solved by the participants sharing the list of distinct values in each column with little to no privacy infringement. But for continuous columns the problem is not as straightforward due to the VGM requirement.

Motivation Example. We demonstrate the challenge of encoding continuous columns with the following experiment using the Adult dataset (see Sec. 5 for details on the setup and dataset). Here we momentarily relax the privacy requirement and assume that the federator coordinating the federated learning has access to 1% of the global data. This 1% data is used to build the VGM encoders for all continuous columns which are then distributed and used by all clients to encode the local data. Without a global view it is impossible to know how well the 1% data represents the global population. If this 1% data is sampled in a skewed way it can severely degrade the encoding quality leading to poor model performance. We show this effect on the age column. We select the 1% data from tail of the age distribution. The distributions of the original and selected data are shown in the Fig. 1(1) and Fig. 1(2). Fitting a VGM encoder from the sampled data will encode well only the data between 75 and 90 years which, however, represents only the data above the 99^th percentile of the real age distribution. Using this encoder to train a model leads to poor generation performance, i.e., the generated samples are not representative of the original data (see Fig. 1(3)). For comparison, the Fig. 1(4) shows the distribution of samples generated with a VGM encoder built from all of the data.

Another key issue for federated tabular learning is the weighting of models from different clients during model aggregation. This issue is exacerbated with tabular data. Non iid data across clients can lead to poor training convergence and ultimately model performance. Federated learning systems counter this effect by weighting each client model differently based on the similarity of local data to the global data. Image federated learning estimates this similarity based on the distribution of labels which can be seen as 1-dimensional data. But for tabular data, each column can be seen as one dimension requiring a multi-dimension solution. Moreover, while the same method as for image labels can be applied to categorical columns, one can not directly estimate the similarity for continuous columns without knowing all the data points. Thus, a new weighting method is needed for tabular data.

3. Decentralized Architecture

GANs comprise two types of antagonizing deep neural networks: generator and discriminator. In training the two take turns. First the generator tries to synthesize data which as indistinguishable as possible from the real data to fool the discriminator. Then the discriminator tries to detect the fake from real data to counter the generator. Model weights are updated using a minimax loss function which tries to minimize the loss of the generator, related to the number of correctly detected fakes by the discriminator, while maximizing the loss of the discriminator, related to how often the generator successfully fooled the discriminator, see Fig. 2a.

Training GANs is computationally demanding since they comprise of at least two deep neural networks (i.e., one generator and one discriminator) and are trained using big datasets. Therefore, a decentralized training framework can be highly beneficial in such a setting and is explored for image GANs only. Existing solutions to decentralized GANs training can be classified into two categories: (1) Multi-Discriminator (MD) structure (mdgan, ; asyndgan, ; temporary_dis, ) and (2) Federated Learning (FL) structure (fedgan, ; fegan, ; fl, ).

Multi-Discriminator has one single generator in the server and multiple discriminators distributed across the clients. The structure is illustrated in Fig. 2b. The server determines the network architecture for the generator and discriminators. The generator is located at the sever and trains its network using random inputs and the gradients from all discriminators, as typically done for the centralized GANs. On the contrary, the discriminators located at the decentralized data sources train their networks locally using outputs from the generator, i.e., synthesized data. Such a structure ensures that the client’s data does not need to leave the clients’ machines. The downside of the MD structure is that it induces significant communication overhead between the generator and discriminator, i.e., sending synthesized data to all discriminators, and returning discriminator’s gradients to the generator per training epoch. In addition, client discriminators tend to over-fit to their local data with more training epochs. MD-GAN (mdgan, ) counters the latter issue by allowing clients to randomly swap their models in a peer-to-peer way every several epochs. Even so, each discriminator is treated with the same weight to update the generator. Thus, the convergence of the generator is not optimal (fegan, ) when the quantity and distribution of data is highly skewed among clients.

Federated Learning (fl, ) structure (shown in Fig. 2c) is composed of multiple GANs (with a discriminator and generator) on each client who have direct access to data. Each client first trains a GAN using the local data. Then sends the GAN model to the federator. The key role for the federator is: (i) during initialization to determine the GANs architecture; and (ii) during training to aggregate the local GANs models into a global GAN and redistribute it to all clients. Communication occurs when clients upload their model weights to the federator and when the federator redistributes the updated weights. Such a communication and merging local models is commonly refereed a global training round in FL studies. The resulting overhead is lower than for the MD structure that requires communication between server and clients per training epoch. Additionally, transferring model weights to/from the federator is more efficient than transferring synthesized data to each discriminator in the case of the MD structure. The FL structure also has a strong scalability relative to the number of clients, as the computation complexity of model aggregation is lower than training the generator network. Another advantage of FL structure is to allow weighting local models during aggregation, which helps to accelerate the convergence of the generator under skewed data distributions among clients. Local data ratios and Kullback-Leibler (KL) weighting methods from (fegan, ) are introduced to address skewed data challenges for image data.

Architecture choice for Fed-TGAN. The FL structure has multiple benefits, ranging from communication overhead, scalability, training stability, and handling skewed client data, compared to the MD structure. In this work, we thus adopt the FL structure for enabling training tabular GANs on decentralized data sources. In summary, the proposed Fed-TGAN is composed of one federator and multiple clients, following the training procedure of the FL structure.

4. Fed-TGAN

In this section we introduce the design of Fed-TGAN which aims to adapt the FL structure presented in Sec. 3 to overcome the challenges presented in Sec. 2. To this purpose we first add an initialization step to standardize the encoding for each column across all participants. Second, we choose the best encoding in a privacy-preserving manner by estimating the global data distribution without directly accessing the participant’s local data. Third, we introduce a multi-dimension weighting mechanism to ensure model convergence under Non-IID data distributions across multiple columns.

4.1. Privacy-preserving feature encoding

Our privacy-preserving model initialization comprises two steps as shown in Fig. 3a and 3b.

Step 1. Each of the $P$ clients extracts the statistical properties of the local data and sends them to the federator. The information sent is different based on the column type. For any categorical column $j$ , each client $i$ computes and sends in the category frequency distribution $X_{ij}$ . This information is used in three ways. First, the federator uses all distinct categories to build the label encoder $LE_{j}$ for column $j$ . A label encoder is a table which maps all possible distinct values of a categorical column into their corresponding rank in one-hot encoding. Second, the frequency information is used to build an aggregated global frequency distribution $X_{j}$ for column $j$ . Third, the sum of the frequency values is used to compute the number of table rows: i) available locally $N_{{i}}$ at each client $i$ ; ii) available globally $N$ across all clients. The global label frequency distribution $X_{j}$ , $N_{{i}}$ and $N$ are needed to estimate the similarity of clients’ local data for computing the clients’ weights for model aggregation. If no categorical columns are present in the tabular data, the client sends out $N_{{i}}$ instead.

For any continuous column $j$ , each client $i$ fits and sends in the parameters of a VGM model $VGM_{{i}j}$ . To estimate the global distribution of column $j$ the federator uses $VGM_{{1}j}$ , $VGM_{{2}j}$ , $\dots$ , $VGM_{{P}j}$ to create the data sets $D_{{1}j}$ , $D_{{2}j}$ , $\dots$ , $D_{{P}j}$ with $N_{{1}}$ , $N_{{2}}$ , $\dots$ , $N_{{P}}$ data points. The federator then uses these data sets to fit a new global VGM model $VGM_{j}$ for column $j$ ²²2It might be possible to fit the global model directly from the parameters of the local models by, e.g., adapting (DBLP:conf/icpr/BruneauGP08, ). This is left for future work.. $VGM_{j}$ is used as the final encoder for column $j$ .

Step 2. The federator distributes all the column encoders $LE_{j}$ and $VGM_{j}$ to each client. Clients use this information to encode the local data and initialize the local models. Models initialized by using the same encoders will have the same input/output layers. This solves the first challenge outlined in Sec. 2. Note that the number and structure of the internal layers used for the generator and discriminator networks are predefined and independent of the data. In our evaluation against the MD structure this information is also used by the server to initialize the hosted generator network. Note that in this process the federator never directly accesses the local data of the clients, only their statistical distribution, thus this addresses the second challenge from Sec. 2.

4.2. Table-ware similarity weighting scheme

After model initialization, federator uses the collected global data statistics to pre-compute the weights for each client. These weights are used in training during the model aggregation (shown in Fig. 3c) to smooth convergence in the presence of skewed data across the clients. The weights calculation process is presented in Fig. 4.

Step 0 is to build a $P$ $\times$ $Q$ divergence matrix $\bm{S}$ where $P$ is the number of clients and $Q$ is the number of columns. Each matrix element $S_{ij}$ is the divergence of client $i$ for column $j$ when compared to the global statistics of column $j$ . The metric used depends on the type of column.

•

Categorical columns use the Jensen-Shannon Divergence (JSD) (jsd, ). The JSD between two probability vectors $p$ and $q$ is defined mathematically as $\sqrt{\frac{D(p||m)+D(q||m)}{2}}$ where $m$ is the point-wise mean of $p$ and $q$ , and $D$ is the Kullback-Leibler divergence (Joyce2011, ). The JSD distance metric is symmetric and bounded between 0 and 1 enabling a hassle-free interpretation of results. For each categorical column $j$ and client $i$ we compute $S_{ij}$ as JSD between $X_{ij}$ and $X_{j}$ , i.e., $S_{ij}=\text{JSD}($ $X_{ij}$ , $X_{j}$ $)$ .
•

Continuous columns use the Wasserstein Distance (WD) (wgan_test, ). The first Wasserstein distance between two distributions $u$ and $v$ is defined as $WD(u,v)=inf_{\pi\in\Gamma(u,v)}\int_{\mathbb{R}\times\mathbb{R}}|x-y|d\pi(x,y)$ where $\Gamma(u,v)$ is the set of probability distributions on $\mathbb{R}\times\mathbb{R}$ whose marginals are $u$ and $v$ on the first and second factors, respectively. It can be interpreted as the minimum cost to transform one distribution into another where the cost is given by amount of distribution to shift times the distance it must be shifted. For each continuous column $j$ , we use the data sets $D_{{i}j}$ created previously for each client $i$ to compute $S_{ij}$ as the WD between $VGM_{{i}j}$ and $VGM_{j}$ .

Step 1 normalizes the matrix $\bm{S}$ across the $P$ clients for each table column $j$ . This is done by dividing each matrix element by the sum of the elements in the corresponding matrix column. This step maintains the relative divergence between different clients with respect to the global column data distribution while allowing to give the same importance to all columns (all columns sum up to 1).

Step 2 aggregates the divergence across the different table columns $j$ . This is done via a sum along the rows of the matrix. For each client $i$ the resulting score $SS_{i}$ can already represent the divergence between client and global data distribution, but it does not yet take into account possible difference in the amount of local data available at each client.

Step 3 fuses the divergence in data values and data quantity at each client. Step 3 first normalizes the divergence metric between 0 and 1 across the clients. Then it uses the complement to represent similarity instead of divergence and combines it with the ratio of local data available with respect to the global data, i.e., $\frac{N_{i}}{N_{all}}$ . The resulting $SD_{i}$ take into account differences in both number of values and distribution of values of the local vs. global data. It takes into account all different dimensions given by the different columns addressing the third challenge from Sec. 2.

Step 4 computes the final weights $W_{i}$ . The $W_{i}$ for each client $i$ are obtained by passing the $SD_{i}$ to a softmax function. $W_{i}$ is the weight that the federator will use when it aggregates the model from client $i$ .

4.3. Implementation details

Fed-TGAN is implemented using the Pytorch RPC framework.This choice makes it easy to control the flow of the training steps from the federator. Clients just need to join the group, then wait to be initialized and assigned work. To parallelize the training across all clients, RPC provides a function rpc_async() which allows the federator to make nonblocking RPC calls to run functions at a client. To implement synchronization points, RPC provides a blocking function wait() for the return from previously called function rpc_async(). The return of rpc_async() is future type object. Once the wait() is called on this object, the process is blocked until the return values are received from the client. The federator starts the training on all clients via rpc_async(). Then it waits for the new model from each client via the wait(). Once all models are received, they are aggregated into a single model using the clients weight and redistributed to the clients via rpc_async(). Once all clients confirm the reception of the updated model (via wait()) the federator starts the next round of training. We plan to provide the source code via Github after publication of the paper.

One weakness of current RPC framework from Pytorch v1.8.1 is that it does not support the transmission of tensors directly on GPU through RPC call. This means that each time when we collect or update the model weights we need to pay an extra time cost to detach the weights from GPU to CPU or reload the weights from CPU to GPU.

5. Experimental Analysis

Our algorithm Fed-TGAN is evaluated on four commonly used datasets, and compared with three alternative architectures. To evaluate the similarity between real and synthetically generated data, we resort to use avg-JSD and avg-WD for categorical columns and continuous columns, respectively. We also provide an ablation analysis to highlight the efficacy of the proposed client weighting strategies of Fed-TGAN. A training time analysis is reported in the end, to show the time efficiency of all algorithms.

5.1. Experimental setup

Datasets. We test our algorithm on four commonly used machine learning datasets. Adult, Covertype and Intrusion – are from the UCI machine learning repository (UCIdataset, ), and Credit is from Kaggle (kagglecredit, ). Due to our computational limitation, we randomly sample 40k data from each of above datasets. The details of each dataset are shown in Tab. 1.

Table 1. Description of datasets.

Dataset	Rows [#]	Columns [#]
Dataset	Rows [#]	Categorical	Continuous	Total
Adult	40k	9	5	14
Covertype	40k	45	10	55
Credit	40k	1	30	31
Intrusion	40k	20	22	42

Baselines. We compare Fed-TGAN against 3 baselines: (i) multi-discriminator structure, (ii) vanilla federated learning structure and (iii) centralized approach abbreviated as MD-TGAN, vanilla FL-TGAN, and Centralized, respectively. The aim is to learn a CTGAN model from distributed clients using the three frameworks on the basis of CTGAN’s default settings for encoding features (ctgan, ). Specifically, we use 10 as the limiting number for estimated modes for the VGM encoders for each continuous column and use one-hot-encoding for categorical columns. We re-implement all baselines using the Pytorch v1.8.1 RPC framework.

MD-TGAN clients swap discriminator models with each other at the end of each training epoch (mdgan, ). For a fair comparison with MD-TGAN, we force also Fed-TGAN and vanilla FL-TGAN to share the model weights with the federator at the end of each training epoch. Due to this, the notion per round commonly referred in FL studies equals to per epoch in this paper, if not otherwise stated. Vanilla FL-TGAN is identical to Fed-TGAN, except it uses identical weights for all clients equal to $\frac{1}{P}$ where $P$ is the number of clients. Due to the different learning speed per epoch of the four frameworks, for a fair comparison we fix the number of epochs so that the training time is similar. In particular we use to 500, 500, 500, and 150 epochs for Fed-TGAN, Vanilla FL-TGAN, centralised, and MD-TGAN, respectively. We repeat each experiment 3 times and report the average.

Testbed. Experiments are run under Ubuntu 20.04 on two machines. Each machine is equipped with 32 GB memory, GeForce RTX 2080 Ti GPU and 10-core Intel i9 CPU. Each CPU core has two threads, hence each machine contains 20 logical CPU cores in total. The machine are interconnected via 1G Ethernet links (measured speed: 943Mb/s). One machine hosts the federator, the other all the clients. When not otherwise stated both federator and clients use the GPU for training. For experiments in Sec. 5.4, when CPU is used to host clients for Fed-TGAN and MD-TGAN, CPU affinity (by taskset command in Linux) is used to bind each client to one logical CPU core to reduce interference between different processes.

5.2. Evaluation metrics

To evaluate the performance of the generator, a 40k synthetic data is sampled from the trained generator at the end of each epoch. We use two metrics to quantitatively measure the statistical similarity between the real and synthetic data:

Average Jensen-Shannon divergence (Avg-JSD). Used for categorical columns (see Sec. 4 for its definition). First, we compute the JSD between the synthetic and real data for each categorical column. Second, we average the obtained JSDs to obtain a compact comprehensible score, abbreviated as Avg-JSD. The closer to 0 Avg-JSD is, the more realistic the synthetic data is.

Average Wasserstein distance (Avg-WD). Used for continuous columns³³3We use WD over JSD for continuous columns since JSD is not well-defined when the synthetic values lie outside of the original value range from the real dataset, i.e., the KL divergence is not defined when comparing the similarity of probability distributions with non-overlapping support. (see Sec. 4 for its definition). Unlike JSD, WD is unbounded and can vary greatly depending on the scale of the data. To make the WD scores comparable across columns, before computing the WD we fit and apply a min-max normalizer to each continuous column in the real data and apply the same normalizer to the corresponding columns in the synthetic data. We average all column WD scores to obtain the final score abbreviated as Avg-WD. The closer to 0 Avg-WD is, the more realistic the synthetic data is.

5.3. Result analysis

We first evaluate how realistic the generated synthetic data is. Sec. 5.3.1 designs an experiment where all the clients contain the whole dataset, this is to test the performance of each framework under the ideal case. Then in Sec. 5.3.2, we implement a scenario where data on clients are IID, but quantities of data are highly imbalanced across all the clients. The objective of this experiment is to show the effect of our model aggregation weighting method. In the end in Sec. 5.3.3, an ablation analysis is designed where one of the clients has much higher amount of data, but the data quality is low. We want to show the efficacy of our table-similarity aware weighting method in the calculation of model aggregation weights.

5.3.1. Ideal case of full dataset

This experiments uses one server (or federator) and 5 clients. Each client is provided with a copy of the full real dataset. This represents the ideal case with perfectly identical clients, i.e. each client has identical IID data. We compare in particular Fed-TGAN, MD-TGAN and Centralized. Since in this case the aggregation weights of Fed-TGAN are the same as for vanilla FL-TGAN due to the identical data, we skip vanilla FL-TGAN. Results for the Intrusion dataset are shown in Fig. 5. Avg-JSD and Avg-WD are presented by both epoch and time (in seconds) as different architectures spend vastly different time per epoch. For categorical columns Fed-TGAN converges faster both by epoch and by time (see Fig.5a and 5b). Moreover, the Avg-JSD of MD-TGAN converges quite slowly after epoch 24. For continuous columns, from the perspective of number of epochs, Avg-WD for Fed-TGAN converges faster at the beginning, then becomes slightly worse than the Avg-WD for MD-TGAN (see Fig. 5c). However, inspecting the result by time, Fed-TGAN not only converges faster, but also achieves a lower Avg-WD than the other two architectures (see Fig. 5d). The performance gap between the centralized approach and Fed-TGAN may look counter-intuitive. However, similar results are reported by FeGAN (fegan, ). The reason is that, Fed-TGAN can see the data five times per epoch as compared to the centralized approach which only sees it once. This boosts the diversity of samples seen by Fed-TGAN thereby providing superior performance.

We summarize the final similarity results of all three approaches and all four datsets in Tab. 2. The scores are taken at the time in seconds when Centralized finishes 500 epochs training. One can see that Fed-TGAN consistently achieves higher similarity (lower Avg-JSD and Avg-WD values) than the other two approaches.

Table 2. Final similarity for MD-TGAN and Fed-TGAN and Centralized: 5 clients each having a complete data copy.

Dataset	Avg JSD (MD/Fed/Centralized)	Avg WD (MD/Fed/Centralized)
Adult	0.072/0.059/0.117	0.014/0.012/0.015
Covertype	0.038/0.018/0.075	0.022/0.021/0.086
Credit	0.083/0/0.012	0.006/ 0.006/0.041
Intrusion	0.095/0.031/0.032	0.027/0.02/0.026

5.3.2. Imbalanced amount of IID data

For this experiment, we design a scenario where the number of data rows distributed among clients is highly imbalanced. Specifically, we include 5 clients in the group. 4 out of 5 clients contain only 500 rows of data randomly sampled from the original dataset. The last client contains the full dataset. We select 500 because it is the batch size setting in CTGAN. So we need at least 500 rows to form one mini batch for one epoch. This scenario is to show the effect of the model aggregation weights that the federator calculates during initialization.

Results are shown in Fig. 6. For categorical columns, one can see that the Avg-JSD by epoch, converges faster for Fed-TGAN than vanilla FL-TGAN by around 35% (epoch 17 versus epoch 26) (see Fig. 6a). Moreover, the Avg-JSD value for Fed-TGAN after convergence is also smaller as compared to that of MD-TGAN and vanilla FL-TGAN. Similar results are presented for measuring the Avg-JSD by time as well (see Fig. 6b). For continuous columns, Fed-TGAN converges faster at very beginning. Between 80 and 400 seconds, Fed-TGAN is slightly worse than MD-TGAN and vanilla FL-TGAN. From then on till the end, MD-TGAN and Fed-TGAN perform similarly, see Fig. 6d. A similar pattern can be found also while computing the Avg-WD with respect to epochs, see Fig. 6c. Full results on the four datasets are presented in Tab. 3. We notice that except for the Intrusion dataset, Fed-TGAN and vanilla FL-TGAN perform similarly for continuous columns. But for categorical columns, Fed-TGAN outperforms vanilla FL-TGAN for most datasets. Fed-TGAN converges better than vanilla FL-TGAN because the model trained on 40K converges better than model trained on 500, since all their data are IID and sampled from original dataset. As we give more weight to the model which is better trained, we benefits from its better convergence. The reason that Fed-TGAN and vanilla FL-TGAN have similar performance on Adult dataset, that can due to the fact that Adult dataset has less columns, thus simpler to learn. The Avg-WD results for Adult, Covertype and Credit datasets are similar, the reason is because in each of the 500 IID data, continuous columns distributions are well maintained as in original data. From Fig. 6 and Tab. 3 (Results are taken at the time when MD-TGAN finishes 150 epochs training.), we conclude that under an imbalanced data quantity distribution across clients, vanilla FL-TGAN not only suffers from slow convergence speeds, but also results in poor sample quality.

Table 3. Final similarity for MD-TGAN, Fed-TGAN and vanilla FL-TGAN: 4 clients have 500, 1 client has 40k rows of sampled IID data.

Dataset	Avg JSD (MD/Fed/Vanilla-FL)	Avg WD (MD/Fed/Vanilla-FL)
Adult	0.07/0.062/0.062	0.014/0.012/0.012
Covertype	0.029/0.026/0.032	0.02/0.02/0.02
Credit	0.078/0.007/0.011	0.006/0.005/0.005
Intrusion	0.092/0.037/0.044	0.025/0.025/0.052

5.3.3. Ablation analysis

Recall the weights calculation process in Fig. 4. The $SD_{i}$ is composed of two parts: (1) the ratio of the number of data rows locally available at the client $i$ to global number of data rows, i.e., $\frac{N_{i}}{N_{all}}$ ; and (2) the similarity calculated between the local data distribution of client $i$ and the global distribution, i.e., $1-\frac{SS_{i}}{\sum_{i=1}^{P}SS_{i}}$ . Our experiment in Sec. 5.3.2 shows the difference between Fed-TGAN and vanilla FL-TGAN (i.e., Fed-TGAN with equal weights for all clients). Results show that weighting clients differently based on the amount of data is indeed useful when the data quantity at each client is skewed.The contribution of data number ratio part is intuitive. Therefore in this ablation analysis, we design a scenario where for Fed-TGAN, the client weights are only calculated using data number ratio of each client, without using the similarity component.

Table 4. Final similarity for MD-TGAN, Fed-TGAN and Fed-TGAN without similarity weights (Fed

\backslash

SW): 4 clients have 10k, 1 client has 40K rows of sampled Non IID data.

Dataset	Avg JSD (MD/Fed/Fed $\backslash$ SW)	Avg WD (MD/Fed/Fed $\backslash$ SW)
Adult	0.37/0.149/0.261	0.107/0.026/0.027
Covertype	0.089/0.05/0.06	0.125/0.045/0.056
Credit	0.074/0.014/0.06	0.04/0.01/0.015
Intrusion	0.208/0.068/0.073	0.107/0.032/0.036

To better show the importance of similarity weights, we design a specific scenario for this experiment. Still with 5 clients, 4 of them containing 10k IID data sampled from original data, the last client is modified to contain 40k rows of data by repeating one row sampled from the original dataset 40k times. One can imagine, this last client has a large number of rows, but contains little information in them. Fig. 7 shows the results on Intrusion dataset. One can already notice that this scenario badly hits MD-TGAN since it treats all clients equally while updating the generator’s weights. Moreover, for the results in Fig. 7c and 7d, one can see the client with 40k repeated data introduces oscillation to the curves of Fed-TGAN with and without the similarity component. As expected, the curve for Fed-TGAN without similarity component naturally performs worse than Fed-TGAN. Results in Tab. 4 (Scores are taken at the time when MD-TGAN finishes 150 epochs training.) shows that Fed-TGAN undoubtedly outperforms MD-TGAN and Fed-TGAN without similarity computation for all datasets. Therefore, similarity component in Fed-TGAN gives more stability for model convergence.

5.4. Training time analysis

Above experiments all focus on the quality of generation. In this section, we study the training efficiency of MD-TGAN and Fed-TGAN. The first experiment scenario is the same as we discussed in Sec. 5.3.1. The entire FL system consists of 5 clients, and each of them possesses the full set original dataset. Fig. 8a shows the time distribution for MD-TGAN and Fed-TGAN during one training epoch on the Intrusion dataset. The Calculation on C is estimated at server or federator. It means when server or federator calls the training on clients, the time is calculated from the start of first client to the finishing of all clients. For MD-TGAN, clients all need to wait for synthesized data from generator. Its Calculation on C extracts this part of time, and adds it to Communication. Communication counts the time for exchanging model weights, swapping discriminator between clients, or sharing training data between server (or federator) and clients.

We can see that for Fed-TGAN, the calculation time on federator is negligible since it is only averaging model weights. Fed-TGAN has a slightly higher calculation time on clients because it trains both generator and discriminator networks on clients. Moreover, the communication time of MD-TGAN is much higher. Because for updating generator or discriminator in MD-TGAN, the generator needs to send the generated data from generator to each discriminator. Since MD-TGAN only has one server, the above tasks can not be distributed. Fig. 8a shows that Fed-TGAN saves more than 200% of the time taken by MD-TGAN per epoch. The communication time of Fed-TGAN is only 30% of what MD-TGAN uses.

In the second experiment, instead of aggregating models at the end of each epoch for every client, we vary the number of local training epochs before aggregating models for Fed-TGAN. Fig. 8b shows the total training time with variation on local epochs per round for Fed-TGAN, where we fix all the clients to train for 500 epochs in total. Therefore with more local epochs per round, we are left with less rounds for the whole training process. For local epochs 1, 10, 25 and 50, the corresponding round numbers are 500, 50, 20 and 10. The massive time decrease between local epoch 1 and others is simply because of the reduction of model aggregations. The differences among other numbers of local epochs are not that significant. Fig. 9 shows the generation results under different local epochs. We see for categorical columns, the Avg-JSD converges for all to a small value. For continuous columns, the Avg-WD for Fed-TGAN with 10 local epochs per round converges fastest and provides the best result until 1150s. This result indicates that it is possible to speed up training of Fed-TGAN by utilizing more local epochs while still preserving the statistical similarity between real and synthetic datasets. However, increasing the local epochs to a large value can potentially lead to over-fitting on the local data of clients ultimately deteriorating performance. Thus, the local epoch number introduces a trade-off between efficiency and performance.

Next, we evaluate time consumption of one epoch for Fed-TGAN and MD-TGAN with varying factors. First, we fix the number and type of data (10K IID data from Intrusion dataset) on each client, and vary the number of clients from 5 to 20. For computing resource limitations, all the experiments with varying number of clients are implemented using CPUs on client side with the server (or federator) side using the GPU. To limit interference between different processes, CPU affinity is used to bind each client to one logical CPU core. Fig. 10a clearly shows that Fed-TGAN scales better than MD-TGAN with number of clients. In MD-TGAN the central server becomes increasingly the bottleneck when adding clients due to large amount of data exchanged with each client. Second, we fix the number of clients to 5 and vary the number of IID sampled data from the Intrusion dataset. We vary the number of rows from 10k to 40k. The experiments are implemented on both CPU and GPU for clients. The result in Fig. 10b shows that with increasing the number of data rows on each client, Fed-TGAN and MD-TGAN both experience an increase in the training time per epoch. The difference between the two algorithms is small when training happens on CPU. But when using GPU, we have an increasing difference with increasing amount of data on clients. The reason is because when sharing data between the client and the server, tensors that are on GPU must first be detached from GPU to CPU for being sent through the Pytorch RPC framework. Since Fed-TGAN trains all tabular GAN models locally on each client, the training process is highly accelerated by GPUs. As clients in the federated setting only need to detach the model from the GPU to CPU at the end of training to share them with the when exchanging messages. And so, since the server shares message more times in MD-TGAN than the federator in Fed-TGAN, the GPUs do not accelerate the training process of MD-TGAN as much as for Fed-TGAN.

5.5. Further discussion

For calculating the weights for merging local client models during the initialization process, we only use the individual column data distributions to compare local data distributions with the global distributions. But for tabular data, inter-dependency between columns is also an important factor.

The reason that we do not use it is due to privacy. Since the server (or federator) cannot collect real data from clients, inter-dependencies between columns cannot be inferred only according to distributions of each column.

But analysis of column inter-dependency for client’s data is not useless. Recall the experiment in Sec. 5.3.3. One malicious client contains 40k data, which is only one data repeated 40k times. Other 4 clients each contains 10k IID data sampled from original data. For a dataset of one row repeated 40k times, the correlation between every two columns is 0, since they are just two constant columns. Therefore, the analysis of self-reported column inter-dependency may not improve the similarity calculation under the current privacy-preserving rule, but it may still help to identify some types of malicious clients.

Further note on Privacy-Preserving Technologies. FL indeed emerges as a viable solution that enables a collaborative distributed learning without disclosing original data. Orthogonal to FL, there is an array of privacy-preserving techniques that can be jointly applied to further strengthen privacy guarantee of GANs and FL, namely differential privacy (DP) (dwork2006our, ; DPGAN, ; pategan, ) and homomorphic encryption (HE) (gentry2010computing, ; crossSilos, ). Exploring advanced privacy enhancing technologies is beyond the scope of this paper and will be addressed in our future work.

6. Conclusion

Due to ever increasing distributed data sources and privacy concerns, it is imperative to learn GANs in a decentralized and privacy-preserving manner – features offered by federated learning systems. While the prior art demonstrates the feasibility of learning image GANs in FL systems, it remained unknown if the predominant tabular data in industry and its GANs can be deployed in a FL framework. This paper proposes and implements, Fed-TGAN, a first of its kind FL architecture and prototype for tabular GANs, overcoming specific challenges related to tabular data. Two main features of Fed-TGAN are (i) privacy preserving feature encoding to enable model initialization across heterogeneous data sources, and (ii) table-similarity aware weighting for merging local models. We extensively evaluate Fed-TGAN using a state-of-the-art tabular GAN and compare it with two alternative decentralized architectures, i.e., MD-TGAN and vanilla FL-TGAN, and a centralized approach. Our results show that Fed-TGAN can generate synthetic tabular data that preserves high similarity to the original data with faster convergence speeds, even in the challenging case of Non-IID data among clients. The prototype of Fed-TGAN is currently under testing by the fortune 500 financial institute. The promising evaluation results confirm that Fed-TGAN can help large organizations to unlock their data stored across multi-national silos to build a better tabular data synthesizer in a privacy preserving manner. We plan to release the source code after publication of the paper.

References

[1] S. O. Arik and T. Pfister. Tabnet: Attentive interpretable tabular learning. arXiv preprint arXiv:1908.07442, 2019.
[2] P. Bruneau, M. Gelgon, and F. Picarougne. Parameter-based reduction of gaussian mixture models with a variational-bayes approach. In IEEE International Conference on Pattern Recognition ICPR, pages 1–4, 2008.
[3] Q. Chang, H. Qu, Y. Zhang, M. Sabuncu, C. Chen, T. Zhang, and D. N. Metaxas. Synthetic learning: Learn from distributed asynchronized discriminator gan without sharing medical image data. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13853–13863, 2020.
[4] D. Dua and C. Graff. UCI machine learning repository. http://archive.ics.uci.edu/ml, 2017.
[5] C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor. Our data, ourselves: Privacy via distributed noise generation. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, pages 486–503. Springer, 2006.
[6] C. Gentry. Computing arbitrary functions of encrypted data. Communications of the ACM, 53(3):97–105, 2010.
[7] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, page 2672–2680, Cambridge, MA, USA, 2014.
[8] R. Guerraoui, A. Guirguis, A.-M. Kermarrec, and E. Le Merrer. Fegan: Scaling distributed gans. In ACM/IFIP Middleware, 2002.
[9] C. Hardy, E. Le Merrer, and B. Sericola. Md-gan: Multi-discriminator generative adversarial networks for distributed datasets. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 866–877, 2019.
[10] S. Hardy, W. Henecka, H. Ivey-Law, R. Nock, G. Patrini, G. Smith, and B. Thorne. Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption. arXiv preprint arXiv:1711.10677, 2017.
[11] J. Huang, R. Talbi, Z. Zhao, S. Boucchenak, L. Y. Chen, and S. Roos. An exploratory analysis on users’ contributions in federated learning. In 2020 Second IEEE International Conference on Trust, Privacy and Security in Intelligent Systems and Applications (TPS-ISA), pages 20–29, 2020.
[12] J. M. Joyce. Kullback-Leibler Divergence, pages 720–722. Springer Berlin Heidelberg, Berlin, Heidelberg, 2011.
[13] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4396–4405, 2019.
[14] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila. Analyzing and improving the image quality of stylegan. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8107–8116, 2020.
[15] J. Lin. Divergence measures based on the shannon entropy. IEEE Transactions on Information Theory, 37(1):145–151, 1991.
[16] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, pages 1273–1282, 2017.
[17] N. Park, M. Mohammadi, K. Gorde, S. Jajodia, H. Park, and Y. Kim. Data synthesis based on generative adversarial networks. Proc. VLDB Endow., 11(10):1071–1083, 2018.
[18] H. Qu, Y. Zhang, Q. Chang, Z. Yan, C. Chen, and D. Metaxas. Learn distributed gan with temporary discriminators. In European Conference on Computer Vision, pages 175–192. Springer, 2020.
[19] A. Ramdas, N. G. Trillos, and M. Cuturi. On wasserstein two-sample testing and related families of nonparametric tests. Entropy, 19(2), 2017.
[20] M. Rasouli, T. Sun, and R. Rajagopal. Fedgan: Federated generative adversarial networks for distributed data. CoRR, abs/2006.07228, 2020.
[21] S. Semeniuta, A. Severyn, and S. Gelly. On accurate evaluation of gans for language generation. arXiv preprint arXiv:1806.04936, 2018.
[22] M. L. G. ULB. Kaggle - anonymized credit card transactions labeled as fraudulent or genuine. https://www.kaggle.com/mlg-ulb/creditcardfraud, 2018.
[23] L. Xie, K. Lin, S. Wang, F. Wang, and J. Zhou. Differentially private generative adversarial network. arXiv preprint arXiv:1802.06739, 2018.
[24] L. Xu, M. Skoularidou, A. Cuesta-Infante, and K. Veeramachaneni. Modeling tabular data using conditional gan. In Advances in Neural Information Processing Systems, 2019, volume 32, pages 7335–7345. Curran Associates, Inc., 2019.
[25] J. Yoon, J. Jordon, and M. van der Schaar. PATE-GAN: Generating synthetic data with differential privacy guarantees. In International Conference on Learning Representations, 2019.
[26] Z. Zhao, A. Kunar, H. Van der Scheer, R. Birke, and L. Y. Chen. CTAB-GAN: Effective Table Data Synthesizing. arXiv e-prints, page arXiv:2102.08369, Feb. 2021.