Federated Learning via Decentralized Dataset Distillation in Resource-Constrained Edge Environments

Rui Song^{*, 1, 2} , Dai Liu^{*, 2}, Dave Zhenyu Chen²,
Andreas Festag^{1, 3}, Carsten Trinitis², Martin Schulz², Alois Knoll²
¹Fraunhofer IVI ²Technical University of Munich ³Technische Hochschule Ingolstadt
{rui.song, andreas.festag}@ivi.fraunhofer.de
{rui.song, dai.liu, zhenyu.chen, carsten.trinitis}@tum.de,
{schulzm, knoll}@in.tum.de, [email protected]
^∗Equal Contribution.This work was supported by the German Federal Ministry for Digital and Transport (BMVI) in the projects “KIVI – KI im Verkehr Ingolstadt” and ”5GoIng – 5G Innovation Concept Ingolstadt”.

Abstract

In federated learning, all networked clients contribute to the model training cooperatively. However, with model sizes increasing, even sharing the trained partial models often leads to severe communication bottlenecks in underlying networks, especially when communicated iteratively. In this paper, we introduce a federated learning framework FedD3 requiring only one-shot communication by integrating dataset distillation instances. Instead of sharing model updates in other federated learning approaches, FedD3 allows the connected clients to distill the local datasets independently, and then aggregates those decentralized distilled datasets (e.g. a few unrecognizable images) from networks for model training. Our experimental results show that FedD3 significantly outperforms other federated learning frameworks in terms of needed communication volumes, while it provides the additional benefit to be able to balance the trade-off between accuracy and communication cost, depending on usage scenario or target dataset. For instance, for training an AlexNet model on CIFAR-10 with 10 clients under non-independent and identically distributed (Non-IID) setting, FedD3 can either increase the accuracy by over 71% with a similar communication volume, or save 98% of communication volume, while reaching the same accuracy, compared to other one-shot federated learning approaches.

I Introduction

Refer to caption — Figure 1: Decentralized distilled datasets on federated CIFAR10 and MNIST divided into 10 clients. Each has only 2 classes of data and the class combination in each client is individual. We distill one image per class in each of 10 clients.

Federated learning has become an emerging paradigm for collaborative learning in large-scale distributed systems with a massive number of networked clients, such as smartphones, connected vehicles or edge devices. Due to the limited bandwidth between clients [1], previous research [2, 3, 4, 5, 6, 7, 8] attempts to speed up convergence and improve communication efficiency. However, for modern neural networks with over hundreds of million parameters, this kind of cooperative optimization still leads to extreme communication volumes, which require substantial network bandwidth (up to the Gbps level [9]) in order to work reliably and efficiently. This drawback hinders any large-scale deployment of federated learning models in commercial wireless mobile networks, e.g., vehicular communication networks [10] or industrial sensor networks [11].

Motivated by this communication bottleneck, prior federated learning algorithms attempt to reduce the number of communication rounds and with that the communication volume, to reach a good learning performance. Guha et al. [12] introduce a one-shot federated learning approach aimed at reducing communication overhead during the training process of a support vector machine (SVM). By exchanging information in a single communication round, it offers significant efficiency improvements. Kasturi et al. [13] provide a fusion federated learning method that uploads both model and data distribution to the server, but characterizing a distribution of a real dataset can be difficult. The one-shot federated learning based on knowledge transfer from Li et al. [14] is general, but it requires additional communication overhead to transmit multiple student models to the server.

Inspired by the one-shot scheme [12], we introduce a federated learning training scheme with one-shot communication via dataset distillation [15, 16, 17, 18]. Intuitively, dramatically smaller but more informative datasets, which include dense features, are synthesized and transmitted. This way, more informative training data is transmitted across the limited bandwidth without any privacy violation.

Specifically, we introduce a novel federated learning framework incorporating dataset distillation, FedD3, which is shown in Fig. 1. It enables efficient federated learning via transmitting the locally distilled dataset to the server in a one-shot manner, which can be applied as a pre-trained model and used for personalized [19, 20] and fairness-aware [21] learning. Note that dataset distillation can keep the advantage of privacy in federated learning[22, 23, 24, 25, 26, 27, 28, 29, 30]. It anonymously maps distilled datasets from the original client data without any exposure, which is analogous to the shared model parameters in previous federated learning methods, but substantially more efficient and effective.

We perform an extensive analysis of our method in various scenarios to showcase the effectiveness of our method in massively distributed systems and on Non-IID (non-independent and identically distributed) datasets. Specifically, our experiments highlight the trade-off between accuracy and communication cost. To handle this trade-off, we propose a new evaluation metric, the $\gamma$ -accuracy gain. By tuning the importance of accuracy gain $\gamma$ to the communication cost, the communication efficiency in federated learning is scored accordingly. We also investigate the effects of specific external parameters, including Non-IID datasets, number of clients and local contributions, and demonstrate a great potential for our framework in networks with constrained communication budgets in federated learning. We experimentally show that FedD3 has the following advantages: (i) Compared to conventional multi-shot federated learning, FedD3 significantly reduces the amount of bits that needs to be communicated, making our approach practically feasible even in low-bandwidth environments; (ii) Compared to other approaches for one-shot federated learning, FedD3 achieves a much better performance even with less communication volume, where the accuracy in a distributed system with 500 clients is enhanced by over 2.3 $\times$ (from 42.08% to 94.74%) on Non-IID MNIST and 3.6 $\times$ (from 10.74% to 38.27%) on Non-IID CIFAR-10 compared to FedAvg in one single round; (iii) Compared to centralized dataset distillation, FedD3 achieves much better results due to the broader data resource via federated learning.

Contributions To summarize, our contributions are as fourfold:

•

We introduce a decentralized dataset distillation scheme in federated learning systems, where distilled data instead of models are uploaded to the server;
•

We formulate and propose a novel framework, FedD3, for efficient federated learning in a one-shot manner, and demonstrate FedD3 with two different dataset distillation instances in clients;
•

We propose a novel evaluation metric $\gamma$ -accuracy gain, which can be used to tune the importance of accuracy and analyze communication efficiency;
•

We conduct an extensive analysis of the proposed framework. The experiments showcase the great potentials of our framework in networks with constrained communication budget in federated learning, especially considering the trade-off between accuracy and communication costs. The software implementation of FedD3 is publicly available as open source at GitHub¹¹1https://github.com/rruisong/FedD3.git.

II Background and Related Work

Federated learning Federated learning was first introduced by McMahan et al. [1], where models can be learned collaboratively from decentralized data through model exchange between clients and a central server without violating privacy. The proposed federated learning scheme FedAvg [1] aggregates the received models and updates global model by averaging their parameters.

Compared to other distributed optimization approaches, federated optimization addresses more practical challenges, e.g., communication efficiency [2], data heterogeneity [31], privacy protection [32], system design [33], which enables a large-scale deployment in real-world application scenarios [34, 35, 36, 37, 38, 39, 40, 41, 42, 43].

In a federated learning scenario, given a set of clients indexed by $k$ , machine learning models with weights $w_{k}$ are trained individually on local client datasets $\mathcal{D}_{k}=\{(x_{i})|i=1,2,...,n_{k}\}$ , where $x_{i}$ is one data point with its label $y_{i}$ in client $k$ and $n_{k}$ is the number of the local data points. The goal of local training in client $k$ is to minimize

F_{k}(w)=\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}f_{i}(w)

(1)

where $f_{i}(w)$ is the loss function on one data point $x_{i}$ with the label $y_{i}$ . Finally, the goal is to minimize aggregated local goals $F_{k}(w)$ in (1):

f(w)=\sum_{k=1}^{m}\frac{n_{k}}{n}F_{k}(w)

(2)

i.e. $f(w)=\mathbb{E}_{\mathcal{D}_{k}}[F_{k}(w)]$ . Note that the datasets across clients can be Non-IID in federated learning.

However, most federated optimization methods exchange models or gradients for learning updates, which can still lead to excessive communication volumes when a model has numerous parameters. This is even more problematic in wireless networks common in many mobile applications, as frequently exchanging data leads to higher error likelihood when connections are unstable, which can cause federated learning to fail.

One-shot Federated Learning Federated learning with one-shot communication has been studied in several projects [12, 13, 14]. Guha et al. [12] introduce an algorithm for training a support vector machine in one-shot fashion. The framework proposed by Kasturi et al. [13] uploads models and additionally the local dataset distribution, which hard to do when training on real datasets. Li et al. [14] utilize knowledge transfer to distill models in each client, which can change the original model structure and may cause much communication cost for sharing student models.

Instead of distilling models, our framework distills input data into smaller synthetic datasets in clients, which can be used for training a more general model by only relying on one-shot communication.

Dataset Distillation Dataset distillation [15] has become an attractive paradigm. It attempts to synthesize a significantly smaller dataset from a large dataset, aiming to maintain the same training performance in terms of test accuracy. Dataset distillation was proposed by Wang et al. [15]. Methods like matching outputs or gradients [44, 17, 45, 16] have been achieving outstanding results. Beside updating synthetic datasets with forward and backward propagation, Nguyen et al. [18] perform Kernel Inducing Points (KIP) and Label Solve (LS) for the optimal solution. Relating back to our work, dataset distillation has been successfully applied in centralized training in limited fashion.

Dataset Distillation in Federated Learning Though some previous works in dataset distillation, e.g. [44], have mentioned the dataset distillation might be beneficial for federated learning, they have not provided detailed analysis and experimental evaluation on it. Only little research has explored the dataset distillation approaches in federated learning: Zhou et al. [46] and Goetz et al. [47] have attempted to employ the approaches proposed by Wang et al. [15] ,and Sucholutsky and Schonlau [48] in federated learning, respectively. However, more advanced methods, e.g. [18] [16] with more stable performance have been developed, and further studies on decentralized dataset distillation in federated settings are needed. Furthermore, both of them have not pointed out the dataset distillation can improve the training on heterogeneous data, which is one of the biggest challenges in federated learning.

In fact, the computation abilities in distributed edge devices are normally limited, while most dataset distillation methods require high computation power. For instance, the approach from [16] can lead to computation overhead, though it can generate satisfactory distilled dataset. Therefore, in this work, we consider the coreset-based and KIP-based [49, 18] methods for decentralized dataset distillation in our federated learning framework, and focus on improving the communication efficiency while training on federated datasets.

III FedD3: Federated Learning from Decentralized Distilled Datasets

To explore the dataset distillation in federated settings, we introduce FedD3, a federated learning framework from decentralized distilled datasets in this section. Specifically, we consider a joint learning task with $m$ clients, where the client $k$ owns local dataset $\mathcal{D}_{k}$ . If we distill a synthetic dataset $\mathcal{\tilde{D}}_{k}$ ( $\tilde{n}_{k}=|\mathcal{\tilde{D}}_{k}|$ ), in the client $k$ from its local dataset $\mathcal{D}_{k}$ ( $n_{k}=|\mathcal{D}_{k}|$ ), the goal of the dataset distillation instance is to minimize $H_{k}(\tilde{X}_{k},\tilde{y}_{k};\Theta_{k})$ , where the $\tilde{X}_{k}$ represents the matrix of stacked distilled data points in the client $k$ , $\tilde{x}_{k,i}$ , $\tilde{y}_{k}$ contains the corresponding labels, and $\Theta_{k}$ indicates a set of parameters in the instance models. Note that $H_{k}$ can vary, depending on the instance used in the client $k$ .

Algorithm 1 FedD3: Federated Learning from Decentralized Distilled Datasets

0: number of the learning epochs

E

0: learning rate

\eta

0: number of the client

m

0: number of the distillation epochs

E_{k}

0: distillation rate

\eta_{k}

0: decentralized dataset

\mathcal{D}_{k}

\tilde{w}

1: Server:

2: select instance with hyper-parameters

\Theta

3: for

k\in\{1,2,...,m\}

in parallel do

{\tilde{D}_{k}}\leftarrow

ClientDatasetDistillation(k,

\Theta

)

5: end for

\mathcal{\tilde{D}}\leftarrow aggregate(\mathcal{\tilde{D}}_{1},\mathcal{\tilde{D}}_{2},...,\mathcal{\tilde{D}}_{m})

7: for epoch

e=1,2,...,E

8: for each batch

\mathcal{\tilde{B}}=(\tilde{X}_{b},\tilde{y}_{b})

\mathcal{\tilde{D}}

\tilde{w}\leftarrow\tilde{w}-\eta(\nabla l(\tilde{X}_{b},\tilde{y}_{b};\tilde{w}))

10: end for

11: end for

12: return

\tilde{w}

13: ClientDatasetDistillation(k,

\Theta

)

14:

\Theta_{k}\leftarrow\Theta

15: initialize

\tilde{D}_{k}

with a subset of

D_{k}

and

\tilde{n}_{k}=|\tilde{D}_{k}|

16: for epoch

e_{k}=1,2,...,E_{k}

17: for each pair of batches

\mathcal{\tilde{B}}_{k}=(\tilde{X}_{b},\tilde{y}_{b})

\mathcal{\tilde{D}}_{k}

and

\mathcal{B}_{k}=(X_{b},y_{b})

\mathcal{D}_{k}

18:

{\tilde{X}}_{b}\leftarrow{\tilde{X}}_{b}-\eta_{k}(\frac{\partial H_{k}(\tilde{X}_{b},\tilde{y}_{b};\Theta_{k})}{\partial{\tilde{X}}_{b}})

19: update

{\mathcal{\tilde{B}}_{k}}

and

{\mathcal{\tilde{D}}_{k}}

with

{\tilde{X}}_{b}

20: end for

21: if converged then

22: break

23: end if

24: end for

25: return

{\tilde{D}_{k}}

III-A Coreset-based Methods

We start from coreset-based methods to distill the decentralized datasets. We assume that there exists a synthetic dataset $\mathcal{\tilde{D}}_{k}$ , which can approximate the statistical distribution of the original dataset $\mathcal{D}_{k}$ . Through minimizing the error in a small subset of $\mathcal{D}_{k}$ , we can generate the coreset $\mathcal{\tilde{D}}_{k}$ using a specific instance, e.g. Kernel Herding [50]. More generally, we consider a clustering-based methods to generate a coreset in the client $k$ for one of classes $s\in S_{k}$ ( $S_{k}$ is the set of local classes), then the goal is to minimize a clustering loss, for instance generating coreset using a Gaussian mixture model (GMM) [51] for each class $s$ in all clients.

III-B KIP-based Methods

We review and adopt KIP [49, 18] to federated fashion for its fast divide up gradient computation. Each client $k$ aims to distill the original local dataset (a.k.a. target dataset) $\mathcal{D}$ to a synthetic dataset (a.k.a. support dataset) $\mathcal{\tilde{D}}$ by minimizing the kernel ridge-regression (KRR) loss $H_{k}(\tilde{X}_{k})$ as follows:

\mathrm{H_{k}}(\tilde{X}_{k})=\\ \frac{1}{2}||y_{k}-\mathrm{K}_{k}({X_{k},\tilde{X}_{k}})(\mathrm{K}_{k}({\tilde{X}_{k},\tilde{X}_{k}})+\lambda_{k}I)^{-1}\tilde{y}_{k}||^{2}_{2}

(3)

where $\mathrm{K}_{k}$ is the kernel used in the client $k$ and $\lambda_{k}$ is the regularization constant deduced by ridge regression in KIP. $I$ is an identity matrix. We refer the readers to the work of Nguyen et al. [49] for further details regarding kernels.

III-C Aggregation and Learning

After decentralized dataset distillation, the distilled datasets in all connected clients are transmitted to the server and aggregated as $\mathcal{\tilde{D}}=\{\mathcal{\tilde{D}}_{k}|k=1,2,...,m\}$ . We consider a non-convex neural network objective in the server and train a machine learning model on $\mathcal{\tilde{D}}$ instead of on original dataset $\mathcal{D}$ , the objective is then minimizing:

f(w)=\frac{1}{\tilde{n}}\sum_{k=1}^{m}\sum_{i=1}^{\tilde{n}_{k}}f_{i}(w)=\mathbb{E}_{\tilde{D}}[f_{i}(w)]

(4)

where $f_{i}(w)=l(\tilde{x}_{i},\tilde{y}_{i};w)$ is the loss of the prediction on one distilled data point $\tilde{x}_{i}$ with its label $\tilde{y}_{i}$ and model weights $w$ . If the $\tilde{X}_{k}$ could be distilled from $X_{k}$ perfectly, the learning result of minimizing $f(w)$ on $\tilde{D}$ should be similar to it on $D$ . The FedD3 pseudocode is given in Algorithm 1.

III-D Gamma Communication Efficiency

In federated learning, communication cost is even more expensive than computational cost [1]. It is worth studying how much communication cost is needed for achieving a dedicated gain of the model performance in terms of accuracy. To tackle the trade-off between model performance and required communication cost, we define the Gamma Communication Efficiency (GCE) using $\gamma$ -accuracy gain per binary logarithmic bit as follows:

GCE=\frac{ACC}{(1-ACC)^{\gamma}*\sum_{t=1}^{T}\log_{2}(V_{t}+1)}

(5)

where $T\in\mathbb{N}_{+}$ is the total communication rounds and $V_{t}\in\mathbb{R}_{+}$ is the required communication volume in each round.

We use the binary logarithmic bit $\sum_{t=1}^{T}\log_{2}(V_{t}+1)$ to describe the communication cost from communication round $t=1$ to $T$ . Then the gain per binary logarithmic bit can be defined by $1/\sum_{t=1}^{T}\log_{2}(V_{t}+1)$ , where the 0 communication cost gives us infinitive high gain and infinitive high communication cost lead to 0 gain. $ACC$ is the prediction accuracy value. $\gamma\in\mathbb{R}_{+}$ is an tunable parameter to represent the importance of the prediction accuracy. If $\gamma\rightarrow 0$ , the accuracy and communication cost in logarithmic bit is near proportional. The higher $\gamma$ is defined, the more test accuracy is weighted. A tiny test accuracy gain can be scored very well with an infinitely high $\gamma$ . Through selecting an appropriate $\gamma$ , we can evaluate the performance of federated learning approaches, considering both test accuracy and communication cost based on the application scenarios.

IV Experiment

IV-A Experimental Settings

We conduct experiments mainly on MNIST [52] and CIFAR-10 [53], as they are widely used for federated learning evaluation. For FedD3, we use GMM for coreset generation in coreset-based instance [51] and employ a four-layer fully connected neural network model with the width of 1024 as instance for KIP-based instance [18].

TABLE I: The baselines and the corresponding binary logarithmic bit for one communication round, where

P

is the bit size of models

FedAvg	FedProx	FedNova	SCAFFOLD
$\log_{2}(P+1)$	$\log_{2}(P+1)$	$\log_{2}(P+9)$	$\log_{2}(2*P+1)$

TABLE II: The prediction accuracy (ACC) and

\gamma-

GCE comparison between FedD3 and other baselines on both MNIST and CIFAR-10 distributed in 500 clients.

Dataset

Metric

MSFL¹

OSFL

FedD3 (Ours)²

FedAvg

FedProx

FedNova

SCAFFOLD

FedAvg

FedProx

FedNova

SCAFFOLD

Coreset

KIP

MNIST

IID

ACC %

96.97

\pm

0.02

96.55

\pm

0.03

85.50

\pm

0.01

97.49

\pm

0.02

85.34

\pm

0.02

83.63

\pm

0.02

74.78

\pm

0.02

67.47

\pm

5.89

86.82

\pm

0.46

94.37

\pm

0.67

0.01-GCE %

0.27

\pm

0.01

0.27

\pm

0.00

0.69

\pm

0.00

0.13

\pm

0.00

4.16

\pm

0.00

4.07

\pm

0.00

3.63

\pm

0.00

1.63

\pm

0.15

5.56

\pm

0.03

6.09

\pm

0.05

0.5-GCE %

1.48

\pm

0.00

1.38

\pm

0.01

1.79

\pm

0.00

0.82

\pm

0.00

10.66

\pm

0.01

9.88

\pm

0.01

7.12

\pm

0.01

2.85

\pm

0.54

15.01

\pm

0.34

24.99

\pm

1.78

Non-IID

ACC %

71.29

\pm

0.02

67.54

\pm

0.04

71.33

\pm

0.08

88.69

\pm

0.12

42.08

\pm

0.03

49.25

\pm

0.01

63.61

\pm

0.75

36.92

\pm

0.03

77.29

\pm

2.58

94.74

\pm

0.64

0.01-GCE %

0.20

\pm

0.00

0.20

\pm

0.00

0.20

\pm

0.00

0.12

\pm

0.00

2.02

\pm

0.00

2.37

\pm

0.00

3.07

\pm

0.04

0.89

\pm

0.00

5.76

\pm

0.20

7.17

\pm

0.06

0.5-GCE %

0.31

\pm

0.00

0.31

\pm

0.00

0.30

\pm

0.00

0.35

\pm

0.00

2.64

\pm

0.00

3.31

\pm

0.00

5.04

\pm

0.11

1.11

\pm

0.00

11.95

\pm

1.13

30.43

\pm

2.17

CIFAR-10

IID

ACC %

48.12

\pm

0.38

47.89

\pm

0.27

47.69

\pm

0.80

41.89

\pm

0.11

36.23

\pm

0.60

36.26

\pm

0.63

36.59

\pm

0.27

24.33

\pm

0.01

46.18

\pm

0.08

48.97

\pm

0.83

0.1-GCE %

0.60

\pm

0.76

0.33

\pm

0.00

0.42

\pm

0.06

0.14

\pm

0.00

1.47

\pm

0.03

1.47

\pm

0.03

1.49

\pm

0.01

0.49

\pm

0.00

2.74

\pm

0.01

2.92

\pm

0.05

2-GCE %

1.60

\pm

1.35

1.14

\pm

0.02

1.34

\pm

0.21

0.40

\pm

0.00

3.46

\pm

0.12

3.47

\pm

0.13

3.54

\pm

0.06

0.83

\pm

0.00

8.91

\pm

0.04

10.50

\pm

0.53

Non-IID

ACC %

13.14

\pm

5.02

18.46

\pm

0.83

12.98

\pm

5.00

34.27

\pm

0.04

10.73

\pm

0.01

10.71

\pm

0.01

10.72

\pm

0.01

10.05

\pm

0.00

30.32

\pm

1.43

38.27

\pm

1.45

0.1-GCE %

0.35

\pm

0.04

0.19

\pm

0.04

0.39

\pm

0.00

0.12

\pm

0.00

0.42

\pm

0.00

0.42

\pm

0.00

0.42

\pm

0.00

0.20

\pm

0.00

2.02

\pm

0.10

2.58

\pm

0.10

2-GCE %

0.44

\pm

0.04

0.24

\pm

0.05

0.48

\pm

0.00

0.26

\pm

0.00

0.52

\pm

0.00

0.52

\pm

0.00

0.52

\pm

0.00

0.24

\pm

0.00

4.01

\pm

0.36

6.45

\pm

0.56

1

We select the best result in the first 18 and 6 communication rounds for the training on MNIST and CIFAR-10, respectively.
2

Each client contributes 1 distilled image per class (Img/Cls = 1) from its local dataset.

TABLE III: The prediction accuracy (Acc) and

\gamma-

GCE comparison between FedD3 and other baselines on Fashion-MNIST and CIFAR-100 distributed in 200 clients, and on SVHN distributed in 100 clients.

Dataset		Metric	MSFL	OSFL	FedD3
Fashion-MNIST	IID	Acc %	79.40 $\pm$ 0.00	69.69 $\pm$ 0.07	74.80 $\pm$ 0.75
		0.01-GCE %	0.95 $\pm$ 0.00	2.48 $\pm$ 0.00	4.65 $\pm$ 0.05
		0.5-GCE %	2.05 $\pm$ 0.00	4.45 $\pm$ 0.01	9.13 $\pm$ 0.23
	Non-IID	Acc %	40.69 $\pm$ 0.01	27.92 $\pm$ 0.00	76.78 $\pm$ 0.98
		0.01-GCE %	0.48 $\pm$ 0.00	0.99 $\pm$ 0.00	5.56 $\pm$ 0.07
		0.5-GCE %	0.62 $\pm$ 0.00	1.16 $\pm$ 0.00	11.39 $\pm$ 0.39
SVHN	IID	Acc %	80.99 $\pm$ 0.00	25.00 $\pm$ 0.49	80.42 $\pm$ 0.63
		0.01-GCE %	0.16 $\pm$ 0.00	0.44 $\pm$ 0.01	3.85 $\pm$ 0.03
		0.5-GCE %	0.36 $\pm$ 0.00	0.51 $\pm$ 0.01	8.56 $\pm$ 0.21
	Non-IID	Acc %	46.32 $\pm$ 0.06	19.96 $\pm$ 0.05	69.10 $\pm$ 0.98
		0.01-GCE %	0.10 $\pm$ 0.01	0.35 $\pm$ 0.00	3.69 $\pm$ 0.05
		0.5-GCE %	0.13 $\pm$ 0.02	0.39 $\pm$ 0.01	6.56 $\pm$ 0.02
CIFAR-100	IID	Acc %	41.15 $\pm$ 0.03	20.21 $\pm$ 0.06	47.89 $\pm$ 0.42
		0.1-GCE %	0.15 $\pm$ 0.01	0.36 $\pm$ 0.00	2.41 $\pm$ 0.02
		2-GCE %	0.05 $\pm$ 0.00	0.55 $\pm$ 0.00	8.31 $\pm$ 0.21
	Non-IID¹	Acc %	29.79 $\pm$ 0.03	10.27 $\pm$ 0.00	38.05 $\pm$ 0.49
		0.1-GCE %	0.04 $\pm$ 0.00	0.18 $\pm$ 0.00	1.97 $\pm$ 0.03
		2-GCE %	0.07 $\pm$ 0.00	0.22 $\pm$ 0.00	4.90 $\pm$ 0.14

1

Each client holds data with 50 classes ( $C_{k}=50$ ).

We compare FedD3 with eight other baselines, where four federated learning methods are evaluated in both multi-shot federated learning (MSFL) and one-shot federated learning (OSFL). The federated learning methods are: FedAvg [1], FedProx [54], FedNova [55] and SCAFFOLD [31]. The required binary logarithmic bit for communication is shown in Tab. I. We train an LeNet [56] and an AlexNet [57] on MNIST and CIFAR-10, respectively.

Considering properties in federated learning [1], we demonstrate our methods and baselines on IID and Non-IID datasets in distributed systems with a different number of clients. We vary the number of classes in each client $C_{k}$ to design Non-IID datasets [58]. Guided by federated learning benchmark datasets made by Caldas et al. [59], we use Non-IID to denote $C_{k}=2$ (pathological Non-IID [19]), unless specified otherwise.

IV-B Robust Training on Heterogeneous Data

First, we focus on the performance of FedD3 on data heterogeneity and consider FedAvg as the baseline here. The accuracy changing with increasing total communication volume required in each method is observed in Fig, 2. In experiment with IID decentralized datasets, though FedAvg with unlimited communication cost has higher accuracy than FedD3, FedD3 outperforms FedAvg at the same communication volume. The best performance of one-shot FedAvg can be reached by using FedD3 with only around half of the communication volume. Note that we consider the communication volume only for uploading. In fact, MSFL methods still require further cost for downloading the global models.

Additionally, as shown in Fig. 2, the performance of FedD3 is obviously more robust than FedAvg, when the data heterogeneity across clients increases from left to right. With decreasing number of classes in local datasets, the prediction accuracy FedAvg reduces notably, however, the results of FedD3 are not much affected. In the scenario with extreme data heterogeneity, i.e. $C_{k}=1$ , the standard federated learning can even not easily converged and one-shot federated learning performances very badly, while FedD3 can achieve the similar accuracy as in IID scenarios. We believe the reason is that aggregating distilled datasets allows server to train a model on a similar distribution of original datasets.

IV-C Scalable Communication Efficiency

Then, we compare the results from 2 variants of FedD3 with 8 baselines on both IID and Non-IID, MNIST and CIFAR10 distributed in 500 clients. As shown in Tab. II, on IID datasets, FedD3 can achieve a comparable test accuracy to other federated learning, while the GCE is significantly higher. On Non-IID datasets, FedD3 is outperforms others for both test accuracy and GCE. We consider various $\gamma$ to indicate the evaluation based on different importance of accuracy. We assign $\gamma$ for CIFAR-10 with greater values than for MNIST, as a high accuracy for training model on CIFAR-10 is harder to achieve and should deserve to spend more communication cost for it.

We conduct further experiments for training a ResNet-18 [60] on Fashion-MNIST [61] and SVHN [62], and training a CNN consisting of 5 convolutional and 3 fully connected layers on CIFAR-100 [62], using MSFL and OSFL with FedAvg, and FedD3 with KIP-instances. As shown in Tab. III, we observe consistent results.

To perform the scalablity of communication efficiency, we meanwhile evaluate FedD3 with increasing number of images per class in each client. Fig. 3 shows the test accuracy raises and thereby GCE with higher $\gamma$ increases, when each client contributes more distilled images to the server. However, GCE with $\gamma=0.01$ reduces, because much communication cost is required for only a small accuracy gain. FedD3 provides the opportunity to optimize the GCE by adjusting the Img/Cls in decentralized dataset distallation, considering different importance of accuracy and constraint communication budget in practical applications.

IV-D Evaluation with System Parameters

Finally, we explore the performance of FedD3 affected by federated system parameters, including the number of clients and distilled images. In Fig. 4, we run FedD3 in federated systems containing various number of clients with IID and Non-IID decentralized datasets. We hold the total number of distilled images constant by varying the number of distilled images in each client. Fig. 4 shows that the prediction accuracy decreases at a larger number of clients. This can be led by the following reason: When increasing the client number, both the local data volume and distilled image per client reduce in our experimental setup. This results in lower granularity and thereby decreases prediction accuracy.

In fact, federated learning considers massively distributed systems in practical applications. Thus, even if each client provides a small number of distilled images, a promising training performance can be achieved, when the number of clients is large. As we can observe in Fig. 4, when each client provides the same number of distilled images, a better model can be trained with more clients, due to more distilled images received in the server. Moreover, additional clients in real applications can enrich the training dataset, and hence reach a better prediction accuracy, which is consistent with the motivation of deploying federated learning.

V Discussion

Distilled Datasets in Multi-round Due to the robustness on data heterogeneity, we further explore the potential benefits of distilled datasets in federated learning. We believe that sharing such synthetic data might bridge the information silos. For that, we extend FedD3 to multiple shots and consider a hybrid federated learning method by adding a spoon of distilled datasets from other clients via D2N (Device to Networks) or D2D (Device to Device) networks, before the first round in standard federated learning.

Network Assumption Compared to multi-shot schemes, a one-shot scheme is less affected by network heterogeneity. Despite, to evaluate the impact of network heterogeneity and address the network assumption in federated learning application scenarios, we consider the Quality of Service (QoS) of at most once for the one-shot scheme and compare the performance of FedD3 and OSFL with different ratios of stragglers. As shown in Fig 5, FedD3 outperforms OSFL.

Computation Costs Most dataset distillation methods can result in high computation loads. In our experiment setup, we take into account computation limitations in KIP and scale the number of distilled images on each client. In our FedD3 setup, we use FC-4 kernel without pooling and convolutional kernels for dataset distillation. In each client, we only distill a small number of images, i.e. 1 Img/Cls. In Federated Learning, local training can result in a linear increase in computational cost with the number of epochs. We also disregard the computational cost caused by a large number of epochs and aim to achieve better results in the baselines for a fair comparison only for communication efficiency. Additionally, we measure the required time on a compact NVIDIA Quadro RTX 4000 GPU for training an AlexNet on CIFAR-10: On average 73.69s are needed for FedD3 with 800 distillation steps for 1 Img/Cls, and 89.96s for one-shot FedAvg with 10 epochs.

Beyond Kernel: Individual DD-instances Because of stable performance, we mainly employ KIP methods as dataset distillation instances in FedD3. In fact, according to the actual local datasets, we can also consider individual instances to generate the synthetic dataset for uploading. Self-selecting instances in each client should be encouraged, which can be advantageous to the quality of distilled data, as the server and other clients should be not aware of the distribution of local data. Additionally, we believe the autonomy of local dataset distillation can enhance privacy.

Lessons Learned for Orchestration In FedD3, all local datasets should meet the requirement of the used dataset distillation method. Here we can think about two example scenarios: (i) If there is only one data point in one of the clients, $\tilde{D}_{k}$ and $D_{k}$ are the same at the first epoch in ClientDatasetDistillation in Algorithm 1; (ii) If there is only one data point with its label $x_{i}$ for a specific class $y_{c}$ , the distilled dataset $\tilde{x}_{i}$ can be always the same as $x_{i}$ , especially when only the loss is backpropagated on $\tilde{X}$ . Both situations can break the privacy, as it will upload at least one raw data point to the server. To tackle this issue, we believe a good orchestration for dataset and client selection is recommended.

VI Conclusion

In this work, we introduce a novel federated learning framework, FedD3, which reduces the overall communication volume and with that opens up the concept of federated learning to more application scenarios in particular in network constrained environments. It achieves this by leveraging local dataset distillation instead of traditional learning approaches (i) to significantly reduce communication volumes and (ii) to limit transfers to one-shot communication, rather than iterative multi-way communication. Our conducted experimental evaluation shows that FedD3 can well balance the trade-off between prediction accuracy and communication cost for federated learning. Compared to other federated learning, it provides a more robust training performance, especially on Non-IID data silos.

References

[1] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas, “Communication-efficient learning of deep networks from decentralized data,” in International Conference on Artificial Intelligence and Statistics (AISTATS), vol. 54, 2017, pp. 1273–1282.
[2] R. Pathak and M. J. Wainwright, “Fedsplit: An algorithmic framework for fast federated optimization,” Advances in Neural Information Processing Systems, vol. 33, pp. 7057–7066, 2020.
[3] H. Yuan and T. Ma, “Federated accelerated stochastic gradient descent,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 5332–5344. [Online]. Available: https://proceedings.neurips.cc/paper/2020/file/39d0a8908fbe6c18039ea8227f827023-Paper.pdf
[4] G. Malinovskiy, D. Kovalev, E. Gasanov, L. Condat, and P. Richtárik, “From local sgd to local fixed-point methods for federated learning,” in ICML, 2020, pp. 6692–6701. [Online]. Available: http://proceedings.mlr.press/v119/malinovskiy20a.html
[5] D. Rothchild, A. Panda, E. Ullah, N. Ivkin, I. Stoica, V. Braverman, J. Gonzalez, and R. Arora, “Fetchsgd: Communication-efficient federated learning with sketching,” in ICML, 2020, pp. 8253–8265. [Online]. Available: http://proceedings.mlr.press/v119/rothchild20a.html
[6] J. Hamer, M. Mohri, and A. T. Suresh, “FedBoost: A communication-efficient algorithm for federated learning,” in Proceedings of the 37th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, H. D. III and A. Singh, Eds., vol. 119. PMLR, 13–18 Jul 2020, pp. 3973–3983. [Online]. Available: https://proceedings.mlr.press/v119/hamer20a.html
[7] R. Song, L. Zhou, L. Lyu, A. Festag, and A. Knoll, “Resfed: Communication efficient federated learning by transmitting deep compressed residuals,” arXiv preprint arXiv:2212.05602, 2022.
[8] H. Gao, A. Xu, and H. Huang, “On the convergence of communication-efficient local sgd for federated learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2021, pp. 7510–7518.
[9] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, “Deep gradient compression: Reducing the communication bandwidth for distributed training,” arXiv preprint arXiv:1712.01887, 2017.
[10] A. Festag, “Standards for vehicular communication - from ieee 802.11p to 5g,” e & i Elektrotechnik und Informationstechnik, Springer Verlag, vol. 132, no. 7, pp. 409–416, 2015. [Online]. Available: https://link.springer.com/article/10.1007/s00502-015-0343-0
[11] R. Karlstetter, A. Raoofy, M. Radev, C. Trinitis, J. Hermann, and M. Schulz, “Living on the edge: Efficient handling of large scale sensor data,” in 2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid), 2021, pp. 1–10.
[12] N. Guha, A. Talwalkar, and V. Smith, “One-shot federated learning,” arXiv preprint arXiv:1902.11175, 2019.
[13] A. Kasturi, A. R. Ellore, and C. Hota, “Fusion learning: A one shot federated learning,” in International Conference on Computational Science. Springer, 2020, pp. 424–436.
[14] Q. Li, B. He, and D. Song, “Practical one-shot federated learning for cross-silo setting,” in Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, Z.-H. Zhou, Ed. International Joint Conferences on Artificial Intelligence Organization, 8 2021, pp. 1484–1490, main Track. [Online]. Available: https://doi.org/10.24963/ijcai.2021/205
[15] T. Wang, J.-Y. Zhu, A. Torralba, and A. A. Efros, “Dataset distillation,” arXiv preprint arXiv:1811.10959, 2018.
[16] G. Cazenavette, T. Wang, A. Torralba, A. A. Efros, and J.-Y. Zhu, “Dataset distillation by matching training trajectories,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
[17] B. Zhao, K. R. Mopuri, and H. Bilen, “Dataset condensation with gradient matching,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=mSAKhLYLSsl
[18] T. Nguyen, R. Novak, L. Xiao, and J. Lee, “Dataset distillation with infinitely wide convolutional networks,” Advances in Neural Information Processing Systems, vol. 34, 2021.
[19] Y. Huang, L. Chu, Z. Zhou, L. Wang, J. Liu, J. Pei, and Y. Zhang, “Personalized cross-silo federated learning on non-iid data.” in AAAI, 2021, pp. 7865–7873.
[20] R. Song, R. Xu, A. Festag, J. Ma, and A. Knoll, “Fedbevt: Federated learning bird’s eye view perception transformer in road traffic systems,” arXiv preprint arXiv:2304.01534, 2023.
[21] H. Yu, Z. Liu, Y. Liu, T. Chen, M. Cong, X. Weng, D. Niyato, and Q. Yang, “A fairness-aware incentive scheme for federated learning,” in Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 2020, pp. 393–399.
[22] T. Dong, B. Zhao, and L. Lyu, “Privacy for free: How does dataset condensation help privacy?” in Proceedings of the 39th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 162. PMLR, 17–23 Jul 2022, pp. 5378–5396. [Online]. Available: https://proceedings.mlr.press/v162/dong22c.html
[23] I. Sucholutsky and M. Schonlau, “Secdd: Efficient and secure method for remotely training neural networks (student abstract),” in Proceedings of the AAAI Conference on Artificial Intelligence, 2021, pp. 15 897–15 898.
[24] G. Li, R. Togo, T. Ogawa, and M. Haseyama, “Soft-label anonymous gastric x-ray image distillation,” in 2020 IEEE International Conference on Image Processing (ICIP). IEEE, 2020, pp. 305–309.
[25] Y. Han and X. Zhang, “Robust federated learning via collaborative machine teaching,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2020, pp. 4075–4082.
[26] Y. Wang, Y. Tong, and D. Shi, “Federated latent dirichlet allocation: A local differential privacy based framework,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2020, pp. 6283–6290.
[27] Y. Zhou, X. Ma, D. Wu, and X. Li, “Communication-efficient and attack-resistant federated edge learning with dataset distillation,” IEEE Transactions on Cloud Computing, 2022.
[28] Y. Xiong, R. Wang, M. Cheng, F. Yu, and C.-J. Hsieh, “FedDM: Iterative distribution matching for communication-efficient federated learning,” in Workshop on Federated Learning: Recent Advances and New Challenges (in Conjunction with NeurIPS 2022), 2022. [Online]. Available: https://openreview.net/forum?id=erV2t8ZLk2o
[29] G. Li, R. Togo, T. Ogawa, and M. Haseyama, “Dataset distillation for medical dataset sharing,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Workshop, 2023, pp. 1–6.
[30] ——, “Compressed gastric image generation based on soft-label dataset distillation for medical data sharing,” Computer Methods and Programs in Biomedicine, vol. 227, p. 107189, 2022.
[31] S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh, “SCAFFOLD: Stochastic controlled averaging for federated learning,” in Proceedings of the 37th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, H. D. III and A. Singh, Eds., vol. 119. PMLR, 13–18 Jul 2020, pp. 5132–5143. [Online]. Available: https://proceedings.mlr.press/v119/karimireddy20a.html
[32] A. Reisizadeh, F. Farnia, R. Pedarsani, and A. Jadbabaie, “Robust federated learning: The case of affine distribution shifts,” Advances in Neural Information Processing Systems, vol. 33, pp. 21 554–21 565, 2020.
[33] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, V. Ivanov, C. Kiddon, J. Konečnỳ, S. Mazzocchi, B. McMahan et al., “Towards federated learning at scale: System design,” Proceedings of Machine Learning and Systems, vol. 1, pp. 374–388, 2019.
[34] R. Xu, X. Xia, J. Li, H. Li, S. Zhang, Z. Tu, Z. Meng, H. Xiang, X. Dong, R. Song, H. Yu, B. Zhou, and J. Ma, “V2v4real: A real-world large-scale dataset for vehicle-to-vehicle cooperative perception,” in The IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2023.
[35] C. He, S. Li, J. So, X. Zeng, M. Zhang, H. Wang, X. Wang, P. Vepakomma, A. Singh, H. Qiu et al., “Fedml: A research library and benchmark for federated machine learning,” arXiv preprint arXiv:2007.13518, 2020.
[36] L. U. Khan, W. Saad, Z. Han, E. Hossain, and C. S. Hong, “Federated learning for internet of things: Recent advances, taxonomy, and open challenges,” IEEE Communications Surveys & Tutorials, 2021.
[37] R. Song, L. Zhou, V. Lakshminarasimhan, A. Festag, and A. Knoll, “Federated learning framework coping with hierarchical heterogeneity in cooperative its,” in 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC), 2022, pp. 3502–3508.
[38] D. Z. Chen, A. X. Chang, and M. Nießner, “Scanrefer: 3d object localization in rgb-d scans using natural language,” in Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds. Cham: Springer International Publishing, 2020, pp. 202–221.
[39] Z. Chen, A. Gholami, M. Niessner, and A. X. Chang, “Scan2cap: Context-aware dense captioning in rgb-d scans,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 3193–3203.
[40] D. Z. Chen, Q. Wu, M. Nießner, and A. X. Chang, “D3net: A unified speaker-listener architecture for 3d dense captioning and visual grounding,” in Proceedings of European Conference on Computer Vision (ECCV), 2022.
[41] R. Xu, H. Xiang, X. Xia, X. Han, J. Li, and J. Ma, “Opv2v: An open benchmark dataset and fusion pipeline for perception with vehicle-to-vehicle communication,” in 2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 2583–2589.
[42] R. Xu, Y. Guo, X. Han, X. Xia, H. Xiang, and J. Ma, “Opencda: an open cooperative driving automation framework integrated with co-simulation,” in 2021 IEEE International Intelligent Transportation Systems Conference (ITSC). IEEE, 2021, pp. 1155–1162.
[43] S. Banik, A. M. GarcÍa, and A. Knoll, “3d human pose regression using graph convolutional network,” in 2021 IEEE International Conference on Image Processing (ICIP). IEEE, 2021, pp. 924–928.
[44] B. Zhao and H. Bilen, “Dataset condensation with differentiable siamese augmentation,” in Proceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 139. PMLR, 18–24 Jul 2021, pp. 12 674–12 685. [Online]. Available: https://proceedings.mlr.press/v139/zhao21a.html
[45] K. Wang, B. Zhao, X. Peng, Z. Zhu, S. Yang, S. Wang, G. Huang, H. Bilen, X. Wang, and Y. You, “Cafe: Learning to condense dataset by aligning features,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 12 196–12 205.
[46] Y. Zhou, G. Pu, X. Ma, X. Li, and D. Wu, “Distilled one-shot federated learning,” arXiv preprint arXiv:2009.07999, 2020.
[47] J. Goetz and A. Tewari, “Federated learning via synthetic data,” arXiv preprint arXiv:2008.04489, 2020.
[48] I. Sucholutsky and M. Schonlau, “Soft-label dataset distillation and text dataset distillation,” in 2021 International Joint Conference on Neural Networks (IJCNN), 2021, pp. 1–8.
[49] T. C. Nguyen, Z. Chen, and J. Lee, “Dataset meta-learning from kernel ridge-regression,” in ICLR 2021, 2021. [Online]. Available: https://openreview.net/forum?id=l-PrrQrK0QR
[50] Y. Chen, M. Welling, and A. Smola, “Super-samples from kernel herding,” in Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, ser. UAI’10. Arlington, Virginia, USA: AUAI Press, 2010, p. 109–116.
[51] J.-P. Baudry, A. E. Raftery, G. Celeux, K. Lo, and R. Gottardo, “Combining mixture components for clustering,” Journal of computational and graphical statistics, vol. 19, no. 2, pp. 332–353, 2010.
[52] Y. LeCun, C. Cortes, and C. Burges, “Mnist handwritten digit database,” ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, vol. 2, 2010.
[53] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Technical report, 2009.
[54] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, “Federated optimization in heterogeneous networks,” Proceedings of Machine Learning and Systems, vol. 2, pp. 429–450, 2020.
[55] J. Wang, Q. Liu, H. Liang, G. Joshi, and H. V. Poor, “Tackling the objective inconsistency problem in heterogeneous federated optimization,” Advances in neural information processing systems, vol. 33, pp. 7611–7623, 2020.
[56] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998, doi:10.1109/5.726791.
[57] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, Eds., vol. 25. Curran Associates, Inc., 2012. [Online]. Available: https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf
[58] Q. Li, Y. Diao, Q. Chen, and B. He, “Federated learning on non-iid data silos: An experimental study,” in IEEE International Conference on Data Engineering, 2022.
[59] S. Caldas, S. M. K. Duddu, P. Wu, T. Li, J. Konečnỳ, H. B. McMahan, V. Smith, and A. Talwalkar, “Leaf: A benchmark for federated settings,” arXiv preprint arXiv:1812.01097, 2018.
[60] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[61] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms,” arXiv preprint arXiv:1708.07747, 2017.
[62] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, “Reading digits in natural images with unsupervised feature learning,” 2011.