This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Mode Connectivity and Data Heterogeneity of Federated Learning

Tailin Zhou, Jun Zhang,  and Danny H.K. Tsang T. Zhou is with IPO, Academy of Interdisciplinary Studies, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong SAR, China (Email: [email protected]). J. Zhang is with the Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong SAR, China (E-mail: [email protected]). D. H.K. Tsang is with the Internet of Things Thrust, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, Guangdong, China, and also with the Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong SAR, China (Email: [email protected]). (The corresponding author is J. Zhang.)
Abstract

Federated learning (FL) enables multiple clients to train a model while keeping their data private collaboratively. Previous studies have shown that data heterogeneity between clients leads to drifts across client updates. However, there are few studies on the relationship between client and global modes, making it unclear where these updates end up drifting. We perform empirical and theoretical studies on this relationship by utilizing mode connectivity, which measures performance change (i.e., connectivity) along parametric paths between different modes. Empirically, reducing data heterogeneity makes the connectivity on different paths more similar, forming more low-error overlaps between client and global modes. We also find that a barrier to connectivity occurs when linearly connecting two global modes, while it disappears with considering non-linear mode connectivity. Theoretically, we establish a quantitative bound on the global-mode connectivity using mean-field theory or dropout stability. The bound demonstrates that the connectivity improves when reducing data heterogeneity and widening trained models. Numerical results further corroborate our analytical findings.

Index Terms:
Federated learning, mode connectivity, data heterogeneity, mean-field theory, loss landscape visualization.

I Introduction

Federated Learning (FL) [1] is a collaborative paradigm that enables multiple clients to train a model while preserving data privacy. A primary challenge in FL is data heterogeneity across clients [2]. Previous works like [3] found that data heterogeneity induces a misalignment between clients’ local objectives and the FL’s global objective (hereinafter referred to as client objective and FL objective, respectively) and drifts client updates. Nevertheless, despite client-update drift, the FL objective can typically be optimized [4] effectively. This implies that the destinations of these drifting updates (i.e., the solutions to the client and FL objectives) may overlap. However, a systematic investigation of this relationship is currently lacking.

A straightforward method to examine this overlapping relationship is to investigate mode connectivity [5], where mode refers to a suit of solutions to a specific objective, i.e., a suit of neuron parameters, under permutation invariance. Mode connectivity measures the connectivity of two modes via a parametric path and reveals the geometric relationship in the solution space. Specifically, it evaluates the performance change (e.g., taking loss/error as metrics) along a given path with endpoints corresponding to two different modes. Maintaining a minor change along the path indicates better connectivity (i.e., the geometric landscape becomes increasingly connected between the two modes). Some works [5, 6] have discovered that two modes can be connected well in centralized training, but research on FL remains scarce.

Refer to caption
Figure 1: An illustration to show the relationship of client modes and global modes of FL in the solution space.

In this work, we conduct empirical and theoretical studies on mode connectivity in FL under data heterogeneity to unravel the relationship between client modes (i.e., the solutions to client objectives) and global modes (i.e., the solutions to the FL objective). Empirically, we consider linear and non-linear paths to evaluate the connectivity between client modes and global modes to show their overlap relationships, as illustrated in Figure 1. We also investigate the connectivity of global modes obtained under varying levels of data heterogeneity. Theoretically, we leverage mean-field theory [7] and dropout stability to establish a quantitative bound on the connectivity errors of global modes. In addition, we conduct a comprehensive numerical analysis to verify our analytical findings.

This study takes the first step to explore mode connectivity in FL. Our main contributions are summarized as follows:

  • Our empirical and theoretical investigations demonstrate that data heterogeneity deteriorates the connectivity between client and global modes, making it challenging to find solutions that meet both client and FL objectives.

  • It is revealed that FL solutions from different data heterogeneity belong to distinct global modes within a common region of the solution space. They can be connected on a polygonal chain while keeping error low but face an error barrier when considering a linear path.

  • We establish quantitative bounds on dropout and connectivity errors of global modes using mean-field theory, showing that both errors decrease with reduced data heterogeneity and widening trained models.

II Related Works

Data heterogeneity in FL. According to [3], data heterogeneity across clients significantly degrades the FL performance. Subsequent works have observed that data heterogeneity induces weight divergence and feature inconsistency. For weight divergence, studies such as [8, 9] show that client-update drift slows the FL convergence due to inconsistent client and FL objectives. Moreover, some studies, e.g., [10, 11], discover that the classifier divergence across clients is the main culprit of feature inconsistency, negatively affecting FL generalization. In contrast to previous works, we offer a novel perspective on data heterogeneity’s impact within the FL solution space.

Mode connectivity. The connectivity between neural network (NN) solutions due to the NN non-convexity, called mode connectivity, provides a new view into the geometric landscape of NNs in recent years. Despite the high dimension of NNs, the different solutions follow a clear connectivity pattern [12, 13, 6]. Specifically, it is possible to connect the local minima through specific paths that make it easier to find high-performance solutions when using gradient-based optimizers. The paths are often non-linear curves and are discovered by specific tasks. This suggests that the local minima are not isolated but rather interconnected within a manifold. Moreover, the geometric landscape of an NN becomes approximately linear-connected when considering the infinite neuron numbers as per [12]. Previous studies mainly focus on mode connectivity in centralized training with homogeneous data. Several pilot studies [14, 15] visualize the loss landscape of FL while not mode connectivity, and then our work tries to fill this gap of mode connectivity in FL under data heterogeneity.

Mean-field theory. A mean-field theory for the NN dynamics analysis is proposed by [7, 16] based on the gradient-flow approximation. The theory shows that when the neuron number is sufficiently large, the dynamics of stochastic gradient descent (SGD) to optimize NNs can be well approximated by a Wasserstein gradient flow. This approximation has been studied both in the two-layer case [7, 17] and the multi-layer case [18, 19]. In particular, the approximation is utilized to analyze the dropout stability of SGD solutions by [19], and the dropout stability implies mode connectivity of NNs according to [20]. However, these results are solely observed in centralized training. To the best of our knowledge, this paper is the first to expand upon the theoretical results into FL.

III Preliminaries

Refer to caption
Figure 2: Loss landscapes for the global mode and client modes on their corresponding test sets at the 200-th round (final round), and LMC from client modes to the global mode. Loss landscape on VGG11 on CIFAR-10 with α=0.1\alpha=0.1 (left), CIFAR-10 with α=0.5\alpha=0.5 (middle) and on PACS with α=0.5\alpha=0.5 (right).

Mode connectivity. The forward function of an NN parameterized by 𝜽Θ\bm{\theta}\in\Theta is represented as f𝜽:𝒳𝒴f_{\bm{\theta}}:\mathcal{X}\rightarrow\mathcal{Y}, where Θ\Theta denotes the parameter space of an NN architecture, and 𝒳d\mathcal{X}\in\mathbb{R}^{d} and 𝒴\mathcal{Y}\in\mathbb{R} are input and output spaces, respectively. When models 𝜽1{\bm{\theta}_{1}} and 𝜽2{\bm{\theta}_{2}} have the same neurons but different neuron permutations, they belong to the same mode. According to [20, 19], given an NN architecture Θ\Theta, a dataset 𝒟\mathcal{D} and a loss function ()\mathcal{L}(\cdot), mode connectivity is defined as follows:

Definition 1.

(Mode connectivity). Two mode parameters 𝛉1Θ\bm{\theta}_{1}\in\Theta and 𝛉2Θ\bm{\theta}_{2}\in\Theta are εC\varepsilon_{C}-connected if there exists a continuous path in parameter space π𝛉(ν):[0,1]Θ\pi_{\bm{\theta}}(\nu):[0,1]\rightarrow\Theta, such that π𝛉(0)=𝛉1\pi_{\bm{\theta}}(0)=\bm{\theta}_{1} and π𝛉(1)=𝛉2\pi_{\bm{\theta}}(1)=\bm{\theta}_{2} with (π𝛉(ν))max((𝛉1),(𝛉2))+εC\mathcal{L}(\pi_{\bm{\theta}}(\nu))\leq\max(\mathcal{L}(\bm{\theta}_{1}),\mathcal{L}(\bm{\theta}_{2}))+\varepsilon_{C}, where εC\varepsilon_{C} is the connectivity error.

This definition measures the connectivity of two modes by comparing the loss change along a path to its endpoints. Moreover, we consider a path-finding task proposed by [5], which minimizes the expected loss over a uniform distribution on the path as:

mincurve(𝜽)=𝔼νU(0,1)[(π𝜽(ν))],\min\mathcal{L}_{\rm curve}(\bm{\theta})=\mathbb{E}_{\nu\sim U(0,1)}\left[\mathcal{L}(\pi_{\bm{\theta}}(\nu))\right], (1)

where the path π𝜽(ν)\pi_{\bm{\theta}}(\nu) can be characterized by a Polygonal chain (PolyChain) with one or more bends [5]. We consider a PolyChain case with one bend as:

π𝜽(ν)={2[(0.5ν)𝜽1+ν𝜽b],0ν0.5,2[(ν0.5)𝜽b+(1ν)𝜽2],0.5ν1,\pi_{\bm{\theta}}(\nu)=\left\{\begin{array}[]{lr}2\left[(0.5-\nu)\bm{\theta}_{1}+\nu\bm{\theta}_{\rm b}\right],&0\leq\nu\leq 0.5,\\ 2\left[(\nu-0.5)\bm{\theta}_{\rm b}+(1-\nu)\bm{\theta}_{2}\right],&0.5\leq\nu\leq 1,\end{array}\right. (2)

where 𝜽b\bm{\theta}_{\rm b} is the model of the PolyChain-bend point and can be found by min𝜽bcurve(𝜽b)\min_{\bm{\theta}_{\rm b}}\mathcal{L}_{\rm curve}(\bm{\theta}_{\rm b}) in (1). We refer to mode connectivity as linear (or non-linear) mode connectivity (LMC) when using linear interpolation (or PolyChain). Compared with the non-linear counterpart, LMC is a stronger constraint since LMC requires the two modes to stay in the same basin.

Federated leaning (FL). We consider an FL framework with KK clients, each with its own dataset 𝒟k={(𝒙ik,yik)}\mathcal{D}_{k}=\{({\bm{x}}_{i_{k}},y_{i_{k}})\}, where data samples (𝒙ik,yik)k:d×({\bm{x}}_{i_{k}},y_{i_{k}})\sim\mathbb{P}_{k}:\mathbb{R}^{d}\times\mathbb{R} are indexed by ik[nk]i_{k}\in[n_{k}]. The global dataset is the union of all client datasets and denoted by 𝒟=k=1K𝒟k\mathcal{D}=\cup_{k=1}^{K}\mathcal{D}_{k}\sim\mathbb{P} on d×\mathbb{R}^{d}\times\mathbb{R}, which includes n=k=1Knkn=\sum_{k=1}^{K}n_{k} data samples. The FL objective is to minimize the expected global loss (𝜽)\mathcal{L}(\bm{\theta}) on 𝒟\mathcal{D}, and is formulated as:

min𝜽(𝜽)\displaystyle\min_{\bm{\theta}}\mathcal{L}(\bm{\theta}) =𝔼(𝒙,y)𝒟[l(𝜽;(𝒙,y))]\displaystyle=\mathbb{E}_{({\bm{x}},y)\in\mathcal{D}}[l(\bm{\theta};({\bm{x}},y))] (3)
=k=1Knkn𝔼(𝒙,y)𝒟k[lk(𝜽;(𝒙,y))],\displaystyle=\sum_{k=1}^{K}\frac{n_{k}}{n}\mathbb{E}_{({\bm{x}},y)\in\mathcal{D}_{k}}[l_{k}(\bm{\theta};({\bm{x}},y))],

where l()l(\cdot) and lk()l_{k}(\cdot) denote the global and client loss functions, respectively. The kk-th client objective is min𝜽kk(𝜽k)=𝔼(𝒙,y)𝒟k[lk(𝜽k;(𝒙,y))]\min_{\bm{\theta}_{k}}\mathcal{L}_{k}(\bm{\theta}_{k})=\mathbb{E}_{({\bm{x}},y)\in\mathcal{D}_{k}}[l_{k}(\bm{\theta}_{k};({\bm{x}},y))]. When k\mathbb{P}_{k}\neq\mathbb{P}, FedAvg [1] takes client models {𝜽k}k=1K\{\bm{\theta}_{k}\}_{k=1}^{K} to solve (3) by minimizing client objectives locally and obtaining the global model 𝜽=k=1Knkn𝜽k\bm{\theta}=\sum_{k=1}^{K}\frac{n_{k}}{n}\bm{\theta}_{k} round by round.

Experimental setups. We explore mode connectivity in FL on classification tasks, including MNIST [21], CIFAR-10 [22], and PACS [23]. The experimental setups used throughout this paper are as follows, unless stated otherwise. For NN architectures, we consider a two-layer NN, a shallow convolutional neural network (CNN) with two convolutional layers and a deep CNN, i.e., VGG architecture [24]. For the FL setup, we consider ten clients participating in FL with a total of 200 rounds. Client optimizers are SGD with a learning rate of 0.01 and momentum of 0.9, mini-batches of size 50, the local epoch of 1, and dropout of 0.5 for training VGG. For data heterogeneity, we follow [2] and examine label distribution skew by using Dirichlet distribution Dir(α)Dir(\alpha) [25] to create client label datasets. A larger α\alpha suggests more homogenous distributions across clients. On this basis, we also make use of the built-in domain shift of PACS to introduce feature distribution skew across clients.

IV Mode Connectivity of Client Modes

In this section, we explore mode connectivity among client modes trained on client datasets, where kk\mathbb{P}_{k}\neq\mathbb{P}_{k^{\prime}} when kkk\neq k^{\prime}. We first investigate the geometric landscape of client modes on their corresponding test sets. Specifically, we create a hyperplane by using the global mode 𝜽\bm{\theta} and client modes 𝜽k\bm{\theta}_{k}, 𝜽k\bm{\theta}_{k^{\prime}}, which is represented as {(a,b)|𝜽+a(𝜽k𝜽)+b(𝜽k𝜽)}\{(a,b)|\bm{\theta}+a(\bm{\theta}_{k}-\bm{\theta})+b(\bm{\theta}_{k^{\prime}}-\bm{\theta})\}. We then use the test sets of clients kk and kk^{\prime} to evaluate the performance of all points on the hyperplane. Figure 2 displays the loss landscapes obtained on different test sets and distinguishes them by different colors. We also depict the loss landscape of the global mode (i.e., the initialization of client modes) on the global test set in Figure 2. For comparison, we set the legend scale of the client-mode landscape to be the same in each sub-figure of Figure 2 except the lowest value, and keep the scale of the global-mode landscape consistent among all the sub-figures.

Refer to caption
Refer to caption
Figure 3: Traversing loss under α=0.1\alpha=0.1 (left) and α=0.5\alpha=0.5 (right) from client 1 to client 2.
Refer to caption
Refer to caption
Figure 4: Client models’ optimization trajectory (VGG11) along the training rounds (left) and along the local iterations within one round (right).
Refer to caption
Refer to caption
Figure 5: Training accuracy (Left) and test accuracy (Right) with linear connectivity and Polygonal connectivity on global modes.

Visualizing mode connectivity on client modes. As shown in Figure 2, the comparison between the cases of α=0.1\alpha=0.1 and α=0.5\alpha=0.5 indicates that when data heterogeneity increases, low-loss overlaps between client and global modes decrease, and the number of overlapping solutions to both client and FL objectives reduces. Similar results are also observed by comparing the cases of CIFAR-10 and PACS. Meanwhile, although client modes achieve a lower test loss than the global mode, they stray from the low-loss landscapes of the global mode. This indicates that client modes are prone to overfitting their own objectives when FedAvg solves (3) even if the global mode provides adequate initialization for them in the final round. On the other hand, we can linearly connect client models to the global model in all cases, as shown in the LMC of Figure 2. As the loss changes along these linear paths, it may eventually reach the low-loss landscapes of the global mode while still remaining in the low-loss landscapes of client modes. The above observations reveal that there exist some solutions that meet both client and FL objectives, but they may be overlooked by FedAvg.

Traversing loss along the paths connecting client modes. To delve deeper into the loss landscapes of Figure 2, we consider two paths to connect two client modes. One path is a linear connection between them (i.e., 𝜽k𝜽k\bm{\theta}_{k}\rightarrow\bm{\theta}_{k}^{\prime}), while the other is a PolyChain with the global mode as its bend (i.e., 𝜽k𝜽𝜽k\bm{\theta}_{k}\rightarrow\bm{\theta}\rightarrow\bm{\theta}_{k}^{\prime}) as per (2). We follow the paths to traverse the landscape from 𝜽k\bm{\theta}_{k} to 𝜽k\bm{\theta}_{k}^{\prime} on their test sets, and show the traversing losses along the paths in Figure 3. For the PolyChain in the case of α=0.1\alpha=0.1, the intersection point of the two curves deviates from the global mode (i.e., a=0.5a=0.5), where the intersection represents a solution that works effectively on both client datasets. Moreover, the traversing-loss gap between the two paths in the case of α=0.1\alpha=0.1 is greater than that of α=0.5\alpha=0.5. This means that when heterogeneous data exist at clients, it becomes more difficult to find solutions that align with both the client and FL objectives.

Client-mode trajectory. The trajectory of client modes during optimization in the case of α=0.1\alpha=0.1 is depicted in Figure 4. This trajectory is based on t-SNE [26] and covers both training rounds and local iterations within each round. The figure illustrates that throughout the training process, the distance between client modes is controlled even under serious data heterogeneity, such that they closely surround the global mode. During the same round, data heterogeneity pushes client modes with the same initialization to be optimized locally along different directions. Therefore, the solutions found by client objectives vary when facing data heterogeneity even though they are close to each other in the parameter space. Figure 12 shows that the shallow CNN has similar results.

V Mode Connectivity of Global Modes

In this section, we explore the mode connectivity of global modes obtained by different data heterogeneity of the same training task. We consider two paths, linear interpolation and PolyChain, to investigate the mode connectivity among global modes on CIFAR-10 under α=0.1,0.2\alpha=0.1,0.2 and 0.50.5, as shown in Figure 5. Note that the initialization of global modes remains the same across these three cases.

Visualizing mode connectivity on global modes. In Figure 5, the linear-interpolation landscape reveals that global modes obtained from varying data heterogeneity are situated in distinct basins with an error barrier separating them. The landscape also shows that when sharing the same initialization, the global-mode basins are within a common region surrounded by high errors. With more heterogeneous data, like those involving α=0.1,0.2\alpha=0.1,0.2, both training and test error barriers become higher, compared with that of α=0.1,0.5\alpha=0.1,0.5. Furthermore, we also use a PolyChain found by (1) to connect the global modes of α=0.1,0.2\alpha=0.1,0.2. Along the PolyChain, all the solutions keep no barrier in both training and test landscapes. The above results indicate that global modes can be connected without any barrier by some paths and they are situated within a manifold, as illustrated in Figure 1.

Refer to caption
Refer to caption
Figure 6: Global-mode barrier with linear interpolation (left) and PolyChain (right).

Barrier among global modes. To further quantify the mode connectivity of two global modes 𝜽{\bm{\theta}} and 𝜽{\bm{\theta}^{\prime}}, we define a barrier metric as:

B(𝜽,𝜽)=maxa[0,1]|(π𝜽(a))min((𝜽),(𝜽))||(𝜽)(𝜽)|1B({\bm{\theta}},{\bm{\theta}^{\prime}})=\max_{a\in[0,1]}\frac{\left|\mathcal{L}\left(\pi_{\bm{\theta}}(a)\right)-\min(\mathcal{L}({\bm{\theta}}),\mathcal{L}({\bm{\theta}}^{\prime}))\right|}{|\mathcal{L}({\bm{\theta}})-\mathcal{L}({\bm{\theta}}^{\prime})|}-1 (4)

where π𝜽(a)\pi_{\bm{\theta}}(a) is equal to (1a)𝜽+a𝜽(1-a){\bm{\theta}}+a{\bm{\theta}}^{\prime} and (2) when considering linear interpolation and PolyChain, respectively. Here, B(𝜽,𝜽)B({\bm{\theta}},{\bm{\theta}^{\prime}}) measures the largest performance gap along π𝜽(a)\pi_{\bm{\theta}}(a) comparing with its endpoints 𝜽{\bm{\theta}} and 𝜽{\bm{\theta}^{\prime}}. We say that 𝜽{\bm{\theta}} and 𝜽{\bm{\theta}^{\prime}} are well-connected if the barrier B(𝜽,𝜽)B({\bm{\theta}},{\bm{\theta}^{\prime}}) is close to 0.

We then use B(𝜽,𝜽)B({\bm{\theta}},{\bm{\theta}^{\prime}}) to explore mode connectivity during the whole training round. In Figure 6, the sub-figure on the left shows that more severe data heterogeneity causes greater barriers by comparing the cases of α=0.1,0.2\alpha=0.1,0.2 and α=0.1,0.5\alpha=0.1,0.5, which persist throughout training. Meanwhile, as the rounds progress, the barrier shows an increasing trend after an initial oscillation. This causes various global modes to reach their own basins, as found in Figure 5. We also examine the difficulty of finding the global-mode PolyChain when facing varying data heterogeneity. Here, the difficulty is quantified by the relationship between barrier B(𝜽,𝜽)B({\bm{\theta}},{\bm{\theta}^{\prime}}) and data sample to optimize (1) and is depicted in the right sub-figure of Figure 6. The sub-figure illustrates that more samples are needed to find the specific PolyChain as data heterogeneity increases, resulting in more incredible difficulty in connecting global modes with no error barrier.

Function dissimilarity and distance of global modes. According to [27], permutation invariance may contribute to a linear-interpolation barrier between NN solutions in centralized training. When considering FL, data heterogeneity may disturb the permutation of FL solutions, leading to the barriers found in Figure 5. To verify whether the FL solutions obtained from different data heterogeneity belong to the same global mode, we measure their function dissimilarity on the left-hand side of Figure 7. The figure shows that function dissimilarity remains >0.2>0.2 among these FL solutions even when the data heterogeneity is weakened. This means that the FL solutions belong to different global modes. Furthermore, Figure 7 presents that the distance between these FL solutions is smaller than the distance between them and their initial points. This means that different global modes with the same initialization are close to each other. Combined with the above results, it can be inferred that these FL solutions belonging to various global modes are adjacent and in a common region, but they cannot be linearly connected.

Refer to caption
Refer to caption
Figure 7: Global modes in function space (left) and weight space (right). Function dissimilarity is computed by the fraction of labels on which the predictions from different global modes disagree when given the same inputs. L2 distance is normalized by the norm of model initialization.

VI Mode Connectivity Analysis under Data Heterogeneity

As indicated by our experimental results, a sufficient condition to find the solutions meeting both client and FL objectives (3) is that the distributions represented by client modes 𝜽k\bm{\theta}_{k} are the same as the global mode 𝜽\bm{\theta}. However, satisfying this condition is generally challenging since data heterogeneity induces client-gradient drifts [3] and then implicit output bias among client modes [15]. Building upon the drifts and bias, we consider KK client modes {𝜽k}k=1K\{\bm{\theta}_{k}\}_{k=1}^{K} and formulate data heterogeneity as:

Definition 2.

(Data heterogeneity). Given a global mode 𝛉=k=1Knkn𝛉k{\bm{\theta}}=\sum_{k=1}^{K}\frac{n_{k}}{n}\bm{\theta}_{k} and data samples (𝐱,y)({\bm{x}},y)\sim\mathbb{P}, for k[K]\forall k\in[K], the global-mode output y¯=f𝛉(𝐱)\bar{y}=f_{{\bm{\theta}}}(\bm{x}) and its gradient 𝛉(𝐱)\nabla_{{\bm{\theta}}}(\bm{x}) differ from the client-mode output y^k=f𝛉k(𝐱)\hat{y}_{k}=f_{{\bm{\theta}}_{k}}(\bm{x}) and its gradient 𝛉k(𝐱)\nabla_{{\bm{\theta}}_{k}}(\bm{x}), respectively, if the client distribution k\mathbb{P}_{k} differs from the global distribution \mathbb{P}.

In this section, we will utilize dropout stability to analyze mode connectivity of global modes. First, let us define it as:

Definition 3.

(Dropout stability). Given a model 𝛉\bm{\theta} with NN neurons and its dropout networks 𝛉S\bm{\theta}_{\mathrm{S}} with 𝒜[N]\mathcal{A}\subseteq[N], the model 𝛉\bm{\theta} is εD\varepsilon_{\mathrm{D}}-dropout stable if |N(𝛉)|𝒜|(𝛉S)|εD\left|\mathcal{L}_{N}(\bm{\theta})-\mathcal{L}_{|\mathcal{A}|}\left(\bm{\theta}_{\mathrm{S}}\right)\right|\leq\varepsilon_{\mathrm{D}}.

Then, we adopt the mean-field theory developed by [7] to analyze dropout stability under data heterogeneity. Mean-field theory shows that the trajectory of SGD can be approximated by a partial differential equation (PDE) called distributional dynamic (DD). Namely, under the assumption of enough neurons (i.e., NN is large enough) and one-pass data processing (i.e., for all i[n]i\in[n], samples (𝒙i,yi)𝒟({\bm{x}}_{i},y_{i})\in\mathcal{D}\sim\mathbb{P} are independent and identically distributed), the SGD trajectory of NN neurons are close to the movement of NN i.i.d particles following the description of DD. Here, we consider a two-layer NN with NN hidden neurons, denoted as 𝜽=(𝜽1,,𝜽N){\bm{\theta}}=({\bm{\theta}_{1},\dots,{\bm{\theta}_{N}}}), where 𝜽iD{\bm{\theta}_{i}}\in\mathbb{R}^{D} for all i[N]i\in[N]. The forward function of 𝜽{\bm{\theta}} is represented as:

f𝜽(𝒙)=1/Ni=1Nσ(𝒙;𝜽i),f_{\bm{\theta}}({\bm{x}})=1/N\sum_{i=1}^{N}\sigma_{*}({\bm{x}};{\bm{\theta}_{i}}), (5)

where σ:d×D\sigma_{*}:\mathbb{R}^{d}\times\mathbb{R}^{D}\rightarrow\mathbb{R} is the activation function. For simplicity, we focus on the ReLU activation and the mean-square loss, i.e., σ(𝒙;𝜽i)=σ(𝒙,𝜽i)\sigma_{*}({\bm{x}};{\bm{\theta}_{i}})=\sigma(\langle{\bm{x}},{\bm{\theta}_{i}}\rangle) and l(𝜽;(𝒙,y))=(yf𝜽(𝒙))2l(\bm{\theta};({\bm{x}},y))=(y-f_{\bm{\theta}}({\bm{x}}))^{2}, which can be extended to other activation functions and losses in [7, 16].

Meanwhile, we consider FedAvg with each client undergoing pp local iterations in the mm-th round, and represent the global-mode update on the ii-th neuron 𝜽i\bm{\theta}_{i} as Δ𝜽i(m)=𝜽im+1,0𝜽im,0\Delta\bm{\theta}_{i}^{(m)}=\bm{\theta}_{i}^{m+1,0}-\bm{\theta}_{i}^{m,0} to get:

Δ𝜽i(m)=\displaystyle\Delta\bm{\theta}_{i}^{(m)}= 2τ=0psm,τ(ym,τy¯m,τ)𝜽¯iσ(𝒙m,τ;𝜽i¯m,τ)\displaystyle 2\sum_{\tau=0}^{p}{s}^{m,\tau}(y^{m,\tau}-\bar{y}^{m,\tau})\nabla_{\bar{\bm{\theta}}_{i}}\sigma(\bm{x}^{m,\tau};\bar{\bm{\theta}_{i}}^{m,\tau}) (6)
+2τ=0psm,τ𝒏im,τ,\displaystyle+2\sum_{\tau=0}^{p}{s}^{m,\tau}\bm{n}_{i}^{m,\tau},

where τ[p]\tau\in[p] denotes the τ\tau-th local iteration, sm,τ{s}^{m,\tau} denotes the step size at the τ\tau-th iteration, 𝜽i¯m,τ=k=1Knkn𝜽i,km,τ\bar{\bm{\theta}_{i}}^{m,\tau}=\sum_{k=1}^{K}\frac{n_{k}}{n}\bm{\theta}_{i,k}^{m,\tau} denotes the weighted average of all the ii-th client neurons, y¯m,τ=1/Ni=1Nσ(𝒙m,τ;𝜽i¯m,τ)\bar{y}^{m,\tau}=1/N\sum_{i=1}^{N}\sigma({\bm{x}^{m,\tau}};{\bar{\bm{\theta}_{i}}^{m,\tau}}) denotes the output of the weighted-average model 𝜽¯m,τ\bar{\bm{\theta}}^{m,\tau} based on one-pass data (𝒙m,τ,𝒚m,τ)({\bm{x}^{m,\tau}},{\bm{y}^{m,\tau}})\in\mathbb{P}, and 𝒏im,τ\bm{n}_{i}^{m,\tau} denote the noise induced by data heterogeneity. Specifically, we follow Definition 2 and represent 𝒏im,τ\bm{n}_{i}^{m,\tau} (the superscript is omitted for brevity) as:

𝒏i=k=1Knkn\displaystyle\bm{n}_{i}=\sum_{k=1}^{K}\frac{n_{k}}{n} [(yky^k)[𝜽i,kσ(𝒙k;𝜽i,k)𝜽¯iσ(𝒙k;𝜽i¯)]gradient drift\displaystyle\left[(y_{k}-\hat{y}_{k})\right.\underbrace{[\nabla_{\bm{\theta}_{i,k}}\sigma(\bm{x}_{k};\bm{\theta}_{i,k})-\nabla_{\bar{\bm{\theta}}_{i}}\sigma(\bm{x}_{k};\bar{\bm{\theta}_{i}})]}_{\text{gradient drift}} (7)
+(y¯ky^k)output bias𝜽¯iσ(𝒙k;𝜽i¯)],\displaystyle+\underbrace{(\bar{y}_{k}-\hat{y}_{k})}_{\text{output bias}}\left.\nabla_{\bar{\bm{\theta}}_{i}}\sigma(\bm{x}_{k};\bar{\bm{\theta}_{i}})\right],

where the gradient drift and output bias depend on data heterogeneity, i.e., (xk,yk)k(x_{k},y_{k})\in\mathbb{P}_{k} and \in\mathbb{P}, but k\mathbb{P}_{k}\neq\mathbb{P}.

Furthermore, the trajectory of 𝜽i¯\bar{\bm{\theta}_{i}} in (6) subsumes the trajectory of the global model 𝜽i{\bm{\theta}_{i}} since FedAvg performs weighted averaging on the global model at each round. Therefore, we represent the global-mode trajectory as follows:

𝜽iτ+1=𝜽iτ+2sτ(yτy¯τ)𝜽iσ(𝒙τ;𝜽iτ)+2sτ𝒏iτ,\bm{\theta}_{i}^{\tau+1}=\bm{\theta}_{i}^{\tau}+2s_{\tau}\left(y^{\tau}-\bar{y}^{\tau}\right)\nabla_{\bm{\theta}_{i}}\sigma\left(\bm{x}_{\tau};\bm{\theta}_{i}^{\tau}\right)+2{s}^{\tau}{\bm{n}}_{i}^{\tau}, (8)

where τ[Mp]\tau\in[Mp] when given a total round MM and 𝒏iτ=0\bm{n}_{i}^{\tau}=0 when τmodp[M]\tau\mod p\in[M]. See the appendix for detailed proof.

We make the following assumptions as per [7]:

Assumption 1.

The step size sτs_{\tau} is denoted as sk=αξ(τε)s_{k}=\alpha\xi(\tau\varepsilon), where ξ:0>0\xi:\mathbb{R}_{\geq 0}\rightarrow\mathbb{R}_{>0} is bounded by C1C_{1} and C1C_{1}-Lipschitz.

Assumption 2.

For (𝐱,y)(\bm{x},y)\sim\mathbb{P}, the label yy and the activation function σ(𝐱,𝛉)\sigma(\bm{x},\bm{\theta}) with C2C_{2} sub-Gaussian gradient are bounded by K2K_{2}.

Assumption 3.

The functions v(𝛉)=𝔼{yσ(𝐱,𝛉)}v(\bm{\theta})=-\mathbb{E}\{y\sigma(\bm{x},\bm{\theta})\} and u(𝛉1,𝛉2)=𝔼{σ(𝐱,𝛉1)σ(𝐱,𝛉2)}u(\bm{\theta}_{1},\bm{\theta}_{2})=\mathbb{E}\{\sigma(\bm{x},\bm{\theta}_{1})\sigma(\bm{x},\bm{\theta}_{2})\} are differentiable, and their gradients are bounded by C3C_{3} and C3C_{3}-Lipschitz.

Assumption 4.

The initial condition 𝛉i0\bm{\theta}_{i}^{0} is K42/DK^{2}_{4}/D-sub-Gaussian. Let v(𝛉)C4(D)v(\bm{\theta})\in C^{4}(\mathbb{R}^{D}) and u(𝛉1,𝛉2)C4(D)u(\bm{\theta}_{1},\bm{\theta}_{2})\in C^{4}(\mathbb{R}^{D}), and u𝛉1k(𝛉1,𝛉2)\nabla u^{k}_{\bm{\theta}_{1}}(\bm{\theta}_{1},\bm{\theta}_{2}) is uniformly bounded for 0k40\leq k\leq 4.

Assumption 5.

For m[M]m\in[M] and τ[p]\tau\in[p], the noise 𝐧iτ=βh(α)𝐧¯iτ\bm{n}_{i}^{\tau}=\sqrt{\beta h(\alpha)}\bar{\bm{n}}_{i}^{\tau}, where 𝐧¯iτ𝒩(0,𝐈D)\bar{\bm{n}}_{i}^{\tau}\sim\mathcal{N}\left(0,\bm{I}_{D}\right).

Assumptions 1-4 denote the requirements of mean-field theory on the learning rate sks_{k}, data distribution (𝒙,y)(\bm{x},y)\sim\mathbb{P}, activation function σ\sigma, and initialization ρ0\rho_{0}. Assumption 5 indicates that the noise 𝒏i\bm{n}_{i} in (7) depends on data heterogeneity. Specifically, a generalized function h(α)h(\alpha) is taken to measure the impact of data heterogeneity on 𝒏i\bm{n}_{i}, including the gradient drift and output bias, where βh(α)[β,)\beta h(\alpha)\in[\beta,\infty) and 1/α1/\alpha denotes the degree of data heterogeneity across clients (i.e., smaller α\alpha, larger h(α))h(\alpha)).

We take the mean-field theory to quantify the difference between the global-mode trajectory loss and the DD-solution loss and have:

Lemma 1.

(Mean field approximation.) Assume that conditions 1-5 hold, the solution (ρt)t0(\rho_{t})_{t\geq 0} of a PDE with initialization ρ0\rho_{0} can approximate the update trajectory of the global mode (𝛉iτ)(\bm{\theta}_{i}^{\tau}) as (8) with initialization 𝛉i0ρ0\bm{\theta}_{i}^{0}\sim\rho_{0} and unchanged data heterogeneity h(α)h(\alpha) throughout training. When β=2sτε/D\beta=2s^{\tau}\varepsilon/D, h(α)C5h(\alpha)\leq C_{5}, T1T\geq 1 and ε1\varepsilon\leq 1, there exists a constant CC (depending solely on the constants CiC_{i} of assumptions 1-4) such that

supτ[0,T/ε]|N(𝜽τ)(ρτε)|\displaystyle\sup_{\tau\in[0,T/\varepsilon]\cap\mathbb{N}}\left|\mathcal{L}_{N}(\bm{\theta}^{\tau})-\mathcal{L}(\rho_{\tau\varepsilon})\right|
\displaystyle\leq CeCTh(α)(logN+zN+ε(D+logN+z)),\displaystyle Ce^{CT}\sqrt{h(\alpha)}\left(\frac{\sqrt{\log N}+z}{\sqrt{N}}+\sqrt{\varepsilon}(\sqrt{D+\log N}+z)\right),

with probability at least 1ez21-e^{-z^{2}}.

Lemma 1 shows when the number of neurons NN is sufficiently large and the step size ε\varepsilon is sufficiently small, the global-mode neurons 𝜽iτ\bm{\theta}_{i}^{\tau} obtained by running τ\tau steps as (8) can be approximated as NN i.i.d. particles that evolve according to the DD at time τε\tau\varepsilon. Then, we utilize Lemma 1 to show that the global mode remains dropout-stable even when faced with data heterogeneity.

Theorem 1.

(Dropout stability under data heterogeneity.) Assume that conditions 1-5 hold, and fix T1T\geq 1 and h(α)C5h(\alpha)\geq C_{5}. Let the global model 𝛉τ\bm{\theta}^{\tau} be obtained by running τ\tau steps as (8) with data {(𝐱i,yi)}i=0τ\left\{\left(\bm{x}_{i},y_{i}\right)\right\}_{i=0}^{\tau}{\sim}\mathbb{P} and initialization ρ0\rho_{0}. Then, the following results hold: Pick 𝒜[N]\mathcal{A}\subseteq[N] independent of 𝛉τ\bm{\theta}^{\tau}. Then, with probability at least 1ez21-e^{-z^{2}}, for all τ[T/ε],θτ\tau\in[T/\varepsilon],\theta^{\tau} is εD\varepsilon_{\mathrm{D}}-dropout stable with εD\varepsilon_{\mathrm{D}} equal to

CeCTh(α)(log|𝒜|+z|𝒜|+ε(D+logN+z)),Ce^{CT}\sqrt{h(\alpha)}\left(\frac{\sqrt{\log|\mathcal{A}|}+z}{\sqrt{|\mathcal{A}|}}+\sqrt{\varepsilon}(\sqrt{D+\log N}+z)\right),

where the constant CC depends only on the constants CiC_{i} in the assumptions 1-4.

Following [20], we demonstrate that dropout-stable NNs in FL have mode connectivity as follows.

Theorem 2.

(Mode connectivity under data heterogeneity.) Under Theorem 1, fix T1,T1T\geq 1,T^{\prime}\geq 1 and let 𝛉\bm{\theta} and 𝛉\bm{\theta}^{\prime} be the global models obtained by running running τ\tau steps as (8) with data heterogeneity h(α)h(\alpha) and h(α)h(\alpha^{\prime}) and initialization ρ0\rho_{0} and ρ0\rho_{0}^{\prime}, respectively. Then, with probability at least 1ez21-e^{-z^{2}}, for all τ[T/ε]\tau\in[T/\varepsilon] and τ[T/ε],𝛉τ\tau^{\prime}\in\left[T^{\prime}/\varepsilon\right],\bm{\theta}^{\tau} and 𝛉τ{\bm{\theta}^{\prime}}^{\tau^{\prime}} are εC\varepsilon_{\mathrm{C}}-connected with εC\varepsilon_{\mathrm{C}} equal to

CeCTmaxmax(h(α),h(α))(logN+zN\displaystyle Ce^{CT_{\max}}\max(\sqrt{h(\alpha)},\sqrt{h(\alpha^{\prime})})\left(\frac{\sqrt{\log N}+z}{\sqrt{N}}\right.
+ε(D+logN+z)),\displaystyle\quad\quad\quad\left.+\sqrt{\varepsilon}(\sqrt{D+\log N}+z)\right),

where Tmax=max(T,T)T_{\max}=\max\left(T,T^{\prime}\right). Furthermore, the path connecting 𝛉k\bm{\theta}^{k} with 𝛉τ{\bm{\theta}^{\prime}}^{\tau^{\prime}} consists of 7 line segments.

Theorem 2 shows that the global modes obtained from FL under different data heterogeneity can be connected, as shown in Figure 5, and then mode-connectivity error depends on the data heterogeneity h(α)h(\alpha) and neuron numbers NN. See the appendix for the detailed proof of Theorems 1 and 2.

VII Numerical Results

To verify our analysis results, we undertake two classification tasks: i) using a two-layer NN on MNIST; ii) using a VGG11 network from scratch on CIFAR-10. The NNs take ReLU activations for both tasks and are optimized based on the cross-entropy loss. For the FL setup, we consider ten clients with 50/20050/200 training rounds for MNIST/CIFAR-10. Client optimizers are SGD with a learning rate of 0.02/0.01 for MNIST/CIFAR-10, where the mini-batch size is 50 and the number of local iterations is 10. Meanwhile, within the same task, the NN initialization is consistent under varying data heterogeneity and independent of neuron number NN, which aligns with the theoretical assumptions. We perform 20 independent trials to measure mean value and standard deviation in Figures 8 and 9.

Dropout stability under varying data heterogeneity.

Refer to caption
Refer to caption
Figure 8: Training/test dropout stability under varying data heterogeneity on MNIST with a two-layer NN (left) and CIFAR-10 with VGG11 (right).

Figure 8 compares the dropout error εD\varepsilon_{\mathrm{D}} under varying data heterogeneity as per Definition 3 on the training dataset (blue curve) and test dataset (orange curve). For MNIST, the neuron number of the two-layer NN is N=1600N=1600, and for CIFAR-10, VGG11 follows the standard setup in [24]. As expected from Theorem 1, the dropout error shows a downward trend as client datasets become more homogeneous. For example, the case of α=0.1\alpha=0.1 has a more significant training/test dropout error than the i.i.d case at the different training stages of both tasks. When the same heterogeneity setting (e.g., α=0.1\alpha=0.1) is adopted, the effect of data heterogeneity on CIFAR-10 is stronger than that on MNIST. This leads to a larger dropout error in CIFAR-10, where the dropout error is εD<0.04\varepsilon_{\mathrm{D}}<0.04 for MNIST and εD<0.15\varepsilon_{\mathrm{D}}<0.15 for CIFAR-10. Moreover, we also consider two training stages in Figure 8, including the middle stage (dashed curve) and the ending stage (solid curve). For both tasks, the dropout-error gap between the two stages keeps small, i.e., εD<0.01\varepsilon_{\mathrm{D}}<0.01 for MNIST and εD<0.05\varepsilon_{\mathrm{D}}<0.05 for CIFAR-10. As expected, this indicates that the mean-field theory can view the dropout-out dynamics as long as the training time is sufficient.

Mode connectivity under varying neuron numbers and data heterogeneity.

Refer to caption
Refer to caption
Figure 9: Dropout stability under varying neuron numbers (left) and linear-interpolation barrier under varying neuron numbers and data heterogeneity (right).

Figure 9 takes dropout stability (left sub-figure) and linear-interpolation barrier (right sub-figure) to validate our analysis on mode connectivity of global modes in FL. Theorem 2 shows that dropout stability implies mode connectivity, and the connectivity error εC\varepsilon_{\mathrm{C}} depends on the neuron number and data heterogeneity. For the MNIST under α=0.1\alpha=0.1, we consider a two-layer NN and plot the relationship between its dropout error εD\varepsilon_{\mathrm{D}} and neuron number NN. As expected, the training/test dropout error rapidly decreases as the width of the network grows. The dropout error for N=6400N=6400 is less than 0.010.01, while for N=200N=200, it is up to 0.060.06. Then, we take the linear-interpolation barrier to explore the connectivity error εC\varepsilon_{\mathrm{C}}, and connect the global mode obtained in the case α=0.1\alpha=0.1 to that of the cases α=0.2\alpha=0.2 (solid curve) and α=0.5\alpha=0.5 (dashed curve). With the expansion of the network, the barrier (i.e., connectivity error εC\varepsilon_{\mathrm{C}}) diminishes rapidly, which is consistent with dropout error εD\varepsilon_{\mathrm{D}} v.s. neuron number NN. Furthermore, the barrier of α=0.1\alpha=0.1 to α=0.2\alpha=0.2 is higher than that of α=0.1\alpha=0.1 to α=0.5\alpha=0.5 when N=200N=200. This is expected because when NN is small, the connectivity error εC\varepsilon_{\mathrm{C}} is amplified by the effect of data heterogeneity h(α)h(\alpha), i.e., the effect of data heterogeneity amortized to neurons would be even greater.

Neuron noise ni\bm{n}_{i} under varying data heterogeneity.

Refer to caption
Refer to caption
Figure 10: Neuron noise statistics under data heterogeneity (left) and neuron noise along the whole training (right).

Figure 10 illustrates the L2 norm of the global-mode update noise under varying data heterogeneity in (7) along all iterations in the right sub-figure, where the mean, maximal, and minimal noise norms are reported in the left sub-figure. As shown in the left sub-figure, the noise norm decrease as data heterogeneity alleviates. Meanwhile, the maximal noise norm is limited, i.e., <0.012<0.012, but the noise norm does not decrease to zero even in the i.i.d case due to the randomness of SGD. Moreover, according to the right sub-figure, the noise is not strongly correlated with the iterations when the data heterogeneity is mild (i.e., α0.4\alpha\geq 0.4), and slightly decreases along the training iterations. This verifies Assumption 5, where the neuron noise arises from data heterogeneity and remains independent of the number of iterations.

VIII Discussion and Future Work

Through experimental and theoretical analysis, we unravel the relationship between client and global modes in FL under varying data heterogeneity. As data become more heterogeneous, finding solutions that optimize both the client and FL objectives becomes increasingly challenging due to the reduction of overlapping solutions between client and global modes. Meanwhile, the FL solutions obtained from different setups of data heterogeneity belong to distinct global modes, and are located in distinct basins of a joint region. These solutions can be connected by a specific PolyChain while showing an error barrier along the linear interpolation. This suggests that global modes in FL are not isolated but rather interconnected within a manifold. Our analysis demonstrates that the connectivity error between global modes decreases with weakened data heterogeneity and wider trained models.

The findings suggest potential directions for future work for the design of FL: i) exploring the solutions to the FL objective that maintain low error for both client and global test sets, i.e., the solutions to both client and FL objectives; ii) investigating the effect of model depth and architecture in FL based on mean-field theory; iii) understanding FL training dynamics by the gap between random and pre-trained initialization (see a preliminary investigation in Appendix).

References

  • [1] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas, “Communication-Efficient Learning of Deep Networks from Decentralized Data,” in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, vol. 54.   PMLR, 20–22 Apr 2017, pp. 1273–1282. [Online]. Available: https://proceedings.mlr.press/v54/mcmahan17a.html
  • [2] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings et al., “Advances and open problems in federated learning,” Foundations and Trends® in Machine Learning, vol. 14, no. 1–2, pp. 1–210, 2021.
  • [3] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated learning with non-iid data,” arXiv preprint arXiv:1806.00582, 2018.
  • [4] J. Shao, Z. Li, W. Sun, T. Zhou, Y. Sun, L. Liu, Z. Lin, and J. Zhang, “A survey of what to share in federated learning: Perspectives on model utility, privacy leakage, and communication efficiency,” arXiv preprint arXiv:2307.10655, 2023.
  • [5] T. Garipov, P. Izmailov, D. Podoprikhin, D. P. Vetrov, and A. G. Wilson, “Loss surfaces, mode connectivity, and fast ensembling of dnns,” in Advances in Neural Information Processing Systems, vol. 31.   Curran Associates, Inc., 2018. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2018/file/be3087e74e9100d4bc4c6268cdbe8456-Paper.pdf
  • [6] F. Draxler, K. Veschgini, M. Salmhofer, and F. Hamprecht, “Essentially no barriers in neural network energy landscape,” in Proceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 80.   PMLR, 10–15 Jul 2018, pp. 1309–1318. [Online]. Available: https://proceedings.mlr.press/v80/draxler18a.html
  • [7] S. Mei, A. Montanari, and P.-M. Nguyen, “A mean field view of the landscape of two-layer neural networks,” Proceedings of the National Academy of Sciences, vol. 115, no. 33, pp. E7665–E7671, 2018. [Online]. Available: https://www.pnas.org/doi/abs/10.1073/pnas.1806579115
  • [8] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, “Federated optimization in heterogeneous networks,” Proceedings of Machine Learning and Systems, vol. 2, pp. 429–450, 2020.
  • [9] J. Wang, Q. Liu, H. Liang, G. Joshi, and H. V. Poor, “Tackling the objective inconsistency problem in heterogeneous federated optimization,” Advances in Neural Information Processing Systems, vol. 33, pp. 7611–7623, 2020.
  • [10] M. Luo, F. Chen, D. Hu, Y. Zhang, J. Liang, and J. Feng, “No fear of heterogeneity: Classifier calibration for federated learning with non-iid data,” Advances in Neural Information Processing Systems, vol. 34, pp. 5972–5984, 2021.
  • [11] T. Zhou, J. Zhang, and D. Tsang, “Fedfa: Federated learning with feature anchors to align feature and classifier for heterogeneous data,” arXiv preprint arXiv:2211.09299, 2022.
  • [12] C. D. Freeman and J. Bruna, “Topology and geometry of half-rectified network optimization,” in International Conference on Learning Representations, 2017. [Online]. Available: https://openreview.net/forum?id=Bk0FWVcgx
  • [13] T. Garipov, P. Izmailov, D. Podoprikhin, D. P. Vetrov, and A. G. Wilson, “Loss surfaces, mode connectivity, and fast ensembling of dnns,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  • [14] Z. Li, H.-Y. Chen, H. W. Shen, and W.-L. Chao, “Understanding federated learning through loss landscape visualizations: A pilot study,” in Workshop on Federated Learning: Recent Advances and New Challenges (in Conjunction with NeurIPS 2022), 2022.
  • [15] T. Zhou, Z. Lin, J. Zhang, and D. H. Tsang, “Understanding model averaging in federated learning on heterogeneous data,” arXiv preprint arXiv:2305.07845, 2023.
  • [16] S. Mei, T. Misiakiewicz, and A. Montanari, “Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit,” in Proceedings of the Thirty-Second Conference on Learning Theory, ser. Proceedings of Machine Learning Research, vol. 99.   PMLR, 25–28 Jun 2019, pp. 2388–2464. [Online]. Available: https://proceedings.mlr.press/v99/mei19a.html
  • [17] G. M. Rotskoff and E. Vanden-Eijnden, “Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error,” Advances in neural information processing systems, p. 7146–7155, 2018.
  • [18] P.-M. Nguyen and H. T. Pham, “A rigorous framework for the mean field limit of multilayer neural networks,” arXiv preprint arXiv:2001.11443, 2020.
  • [19] A. Shevchenko and M. Mondelli, “Landscape connectivity and dropout stability of SGD solutions for over-parameterized neural networks,” in Proceedings of the 37th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 119.   PMLR, 13–18 Jul 2020, pp. 8773–8784. [Online]. Available: https://proceedings.mlr.press/v119/shevchenko20a.html
  • [20] R. Kuditipudi, X. Wang, H. Lee, Y. Zhang, Z. Li, W. Hu, R. Ge, and S. Arora, “Explaining landscape connectivity of low-cost solutions for multilayer nets,” in Advances in Neural Information Processing Systems, vol. 32.   Curran Associates, Inc., 2019. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2019/file/46a4378f835dc8040c8057beb6a2da52-Paper.pdf
  • [21] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • [22] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” 2009.
  • [23] D. Li, Y. Yang, Y.-Z. Song, and T. M. Hospedales, “Deeper, broader and artier domain generalization,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5542–5550.
  • [24] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [25] M. Yurochkin, M. Agarwal, S. Ghosh, K. Greenewald, N. Hoang, and Y. Khazaeni, “Bayesian nonparametric federated learning of neural networks,” in Proceedings of the 36th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 97.   Long Beach, California, USA: PMLR, 09–15 Jun 2019, pp. 7252–7261. [Online]. Available: http://proceedings.mlr.press/v97/yurochkin19a.html
  • [26] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, 2008.
  • [27] R. Entezari, H. Sedghi, O. Saukh, and B. Neyshabur, “The role of permutation invariance in linear mode connectivity of neural networks,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=dNigytemkL

Appendix A Additional Experiments

A-A Additional Experiments in client-mode connectivity

Refer to caption
Refer to caption
Figure 11: Test loss of interpolated model along two given paths. The test set is the global test set of FL, where the VGG11 trained in FL is from scratch (left) and based on pre-trained VGG11 on Imagenet (right).

Testing the models on the global test set along the paths connecting client models. Figure 11 considers the same paths as that of Figure 3 (i.e., the linear path and the PolyChain path) to connect two client models, and reports the loss of the models along both paths on the global test set. The left and right sub-figures display the results of the global model trained from scratch and pre-training, respectively. When data are more heterogeneous, the loss gap between the two paths becomes greater. For example, the gap of the case of α=0.1\alpha=0.1 is much larger than that of α=0.5\alpha=0.5. Both sub-figures indicate that models along the PolyChain path exhibit good performance in the global test set when aa falls within the range of [0.4,0.6][0.4,0.6]. The sub-figure in the right-hand side of Figure 3 also shows that the models in the range a[0.4,0.6]a\in[0.4,0.6] exhibit good performance in the client test sets. Upon comparing the left and right sub-figures, we observe a significant reduction in the loss gap between both paths due to the utilization of the pre-trained model. Therefore, it appears that there are certain solutions to the FL objective (3) that can maintain a low loss for both client test sets and the global test set, particularly when using a pre-trained model as the model initialization. It would be beneficial for future work to further explore how to find these solutions.

Refer to caption
Refer to caption
Figure 12: Client models’ optimization trajectory (shallow CNN with two convolutional layers) along the training rounds (left) and along the local iteration within one round (right).

Optimization trajectory of shallow CNN. In Figure 12, we also visualize the optimization trajectory of client models to compare with the VGG11 trajectory illustrated in Figure 4. Similar to the observation of Figure 4, the solutions found by client objectives vary under data heterogeneity even though they are close to each other in the parameter space. Additionally, it is evident that during the training procedure, when utilizing shallow CNN, the distance between client solutions is regulated but is greater than that of VGG11. This implies that data heterogeneity affects models with various dimension sizes differently, which is in line with the results of Theorem 2 and Figure 14.

A-B Additional Experiments in global-mode connectivity

Connectivity landscape along the linear-interpolation path and the PolyChain path.

Refer to caption
Refer to caption
Figure 13: Linear-interpolation path (left) and PolyChain path (right) of global models.

We take the classification accuracy as the metric and plot the connectivity landscape along the linear-interpolation path and the PolyChain path in Figure 13. As additional data to Figure 5, Figure 13 displays comprehensive accuracy fluctuations along both paths. As shown in the left sub-figure, when dealing with more heterogeneous data, such as in the case of α=0.1,0.2\alpha=0.1,0.2, both training and test barriers become higher, compared with α=0.1,0.5\alpha=0.1,0.5. Meanwhile, since all global models are not far away from each other and are located in a common region, the maximal accuracy drop induced by the barriers is around 10%10\%. The right sub-figure shows that the PolyChain with one bend easily connects the global models from α=0.1\alpha=0.1 to 0.20.2 and from α=0.1\alpha=0.1 to 0.50.5, while all the models along the PolyChain keep no barrier.

Neuron noise ni\bm{n}_{i} insight into global-mode connectivity.

Refer to caption
Refer to caption
Figure 14: Neuron noise under varying neuron numbers (left) and its visualization along all the local iterations (right).

We resort to the noise norm in (7) to gain a better understanding of the relationship between the number of neurons (NN) and the dropout error (εC\varepsilon_{\mathrm{C}}) and connectivity error (εD\varepsilon_{\mathrm{D}}). As shown in Figure 14, the L2 norm of global-mode update noise is depicted under α=0.1\alpha=0.1. This figure displays the noise norms for different neuron numbers and their visualization across all iterations. It also reports the mean, maximal, and minimal noise norms. The left sub-figure demonstrates that as the number of neurons increases, the noise norm decreases. When N=6400N=6400, the gap between maximal and minimal noise norms becomes much smaller than that of N=200N=200. As shown in the right sub-figure, the noise levels are distinguishable for different numbers of neurons. These results reveal that the wider models show smaller noise in their updates of FL, and then have lower dropout error (εC\varepsilon_{\mathrm{C}}) and connectivity error (εD\varepsilon_{\mathrm{D}}) in Figure 9.

Appendix B Proof

The approximation in Lemma 1 builds the connection from the nonlinear dynamics to the particle dynamics to the gradient descent to the SGD under the data-heterogeneity noise. The approximation relies on the following model update:

𝜽iτ+1=(12λsτ)𝜽iτ+2sτ(yτy^τ)𝜽iσ(𝒙τ;𝜽iτ)+β𝒏iτ,\bm{\theta}_{i}^{\tau+1}=(1-2\lambda s_{\tau})\bm{\theta}_{i}^{\tau}+2s_{\tau}\left(y^{\tau}-\hat{y}^{\tau}\right)\nabla_{\bm{\theta}_{i}}\sigma\left(\bm{x}_{\tau};\bm{\theta}_{i}^{\tau}\right)+\beta{\bm{n}}_{i}^{\tau}, (9)

where 𝜽i\bm{\theta}_{i} denotes the ii-th neuron parameter, λ\lambda denotes the coefficient of regularizer in the objective (i.e., λ=0\lambda=0 without regularizers), sτs_{\tau} denotes the step size at the τ\tau-th iteration, β=2sτε/D\beta=\sqrt{2{s}^{\tau}\varepsilon/D} and β𝒏iτ\beta{\bm{n}}_{i}^{\tau} denote the random noise at the τ\tau-th iteration. By Proposition 33, 35, 37 and 38 in [16], when Assumptions 1 to 5 hold, we have the following proposition:

Proposition 1.

There exists a constant CC (depending solely on the constants CiC_{i} of assumptions 1-5), such that with probability at least 1ez21-e^{-z^{2}}, we have:

supt[0,T]\displaystyle\sup_{t\in[0,T]} |N(𝜽¯t)(ρt)|\displaystyle\left|\mathcal{L}_{N}\left(\overline{\bm{\theta}}^{t}\right)-\mathcal{L}\left(\rho_{t}\right)\right|
\displaystyle\leq CeCTh(α)N[log(NT)+z],\displaystyle Ce^{CT}\frac{\sqrt{h(\alpha)}}{\sqrt{N}}[\sqrt{\log(NT)}+z],
supt[0,T]\displaystyle\sup_{t\in[0,T]} |N(𝜽¯t)N(𝜽¯t)|\displaystyle\left|\mathcal{L}_{N}\left(\underline{\bm{\theta}}^{t}\right)-\mathcal{L}_{N}\left(\overline{\bm{\theta}}^{t}\right)\right|
\displaystyle\leq CeCT1N[log(NT)+z],\displaystyle Ce^{CT}\frac{1}{\sqrt{N}}[\sqrt{\log(NT)}+z],
supτ[0,T/ε]\displaystyle\sup_{\tau\in[0,T/\varepsilon]\cap\mathbb{N}} |N(𝜽¯τε)N(𝜽~τ)|\displaystyle\left|\mathcal{L}_{N}\left(\underline{\bm{\theta}}^{\tau\varepsilon}\right)-\mathcal{L}_{N}\left(\tilde{\bm{\theta}}^{\tau}\right)\right|
\displaystyle\leq CeCT[log(N(T/ε1)+z]εh(α),\displaystyle Ce^{CT}[\sqrt{\log(N(T/\varepsilon\vee 1)}+z]\sqrt{\varepsilon h(\alpha)},
supτ[0,T/ε]\displaystyle\sup_{\tau\in[0,T/\varepsilon]\cap\mathbb{N}} |N(𝜽~τ)N(𝜽τ)|\displaystyle\left|\mathcal{L}_{N}\left(\tilde{\bm{\theta}}^{\tau}\right)-\mathcal{L}_{N}\left(\bm{\theta}^{\tau}\right)\right|
\displaystyle\leq CeCTTε[D+logN+z],\displaystyle Ce^{CT}\sqrt{T\varepsilon}[\sqrt{D+\log N}+z],

where 𝛉¯,𝛉¯t,𝛉~τ\overline{\bm{\theta}},\underline{\bm{\theta}}^{t},\tilde{\bm{\theta}}^{\tau}, and 𝛉τ{\bm{\theta}}^{\tau} denote the solutions of the nonlinear dynamics, particle dynamics, gradient descent and SGD, respectively.

In Proposition 1, the main distinction from its counterpart in [16] is that it accounts for the impact of data heterogeneity in the model update, i.e., h(α)\sqrt{h(\alpha)} is not embraced into the constant CC. The proof for Proposition 1 aligns with the proof of its counterpart in [16], and see [16] for the details.

Lemma 2.

(From Lemma 31 in [16]). There exists a constant KK, such that with probability at least 1ez21-e^{-z^{2}},

supiNsupτ[0,T/ε]supu[0,ε]\displaystyle\sup_{i\leq N}\sup_{\tau\in[0,T/\varepsilon]\cap\mathbb{N}}\sup_{u\in[0,\varepsilon]} 𝜽¯iτε+u𝜽¯iτε2\displaystyle\left\|\overline{\bm{\theta}}_{i}^{\tau\varepsilon+u}-\overline{\bm{\theta}}_{i}^{\tau\varepsilon}\right\|_{2}
\displaystyle\leq CeCT[log(N(T/ε1))+z]ε,\displaystyle Ce^{CT}[\sqrt{\log(N(T/\varepsilon\vee 1))}+z]\sqrt{\varepsilon},

where 𝛉¯iτε+u\overline{\bm{\theta}}_{i}^{\tau\varepsilon+u} and 𝛉¯iτε\overline{\bm{\theta}}_{i}^{\tau\varepsilon} denote the solutions of the PDE in Lemma 1 at time τε+u\tau\varepsilon+u and τε\tau\varepsilon, respectively.

B-A Proof of Theorem 1

Proof.

In the following proof, we take CC to accommodate all constants that depend solely on the constants CiC_{i} of assumptions 1-4. According to Definition 3, we have:

supτ[0,T/ε]\displaystyle\sup_{\tau\in[0,T/\varepsilon]\cap\mathbb{N}} |N(𝜽τ)|𝒜|(𝜽Sτ)|\displaystyle\left|\mathcal{L}_{N}(\bm{\theta}^{\tau})-\mathcal{L}_{|\mathcal{A}|}\left(\bm{\theta}_{\mathrm{S}}^{\tau}\right)\right| (10)
\displaystyle\leq |N(𝜽τ)(ρτε)|+||𝒜|(𝜽Sτ)(ρτε)|\displaystyle\left|\mathcal{L}_{N}(\bm{\theta}^{\tau})-\mathcal{L}\left(\rho_{\tau\varepsilon}\right)\right|+\left|\mathcal{L}_{|\mathcal{A}|}(\bm{\theta}_{\mathrm{S}}^{\tau})-\mathcal{L}\left(\rho_{\tau\varepsilon}\right)\right|
\displaystyle\leq |N(𝜽τ)(ρτε)|+||𝒜|(𝜽Sτ)|𝒜|(𝜽¯Sτε)|\displaystyle\left|\mathcal{L}_{N}(\bm{\theta}^{\tau})-\mathcal{L}\left(\rho_{\tau\varepsilon}\right)\right|+\left|\mathcal{L}_{|\mathcal{A}|}(\bm{\theta}_{\mathrm{S}}^{\tau})-\mathcal{L}_{|\mathcal{A}|}(\bar{\bm{\theta}}_{\mathrm{S}}^{\tau\varepsilon})\right|
+||𝒜|(𝜽¯Sτε)(ρτε)|,\displaystyle+\left|\mathcal{L}_{|\mathcal{A}|}(\bar{\bm{\theta}}_{\mathrm{S}}^{\tau\varepsilon})-\mathcal{L}\left(\rho_{\tau\varepsilon}\right)\right|,

where 𝜽¯Sτε\bar{\bm{\theta}}_{\mathrm{S}}^{\tau\varepsilon} denotes the dropout model containing the first |𝒜|\mathcal{|A|} non-zero elements (i.e., non-dropout elements) of 𝜽¯τε\bar{\bm{\theta}}^{\tau\varepsilon}. Here, 𝜽¯τε\bar{\bm{\theta}}^{\tau\varepsilon} is the PDE solution of the distributional dynamic at time τε\tau\varepsilon.

For the first term, we take Lemma 1 and directly obtain:

supτ[0,T/ε]\displaystyle\sup_{\tau\in[0,T/\varepsilon]\cap\mathbb{N}} |N(𝜽τ)(ρτε)|\displaystyle\left|\mathcal{L}_{N}(\bm{\theta}^{\tau})-\mathcal{L}\left(\rho_{\tau\varepsilon}\right)\right| (11)
\displaystyle\leq CeCTh(α)(logN+zN+ε(D+logN+z)).\displaystyle Ce^{CT}\sqrt{h(\alpha)}\left(\frac{\sqrt{\log N}+z}{\sqrt{N}}+\sqrt{\varepsilon}(\sqrt{D+\log N}+z)\right).

For the second term, we compute its upper bound based on the maximal output gap between the ii-th neuron of 𝜽Sτ\bm{\theta}_{\mathrm{S}}^{\tau} and the jj-th neuron of 𝜽¯Sτε\overline{\bm{\theta}}_{\mathrm{S}}^{\tau\varepsilon}, which can be represented as:

||𝒜|(𝜽Sτ)|𝒜|(𝜽¯Sτε)|\displaystyle\left|\mathcal{L}_{|\mathcal{A}|}\left(\bm{\theta}_{\mathrm{S}}^{\tau}\right)-\mathcal{L}_{|\mathcal{A}|}(\overline{\bm{\theta}}_{\mathrm{S}}^{\tau\varepsilon})\right| (12)
\displaystyle\leq 2maxi𝒜|𝔼{yσ(𝒙,𝜽ik)}𝔼{yσ(𝒙,𝜽¯iτε)}|\displaystyle 2\max_{i\in\mathcal{A}}\left|\mathbb{E}\left\{y\sigma\left(\bm{x},\bm{\theta}_{i}^{k}\right)\right\}-\mathbb{E}\left\{y\sigma(\bm{x},\overline{\bm{\theta}}_{i}^{\tau\varepsilon})\right\}\right|
+maxi,j𝒜|𝔼{σ(𝒙,𝜽ik)σ(𝒙,𝜽jk)}𝔼{σ(𝒙,𝜽¯iτεt)σ(𝒙,𝜽¯jτε)}|\displaystyle+\max_{i,j\in\mathcal{A}}\left|\mathbb{E}\left\{\sigma\left(\bm{x},\bm{\theta}_{i}^{k}\right)\sigma\left(\bm{x},\bm{\theta}_{j}^{k}\right)\right\}-\mathbb{E}\left\{\sigma(\bm{x},\overline{\bm{\theta}}_{i}^{\tau\varepsilon}t)\sigma(\bm{x},\overline{\bm{\theta}}_{j}^{\tau\varepsilon})\right\}\right|
\displaystyle\leq 4C2maxi[|𝒜|]𝜽ik𝜽¯iτε2\displaystyle 4C_{2}\max_{i\in\mathcal{[|A|]}}\left\|\bm{\theta}_{i}^{k}-\overline{\bm{\theta}}_{i}^{\tau\varepsilon}\right\|_{2}
\displaystyle\leq 4C2maxi[N]𝜽ik𝜽¯iτε2,\displaystyle 4C_{2}\max_{i\in[N]}\left\|\bm{\theta}_{i}^{k}-\overline{\bm{\theta}}_{i}^{\tau\varepsilon}\right\|_{2},

where the second inequality follows that y,σ()y,\sigma(\cdot) and the gradient of σ()\sigma(\cdot) are bounded.

Furthermore, we take the sum of all inequalities in Proposition 1 and obtain:

supτ[T/ε]maxi[N]𝜽ik𝜽¯iτε2\displaystyle\sup_{\tau\in[T/\varepsilon]}\max_{i\in[N]}\left\|\bm{\theta}_{i}^{k}-\overline{\bm{\theta}}_{i}^{\tau\varepsilon}\right\|_{2} (13)
\displaystyle\leq CeCTh(α)(logN+zN+ε(D+logN+z)),\displaystyle Ce^{CT}\sqrt{h(\alpha)}\left(\frac{\sqrt{\log N}+z}{\sqrt{N}}+\sqrt{\varepsilon}(\sqrt{D+\log N}+z)\right),

with probability at least 1ez21-e^{-z^{2}}. Consequently, we have: with probability at least 1ez21-e^{-z^{2}},

supτ[T/ε]||𝒜|(𝜽Sτ)|𝒜|(𝜽¯Sτε)|\displaystyle\sup_{\tau\in[T/\varepsilon]}\left|\mathcal{L}_{|\mathcal{A}|}\left(\bm{\theta}_{\mathrm{S}}^{\tau}\right)-\mathcal{L}_{|\mathcal{A}|}\left(\overline{\bm{\theta}}_{\mathrm{S}}^{\tau\varepsilon}\right)\right| (14)
CeCTh(α)(logN+zN+ε(D+logN+z)).\displaystyle\leq Ce^{CT}\sqrt{h(\alpha)}\left(\frac{\sqrt{\log N}+z}{\sqrt{N}}+\sqrt{\varepsilon}(\sqrt{D+\log N}+z)\right).

For the third term, we follow the triangle inequality and have:

supτ[0,T/ε]\displaystyle\sup_{\tau\in[0,T/\varepsilon]\cap\mathbb{N}} ||𝒜|(𝜽¯Sτε)(ρτε)|\displaystyle\left|\mathcal{L}_{|\mathcal{A}|}\left(\overline{\bm{\theta}}_{\mathrm{S}}^{\tau\varepsilon}\right)-\mathcal{L}\left(\rho_{\tau\varepsilon}\right)\right| (15)
\displaystyle\leq ||𝒜|(𝜽¯Sτε)𝔼ρ0{|𝒜|(𝜽¯Sτε)}|\displaystyle\left|\mathcal{L}_{|\mathcal{A}|}\left(\overline{\bm{\theta}}_{\mathrm{S}}^{\tau\varepsilon}\right)-\mathbb{E}_{\rho_{0}}\left\{\mathcal{L}_{|\mathcal{A}|}\left(\overline{\bm{\theta}}_{\mathrm{S}}^{\tau\varepsilon}\right)\right\}\right|
+|𝔼ρ0{|𝒜|(𝜽¯Sτε)}(ρka)|,\displaystyle+\left|\mathbb{E}_{\rho_{0}}\left\{\mathcal{L}_{|\mathcal{A}|}\left(\overline{\bm{\theta}}_{\mathrm{S}}^{\tau\varepsilon}\right)\right\}-\mathcal{L}\left(\rho_{k\mathrm{a}}\right)\right|,

where 𝔼ρ0{}\mathbb{E}_{\rho_{0}}\{\cdot\} takes the expectation on 𝜽i0ρ0\bm{\theta}_{i}^{0}\sim\rho_{0}. For the first term of (15), we consider the mean-square loss in this work and have:

|𝒜|(𝜽S)=𝔼(x,y){(y1|𝒜|i=1|𝒜|σ(𝒙,𝜽i))2},\displaystyle\mathcal{L}_{|\mathcal{A}|}\left(\bm{\theta}_{\mathrm{S}}\right)=\mathbb{E}_{(x,y)\sim\mathbb{P}}\left\{\left(y-\frac{1}{|\mathcal{A}|}\sum_{i=1}^{|\mathcal{A}|}\sigma\left(\bm{x},\bm{\theta}_{i}\right)\right)^{2}\right\}, (16)

where 𝔼(x,y)\mathbb{E}_{(x,y)} takes the expectation on (𝒙,y)(\bm{x},y)\sim\mathbb{P}. According to Lemma 1, we have {𝜽iτε}i=1|𝒜|ρτε\left\{\bm{\theta}_{i}^{\tau\varepsilon}\right\}_{i=1}^{|\mathcal{A}|}\sim\rho_{\tau\varepsilon} under the PDE approximation, and rewrite the second term of (15) as:

supτ[0,T/ε]|𝔼ρ0{|𝒜|(𝜽¯Sτε)}(ρka)|\displaystyle\sup_{\tau\in[0,T/\varepsilon]\cap\mathbb{N}}\left|\mathbb{E}_{\rho_{0}}\left\{\mathcal{L}_{|\mathcal{A}|}\left(\overline{\bm{\theta}}_{\mathrm{S}}^{\tau\varepsilon}\right)\right\}-\mathcal{L}\left(\rho_{k\mathrm{a}}\right)\right| (17)
=\displaystyle= supτ[0,T/ε]1|𝒜||𝔼(x,y){(σ(𝒙,𝜽))2}ρτε(d𝜽)\displaystyle\sup_{\tau\in[0,T/\varepsilon]}\frac{1}{|\mathcal{A}|}\left|\int\mathbb{E}_{(x,y)}\left\{\left(\sigma(\bm{x},\bm{\theta})\right)^{2}\right\}\rho_{\tau\varepsilon}(\mathrm{d}\bm{\theta})\right.
𝔼(x,y){σ(𝒙,𝜽1)σ(𝒙,𝜽2)}ρτε(d𝜽1)ρτε(d𝜽2)|\displaystyle-\left.\int\mathbb{E}_{(x,y)}\left\{\sigma\left(\bm{x},\bm{\theta}_{1}\right)\sigma\left(\bm{x},\bm{\theta}_{2}\right)\right\}\rho_{\tau\varepsilon}\left(\mathrm{d}\bm{\theta}_{1}\right)\rho_{\tau\varepsilon}\left(\mathrm{d}\bm{\theta}_{2}\right)\right|
\displaystyle\leq C|𝒜|,\displaystyle\frac{C}{|\mathcal{A}|},

where the inequality is because the activation function σ()\sigma(\cdot) is bounded by C2C_{2} according to Assumption 2. That is, denoting by 𝜽\bm{\theta} and 𝜽\bm{\theta}^{\prime} be two parameters that differ only in the ii-th component, i.e., 𝜽=(𝜽1,,𝜽i,,,𝜽|𝒜|)\bm{\theta}=\left(\bm{\theta}_{1},\ldots,\bm{\theta}_{i,},\ldots,\bm{\theta}_{|\mathcal{A}|}\right) and 𝜽=\bm{\theta}^{\prime}= (𝜽1,,𝜽i,,𝜽|𝒜|)\left(\bm{\theta}_{1},\ldots,\bm{\theta}_{i}^{\prime},\ldots,\bm{\theta}_{|\mathcal{A}|}\right), we have: ||𝒜|(𝜽)|𝒜|(𝜽)|C/|𝒜|\left|\mathcal{L}_{|\mathcal{A}|}(\bm{\theta})-\mathcal{L}_{|\mathcal{A}|}\left(\bm{\theta}^{\prime}\right)\right|\leq{C}/{|\mathcal{A}|}.

With McDiarmid’s inequality, we have:

(||𝒜|(𝜽¯St)𝔼ρ0{|𝒜|(𝜽¯St)}|>δ)exp(|𝒜|δ2C).\displaystyle\mathbb{P}\left(\left|\mathcal{L}_{|\mathcal{A}|}\left(\bar{\bm{\theta}}_{\rm S}^{t}\right)-\mathbb{E}_{\rho_{0}}\left\{\mathcal{L}_{|\mathcal{A}|}\left(\bar{\bm{\theta}}_{\rm S}^{t}\right)\right\}\right|>\delta\right)\leq\exp\left(-\frac{|\mathcal{A}|\delta^{2}}{C}\right). (18)

Furthermore, we have the following increment bound for time t,h0t,h\geq 0:

|||𝒜|(𝜽¯St+h)𝔼ρ0{|𝒜|(𝜽¯St+h)}|\displaystyle\left||\mathcal{L}_{|\mathcal{A}|}\left(\overline{\bm{\theta}}_{\rm S}^{t+h}\right)-\mathbb{E}_{\rho_{0}}\left\{\mathcal{L}_{|\mathcal{A}|}\left(\overline{\bm{\theta}}_{\rm S}^{t+h}\right)\right\}|\right. (19)
||𝒜|(𝜽¯St)𝔼ρ0{|𝒜|(𝜽¯St)}||\displaystyle-\left.|\mathcal{L}_{|\mathcal{A}|}\left(\overline{\bm{\theta}}_{\rm S}^{t}\right)-\mathbb{E}_{\rho_{0}}\left\{\mathcal{L}_{|\mathcal{A}|}\left(\overline{\bm{\theta}}_{\rm S}^{t}\right)\right\}|\right|
\displaystyle\leq ||𝒜|(𝜽¯St+h)|𝒜|(𝜽¯St)|\displaystyle\left|\mathcal{L}_{|\mathcal{A}|}\left(\overline{\bm{\theta}}_{\rm S}^{t+h}\right)-\mathcal{L}_{|\mathcal{A}|}\left(\overline{\bm{\theta}}_{\rm S}^{t}\right)\right|
+|𝔼ρ0{|𝒜|(𝜽¯St+h)}𝔼ρ0{|𝒜|(𝜽¯St)}|\displaystyle+\left|\mathbb{E}_{\rho_{0}}\left\{\mathcal{L}_{|\mathcal{A}|}\left(\overline{\bm{\theta}}_{\rm S}^{t+h}\right)\right\}-\mathbb{E}_{\rho_{0}}\left\{\mathcal{L}_{|\mathcal{A}|}\left(\overline{\bm{\theta}}_{\rm S}^{t}\right)\right\}\right|
\displaystyle\leq C[supi[N]𝜽¯it+h𝜽¯it2+𝔼[𝜽¯jt+h𝜽¯jt2]].\displaystyle C\left[\sup_{i\in[N]}\left\|\overline{\bm{\theta}}_{i}^{t+h}-\overline{\bm{\theta}}_{i}^{t}\right\|_{2}+\mathbb{E}\left[\left\|\overline{\bm{\theta}}_{j}^{t+h}-\overline{\bm{\theta}}_{j}^{t}\right\|_{2}\right]\right].

Using Lemma 2, we get:

supτ[0,T/ε]supu[0,ε]|||𝒜|(𝜽¯Sτε+u)𝔼ρ0{|𝒜|(𝜽¯Sτε+u)}|\displaystyle\sup_{\tau\in[0,T/\varepsilon]\cap\mathbb{N}}\sup_{u\in[0,\varepsilon]}\left||\mathcal{L}_{|\mathcal{A}|}\left(\overline{\bm{\theta}}_{\rm S}^{\tau\varepsilon+u}\right)-\mathbb{E}_{\rho_{0}}\left\{\mathcal{L}_{|\mathcal{A}|}\left(\overline{\bm{\theta}}_{\rm S}^{\tau\varepsilon+u}\right)\right\}|\right. (20)
||𝒜|(𝜽¯Sτε)𝔼ρ0{|𝒜|(𝜽¯Sτε)}||\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad-\left.|\mathcal{L}_{|\mathcal{A}|}\left(\overline{\bm{\theta}}_{\rm S}^{\tau\varepsilon}\right)-\mathbb{E}_{\rho_{0}}\left\{\mathcal{L}_{|\mathcal{A}|}\left(\overline{\bm{\theta}}_{\rm S}^{\tau\varepsilon}\right)\right\}|\right|
\displaystyle\leq CeCT[log|𝒜|(T/ε1)+z]εh(α),\displaystyle Ce^{CT}[\sqrt{\log|\mathcal{A}|(T/\varepsilon\vee 1)}+z]\sqrt{\varepsilon h(\alpha)},

with probability at least 1ez21-e^{-z^{2}}.

Here, we take a union bound over sε{0,1,,T/ε}s\in\varepsilon\{0,1,\ldots,\lfloor T/\varepsilon\rfloor\} and bound the variation inside the grid intervals. Then, with McDiarmid’s inequality, we get:

(supt[0,T]||𝒜|(𝜽¯St)𝔼ρ0|𝒜|(𝜽¯St)|\displaystyle\mathbb{P}\left(\sup_{t\in[0,T]}|\mathcal{L}_{|\mathcal{A}|}\left(\overline{\bm{\theta}}_{\rm S}^{t}\right)-\mathbb{E}_{\rho_{0}}\mathcal{L}_{|\mathcal{A}|}\left(\overline{\bm{\theta}}_{\rm S}^{t}\right)|\right. (21)
δ+CeCT[log|𝒜|(T/ε1)+z]εh(α))\displaystyle\quad\quad\geq\left.\delta+Ce^{CT}[\sqrt{\log\mathcal{|A|}(T/\varepsilon\vee 1)}+z]\sqrt{\varepsilon h(\alpha)}\right)
\displaystyle\leq (T/ε)exp{|𝒜|δ2/C}+ez2\displaystyle(T/\varepsilon)\exp\left\{-\mathcal{|A|}\delta^{2}/C\right\}+e^{-z^{2}}

We take ε=1/|𝒜|\varepsilon=1/\mathcal{|A|} and δ=C[log(|𝒜|T)+z]/|𝒜|\delta=C[\sqrt{\log(\mathcal{|A|}T)}+z]/\sqrt{\mathcal{|A|}} in (21) and combine it with (21), and get:

supτ[0,T/ε]||𝒜|(𝜽¯Sτε)(ρτε)|\displaystyle\sup_{\tau\in[0,T/\varepsilon]\cap\mathbb{N}}\left|\mathcal{L}_{|\mathcal{A}|}\left(\overline{\bm{\theta}}_{\mathrm{S}}^{\tau\varepsilon}\right)-\mathcal{L}\left(\rho_{\tau\varepsilon}\right)\right| (22)
\displaystyle\leq CeCT(log|𝒜|+z)h(α)|𝒜|\displaystyle Ce^{CT}\frac{(\sqrt{\log|\mathcal{A}|}+z)\sqrt{h(\alpha)}}{\sqrt{|\mathcal{A}|}}

with probability at least 1ez21-e^{-z^{2}}.

we summarize (11), (14) and (22) and have:

supτ[0,T/ε]|N(𝜽τ)|𝒜|(𝜽Sτ)|\displaystyle\sup_{\tau\in[0,T/\varepsilon]\cap\mathbb{N}}\left|\mathcal{L}_{N}(\bm{\theta}^{\tau})-\mathcal{L}_{|\mathcal{A}|}\left(\bm{\theta}_{\mathrm{S}}^{\tau}\right)\right| (23)
\displaystyle\leq CeCTh(α)(log|𝒜|+z|𝒜|+α(D+logN+z)),\displaystyle Ce^{CT}\sqrt{h(\alpha)}\left(\frac{\sqrt{\log|\mathcal{A}|}+z}{\sqrt{|\mathcal{A}|}}+\sqrt{\alpha}(\sqrt{D+\log N}+z)\right),

with probability at least 1ez21-e^{-z^{2}}. Therefore, the global model 𝜽\bm{\theta} obtained from the FL under data heterogeneity h(α)h(\alpha) is εD\varepsilon_{D}-dropout stable with εD\varepsilon_{D} equal to CeCTh(α)[(log|𝒜|+z)/|𝒜|+α(D+logN+z)]Ce^{CT}\sqrt{h(\alpha)}[({\sqrt{\log|\mathcal{A}|}+z})/{\sqrt{|\mathcal{A}|}}+\sqrt{\alpha}(\sqrt{D+\log N}+z)].

B-B Proof of Theorem 2

The proof of Theorem 2 is obtained by combining Theorem 1 with the following lemma, which is based on [20, 19].

Lemma 3.

(Dropout networks obtained from FL under varying data heterogeneity have connectivity). Consider a two-layer neural network with NN neurons, as in (5). Given 𝒜=[N/2]\mathcal{A}=[N/2], let 𝛉\bm{\theta} and 𝛉\bm{\theta}^{\prime} be solved under data heterogeneity h(α)h(\alpha) and h(α)h(\alpha^{\prime}), which have dropout stability as in Definition 3. Then, 𝛉\bm{\theta} and 𝛉\bm{\theta}^{\prime} are ε\varepsilon-connected as in Definition 1. Furthermore, the path connecting 𝛉\bm{\theta} with 𝛉\bm{\theta}^{\prime} consists of 7 line segments.

Proof.

To consider the effect of the dropout stability on the given models, we represent the model parameters as:

𝜽=((a1,𝜽1),(a2,𝜽2),,(aN,𝜽N)),\bm{\theta}=\left((a_{1},\bm{\theta}_{1}),(a_{2},\bm{\theta}_{2}),\ldots,(a_{N},\bm{\theta}_{N})\right),
𝜽=((a1,𝜽1),(a2,𝜽2),,(aN,𝜽N)),\bm{\theta}^{\prime}=\left((a_{1}^{\prime},\bm{\theta}_{1}^{\prime}),(a_{2}^{\prime},\bm{\theta}_{2}^{\prime}),\ldots,(a_{N}^{\prime},\bm{\theta}_{N}^{\prime})\right),

where ai,i[N]a_{i},\forall i\in[N] is the rescaling factor of dropout. That is, ai=0a_{i}=0 if the model components (i𝒜i\notin\mathcal{A}) perform dropout, and ai=2a_{i}=2 if the model components (i𝒜i\in\mathcal{A}) do not perform dropout.

We consider the piecewise linear path in parameter space that connects 𝜽\bm{\theta} to 𝜽\bm{\theta}^{\prime} as one path of mode connectivity for over-parameterized neural networks. Namely, when NN is even, we can connect 𝜽\bm{\theta} to 𝜽\bm{\theta}^{\prime} with the following way:

𝜽=((a1,𝜽1),,(aN/2,𝜽N/2),,(aN,𝜽N)),\displaystyle\bm{\theta}=\left((a_{1},\bm{\theta}_{1}),\ldots,(a_{N/2},\bm{\theta}_{N/2}),\ldots,(a_{N},\bm{\theta}_{N})\right),
𝜽1=((2,𝜽1),,(2,𝜽N/2),(0,𝜽N/2+1),,(0,𝜽N)),\displaystyle\bm{\theta}_{1}=\left((2,\bm{\theta}_{1}),\ldots,(2,\bm{\theta}_{N/2}),(0,\bm{\theta}_{N/2+1}),\ldots,(0,\bm{\theta}_{N})\right),
𝜽2=((2,𝜽1),,(2,𝜽N/2),(0,𝜽1),,(0,𝜽N/2)),\displaystyle\bm{\theta}_{2}=\left((2,\bm{\theta}_{1}),\ldots,(2,\bm{\theta}_{N/2}),\left(0,\bm{\theta}_{1}^{\prime}\right),\ldots,(0,\bm{\theta}_{N/2}^{\prime})\right),
𝜽3=((0,𝜽1),,(0,𝜽N/2),(2,𝜽1),,(2,𝜽N/2)),\displaystyle\bm{\theta}_{3}=\left((0,\bm{\theta}_{1}),\ldots,(0,\bm{\theta}_{N/2}),\left(2,\bm{\theta}_{1}^{\prime}\right),\ldots,(2,\bm{\theta}_{N/2}^{\prime})\right),
𝜽4=((0,𝜽1),,(0,𝜽N/2),(2,𝜽1),,(2,𝜽N/2)),\displaystyle\bm{\theta}_{4}=\left(\left(0,\bm{\theta}_{1}^{\prime}\right),\ldots,(0,\bm{\theta}_{N/2}^{\prime}),(2,\bm{\theta}_{1}^{\prime}),\ldots,(2,\bm{\theta}_{N/2}^{\prime})\right),
𝜽5=((2,𝜽1),,(2,𝜽N/2),(0,𝜽1),,(0,𝜽N/2)),\displaystyle\bm{\theta}_{5}=\left((2,\bm{\theta}_{1}^{\prime}),\ldots,(2,\bm{\theta}_{N/2}^{\prime}),\left(0,\bm{\theta}_{1}^{\prime}\right),\ldots,(0,\bm{\theta}_{N/2}^{\prime})\right),
𝜽6=((2,𝜽1),,(2,𝜽N/2),(0,𝜽N/2+1),,(0,𝜽N)),\displaystyle\bm{\theta}_{6}=\left((2,\bm{\theta}_{1}^{\prime}),\ldots,(2,\bm{\theta}_{N/2}^{\prime}),(0,\bm{\theta}_{N/2+1}^{\prime}),\ldots,(0,\bm{\theta}_{N}^{\prime})\right),
𝜽=((a1,𝜽1),,(aN/2,𝜽N/2),,(aN,𝜽N)).\displaystyle\bm{\theta}^{\prime}=\left((a_{1}^{\prime},\bm{\theta}_{1}^{\prime}),\ldots,(a_{N/2}^{\prime},\bm{\theta}_{N/2}^{\prime}),\ldots,(a_{N}^{\prime},\bm{\theta}_{N}^{\prime})\right).

Firstly, for the path between 𝜽\bm{\theta} to 𝜽1\bm{\theta}_{1}, since 𝜽\bm{\theta} is ε\varepsilon-dropout stable, we have that N(𝜽1)N(𝜽)+εD\mathcal{L}_{N}\left(\bm{\theta}_{1}\right)\leq\mathcal{L}_{N}(\bm{\theta})+\varepsilon_{D}. Similarly, the loss along the path that connects 𝜽6\bm{\theta}_{6} to 𝜽\bm{\theta}^{\prime} is also upper bounded by N(𝜽6)N(𝜽)+εD\mathcal{L}_{N}\left(\bm{\theta}_{6}\right)\leq\mathcal{L}_{N}\left(\bm{\theta}^{\prime}\right)+\varepsilon_{D}^{\prime}.

Then, for the path between 𝜽1\bm{\theta}_{1} to 𝜽2\bm{\theta}_{2}, only 𝜽i\bm{\theta}_{i} with the corresponding ai=0a_{i}=0 is changed to be 𝜽i\bm{\theta}^{\prime}_{i}. Thus, the loss along this path does not change; i.e., N(𝜽1)=N(𝜽2)\mathcal{L}_{N}(\bm{\theta}_{1})=\mathcal{L}_{N}(\bm{\theta}_{2}). Similarly, the loss does not change along the path between 𝜽3\bm{\theta}_{3} to 𝜽4\bm{\theta}_{4} and the path between 𝜽5\bm{\theta}_{5} to 𝜽6\bm{\theta}_{6}: N(𝜽3)=N(𝜽4)\mathcal{L}_{N}(\bm{\theta}_{3})=\mathcal{L}_{N}(\bm{\theta}_{4}) and N(𝜽5)=N(𝜽6)\mathcal{L}_{N}(\bm{\theta}_{5})=\mathcal{L}_{N}(\bm{\theta}_{6}).

Next, for the path between 𝜽4\bm{\theta}_{4} to 𝜽5\bm{\theta}_{5}, the two subnetworks are equal, such that the loss of their interpolated models along this path does not change; i.e., N(𝜽4)=N(𝜽5)\mathcal{L}_{N}(\bm{\theta}_{4})=\mathcal{L}_{N}(\bm{\theta}_{5}).

Finally, based on the above analysis, we have N(𝜽2)N(𝜽)+εD\mathcal{L}_{N}\left(\bm{\theta}_{2}\right)\leq\mathcal{L}_{N}(\bm{\theta})+\varepsilon_{D} and N(𝜽3)N(𝜽)+εD\mathcal{L}_{N}\left(\bm{\theta}_{3}\right)\leq\mathcal{L}_{N}(\bm{\theta}^{\prime})+\varepsilon_{D}^{\prime}. For the path between 𝜽2\bm{\theta}_{2} to 𝜽3\bm{\theta}_{3}, since the loss is convex in the weights of the last layer, the loss along this path is upper bounded by max(N(𝜽2),N(𝜽3))max(N(𝜽)+εD,N(𝜽)+εD)\max\left(\mathcal{L}_{N}(\bm{\theta}_{2}),\mathcal{L}_{N}\left(\bm{\theta}_{3}\right)\right)\leq\max\left(\mathcal{L}_{N}(\bm{\theta})+\varepsilon_{D},\mathcal{L}_{N}\left(\bm{\theta}^{\prime}\right)+\varepsilon_{D}^{\prime}\right).

When NN is odd, we can take a similar analysis to that of even NN. The differences are that (i) the N/2\lceil N/2\rceil-th parameter of 𝜽1,𝜽2\bm{\theta}_{1},\bm{\theta}_{2} and 𝜽3\bm{\theta}_{3} is (0,𝜽N/2)\left(0,\bm{\theta}_{N/2}\right) and the N/2\lceil N/2\rceil-th parameter of 𝜽4,𝜽5\bm{\theta}_{4},\bm{\theta}_{5} and 𝜽6\bm{\theta}_{6} is (0,𝜽N/2)\left(0,\bm{\theta}_{N/2}^{\prime}\right), and (ii) the constant rescalling factor ai=2a_{i}=2 is replaced by N/N/2N/\lfloor N/2\rfloor.

Therefore, the loss along the above path between 𝜽\bm{\theta} and 𝜽\bm{\theta}^{\prime} is upper bounded by max(N(𝜽),N(𝜽))+εC\max\left(\mathcal{L}_{N}(\bm{\theta}),\mathcal{L}_{N}\left(\bm{\theta}^{\prime}\right)\right)+\varepsilon_{C} with εC\varepsilon_{C} equal to max(εD,εD)\max(\varepsilon_{D},\varepsilon_{D}^{\prime}).

B-C Proof of global-mode update (6)

Proof.

Here, we denote the client kk model as 𝜽.,k\bm{\theta}_{.,k} and its ii-th neuron as 𝜽i,k\bm{\theta}_{i,k}. Then, we represent the weighted average of all client models as 𝜽¯m,τ=k=1Knkn𝜽.,km,τ\bar{\bm{\theta}}^{m,\tau}=\sum_{k=1}^{K}\frac{n_{k}}{n}\bm{\theta}_{.,k}^{m,\tau} and their ii-th neuron as 𝜽i¯m,τ=k=1Knkn𝜽i,km,τ\bar{\bm{\theta}_{i}}^{m,\tau}=\sum_{k=1}^{K}\frac{n_{k}}{n}\bm{\theta}_{i,k}^{m,\tau}. Meanwhile, for given a data sample (𝒙km,τ,𝒚km,τ)(\bm{x}_{k}^{m,\tau},\bm{y}_{k}^{m,\tau}), we formulate the model output as y¯km,τ=f𝜽¯m,τ(𝒙km,τ)\bar{y}_{k}^{m,\tau}=f_{\bar{\bm{\theta}}^{m,\tau}}(\bm{x}_{k}^{m,\tau}) and y^km,τ=f𝜽.,km,τ(𝒙km,τ)\hat{y}_{k}^{m,\tau}=f_{{\bm{\theta}}_{.,k}^{m,\tau}}(\bm{x}_{k}^{m,\tau}), and the gradient deviation as Δ𝒈i,km,τ=𝜽i,kσ(𝒙km,τ;𝜽i,km,τ)𝜽iσ(𝒙km,τ;𝜽i¯m,τ)\Delta\bm{g}_{i,k}^{m,\tau}=\nabla_{\bm{\theta}_{i,k}}\sigma(\bm{x}_{k}^{m,\tau};\bm{\theta}_{i,k}^{m,\tau})-\nabla_{\bm{\theta}_{i}}\sigma(\bm{x}_{k}^{m,\tau};\bar{\bm{\theta}_{i}}^{m,\tau}).

With these notations, we formulate the update of the global model at the mm-th round as:

𝜽im+1,0𝜽im,0\displaystyle\bm{\theta}_{i}^{m+1,0}-\bm{\theta}_{i}^{m,0} (24)
=\displaystyle= 2k=1Knknτ=0psm,τ(ykm,τy^km,τ)𝜽i,kσ(𝒙km,τ;𝜽i,km,τ)\displaystyle 2\sum_{k=1}^{K}\frac{n_{k}}{n}\sum_{\tau=0}^{p}s^{m,\tau}(y_{k}^{m,\tau}-\hat{y}_{k}^{m,\tau})\nabla_{\bm{\theta}_{i,k}}\sigma(\bm{x}_{k}^{m,\tau};\bm{\theta}_{i,k}^{m,\tau})
=\displaystyle= 2τ=0psm,τk=1Knkn(ykm,τy^km,τ)[𝜽iσ(𝒙km,τ;𝜽i¯m,τ)\displaystyle 2\sum_{\tau=0}^{p}{s}^{m,\tau}\sum_{k=1}^{K}\frac{n_{k}}{n}(y_{k}^{m,\tau}-\hat{y}_{k}^{m,\tau})\left[\nabla_{\bm{\theta}_{i}}\sigma(\bm{x}_{k}^{m,\tau};\bar{\bm{\theta}_{i}}^{m,\tau})\right.
+(𝜽i,kσ(𝒙km,τ;𝜽i,km,τ)𝜽iσ(𝒙km,τ;𝜽i¯m,τ))]\displaystyle+\left.\left(\nabla_{\bm{\theta}_{i,k}}\sigma(\bm{x}_{k}^{m,\tau};\bm{\theta}_{i,k}^{m,\tau})-\nabla_{\bm{\theta}_{i}}\sigma(\bm{x}_{k}^{m,\tau};\bar{\bm{\theta}_{i}}^{m,\tau})\right)\right]
=\displaystyle= 2τ=0psm,τk=1Knkn(ykm,τy^km,τ)[𝜽iσ(𝒙km,τ;𝜽i¯m,τ)\displaystyle 2\sum_{\tau=0}^{p}{s}^{m,\tau}\sum_{k=1}^{K}\frac{n_{k}}{n}(y_{k}^{m,\tau}-\hat{y}_{k}^{m,\tau})\left[\nabla_{\bm{\theta}_{i}}\sigma(\bm{x}_{k}^{m,\tau};\bar{\bm{\theta}_{i}}^{m,\tau})\right.
+Δ𝒈i,km,τ]\displaystyle+\left.\Delta\bm{g}_{i,k}^{m,\tau}\right]
=\displaystyle= 2τ=0psm,τk=1Knkn[(ykm,τy¯km,τ)𝜽iσ(𝒙km,τ;𝜽i¯m,τ)\displaystyle 2\sum_{\tau=0}^{p}{s}^{m,\tau}\sum_{k=1}^{K}\frac{n_{k}}{n}\left[(y_{k}^{m,\tau}-\bar{y}_{k}^{m,\tau})\nabla_{\bm{\theta}_{i}}\sigma(\bm{x}_{k}^{m,\tau};\bar{\bm{\theta}_{i}}^{m,\tau})\right.
+(y¯km,τy^km,τ)𝜽iσ(𝒙km,τ;𝜽i¯m,τ)]\displaystyle+\left.(\bar{y}_{k}^{m,\tau}-\hat{y}_{k}^{m,\tau})\nabla_{\bm{\theta}_{i}}\sigma(\bm{x}_{k}^{m,\tau};\bar{\bm{\theta}_{i}}^{m,\tau})\right]
+2τ=0psm,τk=1Knkn(ykm,τy^km,τ)Δ𝒈i,km,τ\displaystyle+2\sum_{\tau=0}^{p}{s}^{m,\tau}\sum_{k=1}^{K}\frac{n_{k}}{n}(y_{k}^{m,\tau}-\hat{y}_{k}^{m,\tau})\Delta\bm{g}_{i,k}^{m,\tau}
=\displaystyle= 2τ=0psm,τk=1Knkn(ykm,τy¯km,τ)𝜽iσ(𝒙km,τ;𝜽i¯m,τ)\displaystyle 2\sum_{\tau=0}^{p}{s}^{m,\tau}\sum_{k=1}^{K}\frac{n_{k}}{n}(y_{k}^{m,\tau}-\bar{y}_{k}^{m,\tau})\nabla_{\bm{\theta}_{i}}\sigma(\bm{x}_{k}^{m,\tau};\bar{\bm{\theta}_{i}}^{m,\tau})
+2τ=0psm,τk=1Knkn[(y¯km,τy^km,τ)𝜽iσ(𝒙km,τ;𝜽i¯m,τ)\displaystyle+2\sum_{\tau=0}^{p}{s}^{m,\tau}\sum_{k=1}^{K}\frac{n_{k}}{n}\left[(\bar{y}_{k}^{m,\tau}-\hat{y}_{k}^{m,\tau})\nabla_{\bm{\theta}_{i}}\sigma(\bm{x}_{k}^{m,\tau};\bar{\bm{\theta}_{i}}^{m,\tau})\right. (25)
+(ykm,τy^km,τ)Δ𝒈i,km,τ]\displaystyle\quad+(y_{k}^{m,\tau}-\hat{y}_{k}^{m,\tau})\Delta\bm{g}_{i,k}^{m,\tau}]
=\displaystyle= 2τ=0psm,τ(ym,τy¯m,τ)𝜽iσ(𝒙m,τ;𝜽i¯m,τ)\displaystyle 2\sum_{\tau=0}^{p}{s}^{m,\tau}(y^{m,\tau}-\bar{y}^{m,\tau})\nabla_{\bm{\theta}_{i}}\sigma(\bm{x}^{m,\tau};\bar{\bm{\theta}_{i}}^{m,\tau})
+τ=0p2sm,τ𝒏im,τ,\displaystyle+\sum_{\tau=0}^{p}{2{s}^{m,\tau}}{\bm{n}}_{i}^{m,\tau},

where the last equality hold because of the assumption of one-pass data processing and all client samples belong to the global distribution \mathbb{P}, i.e., (𝒙k,yk)(\bm{x}_{k},y_{k})\in\mathbb{P}. Here, we represent the noise as:

𝒏im,τ=k=1Knkn\displaystyle\bm{n}_{i}^{m,\tau}=\sum_{k=1}^{K}\frac{n_{k}}{n} [(y¯km,τy^km,τ)𝜽iσ(𝒙km,τ;𝜽i¯m,τ)\displaystyle[(\bar{y}_{k}^{m,\tau}-\hat{y}_{k}^{m,\tau})\nabla_{\bm{\theta}_{i}}\sigma(\bm{x}_{k}^{m,\tau};\bar{\bm{\theta}_{i}}^{m,\tau}) (26)
+(ykm,τy^km,τ)Δ𝒈i,km,τ]\displaystyle+(y_{k}^{m,\tau}-\hat{y}_{k}^{m,\tau})\Delta\bm{g}_{i,k}^{m,\tau}]
=k=1Knkn\displaystyle=\sum_{k=1}^{K}\frac{n_{k}}{n} [(y¯km,τy^km,τ)𝜽iσ(𝒙km,τ;𝜽i¯m,τ)\displaystyle[(\bar{y}_{k}^{m,\tau}-\hat{y}_{k}^{m,\tau})\nabla_{\bm{\theta}_{i}}\sigma(\bm{x}_{k}^{m,\tau};\bar{\bm{\theta}_{i}}^{m,\tau})
+(ykm,τy^km,τ)(𝜽i,kσ(𝒙km,τ;𝜽i,km,τ)\displaystyle+(y_{k}^{m,\tau}-\hat{y}_{k}^{m,\tau})(\nabla_{\bm{\theta}_{i,k}}\sigma(\bm{x}_{k}^{m,\tau};\bm{\theta}_{i,k}^{m,\tau})
𝜽iσ(𝒙km,τ;𝜽i¯m,τ))].\displaystyle\quad\quad\quad-\nabla_{\bm{\theta}_{i}}\sigma(\bm{x}_{k}^{m,\tau};\bar{\bm{\theta}_{i}}^{m,\tau}))].

Therefore, we obtain (6), i.e.,

𝜽im+1,τ+1=\displaystyle\bm{\theta}_{i}^{m+1,\tau+1}= 𝜽im,τ+2sm,τ(ym,τy¯m,τ)𝜽¯iσ(𝒙m,τ;𝜽i¯m,τ)\displaystyle\bm{\theta}_{i}^{m,\tau}+2{s}^{m,\tau}(y^{m,\tau}-\bar{y}^{m,\tau})\nabla_{\bar{\bm{\theta}}_{i}}\sigma(\bm{x}^{m,\tau};\bar{\bm{\theta}_{i}}^{m,\tau}) (27)
+2sm,τ𝒏im,τ.\displaystyle+2{s}^{m,\tau}\bm{n}_{i}^{m,\tau}.

Furthermore, according to Definition 2, 𝒏im,τ\bm{n}_{i}^{m,\tau} can be regarded as the noise induced by data heterogeneity in FL. Since 𝜽im,p1=𝜽i¯m,p1=k=1Knkn𝜽.,km,p1\bm{\theta}_{i}^{m,p-1}=\bar{\bm{\theta}_{i}}^{m,p-1}=\sum_{k=1}^{K}\frac{n_{k}}{n}\bm{\theta}_{.,k}^{m,p-1}, we take Assumption 5 and simplify (6) with τ[0,Mp1]\tau\in[0,Mp-1] when given a total training round MM as:

𝜽iτ+1=\displaystyle\bm{\theta}_{i}^{\tau+1}= 𝜽iτ+2sτ(yτy¯τ)𝜽iσ(𝒙τ;𝜽iτ)\displaystyle\bm{\theta}_{i}^{\tau}+2s_{\tau}\left(y^{\tau}-\bar{y}^{\tau}\right)\nabla_{\bm{\theta}_{i}}\sigma\left(\bm{x}_{\tau};\bm{\theta}_{i}^{\tau}\right) (28)
+2sτ𝒏iτ.\displaystyle+2{s}^{\tau}{\bm{n}}_{i}^{\tau}.

Therefore, we obtain (8).