Mode Connectivity and Data Heterogeneity of Federated Learning

Tailin Zhou, Jun Zhang, and Danny H.K. Tsang T. Zhou is with IPO, Academy of Interdisciplinary Studies, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong SAR, China (Email: [email protected]). J. Zhang is with the Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong SAR, China (E-mail: [email protected]). D. H.K. Tsang is with the Internet of Things Thrust, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, Guangdong, China, and also with the Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong SAR, China (Email: [email protected]). (The corresponding author is J. Zhang.)

Abstract

Federated learning (FL) enables multiple clients to train a model while keeping their data private collaboratively. Previous studies have shown that data heterogeneity between clients leads to drifts across client updates. However, there are few studies on the relationship between client and global modes, making it unclear where these updates end up drifting. We perform empirical and theoretical studies on this relationship by utilizing mode connectivity, which measures performance change (i.e., connectivity) along parametric paths between different modes. Empirically, reducing data heterogeneity makes the connectivity on different paths more similar, forming more low-error overlaps between client and global modes. We also find that a barrier to connectivity occurs when linearly connecting two global modes, while it disappears with considering non-linear mode connectivity. Theoretically, we establish a quantitative bound on the global-mode connectivity using mean-field theory or dropout stability. The bound demonstrates that the connectivity improves when reducing data heterogeneity and widening trained models. Numerical results further corroborate our analytical findings.

Index Terms:

Federated learning, mode connectivity, data heterogeneity, mean-field theory, loss landscape visualization.

I Introduction

Federated Learning (FL) [1] is a collaborative paradigm that enables multiple clients to train a model while preserving data privacy. A primary challenge in FL is data heterogeneity across clients [2]. Previous works like [3] found that data heterogeneity induces a misalignment between clients’ local objectives and the FL’s global objective (hereinafter referred to as client objective and FL objective, respectively) and drifts client updates. Nevertheless, despite client-update drift, the FL objective can typically be optimized [4] effectively. This implies that the destinations of these drifting updates (i.e., the solutions to the client and FL objectives) may overlap. However, a systematic investigation of this relationship is currently lacking.

A straightforward method to examine this overlapping relationship is to investigate mode connectivity [5], where mode refers to a suit of solutions to a specific objective, i.e., a suit of neuron parameters, under permutation invariance. Mode connectivity measures the connectivity of two modes via a parametric path and reveals the geometric relationship in the solution space. Specifically, it evaluates the performance change (e.g., taking loss/error as metrics) along a given path with endpoints corresponding to two different modes. Maintaining a minor change along the path indicates better connectivity (i.e., the geometric landscape becomes increasingly connected between the two modes). Some works [5, 6] have discovered that two modes can be connected well in centralized training, but research on FL remains scarce.

Refer to caption — Figure 1: An illustration to show the relationship of client modes and global modes of FL in the solution space.

In this work, we conduct empirical and theoretical studies on mode connectivity in FL under data heterogeneity to unravel the relationship between client modes (i.e., the solutions to client objectives) and global modes (i.e., the solutions to the FL objective). Empirically, we consider linear and non-linear paths to evaluate the connectivity between client modes and global modes to show their overlap relationships, as illustrated in Figure 1. We also investigate the connectivity of global modes obtained under varying levels of data heterogeneity. Theoretically, we leverage mean-field theory [7] and dropout stability to establish a quantitative bound on the connectivity errors of global modes. In addition, we conduct a comprehensive numerical analysis to verify our analytical findings.

This study takes the first step to explore mode connectivity in FL. Our main contributions are summarized as follows:

•

Our empirical and theoretical investigations demonstrate that data heterogeneity deteriorates the connectivity between client and global modes, making it challenging to find solutions that meet both client and FL objectives.
•

It is revealed that FL solutions from different data heterogeneity belong to distinct global modes within a common region of the solution space. They can be connected on a polygonal chain while keeping error low but face an error barrier when considering a linear path.
•

We establish quantitative bounds on dropout and connectivity errors of global modes using mean-field theory, showing that both errors decrease with reduced data heterogeneity and widening trained models.

II Related Works

Data heterogeneity in FL. According to [3], data heterogeneity across clients significantly degrades the FL performance. Subsequent works have observed that data heterogeneity induces weight divergence and feature inconsistency. For weight divergence, studies such as [8, 9] show that client-update drift slows the FL convergence due to inconsistent client and FL objectives. Moreover, some studies, e.g., [10, 11], discover that the classifier divergence across clients is the main culprit of feature inconsistency, negatively affecting FL generalization. In contrast to previous works, we offer a novel perspective on data heterogeneity’s impact within the FL solution space.

Mode connectivity. The connectivity between neural network (NN) solutions due to the NN non-convexity, called mode connectivity, provides a new view into the geometric landscape of NNs in recent years. Despite the high dimension of NNs, the different solutions follow a clear connectivity pattern [12, 13, 6]. Specifically, it is possible to connect the local minima through specific paths that make it easier to find high-performance solutions when using gradient-based optimizers. The paths are often non-linear curves and are discovered by specific tasks. This suggests that the local minima are not isolated but rather interconnected within a manifold. Moreover, the geometric landscape of an NN becomes approximately linear-connected when considering the infinite neuron numbers as per [12]. Previous studies mainly focus on mode connectivity in centralized training with homogeneous data. Several pilot studies [14, 15] visualize the loss landscape of FL while not mode connectivity, and then our work tries to fill this gap of mode connectivity in FL under data heterogeneity.

Mean-field theory. A mean-field theory for the NN dynamics analysis is proposed by [7, 16] based on the gradient-flow approximation. The theory shows that when the neuron number is sufficiently large, the dynamics of stochastic gradient descent (SGD) to optimize NNs can be well approximated by a Wasserstein gradient flow. This approximation has been studied both in the two-layer case [7, 17] and the multi-layer case [18, 19]. In particular, the approximation is utilized to analyze the dropout stability of SGD solutions by [19], and the dropout stability implies mode connectivity of NNs according to [20]. However, these results are solely observed in centralized training. To the best of our knowledge, this paper is the first to expand upon the theoretical results into FL.

III Preliminaries

Mode connectivity. The forward function of an NN parameterized by $\bm{\theta}\in\Theta$ is represented as $f_{\bm{\theta}}:\mathcal{X}\rightarrow\mathcal{Y}$ , where $\Theta$ denotes the parameter space of an NN architecture, and $\mathcal{X}\in\mathbb{R}^{d}$ and $\mathcal{Y}\in\mathbb{R}$ are input and output spaces, respectively. When models ${\bm{\theta}_{1}}$ and ${\bm{\theta}_{2}}$ have the same neurons but different neuron permutations, they belong to the same mode. According to [20, 19], given an NN architecture $\Theta$ , a dataset $\mathcal{D}$ and a loss function $\mathcal{L}(\cdot)$ , mode connectivity is defined as follows:

Definition 1.

(Mode connectivity). Two mode parameters $\bm{\theta}_{1}\in\Theta$ and $\bm{\theta}_{2}\in\Theta$ are $\varepsilon_{C}$ -connected if there exists a continuous path in parameter space $\pi_{\bm{\theta}}(\nu):[0,1]\rightarrow\Theta$ , such that $\pi_{\bm{\theta}}(0)=\bm{\theta}_{1}$ and $\pi_{\bm{\theta}}(1)=\bm{\theta}_{2}$ with $\mathcal{L}(\pi_{\bm{\theta}}(\nu))\leq\max(\mathcal{L}(\bm{\theta}_{1}),\mathcal{L}(\bm{\theta}_{2}))+\varepsilon_{C}$ , where $\varepsilon_{C}$ is the connectivity error.

This definition measures the connectivity of two modes by comparing the loss change along a path to its endpoints. Moreover, we consider a path-finding task proposed by [5], which minimizes the expected loss over a uniform distribution on the path as:

\min\mathcal{L}_{\rm curve}(\bm{\theta})=\mathbb{E}_{\nu\sim U(0,1)}\left[\mathcal{L}(\pi_{\bm{\theta}}(\nu))\right],

(1)

where the path $\pi_{\bm{\theta}}(\nu)$ can be characterized by a Polygonal chain (PolyChain) with one or more bends [5]. We consider a PolyChain case with one bend as:

\pi_{\bm{\theta}}(\nu)=\left\{\begin{array}[]{lr}2\left[(0.5-\nu)\bm{\theta}_{1}+\nu\bm{\theta}_{\rm b}\right],&0\leq\nu\leq 0.5,\\ 2\left[(\nu-0.5)\bm{\theta}_{\rm b}+(1-\nu)\bm{\theta}_{2}\right],&0.5\leq\nu\leq 1,\end{array}\right.

(2)

where $\bm{\theta}_{\rm b}$ is the model of the PolyChain-bend point and can be found by $\min_{\bm{\theta}_{\rm b}}\mathcal{L}_{\rm curve}(\bm{\theta}_{\rm b})$ in (1). We refer to mode connectivity as linear (or non-linear) mode connectivity (LMC) when using linear interpolation (or PolyChain). Compared with the non-linear counterpart, LMC is a stronger constraint since LMC requires the two modes to stay in the same basin.

Federated leaning (FL). We consider an FL framework with $K$ clients, each with its own dataset $\mathcal{D}_{k}=\{({\bm{x}}_{i_{k}},y_{i_{k}})\}$ , where data samples $({\bm{x}}_{i_{k}},y_{i_{k}})\sim\mathbb{P}_{k}:\mathbb{R}^{d}\times\mathbb{R}$ are indexed by $i_{k}\in[n_{k}]$ . The global dataset is the union of all client datasets and denoted by $\mathcal{D}=\cup_{k=1}^{K}\mathcal{D}_{k}\sim\mathbb{P}$ on $\mathbb{R}^{d}\times\mathbb{R}$ , which includes $n=\sum_{k=1}^{K}n_{k}$ data samples. The FL objective is to minimize the expected global loss $\mathcal{L}(\bm{\theta})$ on $\mathcal{D}$ , and is formulated as:

	$\displaystyle\min_{\bm{\theta}}\mathcal{L}(\bm{\theta})$	$\displaystyle=\mathbb{E}_{({\bm{x}},y)\in\mathcal{D}}[l(\bm{\theta};({\bm{x}},y))]$		(3)
		$\displaystyle=\sum_{k=1}^{K}\frac{n_{k}}{n}\mathbb{E}_{({\bm{x}},y)\in\mathcal{D}_{k}}[l_{k}(\bm{\theta};({\bm{x}},y))],$		(3)

where $l(\cdot)$ and $l_{k}(\cdot)$ denote the global and client loss functions, respectively. The $k$ -th client objective is $\min_{\bm{\theta}_{k}}\mathcal{L}_{k}(\bm{\theta}_{k})=\mathbb{E}_{({\bm{x}},y)\in\mathcal{D}_{k}}[l_{k}(\bm{\theta}_{k};({\bm{x}},y))]$ . When $\mathbb{P}_{k}\neq\mathbb{P}$ , FedAvg [1] takes client models $\{\bm{\theta}_{k}\}_{k=1}^{K}$ to solve (3) by minimizing client objectives locally and obtaining the global model $\bm{\theta}=\sum_{k=1}^{K}\frac{n_{k}}{n}\bm{\theta}_{k}$ round by round.

Experimental setups. We explore mode connectivity in FL on classification tasks, including MNIST [21], CIFAR-10 [22], and PACS [23]. The experimental setups used throughout this paper are as follows, unless stated otherwise. For NN architectures, we consider a two-layer NN, a shallow convolutional neural network (CNN) with two convolutional layers and a deep CNN, i.e., VGG architecture [24]. For the FL setup, we consider ten clients participating in FL with a total of 200 rounds. Client optimizers are SGD with a learning rate of 0.01 and momentum of 0.9, mini-batches of size 50, the local epoch of 1, and dropout of 0.5 for training VGG. For data heterogeneity, we follow [2] and examine label distribution skew by using Dirichlet distribution $Dir(\alpha)$ [25] to create client label datasets. A larger $\alpha$ suggests more homogenous distributions across clients. On this basis, we also make use of the built-in domain shift of PACS to introduce feature distribution skew across clients.

IV Mode Connectivity of Client Modes

In this section, we explore mode connectivity among client modes trained on client datasets, where $\mathbb{P}_{k}\neq\mathbb{P}_{k^{\prime}}$ when $k\neq k^{\prime}$ . We first investigate the geometric landscape of client modes on their corresponding test sets. Specifically, we create a hyperplane by using the global mode $\bm{\theta}$ and client modes $\bm{\theta}_{k}$ , $\bm{\theta}_{k^{\prime}}$ , which is represented as $\{(a,b)|\bm{\theta}+a(\bm{\theta}_{k}-\bm{\theta})+b(\bm{\theta}_{k^{\prime}}-\bm{\theta})\}$ . We then use the test sets of clients $k$ and $k^{\prime}$ to evaluate the performance of all points on the hyperplane. Figure 2 displays the loss landscapes obtained on different test sets and distinguishes them by different colors. We also depict the loss landscape of the global mode (i.e., the initialization of client modes) on the global test set in Figure 2. For comparison, we set the legend scale of the client-mode landscape to be the same in each sub-figure of Figure 2 except the lowest value, and keep the scale of the global-mode landscape consistent among all the sub-figures.

Visualizing mode connectivity on client modes. As shown in Figure 2, the comparison between the cases of $\alpha=0.1$ and $\alpha=0.5$ indicates that when data heterogeneity increases, low-loss overlaps between client and global modes decrease, and the number of overlapping solutions to both client and FL objectives reduces. Similar results are also observed by comparing the cases of CIFAR-10 and PACS. Meanwhile, although client modes achieve a lower test loss than the global mode, they stray from the low-loss landscapes of the global mode. This indicates that client modes are prone to overfitting their own objectives when FedAvg solves (3) even if the global mode provides adequate initialization for them in the final round. On the other hand, we can linearly connect client models to the global model in all cases, as shown in the LMC of Figure 2. As the loss changes along these linear paths, it may eventually reach the low-loss landscapes of the global mode while still remaining in the low-loss landscapes of client modes. The above observations reveal that there exist some solutions that meet both client and FL objectives, but they may be overlooked by FedAvg.

Traversing loss along the paths connecting client modes. To delve deeper into the loss landscapes of Figure 2, we consider two paths to connect two client modes. One path is a linear connection between them (i.e., $\bm{\theta}_{k}\rightarrow\bm{\theta}_{k}^{\prime}$ ), while the other is a PolyChain with the global mode as its bend (i.e., $\bm{\theta}_{k}\rightarrow\bm{\theta}\rightarrow\bm{\theta}_{k}^{\prime}$ ) as per (2). We follow the paths to traverse the landscape from $\bm{\theta}_{k}$ to $\bm{\theta}_{k}^{\prime}$ on their test sets, and show the traversing losses along the paths in Figure 3. For the PolyChain in the case of $\alpha=0.1$ , the intersection point of the two curves deviates from the global mode (i.e., $a=0.5$ ), where the intersection represents a solution that works effectively on both client datasets. Moreover, the traversing-loss gap between the two paths in the case of $\alpha=0.1$ is greater than that of $\alpha=0.5$ . This means that when heterogeneous data exist at clients, it becomes more difficult to find solutions that align with both the client and FL objectives.

Client-mode trajectory. The trajectory of client modes during optimization in the case of $\alpha=0.1$ is depicted in Figure 4. This trajectory is based on t-SNE [26] and covers both training rounds and local iterations within each round. The figure illustrates that throughout the training process, the distance between client modes is controlled even under serious data heterogeneity, such that they closely surround the global mode. During the same round, data heterogeneity pushes client modes with the same initialization to be optimized locally along different directions. Therefore, the solutions found by client objectives vary when facing data heterogeneity even though they are close to each other in the parameter space. Figure 12 shows that the shallow CNN has similar results.

V Mode Connectivity of Global Modes

In this section, we explore the mode connectivity of global modes obtained by different data heterogeneity of the same training task. We consider two paths, linear interpolation and PolyChain, to investigate the mode connectivity among global modes on CIFAR-10 under $\alpha=0.1,0.2$ and $0.5$ , as shown in Figure 5. Note that the initialization of global modes remains the same across these three cases.

Visualizing mode connectivity on global modes. In Figure 5, the linear-interpolation landscape reveals that global modes obtained from varying data heterogeneity are situated in distinct basins with an error barrier separating them. The landscape also shows that when sharing the same initialization, the global-mode basins are within a common region surrounded by high errors. With more heterogeneous data, like those involving $\alpha=0.1,0.2$ , both training and test error barriers become higher, compared with that of $\alpha=0.1,0.5$ . Furthermore, we also use a PolyChain found by (1) to connect the global modes of $\alpha=0.1,0.2$ . Along the PolyChain, all the solutions keep no barrier in both training and test landscapes. The above results indicate that global modes can be connected without any barrier by some paths and they are situated within a manifold, as illustrated in Figure 1.

Barrier among global modes. To further quantify the mode connectivity of two global modes ${\bm{\theta}}$ and ${\bm{\theta}^{\prime}}$ , we define a barrier metric as:

B({\bm{\theta}},{\bm{\theta}^{\prime}})=\max_{a\in[0,1]}\frac{\left|\mathcal{L}\left(\pi_{\bm{\theta}}(a)\right)-\min(\mathcal{L}({\bm{\theta}}),\mathcal{L}({\bm{\theta}}^{\prime}))\right|}{|\mathcal{L}({\bm{\theta}})-\mathcal{L}({\bm{\theta}}^{\prime})|}-1

(4)

where $\pi_{\bm{\theta}}(a)$ is equal to $(1-a){\bm{\theta}}+a{\bm{\theta}}^{\prime}$ and (2) when considering linear interpolation and PolyChain, respectively. Here, $B({\bm{\theta}},{\bm{\theta}^{\prime}})$ measures the largest performance gap along $\pi_{\bm{\theta}}(a)$ comparing with its endpoints ${\bm{\theta}}$ and ${\bm{\theta}^{\prime}}$ . We say that ${\bm{\theta}}$ and ${\bm{\theta}^{\prime}}$ are well-connected if the barrier $B({\bm{\theta}},{\bm{\theta}^{\prime}})$ is close to $0$ .

We then use $B({\bm{\theta}},{\bm{\theta}^{\prime}})$ to explore mode connectivity during the whole training round. In Figure 6, the sub-figure on the left shows that more severe data heterogeneity causes greater barriers by comparing the cases of $\alpha=0.1,0.2$ and $\alpha=0.1,0.5$ , which persist throughout training. Meanwhile, as the rounds progress, the barrier shows an increasing trend after an initial oscillation. This causes various global modes to reach their own basins, as found in Figure 5. We also examine the difficulty of finding the global-mode PolyChain when facing varying data heterogeneity. Here, the difficulty is quantified by the relationship between barrier $B({\bm{\theta}},{\bm{\theta}^{\prime}})$ and data sample to optimize (1) and is depicted in the right sub-figure of Figure 6. The sub-figure illustrates that more samples are needed to find the specific PolyChain as data heterogeneity increases, resulting in more incredible difficulty in connecting global modes with no error barrier.

Function dissimilarity and distance of global modes. According to [27], permutation invariance may contribute to a linear-interpolation barrier between NN solutions in centralized training. When considering FL, data heterogeneity may disturb the permutation of FL solutions, leading to the barriers found in Figure 5. To verify whether the FL solutions obtained from different data heterogeneity belong to the same global mode, we measure their function dissimilarity on the left-hand side of Figure 7. The figure shows that function dissimilarity remains $>0.2$ among these FL solutions even when the data heterogeneity is weakened. This means that the FL solutions belong to different global modes. Furthermore, Figure 7 presents that the distance between these FL solutions is smaller than the distance between them and their initial points. This means that different global modes with the same initialization are close to each other. Combined with the above results, it can be inferred that these FL solutions belonging to various global modes are adjacent and in a common region, but they cannot be linearly connected.

VI Mode Connectivity Analysis under Data Heterogeneity

As indicated by our experimental results, a sufficient condition to find the solutions meeting both client and FL objectives (3) is that the distributions represented by client modes $\bm{\theta}_{k}$ are the same as the global mode $\bm{\theta}$ . However, satisfying this condition is generally challenging since data heterogeneity induces client-gradient drifts [3] and then implicit output bias among client modes [15]. Building upon the drifts and bias, we consider $K$ client modes $\{\bm{\theta}_{k}\}_{k=1}^{K}$ and formulate data heterogeneity as:

Definition 2.

(Data heterogeneity). Given a global mode ${\bm{\theta}}=\sum_{k=1}^{K}\frac{n_{k}}{n}\bm{\theta}_{k}$ and data samples $({\bm{x}},y)\sim\mathbb{P}$ , for $\forall k\in[K]$ , the global-mode output $\bar{y}=f_{{\bm{\theta}}}(\bm{x})$ and its gradient $\nabla_{{\bm{\theta}}}(\bm{x})$ differ from the client-mode output $\hat{y}_{k}=f_{{\bm{\theta}}_{k}}(\bm{x})$ and its gradient $\nabla_{{\bm{\theta}}_{k}}(\bm{x})$ , respectively, if the client distribution $\mathbb{P}_{k}$ differs from the global distribution $\mathbb{P}$ .

In this section, we will utilize dropout stability to analyze mode connectivity of global modes. First, let us define it as:

Definition 3.

(Dropout stability). Given a model $\bm{\theta}$ with $N$ neurons and its dropout networks $\bm{\theta}_{\mathrm{S}}$ with $\mathcal{A}\subseteq[N]$ , the model $\bm{\theta}$ is $\varepsilon_{\mathrm{D}}$ -dropout stable if $\left|\mathcal{L}_{N}(\bm{\theta})-\mathcal{L}_{|\mathcal{A}|}\left(\bm{\theta}_{\mathrm{S}}\right)\right|\leq\varepsilon_{\mathrm{D}}$ .

Then, we adopt the mean-field theory developed by [7] to analyze dropout stability under data heterogeneity. Mean-field theory shows that the trajectory of SGD can be approximated by a partial differential equation (PDE) called distributional dynamic (DD). Namely, under the assumption of enough neurons (i.e., $N$ is large enough) and one-pass data processing (i.e., for all $i\in[n]$ , samples $({\bm{x}}_{i},y_{i})\in\mathcal{D}\sim\mathbb{P}$ are independent and identically distributed), the SGD trajectory of $N$ neurons are close to the movement of $N$ i.i.d particles following the description of DD. Here, we consider a two-layer NN with $N$ hidden neurons, denoted as ${\bm{\theta}}=({\bm{\theta}_{1},\dots,{\bm{\theta}_{N}}})$ , where ${\bm{\theta}_{i}}\in\mathbb{R}^{D}$ for all $i\in[N]$ . The forward function of ${\bm{\theta}}$ is represented as:

f_{\bm{\theta}}({\bm{x}})=1/N\sum_{i=1}^{N}\sigma_{*}({\bm{x}};{\bm{\theta}_{i}}),

(5)

where $\sigma_{*}:\mathbb{R}^{d}\times\mathbb{R}^{D}\rightarrow\mathbb{R}$ is the activation function. For simplicity, we focus on the ReLU activation and the mean-square loss, i.e., $\sigma_{*}({\bm{x}};{\bm{\theta}_{i}})=\sigma(\langle{\bm{x}},{\bm{\theta}_{i}}\rangle)$ and $l(\bm{\theta};({\bm{x}},y))=(y-f_{\bm{\theta}}({\bm{x}}))^{2}$ , which can be extended to other activation functions and losses in [7, 16].

Meanwhile, we consider FedAvg with each client undergoing $p$ local iterations in the $m$ -th round, and represent the global-mode update on the $i$ -th neuron $\bm{\theta}_{i}$ as $\Delta\bm{\theta}_{i}^{(m)}=\bm{\theta}_{i}^{m+1,0}-\bm{\theta}_{i}^{m,0}$ to get:

	$\displaystyle\Delta\bm{\theta}_{i}^{(m)}=$	$\displaystyle 2\sum_{\tau=0}^{p}{s}^{m,\tau}(y^{m,\tau}-\bar{y}^{m,\tau})\nabla_{\bar{\bm{\theta}}_{i}}\sigma(\bm{x}^{m,\tau};\bar{\bm{\theta}_{i}}^{m,\tau})$		(6)
		$\displaystyle+2\sum_{\tau=0}^{p}{s}^{m,\tau}\bm{n}_{i}^{m,\tau},$		(6)

where $\tau\in[p]$ denotes the $\tau$ -th local iteration, ${s}^{m,\tau}$ denotes the step size at the $\tau$ -th iteration, $\bar{\bm{\theta}_{i}}^{m,\tau}=\sum_{k=1}^{K}\frac{n_{k}}{n}\bm{\theta}_{i,k}^{m,\tau}$ denotes the weighted average of all the $i$ -th client neurons, $\bar{y}^{m,\tau}=1/N\sum_{i=1}^{N}\sigma({\bm{x}^{m,\tau}};{\bar{\bm{\theta}_{i}}^{m,\tau}})$ denotes the output of the weighted-average model $\bar{\bm{\theta}}^{m,\tau}$ based on one-pass data $({\bm{x}^{m,\tau}},{\bm{y}^{m,\tau}})\in\mathbb{P}$ , and $\bm{n}_{i}^{m,\tau}$ denote the noise induced by data heterogeneity. Specifically, we follow Definition 2 and represent $\bm{n}_{i}^{m,\tau}$ (the superscript is omitted for brevity) as:

	$\displaystyle\bm{n}_{i}=\sum_{k=1}^{K}\frac{n_{k}}{n}$	$\displaystyle\left[(y_{k}-\hat{y}_{k})\right.\underbrace{[\nabla_{\bm{\theta}_{i,k}}\sigma(\bm{x}_{k};\bm{\theta}_{i,k})-\nabla_{\bar{\bm{\theta}}_{i}}\sigma(\bm{x}_{k};\bar{\bm{\theta}_{i}})]}_{\text{gradient drift}}$		(7)
		$\displaystyle+\underbrace{(\bar{y}_{k}-\hat{y}_{k})}_{\text{output bias}}\left.\nabla_{\bar{\bm{\theta}}_{i}}\sigma(\bm{x}_{k};\bar{\bm{\theta}_{i}})\right],$		(7)

where the gradient drift and output bias depend on data heterogeneity, i.e., $(x_{k},y_{k})\in\mathbb{P}_{k}$ and $\in\mathbb{P}$ , but $\mathbb{P}_{k}\neq\mathbb{P}$ .

Furthermore, the trajectory of $\bar{\bm{\theta}_{i}}$ in (6) subsumes the trajectory of the global model ${\bm{\theta}_{i}}$ since FedAvg performs weighted averaging on the global model at each round. Therefore, we represent the global-mode trajectory as follows:

\bm{\theta}_{i}^{\tau+1}=\bm{\theta}_{i}^{\tau}+2s_{\tau}\left(y^{\tau}-\bar{y}^{\tau}\right)\nabla_{\bm{\theta}_{i}}\sigma\left(\bm{x}_{\tau};\bm{\theta}_{i}^{\tau}\right)+2{s}^{\tau}{\bm{n}}_{i}^{\tau},

(8)

where $\tau\in[Mp]$ when given a total round $M$ and $\bm{n}_{i}^{\tau}=0$ when $\tau\mod p\in[M]$ . See the appendix for detailed proof.

We make the following assumptions as per [7]:

Assumption 1.

The step size $s_{\tau}$ is denoted as $s_{k}=\alpha\xi(\tau\varepsilon)$ , where $\xi:\mathbb{R}_{\geq 0}\rightarrow\mathbb{R}_{>0}$ is bounded by $C_{1}$ and $C_{1}$ -Lipschitz.

Assumption 2.

For $(\bm{x},y)\sim\mathbb{P}$ , the label $y$ and the activation function $\sigma(\bm{x},\bm{\theta})$ with $C_{2}$ sub-Gaussian gradient are bounded by $K_{2}$ .

Assumption 3.

The functions $v(\bm{\theta})=-\mathbb{E}\{y\sigma(\bm{x},\bm{\theta})\}$ and $u(\bm{\theta}_{1},\bm{\theta}_{2})=\mathbb{E}\{\sigma(\bm{x},\bm{\theta}_{1})\sigma(\bm{x},\bm{\theta}_{2})\}$ are differentiable, and their gradients are bounded by $C_{3}$ and $C_{3}$ -Lipschitz.

Assumption 4.

The initial condition $\bm{\theta}_{i}^{0}$ is $K^{2}_{4}/D$ -sub-Gaussian. Let $v(\bm{\theta})\in C^{4}(\mathbb{R}^{D})$ and $u(\bm{\theta}_{1},\bm{\theta}_{2})\in C^{4}(\mathbb{R}^{D})$ , and $\nabla u^{k}_{\bm{\theta}_{1}}(\bm{\theta}_{1},\bm{\theta}_{2})$ is uniformly bounded for $0\leq k\leq 4$ .

Assumption 5.

For $m\in[M]$ and $\tau\in[p]$ , the noise $\bm{n}_{i}^{\tau}=\sqrt{\beta h(\alpha)}\bar{\bm{n}}_{i}^{\tau}$ , where $\bar{\bm{n}}_{i}^{\tau}\sim\mathcal{N}\left(0,\bm{I}_{D}\right)$ .

Assumptions 1-4 denote the requirements of mean-field theory on the learning rate $s_{k}$ , data distribution $(\bm{x},y)\sim\mathbb{P}$ , activation function $\sigma$ , and initialization $\rho_{0}$ . Assumption 5 indicates that the noise $\bm{n}_{i}$ in (7) depends on data heterogeneity. Specifically, a generalized function $h(\alpha)$ is taken to measure the impact of data heterogeneity on $\bm{n}_{i}$ , including the gradient drift and output bias, where $\beta h(\alpha)\in[\beta,\infty)$ and $1/\alpha$ denotes the degree of data heterogeneity across clients (i.e., smaller $\alpha$ , larger $h(\alpha))$ .

We take the mean-field theory to quantify the difference between the global-mode trajectory loss and the DD-solution loss and have:

Lemma 1.

(Mean field approximation.) Assume that conditions 1-5 hold, the solution $(\rho_{t})_{t\geq 0}$ of a PDE with initialization $\rho_{0}$ can approximate the update trajectory of the global mode $(\bm{\theta}_{i}^{\tau})$ as (8) with initialization $\bm{\theta}_{i}^{0}\sim\rho_{0}$ and unchanged data heterogeneity $h(\alpha)$ throughout training. When $\beta=2s^{\tau}\varepsilon/D$ , $h(\alpha)\leq C_{5}$ , $T\geq 1$ and $\varepsilon\leq 1$ , there exists a constant $C$ (depending solely on the constants $C_{i}$ of assumptions 1-4) such that

		$\displaystyle\sup_{\tau\in[0,T/\varepsilon]\cap\mathbb{N}}\left\|\mathcal{L}_{N}(\bm{\theta}^{\tau})-\mathcal{L}(\rho_{\tau\varepsilon})\right\|$
	$\displaystyle\leq$	$\displaystyle Ce^{CT}\sqrt{h(\alpha)}\left(\frac{\sqrt{\log N}+z}{\sqrt{N}}+\sqrt{\varepsilon}(\sqrt{D+\log N}+z)\right),$

with probability at least $1-e^{-z^{2}}$ .

Lemma 1 shows when the number of neurons $N$ is sufficiently large and the step size $\varepsilon$ is sufficiently small, the global-mode neurons $\bm{\theta}_{i}^{\tau}$ obtained by running $\tau$ steps as (8) can be approximated as $N$ i.i.d. particles that evolve according to the DD at time $\tau\varepsilon$ . Then, we utilize Lemma 1 to show that the global mode remains dropout-stable even when faced with data heterogeneity.

Theorem 1.

(Dropout stability under data heterogeneity.) Assume that conditions 1-5 hold, and fix $T\geq 1$ and $h(\alpha)\geq C_{5}$ . Let the global model $\bm{\theta}^{\tau}$ be obtained by running $\tau$ steps as (8) with data $\left\{\left(\bm{x}_{i},y_{i}\right)\right\}_{i=0}^{\tau}{\sim}\mathbb{P}$ and initialization $\rho_{0}$ . Then, the following results hold: Pick $\mathcal{A}\subseteq[N]$ independent of $\bm{\theta}^{\tau}$ . Then, with probability at least $1-e^{-z^{2}}$ , for all $\tau\in[T/\varepsilon],\theta^{\tau}$ is $\varepsilon_{\mathrm{D}}$ -dropout stable with $\varepsilon_{\mathrm{D}}$ equal to

Ce^{CT}\sqrt{h(\alpha)}\left(\frac{\sqrt{\log|\mathcal{A}|}+z}{\sqrt{|\mathcal{A}|}}+\sqrt{\varepsilon}(\sqrt{D+\log N}+z)\right),

where the constant $C$ depends only on the constants $C_{i}$ in the assumptions 1-4.

Following [20], we demonstrate that dropout-stable NNs in FL have mode connectivity as follows.

Theorem 2.

(Mode connectivity under data heterogeneity.) Under Theorem 1, fix $T\geq 1,T^{\prime}\geq 1$ and let $\bm{\theta}$ and $\bm{\theta}^{\prime}$ be the global models obtained by running running $\tau$ steps as (8) with data heterogeneity $h(\alpha)$ and $h(\alpha^{\prime})$ and initialization $\rho_{0}$ and $\rho_{0}^{\prime}$ , respectively. Then, with probability at least $1-e^{-z^{2}}$ , for all $\tau\in[T/\varepsilon]$ and $\tau^{\prime}\in\left[T^{\prime}/\varepsilon\right],\bm{\theta}^{\tau}$ and ${\bm{\theta}^{\prime}}^{\tau^{\prime}}$ are $\varepsilon_{\mathrm{C}}$ -connected with $\varepsilon_{\mathrm{C}}$ equal to

	$\displaystyle Ce^{CT_{\max}}\max(\sqrt{h(\alpha)},\sqrt{h(\alpha^{\prime})})\left(\frac{\sqrt{\log N}+z}{\sqrt{N}}\right.$
	$\displaystyle\quad\quad\quad\left.+\sqrt{\varepsilon}(\sqrt{D+\log N}+z)\right),$

where $T_{\max}=\max\left(T,T^{\prime}\right)$ . Furthermore, the path connecting $\bm{\theta}^{k}$ with ${\bm{\theta}^{\prime}}^{\tau^{\prime}}$ consists of 7 line segments.

Theorem 2 shows that the global modes obtained from FL under different data heterogeneity can be connected, as shown in Figure 5, and then mode-connectivity error depends on the data heterogeneity $h(\alpha)$ and neuron numbers $N$ . See the appendix for the detailed proof of Theorems 1 and 2.

VII Numerical Results

To verify our analysis results, we undertake two classification tasks: i) using a two-layer NN on MNIST; ii) using a VGG11 network from scratch on CIFAR-10. The NNs take ReLU activations for both tasks and are optimized based on the cross-entropy loss. For the FL setup, we consider ten clients with $50/200$ training rounds for MNIST/CIFAR-10. Client optimizers are SGD with a learning rate of 0.02/0.01 for MNIST/CIFAR-10, where the mini-batch size is 50 and the number of local iterations is 10. Meanwhile, within the same task, the NN initialization is consistent under varying data heterogeneity and independent of neuron number $N$ , which aligns with the theoretical assumptions. We perform 20 independent trials to measure mean value and standard deviation in Figures 8 and 9.

Dropout stability under varying data heterogeneity.

Figure 8 compares the dropout error $\varepsilon_{\mathrm{D}}$ under varying data heterogeneity as per Definition 3 on the training dataset (blue curve) and test dataset (orange curve). For MNIST, the neuron number of the two-layer NN is $N=1600$ , and for CIFAR-10, VGG11 follows the standard setup in [24]. As expected from Theorem 1, the dropout error shows a downward trend as client datasets become more homogeneous. For example, the case of $\alpha=0.1$ has a more significant training/test dropout error than the i.i.d case at the different training stages of both tasks. When the same heterogeneity setting (e.g., $\alpha=0.1$ ) is adopted, the effect of data heterogeneity on CIFAR-10 is stronger than that on MNIST. This leads to a larger dropout error in CIFAR-10, where the dropout error is $\varepsilon_{\mathrm{D}}<0.04$ for MNIST and $\varepsilon_{\mathrm{D}}<0.15$ for CIFAR-10. Moreover, we also consider two training stages in Figure 8, including the middle stage (dashed curve) and the ending stage (solid curve). For both tasks, the dropout-error gap between the two stages keeps small, i.e., $\varepsilon_{\mathrm{D}}<0.01$ for MNIST and $\varepsilon_{\mathrm{D}}<0.05$ for CIFAR-10. As expected, this indicates that the mean-field theory can view the dropout-out dynamics as long as the training time is sufficient.

Mode connectivity under varying neuron numbers and data heterogeneity.

Figure 9 takes dropout stability (left sub-figure) and linear-interpolation barrier (right sub-figure) to validate our analysis on mode connectivity of global modes in FL. Theorem 2 shows that dropout stability implies mode connectivity, and the connectivity error $\varepsilon_{\mathrm{C}}$ depends on the neuron number and data heterogeneity. For the MNIST under $\alpha=0.1$ , we consider a two-layer NN and plot the relationship between its dropout error $\varepsilon_{\mathrm{D}}$ and neuron number $N$ . As expected, the training/test dropout error rapidly decreases as the width of the network grows. The dropout error for $N=6400$ is less than $0.01$ , while for $N=200$ , it is up to $0.06$ . Then, we take the linear-interpolation barrier to explore the connectivity error $\varepsilon_{\mathrm{C}}$ , and connect the global mode obtained in the case $\alpha=0.1$ to that of the cases $\alpha=0.2$ (solid curve) and $\alpha=0.5$ (dashed curve). With the expansion of the network, the barrier (i.e., connectivity error $\varepsilon_{\mathrm{C}}$ ) diminishes rapidly, which is consistent with dropout error $\varepsilon_{\mathrm{D}}$ v.s. neuron number $N$ . Furthermore, the barrier of $\alpha=0.1$ to $\alpha=0.2$ is higher than that of $\alpha=0.1$ to $\alpha=0.5$ when $N=200$ . This is expected because when $N$ is small, the connectivity error $\varepsilon_{\mathrm{C}}$ is amplified by the effect of data heterogeneity $h(\alpha)$ , i.e., the effect of data heterogeneity amortized to neurons would be even greater.

Neuron noise $\bm{n}_{i}$ under varying data heterogeneity.

Figure 10 illustrates the L2 norm of the global-mode update noise under varying data heterogeneity in (7) along all iterations in the right sub-figure, where the mean, maximal, and minimal noise norms are reported in the left sub-figure. As shown in the left sub-figure, the noise norm decrease as data heterogeneity alleviates. Meanwhile, the maximal noise norm is limited, i.e., $<0.012$ , but the noise norm does not decrease to zero even in the i.i.d case due to the randomness of SGD. Moreover, according to the right sub-figure, the noise is not strongly correlated with the iterations when the data heterogeneity is mild (i.e., $\alpha\geq 0.4$ ), and slightly decreases along the training iterations. This verifies Assumption 5, where the neuron noise arises from data heterogeneity and remains independent of the number of iterations.

VIII Discussion and Future Work

Through experimental and theoretical analysis, we unravel the relationship between client and global modes in FL under varying data heterogeneity. As data become more heterogeneous, finding solutions that optimize both the client and FL objectives becomes increasingly challenging due to the reduction of overlapping solutions between client and global modes. Meanwhile, the FL solutions obtained from different setups of data heterogeneity belong to distinct global modes, and are located in distinct basins of a joint region. These solutions can be connected by a specific PolyChain while showing an error barrier along the linear interpolation. This suggests that global modes in FL are not isolated but rather interconnected within a manifold. Our analysis demonstrates that the connectivity error between global modes decreases with weakened data heterogeneity and wider trained models.

The findings suggest potential directions for future work for the design of FL: i) exploring the solutions to the FL objective that maintain low error for both client and global test sets, i.e., the solutions to both client and FL objectives; ii) investigating the effect of model depth and architecture in FL based on mean-field theory; iii) understanding FL training dynamics by the gap between random and pre-trained initialization (see a preliminary investigation in Appendix).

References

[1] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas, “Communication-Efficient Learning of Deep Networks from Decentralized Data,” in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, vol. 54. PMLR, 20–22 Apr 2017, pp. 1273–1282. [Online]. Available: https://proceedings.mlr.press/v54/mcmahan17a.html
[2] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings et al., “Advances and open problems in federated learning,” Foundations and Trends® in Machine Learning, vol. 14, no. 1–2, pp. 1–210, 2021.
[3] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated learning with non-iid data,” arXiv preprint arXiv:1806.00582, 2018.
[4] J. Shao, Z. Li, W. Sun, T. Zhou, Y. Sun, L. Liu, Z. Lin, and J. Zhang, “A survey of what to share in federated learning: Perspectives on model utility, privacy leakage, and communication efficiency,” arXiv preprint arXiv:2307.10655, 2023.
[5] T. Garipov, P. Izmailov, D. Podoprikhin, D. P. Vetrov, and A. G. Wilson, “Loss surfaces, mode connectivity, and fast ensembling of dnns,” in Advances in Neural Information Processing Systems, vol. 31. Curran Associates, Inc., 2018. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2018/file/be3087e74e9100d4bc4c6268cdbe8456-Paper.pdf
[6] F. Draxler, K. Veschgini, M. Salmhofer, and F. Hamprecht, “Essentially no barriers in neural network energy landscape,” in Proceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 80. PMLR, 10–15 Jul 2018, pp. 1309–1318. [Online]. Available: https://proceedings.mlr.press/v80/draxler18a.html
[7] S. Mei, A. Montanari, and P.-M. Nguyen, “A mean field view of the landscape of two-layer neural networks,” Proceedings of the National Academy of Sciences, vol. 115, no. 33, pp. E7665–E7671, 2018. [Online]. Available: https://www.pnas.org/doi/abs/10.1073/pnas.1806579115
[8] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, “Federated optimization in heterogeneous networks,” Proceedings of Machine Learning and Systems, vol. 2, pp. 429–450, 2020.
[9] J. Wang, Q. Liu, H. Liang, G. Joshi, and H. V. Poor, “Tackling the objective inconsistency problem in heterogeneous federated optimization,” Advances in Neural Information Processing Systems, vol. 33, pp. 7611–7623, 2020.
[10] M. Luo, F. Chen, D. Hu, Y. Zhang, J. Liang, and J. Feng, “No fear of heterogeneity: Classifier calibration for federated learning with non-iid data,” Advances in Neural Information Processing Systems, vol. 34, pp. 5972–5984, 2021.
[11] T. Zhou, J. Zhang, and D. Tsang, “Fedfa: Federated learning with feature anchors to align feature and classifier for heterogeneous data,” arXiv preprint arXiv:2211.09299, 2022.
[12] C. D. Freeman and J. Bruna, “Topology and geometry of half-rectified network optimization,” in International Conference on Learning Representations, 2017. [Online]. Available: https://openreview.net/forum?id=Bk0FWVcgx
[13] T. Garipov, P. Izmailov, D. Podoprikhin, D. P. Vetrov, and A. G. Wilson, “Loss surfaces, mode connectivity, and fast ensembling of dnns,” Advances in Neural Information Processing Systems, vol. 31, 2018.
[14] Z. Li, H.-Y. Chen, H. W. Shen, and W.-L. Chao, “Understanding federated learning through loss landscape visualizations: A pilot study,” in Workshop on Federated Learning: Recent Advances and New Challenges (in Conjunction with NeurIPS 2022), 2022.
[15] T. Zhou, Z. Lin, J. Zhang, and D. H. Tsang, “Understanding model averaging in federated learning on heterogeneous data,” arXiv preprint arXiv:2305.07845, 2023.
[16] S. Mei, T. Misiakiewicz, and A. Montanari, “Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit,” in Proceedings of the Thirty-Second Conference on Learning Theory, ser. Proceedings of Machine Learning Research, vol. 99. PMLR, 25–28 Jun 2019, pp. 2388–2464. [Online]. Available: https://proceedings.mlr.press/v99/mei19a.html
[17] G. M. Rotskoff and E. Vanden-Eijnden, “Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error,” Advances in neural information processing systems, p. 7146–7155, 2018.
[18] P.-M. Nguyen and H. T. Pham, “A rigorous framework for the mean field limit of multilayer neural networks,” arXiv preprint arXiv:2001.11443, 2020.
[19] A. Shevchenko and M. Mondelli, “Landscape connectivity and dropout stability of SGD solutions for over-parameterized neural networks,” in Proceedings of the 37th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 119. PMLR, 13–18 Jul 2020, pp. 8773–8784. [Online]. Available: https://proceedings.mlr.press/v119/shevchenko20a.html
[20] R. Kuditipudi, X. Wang, H. Lee, Y. Zhang, Z. Li, W. Hu, R. Ge, and S. Arora, “Explaining landscape connectivity of low-cost solutions for multilayer nets,” in Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc., 2019. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2019/file/46a4378f835dc8040c8057beb6a2da52-Paper.pdf
[21] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
[22] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” 2009.
[23] D. Li, Y. Yang, Y.-Z. Song, and T. M. Hospedales, “Deeper, broader and artier domain generalization,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5542–5550.
[24] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[25] M. Yurochkin, M. Agarwal, S. Ghosh, K. Greenewald, N. Hoang, and Y. Khazaeni, “Bayesian nonparametric federated learning of neural networks,” in Proceedings of the 36th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 97. Long Beach, California, USA: PMLR, 09–15 Jun 2019, pp. 7252–7261. [Online]. Available: http://proceedings.mlr.press/v97/yurochkin19a.html
[26] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, 2008.
[27] R. Entezari, H. Sedghi, O. Saukh, and B. Neyshabur, “The role of permutation invariance in linear mode connectivity of neural networks,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=dNigytemkL

Appendix A Additional Experiments

A-A Additional Experiments in client-mode connectivity

Testing the models on the global test set along the paths connecting client models. Figure 11 considers the same paths as that of Figure 3 (i.e., the linear path and the PolyChain path) to connect two client models, and reports the loss of the models along both paths on the global test set. The left and right sub-figures display the results of the global model trained from scratch and pre-training, respectively. When data are more heterogeneous, the loss gap between the two paths becomes greater. For example, the gap of the case of $\alpha=0.1$ is much larger than that of $\alpha=0.5$ . Both sub-figures indicate that models along the PolyChain path exhibit good performance in the global test set when $a$ falls within the range of $[0.4,0.6]$ . The sub-figure in the right-hand side of Figure 3 also shows that the models in the range $a\in[0.4,0.6]$ exhibit good performance in the client test sets. Upon comparing the left and right sub-figures, we observe a significant reduction in the loss gap between both paths due to the utilization of the pre-trained model. Therefore, it appears that there are certain solutions to the FL objective (3) that can maintain a low loss for both client test sets and the global test set, particularly when using a pre-trained model as the model initialization. It would be beneficial for future work to further explore how to find these solutions.

Optimization trajectory of shallow CNN. In Figure 12, we also visualize the optimization trajectory of client models to compare with the VGG11 trajectory illustrated in Figure 4. Similar to the observation of Figure 4, the solutions found by client objectives vary under data heterogeneity even though they are close to each other in the parameter space. Additionally, it is evident that during the training procedure, when utilizing shallow CNN, the distance between client solutions is regulated but is greater than that of VGG11. This implies that data heterogeneity affects models with various dimension sizes differently, which is in line with the results of Theorem 2 and Figure 14.

A-B Additional Experiments in global-mode connectivity

Connectivity landscape along the linear-interpolation path and the PolyChain path.

We take the classification accuracy as the metric and plot the connectivity landscape along the linear-interpolation path and the PolyChain path in Figure 13. As additional data to Figure 5, Figure 13 displays comprehensive accuracy fluctuations along both paths. As shown in the left sub-figure, when dealing with more heterogeneous data, such as in the case of $\alpha=0.1,0.2$ , both training and test barriers become higher, compared with $\alpha=0.1,0.5$ . Meanwhile, since all global models are not far away from each other and are located in a common region, the maximal accuracy drop induced by the barriers is around $10\%$ . The right sub-figure shows that the PolyChain with one bend easily connects the global models from $\alpha=0.1$ to $0.2$ and from $\alpha=0.1$ to $0.5$ , while all the models along the PolyChain keep no barrier.

Neuron noise $\bm{n}_{i}$ insight into global-mode connectivity.

We resort to the noise norm in (7) to gain a better understanding of the relationship between the number of neurons ( $N$ ) and the dropout error ( $\varepsilon_{\mathrm{C}}$ ) and connectivity error ( $\varepsilon_{\mathrm{D}}$ ). As shown in Figure 14, the L2 norm of global-mode update noise is depicted under $\alpha=0.1$ . This figure displays the noise norms for different neuron numbers and their visualization across all iterations. It also reports the mean, maximal, and minimal noise norms. The left sub-figure demonstrates that as the number of neurons increases, the noise norm decreases. When $N=6400$ , the gap between maximal and minimal noise norms becomes much smaller than that of $N=200$ . As shown in the right sub-figure, the noise levels are distinguishable for different numbers of neurons. These results reveal that the wider models show smaller noise in their updates of FL, and then have lower dropout error ( $\varepsilon_{\mathrm{C}}$ ) and connectivity error ( $\varepsilon_{\mathrm{D}}$ ) in Figure 9.

Appendix B Proof

The approximation in Lemma 1 builds the connection from the nonlinear dynamics to the particle dynamics to the gradient descent to the SGD under the data-heterogeneity noise. The approximation relies on the following model update:

\bm{\theta}_{i}^{\tau+1}=(1-2\lambda s_{\tau})\bm{\theta}_{i}^{\tau}+2s_{\tau}\left(y^{\tau}-\hat{y}^{\tau}\right)\nabla_{\bm{\theta}_{i}}\sigma\left(\bm{x}_{\tau};\bm{\theta}_{i}^{\tau}\right)+\beta{\bm{n}}_{i}^{\tau},

(9)

where $\bm{\theta}_{i}$ denotes the $i$ -th neuron parameter, $\lambda$ denotes the coefficient of regularizer in the objective (i.e., $\lambda=0$ without regularizers), $s_{\tau}$ denotes the step size at the $\tau$ -th iteration, $\beta=\sqrt{2{s}^{\tau}\varepsilon/D}$ and $\beta{\bm{n}}_{i}^{\tau}$ denote the random noise at the $\tau$ -th iteration. By Proposition 33, 35, 37 and 38 in [16], when Assumptions 1 to 5 hold, we have the following proposition:

Proposition 1.

There exists a constant $C$ (depending solely on the constants $C_{i}$ of assumptions 1-5), such that with probability at least $1-e^{-z^{2}}$ , we have:

	$\displaystyle\sup_{t\in[0,T]}$	$\displaystyle\left\|\mathcal{L}_{N}\left(\overline{\bm{\theta}}^{t}\right)-\mathcal{L}\left(\rho_{t}\right)\right\|$
	$\displaystyle\leq$	$\displaystyle Ce^{CT}\frac{\sqrt{h(\alpha)}}{\sqrt{N}}[\sqrt{\log(NT)}+z],$
	$\displaystyle\sup_{t\in[0,T]}$	$\displaystyle\left\|\mathcal{L}_{N}\left(\underline{\bm{\theta}}^{t}\right)-\mathcal{L}_{N}\left(\overline{\bm{\theta}}^{t}\right)\right\|$
	$\displaystyle\leq$	$\displaystyle Ce^{CT}\frac{1}{\sqrt{N}}[\sqrt{\log(NT)}+z],$
	$\displaystyle\sup_{\tau\in[0,T/\varepsilon]\cap\mathbb{N}}$	$\displaystyle\left\|\mathcal{L}_{N}\left(\underline{\bm{\theta}}^{\tau\varepsilon}\right)-\mathcal{L}_{N}\left(\tilde{\bm{\theta}}^{\tau}\right)\right\|$
	$\displaystyle\leq$	$\displaystyle Ce^{CT}[\sqrt{\log(N(T/\varepsilon\vee 1)}+z]\sqrt{\varepsilon h(\alpha)},$
	$\displaystyle\sup_{\tau\in[0,T/\varepsilon]\cap\mathbb{N}}$	$\displaystyle\left\|\mathcal{L}_{N}\left(\tilde{\bm{\theta}}^{\tau}\right)-\mathcal{L}_{N}\left(\bm{\theta}^{\tau}\right)\right\|$
	$\displaystyle\leq$	$\displaystyle Ce^{CT}\sqrt{T\varepsilon}[\sqrt{D+\log N}+z],$

where $\overline{\bm{\theta}},\underline{\bm{\theta}}^{t},\tilde{\bm{\theta}}^{\tau}$ , and ${\bm{\theta}}^{\tau}$ denote the solutions of the nonlinear dynamics, particle dynamics, gradient descent and SGD, respectively.

In Proposition 1, the main distinction from its counterpart in [16] is that it accounts for the impact of data heterogeneity in the model update, i.e., $\sqrt{h(\alpha)}$ is not embraced into the constant $C$ . The proof for Proposition 1 aligns with the proof of its counterpart in [16], and see [16] for the details.

Lemma 2.

(From Lemma 31 in [16]). There exists a constant $K$ , such that with probability at least $1-e^{-z^{2}}$ ,

	$\displaystyle\sup_{i\leq N}\sup_{\tau\in[0,T/\varepsilon]\cap\mathbb{N}}\sup_{u\in[0,\varepsilon]}$	$\displaystyle\left\\|\overline{\bm{\theta}}_{i}^{\tau\varepsilon+u}-\overline{\bm{\theta}}_{i}^{\tau\varepsilon}\right\\|_{2}$
	$\displaystyle\leq$	$\displaystyle Ce^{CT}[\sqrt{\log(N(T/\varepsilon\vee 1))}+z]\sqrt{\varepsilon},$

where $\overline{\bm{\theta}}_{i}^{\tau\varepsilon+u}$ and $\overline{\bm{\theta}}_{i}^{\tau\varepsilon}$ denote the solutions of the PDE in Lemma 1 at time $\tau\varepsilon+u$ and $\tau\varepsilon$ , respectively.

B-A Proof of Theorem 1

Proof.

In the following proof, we take $C$ to accommodate all constants that depend solely on the constants $C_{i}$ of assumptions 1-4. According to Definition 3, we have:

$\displaystyle\sup_{\tau\in[0,T/\varepsilon]\cap\mathbb{N}}$	$\displaystyle\left\|\mathcal{L}_{N}(\bm{\theta}^{\tau})-\mathcal{L}_{\|\mathcal{A}\|}\left(\bm{\theta}_{\mathrm{S}}^{\tau}\right)\right\|$	(10)
$\displaystyle\leq$	$\displaystyle\left\|\mathcal{L}_{N}(\bm{\theta}^{\tau})-\mathcal{L}\left(\rho_{\tau\varepsilon}\right)\right\|+\left\|\mathcal{L}_{\|\mathcal{A}\|}(\bm{\theta}_{\mathrm{S}}^{\tau})-\mathcal{L}\left(\rho_{\tau\varepsilon}\right)\right\|$
$\displaystyle\leq$	$\displaystyle\left\|\mathcal{L}_{N}(\bm{\theta}^{\tau})-\mathcal{L}\left(\rho_{\tau\varepsilon}\right)\right\|+\left\|\mathcal{L}_{\|\mathcal{A}\|}(\bm{\theta}_{\mathrm{S}}^{\tau})-\mathcal{L}_{\|\mathcal{A}\|}(\bar{\bm{\theta}}_{\mathrm{S}}^{\tau\varepsilon})\right\|$
	$\displaystyle+\left\|\mathcal{L}_{\|\mathcal{A}\|}(\bar{\bm{\theta}}_{\mathrm{S}}^{\tau\varepsilon})-\mathcal{L}\left(\rho_{\tau\varepsilon}\right)\right\|,$

where $\bar{\bm{\theta}}_{\mathrm{S}}^{\tau\varepsilon}$ denotes the dropout model containing the first $\mathcal{|A|}$ non-zero elements (i.e., non-dropout elements) of $\bar{\bm{\theta}}^{\tau\varepsilon}$ . Here, $\bar{\bm{\theta}}^{\tau\varepsilon}$ is the PDE solution of the distributional dynamic at time $\tau\varepsilon$ .

For the first term, we take Lemma 1 and directly obtain:

	$\displaystyle\sup_{\tau\in[0,T/\varepsilon]\cap\mathbb{N}}$	$\displaystyle\left\|\mathcal{L}_{N}(\bm{\theta}^{\tau})-\mathcal{L}\left(\rho_{\tau\varepsilon}\right)\right\|$		(11)
	$\displaystyle\leq$	$\displaystyle Ce^{CT}\sqrt{h(\alpha)}\left(\frac{\sqrt{\log N}+z}{\sqrt{N}}+\sqrt{\varepsilon}(\sqrt{D+\log N}+z)\right).$		(11)

For the second term, we compute its upper bound based on the maximal output gap between the $i$ -th neuron of $\bm{\theta}_{\mathrm{S}}^{\tau}$ and the $j$ -th neuron of $\overline{\bm{\theta}}_{\mathrm{S}}^{\tau\varepsilon}$ , which can be represented as:

	$\displaystyle\left\|\mathcal{L}_{\|\mathcal{A}\|}\left(\bm{\theta}_{\mathrm{S}}^{\tau}\right)-\mathcal{L}_{\|\mathcal{A}\|}(\overline{\bm{\theta}}_{\mathrm{S}}^{\tau\varepsilon})\right\|$	(12)
$\displaystyle\leq$	$\displaystyle 2\max_{i\in\mathcal{A}}\left\|\mathbb{E}\left\{y\sigma\left(\bm{x},\bm{\theta}_{i}^{k}\right)\right\}-\mathbb{E}\left\{y\sigma(\bm{x},\overline{\bm{\theta}}_{i}^{\tau\varepsilon})\right\}\right\|$
	$\displaystyle+\max_{i,j\in\mathcal{A}}\left\|\mathbb{E}\left\{\sigma\left(\bm{x},\bm{\theta}_{i}^{k}\right)\sigma\left(\bm{x},\bm{\theta}_{j}^{k}\right)\right\}-\mathbb{E}\left\{\sigma(\bm{x},\overline{\bm{\theta}}_{i}^{\tau\varepsilon}t)\sigma(\bm{x},\overline{\bm{\theta}}_{j}^{\tau\varepsilon})\right\}\right\|$
$\displaystyle\leq$	$\displaystyle 4C_{2}\max_{i\in\mathcal{[\|A\|]}}\left\\|\bm{\theta}_{i}^{k}-\overline{\bm{\theta}}_{i}^{\tau\varepsilon}\right\\|_{2}$
$\displaystyle\leq$	$\displaystyle 4C_{2}\max_{i\in[N]}\left\\|\bm{\theta}_{i}^{k}-\overline{\bm{\theta}}_{i}^{\tau\varepsilon}\right\\|_{2},$

where the second inequality follows that $y,\sigma(\cdot)$ and the gradient of $\sigma(\cdot)$ are bounded.

Furthermore, we take the sum of all inequalities in Proposition 1 and obtain:

		$\displaystyle\sup_{\tau\in[T/\varepsilon]}\max_{i\in[N]}\left\\|\bm{\theta}_{i}^{k}-\overline{\bm{\theta}}_{i}^{\tau\varepsilon}\right\\|_{2}$		(13)
	$\displaystyle\leq$	$\displaystyle Ce^{CT}\sqrt{h(\alpha)}\left(\frac{\sqrt{\log N}+z}{\sqrt{N}}+\sqrt{\varepsilon}(\sqrt{D+\log N}+z)\right),$		(13)

with probability at least $1-e^{-z^{2}}$ . Consequently, we have: with probability at least $1-e^{-z^{2}}$ ,

		$\displaystyle\sup_{\tau\in[T/\varepsilon]}\left\|\mathcal{L}_{\|\mathcal{A}\|}\left(\bm{\theta}_{\mathrm{S}}^{\tau}\right)-\mathcal{L}_{\|\mathcal{A}\|}\left(\overline{\bm{\theta}}_{\mathrm{S}}^{\tau\varepsilon}\right)\right\|$		(14)
		$\displaystyle\leq Ce^{CT}\sqrt{h(\alpha)}\left(\frac{\sqrt{\log N}+z}{\sqrt{N}}+\sqrt{\varepsilon}(\sqrt{D+\log N}+z)\right).$		(14)

For the third term, we follow the triangle inequality and have:

$\displaystyle\sup_{\tau\in[0,T/\varepsilon]\cap\mathbb{N}}$	$\displaystyle\left\|\mathcal{L}_{\|\mathcal{A}\|}\left(\overline{\bm{\theta}}_{\mathrm{S}}^{\tau\varepsilon}\right)-\mathcal{L}\left(\rho_{\tau\varepsilon}\right)\right\|$	(15)
$\displaystyle\leq$	$\displaystyle\left\|\mathcal{L}_{\|\mathcal{A}\|}\left(\overline{\bm{\theta}}_{\mathrm{S}}^{\tau\varepsilon}\right)-\mathbb{E}_{\rho_{0}}\left\{\mathcal{L}_{\|\mathcal{A}\|}\left(\overline{\bm{\theta}}_{\mathrm{S}}^{\tau\varepsilon}\right)\right\}\right\|$
	$\displaystyle+\left\|\mathbb{E}_{\rho_{0}}\left\{\mathcal{L}_{\|\mathcal{A}\|}\left(\overline{\bm{\theta}}_{\mathrm{S}}^{\tau\varepsilon}\right)\right\}-\mathcal{L}\left(\rho_{k\mathrm{a}}\right)\right\|,$

where $\mathbb{E}_{\rho_{0}}\{\cdot\}$ takes the expectation on $\bm{\theta}_{i}^{0}\sim\rho_{0}$ . For the first term of (15), we consider the mean-square loss in this work and have:

\displaystyle\mathcal{L}_{|\mathcal{A}|}\left(\bm{\theta}_{\mathrm{S}}\right)=\mathbb{E}_{(x,y)\sim\mathbb{P}}\left\{\left(y-\frac{1}{|\mathcal{A}|}\sum_{i=1}^{|\mathcal{A}|}\sigma\left(\bm{x},\bm{\theta}_{i}\right)\right)^{2}\right\},

(16)

where $\mathbb{E}_{(x,y)}$ takes the expectation on $(\bm{x},y)\sim\mathbb{P}$ . According to Lemma 1, we have $\left\{\bm{\theta}_{i}^{\tau\varepsilon}\right\}_{i=1}^{|\mathcal{A}|}\sim\rho_{\tau\varepsilon}$ under the PDE approximation, and rewrite the second term of (15) as:

	$\displaystyle\sup_{\tau\in[0,T/\varepsilon]\cap\mathbb{N}}\left\|\mathbb{E}_{\rho_{0}}\left\{\mathcal{L}_{\|\mathcal{A}\|}\left(\overline{\bm{\theta}}_{\mathrm{S}}^{\tau\varepsilon}\right)\right\}-\mathcal{L}\left(\rho_{k\mathrm{a}}\right)\right\|$	(17)
$\displaystyle=$	$\displaystyle\sup_{\tau\in[0,T/\varepsilon]}\frac{1}{\|\mathcal{A}\|}\left\|\int\mathbb{E}_{(x,y)}\left\{\left(\sigma(\bm{x},\bm{\theta})\right)^{2}\right\}\rho_{\tau\varepsilon}(\mathrm{d}\bm{\theta})\right.$
	$\displaystyle-\left.\int\mathbb{E}_{(x,y)}\left\{\sigma\left(\bm{x},\bm{\theta}_{1}\right)\sigma\left(\bm{x},\bm{\theta}_{2}\right)\right\}\rho_{\tau\varepsilon}\left(\mathrm{d}\bm{\theta}_{1}\right)\rho_{\tau\varepsilon}\left(\mathrm{d}\bm{\theta}_{2}\right)\right\|$
$\displaystyle\leq$	$\displaystyle\frac{C}{\|\mathcal{A}\|},$

where the inequality is because the activation function $\sigma(\cdot)$ is bounded by $C_{2}$ according to Assumption 2. That is, denoting by $\bm{\theta}$ and $\bm{\theta}^{\prime}$ be two parameters that differ only in the $i$ -th component, i.e., $\bm{\theta}=\left(\bm{\theta}_{1},\ldots,\bm{\theta}_{i,},\ldots,\bm{\theta}_{|\mathcal{A}|}\right)$ and $\bm{\theta}^{\prime}=$ $\left(\bm{\theta}_{1},\ldots,\bm{\theta}_{i}^{\prime},\ldots,\bm{\theta}_{|\mathcal{A}|}\right)$ , we have: $\left|\mathcal{L}_{|\mathcal{A}|}(\bm{\theta})-\mathcal{L}_{|\mathcal{A}|}\left(\bm{\theta}^{\prime}\right)\right|\leq{C}/{|\mathcal{A}|}$ .

With McDiarmid’s inequality, we have:

\displaystyle\mathbb{P}\left(\left|\mathcal{L}_{|\mathcal{A}|}\left(\bar{\bm{\theta}}_{\rm S}^{t}\right)-\mathbb{E}_{\rho_{0}}\left\{\mathcal{L}_{|\mathcal{A}|}\left(\bar{\bm{\theta}}_{\rm S}^{t}\right)\right\}\right|>\delta\right)\leq\exp\left(-\frac{|\mathcal{A}|\delta^{2}}{C}\right).

(18)

Furthermore, we have the following increment bound for time $t,h\geq 0$ :

	$\displaystyle\left\|\|\mathcal{L}_{\|\mathcal{A}\|}\left(\overline{\bm{\theta}}_{\rm S}^{t+h}\right)-\mathbb{E}_{\rho_{0}}\left\{\mathcal{L}_{\|\mathcal{A}\|}\left(\overline{\bm{\theta}}_{\rm S}^{t+h}\right)\right\}\|\right.$	(19)
	$\displaystyle-\left.\|\mathcal{L}_{\|\mathcal{A}\|}\left(\overline{\bm{\theta}}_{\rm S}^{t}\right)-\mathbb{E}_{\rho_{0}}\left\{\mathcal{L}_{\|\mathcal{A}\|}\left(\overline{\bm{\theta}}_{\rm S}^{t}\right)\right\}\|\right\|$
$\displaystyle\leq$	$\displaystyle\left\|\mathcal{L}_{\|\mathcal{A}\|}\left(\overline{\bm{\theta}}_{\rm S}^{t+h}\right)-\mathcal{L}_{\|\mathcal{A}\|}\left(\overline{\bm{\theta}}_{\rm S}^{t}\right)\right\|$
	$\displaystyle+\left\|\mathbb{E}_{\rho_{0}}\left\{\mathcal{L}_{\|\mathcal{A}\|}\left(\overline{\bm{\theta}}_{\rm S}^{t+h}\right)\right\}-\mathbb{E}_{\rho_{0}}\left\{\mathcal{L}_{\|\mathcal{A}\|}\left(\overline{\bm{\theta}}_{\rm S}^{t}\right)\right\}\right\|$
$\displaystyle\leq$	$\displaystyle C\left[\sup_{i\in[N]}\left\\|\overline{\bm{\theta}}_{i}^{t+h}-\overline{\bm{\theta}}_{i}^{t}\right\\|_{2}+\mathbb{E}\left[\left\\|\overline{\bm{\theta}}_{j}^{t+h}-\overline{\bm{\theta}}_{j}^{t}\right\\|_{2}\right]\right].$

Using Lemma 2, we get:

	$\displaystyle\sup_{\tau\in[0,T/\varepsilon]\cap\mathbb{N}}\sup_{u\in[0,\varepsilon]}\left\|\|\mathcal{L}_{\|\mathcal{A}\|}\left(\overline{\bm{\theta}}_{\rm S}^{\tau\varepsilon+u}\right)-\mathbb{E}_{\rho_{0}}\left\{\mathcal{L}_{\|\mathcal{A}\|}\left(\overline{\bm{\theta}}_{\rm S}^{\tau\varepsilon+u}\right)\right\}\|\right.$	(20)
	$\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad-\left.\|\mathcal{L}_{\|\mathcal{A}\|}\left(\overline{\bm{\theta}}_{\rm S}^{\tau\varepsilon}\right)-\mathbb{E}_{\rho_{0}}\left\{\mathcal{L}_{\|\mathcal{A}\|}\left(\overline{\bm{\theta}}_{\rm S}^{\tau\varepsilon}\right)\right\}\|\right\|$
$\displaystyle\leq$	$\displaystyle Ce^{CT}[\sqrt{\log\|\mathcal{A}\|(T/\varepsilon\vee 1)}+z]\sqrt{\varepsilon h(\alpha)},$

with probability at least $1-e^{-z^{2}}$ .

Here, we take a union bound over $s\in\varepsilon\{0,1,\ldots,\lfloor T/\varepsilon\rfloor\}$ and bound the variation inside the grid intervals. Then, with McDiarmid’s inequality, we get:

	$\displaystyle\mathbb{P}\left(\sup_{t\in[0,T]}\|\mathcal{L}_{\|\mathcal{A}\|}\left(\overline{\bm{\theta}}_{\rm S}^{t}\right)-\mathbb{E}_{\rho_{0}}\mathcal{L}_{\|\mathcal{A}\|}\left(\overline{\bm{\theta}}_{\rm S}^{t}\right)\|\right.$	(21)
	$\displaystyle\quad\quad\geq\left.\delta+Ce^{CT}[\sqrt{\log\mathcal{\|A\|}(T/\varepsilon\vee 1)}+z]\sqrt{\varepsilon h(\alpha)}\right)$
$\displaystyle\leq$	$\displaystyle(T/\varepsilon)\exp\left\{-\mathcal{\|A\|}\delta^{2}/C\right\}+e^{-z^{2}}$

We take $\varepsilon=1/\mathcal{|A|}$ and $\delta=C[\sqrt{\log(\mathcal{|A|}T)}+z]/\sqrt{\mathcal{|A|}}$ in (21) and combine it with (21), and get:

		$\displaystyle\sup_{\tau\in[0,T/\varepsilon]\cap\mathbb{N}}\left\|\mathcal{L}_{\|\mathcal{A}\|}\left(\overline{\bm{\theta}}_{\mathrm{S}}^{\tau\varepsilon}\right)-\mathcal{L}\left(\rho_{\tau\varepsilon}\right)\right\|$		(22)
	$\displaystyle\leq$	$\displaystyle Ce^{CT}\frac{(\sqrt{\log\|\mathcal{A}\|}+z)\sqrt{h(\alpha)}}{\sqrt{\|\mathcal{A}\|}}$		(22)

with probability at least $1-e^{-z^{2}}$ .

we summarize (11), (14) and (22) and have:

		$\displaystyle\sup_{\tau\in[0,T/\varepsilon]\cap\mathbb{N}}\left\|\mathcal{L}_{N}(\bm{\theta}^{\tau})-\mathcal{L}_{\|\mathcal{A}\|}\left(\bm{\theta}_{\mathrm{S}}^{\tau}\right)\right\|$		(23)
	$\displaystyle\leq$	$\displaystyle Ce^{CT}\sqrt{h(\alpha)}\left(\frac{\sqrt{\log\|\mathcal{A}\|}+z}{\sqrt{\|\mathcal{A}\|}}+\sqrt{\alpha}(\sqrt{D+\log N}+z)\right),$		(23)

with probability at least $1-e^{-z^{2}}$ . Therefore, the global model $\bm{\theta}$ obtained from the FL under data heterogeneity $h(\alpha)$ is $\varepsilon_{D}$ -dropout stable with $\varepsilon_{D}$ equal to $Ce^{CT}\sqrt{h(\alpha)}[({\sqrt{\log|\mathcal{A}|}+z})/{\sqrt{|\mathcal{A}|}}+\sqrt{\alpha}(\sqrt{D+\log N}+z)]$ .

∎

B-B Proof of Theorem 2

The proof of Theorem 2 is obtained by combining Theorem 1 with the following lemma, which is based on [20, 19].

Lemma 3.

(Dropout networks obtained from FL under varying data heterogeneity have connectivity). Consider a two-layer neural network with $N$ neurons, as in (5). Given $\mathcal{A}=[N/2]$ , let $\bm{\theta}$ and $\bm{\theta}^{\prime}$ be solved under data heterogeneity $h(\alpha)$ and $h(\alpha^{\prime})$ , which have dropout stability as in Definition 3. Then, $\bm{\theta}$ and $\bm{\theta}^{\prime}$ are $\varepsilon$ -connected as in Definition 1. Furthermore, the path connecting $\bm{\theta}$ with $\bm{\theta}^{\prime}$ consists of 7 line segments.

Proof.

To consider the effect of the dropout stability on the given models, we represent the model parameters as:

\bm{\theta}=\left((a_{1},\bm{\theta}_{1}),(a_{2},\bm{\theta}_{2}),\ldots,(a_{N},\bm{\theta}_{N})\right),

\bm{\theta}^{\prime}=\left((a_{1}^{\prime},\bm{\theta}_{1}^{\prime}),(a_{2}^{\prime},\bm{\theta}_{2}^{\prime}),\ldots,(a_{N}^{\prime},\bm{\theta}_{N}^{\prime})\right),

where $a_{i},\forall i\in[N]$ is the rescaling factor of dropout. That is, $a_{i}=0$ if the model components ( $i\notin\mathcal{A}$ ) perform dropout, and $a_{i}=2$ if the model components ( $i\in\mathcal{A}$ ) do not perform dropout.

We consider the piecewise linear path in parameter space that connects $\bm{\theta}$ to $\bm{\theta}^{\prime}$ as one path of mode connectivity for over-parameterized neural networks. Namely, when $N$ is even, we can connect $\bm{\theta}$ to $\bm{\theta}^{\prime}$ with the following way:

		$\displaystyle\bm{\theta}=\left((a_{1},\bm{\theta}_{1}),\ldots,(a_{N/2},\bm{\theta}_{N/2}),\ldots,(a_{N},\bm{\theta}_{N})\right),$
		$\displaystyle\bm{\theta}_{1}=\left((2,\bm{\theta}_{1}),\ldots,(2,\bm{\theta}_{N/2}),(0,\bm{\theta}_{N/2+1}),\ldots,(0,\bm{\theta}_{N})\right),$
		$\displaystyle\bm{\theta}_{2}=\left((2,\bm{\theta}_{1}),\ldots,(2,\bm{\theta}_{N/2}),\left(0,\bm{\theta}_{1}^{\prime}\right),\ldots,(0,\bm{\theta}_{N/2}^{\prime})\right),$
		$\displaystyle\bm{\theta}_{3}=\left((0,\bm{\theta}_{1}),\ldots,(0,\bm{\theta}_{N/2}),\left(2,\bm{\theta}_{1}^{\prime}\right),\ldots,(2,\bm{\theta}_{N/2}^{\prime})\right),$
		$\displaystyle\bm{\theta}_{4}=\left(\left(0,\bm{\theta}_{1}^{\prime}\right),\ldots,(0,\bm{\theta}_{N/2}^{\prime}),(2,\bm{\theta}_{1}^{\prime}),\ldots,(2,\bm{\theta}_{N/2}^{\prime})\right),$
		$\displaystyle\bm{\theta}_{5}=\left((2,\bm{\theta}_{1}^{\prime}),\ldots,(2,\bm{\theta}_{N/2}^{\prime}),\left(0,\bm{\theta}_{1}^{\prime}\right),\ldots,(0,\bm{\theta}_{N/2}^{\prime})\right),$
		$\displaystyle\bm{\theta}_{6}=\left((2,\bm{\theta}_{1}^{\prime}),\ldots,(2,\bm{\theta}_{N/2}^{\prime}),(0,\bm{\theta}_{N/2+1}^{\prime}),\ldots,(0,\bm{\theta}_{N}^{\prime})\right),$
		$\displaystyle\bm{\theta}^{\prime}=\left((a_{1}^{\prime},\bm{\theta}_{1}^{\prime}),\ldots,(a_{N/2}^{\prime},\bm{\theta}_{N/2}^{\prime}),\ldots,(a_{N}^{\prime},\bm{\theta}_{N}^{\prime})\right).$

Firstly, for the path between $\bm{\theta}$ to $\bm{\theta}_{1}$ , since $\bm{\theta}$ is $\varepsilon$ -dropout stable, we have that $\mathcal{L}_{N}\left(\bm{\theta}_{1}\right)\leq\mathcal{L}_{N}(\bm{\theta})+\varepsilon_{D}$ . Similarly, the loss along the path that connects $\bm{\theta}_{6}$ to $\bm{\theta}^{\prime}$ is also upper bounded by $\mathcal{L}_{N}\left(\bm{\theta}_{6}\right)\leq\mathcal{L}_{N}\left(\bm{\theta}^{\prime}\right)+\varepsilon_{D}^{\prime}$ .

Then, for the path between $\bm{\theta}_{1}$ to $\bm{\theta}_{2}$ , only $\bm{\theta}_{i}$ with the corresponding $a_{i}=0$ is changed to be $\bm{\theta}^{\prime}_{i}$ . Thus, the loss along this path does not change; i.e., $\mathcal{L}_{N}(\bm{\theta}_{1})=\mathcal{L}_{N}(\bm{\theta}_{2})$ . Similarly, the loss does not change along the path between $\bm{\theta}_{3}$ to $\bm{\theta}_{4}$ and the path between $\bm{\theta}_{5}$ to $\bm{\theta}_{6}$ : $\mathcal{L}_{N}(\bm{\theta}_{3})=\mathcal{L}_{N}(\bm{\theta}_{4})$ and $\mathcal{L}_{N}(\bm{\theta}_{5})=\mathcal{L}_{N}(\bm{\theta}_{6})$ .

Next, for the path between $\bm{\theta}_{4}$ to $\bm{\theta}_{5}$ , the two subnetworks are equal, such that the loss of their interpolated models along this path does not change; i.e., $\mathcal{L}_{N}(\bm{\theta}_{4})=\mathcal{L}_{N}(\bm{\theta}_{5})$ .

Finally, based on the above analysis, we have $\mathcal{L}_{N}\left(\bm{\theta}_{2}\right)\leq\mathcal{L}_{N}(\bm{\theta})+\varepsilon_{D}$ and $\mathcal{L}_{N}\left(\bm{\theta}_{3}\right)\leq\mathcal{L}_{N}(\bm{\theta}^{\prime})+\varepsilon_{D}^{\prime}$ . For the path between $\bm{\theta}_{2}$ to $\bm{\theta}_{3}$ , since the loss is convex in the weights of the last layer, the loss along this path is upper bounded by $\max\left(\mathcal{L}_{N}(\bm{\theta}_{2}),\mathcal{L}_{N}\left(\bm{\theta}_{3}\right)\right)\leq\max\left(\mathcal{L}_{N}(\bm{\theta})+\varepsilon_{D},\mathcal{L}_{N}\left(\bm{\theta}^{\prime}\right)+\varepsilon_{D}^{\prime}\right)$ .

When $N$ is odd, we can take a similar analysis to that of even $N$ . The differences are that (i) the $\lceil N/2\rceil$ -th parameter of $\bm{\theta}_{1},\bm{\theta}_{2}$ and $\bm{\theta}_{3}$ is $\left(0,\bm{\theta}_{N/2}\right)$ and the $\lceil N/2\rceil$ -th parameter of $\bm{\theta}_{4},\bm{\theta}_{5}$ and $\bm{\theta}_{6}$ is $\left(0,\bm{\theta}_{N/2}^{\prime}\right)$ , and (ii) the constant rescalling factor $a_{i}=2$ is replaced by $N/\lfloor N/2\rfloor$ .

Therefore, the loss along the above path between $\bm{\theta}$ and $\bm{\theta}^{\prime}$ is upper bounded by $\max\left(\mathcal{L}_{N}(\bm{\theta}),\mathcal{L}_{N}\left(\bm{\theta}^{\prime}\right)\right)+\varepsilon_{C}$ with $\varepsilon_{C}$ equal to $\max(\varepsilon_{D},\varepsilon_{D}^{\prime})$ .

∎

B-C Proof of global-mode update (6)

Proof.

Here, we denote the client $k$ model as $\bm{\theta}_{.,k}$ and its $i$ -th neuron as $\bm{\theta}_{i,k}$ . Then, we represent the weighted average of all client models as $\bar{\bm{\theta}}^{m,\tau}=\sum_{k=1}^{K}\frac{n_{k}}{n}\bm{\theta}_{.,k}^{m,\tau}$ and their $i$ -th neuron as $\bar{\bm{\theta}_{i}}^{m,\tau}=\sum_{k=1}^{K}\frac{n_{k}}{n}\bm{\theta}_{i,k}^{m,\tau}$ . Meanwhile, for given a data sample $(\bm{x}_{k}^{m,\tau},\bm{y}_{k}^{m,\tau})$ , we formulate the model output as $\bar{y}_{k}^{m,\tau}=f_{\bar{\bm{\theta}}^{m,\tau}}(\bm{x}_{k}^{m,\tau})$ and $\hat{y}_{k}^{m,\tau}=f_{{\bm{\theta}}_{.,k}^{m,\tau}}(\bm{x}_{k}^{m,\tau})$ , and the gradient deviation as $\Delta\bm{g}_{i,k}^{m,\tau}=\nabla_{\bm{\theta}_{i,k}}\sigma(\bm{x}_{k}^{m,\tau};\bm{\theta}_{i,k}^{m,\tau})-\nabla_{\bm{\theta}_{i}}\sigma(\bm{x}_{k}^{m,\tau};\bar{\bm{\theta}_{i}}^{m,\tau})$ .

With these notations, we formulate the update of the global model at the $m$ -th round as:

	$\displaystyle\bm{\theta}_{i}^{m+1,0}-\bm{\theta}_{i}^{m,0}$	(24)
$\displaystyle=$	$\displaystyle 2\sum_{k=1}^{K}\frac{n_{k}}{n}\sum_{\tau=0}^{p}s^{m,\tau}(y_{k}^{m,\tau}-\hat{y}_{k}^{m,\tau})\nabla_{\bm{\theta}_{i,k}}\sigma(\bm{x}_{k}^{m,\tau};\bm{\theta}_{i,k}^{m,\tau})$
$\displaystyle=$	$\displaystyle 2\sum_{\tau=0}^{p}{s}^{m,\tau}\sum_{k=1}^{K}\frac{n_{k}}{n}(y_{k}^{m,\tau}-\hat{y}_{k}^{m,\tau})\left[\nabla_{\bm{\theta}_{i}}\sigma(\bm{x}_{k}^{m,\tau};\bar{\bm{\theta}_{i}}^{m,\tau})\right.$
	$\displaystyle+\left.\left(\nabla_{\bm{\theta}_{i,k}}\sigma(\bm{x}_{k}^{m,\tau};\bm{\theta}_{i,k}^{m,\tau})-\nabla_{\bm{\theta}_{i}}\sigma(\bm{x}_{k}^{m,\tau};\bar{\bm{\theta}_{i}}^{m,\tau})\right)\right]$
$\displaystyle=$	$\displaystyle 2\sum_{\tau=0}^{p}{s}^{m,\tau}\sum_{k=1}^{K}\frac{n_{k}}{n}(y_{k}^{m,\tau}-\hat{y}_{k}^{m,\tau})\left[\nabla_{\bm{\theta}_{i}}\sigma(\bm{x}_{k}^{m,\tau};\bar{\bm{\theta}_{i}}^{m,\tau})\right.$
	$\displaystyle+\left.\Delta\bm{g}_{i,k}^{m,\tau}\right]$
$\displaystyle=$	$\displaystyle 2\sum_{\tau=0}^{p}{s}^{m,\tau}\sum_{k=1}^{K}\frac{n_{k}}{n}\left[(y_{k}^{m,\tau}-\bar{y}_{k}^{m,\tau})\nabla_{\bm{\theta}_{i}}\sigma(\bm{x}_{k}^{m,\tau};\bar{\bm{\theta}_{i}}^{m,\tau})\right.$
	$\displaystyle+\left.(\bar{y}_{k}^{m,\tau}-\hat{y}_{k}^{m,\tau})\nabla_{\bm{\theta}_{i}}\sigma(\bm{x}_{k}^{m,\tau};\bar{\bm{\theta}_{i}}^{m,\tau})\right]$
	$\displaystyle+2\sum_{\tau=0}^{p}{s}^{m,\tau}\sum_{k=1}^{K}\frac{n_{k}}{n}(y_{k}^{m,\tau}-\hat{y}_{k}^{m,\tau})\Delta\bm{g}_{i,k}^{m,\tau}$
$\displaystyle=$	$\displaystyle 2\sum_{\tau=0}^{p}{s}^{m,\tau}\sum_{k=1}^{K}\frac{n_{k}}{n}(y_{k}^{m,\tau}-\bar{y}_{k}^{m,\tau})\nabla_{\bm{\theta}_{i}}\sigma(\bm{x}_{k}^{m,\tau};\bar{\bm{\theta}_{i}}^{m,\tau})$

	$\displaystyle+2\sum_{\tau=0}^{p}{s}^{m,\tau}\sum_{k=1}^{K}\frac{n_{k}}{n}\left[(\bar{y}_{k}^{m,\tau}-\hat{y}_{k}^{m,\tau})\nabla_{\bm{\theta}_{i}}\sigma(\bm{x}_{k}^{m,\tau};\bar{\bm{\theta}_{i}}^{m,\tau})\right.$	(25)
	$\displaystyle\quad+(y_{k}^{m,\tau}-\hat{y}_{k}^{m,\tau})\Delta\bm{g}_{i,k}^{m,\tau}]$
$\displaystyle=$	$\displaystyle 2\sum_{\tau=0}^{p}{s}^{m,\tau}(y^{m,\tau}-\bar{y}^{m,\tau})\nabla_{\bm{\theta}_{i}}\sigma(\bm{x}^{m,\tau};\bar{\bm{\theta}_{i}}^{m,\tau})$
	$\displaystyle+\sum_{\tau=0}^{p}{2{s}^{m,\tau}}{\bm{n}}_{i}^{m,\tau},$

where the last equality hold because of the assumption of one-pass data processing and all client samples belong to the global distribution $\mathbb{P}$ , i.e., $(\bm{x}_{k},y_{k})\in\mathbb{P}$ . Here, we represent the noise as:

$\displaystyle\bm{n}_{i}^{m,\tau}=\sum_{k=1}^{K}\frac{n_{k}}{n}$	$\displaystyle[(\bar{y}_{k}^{m,\tau}-\hat{y}_{k}^{m,\tau})\nabla_{\bm{\theta}_{i}}\sigma(\bm{x}_{k}^{m,\tau};\bar{\bm{\theta}_{i}}^{m,\tau})$	(26)
	$\displaystyle+(y_{k}^{m,\tau}-\hat{y}_{k}^{m,\tau})\Delta\bm{g}_{i,k}^{m,\tau}]$
$\displaystyle=\sum_{k=1}^{K}\frac{n_{k}}{n}$	$\displaystyle[(\bar{y}_{k}^{m,\tau}-\hat{y}_{k}^{m,\tau})\nabla_{\bm{\theta}_{i}}\sigma(\bm{x}_{k}^{m,\tau};\bar{\bm{\theta}_{i}}^{m,\tau})$
	$\displaystyle+(y_{k}^{m,\tau}-\hat{y}_{k}^{m,\tau})(\nabla_{\bm{\theta}_{i,k}}\sigma(\bm{x}_{k}^{m,\tau};\bm{\theta}_{i,k}^{m,\tau})$
	$\displaystyle\quad\quad\quad-\nabla_{\bm{\theta}_{i}}\sigma(\bm{x}_{k}^{m,\tau};\bar{\bm{\theta}_{i}}^{m,\tau}))].$

Therefore, we obtain (6), i.e.,

	$\displaystyle\bm{\theta}_{i}^{m+1,\tau+1}=$	$\displaystyle\bm{\theta}_{i}^{m,\tau}+2{s}^{m,\tau}(y^{m,\tau}-\bar{y}^{m,\tau})\nabla_{\bar{\bm{\theta}}_{i}}\sigma(\bm{x}^{m,\tau};\bar{\bm{\theta}_{i}}^{m,\tau})$		(27)
		$\displaystyle+2{s}^{m,\tau}\bm{n}_{i}^{m,\tau}.$		(27)

Furthermore, according to Definition 2, $\bm{n}_{i}^{m,\tau}$ can be regarded as the noise induced by data heterogeneity in FL. Since $\bm{\theta}_{i}^{m,p-1}=\bar{\bm{\theta}_{i}}^{m,p-1}=\sum_{k=1}^{K}\frac{n_{k}}{n}\bm{\theta}_{.,k}^{m,p-1}$ , we take Assumption 5 and simplify (6) with $\tau\in[0,Mp-1]$ when given a total training round $M$ as:

	$\displaystyle\bm{\theta}_{i}^{\tau+1}=$	$\displaystyle\bm{\theta}_{i}^{\tau}+2s_{\tau}\left(y^{\tau}-\bar{y}^{\tau}\right)\nabla_{\bm{\theta}_{i}}\sigma\left(\bm{x}_{\tau};\bm{\theta}_{i}^{\tau}\right)$		(28)
		$\displaystyle+2{s}^{\tau}{\bm{n}}_{i}^{\tau}.$		(28)

Therefore, we obtain (8).

∎

	$\displaystyle\left\|\mathcal{L}_{\|\mathcal{A}\|}\left(\bm{\theta}_{\mathrm{S}}^{\tau}\right)-\mathcal{L}_{\|\mathcal{A}\|}(\overline{\bm{\theta}}_{\mathrm{S}}^{\tau\varepsilon})\right\|$	(12)
$\displaystyle\leq$	$\displaystyle 2\max_{i\in\mathcal{A}}\left\|\mathbb{E}\left\{y\sigma\left(\bm{x},\bm{\theta}_{i}^{k}\right)\right\}-\mathbb{E}\left\{y\sigma(\bm{x},\overline{\bm{\theta}}_{i}^{\tau\varepsilon})\right\}\right\|$
	$\displaystyle+\max_{i,j\in\mathcal{A}}\left\|\mathbb{E}\left\{\sigma\left(\bm{x},\bm{\theta}_{i}^{k}\right)\sigma\left(\bm{x},\bm{\theta}_{j}^{k}\right)\right\}-\mathbb{E}\left\{\sigma(\bm{x},\overline{\bm{\theta}}_{i}^{\tau\varepsilon}t)\sigma(\bm{x},\overline{\bm{\theta}}_{j}^{\tau\varepsilon})\right\}\right\|$
$\displaystyle\leq$	$\displaystyle 4C_{2}\max_{i\in\mathcal{[\|A\|]}}\left\\|\bm{\theta}_{i}^{k}-\overline{\bm{\theta}}_{i}^{\tau\varepsilon}\right\\|_{2}$
$\displaystyle\leq$	$\displaystyle 4C_{2}\max_{i\in[N]}\left\\|\bm{\theta}_{i}^{k}-\overline{\bm{\theta}}_{i}^{\tau\varepsilon}\right\\|_{2},$

	$\displaystyle\sup_{\tau\in[0,T/\varepsilon]\cap\mathbb{N}}\left\|\mathbb{E}_{\rho_{0}}\left\{\mathcal{L}_{\|\mathcal{A}\|}\left(\overline{\bm{\theta}}_{\mathrm{S}}^{\tau\varepsilon}\right)\right\}-\mathcal{L}\left(\rho_{k\mathrm{a}}\right)\right\|$	(17)
$\displaystyle=$	$\displaystyle\sup_{\tau\in[0,T/\varepsilon]}\frac{1}{\|\mathcal{A}\|}\left\|\int\mathbb{E}_{(x,y)}\left\{\left(\sigma(\bm{x},\bm{\theta})\right)^{2}\right\}\rho_{\tau\varepsilon}(\mathrm{d}\bm{\theta})\right.$
	$\displaystyle-\left.\int\mathbb{E}_{(x,y)}\left\{\sigma\left(\bm{x},\bm{\theta}_{1}\right)\sigma\left(\bm{x},\bm{\theta}_{2}\right)\right\}\rho_{\tau\varepsilon}\left(\mathrm{d}\bm{\theta}_{1}\right)\rho_{\tau\varepsilon}\left(\mathrm{d}\bm{\theta}_{2}\right)\right\|$
$\displaystyle\leq$	$\displaystyle\frac{C}{\|\mathcal{A}\|},$

	$\displaystyle\left\|\|\mathcal{L}_{\|\mathcal{A}\|}\left(\overline{\bm{\theta}}_{\rm S}^{t+h}\right)-\mathbb{E}_{\rho_{0}}\left\{\mathcal{L}_{\|\mathcal{A}\|}\left(\overline{\bm{\theta}}_{\rm S}^{t+h}\right)\right\}\|\right.$	(19)
	$\displaystyle-\left.\|\mathcal{L}_{\|\mathcal{A}\|}\left(\overline{\bm{\theta}}_{\rm S}^{t}\right)-\mathbb{E}_{\rho_{0}}\left\{\mathcal{L}_{\|\mathcal{A}\|}\left(\overline{\bm{\theta}}_{\rm S}^{t}\right)\right\}\|\right\|$
$\displaystyle\leq$	$\displaystyle\left\|\mathcal{L}_{\|\mathcal{A}\|}\left(\overline{\bm{\theta}}_{\rm S}^{t+h}\right)-\mathcal{L}_{\|\mathcal{A}\|}\left(\overline{\bm{\theta}}_{\rm S}^{t}\right)\right\|$
	$\displaystyle+\left\|\mathbb{E}_{\rho_{0}}\left\{\mathcal{L}_{\|\mathcal{A}\|}\left(\overline{\bm{\theta}}_{\rm S}^{t+h}\right)\right\}-\mathbb{E}_{\rho_{0}}\left\{\mathcal{L}_{\|\mathcal{A}\|}\left(\overline{\bm{\theta}}_{\rm S}^{t}\right)\right\}\right\|$
$\displaystyle\leq$	$\displaystyle C\left[\sup_{i\in[N]}\left\\|\overline{\bm{\theta}}_{i}^{t+h}-\overline{\bm{\theta}}_{i}^{t}\right\\|_{2}+\mathbb{E}\left[\left\\|\overline{\bm{\theta}}_{j}^{t+h}-\overline{\bm{\theta}}_{j}^{t}\right\\|_{2}\right]\right].$

	$\displaystyle\sup_{\tau\in[0,T/\varepsilon]\cap\mathbb{N}}\sup_{u\in[0,\varepsilon]}\left\|\|\mathcal{L}_{\|\mathcal{A}\|}\left(\overline{\bm{\theta}}_{\rm S}^{\tau\varepsilon+u}\right)-\mathbb{E}_{\rho_{0}}\left\{\mathcal{L}_{\|\mathcal{A}\|}\left(\overline{\bm{\theta}}_{\rm S}^{\tau\varepsilon+u}\right)\right\}\|\right.$	(20)
	$\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad-\left.\|\mathcal{L}_{\|\mathcal{A}\|}\left(\overline{\bm{\theta}}_{\rm S}^{\tau\varepsilon}\right)-\mathbb{E}_{\rho_{0}}\left\{\mathcal{L}_{\|\mathcal{A}\|}\left(\overline{\bm{\theta}}_{\rm S}^{\tau\varepsilon}\right)\right\}\|\right\|$
$\displaystyle\leq$	$\displaystyle Ce^{CT}[\sqrt{\log\|\mathcal{A}\|(T/\varepsilon\vee 1)}+z]\sqrt{\varepsilon h(\alpha)},$

	$\displaystyle\mathbb{P}\left(\sup_{t\in[0,T]}\|\mathcal{L}_{\|\mathcal{A}\|}\left(\overline{\bm{\theta}}_{\rm S}^{t}\right)-\mathbb{E}_{\rho_{0}}\mathcal{L}_{\|\mathcal{A}\|}\left(\overline{\bm{\theta}}_{\rm S}^{t}\right)\|\right.$	(21)
	$\displaystyle\quad\quad\geq\left.\delta+Ce^{CT}[\sqrt{\log\mathcal{\|A\|}(T/\varepsilon\vee 1)}+z]\sqrt{\varepsilon h(\alpha)}\right)$
$\displaystyle\leq$	$\displaystyle(T/\varepsilon)\exp\left\{-\mathcal{\|A\|}\delta^{2}/C\right\}+e^{-z^{2}}$