Hierarchical Federated Learning in Wireless Networks: Pruning Tackles Bandwidth
Scarcity and System Heterogeneity

Md Ferdous Pervej, , Richeng Jin, , and Huaiyu Dai This research was supported in part by the National Natural Science Foundation of China under Grants 62301487, in part by the Zhejiang Provincial Natural Science Foundation of China under Grant No. LQ23F010021 and LD21F010001, in part by the Ng Teng Fong Charitable Foundation in the form of ZJU-SUTD IDEA Grant under Grant No. 188170-11102, and in part by the US National Science Foundation under grants CNS-1824518 and ECCS-2203214. (Corresponding author: Richeng Jin.)M. F. Pervej was with the Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, NC 27695, USA, and now is with the Ming Hsieh Department of Electrical and Computer Engineering, University of Southern California, Los Angeles, CA 90089, USA (e-mail: [email protected]).H. Dai is with the Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, NC 27695, USA (e-mail: [email protected]).R. Jin is with the College of Information Science and Electronic Engineering, Zhejiang University, the Zhejiang–Singapore Innovation and AI Joint Research Lab, and the Zhejiang Provincial Key Lab of Information Processing, Communication and Networking (IPCAN), Hangzhou, China, 310000 (e-mail: [email protected]).©2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Abstract

While a practical wireless network has many tiers where end users do not directly communicate with the central server, the users’ devices have limited computation and battery powers, and the serving base station (BS) has a fixed bandwidth. Owing to these practical constraints and system models, this paper leverages model pruning and proposes a pruning-enabled hierarchical federated learning (PHFL) in heterogeneous networks (HetNets). We first derive an upper bound of the convergence rate that clearly demonstrates the impact of the model pruning and wireless communications between the clients and the associated BS. Then we jointly optimize the model pruning ratio, central processing unit (CPU) frequency and transmission power of the clients in order to minimize the controllable terms of the convergence bound under strict delay and energy constraints. However, since the original problem is not convex, we perform successive convex approximation (SCA) and jointly optimize the parameters for the relaxed convex problem. Through extensive simulation, we validate the effectiveness of our proposed PHFL algorithm in terms of test accuracy, wall clock time, energy consumption and bandwidth requirement.

Index Terms:

Heterogeneous network, hierarchical federated learning, model pruning, resource management.

I Introduction

Federated learning (FL) has garnered significant attention as a privacy-preserving distributed edge learning solution in wireless edge networks [1, 2, 3]. Since the original FL follows the parameter server paradigm, many state-of-the-art works consider a single server with distributed clients as the general system model in order to study the analytical and empirical performance [2, 3, 4, 5, 6]. Given that there are $\mathcal{U}\coloneqq\{u\}_{u=1}^{\mathrm{U}}$ clients, each with a local dataset of $\mathcal{D}_{u}\coloneqq\{\mathbf{x}_{a},y_{a}\}_{a=1}^{A}$ , where $\mathbf{x}_{a}$ and $y_{a}$ are the $a^{th}$ feature vector and the corresponding label, the central server wants to train a global machine learning (ML) model $\mathbf{w}$ by minimizing a weighted combination of the clients’ local objective functions $f_{u}(\mathbf{w})$ ’s, as follows

	$\displaystyle f(\mathbf{w})\coloneqq\sum\nolimits_{u=1}^{\mathrm{U}}\alpha_{u}f_{u}(\mathbf{w}),$		(1)
	$\displaystyle f_{u}(\mathbf{w})\coloneqq(1/\|\mathcal{D}_{u}\|)\sum\nolimits_{(\mathbf{x}_{a},y_{a})\in\mathcal{D}_{u}}\mathrm{l}(\mathbf{w},\mathbf{x}_{a},y_{a}),$		(2)

where $\alpha_{u}$ is the corresponding weight, and $\mathrm{l}(\mathbf{w},\mathbf{x}_{a},y_{a})$ denotes the loss function associated with the $a^{th}$ data sample. However, general networks usually follow a hierarchical structure [7], where the clients are connected to edge servers, the edge servers are connected to fog nodes/servers, and these fog nodes/servers are connected to the cloud server [8]. Naturally, some recent works [9, 10, 11, 12, 13, 14] have extended FL to accommodate this hierarchical network topology.

A client¹¹1The terms client and UE are interchangeably used when there is no ambiguity. does not directly communicate with the central server in the hierarchical network topology. Instead, the clients usually perform multiple local rounds of model training before sending the updated models to the edge server. The edge server aggregates the received models and updates its edge model, and then broadcasts the updated model to the associated clients for local training. The edge servers repeat this for multiple edge rounds and finally send the updated edge models to the upper-tier servers, which then undergo the same process before finally sending the updated models to the cloud/central server. This is usually known as hierarchical federated learning (HFL) [11]. On the one hand, HFL acknowledges the practical wireless heterogeneous network (HetNet) architecture. On the other hand, it avoids costly direct communication between the far-away cloud server and the capacity-limited clients [14]. Moreover, since local averaging improves learning accuracy [7], the central server ends up with a better-trained model.

While HFL can alleviate communication bottlenecks for the cloud server, data and system heterogeneity amongst the clients still need to be addressed. Since the clients are usually scattered in different locations and have various onboard sensors, the data collected/sensed by these clients are diverse, causing statistical data heterogeneity that the server cannot govern. As such, we need to embrace it in our theoretical and empirical study. Besides, the well-known system heterogeneity arises from the clients’ diverse computation powers [15]. Recently, some works have been proposed to deal with system heterogeneity. For example, FedProx [16], anarchic federated averaging (AFA) [17] and federated normalized averaging algorithm (FedNova) [18], to name a few, considered different local rounds for different clients to address the system heterogeneity. More specifically, FedProx adds a proximal term to the client’s local objective function to handle heterogeneity. AFA and FedNova present different ways to aggregate clients trained models’ weights at the server to tackle this system heterogeneity. However, these algorithms still assume that the client has and trains the original ML model, i.e., neither the computation time for the client’s local training nor the communication overhead for offloading the trained model to the server is considered in system design.

Model pruning has attracted research interest recently [19, 20]. It makes the over-parameterized model sparser, which allows the less computationally capable clients to perform local training more efficiently without sacrificing much of the test accuracy. Besides, since the trained model contains fewer non-zero entries, the communication overhead over the unreliable wireless link between the client and the associated base station (BS) also dramatically reduces. However, pruning generally introduces errors that only partially vanish, causing the pruned model to converge only to a neighborhood of the optimal solution [19]. Besides, unlike the traditional FL, where the model averaging happens only at the central server, HFL has multiple hierarchical levels that may adopt their own aggregation strategy. Therefore, model pruning at the local client level leads to additional errors in the available models at different levels, eventually contributing to the global model. As such, more in-depth study is in need to understand the full benefit of model pruning in hierarchical networks.

I-A Related Work

Some recent works studied HFL [9, 10, 11, 12, 13, 14] and model pruning-based traditional single server based FL [20, 21, 22, 23, 24] separately. In [9], Xu et al. proposed an adaptive HFL scheme, where they optimized edge aggregation intervals and bandwidth allocation to minimize a weighted combination of the model training delay and training loss. Liu et al. proposed network-assisted HFL in [10], where they optimized wireless resource allocation and user associations to minimize $1)$ learning latency for independent identically distributed (IID) data distribution and $2)$ weighted sum of the total data distance and learning latency for the non-IID data distribution. Similar to [9, 10], Luo et al. jointly optimized the wireless network parameters in order to minimize the weighted combination of the total energy consumption and delay during the training process in [11]. Besides, [12] also proposed an HFL algorithm based on federated averaging (FedAvg) [1]. In [13], Feng et al. proposed a mobility-aware clustered FL algorithm owing to user mobility. More specifically, assuming that all users had an equal probability of staying at a cluster, the authors derived an upper bound of the convergence rate to capture the impact of user mobility, data heterogeneity and network heterogeneity. Abad et al. also optimized wireless resources to reduce the communication latency and facilitate HFL in [14].

On the model pruning side, Jiang et al. considered two-stage distributed model pruning in [20] with traditional single server based FL setting without any wireless network aspects. In a similar setting, Zhu et al. proposed a layer-wise pruning mechanism in [21]. Liu et al. optimized the pruning ratio and time allocation in [22] in order to maximize the convergence rate in a time division multiple access operated small BS (sBS). The idea was extended to joint client selection, pruning ratio optimization and time allocation in [23]. Using a similar network model, Ren et al. optimized pruning ratios and bandwidth allocations jointly to minimize a weighted combination of the FL training time and pruning error in [24]. These works [22, 23, 24] decomposed the original problem into different sub-problems that they solved iteratively in an attempt to solve the original problem sub-optimally. Moreover, [22, 23, 24] considered a simple network system model with a single BS serving the distributed clients with the wireless links.

I-B Our Contributions

While the studies mentioned above shed some light on HFL and model pruning in the traditional single server based FL, the impact of pruning on HFL in resource-constrained wireless HetNet is yet to be explored. On the one hand, the clients need to train the original model for a few local episodes to determine the neurons they shall prune, which adds additional time and energy costs. On the other hand, pruning adds errors to the learning performance. Therefore, it is necessary to theoretically and empirically study these errors from different levels in HFL. Moreover, it is also crucial to justify how and when one should adopt model pruning in practical wireless HetNets. Motivated by these, in this work, we present our pruning-enabled HFL (PHFL) framework with the following major contributions:

•

Considering a practical wireless HetNet, we propose a PHFL solution in which the clients perform local training on the initial models to determine the neurons to prune, perform extensive training on the pruned models, and offload the trained models under strict delay and energy constraints.
•

We theoretically analyze how pruning introduces errors in different levels under resource constraints in wireless HetNets by deriving a convergence bound that captures the impact of the wireless links between the clients and server and the pruning ratios. More specifically, the proposed solution converges to the neighborhood of a stationary point of traditional HFL with a convergence rate of $\mathcal{O}\big{(}1/\sqrt{\mathrm{U}T}\big{)}+\mathcal{O}(\beta^{2}D^{2}\delta^{\mathrm{th}})$ , where $\mathrm{U}$ is the total number of clients, $T$ is the total local iterations, $\beta$ quantifies smoothness of the loss function, $D^{2}$ is an upper bound of the $L_{2}$ norm of the model weights, and $0<\delta^{\mathrm{th}}<1$ is the maximum allowable pruning ratio.
•

Then, we formulate an optimization problem to maximize the convergence rate by jointly configuring wireless resources and system parameters. To tackle the non-convexity of the original problem, we use a successive convex approximation (SCA) algorithm to solve the relaxed convex problem efficiently.
•

Finally, using extensive simulation on two popular datasets and three popular ML models, we show the effectiveness of our proposed solution in terms of test accuracy, training time, energy consumption and bandwidth requirement.

The rest of the paper is organized as follows: Section II introduces our system model. Detailed theoretical analysis is performed in Section III, followed by our joint problem formulation and solution in Section IV. Based on our extensive simulation, we discuss the results in Section V. Finally, Section VI concludes the paper. Moreover, Table I summarizes the important notations used in the paper.

TABLE I: Important Notations

Notation	Description
$u$ ; $b$ ; $l$	$u^{\mathrm{th}}$ user; $b^{\mathrm{th}}$ sBS; $l^{\mathrm{th}}$ mBS
$\mathcal{B}_{l}$ ; $k$	sBS set under the $l^{\mathrm{th}}$ mBS; $k^{\mathrm{th}}$ sBS in $\mathcal{B}_{l}$
$\mathcal{V}_{k,l}$ ; $j$	VC set of sBS $k$ under the $l^{\mathrm{th}}$ mBS; $j^{\mathrm{th}}$ VC of sBS $k$
$\mathcal{U}_{j,k,l}$ ; $i$	Client set of the $j^{\mathrm{th}}$ VC of $l^{\mathrm{th}}$ sBS under the $l^{\mathrm{th}}$ mBS; $i^{\mathrm{th}}$ client in $\mathcal{U}_{j,k,l}$
$z$ ; $\mathcal{Z}$	$z^{\mathrm{th}}$ pRB; pRB set
$\mathrm{P}_{i}^{t}$	Client $i$ ’s transmission power during $t$
$\mathbf{w}$ ; $\mathbf{m}$ ; $\tilde{\mathbf{w}}$	Original model; binary mask; pruned model
$\mathbf{w}_{i}$ ; $\mathbf{w}_{j}$ ; $\mathbf{w}_{k}$ ; $\mathbf{w}_{l}$	Local model of the client, VC, sBS, and mBS
$f_{i}(\cdot)$ ; $f_{j}(\cdot)$ ; $f_{k}(\cdot)$ ; $f_{l}(\cdot)$ ; $f(\cdot)$	Loss function of the client, VC, sBS, mBS, and central server, respectively
$\nabla f_{i}(\cdot)$ ; $g(\cdot)$ ; $\eta$	True gradient; stochastic gradient; learning rate
$d$ ; $d_{p}$	Total & pruned parameters of the ML model
$\delta_{i}^{t}$ ; $\delta^{\mathrm{th}}$	Pruning ratio of client $i$ during $t$ ; max pruning ratio
$\alpha_{i}$ ; $\alpha_{j}$ ; $\alpha_{k}$ ; $\alpha_{l}$	Weight of $i^{th}$ client, $j^{\mathrm{th}}$ VC, $k^{\mathrm{th}}$ sBS, and $l^{\mathrm{th}}$ mBS
$\rho$	Number of SGD rounds on $\mathbf{w}$ to get winning ticket
$\kappa_{0}$ ; $\kappa_{1}$ ; $\kappa_{2}$ ; $\kappa_{3}$	Number of local, VC, sBS, and mBS rounds
$\mathbf{1}_{i}^{t}$ ; $p_{i}^{t}$	Binary indicator function to define if sBS receives $i^{\mathrm{th}}$ client’s trained model; probability that $\mathbf{1}_{i}^{t}=1$
$\beta$	Smoothness of the loss functions
$\sigma^{2}$	Bounded variance of the gradients
$\epsilon_{\cdot}^{2}$	Bounded divergence of the loss functions of two inter-connected tiers
$G^{2}$ ; $D^{2}$	Upper bound of the $L_{2}$ -norm of stochastic gradients and model weights, respectively
$\mathrm{f}_{i}^{t}$ ; $\mathrm{f}_{i}^{\mathrm{max}}$	CPU clock cycle of $i$ during $t$ ; max CPU cycle of i
$b$ , $n$	Batch size; number of mini-batch
$c_{i}$ ; $\mathrm{D}_{i}$	Required number of CPU cycle of $i$ to process $1$ -bit data; each data sample size in bits
$\mathrm{FPP}$	Floating point precision
$\mathrm{t}_{i}^{\mathrm{cp_{d}}}$ ; $\mathrm{e}_{i}^{\mathrm{cp_{d}}}$	Time and energy overheads to get the lottery ticket
$\mathrm{t}_{i}^{\mathrm{cp_{s}}}$ ; $\mathrm{e}_{i}^{\mathrm{cp_{s}}}$	Time and energy overheads to compute $\kappa_{0}$ local SGD rounds with the pruned model
$\mathrm{t}_{i}^{\mathrm{up}}$ ; $\mathrm{e}_{i}^{\mathrm{up}}$	Time and energy overheads for offloading client $i$ ’s trained model
$\mathrm{t}_{i}^{\mathrm{tot}}$ ; $\mathrm{e}_{i}^{\mathrm{tot}}$	Client $i$ ’s total time and energy overheads to finish one VC round
$\mathrm{t^{th}}$ ; $\mathrm{e}_{i}^{\mathrm{th}}$	Time and energy budgets to finish one VC round

II System Model

II-A Wireless Network Model

We consider a generic heterogeneous network (HetNet) consisting of some UEs, sBSs and macro BSs (mBSs), as shown in Fig. 1. Denote the UE, sBS and mBS sets by $\mathcal{U}\coloneqq\{u\}_{u=1}^{\mathrm{U}}$ , $\mathcal{B}\coloneqq\{b\}_{b=1}^{\mathrm{B}}$ and $\mathcal{L}\coloneqq\{l\}_{l=1}^{L}$ , respectively. Each UE and sBS are connected to one sBS and mBS, respectively. The mBSs are connected to the central server. While the UEs communicate over wireless links with the sBS, the connections between the sBS and mBS and between the mBS and central server are wired. Moreover, due to UEs’ system heterogeneity, we consider that the sBS groups UEs with similar computation and battery powers into a virtual cluster (VC).

Refer to caption — Figure 1: Pruning-enabled hierarchical FL system model

We can benefit from the VC since it enables one additional aggregation tier. Besides, thanks to the recent progress of the proximity-services in practical networks [25], one can also select a cluster head and leverage device-to-device communication to receive and distribute the models to the UEs in the same VC. However, in this work, we assume that the sBS creates the VC and also manages it. We use the notation $\mathcal{B}_{l}\coloneqq\{k\}_{k=1}^{B_{l}}$ , $\mathcal{V}_{k,l}\coloneqq\{j\}_{j=1}^{V_{k,l}}$ and $\mathcal{U}_{j,k,l}\coloneqq\{i\}_{i=1}^{U_{j,k,l}}$ to represent the sBS set associated to mBS $l$ , VC set of sBS $k\in\mathcal{B}_{l}$ and UE set of VC $j\in\mathcal{V}_{k,l}$ , respectively. Moreover, denote the UEs associated with sBS $k$ , mBS $l$ by $\mathcal{U}_{k,l}=\bigcup_{j=1}^{V_{k,l}}$ $\mathcal{U}_{j,k,l}$ and $\mathcal{U}_{l}=\bigcup_{k=1}^{B_{l}}\mathcal{U}_{k,l}$ , respectively. Finally, $\mathcal{U}=\bigcup_{l=1}^{L}\mathcal{U}_{l}$ . Note that we consider that these associations are known and provided by the network administrator. The network has a fixed bandwidth divided into orthogonal physical resource blocks (pRBs) for performing the FL task. Denote the pRB set by $\mathcal{Z}\coloneqq\{z\}_{z=1}^{Z}$ . Each mBS reuses the same pRB set, i.e., the frequency reuse factor is $1$ . Besides, each mBS allocates dedicated pRBs to its associated sBSs. Each sBS further uses dedicated pRBs to communicate with the associated UEs. As such, there is no intra-tier interference in the system.

To this end, denote the distance between UE $i$ and the sBS $k$ by $d_{i,k}$ . The wireless fading channel between the UE and sBS follows Rayleigh distribution²²2Rayleigh distribution is widely used for its simplicity. Other distributions are not precluded., and is denoted by $h_{i,k}^{t}$ . The transmission power of UE is denoted by $\mathrm{P}_{i}^{t}$ . As such, the uplink signal-to-noise-plus-interference ratio (SINR) is expressed as

\displaystyle\gamma_{i,k}^{t}

\displaystyle=\mathrm{P}_{i}^{t}h_{i,k}^{t}d_{i,k}^{-\alpha}/(\omega\zeta^{2}+I_{i,k}^{t}),

(3)

where $\alpha$ is the path loss exponent, $\zeta^{2}$ is the variance of the circularly symmetric zero-mean Gaussian distributed random noise and $\omega$ is the pRB size. Moreover, $I_{i,k}^{t}=\sum_{l=1}^{L}\sum_{k^{\prime}=1,k\neq k^{\prime}}^{B_{l}}\sum_{j^{\prime}=1}^{V_{k,l}}\sum_{i^{\prime}\in\mathcal{U}_{j^{\prime},k^{\prime},l}}\mathrm{P}_{i^{\prime}}^{t}h_{i^{\prime},k^{\prime}}^{t}d_{i^{\prime},k^{\prime}}^{-\alpha}$ is the inter-cell interference. Thus, the data rate

\displaystyle r_{i}^{t}

\displaystyle=\omega\log_{2}\big{[}1+\gamma_{i,k}^{t}\big{]}.

(4)

II-B Pruning-Enabled Hierarchical Federated Learning (PHFL)

In this work, we consider that each client uses mini-batch stochastic gradient descent (SGD) to minimize (2) over $n$ mini-batches since the gradient descent over the entire dataset is time-consuming. Denote the stochastic gradient $g(\mathbf{w})$ such that $\mathbb{E}_{\xi_{n}\sim\mathcal{D}_{i}}[g(\mathbf{w})]\coloneqq\nabla f_{i}(\mathbf{w})$ , where $\xi_{n}$ is a randomly sampled batch from dataset $\mathcal{D}_{i}$ . However, computation with the original model $\mathbf{w}\in\mathbb{R}^{d}$ is still costly and may significantly extend the training time of a computationally limited client. To alleviate this, the UE trains a pruned model by removing some of the weights of the original model $\mathbf{w}$ [19].

Denote a binary mask by $\mathbf{m}_{i}\in\mathbb{R}^{d}$ and the pruned model by $\tilde{\mathbf{w}}_{i}\coloneqq\mathbf{w}_{i}\odot\mathbf{m}_{i}$ , where $\odot$ means element-wise multiplication. Note that training the pruned model $\tilde{\mathbf{w}}_{i}$ is computationally less expensive as it has fewer parameters than the original model. It is worth pointing out that the UE utilizes the state-of-the-art lottery ticket hypothesis [26] to find the winning ticket $\tilde{\mathbf{w}}_{i}$ and the corresponding mask with the following key steps. Denote the number of parameters required to be pruned by $d_{p}$ . The UE performs $\rho$ local iterations on the original model $\mathbf{w}_{i}$ as

\mathbf{w}_{i}^{\rho}=\mathbf{w}_{i}-\eta\sum\nolimits_{\bar{\rho}=1}^{\rho}g(\mathbf{w}_{i}^{\bar{\rho}}),

(5)

where $\eta$ is the step size. The UE then prunes $d_{p}$ entries of $\mathbf{w}_{i}^{\rho}\in\mathbb{R}^{d}$ with the smallest magnitudes³³3The time complexity to sort the $d$ parameters and then prune the $d_{p}$ smallest ones depends on the sorting technique. Many sorting algorithms have logarithmic time complexity, which can be computed quickly in modern graphical processing units. Following the common practice in literature [22, 23, 24], the overhead for pruning is ignored in this work. Moreover, the proposed method can be readily extended to incorporate the time overhead for pruning. and generates a binary mask $\mathbf{m}_{i}\in\{0,1\}^{d}$ . To that end, the client obtains the winning ticket $\tilde{\mathbf{w}}_{i}$ by retaining the original weights of the corresponding nonzero entries of the mask $\mathbf{m}_{i}$ from the original initial model $\mathbf{w}_{i}$ [26]. Note that other pruning techniques can also be adopted. Moreover, we denote the pruning ratio by [23]

\delta_{i}\coloneqq d_{p}/d.

(6)

Given the pruned model $\tilde{\mathbf{w}}_{i}^{t,0}\coloneqq\mathbf{w}_{i}^{t}\odot\mathbf{m}_{i}^{t}$ , the loss function of the UE is rewritten as

f_{i}(\tilde{\mathbf{w}}_{i}^{t,0})\coloneqq[1/|\mathcal{D}_{i}|]\sum\nolimits_{(\mathbf{x}_{a},y_{a})\in\mathcal{D}_{i}}f(\tilde{\mathbf{w}}_{i}^{t,0},\mathbf{x}_{a},y_{a}),

(7)

Each UE updates its pruned model as

\tilde{\mathbf{w}}_{i}^{t+1}\coloneqq\tilde{\mathbf{w}}_{i}^{t,0}-\eta g(\tilde{\mathbf{w}}_{i}^{t,0})\odot\mathbf{m}_{i}^{t}.

(8)

As such, we denote the loss functions of the $j^{th}$ VC, $k^{th}$ sBS, $l^{th}$ mBS and central server as $f_{j}(\mathbf{w}_{j})\coloneqq\sum\nolimits_{i=1}^{U_{j,k,l}}\alpha_{i}f_{i}(\tilde{\mathbf{w}}_{i})$ , $f_{k}(\mathbf{w}_{k})\coloneqq\sum\nolimits_{j=1}^{V_{k,l}}\alpha_{j}f_{j}(\mathbf{w}_{j})$ , $f_{l}(\mathbf{w}_{l})\coloneqq\sum\nolimits_{k=1}^{B_{l}}\alpha_{k}f_{k}(\mathbf{w}_{k})$ , and $f(\mathbf{w})\coloneqq\sum\nolimits_{l=1}^{L}\alpha_{l}f_{l}(\mathbf{w}_{l})$ , respectively. Note that for simplicity, we consider identical weights, i.e., $\alpha_{i}=1/U_{j,k,l}$ , $\alpha_{j}=1/V_{k,l}$ , $\alpha_{k}=1/B_{l}$ and $\alpha_{l}=1/L$ , which can be easily adjusted for other weighting strategies. Besides, since aggregation happens at different times and different levels, we need to capture the time indices explicitly. Let each UE perform $\kappa_{0}$ local iterations before sending the updated model to the associated VC. It is worth noting that the winning ticket and the corresponding binary mask are only obtained before these $\kappa_{0}$ local rounds begin. Besides, since training the original model is costly, it is reasonable to consider $\rho\ll\kappa_{0}$ ⁴⁴4In our simulation, we considered $\rho<\kappa_{0}$ and observed that even $\rho=1$ with $\kappa_{0}\geq 5$ performed well.. Moreover, although the sBS-mBS and mBS-central server links are wired, communication and computation at these nodes incur additional burdens. As such, we assume that each VC, sBS and mBS perform $\kappa_{1}$ , $\kappa_{2}$ and $\kappa_{3}$ rounds, respectively, before sending the trained model to the respective upper layers. Denote the indices of the current global round, mBS round, sBS round, VC round and UE’s local round by $m$ , $t_{3}$ , $t_{2}$ , $t_{1}$ and $t_{0}$ , respectively. Besides, similar to [13, 9], let $t\coloneqq[\{(m\kappa_{3}+t_{3})\kappa_{2}+t_{2}\}\kappa_{1}+t_{1}]\kappa_{0}+t_{0}$ denote the index of local update iterations.

If $t~{}\textsl{mod}~{}\kappa_{0}=0$ , the UE receives the latest available model of its associated VC, i.e.,

\mathbf{w}_{i}^{\bar{t}_{0}}\leftarrow\mathbf{w}_{j}^{\bar{t}_{0}},

(9)

where $\bar{t}_{0}=[\{(m\kappa_{3}+t_{3})\kappa_{2}+t_{2}\}\kappa_{1}+t_{1}]\kappa_{0}$ . The UE then computes the pruned model $\tilde{\mathbf{w}}_{i}^{\bar{t}_{0},0}$ and the binary mask $\mathbf{m}_{i}^{\bar{t}_{0}}$ . It then performs $\kappa_{0}$ local SGD rounds as

\tilde{\mathbf{w}}_{i}^{\bar{t}_{0}+\kappa_{0}}=\tilde{\mathbf{w}}_{i}^{\bar{t}_{0},0}-\eta\sum\nolimits_{t_{0}=1}^{\kappa_{0}}g\big{(}\tilde{\mathbf{w}}_{i}^{\bar{t}_{0}+t_{0},0}\big{)}\odot\mathbf{m}_{i}^{\bar{t}_{0}}.

(10)

Each VC $j$ performs $t_{1}=0,\dots,\kappa_{1}-1$ local rounds. When $(t_{1}+1)~{}\textsl{mod}~{}\kappa_{1}=0$ , the VC’s model gets updated by the latest available sBS model, i.e.,

\mathbf{w}_{j}^{\bar{t}_{1}}\leftarrow\mathbf{w}_{k}^{\bar{t}_{1}},

(11)

where $\bar{t}_{1}=\{(m\kappa_{3}+t_{3})\kappa_{2}+t_{2}\}\kappa_{1}\kappa_{0}$ . Besides, between two VC rounds, the local model of the VC is updated as

	$\displaystyle\mathbf{w}_{j}^{\bar{t}_{1}+(t_{1}+1)\kappa_{0}}=\sum\nolimits_{i=1}^{U_{j,k,l}}\alpha_{i}\tilde{\mathbf{w}}_{i}^{\bar{t}_{1}+(t_{1}+1)\kappa_{0}}=\tilde{\mathbf{w}}_{j}^{\bar{t}_{1}+t_{1}\kappa_{0},0}-$
	$\displaystyle\eta\!\!\sum\nolimits_{i=1}^{U_{j,k,l}}\!\!\big{[}\boldsymbol{1}_{i}^{\bar{t}_{1}+t_{1}\kappa_{0}}\!\!/\!p_{i}^{\bar{t}_{1}+t_{1}\kappa_{0}}\big{]}\alpha_{i}\sum\nolimits_{t_{0}=1}^{\kappa_{0}}\!g\big{(}\tilde{\mathbf{w}}_{i}^{\bar{t}_{1}+t_{1}\kappa_{0}+t_{0},0}\big{)}\odot\mathbf{m}_{i}^{\bar{t}_{1}+t_{1}\kappa_{0}},$		(12)

where $\tilde{\mathbf{w}}_{j}^{\bar{t}_{1}+t_{1}\kappa_{0},0}\coloneqq\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\tilde{\mathbf{w}}_{i}^{\bar{t}_{1}+t_{1}\kappa_{0},0}$ and $\boldsymbol{1}_{i}^{\bar{t}_{1}+t_{1}\kappa_{0}}$ is a binary indicator function that indicates whether the sBS receives the trained model of $i$ during the VC aggregation round $\bar{t}_{1}+(t_{1}+1)\kappa_{0}$ or not, and is defined as follows:

\displaystyle\!\!\boldsymbol{1}_{i}^{\bar{t}_{1}+t_{1}\kappa_{0}}

\displaystyle\coloneqq\begin{cases}1,\!\!&\text{with probability $p_{i}^{\bar{t}_{1}+t_{1}\kappa_{0}}$},\\ 0,\!\!&\text{otherwise},\end{cases},

(13)

where $p_{i}^{\bar{t}_{1}+t_{1}\kappa_{0}}\!\!$ is the probability of receiving the trained model over the wireless link and is calculated in the subsequent section (c.f. (IV-A)). Note that since the sBS has to receive the gradient over the wireless link, we use the binary indicator function $\boldsymbol{1}_{i}^{\bar{t}_{1}+t_{1}\kappa_{0}}\!\!$ in (12) as a common practice [27, 13].

The sBS performs $t_{2}\!\!=\!\!0,\dots,\!\!\kappa_{2}\!\!-\!1\!\!$ local rounds before updating its model. When $(t_{2}+1)~{}\textsl{mod}~{}\kappa_{2}=0$ , the sBS updates its local model with the latest available model at its associated mBS, i.e.,

\mathbf{w}_{k}^{\bar{t}_{2}}\leftarrow\mathbf{w}_{l}^{\bar{t}_{2}},

(14)

where $\bar{t}_{2}=(m\kappa_{3}+t_{3})\kappa_{2}\kappa_{1}\kappa_{0}$ . In each sBS round $t_{2}$ , the sBS updates its model as

	$\displaystyle\mathbf{w}_{k}^{\bar{t}_{2}+(t_{2}+1)\kappa_{1}\kappa_{0}}=\sum\nolimits_{j=1}^{V_{k,l}}\alpha_{j}\mathbf{w}_{j}^{\bar{t}_{2}+(t_{2}+1)\kappa_{1}\kappa_{0}}=\tilde{\mathbf{w}}_{k}^{\bar{t}_{2}+t_{2}\kappa_{1}\kappa_{0},0}-$
	$\displaystyle\eta\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{t_{1}=0}^{\kappa_{1}-1}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\frac{\boldsymbol{1}_{i}^{\bar{t}_{2}+\tilde{t}_{2}}}{p_{i}^{\bar{t}_{2}+\tilde{t}_{2}}}\sum_{t_{0}=1}^{\kappa_{0}}g\big{(}\tilde{\mathbf{w}}_{i}^{\bar{t}_{2}+\tilde{t}_{2}+t_{0},0}\big{)}\odot\mathbf{m}_{i}^{\bar{t}_{2}+\tilde{t}_{2}},$		(15)

where $\tilde{\mathbf{w}}_{k}^{\bar{t}_{2}+t_{2}\kappa_{1}\kappa_{0},0}\coloneqq\sum_{j=1}^{V_{k,l}}\alpha_{j}\tilde{\mathbf{w}}_{j,k,l}^{\bar{t}_{2}+t_{2}\kappa_{1}\kappa_{0},0}$ and $\tilde{t}_{2}=(t_{2}\kappa_{1}+t_{1})\kappa_{0}$ . Similarly, the mBS performs $t_{3}=0,\dots,\kappa_{3}-1$ local rounds before updating its local model with the latest available global model when $(t_{3}+1)~{}\textsl{mod}~{}\kappa_{3}=0$ , i.e.,

\mathbf{w}_{l}^{m\kappa_{3}\kappa_{2}\kappa_{1}\kappa_{0}}\leftarrow\mathbf{w}^{m\kappa_{3}\kappa_{2}\kappa_{1}\kappa_{0}}.

(16)

Moreover, between two mBS rounds, we can write

	$\displaystyle\mathbf{w}_{l}^{(m\kappa_{3}+(t_{3}+1))\kappa_{2}\kappa_{1}\kappa_{0}}\!\!=\tilde{\mathbf{w}}_{l}^{(m\kappa_{3}+t_{3})\kappa_{2}\kappa_{1}\kappa_{0},0}\!\!-$
	$\displaystyle\quad\eta\sum_{k=1}^{B_{l}}\!\!\alpha_{k}\!\!\sum_{t_{2}=0}^{\kappa_{2}-1}\sum_{j=1}^{V_{k,l}}\!\!\alpha_{j}\sum_{t_{1}=0}^{\kappa_{1}-1}\sum_{i=1}^{U_{j,k,l}}\!\!\alpha_{i}\frac{\boldsymbol{1}_{i}^{\bar{t}_{3}}}{p_{i}^{\bar{t}_{3}}}\!\!\sum_{t_{0}=1}^{\kappa_{0}}\!\!g\big{(}\tilde{\mathbf{w}}_{i}^{\bar{t}_{3}+t_{0},0}\big{)}\odot\mathbf{m}_{i}^{\bar{t}_{3}},$		(17)

where $\tilde{\mathbf{w}}_{l}^{(m\kappa_{3}+t_{3})\kappa_{2}\kappa_{1}\kappa_{0},0}\coloneqq\sum_{k=1}^{B_{l}}\alpha_{k}\tilde{\mathbf{w}}_{k}^{(m\kappa_{3}+t_{3})\kappa_{2}\kappa_{1}\kappa_{0},0}$ and $\bar{t}_{3}=[((m\kappa_{3}+t_{3})\kappa_{2}+t_{2})\kappa_{1}+t_{1}]\kappa_{0}$ . Finally, the central server performs global aggregation by collecting the updated models from all mBSs as follows:

	$\displaystyle\mathbf{w}^{(m+1)\prod_{z=0}^{3}\kappa_{z}}=\tilde{\mathbf{w}}^{m\prod_{z=0}^{3}\kappa_{z},0}\!\!-\eta\sum\nolimits_{l=1}^{L}\!\!\alpha_{l}\sum\nolimits_{t_{3}=0}^{\kappa_{3}-1}\sum\nolimits_{k=1}^{B_{l}}\!\!\alpha_{k}$
	$\displaystyle\qquad\sum_{t_{2}=0}^{\kappa_{2}-1}\sum_{j=1}^{V_{k,l}}\!\!\alpha_{j}\!\!\sum_{t_{1}=0}^{\kappa_{1}-1}\sum_{i=1}^{U_{j,k,l}}\!\!\alpha_{i}\frac{\boldsymbol{1}_{i}^{\bar{t}_{3}}}{p_{i}^{\bar{t}_{3}}}\sum_{t_{0}=1}^{\kappa_{0}}g\big{(}\tilde{\mathbf{w}}_{i}^{\bar{t}_{3}+t_{0},0}\big{)}\odot\mathbf{m}_{i}^{\bar{t}_{3}},$		(18)

where $\tilde{\mathbf{w}}^{m\prod_{z=0}^{3}\kappa_{z},0}\coloneqq\sum_{l=1}^{L}\alpha_{l}\tilde{\mathbf{w}}_{l}^{m\prod_{z=0}^{3}\kappa_{z},0}$ .

The proposed PHFL process is summarized in Algorithm 1.

Input: Total global round

M

, initial model

\mathbf{w}^{0}

1 Synchronize all edge devices with the initial model

\mathbf{w}^{0}

2 for All global rounds $m=0$ to $M-1$ do

3 for All mBS rounds $t_{3}=0,1,\dots,\kappa_{3}-1$ do

4 for All sBS rounds $t_{2}=0,1,\dots,\kappa_{2}-1$ do

5 for All VC rounds $t_{1}=0,1,\dots,\kappa_{1}-1$ do

6 for $i\in\mathcal{U}_{j,k,l}$ in parallel do

7 UE receives the latest available model from the associated VC

8 Compute binary mask and get the winning ticket using lottery ticket hypothesis

9 for All local rounds $t_{0}=1,2,\dots,\kappa_{0}$ do

t\leftarrow[\{(m\kappa_{3}+t_{3})\kappa_{2}+t_{2}\}\kappa_{1}+t_{1}]\kappa_{0}+t_{0}

11 UE updates local model using (8)

12 end for

14 end for

15 sBS updates VC model using (11) and (12)

17 end for

18 sBS update local cell model using (14) and (15)

20 end for

21 mBS updates local cell model using (16) and (17)

23 end for

24 Central server updates global model using (18)

26 end for

Output: Global ML model

\mathbf{w}^{M-1}

Algorithm 1 Pruning-Enabled Hierarchical FL

III PHFL: Convergence Analysis

III-A Assumptions

We make the following standard assumptions [13, 7, 20, 23, 28]:

1.

The loss functions are lower-bounded, i.e., $f(\mathbf{w})\geq f_{\text{inf}}$ .
2.

The loss functions are $\beta$ -smooth, i.e., $\|\nabla f_{i}(\mathbf{w})-\nabla f_{i}(\mathbf{w}^{\prime})\|\leq\beta\|\mathbf{w}-\mathbf{w}^{\prime}\|$ .
3.

Mini-batch gradients are unbiased $\mathbb{E}_{\xi\sim\mathcal{D}_{i}}[g(\tilde{\mathbf{w}}_{i})]=\nabla f_{i}(\tilde{\mathbf{w}}_{i})$ . Besides, the variance of the gradients is bounded, i.e., $\|g(\tilde{\mathbf{w}}_{i})-\nabla f_{i}(\tilde{\mathbf{w}}_{i})\|^{2}\leq\sigma^{2}$ .

The divergence of the local, VC, sBS, mBS and global loss functions are bounded for all $i$ , $j$ , $k$ , $l$ and $\mathbf{w}$ , i.e.,

	$\displaystyle\sum\nolimits_{i=1}^{U_{j,k,l}}\alpha_{i}\\|\nabla f_{i}(\mathbf{w})-\nabla f_{j}(\mathbf{w})\\|^{2}\leq\epsilon_{\mathrm{vc}}^{2},$
	$\displaystyle\sum\nolimits_{j=1}^{V_{k,l}}\alpha_{j}\\|\nabla f_{j}(\mathbf{w})-\nabla f_{k}(\mathbf{w})\\|^{2}\leq\epsilon_{\mathrm{sbs}}^{2},$
	$\displaystyle\sum\nolimits_{k=1}^{B_{l}}\alpha_{k}\\|\nabla f_{k}(\mathbf{w})-\nabla f_{l}(\mathbf{w})\\|^{2}\leq\epsilon_{\mathrm{mbs}}^{2},$
	$\displaystyle\sum\nolimits_{l=1}^{L}\alpha_{l}\\|\nabla f_{l}(\mathbf{w})-\nabla f(\mathbf{w})\\|^{2}\leq\epsilon^{2}.$

5.

The stochastic gradients are independent of each other in different iterations.
6.

The stochastic gradients are bounded, i.e., $\mathbb{E}\|g(\mathbf{w}_{i})\|^{2}\leq G^{2}$ .
7.

The model weights are bounded, i.e., $\mathbb{E}\|\mathbf{w}_{i}\|^{2}\leq D^{2}$ .
8.

The pruning ratio $\delta_{i}^{t}\in[0,\delta^{\mathrm{th}}]$ , in which $0<\delta^{\mathrm{th}}<1$ and $\delta^{\mathrm{th}}$ is the maximum allowable pruning ratio, follows

$\delta_{i}^{t}\geq\big{\|}\mathbf{w}_{i}^{t}-\tilde{\mathbf{w}}_{i}^{t,0}\big{\|}^{2}/\|\mathbf{w}_{i}^{t}\|^{2}.$ (19)

Since the updated global, mBS, sBS and VC models are not available in each local iteration $t$ , similar to standard practice [13, 7, 9], we assume the virtual copies of these models, denoted by $\bar{\mathbf{w}}^{t}$ , $\bar{\mathbf{w}}_{l}^{t}$ , $\bar{\mathbf{w}}_{k}^{t}$ and $\bar{\mathbf{w}}_{j}^{t}$ , respectively, are available. Besides, we assume that the bounded divergence assumptions amongst the above loss functions also hold for these virtual models. Moreover, analogous to our previous notations, we express $\tilde{\bar{\mathbf{w}}}^{t,0}\coloneqq\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\tilde{\mathbf{w}}_{i}^{t,0}=\sum_{u=1}^{\mathrm{U}}\alpha_{u}\tilde{\mathbf{w}}_{u}^{t,0}$ and $\tilde{\bar{\mathbf{w}}}^{0}\coloneqq\tilde{\bar{\mathbf{w}}}^{0,0}$ .

III-B Convergence Analysis

Similar to existing literature [13, 7, 9, 20], we consider the average global gradient norm as the indicator of the proposed PHFL algorithm’s convergence. As such, in the following, we seek an $\theta_{\mathrm{PHFL}}$ -suboptimal solution such that $\frac{1}{T}\sum_{t=0}^{T-1}\|\sum_{u=1}^{\mathrm{U}}\alpha_{u}\nabla f_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})\odot\mathbf{m}_{u}^{t}\|^{2}\leq\theta_{\mathrm{PHFL}}$ and $\theta_{\mathrm{PHFL}}\geq 0$ . Particularly, we start with Theorem 1 that requires bounding the differences amongst the models in different hierarchical levels. These differences are first calculated in Lemma 1 to Lemma 4 and then plugged into Theorem 1 to get the $\theta_{\mathrm{PHFL}}$ -suboptimal bound in Corollary 1.

Theorem 1.

When the assumptions in Section III-A hold and $\eta\leq 1/\beta$ , we have

		$\displaystyle\!\!\theta_{\mathrm{{PHFL}}}\leq\mathcal{O}\bigg{(}\!\!\frac{f(\tilde{\bar{\mathbf{w}}}^{0})-f_{\mathrm{inf}}}{\eta T}\!\!\bigg{)}+\mathcal{O}\bigg{(}\!\!\frac{\beta\eta\sigma^{2}}{\mathrm{U}}\!\bigg{)}+\underbrace{\mathcal{O}\big{(}\!\delta^{\mathrm{th}}\beta^{2}D^{2}\big{)}}_{\mathrm{pruning~{}error}}+$
		$\displaystyle\underbrace{\mathcal{O}\big{(}\beta\eta G^{2}\cdot\varphi_{\mathrm{w,0}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})\big{)}}_{\mathrm{wireless~{}factor}}+\mathcal{O}\big{(}\beta^{2}\big{[}\mathrm{L}_{1}+\mathrm{L}_{2}+\mathrm{L}_{3}+\mathrm{L}_{4}\big{]}\big{)},$		(20)

where $\boldsymbol{\delta}=\{\delta_{1}^{t},\dots,\delta_{U}^{t}\}_{t=0}^{T-1}$ , $\boldsymbol{\mathrm{f}}=\{\mathrm{f}_{1}^{t},\dots,\mathrm{f}_{U}^{t}\}_{t=0}^{T-1}$ , $\boldsymbol{\mathrm{P}}=\{P_{1}^{t},\dots,P_{U}^{t}\}_{t=0}^{T-1}$ and $\mathrm{f}_{i}^{t}$ is the $i^{\mathrm{th}}$ client’s central processing unit (CPU) frequency in the wireless factor. Besides, the terms $\mathrm{L}_{1}$ , $\mathrm{L}_{2}$ , $\mathrm{L}_{3}$ and $\mathrm{L}_{4}$ are

	$\displaystyle\varphi_{\mathrm{w,0}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})=\frac{1}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\!\!\alpha_{l}^{2}\sum_{k=1}^{B_{l}}\!\!\alpha_{k}^{2}\sum_{j=1}^{V_{k,l}}\!\!\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\!\!\alpha_{i}^{2}\bigg{[}\frac{1}{p_{i}^{t}}-1\bigg{]},$		(21)
	$\displaystyle\mathrm{L}_{1}=\frac{1}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\\|\bar{\mathbf{w}}_{j}^{t}-\tilde{\mathbf{w}}_{i}^{t}\\|^{2},$		(22)
	$\displaystyle\mathrm{L}_{2}=[1/T]\sum\nolimits_{t=0}^{T-1}\sum\nolimits_{l=1}^{L}\!\!\alpha_{l}\sum\nolimits_{k=1}^{B_{l}}\!\!\alpha_{k}\sum\nolimits_{j=1}^{V_{k,l}}\!\!\alpha_{j}\mathbb{E}\\|\bar{\mathbf{w}}_{k}^{t}-\bar{\mathbf{w}}_{j}^{t}\\|^{2},\!\!\!\!$		(23)
	$\displaystyle\mathrm{L}_{3}=[1/T]\sum\nolimits_{t=0}^{T-1}\sum\nolimits_{l=1}^{L}\alpha_{l}\sum\nolimits_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\\|\bar{\mathbf{w}}_{l}^{t}-\bar{\mathbf{w}}_{k}^{t}\\|^{2},$		(24)
	$\displaystyle\mathrm{L}_{4}=[1/T]\sum\nolimits_{t=0}^{T-1}\sum\nolimits_{l=1}^{L}\alpha_{l}\mathbb{E}\\|\bar{\mathbf{w}}^{t}-\bar{\mathbf{w}}_{l}^{t}\\|^{2}.$		(25)

The proof of Theorem 1 and the subsequent Lemmas are left in the supplementary materials.

Remark 1.

The first term in (1) is what we get for centralized learning, while the second term arises from the randomness of the mini-batch gradients [29]. The third term appears from model pruning. Besides, the fourth term arises from the wireless links among the sBS and UEs. It is worth noting that $\varphi_{\mathrm{w,0}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})=0$ when all $p_{i}^{t}$ ’s are $1$ ’s. Finally, the last term is due to the difference among the VC-local, sBS-VC, sBs-mBS and mBS-global model parameters, respectively, which are derived in the following.

Remark 2.

When the system has no pruning, i.e., all UEs use the original models, all $\delta_{i}^{t}=0$ . Besides, under the perfect communication among the sBS and UEs, we have $p_{i}^{t}=1$ . In such cases, the $\theta_{\mathrm{PHFL}}$ -suboptimal bound boils down to


$\displaystyle\theta_{\mathrm{{PHFL}}}$	$\displaystyle\leq\mathcal{O}\big{(}(f(\bar{\mathbf{w}}^{0})-f_{\mathrm{inf}})/[\eta T]\big{)}+\mathcal{O}\big{(}\beta\eta\sigma^{2}/\mathrm{U}\big{)}+$
	$\displaystyle\qquad\qquad\qquad\qquad\mathcal{O}\big{(}\beta^{2}[\mathrm{L}_{1}+\mathrm{L}_{2}+\mathrm{L}_{3}+\mathrm{L}_{4}]\big{)}.$	(26)

Besides, the last term in (26) appears from the four hierarchical levels. When $\mathrm{U}=1$ and there are no levels, i.e., $\mathrm{L}_{1}=\mathrm{L}_{2}=\mathrm{L}_{3}=\mathrm{L}_{4}=0$ , the convergence bound is exactly the same as the original SGD with non-convex loss function [7].

To that end, we calculate the divergence among the local, VC, sBS, mBS and global model parameters, and derive the corresponding pruning errors in each level in what follows.

Lemma 1.

When $\eta\leq 1/[2\sqrt{10}\kappa_{0}\beta]$ , the average difference between the VC and local model parameters, i.e., the $\mathrm{L}_{1}$ term of (1), is upper bounded as

		$\displaystyle\!\!\!\!\!\!\frac{\beta^{2}}{T}\!\!\sum_{t=0}^{T}\sum_{l=1}^{L}\!\!\alpha_{l}\!\!\sum_{k=1}^{B_{l}}\!\!\alpha_{k}\!\!\sum_{j=1}^{V_{k,l}}\!\!\alpha_{j}\!\!\sum_{i=1}^{U_{j,k,l}}\!\!\alpha_{i}\mathbb{E}\left\\|\bar{\mathbf{w}}_{j}^{t}-\tilde{\mathbf{w}}_{i}^{t}\right\\|^{2}\!\!\leq\mathcal{O}\big{(}\kappa_{0}\eta^{2}\beta^{2}\sigma^{2}\big{)}+$
		$\displaystyle\!\!\mathcal{O}\big{(}\kappa_{0}^{2}\eta^{2}\beta^{2}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\delta^{\mathrm{th}}\beta^{2}D^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\eta^{2}\beta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}\big{)},\!\!$		(27)

where $\varphi_{\mathrm{w,L}_{1}}\!\!\!=\!\frac{1}{T}\sum_{l=1}^{L}\!\!\alpha_{l}\!\sum_{k=1}^{B_{l}}\!\!\alpha_{k}\sum_{j=1}^{V_{k,l}}\!\!\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\!\!\alpha_{i}\sum_{t=0}^{T-1}(1/p_{i}^{t}-1)$ .

Remark 3.

In (1), the first term comes from the statistical data heterogeneity, while the second term arises from the divergence between the local and VC loss functions. The third term emanates from model pruning. Finally, the fourth term stems from the wireless links among the UEs and sBS.

Lemma 2.

When $\eta\leq 1/[2\sqrt{10}\kappa_{0}\kappa_{1}\beta]$ , the difference between the sBS model parameters and VC model parameters, i.e., the $\mathrm{L}_{2}$ term of (1), is upper bounded as

		$\displaystyle\!\!\!\!\!\!\!\!\frac{\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\!\!\alpha_{l}\!\!\sum_{k=1}^{B_{l}}\!\!\alpha_{k}\!\!\sum_{j=1}^{V_{k,l}}\!\!\alpha_{j}\mathbb{E}\left\\|\bar{\mathbf{w}}_{k}^{t}-\bar{\mathbf{w}}_{j}^{t}\right\\|^{2}\!\!\leq\mathcal{O}\big{(}\beta^{4}\kappa_{0}^{4}\kappa_{1}^{2}\eta^{4}\epsilon_{\mathrm{vc}}^{2}\big{)}+$
		$\displaystyle\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\eta^{2}\beta^{2}\epsilon_{\mathrm{sbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\eta^{2}\sigma^{2}\beta^{2}\big{)}+\mathcal{O}\big{(}\delta^{th}\beta^{2}D^{2}\big{)}+$
		$\displaystyle\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\beta^{4}\eta^{4}G^{2}\varphi_{\mathrm{w,L}_{1}}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\beta^{2}\eta^{2}\varphi_{\mathrm{w,L}_{2}}\big{)},$		(28)

where $\varphi_{\mathrm{w,L}_{2}}\!\!=\!\!\frac{1}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\!\!\alpha_{l}\!\sum_{k=1}^{B_{l}}\!\!\alpha_{k}\!\sum_{j=1}^{V_{k,l}}\!\!\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\!\!\alpha_{i}^{2}(1/p_{i}^{t}-1)$ .

Remark 4.

The first term in (2) appears from the divergence of the loss functions of the clients and VC, while the second term stems from the divergence between the loss function of the VC and sBS. The rest of the terms are due to the statistical data heterogeneity, model pruning and wireless links, respectively.

Lemma 3.

When $\eta\leq 1/[2\sqrt{14}\kappa_{0}\kappa_{1}\kappa_{2}\beta]$ , the average difference between the sBS and mBS model parameters, i.e., the $\mathrm{L}_{3}$ term of (1), is upper bounded as

		$\displaystyle[\beta^{2}/T]\sum\nolimits_{t=0}^{T-1}\sum\nolimits_{l=1}^{L}\alpha_{l}\sum\nolimits_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\left\\|\bar{\mathbf{w}}_{l}^{t}-\bar{\mathbf{w}}_{k}^{t}\right\\|^{2}$
		$\displaystyle\leq\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{4}\beta^{4}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{2}\eta^{4}\beta^{4}\epsilon_{\mathrm{sbs}}^{2}\big{)}+$
		$\displaystyle\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{4}\epsilon_{\mathrm{mbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}\beta^{2}\sigma^{2}\big{)}+$
		$\displaystyle\mathcal{O}\big{(}\delta^{\mathrm{th}}\beta^{2}D^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{4}\beta^{4}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}\big{)}+$
		$\displaystyle\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\beta^{4}\eta^{4}\cdot\varphi_{\mathrm{w,L}_{2}})+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}\beta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{3}}\big{)}.$		(29)

where $\varphi_{\mathrm{w,L}_{3}}\!\!\!=\!\frac{1}{T}\sum_{l=1}^{L}\!\!\alpha_{l}\sum_{k=1}^{B_{l}}\!\!\alpha_{k}\sum_{j=1}^{V_{k,l}}\!\!\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\!\!\alpha_{i}^{2}\sum_{t=0}^{T-1}\!(1\!/\!p_{i}^{t}-1)$ .

Lemma 4.

When $\eta\leq 1/[6\sqrt{2}\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\beta]$ , the average difference between the global and the mBS models, i.e., the $\mathrm{L}_{4}$ term, is bounded as follows:

		$\displaystyle\!\!\!\!\!\![\beta^{2}/T]\!\sum\nolimits_{t=0}^{T-1}\!\!\sum\nolimits_{l=1}^{L}\!\!\alpha_{l}\mathbb{E}\left\\|\bar{\mathbf{w}}^{t}-\bar{\mathbf{w}}_{l}^{t}\right\\|^{2}\!\!\leq\mathcal{O}\big{(}\!\!\kappa_{0}^{4}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{4}\epsilon_{\mathrm{vc}}^{2}\!\big{)}+$
		$\displaystyle\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{4}\epsilon_{\mathrm{sbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{4}\kappa_{3}^{2}\eta^{4}\beta^{6}\epsilon_{\mathrm{mbs}}^{2}\big{)}+$
		$\displaystyle\!\!\mathcal{O}\!(\!\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\beta^{4}\eta^{2}\epsilon^{2}\!)\!+\!\mathcal{O}(\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\beta^{2}\eta^{2}\sigma^{2})\!+\!\mathcal{O}\!\big{(}\!\delta^{\mathrm{th}}\beta^{2}D^{2}\!\big{)}\!+\!$
		$\displaystyle\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{4}G^{2}\varphi_{\mathrm{w,L}_{1}}\big{)}\!+\!\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{3}\kappa_{2}^{2}\kappa_{3}^{2}\beta^{4}\eta^{4}\varphi_{\mathrm{w,L}_{2}}\big{)}+$
		$\displaystyle\!\!\mathcal{O}\!(\!\kappa_{0}^{3}\kappa_{1}^{3}\kappa_{2}^{3}\kappa_{3}^{2}\eta^{4}\beta^{4}G^{2}\varphi_{\mathrm{w,L}_{3}}\!)\!+\!\mathcal{O}\!(\!\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\beta^{2}\eta^{2}G^{2}\varphi_{\mathrm{w,L}_{4}}\!),\!\!$		(30)

where $\varphi_{\mathrm{w,L}_{4}}\!\!=\frac{1}{T}\sum_{l=1}^{L}\!\!\alpha_{l}\!\sum_{k=1}^{B_{l}}\!\!\alpha_{k}^{2}\!\sum_{j=1}^{V_{k,l}}\!\!\alpha_{j}^{2}\!\sum_{i=1}^{U_{j,k,l}}\!\!\alpha_{i}^{2}\!\sum_{t=0}^{T-1}\left(1\!/\!p_{i}^{t}-1\right)$ .

Note that we have similar observations for (3) and (4) as in Remark 4. Now, using the above Lemmas, we find the final convergence rate in Corollary 1.

Corollary 1.

When $\eta\leq 1/[6\sqrt{2}\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\beta]$ , the $\theta_{\mathrm{{PHFL}}}$ bound of Theorem 1 boils down to

$\displaystyle\theta_{\mathrm{{PHFL}}}$	$\displaystyle\leq\mathcal{O}\big{(}[f(\tilde{\bar{\mathbf{w}}}^{0})-f_{\mathrm{inf}}]/[\eta T]\big{)}+\mathcal{O}(\beta\eta\sigma^{2}/\mathrm{U})+$
	$\displaystyle\mathcal{O}\big{(}\kappa_{0}^{2}\eta^{2}\beta^{2}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\eta^{2}\beta^{2}\epsilon_{\mathrm{sbs}}^{2}\big{)}+$
	$\displaystyle\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{4}\epsilon_{\mathrm{mbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\beta^{4}\eta^{2}\epsilon^{2}\big{)}+$
	$\displaystyle\underbrace{\mathcal{O}\big{(}\delta^{\mathrm{th}}\beta^{2}D^{2}\big{)}}_{\mathrm{pruning~{}error}}+\underbrace{\mathcal{O}\big{(}\beta\eta G^{2}\cdot\varphi_{\mathrm{w,0}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})\big{)}}_{\mathrm{wireless~{}factor}}.$	(31)

Remark 5.

In (1), the third, fourth, fifth and sixth terms appear from the divergence between client-VC, VC-sBS, sBS-mBS and mBS-global loss functions, respectively.

Remark 6.

In typical HFL with no model pruning, i.e., $\delta_{u}^{t}=0$ , $\forall u\in\mathcal{U}$ , $\mathcal{O}\big{(}\delta^{\mathrm{th}}\beta^{2}D^{2}\big{)}=0$ . Besides, when the wireless links are ignored, the last term in (1) becomes zero. In such a special case, Corollary 1 boils down to

$\displaystyle\theta_{\mathrm{{PHFL}}}$	$\displaystyle\leq\mathcal{O}\big{(}[f(\tilde{\bar{\mathbf{w}}}^{0})-f_{\mathrm{inf}}]/[\eta T]\big{)}+\mathcal{O}(\beta\eta\sigma^{2}/\mathrm{U})+$
	$\displaystyle\mathcal{O}\big{(}\kappa_{0}^{2}\eta^{2}\beta^{2}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\eta^{2}\beta^{2}\epsilon_{\mathrm{sbs}}^{2}\big{)}+$
	$\displaystyle\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{4}\epsilon_{\mathrm{mbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\beta^{4}\eta^{2}\epsilon^{2}\big{)}.$	(32)

Remark 7.

When $\eta=\sqrt{\mathrm{U}/T}$ , we have $T\geq 1/[72\mathrm{U}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\beta^{2}]$ . With a sufficiently large $T$ , when the trained model reception success probability is $1$ for all users in all time steps, we have $\theta_{\mathrm{PHFL}}\approx\mathcal{O}\big{(}1/\sqrt{\mathrm{U}T}\big{)}+\mathcal{O}\big{(}\delta^{\mathrm{th}}\beta^{2}D^{2}\big{)}$ , where the second term comes from the pruning error. Therefore, the proposed PHFL solution converges to the neighborhood of a stationary point of traditional HFL.

IV Joint Problem Formulation and Solution

Similar to existing literature[3, 4, 8, 27], we ignore the downlink delay in this paper since the sBS can utilize the higher spectrum and transmission power to broadcast the updated model. Moreover, since the sBS-mBS and mBS-cloud server links are wired, we ignore the transmission delays for these links⁵⁵5The transmissions over the wired sBS-mBS and mBS-cloud server links happen in the backhaul, and the corresponding delays are quite small. In order to calculate these delays, one should also consider the overall network loads, which are beyond the scope of this paper.. Furthermore, since the sBS, mBS and the cloud server usually have high computation power, we also ignore the model aggregation and processing delays⁶⁶6The addition of $d$ parameters and then taking the average have a time complexity of $\mathcal{O}(d+1)$ . With highly capable CPUs at the sBS, mBS, and central server, the corresponding time delays for parameter aggregation are usually small and therefore ignored in the literature [9, 10, 23].. Therefore, at the beginning of each VC round, i.e., $t\ni(m\prod_{z=0}^{\kappa_{z}}+t_{0})~{}\textsl{mod}~{}\kappa_{0}=0$ , we first calculate the required computation time for finding the lottery ticket as

\displaystyle\mathrm{t}_{i}^{\mathrm{cp_{d}}}=\rho\times\left(bnc_{i}\mathrm{D}_{i}/\mathrm{f}_{i}^{t}\right),

(33)

where $b$ is the batch size, $n$ is the number of batches, $c_{i}$ is the CPU cycles to process $1$ -bit data, $\mathrm{D}_{i}$ is UE $u_{i}$ ’s each data sample’s size in bits and $\mathrm{f}_{i}^{t}$ is the CPU frequency. Upon finding the pruned model, each client performs $\kappa_{0}$ local iterations, which require the following computation time [23]

\displaystyle\mathrm{t}_{i}^{\mathrm{cp_{s}}}=\kappa_{0}\times\left(bn(1-\delta_{i}^{t})c_{i}\mathrm{D}_{i}/\mathrm{f}_{i}^{t}\right).

(34)

To that end, the UE only offloads the non-zero weights along with the binary mask to the sBS. As such, we calculate the uplink payload size of UE $i$ as follows⁷⁷7Note that one may send the non-pruned weights and the corresponding indices, which are unknown until the original initial model is trained for $\rho$ iterations. We consider an upper bound for the uplink payload, which will be used during the joint parameters optimization phase.:

\displaystyle s_{i}\leq d\big{[}1-\delta_{i}^{t}\big{]}\left(\mathrm{FPP}+1\right)+d,

(35)

where $\mathrm{FPP}$ is the floating point precision. Note that, in (35), we need $1$ bit to represent the sign of the entry. Therefore, we calculate the uplink payload offloading delay as follows:

\displaystyle\mathrm{t}_{i}^{\mathrm{up}}\leq d\left[\left(1-\delta_{i}^{t}\right)\left(\mathrm{FPP}+1\right)+1\right]/r_{i}^{t}.

(36)

As such, UE $i$ ’s total duration for local computing and trained model offloading is

\mathrm{t}_{i}^{\mathrm{tot}}\leq\mathrm{t}_{i}^{\mathrm{cp_{d}}}+\mathrm{t}_{i}^{\mathrm{cp_{s}}}+\mathrm{t}_{i}^{\mathrm{up}}.

(37)

We now calculate the energy consumption for performing the model training, followed by the required energy for offloading the trained models. First, let us calculate the energy consumption to get the lottery ticket as

\displaystyle\mathrm{e}_{i}^{\mathrm{cp_{d}}}=\rho\times 0.5\xi bnc_{i}\mathrm{D}_{i}(\mathrm{f}_{i}^{t})^{2},

(38)

where $0.5\xi$ is the effective capacitance of UE’s CPU chip. Similarly, we calculate the energy consumption to train $\kappa_{0}$ local iterations using the pruned model as

\displaystyle\mathrm{e}_{i}^{\mathrm{cp_{s}}}=\kappa_{0}\times 0.5\xi bn(1-\delta_{i}^{t})c_{i}\mathrm{D}_{i}(\mathrm{f}_{i}^{t})^{2}.

(39)

Moreover, we calculate the uplink payload offloading energy consumption as follows:

\displaystyle\mathrm{e}_{i}^{\mathrm{up}}\leq d\left[\left(1-\delta_{i}^{t}\right)\left(\mathrm{FPP}+1\right)+1\right]P_{i}^{t}/r_{i}.

(40)

Therefore, the total energy consumption is calculated as

\mathrm{e}_{i}^{\mathrm{tot}}\leq\mathrm{e}_{i}^{\mathrm{cp_{d}}}+\mathrm{e}_{i}^{\mathrm{cp_{s}}}+\mathrm{e}_{i}^{\mathrm{up}}.

(41)

IV-A Problem Formulation

Denote the duration between VC aggregation $\mathrm{t}^{\mathrm{th}}$ . Then, we calculate the probability of successful reception of UE’s trained model as follows:

$\displaystyle p_{i}^{t}$	$\displaystyle=\mathrm{Pr}\big{\{}\mathrm{t}_{i}^{\mathrm{tot}}\leq\mathrm{t}^{\mathrm{th}}\big{\}}\!=\!\mathrm{Pr}\big{\{}\!s_{i}\!\leq\!r_{i}^{t}\big{[}\mathrm{t}^{\mathrm{th}}\!\!-\mathrm{t}_{i}^{\mathrm{cp_{d}}}\!\!-\mathrm{t}_{i}^{\mathrm{cp_{s}}}\big{]}\!\!\big{\}}$
	$\displaystyle\!\!=\mathrm{Pr}\big{\{}h_{i,k}^{t}\geq\big{[}(2^{\chi_{i}^{t}}-1)(\omega\zeta^{2}+I_{i,k}^{t})/(\mathrm{P}_{i}^{t}d_{i,k}^{-\alpha})\big{]}\big{\}}$
	$\displaystyle\overset{(a)}{=}\exp\big{[}-(2^{\chi_{i}^{t}}-1)(\omega\zeta^{2}+I_{i,k}^{t})/(\mathrm{P}_{i}^{t}d_{i,k}^{-\alpha})\big{]},$	(42)

where $\chi_{i}^{t}=\frac{d\mathrm{f}_{i}^{t}\left[\left(1-\delta_{i}^{t}\right)\left(\mathrm{FPP}+1\right)+1\right]}{\omega\left[\mathrm{f}_{i}^{t}\mathrm{t^{th}}-bnc_{i}\mathrm{D}_{i}\left(\rho+\kappa_{0}(1-\delta_{i}^{t})\right)\right]}$ and $(a)$ follows from the Rayleigh fading channels between the UE and the sBS.

Notice that the pruning ratio $\delta_{i}^{t}$ , CPU frequency $\mathrm{f}_{i}^{t}$ , transmission power $\mathrm{P}_{i}^{t}$ and the probability of successful model reception $p_{i}^{t}$ are intertwined. More specifically, $p_{i}^{t}$ depends on $\delta_{i}^{t}$ , $\mathrm{f}_{i}^{t}$ and $\mathrm{P}_{i}^{t}$ , given that the other parameters remain fixed. As such, we aim to optimize these parameters jointly by considering the controllable terms in our convergence bound in Corollary 1. Therefore, we focus on each VC round, i.e., the local iteration round $t$ at which $t\textsl{ mod }\kappa_{0}=0$ . Specifically, we focus on minimizing the error terms due to pruning and wireless links, which are given by


	$\displaystyle\mathcal{O}\big{(}\delta^{\mathrm{th}}\beta^{2}D^{2}\big{)}+\mathcal{O}\big{(}\beta\eta G^{2}\cdot\varphi_{\mathrm{w,0}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})\big{)}.$		(43)

Remark 8.

In the above expression, the first term appears from the pruning error ${\frac{12\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}}\cdot\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\big{(}1+2\big{\{}\alpha_{i}\big{[}1+\alpha_{j}\big{(}1+\alpha_{k}\big{\{}1+\alpha_{l}\big{\}}\big{)}\big{]}\big{\}}\big{)}\delta_{i}^{t}\|\mathbf{w}_{i}^{t}\|^{2}\leq\mathcal{O}\big{(}\delta^{\mathrm{th}}\beta^{2}D^{2}\big{)}$ , while the second term comes from the wireless factor $\frac{2\beta\eta}{T}\sum_{l=1}^{L}\alpha_{l}^{2}\cdot\sum_{k=1}^{B_{l}}\alpha_{k}^{2}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\sum_{t=0}^{T-1}\left(1/p_{i}^{t}-1\right)\mathbb{E}\big{\|}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\big{\|}^{2}\leq\mathcal{O}\big{(}\frac{\beta\eta G^{2}}{T}\sum_{l=1}^{L}\alpha_{l}^{2}\sum_{k=1}^{B_{l}}\alpha_{k}^{2}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\sum_{t=0}^{T-1}\left(1/p_{i}^{t}-1\right)\big{)}$ .

Based on the above observations, we consider a weighted combination of these two terms as our objective function to minimize the bound in (43). Using (IV-A) in the wireless factor, we, therefore, consider the following objective function.


	$\displaystyle\!\!\!\!\varphi^{t}(\boldsymbol{\delta}^{t},\boldsymbol{\mathrm{f}}^{t},\boldsymbol{\mathrm{P}}^{t})\!\!=\phi_{1}\sum\nolimits_{l=1}^{L}\!\!\alpha_{l}\sum\nolimits_{k=1}^{B_{l}}\!\!\alpha_{k}\sum\nolimits_{j=1}^{V_{k,l}}\!\!\alpha_{j}\sum\nolimits_{i=1}^{U_{j,k,l}}\alpha_{i}\delta_{i}^{t}+\!\!$		(44)
	$\displaystyle\!\!\!\!\phi_{2}\sum_{l=1}^{L}\!\!\alpha_{l}\sum_{k=1}^{B_{l}}\!\!\alpha_{k}\!\!\sum_{j=1}^{V_{k,l}}\!\!\alpha_{j}\!\!\sum_{i=1}^{U_{j,k,l}}\!\!\alpha_{i}\bigg{[}\exp\bigg{(}\frac{(2^{\chi_{i}^{t}}-1)(\omega\zeta^{2}+I_{i,k}^{t})}{\mathrm{P}_{i}^{t}d_{i,k}^{-\alpha}}\bigg{)}-1\bigg{]},$

where $\phi_{1}$ and $\phi_{2}$ are two weights to strike the balance between the terms. Note that the wireless factor is multiplied by the learning rate and gradient in (43). Typically, the learning rate is small. Besides, the gradient becomes smaller as the training progresses. As such, the wireless factor term is relatively small when $p_{i}^{t}>0$ for all UEs and VC aggregation rounds. The model weights are non-negative. Furthermore, a larger pruning ratio $\delta_{i}^{t}$ can dramatically reduce the computation and offloading time, making the wireless factor $0$ . However, as a higher pruning ratio means more model parameters are pruned, we wish to avoid making the $\delta_{i}^{t}$ ’s large to reduce the pruning-induced errors. The above facts suggest we put more weight on the pruning error term to penalize more for the $\delta_{i}^{t}$ ’s. As such, we consider $\phi_{1}\gg\phi_{2}$ . However, in our resource-constrained setting, a small $\delta_{i}^{t}$ can prolong the training and offloading time, leading $p_{i}^{t}$ to be $0$ , i.e., the sBS will not receive the local trained model. Therefore, although $\phi_{2}$ is small, we keep the wireless factor to ensure $p_{i}^{t}$ is never $0$ .

Therefore, we pose the following optimization problem to configure the parameters jointly.


$\displaystyle\underset{\boldsymbol{\delta}^{t},\boldsymbol{\mathrm{f}}^{t},\boldsymbol{\mathrm{P}}^{t}}{\text{minimize }}$	$\displaystyle\quad\varphi^{t}(\boldsymbol{\delta}^{t},\boldsymbol{\mathrm{f}}^{t},\boldsymbol{\mathrm{P}}^{t}),$	(45)
$\displaystyle\text{ s.t.}\quad(C1)$	$\displaystyle\quad\mathrm{t}_{i}^{\mathrm{tot}}\leq\mathrm{t}^{\mathrm{th}},\qquad\qquad\qquad~{}\forall i,$	(45a)
$\displaystyle(C2)$	$\displaystyle\quad\mathrm{e}_{i}^{\mathrm{tot}}\leq\mathrm{e}_{i}^{\mathrm{th}},\qquad\qquad\qquad\forall i,$	(45b)
$\displaystyle(C3)$	$\displaystyle\quad 0\leq\mathrm{f}_{i}^{t}\leq\mathrm{f}_{i}^{\mathrm{max}},\qquad\qquad~{}\forall i,$	(45c)
$\displaystyle(C4)$	$\displaystyle\quad 0\leq\mathrm{P}_{i}^{t}\leq P_{i}^{\mathrm{max}},\qquad\qquad\forall i,$	(45d)
$\displaystyle(C5)$	$\displaystyle\quad 0\leq\delta_{i}^{t}\leq\delta^{\mathrm{th}},\qquad\qquad~{}~{}\forall i,$	(45e)

where constraint $(C1)$ ensures that the completion of one VC round is within the required deadline. Constraint $(C2)$ controls the energy expense to be within the allowable budget. Besides, $(C3)$ and $(C4)$ restrict us from choosing the CPU frequency and transmission power within the UE’s minimum and maximum CPU cycles and transmission power, respectively. Finally, constraint $(C5)$ ensures the pruning ratio to be within a tolerable limit $\delta^{\mathrm{th}}$ .

Remark 9.

We assume that clients’ system configurations remain unchanged over time, while the channel state information (CSI) is dynamic and known at the sBS. The clients share their system configurations with their associated sBS. The sBSs share their respective users’ system configurations and CSI with the central server. As such, problem (45) is solved centrally, and the optimized parameters are broadcasted to the clients. Besides, problem (45) is non-convex with the multiplications and divisions of the optimization variables in the second term. Moreover, constraints (C $1$ ) and (C $2$ ) are not convex. Therefore, it is not desirable to minimize this original problem directly. In the following, we transform the problem into an approximate convex problem that can be solved efficiently.

IV-B Problem Transformation

Let us define $A(\delta_{i}^{t},\mathrm{f}_{i},P_{i})\coloneqq\exp\big{[}(2^{\chi_{i}^{t}}-1)(\omega\zeta^{2}+I_{i,k}^{t})/(\mathrm{P}_{i}^{t}d_{i,k}^{-\alpha})\big{]}$ . Given an initial feasible point set ( $\delta_{i}^{t,q}$ , $\mathrm{f}_{i}^{t,q}$ , $\mathrm{P}_{i}^{t,q}$ ), we perform a linear approximation of this non-convex expression as follows:

$\displaystyle\!\!\!\!A(\delta_{i}^{t},\mathrm{f}_{i},P_{i})\!\!$	$\displaystyle\approx\!\!A(\delta_{i}^{t,q},\mathrm{f}_{i}^{t,q},\mathrm{P}_{i}^{t,q})\!\!+\!\!\nabla_{\delta_{i}^{t}}\!\!\big{[}A(\delta_{i}^{t,q},\mathrm{f}_{i}^{t,q},\mathrm{P}_{i}^{t,q})\big{]}(\delta_{i}^{t}-\delta_{i}^{t,q})$
	$\displaystyle\!\!\!\!+\nabla_{\mathrm{f}_{i}^{t}}\big{[}A(\delta_{i}^{t,q},\mathrm{f}_{i}^{t,q},\mathrm{P}_{i}^{t,q})\big{]}(\mathrm{f}_{i}^{t}-\mathrm{f}_{i}^{t,q})+$
	$\displaystyle\!\!\!\!\nabla_{\mathrm{P}_{i}^{t}}\big{[}A(\delta_{i}^{t,q},\mathrm{f}_{i}^{t,q},\mathrm{P}_{i}^{t,q})\big{]}(\mathrm{P}_{i}^{t}-\mathrm{P}_{i}^{t,q})\coloneqq\tilde{A}(\delta_{i}^{t},\mathrm{f}_{i}^{t},\mathrm{P}_{i}^{t}),\!\!$	(46)

where $A(\delta_{i}^{t,q},\mathrm{f}_{i}^{t,q},\mathrm{P}_{i}^{t,q})=\exp\big{[}(2^{\chi_{i}^{t,q}}-1)(\omega\zeta^{2}+\tilde{I}_{i,k}^{t})/(\mathrm{P}_{i}^{t,q}d_{i,k}^{-\alpha})\big{]}$ , $\chi_{i}^{t,q}=\frac{d\mathrm{f}_{i}^{t,q}\left[\left(1-\delta_{i}^{t}\right)\left(\mathrm{FPP}+1\right)+1\right]}{\omega\left[\mathrm{f}_{i}^{t,q}\mathrm{t^{th}}-bnc_{i}\mathrm{D}_{i}\big{(}\rho+\kappa_{0}(1-\delta_{i}^{t,q})\big{)}\right]}$ and $\tilde{I}_{i,k}^{t}=\sum_{l=1}^{L}\sum_{k^{\prime}=1,k\neq k^{\prime}}^{K}\sum_{j^{\prime}=1}^{J_{k^{\prime},l}}\sum_{u_{i^{\prime}}\in\mathcal{U}_{j^{\prime},k^{\prime},l}}\mathrm{P}_{i^{\prime}}^{t,q}h_{i^{\prime},k^{\prime}}d_{i^{\prime},k^{\prime}}^{-\alpha}$ . Moreover, $\nabla_{\delta_{i}^{t}}\big{[}A(\delta_{i}^{t,q},\mathrm{f}_{i}^{t,q},\mathrm{P}_{i}^{t,q})\big{]}$ ,
$\nabla_{\mathrm{f}_{i}^{t}}\big{[}A(\delta_{i}^{t,q},\mathrm{f}_{i}^{t,q},\mathrm{P}_{i}^{t,q})\big{]}$ and $\nabla_{\mathrm{P}_{i}^{t}}\big{[}A(\delta_{i}^{t,q},\mathrm{f}_{i}^{t,q},\mathrm{P}_{i}^{t,q})\big{]}$ are calculated in (47), (48) and (49), respectively.

	$\displaystyle\nabla_{\delta_{i}^{t}}\big{[}A(\delta_{i}^{t,q},\mathrm{f}_{i}^{t,q},\mathrm{P}_{i}^{t,q})\big{]}=\{\ln(2)2^{\chi_{i}^{t,q}}d\mathrm{f}_{i}^{t,q}A(\delta_{i}^{t,q},\mathrm{f}_{i}^{t,q},\mathrm{P}_{i}^{t,q})(\omega\zeta^{2}+\tilde{I}_{i,k}^{t})[bnc_{i}D_{i}(\rho(\mathrm{FPP}+1)-\kappa_{0})-$
	$\displaystyle\qquad\qquad\qquad\qquad\mathrm{f}_{i}^{t,q}\mathrm{t^{th}}(\mathrm{FPP}+1)]\}/\{\omega\mathrm{P}_{i}^{t,q}d_{i,k}^{-\alpha}\times[\mathrm{f}_{i}^{t,q}\mathrm{t^{th}}-bnc_{i}D_{i}(\rho+\kappa_{0}(1-\delta_{i}^{t,q}))]^{2}\}.$		(47)
	$\displaystyle\nabla_{\mathrm{f}_{i}}\big{[}A(\delta_{i}^{t,q},\mathrm{f}_{i}^{t,q},\mathrm{P}_{i}^{t,q})\big{]}=\{-\ln(2)2^{\chi_{i}^{t,q}}bnc_{i}dD_{i}A(\delta_{i}^{t,q},\mathrm{f}_{i}^{t,q},\mathrm{P}_{i}^{t,q})(\omega\zeta^{2}+\tilde{I}_{i,k}^{t})[(1-\delta_{i}^{t,q})(\mathrm{FPP}+1)+1]\times$
	$\displaystyle\qquad\qquad\qquad\qquad(\rho+\kappa_{0}[1-\delta_{i}^{t,q}])\}/\{\omega\mathrm{P}_{i}^{t,q}d_{i,k}^{-\alpha}\times[\mathrm{f}_{i}^{t,q}\mathrm{t^{th}}-bnc_{i}D_{i}(\rho+\kappa_{0}(1-\delta_{i}^{t,q}))]^{2}\}.\hfill$		(48)
	$\displaystyle\nabla_{P_{i}}\big{[}A(\delta_{i}^{t,q},\mathrm{f}_{i}^{t,q},\mathrm{P}_{i}^{t,q})\big{]}=-A(\delta_{i}^{t,q},\mathrm{f}_{i}^{t,q},\mathrm{P}_{i}^{t,q})(2^{\chi_{i}^{t,q}}-1)(\omega\zeta^{2}+\tilde{I}_{i,k}^{t})/[(P_{i}^{t,q})^{2}d_{i,k}^{-\alpha}].$		(49)

\displaystyle\mathrm{t}_{i}^{\mathrm{up}}

\displaystyle\approx\frac{d\big{(}2+\mathrm{FPP}-(1+\mathrm{FPP})\delta_{i}^{t}\big{)}}{\omega\log_{2}(1+\mathrm{P}_{i}^{t,q}h_{i,k}d_{i,k}^{-\alpha}/[\omega\zeta^{2}+\tilde{I}_{i,k}^{t}])}+\frac{-\ln(2)dh_{i,k}d_{i,k}^{-\alpha}[(1-\delta_{i}^{t,q})(\mathrm{FPP}+1)+1]\times(\mathrm{P}_{i}^{t}-\mathrm{P}_{i}^{t,q})}{\omega\{\ln(1+[\mathrm{P}_{i}^{t,q}h_{i,k}d_{i,k}^{-\alpha}]/[\omega\zeta^{2}+\tilde{I}_{i,k}^{t}])\}^{2}(\omega\zeta^{2}+\tilde{I}_{i,k}^{t}+\mathrm{P}_{i}^{t,q}h_{i,k}d_{i,k}^{-\alpha})}\coloneqq\mathrm{\tilde{t}}_{i}^{\mathrm{up}}.

(50)

	$\displaystyle\mathrm{e}_{i}^{\mathrm{up}}$	$\displaystyle\approx\{d\mathrm{P}_{i}^{t,q}[\left(\mathrm{FPP}+2\right)-(\mathrm{FPP}+1)\delta_{i}^{t}]\}/\{\omega\log_{2}(1+\mathrm{P}_{i}^{t,q}h_{i,k}d_{i,k}^{-\alpha}/(\omega\zeta^{2}+\tilde{I}_{i,k}^{t}))\}+\qquad\quad$
		$\displaystyle\qquad\qquad\qquad\qquad\frac{d[(1-\delta_{i}^{t,q})(\mathrm{FPP}+1)+1]\bigg{[}\!\!\log_{2}\!\!\bigg{(}\!\!\!\!1+\frac{\mathrm{P}_{i}^{t,q}h_{i,k}d_{i,k}^{-\alpha}}{\omega\zeta^{2}+\tilde{I}_{i,k}^{t}}\!\!\bigg{)}\!\!-\frac{\mathrm{P}_{i}^{t,q}h_{i,k}d_{i,k}^{-\alpha}}{\ln(2)(\omega\zeta^{2}+\tilde{I}_{i,k}^{t}+\mathrm{P}_{i}^{t,q}h_{i,k}d_{i,k}^{-\alpha})}\!\!\bigg{]}}{\omega\{\log_{2}(1+\mathrm{P}_{i}^{t,q}h_{i,k}d_{i,k}^{-\alpha}/(\omega\zeta^{2}+\tilde{I}_{i,k}^{t}))\}^{2}}(\mathrm{P}_{i}^{t}-\mathrm{P}_{i}^{t,q})=\mathrm{\tilde{e}}_{i}^{\mathrm{up}}.\!\!\!\!\!\!$		(51)

As such, we approximate (44) as follows:

	$\displaystyle\!\!\!\!\tilde{\varphi}^{t}(\boldsymbol{\delta}^{t},\boldsymbol{\mathrm{f}}^{t},\boldsymbol{\mathrm{P}}^{t})$	$\displaystyle=\sum\nolimits_{l=1}^{L}\!\!\alpha_{l}\sum\nolimits_{k=1}^{B_{l}}\!\!\alpha_{k}\sum\nolimits_{j=1}^{V_{k,l}}\!\!\alpha_{j}\sum\nolimits_{i=1}^{U_{j,k,l}}\!\!\alpha_{i}\big{(}\phi_{1}\delta_{i}^{t}+$
		$\displaystyle\qquad\qquad\qquad\qquad\phi_{2}\big{[}\tilde{A}(\delta_{i}^{t},\mathrm{f}_{i}^{t},\mathrm{P}_{i}^{t})-1\big{]}\big{)},$		(52)

where $\tilde{A}(\delta_{i}^{t},\mathrm{f}_{i}^{t},\mathrm{P}_{i}^{t})$ is calculated in (IV-B).

We now focus on the non-convex constraints. First, let us approximate the local pruned model computation time as

\displaystyle\!\!\!\!\!\!\mathrm{t}_{i}^{\mathrm{cp_{s}}}

\displaystyle\approx\!\![\kappa_{0}bnc_{i}\mathrm{D}_{i}\!/\!\mathrm{f}_{i}^{t,q}](\!1-\delta_{i}^{t}-(1-\delta_{i}^{t,q})(\mathrm{f}_{i}^{t}-\mathrm{f}_{i}^{t,q})/\mathrm{f}_{i}^{t,q})\!\!=\!\!\mathrm{\tilde{t}^{cp_{s}}}_{i}\!\!.\!\!\!\!\!\!\!\!\!\!

(53)

Then, we approximate the non-convex uplink model offloading delay as shown in (50). Using a similar treatment, we write

	$\displaystyle\mathrm{e}_{i}^{\mathrm{cp_{s}}}$	$\displaystyle\approx\kappa_{0}\xi bnc_{i}\mathrm{D}_{i}\mathrm{f}_{i}^{t,q}\big{[}(\delta_{i}^{t,q}-0.5)\mathrm{f}_{i}^{t,q}-0.5\mathrm{f}_{i}^{t,q}\delta_{i}^{t}+$
		$\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad(1-\delta_{i}^{t,q})\mathrm{f}_{i}^{t}\big{]}\coloneqq\mathrm{\tilde{e}}_{i}^{\mathrm{cp_{s}}}.$		(54)

Similarly, we approximate the energy consumption for model offloading as shown in (IV-B).

Therefore, we pose the following transformed problem


$\displaystyle\underset{\boldsymbol{\delta}^{t},\boldsymbol{\mathrm{f}}^{t},\boldsymbol{\mathrm{P}}^{t}}{\text{minimize }}$	$\displaystyle\quad\tilde{\varphi}^{t}(\boldsymbol{\delta}^{t},\boldsymbol{\mathrm{f}}^{t},\boldsymbol{\mathrm{P}}^{t}),$	(55)
$\displaystyle\text{ s.t.}\quad(\tilde{C}1)$	$\displaystyle\quad\mathrm{\tilde{t}}_{i}^{\mathrm{cp_{d}}}+\mathrm{\tilde{t}}_{i}^{\mathrm{cp_{s}}}+\mathrm{\tilde{t}}_{i}^{\mathrm{up}}\leq\mathrm{t}^{\mathrm{th}},$	(55a)
$\displaystyle(\tilde{C}2)$	$\displaystyle\quad\mathrm{e}_{i}^{\mathrm{cp_{d}}}+\mathrm{\tilde{e}}_{i}^{\mathrm{cp_{s}}}+\mathrm{\tilde{e}}_{i}^{\mathrm{up}}\leq\mathrm{e}_{i}^{\mathrm{th}},$	(55b)
	$\displaystyle\quad(\ref{cons3}),(\ref{cons4}),(\ref{cons5}),$	(55c)

where the constraints are taken for the same reasons as in the original problem. Besides, $\mathrm{\tilde{t}}_{i}^{\mathrm{cp_{d}}}=2\rho bnc_{i}\mathrm{D}_{i}/\mathrm{f}_{i}^{t,q}-\rho bnc_{i}\mathrm{D}_{i}\mathrm{f}_{i}^{t}/(\mathrm{f}_{i}^{t,q})^{2}$ .

Note that problem (55) is now convex and can be solved iteratively using existing solvers such as CVX [30]. The key steps of our iterative solution are summarized in Algorithm 2. Moreover, as (55) has $3\mathrm{U}$ decision variables and $5\mathrm{U}$ constraints, the time complexity of running Algorithm 2 for $Q$ iterations is $\mathcal{O}\left(Q\times[(3\mathrm{U})^{3}\times 5\mathrm{U}]\right)$ [31]. While Algorithm 2 yields a suboptimal solution and converges to a local stationary solution set of the original problem (45), SCA-based solutions are well-known for fast convergence [32]. Moreover, our extensive empirical study in the sequel suggests that the proposed PHFL solution with Algorithm 2 delivers nearly identical performance to the upper bounded performance.

Input: Initial feasible set

(\boldsymbol{\delta}^{t,0},\boldsymbol{\mathrm{f}}^{t,0},\boldsymbol{\mathrm{P}}^{t,0})

, maximum iteration

Q

, precision level

\epsilon^{\mathrm{prec}}

; set

q=0

21 Repeat:

3 Solve (55) using

(\boldsymbol{\delta}^{t,q},\boldsymbol{\mathrm{f}}^{t,q},\boldsymbol{\mathrm{P}}^{t,q})

to get the optimized

(\boldsymbol{\delta}^{t},\boldsymbol{\mathrm{f}}^{t},\boldsymbol{\mathrm{P}}^{t})

q\leftarrow q+1

;

\boldsymbol{\delta}^{t,q}\leftarrow\boldsymbol{\delta}^{t}

;

\boldsymbol{\mathrm{f}}^{t,q}\leftarrow\boldsymbol{\mathrm{f}}^{t}

;

\boldsymbol{\mathrm{P}}^{t,q}\leftarrow\boldsymbol{\mathrm{P}}^{t}

5 Until converge with

\epsilon^{\mathrm{prec}}

precision or

q=Q

Output: Optimal

(\boldsymbol{\delta}^{t},\boldsymbol{\mathrm{f}}^{t},\boldsymbol{\mathrm{P}}^{t})

Algorithm 2 Iterative Joint Pruning Ratio, CPU Frequency and Transmission Power Selection Process

V Simulation Results and Discussions

V-A Simulation Setting

For the performance evaluation, we consider $L=2$ , $B=4$ and $\mathrm{U}=48$ . We let each sBS maintain $2$ VCs, where each VC has $6$ UEs. In other words, we have $U_{j,k,l}=6,\forall j,k$ and $l$ , $V_{k,l}=2,\forall k$ and $l$ , and $B_{l}=2,\forall l$ . We assume $\omega=1$ megahertz (MHz). We randomly generate maximum transmission power $P^{\mathrm{max}}$ , energy budget for each VC aggregation round $\mathrm{e^{th}}$ , CPU frequency $\mathrm{f^{max}}$ and required CPU cycle to process per-bit data $c$ , respectively, from $[23,30]$ dBm, $[10,13]$ Joules, $[1.8,2.8]$ gigahertz (GHz) and $[20,25]$ for these two VCs. Therefore, all UEs in a VC have the above randomly generated system configurations⁸⁸8Our approach can easily be extended where all clients can have random $\mathrm{f}_{i}^{\mathrm{max}}$ ’s, $\mathrm{e}_{i}^{\mathrm{th}}$ ’s and $P_{i}^{\mathrm{max}}$ ’s. Our approach is practical since these parameters depend on the clients’ manufacturers and their specific models.. Moreover, as described earlier in Section II, our proposed PHFL has $4$ tiers, namely ( $1$ ) UE-VC, ( $2$ ) VC-sBS, ( $3$ ) sBS-mBS, and ( $4$ ) mBS-central server.

For our ML task, we use image classification with the popular CIFAR- $10$ and CIFAR- $100$ datasets [33] for performance evaluation. We use symmetric Dirichlet distribution $\mathrm{Dir}(\bar{\alpha})$ with concentration parameter $\bar{\alpha}$ for the non-IID data distribution as commonly used in literature [4, 27]. Besides, we use $1)$ convolutional neural network (CNN), $2)$ residual network (ResNet)- $18$ [34] and $3)$ ResNet- $34$ [34]. The CNN model has the following architecture: $\tt{Conv2d(3,128)}$ , $\tt{MaxPool2d}$ , $\tt{Conv2d(128,64)}$ , $\tt{MaxPool2d}$ , $\tt{Linear(256,256)}$ , $\tt{Linear(256,\#Labels)}$ , whereas the ResNets have a similar architecture as in the original paper [34]. Moreover, the total number of trainable parameters depends on various configurations, such as the input/output shapes, kernel sizes, strides, etc. In our implementation, the original CNN, ResNet- $18$ and ResNet- $34$ models, respectively, have $151,882$ ; $6,992,138$ and $12,614,794$ trainable parameters on CIFAR- $10$ , and $175,012$ ; $7,038,308$ and $12,660,964$ trainable parameters on CIFAR- $100$ . Besides, with $\mathrm{FPP}=32$ , we have a wireless payload of about $5.01$ megabits (Mbs), $230.7$ Mbs and $416.3$ Mbs for CIFAR- $10$ , and $5.8$ Mbs, $232.3$ Mbs and $417.8$ Mbs for CIFAR- $100$ datasets for the respective three original models.

V-B Performance Study

First, we investigate the pruning ratios $\delta_{i}^{t}$ ’s in different VCs. When the system configurations remain the same, the pruning ratio depends on the deadline threshold $\mathrm{t^{th}}$ . More specifically, a larger deadline allows the client to prune fewer model parameters, given that the energy constraint is satisfied. Intuitively, less pruning leads to a bulky model that takes longer training time. The CNN model is shallower compared to the ResNets. More specifically, the original non-pruned ResNet- $18$ and ResNet- $34$ models have about $46$ times and $83$ times the trainable parameters of the CNN model, respectively, on CIFAR- $10$ . Therefore, the clients require a larger $\mathrm{t^{th}}$ to perform their local training and trained model offloading as the trainable parameters increase.

Intuitively, given a fixed $\mathrm{t^{th}}$ , the clients need to prune more model parameters for a bulky model in order to meet the deadline and energy constraints. Our simulation results also show that this general intuition holds in determining the $\delta_{i}^{t}$ ’s, as shown in Fig. 2, which show the cumulative distribution function (CDF) of the $\delta_{i}^{t}$ ’s in different VCs. It is worth noting that the pruning ratios $\delta_{i}^{t}$ ’s in each VC aggregation rounds are not deterministic due to the randomness of the wireless channels. We know the optimal variables once we solve the optimization problem in (55), which depends on the realizations of the wireless channels. Then, for a given VC $j$ , we generate the plot by calculating $\frac{\sum_{l=1}^{L}\sum_{k=1}^{B_{l}}\sum_{i=1}^{U_{j,k,l}}\mathbf{1}\left(\delta_{i}^{t}\leq\delta\right)}{\sum_{l=1}^{L}\sum_{k=1}^{B_{l}}\sum_{i=1}^{U_{j,k,l}}i}$ , where $\mathbf{1}\left(\mathrm{\delta}_{i}^{t}\leq\delta\right)$ is an indicator function that takes value $1$ if $\delta_{i}^{t}\leq\delta$ and $0$ otherwise. With the CNN model, about $50\%$ clients have a $\delta_{i}^{t}$ less than $0.23$ , $0.43$ , $0.58$ and $0.72$ in VC- $0$ in all cells, for $1.3$ s, $1$ s, $0.8$ s and $0.6$ s deadline thresholds, respectively, in Fig. 2(a). Note that we use $\delta^{\mathrm{th}}=0.9$ , i.e., the clients can prune up to $90\%$ of the neurons. Moreover, we consider $\mathrm{t^{th}}=4$ s and $\mathrm{t^{th}}=6$ s, to make the problem feasible for all clients for the ResNet- $18$ and ResNet- $34$ models, respectively. Furthermore, from Fig. 2(a) - Fig. 2(c), it is quite clear that the UEs in VC- $1$ have to prune slightly lesser model parameters than the UEs in VC- $0$ , even though the maximum CPU frequency $\mathrm{f^{max}}$ of the UEs in VC- $0$ is $2.58$ GHz, which is about $6.22\%$ higher than the UEs in VC- $1$ . However, due to the wireless payloads in the offloading phase, the transmission powers of the clients can also influence the $\delta_{i}^{t}$ ’s. In our setting, the UEs’ maximum transmission powers are $0.35$ Watt and $0.95$ Watt, respectively, in VC- $0$ and VC- $1$ . As such, with a similar wireless channel, the UEs in VC- $1$ can offload much faster than the UEs in VC- $0$ . The above observations, thus, point out that trained model offloading time $\mathrm{t}_{i}^{\mathrm{up}}$ dominates the total time to finish one VC round $\mathrm{t}_{i}^{\mathrm{tot}}$ .

When it comes to energy expense, from (39) and (40), it is quite obvious that a high $\delta_{i}^{t}$ shall lead to less energy consumption for both the training and offloading. However, the total energy expense of the clients boils down to the dominating factor between the required energy for computation and trained model offloading due to the interplay between the wireless and the learning parameters. Particularly, with $\omega=1$ MHz, the clients can offload the CNN model fast, leading the computational energy consumption to be the dominating factor. The ResNets, on the other hand, have huge wireless overheads, leading the offloading time and energy be the dominating factors. This is also observed in our results in Fig. 3, which is the CDF plot of the energy expense $\mathrm{e}_{i}^{\mathrm{tot}}$ , calculated in (41), of the clients in each VC and is generated following a similar strategy as in Fig. 2. When the CNN model is used, the total energy cost of the clients in VC- $0$ is larger, even though they prune more parameters, compared to the clients in VC- $1$ since larger $\mathrm{f}_{i}^{\mathrm{max}}$ ’s of the clients in VC- $0$ leads to a higher computational energy cost. This, however, changes for the ResNets since the wireless communication burden dominates the computation burden. The clients in VC- $1$ can use their higher $P_{i}^{\mathrm{max}}$ ’s for the offloading time reduction when they determine the pruning ratios $\delta_{i}^{t}$ ’s. As such, the total energy expenses of the clients in VC- $1$ are much larger than the ones in VC- $0$ . Our simulation results in Fig. 3(a) and Figs. 3(b)-3(c) also reveal the same trends.

Now, we observe the impact of $\delta_{i}^{t}$ ’s on the test accuracies and required bandwidth for trained model offloading. Intuitively, if the model is shallow, pruning further makes it shallower. Therefore, the test performance can exacerbate if $\delta_{i}^{t}$ ’s increases for a shallow model. On the other hand, for a bulky model, pruning may have a less severe effect. Specifically, under the deadline and energy constraints, pruning may eventually help because pruning a few neurons leads to a shallower but still reasonably well-constructed model that can be trained more efficiently. Moreover, our convergence bound in (1) clearly shows that the increasing $\delta_{i}^{t}$ ’s decreases the convergence speed. However, the wireless payload is directly related to $\delta_{i}^{t}$ ’s as shown in (35). Particularly, the wireless payload is an increasing function of the $\delta_{i}^{t}$ ’s. As such, increasing the deadline threshold $\mathrm{t^{th}}$ should decrease the $\delta_{i}^{t}$ ’s but significantly increase the wireless payload size.

Our simulation results also reveal the above trends in Fig. 4. Note that, in Fig. 4 and the subsequent figures, the (solid/dashed) lines are the average of $4$ independent simulation trials using the configurations mentioned in Section V-A, while the shaded strips show the corresponding standard deviations. Particularly, we observe that the CNN model is largely affected by a small $\mathrm{t^{th}}$ because that leads to a large $\delta_{i}^{t}$ , which eventually prunes more neurons of the already shallow model. On the other hand, the bulky ResNets exhibit small performance degradation when $\mathrm{t^{th}}$ decreases. Moreover, compared to the original non-pruned counterparts, the performance difference is small if $\mathrm{t^{th}}$ is selected appropriately, as shown in Fig. 4(a) to Fig. 4(c). However, even a slight increase in $d_{p}$ shall reduce the wireless payload size, which can significantly save a large portion of the bandwidth, as observed in Fig. 4(d) to Fig. 4(f). For example, if the CNN model is used, with $\mathrm{t^{th}}=1.3$ s, the performance degradation on test accuracy is about $1.92\%$ after $100$ PHFL round, while the per PHFL round bandwidth saving for at least $70\%$ of the clients is about $20.84\%$ . Similarly, if $\mathrm{t^{th}}=6$ s, the test accuracy degradation is about $0.47\%$ and $1.11\%$ after $100$ global rounds, while the per PHFL round bandwidth saving for at least $70\%$ of the clients are about $33.12\%$ and $81.26\%$ , for ResNet- $18$ and ResNet- $34$ , respectively.

V-C Baseline Comparisons

We now focus on performance comparisons. First, we consider the existing HFL [10, 9] algorithm that does not consider model pruning or any energy constraints. Besides, [10, 9] only considered two levels, UE-BS and BS-cloud/server. For a fair comparison, we adapt HFL into three levels $1$ ) UE-sBS, $2$ ) sBS-mBS and $3$ ) mBS-cloud/server. Furthermore, we enforce the energy and deadline constraints in each UE-sBS aggregation round and name this baseline HFL with constraints (HFL-WC). Moreover, since [10, 9] did not have any VC and we have $\kappa_{1}$ VC aggregation rounds before the sBS aggregation round, we have adapted the deadline accordingly for HFL-WC to make our comparison fair. Furthermore, we consider a random PHFL (R-PHFL) scheme, which has the same system model as ours and allows model pruning. In R-PHFL, the pruning ratios, $\delta_{i}^{t}$ ’s, are randomly selected between $0$ and $\delta^{\mathrm{th}}$ to satisfy constraint (45e). Moreover, in both HFL-WC and R-PHFL, a common $\kappa_{0}$ that satisfies both deadline and energy constraints for all clients in all VCs leads to poor test accuracy. As such, we determine the local iterations of the UEs within a VC by selecting the maximum possible number of iterations that all clients within that VC can perform without violating the delay and energy constraints. For our proposed PHFL, we choose $\kappa_{0}=5$ and $\kappa_{1}=2$ . Moreover, we let $\kappa_{2}=\kappa_{3}=2$ for both HFL-WC and PHFL. We also consider centralized SGD to show the performance gap of PHFL with the ideal case where all training data samples are available centrally.

From our above discussion, it is expected that pruning will likely not have the edge over the HFL-WC with a shallow model. Moreover, one may not need pruning for a shallow model in the first place. However, for a bulky model, due to a large number of training parameters and a huge wireless payload, pruning can be a necessity under extreme resource constraints. Furthermore, it is crucial to jointly optimize $\delta_{i}^{t}$ ’s, $\mathrm{f}_{i}^{t}$ ’s and $\mathrm{P}_{i}^{t}$ ’s in order to increase the test accuracy. The simulation results in Fig. 5 also validate these claims. We observe that when the $\mathrm{t^{th}}$ increases, HFL-WC’s performance improves with the CNN model. Moreover, when HFL-WC has $\kappa_{1}$ times the deadline of PHFL, the performance is comparable, as shown in Fig. 5(a) and Fig. 5(d). More specifically, the maximum performance degradation of the proposed PHFL algorithm is about $1.88\%$ on CIFAR- $10$ when HFL-WC has $2.31$ times the deadline of PHFL. However, for the ResNets model, HFL-WC requires a significantly longer deadline threshold to make the problem feasible. Particularly, with ResNet- $18$ , $\mathrm{t^{th}}\leq 6$ s does not allow the UEs to perform even a single local iteration, leading to the same initial model weights and, thus, the same test accuracy in HFL-WC. Moreover, when the deadline threshold is $\kappa_{1}$ times the $\mathrm{t^{th}}$ of the PHFL, HFL-WC’s test accuracy significantly lags, as shown in Fig. 5(b) and Fig. 5(e). Particularly, after $T=100$ rounds, our proposed solution with $\mathrm{t^{th}}=4$ s provides about $55.06\%$ , $11.62\%$ and $2.33\%$ , and about $122.8\%$ , $26.7\%$ and $3.62\%$ better test accuracy on CIFAR- $10$ and on CIFAR- $100$ , respectively, than the HFL-WC with $\mathrm{t^{th}}=8$ s, $\mathrm{t^{th}}=9$ s and $\mathrm{t^{th}}=10$ s. For the ResNet- $34$ model, the clients require $\mathrm{t^{th}}\geq 13$ s for performing some local training in HFL-WC, whereas our proposed solution can achieve significantly better performance with only $\mathrm{t^{th}}=6$ s. For example, our proposed solution with $\mathrm{t^{th}}=6$ s yields about $28.09\%$ , $14.34\%$ and $8.22\%$ , and about $61.97\%$ , $34.19\%$ and $21.86\%$ better test accuracy than the HFL-WC with $\mathrm{t^{th}}=16$ s, $\mathrm{t^{th}}=17$ s and $\mathrm{t^{th}}=18$ s on CIFAR- $10$ and CIFAR- $100$ , respectively. Moreover, our proposed solution provides about $243.47\%$ and $643.29\%$ and about $648.36\%$ and $4542.45\%$ better test accuracy than R-PHFL on CIFAR- $10$ and CIFAR- $100$ with the ResNet- $18$ and ResNet- $34$ models, respectively. The gap with the ideal centralized ML is also expected since FL suffers from data and system heterogeneity.

We also examine the performance of the proposed method in terms of wall clock time. Since HFL-WC does not have the VC tier, the wall clock time to run $M$ global rounds for HFL-WC with a deadline smaller than $\kappa_{1}\times\mathrm{t^{th}}$ will be lower than the proposed PHFL’s wall clock time to run the same number of global rounds. However, that does not necessarily guarantee a higher test accuracy than our PHFL since training and offloading the original bulky model may take a long time, allowing only a few local SGD rounds of the clients. Besides, any deadline greater than $\kappa_{1}\times\mathrm{t^{th}}$ for the HFL-WC will require a longer wall clock time than our proposed PHFL solution. Our simulation results in Fig. 6 clearly shows these trends. We observe that when HFL-WC has a deadline $\kappa_{1}\times\mathrm{t^{th}}$ , i.e., $2\times 4$ s = $8$ s, the test accuracies are about $42.52\%$ and $76.24\%$ , respectively, for HFL-WC and PHFL when the wall clock reaches $1800$ seconds. Even with $\mathrm{t^{th}}>4\kappa_{1}$ seconds, the HFL-WC algorithm performs worse than our proposed PHFL solution.

From the above results and discussion, it is quite clear that R-PHFL yields poor test accuracy due to the random selection of $\delta_{i}^{t}$ ’s. Besides, HFL-WC cannot deliver reasonable performance when the model has a large number of training parameters. Furthermore, our proposed PHFL’s performance is comparable to the non-pruned HFL-WC performance with the shallow model. As such, in the following, we only consider the upper bound (UB) of the HFL baseline, which does not consider the constraints. Moreover, to show how pruning degrades test accuracy, we also consider the UB, called HFL-VC-UB, by inheriting the same system model described in Section II, but without the model pruning and the constraints. In other words, we inherit the same underlying four levels, UE-VC, VC-sBS, sBS-mBS and mBS-cloud/server aggregation policy with the original non-pruned model.

TABLE II: Test Accuracy with Trained

\mathbf{w}^{T}

on CIFAR-

10

dataset with

\kappa_{0}=5,\kappa_{1}=\kappa_{2}=\kappa_{3}=2

and

T=100

Methods	Dir( $\bar{\alpha}$ )	With CNN Model			With ResNet- $18$ Model			With ResNet- $34$ Model
		Acc	Req T [s]	Req E [J]	Acc	Req T [s]	Req E [J]	Acc	Req T [s]	Req E [J]
	$0.5$	$0.6791\pm 0.0049$	$1040$	$48355\pm 14664$	$0.7613\pm 0.0026$	$3200$	$98122\pm 15878$	$0.7677\pm 0.0017$	$4800$	$140469\pm 20993$
PHFL	$0.9$	$0.6930\pm 0.0049$		$48355\pm 14663$	$0.7780\pm 0.0043$		$98122\pm 15878$	$0.7854\pm 0.0026$		$140469\pm 20993$
(Ours)	$10$	$0.7091\pm 0.0022$		$48355\pm 14663$	$0.7899\pm 0.0020$		$98122\pm 15878$	$0.7994\pm 0.0010$		$140469\pm 20994$
HFL-VC	$0.5$	$0.6971\pm 0.0037$	$1293\pm 110$	$53041\pm 12377$	$0.7689\pm 0.0054$	$8985\pm 182$	$231634\pm 32761$	$0.7789\pm 0.0030$	$15370\pm 269$	$378436\pm 51167$
(UB)	$0.9$	$0.7066\pm 0.0017$	$1293\pm 110$	$53041\pm 12377$	$0.7807\pm 0.0019$	$8985\pm 182$	$231634\pm 32761$	$0.7942\pm 0.0033$	$15370\pm 269$	$378436\pm 51167$
(Ours)	$10$	$0.7211\pm 0.0010$	$1293\pm 110$	$53041\pm 12377$	$0.7920\pm 0.0037$	$8985\pm 182$	$231634\pm 32761$	$0.8031\pm 0.0027$	$15370\pm 269$	$378436\pm 51167$
	$0.5$	$0.5346\pm 0.0074$	$1040$	$31483\pm 9454$	$0.2447\pm 0.0161$	$3200$	$92529\pm 12400$	$0.1112\pm 0.0193$	$4800$	$158622\pm 20589$
R-PHFL	$0.9$	$0.5471\pm 0.0089$		$31467\pm 9427$	$0.2265\pm 0.0439$		$92495\pm 12356$	$0.1049\pm 0.0086$		$158827\pm 20834$
	$10$	$0.5847\pm 0.0164$		$31455\pm 9406$	$0.3265\pm 0.0273$		$92441\pm 12287$	$0.1096\pm 0.0108$		$158651\pm 20624$
	$0.5$	$0.6624\pm 0.0021$	$646\pm 55$	$26520\pm 6188$	$0.7539\pm 0.0044$	$4493\pm 90$	$115810\pm 16379$	$0.7445\pm 0.0049$	$7686\pm 132$	$189207\pm 25580$
HFL	$0.9$	$0.6752\pm 0.0018$	$646\pm 55$	$26520\pm 6188$	$0.7695\pm 0.0014$	$4493\pm 90$	$115810\pm 16379$	$0.7664\pm 0.0045$	$7686\pm 132$	$189207\pm 25580$
(UB) - $\kappa_{0}$	$10$	$0.6934\pm 0.0012$	$646\pm 55$	$26520\pm 6188$	$0.7833\pm 0.0032$	$4493\pm 90$	$115810\pm 16379$	$0.7844\pm 0.0013$	$7686\pm 132$	$189207\pm 25580$

TABLE III: Test Accuracy with Trained

\mathbf{w}^{T}

on CIFAR-

10

dataset with

\kappa_{0}=10,\kappa_{1}=\kappa_{2}=\kappa_{3}=2

and

T=100

Methods	Dir( $\bar{\alpha}$ )	With CNN Model			With ResNet- $18$ Model			With ResNet- $34$ Model
		Acc	Req T [s]	Req E [J]	Acc	Req T [s]	Req E [J]	Acc	Req T [s]	Req E [J]
	$0.5$	$0.6859\pm 0.0037$	$1600$	$76084\pm 23665$	$0.7668\pm 0.0016$	$3200$	$103997\pm 18692$	$0.7774\pm 0.0016$	$4800$	$146238\pm 23406$
PHFL	$0.9$	$0.6966\pm 0.0034$		$76084\pm 23665$	$0.7804\pm 0.0014$		$103998\pm 18693$	$0.7948\pm 0.0023$		$146239\pm 23408$
(Ours)	$10$	$0.7093\pm 0.0024$		$76083\pm 23664$	$0.7935\pm 0.0015$		$103999\pm 18695$	$0.8092\pm 0.0019$		$146238\pm 23407$
HFL-VC	$0.5$	$0.6932\pm 0.0021$	$2417\pm 221$	$102117\pm 24405$	$0.7704\pm 0.0042$	$10052\pm 267$	$280710\pm 43428$	$0.7825\pm 0.0033$	$16416\pm 346$	$427512\pm 61116$
(UB)	$0.9$	$0.7070\pm 0.0024$	$2417\pm 221$	$102117\pm 24405$	$0.7866\pm 0.0016$	$10052\pm 267$	$280710\pm 43428$	$0.7975\pm 0.0025$	$16416\pm 346$	$427512\pm 61116$
(Ours)	$10$	$0.7164\pm 0.0042$	$2417\pm 221$	$102117\pm 24405$	$0.7936\pm 0.0009$	$10052\pm 267$	$280710\pm 43428$	$0.8102\pm 0.0018$	$16416\pm 346$	$427512\pm 61116$
	$0.5$	$0.6934\pm 0.0030$	$1208\pm 110$	$51058\pm 12202$	$0.7634\pm 0.0017$	$5026\pm 132$	$140349\pm 21712$	$0.7726\pm 0.0014$	$8209\pm 171$	$213745\pm 30554$
HFL	$0.9$	$0.7033\pm 0.0031$	$1208\pm 110$	$51058\pm 12202$	$0.7769\pm 0.0060$	$5026\pm 132$	$140349\pm 21712$	$0.7847\pm 0.0014$	$8209\pm 171$	$213745\pm 30554$
(UB)- $\kappa_{0}$	$10$	$0.7208\pm 0.0018$	$1208\pm 110$	$51058\pm 12202$	$0.7895\pm 0.0023$	$5026\pm 132$	$140349\pm 21712$	$0.8004\pm 0.0016$	$8209\pm 171$	$213745\pm 30554$

TABLE IV: Test Accuracy with Trained

\mathbf{w}^{T}

on CIFAR-

10

dataset with

\kappa_{0}=5,\kappa_{1}=4,\kappa_{2}=\kappa_{3}=2

and

T=100

Methods	Dir( $\bar{\alpha}$ )	With CNN Model			With ResNet- $18$ Model			With ResNet- $34$ Model
		Acc	Req T [s]	Req E [J]	Acc	Req T [s]	Req E [J]	Acc	Req T [s]	Req E [J]
	$0.5$	$0.6950\pm 0.0047$	$2080$	$96710\pm 29327$	$0.7699\pm 0.0016$	$6400$	$196244\pm 31755$	$0.7826\pm 0.0023$	$9600$	$280939\pm 41988$
PHFL	$0.9$	$0.7060\pm 0.0026$		$96710\pm 29327$	$0.7828\pm 0.0018$		$196245\pm 31757$	$0.8010\pm 0.0033$		$280938\pm 41986$
(Ours)	$10$	$0.7136\pm 0.0043$		$96710\pm 29327$	$0.7945\pm 0.0014$		$196244\pm 31756$	$0.8111\pm 0.0014$		$280938\pm 41987$
HFL-VC	$0.5$	$0.7052\pm 0.0038$	$2587\pm 220$	$106082\pm 24755$	$0.7726\pm 0.0057$	$17969\pm 364$	$463257\pm 65516$	$0.7881\pm 0.0028$	$30737\pm 534$	$756853\pm 102321$
(UB)	$0.9$	$0.7122\pm 0.0016$	$2587\pm 220$	$106082\pm 24755$	$0.7874\pm 0.0015$	$17969\pm 364$	$463257\pm 65516$	$0.8012\pm 0.0009$	$30737\pm 534$	$756853\pm 102321$
(Ours)	$10$	$0.7195\pm 0.0015$	$2587\pm 220$	$106082\pm 24755$	$0.7961\pm 0.0028$	$17969\pm 364$	$463257\pm 65516$	$0.8117\pm 0.0031$	$30737\pm 534$	$756853\pm 102321$
R-PHFL	$0.5$	$0.5813\pm 0.0146$	$2080$	$62927\pm 18914$	$0.3863\pm 0.0408$	$6400$	$184493\pm 24419$	$0.1042\pm 0.0069$	$9600$	$316985\pm 41406$
	$0.9$	$0.6013\pm 0.0164$		$62933\pm 18924$	$0.4010\pm 0.0208$		$184766\pm 24766$	$0.1157\pm 0.0137$		$316870\pm 41267$
	$10$	$0.6360\pm 0.0082$		$62849\pm 18781$	$0.4648\pm 0.0352$		$184617\pm 24577$	$0.1111\pm 0.0078$		$317122\pm 41570$

First, we illustrate the required computation time and the corresponding energy expenses for the baselines and our proposed PHFL solution with different deadline thresholds. Naturally, if we increase the $\mathrm{t^{th}}$ for each VC aggregation round, our proposed solution will take longer time and energy to perform $T=100$ global rounds. Moreover, since there are $\kappa_{1}$ VC aggregation rounds in our proposed system model, the original model training and offloading with HFL-VC-UB is expected to take significantly longer and consume more energy than the HFL-UB baseline. Therefore, the effectiveness, in terms of time and energy consumption, of PHFL is largely dependent upon the deadline threshold $\mathrm{t^{th}}$ . While R-PHFL requires the same time as our proposed PHFL, the energy requirements for R-PHFL can vary significantly due to the random selection of the $\delta_{i}^{t}$ ’s that lead to different model sizes and different local training episodes in different VCs.

Our results in Figs. 7(a)-7(c) and Figs. 7(d)-7(f) clearly show these trade-offs with time and energy consumption, respectively, for the three models. It is worth pointing out that we adopted the popular lottery ticket hypothesis [26] for finding the winning ticket that requires performing $\rho$ iterations on the initial model with full parameter space as described in Section II-B. This incurs additional $\mathrm{t}_{i}^{\mathrm{cp_{d}}}$ time and $\mathrm{e}_{i}^{\mathrm{cp_{d}}}$ energy overheads, as calculated in (33) and (38), respectively. As such, when the $\mathrm{t^{th}}$ is large, if the UEs have sufficient energy budgets, they can prune only a few neurons. Therefore, the total time and energy consumption for the original model computation for getting the winning ticket, training the pruned model and offloading the trained pruned model parameters can become slightly larger than the HFL-VC-UB. This is observed in Fig. 7(a) and Fig. 7(d) for CNN model when $\mathrm{t^{th}}=1.3$ s, and also in Fig. 7(b) and Fig. 7(e) when $\mathrm{t^{th}}=8$ s for the ResNet- $18$ model. Besides, the dashed vertical lines in Fig. 7(e) and Fig. 7(f) show the mean energy budget of the clients for each PHFL global round. While all UEs are able to perform the learning and offloading within this mean energy budget with CNN and ResNet- $18$ models, clearly, when the ResNet- $34$ model is used, more than $57\%$ of the clients will fail to perform the HFL-VC-UB due to their limited energy budgets.

TABLE V: Test Accuracy with Trained

\mathbf{w}^{T}

on CIFAR-

100

dataset with

\kappa_{0}=5,\kappa_{1}=\kappa_{2}=\kappa_{3}=2

and

T=100

Methods	Dir( $\bar{\alpha}$ )	With CNN Model			With ResNet- $18$ Model			With ResNet- $34$ Model
		Acc	Req T [s]	Req E [J]	Acc	Req T [s]	Req E [J]	Acc	Req T [s]	Req E [J]
	$0.5$	$0.3723\pm 0.0101$	$1120$	$51667\pm 15530$	$0.4725\pm 0.0032$	$3200$	$98074\pm 15856$	$0.4770\pm 0.0032$	$4800$	$140445\pm 20984$
PHFL	$0.9$	$0.3786\pm 0.0096$		$51667\pm 15531$	$0.4765\pm 0.0003$		$98075\pm 15857$	$0.4840\pm 0.0032$		$140444\pm 20983$
(Ours)	$10$	$0.3795\pm 0.0084$		$51667\pm 15531$	$0.4811\pm 0.0024$		$98075\pm 15857$	$0.4834\pm 0.0023$		$140444\pm 20984$
HFL-VC	$0.5$	$0.3962\pm 0.0030$	$1319\pm 109$	$53645\pm 12431$	$0.4805\pm 0.0022$	$9038\pm 183$	$232839\pm 32910$	$0.4909\pm 0.0017$	$15422\pm 270$	$379641\pm 51319$
(UB)	$0.9$	$0.3956\pm 0.0030$	$1319\pm 109$	$53645\pm 12431$	$0.4832\pm 0.0034$	$9038\pm 183$	$232839\pm 32910$	$0.4922\pm 0.0019$	$15422\pm 270$	$379641\pm 51319$
(Ours)	$10$	$0.4004\pm 0.0042$	$1319\pm 109$	$53645\pm 12431$	$0.4800\pm 0.0019$	$9038\pm 183$	$232839\pm 32910$	$0.4895\pm 0.0024$	$15422\pm 270$	$379641\pm 51319$
R-PHFL	$0.5$	$0.1994\pm 0.0075$	$1120$	$33751\pm 10069$	$0.0587\pm 0.0212$	$3200$	$92996\pm 12374$	$0.0123\pm 0.0028$	$4800$	$159083\pm 20562$
	$0.9$	$0.2119\pm 0.0156$		$33763\pm 10089$	$0.0641\pm 0.0371$		$92893\pm 12245$	$0.0104\pm 0.0007$		$159122\pm 20608$
	$10$	$0.2165\pm 0.0057$		$33733\pm 10040$	$0.0728\pm 0.0427$		$92893\pm 12245$	$0.0109\pm 0.0011$		$159175\pm 20671$
HFL	$0.5$	$0.3509\pm 0.0023$	$659\pm 54$	$26822\pm 6215$	$0.4730\pm 0.0025$	$4519\pm 90$	$116413\pm 16453$	$0.4663\pm 0.0018$	$7712\pm 133$	$189809\pm 25656$
(UB) - $\kappa_{0}$	$0.9$	$0.3524\pm 0.0029$	$659\pm 54$	$26822\pm 6215$	$0.4768\pm 0.0047$	$4519\pm 90$	$116413\pm 16453$	$0.4756\pm 0.0021$	$7712\pm 133$	$189809\pm 25656$
	$10$	$0.3590\pm 0.0019$	$659\pm 54$	$26822\pm 6215$	$0.4841\pm 0.0057$	$4519\pm 90$	$116413\pm 16453$	$0.4776\pm 0.0035$	$7712\pm 133$	$189809\pm 25656$

TABLE VI: Test Accuracy with Trained

\mathbf{w}^{T}

on CIFAR-

100

dataset with

\kappa_{0}=10,\kappa_{1}=\kappa_{2}=\kappa_{3}=2

and

T=100

Methods	Dir( $\bar{\alpha}$ )	With CNN Model			With ResNet- $18$ Model			With ResNet- $34$ Model
		Acc	Req T [s]	Req E [J]	Acc	Req T [s]	Req E [J]	Acc	Req T [s]	Req E [J]
	$0.5$	$0.3700\pm 0.0035$	$2000$	$94632\pm 29277$	$0.4654\pm 0.0035$	$3200$	$103923\pm 18656$	$0.4801\pm 0.0009$	$4800$	$146195\pm 23388$
PHFL	$0.9$	$0.3648\pm 0.0044$		$94632\pm 29277$	$0.4699\pm 0.0026$		$103923\pm 18655$	$0.4810\pm 0.0047$		$146196\pm 23388$
(Ours)	$10$	$0.3634\pm 0.0039$		$94632\pm 29277$	$0.4763\pm 0.0019$		$103922\pm 18655$	$0.4780\pm 0.0018$		$146195\pm 23388$
HFL-VC	$0.5$	$0.3756\pm 0.0041$	$2443\pm 220$	$102721\pm 24458$	$0.4720\pm 0.0020$	$10104\pm 268$	$281915\pm 43569$	$0.4877\pm 0.0058$	$16469\pm 346$	$428717\pm 61264$
(UB)	$0.9$	$0.3754\pm 0.0053$	$2443\pm 220$	$102721\pm 24458$	$0.4701\pm 0.0043$	$10104\pm 268$	$281915\pm 43569$	$0.4895\pm 0.0015$	$16469\pm 346$	$428717\pm 61264$
(Ours)	$10$	$0.3794\pm 0.0025$	$2443\pm 220$	$102721\pm 24458$	$0.4736\pm 0.0012$	$10104\pm 268$	$281915\pm 43569$	$0.4736\pm 0.0012$	$10104\pm 268$	$281915\pm 43569$
HFL	$0.5$	$0.3877\pm 0.0031$	$1221\pm 110$	$51360\pm 12228$	$0.4713\pm 0.0052$	$5052\pm 132$	$140951\pm 21783$	$0.4806\pm 0.0058$	$8235\pm 171$	$214347\pm 30628$
(UB) -	$0.9$	$0.3877\pm 0.0024$	$1221\pm 110$	$51360\pm 12228$	$0.4731\pm 0.0020$	$5052\pm 132$	$140951\pm 21783$	$0.4855\pm 0.0011$	$8235\pm 171$	$214347\pm 30628$
$\kappa_{0}$	$10$	$0.3922\pm 0.0042$	$1221\pm 110$	$51360\pm 12228$	$0.4772\pm 0.0047$	$5052\pm 132$	$140951\pm 21783$	$0.4844\pm 0.0031$	$8235\pm 171$	$214347\pm 30628$

TABLE VII: Test Accuracy with Trained

\mathbf{w}^{T}

on CIFAR-

100

dataset with

\kappa_{0}=5,\kappa_{1}=4,\kappa_{2}=\kappa_{3}=2

and

T=100

Methods	Dir( $\bar{\alpha}$ )	With CNN Model			With ResNet- $18$ Model			With ResNet- $34$ Model
		Acc	Req T [s]	Req E [J]	Acc	Req T [s]	Req E [J]	Acc	Req T [s]	Req E [J]
	$0.5$	$0.3817\pm 0.0053$	$2240$	$103334\pm 31061$	$0.4712\pm 0.0043$	$6400$	$196151\pm 31714$	$0.4932\pm 0.0039$	$9600$	$280889\pm 41968$
PHFL	$0.9$	$0.3782\pm 0.0022$		$103335\pm 31061$	$0.4815\pm 0.0049$		$196151\pm 31714$	$0.4928\pm 0.0029$		$280889\pm 41968$
(Ours)	$10$	$0.3853\pm 0.0035$		$103335\pm 31061$	$0.4806\pm 0.0035$		$196150\pm 31714$	$0.4935\pm 0.0037$		$280889\pm 41968$
HFL-VC	$0.5$	$0.3924\pm 0.0055$	$2638\pm 219$	$107290\pm 24863$	$0.4780\pm 0.0043$	$18074\pm 365$	$465668\pm 65814$	$0.4974\pm 0.0032$	$30841\pm 535$	$759264\pm 102626$
(UB)	$0.9$	$0.3871\pm 0.0063$	$2638\pm 219$	$107290\pm 24863$	$0.4819\pm 0.0012$	$18074\pm 365$	$465668\pm 65814$	$0.4992\pm 0.0022$	$30841\pm 535$	$759264\pm 102626$
(Ours)	$10$	$0.3934\pm 0.0048$	$2638\pm 219$	$107290\pm 24863$	$0.4824\pm 0.0035$	$18074\pm 365$	$465668\pm 65814$	$0.4954\pm 0.0058$	$30841\pm 535$	$759264\pm 102626$
R-PHFL	$0.5$	$0.2338\pm 0.0047$	$2240$	$67480\pm 20211$	$0.1562\pm 0.0181$	$6400$	$185662\pm 24670$	$0.0153\pm 0.0032$	$9600$	$318013\pm 41476$
	$0.9$	$0.2282\pm 0.0092$		$67530\pm 20296$	$0.1706\pm 0.0180$		$185738\pm 24765$	$0.0149\pm 0.0011$		$318004\pm 41465$
	$10$	$0.2398\pm 0.0085$		$67535\pm 20304$	$0.1837\pm 0.0220$		$185639\pm 24640$	$0.0144\pm 0.0036$		$317815\pm 41240$

Finally, we show the impact of $\kappa_{0}$ and $\kappa_{1}$ for different dataset heterogeneity levels on CIFAR- $10$ dataset in Table II - Table IV, where $\mathrm{t^{th}}=4$ s and $\mathrm{t^{th}}=6$ s is used in our proposed PHFL algorithms for ResNet- $18$ and ResNet- $34$ models, respectively. Besides, for the CNN model, we used $\mathrm{t^{th}}=1.3$ s and $\mathrm{2}$ s, respectively, in our PHFL algorithm when $\kappa_{0}=5$ and $\kappa_{0}=10$ , respectively, due to the facts that pruning exacerbates test performance for a shallow model and computational time dominates the offloading delay. Similarly, we considered $\mathrm{t^{th}}=1.4$ s and $\mathrm{t^{th}}=2.5$ s, respectively, for $\kappa_{0}=5$ and $\kappa_{0}=10$ on CIFAR- $100$ datasets for the CNN model. The performance comparisons for different $\bar{\alpha}$ ’s are shown in Table V - Table VII. From the tables, it is quite clear that pruning helps with negligible performance deviation from its original non-pruned counterparts. Besides, for the shallow CNN model, the performance gain, in terms of test accuracy, of our proposed PHFL is insignificant compared to the bulky ResNets. Moreover, increasing $\kappa_{0}$ or $\kappa_{1}$ generally improves the test accuracy. However, if the same $\mathrm{t^{th}}$ is to be used, it is beneficial to increase $\kappa_{0}$ compared to increasing $\kappa_{1}$ .

VI Conclusion

This work proposed a model pruning solution to alleviate bandwidth scarcity and limited computational capacity of wireless clients in heterogeneous networks. Using the convergence upper-bound, pruning ratio, computation frequency and transmission power of the clients were jointly optimized to maximize the convergence rate. The performances were evaluated on two popular datasets using three popular machine learning models of different total training parameter sizes. The results suggest that pruning can significantly reduce training time, energy expense and bandwidth requirement while incurring negligible test performance.

References

[1] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Proc. AISTATS, 2017.
[2] N. H. Tran, W. Bao, A. Zomaya, M. N. H. Nguyen, and C. S. Hong, “Federated learning over wireless networks: Optimization model design and analysis,” in Proc. IEEE INFOCOM, 2019.
[3] Z. Yang, M. Chen, W. Saad, C. S. Hong, and M. Shikh-Bahaei, “Energy efficient federated learning over wireless communication networks,” IEEE Trans. Wireless Commun., vol. 20, no. 3, pp. 1935–1949, 2020.
[4] R. Jin, X. He, and H. Dai, “Communication efficient federated learning with energy awareness over wireless networks,” IEEE Trans. Wireless Commun., vol. 21, no. 7, pp. 5204–5219, Jan. 2022.
[5] Z. Chen, W. Yi, and A. Nallanathan, “Exploring representativity in device scheduling for wireless federated learning,” IEEE Trans. Wireless Commun., pp. 1–1, 2023.
[6] T. Zhang, K.-Y. Lam, J. Zhao, and J. Feng, “Joint device scheduling and bandwidth allocation for federated learning over wireless networks,” IEEE Trans. Wireless Commun., pp. 1–1, 2023.
[7] J. Wang, S. Wang, R.-R. Chen, and M. Ji, “Demystifying why local aggregation helps: Convergence analysis of hierarchical sgd,” in Proc. AAAI, 2022.
[8] S. Hosseinalipour, S. S. Azam, C. G. Brinton, N. Michelusi, V. Aggarwal, D. J. Love, and H. Dai, “Multi-stage hybrid federated learning over large-scale d2d-enabled fog networks,” IEEE/ACM Trans. Network., vol. 30, no. 4, pp. 1569–1584, 2022.
[9] B. Xu, W. Xia, W. Wen, P. Liu, H. Zhao, and H. Zhu, “Adaptive hierarchical federated learning over wireless networks,” IEEE Trans. Vehicular Technol., vol. 71, no. 2, pp. 2070–2083, 2021.
[10] S. Liu, G. Yu, X. Chen, and M. Bennis, “Joint user association and resource allocation for wireless hierarchical federated learning with iid and non-iid data,” IEEE Trans. Wireless Commun., vol. 21, no. 10, pp. 7852–7866, 2022.
[11] S. Luo, X. Chen, Q. Wu, Z. Zhou, and S. Yu, “Hfel: Joint edge association and resource allocation for cost-efficient hierarchical federated edge learning,” IEEE Trans. Wireless Commun., vol. 19, no. 10, pp. 6535–6548, 2020.
[12] L. Liu, J. Zhang, S. Song, and K. B. Letaief, “Client-edge-cloud hierarchical federated learning,” in Proc. IEEE ICC, 2020.
[13] C. Feng, H. H. Yang, D. Hu, Z. Zhao, T. Q. S. Quek, and G. Min, “Mobility-aware cluster federated learning in hierarchical wireless networks,” IEEE Trans. Wireless Commun., vol. 21, no. 10, pp. 8441–8458, Oct. 2022.
[14] M. S. H. Abad, E. Ozfatura, D. Gunduz, and O. Ercetin, “Hierarchical federated learning across heterogeneous cellular networks,” in Proc. ICASSP. IEEE, 2020.
[15] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings et al., “Advances and open problems in federated learning,” Foundations and Trends® in Machine Learning, vol. 14, no. 1–2, pp. 1–210, 2021.
[16] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, “Federated optimization in heterogeneous networks,” Proc. MLSys, vol. 2, pp. 429–450, 2020.
[17] H. Yang, X. Zhang, P. Khanduri, and J. Liu, “Anarchic federated learning,” in Proc. ICML. PMLR, 2022, pp. 25 331–25 363.
[18] J. Wang, Q. Liu, H. Liang, G. Joshi, and H. V. Poor, “Tackling the objective inconsistency problem in heterogeneous federated optimization,” Proc. NeurIPS, vol. 33, pp. 7611–7623, 2020.
[19] T. Lin, S. U. Stich, L. Barba, D. Dmitriev, and M. Jaggi, “Dynamic model pruning with feedback,” in Proc. ICLR, 2020.
[20] Y. Jiang, S. Wang, V. Valls, B. J. Ko, W.-H. Lee, K. K. Leung, and L. Tassiulas, “Model pruning enables efficient federated learning on edge devices,” IEEE Trans. Neural Netw. Learn. Syst., pp. 1–13, Apr. 2022.
[21] Z. Zhu, Y. Shi, J. Luo, F. Wang, C. Peng, P. Fan, and K. B. Letaief, “Fedlp: Layer-wise pruning mechanism for communication-computation efficient federated learning,” arXiv preprint arXiv:2303.06360, 2023.
[22] S. Liu, G. Yu, R. Yin, and J. Yuan, “Adaptive network pruning for wireless federated learning,” IEEE Wireless Commun. Lett., vol. 10, no. 7, pp. 1572–1576, 2021.
[23] S. Liu, G. Yu, R. Yin, J. Yuan, L. Shen, and C. Liu, “Joint model pruning and device selection for communication-efficient federated edge learning,” IEEE Trans. Commun., vol. 70, no. 1, pp. 231–244, Jan. 2022.
[24] J. Ren, W. Ni, and H. Tian, “Toward communication-learning trade-off for federated learning at the network edge,” IEEE Commun. Lett., vol. 26, no. 8, pp. 1858–1862, Aug. 2022.
[25] “3rd Generation Partnership Project; Technical Specification Group Core Network and Terminals; Proximity-services in 5G System protocol aspects; Stage 3,” 3GPP TS 24.554 V18.0.0, Rel. 18, Mar. 2023.
[26] J. Frankle and M. Carbin, “The lottery ticket hypothesis: Finding sparse, trainable neural networks,” in Proc. ICLR, 2019.
[27] M. F. Pervej, R. Jin, and H. Dai, “Resource constrained vehicular edge federated learning with highly mobile connected vehicles,” IEEE J. Sel. Areas Commun., vol. 41, no. 6, pp. 1825–1844, June 2023.
[28] S. U. Stich, J.-B. Cordonnier, and M. Jaggi, “Sparsified sgd with memory,” in Proc. NeurIPS, 2018.
[29] L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,” Siam Review, vol. 60, no. 2, pp. 223–311, 2018.
[30] S. Diamond and S. Boyd, “Cvxpy: A python-embedded modeling language for convex optimization,” J. Mach. Learn. Res., vol. 17, no. 1, pp. 2909–2913, 2016.
[31] E. Che, H. D. Tuan, and H. H. Nguyen, “Joint optimization of cooperative beamforming and relay assignment in multi-user wireless relay networks,” IEEE Trans. Wireless Comm., vol. 13, no. 10, pp. 5481–5495, 2014.
[32] Y. Sun, D. Xu, D. W. K. Ng, L. Dai, and R. Schober, “Optimal 3d-trajectory design and resource allocation for solar-powered uav communication systems,” IEEE Trans. Commun., vol. 67, no. 6, pp. 4281–4298, 2019.
[33] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Apr. 2009. [Online]. Available: https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf
[34] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE CVPR, 2016.

Richeng Jin (M’21) received the B.S. degree in information and communication engineering from Zhejiang University, Hangzhou, China, in 2015, and the Ph.D. degree in electrical engineering from North Carolina State University, Raleigh, NC, USA, in 2020. He was a Postdoctoral Researcher in electrical and computer engineering at North Carolina State University, Raleigh, NC, USA, from 2021 to 2022. He is currently a faculty member of the department of information and communication engineering with Zhejiang University, Hangzhou, China. His research interests are in the general area of wireless AI, game theory, and security and privacy in machine learning/artificial intelligence and wireless networks.

Supplementary Materials

Additional Notations: Analogous to our previous notations, we express the pruned virtual models at different hierarchy levels as $\tilde{\bar{\mathbf{w}}}_{j}^{t,0}\coloneqq\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\tilde{\mathbf{w}}_{i}^{t,0}$ , $\tilde{\bar{\mathbf{w}}}_{k}^{t,0}\coloneqq\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\tilde{\mathbf{w}}_{i}^{t,0}=\sum_{j=1}^{V_{k,l}}\alpha_{j}\tilde{\bar{\mathbf{w}}}_{j}^{t,0}$ , $\tilde{\bar{\mathbf{w}}}_{l}^{t,0}\coloneqq\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\tilde{\mathbf{w}}_{i}^{t,0}=\sum_{k=1}^{B_{l}}\alpha_{k}\tilde{\bar{\mathbf{w}}}_{k}^{t,0}$ and $\tilde{\bar{\mathbf{w}}}^{t,0}\coloneqq\sum_{l=1}^{L}\alpha_{l}\tilde{\bar{\mathbf{w}}}_{l}^{t,0}=\sum_{u=1}^{\mathrm{U}}\alpha_{u}\tilde{\mathbf{w}}_{u}^{t,0}$ . Moreover, $\tilde{g}(\tilde{\mathbf{w}}_{i}^{t,0})\coloneqq g(\tilde{\mathbf{w}}_{i}^{t,0})\odot\mathbf{m}_{i}^{t}$ , $\nabla\tilde{f}_{i}(\tilde{\bar{\mathbf{w}}}^{t,0})\coloneqq\nabla f_{i}(\tilde{\bar{\mathbf{w}}}^{t,0})\odot\mathbf{m}_{i}^{t}$ and $\nabla\tilde{f}(\tilde{\bar{\mathbf{w}}}^{t,0}))=\sum_{u=1}^{\mathrm{U}}\alpha_{u}\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0}))$ .

Additional Assumptions: Since the element-wise mask operation is equivalent to replacing a subset of the actual entries with zero, similar to [20, 22], we also assume that the assumptions made in Section III-A hold if we apply the binary masks to all gradients.

Appendix A Proof of Theorem 1

Theorem 1.

When the assumptions in Section III-A hold and $\eta\leq 1/\beta$ , we have

\displaystyle\!\!\!\!\theta_{\mathrm{{PHFL}}}\leq

\displaystyle\mathcal{O}\big{(}[f(\tilde{\bar{\mathbf{w}}}^{0})-f_{\mathrm{inf}}]/[\eta T]\big{)}+\mathcal{O}(\beta\eta\sigma^{2}/\mathrm{U})+\underbrace{\mathcal{O}\big{(}\delta^{\mathrm{th}}\beta^{2}D^{2}\big{)}}_{\mathrm{pruning~{}error}}+\underbrace{\mathcal{O}\big{(}\beta\eta G^{2}\cdot\varphi_{\mathrm{w,0}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})\big{)}}_{\mathrm{wireless~{}factor}}+\mathcal{O}\big{(}\beta^{2}\big{[}\mathrm{L}_{1}+\mathrm{L}_{2}+\mathrm{L}_{3}+\mathrm{L}_{4}\big{]}\big{)},

(56)

where $\boldsymbol{\delta}=\{\delta_{1}^{t},\dots,\delta_{U}^{t}\}_{t=0}^{T-1}$ , $\boldsymbol{\mathrm{f}}=\{\mathrm{f}_{1}^{t},\dots,\mathrm{f}_{U}^{t}\}_{t=0}^{T-1}$ , $\boldsymbol{\mathrm{P}}=\{P_{1}^{t},\dots,P_{U}^{t}\}_{t=0}^{T-1}$ and $\mathrm{f}_{i}^{t}$ is the $i^{\mathrm{th}}$ client’s CPU frequency in the wireless factors. Besides, the terms $\mathrm{L}_{1}$ , $\mathrm{L}_{2}$ , $\mathrm{L}_{3}$ and $\mathrm{L}_{4}$ are

$\displaystyle\varphi_{\mathrm{w,0}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})$	$\displaystyle=[1/T]\sum\nolimits_{t=0}^{T-1}\sum\nolimits_{l=1}^{L}\alpha_{l}^{2}\sum\nolimits_{k=1}^{B_{l}}\alpha_{k}^{2}\sum\nolimits_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum\nolimits_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\big{[}1/p_{i}^{t}-1\big{]},$	(57)
$\displaystyle\mathrm{L}_{1}$	$\displaystyle=[1/T]\sum\nolimits_{t=0}^{T-1}\sum\nolimits_{l=1}^{L}\alpha_{l}\sum\nolimits_{k=1}^{B_{l}}\alpha_{k}\sum\nolimits_{j=1}^{V_{k,l}}\alpha_{j}\sum\nolimits_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\\|\bar{\mathbf{w}}_{j}^{t}-\tilde{\mathbf{w}}_{i}^{t}\\|^{2},$	(58)
$\displaystyle\mathrm{L}_{2}$	$\displaystyle=[1/T]\sum\nolimits_{t=0}^{T-1}\sum\nolimits_{l=1}^{L}\alpha_{l}\sum\nolimits_{k=1}^{B_{l}}\alpha_{k}\sum\nolimits_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\\|\bar{\mathbf{w}}_{k}^{t}-\bar{\mathbf{w}}_{j}^{t}\\|^{2},$	(59)
$\displaystyle\mathrm{L}_{3}$	$\displaystyle=[1/T]\sum\nolimits_{t=0}^{T-1}\sum\nolimits_{l=1}^{L}\alpha_{l}\sum\nolimits_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\\|\bar{\mathbf{w}}_{l}^{t}-\bar{\mathbf{w}}_{k}^{t}\\|^{2},$	(60)
$\displaystyle\mathrm{L}_{4}$	$\displaystyle=[1/T]\sum\nolimits_{t=0}^{T-1}\sum\nolimits_{l=1}^{L}\alpha_{l}\mathbb{E}\\|\bar{\mathbf{w}}^{t}-\bar{\mathbf{w}}_{l}^{t}\\|^{2}.$	(61)

Proof.

Since there are $U$ clients and we consider the virtual models at different hierarchy levels, we start the convergence proof assuming the global model is the weighted combination of all these $U$ clients. After that, we break down our derivations for different hierarchy levels based on our proposed PHFL algorithm. Note that this is also a standard practice in the literature [13, 9].

First, let us write the update rule for the global model as

\displaystyle\bar{\mathbf{w}}^{t+1}

\displaystyle=\sum_{u=1}^{\mathrm{U}}\alpha_{u}\tilde{\mathbf{w}}_{u}^{t+1}=\sum_{u=1}^{\mathrm{U}}\alpha_{u}\left(\tilde{\mathbf{w}}_{u}^{t,0}-\eta\tilde{g}(\tilde{\mathbf{w}}_{u}^{t,0})\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}\right)=\tilde{\bar{\mathbf{w}}}^{t,0}-\eta\sum_{u=1}^{\mathrm{U}}\alpha_{u}\tilde{g}(\tilde{\mathbf{w}}_{u}^{t,0})\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}},

(62)

where $\tilde{\bar{\mathbf{w}}}^{t,0}\coloneqq\sum_{u=1}^{\mathrm{U}}\alpha_{u}\tilde{\mathbf{w}}_{u}^{t,0}$ . As such, we write

\displaystyle f(\bar{\mathbf{w}}^{t+1})

\displaystyle=f(\tilde{\bar{\mathbf{w}}}^{t,0}-\eta\sum_{u=1}^{\mathrm{U}}\alpha_{u}\tilde{g}(\tilde{\mathbf{w}}_{u}^{t,0})\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}),\overset{(a)}{\leq}f(\tilde{\bar{\mathbf{w}}}^{t,0})+\left<\nabla f(\tilde{\bar{\mathbf{w}}}^{t,0}),-\eta\sum_{u=1}^{\mathrm{U}}\alpha_{u}\tilde{g}(\tilde{\mathbf{w}}_{u}^{t,0})\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}\right>+\frac{\beta\eta^{2}}{2}\left\|\sum_{u=1}^{\mathrm{U}}\alpha_{u}\tilde{g}(\tilde{\mathbf{w}}_{u}^{t,0})\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}\right\|^{2},

(63)

where $(a)$ stems from $\beta$ -smoothness assumption. For the third term in (63), we can write

		$\displaystyle\left<\nabla f(\tilde{\bar{\mathbf{w}}}^{t,0}),-\eta\sum_{u=1}^{\mathrm{U}}\alpha_{u}\tilde{g}(\tilde{\mathbf{w}}_{u}^{t,0})\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}\right>=-\eta\left<\sum_{u=1}^{\mathrm{U}}\alpha_{u}\nabla f_{u}(\tilde{\bar{\mathbf{w}}}^{t,0}),\sum_{u=1}^{\mathrm{U}}\alpha_{u}\tilde{g}(\tilde{\mathbf{w}}_{u}^{t,0})\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}\right>$
		$\displaystyle=-\eta\sum_{u=1}^{\mathrm{U}}\alpha_{u}\bigg{<}\nabla f_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})\odot\mathbf{m}_{u}^{t},\big{(}g(\tilde{\mathbf{w}}_{u}^{t,0})\odot\mathbf{m}_{u}^{t}\big{)}\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}\bigg{>}=-\eta\sum_{u=1}^{\mathrm{U}}\alpha_{u}\bigg{<}\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0}),\tilde{g}(\tilde{\mathbf{w}}_{u}^{t,0})\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}\bigg{>},$		(64)

where $\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})\coloneqq\nabla f_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})\odot\mathbf{m}_{u}^{t}$ .

Plugging (A) into (63) and taking expectation over both sides, we get

\displaystyle\mathbb{E}\left[f(\bar{\mathbf{w}}^{t+1})\right]

\displaystyle\leq f(\tilde{\bar{\mathbf{w}}}^{t,0})-\eta\sum_{u=1}^{\mathrm{U}}\alpha_{u}\mathbb{E}\left[\left<\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0}),\tilde{g}(\tilde{\mathbf{w}}_{u}^{t,0})\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}\right>\right]+\frac{\beta\eta^{2}}{2}\mathbb{E}\left[\left\|\sum_{u=1}^{\mathrm{U}}\alpha_{u}\tilde{g}(\tilde{\mathbf{w}}_{u}^{t,0})\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}\right\|^{2}\right],

(65)

We simply the third term in (65) as follows:


	$\displaystyle-\eta\sum_{u=1}^{\mathrm{U}}\alpha_{u}\mathbb{E}\left[\left<\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0}),\tilde{g}(\tilde{\mathbf{w}}_{u}^{t,0})\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}\right>\right]\overset{(a)}{=}-\eta\sum_{u=1}^{\mathrm{U}}\alpha_{u}\left<\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0}),\mathbb{E}\left[\tilde{g}(\tilde{\mathbf{w}}_{u}^{t,0})\right]\mathbb{E}\bigg{[}\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}\bigg{]}\right>,$
	$\displaystyle\qquad=-\eta\sum_{u=1}^{\mathrm{U}}\alpha_{u}\left<\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0}),\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\right>,$
	$\displaystyle\qquad=-\frac{\eta}{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}\Big{\{}\\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})\\|^{2}+\\|\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\\|^{2}-\\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})-\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\\|^{2}\Big{\}},$
	$\displaystyle\qquad=\frac{\eta}{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}\Big{\{}\\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})-\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\\|^{2}-\\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})\\|^{2}-\\|\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\\|^{2}\Big{\}},$		(66)
	$\displaystyle\qquad=\frac{\eta}{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}\Big{\{}\\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})-\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\\|^{2}-\\|\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\\|^{2}\Big{\}}-\frac{\eta}{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}\\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})\\|^{2},$
	$\displaystyle\qquad\leq\frac{\eta}{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}\Big{\{}\\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})-\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\\|^{2}-\\|\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\\|^{2}\Big{\}}-\frac{\eta}{2}\left\\|\sum_{u=1}^{\mathrm{U}}\alpha_{u}\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})\right\\|^{2},$
	$\displaystyle\qquad\overset{(b)}{=}\frac{\eta}{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}\Big{\{}\\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})-\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\\|^{2}-\\|\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\\|^{2}\Big{\}}-\frac{\eta}{2}\\|\nabla\tilde{f}(\tilde{\bar{\mathbf{w}}}^{t,0})\\|^{2},$

where $(a)$ follows from the independence of client selection and SGD. Besides, we define $\|\nabla\tilde{f}(\tilde{\bar{\mathbf{w}}}^{t,0})\|^{2}\coloneqq\|\sum_{u=1}^{\mathrm{U}}\alpha_{u}\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})\|^{2}=\|\sum_{u=1}^{\mathrm{U}}\alpha_{u}\nabla f_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})\odot\mathbf{m}_{u}^{t}\|^{2}$ in $(b)$ .

The second term in (65) can be simplified as follows:

		$\displaystyle\mathbb{E}\left[\left\\|\sum_{u=1}^{\mathrm{U}}\alpha_{u}\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}\right\\|^{2}\right]$
		$\displaystyle\overset{(a)}{=}\mathbb{E}\left\\|\sum_{u=1}^{\mathrm{U}}\alpha_{u}\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}-\mathbb{E}\left[\sum_{u=1}^{\mathrm{U}}\alpha_{u}\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}\right]\right\\|^{2}+\left(\mathbb{E}\left[\sum_{u=1}^{\mathrm{U}}\alpha_{u}\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}\right]\right)^{2},$
		$\displaystyle=\mathbb{E}\left\\|\sum_{u=1}^{\mathrm{U}}\alpha_{u}\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}\pm\sum_{u=1}^{\mathrm{U}}\alpha_{u}\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)-\sum_{u=1}^{\mathrm{U}}\alpha_{u}\nabla\tilde{f}_{u}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right\\|^{2}+\left\\|\sum_{u=1}^{\mathrm{U}}\alpha_{u}\nabla\tilde{f}_{u}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right\\|^{2},$
		$\displaystyle\overset{(b)}{\leq}\mathbb{E}\left\\|\sum_{u=1}^{\mathrm{U}}\alpha_{u}\left(\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}-1\right)\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)+\sum_{u=1}^{\mathrm{U}}\alpha_{u}\left(\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)-\nabla\tilde{f}_{u}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right)\right\\|^{2}+\sum_{u=1}^{\mathrm{U}}\alpha_{u}\left\\|\nabla\tilde{f}_{u}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right\\|^{2},$
		$\displaystyle\overset{(c)}{\leq}2\mathbb{E}\left\\|\sum_{u=1}^{\mathrm{U}}\alpha_{u}\left(\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}-1\right)\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right\\|^{2}+2\mathbb{E}\left\\|\sum_{u=1}^{\mathrm{U}}\alpha_{u}\left(\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)-\nabla\tilde{f}_{u}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right)\right\\|^{2}+\sum_{u=1}^{\mathrm{U}}\alpha_{u}\left\\|\nabla\tilde{f}_{u}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right\\|^{2},$
		$\displaystyle=2\mathbb{E}\Big{[}\sum_{u=1}^{\mathrm{U}}\alpha_{u}^{2}\left\\|\left(\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}-1\right)\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right\\|^{2}+\sum_{u=1}^{\mathrm{U}}\alpha_{u}\sum_{u^{\prime}=1,u^{\prime}\neq u}^{U}\alpha_{u^{\prime}}\left(\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}-1\right)\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\left(\frac{\boldsymbol{1}_{u^{\prime}}^{t}}{p_{u^{\prime}}^{t}}-1\right)\tilde{g}(\tilde{\mathbf{w}}_{u^{\prime}}^{t,0})\Big{]}+$
		$\displaystyle\qquad\qquad\qquad 2\mathbb{E}\Big{[}\sum_{u=1}^{\mathrm{U}}\alpha_{u}^{2}\left\\|\left(\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)-\nabla\tilde{f}_{u}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right)\right\\|^{2}+$
		$\displaystyle\qquad\sum_{u=1}^{\mathrm{U}}\alpha_{u}\sum_{u^{\prime}=1,u\neq u^{\prime}}^{U}\alpha_{u^{\prime}}\left(\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)-\nabla\tilde{f}_{u}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right)\left(\tilde{g}(\tilde{\mathbf{w}}_{u^{\prime}}^{t,0})-\nabla\tilde{f}_{u^{\prime}}(\tilde{\mathbf{w}}_{u^{\prime}}^{t,0})\right)\Big{]}+\sum_{u=1}^{\mathrm{U}}\alpha_{u}\left\\|\nabla\tilde{f}_{u}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right\\|^{2},$
		$\displaystyle\overset{(d)}{\leq}2\mathbb{E}\bigg{[}\sum_{u=1}^{\mathrm{U}}\alpha_{u}^{2}\left\\|\left(\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}-1\right)\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right\\|^{2}\bigg{]}+2\sigma^{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}^{2}+\sum_{u=1}^{\mathrm{U}}\alpha_{u}\left\\|\nabla\tilde{f}_{u}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right\\|^{2},$
		$\displaystyle=2\sum_{u=1}^{\mathrm{U}}\alpha_{u}^{2}\left(\frac{1-p_{u}^{t}}{p_{u}^{t}}\right)\mathbb{E}\left\\|\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right\\|^{2}+2\sigma^{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}^{2}+\sum_{u=1}^{\mathrm{U}}\alpha_{u}\left\\|\nabla\tilde{f}_{u}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right\\|^{2},$		(67)

where $(a)$ comes from the definition of variance, $(b)$ follows from Jensen inequality, i.e., $\left\|\sum_{u=1}^{\mathrm{U}}\alpha_{u}\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\right\|^{2}\leq\sum_{u=1}^{\mathrm{U}}\alpha_{u}\left\|\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\right\|^{2}$ , $(c)$ stems from the fact that $\|\sum_{i=1}^{n}\mathbf{a}_{i}\|^{2}\leq n\sum_{i=1}^{n}\|\mathbf{a}_{i}\|^{2}$ and $(d)$ is due to the bounded stochastic gradient assumption.

Plugging (66) and (A) in (65), we get


	$\displaystyle\mathbb{E}\left[f(\bar{\mathbf{w}}^{t+1})\right]\leq f(\tilde{\bar{\mathbf{w}}}^{t,0})+\frac{\eta}{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}\Big{\{}\\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})-\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\\|^{2}-\\|\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\\|^{2}\Big{\}}-$
	$\displaystyle\qquad\frac{\eta}{2}\\|\nabla\tilde{f}(\tilde{\bar{\mathbf{w}}}^{t,0})\\|^{2}+\eta^{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}^{2}\left(\frac{1-p_{u}^{t}}{p_{u}^{t}}\right)\mathbb{E}\left\\|\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right\\|^{2}+\beta\eta^{2}\sigma^{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}^{2}+\frac{\beta\eta^{2}}{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}\left\\|\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\right\\|^{2},$
	$\displaystyle=f(\tilde{\bar{\mathbf{w}}}^{t,0})+\frac{\eta}{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}\bigg{\\|}\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})-\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\bigg{\\|}^{2}-\frac{\eta}{2}\left\\|\nabla\tilde{f}(\tilde{\bar{\mathbf{w}}}^{t,0})\right\\|^{2}-\frac{\eta}{2}\left(1-\beta\eta\right)\sum_{u=1}^{\mathrm{U}}\\|\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\\|^{2}$
	$\displaystyle\qquad\qquad\qquad+\beta\eta^{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}^{2}\left(\frac{1-p_{u}^{t}}{p_{u}^{t}}\right)\mathbb{E}\left\\|\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right\\|^{2}+\beta\eta^{2}\sigma^{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}^{2}.$		(68)

Note that, when $\eta\leq\frac{1}{\beta}$ , $\left(1-\beta\eta\right)\geq 0$ . Therefore, we can drop the fourth term in (68).

To that end, rearranging the terms in (68), then dividing both sides by $\frac{\eta}{2}$ , taking expectations on both sides and averaging over time, we get the following

	$\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\left[\left\\|\nabla\tilde{f}\left(\tilde{\bar{\mathbf{w}}}^{t,0}\right)\right\\|^{2}\right]$	$\displaystyle\leq\frac{2\left(f(\tilde{\bar{\mathbf{w}}}^{0})-\mathbb{E}[f(\bar{\mathbf{w}}^{T})]\right)}{\eta T}+2\beta\eta\sigma^{2}\!\!\sum_{u=1}^{\mathrm{U}}\alpha_{u}^{2}+\frac{2\beta\eta G^{2}}{T}\sum_{t=0}^{T-1}\sum_{u=1}^{\mathrm{U}}\alpha_{u}^{2}\left(\frac{1-p_{u}^{t}}{p_{u}^{t}}\right)+$
		$\displaystyle\qquad\qquad\qquad\frac{1}{T}\sum_{t=0}^{T-1}\sum_{u=1}^{\mathrm{U}}\alpha_{u}\mathbb{E}\left[\\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})-\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\\|^{2}\right],$		(69)

where we use the notation $\tilde{\bar{\mathbf{w}}}^{0}$ to represent $\tilde{\bar{\mathbf{w}}}^{0,0}$ for simplicity.

The last term in (A) can be expanded as follows:


	$\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\sum_{u=1}^{\mathrm{U}}\alpha_{u}\mathbb{E}\left[\\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})-\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\\|^{2}\right]=\sum_{u=1}^{\mathrm{U}}\alpha_{u}\mathbb{E}\bigg{[}\Big{\\|}\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})\pm\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}_{l,(u)}^{t,0})\pm\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}_{k,(u)}^{t,0})\pm\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}_{j,(u)}^{t,0})-\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\Big{\\|}^{2}\bigg{]},$
	$\displaystyle\overset{(a)}{\leq}\frac{4}{T}\sum_{t=0}^{T-1}\sum_{u=1}^{\mathrm{U}}\alpha_{u}\mathbb{E}\Big{[}\\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})-\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}_{l,(u)}^{t,0})\\|^{2}+\\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}_{l,(u)}^{t,0})-\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}_{k,(u)}^{t,0})\\|^{2}+\\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}_{k,(u)}^{t,0})-\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}_{j,(u)}^{t,0})\\|^{2}+$
	$\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}_{j,(u)}^{t,0})-\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\\|^{2}\Big{]},$
	$\displaystyle\overset{(b)}{\leq}\frac{4\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{u=1}^{\mathrm{U}}\alpha_{u}\Big{\{}\mathbb{E}\left[\\|\tilde{\bar{\mathbf{w}}}^{t,0}-\tilde{\bar{\mathbf{w}}}_{l,(u)}^{t,0}\\|^{2}\right]+\mathbb{E}\left[\\|\tilde{\bar{\mathbf{w}}}_{l,(u)}^{t,0}-\tilde{\bar{\mathbf{w}}}_{k,(u)}^{t,0}\\|^{2}\right]+\mathbb{E}\left[\\|\tilde{\bar{\mathbf{w}}}_{k,(u)}^{t,0}-\tilde{\bar{\mathbf{w}}}_{j,(u)}^{t,0}\\|^{2}\right]+\mathbb{E}\left[\\|\tilde{\bar{\mathbf{w}}}_{j,(u)}^{t,0}-\tilde{\mathbf{w}}_{u}^{t,0}\\|^{2}\right]\Big{\}},$
	$\displaystyle=\frac{4\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\Big{\{}\mathbb{E}\left[\\|\tilde{\bar{\mathbf{w}}}^{t,0}-\tilde{\bar{\mathbf{w}}}_{l}^{t,0}\\|^{2}\right]+\mathbb{E}\left[\\|\tilde{\bar{\mathbf{w}}}_{l}^{t,0}-\tilde{\bar{\mathbf{w}}}_{k}^{t,0}\\|^{2}\right]+\mathbb{E}\left[\\|\tilde{\bar{\mathbf{w}}}_{k}^{t,0}-\tilde{\bar{\mathbf{w}}}_{j}^{t,0}\\|^{2}\right]+\mathbb{E}\left[\\|\tilde{\bar{\mathbf{w}}}_{j}^{t,0}-\tilde{\mathbf{w}}_{i}^{t,0}\\|^{2}\right]\Big{\}},$
	$\displaystyle=\frac{4\beta^{2}}{T}\sum_{t=0}^{T-1}\Bigg{\{}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\left[\\|\tilde{\bar{\mathbf{w}}}^{t,0}-\tilde{\bar{\mathbf{w}}}_{l}^{t,0}\\|^{2}\right]+\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\left[\\|\tilde{\bar{\mathbf{w}}}_{l}^{t,0}-\tilde{\bar{\mathbf{w}}}_{k}^{t,0}\\|^{2}\right]+\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\left[\\|\tilde{\bar{\mathbf{w}}}_{k}^{t,0}-\tilde{\bar{\mathbf{w}}}_{j}^{t,0}\\|^{2}\right]+$
	$\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\left[\\|\tilde{\bar{\mathbf{w}}}_{j}^{t,0}-\tilde{\mathbf{w}}_{i}^{t,0}\\|^{2}\right]\Bigg{\}},$
	$\displaystyle=\frac{4\beta^{2}}{T}\sum_{t=0}^{T-1}\Big{\{}\sum\nolimits_{l=1}^{L}\alpha_{l}\mathbb{E}\left[\\|\tilde{\bar{\mathbf{w}}}^{t,0}\pm\bar{\mathbf{w}}^{t}\pm\bar{\mathbf{w}}_{l}^{t}-\tilde{\mathbf{w}}_{l}^{t}\\|^{2}\right]+\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\left[\\|\tilde{\bar{\mathbf{w}}}_{l}^{t,0}\pm\bar{\mathbf{w}}_{l}^{t}\pm\bar{\mathbf{w}}_{k}^{t}-\tilde{\bar{\mathbf{w}}}_{k}^{t,0}\\|^{2}\right]+$
	$\displaystyle\qquad\qquad\qquad\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\left[\\|\tilde{\bar{\mathbf{w}}}_{k}^{t,0}\pm\bar{\mathbf{w}}_{k}^{t}\pm\bar{\mathbf{w}}_{j}^{t}-\tilde{\bar{\mathbf{w}}}_{j}^{t,0}\\|^{2}\right]+\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\left[\\|\tilde{\bar{\mathbf{w}}}_{j}^{t,0}\pm\bar{\mathbf{w}}_{j}^{t}\pm\tilde{\mathbf{w}}_{i}^{t}-\tilde{\mathbf{w}}_{i}^{t,0}\\|^{2}\right]\Big{\}},$
	$\displaystyle\overset{(c)}{\leq}\frac{12\beta^{2}}{T}\sum_{t=0}^{T-1}\bigg{\{}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\left[\\|\tilde{\bar{\mathbf{w}}}^{t,0}-\bar{\mathbf{w}}^{t}\\|^{2}+\\|\bar{\mathbf{w}}_{l}^{t}-\tilde{\bar{\mathbf{w}}}_{l}^{t,0}\\|^{2}+\\|\bar{\mathbf{w}}^{t}-\bar{\mathbf{w}}_{l}^{t}\\|^{2}\right]+\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\Big{[}\\|\tilde{\bar{\mathbf{w}}}_{l}^{t,0}-\bar{\mathbf{w}}_{l}^{t}\\|^{2}+\\|\bar{\mathbf{w}}_{k}^{t}-\tilde{\bar{\mathbf{w}}}_{k}^{t,0}\\|^{2}+\\|\bar{\mathbf{w}}_{l}^{t}-\bar{\mathbf{w}}_{k}^{t}\\|^{2}\Big{]}+$
	$\displaystyle\qquad\qquad\qquad\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\Big{[}\\|\tilde{\bar{\mathbf{w}}}_{k}^{t,0}-\bar{\mathbf{w}}_{k}^{t}\\|^{2}+\\|\bar{\mathbf{w}}_{j}^{t}-\tilde{\bar{\mathbf{w}}}_{j}^{t,0}\\|^{2}+\\|\bar{\mathbf{w}}_{k}^{t}-\bar{\mathbf{w}}_{j}^{t}\\|^{2}\Big{]}+$
	$\displaystyle\qquad\qquad\qquad\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\Big{\{}\mathbb{E}\left[\\|\tilde{\bar{\mathbf{w}}}_{j}^{t,0}-\bar{\mathbf{w}}_{j}^{t}\\|^{2}+\\|\bar{\mathbf{w}}_{j}^{t}-\tilde{\mathbf{w}}_{i}^{t}\\|^{2}+\\|\tilde{\mathbf{w}}_{i}^{t}-\tilde{\mathbf{w}}_{i}^{t,0}\\|^{2}\right]\bigg{\}},$
	$\displaystyle=\frac{12\beta^{2}}{T}\underbrace{\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\\|\bar{\mathbf{w}}^{t}-\bar{\mathbf{w}}_{l}^{t}\\|^{2}}_{\mathrm{L}_{4}}+\frac{12\beta^{2}}{T}\underbrace{\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\\|\bar{\mathbf{w}}_{l}^{t}-\bar{\mathbf{w}}_{k}^{t}\\|^{2}}_{\mathrm{L}_{3}}+\frac{12\beta^{2}}{T}\underbrace{\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\\|\bar{\mathbf{w}}_{k}^{t}-\bar{\mathbf{w}}_{j}^{t}\\|^{2}}_{\mathrm{L}_{2}}+$
	$\displaystyle\qquad\qquad\qquad\frac{12\beta^{2}}{T}\underbrace{\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\\|\bar{\mathbf{w}}_{j}^{t}-\tilde{\mathbf{w}}_{i}^{t}\\|^{2}}_{\mathrm{L}_{1}}+\underbrace{\frac{12\beta^{2}}{T}\sum_{t=0}^{T-1}\mathbb{E}\\|\bar{\mathbf{w}}^{t}-\tilde{\bar{\mathbf{w}}}^{t,0}\\|^{2}}_{\mathrm{T}_{5}}+\underbrace{\frac{24\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\\|\bar{\mathbf{w}}_{l}^{t}-\tilde{\bar{\mathbf{w}}}_{l}^{t,0}\\|^{2}}_{\mathrm{T}_{4}}+$
	$\displaystyle\qquad\qquad\qquad\underbrace{\frac{24\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\\|\bar{\mathbf{w}}_{k}^{t}-\tilde{\bar{\mathbf{w}}}_{k}^{t,0}\\|^{2}}_{\mathrm{T}_{3}}+\underbrace{\frac{24\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\\|\bar{\mathbf{w}}_{j}^{t}-\tilde{\bar{\mathbf{w}}}_{j}^{t,0}\\|^{2}}_{\mathrm{T}_{2}}+$
	$\displaystyle\qquad\qquad\qquad\underbrace{\frac{12\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\mathbb{E}\\|\tilde{\mathbf{w}}_{i}^{t}-\tilde{\mathbf{w}}_{i}^{t,0}\\|^{2}}_{\mathrm{T}_{1}},$		(70)

where we used the notation $\bar{\mathbf{w}}_{j,(u)}^{t},\bar{\mathbf{w}}_{k,(u)}^{t}$ and $\bar{\mathbf{w}}_{l,(u)}^{t}$ to represent the model of the VC, sBS and mBS, respectively, that UE $u$ is connected to. Furthermore, $(a)$ and $(c)$ stem from the fact that $\|\sum_{i=1}^{n}\mathbf{a}_{i}\|^{2}\leq n\sum_{i=1}^{n}\|\mathbf{a}_{i}\|^{2}$ . Moreover, $(b)$ arises from the $\beta$ -smoothness and divergence between the loss functions assumptions.

To this end, we bound the terms $\mathrm{T}_{1}$ to $\mathrm{T}_{5}$ using our aggregation rules and definitions. First, let us focus on the term $\mathrm{T}_{1}$

	$\displaystyle\mathrm{T}_{1}$	$\displaystyle\overset{(a)}{=}\frac{12\beta^{2}}{T}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\sum_{t=0}^{T-1}\mathbb{E}\bigg{\\|}\mathbf{w}_{i}^{t}-\tilde{\mathbf{w}}_{i}^{t,0}\bigg{\\|}^{2}$
		$\displaystyle\overset{(b)}{{\color[rgb]{0,0,0}\leq}}\frac{12\beta^{2}}{T}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\sum_{t=0}^{T-1}\delta_{i}^{t}\mathbb{E}\\|\mathbf{w}_{i}^{t}\\|^{2},$		(71)

where in $(a)$ , we trace back to the nearest synchronization iteration where $\mathbf{w}_{i}^{t}\leftarrow\bar{\mathbf{w}}_{j}^{t}$ and $\tilde{\mathbf{w}}_{i}^{t}=\mathbf{w}_{i}^{t}$ . Besides, we use the definition of pruning ratio in $(b)$ .

Now, we calculate the bound for the term $\mathrm{T}_{2}$ as follows

$\displaystyle\mathrm{T}_{2}$	$\displaystyle=\frac{24\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\bigg{\\|}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\big{(}\tilde{\mathbf{w}}_{i}^{t}-\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\bigg{\\|}^{2}$
	$\displaystyle\overset{(a)}{{\color[rgb]{0,0,0}\leq}}\frac{24\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}{\color[rgb]{0,0,0}\alpha_{i}}\mathbb{E}\bigg{\\|}\tilde{\mathbf{w}}_{i}^{t}-\tilde{\mathbf{w}}_{i}^{t,0}\bigg{\\|}^{2}$
	$\displaystyle\overset{(b)}{{\color[rgb]{0,0,0}\leq}}\frac{24\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}{\color[rgb]{0,0,0}\alpha_{i}}\delta_{i}^{t}\mathbb{E}\\|\mathbf{w}_{i}^{t}\\|^{2},$	(72)

where ( $a$ ) arises from Jensen inequality and $(b)$ appears from the same reasoning as in $\mathrm{T}_{1}$ . Using similar steps, we write the following:

$\displaystyle\mathrm{T_{3}}$	$\displaystyle\leq\frac{24\beta^{2}}{T}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}{\color[rgb]{0,0,0}\alpha_{j}}\sum_{i=1}^{U_{j,k,l}}{\color[rgb]{0,0,0}\alpha_{i}}\delta_{i}^{t}\mathbb{E}\\|\mathbf{w}_{i}^{t}\\|^{2},$	(73)
$\displaystyle\mathrm{T}_{4}$	$\displaystyle\leq\frac{24\beta^{2}}{T}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}{\color[rgb]{0,0,0}\alpha_{k}}\sum_{j=1}^{V_{k,l}}{\color[rgb]{0,0,0}\alpha_{j}}\sum_{i=1}^{U_{j,k,l}}{\color[rgb]{0,0,0}\alpha_{i}}\delta_{i}^{t}\mathbb{E}\\|\mathbf{w}_{i}^{t}\\|^{2},$	(74)
$\displaystyle\mathrm{T}_{5}$	$\displaystyle\leq\frac{24\beta^{2}}{T}\sum_{l=1}^{L}{\color[rgb]{0,0,0}\alpha_{l}}\sum_{k=1}^{B_{l}}{\color[rgb]{0,0,0}\alpha_{k}}\sum_{j=1}^{V_{k,l}}{\color[rgb]{0,0,0}\alpha_{j}}\sum_{i=1}^{U_{j,k,l}}{\color[rgb]{0,0,0}\alpha_{i}}\delta_{i}^{t}\mathbb{E}\\|\mathbf{w}_{i}^{t}\\|^{2}.$	(75)

Plugging the above values into (A), we get


	$\displaystyle\frac{1}{T}\!\!\sum_{t=0}^{T-1}\!\!\mathbb{E}\left[\left\\|\nabla\tilde{f}\left(\tilde{\bar{\mathbf{w}}}^{t,0}\right)\right\\|^{2}\right]\leq\frac{2\left(f(\tilde{\bar{\mathbf{w}}}^{0})-\mathbb{E}[f(\bar{\mathbf{w}}^{T})]\right)}{\eta T}+2\beta\eta\sigma^{2}\sum_{l=1}^{L}\alpha_{l}^{2}\sum_{k=1}^{B_{l}}\alpha_{k}^{2}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}+$
	$\displaystyle\qquad\qquad\qquad\qquad\qquad 2\beta\eta G^{2}\cdot\underbrace{\frac{1}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}^{2}\sum_{k=1}^{B_{l}}\alpha_{k}^{2}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\bigg{(}\frac{1-p_{i}^{t}}{p_{i}^{t}}\bigg{)}}_{\varphi_{\mathrm{w,0}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})}+12\beta^{2}\cdot\big{(}\mathrm{L}_{1}+2\big{\{}\mathrm{L}_{2}+\mathrm{L}_{3}+\mathrm{L}_{4}\big{\}}\big{)}+$
	$\displaystyle\qquad\qquad\qquad\qquad\qquad{\color[rgb]{0,0,0}96}\beta^{2}\cdot\underbrace{\frac{1}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\delta_{i}^{t}\\|\mathbf{w}_{i}^{t}\\|^{2}}_{\mathrm{e}_{0}(\boldsymbol{\delta})}.$		(76)

When $\alpha_{i}=\frac{1}{U_{j,k,l}}$ , $\alpha_{j}=\frac{1}{V_{k,l}}$ , $\alpha_{k}=\frac{1}{B_{l}}$ and $\alpha_{l}=\frac{1}{L}$ , we have $\sum_{l=1}^{L}\alpha_{l}^{2}\sum_{k=1}^{B_{l}}\alpha_{k}^{2}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}=\frac{1}{\mathrm{U}}$ . As such, using the fact that $f(\bar{\mathbf{w}}^{T})\geq f_{\mathrm{inf}}$ and definition of $\theta_{\mathrm{PHFL}}$ , we get

\displaystyle\theta_{\mathrm{{PHFL}}}\leq

\displaystyle\mathcal{O}\bigg{(}\frac{f(\tilde{\bar{\mathbf{w}}}^{0})-f_{\mathrm{inf}}}{\eta T}\bigg{)}+\mathcal{O}\bigg{(}\frac{\beta\eta\sigma^{2}}{\mathrm{U}}\bigg{)}+\mathcal{O}\big{(}\delta^{\mathrm{th}}\beta^{2}D^{2}\big{)}+\mathcal{O}\big{(}\beta\eta G^{2}\cdot\varphi_{\mathrm{w,0}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})\big{)}+\mathcal{O}\big{(}\beta^{2}\big{[}\mathrm{L}_{1}+\mathrm{L}_{2}+\mathrm{L}_{3}+\mathrm{L}_{4}\big{]}\big{)}.

(77)

∎

Appendix B Proof of Lemma 1

Lemma 1.

When $\eta\leq 1/[2\sqrt{10}\kappa_{0}\beta]$ , the average difference between the VC and local model parameters, i.e., the $\mathrm{L}_{1}$ term of (56), is upper bounded as

		$\displaystyle[\beta^{2}/T]\sum\nolimits_{t=0}^{T}\sum\nolimits_{l=1}^{L}\alpha_{l}\sum\nolimits_{k=1}^{B_{l}}\alpha_{k}\sum\nolimits_{j=1}^{V_{k,l}}\alpha_{j}\sum\nolimits_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\left\\|\bar{\mathbf{w}}_{j}^{t}-\tilde{\mathbf{w}}_{i}^{t}\right\\|^{2}$
		$\displaystyle\qquad\leq\mathcal{O}\big{(}\kappa_{0}\eta^{2}\beta^{2}\sigma^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\eta^{2}\beta^{2}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\eta^{2}\beta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})\big{)}+\mathcal{O}\big{(}\delta^{\mathrm{th}}\beta^{2}D^{2}\big{)},$		(78)

where $\varphi_{\mathrm{w,L}_{1}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})=[1/T]\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\sum_{t=0}^{T-1}\left(1/p_{i}^{t}-1\right)$ .

Proof.

		$\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\left\\|\bar{\mathbf{w}}_{j}^{t}-\tilde{\mathbf{w}}_{i}^{t}\right\\|^{2}$
		$\displaystyle=\frac{1}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\Bigg{\\|}\tilde{\mathbf{w}}_{j}^{\bar{t}_{0},0}-\eta\sum_{i^{\prime}=1}^{U_{j,k,l}}\alpha_{i^{\prime}}\sum_{\tau=\bar{t}_{0}}^{t-1}\tilde{g}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}-\Big{(}\tilde{\mathbf{w}}_{i}^{\bar{t}_{0},0}-\eta\sum_{\tau=\bar{t}_{0}}^{t-1}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}\Big{)}\Bigg{\\|}^{2},$
		$\displaystyle\overset{(a)}{\leq}\frac{2}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\Bigg{\\|}\tilde{\mathbf{w}}_{j}^{\bar{t}_{0},0}-\tilde{\mathbf{w}}_{i}^{\bar{t}_{0},0}\Bigg{\\|}^{2}+$
		$\displaystyle\qquad\frac{2\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\Bigg{\\|}\sum_{\tau=\bar{t}_{0}}^{t-1}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\sum_{i^{\prime}=1}^{U_{j,k,l}}\alpha_{i^{\prime}}\sum_{\tau=\bar{t}_{0}}^{t-1}\tilde{g}\Big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\Big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}\Bigg{\\|}^{2},\!\!$		(79)

where $\bar{t}_{0}=[\{(m\kappa_{3}+t_{3})\kappa_{2}+t_{2}\}\kappa_{1}+t_{1}]\kappa_{0}$ and $(a)$ comes from $\|\sum_{i=1}^{n}\mathbf{a}_{i}\|^{2}\leq n\sum_{i=1}^{n}\|\mathbf{a}_{i}\|^{2}$ .

Note that the first term in (B) comes from the VC receiving a weighted combination of the pruned models of its associated UEs. For this term, we have

	$\displaystyle 2\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\bigg{\\|}\tilde{\mathbf{w}}_{j}^{\bar{t}_{0},0}-\tilde{\mathbf{w}}_{i}^{\bar{t}_{0},0}\bigg{\\|}^{2},$
	$\displaystyle\overset{(a)}{=}2\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\bigg{\\|}\left(\tilde{\mathbf{w}}_{j}^{\bar{t}_{0},0}-\mathbf{w}_{j}^{\bar{t}_{0}}\right)+\left(\mathbf{w}_{i}^{\bar{t}_{0}}-\tilde{\mathbf{w}}_{i}^{\bar{t}_{0},0}\right)\bigg{\\|}^{2},$
	$\displaystyle\overset{(b)}{\leq}4\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\bigg{\\|}\tilde{\mathbf{w}}_{j}^{\bar{t}_{0},0}-\mathbf{w}_{j}^{\bar{t}_{0}}\bigg{\\|}^{2}+4\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\bigg{\\|}\mathbf{w}_{i}^{\bar{t}_{0}}-\tilde{\mathbf{w}}_{i}^{\bar{t}_{0},0}\bigg{\\|}^{2},$
	$\displaystyle\leq 4\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\bigg{\{}\mathbb{E}\bigg{\\|}\tilde{\mathbf{w}}_{j}^{\bar{t}_{0},0}-\mathbf{w}_{j}^{\bar{t}_{0}}\bigg{\\|}^{2}+\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\delta_{i}^{\bar{t}_{0}}\mathbb{E}\left\\|\mathbf{w}_{i}^{\bar{t}_{0}}\right\\|^{2}\bigg{\}},$
	$\displaystyle\leq{\color[rgb]{0,0,0}8}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\delta_{i}^{\bar{t}_{0}}\mathbb{E}\left\\|\mathbf{w}_{i}^{\bar{t}_{0}}\right\\|^{2},$		(80)

where $(a)$ is true since $\mathbf{w}_{j}^{\bar{t}_{0}}=\mathbf{w}_{i}^{\bar{t}_{0}}$ and $(b)$ comes from $\|\sum_{i=1}^{n}\mathbf{a}_{i}\|^{2}\leq n\sum_{i=1}^{n}\|\mathbf{a}_{i}\|^{2}$ .

As such, the first term is bounded as

\displaystyle\frac{2}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\bigg{\|}\tilde{\mathbf{w}}_{j}^{\bar{t}_{0},0}-\tilde{\mathbf{w}}_{i}^{\bar{t}_{0},0}\bigg{\|}^{2}\leq\frac{{\color[rgb]{0,0,0}8}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\delta_{i}^{\lfloor t/\kappa_{0}\rfloor}\mathbb{E}\left\|\mathbf{w}_{i}^{\lfloor t/\kappa_{0}\rfloor}\right\|^{2}.

(81)

For the second term of (B), we have

		$\displaystyle\frac{2\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\Bigg{\\|}\sum_{\tau=\bar{t}_{0}}^{t-1}\bigg{[}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\sum_{i^{\prime}=1}^{U_{j,k,l}}\alpha_{i^{\prime}}\tilde{g}\Big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\Big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}\bigg{]}\Bigg{\\|}^{2}$
		$\displaystyle=\frac{2\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\Bigg{\\|}\sum_{\tau=\bar{t}_{0}}^{t-1}\bigg{[}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}\pm\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\pm\sum_{i^{\prime}=1}^{U_{j,k,l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i}\Big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\Big{)}-\sum_{i^{\prime}=1}^{U_{j,k,l}}\alpha_{i^{\prime}}\tilde{g}\Big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\Big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}\bigg{]}\Bigg{\\|}^{2}$
		$\displaystyle\leq\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\Bigg{\\|}\sum_{\tau=\bar{t}_{0}}^{t-1}\bigg{[}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{)}-\sum_{i^{\prime}=1}^{U_{j,k,l}}\alpha_{i^{\prime}}\bigg{(}\tilde{g}\Big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\Big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}-\nabla\tilde{f}_{i}\Big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\Big{)}\bigg{)}\bigg{]}\Bigg{\\|}^{2}+$
		$\displaystyle\qquad\qquad\qquad\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\Bigg{\\|}\sum_{\tau=\bar{t}_{0}}^{t-1}\bigg{[}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}-\sum_{i^{\prime}=1}^{U_{j,k,l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i}\Big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\Big{)}\bigg{]}\Bigg{\\|}^{2}.$		(82)

We calculate the first and the second terms of (B) in Lemma 5 and Lemma 6, respectively.

Lemma 5.

	$\displaystyle\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\Bigg{\\|}\sum_{\tau=\bar{t}_{0}}^{t-1}\bigg{[}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{)}-\sum_{i^{\prime}=1}^{U_{j,k,l}}\alpha_{i^{\prime}}\bigg{(}\tilde{g}\Big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\Big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}-\nabla\tilde{f}_{i}\Big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\Big{)}\bigg{)}\bigg{]}\Bigg{\\|}^{2}$
	$\displaystyle\leq 8\kappa_{0}\eta^{2}\sigma^{2}+8\kappa_{0}\eta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}},$		(83)

where $\varphi_{\mathrm{w,L}_{1}}=\frac{1}{T}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\sum_{t=0}^{T-1}\left(\frac{1-p_{i}^{t}}{p_{i}^{t}}\right)$ .

Lemma 6.

	$\displaystyle\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\Bigg{\\|}\sum_{\tau=\bar{t}_{0}}^{t-1}\bigg{[}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}-\sum_{i^{\prime}=1}^{U_{j,k,l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\Big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\Big{)}\bigg{]}\Bigg{\\|}^{2}$
	$\displaystyle\leq 20\kappa_{0}^{2}\eta^{2}\epsilon_{\mathrm{vc}}^{2}+40\kappa_{0}^{2}\eta^{2}\beta^{2}\cdot\bar{\mathrm{e}}_{\boldsymbol{\delta}}+\frac{40\kappa_{0}^{2}\eta^{2}\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\left\\|\bar{\mathbf{w}}_{j}^{t}-\tilde{\mathbf{w}}_{i}^{t}\right\\|^{2},$		(84)

where $\bar{\mathrm{e}}_{\boldsymbol{\delta}}=\frac{1}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\delta_{i}^{t}\mathbb{E}\|\mathbf{w}_{i}^{t}\|^{2}$ .

Using Lemma 5 and Lemma 6, we write

\displaystyle\frac{1}{T}\sum_{t=0}^{T}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\left\|\bar{\mathbf{w}}_{j}^{t}-\tilde{\mathbf{w}}_{i}^{t}\right\|^{2}\leq\frac{8\kappa_{0}\eta^{2}\sigma^{2}+20\kappa_{0}^{2}\eta^{2}\epsilon_{\mathrm{vc}}^{2}+8\kappa_{0}\eta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}+40\kappa_{0}^{2}\eta^{2}\beta^{2}\cdot\bar{\mathrm{e}}_{\boldsymbol{\delta}}+{\color[rgb]{0,0,0}8}\cdot\mathrm{e}_{\mathrm{p,L}_{1}}}{1-40\kappa_{0}^{2}\eta^{2}\beta^{2}},

(85)

where $\mathrm{e}_{\mathrm{p,L}_{1}}=\frac{1}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\delta_{i}^{\lfloor t/\kappa_{0}\rfloor}\mathbb{E}\left\|\mathbf{w}_{i}^{\lfloor t/\kappa_{0}\rfloor}\right\|^{2}$ .

Now multiplying both sides of (85) by $12\beta^{2}$ , we get

	$\displaystyle\frac{12\beta^{2}}{T}\sum_{t=0}^{T}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\left\\|\bar{\mathbf{w}}_{j}^{t}-\tilde{\mathbf{w}}_{i}^{t}\right\\|^{2}$
	$\displaystyle\leq\frac{96\kappa_{0}\eta^{2}\beta^{2}\sigma^{2}+240\kappa_{0}^{2}\eta^{2}\beta^{2}\epsilon_{\mathrm{vc}}^{2}+96\kappa_{0}\eta^{2}\beta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}+480\kappa_{0}^{2}\eta^{2}\beta^{4}\cdot\bar{\mathrm{e}}_{\boldsymbol{\delta}}+{\color[rgb]{0,0,0}96}\beta^{2}\cdot\mathrm{e}_{\mathrm{p,L}_{1}}}{1-40\kappa_{0}^{2}\eta^{2}\beta^{2}}.$		(86)

When $\eta\leq\frac{1}{2\sqrt{10}\kappa_{0}\beta}$ , we have $1-40\kappa_{0}^{2}\eta^{2}\beta^{2}\geq 1$ , and the previous assumption of $\eta\leq\frac{1}{\beta}$ is automatically satisfied. As such, we write

	$\displaystyle\frac{12\beta^{2}}{T}\sum_{t=0}^{T}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\left\\|\bar{\mathbf{w}}_{j}^{t}-\tilde{\mathbf{w}}_{i}^{t}\right\\|^{2}$
	$\displaystyle\leq 96\kappa_{0}\eta^{2}\beta^{2}\sigma^{2}+240\kappa_{0}^{2}\eta^{2}\beta^{2}\epsilon_{\mathrm{vc}}^{2}+96\kappa_{0}\eta^{2}\beta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}+480\kappa_{0}^{2}\eta^{2}\beta^{4}\cdot\bar{\mathrm{e}}_{\boldsymbol{\delta}}+{\color[rgb]{0,0,0}96}\beta^{2}\cdot\mathrm{e}_{\mathrm{p,L}_{1}}$
	$\displaystyle\approx\mathcal{O}\big{(}\kappa_{0}\eta^{2}\beta^{2}\sigma^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\eta^{2}\beta^{2}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\eta^{2}\beta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}\big{)}+\mathcal{O}\big{(}\delta^{\mathrm{th}}\beta^{2}D^{2}\big{)}.$		(87)

B-A Missing Proof of Lemma 5

		$\displaystyle 4\eta^{2}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\Bigg{\\|}\sum_{\tau=\bar{t}_{0}}^{t-1}\Bigg{\{}\left(\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\nabla\tilde{f}_{i}(\tilde{\mathbf{w}}_{i}^{\tau,0})\right)-\sum_{i^{\prime}=1}^{U_{j,k,l}}\alpha_{i^{\prime}}\left(\tilde{g}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}-\nabla\tilde{f}_{i^{\prime}}(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0})\right)\Bigg{\}}\Bigg{\\|}^{2},$
		$\displaystyle\overset{(a)}{=}4\eta^{2}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\Bigg{\{}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\Bigg{\\|}\sum_{\tau=\bar{t}_{0}}^{t-1}\left(\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\nabla\tilde{f}_{i}(\tilde{\mathbf{w}}_{i}^{\tau,0})\right)\Bigg{\\|}^{2}-\mathbb{E}\Bigg{\\|}\sum_{\tau=\bar{t}_{0}}^{t-1}\sum_{i^{\prime}=1}^{U_{j,k,l}}\alpha_{i^{\prime}}\left(\tilde{g}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}-\nabla\tilde{f}_{i^{\prime}}(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau})\right)\Bigg{\\|}^{2}\Bigg{\}},$
		$\displaystyle\overset{(b)}{=}4\eta^{2}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\Bigg{\{}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\sum_{\tau=\bar{t}_{0}}^{t-1}\mathbb{E}\Bigg{\\|}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}\pm\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}-\nabla\tilde{f}_{i}(\tilde{\mathbf{w}}_{i}^{\tau,0})\Bigg{\\|}^{2}-$
		$\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\sum_{\tau=\bar{t}_{0}}^{t-1}\mathbb{E}\Bigg{\\|}\sum_{i^{\prime}=1}^{U_{j,k,l}}\alpha_{i^{\prime}}\left(\tilde{g}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}\pm\tilde{g}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)-\nabla\tilde{f}_{i^{\prime}}(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0})\right)\Bigg{\\|}^{2}\Bigg{\}},$
		$\displaystyle\overset{(c)}{=}8\eta^{2}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\sum_{\tau=\bar{t}_{0}}^{t-1}\mathbb{E}\Bigg{\{}\Bigg{\\|}\left(\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-1\right)\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\Bigg{\\|}^{2}+\Bigg{\\|}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}-\nabla\tilde{f}_{i}(\tilde{\mathbf{w}}_{i}^{\tau,0})\Bigg{\\|}^{2}\Bigg{\}}-$
		$\displaystyle 8\eta^{2}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\sum_{\tau=\bar{t}_{0}}^{t-1}\sum_{i^{\prime}=1}^{U_{j,k,l}}\!\!\!\!\!\!\alpha_{i^{\prime}}^{2}\mathbb{E}\Bigg{\{}\Bigg{\\|}\left(\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}-1\right)\tilde{g}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)\Bigg{\\|}^{2}+\Bigg{\\|}\tilde{g}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)-\nabla\tilde{f}_{i^{\prime}}(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0})\Bigg{\\|}^{2}\Bigg{\}},$
		$\displaystyle\overset{(d)}{\leq}8\eta^{2}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\sum_{\tau=\bar{t}_{0}}^{t-1}\Bigg{\{}\left(\frac{1-p_{i}^{\tau}}{p_{i}^{\tau}}\right)\mathbb{E}\left\\|\tilde{g}(\tilde{\mathbf{w}}_{i}^{\tau,0})\right\\|^{2}+\sigma^{2}\Bigg{\}}-$
		$\displaystyle\qquad\qquad\qquad 8\eta^{2}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{\tau=\bar{t}_{0}}^{t-1}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\Bigg{\{}\left(\frac{1-p_{i}^{\tau}}{p_{i}^{\tau}}\right)\mathbb{E}\left\\|\tilde{g}(\tilde{\mathbf{w}}_{i}^{\tau,0})\right\\|^{2}+\sigma^{2}\Bigg{\}},$
		$\displaystyle\overset{(e)}{\leq}8\kappa_{0}\eta^{2}\sigma^{2}-8\eta^{2}\sum\nolimits_{l=1}^{L}\alpha_{l}\sum\nolimits_{k=1}^{B_{l}}\alpha_{k}\sum\nolimits_{j=1}^{V_{k,l}}\alpha_{j}\sum\nolimits_{\tau=\bar{t}_{0}}^{t-1}\sum\nolimits_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\sigma^{2}+$
		$\displaystyle\qquad\qquad\qquad 8\eta^{2}\sum\nolimits_{l=1}^{L}\alpha_{l}\sum\nolimits_{k=1}^{B_{l}}\alpha_{k}\sum\nolimits_{j=1}^{V_{k,l}}\alpha_{j}\sum\nolimits_{\tau=\bar{t}_{0}}^{t-1}\sum\nolimits_{i=1}^{U_{j,k,l}}\alpha_{i}\left(1-\alpha_{i}\right)\left([1-p_{i}^{\tau}]/p_{i}^{\tau}\right)\mathbb{E}\left\\|\tilde{g}(\tilde{\mathbf{w}}_{i}^{\tau,0})\right\\|^{2},$
		$\displaystyle\leq 8\kappa_{0}\eta^{2}\sigma^{2}+8\eta^{2}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{\tau=\bar{t}_{0}}^{t-1}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\left(\frac{1-p_{i}^{\tau}}{p_{i}^{\tau}}\right)\mathbb{E}\left\\|\tilde{g}(\tilde{\mathbf{w}}_{i}^{\tau,0})\right\\|^{2},$		(88)

where $(a)$ stems from the fact that $\sum_{i=1}^{n}p_{i}\|\mathbf{x}_{i}-\bar{\mathbf{x}}\|^{2}=\sum_{i=1}^{n}p_{i}\|\mathbf{x}_{i}\|^{2}-\|\bar{\mathbf{x}}\|^{2}$ , where $\bar{\mathbf{x}}=\sum_{i=1}^{n}p_{i}\mathbf{x}_{i}$ for any $0\leq p_{i}\leq 1$ and $\sum_{i=1}^{n}p_{i}=1$ . Besides, $(b)$ is true due to the time independence of the SGD assumption. Furthermore, in $(c)$ and $(d)$ , we use the bounded divergence of the mini-batch gradient assumption and client independence property.

B-B Missing Proof of Lemma 6

	$\displaystyle\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\Bigg{\\|}\sum_{\tau=\bar{t}_{0}}^{t-1}\bigg{[}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}-\sum_{i^{\prime}=1}^{U_{j,k,l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\Big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\Big{)}\bigg{]}\Bigg{\\|}^{2}$
	$\displaystyle=\frac{4\kappa_{0}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\sum_{\tau=\bar{t}_{0}}^{t-1}\mathbb{E}\Bigg{\\|}\bigg{(}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau}\big{)}\bigg{)}+\bigg{(}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau}\big{)}-\nabla\tilde{f}_{i}\big{(}\bar{\mathbf{w}}_{j}^{\tau}\big{)}\bigg{)}+$
	$\displaystyle\qquad\qquad\qquad\bigg{(}\nabla\tilde{f}_{i}\big{(}\bar{\mathbf{w}}_{j}^{\tau}\big{)}-\sum_{i^{\prime}=1}^{U_{j,k,l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\big{(}\bar{\mathbf{w}}_{j}^{\tau}\big{)}\bigg{)}+\bigg{(}\sum_{i^{\prime}=1}^{U_{j,k,l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\big{(}\bar{\mathbf{w}}_{j}^{\tau}\big{)}-\sum_{i^{\prime}=1}^{U_{j,k,l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i}^{\tau}\big{)}\bigg{)}+$
	$\displaystyle\qquad\qquad\qquad\bigg{(}\sum_{i^{\prime}=1}^{U_{j,k,l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i}^{\tau}\big{)}-\sum_{i^{\prime}=1}^{U_{j,k,l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\bigg{)}\Bigg{\\|}^{2}$
	$\displaystyle\leq\frac{20\kappa_{0}^{2}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{[}2\beta^{2}\mathbb{E}\\|\tilde{\mathbf{w}}_{i}^{t}-\tilde{\mathbf{w}}_{i}^{t,0}\\|^{2}+\epsilon_{\mathrm{vc}}^{2}+2\beta^{2}\mathbb{E}\\|\bar{\mathbf{w}}_{j}^{t}-\tilde{\mathbf{w}}_{i}^{t}\\|^{2}\bigg{]}$
	$\displaystyle\leq 20\kappa_{0}^{2}\eta^{2}\epsilon_{\mathrm{vc}}^{2}+40\kappa_{0}^{2}\eta^{2}\beta^{2}\cdot\bar{\mathrm{e}}_{\boldsymbol{\delta}}+\frac{40\kappa_{0}^{2}\eta^{2}\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\left\\|\bar{\mathbf{w}}_{j}^{t}-\tilde{\mathbf{w}}_{i}^{t}\right\\|^{2},$		(89)

Appendix C Proof of Lemma 2

Lemma 2.

When $\eta\leq 1/[2\sqrt{10}\kappa_{0}\kappa_{1}\beta]$ , the difference between the sBS model parameters and VC model parameters, i.e., the $\mathrm{L}_{2}$ term of (56), is upper bounded as

		$\displaystyle\frac{\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\left\\|\bar{\mathbf{w}}_{k}^{t}-\bar{\mathbf{w}}_{j}^{t}\right\\|^{2}$
		$\displaystyle\leq\mathcal{O}\big{(}\beta^{4}\kappa_{0}^{4}\kappa_{1}^{2}\eta^{4}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\eta^{2}\beta^{2}\epsilon_{\mathrm{sbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\eta^{2}\sigma^{2}\beta^{2}\big{)}+\mathcal{O}\big{(}\delta^{th}\beta^{2}D^{2}\big{)}+$
		$\displaystyle\qquad\qquad\qquad\qquad\qquad\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\beta^{4}\eta^{4}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\beta^{2}\eta^{2}\cdot\varphi_{\mathrm{w,L}_{2}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})\big{)},$		(90)

where $\varphi_{\mathrm{w,L}_{2}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})=[1/T]\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}(1/p_{i}^{t}-1)$ .

Proof.

		$\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\left\\|\bar{\mathbf{w}}_{k}^{t}-\bar{\mathbf{w}}_{j}^{t}\right\\|^{2}$
		$\displaystyle=\frac{1}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\Big{\\|}\tilde{\mathbf{w}}_{k}^{\bar{t}_{1},0}-\eta\sum_{\tau=\bar{t}_{1}}^{t-1}\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\tilde{g}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}-\tilde{\mathbf{w}}_{j}^{\bar{t}_{1},0}+\eta\sum_{\tau=\bar{t}_{1}}^{t-1}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}\Big{\\|}^{2},$
		$\displaystyle\leq\frac{2}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\left\\|\tilde{\mathbf{w}}_{k}^{\bar{t}_{1},0}-\tilde{\mathbf{w}}_{j}^{\bar{t}_{1},0}\right\\|^{2}+$		(91)
		$\displaystyle\qquad\frac{2\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\left\\|\sum_{\tau=\bar{t}_{1}}^{t-1}\bigg{[}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\tilde{g}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}\bigg{]}\right\\|^{2},$

where $\bar{t}_{1}=\{(m\kappa_{3}+t_{3})\kappa_{2}+t_{2}\}\kappa_{1}\kappa_{0}$ and the inequalities in the last term arise from Jensen inequality.

For the first term in (C), we have

		$\displaystyle 2\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\left\\|\tilde{\mathbf{w}}_{k}^{\bar{t}_{1},0}-\tilde{\mathbf{w}}_{j}^{\bar{t}_{1},0}\right\\|^{2}$
		$\displaystyle\overset{(a)}{=}2\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\left\\|\tilde{\mathbf{w}}_{k}^{\bar{t}_{1},0}-\mathbf{w}_{k}^{\bar{t}_{1}}+\mathbf{w}_{j}^{\bar{t}_{1}}-\tilde{\mathbf{w}}_{j}^{\bar{t}_{1},0}\right\\|^{2},$
		$\displaystyle\leq 4\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\left\\|\mathbf{w}_{k}^{\bar{t}_{1}}-\tilde{\mathbf{w}}_{k}^{\bar{t}_{1},0}\right\\|^{2}+4\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\left\\|\mathbf{w}_{j}^{\bar{t}_{1}}-\tilde{\mathbf{w}}_{j}^{\bar{t}_{1},0}\right\\|^{2},$
		$\displaystyle\overset{(b)}{\leq}4\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}{\color[rgb]{0,0,0}\alpha_{j}}\sum_{i=1}^{U_{j,k,l}}{\color[rgb]{0,0,0}\alpha_{i}}\delta_{i}^{\bar{t}_{1}}\mathbb{E}\left\\|\mathbf{w}_{i}^{\bar{t}_{1}}\right\\|^{2}+4\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}{\color[rgb]{0,0,0}\alpha_{i}}\delta_{i}^{\bar{t}_{1}}\mathbb{E}\left\\|\mathbf{w}_{i}^{\bar{t}_{1}}\right\\|^{2},$
		$\displaystyle={\color[rgb]{0,0,0}8}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}{\color[rgb]{0,0,0}\alpha_{i}}\delta_{i}^{\bar{t}_{1}}\mathbb{E}\left\\|\mathbf{w}_{i}^{\bar{t}_{1}}\right\\|^{2},$		(92)

where in $(a)$ we use the fact that $\mathbf{w}_{k}^{\bar{t}_{1}}=\mathbf{w}_{j}^{\bar{t}_{1}}$ and $(b)$ stems from following similar steps as in (73) and (72).

As such we derive the upper bound of the first term of (C) as

		$\displaystyle 2\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\left\\|\tilde{\mathbf{w}}_{k}^{\bar{t}_{1},0}-\tilde{\mathbf{w}}_{j}^{\bar{t}_{1},0}\right\\|^{2}$
		$\displaystyle\leq\frac{{\color[rgb]{0,0,0}8}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}{\color[rgb]{0,0,0}\alpha_{i}}\delta_{i}^{\left\lfloor t/(\kappa_{0}\kappa_{1})\right\rfloor}\mathbb{E}\left\\|\mathbf{w}_{i}^{\left\lfloor t/(\kappa_{0}\kappa_{1})\right\rfloor}\right\\|^{2}\approx\mathcal{O}\big{(}\delta^{\mathrm{th}}D^{2}\big{)}.$		(93)

For the second term in (C), we have

		$\displaystyle\frac{2\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\left\\|\sum_{\tau=\bar{t}_{1}}^{t-1}\Big{(}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\tilde{g}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}\Big{)}\right\\|^{2}$
		$\displaystyle=\frac{2\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\bigg{\\|}\sum_{\tau=\bar{t}_{1}}^{t-1}\bigg{[}\bigg{\{}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{\}}+$
		$\displaystyle\qquad\qquad\qquad\bigg{\{}\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)-\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\tilde{g}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}\bigg{\}}+$
		$\displaystyle\qquad\qquad\qquad\bigg{\{}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}-\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)\bigg{\}}\bigg{]}\bigg{\\|}^{2},$
		$\displaystyle\leq\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\bigg{\\|}\sum_{\tau=\bar{t}_{1}}^{t-1}\bigg{[}\bigg{\{}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{\}}+$
		$\displaystyle\qquad\qquad\qquad\qquad\qquad\bigg{\{}\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)-\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\tilde{g}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}\bigg{\}}\bigg{]}\bigg{\\|}^{2}+$
		$\displaystyle~{}\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\bigg{\\|}\sum_{\tau=\bar{t}_{1}}^{t-1}\bigg{[}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}-\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)\bigg{]}\bigg{\\|}^{2},$		(94)

Lemma 7.

	$\displaystyle\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\bigg{\\|}\sum_{\tau=\bar{t}_{1}}^{t-1}\bigg{[}\bigg{\{}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{\}}+$
	$\displaystyle\qquad\qquad\qquad\qquad\qquad\bigg{\{}\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)-\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\tilde{g}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}\bigg{\}}\bigg{]}\bigg{\\|}^{2}$
	$\displaystyle\leq 8\kappa_{0}\kappa_{1}\eta^{2}\sigma^{2}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}+8\kappa_{0}\kappa_{1}\eta^{2}\cdot\varphi_{\mathrm{w,L}_{2}}$		(95)
	$\displaystyle\approx\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\eta^{2}\sigma^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\eta^{2}\cdot\varphi_{\mathrm{w,L}_{2}}\big{)},$		(96)

where $\varphi_{\mathrm{w,L}_{2}}=\frac{1}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\big{(}\frac{1}{p_{i}^{t}}-1\big{)}$ .

Lemma 8.

	$\displaystyle\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\bigg{\\|}\sum_{\tau=\bar{t}_{1}}^{t-1}\bigg{[}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}-\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)\bigg{]}\bigg{\\|}^{2}$
	$\displaystyle\leq\mathcal{O}\big{(}\beta^{2}\kappa_{0}^{4}\kappa_{1}^{2}\eta^{4}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\eta^{2}\epsilon_{\mathrm{sbs}}^{2}\big{)}+\mathcal{O}\big{(}\beta^{2}\sigma^{2}\kappa_{0}^{3}\kappa_{1}^{2}\eta^{4}\big{)}+\mathcal{O}\big{(}\delta^{\mathrm{th}}D^{2}\beta^{2}\kappa_{0}^{2}\kappa_{1}^{2}\eta^{2}\big{)}+$
	$\displaystyle\qquad\qquad\mathcal{O}\big{(}\beta^{2}\kappa_{0}^{3}\kappa_{1}^{2}\eta^{4}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}\big{)}+\frac{40\beta^{2}\kappa_{0}^{2}\kappa_{1}^{2}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\big{\\|}\bar{\mathbf{w}}_{k}^{t}-\bar{\mathbf{w}}_{j}^{t}\big{\\|}^{2}.$		(97)

Now, using Lemma 7, Lemma 8 and assuming $\eta\leq\frac{1}{2\sqrt{10}\kappa_{0}\kappa_{1}\beta}$ , we have the following

	$\displaystyle\frac{\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\left\\|\bar{\mathbf{w}}_{k}^{t}-\bar{\mathbf{w}}_{j}^{t}\right\\|^{2}$
	$\displaystyle\leq\mathcal{O}\big{(}\delta^{th}\beta^{2}D^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\eta^{2}\sigma^{2}\beta^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\beta^{2}\eta^{2}\cdot\varphi_{\mathrm{w,L}_{2}})+\mathcal{O}\big{(}\beta^{4}\kappa_{0}^{4}\kappa_{1}^{2}\eta^{4}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\eta^{2}\beta^{2}\epsilon_{\mathrm{sbs}}^{2}\big{)}+$
	$\displaystyle\qquad\qquad\qquad\mathcal{O}\big{(}\sigma^{2}\kappa_{0}^{3}\kappa_{1}^{2}\beta^{4}\eta^{4}\big{)}+\mathcal{O}\big{(}\delta^{\mathrm{th}}D^{2}\kappa_{0}^{2}\kappa_{1}^{2}\eta^{2}\beta^{4}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\beta^{4}\eta^{4}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}\big{)}$
	$\displaystyle\approx\mathcal{O}\big{(}\beta^{4}\kappa_{0}^{4}\kappa_{1}^{2}\eta^{4}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\eta^{2}\beta^{2}\epsilon_{\mathrm{sbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\eta^{2}\sigma^{2}\beta^{2}\big{)}+\mathcal{O}\big{(}\delta^{th}\beta^{2}D^{2}\big{)}+$
	$\displaystyle\qquad\qquad\qquad\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\beta^{4}\eta^{4}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\beta^{2}\eta^{2}\cdot\varphi_{\mathrm{w,L}_{2}}).$		(98)

C-A Missing Proof of Lemma 7

		$\displaystyle\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\bigg{\\|}\sum_{\tau=\bar{t}_{1}}^{t-1}\bigg{[}\bigg{\{}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{\}}+$
		$\displaystyle\qquad\qquad\qquad\qquad\qquad\bigg{\{}\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)-\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\tilde{g}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}\bigg{\}}\bigg{]}\bigg{\\|}^{2}$
		$\displaystyle\overset{(a)}{=}\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\bigg{\\|}\sum_{\tau=\bar{t}_{1}}^{t-1}\bigg{[}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{]}\bigg{\\|}^{2}-$
		$\displaystyle\qquad\qquad\qquad\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\bigg{\\|}\sum_{\tau=\bar{t}_{1}}^{t-1}\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\bigg{[}\tilde{g}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}-\nabla\tilde{f}_{i^{\prime}}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)\bigg{]}\bigg{\\|}^{2}$
		$\displaystyle\overset{(b)}{=}\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{\tau=\bar{t}_{1}}^{t-1}\mathbb{E}\bigg{\\|}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{[}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{]}\bigg{\\|}^{2}-$
		$\displaystyle\qquad\qquad\qquad\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{\tau=\bar{t}_{1}}^{t-1}\mathbb{E}\bigg{\\|}\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\bigg{[}\tilde{g}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}-\nabla\tilde{f}_{i^{\prime}}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)\bigg{]}\bigg{\\|}^{2}$
		$\displaystyle\overset{(c)}{\leq}\frac{4\kappa_{0}\kappa_{1}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\bigg{\\|}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{[}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\frac{\boldsymbol{1}_{i}^{t}}{p_{i}^{t}}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\bigg{]}\bigg{\\|}^{2}-$
		$\displaystyle\qquad\qquad\qquad\frac{4\kappa_{0}\kappa_{1}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\bigg{\\|}\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\bigg{[}\tilde{g}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{t,0}\right)\frac{\boldsymbol{1}_{i^{\prime}}^{t}}{p_{i^{\prime}}^{t}}-\nabla\tilde{f}_{i^{\prime}}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{t,0}\right)\bigg{]}\bigg{\\|}^{2}$
		$\displaystyle=\frac{4\kappa_{0}\kappa_{1}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\mathbb{E}\bigg{\\|}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\frac{\boldsymbol{1}_{i}^{t}}{p_{i}^{t}}\pm\tilde{g}\left(\tilde{\mathbf{w}}_{i}^{t,0}\right)-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\bigg{\\|}^{2}-$
		$\displaystyle\qquad\qquad\qquad\frac{4\kappa_{0}\kappa_{1}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\mathbb{E}\bigg{\\|}\tilde{g}\left(\tilde{\mathbf{w}}_{i}^{t,0}\right)\frac{\boldsymbol{1}_{i}^{t}}{p_{i}^{t}}\pm\tilde{g}\left(\tilde{\mathbf{w}}_{i}^{t,0}\right)-\nabla\tilde{f}_{i}\left(\tilde{\mathbf{w}}_{i}^{t,0}\right)\bigg{\\|}^{2}$
		$\displaystyle\leq\frac{8\kappa_{0}\kappa_{1}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\bigg{[}\mathbb{E}\bigg{\\|}\bigg{(}\frac{\boldsymbol{1}_{i}^{t}}{p_{i}^{t}}-1\bigg{)}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\bigg{\\|}^{2}+\mathbb{E}\bigg{\\|}\tilde{g}\left(\tilde{\mathbf{w}}_{i}^{t,0}\right)-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\bigg{\\|}^{2}\bigg{]}-$
		$\displaystyle\quad\frac{8\kappa_{0}\kappa_{1}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\bigg{[}\mathbb{E}\bigg{\\|}\bigg{(}\frac{\boldsymbol{1}_{i}^{t}}{p_{i}^{t}}-1\bigg{)}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\bigg{\\|}^{2}+\mathbb{E}\bigg{\\|}\tilde{g}\left(\tilde{\mathbf{w}}_{i}^{t,0}\right)-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\bigg{\\|}^{2}\bigg{]}$
		$\displaystyle\leq\frac{8\kappa_{0}\kappa_{1}\eta^{2}G^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\bigg{(}\frac{1}{p_{i}^{t}}-1\bigg{)}+8\kappa_{0}\kappa_{1}\eta^{2}\sigma^{2}G^{2}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}-$
		$\displaystyle\frac{8\kappa_{0}\kappa_{1}\eta^{2}G^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\bigg{(}\frac{1}{p_{i}^{t}}-1\bigg{)}-8\kappa_{0}\kappa_{1}\eta^{2}\sigma^{2}G^{2}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}$
		$\displaystyle\leq 8\kappa_{0}\kappa_{1}\eta^{2}\sigma^{2}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}+8\kappa_{0}\kappa_{1}\eta^{2}\cdot\varphi_{\mathrm{w,L}_{2}},$		(99)

C-B Missing Proof of Lemma 8

	$\displaystyle\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\bigg{\\|}\sum_{\tau=\bar{t}_{1}}^{t-1}\bigg{[}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}-\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)\bigg{]}\bigg{\\|}^{2}$
	$\displaystyle=\frac{4\kappa_{0}\kappa_{1}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{\tau=\bar{t}_{1}}^{t-1}\mathbb{E}\bigg{\\|}\bigg{(}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\big{[}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}-\nabla\tilde{f}_{i}\big{(}\bar{\mathbf{w}}_{j}^{\tau}\big{)}\big{]}\bigg{)}+$
	$\displaystyle\bigg{(}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\big{[}\nabla\tilde{f}_{i}\big{(}\bar{\mathbf{w}}_{j}^{\tau}\big{)}-\nabla\tilde{f}_{i}\big{(}\bar{\mathbf{w}}_{k}^{\tau}\big{)}\big{]}\bigg{)}+\bigg{(}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\nabla\tilde{f}_{i}\big{(}\bar{\mathbf{w}}_{k}^{\tau}\big{)}-\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\big{(}\bar{\mathbf{w}}_{k}^{\tau}\big{)}\bigg{)}+$
	$\displaystyle\bigg{(}\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\big{[}\nabla\tilde{f}_{i^{\prime}}\big{(}\bar{\mathbf{w}}_{k}^{\tau}\big{)}-\nabla\tilde{f}_{i^{\prime}}\big{(}\bar{\mathbf{w}}_{j}^{\tau}\big{)}\big{]}\bigg{)}+\bigg{(}\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\big{[}\nabla\tilde{f}_{i^{\prime}}\big{(}\bar{\mathbf{w}}_{j}^{\tau}\big{)}-\nabla\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\big{]}\bigg{)}\bigg{\\|}^{2}$
	$\displaystyle\leq 20\kappa_{0}^{2}\kappa_{1}^{2}\eta^{2}\epsilon_{\mathrm{sbs}}^{2}+\frac{40\beta^{2}\kappa_{0}^{2}\kappa_{1}^{2}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\big{\\|}\bar{\mathbf{w}}_{k}^{t}-\bar{\mathbf{w}}_{j}^{t}\big{\\|}^{2}+\frac{80\beta^{2}\kappa_{0}^{2}\kappa_{1}^{2}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\big{\\|}\bar{\mathbf{w}}_{j}^{t}-\tilde{\mathbf{w}}_{i}^{t}\big{\\|}^{2}+$
	$\displaystyle\qquad\qquad\qquad\frac{80\beta^{2}\kappa_{0}^{2}\kappa_{1}^{2}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\big{\\|}\tilde{\mathbf{w}}_{i}^{t}-\tilde{\mathbf{w}}_{i}^{t,0}\big{\\|}^{2}$
	$\displaystyle\approx\mathcal{O}\big{(}\beta^{2}\kappa_{0}^{4}\kappa_{1}^{2}\eta^{4}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\eta^{2}\epsilon_{\mathrm{sbs}}^{2}\big{)}+\mathcal{O}\big{(}\beta^{2}\sigma^{2}\kappa_{0}^{3}\kappa_{1}^{2}\eta^{4}\big{)}+\mathcal{O}\big{(}\delta^{\mathrm{th}}D^{2}\beta^{2}\kappa_{0}^{2}\kappa_{1}^{2}\eta^{2}\big{)}+\mathcal{O}\big{(}\beta^{2}\kappa_{0}^{3}\kappa_{1}^{2}\eta^{4}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}\big{)}+$
	$\displaystyle\qquad\qquad\qquad\frac{40\beta^{2}\kappa_{0}^{2}\kappa_{1}^{2}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\big{\\|}\bar{\mathbf{w}}_{k}^{t}-\bar{\mathbf{w}}_{j}^{t}\big{\\|}^{2}.$		(100)

which concludes the proof of Lemma 8. ∎

Appendix D Proof of Lemma 3

Lemma 3.

When $\eta\leq 1/[2\sqrt{14}\kappa_{0}\kappa_{1}\kappa_{2}\beta]$ , the average difference between the sBS and mBS model parameters, i.e., the $\mathrm{L}_{3}$ term of (56), is upper bounded as

		$\displaystyle[\beta^{2}/T]\sum\nolimits_{t=0}^{T-1}\sum\nolimits_{l=1}^{L}\alpha_{l}\sum\nolimits_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\left\\|\bar{\mathbf{w}}_{l}^{t}-\bar{\mathbf{w}}_{k}^{t}\right\\|^{2}$
		$\displaystyle\leq\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{4}\beta^{4}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{2}\eta^{4}\beta^{4}\epsilon_{\mathrm{sbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{4}\epsilon_{\mathrm{mbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}\beta^{2}\sigma^{2}\big{)}+\mathcal{O}\big{(}\delta^{\mathrm{th}}\beta^{2}D^{2}\big{)}+$
		$\displaystyle\qquad\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{4}\beta^{4}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\beta^{4}\eta^{4}\cdot\varphi_{\mathrm{w,L}_{2}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}}))+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}\beta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{3}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})\big{)}.$		(101)

where $\varphi_{\mathrm{w,L}_{3}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})=[1/T]\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\sum_{t=0}^{T-1}\left(1/p_{i}^{t}-1\right)$ .

Proof.

		$\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\left\\|\bar{\mathbf{w}}_{l}^{t}-\bar{\mathbf{w}}_{k}^{t}\right\\|^{2}$
		$\displaystyle=\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\Bigg{\\|}\tilde{\mathbf{w}}_{l}^{\bar{t}_{2},0}-\tilde{\mathbf{w}}_{k}^{\bar{t}_{2},0}-\eta\sum_{\tau=\bar{t}_{2}}^{t-1}\sum_{k^{\prime}=1}^{B_{l}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l}}\alpha_{i^{\prime}}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}+\eta\sum_{\tau=\bar{t}_{2}}^{t-1}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}\Bigg{\\|}^{2},$
		$\displaystyle\leq\frac{2}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\!\!\alpha_{l}\!\!\sum_{k=1}^{B_{l}}\!\!\alpha_{k}\mathbb{E}\left\\|\tilde{\mathbf{w}}_{l}^{\bar{t}_{2},0}-\tilde{\mathbf{w}}_{k}^{\bar{t}_{2},0}\right\\|^{2}+\frac{2\eta^{2}}{T}\sum_{t=0}^{T-1}\!\!\sum_{l=1}^{L}\!\!\alpha_{l}\!\!\sum_{k=1}^{B_{l}}\!\!\alpha_{k}\mathbb{E}\bigg{\\|}\sum_{\tau=\bar{t}_{2}}^{t-1}\bigg{[}\sum_{j=1}^{V_{k,l}}\!\!\alpha_{j}\!\!\sum_{i=1}^{U_{j,k,l}}\!\!\alpha_{i}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\sum_{k^{\prime}=1}^{B_{l}}\!\!\alpha_{k^{\prime}}\!\!\sum_{j^{\prime}=1}^{V_{k^{\prime},l}}\!\!\alpha_{j^{\prime}}\!\!\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l}}\!\!\alpha_{i^{\prime}}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}\bigg{]}\bigg{\\|}^{2}\!\!,$		(102)

where $\bar{t}_{2}=(m\kappa_{3}+t_{3})\kappa_{2}\kappa_{1}\kappa_{0}$ and the inequalities in the last term arise from Jensen inequality.

For the first term in (D), we have

		$\displaystyle 2\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\left\\|\tilde{\mathbf{w}}_{l}^{\bar{t}_{2},0}-\tilde{\mathbf{w}}_{k}^{\bar{t}_{2},0}\right\\|^{2}$
		$\displaystyle\overset{(a)}{=}2\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\left\\|\tilde{\mathbf{w}}_{l}^{\bar{t}_{2},0}-\mathbf{w}_{l}^{\bar{t}_{2}}+\mathbf{w}_{k}^{\bar{t}_{2}}-\tilde{\mathbf{w}}_{k}^{\bar{t}_{2},0}\right\\|^{2},$
		$\displaystyle\leq 4\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\left\\|\mathbf{w}_{l}^{\bar{t}_{2}}-\tilde{\mathbf{w}}_{l}^{\bar{t}_{2},0}\right\\|^{2}+4\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\left\\|\mathbf{w}_{k}^{\bar{t}_{2}}-\tilde{\mathbf{w}}_{k}^{\bar{t}_{2},0}\right\\|^{2},$
		$\displaystyle\overset{(b)}{\leq}4\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}{\color[rgb]{0,0,0}\alpha_{k}}\sum_{j=1}^{V_{k,l}}{\color[rgb]{0,0,0}\alpha_{j}}\sum_{i=1}^{U_{j,k,l}}{\color[rgb]{0,0,0}\alpha_{i}}\delta_{i}^{\bar{t}_{2}}\mathbb{E}\left\\|\mathbf{w}_{i}^{\bar{t}_{2}}\right\\|^{2}+4\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}{\color[rgb]{0,0,0}\alpha_{j}}\sum_{i=1}^{U_{j,k,l}}{\color[rgb]{0,0,0}\alpha_{i}}\delta_{i}^{\bar{t}_{2}}\mathbb{E}\left\\|\mathbf{w}_{i}^{\bar{t}_{2}}\right\\|^{2},$
		$\displaystyle={\color[rgb]{0,0,0}8}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}{\color[rgb]{0,0,0}\alpha_{j}}\sum_{i=1}^{U_{j,k,l}}{\color[rgb]{0,0,0}\alpha_{i}}\delta_{i}^{\bar{t}_{2}}\mathbb{E}\left\\|\mathbf{w}_{i}^{\bar{t}_{2}}\right\\|^{2},$		(103)

where in $(a)$ we use the fact that $\mathbf{w}_{l}^{\bar{t}_{2}}=\mathbf{w}_{k}^{\bar{t}_{2}}$ . Moreover, $(b)$ appears following similar steps as in (74) and (73).

As such, we have

	$\displaystyle\frac{2}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\!\!\alpha_{l}\!\!\sum_{k=1}^{B_{l}}\!\!\alpha_{k}\mathbb{E}\left\\|\tilde{\mathbf{w}}_{l}^{\bar{t}_{2},0}-\tilde{\mathbf{w}}_{k}^{\bar{t}_{2},0}\right\\|^{2}$
	$\displaystyle\leq\frac{{\color[rgb]{0,0,0}8}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}{\color[rgb]{0,0,0}\alpha_{j}}\sum_{i=1}^{U_{j,k,l}}{\color[rgb]{0,0,0}\alpha_{i}}\delta_{i}^{\big{\lfloor}\frac{t}{\kappa_{0}\kappa_{1}\kappa_{2}}\big{\rfloor}}\mathbb{E}\bigg{\\|}\mathbf{w}_{i}^{\big{\lfloor}\frac{t}{\kappa_{0}\kappa_{1}\kappa_{2}}\big{\rfloor}}\bigg{\\|}^{2}$
	$\displaystyle\approx\mathcal{O}\big{(}\delta^{\mathrm{th}}D^{2}\big{)}.$		(104)

For the second term in (D), we have

		$\displaystyle\frac{2\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\left\\|\sum_{\tau=\bar{t}_{2}}^{t-1}\Big{(}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\sum_{k^{\prime}=1}^{B_{l}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l}}\alpha_{i^{\prime}}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}\Big{)}\right\\|^{2}$
		$\displaystyle=\frac{2\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\Bigg{\\|}\sum_{\tau=\bar{t}_{2}}^{t-1}\!\!\bigg{[}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{)}-\sum_{k^{\prime}=1}^{B_{l}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l}}\alpha_{i^{\prime}}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}-\nabla\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\bigg{)}+$
		$\displaystyle\qquad\qquad\qquad\qquad\qquad\sum_{k^{\prime}=1}^{B_{l}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}-\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{]}\Bigg{\\|}^{2},$
		$\displaystyle\leq\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\Bigg{\\|}\sum_{\tau=\bar{t}_{2}}^{t-1}\!\!\bigg{[}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{)}-\sum_{k^{\prime}=1}^{B_{l}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l}}\alpha_{i^{\prime}}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}-\nabla\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\bigg{)}\bigg{]}\Bigg{\\|}^{2}+$
		$\displaystyle\qquad\qquad\qquad\qquad\qquad\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\Bigg{\\|}\sum_{\tau=\bar{t}_{2}}^{t-1}\!\!\bigg{[}\sum_{k^{\prime}=1}^{B_{l}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}-\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{]}\Bigg{\\|}^{2},$		(105)

Lemma 9.

	$\displaystyle\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\Bigg{\\|}\sum_{\tau=\bar{t}_{2}}^{t-1}\!\!\bigg{[}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{)}-\sum_{k^{\prime}=1}^{B_{l}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l}}\alpha_{i^{\prime}}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}-\nabla\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\bigg{)}\bigg{]}\Bigg{\\|}^{2}$
	$\displaystyle\leq 8\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}\sigma^{2}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}+8\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{3}}$
	$\displaystyle\approx\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}\sigma^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{3}}\big{)},$		(106)

where $\varphi_{\mathrm{w,L}_{3}}=\frac{1}{T}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\sum_{t=0}^{T-1}\left(\frac{1}{p_{i}^{t}}-1\right)$ .

Lemma 10.

	$\displaystyle\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\Bigg{\\|}\sum_{\tau=\bar{t}_{2}}^{t-1}\!\!\bigg{[}\sum_{k^{\prime}=1}^{B_{l}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}-\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{]}\Bigg{\\|}^{2}$
	$\displaystyle\leq\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{4}\beta^{2}\sigma^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{4}\beta^{2}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{2}\eta^{4}\beta^{2}\epsilon_{\mathrm{sbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{2}\epsilon_{\mathrm{mbs}}^{2}\big{)}+\mathcal{O}\big{(}\delta^{th}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{2}D^{2}\big{)}+$
	$\displaystyle\qquad\qquad\qquad\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{4}\beta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\beta^{2}\eta^{4}\cdot\varphi_{\mathrm{w,L}_{2}})+\frac{56\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\\|\bar{\mathbf{w}}_{l}^{t}-\bar{\mathbf{w}}_{k}^{t}\\|^{2}.$		(107)

Using Lemma 9 and Lemma 10, when $\eta\leq\frac{1}{2\sqrt{14}\kappa_{0}\kappa_{1}\kappa_{2}\beta}$ , we have

	$\displaystyle\frac{\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\left\\|\bar{\mathbf{w}}_{l}^{t}-\bar{\mathbf{w}}_{k}^{t}\right\\|^{2}$
	$\displaystyle\leq\mathcal{O}\big{(}\delta^{\mathrm{th}}\beta^{2}D^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}\beta^{2}\sigma^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}\beta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{3}}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{4}\beta^{4}\sigma^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{4}\beta^{4}\epsilon_{\mathrm{vc}}^{2}\big{)}+$
	$\displaystyle\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{2}\eta^{4}\beta^{4}\epsilon_{\mathrm{sbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{4}\epsilon_{\mathrm{mbs}}^{2}\big{)}+\mathcal{O}\big{(}\delta^{th}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{4}D^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{4}\beta^{4}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\beta^{4}\eta^{4}\cdot\varphi_{\mathrm{w,L}_{2}})$
	$\displaystyle\approx\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{4}\beta^{4}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{2}\eta^{4}\beta^{4}\epsilon_{\mathrm{sbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{4}\epsilon_{\mathrm{mbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}\beta^{2}\sigma^{2}\big{)}+\mathcal{O}\big{(}\delta^{\mathrm{th}}\beta^{2}D^{2}\big{)}+$
	$\displaystyle\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{4}\beta^{4}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\beta^{4}\eta^{4}\cdot\varphi_{\mathrm{w,L}_{2}})+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}\beta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{3}}\big{)}.$		(108)

D-A Missing Proof of Lemma 9

		$\displaystyle\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\Bigg{\\|}\sum_{\tau=\bar{t}_{2}}^{t-1}\!\!\bigg{[}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{)}-\sum_{k^{\prime}=1}^{B_{l}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l}}\alpha_{i^{\prime}}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}-\nabla\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\bigg{)}\bigg{]}\Bigg{\\|}^{2}$
		$\displaystyle\overset{(a)}{=}\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\Bigg{\\|}\sum_{\tau=\bar{t}_{2}}^{t-1}\!\!\bigg{[}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{)}\bigg{]}\Bigg{\\|}^{2}-$
		$\displaystyle\qquad\qquad\qquad\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\Bigg{\\|}\sum_{\tau=\bar{t}_{2}}^{t-1}\!\!\bigg{[}\sum_{k^{\prime}=1}^{B_{l}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l}}\alpha_{i^{\prime}}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}-\nabla\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\bigg{)}\bigg{]}\Bigg{\\|}^{2}$
		$\displaystyle\overset{(b)}{=}\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{\tau=\bar{t}_{2}}^{t-1}\mathbb{E}\Bigg{\\|}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{)}\Bigg{\\|}^{2}-$
		$\displaystyle\qquad\qquad\qquad\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{\tau=\bar{t}_{2}}^{t-1}\mathbb{E}\Bigg{\\|}\sum_{k^{\prime}=1}^{B_{l}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l}}\alpha_{i^{\prime}}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}-\nabla\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\bigg{)}\Bigg{\\|}^{2}$
		$\displaystyle\overset{(c)}{\leq}\frac{4\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\Bigg{\\|}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\frac{\boldsymbol{1}_{i}^{t}}{p_{i}^{t}}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\bigg{)}\Bigg{\\|}^{2}-$
		$\displaystyle\qquad\qquad\qquad\frac{4\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\Bigg{\\|}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\frac{\boldsymbol{1}_{i}^{t}}{p_{i}^{t}}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\bigg{)}\Bigg{\\|}^{2}$
		$\displaystyle\overset{(d)}{=}\frac{4\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\mathbb{E}\Bigg{\\|}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\frac{\boldsymbol{1}_{i}^{t}}{p_{i}^{t}}\pm\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\Bigg{\\|}^{2}-$
		$\displaystyle\qquad\qquad\qquad\frac{4\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}^{2}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\mathbb{E}\Bigg{\\|}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\frac{\boldsymbol{1}_{i}^{t}}{p_{i}^{t}}\pm\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\Bigg{\\|}^{2}$
		$\displaystyle\leq 8\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}\sigma^{2}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}+8\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{3}}$
		$\displaystyle\approx\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}\sigma^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{3}}\big{)},$		(109)

D-B Missing Proof of Lemma 10

		$\displaystyle\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\Bigg{\\|}\sum_{\tau=\bar{t}_{2}}^{t-1}\!\!\bigg{[}\sum_{k^{\prime}=1}^{B_{l}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}-\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{]}\Bigg{\\|}^{2}$
		$\displaystyle=\frac{4\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{\tau=\bar{t}_{2}}^{t-1}\mathbb{E}\Bigg{\\|}\bigg{(}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\big{[}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}-\nabla\tilde{f}_{i}\big{(}\bar{\mathbf{w}}_{j}^{\tau}\big{)}\big{]}\bigg{)}+$
		$\displaystyle\qquad\qquad\qquad\bigg{(}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\big{[}\nabla\tilde{f}_{i}\big{(}\bar{\mathbf{w}}_{j}^{\tau}\big{)}-\nabla\tilde{f}_{i}\big{(}\bar{\mathbf{w}}_{k}^{\tau}\big{)}\big{]}\bigg{)}+\bigg{(}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\big{[}\nabla\tilde{f}_{i}\big{(}\bar{\mathbf{w}}_{k}^{\tau}\big{)}-\nabla\tilde{f}_{i}\big{(}\bar{\mathbf{w}}_{l}^{\tau}\big{)}\big{]}\bigg{)}+$
		$\displaystyle\qquad\qquad\qquad\bigg{(}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\nabla\tilde{f}_{i}\big{(}\bar{\mathbf{w}}_{l}^{\tau}\big{)}-\sum_{k^{\prime}=1}^{B_{l}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\big{(}\bar{\mathbf{w}}_{l}^{\tau}\big{)}\bigg{)}+$
		$\displaystyle\qquad\qquad\qquad\bigg{(}\sum_{k^{\prime}=1}^{B_{l}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l}}\alpha_{i^{\prime}}\big{[}\nabla\tilde{f}_{i^{\prime}}\big{(}\bar{\mathbf{w}}_{l}^{\tau}\big{)}-\nabla\tilde{f}_{i^{\prime}}\big{(}\bar{\mathbf{w}}_{k}^{\tau}\big{)}\big{]}\bigg{)}+$
		$\displaystyle\qquad\qquad\qquad\bigg{(}\sum_{k^{\prime}=1}^{B_{l}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l}}\alpha_{i^{\prime}}\big{[}\nabla\tilde{f}_{i^{\prime}}\big{(}\bar{\mathbf{w}}_{k}^{\tau}\big{)}-\nabla\tilde{f}_{i^{\prime}}\big{(}\bar{\mathbf{w}}_{j}^{\tau}\big{)}\big{]}\bigg{)}+$
		$\displaystyle\qquad\qquad\qquad\bigg{(}\sum_{k^{\prime}=1}^{B_{l}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l}}\alpha_{i^{\prime}}\big{[}\nabla\tilde{f}_{i^{\prime}}\big{(}\bar{\mathbf{w}}_{j}^{\tau}\big{)}-\nabla\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\big{]}\bigg{)}\Bigg{\\|}^{2}$
		$\displaystyle\leq 28\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{2}\epsilon_{\mathrm{mbs}}^{2}+\frac{128\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\\|\tilde{\mathbf{w}}_{i}^{t}-\tilde{\mathbf{w}}_{i}^{t,0}\\|^{2}+$
		$\displaystyle\qquad\qquad\qquad\frac{128\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\\|\bar{\mathbf{w}}_{j}^{t}-\tilde{\mathbf{w}}_{i}^{t}\\|^{2}+$
		$\displaystyle\qquad\qquad\qquad\frac{56\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\\|\bar{\mathbf{w}}_{k}^{t}-\bar{\mathbf{w}}_{j}^{t}\\|^{2}+$
		$\displaystyle\qquad\qquad\qquad\frac{56\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\\|\bar{\mathbf{w}}_{l}^{t}-\bar{\mathbf{w}}_{k}^{t}\\|^{2}$
		$\displaystyle\leq\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{2}\epsilon_{\mathrm{mbs}}^{2}\big{)}+\mathcal{O}\big{(}\delta^{\mathrm{th}}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{3}^{2}\eta^{2}\beta^{2}D^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{4}\beta^{2}\sigma^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{4}\beta^{2}\epsilon_{\mathrm{vc}}^{2}\big{)}+$
		$\displaystyle\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{4}\beta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}\big{)}+\mathcal{O}\big{(}\delta^{\mathrm{th}}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{2}D^{2}\big{)}+\mathcal{O}\big{(}\beta^{4}\kappa_{0}^{6}\kappa_{1}^{4}\kappa_{2}^{2}\eta^{6}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{2}\eta^{4}\beta^{2}\epsilon_{\mathrm{sbs}}^{2}\big{)}+$
		$\displaystyle\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{3}\kappa_{2}^{2}\eta^{4}\sigma^{2}\beta^{2}\big{)}+\mathcal{O}\big{(}\delta^{th}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{2}D^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{5}\kappa_{1}^{4}\kappa_{2}^{2}\beta^{4}\eta^{6}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}\big{)}+$
		$\displaystyle\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\beta^{2}\eta^{4}\cdot\varphi_{\mathrm{w,L}_{2}})+\frac{56\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\\|\bar{\mathbf{w}}_{l}^{t}-\bar{\mathbf{w}}_{k}^{t}\\|^{2}$
		$\displaystyle\approx\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{4}\beta^{2}\sigma^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{4}\beta^{2}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{2}\eta^{4}\beta^{2}\epsilon_{\mathrm{sbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{2}\epsilon_{\mathrm{mbs}}^{2}\big{)}+$
		$\displaystyle\qquad\qquad\qquad\mathcal{O}\big{(}\delta^{th}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{2}D^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{4}\beta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}\big{)}+$
		$\displaystyle\qquad\qquad\qquad\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\beta^{2}\eta^{4}\cdot\varphi_{\mathrm{w,L}_{2}})+\frac{56\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\\|\bar{\mathbf{w}}_{l}^{t}-\bar{\mathbf{w}}_{k}^{t}\\|^{2}.$		(110)

∎

Appendix E Proof of Lemma 4

Lemma 4.

When $\eta\leq 1/[6\sqrt{2}\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\beta]$ , the average difference between the global and the mBS models, i.e., the $\mathrm{L}_{4}$ term, is bounded as follows:

		$\displaystyle[\beta^{2}/T]\sum\nolimits_{t=0}^{T-1}\sum\nolimits_{l=1}^{L}\alpha_{l}\mathbb{E}\left\\|\bar{\mathbf{w}}^{t}-\bar{\mathbf{w}}_{l}^{t}\right\\|^{2}\leq\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{4}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{4}\epsilon_{\mathrm{sbs}}^{2}\big{)}+$
		$\displaystyle\qquad\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{4}\kappa_{3}^{2}\eta^{4}\beta^{6}\epsilon_{\mathrm{mbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\beta^{4}\eta^{2}\epsilon^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\beta^{2}\eta^{2}\sigma^{2}\big{)}+\mathcal{O}\big{(}\delta^{\mathrm{th}}\beta^{2}D^{2}\big{)}+$
		$\displaystyle\qquad\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{4}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{3}\kappa_{2}^{2}\kappa_{3}^{2}\beta^{4}\eta^{4}\cdot\varphi_{\mathrm{w,L}_{2}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})\big{)}+$
		$\displaystyle\qquad\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{3}\kappa_{2}^{3}\kappa_{3}^{2}\eta^{4}\beta^{4}G^{2}\cdot\varphi_{\mathrm{w,L}_{3}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\beta^{2}\eta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{4}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})\big{)},$		(111)

where $\varphi_{\mathrm{w,L}_{4}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})=[1/T]\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}^{2}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\sum_{t=0}^{T-1}\left(1/p_{i}^{t}-1\right)$ .

Proof.

		$\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\left\\|\bar{\mathbf{w}}^{t}-\bar{\mathbf{w}}_{l}^{t}\right\\|^{2}$
		$\displaystyle=\frac{1}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\bigg{\\|}\tilde{\mathbf{w}}^{m\prod_{z=0}^{3}\kappa_{z},0}-\eta\sum_{\tau=m\prod_{z=0}^{3}\kappa_{z}}^{t-1}\sum_{l^{\prime}=1}^{L}\alpha_{l^{\prime}}\sum_{k^{\prime}=1}^{B_{l^{\prime}}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l^{\prime}}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l^{\prime}}}\alpha_{i^{\prime}}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i}^{\tau}}-$
		$\displaystyle\qquad\qquad\qquad\Big{(}\tilde{\mathbf{w}}_{l}^{m\prod_{z=0}^{3}\kappa_{z},0}-\eta\sum_{\tau=m\prod_{z=0}^{3}\kappa_{z}}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}\Big{)}\bigg{\\|}^{2},$
		$\displaystyle\leq\frac{2}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\left\\|\tilde{\mathbf{w}}^{m\prod_{z=0}^{3}\kappa_{z},0}-\tilde{\mathbf{w}}_{l}^{m\prod_{z=0}^{3}\kappa_{z},0}\right\\|^{2}+$
		$\displaystyle\qquad\qquad\qquad\frac{2\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\bigg{\\|}\sum_{\tau=m\prod_{z=0}^{3}\kappa_{z}}^{t-1}\bigg{[}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\sum_{l^{\prime}=1}^{L}\alpha_{l^{\prime}}\sum_{k^{\prime}=1}^{B_{l^{\prime}}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l^{\prime}}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l^{\prime}}}\alpha_{i^{\prime}}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}\bigg{]}\bigg{\\|}^{2},$		(112)

where the last inequality follows from Jensen inequality.

For the first term of (E), we have

		$\displaystyle 2\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\left\\|\tilde{\mathbf{w}}^{m\prod_{z=0}^{3}\kappa_{z},0}-\tilde{\mathbf{w}}_{l}^{m\prod_{z=0}^{3}\kappa_{z},0}\right\\|^{2},$
		$\displaystyle\overset{(a)}{=}2\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\left\\|\tilde{\mathbf{w}}^{m\prod_{z=0}^{3}\kappa_{z},0}-\mathbf{w}^{m\prod_{z=0}^{3}\kappa_{z}}+\mathbf{w}_{l}^{m\prod_{z=0}^{3}\kappa_{z}}-\tilde{\mathbf{w}}_{l}^{m\prod_{z=0}^{3}\kappa_{z},0}\right\\|^{2},$
		$\displaystyle\overset{(b)}{\leq}4\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\left\\|\tilde{\mathbf{w}}^{m\prod_{z=0}^{3}\kappa_{z},0}-\mathbf{w}^{m\prod_{z=0}^{3}\kappa_{z}}\right\\|^{2}+4\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\left\\|\mathbf{w}_{l}^{m\prod_{z=0}^{3}\kappa_{z}}-\tilde{\mathbf{w}}_{l}^{m\prod_{z=0}^{3}\kappa_{z},0}\right\\|^{2},$
		$\displaystyle\leq{\color[rgb]{0,0,0}8}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}{\color[rgb]{0,0,0}\alpha_{k}}\sum_{j=1}^{V_{k,l}}{\color[rgb]{0,0,0}\alpha_{j}}\sum_{i=1}^{U_{j,k,l}}{\color[rgb]{0,0,0}\alpha_{i}}\delta_{i}^{m\prod_{z=0}^{3}\kappa_{z}}\mathbb{E}\left\\|\mathbf{w}_{i}^{m\prod_{z=0}^{3}\kappa_{z}}\right\\|^{2},$		(113)

where in $(a)$ , we use the fact that $\mathbf{w}^{m\prod_{z=0}^{3}\kappa_{z}}=\mathbf{w}_{l}^{m\prod_{z=0}^{3}\kappa_{z}}$ and $(b)$ stems from $\|\sum_{i=1}^{n}\mathbf{a}_{i}\|^{2}\leq n\sum_{i=1}^{n}\|\mathbf{a}_{i}\|^{2}$ . Moreover, the last inequality appears following steps as in (75) and (74).

As such, we simplify the first term as

	$\displaystyle\frac{2}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\left\\|\tilde{\mathbf{w}}^{m\prod_{z=0}^{3}\kappa_{z},0}-\tilde{\mathbf{w}}_{l}^{m\prod_{z=0}^{3}\kappa_{z},0}\right\\|^{2}\leq\frac{{\color[rgb]{0,0,0}8}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}{\color[rgb]{0,0,0}\alpha_{k}}\sum_{j=1}^{V_{k,l}}{\color[rgb]{0,0,0}\alpha_{j}}\sum_{i=1}^{U_{j,k,l}}{\color[rgb]{0,0,0}\alpha_{i}}\delta_{i}^{\left\lfloor t/[\prod_{z=0}^{3}\kappa_{z}]\right\rfloor}\mathbb{E}\left\\|\mathbf{w}_{i}^{\left\lfloor t/[\prod_{z=0}^{3}\kappa_{z}]\right\rfloor}\right\\|^{2}$
	$\displaystyle\approx\mathcal{O}\big{(}\delta^{\mathrm{th}}D^{2}\big{)}.$		(114)

The second term of (E) is further simplified as follows:

		$\displaystyle\frac{2\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\bigg{\\|}\sum_{\tau=m\prod_{z=0}^{3}\kappa_{z}}^{t-1}\bigg{[}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\sum_{l^{\prime}=1}^{L}\alpha_{l^{\prime}}\sum_{k^{\prime}=1}^{B_{l^{\prime}}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l^{\prime}}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l^{\prime}}}\alpha_{i^{\prime}}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}\bigg{]}\bigg{\\|}^{2}$
		$\displaystyle=\frac{2\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\bigg{\\|}\sum_{\tau=m\prod_{z=0}^{3}\kappa_{z}}^{t-1}\bigg{[}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{)}-$
		$\displaystyle\qquad\qquad\qquad\sum_{l^{\prime}=1}^{L}\alpha_{l^{\prime}}\sum_{k^{\prime}=1}^{B_{l^{\prime}}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l^{\prime}}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l^{\prime}}}\alpha_{i^{\prime}}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}-\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\bigg{)}+$
		$\displaystyle\qquad\qquad\qquad\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}-\sum_{l^{\prime}=1}^{L}\alpha_{l^{\prime}}\sum_{k^{\prime}=1}^{B_{l^{\prime}}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l^{\prime}}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l^{\prime}}}\alpha_{i^{\prime}}\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\bigg{]}\bigg{\\|}^{2}$
		$\displaystyle\leq\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\bigg{\\|}\sum_{\tau=m\prod_{z=0}^{3}\kappa_{z}}^{t-1}\bigg{[}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{)}-$
		$\displaystyle\qquad\qquad\qquad\sum_{l^{\prime}=1}^{L}\alpha_{l^{\prime}}\sum_{k^{\prime}=1}^{B_{l^{\prime}}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l^{\prime}}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l^{\prime}}}\alpha_{i^{\prime}}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}-\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\bigg{)}\bigg{]}\bigg{\\|}^{2}+$
		$\displaystyle\qquad\qquad\qquad\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\bigg{\\|}\sum_{\tau=m\prod_{z=0}^{3}\kappa_{z}}^{t-1}\bigg{[}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}-\sum_{l^{\prime}=1}^{L}\alpha_{l^{\prime}}\sum_{k^{\prime}=1}^{B_{l^{\prime}}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l^{\prime}}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l^{\prime}}}\alpha_{i^{\prime}}\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\bigg{]}\bigg{\\|}^{2}.$		(115)

Lemma 11.

The first term of (E) is bounded as follow:

		$\displaystyle\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\bigg{\\|}\sum_{\tau=m\prod_{z=0}^{3}\kappa_{z}}^{t-1}\bigg{[}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{)}-$
		$\displaystyle\qquad\qquad\qquad\sum_{l^{\prime}=1}^{L}\alpha_{l^{\prime}}\sum_{k^{\prime}=1}^{B_{l^{\prime}}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l^{\prime}}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l^{\prime}}}\alpha_{i^{\prime}}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}-\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\bigg{)}\bigg{]}\bigg{\\|}^{2}$
		$\displaystyle\leq 8\Big{(}\prod_{z=0}^{3}\kappa_{z}\Big{)}\eta^{2}\sigma^{2}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}^{2}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}+\frac{8\Big{(}\prod_{z=0}^{3}\kappa_{z}\Big{)}\eta^{2}G^{2}}{T}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}^{2}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\sum_{t=0}^{T-1}\bigg{(}\frac{1-p_{i}^{t}}{p_{i}^{t}}\bigg{)}$
		$\displaystyle\approx\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\eta^{2}\sigma^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\eta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{4}}\big{)},$		(116)

where $\varphi_{\mathrm{w,L}_{4}}=\frac{1}{T}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}^{2}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\sum_{t=0}^{T-1}\big{(}\frac{1-p_{i}^{t}}{p_{i}^{t}}\big{)}$ .

Lemma 12.

The second term of (E) is bounded as follow:

	$\displaystyle\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\bigg{\\|}\sum_{\tau=m\prod_{z=0}^{3}\kappa_{z}}^{t-1}\bigg{[}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}-\sum_{l^{\prime}=1}^{L}\alpha_{l^{\prime}}\sum_{k^{\prime}=1}^{B_{l^{\prime}}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l^{\prime}}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l^{\prime}}}\alpha_{i^{\prime}}\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\bigg{]}\bigg{\\|}^{2}$
	$\displaystyle\leq\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{2}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{2}\epsilon_{\mathrm{sbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{4}\kappa_{3}^{2}\eta^{4}\beta^{4}\epsilon_{\mathrm{mbs}}^{2}\big{)}+$
	$\displaystyle\qquad\qquad\qquad\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\beta^{2}\eta^{2}\epsilon^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{2}\sigma^{2}\big{)}+\mathcal{O}\big{(}\delta^{\mathrm{th}}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{2}\beta^{2}D^{2}\big{)}+$
	$\displaystyle\qquad\qquad\qquad\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{3}\kappa_{2}^{2}\kappa_{3}^{2}\beta^{2}\eta^{4}\cdot\varphi_{\mathrm{w,L}_{2}})+$
	$\displaystyle\qquad\qquad\qquad\!\!\!\!\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{3}\kappa_{2}^{3}\kappa_{3}^{2}\eta^{4}\beta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{3}}\big{)}+\frac{72\big{(}\beta\eta\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\big{)}^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\bigg{\\|}\bar{\mathbf{w}}^{t}-\bar{\mathbf{w}}_{l}^{t}\bigg{\\|}^{2}.$		(117)

Using Lemma 11 and Lemma 12, and assuming $\eta\leq\frac{1}{6\sqrt{2}\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\beta}$ , we get

	$\displaystyle\frac{\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\left\\|\bar{\mathbf{w}}^{t}-\bar{\mathbf{w}}_{l}^{t}\right\\|^{2}$
	$\displaystyle\leq\mathcal{O}\big{(}\delta^{\mathrm{th}}\beta^{2}D^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\beta^{2}\eta^{2}\sigma^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\beta^{2}\eta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{4}}\big{)}+$
	$\displaystyle\qquad\qquad\qquad\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{4}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{4}\epsilon_{\mathrm{sbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{4}\kappa_{3}^{2}\eta^{4}\beta^{6}\epsilon_{\mathrm{mbs}}^{2}\big{)}+$
	$\displaystyle\qquad\qquad\qquad\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\beta^{4}\eta^{2}\epsilon^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{4}\sigma^{2}\big{)}+\mathcal{O}\big{(}\delta^{\mathrm{th}}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{2}\beta^{4}D^{2}\big{)}+$
	$\displaystyle\qquad\qquad\qquad\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{4}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{3}\kappa_{2}^{2}\kappa_{3}^{2}\beta^{4}\eta^{4}\cdot\varphi_{\mathrm{w,L}_{2}})+$
	$\displaystyle\qquad\qquad\qquad\!\!\!\!\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{3}\kappa_{2}^{3}\kappa_{3}^{2}\eta^{4}\beta^{4}G^{2}\cdot\varphi_{\mathrm{w,L}_{3}}\big{)}$
	$\displaystyle\approx\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{4}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{4}\epsilon_{\mathrm{sbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{4}\kappa_{3}^{2}\eta^{4}\beta^{6}\epsilon_{\mathrm{mbs}}^{2}\big{)}+$
	$\displaystyle\qquad\qquad\qquad\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\beta^{4}\eta^{2}\epsilon^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\beta^{2}\eta^{2}\sigma^{2}\big{)}+\mathcal{O}\big{(}\delta^{\mathrm{th}}\beta^{2}D^{2}\big{)}+$
	$\displaystyle\qquad\qquad\qquad\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{4}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{3}\kappa_{2}^{2}\kappa_{3}^{2}\beta^{4}\eta^{4}\cdot\varphi_{\mathrm{w,L}_{2}})+$
	$\displaystyle\qquad\qquad\qquad\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{3}\kappa_{2}^{3}\kappa_{3}^{2}\eta^{4}\beta^{4}G^{2}\cdot\varphi_{\mathrm{w,L}_{3}}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\beta^{2}\eta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{4}}\big{)}.$		(118)

E-A Missing Proof of Lemma 11

	$\displaystyle\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\bigg{\\|}\sum_{\tau=m\prod_{z=0}^{3}\kappa_{z}}^{t-1}\bigg{[}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{)}-$
	$\displaystyle\qquad\qquad\qquad\sum_{l^{\prime}=1}^{L}\alpha_{l^{\prime}}\sum_{k^{\prime}=1}^{B_{l^{\prime}}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l^{\prime}}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l^{\prime}}}\alpha_{i^{\prime}}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}-\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\bigg{)}\bigg{]}\bigg{\\|}^{2}$
	$\displaystyle\overset{(a)}{=}\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\bigg{\\|}\sum_{\tau=m\prod_{z=0}^{3}\kappa_{z}}^{t-1}\bigg{[}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{)}\bigg{]}\bigg{\\|}^{2}-$
	$\displaystyle\qquad\qquad\qquad\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\mathbb{E}\bigg{\\|}\sum_{\tau=m\prod_{z=0}^{3}\kappa_{z}}^{t-1}\bigg{[}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{)}\bigg{]}\bigg{\\|}^{2}$
	$\displaystyle\overset{(b)}{=}\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{\tau=m\prod_{z=0}^{3}\kappa_{z}}^{t-1}\mathbb{E}\bigg{\\|}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{)}\bigg{\\|}^{2}-$
	$\displaystyle\qquad\qquad\qquad\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{\tau=m\prod_{z=0}^{3}\kappa_{z}}^{t-1}\mathbb{E}\bigg{\\|}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{)}\bigg{\\|}^{2}$
	$\displaystyle\overset{(c)}{\leq}\frac{4\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\bigg{\\|}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\frac{\boldsymbol{1}_{i}^{t}}{p_{i}^{t}}\pm\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\bigg{)}\bigg{\\|}^{2}-$
	$\displaystyle\qquad\frac{4\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\eta^{2}}{T}\sum_{t=0}^{T-1}\mathbb{E}\bigg{\\|}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\frac{\boldsymbol{1}_{i}^{t}}{p_{i}^{t}}\pm\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}-\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\bigg{)}\bigg{\\|}^{2}$
	$\displaystyle\leq 8\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\eta^{2}\sigma^{2}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}^{2}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}+\frac{8\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\eta^{2}G^{2}}{T}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}^{2}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\sum_{t=0}^{T-1}\bigg{(}\frac{1-p_{i}^{t}}{p_{i}^{t}}\bigg{)}$
	$\displaystyle\approx\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\eta^{2}\sigma^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\eta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{4}}\big{)},$		(119)

E-B Missing Proof of Lemma 12

	$\displaystyle\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\bigg{\\|}\sum_{\tau=m\prod_{z=0}^{3}\kappa_{z}}^{t-1}\bigg{[}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}-\sum_{l^{\prime}=1}^{L}\alpha_{l^{\prime}}\sum_{k^{\prime}=1}^{B_{l^{\prime}}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l^{\prime}}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l^{\prime}}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\bigg{]}\bigg{\\|}^{2}$
	$\displaystyle=\frac{4\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\bigg{\\|}\bigg{(}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\big{[}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}-\nabla\tilde{f}_{i}\big{(}\bar{\mathbf{w}}_{j}^{t}\big{)}\big{]}\bigg{)}+$
	$\displaystyle\qquad\bigg{(}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\big{[}\nabla\tilde{f}_{i}\big{(}\bar{\mathbf{w}}_{j}^{t}\big{)}-\nabla\tilde{f}_{i}\big{(}\bar{\mathbf{w}}_{k}^{t}\big{)}\big{]}\bigg{)}+\bigg{(}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\big{[}\nabla\tilde{f}_{i}\big{(}\bar{\mathbf{w}}_{k}^{t}\big{)}-\nabla\tilde{f}_{i}\big{(}\bar{\mathbf{w}}_{l}^{t}\big{)}\big{]}\bigg{)}+$
	$\displaystyle\qquad\bigg{(}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\big{[}\nabla\tilde{f}_{i}\big{(}\bar{\mathbf{w}}_{l}^{t}\big{)}-\nabla\tilde{f}_{i}\big{(}\bar{\mathbf{w}}^{t}\big{)}\big{]}\bigg{)}+\bigg{(}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\nabla\tilde{f}_{i}\big{(}\bar{\mathbf{w}}^{t}\big{)}-$
	$\displaystyle\qquad\sum_{l^{\prime}=1}^{L}\alpha_{l^{\prime}}\sum_{k^{\prime}=1}^{B_{l^{\prime}}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l^{\prime}}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l^{\prime}}}\alpha_{i^{\prime}}\tilde{f}_{i^{\prime}}\big{(}\bar{\mathbf{w}}^{t}\big{)}\bigg{)}+\bigg{(}\sum_{l^{\prime}=1}^{L}\alpha_{l^{\prime}}\sum_{k^{\prime}=1}^{B_{l^{\prime}}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l^{\prime}}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l^{\prime}}}\alpha_{i^{\prime}}\big{[}\nabla\tilde{f}_{i^{\prime}}\big{(}\bar{\mathbf{w}}^{t}\big{)}-\nabla\tilde{f}_{i^{\prime}}\big{(}\bar{\mathbf{w}}_{l}^{t}\big{)}\big{]}\bigg{)}+$
	$\displaystyle\qquad\bigg{(}\sum_{l^{\prime}=1}^{L}\alpha_{l^{\prime}}\sum_{k^{\prime}=1}^{B_{l^{\prime}}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l^{\prime}}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l^{\prime}}}\alpha_{i^{\prime}}\big{[}\nabla\tilde{f}_{i^{\prime}}\big{(}\bar{\mathbf{w}}_{l}^{t}\big{)}-\nabla\tilde{f}_{i^{\prime}}\big{(}\bar{\mathbf{w}}_{k}^{t}\big{)}\big{]}\bigg{)}+$
	$\displaystyle\qquad\bigg{(}\sum_{l^{\prime}=1}^{L}\alpha_{l^{\prime}}\sum_{k^{\prime}=1}^{B_{l^{\prime}}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l^{\prime}}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l^{\prime}}}\alpha_{i^{\prime}}\big{[}\nabla\tilde{f}_{i^{\prime}}\big{(}\bar{\mathbf{w}}_{k}^{t}\big{)}-\nabla\tilde{f}_{i^{\prime}}\big{(}\bar{\mathbf{w}}_{j}^{t}\big{)}\big{]}\bigg{)}+$
	$\displaystyle\qquad\bigg{(}\sum_{l^{\prime}=1}^{L}\alpha_{l^{\prime}}\sum_{k^{\prime}=1}^{B_{l^{\prime}}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l^{\prime}}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l^{\prime}}}\alpha_{i^{\prime}}\big{[}\nabla\tilde{f}_{i^{\prime}}\big{(}\bar{\mathbf{w}}_{l}^{t}\big{)}-\nabla\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{t,0}\big{)}\big{]}\bigg{)}~{}\bigg{\\|}^{2}$		(120)
	$\displaystyle\leq 36\big{(}\beta\epsilon\eta\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\big{)}^{2}+\frac{144\big{(}\beta\eta\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\big{)}^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\bigg{\\|}\tilde{\mathbf{w}}_{i}^{t}-\tilde{\mathbf{w}}_{i}^{t,0}\bigg{\\|}^{2}+$
	$\displaystyle\qquad\qquad\qquad\frac{144\big{(}\beta\eta\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\big{)}^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\bigg{\\|}\bar{\mathbf{w}}_{j}^{t}-\tilde{\mathbf{w}}_{i}^{t}\bigg{\\|}^{2}+$
	$\displaystyle\qquad\qquad\qquad\frac{72\big{(}\beta\eta\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\big{)}^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\bigg{\\|}\bar{\mathbf{w}}_{k}^{t}-\bar{\mathbf{w}}_{j}^{t}\bigg{\\|}^{2}+$
	$\displaystyle\qquad\qquad\qquad\frac{72\big{(}\beta\eta\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\big{)}^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\bigg{\\|}\bar{\mathbf{w}}_{l}^{t}-\bar{\mathbf{w}}_{k}^{t}\bigg{\\|}^{2}+$
	$\displaystyle\qquad\qquad\qquad\frac{72\big{(}\beta\eta\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\big{)}^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\bigg{\\|}\bar{\mathbf{w}}^{t}-\bar{\mathbf{w}}_{l}^{t}\bigg{\\|}^{2}$
	$\displaystyle\leq\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\beta^{2}\eta^{2}\epsilon^{2}\big{)}+\mathcal{O}\big{(}\delta^{\mathrm{th}}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\beta^{2}\eta^{2}D^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{2}\sigma^{2}\big{)}+$
	$\displaystyle\qquad\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{2}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}\big{)}+\mathcal{O}\big{(}\delta^{\mathrm{th}}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{2}\beta^{2}D^{2}\big{)}+$
	$\displaystyle\qquad\mathcal{O}\big{(}\kappa_{0}^{6}\kappa_{1}^{4}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{6}\beta^{4}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{2}\epsilon_{\mathrm{sbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{3}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\sigma^{2}\beta^{2}\big{)}+$
	$\displaystyle\qquad\mathcal{O}\big{(}\delta^{th}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\beta^{2}\eta^{2}D^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{5}\kappa_{1}^{4}\kappa_{2}^{2}\kappa_{3}^{2}\beta^{4}\eta^{6}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{3}\kappa_{2}^{2}\kappa_{3}^{2}\beta^{2}\eta^{4}\cdot\varphi_{\mathrm{w,L}_{2}})+$
	$\displaystyle\qquad\mathcal{O}\big{(}\kappa_{0}^{5}\kappa_{1}^{4}\kappa_{2}^{4}\kappa_{3}^{2}\eta^{6}\beta^{4}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{6}\kappa_{1}^{6}\kappa_{2}^{4}\kappa_{3}^{2}\eta^{6}\beta^{4}\epsilon_{\mathrm{sbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{4}\kappa_{3}^{2}\eta^{4}\beta^{4}\epsilon_{\mathrm{mbs}}^{2}\big{)}+$
	$\displaystyle\qquad\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{3}\kappa_{2}^{3}\kappa_{3}^{2}\eta^{4}\beta^{2}\sigma^{2}\big{)}+\mathcal{O}\big{(}\delta^{\mathrm{th}}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{2}\beta^{2}D^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{5}\kappa_{1}^{4}\kappa_{2}^{4}\kappa_{3}^{2}\eta^{6}\beta^{4}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}\big{)}+$
	$\displaystyle\qquad\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{4}\kappa_{3}^{2}\beta^{4}\eta^{6}\cdot\varphi_{\mathrm{w,L}_{2}})+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{3}\kappa_{2}^{3}\kappa_{3}^{2}\eta^{4}\beta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{3}}\big{)}+$
	$\displaystyle\qquad\frac{72\big{(}\beta\eta\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\big{)}^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\bigg{\\|}\bar{\mathbf{w}}^{t}-\bar{\mathbf{w}}_{l}^{t}\bigg{\\|}^{2}$
	$\displaystyle\approx\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{2}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{2}\epsilon_{\mathrm{sbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{4}\kappa_{3}^{2}\eta^{4}\beta^{4}\epsilon_{\mathrm{mbs}}^{2}\big{)}+$
	$\displaystyle\qquad\qquad\qquad\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\beta^{2}\eta^{2}\epsilon^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{2}\sigma^{2}\big{)}+\mathcal{O}\big{(}\delta^{\mathrm{th}}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{2}\beta^{2}D^{2}\big{)}+$
	$\displaystyle\qquad\qquad\qquad\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{3}\kappa_{2}^{2}\kappa_{3}^{2}\beta^{2}\eta^{4}\cdot\varphi_{\mathrm{w,L}_{2}})+$
	$\displaystyle\qquad\qquad\qquad\!\!\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{3}\kappa_{2}^{3}\kappa_{3}^{2}\eta^{4}\beta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{3}}\big{)}+\frac{72\big{(}\beta\eta\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\big{)}^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\bigg{\\|}\bar{\mathbf{w}}^{t}-\bar{\mathbf{w}}_{l}^{t}\bigg{\\|}^{2}\!\!.\!\!$		(121)

∎

	$\displaystyle\sum\nolimits_{i=1}^{U_{j,k,l}}\alpha_{i}\\|\nabla f_{i}(\mathbf{w})-\nabla f_{j}(\mathbf{w})\\|^{2}\leq\epsilon_{\mathrm{vc}}^{2},$
	$\displaystyle\sum\nolimits_{j=1}^{V_{k,l}}\alpha_{j}\\|\nabla f_{j}(\mathbf{w})-\nabla f_{k}(\mathbf{w})\\|^{2}\leq\epsilon_{\mathrm{sbs}}^{2},$
	$\displaystyle\sum\nolimits_{k=1}^{B_{l}}\alpha_{k}\\|\nabla f_{k}(\mathbf{w})-\nabla f_{l}(\mathbf{w})\\|^{2}\leq\epsilon_{\mathrm{mbs}}^{2},$
	$\displaystyle\sum\nolimits_{l=1}^{L}\alpha_{l}\\|\nabla f_{l}(\mathbf{w})-\nabla f(\mathbf{w})\\|^{2}\leq\epsilon^{2}.$


	$\displaystyle-\eta\sum_{u=1}^{\mathrm{U}}\alpha_{u}\mathbb{E}\left[\left<\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0}),\tilde{g}(\tilde{\mathbf{w}}_{u}^{t,0})\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}\right>\right]\overset{(a)}{=}-\eta\sum_{u=1}^{\mathrm{U}}\alpha_{u}\left<\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0}),\mathbb{E}\left[\tilde{g}(\tilde{\mathbf{w}}_{u}^{t,0})\right]\mathbb{E}\bigg{[}\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}\bigg{]}\right>,$
	$\displaystyle\qquad=-\eta\sum_{u=1}^{\mathrm{U}}\alpha_{u}\left<\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0}),\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\right>,$
	$\displaystyle\qquad=-\frac{\eta}{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}\Big{\{}\\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})\\|^{2}+\\|\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\\|^{2}-\\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})-\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\\|^{2}\Big{\}},$
	$\displaystyle\qquad=\frac{\eta}{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}\Big{\{}\\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})-\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\\|^{2}-\\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})\\|^{2}-\\|\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\\|^{2}\Big{\}},$		(66)
	$\displaystyle\qquad=\frac{\eta}{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}\Big{\{}\\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})-\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\\|^{2}-\\|\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\\|^{2}\Big{\}}-\frac{\eta}{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}\\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})\\|^{2},$
	$\displaystyle\qquad\leq\frac{\eta}{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}\Big{\{}\\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})-\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\\|^{2}-\\|\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\\|^{2}\Big{\}}-\frac{\eta}{2}\left\\|\sum_{u=1}^{\mathrm{U}}\alpha_{u}\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})\right\\|^{2},$
	$\displaystyle\qquad\overset{(b)}{=}\frac{\eta}{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}\Big{\{}\\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})-\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\\|^{2}-\\|\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\\|^{2}\Big{\}}-\frac{\eta}{2}\\|\nabla\tilde{f}(\tilde{\bar{\mathbf{w}}}^{t,0})\\|^{2},$

		$\displaystyle\mathbb{E}\left[\left\\|\sum_{u=1}^{\mathrm{U}}\alpha_{u}\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}\right\\|^{2}\right]$
		$\displaystyle\overset{(a)}{=}\mathbb{E}\left\\|\sum_{u=1}^{\mathrm{U}}\alpha_{u}\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}-\mathbb{E}\left[\sum_{u=1}^{\mathrm{U}}\alpha_{u}\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}\right]\right\\|^{2}+\left(\mathbb{E}\left[\sum_{u=1}^{\mathrm{U}}\alpha_{u}\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}\right]\right)^{2},$
		$\displaystyle=\mathbb{E}\left\\|\sum_{u=1}^{\mathrm{U}}\alpha_{u}\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}\pm\sum_{u=1}^{\mathrm{U}}\alpha_{u}\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)-\sum_{u=1}^{\mathrm{U}}\alpha_{u}\nabla\tilde{f}_{u}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right\\|^{2}+\left\\|\sum_{u=1}^{\mathrm{U}}\alpha_{u}\nabla\tilde{f}_{u}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right\\|^{2},$
		$\displaystyle\overset{(b)}{\leq}\mathbb{E}\left\\|\sum_{u=1}^{\mathrm{U}}\alpha_{u}\left(\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}-1\right)\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)+\sum_{u=1}^{\mathrm{U}}\alpha_{u}\left(\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)-\nabla\tilde{f}_{u}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right)\right\\|^{2}+\sum_{u=1}^{\mathrm{U}}\alpha_{u}\left\\|\nabla\tilde{f}_{u}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right\\|^{2},$
		$\displaystyle\overset{(c)}{\leq}2\mathbb{E}\left\\|\sum_{u=1}^{\mathrm{U}}\alpha_{u}\left(\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}-1\right)\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right\\|^{2}+2\mathbb{E}\left\\|\sum_{u=1}^{\mathrm{U}}\alpha_{u}\left(\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)-\nabla\tilde{f}_{u}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right)\right\\|^{2}+\sum_{u=1}^{\mathrm{U}}\alpha_{u}\left\\|\nabla\tilde{f}_{u}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right\\|^{2},$
		$\displaystyle=2\mathbb{E}\Big{[}\sum_{u=1}^{\mathrm{U}}\alpha_{u}^{2}\left\\|\left(\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}-1\right)\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right\\|^{2}+\sum_{u=1}^{\mathrm{U}}\alpha_{u}\sum_{u^{\prime}=1,u^{\prime}\neq u}^{U}\alpha_{u^{\prime}}\left(\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}-1\right)\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\left(\frac{\boldsymbol{1}_{u^{\prime}}^{t}}{p_{u^{\prime}}^{t}}-1\right)\tilde{g}(\tilde{\mathbf{w}}_{u^{\prime}}^{t,0})\Big{]}+$
		$\displaystyle\qquad\qquad\qquad 2\mathbb{E}\Big{[}\sum_{u=1}^{\mathrm{U}}\alpha_{u}^{2}\left\\|\left(\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)-\nabla\tilde{f}_{u}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right)\right\\|^{2}+$
		$\displaystyle\qquad\sum_{u=1}^{\mathrm{U}}\alpha_{u}\sum_{u^{\prime}=1,u\neq u^{\prime}}^{U}\alpha_{u^{\prime}}\left(\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)-\nabla\tilde{f}_{u}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right)\left(\tilde{g}(\tilde{\mathbf{w}}_{u^{\prime}}^{t,0})-\nabla\tilde{f}_{u^{\prime}}(\tilde{\mathbf{w}}_{u^{\prime}}^{t,0})\right)\Big{]}+\sum_{u=1}^{\mathrm{U}}\alpha_{u}\left\\|\nabla\tilde{f}_{u}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right\\|^{2},$
		$\displaystyle\overset{(d)}{\leq}2\mathbb{E}\bigg{[}\sum_{u=1}^{\mathrm{U}}\alpha_{u}^{2}\left\\|\left(\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}-1\right)\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right\\|^{2}\bigg{]}+2\sigma^{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}^{2}+\sum_{u=1}^{\mathrm{U}}\alpha_{u}\left\\|\nabla\tilde{f}_{u}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right\\|^{2},$
		$\displaystyle=2\sum_{u=1}^{\mathrm{U}}\alpha_{u}^{2}\left(\frac{1-p_{u}^{t}}{p_{u}^{t}}\right)\mathbb{E}\left\\|\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right\\|^{2}+2\sigma^{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}^{2}+\sum_{u=1}^{\mathrm{U}}\alpha_{u}\left\\|\nabla\tilde{f}_{u}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right\\|^{2},$		(67)


	$\displaystyle\mathbb{E}\left[f(\bar{\mathbf{w}}^{t+1})\right]\leq f(\tilde{\bar{\mathbf{w}}}^{t,0})+\frac{\eta}{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}\Big{\{}\\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})-\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\\|^{2}-\\|\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\\|^{2}\Big{\}}-$
	$\displaystyle\qquad\frac{\eta}{2}\\|\nabla\tilde{f}(\tilde{\bar{\mathbf{w}}}^{t,0})\\|^{2}+\eta^{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}^{2}\left(\frac{1-p_{u}^{t}}{p_{u}^{t}}\right)\mathbb{E}\left\\|\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right\\|^{2}+\beta\eta^{2}\sigma^{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}^{2}+\frac{\beta\eta^{2}}{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}\left\\|\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\right\\|^{2},$
	$\displaystyle=f(\tilde{\bar{\mathbf{w}}}^{t,0})+\frac{\eta}{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}\bigg{\\|}\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})-\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\bigg{\\|}^{2}-\frac{\eta}{2}\left\\|\nabla\tilde{f}(\tilde{\bar{\mathbf{w}}}^{t,0})\right\\|^{2}-\frac{\eta}{2}\left(1-\beta\eta\right)\sum_{u=1}^{\mathrm{U}}\\|\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\\|^{2}$
	$\displaystyle\qquad\qquad\qquad+\beta\eta^{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}^{2}\left(\frac{1-p_{u}^{t}}{p_{u}^{t}}\right)\mathbb{E}\left\\|\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right\\|^{2}+\beta\eta^{2}\sigma^{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}^{2}.$		(68)


	$\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\sum_{u=1}^{\mathrm{U}}\alpha_{u}\mathbb{E}\left[\\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})-\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\\|^{2}\right]=\sum_{u=1}^{\mathrm{U}}\alpha_{u}\mathbb{E}\bigg{[}\Big{\\|}\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})\pm\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}_{l,(u)}^{t,0})\pm\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}_{k,(u)}^{t,0})\pm\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}_{j,(u)}^{t,0})-\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\Big{\\|}^{2}\bigg{]},$
	$\displaystyle\overset{(a)}{\leq}\frac{4}{T}\sum_{t=0}^{T-1}\sum_{u=1}^{\mathrm{U}}\alpha_{u}\mathbb{E}\Big{[}\\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})-\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}_{l,(u)}^{t,0})\\|^{2}+\\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}_{l,(u)}^{t,0})-\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}_{k,(u)}^{t,0})\\|^{2}+\\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}_{k,(u)}^{t,0})-\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}_{j,(u)}^{t,0})\\|^{2}+$
	$\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}_{j,(u)}^{t,0})-\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\\|^{2}\Big{]},$
	$\displaystyle\overset{(b)}{\leq}\frac{4\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{u=1}^{\mathrm{U}}\alpha_{u}\Big{\{}\mathbb{E}\left[\\|\tilde{\bar{\mathbf{w}}}^{t,0}-\tilde{\bar{\mathbf{w}}}_{l,(u)}^{t,0}\\|^{2}\right]+\mathbb{E}\left[\\|\tilde{\bar{\mathbf{w}}}_{l,(u)}^{t,0}-\tilde{\bar{\mathbf{w}}}_{k,(u)}^{t,0}\\|^{2}\right]+\mathbb{E}\left[\\|\tilde{\bar{\mathbf{w}}}_{k,(u)}^{t,0}-\tilde{\bar{\mathbf{w}}}_{j,(u)}^{t,0}\\|^{2}\right]+\mathbb{E}\left[\\|\tilde{\bar{\mathbf{w}}}_{j,(u)}^{t,0}-\tilde{\mathbf{w}}_{u}^{t,0}\\|^{2}\right]\Big{\}},$
	$\displaystyle=\frac{4\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\Big{\{}\mathbb{E}\left[\\|\tilde{\bar{\mathbf{w}}}^{t,0}-\tilde{\bar{\mathbf{w}}}_{l}^{t,0}\\|^{2}\right]+\mathbb{E}\left[\\|\tilde{\bar{\mathbf{w}}}_{l}^{t,0}-\tilde{\bar{\mathbf{w}}}_{k}^{t,0}\\|^{2}\right]+\mathbb{E}\left[\\|\tilde{\bar{\mathbf{w}}}_{k}^{t,0}-\tilde{\bar{\mathbf{w}}}_{j}^{t,0}\\|^{2}\right]+\mathbb{E}\left[\\|\tilde{\bar{\mathbf{w}}}_{j}^{t,0}-\tilde{\mathbf{w}}_{i}^{t,0}\\|^{2}\right]\Big{\}},$
	$\displaystyle=\frac{4\beta^{2}}{T}\sum_{t=0}^{T-1}\Bigg{\{}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\left[\\|\tilde{\bar{\mathbf{w}}}^{t,0}-\tilde{\bar{\mathbf{w}}}_{l}^{t,0}\\|^{2}\right]+\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\left[\\|\tilde{\bar{\mathbf{w}}}_{l}^{t,0}-\tilde{\bar{\mathbf{w}}}_{k}^{t,0}\\|^{2}\right]+\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\left[\\|\tilde{\bar{\mathbf{w}}}_{k}^{t,0}-\tilde{\bar{\mathbf{w}}}_{j}^{t,0}\\|^{2}\right]+$
	$\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\left[\\|\tilde{\bar{\mathbf{w}}}_{j}^{t,0}-\tilde{\mathbf{w}}_{i}^{t,0}\\|^{2}\right]\Bigg{\}},$
	$\displaystyle=\frac{4\beta^{2}}{T}\sum_{t=0}^{T-1}\Big{\{}\sum\nolimits_{l=1}^{L}\alpha_{l}\mathbb{E}\left[\\|\tilde{\bar{\mathbf{w}}}^{t,0}\pm\bar{\mathbf{w}}^{t}\pm\bar{\mathbf{w}}_{l}^{t}-\tilde{\mathbf{w}}_{l}^{t}\\|^{2}\right]+\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\left[\\|\tilde{\bar{\mathbf{w}}}_{l}^{t,0}\pm\bar{\mathbf{w}}_{l}^{t}\pm\bar{\mathbf{w}}_{k}^{t}-\tilde{\bar{\mathbf{w}}}_{k}^{t,0}\\|^{2}\right]+$
	$\displaystyle\qquad\qquad\qquad\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\left[\\|\tilde{\bar{\mathbf{w}}}_{k}^{t,0}\pm\bar{\mathbf{w}}_{k}^{t}\pm\bar{\mathbf{w}}_{j}^{t}-\tilde{\bar{\mathbf{w}}}_{j}^{t,0}\\|^{2}\right]+\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\left[\\|\tilde{\bar{\mathbf{w}}}_{j}^{t,0}\pm\bar{\mathbf{w}}_{j}^{t}\pm\tilde{\mathbf{w}}_{i}^{t}-\tilde{\mathbf{w}}_{i}^{t,0}\\|^{2}\right]\Big{\}},$
	$\displaystyle\overset{(c)}{\leq}\frac{12\beta^{2}}{T}\sum_{t=0}^{T-1}\bigg{\{}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\left[\\|\tilde{\bar{\mathbf{w}}}^{t,0}-\bar{\mathbf{w}}^{t}\\|^{2}+\\|\bar{\mathbf{w}}_{l}^{t}-\tilde{\bar{\mathbf{w}}}_{l}^{t,0}\\|^{2}+\\|\bar{\mathbf{w}}^{t}-\bar{\mathbf{w}}_{l}^{t}\\|^{2}\right]+\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\Big{[}\\|\tilde{\bar{\mathbf{w}}}_{l}^{t,0}-\bar{\mathbf{w}}_{l}^{t}\\|^{2}+\\|\bar{\mathbf{w}}_{k}^{t}-\tilde{\bar{\mathbf{w}}}_{k}^{t,0}\\|^{2}+\\|\bar{\mathbf{w}}_{l}^{t}-\bar{\mathbf{w}}_{k}^{t}\\|^{2}\Big{]}+$
	$\displaystyle\qquad\qquad\qquad\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\Big{[}\\|\tilde{\bar{\mathbf{w}}}_{k}^{t,0}-\bar{\mathbf{w}}_{k}^{t}\\|^{2}+\\|\bar{\mathbf{w}}_{j}^{t}-\tilde{\bar{\mathbf{w}}}_{j}^{t,0}\\|^{2}+\\|\bar{\mathbf{w}}_{k}^{t}-\bar{\mathbf{w}}_{j}^{t}\\|^{2}\Big{]}+$
	$\displaystyle\qquad\qquad\qquad\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\Big{\{}\mathbb{E}\left[\\|\tilde{\bar{\mathbf{w}}}_{j}^{t,0}-\bar{\mathbf{w}}_{j}^{t}\\|^{2}+\\|\bar{\mathbf{w}}_{j}^{t}-\tilde{\mathbf{w}}_{i}^{t}\\|^{2}+\\|\tilde{\mathbf{w}}_{i}^{t}-\tilde{\mathbf{w}}_{i}^{t,0}\\|^{2}\right]\bigg{\}},$
	$\displaystyle=\frac{12\beta^{2}}{T}\underbrace{\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\\|\bar{\mathbf{w}}^{t}-\bar{\mathbf{w}}_{l}^{t}\\|^{2}}_{\mathrm{L}_{4}}+\frac{12\beta^{2}}{T}\underbrace{\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\\|\bar{\mathbf{w}}_{l}^{t}-\bar{\mathbf{w}}_{k}^{t}\\|^{2}}_{\mathrm{L}_{3}}+\frac{12\beta^{2}}{T}\underbrace{\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\\|\bar{\mathbf{w}}_{k}^{t}-\bar{\mathbf{w}}_{j}^{t}\\|^{2}}_{\mathrm{L}_{2}}+$
	$\displaystyle\qquad\qquad\qquad\frac{12\beta^{2}}{T}\underbrace{\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\\|\bar{\mathbf{w}}_{j}^{t}-\tilde{\mathbf{w}}_{i}^{t}\\|^{2}}_{\mathrm{L}_{1}}+\underbrace{\frac{12\beta^{2}}{T}\sum_{t=0}^{T-1}\mathbb{E}\\|\bar{\mathbf{w}}^{t}-\tilde{\bar{\mathbf{w}}}^{t,0}\\|^{2}}_{\mathrm{T}_{5}}+\underbrace{\frac{24\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\\|\bar{\mathbf{w}}_{l}^{t}-\tilde{\bar{\mathbf{w}}}_{l}^{t,0}\\|^{2}}_{\mathrm{T}_{4}}+$
	$\displaystyle\qquad\qquad\qquad\underbrace{\frac{24\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\\|\bar{\mathbf{w}}_{k}^{t}-\tilde{\bar{\mathbf{w}}}_{k}^{t,0}\\|^{2}}_{\mathrm{T}_{3}}+\underbrace{\frac{24\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\\|\bar{\mathbf{w}}_{j}^{t}-\tilde{\bar{\mathbf{w}}}_{j}^{t,0}\\|^{2}}_{\mathrm{T}_{2}}+$
	$\displaystyle\qquad\qquad\qquad\underbrace{\frac{12\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\mathbb{E}\\|\tilde{\mathbf{w}}_{i}^{t}-\tilde{\mathbf{w}}_{i}^{t,0}\\|^{2}}_{\mathrm{T}_{1}},$		(70)

Hierarchical Federated Learning in Wireless Networks: Pruning Tackles Bandwidth Scarcity and System Heterogeneity

Abstract

Index Terms:

I Introduction

I-A Related Work

I-B Our Contributions

II System Model

II-A Wireless Network Model

II-B Pruning-Enabled Hierarchical Federated Learning (PHFL)

III PHFL: Convergence Analysis

III-A Assumptions

III-B Convergence Analysis

Theorem 1.

Remark 1.

Remark 2.

Lemma 1.

Remark 3.

Lemma 2.

Remark 4.

Lemma 3.

Lemma 4.

Corollary 1.

Remark 5.

Remark 6.

Remark 7.

IV Joint Problem Formulation and Solution

IV-A Problem Formulation

Remark 8.

Remark 9.

IV-B Problem Transformation

V Simulation Results and Discussions

V-A Simulation Setting

V-B Performance Study

V-C Baseline Comparisons

VI Conclusion

References

Supplementary Materials

Appendix A Proof of Theorem 1

Theorem 1.

Proof.

Appendix B Proof of Lemma 1

Lemma 1.

Proof.

Lemma 5.

Lemma 6.

B-A Missing Proof of Lemma 5

B-B Missing Proof of Lemma 6

Appendix C Proof of Lemma 2

Lemma 2.

Proof.

Lemma 7.

Lemma 8.

C-A Missing Proof of Lemma 7

C-B Missing Proof of Lemma 8

Appendix D Proof of Lemma 3

Lemma 3.

Proof.

Lemma 9.

Lemma 10.

D-A Missing Proof of Lemma 9

D-B Missing Proof of Lemma 10

Appendix E Proof of Lemma 4

Lemma 4.

Proof.

Lemma 11.

Lemma 12.

E-A Missing Proof of Lemma 11

E-B Missing Proof of Lemma 12

Hierarchical Federated Learning in Wireless Networks: Pruning Tackles Bandwidth
Scarcity and System Heterogeneity