This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Hierarchical Federated Learning in Wireless Networks: Pruning Tackles Bandwidth
Scarcity and System Heterogeneity

Md Ferdous Pervej, , Richeng Jin, , and Huaiyu Dai This research was supported in part by the National Natural Science Foundation of China under Grants 62301487, in part by the Zhejiang Provincial Natural Science Foundation of China under Grant No. LQ23F010021 and LD21F010001, in part by the Ng Teng Fong Charitable Foundation in the form of ZJU-SUTD IDEA Grant under Grant No. 188170-11102, and in part by the US National Science Foundation under grants CNS-1824518 and ECCS-2203214. (Corresponding author: Richeng Jin.)M. F. Pervej was with the Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, NC 27695, USA, and now is with the Ming Hsieh Department of Electrical and Computer Engineering, University of Southern California, Los Angeles, CA 90089, USA (e-mail: [email protected]).H. Dai is with the Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, NC 27695, USA (e-mail: [email protected]).R. Jin is with the College of Information Science and Electronic Engineering, Zhejiang University, the Zhejiang–Singapore Innovation and AI Joint Research Lab, and the Zhejiang Provincial Key Lab of Information Processing, Communication and Networking (IPCAN), Hangzhou, China, 310000 (e-mail: [email protected]).©2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Abstract

While a practical wireless network has many tiers where end users do not directly communicate with the central server, the users’ devices have limited computation and battery powers, and the serving base station (BS) has a fixed bandwidth. Owing to these practical constraints and system models, this paper leverages model pruning and proposes a pruning-enabled hierarchical federated learning (PHFL) in heterogeneous networks (HetNets). We first derive an upper bound of the convergence rate that clearly demonstrates the impact of the model pruning and wireless communications between the clients and the associated BS. Then we jointly optimize the model pruning ratio, central processing unit (CPU) frequency and transmission power of the clients in order to minimize the controllable terms of the convergence bound under strict delay and energy constraints. However, since the original problem is not convex, we perform successive convex approximation (SCA) and jointly optimize the parameters for the relaxed convex problem. Through extensive simulation, we validate the effectiveness of our proposed PHFL algorithm in terms of test accuracy, wall clock time, energy consumption and bandwidth requirement.

Index Terms:
Heterogeneous network, hierarchical federated learning, model pruning, resource management.

I Introduction

Federated learning (FL) has garnered significant attention as a privacy-preserving distributed edge learning solution in wireless edge networks [1, 2, 3]. Since the original FL follows the parameter server paradigm, many state-of-the-art works consider a single server with distributed clients as the general system model in order to study the analytical and empirical performance [2, 3, 4, 5, 6]. Given that there are 𝒰{u}u=1U\mathcal{U}\coloneqq\{u\}_{u=1}^{\mathrm{U}} clients, each with a local dataset of 𝒟u{𝐱a,ya}a=1A\mathcal{D}_{u}\coloneqq\{\mathbf{x}_{a},y_{a}\}_{a=1}^{A}, where 𝐱a\mathbf{x}_{a} and yay_{a} are the atha^{th} feature vector and the corresponding label, the central server wants to train a global machine learning (ML) model 𝐰\mathbf{w} by minimizing a weighted combination of the clients’ local objective functions fu(𝐰)f_{u}(\mathbf{w})’s, as follows

f(𝐰)u=1Uαufu(𝐰),\displaystyle f(\mathbf{w})\coloneqq\sum\nolimits_{u=1}^{\mathrm{U}}\alpha_{u}f_{u}(\mathbf{w}), (1)
fu(𝐰)(1/|𝒟u|)(𝐱a,ya)𝒟ul(𝐰,𝐱a,ya),\displaystyle f_{u}(\mathbf{w})\coloneqq(1/|\mathcal{D}_{u}|)\sum\nolimits_{(\mathbf{x}_{a},y_{a})\in\mathcal{D}_{u}}\mathrm{l}(\mathbf{w},\mathbf{x}_{a},y_{a}), (2)

where αu\alpha_{u} is the corresponding weight, and l(𝐰,𝐱a,ya)\mathrm{l}(\mathbf{w},\mathbf{x}_{a},y_{a}) denotes the loss function associated with the atha^{th} data sample. However, general networks usually follow a hierarchical structure [7], where the clients are connected to edge servers, the edge servers are connected to fog nodes/servers, and these fog nodes/servers are connected to the cloud server [8]. Naturally, some recent works [9, 10, 11, 12, 13, 14] have extended FL to accommodate this hierarchical network topology.

A client111The terms client and UE are interchangeably used when there is no ambiguity. does not directly communicate with the central server in the hierarchical network topology. Instead, the clients usually perform multiple local rounds of model training before sending the updated models to the edge server. The edge server aggregates the received models and updates its edge model, and then broadcasts the updated model to the associated clients for local training. The edge servers repeat this for multiple edge rounds and finally send the updated edge models to the upper-tier servers, which then undergo the same process before finally sending the updated models to the cloud/central server. This is usually known as hierarchical federated learning (HFL) [11]. On the one hand, HFL acknowledges the practical wireless heterogeneous network (HetNet) architecture. On the other hand, it avoids costly direct communication between the far-away cloud server and the capacity-limited clients [14]. Moreover, since local averaging improves learning accuracy [7], the central server ends up with a better-trained model.

While HFL can alleviate communication bottlenecks for the cloud server, data and system heterogeneity amongst the clients still need to be addressed. Since the clients are usually scattered in different locations and have various onboard sensors, the data collected/sensed by these clients are diverse, causing statistical data heterogeneity that the server cannot govern. As such, we need to embrace it in our theoretical and empirical study. Besides, the well-known system heterogeneity arises from the clients’ diverse computation powers [15]. Recently, some works have been proposed to deal with system heterogeneity. For example, FedProx [16], anarchic federated averaging (AFA) [17] and federated normalized averaging algorithm (FedNova) [18], to name a few, considered different local rounds for different clients to address the system heterogeneity. More specifically, FedProx adds a proximal term to the client’s local objective function to handle heterogeneity. AFA and FedNova present different ways to aggregate clients trained models’ weights at the server to tackle this system heterogeneity. However, these algorithms still assume that the client has and trains the original ML model, i.e., neither the computation time for the client’s local training nor the communication overhead for offloading the trained model to the server is considered in system design.

Model pruning has attracted research interest recently [19, 20]. It makes the over-parameterized model sparser, which allows the less computationally capable clients to perform local training more efficiently without sacrificing much of the test accuracy. Besides, since the trained model contains fewer non-zero entries, the communication overhead over the unreliable wireless link between the client and the associated base station (BS) also dramatically reduces. However, pruning generally introduces errors that only partially vanish, causing the pruned model to converge only to a neighborhood of the optimal solution [19]. Besides, unlike the traditional FL, where the model averaging happens only at the central server, HFL has multiple hierarchical levels that may adopt their own aggregation strategy. Therefore, model pruning at the local client level leads to additional errors in the available models at different levels, eventually contributing to the global model. As such, more in-depth study is in need to understand the full benefit of model pruning in hierarchical networks.

I-A Related Work

Some recent works studied HFL [9, 10, 11, 12, 13, 14] and model pruning-based traditional single server based FL [20, 21, 22, 23, 24] separately. In [9], Xu et al. proposed an adaptive HFL scheme, where they optimized edge aggregation intervals and bandwidth allocation to minimize a weighted combination of the model training delay and training loss. Liu et al. proposed network-assisted HFL in [10], where they optimized wireless resource allocation and user associations to minimize 1)1) learning latency for independent identically distributed (IID) data distribution and 2)2) weighted sum of the total data distance and learning latency for the non-IID data distribution. Similar to [9, 10], Luo et al. jointly optimized the wireless network parameters in order to minimize the weighted combination of the total energy consumption and delay during the training process in [11]. Besides, [12] also proposed an HFL algorithm based on federated averaging (FedAvg) [1]. In [13], Feng et al. proposed a mobility-aware clustered FL algorithm owing to user mobility. More specifically, assuming that all users had an equal probability of staying at a cluster, the authors derived an upper bound of the convergence rate to capture the impact of user mobility, data heterogeneity and network heterogeneity. Abad et al. also optimized wireless resources to reduce the communication latency and facilitate HFL in [14].

On the model pruning side, Jiang et al. considered two-stage distributed model pruning in [20] with traditional single server based FL setting without any wireless network aspects. In a similar setting, Zhu et al. proposed a layer-wise pruning mechanism in [21]. Liu et al. optimized the pruning ratio and time allocation in [22] in order to maximize the convergence rate in a time division multiple access operated small BS (sBS). The idea was extended to joint client selection, pruning ratio optimization and time allocation in [23]. Using a similar network model, Ren et al. optimized pruning ratios and bandwidth allocations jointly to minimize a weighted combination of the FL training time and pruning error in [24]. These works [22, 23, 24] decomposed the original problem into different sub-problems that they solved iteratively in an attempt to solve the original problem sub-optimally. Moreover, [22, 23, 24] considered a simple network system model with a single BS serving the distributed clients with the wireless links.

I-B Our Contributions

While the studies mentioned above shed some light on HFL and model pruning in the traditional single server based FL, the impact of pruning on HFL in resource-constrained wireless HetNet is yet to be explored. On the one hand, the clients need to train the original model for a few local episodes to determine the neurons they shall prune, which adds additional time and energy costs. On the other hand, pruning adds errors to the learning performance. Therefore, it is necessary to theoretically and empirically study these errors from different levels in HFL. Moreover, it is also crucial to justify how and when one should adopt model pruning in practical wireless HetNets. Motivated by these, in this work, we present our pruning-enabled HFL (PHFL) framework with the following major contributions:

  • Considering a practical wireless HetNet, we propose a PHFL solution in which the clients perform local training on the initial models to determine the neurons to prune, perform extensive training on the pruned models, and offload the trained models under strict delay and energy constraints.

  • We theoretically analyze how pruning introduces errors in different levels under resource constraints in wireless HetNets by deriving a convergence bound that captures the impact of the wireless links between the clients and server and the pruning ratios. More specifically, the proposed solution converges to the neighborhood of a stationary point of traditional HFL with a convergence rate of 𝒪(1/UT)+𝒪(β2D2δth)\mathcal{O}\big{(}1/\sqrt{\mathrm{U}T}\big{)}+\mathcal{O}(\beta^{2}D^{2}\delta^{\mathrm{th}}), where U\mathrm{U} is the total number of clients, TT is the total local iterations, β\beta quantifies smoothness of the loss function, D2D^{2} is an upper bound of the L2L_{2} norm of the model weights, and 0<δth<10<\delta^{\mathrm{th}}<1 is the maximum allowable pruning ratio.

  • Then, we formulate an optimization problem to maximize the convergence rate by jointly configuring wireless resources and system parameters. To tackle the non-convexity of the original problem, we use a successive convex approximation (SCA) algorithm to solve the relaxed convex problem efficiently.

  • Finally, using extensive simulation on two popular datasets and three popular ML models, we show the effectiveness of our proposed solution in terms of test accuracy, training time, energy consumption and bandwidth requirement.

The rest of the paper is organized as follows: Section II introduces our system model. Detailed theoretical analysis is performed in Section III, followed by our joint problem formulation and solution in Section IV. Based on our extensive simulation, we discuss the results in Section V. Finally, Section VI concludes the paper. Moreover, Table I summarizes the important notations used in the paper.

TABLE I: Important Notations
Notation Description
uu; bb; ll uthu^{\mathrm{th}} user; bthb^{\mathrm{th}} sBS; lthl^{\mathrm{th}} mBS
l\mathcal{B}_{l}; kk sBS set under the lthl^{\mathrm{th}} mBS; kthk^{\mathrm{th}} sBS in l\mathcal{B}_{l}
𝒱k,l\mathcal{V}_{k,l}; jj VC set of sBS kk under the lthl^{\mathrm{th}} mBS; jthj^{\mathrm{th}} VC of sBS kk
𝒰j,k,l\mathcal{U}_{j,k,l}; ii Client set of the jthj^{\mathrm{th}} VC of lthl^{\mathrm{th}} sBS under the lthl^{\mathrm{th}} mBS; ithi^{\mathrm{th}} client in 𝒰j,k,l\mathcal{U}_{j,k,l}
zz; 𝒵\mathcal{Z} zthz^{\mathrm{th}} pRB; pRB set
Pit\mathrm{P}_{i}^{t} Client ii’s transmission power during tt
𝐰\mathbf{w}; 𝐦\mathbf{m}; 𝐰~\tilde{\mathbf{w}} Original model; binary mask; pruned model
𝐰i\mathbf{w}_{i}; 𝐰j\mathbf{w}_{j}; 𝐰k\mathbf{w}_{k}; 𝐰l\mathbf{w}_{l} Local model of the client, VC, sBS, and mBS
fi()f_{i}(\cdot); fj()f_{j}(\cdot); fk()f_{k}(\cdot); fl()f_{l}(\cdot); f()f(\cdot) Loss function of the client, VC, sBS, mBS, and central server, respectively
fi()\nabla f_{i}(\cdot); g()g(\cdot); η\eta True gradient; stochastic gradient; learning rate
dd; dpd_{p} Total & pruned parameters of the ML model
δit\delta_{i}^{t}; δth\delta^{\mathrm{th}} Pruning ratio of client ii during tt; max pruning ratio
αi\alpha_{i}; αj\alpha_{j}; αk\alpha_{k}; αl\alpha_{l} Weight of ithi^{th} client, jthj^{\mathrm{th}} VC, kthk^{\mathrm{th}} sBS, and lthl^{\mathrm{th}} mBS
ρ\rho Number of SGD rounds on 𝐰\mathbf{w} to get winning ticket
κ0\kappa_{0}; κ1\kappa_{1}; κ2\kappa_{2}; κ3\kappa_{3} Number of local, VC, sBS, and mBS rounds
𝟏it\mathbf{1}_{i}^{t}; pitp_{i}^{t} Binary indicator function to define if sBS receives ithi^{\mathrm{th}} client’s trained model; probability that 𝟏it=1\mathbf{1}_{i}^{t}=1
β\beta Smoothness of the loss functions
σ2\sigma^{2} Bounded variance of the gradients
ϵ2\epsilon_{\cdot}^{2} Bounded divergence of the loss functions of two inter-connected tiers
G2G^{2}; D2D^{2} Upper bound of the L2L_{2}-norm of stochastic gradients and model weights, respectively
fit\mathrm{f}_{i}^{t}; fimax\mathrm{f}_{i}^{\mathrm{max}} CPU clock cycle of ii during tt; max CPU cycle of i
bb, nn Batch size; number of mini-batch
cic_{i}; Di\mathrm{D}_{i} Required number of CPU cycle of ii to process 11-bit data; each data sample size in bits
FPP\mathrm{FPP} Floating point precision
ticpd\mathrm{t}_{i}^{\mathrm{cp_{d}}}; eicpd\mathrm{e}_{i}^{\mathrm{cp_{d}}} Time and energy overheads to get the lottery ticket
ticps\mathrm{t}_{i}^{\mathrm{cp_{s}}}; eicps\mathrm{e}_{i}^{\mathrm{cp_{s}}} Time and energy overheads to compute κ0\kappa_{0} local SGD rounds with the pruned model
tiup\mathrm{t}_{i}^{\mathrm{up}}; eiup\mathrm{e}_{i}^{\mathrm{up}} Time and energy overheads for offloading client ii’s trained model
titot\mathrm{t}_{i}^{\mathrm{tot}}; eitot\mathrm{e}_{i}^{\mathrm{tot}} Client ii’s total time and energy overheads to finish one VC round
tth\mathrm{t^{th}}; eith\mathrm{e}_{i}^{\mathrm{th}} Time and energy budgets to finish one VC round

II System Model

II-A Wireless Network Model

We consider a generic heterogeneous network (HetNet) consisting of some UEs, sBSs and macro BSs (mBSs), as shown in Fig. 1. Denote the UE, sBS and mBS sets by 𝒰{u}u=1U\mathcal{U}\coloneqq\{u\}_{u=1}^{\mathrm{U}}, {b}b=1B\mathcal{B}\coloneqq\{b\}_{b=1}^{\mathrm{B}} and {l}l=1L\mathcal{L}\coloneqq\{l\}_{l=1}^{L}, respectively. Each UE and sBS are connected to one sBS and mBS, respectively. The mBSs are connected to the central server. While the UEs communicate over wireless links with the sBS, the connections between the sBS and mBS and between the mBS and central server are wired. Moreover, due to UEs’ system heterogeneity, we consider that the sBS groups UEs with similar computation and battery powers into a virtual cluster (VC).

Refer to caption
Figure 1: Pruning-enabled hierarchical FL system model

We can benefit from the VC since it enables one additional aggregation tier. Besides, thanks to the recent progress of the proximity-services in practical networks [25], one can also select a cluster head and leverage device-to-device communication to receive and distribute the models to the UEs in the same VC. However, in this work, we assume that the sBS creates the VC and also manages it. We use the notation l{k}k=1Bl\mathcal{B}_{l}\coloneqq\{k\}_{k=1}^{B_{l}}, 𝒱k,l{j}j=1Vk,l\mathcal{V}_{k,l}\coloneqq\{j\}_{j=1}^{V_{k,l}} and 𝒰j,k,l{i}i=1Uj,k,l\mathcal{U}_{j,k,l}\coloneqq\{i\}_{i=1}^{U_{j,k,l}} to represent the sBS set associated to mBS ll, VC set of sBS klk\in\mathcal{B}_{l} and UE set of VC j𝒱k,lj\in\mathcal{V}_{k,l}, respectively. Moreover, denote the UEs associated with sBS kk, mBS ll by 𝒰k,l=j=1Vk,l\mathcal{U}_{k,l}=\bigcup_{j=1}^{V_{k,l}} 𝒰j,k,l\mathcal{U}_{j,k,l} and 𝒰l=k=1Bl𝒰k,l\mathcal{U}_{l}=\bigcup_{k=1}^{B_{l}}\mathcal{U}_{k,l}, respectively. Finally, 𝒰=l=1L𝒰l\mathcal{U}=\bigcup_{l=1}^{L}\mathcal{U}_{l}. Note that we consider that these associations are known and provided by the network administrator. The network has a fixed bandwidth divided into orthogonal physical resource blocks (pRBs) for performing the FL task. Denote the pRB set by 𝒵{z}z=1Z\mathcal{Z}\coloneqq\{z\}_{z=1}^{Z}. Each mBS reuses the same pRB set, i.e., the frequency reuse factor is 11. Besides, each mBS allocates dedicated pRBs to its associated sBSs. Each sBS further uses dedicated pRBs to communicate with the associated UEs. As such, there is no intra-tier interference in the system.

To this end, denote the distance between UE ii and the sBS kk by di,kd_{i,k}. The wireless fading channel between the UE and sBS follows Rayleigh distribution222Rayleigh distribution is widely used for its simplicity. Other distributions are not precluded., and is denoted by hi,kth_{i,k}^{t}. The transmission power of UE is denoted by Pit\mathrm{P}_{i}^{t}. As such, the uplink signal-to-noise-plus-interference ratio (SINR) is expressed as

γi,kt\displaystyle\gamma_{i,k}^{t} =Pithi,ktdi,kα/(ωζ2+Ii,kt),\displaystyle=\mathrm{P}_{i}^{t}h_{i,k}^{t}d_{i,k}^{-\alpha}/(\omega\zeta^{2}+I_{i,k}^{t}), (3)

where α\alpha is the path loss exponent, ζ2\zeta^{2} is the variance of the circularly symmetric zero-mean Gaussian distributed random noise and ω\omega is the pRB size. Moreover, Ii,kt=l=1Lk=1,kkBlj=1Vk,li𝒰j,k,lPithi,ktdi,kαI_{i,k}^{t}=\sum_{l=1}^{L}\sum_{k^{\prime}=1,k\neq k^{\prime}}^{B_{l}}\sum_{j^{\prime}=1}^{V_{k,l}}\sum_{i^{\prime}\in\mathcal{U}_{j^{\prime},k^{\prime},l}}\mathrm{P}_{i^{\prime}}^{t}h_{i^{\prime},k^{\prime}}^{t}d_{i^{\prime},k^{\prime}}^{-\alpha} is the inter-cell interference. Thus, the data rate

rit\displaystyle r_{i}^{t} =ωlog2[1+γi,kt].\displaystyle=\omega\log_{2}\big{[}1+\gamma_{i,k}^{t}\big{]}. (4)

II-B Pruning-Enabled Hierarchical Federated Learning (PHFL)

In this work, we consider that each client uses mini-batch stochastic gradient descent (SGD) to minimize (2) over nn mini-batches since the gradient descent over the entire dataset is time-consuming. Denote the stochastic gradient g(𝐰)g(\mathbf{w}) such that 𝔼ξn𝒟i[g(𝐰)]fi(𝐰)\mathbb{E}_{\xi_{n}\sim\mathcal{D}_{i}}[g(\mathbf{w})]\coloneqq\nabla f_{i}(\mathbf{w}), where ξn\xi_{n} is a randomly sampled batch from dataset 𝒟i\mathcal{D}_{i}. However, computation with the original model 𝐰d\mathbf{w}\in\mathbb{R}^{d} is still costly and may significantly extend the training time of a computationally limited client. To alleviate this, the UE trains a pruned model by removing some of the weights of the original model 𝐰\mathbf{w} [19].

Denote a binary mask by 𝐦id\mathbf{m}_{i}\in\mathbb{R}^{d} and the pruned model by 𝐰~i𝐰i𝐦i\tilde{\mathbf{w}}_{i}\coloneqq\mathbf{w}_{i}\odot\mathbf{m}_{i}, where \odot means element-wise multiplication. Note that training the pruned model 𝐰~i\tilde{\mathbf{w}}_{i} is computationally less expensive as it has fewer parameters than the original model. It is worth pointing out that the UE utilizes the state-of-the-art lottery ticket hypothesis [26] to find the winning ticket 𝐰~i\tilde{\mathbf{w}}_{i} and the corresponding mask with the following key steps. Denote the number of parameters required to be pruned by dpd_{p}. The UE performs ρ\rho local iterations on the original model 𝐰i\mathbf{w}_{i} as

𝐰iρ=𝐰iηρ¯=1ρg(𝐰iρ¯),\mathbf{w}_{i}^{\rho}=\mathbf{w}_{i}-\eta\sum\nolimits_{\bar{\rho}=1}^{\rho}g(\mathbf{w}_{i}^{\bar{\rho}}), (5)

where η\eta is the step size. The UE then prunes dpd_{p} entries of 𝐰iρd\mathbf{w}_{i}^{\rho}\in\mathbb{R}^{d} with the smallest magnitudes333The time complexity to sort the dd parameters and then prune the dpd_{p} smallest ones depends on the sorting technique. Many sorting algorithms have logarithmic time complexity, which can be computed quickly in modern graphical processing units. Following the common practice in literature [22, 23, 24], the overhead for pruning is ignored in this work. Moreover, the proposed method can be readily extended to incorporate the time overhead for pruning. and generates a binary mask 𝐦i{0,1}d\mathbf{m}_{i}\in\{0,1\}^{d}. To that end, the client obtains the winning ticket 𝐰~i\tilde{\mathbf{w}}_{i} by retaining the original weights of the corresponding nonzero entries of the mask 𝐦i\mathbf{m}_{i} from the original initial model 𝐰i\mathbf{w}_{i} [26]. Note that other pruning techniques can also be adopted. Moreover, we denote the pruning ratio by [23]

δidp/d.\delta_{i}\coloneqq d_{p}/d. (6)

Given the pruned model 𝐰~it,0𝐰it𝐦it\tilde{\mathbf{w}}_{i}^{t,0}\coloneqq\mathbf{w}_{i}^{t}\odot\mathbf{m}_{i}^{t}, the loss function of the UE is rewritten as

fi(𝐰~it,0)[1/|𝒟i|](𝐱a,ya)𝒟if(𝐰~it,0,𝐱a,ya),f_{i}(\tilde{\mathbf{w}}_{i}^{t,0})\coloneqq[1/|\mathcal{D}_{i}|]\sum\nolimits_{(\mathbf{x}_{a},y_{a})\in\mathcal{D}_{i}}f(\tilde{\mathbf{w}}_{i}^{t,0},\mathbf{x}_{a},y_{a}), (7)

Each UE updates its pruned model as

𝐰~it+1𝐰~it,0ηg(𝐰~it,0)𝐦it.\tilde{\mathbf{w}}_{i}^{t+1}\coloneqq\tilde{\mathbf{w}}_{i}^{t,0}-\eta g(\tilde{\mathbf{w}}_{i}^{t,0})\odot\mathbf{m}_{i}^{t}. (8)

As such, we denote the loss functions of the jthj^{th} VC, kthk^{th} sBS, lthl^{th} mBS and central server as fj(𝐰j)i=1Uj,k,lαifi(𝐰~i)f_{j}(\mathbf{w}_{j})\coloneqq\sum\nolimits_{i=1}^{U_{j,k,l}}\alpha_{i}f_{i}(\tilde{\mathbf{w}}_{i}), fk(𝐰k)j=1Vk,lαjfj(𝐰j)f_{k}(\mathbf{w}_{k})\coloneqq\sum\nolimits_{j=1}^{V_{k,l}}\alpha_{j}f_{j}(\mathbf{w}_{j}), fl(𝐰l)k=1Blαkfk(𝐰k)f_{l}(\mathbf{w}_{l})\coloneqq\sum\nolimits_{k=1}^{B_{l}}\alpha_{k}f_{k}(\mathbf{w}_{k}), and f(𝐰)l=1Lαlfl(𝐰l)f(\mathbf{w})\coloneqq\sum\nolimits_{l=1}^{L}\alpha_{l}f_{l}(\mathbf{w}_{l}), respectively. Note that for simplicity, we consider identical weights, i.e., αi=1/Uj,k,l\alpha_{i}=1/U_{j,k,l}, αj=1/Vk,l\alpha_{j}=1/V_{k,l}, αk=1/Bl\alpha_{k}=1/B_{l} and αl=1/L\alpha_{l}=1/L, which can be easily adjusted for other weighting strategies. Besides, since aggregation happens at different times and different levels, we need to capture the time indices explicitly. Let each UE perform κ0\kappa_{0} local iterations before sending the updated model to the associated VC. It is worth noting that the winning ticket and the corresponding binary mask are only obtained before these κ0\kappa_{0} local rounds begin. Besides, since training the original model is costly, it is reasonable to consider ρκ0\rho\ll\kappa_{0}444In our simulation, we considered ρ<κ0\rho<\kappa_{0} and observed that even ρ=1\rho=1 with κ05\kappa_{0}\geq 5 performed well.. Moreover, although the sBS-mBS and mBS-central server links are wired, communication and computation at these nodes incur additional burdens. As such, we assume that each VC, sBS and mBS perform κ1\kappa_{1}, κ2\kappa_{2} and κ3\kappa_{3} rounds, respectively, before sending the trained model to the respective upper layers. Denote the indices of the current global round, mBS round, sBS round, VC round and UE’s local round by mm, t3t_{3}, t2t_{2}, t1t_{1} and t0t_{0}, respectively. Besides, similar to [13, 9], let t[{(mκ3+t3)κ2+t2}κ1+t1]κ0+t0t\coloneqq[\{(m\kappa_{3}+t_{3})\kappa_{2}+t_{2}\}\kappa_{1}+t_{1}]\kappa_{0}+t_{0} denote the index of local update iterations.

If tmodκ0=0t~{}\textsl{mod}~{}\kappa_{0}=0, the UE receives the latest available model of its associated VC, i.e.,

𝐰it¯0𝐰jt¯0,\mathbf{w}_{i}^{\bar{t}_{0}}\leftarrow\mathbf{w}_{j}^{\bar{t}_{0}}, (9)

where t¯0=[{(mκ3+t3)κ2+t2}κ1+t1]κ0\bar{t}_{0}=[\{(m\kappa_{3}+t_{3})\kappa_{2}+t_{2}\}\kappa_{1}+t_{1}]\kappa_{0}. The UE then computes the pruned model 𝐰~it¯0,0\tilde{\mathbf{w}}_{i}^{\bar{t}_{0},0} and the binary mask 𝐦it¯0\mathbf{m}_{i}^{\bar{t}_{0}}. It then performs κ0\kappa_{0} local SGD rounds as

𝐰~it¯0+κ0=𝐰~it¯0,0ηt0=1κ0g(𝐰~it¯0+t0,0)𝐦it¯0.\tilde{\mathbf{w}}_{i}^{\bar{t}_{0}+\kappa_{0}}=\tilde{\mathbf{w}}_{i}^{\bar{t}_{0},0}-\eta\sum\nolimits_{t_{0}=1}^{\kappa_{0}}g\big{(}\tilde{\mathbf{w}}_{i}^{\bar{t}_{0}+t_{0},0}\big{)}\odot\mathbf{m}_{i}^{\bar{t}_{0}}. (10)

Each VC jj performs t1=0,,κ11t_{1}=0,\dots,\kappa_{1}-1 local rounds. When (t1+1)modκ1=0(t_{1}+1)~{}\textsl{mod}~{}\kappa_{1}=0, the VC’s model gets updated by the latest available sBS model, i.e.,

𝐰jt¯1𝐰kt¯1,\mathbf{w}_{j}^{\bar{t}_{1}}\leftarrow\mathbf{w}_{k}^{\bar{t}_{1}}, (11)

where t¯1={(mκ3+t3)κ2+t2}κ1κ0\bar{t}_{1}=\{(m\kappa_{3}+t_{3})\kappa_{2}+t_{2}\}\kappa_{1}\kappa_{0}. Besides, between two VC rounds, the local model of the VC is updated as

𝐰jt¯1+(t1+1)κ0=i=1Uj,k,lαi𝐰~it¯1+(t1+1)κ0=𝐰~jt¯1+t1κ0,0\displaystyle\mathbf{w}_{j}^{\bar{t}_{1}+(t_{1}+1)\kappa_{0}}=\sum\nolimits_{i=1}^{U_{j,k,l}}\alpha_{i}\tilde{\mathbf{w}}_{i}^{\bar{t}_{1}+(t_{1}+1)\kappa_{0}}=\tilde{\mathbf{w}}_{j}^{\bar{t}_{1}+t_{1}\kappa_{0},0}-
ηi=1Uj,k,l[𝟏it¯1+t1κ0/pit¯1+t1κ0]αit0=1κ0g(𝐰~it¯1+t1κ0+t0,0)𝐦it¯1+t1κ0,\displaystyle\eta\!\!\sum\nolimits_{i=1}^{U_{j,k,l}}\!\!\big{[}\boldsymbol{1}_{i}^{\bar{t}_{1}+t_{1}\kappa_{0}}\!\!/\!p_{i}^{\bar{t}_{1}+t_{1}\kappa_{0}}\big{]}\alpha_{i}\sum\nolimits_{t_{0}=1}^{\kappa_{0}}\!g\big{(}\tilde{\mathbf{w}}_{i}^{\bar{t}_{1}+t_{1}\kappa_{0}+t_{0},0}\big{)}\odot\mathbf{m}_{i}^{\bar{t}_{1}+t_{1}\kappa_{0}}, (12)

where 𝐰~jt¯1+t1κ0,0i=1Uj,k,lαi𝐰~it¯1+t1κ0,0\tilde{\mathbf{w}}_{j}^{\bar{t}_{1}+t_{1}\kappa_{0},0}\coloneqq\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\tilde{\mathbf{w}}_{i}^{\bar{t}_{1}+t_{1}\kappa_{0},0} and 𝟏it¯1+t1κ0\boldsymbol{1}_{i}^{\bar{t}_{1}+t_{1}\kappa_{0}} is a binary indicator function that indicates whether the sBS receives the trained model of ii during the VC aggregation round t¯1+(t1+1)κ0\bar{t}_{1}+(t_{1}+1)\kappa_{0} or not, and is defined as follows:

𝟏it¯1+t1κ0\displaystyle\!\!\boldsymbol{1}_{i}^{\bar{t}_{1}+t_{1}\kappa_{0}} {1,with probability pit¯1+t1κ0,0,otherwise,,\displaystyle\coloneqq\begin{cases}1,\!\!&\text{with probability $p_{i}^{\bar{t}_{1}+t_{1}\kappa_{0}}$},\\ 0,\!\!&\text{otherwise},\end{cases}, (13)

where pit¯1+t1κ0p_{i}^{\bar{t}_{1}+t_{1}\kappa_{0}}\!\! is the probability of receiving the trained model over the wireless link and is calculated in the subsequent section (c.f. (IV-A)). Note that since the sBS has to receive the gradient over the wireless link, we use the binary indicator function 𝟏it¯1+t1κ0\boldsymbol{1}_{i}^{\bar{t}_{1}+t_{1}\kappa_{0}}\!\! in (12) as a common practice [27, 13].

The sBS performs t2=0,,κ21t_{2}\!\!=\!\!0,\dots,\!\!\kappa_{2}\!\!-\!1\!\! local rounds before updating its model. When (t2+1)modκ2=0(t_{2}+1)~{}\textsl{mod}~{}\kappa_{2}=0, the sBS updates its local model with the latest available model at its associated mBS, i.e.,

𝐰kt¯2𝐰lt¯2,\mathbf{w}_{k}^{\bar{t}_{2}}\leftarrow\mathbf{w}_{l}^{\bar{t}_{2}}, (14)

where t¯2=(mκ3+t3)κ2κ1κ0\bar{t}_{2}=(m\kappa_{3}+t_{3})\kappa_{2}\kappa_{1}\kappa_{0}. In each sBS round t2t_{2}, the sBS updates its model as

𝐰kt¯2+(t2+1)κ1κ0=j=1Vk,lαj𝐰jt¯2+(t2+1)κ1κ0=𝐰~kt¯2+t2κ1κ0,0\displaystyle\mathbf{w}_{k}^{\bar{t}_{2}+(t_{2}+1)\kappa_{1}\kappa_{0}}=\sum\nolimits_{j=1}^{V_{k,l}}\alpha_{j}\mathbf{w}_{j}^{\bar{t}_{2}+(t_{2}+1)\kappa_{1}\kappa_{0}}=\tilde{\mathbf{w}}_{k}^{\bar{t}_{2}+t_{2}\kappa_{1}\kappa_{0},0}-
ηj=1Vk,lαjt1=0κ11i=1Uj,k,lαi𝟏it¯2+t~2pit¯2+t~2t0=1κ0g(𝐰~it¯2+t~2+t0,0)𝐦it¯2+t~2,\displaystyle\eta\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{t_{1}=0}^{\kappa_{1}-1}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\frac{\boldsymbol{1}_{i}^{\bar{t}_{2}+\tilde{t}_{2}}}{p_{i}^{\bar{t}_{2}+\tilde{t}_{2}}}\sum_{t_{0}=1}^{\kappa_{0}}g\big{(}\tilde{\mathbf{w}}_{i}^{\bar{t}_{2}+\tilde{t}_{2}+t_{0},0}\big{)}\odot\mathbf{m}_{i}^{\bar{t}_{2}+\tilde{t}_{2}}, (15)

where 𝐰~kt¯2+t2κ1κ0,0j=1Vk,lαj𝐰~j,k,lt¯2+t2κ1κ0,0\tilde{\mathbf{w}}_{k}^{\bar{t}_{2}+t_{2}\kappa_{1}\kappa_{0},0}\coloneqq\sum_{j=1}^{V_{k,l}}\alpha_{j}\tilde{\mathbf{w}}_{j,k,l}^{\bar{t}_{2}+t_{2}\kappa_{1}\kappa_{0},0} and t~2=(t2κ1+t1)κ0\tilde{t}_{2}=(t_{2}\kappa_{1}+t_{1})\kappa_{0}. Similarly, the mBS performs t3=0,,κ31t_{3}=0,\dots,\kappa_{3}-1 local rounds before updating its local model with the latest available global model when (t3+1)modκ3=0(t_{3}+1)~{}\textsl{mod}~{}\kappa_{3}=0, i.e.,

𝐰lmκ3κ2κ1κ0𝐰mκ3κ2κ1κ0.\mathbf{w}_{l}^{m\kappa_{3}\kappa_{2}\kappa_{1}\kappa_{0}}\leftarrow\mathbf{w}^{m\kappa_{3}\kappa_{2}\kappa_{1}\kappa_{0}}. (16)

Moreover, between two mBS rounds, we can write

𝐰l(mκ3+(t3+1))κ2κ1κ0=𝐰~l(mκ3+t3)κ2κ1κ0,0\displaystyle\mathbf{w}_{l}^{(m\kappa_{3}+(t_{3}+1))\kappa_{2}\kappa_{1}\kappa_{0}}\!\!=\tilde{\mathbf{w}}_{l}^{(m\kappa_{3}+t_{3})\kappa_{2}\kappa_{1}\kappa_{0},0}\!\!-
ηk=1Blαkt2=0κ21j=1Vk,lαjt1=0κ11i=1Uj,k,lαi𝟏it¯3pit¯3t0=1κ0g(𝐰~it¯3+t0,0)𝐦it¯3,\displaystyle\quad\eta\sum_{k=1}^{B_{l}}\!\!\alpha_{k}\!\!\sum_{t_{2}=0}^{\kappa_{2}-1}\sum_{j=1}^{V_{k,l}}\!\!\alpha_{j}\sum_{t_{1}=0}^{\kappa_{1}-1}\sum_{i=1}^{U_{j,k,l}}\!\!\alpha_{i}\frac{\boldsymbol{1}_{i}^{\bar{t}_{3}}}{p_{i}^{\bar{t}_{3}}}\!\!\sum_{t_{0}=1}^{\kappa_{0}}\!\!g\big{(}\tilde{\mathbf{w}}_{i}^{\bar{t}_{3}+t_{0},0}\big{)}\odot\mathbf{m}_{i}^{\bar{t}_{3}}, (17)

where 𝐰~l(mκ3+t3)κ2κ1κ0,0k=1Blαk𝐰~k(mκ3+t3)κ2κ1κ0,0\tilde{\mathbf{w}}_{l}^{(m\kappa_{3}+t_{3})\kappa_{2}\kappa_{1}\kappa_{0},0}\coloneqq\sum_{k=1}^{B_{l}}\alpha_{k}\tilde{\mathbf{w}}_{k}^{(m\kappa_{3}+t_{3})\kappa_{2}\kappa_{1}\kappa_{0},0} and t¯3=[((mκ3+t3)κ2+t2)κ1+t1]κ0\bar{t}_{3}=[((m\kappa_{3}+t_{3})\kappa_{2}+t_{2})\kappa_{1}+t_{1}]\kappa_{0}. Finally, the central server performs global aggregation by collecting the updated models from all mBSs as follows:

𝐰(m+1)z=03κz=𝐰~mz=03κz,0ηl=1Lαlt3=0κ31k=1Blαk\displaystyle\mathbf{w}^{(m+1)\prod_{z=0}^{3}\kappa_{z}}=\tilde{\mathbf{w}}^{m\prod_{z=0}^{3}\kappa_{z},0}\!\!-\eta\sum\nolimits_{l=1}^{L}\!\!\alpha_{l}\sum\nolimits_{t_{3}=0}^{\kappa_{3}-1}\sum\nolimits_{k=1}^{B_{l}}\!\!\alpha_{k}
t2=0κ21j=1Vk,lαjt1=0κ11i=1Uj,k,lαi𝟏it¯3pit¯3t0=1κ0g(𝐰~it¯3+t0,0)𝐦it¯3,\displaystyle\qquad\sum_{t_{2}=0}^{\kappa_{2}-1}\sum_{j=1}^{V_{k,l}}\!\!\alpha_{j}\!\!\sum_{t_{1}=0}^{\kappa_{1}-1}\sum_{i=1}^{U_{j,k,l}}\!\!\alpha_{i}\frac{\boldsymbol{1}_{i}^{\bar{t}_{3}}}{p_{i}^{\bar{t}_{3}}}\sum_{t_{0}=1}^{\kappa_{0}}g\big{(}\tilde{\mathbf{w}}_{i}^{\bar{t}_{3}+t_{0},0}\big{)}\odot\mathbf{m}_{i}^{\bar{t}_{3}}, (18)

where 𝐰~mz=03κz,0l=1Lαl𝐰~lmz=03κz,0\tilde{\mathbf{w}}^{m\prod_{z=0}^{3}\kappa_{z},0}\coloneqq\sum_{l=1}^{L}\alpha_{l}\tilde{\mathbf{w}}_{l}^{m\prod_{z=0}^{3}\kappa_{z},0}.

The proposed PHFL process is summarized in Algorithm 1.

Input: Total global round MM, initial model 𝐰0\mathbf{w}^{0}
1 Synchronize all edge devices with the initial model 𝐰0\mathbf{w}^{0}
2 for All global rounds m=0m=0 to M1M-1 do
3       for All mBS rounds t3=0,1,,κ31t_{3}=0,1,\dots,\kappa_{3}-1 do
4             for All sBS rounds t2=0,1,,κ21t_{2}=0,1,\dots,\kappa_{2}-1 do
5                   for All VC rounds t1=0,1,,κ11t_{1}=0,1,\dots,\kappa_{1}-1 do
6                         for i𝒰j,k,li\in\mathcal{U}_{j,k,l} in parallel do
7                               UE receives the latest available model from the associated VC
8                               Compute binary mask and get the winning ticket using lottery ticket hypothesis
9                               for All local rounds t0=1,2,,κ0t_{0}=1,2,\dots,\kappa_{0} do
10                                     t[{(mκ3+t3)κ2+t2}κ1+t1]κ0+t0t\leftarrow[\{(m\kappa_{3}+t_{3})\kappa_{2}+t_{2}\}\kappa_{1}+t_{1}]\kappa_{0}+t_{0}
11                                     UE updates local model using (8)
12                               end for
13                              
14                         end for
15                        sBS updates VC model using (11) and (12)
16                        
17                   end for
18                  sBS update local cell model using (14) and (15)
19                  
20             end for
21            mBS updates local cell model using (16) and (17)
22            
23       end for
24      Central server updates global model using (18)
25      
26 end for
Output: Global ML model 𝐰M1\mathbf{w}^{M-1}
Algorithm 1 Pruning-Enabled Hierarchical FL

III PHFL: Convergence Analysis

III-A Assumptions

We make the following standard assumptions [13, 7, 20, 23, 28]:

  1. 1.

    The loss functions are lower-bounded, i.e., f(𝐰)finff(\mathbf{w})\geq f_{\text{inf}}.

  2. 2.

    The loss functions are β\beta-smooth, i.e., fi(𝐰)fi(𝐰)β𝐰𝐰\|\nabla f_{i}(\mathbf{w})-\nabla f_{i}(\mathbf{w}^{\prime})\|\leq\beta\|\mathbf{w}-\mathbf{w}^{\prime}\|.

  3. 3.

    Mini-batch gradients are unbiased 𝔼ξ𝒟i[g(𝐰~i)]=fi(𝐰~i)\mathbb{E}_{\xi\sim\mathcal{D}_{i}}[g(\tilde{\mathbf{w}}_{i})]=\nabla f_{i}(\tilde{\mathbf{w}}_{i}). Besides, the variance of the gradients is bounded, i.e., g(𝐰~i)fi(𝐰~i)2σ2\|g(\tilde{\mathbf{w}}_{i})-\nabla f_{i}(\tilde{\mathbf{w}}_{i})\|^{2}\leq\sigma^{2}.

  4. 4.

    The divergence of the local, VC, sBS, mBS and global loss functions are bounded for all ii, jj, kk, ll and 𝐰\mathbf{w}, i.e.,

    i=1Uj,k,lαifi(𝐰)fj(𝐰)2ϵvc2,\displaystyle\sum\nolimits_{i=1}^{U_{j,k,l}}\alpha_{i}\|\nabla f_{i}(\mathbf{w})-\nabla f_{j}(\mathbf{w})\|^{2}\leq\epsilon_{\mathrm{vc}}^{2},
    j=1Vk,lαjfj(𝐰)fk(𝐰)2ϵsbs2,\displaystyle\sum\nolimits_{j=1}^{V_{k,l}}\alpha_{j}\|\nabla f_{j}(\mathbf{w})-\nabla f_{k}(\mathbf{w})\|^{2}\leq\epsilon_{\mathrm{sbs}}^{2},
    k=1Blαkfk(𝐰)fl(𝐰)2ϵmbs2,\displaystyle\sum\nolimits_{k=1}^{B_{l}}\alpha_{k}\|\nabla f_{k}(\mathbf{w})-\nabla f_{l}(\mathbf{w})\|^{2}\leq\epsilon_{\mathrm{mbs}}^{2},
    l=1Lαlfl(𝐰)f(𝐰)2ϵ2.\displaystyle\sum\nolimits_{l=1}^{L}\alpha_{l}\|\nabla f_{l}(\mathbf{w})-\nabla f(\mathbf{w})\|^{2}\leq\epsilon^{2}.
  5. 5.

    The stochastic gradients are independent of each other in different iterations.

  6. 6.

    The stochastic gradients are bounded, i.e., 𝔼g(𝐰i)2G2\mathbb{E}\|g(\mathbf{w}_{i})\|^{2}\leq G^{2}.

  7. 7.

    The model weights are bounded, i.e., 𝔼𝐰i2D2\mathbb{E}\|\mathbf{w}_{i}\|^{2}\leq D^{2}.

  8. 8.

    The pruning ratio δit[0,δth]\delta_{i}^{t}\in[0,\delta^{\mathrm{th}}], in which 0<δth<10<\delta^{\mathrm{th}}<1 and δth\delta^{\mathrm{th}} is the maximum allowable pruning ratio, follows

    δit𝐰it𝐰~it,02/𝐰it2.\delta_{i}^{t}\geq\big{\|}\mathbf{w}_{i}^{t}-\tilde{\mathbf{w}}_{i}^{t,0}\big{\|}^{2}/\|\mathbf{w}_{i}^{t}\|^{2}. (19)

Since the updated global, mBS, sBS and VC models are not available in each local iteration tt, similar to standard practice [13, 7, 9], we assume the virtual copies of these models, denoted by 𝐰¯t\bar{\mathbf{w}}^{t}, 𝐰¯lt\bar{\mathbf{w}}_{l}^{t}, 𝐰¯kt\bar{\mathbf{w}}_{k}^{t} and 𝐰¯jt\bar{\mathbf{w}}_{j}^{t}, respectively, are available. Besides, we assume that the bounded divergence assumptions amongst the above loss functions also hold for these virtual models. Moreover, analogous to our previous notations, we express 𝐰¯~t,0l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi𝐰~it,0=u=1Uαu𝐰~ut,0\tilde{\bar{\mathbf{w}}}^{t,0}\coloneqq\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\tilde{\mathbf{w}}_{i}^{t,0}=\sum_{u=1}^{\mathrm{U}}\alpha_{u}\tilde{\mathbf{w}}_{u}^{t,0} and 𝐰¯~0𝐰¯~0,0\tilde{\bar{\mathbf{w}}}^{0}\coloneqq\tilde{\bar{\mathbf{w}}}^{0,0}.

III-B Convergence Analysis

Similar to existing literature [13, 7, 9, 20], we consider the average global gradient norm as the indicator of the proposed PHFL algorithm’s convergence. As such, in the following, we seek an θPHFL\theta_{\mathrm{PHFL}}-suboptimal solution such that 1Tt=0T1u=1Uαufu(𝐰¯~t,0)𝐦ut2θPHFL\frac{1}{T}\sum_{t=0}^{T-1}\|\sum_{u=1}^{\mathrm{U}}\alpha_{u}\nabla f_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})\odot\mathbf{m}_{u}^{t}\|^{2}\leq\theta_{\mathrm{PHFL}} and θPHFL0\theta_{\mathrm{PHFL}}\geq 0. Particularly, we start with Theorem 1 that requires bounding the differences amongst the models in different hierarchical levels. These differences are first calculated in Lemma 1 to Lemma 4 and then plugged into Theorem 1 to get the θPHFL\theta_{\mathrm{PHFL}}-suboptimal bound in Corollary 1.

Theorem 1.

When the assumptions in Section III-A hold and η1/β\eta\leq 1/\beta, we have

θPHFL𝒪(f(𝐰¯~0)finfηT)+𝒪(βησ2U)+𝒪(δthβ2D2)pruningerror+\displaystyle\!\!\theta_{\mathrm{{PHFL}}}\leq\mathcal{O}\bigg{(}\!\!\frac{f(\tilde{\bar{\mathbf{w}}}^{0})-f_{\mathrm{inf}}}{\eta T}\!\!\bigg{)}+\mathcal{O}\bigg{(}\!\!\frac{\beta\eta\sigma^{2}}{\mathrm{U}}\!\bigg{)}+\underbrace{\mathcal{O}\big{(}\!\delta^{\mathrm{th}}\beta^{2}D^{2}\big{)}}_{\mathrm{pruning~{}error}}+
𝒪(βηG2φw,0(𝜹,𝐟,𝐏))wirelessfactor+𝒪(β2[L1+L2+L3+L4]),\displaystyle\underbrace{\mathcal{O}\big{(}\beta\eta G^{2}\cdot\varphi_{\mathrm{w,0}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})\big{)}}_{\mathrm{wireless~{}factor}}+\mathcal{O}\big{(}\beta^{2}\big{[}\mathrm{L}_{1}+\mathrm{L}_{2}+\mathrm{L}_{3}+\mathrm{L}_{4}\big{]}\big{)}, (20)

where 𝛅={δ1t,,δUt}t=0T1\boldsymbol{\delta}=\{\delta_{1}^{t},\dots,\delta_{U}^{t}\}_{t=0}^{T-1}, 𝐟={f1t,,fUt}t=0T1\boldsymbol{\mathrm{f}}=\{\mathrm{f}_{1}^{t},\dots,\mathrm{f}_{U}^{t}\}_{t=0}^{T-1}, 𝐏={P1t,,PUt}t=0T1\boldsymbol{\mathrm{P}}=\{P_{1}^{t},\dots,P_{U}^{t}\}_{t=0}^{T-1} and fit\mathrm{f}_{i}^{t} is the ithi^{\mathrm{th}} client’s central processing unit (CPU) frequency in the wireless factor. Besides, the terms L1\mathrm{L}_{1}, L2\mathrm{L}_{2}, L3\mathrm{L}_{3} and L4\mathrm{L}_{4} are

φw,0(𝜹,𝐟,𝐏)=1Tt=0T1l=1Lαl2k=1Blαk2j=1Vk,lαj2i=1Uj,k,lαi2[1pit1],\displaystyle\varphi_{\mathrm{w,0}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})=\frac{1}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\!\!\alpha_{l}^{2}\sum_{k=1}^{B_{l}}\!\!\alpha_{k}^{2}\sum_{j=1}^{V_{k,l}}\!\!\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\!\!\alpha_{i}^{2}\bigg{[}\frac{1}{p_{i}^{t}}-1\bigg{]}, (21)
L1=1Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi𝔼𝐰¯jt𝐰~it2,\displaystyle\mathrm{L}_{1}=\frac{1}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\|\bar{\mathbf{w}}_{j}^{t}-\tilde{\mathbf{w}}_{i}^{t}\|^{2}, (22)
L2=[1/T]t=0T1l=1Lαlk=1Blαkj=1Vk,lαj𝔼𝐰¯kt𝐰¯jt2,\displaystyle\mathrm{L}_{2}=[1/T]\sum\nolimits_{t=0}^{T-1}\sum\nolimits_{l=1}^{L}\!\!\alpha_{l}\sum\nolimits_{k=1}^{B_{l}}\!\!\alpha_{k}\sum\nolimits_{j=1}^{V_{k,l}}\!\!\alpha_{j}\mathbb{E}\|\bar{\mathbf{w}}_{k}^{t}-\bar{\mathbf{w}}_{j}^{t}\|^{2},\!\!\!\! (23)
L3=[1/T]t=0T1l=1Lαlk=1Blαk𝔼𝐰¯lt𝐰¯kt2,\displaystyle\mathrm{L}_{3}=[1/T]\sum\nolimits_{t=0}^{T-1}\sum\nolimits_{l=1}^{L}\alpha_{l}\sum\nolimits_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\|\bar{\mathbf{w}}_{l}^{t}-\bar{\mathbf{w}}_{k}^{t}\|^{2}, (24)
L4=[1/T]t=0T1l=1Lαl𝔼𝐰¯t𝐰¯lt2.\displaystyle\mathrm{L}_{4}=[1/T]\sum\nolimits_{t=0}^{T-1}\sum\nolimits_{l=1}^{L}\alpha_{l}\mathbb{E}\|\bar{\mathbf{w}}^{t}-\bar{\mathbf{w}}_{l}^{t}\|^{2}. (25)

The proof of Theorem 1 and the subsequent Lemmas are left in the supplementary materials.

Remark 1.

The first term in (1) is what we get for centralized learning, while the second term arises from the randomness of the mini-batch gradients [29]. The third term appears from model pruning. Besides, the fourth term arises from the wireless links among the sBS and UEs. It is worth noting that φw,0(𝛅,𝐟,𝐏)=0\varphi_{\mathrm{w,0}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})=0 when all pitp_{i}^{t}’s are 11’s. Finally, the last term is due to the difference among the VC-local, sBS-VC, sBs-mBS and mBS-global model parameters, respectively, which are derived in the following.

Remark 2.

When the system has no pruning, i.e., all UEs use the original models, all δit=0\delta_{i}^{t}=0. Besides, under the perfect communication among the sBS and UEs, we have pit=1p_{i}^{t}=1. In such cases, the θPHFL\theta_{\mathrm{PHFL}}-suboptimal bound boils down to

θPHFL\displaystyle\theta_{\mathrm{{PHFL}}} 𝒪((f(𝐰¯0)finf)/[ηT])+𝒪(βησ2/U)+\displaystyle\leq\mathcal{O}\big{(}(f(\bar{\mathbf{w}}^{0})-f_{\mathrm{inf}})/[\eta T]\big{)}+\mathcal{O}\big{(}\beta\eta\sigma^{2}/\mathrm{U}\big{)}+
𝒪(β2[L1+L2+L3+L4]).\displaystyle\qquad\qquad\qquad\qquad\mathcal{O}\big{(}\beta^{2}[\mathrm{L}_{1}+\mathrm{L}_{2}+\mathrm{L}_{3}+\mathrm{L}_{4}]\big{)}. (26)

Besides, the last term in (26) appears from the four hierarchical levels. When U=1\mathrm{U}=1 and there are no levels, i.e., L1=L2=L3=L4=0\mathrm{L}_{1}=\mathrm{L}_{2}=\mathrm{L}_{3}=\mathrm{L}_{4}=0, the convergence bound is exactly the same as the original SGD with non-convex loss function [7].

To that end, we calculate the divergence among the local, VC, sBS, mBS and global model parameters, and derive the corresponding pruning errors in each level in what follows.

Lemma 1.

When η1/[210κ0β]\eta\leq 1/[2\sqrt{10}\kappa_{0}\beta], the average difference between the VC and local model parameters, i.e., the L1\mathrm{L}_{1} term of (1), is upper bounded as

β2Tt=0Tl=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi𝔼𝐰¯jt𝐰~it2𝒪(κ0η2β2σ2)+\displaystyle\!\!\!\!\!\!\frac{\beta^{2}}{T}\!\!\sum_{t=0}^{T}\sum_{l=1}^{L}\!\!\alpha_{l}\!\!\sum_{k=1}^{B_{l}}\!\!\alpha_{k}\!\!\sum_{j=1}^{V_{k,l}}\!\!\alpha_{j}\!\!\sum_{i=1}^{U_{j,k,l}}\!\!\alpha_{i}\mathbb{E}\left\|\bar{\mathbf{w}}_{j}^{t}-\tilde{\mathbf{w}}_{i}^{t}\right\|^{2}\!\!\leq\mathcal{O}\big{(}\kappa_{0}\eta^{2}\beta^{2}\sigma^{2}\big{)}+
𝒪(κ02η2β2ϵvc2)+𝒪(δthβ2D2)+𝒪(κ0η2β2G2φw,L1),\displaystyle\!\!\mathcal{O}\big{(}\kappa_{0}^{2}\eta^{2}\beta^{2}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\delta^{\mathrm{th}}\beta^{2}D^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\eta^{2}\beta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}\big{)},\!\! (27)

where φw,L1=1Tl=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαit=0T1(1/pit1)\varphi_{\mathrm{w,L}_{1}}\!\!\!=\!\frac{1}{T}\sum_{l=1}^{L}\!\!\alpha_{l}\!\sum_{k=1}^{B_{l}}\!\!\alpha_{k}\sum_{j=1}^{V_{k,l}}\!\!\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\!\!\alpha_{i}\sum_{t=0}^{T-1}(1/p_{i}^{t}-1).

Remark 3.

In (1), the first term comes from the statistical data heterogeneity, while the second term arises from the divergence between the local and VC loss functions. The third term emanates from model pruning. Finally, the fourth term stems from the wireless links among the UEs and sBS.

Lemma 2.

When η1/[210κ0κ1β]\eta\leq 1/[2\sqrt{10}\kappa_{0}\kappa_{1}\beta], the difference between the sBS model parameters and VC model parameters, i.e., the L2\mathrm{L}_{2} term of (1), is upper bounded as

β2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαj𝔼𝐰¯kt𝐰¯jt2𝒪(β4κ04κ12η4ϵvc2)+\displaystyle\!\!\!\!\!\!\!\!\frac{\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\!\!\alpha_{l}\!\!\sum_{k=1}^{B_{l}}\!\!\alpha_{k}\!\!\sum_{j=1}^{V_{k,l}}\!\!\alpha_{j}\mathbb{E}\left\|\bar{\mathbf{w}}_{k}^{t}-\bar{\mathbf{w}}_{j}^{t}\right\|^{2}\!\!\leq\mathcal{O}\big{(}\beta^{4}\kappa_{0}^{4}\kappa_{1}^{2}\eta^{4}\epsilon_{\mathrm{vc}}^{2}\big{)}+
𝒪(κ02κ12η2β2ϵsbs2)+𝒪(κ0κ1η2σ2β2)+𝒪(δthβ2D2)+\displaystyle\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\eta^{2}\beta^{2}\epsilon_{\mathrm{sbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\eta^{2}\sigma^{2}\beta^{2}\big{)}+\mathcal{O}\big{(}\delta^{th}\beta^{2}D^{2}\big{)}+
𝒪(κ03κ12β4η4G2φw,L1)+𝒪(κ0κ1β2η2φw,L2),\displaystyle\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\beta^{4}\eta^{4}G^{2}\varphi_{\mathrm{w,L}_{1}}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\beta^{2}\eta^{2}\varphi_{\mathrm{w,L}_{2}}\big{)}, (28)

where φw,L2=1Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi2(1/pit1)\varphi_{\mathrm{w,L}_{2}}\!\!=\!\!\frac{1}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\!\!\alpha_{l}\!\sum_{k=1}^{B_{l}}\!\!\alpha_{k}\!\sum_{j=1}^{V_{k,l}}\!\!\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\!\!\alpha_{i}^{2}(1/p_{i}^{t}-1).

Remark 4.

The first term in (2) appears from the divergence of the loss functions of the clients and VC, while the second term stems from the divergence between the loss function of the VC and sBS. The rest of the terms are due to the statistical data heterogeneity, model pruning and wireless links, respectively.

Lemma 3.

When η1/[214κ0κ1κ2β]\eta\leq 1/[2\sqrt{14}\kappa_{0}\kappa_{1}\kappa_{2}\beta], the average difference between the sBS and mBS model parameters, i.e., the L3\mathrm{L}_{3} term of (1), is upper bounded as

[β2/T]t=0T1l=1Lαlk=1Blαk𝔼𝐰¯lt𝐰¯kt2\displaystyle[\beta^{2}/T]\sum\nolimits_{t=0}^{T-1}\sum\nolimits_{l=1}^{L}\alpha_{l}\sum\nolimits_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\left\|\bar{\mathbf{w}}_{l}^{t}-\bar{\mathbf{w}}_{k}^{t}\right\|^{2}
𝒪(κ03κ12κ22η4β4ϵvc2)+𝒪(κ04κ14κ22η4β4ϵsbs2)+\displaystyle\leq\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{4}\beta^{4}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{2}\eta^{4}\beta^{4}\epsilon_{\mathrm{sbs}}^{2}\big{)}+
𝒪(κ02κ12κ22η2β4ϵmbs2)+𝒪(κ0κ1κ2η2β2σ2)+\displaystyle\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{4}\epsilon_{\mathrm{mbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}\beta^{2}\sigma^{2}\big{)}+
𝒪(δthβ2D2)+𝒪(κ03κ12κ22η4β4G2φw,L1)+\displaystyle\mathcal{O}\big{(}\delta^{\mathrm{th}}\beta^{2}D^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{4}\beta^{4}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}\big{)}+
𝒪(κ02κ12κ22β4η4φw,L2)+𝒪(κ0κ1κ2η2β2G2φw,L3).\displaystyle\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\beta^{4}\eta^{4}\cdot\varphi_{\mathrm{w,L}_{2}})+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}\beta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{3}}\big{)}. (29)

where φw,L3=1Tl=1Lαlk=1Blαkj=1Vk,lαj2i=1Uj,k,lαi2t=0T1(1/pit1)\varphi_{\mathrm{w,L}_{3}}\!\!\!=\!\frac{1}{T}\sum_{l=1}^{L}\!\!\alpha_{l}\sum_{k=1}^{B_{l}}\!\!\alpha_{k}\sum_{j=1}^{V_{k,l}}\!\!\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\!\!\alpha_{i}^{2}\sum_{t=0}^{T-1}\!(1\!/\!p_{i}^{t}-1).

Lemma 4.

When η1/[62κ0κ1κ2κ3β]\eta\leq 1/[6\sqrt{2}\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\beta], the average difference between the global and the mBS models, i.e., the L4\mathrm{L}_{4} term, is bounded as follows:

[β2/T]t=0T1l=1Lαl𝔼𝐰¯t𝐰¯lt2𝒪(κ04κ12κ22κ32η4β4ϵvc2)+\displaystyle\!\!\!\!\!\![\beta^{2}/T]\!\sum\nolimits_{t=0}^{T-1}\!\!\sum\nolimits_{l=1}^{L}\!\!\alpha_{l}\mathbb{E}\left\|\bar{\mathbf{w}}^{t}-\bar{\mathbf{w}}_{l}^{t}\right\|^{2}\!\!\leq\mathcal{O}\big{(}\!\!\kappa_{0}^{4}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{4}\epsilon_{\mathrm{vc}}^{2}\!\big{)}+
𝒪(κ04κ14κ22κ32η4β4ϵsbs2)+𝒪(κ04κ14κ24κ32η4β6ϵmbs2)+\displaystyle\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{4}\epsilon_{\mathrm{sbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{4}\kappa_{3}^{2}\eta^{4}\beta^{6}\epsilon_{\mathrm{mbs}}^{2}\big{)}+
𝒪(κ02κ12κ22κ32β4η2ϵ2)+𝒪(κ0κ1κ2κ3β2η2σ2)+𝒪(δthβ2D2)+\displaystyle\!\!\mathcal{O}\!(\!\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\beta^{4}\eta^{2}\epsilon^{2}\!)\!+\!\mathcal{O}(\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\beta^{2}\eta^{2}\sigma^{2})\!+\!\mathcal{O}\!\big{(}\!\delta^{\mathrm{th}}\beta^{2}D^{2}\!\big{)}\!+\!
𝒪(κ03κ12κ22κ32η4β4G2φw,L1)+𝒪(κ03κ13κ22κ32β4η4φw,L2)+\displaystyle\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{4}G^{2}\varphi_{\mathrm{w,L}_{1}}\big{)}\!+\!\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{3}\kappa_{2}^{2}\kappa_{3}^{2}\beta^{4}\eta^{4}\varphi_{\mathrm{w,L}_{2}}\big{)}+
𝒪(κ03κ13κ23κ32η4β4G2φw,L3)+𝒪(κ0κ1κ2κ3β2η2G2φw,L4),\displaystyle\!\!\mathcal{O}\!(\!\kappa_{0}^{3}\kappa_{1}^{3}\kappa_{2}^{3}\kappa_{3}^{2}\eta^{4}\beta^{4}G^{2}\varphi_{\mathrm{w,L}_{3}}\!)\!+\!\mathcal{O}\!(\!\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\beta^{2}\eta^{2}G^{2}\varphi_{\mathrm{w,L}_{4}}\!),\!\! (30)

where φw,L4=1Tl=1Lαlk=1Blαk2j=1Vk,lαj2i=1Uj,k,lαi2t=0T1(1/pit1)\varphi_{\mathrm{w,L}_{4}}\!\!=\frac{1}{T}\sum_{l=1}^{L}\!\!\alpha_{l}\!\sum_{k=1}^{B_{l}}\!\!\alpha_{k}^{2}\!\sum_{j=1}^{V_{k,l}}\!\!\alpha_{j}^{2}\!\sum_{i=1}^{U_{j,k,l}}\!\!\alpha_{i}^{2}\!\sum_{t=0}^{T-1}\left(1\!/\!p_{i}^{t}-1\right).

Note that we have similar observations for (3) and (4) as in Remark 4. Now, using the above Lemmas, we find the final convergence rate in Corollary 1.

Corollary 1.

When η1/[62κ0κ1κ2κ3β]\eta\leq 1/[6\sqrt{2}\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\beta], the θPHFL\theta_{\mathrm{{PHFL}}} bound of Theorem 1 boils down to

θPHFL\displaystyle\theta_{\mathrm{{PHFL}}} 𝒪([f(𝐰¯~0)finf]/[ηT])+𝒪(βησ2/U)+\displaystyle\leq\mathcal{O}\big{(}[f(\tilde{\bar{\mathbf{w}}}^{0})-f_{\mathrm{inf}}]/[\eta T]\big{)}+\mathcal{O}(\beta\eta\sigma^{2}/\mathrm{U})+
𝒪(κ02η2β2ϵvc2)+𝒪(κ02κ12η2β2ϵsbs2)+\displaystyle\mathcal{O}\big{(}\kappa_{0}^{2}\eta^{2}\beta^{2}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\eta^{2}\beta^{2}\epsilon_{\mathrm{sbs}}^{2}\big{)}+
𝒪(κ02κ12κ22η2β4ϵmbs2)+𝒪(κ02κ12κ22κ32β4η2ϵ2)+\displaystyle\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{4}\epsilon_{\mathrm{mbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\beta^{4}\eta^{2}\epsilon^{2}\big{)}+
𝒪(δthβ2D2)pruningerror+𝒪(βηG2φw,0(𝜹,𝐟,𝐏))wirelessfactor.\displaystyle\underbrace{\mathcal{O}\big{(}\delta^{\mathrm{th}}\beta^{2}D^{2}\big{)}}_{\mathrm{pruning~{}error}}+\underbrace{\mathcal{O}\big{(}\beta\eta G^{2}\cdot\varphi_{\mathrm{w,0}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})\big{)}}_{\mathrm{wireless~{}factor}}. (31)
Remark 5.

In (1), the third, fourth, fifth and sixth terms appear from the divergence between client-VC, VC-sBS, sBS-mBS and mBS-global loss functions, respectively.

Remark 6.

In typical HFL with no model pruning, i.e., δut=0\delta_{u}^{t}=0, u𝒰\forall u\in\mathcal{U}, 𝒪(δthβ2D2)=0\mathcal{O}\big{(}\delta^{\mathrm{th}}\beta^{2}D^{2}\big{)}=0. Besides, when the wireless links are ignored, the last term in (1) becomes zero. In such a special case, Corollary 1 boils down to

θPHFL\displaystyle\theta_{\mathrm{{PHFL}}} 𝒪([f(𝐰¯~0)finf]/[ηT])+𝒪(βησ2/U)+\displaystyle\leq\mathcal{O}\big{(}[f(\tilde{\bar{\mathbf{w}}}^{0})-f_{\mathrm{inf}}]/[\eta T]\big{)}+\mathcal{O}(\beta\eta\sigma^{2}/\mathrm{U})+
𝒪(κ02η2β2ϵvc2)+𝒪(κ02κ12η2β2ϵsbs2)+\displaystyle\mathcal{O}\big{(}\kappa_{0}^{2}\eta^{2}\beta^{2}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\eta^{2}\beta^{2}\epsilon_{\mathrm{sbs}}^{2}\big{)}+
𝒪(κ02κ12κ22η2β4ϵmbs2)+𝒪(κ02κ12κ22κ32β4η2ϵ2).\displaystyle\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{4}\epsilon_{\mathrm{mbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\beta^{4}\eta^{2}\epsilon^{2}\big{)}. (32)
Remark 7.

When η=U/T\eta=\sqrt{\mathrm{U}/T}, we have T1/[72Uκ02κ12κ22κ32β2]T\geq 1/[72\mathrm{U}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\beta^{2}]. With a sufficiently large TT, when the trained model reception success probability is 11 for all users in all time steps, we have θPHFL𝒪(1/UT)+𝒪(δthβ2D2)\theta_{\mathrm{PHFL}}\approx\mathcal{O}\big{(}1/\sqrt{\mathrm{U}T}\big{)}+\mathcal{O}\big{(}\delta^{\mathrm{th}}\beta^{2}D^{2}\big{)}, where the second term comes from the pruning error. Therefore, the proposed PHFL solution converges to the neighborhood of a stationary point of traditional HFL.

IV Joint Problem Formulation and Solution

Similar to existing literature[3, 4, 8, 27], we ignore the downlink delay in this paper since the sBS can utilize the higher spectrum and transmission power to broadcast the updated model. Moreover, since the sBS-mBS and mBS-cloud server links are wired, we ignore the transmission delays for these links555The transmissions over the wired sBS-mBS and mBS-cloud server links happen in the backhaul, and the corresponding delays are quite small. In order to calculate these delays, one should also consider the overall network loads, which are beyond the scope of this paper.. Furthermore, since the sBS, mBS and the cloud server usually have high computation power, we also ignore the model aggregation and processing delays666The addition of dd parameters and then taking the average have a time complexity of 𝒪(d+1)\mathcal{O}(d+1). With highly capable CPUs at the sBS, mBS, and central server, the corresponding time delays for parameter aggregation are usually small and therefore ignored in the literature [9, 10, 23].. Therefore, at the beginning of each VC round, i.e., t(mz=0κz+t0)modκ0=0t\ni(m\prod_{z=0}^{\kappa_{z}}+t_{0})~{}\textsl{mod}~{}\kappa_{0}=0, we first calculate the required computation time for finding the lottery ticket as

ticpd=ρ×(bnciDi/fit),\displaystyle\mathrm{t}_{i}^{\mathrm{cp_{d}}}=\rho\times\left(bnc_{i}\mathrm{D}_{i}/\mathrm{f}_{i}^{t}\right), (33)

where bb is the batch size, nn is the number of batches, cic_{i} is the CPU cycles to process 11-bit data, Di\mathrm{D}_{i} is UE uiu_{i}’s each data sample’s size in bits and fit\mathrm{f}_{i}^{t} is the CPU frequency. Upon finding the pruned model, each client performs κ0\kappa_{0} local iterations, which require the following computation time [23]

ticps=κ0×(bn(1δit)ciDi/fit).\displaystyle\mathrm{t}_{i}^{\mathrm{cp_{s}}}=\kappa_{0}\times\left(bn(1-\delta_{i}^{t})c_{i}\mathrm{D}_{i}/\mathrm{f}_{i}^{t}\right). (34)

To that end, the UE only offloads the non-zero weights along with the binary mask to the sBS. As such, we calculate the uplink payload size of UE ii as follows777Note that one may send the non-pruned weights and the corresponding indices, which are unknown until the original initial model is trained for ρ\rho iterations. We consider an upper bound for the uplink payload, which will be used during the joint parameters optimization phase.:

sid[1δit](FPP+1)+d,\displaystyle s_{i}\leq d\big{[}1-\delta_{i}^{t}\big{]}\left(\mathrm{FPP}+1\right)+d, (35)

where FPP\mathrm{FPP} is the floating point precision. Note that, in (35), we need 11 bit to represent the sign of the entry. Therefore, we calculate the uplink payload offloading delay as follows:

tiupd[(1δit)(FPP+1)+1]/rit.\displaystyle\mathrm{t}_{i}^{\mathrm{up}}\leq d\left[\left(1-\delta_{i}^{t}\right)\left(\mathrm{FPP}+1\right)+1\right]/r_{i}^{t}. (36)

As such, UE ii’s total duration for local computing and trained model offloading is

titotticpd+ticps+tiup.\mathrm{t}_{i}^{\mathrm{tot}}\leq\mathrm{t}_{i}^{\mathrm{cp_{d}}}+\mathrm{t}_{i}^{\mathrm{cp_{s}}}+\mathrm{t}_{i}^{\mathrm{up}}. (37)

We now calculate the energy consumption for performing the model training, followed by the required energy for offloading the trained models. First, let us calculate the energy consumption to get the lottery ticket as

eicpd=ρ×0.5ξbnciDi(fit)2,\displaystyle\mathrm{e}_{i}^{\mathrm{cp_{d}}}=\rho\times 0.5\xi bnc_{i}\mathrm{D}_{i}(\mathrm{f}_{i}^{t})^{2}, (38)

where 0.5ξ0.5\xi is the effective capacitance of UE’s CPU chip. Similarly, we calculate the energy consumption to train κ0\kappa_{0} local iterations using the pruned model as

eicps=κ0×0.5ξbn(1δit)ciDi(fit)2.\displaystyle\mathrm{e}_{i}^{\mathrm{cp_{s}}}=\kappa_{0}\times 0.5\xi bn(1-\delta_{i}^{t})c_{i}\mathrm{D}_{i}(\mathrm{f}_{i}^{t})^{2}. (39)

Moreover, we calculate the uplink payload offloading energy consumption as follows:

eiupd[(1δit)(FPP+1)+1]Pit/ri.\displaystyle\mathrm{e}_{i}^{\mathrm{up}}\leq d\left[\left(1-\delta_{i}^{t}\right)\left(\mathrm{FPP}+1\right)+1\right]P_{i}^{t}/r_{i}. (40)

Therefore, the total energy consumption is calculated as

eitoteicpd+eicps+eiup.\mathrm{e}_{i}^{\mathrm{tot}}\leq\mathrm{e}_{i}^{\mathrm{cp_{d}}}+\mathrm{e}_{i}^{\mathrm{cp_{s}}}+\mathrm{e}_{i}^{\mathrm{up}}. (41)

IV-A Problem Formulation

Denote the duration between VC aggregation tth\mathrm{t}^{\mathrm{th}}. Then, we calculate the probability of successful reception of UE’s trained model as follows:

pit\displaystyle p_{i}^{t} =Pr{titottth}=Pr{sirit[tthticpdticps]}\displaystyle=\mathrm{Pr}\big{\{}\mathrm{t}_{i}^{\mathrm{tot}}\leq\mathrm{t}^{\mathrm{th}}\big{\}}\!=\!\mathrm{Pr}\big{\{}\!s_{i}\!\leq\!r_{i}^{t}\big{[}\mathrm{t}^{\mathrm{th}}\!\!-\mathrm{t}_{i}^{\mathrm{cp_{d}}}\!\!-\mathrm{t}_{i}^{\mathrm{cp_{s}}}\big{]}\!\!\big{\}}
=Pr{hi,kt[(2χit1)(ωζ2+Ii,kt)/(Pitdi,kα)]}\displaystyle\!\!=\mathrm{Pr}\big{\{}h_{i,k}^{t}\geq\big{[}(2^{\chi_{i}^{t}}-1)(\omega\zeta^{2}+I_{i,k}^{t})/(\mathrm{P}_{i}^{t}d_{i,k}^{-\alpha})\big{]}\big{\}}
=(a)exp[(2χit1)(ωζ2+Ii,kt)/(Pitdi,kα)],\displaystyle\overset{(a)}{=}\exp\big{[}-(2^{\chi_{i}^{t}}-1)(\omega\zeta^{2}+I_{i,k}^{t})/(\mathrm{P}_{i}^{t}d_{i,k}^{-\alpha})\big{]}, (42)

where χit=dfit[(1δit)(FPP+1)+1]ω[fittthbnciDi(ρ+κ0(1δit))]\chi_{i}^{t}=\frac{d\mathrm{f}_{i}^{t}\left[\left(1-\delta_{i}^{t}\right)\left(\mathrm{FPP}+1\right)+1\right]}{\omega\left[\mathrm{f}_{i}^{t}\mathrm{t^{th}}-bnc_{i}\mathrm{D}_{i}\left(\rho+\kappa_{0}(1-\delta_{i}^{t})\right)\right]} and (a)(a) follows from the Rayleigh fading channels between the UE and the sBS.

Notice that the pruning ratio δit\delta_{i}^{t}, CPU frequency fit\mathrm{f}_{i}^{t}, transmission power Pit\mathrm{P}_{i}^{t} and the probability of successful model reception pitp_{i}^{t} are intertwined. More specifically, pitp_{i}^{t} depends on δit\delta_{i}^{t}, fit\mathrm{f}_{i}^{t} and Pit\mathrm{P}_{i}^{t}, given that the other parameters remain fixed. As such, we aim to optimize these parameters jointly by considering the controllable terms in our convergence bound in Corollary 1. Therefore, we focus on each VC round, i.e., the local iteration round tt at which t mod κ0=0t\textsl{ mod }\kappa_{0}=0. Specifically, we focus on minimizing the error terms due to pruning and wireless links, which are given by

𝒪(δthβ2D2)+𝒪(βηG2φw,0(𝜹,𝐟,𝐏)).\displaystyle\mathcal{O}\big{(}\delta^{\mathrm{th}}\beta^{2}D^{2}\big{)}+\mathcal{O}\big{(}\beta\eta G^{2}\cdot\varphi_{\mathrm{w,0}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})\big{)}. (43)
Remark 8.

In the above expression, the first term appears from the pruning error 12β2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi(1+2{αi[1+αj(1+αk{1+αl})]})δit𝐰it2𝒪(δthβ2D2){\frac{12\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}}\cdot\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\big{(}1+2\big{\{}\alpha_{i}\big{[}1+\alpha_{j}\big{(}1+\alpha_{k}\big{\{}1+\alpha_{l}\big{\}}\big{)}\big{]}\big{\}}\big{)}\delta_{i}^{t}\|\mathbf{w}_{i}^{t}\|^{2}\leq\mathcal{O}\big{(}\delta^{\mathrm{th}}\beta^{2}D^{2}\big{)}, while the second term comes from the wireless factor 2βηTl=1Lαl2k=1Blαk2j=1Vk,lαj2i=1Uj,k,lαi2t=0T1(1/pit1)𝔼g~(𝐰~it,0)2𝒪(βηG2Tl=1Lαl2k=1Blαk2j=1Vk,lαj2i=1Uj,k,lαi2t=0T1(1/pit1))\frac{2\beta\eta}{T}\sum_{l=1}^{L}\alpha_{l}^{2}\cdot\sum_{k=1}^{B_{l}}\alpha_{k}^{2}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\sum_{t=0}^{T-1}\left(1/p_{i}^{t}-1\right)\mathbb{E}\big{\|}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\big{\|}^{2}\leq\mathcal{O}\big{(}\frac{\beta\eta G^{2}}{T}\sum_{l=1}^{L}\alpha_{l}^{2}\sum_{k=1}^{B_{l}}\alpha_{k}^{2}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\sum_{t=0}^{T-1}\left(1/p_{i}^{t}-1\right)\big{)}.

Based on the above observations, we consider a weighted combination of these two terms as our objective function to minimize the bound in (43). Using (IV-A) in the wireless factor, we, therefore, consider the following objective function.

φt(𝜹t,𝐟t,𝐏t)=ϕ1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαiδit+\displaystyle\!\!\!\!\varphi^{t}(\boldsymbol{\delta}^{t},\boldsymbol{\mathrm{f}}^{t},\boldsymbol{\mathrm{P}}^{t})\!\!=\phi_{1}\sum\nolimits_{l=1}^{L}\!\!\alpha_{l}\sum\nolimits_{k=1}^{B_{l}}\!\!\alpha_{k}\sum\nolimits_{j=1}^{V_{k,l}}\!\!\alpha_{j}\sum\nolimits_{i=1}^{U_{j,k,l}}\alpha_{i}\delta_{i}^{t}+\!\! (44)
ϕ2l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi[exp((2χit1)(ωζ2+Ii,kt)Pitdi,kα)1],\displaystyle\!\!\!\!\phi_{2}\sum_{l=1}^{L}\!\!\alpha_{l}\sum_{k=1}^{B_{l}}\!\!\alpha_{k}\!\!\sum_{j=1}^{V_{k,l}}\!\!\alpha_{j}\!\!\sum_{i=1}^{U_{j,k,l}}\!\!\alpha_{i}\bigg{[}\exp\bigg{(}\frac{(2^{\chi_{i}^{t}}-1)(\omega\zeta^{2}+I_{i,k}^{t})}{\mathrm{P}_{i}^{t}d_{i,k}^{-\alpha}}\bigg{)}-1\bigg{]},

where ϕ1\phi_{1} and ϕ2\phi_{2} are two weights to strike the balance between the terms. Note that the wireless factor is multiplied by the learning rate and gradient in (43). Typically, the learning rate is small. Besides, the gradient becomes smaller as the training progresses. As such, the wireless factor term is relatively small when pit>0p_{i}^{t}>0 for all UEs and VC aggregation rounds. The model weights are non-negative. Furthermore, a larger pruning ratio δit\delta_{i}^{t} can dramatically reduce the computation and offloading time, making the wireless factor 0. However, as a higher pruning ratio means more model parameters are pruned, we wish to avoid making the δit\delta_{i}^{t}’s large to reduce the pruning-induced errors. The above facts suggest we put more weight on the pruning error term to penalize more for the δit\delta_{i}^{t}’s. As such, we consider ϕ1ϕ2\phi_{1}\gg\phi_{2}. However, in our resource-constrained setting, a small δit\delta_{i}^{t} can prolong the training and offloading time, leading pitp_{i}^{t} to be 0, i.e., the sBS will not receive the local trained model. Therefore, although ϕ2\phi_{2} is small, we keep the wireless factor to ensure pitp_{i}^{t} is never 0.

Therefore, we pose the following optimization problem to configure the parameters jointly.

minimize𝜹t,𝐟t,𝐏t\displaystyle\underset{\boldsymbol{\delta}^{t},\boldsymbol{\mathrm{f}}^{t},\boldsymbol{\mathrm{P}}^{t}}{\text{minimize }} φt(𝜹t,𝐟t,𝐏t),\displaystyle\quad\varphi^{t}(\boldsymbol{\delta}^{t},\boldsymbol{\mathrm{f}}^{t},\boldsymbol{\mathrm{P}}^{t}), (45)
s.t.(C1)\displaystyle\text{ s.t.}\quad(C1) titottth,i,\displaystyle\quad\mathrm{t}_{i}^{\mathrm{tot}}\leq\mathrm{t}^{\mathrm{th}},\qquad\qquad\qquad~{}\forall i, (45a)
(C2)\displaystyle(C2) eitoteith,i,\displaystyle\quad\mathrm{e}_{i}^{\mathrm{tot}}\leq\mathrm{e}_{i}^{\mathrm{th}},\qquad\qquad\qquad\forall i, (45b)
(C3)\displaystyle(C3) 0fitfimax,i,\displaystyle\quad 0\leq\mathrm{f}_{i}^{t}\leq\mathrm{f}_{i}^{\mathrm{max}},\qquad\qquad~{}\forall i, (45c)
(C4)\displaystyle(C4) 0PitPimax,i,\displaystyle\quad 0\leq\mathrm{P}_{i}^{t}\leq P_{i}^{\mathrm{max}},\qquad\qquad\forall i, (45d)
(C5)\displaystyle(C5) 0δitδth,i,\displaystyle\quad 0\leq\delta_{i}^{t}\leq\delta^{\mathrm{th}},\qquad\qquad~{}~{}\forall i, (45e)

where constraint (C1)(C1) ensures that the completion of one VC round is within the required deadline. Constraint (C2)(C2) controls the energy expense to be within the allowable budget. Besides, (C3)(C3) and (C4)(C4) restrict us from choosing the CPU frequency and transmission power within the UE’s minimum and maximum CPU cycles and transmission power, respectively. Finally, constraint (C5)(C5) ensures the pruning ratio to be within a tolerable limit δth\delta^{\mathrm{th}}.

Remark 9.

We assume that clients’ system configurations remain unchanged over time, while the channel state information (CSI) is dynamic and known at the sBS. The clients share their system configurations with their associated sBS. The sBSs share their respective users’ system configurations and CSI with the central server. As such, problem (45) is solved centrally, and the optimized parameters are broadcasted to the clients. Besides, problem (45) is non-convex with the multiplications and divisions of the optimization variables in the second term. Moreover, constraints (C11) and (C22) are not convex. Therefore, it is not desirable to minimize this original problem directly. In the following, we transform the problem into an approximate convex problem that can be solved efficiently.

IV-B Problem Transformation

Let us define A(δit,fi,Pi)exp[(2χit1)(ωζ2+Ii,kt)/(Pitdi,kα)]A(\delta_{i}^{t},\mathrm{f}_{i},P_{i})\coloneqq\exp\big{[}(2^{\chi_{i}^{t}}-1)(\omega\zeta^{2}+I_{i,k}^{t})/(\mathrm{P}_{i}^{t}d_{i,k}^{-\alpha})\big{]}. Given an initial feasible point set (δit,q\delta_{i}^{t,q}, fit,q\mathrm{f}_{i}^{t,q}, Pit,q\mathrm{P}_{i}^{t,q}), we perform a linear approximation of this non-convex expression as follows:

A(δit,fi,Pi)\displaystyle\!\!\!\!A(\delta_{i}^{t},\mathrm{f}_{i},P_{i})\!\! A(δit,q,fit,q,Pit,q)+δit[A(δit,q,fit,q,Pit,q)](δitδit,q)\displaystyle\approx\!\!A(\delta_{i}^{t,q},\mathrm{f}_{i}^{t,q},\mathrm{P}_{i}^{t,q})\!\!+\!\!\nabla_{\delta_{i}^{t}}\!\!\big{[}A(\delta_{i}^{t,q},\mathrm{f}_{i}^{t,q},\mathrm{P}_{i}^{t,q})\big{]}(\delta_{i}^{t}-\delta_{i}^{t,q})
+fit[A(δit,q,fit,q,Pit,q)](fitfit,q)+\displaystyle\!\!\!\!+\nabla_{\mathrm{f}_{i}^{t}}\big{[}A(\delta_{i}^{t,q},\mathrm{f}_{i}^{t,q},\mathrm{P}_{i}^{t,q})\big{]}(\mathrm{f}_{i}^{t}-\mathrm{f}_{i}^{t,q})+
Pit[A(δit,q,fit,q,Pit,q)](PitPit,q)A~(δit,fit,Pit),\displaystyle\!\!\!\!\nabla_{\mathrm{P}_{i}^{t}}\big{[}A(\delta_{i}^{t,q},\mathrm{f}_{i}^{t,q},\mathrm{P}_{i}^{t,q})\big{]}(\mathrm{P}_{i}^{t}-\mathrm{P}_{i}^{t,q})\coloneqq\tilde{A}(\delta_{i}^{t},\mathrm{f}_{i}^{t},\mathrm{P}_{i}^{t}),\!\! (46)

where A(δit,q,fit,q,Pit,q)=exp[(2χit,q1)(ωζ2+I~i,kt)/(Pit,qdi,kα)]A(\delta_{i}^{t,q},\mathrm{f}_{i}^{t,q},\mathrm{P}_{i}^{t,q})=\exp\big{[}(2^{\chi_{i}^{t,q}}-1)(\omega\zeta^{2}+\tilde{I}_{i,k}^{t})/(\mathrm{P}_{i}^{t,q}d_{i,k}^{-\alpha})\big{]}, χit,q=dfit,q[(1δit)(FPP+1)+1]ω[fit,qtthbnciDi(ρ+κ0(1δit,q))]\chi_{i}^{t,q}=\frac{d\mathrm{f}_{i}^{t,q}\left[\left(1-\delta_{i}^{t}\right)\left(\mathrm{FPP}+1\right)+1\right]}{\omega\left[\mathrm{f}_{i}^{t,q}\mathrm{t^{th}}-bnc_{i}\mathrm{D}_{i}\big{(}\rho+\kappa_{0}(1-\delta_{i}^{t,q})\big{)}\right]} and I~i,kt=l=1Lk=1,kkKj=1Jk,lui𝒰j,k,lPit,qhi,kdi,kα\tilde{I}_{i,k}^{t}=\sum_{l=1}^{L}\sum_{k^{\prime}=1,k\neq k^{\prime}}^{K}\sum_{j^{\prime}=1}^{J_{k^{\prime},l}}\sum_{u_{i^{\prime}}\in\mathcal{U}_{j^{\prime},k^{\prime},l}}\mathrm{P}_{i^{\prime}}^{t,q}h_{i^{\prime},k^{\prime}}d_{i^{\prime},k^{\prime}}^{-\alpha}. Moreover, δit[A(δit,q,fit,q,Pit,q)]\nabla_{\delta_{i}^{t}}\big{[}A(\delta_{i}^{t,q},\mathrm{f}_{i}^{t,q},\mathrm{P}_{i}^{t,q})\big{]},
fit[A(δit,q,fit,q,Pit,q)]\nabla_{\mathrm{f}_{i}^{t}}\big{[}A(\delta_{i}^{t,q},\mathrm{f}_{i}^{t,q},\mathrm{P}_{i}^{t,q})\big{]} and Pit[A(δit,q,fit,q,Pit,q)]\nabla_{\mathrm{P}_{i}^{t}}\big{[}A(\delta_{i}^{t,q},\mathrm{f}_{i}^{t,q},\mathrm{P}_{i}^{t,q})\big{]} are calculated in (47), (48) and (49), respectively.

δit[A(δit,q,fit,q,Pit,q)]={ln(2)2χit,qdfit,qA(δit,q,fit,q,Pit,q)(ωζ2+I~i,kt)[bnciDi(ρ(FPP+1)κ0)\displaystyle\nabla_{\delta_{i}^{t}}\big{[}A(\delta_{i}^{t,q},\mathrm{f}_{i}^{t,q},\mathrm{P}_{i}^{t,q})\big{]}=\{\ln(2)2^{\chi_{i}^{t,q}}d\mathrm{f}_{i}^{t,q}A(\delta_{i}^{t,q},\mathrm{f}_{i}^{t,q},\mathrm{P}_{i}^{t,q})(\omega\zeta^{2}+\tilde{I}_{i,k}^{t})[bnc_{i}D_{i}(\rho(\mathrm{FPP}+1)-\kappa_{0})-
fit,qtth(FPP+1)]}/{ωPit,qdi,kα×[fit,qtthbnciDi(ρ+κ0(1δit,q))]2}.\displaystyle\qquad\qquad\qquad\qquad\mathrm{f}_{i}^{t,q}\mathrm{t^{th}}(\mathrm{FPP}+1)]\}/\{\omega\mathrm{P}_{i}^{t,q}d_{i,k}^{-\alpha}\times[\mathrm{f}_{i}^{t,q}\mathrm{t^{th}}-bnc_{i}D_{i}(\rho+\kappa_{0}(1-\delta_{i}^{t,q}))]^{2}\}. (47)
fi[A(δit,q,fit,q,Pit,q)]={ln(2)2χit,qbncidDiA(δit,q,fit,q,Pit,q)(ωζ2+I~i,kt)[(1δit,q)(FPP+1)+1]×\displaystyle\nabla_{\mathrm{f}_{i}}\big{[}A(\delta_{i}^{t,q},\mathrm{f}_{i}^{t,q},\mathrm{P}_{i}^{t,q})\big{]}=\{-\ln(2)2^{\chi_{i}^{t,q}}bnc_{i}dD_{i}A(\delta_{i}^{t,q},\mathrm{f}_{i}^{t,q},\mathrm{P}_{i}^{t,q})(\omega\zeta^{2}+\tilde{I}_{i,k}^{t})[(1-\delta_{i}^{t,q})(\mathrm{FPP}+1)+1]\times
(ρ+κ0[1δit,q])}/{ωPit,qdi,kα×[fit,qtthbnciDi(ρ+κ0(1δit,q))]2}.\displaystyle\qquad\qquad\qquad\qquad(\rho+\kappa_{0}[1-\delta_{i}^{t,q}])\}/\{\omega\mathrm{P}_{i}^{t,q}d_{i,k}^{-\alpha}\times[\mathrm{f}_{i}^{t,q}\mathrm{t^{th}}-bnc_{i}D_{i}(\rho+\kappa_{0}(1-\delta_{i}^{t,q}))]^{2}\}.\hfill (48)
Pi[A(δit,q,fit,q,Pit,q)]=A(δit,q,fit,q,Pit,q)(2χit,q1)(ωζ2+I~i,kt)/[(Pit,q)2di,kα].\displaystyle\nabla_{P_{i}}\big{[}A(\delta_{i}^{t,q},\mathrm{f}_{i}^{t,q},\mathrm{P}_{i}^{t,q})\big{]}=-A(\delta_{i}^{t,q},\mathrm{f}_{i}^{t,q},\mathrm{P}_{i}^{t,q})(2^{\chi_{i}^{t,q}}-1)(\omega\zeta^{2}+\tilde{I}_{i,k}^{t})/[(P_{i}^{t,q})^{2}d_{i,k}^{-\alpha}]. (49)
tiup\displaystyle\mathrm{t}_{i}^{\mathrm{up}} d(2+FPP(1+FPP)δit)ωlog2(1+Pit,qhi,kdi,kα/[ωζ2+I~i,kt])+ln(2)dhi,kdi,kα[(1δit,q)(FPP+1)+1]×(PitPit,q)ω{ln(1+[Pit,qhi,kdi,kα]/[ωζ2+I~i,kt])}2(ωζ2+I~i,kt+Pit,qhi,kdi,kα)t~iup.\displaystyle\approx\frac{d\big{(}2+\mathrm{FPP}-(1+\mathrm{FPP})\delta_{i}^{t}\big{)}}{\omega\log_{2}(1+\mathrm{P}_{i}^{t,q}h_{i,k}d_{i,k}^{-\alpha}/[\omega\zeta^{2}+\tilde{I}_{i,k}^{t}])}+\frac{-\ln(2)dh_{i,k}d_{i,k}^{-\alpha}[(1-\delta_{i}^{t,q})(\mathrm{FPP}+1)+1]\times(\mathrm{P}_{i}^{t}-\mathrm{P}_{i}^{t,q})}{\omega\{\ln(1+[\mathrm{P}_{i}^{t,q}h_{i,k}d_{i,k}^{-\alpha}]/[\omega\zeta^{2}+\tilde{I}_{i,k}^{t}])\}^{2}(\omega\zeta^{2}+\tilde{I}_{i,k}^{t}+\mathrm{P}_{i}^{t,q}h_{i,k}d_{i,k}^{-\alpha})}\coloneqq\mathrm{\tilde{t}}_{i}^{\mathrm{up}}. (50)
eiup\displaystyle\mathrm{e}_{i}^{\mathrm{up}} {dPit,q[(FPP+2)(FPP+1)δit]}/{ωlog2(1+Pit,qhi,kdi,kα/(ωζ2+I~i,kt))}+\displaystyle\approx\{d\mathrm{P}_{i}^{t,q}[\left(\mathrm{FPP}+2\right)-(\mathrm{FPP}+1)\delta_{i}^{t}]\}/\{\omega\log_{2}(1+\mathrm{P}_{i}^{t,q}h_{i,k}d_{i,k}^{-\alpha}/(\omega\zeta^{2}+\tilde{I}_{i,k}^{t}))\}+\qquad\quad
d[(1δit,q)(FPP+1)+1][log2(1+Pit,qhi,kdi,kαωζ2+I~i,kt)Pit,qhi,kdi,kαln(2)(ωζ2+I~i,kt+Pit,qhi,kdi,kα)]ω{log2(1+Pit,qhi,kdi,kα/(ωζ2+I~i,kt))}2(PitPit,q)=e~iup.\displaystyle\qquad\qquad\qquad\qquad\frac{d[(1-\delta_{i}^{t,q})(\mathrm{FPP}+1)+1]\bigg{[}\!\!\log_{2}\!\!\bigg{(}\!\!\!\!1+\frac{\mathrm{P}_{i}^{t,q}h_{i,k}d_{i,k}^{-\alpha}}{\omega\zeta^{2}+\tilde{I}_{i,k}^{t}}\!\!\bigg{)}\!\!-\frac{\mathrm{P}_{i}^{t,q}h_{i,k}d_{i,k}^{-\alpha}}{\ln(2)(\omega\zeta^{2}+\tilde{I}_{i,k}^{t}+\mathrm{P}_{i}^{t,q}h_{i,k}d_{i,k}^{-\alpha})}\!\!\bigg{]}}{\omega\{\log_{2}(1+\mathrm{P}_{i}^{t,q}h_{i,k}d_{i,k}^{-\alpha}/(\omega\zeta^{2}+\tilde{I}_{i,k}^{t}))\}^{2}}(\mathrm{P}_{i}^{t}-\mathrm{P}_{i}^{t,q})=\mathrm{\tilde{e}}_{i}^{\mathrm{up}}.\!\!\!\!\!\! (51)

 

As such, we approximate (44) as follows:

φ~t(𝜹t,𝐟t,𝐏t)\displaystyle\!\!\!\!\tilde{\varphi}^{t}(\boldsymbol{\delta}^{t},\boldsymbol{\mathrm{f}}^{t},\boldsymbol{\mathrm{P}}^{t}) =l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi(ϕ1δit+\displaystyle=\sum\nolimits_{l=1}^{L}\!\!\alpha_{l}\sum\nolimits_{k=1}^{B_{l}}\!\!\alpha_{k}\sum\nolimits_{j=1}^{V_{k,l}}\!\!\alpha_{j}\sum\nolimits_{i=1}^{U_{j,k,l}}\!\!\alpha_{i}\big{(}\phi_{1}\delta_{i}^{t}+
ϕ2[A~(δit,fit,Pit)1]),\displaystyle\qquad\qquad\qquad\qquad\phi_{2}\big{[}\tilde{A}(\delta_{i}^{t},\mathrm{f}_{i}^{t},\mathrm{P}_{i}^{t})-1\big{]}\big{)}, (52)

where A~(δit,fit,Pit)\tilde{A}(\delta_{i}^{t},\mathrm{f}_{i}^{t},\mathrm{P}_{i}^{t}) is calculated in (IV-B).

We now focus on the non-convex constraints. First, let us approximate the local pruned model computation time as

ticps\displaystyle\!\!\!\!\!\!\mathrm{t}_{i}^{\mathrm{cp_{s}}} [κ0bnciDi/fit,q](1δit(1δit,q)(fitfit,q)/fit,q)=t~cpsi.\displaystyle\approx\!\![\kappa_{0}bnc_{i}\mathrm{D}_{i}\!/\!\mathrm{f}_{i}^{t,q}](\!1-\delta_{i}^{t}-(1-\delta_{i}^{t,q})(\mathrm{f}_{i}^{t}-\mathrm{f}_{i}^{t,q})/\mathrm{f}_{i}^{t,q})\!\!=\!\!\mathrm{\tilde{t}^{cp_{s}}}_{i}\!\!.\!\!\!\!\!\!\!\!\!\! (53)

Then, we approximate the non-convex uplink model offloading delay as shown in (50). Using a similar treatment, we write

eicps\displaystyle\mathrm{e}_{i}^{\mathrm{cp_{s}}} κ0ξbnciDifit,q[(δit,q0.5)fit,q0.5fit,qδit+\displaystyle\approx\kappa_{0}\xi bnc_{i}\mathrm{D}_{i}\mathrm{f}_{i}^{t,q}\big{[}(\delta_{i}^{t,q}-0.5)\mathrm{f}_{i}^{t,q}-0.5\mathrm{f}_{i}^{t,q}\delta_{i}^{t}+
(1δit,q)fit]e~icps.\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad(1-\delta_{i}^{t,q})\mathrm{f}_{i}^{t}\big{]}\coloneqq\mathrm{\tilde{e}}_{i}^{\mathrm{cp_{s}}}. (54)

Similarly, we approximate the energy consumption for model offloading as shown in (IV-B).

Therefore, we pose the following transformed problem

minimize𝜹t,𝐟t,𝐏t\displaystyle\underset{\boldsymbol{\delta}^{t},\boldsymbol{\mathrm{f}}^{t},\boldsymbol{\mathrm{P}}^{t}}{\text{minimize }} φ~t(𝜹t,𝐟t,𝐏t),\displaystyle\quad\tilde{\varphi}^{t}(\boldsymbol{\delta}^{t},\boldsymbol{\mathrm{f}}^{t},\boldsymbol{\mathrm{P}}^{t}), (55)
s.t.(C~1)\displaystyle\text{ s.t.}\quad(\tilde{C}1) t~icpd+t~icps+t~iuptth,\displaystyle\quad\mathrm{\tilde{t}}_{i}^{\mathrm{cp_{d}}}+\mathrm{\tilde{t}}_{i}^{\mathrm{cp_{s}}}+\mathrm{\tilde{t}}_{i}^{\mathrm{up}}\leq\mathrm{t}^{\mathrm{th}}, (55a)
(C~2)\displaystyle(\tilde{C}2) eicpd+e~icps+e~iupeith,\displaystyle\quad\mathrm{e}_{i}^{\mathrm{cp_{d}}}+\mathrm{\tilde{e}}_{i}^{\mathrm{cp_{s}}}+\mathrm{\tilde{e}}_{i}^{\mathrm{up}}\leq\mathrm{e}_{i}^{\mathrm{th}}, (55b)
(45c),(45d),(45e),\displaystyle\quad(\ref{cons3}),(\ref{cons4}),(\ref{cons5}), (55c)

where the constraints are taken for the same reasons as in the original problem. Besides, t~icpd=2ρbnciDi/fit,qρbnciDifit/(fit,q)2\mathrm{\tilde{t}}_{i}^{\mathrm{cp_{d}}}=2\rho bnc_{i}\mathrm{D}_{i}/\mathrm{f}_{i}^{t,q}-\rho bnc_{i}\mathrm{D}_{i}\mathrm{f}_{i}^{t}/(\mathrm{f}_{i}^{t,q})^{2}.

Note that problem (55) is now convex and can be solved iteratively using existing solvers such as CVX [30]. The key steps of our iterative solution are summarized in Algorithm 2. Moreover, as (55) has 3U3\mathrm{U} decision variables and 5U5\mathrm{U} constraints, the time complexity of running Algorithm 2 for QQ iterations is 𝒪(Q×[(3U)3×5U])\mathcal{O}\left(Q\times[(3\mathrm{U})^{3}\times 5\mathrm{U}]\right) [31]. While Algorithm 2 yields a suboptimal solution and converges to a local stationary solution set of the original problem (45), SCA-based solutions are well-known for fast convergence [32]. Moreover, our extensive empirical study in the sequel suggests that the proposed PHFL solution with Algorithm 2 delivers nearly identical performance to the upper bounded performance.

Input: Initial feasible set (𝜹t,0,𝐟t,0,𝐏t,0)(\boldsymbol{\delta}^{t,0},\boldsymbol{\mathrm{f}}^{t,0},\boldsymbol{\mathrm{P}}^{t,0}), maximum iteration QQ, precision level ϵprec\epsilon^{\mathrm{prec}}; set q=0q=0
21 Repeat:
3 Solve (55) using (𝜹t,q,𝐟t,q,𝐏t,q)(\boldsymbol{\delta}^{t,q},\boldsymbol{\mathrm{f}}^{t,q},\boldsymbol{\mathrm{P}}^{t,q}) to get the optimized (𝜹t,𝐟t,𝐏t)(\boldsymbol{\delta}^{t},\boldsymbol{\mathrm{f}}^{t},\boldsymbol{\mathrm{P}}^{t})
4 qq+1q\leftarrow q+1 ; 𝜹t,q𝜹t\boldsymbol{\delta}^{t,q}\leftarrow\boldsymbol{\delta}^{t} ; 𝐟t,q𝐟t\boldsymbol{\mathrm{f}}^{t,q}\leftarrow\boldsymbol{\mathrm{f}}^{t} ; 𝐏t,q𝐏t\boldsymbol{\mathrm{P}}^{t,q}\leftarrow\boldsymbol{\mathrm{P}}^{t}
5 Until converge with ϵprec\epsilon^{\mathrm{prec}} precision or q=Qq=Q
Output: Optimal (𝜹t,𝐟t,𝐏t)(\boldsymbol{\delta}^{t},\boldsymbol{\mathrm{f}}^{t},\boldsymbol{\mathrm{P}}^{t})
Algorithm 2 Iterative Joint Pruning Ratio, CPU Frequency and Transmission Power Selection Process

V Simulation Results and Discussions

V-A Simulation Setting

For the performance evaluation, we consider L=2L=2, B=4B=4 and U=48\mathrm{U}=48. We let each sBS maintain 22 VCs, where each VC has 66 UEs. In other words, we have Uj,k,l=6,j,kU_{j,k,l}=6,\forall j,k and ll, Vk,l=2,kV_{k,l}=2,\forall k and ll, and Bl=2,lB_{l}=2,\forall l. We assume ω=1\omega=1 megahertz (MHz). We randomly generate maximum transmission power PmaxP^{\mathrm{max}}, energy budget for each VC aggregation round eth\mathrm{e^{th}}, CPU frequency fmax\mathrm{f^{max}} and required CPU cycle to process per-bit data cc, respectively, from [23,30][23,30] dBm, [10,13][10,13] Joules, [1.8,2.8][1.8,2.8] gigahertz (GHz) and [20,25][20,25] for these two VCs. Therefore, all UEs in a VC have the above randomly generated system configurations888Our approach can easily be extended where all clients can have random fimax\mathrm{f}_{i}^{\mathrm{max}}’s, eith\mathrm{e}_{i}^{\mathrm{th}}’s and PimaxP_{i}^{\mathrm{max}}’s. Our approach is practical since these parameters depend on the clients’ manufacturers and their specific models.. Moreover, as described earlier in Section II, our proposed PHFL has 44 tiers, namely (11) UE-VC, (22) VC-sBS, (33) sBS-mBS, and (44) mBS-central server.

For our ML task, we use image classification with the popular CIFAR-1010 and CIFAR-100100 datasets [33] for performance evaluation. We use symmetric Dirichlet distribution Dir(α¯)\mathrm{Dir}(\bar{\alpha}) with concentration parameter α¯\bar{\alpha} for the non-IID data distribution as commonly used in literature [4, 27]. Besides, we use 1)1) convolutional neural network (CNN), 2)2) residual network (ResNet)-1818 [34] and 3)3) ResNet-3434 [34]. The CNN model has the following architecture: 𝙲𝚘𝚗𝚟𝟸𝚍(𝟹,𝟷𝟸𝟾)\tt{Conv2d(3,128)}, 𝙼𝚊𝚡𝙿𝚘𝚘𝚕𝟸𝚍\tt{MaxPool2d}, 𝙲𝚘𝚗𝚟𝟸𝚍(𝟷𝟸𝟾,𝟼𝟺)\tt{Conv2d(128,64)}, 𝙼𝚊𝚡𝙿𝚘𝚘𝚕𝟸𝚍\tt{MaxPool2d}, 𝙻𝚒𝚗𝚎𝚊𝚛(𝟸𝟻𝟼,𝟸𝟻𝟼)\tt{Linear(256,256)}, 𝙻𝚒𝚗𝚎𝚊𝚛(𝟸𝟻𝟼,#𝙻𝚊𝚋𝚎𝚕𝚜)\tt{Linear(256,\#Labels)}, whereas the ResNets have a similar architecture as in the original paper [34]. Moreover, the total number of trainable parameters depends on various configurations, such as the input/output shapes, kernel sizes, strides, etc. In our implementation, the original CNN, ResNet-1818 and ResNet-3434 models, respectively, have 151,882151,882; 6,992,1386,992,138 and 12,614,79412,614,794 trainable parameters on CIFAR-1010, and 175,012175,012; 7,038,3087,038,308 and 12,660,96412,660,964 trainable parameters on CIFAR-100100. Besides, with FPP=32\mathrm{FPP}=32, we have a wireless payload of about 5.015.01 megabits (Mbs), 230.7230.7 Mbs and 416.3416.3 Mbs for CIFAR-1010, and 5.85.8 Mbs, 232.3232.3 Mbs and 417.8417.8 Mbs for CIFAR-100100 datasets for the respective three original models.

Refer to caption
(a) CDF of δut\delta_{u}^{t}’s with CNN
Refer to caption
(b) CDF of δut\delta_{u}^{t}’s with ResNet-1818
Refer to caption
(c) CDF of δut\delta_{u}^{t}’s with ResNet-3434
Figure 2: CDF of clients’ pruning ratios in different VCs for different tth\mathrm{t^{th}} with different ML models

V-B Performance Study

First, we investigate the pruning ratios δit\delta_{i}^{t}’s in different VCs. When the system configurations remain the same, the pruning ratio depends on the deadline threshold tth\mathrm{t^{th}}. More specifically, a larger deadline allows the client to prune fewer model parameters, given that the energy constraint is satisfied. Intuitively, less pruning leads to a bulky model that takes longer training time. The CNN model is shallower compared to the ResNets. More specifically, the original non-pruned ResNet-1818 and ResNet-3434 models have about 4646 times and 8383 times the trainable parameters of the CNN model, respectively, on CIFAR-1010. Therefore, the clients require a larger tth\mathrm{t^{th}} to perform their local training and trained model offloading as the trainable parameters increase.

Intuitively, given a fixed tth\mathrm{t^{th}}, the clients need to prune more model parameters for a bulky model in order to meet the deadline and energy constraints. Our simulation results also show that this general intuition holds in determining the δit\delta_{i}^{t}’s, as shown in Fig. 2, which show the cumulative distribution function (CDF) of the δit\delta_{i}^{t}’s in different VCs. It is worth noting that the pruning ratios δit\delta_{i}^{t}’s in each VC aggregation rounds are not deterministic due to the randomness of the wireless channels. We know the optimal variables once we solve the optimization problem in (55), which depends on the realizations of the wireless channels. Then, for a given VC jj, we generate the plot by calculating l=1Lk=1Bli=1Uj,k,l𝟏(δitδ)l=1Lk=1Bli=1Uj,k,li\frac{\sum_{l=1}^{L}\sum_{k=1}^{B_{l}}\sum_{i=1}^{U_{j,k,l}}\mathbf{1}\left(\delta_{i}^{t}\leq\delta\right)}{\sum_{l=1}^{L}\sum_{k=1}^{B_{l}}\sum_{i=1}^{U_{j,k,l}}i}, where 𝟏(δitδ)\mathbf{1}\left(\mathrm{\delta}_{i}^{t}\leq\delta\right) is an indicator function that takes value 11 if δitδ\delta_{i}^{t}\leq\delta and 0 otherwise. With the CNN model, about 50%50\% clients have a δit\delta_{i}^{t} less than 0.230.23, 0.430.43, 0.580.58 and 0.720.72 in VC-0 in all cells, for 1.31.3s, 11s, 0.80.8s and 0.60.6s deadline thresholds, respectively, in Fig. 2(a). Note that we use δth=0.9\delta^{\mathrm{th}}=0.9, i.e., the clients can prune up to 90%90\% of the neurons. Moreover, we consider tth=4\mathrm{t^{th}}=4s and tth=6\mathrm{t^{th}}=6s, to make the problem feasible for all clients for the ResNet-1818 and ResNet-3434 models, respectively. Furthermore, from Fig. 2(a) - Fig. 2(c), it is quite clear that the UEs in VC-11 have to prune slightly lesser model parameters than the UEs in VC-0, even though the maximum CPU frequency fmax\mathrm{f^{max}} of the UEs in VC-0 is 2.582.58 GHz, which is about 6.22%6.22\% higher than the UEs in VC-11. However, due to the wireless payloads in the offloading phase, the transmission powers of the clients can also influence the δit\delta_{i}^{t}’s. In our setting, the UEs’ maximum transmission powers are 0.350.35 Watt and 0.950.95 Watt, respectively, in VC-0 and VC-11. As such, with a similar wireless channel, the UEs in VC-11 can offload much faster than the UEs in VC-0. The above observations, thus, point out that trained model offloading time tiup\mathrm{t}_{i}^{\mathrm{up}} dominates the total time to finish one VC round titot\mathrm{t}_{i}^{\mathrm{tot}}.

Refer to caption
(a) CDF of eitot\mathrm{e}_{i}^{\mathrm{tot}}’s with CNN
Refer to caption
(b) CDF of eitot\mathrm{e}_{i}^{\mathrm{tot}}’s with ResNet-1818
Refer to caption
(c) CDF of eitot\mathrm{e}_{i}^{\mathrm{tot}}’s with ResNet-3434
Figure 3: CDF of clients’ eitot\mathrm{e}_{i}^{\mathrm{tot}}’s in different VCs for different tth\mathrm{t^{th}} with different ML models

When it comes to energy expense, from (39) and (40), it is quite obvious that a high δit\delta_{i}^{t} shall lead to less energy consumption for both the training and offloading. However, the total energy expense of the clients boils down to the dominating factor between the required energy for computation and trained model offloading due to the interplay between the wireless and the learning parameters. Particularly, with ω=1\omega=1 MHz, the clients can offload the CNN model fast, leading the computational energy consumption to be the dominating factor. The ResNets, on the other hand, have huge wireless overheads, leading the offloading time and energy be the dominating factors. This is also observed in our results in Fig. 3, which is the CDF plot of the energy expense eitot\mathrm{e}_{i}^{\mathrm{tot}}, calculated in (41), of the clients in each VC and is generated following a similar strategy as in Fig. 2. When the CNN model is used, the total energy cost of the clients in VC-0 is larger, even though they prune more parameters, compared to the clients in VC-11 since larger fimax\mathrm{f}_{i}^{\mathrm{max}}’s of the clients in VC-0 leads to a higher computational energy cost. This, however, changes for the ResNets since the wireless communication burden dominates the computation burden. The clients in VC-11 can use their higher PimaxP_{i}^{\mathrm{max}}’s for the offloading time reduction when they determine the pruning ratios δit\delta_{i}^{t}’s. As such, the total energy expenses of the clients in VC-11 are much larger than the ones in VC-0. Our simulation results in Fig. 3(a) and Figs. 3(b)-3(c) also reveal the same trends.

Refer to caption
(a) Test accuracy with CNN
Refer to caption
(b) Test accuracy with ResNet-1818
Refer to caption
(c) Test accuracy with ResNet-3434
Refer to caption
(d) Required bandwidth with CNN
Refer to caption
(e) Required bandwidth with ResNet-1818
Refer to caption
(f) Required bandwidth with ResNet-3434
Figure 4: Trade-offs between test accuracies and required bandwidth for different tth\mathrm{t^{th}}’s with different ML models

Now, we observe the impact of δit\delta_{i}^{t}’s on the test accuracies and required bandwidth for trained model offloading. Intuitively, if the model is shallow, pruning further makes it shallower. Therefore, the test performance can exacerbate if δit\delta_{i}^{t}’s increases for a shallow model. On the other hand, for a bulky model, pruning may have a less severe effect. Specifically, under the deadline and energy constraints, pruning may eventually help because pruning a few neurons leads to a shallower but still reasonably well-constructed model that can be trained more efficiently. Moreover, our convergence bound in (1) clearly shows that the increasing δit\delta_{i}^{t}’s decreases the convergence speed. However, the wireless payload is directly related to δit\delta_{i}^{t}’s as shown in (35). Particularly, the wireless payload is an increasing function of the δit\delta_{i}^{t}’s. As such, increasing the deadline threshold tth\mathrm{t^{th}} should decrease the δit\delta_{i}^{t}’s but significantly increase the wireless payload size.

Our simulation results also reveal the above trends in Fig. 4. Note that, in Fig. 4 and the subsequent figures, the (solid/dashed) lines are the average of 44 independent simulation trials using the configurations mentioned in Section V-A, while the shaded strips show the corresponding standard deviations. Particularly, we observe that the CNN model is largely affected by a small tth\mathrm{t^{th}} because that leads to a large δit\delta_{i}^{t}, which eventually prunes more neurons of the already shallow model. On the other hand, the bulky ResNets exhibit small performance degradation when tth\mathrm{t^{th}} decreases. Moreover, compared to the original non-pruned counterparts, the performance difference is small if tth\mathrm{t^{th}} is selected appropriately, as shown in Fig. 4(a) to Fig. 4(c). However, even a slight increase in dpd_{p} shall reduce the wireless payload size, which can significantly save a large portion of the bandwidth, as observed in Fig. 4(d) to Fig. 4(f). For example, if the CNN model is used, with tth=1.3\mathrm{t^{th}}=1.3s, the performance degradation on test accuracy is about 1.92%1.92\% after 100100 PHFL round, while the per PHFL round bandwidth saving for at least 70%70\% of the clients is about 20.84%20.84\%. Similarly, if tth=6\mathrm{t^{th}}=6s, the test accuracy degradation is about 0.47%0.47\% and 1.11%1.11\% after 100100 global rounds, while the per PHFL round bandwidth saving for at least 70%70\% of the clients are about 33.12%33.12\% and 81.26%81.26\%, for ResNet-1818 and ResNet-3434, respectively.

V-C Baseline Comparisons

We now focus on performance comparisons. First, we consider the existing HFL [10, 9] algorithm that does not consider model pruning or any energy constraints. Besides, [10, 9] only considered two levels, UE-BS and BS-cloud/server. For a fair comparison, we adapt HFL into three levels 11) UE-sBS, 22) sBS-mBS and 33) mBS-cloud/server. Furthermore, we enforce the energy and deadline constraints in each UE-sBS aggregation round and name this baseline HFL with constraints (HFL-WC). Moreover, since [10, 9] did not have any VC and we have κ1\kappa_{1} VC aggregation rounds before the sBS aggregation round, we have adapted the deadline accordingly for HFL-WC to make our comparison fair. Furthermore, we consider a random PHFL (R-PHFL) scheme, which has the same system model as ours and allows model pruning. In R-PHFL, the pruning ratios, δit\delta_{i}^{t}’s, are randomly selected between 0 and δth\delta^{\mathrm{th}} to satisfy constraint (45e). Moreover, in both HFL-WC and R-PHFL, a common κ0\kappa_{0} that satisfies both deadline and energy constraints for all clients in all VCs leads to poor test accuracy. As such, we determine the local iterations of the UEs within a VC by selecting the maximum possible number of iterations that all clients within that VC can perform without violating the delay and energy constraints. For our proposed PHFL, we choose κ0=5\kappa_{0}=5 and κ1=2\kappa_{1}=2. Moreover, we let κ2=κ3=2\kappa_{2}=\kappa_{3}=2 for both HFL-WC and PHFL. We also consider centralized SGD to show the performance gap of PHFL with the ideal case where all training data samples are available centrally.

Refer to caption
(a) Test acc. (CNN on CIFAR-10)
Refer to caption
(b) Test acc. (ResNet-1818 on CIFAR-10)
Refer to caption
(c) Test acc. (ResNet-3434 on CIFAR-10)
Refer to caption
(d) Test acc. (CNN on CIFAR-100100)
Refer to caption
(e) Test acc. (ResNet-1818 on CIFAR-100100)
Refer to caption
(f) Test acc. (ResNet-3434 on CIFAR-100100)
Figure 5: Test accuracies for different tth\mathrm{t^{th}}’s with different ML models on different datasets

From our above discussion, it is expected that pruning will likely not have the edge over the HFL-WC with a shallow model. Moreover, one may not need pruning for a shallow model in the first place. However, for a bulky model, due to a large number of training parameters and a huge wireless payload, pruning can be a necessity under extreme resource constraints. Furthermore, it is crucial to jointly optimize δit\delta_{i}^{t}’s, fit\mathrm{f}_{i}^{t}’s and Pit\mathrm{P}_{i}^{t}’s in order to increase the test accuracy. The simulation results in Fig. 5 also validate these claims. We observe that when the tth\mathrm{t^{th}} increases, HFL-WC’s performance improves with the CNN model. Moreover, when HFL-WC has κ1\kappa_{1} times the deadline of PHFL, the performance is comparable, as shown in Fig. 5(a) and Fig. 5(d). More specifically, the maximum performance degradation of the proposed PHFL algorithm is about 1.88%1.88\% on CIFAR-1010 when HFL-WC has 2.312.31 times the deadline of PHFL. However, for the ResNets model, HFL-WC requires a significantly longer deadline threshold to make the problem feasible. Particularly, with ResNet-1818, tth6\mathrm{t^{th}}\leq 6s does not allow the UEs to perform even a single local iteration, leading to the same initial model weights and, thus, the same test accuracy in HFL-WC. Moreover, when the deadline threshold is κ1\kappa_{1} times the tth\mathrm{t^{th}} of the PHFL, HFL-WC’s test accuracy significantly lags, as shown in Fig. 5(b) and Fig. 5(e). Particularly, after T=100T=100 rounds, our proposed solution with tth=4\mathrm{t^{th}}=4s provides about 55.06%55.06\%, 11.62%11.62\% and 2.33%2.33\%, and about 122.8%122.8\%, 26.7%26.7\% and 3.62%3.62\% better test accuracy on CIFAR-1010 and on CIFAR-100100, respectively, than the HFL-WC with tth=8\mathrm{t^{th}}=8s, tth=9\mathrm{t^{th}}=9s and tth=10\mathrm{t^{th}}=10s. For the ResNet-3434 model, the clients require tth13\mathrm{t^{th}}\geq 13s for performing some local training in HFL-WC, whereas our proposed solution can achieve significantly better performance with only tth=6\mathrm{t^{th}}=6s. For example, our proposed solution with tth=6\mathrm{t^{th}}=6s yields about 28.09%28.09\%, 14.34%14.34\% and 8.22%8.22\%, and about 61.97%61.97\%, 34.19%34.19\% and 21.86%21.86\% better test accuracy than the HFL-WC with tth=16\mathrm{t^{th}}=16s, tth=17\mathrm{t^{th}}=17s and tth=18\mathrm{t^{th}}=18s on CIFAR-1010 and CIFAR-100100, respectively. Moreover, our proposed solution provides about 243.47%243.47\% and 643.29%643.29\% and about 648.36%648.36\% and 4542.45%4542.45\% better test accuracy than R-PHFL on CIFAR-1010 and CIFAR-100100 with the ResNet-1818 and ResNet-3434 models, respectively. The gap with the ideal centralized ML is also expected since FL suffers from data and system heterogeneity.

Refer to caption
Figure 6: Wall clock vs test accuracy on CIFAR-1010 with ResNet-1818

We also examine the performance of the proposed method in terms of wall clock time. Since HFL-WC does not have the VC tier, the wall clock time to run MM global rounds for HFL-WC with a deadline smaller than κ1×tth\kappa_{1}\times\mathrm{t^{th}} will be lower than the proposed PHFL’s wall clock time to run the same number of global rounds. However, that does not necessarily guarantee a higher test accuracy than our PHFL since training and offloading the original bulky model may take a long time, allowing only a few local SGD rounds of the clients. Besides, any deadline greater than κ1×tth\kappa_{1}\times\mathrm{t^{th}} for the HFL-WC will require a longer wall clock time than our proposed PHFL solution. Our simulation results in Fig. 6 clearly shows these trends. We observe that when HFL-WC has a deadline κ1×tth\kappa_{1}\times\mathrm{t^{th}}, i.e., 2×42\times 4s = 88s, the test accuracies are about 42.52%42.52\% and 76.24%76.24\%, respectively, for HFL-WC and PHFL when the wall clock reaches 18001800 seconds. Even with tth>4κ1\mathrm{t^{th}}>4\kappa_{1} seconds, the HFL-WC algorithm performs worse than our proposed PHFL solution.

From the above results and discussion, it is quite clear that R-PHFL yields poor test accuracy due to the random selection of δit\delta_{i}^{t}’s. Besides, HFL-WC cannot deliver reasonable performance when the model has a large number of training parameters. Furthermore, our proposed PHFL’s performance is comparable to the non-pruned HFL-WC performance with the shallow model. As such, in the following, we only consider the upper bound (UB) of the HFL baseline, which does not consider the constraints. Moreover, to show how pruning degrades test accuracy, we also consider the UB, called HFL-VC-UB, by inheriting the same system model described in Section II, but without the model pruning and the constraints. In other words, we inherit the same underlying four levels, UE-VC, VC-sBS, sBS-mBS and mBS-cloud/server aggregation policy with the original non-pruned model.

Refer to caption
(a) Per round time with CNN
Refer to caption
(b) Per round time with ResNet-1818
Refer to caption
(c) Per round time with ResNet-3434
Refer to caption
(d) Per round etot\mathrm{e^{tot}} with CNN
Refer to caption
(e) Per round etot\mathrm{e^{tot}} with ResNet-1818
Refer to caption
(f) Per round etot\mathrm{e^{tot}} with ResNet-3434
Figure 7: CDF of per PHFL round time and energy consumption with different ML models on CIFAR-1010
TABLE II: Test Accuracy with Trained 𝐰T\mathbf{w}^{T} on CIFAR-1010 dataset with κ0=5,κ1=κ2=κ3=2\kappa_{0}=5,\kappa_{1}=\kappa_{2}=\kappa_{3}=2 and T=100T=100
Methods ​​​​Dir(α¯\bar{\alpha}) With CNN Model With ResNet-1818 Model With ResNet-3434 Model
Acc Req T [s] Req E [J] Acc Req T [s] Req E [J] Acc Req T [s] Req E [J]
0.50.5 0.6791±0.00490.6791\pm 0.0049 10401040 48355±1466448355\pm 14664 0.7613±0.00260.7613\pm 0.0026 32003200 98122±1587898122\pm 15878 0.7677±0.00170.7677\pm 0.0017 48004800 140469±20993140469\pm 20993
PHFL 0.90.9 0.6930±0.00490.6930\pm 0.0049 48355±1466348355\pm 14663 0.7780±0.00430.7780\pm 0.0043 98122±1587898122\pm 15878 0.7854±0.00260.7854\pm 0.0026 140469±20993140469\pm 20993
(Ours) 1010 0.7091±0.00220.7091\pm 0.0022 48355±1466348355\pm 14663 0.7899±0.00200.7899\pm 0.0020 98122±1587898122\pm 15878 0.7994±0.00100.7994\pm 0.0010 140469±20994140469\pm 20994
HFL-VC 0.50.5 0.6971±0.00370.6971\pm 0.0037 1293±1101293\pm 110 53041±1237753041\pm 12377 0.7689±0.00540.7689\pm 0.0054 8985±1828985\pm 182 231634±32761231634\pm 32761 0.7789±0.00300.7789\pm 0.0030 15370±26915370\pm 269 378436±51167378436\pm 51167
(UB) 0.90.9 0.7066±0.00170.7066\pm 0.0017 1293±1101293\pm 110 53041±1237753041\pm 12377 0.7807±0.00190.7807\pm 0.0019 8985±1828985\pm 182 231634±32761231634\pm 32761 0.7942±0.00330.7942\pm 0.0033 15370±26915370\pm 269 378436±51167378436\pm 51167
(Ours) 1010 0.7211±0.00100.7211\pm 0.0010 1293±1101293\pm 110 53041±1237753041\pm 12377 0.7920±0.00370.7920\pm 0.0037 8985±1828985\pm 182 231634±32761231634\pm 32761 0.8031±0.00270.8031\pm 0.0027 15370±26915370\pm 269 378436±51167378436\pm 51167
0.50.5 0.5346±0.00740.5346\pm 0.0074 10401040 31483±945431483\pm 9454 0.2447±0.01610.2447\pm 0.0161 32003200 92529±1240092529\pm 12400 0.1112±0.01930.1112\pm 0.0193 48004800 158622±20589158622\pm 20589
R-PHFL 0.90.9 0.5471±0.00890.5471\pm 0.0089 31467±942731467\pm 9427 0.2265±0.04390.2265\pm 0.0439 92495±1235692495\pm 12356 0.1049±0.00860.1049\pm 0.0086 158827±20834158827\pm 20834
1010 0.5847±0.01640.5847\pm 0.0164 31455±940631455\pm 9406 0.3265±0.02730.3265\pm 0.0273 92441±1228792441\pm 12287 0.1096±0.01080.1096\pm 0.0108 158651±20624158651\pm 20624
0.50.5 0.6624±0.00210.6624\pm 0.0021 646±55646\pm 55 26520±618826520\pm 6188 0.7539±0.00440.7539\pm 0.0044 4493±904493\pm 90 115810±16379115810\pm 16379 0.7445±0.00490.7445\pm 0.0049 7686±1327686\pm 132 189207±25580189207\pm 25580
HFL 0.90.9 0.6752±0.00180.6752\pm 0.0018 646±55646\pm 55 26520±618826520\pm 6188 0.7695±0.00140.7695\pm 0.0014 4493±904493\pm 90 115810±16379115810\pm 16379 0.7664±0.00450.7664\pm 0.0045 7686±1327686\pm 132 189207±25580189207\pm 25580
(UB) - κ0\kappa_{0} 1010 0.6934±0.00120.6934\pm 0.0012 646±55646\pm 55 26520±618826520\pm 6188 0.7833±0.00320.7833\pm 0.0032 4493±904493\pm 90 115810±16379115810\pm 16379 0.7844±0.00130.7844\pm 0.0013 7686±1327686\pm 132 189207±25580189207\pm 25580
TABLE III: Test Accuracy with Trained 𝐰T\mathbf{w}^{T} on CIFAR-1010 dataset with κ0=10,κ1=κ2=κ3=2\kappa_{0}=10,\kappa_{1}=\kappa_{2}=\kappa_{3}=2 and T=100T=100
Methods ​​​​Dir(α¯\bar{\alpha}) With CNN Model With ResNet-1818 Model With ResNet-3434 Model
Acc Req T [s] Req E [J] Acc Req T [s] Req E [J] Acc Req T [s] Req E [J]
0.50.5 0.6859±0.00370.6859\pm 0.0037 16001600 76084±2366576084\pm 23665 0.7668±0.00160.7668\pm 0.0016 32003200 103997±18692103997\pm 18692 0.7774±0.00160.7774\pm 0.0016 48004800 146238±23406146238\pm 23406
PHFL 0.90.9 0.6966±0.00340.6966\pm 0.0034 76084±2366576084\pm 23665 0.7804±0.00140.7804\pm 0.0014 103998±18693103998\pm 18693 0.7948±0.00230.7948\pm 0.0023 146239±23408146239\pm 23408
(Ours) 1010 0.7093±0.00240.7093\pm 0.0024 76083±2366476083\pm 23664 0.7935±0.00150.7935\pm 0.0015 103999±18695103999\pm 18695 0.8092±0.00190.8092\pm 0.0019 146238±23407146238\pm 23407
HFL-VC 0.50.5 0.6932±0.00210.6932\pm 0.0021 2417±2212417\pm 221 102117±24405102117\pm 24405 0.7704±0.00420.7704\pm 0.0042 10052±26710052\pm 267 280710±43428280710\pm 43428 0.7825±0.00330.7825\pm 0.0033 16416±34616416\pm 346 427512±61116427512\pm 61116
(UB) 0.90.9 0.7070±0.00240.7070\pm 0.0024 2417±2212417\pm 221 102117±24405102117\pm 24405 0.7866±0.00160.7866\pm 0.0016 10052±26710052\pm 267 280710±43428280710\pm 43428 0.7975±0.00250.7975\pm 0.0025 16416±34616416\pm 346 427512±61116427512\pm 61116
(Ours) 1010 0.7164±0.00420.7164\pm 0.0042 2417±2212417\pm 221 102117±24405102117\pm 24405 0.7936±0.00090.7936\pm 0.0009 10052±26710052\pm 267 280710±43428280710\pm 43428 0.8102±0.00180.8102\pm 0.0018 16416±34616416\pm 346 427512±61116427512\pm 61116
0.50.5 0.6934±0.00300.6934\pm 0.0030 1208±1101208\pm 110 51058±1220251058\pm 12202 0.7634±0.00170.7634\pm 0.0017 5026±1325026\pm 132 140349±21712140349\pm 21712 0.7726±0.00140.7726\pm 0.0014 8209±1718209\pm 171 213745±30554213745\pm 30554
HFL 0.90.9 0.7033±0.00310.7033\pm 0.0031 1208±1101208\pm 110 51058±1220251058\pm 12202 0.7769±0.00600.7769\pm 0.0060 5026±1325026\pm 132 140349±21712140349\pm 21712 0.7847±0.00140.7847\pm 0.0014 8209±1718209\pm 171 213745±30554213745\pm 30554
(UB)- κ0\kappa_{0} 1010 0.7208±0.00180.7208\pm 0.0018 1208±1101208\pm 110 51058±1220251058\pm 12202 0.7895±0.00230.7895\pm 0.0023 5026±1325026\pm 132 140349±21712140349\pm 21712 0.8004±0.00160.8004\pm 0.0016 8209±1718209\pm 171 213745±30554213745\pm 30554
TABLE IV: Test Accuracy with Trained 𝐰T\mathbf{w}^{T} on CIFAR-1010 dataset with κ0=5,κ1=4,κ2=κ3=2\kappa_{0}=5,\kappa_{1}=4,\kappa_{2}=\kappa_{3}=2 and T=100T=100
Methods ​​​​Dir(α¯\bar{\alpha}) With CNN Model With ResNet-1818 Model With ResNet-3434 Model
Acc Req T [s] Req E [J] Acc Req T [s] Req E [J] Acc Req T [s] Req E [J]
0.50.5 0.6950±0.00470.6950\pm 0.0047 20802080 96710±2932796710\pm 29327 0.7699±0.00160.7699\pm 0.0016 64006400 196244±31755196244\pm 31755 0.7826±0.00230.7826\pm 0.0023 96009600 280939±41988280939\pm 41988
PHFL 0.90.9 0.7060±0.00260.7060\pm 0.0026 96710±2932796710\pm 29327 0.7828±0.00180.7828\pm 0.0018 196245±31757196245\pm 31757 0.8010±0.00330.8010\pm 0.0033 280938±41986280938\pm 41986
(Ours) 1010 0.7136±0.00430.7136\pm 0.0043 96710±2932796710\pm 29327 0.7945±0.00140.7945\pm 0.0014 196244±31756196244\pm 31756 0.8111±0.00140.8111\pm 0.0014 280938±41987280938\pm 41987
HFL-VC 0.50.5 0.7052±0.00380.7052\pm 0.0038 2587±2202587\pm 220 106082±24755106082\pm 24755 0.7726±0.00570.7726\pm 0.0057 17969±36417969\pm 364 463257±65516463257\pm 65516 0.7881±0.00280.7881\pm 0.0028 30737±53430737\pm 534 756853±102321756853\pm 102321
(UB) 0.90.9 0.7122±0.00160.7122\pm 0.0016 2587±2202587\pm 220 106082±24755106082\pm 24755 0.7874±0.00150.7874\pm 0.0015 17969±36417969\pm 364 463257±65516463257\pm 65516 0.8012±0.00090.8012\pm 0.0009 30737±53430737\pm 534 756853±102321756853\pm 102321
(Ours) 1010 0.7195±0.00150.7195\pm 0.0015 2587±2202587\pm 220 106082±24755106082\pm 24755 0.7961±0.00280.7961\pm 0.0028 17969±36417969\pm 364 463257±65516463257\pm 65516 0.8117±0.00310.8117\pm 0.0031 30737±53430737\pm 534 756853±102321756853\pm 102321
R-PHFL 0.50.5 0.5813±0.01460.5813\pm 0.0146 20802080 62927±1891462927\pm 18914 0.3863±0.04080.3863\pm 0.0408 64006400 184493±24419184493\pm 24419 0.1042±0.00690.1042\pm 0.0069 96009600 316985±41406316985\pm 41406
0.90.9 0.6013±0.01640.6013\pm 0.0164 62933±1892462933\pm 18924 0.4010±0.02080.4010\pm 0.0208 184766±24766184766\pm 24766 0.1157±0.01370.1157\pm 0.0137 316870±41267316870\pm 41267
1010 0.6360±0.00820.6360\pm 0.0082 62849±1878162849\pm 18781 0.4648±0.03520.4648\pm 0.0352 184617±24577184617\pm 24577 0.1111±0.00780.1111\pm 0.0078 317122±41570317122\pm 41570

First, we illustrate the required computation time and the corresponding energy expenses for the baselines and our proposed PHFL solution with different deadline thresholds. Naturally, if we increase the tth\mathrm{t^{th}} for each VC aggregation round, our proposed solution will take longer time and energy to perform T=100T=100 global rounds. Moreover, since there are κ1\kappa_{1} VC aggregation rounds in our proposed system model, the original model training and offloading with HFL-VC-UB is expected to take significantly longer and consume more energy than the HFL-UB baseline. Therefore, the effectiveness, in terms of time and energy consumption, of PHFL is largely dependent upon the deadline threshold tth\mathrm{t^{th}}. While R-PHFL requires the same time as our proposed PHFL, the energy requirements for R-PHFL can vary significantly due to the random selection of the δit\delta_{i}^{t}’s that lead to different model sizes and different local training episodes in different VCs.

Our results in Figs. 7(a)-7(c) and Figs. 7(d)-7(f) clearly show these trade-offs with time and energy consumption, respectively, for the three models. It is worth pointing out that we adopted the popular lottery ticket hypothesis [26] for finding the winning ticket that requires performing ρ\rho iterations on the initial model with full parameter space as described in Section II-B. This incurs additional ticpd\mathrm{t}_{i}^{\mathrm{cp_{d}}} time and eicpd\mathrm{e}_{i}^{\mathrm{cp_{d}}} energy overheads, as calculated in (33) and (38), respectively. As such, when the tth\mathrm{t^{th}} is large, if the UEs have sufficient energy budgets, they can prune only a few neurons. Therefore, the total time and energy consumption for the original model computation for getting the winning ticket, training the pruned model and offloading the trained pruned model parameters can become slightly larger than the HFL-VC-UB. This is observed in Fig. 7(a) and Fig. 7(d) for CNN model when tth=1.3\mathrm{t^{th}}=1.3s, and also in Fig. 7(b) and Fig. 7(e) when tth=8\mathrm{t^{th}}=8s for the ResNet-1818 model. Besides, the dashed vertical lines in Fig. 7(e) and Fig. 7(f) show the mean energy budget of the clients for each PHFL global round. While all UEs are able to perform the learning and offloading within this mean energy budget with CNN and ResNet-1818 models, clearly, when the ResNet-3434 model is used, more than 57%57\% of the clients will fail to perform the HFL-VC-UB due to their limited energy budgets.

TABLE V: Test Accuracy with Trained 𝐰T\mathbf{w}^{T} on CIFAR-100100 dataset with κ0=5,κ1=κ2=κ3=2\kappa_{0}=5,\kappa_{1}=\kappa_{2}=\kappa_{3}=2 and T=100T=100
Methods ​​​​Dir(α¯\bar{\alpha}) With CNN Model With ResNet-1818 Model With ResNet-3434 Model
Acc Req T [s] Req E [J] Acc Req T [s] Req E [J] Acc Req T [s] Req E [J]
0.50.5 0.3723±0.01010.3723\pm 0.0101 11201120 51667±1553051667\pm 15530 0.4725±0.00320.4725\pm 0.0032 32003200 98074±1585698074\pm 15856 0.4770±0.00320.4770\pm 0.0032 48004800 140445±20984140445\pm 20984
PHFL 0.90.9 0.3786±0.00960.3786\pm 0.0096 51667±1553151667\pm 15531 0.4765±0.00030.4765\pm 0.0003 98075±1585798075\pm 15857 0.4840±0.00320.4840\pm 0.0032 140444±20983140444\pm 20983
(Ours) 1010 0.3795±0.00840.3795\pm 0.0084 51667±1553151667\pm 15531 0.4811±0.00240.4811\pm 0.0024 98075±1585798075\pm 15857 0.4834±0.00230.4834\pm 0.0023 140444±20984140444\pm 20984
HFL-VC 0.50.5 0.3962±0.00300.3962\pm 0.0030 1319±1091319\pm 109 53645±1243153645\pm 12431 0.4805±0.00220.4805\pm 0.0022 9038±1839038\pm 183 232839±32910232839\pm 32910 0.4909±0.00170.4909\pm 0.0017 15422±27015422\pm 270 379641±51319379641\pm 51319
(UB) 0.90.9 0.3956±0.00300.3956\pm 0.0030 1319±1091319\pm 109 53645±1243153645\pm 12431 0.4832±0.00340.4832\pm 0.0034 9038±1839038\pm 183 232839±32910232839\pm 32910 0.4922±0.00190.4922\pm 0.0019 15422±27015422\pm 270 379641±51319379641\pm 51319
(Ours) 1010 0.4004±0.00420.4004\pm 0.0042 1319±1091319\pm 109 53645±1243153645\pm 12431 0.4800±0.00190.4800\pm 0.0019 9038±1839038\pm 183 232839±32910232839\pm 32910 0.4895±0.00240.4895\pm 0.0024 15422±27015422\pm 270 379641±51319379641\pm 51319
R-PHFL 0.50.5 0.1994±0.00750.1994\pm 0.0075 11201120 33751±1006933751\pm 10069 0.0587±0.02120.0587\pm 0.0212 32003200 92996±1237492996\pm 12374 0.0123±0.00280.0123\pm 0.0028 48004800 159083±20562159083\pm 20562
0.90.9 0.2119±0.01560.2119\pm 0.0156 33763±1008933763\pm 10089 0.0641±0.03710.0641\pm 0.0371 92893±1224592893\pm 12245 0.0104±0.00070.0104\pm 0.0007 159122±20608159122\pm 20608
1010 0.2165±0.00570.2165\pm 0.0057 33733±1004033733\pm 10040 0.0728±0.04270.0728\pm 0.0427 92893±1224592893\pm 12245 0.0109±0.00110.0109\pm 0.0011 159175±20671159175\pm 20671
HFL 0.50.5 0.3509±0.00230.3509\pm 0.0023 659±54659\pm 54 26822±621526822\pm 6215 0.4730±0.00250.4730\pm 0.0025 4519±904519\pm 90 116413±16453116413\pm 16453 0.4663±0.00180.4663\pm 0.0018 7712±1337712\pm 133 189809±25656189809\pm 25656
(UB) - κ0\kappa_{0} 0.90.9 0.3524±0.00290.3524\pm 0.0029 659±54659\pm 54 26822±621526822\pm 6215 0.4768±0.00470.4768\pm 0.0047 4519±904519\pm 90 116413±16453116413\pm 16453 0.4756±0.00210.4756\pm 0.0021 7712±1337712\pm 133 189809±25656189809\pm 25656
1010 0.3590±0.00190.3590\pm 0.0019 659±54659\pm 54 26822±621526822\pm 6215 0.4841±0.00570.4841\pm 0.0057 4519±904519\pm 90 116413±16453116413\pm 16453 0.4776±0.00350.4776\pm 0.0035 7712±1337712\pm 133 189809±25656189809\pm 25656
TABLE VI: Test Accuracy with Trained 𝐰T\mathbf{w}^{T} on CIFAR-100100 dataset with κ0=10,κ1=κ2=κ3=2\kappa_{0}=10,\kappa_{1}=\kappa_{2}=\kappa_{3}=2 and T=100T=100
​​Methods ​​​​Dir(α¯\bar{\alpha}) With CNN Model With ResNet-1818 Model With ResNet-3434 Model
Acc Req T [s] Req E [J] Acc Req T [s] Req E [J] Acc Req T [s] Req E [J]
0.50.5 0.3700±0.00350.3700\pm 0.0035 20002000 94632±2927794632\pm 29277 0.4654±0.00350.4654\pm 0.0035 32003200 103923±18656103923\pm 18656 0.4801±0.00090.4801\pm 0.0009 48004800 146195±23388146195\pm 23388
PHFL 0.90.9 0.3648±0.00440.3648\pm 0.0044 94632±2927794632\pm 29277 0.4699±0.00260.4699\pm 0.0026 103923±18655103923\pm 18655 0.4810±0.00470.4810\pm 0.0047 146196±23388146196\pm 23388
(Ours) 1010 0.3634±0.00390.3634\pm 0.0039 94632±2927794632\pm 29277 0.4763±0.00190.4763\pm 0.0019 103922±18655103922\pm 18655 0.4780±0.00180.4780\pm 0.0018 146195±23388146195\pm 23388
​​​​HFL-VC 0.50.5 0.3756±0.00410.3756\pm 0.0041 2443±2202443\pm 220 102721±24458102721\pm 24458 0.4720±0.00200.4720\pm 0.0020 10104±26810104\pm 268 281915±43569281915\pm 43569 0.4877±0.00580.4877\pm 0.0058 16469±34616469\pm 346 428717±61264428717\pm 61264
(UB) 0.90.9 0.3754±0.00530.3754\pm 0.0053 2443±2202443\pm 220 102721±24458102721\pm 24458 0.4701±0.00430.4701\pm 0.0043 10104±26810104\pm 268 281915±43569281915\pm 43569 0.4895±0.00150.4895\pm 0.0015 16469±34616469\pm 346 428717±61264428717\pm 61264
(Ours) 1010 0.3794±0.00250.3794\pm 0.0025 2443±2202443\pm 220 102721±24458102721\pm 24458 0.4736±0.00120.4736\pm 0.0012 10104±26810104\pm 268 281915±43569281915\pm 43569 0.4736±0.00120.4736\pm 0.0012 10104±26810104\pm 268 281915±43569281915\pm 43569
HFL 0.50.5 0.3877±0.00310.3877\pm 0.0031 1221±1101221\pm 110 51360±1222851360\pm 12228 0.4713±0.00520.4713\pm 0.0052 5052±1325052\pm 132 140951±21783140951\pm 21783 0.4806±0.00580.4806\pm 0.0058 8235±1718235\pm 171 214347±30628214347\pm 30628
(UB) - 0.90.9 0.3877±0.00240.3877\pm 0.0024 1221±1101221\pm 110 51360±1222851360\pm 12228 0.4731±0.00200.4731\pm 0.0020 5052±1325052\pm 132 140951±21783140951\pm 21783 0.4855±0.00110.4855\pm 0.0011 8235±1718235\pm 171 214347±30628214347\pm 30628
κ0\kappa_{0} 1010 0.3922±0.00420.3922\pm 0.0042 1221±1101221\pm 110 51360±1222851360\pm 12228 0.4772±0.00470.4772\pm 0.0047 5052±1325052\pm 132 140951±21783140951\pm 21783 0.4844±0.00310.4844\pm 0.0031 8235±1718235\pm 171 214347±30628214347\pm 30628
TABLE VII: Test Accuracy with Trained 𝐰T\mathbf{w}^{T} on CIFAR-100100 dataset with κ0=5,κ1=4,κ2=κ3=2\kappa_{0}=5,\kappa_{1}=4,\kappa_{2}=\kappa_{3}=2 and T=100T=100
Methods ​​​​Dir(α¯\bar{\alpha}) With CNN Model With ResNet-1818 Model With ResNet-3434 Model
Acc Req T [s] Req E [J] Acc Req T [s] Req E [J] Acc Req T [s] Req E [J]
0.50.5 0.3817±0.00530.3817\pm 0.0053 22402240 103334±31061103334\pm 31061 0.4712±0.00430.4712\pm 0.0043 64006400 196151±31714196151\pm 31714 0.4932±0.00390.4932\pm 0.0039 96009600 280889±41968280889\pm 41968
PHFL 0.90.9 0.3782±0.00220.3782\pm 0.0022 103335±31061103335\pm 31061 0.4815±0.00490.4815\pm 0.0049 196151±31714196151\pm 31714 0.4928±0.00290.4928\pm 0.0029 280889±41968280889\pm 41968
(Ours) 1010 0.3853±0.00350.3853\pm 0.0035 103335±31061103335\pm 31061 0.4806±0.00350.4806\pm 0.0035 196150±31714196150\pm 31714 0.4935±0.00370.4935\pm 0.0037 280889±41968280889\pm 41968
​​HFL-VC 0.50.5 0.3924±0.00550.3924\pm 0.0055 2638±2192638\pm 219 107290±24863107290\pm 24863 0.4780±0.00430.4780\pm 0.0043 18074±36518074\pm 365 465668±65814465668\pm 65814 0.4974±0.00320.4974\pm 0.0032 30841±53530841\pm 535 759264±102626759264\pm 102626
(UB) 0.90.9 0.3871±0.00630.3871\pm 0.0063 2638±2192638\pm 219 107290±24863107290\pm 24863 0.4819±0.00120.4819\pm 0.0012 18074±36518074\pm 365 465668±65814465668\pm 65814 0.4992±0.00220.4992\pm 0.0022 30841±53530841\pm 535 759264±102626759264\pm 102626
(Ours) 1010 0.3934±0.00480.3934\pm 0.0048 2638±2192638\pm 219 107290±24863107290\pm 24863 0.4824±0.00350.4824\pm 0.0035 18074±36518074\pm 365 465668±65814465668\pm 65814 0.4954±0.00580.4954\pm 0.0058 30841±53530841\pm 535 759264±102626759264\pm 102626
R-PHFL 0.50.5 0.2338±0.00470.2338\pm 0.0047 22402240 67480±2021167480\pm 20211 0.1562±0.01810.1562\pm 0.0181 64006400 185662±24670185662\pm 24670 0.0153±0.00320.0153\pm 0.0032 96009600 318013±41476318013\pm 41476
0.90.9 0.2282±0.00920.2282\pm 0.0092 67530±2029667530\pm 20296 0.1706±0.01800.1706\pm 0.0180 185738±24765185738\pm 24765 0.0149±0.00110.0149\pm 0.0011 318004±41465318004\pm 41465
1010 0.2398±0.00850.2398\pm 0.0085 67535±2030467535\pm 20304 0.1837±0.02200.1837\pm 0.0220 185639±24640185639\pm 24640 0.0144±0.00360.0144\pm 0.0036 317815±41240317815\pm 41240

Finally, we show the impact of κ0\kappa_{0} and κ1\kappa_{1} for different dataset heterogeneity levels on CIFAR-1010 dataset in Table II - Table IV, where tth=4\mathrm{t^{th}}=4s and tth=6\mathrm{t^{th}}=6s is used in our proposed PHFL algorithms for ResNet-1818 and ResNet-3434 models, respectively. Besides, for the CNN model, we used tth=1.3\mathrm{t^{th}}=1.3s and 2\mathrm{2}s, respectively, in our PHFL algorithm when κ0=5\kappa_{0}=5 and κ0=10\kappa_{0}=10, respectively, due to the facts that pruning exacerbates test performance for a shallow model and computational time dominates the offloading delay. Similarly, we considered tth=1.4\mathrm{t^{th}}=1.4s and tth=2.5\mathrm{t^{th}}=2.5s, respectively, for κ0=5\kappa_{0}=5 and κ0=10\kappa_{0}=10 on CIFAR-100100 datasets for the CNN model. The performance comparisons for different α¯\bar{\alpha}’s are shown in Table V - Table VII. From the tables, it is quite clear that pruning helps with negligible performance deviation from its original non-pruned counterparts. Besides, for the shallow CNN model, the performance gain, in terms of test accuracy, of our proposed PHFL is insignificant compared to the bulky ResNets. Moreover, increasing κ0\kappa_{0} or κ1\kappa_{1} generally improves the test accuracy. However, if the same tth\mathrm{t^{th}} is to be used, it is beneficial to increase κ0\kappa_{0} compared to increasing κ1\kappa_{1}.

VI Conclusion

This work proposed a model pruning solution to alleviate bandwidth scarcity and limited computational capacity of wireless clients in heterogeneous networks. Using the convergence upper-bound, pruning ratio, computation frequency and transmission power of the clients were jointly optimized to maximize the convergence rate. The performances were evaluated on two popular datasets using three popular machine learning models of different total training parameter sizes. The results suggest that pruning can significantly reduce training time, energy expense and bandwidth requirement while incurring negligible test performance.

References

  • [1] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Proc. AISTATS, 2017.
  • [2] N. H. Tran, W. Bao, A. Zomaya, M. N. H. Nguyen, and C. S. Hong, “Federated learning over wireless networks: Optimization model design and analysis,” in Proc. IEEE INFOCOM, 2019.
  • [3] Z. Yang, M. Chen, W. Saad, C. S. Hong, and M. Shikh-Bahaei, “Energy efficient federated learning over wireless communication networks,” IEEE Trans. Wireless Commun., vol. 20, no. 3, pp. 1935–1949, 2020.
  • [4] R. Jin, X. He, and H. Dai, “Communication efficient federated learning with energy awareness over wireless networks,” IEEE Trans. Wireless Commun., vol. 21, no. 7, pp. 5204–5219, Jan. 2022.
  • [5] Z. Chen, W. Yi, and A. Nallanathan, “Exploring representativity in device scheduling for wireless federated learning,” IEEE Trans. Wireless Commun., pp. 1–1, 2023.
  • [6] T. Zhang, K.-Y. Lam, J. Zhao, and J. Feng, “Joint device scheduling and bandwidth allocation for federated learning over wireless networks,” IEEE Trans. Wireless Commun., pp. 1–1, 2023.
  • [7] J. Wang, S. Wang, R.-R. Chen, and M. Ji, “Demystifying why local aggregation helps: Convergence analysis of hierarchical sgd,” in Proc. AAAI, 2022.
  • [8] S. Hosseinalipour, S. S. Azam, C. G. Brinton, N. Michelusi, V. Aggarwal, D. J. Love, and H. Dai, “Multi-stage hybrid federated learning over large-scale d2d-enabled fog networks,” IEEE/ACM Trans. Network., vol. 30, no. 4, pp. 1569–1584, 2022.
  • [9] B. Xu, W. Xia, W. Wen, P. Liu, H. Zhao, and H. Zhu, “Adaptive hierarchical federated learning over wireless networks,” IEEE Trans. Vehicular Technol., vol. 71, no. 2, pp. 2070–2083, 2021.
  • [10] S. Liu, G. Yu, X. Chen, and M. Bennis, “Joint user association and resource allocation for wireless hierarchical federated learning with iid and non-iid data,” IEEE Trans. Wireless Commun., vol. 21, no. 10, pp. 7852–7866, 2022.
  • [11] S. Luo, X. Chen, Q. Wu, Z. Zhou, and S. Yu, “Hfel: Joint edge association and resource allocation for cost-efficient hierarchical federated edge learning,” IEEE Trans. Wireless Commun., vol. 19, no. 10, pp. 6535–6548, 2020.
  • [12] L. Liu, J. Zhang, S. Song, and K. B. Letaief, “Client-edge-cloud hierarchical federated learning,” in Proc. IEEE ICC, 2020.
  • [13] C. Feng, H. H. Yang, D. Hu, Z. Zhao, T. Q. S. Quek, and G. Min, “Mobility-aware cluster federated learning in hierarchical wireless networks,” IEEE Trans. Wireless Commun., vol. 21, no. 10, pp. 8441–8458, Oct. 2022.
  • [14] M. S. H. Abad, E. Ozfatura, D. Gunduz, and O. Ercetin, “Hierarchical federated learning across heterogeneous cellular networks,” in Proc. ICASSP.   IEEE, 2020.
  • [15] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings et al., “Advances and open problems in federated learning,” Foundations and Trends® in Machine Learning, vol. 14, no. 1–2, pp. 1–210, 2021.
  • [16] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, “Federated optimization in heterogeneous networks,” Proc. MLSys, vol. 2, pp. 429–450, 2020.
  • [17] H. Yang, X. Zhang, P. Khanduri, and J. Liu, “Anarchic federated learning,” in Proc. ICML.   PMLR, 2022, pp. 25 331–25 363.
  • [18] J. Wang, Q. Liu, H. Liang, G. Joshi, and H. V. Poor, “Tackling the objective inconsistency problem in heterogeneous federated optimization,” Proc. NeurIPS, vol. 33, pp. 7611–7623, 2020.
  • [19] T. Lin, S. U. Stich, L. Barba, D. Dmitriev, and M. Jaggi, “Dynamic model pruning with feedback,” in Proc. ICLR, 2020.
  • [20] Y. Jiang, S. Wang, V. Valls, B. J. Ko, W.-H. Lee, K. K. Leung, and L. Tassiulas, “Model pruning enables efficient federated learning on edge devices,” IEEE Trans. Neural Netw. Learn. Syst., pp. 1–13, Apr. 2022.
  • [21] Z. Zhu, Y. Shi, J. Luo, F. Wang, C. Peng, P. Fan, and K. B. Letaief, “Fedlp: Layer-wise pruning mechanism for communication-computation efficient federated learning,” arXiv preprint arXiv:2303.06360, 2023.
  • [22] S. Liu, G. Yu, R. Yin, and J. Yuan, “Adaptive network pruning for wireless federated learning,” IEEE Wireless Commun. Lett., vol. 10, no. 7, pp. 1572–1576, 2021.
  • [23] S. Liu, G. Yu, R. Yin, J. Yuan, L. Shen, and C. Liu, “Joint model pruning and device selection for communication-efficient federated edge learning,” IEEE Trans. Commun., vol. 70, no. 1, pp. 231–244, Jan. 2022.
  • [24] J. Ren, W. Ni, and H. Tian, “Toward communication-learning trade-off for federated learning at the network edge,” IEEE Commun. Lett., vol. 26, no. 8, pp. 1858–1862, Aug. 2022.
  • [25] 3rd Generation Partnership Project; Technical Specification Group Core Network and Terminals; Proximity-services in 5G System protocol aspects; Stage 3,” 3GPP TS 24.554 V18.0.0, Rel. 18, Mar. 2023.
  • [26] J. Frankle and M. Carbin, “The lottery ticket hypothesis: Finding sparse, trainable neural networks,” in Proc. ICLR, 2019.
  • [27] M. F. Pervej, R. Jin, and H. Dai, “Resource constrained vehicular edge federated learning with highly mobile connected vehicles,” IEEE J. Sel. Areas Commun., vol. 41, no. 6, pp. 1825–1844, June 2023.
  • [28] S. U. Stich, J.-B. Cordonnier, and M. Jaggi, “Sparsified sgd with memory,” in Proc. NeurIPS, 2018.
  • [29] L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,” Siam Review, vol. 60, no. 2, pp. 223–311, 2018.
  • [30] S. Diamond and S. Boyd, “Cvxpy: A python-embedded modeling language for convex optimization,” J. Mach. Learn. Res., vol. 17, no. 1, pp. 2909–2913, 2016.
  • [31] E. Che, H. D. Tuan, and H. H. Nguyen, “Joint optimization of cooperative beamforming and relay assignment in multi-user wireless relay networks,” IEEE Trans. Wireless Comm., vol. 13, no. 10, pp. 5481–5495, 2014.
  • [32] Y. Sun, D. Xu, D. W. K. Ng, L. Dai, and R. Schober, “Optimal 3d-trajectory design and resource allocation for solar-powered uav communication systems,” IEEE Trans. Commun., vol. 67, no. 6, pp. 4281–4298, 2019.
  • [33] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Apr. 2009. [Online]. Available: https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf
  • [34] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE CVPR, 2016.
[Uncaptioned image] Md Ferdous Pervej (M’23) received the B.Sc. degree in electronics and telecommunication engineering from the Rajshahi University of Engineering and Technology, Rajshahi, Bangladesh, in 2014, and the M.S. and Ph.D. degrees in electrical engineering from Utah State University, Logan, UT, USA, in 20192019 and North Carolina State University, Raleigh, NC, USA, in 20232023, respectively. He was with Mitsubishi Electric Research Laboratories, Cambridge, MA, in summer 20212021, and with Futurewei Wireless Research and Standards, Schaumburg, IL from May to December 20222022. He is currently a Postdoctoral Scholar – Research Associate in the Ming Hsieh Department of Electrical and Computer Engineering at the University of Southern California, Los Angeles, CA, USA. His primary research interests are wireless networks, distributed machine learning, vehicle-to-everything communication, edge caching/computing, and machine learning for wireless networks.
[Uncaptioned image] Richeng Jin (M’21) received the B.S. degree in information and communication engineering from Zhejiang University, Hangzhou, China, in 2015, and the Ph.D. degree in electrical engineering from North Carolina State University, Raleigh, NC, USA, in 2020. He was a Postdoctoral Researcher in electrical and computer engineering at North Carolina State University, Raleigh, NC, USA, from 2021 to 2022. He is currently a faculty member of the department of information and communication engineering with Zhejiang University, Hangzhou, China. His research interests are in the general area of wireless AI, game theory, and security and privacy in machine learning/artificial intelligence and wireless networks.
[Uncaptioned image] Huaiyu Dai (F’17) received the B.E. and M.S. degrees in electrical engineering from Tsinghua University, Beijing, China, in 1996 and 1998, respectively, and the Ph.D. degree in electrical engineering from Princeton University, Princeton, NJ in 2002. He was with Bell Labs, Lucent Technologies, Holmdel, NJ, in summer 2000, and with AT&T Labs-Research, Middletown, NJ, in summer 2001. He is currently a Professor of Electrical and Computer Engineering with NC State University, Raleigh, holding the title of University Faculty Scholar. His research interests are in the general areas of communications, signal processing, networking, and computing. His current research focuses on machine learning and artificial intelligence for communications and networking, multilayer and interdependent networks, dynamic spectrum access and sharing, as well as security and privacy issues in the above systems. He has served as an area editor for IEEE Transactions on Communications, a member of the Executive Editorial Committee for IEEE Transactions on Wireless Communications, and an editor for IEEE Transactions on Signal Processing. Currently he serves as an area editor for IEEE Transactions on Wireless Communications. He was a co-recipient of best paper awards at 2010 IEEE International Conference on Mobile Ad-hoc and Sensor Systems (MASS 2010), 2016 IEEE INFOCOM BIGSECURITY Workshop, and 2017 IEEE International Conference on Communications (ICC 2017). He received Qualcomm Faculty Award in 2019.

Supplementary Materials

Additional Notations: Analogous to our previous notations, we express the pruned virtual models at different hierarchy levels as 𝐰¯~jt,0i=1Uj,k,lαi𝐰~it,0\tilde{\bar{\mathbf{w}}}_{j}^{t,0}\coloneqq\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\tilde{\mathbf{w}}_{i}^{t,0}, 𝐰¯~kt,0j=1Vk,lαji=1Uj,k,lαi𝐰~it,0=j=1Vk,lαj𝐰¯~jt,0\tilde{\bar{\mathbf{w}}}_{k}^{t,0}\coloneqq\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\tilde{\mathbf{w}}_{i}^{t,0}=\sum_{j=1}^{V_{k,l}}\alpha_{j}\tilde{\bar{\mathbf{w}}}_{j}^{t,0}, 𝐰¯~lt,0k=1Blαkj=1Vk,lαji=1Uj,k,lαi𝐰~it,0=k=1Blαk𝐰¯~kt,0\tilde{\bar{\mathbf{w}}}_{l}^{t,0}\coloneqq\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\tilde{\mathbf{w}}_{i}^{t,0}=\sum_{k=1}^{B_{l}}\alpha_{k}\tilde{\bar{\mathbf{w}}}_{k}^{t,0} and 𝐰¯~t,0l=1Lαl𝐰¯~lt,0=u=1Uαu𝐰~ut,0\tilde{\bar{\mathbf{w}}}^{t,0}\coloneqq\sum_{l=1}^{L}\alpha_{l}\tilde{\bar{\mathbf{w}}}_{l}^{t,0}=\sum_{u=1}^{\mathrm{U}}\alpha_{u}\tilde{\mathbf{w}}_{u}^{t,0}. Moreover, g~(𝐰~it,0)g(𝐰~it,0)𝐦it\tilde{g}(\tilde{\mathbf{w}}_{i}^{t,0})\coloneqq g(\tilde{\mathbf{w}}_{i}^{t,0})\odot\mathbf{m}_{i}^{t}, f~i(𝐰¯~t,0)fi(𝐰¯~t,0)𝐦it\nabla\tilde{f}_{i}(\tilde{\bar{\mathbf{w}}}^{t,0})\coloneqq\nabla f_{i}(\tilde{\bar{\mathbf{w}}}^{t,0})\odot\mathbf{m}_{i}^{t} and f~(𝐰¯~t,0))=u=1Uαuf~u(𝐰¯~t,0))\nabla\tilde{f}(\tilde{\bar{\mathbf{w}}}^{t,0}))=\sum_{u=1}^{\mathrm{U}}\alpha_{u}\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})).

Additional Assumptions: Since the element-wise mask operation is equivalent to replacing a subset of the actual entries with zero, similar to [20, 22], we also assume that the assumptions made in Section III-A hold if we apply the binary masks to all gradients.

Appendix A Proof of Theorem 1

Theorem 1.

When the assumptions in Section III-A hold and η1/β\eta\leq 1/\beta, we have

θPHFL\displaystyle\!\!\!\!\theta_{\mathrm{{PHFL}}}\leq 𝒪([f(𝐰¯~0)finf]/[ηT])+𝒪(βησ2/U)+𝒪(δthβ2D2)pruningerror+𝒪(βηG2φw,0(𝜹,𝐟,𝐏))wirelessfactor+𝒪(β2[L1+L2+L3+L4]),\displaystyle\mathcal{O}\big{(}[f(\tilde{\bar{\mathbf{w}}}^{0})-f_{\mathrm{inf}}]/[\eta T]\big{)}+\mathcal{O}(\beta\eta\sigma^{2}/\mathrm{U})+\underbrace{\mathcal{O}\big{(}\delta^{\mathrm{th}}\beta^{2}D^{2}\big{)}}_{\mathrm{pruning~{}error}}+\underbrace{\mathcal{O}\big{(}\beta\eta G^{2}\cdot\varphi_{\mathrm{w,0}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})\big{)}}_{\mathrm{wireless~{}factor}}+\mathcal{O}\big{(}\beta^{2}\big{[}\mathrm{L}_{1}+\mathrm{L}_{2}+\mathrm{L}_{3}+\mathrm{L}_{4}\big{]}\big{)}, (56)

where 𝛅={δ1t,,δUt}t=0T1\boldsymbol{\delta}=\{\delta_{1}^{t},\dots,\delta_{U}^{t}\}_{t=0}^{T-1}, 𝐟={f1t,,fUt}t=0T1\boldsymbol{\mathrm{f}}=\{\mathrm{f}_{1}^{t},\dots,\mathrm{f}_{U}^{t}\}_{t=0}^{T-1}, 𝐏={P1t,,PUt}t=0T1\boldsymbol{\mathrm{P}}=\{P_{1}^{t},\dots,P_{U}^{t}\}_{t=0}^{T-1} and fit\mathrm{f}_{i}^{t} is the ithi^{\mathrm{th}} client’s CPU frequency in the wireless factors. Besides, the terms L1\mathrm{L}_{1}, L2\mathrm{L}_{2}, L3\mathrm{L}_{3} and L4\mathrm{L}_{4} are

φw,0(𝜹,𝐟,𝐏)\displaystyle\varphi_{\mathrm{w,0}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}}) =[1/T]t=0T1l=1Lαl2k=1Blαk2j=1Vk,lαj2i=1Uj,k,lαi2[1/pit1],\displaystyle=[1/T]\sum\nolimits_{t=0}^{T-1}\sum\nolimits_{l=1}^{L}\alpha_{l}^{2}\sum\nolimits_{k=1}^{B_{l}}\alpha_{k}^{2}\sum\nolimits_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum\nolimits_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\big{[}1/p_{i}^{t}-1\big{]}, (57)
L1\displaystyle\mathrm{L}_{1} =[1/T]t=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi𝔼𝐰¯jt𝐰~it2,\displaystyle=[1/T]\sum\nolimits_{t=0}^{T-1}\sum\nolimits_{l=1}^{L}\alpha_{l}\sum\nolimits_{k=1}^{B_{l}}\alpha_{k}\sum\nolimits_{j=1}^{V_{k,l}}\alpha_{j}\sum\nolimits_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\|\bar{\mathbf{w}}_{j}^{t}-\tilde{\mathbf{w}}_{i}^{t}\|^{2}, (58)
L2\displaystyle\mathrm{L}_{2} =[1/T]t=0T1l=1Lαlk=1Blαkj=1Vk,lαj𝔼𝐰¯kt𝐰¯jt2,\displaystyle=[1/T]\sum\nolimits_{t=0}^{T-1}\sum\nolimits_{l=1}^{L}\alpha_{l}\sum\nolimits_{k=1}^{B_{l}}\alpha_{k}\sum\nolimits_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\|\bar{\mathbf{w}}_{k}^{t}-\bar{\mathbf{w}}_{j}^{t}\|^{2}, (59)
L3\displaystyle\mathrm{L}_{3} =[1/T]t=0T1l=1Lαlk=1Blαk𝔼𝐰¯lt𝐰¯kt2,\displaystyle=[1/T]\sum\nolimits_{t=0}^{T-1}\sum\nolimits_{l=1}^{L}\alpha_{l}\sum\nolimits_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\|\bar{\mathbf{w}}_{l}^{t}-\bar{\mathbf{w}}_{k}^{t}\|^{2}, (60)
L4\displaystyle\mathrm{L}_{4} =[1/T]t=0T1l=1Lαl𝔼𝐰¯t𝐰¯lt2.\displaystyle=[1/T]\sum\nolimits_{t=0}^{T-1}\sum\nolimits_{l=1}^{L}\alpha_{l}\mathbb{E}\|\bar{\mathbf{w}}^{t}-\bar{\mathbf{w}}_{l}^{t}\|^{2}. (61)
Proof.

Since there are UU clients and we consider the virtual models at different hierarchy levels, we start the convergence proof assuming the global model is the weighted combination of all these UU clients. After that, we break down our derivations for different hierarchy levels based on our proposed PHFL algorithm. Note that this is also a standard practice in the literature [13, 9].

First, let us write the update rule for the global model as

𝐰¯t+1\displaystyle\bar{\mathbf{w}}^{t+1} =u=1Uαu𝐰~ut+1=u=1Uαu(𝐰~ut,0ηg~(𝐰~ut,0)𝟏utput)=𝐰¯~t,0ηu=1Uαug~(𝐰~ut,0)𝟏utput,\displaystyle=\sum_{u=1}^{\mathrm{U}}\alpha_{u}\tilde{\mathbf{w}}_{u}^{t+1}=\sum_{u=1}^{\mathrm{U}}\alpha_{u}\left(\tilde{\mathbf{w}}_{u}^{t,0}-\eta\tilde{g}(\tilde{\mathbf{w}}_{u}^{t,0})\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}\right)=\tilde{\bar{\mathbf{w}}}^{t,0}-\eta\sum_{u=1}^{\mathrm{U}}\alpha_{u}\tilde{g}(\tilde{\mathbf{w}}_{u}^{t,0})\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}, (62)

where 𝐰¯~t,0u=1Uαu𝐰~ut,0\tilde{\bar{\mathbf{w}}}^{t,0}\coloneqq\sum_{u=1}^{\mathrm{U}}\alpha_{u}\tilde{\mathbf{w}}_{u}^{t,0}. As such, we write

f(𝐰¯t+1)\displaystyle f(\bar{\mathbf{w}}^{t+1}) =f(𝐰¯~t,0ηu=1Uαug~(𝐰~ut,0)𝟏utput),(a)f(𝐰¯~t,0)+f(𝐰¯~t,0),ηu=1Uαug~(𝐰~ut,0)𝟏utput+βη22u=1Uαug~(𝐰~ut,0)𝟏utput2,\displaystyle=f(\tilde{\bar{\mathbf{w}}}^{t,0}-\eta\sum_{u=1}^{\mathrm{U}}\alpha_{u}\tilde{g}(\tilde{\mathbf{w}}_{u}^{t,0})\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}),\overset{(a)}{\leq}f(\tilde{\bar{\mathbf{w}}}^{t,0})+\left<\nabla f(\tilde{\bar{\mathbf{w}}}^{t,0}),-\eta\sum_{u=1}^{\mathrm{U}}\alpha_{u}\tilde{g}(\tilde{\mathbf{w}}_{u}^{t,0})\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}\right>+\frac{\beta\eta^{2}}{2}\left\|\sum_{u=1}^{\mathrm{U}}\alpha_{u}\tilde{g}(\tilde{\mathbf{w}}_{u}^{t,0})\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}\right\|^{2}, (63)

where (a)(a) stems from β\beta-smoothness assumption. For the third term in (63), we can write

f(𝐰¯~t,0),ηu=1Uαug~(𝐰~ut,0)𝟏utput=ηu=1Uαufu(𝐰¯~t,0),u=1Uαug~(𝐰~ut,0)𝟏utput\displaystyle\left<\nabla f(\tilde{\bar{\mathbf{w}}}^{t,0}),-\eta\sum_{u=1}^{\mathrm{U}}\alpha_{u}\tilde{g}(\tilde{\mathbf{w}}_{u}^{t,0})\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}\right>=-\eta\left<\sum_{u=1}^{\mathrm{U}}\alpha_{u}\nabla f_{u}(\tilde{\bar{\mathbf{w}}}^{t,0}),\sum_{u=1}^{\mathrm{U}}\alpha_{u}\tilde{g}(\tilde{\mathbf{w}}_{u}^{t,0})\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}\right>
=ηu=1Uαu<fu(𝐰¯~t,0)𝐦ut,(g(𝐰~ut,0)𝐦ut)𝟏utput>=ηu=1Uαu<f~u(𝐰¯~t,0),g~(𝐰~ut,0)𝟏utput>,\displaystyle=-\eta\sum_{u=1}^{\mathrm{U}}\alpha_{u}\bigg{<}\nabla f_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})\odot\mathbf{m}_{u}^{t},\big{(}g(\tilde{\mathbf{w}}_{u}^{t,0})\odot\mathbf{m}_{u}^{t}\big{)}\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}\bigg{>}=-\eta\sum_{u=1}^{\mathrm{U}}\alpha_{u}\bigg{<}\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0}),\tilde{g}(\tilde{\mathbf{w}}_{u}^{t,0})\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}\bigg{>}, (64)

where f~u(𝐰¯~t,0)fu(𝐰¯~t,0)𝐦ut\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})\coloneqq\nabla f_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})\odot\mathbf{m}_{u}^{t}.

Plugging (A) into (63) and taking expectation over both sides, we get

𝔼[f(𝐰¯t+1)]\displaystyle\mathbb{E}\left[f(\bar{\mathbf{w}}^{t+1})\right] f(𝐰¯~t,0)ηu=1Uαu𝔼[f~u(𝐰¯~t,0),g~(𝐰~ut,0)𝟏utput]+βη22𝔼[u=1Uαug~(𝐰~ut,0)𝟏utput2],\displaystyle\leq f(\tilde{\bar{\mathbf{w}}}^{t,0})-\eta\sum_{u=1}^{\mathrm{U}}\alpha_{u}\mathbb{E}\left[\left<\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0}),\tilde{g}(\tilde{\mathbf{w}}_{u}^{t,0})\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}\right>\right]+\frac{\beta\eta^{2}}{2}\mathbb{E}\left[\left\|\sum_{u=1}^{\mathrm{U}}\alpha_{u}\tilde{g}(\tilde{\mathbf{w}}_{u}^{t,0})\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}\right\|^{2}\right], (65)

We simply the third term in (65) as follows:

ηu=1Uαu𝔼[f~u(𝐰¯~t,0),g~(𝐰~ut,0)𝟏utput]=(a)ηu=1Uαuf~u(𝐰¯~t,0),𝔼[g~(𝐰~ut,0)]𝔼[𝟏utput],\displaystyle-\eta\sum_{u=1}^{\mathrm{U}}\alpha_{u}\mathbb{E}\left[\left<\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0}),\tilde{g}(\tilde{\mathbf{w}}_{u}^{t,0})\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}\right>\right]\overset{(a)}{=}-\eta\sum_{u=1}^{\mathrm{U}}\alpha_{u}\left<\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0}),\mathbb{E}\left[\tilde{g}(\tilde{\mathbf{w}}_{u}^{t,0})\right]\mathbb{E}\bigg{[}\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}\bigg{]}\right>,
=ηu=1Uαuf~u(𝐰¯~t,0),f~u(𝐰~ut,0),\displaystyle\qquad=-\eta\sum_{u=1}^{\mathrm{U}}\alpha_{u}\left<\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0}),\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\right>,
=η2u=1Uαu{f~u(𝐰¯~t,0)2+f~u(𝐰~ut,0)2f~u(𝐰¯~t,0)f~u(𝐰~ut,0)2},\displaystyle\qquad=-\frac{\eta}{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}\Big{\{}\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})\|^{2}+\|\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\|^{2}-\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})-\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\|^{2}\Big{\}},
=η2u=1Uαu{f~u(𝐰¯~t,0)f~u(𝐰~ut,0)2f~u(𝐰¯~t,0)2f~u(𝐰~ut,0)2},\displaystyle\qquad=\frac{\eta}{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}\Big{\{}\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})-\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\|^{2}-\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})\|^{2}-\|\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\|^{2}\Big{\}}, (66)
=η2u=1Uαu{f~u(𝐰¯~t,0)f~u(𝐰~ut,0)2f~u(𝐰~ut,0)2}η2u=1Uαuf~u(𝐰¯~t,0)2,\displaystyle\qquad=\frac{\eta}{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}\Big{\{}\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})-\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\|^{2}-\|\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\|^{2}\Big{\}}-\frac{\eta}{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})\|^{2},
η2u=1Uαu{f~u(𝐰¯~t,0)f~u(𝐰~ut,0)2f~u(𝐰~ut,0)2}η2u=1Uαuf~u(𝐰¯~t,0)2,\displaystyle\qquad\leq\frac{\eta}{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}\Big{\{}\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})-\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\|^{2}-\|\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\|^{2}\Big{\}}-\frac{\eta}{2}\left\|\sum_{u=1}^{\mathrm{U}}\alpha_{u}\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})\right\|^{2},
=(b)η2u=1Uαu{f~u(𝐰¯~t,0)f~u(𝐰~ut,0)2f~u(𝐰~ut,0)2}η2f~(𝐰¯~t,0)2,\displaystyle\qquad\overset{(b)}{=}\frac{\eta}{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}\Big{\{}\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})-\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\|^{2}-\|\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\|^{2}\Big{\}}-\frac{\eta}{2}\|\nabla\tilde{f}(\tilde{\bar{\mathbf{w}}}^{t,0})\|^{2},

where (a)(a) follows from the independence of client selection and SGD. Besides, we define f~(𝐰¯~t,0)2u=1Uαuf~u(𝐰¯~t,0)2=u=1Uαufu(𝐰¯~t,0)𝐦ut2\|\nabla\tilde{f}(\tilde{\bar{\mathbf{w}}}^{t,0})\|^{2}\coloneqq\|\sum_{u=1}^{\mathrm{U}}\alpha_{u}\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})\|^{2}=\|\sum_{u=1}^{\mathrm{U}}\alpha_{u}\nabla f_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})\odot\mathbf{m}_{u}^{t}\|^{2} in (b)(b).

The second term in (65) can be simplified as follows:

𝔼[u=1Uαug~(𝐰~ut,0)𝟏utput2]\displaystyle\mathbb{E}\left[\left\|\sum_{u=1}^{\mathrm{U}}\alpha_{u}\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}\right\|^{2}\right]
=(a)𝔼u=1Uαug~(𝐰~ut,0)𝟏utput𝔼[u=1Uαug~(𝐰~ut,0)𝟏utput]2+(𝔼[u=1Uαug~(𝐰~ut,0)𝟏utput])2,\displaystyle\overset{(a)}{=}\mathbb{E}\left\|\sum_{u=1}^{\mathrm{U}}\alpha_{u}\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}-\mathbb{E}\left[\sum_{u=1}^{\mathrm{U}}\alpha_{u}\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}\right]\right\|^{2}+\left(\mathbb{E}\left[\sum_{u=1}^{\mathrm{U}}\alpha_{u}\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}\right]\right)^{2},
=𝔼u=1Uαug~(𝐰~ut,0)𝟏utput±u=1Uαug~(𝐰~ut,0)u=1Uαuf~u(𝐰~ut,0)2+u=1Uαuf~u(𝐰~ut,0)2,\displaystyle=\mathbb{E}\left\|\sum_{u=1}^{\mathrm{U}}\alpha_{u}\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}\pm\sum_{u=1}^{\mathrm{U}}\alpha_{u}\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)-\sum_{u=1}^{\mathrm{U}}\alpha_{u}\nabla\tilde{f}_{u}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right\|^{2}+\left\|\sum_{u=1}^{\mathrm{U}}\alpha_{u}\nabla\tilde{f}_{u}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right\|^{2},
(b)𝔼u=1Uαu(𝟏utput1)g~(𝐰~ut,0)+u=1Uαu(g~(𝐰~ut,0)f~u(𝐰~ut,0))2+u=1Uαuf~u(𝐰~ut,0)2,\displaystyle\overset{(b)}{\leq}\mathbb{E}\left\|\sum_{u=1}^{\mathrm{U}}\alpha_{u}\left(\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}-1\right)\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)+\sum_{u=1}^{\mathrm{U}}\alpha_{u}\left(\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)-\nabla\tilde{f}_{u}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right)\right\|^{2}+\sum_{u=1}^{\mathrm{U}}\alpha_{u}\left\|\nabla\tilde{f}_{u}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right\|^{2},
(c)2𝔼u=1Uαu(𝟏utput1)g~(𝐰~ut,0)2+2𝔼u=1Uαu(g~(𝐰~ut,0)f~u(𝐰~ut,0))2+u=1Uαuf~u(𝐰~ut,0)2,\displaystyle\overset{(c)}{\leq}2\mathbb{E}\left\|\sum_{u=1}^{\mathrm{U}}\alpha_{u}\left(\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}-1\right)\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right\|^{2}+2\mathbb{E}\left\|\sum_{u=1}^{\mathrm{U}}\alpha_{u}\left(\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)-\nabla\tilde{f}_{u}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right)\right\|^{2}+\sum_{u=1}^{\mathrm{U}}\alpha_{u}\left\|\nabla\tilde{f}_{u}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right\|^{2},
=2𝔼[u=1Uαu2(𝟏utput1)g~(𝐰~ut,0)2+u=1Uαuu=1,uuUαu(𝟏utput1)g~(𝐰~ut,0)(𝟏utput1)g~(𝐰~ut,0)]+\displaystyle=2\mathbb{E}\Big{[}\sum_{u=1}^{\mathrm{U}}\alpha_{u}^{2}\left\|\left(\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}-1\right)\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right\|^{2}+\sum_{u=1}^{\mathrm{U}}\alpha_{u}\sum_{u^{\prime}=1,u^{\prime}\neq u}^{U}\alpha_{u^{\prime}}\left(\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}-1\right)\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\left(\frac{\boldsymbol{1}_{u^{\prime}}^{t}}{p_{u^{\prime}}^{t}}-1\right)\tilde{g}(\tilde{\mathbf{w}}_{u^{\prime}}^{t,0})\Big{]}+
2𝔼[u=1Uαu2(g~(𝐰~ut,0)f~u(𝐰~ut,0))2+\displaystyle\qquad\qquad\qquad 2\mathbb{E}\Big{[}\sum_{u=1}^{\mathrm{U}}\alpha_{u}^{2}\left\|\left(\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)-\nabla\tilde{f}_{u}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right)\right\|^{2}+
u=1Uαuu=1,uuUαu(g~(𝐰~ut,0)f~u(𝐰~ut,0))(g~(𝐰~ut,0)f~u(𝐰~ut,0))]+u=1Uαuf~u(𝐰~ut,0)2,\displaystyle\qquad\sum_{u=1}^{\mathrm{U}}\alpha_{u}\sum_{u^{\prime}=1,u\neq u^{\prime}}^{U}\alpha_{u^{\prime}}\left(\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)-\nabla\tilde{f}_{u}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right)\left(\tilde{g}(\tilde{\mathbf{w}}_{u^{\prime}}^{t,0})-\nabla\tilde{f}_{u^{\prime}}(\tilde{\mathbf{w}}_{u^{\prime}}^{t,0})\right)\Big{]}+\sum_{u=1}^{\mathrm{U}}\alpha_{u}\left\|\nabla\tilde{f}_{u}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right\|^{2},
(d)2𝔼[u=1Uαu2(𝟏utput1)g~(𝐰~ut,0)2]+2σ2u=1Uαu2+u=1Uαuf~u(𝐰~ut,0)2,\displaystyle\overset{(d)}{\leq}2\mathbb{E}\bigg{[}\sum_{u=1}^{\mathrm{U}}\alpha_{u}^{2}\left\|\left(\frac{\boldsymbol{1}_{u}^{t}}{p_{u}^{t}}-1\right)\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right\|^{2}\bigg{]}+2\sigma^{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}^{2}+\sum_{u=1}^{\mathrm{U}}\alpha_{u}\left\|\nabla\tilde{f}_{u}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right\|^{2},
=2u=1Uαu2(1putput)𝔼g~(𝐰~ut,0)2+2σ2u=1Uαu2+u=1Uαuf~u(𝐰~ut,0)2,\displaystyle=2\sum_{u=1}^{\mathrm{U}}\alpha_{u}^{2}\left(\frac{1-p_{u}^{t}}{p_{u}^{t}}\right)\mathbb{E}\left\|\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right\|^{2}+2\sigma^{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}^{2}+\sum_{u=1}^{\mathrm{U}}\alpha_{u}\left\|\nabla\tilde{f}_{u}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right\|^{2}, (67)

where (a)(a) comes from the definition of variance, (b)(b) follows from Jensen inequality, i.e., u=1Uαuf~u(𝐰~ut,0)2u=1Uαuf~u(𝐰~ut,0)2\left\|\sum_{u=1}^{\mathrm{U}}\alpha_{u}\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\right\|^{2}\leq\sum_{u=1}^{\mathrm{U}}\alpha_{u}\left\|\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\right\|^{2}, (c)(c) stems from the fact that i=1n𝐚i2ni=1n𝐚i2\|\sum_{i=1}^{n}\mathbf{a}_{i}\|^{2}\leq n\sum_{i=1}^{n}\|\mathbf{a}_{i}\|^{2} and (d)(d) is due to the bounded stochastic gradient assumption.

Plugging (66) and (A) in (65), we get

𝔼[f(𝐰¯t+1)]f(𝐰¯~t,0)+η2u=1Uαu{f~u(𝐰¯~t,0)f~u(𝐰~ut,0)2f~u(𝐰~ut,0)2}\displaystyle\mathbb{E}\left[f(\bar{\mathbf{w}}^{t+1})\right]\leq f(\tilde{\bar{\mathbf{w}}}^{t,0})+\frac{\eta}{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}\Big{\{}\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})-\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\|^{2}-\|\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\|^{2}\Big{\}}-
η2f~(𝐰¯~t,0)2+η2u=1Uαu2(1putput)𝔼g~(𝐰~ut,0)2+βη2σ2u=1Uαu2+βη22u=1Uαuf~u(𝐰~ut,0)2,\displaystyle\qquad\frac{\eta}{2}\|\nabla\tilde{f}(\tilde{\bar{\mathbf{w}}}^{t,0})\|^{2}+\eta^{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}^{2}\left(\frac{1-p_{u}^{t}}{p_{u}^{t}}\right)\mathbb{E}\left\|\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right\|^{2}+\beta\eta^{2}\sigma^{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}^{2}+\frac{\beta\eta^{2}}{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}\left\|\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\right\|^{2},
=f(𝐰¯~t,0)+η2u=1Uαuf~u(𝐰¯~t,0)f~u(𝐰~ut,0)2η2f~(𝐰¯~t,0)2η2(1βη)u=1Uf~u(𝐰~ut,0)2\displaystyle=f(\tilde{\bar{\mathbf{w}}}^{t,0})+\frac{\eta}{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}\bigg{\|}\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})-\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\bigg{\|}^{2}-\frac{\eta}{2}\left\|\nabla\tilde{f}(\tilde{\bar{\mathbf{w}}}^{t,0})\right\|^{2}-\frac{\eta}{2}\left(1-\beta\eta\right)\sum_{u=1}^{\mathrm{U}}\|\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\|^{2}
+βη2u=1Uαu2(1putput)𝔼g~(𝐰~ut,0)2+βη2σ2u=1Uαu2.\displaystyle\qquad\qquad\qquad+\beta\eta^{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}^{2}\left(\frac{1-p_{u}^{t}}{p_{u}^{t}}\right)\mathbb{E}\left\|\tilde{g}\left(\tilde{\mathbf{w}}_{u}^{t,0}\right)\right\|^{2}+\beta\eta^{2}\sigma^{2}\sum_{u=1}^{\mathrm{U}}\alpha_{u}^{2}. (68)

Note that, when η1β\eta\leq\frac{1}{\beta}, (1βη)0\left(1-\beta\eta\right)\geq 0. Therefore, we can drop the fourth term in (68).

To that end, rearranging the terms in (68), then dividing both sides by η2\frac{\eta}{2}, taking expectations on both sides and averaging over time, we get the following

1Tt=0T1𝔼[f~(𝐰¯~t,0)2]\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\left[\left\|\nabla\tilde{f}\left(\tilde{\bar{\mathbf{w}}}^{t,0}\right)\right\|^{2}\right] 2(f(𝐰¯~0)𝔼[f(𝐰¯T)])ηT+2βησ2u=1Uαu2+2βηG2Tt=0T1u=1Uαu2(1putput)+\displaystyle\leq\frac{2\left(f(\tilde{\bar{\mathbf{w}}}^{0})-\mathbb{E}[f(\bar{\mathbf{w}}^{T})]\right)}{\eta T}+2\beta\eta\sigma^{2}\!\!\sum_{u=1}^{\mathrm{U}}\alpha_{u}^{2}+\frac{2\beta\eta G^{2}}{T}\sum_{t=0}^{T-1}\sum_{u=1}^{\mathrm{U}}\alpha_{u}^{2}\left(\frac{1-p_{u}^{t}}{p_{u}^{t}}\right)+
1Tt=0T1u=1Uαu𝔼[f~u(𝐰¯~t,0)f~u(𝐰~ut,0)2],\displaystyle\qquad\qquad\qquad\frac{1}{T}\sum_{t=0}^{T-1}\sum_{u=1}^{\mathrm{U}}\alpha_{u}\mathbb{E}\left[\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})-\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\|^{2}\right], (69)

where we use the notation 𝐰¯~0\tilde{\bar{\mathbf{w}}}^{0} to represent 𝐰¯~0,0\tilde{\bar{\mathbf{w}}}^{0,0} for simplicity.

The last term in (A) can be expanded as follows:

1Tt=0T1u=1Uαu𝔼[f~u(𝐰¯~t,0)f~u(𝐰~ut,0)2]=u=1Uαu𝔼[f~u(𝐰¯~t,0)±f~u(𝐰¯~l,(u)t,0)±f~u(𝐰¯~k,(u)t,0)±f~u(𝐰¯~j,(u)t,0)f~u(𝐰~ut,0)2],\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\sum_{u=1}^{\mathrm{U}}\alpha_{u}\mathbb{E}\left[\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})-\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\|^{2}\right]=\sum_{u=1}^{\mathrm{U}}\alpha_{u}\mathbb{E}\bigg{[}\Big{\|}\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})\pm\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}_{l,(u)}^{t,0})\pm\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}_{k,(u)}^{t,0})\pm\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}_{j,(u)}^{t,0})-\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\Big{\|}^{2}\bigg{]},
(a)4Tt=0T1u=1Uαu𝔼[f~u(𝐰¯~t,0)f~u(𝐰¯~l,(u)t,0)2+f~u(𝐰¯~l,(u)t,0)f~u(𝐰¯~k,(u)t,0)2+f~u(𝐰¯~k,(u)t,0)f~u(𝐰¯~j,(u)t,0)2+\displaystyle\overset{(a)}{\leq}\frac{4}{T}\sum_{t=0}^{T-1}\sum_{u=1}^{\mathrm{U}}\alpha_{u}\mathbb{E}\Big{[}\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}^{t,0})-\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}_{l,(u)}^{t,0})\|^{2}+\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}_{l,(u)}^{t,0})-\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}_{k,(u)}^{t,0})\|^{2}+\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}_{k,(u)}^{t,0})-\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}_{j,(u)}^{t,0})\|^{2}+
f~u(𝐰¯~j,(u)t,0)f~u(𝐰~ut,0)2],\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\|\nabla\tilde{f}_{u}(\tilde{\bar{\mathbf{w}}}_{j,(u)}^{t,0})-\nabla\tilde{f}_{u}(\tilde{\mathbf{w}}_{u}^{t,0})\|^{2}\Big{]},
(b)4β2Tt=0T1u=1Uαu{𝔼[𝐰¯~t,0𝐰¯~l,(u)t,02]+𝔼[𝐰¯~l,(u)t,0𝐰¯~k,(u)t,02]+𝔼[𝐰¯~k,(u)t,0𝐰¯~j,(u)t,02]+𝔼[𝐰¯~j,(u)t,0𝐰~ut,02]},\displaystyle\overset{(b)}{\leq}\frac{4\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{u=1}^{\mathrm{U}}\alpha_{u}\Big{\{}\mathbb{E}\left[\|\tilde{\bar{\mathbf{w}}}^{t,0}-\tilde{\bar{\mathbf{w}}}_{l,(u)}^{t,0}\|^{2}\right]+\mathbb{E}\left[\|\tilde{\bar{\mathbf{w}}}_{l,(u)}^{t,0}-\tilde{\bar{\mathbf{w}}}_{k,(u)}^{t,0}\|^{2}\right]+\mathbb{E}\left[\|\tilde{\bar{\mathbf{w}}}_{k,(u)}^{t,0}-\tilde{\bar{\mathbf{w}}}_{j,(u)}^{t,0}\|^{2}\right]+\mathbb{E}\left[\|\tilde{\bar{\mathbf{w}}}_{j,(u)}^{t,0}-\tilde{\mathbf{w}}_{u}^{t,0}\|^{2}\right]\Big{\}},
=4β2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi{𝔼[𝐰¯~t,0𝐰¯~lt,02]+𝔼[𝐰¯~lt,0𝐰¯~kt,02]+𝔼[𝐰¯~kt,0𝐰¯~jt,02]+𝔼[𝐰¯~jt,0𝐰~it,02]},\displaystyle=\frac{4\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\Big{\{}\mathbb{E}\left[\|\tilde{\bar{\mathbf{w}}}^{t,0}-\tilde{\bar{\mathbf{w}}}_{l}^{t,0}\|^{2}\right]+\mathbb{E}\left[\|\tilde{\bar{\mathbf{w}}}_{l}^{t,0}-\tilde{\bar{\mathbf{w}}}_{k}^{t,0}\|^{2}\right]+\mathbb{E}\left[\|\tilde{\bar{\mathbf{w}}}_{k}^{t,0}-\tilde{\bar{\mathbf{w}}}_{j}^{t,0}\|^{2}\right]+\mathbb{E}\left[\|\tilde{\bar{\mathbf{w}}}_{j}^{t,0}-\tilde{\mathbf{w}}_{i}^{t,0}\|^{2}\right]\Big{\}},
=4β2Tt=0T1{l=1Lαl𝔼[𝐰¯~t,0𝐰¯~lt,02]+l=1Lαlk=1Blαk𝔼[𝐰¯~lt,0𝐰¯~kt,02]+l=1Lαlk=1Blαkj=1Vk,lαj𝔼[𝐰¯~kt,0𝐰¯~jt,02]+\displaystyle=\frac{4\beta^{2}}{T}\sum_{t=0}^{T-1}\Bigg{\{}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\left[\|\tilde{\bar{\mathbf{w}}}^{t,0}-\tilde{\bar{\mathbf{w}}}_{l}^{t,0}\|^{2}\right]+\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\left[\|\tilde{\bar{\mathbf{w}}}_{l}^{t,0}-\tilde{\bar{\mathbf{w}}}_{k}^{t,0}\|^{2}\right]+\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\left[\|\tilde{\bar{\mathbf{w}}}_{k}^{t,0}-\tilde{\bar{\mathbf{w}}}_{j}^{t,0}\|^{2}\right]+
l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi𝔼[𝐰¯~jt,0𝐰~it,02]},\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\left[\|\tilde{\bar{\mathbf{w}}}_{j}^{t,0}-\tilde{\mathbf{w}}_{i}^{t,0}\|^{2}\right]\Bigg{\}},
=4β2Tt=0T1{l=1Lαl𝔼[𝐰¯~t,0±𝐰¯t±𝐰¯lt𝐰~lt2]+l=1Lαlk=1Blαk𝔼[𝐰¯~lt,0±𝐰¯lt±𝐰¯kt𝐰¯~kt,02]+\displaystyle=\frac{4\beta^{2}}{T}\sum_{t=0}^{T-1}\Big{\{}\sum\nolimits_{l=1}^{L}\alpha_{l}\mathbb{E}\left[\|\tilde{\bar{\mathbf{w}}}^{t,0}\pm\bar{\mathbf{w}}^{t}\pm\bar{\mathbf{w}}_{l}^{t}-\tilde{\mathbf{w}}_{l}^{t}\|^{2}\right]+\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\left[\|\tilde{\bar{\mathbf{w}}}_{l}^{t,0}\pm\bar{\mathbf{w}}_{l}^{t}\pm\bar{\mathbf{w}}_{k}^{t}-\tilde{\bar{\mathbf{w}}}_{k}^{t,0}\|^{2}\right]+
l=1Lαlk=1Blαkj=1Vk,lαj𝔼[𝐰¯~kt,0±𝐰¯kt±𝐰¯jt𝐰¯~jt,02]+l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi𝔼[𝐰¯~jt,0±𝐰¯jt±𝐰~it𝐰~it,02]},\displaystyle\qquad\qquad\qquad\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\left[\|\tilde{\bar{\mathbf{w}}}_{k}^{t,0}\pm\bar{\mathbf{w}}_{k}^{t}\pm\bar{\mathbf{w}}_{j}^{t}-\tilde{\bar{\mathbf{w}}}_{j}^{t,0}\|^{2}\right]+\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\left[\|\tilde{\bar{\mathbf{w}}}_{j}^{t,0}\pm\bar{\mathbf{w}}_{j}^{t}\pm\tilde{\mathbf{w}}_{i}^{t}-\tilde{\mathbf{w}}_{i}^{t,0}\|^{2}\right]\Big{\}},
(c)12β2Tt=0T1{l=1Lαl𝔼[𝐰¯~t,0𝐰¯t2+𝐰¯lt𝐰¯~lt,02+𝐰¯t𝐰¯lt2]+l=1Lαlk=1Blαk𝔼[𝐰¯~lt,0𝐰¯lt2+𝐰¯kt𝐰¯~kt,02+𝐰¯lt𝐰¯kt2]+\displaystyle\overset{(c)}{\leq}\frac{12\beta^{2}}{T}\sum_{t=0}^{T-1}\bigg{\{}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\left[\|\tilde{\bar{\mathbf{w}}}^{t,0}-\bar{\mathbf{w}}^{t}\|^{2}+\|\bar{\mathbf{w}}_{l}^{t}-\tilde{\bar{\mathbf{w}}}_{l}^{t,0}\|^{2}+\|\bar{\mathbf{w}}^{t}-\bar{\mathbf{w}}_{l}^{t}\|^{2}\right]+\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\Big{[}\|\tilde{\bar{\mathbf{w}}}_{l}^{t,0}-\bar{\mathbf{w}}_{l}^{t}\|^{2}+\|\bar{\mathbf{w}}_{k}^{t}-\tilde{\bar{\mathbf{w}}}_{k}^{t,0}\|^{2}+\|\bar{\mathbf{w}}_{l}^{t}-\bar{\mathbf{w}}_{k}^{t}\|^{2}\Big{]}+
l=1Lαlk=1Blαkj=1Vk,lαj𝔼[𝐰¯~kt,0𝐰¯kt2+𝐰¯jt𝐰¯~jt,02+𝐰¯kt𝐰¯jt2]+\displaystyle\qquad\qquad\qquad\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\Big{[}\|\tilde{\bar{\mathbf{w}}}_{k}^{t,0}-\bar{\mathbf{w}}_{k}^{t}\|^{2}+\|\bar{\mathbf{w}}_{j}^{t}-\tilde{\bar{\mathbf{w}}}_{j}^{t,0}\|^{2}+\|\bar{\mathbf{w}}_{k}^{t}-\bar{\mathbf{w}}_{j}^{t}\|^{2}\Big{]}+
l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi{𝔼[𝐰¯~jt,0𝐰¯jt2+𝐰¯jt𝐰~it2+𝐰~it𝐰~it,02]},\displaystyle\qquad\qquad\qquad\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\Big{\{}\mathbb{E}\left[\|\tilde{\bar{\mathbf{w}}}_{j}^{t,0}-\bar{\mathbf{w}}_{j}^{t}\|^{2}+\|\bar{\mathbf{w}}_{j}^{t}-\tilde{\mathbf{w}}_{i}^{t}\|^{2}+\|\tilde{\mathbf{w}}_{i}^{t}-\tilde{\mathbf{w}}_{i}^{t,0}\|^{2}\right]\bigg{\}},
=12β2Tt=0T1l=1Lαl𝔼𝐰¯t𝐰¯lt2L4+12β2Tt=0T1l=1Lαlk=1Blαk𝔼𝐰¯lt𝐰¯kt2L3+12β2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαj𝔼𝐰¯kt𝐰¯jt2L2+\displaystyle=\frac{12\beta^{2}}{T}\underbrace{\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\|\bar{\mathbf{w}}^{t}-\bar{\mathbf{w}}_{l}^{t}\|^{2}}_{\mathrm{L}_{4}}+\frac{12\beta^{2}}{T}\underbrace{\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\|\bar{\mathbf{w}}_{l}^{t}-\bar{\mathbf{w}}_{k}^{t}\|^{2}}_{\mathrm{L}_{3}}+\frac{12\beta^{2}}{T}\underbrace{\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\|\bar{\mathbf{w}}_{k}^{t}-\bar{\mathbf{w}}_{j}^{t}\|^{2}}_{\mathrm{L}_{2}}+
12β2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi𝔼𝐰¯jt𝐰~it2L1+12β2Tt=0T1𝔼𝐰¯t𝐰¯~t,02T5+24β2Tt=0T1l=1Lαl𝔼𝐰¯lt𝐰¯~lt,02T4+\displaystyle\qquad\qquad\qquad\frac{12\beta^{2}}{T}\underbrace{\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\|\bar{\mathbf{w}}_{j}^{t}-\tilde{\mathbf{w}}_{i}^{t}\|^{2}}_{\mathrm{L}_{1}}+\underbrace{\frac{12\beta^{2}}{T}\sum_{t=0}^{T-1}\mathbb{E}\|\bar{\mathbf{w}}^{t}-\tilde{\bar{\mathbf{w}}}^{t,0}\|^{2}}_{\mathrm{T}_{5}}+\underbrace{\frac{24\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\|\bar{\mathbf{w}}_{l}^{t}-\tilde{\bar{\mathbf{w}}}_{l}^{t,0}\|^{2}}_{\mathrm{T}_{4}}+
24β2Tt=0T1l=1Lαlk=1Blαk𝔼𝐰¯kt𝐰¯~kt,02T3+24β2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαj𝔼𝐰¯jt𝐰¯~jt,02T2+\displaystyle\qquad\qquad\qquad\underbrace{\frac{24\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\|\bar{\mathbf{w}}_{k}^{t}-\tilde{\bar{\mathbf{w}}}_{k}^{t,0}\|^{2}}_{\mathrm{T}_{3}}+\underbrace{\frac{24\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\|\bar{\mathbf{w}}_{j}^{t}-\tilde{\bar{\mathbf{w}}}_{j}^{t,0}\|^{2}}_{\mathrm{T}_{2}}+
12β2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,l𝔼𝐰~it𝐰~it,02T1,\displaystyle\qquad\qquad\qquad\underbrace{\frac{12\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\mathbb{E}\|\tilde{\mathbf{w}}_{i}^{t}-\tilde{\mathbf{w}}_{i}^{t,0}\|^{2}}_{\mathrm{T}_{1}}, (70)

where we used the notation 𝐰¯j,(u)t,𝐰¯k,(u)t\bar{\mathbf{w}}_{j,(u)}^{t},\bar{\mathbf{w}}_{k,(u)}^{t} and 𝐰¯l,(u)t\bar{\mathbf{w}}_{l,(u)}^{t} to represent the model of the VC, sBS and mBS, respectively, that UE uu is connected to. Furthermore, (a)(a) and (c)(c) stem from the fact that i=1n𝐚i2ni=1n𝐚i2\|\sum_{i=1}^{n}\mathbf{a}_{i}\|^{2}\leq n\sum_{i=1}^{n}\|\mathbf{a}_{i}\|^{2}. Moreover, (b)(b) arises from the β\beta-smoothness and divergence between the loss functions assumptions.

To this end, we bound the terms T1\mathrm{T}_{1} to T5\mathrm{T}_{5} using our aggregation rules and definitions. First, let us focus on the term T1\mathrm{T}_{1}

T1\displaystyle\mathrm{T}_{1} =(a)12β2Tl=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαit=0T1𝔼𝐰it𝐰~it,02\displaystyle\overset{(a)}{=}\frac{12\beta^{2}}{T}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\sum_{t=0}^{T-1}\mathbb{E}\bigg{\|}\mathbf{w}_{i}^{t}-\tilde{\mathbf{w}}_{i}^{t,0}\bigg{\|}^{2}
(b)12β2Tl=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαit=0T1δit𝔼𝐰it2,\displaystyle\overset{(b)}{{\color[rgb]{0,0,0}\leq}}\frac{12\beta^{2}}{T}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\sum_{t=0}^{T-1}\delta_{i}^{t}\mathbb{E}\|\mathbf{w}_{i}^{t}\|^{2}, (71)

where in (a)(a), we trace back to the nearest synchronization iteration where 𝐰it𝐰¯jt\mathbf{w}_{i}^{t}\leftarrow\bar{\mathbf{w}}_{j}^{t} and 𝐰~it=𝐰it\tilde{\mathbf{w}}_{i}^{t}=\mathbf{w}_{i}^{t}. Besides, we use the definition of pruning ratio in (b)(b).

Now, we calculate the bound for the term T2\mathrm{T}_{2} as follows

T2\displaystyle\mathrm{T}_{2} =24β2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαj𝔼i=1Uj,k,lαi(𝐰~it𝐰~it,0)2\displaystyle=\frac{24\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\bigg{\|}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\big{(}\tilde{\mathbf{w}}_{i}^{t}-\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\bigg{\|}^{2}
(a)24β2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi𝔼𝐰~it𝐰~it,02\displaystyle\overset{(a)}{{\color[rgb]{0,0,0}\leq}}\frac{24\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}{\color[rgb]{0,0,0}\alpha_{i}}\mathbb{E}\bigg{\|}\tilde{\mathbf{w}}_{i}^{t}-\tilde{\mathbf{w}}_{i}^{t,0}\bigg{\|}^{2}
(b)24β2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαiδit𝔼𝐰it2,\displaystyle\overset{(b)}{{\color[rgb]{0,0,0}\leq}}\frac{24\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}{\color[rgb]{0,0,0}\alpha_{i}}\delta_{i}^{t}\mathbb{E}\|\mathbf{w}_{i}^{t}\|^{2}, (72)

where (aa) arises from Jensen inequality and (b)(b) appears from the same reasoning as in T1\mathrm{T}_{1}. Using similar steps, we write the following:

T3\displaystyle\mathrm{T_{3}} 24β2Tl=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαiδit𝔼𝐰it2,\displaystyle\leq\frac{24\beta^{2}}{T}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}{\color[rgb]{0,0,0}\alpha_{j}}\sum_{i=1}^{U_{j,k,l}}{\color[rgb]{0,0,0}\alpha_{i}}\delta_{i}^{t}\mathbb{E}\|\mathbf{w}_{i}^{t}\|^{2}, (73)
T4\displaystyle\mathrm{T}_{4} 24β2Tl=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαiδit𝔼𝐰it2,\displaystyle\leq\frac{24\beta^{2}}{T}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}{\color[rgb]{0,0,0}\alpha_{k}}\sum_{j=1}^{V_{k,l}}{\color[rgb]{0,0,0}\alpha_{j}}\sum_{i=1}^{U_{j,k,l}}{\color[rgb]{0,0,0}\alpha_{i}}\delta_{i}^{t}\mathbb{E}\|\mathbf{w}_{i}^{t}\|^{2}, (74)
T5\displaystyle\mathrm{T}_{5} 24β2Tl=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαiδit𝔼𝐰it2.\displaystyle\leq\frac{24\beta^{2}}{T}\sum_{l=1}^{L}{\color[rgb]{0,0,0}\alpha_{l}}\sum_{k=1}^{B_{l}}{\color[rgb]{0,0,0}\alpha_{k}}\sum_{j=1}^{V_{k,l}}{\color[rgb]{0,0,0}\alpha_{j}}\sum_{i=1}^{U_{j,k,l}}{\color[rgb]{0,0,0}\alpha_{i}}\delta_{i}^{t}\mathbb{E}\|\mathbf{w}_{i}^{t}\|^{2}. (75)

Plugging the above values into (A), we get

1Tt=0T1𝔼[f~(𝐰¯~t,0)2]2(f(𝐰¯~0)𝔼[f(𝐰¯T)])ηT+2βησ2l=1Lαl2k=1Blαk2j=1Vk,lαj2i=1Uj,k,lαi2+\displaystyle\frac{1}{T}\!\!\sum_{t=0}^{T-1}\!\!\mathbb{E}\left[\left\|\nabla\tilde{f}\left(\tilde{\bar{\mathbf{w}}}^{t,0}\right)\right\|^{2}\right]\leq\frac{2\left(f(\tilde{\bar{\mathbf{w}}}^{0})-\mathbb{E}[f(\bar{\mathbf{w}}^{T})]\right)}{\eta T}+2\beta\eta\sigma^{2}\sum_{l=1}^{L}\alpha_{l}^{2}\sum_{k=1}^{B_{l}}\alpha_{k}^{2}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}+
2βηG21Tt=0T1l=1Lαl2k=1Blαk2j=1Vk,lαj2i=1Uj,k,lαi2(1pitpit)φw,0(𝜹,𝐟,𝐏)+12β2(L1+2{L2+L3+L4})+\displaystyle\qquad\qquad\qquad\qquad\qquad 2\beta\eta G^{2}\cdot\underbrace{\frac{1}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}^{2}\sum_{k=1}^{B_{l}}\alpha_{k}^{2}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\bigg{(}\frac{1-p_{i}^{t}}{p_{i}^{t}}\bigg{)}}_{\varphi_{\mathrm{w,0}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})}+12\beta^{2}\cdot\big{(}\mathrm{L}_{1}+2\big{\{}\mathrm{L}_{2}+\mathrm{L}_{3}+\mathrm{L}_{4}\big{\}}\big{)}+
96β21Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαiδit𝐰it2e0(𝜹).\displaystyle\qquad\qquad\qquad\qquad\qquad{\color[rgb]{0,0,0}96}\beta^{2}\cdot\underbrace{\frac{1}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\delta_{i}^{t}\|\mathbf{w}_{i}^{t}\|^{2}}_{\mathrm{e}_{0}(\boldsymbol{\delta})}. (76)

When αi=1Uj,k,l\alpha_{i}=\frac{1}{U_{j,k,l}}, αj=1Vk,l\alpha_{j}=\frac{1}{V_{k,l}}, αk=1Bl\alpha_{k}=\frac{1}{B_{l}} and αl=1L\alpha_{l}=\frac{1}{L}, we have l=1Lαl2k=1Blαk2j=1Vk,lαj2i=1Uj,k,lαi2=1U\sum_{l=1}^{L}\alpha_{l}^{2}\sum_{k=1}^{B_{l}}\alpha_{k}^{2}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}=\frac{1}{\mathrm{U}}. As such, using the fact that f(𝐰¯T)finff(\bar{\mathbf{w}}^{T})\geq f_{\mathrm{inf}} and definition of θPHFL\theta_{\mathrm{PHFL}}, we get

θPHFL\displaystyle\theta_{\mathrm{{PHFL}}}\leq 𝒪(f(𝐰¯~0)finfηT)+𝒪(βησ2U)+𝒪(δthβ2D2)+𝒪(βηG2φw,0(𝜹,𝐟,𝐏))+𝒪(β2[L1+L2+L3+L4]).\displaystyle\mathcal{O}\bigg{(}\frac{f(\tilde{\bar{\mathbf{w}}}^{0})-f_{\mathrm{inf}}}{\eta T}\bigg{)}+\mathcal{O}\bigg{(}\frac{\beta\eta\sigma^{2}}{\mathrm{U}}\bigg{)}+\mathcal{O}\big{(}\delta^{\mathrm{th}}\beta^{2}D^{2}\big{)}+\mathcal{O}\big{(}\beta\eta G^{2}\cdot\varphi_{\mathrm{w,0}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})\big{)}+\mathcal{O}\big{(}\beta^{2}\big{[}\mathrm{L}_{1}+\mathrm{L}_{2}+\mathrm{L}_{3}+\mathrm{L}_{4}\big{]}\big{)}. (77)

Appendix B Proof of Lemma 1

Lemma 1.

When η1/[210κ0β]\eta\leq 1/[2\sqrt{10}\kappa_{0}\beta], the average difference between the VC and local model parameters, i.e., the L1\mathrm{L}_{1} term of (56), is upper bounded as

[β2/T]t=0Tl=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi𝔼𝐰¯jt𝐰~it2\displaystyle[\beta^{2}/T]\sum\nolimits_{t=0}^{T}\sum\nolimits_{l=1}^{L}\alpha_{l}\sum\nolimits_{k=1}^{B_{l}}\alpha_{k}\sum\nolimits_{j=1}^{V_{k,l}}\alpha_{j}\sum\nolimits_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\left\|\bar{\mathbf{w}}_{j}^{t}-\tilde{\mathbf{w}}_{i}^{t}\right\|^{2}
𝒪(κ0η2β2σ2)+𝒪(κ02η2β2ϵvc2)+𝒪(κ0η2β2G2φw,L1(𝜹,𝐟,𝐏))+𝒪(δthβ2D2),\displaystyle\qquad\leq\mathcal{O}\big{(}\kappa_{0}\eta^{2}\beta^{2}\sigma^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\eta^{2}\beta^{2}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\eta^{2}\beta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})\big{)}+\mathcal{O}\big{(}\delta^{\mathrm{th}}\beta^{2}D^{2}\big{)}, (78)

where φw,L1(𝛅,𝐟,𝐏)=[1/T]l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαit=0T1(1/pit1)\varphi_{\mathrm{w,L}_{1}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})=[1/T]\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\sum_{t=0}^{T-1}\left(1/p_{i}^{t}-1\right).

Proof.
1Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi𝔼𝐰¯jt𝐰~it2\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\left\|\bar{\mathbf{w}}_{j}^{t}-\tilde{\mathbf{w}}_{i}^{t}\right\|^{2}
=1Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi𝔼𝐰~jt¯0,0ηi=1Uj,k,lαiτ=t¯0t1g~(𝐰~iτ,0)𝟏iτpiτ(𝐰~it¯0,0ητ=t¯0t1g~(𝐰~iτ,0)𝟏iτpiτ)2,\displaystyle=\frac{1}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\Bigg{\|}\tilde{\mathbf{w}}_{j}^{\bar{t}_{0},0}-\eta\sum_{i^{\prime}=1}^{U_{j,k,l}}\alpha_{i^{\prime}}\sum_{\tau=\bar{t}_{0}}^{t-1}\tilde{g}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}-\Big{(}\tilde{\mathbf{w}}_{i}^{\bar{t}_{0},0}-\eta\sum_{\tau=\bar{t}_{0}}^{t-1}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}\Big{)}\Bigg{\|}^{2},
(a)2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi𝔼𝐰~jt¯0,0𝐰~it¯0,02+\displaystyle\overset{(a)}{\leq}\frac{2}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\Bigg{\|}\tilde{\mathbf{w}}_{j}^{\bar{t}_{0},0}-\tilde{\mathbf{w}}_{i}^{\bar{t}_{0},0}\Bigg{\|}^{2}+
2η2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi𝔼τ=t¯0t1g~(𝐰~iτ,0)𝟏iτpiτi=1Uj,k,lαiτ=t¯0t1g~(𝐰~iτ,0)𝟏iτpiτ2,\displaystyle\qquad\frac{2\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\Bigg{\|}\sum_{\tau=\bar{t}_{0}}^{t-1}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\sum_{i^{\prime}=1}^{U_{j,k,l}}\alpha_{i^{\prime}}\sum_{\tau=\bar{t}_{0}}^{t-1}\tilde{g}\Big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\Big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}\Bigg{\|}^{2},\!\! (79)

where t¯0=[{(mκ3+t3)κ2+t2}κ1+t1]κ0\bar{t}_{0}=[\{(m\kappa_{3}+t_{3})\kappa_{2}+t_{2}\}\kappa_{1}+t_{1}]\kappa_{0} and (a)(a) comes from i=1n𝐚i2ni=1n𝐚i2\|\sum_{i=1}^{n}\mathbf{a}_{i}\|^{2}\leq n\sum_{i=1}^{n}\|\mathbf{a}_{i}\|^{2}.

Note that the first term in (B) comes from the VC receiving a weighted combination of the pruned models of its associated UEs. For this term, we have

2l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi𝔼𝐰~jt¯0,0𝐰~it¯0,02,\displaystyle 2\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\bigg{\|}\tilde{\mathbf{w}}_{j}^{\bar{t}_{0},0}-\tilde{\mathbf{w}}_{i}^{\bar{t}_{0},0}\bigg{\|}^{2},
=(a)2l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi𝔼(𝐰~jt¯0,0𝐰jt¯0)+(𝐰it¯0𝐰~it¯0,0)2,\displaystyle\overset{(a)}{=}2\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\bigg{\|}\left(\tilde{\mathbf{w}}_{j}^{\bar{t}_{0},0}-\mathbf{w}_{j}^{\bar{t}_{0}}\right)+\left(\mathbf{w}_{i}^{\bar{t}_{0}}-\tilde{\mathbf{w}}_{i}^{\bar{t}_{0},0}\right)\bigg{\|}^{2},
(b)4l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi𝔼𝐰~jt¯0,0𝐰jt¯02+4l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi𝔼𝐰it¯0𝐰~it¯0,02,\displaystyle\overset{(b)}{\leq}4\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\bigg{\|}\tilde{\mathbf{w}}_{j}^{\bar{t}_{0},0}-\mathbf{w}_{j}^{\bar{t}_{0}}\bigg{\|}^{2}+4\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\bigg{\|}\mathbf{w}_{i}^{\bar{t}_{0}}-\tilde{\mathbf{w}}_{i}^{\bar{t}_{0},0}\bigg{\|}^{2},
4l=1Lαlk=1Blαkj=1Vk,lαj{𝔼𝐰~jt¯0,0𝐰jt¯02+i=1Uj,k,lαiδit¯0𝔼𝐰it¯02},\displaystyle\leq 4\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\bigg{\{}\mathbb{E}\bigg{\|}\tilde{\mathbf{w}}_{j}^{\bar{t}_{0},0}-\mathbf{w}_{j}^{\bar{t}_{0}}\bigg{\|}^{2}+\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\delta_{i}^{\bar{t}_{0}}\mathbb{E}\left\|\mathbf{w}_{i}^{\bar{t}_{0}}\right\|^{2}\bigg{\}},
8l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαiδit¯0𝔼𝐰it¯02,\displaystyle\leq{\color[rgb]{0,0,0}8}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\delta_{i}^{\bar{t}_{0}}\mathbb{E}\left\|\mathbf{w}_{i}^{\bar{t}_{0}}\right\|^{2}, (80)

where (a)(a) is true since 𝐰jt¯0=𝐰it¯0\mathbf{w}_{j}^{\bar{t}_{0}}=\mathbf{w}_{i}^{\bar{t}_{0}} and (b)(b) comes from i=1n𝐚i2ni=1n𝐚i2\|\sum_{i=1}^{n}\mathbf{a}_{i}\|^{2}\leq n\sum_{i=1}^{n}\|\mathbf{a}_{i}\|^{2}.

As such, the first term is bounded as

2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi𝔼𝐰~jt¯0,0𝐰~it¯0,028Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαiδit/κ0𝔼𝐰it/κ02.\displaystyle\frac{2}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\bigg{\|}\tilde{\mathbf{w}}_{j}^{\bar{t}_{0},0}-\tilde{\mathbf{w}}_{i}^{\bar{t}_{0},0}\bigg{\|}^{2}\leq\frac{{\color[rgb]{0,0,0}8}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\delta_{i}^{\lfloor t/\kappa_{0}\rfloor}\mathbb{E}\left\|\mathbf{w}_{i}^{\lfloor t/\kappa_{0}\rfloor}\right\|^{2}. (81)

For the second term of (B), we have

2η2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi𝔼τ=t¯0t1[g~(𝐰~iτ,0)𝟏iτpiτi=1Uj,k,lαig~(𝐰~iτ,0)𝟏iτpiτ]2\displaystyle\frac{2\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\Bigg{\|}\sum_{\tau=\bar{t}_{0}}^{t-1}\bigg{[}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\sum_{i^{\prime}=1}^{U_{j,k,l}}\alpha_{i^{\prime}}\tilde{g}\Big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\Big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}\bigg{]}\Bigg{\|}^{2}
=2η2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi𝔼τ=t¯0t1[g~(𝐰~iτ,0)𝟏iτpiτ±f~i(𝐰~iτ,0)±i=1Uj,k,lαif~i(𝐰~iτ,0)i=1Uj,k,lαig~(𝐰~iτ,0)𝟏iτpiτ]2\displaystyle=\frac{2\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\Bigg{\|}\sum_{\tau=\bar{t}_{0}}^{t-1}\bigg{[}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}\pm\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\pm\sum_{i^{\prime}=1}^{U_{j,k,l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i}\Big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\Big{)}-\sum_{i^{\prime}=1}^{U_{j,k,l}}\alpha_{i^{\prime}}\tilde{g}\Big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\Big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}\bigg{]}\Bigg{\|}^{2}
4η2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi𝔼τ=t¯0t1[(g~(𝐰~iτ,0)𝟏iτpiτf~i(𝐰~iτ,0))i=1Uj,k,lαi(g~(𝐰~iτ,0)𝟏iτpiτf~i(𝐰~iτ,0))]2+\displaystyle\leq\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\Bigg{\|}\sum_{\tau=\bar{t}_{0}}^{t-1}\bigg{[}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{)}-\sum_{i^{\prime}=1}^{U_{j,k,l}}\alpha_{i^{\prime}}\bigg{(}\tilde{g}\Big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\Big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}-\nabla\tilde{f}_{i}\Big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\Big{)}\bigg{)}\bigg{]}\Bigg{\|}^{2}+
4η2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi𝔼τ=t¯0t1[f~i(𝐰~iτ,0)i=1Uj,k,lαif~i(𝐰~iτ,0)]2.\displaystyle\qquad\qquad\qquad\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\Bigg{\|}\sum_{\tau=\bar{t}_{0}}^{t-1}\bigg{[}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}-\sum_{i^{\prime}=1}^{U_{j,k,l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i}\Big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\Big{)}\bigg{]}\Bigg{\|}^{2}. (82)

We calculate the first and the second terms of (B) in Lemma 5 and Lemma 6, respectively.

Lemma 5.
4η2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi𝔼τ=t¯0t1[(g~(𝐰~iτ,0)𝟏iτpiτf~i(𝐰~iτ,0))i=1Uj,k,lαi(g~(𝐰~iτ,0)𝟏iτpiτf~i(𝐰~iτ,0))]2\displaystyle\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\Bigg{\|}\sum_{\tau=\bar{t}_{0}}^{t-1}\bigg{[}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{)}-\sum_{i^{\prime}=1}^{U_{j,k,l}}\alpha_{i^{\prime}}\bigg{(}\tilde{g}\Big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\Big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}-\nabla\tilde{f}_{i}\Big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\Big{)}\bigg{)}\bigg{]}\Bigg{\|}^{2}
8κ0η2σ2+8κ0η2G2φw,L1,\displaystyle\leq 8\kappa_{0}\eta^{2}\sigma^{2}+8\kappa_{0}\eta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}, (83)

where φw,L1=1Tl=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαit=0T1(1pitpit)\varphi_{\mathrm{w,L}_{1}}=\frac{1}{T}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\sum_{t=0}^{T-1}\left(\frac{1-p_{i}^{t}}{p_{i}^{t}}\right).

Lemma 6.
4η2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi𝔼τ=t¯0t1[f~i(𝐰~iτ,0)i=1Uj,k,lαif~i(𝐰~iτ,0)]2\displaystyle\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\Bigg{\|}\sum_{\tau=\bar{t}_{0}}^{t-1}\bigg{[}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}-\sum_{i^{\prime}=1}^{U_{j,k,l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\Big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\Big{)}\bigg{]}\Bigg{\|}^{2}
20κ02η2ϵvc2+40κ02η2β2e¯𝜹+40κ02η2β2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi𝔼𝐰¯jt𝐰~it2,\displaystyle\leq 20\kappa_{0}^{2}\eta^{2}\epsilon_{\mathrm{vc}}^{2}+40\kappa_{0}^{2}\eta^{2}\beta^{2}\cdot\bar{\mathrm{e}}_{\boldsymbol{\delta}}+\frac{40\kappa_{0}^{2}\eta^{2}\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\left\|\bar{\mathbf{w}}_{j}^{t}-\tilde{\mathbf{w}}_{i}^{t}\right\|^{2}, (84)

where e¯𝛅=1Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαiδit𝔼𝐰it2\bar{\mathrm{e}}_{\boldsymbol{\delta}}=\frac{1}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\delta_{i}^{t}\mathbb{E}\|\mathbf{w}_{i}^{t}\|^{2}.

Using Lemma 5 and Lemma 6, we write

1Tt=0Tl=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi𝔼𝐰¯jt𝐰~it28κ0η2σ2+20κ02η2ϵvc2+8κ0η2G2φw,L1+40κ02η2β2e¯𝜹+8ep,L1140κ02η2β2,\displaystyle\frac{1}{T}\sum_{t=0}^{T}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\left\|\bar{\mathbf{w}}_{j}^{t}-\tilde{\mathbf{w}}_{i}^{t}\right\|^{2}\leq\frac{8\kappa_{0}\eta^{2}\sigma^{2}+20\kappa_{0}^{2}\eta^{2}\epsilon_{\mathrm{vc}}^{2}+8\kappa_{0}\eta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}+40\kappa_{0}^{2}\eta^{2}\beta^{2}\cdot\bar{\mathrm{e}}_{\boldsymbol{\delta}}+{\color[rgb]{0,0,0}8}\cdot\mathrm{e}_{\mathrm{p,L}_{1}}}{1-40\kappa_{0}^{2}\eta^{2}\beta^{2}}, (85)

where ep,L1=1Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαiδit/κ0𝔼𝐰it/κ02\mathrm{e}_{\mathrm{p,L}_{1}}=\frac{1}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\delta_{i}^{\lfloor t/\kappa_{0}\rfloor}\mathbb{E}\left\|\mathbf{w}_{i}^{\lfloor t/\kappa_{0}\rfloor}\right\|^{2}.

Now multiplying both sides of (85) by 12β212\beta^{2}, we get

12β2Tt=0Tl=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi𝔼𝐰¯jt𝐰~it2\displaystyle\frac{12\beta^{2}}{T}\sum_{t=0}^{T}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\left\|\bar{\mathbf{w}}_{j}^{t}-\tilde{\mathbf{w}}_{i}^{t}\right\|^{2}
96κ0η2β2σ2+240κ02η2β2ϵvc2+96κ0η2β2G2φw,L1+480κ02η2β4e¯𝜹+96β2ep,L1140κ02η2β2.\displaystyle\leq\frac{96\kappa_{0}\eta^{2}\beta^{2}\sigma^{2}+240\kappa_{0}^{2}\eta^{2}\beta^{2}\epsilon_{\mathrm{vc}}^{2}+96\kappa_{0}\eta^{2}\beta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}+480\kappa_{0}^{2}\eta^{2}\beta^{4}\cdot\bar{\mathrm{e}}_{\boldsymbol{\delta}}+{\color[rgb]{0,0,0}96}\beta^{2}\cdot\mathrm{e}_{\mathrm{p,L}_{1}}}{1-40\kappa_{0}^{2}\eta^{2}\beta^{2}}. (86)

When η1210κ0β\eta\leq\frac{1}{2\sqrt{10}\kappa_{0}\beta}, we have 140κ02η2β211-40\kappa_{0}^{2}\eta^{2}\beta^{2}\geq 1, and the previous assumption of η1β\eta\leq\frac{1}{\beta} is automatically satisfied. As such, we write

12β2Tt=0Tl=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi𝔼𝐰¯jt𝐰~it2\displaystyle\frac{12\beta^{2}}{T}\sum_{t=0}^{T}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\left\|\bar{\mathbf{w}}_{j}^{t}-\tilde{\mathbf{w}}_{i}^{t}\right\|^{2}
96κ0η2β2σ2+240κ02η2β2ϵvc2+96κ0η2β2G2φw,L1+480κ02η2β4e¯𝜹+96β2ep,L1\displaystyle\leq 96\kappa_{0}\eta^{2}\beta^{2}\sigma^{2}+240\kappa_{0}^{2}\eta^{2}\beta^{2}\epsilon_{\mathrm{vc}}^{2}+96\kappa_{0}\eta^{2}\beta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}+480\kappa_{0}^{2}\eta^{2}\beta^{4}\cdot\bar{\mathrm{e}}_{\boldsymbol{\delta}}+{\color[rgb]{0,0,0}96}\beta^{2}\cdot\mathrm{e}_{\mathrm{p,L}_{1}}
𝒪(κ0η2β2σ2)+𝒪(κ02η2β2ϵvc2)+𝒪(κ0η2β2G2φw,L1)+𝒪(δthβ2D2).\displaystyle\approx\mathcal{O}\big{(}\kappa_{0}\eta^{2}\beta^{2}\sigma^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\eta^{2}\beta^{2}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\eta^{2}\beta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}\big{)}+\mathcal{O}\big{(}\delta^{\mathrm{th}}\beta^{2}D^{2}\big{)}. (87)

B-A Missing Proof of Lemma 5

4η2l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi𝔼τ=t¯0t1{(g~(𝐰~iτ,0)𝟏iτpiτf~i(𝐰~iτ,0))i=1Uj,k,lαi(g~(𝐰~iτ,0)𝟏iτpiτf~i(𝐰~iτ,0))}2,\displaystyle 4\eta^{2}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\Bigg{\|}\sum_{\tau=\bar{t}_{0}}^{t-1}\Bigg{\{}\left(\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\nabla\tilde{f}_{i}(\tilde{\mathbf{w}}_{i}^{\tau,0})\right)-\sum_{i^{\prime}=1}^{U_{j,k,l}}\alpha_{i^{\prime}}\left(\tilde{g}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}-\nabla\tilde{f}_{i^{\prime}}(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0})\right)\Bigg{\}}\Bigg{\|}^{2},
=(a)4η2l=1Lαlk=1Blαkj=1Vk,lαj{i=1Uj,k,lαi𝔼τ=t¯0t1(g~(𝐰~iτ,0)𝟏iτpiτf~i(𝐰~iτ,0))2𝔼τ=t¯0t1i=1Uj,k,lαi(g~(𝐰~iτ,0)𝟏iτpiτf~i(𝐰~iτ))2},\displaystyle\overset{(a)}{=}4\eta^{2}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\Bigg{\{}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\Bigg{\|}\sum_{\tau=\bar{t}_{0}}^{t-1}\left(\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\nabla\tilde{f}_{i}(\tilde{\mathbf{w}}_{i}^{\tau,0})\right)\Bigg{\|}^{2}-\mathbb{E}\Bigg{\|}\sum_{\tau=\bar{t}_{0}}^{t-1}\sum_{i^{\prime}=1}^{U_{j,k,l}}\alpha_{i^{\prime}}\left(\tilde{g}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}-\nabla\tilde{f}_{i^{\prime}}(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau})\right)\Bigg{\|}^{2}\Bigg{\}},
=(b)4η2l=1Lαlk=1Blαkj=1Vk,lαj{i=1Uj,k,lαiτ=t¯0t1𝔼g~(𝐰~iτ,0)𝟏iτpiτ±g~(𝐰~iτ,0)f~i(𝐰~iτ,0)2\displaystyle\overset{(b)}{=}4\eta^{2}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\Bigg{\{}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\sum_{\tau=\bar{t}_{0}}^{t-1}\mathbb{E}\Bigg{\|}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}\pm\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}-\nabla\tilde{f}_{i}(\tilde{\mathbf{w}}_{i}^{\tau,0})\Bigg{\|}^{2}-
τ=t¯0t1𝔼i=1Uj,k,lαi(g~(𝐰~iτ,0)𝟏iτpiτ±g~(𝐰~iτ,0)f~i(𝐰~iτ,0))2},\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\sum_{\tau=\bar{t}_{0}}^{t-1}\mathbb{E}\Bigg{\|}\sum_{i^{\prime}=1}^{U_{j,k,l}}\alpha_{i^{\prime}}\left(\tilde{g}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}\pm\tilde{g}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)-\nabla\tilde{f}_{i^{\prime}}(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0})\right)\Bigg{\|}^{2}\Bigg{\}},
=(c)8η2l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαiτ=t¯0t1𝔼{(𝟏iτpiτ1)g~(𝐰~iτ,0)2+g~(𝐰~iτ,0)f~i(𝐰~iτ,0)2}\displaystyle\overset{(c)}{=}8\eta^{2}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\sum_{\tau=\bar{t}_{0}}^{t-1}\mathbb{E}\Bigg{\{}\Bigg{\|}\left(\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-1\right)\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\Bigg{\|}^{2}+\Bigg{\|}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}-\nabla\tilde{f}_{i}(\tilde{\mathbf{w}}_{i}^{\tau,0})\Bigg{\|}^{2}\Bigg{\}}-
8η2l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαiτ=t¯0t1i=1Uj,k,lαi2𝔼{(𝟏iτpiτ1)g~(𝐰~iτ,0)2+g~(𝐰~iτ,0)f~i(𝐰~iτ,0)2},\displaystyle 8\eta^{2}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\sum_{\tau=\bar{t}_{0}}^{t-1}\sum_{i^{\prime}=1}^{U_{j,k,l}}\!\!\!\!\!\!\alpha_{i^{\prime}}^{2}\mathbb{E}\Bigg{\{}\Bigg{\|}\left(\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}-1\right)\tilde{g}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)\Bigg{\|}^{2}+\Bigg{\|}\tilde{g}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)-\nabla\tilde{f}_{i^{\prime}}(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0})\Bigg{\|}^{2}\Bigg{\}},
(d)8η2l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαiτ=t¯0t1{(1piτpiτ)𝔼g~(𝐰~iτ,0)2+σ2}\displaystyle\overset{(d)}{\leq}8\eta^{2}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\sum_{\tau=\bar{t}_{0}}^{t-1}\Bigg{\{}\left(\frac{1-p_{i}^{\tau}}{p_{i}^{\tau}}\right)\mathbb{E}\left\|\tilde{g}(\tilde{\mathbf{w}}_{i}^{\tau,0})\right\|^{2}+\sigma^{2}\Bigg{\}}-
8η2l=1Lαlk=1Blαkj=1Vk,lαjτ=t¯0t1i=1Uj,k,lαi2{(1piτpiτ)𝔼g~(𝐰~iτ,0)2+σ2},\displaystyle\qquad\qquad\qquad 8\eta^{2}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{\tau=\bar{t}_{0}}^{t-1}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\Bigg{\{}\left(\frac{1-p_{i}^{\tau}}{p_{i}^{\tau}}\right)\mathbb{E}\left\|\tilde{g}(\tilde{\mathbf{w}}_{i}^{\tau,0})\right\|^{2}+\sigma^{2}\Bigg{\}},
(e)8κ0η2σ28η2l=1Lαlk=1Blαkj=1Vk,lαjτ=t¯0t1i=1Uj,k,lαi2σ2+\displaystyle\overset{(e)}{\leq}8\kappa_{0}\eta^{2}\sigma^{2}-8\eta^{2}\sum\nolimits_{l=1}^{L}\alpha_{l}\sum\nolimits_{k=1}^{B_{l}}\alpha_{k}\sum\nolimits_{j=1}^{V_{k,l}}\alpha_{j}\sum\nolimits_{\tau=\bar{t}_{0}}^{t-1}\sum\nolimits_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\sigma^{2}+
8η2l=1Lαlk=1Blαkj=1Vk,lαjτ=t¯0t1i=1Uj,k,lαi(1αi)([1piτ]/piτ)𝔼g~(𝐰~iτ,0)2,\displaystyle\qquad\qquad\qquad 8\eta^{2}\sum\nolimits_{l=1}^{L}\alpha_{l}\sum\nolimits_{k=1}^{B_{l}}\alpha_{k}\sum\nolimits_{j=1}^{V_{k,l}}\alpha_{j}\sum\nolimits_{\tau=\bar{t}_{0}}^{t-1}\sum\nolimits_{i=1}^{U_{j,k,l}}\alpha_{i}\left(1-\alpha_{i}\right)\left([1-p_{i}^{\tau}]/p_{i}^{\tau}\right)\mathbb{E}\left\|\tilde{g}(\tilde{\mathbf{w}}_{i}^{\tau,0})\right\|^{2},
8κ0η2σ2+8η2l=1Lαlk=1Blαkj=1Vk,lαjτ=t¯0t1i=1Uj,k,lαi(1piτpiτ)𝔼g~(𝐰~iτ,0)2,\displaystyle\leq 8\kappa_{0}\eta^{2}\sigma^{2}+8\eta^{2}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{\tau=\bar{t}_{0}}^{t-1}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\left(\frac{1-p_{i}^{\tau}}{p_{i}^{\tau}}\right)\mathbb{E}\left\|\tilde{g}(\tilde{\mathbf{w}}_{i}^{\tau,0})\right\|^{2}, (88)

where (a)(a) stems from the fact that i=1npi𝐱i𝐱¯2=i=1npi𝐱i2𝐱¯2\sum_{i=1}^{n}p_{i}\|\mathbf{x}_{i}-\bar{\mathbf{x}}\|^{2}=\sum_{i=1}^{n}p_{i}\|\mathbf{x}_{i}\|^{2}-\|\bar{\mathbf{x}}\|^{2}, where 𝐱¯=i=1npi𝐱i\bar{\mathbf{x}}=\sum_{i=1}^{n}p_{i}\mathbf{x}_{i} for any 0pi10\leq p_{i}\leq 1 and i=1npi=1\sum_{i=1}^{n}p_{i}=1. Besides, (b)(b) is true due to the time independence of the SGD assumption. Furthermore, in (c)(c) and (d)(d), we use the bounded divergence of the mini-batch gradient assumption and client independence property.

B-B Missing Proof of Lemma 6

4η2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi𝔼τ=t¯0t1[f~i(𝐰~iτ,0)i=1Uj,k,lαif~i(𝐰~iτ,0)]2\displaystyle\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\Bigg{\|}\sum_{\tau=\bar{t}_{0}}^{t-1}\bigg{[}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}-\sum_{i^{\prime}=1}^{U_{j,k,l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\Big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\Big{)}\bigg{]}\Bigg{\|}^{2}
=4κ0η2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαiτ=t¯0t1𝔼(f~i(𝐰~iτ,0)f~i(𝐰~iτ))+(f~i(𝐰~iτ)f~i(𝐰¯jτ))+\displaystyle=\frac{4\kappa_{0}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\sum_{\tau=\bar{t}_{0}}^{t-1}\mathbb{E}\Bigg{\|}\bigg{(}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau}\big{)}\bigg{)}+\bigg{(}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau}\big{)}-\nabla\tilde{f}_{i}\big{(}\bar{\mathbf{w}}_{j}^{\tau}\big{)}\bigg{)}+
(f~i(𝐰¯jτ)i=1Uj,k,lαif~i(𝐰¯jτ))+(i=1Uj,k,lαif~i(𝐰¯jτ)i=1Uj,k,lαif~i(𝐰~iτ))+\displaystyle\qquad\qquad\qquad\bigg{(}\nabla\tilde{f}_{i}\big{(}\bar{\mathbf{w}}_{j}^{\tau}\big{)}-\sum_{i^{\prime}=1}^{U_{j,k,l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\big{(}\bar{\mathbf{w}}_{j}^{\tau}\big{)}\bigg{)}+\bigg{(}\sum_{i^{\prime}=1}^{U_{j,k,l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\big{(}\bar{\mathbf{w}}_{j}^{\tau}\big{)}-\sum_{i^{\prime}=1}^{U_{j,k,l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i}^{\tau}\big{)}\bigg{)}+
(i=1Uj,k,lαif~i(𝐰~iτ)i=1Uj,k,lαif~i(𝐰~iτ,0))2\displaystyle\qquad\qquad\qquad\bigg{(}\sum_{i^{\prime}=1}^{U_{j,k,l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i}^{\tau}\big{)}-\sum_{i^{\prime}=1}^{U_{j,k,l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\bigg{)}\Bigg{\|}^{2}
20κ02η2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi[2β2𝔼𝐰~it𝐰~it,02+ϵvc2+2β2𝔼𝐰¯jt𝐰~it2]\displaystyle\leq\frac{20\kappa_{0}^{2}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{[}2\beta^{2}\mathbb{E}\|\tilde{\mathbf{w}}_{i}^{t}-\tilde{\mathbf{w}}_{i}^{t,0}\|^{2}+\epsilon_{\mathrm{vc}}^{2}+2\beta^{2}\mathbb{E}\|\bar{\mathbf{w}}_{j}^{t}-\tilde{\mathbf{w}}_{i}^{t}\|^{2}\bigg{]}
20κ02η2ϵvc2+40κ02η2β2e¯𝜹+40κ02η2β2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi𝔼𝐰¯jt𝐰~it2,\displaystyle\leq 20\kappa_{0}^{2}\eta^{2}\epsilon_{\mathrm{vc}}^{2}+40\kappa_{0}^{2}\eta^{2}\beta^{2}\cdot\bar{\mathrm{e}}_{\boldsymbol{\delta}}+\frac{40\kappa_{0}^{2}\eta^{2}\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\left\|\bar{\mathbf{w}}_{j}^{t}-\tilde{\mathbf{w}}_{i}^{t}\right\|^{2}, (89)

where e¯𝜹=1Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαiδit𝔼𝐰it2\bar{\mathrm{e}}_{\boldsymbol{\delta}}=\frac{1}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\delta_{i}^{t}\mathbb{E}\|\mathbf{w}_{i}^{t}\|^{2}. ∎

Appendix C Proof of Lemma 2

Lemma 2.

When η1/[210κ0κ1β]\eta\leq 1/[2\sqrt{10}\kappa_{0}\kappa_{1}\beta], the difference between the sBS model parameters and VC model parameters, i.e., the L2\mathrm{L}_{2} term of (56), is upper bounded as

β2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαj𝔼𝐰¯kt𝐰¯jt2\displaystyle\frac{\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\left\|\bar{\mathbf{w}}_{k}^{t}-\bar{\mathbf{w}}_{j}^{t}\right\|^{2}
𝒪(β4κ04κ12η4ϵvc2)+𝒪(κ02κ12η2β2ϵsbs2)+𝒪(κ0κ1η2σ2β2)+𝒪(δthβ2D2)+\displaystyle\leq\mathcal{O}\big{(}\beta^{4}\kappa_{0}^{4}\kappa_{1}^{2}\eta^{4}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\eta^{2}\beta^{2}\epsilon_{\mathrm{sbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\eta^{2}\sigma^{2}\beta^{2}\big{)}+\mathcal{O}\big{(}\delta^{th}\beta^{2}D^{2}\big{)}+
𝒪(κ03κ12β4η4G2φw,L1(𝜹,𝐟,𝐏))+𝒪(κ0κ1β2η2φw,L2(𝜹,𝐟,𝐏)),\displaystyle\qquad\qquad\qquad\qquad\qquad\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\beta^{4}\eta^{4}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\beta^{2}\eta^{2}\cdot\varphi_{\mathrm{w,L}_{2}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})\big{)}, (90)

where φw,L2(𝛅,𝐟,𝐏)=[1/T]t=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi2(1/pit1)\varphi_{\mathrm{w,L}_{2}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})=[1/T]\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}(1/p_{i}^{t}-1).

Proof.
1Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαj𝔼𝐰¯kt𝐰¯jt2\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\left\|\bar{\mathbf{w}}_{k}^{t}-\bar{\mathbf{w}}_{j}^{t}\right\|^{2}
=1Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαj𝔼𝐰~kt¯1,0ητ=t¯1t1j=1Vk,lαji=1Uj,k,lαig~(𝐰~iτ,0)𝟏iτpiτ𝐰~jt¯1,0+ητ=t¯1t1i=1Uj,k,lαig~(𝐰~iτ,0)𝟏iτpiτ2,\displaystyle=\frac{1}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\Big{\|}\tilde{\mathbf{w}}_{k}^{\bar{t}_{1},0}-\eta\sum_{\tau=\bar{t}_{1}}^{t-1}\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\tilde{g}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}-\tilde{\mathbf{w}}_{j}^{\bar{t}_{1},0}+\eta\sum_{\tau=\bar{t}_{1}}^{t-1}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}\Big{\|}^{2},
2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαj𝔼𝐰~kt¯1,0𝐰~jt¯1,02+\displaystyle\leq\frac{2}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\left\|\tilde{\mathbf{w}}_{k}^{\bar{t}_{1},0}-\tilde{\mathbf{w}}_{j}^{\bar{t}_{1},0}\right\|^{2}+ (91)
2η2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαj𝔼τ=t¯1t1[i=1Uj,k,lαig~(𝐰~iτ,0)𝟏iτpiτj=1Vk,lαji=1Uj,k,lαig~(𝐰~iτ,0)𝟏iτpiτ]2,\displaystyle\qquad\frac{2\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\left\|\sum_{\tau=\bar{t}_{1}}^{t-1}\bigg{[}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\tilde{g}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}\bigg{]}\right\|^{2},

where t¯1={(mκ3+t3)κ2+t2}κ1κ0\bar{t}_{1}=\{(m\kappa_{3}+t_{3})\kappa_{2}+t_{2}\}\kappa_{1}\kappa_{0} and the inequalities in the last term arise from Jensen inequality.

For the first term in (C), we have

2l=1Lαlk=1Blαkj=1Vk,lαj𝔼𝐰~kt¯1,0𝐰~jt¯1,02\displaystyle 2\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\left\|\tilde{\mathbf{w}}_{k}^{\bar{t}_{1},0}-\tilde{\mathbf{w}}_{j}^{\bar{t}_{1},0}\right\|^{2}
=(a)2l=1Lαlk=1Blαkj=1Vk,lαj𝔼𝐰~kt¯1,0𝐰kt¯1+𝐰jt¯1𝐰~jt¯1,02,\displaystyle\overset{(a)}{=}2\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\left\|\tilde{\mathbf{w}}_{k}^{\bar{t}_{1},0}-\mathbf{w}_{k}^{\bar{t}_{1}}+\mathbf{w}_{j}^{\bar{t}_{1}}-\tilde{\mathbf{w}}_{j}^{\bar{t}_{1},0}\right\|^{2},
4l=1Lαlk=1Blαkj=1Vk,lαj𝔼𝐰kt¯1𝐰~kt¯1,02+4l=1Lαlk=1Blαkj=1Vk,lαj𝔼𝐰jt¯1𝐰~jt¯1,02,\displaystyle\leq 4\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\left\|\mathbf{w}_{k}^{\bar{t}_{1}}-\tilde{\mathbf{w}}_{k}^{\bar{t}_{1},0}\right\|^{2}+4\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\left\|\mathbf{w}_{j}^{\bar{t}_{1}}-\tilde{\mathbf{w}}_{j}^{\bar{t}_{1},0}\right\|^{2},
(b)4l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαiδit¯1𝔼𝐰it¯12+4l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαiδit¯1𝔼𝐰it¯12,\displaystyle\overset{(b)}{\leq}4\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}{\color[rgb]{0,0,0}\alpha_{j}}\sum_{i=1}^{U_{j,k,l}}{\color[rgb]{0,0,0}\alpha_{i}}\delta_{i}^{\bar{t}_{1}}\mathbb{E}\left\|\mathbf{w}_{i}^{\bar{t}_{1}}\right\|^{2}+4\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}{\color[rgb]{0,0,0}\alpha_{i}}\delta_{i}^{\bar{t}_{1}}\mathbb{E}\left\|\mathbf{w}_{i}^{\bar{t}_{1}}\right\|^{2},
=8l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαiδit¯1𝔼𝐰it¯12,\displaystyle={\color[rgb]{0,0,0}8}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}{\color[rgb]{0,0,0}\alpha_{i}}\delta_{i}^{\bar{t}_{1}}\mathbb{E}\left\|\mathbf{w}_{i}^{\bar{t}_{1}}\right\|^{2}, (92)

where in (a)(a) we use the fact that 𝐰kt¯1=𝐰jt¯1\mathbf{w}_{k}^{\bar{t}_{1}}=\mathbf{w}_{j}^{\bar{t}_{1}} and (b)(b) stems from following similar steps as in (73) and (72).

As such we derive the upper bound of the first term of (C) as

2l=1Lαlk=1Blαkj=1Vk,lαj𝔼𝐰~kt¯1,0𝐰~jt¯1,02\displaystyle 2\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\left\|\tilde{\mathbf{w}}_{k}^{\bar{t}_{1},0}-\tilde{\mathbf{w}}_{j}^{\bar{t}_{1},0}\right\|^{2}
8Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαiδit/(κ0κ1)𝔼𝐰it/(κ0κ1)2𝒪(δthD2).\displaystyle\leq\frac{{\color[rgb]{0,0,0}8}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}{\color[rgb]{0,0,0}\alpha_{i}}\delta_{i}^{\left\lfloor t/(\kappa_{0}\kappa_{1})\right\rfloor}\mathbb{E}\left\|\mathbf{w}_{i}^{\left\lfloor t/(\kappa_{0}\kappa_{1})\right\rfloor}\right\|^{2}\approx\mathcal{O}\big{(}\delta^{\mathrm{th}}D^{2}\big{)}. (93)

For the second term in (C), we have

2η2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαj𝔼τ=t¯1t1(i=1Uj,k,lαig~(𝐰~iτ,0)𝟏iτpiτj=1Vk,lαji=1Uj,k,lαig~(𝐰~iτ,0)𝟏iτpiτ)2\displaystyle\frac{2\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\left\|\sum_{\tau=\bar{t}_{1}}^{t-1}\Big{(}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\tilde{g}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}\Big{)}\right\|^{2}
=2η2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαj𝔼τ=t¯1t1[{i=1Uj,k,lαig~(𝐰~iτ,0)𝟏iτpiτi=1Uj,k,lαif~i(𝐰~iτ,0)}+\displaystyle=\frac{2\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\bigg{\|}\sum_{\tau=\bar{t}_{1}}^{t-1}\bigg{[}\bigg{\{}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{\}}+
{j=1Vk,lαji=1Uj,k,lαif~i(𝐰~iτ,0)j=1Vk,lαji=1Uj,k,lαig~(𝐰~iτ,0)𝟏iτpiτ}+\displaystyle\qquad\qquad\qquad\bigg{\{}\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)-\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\tilde{g}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}\bigg{\}}+
{i=1Uj,k,lαif~i(𝐰~iτ,0)j=1Vk,lαji=1Uj,k,lαif~i(𝐰~iτ,0)}]2,\displaystyle\qquad\qquad\qquad\bigg{\{}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}-\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)\bigg{\}}\bigg{]}\bigg{\|}^{2},
4η2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαj𝔼τ=t¯1t1[{i=1Uj,k,lαig~(𝐰~iτ,0)𝟏iτpiτi=1Uj,k,lαif~i(𝐰~iτ,0)}+\displaystyle\leq\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\bigg{\|}\sum_{\tau=\bar{t}_{1}}^{t-1}\bigg{[}\bigg{\{}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{\}}+
{j=1Vk,lαji=1Uj,k,lαif~i(𝐰~iτ,0)j=1Vk,lαji=1Uj,k,lαig~(𝐰~iτ,0)𝟏iτpiτ}]2+\displaystyle\qquad\qquad\qquad\qquad\qquad\bigg{\{}\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)-\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\tilde{g}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}\bigg{\}}\bigg{]}\bigg{\|}^{2}+
4η2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαj𝔼τ=t¯1t1[i=1Uj,k,lαif~i(𝐰~iτ,0)j=1Vk,lαji=1Uj,k,lαif~i(𝐰~iτ,0)]2,\displaystyle~{}\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\bigg{\|}\sum_{\tau=\bar{t}_{1}}^{t-1}\bigg{[}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}-\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)\bigg{]}\bigg{\|}^{2}, (94)
Lemma 7.
4η2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαj𝔼τ=t¯1t1[{i=1Uj,k,lαig~(𝐰~iτ,0)𝟏iτpiτi=1Uj,k,lαif~i(𝐰~iτ,0)}+\displaystyle\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\bigg{\|}\sum_{\tau=\bar{t}_{1}}^{t-1}\bigg{[}\bigg{\{}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{\}}+
{j=1Vk,lαji=1Uj,k,lαif~i(𝐰~iτ,0)j=1Vk,lαji=1Uj,k,lαig~(𝐰~iτ,0)𝟏iτpiτ}]2\displaystyle\qquad\qquad\qquad\qquad\qquad\bigg{\{}\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)-\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\tilde{g}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}\bigg{\}}\bigg{]}\bigg{\|}^{2}
8κ0κ1η2σ2l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi2+8κ0κ1η2φw,L2\displaystyle\leq 8\kappa_{0}\kappa_{1}\eta^{2}\sigma^{2}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}+8\kappa_{0}\kappa_{1}\eta^{2}\cdot\varphi_{\mathrm{w,L}_{2}} (95)
𝒪(κ0κ1η2σ2)+𝒪(κ0κ1η2φw,L2),\displaystyle\approx\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\eta^{2}\sigma^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\eta^{2}\cdot\varphi_{\mathrm{w,L}_{2}}\big{)}, (96)

where φw,L2=1Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi2(1pit1)\varphi_{\mathrm{w,L}_{2}}=\frac{1}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\big{(}\frac{1}{p_{i}^{t}}-1\big{)}.

Lemma 8.
4η2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαj𝔼τ=t¯1t1[i=1Uj,k,lαif~i(𝐰~iτ,0)j=1Vk,lαji=1Uj,k,lαif~i(𝐰~iτ,0)]2\displaystyle\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\bigg{\|}\sum_{\tau=\bar{t}_{1}}^{t-1}\bigg{[}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}-\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)\bigg{]}\bigg{\|}^{2}
𝒪(β2κ04κ12η4ϵvc2)+𝒪(κ02κ12η2ϵsbs2)+𝒪(β2σ2κ03κ12η4)+𝒪(δthD2β2κ02κ12η2)+\displaystyle\leq\mathcal{O}\big{(}\beta^{2}\kappa_{0}^{4}\kappa_{1}^{2}\eta^{4}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\eta^{2}\epsilon_{\mathrm{sbs}}^{2}\big{)}+\mathcal{O}\big{(}\beta^{2}\sigma^{2}\kappa_{0}^{3}\kappa_{1}^{2}\eta^{4}\big{)}+\mathcal{O}\big{(}\delta^{\mathrm{th}}D^{2}\beta^{2}\kappa_{0}^{2}\kappa_{1}^{2}\eta^{2}\big{)}+
𝒪(β2κ03κ12η4G2φw,L1)+40β2κ02κ12η2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαj𝔼𝐰¯kt𝐰¯jt2.\displaystyle\qquad\qquad\mathcal{O}\big{(}\beta^{2}\kappa_{0}^{3}\kappa_{1}^{2}\eta^{4}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}\big{)}+\frac{40\beta^{2}\kappa_{0}^{2}\kappa_{1}^{2}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\big{\|}\bar{\mathbf{w}}_{k}^{t}-\bar{\mathbf{w}}_{j}^{t}\big{\|}^{2}. (97)

Now, using Lemma 7, Lemma 8 and assuming η1210κ0κ1β\eta\leq\frac{1}{2\sqrt{10}\kappa_{0}\kappa_{1}\beta}, we have the following

β2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαj𝔼𝐰¯kt𝐰¯jt2\displaystyle\frac{\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\left\|\bar{\mathbf{w}}_{k}^{t}-\bar{\mathbf{w}}_{j}^{t}\right\|^{2}
𝒪(δthβ2D2)+𝒪(κ0κ1η2σ2β2)+𝒪(κ0κ1β2η2φw,L2)+𝒪(β4κ04κ12η4ϵvc2)+𝒪(κ02κ12η2β2ϵsbs2)+\displaystyle\leq\mathcal{O}\big{(}\delta^{th}\beta^{2}D^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\eta^{2}\sigma^{2}\beta^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\beta^{2}\eta^{2}\cdot\varphi_{\mathrm{w,L}_{2}})+\mathcal{O}\big{(}\beta^{4}\kappa_{0}^{4}\kappa_{1}^{2}\eta^{4}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\eta^{2}\beta^{2}\epsilon_{\mathrm{sbs}}^{2}\big{)}+
𝒪(σ2κ03κ12β4η4)+𝒪(δthD2κ02κ12η2β4)+𝒪(κ03κ12β4η4G2φw,L1)\displaystyle\qquad\qquad\qquad\mathcal{O}\big{(}\sigma^{2}\kappa_{0}^{3}\kappa_{1}^{2}\beta^{4}\eta^{4}\big{)}+\mathcal{O}\big{(}\delta^{\mathrm{th}}D^{2}\kappa_{0}^{2}\kappa_{1}^{2}\eta^{2}\beta^{4}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\beta^{4}\eta^{4}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}\big{)}
𝒪(β4κ04κ12η4ϵvc2)+𝒪(κ02κ12η2β2ϵsbs2)+𝒪(κ0κ1η2σ2β2)+𝒪(δthβ2D2)+\displaystyle\approx\mathcal{O}\big{(}\beta^{4}\kappa_{0}^{4}\kappa_{1}^{2}\eta^{4}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\eta^{2}\beta^{2}\epsilon_{\mathrm{sbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\eta^{2}\sigma^{2}\beta^{2}\big{)}+\mathcal{O}\big{(}\delta^{th}\beta^{2}D^{2}\big{)}+
𝒪(κ03κ12β4η4G2φw,L1)+𝒪(κ0κ1β2η2φw,L2).\displaystyle\qquad\qquad\qquad\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\beta^{4}\eta^{4}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\beta^{2}\eta^{2}\cdot\varphi_{\mathrm{w,L}_{2}}). (98)

C-A Missing Proof of Lemma 7

4η2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαj𝔼τ=t¯1t1[{i=1Uj,k,lαig~(𝐰~iτ,0)𝟏iτpiτi=1Uj,k,lαif~i(𝐰~iτ,0)}+\displaystyle\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\bigg{\|}\sum_{\tau=\bar{t}_{1}}^{t-1}\bigg{[}\bigg{\{}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{\}}+
{j=1Vk,lαji=1Uj,k,lαif~i(𝐰~iτ,0)j=1Vk,lαji=1Uj,k,lαig~(𝐰~iτ,0)𝟏iτpiτ}]2\displaystyle\qquad\qquad\qquad\qquad\qquad\bigg{\{}\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)-\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\tilde{g}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}\bigg{\}}\bigg{]}\bigg{\|}^{2}
=(a)4η2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαj𝔼τ=t¯1t1[i=1Uj,k,lαig~(𝐰~iτ,0)𝟏iτpiτi=1Uj,k,lαif~i(𝐰~iτ,0)]2\displaystyle\overset{(a)}{=}\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\bigg{\|}\sum_{\tau=\bar{t}_{1}}^{t-1}\bigg{[}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{]}\bigg{\|}^{2}-
4η2Tt=0T1l=1Lαlk=1Blαk𝔼τ=t¯1t1j=1Vk,lαji=1Uj,k,lαi[g~(𝐰~iτ,0)𝟏iτpiτf~i(𝐰~iτ,0)]2\displaystyle\qquad\qquad\qquad\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\bigg{\|}\sum_{\tau=\bar{t}_{1}}^{t-1}\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\bigg{[}\tilde{g}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}-\nabla\tilde{f}_{i^{\prime}}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)\bigg{]}\bigg{\|}^{2}
=(b)4η2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαjτ=t¯1t1𝔼i=1Uj,k,lαi[g~(𝐰~iτ,0)𝟏iτpiτf~i(𝐰~iτ,0)]2\displaystyle\overset{(b)}{=}\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{\tau=\bar{t}_{1}}^{t-1}\mathbb{E}\bigg{\|}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{[}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{]}\bigg{\|}^{2}-
4η2Tt=0T1l=1Lαlk=1Blαkτ=t¯1t1𝔼j=1Vk,lαji=1Uj,k,lαi[g~(𝐰~iτ,0)𝟏iτpiτf~i(𝐰~iτ,0)]2\displaystyle\qquad\qquad\qquad\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{\tau=\bar{t}_{1}}^{t-1}\mathbb{E}\bigg{\|}\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\bigg{[}\tilde{g}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}-\nabla\tilde{f}_{i^{\prime}}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)\bigg{]}\bigg{\|}^{2}
(c)4κ0κ1η2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαj𝔼i=1Uj,k,lαi[g~(𝐰~it,0)𝟏itpitf~i(𝐰~it,0)]2\displaystyle\overset{(c)}{\leq}\frac{4\kappa_{0}\kappa_{1}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\bigg{\|}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{[}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\frac{\boldsymbol{1}_{i}^{t}}{p_{i}^{t}}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\bigg{]}\bigg{\|}^{2}-
4κ0κ1η2Tt=0T1l=1Lαlk=1Blαk𝔼j=1Vk,lαji=1Uj,k,lαi[g~(𝐰~it,0)𝟏itpitf~i(𝐰~it,0)]2\displaystyle\qquad\qquad\qquad\frac{4\kappa_{0}\kappa_{1}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\bigg{\|}\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\bigg{[}\tilde{g}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{t,0}\right)\frac{\boldsymbol{1}_{i^{\prime}}^{t}}{p_{i^{\prime}}^{t}}-\nabla\tilde{f}_{i^{\prime}}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{t,0}\right)\bigg{]}\bigg{\|}^{2}
=4κ0κ1η2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi2𝔼g~(𝐰~it,0)𝟏itpit±g~(𝐰~it,0)f~i(𝐰~it,0)2\displaystyle=\frac{4\kappa_{0}\kappa_{1}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\mathbb{E}\bigg{\|}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\frac{\boldsymbol{1}_{i}^{t}}{p_{i}^{t}}\pm\tilde{g}\left(\tilde{\mathbf{w}}_{i}^{t,0}\right)-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\bigg{\|}^{2}-
4κ0κ1η2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαj2i=1Uj,k,lαi2𝔼g~(𝐰~it,0)𝟏itpit±g~(𝐰~it,0)f~i(𝐰~it,0)2\displaystyle\qquad\qquad\qquad\frac{4\kappa_{0}\kappa_{1}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\mathbb{E}\bigg{\|}\tilde{g}\left(\tilde{\mathbf{w}}_{i}^{t,0}\right)\frac{\boldsymbol{1}_{i}^{t}}{p_{i}^{t}}\pm\tilde{g}\left(\tilde{\mathbf{w}}_{i}^{t,0}\right)-\nabla\tilde{f}_{i}\left(\tilde{\mathbf{w}}_{i}^{t,0}\right)\bigg{\|}^{2}
8κ0κ1η2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi2[𝔼(𝟏itpit1)g~(𝐰~it,0)2+𝔼g~(𝐰~it,0)f~i(𝐰~it,0)2]\displaystyle\leq\frac{8\kappa_{0}\kappa_{1}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\bigg{[}\mathbb{E}\bigg{\|}\bigg{(}\frac{\boldsymbol{1}_{i}^{t}}{p_{i}^{t}}-1\bigg{)}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\bigg{\|}^{2}+\mathbb{E}\bigg{\|}\tilde{g}\left(\tilde{\mathbf{w}}_{i}^{t,0}\right)-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\bigg{\|}^{2}\bigg{]}-
8κ0κ1η2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαj2i=1Uj,k,lαi2[𝔼(𝟏itpit1)g~(𝐰~it,0)2+𝔼g~(𝐰~it,0)f~i(𝐰~it,0)2]\displaystyle\quad\frac{8\kappa_{0}\kappa_{1}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\bigg{[}\mathbb{E}\bigg{\|}\bigg{(}\frac{\boldsymbol{1}_{i}^{t}}{p_{i}^{t}}-1\bigg{)}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\bigg{\|}^{2}+\mathbb{E}\bigg{\|}\tilde{g}\left(\tilde{\mathbf{w}}_{i}^{t,0}\right)-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\bigg{\|}^{2}\bigg{]}
8κ0κ1η2G2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi2(1pit1)+8κ0κ1η2σ2G2l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi2\displaystyle\leq\frac{8\kappa_{0}\kappa_{1}\eta^{2}G^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\bigg{(}\frac{1}{p_{i}^{t}}-1\bigg{)}+8\kappa_{0}\kappa_{1}\eta^{2}\sigma^{2}G^{2}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}-
8κ0κ1η2G2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαj2i=1Uj,k,lαi2(1pit1)8κ0κ1η2σ2G2l=1Lαlk=1Blαkj=1Vk,lαj2i=1Uj,k,lαi2\displaystyle\frac{8\kappa_{0}\kappa_{1}\eta^{2}G^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\bigg{(}\frac{1}{p_{i}^{t}}-1\bigg{)}-8\kappa_{0}\kappa_{1}\eta^{2}\sigma^{2}G^{2}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}
8κ0κ1η2σ2l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi2+8κ0κ1η2φw,L2,\displaystyle\leq 8\kappa_{0}\kappa_{1}\eta^{2}\sigma^{2}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}+8\kappa_{0}\kappa_{1}\eta^{2}\cdot\varphi_{\mathrm{w,L}_{2}}, (99)

where φw,L2=1Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi2(1pit1)\varphi_{\mathrm{w,L}_{2}}=\frac{1}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\bigg{(}\frac{1}{p_{i}^{t}}-1\bigg{)}.

C-B Missing Proof of Lemma 8

4η2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαj𝔼τ=t¯1t1[i=1Uj,k,lαif~i(𝐰~iτ,0)j=1Vk,lαji=1Uj,k,lαif~i(𝐰~iτ,0)]2\displaystyle\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\bigg{\|}\sum_{\tau=\bar{t}_{1}}^{t-1}\bigg{[}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}-\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\left(\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\right)\bigg{]}\bigg{\|}^{2}
=4κ0κ1η2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαjτ=t¯1t1𝔼(i=1Uj,k,lαi[f~i(𝐰~iτ,0)f~i(𝐰¯jτ)])+\displaystyle=\frac{4\kappa_{0}\kappa_{1}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{\tau=\bar{t}_{1}}^{t-1}\mathbb{E}\bigg{\|}\bigg{(}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\big{[}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}-\nabla\tilde{f}_{i}\big{(}\bar{\mathbf{w}}_{j}^{\tau}\big{)}\big{]}\bigg{)}+
(i=1Uj,k,lαi[f~i(𝐰¯jτ)f~i(𝐰¯kτ)])+(i=1Uj,k,lαif~i(𝐰¯kτ)j=1Vk,lαji=1Uj,k,lαif~i(𝐰¯kτ))+\displaystyle\bigg{(}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\big{[}\nabla\tilde{f}_{i}\big{(}\bar{\mathbf{w}}_{j}^{\tau}\big{)}-\nabla\tilde{f}_{i}\big{(}\bar{\mathbf{w}}_{k}^{\tau}\big{)}\big{]}\bigg{)}+\bigg{(}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\nabla\tilde{f}_{i}\big{(}\bar{\mathbf{w}}_{k}^{\tau}\big{)}-\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\big{(}\bar{\mathbf{w}}_{k}^{\tau}\big{)}\bigg{)}+
(j=1Vk,lαji=1Uj,k,lαi[f~i(𝐰¯kτ)f~i(𝐰¯jτ)])+(j=1Vk,lαji=1Uj,k,lαi[f~i(𝐰¯jτ)f~i(𝐰~iτ,0)])2\displaystyle\bigg{(}\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\big{[}\nabla\tilde{f}_{i^{\prime}}\big{(}\bar{\mathbf{w}}_{k}^{\tau}\big{)}-\nabla\tilde{f}_{i^{\prime}}\big{(}\bar{\mathbf{w}}_{j}^{\tau}\big{)}\big{]}\bigg{)}+\bigg{(}\sum_{j^{\prime}=1}^{V_{k,l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k,l}}\alpha_{i^{\prime}}\big{[}\nabla\tilde{f}_{i^{\prime}}\big{(}\bar{\mathbf{w}}_{j}^{\tau}\big{)}-\nabla\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\big{]}\bigg{)}\bigg{\|}^{2}
20κ02κ12η2ϵsbs2+40β2κ02κ12η2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαj𝔼𝐰¯kt𝐰¯jt2+80β2κ02κ12η2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi𝔼𝐰¯jt𝐰~it2+\displaystyle\leq 20\kappa_{0}^{2}\kappa_{1}^{2}\eta^{2}\epsilon_{\mathrm{sbs}}^{2}+\frac{40\beta^{2}\kappa_{0}^{2}\kappa_{1}^{2}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\big{\|}\bar{\mathbf{w}}_{k}^{t}-\bar{\mathbf{w}}_{j}^{t}\big{\|}^{2}+\frac{80\beta^{2}\kappa_{0}^{2}\kappa_{1}^{2}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\big{\|}\bar{\mathbf{w}}_{j}^{t}-\tilde{\mathbf{w}}_{i}^{t}\big{\|}^{2}+
80β2κ02κ12η2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi𝔼𝐰~it𝐰~it,02\displaystyle\qquad\qquad\qquad\frac{80\beta^{2}\kappa_{0}^{2}\kappa_{1}^{2}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\big{\|}\tilde{\mathbf{w}}_{i}^{t}-\tilde{\mathbf{w}}_{i}^{t,0}\big{\|}^{2}
𝒪(β2κ04κ12η4ϵvc2)+𝒪(κ02κ12η2ϵsbs2)+𝒪(β2σ2κ03κ12η4)+𝒪(δthD2β2κ02κ12η2)+𝒪(β2κ03κ12η4G2φw,L1)+\displaystyle\approx\mathcal{O}\big{(}\beta^{2}\kappa_{0}^{4}\kappa_{1}^{2}\eta^{4}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\eta^{2}\epsilon_{\mathrm{sbs}}^{2}\big{)}+\mathcal{O}\big{(}\beta^{2}\sigma^{2}\kappa_{0}^{3}\kappa_{1}^{2}\eta^{4}\big{)}+\mathcal{O}\big{(}\delta^{\mathrm{th}}D^{2}\beta^{2}\kappa_{0}^{2}\kappa_{1}^{2}\eta^{2}\big{)}+\mathcal{O}\big{(}\beta^{2}\kappa_{0}^{3}\kappa_{1}^{2}\eta^{4}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}\big{)}+
40β2κ02κ12η2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαj𝔼𝐰¯kt𝐰¯jt2.\displaystyle\qquad\qquad\qquad\frac{40\beta^{2}\kappa_{0}^{2}\kappa_{1}^{2}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\big{\|}\bar{\mathbf{w}}_{k}^{t}-\bar{\mathbf{w}}_{j}^{t}\big{\|}^{2}. (100)

which concludes the proof of Lemma 8. ∎

Appendix D Proof of Lemma 3

Lemma 3.

When η1/[214κ0κ1κ2β]\eta\leq 1/[2\sqrt{14}\kappa_{0}\kappa_{1}\kappa_{2}\beta], the average difference between the sBS and mBS model parameters, i.e., the L3\mathrm{L}_{3} term of (56), is upper bounded as

[β2/T]t=0T1l=1Lαlk=1Blαk𝔼𝐰¯lt𝐰¯kt2\displaystyle[\beta^{2}/T]\sum\nolimits_{t=0}^{T-1}\sum\nolimits_{l=1}^{L}\alpha_{l}\sum\nolimits_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\left\|\bar{\mathbf{w}}_{l}^{t}-\bar{\mathbf{w}}_{k}^{t}\right\|^{2}
𝒪(κ03κ12κ22η4β4ϵvc2)+𝒪(κ04κ14κ22η4β4ϵsbs2)+𝒪(κ02κ12κ22η2β4ϵmbs2)+𝒪(κ0κ1κ2η2β2σ2)+𝒪(δthβ2D2)+\displaystyle\leq\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{4}\beta^{4}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{2}\eta^{4}\beta^{4}\epsilon_{\mathrm{sbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{4}\epsilon_{\mathrm{mbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}\beta^{2}\sigma^{2}\big{)}+\mathcal{O}\big{(}\delta^{\mathrm{th}}\beta^{2}D^{2}\big{)}+
𝒪(κ03κ12κ22η4β4G2φw,L1(𝜹,𝐟,𝐏))+𝒪(κ02κ12κ22β4η4φw,L2(𝜹,𝐟,𝐏))+𝒪(κ0κ1κ2η2β2G2φw,L3(𝜹,𝐟,𝐏)).\displaystyle\qquad\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{4}\beta^{4}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\beta^{4}\eta^{4}\cdot\varphi_{\mathrm{w,L}_{2}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}}))+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}\beta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{3}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})\big{)}. (101)

where φw,L3(𝛅,𝐟,𝐏)=[1/T]l=1Lαlk=1Blαkj=1Vk,lαj2i=1Uj,k,lαi2t=0T1(1/pit1)\varphi_{\mathrm{w,L}_{3}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})=[1/T]\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\sum_{t=0}^{T-1}\left(1/p_{i}^{t}-1\right).

Proof.
1Tt=0T1l=1Lαlk=1Blαk𝔼𝐰¯lt𝐰¯kt2\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\left\|\bar{\mathbf{w}}_{l}^{t}-\bar{\mathbf{w}}_{k}^{t}\right\|^{2}
=l=1Lαlk=1Blαk𝔼𝐰~lt¯2,0𝐰~kt¯2,0ητ=t¯2t1k=1Blαkj=1Vk,lαji=1Uj,k,lαig~(𝐰~iτ,0)𝟏iτpiτ+ητ=t¯2t1j=1Vk,lαji=1Uj,k,lαig~(𝐰~iτ,0)𝟏iτpiτ2,\displaystyle=\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\Bigg{\|}\tilde{\mathbf{w}}_{l}^{\bar{t}_{2},0}-\tilde{\mathbf{w}}_{k}^{\bar{t}_{2},0}-\eta\sum_{\tau=\bar{t}_{2}}^{t-1}\sum_{k^{\prime}=1}^{B_{l}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l}}\alpha_{i^{\prime}}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}+\eta\sum_{\tau=\bar{t}_{2}}^{t-1}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}\Bigg{\|}^{2},
2Tt=0T1l=1Lαlk=1Blαk𝔼𝐰~lt¯2,0𝐰~kt¯2,02+2η2Tt=0T1l=1Lαlk=1Blαk𝔼τ=t¯2t1[j=1Vk,lαji=1Uj,k,lαig~(𝐰~iτ,0)𝟏iτpiτk=1Blαkj=1Vk,lαji=1Uj,k,lαig~(𝐰~iτ,0)𝟏iτpiτ]2,\displaystyle\leq\frac{2}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\!\!\alpha_{l}\!\!\sum_{k=1}^{B_{l}}\!\!\alpha_{k}\mathbb{E}\left\|\tilde{\mathbf{w}}_{l}^{\bar{t}_{2},0}-\tilde{\mathbf{w}}_{k}^{\bar{t}_{2},0}\right\|^{2}+\frac{2\eta^{2}}{T}\sum_{t=0}^{T-1}\!\!\sum_{l=1}^{L}\!\!\alpha_{l}\!\!\sum_{k=1}^{B_{l}}\!\!\alpha_{k}\mathbb{E}\bigg{\|}\sum_{\tau=\bar{t}_{2}}^{t-1}\bigg{[}\sum_{j=1}^{V_{k,l}}\!\!\alpha_{j}\!\!\sum_{i=1}^{U_{j,k,l}}\!\!\alpha_{i}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\sum_{k^{\prime}=1}^{B_{l}}\!\!\alpha_{k^{\prime}}\!\!\sum_{j^{\prime}=1}^{V_{k^{\prime},l}}\!\!\alpha_{j^{\prime}}\!\!\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l}}\!\!\alpha_{i^{\prime}}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}\bigg{]}\bigg{\|}^{2}\!\!, (102)

where t¯2=(mκ3+t3)κ2κ1κ0\bar{t}_{2}=(m\kappa_{3}+t_{3})\kappa_{2}\kappa_{1}\kappa_{0} and the inequalities in the last term arise from Jensen inequality.

For the first term in (D), we have

2l=1Lαlk=1Blαk𝔼𝐰~lt¯2,0𝐰~kt¯2,02\displaystyle 2\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\left\|\tilde{\mathbf{w}}_{l}^{\bar{t}_{2},0}-\tilde{\mathbf{w}}_{k}^{\bar{t}_{2},0}\right\|^{2}
=(a)2l=1Lαlk=1Blαk𝔼𝐰~lt¯2,0𝐰lt¯2+𝐰kt¯2𝐰~kt¯2,02,\displaystyle\overset{(a)}{=}2\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\left\|\tilde{\mathbf{w}}_{l}^{\bar{t}_{2},0}-\mathbf{w}_{l}^{\bar{t}_{2}}+\mathbf{w}_{k}^{\bar{t}_{2}}-\tilde{\mathbf{w}}_{k}^{\bar{t}_{2},0}\right\|^{2},
4l=1Lαlk=1Blαk𝔼𝐰lt¯2𝐰~lt¯2,02+4l=1Lαlk=1Blαk𝔼𝐰kt¯2𝐰~kt¯2,02,\displaystyle\leq 4\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\left\|\mathbf{w}_{l}^{\bar{t}_{2}}-\tilde{\mathbf{w}}_{l}^{\bar{t}_{2},0}\right\|^{2}+4\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\left\|\mathbf{w}_{k}^{\bar{t}_{2}}-\tilde{\mathbf{w}}_{k}^{\bar{t}_{2},0}\right\|^{2},
(b)4l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαiδit¯2𝔼𝐰it¯22+4l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαiδit¯2𝔼𝐰it¯22,\displaystyle\overset{(b)}{\leq}4\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}{\color[rgb]{0,0,0}\alpha_{k}}\sum_{j=1}^{V_{k,l}}{\color[rgb]{0,0,0}\alpha_{j}}\sum_{i=1}^{U_{j,k,l}}{\color[rgb]{0,0,0}\alpha_{i}}\delta_{i}^{\bar{t}_{2}}\mathbb{E}\left\|\mathbf{w}_{i}^{\bar{t}_{2}}\right\|^{2}+4\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}{\color[rgb]{0,0,0}\alpha_{j}}\sum_{i=1}^{U_{j,k,l}}{\color[rgb]{0,0,0}\alpha_{i}}\delta_{i}^{\bar{t}_{2}}\mathbb{E}\left\|\mathbf{w}_{i}^{\bar{t}_{2}}\right\|^{2},
=8l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαiδit¯2𝔼𝐰it¯22,\displaystyle={\color[rgb]{0,0,0}8}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}{\color[rgb]{0,0,0}\alpha_{j}}\sum_{i=1}^{U_{j,k,l}}{\color[rgb]{0,0,0}\alpha_{i}}\delta_{i}^{\bar{t}_{2}}\mathbb{E}\left\|\mathbf{w}_{i}^{\bar{t}_{2}}\right\|^{2}, (103)

where in (a)(a) we use the fact that 𝐰lt¯2=𝐰kt¯2\mathbf{w}_{l}^{\bar{t}_{2}}=\mathbf{w}_{k}^{\bar{t}_{2}}. Moreover, (b)(b) appears following similar steps as in (74) and (73).

As such, we have

2Tt=0T1l=1Lαlk=1Blαk𝔼𝐰~lt¯2,0𝐰~kt¯2,02\displaystyle\frac{2}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\!\!\alpha_{l}\!\!\sum_{k=1}^{B_{l}}\!\!\alpha_{k}\mathbb{E}\left\|\tilde{\mathbf{w}}_{l}^{\bar{t}_{2},0}-\tilde{\mathbf{w}}_{k}^{\bar{t}_{2},0}\right\|^{2}
8Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαiδitκ0κ1κ2𝔼𝐰itκ0κ1κ22\displaystyle\leq\frac{{\color[rgb]{0,0,0}8}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}{\color[rgb]{0,0,0}\alpha_{j}}\sum_{i=1}^{U_{j,k,l}}{\color[rgb]{0,0,0}\alpha_{i}}\delta_{i}^{\big{\lfloor}\frac{t}{\kappa_{0}\kappa_{1}\kappa_{2}}\big{\rfloor}}\mathbb{E}\bigg{\|}\mathbf{w}_{i}^{\big{\lfloor}\frac{t}{\kappa_{0}\kappa_{1}\kappa_{2}}\big{\rfloor}}\bigg{\|}^{2}
𝒪(δthD2).\displaystyle\approx\mathcal{O}\big{(}\delta^{\mathrm{th}}D^{2}\big{)}. (104)

For the second term in (D), we have

2η2Tt=0T1l=1Lαlk=1Blαk𝔼τ=t¯2t1(j=1Vk,lαji=1Uj,k,lαig~(𝐰~iτ,0)𝟏iτpiτk=1Blαkj=1Vk,lαji=1Uj,k,lαig~(𝐰~iτ,0)𝟏iτpiτ)2\displaystyle\frac{2\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\left\|\sum_{\tau=\bar{t}_{2}}^{t-1}\Big{(}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\sum_{k^{\prime}=1}^{B_{l}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l}}\alpha_{i^{\prime}}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}\Big{)}\right\|^{2}
=2η2Tt=0T1l=1Lαlk=1Blαk𝔼τ=t¯2t1[j=1Vk,lαji=1Uj,k,lαi(g~(𝐰~iτ,0)𝟏iτpiτf~i(𝐰~iτ,0))k=1Blαkj=1Vk,lαji=1Uj,k,lαi(g~(𝐰~iτ,0)𝟏iτpiτf~i(𝐰~iτ,0))+\displaystyle=\frac{2\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\Bigg{\|}\sum_{\tau=\bar{t}_{2}}^{t-1}\!\!\bigg{[}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{)}-\sum_{k^{\prime}=1}^{B_{l}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l}}\alpha_{i^{\prime}}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}-\nabla\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\bigg{)}+
k=1Blαkj=1Vk,lαji=1Uj,k,lαif~i(𝐰~iτ,0)j=1Vk,lαji=1Uj,k,lαif~i(𝐰~iτ,0)]2,\displaystyle\qquad\qquad\qquad\qquad\qquad\sum_{k^{\prime}=1}^{B_{l}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}-\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{]}\Bigg{\|}^{2},
4η2Tt=0T1l=1Lαlk=1Blαk𝔼τ=t¯2t1[j=1Vk,lαji=1Uj,k,lαi(g~(𝐰~iτ,0)𝟏iτpiτf~i(𝐰~iτ,0))k=1Blαkj=1Vk,lαji=1Uj,k,lαi(g~(𝐰~iτ,0)𝟏iτpiτf~i(𝐰~iτ,0))]2+\displaystyle\leq\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\Bigg{\|}\sum_{\tau=\bar{t}_{2}}^{t-1}\!\!\bigg{[}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{)}-\sum_{k^{\prime}=1}^{B_{l}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l}}\alpha_{i^{\prime}}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}-\nabla\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\bigg{)}\bigg{]}\Bigg{\|}^{2}+
4η2Tt=0T1l=1Lαlk=1Blαk𝔼τ=t¯2t1[k=1Blαkj=1Vk,lαji=1Uj,k,lαif~i(𝐰~iτ,0)j=1Vk,lαji=1Uj,k,lαif~i(𝐰~iτ,0)]2,\displaystyle\qquad\qquad\qquad\qquad\qquad\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\Bigg{\|}\sum_{\tau=\bar{t}_{2}}^{t-1}\!\!\bigg{[}\sum_{k^{\prime}=1}^{B_{l}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}-\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{]}\Bigg{\|}^{2}, (105)
Lemma 9.
4η2Tt=0T1l=1Lαlk=1Blαk𝔼τ=t¯2t1[j=1Vk,lαji=1Uj,k,lαi(g~(𝐰~iτ,0)𝟏iτpiτf~i(𝐰~iτ,0))k=1Blαkj=1Vk,lαji=1Uj,k,lαi(g~(𝐰~iτ,0)𝟏iτpiτf~i(𝐰~iτ,0))]2\displaystyle\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\Bigg{\|}\sum_{\tau=\bar{t}_{2}}^{t-1}\!\!\bigg{[}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{)}-\sum_{k^{\prime}=1}^{B_{l}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l}}\alpha_{i^{\prime}}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}-\nabla\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\bigg{)}\bigg{]}\Bigg{\|}^{2}
8κ0κ1κ2η2σ2l=1Lαlk=1Blαkj=1Vk,lαj2i=1Uj,k,lαi2+8κ0κ1κ2η2G2φw,L3\displaystyle\leq 8\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}\sigma^{2}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}+8\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{3}}
𝒪(κ0κ1κ2η2σ2)+𝒪(κ0κ1κ2η2G2φw,L3),\displaystyle\approx\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}\sigma^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{3}}\big{)}, (106)

where φw,L3=1Tl=1Lαlk=1Blαkj=1Vk,lαj2i=1Uj,k,lαi2t=0T1(1pit1)\varphi_{\mathrm{w,L}_{3}}=\frac{1}{T}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\sum_{t=0}^{T-1}\left(\frac{1}{p_{i}^{t}}-1\right).

Lemma 10.
4η2Tt=0T1l=1Lαlk=1Blαk𝔼τ=t¯2t1[k=1Blαkj=1Vk,lαji=1Uj,k,lαif~i(𝐰~iτ,0)j=1Vk,lαji=1Uj,k,lαif~i(𝐰~iτ,0)]2\displaystyle\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\Bigg{\|}\sum_{\tau=\bar{t}_{2}}^{t-1}\!\!\bigg{[}\sum_{k^{\prime}=1}^{B_{l}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}-\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{]}\Bigg{\|}^{2}
𝒪(κ03κ12κ22η4β2σ2)+𝒪(κ03κ12κ22η4β2ϵvc2)+𝒪(κ04κ14κ22η4β2ϵsbs2)+𝒪(κ02κ12κ22η2β2ϵmbs2)+𝒪(δthκ02κ12κ22η2β2D2)+\displaystyle\leq\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{4}\beta^{2}\sigma^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{4}\beta^{2}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{2}\eta^{4}\beta^{2}\epsilon_{\mathrm{sbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{2}\epsilon_{\mathrm{mbs}}^{2}\big{)}+\mathcal{O}\big{(}\delta^{th}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{2}D^{2}\big{)}+
𝒪(κ03κ12κ22η4β2G2φw,L1)+𝒪(κ02κ12κ22β2η4φw,L2)+56κ02κ12κ22η2β2Tt=0T1l=1Lαlk=1Blαk𝔼𝐰¯lt𝐰¯kt2.\displaystyle\qquad\qquad\qquad\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{4}\beta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\beta^{2}\eta^{4}\cdot\varphi_{\mathrm{w,L}_{2}})+\frac{56\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\|\bar{\mathbf{w}}_{l}^{t}-\bar{\mathbf{w}}_{k}^{t}\|^{2}. (107)

Using Lemma 9 and Lemma 10, when η1214κ0κ1κ2β\eta\leq\frac{1}{2\sqrt{14}\kappa_{0}\kappa_{1}\kappa_{2}\beta}, we have

β2Tt=0T1l=1Lαlk=1Blαk𝔼𝐰¯lt𝐰¯kt2\displaystyle\frac{\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\left\|\bar{\mathbf{w}}_{l}^{t}-\bar{\mathbf{w}}_{k}^{t}\right\|^{2}
𝒪(δthβ2D2)+𝒪(κ0κ1κ2η2β2σ2)+𝒪(κ0κ1κ2η2β2G2φw,L3)+𝒪(κ03κ12κ22η4β4σ2)+𝒪(κ03κ12κ22η4β4ϵvc2)+\displaystyle\leq\mathcal{O}\big{(}\delta^{\mathrm{th}}\beta^{2}D^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}\beta^{2}\sigma^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}\beta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{3}}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{4}\beta^{4}\sigma^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{4}\beta^{4}\epsilon_{\mathrm{vc}}^{2}\big{)}+
𝒪(κ04κ14κ22η4β4ϵsbs2)+𝒪(κ02κ12κ22η2β4ϵmbs2)+𝒪(δthκ02κ12κ22η2β4D2)+𝒪(κ03κ12κ22η4β4G2φw,L1)+𝒪(κ02κ12κ22β4η4φw,L2)\displaystyle\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{2}\eta^{4}\beta^{4}\epsilon_{\mathrm{sbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{4}\epsilon_{\mathrm{mbs}}^{2}\big{)}+\mathcal{O}\big{(}\delta^{th}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{4}D^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{4}\beta^{4}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\beta^{4}\eta^{4}\cdot\varphi_{\mathrm{w,L}_{2}})
𝒪(κ03κ12κ22η4β4ϵvc2)+𝒪(κ04κ14κ22η4β4ϵsbs2)+𝒪(κ02κ12κ22η2β4ϵmbs2)+𝒪(κ0κ1κ2η2β2σ2)+𝒪(δthβ2D2)+\displaystyle\approx\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{4}\beta^{4}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{2}\eta^{4}\beta^{4}\epsilon_{\mathrm{sbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{4}\epsilon_{\mathrm{mbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}\beta^{2}\sigma^{2}\big{)}+\mathcal{O}\big{(}\delta^{\mathrm{th}}\beta^{2}D^{2}\big{)}+
𝒪(κ03κ12κ22η4β4G2φw,L1)+𝒪(κ02κ12κ22β4η4φw,L2)+𝒪(κ0κ1κ2η2β2G2φw,L3).\displaystyle\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{4}\beta^{4}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\beta^{4}\eta^{4}\cdot\varphi_{\mathrm{w,L}_{2}})+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}\beta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{3}}\big{)}. (108)

D-A Missing Proof of Lemma 9

4η2Tt=0T1l=1Lαlk=1Blαk𝔼τ=t¯2t1[j=1Vk,lαji=1Uj,k,lαi(g~(𝐰~iτ,0)𝟏iτpiτf~i(𝐰~iτ,0))k=1Blαkj=1Vk,lαji=1Uj,k,lαi(g~(𝐰~iτ,0)𝟏iτpiτf~i(𝐰~iτ,0))]2\displaystyle\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\Bigg{\|}\sum_{\tau=\bar{t}_{2}}^{t-1}\!\!\bigg{[}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{)}-\sum_{k^{\prime}=1}^{B_{l}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l}}\alpha_{i^{\prime}}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}-\nabla\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\bigg{)}\bigg{]}\Bigg{\|}^{2}
=(a)4η2Tt=0T1l=1Lαlk=1Blαk𝔼τ=t¯2t1[j=1Vk,lαji=1Uj,k,lαi(g~(𝐰~iτ,0)𝟏iτpiτf~i(𝐰~iτ,0))]2\displaystyle\overset{(a)}{=}\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\Bigg{\|}\sum_{\tau=\bar{t}_{2}}^{t-1}\!\!\bigg{[}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{)}\bigg{]}\Bigg{\|}^{2}-
4η2Tt=0T1l=1Lαl𝔼τ=t¯2t1[k=1Blαkj=1Vk,lαji=1Uj,k,lαi(g~(𝐰~iτ,0)𝟏iτpiτf~i(𝐰~iτ,0))]2\displaystyle\qquad\qquad\qquad\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\Bigg{\|}\sum_{\tau=\bar{t}_{2}}^{t-1}\!\!\bigg{[}\sum_{k^{\prime}=1}^{B_{l}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l}}\alpha_{i^{\prime}}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}-\nabla\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\bigg{)}\bigg{]}\Bigg{\|}^{2}
=(b)4η2Tt=0T1l=1Lαlk=1Blαkτ=t¯2t1𝔼j=1Vk,lαji=1Uj,k,lαi(g~(𝐰~iτ,0)𝟏iτpiτf~i(𝐰~iτ,0))2\displaystyle\overset{(b)}{=}\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{\tau=\bar{t}_{2}}^{t-1}\mathbb{E}\Bigg{\|}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{)}\Bigg{\|}^{2}-
4η2Tt=0T1l=1Lαlτ=t¯2t1𝔼k=1Blαkj=1Vk,lαji=1Uj,k,lαi(g~(𝐰~iτ,0)𝟏iτpiτf~i(𝐰~iτ,0))2\displaystyle\qquad\qquad\qquad\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{\tau=\bar{t}_{2}}^{t-1}\mathbb{E}\Bigg{\|}\sum_{k^{\prime}=1}^{B_{l}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l}}\alpha_{i^{\prime}}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}-\nabla\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\bigg{)}\Bigg{\|}^{2}
(c)4κ0κ1κ2η2Tt=0T1l=1Lαlk=1Blαk𝔼j=1Vk,lαji=1Uj,k,lαi(g~(𝐰~it,0)𝟏itpitf~i(𝐰~it,0))2\displaystyle\overset{(c)}{\leq}\frac{4\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\Bigg{\|}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\frac{\boldsymbol{1}_{i}^{t}}{p_{i}^{t}}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\bigg{)}\Bigg{\|}^{2}-
4κ0κ1κ2η2Tt=0T1l=1Lαl𝔼k=1Blαkj=1Vk,lαji=1Uj,k,lαi(g~(𝐰~it,0)𝟏itpitf~i(𝐰~it,0))2\displaystyle\qquad\qquad\qquad\frac{4\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\Bigg{\|}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\frac{\boldsymbol{1}_{i}^{t}}{p_{i}^{t}}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\bigg{)}\Bigg{\|}^{2}
=(d)4κ0κ1κ2η2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαj2i=1Uj,k,lαi2𝔼g~(𝐰~it,0)𝟏itpit±g~(𝐰~it,0)f~i(𝐰~it,0)2\displaystyle\overset{(d)}{=}\frac{4\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\mathbb{E}\Bigg{\|}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\frac{\boldsymbol{1}_{i}^{t}}{p_{i}^{t}}\pm\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\Bigg{\|}^{2}-
4κ0κ1κ2η2Tt=0T1l=1Lαlk=1Blαk2j=1Vk,lαj2i=1Uj,k,lαi2𝔼g~(𝐰~it,0)𝟏itpit±g~(𝐰~it,0)f~i(𝐰~it,0)2\displaystyle\qquad\qquad\qquad\frac{4\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}^{2}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\mathbb{E}\Bigg{\|}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\frac{\boldsymbol{1}_{i}^{t}}{p_{i}^{t}}\pm\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\Bigg{\|}^{2}
8κ0κ1κ2η2σ2l=1Lαlk=1Blαkj=1Vk,lαj2i=1Uj,k,lαi2+8κ0κ1κ2η2G2φw,L3\displaystyle\leq 8\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}\sigma^{2}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}+8\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{3}}
𝒪(κ0κ1κ2η2σ2)+𝒪(κ0κ1κ2η2G2φw,L3),\displaystyle\approx\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}\sigma^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{3}}\big{)}, (109)

where φw,L3=1Tl=1Lαlk=1Blαkj=1Vk,lαj2i=1Uj,k,lαi2t=0T1(1pit1)\varphi_{\mathrm{w,L}_{3}}=\frac{1}{T}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\sum_{t=0}^{T-1}\left(\frac{1}{p_{i}^{t}}-1\right).

D-B Missing Proof of Lemma 10

4η2Tt=0T1l=1Lαlk=1Blαk𝔼τ=t¯2t1[k=1Blαkj=1Vk,lαji=1Uj,k,lαif~i(𝐰~iτ,0)j=1Vk,lαji=1Uj,k,lαif~i(𝐰~iτ,0)]2\displaystyle\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\Bigg{\|}\sum_{\tau=\bar{t}_{2}}^{t-1}\!\!\bigg{[}\sum_{k^{\prime}=1}^{B_{l}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}-\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{]}\Bigg{\|}^{2}
=4κ0κ1κ2η2Tt=0T1l=1Lαlk=1Blαkτ=t¯2t1𝔼(j=1Vk,lαji=1Uj,k,lαi[f~i(𝐰~iτ,0)f~i(𝐰¯jτ)])+\displaystyle=\frac{4\kappa_{0}\kappa_{1}\kappa_{2}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{\tau=\bar{t}_{2}}^{t-1}\mathbb{E}\Bigg{\|}\bigg{(}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\big{[}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}-\nabla\tilde{f}_{i}\big{(}\bar{\mathbf{w}}_{j}^{\tau}\big{)}\big{]}\bigg{)}+
(j=1Vk,lαji=1Uj,k,lαi[f~i(𝐰¯jτ)f~i(𝐰¯kτ)])+(j=1Vk,lαji=1Uj,k,lαi[f~i(𝐰¯kτ)f~i(𝐰¯lτ)])+\displaystyle\qquad\qquad\qquad\bigg{(}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\big{[}\nabla\tilde{f}_{i}\big{(}\bar{\mathbf{w}}_{j}^{\tau}\big{)}-\nabla\tilde{f}_{i}\big{(}\bar{\mathbf{w}}_{k}^{\tau}\big{)}\big{]}\bigg{)}+\bigg{(}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\big{[}\nabla\tilde{f}_{i}\big{(}\bar{\mathbf{w}}_{k}^{\tau}\big{)}-\nabla\tilde{f}_{i}\big{(}\bar{\mathbf{w}}_{l}^{\tau}\big{)}\big{]}\bigg{)}+
(j=1Vk,lαji=1Uj,k,lαif~i(𝐰¯lτ)k=1Blαkj=1Vk,lαji=1Uj,k,lαif~i(𝐰¯lτ))+\displaystyle\qquad\qquad\qquad\bigg{(}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\nabla\tilde{f}_{i}\big{(}\bar{\mathbf{w}}_{l}^{\tau}\big{)}-\sum_{k^{\prime}=1}^{B_{l}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\big{(}\bar{\mathbf{w}}_{l}^{\tau}\big{)}\bigg{)}+
(k=1Blαkj=1Vk,lαji=1Uj,k,lαi[f~i(𝐰¯lτ)f~i(𝐰¯kτ)])+\displaystyle\qquad\qquad\qquad\bigg{(}\sum_{k^{\prime}=1}^{B_{l}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l}}\alpha_{i^{\prime}}\big{[}\nabla\tilde{f}_{i^{\prime}}\big{(}\bar{\mathbf{w}}_{l}^{\tau}\big{)}-\nabla\tilde{f}_{i^{\prime}}\big{(}\bar{\mathbf{w}}_{k}^{\tau}\big{)}\big{]}\bigg{)}+
(k=1Blαkj=1Vk,lαji=1Uj,k,lαi[f~i(𝐰¯kτ)f~i(𝐰¯jτ)])+\displaystyle\qquad\qquad\qquad\bigg{(}\sum_{k^{\prime}=1}^{B_{l}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l}}\alpha_{i^{\prime}}\big{[}\nabla\tilde{f}_{i^{\prime}}\big{(}\bar{\mathbf{w}}_{k}^{\tau}\big{)}-\nabla\tilde{f}_{i^{\prime}}\big{(}\bar{\mathbf{w}}_{j}^{\tau}\big{)}\big{]}\bigg{)}+
(k=1Blαkj=1Vk,lαji=1Uj,k,lαi[f~i(𝐰¯jτ)f~i(𝐰~iτ,0)])2\displaystyle\qquad\qquad\qquad\bigg{(}\sum_{k^{\prime}=1}^{B_{l}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l}}\alpha_{i^{\prime}}\big{[}\nabla\tilde{f}_{i^{\prime}}\big{(}\bar{\mathbf{w}}_{j}^{\tau}\big{)}-\nabla\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\big{]}\bigg{)}\Bigg{\|}^{2}
28κ02κ12κ22η2β2ϵmbs2+128κ02κ12κ22η2β2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi𝔼𝐰~it𝐰~it,02+\displaystyle\leq 28\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{2}\epsilon_{\mathrm{mbs}}^{2}+\frac{128\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\|\tilde{\mathbf{w}}_{i}^{t}-\tilde{\mathbf{w}}_{i}^{t,0}\|^{2}+
128κ02κ12κ22η2β2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi𝔼𝐰¯jt𝐰~it2+\displaystyle\qquad\qquad\qquad\frac{128\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\|\bar{\mathbf{w}}_{j}^{t}-\tilde{\mathbf{w}}_{i}^{t}\|^{2}+
56κ02κ12κ22η2β2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαj𝔼𝐰¯kt𝐰¯jt2+\displaystyle\qquad\qquad\qquad\frac{56\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\|\bar{\mathbf{w}}_{k}^{t}-\bar{\mathbf{w}}_{j}^{t}\|^{2}+
56κ02κ12κ22η2β2Tt=0T1l=1Lαlk=1Blαk𝔼𝐰¯lt𝐰¯kt2\displaystyle\qquad\qquad\qquad\frac{56\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\|\bar{\mathbf{w}}_{l}^{t}-\bar{\mathbf{w}}_{k}^{t}\|^{2}
𝒪(κ02κ12κ22η2β2ϵmbs2)+𝒪(δthκ02κ12κ32η2β2D2)+𝒪(κ03κ12κ22η4β2σ2)+𝒪(κ03κ12κ22η4β2ϵvc2)+\displaystyle\leq\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{2}\epsilon_{\mathrm{mbs}}^{2}\big{)}+\mathcal{O}\big{(}\delta^{\mathrm{th}}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{3}^{2}\eta^{2}\beta^{2}D^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{4}\beta^{2}\sigma^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{4}\beta^{2}\epsilon_{\mathrm{vc}}^{2}\big{)}+
𝒪(κ03κ12κ22η4β2G2φw,L1)+𝒪(δthκ02κ12κ22η2β2D2)+𝒪(β4κ06κ14κ22η6ϵvc2)+𝒪(κ04κ14κ22η4β2ϵsbs2)+\displaystyle\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{4}\beta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}\big{)}+\mathcal{O}\big{(}\delta^{\mathrm{th}}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{2}D^{2}\big{)}+\mathcal{O}\big{(}\beta^{4}\kappa_{0}^{6}\kappa_{1}^{4}\kappa_{2}^{2}\eta^{6}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{2}\eta^{4}\beta^{2}\epsilon_{\mathrm{sbs}}^{2}\big{)}+
𝒪(κ03κ13κ22η4σ2β2)+𝒪(δthκ02κ12κ22η2β2D2)+𝒪(κ05κ14κ22β4η6G2φw,L1)+\displaystyle\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{3}\kappa_{2}^{2}\eta^{4}\sigma^{2}\beta^{2}\big{)}+\mathcal{O}\big{(}\delta^{th}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{2}D^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{5}\kappa_{1}^{4}\kappa_{2}^{2}\beta^{4}\eta^{6}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}\big{)}+
𝒪(κ02κ12κ22β2η4φw,L2)+56κ02κ12κ22η2β2Tt=0T1l=1Lαlk=1Blαk𝔼𝐰¯lt𝐰¯kt2\displaystyle\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\beta^{2}\eta^{4}\cdot\varphi_{\mathrm{w,L}_{2}})+\frac{56\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\|\bar{\mathbf{w}}_{l}^{t}-\bar{\mathbf{w}}_{k}^{t}\|^{2}
𝒪(κ03κ12κ22η4β2σ2)+𝒪(κ03κ12κ22η4β2ϵvc2)+𝒪(κ04κ14κ22η4β2ϵsbs2)+𝒪(κ02κ12κ22η2β2ϵmbs2)+\displaystyle\approx\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{4}\beta^{2}\sigma^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{4}\beta^{2}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{2}\eta^{4}\beta^{2}\epsilon_{\mathrm{sbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{2}\epsilon_{\mathrm{mbs}}^{2}\big{)}+
𝒪(δthκ02κ12κ22η2β2D2)+𝒪(κ03κ12κ22η4β2G2φw,L1)+\displaystyle\qquad\qquad\qquad\mathcal{O}\big{(}\delta^{th}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{2}D^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{4}\beta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}\big{)}+
𝒪(κ02κ12κ22β2η4φw,L2)+56κ02κ12κ22η2β2Tt=0T1l=1Lαlk=1Blαk𝔼𝐰¯lt𝐰¯kt2.\displaystyle\qquad\qquad\qquad\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\beta^{2}\eta^{4}\cdot\varphi_{\mathrm{w,L}_{2}})+\frac{56\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\eta^{2}\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\|\bar{\mathbf{w}}_{l}^{t}-\bar{\mathbf{w}}_{k}^{t}\|^{2}. (110)

Appendix E Proof of Lemma 4

Lemma 4.

When η1/[62κ0κ1κ2κ3β]\eta\leq 1/[6\sqrt{2}\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\beta], the average difference between the global and the mBS models, i.e., the L4\mathrm{L}_{4} term, is bounded as follows:

[β2/T]t=0T1l=1Lαl𝔼𝐰¯t𝐰¯lt2𝒪(κ04κ12κ22κ32η4β4ϵvc2)+𝒪(κ04κ14κ22κ32η4β4ϵsbs2)+\displaystyle[\beta^{2}/T]\sum\nolimits_{t=0}^{T-1}\sum\nolimits_{l=1}^{L}\alpha_{l}\mathbb{E}\left\|\bar{\mathbf{w}}^{t}-\bar{\mathbf{w}}_{l}^{t}\right\|^{2}\leq\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{4}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{4}\epsilon_{\mathrm{sbs}}^{2}\big{)}+
𝒪(κ04κ14κ24κ32η4β6ϵmbs2)+𝒪(κ02κ12κ22κ32β4η2ϵ2)+𝒪(κ0κ1κ2κ3β2η2σ2)+𝒪(δthβ2D2)+\displaystyle\qquad\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{4}\kappa_{3}^{2}\eta^{4}\beta^{6}\epsilon_{\mathrm{mbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\beta^{4}\eta^{2}\epsilon^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\beta^{2}\eta^{2}\sigma^{2}\big{)}+\mathcal{O}\big{(}\delta^{\mathrm{th}}\beta^{2}D^{2}\big{)}+
𝒪(κ03κ12κ22κ32η4β4G2φw,L1(𝜹,𝐟,𝐏))+𝒪(κ03κ13κ22κ32β4η4φw,L2(𝜹,𝐟,𝐏))+\displaystyle\qquad\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{4}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{3}\kappa_{2}^{2}\kappa_{3}^{2}\beta^{4}\eta^{4}\cdot\varphi_{\mathrm{w,L}_{2}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})\big{)}+
𝒪(κ03κ13κ23κ32η4β4G2φw,L3(𝜹,𝐟,𝐏))+𝒪(κ0κ1κ2κ3β2η2G2φw,L4(𝜹,𝐟,𝐏)),\displaystyle\qquad\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{3}\kappa_{2}^{3}\kappa_{3}^{2}\eta^{4}\beta^{4}G^{2}\cdot\varphi_{\mathrm{w,L}_{3}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\beta^{2}\eta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{4}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})\big{)}, (111)

where φw,L4(𝛅,𝐟,𝐏)=[1/T]l=1Lαlk=1Blαk2j=1Vk,lαj2i=1Uj,k,lαi2t=0T1(1/pit1)\varphi_{\mathrm{w,L}_{4}}(\boldsymbol{\delta},\boldsymbol{\mathrm{f}},\boldsymbol{\mathrm{P}})=[1/T]\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}^{2}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\sum_{t=0}^{T-1}\left(1/p_{i}^{t}-1\right).

Proof.
1Tt=0T1l=1Lαl𝔼𝐰¯t𝐰¯lt2\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\left\|\bar{\mathbf{w}}^{t}-\bar{\mathbf{w}}_{l}^{t}\right\|^{2}
=1Tt=0T1l=1Lαl𝔼𝐰~mz=03κz,0ητ=mz=03κzt1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαig~(𝐰~iτ,0)𝟏iτpiτ\displaystyle=\frac{1}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\bigg{\|}\tilde{\mathbf{w}}^{m\prod_{z=0}^{3}\kappa_{z},0}-\eta\sum_{\tau=m\prod_{z=0}^{3}\kappa_{z}}^{t-1}\sum_{l^{\prime}=1}^{L}\alpha_{l^{\prime}}\sum_{k^{\prime}=1}^{B_{l^{\prime}}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l^{\prime}}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l^{\prime}}}\alpha_{i^{\prime}}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i}^{\tau}}-
(𝐰~lmz=03κz,0ητ=mz=03κzk=1Blαkj=1Vk,lαji=1Uj,k,lαig~(𝐰~iτ,0)𝟏iτpiτ)2,\displaystyle\qquad\qquad\qquad\Big{(}\tilde{\mathbf{w}}_{l}^{m\prod_{z=0}^{3}\kappa_{z},0}-\eta\sum_{\tau=m\prod_{z=0}^{3}\kappa_{z}}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}\Big{)}\bigg{\|}^{2},
2Tt=0T1l=1Lαl𝔼𝐰~mz=03κz,0𝐰~lmz=03κz,02+\displaystyle\leq\frac{2}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\left\|\tilde{\mathbf{w}}^{m\prod_{z=0}^{3}\kappa_{z},0}-\tilde{\mathbf{w}}_{l}^{m\prod_{z=0}^{3}\kappa_{z},0}\right\|^{2}+
2η2Tt=0T1l=1Lαl𝔼τ=mz=03κzt1[k=1Blαkj=1Vk,lαji=1Uj,k,lαig~(𝐰~iτ,0)𝟏iτpiτl=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαig~(𝐰~iτ,0)𝟏iτpiτ]2,\displaystyle\qquad\qquad\qquad\frac{2\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\bigg{\|}\sum_{\tau=m\prod_{z=0}^{3}\kappa_{z}}^{t-1}\bigg{[}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\sum_{l^{\prime}=1}^{L}\alpha_{l^{\prime}}\sum_{k^{\prime}=1}^{B_{l^{\prime}}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l^{\prime}}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l^{\prime}}}\alpha_{i^{\prime}}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}\bigg{]}\bigg{\|}^{2}, (112)

where the last inequality follows from Jensen inequality.

For the first term of (E), we have

2l=1Lαl𝔼𝐰~mz=03κz,0𝐰~lmz=03κz,02,\displaystyle 2\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\left\|\tilde{\mathbf{w}}^{m\prod_{z=0}^{3}\kappa_{z},0}-\tilde{\mathbf{w}}_{l}^{m\prod_{z=0}^{3}\kappa_{z},0}\right\|^{2},
=(a)2l=1Lαl𝔼𝐰~mz=03κz,0𝐰mz=03κz+𝐰lmz=03κz𝐰~lmz=03κz,02,\displaystyle\overset{(a)}{=}2\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\left\|\tilde{\mathbf{w}}^{m\prod_{z=0}^{3}\kappa_{z},0}-\mathbf{w}^{m\prod_{z=0}^{3}\kappa_{z}}+\mathbf{w}_{l}^{m\prod_{z=0}^{3}\kappa_{z}}-\tilde{\mathbf{w}}_{l}^{m\prod_{z=0}^{3}\kappa_{z},0}\right\|^{2},
(b)4l=1Lαl𝔼𝐰~mz=03κz,0𝐰mz=03κz2+4l=1Lαl𝔼𝐰lmz=03κz𝐰~lmz=03κz,02,\displaystyle\overset{(b)}{\leq}4\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\left\|\tilde{\mathbf{w}}^{m\prod_{z=0}^{3}\kappa_{z},0}-\mathbf{w}^{m\prod_{z=0}^{3}\kappa_{z}}\right\|^{2}+4\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\left\|\mathbf{w}_{l}^{m\prod_{z=0}^{3}\kappa_{z}}-\tilde{\mathbf{w}}_{l}^{m\prod_{z=0}^{3}\kappa_{z},0}\right\|^{2},
8l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαiδimz=03κz𝔼𝐰imz=03κz2,\displaystyle\leq{\color[rgb]{0,0,0}8}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}{\color[rgb]{0,0,0}\alpha_{k}}\sum_{j=1}^{V_{k,l}}{\color[rgb]{0,0,0}\alpha_{j}}\sum_{i=1}^{U_{j,k,l}}{\color[rgb]{0,0,0}\alpha_{i}}\delta_{i}^{m\prod_{z=0}^{3}\kappa_{z}}\mathbb{E}\left\|\mathbf{w}_{i}^{m\prod_{z=0}^{3}\kappa_{z}}\right\|^{2}, (113)

where in (a)(a), we use the fact that 𝐰mz=03κz=𝐰lmz=03κz\mathbf{w}^{m\prod_{z=0}^{3}\kappa_{z}}=\mathbf{w}_{l}^{m\prod_{z=0}^{3}\kappa_{z}} and (b)(b) stems from i=1n𝐚i2ni=1n𝐚i2\|\sum_{i=1}^{n}\mathbf{a}_{i}\|^{2}\leq n\sum_{i=1}^{n}\|\mathbf{a}_{i}\|^{2}. Moreover, the last inequality appears following steps as in (75) and (74).

As such, we simplify the first term as

2Tt=0T1l=1Lαl𝔼𝐰~mz=03κz,0𝐰~lmz=03κz,028Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαiδit/[z=03κz]𝔼𝐰it/[z=03κz]2\displaystyle\frac{2}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\left\|\tilde{\mathbf{w}}^{m\prod_{z=0}^{3}\kappa_{z},0}-\tilde{\mathbf{w}}_{l}^{m\prod_{z=0}^{3}\kappa_{z},0}\right\|^{2}\leq\frac{{\color[rgb]{0,0,0}8}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}{\color[rgb]{0,0,0}\alpha_{k}}\sum_{j=1}^{V_{k,l}}{\color[rgb]{0,0,0}\alpha_{j}}\sum_{i=1}^{U_{j,k,l}}{\color[rgb]{0,0,0}\alpha_{i}}\delta_{i}^{\left\lfloor t/[\prod_{z=0}^{3}\kappa_{z}]\right\rfloor}\mathbb{E}\left\|\mathbf{w}_{i}^{\left\lfloor t/[\prod_{z=0}^{3}\kappa_{z}]\right\rfloor}\right\|^{2}
𝒪(δthD2).\displaystyle\approx\mathcal{O}\big{(}\delta^{\mathrm{th}}D^{2}\big{)}. (114)

The second term of (E) is further simplified as follows:

2η2Tt=0T1l=1Lαl𝔼τ=mz=03κzt1[k=1Blαkj=1Vk,lαji=1Uj,k,lαig~(𝐰~iτ,0)𝟏iτpiτl=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαig~(𝐰~iτ,0)𝟏iτpiτ]2\displaystyle\frac{2\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\bigg{\|}\sum_{\tau=m\prod_{z=0}^{3}\kappa_{z}}^{t-1}\bigg{[}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\sum_{l^{\prime}=1}^{L}\alpha_{l^{\prime}}\sum_{k^{\prime}=1}^{B_{l^{\prime}}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l^{\prime}}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l^{\prime}}}\alpha_{i^{\prime}}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}\bigg{]}\bigg{\|}^{2}
=2η2Tt=0T1l=1Lαl𝔼τ=mz=03κzt1[k=1Blαkj=1Vk,lαji=1Uj,k,lαi(g~(𝐰~iτ,0)𝟏iτpiτf~i(𝐰~iτ,0))\displaystyle=\frac{2\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\bigg{\|}\sum_{\tau=m\prod_{z=0}^{3}\kappa_{z}}^{t-1}\bigg{[}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{)}-
l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi(g~(𝐰~iτ,0)𝟏iτpiτf~i(𝐰~iτ,0))+\displaystyle\qquad\qquad\qquad\sum_{l^{\prime}=1}^{L}\alpha_{l^{\prime}}\sum_{k^{\prime}=1}^{B_{l^{\prime}}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l^{\prime}}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l^{\prime}}}\alpha_{i^{\prime}}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}-\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\bigg{)}+
k=1Blαkj=1Vk,lαji=1Uj,k,lαif~i(𝐰~iτ,0)l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαif~i(𝐰~iτ,0)]2\displaystyle\qquad\qquad\qquad\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}-\sum_{l^{\prime}=1}^{L}\alpha_{l^{\prime}}\sum_{k^{\prime}=1}^{B_{l^{\prime}}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l^{\prime}}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l^{\prime}}}\alpha_{i^{\prime}}\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\bigg{]}\bigg{\|}^{2}
4η2Tt=0T1l=1Lαl𝔼τ=mz=03κzt1[k=1Blαkj=1Vk,lαji=1Uj,k,lαi(g~(𝐰~iτ,0)𝟏iτpiτf~i(𝐰~iτ,0))\displaystyle\leq\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\bigg{\|}\sum_{\tau=m\prod_{z=0}^{3}\kappa_{z}}^{t-1}\bigg{[}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{)}-
l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi(g~(𝐰~iτ,0)𝟏iτpiτf~i(𝐰~iτ,0))]2+\displaystyle\qquad\qquad\qquad\sum_{l^{\prime}=1}^{L}\alpha_{l^{\prime}}\sum_{k^{\prime}=1}^{B_{l^{\prime}}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l^{\prime}}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l^{\prime}}}\alpha_{i^{\prime}}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}-\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\bigg{)}\bigg{]}\bigg{\|}^{2}+
4η2Tt=0T1l=1Lαl𝔼τ=mz=03κzt1[k=1Blαkj=1Vk,lαji=1Uj,k,lαif~i(𝐰~iτ,0)l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαif~i(𝐰~iτ,0)]2.\displaystyle\qquad\qquad\qquad\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\bigg{\|}\sum_{\tau=m\prod_{z=0}^{3}\kappa_{z}}^{t-1}\bigg{[}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}-\sum_{l^{\prime}=1}^{L}\alpha_{l^{\prime}}\sum_{k^{\prime}=1}^{B_{l^{\prime}}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l^{\prime}}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l^{\prime}}}\alpha_{i^{\prime}}\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\bigg{]}\bigg{\|}^{2}. (115)
Lemma 11.

The first term of (E) is bounded as follow:

4η2Tt=0T1l=1Lαl𝔼τ=mz=03κzt1[k=1Blαkj=1Vk,lαji=1Uj,k,lαi(g~(𝐰~iτ,0)𝟏iτpiτf~i(𝐰~iτ,0))\displaystyle\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\bigg{\|}\sum_{\tau=m\prod_{z=0}^{3}\kappa_{z}}^{t-1}\bigg{[}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{)}-
l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi(g~(𝐰~iτ,0)𝟏iτpiτf~i(𝐰~iτ,0))]2\displaystyle\qquad\qquad\qquad\sum_{l^{\prime}=1}^{L}\alpha_{l^{\prime}}\sum_{k^{\prime}=1}^{B_{l^{\prime}}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l^{\prime}}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l^{\prime}}}\alpha_{i^{\prime}}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}-\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\bigg{)}\bigg{]}\bigg{\|}^{2}
8(z=03κz)η2σ2l=1Lαlk=1Blαk2j=1Vk,lαj2i=1Uj,k,lαi2+8(z=03κz)η2G2Tl=1Lαlk=1Blαk2j=1Vk,lαj2i=1Uj,k,lαi2t=0T1(1pitpit)\displaystyle\leq 8\Big{(}\prod_{z=0}^{3}\kappa_{z}\Big{)}\eta^{2}\sigma^{2}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}^{2}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}+\frac{8\Big{(}\prod_{z=0}^{3}\kappa_{z}\Big{)}\eta^{2}G^{2}}{T}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}^{2}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\sum_{t=0}^{T-1}\bigg{(}\frac{1-p_{i}^{t}}{p_{i}^{t}}\bigg{)}
𝒪(κ0κ1κ2κ3η2σ2)+𝒪(κ0κ1κ2κ3η2G2φw,L4),\displaystyle\approx\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\eta^{2}\sigma^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\eta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{4}}\big{)}, (116)

where φw,L4=1Tl=1Lαlk=1Blαk2j=1Vk,lαj2i=1Uj,k,lαi2t=0T1(1pitpit)\varphi_{\mathrm{w,L}_{4}}=\frac{1}{T}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}^{2}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\sum_{t=0}^{T-1}\big{(}\frac{1-p_{i}^{t}}{p_{i}^{t}}\big{)}.

Lemma 12.

The second term of (E) is bounded as follow:

4η2Tt=0T1l=1Lαl𝔼τ=mz=03κzt1[k=1Blαkj=1Vk,lαji=1Uj,k,lαif~i(𝐰~iτ,0)l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαif~i(𝐰~iτ,0)]2\displaystyle\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\bigg{\|}\sum_{\tau=m\prod_{z=0}^{3}\kappa_{z}}^{t-1}\bigg{[}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}-\sum_{l^{\prime}=1}^{L}\alpha_{l^{\prime}}\sum_{k^{\prime}=1}^{B_{l^{\prime}}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l^{\prime}}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l^{\prime}}}\alpha_{i^{\prime}}\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\bigg{]}\bigg{\|}^{2}
𝒪(κ04κ12κ22κ32η4β2ϵvc2)+𝒪(κ04κ14κ22κ32η4β2ϵsbs2)+𝒪(κ04κ14κ24κ32η4β4ϵmbs2)+\displaystyle\leq\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{2}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{2}\epsilon_{\mathrm{sbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{4}\kappa_{3}^{2}\eta^{4}\beta^{4}\epsilon_{\mathrm{mbs}}^{2}\big{)}+
𝒪(κ02κ12κ22κ32β2η2ϵ2)+𝒪(κ03κ12κ22κ32η4β2σ2)+𝒪(δthκ02κ12κ22κ32η2β2D2)+\displaystyle\qquad\qquad\qquad\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\beta^{2}\eta^{2}\epsilon^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{2}\sigma^{2}\big{)}+\mathcal{O}\big{(}\delta^{\mathrm{th}}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{2}\beta^{2}D^{2}\big{)}+
𝒪(κ03κ12κ22κ32η4β2G2φw,L1)+𝒪(κ03κ13κ22κ32β2η4φw,L2)+\displaystyle\qquad\qquad\qquad\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{3}\kappa_{2}^{2}\kappa_{3}^{2}\beta^{2}\eta^{4}\cdot\varphi_{\mathrm{w,L}_{2}})+
𝒪(κ03κ13κ23κ32η4β2G2φw,L3)+72(βηκ0κ1κ2κ3)2Tt=0T1l=1Lαl𝔼𝐰¯t𝐰¯lt2.\displaystyle\qquad\qquad\qquad\!\!\!\!\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{3}\kappa_{2}^{3}\kappa_{3}^{2}\eta^{4}\beta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{3}}\big{)}+\frac{72\big{(}\beta\eta\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\big{)}^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\bigg{\|}\bar{\mathbf{w}}^{t}-\bar{\mathbf{w}}_{l}^{t}\bigg{\|}^{2}. (117)

Using Lemma 11 and Lemma 12, and assuming η162κ0κ1κ2κ3β\eta\leq\frac{1}{6\sqrt{2}\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\beta}, we get

β2Tt=0T1l=1Lαl𝔼𝐰¯t𝐰¯lt2\displaystyle\frac{\beta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\left\|\bar{\mathbf{w}}^{t}-\bar{\mathbf{w}}_{l}^{t}\right\|^{2}
𝒪(δthβ2D2)+𝒪(κ0κ1κ2κ3β2η2σ2)+𝒪(κ0κ1κ2κ3β2η2G2φw,L4)+\displaystyle\leq\mathcal{O}\big{(}\delta^{\mathrm{th}}\beta^{2}D^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\beta^{2}\eta^{2}\sigma^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\beta^{2}\eta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{4}}\big{)}+
𝒪(κ04κ12κ22κ32η4β4ϵvc2)+𝒪(κ04κ14κ22κ32η4β4ϵsbs2)+𝒪(κ04κ14κ24κ32η4β6ϵmbs2)+\displaystyle\qquad\qquad\qquad\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{4}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{4}\epsilon_{\mathrm{sbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{4}\kappa_{3}^{2}\eta^{4}\beta^{6}\epsilon_{\mathrm{mbs}}^{2}\big{)}+
𝒪(κ02κ12κ22κ32β4η2ϵ2)+𝒪(κ03κ12κ22κ32η4β4σ2)+𝒪(δthκ02κ12κ22κ32η2β4D2)+\displaystyle\qquad\qquad\qquad\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\beta^{4}\eta^{2}\epsilon^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{4}\sigma^{2}\big{)}+\mathcal{O}\big{(}\delta^{\mathrm{th}}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{2}\beta^{4}D^{2}\big{)}+
𝒪(κ03κ12κ22κ32η4β4G2φw,L1)+𝒪(κ03κ13κ22κ32β4η4φw,L2)+\displaystyle\qquad\qquad\qquad\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{4}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{3}\kappa_{2}^{2}\kappa_{3}^{2}\beta^{4}\eta^{4}\cdot\varphi_{\mathrm{w,L}_{2}})+
𝒪(κ03κ13κ23κ32η4β4G2φw,L3)\displaystyle\qquad\qquad\qquad\!\!\!\!\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{3}\kappa_{2}^{3}\kappa_{3}^{2}\eta^{4}\beta^{4}G^{2}\cdot\varphi_{\mathrm{w,L}_{3}}\big{)}
𝒪(κ04κ12κ22κ32η4β4ϵvc2)+𝒪(κ04κ14κ22κ32η4β4ϵsbs2)+𝒪(κ04κ14κ24κ32η4β6ϵmbs2)+\displaystyle\approx\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{4}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{4}\epsilon_{\mathrm{sbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{4}\kappa_{3}^{2}\eta^{4}\beta^{6}\epsilon_{\mathrm{mbs}}^{2}\big{)}+
𝒪(κ02κ12κ22κ32β4η2ϵ2)+𝒪(κ0κ1κ2κ3β2η2σ2)+𝒪(δthβ2D2)+\displaystyle\qquad\qquad\qquad\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\beta^{4}\eta^{2}\epsilon^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\beta^{2}\eta^{2}\sigma^{2}\big{)}+\mathcal{O}\big{(}\delta^{\mathrm{th}}\beta^{2}D^{2}\big{)}+
𝒪(κ03κ12κ22κ32η4β4G2φw,L1)+𝒪(κ03κ13κ22κ32β4η4φw,L2)+\displaystyle\qquad\qquad\qquad\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{4}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{3}\kappa_{2}^{2}\kappa_{3}^{2}\beta^{4}\eta^{4}\cdot\varphi_{\mathrm{w,L}_{2}})+
𝒪(κ03κ13κ23κ32η4β4G2φw,L3)+𝒪(κ0κ1κ2κ3β2η2G2φw,L4).\displaystyle\qquad\qquad\qquad\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{3}\kappa_{2}^{3}\kappa_{3}^{2}\eta^{4}\beta^{4}G^{2}\cdot\varphi_{\mathrm{w,L}_{3}}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\beta^{2}\eta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{4}}\big{)}. (118)

E-A Missing Proof of Lemma 11

4η2Tt=0T1l=1Lαl𝔼τ=mz=03κzt1[k=1Blαkj=1Vk,lαji=1Uj,k,lαi(g~(𝐰~iτ,0)𝟏iτpiτf~i(𝐰~iτ,0))\displaystyle\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\bigg{\|}\sum_{\tau=m\prod_{z=0}^{3}\kappa_{z}}^{t-1}\bigg{[}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{)}-
l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi(g~(𝐰~iτ,0)𝟏iτpiτf~i(𝐰~iτ,0))]2\displaystyle\qquad\qquad\qquad\sum_{l^{\prime}=1}^{L}\alpha_{l^{\prime}}\sum_{k^{\prime}=1}^{B_{l^{\prime}}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l^{\prime}}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l^{\prime}}}\alpha_{i^{\prime}}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i^{\prime}}^{\tau}}{p_{i^{\prime}}^{\tau}}-\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\bigg{)}\bigg{]}\bigg{\|}^{2}
=(a)4η2Tt=0T1l=1Lαl𝔼τ=mz=03κzt1[k=1Blαkj=1Vk,lαji=1Uj,k,lαi(g~(𝐰~iτ,0)𝟏iτpiτf~i(𝐰~iτ,0))]2\displaystyle\overset{(a)}{=}\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\bigg{\|}\sum_{\tau=m\prod_{z=0}^{3}\kappa_{z}}^{t-1}\bigg{[}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{)}\bigg{]}\bigg{\|}^{2}-
4η2Tt=0T1𝔼τ=mz=03κzt1[l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi(g~(𝐰~iτ,0)𝟏iτpiτf~i(𝐰~iτ,0))]2\displaystyle\qquad\qquad\qquad\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\mathbb{E}\bigg{\|}\sum_{\tau=m\prod_{z=0}^{3}\kappa_{z}}^{t-1}\bigg{[}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{)}\bigg{]}\bigg{\|}^{2}
=(b)4η2Tt=0T1l=1Lαlτ=mz=03κzt1𝔼k=1Blαkj=1Vk,lαji=1Uj,k,lαi(g~(𝐰~iτ,0)𝟏iτpiτf~i(𝐰~iτ,0))2\displaystyle\overset{(b)}{=}\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{\tau=m\prod_{z=0}^{3}\kappa_{z}}^{t-1}\mathbb{E}\bigg{\|}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{)}\bigg{\|}^{2}-
4η2Tt=0T1τ=mz=03κzt1𝔼l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi(g~(𝐰~iτ,0)𝟏iτpiτf~i(𝐰~iτ,0))2\displaystyle\qquad\qquad\qquad\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{\tau=m\prod_{z=0}^{3}\kappa_{z}}^{t-1}\mathbb{E}\bigg{\|}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\frac{\boldsymbol{1}_{i}^{\tau}}{p_{i}^{\tau}}-\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}\bigg{)}\bigg{\|}^{2}
(c)4κ0κ1κ2κ3η2Tt=0T1l=1Lαl𝔼k=1Blαkj=1Vk,lαji=1Uj,k,lαi(g~(𝐰~it,0)𝟏itpit±g~(𝐰~it,0)f~i(𝐰~it,0))2\displaystyle\overset{(c)}{\leq}\frac{4\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\bigg{\|}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\frac{\boldsymbol{1}_{i}^{t}}{p_{i}^{t}}\pm\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}-\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\bigg{)}\bigg{\|}^{2}-
4κ0κ1κ2κ3η2Tt=0T1𝔼l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi(g~(𝐰~it,0)𝟏itpit±g~(𝐰~it,0)f~i(𝐰~it,0))2\displaystyle\qquad\frac{4\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\eta^{2}}{T}\sum_{t=0}^{T-1}\mathbb{E}\bigg{\|}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\bigg{(}\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\frac{\boldsymbol{1}_{i}^{t}}{p_{i}^{t}}\pm\tilde{g}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}-\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}\bigg{)}\bigg{\|}^{2}
8κ0κ1κ2κ3η2σ2l=1Lαlk=1Blαk2j=1Vk,lαj2i=1Uj,k,lαi2+8κ0κ1κ2κ3η2G2Tl=1Lαlk=1Blαk2j=1Vk,lαj2i=1Uj,k,lαi2t=0T1(1pitpit)\displaystyle\leq 8\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\eta^{2}\sigma^{2}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}^{2}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}+\frac{8\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\eta^{2}G^{2}}{T}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}^{2}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\sum_{t=0}^{T-1}\bigg{(}\frac{1-p_{i}^{t}}{p_{i}^{t}}\bigg{)}
𝒪(κ0κ1κ2κ3η2σ2)+𝒪(κ0κ1κ2κ3η2G2φw,L4),\displaystyle\approx\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\eta^{2}\sigma^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\eta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{4}}\big{)}, (119)

where φw,L4=1Tl=1Lαlk=1Blαk2j=1Vk,lαj2i=1Uj,k,lαi2t=0T1(1pitpit)\varphi_{\mathrm{w,L}_{4}}=\frac{1}{T}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}^{2}\sum_{j=1}^{V_{k,l}}\alpha_{j}^{2}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}^{2}\sum_{t=0}^{T-1}\big{(}\frac{1-p_{i}^{t}}{p_{i}^{t}}\big{)}.

E-B Missing Proof of Lemma 12

4η2Tt=0T1l=1Lαl𝔼τ=mz=03κzt1[k=1Blαkj=1Vk,lαji=1Uj,k,lαif~i(𝐰~iτ,0)l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαif~i(𝐰~iτ,0)]2\displaystyle\frac{4\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\bigg{\|}\sum_{\tau=m\prod_{z=0}^{3}\kappa_{z}}^{t-1}\bigg{[}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{\tau,0}\big{)}-\sum_{l^{\prime}=1}^{L}\alpha_{l^{\prime}}\sum_{k^{\prime}=1}^{B_{l^{\prime}}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l^{\prime}}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l^{\prime}}}\alpha_{i^{\prime}}\nabla\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{\tau,0}\big{)}\bigg{]}\bigg{\|}^{2}
=4κ0κ1κ2κ3η2Tt=0T1l=1Lαl𝔼(k=1Blαkj=1Vk,lαji=1Uj,k,lαi[f~i(𝐰~it,0)f~i(𝐰¯jt)])+\displaystyle=\frac{4\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\eta^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\bigg{\|}\bigg{(}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\big{[}\nabla\tilde{f}_{i}\big{(}\tilde{\mathbf{w}}_{i}^{t,0}\big{)}-\nabla\tilde{f}_{i}\big{(}\bar{\mathbf{w}}_{j}^{t}\big{)}\big{]}\bigg{)}+
(k=1Blαkj=1Vk,lαji=1Uj,k,lαi[f~i(𝐰¯jt)f~i(𝐰¯kt)])+(k=1Blαkj=1Vk,lαji=1Uj,k,lαi[f~i(𝐰¯kt)f~i(𝐰¯lt)])+\displaystyle\qquad\bigg{(}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\big{[}\nabla\tilde{f}_{i}\big{(}\bar{\mathbf{w}}_{j}^{t}\big{)}-\nabla\tilde{f}_{i}\big{(}\bar{\mathbf{w}}_{k}^{t}\big{)}\big{]}\bigg{)}+\bigg{(}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\big{[}\nabla\tilde{f}_{i}\big{(}\bar{\mathbf{w}}_{k}^{t}\big{)}-\nabla\tilde{f}_{i}\big{(}\bar{\mathbf{w}}_{l}^{t}\big{)}\big{]}\bigg{)}+
(k=1Blαkj=1Vk,lαji=1Uj,k,lαi[f~i(𝐰¯lt)f~i(𝐰¯t)])+(k=1Blαkj=1Vk,lαji=1Uj,k,lαif~i(𝐰¯t)\displaystyle\qquad\bigg{(}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\big{[}\nabla\tilde{f}_{i}\big{(}\bar{\mathbf{w}}_{l}^{t}\big{)}-\nabla\tilde{f}_{i}\big{(}\bar{\mathbf{w}}^{t}\big{)}\big{]}\bigg{)}+\bigg{(}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\nabla\tilde{f}_{i}\big{(}\bar{\mathbf{w}}^{t}\big{)}-
l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαif~i(𝐰¯t))+(l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi[f~i(𝐰¯t)f~i(𝐰¯lt)])+\displaystyle\qquad\sum_{l^{\prime}=1}^{L}\alpha_{l^{\prime}}\sum_{k^{\prime}=1}^{B_{l^{\prime}}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l^{\prime}}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l^{\prime}}}\alpha_{i^{\prime}}\tilde{f}_{i^{\prime}}\big{(}\bar{\mathbf{w}}^{t}\big{)}\bigg{)}+\bigg{(}\sum_{l^{\prime}=1}^{L}\alpha_{l^{\prime}}\sum_{k^{\prime}=1}^{B_{l^{\prime}}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l^{\prime}}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l^{\prime}}}\alpha_{i^{\prime}}\big{[}\nabla\tilde{f}_{i^{\prime}}\big{(}\bar{\mathbf{w}}^{t}\big{)}-\nabla\tilde{f}_{i^{\prime}}\big{(}\bar{\mathbf{w}}_{l}^{t}\big{)}\big{]}\bigg{)}+
(l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi[f~i(𝐰¯lt)f~i(𝐰¯kt)])+\displaystyle\qquad\bigg{(}\sum_{l^{\prime}=1}^{L}\alpha_{l^{\prime}}\sum_{k^{\prime}=1}^{B_{l^{\prime}}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l^{\prime}}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l^{\prime}}}\alpha_{i^{\prime}}\big{[}\nabla\tilde{f}_{i^{\prime}}\big{(}\bar{\mathbf{w}}_{l}^{t}\big{)}-\nabla\tilde{f}_{i^{\prime}}\big{(}\bar{\mathbf{w}}_{k}^{t}\big{)}\big{]}\bigg{)}+
(l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi[f~i(𝐰¯kt)f~i(𝐰¯jt)])+\displaystyle\qquad\bigg{(}\sum_{l^{\prime}=1}^{L}\alpha_{l^{\prime}}\sum_{k^{\prime}=1}^{B_{l^{\prime}}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l^{\prime}}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l^{\prime}}}\alpha_{i^{\prime}}\big{[}\nabla\tilde{f}_{i^{\prime}}\big{(}\bar{\mathbf{w}}_{k}^{t}\big{)}-\nabla\tilde{f}_{i^{\prime}}\big{(}\bar{\mathbf{w}}_{j}^{t}\big{)}\big{]}\bigg{)}+
(l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi[f~i(𝐰¯lt)f~i(𝐰~it,0)])2\displaystyle\qquad\bigg{(}\sum_{l^{\prime}=1}^{L}\alpha_{l^{\prime}}\sum_{k^{\prime}=1}^{B_{l^{\prime}}}\alpha_{k^{\prime}}\sum_{j^{\prime}=1}^{V_{k^{\prime},l^{\prime}}}\alpha_{j^{\prime}}\sum_{i^{\prime}=1}^{U_{j^{\prime},k^{\prime},l^{\prime}}}\alpha_{i^{\prime}}\big{[}\nabla\tilde{f}_{i^{\prime}}\big{(}\bar{\mathbf{w}}_{l}^{t}\big{)}-\nabla\tilde{f}_{i^{\prime}}\big{(}\tilde{\mathbf{w}}_{i^{\prime}}^{t,0}\big{)}\big{]}\bigg{)}~{}\bigg{\|}^{2} (120)
36(βϵηκ0κ1κ2κ3)2+144(βηκ0κ1κ2κ3)2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi𝔼𝐰~it𝐰~it,02+\displaystyle\leq 36\big{(}\beta\epsilon\eta\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\big{)}^{2}+\frac{144\big{(}\beta\eta\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\big{)}^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\bigg{\|}\tilde{\mathbf{w}}_{i}^{t}-\tilde{\mathbf{w}}_{i}^{t,0}\bigg{\|}^{2}+
144(βηκ0κ1κ2κ3)2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαji=1Uj,k,lαi𝔼𝐰¯jt𝐰~it2+\displaystyle\qquad\qquad\qquad\frac{144\big{(}\beta\eta\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\big{)}^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\sum_{i=1}^{U_{j,k,l}}\alpha_{i}\mathbb{E}\bigg{\|}\bar{\mathbf{w}}_{j}^{t}-\tilde{\mathbf{w}}_{i}^{t}\bigg{\|}^{2}+
72(βηκ0κ1κ2κ3)2Tt=0T1l=1Lαlk=1Blαkj=1Vk,lαj𝔼𝐰¯kt𝐰¯jt2+\displaystyle\qquad\qquad\qquad\frac{72\big{(}\beta\eta\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\big{)}^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\sum_{j=1}^{V_{k,l}}\alpha_{j}\mathbb{E}\bigg{\|}\bar{\mathbf{w}}_{k}^{t}-\bar{\mathbf{w}}_{j}^{t}\bigg{\|}^{2}+
72(βηκ0κ1κ2κ3)2Tt=0T1l=1Lαlk=1Blαk𝔼𝐰¯lt𝐰¯kt2+\displaystyle\qquad\qquad\qquad\frac{72\big{(}\beta\eta\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\big{)}^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\sum_{k=1}^{B_{l}}\alpha_{k}\mathbb{E}\bigg{\|}\bar{\mathbf{w}}_{l}^{t}-\bar{\mathbf{w}}_{k}^{t}\bigg{\|}^{2}+
72(βηκ0κ1κ2κ3)2Tt=0T1l=1Lαl𝔼𝐰¯t𝐰¯lt2\displaystyle\qquad\qquad\qquad\frac{72\big{(}\beta\eta\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\big{)}^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\bigg{\|}\bar{\mathbf{w}}^{t}-\bar{\mathbf{w}}_{l}^{t}\bigg{\|}^{2}
𝒪(κ02κ12κ22κ32β2η2ϵ2)+𝒪(δthκ02κ12κ22κ32β2η2D2)+𝒪(κ03κ12κ22κ32η4β2σ2)+\displaystyle\leq\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\beta^{2}\eta^{2}\epsilon^{2}\big{)}+\mathcal{O}\big{(}\delta^{\mathrm{th}}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\beta^{2}\eta^{2}D^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{2}\sigma^{2}\big{)}+
𝒪(κ04κ12κ22κ32η4β2ϵvc2)+𝒪(κ03κ12κ22κ32η4β2G2φw,L1)+𝒪(δthκ02κ12κ22κ32η2β2D2)+\displaystyle\qquad\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{2}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}\big{)}+\mathcal{O}\big{(}\delta^{\mathrm{th}}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{2}\beta^{2}D^{2}\big{)}+
𝒪(κ06κ14κ22κ32η6β4ϵvc2)+𝒪(κ04κ14κ22κ32η4β2ϵsbs2)+𝒪(κ03κ13κ22κ32η4σ2β2)+\displaystyle\qquad\mathcal{O}\big{(}\kappa_{0}^{6}\kappa_{1}^{4}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{6}\beta^{4}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{2}\epsilon_{\mathrm{sbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{3}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\sigma^{2}\beta^{2}\big{)}+
𝒪(δthκ02κ12κ22κ32β2η2D2)+𝒪(κ05κ14κ22κ32β4η6G2φw,L1)+𝒪(κ03κ13κ22κ32β2η4φw,L2)+\displaystyle\qquad\mathcal{O}\big{(}\delta^{th}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\beta^{2}\eta^{2}D^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{5}\kappa_{1}^{4}\kappa_{2}^{2}\kappa_{3}^{2}\beta^{4}\eta^{6}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{3}\kappa_{2}^{2}\kappa_{3}^{2}\beta^{2}\eta^{4}\cdot\varphi_{\mathrm{w,L}_{2}})+
𝒪(κ05κ14κ24κ32η6β4ϵvc2)+𝒪(κ06κ16κ24κ32η6β4ϵsbs2)+𝒪(κ04κ14κ24κ32η4β4ϵmbs2)+\displaystyle\qquad\mathcal{O}\big{(}\kappa_{0}^{5}\kappa_{1}^{4}\kappa_{2}^{4}\kappa_{3}^{2}\eta^{6}\beta^{4}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{6}\kappa_{1}^{6}\kappa_{2}^{4}\kappa_{3}^{2}\eta^{6}\beta^{4}\epsilon_{\mathrm{sbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{4}\kappa_{3}^{2}\eta^{4}\beta^{4}\epsilon_{\mathrm{mbs}}^{2}\big{)}+
𝒪(κ03κ13κ23κ32η4β2σ2)+𝒪(δthκ02κ12κ22κ32η2β2D2)+𝒪(κ05κ14κ24κ32η6β4G2φw,L1)+\displaystyle\qquad\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{3}\kappa_{2}^{3}\kappa_{3}^{2}\eta^{4}\beta^{2}\sigma^{2}\big{)}+\mathcal{O}\big{(}\delta^{\mathrm{th}}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{2}\beta^{2}D^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{5}\kappa_{1}^{4}\kappa_{2}^{4}\kappa_{3}^{2}\eta^{6}\beta^{4}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}\big{)}+
𝒪(κ04κ14κ24κ32β4η6φw,L2)+𝒪(κ03κ13κ23κ32η4β2G2φw,L3)+\displaystyle\qquad\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{4}\kappa_{3}^{2}\beta^{4}\eta^{6}\cdot\varphi_{\mathrm{w,L}_{2}})+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{3}\kappa_{2}^{3}\kappa_{3}^{2}\eta^{4}\beta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{3}}\big{)}+
72(βηκ0κ1κ2κ3)2Tt=0T1l=1Lαl𝔼𝐰¯t𝐰¯lt2\displaystyle\qquad\frac{72\big{(}\beta\eta\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\big{)}^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\bigg{\|}\bar{\mathbf{w}}^{t}-\bar{\mathbf{w}}_{l}^{t}\bigg{\|}^{2}
𝒪(κ04κ12κ22κ32η4β2ϵvc2)+𝒪(κ04κ14κ22κ32η4β2ϵsbs2)+𝒪(κ04κ14κ24κ32η4β4ϵmbs2)+\displaystyle\approx\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{2}\epsilon_{\mathrm{vc}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{2}\epsilon_{\mathrm{sbs}}^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{4}\kappa_{1}^{4}\kappa_{2}^{4}\kappa_{3}^{2}\eta^{4}\beta^{4}\epsilon_{\mathrm{mbs}}^{2}\big{)}+
𝒪(κ02κ12κ22κ32β2η2ϵ2)+𝒪(κ03κ12κ22κ32η4β2σ2)+𝒪(δthκ02κ12κ22κ32η2β2D2)+\displaystyle\qquad\qquad\qquad\mathcal{O}\big{(}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\beta^{2}\eta^{2}\epsilon^{2}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{2}\sigma^{2}\big{)}+\mathcal{O}\big{(}\delta^{\mathrm{th}}\kappa_{0}^{2}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{2}\beta^{2}D^{2}\big{)}+
𝒪(κ03κ12κ22κ32η4β2G2φw,L1)+𝒪(κ03κ13κ22κ32β2η4φw,L2)+\displaystyle\qquad\qquad\qquad\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{2}\kappa_{2}^{2}\kappa_{3}^{2}\eta^{4}\beta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{1}}\big{)}+\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{3}\kappa_{2}^{2}\kappa_{3}^{2}\beta^{2}\eta^{4}\cdot\varphi_{\mathrm{w,L}_{2}})+
𝒪(κ03κ13κ23κ32η4β2G2φw,L3)+72(βηκ0κ1κ2κ3)2Tt=0T1l=1Lαl𝔼𝐰¯t𝐰¯lt2.\displaystyle\qquad\qquad\qquad\!\!\mathcal{O}\big{(}\kappa_{0}^{3}\kappa_{1}^{3}\kappa_{2}^{3}\kappa_{3}^{2}\eta^{4}\beta^{2}G^{2}\cdot\varphi_{\mathrm{w,L}_{3}}\big{)}+\frac{72\big{(}\beta\eta\kappa_{0}\kappa_{1}\kappa_{2}\kappa_{3}\big{)}^{2}}{T}\sum_{t=0}^{T-1}\sum_{l=1}^{L}\alpha_{l}\mathbb{E}\bigg{\|}\bar{\mathbf{w}}^{t}-\bar{\mathbf{w}}_{l}^{t}\bigg{\|}^{2}\!\!.\!\! (121)