This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

The Impact of Cut Layer Selection in Split Federated Learning

Justin Dachille1, Chao Huang2, Xin Liu1
Abstract

Split Federated Learning (SFL) is a distributed machine learning paradigm that combines federated learning and split learning. In SFL, a neural network is partitioned at a cut layer, with the initial layers deployed on clients and remaining layers on a training server. There are two main variants of SFL: SFL-V1 where the training server maintains separate server-side models for each client, and SFL-V2 where the training server maintains a single shared model for all clients. While existing studies have focused on algorithm development for SFL, a comprehensive quantitative analysis of how the cut layer selection affects model performance remains unexplored. This paper addresses this gap by providing numerical and theoretical analysis of SFL performance and convergence relative to cut layer selection. We find that SFL-V1 is relatively invariant to the choice of cut layer, which is consistent with our theoretical results. Numerical experiments on four datasets and two neural networks show that the cut layer selection significantly affects the performance of SFL-V2. Moreover, SFL-V2 with an appropriate cut layer selection outperforms FedAvg on heterogeneous data.

1 Introduction

Federated Learning (FL) is a popular approach to collectively train machine learning models while preserving data privacy among clients such as mobile phones, IoT devices, and edge devices. (McMahan et al. 2017; Brisimi et al. 2018). In conventional FL, clients train models in parallel and then upload their models to a coordinating server for aggregation. However, FL faces significant computational and communication challenges in resource-constrained environments, as each client needs to train the entire model on-premise and communicate the full model to the server (Hamer, Mohri, and Suresh 2020; Caldas et al. 2018). This limitation becomes particularly critical as modern machine learning models, especially Large Language Models (LLMs), continue to grow to billions of parameters and beyond (Minaee et al. 2024).

To address these limitations, Split Learning (SL) is a promising solution (Gupta and Raskar 2018; Vepakomma et al. 2018). In SL, a neural network is typically split into two parts where the clients train the initial layers of the neural network (e.g., feature extraction layers) and send activations to the server. The server completes the training with the remaining layers and sends back gradients which the client uses to finish back propagation on its client side network. By splitting the model, SL reduces the computational burden on clients. Additionally, studies have shown that SL can outperform FL in terms of communication efficiency as the number of clients increases, since only layer activations rather than full model parameters need to be transmitted (Singh et al. 2019). However, the sequential relay-based training of clients results in prohibitively long training times and inherently reduced scalability. Moreover, SL suffers from catastrophic forgetting, where the model tends to forget previously learned features when training on new client data (Duan et al. 2022).

These limitations motivated the development of Split Federated Learning (SFL) (Thapa et al. 2022). SFL aims to combine the advantages of both FL and SL while minimizing their respective drawbacks. By leveraging parallel training from FL and reduced client-side computation from SL, SFL offers a promising solution for efficient and privacy-preserving distributed learning in resource-constrained environments. We summarize the differences between these three algorithms in Fig. 1.

Refer to caption
Figure 1: Comparison of distributed learning architectures. The Model Sync Server maintains model consistency across clients, while the Training Server handles model computations. In FL (top left), clients train complete model copies locally and periodically synchronize with the Model Sync Server (1). In SL (top right), the model θ\theta is split at a cut layer into client-side (θc\theta^{c}) and server-side (θs\theta^{s}) components, where clients sequentially take turns: each client computes forward activations up to the cut layer (2), the server completes the forward pass and backpropagation to the cut layer, and the client finishes the backward pass using returned gradients (3). SFL comes in two variants: SFL-V1 (bottom left) and SFL-V2 (bottom right). Both variants split the model and maintain client-side synchronization through FedAvg (1), but differ in server-side processing: after clients send activations (2) and receive gradients (3), SFL-V1 aggregates both client and server-side models, whereas SFL-V2 only aggregates client-side models.

Despite recent advances in SFL algorithms, a critical yet understudied aspect is the selection of the cut layer—the point at which the neural network is split between client and server. Prior work on cut layer selection in SFL has primarily focused on communication efficiency and privacy preservation (Wu et al. 2023; Zhang et al. 2023). However, the impact of cut layer selection on model convergence and performance remains under-explored. This is a critical gap because the cut layer determines the computational load distribution and can affect model performance. Understanding these dynamics is essential for deploying SFL effectively in real-world applications.

In this paper, we aim to address two fundamental questions regarding SFL:

  • Question 1: How does the cut layer affect the performance of SFL?

  • Question 2: How does the performance of SFL compare to FL?

These questions are motivated by several critical factors. First, deep learning is resource-intensive to implement on edge devices (Tak and Cherkaoui 2021), and the choice of cut layer can significantly affect computation requirements. Recent work by Lee et al. (Lee et al. 2024) demonstrates that deeper cut layers increase both client-side computation and communication energy costs - client computation increases with more local layers, while communication costs rise from transmitting larger model parameters during synchronization. They also show that deeper cut layers enhance privacy by making input reconstruction from smashed data more difficult, highlighting the trade-off between efficiency and privacy in cut layer selection. Second, while SFL shows promise as a distributed learning approach, its performance characteristics, particularly in relation to cut layer selection and data heterogeneity, remain underexplored.

Regarding the impact of cut layer selection, we prove theoretically and validate experimentally that SFL-V1’s performance remains invariant to cut layer placement in both Independent and Identically Distributed (IID) and non-IID settings. This behavior stems from SFL-V1’s architecture, where separate server-side models for each client preserve the independence of client training paths regardless of cut layer location. In contrast, SFL-V2 exhibits significant performance variations respect to with cut layer placement, which we postulate because its shared server-side model enables effective learning from all client activations simultaneously.

Our key contributions are summarized below:

  • We provide a quantitative study on the effect of cut layer selection on SFL performance.

  • We provide a convergence analysis of SFL with respect to cut layer selection. We show that SFL-V1 is invariant to cut layer selection.

  • We conduct numerical experiments across four datasets and two model architectures. We find that SFL-V2 with an early cut layer significantly outperforms FedAvg on heterogeneous data. SFL-V1, while maintaining consistent performance across cut layers, achieves lower accuracy compared to SFL-V2 and similar to FL.

2 Related Work

2.1 Federated Learning

Federated Learning (FL) was introduced by McMahan et al. (McMahan et al. 2017) as a distributed learning paradigm where multiple clients collaborate to train a model while keeping their data local. In FL, each client trains a complete copy of the model on their local data, and a model synchronization server aggregates these models to create a global model. The most widely-used FL algorithm, FedAvg, performs weighted averaging of client models based on their local dataset sizes.

Data heterogeneity represents one of the fundamental challenges in FL. When clients have heterogeneous data distributions, FedAvg can suffer significant performance degradation due to the notorious client drift issue (Hsu, Qi, and Brown 2019). Several approaches have been proposed to address this challenge: SCAFFOLD (Karimireddy et al. 2020) uses control variates to correct for client drift in non-IID settings, while FedPer (Arivazhagan et al. 2019) maintains personalized layers on clients while sharing base layers globally. Other approaches like FedDF (Lin et al. 2020) employ ensemble distillation for better heterogeneous model aggregation, and FedProx (Li et al. 2020) adds a proximal term to limit local updates from diverging from the global model. Recent works have introduced novel approaches to tackle heterogeneity: FedUV (Son et al. 2024) introduces regularization terms to emulate IID settings locally by controlling classifier variance and representation uniformity. MOON (Li, He, and Song 2021) leverages model-level contrastive learning to correct local training. FedAlign (Sun et al. 2024) addresses data heterogeneity through data-free knowledge distillation, using a generator to estimate global feature distributions and align local models accordingly.111While these techniques were primarily designed for FL, they can be easily implemented in SFL.

2.2 Split Learning

Split Learning (SL), introduced by Gupta and Raskar (Gupta and Raskar 2018), takes a different approach to distributed learning. Instead of training complete models on clients, SL splits a neural network at a cut layer, with initial layers on clients and remaining layers on the server. In SL, clients process data through their local layers up to the cut layer, then send activations to the server. The server completes the forward pass, computes gradients, and sends these gradients back to clients for local model updates.

Several works have explored SL’s potential for handling heterogeneous data distributions. SplitAVG (Zhang et al. 2022) combines network splitting with feature map concatenation to train an unbiased estimator of the target data distribution. Li and Lyu (Li and Lyu 2024) provided theoretical convergence guarantees for split learning, suggesting potential advantages over FedAvg on heterogeneous data. COMSPLIT (Ninkovic et al. 2024) introduced a communication-aware SL framework for time series data, incorporating early-exit strategies to handle devices with heterogeneous computational capabilities. FedCST (Wang et al. 2024) proposed a hybrid approach combining device clustering with SL to address both data and device heterogeneity. AdaSplit (Chopra et al. 2021) focused on improving SL’s performance across heterogeneous clients while reducing bandwidth consumption through adaptive mechanisms. In the domain of graph neural networks, SplitGNN (Xu et al. 2023b) addressed heterogeneity in graph data through a split architecture with heterogeneous attention mechanisms. RoofSplit (Huang et al. 2023) proposed an edge computing framework that optimally splits CNN models across heterogeneous edge nodes based on Roofline theory (Williams, Waterman, and Patterson 2009).

2.3 Split Federated Learning

Split Federated Learning (SFL) was originally proposed by Thapa et al. (Thapa et al. 2022) as a hybrid approach combining FL and SL. Since its introduction, several works have explored various aspects of SFL, from privacy and security to system optimization and scalability.

Privacy preservation has been a key focus in SFL research. Li et al. (Li et al. 2022) proposed ResSFL, a framework designed to be resistant to model inversion attacks during training through attacker-aware training of the feature extractor. Khan et al. (Khan et al. 2022) conducted the first empirical analysis of SFL’s robustness against model poisoning attacks, demonstrating that SFL’s lower-dimensional model updates provide inherent robustness compared to traditional FL. Zhang et al. (Zhang et al. 2023) demonstrated that deeper cut layers in SFL improve privacy by reducing reconstruction accuracy at the cost of increased client computation and communication overhead. Zheng et al. (Zheng, Chen, and Lai 2024) introduced PPSFL, incorporating private group normalization layers to protect privacy while addressing data heterogeneity.

System optimization and efficiency have been another major research direction. Mu et al. (Mu and Shen 2023) developed CSE-FSL to reduce communication overhead and server storage requirements through an auxiliary network and selective epoch communication. Gao et al. (Gao et al. 2024) proposed PipeSFL, a fine-grained parallelization framework that addresses client heterogeneity through priority scheduling and hybrid training modes. Xu et al. (Xu et al. 2023a) tackled the challenge of heterogeneous devices by jointly optimizing cut layer selection and bandwidth allocation.

Several works have focused on enhancing SFL’s performance and applicability. Yang et al. (Yang et al. 2022) introduced RoS-FL specifically for U-shaped medical image networks, incorporating a dynamic weight correction strategy to stabilize training and prevent model drift. Li et al. (Li et al. 2023) demonstrated SFL’s effectiveness in healthcare settings, achieving comparable performance to centralized and federated approaches while improving privacy and computational efficiency. Shin et al. (Shin et al. 2023) proposed FedSplitX to handle system heterogeneity in foundation models by implementing multiple partition points and auxiliary networks. Liao et al. (Liao et al. 2024) developed MergeSFL, which combines feature merging and batch size regulation to improve accuracy and training efficiency. Zhu et al. (Zhu et al. 2024) proposed ESFL, which jointly optimizes user-side workload and server-side resource allocation for heterogeneous environments.

Recent work has begun to examine the implications of cut layer selection, though primarily from system-level perspectives. Lee et al. (Lee et al. 2024) provided a comprehensive analysis of how cut layer placement affects both energy consumption and privacy preservation in SFL systems, demonstrating that cut layer selection creates important trade-offs between computational efficiency and data protection. Shiranthika et al. (Shiranthika et al. 2023) investigated the robustness of different cut layer positions against network packet loss, finding statistical evidence that deeper split points can provide advantages in certain applications. While these works provide valuable insights into system-level tradeoffs, they do not address the question of how cut layer selection affects model performance. Note that recent work has examined SFL convergence properties (Han et al. 2024b), but it does not analyze the specific effects of cut layer selection. Our work addresses this gap by providing the first comprehensive analysis of how cut layer selection affects model performance across various datasets, architectures, and data distributions.

3 Problem Formulation

We provide an overview of SFL here. The complete algorithmic description with detailed notation can be found in Appendix A. Consider a set of clients 𝒦={1,2,,K}\mathcal{K}=\{1,2,\ldots,K\}, where each client k𝒦k\in\mathcal{K} possesses a local private dataset 𝒟k\mathcal{D}_{k} of size Dk=|𝒟k|D_{k}=|\mathcal{D}_{k}|. The global model, parameterized by θ\theta, consists of LL layers. In SFL, the global model is split at the LcL_{c}-th layer (the cut layer) into two segments:

  • Client-side model θC\theta^{C} (layers 1 to LcL_{c}),

  • Server-side model θS\theta^{S} (layers Lc+1L_{c}+1 to LL),

where θ={θC;θS}\theta=\{\theta^{C};\theta^{S}\}. Let θkC\theta^{C}_{k} denote client kk’s local client-side model. The system involves two servers (see Fig. 1):

  • Model synchronization server (MSS): It aggregates and distributes model weights, similar to FL.

  • Training server (TS): It maintains server-side models and performs training computations.

In this paper, we focus on two variants of SFL as proposed in (Thapa et al. 2022): SFL-V1 and SFL-V2. The key distinction between these variants lies in how the TS manages its models. In SFL-V1, the TS maintains separate server-side models θkS\theta^{S}_{k} for each client kk, which the MSS aggregates after each communication round to form a global server-side model. In SFL-V2, the TS maintains a single shared model θS\theta^{S} that processes data from all clients sequentially.

More specifically, let Fk(θ,Bk)F_{k}(\theta,B_{k}) denote the loss of model θ\theta over client kk’s mini-batch samples BkB_{k}, which is randomly sampled from client kk’s dataset 𝒟k\mathcal{D}_{k}. Let Fk(θ)𝔼Bk𝒟k[Fk(θ,Bk)]F_{k}(\theta)\triangleq\mathbb{E}_{B_{k}\sim\mathcal{D}_{k}}[F_{k}(\theta,B_{k})] denote the expected loss of model θ\theta over client kk’s dataset. SFL aims to minimize the expected loss of the model θ\theta over the datasets of all clients:

minθf(θ)=k=1KαkFk(θ),\min_{\theta}f(\theta)=\sum_{k=1}^{K}\alpha_{k}F_{k}\left(\theta\right), (1)

where αk[0,1]\alpha_{k}\in[0,1] is the weight of client kk, and we typically have αk=Dk/k𝒦Dk\alpha_{k}=D_{k}/\sum_{k^{\prime}\in\mathcal{K}}D_{k^{\prime}}.

An SFL process operates for TT communication rounds to minimize the global loss function f(θ)f(\theta). Each communication round consists of the following steps:

Client Forward Pass

At the beginning of each round tt, every client (in parallel) kk samples a mini-batch BkB_{k} from its local dataset 𝒟k\mathcal{D}_{k}. The client performs a forward pass through its local model θkC(t)\theta^{C}_{k}(t) using BkB_{k}, producing activations ak(t)a_{k}(t) at the cut layer. These activations and their corresponding ground truth labels yk(t)y_{k}(t) from BkB_{k} are sent to the TS.

Training Server-side Computation

The TS operation differs between SFL-V1 and SFL-V2. For SFL-V1, each client’s activations ak(t)a_{k}(t) are processed by their corresponding server-side model θkS(t)\theta^{S}_{k}(t). The TS computes predictions y^k(t)\hat{y}_{k}(t) and the loss (y^k(t),yk(t))\mathcal{L}(\hat{y}_{k}(t),y_{k}(t)) for each client. Then, the TS computes and applies gradients θkS(t)\nabla\theta^{S}_{k}(t) to update each server-side model. It computes gradients ak(t)\nabla a_{k}(t) at the cut layer, and sends ak(t)\nabla a_{k}(t) back to each client.

For SFL-V2, client activations ak(t)a_{k}(t) are processed sequentially in a randomized manner through the shared server-side model θS(t)\theta^{S}(t). For each client’s data, the TS computes predictions y^k(t)\hat{y}_{k}(t) and loss (y^k(t),yk(t))\mathcal{L}(\hat{y}_{k}(t),y_{k}(t)), and updates θS(t)\theta^{S}(t) using gradient descent. Then, it computes gradients ak(t)\nabla a_{k}(t) at the cut layer, and sends ak(t)\nabla a_{k}(t) back to each client.

Client Backward Pass

After receiving ak(t)\nabla a_{k}(t), each client completes its backward pass to compute gradients θkC(t)\nabla\theta^{C}_{k}(t). The client updates its model: θkC(t+1)=θkC(t)ηtθkC(t)\theta^{C}_{k}(t+1)=\theta^{C}_{k}(t)-\eta^{t}\nabla\theta^{C}_{k}(t), where ηt\eta^{t} is the learning rate. This process repeats for EE local epochs before the next communication round.

Model Aggregation

At the end of each round, the MSS performs FedAvg (McMahan et al. 2017) on the client-side models, where models are averaged weighted by their local dataset sizes to form θC(t+1)\theta^{C}(t+1). In SFL-V1, the MSS similarly aggregates its per-client server-side models θkS(t)\theta^{S}_{k}(t) into a single model θS(t+1)\theta^{S}(t+1). In SFL-V2, the server-side model θS(t)\theta^{S}(t) does not perform aggregation.

4 Theoretical Results

In this section, we provide convergence analysis to understand the impact of cut layer selection. We start with some assumptions that are widely adopted in the distributed learning literature, e.g., (Han et al. 2024b; Woodworth, Patel, and Srebro 2020; Wang et al. 2019).

Assumption 1.

Each client kk’s loss function FkF_{k} is non-convex.

Assumption 2.

Each client kk’s loss function FkF_{k} is SS-smooth. That is, for all model parameters θ,ϕ\theta,\phi,

Fk(θ)Fk(ϕ)+Fk(ϕ),θϕ+S2θϕ2.F_{k}(\theta)\leq F_{k}(\phi)+\langle\nabla F_{k}(\phi),\theta-\phi\rangle+\frac{S}{2}\|\theta-\phi\|^{2}. (2)
Assumption 3.

The stochastic gradients gk()g_{k}(\cdot) of Fk()F_{k}\left(\cdot\right) are unbiased with the variance bounded by σk2\sigma_{k}^{2}:

𝔼Bk𝒟n[gk(θ,Bk)]=Fk(θ),\mathbb{E}_{B_{k}\sim\mathcal{D}_{n}}\left[g_{k}\left(\theta,B_{k}\right)\right]=\nabla F_{k}\left(\theta\right), (3)
𝔼Bk𝒟n[gk(θ,Bk)Fk(θ)2]σk2.\mathbb{E}_{B_{k}\sim\mathcal{D}_{n}}\left[\left\|g_{k}\left(\theta,B_{k}\right)-\nabla F_{k}\left(\theta\right)\right\|^{2}\right]\leq\sigma_{k}^{2}. (4)
Assumption 4.

There exists an ϵ2\epsilon^{2} such that for any client kk,

Fk(θ)f(θ)2ϵ2.\left\|\nabla F_{k}\left(\theta\right)-\nabla f\left(\theta\right)\right\|^{2}\leq\epsilon^{2}. (5)

A larger ϵ2\epsilon^{2} indicates greater data heterogeneity. Next, we show that the convergence of SFL-V1 is invariant to the choice of cut layer LcL_{c}.

Proposition 1.

(Convergence Invariability to Cut Layer Selection in SFL-V1) Let Assumptions 1-4 hold, and let ηtmin{116Sτ,18SKτk=1Kαk2}\eta^{t}\leq\min\left\{\frac{1}{16S\tau},\frac{1}{8SK\tau\sum_{k=1}^{K}\alpha_{k}^{2}}\right\}. Then, for any Lc{1,2,,L1}L_{c}\in\{1,2,\cdots,L-1\}, the following inequality holds:

1Tt=0T1ηt𝔼[f(θ(t))2]4Tτ(f(θ(0))f(θ)))\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\eta^{t}\mathbb{E}\left[\left\|\nabla f\left(\theta(t)\right)\right\|^{2}\right]\leq\frac{4}{T\tau}\left(f\left(\theta(0)\right)-f(\theta^{\ast}))\right) (6)
+16KSτTk=1Kαk2(σk2+ϵ2)t=0T1(ηt)2,\displaystyle\hskip 28.45274pt+\frac{16KS\tau}{T}\sum_{k=1}^{K}\alpha_{k}^{2}\left(\sigma_{k}^{2}+\epsilon^{2}\right)\sum_{t=0}^{T-1}\left(\eta^{t}\right)^{2},

where θ\theta^{\ast} is the optimal global model, EE is the number of epochs, and τEDkBk\tau\triangleq\lceil\frac{ED_{k}}{B_{k}}\rceil denotes the number of model updates in one communication round.

The proof of Proposition 1 is given in Appendix B. The convergence bound is affected by terms involving initial conditions (f(θ(0)))f(θ))(f(\theta(0)))-f(\theta^{\ast})), data heterogeneity ϵ2\epsilon^{2}, and gradient variance σk2\sigma_{k}^{2}. A worse initial condition, a larger data heterogeneity, or a larger gradient variance will hurt the convergence, which is also observed in other distributed learning algorithms (Han et al. 2024a; Huang, Dachille, and Liu 2024). Importantly, Proposition 1 holds for any cut layer selection. The key rationale is that SFL-V1 with different cut layers can be equivalently transformed into FedAvg with identical model updates. This mathematical invariance to LcL_{c} aligns with our empirical observations in Sec. 5 that SFL-V1’s performance remains stable across different cut layer selections.

Note that the convergence proof of SFL-V2 would require a significantly different approach, and is left to future work. We will show in Sec. 5 that SFL-V2 outperforms SFL-V1 and FL under various experiment settings.

5 Numerical Results

We conduct experiments to validate our theoretical analysis and provide additional insights.

Table 1: Dataset configurations showing architectures, client counts, and communication rounds.
Dataset Model Clients Rounds
HAM10000 ResNet-18 100 100
CIFAR-10 ResNet-18 100 200
CIFAR-100 ResNet-50 100 300
Tiny ImageNet ResNet-50 10 300
Table 2: Training hyperparameters for experiments.
Algorithm Learning rate Batch size Optimizer
SFL-V1 0.001 64 Adam
SFL-V2 0.001 64 Adam
FedAvg 0.01 64 SGD

5.1 Experimental Setup

We conduct experiments on CIFAR-10 (Krizhevsky, Hinton et al. 2009) and HAM10000 (Tschandl 2018) with ResNet-18, and CIFAR-100 (Krizhevsky, Hinton et al. 2009) and Tiny ImageNet (Le and Yang 2015) with ResNet-50. The architectures use cut layer Lc{1,2,3,4}L_{c}\in\{1,2,3,4\}, where Lc=iL_{c}=i denotes the network is cut following the ii-th residual block. These cut points correspond to the boundaries between macro-residual blocks in ResNet architectures (He et al. 2016). We set the number of clients K=100K=100, except for Tiny ImageNet, where K=10K=10. The models, numbers of clients, and rounds of training used for each dataset are enumerated in Table 1. The local training epochs for each client were set as E=5E=5. Hyperparameters for all algorithms are enumerated in Table 2.

We consider both IID and non-IID scenarios. In our non-IID scenario, we use a label-based Dirichlet distribution (Hsu, Qi, and Brown 2019; Li, He, and Song 2021) with μ=0.1\mu=0.1 to simulate highly imbalanced datasets across clients. Here, μ>0\mu>0 is a tunable parameter, where a smaller value of μ\mu refers to a more imbalanced partition and μ=\mu=\infty corresponds to IID data partition. We exclude the HAM10000 dataset from this process because the dataset itself is highly label skewed. The results for IID and non-IID settings are presented in Tables 3 and 4, respectively. Bold values indicate the best performance for each dataset, while underlined values show the second-best results.

Baseline Selection

We use FedAvg as our FL baseline since innovations like proximal terms, control variates, and knowledge distillation could enhance both FL and SFL implementations. For instance, proximal terms could be added to client objectives in SFL to limit divergence, or control variates could be incorporated to correct for client drift. Our choice enables a fair comparison of the architectural differences between approaches without conflating algorithm-specific improvements.

Table 3: Test accuracy (%) comparison on IID data. Bold and underlined indicate best and second-best results respectively (mean ± std over 3 independent runs).
ResNet-18 ResNet-50
Method CIFAR10 CIFAR100 TinyImageNet
FedAvg 85.25 ± 0.07 48.48 ± 0.38 47.49 ± 0.07
SFL-V1 (Lc=1L_{c}=1) 86.34 ± 0.37 49.56 ± 0.18 48.73 ± 0.43
SFL-V1 (Lc=2L_{c}=2) 86.12 ± 0.46 50.28 ± 0.76 49.26 ± 0.78
SFL-V1 (Lc=3L_{c}=3) 85.94 ± 0.36 49.29 ± 0.11 49.11 ± 0.70
SFL-V1 (Lc=4L_{c}=4) 86.01 ± 0.45 46.59 ± 1.42 48.11 ± 0.27
SFL-V2 (Lc=1L_{c}=1) 92.30 ± 0.15 56.00 ± 0.69 54.77 ± 0.22
SFL-V2 (Lc=2L_{c}=2) 89.56 ± 0.15 45.93 ± 0.21 50.86 ± 0.05
SFL-V2 (Lc=3L_{c}=3) 87.57 ± 0.34 41.29 ± 0.26 44.15 ± 0.11
SFL-V2 (Lc=4L_{c}=4) 86.11 ± 0.36 45.98 ± 0.24 43.85 ± 0.25
Table 4: Test accuracy (%) comparison on non-IID data with Dirichlet μ=0.1\mu=0.1. Bold and underlined indicate best and second-best results respectively (mean ± std over independent 3 runs).
ResNet-18 ResNet-50
Method HAM10000 CIFAR10 CIFAR100 TinyImageNet
FedAvg 77.37 ± 0.35 67.59 ± 2.52 42.60 ± 1.18 28.33 ± 0.28
SFL-V1 (Lc=1L_{c}=1) 77.88 ± 0.13 66.38 ± 2.22 39.96 ± 0.84 12.84 ± 0.73
SFL-V1 (Lc=2L_{c}=2) 78.18 ± 0.67 65.64 ± 1.49 39.82 ± 0.63 13.42 ± 0.60
SFL-V1 (Lc=3L_{c}=3) 78.40 ± 0.66 65.36 ± 1.09 40.32 ± 0.88 13.66 ± 0.97
SFL-V1 (Lc=4L_{c}=4) 78.10 ± 0.83 64.51 ± 1.84 41.24 ± 1.18 13.89 ± 1.00
SFL-V2 (Lc=1L_{c}=1) 80.58 ± 0.35 67.58 ± 6.28 52.38 ± 0.97 30.14 ± 8.58
SFL-V2 (Lc=2L_{c}=2) 79.56 ± 0.88 59.98 ± 11.99 45.90 ± 1.08 22.55 ± 2.28
SFL-V2 (Lc=3L_{c}=3) 79.26 ± 0.54 61.73 ± 4.04 42.42 ± 0.99 27.08 ± 1.48
SFL-V2 (Lc=4L_{c}=4) 78.27 ± 0.65 69.45 ± 0.64 43.31 ± 1.17 26.42 ± 1.52

5.2 Impact of Cut Layer on SFL

SFL-V1: Robust Across Cut Layers

SFL-V1 demonstrates stability across cut layers in both IID and non-IID settings, which is consistent with our analysis. For IID CIFAR10, the performance variation is minimal (0.4%0.4\%), while for non-IID CIFAR10, it’s slightly higher but still modest (1.87%1.87\%). This stability stems from SFL-V1’s algorithmic similarity to FedAvg, with minor variations potentially attributable to implementation-specific factors in PyTorch’s computation graph. The robustness of SFL-V1 to cut layer selection enhances its deployability, particularly for resource-constrained edge devices and heterogeneous client environments.

SFL-V2: Performance Affected by Cut Layer Selection

In contrast, SFL-V2 exhibits a strong dependency on cut layer selection. In three out of four datasets, SFL-V2 achieves peak performance with (Lc=1L_{c}=1). The impact of cut layer becomes more pronounced in non-IID settings. For instance, on non-IID CIFAR100, SFL-V2’s performance ranges from 42.42%42.42\% (Lc=3L_{c}=3) to 52.38%52.38\% (Lc=1L_{c}=1). While SFL-V2 generally achieves optimal performance with (Lc=1Lc=1), we observe an interesting exception with non-IID CIFAR-10, where (Lc=4Lc=4) performs best (69.45%69.45\% vs 67.58%67.58\% at (Lc=1Lc=1)). This outlier may be attributed to CIFAR-10’s relatively simple feature space compared to CIFAR-100 or TinyImageNet. With only 10 classes, later layer features might be more discriminative for class separation under non-IID conditions. However, this pattern doesn’t generalize across other datasets, suggesting that early cut layers remain the more robust choice for most applications.

The aforementioned behavior can be attributed to the fact that lower cut layers allow more data to pass through the training server, which more closely approximates centralized learning (CL), where all data is pooled and trained at a single location. Conceptually, Lc=0L_{c}=0 would be equivalent to CL, while Lc=LL_{c}=L would reduce to FedAvg. Thus, the cut layer in SFL-V2 can serve as a crucial performance tuning parameter.

5.3 SFL vs FL

Our experiments demonstrate that SFL, particularly SFL-V2, often outperforms FL. In IID settings, SFL-V2 (Lc=1L_{c}=1) consistently outperforms FedAvg across all datasets. For CIFAR10, SFL-V2 achieves 92.30%92.30\% accuracy compared to FedAvg’s 85.25%85.25\%, a significant 7.05%7.05\% improvement. The performance gap widens in non-IID scenarios. On CIFAR100, SFL-V2 (Lc=1L_{c}=1) reaches 52.38%52.38\%, while FedAvg achieves 42.60%42.60\%, a substantial 9.78%9.78\% difference. This advantage is maintained with increasing dataset complexity: For non-IID TinyImageNet, SFL-V2 (Lc=1L_{c}=1) achieves 30.14%30.14\% versus FedAvg’s 28.33%28.33\%.

Our results suggest that SFL-V2’s superior performance stems from its unique architecture combining local and centralized computation. With an early cut layer, most of the model computation occurs at the training server, where the server-side model processes features from all clients without requiring weight aggregation. Unlike FedAvg, where models must reconcile entirely separate local training trajectories through weight averaging, SFL-V2’s server-side model directly updates on client features through back propagation. This more direct form of learning appears to be more effective than weight aggregation, particularly for heterogeneous data distributions. The sequential processing of client features by a single shared server-side model, combined with client-side model aggregation, enables SFL-V2 to learn global patterns more efficiently than traditional FL approaches. This helps explain why SFL-V2 can surpass FedAvg, particularly when the cut layer is placed early in the network.

6 Discussion

6.1 Addressing Non-IID Challenges

While our experiments demonstrate SFL-V2’s strong performance in non-IID settings, several important theoretical questions remain unexplored. First, we lack a theoretical framework explaining why SFL-V2’s architecture provides advantages at certain cut layer selections. Understanding this could help optimize cut layer placement and guide architectural choices for different types of data heterogeneity. Additionally, while we’ve shown empirical improvements, we haven’t established theoretical guarantees for the advantage of SFL-V2 over FedAvg. A comprehensive theoretical analysis would need to account for the complex interplay between data heterogeneity, model architecture, and system constraints like client computation and communication resources.

6.2 Privacy Considerations

Although SFL maintains some level of privacy by keeping part of the model on client devices, the improved performance with lower cut layers in SFL-V2 raises important privacy considerations. As more layers are shifted to the server, there’s an increased risk of potential privacy leakage through model inversions or inference attacks. Further research is needed to quantify these risks and develop mitigation strategies, possibly through the integration of differential privacy techniques or secure multi-party computation.

7 Conclusion

This paper presents an analysis of split federated learning, focusing on the impact of cut layer selection on model performance across various datasets and data distributions. Our study reveals that SFL-V1 exhibits stability across cut layers, while SFL-V2’s performance is significantly influenced by the cut layer depth, with lower layers generally yielding better results. Notably, SFL-V2 outperforms FedAvg in both IID and non-IID settings. These findings highlight SFL’s potential as a flexible and effective approach for privacy-preserving distributed learning, particularly in scenarios with heterogeneous data distributions.

For future work, it would be interesting to explore how to combine SFL with FL techniques handling data heterogeneity such as SCAFFOLD’s control variates, FedProx’s proximal terms, and MOON’s contrastive learning approaches, using the unique availability of smashed data in SFL. Several challenges remain: developing a theoretical framework explaining SFL-V2’s improved performance under non-IID conditions, quantifying privacy-performance trade-offs in cut layer selection, and validating these findings on real-world datasets. Additionally, developing adaptive algorithms that optimize these trade-offs based on specific application requirements, privacy constraints, and system limitations represents an important direction for future research.

References

  • Arivazhagan et al. (2019) Arivazhagan, M. G.; Aggarwal, V.; Singh, A. K.; and Choudhary, S. 2019. Federated learning with personalization layers. arXiv preprint arXiv:1912.00818.
  • Brisimi et al. (2018) Brisimi, T. S.; Chen, R.; Mela, T.; Olshevsky, A.; Paschalidis, I. C.; and Shi, W. 2018. Federated learning of predictive models from federated electronic health records. International journal of medical informatics, 112: 59–67.
  • Caldas et al. (2018) Caldas, S.; Konečny, J.; McMahan, H. B.; and Talwalkar, A. 2018. Expanding the reach of federated learning by reducing client resource requirements. arXiv preprint arXiv:1812.07210.
  • Chopra et al. (2021) Chopra, A.; Sahu, S. K.; Singh, A.; Java, A.; Vepakomma, P.; Sharma, V.; and Raskar, R. 2021. Adasplit: Adaptive trade-offs for resource-constrained distributed deep learning. arXiv preprint arXiv:2112.01637.
  • Duan et al. (2022) Duan, Q.; Hu, S.; Deng, R.; and Lu, Z. 2022. Combined federated and split learning in edge computing for ubiquitous intelligence in internet of things: State-of-the-art and future directions. Sensors, 22(16): 5983.
  • Gao et al. (2024) Gao, Y.; Hu, B.; Mashhadi, M. B.; Wang, W.; and Bennis, M. 2024. PipeSFL: A Fine-Grained Parallelization Framework for Split Federated Learning on Heterogeneous Clients. IEEE Transactions on Mobile Computing.
  • Gupta and Raskar (2018) Gupta, O.; and Raskar, R. 2018. Distributed learning of deep neural network over multiple agents. Journal of Network and Computer Applications, 116: 1–8.
  • Hamer, Mohri, and Suresh (2020) Hamer, J.; Mohri, M.; and Suresh, A. T. 2020. FedBoost: A Communication-Efficient Algorithm for Federated Learning. In III, H. D.; and Singh, A., eds., Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, 3973–3983. PMLR.
  • Han et al. (2024a) Han, P.; Huang, C.; Shi, X.; Huang, J.; and Liu, X. 2024a. Incentivizing Participation in SplitFed Learning: Convergence Analysis and Model Versioning. In International Conference on Distributed Computing Systems, 846–856. IEEE.
  • Han et al. (2024b) Han, P.; Huang, C.; Tian, G.; Tang, M.; and Liu, X. 2024b. Convergence Analysis of Split Federated Learning on Heterogeneous Data. In The Thirty-eighth Annual Conference on Neural Information Processing Systems.
  • He et al. (2016) He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
  • Hsu, Qi, and Brown (2019) Hsu, T.-M. H.; Qi, H.; and Brown, M. 2019. Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335.
  • Huang, Dachille, and Liu (2024) Huang, C.; Dachille, J.; and Liu, X. 2024. When Federated Learning Meets Oligopoly Competition: Stability and Model Differentiation. IEEE Internet of Things Journal.
  • Huang et al. (2023) Huang, Y.; Zhang, H.; Shao, X.; Li, X.; and Ji, H. 2023. RoofSplit: an edge computing framework with heterogeneous nodes collaboration considering optimal CNN model splitting. Future Generation Computer Systems, 140: 79–90.
  • Karimireddy et al. (2020) Karimireddy, S. P.; Kale, S.; Mohri, M.; Reddi, S.; Stich, S.; and Suresh, A. T. 2020. Scaffold: Stochastic controlled averaging for federated learning. In International conference on machine learning, 5132–5143. PMLR.
  • Khan et al. (2022) Khan, M. A.; Shejwalkar, V.; Houmansadr, A.; and Anwar, F. M. 2022. Security analysis of splitfed learning. In Proceedings of the 20th ACM Conference on Embedded Networked Sensor Systems, 987–993.
  • Krizhevsky, Hinton et al. (2009) Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiple layers of features from tiny images.
  • Le and Yang (2015) Le, Y.; and Yang, X. 2015. Tiny imagenet visual recognition challenge. CS 231N, 7(7): 3.
  • Lee et al. (2024) Lee, J.; Seif, M.; Cho, J.; and Poor, H. V. 2024. Exploring the Privacy-Energy Consumption Tradeoff for Split Federated Learning. IEEE Network.
  • Li et al. (2022) Li, J.; Rakin, A. S.; Chen, X.; He, Z.; Fan, D.; and Chakrabarti, C. 2022. Ressfl: A resistance transfer framework for defending model inversion attack in split federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10194–10202.
  • Li, He, and Song (2021) Li, Q.; He, B.; and Song, D. 2021. Model-contrastive federated learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10713–10722.
  • Li et al. (2020) Li, T.; Sahu, A. K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; and Smith, V. 2020. Federated optimization in heterogeneous networks. Proceedings of Machine learning and systems, 2: 429–450.
  • Li and Lyu (2024) Li, Y.; and Lyu, X. 2024. Convergence analysis of sequential federated learning on heterogeneous data. Advances in Neural Information Processing Systems, 36.
  • Li et al. (2023) Li, Z.; Yan, C.; Zhang, X.; Gharibi, G.; Yin, Z.; Jiang, X.; and Malin, B. A. 2023. Split Learning for Distributed Collaborative Training of Deep Learning Models in Health Informatics. In AMIA Annual Symposium Proceedings, 1047. American Medical Informatics Association.
  • Liao et al. (2024) Liao, Y.; Xu, Y.; Xu, H.; Wang, L.; Yao, Z.; and Qiao, C. 2024. Mergesfl: Split federated learning with feature merging and batch size regulation. In 2024 IEEE 40th International Conference on Data Engineering (ICDE), 2054–2067. IEEE.
  • Lin et al. (2020) Lin, T.; Kong, L.; Stich, S. U.; and Jaggi, M. 2020. Ensemble distillation for robust model fusion in federated learning. Advances in neural information processing systems, 33: 2351–2363.
  • McMahan et al. (2017) McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; and y Arcas, B. A. 2017. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, 1273–1282. PMLR.
  • Minaee et al. (2024) Minaee, S.; Mikolov, T.; Nikzad, N.; Chenaghlu, M.; Socher, R.; Amatriain, X.; and Gao, J. 2024. Large language models: A survey. arXiv preprint arXiv:2402.06196.
  • Mu and Shen (2023) Mu, Y.; and Shen, C. 2023. Communication and storage efficient federated split learning. In IEEE International Conference on Communications, 2976–2981. IEEE.
  • Ninkovic et al. (2024) Ninkovic, V.; Vukobratovic, D.; Miskovic, D.; and Zennaro, M. 2024. COMSPLIT: A Communication–Aware Split Learning Design for Heterogeneous IoT Platforms. IEEE Internet of Things Journal.
  • Shin et al. (2023) Shin, J.; Ahn, J.; Kang, H.; and Kang, J. 2023. FedSplitX: Federated Split Learning for Computationally-Constrained Heterogeneous Clients. arXiv preprint arXiv:2310.14579.
  • Shiranthika et al. (2023) Shiranthika, C.; Kafshgari, Z. H.; Saeedi, P.; and Bajić, I. V. 2023. SplitFed resilience to packet loss: Where to split, that is the question. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 367–377. Springer.
  • Singh et al. (2019) Singh, A.; Vepakomma, P.; Gupta, O.; and Raskar, R. 2019. Detailed comparison of communication efficiency of split learning and federated learning. arXiv preprint arXiv:1909.09145.
  • Son et al. (2024) Son, H. M.; Kim, M.-H.; Chung, T.-M.; Huang, C.; and Liu, X. 2024. FedUV: Uniformity and Variance for Heterogeneous Federated Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5863–5872.
  • Sun et al. (2024) Sun, W.; Yan, R.; Jin, R.; Zhao, R.; and Chen, Z. 2024. FedAlign: Federated Model Alignment via Data-Free Knowledge Distillation for Machine Fault Diagnosis. IEEE Transactions on Instrumentation and Measurement, 73: 1–12.
  • Tak and Cherkaoui (2021) Tak, A.; and Cherkaoui, S. 2021. Federated Edge Learning: Design Issues and Challenges. IEEE Network, 35(2): 252–258.
  • Thapa et al. (2022) Thapa, C.; Arachchige, P. C. M.; Camtepe, S.; and Sun, L. 2022. Splitfed: When federated learning meets split learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 8485–8493.
  • Tschandl (2018) Tschandl, P. 2018. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions.
  • Vepakomma et al. (2018) Vepakomma, P.; Gupta, O.; Swedish, T.; and Raskar, R. 2018. Split learning for health: Distributed deep learning without sharing raw patient data. arXiv preprint arXiv:1812.00564.
  • Wang et al. (2019) Wang, S.; Tuor, T.; Salonidis, T.; Leung, K. K.; Makaya, C.; He, T.; and Chan, K. 2019. Adaptive federated learning in resource constrained edge computing systems. IEEE journal on selected areas in communications, 37(6): 1205–1221.
  • Wang et al. (2024) Wang, Z.; Lin, H.; Liu, Q.; Zhang, Y.; and Liu, X. 2024. FedCST: Federated Learning on Heterogeneous Resource-constrained Devices Using Clustering and Split Training. In 2024 IEEE 24th International Conference on Software Quality, Reliability, and Security Companion (QRS-C), 786–792. IEEE.
  • Williams, Waterman, and Patterson (2009) Williams, S.; Waterman, A.; and Patterson, D. 2009. Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM, 52(4): 65–76.
  • Woodworth, Patel, and Srebro (2020) Woodworth, B. E.; Patel, K. K.; and Srebro, N. 2020. Minibatch vs local sgd for heterogeneous distributed learning. Advances in Neural Information Processing Systems, 33: 6281–6292.
  • Wu et al. (2023) Wu, W.; Li, M.; Qu, K.; Zhou, C.; Shen, X.; Zhuang, W.; Li, X.; and Shi, W. 2023. Split learning over wireless networks: Parallel design and resource management. IEEE Journal on Selected Areas in Communications, 41(4): 1051–1066.
  • Xu et al. (2023a) Xu, C.; Li, J.; Liu, Y.; Ling, Y.; and Wen, M. 2023a. Accelerating split federated learning over wireless communication networks. IEEE Transactions on Wireless Communications.
  • Xu et al. (2023b) Xu, X.; Lyu, L.; Dong, Y.; Lu, Y.; Wang, W.; and Jin, H. 2023b. SplitGNN: Splitting GNN for Node Classification with Heterogeneous Attention. arXiv preprint arXiv:2301.12885.
  • Yang et al. (2022) Yang, Z.; Chen, Y.; Huangfu, H.; Ran, M.; Wang, H.; Li, X.; and Zhang, Y. 2022. Robust split federated learning for u-shaped medical image networks. arXiv preprint arXiv:2212.06378.
  • Zhang et al. (2022) Zhang, M.; Qu, L.; Singh, P.; Kalpathy-Cramer, J.; and Rubin, D. L. 2022. Splitavg: A heterogeneity-aware federated deep learning method for medical imaging. IEEE Journal of Biomedical and Health Informatics, 26(9): 4635–4644.
  • Zhang et al. (2023) Zhang, Z.; Pinto, A.; Turina, V.; Esposito, F.; and Matta, I. 2023. Privacy and efficiency of communications in federated split learning. IEEE Transactions on Big Data, 9(5): 1380–1391.
  • Zheng, Chen, and Lai (2024) Zheng, J.; Chen, Y.; and Lai, Q. 2024. PPSFL: Privacy-Preserving Split Federated Learning for heterogeneous data in edge-based Internet of Things. Future Generation Computer Systems, 156: 231–241.
  • Zhu et al. (2024) Zhu, G.; Deng, Y.; Chen, X.; Zhang, H.; Fang, Y.; and Wong, T. F. 2024. ESFL: Efficient Split Federated Learning over Resource-Constrained Heterogeneous Wireless Devices. IEEE Internet of Things Journal.

Appendix A Appendix: Algorithmic Framework

A.1 Notation

We denote the number of clients as KK and the set of all clients as 𝒦={1,,K}\mathcal{K}=\{1,\ldots,K\}. Each client kk has a local dataset 𝒟k\mathcal{D}_{k} of size |𝒟k||\mathcal{D}_{k}|. The client-side model parameters for client kk are denoted as θkC\theta^{C}_{k}, while the server-side model parameters are denoted as θS\theta^{S} for SFL-V2 and θkS\theta^{S}_{k} for each client kk in SFL-V1. The cut layer index is denoted as LcL_{c}, and BkB_{k} represents a mini-batch sampled from client kk’s local dataset. We use η\eta as the learning rate, EE as the number of local epochs, and TT as the number of communication rounds. The activations at the cut layer from client kk are denoted as aka_{k}, with yky_{k} representing the ground truth labels and y^k\hat{y}_{k} representing the predictions for client kk’s data. The system involves two distinct server roles: the Training Server (TS), which maintains and updates server-side models, and the Model Synchronization Server (MSS), which handles model aggregation and distribution.

A.2 SFL-V1 Algorithm

Algorithm 1 Split Federated Learning V1 (SFL-V1)
1:
2:Clients 𝒦={1,,K}\mathcal{K}=\{1,\ldots,K\}
3:Communication rounds TT
4:Local epochs EE
5:Learning rate η\eta
6:Cut layer index LcL_{c}
7:Local datasets {𝒟k}k=1K\{\mathcal{D}_{k}\}_{k=1}^{K}
8:
9:Final client models {θkC(T)}k=1K\{\theta^{C}_{k}(T)\}_{k=1}^{K}
10:Final server models {θkS(T)}k=1K\{\theta^{S}_{k}(T)\}_{k=1}^{K}
11:Initialize client models {θkC(0)}k=1K\{\theta^{C}_{k}(0)\}_{k=1}^{K} and server models {θkS(0)}k=1K\{\theta^{S}_{k}(0)\}_{k=1}^{K}
12:for round t=0,,T1t=0,\ldots,T-1 do
13:Client-side Parallel Processing:
14:     for each client k𝒦k\in\mathcal{K} in parallel do
15:         Sample mini-batch Bk𝒟kB_{k}\sim\mathcal{D}_{k}
16:         Compute activations ak(t)a_{k}(t) through θkC(t)\theta^{C}_{k}(t) up to LcL_{c}
17:         Send (ak(t),yk(t))(a_{k}(t),y_{k}(t)) to Training Server
18:     end for
19:Training Server Operations:
20:     for each client k𝒦k\in\mathcal{K} in parallel do
21:         Compute predictions: y^k(t)=θkS(t)(ak(t))\hat{y}_{k}(t)=\theta^{S}_{k}(t)(a_{k}(t))
22:         Calculate loss: (y^k(t),yk(t))\mathcal{L}(\hat{y}_{k}(t),y_{k}(t))
23:         Update server model: θkS(t)θkS(t)ηθkS(t)\theta^{S}_{k}(t)\leftarrow\theta^{S}_{k}(t)-\eta\nabla\theta^{S}_{k}(t)
24:         Compute and send gradients ak(t)\nabla a_{k}(t) to client kk
25:     end for
26:Client Backward Pass:
27:     for each client k𝒦k\in\mathcal{K} in parallel do
28:         for epoch e=1,,Ee=1,\ldots,E do
29:              Backpropagate using ak(t)\nabla a_{k}(t)
30:              Update client model: θkC(t)θkC(t)ηθkC(t)\theta^{C}_{k}(t)\leftarrow\theta^{C}_{k}(t)-\eta\nabla\theta^{C}_{k}(t)
31:         end for
32:     end for
33:Model Synchronization:
34:     Aggregate client models: θC(t+1)k=1K|𝒟k|i|𝒟i|θkC(t)\theta^{C}(t+1)\leftarrow\sum_{k=1}^{K}\frac{|\mathcal{D}_{k}|}{\sum_{i}|\mathcal{D}_{i}|}\theta^{C}_{k}(t)
35:     Aggregate server models: θS(t+1)k=1K|𝒟k|i|𝒟i|θkS(t)\theta^{S}(t+1)\leftarrow\sum_{k=1}^{K}\frac{|\mathcal{D}_{k}|}{\sum_{i}|\mathcal{D}_{i}|}\theta^{S}_{k}(t)
36:     Broadcast aggregated models to all clients for next round:
37:     for each client k𝒦k\in\mathcal{K} do
38:         Set client model: θkC(t+1)θC(t+1)\theta^{C}_{k}(t+1)\leftarrow\theta^{C}(t+1)
39:         Set server model: θkS(t+1)θS(t+1)\theta^{S}_{k}(t+1)\leftarrow\theta^{S}(t+1)
40:     end for
41:end for

A.3 SFL-V2 Algorithm

Algorithm 2 Split Federated Learning V2 (SFL-V2)
1:
2:Clients 𝒦={1,,K}\mathcal{K}=\{1,\ldots,K\}
3:Communication rounds TT
4:Local epochs EE
5:Learning rate η\eta
6:Cut layer index LcL_{c}
7:Local datasets {𝒟k}k=1K\{\mathcal{D}_{k}\}_{k=1}^{K}
8:
9:Final client models {θkC(T)}k=1K\{\theta^{C}_{k}(T)\}_{k=1}^{K}
10:Final server model θS(T)\theta^{S}(T)
11:Initialize client models {θkC(0)}k=1K\{\theta^{C}_{k}(0)\}_{k=1}^{K} and server model θS(0)\theta^{S}(0)
12:for round t=0,,T1t=0,\ldots,T-1 do
13:Client-side Parallel Processing:
14:     for each client k𝒦k\in\mathcal{K} in parallel do
15:         Sample mini-batch Bk𝒟kB_{k}\sim\mathcal{D}_{k}
16:         Compute activations ak(t)a_{k}(t) through θkC(t)\theta^{C}_{k}(t) up to LcL_{c}
17:         Send (ak(t),yk(t))(a_{k}(t),y_{k}(t)) to Training Server
18:     end for
19:Training Server Sequential Processing:
20:     Generate random permutation π\pi of clients 𝒦\mathcal{K}
21:     for kπk\in\pi do
22:         Compute predictions: y^k(t)=θS(t)(ak(t))\hat{y}_{k}(t)=\theta^{S}(t)(a_{k}(t))
23:         Calculate loss: (y^k(t),yk(t))\mathcal{L}(\hat{y}_{k}(t),y_{k}(t))
24:         Update server model: θS(t)θS(t)ηθS(t)\theta^{S}(t)\leftarrow\theta^{S}(t)-\eta\nabla\theta^{S}(t)
25:         Compute and send gradients ak(t)\nabla a_{k}(t) to client kk
26:     end for
27:Client Backward Pass:
28:     for each client k𝒦k\in\mathcal{K} in parallel do
29:         for epoch e=1,,Ee=1,\ldots,E do
30:              Backpropagate using ak(t)\nabla a_{k}(t)
31:              Update client model: θkC(t)θkC(t)ηθkC(t)\theta^{C}_{k}(t)\leftarrow\theta^{C}_{k}(t)-\eta\nabla\theta^{C}_{k}(t)
32:         end for
33:     end for
34:Model Synchronization:
35:     θC(t+1)k=1K|𝒟k|i|𝒟i|θkC(t)\theta^{C}(t+1)\leftarrow\sum_{k=1}^{K}\frac{|\mathcal{D}_{k}|}{\sum_{i}|\mathcal{D}_{i}|}\theta^{C}_{k}(t)
36:     Broadcast aggregated model to all clients for next round:
37:     for each client k𝒦k\in\mathcal{K} do
38:         θkC(t+1)θC(t+1)\theta^{C}_{k}(t+1)\leftarrow\theta^{C}(t+1)
39:     end for
40:end for

Appendix B Proof of Proposition 1

The proof mainly follows (Han et al. 2024b). The key difference is that we specify how the choice of cut layer affects the model updates at the training server and client side. For ease of presentation, we put the communication round index in the superscript, and the client/server notations (c/sc/s) in the subscripts.

B.1 Preliminary

We start with a few lemmas that facilitate the proof.

Lemma 1.

[Multiple iterations of local training in each round] Under Assumptions 2-4, if we let ηt18Sτ\eta^{t}\leq\frac{1}{\sqrt{8}S\tau} and run client kk’s local model for τ\tau iteration continuously in any round tt. Then, for any cut layer selection Lc{1,2,,L1}L_{c}\in\{1,2,\cdots,L-1\}, we have

i=0τ1𝔼[θkt,iθt2]2τ2(8τ(ηt)2σk2+8τ(ηt)2ϵ2+8τ(ηt)2θf(θt)2).\displaystyle\sum_{i=0}^{\tau-1}\mathbb{E}\left[\left\|\theta^{t,i}_{k}-\theta^{t}\right\|^{2}\right]\leq 2\tau^{2}\left(8\tau\left(\eta^{t}\right)^{2}\sigma_{k}^{2}+8\tau\left(\eta^{t}\right)^{2}\epsilon^{2}+8\tau\left(\eta^{t}\right)^{2}\left\|\nabla_{\theta}f\left(\theta^{t}\right)\right\|^{2}\right). (7)
Proof.
𝔼[θkt,iθt2]\displaystyle\mathbb{E}\left[\left\|\theta^{t,i}_{k}-\theta^{t}\right\|^{2}\right]
𝔼[θkt,i1ηt𝒈kt,i1θt2]\displaystyle\leq\mathbb{E}\left[\left\|\theta^{t,i-1}_{k}-\eta^{t}\boldsymbol{g}_{k}^{t,i-1}-\theta^{t}\right\|^{2}\right]
𝔼[θkt,i1θtηt(𝒈kt,i1θFk(θkt,i1)\displaystyle\leq\mathbb{E}\left[\left\|\theta^{t,i-1}_{k}-\theta^{t}-\eta^{t}\left(\boldsymbol{g}_{k}^{t,i-1}-\nabla_{\theta}F_{k}\left(\theta^{t,i-1}_{k}\right)\right.\right.\right.
+θFk(θkt,i1)θFk(θt)+θFk(θt)θf(θt)+θf(θt))2]\displaystyle\left.\left.\left.+\nabla_{\theta}F_{k}\left(\theta^{t,i-1}_{k}\right)-\nabla_{\theta}F_{k}\left(\theta^{t}\right)+\nabla_{\theta}F_{k}\left(\theta^{t}\right)-\nabla_{\theta}f\left(\theta^{t}\right)+\nabla_{\theta}f\left(\theta^{t}\right)\right)\right\|^{2}\right]
(1+1τ)𝔼[θkt,i1θt2]+8τ𝔼[ηt(𝒈kt,i1θFk(θkt,i1))2]\displaystyle\leq\left(1+\frac{1}{\tau}\right)\mathbb{E}\left[\left\|\theta^{t,i-1}_{k}-\theta^{t}\right\|^{2}\right]+8\tau\mathbb{E}\left[\left\|\eta^{t}\left(\boldsymbol{g}_{k}^{t,i-1}-\nabla_{\theta}F_{k}\left(\theta^{t,i-1}_{k}\right)\right)\right\|^{2}\right]
+8τ𝔼[ηt(θFk(θkt,i1)θFk(θt))2]+8τ𝔼[ηt(θFk(θt)θf(θt))2]\displaystyle+8\tau\mathbb{E}\left[\left\|\eta^{t}\left(\nabla_{\theta}F_{k}\left(\theta^{t,i-1}_{k}\right)-\nabla_{\theta}F_{k}\left(\theta^{t}\right)\right)\right\|^{2}\right]+8\tau\mathbb{E}\left[\left\|\eta^{t}\left(\nabla_{\theta}F_{k}\left(\theta^{t}\right)-\nabla_{\theta}f\left(\theta^{t}\right)\right)\right\|^{2}\right]
+8τηtθf(θt)2\displaystyle+8\tau\left\|\eta^{t}\nabla_{\theta}f\left(\theta^{t}\right)\right\|^{2}
(1+1τ)𝔼[θkt,i1θt2]+8τ(ηt)2σk2+8τ(ηt)2S2𝔼[θkt,i1θt2]+8τ(ηt)2ϵ2\displaystyle\leq\left(1+\frac{1}{\tau}\right)\mathbb{E}\left[\left\|\theta^{t,i-1}_{k}-\theta^{t}\right\|^{2}\right]+8\tau\left(\eta^{t}\right)^{2}\sigma_{k}^{2}+8\tau\left(\eta^{t}\right)^{2}S^{2}\mathbb{E}\left[\left\|\theta^{t,i-1}_{k}-\theta^{t}\right\|^{2}\right]+8\tau\left(\eta^{t}\right)^{2}\epsilon^{2}
+8τ(ηt)2θf(θt)2\displaystyle+8\tau\left(\eta^{t}\right)^{2}\left\|\nabla_{\theta}f\left(\theta^{t}\right)\right\|^{2}
(1+1τ+8τ(ηt)2S2)𝔼[θkt,i1θt2]+8τ(ηt)2σk2+8τ(ηt)2ϵ2+8τ(ηt)2θf(θt)2\displaystyle\leq\!\left(\!1\!+\!\frac{1}{\tau}\!+\!8\tau\left(\!\eta^{t}\!\right)^{2}S^{2}\right)\mathbb{E}\!\left[\!\left\|\theta^{t,i-1}_{k}\!-\!\theta^{t}\right\|^{2}\!\right]\!+\!8\tau\left(\eta^{t}\right)^{2}\!\sigma_{k}^{2}\!+\!8\tau\left(\!\eta^{t}\!\right)^{2}\epsilon^{2}\!+\!8\tau\left(\eta^{t}\right)^{2}\left\|\!\nabla_{\theta}f\left(\theta^{t}\!\right)\!\right\|^{2}
(1+2τ)𝔼[θkt,i1θt2]+8τ(ηt)2σk2+8τ(ηt)2ϵ2+8τ(ηt)2θf(θt)2\displaystyle\leq\left(1+\frac{2}{\tau}\right)\mathbb{E}\left[\left\|\theta^{t,i-1}_{k}-\theta^{t}\right\|^{2}\right]+8\tau\left(\eta^{t}\right)^{2}\sigma_{k}^{2}+8\tau\left(\eta^{t}\right)^{2}\epsilon^{2}+8\tau\left(\eta^{t}\right)^{2}\left\|\nabla_{\theta}f\left(\theta^{t}\right)\right\|^{2} (8)

where we have applied Assumptions 2-4, (X+Y)2(1+a)X2+(1+1a)Y2\left(X+Y\right)^{2}\leq\left(1+a\right)X^{2}+\left(1+\frac{1}{a}\right)Y^{2} for some positive aa, and ηt18Sτ\eta^{t}\leq\frac{1}{\sqrt{8}S\tau}.

Let

At,i:=𝔼[θkt,iθt2]\displaystyle A_{t,i}:=\mathbb{E}\left[\left\|\theta^{t,i}_{k}-\theta^{t}\right\|^{2}\right]
B:=8τ(ηt)2σk2+8τ(ηt)2ϵ2+8τ(ηt)2θf(θt)2\displaystyle B:=8\tau\left(\eta^{t}\right)^{2}\sigma_{k}^{2}+8\tau\left(\eta^{t}\right)^{2}\epsilon^{2}+8\tau\left(\eta^{t}\right)^{2}\left\|\nabla_{\theta}f\left(\theta^{t}\right)\right\|^{2}
C:=1+2τ\displaystyle C:=1+\frac{2}{\tau}

We have

At,iCAt,i1+B\displaystyle A_{t,i}\leq CA_{t,i-1}+B (9)

We can show that

At,iCiAt+Bj=0i1Cj\displaystyle A_{t,i}\leq C^{i}A_{t}+B\sum_{j=0}^{i-1}C^{j}

Note that At=𝔼[θtθt2]=0A_{t}=\mathbb{E}\left[\left\|\theta^{t}-\theta^{t}\right\|^{2}\right]=0. Accumulate the above for τ\tau iterations, we have

i=0τ1𝔼[θkt,iθt2]=i=0τ1Bj=0i1Cj\displaystyle\sum_{i=0}^{\tau-1}\mathbb{E}\left[\left\|\theta^{t,i}_{k}-\theta^{t}\right\|^{2}\right]=\sum_{i=0}^{\tau-1}B\sum_{j=0}^{i-1}C^{j}
2τ2B\displaystyle\leq 2\tau^{2}B
2τ2(8τ(ηt)2σk2+8τ(ηt)2ϵ2+8τ(ηt)2θf(θt)2)\displaystyle\leq 2\tau^{2}\left(8\tau\left(\eta^{t}\right)^{2}\sigma_{k}^{2}+8\tau\left(\eta^{t}\right)^{2}\epsilon^{2}+8\tau\left(\eta^{t}\right)^{2}\left\|\nabla_{\theta}f\left(\theta^{t}\right)\right\|^{2}\right) (10)

where we use i=0N1xi=xN1x1\sum_{i=0}^{N-1}x^{i}=\frac{x^{N}-1}{x-1} and (1+nx)xen(1+\frac{n}{x})^{x}\leq e^{n}. Therefore, we complete the proof. ∎

Lemma 2.

[Multiple iterations of local gradient accumulation in each round] Under Assumption 2-4, if we let ηt12Sτ\eta^{t}\leq\frac{1}{2S\tau} and run client kk’s local model for τ\tau iterations continuously in any round tt. Then, for any cut layer selection Lc{1,2,,L1}L_{c}\in\{1,2,\cdots,L-1\}, we have

i=0τ1𝔼[𝒈kt,i𝒈kt2]8τ3(ηt)2S2(θFk(θt)2+σk2).\displaystyle\sum_{i=0}^{\tau-1}\mathbb{E}\left[\left\|\boldsymbol{g}^{t,i}_{k}-\boldsymbol{g}_{k}^{t}\right\|^{2}\right]\leq 8\tau^{3}\left(\eta^{t}\right)^{2}S^{2}\left(\left\|\nabla_{\theta}F_{k}\left(\theta^{t}\right)\right\|^{2}+\sigma_{k}^{2}\right). (11)
Proof.
𝔼[𝒈kt,i𝒈kt2]\displaystyle\mathbb{E}\left[\left\|\boldsymbol{g}^{t,i}_{k}-\boldsymbol{g}_{k}^{t}\right\|^{2}\right]
𝔼[𝒈kt,i𝒈kt,i1+𝒈kt,i1𝒈kt2]\displaystyle\leq\mathbb{E}\left[\left\|\boldsymbol{g}^{t,i}_{k}-\boldsymbol{g}_{k}^{t,i-1}+\boldsymbol{g}_{k}^{t,i-1}-\boldsymbol{g}_{k}^{t}\right\|^{2}\right]
(1+τ)𝔼[𝒈kt,i𝒈kt,i12]+(1+1τ)𝔼[𝒈kt,i1𝒈kt2]\displaystyle\leq\left(1+\tau\right)\mathbb{E}\left[\left\|\boldsymbol{g}^{t,i}_{k}-\boldsymbol{g}_{k}^{t,i-1}\right\|^{2}\right]+\left(1+\frac{1}{\tau}\right)\mathbb{E}\left[\left\|\boldsymbol{g}_{k}^{t,i-1}-\boldsymbol{g}_{k}^{t}\right\|^{2}\right]
(1+τ)S2𝔼[θkt,iθkt,i12]+(1+1τ)𝔼[𝒈kt,i1𝒈kt2]\displaystyle\leq\left(1+\tau\right)S^{2}\mathbb{E}\left[\left\|\theta^{t,i}_{k}-\theta^{t,i-1}_{k}\right\|^{2}\right]+\left(1+\frac{1}{\tau}\right)\mathbb{E}\left[\left\|\boldsymbol{g}_{k}^{t,i-1}-\boldsymbol{g}_{k}^{t}\right\|^{2}\right]
(1+τ)(ηt)2S2𝔼[𝒈kt,i12]+(1+1τ)𝔼[𝒈kt,i1𝒈kt2]\displaystyle\leq\left(1+\tau\right)\left(\eta^{t}\right)^{2}S^{2}\mathbb{E}\left[\left\|\boldsymbol{g}_{k}^{t,i-1}\right\|^{2}\right]+\left(1+\frac{1}{\tau}\right)\mathbb{E}\left[\left\|\boldsymbol{g}_{k}^{t,i-1}-\boldsymbol{g}_{k}^{t}\right\|^{2}\right]
(1+τ)(ηt)2S2𝔼[𝒈kt,i1𝒈kt+𝒈kt2]+(1+1τ)𝔼[𝒈kt,i1𝒈kt2]\displaystyle\leq\left(1+\tau\right)\left(\eta^{t}\right)^{2}S^{2}\mathbb{E}\left[\left\|\boldsymbol{g}_{k}^{t,i-1}-\boldsymbol{g}_{k}^{t}+\boldsymbol{g}_{k}^{t}\right\|^{2}\right]+\left(1+\frac{1}{\tau}\right)\mathbb{E}\left[\left\|\boldsymbol{g}_{k}^{t,i-1}-\boldsymbol{g}_{k}^{t}\right\|^{2}\right]
2(1+τ)(ηt)2S2𝔼[𝒈kt,i1𝒈kt2]+2(1+τ)(ηt)2S2𝔼[𝒈kt2]\displaystyle\leq 2\left(1+\tau\right)\left(\eta^{t}\right)^{2}S^{2}\mathbb{E}\left[\left\|\boldsymbol{g}_{k}^{t,i-1}-\boldsymbol{g}_{k}^{t}\right\|^{2}\right]+2\left(1+\tau\right)\left(\eta^{t}\right)^{2}S^{2}\mathbb{E}\left[\left\|\boldsymbol{g}_{k}^{t}\right\|^{2}\right]
+(1+1τ)𝔼[𝒈kt,i1𝒈kt2]\displaystyle+\left(1+\frac{1}{\tau}\right)\mathbb{E}\left[\left\|\boldsymbol{g}_{k}^{t,i-1}-\boldsymbol{g}_{k}^{t}\right\|^{2}\right]
(1+2τ)𝔼[𝒈kt,i1𝒈kt2]+2(1+τ)(ηt)2S2𝔼[𝒈kt2].\displaystyle\leq\left(1+\frac{2}{\tau}\right)\mathbb{E}\left[\left\|\boldsymbol{g}_{k}^{t,i-1}-\boldsymbol{g}_{k}^{t}\right\|^{2}\right]+2\left(1+\tau\right)\left(\eta^{t}\right)^{2}S^{2}\mathbb{E}\left[\left\|\boldsymbol{g}_{k}^{t}\right\|^{2}\right]. (12)

We define the following notation for simplicity:

At,i:=𝔼[𝒈kt,i𝒈kt2]\displaystyle A_{t,i}:=\mathbb{E}\left[\left\|\boldsymbol{g}^{t,i}_{k}-\boldsymbol{g}_{k}^{t}\right\|^{2}\right] (13)
B:=2(1+τ)(ηt)2S2𝔼[𝒈kt2]\displaystyle B:=2\left(1+\tau\right)\left(\eta^{t}\right)^{2}S^{2}\mathbb{E}\left[\left\|\boldsymbol{g}_{k}^{t}\right\|^{2}\right] (14)
C:=(1+2τ)\displaystyle C:=\left(1+\frac{2}{\tau}\right) (15)

We have

At,iCAt,i1+B\displaystyle A_{t,i}\leq CA_{t,i-1}+B (16)

We can show that

At,iCiAt+Bj=0i1Cj\displaystyle A_{t,i}\leq C^{i}A_{t}+B\sum_{j=0}^{i-1}C^{j}

Note that At=𝔼[𝒈kt𝒈kt2]=0A_{t}=\mathbb{E}\left[\left\|\boldsymbol{g}_{k}^{t}-\boldsymbol{g}_{k}^{t}\right\|^{2}\right]=0. For the second part, we have

i=0τ1𝔼[𝒈kt,i𝒈kt2]=i=0τ1Bj=0i1Cj2τ2B\displaystyle\sum_{i=0}^{\tau-1}\mathbb{E}\left[\left\|\boldsymbol{g}^{t,i}_{k}-\boldsymbol{g}_{k}^{t}\right\|^{2}\right]=\sum_{i=0}^{\tau-1}B\sum_{j=0}^{i-1}C^{j}\leq 2\tau^{2}B
4τ2(1+τ)(ηt)2S2𝔼[𝒈kt2]\displaystyle\leq 4\tau^{2}\left(1+\tau\right)\left(\eta^{t}\right)^{2}S^{2}\mathbb{E}\left[\left\|\boldsymbol{g}_{k}^{t}\right\|^{2}\right]
8τ3(ηt)2S2𝔼[𝒈kt2]\displaystyle\leq 8\tau^{3}\left(\eta^{t}\right)^{2}S^{2}\mathbb{E}\left[\left\|\boldsymbol{g}_{k}^{t}\right\|^{2}\right]
8τ3(ηt)2S2(θFk(θt)2+σk2).\displaystyle\leq 8\tau^{3}\left(\eta^{t}\right)^{2}S^{2}\left(\left\|\nabla_{\theta}F_{k}\left(\theta^{t}\right)\right\|^{2}+\sigma_{k}^{2}\right). (17)

B.2 Main server’s model update

We first analyze one-round model update at the main server side. For any cut layer selection Lc{1,2,,L1}L_{c}\in\{1,2,\cdots,L-1\}, we have

𝔼[θsf(θt),θst+1θst]\displaystyle\mathbb{E}\left[\left\langle\nabla_{\theta_{s}}f\left(\theta^{t}\right),\theta^{t+1}_{s}-\theta^{t}_{s}\right\rangle\right]
𝔼[θsf(θt),θst+1θst+ηtτθsf(θt)ηtτθsf(θt)]\displaystyle\leq\mathbb{E}\left[\left\langle\nabla_{\theta_{s}}f\left(\theta^{t}\right),\theta^{t+1}_{s}-\theta^{t}_{s}+\eta^{t}\tau\nabla_{\theta_{s}}f\left(\theta^{t}\right)-\eta^{t}\tau\nabla_{\theta_{s}}f\left(\theta^{t}\right)\right\rangle\right]
𝔼[θsf(θt),θst+1θst+ηtτθsf(θt)f(θst),ηtτθsf(θt)]\displaystyle\leq\mathbb{E}\left[\left\langle\nabla_{\theta_{s}}f\left(\theta^{t}\right),\theta^{t+1}_{s}-\theta^{t}_{s}+\eta^{t}\tau\nabla_{\theta_{s}}f\left(\theta^{t}\right)\right\rangle-\left\langle f\left(\theta^{t}_{s}\right),\eta^{t}\tau\nabla_{\theta_{s}}f\left(\theta^{t}\right)\right\rangle\right]
θsf(θt),𝔼[ηtk=1Ki=0τ1αk𝒈s,kt,i]+ηtτθsf(θt)ηtτθsf(θt)2\displaystyle\leq\left\langle\nabla_{\theta_{s}}f\left(\theta^{t}\right),\mathbb{E}\left[-\eta^{t}\sum_{k=1}^{K}\sum_{i=0}^{\tau-1}\alpha_{k}\boldsymbol{g}_{s,k}^{t,i}\right]+\eta^{t}\tau\nabla_{\theta_{s}}f\left(\theta^{t}\right)\right\rangle-\eta^{t}\tau\left\|\nabla_{\theta_{s}}f\left(\theta^{t}\right)\right\|^{2}
θsf(θt),𝔼[ηtk=1Ki=0τ1αkθsFk({θc,kt,i,θs,kt,i})]+ηtτθsf(θt)ηtτθsf(θt)2\displaystyle\leq\left\langle\nabla_{\theta_{s}}f\left(\theta^{t}\right),\mathbb{E}\left[-\eta^{t}\sum_{k=1}^{K}\sum_{i=0}^{\tau-1}\alpha_{k}\nabla_{\theta_{s}}F_{k}\left(\left\{\theta^{t,i}_{c,k},\theta^{t,i}_{s,k}\right\}\right)\right]+\eta^{t}\tau\nabla_{\theta_{s}}f\left(\theta^{t}\right)\right\rangle-\eta^{t}\tau\left\|\nabla_{\theta_{s}}f\left(\theta^{t}\right)\right\|^{2}
θsf(θt),𝔼[ηtk=1Ki=0τ1αkθsFk({θc,kt,i,θs,kt,i})+ηtk=1Ki=0τ1αkθsFk(θt)]\displaystyle\leq\left\langle\nabla_{\theta_{s}}f\left(\theta^{t}\right),\mathbb{E}\left[-\eta^{t}\sum_{k=1}^{K}\sum_{i=0}^{\tau-1}\alpha_{k}\nabla_{\theta_{s}}F_{k}\left(\left\{\theta^{t,i}_{c,k},\theta^{t,i}_{s,k}\right\}\right)+\eta^{t}\sum_{k=1}^{K}\sum_{i=0}^{\tau-1}\alpha_{k}\nabla_{\theta_{s}}F_{k}\left(\theta^{t}\right)\right]\right\rangle
ηtτθsf(θt)2\displaystyle\quad-\eta^{t}\tau\left\|\nabla_{\theta_{s}}f\left(\theta^{t}\right)\right\|^{2}
ηtτθsf(θt),𝔼[1τk=1Ki=0τ1αkθsFk({θc,kt,i,θs,kt,i})+1τk=1Ki=0τ1αkθsFk(θt)]\displaystyle\leq\eta^{t}\tau\left\langle\nabla_{\theta_{s}}f\left(\theta^{t}\right),\mathbb{E}\left[-\frac{1}{\tau}\sum_{k=1}^{K}\sum_{i=0}^{\tau-1}\alpha_{k}\nabla_{\theta_{s}}F_{k}\left(\left\{\theta^{t,i}_{c,k},\theta^{t,i}_{s,k}\right\}\right)+\frac{1}{\tau}\sum_{k=1}^{K}\sum_{i=0}^{\tau-1}\alpha_{k}\nabla_{\theta_{s}}F_{k}\left(\theta^{t}\right)\right]\right\rangle
ηtτθsf(θt)2\displaystyle\quad-\eta^{t}\tau\left\|\nabla_{\theta_{s}}f\left(\theta^{t}\right)\right\|^{2}
ηtτ2θsf(θt)2+ηt2τ𝔼[k=1Ki=0τ1αkθsFk({θc,kt,i,θs,kt,i})k=1Ki=0τ1αkθsFk(θt)2]\displaystyle\leq\frac{\eta^{t}\tau}{2}\left\|\nabla_{\theta_{s}}f\left(\theta^{t}\right)\right\|^{2}+\frac{\eta^{t}}{2\tau}\mathbb{E}\left[\left\|\sum_{k=1}^{K}\sum_{i=0}^{\tau-1}\alpha_{k}\nabla_{\theta_{s}}F_{k}\left(\left\{\theta^{t,i}_{c,k},\theta^{t,i}_{s,k}\right\}\right)-\sum_{k=1}^{K}\sum_{i=0}^{\tau-1}\alpha_{k}\nabla_{\theta_{s}}F_{k}\left(\theta^{t}\right)\right\|^{2}\right]
ηtτθsf(θt)2\displaystyle\quad-\eta^{t}\tau\left\|\nabla_{\theta_{s}}f\left(\theta^{t}\right)\right\|^{2}
ηtτ2θsf(θt)2+Kηt2τk=1Kαk2𝔼[i=0τ1(θsFk({θc,kt,i,θs,kt,i})θsFk(θt))2]\displaystyle\leq-\frac{\eta^{t}\tau}{2}\left\|\nabla_{\theta_{s}}f\left(\theta^{t}\right)\right\|^{2}+\frac{K\eta^{t}}{2\tau}\sum_{k=1}^{K}\alpha_{k}^{2}\mathbb{E}\left[\left\|\sum_{i=0}^{\tau-1}\left(\nabla_{\theta_{s}}F_{k}\left(\left\{\theta^{t,i}_{c,k},\theta^{t,i}_{s,k}\right\}\right)-\nabla_{\theta_{s}}F_{k}\left(\theta^{t}\right)\right)\right\|^{2}\right]
ηtτ2θsf(θt)2+KηtS22k=1Kαk2i=0τ1𝔼[θs,kt,iθst2],\displaystyle\leq-\frac{\eta^{t}\tau}{2}\left\|\nabla_{\theta_{s}}f\left(\theta^{t}\right)\right\|^{2}+\frac{K\eta^{t}S^{2}}{2}\sum_{k=1}^{K}\alpha_{k}^{2}\sum_{i=0}^{\tau-1}\mathbb{E}\left[\left\|\theta^{t,i}_{s,k}-\theta^{t}_{s}\right\|^{2}\right], (18)

where we have used the fact that θsf(θt)=k=1KαkθsFk(θt)\nabla_{\theta_{s}}f\left(\theta^{t}\right)=\sum_{k=1}^{K}\alpha_{k}\nabla_{\theta_{s}}F_{k}\left(\theta^{t}\right), and a,ba2+b22\left\langle a,b\right\rangle\leq\frac{a^{2}+b^{2}}{2}.

By Lemma 1 with ηt18Sτ\eta^{t}\leq\frac{1}{\sqrt{8}S\tau}, we have

i=0τ1𝔼[θs,kt,iθst2]2τ2(8τ(ηt)2σk2+8τ(ηt)2ϵ2+8τ(ηt)2θsf(θst)2).\displaystyle\sum_{i=0}^{\tau-1}\mathbb{E}\left[\left\|\theta^{t,i}_{s,k}-\theta^{t}_{s}\right\|^{2}\right]\leq 2\tau^{2}\left(8\tau\left(\eta^{t}\right)^{2}\sigma_{k}^{2}+8\tau\left(\eta^{t}\right)^{2}\epsilon^{2}+8\tau\left(\eta^{t}\right)^{2}\left\|\nabla_{\theta_{s}}f\left(\theta^{t}_{s}\right)\right\|^{2}\right). (19)

Thus, (18) becomes

𝔼[θsf(θt),θst+1θst]\displaystyle\mathbb{E}\left[\left\langle\nabla_{\theta_{s}}f\left(\theta^{t}\right),\theta^{t+1}_{s}-\theta^{t}_{s}\right\rangle\right]
ηtτ~2θsf(θt)2+NηtS22n=1Nan22τ~2(8τ~(ηt)2σn2+8τ~(ηt)2ϵ2+8τ~(ηt)2θsf(θst)2)\displaystyle\leq\!-\frac{\eta^{t}\tilde{\tau}}{2}\left\|\!\nabla_{\theta_{s}}f\left(\theta^{t}\right)\!\right\|^{2}\!+\!\frac{N\eta^{t}S^{2}}{2}\!\sum_{n=1}^{N}\!a_{n}^{2}2\tilde{\tau}^{2}\left(\!8\tilde{\tau}\left(\eta^{t}\right)^{2}\sigma_{n}^{2}\!+\!8\tilde{\tau}\left(\eta^{t}\right)^{2}\epsilon^{2}\!+\!8\tilde{\tau}\left(\eta^{t}\right)^{2}\!\left\|\nabla_{\theta_{s}}\!f\left(\theta^{t}_{s}\right)\right\|^{2}\!\right)
(ηtτ~2+8N(ηt)3τ~3S2n=1Nan2)θsf(θt)2+8NηtS2τ~3n=1Nan2(ηt)2(σn2+ϵ2).\displaystyle\leq\left(-\frac{\eta^{t}\tilde{\tau}}{2}+8N\left(\eta^{t}\right)^{3}\tilde{\tau}^{3}S^{2}\sum_{n=1}^{N}a_{n}^{2}\right)\left\|\nabla_{\theta_{s}}f\left(\theta^{t}\right)\right\|^{2}+8N\eta^{t}S^{2}\tilde{\tau}^{3}\sum_{n=1}^{N}a_{n}^{2}\left(\eta^{t}\right)^{2}\left(\sigma_{n}^{2}+\epsilon^{2}\right). (20)

Furthermore, we have

S2𝔼[θst+1θst2]\displaystyle\frac{S}{2}\mathbb{E}\left[\left\|\theta^{t+1}_{s}-\theta^{t}_{s}\right\|^{2}\right]
=SK(ηt)22k=1K𝔼[i=0τ1αk𝒈s,kt,i2]\displaystyle=\frac{SK\left(\eta^{t}\right)^{2}}{2}\sum_{k=1}^{K}\mathbb{E}\left[\left\|\sum_{i=0}^{\tau-1}\alpha_{k}\boldsymbol{g}^{t,i}_{s,k}\right\|^{2}\right]
SK(ηt)22k=1Kαk2𝔼[i=0τ1𝒈s,kt,i2]\displaystyle\leq\frac{SK\left(\eta^{t}\right)^{2}}{2}\sum_{k=1}^{K}\alpha_{k}^{2}\mathbb{E}\left[\left\|\sum_{i=0}^{\tau-1}\boldsymbol{g}^{t,i}_{s,k}\right\|^{2}\right]
SK(ηt)2τ2k=1Kαk2i=0τ1𝔼[𝒈s,kt,i2]\displaystyle\leq\frac{SK\left(\eta^{t}\right)^{2}\tau}{2}\sum_{k=1}^{K}\alpha_{k}^{2}\sum_{i=0}^{\tau-1}\mathbb{E}\left[\left\|\boldsymbol{g}^{t,i}_{s,k}\right\|^{2}\right]
SK(ηt)2τ2k=1Kαk2i=0τ1𝔼[𝒈s,kt,i𝒈s,kt+𝒈s,kt2]\displaystyle\leq\frac{SK\left(\eta^{t}\right)^{2}\tau}{2}\sum_{k=1}^{K}\alpha_{k}^{2}\sum_{i=0}^{\tau-1}\mathbb{E}\left[\left\|\boldsymbol{g}^{t,i}_{s,k}-\boldsymbol{g}_{s,k}^{t}+\boldsymbol{g}_{s,k}^{t}\right\|^{2}\right]
SK(ηt)2τ2k=1Kαk2i=0τ1(𝔼[𝒈s,kt,i𝒈s,kt2]+𝔼[𝒈s,kt2])\displaystyle\leq\frac{SK\left(\eta^{t}\right)^{2}\tau}{2}\sum_{k=1}^{K}\alpha_{k}^{2}\sum_{i=0}^{\tau-1}\left(\mathbb{E}\left[\left\|\boldsymbol{g}^{t,i}_{s,k}-\boldsymbol{g}_{s,k}^{t}\right\|^{2}\right]+\mathbb{E}\left[\left\|\boldsymbol{g}_{s,k}^{t}\right\|^{2}\right]\right)
SK(ηt)2τ2k=1Kαk2i=0τ1(𝔼[𝒈s,kt,i𝒈s,kt2]+𝔼[θsFk(θt)2+σk2]),\displaystyle\leq\frac{SK\left(\eta^{t}\right)^{2}\tau}{2}\sum_{k=1}^{K}\alpha_{k}^{2}\sum_{i=0}^{\tau-1}\left(\mathbb{E}\left[\left\|\boldsymbol{g}^{t,i}_{s,k}-\boldsymbol{g}_{s,k}^{t}\right\|^{2}\right]+\mathbb{E}\left[\left\|\nabla_{\theta_{s}}F_{k}\left(\theta^{t}\right)\right\|^{2}+\sigma_{k}^{2}\right]\right), (21)

where the last line uses Assumption 2-4 and 𝔼[𝐳2]=𝔼[𝐳]2+𝔼[𝐳𝔼[𝐳]2]\mathbb{E}\left[\|\mathbf{z}\|^{2}\right]=\|\mathbb{E}[\mathbf{z}]\|^{2}+\mathbb{E}[\|\mathbf{z}-\mathbb{E}[\mathbf{z}]\|^{2}] for any random variable 𝐳\mathbf{z}.

By Lemma 2 with ηt12Sτ\eta^{t}\leq\frac{1}{2S\tau}, we have

i=0τ1𝔼[𝒈s,kt,i𝒈s,kt2]8τ3(ηt)2S2(θsFk(θt)2+σk2).\displaystyle\sum_{i=0}^{\tau-1}\mathbb{E}\left[\left\|\boldsymbol{g}^{t,i}_{s,k}-\boldsymbol{g}_{s,k}^{t}\right\|^{2}\right]\leq 8\tau^{3}\left(\eta^{t}\right)^{2}S^{2}\left(\left\|\nabla_{\theta_{s}}F_{k}\left(\theta^{t}\right)\right\|^{2}+\sigma_{k}^{2}\right). (22)

Thus,

S2𝔼[θst+1θst2]\displaystyle\frac{S}{2}\mathbb{E}\left[\left\|\theta^{t+1}_{s}-\theta^{t}_{s}\right\|^{2}\right]
SK(ηt)2τ2k=1Kαk2(8τ3(ηt)2S2(θsFk(θt)2+σk2)+τ𝔼[θsFk(θt)2+σk2])\displaystyle\leq\frac{SK\left(\eta^{t}\right)^{2}\tau}{2}\sum_{k=1}^{K}\alpha_{k}^{2}\left(8\tau^{3}\left(\eta^{t}\right)^{2}S^{2}\left(\left\|\nabla_{\theta_{s}}F_{k}\left(\theta^{t}\right)\right\|^{2}+\sigma_{k}^{2}\right)+\tau\mathbb{E}\left[\left\|\nabla_{\theta_{s}}F_{k}\left(\theta^{t}\right)\right\|^{2}+\sigma_{k}^{2}\right]\right)
SK(ηt)2τ2k=1Kαk2(τ+8τ3(ηt)2S2)(θsFk(θt)2+σk2)\displaystyle\leq\frac{SK\left(\eta^{t}\right)^{2}\tau}{2}\sum_{k=1}^{K}\alpha_{k}^{2}\left(\tau+8\tau^{3}\left(\eta^{t}\right)^{2}S^{2}\right)\left(\left\|\nabla_{\theta_{s}}F_{k}\left(\theta^{t}\right)\right\|^{2}+\sigma_{k}^{2}\right)
SK(ηt)2τ2k=1Kαk2(τ+8τ3(ηt)2S2)(θsFk(θt)θsf(θt)+θsf(θt)2+σk2)\displaystyle\leq\frac{SK\left(\eta^{t}\right)^{2}\tau}{2}\sum_{k=1}^{K}\alpha_{k}^{2}\left(\tau+8\tau^{3}\left(\eta^{t}\right)^{2}S^{2}\right)\left(\left\|\nabla_{\theta_{s}}F_{k}\left(\theta^{t}\right)-\nabla_{\theta_{s}}f\left(\theta^{t}\right)+\nabla_{\theta_{s}}f\left(\theta^{t}\right)\right\|^{2}+\sigma_{k}^{2}\right)
SK(ηt)2τ2k=1Kαk2(τ+8τ3(ηt)2S2)(2θsf(θt)2+2ϵ2+σk2).\displaystyle\leq\frac{SK\left(\eta^{t}\right)^{2}\tau}{2}\sum_{k=1}^{K}\alpha_{k}^{2}\left(\tau+8\tau^{3}\left(\eta^{t}\right)^{2}S^{2}\right)\left(2\left\|\nabla_{\theta_{s}}f\left(\theta^{t}\right)\right\|^{2}+2\epsilon^{2}+\sigma_{k}^{2}\right). (23)

B.3 Clients’ model update

The analysis of the client-side model update is similar to the server. For any cut layer selection Lc{1,2,,L1}L_{c}\in\{1,2,\cdots,L-1\}, we have

𝔼[θcf(θt),θct+1θct]\displaystyle\mathbb{E}\left[\left\langle\nabla_{\theta_{c}}f\left(\theta^{t}\right),\theta^{t+1}_{c}-\theta^{t}_{c}\right\rangle\right]
(ηtτ2+8K(ηt)3τ3S2k=1Kαk2)θcf(θt)2+8KηtS2τ3k=1Kαk2(ηt)2(σk2+ϵ2).\displaystyle\leq\left(-\frac{\eta^{t}\tau}{2}+8K\left(\eta^{t}\right)^{3}\tau^{3}S^{2}\sum_{k=1}^{K}\alpha_{k}^{2}\right)\left\|\nabla_{\theta_{c}}f\left(\theta^{t}\right)\right\|^{2}+8K\eta^{t}S^{2}\tau^{3}\sum_{k=1}^{K}\alpha_{k}^{2}\left(\eta^{t}\right)^{2}\left(\sigma_{k}^{2}+\epsilon^{2}\right). (24)

For ηt12Sτ\eta^{t}\leq\frac{1}{2S\tau},

S2𝔼[θct+1θct2]\displaystyle\frac{S}{2}\mathbb{E}\left[\left\|\theta^{t+1}_{c}-\theta^{t}_{c}\right\|^{2}\right]
SK(ηt)2τ2k=1Kαk2(τ+8τ3(ηt)2S2)(2θcf(θt)2+2ϵ2+σk2).\displaystyle\leq\frac{SK\left(\eta^{t}\right)^{2}\tau}{2}\sum_{k=1}^{K}\alpha_{k}^{2}\left(\tau+8\tau^{3}\left(\eta^{t}\right)^{2}S^{2}\right)\left(2\left\|\nabla_{\theta_{c}}f\left(\theta^{t}\right)\right\|^{2}+2\epsilon^{2}+\sigma_{k}^{2}\right). (25)

B.4 Superposition of main server and clients

For any cut layer selection Lc{1,2,,L1}L_{c}\in\{1,2,\cdots,L-1\}, we have

𝔼[f(θt+1)]f(θt)\displaystyle\mathbb{E}\left[f\left(\theta^{t+1}\right)\right]-f\left(\theta^{t}\right)
𝔼[θcf(θt),θct+1θct]+S2𝔼[θct+1θct2]+𝔼[θsf(θt),θst+1θst]\displaystyle\leq\mathbb{E}\left[\left\langle\nabla_{\theta_{c}}f\left(\theta^{t}\right),\theta^{t+1}_{c}-\theta^{t}_{c}\right\rangle\right]+\frac{S}{2}\mathbb{E}\left[\left\|\theta^{t+1}_{c}-\theta^{t}_{c}\right\|^{2}\right]+\mathbb{E}\left[\left\langle\nabla_{\theta_{s}}f\left(\theta^{t}\right),\theta^{t+1}_{s}-\theta^{t}_{s}\right\rangle\right]
+S2𝔼[θst+1θst2]\displaystyle+\frac{S}{2}\mathbb{E}\left[\left\|\theta^{t+1}_{s}-\theta^{t}_{s}\right\|^{2}\right]
(ηtτ2+8K(ηt)3S2τ3k=1Kαk2+SK(ηt)2τk=1Kαk2(τ+8τ3(ηt)2S2))θf(θt)2\displaystyle\leq\left(-\frac{\eta^{t}\tau}{2}+8K\left(\eta^{t}\right)^{3}S^{2}\tau^{3}\sum_{k=1}^{K}\alpha_{k}^{2}+SK\left(\eta^{t}\right)^{2}\tau\sum_{k=1}^{K}\alpha_{k}^{2}\left(\tau+8\tau^{3}\left(\eta^{t}\right)^{2}S^{2}\right)\right)\left\|\nabla_{\theta}f\left(\theta^{t}\right)\right\|^{2}
+8KηtS2τ3k=1Kαk2(ηt)2(σk2+ϵ2)\displaystyle+8K\eta^{t}S^{2}\tau^{3}\sum_{k=1}^{K}\alpha_{k}^{2}\left(\eta^{t}\right)^{2}\left(\sigma_{k}^{2}+\epsilon^{2}\right)
+12SK(ηt)2τ(τ+8τ3(ηt)2S2)k=1Kαk2(2ϵ2+σk2)\displaystyle+\frac{1}{2}SK\left(\eta^{t}\right)^{2}\tau\left(\tau+8\tau^{3}\left(\eta^{t}\right)^{2}S^{2}\right)\sum_{k=1}^{K}\alpha_{k}^{2}\left(2\epsilon^{2}+\sigma_{k}^{2}\right)
(ηtτ2+SK(ηt)2τ2k=1Kαk2+8K(ηt)3S2τ3k=1Kαk2+8S3K(ηt)4τ4k=1Kαk2)θf(θt)2\displaystyle\leq\left(-\frac{\eta^{t}\tau}{2}+SK\left(\eta^{t}\right)^{2}\tau^{2}\sum_{k=1}^{K}\alpha_{k}^{2}+8K\left(\eta^{t}\right)^{3}S^{2}\tau^{3}\sum_{k=1}^{K}\alpha_{k}^{2}+8S^{3}K\left(\eta^{t}\right)^{4}\tau^{4}\sum_{k=1}^{K}\alpha_{k}^{2}\right)\left\|\nabla_{\theta}f\left(\theta^{t}\right)\right\|^{2}
+8K(ηt)3S2τ3k=1Kαk2σk2+8K(ηt)3S2τ3ϵ2k=1Kαk2\displaystyle+8K\left(\eta^{t}\right)^{3}S^{2}\tau^{3}\sum_{k=1}^{K}\alpha_{k}^{2}\sigma_{k}^{2}+8K\left(\eta^{t}\right)^{3}S^{2}\tau^{3}\epsilon^{2}\sum_{k=1}^{K}\alpha_{k}^{2}
+SK(ηt)2τ2ϵ2k=1Kαk2+12SK(ηt)2τ2k=1Kαk2σk2+8KS3(ηt)4τ4ϵ2k=1Kαk2+4KS3(ηt)4τ4k=1Kαk2σk2\displaystyle+SK\left(\eta^{t}\right)^{2}\tau^{2}\epsilon^{2}\sum_{k=1}^{K}\alpha_{k}^{2}+\frac{1}{2}SK\left(\eta^{t}\right)^{2}\tau^{2}\sum_{k=1}^{K}\alpha_{k}^{2}\sigma_{k}^{2}+8KS^{3}\left(\eta^{t}\right)^{4}\tau^{4}\epsilon^{2}\sum_{k=1}^{K}\alpha_{k}^{2}+4KS^{3}\left(\eta^{t}\right)^{4}\tau^{4}\sum_{k=1}^{K}\alpha_{k}^{2}\sigma_{k}^{2}
ηtτ2(14KSηtτ2τk=1Kαk2)θf(θt)2+2KS(ηt)2τ2k=1Kαk2(σk2+ϵ2)\displaystyle\leq-\frac{\eta^{t}\tau}{2}\left(1-4KS\eta^{t}\frac{\tau^{2}}{\tau}\sum_{k=1}^{K}\alpha_{k}^{2}\right)\left\|\nabla_{\theta}f\left(\theta^{t}\right)\right\|^{2}+2KS\left(\eta^{t}\right)^{2}\tau^{2}\sum_{k=1}^{K}\alpha_{k}^{2}\left(\sigma_{k}^{2}+\epsilon^{2}\right)
ηtτ4θf(θt)2+2KS(ηt)2τ2k=1Kαk2(σk2+ϵ2),\displaystyle\leq-\frac{\eta^{t}\tau}{4}\left\|\nabla_{\theta}f\left(\theta^{t}\right)\right\|^{2}+2KS\left(\eta^{t}\right)^{2}\tau^{2}\sum_{k=1}^{K}\alpha_{k}^{2}\left(\sigma_{k}^{2}+\epsilon^{2}\right), (26)

where we first let ηt116Sτ\eta^{t}\leq\frac{1}{16S\tau} and then let ηt18SKτ2τk=1Kαk2\eta^{t}\leq\frac{1}{8SK\frac{\tau^{2}}{\tau}\sum_{k=1}^{K}\alpha_{k}^{2}}. We also use θf(θt)2=θcf(θt)2+θsf(θt)2\left\|\nabla_{\theta}f\left(\theta^{t}\right)\right\|^{2}=\left\|\nabla_{\theta_{c}}f\left(\theta^{t}\right)\right\|^{2}+\left\|\nabla_{\theta_{s}}f\left(\theta^{t}\right)\right\|^{2}.

Rearranging the above, we have

ηtθf(θt)24τ(f(θt)𝔼[f(θst+1)])+8KS(ηt)22τ2τk=1Kαk2(σk2+ϵ2).\displaystyle\eta^{t}\left\|\nabla_{\theta}f\left(\theta^{t}\right)\right\|^{2}\leq\frac{4}{\tau}\left(f\left(\theta^{t}\right)-\mathbb{E}\left[f\left(\theta^{t+1}_{s}\right)\right]\right)+8KS\left(\eta^{t}\right)^{2}\frac{2\tau^{2}}{\tau}\sum_{k=1}^{K}\alpha_{k}^{2}\left(\sigma_{k}^{2}+\epsilon^{2}\right). (27)

Taking expectation and averaging over all tt, we have

1Tt=0T1ηt𝔼[θf(θt)2]4Tτ(f(θ(0))f)+16KSτTk=1Kαk2(σk2+ϵ2)t=0T1(ηt)2.\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\eta^{t}\mathbb{E}\left[\left\|\nabla_{\theta}f\left(\theta^{t}\right)\right\|^{2}\right]\leq\frac{4}{T\tau}\left(f\left(\theta(0)\right)-f^{\ast}\right)+\frac{16KS\tau}{T}\sum_{k=1}^{K}\alpha_{k}^{2}\left(\sigma_{k}^{2}+\epsilon^{2}\right)\sum_{t=0}^{T-1}\left(\eta^{t}\right)^{2}. (28)

Hence, we finish the proof of Proposition 1.