IP-FL: Incentivized and Personalized Federated Learning

Ahmad Faraz Khan¹
[email protected] &Xinran Wang²
&Qi Le²
&Zain ul Abdeen¹
&Azal Ahmad Khan²
&Haider Ali¹
&Ming Jin¹
&Jie Ding²
&Ali R. Butt¹
&Ali Anwar²

Abstract

Existing incentive solutions for traditional Federated Learning (FL) focus on individual contributions to a single global objective, neglecting the nuances of clustered personalization with multiple cluster-level models and the non-monetary incentives such as personalized model appeal for clients. In this paper, we first propose to treat incentivization and personalization as interrelated challenges and solve them with an incentive mechanism that fosters personalized learning. Additionally, current methods depend on an aggregator for client clustering, which is limited by a lack of access to clients’ confidential information due to privacy constraints, leading to inaccurate clustering. To overcome this, we propose direct client involvement, allowing clients to indicate their cluster membership preferences based on data distribution and incentive-driven feedback. Our approach enhances the personalized model appeal for self-aware clients with high-quality data leading to their active and consistent participation. Our evaluation demonstrates significant improvements in test accuracy (8–45%), personalized model appeal (3–38%), and participation rates (31–100%) over existing FL models, including those addressing data heterogeneity and personalization.

¹Department of Computer Science, Virginia Tech, USA

²Department of Computer Science, University of Minnesota, USA

1 Introduction

Training high-quality models using traditional distributed machine learning requires massive data transfer from the data sources to a central location, which raises various communication, computation, and privacy challenges. In response, Federated Learning (FL) [28, 38] has emerged as a solution to train models at the source, reducing privacy issues and addressing the need for high-quality models. However, the success of FL relies on resolving various new challenges related to statistical heterogeneity [34, 37, 61, 58], scheduling [8, 30, 19], and incentive distribution [12, 51, 47]. To overcome data heterogeneity challenges, personalized FL (pFL) has emerged as an effective solution to generate separate yet related models [29, 49, 9, 11, 31]. Among pFL techniques, similarity-based approaches that use clustering of clients at the aggregator have gained popularity [36, 15, 45, 53, 57].

However, existing pFL solutions do not include any incentive mechanism, which is crucial in FL to motivate participants to contribute their data and computation resources. Existing incentive mechanisms [12, 20, 23] for traditional FL cannot be applied to pFL techniques because they only consider the performance contribution of clients towards training a single objective. In contrast, clients in pFL contribute towards multiple objectives simultaneously [33, 15, 52, 61, 37, 45]. Furthermore, traditional incentive solutions only provide monetary benefits and do not consider increasing personalized models’ appeal as an incentive for encouraging active and reliable participation of clients. Without incentives, participants may provide low-quality data [12, 23, 47] or opt-out from participation¹¹1By “opt-out” we mean the clients voluntarily leave FL due to the lack of incentivization. [59], leading to poorly performing pFL models [51, 23]. Collaboration fairness [17, 46] can also be ensured by appropriately rewarding contributions and accounting for data heterogeneity [62, 47].

In addition, since existing pFL techniques assume voluntary and consistent participation from clients, the aggregator controls the client selection and training with insufficient knowledge of clients’ training capacity, availability, frequency of new incoming data, clustering preferences, and performance requirements from the trained personalized models. These factors can directly influence the motivation of self-conscious clients to participate consistently. In this paper, we show that these factors cause frequent opt-outs from uninterested clients due to uninformed clustering decisions by the server and low personalized model appeal (PMA)²²2Akin to global model appeal [10], we propose a new metric to measure the personalized model appeal., which leads to reduced pFL performance. We also show that solving personalization and incentivization as interrelated challenges yields better outcomes for pFL than solving them as separate problems. Incentives can establish a feedback mechanism by providing a cost-benefit analysis guiding clients to make informed decisions in the pFL process. However, this requires new paradigms for clustered pFL using data distribution information available to clients via their preferences and designing incentive mechanisms for increasing pFL appeal to reduce client opt-outs.

In this paper, we propose IP-FL that combines clustering-based pFL with token-based incentivization. Unlike prior works that control clustering from the server side, IP-FL allows clients to estimate the importance of each cluster and send their preferences for joining them to the aggregator as bids. To identify a cluster’s importance for a client we use the importance weight of the cluster model as defined by FedSoft [45]. Clients also use the importance weights for weighted local aggregation during single-shot personalization. This client-driven clustering approach results in accurate clustering because clients can attain a global perspective from their local dataset which is only accessible to them and the importance weights information of each cluster. This allows them to make informed decisions that the server cannot make, resulting in improved PMA and reduced opt-outs. To incentivize clients for consistent participation, IP-FL motivates clients to join clusters with the clients that are most similar to them, maximizing their contribution to the cluster and, in turn, their rewards. Good quality cluster-level models then produce more appealing personalized models for each client. The incentive mechanism treats clients as both providers and consumers. As a consumer, the client tries to attain a certain level of personalized model appeal, so it pays the provider to spend resources to participate in training for the said model in each round. Whereas as a provider, the client earns a profit based on its marginal contribution to training the cluster models. The marginal contribution is calculated with a Shapley Value approximation due to the large computational overhead of the original algorithm [55, 35].

Contributions.

In this work, we propose IP-FL, an incentive-driven pFL method. The key contributions of IP-FL are as follows: First, it integrates client preferences for personalization and provides a feedback mechanism via contribution-based incentives that results in accurate clustering choices resulting in improved personalization. Second, IP-FL introduces novel incentives such as improved personalized models’ appeal for clients to prevent opt-outs. Third, IP-FL has the added advantage of creating personalized models for unseen clients with unknown data distributions that perform similarly to seen clients without requiring additional training. Lastly, we show the efficacy of IP-FL with a comprehensive theoretical analysis and rigorous experimental verification, demonstrating its superiority compared to other pFL solutions. In particular, we observe an 8–45% test accuracy improvement of the cluster models, 3–38% improvement in personalized model appeal, and 31–100% increase in the participation rate compared to a wide range of FL and pFL solutions.

2 Related Work

Cluster-based pFL: Works like FedSoft [45], FedGroup [15], and [52] employ clustering techniques in pFL. FedSoft performs soft clustering by matching client data distributions, whereas FedGroup evaluates client gradient similarities through the Euclidean distance of decomposed cosine similarity. [52] determines personalization-generalization trade-offs via a bi-level optimization problem. Despite these, issues like clustering overhead and client distribution overlap restrictions exist. Other noteworthy models are IFCA [18], focusing on loss-based clustering, and [36], presenting three personalization strategies. Other pFL models: Meta-learning techniques for rapid personalized model training are found in works like Per-FedAvg [16] and regularization models [21, 22]. Multi-task learning models include Ditto [33], FedALA [58], and pFedMe [13]. Additionally, FedFomo [61] introduces adaptive local aggregation for personalization, while FedProx [34] aims to stabilize FL. However, many of these models face challenges in sustaining long-term client engagement and require additional training or restructuring for new clients. Incentivized FL: FAIR [12] and FedFAIM [47] focus on quality and fairness-based incentives, respectively. The reputation-based and reverse auction method is highlighted in [59], and a utility-centric approach in [23]. Yet, these methods don’t integrate seamlessly with all pFL models.

Why existing incentive mechanisms cannot be applied directly to pFL frameworks? Standard FL incentives targeting a single global goal [12, 59, 47, 23] may not align with the multifaceted objectives in pFL, such as cluster-based [15, 52, 45] or multi-task models [33, 22, 13]. IP-FL employs clustering for pFL, adjusting cluster memberships every $R$ rounds. It distinguishes itself by establishing cluster boundaries while enhancing shared learning via multiple participation at the client level. Moreover, IP-FL promotes consistent client participation using PMA and an incentive mechanism based on Individual Rationality (IR) from game theory [12, 60].

Refer to caption — Figure 1: IP-FL design

3 Problem Formulation

In pFL, each client’s natural goal is to develop a model that maximizes its test accuracy. Different clients may have varying thresholds for the minimum accuracy gain required to justify their participation in pFL. We denote this self-defined threshold as $\mathbb{\rho}_{i}$ , where $i\in[N]$ . The threshold $\mathbb{\rho}_{i}$ is defined as the test accuracy that client $i$ would achieve if it used the conventional FedAvg approach. Therefore, $\mathrm{PMA}_{i}$ represents the gain in performance from pFL compared to vanilla FL using FedAvg for a client $i$ among $N$ clients. PMA is analogous to GMA from [10], however, our analysis and experiments demonstrate that creating a single global model may not align with the interests of all clients. We formalize the concept of PMA and opt-outs as follows:

\mathrm{PMA}_{i}={f_{i}(w_{k})-\mathbb{\rho}_{i}}\mid i\in[N],k\in[K]

(1)

\textrm{opt-outs}=\sum_{i=1}^{N}H(\mathrm{PMA}_{i})\quad\text{for }i\in[N],\quad H(x)=\begin{cases}0&\text{if }x\leq 0\\ 1&\text{if }x>0\end{cases}

(2)

Where $f_{i}(w_{k})$ is the test accuracy achieved by pFL for client $i$ with model parameters $w_{k}$ .

However, pFL faces the challenge of incentive distribution, as this approach necessitates considering the multidimensional nature of personalization. Furthermore, data-shift and feature-shift in client data over time complicate clients’ decisions on whether continued participation is beneficial, leading to potential opt-outs. Therefore, we aim to develop a pFL framework that provides specialized incentives for personalized training, thereby reducing opt-outs and increasing $\mathrm{PMA}$ to ensure successful collaboration in pFL.

Theoretical foundation

We study the following particular case to develop insights. Suppose there are $m$ clients in total, each observing a set of independent Gaussian observations $z_{i,j}\sim\mathcal{N}(\mu_{i},\sigma^{2}),j=1,\ldots,n_{i}$ , with a personalized task of estimating its unknown mean $\mu\in\mathbb{R}$ . The quality of the learning result, denoted by $\hat{\mu}$ , will be assessed by the mean squared error $\mathbb{E}_{i}(\hat{\mu}-\mu)^{2}$ , where the expectation $\mathbb{E}_{i}$ is taken with respect to the distribution of client $i$ . It is conceivable that if clients’ underlying parameters $\mu_{i}$ ’s are arbitrarily given, personalized FL may not boost the local learning result. To highlight the potential benefit of cluster-based modeling, we suppose that the $m$ clients can be partitioned into two or more subsets, a common assumption in related works [33, 18, 45]. For simplicity of the problem, we make a case for two subsets: one with $m_{1}$ clients, say $T_{1}=\{1,\ldots,m_{1}\}$ , and the other with $m_{2}$ clients, say $T_{2}=\{m_{1}+1,\ldots,m\}$ , whose underlying parameters are randomly generated as follows:

\displaystyle\mu_{i}\sim\mathcal{N}(\beta_{1},\tau^{2})\mid\quad i\in T_{1},\hskip 10.00002pt\mu_{i}\sim\mathcal{N}(\beta_{2},\tau^{2})\mid\quad i\in T_{2}.

(3)

Here, $\beta_{1}$ and $\beta_{2}$ can be treated as the root cause of two underlying clusters. We will study how the values of sample size $n_{i}$ , data variation $\sigma$ , within-cluster similarity as quantified by $\tau$ , and cross-cluster similarity as quantified by $|\beta_{1}-\beta_{2}|$ will influence the gain of a client in personalized learning. To simplify the discussion, we will assess the learning quality (based on the mean squared error) of any particular client $i$ in the following three procedures:

Local training: Client $i$ only performs local learning by minimizing the local loss $L_{i}(\mu)=\sum_{j=1}^{n_{i}}(\mu-z_{i,j})^{2},$ and obtains $\hat{\mu}_{i}=n_{i}^{-1}\sum_{j=1}^{n_{i}}z_{i,j}$ . Thus, the corresponding error variance is

\displaystyle e(\hat{\mu}_{i})=\mathbb{E}_{i}(\hat{\mu}_{i}-\mu_{i})^{2}=\frac{\sigma^{2}}{n_{i}}.

(4)

Federated training: Suppose the FL converges to the global minimum of the loss, $\sum_{i=1}^{m}\frac{n_{i}}{n}L_{i}(\mu),\quad n\overset{\Delta}{=}\sum_{i=1}^{m}n_{i},$ which can be calculated to be $\hat{\mu}_{\textrm{FL}}=\sum_{i=1}^{m}\frac{n_{i}}{n}\hat{\mu}_{i}$ . Consider any particular client $i$ . Without loss of generality, suppose it belongs to cluster 1, namely $i\in T_{1}$ . From the client $i$ ’s angle, conditional on its local $\mu_{i}$ and assuming a flat prior on $\beta_{1}$ and $\beta_{2}$ , client $j$ ’s $\mu_{j}$ follows $\mu_{j}\mid\mu_{i}\sim\mathcal{N}(\mu_{i},2\tau^{2})$ for $j\in T_{1}$ and $j\neq i$ , and $\mu_{j}\mid\mu_{i}\sim\mathcal{N}(\mu_{i}+\beta_{2}-\beta_{1},2\tau^{2})$ for $j\in T_{2}$ . Then, the corresponding error is

\displaystyle e(\hat{\mu}_{\textrm{FL}})=\mathbb{E}_{i}(\hat{\mu}_{\textrm{FL}}-\mu_{i})^{2}=\biggl{\{}\sum_{j\in T_{2}}\frac{n_{j}}{n}(\beta_{2}-\beta_{1})\biggr{\}}^{2}+\sum_{j=1,\ldots,m,j\neq i}\biggl{(}\frac{n_{j}}{n}\biggr{)}^{2}\biggl{(}\frac{\sigma^{2}}{n_{j}}+2\tau^{2}\biggr{)}+\biggl{(}\frac{n_{i}}{n}\biggr{)}^{2}\frac{\sigma^{2}}{n_{i}}.

It can be seen that compared with (4), the above FL error can be non-vanishing if $\sum_{j\in T_{2}}\frac{n_{j}}{n}(\beta_{2}-\beta_{1})$ is away from zero, even if sample sizes go to infinity. In other words, in the presence of a significant difference between the two clusters, FL may not bring additional gain compared with local learning.

Cluster-based personalized FL: Suppose our algorithm allows both clusters to be correctly identified upon convergence. Consider any particular client $i$ . Suppose it belongs to Cluster 1 and will use a weighted average of Cluster-specific models. Specifically, the Cluster 1 model will be the minimum of the loss $\sum_{j\in T_{1}}\frac{n_{j}}{n_{\textrm{T1}}}L_{j}(\mu),\quad n_{\textrm{T1}}\overset{\Delta}{=}\sum_{j\in T_{1}}n_{j},$ which can be calculated to be $\hat{\mu}_{\textrm{T1}}=\sum_{j\in T_{1}}\frac{n_{j}}{n_{\textrm{T1}}}\hat{\mu}_{i}$ . By a similar argument as in the derivation of (3), we can calculate

\displaystyle e(\hat{\mu}_{\textrm{T1}})=\sum_{j\in T_{1},j\neq i}\biggl{(}\frac{n_{j}}{n_{\textrm{T1}}}\biggr{)}^{2}\biggl{(}\frac{\sigma^{2}}{n_{j}}+2\tau^{2}\biggr{)}+\biggl{(}\frac{n_{i}}{n_{\textrm{T1}}}\biggr{)}^{2}\frac{\sigma^{2}}{n_{i}}.

(5)

The above value can be smaller than that in (4). To see this, let us suppose the sample sizes $n_{i}$ ’s are all equal to, say $n_{0}$ , for simplicity. Then, we have

	$\displaystyle e(\hat{\mu}_{\textrm{T1}})$	$\displaystyle=\frac{m_{1}-1}{m_{1}^{2}}\biggl{(}\frac{\sigma^{2}}{n_{0}}+2\tau^{2}\biggr{)}+\frac{1}{m^{2}}\frac{\sigma^{2}}{n_{0}}=\frac{m_{1}-1}{m_{1}^{2}}\biggl{(}\frac{\sigma^{2}}{n_{0}}+2\tau^{2}\biggr{)}+\frac{1}{m_{1}^{2}}\frac{\sigma^{2}}{n_{0}}=$
		$\displaystyle\frac{1}{m_{1}}\frac{\sigma^{2}}{n_{0}}+\frac{m_{1}-1}{m_{1}^{2}}2\tau^{2},\text{which is smaller than (\ref{eq1}) if and only if: }\tau^{2}<\frac{m_{1}\sigma^{2}}{2n_{0}}.$		(6)

We derive the following intuitions from this analysis: R1. If the within-cluster bias is relatively small, the number of cluster-specific clients is large, and data noise is large, a client will have personalized gain from collaborating with others in the same cluster. R2. IP-FL’s incentive algorithm rewards accuracy improvement reflected in PMA, which directly correlates with reducing within-cluster bias as per Equation 6. R3. By association, the incentive algorithm motivates clients to join similar clusters which increases cluster homogeneity and reduces the within-cluster bias. We later show the impact of change in performance with an ablation study of IP-FL incentive.

4 Proposed Methodology

In this section, we introduce IP-FL, consisting of three key modules: profiler, token manager, and the scheduler (Figure 1). The profiler computes and maintains client contributions using Shapley Values approximation (lines 24-27 in Algorithm 1), aiding cluster formation. The token manager handles auctions, rewards, and reimbursement transactions (lines 13 & 14). The scheduler selects clients based on bids and contributions, clustering them for improved homogeneity (lines 20 & 27-29). Clients calculate importance weights from cluster models, submit preference bids to the token manager for cluster participation (lines 23-28 in Algorithm 2), and generate personalized models (line 29). Clients seek to maximize profits through IR by choosing clusters where they can contribute the most for maximum rewards.

Algorithm 1 IP-FL (Server)

Input: $R$ : Rounds, $K$ : Number of clusters, $M_{k}$ : Cluster-level model of cluster $k\in K$ , $N$ : Number of clients, $\zeta_{a}$ : Available Clients, $N_{p}$ : Number of clients selected on performance basis, $N_{r}$ : Number of clients selected randomly, $\zeta_{k}$ : Clients selected for training in cluster $k\in K$ , sort():Timsort [40]

for each round $r\in R$ do

\zeta_{k}=SelectClients(r)

for each cluster

k\in K

for cluster $k\in K$ do

3 Server deploys model

M_{k}

for training client in

\zeta_{k}

Token Manager collects bid payments from all willing clients via Eqn. 9 Token manager updates available tokens for round

r

via Eqn. 10

U_{k}\leftarrow

model updates received from clients in

\zeta_{k}

M_{k}=FedAvg(U_{k})

[38]

5 Function SelectClients( $r$ )

6 if $r=0$ then

7 for $k=1$ to $K$ do

\zeta_{k}^{*}\leftarrow

Scheduler selects random clients from

\zeta_{a}

9 return $\zeta_{k}^{*}$

10 else if $r>1$ then

11 for $i=1$ to $N$ do

{P_{B}}\leftarrow ClientPreferences(M_{1},...,M_{k})\mid\forall k\in[K]

//

from Algo. 2 The server calculates marginal contributions

\psi_{{k}{i}}

of each client within its cluster via Shapley Values approximation in Appendix H Algo. 3

S_{c}=sort(P_{B},\psi_{{k}{i}})

//

Profiler sorts clients by marginal contributions and preference bids for $k=1$ to $K$ do

\zeta_{k}^{*}\leftarrow

N_{p}

clients selected from

S_{c}

and

N_{r}

clients randomly from

\zeta_{a}

by Scheduler.

15 return $\zeta_{k}^{*}$

4.1 Profiler

At the pFL training’s onset, the scheduler forms initial clusters by assigning clients randomly. At each round, clients train the cluster-level model on local data and compute the importance weight for each aggregated cluster model $M_{k}$ via Equation 7, where $\upsilon_{{i}{k}}$ is the normalized sum of correctly predicted data points on local dataset $D_{i}$ . These weights create a personalized model through weighted aggregation via Equation 8. Here, $P_{{i}{k}}$ is client $i$ ’s personalized model and $\omega_{k}$ is cluster $k$ ’s weight vector. Through client-centric clustering and participation, clients produce personalized models offline to suit dynamic needs unknown to the server. Clients can also decide on training participation based on budget, past rewards, and importance weights. The clients then bid for the desired cluster in the next training round.

\upsilon_{{i}{k}}=n_{{i}{k}}/n_{k}\in{[0,1]}\mid k\in[K]

(7)

P_{{i}{k}}=\sum_{k=1}^{K}\upsilon_{{i}{k}}\times(\omega_{k})

(8)

Before the start of the next training round, the profiler calculates client marginal contributions using a Shapley Values approximation algorithm given in the Appendix H, providing data quality insights of each client to the scheduler.

4.2 Token Manager

The token manager coordinates client transactions, operating similarly to a bank executing reward transactions from consumer to provider clients and any reimbursement transactions from providers back to consumer clients. At the beginning of each training round, it conducts an auction for each cluster. Each client $i$ interested in a cluster places its bid using tokens. These bids, represented as $\tau_{p}$ , are deducted from the clients’ total tokens, $\tau_{i}$ , as detailed in Equation 9:

\tau_{i}=\tau_{i}-\tau_{p}

(9)

Collected tokens are then added to the Token Manager’s overall pool $\tau_{{a}{r}}$ via Equation 10. $N{p}$ and $N_{r}$ represent the number of clients selected based on performance and random selection respectively.

\tau_{{a}{r}}=\tau_{{a}{r}}+(N{p}+N_{r})\times\tau_{p}\mid r\in[1,R]

(10)

The token manager’s responsibilities also encompass rewards and penalties as reimbursements. It calculates reimbursements using the utility function, which quantifies the average accuracy improvement or decrement of the cluster model $M_{k}$ over the maximum achieved accuracy in past rounds on the local data of clients in cluster $k$ . Then it reimburses consumer clients and penalizes any degradation in the performance of the provider clients to ensure contribution fairness on a finer granularity for consumer and provider clients. This utility and the subsequent reimbursement are defined in Equations 11 and 12:

\displaystyle U_{til}

\displaystyle=\frac{\eta\times(\gamma-\min(\gamma,\max(0.0,\frac{(Acc_{{k}{r}}-Acc_{{k}{max}})}{Acc_{{k}{max}}})))}{\gamma}\mid\eta\in[0,1],\gamma\in[0,1]

(11)

\tau_{i}=\tau_{i}-\tau_{{a}{r}}\times U_{til}\mid U_{til}\in[0,\gamma],\forall i\in[N],\forall r\in[1,R]

(12)

$Acc_{{k}{r}}$ is the cluster-level model accuracy in round $r$ , $Acc_{{k}{max}}$ is the maximum cluster-level accuracy achieved until this round, $\eta$ and $\gamma$ represent token limitations and accuracy improvement thresholds and $\tau_{{a}{r}}$ is the tokens collected from clients in round $r$ . Our approach is inspired by [20], but it’s adapted to account for the impractical assumption of having an IID dataset at the aggregator that perfectly corresponds to the client data distribution within the cluster. Instead, we rely on the local client dataset. Post-reimbursement, the token manager distributes rewards. Provider clients are ranked based on their contributions and participation, and rewards are allocated via Equation 13:

\displaystyle\tau_{i}

\displaystyle=\tau_{i}+\text{sort}(\psi_{{k}{i}},\Omega_{{k}{i}})\times\frac{\tau_{{a}{r}}}{N_{r}\times\frac{(N_{r}+1)}{2}}\mid\forall k\in[K],\forall i\in[N],\forall r\in[R]

(13)

In Equation 13, $\psi_{{k}{i}}$ represents client contributions, and $\Omega_{{k}{i}}$ tracks client participation. $\beta$ normalizes token distribution based on provider ranks $\alpha$ in Equation 13. $\tau_{i}$ are client-owned tokens, and $\tau_{{a}{r}}$ are tokens available for distribution by the Token Manager. Through reimbursements to consumers and payments to providers, IP-FL incentivizes personalized learning, improving PMA, reducing opt-outs, and ensuring fair client incentives based on contributions to enhance personalized learning outcomes.

Algorithm 2 IP-FL (Client)

Input: $T_{h}$ : Importance weight threshold, $K$ : Number of clusters, $M_{k}$ : Cluster-level model of cluster $k$ , $D$ : Local dataset of client

Function ClientPreferences( $M_{1},...,M_{k}$ )

18 for each cluster $k\in K$ do

19 for each data point $d\in D$ do

20 The client

i

computes

\upsilon_{ik}

importance weight

T_{h}

M_{k}

model for each data point

d

via Eqn. 7

21 if $\upsilon_{k}>T_{h}$ then

22 Client includes cluster

k

in their preference list

{P_{B}}

24 The client generates personalized model

P_{{i}{k}}

via Eqn. 8 return ${P_{B}}$

4.3 Scheduler

The scheduler employs the $SelectClients(r)$ function detailed in Algorithm 1 to choose clients for each round $r$ . It collects preference bids ${P_{B}}$ from the token manager and marginal contributions $\psi_{{k}{i}}$ from the profiler for every client $i\in N$ in cluster $k\in K$ . Here, $N$ and $K$ denote the total counts of clients and clusters, respectively. With this data, the scheduler groups clients by similar preference bids, arranges them by their marginal contributions, and subsequently picks $N_{p}$ from this ordered list and $N_{r}$ at random. These quantities, $N_{p}$ and $N_{r}$ , are adjustable parameters. Following the methodology in [30], we prioritize exploration in early rounds, initializing $N_{r}$ to $20\%$ of all clients and decrementing it based on the count of remaining unexplored clients. The reduction strategy, which caps the minimum $N_{r}$ at $5\%$ , guarantees an effective exploration technique. Random selection of $N_{r}$ clients mitigates accuracy bias, echoing techniques from prior studies [38, 7, 20, 27]. By clustering clients with similar preferences, the scheduler reduces within-cluster bias and enhances within-cluster uniformity, culminating in a cluster model that truly mirrors its constituent clients. Section 3 delves into the significance of this in boosting the PMA.

5 Convergence Analysis of the IP-FL

This section explores the convergence properties of IP-FL, showcasing its robustness and efficiency. Beginning with foundational propositions, we establish the basis for cluster formation in IP-FL. Subsequent theorems then delve into the convergence of personalized and global models across various data distributions, highlighting IP-FL’s adaptability for achieving stable and optimal parameters. Proofs for all propositions and theorems are detailed in appendix I.

Proposition 1.

Let $\{\mathcal{C}_{k}\}_{k=1}^{K}$ represent the set of clusters formed by IP-FL algorithm, where each cluster $\mathcal{C}_{k}$ contains clients $\{i_{1},\ldots,i_{N}\}$ with their respective data distributions $\{D_{i_{1}},\ldots,D_{i_{N}}\}$ . Cluster will converge and within each cluster $\mathcal{C}_{k}$ , the data distributions of the clients are statistically similar to each other up to a statistical threshold $\delta$ , and the within-cluster bias is reduced.

Remark 2.

Proposition 1 establishes the fundamental concept of cluster formation within IP-FL. The result ensures that data distributions within each cluster are homogeneous. Theorem 3 builds directly on the insights from proposition 1, detailing the convergence of personalized models within clusters which underscores the algorithm’s adaptability and efficiency in handling diverse data distributions.

Theorem 3.

Let $\{\mathcal{C}_{k}\}_{k=1}^{K}$ be a set of clusters formed by the IP-FL algorithm, where each cluster’s data follows a Gaussian distribution. If the loss function $L(\theta)$ is convex for the model parameters $\theta$ , then the IP-FL algorithm converges to a set of stable parameters $\theta^{*}$ for the global model. Furthermore, once clustering convergence is established, each cluster converges to a set of stable parameters $\theta_{k}^{*}$ .

Theorem 4.

Under the assumptions Proposition 1 and Theorem 3, there exists a finite number of training iterations $T$ such that for all $t\geq T$ , for every $\epsilon$ , the performance metric $\mathrm{PMA}_{i}$ for each client $i$ within $\epsilon$ , i.e., $|\mathrm{PMA}^{(t)}_{i}-\mathrm{PMA}_{i}^{*}|<\epsilon$ , where $\mathrm{PMA}_{i}^{*}$ is the optimal.

Remark 5.

As $\mathrm{PMA}_{i}:=f_{i}(w_{k})-\rho_{i}$ , above convergence Theorem 4 ensures that over the course of training $w_{k}$ is optimized, and $f_{i}(w_{k})$ converges to maximum achievable performance for client $i$ . This leads to $\mathrm{PMA}_{i}>0$ , which implies the opt-out likelihood is minimized. The following Theorem 6 demonstrates convergence with general convex and smooth loss functions. This theorem encapsulates the IP-FL algorithm’s robustness, showcasing its potential to achieve global optimization effectively.

Theorem 6 (Convergence of IP-FL Algorithm).

Let $\{c_{i}\}_{i=1}^{N}$ be a set of clients participating in the IP-FL framework with local loss functions $\{L_{i}\}_{i=1}^{N}$ , where each $L_{i}$ is convex and $\beta$ -smooth. Suppose that the IP-FL algorithm updates the global model $M$ by a weighted aggregation of locally updated models using a personalized diminishing learning rate $\eta_{t}$ . If the series $\sum_{t=1}^{\infty}\eta_{t}=\infty$ and $\sum_{t=1}^{\infty}\eta_{t}^{2}<\infty$ , then the sequence of global models $\{M_{t}\}$ generated by the IP-FL algorithm converges to a global minimizer $M^{*}$ of the weighted average loss function $L$ , defined as $L(M)=\sum_{i=1}^{N}w_{i}L_{i}(M)$ where $w_{i}$ is the weight corresponding to client $i$ ’s data contribution.

Table 1: Test accuracy (CIFAR10)

10:90 30:70 IP-FL FedSoft IP-FL FedSoft c0 c1 c0 c1 c0 c1 c0 c1 $\theta_{0}$ 63.7 41.3 48.9 49.5 58 57.7 48 48.4 $\theta_{1}$ 43.7 63.8 50.7 49.6 58.6 58.8 50 50

Table 2: Clusters models accuracy (Synthetic CIFAR10)

10:90 30:70 Linear Random c0 c1 c0 c1 c0 c1 c0 c1 IP-FL 62.7(0) 71(1) 53.3(0) 61.9(1) 61.7(0) 59.4(1) 66.9(0) 58.4(1) FedSoft 32.5(0) 38.6(1) 20.3(0) 23.6(0) 34.42(1) 49.6(1) 21.6(1) 33.12(1) IFCA 30.5(0) 36.2(0) 18.8(0) 20.4(0) 31.2(1) 46.3(1) 18.7(1) 31.24(1) FedEM 29.7(0) 37.6(0) 17.3(0) 20.6(0) 30.6(1) 45.8(1) 18.4(1) 30.4(1)

6 Experimental Study

6.1 Experimental Setup

We use NVIDIA RTX 3070 GPUs to evaluate IP-FL and other pFL methods across four datasets. Our CNN model (32x64x64 convolutional with 3136x128 linear layer parameters) was designed for efficient training on client devices with limited resources, aligning with Cross-Device FL settings [25].

CIFAR10 Data: We utilized the CIFAR10 dataset from FedSoft [45], comprising 32 × 32 × 3 images and 10 output classes. We replicated data heterogeneity conditions: 10:90, 30:70, linear, and random partitions. Data was divided into two clusters, $D_{A}$ and $D_{B}$ . In the 10:90 partition, 50 clients had $90\%$ training data from $D_{A}$ and $10\%$ from $D_{B}$ , while the other $50$ have $10\%$ training data from $D_{A}$ and $90\%$ from $D_{B}$ . The 30:70 partition had a $30\%$ and $70\%$ distribution. EMNIST Data: This dataset contained 28 x 28 images and 52 output classes (26 lowercase and 26 uppercase letters). We employed 10:90 and 30:70 data partitions, along with linear and random partitions. In the linear partition, client $k$ had $(0.5+k)\%$ training and testing data from $D_{A}$ and $(99.5-k)\%$ from $D_{B}$ . In the random partition, clients were assigned a mixture vector generated randomly based on $Uniform(0,1)$ . The EMNIST dataset was divided into $K$ clusters, ensuring no data overlap between them. Synthetic CIFAR10: This synthetic dataset mirrored CIFAR10’s partitions but had different training and testing data distributions to simulate dynamic client data. In the 10:90 partition, 50 clients had $90\%$ training data with $10\%$ testing data from $D_{A}$ and vice versa. The rationale for separate training and testing data distributions is explained in Appendix D.2.4.

6.2 Focus of Experimental Study

First, we compare the clustering ability of IP-FL with a recent clustering-based pFL algorithm [45]. Second, we compare IP-FL with other non-clustering pFL models with a simple test accuracy evaluation. Taking it one step further, we provide a comparison of IP-FL and other clustering and non-clustering pFL models in terms of opt-out reduction and PMA maintenance. Lastly, we show that including client preferences while clustering yields better personalization results because clients can make decisions based on knowledge restricted to the aggregator server. We evaluate all algorithms using the datasets that were employed in their original publications. Thus ensuring that we highlight the strengths of each algorithm, rather than skewing the evaluation towards our methodology.

Table 3: Clusters models accuracy (CIFAR10)

10:90 30:70 $\theta_{0}$ $\theta_{1}$ $\theta_{0}$ $\theta_{1}$ c0 c1 c0 c1 c0 c1 c0 c1 IP-FL 63.7 41.3 43.7 63.8 58 57.7 58.6 58.8 FedSoft 48.9 49.5 50.7 49.6 48 48.4 50 50 IFCA 46.4 44.3 47 45 46 44 47.4 46.5 FedEM 45.7 46.5 46.8 47.1 43.8 45.2 46.3 47.4

Table 4: Test accuracy of pFL methods (EMNIST)

Partitions Ditto FedProx FedALA PerfFedAvg FedProto IP-FL 10:90 85.8±4.8 75.2±4.8 75.5±4.7 87.5±3.8 72±1.4 87.5±3.7 30:70 76±4.5 79.7±4 78.4±3.2 76.6±3.9 59.7±4.7 85.1±3.4 Linear 75.3±5.1 82.8±2.7 82±3.6 80.8±3.5 62.6±4.9 83.4±4.9 Random 77.8±6.8 80.9±4.4 79±5.1 83.3±5.2 68.4±5.7 86.2±4.3

6.3 Test Accuracy Performance Study

Effectiveness of clustering. We evaluate cluster-level model performance on holdout datasets from cluster distributions ( $D_{A}$ and $D_{B}$ ), comparing IP-FL with recent cluster-based pFL algorithm FedSoft on CIFAR10, matching experiment settings from [45] with 100 clients, batch size 128, learning rate $\eta=0.01$ , and 300 training rounds. Table 2 presents test accuracy for two data partitions: 10:90 and 30:70, using IP-FL. Notably, IP-FL outperforms FedSoft in the 10:90 partition, where each cluster represents a singular distribution. Clients with more $\theta_{0}$ data train in cluster $c_{0}$ with $63.68\%$ accuracy, and clients with more $\theta_{1}$ data prefer cluster $c_{1}$ with $63.82\%$ accuracy. In contrast, FedSoft scores lower accuracies of $50.7\%$ and $49.6\%$ in the 10:90 partition, facing challenges with data partition variability and representing singular distributions. FedSoft’s models $c_{0}$ and $c_{1}$ show similar performance across $\theta_{0}$ and $\theta_{1}$ data, indicating its personalization is more effective when the majority of clients have share data, hence struggling with individual distribution representation and underperforming in non-IID scenarios. The 30:70 partition shows diminished performance, attributed to less data heterogeneity, which constrains personalization-driven accuracy gains. In Tables 2 and 4, we compare IP-FL with three clustering-based pFL algorithms: FedSoft, IFCA, and FedEM [37]. IP-FL consistently outperforms all three algorithms, supporting its effectiveness.

Comparison with non-clustering pFL models. Table 4 shows the test accuracy comparison of IP-FL with other recent pFL algorithms. Some pFL models can perform well for individual partitions such as Ditto for 10:90, FedProx and FedALA for Linear, and PerFedAvg for Random, however, IP-FL consistently outperforms for all data partitions.

Convergence speed. Figure 2 shows IP-FL’s superior convergence speed and personalized accuracies over other pFL methods, matching the communication rounds per training epoch of clustering-based algorithms [18, 45]. This efficiency stems from incorporating client preferences into post-aggregation updates, avoiding the additional proximal update step needed in clustering-based approaches. Moreover, IP-FL enables optional single-shot personalization with cluster models during training or post-training, further enhancing efficiency.

Table 5: PMA and opt-outs with the feature-shift EMNIST

pFL algo	FedAvg	Ditto	FedProx	FedALA	FedFomo	FedProto	PerfFedAvg	IP-FL
Pers. Acc.	75.7±3	83.7±2	78.8±14	83.3±3	63.9±10	46.5±21	83.7±3	84.3±2
Optouts	-	0	23	0	100	0	0	0
Avg. PMA	-	7.6	3	7.2	-17	-48.6	7.7	8.6

6.4 Experimental Study of Opt-outs and PMA

Figure 3 shows the CDF of PMA for clients using different datasets and pFL methods. IP-FL excels, especially for the 10:90 partition where data heterogeneity is most prominent (Figure 3(a)). In contrast, the EMNIST dataset which features a broader class distribution per client allows FedAvg to perform relatively well, suggesting limited scope for further improvement via personalization. IP-FL enhances PMA, particularly in the 10:90 and 30:70 partitions where other pFL solutions struggle. Table 10 in Appendix shows a highly heterogeneous scenario with 52 clusters and 4 classes per client, Ditto and FedProto perform well, however IP-FL outperforms by $15\%$ in PMA. Opt-out ratios for FedProx, FedALA, and PerFedAvg are $0.64$ , $0.31$ , and $0.68$ , respectively, with no opt-outs for Ditto, FedFomo, and IP-FL, showcasing IP-FL’s effectiveness in reducing opt-outs and improving PMA under high data heterogeneities.

PMA and opt-outs with Feature-shift. We compared IP-FL to other pFL algorithms in non-IID data scenarios using a feature shift technique, with identical hyperparameters. Two datasets were created: $D_{A}$ with normal EMNIST images and $D_{B}$ with images rotated by 90 degrees. Results in Table 5 show that IP-FL that IP-FL surpasses other pFL models, improving test accuracy by 1–38% and PMA by 12-118% with feature-shifted non-IID data.

Advantages of including client preferences in pFL. IP-FL maintains personalized model test accuracy, even with dynamic client data or the inadvertent addition of a new client to the wrong cluster. In Figure 4, we observe the CDF of clients’ test accuracy after 500 training rounds. IP-FL’s robustness to varying client data is evident, while FedSoft, relying on the server’s perspective without access to client data, struggles to make precise clustering decisions.

Ablatian study with Incentive in IP-FL. In Figure 5, we compared IP-FL with (I) and without (NI) incentives, and the results show a higher accuracy with incentives enabled under a majority of data scenarios suggesting the importance of incentives in pFL.

7 Conclusion

In this paper, we proposed IP-FL to address the challenges of incentive provision in pFL for increasing consistent participation by providing appealing personalized models to clients. IP-FL client-centric clustering approach ensures accurate clustering and improved performance even in case of dynamic data distribution shift of the client’s local data or inadvertently mistaken clustering decision by the client. Unlike prior works that consider incentivizing and personalization as separate problems, IP-FL solves them as interrelated challenges yielding improvement in pFL performance. Extensive empirical evaluation shows its promising performance compared to other state-of-the-art works.

References

aim [2023] Ai marketplace. https://aimarketplace.co/, 2023.
aws [2023] Aws ai marketplace. https://aws.amazon.com/marketplace/solutions/machine-learning, 2023.
gra [2023] Gravityai. https://www.gravity-ai.com/, 2023.
mod [2023] Modelplace. https://modelplace.ai/, 2023.
Act [1996] Accountability Act. Health insurance portability and accountability act of 1996. Public law, 104:191, 1996.
Arikumar et al. [2022] K. S. Arikumar, Sahaya Beni Prathiba, Mamoun Alazab, Thippa Reddy Gadekallu, Sharnil Pandya, Javed Masood Khan, and Rajalakshmi Shenbaga Moorthy. Fl-pmi: Federated learning-based person movement identification through wearable devices in smart healthcare systems. Sensors, 22(4), 2022. ISSN 1424-8220. doi: 10.3390/s22041377. URL https://www.mdpi.com/1424-8220/22/4/1377.
Bonawitz et al. [2019] Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloe Kiddon, Jakub Konečný, Stefano Mazzocchi, H. Brendan McMahan, Timon Van Overveldt, David Petrou, Daniel Ramage, and Jason Roselander. Towards federated learning at scale: System design, 2019. URL https://arxiv.org/abs/1902.01046.
Chai et al. [2020] Zheng Chai, Ahsan Ali, Syed Zawad, Stacey Truex, Ali Anwar, Nathalie Baracaldo, Yi Zhou, Heiko Ludwig, Feng Yan, and Yue Cheng. Tifl: A tier-based federated learning system. CoRR, abs/2001.09249, 2020. URL https://arxiv.org/abs/2001.09249.
Chen et al. [2022] Huili Chen, Jie Ding, Eric W Tramel, Shuang Wu, Anit Kumar Sahu, Salman Avestimehr, and Tao Zhang. Self-aware personalized federated learning. Advances in Neural Information Processing Systems, 35:20675–20688, 2022.
Cho et al. [2023] Yae Jee Cho, Divyansh Jhunjhunwala, Tian Li, Virginia Smith, and Gauri Joshi. Maximizing global model appeal in federated learning, 2023.
Collins et al. [2022] Liam Collins, Enmao Diao, Tanya Roosta, Jie Ding, and Tao Zhang. Perfedsi: A framework for personalized federated learning with side information. 2022.
Deng et al. [2021] Yongheng Deng, Feng Lyu, Ju Ren, Yi-Chao Chen, Peng Yang, Yuezhi Zhou, and Yaoxue Zhang. Fair: Quality-aware federated learning with precise user incentive and model aggregation. In IEEE INFOCOM 2021 - IEEE Conference on Computer Communications, pages 1–10, 2021. doi: 10.1109/INFOCOM42981.2021.9488743.
Dinh et al. [2022] Canh T. Dinh, Nguyen H. Tran, and Tuan Dung Nguyen. Personalized federated learning with moreau envelopes, 2022.
Du [2023] Chen Du. Solutions to high-dimensional statistics chapter 2. https://chendu2017.github.io/files/solns_to_HDStat_Chapter_2.pdf, January 2023. Solutions to exercises from "High-Dimensional Statistics: A Non-Asymptotic Viewpoint" by Martin J. Wainwright.
Duan et al. [2021] Moming Duan, Duo Liu, Xinyuan Ji, Renping Liu, Liang Liang, Xianzhang Chen, and Yujuan Tan. Fedgroup: Accurate federated learning via decomposed similarity-based clustering. 2021.
Fallah et al. [2020] Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. Personalized federated learning with theoretical guarantees: A model-agnostic meta-learning approach. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 3557–3568. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/24389bfe4fe2eba8bf9aa9203a44cdad-Paper.pdf.
Gao et al. [2021] Liang Gao, Li Li, Yingwen Chen, Wenli Zheng, ChengZhong Xu, and Ming Xu. Fifl: A fair incentive mechanism for federated learning. In Proceedings of the 50th International Conference on Parallel Processing, ICPP ’21, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450390682. doi: 10.1145/3472456.3472469. URL https://doi.org/10.1145/3472456.3472469.
Ghosh et al. [2020] Avishek Ghosh, Jichan Chung, Dong Yin, and Kannan Ramchandran. An efficient framework for clustered federated learning. Advances in Neural Information Processing Systems, 33:19586–19597, 2020.
Han et al. [2022a] Jingoo Han, Ahmad Faraz Khan, Syed Zawad, Ali Anwar, Nathalie Baracaldo Angel, Yi Zhou, Feng Yan, and Ali R. Butt. Heterogeneity-aware adaptive federated learning scheduling. In 2022 IEEE International Conference on Big Data (Big Data), pages 911–920, 2022a. doi: 10.1109/BigData55660.2022.10020721.
Han et al. [2022b] Jingoo Han, Ahmad Faraz Khan, Syed Zawad, Ali Anwar, Nathalie Baracaldo Angel, Yi Zhou, Feng Yan, and Ali R. Butt. Tiff: Tokenized incentive for federated learning. In 2022 IEEE 15th International Conference on Cloud Computing (CLOUD), pages 407–416, 2022b. doi: 10.1109/CLOUD55607.2022.00064.
Hanzely and Richtárik [2021] Filip Hanzely and Peter Richtárik. Federated learning of a mixture of global and local models, 2021.
Hanzely et al. [2021] Filip Hanzely, Boxin Zhao, and Mladen Kolar. Personalized federated learning: A unified framework and universal optimization techniques. ArXiv, abs/2102.09743, 2021.
Hu et al. [2022] Miao Hu, Di Wu, Yipeng Zhou, Xu Chen, and Min Chen. Incentive-aware autonomous client participation in federated learning. IEEE Transactions on Parallel and Distributed Systems, 33(10):2612–2627, 2022. doi: 10.1109/TPDS.2022.3148113.
K-Means [2013] K-Means. Sklearn.cluster.kmeans. https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html, 2013.
Kairouz et al. [2019] Peter Kairouz, H Brendan McMahan, Andrew Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Keith Bonawitz, Christina Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends in Machine Learning, 12(3-4):1–357, 2019.
Kang et al. [2020] Jiawen Kang, Zehui Xiong, Dusit Niyato, Yuze Zou, Yang Zhang, and Mohsen Guizani. Reliable federated learning for mobile networks. IEEE Wireless Communications, 27(2):72–80, 2020. doi: 10.1109/MWC.001.1900119.
Khan et al. [2022] Ahmad Khan, Yuze Li, Ali Anwar, Yue Cheng, Thang Hoang, Nathalie Baracaldo, and Ali Butt. A distributed and elastic aggregation service for scalable federated learning systems, 2022. URL https://arxiv.org/abs/2204.07767.
Konečnỳ et al. [2016] Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016.
Kulkarni et al. [2020] Viraj Kulkarni, Milind Kulkarni, and Aniruddha Pant. Survey of personalization techniques for federated learning. In 2020 Fourth World Conference on Smart Trends in Systems, Security and Sustainability (WorldS4), pages 794–797, 2020. doi: 10.1109/WorldS450073.2020.9210355.
Lai et al. [2021] Fan Lai, Xiangfeng Zhu, Harsha V. Madhyastha, and Mosharaf Chowdhury. Oort: Efficient federated learning via guided participant selection, 2021.
Le et al. [2022] Qi Le, Enmao Diao, Xinran Wang, Ali Anwar, Vahid Tarokh, and Jie Ding. Personalized federated recommender systems with private and partially federated autoencoders. arXiv preprint arXiv:2212.08779, 2022.
Li et al. [2020a] Lianghao Li, Qiang Li, Hong Chen, and Yiming Chen. Federated learning with strategic participants: A game theoretic approach. In Proceedings of the 37th International Conference on Machine Learning, pages 8597–8606. PMLR, 2020a.
Li et al. [2020b] Tian Li, Shengyuan Hu, Ahmad Beirami, and Virginia Smith. Federated multi-task learning for competing constraints. CoRR, abs/2012.04221, 2020b. URL https://arxiv.org/abs/2012.04221.
Li et al. [2020c] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks, 2020c.
Liu et al. [2022] Zelei Liu, Yuanyuan Chen, Han Yu, Yang Liu, and Lizhen Cui. Gtg-shapley: Efficient and accurate participant contribution evaluation in federated learning. ACM Trans. Intell. Syst. Technol., 13(4), may 2022. ISSN 2157-6904. doi: 10.1145/3501811. URL https://doi.org/10.1145/3501811.
Mansour et al. [2020] Y. Mansour, Mehryar Mohri, Jae Ro, and Ananda Theertha Suresh. Three approaches for personalization with applications to federated learning. ArXiv, abs/2002.10619, 2020.
MARFOQ et al. [2021] Othmane MARFOQ, Giovanni Neglia, Aurélien Bellet, Laetitia Kameni, and Richard Vidal. Federated multi-task learning under a mixture of distributions. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=YCqx6zhEzRp.
McMahan et al. [2016] H. Brendan McMahan, Eider Moore, Daniel Ramage, and Blaise Agüera y Arcas. Federated learning of deep networks using model averaging. CoRR, abs/1602.05629, 2016. URL http://arxiv.org/abs/1602.05629.
Pedregosa et al. [2011] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
Python [2002] Python. Cpython/functions.rst at main | python/cpython. https://github.com/python/cpython/blob/main/Doc/library/functions.rst, 2002.
Pérez et al. [2019] Sergio Pérez, Jaime Pérez, Patricia Arroba, Roberto Blanco, José L. Ayala, and José M. Moya. Predictive gpu-based adas management in energy-conscious smart cities. In 2019 IEEE International Smart Cities Conference (ISC2), pages 349–354, 2019. doi: 10.1109/ISC246665.2019.9071685.
Regulation [2018] General Data Protection Regulation. General data protection regulation (gdpr). Intersoft Consulting, Accessed in October, 24(1), 2018.
Robbins and Monro [1951] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
Roth [1988] Alvin E Roth. Introduction to the shapley value. The Shapley value, pages 1–27, 1988.
Ruan and Joe-Wong [2022] Yichen Ruan and Carlee Joe-Wong. Fedsoft: Soft clustered federated learning with proximal local updating. In AAAI, 2022.
Shi et al. [2023] Yuxin Shi, Han Yu, and Cyril Leung. Towards fairness-aware federated learning. IEEE Transactions on Neural Networks and Learning Systems, pages 1–17, 2023. doi: 10.1109/TNNLS.2023.3263594.
Shi et al. [2022] Zhuan Shi, Lan Zhang, Zhenyu Yao, Lingjuan Lyu, Cen Chen, Li Wang, Junhao Wang, and Xiang-Yang Li. Fedfaim: A model performance-based fair incentive mechanism for federated learning. IEEE Transactions on Big Data, pages 1–13, 2022. doi: 10.1109/TBDATA.2022.3183614.
Singh et al. [2022] Saurabh Singh, Shailendra Rathore, Osama Alfarraj, Amr Tolba, and Byungun Yoon. A framework for privacy-preservation of iot healthcare data using federated learning and blockchain technology. Future Generation Computer Systems, 129:380–388, 2022. ISSN 0167-739X. doi: https://doi.org/10.1016/j.future.2021.11.028. URL https://www.sciencedirect.com/science/article/pii/S0167739X21004726.
Tan et al. [2021] Alysa Ziying Tan, Han Yu, Lizhen Cui, and Qiang Yang. Towards personalized federated learning. CoRR, abs/2103.00710, 2021. URL https://arxiv.org/abs/2103.00710.
Tan et al. [2020] Kang Tan, Duncan Bremner, Julien Le Kernec, and Muhammad Imran. Federated machine learning in vehicular networks: A summary of recent applications. In 2020 International Conference on UK-China Emerging Technologies (UCET), pages 1–4, 2020. doi: 10.1109/UCET51115.2020.9205482.
Tang and Wong [2021] Ming Tang and Vincent W.S. Wong. An incentive mechanism for cross-silo federated learning: A public goods perspective. In IEEE INFOCOM 2021 - IEEE Conference on Computer Communications, pages 1–10, 2021. doi: 10.1109/INFOCOM42981.2021.9488705.
Tang et al. [2021] Xueyang Tang, Song Guo, and J Guo. Personalized federated learning with clustered generalization. ArXiv, abs/2106.13044, 2021.
Tang et al. [2022] Xueyang Tang, Song Guo, and Jingcai Guo. Personalized federated learning with contextualized generalization. In Lud De Raedt, editor, Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pages 2241–2247. International Joint Conferences on Artificial Intelligence Organization, 7 2022. doi: 10.24963/ijcai.2022/311. URL https://doi.org/10.24963/ijcai.2022/311. Main Track.
The White House [2023] The White House. National strategy to advance privacy-preserving data sharing and analytics. March 2023.
Wei et al. [2020] Shuyue Wei, Yongxin Tong, Zimu Zhou, and Tianshu Song. Efficient and Fair Data Valuation for Horizontal Federated Learning, pages 139–152. 11 2020. ISBN 978-3-030-63075-1. doi: 10.1007/978-3-030-63076-8_10.
Xiao et al. [2021] Zhiwen Xiao, Xin Xu, Huanlai Xing, Fuhong Song, Xinhan Wang, and Bowen Zhao. A federated learning system with enhanced feature extraction for human activity recognition. Knowledge-Based Systems, 229:107338, 2021. ISSN 0950-7051. doi: https://doi.org/10.1016/j.knosys.2021.107338. URL https://www.sciencedirect.com/science/article/pii/S0950705121006006.
Ye et al. [2022] Chenglong Ye, Reza Ghanadan, and Jie Ding. Meta clustering for collaborative learning. Journal of Computational and Graphical Statistics, pages 1–10, 2022.
Zhang et al. [2022] Jianqing Zhang, Yang Hua, Hao Wang, Tao Song, Zhengui Xue, Ruhui Ma, and Haibing Guan. Fedala: Adaptive local aggregation for personalized federated learning, 12 2022.
Zhang et al. [2021a] Jingwen Zhang, Yuezhou Wu, and Rong Pan. Incentive mechanism for horizontal federated learning based on reputation and reverse auction. In Proceedings of the Web Conference 2021, WWW ’21, page 947–956, New York, NY, USA, 2021a. Association for Computing Machinery. ISBN 9781450383127. doi: 10.1145/3442381.3449888. URL https://doi.org/10.1145/3442381.3449888.
Zhang et al. [2023] Lefeng Zhang, Tianqing Zhu, Ping Xiong, Wanlei Zhou, and Philip S. Yu. A robust game-theoretical federated learning framework with joint differential privacy. IEEE Transactions on Knowledge and Data Engineering, 35(4):3333–3346, 2023. doi: 10.1109/TKDE.2021.3140131.
Zhang et al. [2021b] Michael Zhang, Karan Sapra, Sanja Fidler, Serena Yeung, and Jose M. Alvarez. Personalized federated learning with first order model optimization, 2021b.
Zhou et al. [2021] Zirui Zhou, Lingyang Chu, Changxin Liu, Lanjun Wang, Jian Pei, and Yong Zhang. Towards fair federated learning. In Proceedings of the 27th ACM SIGKDD, KDD ’21, page 4100–4101, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383325. doi: 10.1145/3447548.3470814. URL https://doi.org/10.1145/3447548.3470814.

Appendix A IP-FL Usecases

A.1 Personalized and Incentives Learning Applications

Incentivizing participants in personalized federated learning can guide their behavior towards joining appropriate or similar subgroups, enhancing the personalization and effectiveness of models. Here are a few use cases demonstrating this:

Educational Content Personalization: IP-FL can be used to gather learning behavior data from students. Incentives encourage students to share data, helping to cluster them based on learning styles and progress, leading to more tailored educational content and methods.

Energy Consumption Optimization in Smart Homes: Homeowners can be incentivized to share energy usage data. This data helps create models that predict and optimize energy consumption patterns in different household clusters, promoting energy efficiency [41].

Precision Agriculture: Farmers can be motivated to share crop and soil data. Clustering this data helps develop models that provide personalized farming advice, enhancing crop yields and sustainable practices [48].

Healthcare Data from Wearables: Incentives encourage users of wearable health devices to share specific health data. This data can then be clustered into subgroups like heart patients or diabetes management, leading to more personalized healthcare recommendations [6].

Driving Data in Autonomous Vehicles: Vehicle owners can be incentivized to share their driving data. This data can be used to cluster drivers into groups based on driving styles or habits, which can then be used to improve autonomous driving systems tailored to different driving behaviors [50, 56].

Mobile Device Usage: Users can be incentivized to share data for services like app recommendations. By clustering users with similar app usage patterns, more personalized and relevant app suggestions can be made [26].

In each scenario, IP-FL not only boosts participation rates but also ensures that contributions are more aligned with specific subgroup needs, leading to highly personalized and effective solutions.

A.2 Responsible Personalized learning over BigData through incentives

As per the recent PRIVACY-PRESERVING DATA SHARING AND ANALYTICS (PPDSA) initiative from the US government [54] the aim is to balance the need for leveraging big data for societal benefits (like healthcare, security, or economic growth) with the imperative of protecting individual privacy rights. Personalized Federated Learning enables the collaborative use of data in a way that inherently respects privacy concerns and can be used for personalized objective learning of individual groups in large populations. However, the collection of this data for personalized learning is only possible with the proper incentivization of each population group.

A.3 AI Marketplaces and role of IP-FL

The AI field is witnessing the emergence of several AI marketplaces like ModelPlace [4], GravityAI [3], AWS AI Marketplace [2], and AI Marketplace [1], serving both model and data requirements. A crucial distinction, however, is while these platforms offer data and model training services, their data is often static and not as contemporary as data generated on individual client devices. There are also pronounced challenges related to data transportation, such as significant costs and privacy issues.

For applications rooted in user recommendations and driven by user data, assimilating the latest user trends is of paramount significance. Personalized models through federated learning offer an effective strategy to realize this goal. However, the current federated model marketplace is limited, primarily due to the lack of incentives encouraging clients or users to share their recent and valuable data, which can be pivotal in developing top-tier models.

Our work is meticulously curated to bridge this gap, championing the design of architectures and frameworks that stimulate this sector’s growth. By accentuating the importance of contemporary and pertinent data from client devices, our objective is to create a framework that incentivizes active user engagement, leading to the production of higher-caliber models. This initiative is grounded in the belief that motivating clients/users to share their data is essential to harness federated learning’s complete potential in dynamic, personalized applications.

Appendix B Initial Clustering

In addition to the analyses presented in the "Experimental Study" section, we also include an evaluation of the 40:60 partition using the EMNIST dataset, following the hyperparameters outlined in the "Experimental Study" section. Notably, in this new partition setting, our proposed IP-FL algorithm performs better compared to other algorithms under consideration. As mentioned in the prior response 40:60 is slightly heterogeneous compared to the complete IID case of 50:50 which is why we see an improved performance with personalization particularly with IP-FL.

IP-FL also has a mode that facilitates clients in forming initial clusters. This helps clients bypass early decisions, optimizing spending when contributions and cluster distributions are ambiguous and similarity metrics that the clients use among other metrics to make clustering decisions are unknown. Initially, the clients perform pre-training for a few rounds, through evaluation we have found that by pre-training for just 5 rounds the clients can be profiled. After this, the profiler calculates per-class F1-Scores $\xi$ on an IID test dataset [39]. The next training round’s client clustering uses the K-Means algorithm [24] based on the most varied F1-scores $V_{F1}$ from $C$ classes, as per Equation 14.

V_{F1}=var(\xi_{i})\in[1,C]\mid\forall i\in N

(14)

Our evaluations exclude this feature, but IP-FL offers it for rapid convergence and cost-saving. Realizing the constraints in choosing all clients for pre-training, only clients replying within a threshold are used for F1 score calculations. Others are assigned to clusters randomly, later settling into suitable clusters through preference and contribution.

Appendix C Fairness in IP-FL

Consumer and Provider Roles: The incentive mechanism in IP-FL treats clients as both providers and consumers. As a consumer, the client tries to attain a certain level of personalized model appeal, so it pays the provider to spend resources to participate in training for the said model in each round. Whereas as a provider, the client earns a profit based on its marginal contribution to training the cluster models.

Fairness for consumers: The clients with consumer’s profile pay for the tokens they use. To ensure that the consumers only pay for what they are getting they have the option of opt-out, where consumer clients can opt-out if they are not getting a model with a PMA over the desired threshold. To overcome the coarse-grained limitation of opt-out, the consumer clients also get a partial reimbursement of their tokens spent using the reimbursement if the resulting PMA is not up to their expected threshold.

Opportunity fairness: We have followed other state of art token-based incentive federated learning frameworks that ensure no consumer or a sub-group of consumers is able to manipulate the overall training process by spending more tokens which is why we have a single pricing model for all consumers. Starting with the same number of tokens also ensures equal opportunity for all consumer clients.

Fairness for providers: We also ensure collaboration fairness among clients with the provider’s profile. A provider that has more marginal contribution (calculated using Shapley Values) in training the model receives more token rewards as highlighted in the Algorithms presented in the paper. The second incentive for providers similar to consumers is increased PMA. Both rewards are dependent on accurate clustering choices from individual provider clients. Hence, to achieve these incentives each client makes better clustering choices which lead to clusters that have similar client objectives (good personalization) and this increases the provider client’s incentives (PMA and tokens). This is theoretically supported by the theoretical analysis presented in the paper.

Appendix D Further evaluation of the empirical performance of IP-FL

Table 6: Test accuracy of ablation study with Incentive (I) and without incentive (NI)

	10:90		30:70		linear		random
	c0	c1	c0	c1	c0	c1	c0	c1
IP-FL (I)	$58.62(0)$	$67.4(1)$	$51.12(0)$	$64.06(1)$	$66.2(0)$	$57.02(1)$	$64.54(0)$	$56.86(1)$
IP-FL (NI)	$49.92(1)$	$52.8(1)$	$48.9(0)$	$44.44(1)$	$50.82(1)$	$45.92(1)$	$57.42(1)$	$46.76(1)$

D.1 Ablatian study of clustering performance with Incentive in IP-FL

We perform an ablation study with the incentive component of IP-FL on the Synthetic CIFAR10 dataset for 200 rounds with $N=100$ clients, batch size 128, and learning rate $\eta=0.01$ . When the incentive is disabled clients do not consider maximizing their incentive while sending preference bids. Instead, clients send preference bids with random cluster choices to the scheduler as in FedAvg [38].

Table 6 shows the test accuracies of the cluster-level models. IP-FL(I) indicates that incentives are enabled and PI-F(NI) shows the accuracies when incentives are disabled. In general IP-FL(I) outperforms IP-FL(NI) in terms of test accuracy for all partitions. The important point to note here is that the incentive mechanism in IP-FL directly motivates clients to join clusters in which they can make the most contribution. This results in accurate clustering based on client data distributions and good-quality personalized models. This is indicated by the performance of IP-FL(NI), i.e without incentive, cluster-level models are unable to dominate a single distribution and only perform well for a single distribution for all partitions except 30:70. Compared to this, in IP-FL(I) each cluster-level model dominates and performs well for their distribution.

D.1.1 Multiple distribution Results

D.2 Advantages of including client preferences in pFL.

D.2.1 Challenges requiring client autonomy in pFL

Prior personalized FL works generate personalized models from the server’s perspective. We argue that the server may not have complete information to produce good-quality models due to a variety of challenges [25, 32]. These challenges are as follows: Confidentiality: Some clients may have sensitive data that they do not want to share with others for privacy or security reasons. For example, a company may have confidential customer data that they do not want to share with a third-party vendor, Competitive Advantage: In some industries, companies may want to keep their data private to maintain a competitive advantage. For example, a company may not want to share its sales data with competitors. Data Governance: Some organizations may have strict data governance policies that prohibit the sharing of certain types of data. For example, a healthcare organization may not be able to share patient data without proper consent. Resource limitations: Clients with large datasets may not have the resources to share all their data for training. In these cases, they may choose to share a random sample of their data to keep the training process manageable. Data Anonymization: Sometimes, clients may not want to share the raw data, but instead, they may share a subset of the data which has been anonymized to protect the privacy of the individuals. Compliance with privacy laws: To comply with privacy laws (GDPR [42], HIPA [5]) some clients might only share anonymized data while keeping Personally Identifiable Information (PII) private. Prior clustering-based pFL works also lack support to accurately include new clients into the clusters whose data qualities are unknown. By including client preferences, IP-FL performs accurate clustering and generation of appealing personalized models for new clients.

We use the Synthetic datasets to test IP-FL where the aggregator is unaware of the client’s dataset distribution and goals.

Table 7: Test accuracy for IP-FL on Synthetic CIFAR10 Dataset

	10:90		30:70		linear		random
	c0	c1	c0	c1	c0	c1	c0	c1
$\theta_{0}$	62.78	2.56	53.28	34.32	61.66	12.08	66.9	19.48
$\theta_{1}$	1.5	70.96	30.38	61.94	30.94	59.44	19.12	58.42

Table 8: Test accuracy for IP-FL and FedSoft on Synthetic CIFAR10 Dataset

	10:90		30:70		linear		random
	c0	c1	c0	c1	c0	c1	c0	c1
IP-FL	$62.68(0)$	$70.96(1)$	$53.28(0)$	$61.94(1)$	$61.66(0)$	$59.44(1)$	$66.90(0)$	$58.42(1)$
FedSoft	$32.50(0)$	$38.62(1)$	$20.28(0)$	$23.58(0)$	$34.42(1)$	$49.62(1)$	$21.62(1)$	$33.12(1)$

Table 9: Test accuracy for FedSoft on Synthetic CIFAR10 Dataset

	10:90		30:70		linear		random
	c0	c1	c0	c1	c0	c1	c0	c1
$\theta_{0}$	32.5	13.6	20.28	23.58	8.48	2.82	16.18	0.28
$\theta_{1}$	11.76	38.62	0.18	0.08	34.42	49.62	21.62	33.12

D.2.2 Clustering performance with Synthetic CIFAR10 dataset

Cluster-based pFL methods cluster all heterogeneous clients within different clusters so each cluster has homogeneous clients with similar data distribution in it. So first we test the client-preference driven clustering design of IP-FL with Synthetic CIFAR10 Data to show the performance of cluster-level models. We use the same configurations described in the main paper for the CIFAR10 dataset evaluation. We perform training for 500 rounds with both IP-FL and FedSoft. Table 7 shows the cluster-level model test accuracies for IP-FL with Synthetic CIFAR10 data.

IP-FL accurately differentiates between clients of different distributions. This is visible by the accuracy difference of each cluster-level model on different distributions. For example on the 10:90 partition $c1$ model has a $70.96\%$ accuracy on the $\theta_{1}$ distribution and has $2.56\%$ accuracy on $\theta_{0}$ which indicates that cluster-level model $c1$ trains with clients that have the majority of their training data from $\theta_{1}$ . Similarly, $c0$ trains with clients that have their majority of training data from $\theta_{0}$ and has an accuracy of $62.68\%$ .

Table 8 shows the cluster-level model accuracy comparison of IP-FL and FedSoft. The number inside the parenthesis along with accuracy shows the distribution for which the cluster-level model performs best. For example, for the linear partition, IP-FL $c0$ cluster-level model has an accuracy of $61.66\%$ for distribution $\theta_{0}$ and $59.44\%$ accuracy for distribution $\theta_{1}$ . For the linear partition with FedSoft, both $c0$ and $c1$ perform best on only one distribution $\theta_{1}$ with accuracy $34.42\%$ and $49.62\%$ . IP-FL outperforms FedSoft in terms of accuracy for each partition and can distinguish between different distributions accurately.

Table 9 represents the result of FedSoft [45] tested with the Synthetic CIFAR10 dataset. The experimental setup is explained in more detail in the main paper. We can observe that except for the 10:90 partition, FedSoft is unable to cluster clients under individual clusters in a way that each cluster-level model could dominate one distribution. Both cluster-level models perform well on only one distribution and the other is ignored leading to low test accuracy of both cluster-level models for the ignored distribution.

D.2.3 Clustering performance with Synthetic EMNIST dataset

This image dataset has images of dimension 28 x 28 and 52 output classes where 26 classes are lower case letters and 26 classes are upper case letters. We test the dataset on different partitions of 10:90, 30:70, linear, and random created in the same way as the Synthetic CIFAR10 data. The only difference is that $D_{A}$ contains 26 lowercase letters and $D_{B}$ has 26 uppercase letters.

Since Ditto is not a clustering-based algorithm, we only compare the test accuracy of personalized models on the EMNIST dataset for IP-FL and Ditto in Figure 6. IP-FL outperforms Ditto significantly, especially for highly heterogeneous data partitions such as 10:90 and linear. The reason for this performance improvement is that, unlike Ditto, IP-FL provides autonomy to clients to personalize according to their own goals. With Ditto, the goal of each client which consists of their private data is hidden from the aggregator server which affects the quality of personalized models.

IP-FL outperforms other FL personalization algorithms in heterogeneous cases when the clients and dataset are divided between 2 distributions (DA & DB). To analyze the reliability of IP-FL it is also evaluated in highly heterogeneous conditions with 4 different distributions (DA, DB, DC, and DD) on the EMNIST dataset. EMNIST dataset has a total of 56 classes, thus each of the 4 distributions gets $25\%$ of total available classes. DA has the first 13 classes, DB has the next 13, and so on. Figure 7 shows the CDF of test accuracy for all personalized models at clients. IP-FL outperforms Ditto for all partition types. For the linear partition less than $10\%$ clients have lower than $80\%$ accuracy and we attribute this as an outlier due to the partition type where dividing the data linearly some clients get very few data samples. This trend can also be seen with Ditto for the linear partition.

D.2.4 Insights from the analysis with Synthetic datasets

We use the synthetic datasets to present a scenario of dynamic data at client or induction of new clients in pFL whose data distributions are unknown. This presents new challenges of accurate clustering and the generation of personalized models for less familiar clients. We observe through empirical evaluation with Synthetic datasets that this can lead to $100\%$ opt-outs from these clients if we use conventional server-driven pFL methods. However, we observe that by the induction of client preferences in clustering, IP-FL can do accurate clustering and generate appealing models for new clients and changing data distribution at clients.

D.3 Opt-out results for 4 classes per client with EMNIST dataset

Table 10 shows the complete opt-out, test accuracy, and PMA results for the test conducted with the EMNIST dataset with 4 classes per client. Some pFL algorithms such as FedFomo and Ditto have similar performance compared to IP-FL in terms of opt-outs. However, IP-FL outperforms all of them in terms of PMA and test accuracy.

D.4 Opt-out results for 2 classes per client with CIFAR10 dataset

We also test highly heterogeneous data conditions where each client has only 2 classes per client. We use the learning rate = 0.01, batch size = 128, and global epochs = 150 with the CIFAR10 dataset and the same CNN model mentioned in the main paper. Figure 8 shows the PMA for different pFL algorithms. IP-FL outperforms all other pFL algorithms by approximately $50\%$ and FedFomo by $30\%$ . This goes to show that IP-FL performs even better in highly heterogeneous conditions where personalization is used to tackle data heterogeneity.

Table 10: PMA and opt-outs with the 4 classes per client EMNIST dataset

pFL Method	FedAvg	Ditto	FedProx	FedALA	PerFedAvg	FedFomo	IP-FL
Personalized Accuracy	82.8±4.2	90.78±2.1	61.23±2.86	62.17±3.9	59.99±1.87	86.57±1.59	98.59±1.2
Optouts	-	0	0.64	0.31	0.68	0	0
Average PMA	-	24.48±4.2	0.57±5.14	1.5±4.74	-0.6±1.5	25.9±3.6	37.93±3.43

Table 11: PMA and opt-outs with 40:60 partition of the EMNIST dataset

pFL Method FedAvg Ditto FedProx FedALA PerFedAvg FedFomo IP-FL Personalized Accuracy $76.22\pm 0.47$ $77.93\pm 1.1$ $82.2\pm 0.47$ $74.65\pm 3.51$ $79.14\pm 2.36$ $47.65\pm 14$ 85.11 ± 2.1 Optouts - 0 26 0 100 0 0 Average PMA - 1.7 5.9 0.14 2.9 -28.6 8.86

D.5 Opt-out results for slightly non-IID EMNIST dataset

We also evaluate the 40:60 partition using the EMNIST dataset, following the parameters outlined in the section 6.4. The results for this are shown in Table 11. Notably, in this new partition setting, our proposed IP-FL algorithm performs better compared to other algorithms under consideration. The 40:60 partition is slightly heterogeneous compared to the complete IID case of 50:50 which is why we see an improved performance with personalization particularly with IP-FL.

Appendix E Convergence Speed

In this section, we include the remaining results of convergence speed from the experimental evaluation. Figure 9 shows the convergence speed of IP-FL in comparison with other recent pFL algorithms. The results clearly indicate that IP-FL outperforms all other pFL algorithms in terms of both accuracy achieved as well as time to convergence for all datasets.

Appendix F Cost and Overhead

In terms of computation costs, IP-FL vastly outperforms clustering-based pFL algorithms. The significant advantage arises because other recent cluster-based pFL solutions [45] necessitate an additional update step during training, referred to as proximal update. On the other hand, IP-FL clients only require a single-shot personalization using cluster models, which is not only efficient but also conveniently performed by each client once the training concludes. It is worth noting that this personalization step is not mandatory during training, which further adds to the efficiency of our approach.

Appendix G Limitations

Calculating Shapley values for contribution calculation puts some additional overhead on the aggregator server however, we have partially resolved that by using Shapley Values approximations. In the future, we can easily include more lightweight contribution calculation mechanisms as they evolve. In addition to that, calculating evaluation metrics such as PMA could consume resources, however, we would emphasize that PMA is an evaluation metric employed to demonstrate performance improvements of IP-FL over other pFL algorithms. PMA or threshold calculation is not required for the IP-FL algorithm itself. We have just use it to compare with other pFL algorithms. Therefore, in the practical application of IP-FL, calculating PMA or the threshold is not a requisite step. Even if we take local training as the threshold the results will be the same because the threshold only serves the purpose of comparing the IP-FL algorithm with other pFL algorithms.

Appendix H Shapley value approximation for client contribution

Here we present the Shapley Value approximation derivation we use for calculating the client contributions.

Algorithm 3 Estimated Shapley value of any client in an FL

0: Test data

(x_{i},y_{i}),i=1,\ldots,n_{\textrm{test}}

, clients’ local model parameters and aggregation weights,

W_{m},\lambda_{m}

. server’s aggregated model parameter

W_{M}=\sum_{m=1}^{M}\lambda_{m}W_{m}

1: Calculate

\gamma_{M}\overset{\Delta}{=}n_{\textrm{test}}^{-1}\sum_{i=1}^{n_{\textrm{test}}}\nabla_{W}\ell(x_{i},y_{i};W_{M})

2: for

k=1,\ldots,M

3: Calculate

\textsc{shap}(i\rightarrow[M])

using

\displaystyle-\biggl{(}\frac{1}{n_{\textrm{test}}}\sum_{i=1}^{n_{\textrm{test}}}\nabla_{W}\ell(x_{i},y_{i};W_{M})\biggr{)}^{\mathrm{\scriptscriptstyle T}}\lambda_{i}W_{i}

(15)

for unnormalized aggregation, or

\displaystyle-\biggl{(}\frac{1}{n_{\textrm{test}}}\sum_{i=1}^{n_{\textrm{test}}}\nabla_{W}\ell(x_{i},y_{i};W_{M})\biggr{)}^{\mathrm{\scriptscriptstyle T}}\lambda_{i}(W_{i}-W_{M})

(16)

for normalized aggregation.

4: end for

4: Obtains all clients’ Shapley values

Notation: Let $[M]$ denote the set $\{1,\ldots,M\}$ , and $A-B$ the set of elements in $A$ but not in $B$ . In this section, $[M]$ denotes all the agents that participate in the coalition. We will consider those not participating in the near future.

We aim to look for a reasonable way to quantify the amount of each client’s contribution in a round. Suppose at any particular round, the server obtains an aggregated model with parameter

\displaystyle W_{[M]}\overset{\Delta}{=}\sum_{m\in[M]}\lambda_{m}W_{m},

(17)

where $\lambda_{m}$ is the weight (usually $n_{m}/n$ where $n_{m}$ and $n$ are sample sizes of client $m$ and all clients, respectively), and $W_{m}$ is the locally updated model of client $m$ .

The prediction loss of the model with parameter $W$ , denoted by $\mathcal{L}(W)$ , is approximated by

\displaystyle\mathcal{L}(W)\approx\frac{1}{n_{\textrm{test}}}\sum_{i=1}^{n_{\textrm{test}}}\ell(x_{i},y_{i};W),

(18)

where $(x_{i},y_{i})$ , $i=1,\ldots,n_{\textrm{test}}$ , is a set of test data. At round $t$ , we define the value function of a set of agents $C$ based on how much their contributed model, denoted by $W_{C}$ , has decreased the loss of the earlier model, denoted by $W_{t-1}$ , namely

\displaystyle v_{t}(C)\overset{\Delta}{=}\mathcal{L}(W_{t-1})-\mathcal{L}(W_{C}),

(19)

so that the larger the better. When there is no ambiguity, we simply write $v_{t}$ as $v$ . It is worth noting that $v$ is a function of the set while $\mathcal{L}$ is a function of the parameter. Once $C$ is realized, $W_{C}$ will become $W_{t}$ for the next round.

Recall that the original Shapley value [44] of agent $m$ given a set of agents $A$ and a value function $v$ is defined by

\displaystyle\sum_{S\in A-\{m\}}\frac{|S|!(|A|-1-|S|)!}{|A|!}(v(\{S\cup\{m\}\})-v(S)),

(20)

whose sum over all agents is equal to $v(A)-v(\emptyset)$ . Here, $\emptyset$ represents the baseline coalition scenario, from which the contribution of each agent is quantified. To highlight the dependency on baseline, we use $B$ to denote the baseline and rewrite (20) as

	$\displaystyle\textsc{shap}(m\rightarrow A\mid B)$	$\displaystyle=\sum_{S\in A-\{m\}}\frac{\|S\|!(\|A\|-1-\|S\|)!}{\|A\|!}\times$		(21)
		$\displaystyle\quad(v(\{S\cup\{m\}\}\mid B)-v(S\mid B)),$		(21)

where $v(S\mid B)$ means the value of $S$ conditional on the baseline $B$ . In our scenario, $B$ means the set of agents that are already in coalition and thus

\displaystyle v(S\mid B)\overset{\Delta}{=}v(S\cup B).

(22)

Let us consider the baseline as $B\overset{\Delta}{=}[M]-\{i,j\}$ . The corresponding baseline model will be

		unnormalized version:		(23)
		$\displaystyle W_{[M]-\{i,j\}}\overset{\Delta}{=}\sum_{m\in[M]-\{i,j\}}\lambda_{m}W_{m},$		(23)

		normalized version:		(24)
		$\displaystyle W_{[M]-\{i,j\}}^{*}\overset{\Delta}{=}\frac{1}{\sum_{m\in[M]-\{i,j\}}\lambda_{m}}\sum_{m\in[M]-\{i,j\}}\lambda_{m}W_{m}.$		(24)

We consider an unnormalized version for brevity. The additional value by introducing $i,j$ is

	$\displaystyle v(\{i,j\}\mid[M]-\{i,j\})-v(\emptyset\mid[M]-\{i,j\})$		(25)
	$\displaystyle\begin{subarray}{c}\textrm{use}(\ref{eq_value})\end{subarray}=v([M])-v([M]-\{i,j\})$		(26)
	$\displaystyle\begin{subarray}{c}\textrm{use}(\ref{eq_value2})\textrm{ and recall }(\ref{eq_thetaM})and(\ref{eq_thetaij})\end{subarray}=\mathcal{L}(W_{[M]-\{i,j\}})-\mathcal{L}(W_{M})$		(27)
	$\displaystyle=\frac{1}{n_{\textrm{test}}}\sum_{i=1}^{n_{\textrm{test}}}\{\ell(x_{i},y_{i};W_{M})-\ell(x_{i},y_{i};W_{[M]-\{i,j\}})\}$
	$\displaystyle\approx\frac{1}{n_{\textrm{test}}}\sum_{i=1}^{n_{\textrm{test}}}\nabla_{W}\ell(x_{i},y_{i};W_{M})^{\mathrm{\scriptscriptstyle T}}(W_{[M]-\{i,j\}}-W_{M})$		(28)
	$\displaystyle=-\biggl{(}\frac{1}{n_{\textrm{test}}}\sum_{i=1}^{n_{\textrm{test}}}\nabla_{W}\ell(x_{i},y_{i};W_{M})\biggr{)}^{\mathrm{\scriptscriptstyle T}}(\lambda_{i}W_{i}+\lambda_{j}W_{j}).$		(29)

Next, we calculate how much agent $i$ should be attributed to the above gain that is achieved by $i,j$ jointly. To that end, we calculate the Shapley value of agent $i$ conditional on that agents in $[M]-\{i,j\}$ already participate, namely

	$\displaystyle\textsc{shap}(i\rightarrow\{i,j\}\mid[M]-\{i,j\})$		(30)
	$\displaystyle\begin{subarray}{c}\textrm{recall }(\ref{eq_Shapley2})\end{subarray}=\sum_{S\in\{j\}}\frac{\|S\|!(1-\|S\|)!}{2!}\biggl{(}v\biggl{(}S\cup\{i\}\cup([M]-\{i,j\})\biggr{)}-v\biggl{(}S\cup([M]-\{i,j\})\biggr{)}\biggr{)}$
	$\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}=\frac{1}{2}\biggl{(}v([M])-v([M]-\{j\})+v([M]-\{i\})-v([M]-\{i,j\})\biggr{)}$
	$\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}=\frac{1}{2}\biggl{(}-\mathcal{L}(W_{M})+\mathcal{L}(W_{M,-i})-\mathcal{L}(W_{M,-j})+\mathcal{L}(W_{M,-ij})\biggr{)}$

	$\displaystyle=\frac{1}{2}\biggl{(}-\mathcal{L}(W_{M})+\mathcal{L}(W_{M,-i})+\mathcal{L}(W_{M})-\mathcal{L}(W_{M,-j})+\mathcal{L}(W_{M,-ij})-\mathcal{L}(W_{M})\biggr{)}$		(31)
	$\displaystyle\begin{subarray}{c}\textrm{use }(\ref{eq_4})-(\ref{eq_3})\textrm{ and alike }\end{subarray}\approx$		(32)
	$\displaystyle-\frac{1}{2}\biggl{(}\frac{1}{n_{\textrm{test}}}\sum_{i=1}^{n_{\textrm{test}}}\nabla_{W}\ell(x_{i},y_{i};W_{M})\biggr{)}^{\mathrm{\scriptscriptstyle T}}$
	$\displaystyle\hskip 28.45274pt\cdot(\lambda_{i}W_{i}-\lambda_{j}W_{j}+\lambda_{i}W_{i}+\lambda_{j}W_{j})$
	$\displaystyle=-\biggl{(}\frac{1}{n_{\textrm{test}}}\sum_{i=1}^{n_{\textrm{test}}}\nabla_{W}\ell(x_{i},y_{i};W_{M})\biggr{)}^{\mathrm{\scriptscriptstyle T}}\lambda_{i}W_{i},$		(33)

which, interestingly, does not depend on $j$ . As such, we use this to calculate the Shapley value of client $i$ , denoted by

\displaystyle\textsc{shap}(i\rightarrow[M])\overset{\Delta}{=}-\biggl{(}\frac{1}{n_{\textrm{test}}}\sum_{i=1}^{n_{\textrm{test}}}\nabla_{W}\ell(x_{i},y_{i};W_{M})\biggr{)}^{\mathrm{\scriptscriptstyle T}}\lambda_{i}W_{i}.

(34)

From Equalities (29) and (33), we can verify that

	$\displaystyle v(\{i,j\}\mid[M]-\{i,j\})-v(\emptyset\mid[M]-\{i,j\})$		(35)
	$\displaystyle=\textsc{shap}(i\rightarrow[M])+\textsc{shap}(j\rightarrow[M])$

Remark 7 (Intuitions).

Intuitively, our derived Shapley value of client $i$ in (33) can be regarded as the model’s marginal reduction of the test loss by introducing client $i$ . To see that, consider the following approximation based on first-order Taylor expansion:

	$\displaystyle\frac{1}{n_{\textrm{test}}}\sum_{i=1}^{n_{\textrm{test}}}\ell(x_{i},y_{i};W_{M}-\Delta)-\frac{1}{n_{\textrm{test}}}\sum_{i=1}^{n_{\textrm{test}}}\ell(x_{i},y_{i};W_{M}W)$
	$\displaystyle\approx-\biggl{(}\frac{1}{n_{\textrm{test}}}\sum_{i=1}^{n_{\textrm{test}}}\nabla_{W}\ell(x_{i},y_{i};W_{M})\biggr{)}^{\mathrm{\scriptscriptstyle T}}\Delta W,$

which becomes the term in (33) when $\Delta W\overset{\Delta}{=}\lambda_{i}W_{i}$ . The above quantity approximates the amount of client $i$ ’s contribution to decreasing the test loss of the server’s aggregated model, the larger the better.

Remark 8 (Normalized counterpart).

Suppose we use the normalized version introduced in (24) when considering the baseline without clients $i,j$ . Thus,

	$\displaystyle W^{*}_{[M]-\{i,j\}}$	$\displaystyle=\frac{W_{M}-\lambda_{i}W_{i}-\lambda_{j}W_{j}}{\sum_{m\in[M]-\{i,j\}}\lambda_{m}}$
		$\displaystyle=W_{M}+\frac{(\lambda_{i}+\lambda_{j})W_{M}-\lambda_{i}W_{i}-\lambda_{j}W_{j}}{\sum_{m\in[M]-\{i,j\}}\lambda_{m}}$
		$\displaystyle=W_{M}-\frac{\lambda_{i}(W_{i}-W_{M})+\lambda_{j}(W_{j}-W_{M})}{1-(\lambda_{i}+\lambda_{j})}.$

Similarly, we have

\displaystyle W^{*}_{[M]-\{i\}}

\displaystyle=W_{M}-\frac{\lambda_{i}(W_{i}-W_{M})}{1-\lambda_{i}}.

Bringing the above formula into (31), we have

	$\displaystyle\textsc{shap}(i\rightarrow\{i,j\}\mid[M]-\{i,j\})$		(36)
	$\displaystyle=\frac{1}{2}\biggl{(}-\mathcal{L}(W_{M})+\mathcal{L}(W_{M,-i})+\mathcal{L}(W_{M})-\mathcal{L}(W_{M,-j})$
	$\displaystyle\hskip 28.45274pt+\mathcal{L}(W_{M,-ij})-\mathcal{L}(W_{M})\biggr{)}$		(37)
	$\displaystyle\approx-\biggl{(}\frac{1}{n_{\textrm{test}}}\sum_{i=1}^{n_{\textrm{test}}}\nabla_{W}\ell(x_{i},y_{i};W_{M})\biggr{)}^{\mathrm{\scriptscriptstyle T}}\Delta W^{*}\textrm{ where }$
	$\displaystyle 2\Delta W^{*}\overset{\Delta}{=}\frac{\lambda_{i}(W_{i}-W_{M})}{1-\lambda_{i}}-\frac{\lambda_{j}(W_{j}-W_{M})}{1-\lambda_{j}}+$
	$\displaystyle\frac{\lambda_{i}(W_{i}-W_{M})+\lambda_{j}(W_{j}-W_{M})}{1-(\lambda_{i}+\lambda_{j})}\approx 2\lambda_{i}(W_{i}-W_{M})$

assuming small $\lambda_{i}$ and $\lambda_{j}$ . Therefore, under normalization we have

	$\displaystyle\textsc{shap}(i\rightarrow\{i,j\}\mid[M]-\{i,j\})\approx$
	$\displaystyle-\biggl{(}\frac{1}{n_{\textrm{test}}}\sum_{i=1}^{n_{\textrm{test}}}\nabla_{W}\ell(x_{i},y_{i};W_{M})\biggr{)}^{\mathrm{\scriptscriptstyle T}}\lambda_{i}(W_{i}-W_{M}).$

The intuition is the same as Remark 7 except that the server model with client $i$ satisfies

	$\displaystyle\textrm{unnormalized version}:\quad W_{[M]-\{i\}}=W_{M}-\lambda_{i}W_{i}.$
	$\displaystyle\textrm{normalized version}:\quad W_{[M]-\{i\}}\approx W_{M}-\lambda_{i}(W_{i}-W_{M}).$

The Shapley Values are used to calculate the performance of individual clients towards the tier they participate in. Therefore, we use a small holdout dataset per tier that represents the data distribution of clients within that tier to calculate the Shapley Values.

Appendix I Proofs for Section 5

In this section, we provide detailed proof of results previously established in Section 5.

Proof of Proposition 1. To prove this proposition we divide the proof into two parts:

1. Convergence of Cluster: Given the objective function for a personalized and incentivized federated learning (IP-FL) system with $m$ clients, each observing a set of independent Gaussian observations $z_{i,j}\sim N(\mu_{i},\sigma^{2})$ for $j=1,\ldots,n_{i}$ , aiming to estimate its unknown mean $\mu_{i}\in\mathbb{R}$ , the objective function is defined as:

L(\mu)=\sum_{i=1}^{m}\frac{n_{i}}{n}L_{i}(\mu)=\sum_{i=1}^{m}\frac{n_{i}}{n}\sum_{j=1}^{n_{i}}(\mu-z_{i,j})^{2},

where $n=\sum_{i=1}^{m}n_{i}$ represents the total number of observations across all clients, and $\hat{\mu}_{FL}=\frac{\sum_{i=1}^{m}n_{i}\hat{\mu}_{i}}{n}$ is the federated estimate of the global mean and $\hat{\mu}_{i}=\frac{1}{n_{i}}\sum_{j=1}^{n_{i}}z_{i,j}$ . The algorithm iteratively assigns clients to clusters and updates cluster centroids to minimize the local loss function $L(\mu)$ .

The incentive mechanism encourages client participation. Clients are assigned to the cluster whose current centroid minimizes their local loss. Formally, client $i$ is assigned to cluster $k$ if:

k=\arg\min_{k}\sum_{j=1}^{n_{i}}(\beta_{k}^{(t)}-z_{i,j})^{2},

where $\beta_{k}^{(t)}$ is the centroid of cluster $k$ at iteration $t$ . The centroid of each cluster is updated to be the mean (because the mean is the statistic that minimizes the sum of squared deviations):

\beta_{k}^{(t+1)}=\frac{\sum_{i\in C_{k}^{(t)}}n_{i}\hat{\mu}_{i}}{\sum_{i\in C_{k}^{(t)}}n_{i}},\quad\text{where},\quad C_{k}^{(t)}=\{i:\text{client }i\text{ is assigned to cluster }k\text{ at iteration }t\}.

Since there are a finite number of clients and, therefore a finite number of ways to partition these clients into $K$ clusters. Therefore, the finite improvement space and the monotonic decrease of the objective function $L$ with each iteration, the algorithm must eventually reach a point where $C_{k}^{(t)}=C_{k}^{(t+1)}$ for all $k$ beyond a certain iteration $t$ . At this point, the clustering algorithm converged.

2. Within each cluster $C_{k}$ , the data distributions of the clients are statistically similar to each other up to a threshold $\delta$ and Within-cluster bias is reduced: Let $\mathcal{P}$ be the statistical similarity index property that we are interested in, Then we need to show that

|\mathcal{P}(D_{i})-\mathcal{P}(D_{j})|<\delta\quad\text{for any clients}\quad i,j\in C_{k}.

(38)

Since the clustering algorithm objective is to minimize the within-cluster sum of squares (WCSS),

\mathrm{minimize}_{k}\sum_{k=1}^{K}\sum_{i\in C_{k}}\|x_{i}-\mu_{k}\|^{2},

In the context of IP-FL, we can consider this difference $||x_{i}-\mu_{k}\|^{2}$ to be calculated as the importance weights $\upsilon_{{i}{k}}$ from Equation 7. Here $\upsilon_{ik}$ quantifies the alignment between client $i$ ’s data and the model of cluster $k$ , with a value of 1 indicating perfect prediction accuracy for the client’s data by the cluster model. This is because $\upsilon_{{i}{k}}$ represents the similarity between the client’s local data distribution and the cluster model representing the centroid. Thus, the relationship between $\upsilon_{ik}$ and the difference $\|x_{i}-\mu_{k}\|^{2}$ can be expressed as follows:

\lim_{\upsilon_{ik}\to 1}\|x_{i}-\mu_{k}\|^{2}\to 0,

(39)

Thus, for the two clients $i$ and $j$ within the same cluster, if $\lim_{\upsilon_{ik}\to 1}$ and $\lim_{\upsilon_{jk}\to 1}$ , this means $||x_{i}-\mu_{k}\|^{2}\to 0$ and $||x_{j}-\mu_{k}\|^{2}\to 0$ , where $\mu_{k}$ represents the centroid of cluster or the cluster-level model. Since the data distribution of both clients $i$ and $j$ are close to the centroid, by association, they are also statistically similar to each other.

To show clustering reduces the within-cluster bias, we will first show cluster bias i.e. $\mathrm{Bias}(C_{k})<\mathrm{Bias}(System)$ in the context of the (IP-FL) cluster.

Let $\mathrm{Bias}(C_{k})$ be defined as the deviation of the average data distribution in the cluster from an ideal/true distribution.

The clustering algorithm partitions the dataset into K clusters by minimizing the within-cluster sum of squares, effectively reducing the intra-cluster variance. Due to reduced intra-cluster variance, each cluster $C_{k}$ exhibits a higher degree of homogeneity i.e. $\|x_{i}-\mu_{k}\|^{2}\to 0$ compared to the dataset as a whole. Consequently, the average characteristics of the data within $C_{k}$ more accurately represent the data point within $C_{k}$ . Given the homogeneity within the cluster. $\mathrm{Bias}(C_{k})<\mathrm{Bias}(System)$ . Since more accurate representation of data characteristics within each cluster, thereby reducing the bias.

Proof of Theorem 3. We begin by establishing that the loss function $L(\theta)$ is convex. For any two parameter vectors $\theta,\theta^{\prime}$ , the following condition holds due to convexity:

L(\lambda\theta+(1-\lambda)\theta^{\prime})\leq\lambda L(\theta)+(1-\lambda)L(\theta^{\prime}),\quad\forall\lambda\in[0,1].

(40)

The parameter update rule in IP-FL can be expressed as:

\theta^{(t+1)}=\theta^{(t)}-\eta_{t}\nabla L(\theta^{(t)}),

(41)

where $\eta_{t}$ is the learning rate at iteration $t$ .

To establish convergence, we show that the sequence $\{\theta^{(t)}\}$ approaches a fixed point $\theta^{*}$ as $t$ grows large. Since $L(\theta)$ is convex, we have:

L(\theta^{(t)})-L(\theta^{*})\geq\nabla L(\theta^{*})^{\top}(\theta^{(t)}-\theta^{*}).

(42)

Employing the descent lemma for a $\beta$ -smooth $L(\theta)$ , we obtain:

L(\theta^{(t+1)})\leq L(\theta^{(t)})+\nabla L(\theta^{(t)})^{\top}(\theta^{(t+1)}-\theta^{(t)})+\frac{\beta}{2}\|\theta^{(t+1)}-\theta^{(t)}\|^{2}.

(43)

By substituting the update rule into the descent lemma, we can simplify the inequality as follows:

	$\displaystyle L(\theta^{(t+1)})$	$\displaystyle\leq L(\theta^{(t)})-\eta_{t}\\|\nabla L(\theta^{(t)})\\|^{2}+\frac{\beta}{2}\eta_{t}^{2}\\|\nabla L(\theta^{(t)})\\|^{2}$		(44)
		$\displaystyle=L(\theta^{(t)})-\eta_{t}(1-\frac{\beta\eta_{t}}{2})\\|\nabla L(\theta^{(t)})\\|^{2}.$		(45)

Provided that the learning rate $\eta_{t}$ satisfies $0<\eta_{t}<\frac{2}{\beta}$ , the quantity $(1-\frac{\beta\eta_{t}}{2})$ is positive, which guarantees a decrease in the loss function at each iteration. The learning rate $\eta_{t}$ is typically chosen to diminish over time but not sum to zero, i.e., $\sum_{t=1}^{\infty}\eta_{t}=\infty$ and $\sum_{t=1}^{\infty}\eta_{t}^{2}<\infty$ .

As $\eta_{t}$ diminishes and the loss function’s gradient decreases, the sequence $\{\theta^{(t)}\}$ approaches a point $\theta^{*}$ where $\nabla L(\theta^{*})=0$ , suggesting convergence to a critical point. Assuming $L(\theta)$ has a unique global minimum, this critical point $\theta^{*}$ is the global minimum.

Therefore, as $t\rightarrow\infty$ , we have $\|\theta^{(t+1)}-\theta^{(t)}\|\rightarrow 0$ and $\nabla L(\theta^{(t)})\rightarrow 0$ , indicating convergence of the IP-FL algorithm to a set of stable parameters $\theta^{*}$ for the global model. $\blacksquare$

Proof of Theorem 4.

Let $C=\{C_{1},C_{2},\ldots,C_{k}\}$ represent the clusters. Each client $i$ updates its model parameters $\theta_{i}$ based on local data $D_{i}$ and the aggregated model of its cluster. The update at each iteration $t$ can be expressed as:

\theta_{i}^{(t+1)}=\theta_{i}^{(t)}-\eta_{i}\nabla L_{i}(\theta_{i}^{(t)},D_{i})

\theta_{i}^{(t+1)}-\theta_{i}^{(t)}=-\eta_{i}\nabla L_{i}(\theta_{i}^{(t)},D_{i})

Here, $\eta_{i}$ is the learning rate for client $i$ , and $\nabla L_{i}$ is the gradient of the loss function with respect to the model parameters $\theta_{i}$ . The magnitude of $\nabla L_{i}(\theta_{i}^{(t)},D_{i})$ gets smaller as $\theta_{i}$ approaches the minimum reducing the difference between consecutive $\theta_{i}^{(t)}$ and $\theta_{i}^{(t+1)}$ . Thus for every $\epsilon_{1}>0$ , there exists an $N$ such that $\forall~{}m,n>N$ , such that $|\theta_{i}^{m}-\theta_{i}^{n}|<\epsilon_{1}$ . Therefore, the sequence $\{\theta_{i}^{(t)}\}$ is Cauchy and hence convergent to a global optimum $\theta_{i}^{*}$ .

The aggregation step in (IP-FL) combines the individual client models into a cluster model, and then the cluster model is personalized for each client. The personalization step can be written as a weighted average:

\theta_{i}^{(t+1)}=\sum_{k=1}^{K}w_{ik}\theta_{k}^{(t)}

where $w_{ik}$ is the weight assigned to the cluster model $\theta_{k}^{(t)}$ by client $i$ . As the training progresses, the weights $w_{ik}$ are adjusted based on the performance of the cluster models on each client’s data. Under the assumption of proper weight adjustment and model updating, $\theta_{i}^{(t)}$ will converge to an optimal set of parameters $\theta_{i}^{*}$ that maximizes $\text{PMA}_{i}$ .

Thus for given $\epsilon>0$ , there exists a finite $T$ such that:

|\mathrm{PMA}_{i}^{(t)}-\mathrm{PMA}_{i}^{*}|<\epsilon\quad\forall t\geq T.

(46)

The above inequality (46) follows from the convergence of $\theta^{t}_{i}$ and the continuity of the performance metric $\mathrm{PMA}_{i}$ with respect to the model parameters.

Corollary 8.1.

For each client $i$ , the $\mathrm{PMA_{i}}$ is positive and approaches the optimal $\mathrm{PMA}_{i}^{*}$ . Additionally, if $\mathrm{PMA_{i}}>0$ , then the likelihood of client $i$ opt-out of pFL training is minimized.

Proof.

Let $f_{i}(w_{k})$ denote the performance measure (such as accuracy) of the personalized model for client $i$ with model parameter $w_{k}$ . Let $\rho_{i}$ represent the baseline performance measure (of a global model). $\mathrm{PMA_{i}}$ of a client defined as the difference between the two performances $\mathrm{PMA_{i}}=f_{i}(w_{k})-\rho_{i}$

We assume the pFL system employs algorithms and strategies that enhance the performance of each client by utilizing personalized models. Through iterative training, the model parameters $w_{k}$ are adjusted to improve $f_{i}(w_{k})$ such that for most clients $f_{i}(w_{k})\geq\rho_{i}$ . This leads to $\mathrm{PMA_{i}}>0$ as $f_{i}(w_{k})-\rho_{i}>0$ . Let $\mathrm{PMA}_{i}^{*}$ denote optimal $\mathrm{PMA_{i}}$ for client $i$ over the course of training. As $w_{k}$ is optimized, $f_{i}(w_{k})$ reaches the maximum achievable performance for client $i$ i.e.

\lim_{r\to\mathbb{R}}\mathrm{PMA_{i}}=\mathrm{PMA}_{i}^{*}

, where $r\in R$ and $R$ representing total training rounds.

Define opt-out likelihood for client $i$ as a function $g(\mathrm{PMA_{i}}$ , where higher $\mathrm{PMA_{i}}$ values correlate with lower likelihood of opt-out.

It is reasonable to posit that $g$ is a monotonically decreasing function of $\mathrm{PMA_{i}}$ .

Therefore, with $\mathrm{PMA_{i}}>0$ , the opt-out likelihood is minimized $g(\mathrm{PMA_{i}}$ is minimized when $\mathrm{PMA_{i}}>0$ .

∎

Proof of Theorem 6. We assume each client $i$ performs local updates using gradient descent:

M_{t+1}^{i}=M_{t}^{i}-\eta_{t}\nabla L_{i}(M_{t}^{i})

where, $L_{i}$ is convex and $\beta$ -smooth, for the local model $M_{t}^{i}$ , thus by definition for any vector $y$ , we have:

L_{i}(y)\leq L_{i}(M_{t}^{i})+\langle\nabla L_{i}(M_{t}^{i}),y-M_{t}^{i}\rangle+\frac{\beta}{2}\|y-M_{t}^{i}\|^{2}.

The above inequality implies that the local update decreases the loss. The global model at iteration $t+1$ is a weighted sum of the local models,

M_{t+1}=\sum_{i=1}^{N}w_{i}M_{t+1}^{i},

where, $w_{i}$ is the weight corresponding to client $i$ contribution to the global model. By the convexity of $L_{i}$ and Jensen’s inequality, the global loss function $L$ also decreases

L(M_{t+1})\leq\sum_{i=1}^{N}w_{i}L_{i}(M_{t+1}^{i}).

Using the $\beta$ -smoothness of $L_{i}$ and the update rule for $M_{t}^{i}$ , we can bound the difference in the loss for each client

L_{i}(M_{t+1}^{i})\leq L_{i}(M_{t}^{i})-\eta_{t}\|\nabla L_{i}(M_{t}^{i})\|^{2}+\frac{\beta}{2}\eta_{t}^{2}\|\nabla L_{i}(M_{t}^{i})\|^{2}.

(47)

Now, to show convergence of the global model, we sum the inequalities (47) over all clients and multiply by their respective weights

L(M_{t+1})\leq L(M_{t})-\eta_{t}\sum_{i=1}^{N}w_{i}\|\nabla L_{i}(M_{t}^{i})\|^{2}+\frac{\beta}{2}\eta_{t}^{2}\sum_{i=1}^{N}w_{i}\|\nabla L_{i}(M_{t}^{i})\|^{2}

(48)

By the Robbins-Monro condition [43], and since $\sum_{t=1}^{\infty}\eta_{t}=\infty$ and $\sum_{t=1}^{\infty}\eta_{t}^{2}<\infty$ , the second term in (48) dominates the third term, ensuring that the global loss function $L$ decreases over time.

The convergence of the sequence $\{L(M_{t})\}$ to the minimum loss value $L(M^{*})$ is facilitated by the diminishing learning rates and the weighted contributions of the clients. Formally, this is observed in the non-increasing sequence of expected losses,

\mathbb{E}[L(M_{t+1})]\leq\mathbb{E}[L(M_{t})]-\eta_{t}\left(\sum_{i=1}^{N}w_{i}\|\nabla L_{i}(M_{t}^{i})\|^{2}-\frac{\beta}{2}\eta_{t}\sum_{i=1}^{N}w_{i}\|\nabla L_{i}(M_{t}^{i})\|^{2}\right)

Under the Robbins-Monro conditions for the learning rates, the series of errors $\mathbb{E}[L(M_{t})-L(M^{*})]$ converges to zero. Thus, we have:

\lim_{t\to\infty}\mathbb{E}[L(M_{t})-L(M^{*})]=0

By applying the Supermartingale Convergence Theorem [14], we conclude that $\lim_{t\to\infty}M_{t}=M^{*},~{}~{}\text{almost surely.}$

We have made the code available as part of the supplementary material.