Model-sharing Games:
Analyzing Federated Learning Under Voluntary Participation
Abstract
Federated learning is a setting where agents, each with access to their own data source, combine models learned from local data to create a global model. If agents are drawing their data from different distributions, though, federated learning might produce a biased global model that is not optimal for each agent. This means that agents face a fundamental question: should they join the global model or stay with their local model? In this work, we show how this situation can be naturally analyzed through the framework of coalitional game theory.
Motivated by these considerations, we propose the following game: there are heterogeneous players with different model parameters governing their data distribution and different amounts of data they have noisily drawn from their own distribution. Each playerβs goal is to obtain a model with minimal expected mean squared error (MSE) on their own distribution. They have a choice of fitting a model based solely on their own data, or combining their learned parameters with those of some subset of the other players. Combining models reduces the variance component of their error through access to more data, but increases the bias because of the heterogeneity of distributions.
In this work, we derive exact expected MSE values for problems in linear regression and mean estimation. We use these values to analyze the resulting game in the framework of hedonic game theory; we study how players might divide into coalitions, where each set of players within a coalition jointly construct model(s). We analyze three methods of federation, modeling differing degrees of customization. In uniform federation, the agents collectively produce a single model. In coarse-grained federation, each agent can weight the global model together with their local model. In fine-grained federation, each agent can flexibly combine models from each other agent in the federation. For each method, we constructively analyze the stable partitions of players into coalitions.
1 Introduction
Imagine a situation as follows: a hospital is trying to evaluate the effectiveness of a certain procedure based on data it has collected from procedures done on patients in their facilities. It seems likely that certain attributes of the patient influences the effectiveness of the procedure, so the hospital analysts opt to fit a linear regression model with parameters . However, because of the limited amount of data the hospital has access to, this model has relatively high error. Luckily, other hospitals also have data from implementations of this same procedure. However, for reasons of privacy, data incompatibility, data size, or other operational considerations, the hospitals donβt wish to share raw patient data. Instead, they they opt to combine their models by taking a weighted average of the parameters learned by each hospital. If there are hospitals and hospital has samples, the combined model parameters would look like:
The situation described above could be viewed as a stylized model of federated learning. Federated learning is a distributed learning process that is currently experiencing rapid innovations and widespread implementation (Li etΒ al. 2020; Kairouz etΒ al. 2019). It is used in cases where data is distributed across multiple agents and cannot be combined centrally for training. For example, federated learning is implemented in word prediction on cell phones, where transferring the raw text data would be infeasible given its large size (and sensitive content). The motivating factor for using federated learning is that access to more data will reduce the variance in a learned model, reducing its error.
However, there could be a downside to using federated learning. In the hospital example, it seems quite reasonable that certain hospitals might have different true generating models for their data, based on the differences in patient populations or variants of the procedure implementation, for example. Two dissimilar hospitals that are federating together will see a decrease in their modelβs error due to model variance, but an increase in their error due to model bias. This raises some fundamental questions for each participating hospital - or, more generally, each agent considering federated learning. Which other agents should federate with in order to minimize its error? Will those other agents be interested in federating with ? Does there exist some stable arrangement of agents into federating clusters, and if so, what does that arrangement look like?
Numerous works have explored the issue of heterogeneous data in federated learning - we discuss specifically how they relate to ours in a later section. Often the goal in these lines of work is to achieve equality in error rates guaranteed to each agent, potentially by actively collecting more data or using transfer learning to ensure the model better fits local data. However, to our knowledge, there has not yet been work that systematically looks at the participation questions inherent in federated learning through the lens of game theory β especially the theory of hedonic games, which studies the formation of self-sustaining coalitions.
In a hedonic game, players are grouped together into clusters or coalitions: the overall collection of coalitions is called a coalition structure. Each playerβs utility depends solely on the identity of the other players in its coalition. A common question in hedonic games is the stability of a coalition structure. A coalition structure is core-stable (or βin the coreβ) if there does not exist a coalition so that every player in prefers to its coalition in . A coalition structure is strictly core stable if there does not exist a coalition so that every player in weakly prefers to its coalition in , and at least one player strictly prefers to . A coalition structure is individually stable if there does not exist a coalition so that a player prefers to its arrangement in and all players in weakly prefer to (Bogomolnaia and Jackson 2002).
To explain the analogy of federated learning to hedonic games, we first consider that each agent in federated learning is a player in a hedonic game. A player is in coalition with other players if it is federating with them. Its cost is its expected error in a given federating cluster, which depends only on the identity of other players in its federating cluster. Players are assumed to be able to move between federating clusters only if doing so would benefit itself and not harm other players in the cluster it is moving to: notably, we allow players to freely leave a cluster, even if doing so would harm the players in the cluster it leaves behind.
The present work: Analyzing federated learning through hedonic game theory
In this work, we use the framework of hedonic games to analyze the stability of coalitions in data-sharing applications that capture key issues in federated learning. By working through a sequence of deliberately stylized models, we obtain some general insights about participation and stability in these kinds of applications.
For the first case, we analyze uniform federation. In this simplest case, a federating cluster produces a single model, which each player uses. For uniform federation, first we consider the case where all players have the same number of data points. We show that in this game, when the number of data points is fairly small, the only core-stable coalition structure is to have all players federating together, in the βgrand coalitionβ. When is large, the only core-stable coalition structure is to have all players separate (doing local learning). There exists a point case of intermediate size where all coalition structures are core-stable. Next, we analyze the case where all players have either one of two sizes (βsmallβ or βlargeβ). The analysis is more complicated, but we demonstrate constructively that there always exists an individually stable partition of players into clusters.
Besides uniform federation, we also analyze two other forms of federation. For coarse-grained federation, the federating cluster still produces a single model, but each player can weight the global model with their local model, allowing some personalization. For coarse-grained federation, when all players have the same number of samples, we show that the grand coalition (all players federating together) is always the only core stable arrangement. For the small/large case, we produce a simple, efficient algorithm to calculate a strictly core-stable arrangement. Additionally, we show that, for this federation method, the grand coalition is always individually stable (no player wishes to unilaterally defect).
Finally, for fine-grained federation, each player is allowed to take the local models of other players in the federation and combine them using whichever weights they choose to produce a model customized for their use. With fine-grained federation, we show that the grand coalition is always core stable.
We are only able to produce these hedonic game theory results because of our derivations of exact error values for the underlying inference problems. We calculate these values for all three methods of federation, and for agents federating in two situation: 1) a mean estimation problem and 2) a linear regression problem. The error values depend on the number of samples each agent has access to, with the expectation taken over the values of samples each agent draws as well as the possible different true parameters of the data each player is trying to model. Our results are completely independent of the generating distributions used, relying only weakly on two parameters.
The results in this paper are theoretical and do not depend on any experiments or code. However, while writing the paper, we found it useful to write and work with code to check conjectures. Some of that code is publicly available at https://github.com/kpdonahue/modelΛsharingΛgames.
Before moving to the main technical content, the next section will walk through a motivating example, followed by a review of related literature and a description of our model and assumptions. Beyond technical assumptions, recent work (Cooper (2020)) has highlighted the importance of describing normative assumptions researchers make: we also include a summary of the most important normative assumptions of our analysis .
Related works
Incentives and federated learning:
Blum etΒ al. (2017) describes an approach to handling heterogeneous data where more samples are iteratively gathered from each agent in a way so that all agents are incentivized to participate in the grand coalition during federated learning. Duan etΒ al. (2021) builds a framework to schedule data augmentation and resampling. Yu, Bagdasaryan, and Shmatikov (2020) demonstrates empirically that there can be cases where individuals get lower error with local training than federated and evaluates empirical solutions. Wang etΒ al. (2020) analyzes the question of when it makes sense to split or not to split datasets drawn from different distributions. Finally, Blum etΒ al. (2020) analyzes notions of envy and efficiency with respect to sampling allocations in federated learning.
Transfer learning:
Mansour etΒ al. (2020) and Deng, Kamani, and Mahdavi (2020) both propose theoretical methods for using transfer learning to minimize error provided to agents with heterogeneous data. Li etΒ al. (2019) and Martinez, Bertran, and Sapiro (2020) both provide methods to produce a more uniform level of error rates across agents participating in federated learning.
Clustering and federated learning:
Sattler, Muller, and Samek (2020) and Shlezinger, Rini, and Eldar (2020) provide an algorithm to βclusterβ together players with similar data distributions with the aim of providing them with lower error. They differ from our approach in that they consider the case where there is some knowledge of each playerβs data distribution, where we only assume knowledge of the number of data points. Additionally, their approach doesnβt explicitly consider agents to be game-theoretic actors in the same way that this one does. Interestingly, Guazzone, Anglano, and Sereno (2014) uses a game theoretic framework to analyze federated learning, but with the aim of minimizing energy usage, not error rate.
2 Motivating example
Coalition structure | |||
---|---|---|---|
2 | 2 | 2 | |
1.5 | 1.5 | 2 | |
1.3 | 1.3 | 1.3 |
Coalition structure | |||
---|---|---|---|
2 | 2 | 0.4 | |
1.5 | 1.5 | 0.4 | |
2 | 1.72 | 0.39 | |
1.55 | 1.55 | 0.41 |
Coalition structure | |||
---|---|---|---|
0.4 | 0.4 | 0.4 | |
0.7 | 0.7 | 0.4 | |
0.8 | 0.8 | 0.8 |
To motivate our problem and clarify the types of analyses we will be exploring, we will first work through a simple mean estimation example. (The Github repository contains numerical calculations and full tables for this section.) Calculating the error each player can expect requires two parameters: , which reflects the average error each player experiences when sampling data from its own personal distribution, and , which reflects the average variance in the true parameters between players. In this section we will use , but will discuss later how to handle when they may be imperfectly known.
First, we will analyze uniform federation, with three players, and . We will first assume that each player has 5 samples from their local data distribution: Table 1 gives the error each player can expect in this situation. Arrangements equivalent up to renaming of players are omitted. Every player sees its error minimized in the βgrand coalitionβ where all three players are federating together. This implies that the only arrangement that is stable (core-stable or individually stable) is .
Next, we assume that player increases the amount of samples it has from 5 to 25: Table 2 demonstrates the error each player can expect in this situation. Here, the players have different preferences over which arrangement they would most prefer. The βsmallβ players and would most prefer , whereas the βlargeβ player would most prefer or (identically) . However out of all of these coalition structures, only is stable (either core stable or individually stable). Note that is not stable because the coalition is one where each player prefers to its current situation.
Thirdly, we will assume that all three players have 25 samples: this example is shown in Table 3. As in Table 1, the players have identical preferences. However, in this case, the players minimize their error by being alone. Overall, stability results from this example are part of a broader pattern we will analyze in later sections.
Next, we will explore the two other methods of federation: coarse-grained and fine-grained. Both offer some degree of personalization, with varying levels of flexibility.
Table 4 shows an example using coarse federation with four players: three have 30 samples each, and the fourth player has 300 samples. We assume the weights are set optimally for each player. For conciseness, some columns and rows are omitted. Note that both player types get lower error in than they would with local learning: that is, is individually stable (stable against defections of any player alone). However, it is also clear that is not core stable: in particular, the three small players would get lower error in than in . These results will be examined theoretically in later sections: with optimal weighting, coarse-grained federation will always have an individually stable that is not necessarily core stable.
Coalition structure | ||
---|---|---|
0.333 | 0.0333 | |
0.278 | 0.0333 | |
0.280 | 0.0326 |
Finally, we examine fine-grained federation. Table 5 analyzes the same case as coarse-grained federation previously, but with optimally-weighted fine-grained federation. The full error table shows that is core stable because each player minimizes their error in this arrangement. This result will hold theoretically: when optimal fine-grained federation is used, always minimizes error for every player and is thus core stable.
Coalition structure | ||
---|---|---|
0.333 | 0.0333 | |
0.278 | 0.0333 | |
0.269 | 0.0325 |
In later sections we will give theoretical results that explain this example more fully, but understanding the core-stable partitions here will help to build intuition for more general results.
3 Model and assumptions
Model and technical assumptions
This section introduces our model. We assume that there is a fixed set of players, and player has a fixed number of samples, . Though the number of samples is fixed, it is possible to analyze a varying number of samples by investigating all games involving the relevant number of samples. Each player draws their true parameters i.i.d. (independent and identically distributed) . represents the amount of noise in the sampling process for a given player.
In the case of mean estimation, is a scalar representing the true mean of player . Player draws samples i.i.d. from its true distribution: . Samples are drawn with variance around the true mean of the distribution.
In the case of linear regression, is a -dimensional vector representing the coefficients on the true classification function, which is also assumed to be linear. Each player draws input datapoints from their own input distribution such that . They then noisily observe the outputs, drawing values i.i.d. , where again denotes the variance of how samples are drawn around the true mean.
There are three methods of federation. In uniform federation, a single model is produced for all members of the federating coalition:
In coarse-grained federation, each player has a parameter that it uses to weight the global model with its own local model, producing an averaged model:
for . Note that corresponds to unweighted federated learning and corresponds to pure local learning. Finally, with fine-grained federation, each player as a vector of weights that they use to weight every other playerβs contribution to their estimate:
for . Note that we can recover the weighting case with and . Coarse-grained and fine-grained federation each have player-specific parameters () that can be tuned. When those parameters are set optimally for the given player, we refer to the models as βoptimalβ coarse-grained or fine-grained federation. We will prove in later sections how to calculate optimal weights.
We denote : the expectation of the error parameter. In the mean estimation case, represents the variance around the mean. In the linear regression case, for .
We assume that each player knows how many samples it has access to. It may or may not have access to the data itself, but it does not know how its values (or its parameters) differ from the mean. For example, it does not know if the data it has is unusually noisy or if its true mean lies far from the true mean of other players.
All of the stability analysis results depend on the parameters and . However, the reliance is fairly weak: often the player only needs to whether the number of samples they have is larger or smaller than the ratio .
Much of this paper analyzes the stability of coalition structures. Analyzing stability could be relevant because players can actually move between coalitions. However, even if players arenβt able to actually move, analyzing the stability of a coalition tells us something about its optimality for each set of players.
Normative assumptions
This paper is primarily descriptive: it aims to model a phenomenon in the world, not to say whether that phenomenon is good or bad. For example, it could be that society as a whole values situations where many players federate together and might wish to require players to do so, regardless of whether this minimizes their error. It might be the case that society prefers all players, regardless of how many samples they have access to, have roughly similar error rates. Our use of the expected mean squared error is also worth reflecting on: it assumes that over- and under-estimates are equally costly and that larger mis-estimates are more costly. In a more subtle point, we are taking the expected MSE over parameter draws . A player with a true mean that happens to fall far from the mean might experience a much higher error than its expected MSE.
In the entirely of this paper, we are taking as fixed the requirement that data not be shared, either for privacy or technical capability reasons, and so are implicitly valuing that requirement more than the desire for lower error. We are also assuming that the problem at hand is completely encompassed by the machine learning task, which might omit the fact that non-machine learning solutions may be better suited. It also may be the fact that technical requirements other than error rate are more important: for example, the desire to balance the amount of computation done by each agent.
4 Expected error results
This paperβs first contribution is to derive exact expected values for the MSE of players under different situations. The fact that these values are exact allows us to precisely reason about each playerβs incentives in later sections. We will state the theorems here and provide the proofs in Appendix B.
The approach for this section was first to derive expected MSE values for the most general case and then derive values for other cases as corollaries. The most general case is linear regression with fine-trained federation. First, we note that we can derive coarse-grained or uniform federation by setting the weights to the appropriate values. Next, we note that mean estimation is a special case of linear regression. For intuition, consider a model where a player draws an value that is deterministically 1, then multiplies it by an unknown single parameter , then then takes a measurement of this mean with noise . This corresponds exactly to the mean estimation case, where a player has a true mean and observes as a sample, with noise . We can use this representation to simplify the error terms, with more details given in Appendix B.
First, we give the expected MSE for local estimation:
Theorem 4.1.
For linear regression, the expected MSE of local estimation for a player with samples is
If the distribution of input values is a -dimensional multivariate normal distribution with 0 mean, then, the expected MSE of local estimation can be simplified to:
In the case of mean estimation, the error term can be simplified to:
Next, we calculate the expected MSE for fine-grained federation:
Theorem 4.2.
For linear regression with fine-grained federation, the expected MSE of federated estimation for a player with samples is:
where is equal to:
If the distribution of input values is a -dimensional multivariate normal distribution with 0 mean, this can be simplified to:
In the case of mean estimation, the entire error term can be simplified to:
Finally, we derive as corollaries the expected MSE for the uniform federation and the coarse-grained case.
Corollary 4.3.
For uniform linear regression, the expected MSE of federated estimation for a player with samples is:
where is equal to:
or, if the distribution of input values is a -dimensional multivariate normal distribution with 0 mean, can be simplified to
In the case of mean estimation, the entire error term can be simplified to:
where .
Corollary 4.4.
For coarse-grained linear regression, the expected MSE of federated estimation for a player with samples is:
where is equal to:
or, if the distribution of input values is a -dimensional multivariate normal distribution with 0 mean, can be simplified to
In the case of mean estimation, the entire error term can be simplified to:
where .
The exact MSE for linear regression follows a very similar form to that for mean estimation. In all cases, the bias component (the term involving ) is in the exact same form and could be directly modified to mean estimation by using . The variance component (the term involving ) fits the exact form of mean estimation in the limit where . In this case, the error can be modified to fit mean estimation by using . This approximation is good when there are many more samples than the dimension of the linear regression problem under investigation: for most cases of model fitting, this assumption is reasonable.
For the rest of the paper, we will use the assumption: consequentially, all of our results apply equally to linear regression and mean estimation.
5 Uniform federation: coalition formation
In this section, we analyze the stability of coalition structures in the case that uniform federation is used. We consider two cases: 1) where all players have the same number of datapoints and 2) where all players have either a βsmallβ or βlargeβ number of points. We will use to refer to the coalition partition where all players are alone and to refer to the grand coalition. Proofs from this section are given in Appendix C.
All players have the same number of samples
In this case, the analysis simplifies greatly:
Lemma 5.1.
If all players have the same number of samples , then:
-
β’
If , players minimize their error in .
-
β’
If , players minimize their error in .
-
β’
If , players are indifferent between any arrangement of players.
Proof.
In the case that all players have the same number of samples, we can use to simplify the error term:
In order to see whether players would prefer a larger group (higher ) or a smaller group (smaller ), we take the derivative of the error with respect to :
This is positive when : a player gets higher error the more players it is federating with. This is negative when : a player gets lower error the more players it is federating with. This is 0 when , which implies players should be indifferent between different arrangements. Plugging in for in the error equation gives which is equivalent to the error a player would get alone: . β
As a corollary, we can classify the core stable arrangements cleanly:
Corollary 5.2.
For uniform federation, if all players have the same number of samples , then:
-
β’
If , is the only partition that is core-stable.
-
β’
If , is the only partition that is core-stable.
-
β’
If , any arrangement of players is core-stable.
Small & large player case
In this section, we add another layer of depth by allowing players to come in one of two βsizesβ. βSmallβ players have samples and βlargeβ ones have samples, with . We demonstrate that versions of the game in this pattern always have a stable partition by constructively producing an element that is stable. Note that this is not true in general of hedonic games. As discussed in Bogomolnaia and Jackson (2002), there are multiple instances where a game might have no stable partition.
To characterize this space, we divide it into cases depending on the relative size of . We will use the notation to denote a coalition with small players and large players, out of a total of and present. We will use to mean that the small players prefer coalition to and to mean the same preference, but for large players.
Case 1:
The first case is when is large: it turns out that each player minimizes their error by using local learning, which means that is in the core. The lemma below is more general than the small/large case, but implies that when , is the only element in the core and when then any arrangement where the large players are alone are is in the core.
Lemma 5.3.
For uniform federation, if for all , then is the unique element in the core.
If for all , with for at least one player , then any arrangement where the players with samples are alone is in the core.
Case 2:
Next, we consider the case where both the small and large players have a relatively small number of samples. In this situation, it turns out that the grand coalition is core stable.
Theorem 5.4.
For uniform federation, if and , then the grand coalition is core stable.
Case 3: ,
Finally, we consider the case where the small players have a number of samples below the boundary, while the large players have a number of samples above this threshold.
Theorem 5.5.
Assume uniform federation with . Then, there exists an arrangement of small and large players that is individually stable and a computationally efficient algorithm to calculate it.
The proof of Theorem 5.5 is constructive: it gives an exact arrangement that is individually stable. One natural question is whether this arrangement is also core stable. The answer to this question is βnoβ: we show that this arrangement can fail to be core stable. This avenue is explained more in Appendix C.
6 Coarse-grained federation
In this section, we analyze coarse-grained federation. As a reminder, in this situation, each player has a parameter that it uses to weight the global model with its own local model.
for . All proofs from this section are given in Appendix D.
Note that the value is a parameter that each player can set independently. The lemma below analyzes the optimal value of and tells us that each player would prefer federation, in some form, to being alone.
Lemma 6.1.
For coarse-grained federation, the minimum error is always achieved when , implying that federation is always preferable to local learning.
Corollary 6.2.
For coarse-grained federation, when is set optimally, the grand coalition is always individually stable.
Specifically, this means that no player wishes to unilaterally deviate from . However, this does not mean that each player prefers the grand coalition to some other federating coalition. For example, refer to Section 2 for an example where the grand coalition is not core stable.
In the rest of this section, we will analyze the stability of coalition structures in the that the parameters are set optimally (optimal coarse-grained federation). First, we will find it useful to get the closed-form value for expected MSE of a player using optimal coarse-grained federation:
Lemma 6.3.
A player using coarse-grained federation parameter has expected MSE:
where .
All players have the same number of samples
Lemma 6.4 is the analog to Lemma 5.1 in the previous section. Here, the results differ: with optimal coarse-grained federation, the grand coalition is always the only stable arrangement, no matter how small or large is relative to .
Lemma 6.4.
For mean estimation with coarse-grained federation, if , then is the only element in the core.
Proof.
Using the error term derived in Lemma 6.3, plugging in for and simplifying gives:
As increases, the error (numerator) decreases always - so is where each player minimizes their error and is thus core stable. β
Small & large player case
In this subsection, we similarly extend results for the βsmallβ and βlargeβ case that was introduced in the previous section. The analysis turns out to be much simpler than in the uniform federation case, and also produce stronger results: strict core stability, rather than individual stability.
Theorem 6.5.
If optimal coarse-grained federation is used, then:
-
β’
If (small player weakly prefers ), then is strictly core stable.
-
β’
If (small player strictly prefers ), then is strictly core stable.
7 Fine-grained federation
In this section, we analyze fine-grained federation. As a reminder, with this method, each player as a vector of weights that they use to weight every other playerβs contribution to their estimate.
for .
We calculate the optimal weights for player βs error.
Lemma 7.1.
Define . Then, the value of that minimizes player βs error is:
The proof of this lemma is given in Appendix E.
From this analysis, a few properties become clear. To start with, and are always strictly between 0 and 1. This implies the following lemma:
Corollary 7.2.
With optimal fine-grained federation, is optimal for each player.
Proof.
Suppose by contradiction that some other coalition gave player a lower error. WLOG, assume this coalition omitted player . In this case, the weights for can be represented as a length vector with in the th entry. However, set of weights is achievable in : it is always an option to set a playerβs coefficient equal to 0. This contradicts the use of as an optimal weighting, so it cannot be the case that any player gets lower error in a different coalition. β
Similarly, the fact that is optimal for every player implies that it is in the core, and that it is the only element in the core.
8 Conclusions and future directions
In this work, we have drawn a connection between a simple model of federated learning and the game theoretic tool of hedonic games. We used this tool to examine stable partitions of the space for two variants of the game. In service of this analysis, we computed exact error values for mean estimation and linear regression, as well as for three different variations of federation.
We believe that this framework is a simple and useful tool for analyzing the incentives of multiple self-interested agents in a learning environment. There are many fascinating extensions. For example, completely characterizing the core (including whether it is always non-empty) in the case of arbitrary number of samples is an obvious area of investigation. Besides this, it could be interesting to compute exact or approximate error values for cases beyond mean estimation and linear regression.
Acknowledgments
This work was supported in part by a Simons Investigator Award, a Vannevar Bush Faculty Fellowship, a MURI grant, AFOSR grant FA9550-19-1-0183, grants from the ARO and the MacArthur Foundation, and NSF grant DGE-1650441. We are grateful to A. F. Cooper, Thodoris Lykouris, Hakim Weathersppon, and the AI in Policy and Practice working group at Cornell for invaluable discussions. In particular, we thank A.F. Cooper for discussions around normative assumptions. Finally, we are grateful to Katy Blumer for discussions around code in the Github repository.
References
- Abu-Mostafa, Lin, and Magdon-Ismail (2012) Abu-Mostafa, Y.; Lin, H.; and Magdon-Ismail, M. 2012. Learning from data: a short course: AMLbook. Google Scholar .
- Anderson (1962) Anderson, T.Β W. 1962. An introduction to multivariate statistical analysis. Technical report, Wiley New York.
- Blum etΒ al. (2017) Blum, A.; Haghtalab, N.; Procaccia, A.Β D.; and Qiao, M. 2017. Collaborative PAC Learning. In Guyon, I.; Luxburg, U.Β V.; Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; and Garnett, R., eds., Advances in Neural Information Processing Systems 30, 2392β2401. Curran Associates, Inc. URL http://papers.nips.cc/paper/6833-collaborative-pac-learning.pdf.
- Blum etΒ al. (2020) Blum, A.; Haghtalab, N.; Shao, H.; and Phillips, R.Β L. 2020. Unpublished work, private correspondence.
- Bogomolnaia and Jackson (2002) Bogomolnaia, A.; and Jackson, M.Β O. 2002. The Stability of Hedonic Coalition Structures. Games and Economic Behavior 38(2): 201 β 230. ISSN 0899-8256. doi:https://doi.org/10.1006/game.2001.0877. URL http://www.sciencedirect.com/science/article/pii/S0899825601908772.
- Casella (1992) Casella, G. 1992. Illustrating empirical Bayes methods. Chemometrics and intelligent laboratory systems 16(2): 107β125.
- Cooper (2020) Cooper, A.Β F. 2020. Where Is the Normative Proof? Assumptions and Contradictions in ML Fairness Research.
- Deng, Kamani, and Mahdavi (2020) Deng, Y.; Kamani, M.Β M.; and Mahdavi, M. 2020. Adaptive Personalized Federated Learning.
- Duan etΒ al. (2021) Duan, M.; Liu, D.; Chen, X.; Liu, R.; Tan, Y.; and Liang, L. 2021. Self-Balancing Federated Learning With Global Imbalanced Data in Mobile Systems. IEEE Transactions on Parallel and Distributed Systems 32(1): 59β71.
- Efron and Morris (1977) Efron, B.; and Morris, C. 1977. Steinβs paradox in statistics. Scientific American 236(5): 119β127.
- Guazzone, Anglano, and Sereno (2014) Guazzone, M.; Anglano, C.; and Sereno, M. 2014. A Game-Theoretic Approach to Coalition Formation in Green Cloud Federations. 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing doi:10.1109/ccgrid.2014.37. URL http://dx.doi.org/10.1109/CCGrid.2014.37.
- Kairouz etΒ al. (2019) Kairouz, P.; McMahan, H.Β B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A.Β N.; Bonawitz, K.; Charles, Z.; Cormode, G.; Cummings, R.; DβOliveira, R. G.Β L.; Rouayheb, S.Β E.; Evans, D.; Gardner, J.; Garrett, Z.; GascΓ³n, A.; Ghazi, B.; Gibbons, P.Β B.; Gruteser, M.; Harchaoui, Z.; He, C.; He, L.; Huo, Z.; Hutchinson, B.; Hsu, J.; Jaggi, M.; Javidi, T.; Joshi, G.; Khodak, M.; KoneΔnΓ½, J.; Korolova, A.; Koushanfar, F.; Koyejo, S.; Lepoint, T.; Liu, Y.; Mittal, P.; Mohri, M.; Nock, R.; ΓzgΓΌr, A.; Pagh, R.; Raykova, M.; Qi, H.; Ramage, D.; Raskar, R.; Song, D.; Song, W.; Stich, S.Β U.; Sun, Z.; Suresh, A.Β T.; TramΓ¨r, F.; Vepakomma, P.; Wang, J.; Xiong, L.; Xu, Z.; Yang, Q.; Yu, F.Β X.; Yu, H.; and Zhao, S. 2019. Advances and Open Problems in Federated Learning.
- Li etΒ al. (2020) Li, T.; Sahu, A.Β K.; Talwalkar, A.; and Smith, V. 2020. Federated Learning: Challenges, Methods, and Future Directions. IEEE Signal Processing Magazine 37(3): 50β60. ISSN 1558-0792. doi:10.1109/msp.2020.2975749. URL http://dx.doi.org/10.1109/MSP.2020.2975749.
- Li etΒ al. (2019) Li, T.; Sanjabi, M.; Beirami, A.; and Smith, V. 2019. Fair Resource Allocation in Federated Learning.
- Mansour etΒ al. (2020) Mansour, Y.; Mohri, M.; Ro, J.; and Suresh, A.Β T. 2020. Three Approaches for Personalization with Applications to Federated Learning.
- Martinez, Bertran, and Sapiro (2020) Martinez, N.; Bertran, M.; and Sapiro, G. 2020. Minimax Pareto Fairness: A Multi Objective Perspective. In Proceedings of the 37th International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR. URL https://proceedings.icml.cc/static/paperΛfiles/icml/2020/1084-Paper.pdf.
- Morris (1986) Morris, C.Β N. 1986. Empirical Bayes: a frequency-Bayes compromise, volume Volume 8 of Lecture NotesβMonograph Series, 195β203. Hayward, CA: Institute of Mathematical Statistics. doi:10.1214/lnms/1215540299. URL https://doi.org/10.1214/lnms/1215540299.
- Paquay (2018) Paquay, P. 2018. Learning-from-data-Solutions. URL https://github.com/ppaquay/Learning-from-Data-Solutions.
- Sattler, Muller, and Samek (2020) Sattler, F.; Muller, K.-R.; and Samek, W. 2020. Clustered Federated Learning: Model-Agnostic Distributed Multitask Optimization Under Privacy Constraints. IEEE Transactions on Neural Networks and Learning Systems 1β13. ISSN 2162-2388. doi:10.1109/tnnls.2020.3015958. URL http://dx.doi.org/10.1109/TNNLS.2020.3015958.
- Sellentin and Heavens (2015) Sellentin, E.; and Heavens, A.Β F. 2015. Parameter inference with estimated covariance matrices. Monthly Notices of the Royal Astronomical Society: Letters 456(1): L132βL136. ISSN 1745-3925. doi:10.1093/mnrasl/slv190. URL https://doi.org/10.1093/mnrasl/slv190.
- Shlezinger, Rini, and Eldar (2020) Shlezinger, N.; Rini, S.; and Eldar, Y.Β C. 2020. The Communication-Aware Clustered Federated Learning Problem. In 2020 IEEE International Symposium on Information Theory (ISIT), 2610β2615.
- Wang etΒ al. (2020) Wang, H.; Hsu, H.; Diaz, M.; and Calmon, F.Β P. 2020. To Split or Not to Split: The Impact of Disparate Treatment in Classification.
- Yu, Bagdasaryan, and Shmatikov (2020) Yu, T.; Bagdasaryan, E.; and Shmatikov, V. 2020. Salvaging Federated Learning by Local Adaptation.
Appendix A Relationship to other approaches
This section contains a high-level summary of similar approaches and how they relate to ours. Throughout we assume the goal is to estimate some unknown given samples drawn .
A frequentist approach would take to be a constant that would be estimated by the average of the given samples .
A hierarchical Bayesian estimator assumes data is generated in the following way: data is drawn . The parameter is drawn , where hyperparameter is drawn from known distribution . Given some data, the parameter can be estimated as follows
Parametric empirical Bayes (Morris 1986) (Casella 1992) is frequently described as an intermediate between these two viewpoints. Similar to the hierarchical Bayesian viewpoint, it assumes data is drawn , with parameter is drawn . However, it differs in that it estimates based on the data, producing . This estimate of the hyperparameter is used, along with the data, to estimate .
A related example is the James-Stein estimator (Efron and Morris 1977). The estimator assumes the following process: each of players draws a single sample from a normal distribution with variance .
This is different from the empirical Bayes or Bayes case in that it is assumed that the means are completely unrelated to each other. Nevertheless, it has been demonstrated that the James-Stein estimator:
has lower expected MSE than simply using the drawn parameters . In the case that the variance is not known perfectly, it can be estimated as using entire vector of data .
Our method is similar at a high level to empirical Bayes: we assume each player draws data from a personal distribution governed by and that the terms are in turn drawn from some distribution . However, one key difference is that all three methods discussed above assume knowledge of the distributions generating the data, or at least which family they are drawn from. For example, the James Stein estimator assumes a normal distribution: variants of it exist for different distributions, but not a version that works for all distributions. Similarly, a hierarchical Bayes or empirical Bayes viewpoint would require knowledge of the distributions. In our approach, we do not assume that we know the form of these generating distributions, only some summary values summary statistics (mean and variance) of the distribution.
It is entirely possible that other approaches, especially those that assume knowledge of the generating distribution, will out-perform our approach in terms of the error guarantees they can provide. Our distribution-free approach allows it to be implemented in a broader range of situations.
Additionally, our approach is restricted to linear combinations of estimators such as . It is possible that a method outside this situation, for example, something like , or something like the non-linear James Stein estimator, would produce better estimates.
Appendix B Expected error proofs
For convenience, we will restate the model setup for the most general case of linear regression. We assume that each player draws parameters , where is a length vector and is a scalar-valued variance parameter. The th entry in the vector is , and . We assume that each value is drawn independently of the others. The main result of this section will assume that each dimension is drawn independently, for example that is independent of , for , but we will demonstrate how this can be relaxed. Each player draws input data points from their own input distribution, such that . They then noisily observes the outputs, drawing . We use to denote the length vector of errors so that . Each player uses ordinary least squares (OLS) to compute estimates of their parameters, which requires that is invertible. This happens when the columns of are linearly independent. We will require that be such that is invertible with probability 1. This rules out cases where one dimension is a deterministic function of another, for example. Using the below OLS calculation gives local estimation:
It is worth pausing briefly to note why mean estimation is a special case of linear regression. Consider the case where the distribution is deterministically 1, meaning that is a vector of 1s of length . is always invertible, as . Each is multiplied by an unknown single parameter , which each player is attempting to learn.
Besides local estimation, there are multiple federation possibilities. Uniform federation is given by:
Coarse-grained federation is given by:
for . Finally, fine-grained federation is given by:
for . Note that fine-grained federation is the most general case of federation. It is possible to derive coarse-grained federation, uniform federation, or local estimation by appropriately setting the weights. In this section, we will first derive the expected error for the fine-grained federation, linear regression case, and get expected error results for other cases as corollaries of this result.
The expected error produced by a set of estimates is determined by the expectation of the following quantity.
Here, the expectation is taken over four sources of randomness.
-
1.
: Drawing parameters for player βs distribution from .
-
2.
: Drawing the training dataset from data distribution
-
3.
: Drawing labels for the training dataset from the distribution .
-
4.
: Drawing a new test point from the data distribution
measures the expected error of a set of parameters at a particular point : when the expectation is taken over all , it represents the average error everywhere on the distribution. It might be not immediately clear, though, why is the correct term to be considering. Other potential candidates might include:
-
1.
-
2.
for .
The first candidate measures the difference in estimated parameters; however, we assume that the objective of learning is to have low error on predicting future points, rather than solely estimate the parameters. The second candidate represents the error of predicting an instance as opposed to a mean value: it ends up simply producing an additive increase in our overall error term. To see this, note that we can write
The first term is the same error function we are considering. The middle term is 0 in expectation and the last term is in expectation, so this approach simply scales the error we were looking at by .
See 4.1 Note: portions of Abu-Mostafa, Lin, and Magdon-Ismail (2012) and Paquay (2018), especially problem 3.11, were helpful in formulating this approach. Sellentin and Heavens (2015) and Anderson (1962) were helpful in providing the connection to the Inverse Wishart.
Proof.
First, note that:
Then,
To simplify, we note that the above quantity is a scalar. For a scalar, , and for any matrix, through the cyclic property of the scalar.
To evaluate, we start by applying the various expectations, noting that expectation and trace commute. Applying to the term above allows us to rewrite it as:
where
is an matrix. The th diagonal is , which has expectation . Off diagonal entries have value for . Because the errors for each data point are drawn independently and with 0 mean, the expectation of this is 0. is a diagonal matrix with along the diagonal: we can pull it out of the trace to obtain:
Taking the expectation over the drawn parameters gives:
Taking the expectation over the test point gives:
Finally, we take the expectation with respect to
Note that because the inverse and expectation do not commute, in general, we cannot simplify this without stronger assumptions.
There is one other situation where a particular case of linear regression gives us simpler results. As mentioned in the statement of the lemma, in this case we assume that the distribution of input values is a 0-mean normal distribution with covariance matrix .
Note that, in general, . has, along the diagonals, , and on the off-diagonals, has . By contrast, the covariance matrix has the same term along the diagonals, but the off-diagonal term has . In the case we are looking at, the distribution is 0 mean, so the off-diagonal terms match as well, and .
If this is the case, then is distributed according to a Wishart distribution with parameters and covariance with dimension . Given this, is distributed according to an Inverse Wishart distribution with parameters and covariance with dimension .
The expectation of the inverse Wishart tells us that:
Using these results, we can directly calculate the desired expectation:
Next, we can reduce the linear regression case to mean estimation. In this case, assume a 1-dimensional input with deterministically. After drawing , we multiply it by and add some noise governed by : this is the exact same structure as mean estimation. In this case, . Similarly, deterministically, so the error term reduces to as desired.
Note that, as expected, this does not simplify down to the mean estimation case for : that case would model a version of 1-dimensional linear regression, where it is necessary to estimate both as well as , the mean of the input distribution. β
Next, we calculate expected MSE for the fine-grained linear regression case.
See 4.2
Proof.
Here, we will use to mean the expectation taken over data from all players , given that all of the data influences the federated learning result.
(1) |
Note that the expectation of the last term in Equation 1 results in 0 because . Next, we investigate the second equation in Equation 1.
Expanding out the squared term gives us:
The second term ends up being irrelevant: because each set of parameters are drawn independently and because each data set are drawn independently, the terms are independent of each other. Because each is 0 in expectation, the entire product has expectation 0. Rewriting the first term gives:
The term inside the sum is exactly equivalent to the value we solved with the local estimation case: we can rewrite this as
or, if the necessary conditions are satisfied,
Finally, we will explore the first term Equation 1:
Taking the expectation and using the cyclic property of the trace gives:
(2) |
Next, we focus on simplifying the inner term of Equation 2 involving the values. Using the definition of gives:
We know that , so we can rewrite this as:
Expanding gives us two terms:
(3) |
Note that
Because the parameters are drawn iid,
is implied to mean the same thing as . Using these results allows us to expand out the product and take the expectation. The first term in Equation 3 becomes:
For the second term in Equation 3, each product within the sum becomes :
The last two terms cancel because the parameters are drawn i.i.d., so the sum becomes:
The overall sum in Equation 3 becomes:
The sum of both of these terms taken together becomes:
Next, we can use the identity:
Which allows us to rewrite as:
We can alternatively write this a different way. Because , we have
Recall that this analysis was focusing solely on the component of Equation 2 that involved the product. We can recombine our simplification into Equation 2 to rewrite it as:
Next, we need to reason about the difference in the expected terms. In this setting, we are assuming that each coefficient is drawn separately from the other coefficients. has, on the th element of the diagonal, and on the off-diagonal terms in the th entry, has Here, we are assuming that and are independent, so this equals : we relax this assumption below. has, on the th element of the diagonal, , and on the th off-diagonal term has the same value as the other matrix: . The difference between these two matrices is a diagonal matrix with on the diagonal, where represents the variance of the th coefficient. That turns the term involving the trace into a simple sum:
In the proof above, we assumed that the draw of parameter value is independent of , for . A case where this might not occur is when these values are correlated: say, the value drawn for is anti-correlated with the parameter drawn for . (Note that we still assume draws are independent across players: is independent of and ). Relaxing this assumption is not hard and would change the results in the following way: the off-diagonal terms of the difference would no longer be 0. Instead, the off-diagonal th entry becomes
Performing the matrix multiplication with turns this into:
Our final value for this component of the error would be the same form, but with a slightly different coefficient.
Finally, we will consider the mean estimation case. As before, we note that and deterministically, so that component of the error term reduces to
similarly, we note that , so the second component reduces to:
β
See 4.3
Proof.
To obtain this result, we note that the uniform federation case amounts to weights . For both linear regression and mean estimation, the multiplier becomes:
For linear regression, the multiplier becomes:
And for mean estimation, the multiplier becomes:
β
See 4.4
Proof.
To obtain this result, we note that the coarse-grained case corresponds to and . For both linear regression and mean estimation, the multiplier becomes:
For linear regression, let stand for either or , depending on the linear regression case. Then, the multiplier becomes:
For mean estimation, the multiplier becomes:
β
Appendix C Supporting proofs for uniform federation
See 5.2
Proof.
If a partition is optimal for every player, then it is core stable: there does not exist a coalition where all players prefer to , because there does not exist a coalition where any players prefer to .
If a partition is optimal for every player, then no other partition can be core stable: any set of players not in their optimal configuration could form a coalition where all players would prefer .
In the case that players are indifferent between any arrangement, then for any partition and any competing coalition , all players would be indifferent between and , so is core stable. β
See 5.3
Proof.
First, we will consider the case where the inequality is strict and will show the statement is satisfied by showing that every player minimizes their error by being alone in . We will use to be the sum of elements within a coalition . could be the coalition equal to all players () or some strict subset, but we will assume it contains at least 2 elements. We will show that every player gets higher error in than it would get alone. We wish to show:
Cross multiplying gives:
Rewriting:
The righthand side can be rewritten as:
Then, we can prove the inequality by splitting it up into two terms. The first:
which is true because . The second:
which is satisfied because, for each player,
because .
Next, we will consider the case where the inequality may not be strict. We can note that, in the proof above, any coalition with at least one player would satisfy the desired inequality: all players participating would get higher error than they could alone. This shows that any coalition involving a player with more samples than is infeasible. We have previously shown that all players with get equal error no matter their arrangement.
β
As a reminder, we will use the notation to mean a coalition with small players and large players. The next few lemmas describe how the errors of large and small players in a coalition change as and are increased.
Lemma C.1.
For uniform federation, if and , small players always see their error decrease with the addition of more small players:
Proof.
We will show that the derivative of the small playerβs error with respect to is always negative. The error is:
The derivative with respect to is:
Showing that the derivative is negative is equivalent to showing that the term below is negative:
We can break this term into multiple components:
because . We can rewrite a second term as:
We know that , and because ,
These facts, taken together, show that the derivative is always negative. β
Lemma C.2.
For uniform federation, if and , large players see their error increase with the addition of more large players, so they prefer as small as possible:
If , then there exists some such that for all , the derivative of the large playersβ error with respect to is positive, and for all , the derivative of their error is negative.
Proof.
To prove this, we will show that the derivative of the large playerβs error with respect to is always positive when . The error is:
The derivative with respect to is:
We wish to show that the numerator is positive. We can break it into multiple components:
because . We can rewrite the second term as
which is negative because and .
Next, we consider the case where . The first term is now negative and the second term is the sum of two terms: one is positive and one is negative.
The overall derivative is negative whenever:
Note that, for the assumption, the denominator is negative. If the numerator is positive, then this is true for all , so the slope is always negative. If the numerator is negative, then the slope is negative for all for the given . β
Lemma C.3.
Assume uniform federation with and . If , small players always prefer federating with more large players:
If , then there exists some such that for all , the derivative of the small playersβ error with respect to is positive, and for all , the derivative of their error is negative.
Proof.
The small playerβs error is
The derivative with respect to is:
The derivative is negative when the term below is positive:
The first term (multiplying ) is always positive. For the second term is also positive or zero, so the derivative is always negative.
If , then second term (multiplying ) is negative. The overall derivative is negative when the term below is positive:
β
Lemma C.4.
Assume uniform federation with and . If , then there exists some such that for all , the derivative of the large playersβ error with respect to the number of small players is negative, and for all the derivative of their error is positive.
If , then the shape of the curve either first increases, then decreases for all , or else first decreases, and then increases for all .
Proof.
The large playerβs error is:
The derivative with respect to is:
The derivative is negative when the term below is positive:
For , the first term is always positive or zero and the second term is always negative. Solving for when the overall derivative is positive gives:
Next, we consider the case where . Again, the first term is positive or zero. The second term, though, is composed of a sum: one component is positive and one is negative, so it is not necessarily clear whether the overall sum is positive or negative. If the coefficient on is negative, then the curve has the same shape as in the case: first decreasing, and then increasing. If the coefficient is positive, then by the same analysis as before, the derivative is negative whenever the following inequality holds:
In this case, the derivative of the large playersβ error first increases, then decreases. The shape of the curve depends on more information, which indicates whether the coefficient on is positive or negative. β
See 5.4
Proof.
The small players always prefer as large as possible, and for they also prefer as large as possible, so minimizes error for small players. For this reason, any defection coalition that has is infeasible because the small players would get higher error.
The only kind of defections we need to consider are in the form . We will consider and show that the large players prefer to : . In the case that , , so any other arrangement is also not a possible defection. In the case that , , so similarly any other defection is not possible. What weβd like to show is:
Cross multiplying turns the condition into:
If we collect the terms, we get:
If we collect the terms, we get:
First, we expand the first squared term and combine it with another term:
Multiplied by , it becomes:
Expanding out the second squared term gives us:
When we multiply this by , it becomes
Next, we combine similar terms in both sums. First, we start with coefficients of
Next, we do the next term, which involves coefficients of :
And the last one term, with coefficients of :
If we multiply take the only nonzero term and multiply by the terms we pulled out, it becomes:
Next, we return this to our inequality. What weβre trying to show is:
Cancelling some terms:
Expanding out terms:
We can prove this by splitting up piecewise:
because . Similarly,
because . β
See 5.5
Proof.
We will prove this directly by calculating an arrangement that is individually stable, relying on the results in Lemmas C.1, C.2, C.3, C.4.
First, group all of the small players together. Then calculate such that : the largest number of large players that can be in the coalition such that the large players prefer this to being alone. Check if . If this is true, make the final arrangement : by previous lemmas around how the small playerβs error changes with , we know that if , then for all .
We will show that this is individually stable by showing no players wish to unilaterally deviate.
-
β’
No small player wishes to go to : reducing the number of small players in a group from to 1 monotonically increases the error the small player faces.
-
β’
No small player wishes to go to : it is possible to reach this state by first going from to (which would increase error because and ) and then from to (which would increase error because reducing the number of small players increases error).
-
β’
No large player can go to : this would increase the error of the small players.
-
β’
No large player wishes to go to : this would increase the error of both large players.
Next, we will consider the case where : in this case, we will show that is individually stable.
-
β’
No small player wishes to go from to . We can see that has higher error because we know .
-
β’
No small player wishes to go to . We can see has higher error for the small player because . The first inequality comes from the following reasoning: if is negative at , then there is a monotonically increasing path of error from to 1. If is positive at 1, then we know that , whereas .
-
β’
No large player wishes to go to : by definition of , it would get greater error than in .
-
β’
No large player wishes to go to for the same reason as above.
By this analysis, is individually stable. β
Lemma C.5, below, will be useful in Example C.6 in analyzing whether the individually stable arrangement produced by Theorem 5.5 is also core stable.
Lemma C.5.
Consider uniform federation with two coalitions and with . Then, it is not possible to pick so that and .
Proof.
To prove this, we will rely on Lemmas C.1, C.2, C.3, C.4. First, we consider a hypothetical defined such that
Note that may not be an integer: weβre using it as a reference tool (a hypothetical allocation) so this doesnβt matter.
First, we will show that the small player gets greater error in this case. First, we rewrite it as:
and plug it in to the equation for the error of a small player. Note that most of the values stay the same: the entire term and the denominator of the term. Similarly, note that with the given substitution,
The only term that changes is the other portion of the numerator, which becomes:
We would like to show that it is larger than the relevant portion of the error for , which is:
This is equivalent to showing:
We can lower bound the lefthand side using :
which shows that the small player gets greater error in than in .
Next, weβll show that the large player also gets greater error. We will first note that from the definition of the errors of the small and large players, we can write:
and
Note that, by the definition of , the additive term for each of these qualities is the same. We also have just shown that . From these two equalities, this must imply that .
We have shown that both the large and small players get higher error in than . Next, we will show that they have different preferences about whether they wish were larger or smaller, which means that no matter what is, it will leave at least one of them with higher error.
First, we know from previous analysis in Lemma C.2 that the large players always wish there were fewer other large players in a coalition: the large players want .
Secondly, we know from Lemma C.3 that the error the small player experiences first increases and then decreases as increases. Is it possible to pick an so that the small player gets lower error there than in ?
Suppose that the derivative of with respect to is positive at : then, reducing from to might reduce the small playerβs error. However, for every point where the small playerβs derivative is positive, : the small player would not wish to move here because it would get strictly higher error than it would get in .
Suppose instead that the derivative of of with respect to is negative or zero at . Then, if , reducing the number of large players in the coalition from to would increase the error of the small players. This is also not an allocation that the small players would prefer.
Increasing , so that sufficiently large, would satisfy the small player, but we already showed that it would increase the error the large player experiences. As a result, it is not possible to pick an allocation that both the small and large players prefer to . β
Example C.6.
For uniform federation, the arrangement as produced in Theorem 5.5 is not necessarily core stable.
Theorem 5.5 produces an arrangement of the form , where we note that could equal 0. As mentioned in the end of Section 5, this arrangement is not necessarily core stable. However, the reason why is subtle.
Theorem 5.5 checks that the given arrangement is individually stable, so the remaining cases to check would involve multiple players moving together to a new group. First, we will consider deviations consisting of homogeneous groups: for example, or . From Lemma C.1 we know that small players always prefer having more small players in their coalition, so . Similarly, and (if ), so the large players do not wish to unilaterally deviate.
Next, we might wonder whether some of the small players and large players in the coalition might wish to deviate to some where : assume that for now. Lemma C.5 shows that this is not feasible: it is not possible to find a location with where both the small and large players get lower error. That is, it is not possible to have both and .
However, it still might be possible to have a deviating coalition. Recall from Theorem 5.5 that if , : the large players doing local learning get strictly greater error than the large players in . Could it be possible to find an arrangement so that and ?
The answer is yes. We show this by constructively producing an arrangement of players that is individually stable as produced by Theorem 5.5, but is not core stable because a subset of the small players in and the large players doing local learning both strictly prefer some .
For this arrangement, we set (note the larger value). We fix and . Then, we calculate the error players get in various arrangements111The code to produce these calculations is given in our Github repository: https://github.com/kpdonahue/modelΛsharingΛgames..
First, we calculate the individually stable arrangement as given in Theorem 5.5. We claim that satisfies this arrangement. For reference, we calculate the errors the small and large players get in various arrangements.
Note that and : both the small and large players would rather participate. Because , we cannot add another large player to the coalition.
Next, we will show that the alternate coalition is one where small players and large players (those that are doing local learning) would both strictly prefer.
Note that and .
Appendix D Supporting proofs for coarse-grained federation
See 6.1
Proof.
Taking the derivative of the error with respect to produces:
Setting this equal to 0 and solving for produces equal to
Note that this value depends on the player that it is in reference to. It is also always strictly between 0 and 1. To confirm that this is a point of minimum error rather than maximum, we can take the second derivative of the error, which gives a result that is always positive:
β
See 6.3
Proof.
For conciseness, we will rewrite the error and weight with . The error becomes:
and the weight becomes
Substituting in for gives the error as:
Collecting and pulling out common terms:
We multiply the top and bottom by :
which gives the desired value. β
Next, we will prove a series of lemmas describing how the error of small and large players in a coalition using optimal coarse-grained federation see their error change as and increase.
Lemma D.1.
For optimal coarse-grained federation, small players always see their error decrease with the addition of more small players. That is,
Proof.
The error that the small players experience is where
Next, we consider the derivative of the error term with respect to , the number of small players. If this is always negative, then small players always see their error decrease with the addition of more small players. The derivative is negative when:
Calculating and simplifying the lefthand side of this equation gives:
Each individual term is positive: note that , so . The overall term is negative, meaning that the overall derivative of the error is always negative. β
Lemma D.2.
For optimal coarse-grained federation, there exists some such that for all , the small players always see their error decrease with the addition of more large players.
Proof.
The error that the small player experiences can be given by the ratio of as before. In this case, we are interested in the derivative with respect to , which is negative when:
Calculating and simplifying the lefthand side of this equation gives:
This term is negative whenever:
So, for sufficiently large , the derivative of the small playerβs error is always negative. Note that if the term on the RHS is negative, then the derivative is negative for any . β
Lemma D.3.
For optimal coarse-grained federation, there exists some such that for all , the large players always see their error decrease with the addition of more large players. If there are no small players (), then large players would most prefer to all federate together in .
Proof.
The error that the large players experience is where
The derivative of the error with respect to , the number of large players, is negative whenever:
Calculating and simplifying the lefthand side of this equation gives:
This term is negative whenever:
Note that if , this reduces to , which says that whenever large players are arranged without small players, they would most prefer to all be together. β
Lemma D.4.
For optimal coarse-grained federation, large players always see their error decrease with the addition of more small players. Because of this, for all : large players would always prefer federating with more small players.
Proof.
The error that the large players experience can be given by the ratio as before. In this case, we are interested in the derivative with respect to , which is negative when:
Gathering and simplifying the lefthand side of this equation gives:
Each individual term is positive: note that , so . The overall term is negative, meaning that the derivative of the error is always negative. β
Given these lemmas, we will divide the shape of the error curves into three components. Note that and are both curving downwards always. We will refer to the section of with error greater than the error the small players get in as and the region with error lower than this as . Similarly, we will define the region of where the error is higher than the error the large players get in as , and the region where the error is lower as .
Note that, by Lemma D.2, is always decreasing for all : before that, it is increasing. We will call the region where the slope is increasing , the portion where the slope is decreasing but still greater than the error the small player gets in as , and the final region, where the slope is decreasing and the error is lower than or equal to , as .
We will not divide the regions of curve because the proof below will not depend on those results.
We will use as a shorthand . See 6.5
Proof.
In this proof, we will use the results of Lemmas D.1, D.2, D.3, and D.4. The statement of the theorem has three parts (including the equality ), which we will prove in sequence. We note that Lemma D.4 indicates that always: the large players always prefer federating with the smaller players to being alone.
Case:
In this case, we will show that is strictly core stable.
We know from the precondition that the point is in either region or of . We will show that any other arrangement is one where the small players get higher error than in . We will assume : if and , then we know by Lemma D.1. We will do this by starting at the point and moving along the relevant error curves to towards and to show that the small players experience strictly greater error than in .
First, we will consider the curve . From Lemma D.1, we know that it is decreasing in . We will move from to , which decreases error. (If , then this leaves error unchanged.) Next, we will consider the point on the curve . If it is in region or , then we know
We will show that it is impossible for it to be in region . If it were, then moving along this curve from to would only decrease the error the small player experiences, because the slope is decreasing with in region . But then that would imply that is in region of this graph, and so , which contradicts the precondition of this case.
This shows that the small players minimize their error in , so they will not even weakly prefer any other arrangement. Given this, any deviation from could only involve large players. However, by Lemma D.3, out of any arrangement excluding small players, the large players minimize their error in (their current arrangement) so they would not even weakly prefer any other arrangement.
This implies that is strictly core given .
Case:
In this case, we assume that the small players are indifferent between and and will show that is strictly core stable.
The key part of this proof is to show that, if the small players are indifferent between and , then they get strictly greater error in every other arrangement that isnβt equal to or .
The precondition means that is in region of both graphs. To move to , first we move along from to . In doing so, we end up in either region or of the curve: note that we cannot remain in region unless . This is because by definition of the regions, region has error that decreases with , which means that error strictly increases from when is decreased. This means that if , gives the small player greater error than the it gets in (or equivalently, in ).
If the point is in region or of , then it is also in region of the curve . Reducing from to can only increase the error the small player experiences. This shows that the small players get strictly greater error in any arrangement that is not or .
Note: if , then point is in region of both curves, but because , we must have , so .
Next, we will show that is strictly core stable.
As shown above, the small players get higher error in any arrangement besides and . If they are in , they cannot be convinced to deviate to any arrangement except , and then only weakly (they get identical error). Conditional on the small players all being in , the large players would most prefer to be in , since by Lemma D.3 they strictly prefer it to . This implies that there does not exist a group of players where all weakly would like to defect and at least one strictly wishes to defect, the condition for strict core stability.
Case: :
We will show that is strictly core stable.
The precondition means that is in section of both small player graphs. Consider an arbitrary other coalition : we will show that the small players get strictly greater error here than in .
First, we start at the arrangement . Consider moving along the curve : we hold constant and reduce from to . We could end up in region or of this curve.
First, assume we end up in section . This means that, as weβve reduced , our error has been monotonically increasing (unless , in which case it is unchanged). We also know that, in the curve , the point is in region of this graph. Next, we similarly move along this curve to reduce from to . From Lemma D.1, we know is monotonically decreasing in , which means reducing monotonically increases the small playerβs error. (This is unless , in which case the error is unchanged.). Because we have assumed either or , we have produced a path of monotonically increasing error from to , so we know the small player experiences greater error here.
Next, we will instead assume that we ended up in region or of the curve when we reduced . By definition of the and regions, we know that the small player experiences larger error in than it would in . (The one exception is the point , which is in region and which obviously gives equal error to ). Then, we know that in the curve , the point must be in region . Because is monotonically decreasing in , reducing from to monotonically increases error. We then know that:
where the last inequality comes from the premise of this statement. The first and second inequalities are strict if or , respectively.
We have just shown that the small player prefers to any other arrangement. If there were going to be a defecting group, it would have to involve only large players. However, the arrangement where large players (federating by themselves) get lowest error is in , which by Lemma D.4 gives them higher error than .
By this reasoning, is strictly core stable. β
Appendix E Supporting proofs for fine-grained federation
See 7.1
Proof.
To minimize, we will take the derivative of player βs error with respect to the weight. Note that we only have so appears twice in the component involving . Rewriting the error gives:
Taking the derivative with respect to gives:
To confirm that we are finding a minimum rather than a maximum, we note that the second derivative is always positive:
We first simplify the derivative by substituting in for :
And then solve for to obtain:
To find , we use that all of the weights sum up to 1:
Next, we define . This allows us to rewrite the term as:
Similarly, we can rewrite:
β