Model-sharing Games:
Analyzing Federated Learning Under Voluntary Participation

Kate Donahue,¹ Jon Kleinberg, ^{1, 2}

Abstract

Federated learning is a setting where agents, each with access to their own data source, combine models learned from local data to create a global model. If agents are drawing their data from different distributions, though, federated learning might produce a biased global model that is not optimal for each agent. This means that agents face a fundamental question: should they join the global model or stay with their local model? In this work, we show how this situation can be naturally analyzed through the framework of coalitional game theory.

Motivated by these considerations, we propose the following game: there are heterogeneous players with different model parameters governing their data distribution and different amounts of data they have noisily drawn from their own distribution. Each player’s goal is to obtain a model with minimal expected mean squared error (MSE) on their own distribution. They have a choice of fitting a model based solely on their own data, or combining their learned parameters with those of some subset of the other players. Combining models reduces the variance component of their error through access to more data, but increases the bias because of the heterogeneity of distributions.

In this work, we derive exact expected MSE values for problems in linear regression and mean estimation. We use these values to analyze the resulting game in the framework of hedonic game theory; we study how players might divide into coalitions, where each set of players within a coalition jointly construct model(s). We analyze three methods of federation, modeling differing degrees of customization. In uniform federation, the agents collectively produce a single model. In coarse-grained federation, each agent can weight the global model together with their local model. In fine-grained federation, each agent can flexibly combine models from each other agent in the federation. For each method, we constructively analyze the stable partitions of players into coalitions.

1 Introduction

Imagine a situation as follows: a hospital is trying to evaluate the effectiveness of a certain procedure based on data it has collected from procedures done on patients in their facilities. It seems likely that certain attributes of the patient influences the effectiveness of the procedure, so the hospital analysts opt to fit a linear regression model with parameters $\hat{\bm{\theta}}$ . However, because of the limited amount of data the hospital has access to, this model has relatively high error. Luckily, other hospitals also have data from implementations of this same procedure. However, for reasons of privacy, data incompatibility, data size, or other operational considerations, the hospitals don’t wish to share raw patient data. Instead, they they opt to combine their models by taking a weighted average of the parameters learned by each hospital. If there are $M$ hospitals and hospital $i$ has $n_{i}$ samples, the combined model parameters would look like:

\hat{\bm{\theta}}^{f}=\frac{1}{\sum_{i=1}^{M}n_{i}}\sum_{i=1}^{M}\hat{\bm{\theta}}_{i}\cdot n_{i}

The situation described above could be viewed as a stylized model of federated learning. Federated learning is a distributed learning process that is currently experiencing rapid innovations and widespread implementation (Li et al. 2020; Kairouz et al. 2019). It is used in cases where data is distributed across multiple agents and cannot be combined centrally for training. For example, federated learning is implemented in word prediction on cell phones, where transferring the raw text data would be infeasible given its large size (and sensitive content). The motivating factor for using federated learning is that access to more data will reduce the variance in a learned model, reducing its error.

However, there could be a downside to using federated learning. In the hospital example, it seems quite reasonable that certain hospitals might have different true generating models for their data, based on the differences in patient populations or variants of the procedure implementation, for example. Two dissimilar hospitals that are federating together will see a decrease in their model’s error due to model variance, but an increase in their error due to model bias. This raises some fundamental questions for each participating hospital - or, more generally, each agent $i$ considering federated learning. Which other agents should $i$ federate with in order to minimize its error? Will those other agents be interested in federating with $i$ ? Does there exist some stable arrangement of agents into federating clusters, and if so, what does that arrangement look like?

Numerous works have explored the issue of heterogeneous data in federated learning - we discuss specifically how they relate to ours in a later section. Often the goal in these lines of work is to achieve equality in error rates guaranteed to each agent, potentially by actively collecting more data or using transfer learning to ensure the model better fits local data. However, to our knowledge, there has not yet been work that systematically looks at the participation questions inherent in federated learning through the lens of game theory — especially the theory of hedonic games, which studies the formation of self-sustaining coalitions.

In a hedonic game, players are grouped together into clusters or coalitions: the overall collection of coalitions is called a coalition structure. Each player’s utility depends solely on the identity of the other players in its coalition. A common question in hedonic games is the stability of a coalition structure. A coalition structure $\Pi$ is core-stable (or “in the core”) if there does not exist a coalition $C$ so that every player in $C$ prefers $C$ to its coalition in $\Pi$ . A coalition structure is strictly core stable if there does not exist a coalition $C$ so that every player in $C$ weakly prefers $C$ to its coalition in $\Pi$ , and at least one player $\in C$ strictly prefers $C$ to $\Pi$ . A coalition structure is individually stable if there does not exist a coalition $C\in\Pi$ so that a player $i\not\in C$ prefers $C\cup\{i\}$ to its arrangement in $\Pi$ and all players in $C$ weakly prefer $C\cup\{i\}$ to $C$ (Bogomolnaia and Jackson 2002).

To explain the analogy of federated learning to hedonic games, we first consider that each agent in federated learning is a player in a hedonic game. A player is in coalition with other players if it is federating with them. Its cost is its expected error in a given federating cluster, which depends only on the identity of other players in its federating cluster. Players are assumed to be able to move between federating clusters only if doing so would benefit itself and not harm other players in the cluster it is moving to: notably, we allow players to freely leave a cluster, even if doing so would harm the players in the cluster it leaves behind.

The present work: Analyzing federated learning through hedonic game theory

In this work, we use the framework of hedonic games to analyze the stability of coalitions in data-sharing applications that capture key issues in federated learning. By working through a sequence of deliberately stylized models, we obtain some general insights about participation and stability in these kinds of applications.

For the first case, we analyze uniform federation. In this simplest case, a federating cluster produces a single model, which each player uses. For uniform federation, first we consider the case where all players have the same number of data points. We show that in this game, when the number of data points $n$ is fairly small, the only core-stable coalition structure is to have all players federating together, in the “grand coalition”. When $n$ is large, the only core-stable coalition structure is to have all players separate (doing local learning). There exists a point case of intermediate $n$ size where all coalition structures are core-stable. Next, we analyze the case where all players have either one of two sizes (“small” or “large”). The analysis is more complicated, but we demonstrate constructively that there always exists an individually stable partition of players into clusters.

Besides uniform federation, we also analyze two other forms of federation. For coarse-grained federation, the federating cluster still produces a single model, but each player can weight the global model with their local model, allowing some personalization. For coarse-grained federation, when all players have the same number of samples, we show that the grand coalition (all players federating together) is always the only core stable arrangement. For the small/large case, we produce a simple, efficient algorithm to calculate a strictly core-stable arrangement. Additionally, we show that, for this federation method, the grand coalition is always individually stable (no player wishes to unilaterally defect).

Finally, for fine-grained federation, each player is allowed to take the local models of other players in the federation and combine them using whichever weights they choose to produce a model customized for their use. With fine-grained federation, we show that the grand coalition is always core stable.

We are only able to produce these hedonic game theory results because of our derivations of exact error values for the underlying inference problems. We calculate these values for all three methods of federation, and for agents federating in two situation: 1) a mean estimation problem and 2) a linear regression problem. The error values depend on the number of samples each agent has access to, with the expectation taken over the values of samples each agent draws as well as the possible different true parameters of the data each player is trying to model. Our results are completely independent of the generating distributions used, relying only weakly on two parameters.

The results in this paper are theoretical and do not depend on any experiments or code. However, while writing the paper, we found it useful to write and work with code to check conjectures. Some of that code is publicly available at https://github.com/kpdonahue/model˙sharing˙games.

Before moving to the main technical content, the next section will walk through a motivating example, followed by a review of related literature and a description of our model and assumptions. Beyond technical assumptions, recent work (Cooper (2020)) has highlighted the importance of describing normative assumptions researchers make: we also include a summary of the most important normative assumptions of our analysis .

Related works

Incentives and federated learning:

Blum et al. (2017) describes an approach to handling heterogeneous data where more samples are iteratively gathered from each agent in a way so that all agents are incentivized to participate in the grand coalition during federated learning. Duan et al. (2021) builds a framework to schedule data augmentation and resampling. Yu, Bagdasaryan, and Shmatikov (2020) demonstrates empirically that there can be cases where individuals get lower error with local training than federated and evaluates empirical solutions. Wang et al. (2020) analyzes the question of when it makes sense to split or not to split datasets drawn from different distributions. Finally, Blum et al. (2020) analyzes notions of envy and efficiency with respect to sampling allocations in federated learning.

Transfer learning:

Mansour et al. (2020) and Deng, Kamani, and Mahdavi (2020) both propose theoretical methods for using transfer learning to minimize error provided to agents with heterogeneous data. Li et al. (2019) and Martinez, Bertran, and Sapiro (2020) both provide methods to produce a more uniform level of error rates across agents participating in federated learning.

Clustering and federated learning:

Sattler, Muller, and Samek (2020) and Shlezinger, Rini, and Eldar (2020) provide an algorithm to “cluster” together players with similar data distributions with the aim of providing them with lower error. They differ from our approach in that they consider the case where there is some knowledge of each player’s data distribution, where we only assume knowledge of the number of data points. Additionally, their approach doesn’t explicitly consider agents to be game-theoretic actors in the same way that this one does. Interestingly, Guazzone, Anglano, and Sereno (2014) uses a game theoretic framework to analyze federated learning, but with the aim of minimizing energy usage, not error rate.

2 Motivating example

Coalition structure	$err_{a}(\cdot)$	$err_{b}(\cdot)$	$err_{c}(\cdot)$
$\{a\},\{b\},\{c\}$	2	2	2
$\{a,b\},\{c\}$	1.5	1.5	2
$\{a,b,c\}$	1.3	1.3	1.3

Table 1: The expected errors using uniform federation of players in each coalition when all three players have 5 samples each, with parameters

\mu_{e}=10,\sigma^{2}=1

. Each row denotes a different coalition partition: for example,

\{a,b\}\{c\}

indicates that players

a

and

b

are federating together while

c

is alone.

Coalition structure	$err_{a}(\cdot)$	$err_{b}(\cdot)$	$err_{c}(\cdot)$
$\{a\},\{b\},\{c\}$	2	2	0.4
$\{a,b\},\{c\}$	1.5	1.5	0.4
$\{a\},\{b,c\}$	2	1.72	0.39
$\{a,b,c\}$	1.55	1.55	0.41

Table 2: The expected errors using uniform federation of players in each coalition when players

a

and

b

have 5 samples each and player

c

has 25 samples, with parameters

\mu_{e}=10,\sigma^{2}=1

Coalition structure	$err_{a}(\cdot)$	$err_{b}(\cdot)$	$err_{c}(\cdot)$
$\{a\},\{b\},\{c\}$	0.4	0.4	0.4
$\{a,b\},\{c\}$	0.7	0.7	0.4
$\{a,b,c\}$	0.8	0.8	0.8

Table 3: The expected errors using uniform federation of players in each coalition when players

a,b,c

each have 25 samples, with parameters

\mu_{e}=10,\sigma^{2}=1

To motivate our problem and clarify the types of analyses we will be exploring, we will first work through a simple mean estimation example. (The Github repository contains numerical calculations and full tables for this section.) Calculating the error each player can expect requires two parameters: $\mu_{e}$ , which reflects the average error each player experiences when sampling data from its own personal distribution, and $\sigma^{2}$ , which reflects the average variance in the true parameters between players. In this section we will use $\mu_{e}=10,\sigma^{2}=1$ , but will discuss later how to handle when they may be imperfectly known.

First, we will analyze uniform federation, with three players, $a,b,$ and $c$ . We will first assume that each player has 5 samples from their local data distribution: Table 1 gives the error each player can expect in this situation. Arrangements equivalent up to renaming of players are omitted. Every player sees its error minimized in the “grand coalition” $\pi_{g}$ where all three players are federating together. This implies that the only arrangement that is stable (core-stable or individually stable) is $\pi_{g}$ .

Next, we assume that player $c$ increases the amount of samples it has from 5 to 25: Table 2 demonstrates the error each player can expect in this situation. Here, the players have different preferences over which arrangement they would most prefer. The “small” players $a$ and $b$ would most prefer $\{a,b\}\{c\}$ , whereas the “large” player $c$ would most prefer $\{a\},\{b,c\}$ or (identically) $\{b\},\{a,c\}$ . However out of all of these coalition structures, only $\{a,b\},\{c\}$ is stable (either core stable or individually stable). Note that $\{a\},\{b,c\}$ is not stable because the coalition $C=\{a,b\}$ is one where each player prefers $C$ to its current situation.

Thirdly, we will assume that all three players have 25 samples: this example is shown in Table 3. As in Table 1, the players have identical preferences. However, in this case, the players minimize their error by being alone. Overall, stability results from this example are part of a broader pattern we will analyze in later sections.

Next, we will explore the two other methods of federation: coarse-grained and fine-grained. Both offer some degree of personalization, with varying levels of flexibility.

Table 4 shows an example using coarse federation with four players: three have 30 samples each, and the fourth player has 300 samples. We assume the weights $w_{j}$ are set optimally for each player. For conciseness, some columns and rows are omitted. Note that both player types get lower error in $\pi_{g}$ than they would with local learning: that is, $\pi_{g}$ is individually stable (stable against defections of any player alone). However, it is also clear that $\pi_{g}$ is not core stable: in particular, the three small players would get lower error in $\{a,b,c\}$ than in $\pi_{g}$ . These results will be examined theoretically in later sections: with optimal weighting, coarse-grained federation will always have an individually stable $\pi_{g}$ that is not necessarily core stable.

Coalition structure	$err_{a}(\cdot)$	$err_{d}(\cdot)$
$\{a\},\{b\},\{c\},\{d\}$	0.333	0.0333
$\{a,b,c\},\{d\}$	0.278	0.0333
$\{a,b,c,d\}$	0.280	0.0326

Table 4: The expected errors using optimal coarse-grained federation when players

a,b,c

each have 30 samples, while player

d

has 300 samples, with parameters

\mu_{e}=10,\sigma^{2}=1

Finally, we examine fine-grained federation. Table 5 analyzes the same case as coarse-grained federation previously, but with optimally-weighted fine-grained federation. The full error table shows that $\pi_{g}$ is core stable because each player minimizes their error in this arrangement. This result will hold theoretically: when optimal fine-grained federation is used, $\pi_{g}$ always minimizes error for every player and is thus core stable.

Coalition structure	$err_{a}(\cdot)$	$err_{d}(\cdot)$
$\{a\},\{b\},\{c\},\{d\}$	0.333	0.0333
$\{a,b,c\},\{d\}$	0.278	0.0333
$\{a,b,c,d\}$	0.269	0.0325

Table 5: The expected errors using optimal fine-grained federation when players

a,b,c

each have 30 samples, while player

d

has 300 samples, with parameters

\mu_{e}=10,\sigma^{2}=1

In later sections we will give theoretical results that explain this example more fully, but understanding the core-stable partitions here will help to build intuition for more general results.

3 Model and assumptions

Model and technical assumptions

This section introduces our model. We assume that there is a fixed set of $[M]$ players, and player $j$ has a fixed number of samples, $n_{j}$ . Though the number of samples is fixed, it is possible to analyze a varying number of samples by investigating all games involving the relevant number of samples. Each player draws their true parameters i.i.d. (independent and identically distributed) $(\theta_{j},\epsilon^{2}_{j})\sim\Theta$ . $\epsilon^{2}_{j}$ represents the amount of noise in the sampling process for a given player.

In the case of mean estimation, $\theta_{j}$ is a scalar representing the true mean of player $j$ . Player $j$ draws samples i.i.d. from its true distribution: $Y\sim\mathcal{D}_{j}(\theta_{j},\epsilon^{2}_{j})$ . Samples are drawn with variance $\epsilon^{2}_{j}$ around the true mean of the distribution.

In the case of linear regression, $\bm{\theta}_{j}$ is a $D$ -dimensional vector representing the coefficients on the true classification function, which is also assumed to be linear. Each player draws $n_{j}$ input datapoints from their own input distribution $\bm{X}_{j}\sim\mathcal{X}_{j}$ such that $\mathbb{E}_{x\sim\mathcal{X}_{j}}[\bm{x}^{T}\bm{x}]=\Sigma_{j}$ . They then noisily observe the outputs, drawing values i.i.d. $\bm{Y}_{j}\sim\mathcal{D}_{j}(\bm{X}_{j}^{T}\bm{\bm{\theta}}_{j},\epsilon^{2}_{j})$ , where $\epsilon^{2}_{j}$ again denotes the variance of how samples are drawn around the true mean.

There are three methods of federation. In uniform federation, a single model is produced for all members of the federating coalition:

\hat{\bm{\theta}}^{f}=\frac{1}{\sum_{i=1}^{M}n_{i}}\sum_{i=1}^{M}\hat{\bm{\theta}}_{i}\cdot n_{i}

In coarse-grained federation, each player has a parameter $w_{j}$ that it uses to weight the global model with its own local model, producing an averaged model:

\hat{\theta}^{w}_{j}=w_{j}\cdot\hat{\theta}_{j}+(1-w_{j})\cdot\frac{1}{N}\sum_{i=1}^{M}\hat{\theta}_{i}\cdot n_{i}

for $w_{j}\in[0,1]$ . Note that $w_{j}=0$ corresponds to unweighted federated learning and $w_{j}=1$ corresponds to pure local learning. Finally, with fine-grained federation, each player $j$ as a vector of weights $\bm{v}_{j}$ that they use to weight every other player’s contribution to their estimate:

\hat{\theta}_{j}^{v}=\sum_{i=1}^{M}v_{ji}\theta_{i}

for $\sum_{i=1}^{M}v_{ji}=1$ . Note that we can recover the $w$ weighting case with $v_{jj}=w+\frac{(1-w)\cdot n_{j}}{N}$ and $v_{ji}=(1-w)\cdot\frac{n_{i}}{N}$ . Coarse-grained and fine-grained federation each have player-specific parameters ( $w,v$ ) that can be tuned. When those parameters are set optimally for the given player, we refer to the models as “optimal” coarse-grained or fine-grained federation. We will prove in later sections how to calculate optimal weights.

We denote $\mu_{e}=\mathbb{E}_{(\theta_{i},\epsilon^{2}_{i})\sim\Theta}[\epsilon^{2}_{i}]$ : the expectation of the error parameter. In the mean estimation case, $\sigma^{2}=Var(\theta_{i})$ represents the variance around the mean. In the linear regression case, $\sigma^{2}_{d}=Var(\bm{\theta}^{d}_{i})$ for $d\in[D]$ .

We assume that each player knows how many samples it has access to. It may or may not have access to the data itself, but it does not know how its values (or its parameters) differ from the mean. For example, it does not know if the data it has is unusually noisy or if its true mean lies far from the true mean of other players.

All of the stability analysis results depend on the parameters $\mu_{e}$ and $\sigma^{2}$ . However, the reliance is fairly weak: often the player only needs to whether the number of samples they have $n_{j}$ is larger or smaller than the ratio $\frac{\mu_{e}}{\sigma^{2}}$ .

Much of this paper analyzes the stability of coalition structures. Analyzing stability could be relevant because players can actually move between coalitions. However, even if players aren’t able to actually move, analyzing the stability of a coalition tells us something about its optimality for each set of players.

Normative assumptions

This paper is primarily descriptive: it aims to model a phenomenon in the world, not to say whether that phenomenon is good or bad. For example, it could be that society as a whole values situations where many players federate together and might wish to require players to do so, regardless of whether this minimizes their error. It might be the case that society prefers all players, regardless of how many samples they have access to, have roughly similar error rates. Our use of the expected mean squared error is also worth reflecting on: it assumes that over- and under-estimates are equally costly and that larger mis-estimates are more costly. In a more subtle point, we are taking the expected MSE over parameter draws $\mathbb{E}_{(\theta_{i},\epsilon^{2}_{i})\sim\Theta}$ . A player with a true mean that happens to fall far from the mean might experience a much higher error than its expected MSE.

In the entirely of this paper, we are taking as fixed the requirement that data not be shared, either for privacy or technical capability reasons, and so are implicitly valuing that requirement more than the desire for lower error. We are also assuming that the problem at hand is completely encompassed by the machine learning task, which might omit the fact that non-machine learning solutions may be better suited. It also may be the fact that technical requirements other than error rate are more important: for example, the desire to balance the amount of computation done by each agent.

4 Expected error results

This paper’s first contribution is to derive exact expected values for the MSE of players under different situations. The fact that these values are exact allows us to precisely reason about each player’s incentives in later sections. We will state the theorems here and provide the proofs in Appendix B.

The approach for this section was first to derive expected MSE values for the most general case and then derive values for other cases as corollaries. The most general case is linear regression with fine-trained federation. First, we note that we can derive coarse-grained or uniform federation by setting the $v_{ji}$ weights to the appropriate values. Next, we note that mean estimation is a special case of linear regression. For intuition, consider a model where a player draws an $x$ value that is deterministically 1, then multiplies it by an unknown single parameter $\theta_{j}$ , then then takes a measurement $y$ of this mean with noise $\epsilon^{2}_{j}$ . This corresponds exactly to the mean estimation case, where a player has a true mean $\theta_{j}$ and observes $y$ as a sample, with noise $\epsilon^{2}_{j}$ . We can use this representation to simplify the error terms, with more details given in Appendix B.

First, we give the expected MSE for local estimation:

Theorem 4.1.

For linear regression, the expected MSE of local estimation for a player with $n_{j}$ samples is

\mu_{e}\cdot\text{tr}\left[\Sigma_{j}\mathbb{E}_{X_{j}\sim\mathcal{X}_{j}}\left[\left(\bm{X}_{j}^{T}\bm{X}_{j}\right)^{-1}\right]\right]

If the distribution of input values $\mathcal{X}_{j}$ is a $D$ -dimensional multivariate normal distribution with 0 mean, then, the expected MSE of local estimation can be simplified to:

\frac{\mu_{e}}{n_{j}-D-1}D

In the case of mean estimation, the error term can be simplified to:

\frac{\mu_{e}}{n_{j}}

Next, we calculate the expected MSE for fine-grained federation:

Theorem 4.2.

For linear regression with fine-grained federation, the expected MSE of federated estimation for a player with $n_{j}$ samples is:

L_{j}+\left(\sum_{i\neq j}v_{ji}^{2}+\left(\sum_{i\neq j}v_{ji}\right)^{2}\right)\cdot\sum_{d=1}^{D}\mathbb{E}_{x\sim\mathcal{X}_{j}}[(\bm{x}^{d})^{2}]\cdot\sigma^{2}_{d}

where $L_{j}$ is equal to:

\mu_{e}\sum_{i=1}^{M}v_{ji}^{2}\cdot\text{tr}[\Sigma_{j}\mathbb{E}_{Y\sim\mathcal{D}(\theta_{i},\epsilon^{2}_{i})}\left[(\bm{X}_{i}^{T}\bm{X}_{i})^{-1}\right]

If the distribution of input values $\mathcal{X}_{i}$ is a $D$ -dimensional multivariate normal distribution with 0 mean, this can be simplified to:

\mu_{e}\sum_{i=1}^{M}v_{ji}^{2}\cdot\frac{D}{n_{i}-D-1}

In the case of mean estimation, the entire error term can be simplified to:

\mu_{e}\sum_{i=1}^{M}v_{ji}^{2}\cdot\frac{1}{n_{i}}+\left(\sum_{i\neq j}v_{ji}^{2}+\left(\sum_{i\neq j}v_{ji}\right)^{2}\right)\cdot\sigma^{2}

Finally, we derive as corollaries the expected MSE for the uniform federation and the coarse-grained case.

Corollary 4.3.

For uniform linear regression, the expected MSE of federated estimation for a player with $n_{j}$ samples is:

L_{j}+\frac{\sum_{i\neq j}n_{i}^{2}+(N-n_{j})^{2}}{N^{2}}\sum_{d=1}^{D}\mathbb{E}_{x\sim\mathcal{X}_{j}}[(\bm{x}^{d})^{2}]\cdot\sigma^{2}_{d}

where $L_{j}$ is equal to:

\mu_{e}\sum_{i=1}^{M}\frac{n_{i}^{2}}{N^{2}}\text{tr}[\Sigma_{j}\mathbb{E}_{Y\sim\mathcal{D}(\theta_{i},\epsilon^{2}_{i})}\left[(\bm{X}_{i}^{T}\bm{X}_{i})^{-1}\right]

or, if the distribution of input values $\mathcal{X}_{i}$ is a $D$ -dimensional multivariate normal distribution with 0 mean, can be simplified to

\mu_{e}\sum_{i=1}^{M}\frac{n_{i}^{2}}{N^{2}}\frac{D}{n_{i}-D-1}

In the case of mean estimation, the entire error term can be simplified to:

\frac{\mu_{e}}{N}+\frac{\sum_{i\neq j}n_{i}^{2}+(N-n_{j})^{2}}{N^{2}}\sigma^{2}

where $N=\sum_{i=1}^{M}n_{i}$ .

Corollary 4.4.

For coarse-grained linear regression, the expected MSE of federated estimation for a player with $n_{j}$ samples is:

L_{j}+(1-w)^{2}\cdot\frac{\sum_{i\neq j}n_{i}^{2}+(N-n_{j})^{2}}{N^{2}}\sum_{d=1}^{D}\mathbb{E}_{x\sim\mathcal{X}_{j}}[(\bm{x}^{d})^{2}]\cdot\sigma^{2}_{d}

where $L_{j}$ is equal to:

\mu_{e}\cdot(1-w)^{2}\cdot\sum_{i=1}^{M}\frac{n_{i}^{2}}{N^{2}}\text{tr}[\Sigma_{j}\mathbb{E}_{Y\sim\mathcal{D}(\theta_{i},\epsilon^{2}_{i})}\left[(\bm{X}_{i}^{T}\bm{X}_{i})^{-1}\right]+

\mu_{e}\left(w^{2}+2\frac{(1-w)w\cdot n_{j}}{N}\right)\cdot\text{tr}[\Sigma_{j}\mathbb{E}_{Y\sim\mathcal{D}(\theta_{i},\epsilon^{2}_{i})}\left[(\bm{X}_{j}^{T}\bm{X}_{j})^{-1}\right]

or, if the distribution of input values $\mathcal{X}_{i}$ is a $D$ -dimensional multivariate normal distribution with 0 mean, can be simplified to

\mu_{e}\cdot(1-w)^{2}\cdot\sum_{i=1}^{M}\frac{n_{i}^{2}}{N^{2}}\frac{D}{n_{i}-D-1}

+\mu_{e}\cdot\left(w^{2}+2\cdot\frac{(1-w)\cdot w\cdot n_{j}}{N}\right)\cdot\frac{D}{n_{i}-D-1}

In the case of mean estimation, the entire error term can be simplified to:

\mu_{e}\left(\frac{w^{2}}{n_{j}}+\frac{1-w^{2}}{N}\right)+\frac{\sum_{i\neq j}n_{i}^{2}+(N-n_{j})^{2}}{N^{2}}\cdot(1-w)^{2}\sigma^{2}

where $N=\sum_{i=1}^{M}n_{i}$ .

The exact MSE for linear regression follows a very similar form to that for mean estimation. In all cases, the bias component (the term involving $\sigma^{2}_{d}$ ) is in the exact same form and could be directly modified to mean estimation by using $(\sigma^{2})^{\prime}=\sum_{d=1}^{D}\mathbb{E}_{x\sim\mathcal{X}_{j}}[(\bm{x}^{d})^{2}]\cdot\sigma^{2}_{d}$ . The variance component (the term involving $\mu_{e}$ ) fits the exact form of mean estimation in the limit where $n_{j}>>D$ . In this case, the error can be modified to fit mean estimation by using $\mu_{e}^{\prime}=D\cdot\mu_{e}$ . This approximation is good when there are many more samples than the dimension of the linear regression problem under investigation: for most cases of model fitting, this assumption is reasonable.

For the rest of the paper, we will use the $n_{j}>>D$ assumption: consequentially, all of our results apply equally to linear regression and mean estimation.

5 Uniform federation: coalition formation

In this section, we analyze the stability of coalition structures in the case that uniform federation is used. We consider two cases: 1) where all players have the same number of datapoints $n$ and 2) where all players have either a “small” or “large” number of points. We will use $\pi_{l}$ to refer to the coalition partition where all players are alone and $\pi_{g}$ to refer to the grand coalition. Proofs from this section are given in Appendix C.

All players have the same number of samples

In this case, the analysis simplifies greatly:

Lemma 5.1.

If all players have the same number of samples $n$ , then:

•

If $n<\frac{\mu_{e}}{\sigma^{2}}$ , players minimize their error in $\pi_{g}$ .
•

If $n>\frac{\mu_{e}}{\sigma^{2}}$ , players minimize their error in $\pi_{l}$ .
•

If $n=\frac{\mu_{e}}{\sigma^{2}}$ , players are indifferent between any arrangement of players.

Proof.

In the case that all players have the same number of samples, we can use $n_{i}=n$ to simplify the error term:

\frac{\mu_{e}}{M\cdot n}+\sigma^{2}\frac{M-1}{M}

In order to see whether players would prefer a larger group (higher $M$ ) or a smaller group (smaller $M$ ), we take the derivative of the error with respect to $M$ :

-\frac{\mu_{e}}{M^{2}\cdot n}+\frac{\sigma^{2}}{M^{2}}=\frac{\sigma^{2}\cdot n-\mu_{e}}{n\cdot M^{2}}

This is positive when $n>\frac{\mu_{e}}{n}$ : a player gets higher error the more players it is federating with. This is negative when $n<\frac{\mu_{e}}{\sigma^{2}}$ : a player gets lower error the more players it is federating with. This is 0 when $n=\frac{\mu_{e}}{\sigma^{2}}$ , which implies players should be indifferent between different arrangements. Plugging in for $n=\frac{\mu_{e}}{\sigma^{2}}$ in the error equation gives $\frac{\mu_{e}\cdot\sigma^{2}}{M\cdot\mu_{e}}+\sigma^{2}\frac{M-1}{M}=\sigma^{2}$ which is equivalent to the error a player would get alone: $\frac{\mu_{e}}{n}=\frac{\mu_{e}\cdot\sigma^{2}}{\mu_{e}}=\sigma^{2}$ . ∎

As a corollary, we can classify the core stable arrangements cleanly:

Corollary 5.2.

For uniform federation, if all players have the same number of samples $n$ , then:

•

If $n<\frac{\mu_{e}}{\sigma^{2}}$ , $\pi_{g}$ is the only partition that is core-stable.
•

If $n>\frac{\mu_{e}}{\sigma^{2}}$ , $\pi_{l}$ is the only partition that is core-stable.
•

If $n=\frac{\mu_{e}}{\sigma^{2}}$ , any arrangement of players is core-stable.

Small & large player case

In this section, we add another layer of depth by allowing players to come in one of two “sizes”. “Small” players have $n_{s}$ samples and “large” ones have $n_{\ell}$ samples, with $n_{s}<n_{\ell}$ . We demonstrate that versions of the game in this pattern always have a stable partition by constructively producing an element that is stable. Note that this is not true in general of hedonic games. As discussed in Bogomolnaia and Jackson (2002), there are multiple instances where a game might have no stable partition.

To characterize this space, we divide it into cases depending on the relative size of $n_{s},n_{\ell}$ . We will use the notation $\pi(s,\ell)$ to denote a coalition with $s$ small players and $\ell$ large players, out of a total of $S$ and $L$ present. We will use $\pi(s_{1},\ell_{1})\succ_{S}\pi(s_{2},\ell_{2})$ to mean that the small players prefer coalition $\pi(s_{1},\ell_{1})$ to $\pi(s_{2},\ell_{2})$ and $\pi(s_{1},\ell_{1})\succ_{L}\pi(s_{2},\ell_{2})$ to mean the same preference, but for large players.

Case 1: $n_{s},n_{\ell}\geq\frac{\mu_{e}}{\sigma^{2}}$

The first case is when $n_{s}$ is large: it turns out that each player minimizes their error by using local learning, which means that $\pi_{l}$ is in the core. The lemma below is more general than the small/large case, but implies that when $n_{s}>\frac{\mu_{e}}{\sigma^{2}}$ , $\pi_{l}$ is the only element in the core and when $n_{s}=\frac{\mu_{e}}{\sigma^{2}}$ then any arrangement where the large players are alone are is in the core.

Lemma 5.3.

For uniform federation, if $n_{i}>\frac{\mu_{e}}{\sigma^{2}}$ for all $i\in[M]$ , then $\pi_{l}$ is the unique element in the core.
If $n_{i}\geq\frac{\mu_{e}}{\sigma^{2}}$ for all $i\in[M]$ , with $n_{k}>\frac{\mu_{e}}{\sigma^{2}}$ for at least one player $k$ , then any arrangement where the players with samples $n_{k}>\frac{\mu_{e}}{\sigma^{2}}$ are alone is in the core.

Case 2: $n_{s},n_{\ell}\leq\frac{\mu_{e}}{\sigma^{2}}$

Next, we consider the case where both the small and large players have a relatively small number of samples. In this situation, it turns out that the grand coalition is core stable.

Theorem 5.4.

For uniform federation, if $n_{\ell}\leq\frac{\mu_{e}}{\sigma^{2}}$ and $n_{s}<n_{\ell}$ , then the grand coalition $\pi_{g}$ is core stable.

Case 3: $n_{s}<\frac{\mu_{e}}{\sigma^{2}}$ , $n_{\ell}>\frac{\mu_{e}}{\sigma^{2}}$

Finally, we consider the case where the small players have a number of samples below the $\frac{\mu_{e}}{\sigma^{2}}$ boundary, while the large players have a number of samples above this threshold.

Theorem 5.5.

Assume uniform federation with $n_{\ell}>\frac{\mu_{e}}{\sigma^{2}}$ . Then, there exists an arrangement of small and large players that is individually stable and a computationally efficient algorithm to calculate it.

The proof of Theorem 5.5 is constructive: it gives an exact arrangement that is individually stable. One natural question is whether this arrangement is also core stable. The answer to this question is “no”: we show that this arrangement can fail to be core stable. This avenue is explained more in Appendix C.

6 Coarse-grained federation

In this section, we analyze coarse-grained federation. As a reminder, in this situation, each player has a parameter $w_{j}$ that it uses to weight the global model with its own local model.

\hat{\theta}^{w}_{j}=w_{j}\cdot\hat{\theta}_{j}+(1-w_{j})\cdot\frac{1}{N}\sum_{i=1}^{M}\hat{\theta}_{i}\cdot n_{i}

for $w_{j}\in[0,1]$ . All proofs from this section are given in Appendix D.

Note that the $w_{j}$ value is a parameter that each player can set independently. The lemma below analyzes the optimal value of $w_{j}$ and tells us that each player would prefer federation, in some form, to being alone.

Lemma 6.1.

For coarse-grained federation, the minimum error is always achieved when $w_{j}<1$ , implying that federation is always preferable to local learning.

Corollary 6.2.

For coarse-grained federation, when $w_{j}$ is set optimally, the grand coalition $\pi_{g}$ is always individually stable.

Specifically, this means that no player wishes to unilaterally deviate from $\pi_{g}$ . However, this does not mean that each player prefers the grand coalition $\pi_{g}$ to some other federating coalition. For example, refer to Section 2 for an example where the grand coalition $\pi_{g}$ is not core stable.

In the rest of this section, we will analyze the stability of coalition structures in the that the $w$ parameters are set optimally (optimal coarse-grained federation). First, we will find it useful to get the closed-form value for expected MSE of a player using optimal coarse-grained federation:

Lemma 6.3.

A player using coarse-grained federation parameter has expected MSE:

\frac{\mu_{e}\cdot(N-n_{j})+(\sum_{i\neq j}n_{i}^{2}+(N-n_{j})^{2})\cdot\sigma^{2}}{(N-n_{j})\cdot N+n_{j}\cdot(\sum_{i\neq j}n_{i}^{2}+(N-n_{j})^{2})\cdot\frac{\sigma^{2}}{\mu_{e}}}

where $N=\sum_{i=1}^{M}n_{i}$ .

All players have the same number of samples

Lemma 6.4 is the analog to Lemma 5.1 in the previous section. Here, the results differ: with optimal coarse-grained federation, the grand coalition $\pi_{g}$ is always the only stable arrangement, no matter how small or large $n$ is relative to $\frac{\mu_{e}}{\sigma^{2}}$ .

Lemma 6.4.

For mean estimation with coarse-grained federation, if $n_{j}=n$ , then $\pi_{g}$ is the only element in the core.

Proof.

Using the error term derived in Lemma 6.3, plugging in for $n_{i}=n$ and simplifying gives:

\frac{\frac{\mu_{e}^{2}}{n\cdot M}+\mu_{e}\cdot\sigma^{2}}{\mu_{e}+n\cdot\sigma^{2}}

As $M$ increases, the error (numerator) decreases always - so $\pi_{g}$ is where each player minimizes their error and is thus core stable. ∎

Small & large player case

In this subsection, we similarly extend results for the “small” and “large” case that was introduced in the previous section. The analysis turns out to be much simpler than in the uniform federation case, and also produce stronger results: strict core stability, rather than individual stability.

Theorem 6.5.

If optimal coarse-grained federation is used, then:

•

If $\pi_{g}\preceq_{S}\pi(S,0)$ (small player weakly prefers $\pi(S,0)$ ), then $\{\pi(S,0),\pi(0,L)\}$ is strictly core stable.
•

If $\pi_{g}\succ_{S}\pi(S,0)$ (small player strictly prefers $\pi_{g}$ ), then $\pi_{g}$ is strictly core stable.

7 Fine-grained federation

In this section, we analyze fine-grained federation. As a reminder, with this method, each player $j$ as a vector of weights $\bm{v}_{j}$ that they use to weight every other player’s contribution to their estimate.

\hat{\theta}_{j}^{v}=\sum_{i=1}^{M}v_{ji}\theta_{i}

for $\sum_{i=1}^{M}v_{ji}=1$ .

We calculate the optimal $\bm{v}$ weights for player $j$ ’s error.

Lemma 7.1.

Define $V_{i}=\sigma^{2}+\frac{\mu_{e}}{n_{i}}$ . Then, the value of $\{v_{ji}\}$ that minimizes player $j$ ’s error is:

v_{jj}=\frac{1+\sigma^{2}\sum_{i\neq j}\frac{1}{V_{i}}}{1+V_{j}\sum_{i\neq j}\frac{1}{V_{i}}}

v_{jk}=\frac{1}{V_{k}}\cdot\frac{V_{j}-\sigma^{2}}{1+V_{j}\sum_{i\neq j}\frac{1}{V_{i}}}\quad k\neq j

The proof of this lemma is given in Appendix E.

From this analysis, a few properties become clear. To start with, $v_{jj}$ and $v_{jk}$ are always strictly between 0 and 1. This implies the following lemma:

Corollary 7.2.

With optimal fine-grained federation, $\pi_{g}$ is optimal for each player.

Proof.

Suppose by contradiction that some other coalition $\pi^{\prime}$ gave player $j$ a lower error. WLOG, assume this coalition omitted player $k$ . In this case, the $v$ weights for $\pi^{\prime}$ can be represented as a length $M$ vector with $0$ in the $k$ th entry. However, set of weights is achievable in $\pi_{g}$ : it is always an option to set a player’s coefficient $v_{jk}$ equal to 0. This contradicts the use of $\bm{v}_{j}$ as an optimal weighting, so it cannot be the case that any player gets lower error in a different coalition. ∎

Similarly, the fact that $\pi_{g}$ is optimal for every player implies that it is in the core, and that it is the only element in the core.

8 Conclusions and future directions

In this work, we have drawn a connection between a simple model of federated learning and the game theoretic tool of hedonic games. We used this tool to examine stable partitions of the space for two variants of the game. In service of this analysis, we computed exact error values for mean estimation and linear regression, as well as for three different variations of federation.

We believe that this framework is a simple and useful tool for analyzing the incentives of multiple self-interested agents in a learning environment. There are many fascinating extensions. For example, completely characterizing the core (including whether it is always non-empty) in the case of arbitrary number of samples $\{n_{i}\}$ is an obvious area of investigation. Besides this, it could be interesting to compute exact or approximate error values for cases beyond mean estimation and linear regression.

Acknowledgments

This work was supported in part by a Simons Investigator Award, a Vannevar Bush Faculty Fellowship, a MURI grant, AFOSR grant FA9550-19-1-0183, grants from the ARO and the MacArthur Foundation, and NSF grant DGE-1650441. We are grateful to A. F. Cooper, Thodoris Lykouris, Hakim Weathersppon, and the AI in Policy and Practice working group at Cornell for invaluable discussions. In particular, we thank A.F. Cooper for discussions around normative assumptions. Finally, we are grateful to Katy Blumer for discussions around code in the Github repository.

References

Abu-Mostafa, Lin, and Magdon-Ismail (2012) Abu-Mostafa, Y.; Lin, H.; and Magdon-Ismail, M. 2012. Learning from data: a short course: AMLbook. Google Scholar .
Anderson (1962) Anderson, T. W. 1962. An introduction to multivariate statistical analysis. Technical report, Wiley New York.
Blum et al. (2017) Blum, A.; Haghtalab, N.; Procaccia, A. D.; and Qiao, M. 2017. Collaborative PAC Learning. In Guyon, I.; Luxburg, U. V.; Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; and Garnett, R., eds., Advances in Neural Information Processing Systems 30, 2392–2401. Curran Associates, Inc. URL http://papers.nips.cc/paper/6833-collaborative-pac-learning.pdf.
Blum et al. (2020) Blum, A.; Haghtalab, N.; Shao, H.; and Phillips, R. L. 2020. Unpublished work, private correspondence.
Bogomolnaia and Jackson (2002) Bogomolnaia, A.; and Jackson, M. O. 2002. The Stability of Hedonic Coalition Structures. Games and Economic Behavior 38(2): 201 – 230. ISSN 0899-8256. doi:https://doi.org/10.1006/game.2001.0877. URL http://www.sciencedirect.com/science/article/pii/S0899825601908772.
Casella (1992) Casella, G. 1992. Illustrating empirical Bayes methods. Chemometrics and intelligent laboratory systems 16(2): 107–125.
Cooper (2020) Cooper, A. F. 2020. Where Is the Normative Proof? Assumptions and Contradictions in ML Fairness Research.
Deng, Kamani, and Mahdavi (2020) Deng, Y.; Kamani, M. M.; and Mahdavi, M. 2020. Adaptive Personalized Federated Learning.
Duan et al. (2021) Duan, M.; Liu, D.; Chen, X.; Liu, R.; Tan, Y.; and Liang, L. 2021. Self-Balancing Federated Learning With Global Imbalanced Data in Mobile Systems. IEEE Transactions on Parallel and Distributed Systems 32(1): 59–71.
Efron and Morris (1977) Efron, B.; and Morris, C. 1977. Stein’s paradox in statistics. Scientific American 236(5): 119–127.
Guazzone, Anglano, and Sereno (2014) Guazzone, M.; Anglano, C.; and Sereno, M. 2014. A Game-Theoretic Approach to Coalition Formation in Green Cloud Federations. 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing doi:10.1109/ccgrid.2014.37. URL http://dx.doi.org/10.1109/CCGrid.2014.37.
Kairouz et al. (2019) Kairouz, P.; McMahan, H. B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A. N.; Bonawitz, K.; Charles, Z.; Cormode, G.; Cummings, R.; D’Oliveira, R. G. L.; Rouayheb, S. E.; Evans, D.; Gardner, J.; Garrett, Z.; Gascón, A.; Ghazi, B.; Gibbons, P. B.; Gruteser, M.; Harchaoui, Z.; He, C.; He, L.; Huo, Z.; Hutchinson, B.; Hsu, J.; Jaggi, M.; Javidi, T.; Joshi, G.; Khodak, M.; Konečný, J.; Korolova, A.; Koushanfar, F.; Koyejo, S.; Lepoint, T.; Liu, Y.; Mittal, P.; Mohri, M.; Nock, R.; Özgür, A.; Pagh, R.; Raykova, M.; Qi, H.; Ramage, D.; Raskar, R.; Song, D.; Song, W.; Stich, S. U.; Sun, Z.; Suresh, A. T.; Tramèr, F.; Vepakomma, P.; Wang, J.; Xiong, L.; Xu, Z.; Yang, Q.; Yu, F. X.; Yu, H.; and Zhao, S. 2019. Advances and Open Problems in Federated Learning.
Li et al. (2020) Li, T.; Sahu, A. K.; Talwalkar, A.; and Smith, V. 2020. Federated Learning: Challenges, Methods, and Future Directions. IEEE Signal Processing Magazine 37(3): 50–60. ISSN 1558-0792. doi:10.1109/msp.2020.2975749. URL http://dx.doi.org/10.1109/MSP.2020.2975749.
Li et al. (2019) Li, T.; Sanjabi, M.; Beirami, A.; and Smith, V. 2019. Fair Resource Allocation in Federated Learning.
Mansour et al. (2020) Mansour, Y.; Mohri, M.; Ro, J.; and Suresh, A. T. 2020. Three Approaches for Personalization with Applications to Federated Learning.
Martinez, Bertran, and Sapiro (2020) Martinez, N.; Bertran, M.; and Sapiro, G. 2020. Minimax Pareto Fairness: A Multi Objective Perspective. In Proceedings of the 37th International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR. URL https://proceedings.icml.cc/static/paper˙files/icml/2020/1084-Paper.pdf.
Morris (1986) Morris, C. N. 1986. Empirical Bayes: a frequency-Bayes compromise, volume Volume 8 of Lecture Notes–Monograph Series, 195–203. Hayward, CA: Institute of Mathematical Statistics. doi:10.1214/lnms/1215540299. URL https://doi.org/10.1214/lnms/1215540299.
Paquay (2018) Paquay, P. 2018. Learning-from-data-Solutions. URL https://github.com/ppaquay/Learning-from-Data-Solutions.
Sattler, Muller, and Samek (2020) Sattler, F.; Muller, K.-R.; and Samek, W. 2020. Clustered Federated Learning: Model-Agnostic Distributed Multitask Optimization Under Privacy Constraints. IEEE Transactions on Neural Networks and Learning Systems 1–13. ISSN 2162-2388. doi:10.1109/tnnls.2020.3015958. URL http://dx.doi.org/10.1109/TNNLS.2020.3015958.
Sellentin and Heavens (2015) Sellentin, E.; and Heavens, A. F. 2015. Parameter inference with estimated covariance matrices. Monthly Notices of the Royal Astronomical Society: Letters 456(1): L132–L136. ISSN 1745-3925. doi:10.1093/mnrasl/slv190. URL https://doi.org/10.1093/mnrasl/slv190.
Shlezinger, Rini, and Eldar (2020) Shlezinger, N.; Rini, S.; and Eldar, Y. C. 2020. The Communication-Aware Clustered Federated Learning Problem. In 2020 IEEE International Symposium on Information Theory (ISIT), 2610–2615.
Wang et al. (2020) Wang, H.; Hsu, H.; Diaz, M.; and Calmon, F. P. 2020. To Split or Not to Split: The Impact of Disparate Treatment in Classification.
Yu, Bagdasaryan, and Shmatikov (2020) Yu, T.; Bagdasaryan, E.; and Shmatikov, V. 2020. Salvaging Federated Learning by Local Adaptation.

Appendix A Relationship to other approaches

This section contains a high-level summary of similar approaches and how they relate to ours. Throughout we assume the goal is to estimate some unknown $\theta_{j}$ given samples drawn $Y_{i}\sim D(\theta_{j})$ .

A frequentist approach would take $\theta_{i}$ to be a constant that would be estimated by the average of the given samples $\frac{1}{n_{j}}\sum_{i=1}Y_{i}$ .

A hierarchical Bayesian estimator assumes data is generated in the following way: data is drawn $Y_{i}\sim D(Y|\theta_{i})$ . The parameter $\theta_{i}$ is drawn $\theta_{i}\sim\Theta_{i}(\theta|\lambda_{i})$ , where hyperparameter $\lambda_{i}$ is drawn from known distribution $p(\lambda)$ . Given some data, the parameter $\theta_{i}$ can be estimated as follows

p(\theta_{i}|Y_{i})=\frac{p(Y_{i}|\theta_{i})p(\theta_{i})}{p(Y_{i})}=\frac{p(Y_{i}|\theta_{i})}{p(Y_{i})}\int p(\theta_{i}|\lambda_{i})p(\lambda_{i})d\lambda

Parametric empirical Bayes (Morris 1986) (Casella 1992) is frequently described as an intermediate between these two viewpoints. Similar to the hierarchical Bayesian viewpoint, it assumes data is drawn $Y_{i}\sim D(Y|\theta_{i})$ , with parameter $\theta_{i}$ is drawn $\theta_{i}\sim\Theta_{i}(\theta|\lambda_{i})$ . However, it differs in that it estimates $\lambda_{i}$ based on the data, producing $\hat{\lambda}_{i}$ . This estimate of the hyperparameter is used, along with the data, to estimate $\theta_{i}$ .

A related example is the James-Stein estimator (Efron and Morris 1977). The estimator assumes the following process: each of $m$ players draws a single sample from a normal distribution with variance $s^{2}$ .

Y_{i}\sim\mathcal{N}(\theta_{i},s^{2})

This is different from the empirical Bayes or Bayes case in that it is assumed that the means $\theta_{i}$ are completely unrelated to each other. Nevertheless, it has been demonstrated that the James-Stein estimator:

\hat{}\theta_{JS}=\left(1-\frac{(m-2)\cdot s^{2}}{\left\lVert\bf{Y}\right\rVert^{2}}\right)\bf{Y}

has lower expected MSE than simply using the drawn parameters $Y_{i}$ . In the case that the variance $s^{2}$ is not known perfectly, it can be estimated as $\hat{s^{2}}$ using entire vector of data $\bf{Y}$ .

Our method is similar at a high level to empirical Bayes: we assume each player draws data from a personal distribution governed by $\theta_{i}$ and that the $\theta_{i}$ terms are in turn drawn from some distribution $\Theta$ . However, one key difference is that all three methods discussed above assume knowledge of the distributions generating the data, or at least which family they are drawn from. For example, the James Stein estimator assumes a normal distribution: variants of it exist for different distributions, but not a version that works for all distributions. Similarly, a hierarchical Bayes or empirical Bayes viewpoint would require knowledge of the $D,\Theta,p$ distributions. In our approach, we do not assume that we know the form of these generating distributions, only some summary values summary statistics (mean and variance) of the distribution.

It is entirely possible that other approaches, especially those that assume knowledge of the generating distribution, will out-perform our approach in terms of the error guarantees they can provide. Our distribution-free approach allows it to be implemented in a broader range of situations.

Additionally, our approach is restricted to linear combinations of estimators such as $\hat{\theta}^{f}=w\cdot\hat{\theta}_{j}+(1-w)\sum_{i=1}^{M}\hat{\theta}_{i}$ . It is possible that a method outside this situation, for example, something like $\hat{\theta}^{f}=x\cdot\hat{\theta}_{j}+y\sum_{i=1}^{M}\hat{\theta}_{i}^{2}$ , or something like the non-linear James Stein estimator, would produce better estimates.

Appendix B Expected error proofs

For convenience, we will restate the model setup for the most general case of linear regression. We assume that each player $j\in[M]$ draws parameters $(\bm{\bm{\theta}}_{j},\epsilon^{2}_{j})\sim\Theta$ , where $\bm{\bm{\theta}}_{j}$ is a length $D$ vector and $\epsilon^{2}_{j}$ is a scalar-valued variance parameter. The $d$ th entry in the vector is $\bm{\theta}_{j}^{d}$ , and $Var(\bm{\theta}_{j}^{d})=\sigma^{2}_{d}$ . We assume that each value $\bm{\theta}_{j}$ is drawn independently of the others. The main result of this section will assume that each dimension is drawn independently, for example that $\bm{\theta}_{j}^{l}$ is independent of $\bm{\theta}_{j}^{k}$ , for $k\neq l$ , but we will demonstrate how this can be relaxed. Each player draws $n_{j}$ input data points from their own input distribution, $\bm{X}_{j}\sim\mathcal{X}_{j}$ such that $\mathbb{E}_{x\sim\mathcal{X}_{j}}[\bm{x}^{T}\bm{x}]=\Sigma_{j}$ . They then noisily observes the outputs, drawing $\bm{Y}_{j}\sim\mathcal{D}_{j}(\bm{X}_{j}^{T}\bm{\bm{\theta}}_{j},\epsilon^{2}_{j})$ . We use $\bm{\eta}_{j}$ to denote the length $D$ vector of errors so that $\bm{Y}_{j}=\bm{X}_{j}^{T}\bm{\bm{\theta}}_{j}+\bm{\bm{\eta}}_{j}$ . Each player uses ordinary least squares (OLS) to compute estimates of their parameters, which requires that $\bm{X}_{j}^{T}\bm{X}$ is invertible. This happens when the columns of $\bm{X}$ are linearly independent. We will require that $\Sigma_{j}$ be such that $\bm{X}_{j}^{T}\bm{X}$ is invertible with probability 1. This rules out cases where one dimension is a deterministic function of another, for example. Using the below OLS calculation gives local estimation:

\hat{\bm{\bm{\theta}}_{j}}=(\bm{X}^{T}_{j}\bm{X}_{j})^{-1}\bm{Y}_{j}=(\bm{X}^{T}_{j}\bm{X}_{j})^{-1}(\bm{X}_{j}\bm{\bm{\theta}}_{j}+\bm{\eta}_{j})

It is worth pausing briefly to note why mean estimation is a special case of linear regression. Consider the case where the distribution $\mathcal{X}_{j}$ is deterministically 1, meaning that $\bm{X}_{j}$ is a vector of 1s of length $n_{j}$ . $(\bm{X}_{j}^{T}\bm{X}_{j})$ is always invertible, as $(\bm{X}_{j}^{T}\bm{X}_{j})^{-1}=n_{j}^{-1}$ . Each $\bm{X}_{j}$ is multiplied by an unknown single parameter $\theta_{j}$ , which each player is attempting to learn.

Besides local estimation, there are multiple federation possibilities. Uniform federation is given by:

\hat{\bm{\bm{\theta}}}_{j}^{f}=\frac{1}{N}\sum_{i=1}^{M}\hat{\bm{\bm{\theta}}}_{i}\cdot n_{i}

Coarse-grained federation is given by:

\hat{\theta}^{w}_{j}=w_{j}\cdot\hat{\bm{\bm{\theta}}}_{j}+(1-w_{j})\cdot\frac{1}{N}\sum_{i=1}^{M}\hat{\bm{\bm{\theta}}}_{i}\cdot n_{i}

for $w_{j}\in[0,1]$ . Finally, fine-grained federation is given by:

\hat{\theta}_{j}^{v}=\sum_{i=1}^{M}v_{ji}\theta_{i}

for $\sum_{i=1}^{M}v_{ji}=1$ . Note that fine-grained federation is the most general case of federation. It is possible to derive coarse-grained federation, uniform federation, or local estimation by appropriately setting the $v$ weights. In this section, we will first derive the expected error for the fine-grained federation, linear regression case, and get expected error results for other cases as corollaries of this result.

The expected error produced by a set of estimates $\hat{\bm{\bm{\theta}}}$ is determined by the expectation of the following quantity.

(\bm{x}^{T}\hat{\bm{\bm{\theta}}}-\bm{x}^{T}\bm{\bm{\theta}}_{j})^{2}

Here, the expectation is taken over four sources of randomness.

1.

$\mathbb{E}_{(\theta_{i},\epsilon^{2}_{i})\sim\Theta}$ : Drawing parameters $\bm{\bm{\theta}}_{j},\epsilon^{2}_{j}$ for player $j$ ’s distribution from $\Theta$ .
2.

$\mathbb{E}_{X_{i}\sim\mathcal{X}_{i}}$ : Drawing the training dataset $\bm{X}_{i}$ from data distribution $\mathcal{X}_{i}$
3.

$\mathbb{E}_{Y_{i}\sim\mathcal{D}_{i}(X_{i}^{T}\bm{\bm{\theta}}_{i},\epsilon^{2}_{i})}$ : Drawing labels for the training dataset $\bm{X}_{i}$ from the distribution $\mathcal{D}_{i}(\bm{X}_{i}^{T}\bm{\bm{\theta}}_{i},\epsilon^{2}_{i})$ .
4.

$\mathbb{E}_{x\sim\mathcal{X}_{j}}$ : Drawing a new test point $\bm{x}$ from the data distribution $\mathcal{X}_{j}$

$(\bm{x}^{T}\hat{\bm{\bm{\theta}}}-\bm{x}^{T}\bm{\bm{\theta}}_{j})^{2}$ measures the expected error of a set of parameters at a particular point $\bm{x}$ : when the expectation is taken over all $\bm{x}\sim\mathcal{X}_{j}$ , it represents the average error everywhere on the distribution. It might be not immediately clear, though, why $(\bm{x}^{T}\hat{\bm{\bm{\theta}}}-\bm{x}^{T}\bm{\bm{\theta}}_{j})^{2}$ is the correct term to be considering. Other potential candidates might include:

1.

$\left\lVert\hat{\bm{\bm{\theta}}}_{j}-\bm{\bm{\theta}}_{j}\right\rVert^{2}$
2.

$(\bm{x}^{T}\hat{\bm{\bm{\theta}}}_{j}-y)^{2}$ for $y=\bm{x}^{T}\bm{\bm{\theta}}_{j}+\eta_{j}$ .

The first candidate measures the difference in estimated parameters; however, we assume that the objective of learning is to have low error on predicting future points, rather than solely estimate the parameters. The second candidate represents the error of predicting an instance as opposed to a mean value: it ends up simply producing an additive increase in our overall error term. To see this, note that we can write

(\bm{x}^{T}\hat{\bm{\bm{\theta}}}_{j}-y)^{2}=(\bm{x}^{T}\hat{\bm{\bm{\theta}}}_{j}-\bm{x}^{T}\bm{\theta}_{j}+\bm{x}^{T}\bm{\theta}_{j}-y)^{2}

=(\bm{x}^{T}\hat{\bm{\bm{\theta}}}_{j}-\bm{x}^{T}\bm{\theta}_{j}+\bm{x}^{T}\bm{\theta}_{j}-\bm{x}^{T}\bm{\theta}_{j}+\eta_{j})^{2}

=(\bm{x}^{T}\hat{\bm{\bm{\theta}}}_{j}-\bm{x}^{T}\bm{\theta}_{j}+\eta_{j})^{2}

=(\bm{x}^{T}\hat{\bm{\bm{\theta}}}_{j}-\bm{x}^{T}\bm{\theta}_{j})^{2}+2((\bm{x}^{T}\hat{\bm{\bm{\theta}}}_{j}-\bm{x}^{T}\bm{\theta}_{j})\eta_{j}+\eta_{j}^{2}

The first term is the same error function we are considering. The middle term is 0 in expectation and the last term is $\mu_{e}$ in expectation, so this approach simply scales the error we were looking at by $\mu_{e}$ .

See 4.1 Note: portions of Abu-Mostafa, Lin, and Magdon-Ismail (2012) and Paquay (2018), especially problem 3.11, were helpful in formulating this approach. Sellentin and Heavens (2015) and Anderson (1962) were helpful in providing the connection to the Inverse Wishart.

Proof.

First, note that:

\bm{x}^{T}\bm{\bm{\theta}_{j}}-\bm{x}^{T}\hat{\bm{\bm{\theta}}}_{j}=\bm{x}^{T}\left(\bm{\bm{\theta}}_{j}-(\bm{X}_{j}^{T}\bm{X}_{j})^{-1}\bm{X}_{j}^{T}\bm{Y}_{j}\right)

=\bm{x}^{T}\left(\bm{\bm{\theta}}_{j}-(\bm{X}_{j}^{T}\bm{X}_{j})^{-1}\bm{X}_{j}^{T}(\bm{X}_{j}\bm{\bm{\theta}}_{j}+\bm{\eta}_{j})\right)

=\bm{x}^{T}\left(\bm{\bm{\theta}}_{j}-\bm{\bm{\theta}}_{j}-(\bm{X}_{j}^{T}\bm{X}_{j})^{-1}\bm{X}_{j}^{T}\bm{\eta}_{j}\right)

=-\bm{x}^{T}(\bm{X}_{j}^{T}\bm{X}_{j})^{-1}\bm{X}_{j}^{T}\bm{\eta}_{j}

Then,

(\bm{x}^{T}\bm{\bm{\theta}_{j}}-\bm{x}^{T}\hat{\bm{\bm{\theta}}}_{j})^{2}

=\bm{\eta}_{j}^{T}\bm{X}_{j}(\bm{X}_{j}^{T}\bm{X})^{-1}\bm{x}\bm{x}^{T}(\bm{X}_{j}^{T}\bm{X}_{j})^{-1}\bm{X}_{j}^{T}\bm{\eta}_{j}

To simplify, we note that the above quantity is a scalar. For a scalar, $a=\text{tr}(a)$ , and for any matrix, $\text{tr}(AB)=\text{tr}(BA)$ through the cyclic property of the scalar.

=\text{tr}\left[\bm{\eta}_{j}^{T}\bm{X}_{j}(\bm{X}_{j}^{T}\bm{X})^{-1}\bm{x}\bm{x}^{T}(\bm{X}_{j}^{T}\bm{X}_{j})^{-1}\bm{X}_{j}^{T}\bm{\eta}_{j}\right]

=\text{tr}\left[\bm{x}\bm{x}^{T}(\bm{X}_{j}^{T}\bm{X}_{j})^{-1}\bm{X}_{j}^{T}\bm{\eta}_{j}\bm{\eta}_{j}^{T}\bm{X}_{j}(\bm{X}_{j}^{T}\bm{X})^{-1}\right]

To evaluate, we start by applying the various expectations, noting that expectation and trace commute. Applying $\mathbb{E}_{\bm{\eta}_{j}\sim\mathcal{D}_{j}(0,\epsilon^{2}_{j})}$ to the term above allows us to rewrite it as:

=\text{tr}\left[\bm{x}\bm{x}^{T}(\bm{X}_{j}^{T}\bm{X}_{j})^{-1}\bm{X}_{j}^{T}V\bm{X}_{j}(\bm{X}_{j}^{T}\bm{X})^{-1}\right]

where

V=\mathbb{E}_{\bm{\eta}_{j}\sim\mathcal{D}_{j}(0,\epsilon^{2}_{j})}[\bm{\eta}_{j}\bm{\eta}_{j}^{T}]

$\bm{\eta}_{j}\bm{\eta}_{j}^{T}$ is an $n_{j}\times n_{j}$ matrix. The $l$ th diagonal is $(\bm{\eta}_{j}^{l})^{2}$ , which has expectation $\epsilon^{2}_{j}$ . Off diagonal entries have value $\bm{\eta}_{j}^{l}\cdot\bm{\eta}_{j}^{k}$ for $\ell\neq k$ . Because the errors for each data point are drawn independently and with 0 mean, the expectation of this is 0. $\mathbb{E}_{\bm{\eta}_{j}\sim\mathcal{D}_{j}(0,\epsilon^{2}_{j})}[\bm{\eta}_{j}\bm{\eta}_{j}^{T}]$ is a diagonal matrix with $\epsilon^{2}_{j}$ along the diagonal: we can pull it out of the trace to obtain:

=\epsilon^{2}_{j}\text{tr}\left[\bm{x}\bm{x}^{T}(\bm{X}_{j}^{T}\bm{X}_{j})^{-1}\bm{X}_{j}^{T}\bm{X}_{j}(\bm{X}_{j}^{T}\bm{X})^{-1}\right]

=\epsilon^{2}_{j}\text{tr}\left[\bm{x}\bm{x}^{T}(\bm{X}_{j}^{T}\bm{X}_{j})^{-1}\right]

Taking the expectation over the drawn parameters gives:

=\mathbb{E}_{(\theta_{j},\epsilon^{2}_{j})\sim\Theta}[\epsilon^{2}_{j}\text{tr}\left[\bm{x}\bm{x}^{T}(\bm{X}_{j}^{T}\bm{X}_{j})^{-1}\right]]

=\mu_{e}\text{tr}\left[\bm{x}\bm{x}^{T}(\bm{X}_{j}^{T}\bm{X}_{j})^{-1}\right]

Taking the expectation over the test point $\bm{x}\sim\mathcal{X}_{j}$ gives:

=\mu_{e}\text{tr}\left[\mathbb{E}_{x\sim\mathcal{X}_{j}}[\bm{x}\bm{x}^{T}](\bm{X}_{j}^{T}\bm{X}_{j})^{-1}\right]

=\mu_{e}\text{tr}\left[\Sigma_{j}(\bm{X}_{j}^{T}\bm{X}_{j})^{-1}\right]

Finally, we take the expectation with respect to $\bm{X}_{j}\sim\mathcal{X}_{j}$

=\mu_{e}\text{tr}\left[\Sigma_{j}\mathbb{E}_{X_{j}\sim\mathcal{X}_{j}}\left[\left(\bm{X}_{j}^{T}\bm{X}_{j}\right)^{-1}\right]\right]

Note that because the inverse and expectation do not commute, in general, we cannot simplify this without stronger assumptions.

There is one other situation where a particular case of linear regression gives us simpler results. As mentioned in the statement of the lemma, in this case we assume that the distribution of input values $\mathcal{X}_{j}$ is a 0-mean normal distribution with covariance matrix $Cov_{j}$ .

Note that, in general, $Cov_{j}\neq\Sigma_{j}=\mathbb{E}_{x\sim\mathcal{X}_{j}}[\bm{x}\bm{x}^{T}]$ . $\mathbb{E}_{x\sim\mathcal{X}_{j}}[\bm{x}\bm{x}^{T}]$ has, along the diagonals, $\mathbb{E}_{x\sim\mathcal{X}_{j}}[\bm{x}_{j}^{d}\bm{x}_{j}^{d}]$ , and on the off-diagonals, has $\mathbb{E}_{x\sim\mathcal{X}_{j}}[\bm{x}_{j}^{l}\bm{x}_{j}^{k}]$ . By contrast, the covariance matrix $Cov_{j}$ has the same term along the diagonals, but the off-diagonal term has $\mathbb{E}_{x\sim\mathcal{X}_{j}}[\bm{x}_{j}^{l}\bm{x}_{j}^{k}]-\mathbb{E}_{x\sim\mathcal{X}_{j}}[\bm{x}_{j}^{l}]\mathbb{E}_{x\sim\mathcal{X}_{j}}[\bm{x}_{j}^{k}]$ . In the case we are looking at, the distribution is 0 mean, so the off-diagonal terms match as well, and $Cov_{j}=\Sigma_{j}$ .

If this is the case, then $(\bm{X}_{j}^{T}\bm{X}_{j})$ is distributed according to a Wishart distribution with parameters $n_{j}$ and covariance $Cov_{j}=\Sigma_{j}$ with dimension $D$ . Given this, $(\bm{X}_{j}^{T}\bm{X}_{j})^{-1}$ is distributed according to an Inverse Wishart distribution with parameters $n_{j}$ and covariance $Cov_{j}^{-1}=\Sigma_{j}^{-1}$ with dimension $D$ .

The expectation of the inverse Wishart tells us that:

\mathbb{E}_{X_{j}\sim\mathcal{X}_{j}}\left[\left(\bm{X}_{j}^{T}\bm{X}_{j}\right)^{-1}\right]=\frac{1}{n_{j}-D-1}Cov_{j}^{-1}

Using these results, we can directly calculate the desired expectation:

\mu_{e}\text{tr}\left[\Sigma_{j}\mathbb{E}_{X_{j}\sim\mathcal{X}_{j}}\left[\left(\bm{X}_{j}^{T}\bm{X}_{j}\right)^{-1}\right]\right]

=\mu_{e}\text{tr}\left[\frac{1}{n_{j}-D-1}\Sigma_{j}\Sigma_{j}^{-1}\right]

=\frac{\mu_{e}}{n_{j}-D-1}\text{tr}\left[\Sigma_{j}\Sigma_{j}^{-1}\right]

=\frac{\mu_{e}}{n_{j}-D-1}\text{tr}\left[I_{D}\right]

=\frac{\mu_{e}}{n_{j}-D-1}\cdot D

Next, we can reduce the linear regression case to mean estimation. In this case, assume a 1-dimensional input with $x=1$ deterministically. After drawing $x_{j}$ , we multiply it by $\bm{\theta}_{j}$ and add some noise governed by $\epsilon^{2}_{j}$ : this is the exact same structure as mean estimation. In this case, $\Sigma_{j}=\mathbb{E}_{X_{j}\sim\mathcal{X}_{j}}[\bm{X}_{j}^{2}]+Var(\bm{X}_{j})=1+0=1$ . Similarly, $\bm{X}_{j}^{T}\bm{X}_{j}=n_{j}$ deterministically, so the error term reduces to $\frac{\mu_{e}}{n_{j}}$ as desired.

Note that, as expected, this does not simplify down to the mean estimation case for $D=1$ : that case would model a version of 1-dimensional linear regression, where it is necessary to estimate both $\bm{\theta}$ as well as $\hat{x}$ , the mean of the input distribution. ∎

Next, we calculate expected MSE for the fine-grained linear regression case.

See 4.2

Proof.

Here, we will use $\mathbb{E}_{Y\sim\mathcal{D}(\theta_{i},\epsilon^{2}_{i})}$ to mean the expectation taken over data from all players $i\in[M]$ , given that all of the data influences the federated learning result.

(\bm{x}^{T}\bm{\theta}_{j}-\bm{x}^{T}\hat{\bm{\theta}}^{v}_{j})^{2}

=(\bm{x}^{T}\bm{\theta}_{j}-\bm{x}^{T}\bm{\theta}^{v}_{j}+\bm{x}^{T}\bm{\theta}^{v}_{j}-\bm{x}^{T}\hat{\bm{\theta}}^{v}_{j})^{2}

\begin{split}=(\bm{x}^{T}\bm{\theta}_{j}-\bm{x}^{T}\bm{\theta}^{v}_{j})^{2}+(\bm{x}^{T}\bm{\theta}^{v}_{j}-\bm{x}^{T}\hat{\bm{\theta}}^{v}_{j})^{2}\\ +2(\bm{x}^{T}\bm{\theta}_{j}-\bm{x}^{T}\bm{\theta}^{v}_{j})\cdot(\bm{x}^{T}\bm{\theta}^{v}_{j}-\bm{x}^{T}\hat{\bm{\theta}}^{v}_{j})\end{split}

(1)

Note that the expectation of the last term in Equation 1 results in 0 because $\mathbb{E}_{Y\sim\mathcal{D}(\theta_{j},\epsilon^{2}_{j})}\left[\bm{x}^{T}\bm{\theta}^{v}_{j}-\bm{x}^{T}\hat{\bm{\theta}}^{v}_{j}\right]=0$ . Next, we investigate the second equation in Equation 1.

(\bm{x}^{T}\bm{\theta}^{v}_{j}-\bm{x}^{T}\hat{\bm{\theta}^{v}_{j}})^{2}

=\left(\bm{x}^{T}\sum_{i=1}^{M}v_{ji}\bm{\theta}_{i}-\bm{x}^{T}\sum_{i=1}^{M}v_{ji}\hat{\bm{\theta}}_{i}\right)^{2}

=\left(\sum_{i=1}^{M}v_{ji}\bm{x}^{T}(\bm{\theta}_{i}-\hat{\bm{\theta}}_{i})\right)^{2}

Expanding out the squared term gives us:

\sum_{i=1}^{M}\left(v_{ji}\bm{x}^{T}(\bm{\theta}_{i}-\hat{\bm{\theta}}_{i})\right)^{2}

+\sum_{i=1}^{M}\sum_{k\neq i}\left(v_{ji}\cdot\bm{x}^{T}(\bm{\theta}_{i}-\hat{\bm{\theta}}_{i})\cdot v_{jk}\cdot\bm{x}^{T}(\bm{\theta}_{k}-\hat{\bm{\theta}}_{k})\right)

The second term ends up being irrelevant: because each set of parameters $\bm{\theta}_{i}\sim\Theta$ are drawn independently and because each data set $\bm{X}_{i}\sim\mathcal{X}_{i}$ are drawn independently, the $\bm{\theta}_{i}-\hat{\bm{\theta}}_{i}$ terms are independent of each other. Because each is 0 in expectation, the entire product has expectation 0. Rewriting the first term gives:

\sum_{i=1}^{M}v_{ji}^{2}\cdot(\bm{x}^{T}\bm{\theta}_{i}-\bm{x}^{T}\hat{\bm{\theta}}_{i})^{2}

The term inside the sum is exactly equivalent to the value we solved with the local estimation case: we can rewrite this as

\mu_{e}\sum_{i=1}^{M}v_{ji}^{2}\cdot\text{tr}[\Sigma_{j}\mathbb{E}_{Y\sim\mathcal{D}(\theta_{i},\epsilon^{2}_{i})}\left[(\bm{X}_{i}^{T}\bm{X}_{i})^{-1}\right]

or, if the necessary conditions are satisfied,

\mu_{e}\sum_{i=1}^{M}v_{ji}^{2}\cdot\frac{D}{n_{i}-D-1}

Finally, we will explore the first term Equation 1:

(\bm{x}^{T}\bm{\theta}_{j}-\bm{x}^{T}\bm{\theta}^{v}_{j})^{2}

=(\bm{x}^{T}(\bm{\theta}_{j}-\bm{\theta}^{v}_{j}))^{T}\bm{x}^{T}(\bm{\theta}_{j}-\bm{\theta}^{v}_{j})

=(\bm{\theta}_{j}-\bm{\theta}^{v}_{j})^{T}\bm{x}\bm{x}^{T}(\bm{\theta}_{j}-\bm{\theta}^{v}_{j})

Taking the expectation and using the cyclic property of the trace gives:

=\text{tr}\left[(\bm{\theta}_{j}-\bm{\theta}^{v}_{j})^{T}\mathbb{E}_{x\sim\mathcal{X}_{j}}[\bm{x}\bm{x}^{T}](\bm{\theta}_{j}-\bm{\theta}^{v}_{j})\right]

=\text{tr}\left[\Sigma_{j}(\bm{\theta}_{j}-\bm{\theta}^{v}_{j})(\bm{\theta}_{j}-\bm{\theta}^{v}_{j})^{T}\right]

(2)

Next, we focus on simplifying the inner term of Equation 2 involving the $\bm{\theta}$ values. Using the definition of $\bm{\theta}^{v}_{j}$ gives:

\left(\bm{\theta}_{j}-\sum_{i=1}^{M}v_{ji}\bm{\theta}_{i}\right)\left(\bm{\theta}_{j}-\sum_{i=1}^{M}v_{ji}\bm{\theta}_{i}\right)^{T}

=\left((1-v_{jj})\cdot\bm{\theta}_{j}-\sum_{i\neq j}v_{ji}\cdot\bm{\theta}_{i}\right)

\cdot\left((1-v_{jj})\cdot\bm{\theta}_{j}-\sum_{i\neq j}v_{ji}\cdot\bm{\theta}_{i}\right)^{T}

We know that $1=v_{jj}+\sum_{i\neq j}v_{ji}$ , so we can rewrite this as:

=\left(\sum_{i\neq j}v_{ji}\cdot\bm{\theta}_{j}-\sum_{i\neq j}v_{ji}\cdot\bm{\theta}_{i}\right)\left(\sum_{i\neq j}v_{ji}\cdot\bm{\theta}_{j}-\sum_{i\neq j}v_{ji}\cdot\bm{\theta}_{i}\right)^{T}

=\left(\sum_{i\neq j}v_{ji}\cdot(\bm{\theta}_{j}-\bm{\theta}_{i})\right)\left(\sum_{i\neq j}v_{ji}\cdot(\bm{\theta}_{j}-\bm{\theta}_{i})\right)^{T}

Expanding gives us two terms:

\begin{split}\sum_{i\neq j}&v_{ji}^{2}(\bm{\theta}_{j}-\bm{\theta}_{i})(\bm{\theta}_{j}-\bm{\theta}_{i})^{T}\\ +\sum_{i,k\neq j,i\neq k}&v_{ji}\cdot v_{jk}\cdot(\bm{\theta}_{j}-\bm{\theta}_{i})(\bm{\theta}_{j}-\bm{\theta}_{k})^{T}\end{split}

(3)

Note that

(\bm{\theta}_{j}-\bm{\theta}_{i})\cdot(\bm{\theta}_{j}-\bm{\theta}_{k})=\bm{\theta}_{j}\bm{\theta}_{j}^{T}-\bm{\theta}_{j}\bm{\theta}_{k}^{T}-\bm{\theta}_{i}\bm{\theta}_{j}^{T}+\bm{\theta}_{i}\bm{\theta}_{k}^{T}

Because the parameters are drawn iid,

\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{j}\bm{\theta}_{j}^{T}\right]=\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{i}\bm{\theta}_{i}^{T}\right]\ \forall i\in[M]

\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{j}\bm{\theta}_{k}^{T}\right]=\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{i}\bm{\theta}_{l}^{T}\right]\ \forall j\neq k,i\neq l,\ i,j,k,l\in[M]

$\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}[\bm{\theta}_{j}\bm{\theta}_{j}^{T}]$ is implied to mean the same thing as $\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}[\bm{\theta}_{i}\bm{\theta}_{i}^{T}]$ . Using these results allows us to expand out the product and take the expectation. The first term in Equation 3 becomes:

\sum_{i\neq j}v_{ji}^{2}\cdot\left(\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{j}\bm{\theta}_{j}^{T}\right]+\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{i}\bm{\theta}_{i}^{T}\right]-2\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{i}\bm{\theta}_{j}^{T}\right]\right)

=2\sum_{i\neq j}v_{ji}^{2}\cdot\left(\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{j}\bm{\theta}_{j}^{T}\right]-\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{i}\bm{\theta}_{j}^{T}\right]\right)

For the second term in Equation 3, each product within the sum becomes :

\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{j}\bm{\theta}_{j}^{T}\right]-\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{j}\bm{\theta}_{k}^{T}\right]-\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{i}\bm{\theta}_{j}^{T}\right]+\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{i}\bm{\theta}_{k}^{T}\right]

The last two terms cancel because the parameters are drawn i.i.d., so the sum becomes:

=\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{j}\bm{\theta}_{j}^{T}\right]-\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{j}\bm{\theta}_{k}^{T}\right]

The overall sum in Equation 3 becomes:

=\sum_{i,k\neq j,i\neq k}v_{ji}\cdot v_{jk}\cdot\left(\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{j}\bm{\theta}_{j}^{T}\right]-\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{j}\bm{\theta}_{k}^{T}\right]\right)

The sum of both of these terms taken together becomes:

\left(\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{j}\bm{\theta}_{j}^{T}\right]-\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{j}\bm{\theta}_{k}^{T}\right]\right)\cdot\left(2\sum_{i\neq j}v_{ji}^{2}+\sum_{i,k\neq j,i\neq k}v_{ji}\right)

Next, we can use the identity:

\sum_{i,k\neq j,i\neq k}v_{ji}\cdot v_{jk}=\left(\sum_{i\neq j}v_{ji}\right)^{2}-\sum_{i\neq j}v_{ji}^{2}

Which allows us to rewrite as:

\left(\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{j}\bm{\theta}_{j}^{T}\right]-\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{j}\bm{\theta}_{k}^{T}\right]\right)\cdot\left(\sum_{i\neq j}v_{ji}^{2}+\left(\sum_{i\neq j}v_{ji}\right)^{2}\right)

We can alternatively write this a different way. Because $1=v_{jj}+\sum_{i\neq j}v_{ji}$ , we have $1-v_{jj}=\sum_{i\neq j}v_{ji}$

\left(\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{j}\bm{\theta}_{j}^{T}\right]-\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{j}\bm{\theta}_{k}^{T}\right]\right)\cdot\left(\sum_{i\neq j}v_{ji}^{2}+\left(1-v_{jj}\right)^{2}\right)

Recall that this analysis was focusing solely on the component of Equation 2 that involved the $\bm{\theta}$ product. We can recombine our simplification into Equation 2 to rewrite it as:

\text{tr}\left[\Sigma_{j}\left(\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{j}\bm{\theta}_{j}^{T}\right]-\mathbb{E}_{\bm{\theta}_{i},\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{j}\bm{\theta}_{i}^{T}\right]\right)\right]

\cdot\left(\sum_{i\neq j}v_{ji}^{2}+\left(1-v_{jj}\right)^{2}\right)

Next, we need to reason about the difference in the expected terms. In this setting, we are assuming that each coefficient is drawn separately from the other coefficients. $\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{j}\bm{\theta}_{j}^{T}\right]$ has, on the $d$ th element of the diagonal, $\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}[(\bm{\theta}_{j}^{d})^{2}]$ and on the off-diagonal terms in the $l,k$ th entry, has $\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}[\bm{\theta}_{j}^{l}\cdot\bm{\theta}_{j}^{k}]$ Here, we are assuming that $\bm{\theta}_{j}^{l}$ and $\bm{\theta}_{j}^{k}$ are independent, so this equals $\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}[\bm{\theta}_{j}^{l}]\cdot\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}[\bm{\theta}_{j}^{k}]$ : we relax this assumption below. $\mathbb{E}_{\bm{\theta}_{i},\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{j}\bm{\theta}_{i}^{T}\right]$ has, on the $d$ th element of the diagonal, $\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}[(\bm{\theta}_{j}^{d})]^{2}$ , and on the $l,k$ th off-diagonal term has the same value as the other matrix: $\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}[\bm{\theta}_{j}^{l}]\cdot\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}[\bm{\theta}_{j}^{k}]$ . The difference between these two matrices is a diagonal matrix with $\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}[(\bm{\theta}_{j}^{d})^{2}]-\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}[(\bm{\theta}_{j}^{d})]^{2}=\sigma^{2}_{d}$ on the diagonal, where $\sigma^{2}_{d}$ represents the variance of the $d$ th coefficient. That turns the term involving the trace into a simple sum:

\left(\sum_{i\neq j}v_{ji}^{2}+\left(\sum_{i\neq j}v_{ji}\right)^{2}\right)\cdot\sum_{d=1}^{D}\mathbb{E}_{x\sim\mathcal{X}_{j}}[(\bm{x}^{d})^{2}]\cdot\sigma^{2}_{d}

In the proof above, we assumed that the draw of parameter value $\bm{\theta}_{j}^{l}$ is independent of $\bm{\theta}_{j}^{k}$ , for $l\neq k$ . A case where this might not occur is when these values are correlated: say, the value drawn for $\bm{\theta}_{j}^{l}$ is anti-correlated with the parameter drawn for $\bm{\theta}_{j}^{k}$ . (Note that we still assume draws are independent across players: $\bm{\theta}_{j}^{l}$ is independent of $\bm{\theta}_{i}^{l}$ and $\bm{\theta}_{i}^{k}$ ). Relaxing this assumption is not hard and would change the results in the following way: the off-diagonal terms of the difference would no longer be 0. Instead, the off-diagonal $l,k$ th entry becomes

\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}[\bm{\theta}_{j}^{l}\cdot\bm{\theta}_{j}^{k}]-\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}[\bm{\theta}_{j}^{l}]\cdot\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}[\bm{\theta}_{j}^{k}]

Performing the matrix multiplication with $\Sigma_{j}$ turns this into:

\sum_{d=1}^{D}\mathbb{E}_{x\sim\mathcal{X}_{j}}[(\bm{x}^{d})^{2}]\cdot\sigma^{2}_{d}+\sum_{l\neq d}\mathbb{E}_{x\sim\mathcal{X}_{j}}[\bm{x}^{d}\cdot\bm{x}^{l}]

\cdot(\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}[\bm{\theta}_{j}^{l}\cdot\bm{\theta}_{j}^{k}]-\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}[\bm{\theta}_{j}^{l}]\cdot\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}[\bm{\theta}_{j}^{k}])

Our final value for this component of the error would be the same form, but with a slightly different coefficient.

Finally, we will consider the mean estimation case. As before, we note that $\Sigma_{j}=1$ and $\bm{X}_{j}^{T}\bm{X}_{j}=n_{j}$ deterministically, so that component of the error term reduces to

\mu_{e}\sum_{i=1}^{M}v_{ji}^{2}\cdot\frac{1}{n_{j}}

similarly, we note that $\mathbb{E}_{x\sim\mathcal{X}_{j}}[(\bm{x}^{d})^{2}]=1$ , so the second component reduces to:

\left(\sum_{i\neq j}v_{ji}^{2}+\left(\sum_{i\neq j}v_{ji}\right)^{2}\right)\sigma^{2}

∎

See 4.3

Proof.

To obtain this result, we note that the uniform federation case amounts to weights $v_{ji}=\frac{n_{i}}{N}$ . For both linear regression and mean estimation, the $\sigma^{2}$ multiplier becomes:

\sum_{i\neq j}v_{ji}^{2}+\left(\sum_{i\neq j}v_{ji}\right)^{2}

=\frac{1}{N^{2}}\left(\sum_{i\neq j}n_{i}^{2}+\left(\sum_{i\neq j}n_{i}\right)^{2}\right)

For linear regression, the $\mu_{e}$ multiplier becomes:

\sum_{i=1}^{M}v_{ji}^{2}\cdot\frac{D}{n_{i}-D-1}

=\frac{1}{N^{2}}\sum_{i=1}^{M}n_{i}^{2}\cdot\frac{D}{n_{i}-D-1}

And for mean estimation, the $\mu_{e}$ multiplier becomes:

\sum_{i=1}^{M}v_{ji}^{2}\cdot\frac{1}{n_{i}}

=\sum_{i=1}^{M}\frac{n_{i}^{2}}{N^{2}}\cdot\frac{1}{n_{i}}=\sum_{i=1}^{M}\frac{n_{i}}{N^{2}}=\frac{1}{N}

∎

See 4.4

Proof.

To obtain this result, we note that the coarse-grained case corresponds to $v_{jj}=w+\frac{(1-w)\cdot n_{j}}{N}$ and $v_{ji}=(1-w)\cdot\frac{n_{i}}{N}$ . For both linear regression and mean estimation, the $\sigma^{2}$ multiplier becomes:

\sum_{i\neq j}v_{ji}^{2}+\left(\sum_{i\neq j}v_{ji}\right)^{2}

=(1-w)^{2}\cdot\frac{\sum_{i\neq j}n_{i}^{2}+\left(\sum_{i\neq j}n_{i}\right)^{2}}{N^{2}}

For linear regression, let $\zeta_{i}$ stand for either $\text{tr}[\Sigma_{j}\mathbb{E}_{Y\sim\mathcal{D}(\theta_{i},\epsilon^{2}_{i})}\left[(\bm{X}_{i}^{T}\bm{X}_{i})^{-1}\right]$ or $\frac{D}{n_{i}-D-1}$ , depending on the linear regression case. Then, the $\mu_{e}$ multiplier becomes:

\sum_{i=1}^{M}v_{ji}^{2}\cdot\zeta_{i}

=\left(w+\frac{(1-w)\cdot n_{j}}{N}\right)^{2}\cdot\zeta_{j}+\sum_{i\neq j}(1-w)^{2}\cdot\frac{n_{i}^{2}}{N^{2}}\cdot\zeta_{j}

=\zeta_{j}\cdot\left(w^{2}+2\frac{(1-w)\cdot w\cdot n_{j}}{N}\right)+

+(1-w)^{2}\sum_{i=1}^{M}\frac{n_{i}^{2}}{N^{2}}\cdot\zeta_{j}

For mean estimation, the $\mu_{e}$ multiplier becomes:

\sum_{i=1}^{M}v_{ji}^{2}\cdot\frac{1}{n_{i}}

=\left(w+\frac{(1-w)\cdot n_{j}}{N}\right)^{2}\cdot\frac{1}{n_{j}}+\sum_{i\neq j}(1-w)^{2}\cdot\frac{n_{i}^{2}}{N^{2}}\cdot\frac{1}{n_{i}}

=\frac{w^{2}}{n_{j}}+2\frac{(1-w)\cdot w}{N}+\frac{(1-w)^{2}\cdot n_{j}}{N^{2}}+\sum_{i\neq j}(1-w)^{2}\cdot\frac{n_{i}}{N^{2}}

=\frac{w^{2}}{n_{j}}+2\frac{(1-w)\cdot w}{N}+(1-w)^{2}\sum_{i=1}^{M}\frac{n_{i}}{N^{2}}

=\frac{w^{2}}{n_{j}}+2\frac{(1-w)\cdot w}{N}+\frac{(1-w)^{2}}{N}

=\frac{w^{2}}{n_{j}}+\frac{1-w^{2}}{N}

∎

Appendix C Supporting proofs for uniform federation

See 5.2

Proof.

If a partition $\pi$ is optimal for every player, then it is core stable: there does not exist a coalition $C$ where all players prefer $C$ to $\pi$ , because there does not exist a coalition where any players prefer $C$ to $\pi$ .

If a partition $\pi$ is optimal for every player, then no other partition can be core stable: any set of players not in their optimal configuration could form a coalition $C$ where all players would prefer $C$ .

In the case that players are indifferent between any arrangement, then for any partition $\pi$ and any competing coalition $C$ , all players would be indifferent between $\pi$ and $C$ , so $\pi$ is core stable. ∎

See 5.3

Proof.

First, we will consider the case where the inequality is strict and will show the statement is satisfied by showing that every player minimizes their error by being alone in $\pi_{l}$ . We will use $N_{Q}$ to be the sum of elements within a coalition $Q$ . $Q$ could be the coalition equal to all players ( $\pi_{g}$ ) or some strict subset, but we will assume it contains at least 2 elements. We will show that every player gets higher error in $Q$ than it would get alone. We wish to show:

\frac{\mu_{e}}{N_{Q}}+\sigma^{2}\frac{\sum_{i\neq j,i\in Q}n_{i}^{2}+(N_{Q}-n_{j})^{2}}{N_{Q}^{2}}>\frac{\mu_{e}}{n_{j}}

Cross multiplying gives:

\mu_{e}\cdot N_{Q}\cdot n_{j}+\sigma^{2}\cdot n_{j}\cdot\left(\sum_{i\neq j,i\in Q}n_{i}^{2}+(N_{Q}-n_{j})^{2}\right)>\mu_{e}\cdot N_{Q}^{2}

Rewriting:

\sigma^{2}\cdot n_{j}\cdot\left(\sum_{i\neq j,i\in Q}n_{i}^{2}+(N_{Q}-n_{j})^{2}\right)>\mu_{e}\cdot N_{Q}^{2}-\mu_{e}\cdot N_{Q}\cdot n_{j}

The righthand side can be rewritten as:

\mu_{e}\cdot N_{Q}^{2}-\mu_{e}\cdot N_{Q}\cdot n_{j}=\mu_{e}\cdot(N_{Q}-n_{j})^{2}+\mu_{e}\cdot n_{j}\cdot(N_{Q}-n_{j})

Then, we can prove the inequality by splitting it up into two terms. The first:

\sigma^{2}\cdot n_{j}\cdot(N_{Q}-n_{j})^{2}>\mu_{e}\cdot(N_{Q}-n_{j})^{2}

which is true because $n_{j}\cdot\sigma^{2}>\mu_{e}$ . The second:

\sigma^{2}\cdot n_{j}\cdot\sum_{i\neq j,i\in Q}n_{i}^{2}>\mu_{e}\cdot n_{j}\cdot(N_{Q}-n_{j})

\sigma^{2}\cdot\sum_{i\neq j,i\in Q}n_{i}^{2}>\mu_{e}\cdot\sum_{i\neq j,i\in Q}n_{i}

which is satisfied because, for each player,

\sigma^{2}\cdot n_{i}^{2}>\mu_{e}\cdot n_{i}

because $\sigma^{2}\cdot n_{i}>\mu_{e}$ .
Next, we will consider the case where the inequality may not be strict. We can note that, in the proof above, any coalition $Q$ with at least one player $n_{i}>\frac{\mu_{e}}{\sigma^{2}}$ would satisfy the desired inequality: all players participating would get higher error than they could alone. This shows that any coalition involving a player with more samples than $\frac{\mu_{e}}{\sigma^{2}}$ is infeasible. We have previously shown that all players with $n_{i}=\frac{\mu_{e}}{\sigma^{2}}$ get equal error no matter their arrangement. ∎

As a reminder, we will use the notation $\pi(s,\ell)$ to mean a coalition with $s$ small players and $\ell$ large players. The next few lemmas describe how the errors of large and small players in a coalition change as $s$ and $\ell$ are increased.

Lemma C.1.

For uniform federation, if $n_{s}\leq\frac{\mu_{e}}{\sigma^{2}}$ and $n_{s}<n_{\ell}$ , small players always see their error decrease with the addition of more small players:

s_{2}>s_{1}\quad\Rightarrow\quad\pi(s_{2},\ell)\succ_{S}\pi(s_{1},\ell)

Proof.

We will show that the derivative of the small player’s error with respect to $s$ is always negative. The error is:

\frac{\mu_{e}}{s\cdot n_{s}+\ell\cdot n_{\ell}}

+\sigma^{2}\frac{(s-1)\cdot n_{s}^{2}+\ell\cdot n_{\ell}^{2}+((s-1)\cdot n_{s}+\ell\cdot nl)^{2}}{(s\cdot n_{s}+\ell\cdot n_{\ell})^{2}}

The derivative with respect to $s$ is:

\frac{n_{s}\cdot(s\cdot n_{s}\cdot(n_{s}\cdot\sigma^{2}-\mu_{e}))}{(s\cdot n_{s}+\ell\cdot n_{\ell})^{3}}

-\frac{n_{s}\cdot(\ell\cdot n_{\ell}\cdot(\mu_{e}+2n_{\ell}\cdot\sigma^{2}-3n_{s}\cdot\sigma^{2}))}{(s\cdot n_{s}+\ell\cdot n_{\ell})^{3}}

Showing that the derivative is negative is equivalent to showing that the term below is negative:

s\cdot n_{s}\cdot(n_{s}\cdot\sigma^{2}-\mu_{e})-\ell\cdot n_{\ell}\cdot(\mu_{e}+2n_{\ell}\cdot\sigma^{2}-3n_{s}\cdot\sigma^{2})

We can break this term into multiple components:

s\cdot n_{s}\cdot(n_{s}\cdot\sigma^{2}-\mu_{e})\leq 0

because $n_{s}\leq\frac{\mu_{e}}{\sigma^{2}}$ . We can rewrite a second term as:

\mu_{e}-n_{s}\cdot\sigma^{2}+2\sigma^{2}(n_{\ell}-n_{s})

We know that $\mu_{e}-n_{s}\cdot\sigma^{2}\geq 0$ , and because $n_{\ell}>n_{s}$ ,

2n_{\ell}\cdot\sigma^{2}-2n_{s}\cdot\sigma^{2}>0

These facts, taken together, show that the derivative is always negative. ∎

Lemma C.2.

For uniform federation, if $n_{\ell}\geq\frac{\mu_{e}}{\sigma^{2}}$ and $n_{s}<n_{\ell}$ , large players see their error increase with the addition of more large players, so they prefer $\ell$ as small as possible:

\ell_{2}<\ell_{1}\quad\Rightarrow\quad\pi(s,\ell_{2})\succ_{L}\pi(s,\ell_{1})

If $n_{\ell}<\frac{\mu_{e}}{\sigma^{2}}$ , then there exists some $\ell_{0}$ such that for all $\ell^{\prime}<\ell_{0}$ , the derivative of the large players’ error with respect to $\ell$ is positive, and for all $\ell^{\prime}>\ell_{0}$ , the derivative of their error is negative.

Proof.

To prove this, we will show that the derivative of the large player’s error with respect to $\ell$ is always positive when $n_{\ell}\geq\frac{\mu_{e}}{\sigma^{2}}$ . The error is:

\frac{\mu_{e}}{s\cdot n_{s}+\ell\cdot n_{\ell}}

+\sigma^{2}\frac{s\cdot n_{s}^{2}+(\ell-1)\cdot n_{\ell}^{2}+(s\cdot n_{s}+(\ell-1)\cdot n_{\ell})^{2}}{(s\cdot n_{s}+\ell\cdot n_{\ell})^{2}}

The derivative with respect to $\ell$ is:

\frac{n_{\ell}(\ell\cdot n_{\ell}(n_{\ell}\sigma^{2}-\mu_{e})-n_{s}\cdot s(\mu_{e}-3n_{\ell}\cdot\sigma^{2}+2n_{s}\cdot\sigma^{2}))}{(\ell\cdot n_{\ell}+n_{s}\cdot s)^{3}}

We wish to show that the numerator is positive. We can break it into multiple components:

n_{\ell}\cdot\sigma^{2}-\mu_{e}\geq 0

because $n_{\ell}\geq\frac{\mu_{e}}{\sigma^{2}}$ . We can rewrite the second term as

\mu_{e}-n_{\ell}\cdot\sigma^{2}+2\sigma^{2}(n_{s}-n_{\ell})

which is negative because $n_{\ell}\geq\frac{\mu_{e}}{\sigma^{2}}$ and $n_{\ell}>n_{s}$ .
Next, we consider the case where $n_{\ell}<\frac{\mu_{e}}{\sigma^{2}}$ . The first term is now negative and the second term is the sum of two terms: one is positive and one is negative. The overall derivative is negative whenever:

\ell\cdot n_{\ell}(n_{\ell}\sigma^{2}-\mu_{e})-n_{s}\cdot s(\mu_{e}-3n_{\ell}\cdot\sigma^{2}+2n_{s}\cdot\sigma^{2})<0

\ell\cdot n_{\ell}(n_{\ell}\sigma^{2}-\mu_{e})<n_{s}\cdot s(\mu_{e}-3n_{\ell}\cdot\sigma^{2}+2n_{s}\cdot\sigma^{2})

\ell>\frac{n_{s}\cdot s(\mu_{e}-3n_{\ell}\cdot\sigma^{2}+2n_{s}\cdot\sigma^{2})}{n_{\ell}\cdot(n_{\ell}\cdot\sigma^{2}-\mu_{e})}

Note that, for the assumption, the denominator is negative. If the numerator is positive, then this is true for all $\ell>0$ , so the slope is always negative. If the numerator is negative, then the slope is negative for all $\ell>\ell_{0}$ for the given $\ell_{0}>0$ . ∎

Lemma C.3.

Assume uniform federation with $n_{s}\leq\frac{\mu_{e}}{\sigma^{2}}$ and $n_{s}<n_{\ell}$ . If $n_{\ell}\leq\frac{\mu_{e}}{\sigma^{2}}$ , small players always prefer federating with more large players:

\ell_{2}>\ell_{1}\quad\Rightarrow\quad\pi(s,\ell_{2})\succ_{S}\pi(s,\ell_{1})

If $n_{\ell}>\frac{\mu_{e}}{\sigma^{2}}$ , then there exists some $\ell_{1}$ such that for all $\ell^{\prime}<\ell_{1}$ , the derivative of the small players’ error with respect to $\ell$ is positive, and for all $\ell^{\prime}>\ell_{1}$ , the derivative of their error is negative.

Proof.

The small player’s error is

\frac{\mu_{e}}{s\cdot n_{s}+\ell\cdot n_{\ell}}

+\sigma^{2}\frac{(s-1)\cdot n_{s}^{2}+\ell\cdot n_{\ell}^{2}+((s-1)\cdot n_{s}+\ell\cdot nl)^{2}}{(s\cdot n_{s}+\ell\cdot n_{\ell})^{2}}

The derivative with respect to $\ell$ is:

-\frac{n_{\ell}\cdot(\ell\cdot n_{\ell}\cdot(\mu_{e}-\sigma^{2}\cdot n_{s}+\sigma^{2}(n_{\ell}-n_{s}))}{(\ell\cdot n_{\ell}+s\cdot n_{s})^{3}}

-\frac{n_{\ell}\cdot(s\cdot n_{s}\cdot(\mu_{e}-n_{\ell}\cdot\sigma^{2})}{(\ell\cdot n_{\ell}+s\cdot n_{s})^{3}}

The derivative is negative when the term below is positive:

\ell\cdot n_{\ell}\cdot(\mu_{e}-\sigma^{2}\cdot n_{s}+\sigma^{2}(n_{\ell}-n_{s}))+s\cdot n_{s}\cdot(\mu_{e}-n_{\ell}\cdot\sigma^{2})

The first term (multiplying $\ell\cdot n_{\ell}$ ) is always positive. For $n_{\ell}\leq\frac{\mu_{e}}{\sigma^{2}}$ the second term is also positive or zero, so the derivative is always negative.

If $n_{\ell}>\frac{\mu_{e}}{\sigma^{2}}$ , then second term (multiplying $s$ ) is negative. The overall derivative is negative when the term below is positive:

\ell\cdot n_{\ell}\cdot(\mu_{e}-\sigma^{2}\cdot n_{s}+\sigma^{2}(n_{\ell}-n_{s}))+s\cdot n_{s}\cdot(\mu_{e}-n_{\ell}\cdot\sigma^{2})>0

\ell\cdot n_{\ell}\cdot(\mu_{e}-\sigma^{2}\cdot n_{s}+\sigma^{2}(n_{\ell}-n_{s}))>s\cdot n_{s}\cdot(n_{\ell}\cdot\sigma^{2}-\mu_{e})

\ell>\frac{s\cdot n_{s}\cdot(n_{\ell}\cdot\sigma^{2}-\mu_{e})}{n_{\ell}\cdot(\mu_{e}-\sigma^{2}\cdot n_{s}+\sigma^{2}(n_{\ell}-n_{s}))}

∎

Lemma C.4.

Assume uniform federation with $n_{s}\leq\frac{\mu_{e}}{\sigma^{2}}$ and $n_{s}<n_{\ell}$ . If $n_{\ell}\geq\frac{\mu_{e}}{\sigma^{2}}$ , then there exists some $s_{0}$ such that for all $s^{\prime}<s_{0}$ , the derivative of the large players’ error with respect to the number of small players $s$ is negative, and for all $s>s_{0}$ the derivative of their error is positive.
If $n_{\ell}<\frac{\mu_{e}}{\sigma^{2}}$ , then the shape of the curve either first increases, then decreases for all $s^{\prime}>s_{0}$ , or else first decreases, and then increases for all $s^{\prime}>s_{0}$ .

Proof.

The large player’s error is:

\frac{\mu_{e}}{s\cdot n_{s}+\ell\cdot n_{\ell}}

+\sigma^{2}\frac{s\cdot n_{s}^{2}+(\ell-1)\cdot n_{\ell}^{2}+(s\cdot n_{s}+(\ell-1)\cdot nl)^{2}}{(s\cdot n_{s}+\ell\cdot n_{\ell})^{2}}

The derivative with respect to $s$ is:

-\frac{n_{s}\cdot(\ell\cdot n_{\ell}(\mu_{e}-n_{s}\cdot\sigma^{2}))}{(\ell\cdot n_{\ell}+s\cdot n_{s})^{3}}

-\frac{n_{s}\cdot(n_{s}\cdot s\cdot(\mu_{e}-n_{\ell}\cdot\sigma^{2}+\sigma^{2}(n_{s}-n_{\ell})))}{(\ell\cdot n_{\ell}+s\cdot n_{s})^{3}}

The derivative is negative when the term below is positive:

\ell\cdot n_{\ell}(\mu_{e}-n_{s}\cdot\sigma^{2})+n_{s}\cdot s\cdot(\mu_{e}-n_{\ell}\cdot\sigma^{2}+\sigma^{2}(n_{s}-n_{\ell}))

For $n_{\ell}\geq\frac{\mu_{e}}{\sigma^{2}}$ , the first term is always positive or zero and the second term is always negative. Solving for when the overall derivative is positive gives:

\ell\cdot n_{\ell}(\mu_{e}-n_{s}\cdot\sigma^{2})+n_{s}\cdot s\cdot(\mu_{e}-n_{\ell}\cdot\sigma^{2}+\sigma^{2}(n_{s}-n_{\ell}))>0

n_{s}\cdot s\cdot(\mu_{e}-n_{\ell}\cdot\sigma^{2}+\sigma^{2}(n_{s}-n_{\ell}))>-\ell\cdot n_{\ell}(\mu_{e}-n_{s}\cdot\sigma^{2})

s<\frac{-\ell\cdot n_{\ell}(\mu_{e}-n_{s}\cdot\sigma^{2})}{n_{s}\cdot(\mu_{e}-n_{\ell}\cdot\sigma^{2}+\sigma^{2}(n_{s}-n_{\ell}))}

Next, we consider the case where $n_{\ell}<\frac{\mu_{e}}{\sigma^{2}}$ . Again, the first term is positive or zero. The second term, though, is composed of a sum: one component is positive and one is negative, so it is not necessarily clear whether the overall sum is positive or negative. If the coefficient on $s$ is negative, then the curve has the same shape as in the $n_{\ell}\geq\frac{\mu_{e}}{\sigma^{2}}$ case: first decreasing, and then increasing. If the coefficient is positive, then by the same analysis as before, the derivative is negative whenever the following inequality holds:

n_{s}\cdot s\cdot(\mu_{e}-n_{\ell}\cdot\sigma^{2}+\sigma^{2}(n_{s}-n_{\ell}))>-\ell\cdot n_{\ell}(\mu_{e}-n_{s}\cdot\sigma^{2})

s>\frac{-\ell\cdot n_{\ell}(\mu_{e}-n_{s}\cdot\sigma^{2})}{n_{s}\cdot(\mu_{e}-n_{\ell}\cdot\sigma^{2}+\sigma^{2}(n_{s}-n_{\ell}))}

In this case, the derivative of the large players’ error first increases, then decreases. The shape of the curve depends on more information, which indicates whether the coefficient on $s$ is positive or negative. ∎

See 5.4

Proof.

In this proof, we will use the results of Lemmas C.1, C.2, C.3, and C.4.

The small players always prefer $s$ as large as possible, and for $n_{\ell}\leq\frac{\mu_{e}}{\sigma^{2}}$ they also prefer $\ell$ as large as possible, so $\pi(S,L)=\pi_{g}$ minimizes error for small players. For this reason, any defection coalition that has $\pi(s>0,\ell)$ is infeasible because the small players would get higher error.

The only kind of defections we need to consider are in the form $\pi(0,\ell)$ . We will consider $\pi(0,L)$ and show that the large players prefer $\pi(S,L)$ to $\pi(0,L)$ : $\pi(S,L)\succ_{L}\pi(0,L)$ . In the case that $n_{\ell}<\frac{\mu_{e}}{\sigma^{2}}$ , $\pi(0,L)\succ_{L}\pi(0,\ell<L)$ , so any other arrangement is also not a possible defection. In the case that $n_{\ell}=\frac{\mu_{e}}{\sigma^{2}}$ , $\pi(0,L)=_{L}\pi(0,\ell<L)$ , so similarly any other defection is not possible. What we’d like to show is:

\frac{\mu_{e}}{s\cdot n_{s}+\ell\cdot n_{\ell}}

+\sigma^{2}\frac{s\cdot n_{s}^{2}+(\ell-1)\cdot n_{\ell}^{2}+(s\cdot n_{s}+(\ell-1)\cdot n_{\ell})^{2}}{(s\cdot n_{s}+\ell\cdot n_{\ell})^{2}}

<\frac{\mu_{e}}{\ell\cdot n_{\ell}}+\sigma^{2}\frac{(\ell-1)\cdot n_{\ell}^{2}+(\ell-1)^{2}\cdot n_{\ell}^{2}}{\ell^{2}\cdot n_{\ell}^{2}}

Cross multiplying turns the condition into:

\mu_{e}\cdot(s\cdot n_{s}+\ell\cdot n_{\ell})\cdot\ell^{2}\cdot n_{\ell}^{2}

+\sigma^{2}\cdot(s\cdot n_{s}^{2}+(\ell-1)\cdot n_{\ell}^{2}+(s\cdot n_{s}+(\ell-1)\cdot n_{\ell})^{2})\cdot\ell^{2}\cdot n_{\ell}^{2}

<\mu_{e}\cdot\ell\cdot n_{\ell}\cdot(s\cdot n_{s}+\ell\cdot n_{\ell})^{2}+\sigma^{2}\cdot n_{\ell}^{2}\cdot(\ell-1)\cdot\ell\cdot(s\cdot n_{s}+\ell\cdot n_{\ell})^{2}

If we collect the $\mu_{e}$ terms, we get:

\mu_{e}\cdot\ell\cdot n_{\ell}\cdot(s\cdot n_{s}+\ell\cdot n_{\ell})\cdot(s\cdot n_{s}+\ell\cdot n_{\ell}-\ell\cdot n_{\ell})

=\mu_{e}\cdot\ell\cdot n_{\ell}\cdot(s\cdot n_{s}+\ell\cdot n_{\ell})\cdot s\cdot n_{s}

If we collect the $\sigma^{2}$ terms, we get:

\sigma^{2}\cdot\ell\cdot n_{\ell}^{2}\cdot(\ell\cdot(s\cdot n_{s}^{2}+(\ell-1)\cdot n_{\ell}^{2}

+(s\cdot n_{s}+(\ell-1)\cdot n_{\ell})^{2})-(\ell-1)\cdot(s\cdot n_{s}+\ell\cdot n_{\ell})^{2})

First, we expand the first squared term and combine it with another term:

s\cdot n_{s}^{2}+(\ell-1)\cdot n_{\ell}^{2}+s^{2}\cdot n_{s}^{2}+2\cdot s\cdot(\ell-1)\cdot n_{\ell}+(\ell-1)^{2}\cdot n_{\ell}^{2}

=n_{s}^{2}\cdot s\cdot(s+1)+2\cdot s\cdot(\ell-1)\cdot n_{s}\cdot n_{\ell}+n_{\ell}^{2}\cdot(\ell-1)\cdot\ell

Multiplied by $\ell$ , it becomes:

\ell\cdot(n_{s}^{2}\cdot s\cdot(s+1)+2\cdot s\cdot(\ell-1)\cdot n_{s}\cdot n_{\ell}+n_{\ell}^{2}\cdot(\ell-1)\cdot\ell)

Expanding out the second squared term gives us:

s^{2}\cdot n_{s}^{2}+2s\cdot\ell\cdot n_{s}\cdot n_{\ell}+\ell^{2}\cdot n_{\ell}^{2}

When we multiply this by $-(\ell-1)$ , it becomes

-(\ell-1)\cdot(s^{2}\cdot n_{s}^{2}+2s\cdot\ell\cdot n_{s}\cdot n_{\ell}+\ell^{2}\cdot n_{\ell}^{2})

Next, we combine similar terms in both sums. First, we start with coefficients of $n_{s}^{2}$

\ell\cdot n_{s}^{2}\cdot s\cdot(s+1)-(\ell-1)\cdot n_{s}^{2}\cdot s^{2}

=n_{s}^{2}\cdot s\cdot(\ell\cdot(s+1)-(\ell-1)\cdot s)

=n_{s}^{2}\cdot s\cdot(\ell\cdot s+\ell-\ell\cdot s+s)

=n_{s}^{2}\cdot s\cdot(\ell+s)

Next, we do the next term, which involves coefficients of $n_{s}\cdot n_{\ell}$ :

2\cdot\ell\cdot s\cdot(\ell-1)\cdot n_{s}\cdot n_{\ell}-2\cdot(\ell-1)\cdot s\cdot\ell\cdot n_{s}\cdot n_{\ell}

=0

And the last one term, with coefficients of $n_{\ell}^{2}$ :

\ell^{2}\cdot(\ell-1)\cdot n_{\ell}^{2}-(\ell-1)\cdot\ell^{2}\cdot n_{\ell}^{2}

=0

If we multiply take the only nonzero term and multiply by the terms we pulled out, it becomes:

n_{s}^{2}\cdot s\cdot(\ell+s)\cdot\ell\cdot n_{\ell}^{2}\cdot\sigma^{2}

Next, we return this to our inequality. What we’re trying to show is:

\ell\cdot n_{\ell}^{2}\cdot n_{s}^{2}\cdot s\cdot(\ell+s)\cdot\sigma^{2}<\mu_{e}\cdot\ell\cdot n_{\ell}\cdot(s\cdot n_{s}+\ell\cdot n_{\ell})\cdot s\cdot n_{s}

Cancelling some terms:

n_{\ell}\cdot n_{s}\cdot(\ell+s)\cdot\sigma^{2}<\mu_{e}\cdot(s\cdot n_{s}+\ell\cdot n_{\ell})

Expanding out terms:

\sigma^{2}\cdot(\ell\cdot n_{\ell}\cdot n_{s}+s\cdot n_{\ell}\cdot n_{s})<\mu_{e}\cdot(s\cdot n_{s}+\ell\cdot n_{\ell})

We can prove this by splitting up piecewise:

\sigma^{2}\cdot n_{s}\cdot\ell\cdot n_{\ell}<\mu_{e}\cdot\ell\cdot n_{\ell}

because $\sigma^{2}\cdot n_{s}<\mu_{e}$ . Similarly,

\sigma^{2}\cdot n_{\ell}\cdot s\cdot n_{s}\leq\mu_{e}\cdot s\cdot n_{s}

because $\sigma^{2}\cdot n_{\ell}\leq\mu_{e}$ . ∎

See 5.5

Proof.

We will prove this directly by calculating an arrangement that is individually stable, relying on the results in Lemmas C.1, C.2, C.3, C.4.

First, group all of the small players together. Then calculate $\ell^{\prime}=\max\ell$ such that $\pi(S,\ell)\succeq_{L}\pi(0,1)$ : the largest number of large players that can be in the coalition such that the large players prefer this to being alone. Check if $\pi(S,\ell)\prec_{S}\pi(S,0)$ . If this is true, make the final arrangement $\pi(S,0),\pi(0,1)\cdot L$ : by previous lemmas around how the small player’s error changes with $\ell$ , we know that if $\pi(S,\ell)\prec_{S}\pi(S,0)$ , then $\pi(S,\ell^{\prime})\prec_{S}\pi(S,0)$ for all $\ell^{\prime}<\ell$ .

We will show that this is individually stable by showing no players wish to unilaterally deviate.

•

No small player wishes to go to $\pi(1,0)$ : reducing the number of small players in a group from $S$ to 1 monotonically increases the error the small player faces.
•

No small player wishes to go to $\pi(1,1)$ : it is possible to reach this state by first going from $\pi(S,0)$ to $\pi(S,1)$ (which would increase error because $1\leq\ell^{\prime}$ and $\pi(S,\ell)\prec_{S}\pi(S,0)$ ) and then from $\pi(S,1)$ to $\pi(1,1)$ (which would increase error because reducing the number of small players increases error).
•

No large player can go to $\pi(S,1)$ : this would increase the error of the small players.
•

No large player wishes to go to $\pi(0,2)$ : this would increase the error of both large players.

Next, we will consider the case where $\pi(S,\ell^{\prime})\succeq_{S}\pi(S,0)$ : in this case, we will show that $\pi(S,\ell)$ is individually stable.

•

No small player wishes to go from $\pi(S,\ell^{\prime})$ to $\pi(1,0)$ . We can see that $\pi(1,0)$ has higher error because we know $\pi(S,\ell^{\prime})\succeq_{S}\pi(S,0)\succ_{S}\pi(1,0)$ .
•

No small player wishes to go to $\pi(1,1)$ . We can see $\pi(1,1)$ has higher error for the small player because $\pi(S,\ell^{\prime})\succeq_{S}\pi(S,1)\succ_{S}\pi(1,1)$ . The first inequality comes from the following reasoning: if $\frac{d}{d\ell}err_{S}(\pi(S,\ell))$ is negative at $\ell=1$ , then there is a monotonically increasing path of error from $\ell^{\prime}$ to 1. If $\frac{d}{d\ell}err_{S}(\pi(S,\ell))$ is positive at 1, then we know that $\pi(S,1)\prec_{S}\pi(S,0)$ , whereas $\pi(S,\ell^{\prime})\succeq_{S}\pi(S,0)$ .
•

No large player wishes to go to $\pi(S,\ell^{\prime}+1)$ : by definition of $\ell^{\prime}$ , it would get greater error than in $\pi(0,1)$ .
•

No large player wishes to go to $\pi(0,2)$ for the same reason as above.

By this analysis, $\pi(S,\ell^{\prime})$ is individually stable. ∎

Lemma C.5, below, will be useful in Example C.6 in analyzing whether the individually stable arrangement produced by Theorem 5.5 is also core stable.

Lemma C.5.

Consider uniform federation with two coalitions $\pi(s_{1},\ell_{1})$ and $\pi(s_{2},\ell_{2})$ with $s_{2}<s_{1}$ . Then, it is not possible to pick $\ell_{2}$ so that $\pi(s_{1},\ell_{1})\succ_{S}\pi(s_{2},\ell_{2})$ and $\pi(s_{1},\ell_{1})\succ_{L}\pi(s_{2},\ell_{2})$ .

Proof.

To prove this, we will rely on Lemmas C.1, C.2, C.3, C.4. First, we consider a hypothetical $\ell_{2}^{\prime}$ defined such that

s_{1}\cdot n_{s}+\ell_{1}\cdot n_{\ell}=s_{2}\cdot n_{s}+\ell_{2}^{\prime}\cdot n_{\ell}

Note that $\ell_{2}$ may not be an integer: we’re using it as a reference tool (a hypothetical allocation) so this doesn’t matter.

First, we will show that the small player gets greater error in this case. First, we rewrite it as:

\ell_{2}^{\prime}=(s_{1}-s_{2})\cdot\frac{n_{s}}{n_{\ell}}+\ell_{1}

and plug it in to the equation for the error of a small player. Note that most of the values stay the same: the entire $\mu_{e}$ term and the denominator of the $\sigma^{2}$ term. Similarly, note that with the given substitution,

((s_{2}-1)\cdot n_{s}+\ell_{2}^{\prime}\cdot n_{\ell})^{2}=((s_{1}-1)\cdot n_{s}+\ell_{1}\cdot n_{\ell})^{2}

The only term that changes is the other portion of the numerator, which becomes:

n_{\ell}^{2}\cdot\ell_{1}+n_{s}\cdot n_{\ell}\cdot(s_{1}-s_{2})+(s_{2}-1)\cdot n_{s}^{2}

We would like to show that it is larger than the relevant portion of the error for $\pi(s_{1},\ell_{1})$ , which is:

\ell_{1}\cdot n_{\ell}^{2}+(s_{1}-1)\cdot n_{s}^{2}

This is equivalent to showing:

n_{s}\cdot n_{\ell}\cdot(s_{1}-s_{2})+(s_{2}-1)\cdot n_{s}^{2}>(s_{1}-1)\cdot n_{s}^{2}

We can lower bound the lefthand side using $n_{\ell}>n_{s}$ :

n_{s}\cdot n_{\ell}\cdot(s_{1}-s_{2})+(s_{2}-1)\cdot n_{s}^{2}>n_{s}^{2}(s_{1}-s_{2})+(s_{2}-1)\cdot n_{s}^{2}

=n_{s}^{2}\cdot(s_{1}-1)

which shows that the small player gets greater error in $\pi(s_{2},\ell_{2}^{\prime})$ than in $\pi(s_{1},\ell_{1})$ .

Next, we’ll show that the large player also gets greater error. We will first note that from the definition of the errors of the small and large players, we can write:

err_{S}(s_{1},\ell_{1})=err_{L}(s_{1},\ell_{1})+2\sigma^{2}\frac{n_{\ell}-n_{s}}{\ell_{1}\cdot n_{\ell}+s_{1}\cdot n_{s}}

and

err_{S}(s_{2},\ell_{2}^{\prime})=err_{L}(s_{2},\ell_{2}^{\prime})+2\sigma^{2}\frac{n_{\ell}-n_{s}}{\ell_{2}^{\prime}\cdot n_{\ell}+s_{2}\cdot n_{s}}

Note that, by the definition of $\ell_{2}^{\prime}$ , the additive term for each of these qualities is the same. We also have just shown that $err_{S}(s_{2},\ell_{2}^{\prime})>err_{S}(s_{1},\ell_{1})$ . From these two equalities, this must imply that $err_{L}(s_{2},\ell_{2}^{\prime})>err_{L}(s_{1},\ell_{1})$ .

We have shown that both the large and small players get higher error in $\pi(s_{2},\ell_{2}^{\prime})$ than $\pi(s_{1},\ell_{1})$ . Next, we will show that they have different preferences about whether they wish $\ell_{2}^{\prime}$ were larger or smaller, which means that no matter what $\ell_{2}$ is, it will leave at least one of them with higher error.

First, we know from previous analysis in Lemma C.2 that the large players always wish there were fewer other large players in a coalition: the large players want $\ell_{2}<\ell_{2}^{\prime}$ .

Secondly, we know from Lemma C.3 that the error the small player experiences first increases and then decreases as $\ell$ increases. Is it possible to pick an $\ell_{2}<\ell_{2}^{\prime}$ so that the small player gets lower error there than in $\pi(s_{1},\ell_{1})$ ?

Suppose that the derivative of $err_{S}(s_{2},\ell)$ with respect to $\ell$ is positive at $\ell=\ell_{2}$ : then, reducing $\ell$ from $\ell_{2}^{\prime}$ to $\ell_{2}$ might reduce the small player’s error. However, for every point where the small player’s derivative is positive, $err_{S}(s_{2},\ell)>err_{S}(s_{2},0)>err_{S}(S,0)$ : the small player would not wish to move here because it would get strictly higher error than it would get in $\pi(S,0)$ .

Suppose instead that the derivative of of $err_{S}(s_{2},\ell)$ with respect to $\ell$ is negative or zero at $\ell=\ell_{2}$ . Then, if $\ell_{2}<\ell_{2}^{\prime}$ , reducing the number of large players in the coalition from $\ell_{2}^{\prime}$ to $\ell_{2}$ would increase the error of the small players. This is also not an allocation that the small players would prefer.

Increasing $\ell$ , so that $\ell_{2}>\ell_{2}^{\prime}$ sufficiently large, would satisfy the small player, but we already showed that it would increase the error the large player experiences. As a result, it is not possible to pick an allocation that both the small and large players prefer to $\pi(s_{1},\ell_{1})$ . ∎

Example C.6.

For uniform federation, the arrangement as produced in Theorem 5.5 is not necessarily core stable.

Theorem 5.5 produces an arrangement of the form $\{\pi(S,\ell^{\prime}),\pi(0,1)\cdot(L-\ell^{\prime})\}$ , where we note that $\ell^{\prime}$ could equal 0. As mentioned in the end of Section 5, this arrangement is not necessarily core stable. However, the reason why is subtle.

Theorem 5.5 checks that the given arrangement is individually stable, so the remaining cases to check would involve multiple players moving together to a new group. First, we will consider deviations consisting of homogeneous groups: for example, $\pi(s,0)$ or $\pi(0,\ell)$ . From Lemma C.1 we know that small players always prefer having more small players in their coalition, so $\pi(s,0)\prec_{S}\pi(S,0)\preceq_{S}\pi(S,\ell^{\prime})$ . Similarly, $\pi(0,\ell)\prec_{L}\pi(0,1)$ and $\pi(0,\ell)\prec_{L}\pi(S,\ell^{\prime})$ (if $\ell^{\prime}>0$ ), so the large players do not wish to unilaterally deviate.

Next, we might wonder whether some of the small players and large players in the $\pi(S,\ell^{\prime})$ coalition might wish to deviate to some $\pi(s,\ell)$ where $s<S$ : assume that $\ell^{\prime}>0$ for now. Lemma C.5 shows that this is not feasible: it is not possible to find a location $\pi(s,\ell^{\prime})$ with $s<S$ where both the small and large players get lower error. That is, it is not possible to have both $\pi(s,\ell)\succ_{S}\pi(S,\ell^{\prime})$ and $\pi(s,\ell)\succ_{L}\pi(S,\ell^{\prime})$ .

However, it still might be possible to have a deviating coalition. Recall from Theorem 5.5 that if $\ell^{\prime}>0$ , $\pi(S,\ell^{\prime})\succ_{L}\pi(0,1)$ : the large players doing local learning get strictly greater error than the large players in $\pi(S,\ell^{\prime})$ . Could it be possible to find an arrangement so that $\pi(s,\ell)\succ_{S}\pi(S,\ell^{\prime})$ and $\pi(s,\ell)\succ_{S}\pi(0,1)$ ?

The answer is yes. We show this by constructively producing an arrangement of players that is individually stable as produced by Theorem 5.5, but is not core stable because a subset of the small players in $\pi(S,\ell^{\prime})$ and the large players doing local learning both strictly prefer some $\pi(s,\ell)$ .

For this arrangement, we set $\mu_{e}=100,\sigma^{2}=1$ (note the larger $\mu_{e}$ value). We fix $S=70$ and $L\geq 7$ . Then, we calculate the error players get in various arrangements¹¹1The code to produce these calculations is given in our Github repository: https://github.com/kpdonahue/model˙sharing˙games..

First, we calculate the individually stable arrangement as given in Theorem 5.5. We claim that $\pi(70,3)$ satisfies this arrangement. For reference, we calculate the errors the small and large players get in various arrangements.

err_{S}(\pi(70,3))=1.107322\quad err_{S}(\pi(70,0))=1.115584

err_{L}(\pi(70,3))=0.932690\quad err_{L}(\pi(0,1))=0.943396

err_{L}(\pi(70,4))=0.943664

Note that $\pi(70,3)\succ_{S}\pi(70,0)$ and $\pi(70,3)\succ_{L}\pi(0,1)$ : both the small and large players would rather participate. Because $\pi(70,4)\prec_{L}\pi(0,1)$ , we cannot add another large player to the $\pi(70,30)$ coalition.

Next, we will show that the alternate coalition $\pi(68,4)$ is one where small players and large players (those that are doing local learning) would both strictly prefer.

err_{S}(\pi(68,4))=1.105263\quad err_{L}(\pi(68,4))=0.943147

Note that $\pi(68,4)\succ_{S}\pi(70,3)$ and $\pi(68,3)\succ_{L}\pi(0,1)$ .

Appendix D Supporting proofs for coarse-grained federation

See 6.1

Proof.

Taking the derivative of the error with respect to $w$ produces:

2\mu_{e}\left(\frac{w}{n_{j}}-\frac{w}{N}\right)-2\frac{\sum_{i\neq j}n_{i}^{2}+(N-n_{j})^{2}}{N^{2}}\cdot(1-w)\cdot\sigma^{2}

Setting this equal to 0 and solving for $w$ produces $w_{j}$ equal to

\frac{(\sum_{i\neq j}n_{i}^{2}+(N-n_{j})^{2})\cdot\sigma^{2}}{\mu_{e}\cdot N^{2}\cdot\left(\frac{1}{n_{j}}-\frac{1}{N}\right)+(\sum_{i\neq j}n_{i}^{2}+(N-n_{j})^{2})\cdot\sigma^{2}}

Note that this value $w_{j}$ depends on the player $j$ that it is in reference to. It is also always strictly between 0 and 1. To confirm that this is a point of minimum error rather than maximum, we can take the second derivative of the error, which gives a result that is always positive:

2\mu_{e}\left(\frac{1}{n_{j}}-\frac{1}{N}\right)+2\frac{\sum_{i\neq j}n_{i}^{2}+(N-n_{j})^{2}}{N^{2}}\cdot\sigma^{2}

∎

See 6.3

Proof.

For conciseness, we will rewrite the error and $w_{j}$ weight with $B=\sum_{i\neq j}n_{i}^{2}+(N-n_{j})^{2}$ . The error becomes:

\mu_{e}\cdot\left(\frac{w_{j}^{2}}{n_{j}}+\frac{1-w_{j}^{2}}{N}\right)+\frac{B}{N^{2}}\cdot(1-w_{j})^{2}\sigma^{2}

and the $w_{j}$ weight becomes

\frac{B\cdot\sigma^{2}}{\mu_{e}\cdot N^{2}\cdot\left(\frac{1}{n_{j}}-\frac{1}{N}\right)+B\cdot\sigma^{2}}

Substituting in for $w_{j}$ gives the error as:

\frac{\mu_{e}}{N}+\mu_{e}\cdot\left(\frac{1}{n_{j}}-\frac{1}{N}\right)\cdot\frac{B^{2}\cdot(\sigma^{2})^{2}}{\left(\mu_{e}\cdot N^{2}\cdot\left(\frac{1}{n_{j}}-\frac{1}{N}\right)+B\cdot\sigma^{2}\right)^{2}}

+\frac{B}{N^{2}}\cdot\sigma^{2}\cdot\frac{\mu_{e}^{2}\cdot N^{4}\cdot\left(\frac{1}{n_{j}}-\frac{1}{N}\right)^{2}}{\left(\mu_{e}\cdot N^{2}\cdot\left(\frac{1}{n_{j}}-\frac{1}{N}\right)+B\cdot\sigma^{2}\right)^{2}}

Collecting and pulling out common terms:

\frac{\mu_{e}}{N}+\frac{\mu_{e}\left(\frac{1}{n_{j}}-\frac{1}{N}\right)\cdot B\cdot\sigma^{2}}{\left(\mu_{e}\cdot N^{2}\cdot\left(\frac{1}{n_{j}}-\frac{1}{N}\right)+B\cdot\sigma^{2}\right)^{2}}

\cdot\left(B\cdot\sigma^{2}+\mu_{e}\cdot N^{2}\cdot\left(\frac{1}{n_{j}}-\frac{1}{N}\right)\right)

=\frac{\mu_{e}}{N}+\frac{\mu_{e}\left(\frac{1}{n_{j}}-\frac{1}{N}\right)\cdot B\cdot\sigma^{2}}{\mu_{e}\cdot N^{2}\cdot\left(\frac{1}{n_{j}}-\frac{1}{N}\right)+B\cdot\sigma^{2}}

=\frac{\mu_{e}^{2}\cdot N^{2}\left(\frac{1}{n_{j}}-\frac{1}{N}\right)+\mu_{e}\cdot B\cdot\sigma^{2}+N\cdot\mu_{e}\left(\frac{1}{n_{j}}-\frac{1}{N}\right)B\cdot\sigma^{2}}{N\cdot\left(\mu_{e}\cdot N^{2}\cdot\left(\frac{1}{n_{j}}-\frac{1}{N}\right)+B\cdot\sigma^{2}\right)}

We multiply the top and bottom by $N\cdot n_{j}$ :

\frac{\mu_{e}^{2}\cdot N^{2}\left(N-n_{j}\right)+\mu_{e}\cdot B\cdot\sigma^{2}\cdot N\cdot n_{j}+N\cdot\mu_{e}\left(N-n_{j}\right)B\cdot\sigma^{2}}{N\cdot\left(\mu_{e}\cdot N^{2}\cdot\left(N-n_{j}\right)+B\cdot\sigma^{2}\cdot N\cdot n_{j}\right)}

=\frac{\mu_{e}^{2}\cdot N\cdot\left(N-n_{j}\right)+\mu_{e}\cdot B\cdot\sigma^{2}\cdot n_{j}+\mu_{e}\cdot\left(N-n_{j}\right)\cdot B\cdot\sigma^{2}}{\mu_{e}\cdot N^{2}\cdot\left(N-n_{j}\right)+B\cdot\sigma^{2}\cdot N\cdot n_{j}}

=\frac{\mu_{e}^{2}\cdot N\cdot\left(N-n_{j}\right)+\mu_{e}\cdot B\cdot\sigma^{2}\cdot N}{\mu_{e}\cdot N^{2}\cdot\left(N-n_{j}\right)+B\cdot\sigma^{2}\cdot N\cdot n_{j}}

=\frac{\mu_{e}\cdot\left(N-n_{j}\right)+B\cdot\sigma^{2}}{N\cdot\left(N-n_{j}\right)+B\cdot\frac{\sigma^{2}}{\mu_{e}}\cdot n_{j}}

which gives the desired value. ∎

Next, we will prove a series of lemmas describing how the error of small and large players in a coalition $\pi(s,\ell)$ using optimal coarse-grained federation see their error change as $s$ and $\ell$ increase.

Lemma D.1.

For optimal coarse-grained federation, small players always see their error decrease with the addition of more small players. That is,

s_{2}>s_{1}\quad\Rightarrow\quad\pi(s_{2},\ell)\succ_{S}\pi(s_{1},\ell)

Proof.

The error that the small players experience is $\frac{g_{S}(s,\ell)}{h_{S}(s,\ell)}$ where

g_{S}(s,\ell)=\mu_{e}\cdot\left((s-1)\cdot n_{s}+\ell\cdot n_{\ell}\right)

+\left((s-1)\cdot n_{s}^{2}+\ell\cdot n_{\ell}^{2}+\left((s-1)\cdot n_{s}+\ell\cdot n_{\ell}\right)^{2}\right)\cdot\sigma^{2}

h_{S}(s,\ell)=((s-1)\cdot n_{s}+\ell\cdot n_{\ell})^{2}+n_{s}\cdot((s-1)\cdot n_{s}+\ell\cdot n_{\ell})

+n_{s}\cdot\frac{\sigma^{2}}{\mu_{e}}\cdot((s-1)\cdot n_{s}^{2}+\ell\cdot n_{\ell}^{2}+((s-1)\cdot n_{s}+\ell\cdot n_{\ell})^{2})

Next, we consider the derivative of the error term with respect to $s$ , the number of small players. If this is always negative, then small players always see their error decrease with the addition of more small players. The derivative is negative when:

\frac{d}{ds}\left[g_{S}(s,\ell)\right]\cdot h_{S}(s,\ell)-g_{S}(s,\ell)\cdot\frac{d}{ds}\left[h_{S}(s,\ell)\right]<0

Calculating and simplifying the lefthand side of this equation gives:

-n_{s}\cdot((s-1)\cdot n_{s}+\ell\cdot n_{\ell})

\cdot\left((s-1)\cdot n_{s}\cdot(\mu_{e}+n_{s}\cdot\sigma^{2})+\ell\cdot n_{\ell}\cdot(\mu_{e}+\sigma^{2}\cdot(2n_{\ell}-n_{s}))\right)

Each individual term is positive: note that $n_{\ell}>n_{s}$ , so $2n_{\ell}-n_{s}>0$ . The overall term is negative, meaning that the overall derivative of the error is always negative. ∎

Lemma D.2.

For optimal coarse-grained federation, there exists some $\ell_{0}$ such that for all $\ell>\ell_{0}$ , the small players always see their error decrease with the addition of more large players.

Proof.

The error that the small player experiences can be given by the ratio of $g_{S}(\cdot)/h_{S}(\cdot)$ as before. In this case, we are interested in the derivative with respect to $\ell$ , which is negative when:

\frac{d}{d\ell}\left[g_{S}(s,\ell)\right]\cdot h_{S}(s,\ell)-g_{S}(s,\ell)\cdot\frac{d}{d\ell}\left[h_{S}(s,\ell)\right]<0

Calculating and simplifying the lefthand side of this equation gives:

-n_{\ell}\cdot((s-1)\cdot n_{s}+\ell\cdot n_{\ell})

\cdot\left((s-1)n_{s}(\mu_{e}+\sigma^{2}(2n_{s}-n_{\ell}))+\ell\cdot n_{\ell}(\mu_{e}+\sigma^{2}\cdot n_{\ell}))\right)

This term is negative whenever:

\ell>-\frac{\mu_{e}+\sigma^{2}\cdot(2n_{s}-n_{\ell})}{n_{\ell}\cdot(\mu_{e}+\sigma^{2}\cdot n_{\ell})}

So, for sufficiently large $\ell$ , the derivative of the small player’s error is always negative. Note that if the term on the RHS is negative, then the derivative is negative for any $\ell\geq 0$ . ∎

Lemma D.3.

For optimal coarse-grained federation, there exists some $\ell_{1}$ such that for all $\ell>\ell_{1}$ , the large players always see their error decrease with the addition of more large players. If there are no small players ( $s=0$ ), then large players would most prefer to all federate together in $\pi(0,L)$ .

Proof.

The error that the large players experience is $\frac{g_{L}(s,\ell)}{h_{L}(s,\ell)}$ where

g_{L}(s,\ell)=\mu_{e}\cdot\left(s\cdot n_{s}+(\ell-1)\cdot n_{\ell}\right)

+\left(s\cdot n_{s}^{2}+(\ell-1)\cdot n_{\ell}^{2}+\left(s\cdot n_{s}+(\ell-1)\cdot n_{\ell}\right)^{2}\right)\cdot\sigma^{2}

h_{L}(s,\ell)=(s\cdot n_{s}+(\ell-1)\cdot n_{\ell})^{2}+n_{s}\cdot(s\cdot n_{s}+(\ell-1)\cdot n_{\ell})

+n_{\ell}\cdot\frac{\sigma^{2}}{\mu_{e}}\cdot(s\cdot n_{s}^{2}+(\ell-1)\cdot n_{\ell}^{2}+(s\cdot n_{s}+(\ell-1)\cdot n_{\ell})^{2})

The derivative of the error with respect to $\ell$ , the number of large players, is negative whenever:

\frac{d}{d\ell}\left[g_{L}(s,\ell)\right]\cdot h_{L}(s,\ell)-g_{L}(s,\ell)\cdot\frac{d}{d\ell}\left[h_{L}(s,\ell)\right]<0

Calculating and simplifying the lefthand side of this equation gives:

-n_{\ell}\cdot(s\cdot n_{s}+(\ell-1)\cdot n_{\ell})

\cdot((\ell-1)\cdot(\mu_{e}+n_{s}\cdot\sigma^{2})+s\cdot n_{s}\cdot(\mu_{e}+\sigma^{2}\cdot(2n_{s}-n_{\ell})))

This term is negative whenever:

\ell>\frac{-s\cdot n_{s}\cdot(\mu_{e}+\sigma^{2}\cdot(2n_{s}-n_{\ell}))+\mu_{e}+n_{s}\cdot\sigma^{2}}{\mu_{e}+n_{s}\cdot\sigma^{2}}

Note that if $s=0$ , this reduces to $\ell>1$ , which says that whenever large players are arranged without small players, they would most prefer to all be together. ∎

Lemma D.4.

For optimal coarse-grained federation, large players always see their error decrease with the addition of more small players. Because of this, $\pi(s,\ell)\succ_{L}\pi(0,\ell)$ for all $s>0$ : large players would always prefer federating with more small players.

s_{2}>s_{1}\quad\Rightarrow\quad\pi(s_{2},\ell)\succ_{L}\pi(s_{1},\ell)

Proof.

The error that the large players experience can be given by the ratio $g_{L}(\cdot)/h_{L}(\cdot)$ as before. In this case, we are interested in the derivative with respect to $s$ , which is negative when:

\frac{d}{ds}\left[g_{L}(s,\ell)\right]\cdot h_{L}(s,\ell)-g_{L}(s,\ell)\cdot\frac{d}{ds}\left[h_{L}(s,\ell)\right]<0

Gathering and simplifying the lefthand side of this equation gives:

-n_{s}\cdot(s\cdot n_{s}+(\ell-1)\cdot n_{\ell})

\cdot(s\cdot n_{s}\cdot(\mu_{e}+\sigma^{2}\cdot n_{s})+\ell\cdot n_{\ell}\cdot(\mu_{e}+\sigma^{2}\cdot(2n_{\ell}-n_{s})))

Each individual term is positive: note that $n_{\ell}>n_{s}$ , so $2n_{\ell}-n_{s}>0$ . The overall term is negative, meaning that the derivative of the error is always negative. ∎

Given these lemmas, we will divide the shape of the error curves into three components. Note that $err_{S}(s,\ell=\ell^{\prime})$ and $err_{L}(s,\ell=\ell^{\prime})$ are both curving downwards always. We will refer to the section of $err_{S}(s,\ell=\ell^{\prime})$ with error greater than the error the small players get in $\pi(S,0)$ as $B$ and the region with error lower than this as $C$ . Similarly, we will define the region of $err_{L}(s,\ell=\ell^{\prime})$ where the error is higher than the error the large players get in $\pi(0,L)$ as $B$ , and the region where the error is lower as $C$ .

Note that, by Lemma D.2, $err_{S}(s=s^{\prime},\ell)$ is always decreasing for all $\ell>\ell_{0}$ : before that, it is increasing. We will call the region where the slope is increasing $A$ , the portion where the slope is decreasing but still greater than the error the small player gets in $\pi(S,0)$ as $B$ , and the final region, where the slope is decreasing and the error is lower than or equal to $\pi(S,0)$ , as $C$ .

We will not divide the regions of curve $err_{L}(s=s^{\prime},\ell)$ because the proof below will not depend on those results.

We will use as a shorthand $\pi_{d}=\{\pi(S,0),\pi(0,L)\}$ . See 6.5

Proof.

In this proof, we will use the results of Lemmas D.1, D.2, D.3, and D.4. The statement of the theorem has three parts (including the equality $\pi_{g}=_{S}\pi(S,0)$ ), which we will prove in sequence. We note that Lemma D.4 indicates that $\pi(S,L)\succ_{L}\pi(0,L)$ always: the large players always prefer federating with the smaller players to being alone.

Case: $\pi(S,L)\prec_{S}\pi(S,0)$
In this case, we will show that $\pi_{d}=\{\pi(S,0),\pi(0,L)\}$ is strictly core stable.

We know from the precondition that the point $\pi(S,L)$ is in either region $A$ or $B$ of $err_{S}(s=S,\ell)$ . We will show that any other arrangement $\pi(s^{\prime},\ell^{\prime})$ is one where the small players get higher error than in $\pi(S,0)$ . We will assume $\ell^{\prime}>0$ : if $\ell^{\prime}=0$ and $s^{\prime}<S$ , then we know $\pi(s^{\prime},0)\prec_{S}\pi(S,0)$ by Lemma D.1. We will do this by starting at the point $\pi(s^{\prime},\ell^{\prime})$ and moving along the relevant error curves to towards $\pi(S,0)$ and $\pi(S,L)$ to show that the small players experience strictly greater error than in $\pi(S,0)$ .

First, we will consider the curve $err_{S}(s,\ell=\ell^{\prime})$ . From Lemma D.1, we know that it is decreasing in $s$ . We will move $s$ from $s^{\prime}$ to $S$ , which decreases error. (If $s^{\prime}=S$ , then this leaves error unchanged.) Next, we will consider the point $\pi(S,\ell^{\prime})$ on the curve $err_{S}(s=S,\ell)$ . If it is in region $A$ or $B$ , then we know

\pi(s^{\prime},\ell^{\prime})\preceq_{S}\pi(S,\ell^{\prime})\prec_{S}\pi(S,0)

We will show that it is impossible for it to be in region $C$ . If it were, then moving along this curve from $\ell^{\prime}$ to $L$ would only decrease the error the small player experiences, because the slope is decreasing with $\ell$ in region $C$ . But then that would imply that $\pi(S,L)$ is in region $C$ of this graph, and so $\pi(S,L)\succ_{S}\pi(S,0)$ , which contradicts the precondition of this case.

This shows that the small players minimize their error in $\pi(S,0)$ , so they will not even weakly prefer any other arrangement. Given this, any deviation from $\pi_{d}$ could only involve large players. However, by Lemma D.3, out of any arrangement excluding small players, the large players minimize their error in $\pi(0,L)$ (their current arrangement) so they would not even weakly prefer any other arrangement.

This implies that $\pi_{d}$ is strictly core given $\pi(S,L)\prec_{S}\pi(S,0)$ .

Case: $\pi(S,L)=_{S}\pi(S,0)$
In this case, we assume that the small players are indifferent between $\pi_{g}$ and $\pi_{d}$ and will show that $\pi_{g}$ is strictly core stable.

The key part of this proof is to show that, if the small players are indifferent between $\pi_{g}$ and $\pi(S,0)$ , then they get strictly greater error in every other arrangement $\pi(s^{\prime},\ell^{\prime})$ that isn’t equal to $\pi(S,0)$ or $\pi(S,L)$ .

The precondition means that $\pi(S,L)$ is in region $C$ of both graphs. To move to $\pi(s^{\prime},\ell^{\prime})$ , first we move along $err_{S}(s=S,\ell)$ from $L$ to $\ell^{\prime}$ . In doing so, we end up in either region $A$ or $B$ of the curve: note that we cannot remain in region $C$ unless $\ell^{\prime}=L$ . This is because by definition of the regions, region $C$ has error that decreases with $\ell$ , which means that error strictly increases from $\pi(S,L)$ when $\ell$ is decreased. This means that if $\ell^{\prime}<L$ , $\pi(S,\ell^{\prime})$ gives the small player greater error than the it gets in $\pi(S,0)$ (or equivalently, in $\pi_{g}$ ).

If the point $\pi(S,\ell^{\prime})$ is in region $A$ or $B$ of $err_{S}(s=S,\ell)$ , then it is also in region $B$ of the curve $err_{S}(s,\ell=\ell^{\prime})$ . Reducing $s$ from $S$ to $s^{\prime}$ can only increase the error the small player experiences. This shows that the small players get strictly greater error in any arrangement that is not $\pi(S,0)$ or $\pi(S,L)$ .

Note: if $\ell^{\prime}=L$ , then point $\pi(S,\ell^{\prime})=\pi(S,L)$ is in region $C$ of both curves, but because $\pi(s^{\prime},\ell^{\prime})\neq\pi(S,L)$ , we must have $s^{\prime}<S$ , so $\pi(s^{\prime},\ell^{\prime})\prec_{S}\pi(S,L)=_{S}\pi(S,0)$ .

Next, we will show that $\pi_{g}$ is strictly core stable.

As shown above, the small players get higher error in any arrangement besides $\pi_{g}$ and $\pi(S,0)$ . If they are in $\pi_{g}$ , they cannot be convinced to deviate to any arrangement except $\pi(S,0)$ , and then only weakly (they get identical error). Conditional on the small players all being in $\pi_{g}$ , the large players would most prefer to be in $\pi_{g}$ , since by Lemma D.3 they strictly prefer it to $\pi(0,L)$ . This implies that there does not exist a group of players where all weakly would like to defect and at least one strictly wishes to defect, the condition for strict core stability.

Case: $\pi(S,L)\succ_{S}\pi(S,0)$ :
We will show that $\pi_{g}$ is strictly core stable.

The precondition means that $\pi(S,L)$ is in section $C$ of both small player graphs. Consider an arbitrary other coalition $\pi(s^{\prime},\ell^{\prime})$ : we will show that the small players get strictly greater error here than in $\pi(S,L)$ .

First, we start at the arrangement $\pi(S,L)$ . Consider moving along the curve $err_{S}(s=S,\ell)$ : we hold $s$ constant and reduce $\ell$ from $L$ to $\ell^{\prime}$ . We could end up in region $C,B,$ or $A$ of this curve.

First, assume we end up in section $C$ . This means that, as we’ve reduced $\ell$ , our error has been monotonically increasing (unless $\ell^{\prime}=L$ , in which case it is unchanged). We also know that, in the curve $err_{S}(s,\ell=\ell^{\prime})$ , the point $\pi(S,\ell^{\prime})$ is in region $C$ of this graph. Next, we similarly move along this curve to reduce $s$ from $S$ to $s^{\prime}$ . From Lemma D.1, we know $err_{S}(s,\ell=\ell^{\prime})$ is monotonically decreasing in $s$ , which means reducing $s$ monotonically increases the small player’s error. (This is unless $s^{\prime}=S$ , in which case the error is unchanged.). Because we have assumed either $s^{\prime}\neq S$ or $\ell^{\prime}\neq L$ , we have produced a path of monotonically increasing error from $\pi(S,L)$ to $\pi(s^{\prime},\ell^{\prime})$ , so we know the small player experiences greater error here.

Next, we will instead assume that we ended up in region $A$ or $B$ of the $err_{S}(s=S,\ell)$ curve when we reduced $\ell$ . By definition of the $A$ and $B$ regions, we know that the small player experiences larger error in $\pi(S,\ell^{\prime})$ than it would in $\pi(S,0)$ . (The one exception is the point $\pi(S,0)$ , which is in region $A$ and which obviously gives equal error to $\pi(S,0)$ ). Then, we know that in the curve $err_{S}(s,\ell=\ell^{\prime})$ , the point $\pi(S,\ell^{\prime})$ must be in region $B$ . Because $err_{S}(s,\ell=\ell^{\prime})$ is monotonically decreasing in $s$ , reducing $s$ from $S$ to $s^{\prime}$ monotonically increases error. We then know that:

\pi(s^{\prime},\ell^{\prime})\preceq_{S}\pi(S,\ell^{\prime})\preceq_{S}\pi(S,0)\prec_{S}\pi(S,L)

where the last inequality comes from the premise of this statement. The first and second inequalities are strict if $s^{\prime}<S$ or $\l^{\prime}\neq 0$ , respectively.

We have just shown that the small player prefers $\pi(S,L)$ to any other arrangement. If there were going to be a defecting group, it would have to involve only large players. However, the arrangement where large players (federating by themselves) get lowest error is in $\pi(0,L)$ , which by Lemma D.4 gives them higher error than $\pi(S,L)$ .

By this reasoning, $\pi_{g}$ is strictly core stable. ∎

Appendix E Supporting proofs for fine-grained federation

See 7.1

Proof.

To minimize, we will take the derivative of player $j$ ’s error with respect to the $v_{jk}$ weight. Note that we only have $v_{jj}=1-\sum_{i\neq j}v_{ji}=1-v_{jk}-\sum_{i\neq j,i\neq k}v_{ji}$ so $v_{jk}$ appears twice in the component involving $\mu_{e}$ . Rewriting the error gives:

\mu_{e}\sum_{i\neq j}^{M}\frac{v_{ji}^{2}}{n_{i}}+\mu_{e}\frac{(1-\sum_{i\neq j}v_{ji})^{2}}{n_{j}}

+\left(\sum_{i\neq j}v_{ji}^{2}+\left(\sum_{i\neq j}v_{ji}\right)^{2}\right)\cdot\sigma^{2}

Taking the derivative with respect to $v_{jk}$ gives:

\mu_{e}\frac{2v_{jk}}{n_{k}}-2\mu_{e}\frac{1-\sum_{i\neq j}v_{ji}}{n_{j}}+\sigma^{2}\left(2\sum_{i\neq j}v_{ji}+2v_{jk}\right)

To confirm that we are finding a minimum rather than a maximum, we note that the second derivative is always positive:

\mu_{e}\frac{2}{n_{k}}+\mu_{e}\frac{2}{n_{j}}+\sigma^{2}\left(2+2\right)>0

We first simplify the derivative by substituting in for $v_{jj}$ :

\mu_{e}\frac{2v_{jk}}{n_{k}}-2\mu_{e}\frac{v_{jj}}{n_{j}}+2\sigma^{2}\left(1-v_{jj}+v_{jk}\right)=0

And then solve for $v_{jk}$ to obtain:

v_{jk}=\frac{v_{jj}\cdot\left(\sigma^{2}+\frac{\mu_{e}}{n_{j}}\right)-\sigma^{2}}{\sigma^{2}+\frac{\mu_{e}}{n_{k}}}

To find $v_{jj}$ , we use that all of the weights sum up to 1:

v_{jj}+\sum_{i\neq j}\frac{v_{jj}\cdot\left(\sigma^{2}+\frac{\mu_{e}}{n_{j}}\right)-\sigma^{2}}{\sigma^{2}+\frac{\mu_{e}}{n_{k}}}=1

v_{jj}+\sum_{i\neq j}\frac{v_{jj}\cdot\left(\sigma^{2}+\frac{\mu_{e}}{n_{j}}\right)}{\sigma^{2}+\frac{\mu_{e}}{n_{k}}}-\sum_{i\neq j}\frac{\sigma^{2}}{\sigma^{2}+\frac{\mu_{e}}{n_{k}}}=1

v_{jj}\left(1+\sum_{i\neq j}\frac{\sigma^{2}+\frac{\mu_{e}}{n_{j}}}{\sigma^{2}+\frac{\mu_{e}}{n_{k}}}\right)-\sum_{i\neq j}\frac{\sigma^{2}}{\sigma^{2}+\frac{\mu_{e}}{n_{k}}}=1

v_{jj}=\frac{1+\sum_{i\neq j}\frac{\sigma^{2}}{\sigma^{2}+\frac{\mu_{e}}{n_{k}}}}{1+\sum_{i\neq j}\frac{\sigma^{2}+\frac{\mu_{e}}{n_{j}}}{\sigma^{2}+\frac{\mu_{e}}{n_{k}}}}

Next, we define $V_{i}=\sigma^{2}+\frac{\mu_{e}}{n_{i}}$ . This allows us to rewrite the term as:

v_{jj}=\frac{1+\sigma^{2}\sum_{i\neq j}\frac{1}{V_{i}}}{1+V_{j}\sum_{i\neq j}\frac{1}{V_{i}}}

Similarly, we can rewrite:

v_{jk}=\frac{1}{V_{k}}\cdot\frac{V_{j}-\sigma^{2}}{1+V_{j}\sum_{i\neq j}\frac{1}{V_{i}}}

∎

Model-sharing Games: Analyzing Federated Learning Under Voluntary Participation

Abstract

1 Introduction

The present work: Analyzing federated learning through hedonic game theory

Related works

Incentives and federated learning:

Transfer learning:

Clustering and federated learning:

2 Motivating example

3 Model and assumptions

Model and technical assumptions

Normative assumptions

4 Expected error results

Theorem 4.1.

Theorem 4.2.

Corollary 4.3.

Corollary 4.4.

5 Uniform federation: coalition formation

All players have the same number of samples

Lemma 5.1.

Proof.

Corollary 5.2.

Small & large player case

Case 1: ns,nℓ≥μeσ2n_{s},n_{\ell}\geq\frac{\mu_{e}}{\sigma^{2}}

Lemma 5.3.

Case 2: ns,nℓ≤μeσ2n_{s},n_{\ell}\leq\frac{\mu_{e}}{\sigma^{2}}

Theorem 5.4.

Case 3: ns<μeσ2n_{s}<\frac{\mu_{e}}{\sigma^{2}}, nℓ>μeσ2n_{\ell}>\frac{\mu_{e}}{\sigma^{2}}

Theorem 5.5.

6 Coarse-grained federation

Lemma 6.1.

Corollary 6.2.

Lemma 6.3.

All players have the same number of samples

Lemma 6.4.

Proof.

Small & large player case

Theorem 6.5.

7 Fine-grained federation

Lemma 7.1.

Corollary 7.2.

Proof.

8 Conclusions and future directions

Acknowledgments

References

Appendix A Relationship to other approaches

Appendix B Expected error proofs

Proof.

Proof.

Proof.

Proof.

Appendix C Supporting proofs for uniform federation

Proof.

Proof.

Lemma C.1.

Proof.

Lemma C.2.

Proof.

Lemma C.3.

Proof.

Lemma C.4.

Proof.

Proof.

Proof.

Lemma C.5.

Proof.

Example C.6.

Appendix D Supporting proofs for coarse-grained federation

Proof.

Proof.

Lemma D.1.

Proof.

Lemma D.2.

Proof.

Lemma D.3.

Proof.

Lemma D.4.

Proof.

Proof.

Appendix E Supporting proofs for fine-grained federation

Model-sharing Games:
Analyzing Federated Learning Under Voluntary Participation

Case 1: $n_{s},n_{\ell}\geq\frac{\mu_{e}}{\sigma^{2}}$

Case 2: $n_{s},n_{\ell}\leq\frac{\mu_{e}}{\sigma^{2}}$

Case 3: $n_{s}<\frac{\mu_{e}}{\sigma^{2}}$ , $n_{\ell}>\frac{\mu_{e}}{\sigma^{2}}$