This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Model-sharing Games:
Analyzing Federated Learning Under Voluntary Participation

Kate Donahue,1 Jon Kleinberg, 1, 2
Abstract

Federated learning is a setting where agents, each with access to their own data source, combine models learned from local data to create a global model. If agents are drawing their data from different distributions, though, federated learning might produce a biased global model that is not optimal for each agent. This means that agents face a fundamental question: should they join the global model or stay with their local model? In this work, we show how this situation can be naturally analyzed through the framework of coalitional game theory.

Motivated by these considerations, we propose the following game: there are heterogeneous players with different model parameters governing their data distribution and different amounts of data they have noisily drawn from their own distribution. Each player’s goal is to obtain a model with minimal expected mean squared error (MSE) on their own distribution. They have a choice of fitting a model based solely on their own data, or combining their learned parameters with those of some subset of the other players. Combining models reduces the variance component of their error through access to more data, but increases the bias because of the heterogeneity of distributions.

In this work, we derive exact expected MSE values for problems in linear regression and mean estimation. We use these values to analyze the resulting game in the framework of hedonic game theory; we study how players might divide into coalitions, where each set of players within a coalition jointly construct model(s). We analyze three methods of federation, modeling differing degrees of customization. In uniform federation, the agents collectively produce a single model. In coarse-grained federation, each agent can weight the global model together with their local model. In fine-grained federation, each agent can flexibly combine models from each other agent in the federation. For each method, we constructively analyze the stable partitions of players into coalitions.

1 Introduction

Imagine a situation as follows: a hospital is trying to evaluate the effectiveness of a certain procedure based on data it has collected from procedures done on patients in their facilities. It seems likely that certain attributes of the patient influences the effectiveness of the procedure, so the hospital analysts opt to fit a linear regression model with parameters 𝜽^\hat{\bm{\theta}}. However, because of the limited amount of data the hospital has access to, this model has relatively high error. Luckily, other hospitals also have data from implementations of this same procedure. However, for reasons of privacy, data incompatibility, data size, or other operational considerations, the hospitals don’t wish to share raw patient data. Instead, they they opt to combine their models by taking a weighted average of the parameters learned by each hospital. If there are MM hospitals and hospital ii has nin_{i} samples, the combined model parameters would look like:

𝜽^f=1βˆ‘i=1Mniβ€‹βˆ‘i=1M𝜽^iβ‹…ni\hat{\bm{\theta}}^{f}=\frac{1}{\sum_{i=1}^{M}n_{i}}\sum_{i=1}^{M}\hat{\bm{\theta}}_{i}\cdot n_{i}

The situation described above could be viewed as a stylized model of federated learning. Federated learning is a distributed learning process that is currently experiencing rapid innovations and widespread implementation (Li etΒ al. 2020; Kairouz etΒ al. 2019). It is used in cases where data is distributed across multiple agents and cannot be combined centrally for training. For example, federated learning is implemented in word prediction on cell phones, where transferring the raw text data would be infeasible given its large size (and sensitive content). The motivating factor for using federated learning is that access to more data will reduce the variance in a learned model, reducing its error.

However, there could be a downside to using federated learning. In the hospital example, it seems quite reasonable that certain hospitals might have different true generating models for their data, based on the differences in patient populations or variants of the procedure implementation, for example. Two dissimilar hospitals that are federating together will see a decrease in their model’s error due to model variance, but an increase in their error due to model bias. This raises some fundamental questions for each participating hospital - or, more generally, each agent ii considering federated learning. Which other agents should ii federate with in order to minimize its error? Will those other agents be interested in federating with ii? Does there exist some stable arrangement of agents into federating clusters, and if so, what does that arrangement look like?

Numerous works have explored the issue of heterogeneous data in federated learning - we discuss specifically how they relate to ours in a later section. Often the goal in these lines of work is to achieve equality in error rates guaranteed to each agent, potentially by actively collecting more data or using transfer learning to ensure the model better fits local data. However, to our knowledge, there has not yet been work that systematically looks at the participation questions inherent in federated learning through the lens of game theory β€” especially the theory of hedonic games, which studies the formation of self-sustaining coalitions.

In a hedonic game, players are grouped together into clusters or coalitions: the overall collection of coalitions is called a coalition structure. Each player’s utility depends solely on the identity of the other players in its coalition. A common question in hedonic games is the stability of a coalition structure. A coalition structure Ξ \Pi is core-stable (or β€œin the core”) if there does not exist a coalition CC so that every player in CC prefers CC to its coalition in Ξ \Pi. A coalition structure is strictly core stable if there does not exist a coalition CC so that every player in CC weakly prefers CC to its coalition in Ξ \Pi, and at least one player ∈C\in C strictly prefers CC to Ξ \Pi. A coalition structure is individually stable if there does not exist a coalition C∈ΠC\in\Pi so that a player iβˆ‰Ci\not\in C prefers Cβˆͺ{i}C\cup\{i\} to its arrangement in Ξ \Pi and all players in CC weakly prefer Cβˆͺ{i}C\cup\{i\} to CC (Bogomolnaia and Jackson 2002).

To explain the analogy of federated learning to hedonic games, we first consider that each agent in federated learning is a player in a hedonic game. A player is in coalition with other players if it is federating with them. Its cost is its expected error in a given federating cluster, which depends only on the identity of other players in its federating cluster. Players are assumed to be able to move between federating clusters only if doing so would benefit itself and not harm other players in the cluster it is moving to: notably, we allow players to freely leave a cluster, even if doing so would harm the players in the cluster it leaves behind.

The present work: Analyzing federated learning through hedonic game theory

In this work, we use the framework of hedonic games to analyze the stability of coalitions in data-sharing applications that capture key issues in federated learning. By working through a sequence of deliberately stylized models, we obtain some general insights about participation and stability in these kinds of applications.

For the first case, we analyze uniform federation. In this simplest case, a federating cluster produces a single model, which each player uses. For uniform federation, first we consider the case where all players have the same number of data points. We show that in this game, when the number of data points nn is fairly small, the only core-stable coalition structure is to have all players federating together, in the β€œgrand coalition”. When nn is large, the only core-stable coalition structure is to have all players separate (doing local learning). There exists a point case of intermediate nn size where all coalition structures are core-stable. Next, we analyze the case where all players have either one of two sizes (β€œsmall” or β€œlarge”). The analysis is more complicated, but we demonstrate constructively that there always exists an individually stable partition of players into clusters.

Besides uniform federation, we also analyze two other forms of federation. For coarse-grained federation, the federating cluster still produces a single model, but each player can weight the global model with their local model, allowing some personalization. For coarse-grained federation, when all players have the same number of samples, we show that the grand coalition (all players federating together) is always the only core stable arrangement. For the small/large case, we produce a simple, efficient algorithm to calculate a strictly core-stable arrangement. Additionally, we show that, for this federation method, the grand coalition is always individually stable (no player wishes to unilaterally defect).

Finally, for fine-grained federation, each player is allowed to take the local models of other players in the federation and combine them using whichever weights they choose to produce a model customized for their use. With fine-grained federation, we show that the grand coalition is always core stable.

We are only able to produce these hedonic game theory results because of our derivations of exact error values for the underlying inference problems. We calculate these values for all three methods of federation, and for agents federating in two situation: 1) a mean estimation problem and 2) a linear regression problem. The error values depend on the number of samples each agent has access to, with the expectation taken over the values of samples each agent draws as well as the possible different true parameters of the data each player is trying to model. Our results are completely independent of the generating distributions used, relying only weakly on two parameters.

The results in this paper are theoretical and do not depend on any experiments or code. However, while writing the paper, we found it useful to write and work with code to check conjectures. Some of that code is publicly available at https://github.com/kpdonahue/modelΛ™sharingΛ™games.

Before moving to the main technical content, the next section will walk through a motivating example, followed by a review of related literature and a description of our model and assumptions. Beyond technical assumptions, recent work (Cooper (2020)) has highlighted the importance of describing normative assumptions researchers make: we also include a summary of the most important normative assumptions of our analysis .

Related works

Incentives and federated learning:

Blum etΒ al. (2017) describes an approach to handling heterogeneous data where more samples are iteratively gathered from each agent in a way so that all agents are incentivized to participate in the grand coalition during federated learning. Duan etΒ al. (2021) builds a framework to schedule data augmentation and resampling. Yu, Bagdasaryan, and Shmatikov (2020) demonstrates empirically that there can be cases where individuals get lower error with local training than federated and evaluates empirical solutions. Wang etΒ al. (2020) analyzes the question of when it makes sense to split or not to split datasets drawn from different distributions. Finally, Blum etΒ al. (2020) analyzes notions of envy and efficiency with respect to sampling allocations in federated learning.

Transfer learning:

Mansour etΒ al. (2020) and Deng, Kamani, and Mahdavi (2020) both propose theoretical methods for using transfer learning to minimize error provided to agents with heterogeneous data. Li etΒ al. (2019) and Martinez, Bertran, and Sapiro (2020) both provide methods to produce a more uniform level of error rates across agents participating in federated learning.

Clustering and federated learning:

Sattler, Muller, and Samek (2020) and Shlezinger, Rini, and Eldar (2020) provide an algorithm to β€œcluster” together players with similar data distributions with the aim of providing them with lower error. They differ from our approach in that they consider the case where there is some knowledge of each player’s data distribution, where we only assume knowledge of the number of data points. Additionally, their approach doesn’t explicitly consider agents to be game-theoretic actors in the same way that this one does. Interestingly, Guazzone, Anglano, and Sereno (2014) uses a game theoretic framework to analyze federated learning, but with the aim of minimizing energy usage, not error rate.

2 Motivating example

Coalition structure e​r​ra​(β‹…)err_{a}(\cdot) e​r​rb​(β‹…)err_{b}(\cdot) e​r​rc​(β‹…)err_{c}(\cdot)
{a},{b},{c}\{a\},\{b\},\{c\} 2 2 2
{a,b},{c}\{a,b\},\{c\} 1.5 1.5 2
{a,b,c}\{a,b,c\} 1.3 1.3 1.3
Table 1: The expected errors using uniform federation of players in each coalition when all three players have 5 samples each, with parameters ΞΌe=10,Οƒ2=1\mu_{e}=10,\sigma^{2}=1. Each row denotes a different coalition partition: for example, {a,b}​{c}\{a,b\}\{c\} indicates that players aa and bb are federating together while cc is alone.
Coalition structure e​r​ra​(β‹…)err_{a}(\cdot) e​r​rb​(β‹…)err_{b}(\cdot) e​r​rc​(β‹…)err_{c}(\cdot)
{a},{b},{c}\{a\},\{b\},\{c\} 2 2 0.4
{a,b},{c}\{a,b\},\{c\} 1.5 1.5 0.4
{a},{b,c}\{a\},\{b,c\} 2 1.72 0.39
{a,b,c}\{a,b,c\} 1.55 1.55 0.41
Table 2: The expected errors using uniform federation of players in each coalition when players aa and bb have 5 samples each and player cc has 25 samples, with parameters ΞΌe=10,Οƒ2=1\mu_{e}=10,\sigma^{2}=1.
Coalition structure e​r​ra​(β‹…)err_{a}(\cdot) e​r​rb​(β‹…)err_{b}(\cdot) e​r​rc​(β‹…)err_{c}(\cdot)
{a},{b},{c}\{a\},\{b\},\{c\} 0.4 0.4 0.4
{a,b},{c}\{a,b\},\{c\} 0.7 0.7 0.4
{a,b,c}\{a,b,c\} 0.8 0.8 0.8
Table 3: The expected errors using uniform federation of players in each coalition when players a,b,ca,b,c each have 25 samples, with parameters ΞΌe=10,Οƒ2=1\mu_{e}=10,\sigma^{2}=1.

To motivate our problem and clarify the types of analyses we will be exploring, we will first work through a simple mean estimation example. (The Github repository contains numerical calculations and full tables for this section.) Calculating the error each player can expect requires two parameters: ΞΌe\mu_{e}, which reflects the average error each player experiences when sampling data from its own personal distribution, and Οƒ2\sigma^{2}, which reflects the average variance in the true parameters between players. In this section we will use ΞΌe=10,Οƒ2=1\mu_{e}=10,\sigma^{2}=1, but will discuss later how to handle when they may be imperfectly known.

First, we will analyze uniform federation, with three players, a,b,a,b, and cc. We will first assume that each player has 5 samples from their local data distribution: Table 1 gives the error each player can expect in this situation. Arrangements equivalent up to renaming of players are omitted. Every player sees its error minimized in the β€œgrand coalition” Ο€g\pi_{g} where all three players are federating together. This implies that the only arrangement that is stable (core-stable or individually stable) is Ο€g\pi_{g}.

Next, we assume that player cc increases the amount of samples it has from 5 to 25: Table 2 demonstrates the error each player can expect in this situation. Here, the players have different preferences over which arrangement they would most prefer. The β€œsmall” players aa and bb would most prefer {a,b}​{c}\{a,b\}\{c\}, whereas the β€œlarge” player cc would most prefer {a},{b,c}\{a\},\{b,c\} or (identically) {b},{a,c}\{b\},\{a,c\}. However out of all of these coalition structures, only {a,b},{c}\{a,b\},\{c\} is stable (either core stable or individually stable). Note that {a},{b,c}\{a\},\{b,c\} is not stable because the coalition C={a,b}C=\{a,b\} is one where each player prefers CC to its current situation.

Thirdly, we will assume that all three players have 25 samples: this example is shown in Table 3. As in Table 1, the players have identical preferences. However, in this case, the players minimize their error by being alone. Overall, stability results from this example are part of a broader pattern we will analyze in later sections.

Next, we will explore the two other methods of federation: coarse-grained and fine-grained. Both offer some degree of personalization, with varying levels of flexibility.

Table 4 shows an example using coarse federation with four players: three have 30 samples each, and the fourth player has 300 samples. We assume the weights wjw_{j} are set optimally for each player. For conciseness, some columns and rows are omitted. Note that both player types get lower error in Ο€g\pi_{g} than they would with local learning: that is, Ο€g\pi_{g} is individually stable (stable against defections of any player alone). However, it is also clear that Ο€g\pi_{g} is not core stable: in particular, the three small players would get lower error in {a,b,c}\{a,b,c\} than in Ο€g\pi_{g}. These results will be examined theoretically in later sections: with optimal weighting, coarse-grained federation will always have an individually stable Ο€g\pi_{g} that is not necessarily core stable.

Coalition structure e​r​ra​(β‹…)err_{a}(\cdot) e​r​rd​(β‹…)err_{d}(\cdot)
{a},{b},{c},{d}\{a\},\{b\},\{c\},\{d\} 0.333 0.0333
{a,b,c},{d}\{a,b,c\},\{d\} 0.278 0.0333
{a,b,c,d}\{a,b,c,d\} 0.280 0.0326
Table 4: The expected errors using optimal coarse-grained federation when players a,b,ca,b,c each have 30 samples, while player dd has 300 samples, with parameters ΞΌe=10,Οƒ2=1\mu_{e}=10,\sigma^{2}=1.

Finally, we examine fine-grained federation. Table 5 analyzes the same case as coarse-grained federation previously, but with optimally-weighted fine-grained federation. The full error table shows that Ο€g\pi_{g} is core stable because each player minimizes their error in this arrangement. This result will hold theoretically: when optimal fine-grained federation is used, Ο€g\pi_{g} always minimizes error for every player and is thus core stable.

Coalition structure e​r​ra​(β‹…)err_{a}(\cdot) e​r​rd​(β‹…)err_{d}(\cdot)
{a},{b},{c},{d}\{a\},\{b\},\{c\},\{d\} 0.333 0.0333
{a,b,c},{d}\{a,b,c\},\{d\} 0.278 0.0333
{a,b,c,d}\{a,b,c,d\} 0.269 0.0325
Table 5: The expected errors using optimal fine-grained federation when players a,b,ca,b,c each have 30 samples, while player dd has 300 samples, with parameters ΞΌe=10,Οƒ2=1\mu_{e}=10,\sigma^{2}=1.

In later sections we will give theoretical results that explain this example more fully, but understanding the core-stable partitions here will help to build intuition for more general results.

3 Model and assumptions

Model and technical assumptions

This section introduces our model. We assume that there is a fixed set of [M][M] players, and player jj has a fixed number of samples, njn_{j}. Though the number of samples is fixed, it is possible to analyze a varying number of samples by investigating all games involving the relevant number of samples. Each player draws their true parameters i.i.d. (independent and identically distributed) (θj,ϡj2)∼Θ(\theta_{j},\epsilon^{2}_{j})\sim\Theta. ϡj2\epsilon^{2}_{j} represents the amount of noise in the sampling process for a given player.

In the case of mean estimation, ΞΈj\theta_{j} is a scalar representing the true mean of player jj. Player jj draws samples i.i.d. from its true distribution: YβˆΌπ’Ÿj​(ΞΈj,Ο΅j2)Y\sim\mathcal{D}_{j}(\theta_{j},\epsilon^{2}_{j}). Samples are drawn with variance Ο΅j2\epsilon^{2}_{j} around the true mean of the distribution.

In the case of linear regression, 𝜽j\bm{\theta}_{j} is a DD-dimensional vector representing the coefficients on the true classification function, which is also assumed to be linear. Each player draws njn_{j} input datapoints from their own input distribution 𝑿jβˆΌπ’³j\bm{X}_{j}\sim\mathcal{X}_{j} such that 𝔼xβˆΌπ’³j​[𝒙T​𝒙]=Ξ£j\mathbb{E}_{x\sim\mathcal{X}_{j}}[\bm{x}^{T}\bm{x}]=\Sigma_{j}. They then noisily observe the outputs, drawing values i.i.d. 𝒀jβˆΌπ’Ÿj​(𝑿jTβ€‹πœ½j,Ο΅j2)\bm{Y}_{j}\sim\mathcal{D}_{j}(\bm{X}_{j}^{T}\bm{\bm{\theta}}_{j},\epsilon^{2}_{j}), where Ο΅j2\epsilon^{2}_{j} again denotes the variance of how samples are drawn around the true mean.

There are three methods of federation. In uniform federation, a single model is produced for all members of the federating coalition:

𝜽^f=1βˆ‘i=1Mniβ€‹βˆ‘i=1M𝜽^iβ‹…ni\hat{\bm{\theta}}^{f}=\frac{1}{\sum_{i=1}^{M}n_{i}}\sum_{i=1}^{M}\hat{\bm{\theta}}_{i}\cdot n_{i}

In coarse-grained federation, each player has a parameter wjw_{j} that it uses to weight the global model with its own local model, producing an averaged model:

ΞΈ^jw=wjβ‹…ΞΈ^j+(1βˆ’wj)β‹…1Nβ€‹βˆ‘i=1MΞΈ^iβ‹…ni\hat{\theta}^{w}_{j}=w_{j}\cdot\hat{\theta}_{j}+(1-w_{j})\cdot\frac{1}{N}\sum_{i=1}^{M}\hat{\theta}_{i}\cdot n_{i}

for wj∈[0,1]w_{j}\in[0,1]. Note that wj=0w_{j}=0 corresponds to unweighted federated learning and wj=1w_{j}=1 corresponds to pure local learning. Finally, with fine-grained federation, each player jj as a vector of weights 𝒗j\bm{v}_{j} that they use to weight every other player’s contribution to their estimate:

ΞΈ^jv=βˆ‘i=1Mvj​i​θi\hat{\theta}_{j}^{v}=\sum_{i=1}^{M}v_{ji}\theta_{i}

for βˆ‘i=1Mvj​i=1\sum_{i=1}^{M}v_{ji}=1. Note that we can recover the ww weighting case with vj​j=w+(1βˆ’w)β‹…njNv_{jj}=w+\frac{(1-w)\cdot n_{j}}{N} and vj​i=(1βˆ’w)β‹…niNv_{ji}=(1-w)\cdot\frac{n_{i}}{N}. Coarse-grained and fine-grained federation each have player-specific parameters (w,vw,v) that can be tuned. When those parameters are set optimally for the given player, we refer to the models as β€œoptimal” coarse-grained or fine-grained federation. We will prove in later sections how to calculate optimal weights.

We denote ΞΌe=𝔼(ΞΈi,Ο΅i2)βˆΌΞ˜β€‹[Ο΅i2]\mu_{e}=\mathbb{E}_{(\theta_{i},\epsilon^{2}_{i})\sim\Theta}[\epsilon^{2}_{i}]: the expectation of the error parameter. In the mean estimation case, Οƒ2=V​a​r​(ΞΈi)\sigma^{2}=Var(\theta_{i}) represents the variance around the mean. In the linear regression case, Οƒd2=V​a​r​(𝜽id)\sigma^{2}_{d}=Var(\bm{\theta}^{d}_{i}) for d∈[D]d\in[D].

We assume that each player knows how many samples it has access to. It may or may not have access to the data itself, but it does not know how its values (or its parameters) differ from the mean. For example, it does not know if the data it has is unusually noisy or if its true mean lies far from the true mean of other players.

All of the stability analysis results depend on the parameters μe\mu_{e} and σ2\sigma^{2}. However, the reliance is fairly weak: often the player only needs to whether the number of samples they have njn_{j} is larger or smaller than the ratio μeσ2\frac{\mu_{e}}{\sigma^{2}}.

Much of this paper analyzes the stability of coalition structures. Analyzing stability could be relevant because players can actually move between coalitions. However, even if players aren’t able to actually move, analyzing the stability of a coalition tells us something about its optimality for each set of players.

Normative assumptions

This paper is primarily descriptive: it aims to model a phenomenon in the world, not to say whether that phenomenon is good or bad. For example, it could be that society as a whole values situations where many players federate together and might wish to require players to do so, regardless of whether this minimizes their error. It might be the case that society prefers all players, regardless of how many samples they have access to, have roughly similar error rates. Our use of the expected mean squared error is also worth reflecting on: it assumes that over- and under-estimates are equally costly and that larger mis-estimates are more costly. In a more subtle point, we are taking the expected MSE over parameter draws 𝔼(ΞΈi,Ο΅i2)∼Θ\mathbb{E}_{(\theta_{i},\epsilon^{2}_{i})\sim\Theta}. A player with a true mean that happens to fall far from the mean might experience a much higher error than its expected MSE.

In the entirely of this paper, we are taking as fixed the requirement that data not be shared, either for privacy or technical capability reasons, and so are implicitly valuing that requirement more than the desire for lower error. We are also assuming that the problem at hand is completely encompassed by the machine learning task, which might omit the fact that non-machine learning solutions may be better suited. It also may be the fact that technical requirements other than error rate are more important: for example, the desire to balance the amount of computation done by each agent.

4 Expected error results

This paper’s first contribution is to derive exact expected values for the MSE of players under different situations. The fact that these values are exact allows us to precisely reason about each player’s incentives in later sections. We will state the theorems here and provide the proofs in Appendix B.

The approach for this section was first to derive expected MSE values for the most general case and then derive values for other cases as corollaries. The most general case is linear regression with fine-trained federation. First, we note that we can derive coarse-grained or uniform federation by setting the vj​iv_{ji} weights to the appropriate values. Next, we note that mean estimation is a special case of linear regression. For intuition, consider a model where a player draws an xx value that is deterministically 1, then multiplies it by an unknown single parameter ΞΈj\theta_{j}, then then takes a measurement yy of this mean with noise Ο΅j2\epsilon^{2}_{j}. This corresponds exactly to the mean estimation case, where a player has a true mean ΞΈj\theta_{j} and observes yy as a sample, with noise Ο΅j2\epsilon^{2}_{j}. We can use this representation to simplify the error terms, with more details given in Appendix B.

First, we give the expected MSE for local estimation:

Theorem 4.1.

For linear regression, the expected MSE of local estimation for a player with njn_{j} samples is

ΞΌeβ‹…tr​[Ξ£j​𝔼XjβˆΌπ’³j​[(𝑿jT​𝑿j)βˆ’1]]\mu_{e}\cdot\text{tr}\left[\Sigma_{j}\mathbb{E}_{X_{j}\sim\mathcal{X}_{j}}\left[\left(\bm{X}_{j}^{T}\bm{X}_{j}\right)^{-1}\right]\right]

If the distribution of input values 𝒳j\mathcal{X}_{j} is a DD-dimensional multivariate normal distribution with 0 mean, then, the expected MSE of local estimation can be simplified to:

ΞΌenjβˆ’Dβˆ’1​D\frac{\mu_{e}}{n_{j}-D-1}D

In the case of mean estimation, the error term can be simplified to:

ΞΌenj\frac{\mu_{e}}{n_{j}}

Next, we calculate the expected MSE for fine-grained federation:

Theorem 4.2.

For linear regression with fine-grained federation, the expected MSE of federated estimation for a player with njn_{j} samples is:

Lj+(βˆ‘iβ‰ jvj​i2+(βˆ‘iβ‰ jvj​i)2)β‹…βˆ‘d=1D𝔼xβˆΌπ’³j​[(𝒙d)2]β‹…Οƒd2L_{j}+\left(\sum_{i\neq j}v_{ji}^{2}+\left(\sum_{i\neq j}v_{ji}\right)^{2}\right)\cdot\sum_{d=1}^{D}\mathbb{E}_{x\sim\mathcal{X}_{j}}[(\bm{x}^{d})^{2}]\cdot\sigma^{2}_{d}

where LjL_{j} is equal to:

ΞΌeβˆ‘i=1Mvj​i2β‹…tr[Ξ£j𝔼YβˆΌπ’Ÿβ€‹(ΞΈi,Ο΅i2)[(𝑿iT𝑿i)βˆ’1]\mu_{e}\sum_{i=1}^{M}v_{ji}^{2}\cdot\text{tr}[\Sigma_{j}\mathbb{E}_{Y\sim\mathcal{D}(\theta_{i},\epsilon^{2}_{i})}\left[(\bm{X}_{i}^{T}\bm{X}_{i})^{-1}\right]

If the distribution of input values 𝒳i\mathcal{X}_{i} is a DD-dimensional multivariate normal distribution with 0 mean, this can be simplified to:

ΞΌeβ€‹βˆ‘i=1Mvj​i2β‹…Dniβˆ’Dβˆ’1\mu_{e}\sum_{i=1}^{M}v_{ji}^{2}\cdot\frac{D}{n_{i}-D-1}

In the case of mean estimation, the entire error term can be simplified to:

ΞΌeβ€‹βˆ‘i=1Mvj​i2β‹…1ni+(βˆ‘iβ‰ jvj​i2+(βˆ‘iβ‰ jvj​i)2)β‹…Οƒ2\mu_{e}\sum_{i=1}^{M}v_{ji}^{2}\cdot\frac{1}{n_{i}}+\left(\sum_{i\neq j}v_{ji}^{2}+\left(\sum_{i\neq j}v_{ji}\right)^{2}\right)\cdot\sigma^{2}

Finally, we derive as corollaries the expected MSE for the uniform federation and the coarse-grained case.

Corollary 4.3.

For uniform linear regression, the expected MSE of federated estimation for a player with njn_{j} samples is:

Lj+βˆ‘iβ‰ jni2+(Nβˆ’nj)2N2β€‹βˆ‘d=1D𝔼xβˆΌπ’³j​[(𝒙d)2]β‹…Οƒd2L_{j}+\frac{\sum_{i\neq j}n_{i}^{2}+(N-n_{j})^{2}}{N^{2}}\sum_{d=1}^{D}\mathbb{E}_{x\sim\mathcal{X}_{j}}[(\bm{x}^{d})^{2}]\cdot\sigma^{2}_{d}

where LjL_{j} is equal to:

ΞΌeβˆ‘i=1Mni2N2tr[Ξ£j𝔼YβˆΌπ’Ÿβ€‹(ΞΈi,Ο΅i2)[(𝑿iT𝑿i)βˆ’1]\mu_{e}\sum_{i=1}^{M}\frac{n_{i}^{2}}{N^{2}}\text{tr}[\Sigma_{j}\mathbb{E}_{Y\sim\mathcal{D}(\theta_{i},\epsilon^{2}_{i})}\left[(\bm{X}_{i}^{T}\bm{X}_{i})^{-1}\right]

or, if the distribution of input values 𝒳i\mathcal{X}_{i} is a DD-dimensional multivariate normal distribution with 0 mean, can be simplified to

ΞΌeβ€‹βˆ‘i=1Mni2N2​Dniβˆ’Dβˆ’1\mu_{e}\sum_{i=1}^{M}\frac{n_{i}^{2}}{N^{2}}\frac{D}{n_{i}-D-1}

In the case of mean estimation, the entire error term can be simplified to:

ΞΌeN+βˆ‘iβ‰ jni2+(Nβˆ’nj)2N2​σ2\frac{\mu_{e}}{N}+\frac{\sum_{i\neq j}n_{i}^{2}+(N-n_{j})^{2}}{N^{2}}\sigma^{2}

where N=βˆ‘i=1MniN=\sum_{i=1}^{M}n_{i}.

Corollary 4.4.

For coarse-grained linear regression, the expected MSE of federated estimation for a player with njn_{j} samples is:

Lj+(1βˆ’w)2β‹…βˆ‘iβ‰ jni2+(Nβˆ’nj)2N2β€‹βˆ‘d=1D𝔼xβˆΌπ’³j​[(𝒙d)2]β‹…Οƒd2L_{j}+(1-w)^{2}\cdot\frac{\sum_{i\neq j}n_{i}^{2}+(N-n_{j})^{2}}{N^{2}}\sum_{d=1}^{D}\mathbb{E}_{x\sim\mathcal{X}_{j}}[(\bm{x}^{d})^{2}]\cdot\sigma^{2}_{d}

where LjL_{j} is equal to:

ΞΌeβ‹…(1βˆ’w)2β‹…βˆ‘i=1Mni2N2tr[Ξ£j𝔼YβˆΌπ’Ÿβ€‹(ΞΈi,Ο΅i2)[(𝑿iT𝑿i)βˆ’1]+\mu_{e}\cdot(1-w)^{2}\cdot\sum_{i=1}^{M}\frac{n_{i}^{2}}{N^{2}}\text{tr}[\Sigma_{j}\mathbb{E}_{Y\sim\mathcal{D}(\theta_{i},\epsilon^{2}_{i})}\left[(\bm{X}_{i}^{T}\bm{X}_{i})^{-1}\right]+
ΞΌe(w2+2(1βˆ’w)​wβ‹…njN)β‹…tr[Ξ£j𝔼YβˆΌπ’Ÿβ€‹(ΞΈi,Ο΅i2)[(𝑿jT𝑿j)βˆ’1]\mu_{e}\left(w^{2}+2\frac{(1-w)w\cdot n_{j}}{N}\right)\cdot\text{tr}[\Sigma_{j}\mathbb{E}_{Y\sim\mathcal{D}(\theta_{i},\epsilon^{2}_{i})}\left[(\bm{X}_{j}^{T}\bm{X}_{j})^{-1}\right]

or, if the distribution of input values 𝒳i\mathcal{X}_{i} is a DD-dimensional multivariate normal distribution with 0 mean, can be simplified to

ΞΌeβ‹…(1βˆ’w)2β‹…βˆ‘i=1Mni2N2​Dniβˆ’Dβˆ’1\mu_{e}\cdot(1-w)^{2}\cdot\sum_{i=1}^{M}\frac{n_{i}^{2}}{N^{2}}\frac{D}{n_{i}-D-1}
+ΞΌeβ‹…(w2+2β‹…(1βˆ’w)β‹…wβ‹…njN)β‹…Dniβˆ’Dβˆ’1+\mu_{e}\cdot\left(w^{2}+2\cdot\frac{(1-w)\cdot w\cdot n_{j}}{N}\right)\cdot\frac{D}{n_{i}-D-1}

In the case of mean estimation, the entire error term can be simplified to:

ΞΌe​(w2nj+1βˆ’w2N)+βˆ‘iβ‰ jni2+(Nβˆ’nj)2N2β‹…(1βˆ’w)2​σ2\mu_{e}\left(\frac{w^{2}}{n_{j}}+\frac{1-w^{2}}{N}\right)+\frac{\sum_{i\neq j}n_{i}^{2}+(N-n_{j})^{2}}{N^{2}}\cdot(1-w)^{2}\sigma^{2}

where N=βˆ‘i=1MniN=\sum_{i=1}^{M}n_{i}.

The exact MSE for linear regression follows a very similar form to that for mean estimation. In all cases, the bias component (the term involving Οƒd2\sigma^{2}_{d}) is in the exact same form and could be directly modified to mean estimation by using (Οƒ2)β€²=βˆ‘d=1D𝔼xβˆΌπ’³j​[(𝒙d)2]β‹…Οƒd2(\sigma^{2})^{\prime}=\sum_{d=1}^{D}\mathbb{E}_{x\sim\mathcal{X}_{j}}[(\bm{x}^{d})^{2}]\cdot\sigma^{2}_{d}. The variance component (the term involving ΞΌe\mu_{e}) fits the exact form of mean estimation in the limit where nj>>Dn_{j}>>D. In this case, the error can be modified to fit mean estimation by using ΞΌeβ€²=Dβ‹…ΞΌe\mu_{e}^{\prime}=D\cdot\mu_{e}. This approximation is good when there are many more samples than the dimension of the linear regression problem under investigation: for most cases of model fitting, this assumption is reasonable.

For the rest of the paper, we will use the nj>>Dn_{j}>>D assumption: consequentially, all of our results apply equally to linear regression and mean estimation.

5 Uniform federation: coalition formation

In this section, we analyze the stability of coalition structures in the case that uniform federation is used. We consider two cases: 1) where all players have the same number of datapoints nn and 2) where all players have either a β€œsmall” or β€œlarge” number of points. We will use Ο€l\pi_{l} to refer to the coalition partition where all players are alone and Ο€g\pi_{g} to refer to the grand coalition. Proofs from this section are given in Appendix C.

All players have the same number of samples

In this case, the analysis simplifies greatly:

Lemma 5.1.

If all players have the same number of samples nn, then:

  • β€’

    If n<μeσ2n<\frac{\mu_{e}}{\sigma^{2}}, players minimize their error in πg\pi_{g}.

  • β€’

    If n>μeσ2n>\frac{\mu_{e}}{\sigma^{2}}, players minimize their error in πl\pi_{l}.

  • β€’

    If n=μeσ2n=\frac{\mu_{e}}{\sigma^{2}}, players are indifferent between any arrangement of players.

Proof.

In the case that all players have the same number of samples, we can use ni=nn_{i}=n to simplify the error term:

ΞΌeMβ‹…n+Οƒ2​Mβˆ’1M\frac{\mu_{e}}{M\cdot n}+\sigma^{2}\frac{M-1}{M}

In order to see whether players would prefer a larger group (higher MM) or a smaller group (smaller MM), we take the derivative of the error with respect to MM:

βˆ’ΞΌeM2β‹…n+Οƒ2M2=Οƒ2β‹…nβˆ’ΞΌenβ‹…M2-\frac{\mu_{e}}{M^{2}\cdot n}+\frac{\sigma^{2}}{M^{2}}=\frac{\sigma^{2}\cdot n-\mu_{e}}{n\cdot M^{2}}

This is positive when n>ΞΌenn>\frac{\mu_{e}}{n}: a player gets higher error the more players it is federating with. This is negative when n<ΞΌeΟƒ2n<\frac{\mu_{e}}{\sigma^{2}}: a player gets lower error the more players it is federating with. This is 0 when n=ΞΌeΟƒ2n=\frac{\mu_{e}}{\sigma^{2}}, which implies players should be indifferent between different arrangements. Plugging in for n=ΞΌeΟƒ2n=\frac{\mu_{e}}{\sigma^{2}} in the error equation gives ΞΌeβ‹…Οƒ2Mβ‹…ΞΌe+Οƒ2​Mβˆ’1M=Οƒ2\frac{\mu_{e}\cdot\sigma^{2}}{M\cdot\mu_{e}}+\sigma^{2}\frac{M-1}{M}=\sigma^{2} which is equivalent to the error a player would get alone: ΞΌen=ΞΌeβ‹…Οƒ2ΞΌe=Οƒ2\frac{\mu_{e}}{n}=\frac{\mu_{e}\cdot\sigma^{2}}{\mu_{e}}=\sigma^{2}. ∎

As a corollary, we can classify the core stable arrangements cleanly:

Corollary 5.2.

For uniform federation, if all players have the same number of samples nn, then:

  • β€’

    If n<μeσ2n<\frac{\mu_{e}}{\sigma^{2}}, πg\pi_{g} is the only partition that is core-stable.

  • β€’

    If n>μeσ2n>\frac{\mu_{e}}{\sigma^{2}}, πl\pi_{l} is the only partition that is core-stable.

  • β€’

    If n=μeσ2n=\frac{\mu_{e}}{\sigma^{2}}, any arrangement of players is core-stable.

Small & large player case

In this section, we add another layer of depth by allowing players to come in one of two β€œsizes”. β€œSmall” players have nsn_{s} samples and β€œlarge” ones have nβ„“n_{\ell} samples, with ns<nβ„“n_{s}<n_{\ell}. We demonstrate that versions of the game in this pattern always have a stable partition by constructively producing an element that is stable. Note that this is not true in general of hedonic games. As discussed in Bogomolnaia and Jackson (2002), there are multiple instances where a game might have no stable partition.

To characterize this space, we divide it into cases depending on the relative size of ns,nβ„“n_{s},n_{\ell}. We will use the notation π​(s,β„“)\pi(s,\ell) to denote a coalition with ss small players and β„“\ell large players, out of a total of SS and LL present. We will use π​(s1,β„“1)≻Sπ​(s2,β„“2)\pi(s_{1},\ell_{1})\succ_{S}\pi(s_{2},\ell_{2}) to mean that the small players prefer coalition π​(s1,β„“1)\pi(s_{1},\ell_{1}) to π​(s2,β„“2)\pi(s_{2},\ell_{2}) and π​(s1,β„“1)≻Lπ​(s2,β„“2)\pi(s_{1},\ell_{1})\succ_{L}\pi(s_{2},\ell_{2}) to mean the same preference, but for large players.

Case 1: ns,nβ„“β‰₯ΞΌeΟƒ2n_{s},n_{\ell}\geq\frac{\mu_{e}}{\sigma^{2}}

The first case is when nsn_{s} is large: it turns out that each player minimizes their error by using local learning, which means that πl\pi_{l} is in the core. The lemma below is more general than the small/large case, but implies that when ns>μeσ2n_{s}>\frac{\mu_{e}}{\sigma^{2}}, πl\pi_{l} is the only element in the core and when ns=μeσ2n_{s}=\frac{\mu_{e}}{\sigma^{2}} then any arrangement where the large players are alone are is in the core.

Lemma 5.3.

For uniform federation, if ni>ΞΌeΟƒ2n_{i}>\frac{\mu_{e}}{\sigma^{2}} for all i∈[M]i\in[M], then Ο€l\pi_{l} is the unique element in the core.
If niβ‰₯ΞΌeΟƒ2n_{i}\geq\frac{\mu_{e}}{\sigma^{2}} for all i∈[M]i\in[M], with nk>ΞΌeΟƒ2n_{k}>\frac{\mu_{e}}{\sigma^{2}} for at least one player kk, then any arrangement where the players with samples nk>ΞΌeΟƒ2n_{k}>\frac{\mu_{e}}{\sigma^{2}} are alone is in the core.

Case 2: ns,nℓ≀μeΟƒ2n_{s},n_{\ell}\leq\frac{\mu_{e}}{\sigma^{2}}

Next, we consider the case where both the small and large players have a relatively small number of samples. In this situation, it turns out that the grand coalition is core stable.

Theorem 5.4.

For uniform federation, if nℓ≀μeΟƒ2n_{\ell}\leq\frac{\mu_{e}}{\sigma^{2}} and ns<nβ„“n_{s}<n_{\ell}, then the grand coalition Ο€g\pi_{g} is core stable.

Case 3: ns<μeσ2n_{s}<\frac{\mu_{e}}{\sigma^{2}}, nℓ>μeσ2n_{\ell}>\frac{\mu_{e}}{\sigma^{2}}

Finally, we consider the case where the small players have a number of samples below the μeσ2\frac{\mu_{e}}{\sigma^{2}} boundary, while the large players have a number of samples above this threshold.

Theorem 5.5.

Assume uniform federation with nℓ>μeσ2n_{\ell}>\frac{\mu_{e}}{\sigma^{2}}. Then, there exists an arrangement of small and large players that is individually stable and a computationally efficient algorithm to calculate it.

The proof of Theorem 5.5 is constructive: it gives an exact arrangement that is individually stable. One natural question is whether this arrangement is also core stable. The answer to this question is β€œno”: we show that this arrangement can fail to be core stable. This avenue is explained more in Appendix C.

6 Coarse-grained federation

In this section, we analyze coarse-grained federation. As a reminder, in this situation, each player has a parameter wjw_{j} that it uses to weight the global model with its own local model.

ΞΈ^jw=wjβ‹…ΞΈ^j+(1βˆ’wj)β‹…1Nβ€‹βˆ‘i=1MΞΈ^iβ‹…ni\hat{\theta}^{w}_{j}=w_{j}\cdot\hat{\theta}_{j}+(1-w_{j})\cdot\frac{1}{N}\sum_{i=1}^{M}\hat{\theta}_{i}\cdot n_{i}

for wj∈[0,1]w_{j}\in[0,1]. All proofs from this section are given in Appendix D.

Note that the wjw_{j} value is a parameter that each player can set independently. The lemma below analyzes the optimal value of wjw_{j} and tells us that each player would prefer federation, in some form, to being alone.

Lemma 6.1.

For coarse-grained federation, the minimum error is always achieved when wj<1w_{j}<1, implying that federation is always preferable to local learning.

Corollary 6.2.

For coarse-grained federation, when wjw_{j} is set optimally, the grand coalition Ο€g\pi_{g} is always individually stable.

Specifically, this means that no player wishes to unilaterally deviate from Ο€g\pi_{g}. However, this does not mean that each player prefers the grand coalition Ο€g\pi_{g} to some other federating coalition. For example, refer to Section 2 for an example where the grand coalition Ο€g\pi_{g} is not core stable.

In the rest of this section, we will analyze the stability of coalition structures in the that the ww parameters are set optimally (optimal coarse-grained federation). First, we will find it useful to get the closed-form value for expected MSE of a player using optimal coarse-grained federation:

Lemma 6.3.

A player using coarse-grained federation parameter has expected MSE:

ΞΌeβ‹…(Nβˆ’nj)+(βˆ‘iβ‰ jni2+(Nβˆ’nj)2)β‹…Οƒ2(Nβˆ’nj)β‹…N+njβ‹…(βˆ‘iβ‰ jni2+(Nβˆ’nj)2)β‹…Οƒ2ΞΌe\frac{\mu_{e}\cdot(N-n_{j})+(\sum_{i\neq j}n_{i}^{2}+(N-n_{j})^{2})\cdot\sigma^{2}}{(N-n_{j})\cdot N+n_{j}\cdot(\sum_{i\neq j}n_{i}^{2}+(N-n_{j})^{2})\cdot\frac{\sigma^{2}}{\mu_{e}}}

where N=βˆ‘i=1MniN=\sum_{i=1}^{M}n_{i}.

All players have the same number of samples

Lemma 6.4 is the analog to Lemma 5.1 in the previous section. Here, the results differ: with optimal coarse-grained federation, the grand coalition πg\pi_{g} is always the only stable arrangement, no matter how small or large nn is relative to μeσ2\frac{\mu_{e}}{\sigma^{2}}.

Lemma 6.4.

For mean estimation with coarse-grained federation, if nj=nn_{j}=n, then Ο€g\pi_{g} is the only element in the core.

Proof.

Using the error term derived in Lemma 6.3, plugging in for ni=nn_{i}=n and simplifying gives:

ΞΌe2nβ‹…M+ΞΌeβ‹…Οƒ2ΞΌe+nβ‹…Οƒ2\frac{\frac{\mu_{e}^{2}}{n\cdot M}+\mu_{e}\cdot\sigma^{2}}{\mu_{e}+n\cdot\sigma^{2}}

As MM increases, the error (numerator) decreases always - so Ο€g\pi_{g} is where each player minimizes their error and is thus core stable. ∎

Small & large player case

In this subsection, we similarly extend results for the β€œsmall” and β€œlarge” case that was introduced in the previous section. The analysis turns out to be much simpler than in the uniform federation case, and also produce stronger results: strict core stability, rather than individual stability.

Theorem 6.5.

If optimal coarse-grained federation is used, then:

  • β€’

    If Ο€gβͺ―Sπ​(S,0)\pi_{g}\preceq_{S}\pi(S,0) (small player weakly prefers π​(S,0)\pi(S,0)), then {π​(S,0),π​(0,L)}\{\pi(S,0),\pi(0,L)\} is strictly core stable.

  • β€’

    If Ο€g≻Sπ​(S,0)\pi_{g}\succ_{S}\pi(S,0) (small player strictly prefers Ο€g\pi_{g}), then Ο€g\pi_{g} is strictly core stable.

7 Fine-grained federation

In this section, we analyze fine-grained federation. As a reminder, with this method, each player jj as a vector of weights 𝒗j\bm{v}_{j} that they use to weight every other player’s contribution to their estimate.

ΞΈ^jv=βˆ‘i=1Mvj​i​θi\hat{\theta}_{j}^{v}=\sum_{i=1}^{M}v_{ji}\theta_{i}

for βˆ‘i=1Mvj​i=1\sum_{i=1}^{M}v_{ji}=1.

We calculate the optimal 𝒗\bm{v} weights for player jj’s error.

Lemma 7.1.

Define Vi=Οƒ2+ΞΌeniV_{i}=\sigma^{2}+\frac{\mu_{e}}{n_{i}}. Then, the value of {vj​i}\{v_{ji}\} that minimizes player jj’s error is:

vj​j=1+Οƒ2β€‹βˆ‘iβ‰ j1Vi1+Vjβ€‹βˆ‘iβ‰ j1Viv_{jj}=\frac{1+\sigma^{2}\sum_{i\neq j}\frac{1}{V_{i}}}{1+V_{j}\sum_{i\neq j}\frac{1}{V_{i}}}
vj​k=1Vkβ‹…Vjβˆ’Οƒ21+Vjβ€‹βˆ‘iβ‰ j1Vikβ‰ jv_{jk}=\frac{1}{V_{k}}\cdot\frac{V_{j}-\sigma^{2}}{1+V_{j}\sum_{i\neq j}\frac{1}{V_{i}}}\quad k\neq j

The proof of this lemma is given in Appendix E.

From this analysis, a few properties become clear. To start with, vj​jv_{jj} and vj​kv_{jk} are always strictly between 0 and 1. This implies the following lemma:

Corollary 7.2.

With optimal fine-grained federation, Ο€g\pi_{g} is optimal for each player.

Proof.

Suppose by contradiction that some other coalition Ο€β€²\pi^{\prime} gave player jj a lower error. WLOG, assume this coalition omitted player kk. In this case, the vv weights for Ο€β€²\pi^{\prime} can be represented as a length MM vector with 0 in the kkth entry. However, set of weights is achievable in Ο€g\pi_{g}: it is always an option to set a player’s coefficient vj​kv_{jk} equal to 0. This contradicts the use of 𝒗j\bm{v}_{j} as an optimal weighting, so it cannot be the case that any player gets lower error in a different coalition. ∎

Similarly, the fact that Ο€g\pi_{g} is optimal for every player implies that it is in the core, and that it is the only element in the core.

8 Conclusions and future directions

In this work, we have drawn a connection between a simple model of federated learning and the game theoretic tool of hedonic games. We used this tool to examine stable partitions of the space for two variants of the game. In service of this analysis, we computed exact error values for mean estimation and linear regression, as well as for three different variations of federation.

We believe that this framework is a simple and useful tool for analyzing the incentives of multiple self-interested agents in a learning environment. There are many fascinating extensions. For example, completely characterizing the core (including whether it is always non-empty) in the case of arbitrary number of samples {ni}\{n_{i}\} is an obvious area of investigation. Besides this, it could be interesting to compute exact or approximate error values for cases beyond mean estimation and linear regression.

Acknowledgments

This work was supported in part by a Simons Investigator Award, a Vannevar Bush Faculty Fellowship, a MURI grant, AFOSR grant FA9550-19-1-0183, grants from the ARO and the MacArthur Foundation, and NSF grant DGE-1650441. We are grateful to A. F. Cooper, Thodoris Lykouris, Hakim Weathersppon, and the AI in Policy and Practice working group at Cornell for invaluable discussions. In particular, we thank A.F. Cooper for discussions around normative assumptions. Finally, we are grateful to Katy Blumer for discussions around code in the Github repository.

References

  • Abu-Mostafa, Lin, and Magdon-Ismail (2012) Abu-Mostafa, Y.; Lin, H.; and Magdon-Ismail, M. 2012. Learning from data: a short course: AMLbook. Google Scholar .
  • Anderson (1962) Anderson, T.Β W. 1962. An introduction to multivariate statistical analysis. Technical report, Wiley New York.
  • Blum etΒ al. (2017) Blum, A.; Haghtalab, N.; Procaccia, A.Β D.; and Qiao, M. 2017. Collaborative PAC Learning. In Guyon, I.; Luxburg, U.Β V.; Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; and Garnett, R., eds., Advances in Neural Information Processing Systems 30, 2392–2401. Curran Associates, Inc. URL http://papers.nips.cc/paper/6833-collaborative-pac-learning.pdf.
  • Blum etΒ al. (2020) Blum, A.; Haghtalab, N.; Shao, H.; and Phillips, R.Β L. 2020. Unpublished work, private correspondence.
  • Bogomolnaia and Jackson (2002) Bogomolnaia, A.; and Jackson, M.Β O. 2002. The Stability of Hedonic Coalition Structures. Games and Economic Behavior 38(2): 201 – 230. ISSN 0899-8256. doi:https://doi.org/10.1006/game.2001.0877. URL http://www.sciencedirect.com/science/article/pii/S0899825601908772.
  • Casella (1992) Casella, G. 1992. Illustrating empirical Bayes methods. Chemometrics and intelligent laboratory systems 16(2): 107–125.
  • Cooper (2020) Cooper, A.Β F. 2020. Where Is the Normative Proof? Assumptions and Contradictions in ML Fairness Research.
  • Deng, Kamani, and Mahdavi (2020) Deng, Y.; Kamani, M.Β M.; and Mahdavi, M. 2020. Adaptive Personalized Federated Learning.
  • Duan etΒ al. (2021) Duan, M.; Liu, D.; Chen, X.; Liu, R.; Tan, Y.; and Liang, L. 2021. Self-Balancing Federated Learning With Global Imbalanced Data in Mobile Systems. IEEE Transactions on Parallel and Distributed Systems 32(1): 59–71.
  • Efron and Morris (1977) Efron, B.; and Morris, C. 1977. Stein’s paradox in statistics. Scientific American 236(5): 119–127.
  • Guazzone, Anglano, and Sereno (2014) Guazzone, M.; Anglano, C.; and Sereno, M. 2014. A Game-Theoretic Approach to Coalition Formation in Green Cloud Federations. 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing doi:10.1109/ccgrid.2014.37. URL http://dx.doi.org/10.1109/CCGrid.2014.37.
  • Kairouz etΒ al. (2019) Kairouz, P.; McMahan, H.Β B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A.Β N.; Bonawitz, K.; Charles, Z.; Cormode, G.; Cummings, R.; D’Oliveira, R. G.Β L.; Rouayheb, S.Β E.; Evans, D.; Gardner, J.; Garrett, Z.; GascΓ³n, A.; Ghazi, B.; Gibbons, P.Β B.; Gruteser, M.; Harchaoui, Z.; He, C.; He, L.; Huo, Z.; Hutchinson, B.; Hsu, J.; Jaggi, M.; Javidi, T.; Joshi, G.; Khodak, M.; KonečnΓ½, J.; Korolova, A.; Koushanfar, F.; Koyejo, S.; Lepoint, T.; Liu, Y.; Mittal, P.; Mohri, M.; Nock, R.; Γ–zgΓΌr, A.; Pagh, R.; Raykova, M.; Qi, H.; Ramage, D.; Raskar, R.; Song, D.; Song, W.; Stich, S.Β U.; Sun, Z.; Suresh, A.Β T.; TramΓ¨r, F.; Vepakomma, P.; Wang, J.; Xiong, L.; Xu, Z.; Yang, Q.; Yu, F.Β X.; Yu, H.; and Zhao, S. 2019. Advances and Open Problems in Federated Learning.
  • Li etΒ al. (2020) Li, T.; Sahu, A.Β K.; Talwalkar, A.; and Smith, V. 2020. Federated Learning: Challenges, Methods, and Future Directions. IEEE Signal Processing Magazine 37(3): 50–60. ISSN 1558-0792. doi:10.1109/msp.2020.2975749. URL http://dx.doi.org/10.1109/MSP.2020.2975749.
  • Li etΒ al. (2019) Li, T.; Sanjabi, M.; Beirami, A.; and Smith, V. 2019. Fair Resource Allocation in Federated Learning.
  • Mansour etΒ al. (2020) Mansour, Y.; Mohri, M.; Ro, J.; and Suresh, A.Β T. 2020. Three Approaches for Personalization with Applications to Federated Learning.
  • Martinez, Bertran, and Sapiro (2020) Martinez, N.; Bertran, M.; and Sapiro, G. 2020. Minimax Pareto Fairness: A Multi Objective Perspective. In Proceedings of the 37th International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR. URL https://proceedings.icml.cc/static/paperΛ™files/icml/2020/1084-Paper.pdf.
  • Morris (1986) Morris, C.Β N. 1986. Empirical Bayes: a frequency-Bayes compromise, volume Volume 8 of Lecture Notes–Monograph Series, 195–203. Hayward, CA: Institute of Mathematical Statistics. doi:10.1214/lnms/1215540299. URL https://doi.org/10.1214/lnms/1215540299.
  • Paquay (2018) Paquay, P. 2018. Learning-from-data-Solutions. URL https://github.com/ppaquay/Learning-from-Data-Solutions.
  • Sattler, Muller, and Samek (2020) Sattler, F.; Muller, K.-R.; and Samek, W. 2020. Clustered Federated Learning: Model-Agnostic Distributed Multitask Optimization Under Privacy Constraints. IEEE Transactions on Neural Networks and Learning Systems 1–13. ISSN 2162-2388. doi:10.1109/tnnls.2020.3015958. URL http://dx.doi.org/10.1109/TNNLS.2020.3015958.
  • Sellentin and Heavens (2015) Sellentin, E.; and Heavens, A.Β F. 2015. Parameter inference with estimated covariance matrices. Monthly Notices of the Royal Astronomical Society: Letters 456(1): L132–L136. ISSN 1745-3925. doi:10.1093/mnrasl/slv190. URL https://doi.org/10.1093/mnrasl/slv190.
  • Shlezinger, Rini, and Eldar (2020) Shlezinger, N.; Rini, S.; and Eldar, Y.Β C. 2020. The Communication-Aware Clustered Federated Learning Problem. In 2020 IEEE International Symposium on Information Theory (ISIT), 2610–2615.
  • Wang etΒ al. (2020) Wang, H.; Hsu, H.; Diaz, M.; and Calmon, F.Β P. 2020. To Split or Not to Split: The Impact of Disparate Treatment in Classification.
  • Yu, Bagdasaryan, and Shmatikov (2020) Yu, T.; Bagdasaryan, E.; and Shmatikov, V. 2020. Salvaging Federated Learning by Local Adaptation.

Appendix A Relationship to other approaches

This section contains a high-level summary of similar approaches and how they relate to ours. Throughout we assume the goal is to estimate some unknown ΞΈj\theta_{j} given samples drawn Yi∼D​(ΞΈj)Y_{i}\sim D(\theta_{j}).

A frequentist approach would take ΞΈi\theta_{i} to be a constant that would be estimated by the average of the given samples 1njβ€‹βˆ‘i=1Yi\frac{1}{n_{j}}\sum_{i=1}Y_{i}.

A hierarchical Bayesian estimator assumes data is generated in the following way: data is drawn Yi∼D​(Y|ΞΈi)Y_{i}\sim D(Y|\theta_{i}). The parameter ΞΈi\theta_{i} is drawn ΞΈi∼Θi​(ΞΈ|Ξ»i)\theta_{i}\sim\Theta_{i}(\theta|\lambda_{i}), where hyperparameter Ξ»i\lambda_{i} is drawn from known distribution p​(Ξ»)p(\lambda). Given some data, the parameter ΞΈi\theta_{i} can be estimated as follows

p​(ΞΈi|Yi)=p​(Yi|ΞΈi)​p​(ΞΈi)p​(Yi)=p​(Yi|ΞΈi)p​(Yi)β€‹βˆ«p​(ΞΈi|Ξ»i)​p​(Ξ»i)​𝑑λp(\theta_{i}|Y_{i})=\frac{p(Y_{i}|\theta_{i})p(\theta_{i})}{p(Y_{i})}=\frac{p(Y_{i}|\theta_{i})}{p(Y_{i})}\int p(\theta_{i}|\lambda_{i})p(\lambda_{i})d\lambda

Parametric empirical Bayes (Morris 1986) (Casella 1992) is frequently described as an intermediate between these two viewpoints. Similar to the hierarchical Bayesian viewpoint, it assumes data is drawn Yi∼D​(Y|ΞΈi)Y_{i}\sim D(Y|\theta_{i}), with parameter ΞΈi\theta_{i} is drawn ΞΈi∼Θi​(ΞΈ|Ξ»i)\theta_{i}\sim\Theta_{i}(\theta|\lambda_{i}). However, it differs in that it estimates Ξ»i\lambda_{i} based on the data, producing Ξ»^i\hat{\lambda}_{i}. This estimate of the hyperparameter is used, along with the data, to estimate ΞΈi\theta_{i}.

A related example is the James-Stein estimator (Efron and Morris 1977). The estimator assumes the following process: each of mm players draws a single sample from a normal distribution with variance s2s^{2}.

YiβˆΌπ’©β€‹(ΞΈi,s2)Y_{i}\sim\mathcal{N}(\theta_{i},s^{2})

This is different from the empirical Bayes or Bayes case in that it is assumed that the means ΞΈi\theta_{i} are completely unrelated to each other. Nevertheless, it has been demonstrated that the James-Stein estimator:

^​θJ​S=(1βˆ’(mβˆ’2)β‹…s2βˆ₯𝐘βˆ₯2)β€‹π˜\hat{}\theta_{JS}=\left(1-\frac{(m-2)\cdot s^{2}}{\left\lVert\bf{Y}\right\rVert^{2}}\right)\bf{Y}

has lower expected MSE than simply using the drawn parameters YiY_{i}. In the case that the variance s2s^{2} is not known perfectly, it can be estimated as s2^\hat{s^{2}} using entire vector of data 𝐘\bf{Y}.

Our method is similar at a high level to empirical Bayes: we assume each player draws data from a personal distribution governed by θi\theta_{i} and that the θi\theta_{i} terms are in turn drawn from some distribution Θ\Theta. However, one key difference is that all three methods discussed above assume knowledge of the distributions generating the data, or at least which family they are drawn from. For example, the James Stein estimator assumes a normal distribution: variants of it exist for different distributions, but not a version that works for all distributions. Similarly, a hierarchical Bayes or empirical Bayes viewpoint would require knowledge of the D,Θ,pD,\Theta,p distributions. In our approach, we do not assume that we know the form of these generating distributions, only some summary values summary statistics (mean and variance) of the distribution.

It is entirely possible that other approaches, especially those that assume knowledge of the generating distribution, will out-perform our approach in terms of the error guarantees they can provide. Our distribution-free approach allows it to be implemented in a broader range of situations.

Additionally, our approach is restricted to linear combinations of estimators such as ΞΈ^f=wβ‹…ΞΈ^j+(1βˆ’w)β€‹βˆ‘i=1MΞΈ^i\hat{\theta}^{f}=w\cdot\hat{\theta}_{j}+(1-w)\sum_{i=1}^{M}\hat{\theta}_{i}. It is possible that a method outside this situation, for example, something like ΞΈ^f=xβ‹…ΞΈ^j+yβ€‹βˆ‘i=1MΞΈ^i2\hat{\theta}^{f}=x\cdot\hat{\theta}_{j}+y\sum_{i=1}^{M}\hat{\theta}_{i}^{2}, or something like the non-linear James Stein estimator, would produce better estimates.

Appendix B Expected error proofs

For convenience, we will restate the model setup for the most general case of linear regression. We assume that each player j∈[M]j\in[M] draws parameters (𝜽j,Ο΅j2)∼Θ(\bm{\bm{\theta}}_{j},\epsilon^{2}_{j})\sim\Theta, where 𝜽j\bm{\bm{\theta}}_{j} is a length DD vector and Ο΅j2\epsilon^{2}_{j} is a scalar-valued variance parameter. The ddth entry in the vector is 𝜽jd\bm{\theta}_{j}^{d}, and V​a​r​(𝜽jd)=Οƒd2Var(\bm{\theta}_{j}^{d})=\sigma^{2}_{d}. We assume that each value 𝜽j\bm{\theta}_{j} is drawn independently of the others. The main result of this section will assume that each dimension is drawn independently, for example that 𝜽jl\bm{\theta}_{j}^{l} is independent of 𝜽jk\bm{\theta}_{j}^{k}, for kβ‰ lk\neq l, but we will demonstrate how this can be relaxed. Each player draws njn_{j} input data points from their own input distribution, 𝑿jβˆΌπ’³j\bm{X}_{j}\sim\mathcal{X}_{j} such that 𝔼xβˆΌπ’³j​[𝒙T​𝒙]=Ξ£j\mathbb{E}_{x\sim\mathcal{X}_{j}}[\bm{x}^{T}\bm{x}]=\Sigma_{j}. They then noisily observes the outputs, drawing 𝒀jβˆΌπ’Ÿj​(𝑿jTβ€‹πœ½j,Ο΅j2)\bm{Y}_{j}\sim\mathcal{D}_{j}(\bm{X}_{j}^{T}\bm{\bm{\theta}}_{j},\epsilon^{2}_{j}). We use 𝜼j\bm{\eta}_{j} to denote the length DD vector of errors so that 𝒀j=𝑿jTβ€‹πœ½j+𝜼j\bm{Y}_{j}=\bm{X}_{j}^{T}\bm{\bm{\theta}}_{j}+\bm{\bm{\eta}}_{j}. Each player uses ordinary least squares (OLS) to compute estimates of their parameters, which requires that 𝑿jT​𝑿\bm{X}_{j}^{T}\bm{X} is invertible. This happens when the columns of 𝑿\bm{X} are linearly independent. We will require that Ξ£j\Sigma_{j} be such that 𝑿jT​𝑿\bm{X}_{j}^{T}\bm{X} is invertible with probability 1. This rules out cases where one dimension is a deterministic function of another, for example. Using the below OLS calculation gives local estimation:

𝜽j^=(𝑿jT​𝑿j)βˆ’1​𝒀j=(𝑿jT​𝑿j)βˆ’1​(𝑿jβ€‹πœ½j+𝜼j)\hat{\bm{\bm{\theta}}_{j}}=(\bm{X}^{T}_{j}\bm{X}_{j})^{-1}\bm{Y}_{j}=(\bm{X}^{T}_{j}\bm{X}_{j})^{-1}(\bm{X}_{j}\bm{\bm{\theta}}_{j}+\bm{\eta}_{j})

It is worth pausing briefly to note why mean estimation is a special case of linear regression. Consider the case where the distribution 𝒳j\mathcal{X}_{j} is deterministically 1, meaning that 𝑿j\bm{X}_{j} is a vector of 1s of length njn_{j}. (𝑿jT​𝑿j)(\bm{X}_{j}^{T}\bm{X}_{j}) is always invertible, as (𝑿jT​𝑿j)βˆ’1=njβˆ’1(\bm{X}_{j}^{T}\bm{X}_{j})^{-1}=n_{j}^{-1}. Each 𝑿j\bm{X}_{j} is multiplied by an unknown single parameter ΞΈj\theta_{j}, which each player is attempting to learn.

Besides local estimation, there are multiple federation possibilities. Uniform federation is given by:

𝜽^jf=1Nβ€‹βˆ‘i=1M𝜽^iβ‹…ni\hat{\bm{\bm{\theta}}}_{j}^{f}=\frac{1}{N}\sum_{i=1}^{M}\hat{\bm{\bm{\theta}}}_{i}\cdot n_{i}

Coarse-grained federation is given by:

ΞΈ^jw=wjβ‹…πœ½^j+(1βˆ’wj)β‹…1Nβ€‹βˆ‘i=1M𝜽^iβ‹…ni\hat{\theta}^{w}_{j}=w_{j}\cdot\hat{\bm{\bm{\theta}}}_{j}+(1-w_{j})\cdot\frac{1}{N}\sum_{i=1}^{M}\hat{\bm{\bm{\theta}}}_{i}\cdot n_{i}

for wj∈[0,1]w_{j}\in[0,1]. Finally, fine-grained federation is given by:

ΞΈ^jv=βˆ‘i=1Mvj​i​θi\hat{\theta}_{j}^{v}=\sum_{i=1}^{M}v_{ji}\theta_{i}

for βˆ‘i=1Mvj​i=1\sum_{i=1}^{M}v_{ji}=1. Note that fine-grained federation is the most general case of federation. It is possible to derive coarse-grained federation, uniform federation, or local estimation by appropriately setting the vv weights. In this section, we will first derive the expected error for the fine-grained federation, linear regression case, and get expected error results for other cases as corollaries of this result.

The expected error produced by a set of estimates 𝜽^\hat{\bm{\bm{\theta}}} is determined by the expectation of the following quantity.

(𝒙Tβ€‹πœ½^βˆ’π’™Tβ€‹πœ½j)2(\bm{x}^{T}\hat{\bm{\bm{\theta}}}-\bm{x}^{T}\bm{\bm{\theta}}_{j})^{2}

Here, the expectation is taken over four sources of randomness.

  1. 1.

    𝔼(ΞΈi,Ο΅i2)∼Θ\mathbb{E}_{(\theta_{i},\epsilon^{2}_{i})\sim\Theta}: Drawing parameters 𝜽j,Ο΅j2\bm{\bm{\theta}}_{j},\epsilon^{2}_{j} for player jj’s distribution from Θ\Theta.

  2. 2.

    𝔼XiβˆΌπ’³i\mathbb{E}_{X_{i}\sim\mathcal{X}_{i}}: Drawing the training dataset 𝑿i\bm{X}_{i} from data distribution 𝒳i\mathcal{X}_{i}

  3. 3.

    𝔼YiβˆΌπ’Ÿi​(XiTβ€‹πœ½i,Ο΅i2)\mathbb{E}_{Y_{i}\sim\mathcal{D}_{i}(X_{i}^{T}\bm{\bm{\theta}}_{i},\epsilon^{2}_{i})}: Drawing labels for the training dataset 𝑿i\bm{X}_{i} from the distribution π’Ÿi​(𝑿iTβ€‹πœ½i,Ο΅i2)\mathcal{D}_{i}(\bm{X}_{i}^{T}\bm{\bm{\theta}}_{i},\epsilon^{2}_{i}).

  4. 4.

    𝔼xβˆΌπ’³j\mathbb{E}_{x\sim\mathcal{X}_{j}}: Drawing a new test point 𝒙\bm{x} from the data distribution 𝒳j\mathcal{X}_{j}

(𝒙Tβ€‹πœ½^βˆ’π’™Tβ€‹πœ½j)2(\bm{x}^{T}\hat{\bm{\bm{\theta}}}-\bm{x}^{T}\bm{\bm{\theta}}_{j})^{2} measures the expected error of a set of parameters at a particular point 𝒙\bm{x}: when the expectation is taken over all π’™βˆΌπ’³j\bm{x}\sim\mathcal{X}_{j}, it represents the average error everywhere on the distribution. It might be not immediately clear, though, why (𝒙Tβ€‹πœ½^βˆ’π’™Tβ€‹πœ½j)2(\bm{x}^{T}\hat{\bm{\bm{\theta}}}-\bm{x}^{T}\bm{\bm{\theta}}_{j})^{2} is the correct term to be considering. Other potential candidates might include:

  1. 1.

    βˆ₯𝜽^jβˆ’πœ½jβˆ₯2\left\lVert\hat{\bm{\bm{\theta}}}_{j}-\bm{\bm{\theta}}_{j}\right\rVert^{2}

  2. 2.

    (𝒙Tβ€‹πœ½^jβˆ’y)2(\bm{x}^{T}\hat{\bm{\bm{\theta}}}_{j}-y)^{2} for y=𝒙Tβ€‹πœ½j+Ξ·jy=\bm{x}^{T}\bm{\bm{\theta}}_{j}+\eta_{j}.

The first candidate measures the difference in estimated parameters; however, we assume that the objective of learning is to have low error on predicting future points, rather than solely estimate the parameters. The second candidate represents the error of predicting an instance as opposed to a mean value: it ends up simply producing an additive increase in our overall error term. To see this, note that we can write

(𝒙Tβ€‹πœ½^jβˆ’y)2=(𝒙Tβ€‹πœ½^jβˆ’π’™Tβ€‹πœ½j+𝒙Tβ€‹πœ½jβˆ’y)2(\bm{x}^{T}\hat{\bm{\bm{\theta}}}_{j}-y)^{2}=(\bm{x}^{T}\hat{\bm{\bm{\theta}}}_{j}-\bm{x}^{T}\bm{\theta}_{j}+\bm{x}^{T}\bm{\theta}_{j}-y)^{2}
=(𝒙Tβ€‹πœ½^jβˆ’π’™Tβ€‹πœ½j+𝒙Tβ€‹πœ½jβˆ’π’™Tβ€‹πœ½j+Ξ·j)2=(\bm{x}^{T}\hat{\bm{\bm{\theta}}}_{j}-\bm{x}^{T}\bm{\theta}_{j}+\bm{x}^{T}\bm{\theta}_{j}-\bm{x}^{T}\bm{\theta}_{j}+\eta_{j})^{2}
=(𝒙Tβ€‹πœ½^jβˆ’π’™Tβ€‹πœ½j+Ξ·j)2=(\bm{x}^{T}\hat{\bm{\bm{\theta}}}_{j}-\bm{x}^{T}\bm{\theta}_{j}+\eta_{j})^{2}
=(𝒙T𝜽^jβˆ’π’™T𝜽j)2+2((𝒙T𝜽^jβˆ’π’™T𝜽j)Ξ·j+Ξ·j2=(\bm{x}^{T}\hat{\bm{\bm{\theta}}}_{j}-\bm{x}^{T}\bm{\theta}_{j})^{2}+2((\bm{x}^{T}\hat{\bm{\bm{\theta}}}_{j}-\bm{x}^{T}\bm{\theta}_{j})\eta_{j}+\eta_{j}^{2}

The first term is the same error function we are considering. The middle term is 0 in expectation and the last term is ΞΌe\mu_{e} in expectation, so this approach simply scales the error we were looking at by ΞΌe\mu_{e}.

See 4.1 Note: portions of Abu-Mostafa, Lin, and Magdon-Ismail (2012) and Paquay (2018), especially problem 3.11, were helpful in formulating this approach. Sellentin and Heavens (2015) and Anderson (1962) were helpful in providing the connection to the Inverse Wishart.

Proof.

First, note that:

𝒙Tβ€‹πœ½π’‹βˆ’π’™Tβ€‹πœ½^j=𝒙T​(𝜽jβˆ’(𝑿jT​𝑿j)βˆ’1​𝑿jT​𝒀j)\bm{x}^{T}\bm{\bm{\theta}_{j}}-\bm{x}^{T}\hat{\bm{\bm{\theta}}}_{j}=\bm{x}^{T}\left(\bm{\bm{\theta}}_{j}-(\bm{X}_{j}^{T}\bm{X}_{j})^{-1}\bm{X}_{j}^{T}\bm{Y}_{j}\right)
=𝒙T​(𝜽jβˆ’(𝑿jT​𝑿j)βˆ’1​𝑿jT​(𝑿jβ€‹πœ½j+𝜼j))=\bm{x}^{T}\left(\bm{\bm{\theta}}_{j}-(\bm{X}_{j}^{T}\bm{X}_{j})^{-1}\bm{X}_{j}^{T}(\bm{X}_{j}\bm{\bm{\theta}}_{j}+\bm{\eta}_{j})\right)
=𝒙T​(𝜽jβˆ’πœ½jβˆ’(𝑿jT​𝑿j)βˆ’1​𝑿jTβ€‹πœΌj)=\bm{x}^{T}\left(\bm{\bm{\theta}}_{j}-\bm{\bm{\theta}}_{j}-(\bm{X}_{j}^{T}\bm{X}_{j})^{-1}\bm{X}_{j}^{T}\bm{\eta}_{j}\right)
=βˆ’π’™T​(𝑿jT​𝑿j)βˆ’1​𝑿jTβ€‹πœΌj=-\bm{x}^{T}(\bm{X}_{j}^{T}\bm{X}_{j})^{-1}\bm{X}_{j}^{T}\bm{\eta}_{j}

Then,

(𝒙Tβ€‹πœ½π’‹βˆ’π’™Tβ€‹πœ½^j)2(\bm{x}^{T}\bm{\bm{\theta}_{j}}-\bm{x}^{T}\hat{\bm{\bm{\theta}}}_{j})^{2}
=𝜼jT​𝑿j​(𝑿jT​𝑿)βˆ’1​𝒙​𝒙T​(𝑿jT​𝑿j)βˆ’1​𝑿jTβ€‹πœΌj=\bm{\eta}_{j}^{T}\bm{X}_{j}(\bm{X}_{j}^{T}\bm{X})^{-1}\bm{x}\bm{x}^{T}(\bm{X}_{j}^{T}\bm{X}_{j})^{-1}\bm{X}_{j}^{T}\bm{\eta}_{j}

To simplify, we note that the above quantity is a scalar. For a scalar, a=tr​(a)a=\text{tr}(a), and for any matrix, tr​(A​B)=tr​(B​A)\text{tr}(AB)=\text{tr}(BA) through the cyclic property of the scalar.

=tr​[𝜼jT​𝑿j​(𝑿jT​𝑿)βˆ’1​𝒙​𝒙T​(𝑿jT​𝑿j)βˆ’1​𝑿jTβ€‹πœΌj]=\text{tr}\left[\bm{\eta}_{j}^{T}\bm{X}_{j}(\bm{X}_{j}^{T}\bm{X})^{-1}\bm{x}\bm{x}^{T}(\bm{X}_{j}^{T}\bm{X}_{j})^{-1}\bm{X}_{j}^{T}\bm{\eta}_{j}\right]
=tr​[𝒙​𝒙T​(𝑿jT​𝑿j)βˆ’1​𝑿jTβ€‹πœΌjβ€‹πœΌjT​𝑿j​(𝑿jT​𝑿)βˆ’1]=\text{tr}\left[\bm{x}\bm{x}^{T}(\bm{X}_{j}^{T}\bm{X}_{j})^{-1}\bm{X}_{j}^{T}\bm{\eta}_{j}\bm{\eta}_{j}^{T}\bm{X}_{j}(\bm{X}_{j}^{T}\bm{X})^{-1}\right]

To evaluate, we start by applying the various expectations, noting that expectation and trace commute. Applying π”ΌπœΌjβˆΌπ’Ÿj​(0,Ο΅j2)\mathbb{E}_{\bm{\eta}_{j}\sim\mathcal{D}_{j}(0,\epsilon^{2}_{j})} to the term above allows us to rewrite it as:

=tr​[𝒙​𝒙T​(𝑿jT​𝑿j)βˆ’1​𝑿jT​V​𝑿j​(𝑿jT​𝑿)βˆ’1]=\text{tr}\left[\bm{x}\bm{x}^{T}(\bm{X}_{j}^{T}\bm{X}_{j})^{-1}\bm{X}_{j}^{T}V\bm{X}_{j}(\bm{X}_{j}^{T}\bm{X})^{-1}\right]

where

V=π”ΌπœΌjβˆΌπ’Ÿj​(0,Ο΅j2)​[𝜼jβ€‹πœΌjT]V=\mathbb{E}_{\bm{\eta}_{j}\sim\mathcal{D}_{j}(0,\epsilon^{2}_{j})}[\bm{\eta}_{j}\bm{\eta}_{j}^{T}]

𝜼jβ€‹πœΌjT\bm{\eta}_{j}\bm{\eta}_{j}^{T} is an njΓ—njn_{j}\times n_{j} matrix. The llth diagonal is (𝜼jl)2(\bm{\eta}_{j}^{l})^{2}, which has expectation Ο΅j2\epsilon^{2}_{j}. Off diagonal entries have value 𝜼jlβ‹…πœΌjk\bm{\eta}_{j}^{l}\cdot\bm{\eta}_{j}^{k} for β„“β‰ k\ell\neq k. Because the errors for each data point are drawn independently and with 0 mean, the expectation of this is 0. π”ΌπœΌjβˆΌπ’Ÿj​(0,Ο΅j2)​[𝜼jβ€‹πœΌjT]\mathbb{E}_{\bm{\eta}_{j}\sim\mathcal{D}_{j}(0,\epsilon^{2}_{j})}[\bm{\eta}_{j}\bm{\eta}_{j}^{T}] is a diagonal matrix with Ο΅j2\epsilon^{2}_{j} along the diagonal: we can pull it out of the trace to obtain:

=Ο΅j2​tr​[𝒙​𝒙T​(𝑿jT​𝑿j)βˆ’1​𝑿jT​𝑿j​(𝑿jT​𝑿)βˆ’1]=\epsilon^{2}_{j}\text{tr}\left[\bm{x}\bm{x}^{T}(\bm{X}_{j}^{T}\bm{X}_{j})^{-1}\bm{X}_{j}^{T}\bm{X}_{j}(\bm{X}_{j}^{T}\bm{X})^{-1}\right]
=Ο΅j2​tr​[𝒙​𝒙T​(𝑿jT​𝑿j)βˆ’1]=\epsilon^{2}_{j}\text{tr}\left[\bm{x}\bm{x}^{T}(\bm{X}_{j}^{T}\bm{X}_{j})^{-1}\right]

Taking the expectation over the drawn parameters gives:

=𝔼(ΞΈj,Ο΅j2)βˆΌΞ˜β€‹[Ο΅j2​tr​[𝒙​𝒙T​(𝑿jT​𝑿j)βˆ’1]]=\mathbb{E}_{(\theta_{j},\epsilon^{2}_{j})\sim\Theta}[\epsilon^{2}_{j}\text{tr}\left[\bm{x}\bm{x}^{T}(\bm{X}_{j}^{T}\bm{X}_{j})^{-1}\right]]
=ΞΌe​tr​[𝒙​𝒙T​(𝑿jT​𝑿j)βˆ’1]=\mu_{e}\text{tr}\left[\bm{x}\bm{x}^{T}(\bm{X}_{j}^{T}\bm{X}_{j})^{-1}\right]

Taking the expectation over the test point π’™βˆΌπ’³j\bm{x}\sim\mathcal{X}_{j} gives:

=ΞΌe​tr​[𝔼xβˆΌπ’³j​[𝒙​𝒙T]​(𝑿jT​𝑿j)βˆ’1]=\mu_{e}\text{tr}\left[\mathbb{E}_{x\sim\mathcal{X}_{j}}[\bm{x}\bm{x}^{T}](\bm{X}_{j}^{T}\bm{X}_{j})^{-1}\right]
=ΞΌe​tr​[Ξ£j​(𝑿jT​𝑿j)βˆ’1]=\mu_{e}\text{tr}\left[\Sigma_{j}(\bm{X}_{j}^{T}\bm{X}_{j})^{-1}\right]

Finally, we take the expectation with respect to 𝑿jβˆΌπ’³j\bm{X}_{j}\sim\mathcal{X}_{j}

=ΞΌe​tr​[Ξ£j​𝔼XjβˆΌπ’³j​[(𝑿jT​𝑿j)βˆ’1]]=\mu_{e}\text{tr}\left[\Sigma_{j}\mathbb{E}_{X_{j}\sim\mathcal{X}_{j}}\left[\left(\bm{X}_{j}^{T}\bm{X}_{j}\right)^{-1}\right]\right]

Note that because the inverse and expectation do not commute, in general, we cannot simplify this without stronger assumptions.

There is one other situation where a particular case of linear regression gives us simpler results. As mentioned in the statement of the lemma, in this case we assume that the distribution of input values 𝒳j\mathcal{X}_{j} is a 0-mean normal distribution with covariance matrix C​o​vjCov_{j}.

Note that, in general, C​o​vjβ‰ Ξ£j=𝔼xβˆΌπ’³j​[𝒙​𝒙T]Cov_{j}\neq\Sigma_{j}=\mathbb{E}_{x\sim\mathcal{X}_{j}}[\bm{x}\bm{x}^{T}]. 𝔼xβˆΌπ’³j​[𝒙​𝒙T]\mathbb{E}_{x\sim\mathcal{X}_{j}}[\bm{x}\bm{x}^{T}] has, along the diagonals, 𝔼xβˆΌπ’³j​[𝒙jd​𝒙jd]\mathbb{E}_{x\sim\mathcal{X}_{j}}[\bm{x}_{j}^{d}\bm{x}_{j}^{d}], and on the off-diagonals, has 𝔼xβˆΌπ’³j​[𝒙jl​𝒙jk]\mathbb{E}_{x\sim\mathcal{X}_{j}}[\bm{x}_{j}^{l}\bm{x}_{j}^{k}]. By contrast, the covariance matrix C​o​vjCov_{j} has the same term along the diagonals, but the off-diagonal term has 𝔼xβˆΌπ’³j​[𝒙jl​𝒙jk]βˆ’π”ΌxβˆΌπ’³j​[𝒙jl]​𝔼xβˆΌπ’³j​[𝒙jk]\mathbb{E}_{x\sim\mathcal{X}_{j}}[\bm{x}_{j}^{l}\bm{x}_{j}^{k}]-\mathbb{E}_{x\sim\mathcal{X}_{j}}[\bm{x}_{j}^{l}]\mathbb{E}_{x\sim\mathcal{X}_{j}}[\bm{x}_{j}^{k}]. In the case we are looking at, the distribution is 0 mean, so the off-diagonal terms match as well, and C​o​vj=Ξ£jCov_{j}=\Sigma_{j}.

If this is the case, then (𝑿jT​𝑿j)(\bm{X}_{j}^{T}\bm{X}_{j}) is distributed according to a Wishart distribution with parameters njn_{j} and covariance C​o​vj=Ξ£jCov_{j}=\Sigma_{j} with dimension DD. Given this, (𝑿jT​𝑿j)βˆ’1(\bm{X}_{j}^{T}\bm{X}_{j})^{-1} is distributed according to an Inverse Wishart distribution with parameters njn_{j} and covariance C​o​vjβˆ’1=Ξ£jβˆ’1Cov_{j}^{-1}=\Sigma_{j}^{-1} with dimension DD.

The expectation of the inverse Wishart tells us that:

𝔼XjβˆΌπ’³j​[(𝑿jT​𝑿j)βˆ’1]=1njβˆ’Dβˆ’1​C​o​vjβˆ’1\mathbb{E}_{X_{j}\sim\mathcal{X}_{j}}\left[\left(\bm{X}_{j}^{T}\bm{X}_{j}\right)^{-1}\right]=\frac{1}{n_{j}-D-1}Cov_{j}^{-1}

Using these results, we can directly calculate the desired expectation:

ΞΌe​tr​[Ξ£j​𝔼XjβˆΌπ’³j​[(𝑿jT​𝑿j)βˆ’1]]\mu_{e}\text{tr}\left[\Sigma_{j}\mathbb{E}_{X_{j}\sim\mathcal{X}_{j}}\left[\left(\bm{X}_{j}^{T}\bm{X}_{j}\right)^{-1}\right]\right]
=ΞΌe​tr​[1njβˆ’Dβˆ’1​Σj​Σjβˆ’1]=\mu_{e}\text{tr}\left[\frac{1}{n_{j}-D-1}\Sigma_{j}\Sigma_{j}^{-1}\right]
=ΞΌenjβˆ’Dβˆ’1​tr​[Ξ£j​Σjβˆ’1]=\frac{\mu_{e}}{n_{j}-D-1}\text{tr}\left[\Sigma_{j}\Sigma_{j}^{-1}\right]
=ΞΌenjβˆ’Dβˆ’1​tr​[ID]=\frac{\mu_{e}}{n_{j}-D-1}\text{tr}\left[I_{D}\right]
=ΞΌenjβˆ’Dβˆ’1β‹…D=\frac{\mu_{e}}{n_{j}-D-1}\cdot D

Next, we can reduce the linear regression case to mean estimation. In this case, assume a 1-dimensional input with x=1x=1 deterministically. After drawing xjx_{j}, we multiply it by 𝜽j\bm{\theta}_{j} and add some noise governed by Ο΅j2\epsilon^{2}_{j}: this is the exact same structure as mean estimation. In this case, Ξ£j=𝔼XjβˆΌπ’³j​[𝑿j2]+V​a​r​(𝑿j)=1+0=1\Sigma_{j}=\mathbb{E}_{X_{j}\sim\mathcal{X}_{j}}[\bm{X}_{j}^{2}]+Var(\bm{X}_{j})=1+0=1. Similarly, 𝑿jT​𝑿j=nj\bm{X}_{j}^{T}\bm{X}_{j}=n_{j} deterministically, so the error term reduces to ΞΌenj\frac{\mu_{e}}{n_{j}} as desired.

Note that, as expected, this does not simplify down to the mean estimation case for D=1D=1: that case would model a version of 1-dimensional linear regression, where it is necessary to estimate both 𝜽\bm{\theta} as well as x^\hat{x}, the mean of the input distribution. ∎

Next, we calculate expected MSE for the fine-grained linear regression case.

See 4.2

Proof.

Here, we will use 𝔼YβˆΌπ’Ÿβ€‹(ΞΈi,Ο΅i2)\mathbb{E}_{Y\sim\mathcal{D}(\theta_{i},\epsilon^{2}_{i})} to mean the expectation taken over data from all players i∈[M]i\in[M], given that all of the data influences the federated learning result.

(𝒙Tβ€‹πœ½jβˆ’π’™Tβ€‹πœ½^jv)2(\bm{x}^{T}\bm{\theta}_{j}-\bm{x}^{T}\hat{\bm{\theta}}^{v}_{j})^{2}
=(𝒙Tβ€‹πœ½jβˆ’π’™Tβ€‹πœ½jv+𝒙Tβ€‹πœ½jvβˆ’π’™Tβ€‹πœ½^jv)2=(\bm{x}^{T}\bm{\theta}_{j}-\bm{x}^{T}\bm{\theta}^{v}_{j}+\bm{x}^{T}\bm{\theta}^{v}_{j}-\bm{x}^{T}\hat{\bm{\theta}}^{v}_{j})^{2}
=(𝒙Tβ€‹πœ½jβˆ’π’™Tβ€‹πœ½jv)2+(𝒙Tβ€‹πœ½jvβˆ’π’™Tβ€‹πœ½^jv)2+2​(𝒙Tβ€‹πœ½jβˆ’π’™Tβ€‹πœ½jv)β‹…(𝒙Tβ€‹πœ½jvβˆ’π’™Tβ€‹πœ½^jv)\begin{split}=(\bm{x}^{T}\bm{\theta}_{j}-\bm{x}^{T}\bm{\theta}^{v}_{j})^{2}+(\bm{x}^{T}\bm{\theta}^{v}_{j}-\bm{x}^{T}\hat{\bm{\theta}}^{v}_{j})^{2}\\ +2(\bm{x}^{T}\bm{\theta}_{j}-\bm{x}^{T}\bm{\theta}^{v}_{j})\cdot(\bm{x}^{T}\bm{\theta}^{v}_{j}-\bm{x}^{T}\hat{\bm{\theta}}^{v}_{j})\end{split} (1)

Note that the expectation of the last term in Equation 1 results in 0 because 𝔼YβˆΌπ’Ÿβ€‹(ΞΈj,Ο΅j2)​[𝒙Tβ€‹πœ½jvβˆ’π’™Tβ€‹πœ½^jv]=0\mathbb{E}_{Y\sim\mathcal{D}(\theta_{j},\epsilon^{2}_{j})}\left[\bm{x}^{T}\bm{\theta}^{v}_{j}-\bm{x}^{T}\hat{\bm{\theta}}^{v}_{j}\right]=0. Next, we investigate the second equation in Equation 1.

(𝒙Tβ€‹πœ½jvβˆ’π’™Tβ€‹πœ½jv^)2(\bm{x}^{T}\bm{\theta}^{v}_{j}-\bm{x}^{T}\hat{\bm{\theta}^{v}_{j}})^{2}
=(𝒙Tβ€‹βˆ‘i=1Mvj​iβ€‹πœ½iβˆ’π’™Tβ€‹βˆ‘i=1Mvj​iβ€‹πœ½^i)2=\left(\bm{x}^{T}\sum_{i=1}^{M}v_{ji}\bm{\theta}_{i}-\bm{x}^{T}\sum_{i=1}^{M}v_{ji}\hat{\bm{\theta}}_{i}\right)^{2}
=(βˆ‘i=1Mvj​i​𝒙T​(𝜽iβˆ’πœ½^i))2=\left(\sum_{i=1}^{M}v_{ji}\bm{x}^{T}(\bm{\theta}_{i}-\hat{\bm{\theta}}_{i})\right)^{2}

Expanding out the squared term gives us:

βˆ‘i=1M(vj​i​𝒙T​(𝜽iβˆ’πœ½^i))2\sum_{i=1}^{M}\left(v_{ji}\bm{x}^{T}(\bm{\theta}_{i}-\hat{\bm{\theta}}_{i})\right)^{2}
+βˆ‘i=1Mβˆ‘kβ‰ i(vj​i⋅𝒙T​(𝜽iβˆ’πœ½^i)β‹…vj​k⋅𝒙T​(𝜽kβˆ’πœ½^k))+\sum_{i=1}^{M}\sum_{k\neq i}\left(v_{ji}\cdot\bm{x}^{T}(\bm{\theta}_{i}-\hat{\bm{\theta}}_{i})\cdot v_{jk}\cdot\bm{x}^{T}(\bm{\theta}_{k}-\hat{\bm{\theta}}_{k})\right)

The second term ends up being irrelevant: because each set of parameters 𝜽i∼Θ\bm{\theta}_{i}\sim\Theta are drawn independently and because each data set 𝑿iβˆΌπ’³i\bm{X}_{i}\sim\mathcal{X}_{i} are drawn independently, the 𝜽iβˆ’πœ½^i\bm{\theta}_{i}-\hat{\bm{\theta}}_{i} terms are independent of each other. Because each is 0 in expectation, the entire product has expectation 0. Rewriting the first term gives:

βˆ‘i=1Mvj​i2β‹…(𝒙Tβ€‹πœ½iβˆ’π’™Tβ€‹πœ½^i)2\sum_{i=1}^{M}v_{ji}^{2}\cdot(\bm{x}^{T}\bm{\theta}_{i}-\bm{x}^{T}\hat{\bm{\theta}}_{i})^{2}

The term inside the sum is exactly equivalent to the value we solved with the local estimation case: we can rewrite this as

ΞΌeβˆ‘i=1Mvj​i2β‹…tr[Ξ£j𝔼YβˆΌπ’Ÿβ€‹(ΞΈi,Ο΅i2)[(𝑿iT𝑿i)βˆ’1]\mu_{e}\sum_{i=1}^{M}v_{ji}^{2}\cdot\text{tr}[\Sigma_{j}\mathbb{E}_{Y\sim\mathcal{D}(\theta_{i},\epsilon^{2}_{i})}\left[(\bm{X}_{i}^{T}\bm{X}_{i})^{-1}\right]

or, if the necessary conditions are satisfied,

ΞΌeβ€‹βˆ‘i=1Mvj​i2β‹…Dniβˆ’Dβˆ’1\mu_{e}\sum_{i=1}^{M}v_{ji}^{2}\cdot\frac{D}{n_{i}-D-1}

Finally, we will explore the first term Equation 1:

(𝒙Tβ€‹πœ½jβˆ’π’™Tβ€‹πœ½jv)2(\bm{x}^{T}\bm{\theta}_{j}-\bm{x}^{T}\bm{\theta}^{v}_{j})^{2}
=(𝒙T​(𝜽jβˆ’πœ½jv))T​𝒙T​(𝜽jβˆ’πœ½jv)=(\bm{x}^{T}(\bm{\theta}_{j}-\bm{\theta}^{v}_{j}))^{T}\bm{x}^{T}(\bm{\theta}_{j}-\bm{\theta}^{v}_{j})
=(𝜽jβˆ’πœ½jv)T​𝒙​𝒙T​(𝜽jβˆ’πœ½jv)=(\bm{\theta}_{j}-\bm{\theta}^{v}_{j})^{T}\bm{x}\bm{x}^{T}(\bm{\theta}_{j}-\bm{\theta}^{v}_{j})

Taking the expectation and using the cyclic property of the trace gives:

=tr​[(𝜽jβˆ’πœ½jv)T​𝔼xβˆΌπ’³j​[𝒙​𝒙T]​(𝜽jβˆ’πœ½jv)]=\text{tr}\left[(\bm{\theta}_{j}-\bm{\theta}^{v}_{j})^{T}\mathbb{E}_{x\sim\mathcal{X}_{j}}[\bm{x}\bm{x}^{T}](\bm{\theta}_{j}-\bm{\theta}^{v}_{j})\right]
=tr​[Ξ£j​(𝜽jβˆ’πœ½jv)​(𝜽jβˆ’πœ½jv)T]=\text{tr}\left[\Sigma_{j}(\bm{\theta}_{j}-\bm{\theta}^{v}_{j})(\bm{\theta}_{j}-\bm{\theta}^{v}_{j})^{T}\right] (2)

Next, we focus on simplifying the inner term of Equation 2 involving the 𝜽\bm{\theta} values. Using the definition of 𝜽jv\bm{\theta}^{v}_{j} gives:

(𝜽jβˆ’βˆ‘i=1Mvj​iβ€‹πœ½i)​(𝜽jβˆ’βˆ‘i=1Mvj​iβ€‹πœ½i)T\left(\bm{\theta}_{j}-\sum_{i=1}^{M}v_{ji}\bm{\theta}_{i}\right)\left(\bm{\theta}_{j}-\sum_{i=1}^{M}v_{ji}\bm{\theta}_{i}\right)^{T}
=((1βˆ’vj​j)β‹…πœ½jβˆ’βˆ‘iβ‰ jvj​iβ‹…πœ½i)=\left((1-v_{jj})\cdot\bm{\theta}_{j}-\sum_{i\neq j}v_{ji}\cdot\bm{\theta}_{i}\right)
β‹…((1βˆ’vj​j)β‹…πœ½jβˆ’βˆ‘iβ‰ jvj​iβ‹…πœ½i)T\cdot\left((1-v_{jj})\cdot\bm{\theta}_{j}-\sum_{i\neq j}v_{ji}\cdot\bm{\theta}_{i}\right)^{T}

We know that 1=vj​j+βˆ‘iβ‰ jvj​i1=v_{jj}+\sum_{i\neq j}v_{ji}, so we can rewrite this as:

=(βˆ‘iβ‰ jvj​iβ‹…πœ½jβˆ’βˆ‘iβ‰ jvj​iβ‹…πœ½i)​(βˆ‘iβ‰ jvj​iβ‹…πœ½jβˆ’βˆ‘iβ‰ jvj​iβ‹…πœ½i)T=\left(\sum_{i\neq j}v_{ji}\cdot\bm{\theta}_{j}-\sum_{i\neq j}v_{ji}\cdot\bm{\theta}_{i}\right)\left(\sum_{i\neq j}v_{ji}\cdot\bm{\theta}_{j}-\sum_{i\neq j}v_{ji}\cdot\bm{\theta}_{i}\right)^{T}
=(βˆ‘iβ‰ jvj​iβ‹…(𝜽jβˆ’πœ½i))​(βˆ‘iβ‰ jvj​iβ‹…(𝜽jβˆ’πœ½i))T=\left(\sum_{i\neq j}v_{ji}\cdot(\bm{\theta}_{j}-\bm{\theta}_{i})\right)\left(\sum_{i\neq j}v_{ji}\cdot(\bm{\theta}_{j}-\bm{\theta}_{i})\right)^{T}

Expanding gives us two terms:

βˆ‘iβ‰ jvj​i2​(𝜽jβˆ’πœ½i)​(𝜽jβˆ’πœ½i)T+βˆ‘i,kβ‰ j,iβ‰ kvj​iβ‹…vj​kβ‹…(𝜽jβˆ’πœ½i)​(𝜽jβˆ’πœ½k)T\begin{split}\sum_{i\neq j}&v_{ji}^{2}(\bm{\theta}_{j}-\bm{\theta}_{i})(\bm{\theta}_{j}-\bm{\theta}_{i})^{T}\\ +\sum_{i,k\neq j,i\neq k}&v_{ji}\cdot v_{jk}\cdot(\bm{\theta}_{j}-\bm{\theta}_{i})(\bm{\theta}_{j}-\bm{\theta}_{k})^{T}\end{split} (3)

Note that

(𝜽jβˆ’πœ½i)β‹…(𝜽jβˆ’πœ½k)=𝜽jβ€‹πœ½jTβˆ’πœ½jβ€‹πœ½kTβˆ’πœ½iβ€‹πœ½jT+𝜽iβ€‹πœ½kT(\bm{\theta}_{j}-\bm{\theta}_{i})\cdot(\bm{\theta}_{j}-\bm{\theta}_{k})=\bm{\theta}_{j}\bm{\theta}_{j}^{T}-\bm{\theta}_{j}\bm{\theta}_{k}^{T}-\bm{\theta}_{i}\bm{\theta}_{j}^{T}+\bm{\theta}_{i}\bm{\theta}_{k}^{T}

Because the parameters are drawn iid,

π”Όπœ½jβˆΌΞ˜β€‹[𝜽jβ€‹πœ½jT]=π”Όπœ½jβˆΌΞ˜β€‹[𝜽iβ€‹πœ½iT]β€‹βˆ€i∈[M]\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{j}\bm{\theta}_{j}^{T}\right]=\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{i}\bm{\theta}_{i}^{T}\right]\ \forall i\in[M]
π”Όπœ½jβˆΌΞ˜β€‹[𝜽jβ€‹πœ½kT]=π”Όπœ½jβˆΌΞ˜β€‹[𝜽iβ€‹πœ½lT]β€‹βˆ€jβ‰ k,iβ‰ l,i,j,k,l∈[M]\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{j}\bm{\theta}_{k}^{T}\right]=\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{i}\bm{\theta}_{l}^{T}\right]\ \forall j\neq k,i\neq l,\ i,j,k,l\in[M]

π”Όπœ½jβˆΌΞ˜β€‹[𝜽jβ€‹πœ½jT]\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}[\bm{\theta}_{j}\bm{\theta}_{j}^{T}] is implied to mean the same thing as π”Όπœ½jβˆΌΞ˜β€‹[𝜽iβ€‹πœ½iT]\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}[\bm{\theta}_{i}\bm{\theta}_{i}^{T}]. Using these results allows us to expand out the product and take the expectation. The first term in Equation 3 becomes:

βˆ‘iβ‰ jvj​i2β‹…(π”Όπœ½jβˆΌΞ˜β€‹[𝜽jβ€‹πœ½jT]+π”Όπœ½jβˆΌΞ˜β€‹[𝜽iβ€‹πœ½iT]βˆ’2β€‹π”Όπœ½jβˆΌΞ˜β€‹[𝜽iβ€‹πœ½jT])\sum_{i\neq j}v_{ji}^{2}\cdot\left(\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{j}\bm{\theta}_{j}^{T}\right]+\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{i}\bm{\theta}_{i}^{T}\right]-2\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{i}\bm{\theta}_{j}^{T}\right]\right)
=2β€‹βˆ‘iβ‰ jvj​i2β‹…(π”Όπœ½jβˆΌΞ˜β€‹[𝜽jβ€‹πœ½jT]βˆ’π”Όπœ½jβˆΌΞ˜β€‹[𝜽iβ€‹πœ½jT])=2\sum_{i\neq j}v_{ji}^{2}\cdot\left(\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{j}\bm{\theta}_{j}^{T}\right]-\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{i}\bm{\theta}_{j}^{T}\right]\right)

For the second term in Equation 3, each product within the sum becomes :

π”Όπœ½jβˆΌΞ˜β€‹[𝜽jβ€‹πœ½jT]βˆ’π”Όπœ½jβˆΌΞ˜β€‹[𝜽jβ€‹πœ½kT]βˆ’π”Όπœ½jβˆΌΞ˜β€‹[𝜽iβ€‹πœ½jT]+π”Όπœ½jβˆΌΞ˜β€‹[𝜽iβ€‹πœ½kT]\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{j}\bm{\theta}_{j}^{T}\right]-\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{j}\bm{\theta}_{k}^{T}\right]-\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{i}\bm{\theta}_{j}^{T}\right]+\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{i}\bm{\theta}_{k}^{T}\right]

The last two terms cancel because the parameters are drawn i.i.d., so the sum becomes:

=π”Όπœ½jβˆΌΞ˜β€‹[𝜽jβ€‹πœ½jT]βˆ’π”Όπœ½jβˆΌΞ˜β€‹[𝜽jβ€‹πœ½kT]=\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{j}\bm{\theta}_{j}^{T}\right]-\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{j}\bm{\theta}_{k}^{T}\right]

The overall sum in Equation 3 becomes:

=βˆ‘i,kβ‰ j,iβ‰ kvj​iβ‹…vj​kβ‹…(π”Όπœ½jβˆΌΞ˜β€‹[𝜽jβ€‹πœ½jT]βˆ’π”Όπœ½jβˆΌΞ˜β€‹[𝜽jβ€‹πœ½kT])=\sum_{i,k\neq j,i\neq k}v_{ji}\cdot v_{jk}\cdot\left(\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{j}\bm{\theta}_{j}^{T}\right]-\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{j}\bm{\theta}_{k}^{T}\right]\right)

The sum of both of these terms taken together becomes:

(π”Όπœ½jβˆΌΞ˜β€‹[𝜽jβ€‹πœ½jT]βˆ’π”Όπœ½jβˆΌΞ˜β€‹[𝜽jβ€‹πœ½kT])β‹…(2β€‹βˆ‘iβ‰ jvj​i2+βˆ‘i,kβ‰ j,iβ‰ kvj​i)\left(\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{j}\bm{\theta}_{j}^{T}\right]-\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{j}\bm{\theta}_{k}^{T}\right]\right)\cdot\left(2\sum_{i\neq j}v_{ji}^{2}+\sum_{i,k\neq j,i\neq k}v_{ji}\right)

Next, we can use the identity:

βˆ‘i,kβ‰ j,iβ‰ kvj​iβ‹…vj​k=(βˆ‘iβ‰ jvj​i)2βˆ’βˆ‘iβ‰ jvj​i2\sum_{i,k\neq j,i\neq k}v_{ji}\cdot v_{jk}=\left(\sum_{i\neq j}v_{ji}\right)^{2}-\sum_{i\neq j}v_{ji}^{2}

Which allows us to rewrite as:

(π”Όπœ½jβˆΌΞ˜β€‹[𝜽jβ€‹πœ½jT]βˆ’π”Όπœ½jβˆΌΞ˜β€‹[𝜽jβ€‹πœ½kT])β‹…(βˆ‘iβ‰ jvj​i2+(βˆ‘iβ‰ jvj​i)2)\left(\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{j}\bm{\theta}_{j}^{T}\right]-\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{j}\bm{\theta}_{k}^{T}\right]\right)\cdot\left(\sum_{i\neq j}v_{ji}^{2}+\left(\sum_{i\neq j}v_{ji}\right)^{2}\right)

We can alternatively write this a different way. Because 1=vj​j+βˆ‘iβ‰ jvj​i1=v_{jj}+\sum_{i\neq j}v_{ji}, we have 1βˆ’vj​j=βˆ‘iβ‰ jvj​i1-v_{jj}=\sum_{i\neq j}v_{ji}

(π”Όπœ½jβˆΌΞ˜β€‹[𝜽jβ€‹πœ½jT]βˆ’π”Όπœ½jβˆΌΞ˜β€‹[𝜽jβ€‹πœ½kT])β‹…(βˆ‘iβ‰ jvj​i2+(1βˆ’vj​j)2)\left(\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{j}\bm{\theta}_{j}^{T}\right]-\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{j}\bm{\theta}_{k}^{T}\right]\right)\cdot\left(\sum_{i\neq j}v_{ji}^{2}+\left(1-v_{jj}\right)^{2}\right)

Recall that this analysis was focusing solely on the component of Equation 2 that involved the 𝜽\bm{\theta} product. We can recombine our simplification into Equation 2 to rewrite it as:

tr​[Ξ£j​(π”Όπœ½jβˆΌΞ˜β€‹[𝜽jβ€‹πœ½jT]βˆ’π”Όπœ½i,𝜽jβˆΌΞ˜β€‹[𝜽jβ€‹πœ½iT])]\text{tr}\left[\Sigma_{j}\left(\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{j}\bm{\theta}_{j}^{T}\right]-\mathbb{E}_{\bm{\theta}_{i},\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{j}\bm{\theta}_{i}^{T}\right]\right)\right]
β‹…(βˆ‘iβ‰ jvj​i2+(1βˆ’vj​j)2)\cdot\left(\sum_{i\neq j}v_{ji}^{2}+\left(1-v_{jj}\right)^{2}\right)

Next, we need to reason about the difference in the expected terms. In this setting, we are assuming that each coefficient is drawn separately from the other coefficients. π”Όπœ½jβˆΌΞ˜β€‹[𝜽jβ€‹πœ½jT]\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{j}\bm{\theta}_{j}^{T}\right] has, on the ddth element of the diagonal, π”Όπœ½jβˆΌΞ˜β€‹[(𝜽jd)2]\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}[(\bm{\theta}_{j}^{d})^{2}] and on the off-diagonal terms in the l,kl,kth entry, has π”Όπœ½jβˆΌΞ˜β€‹[𝜽jlβ‹…πœ½jk]\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}[\bm{\theta}_{j}^{l}\cdot\bm{\theta}_{j}^{k}] Here, we are assuming that 𝜽jl\bm{\theta}_{j}^{l} and 𝜽jk\bm{\theta}_{j}^{k} are independent, so this equals π”Όπœ½jβˆΌΞ˜β€‹[𝜽jl]β‹…π”Όπœ½jβˆΌΞ˜β€‹[𝜽jk]\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}[\bm{\theta}_{j}^{l}]\cdot\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}[\bm{\theta}_{j}^{k}]: we relax this assumption below. π”Όπœ½i,𝜽jβˆΌΞ˜β€‹[𝜽jβ€‹πœ½iT]\mathbb{E}_{\bm{\theta}_{i},\bm{\theta}_{j}\sim\Theta}\left[\bm{\theta}_{j}\bm{\theta}_{i}^{T}\right] has, on the ddth element of the diagonal, π”Όπœ½jβˆΌΞ˜β€‹[(𝜽jd)]2\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}[(\bm{\theta}_{j}^{d})]^{2}, and on the l,kl,kth off-diagonal term has the same value as the other matrix: π”Όπœ½jβˆΌΞ˜β€‹[𝜽jl]β‹…π”Όπœ½jβˆΌΞ˜β€‹[𝜽jk]\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}[\bm{\theta}_{j}^{l}]\cdot\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}[\bm{\theta}_{j}^{k}]. The difference between these two matrices is a diagonal matrix with π”Όπœ½jβˆΌΞ˜β€‹[(𝜽jd)2]βˆ’π”Όπœ½jβˆΌΞ˜β€‹[(𝜽jd)]2=Οƒd2\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}[(\bm{\theta}_{j}^{d})^{2}]-\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}[(\bm{\theta}_{j}^{d})]^{2}=\sigma^{2}_{d} on the diagonal, where Οƒd2\sigma^{2}_{d} represents the variance of the ddth coefficient. That turns the term involving the trace into a simple sum:

(βˆ‘iβ‰ jvj​i2+(βˆ‘iβ‰ jvj​i)2)β‹…βˆ‘d=1D𝔼xβˆΌπ’³j​[(𝒙d)2]β‹…Οƒd2\left(\sum_{i\neq j}v_{ji}^{2}+\left(\sum_{i\neq j}v_{ji}\right)^{2}\right)\cdot\sum_{d=1}^{D}\mathbb{E}_{x\sim\mathcal{X}_{j}}[(\bm{x}^{d})^{2}]\cdot\sigma^{2}_{d}

In the proof above, we assumed that the draw of parameter value 𝜽jl\bm{\theta}_{j}^{l} is independent of 𝜽jk\bm{\theta}_{j}^{k}, for lβ‰ kl\neq k. A case where this might not occur is when these values are correlated: say, the value drawn for 𝜽jl\bm{\theta}_{j}^{l} is anti-correlated with the parameter drawn for 𝜽jk\bm{\theta}_{j}^{k}. (Note that we still assume draws are independent across players: 𝜽jl\bm{\theta}_{j}^{l} is independent of 𝜽il\bm{\theta}_{i}^{l} and 𝜽ik\bm{\theta}_{i}^{k}). Relaxing this assumption is not hard and would change the results in the following way: the off-diagonal terms of the difference would no longer be 0. Instead, the off-diagonal l,kl,kth entry becomes

π”Όπœ½jβˆΌΞ˜β€‹[𝜽jlβ‹…πœ½jk]βˆ’π”Όπœ½jβˆΌΞ˜β€‹[𝜽jl]β‹…π”Όπœ½jβˆΌΞ˜β€‹[𝜽jk]\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}[\bm{\theta}_{j}^{l}\cdot\bm{\theta}_{j}^{k}]-\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}[\bm{\theta}_{j}^{l}]\cdot\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}[\bm{\theta}_{j}^{k}]

Performing the matrix multiplication with Ξ£j\Sigma_{j} turns this into:

βˆ‘d=1D𝔼xβˆΌπ’³j​[(𝒙d)2]β‹…Οƒd2+βˆ‘lβ‰ d𝔼xβˆΌπ’³j​[𝒙d⋅𝒙l]\sum_{d=1}^{D}\mathbb{E}_{x\sim\mathcal{X}_{j}}[(\bm{x}^{d})^{2}]\cdot\sigma^{2}_{d}+\sum_{l\neq d}\mathbb{E}_{x\sim\mathcal{X}_{j}}[\bm{x}^{d}\cdot\bm{x}^{l}]
β‹…(π”Όπœ½jβˆΌΞ˜β€‹[𝜽jlβ‹…πœ½jk]βˆ’π”Όπœ½jβˆΌΞ˜β€‹[𝜽jl]β‹…π”Όπœ½jβˆΌΞ˜β€‹[𝜽jk])\cdot(\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}[\bm{\theta}_{j}^{l}\cdot\bm{\theta}_{j}^{k}]-\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}[\bm{\theta}_{j}^{l}]\cdot\mathbb{E}_{\bm{\theta}_{j}\sim\Theta}[\bm{\theta}_{j}^{k}])

Our final value for this component of the error would be the same form, but with a slightly different coefficient.

Finally, we will consider the mean estimation case. As before, we note that Ξ£j=1\Sigma_{j}=1 and 𝑿jT​𝑿j=nj\bm{X}_{j}^{T}\bm{X}_{j}=n_{j} deterministically, so that component of the error term reduces to

ΞΌeβ€‹βˆ‘i=1Mvj​i2β‹…1nj\mu_{e}\sum_{i=1}^{M}v_{ji}^{2}\cdot\frac{1}{n_{j}}

similarly, we note that 𝔼xβˆΌπ’³j​[(𝒙d)2]=1\mathbb{E}_{x\sim\mathcal{X}_{j}}[(\bm{x}^{d})^{2}]=1, so the second component reduces to:

(βˆ‘iβ‰ jvj​i2+(βˆ‘iβ‰ jvj​i)2)​σ2\left(\sum_{i\neq j}v_{ji}^{2}+\left(\sum_{i\neq j}v_{ji}\right)^{2}\right)\sigma^{2}

∎

See 4.3

Proof.

To obtain this result, we note that the uniform federation case amounts to weights vj​i=niNv_{ji}=\frac{n_{i}}{N}. For both linear regression and mean estimation, the Οƒ2\sigma^{2} multiplier becomes:

βˆ‘iβ‰ jvj​i2+(βˆ‘iβ‰ jvj​i)2\sum_{i\neq j}v_{ji}^{2}+\left(\sum_{i\neq j}v_{ji}\right)^{2}
=1N2​(βˆ‘iβ‰ jni2+(βˆ‘iβ‰ jni)2)=\frac{1}{N^{2}}\left(\sum_{i\neq j}n_{i}^{2}+\left(\sum_{i\neq j}n_{i}\right)^{2}\right)

For linear regression, the ΞΌe\mu_{e} multiplier becomes:

βˆ‘i=1Mvj​i2β‹…Dniβˆ’Dβˆ’1\sum_{i=1}^{M}v_{ji}^{2}\cdot\frac{D}{n_{i}-D-1}
=1N2β€‹βˆ‘i=1Mni2β‹…Dniβˆ’Dβˆ’1=\frac{1}{N^{2}}\sum_{i=1}^{M}n_{i}^{2}\cdot\frac{D}{n_{i}-D-1}

And for mean estimation, the ΞΌe\mu_{e} multiplier becomes:

βˆ‘i=1Mvj​i2β‹…1ni\sum_{i=1}^{M}v_{ji}^{2}\cdot\frac{1}{n_{i}}
=βˆ‘i=1Mni2N2β‹…1ni=βˆ‘i=1MniN2=1N=\sum_{i=1}^{M}\frac{n_{i}^{2}}{N^{2}}\cdot\frac{1}{n_{i}}=\sum_{i=1}^{M}\frac{n_{i}}{N^{2}}=\frac{1}{N}

∎

See 4.4

Proof.

To obtain this result, we note that the coarse-grained case corresponds to vj​j=w+(1βˆ’w)β‹…njNv_{jj}=w+\frac{(1-w)\cdot n_{j}}{N} and vj​i=(1βˆ’w)β‹…niNv_{ji}=(1-w)\cdot\frac{n_{i}}{N}. For both linear regression and mean estimation, the Οƒ2\sigma^{2} multiplier becomes:

βˆ‘iβ‰ jvj​i2+(βˆ‘iβ‰ jvj​i)2\sum_{i\neq j}v_{ji}^{2}+\left(\sum_{i\neq j}v_{ji}\right)^{2}
=(1βˆ’w)2β‹…βˆ‘iβ‰ jni2+(βˆ‘iβ‰ jni)2N2=(1-w)^{2}\cdot\frac{\sum_{i\neq j}n_{i}^{2}+\left(\sum_{i\neq j}n_{i}\right)^{2}}{N^{2}}

For linear regression, let ΞΆi\zeta_{i} stand for either tr[Ξ£j𝔼YβˆΌπ’Ÿβ€‹(ΞΈi,Ο΅i2)[(𝑿iT𝑿i)βˆ’1]\text{tr}[\Sigma_{j}\mathbb{E}_{Y\sim\mathcal{D}(\theta_{i},\epsilon^{2}_{i})}\left[(\bm{X}_{i}^{T}\bm{X}_{i})^{-1}\right] or Dniβˆ’Dβˆ’1\frac{D}{n_{i}-D-1}, depending on the linear regression case. Then, the ΞΌe\mu_{e} multiplier becomes:

βˆ‘i=1Mvj​i2β‹…ΞΆi\sum_{i=1}^{M}v_{ji}^{2}\cdot\zeta_{i}
=(w+(1βˆ’w)β‹…njN)2β‹…ΞΆj+βˆ‘iβ‰ j(1βˆ’w)2β‹…ni2N2β‹…ΞΆj=\left(w+\frac{(1-w)\cdot n_{j}}{N}\right)^{2}\cdot\zeta_{j}+\sum_{i\neq j}(1-w)^{2}\cdot\frac{n_{i}^{2}}{N^{2}}\cdot\zeta_{j}
=ΞΆjβ‹…(w2+2​(1βˆ’w)β‹…wβ‹…njN)+=\zeta_{j}\cdot\left(w^{2}+2\frac{(1-w)\cdot w\cdot n_{j}}{N}\right)+
+(1βˆ’w)2β€‹βˆ‘i=1Mni2N2β‹…ΞΆj+(1-w)^{2}\sum_{i=1}^{M}\frac{n_{i}^{2}}{N^{2}}\cdot\zeta_{j}

For mean estimation, the ΞΌe\mu_{e} multiplier becomes:

βˆ‘i=1Mvj​i2β‹…1ni\sum_{i=1}^{M}v_{ji}^{2}\cdot\frac{1}{n_{i}}
=(w+(1βˆ’w)β‹…njN)2β‹…1nj+βˆ‘iβ‰ j(1βˆ’w)2β‹…ni2N2β‹…1ni=\left(w+\frac{(1-w)\cdot n_{j}}{N}\right)^{2}\cdot\frac{1}{n_{j}}+\sum_{i\neq j}(1-w)^{2}\cdot\frac{n_{i}^{2}}{N^{2}}\cdot\frac{1}{n_{i}}
=w2nj+2​(1βˆ’w)β‹…wN+(1βˆ’w)2β‹…njN2+βˆ‘iβ‰ j(1βˆ’w)2β‹…niN2=\frac{w^{2}}{n_{j}}+2\frac{(1-w)\cdot w}{N}+\frac{(1-w)^{2}\cdot n_{j}}{N^{2}}+\sum_{i\neq j}(1-w)^{2}\cdot\frac{n_{i}}{N^{2}}
=w2nj+2​(1βˆ’w)β‹…wN+(1βˆ’w)2β€‹βˆ‘i=1MniN2=\frac{w^{2}}{n_{j}}+2\frac{(1-w)\cdot w}{N}+(1-w)^{2}\sum_{i=1}^{M}\frac{n_{i}}{N^{2}}
=w2nj+2​(1βˆ’w)β‹…wN+(1βˆ’w)2N=\frac{w^{2}}{n_{j}}+2\frac{(1-w)\cdot w}{N}+\frac{(1-w)^{2}}{N}
=w2nj+1βˆ’w2N=\frac{w^{2}}{n_{j}}+\frac{1-w^{2}}{N}

∎

Appendix C Supporting proofs for uniform federation

See 5.2

Proof.

If a partition Ο€\pi is optimal for every player, then it is core stable: there does not exist a coalition CC where all players prefer CC to Ο€\pi, because there does not exist a coalition where any players prefer CC to Ο€\pi.

If a partition Ο€\pi is optimal for every player, then no other partition can be core stable: any set of players not in their optimal configuration could form a coalition CC where all players would prefer CC.

In the case that players are indifferent between any arrangement, then for any partition Ο€\pi and any competing coalition CC, all players would be indifferent between Ο€\pi and CC, so Ο€\pi is core stable. ∎

See 5.3

Proof.

First, we will consider the case where the inequality is strict and will show the statement is satisfied by showing that every player minimizes their error by being alone in Ο€l\pi_{l}. We will use NQN_{Q} to be the sum of elements within a coalition QQ. QQ could be the coalition equal to all players (Ο€g\pi_{g}) or some strict subset, but we will assume it contains at least 2 elements. We will show that every player gets higher error in QQ than it would get alone. We wish to show:

ΞΌeNQ+Οƒ2β€‹βˆ‘iβ‰ j,i∈Qni2+(NQβˆ’nj)2NQ2>ΞΌenj\frac{\mu_{e}}{N_{Q}}+\sigma^{2}\frac{\sum_{i\neq j,i\in Q}n_{i}^{2}+(N_{Q}-n_{j})^{2}}{N_{Q}^{2}}>\frac{\mu_{e}}{n_{j}}

Cross multiplying gives:

ΞΌeβ‹…NQβ‹…nj+Οƒ2β‹…njβ‹…(βˆ‘iβ‰ j,i∈Qni2+(NQβˆ’nj)2)>ΞΌeβ‹…NQ2\mu_{e}\cdot N_{Q}\cdot n_{j}+\sigma^{2}\cdot n_{j}\cdot\left(\sum_{i\neq j,i\in Q}n_{i}^{2}+(N_{Q}-n_{j})^{2}\right)>\mu_{e}\cdot N_{Q}^{2}

Rewriting:

Οƒ2β‹…njβ‹…(βˆ‘iβ‰ j,i∈Qni2+(NQβˆ’nj)2)>ΞΌeβ‹…NQ2βˆ’ΞΌeβ‹…NQβ‹…nj\sigma^{2}\cdot n_{j}\cdot\left(\sum_{i\neq j,i\in Q}n_{i}^{2}+(N_{Q}-n_{j})^{2}\right)>\mu_{e}\cdot N_{Q}^{2}-\mu_{e}\cdot N_{Q}\cdot n_{j}

The righthand side can be rewritten as:

ΞΌeβ‹…NQ2βˆ’ΞΌeβ‹…NQβ‹…nj=ΞΌeβ‹…(NQβˆ’nj)2+ΞΌeβ‹…njβ‹…(NQβˆ’nj)\mu_{e}\cdot N_{Q}^{2}-\mu_{e}\cdot N_{Q}\cdot n_{j}=\mu_{e}\cdot(N_{Q}-n_{j})^{2}+\mu_{e}\cdot n_{j}\cdot(N_{Q}-n_{j})

Then, we can prove the inequality by splitting it up into two terms. The first:

Οƒ2β‹…njβ‹…(NQβˆ’nj)2>ΞΌeβ‹…(NQβˆ’nj)2\sigma^{2}\cdot n_{j}\cdot(N_{Q}-n_{j})^{2}>\mu_{e}\cdot(N_{Q}-n_{j})^{2}

which is true because njβ‹…Οƒ2>ΞΌen_{j}\cdot\sigma^{2}>\mu_{e}. The second:

Οƒ2β‹…njβ‹…βˆ‘iβ‰ j,i∈Qni2>ΞΌeβ‹…njβ‹…(NQβˆ’nj)\sigma^{2}\cdot n_{j}\cdot\sum_{i\neq j,i\in Q}n_{i}^{2}>\mu_{e}\cdot n_{j}\cdot(N_{Q}-n_{j})
Οƒ2β‹…βˆ‘iβ‰ j,i∈Qni2>ΞΌeβ‹…βˆ‘iβ‰ j,i∈Qni\sigma^{2}\cdot\sum_{i\neq j,i\in Q}n_{i}^{2}>\mu_{e}\cdot\sum_{i\neq j,i\in Q}n_{i}

which is satisfied because, for each player,

Οƒ2β‹…ni2>ΞΌeβ‹…ni\sigma^{2}\cdot n_{i}^{2}>\mu_{e}\cdot n_{i}

because Οƒ2β‹…ni>ΞΌe\sigma^{2}\cdot n_{i}>\mu_{e}.
Next, we will consider the case where the inequality may not be strict. We can note that, in the proof above, any coalition QQ with at least one player ni>ΞΌeΟƒ2n_{i}>\frac{\mu_{e}}{\sigma^{2}} would satisfy the desired inequality: all players participating would get higher error than they could alone. This shows that any coalition involving a player with more samples than ΞΌeΟƒ2\frac{\mu_{e}}{\sigma^{2}} is infeasible. We have previously shown that all players with ni=ΞΌeΟƒ2n_{i}=\frac{\mu_{e}}{\sigma^{2}} get equal error no matter their arrangement. ∎

As a reminder, we will use the notation π​(s,β„“)\pi(s,\ell) to mean a coalition with ss small players and β„“\ell large players. The next few lemmas describe how the errors of large and small players in a coalition change as ss and β„“\ell are increased.

Lemma C.1.

For uniform federation, if ns≀μeΟƒ2n_{s}\leq\frac{\mu_{e}}{\sigma^{2}} and ns<nβ„“n_{s}<n_{\ell}, small players always see their error decrease with the addition of more small players:

s2>s1⇒π​(s2,β„“)≻Sπ​(s1,β„“)s_{2}>s_{1}\quad\Rightarrow\quad\pi(s_{2},\ell)\succ_{S}\pi(s_{1},\ell)
Proof.

We will show that the derivative of the small player’s error with respect to ss is always negative. The error is:

ΞΌesβ‹…ns+β„“β‹…nβ„“\frac{\mu_{e}}{s\cdot n_{s}+\ell\cdot n_{\ell}}
+Οƒ2​(sβˆ’1)β‹…ns2+β„“β‹…nβ„“2+((sβˆ’1)β‹…ns+β„“β‹…n​l)2(sβ‹…ns+β„“β‹…nβ„“)2+\sigma^{2}\frac{(s-1)\cdot n_{s}^{2}+\ell\cdot n_{\ell}^{2}+((s-1)\cdot n_{s}+\ell\cdot nl)^{2}}{(s\cdot n_{s}+\ell\cdot n_{\ell})^{2}}

The derivative with respect to ss is:

nsβ‹…(sβ‹…nsβ‹…(nsβ‹…Οƒ2βˆ’ΞΌe))(sβ‹…ns+β„“β‹…nβ„“)3\frac{n_{s}\cdot(s\cdot n_{s}\cdot(n_{s}\cdot\sigma^{2}-\mu_{e}))}{(s\cdot n_{s}+\ell\cdot n_{\ell})^{3}}
βˆ’nsβ‹…(β„“β‹…nβ„“β‹…(ΞΌe+2​nβ„“β‹…Οƒ2βˆ’3​nsβ‹…Οƒ2))(sβ‹…ns+β„“β‹…nβ„“)3-\frac{n_{s}\cdot(\ell\cdot n_{\ell}\cdot(\mu_{e}+2n_{\ell}\cdot\sigma^{2}-3n_{s}\cdot\sigma^{2}))}{(s\cdot n_{s}+\ell\cdot n_{\ell})^{3}}

Showing that the derivative is negative is equivalent to showing that the term below is negative:

sβ‹…nsβ‹…(nsβ‹…Οƒ2βˆ’ΞΌe)βˆ’β„“β‹…nβ„“β‹…(ΞΌe+2​nβ„“β‹…Οƒ2βˆ’3​nsβ‹…Οƒ2)s\cdot n_{s}\cdot(n_{s}\cdot\sigma^{2}-\mu_{e})-\ell\cdot n_{\ell}\cdot(\mu_{e}+2n_{\ell}\cdot\sigma^{2}-3n_{s}\cdot\sigma^{2})

We can break this term into multiple components:

sβ‹…nsβ‹…(nsβ‹…Οƒ2βˆ’ΞΌe)≀0s\cdot n_{s}\cdot(n_{s}\cdot\sigma^{2}-\mu_{e})\leq 0

because ns≀μeΟƒ2n_{s}\leq\frac{\mu_{e}}{\sigma^{2}}. We can rewrite a second term as:

ΞΌeβˆ’nsβ‹…Οƒ2+2​σ2​(nβ„“βˆ’ns)\mu_{e}-n_{s}\cdot\sigma^{2}+2\sigma^{2}(n_{\ell}-n_{s})

We know that ΞΌeβˆ’nsβ‹…Οƒ2β‰₯0\mu_{e}-n_{s}\cdot\sigma^{2}\geq 0, and because nβ„“>nsn_{\ell}>n_{s},

2​nβ„“β‹…Οƒ2βˆ’2​nsβ‹…Οƒ2>02n_{\ell}\cdot\sigma^{2}-2n_{s}\cdot\sigma^{2}>0

These facts, taken together, show that the derivative is always negative. ∎

Lemma C.2.

For uniform federation, if nβ„“β‰₯ΞΌeΟƒ2n_{\ell}\geq\frac{\mu_{e}}{\sigma^{2}} and ns<nβ„“n_{s}<n_{\ell}, large players see their error increase with the addition of more large players, so they prefer β„“\ell as small as possible:

β„“2<β„“1⇒π​(s,β„“2)≻Lπ​(s,β„“1)\ell_{2}<\ell_{1}\quad\Rightarrow\quad\pi(s,\ell_{2})\succ_{L}\pi(s,\ell_{1})

If nβ„“<ΞΌeΟƒ2n_{\ell}<\frac{\mu_{e}}{\sigma^{2}}, then there exists some β„“0\ell_{0} such that for all β„“β€²<β„“0\ell^{\prime}<\ell_{0}, the derivative of the large players’ error with respect to β„“\ell is positive, and for all β„“β€²>β„“0\ell^{\prime}>\ell_{0}, the derivative of their error is negative.

Proof.

To prove this, we will show that the derivative of the large player’s error with respect to β„“\ell is always positive when nβ„“β‰₯ΞΌeΟƒ2n_{\ell}\geq\frac{\mu_{e}}{\sigma^{2}}. The error is:

ΞΌesβ‹…ns+β„“β‹…nβ„“\frac{\mu_{e}}{s\cdot n_{s}+\ell\cdot n_{\ell}}
+Οƒ2​sβ‹…ns2+(β„“βˆ’1)β‹…nβ„“2+(sβ‹…ns+(β„“βˆ’1)β‹…nβ„“)2(sβ‹…ns+β„“β‹…nβ„“)2+\sigma^{2}\frac{s\cdot n_{s}^{2}+(\ell-1)\cdot n_{\ell}^{2}+(s\cdot n_{s}+(\ell-1)\cdot n_{\ell})^{2}}{(s\cdot n_{s}+\ell\cdot n_{\ell})^{2}}

The derivative with respect to β„“\ell is:

nℓ​(β„“β‹…nℓ​(nℓ​σ2βˆ’ΞΌe)βˆ’nsβ‹…s​(ΞΌeβˆ’3​nβ„“β‹…Οƒ2+2​nsβ‹…Οƒ2))(β„“β‹…nβ„“+nsβ‹…s)3\frac{n_{\ell}(\ell\cdot n_{\ell}(n_{\ell}\sigma^{2}-\mu_{e})-n_{s}\cdot s(\mu_{e}-3n_{\ell}\cdot\sigma^{2}+2n_{s}\cdot\sigma^{2}))}{(\ell\cdot n_{\ell}+n_{s}\cdot s)^{3}}

We wish to show that the numerator is positive. We can break it into multiple components:

nβ„“β‹…Οƒ2βˆ’ΞΌeβ‰₯0n_{\ell}\cdot\sigma^{2}-\mu_{e}\geq 0

because nβ„“β‰₯ΞΌeΟƒ2n_{\ell}\geq\frac{\mu_{e}}{\sigma^{2}}. We can rewrite the second term as

ΞΌeβˆ’nβ„“β‹…Οƒ2+2​σ2​(nsβˆ’nβ„“)\mu_{e}-n_{\ell}\cdot\sigma^{2}+2\sigma^{2}(n_{s}-n_{\ell})

which is negative because nβ„“β‰₯ΞΌeΟƒ2n_{\ell}\geq\frac{\mu_{e}}{\sigma^{2}} and nβ„“>nsn_{\ell}>n_{s}.
Next, we consider the case where nℓ<μeσ2n_{\ell}<\frac{\mu_{e}}{\sigma^{2}}. The first term is now negative and the second term is the sum of two terms: one is positive and one is negative. The overall derivative is negative whenever:

β„“β‹…nℓ​(nℓ​σ2βˆ’ΞΌe)βˆ’nsβ‹…s​(ΞΌeβˆ’3​nβ„“β‹…Οƒ2+2​nsβ‹…Οƒ2)<0\ell\cdot n_{\ell}(n_{\ell}\sigma^{2}-\mu_{e})-n_{s}\cdot s(\mu_{e}-3n_{\ell}\cdot\sigma^{2}+2n_{s}\cdot\sigma^{2})<0
β„“β‹…nℓ​(nℓ​σ2βˆ’ΞΌe)<nsβ‹…s​(ΞΌeβˆ’3​nβ„“β‹…Οƒ2+2​nsβ‹…Οƒ2)\ell\cdot n_{\ell}(n_{\ell}\sigma^{2}-\mu_{e})<n_{s}\cdot s(\mu_{e}-3n_{\ell}\cdot\sigma^{2}+2n_{s}\cdot\sigma^{2})
β„“>nsβ‹…s​(ΞΌeβˆ’3​nβ„“β‹…Οƒ2+2​nsβ‹…Οƒ2)nβ„“β‹…(nβ„“β‹…Οƒ2βˆ’ΞΌe)\ell>\frac{n_{s}\cdot s(\mu_{e}-3n_{\ell}\cdot\sigma^{2}+2n_{s}\cdot\sigma^{2})}{n_{\ell}\cdot(n_{\ell}\cdot\sigma^{2}-\mu_{e})}

Note that, for the assumption, the denominator is negative. If the numerator is positive, then this is true for all β„“>0\ell>0, so the slope is always negative. If the numerator is negative, then the slope is negative for all β„“>β„“0\ell>\ell_{0} for the given β„“0>0\ell_{0}>0. ∎

Lemma C.3.

Assume uniform federation with ns≀μeΟƒ2n_{s}\leq\frac{\mu_{e}}{\sigma^{2}} and ns<nβ„“n_{s}<n_{\ell}. If nℓ≀μeΟƒ2n_{\ell}\leq\frac{\mu_{e}}{\sigma^{2}}, small players always prefer federating with more large players:

β„“2>β„“1⇒π​(s,β„“2)≻Sπ​(s,β„“1)\ell_{2}>\ell_{1}\quad\Rightarrow\quad\pi(s,\ell_{2})\succ_{S}\pi(s,\ell_{1})

If nβ„“>ΞΌeΟƒ2n_{\ell}>\frac{\mu_{e}}{\sigma^{2}}, then there exists some β„“1\ell_{1} such that for all β„“β€²<β„“1\ell^{\prime}<\ell_{1}, the derivative of the small players’ error with respect to β„“\ell is positive, and for all β„“β€²>β„“1\ell^{\prime}>\ell_{1}, the derivative of their error is negative.

Proof.

The small player’s error is

ΞΌesβ‹…ns+β„“β‹…nβ„“\frac{\mu_{e}}{s\cdot n_{s}+\ell\cdot n_{\ell}}
+Οƒ2​(sβˆ’1)β‹…ns2+β„“β‹…nβ„“2+((sβˆ’1)β‹…ns+β„“β‹…n​l)2(sβ‹…ns+β„“β‹…nβ„“)2+\sigma^{2}\frac{(s-1)\cdot n_{s}^{2}+\ell\cdot n_{\ell}^{2}+((s-1)\cdot n_{s}+\ell\cdot nl)^{2}}{(s\cdot n_{s}+\ell\cdot n_{\ell})^{2}}

The derivative with respect to β„“\ell is:

βˆ’nβ„“β‹…(β„“β‹…nβ„“β‹…(ΞΌeβˆ’Οƒ2β‹…ns+Οƒ2(nβ„“βˆ’ns))(β„“β‹…nβ„“+sβ‹…ns)3-\frac{n_{\ell}\cdot(\ell\cdot n_{\ell}\cdot(\mu_{e}-\sigma^{2}\cdot n_{s}+\sigma^{2}(n_{\ell}-n_{s}))}{(\ell\cdot n_{\ell}+s\cdot n_{s})^{3}}
βˆ’nβ„“β‹…(sβ‹…nsβ‹…(ΞΌeβˆ’nβ„“β‹…Οƒ2)(β„“β‹…nβ„“+sβ‹…ns)3-\frac{n_{\ell}\cdot(s\cdot n_{s}\cdot(\mu_{e}-n_{\ell}\cdot\sigma^{2})}{(\ell\cdot n_{\ell}+s\cdot n_{s})^{3}}

The derivative is negative when the term below is positive:

β„“β‹…nβ„“β‹…(ΞΌeβˆ’Οƒ2β‹…ns+Οƒ2​(nβ„“βˆ’ns))+sβ‹…nsβ‹…(ΞΌeβˆ’nβ„“β‹…Οƒ2)\ell\cdot n_{\ell}\cdot(\mu_{e}-\sigma^{2}\cdot n_{s}+\sigma^{2}(n_{\ell}-n_{s}))+s\cdot n_{s}\cdot(\mu_{e}-n_{\ell}\cdot\sigma^{2})

The first term (multiplying β„“β‹…nβ„“\ell\cdot n_{\ell}) is always positive. For nℓ≀μeΟƒ2n_{\ell}\leq\frac{\mu_{e}}{\sigma^{2}} the second term is also positive or zero, so the derivative is always negative.

If nℓ>μeσ2n_{\ell}>\frac{\mu_{e}}{\sigma^{2}}, then second term (multiplying ss) is negative. The overall derivative is negative when the term below is positive:

β„“β‹…nβ„“β‹…(ΞΌeβˆ’Οƒ2β‹…ns+Οƒ2​(nβ„“βˆ’ns))+sβ‹…nsβ‹…(ΞΌeβˆ’nβ„“β‹…Οƒ2)>0\ell\cdot n_{\ell}\cdot(\mu_{e}-\sigma^{2}\cdot n_{s}+\sigma^{2}(n_{\ell}-n_{s}))+s\cdot n_{s}\cdot(\mu_{e}-n_{\ell}\cdot\sigma^{2})>0
β„“β‹…nβ„“β‹…(ΞΌeβˆ’Οƒ2β‹…ns+Οƒ2​(nβ„“βˆ’ns))>sβ‹…nsβ‹…(nβ„“β‹…Οƒ2βˆ’ΞΌe)\ell\cdot n_{\ell}\cdot(\mu_{e}-\sigma^{2}\cdot n_{s}+\sigma^{2}(n_{\ell}-n_{s}))>s\cdot n_{s}\cdot(n_{\ell}\cdot\sigma^{2}-\mu_{e})
β„“>sβ‹…nsβ‹…(nβ„“β‹…Οƒ2βˆ’ΞΌe)nβ„“β‹…(ΞΌeβˆ’Οƒ2β‹…ns+Οƒ2​(nβ„“βˆ’ns))\ell>\frac{s\cdot n_{s}\cdot(n_{\ell}\cdot\sigma^{2}-\mu_{e})}{n_{\ell}\cdot(\mu_{e}-\sigma^{2}\cdot n_{s}+\sigma^{2}(n_{\ell}-n_{s}))}

∎

Lemma C.4.

Assume uniform federation with ns≀μeΟƒ2n_{s}\leq\frac{\mu_{e}}{\sigma^{2}} and ns<nβ„“n_{s}<n_{\ell}. If nβ„“β‰₯ΞΌeΟƒ2n_{\ell}\geq\frac{\mu_{e}}{\sigma^{2}}, then there exists some s0s_{0} such that for all sβ€²<s0s^{\prime}<s_{0}, the derivative of the large players’ error with respect to the number of small players ss is negative, and for all s>s0s>s_{0} the derivative of their error is positive.
If nℓ<μeσ2n_{\ell}<\frac{\mu_{e}}{\sigma^{2}}, then the shape of the curve either first increases, then decreases for all s′>s0s^{\prime}>s_{0}, or else first decreases, and then increases for all s′>s0s^{\prime}>s_{0}.

Proof.

The large player’s error is:

ΞΌesβ‹…ns+β„“β‹…nβ„“\frac{\mu_{e}}{s\cdot n_{s}+\ell\cdot n_{\ell}}
+Οƒ2​sβ‹…ns2+(β„“βˆ’1)β‹…nβ„“2+(sβ‹…ns+(β„“βˆ’1)β‹…n​l)2(sβ‹…ns+β„“β‹…nβ„“)2+\sigma^{2}\frac{s\cdot n_{s}^{2}+(\ell-1)\cdot n_{\ell}^{2}+(s\cdot n_{s}+(\ell-1)\cdot nl)^{2}}{(s\cdot n_{s}+\ell\cdot n_{\ell})^{2}}

The derivative with respect to ss is:

βˆ’nsβ‹…(β„“β‹…nℓ​(ΞΌeβˆ’nsβ‹…Οƒ2))(β„“β‹…nβ„“+sβ‹…ns)3-\frac{n_{s}\cdot(\ell\cdot n_{\ell}(\mu_{e}-n_{s}\cdot\sigma^{2}))}{(\ell\cdot n_{\ell}+s\cdot n_{s})^{3}}
βˆ’nsβ‹…(nsβ‹…sβ‹…(ΞΌeβˆ’nβ„“β‹…Οƒ2+Οƒ2​(nsβˆ’nβ„“)))(β„“β‹…nβ„“+sβ‹…ns)3-\frac{n_{s}\cdot(n_{s}\cdot s\cdot(\mu_{e}-n_{\ell}\cdot\sigma^{2}+\sigma^{2}(n_{s}-n_{\ell})))}{(\ell\cdot n_{\ell}+s\cdot n_{s})^{3}}

The derivative is negative when the term below is positive:

β„“β‹…nℓ​(ΞΌeβˆ’nsβ‹…Οƒ2)+nsβ‹…sβ‹…(ΞΌeβˆ’nβ„“β‹…Οƒ2+Οƒ2​(nsβˆ’nβ„“))\ell\cdot n_{\ell}(\mu_{e}-n_{s}\cdot\sigma^{2})+n_{s}\cdot s\cdot(\mu_{e}-n_{\ell}\cdot\sigma^{2}+\sigma^{2}(n_{s}-n_{\ell}))

For nβ„“β‰₯ΞΌeΟƒ2n_{\ell}\geq\frac{\mu_{e}}{\sigma^{2}}, the first term is always positive or zero and the second term is always negative. Solving for when the overall derivative is positive gives:

β„“β‹…nℓ​(ΞΌeβˆ’nsβ‹…Οƒ2)+nsβ‹…sβ‹…(ΞΌeβˆ’nβ„“β‹…Οƒ2+Οƒ2​(nsβˆ’nβ„“))>0\ell\cdot n_{\ell}(\mu_{e}-n_{s}\cdot\sigma^{2})+n_{s}\cdot s\cdot(\mu_{e}-n_{\ell}\cdot\sigma^{2}+\sigma^{2}(n_{s}-n_{\ell}))>0
nsβ‹…sβ‹…(ΞΌeβˆ’nβ„“β‹…Οƒ2+Οƒ2​(nsβˆ’nβ„“))>βˆ’β„“β‹…nℓ​(ΞΌeβˆ’nsβ‹…Οƒ2)n_{s}\cdot s\cdot(\mu_{e}-n_{\ell}\cdot\sigma^{2}+\sigma^{2}(n_{s}-n_{\ell}))>-\ell\cdot n_{\ell}(\mu_{e}-n_{s}\cdot\sigma^{2})
s<βˆ’β„“β‹…nℓ​(ΞΌeβˆ’nsβ‹…Οƒ2)nsβ‹…(ΞΌeβˆ’nβ„“β‹…Οƒ2+Οƒ2​(nsβˆ’nβ„“))s<\frac{-\ell\cdot n_{\ell}(\mu_{e}-n_{s}\cdot\sigma^{2})}{n_{s}\cdot(\mu_{e}-n_{\ell}\cdot\sigma^{2}+\sigma^{2}(n_{s}-n_{\ell}))}

Next, we consider the case where nβ„“<ΞΌeΟƒ2n_{\ell}<\frac{\mu_{e}}{\sigma^{2}}. Again, the first term is positive or zero. The second term, though, is composed of a sum: one component is positive and one is negative, so it is not necessarily clear whether the overall sum is positive or negative. If the coefficient on ss is negative, then the curve has the same shape as in the nβ„“β‰₯ΞΌeΟƒ2n_{\ell}\geq\frac{\mu_{e}}{\sigma^{2}} case: first decreasing, and then increasing. If the coefficient is positive, then by the same analysis as before, the derivative is negative whenever the following inequality holds:

nsβ‹…sβ‹…(ΞΌeβˆ’nβ„“β‹…Οƒ2+Οƒ2​(nsβˆ’nβ„“))>βˆ’β„“β‹…nℓ​(ΞΌeβˆ’nsβ‹…Οƒ2)n_{s}\cdot s\cdot(\mu_{e}-n_{\ell}\cdot\sigma^{2}+\sigma^{2}(n_{s}-n_{\ell}))>-\ell\cdot n_{\ell}(\mu_{e}-n_{s}\cdot\sigma^{2})
s>βˆ’β„“β‹…nℓ​(ΞΌeβˆ’nsβ‹…Οƒ2)nsβ‹…(ΞΌeβˆ’nβ„“β‹…Οƒ2+Οƒ2​(nsβˆ’nβ„“))s>\frac{-\ell\cdot n_{\ell}(\mu_{e}-n_{s}\cdot\sigma^{2})}{n_{s}\cdot(\mu_{e}-n_{\ell}\cdot\sigma^{2}+\sigma^{2}(n_{s}-n_{\ell}))}

In this case, the derivative of the large players’ error first increases, then decreases. The shape of the curve depends on more information, which indicates whether the coefficient on ss is positive or negative. ∎

See 5.4

Proof.

In this proof, we will use the results of Lemmas C.1, C.2, C.3, and C.4.

The small players always prefer ss as large as possible, and for nℓ≀μeΟƒ2n_{\ell}\leq\frac{\mu_{e}}{\sigma^{2}} they also prefer β„“\ell as large as possible, so π​(S,L)=Ο€g\pi(S,L)=\pi_{g} minimizes error for small players. For this reason, any defection coalition that has π​(s>0,β„“)\pi(s>0,\ell) is infeasible because the small players would get higher error.

The only kind of defections we need to consider are in the form π​(0,β„“)\pi(0,\ell). We will consider π​(0,L)\pi(0,L) and show that the large players prefer π​(S,L)\pi(S,L) to π​(0,L)\pi(0,L): π​(S,L)≻Lπ​(0,L)\pi(S,L)\succ_{L}\pi(0,L). In the case that nβ„“<ΞΌeΟƒ2n_{\ell}<\frac{\mu_{e}}{\sigma^{2}}, π​(0,L)≻Lπ​(0,β„“<L)\pi(0,L)\succ_{L}\pi(0,\ell<L), so any other arrangement is also not a possible defection. In the case that nβ„“=ΞΌeΟƒ2n_{\ell}=\frac{\mu_{e}}{\sigma^{2}}, π​(0,L)=Lπ​(0,β„“<L)\pi(0,L)=_{L}\pi(0,\ell<L), so similarly any other defection is not possible. What we’d like to show is:

ΞΌesβ‹…ns+β„“β‹…nβ„“\frac{\mu_{e}}{s\cdot n_{s}+\ell\cdot n_{\ell}}
+Οƒ2​sβ‹…ns2+(β„“βˆ’1)β‹…nβ„“2+(sβ‹…ns+(β„“βˆ’1)β‹…nβ„“)2(sβ‹…ns+β„“β‹…nβ„“)2+\sigma^{2}\frac{s\cdot n_{s}^{2}+(\ell-1)\cdot n_{\ell}^{2}+(s\cdot n_{s}+(\ell-1)\cdot n_{\ell})^{2}}{(s\cdot n_{s}+\ell\cdot n_{\ell})^{2}}
<ΞΌeβ„“β‹…nβ„“+Οƒ2​(β„“βˆ’1)β‹…nβ„“2+(β„“βˆ’1)2β‹…nβ„“2β„“2β‹…nβ„“2<\frac{\mu_{e}}{\ell\cdot n_{\ell}}+\sigma^{2}\frac{(\ell-1)\cdot n_{\ell}^{2}+(\ell-1)^{2}\cdot n_{\ell}^{2}}{\ell^{2}\cdot n_{\ell}^{2}}

Cross multiplying turns the condition into:

ΞΌeβ‹…(sβ‹…ns+β„“β‹…nβ„“)β‹…β„“2β‹…nβ„“2\mu_{e}\cdot(s\cdot n_{s}+\ell\cdot n_{\ell})\cdot\ell^{2}\cdot n_{\ell}^{2}
+Οƒ2β‹…(sβ‹…ns2+(β„“βˆ’1)β‹…nβ„“2+(sβ‹…ns+(β„“βˆ’1)β‹…nβ„“)2)β‹…β„“2β‹…nβ„“2+\sigma^{2}\cdot(s\cdot n_{s}^{2}+(\ell-1)\cdot n_{\ell}^{2}+(s\cdot n_{s}+(\ell-1)\cdot n_{\ell})^{2})\cdot\ell^{2}\cdot n_{\ell}^{2}
<ΞΌeβ‹…β„“β‹…nβ„“β‹…(sβ‹…ns+β„“β‹…nβ„“)2+Οƒ2β‹…nβ„“2β‹…(β„“βˆ’1)β‹…β„“β‹…(sβ‹…ns+β„“β‹…nβ„“)2<\mu_{e}\cdot\ell\cdot n_{\ell}\cdot(s\cdot n_{s}+\ell\cdot n_{\ell})^{2}+\sigma^{2}\cdot n_{\ell}^{2}\cdot(\ell-1)\cdot\ell\cdot(s\cdot n_{s}+\ell\cdot n_{\ell})^{2}

If we collect the ΞΌe\mu_{e} terms, we get:

ΞΌeβ‹…β„“β‹…nβ„“β‹…(sβ‹…ns+β„“β‹…nβ„“)β‹…(sβ‹…ns+β„“β‹…nβ„“βˆ’β„“β‹…nβ„“)\mu_{e}\cdot\ell\cdot n_{\ell}\cdot(s\cdot n_{s}+\ell\cdot n_{\ell})\cdot(s\cdot n_{s}+\ell\cdot n_{\ell}-\ell\cdot n_{\ell})
=ΞΌeβ‹…β„“β‹…nβ„“β‹…(sβ‹…ns+β„“β‹…nβ„“)β‹…sβ‹…ns=\mu_{e}\cdot\ell\cdot n_{\ell}\cdot(s\cdot n_{s}+\ell\cdot n_{\ell})\cdot s\cdot n_{s}

If we collect the Οƒ2\sigma^{2} terms, we get:

Οƒ2β‹…β„“β‹…nβ„“2β‹…(β„“β‹…(sβ‹…ns2+(β„“βˆ’1)β‹…nβ„“2\sigma^{2}\cdot\ell\cdot n_{\ell}^{2}\cdot(\ell\cdot(s\cdot n_{s}^{2}+(\ell-1)\cdot n_{\ell}^{2}
+(sβ‹…ns+(β„“βˆ’1)β‹…nβ„“)2)βˆ’(β„“βˆ’1)β‹…(sβ‹…ns+β„“β‹…nβ„“)2)+(s\cdot n_{s}+(\ell-1)\cdot n_{\ell})^{2})-(\ell-1)\cdot(s\cdot n_{s}+\ell\cdot n_{\ell})^{2})

First, we expand the first squared term and combine it with another term:

sβ‹…ns2+(β„“βˆ’1)β‹…nβ„“2+s2β‹…ns2+2β‹…sβ‹…(β„“βˆ’1)β‹…nβ„“+(β„“βˆ’1)2β‹…nβ„“2s\cdot n_{s}^{2}+(\ell-1)\cdot n_{\ell}^{2}+s^{2}\cdot n_{s}^{2}+2\cdot s\cdot(\ell-1)\cdot n_{\ell}+(\ell-1)^{2}\cdot n_{\ell}^{2}
=ns2β‹…sβ‹…(s+1)+2β‹…sβ‹…(β„“βˆ’1)β‹…nsβ‹…nβ„“+nβ„“2β‹…(β„“βˆ’1)β‹…β„“=n_{s}^{2}\cdot s\cdot(s+1)+2\cdot s\cdot(\ell-1)\cdot n_{s}\cdot n_{\ell}+n_{\ell}^{2}\cdot(\ell-1)\cdot\ell

Multiplied by β„“\ell, it becomes:

β„“β‹…(ns2β‹…sβ‹…(s+1)+2β‹…sβ‹…(β„“βˆ’1)β‹…nsβ‹…nβ„“+nβ„“2β‹…(β„“βˆ’1)β‹…β„“)\ell\cdot(n_{s}^{2}\cdot s\cdot(s+1)+2\cdot s\cdot(\ell-1)\cdot n_{s}\cdot n_{\ell}+n_{\ell}^{2}\cdot(\ell-1)\cdot\ell)

Expanding out the second squared term gives us:

s2β‹…ns2+2​sβ‹…β„“β‹…nsβ‹…nβ„“+β„“2β‹…nβ„“2s^{2}\cdot n_{s}^{2}+2s\cdot\ell\cdot n_{s}\cdot n_{\ell}+\ell^{2}\cdot n_{\ell}^{2}

When we multiply this by βˆ’(β„“βˆ’1)-(\ell-1), it becomes

βˆ’(β„“βˆ’1)β‹…(s2β‹…ns2+2​sβ‹…β„“β‹…nsβ‹…nβ„“+β„“2β‹…nβ„“2)-(\ell-1)\cdot(s^{2}\cdot n_{s}^{2}+2s\cdot\ell\cdot n_{s}\cdot n_{\ell}+\ell^{2}\cdot n_{\ell}^{2})

Next, we combine similar terms in both sums. First, we start with coefficients of ns2n_{s}^{2}

β„“β‹…ns2β‹…sβ‹…(s+1)βˆ’(β„“βˆ’1)β‹…ns2β‹…s2\ell\cdot n_{s}^{2}\cdot s\cdot(s+1)-(\ell-1)\cdot n_{s}^{2}\cdot s^{2}
=ns2β‹…sβ‹…(β„“β‹…(s+1)βˆ’(β„“βˆ’1)β‹…s)=n_{s}^{2}\cdot s\cdot(\ell\cdot(s+1)-(\ell-1)\cdot s)
=ns2β‹…sβ‹…(β„“β‹…s+β„“βˆ’β„“β‹…s+s)=n_{s}^{2}\cdot s\cdot(\ell\cdot s+\ell-\ell\cdot s+s)
=ns2β‹…sβ‹…(β„“+s)=n_{s}^{2}\cdot s\cdot(\ell+s)

Next, we do the next term, which involves coefficients of nsβ‹…nβ„“n_{s}\cdot n_{\ell}:

2β‹…β„“β‹…sβ‹…(β„“βˆ’1)β‹…nsβ‹…nβ„“βˆ’2β‹…(β„“βˆ’1)β‹…sβ‹…β„“β‹…nsβ‹…nβ„“2\cdot\ell\cdot s\cdot(\ell-1)\cdot n_{s}\cdot n_{\ell}-2\cdot(\ell-1)\cdot s\cdot\ell\cdot n_{s}\cdot n_{\ell}
=0=0

And the last one term, with coefficients of nβ„“2n_{\ell}^{2}:

β„“2β‹…(β„“βˆ’1)β‹…nβ„“2βˆ’(β„“βˆ’1)β‹…β„“2β‹…nβ„“2\ell^{2}\cdot(\ell-1)\cdot n_{\ell}^{2}-(\ell-1)\cdot\ell^{2}\cdot n_{\ell}^{2}
=0=0

If we multiply take the only nonzero term and multiply by the terms we pulled out, it becomes:

ns2β‹…sβ‹…(β„“+s)β‹…β„“β‹…nβ„“2β‹…Οƒ2n_{s}^{2}\cdot s\cdot(\ell+s)\cdot\ell\cdot n_{\ell}^{2}\cdot\sigma^{2}

Next, we return this to our inequality. What we’re trying to show is:

β„“β‹…nβ„“2β‹…ns2β‹…sβ‹…(β„“+s)β‹…Οƒ2<ΞΌeβ‹…β„“β‹…nβ„“β‹…(sβ‹…ns+β„“β‹…nβ„“)β‹…sβ‹…ns\ell\cdot n_{\ell}^{2}\cdot n_{s}^{2}\cdot s\cdot(\ell+s)\cdot\sigma^{2}<\mu_{e}\cdot\ell\cdot n_{\ell}\cdot(s\cdot n_{s}+\ell\cdot n_{\ell})\cdot s\cdot n_{s}

Cancelling some terms:

nβ„“β‹…nsβ‹…(β„“+s)β‹…Οƒ2<ΞΌeβ‹…(sβ‹…ns+β„“β‹…nβ„“)n_{\ell}\cdot n_{s}\cdot(\ell+s)\cdot\sigma^{2}<\mu_{e}\cdot(s\cdot n_{s}+\ell\cdot n_{\ell})

Expanding out terms:

Οƒ2β‹…(β„“β‹…nβ„“β‹…ns+sβ‹…nβ„“β‹…ns)<ΞΌeβ‹…(sβ‹…ns+β„“β‹…nβ„“)\sigma^{2}\cdot(\ell\cdot n_{\ell}\cdot n_{s}+s\cdot n_{\ell}\cdot n_{s})<\mu_{e}\cdot(s\cdot n_{s}+\ell\cdot n_{\ell})

We can prove this by splitting up piecewise:

Οƒ2β‹…nsβ‹…β„“β‹…nβ„“<ΞΌeβ‹…β„“β‹…nβ„“\sigma^{2}\cdot n_{s}\cdot\ell\cdot n_{\ell}<\mu_{e}\cdot\ell\cdot n_{\ell}

because Οƒ2β‹…ns<ΞΌe\sigma^{2}\cdot n_{s}<\mu_{e}. Similarly,

Οƒ2β‹…nβ„“β‹…sβ‹…ns≀μeβ‹…sβ‹…ns\sigma^{2}\cdot n_{\ell}\cdot s\cdot n_{s}\leq\mu_{e}\cdot s\cdot n_{s}

because Οƒ2β‹…nℓ≀μe\sigma^{2}\cdot n_{\ell}\leq\mu_{e}. ∎

See 5.5

Proof.

We will prove this directly by calculating an arrangement that is individually stable, relying on the results in Lemmas C.1, C.2, C.3, C.4.

First, group all of the small players together. Then calculate β„“β€²=max⁑ℓ\ell^{\prime}=\max\ell such that π​(S,β„“)βͺ°Lπ​(0,1)\pi(S,\ell)\succeq_{L}\pi(0,1): the largest number of large players that can be in the coalition such that the large players prefer this to being alone. Check if π​(S,β„“)β‰ΊSπ​(S,0)\pi(S,\ell)\prec_{S}\pi(S,0). If this is true, make the final arrangement π​(S,0),π​(0,1)β‹…L\pi(S,0),\pi(0,1)\cdot L: by previous lemmas around how the small player’s error changes with β„“\ell, we know that if π​(S,β„“)β‰ΊSπ​(S,0)\pi(S,\ell)\prec_{S}\pi(S,0), then π​(S,β„“β€²)β‰ΊSπ​(S,0)\pi(S,\ell^{\prime})\prec_{S}\pi(S,0) for all β„“β€²<β„“\ell^{\prime}<\ell.

We will show that this is individually stable by showing no players wish to unilaterally deviate.

  • β€’

    No small player wishes to go to π​(1,0)\pi(1,0): reducing the number of small players in a group from SS to 1 monotonically increases the error the small player faces.

  • β€’

    No small player wishes to go to π​(1,1)\pi(1,1): it is possible to reach this state by first going from π​(S,0)\pi(S,0) to π​(S,1)\pi(S,1) (which would increase error because 1≀ℓ′1\leq\ell^{\prime} and π​(S,β„“)β‰ΊSπ​(S,0)\pi(S,\ell)\prec_{S}\pi(S,0)) and then from π​(S,1)\pi(S,1) to π​(1,1)\pi(1,1) (which would increase error because reducing the number of small players increases error).

  • β€’

    No large player can go to π​(S,1)\pi(S,1): this would increase the error of the small players.

  • β€’

    No large player wishes to go to π​(0,2)\pi(0,2): this would increase the error of both large players.

Next, we will consider the case where π​(S,β„“β€²)βͺ°Sπ​(S,0)\pi(S,\ell^{\prime})\succeq_{S}\pi(S,0): in this case, we will show that π​(S,β„“)\pi(S,\ell) is individually stable.

  • β€’

    No small player wishes to go from π​(S,β„“β€²)\pi(S,\ell^{\prime}) to π​(1,0)\pi(1,0). We can see that π​(1,0)\pi(1,0) has higher error because we know π​(S,β„“β€²)βͺ°Sπ​(S,0)≻Sπ​(1,0)\pi(S,\ell^{\prime})\succeq_{S}\pi(S,0)\succ_{S}\pi(1,0).

  • β€’

    No small player wishes to go to π​(1,1)\pi(1,1). We can see π​(1,1)\pi(1,1) has higher error for the small player because π​(S,β„“β€²)βͺ°Sπ​(S,1)≻Sπ​(1,1)\pi(S,\ell^{\prime})\succeq_{S}\pi(S,1)\succ_{S}\pi(1,1). The first inequality comes from the following reasoning: if dd​ℓ​e​r​rS​(π​(S,β„“))\frac{d}{d\ell}err_{S}(\pi(S,\ell)) is negative at β„“=1\ell=1, then there is a monotonically increasing path of error from β„“β€²\ell^{\prime} to 1. If dd​ℓ​e​r​rS​(π​(S,β„“))\frac{d}{d\ell}err_{S}(\pi(S,\ell)) is positive at 1, then we know that π​(S,1)β‰ΊSπ​(S,0)\pi(S,1)\prec_{S}\pi(S,0), whereas π​(S,β„“β€²)βͺ°Sπ​(S,0)\pi(S,\ell^{\prime})\succeq_{S}\pi(S,0).

  • β€’

    No large player wishes to go to π​(S,β„“β€²+1)\pi(S,\ell^{\prime}+1): by definition of β„“β€²\ell^{\prime}, it would get greater error than in π​(0,1)\pi(0,1).

  • β€’

    No large player wishes to go to π​(0,2)\pi(0,2) for the same reason as above.

By this analysis, π​(S,β„“β€²)\pi(S,\ell^{\prime}) is individually stable. ∎

Lemma C.5, below, will be useful in Example C.6 in analyzing whether the individually stable arrangement produced by Theorem 5.5 is also core stable.

Lemma C.5.

Consider uniform federation with two coalitions π​(s1,β„“1)\pi(s_{1},\ell_{1}) and π​(s2,β„“2)\pi(s_{2},\ell_{2}) with s2<s1s_{2}<s_{1}. Then, it is not possible to pick β„“2\ell_{2} so that π​(s1,β„“1)≻Sπ​(s2,β„“2)\pi(s_{1},\ell_{1})\succ_{S}\pi(s_{2},\ell_{2}) and π​(s1,β„“1)≻Lπ​(s2,β„“2)\pi(s_{1},\ell_{1})\succ_{L}\pi(s_{2},\ell_{2}).

Proof.

To prove this, we will rely on Lemmas C.1, C.2, C.3, C.4. First, we consider a hypothetical β„“2β€²\ell_{2}^{\prime} defined such that

s1β‹…ns+β„“1β‹…nβ„“=s2β‹…ns+β„“2β€²β‹…nβ„“s_{1}\cdot n_{s}+\ell_{1}\cdot n_{\ell}=s_{2}\cdot n_{s}+\ell_{2}^{\prime}\cdot n_{\ell}

Note that β„“2\ell_{2} may not be an integer: we’re using it as a reference tool (a hypothetical allocation) so this doesn’t matter.

First, we will show that the small player gets greater error in this case. First, we rewrite it as:

β„“2β€²=(s1βˆ’s2)β‹…nsnβ„“+β„“1\ell_{2}^{\prime}=(s_{1}-s_{2})\cdot\frac{n_{s}}{n_{\ell}}+\ell_{1}

and plug it in to the equation for the error of a small player. Note that most of the values stay the same: the entire ΞΌe\mu_{e} term and the denominator of the Οƒ2\sigma^{2} term. Similarly, note that with the given substitution,

((s2βˆ’1)β‹…ns+β„“2β€²β‹…nβ„“)2=((s1βˆ’1)β‹…ns+β„“1β‹…nβ„“)2((s_{2}-1)\cdot n_{s}+\ell_{2}^{\prime}\cdot n_{\ell})^{2}=((s_{1}-1)\cdot n_{s}+\ell_{1}\cdot n_{\ell})^{2}

The only term that changes is the other portion of the numerator, which becomes:

nβ„“2β‹…β„“1+nsβ‹…nβ„“β‹…(s1βˆ’s2)+(s2βˆ’1)β‹…ns2n_{\ell}^{2}\cdot\ell_{1}+n_{s}\cdot n_{\ell}\cdot(s_{1}-s_{2})+(s_{2}-1)\cdot n_{s}^{2}

We would like to show that it is larger than the relevant portion of the error for π​(s1,β„“1)\pi(s_{1},\ell_{1}), which is:

β„“1β‹…nβ„“2+(s1βˆ’1)β‹…ns2\ell_{1}\cdot n_{\ell}^{2}+(s_{1}-1)\cdot n_{s}^{2}

This is equivalent to showing:

nsβ‹…nβ„“β‹…(s1βˆ’s2)+(s2βˆ’1)β‹…ns2>(s1βˆ’1)β‹…ns2n_{s}\cdot n_{\ell}\cdot(s_{1}-s_{2})+(s_{2}-1)\cdot n_{s}^{2}>(s_{1}-1)\cdot n_{s}^{2}

We can lower bound the lefthand side using nβ„“>nsn_{\ell}>n_{s}:

nsβ‹…nβ„“β‹…(s1βˆ’s2)+(s2βˆ’1)β‹…ns2>ns2​(s1βˆ’s2)+(s2βˆ’1)β‹…ns2n_{s}\cdot n_{\ell}\cdot(s_{1}-s_{2})+(s_{2}-1)\cdot n_{s}^{2}>n_{s}^{2}(s_{1}-s_{2})+(s_{2}-1)\cdot n_{s}^{2}
=ns2β‹…(s1βˆ’1)=n_{s}^{2}\cdot(s_{1}-1)

which shows that the small player gets greater error in π​(s2,β„“2β€²)\pi(s_{2},\ell_{2}^{\prime}) than in π​(s1,β„“1)\pi(s_{1},\ell_{1}).

Next, we’ll show that the large player also gets greater error. We will first note that from the definition of the errors of the small and large players, we can write:

e​r​rS​(s1,β„“1)=e​r​rL​(s1,β„“1)+2​σ2​nβ„“βˆ’nsβ„“1β‹…nβ„“+s1β‹…nserr_{S}(s_{1},\ell_{1})=err_{L}(s_{1},\ell_{1})+2\sigma^{2}\frac{n_{\ell}-n_{s}}{\ell_{1}\cdot n_{\ell}+s_{1}\cdot n_{s}}

and

e​r​rS​(s2,β„“2β€²)=e​r​rL​(s2,β„“2β€²)+2​σ2​nβ„“βˆ’nsβ„“2β€²β‹…nβ„“+s2β‹…nserr_{S}(s_{2},\ell_{2}^{\prime})=err_{L}(s_{2},\ell_{2}^{\prime})+2\sigma^{2}\frac{n_{\ell}-n_{s}}{\ell_{2}^{\prime}\cdot n_{\ell}+s_{2}\cdot n_{s}}

Note that, by the definition of β„“2β€²\ell_{2}^{\prime}, the additive term for each of these qualities is the same. We also have just shown that e​r​rS​(s2,β„“2β€²)>e​r​rS​(s1,β„“1)err_{S}(s_{2},\ell_{2}^{\prime})>err_{S}(s_{1},\ell_{1}). From these two equalities, this must imply that e​r​rL​(s2,β„“2β€²)>e​r​rL​(s1,β„“1)err_{L}(s_{2},\ell_{2}^{\prime})>err_{L}(s_{1},\ell_{1}).

We have shown that both the large and small players get higher error in π​(s2,β„“2β€²)\pi(s_{2},\ell_{2}^{\prime}) than π​(s1,β„“1)\pi(s_{1},\ell_{1}). Next, we will show that they have different preferences about whether they wish β„“2β€²\ell_{2}^{\prime} were larger or smaller, which means that no matter what β„“2\ell_{2} is, it will leave at least one of them with higher error.

First, we know from previous analysis in Lemma C.2 that the large players always wish there were fewer other large players in a coalition: the large players want β„“2<β„“2β€²\ell_{2}<\ell_{2}^{\prime}.

Secondly, we know from Lemma C.3 that the error the small player experiences first increases and then decreases as β„“\ell increases. Is it possible to pick an β„“2<β„“2β€²\ell_{2}<\ell_{2}^{\prime} so that the small player gets lower error there than in π​(s1,β„“1)\pi(s_{1},\ell_{1})?

Suppose that the derivative of e​r​rS​(s2,β„“)err_{S}(s_{2},\ell) with respect to β„“\ell is positive at β„“=β„“2\ell=\ell_{2}: then, reducing β„“\ell from β„“2β€²\ell_{2}^{\prime} to β„“2\ell_{2} might reduce the small player’s error. However, for every point where the small player’s derivative is positive, e​r​rS​(s2,β„“)>e​r​rS​(s2,0)>e​r​rS​(S,0)err_{S}(s_{2},\ell)>err_{S}(s_{2},0)>err_{S}(S,0): the small player would not wish to move here because it would get strictly higher error than it would get in π​(S,0)\pi(S,0).

Suppose instead that the derivative of of e​r​rS​(s2,β„“)err_{S}(s_{2},\ell) with respect to β„“\ell is negative or zero at β„“=β„“2\ell=\ell_{2}. Then, if β„“2<β„“2β€²\ell_{2}<\ell_{2}^{\prime}, reducing the number of large players in the coalition from β„“2β€²\ell_{2}^{\prime} to β„“2\ell_{2} would increase the error of the small players. This is also not an allocation that the small players would prefer.

Increasing β„“\ell, so that β„“2>β„“2β€²\ell_{2}>\ell_{2}^{\prime} sufficiently large, would satisfy the small player, but we already showed that it would increase the error the large player experiences. As a result, it is not possible to pick an allocation that both the small and large players prefer to π​(s1,β„“1)\pi(s_{1},\ell_{1}). ∎

Example C.6.

For uniform federation, the arrangement as produced in Theorem 5.5 is not necessarily core stable.

Theorem 5.5 produces an arrangement of the form {π​(S,β„“β€²),π​(0,1)β‹…(Lβˆ’β„“β€²)}\{\pi(S,\ell^{\prime}),\pi(0,1)\cdot(L-\ell^{\prime})\}, where we note that β„“β€²\ell^{\prime} could equal 0. As mentioned in the end of Section 5, this arrangement is not necessarily core stable. However, the reason why is subtle.

Theorem 5.5 checks that the given arrangement is individually stable, so the remaining cases to check would involve multiple players moving together to a new group. First, we will consider deviations consisting of homogeneous groups: for example, π​(s,0)\pi(s,0) or π​(0,β„“)\pi(0,\ell). From Lemma C.1 we know that small players always prefer having more small players in their coalition, so π​(s,0)β‰ΊSπ​(S,0)βͺ―Sπ​(S,β„“β€²)\pi(s,0)\prec_{S}\pi(S,0)\preceq_{S}\pi(S,\ell^{\prime}). Similarly, π​(0,β„“)β‰ΊLπ​(0,1)\pi(0,\ell)\prec_{L}\pi(0,1) and π​(0,β„“)β‰ΊLπ​(S,β„“β€²)\pi(0,\ell)\prec_{L}\pi(S,\ell^{\prime}) (if β„“β€²>0\ell^{\prime}>0), so the large players do not wish to unilaterally deviate.

Next, we might wonder whether some of the small players and large players in the π​(S,β„“β€²)\pi(S,\ell^{\prime}) coalition might wish to deviate to some π​(s,β„“)\pi(s,\ell) where s<Ss<S: assume that β„“β€²>0\ell^{\prime}>0 for now. Lemma C.5 shows that this is not feasible: it is not possible to find a location π​(s,β„“β€²)\pi(s,\ell^{\prime}) with s<Ss<S where both the small and large players get lower error. That is, it is not possible to have both π​(s,β„“)≻Sπ​(S,β„“β€²)\pi(s,\ell)\succ_{S}\pi(S,\ell^{\prime}) and π​(s,β„“)≻Lπ​(S,β„“β€²)\pi(s,\ell)\succ_{L}\pi(S,\ell^{\prime}).

However, it still might be possible to have a deviating coalition. Recall from Theorem 5.5 that if β„“β€²>0\ell^{\prime}>0, π​(S,β„“β€²)≻Lπ​(0,1)\pi(S,\ell^{\prime})\succ_{L}\pi(0,1): the large players doing local learning get strictly greater error than the large players in π​(S,β„“β€²)\pi(S,\ell^{\prime}). Could it be possible to find an arrangement so that π​(s,β„“)≻Sπ​(S,β„“β€²)\pi(s,\ell)\succ_{S}\pi(S,\ell^{\prime}) and π​(s,β„“)≻Sπ​(0,1)\pi(s,\ell)\succ_{S}\pi(0,1)?

The answer is yes. We show this by constructively producing an arrangement of players that is individually stable as produced by Theorem 5.5, but is not core stable because a subset of the small players in π​(S,β„“β€²)\pi(S,\ell^{\prime}) and the large players doing local learning both strictly prefer some π​(s,β„“)\pi(s,\ell).

For this arrangement, we set ΞΌe=100,Οƒ2=1\mu_{e}=100,\sigma^{2}=1 (note the larger ΞΌe\mu_{e} value). We fix S=70S=70 and Lβ‰₯7L\geq 7. Then, we calculate the error players get in various arrangements111The code to produce these calculations is given in our Github repository: https://github.com/kpdonahue/modelΛ™sharingΛ™games..

First, we calculate the individually stable arrangement as given in Theorem 5.5. We claim that π​(70,3)\pi(70,3) satisfies this arrangement. For reference, we calculate the errors the small and large players get in various arrangements.

e​r​rS​(π​(70,3))=1.107322e​r​rS​(π​(70,0))=1.115584err_{S}(\pi(70,3))=1.107322\quad err_{S}(\pi(70,0))=1.115584
e​r​rL​(π​(70,3))=0.932690e​r​rL​(π​(0,1))=0.943396err_{L}(\pi(70,3))=0.932690\quad err_{L}(\pi(0,1))=0.943396
e​r​rL​(π​(70,4))=0.943664err_{L}(\pi(70,4))=0.943664

Note that π​(70,3)≻Sπ​(70,0)\pi(70,3)\succ_{S}\pi(70,0) and π​(70,3)≻Lπ​(0,1)\pi(70,3)\succ_{L}\pi(0,1): both the small and large players would rather participate. Because π​(70,4)β‰ΊLπ​(0,1)\pi(70,4)\prec_{L}\pi(0,1), we cannot add another large player to the π​(70,30)\pi(70,30) coalition.

Next, we will show that the alternate coalition π​(68,4)\pi(68,4) is one where small players and large players (those that are doing local learning) would both strictly prefer.

e​r​rS​(π​(68,4))=1.105263e​r​rL​(π​(68,4))=0.943147err_{S}(\pi(68,4))=1.105263\quad err_{L}(\pi(68,4))=0.943147

Note that π​(68,4)≻Sπ​(70,3)\pi(68,4)\succ_{S}\pi(70,3) and π​(68,3)≻Lπ​(0,1)\pi(68,3)\succ_{L}\pi(0,1).

Appendix D Supporting proofs for coarse-grained federation

See 6.1

Proof.

Taking the derivative of the error with respect to ww produces:

2​μe​(wnjβˆ’wN)βˆ’2β€‹βˆ‘iβ‰ jni2+(Nβˆ’nj)2N2β‹…(1βˆ’w)β‹…Οƒ22\mu_{e}\left(\frac{w}{n_{j}}-\frac{w}{N}\right)-2\frac{\sum_{i\neq j}n_{i}^{2}+(N-n_{j})^{2}}{N^{2}}\cdot(1-w)\cdot\sigma^{2}

Setting this equal to 0 and solving for ww produces wjw_{j} equal to

(βˆ‘iβ‰ jni2+(Nβˆ’nj)2)β‹…Οƒ2ΞΌeβ‹…N2β‹…(1njβˆ’1N)+(βˆ‘iβ‰ jni2+(Nβˆ’nj)2)β‹…Οƒ2\frac{(\sum_{i\neq j}n_{i}^{2}+(N-n_{j})^{2})\cdot\sigma^{2}}{\mu_{e}\cdot N^{2}\cdot\left(\frac{1}{n_{j}}-\frac{1}{N}\right)+(\sum_{i\neq j}n_{i}^{2}+(N-n_{j})^{2})\cdot\sigma^{2}}

Note that this value wjw_{j} depends on the player jj that it is in reference to. It is also always strictly between 0 and 1. To confirm that this is a point of minimum error rather than maximum, we can take the second derivative of the error, which gives a result that is always positive:

2​μe​(1njβˆ’1N)+2β€‹βˆ‘iβ‰ jni2+(Nβˆ’nj)2N2β‹…Οƒ22\mu_{e}\left(\frac{1}{n_{j}}-\frac{1}{N}\right)+2\frac{\sum_{i\neq j}n_{i}^{2}+(N-n_{j})^{2}}{N^{2}}\cdot\sigma^{2}

∎

See 6.3

Proof.

For conciseness, we will rewrite the error and wjw_{j} weight with B=βˆ‘iβ‰ jni2+(Nβˆ’nj)2B=\sum_{i\neq j}n_{i}^{2}+(N-n_{j})^{2}. The error becomes:

ΞΌeβ‹…(wj2nj+1βˆ’wj2N)+BN2β‹…(1βˆ’wj)2​σ2\mu_{e}\cdot\left(\frac{w_{j}^{2}}{n_{j}}+\frac{1-w_{j}^{2}}{N}\right)+\frac{B}{N^{2}}\cdot(1-w_{j})^{2}\sigma^{2}

and the wjw_{j} weight becomes

Bβ‹…Οƒ2ΞΌeβ‹…N2β‹…(1njβˆ’1N)+Bβ‹…Οƒ2\frac{B\cdot\sigma^{2}}{\mu_{e}\cdot N^{2}\cdot\left(\frac{1}{n_{j}}-\frac{1}{N}\right)+B\cdot\sigma^{2}}

Substituting in for wjw_{j} gives the error as:

ΞΌeN+ΞΌeβ‹…(1njβˆ’1N)β‹…B2β‹…(Οƒ2)2(ΞΌeβ‹…N2β‹…(1njβˆ’1N)+Bβ‹…Οƒ2)2\frac{\mu_{e}}{N}+\mu_{e}\cdot\left(\frac{1}{n_{j}}-\frac{1}{N}\right)\cdot\frac{B^{2}\cdot(\sigma^{2})^{2}}{\left(\mu_{e}\cdot N^{2}\cdot\left(\frac{1}{n_{j}}-\frac{1}{N}\right)+B\cdot\sigma^{2}\right)^{2}}
+BN2β‹…Οƒ2β‹…ΞΌe2β‹…N4β‹…(1njβˆ’1N)2(ΞΌeβ‹…N2β‹…(1njβˆ’1N)+Bβ‹…Οƒ2)2+\frac{B}{N^{2}}\cdot\sigma^{2}\cdot\frac{\mu_{e}^{2}\cdot N^{4}\cdot\left(\frac{1}{n_{j}}-\frac{1}{N}\right)^{2}}{\left(\mu_{e}\cdot N^{2}\cdot\left(\frac{1}{n_{j}}-\frac{1}{N}\right)+B\cdot\sigma^{2}\right)^{2}}

Collecting and pulling out common terms:

ΞΌeN+ΞΌe​(1njβˆ’1N)β‹…Bβ‹…Οƒ2(ΞΌeβ‹…N2β‹…(1njβˆ’1N)+Bβ‹…Οƒ2)2\frac{\mu_{e}}{N}+\frac{\mu_{e}\left(\frac{1}{n_{j}}-\frac{1}{N}\right)\cdot B\cdot\sigma^{2}}{\left(\mu_{e}\cdot N^{2}\cdot\left(\frac{1}{n_{j}}-\frac{1}{N}\right)+B\cdot\sigma^{2}\right)^{2}}
β‹…(Bβ‹…Οƒ2+ΞΌeβ‹…N2β‹…(1njβˆ’1N))\cdot\left(B\cdot\sigma^{2}+\mu_{e}\cdot N^{2}\cdot\left(\frac{1}{n_{j}}-\frac{1}{N}\right)\right)
=ΞΌeN+ΞΌe​(1njβˆ’1N)β‹…Bβ‹…Οƒ2ΞΌeβ‹…N2β‹…(1njβˆ’1N)+Bβ‹…Οƒ2=\frac{\mu_{e}}{N}+\frac{\mu_{e}\left(\frac{1}{n_{j}}-\frac{1}{N}\right)\cdot B\cdot\sigma^{2}}{\mu_{e}\cdot N^{2}\cdot\left(\frac{1}{n_{j}}-\frac{1}{N}\right)+B\cdot\sigma^{2}}
=ΞΌe2β‹…N2​(1njβˆ’1N)+ΞΌeβ‹…Bβ‹…Οƒ2+Nβ‹…ΞΌe​(1njβˆ’1N)​Bβ‹…Οƒ2Nβ‹…(ΞΌeβ‹…N2β‹…(1njβˆ’1N)+Bβ‹…Οƒ2)=\frac{\mu_{e}^{2}\cdot N^{2}\left(\frac{1}{n_{j}}-\frac{1}{N}\right)+\mu_{e}\cdot B\cdot\sigma^{2}+N\cdot\mu_{e}\left(\frac{1}{n_{j}}-\frac{1}{N}\right)B\cdot\sigma^{2}}{N\cdot\left(\mu_{e}\cdot N^{2}\cdot\left(\frac{1}{n_{j}}-\frac{1}{N}\right)+B\cdot\sigma^{2}\right)}

We multiply the top and bottom by Nβ‹…njN\cdot n_{j}:

ΞΌe2β‹…N2​(Nβˆ’nj)+ΞΌeβ‹…Bβ‹…Οƒ2β‹…Nβ‹…nj+Nβ‹…ΞΌe​(Nβˆ’nj)​Bβ‹…Οƒ2Nβ‹…(ΞΌeβ‹…N2β‹…(Nβˆ’nj)+Bβ‹…Οƒ2β‹…Nβ‹…nj)\frac{\mu_{e}^{2}\cdot N^{2}\left(N-n_{j}\right)+\mu_{e}\cdot B\cdot\sigma^{2}\cdot N\cdot n_{j}+N\cdot\mu_{e}\left(N-n_{j}\right)B\cdot\sigma^{2}}{N\cdot\left(\mu_{e}\cdot N^{2}\cdot\left(N-n_{j}\right)+B\cdot\sigma^{2}\cdot N\cdot n_{j}\right)}
=ΞΌe2β‹…Nβ‹…(Nβˆ’nj)+ΞΌeβ‹…Bβ‹…Οƒ2β‹…nj+ΞΌeβ‹…(Nβˆ’nj)β‹…Bβ‹…Οƒ2ΞΌeβ‹…N2β‹…(Nβˆ’nj)+Bβ‹…Οƒ2β‹…Nβ‹…nj=\frac{\mu_{e}^{2}\cdot N\cdot\left(N-n_{j}\right)+\mu_{e}\cdot B\cdot\sigma^{2}\cdot n_{j}+\mu_{e}\cdot\left(N-n_{j}\right)\cdot B\cdot\sigma^{2}}{\mu_{e}\cdot N^{2}\cdot\left(N-n_{j}\right)+B\cdot\sigma^{2}\cdot N\cdot n_{j}}
=ΞΌe2β‹…Nβ‹…(Nβˆ’nj)+ΞΌeβ‹…Bβ‹…Οƒ2β‹…NΞΌeβ‹…N2β‹…(Nβˆ’nj)+Bβ‹…Οƒ2β‹…Nβ‹…nj=\frac{\mu_{e}^{2}\cdot N\cdot\left(N-n_{j}\right)+\mu_{e}\cdot B\cdot\sigma^{2}\cdot N}{\mu_{e}\cdot N^{2}\cdot\left(N-n_{j}\right)+B\cdot\sigma^{2}\cdot N\cdot n_{j}}
=ΞΌeβ‹…(Nβˆ’nj)+Bβ‹…Οƒ2Nβ‹…(Nβˆ’nj)+Bβ‹…Οƒ2ΞΌeβ‹…nj=\frac{\mu_{e}\cdot\left(N-n_{j}\right)+B\cdot\sigma^{2}}{N\cdot\left(N-n_{j}\right)+B\cdot\frac{\sigma^{2}}{\mu_{e}}\cdot n_{j}}

which gives the desired value. ∎

Next, we will prove a series of lemmas describing how the error of small and large players in a coalition π​(s,β„“)\pi(s,\ell) using optimal coarse-grained federation see their error change as ss and β„“\ell increase.

Lemma D.1.

For optimal coarse-grained federation, small players always see their error decrease with the addition of more small players. That is,

s2>s1⇒π​(s2,β„“)≻Sπ​(s1,β„“)s_{2}>s_{1}\quad\Rightarrow\quad\pi(s_{2},\ell)\succ_{S}\pi(s_{1},\ell)
Proof.

The error that the small players experience is gS​(s,β„“)hS​(s,β„“)\frac{g_{S}(s,\ell)}{h_{S}(s,\ell)} where

gS​(s,β„“)=ΞΌeβ‹…((sβˆ’1)β‹…ns+β„“β‹…nβ„“)g_{S}(s,\ell)=\mu_{e}\cdot\left((s-1)\cdot n_{s}+\ell\cdot n_{\ell}\right)
+((sβˆ’1)β‹…ns2+β„“β‹…nβ„“2+((sβˆ’1)β‹…ns+β„“β‹…nβ„“)2)β‹…Οƒ2+\left((s-1)\cdot n_{s}^{2}+\ell\cdot n_{\ell}^{2}+\left((s-1)\cdot n_{s}+\ell\cdot n_{\ell}\right)^{2}\right)\cdot\sigma^{2}
hS​(s,β„“)=((sβˆ’1)β‹…ns+β„“β‹…nβ„“)2+nsβ‹…((sβˆ’1)β‹…ns+β„“β‹…nβ„“)h_{S}(s,\ell)=((s-1)\cdot n_{s}+\ell\cdot n_{\ell})^{2}+n_{s}\cdot((s-1)\cdot n_{s}+\ell\cdot n_{\ell})
+nsβ‹…Οƒ2ΞΌeβ‹…((sβˆ’1)β‹…ns2+β„“β‹…nβ„“2+((sβˆ’1)β‹…ns+β„“β‹…nβ„“)2)+n_{s}\cdot\frac{\sigma^{2}}{\mu_{e}}\cdot((s-1)\cdot n_{s}^{2}+\ell\cdot n_{\ell}^{2}+((s-1)\cdot n_{s}+\ell\cdot n_{\ell})^{2})

Next, we consider the derivative of the error term with respect to ss, the number of small players. If this is always negative, then small players always see their error decrease with the addition of more small players. The derivative is negative when:

dd​s​[gS​(s,β„“)]β‹…hS​(s,β„“)βˆ’gS​(s,β„“)β‹…dd​s​[hS​(s,β„“)]<0\frac{d}{ds}\left[g_{S}(s,\ell)\right]\cdot h_{S}(s,\ell)-g_{S}(s,\ell)\cdot\frac{d}{ds}\left[h_{S}(s,\ell)\right]<0

Calculating and simplifying the lefthand side of this equation gives:

βˆ’nsβ‹…((sβˆ’1)β‹…ns+β„“β‹…nβ„“)-n_{s}\cdot((s-1)\cdot n_{s}+\ell\cdot n_{\ell})
β‹…((sβˆ’1)β‹…nsβ‹…(ΞΌe+nsβ‹…Οƒ2)+β„“β‹…nβ„“β‹…(ΞΌe+Οƒ2β‹…(2​nβ„“βˆ’ns)))\cdot\left((s-1)\cdot n_{s}\cdot(\mu_{e}+n_{s}\cdot\sigma^{2})+\ell\cdot n_{\ell}\cdot(\mu_{e}+\sigma^{2}\cdot(2n_{\ell}-n_{s}))\right)

Each individual term is positive: note that nβ„“>nsn_{\ell}>n_{s}, so 2​nβ„“βˆ’ns>02n_{\ell}-n_{s}>0. The overall term is negative, meaning that the overall derivative of the error is always negative. ∎

Lemma D.2.

For optimal coarse-grained federation, there exists some β„“0\ell_{0} such that for all β„“>β„“0\ell>\ell_{0}, the small players always see their error decrease with the addition of more large players.

Proof.

The error that the small player experiences can be given by the ratio of gS​(β‹…)/hS​(β‹…)g_{S}(\cdot)/h_{S}(\cdot) as before. In this case, we are interested in the derivative with respect to β„“\ell, which is negative when:

dd​ℓ​[gS​(s,β„“)]β‹…hS​(s,β„“)βˆ’gS​(s,β„“)β‹…dd​ℓ​[hS​(s,β„“)]<0\frac{d}{d\ell}\left[g_{S}(s,\ell)\right]\cdot h_{S}(s,\ell)-g_{S}(s,\ell)\cdot\frac{d}{d\ell}\left[h_{S}(s,\ell)\right]<0

Calculating and simplifying the lefthand side of this equation gives:

βˆ’nβ„“β‹…((sβˆ’1)β‹…ns+β„“β‹…nβ„“)-n_{\ell}\cdot((s-1)\cdot n_{s}+\ell\cdot n_{\ell})
β‹…((sβˆ’1)ns(ΞΌe+Οƒ2(2nsβˆ’nβ„“))+β„“β‹…nβ„“(ΞΌe+Οƒ2β‹…nβ„“)))\cdot\left((s-1)n_{s}(\mu_{e}+\sigma^{2}(2n_{s}-n_{\ell}))+\ell\cdot n_{\ell}(\mu_{e}+\sigma^{2}\cdot n_{\ell}))\right)

This term is negative whenever:

β„“>βˆ’ΞΌe+Οƒ2β‹…(2​nsβˆ’nβ„“)nβ„“β‹…(ΞΌe+Οƒ2β‹…nβ„“)\ell>-\frac{\mu_{e}+\sigma^{2}\cdot(2n_{s}-n_{\ell})}{n_{\ell}\cdot(\mu_{e}+\sigma^{2}\cdot n_{\ell})}

So, for sufficiently large β„“\ell, the derivative of the small player’s error is always negative. Note that if the term on the RHS is negative, then the derivative is negative for any β„“β‰₯0\ell\geq 0. ∎

Lemma D.3.

For optimal coarse-grained federation, there exists some β„“1\ell_{1} such that for all β„“>β„“1\ell>\ell_{1}, the large players always see their error decrease with the addition of more large players. If there are no small players (s=0s=0), then large players would most prefer to all federate together in π​(0,L)\pi(0,L).

Proof.

The error that the large players experience is gL​(s,β„“)hL​(s,β„“)\frac{g_{L}(s,\ell)}{h_{L}(s,\ell)} where

gL​(s,β„“)=ΞΌeβ‹…(sβ‹…ns+(β„“βˆ’1)β‹…nβ„“)g_{L}(s,\ell)=\mu_{e}\cdot\left(s\cdot n_{s}+(\ell-1)\cdot n_{\ell}\right)
+(sβ‹…ns2+(β„“βˆ’1)β‹…nβ„“2+(sβ‹…ns+(β„“βˆ’1)β‹…nβ„“)2)β‹…Οƒ2+\left(s\cdot n_{s}^{2}+(\ell-1)\cdot n_{\ell}^{2}+\left(s\cdot n_{s}+(\ell-1)\cdot n_{\ell}\right)^{2}\right)\cdot\sigma^{2}
hL​(s,β„“)=(sβ‹…ns+(β„“βˆ’1)β‹…nβ„“)2+nsβ‹…(sβ‹…ns+(β„“βˆ’1)β‹…nβ„“)h_{L}(s,\ell)=(s\cdot n_{s}+(\ell-1)\cdot n_{\ell})^{2}+n_{s}\cdot(s\cdot n_{s}+(\ell-1)\cdot n_{\ell})
+nβ„“β‹…Οƒ2ΞΌeβ‹…(sβ‹…ns2+(β„“βˆ’1)β‹…nβ„“2+(sβ‹…ns+(β„“βˆ’1)β‹…nβ„“)2)+n_{\ell}\cdot\frac{\sigma^{2}}{\mu_{e}}\cdot(s\cdot n_{s}^{2}+(\ell-1)\cdot n_{\ell}^{2}+(s\cdot n_{s}+(\ell-1)\cdot n_{\ell})^{2})

The derivative of the error with respect to β„“\ell, the number of large players, is negative whenever:

dd​ℓ​[gL​(s,β„“)]β‹…hL​(s,β„“)βˆ’gL​(s,β„“)β‹…dd​ℓ​[hL​(s,β„“)]<0\frac{d}{d\ell}\left[g_{L}(s,\ell)\right]\cdot h_{L}(s,\ell)-g_{L}(s,\ell)\cdot\frac{d}{d\ell}\left[h_{L}(s,\ell)\right]<0

Calculating and simplifying the lefthand side of this equation gives:

βˆ’nβ„“β‹…(sβ‹…ns+(β„“βˆ’1)β‹…nβ„“)-n_{\ell}\cdot(s\cdot n_{s}+(\ell-1)\cdot n_{\ell})
β‹…((β„“βˆ’1)β‹…(ΞΌe+nsβ‹…Οƒ2)+sβ‹…nsβ‹…(ΞΌe+Οƒ2β‹…(2​nsβˆ’nβ„“)))\cdot((\ell-1)\cdot(\mu_{e}+n_{s}\cdot\sigma^{2})+s\cdot n_{s}\cdot(\mu_{e}+\sigma^{2}\cdot(2n_{s}-n_{\ell})))

This term is negative whenever:

β„“>βˆ’sβ‹…nsβ‹…(ΞΌe+Οƒ2β‹…(2​nsβˆ’nβ„“))+ΞΌe+nsβ‹…Οƒ2ΞΌe+nsβ‹…Οƒ2\ell>\frac{-s\cdot n_{s}\cdot(\mu_{e}+\sigma^{2}\cdot(2n_{s}-n_{\ell}))+\mu_{e}+n_{s}\cdot\sigma^{2}}{\mu_{e}+n_{s}\cdot\sigma^{2}}

Note that if s=0s=0, this reduces to β„“>1\ell>1, which says that whenever large players are arranged without small players, they would most prefer to all be together. ∎

Lemma D.4.

For optimal coarse-grained federation, large players always see their error decrease with the addition of more small players. Because of this, π​(s,β„“)≻Lπ​(0,β„“)\pi(s,\ell)\succ_{L}\pi(0,\ell) for all s>0s>0: large players would always prefer federating with more small players.

s2>s1⇒π​(s2,β„“)≻Lπ​(s1,β„“)s_{2}>s_{1}\quad\Rightarrow\quad\pi(s_{2},\ell)\succ_{L}\pi(s_{1},\ell)
Proof.

The error that the large players experience can be given by the ratio gL​(β‹…)/hL​(β‹…)g_{L}(\cdot)/h_{L}(\cdot) as before. In this case, we are interested in the derivative with respect to ss, which is negative when:

dd​s​[gL​(s,β„“)]β‹…hL​(s,β„“)βˆ’gL​(s,β„“)β‹…dd​s​[hL​(s,β„“)]<0\frac{d}{ds}\left[g_{L}(s,\ell)\right]\cdot h_{L}(s,\ell)-g_{L}(s,\ell)\cdot\frac{d}{ds}\left[h_{L}(s,\ell)\right]<0

Gathering and simplifying the lefthand side of this equation gives:

βˆ’nsβ‹…(sβ‹…ns+(β„“βˆ’1)β‹…nβ„“)-n_{s}\cdot(s\cdot n_{s}+(\ell-1)\cdot n_{\ell})
β‹…(sβ‹…nsβ‹…(ΞΌe+Οƒ2β‹…ns)+β„“β‹…nβ„“β‹…(ΞΌe+Οƒ2β‹…(2​nβ„“βˆ’ns)))\cdot(s\cdot n_{s}\cdot(\mu_{e}+\sigma^{2}\cdot n_{s})+\ell\cdot n_{\ell}\cdot(\mu_{e}+\sigma^{2}\cdot(2n_{\ell}-n_{s})))

Each individual term is positive: note that nβ„“>nsn_{\ell}>n_{s}, so 2​nβ„“βˆ’ns>02n_{\ell}-n_{s}>0. The overall term is negative, meaning that the derivative of the error is always negative. ∎

Given these lemmas, we will divide the shape of the error curves into three components. Note that e​r​rS​(s,β„“=β„“β€²)err_{S}(s,\ell=\ell^{\prime}) and e​r​rL​(s,β„“=β„“β€²)err_{L}(s,\ell=\ell^{\prime}) are both curving downwards always. We will refer to the section of e​r​rS​(s,β„“=β„“β€²)err_{S}(s,\ell=\ell^{\prime}) with error greater than the error the small players get in π​(S,0)\pi(S,0) as BB and the region with error lower than this as CC. Similarly, we will define the region of e​r​rL​(s,β„“=β„“β€²)err_{L}(s,\ell=\ell^{\prime}) where the error is higher than the error the large players get in π​(0,L)\pi(0,L) as BB, and the region where the error is lower as CC.

Note that, by Lemma D.2, e​r​rS​(s=sβ€²,β„“)err_{S}(s=s^{\prime},\ell) is always decreasing for all β„“>β„“0\ell>\ell_{0}: before that, it is increasing. We will call the region where the slope is increasing AA, the portion where the slope is decreasing but still greater than the error the small player gets in π​(S,0)\pi(S,0) as BB, and the final region, where the slope is decreasing and the error is lower than or equal to π​(S,0)\pi(S,0), as CC.

We will not divide the regions of curve e​r​rL​(s=sβ€²,β„“)err_{L}(s=s^{\prime},\ell) because the proof below will not depend on those results.

We will use as a shorthand Ο€d={π​(S,0),π​(0,L)}\pi_{d}=\{\pi(S,0),\pi(0,L)\}. See 6.5

Proof.

In this proof, we will use the results of Lemmas D.1, D.2, D.3, and D.4. The statement of the theorem has three parts (including the equality Ο€g=Sπ​(S,0)\pi_{g}=_{S}\pi(S,0)), which we will prove in sequence. We note that Lemma D.4 indicates that π​(S,L)≻Lπ​(0,L)\pi(S,L)\succ_{L}\pi(0,L) always: the large players always prefer federating with the smaller players to being alone.

Case: π​(S,L)β‰ΊSπ​(S,0)\pi(S,L)\prec_{S}\pi(S,0)
In this case, we will show that Ο€d={π​(S,0),π​(0,L)}\pi_{d}=\{\pi(S,0),\pi(0,L)\} is strictly core stable.

We know from the precondition that the point π​(S,L)\pi(S,L) is in either region AA or BB of e​r​rS​(s=S,β„“)err_{S}(s=S,\ell). We will show that any other arrangement π​(sβ€²,β„“β€²)\pi(s^{\prime},\ell^{\prime}) is one where the small players get higher error than in π​(S,0)\pi(S,0). We will assume β„“β€²>0\ell^{\prime}>0: if β„“β€²=0\ell^{\prime}=0 and sβ€²<Ss^{\prime}<S, then we know π​(sβ€²,0)β‰ΊSπ​(S,0)\pi(s^{\prime},0)\prec_{S}\pi(S,0) by Lemma D.1. We will do this by starting at the point π​(sβ€²,β„“β€²)\pi(s^{\prime},\ell^{\prime}) and moving along the relevant error curves to towards π​(S,0)\pi(S,0) and π​(S,L)\pi(S,L) to show that the small players experience strictly greater error than in π​(S,0)\pi(S,0).

First, we will consider the curve e​r​rS​(s,β„“=β„“β€²)err_{S}(s,\ell=\ell^{\prime}). From Lemma D.1, we know that it is decreasing in ss. We will move ss from sβ€²s^{\prime} to SS, which decreases error. (If sβ€²=Ss^{\prime}=S, then this leaves error unchanged.) Next, we will consider the point π​(S,β„“β€²)\pi(S,\ell^{\prime}) on the curve e​r​rS​(s=S,β„“)err_{S}(s=S,\ell). If it is in region AA or BB, then we know

π​(sβ€²,β„“β€²)βͺ―Sπ​(S,β„“β€²)β‰ΊSπ​(S,0)\pi(s^{\prime},\ell^{\prime})\preceq_{S}\pi(S,\ell^{\prime})\prec_{S}\pi(S,0)

We will show that it is impossible for it to be in region CC. If it were, then moving along this curve from β„“β€²\ell^{\prime} to LL would only decrease the error the small player experiences, because the slope is decreasing with β„“\ell in region CC. But then that would imply that π​(S,L)\pi(S,L) is in region CC of this graph, and so π​(S,L)≻Sπ​(S,0)\pi(S,L)\succ_{S}\pi(S,0), which contradicts the precondition of this case.

This shows that the small players minimize their error in π​(S,0)\pi(S,0), so they will not even weakly prefer any other arrangement. Given this, any deviation from Ο€d\pi_{d} could only involve large players. However, by Lemma D.3, out of any arrangement excluding small players, the large players minimize their error in π​(0,L)\pi(0,L) (their current arrangement) so they would not even weakly prefer any other arrangement.

This implies that Ο€d\pi_{d} is strictly core given π​(S,L)β‰ΊSπ​(S,0)\pi(S,L)\prec_{S}\pi(S,0).

Case: π​(S,L)=Sπ​(S,0)\pi(S,L)=_{S}\pi(S,0)
In this case, we assume that the small players are indifferent between Ο€g\pi_{g} and Ο€d\pi_{d} and will show that Ο€g\pi_{g} is strictly core stable.

The key part of this proof is to show that, if the small players are indifferent between Ο€g\pi_{g} and π​(S,0)\pi(S,0), then they get strictly greater error in every other arrangement π​(sβ€²,β„“β€²)\pi(s^{\prime},\ell^{\prime}) that isn’t equal to π​(S,0)\pi(S,0) or π​(S,L)\pi(S,L).

The precondition means that π​(S,L)\pi(S,L) is in region CC of both graphs. To move to π​(sβ€²,β„“β€²)\pi(s^{\prime},\ell^{\prime}), first we move along e​r​rS​(s=S,β„“)err_{S}(s=S,\ell) from LL to β„“β€²\ell^{\prime}. In doing so, we end up in either region AA or BB of the curve: note that we cannot remain in region CC unless β„“β€²=L\ell^{\prime}=L. This is because by definition of the regions, region CC has error that decreases with β„“\ell, which means that error strictly increases from π​(S,L)\pi(S,L) when β„“\ell is decreased. This means that if β„“β€²<L\ell^{\prime}<L, π​(S,β„“β€²)\pi(S,\ell^{\prime}) gives the small player greater error than the it gets in π​(S,0)\pi(S,0) (or equivalently, in Ο€g\pi_{g}).

If the point π​(S,β„“β€²)\pi(S,\ell^{\prime}) is in region AA or BB of e​r​rS​(s=S,β„“)err_{S}(s=S,\ell), then it is also in region BB of the curve e​r​rS​(s,β„“=β„“β€²)err_{S}(s,\ell=\ell^{\prime}). Reducing ss from SS to sβ€²s^{\prime} can only increase the error the small player experiences. This shows that the small players get strictly greater error in any arrangement that is not π​(S,0)\pi(S,0) or π​(S,L)\pi(S,L).

Note: if β„“β€²=L\ell^{\prime}=L, then point π​(S,β„“β€²)=π​(S,L)\pi(S,\ell^{\prime})=\pi(S,L) is in region CC of both curves, but because π​(sβ€²,β„“β€²)≠π​(S,L)\pi(s^{\prime},\ell^{\prime})\neq\pi(S,L), we must have sβ€²<Ss^{\prime}<S, so π​(sβ€²,β„“β€²)β‰ΊSπ​(S,L)=Sπ​(S,0)\pi(s^{\prime},\ell^{\prime})\prec_{S}\pi(S,L)=_{S}\pi(S,0).

Next, we will show that Ο€g\pi_{g} is strictly core stable.

As shown above, the small players get higher error in any arrangement besides Ο€g\pi_{g} and π​(S,0)\pi(S,0). If they are in Ο€g\pi_{g}, they cannot be convinced to deviate to any arrangement except π​(S,0)\pi(S,0), and then only weakly (they get identical error). Conditional on the small players all being in Ο€g\pi_{g}, the large players would most prefer to be in Ο€g\pi_{g}, since by Lemma D.3 they strictly prefer it to π​(0,L)\pi(0,L). This implies that there does not exist a group of players where all weakly would like to defect and at least one strictly wishes to defect, the condition for strict core stability.

Case: π​(S,L)≻Sπ​(S,0)\pi(S,L)\succ_{S}\pi(S,0):
We will show that Ο€g\pi_{g} is strictly core stable.

The precondition means that π​(S,L)\pi(S,L) is in section CC of both small player graphs. Consider an arbitrary other coalition π​(sβ€²,β„“β€²)\pi(s^{\prime},\ell^{\prime}): we will show that the small players get strictly greater error here than in π​(S,L)\pi(S,L).

First, we start at the arrangement π​(S,L)\pi(S,L). Consider moving along the curve e​r​rS​(s=S,β„“)err_{S}(s=S,\ell): we hold ss constant and reduce β„“\ell from LL to β„“β€²\ell^{\prime}. We could end up in region C,B,C,B, or AA of this curve.

First, assume we end up in section CC. This means that, as we’ve reduced β„“\ell, our error has been monotonically increasing (unless β„“β€²=L\ell^{\prime}=L, in which case it is unchanged). We also know that, in the curve e​r​rS​(s,β„“=β„“β€²)err_{S}(s,\ell=\ell^{\prime}), the point π​(S,β„“β€²)\pi(S,\ell^{\prime}) is in region CC of this graph. Next, we similarly move along this curve to reduce ss from SS to sβ€²s^{\prime}. From Lemma D.1, we know e​r​rS​(s,β„“=β„“β€²)err_{S}(s,\ell=\ell^{\prime}) is monotonically decreasing in ss, which means reducing ss monotonically increases the small player’s error. (This is unless sβ€²=Ss^{\prime}=S, in which case the error is unchanged.). Because we have assumed either sβ€²β‰ Ss^{\prime}\neq S or β„“β€²β‰ L\ell^{\prime}\neq L, we have produced a path of monotonically increasing error from π​(S,L)\pi(S,L) to π​(sβ€²,β„“β€²)\pi(s^{\prime},\ell^{\prime}), so we know the small player experiences greater error here.

Next, we will instead assume that we ended up in region AA or BB of the e​r​rS​(s=S,β„“)err_{S}(s=S,\ell) curve when we reduced β„“\ell. By definition of the AA and BB regions, we know that the small player experiences larger error in π​(S,β„“β€²)\pi(S,\ell^{\prime}) than it would in π​(S,0)\pi(S,0). (The one exception is the point π​(S,0)\pi(S,0), which is in region AA and which obviously gives equal error to π​(S,0)\pi(S,0)). Then, we know that in the curve e​r​rS​(s,β„“=β„“β€²)err_{S}(s,\ell=\ell^{\prime}), the point π​(S,β„“β€²)\pi(S,\ell^{\prime}) must be in region BB. Because e​r​rS​(s,β„“=β„“β€²)err_{S}(s,\ell=\ell^{\prime}) is monotonically decreasing in ss, reducing ss from SS to sβ€²s^{\prime} monotonically increases error. We then know that:

π​(sβ€²,β„“β€²)βͺ―Sπ​(S,β„“β€²)βͺ―Sπ​(S,0)β‰ΊSπ​(S,L)\pi(s^{\prime},\ell^{\prime})\preceq_{S}\pi(S,\ell^{\prime})\preceq_{S}\pi(S,0)\prec_{S}\pi(S,L)

where the last inequality comes from the premise of this statement. The first and second inequalities are strict if sβ€²<Ss^{\prime}<S or Ε‚β€²β‰ 0\l^{\prime}\neq 0, respectively.

We have just shown that the small player prefers π​(S,L)\pi(S,L) to any other arrangement. If there were going to be a defecting group, it would have to involve only large players. However, the arrangement where large players (federating by themselves) get lowest error is in π​(0,L)\pi(0,L), which by Lemma D.4 gives them higher error than π​(S,L)\pi(S,L).

By this reasoning, Ο€g\pi_{g} is strictly core stable. ∎

Appendix E Supporting proofs for fine-grained federation

See 7.1

Proof.

To minimize, we will take the derivative of player jj’s error with respect to the vj​kv_{jk} weight. Note that we only have vj​j=1βˆ’βˆ‘iβ‰ jvj​i=1βˆ’vj​kβˆ’βˆ‘iβ‰ j,iβ‰ kvj​iv_{jj}=1-\sum_{i\neq j}v_{ji}=1-v_{jk}-\sum_{i\neq j,i\neq k}v_{ji} so vj​kv_{jk} appears twice in the component involving ΞΌe\mu_{e}. Rewriting the error gives:

ΞΌeβ€‹βˆ‘iβ‰ jMvj​i2ni+ΞΌe​(1βˆ’βˆ‘iβ‰ jvj​i)2nj\mu_{e}\sum_{i\neq j}^{M}\frac{v_{ji}^{2}}{n_{i}}+\mu_{e}\frac{(1-\sum_{i\neq j}v_{ji})^{2}}{n_{j}}
+(βˆ‘iβ‰ jvj​i2+(βˆ‘iβ‰ jvj​i)2)β‹…Οƒ2+\left(\sum_{i\neq j}v_{ji}^{2}+\left(\sum_{i\neq j}v_{ji}\right)^{2}\right)\cdot\sigma^{2}

Taking the derivative with respect to vj​kv_{jk} gives:

ΞΌe​2​vj​knkβˆ’2​μe​1βˆ’βˆ‘iβ‰ jvj​inj+Οƒ2​(2β€‹βˆ‘iβ‰ jvj​i+2​vj​k)\mu_{e}\frac{2v_{jk}}{n_{k}}-2\mu_{e}\frac{1-\sum_{i\neq j}v_{ji}}{n_{j}}+\sigma^{2}\left(2\sum_{i\neq j}v_{ji}+2v_{jk}\right)

To confirm that we are finding a minimum rather than a maximum, we note that the second derivative is always positive:

ΞΌe​2nk+ΞΌe​2nj+Οƒ2​(2+2)>0\mu_{e}\frac{2}{n_{k}}+\mu_{e}\frac{2}{n_{j}}+\sigma^{2}\left(2+2\right)>0

We first simplify the derivative by substituting in for vj​jv_{jj}:

ΞΌe​2​vj​knkβˆ’2​μe​vj​jnj+2​σ2​(1βˆ’vj​j+vj​k)=0\mu_{e}\frac{2v_{jk}}{n_{k}}-2\mu_{e}\frac{v_{jj}}{n_{j}}+2\sigma^{2}\left(1-v_{jj}+v_{jk}\right)=0

And then solve for vj​kv_{jk} to obtain:

vj​k=vj​jβ‹…(Οƒ2+ΞΌenj)βˆ’Οƒ2Οƒ2+ΞΌenkv_{jk}=\frac{v_{jj}\cdot\left(\sigma^{2}+\frac{\mu_{e}}{n_{j}}\right)-\sigma^{2}}{\sigma^{2}+\frac{\mu_{e}}{n_{k}}}

To find vj​jv_{jj}, we use that all of the weights sum up to 1:

vj​j+βˆ‘iβ‰ jvj​jβ‹…(Οƒ2+ΞΌenj)βˆ’Οƒ2Οƒ2+ΞΌenk=1v_{jj}+\sum_{i\neq j}\frac{v_{jj}\cdot\left(\sigma^{2}+\frac{\mu_{e}}{n_{j}}\right)-\sigma^{2}}{\sigma^{2}+\frac{\mu_{e}}{n_{k}}}=1
vj​j+βˆ‘iβ‰ jvj​jβ‹…(Οƒ2+ΞΌenj)Οƒ2+ΞΌenkβˆ’βˆ‘iβ‰ jΟƒ2Οƒ2+ΞΌenk=1v_{jj}+\sum_{i\neq j}\frac{v_{jj}\cdot\left(\sigma^{2}+\frac{\mu_{e}}{n_{j}}\right)}{\sigma^{2}+\frac{\mu_{e}}{n_{k}}}-\sum_{i\neq j}\frac{\sigma^{2}}{\sigma^{2}+\frac{\mu_{e}}{n_{k}}}=1
vj​j​(1+βˆ‘iβ‰ jΟƒ2+ΞΌenjΟƒ2+ΞΌenk)βˆ’βˆ‘iβ‰ jΟƒ2Οƒ2+ΞΌenk=1v_{jj}\left(1+\sum_{i\neq j}\frac{\sigma^{2}+\frac{\mu_{e}}{n_{j}}}{\sigma^{2}+\frac{\mu_{e}}{n_{k}}}\right)-\sum_{i\neq j}\frac{\sigma^{2}}{\sigma^{2}+\frac{\mu_{e}}{n_{k}}}=1
vj​j=1+βˆ‘iβ‰ jΟƒ2Οƒ2+ΞΌenk1+βˆ‘iβ‰ jΟƒ2+ΞΌenjΟƒ2+ΞΌenkv_{jj}=\frac{1+\sum_{i\neq j}\frac{\sigma^{2}}{\sigma^{2}+\frac{\mu_{e}}{n_{k}}}}{1+\sum_{i\neq j}\frac{\sigma^{2}+\frac{\mu_{e}}{n_{j}}}{\sigma^{2}+\frac{\mu_{e}}{n_{k}}}}

Next, we define Vi=Οƒ2+ΞΌeniV_{i}=\sigma^{2}+\frac{\mu_{e}}{n_{i}}. This allows us to rewrite the term as:

vj​j=1+Οƒ2β€‹βˆ‘iβ‰ j1Vi1+Vjβ€‹βˆ‘iβ‰ j1Viv_{jj}=\frac{1+\sigma^{2}\sum_{i\neq j}\frac{1}{V_{i}}}{1+V_{j}\sum_{i\neq j}\frac{1}{V_{i}}}

Similarly, we can rewrite:

vj​k=1Vkβ‹…Vjβˆ’Οƒ21+Vjβ€‹βˆ‘iβ‰ j1Viv_{jk}=\frac{1}{V_{k}}\cdot\frac{V_{j}-\sigma^{2}}{1+V_{j}\sum_{i\neq j}\frac{1}{V_{i}}}

∎