Collaborative Machine Learning with Incentive-Aware Model Rewards

Rachael Hwee Ling Sim Yehong Zhang Mun Choon Chan Bryan Kian Hsiang Low

Abstract

Collaborative machine learning (ML) is an appealing paradigm to build high-quality ML models by training on the aggregated data from many parties. However, these parties are only willing to share their data when given enough incentives, such as a guaranteed fair reward based on their contributions. This motivates the need for measuring a party’s contribution and designing an incentive-aware reward scheme accordingly. This paper proposes to value a party’s reward based on Shapley value and information gain on model parameters given its data. Subsequently, we give each party a model as a reward. To formally incentivize the collaboration, we define some desirable properties (e.g., fairness and stability) which are inspired by cooperative game theory but adapted for our model reward that is uniquely freely replicable. Then, we propose a novel model reward scheme to satisfy fairness and trade off between the desirable properties via an adjustable parameter. The value of each party’s model reward determined by our scheme is attained by injecting Gaussian noise to the aggregated training data with an optimized noise variance. We empirically demonstrate interesting properties of our scheme and evaluate its performance using synthetic and real-world datasets.

Data Valuation, Incentives in Machine Learning

1 Introduction

Collaborative machine learning (ML) is an appealing paradigm to build high-quality ML models. While an individual party may have limited data, it is possible to build improved, high-quality ML models by training on the aggregated data from many parties. For example, in healthcare, a hospital or healthcare firm whose data diversity and quantity are limited due to its small patient base can draw on data from other hospitals and firms to improve the prediction of some disease progression (e.g., diabetes) (Center for Open Data Enterprise, 2019). This collaboration can be encouraged by a government agency, such as the National Institute of Health in the United States. In precision agriculture, a farmer with limited land area and sensors can combine his collected data with the other farmers to improve the modeling of the effect of various influences (e.g., weather, pest) on his crop yield (Claver, 2019). Such data sharing also benefits other application domains, including real estate in which a property agency can pool together its limited transactional data with that of the other agencies to improve the prediction of property prices (Conway, 2018).

However, any party would have incurred some nontrivial cost to collect its data. So, they would not altruistically donate their data and risk losing their competitive edge. These parties will be motivated to share their data when given enough incentives, such as a guaranteed benefit from the collaboration and a fair higher reward from contributing more valuable data. To this end, we propose to value each party’s contributed data and design an incentive-aware reward scheme to give each party a separate ML model as a reward (in short, model reward) accordingly. We use only model rewards and exclude monetary compensations as (a) in the above-mentioned applications, every party such as a hospital is mainly interested in improving model quality for unlimited future test predictions; (b) there may not be a feasible and clear source of monetary revenue to compensate participants (e.g., due to restrictions, the government may not be able to pay firms using tax revenues); and (c) if parties have to pay to participate in the collaboration, then the total payment is debatable and many may lack funding while still having valuable data.

How then should we value a party’s data and its effect on model quality? To answer this first question, we propose a valuation method based on the informativeness of data. In particular, more informative data induces a greater reduction in the uncertainty of the model parameters, hence improving model quality. In contrast, existing data valuation methods (Ghorbani & Zou, 2019; Ghorbani et al., 2020; Jia et al., 2019a, b; Yoon et al., 2020) measure model quality via its validation accuracy, which calls for a tedious or even impossible process of needing all parties to agree on a common validation dataset. Inaccurate valuation can also arise due to how a party’s test queries, which are likely to change over time, differ from the validation dataset. Our data valuation method does not make any assumption about the current or future distribution of test queries.

Next, how should we design a reward scheme to decide the values of model rewards for incentivizing a collaboration? Intuitively, a party will be motivated to collaborate if it can receive a better ML model than others who have contributed less valuable data, and than what it can build alone or get from collaborating separately with some parties. Also, the parties often like to maximize the total benefit from the collaboration. These incentives appear related to solution concepts (fairness, individual rationality, stability, and group welfare) from cooperative game theory (CGT), respectively. However, as CGT assumptions are restrictive for the uniquely freely replicable¹¹1Data and model reward, like digital goods, can be replicated at zero marginal cost and given to more parties. nature of our model reward, these CGT concepts have to be adapted for defining incentives for our model reward. We then design a novel model reward scheme to provide these incentives.

Finally, how can we realize the values of the model rewards decided by our scheme? How should we modify the model rewards or the data used to train them? An obvious approach to control the values of the model rewards is to select and only train on subsets of the aggregated data. However, this requires considering an exponential number of discrete subsets of data, which is intractable for large datasets such as medical records. We avoid this issue by injecting noise into the aggregated data from multiple parties instead. The value of each party’s model reward can then be realized by simply optimizing the continuous noise variance parameter.

The specific contributions of our work here include:

$\bullet$

Proposing a data valuation method using the information gain (IG) on model parameters given the data (Sec. 3);
$\bullet$

Defining new conditions for incentives (i.e., Shapley fairness, stability, individual rationality, and group welfare) that are suitable for the freely replicable nature of our model reward (Sec. 4.1). As these incentives cannot be provided all at the same time, we design a novel model reward scheme with an adjustable parameter to trade off between them while maintaining fairness (Sec. 4.2);
$\bullet$

Injecting Gaussian noise into the aggregated data from multiple parties and optimizing the noise variance parameter for realizing the values of the model rewards decided by our scheme (Sec. 4.3); and
$\bullet$

Demonstrating interesting properties of our model reward scheme empirically and evaluating its performance with synthetic and real-world datasets (Sec. 5).

To the best of our knowledge, our work here is the first to propose a collaborative ML scheme that formally considers incentives beyond fairness and relies solely on model rewards to realize them. Existing works (Jia et al., 2019a, b; Ohrimenko et al., 2019) have only looked at fairness and have to resort to monetary compensations if considered.

2 Problem Formulation

In our problem setting, we consider $n$ honest and non-malicious parties, each of whom owns some data and assume the availability of a trusted central party²²2In reality, such a central party can be found in established data sharing platforms like Ocean Protocol (Ocean Protocol Foundation, 2019) and Data Republic (https://datarepublic.com). who aggregates data from these parties, measures the value of their data, and distributes a resulting trained ML model to each party. We first introduce the notations and terminologies used in this work: Let $N\triangleq\{1,\ldots,n\}$ denote a set of $n$ parties. Any subset $C\subseteq N$ is called a coalition of parties. The grand coalition is the set $N$ of all parties. Parties will team up and partition themselves into a coalition structure $CS$ . Formally, $CS$ is a set of coalitions such that $\bigcup_{C\in CS}C=N$ and $C\cap C^{\prime}=\emptyset$ for any $C,C^{\prime}\in CS$ and $C\neq C^{\prime}$ . The data of party $i\in N$ is represented by $D_{i}\triangleq({\rm{\mathbf{X}}}_{i},{\rm{\mathbf{y}}}_{i})$ where ${\rm{\mathbf{X}}}_{i}$ and ${\rm{\mathbf{y}}}_{i}$ are the input matrix and output vector, respectively. Let $v_{C}$ denote the value of the (aggregated) data $D_{C}\triangleq\{D_{i}\}_{i\in C}$ of any coalition $C\subseteq N$ . We use $v_{i}$ to represent $v_{\{i\}}$ to ease notation. For each party $i\in N$ , $r_{i}$ denotes the value of its received model reward.

The objective is to design a collaborative ML scheme for the central party to decide and realize the values $(r_{i})_{i\in N}$ of model rewards distributed to parties $1,\ldots,n$ . The scheme should satisfy certain incentives (e.g., fairness and stability) to encourage the collaboration. Some works (Ghorbani & Zou, 2019; Jia et al., 2019a, b) have considered similar problems and can fairly partition the (monetary) value $v_{N}$ of the entire aggregated data into $r_{i}$ for $i\in N$ . They achieved this using results from cooperative game theory (CGT). Our problem, however, differs and cannot be addressed directly using CGT. This is due to the freely replicable nature of our model reward – the total value $\sum_{i\in N}r_{i}$ of received model rewards over all parties $i\in N$ can exceed $v_{N}$ .

We will next show how to assess the value of data (Sec. 3) and, more importantly, how to design the reward scheme to decide the values $(r_{i})_{i\in N}$ of model rewards accordingly and realize these values for achieving our incentive-aware objective (Sec. 4).

3 Data Valuation with Information Gain

A set $D_{C}$ of data is considered more valuable (i.e., higher $v_{C}$ ) if it can be used to train a higher-quality (hence more valuable) ML model. Existing data valuation methods (Ghorbani & Zou, 2019; Ghorbani et al., 2020; Jia et al., 2019b; Yoon et al., 2020) measure the quality of a trained ML model via its validation accuracy. However, these methods require the central party to carefully select a validation dataset that all parties must agree on. This is often a tedious or even impossible process, especially if every party’s test queries, which are likely to change over time, differ from the validation dataset.³³3In the unlikely event that all parties can agree on a common validation dataset, validation accuracy would be the preferred measure of model quality and hence data valuation method. For example, two private hospitals $\mathcal{H}_{1}$ and $\mathcal{H}_{2}$ aggregate their data for diabetes prediction. $\mathcal{H}_{1}$ and $\mathcal{H}_{2}$ prefer accurate test predictions for female and young patients, respectively. Due to the differences in their data sizes, data qualities, and preferred test queries, it is difficult for the central party to decide the demographics of the patients in the validation dataset such that the data valuation is unbiased and accurate.

To circumvent the above-mentioned issues, our proposed data valuation method instead considers an information-theoretic measure of the quality of a trained model in terms of the reduction in uncertainty of the model parameters, denoted by vector $\boldsymbol{\theta}$ , after training on data $D_{C}$ . We use the prior entropy $\mathbb{H}(\boldsymbol{\theta})$ and posterior entropy $\mathbb{H}(\boldsymbol{\theta}|D_{C})$ to represent the uncertainty of $\boldsymbol{\theta}$ before and after training on $D_{C}$ , respectively. So, if the data $D_{C}$ for $C\subseteq N$ can induce a greater reduction in the uncertainty/entropy of $\boldsymbol{\theta}$ or, equivalently, information gain (IG) $\mathbb{I}(\boldsymbol{\theta};D_{C})$ on $\boldsymbol{\theta}$ :

v_{C}\triangleq\mathbb{I}(\boldsymbol{\theta};D_{C})=\mathbb{H}(\boldsymbol{\theta})-\mathbb{H}(\boldsymbol{\theta}|D_{C})\ ,

(1)

then a higher-quality (hence more valuable) model can be trained using this more valuable/informative data $D_{C}$ . IG is an appropriate data valuation method as it is often used as a surrogate measure of the test/predictive accuracy of a trained model (Krause & Guestrin, 2007; Kruschke, 2008) since the test queries are usually not known a priori. We empirically demonstrate such a surrogate effect in Appendix F.3. The predictive distribution of the output $y^{*}$ at the test input ${\rm{\mathbf{x}}}^{*}$ given data $D_{C}$ is calculated by averaging over all possible model parameters $\boldsymbol{\theta}$ weighted by their posterior belief $p(\boldsymbol{\theta}|D_{C})$ , i.e., $p(y^{*}|{\rm{\mathbf{x}}}^{*},D_{C})=\int p(y^{*}|{\rm{\mathbf{x}}}^{*},\boldsymbol{\theta})\ p(\boldsymbol{\theta}|D_{C})\ \text{d}\boldsymbol{\theta}.$ By reducing the uncertainty in $\boldsymbol{\theta}$ , we can further rule out models that are unlikely given $D_{C}$ and place higher weights (i.e., by increasing $p(\boldsymbol{\theta}|D_{C})$ ) on models that are closer to the true model parameters, thus improving the predictive accuracy for any test query in expectation. The value $v_{C}$ (1) of data $D_{C}$ has the following properties:

$\bullet$

Data of an empty coalition has no value: $v_{\emptyset}=0\ .$
$\bullet$

Data of any coalition $C\subseteq N$ has non-negative value: $\forall C\subseteq N\ \ v_{C}\geq 0\ .$
$\bullet$

Monotonicity. Adding more parties to a coalition cannot decrease the value of its data: $\forall C\subseteq C^{\prime}\subseteq N\ \ v_{C^{\prime}}\geq v_{C}\ .$
$\bullet$

Submodularity. Data of any party $i$ is less valuable to a larger coalition which has more parties and data:

$\forall i\in N\ \ \forall C\subseteq C^{\prime}\subseteq N\setminus\{i\}\ \ v_{C^{\prime}\cup\{i\}}-v_{C^{\prime}}\leq v_{C\cup\{i\}}-v_{C}\ .$

The second and third properties are due to the “information never hurts” bound for entropy (Cover & Thomas, 1991). The submodular property of IG (1) assumes conditional independence of the data $D_{i}$ and $D_{j}$ given $\boldsymbol{\theta}$ for any $i,j\in N$ and $i\neq j$ ; its proof is in Appendix A.

The first two properties fulfill standard assumptions of CGT. The latter two properties will influence the design and properties of our model reward scheme in Sec. 4. For example, the monotonic property ensures that the value $r_{i}$ of any party $i$ ’s model reward is never negative (Sec. 4).

4 Incentive-Aware Reward Scheme with Model Rewards

Recall our problem setting in Sec. 2 that the central party will train a model for each party as a reward. The reward should be decided based on the value of each party’s data relative to that of the other parties’. Let $D_{i}^{r}$ be the data (i.e., derived from $D_{N}$ ) used to train party $i$ ’s model reward. The value of party $i$ ’s model reward is $r_{i}\triangleq\mathbb{I}(\boldsymbol{\theta};D_{i}^{r})$ according to Sec. 3. In this section, we will first present the incentives to encourage collaboration (Sec. 4.1), then describe how our incentive-aware reward scheme will satisfy them (Sec. 4.2), and finally show how to vary $D_{i}^{r}$ to realize the values of the model rewards decided by our scheme (Sec. 4.3).

We desire a collaboration involving the grand coalition (i.e., $CS=\{N\}$ ) as it results in the largest aggregated data $D_{N}$ and hence allows the highest value of model reward to be achieved, which eliminates the tedious process of deciding how the parties should team up and partition themselves. In Sec. 4.1.2, we will discuss how to incentivize the grand coalition formation and increase the total benefit from the collaboration.

4.1 Incentives

Our reward scheme has to be valid, fair to all the parties, and guarantee an improved model for each party. To achieve these, we exploit and adapt key assumptions and constraints about rewards in CGT. As has been mentioned in Sec. 2, the modifications are necessary as the model reward is uniquely freely replicable. We require the following incentive conditions to hold for the values $(r_{i})_{i\in N}$ of model rewards based on the chosen coalition structure $CS$ :

R1

Non-negativity. $\forall i\in N\ \ r_{i}\geq 0\ .$
R2

Feasibility. The model reward received by each party in any coalition $C\in CS$ cannot be more valuable than the model trained on their aggregated data $D_{C}$ : ${\forall C\in CS}\ \ {\forall i\in C}\ \ {r_{i}\leq v_{C}}\ .$
R3

Weak Efficiency. In each coalition $C\in CS$ , the model reward received by at least a party $i\in C$ is as valuable as the model trained on the aggregated data $D_{C}$ of $C$ : ${\forall C\in CS}\ \ {\exists i\in C}\ \ {r_{i}=v_{C}}\ .$
R4

Individual Rationality. Each party should receive a model reward that is at least as valuable as the model trained on its own data: $\forall i\in N\ \ r_{i}\geq v_{i}\ .$

R1 and R4 are the same as the solution concepts in CGT while R2 and R3 have been adapted.⁴⁴4R2 and R3 are, respectively, adapted from ${\forall C\in CS}\ {\sum_{i\in C}{r_{i}}\leq v_{C}}$ and ${\forall C\in CS}\ {\sum_{i\in C}{r_{i}}=v_{C}}$ in CGT (Chalkiadakis et al., 2011). When $CS=\{N\}$ (i.e., grand coalition), R2 and R3 become ${\forall i\in N}\ \ {r_{i}\leq v_{N}}$ and ${\exists i\in N}\ \ {r_{i}=v_{N}}$ , respectively. So, each party cannot receive a more valuable model reward than the model trained on $D_{N}$ as it would involve creating data artificially.

4.1.1 Fairness

In addition, when $CS=\{N\}$ (i.e., grand coalition), to guarantee that the reward scheme is fair to all $n$ parties, the values $(r_{i})_{i\in N}$ of their model rewards must satisfy the following properties which are inspired by the fairness concepts in CGT (Maschler & Peleg, 1966; Shapley, 1953; Young, 1985) and have also been experimentally studied from a normative perspective (de Clippel & Rozen, 2013):

F1

Uselessness.⁵⁵5These properties are axioms of Shapley Value (Shapley, 1953) and have been widely used in existing ML works (Ghorbani & Zou, 2019; Jia et al., 2019b; Ohrimenko et al., 2019) for data valuation. If including the data of party $i$ does not improve the quality of a model trained on the aggregated data of any coalition (e.g., when $D_{i}=\emptyset$ ), then party $i$ should receive a valueless model reward: For all $i\in N$ ,

$(\forall C\subseteq N\setminus\{i\}\ \ v_{C\cup\{i\}}=v_{C})\Rightarrow r_{i}=0\ .$
F2

Symmetry.⁵ If including the data of party $i$ yields the same improvement as that of party $j$ in the quality of a model trained on the aggregated data of any coalition (e.g., when $D_{i}=D_{j}$ ), then they should receive equally valuable model rewards: For all $i,j\in N$ s.t. $i\neq j$ ,

$(\forall C\subseteq N\setminus\{i,j\}\ \ v_{C\cup\{i\}}=v_{C\cup\{j\}})\Rightarrow r_{i}=r_{j}\ .$

Strict Desirability (Maschler & Peleg, 1966). If the quality of a model trained on the aggregated data of at least a coalition improves more by including the data of party $i$ than that of party $j$ , but the reverse is not true, then party $i$ should receive a more valuable model reward than party $j$ : For all $i,j\in N$ s.t. $i\neq j$ ,

\begin{array}[]{l}(\exists B\subseteq N\setminus\{i,j\}\ \ v_{B\cup\{i\}}>v_{B\cup\{j\}})\ \land\vspace{1mm}\\ (\forall C\subseteq N\setminus\{i,j\}\ \ v_{C\cup\{i\}}\geq v_{C\cup\{j\}})\Rightarrow r_{i}>r_{j}\ .\end{array}

Strict Monotonicity.⁶⁶6Our definition is similar to coalitional monotonicity in (Young, 1985) except that we consider $>$ instead of $\geq$ . This rules out the scenario where the value of party $i$ ’s data improves but not that of its model reward. We further check the feasibility of improvement in the value of its model reward. If the quality of a model trained on the aggregated data of at least a coalition containing party $i$ improves (e.g., by including more data of party $i$ ), ceteris paribus, then party $i$ should receive a more valuable model reward than before: Let $\{v_{C}\}_{C\in 2^{N}}$ and $\{v^{\prime}_{C}\}_{C\in 2^{N}}$ denote any two sets of values of data over all coalitions $C\subseteq N$ , and $r_{i}$ and $r^{\prime}_{i}$ be the corresponding values of model rewards received by party $i$ . For all $i\in N$ ,

\begin{array}[]{l}(\exists B\subseteq N\setminus\{i\}\ \ v^{\prime}_{B\cup\{i\}}>v_{B\cup\{i\}})\ \land\vspace{1mm}\\ (\forall C\subseteq N\setminus\{i\}\ \ v^{\prime}_{C\cup\{i\}}\geq v_{C\cup\{i\}})\ \land\vspace{1mm}\\ (\forall A\subseteq N\setminus\{i\}\ \ v^{\prime}_{A}=v_{A})\land(v^{\prime}_{N}>r_{i})\Rightarrow r^{\prime}_{i}>r_{i}\ .\end{array}

We have the following incentive condition:

R5

Fairness. The values $(r_{i})_{i\in N}$ of model rewards must satisfy F1 to F4.

Both F3 and F4 imply that marginal contribution ( $v_{C\cup\{i\}}-v_{C}$ ) matters. F4 is the only property guaranteeing that if party $i$ adds more valuable/informative data to a coalition containing it, ceteris paribus, then it should receive a more valuable model reward than before. Additionally, F3 establishes a relationship between parties $i$ and $j$ that if their marginal contributions only differ for coalition $C$ (i.e., $v_{C\cup\{i\}}>v_{C\cup\{j\}}$ w.l.o.g.), then party $i$ should receive a more valuable model reward than party $j$ .

To illustrate their significance, we consider two simpler reward schemes where we directly set the value of model reward received by every party $i\in N$ as (a) the value of its data (i.e., $r_{i}\triangleq v_{i}$ ) or (b) the decrease in the value of its data if it leaves the grand coalition (i.e., $r_{i}\triangleq v_{N}-v_{N\setminus\{i\}}$ ). Both schemes violate F3 and F4 as they ignore the values of the other parties’ data: The value of model reward received by party $i$ does not change when its marginal contribution to any non-empty coalition $C\subset N\setminus\{i\}$ increases. Intuitively, a party with a small value $v_{i}$ of data (e.g., few data points) can be highly valuable to the other parties if its data is distinct. Conversely, a party with a high value $v_{j}$ of data should be less valuable to the other parties with similar data as its marginal contribution is lower.

Hence, we consider the Shapley value which captures the idea of marginal contribution precisely as it uses the expected marginal contribution of a party $i$ when it joins the parties preceding it in any permutation:

\text{Shapley}_{v}(i)=\frac{1}{n!}\sum_{\pi\in\Pi_{N}}{\left(v_{{S_{\pi,i}}\cup\{i\}}-v_{S_{\pi,i}}\right)}

(2)

where $\Pi_{N}$ is the set of all possible permutations of $N$ and $S_{\pi,i}$ is the coalition of parties preceding $i$ in permutation $\pi$ .

Proposition 1.

$r_{i}=\text{Shapley}_{v}(i)$ for all $i\in N$ satisfy R5.

Its proof is in Appendix B. A party will have a larger Shapley value and hence value $r_{i}$ of model reward when its data is highly valuable on its own (e.g., with low inherent noise) and/or to the other parties (e.g., due to low correlation).

Fairness with Weak Efficiency (R3). We simplify the notation of $\text{Shapley}_{v}(i)$ to $\phi_{i}$ . Such a reward scheme may not satisfy R3 as the total value $\sum_{i\in N}r_{i}$ of model rewards is only $v_{N}$ (Chalkiadakis et al., 2011). Due to the freely replicable nature of our model reward, we want the total value to exceed $v_{N}$ and the value of some party’s model reward to be $v_{N}$ . To satisfy other incentive conditions, we consider a function $g$ to map $(\phi_{i})_{i\in N}$ to $(r_{i})_{i\in N}$ (i.e., $r_{i}\triangleq g(\phi_{i})$ ). To achieve fairness in R5, $g$ must be strictly increasing with $g(0)=0$ . These motivate us to propose the following:

Definition 1 (Shapley Fairness).

Given $\{v_{C}\}_{C\in 2^{N}}$ , if there exists a constant $k>0$ s.t. $r_{i}=k\phi_{i}$ for all $i\in N$ , then the values $(r_{i})_{i\in N}$ of model rewards are Shapley fair.

Note that CGT sets $k=1$ (Proposition 1). Also, we can control constant $k$ to satisfy other incentive conditions such as R3. A consequence of Definition 1 is that $\phi_{i}/\phi_{j}=r_{i}/r_{j}$ . So, increasing the ratio of expected marginal contributions $\phi_{i}$ of party $i$ vs. $\phi_{j}$ of party $j$ (e.g., by including more valuable/informative data of party $i$ , ceteris paribus) results in the same increase in the ratio of values of model rewards $r_{i}$ received by party $i$ vs. $r_{j}$ received by party $j$ .

Fairness with Individual Rationality (R4). However, another issue persists in the above modified definition of fairness (Definition 1): R4 may not be satisfied when the value of data is submodular (e.g., IG). We show an example below and leave the detailed discussion to Appendix C:

Example 1.

Suppose that there is a coalition of two parties whose values of data are $v_{1}=7$ , $v_{2}=5$ , and $v_{\{1,2\}}=8$ . Using (2), the Shapley values of the two parties are $\phi_{1}=5$ and $\phi_{2}=3$ . To satisfy R3, we set $r_{1}=8$ . Then, to achieve Shapley fairness, $r_{2}=(8/5)\times 3=4.8$ . Since $r_{2}<v_{2}$ , R4 is not satisfied.

Since our model reward is freely replicable, it is possible to give all parties in a larger coalition more valuable model rewards. How then should we redefine $g$ to derive larger values $r_{i}$ of model rewards to achieve R4 while still satisfying R5 and R3 (i.e., $g(\max_{i\in N}\phi_{i})=v_{N}$ )? To answer this question, we further modify the above definition of Shapley fairness (Definition 1) to the following:

Definition 2 ( $\rho$ -Shapley Fairness).

Given $\{v_{C}\}_{C\in 2^{N}}$ , if there exist constants $k>0$ and $\rho>0$ s.t. $r_{i}=k\phi_{i}^{\rho}$ for all $i\in N$ , then the values $(r_{i})_{i\in N}$ of model rewards are $\rho$ -Shapley fair.

It follows from Definition 2 that when $\rho<1$ , $\phi_{i}>\phi_{j}$ implies $\phi_{i}/\phi_{j}>r_{i}/r_{j}>1$ . Furthermore, reducing $\rho$ from $1$ “weakens” proportionality of $r_{i}$ to $\phi_{i}$ and decreases the ratio of $r_{i}$ vs. $r_{j}$ while preserving $r_{i}>r_{j}$ . So, if the value of a party’s model reward is $v_{N}$ (hence satisfying R3), then the values of the other parties’ model rewards will become closer to $v_{N}$ , thereby potentially satisfying R4. We will discuss the choice of $k$ and the effect of $\rho$ in Sec. 4.2.

4.1.2 Stability and Group Welfare

At the beginning of Sec. 4, we have provided justifications for desiring the grand coalition. We will now introduce two other solution concepts in CGT (i.e., stability and group welfare) to incentivize its formation and increase the total benefit from the collaboration, respectively. To the best of our knowledge, these solution concepts have not been considered in existing collaborative ML works.

A coalition structure $CS$ with a given set $(r_{i})_{i\in N}$ of values of model rewards is said to be stable if no subset of parties has a common incentive to abandon it to form another coalition on their own. With stability, parties can be assured that the collaboration will not fall apart and they cannot get a more valuable model reward from adding or removing parties. Similar to other solution concepts in CGT introduced previously, the definition of stability in CGT⁷⁷7In CGT, a coalition structure $CS$ with given $(r_{i})_{i\in N}$ is stable if $\forall C\subseteq N\ \ v_{C}\leq\sum_{i\in C}r_{i}\ .$ does not suit the freely replicable nature of our model reward and the redefined incentive condition on feasibility R2. Therefore, we propose the following new definition of stability:

Definition 3 (Stability).

A coalition structure $CS$ with a given set $(r_{i})_{i\in N}$ of values of model rewards is stable if $\forall C\subseteq N\ \ \exists i\in C\ \ {v_{C}\leq r_{i}}\ .$

Conversely, supposing $\exists C\subseteq N\ \ \forall i\in C\ \ r_{i}<v_{C}$ , all parties in $C$ may be willing to deviate to form coalition $C$ as they can feasibly increase the values of their model rewards (up) to $v_{C}$ . The condition in Definition 3 is computationally costly to check as it involves an exponential (in $n$ ) number of constraints. To ease computation and since we are mainly interested in the grand coalition (i.e., $CS=\{N\}$ ), we look at the following sufficient condition instead:

R6

Stability of Grand Coalition. Suppose that the value of data is monotonic. The grand coalition is stable if for every coalition $C$ , the value of the model reward received by the party with largest Shapley value is at least $v_{C}$ :
$\forall C\subseteq N\ \ \forall i\in C\ \ \phi_{i}=\max_{j\in C}\phi_{j}\Rightarrow\ v_{C}\leq r_{i}\ .$

The values $(r_{i})_{i\in N}$ of model rewards that satisfy R6 will also satisfy Definition 3. To ensure R6, for each party $i\in N$ , we will only need to check one constraint involving $r_{i}$ : $v_{C_{i}}\leq r_{i}$ where $C_{i}$ includes all parties whose Shapley value is at most $\phi_{i}$ . The monotonic property of the value of data guarantees that $v_{C}\leq v_{C_{i}}\leq r_{i}$ for any $C\subseteq C_{i}$ . The total number of constraints is linear in the number $n$ of parties. Unlike R1 to R5, R6 is an optional incentive condition.

The last incentive condition stated below does not have to be fully provided:

R7

Group Welfare. The values $(r_{i})_{i\in N}$ of model rewards should maximize the group welfare $\sum_{i\in N}r_{i}$ .

4.2 Reward Scheme Considering All Incentives

In this subsection, we will present a reward scheme which considers all the incentives in Sec. 4.1 and assumes that the grand coalition will form. It uses parameter $\rho$ in Definition 2 to trade off between achieving Shapley fairness vs. satisfying the other incentives, as detailed below:

Theorem 1.

Let $0<\rho\leq 1$ . For each party $i\in N$ , let $\phi_{i}\triangleq\text{Shapley}_{v}(i)$ and reward ${r_{i}\triangleq{{({\phi_{i}}/{\phi^{*}})}^{\rho}}\times v_{N}}$ where $\phi^{*}=\max_{i\in N}\phi_{i}$ .⁸⁸8In practice, $\phi_{i}$ can be based on other solution concepts in CGT that satisfy F1 to F4. The values $(r_{i})_{i\in N}$ of model rewards are $\rho$ -Shapley fair and satisfy R1 to R3 and R5 when $\rho>0$ . Also, when

$\bullet$

$\rho=1$ , $(r_{i})_{i\in N}$ are (pure) Shapley fair (Definition 1);
$\bullet$

$\rho\leq\rho_{r}\triangleq\min_{i\in N}{{\log(v_{i}/v_{N})}/{\log(\phi_{i}/\phi^{*})}}$ , $(r_{i})_{i\in N}$ satisfy individual rationality (R4);
$\bullet$

$\rho\leq\rho_{s}\triangleq\min_{i\in N}{{\log(v_{C_{i}}/v_{N})}/{\log(\phi_{i}/\phi^{*})}}$ where coalition $C_{i}\triangleq\{j\in N\mid\phi_{j}\leq\phi_{i}\}$ , $(r_{i})_{i\in N}$ achieve stability of the grand coalition (R6) and individual rationality (R4) as $\rho_{s}\leq\rho_{r}$ ;
$\bullet$

$\rho=0$ , $(r_{i})_{i\in N}$ provide maximum group welfare (R7) but do not satisfy fairness (R5).

Its proof is in Appendix D. As $\rho$ decreases from 1, the values $(r_{i})_{i\in N}$ of model rewards deviate further from pure Shapley fairness (i.e., less proportional to $(\phi_{i})_{i\in N}$ ) and the value $r_{i}$ of any party $i$ ’s model reward with $0<\phi_{i}/\phi^{*}<1$ will increase. This increases group welfare (R7) and the values $(r_{i})_{i\in N}$ of model rewards can potentially satisfy individual rationality (R4) and stability (R6). Thus, a smaller $\rho$ addresses the limitations of pure Shapley fairness. We do not consider (a) $\rho>1$ as it reduces group welfare and is not needed to satisfy other incentives, nor (b) $\rho<0$ as it demands that more valuable/informative data and higher Shapley value lead to less valuable model reward, hence not satisfying fairness (R5).

Fig. 1 gives an overview of the incentive conditions and how Theorem 1 uses $\rho$ to satisfy them and trade off between the various incentives. Note that R6 guarantees R4. Ideally, we want the values $(r_{i})_{i\in N}$ of model rewards to satisfy fairness (R5), stability (R6), and individual rationality (R4) (i.e., shaded region in Fig. 1). However, the values $(r_{i})_{i\in N}$ of model rewards that are purely Shapley fair may not always lie in the desired shaded region. So, a smaller $\rho$ is needed.

Refer to caption — Figure 1: An overview of the incentive conditions and how they are satisfied by our reward scheme in Theorem 1. There may be alternative desirable reward schemes not using Definition 2. Note that in some scenarios, $1$ may be less than $\rho_{r}$ or $\rho_{s}$ . As a result, the values $(r_{i})_{i\in N}$ of model rewards that are purely Shapley fair naturally satisfy individual rationality and stability, respectively.

Consider how much each party $i$ benefits from the collaboration by choosing a $\rho$ smaller than $1$ , specifically, by measuring the ratio of the values $r_{i}$ of its received model reward that are $\rho$ -Shapley fair ( $\rho<1$ ) vs. purely Shapley fair ( $\rho=1$ ). Such a ratio can be evaluated to $(\phi^{*}/\phi_{i})^{1-\rho}\geq 1$ and is smaller (larger) for a party $i$ with a larger (smaller) $\phi_{i}$ . So, when a smaller $\rho$ is chosen, a party $i$ with $\phi_{i}$ close to $\phi^{*}$ needs to be “altruistic” as it cannot benefit as much due to a smaller ratio than any party $j$ with a smaller $\phi_{j}$ . However, note that party $i$ is already getting close to the maximum value $v_{N}$ of model reward. Regardless of the chosen $\rho$ , the party $i$ with the largest $\phi_{i}$ (i.e., $\phi^{*}$ ) always receives the most valuable model reward with the highest value $v_{N}$ . In practice, the parties may agree to use a smaller $\rho$ if they want to increase their total benefit from the collaboration or if they do not know their relative expected marginal contributions beforehand.

4.3 Realization of Model Rewards

Finally, we will have to train a model for each party as a reward to realize the values $(r^{*}_{i})_{i\in N}$ of model rewards decided by the above reward scheme. Recall from the beginning of Sec. 4 that the value $r_{i}=\mathbb{I}(\boldsymbol{\theta};D^{r}_{i})$ of model reward received by each party $i$ is the IG on model parameters $\theta$ given the data $D^{r}_{i}$ . How is the training data $D^{r}_{i}$ for each party $i$ selected to realize $r_{i}=r^{*}_{i}$ ?

A direct approach is to select and only train on a subset of the aggregated data from all parties: $D^{r}_{i}\subseteq D_{N}$ . However, this is infeasible due to the need to consider an exponential number of discrete subsets of data, all of which may not be able to realize the decided value $r^{*}_{i}$ of model reward exactly.

To avoid the above issue, we instead consider injecting Gaussian noise $\mathcal{N}({\rm{\mathbf{0}}},\eta_{i}{\rm{\mathbf{I}}})$ into the aggregated data from multiple parties and optimizing the continuous noise variance parameter $\eta_{i}$ for realizing the decided value $r^{*}_{i}$ of model reward. Specifically, the central party collects data $D_{i}\triangleq({\rm{\mathbf{X}}}_{i},{\rm{\mathbf{y}}}_{i})$ from every party $i\in N$ and constructs $D^{r}_{i}$ by concatenating the input matrices ${\rm{\mathbf{X}}}_{i}$ for $i\in N$ and the output vector ${\rm{\mathbf{y}}}_{i}$ with ${\rm{\mathbf{z}}}_{i}\triangleq({\rm{\mathbf{y}}}_{j})_{j\in N\setminus\{i\}}+\mathcal{N}({\rm{\mathbf{0}}},\eta_{i}{\rm{\mathbf{I}}})$ . When training a model for party $i$ , we use the original ${\rm{\mathbf{y}}}_{i}$ so that each party gets all the information from its own data $D_{i}$ and cannot improve its received model reward by subsequently training on it. We inject noise with variance $\eta_{i}$ only to the other parties’ data (i.e., $({\rm{\mathbf{y}}}_{j})_{j\in N\setminus\{i\}}$ ) to reduce the information from parties $N\setminus\{i\}$ conveyed to party $i$ . In particular, $r_{i}=v_{N}$ when $\eta_{i}=0$ and $r_{i}=v_{i}$ when $\eta_{i}=\infty$ . By varying $\eta_{i}$ , we can span different values of $r_{i}$ . We use an efficient root-finding algorithm to find the optimal $\eta_{i}$ such that $r_{i}=r^{*}_{i}$ for each party $i\in N$ . The effect of adding noise to the data will be reflected in the variance of party $i$ ’s model reward parameters and its predictive distribution. We will show that the added noise affects predictive accuracy reasonably in Sec. 5.

5 Experiments and Discussion

This section empirically evaluates the performance and properties of our reward scheme (Sec. 4.2) on Bayesian regression models with the (a) synthetic Friedman dataset with $6$ input features (Friedman, 1991), (b) diabetes progression (DiaP) dataset on the diabetes progression of $442$ patients with $9$ input features (Efron et al., 2004), and (c) Californian housing (CaliH) dataset on the value of $20640$ houses with $8$ input features (Pace & Barry, 1997). We use a Gaussian likelihood and assume that the model hyperparameters are known or learned using maximum likelihood estimation.

The performance of our reward scheme is evaluated using the IG and mean negative log probability (MNLP) metrics:

\begin{array}[]{rl}\text{MNLP}\triangleq&\displaystyle\frac{1}{|D^{*}|}\sum_{({\rm{\mathbf{x}}}^{*},y^{*})\in D^{*}}{-\log p(y^{*}|{\rm{\mathbf{x}}}^{*},D_{i}^{r})}\\ =&\displaystyle\frac{1}{|D^{*}|}\sum_{({\rm{\mathbf{x}}}^{*},y^{*})\in D^{*}}{\frac{1}{2}\left(\log(2\pi\sigma_{*}^{2})+\frac{(\mu_{*}-y^{*})^{2}}{\sigma_{*}^{2}}\right)\vspace{-1mm}}\end{array}

(3)

where $D^{*}$ denotes a test dataset, and $\mu_{*}$ and $\sigma_{*}^{2}$ denote, respectively, the predictive mean and variance of the predictive distribution $p(y^{*}|{\rm{\mathbf{x}}}^{*},D_{i}^{r})$ . We then use these metrics to show how an improvement in IG can affect the predictive accuracy of a trained model on a common test dataset.⁹⁹9We use a randomly sampled test dataset to illustrate how IG is a suitable surrogate measure of the test/predictive accuracy of a trained model. In Sec. 3, we have discussed that in practice, it is often tedious or even impossible for all parties to agree on a common validation dataset, let alone a common set of test queries. More details of the experimental settings such as how to select the test dataset are in Appendix E. The performance of our reward scheme cannot be directly compared against that of the existing works (Ghorbani & Zou, 2019; Jia et al., 2019b) as they mainly focus on non-Bayesian classification models and assume monetary compensations. For all experiments, we consider a collaboration of $n=3$ parties as it suffices to show interesting results.

Gaussian process (GP) regression with synthetic Friedman dataset. For each party, we generate $250$ data points from the Friedman function. We assign party $1$ the most valuable data by spanning the first input feature over the entire input domain $[0,1]$ . In contrast, for parties $2$ and $3$ , the same input feature spans only the non-overlapping ranges of $[0,0.5]$ and $[0.5,1]$ , respectively. This makes the data of parties $2$ and $3$ highly valuable to each other, i.e., $v_{\{2,3\}}$ is high relative to $v_{2}$ . Using (2), the Shapley values of the three parties are $\phi_{1}=34.57$ , $\phi_{2}=29.24$ , and $\phi_{3}=30.78$ . Party $1$ has the largest $\phi_{1}$ and will always receive the most valuable model reward with the highest IG or value $v_{N}$ .

Fig. 2a shows that our reward scheme indeed satisfies fairness (R5): A party $i$ with a larger $\phi_{i}$ always receives a higher $r_{i}$ (i.e., more valuable model reward). Also, since $\rho_{r}\geq 1$ here, our reward scheme always satisfies individual rationality (R4): Every party $i$ receives a more valuable model reward than that trained on its own data, i.e., its IG $r_{i}$ is higher than $v_{i}$ (see dotted lines in Fig. 2a). As $\rho$ decreases (rightwards), party $1$ ’s most valuable model reward is unaffected but it has to be “altruistic” to parties $2$ and $3$ with increasing $r_{2}$ and $r_{3}$ (i.e., increasingly valuable model rewards) but smaller $\phi_{2}$ and $\phi_{3}$ , as discussed in the last paragraph of Sec. 4.2. When $\rho$ decreases to $\rho_{s}$ , stability (R6) is reached: Party $3$ ’s model reward matches what it will receive by only collaborating with party $2$ . When $\rho=0$ , all parties receive an equally valuable model reward with the same IG/value $v_{N}$ despite their varying expected marginal contributions. This shows how $\rho$ can be reduced to trade off proportionality in pure Shapley fairness for satisfying other incentives such as stability (R6) and group welfare (R7).

In Fig. 2b, we report the averaged MNLP and shade the $95\%$ confidence interval over $20$ different realizations of ${\rm{\mathbf{z}}}_{i}$ . As $\rho$ decreases and each party receives an increasingly valuable model reward (i.e., increasing IG/ $r_{i}$ ), MNLP decreases, thus showing that an improvement in IG translates to a higher predictive accuracy of its trained model.

GP regression with DiaP dataset. We train a GP regression model with a composite kernel (i.e., squared exponential kernel $+$ exponential kernel) and partition the training dataset among $n=3$ parties, as detailed later. The results are shown in Fig. 3.

Neural network regression with CaliH dataset. We consider Bayesian linear regression (BLR) on the last layer of a neural network (NN). $60\%$ of the CaliH data is designated as the “public” dataset¹⁰¹⁰10 In practice, the “public” dataset can be provided by the government to kickstart the collaborative ML. Alternatively, it can be historical data that companies are more willing to share. and used to pretrain a NN. Since the correlation of the house value with the input features in the data of the parties may differ from that in the “public” dataset, we perform transfer learning and selectively retrain only the last layer using BLR with a standard Gaussian prior. So, IG is only computed based on BLR. From the remaining data, we select $400$ data points for the test dataset and $1600$ data points¹¹¹¹11We restrict the total number of data points so that every individual party cannot train a high-quality model alone. for the training dataset to be partitioned among $3$ parties, as detailed later. The results are shown in Fig. 5.

Sparse GP regression with synthetic Friedman dataset (10 parties). We also evaluate the performance of our reward scheme on the collaboration between a larger number $n=10$ of parties. We consider a larger synthetic Friedman dataset of size $5000$ and partition the training dataset among $10$ parties such that each party owns at least $5\%$ of the dataset. We train a sparse GP model based on the deterministic training conditional approximation (Hoang et al., 2015, 2016) and a squared exponential kernel. Since the exact Shapley value (2) is expensive to evaluate with a large number of parties, we approximate it using the simple random sampling algorithm in (Maleki et al., 2013). The results are shown in Fig. 6.

Partitioning Algorithm. To divide any training dataset among parties and vary their contributed data, we first choose an input feature $a$ uniformly at random from all the input features. Next, the data point $d$ is chosen uniformly at random. We partially sort the dataset and divide it into continuous blocks of random sizes, as detailed in Fig. 4.

Such a setup allows parties to have different quantities of unique or partially overlapping data. The input feature $a$ may have real-world significance: If $a$ is the age of patients, then this models the scenario where hospitals have data of patients with different demographics.

Summary of Observations. For any dataset, we consider different train-test splits and partitions of the training dataset using the above algorithm. For each partition, we compute the optimized noise variance $\eta_{i}$ to realize $r^{*}_{i}$ and consider $5$ realizations of ${\rm{\mathbf{z}}}_{i}$ for each party $i\in N$ . For a given partition and choice of $\rho$ , if the values $(r_{i})_{i\in N}$ of model rewards satisfy individual rationality (R4), then we compute the ratio $\phi_{i}/\phi^{*}$ , improvement in IG ( $r_{i}-v_{i}$ ) and MNLP, and maximal (possible) improvement in IG ( $v_{N}-v_{i}$ ) and MNLP for each party $i\in N$ . The improvement is measured relative to a model trained only on party $i$ ’s data. The best/lowest MNLP $m_{N}$ is assumed to be achieved by a model trained on the aggregated data of all parties. Each realization of ${\rm{\mathbf{z}}}_{i}$ (from each party $i$ ) will produce a point in the scatter graphs in Figs. 3, 5, and 6. If a point lies on the diagonal identity line, then the corresponding party (e.g., with the largest $\phi_{i}$ , i.e., $\phi^{*}$ ) receives a model reward with MNLP $m_{N}$ . We also report the number of points with positive improvement in IG and MNLP in the top left hand corner of each graph.

In Figs. 3c-d, 5c-d, and 6c-d, the improvement in MNLP is usually positive: The predictive performance of each party benefits from the collaboration. Occasionally but reasonably, this does not hold when the maximal improvement in MNLP is extremely small or even negative. Lighter colored points lie closer to the diagonal line: Parties with larger $\phi_{i}$ receive more valuable model rewards (Figs. 3a-b, 5a-b, 6a-b) which translates to lower MNLP (Figs. 3c-d, 5c-d, 6c-d).

In Figs. 3, 5 and 6, when $\rho$ decreases from $1$ to $0.5$ , darker colored points move closer to the diagonal line, which implies that parties with smaller $\phi_{i}$ can now receive more valuable model rewards with higher predictive accuracy. The number of points (reported in the top left corner) also increase as more of them satisfy R4.

In Fig. 6d with $\rho=0.5$ , most points are close to the diagonal identity line because it may be the case that not all training data points are needed to achieve the lowest MNLP. Instead, fewer or noisier data points are sufficient. In this scenario, a larger $\rho$ may be preferred to reward the parties differently. More experimental results for different Bayesian models and datasets are shown in Appendix F.1.

Limitations. Though IG is a suitable surrogate measure of the predictive accuracy of a trained model, a higher IG does not always translate to lower MNLP. In Figs. 3c-d and 5c-d, occasionally, the improvement in MNLP is negative (around $15\%$ of the time) and darker points lie above the diagonal line. The latter suggests that parties with smaller $\phi_{i}$ may receive model rewards with MNLP lower than $m_{N}$ (i.e., awarded to the party with largest $\phi_{i}$ , i.e., $\phi^{*}$ ). This may be purely due to randomness, but we have also investigated factors (e.g., model selection) leading to a weak relationship between IG and MNLP and reported the results in Appendix F.2. Our reward scheme works best when a suitable model is selected and the model prior is not sufficiently informative for any party to achieve a high predictive accuracy by training a model on its own data. In practice, this is reasonable as collaboration precisely happens only when every individual party cannot train a high-quality model alone but can do so from working together.

6 Conclusion

This paper describes a collaborative ML scheme that distributes only model rewards to the parties. We perform data valuation based on the IG on the model parameters given the data and compute each party’s marginal contribution using the Shapley value. To decide the appropriate rewards to incentivize the collaboration, we adapt solution concepts from CGT (i.e., fairness, Shapley fairness, stability, individual rationality, and group welfare) for our freely replicable model rewards. We propose a novel reward scheme (Theorem 1) with an adjustable parameter that can trade off between these incentives while maintaining fairness. Empirical evaluations show that a smaller $\rho$ and more valuable model rewards translate to higher predictive accuracy. Current efforts on designing incentive mechanisms for federated learning can build upon our work presented in this paper. For future work, we plan to address privacy preservation in our reward scheme. The noise injection method used for realizing the model rewards in this work is closely related to the Gaussian mechanism of differential privacy (Dwork & Roth, 2014). This motivates us to explore how the injected noise will affect privacy in the model rewards.

Acknowledgements

This research/project is supported by the National Research Foundation, Singapore under its Strategic Capability Research Centres Funding Initiative and its AI Singapore Programme (Award Number: AISG-GC- $2019$ - $002$ ). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore.

References

Center for Open Data Enterprise (2019) Center for Open Data Enterprise. Sharing and utilizing health data for AI applications. Roundtable report, 2019.
Chalkiadakis et al. (2011) Chalkiadakis, G., Elkind, E., and Wooldridge, M. Computational aspects of cooperative game theory. In Brachman, R. J., Cohen, W. W., and Dietterich, T. G. (eds.), Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2011.
Claver (2019) Claver, H. Data sharing key for AI in agriculture. Future Farming, Feb. 2019. URL https://www.futurefarming.com/Tools-data/Articles/2019/2/Data-sharing-key-for-AI-in-agriculture-389844E/.
Conway (2018) Conway, J. Artificial Intelligence and Machine Learning: Current Applications in Real Estate. PhD thesis, Massachusetts Institute of Technology, 2018.
Cover & Thomas (1991) Cover, T. and Thomas, J. Elements of Information Theory. John Wiley & Sons, Inc., 1991.
de Clippel & Rozen (2013) de Clippel, G. and Rozen, K. Fairness through the lens of cooperative game theory: An experimental approach. Cowles Foundation discussion paper no. 1925, 2013.
Dwork & Roth (2014) Dwork, C. and Roth, A. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014.
Efron et al. (2004) Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. Least angle regression. Ann. Statist., 32(2):407–499, 2004.
Friedman (1991) Friedman, J. H. Multivariate adaptive regression splines. Ann. Statist., 19(1):1–67, 1991.
Ghorbani & Zou (2019) Ghorbani, A. and Zou, J. Data Shapley: Equitable valuation of data for machine learning. In Proc. ICML, pp. 2242–2251, 2019.
Ghorbani et al. (2020) Ghorbani, A., Kim, M. P., and Zou, J. A distributional framework for data valuation. In Proc. ICML, 2020.
Harrison & Rubinfeld (1978) Harrison, D. and Rubinfeld, D. Hedonic housing prices and the demand for clean air. Journal of Environmental Economics and Management, 5(1):81–102, 1978.
Hoang et al. (2015) Hoang, T. N., Hoang, Q. M., and Low, K. H. A unifying framework of anytime sparse Gaussian process regression models with stochastic variational inference for big data. In Proc. ICML, pp. 569–578, 2015.
Hoang et al. (2016) Hoang, T. N., Hoang, Q. M., and Low, K. H. A distributed variational inference framework for unifying parallel sparse Gaussian process regression models. In Proc. ICML, pp. 382–391, 2016.
Jia et al. (2019a) Jia, R., Dao, D., Wang, B., Hubis, F. A., Gurel, N. M., Li, B., Zhang, C., Spanos, C., and Song, D. Efficient task-specific data valuation for nearest neighbor algorithms. Proc. VLDB Endowment, 12(11):1610–1623, 2019a.
Jia et al. (2019b) Jia, R., Dao, D., Wang, B., Hubis, F. A., Hynes, N., Gurel, N. M., Li, B., Zhang, C., Song, D., and Spanos, C. Towards efficient data valuation based on the Shapley value. In Proc. AISTATS, pp. 1167–1176, 2019b.
Krause & Guestrin (2007) Krause, A. and Guestrin, C. Nonmyopic active learning of Gaussian processes: an exploration-exploitation approach. In Proc. ICML, pp. 449–456, 2007.
Kruschke (2008) Kruschke, J. K. Bayesian approaches to associative learning: From passive to active learning. Learning & Behavior, 36(3):210–226, 2008.
Maleki et al. (2013) Maleki, S., Tran-Thanh, L., Hines, G., Rahwan, T., and Rogers, A. Bounding the estimation error of sampling-based Shapley value approximation. arXiv:1306.4265, 2013.
Maschler & Peleg (1966) Maschler, M. and Peleg, B. A characterization, existence proof and dimension bounds for the kernel of a game. Pacific J. Mathematics, 18(2):289–328, 1966.
Ocean Protocol Foundation (2019) Ocean Protocol Foundation. Ocean protocol: A decentralized substrate for AI data and services. Technical whitepaper, 2019.
Ohrimenko et al. (2019) Ohrimenko, O., Tople, S., and Tschiatschek, S. Collaborative machine learning markets with data-replication-robust payments. arXiv:1911.09052, 2019.
Pace & Barry (1997) Pace, R. K. and Barry, R. Sparse spatial auto-regressions. Statistics and Probability Letters, 33(3):291–297, 1997.
Shapley (1953) Shapley, L. S. A value for $n$ -person games. In Kuhn, H. W. and Tucker, A. W. (eds.), Contributions to the Theory of Games, volume 2, pp. 307–317. Princeton Univ. Press, 1953.
Yoon et al. (2020) Yoon, J., Arik, S. O., and Pfister, T. Data valuation using reinforcement learning. In Proc. ICML, 2020.
Young (1985) Young, H. P. Monotonic solutions of cooperative games. International Journal of Game Theory, 14(2):65–72, 1985.

Appendix A Proof of Submodularity of IG (1)

Firstly, we assume that the data $D_{i}$ and $D_{j}$ are conditionally independent given model parameters $\boldsymbol{\theta}$ for any $i,j\in N$ and $i\neq j$ . Then, $\forall i\in N\ \ \forall C\subseteq C^{\prime}\subseteq N\setminus\{i\}\ ,$

\begin{array}[]{l}v_{C\cup\{i\}}-v_{C}\vspace{1mm}\\ =\mathbb{I}(\boldsymbol{\theta};D_{C\cup\{i\}})-\mathbb{I}(\boldsymbol{\theta};D_{C})\vspace{1mm}\\ =\mathbb{H}(\boldsymbol{\theta}|D_{C})-\mathbb{H}(\boldsymbol{\theta}|D_{C\cup\{i\}})\vspace{1mm}\\ =\mathbb{H}(D_{i}|D_{C})-\mathbb{H}(D_{i}|\boldsymbol{\theta},D_{C})\vspace{1mm}\\ =\mathbb{H}(D_{i}|D_{C})-\mathbb{H}(D_{i}|\boldsymbol{\theta})\vspace{1mm}\\ \geq\mathbb{H}(D_{i}|D_{C^{\prime}})-\mathbb{H}(D_{i}|\boldsymbol{\theta})\vspace{1mm}\\ =\mathbb{H}(D_{i}|D_{C^{\prime}})-\mathbb{H}(D_{i}|\boldsymbol{\theta},D_{C^{\prime}})\vspace{1mm}\\ =\mathbb{H}(\boldsymbol{\theta}|D_{C^{\prime}})-\mathbb{H}(\boldsymbol{\theta}|D_{C^{\prime}\cup\{i\}})\vspace{1mm}\\ =\mathbb{I}(\boldsymbol{\theta};D_{C^{\prime}\cup\{i\}})-\mathbb{I}(\boldsymbol{\theta};D_{C^{\prime}})\vspace{1mm}\\ =v_{C^{\prime}\cup\{i\}}-v_{C^{\prime}}\end{array}

where the third and sixth equalities are due to the symmetric property of mutual information, the fourth and fifth equalities are due to the conditional independence of $D_{i}$ and $D_{C}$ given $\boldsymbol{\theta}$ , and the inequality is due to the property that conditioning on more information (i.e., $D_{C^{\prime}\setminus C}$ ) never increases the entropy (i.e., “information never hurts” bound for entropy) (Cover & Thomas, 1991).

Appendix B Proof of Proposition 1

We prove below that $r_{i}=\text{Shapley}_{v}(i)$ for all $i\in N$ satisfy each of the properties F1 to F4:

\bullet

F1 Uselessness. For all $i\in N$ , if

\forall C\subseteq N\setminus\{i\}\ \ v_{C\cup\{i\}}=v_{C}\ ,

then

	$\displaystyle r_{i}=\text{Shapley}_{v}(i)$	$\displaystyle=\frac{1}{n!}\sum_{\pi\in\Pi_{N}}{\left(v_{{S_{\pi,i}}\cup\{i\}}-v_{S_{\pi,i}}\right)}$
		$\displaystyle=\frac{1}{n!}\sum_{\pi\in\Pi_{N}}{0}$
		$\displaystyle=0\ .$

\bullet

F2 Symmetry. For all $i,j\in N$ s.t. $i\neq j$ , if

\forall C\subseteq N\setminus\{i,j\}\ \ v_{C\cup\{i\}}=v_{C\cup\{j\}}\ ,

(4)

then

	$\displaystyle r_{i}=\text{Shapley}_{v}(i)$	$\displaystyle=\frac{1}{n!}\sum_{\pi\in\Pi_{N}}{\left(v_{{S_{\pi,i}}\cup\{i\}}-v_{S_{\pi,i}}\right)}$
		$\displaystyle=\frac{1}{n!}\sum_{\pi^{\prime}\in\Pi_{N}}{\left(v_{{S_{\pi^{\prime},j}}\cup\{j\}}-v_{S_{\pi^{\prime},j}}\right)}$
		$\displaystyle=\text{Shapley}_{v}(j)=r_{j}\ .$

Note that $\pi^{\prime}$ is obtained from $\pi$ by swapping the positions of $i$ and $j$ in $\pi$ . Then, we prove the third equality by considering the following two cases: (a) When $i$ precedes $j$ in $\pi$ , $S_{\pi,i}=S_{\pi^{\prime},j}$ (i.e., both excluding $i$ and $j$ ) and thus $v_{S_{\pi,i}}=v_{S_{\pi^{\prime},j}}$ . It also follows from (4) that by setting $C=S_{\pi,i}=S_{\pi^{\prime},j}$ , $v_{{S_{\pi,i}}\cup\{i\}}=v_{C\cup\{i\}}=v_{C\cup\{j\}}=v_{{S_{\pi^{\prime},j}}\cup\{j\}}$ ; (b) when $j$ precedes $i$ in $\pi$ , ${S_{\pi,i}\cup\{i\}}={S_{\pi^{\prime},j}\cup\{j\}}$ (i.e., both containing $i$ and $j$ ) and thus $v_{{S_{\pi,i}}\cup\{i\}}=v_{{S_{\pi^{\prime},j}}\cup\{j\}}$ . It also follows from (4) that by setting $C=S_{\pi,i}\setminus\{j\}=S_{\pi^{\prime},j}\setminus\{i\}$ , $v_{S_{\pi,i}}=v_{C\cup\{j\}}=v_{C\cup\{i\}}=v_{S_{\pi^{\prime},j}}$ .

\bullet

F3 Strict Desirability. For all $i,j\in N$ s.t. $i\neq j$ , if

\begin{array}[]{l}(\exists B\subseteq N\setminus\{i,j\}\ \ v_{B\cup\{i\}}>v_{B\cup\{j\}})\ \land\vspace{1mm}\\ (\forall C\subseteq N\setminus\{i,j\}\ \ v_{C\cup\{i\}}\geq v_{C\cup\{j\}})\ ,\end{array}

(5)

then

	$\displaystyle r_{i}=\text{Shapley}_{v}(i)$	$\displaystyle=\frac{1}{n!}\sum_{\pi\in\Pi_{N}}{\left(v_{{S_{\pi,i}}\cup\{i\}}-v_{S_{\pi,i}}\right)}$
		$\displaystyle>\frac{1}{n!}\sum_{\pi^{\prime}\in\Pi_{N}}{\left(v_{{S_{\pi^{\prime},j}}\cup\{j\}}-v_{S_{\pi^{\prime},j}}\right)}$
		$\displaystyle=\text{Shapley}_{v}(j)=r_{j}\ .$

Note that $\pi^{\prime}$ is obtained from $\pi$ by swapping the positions of $i$ and $j$ in $\pi$ . Similar to the proof of F2, we prove the inequality by considering the following two cases: (a) When $i$ precedes $j$ in $\pi$ , $S_{\pi,i}=S_{\pi^{\prime},j}$ (i.e., both excluding $i$ and $j$ ) and thus $v_{S_{\pi,i}}=v_{S_{\pi^{\prime},j}}$ . It also follows from (5) that by setting $C=S_{\pi,i}=S_{\pi^{\prime},j}$ , $v_{{S_{\pi,i}}\cup\{i\}}=v_{C\cup\{i\}}\geq v_{C\cup\{j\}}=v_{{S_{\pi^{\prime},j}}\cup\{j\}}$ ; (b) when $j$ precedes $i$ in $\pi$ , ${S_{\pi,i}\cup\{i\}}={S_{\pi^{\prime},j}\cup\{j\}}$ (i.e., both containing $i$ and $j$ ) and thus $v_{{S_{\pi,i}}\cup\{i\}}=v_{{S_{\pi^{\prime},j}}\cup\{j\}}$ . It also follows from (5) that by setting $C=S_{\pi,i}\setminus\{j\}=S_{\pi^{\prime},j}\setminus\{i\}$ , $-v_{S_{\pi,i}}=-v_{C\cup\{j\}}\geq-v_{C\cup\{i\}}=-v_{S_{\pi^{\prime},j}}$ . By (5), a strict inequality must hold for either case a or b.

\bullet

F4 Strict Monotonicity. Let $\{v_{C}\}_{C\in 2^{N}}$ and $\{v^{\prime}_{C}\}_{C\in 2^{N}}$ denote any two sets of values of data over all coalitions $C\subseteq N$ , and $r_{i}$ and $r^{\prime}_{i}$ be the corresponding values of model rewards received by party $i$ . For all $i\in N$ , if

\begin{array}[]{l}(\exists B\subseteq N\setminus\{i\}\ \ v^{\prime}_{B\cup\{i\}}>v_{B\cup\{i\}})\ \land\vspace{1mm}\\ (\forall C\subseteq N\setminus\{i\}\ \ v^{\prime}_{C\cup\{i\}}\geq v_{C\cup\{i\}})\ \land\vspace{1mm}\\ (\forall A\subseteq N\setminus\{i\}\ \ v^{\prime}_{A}=v_{A})\land(v^{\prime}_{N}>r_{i})\ ,\end{array}

(6)

then

	$\displaystyle r_{i}=\text{Shapley}_{v}(i)$	$\displaystyle=\frac{1}{n!}\sum_{\pi\in\Pi_{N}}{\left(v_{{S_{\pi,i}}\cup\{i\}}-v_{S_{\pi,i}}\right)}$
		$\displaystyle=\frac{1}{n!}\sum_{\pi\in\Pi_{N}}{\left(v_{{S_{\pi,i}}\cup\{i\}}-v^{\prime}_{S_{\pi,i}}\right)}$
		$\displaystyle<\frac{1}{n!}\sum_{\pi\in\Pi_{N}}{\left(v^{\prime}_{{S_{\pi,i}}\cup\{i\}}-v^{\prime}_{S_{\pi,i}}\right)}$
		$\displaystyle=\text{Shapley}_{v^{\prime}}(i)=r^{\prime}_{i}\ .$

Note that for any $\pi$ , $S_{\pi,i}$ does not contain $i$ . Hence, it follows from (6) that $v_{S_{\pi,i}}=v^{\prime}_{S_{\pi,i}}$ which results in the third equality. The inequality also follows from (6).

Appendix C Enforcing Shapley Fairness may not satisfy Individual Rationality (R4) for Submodular Value of Data

When the value of data is submodular, for any party $i\in N$ and any coalition $C\subseteq N\setminus\{i\}$ , ${v_{i}-v_{\emptyset}}=v_{i}\geq{v_{C\cup\{i\}}-v_{C}}$ . Then,

$\displaystyle\phi_{i}\triangleq\text{Shapley}_{v}(i)$	$\displaystyle=\frac{1}{n!}\sum_{\pi\in\Pi_{N}}{\left(v_{{S_{\pi,i}}\cup\{i\}}-v_{S_{\pi,i}}\right)}$
	$\displaystyle\leq\frac{1}{n!}\sum_{\pi\in\Pi_{N}}v_{i}$
	$\displaystyle=v_{i}\ .$	(7)

Also, for any $i\in N$ , $\phi_{i}\leq\sum_{i\in N}{\phi_{i}}=v_{N}$ due to the efficiency of Shapley value (Chalkiadakis et al., 2011).

To satisfy R3 and R5 simultaneously, we need to set $k=v_{N}/\phi^{*}$ in Definition 1 where $\phi^{*}=\max_{i\in N}{\phi_{i}}$ . So, for each party $i\in N$ ,

r_{i}=(v_{N}/\phi^{*})\times\phi_{i}\

according to Definition 1 and the party with the largest $\phi_{i}$ (i.e., $\phi^{*}$ ) always receives the most valuable model reward with the highest value $v_{N}$ .

When $v_{i}/\phi_{i}>v_{N}/\phi^{*}$ , R4 will not be satisfied (i.e., $r_{i}=(v_{N}/\phi^{*})\times\phi_{i}<v_{i}$ ). Since $v_{N}/\phi^{*}\geq 1$ , this will never happen if $v_{i}/\phi_{i}<1$ . However, as the value of data is submodular, for each party $i\in N$ , $v_{i}\nless\phi_{i}$ due to (C). Thus, R4 may not be satisfied.

Appendix D Proof of Theorem 1

Firstly, the values $(r_{i})_{i\in N}$ of model rewards satisfy non-negativity (R1): For each party $i\in N$ , $\phi_{i}\geq 0$ due to (2) and the non-negative and monotonic value of data using IG (1).

Next, the values $(r_{i})_{i\in N}$ of model rewards satisfy feasibility (R2): For each party $i\in N$ , $r_{i}\triangleq{\left(\phi_{i}/\phi^{*}\right)}^{\rho}\times v_{N}\leq 1^{\rho}\times v_{N}=v_{N}$ when $\rho\geq 0$ . Also, weak efficiency (R3) is satisfied since $r_{i}=v_{N}$ for the party $i$ with the largest Shapley value $\phi^{*}$ .

When $\rho>0$ , the values $(r_{i})_{i\in N}$ of model rewards are $\rho$ -Shapley fair with $k=(1/\phi^{*})^{\rho}\times v_{N}$ in Definition 2.

D.1 Proof of Individual Rationality (R4)

For each party $i\in N$ , since

\rho\leq\rho_{r}\triangleq\min_{j\in N}{\frac{\log(v_{j}/v_{N})}{\log(\phi_{j}/\phi^{*})}}\leq\frac{\log(v_{i}/v_{N})}{\log(\phi_{i}/\phi^{*})}

and $\log(\phi_{i}/\phi^{*})\leq 0\ ,$

	$\displaystyle\log(v_{i}/v_{N})$	$\displaystyle\leq\log{\left(\phi_{i}/\phi^{*}\right)}^{\rho}$
	$\displaystyle v_{i}/v_{N}$	$\displaystyle\leq{\left(\phi_{i}/\phi^{*}\right)}^{\rho}$
	$\displaystyle v_{i}$	$\displaystyle\leq{\left(\phi_{i}/\phi^{*}\right)}^{\rho}\times v_{N}=r_{i}\ .$

As $r_{i}\geq v_{i}$ for each party $i\in N$ , the values $(r_{i})_{i\in N}$ of model rewards satisfy R4.

For each party $i\in N$ , when $\rho=0$ , $r_{i}=v_{N}$ . Since the value of data is monotonic, $v_{i}\leq v_{N}$ . So, $\rho_{r}\geq 0$ . It is possible that $\rho_{r}>1$ .

D.2 Proof of Stability (R6)

For each party $i\in N$ , since

\rho\leq\rho_{s}\triangleq\min_{j\in N}{\frac{\log(v_{C_{j}}/v_{N})}{\log(\phi_{j}/\phi^{*})}}\leq\frac{\log(v_{{C_{i}}}/v_{N})}{\log(\phi_{i}/\phi^{*})}

and $\log(\phi_{i}/\phi^{*})\leq 0\ ,$

	$\displaystyle\log(v_{C_{i}}/v_{N})$	$\displaystyle\leq\log{\left(\phi_{i}/\phi^{*}\right)}^{\rho}$
	$\displaystyle v_{C_{i}}/v_{N}$	$\displaystyle\leq{\left(\phi_{i}/\phi^{*}\right)}^{\rho}$
	$\displaystyle v_{C_{i}}$	$\displaystyle\leq{\left(\phi_{i}/\phi^{*}\right)}^{\rho}\times v_{N}=r_{i}\ .$

As $r_{i}\geq v_{C_{i}}$ for each party $i\in N$ , the values $(r_{i})_{i\in N}$ of model rewards satisfy R6.

For each party $i\in N$ , when $\rho=0$ , $r_{i}=v_{N}$ . Since the value of data is monotonic, $v_{i}\leq v_{C_{i}}\leq v_{N}$ . So, $\rho_{r}\geq\rho_{s}\geq 0$ . It is possible that $\rho_{s}>1$ .

D.3 Proof of Fairness (R5)

Firstly, we show that the Shapley value satisfies the following party monotonicity property,¹²¹²12Another solution concept that satisfies the party monotonicity property is the Banzhaf index. which will be used in our proof of F4: For all $i\in N$ , if

\begin{array}[]{l}\displaystyle(\exists B\subseteq N\setminus\{i\}\ \ v^{\prime}_{B\cup\{i\}}>v_{B\cup\{i\}})\ \land\vspace{1mm}\\ (\forall C\subseteq N\setminus\{i\}\ \ v^{\prime}_{C\cup\{i\}}\geq v_{C\cup\{i\}})\ \land\vspace{1mm}\\ (\forall A\subseteq N\setminus\{i\}\ \ v^{\prime}_{A}=v_{A})\ ,\end{array}

(8)

then $\forall j\in N\setminus\{i\}\ \ \phi^{\prime}_{i}-\phi_{i}\geq\phi^{\prime}_{j}-\phi_{j}\ .$

Proof.

Let $v^{\Delta}_{C}\triangleq\left(v^{\prime}_{C}-v_{C}\right)$ for any coalition $C\subseteq N$ . It follows from (2) that for any party $j\in N\setminus\{i\}$ ,

	$\displaystyle\phi^{\prime}_{j}-\phi_{j}$	$\displaystyle=\frac{1}{n!}\sum_{\pi\in\Pi_{i\prec j}}{\left(v^{\Delta}_{S_{\pi,j}\cup\{j\}}-v^{\Delta}_{S_{\pi,j}}\right)}$
		$\displaystyle\leq\frac{1}{n!}\sum_{\pi\in\Pi_{i\prec j}}{\left(v^{\Delta}_{S_{\pi,j}\cup\{j\}}\right)}$
		$\displaystyle=\frac{1}{n!}\sum_{\pi^{\prime}\in\Pi_{j\prec i}}{\left(v^{\Delta}_{S_{\pi^{\prime},i}\cup\{i\}}\right)}$
		$\displaystyle\leq\frac{1}{n!}\sum_{\pi\in\Pi_{N}}{\left(v^{\Delta}_{S_{\pi,i}\cup\{i\}}\right)}$
		$\displaystyle=\phi^{\prime}_{i}-\phi_{i}$

where $\Pi_{i\prec j}$ is a subset of $\Pi_{N}$ containing all permutations with $i$ preceding $j$ . We consider only the subset $\Pi_{i\prec j}$ of $\Pi_{N}$ in the first equality due to (8) where the quality of a model trained on the aggregated data of a coalition can only change/increase when containing party $i$ . The first inequality is also due to (8) where $v^{\prime}_{C}\geq v_{C}$ for every $C\subseteq N$ and thus $v^{\Delta}_{S_{\pi,j}}\geq 0$ . The second equality can be understood by swapping the positions of $i$ and $j$ in permutation $\pi$ to obtain $\pi^{\prime}$ such that $S_{\pi,j}\cup\{j\}=S_{\pi^{\prime},i}\cup\{i\}$ . The second inequality holds because there may exist a permutation $\pi$ where $i\prec j\ \text{in}\ \pi$ (i.e., $j\notin S_{\pi,i}$ ) and $v^{\Delta}_{S_{\pi,i}\cup\{i\}}\geq 0$ . The last equality holds since $v^{\Delta}_{S_{\pi,i}}=0$ due to (8) where $\forall A\subseteq N\setminus\{i\}\ \ v^{\prime}_{A}=v_{A}$ . ∎

Let us now consider $r_{i}\triangleq{\left(\phi_{i}/\phi^{*}\right)}^{\rho}\times v_{N}$ . We will prove the following properties for R5:

\bullet

F1 Uselessness.

\begin{array}[]{l}\displaystyle(\forall C\subseteq N\setminus\{i\}\ \ v_{C\cup\{i\}}=v_{C})\vspace{1mm}\\ \Rightarrow\phi_{i}=0\vspace{1mm}\\ \Rightarrow r_{i}=0\ .\end{array}

\bullet

F2 Symmetry. Since each $\phi_{i}^{\rho}$ is multiplied by the same factor $k=v_{N}/{\phi^{*}}^{\rho}$ ,

\begin{array}[]{l}\displaystyle(\forall C\subseteq N\setminus\{i,j\}\ \ v_{C\cup\{i\}}=v_{C\cup\{j\}})\vspace{1mm}\\ \Rightarrow\phi_{i}=\phi_{j}\vspace{1mm}\\ \Rightarrow r_{i}=r_{j}\ .\end{array}

\bullet

F3 Strict Desirability. Since each $\phi_{i}^{\rho}$ is multiplied by the same factor $k=v_{N}/{\phi^{*}}^{\rho}$ and $k>0$ ,

\begin{array}[]{l}\displaystyle(\exists B\subseteq N\setminus\{i,j\}\ \ v_{B\cup\{i\}}>v_{B\cup\{j\}})\ \land\vspace{1mm}\\ (\forall C\subseteq N\setminus\{i,j\}\ \ v_{C\cup\{i\}}\geq v_{C\cup\{j\}})\vspace{1mm}\\ \Rightarrow\phi_{i}>\phi_{j}\vspace{1mm}\\ \Rightarrow r_{i}>r_{j}\ .\end{array}

The first implication in the proof of F1, F2, and F3 follows from Proposition 1.

\bullet

F4 Strict Monotonicity. Let $\{v_{C}\}_{C\in 2^{N}}$ and $\{v^{\prime}_{C}\}_{C\in 2^{N}}$ denote any two sets of values of data over all coalitions $C\subseteq N$ , and $\phi_{i}$ and $\phi^{\prime}_{i}$ be the corresponding Shapley values of party $i$ . Let $\ell\in N$ be the party with the largest Shapley value $\phi^{\prime}_{\ell}$ based on $\{v^{\prime}_{C}\}_{C\in 2^{N}}$ . For any party $i\in N$ satisfying (6) (i.e., premise of F4), $\phi^{\prime}_{\ell}\geq\phi^{\prime}_{i}$ due to definition of party $\ell$ and $\phi^{\prime}_{i}-\phi_{i}\geq\phi^{\prime}_{\ell}-\phi_{\ell}$ due to party monotonicity property when $\ell\in N\setminus\{i\}$ . Together, it follows that $\phi_{\ell}\geq\phi_{i}$ . Then,

$\displaystyle\phi^{\prime}_{i}-\phi_{i}$	$\displaystyle\geq\phi^{\prime}_{\ell}-\phi_{\ell}\vspace{1mm}$
$\displaystyle\phi_{\ell}(\phi^{\prime}_{i}-\phi_{i})$	$\displaystyle\geq\phi_{i}(\phi^{\prime}_{\ell}-\phi_{\ell})\vspace{1mm}$
$\displaystyle\phi_{\ell}\times\phi^{\prime}_{i}$	$\displaystyle\geq\phi_{i}\times\phi^{\prime}_{\ell}\vspace{1mm}$
$\displaystyle\phi^{\prime}_{i}/\phi^{\prime}_{\ell}$	$\displaystyle\geq\phi_{i}/\phi_{\ell}\ .$	(9)

If $\phi_{\ell}>\phi_{i}$ , then $\phi^{\prime}_{i}/\phi^{\prime}_{\ell}>\phi_{i}/\phi_{\ell}$ instead. The equality in (9) holds only if $\phi_{\ell}=\phi_{i}$ . Let $p\in N$ be the party with the largest Shapley value $\phi_{p}$ based on $\{v_{C}\}_{C\in 2^{N}}$ . Then,

\displaystyle\begin{array}[]{l}r_{i}^{\prime}\\ \displaystyle={\left({\phi_{i}^{\prime}}/{\phi^{\prime}_{\ell}}\right)}^{\rho}\times v^{\prime}_{N}\vspace{1mm}\\ \displaystyle\geq{\left({\phi_{i}^{\prime}}/{\phi^{\prime}_{\ell}}\right)}^{\rho}\times v_{N}\vspace{1mm}\\ \displaystyle\geq{\left({\phi_{i}}/{\phi_{\ell}}\right)}^{\rho}\times v_{N}\vspace{1mm}\\ \displaystyle\geq{\left({\phi_{i}}/{\phi_{p}}\right)}^{\rho}\times v_{N}\\ =r_{i}\end{array}

where the first inequality follows from (6), the second inequality is due to (9), and the last inequality is due to the definition of party $p$ . For equality in $r^{\prime}_{i}\geq r_{i}$ to possibly hold, it has to be the case that $\phi_{i}=\phi_{\ell}=\phi_{p}$ and $v^{\prime}_{N}=v_{N}$ . Based on the definitions of $r_{i}$ and $r^{\prime}_{i}$ in Theorem 1, this implies $r^{\prime}_{i}=v^{\prime}_{N}=v_{N}=r_{i}$ which does not satisfy $v^{\prime}_{N}>r_{i}$ in (6) (i.e., premise of F4). For all other cases, $r^{\prime}_{i}>r_{i}$ .

Appendix E Details of Experimental Settings

In all the experiments, we optimize $\eta_{i}$ for $i\in N$ (Sec. 4.3) by finding the root of $\mathbb{I}(\boldsymbol{\theta};D_{i}^{r})-r^{*}_{i}$ using a root-finding algorithm called the TOMS Algorithm $748$ method.¹³¹³13Alefeld, G. E., Potra, F. A., and Shi, Y. Algorithm $748$ : Enclosing zeros of continuous functions. ACM Trans. Math. Softw., $21(3)$ , $327$ – $344$ , $1995$ .

E.1 Gaussian Process (GP) Regression with Synthetic Friedman Dataset

The following Friedman function is used in our experiment:

\begin{array}[]{rl}y=&10\sin(\pi{\rm{\mathbf{x}}}_{[d,0]}{\rm{\mathbf{x}}}_{[d,1]})+20({\rm{\mathbf{x}}}_{[d,2]}-0.5)^{2}\vspace{1mm}\\ &+10{\rm{\mathbf{x}}}_{[d,3]}+5{\rm{\mathbf{x}}}_{[d,4]}+0{\rm{\mathbf{x}}}_{[d,5]}+\mathcal{N}(0,1)\end{array}

where $d$ is the index of data point $({\rm{\mathbf{x}}},y)$ and its $6$ input features (i.e., ${\rm{\mathbf{x}}}\triangleq({\rm{\mathbf{x}}}_{[d,a]})_{a=0,\ldots,5}$ ) are independent and uniformly distributed over the input domain $[0,1]$ . The last input feature ${\rm{\mathbf{x}}}_{[d,5]}$ is an independent variable which does not affect the output $y$ . We standardize the values of output $y$ and train a GP model¹⁴¹⁴14For all GP models, we use automatic relevance determination such that each input feature has a different lengthscale parameter. with a squared exponential kernel.

For each tested value of $\rho$ , we reset the random seed to a fixed value and draw $20$ samples of $\mathbf{z}_{i}$ from $\mathcal{N}(\mathbf{0},\eta_{i}\mathbf{I})$ with an optimized $\eta_{i}$ (Sec. 4.3). Though $\eta_{i}$ differs for varying $\rho$ , the fixed random seed makes the standardized samples of $\mathbf{z}_{i}$ the same across all values of $\rho$ , thus allowing a fair comparison. We evaluate the performance of our reward scheme on a test dataset which consists of $800$ points generated from the Friedman function.

E.2 Gaussian Process (GP) Regression with DiaP Dataset

We remove the gender feature and standardize the values of both input matrix ${\rm{\mathbf{X}}}$ and output vector ${\rm{\mathbf{y}}}$ . We train a GP model with a composite kernel comprising the squared exponential kernel and the exponential kernel.

We consider $5$ different train-test (i.e., $80\%$ - $20\%$ at random) splits. For each split, we create $10$ different partitions of the training dataset using the partitioning algorithm detailed in Sec. 5.

E.3 Neural Network (NN) Regression with CaliH Dataset

We first standardize the values of both input matrix ${\rm{\mathbf{X}}}$ and output vector ${\rm{\mathbf{y}}}$ . Then, we use $60\%$ of the CaliH dataset to train a NN with $2$ dense and fully connected hidden layers of $100$ and $50$ units and use the rectified linear unit (ReLU) as the activation function. Subsequently, we perform BLR on the outputs from the last hidden layer.

E.4 Sparse GP Regression with Synthetic Friedman Dataset (10 Parties)

We consider a larger synthetic Friedman dataset of size $5000$ and $3$ different train-test (i.e., $80\%$ - $20\%$ at random) splits. For each split, we create $10$ different partitions of the training dataset using the partitioning algorithm detailed in Sec. 5. But, we divide the training dataset into $10$ consecutive blocks instead with the constraint that each party owns at least $5\%$ of the dataset.

As in Appendix E.1, we standardize the values of output $y$ . Since the exact Shapley value (2) is expensive to evaluate with a large number of parties, we approximate it using $3000$ samples generated by the simple random sampling algorithm (Maleki et al., 2013).

E.5 Calculating IG for BLR and GP Models

Let $D\triangleq({\rm{\mathbf{X}}},{\rm{\mathbf{y}}})$ be the training data where ${\rm{\mathbf{X}}}$ is an input matrix and ${\rm{\mathbf{y}}}$ ( ${\rm{\mathbf{f}}}$ ) is the corresponding noisy (latent) output vector such that ${\rm{\mathbf{y}}}$ is perturbed from ${\rm{\mathbf{f}}}$ by an additive Gaussian noise with noise variance $\sigma^{2}$ .

For BLR with model parameters/weights $\boldsymbol{\theta}$ and prior belief $p(\boldsymbol{\theta})\triangleq\mathcal{N}(\mathbf{0},\boldsymbol{\Sigma}_{0})$ , the IG on $\boldsymbol{\theta}$ given $D$ is

\mathbb{I}(\boldsymbol{\theta};D)=0.5\log(|{\rm{\mathbf{I}}}+{\rm{\mathbf{\Sigma}}}_{0}{\rm{\mathbf{X}}}^{\top}{\rm{\mathbf{X}}}/\sigma^{2}|)\ .

We also consider modeling an unknown/latent function $f$ with a full-rank GP and setting $\boldsymbol{\theta}=f$ such that the latent output vector ${\rm{\mathbf{f}}}$ comprises components $f({\rm{\mathbf{x}}})$ for all ${\rm{\mathbf{x}}}$ in ${\rm{\mathbf{X}}}$ . Then, the IG on $f$ given $D$ is

\mathbb{I}(f;D)=\mathbb{I}({\rm{\mathbf{f}}};D)=0.5\log(|{\rm{\mathbf{I}}}+K_{{\rm{\mathbf{X}}}{\rm{\mathbf{X}}}}/\sigma^{2}|)

(10)

where $K_{{\rm{\mathbf{X}}}{\rm{\mathbf{X}}}}$ is a covariance matrix with components $k({\rm{\mathbf{x}}},{\rm{\mathbf{x}}}^{\prime})$ for all ${\rm{\mathbf{x}}},{\rm{\mathbf{x}}}^{\prime}$ in ${\rm{\mathbf{X}}}$ and $k$ is a kernel function. When we use a sparse GP model¹⁵¹⁵15We use a sparse GP model called the deterministic training conditional (DTC) approximation of the GP model. with inducing input matrix ${\rm{\mathbf{U}}}$ , we replace $K_{{\rm{\mathbf{X}}}{\rm{\mathbf{X}}}}$ in (10) with $K^{\top}_{{\rm{\mathbf{X}}}{\rm{\mathbf{U}}}}K_{{\rm{\mathbf{U}}}{\rm{\mathbf{U}}}}^{-1}K_{{\rm{\mathbf{U}}}{\rm{\mathbf{X}}}}$ where $K_{{\rm{\mathbf{U}}}{\rm{\mathbf{U}}}}$ is a covariance matrix with components $k({\rm{\mathbf{x}}},{\rm{\mathbf{x}}}^{\prime})$ for all ${\rm{\mathbf{x}}},{\rm{\mathbf{x}}}^{\prime}$ in ${\rm{\mathbf{U}}}$ and $K_{{\rm{\mathbf{U}}}{\rm{\mathbf{X}}}}$ is a covariance matrix with components $k({\rm{\mathbf{x}}},{\rm{\mathbf{x}}}^{\prime})$ for all ${\rm{\mathbf{x}}}$ in ${\rm{\mathbf{U}}}$ and ${\rm{\mathbf{x}}}^{\prime}$ in ${\rm{\mathbf{X}}}$ .

When calculating the IG for heteroscedastic data (i.e., data points with different noise variances), instead of dividing by $\sigma^{2}$ , we multiply by the diagonal matrix ${\rm{\mathbf{K}}}_{\text{noise}}^{-1}$ such that each diagonal component of ${\rm{\mathbf{K}}}_{\text{noise}}$ represents the noise variance corresponding to a data point in $D$ .

Appendix F Additional Experimental Results

F.1 Additional Results for Synthetic Friedman and BosH Datasets

GP regression with synthetic Friedman dataset. We also evaluate the performance of our reward scheme on a synthetic Friedman dataset with $2000$ data points. We standardize the values of output $y$ and train a GP model with a squared exponential kernel. We consider $5$ different train-test (i.e., $80\%$ - $20\%$ at random) splits. For each split, we create $10$ different partitions of the training dataset using the partitioning algorithm detailed in Sec. 5.

The results are shown in Fig. 7. As before, the improvement in MNLP is usually positive. When $\rho=0.5$ , most points move closer to the diagonal line, which implies that parties with smaller $\phi_{i}$ can now receive more valuable model rewards with higher predictive accuracy.

NN regression with BosH dataset. The Boston housing (BosH) dataset contains the value of $506$ houses with $10$ selected input features (Harrison & Rubinfeld, 1978). We train a NN with $2$ dense and fully connected hidden layers of $50$ units each and use the rectified linear unit (ReLU) as the activation function. Subsequently, we perform BLR on the outputs from the last hidden layer.

We remove $3$ input features (i.e., proportion of residential land zoned for lots over $25,000$ sq. ft., proportion of non-retail business acres per town, and the Charles River dummy variable) and standardize the values of both input matrix ${\rm{\mathbf{X}}}$ and output vector ${\rm{\mathbf{y}}}$ . Then, we use $80\%$ of the BosH dataset to train the first $2$ layers of the NN and divide the entire BosH dataset into a test dataset and among $3$ parties using our partitioning algorithm. Due to the small data size, there is overlap between the data used to train the “public” NN layers and the data that is divided among parties.

Fig. 8 shows results of our reward scheme for multiple partitions of BosH dataset among $n=3$ parties. The same observations follow: Most model rewards achieve positive improvement in MNLP.

When a smaller $\rho$ is used (i.e., Figs. 8b and 8d) or when the parties have higher Shapley values relative to the others (i.e., lighter color), they will receive more valuable model rewards which translates to lower MNLP (i.e., points lie closer to the diagonal identity line).

F.2 Empirical Analysis of Relationship between IG and MNLP

In this work, our proposed data valuation method and reward scheme are based on IG. Our experimental results have shown that IG is closely related to the predictive accuracy (i.e., MNLP) of a trained model. A model with higher IG usually achieves lower MNLP as it produces smaller predictive variances (i.e., related to the first term of (3)). However, the relationship between IG and MNLP is not strictly monotonic as it is also affected by the second term of (3), i.e., the scaled squared error. In this subsection, we will demonstrate some poor use cases of our scheme where improvement in IG fails to translate to higher predictive accuracy of the trained model more frequently.

The first poor use case is when an unsuitable model is selected. To demonstrate this, we train an alternative GP regression model with an unsuitable squared exponential kernel on the DiaP dataset. Even when ample training data is provided, the squared error remains high relative to the variance in the data. The results achieved using this unsuitable GP model are shown in Figs. 9c-d. For ease of comparison, we copy the results of the GP model with a suitable kernel in Figs. 3c-d to Figs. 9a-b. We can observe that the maximal and actual improvement in MNLP are mostly positive for the suitable model (Figs. 9a-b). However, for the unsuitable model, fewer positive values of the maximal and actual improvement in MNLP can be observed in Figs. 9c-d. In Figs. 9b and 9d, there are, respectively, $631$ and $388$ points with positive improvement in MNLP (out of a maximum of $750$ points). Improvement in IG fails to translate to a higher predictive accuracy of its trained model more frequently when an unsuitable model is selected.

The next poor use case is when the model prior, which determines its predictive performance with limited data, is sufficiently informative, hence leading to a high predictive accuracy even if the model is only trained on a small amount of data. An example of an informative model prior is when the prior mean of the model parameters is close to the true mean and the prior variance is small. In this case, although the first term of (3) is smaller for a model with higher IG, the small predictive variance will magnify the squared error, thus increasing the second term of (3). So, on the overall, the model with a higher IG may not have a significantly lower MNLP. Instead, it is possible that it has a worse/higher MNLP. We demonstrate this with the BosH dataset by setting the prior mean of the BLR parameters to be the NN last layer’s weights (instead of the zero vector as in Appendix F.1) and prior variance to be a small value of $0.01$ (instead of $1$ ). The MNLP results are shown in Fig. 10. It can be observed that the largest maximal improvement in MNLP in Fig. 10 is around $0.2$ which is significantly smaller than that (i.e., $1.0$ ) in Figs. 8c-d. In Fig. 10a, there are more instances with negative improvement in MNLP compared to that in Fig. 8c.

The last poor use case is related to the previous one. Even if the model prior is insufficiently informative, the phenomenon can also happen if each party has ample data to independently train a model with a reasonable predictive accuracy. More training data lead to a higher IG and reduce the first term of (3) but magnify the squared error, thus potentially increasing MNLP.

Summary. Ideally, improvement in IG by training on the additional data decided by our reward scheme should translate to a higher predictive accuracy of the model reward. This should occur when a suitable model is selected and the model prior is not sufficiently informative for any party to achieve a high predictive accuracy by training a model on its own data. In practice, this is reasonable as collaboration precisely happens only when every individual party cannot train a high-quality model alone but can do so from working together.

F.3 Comparing IG-Based vs. MNLP-Based Data Valuation Methods

In Sec. 3, our data valuation method is based on IG as it can avoid the need and difficulty of selecting a common validation dataset, unlike other data valuation methods (e.g., using MNLP). We also claim that IG is a suitable surrogate measure of the predictive accuracy of a trained model. In this subsection, we will demonstrate this by comparing the values of data and Shapley values when the data valuation is performed using IG vs. MNLP.

For the MNLP-based data valuation method, let the value of data $D_{C}$ for any $C\subseteq N$ be $v^{\prime}_{C}=\text{MNLP}_{\emptyset}-\text{MNLP}_{C}$ where $\text{MNLP}_{C}$ is defined in the same way as (3) except that $D_{i}^{r}$ in (3) is replaced by $D_{C}$ . For each party $i\in N$ , we can compute its alternative Shapley value $\phi_{i}^{\prime}$ using (2) by replacing $v_{C}$ with $v^{\prime}_{C}$ for all $C\subseteq N$ .

Since we are more concerned about the relative values of data and Shapley values between parties, Figs. 11a-b show results of the IG-based vs. MNLP-based normalized values of data (respectively, $v_{i}/\sum_{i\in N}{v_{i}}$ vs. $v^{\prime}_{i}/\sum_{i\in N}{v^{\prime}_{i}}$ ) of $5$ different partitions, while Figs. 11c-d show results of the IG-based vs. MNLP-based normalized Shapley values (respectively, $\phi_{i}/v_{N}$ vs. $\phi^{\prime}_{i}/v^{\prime}_{N}$ ) of the same $5$ partitions. These partitions are generated using the partitioning algorithm in Sec. 5 on a training dataset of $800$ data points. For each partition, we consider $10$ different validation datasets of size $200$ , which results in $10$ possible sets of $\{v^{\prime}_{C}\}_{C\in 2^{N}}$ and $\{\phi^{\prime}_{i}\}_{i\in N}$ . The $10$ validation datasets are either randomly selected or generated using the partitioning algorithm on the larger withheld test dataset (using the same feature $a$ ) so that it may overlap or exclude any party’s data range. We measure the similarity between the party’s data and the validation dataset using the Wasserstein distance $W(\cdot,\cdot)$ between both their values of data or Shapley values for feature $a$ . The smaller the distance, the more similar the party’s data and the validation dataset.

In Fig. 11, as the validation dataset varies, the normalized values of data and Shapley values for the MNLP-based data valuation method differ significantly (i.e, large spread along the y-axis). This shows that the MNLP-based data valuation method is highly sensitive to the choice of the validation dataset. Parties are valued significantly lower if their data is very different from the test dataset. This is reflected by the position of the darker colored points below the lighter colored points.

The IG-based data valuation method can serve as a good surrogate for the MNLP-based data valuation method as it will not favor any specific party. In particular, for the Friedman dataset, the normalized IG-based value of data/Shapley value (i.e., diagonal line) (additionally) closely approximates the normalized MNLP-based values of data/Shapley values over different validation datasets (i.e., scatter points) on average.


(a) IG, $\rho=1$	(b) IG, $\rho=0.5$

(c) MNLP, $\rho=1$	(d) MNLP, $\rho=0.5$


(a) IG, $\rho=1$	(b) IG, $\rho=0.5$

(c) MNLP, $\rho=1$	(d) MNLP, $\rho=0.5$


(a) IG, $\rho=1$	(b) IG, $\rho=0.5$

(c) MNLP, $\rho=1$	(d) MNLP, $\rho=0.5$


(a) IG, $\rho=1$	(b) IG, $\rho=0.5$

(c) MNLP, $\rho=1$	(d) MNLP, $\rho=0.5$


(a) IG, $\rho=1$	(b) IG, $\rho=0.5$

(c) MNLP, $\rho=1$	(d) MNLP, $\rho=0.5$


(a) $\rho=1$ , suitable model	(b) $\rho=0.5$ , suitable model

(c) $\rho=1$ , unsuitable model	(d) $\rho=0.5$ , unsuitable model

Collaborative Machine Learning with Incentive-Aware Model Rewards

Abstract

1 Introduction

2 Problem Formulation

3 Data Valuation with Information Gain

4 Incentive-Aware Reward Scheme with Model Rewards

4.1 Incentives

4.1.1 Fairness

Proposition 1.

Definition 1 (Shapley Fairness).

Example 1.

Definition 2 (ρ\rho-Shapley Fairness).

4.1.2 Stability and Group Welfare

Definition 3 (Stability).

4.2 Reward Scheme Considering All Incentives

Theorem 1.

4.3 Realization of Model Rewards

5 Experiments and Discussion

6 Conclusion

Acknowledgements

References

Appendix A Proof of Submodularity of IG (1)

Appendix B Proof of Proposition 1

Appendix C Enforcing Shapley Fairness may not satisfy Individual Rationality (R4) for Submodular Value of Data

Appendix D Proof of Theorem 1

D.1 Proof of Individual Rationality (R4)

D.2 Proof of Stability (R6)

D.3 Proof of Fairness (R5)

Proof.

Appendix E Details of Experimental Settings

E.1 Gaussian Process (GP) Regression with Synthetic Friedman Dataset

E.2 Gaussian Process (GP) Regression with DiaP Dataset

E.3 Neural Network (NN) Regression with CaliH Dataset

E.4 Sparse GP Regression with Synthetic Friedman Dataset (10 Parties)

E.5 Calculating IG for BLR and GP Models

Appendix F Additional Experimental Results

F.1 Additional Results for Synthetic Friedman and BosH Datasets

F.2 Empirical Analysis of Relationship between IG and MNLP

F.3 Comparing IG-Based vs. MNLP-Based Data Valuation Methods

Definition 2 ( $\rho$ -Shapley Fairness).