This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Collaborative Machine Learning with Incentive-Aware Model Rewards

Rachael Hwee Ling Sim    Yehong Zhang    Mun Choon Chan    Bryan Kian Hsiang Low
Abstract

Collaborative machine learning (ML) is an appealing paradigm to build high-quality ML models by training on the aggregated data from many parties. However, these parties are only willing to share their data when given enough incentives, such as a guaranteed fair reward based on their contributions. This motivates the need for measuring a party’s contribution and designing an incentive-aware reward scheme accordingly. This paper proposes to value a party’s reward based on Shapley value and information gain on model parameters given its data. Subsequently, we give each party a model as a reward. To formally incentivize the collaboration, we define some desirable properties (e.g., fairness and stability) which are inspired by cooperative game theory but adapted for our model reward that is uniquely freely replicable. Then, we propose a novel model reward scheme to satisfy fairness and trade off between the desirable properties via an adjustable parameter. The value of each party’s model reward determined by our scheme is attained by injecting Gaussian noise to the aggregated training data with an optimized noise variance. We empirically demonstrate interesting properties of our scheme and evaluate its performance using synthetic and real-world datasets.

Data Valuation, Incentives in Machine Learning

1 Introduction

Collaborative machine learning (ML) is an appealing paradigm to build high-quality ML models. While an individual party may have limited data, it is possible to build improved, high-quality ML models by training on the aggregated data from many parties. For example, in healthcare, a hospital or healthcare firm whose data diversity and quantity are limited due to its small patient base can draw on data from other hospitals and firms to improve the prediction of some disease progression (e.g., diabetes) (Center for Open Data Enterprise, 2019). This collaboration can be encouraged by a government agency, such as the National Institute of Health in the United States. In precision agriculture, a farmer with limited land area and sensors can combine his collected data with the other farmers to improve the modeling of the effect of various influences (e.g., weather, pest) on his crop yield (Claver, 2019). Such data sharing also benefits other application domains, including real estate in which a property agency can pool together its limited transactional data with that of the other agencies to improve the prediction of property prices (Conway, 2018).

However, any party would have incurred some nontrivial cost to collect its data. So, they would not altruistically donate their data and risk losing their competitive edge. These parties will be motivated to share their data when given enough incentives, such as a guaranteed benefit from the collaboration and a fair higher reward from contributing more valuable data. To this end, we propose to value each party’s contributed data and design an incentive-aware reward scheme to give each party a separate ML model as a reward (in short, model reward) accordingly. We use only model rewards and exclude monetary compensations as (a) in the above-mentioned applications, every party such as a hospital is mainly interested in improving model quality for unlimited future test predictions; (b) there may not be a feasible and clear source of monetary revenue to compensate participants (e.g., due to restrictions, the government may not be able to pay firms using tax revenues); and (c) if parties have to pay to participate in the collaboration, then the total payment is debatable and many may lack funding while still having valuable data.

How then should we value a party’s data and its effect on model quality? To answer this first question, we propose a valuation method based on the informativeness of data. In particular, more informative data induces a greater reduction in the uncertainty of the model parameters, hence improving model quality. In contrast, existing data valuation methods (Ghorbani & Zou, 2019; Ghorbani et al., 2020; Jia et al., 2019a, b; Yoon et al., 2020) measure model quality via its validation accuracy, which calls for a tedious or even impossible process of needing all parties to agree on a common validation dataset. Inaccurate valuation can also arise due to how a party’s test queries, which are likely to change over time, differ from the validation dataset. Our data valuation method does not make any assumption about the current or future distribution of test queries.

Next, how should we design a reward scheme to decide the values of model rewards for incentivizing a collaboration? Intuitively, a party will be motivated to collaborate if it can receive a better ML model than others who have contributed less valuable data, and than what it can build alone or get from collaborating separately with some parties. Also, the parties often like to maximize the total benefit from the collaboration. These incentives appear related to solution concepts (fairness, individual rationality, stability, and group welfare) from cooperative game theory (CGT), respectively. However, as CGT assumptions are restrictive for the uniquely freely replicable111Data and model reward, like digital goods, can be replicated at zero marginal cost and given to more parties. nature of our model reward, these CGT concepts have to be adapted for defining incentives for our model reward. We then design a novel model reward scheme to provide these incentives.

Finally, how can we realize the values of the model rewards decided by our scheme? How should we modify the model rewards or the data used to train them? An obvious approach to control the values of the model rewards is to select and only train on subsets of the aggregated data. However, this requires considering an exponential number of discrete subsets of data, which is intractable for large datasets such as medical records. We avoid this issue by injecting noise into the aggregated data from multiple parties instead. The value of each party’s model reward can then be realized by simply optimizing the continuous noise variance parameter.

The specific contributions of our work here include:

  • \bullet

    Proposing a data valuation method using the information gain (IG) on model parameters given the data (Sec. 3);

  • \bullet

    Defining new conditions for incentives (i.e., Shapley fairness, stability, individual rationality, and group welfare) that are suitable for the freely replicable nature of our model reward (Sec. 4.1). As these incentives cannot be provided all at the same time, we design a novel model reward scheme with an adjustable parameter to trade off between them while maintaining fairness (Sec. 4.2);

  • \bullet

    Injecting Gaussian noise into the aggregated data from multiple parties and optimizing the noise variance parameter for realizing the values of the model rewards decided by our scheme (Sec. 4.3); and

  • \bullet

    Demonstrating interesting properties of our model reward scheme empirically and evaluating its performance with synthetic and real-world datasets (Sec. 5).

To the best of our knowledge, our work here is the first to propose a collaborative ML scheme that formally considers incentives beyond fairness and relies solely on model rewards to realize them. Existing works (Jia et al., 2019a, b; Ohrimenko et al., 2019) have only looked at fairness and have to resort to monetary compensations if considered.

2 Problem Formulation

In our problem setting, we consider nn honest and non-malicious parties, each of whom owns some data and assume the availability of a trusted central party222In reality, such a central party can be found in established data sharing platforms like Ocean Protocol (Ocean Protocol Foundation, 2019) and Data Republic (https://datarepublic.com). who aggregates data from these parties, measures the value of their data, and distributes a resulting trained ML model to each party. We first introduce the notations and terminologies used in this work: Let N{1,,n}N\triangleq\{1,\ldots,n\} denote a set of nn parties. Any subset CNC\subseteq N is called a coalition of parties. The grand coalition is the set NN of all parties. Parties will team up and partition themselves into a coalition structure CSCS. Formally, CSCS is a set of coalitions such that CCSC=N\bigcup_{C\in CS}C=N and CC=C\cap C^{\prime}=\emptyset for any C,CCSC,C^{\prime}\in CS and CCC\neq C^{\prime}. The data of party iNi\in N is represented by Di(𝐗i,𝐲i)D_{i}\triangleq({\rm{\mathbf{X}}}_{i},{\rm{\mathbf{y}}}_{i}) where 𝐗i{\rm{\mathbf{X}}}_{i} and 𝐲i{\rm{\mathbf{y}}}_{i} are the input matrix and output vector, respectively. Let vCv_{C} denote the value of the (aggregated) data DC{Di}iCD_{C}\triangleq\{D_{i}\}_{i\in C} of any coalition CNC\subseteq N. We use viv_{i} to represent v{i}v_{\{i\}} to ease notation. For each party iNi\in N, rir_{i} denotes the value of its received model reward.

The objective is to design a collaborative ML scheme for the central party to decide and realize the values (ri)iN(r_{i})_{i\in N} of model rewards distributed to parties 1,,n1,\ldots,n. The scheme should satisfy certain incentives (e.g., fairness and stability) to encourage the collaboration. Some works (Ghorbani & Zou, 2019; Jia et al., 2019a, b) have considered similar problems and can fairly partition the (monetary) value vNv_{N} of the entire aggregated data into rir_{i} for iNi\in N. They achieved this using results from cooperative game theory (CGT). Our problem, however, differs and cannot be addressed directly using CGT. This is due to the freely replicable nature of our model reward – the total value iNri\sum_{i\in N}r_{i} of received model rewards over all parties iNi\in N can exceed vNv_{N}.

We will next show how to assess the value of data (Sec. 3) and, more importantly, how to design the reward scheme to decide the values (ri)iN(r_{i})_{i\in N} of model rewards accordingly and realize these values for achieving our incentive-aware objective (Sec. 4).

3 Data Valuation with Information Gain

A set DCD_{C} of data is considered more valuable (i.e., higher vCv_{C}) if it can be used to train a higher-quality (hence more valuable) ML model. Existing data valuation methods (Ghorbani & Zou, 2019; Ghorbani et al., 2020; Jia et al., 2019b; Yoon et al., 2020) measure the quality of a trained ML model via its validation accuracy. However, these methods require the central party to carefully select a validation dataset that all parties must agree on. This is often a tedious or even impossible process, especially if every party’s test queries, which are likely to change over time, differ from the validation dataset.333In the unlikely event that all parties can agree on a common validation dataset, validation accuracy would be the preferred measure of model quality and hence data valuation method. For example, two private hospitals 1\mathcal{H}_{1} and 2\mathcal{H}_{2} aggregate their data for diabetes prediction. 1\mathcal{H}_{1} and 2\mathcal{H}_{2} prefer accurate test predictions for female and young patients, respectively. Due to the differences in their data sizes, data qualities, and preferred test queries, it is difficult for the central party to decide the demographics of the patients in the validation dataset such that the data valuation is unbiased and accurate.

To circumvent the above-mentioned issues, our proposed data valuation method instead considers an information-theoretic measure of the quality of a trained model in terms of the reduction in uncertainty of the model parameters, denoted by vector 𝜽\boldsymbol{\theta}, after training on data DCD_{C}. We use the prior entropy (𝜽)\mathbb{H}(\boldsymbol{\theta}) and posterior entropy (𝜽|DC)\mathbb{H}(\boldsymbol{\theta}|D_{C}) to represent the uncertainty of 𝜽\boldsymbol{\theta} before and after training on DCD_{C}, respectively. So, if the data DCD_{C} for CNC\subseteq N can induce a greater reduction in the uncertainty/entropy of 𝜽\boldsymbol{\theta} or, equivalently, information gain (IG) 𝕀(𝜽;DC)\mathbb{I}(\boldsymbol{\theta};D_{C}) on 𝜽\boldsymbol{\theta}:

vC𝕀(𝜽;DC)=(𝜽)(𝜽|DC),v_{C}\triangleq\mathbb{I}(\boldsymbol{\theta};D_{C})=\mathbb{H}(\boldsymbol{\theta})-\mathbb{H}(\boldsymbol{\theta}|D_{C})\ , (1)

then a higher-quality (hence more valuable) model can be trained using this more valuable/informative data DCD_{C}. IG is an appropriate data valuation method as it is often used as a surrogate measure of the test/predictive accuracy of a trained model (Krause & Guestrin, 2007; Kruschke, 2008) since the test queries are usually not known a priori. We empirically demonstrate such a surrogate effect in Appendix F.3. The predictive distribution of the output yy^{*} at the test input 𝐱{\rm{\mathbf{x}}}^{*} given data DCD_{C} is calculated by averaging over all possible model parameters 𝜽\boldsymbol{\theta} weighted by their posterior belief p(𝜽|DC)p(\boldsymbol{\theta}|D_{C}), i.e., p(y|𝐱,DC)=p(y|𝐱,𝜽)p(𝜽|DC)d𝜽.p(y^{*}|{\rm{\mathbf{x}}}^{*},D_{C})=\int p(y^{*}|{\rm{\mathbf{x}}}^{*},\boldsymbol{\theta})\ p(\boldsymbol{\theta}|D_{C})\ \text{d}\boldsymbol{\theta}. By reducing the uncertainty in 𝜽\boldsymbol{\theta}, we can further rule out models that are unlikely given DCD_{C} and place higher weights (i.e., by increasing p(𝜽|DC)p(\boldsymbol{\theta}|D_{C})) on models that are closer to the true model parameters, thus improving the predictive accuracy for any test query in expectation. The value vCv_{C} (1) of data DCD_{C} has the following properties:

  • \bullet

    Data of an empty coalition has no value: v=0.v_{\emptyset}=0\ .

  • \bullet

    Data of any coalition CNC\subseteq N has non-negative value: CNvC0.\forall C\subseteq N\ \ v_{C}\geq 0\ .

  • \bullet

    Monotonicity. Adding more parties to a coalition cannot decrease the value of its data: CCNvCvC.\forall C\subseteq C^{\prime}\subseteq N\ \ v_{C^{\prime}}\geq v_{C}\ .

  • \bullet

    Submodularity. Data of any party ii is less valuable to a larger coalition which has more parties and data:

iNCCN{i}vC{i}vCvC{i}vC.\forall i\in N\ \ \forall C\subseteq C^{\prime}\subseteq N\setminus\{i\}\ \ v_{C^{\prime}\cup\{i\}}-v_{C^{\prime}}\leq v_{C\cup\{i\}}-v_{C}\ .

The second and third properties are due to the “information never hurts” bound for entropy (Cover & Thomas, 1991). The submodular property of IG (1) assumes conditional independence of the data DiD_{i} and DjD_{j} given 𝜽\boldsymbol{\theta} for any i,jNi,j\in N and iji\neq j; its proof is in Appendix A.

The first two properties fulfill standard assumptions of CGT. The latter two properties will influence the design and properties of our model reward scheme in Sec. 4. For example, the monotonic property ensures that the value rir_{i} of any party ii’s model reward is never negative (Sec. 4).

4 Incentive-Aware Reward Scheme with Model Rewards

Recall our problem setting in Sec. 2 that the central party will train a model for each party as a reward. The reward should be decided based on the value of each party’s data relative to that of the other parties’. Let DirD_{i}^{r} be the data (i.e., derived from DND_{N}) used to train party ii’s model reward. The value of party ii’s model reward is ri𝕀(𝜽;Dir)r_{i}\triangleq\mathbb{I}(\boldsymbol{\theta};D_{i}^{r}) according to Sec. 3. In this section, we will first present the incentives to encourage collaboration (Sec. 4.1), then describe how our incentive-aware reward scheme will satisfy them (Sec. 4.2), and finally show how to vary DirD_{i}^{r} to realize the values of the model rewards decided by our scheme (Sec. 4.3).

We desire a collaboration involving the grand coalition (i.e., CS={N}CS=\{N\}) as it results in the largest aggregated data DND_{N} and hence allows the highest value of model reward to be achieved, which eliminates the tedious process of deciding how the parties should team up and partition themselves. In Sec. 4.1.2, we will discuss how to incentivize the grand coalition formation and increase the total benefit from the collaboration.

4.1 Incentives

Our reward scheme has to be valid, fair to all the parties, and guarantee an improved model for each party. To achieve these, we exploit and adapt key assumptions and constraints about rewards in CGT. As has been mentioned in Sec. 2, the modifications are necessary as the model reward is uniquely freely replicable. We require the following incentive conditions to hold for the values (ri)iN(r_{i})_{i\in N} of model rewards based on the chosen coalition structure CSCS:

  1. R1

    Non-negativity. iNri0.\forall i\in N\ \ r_{i}\geq 0\ .

  2. R2

    Feasibility. The model reward received by each party in any coalition CCSC\in CS cannot be more valuable than the model trained on their aggregated data DCD_{C}: CCSiCrivC.{\forall C\in CS}\ \ {\forall i\in C}\ \ {r_{i}\leq v_{C}}\ .

  3. R3

    Weak Efficiency. In each coalition CCSC\in CS, the model reward received by at least a party iCi\in C is as valuable as the model trained on the aggregated data DCD_{C} of CC: CCSiCri=vC.{\forall C\in CS}\ \ {\exists i\in C}\ \ {r_{i}=v_{C}}\ .

  4. R4

    Individual Rationality. Each party should receive a model reward that is at least as valuable as the model trained on its own data: iNrivi.\forall i\in N\ \ r_{i}\geq v_{i}\ .

R1 and R4 are the same as the solution concepts in CGT while R2 and R3 have been adapted.444R2 and R3 are, respectively, adapted from CCSiCrivC{\forall C\in CS}\ {\sum_{i\in C}{r_{i}}\leq v_{C}} and CCSiCri=vC{\forall C\in CS}\ {\sum_{i\in C}{r_{i}}=v_{C}} in CGT (Chalkiadakis et al., 2011). When CS={N}CS=\{N\} (i.e., grand coalition), R2 and R3 become iNrivN{\forall i\in N}\ \ {r_{i}\leq v_{N}} and iNri=vN{\exists i\in N}\ \ {r_{i}=v_{N}}, respectively. So, each party cannot receive a more valuable model reward than the model trained on DND_{N} as it would involve creating data artificially.

4.1.1 Fairness

In addition, when CS={N}CS=\{N\} (i.e., grand coalition), to guarantee that the reward scheme is fair to all nn parties, the values (ri)iN(r_{i})_{i\in N} of their model rewards must satisfy the following properties which are inspired by the fairness concepts in CGT (Maschler & Peleg, 1966; Shapley, 1953; Young, 1985) and have also been experimentally studied from a normative perspective (de Clippel & Rozen, 2013):

  1. F1

    Uselessness.555These properties are axioms of Shapley Value (Shapley, 1953) and have been widely used in existing ML works (Ghorbani & Zou, 2019; Jia et al., 2019b; Ohrimenko et al., 2019) for data valuation. If including the data of party ii does not improve the quality of a model trained on the aggregated data of any coalition (e.g., when Di=D_{i}=\emptyset), then party ii should receive a valueless model reward: For all iNi\in N,

    (CN{i}vC{i}=vC)ri=0.(\forall C\subseteq N\setminus\{i\}\ \ v_{C\cup\{i\}}=v_{C})\Rightarrow r_{i}=0\ .
  2. F2

    Symmetry.5 If including the data of party ii yields the same improvement as that of party jj in the quality of a model trained on the aggregated data of any coalition (e.g., when Di=DjD_{i}=D_{j}), then they should receive equally valuable model rewards: For all i,jNi,j\in N s.t. iji\neq j,

    (CN{i,j}vC{i}=vC{j})ri=rj.(\forall C\subseteq N\setminus\{i,j\}\ \ v_{C\cup\{i\}}=v_{C\cup\{j\}})\Rightarrow r_{i}=r_{j}\ .
  3. F3

    Strict Desirability (Maschler & Peleg, 1966). If the quality of a model trained on the aggregated data of at least a coalition improves more by including the data of party ii than that of party jj, but the reverse is not true, then party ii should receive a more valuable model reward than party jj: For all i,jNi,j\in N s.t. iji\neq j,

    (BN{i,j}vB{i}>vB{j})(CN{i,j}vC{i}vC{j})ri>rj.\begin{array}[]{l}(\exists B\subseteq N\setminus\{i,j\}\ \ v_{B\cup\{i\}}>v_{B\cup\{j\}})\ \land\vspace{1mm}\\ (\forall C\subseteq N\setminus\{i,j\}\ \ v_{C\cup\{i\}}\geq v_{C\cup\{j\}})\Rightarrow r_{i}>r_{j}\ .\end{array}
  4. F4

    Strict Monotonicity.666Our definition is similar to coalitional monotonicity in (Young, 1985) except that we consider >> instead of \geq. This rules out the scenario where the value of party ii’s data improves but not that of its model reward. We further check the feasibility of improvement in the value of its model reward. If the quality of a model trained on the aggregated data of at least a coalition containing party ii improves (e.g., by including more data of party ii), ceteris paribus, then party ii should receive a more valuable model reward than before: Let {vC}C2N\{v_{C}\}_{C\in 2^{N}} and {vC}C2N\{v^{\prime}_{C}\}_{C\in 2^{N}} denote any two sets of values of data over all coalitions CNC\subseteq N, and rir_{i} and rir^{\prime}_{i} be the corresponding values of model rewards received by party ii. For all iNi\in N,

    (BN{i}vB{i}>vB{i})(CN{i}vC{i}vC{i})(AN{i}vA=vA)(vN>ri)ri>ri.\begin{array}[]{l}(\exists B\subseteq N\setminus\{i\}\ \ v^{\prime}_{B\cup\{i\}}>v_{B\cup\{i\}})\ \land\vspace{1mm}\\ (\forall C\subseteq N\setminus\{i\}\ \ v^{\prime}_{C\cup\{i\}}\geq v_{C\cup\{i\}})\ \land\vspace{1mm}\\ (\forall A\subseteq N\setminus\{i\}\ \ v^{\prime}_{A}=v_{A})\land(v^{\prime}_{N}>r_{i})\Rightarrow r^{\prime}_{i}>r_{i}\ .\end{array}

We have the following incentive condition:

  1. R5

    Fairness. The values (ri)iN(r_{i})_{i\in N} of model rewards must satisfy F1 to F4.

Both F3 and F4 imply that marginal contribution (vC{i}vCv_{C\cup\{i\}}-v_{C}) matters. F4 is the only property guaranteeing that if party ii adds more valuable/informative data to a coalition containing it, ceteris paribus, then it should receive a more valuable model reward than before. Additionally, F3 establishes a relationship between parties ii and jj that if their marginal contributions only differ for coalition CC (i.e., vC{i}>vC{j}v_{C\cup\{i\}}>v_{C\cup\{j\}} w.l.o.g.), then party ii should receive a more valuable model reward than party jj.

To illustrate their significance, we consider two simpler reward schemes where we directly set the value of model reward received by every party iNi\in N as (a) the value of its data (i.e., rivir_{i}\triangleq v_{i}) or (b) the decrease in the value of its data if it leaves the grand coalition (i.e., rivNvN{i}r_{i}\triangleq v_{N}-v_{N\setminus\{i\}}). Both schemes violate F3 and F4 as they ignore the values of the other parties’ data: The value of model reward received by party ii does not change when its marginal contribution to any non-empty coalition CN{i}C\subset N\setminus\{i\} increases. Intuitively, a party with a small value viv_{i} of data (e.g., few data points) can be highly valuable to the other parties if its data is distinct. Conversely, a party with a high value vjv_{j} of data should be less valuable to the other parties with similar data as its marginal contribution is lower.

Hence, we consider the Shapley value which captures the idea of marginal contribution precisely as it uses the expected marginal contribution of a party ii when it joins the parties preceding it in any permutation:

Shapleyv(i)=1n!πΠN(vSπ,i{i}vSπ,i)\text{Shapley}_{v}(i)=\frac{1}{n!}\sum_{\pi\in\Pi_{N}}{\left(v_{{S_{\pi,i}}\cup\{i\}}-v_{S_{\pi,i}}\right)} (2)

where ΠN\Pi_{N} is the set of all possible permutations of NN and Sπ,iS_{\pi,i} is the coalition of parties preceding ii in permutation π\pi.

Proposition 1.

ri=Shapleyv(i)r_{i}=\text{Shapley}_{v}(i) for all iNi\in N satisfy R5.

Its proof is in Appendix B. A party will have a larger Shapley value and hence value rir_{i} of model reward when its data is highly valuable on its own (e.g., with low inherent noise) and/or to the other parties (e.g., due to low correlation).

Fairness with Weak Efficiency (R3). We simplify the notation of Shapleyv(i)\text{Shapley}_{v}(i) to ϕi\phi_{i}. Such a reward scheme may not satisfy R3 as the total value iNri\sum_{i\in N}r_{i} of model rewards is only vNv_{N} (Chalkiadakis et al., 2011). Due to the freely replicable nature of our model reward, we want the total value to exceed vNv_{N} and the value of some party’s model reward to be vNv_{N}. To satisfy other incentive conditions, we consider a function gg to map (ϕi)iN(\phi_{i})_{i\in N} to (ri)iN(r_{i})_{i\in N} (i.e., rig(ϕi)r_{i}\triangleq g(\phi_{i})). To achieve fairness in R5, gg must be strictly increasing with g(0)=0g(0)=0. These motivate us to propose the following:

Definition 1 (Shapley Fairness).

Given {vC}C2N\{v_{C}\}_{C\in 2^{N}}, if there exists a constant k>0k>0 s.t. ri=kϕir_{i}=k\phi_{i} for all iNi\in N, then the values (ri)iN(r_{i})_{i\in N} of model rewards are Shapley fair.

Note that CGT sets k=1k=1 (Proposition 1). Also, we can control constant kk to satisfy other incentive conditions such as R3. A consequence of Definition 1 is that ϕi/ϕj=ri/rj\phi_{i}/\phi_{j}=r_{i}/r_{j}. So, increasing the ratio of expected marginal contributions ϕi\phi_{i} of party ii vs. ϕj\phi_{j} of party jj (e.g., by including more valuable/informative data of party ii, ceteris paribus) results in the same increase in the ratio of values of model rewards rir_{i} received by party ii vs. rjr_{j} received by party jj.

Fairness with Individual Rationality (R4). However, another issue persists in the above modified definition of fairness (Definition 1): R4 may not be satisfied when the value of data is submodular (e.g., IG). We show an example below and leave the detailed discussion to Appendix C:

Example 1.

Suppose that there is a coalition of two parties whose values of data are v1=7v_{1}=7, v2=5v_{2}=5, and v{1,2}=8v_{\{1,2\}}=8. Using (2), the Shapley values of the two parties are ϕ1=5\phi_{1}=5 and ϕ2=3\phi_{2}=3. To satisfy R3, we set r1=8r_{1}=8. Then, to achieve Shapley fairness, r2=(8/5)×3=4.8r_{2}=(8/5)\times 3=4.8. Since r2<v2r_{2}<v_{2}, R4 is not satisfied.

Since our model reward is freely replicable, it is possible to give all parties in a larger coalition more valuable model rewards. How then should we redefine gg to derive larger values rir_{i} of model rewards to achieve R4 while still satisfying R5 and R3 (i.e., g(maxiNϕi)=vNg(\max_{i\in N}\phi_{i})=v_{N})? To answer this question, we further modify the above definition of Shapley fairness (Definition 1) to the following:

Definition 2 (ρ\rho-Shapley Fairness).

Given {vC}C2N\{v_{C}\}_{C\in 2^{N}}, if there exist constants k>0k>0 and ρ>0\rho>0 s.t. ri=kϕiρr_{i}=k\phi_{i}^{\rho} for all iNi\in N, then the values (ri)iN(r_{i})_{i\in N} of model rewards are ρ\rho-Shapley fair.

It follows from Definition 2 that when ρ<1\rho<1, ϕi>ϕj\phi_{i}>\phi_{j} implies ϕi/ϕj>ri/rj>1\phi_{i}/\phi_{j}>r_{i}/r_{j}>1. Furthermore, reducing ρ\rho from 11 “weakens” proportionality of rir_{i} to ϕi\phi_{i} and decreases the ratio of rir_{i} vs. rjr_{j} while preserving ri>rjr_{i}>r_{j}. So, if the value of a party’s model reward is vNv_{N} (hence satisfying R3), then the values of the other parties’ model rewards will become closer to vNv_{N}, thereby potentially satisfying R4. We will discuss the choice of kk and the effect of ρ\rho in Sec. 4.2.

4.1.2 Stability and Group Welfare

At the beginning of Sec. 4, we have provided justifications for desiring the grand coalition. We will now introduce two other solution concepts in CGT (i.e., stability and group welfare) to incentivize its formation and increase the total benefit from the collaboration, respectively. To the best of our knowledge, these solution concepts have not been considered in existing collaborative ML works.

A coalition structure CSCS with a given set (ri)iN(r_{i})_{i\in N} of values of model rewards is said to be stable if no subset of parties has a common incentive to abandon it to form another coalition on their own. With stability, parties can be assured that the collaboration will not fall apart and they cannot get a more valuable model reward from adding or removing parties. Similar to other solution concepts in CGT introduced previously, the definition of stability in CGT777In CGT, a coalition structure CSCS with given (ri)iN(r_{i})_{i\in N} is stable if CNvCiCri.\forall C\subseteq N\ \ v_{C}\leq\sum_{i\in C}r_{i}\ . does not suit the freely replicable nature of our model reward and the redefined incentive condition on feasibility R2. Therefore, we propose the following new definition of stability:

Definition 3 (Stability).

A coalition structure CSCS with a given set (ri)iN(r_{i})_{i\in N} of values of model rewards is stable if CNiCvCri.\forall C\subseteq N\ \ \exists i\in C\ \ {v_{C}\leq r_{i}}\ .

Conversely, supposing CNiCri<vC\exists C\subseteq N\ \ \forall i\in C\ \ r_{i}<v_{C}, all parties in CC may be willing to deviate to form coalition CC as they can feasibly increase the values of their model rewards (up) to vCv_{C}. The condition in Definition 3 is computationally costly to check as it involves an exponential (in nn) number of constraints. To ease computation and since we are mainly interested in the grand coalition (i.e., CS={N}CS=\{N\}), we look at the following sufficient condition instead:

  1. R6

    Stability of Grand Coalition. Suppose that the value of data is monotonic. The grand coalition is stable if for every coalition CC, the value of the model reward received by the party with largest Shapley value is at least vCv_{C}:
    CNiCϕi=maxjCϕjvCri.\forall C\subseteq N\ \ \forall i\in C\ \ \phi_{i}=\max_{j\in C}\phi_{j}\Rightarrow\ v_{C}\leq r_{i}\ .

The values (ri)iN(r_{i})_{i\in N} of model rewards that satisfy R6 will also satisfy Definition 3. To ensure R6, for each party iNi\in N, we will only need to check one constraint involving rir_{i}: vCiriv_{C_{i}}\leq r_{i} where CiC_{i} includes all parties whose Shapley value is at most ϕi\phi_{i}. The monotonic property of the value of data guarantees that vCvCiriv_{C}\leq v_{C_{i}}\leq r_{i} for any CCiC\subseteq C_{i}. The total number of constraints is linear in the number nn of parties. Unlike R1 to R5, R6 is an optional incentive condition.

The last incentive condition stated below does not have to be fully provided:

  1. R7

    Group Welfare. The values (ri)iN(r_{i})_{i\in N} of model rewards should maximize the group welfare iNri\sum_{i\in N}r_{i}.

4.2 Reward Scheme Considering All Incentives

In this subsection, we will present a reward scheme which considers all the incentives in Sec. 4.1 and assumes that the grand coalition will form. It uses parameter ρ\rho in Definition 2 to trade off between achieving Shapley fairness vs. satisfying the other incentives, as detailed below:

Theorem 1.

Let 0<ρ10<\rho\leq 1. For each party iNi\in N, let ϕiShapleyv(i)\phi_{i}\triangleq\text{Shapley}_{v}(i) and reward ri(ϕi/ϕ)ρ×vN{r_{i}\triangleq{{({\phi_{i}}/{\phi^{*}})}^{\rho}}\times v_{N}} where ϕ=maxiNϕi\phi^{*}=\max_{i\in N}\phi_{i}.888In practice, ϕi\phi_{i} can be based on other solution concepts in CGT that satisfy F1 to F4. The values (ri)iN(r_{i})_{i\in N} of model rewards are ρ\rho-Shapley fair and satisfy R1 to R3 and R5 when ρ>0\rho>0 . Also, when

  • \bullet

    ρ=1\rho=1, (ri)iN(r_{i})_{i\in N} are (pure) Shapley fair (Definition 1);

  • \bullet

    ρρrminiNlog(vi/vN)/log(ϕi/ϕ)\rho\leq\rho_{r}\triangleq\min_{i\in N}{{\log(v_{i}/v_{N})}/{\log(\phi_{i}/\phi^{*})}}, (ri)iN(r_{i})_{i\in N} satisfy individual rationality (R4);

  • \bullet

    ρρsminiNlog(vCi/vN)/log(ϕi/ϕ)\rho\leq\rho_{s}\triangleq\min_{i\in N}{{\log(v_{C_{i}}/v_{N})}/{\log(\phi_{i}/\phi^{*})}} where coalition Ci{jNϕjϕi}C_{i}\triangleq\{j\in N\mid\phi_{j}\leq\phi_{i}\}, (ri)iN(r_{i})_{i\in N} achieve stability of the grand coalition (R6) and individual rationality (R4) as ρsρr\rho_{s}\leq\rho_{r};

  • \bullet

    ρ=0\rho=0, (ri)iN(r_{i})_{i\in N} provide maximum group welfare (R7) but do not satisfy fairness (R5).

Its proof is in Appendix D. As ρ\rho decreases from 1, the values (ri)iN(r_{i})_{i\in N} of model rewards deviate further from pure Shapley fairness (i.e., less proportional to (ϕi)iN(\phi_{i})_{i\in N}) and the value rir_{i} of any party ii’s model reward with 0<ϕi/ϕ<10<\phi_{i}/\phi^{*}<1 will increase. This increases group welfare (R7) and the values (ri)iN(r_{i})_{i\in N} of model rewards can potentially satisfy individual rationality (R4) and stability (R6). Thus, a smaller ρ\rho addresses the limitations of pure Shapley fairness. We do not consider (a) ρ>1\rho>1 as it reduces group welfare and is not needed to satisfy other incentives, nor (b) ρ<0\rho<0 as it demands that more valuable/informative data and higher Shapley value lead to less valuable model reward, hence not satisfying fairness (R5).

Fig. 1 gives an overview of the incentive conditions and how Theorem 1 uses ρ\rho to satisfy them and trade off between the various incentives. Note that R6 guarantees R4. Ideally, we want the values (ri)iN(r_{i})_{i\in N} of model rewards to satisfy fairness (R5), stability (R6), and individual rationality (R4) (i.e., shaded region in Fig. 1). However, the values (ri)iN(r_{i})_{i\in N} of model rewards that are purely Shapley fair may not always lie in the desired shaded region. So, a smaller ρ\rho is needed.

Refer to caption
Figure 1: An overview of the incentive conditions and how they are satisfied by our reward scheme in Theorem 1. There may be alternative desirable reward schemes not using Definition 2. Note that in some scenarios, 11 may be less than ρr\rho_{r} or ρs\rho_{s}. As a result, the values (ri)iN(r_{i})_{i\in N} of model rewards that are purely Shapley fair naturally satisfy individual rationality and stability, respectively.

Consider how much each party ii benefits from the collaboration by choosing a ρ\rho smaller than 11, specifically, by measuring the ratio of the values rir_{i} of its received model reward that are ρ\rho-Shapley fair (ρ<1\rho<1) vs. purely Shapley fair (ρ=1\rho=1). Such a ratio can be evaluated to (ϕ/ϕi)1ρ1(\phi^{*}/\phi_{i})^{1-\rho}\geq 1 and is smaller (larger) for a party ii with a larger (smaller) ϕi\phi_{i}. So, when a smaller ρ\rho is chosen, a party ii with ϕi\phi_{i} close to ϕ\phi^{*} needs to be “altruistic” as it cannot benefit as much due to a smaller ratio than any party jj with a smaller ϕj\phi_{j}. However, note that party ii is already getting close to the maximum value vNv_{N} of model reward. Regardless of the chosen ρ\rho, the party ii with the largest ϕi\phi_{i} (i.e., ϕ\phi^{*}) always receives the most valuable model reward with the highest value vNv_{N}. In practice, the parties may agree to use a smaller ρ\rho if they want to increase their total benefit from the collaboration or if they do not know their relative expected marginal contributions beforehand.

4.3 Realization of Model Rewards

Finally, we will have to train a model for each party as a reward to realize the values (ri)iN(r^{*}_{i})_{i\in N} of model rewards decided by the above reward scheme. Recall from the beginning of Sec. 4 that the value ri=𝕀(𝜽;Dir)r_{i}=\mathbb{I}(\boldsymbol{\theta};D^{r}_{i}) of model reward received by each party ii is the IG on model parameters θ\theta given the data DirD^{r}_{i}. How is the training data DirD^{r}_{i} for each party ii selected to realize ri=rir_{i}=r^{*}_{i}?

A direct approach is to select and only train on a subset of the aggregated data from all parties: DirDND^{r}_{i}\subseteq D_{N}. However, this is infeasible due to the need to consider an exponential number of discrete subsets of data, all of which may not be able to realize the decided value rir^{*}_{i} of model reward exactly.

To avoid the above issue, we instead consider injecting Gaussian noise 𝒩(𝟎,ηi𝐈)\mathcal{N}({\rm{\mathbf{0}}},\eta_{i}{\rm{\mathbf{I}}}) into the aggregated data from multiple parties and optimizing the continuous noise variance parameter ηi\eta_{i} for realizing the decided value rir^{*}_{i} of model reward. Specifically, the central party collects data Di(𝐗i,𝐲i)D_{i}\triangleq({\rm{\mathbf{X}}}_{i},{\rm{\mathbf{y}}}_{i}) from every party iNi\in N and constructs DirD^{r}_{i} by concatenating the input matrices 𝐗i{\rm{\mathbf{X}}}_{i} for iNi\in N and the output vector 𝐲i{\rm{\mathbf{y}}}_{i} with 𝐳i(𝐲j)jN{i}+𝒩(𝟎,ηi𝐈){\rm{\mathbf{z}}}_{i}\triangleq({\rm{\mathbf{y}}}_{j})_{j\in N\setminus\{i\}}+\mathcal{N}({\rm{\mathbf{0}}},\eta_{i}{\rm{\mathbf{I}}}). When training a model for party ii, we use the original 𝐲i{\rm{\mathbf{y}}}_{i} so that each party gets all the information from its own data DiD_{i} and cannot improve its received model reward by subsequently training on it. We inject noise with variance ηi\eta_{i} only to the other parties’ data (i.e., (𝐲j)jN{i}({\rm{\mathbf{y}}}_{j})_{j\in N\setminus\{i\}}) to reduce the information from parties N{i}N\setminus\{i\} conveyed to party ii. In particular, ri=vNr_{i}=v_{N} when ηi=0\eta_{i}=0 and ri=vir_{i}=v_{i} when ηi=\eta_{i}=\infty. By varying ηi\eta_{i}, we can span different values of rir_{i}. We use an efficient root-finding algorithm to find the optimal ηi\eta_{i} such that ri=rir_{i}=r^{*}_{i} for each party iNi\in N. The effect of adding noise to the data will be reflected in the variance of party ii’s model reward parameters and its predictive distribution. We will show that the added noise affects predictive accuracy reasonably in Sec. 5.

5 Experiments and Discussion

This section empirically evaluates the performance and properties of our reward scheme (Sec. 4.2) on Bayesian regression models with the (a) synthetic Friedman dataset with 66 input features (Friedman, 1991), (b) diabetes progression (DiaP) dataset on the diabetes progression of 442442 patients with 99 input features (Efron et al., 2004), and (c) Californian housing (CaliH) dataset on the value of 2064020640 houses with 88 input features (Pace & Barry, 1997). We use a Gaussian likelihood and assume that the model hyperparameters are known or learned using maximum likelihood estimation.

The performance of our reward scheme is evaluated using the IG and mean negative log probability (MNLP) metrics:

MNLP1|D|(𝐱,y)Dlogp(y|𝐱,Dir)=1|D|(𝐱,y)D12(log(2πσ2)+(μy)2σ2)\begin{array}[]{rl}\text{MNLP}\triangleq&\displaystyle\frac{1}{|D^{*}|}\sum_{({\rm{\mathbf{x}}}^{*},y^{*})\in D^{*}}{-\log p(y^{*}|{\rm{\mathbf{x}}}^{*},D_{i}^{r})}\\ =&\displaystyle\frac{1}{|D^{*}|}\sum_{({\rm{\mathbf{x}}}^{*},y^{*})\in D^{*}}{\frac{1}{2}\left(\log(2\pi\sigma_{*}^{2})+\frac{(\mu_{*}-y^{*})^{2}}{\sigma_{*}^{2}}\right)\vspace{-1mm}}\end{array} (3)

where DD^{*} denotes a test dataset, and μ\mu_{*} and σ2\sigma_{*}^{2} denote, respectively, the predictive mean and variance of the predictive distribution p(y|𝐱,Dir)p(y^{*}|{\rm{\mathbf{x}}}^{*},D_{i}^{r}). We then use these metrics to show how an improvement in IG can affect the predictive accuracy of a trained model on a common test dataset.999We use a randomly sampled test dataset to illustrate how IG is a suitable surrogate measure of the test/predictive accuracy of a trained model. In Sec. 3, we have discussed that in practice, it is often tedious or even impossible for all parties to agree on a common validation dataset, let alone a common set of test queries. More details of the experimental settings such as how to select the test dataset are in Appendix E. The performance of our reward scheme cannot be directly compared against that of the existing works (Ghorbani & Zou, 2019; Jia et al., 2019b) as they mainly focus on non-Bayesian classification models and assume monetary compensations. For all experiments, we consider a collaboration of n=3n=3 parties as it suffices to show interesting results.

Gaussian process (GP) regression with synthetic Friedman dataset. For each party, we generate 250250 data points from the Friedman function. We assign party 11 the most valuable data by spanning the first input feature over the entire input domain [0,1][0,1]. In contrast, for parties 22 and 33, the same input feature spans only the non-overlapping ranges of [0,0.5][0,0.5] and [0.5,1][0.5,1], respectively. This makes the data of parties 22 and 33 highly valuable to each other, i.e., v{2,3}v_{\{2,3\}} is high relative to v2v_{2}. Using (2), the Shapley values of the three parties are ϕ1=34.57\phi_{1}=34.57, ϕ2=29.24\phi_{2}=29.24, and ϕ3=30.78\phi_{3}=30.78. Party 11 has the largest ϕ1\phi_{1} and will always receive the most valuable model reward with the highest IG or value vNv_{N}.

Refer to caption Refer to caption
(a) IG (b) MNLP
Figure 2: Graphs of (a) IG and (b) MNLP vs. the adjustable parameter ρ\rho for GP regression with synthetic Friedman dataset where mim_{i} and mCm_{C} (with |C|>1|C|>1) denote the MNLPs of the models, respectively, rewarded to party ii and trained using data DCD_{C}.

Fig. 2a shows that our reward scheme indeed satisfies fairness (R5): A party ii with a larger ϕi\phi_{i} always receives a higher rir_{i} (i.e., more valuable model reward). Also, since ρr1\rho_{r}\geq 1 here, our reward scheme always satisfies individual rationality (R4): Every party ii receives a more valuable model reward than that trained on its own data, i.e., its IG rir_{i} is higher than viv_{i} (see dotted lines in Fig. 2a). As ρ\rho decreases (rightwards), party 11’s most valuable model reward is unaffected but it has to be “altruistic” to parties 22 and 33 with increasing r2r_{2} and r3r_{3} (i.e., increasingly valuable model rewards) but smaller ϕ2\phi_{2} and ϕ3\phi_{3}, as discussed in the last paragraph of Sec. 4.2. When ρ\rho decreases to ρs\rho_{s}, stability (R6) is reached: Party 33’s model reward matches what it will receive by only collaborating with party 22. When ρ=0\rho=0, all parties receive an equally valuable model reward with the same IG/value vNv_{N} despite their varying expected marginal contributions. This shows how ρ\rho can be reduced to trade off proportionality in pure Shapley fairness for satisfying other incentives such as stability (R6) and group welfare (R7).

In Fig. 2b, we report the averaged MNLP and shade the 95%95\% confidence interval over 2020 different realizations of 𝐳i{\rm{\mathbf{z}}}_{i}. As ρ\rho decreases and each party receives an increasingly valuable model reward (i.e., increasing IG/rir_{i}), MNLP decreases, thus showing that an improvement in IG translates to a higher predictive accuracy of its trained model.

Refer to caption Refer to caption
(a) IG, ρ=1\rho=1 (b) IG, ρ=0.5\rho=0.5
Refer to caption Refer to caption
(c) MNLP, ρ=1\rho=1 (d) MNLP, ρ=0.5\rho=0.5
Figure 3: Scatter graph of the (a-b) improvement in IG (rivir_{i}-v_{i}) vs. maximal improvement in IG (vNviv_{N}-v_{i}) and (c-d) improvement in MNLP vs. maximal improvement in MNLP when ρ=1,0.5\rho=1,0.5 for multiple partitions of DiaP dataset among n=3n=3 parties.

GP regression with DiaP dataset. We train a GP regression model with a composite kernel (i.e., squared exponential kernel ++ exponential kernel) and partition the training dataset among n=3n=3 parties, as detailed later. The results are shown in Fig. 3.

Neural network regression with CaliH dataset. We consider Bayesian linear regression (BLR) on the last layer of a neural network (NN). 60%60\% of the CaliH data is designated as the “public” dataset101010 In practice, the “public” dataset can be provided by the government to kickstart the collaborative ML. Alternatively, it can be historical data that companies are more willing to share. and used to pretrain a NN. Since the correlation of the house value with the input features in the data of the parties may differ from that in the “public” dataset, we perform transfer learning and selectively retrain only the last layer using BLR with a standard Gaussian prior. So, IG is only computed based on BLR. From the remaining data, we select 400400 data points for the test dataset and 16001600 data points111111We restrict the total number of data points so that every individual party cannot train a high-quality model alone. for the training dataset to be partitioned among 33 parties, as detailed later. The results are shown in Fig. 5.

Sparse GP regression with synthetic Friedman dataset (10 parties). We also evaluate the performance of our reward scheme on the collaboration between a larger number n=10n=10 of parties. We consider a larger synthetic Friedman dataset of size 50005000 and partition the training dataset among 1010 parties such that each party owns at least 5%5\% of the dataset. We train a sparse GP model based on the deterministic training conditional approximation (Hoang et al., 2015, 2016) and a squared exponential kernel. Since the exact Shapley value (2) is expensive to evaluate with a large number of parties, we approximate it using the simple random sampling algorithm in (Maleki et al., 2013). The results are shown in Fig. 6.

Partitioning Algorithm. To divide any training dataset among parties and vary their contributed data, we first choose an input feature aa uniformly at random from all the input features. Next, the data point dd is chosen uniformly at random. We partially sort the dataset and divide it into continuous blocks of random sizes, as detailed in Fig. 4.

Refer to caption
Figure 4: Only data point dd will be at its sorted position based on input feature aa. We then divide this partially sorted training dataset into 33 consecutive blocks of random sizes with the constraint that each party owns at least 10%10\% of the dataset.

Such a setup allows parties to have different quantities of unique or partially overlapping data. The input feature aa may have real-world significance: If aa is the age of patients, then this models the scenario where hospitals have data of patients with different demographics.

Summary of Observations. For any dataset, we consider different train-test splits and partitions of the training dataset using the above algorithm. For each partition, we compute the optimized noise variance ηi\eta_{i} to realize rir^{*}_{i} and consider 55 realizations of 𝐳i{\rm{\mathbf{z}}}_{i} for each party iNi\in N. For a given partition and choice of ρ\rho, if the values (ri)iN(r_{i})_{i\in N} of model rewards satisfy individual rationality (R4), then we compute the ratio ϕi/ϕ\phi_{i}/\phi^{*}, improvement in IG (rivir_{i}-v_{i}) and MNLP, and maximal (possible) improvement in IG (vNviv_{N}-v_{i}) and MNLP for each party iNi\in N. The improvement is measured relative to a model trained only on party ii’s data. The best/lowest MNLP mNm_{N} is assumed to be achieved by a model trained on the aggregated data of all parties. Each realization of 𝐳i{\rm{\mathbf{z}}}_{i} (from each party ii) will produce a point in the scatter graphs in Figs. 35, and 6. If a point lies on the diagonal identity line, then the corresponding party (e.g., with the largest ϕi\phi_{i}, i.e., ϕ\phi^{*}) receives a model reward with MNLP mNm_{N}. We also report the number of points with positive improvement in IG and MNLP in the top left hand corner of each graph.

In Figs. 3c-d, 5c-d, and 6c-d, the improvement in MNLP is usually positive: The predictive performance of each party benefits from the collaboration. Occasionally but reasonably, this does not hold when the maximal improvement in MNLP is extremely small or even negative. Lighter colored points lie closer to the diagonal line: Parties with larger ϕi\phi_{i} receive more valuable model rewards (Figs. 3a-b, 5a-b, 6a-b) which translates to lower MNLP (Figs. 3c-d, 5c-d, 6c-d).

In Figs. 35 and 6, when ρ\rho decreases from 11 to 0.50.5, darker colored points move closer to the diagonal line, which implies that parties with smaller ϕi\phi_{i} can now receive more valuable model rewards with higher predictive accuracy. The number of points (reported in the top left corner) also increase as more of them satisfy R4.

In Fig. 6d with ρ=0.5\rho=0.5, most points are close to the diagonal identity line because it may be the case that not all training data points are needed to achieve the lowest MNLP. Instead, fewer or noisier data points are sufficient. In this scenario, a larger ρ\rho may be preferred to reward the parties differently. More experimental results for different Bayesian models and datasets are shown in Appendix F.1.

Refer to caption Refer to caption
(a) IG, ρ=1\rho=1 (b) IG, ρ=0.5\rho=0.5
Refer to caption Refer to caption
(c) MNLP, ρ=1\rho=1 (d) MNLP, ρ=0.5\rho=0.5
Figure 5: Scatter graph of the (a-b) improvement in IG (rivir_{i}-v_{i}) vs. maximal improvement in IG (vNviv_{N}-v_{i}) and (c-d) improvement in MNLP vs. maximal improvement in MNLP when ρ=1,0.5\rho=1,0.5 for multiple partitions of CaliH dataset among n=3n=3 parties.
Refer to caption Refer to caption
(a) IG, ρ=1\rho=1 (b) IG, ρ=0.5\rho=0.5
Refer to caption Refer to caption
(c) MNLP, ρ=1\rho=1 (d) MNLP, ρ=0.5\rho=0.5
Figure 6: Scatter graph of the (a-b) improvement in IG (rivir_{i}-v_{i}) vs. maximal improvement in IG (vNviv_{N}-v_{i}) and (c-d) improvement in MNLP vs. maximal improvement in MNLP when ρ=1,0.5\rho=1,0.5 for multiple partitions of Friedman dataset among n=10n=10 parties.

Limitations. Though IG is a suitable surrogate measure of the predictive accuracy of a trained model, a higher IG does not always translate to lower MNLP. In Figs. 3c-d and 5c-d, occasionally, the improvement in MNLP is negative (around 15%15\% of the time) and darker points lie above the diagonal line. The latter suggests that parties with smaller ϕi\phi_{i} may receive model rewards with MNLP lower than mNm_{N} (i.e., awarded to the party with largest ϕi\phi_{i}, i.e., ϕ\phi^{*}). This may be purely due to randomness, but we have also investigated factors (e.g., model selection) leading to a weak relationship between IG and MNLP and reported the results in Appendix F.2. Our reward scheme works best when a suitable model is selected and the model prior is not sufficiently informative for any party to achieve a high predictive accuracy by training a model on its own data. In practice, this is reasonable as collaboration precisely happens only when every individual party cannot train a high-quality model alone but can do so from working together.

6 Conclusion

This paper describes a collaborative ML scheme that distributes only model rewards to the parties. We perform data valuation based on the IG on the model parameters given the data and compute each party’s marginal contribution using the Shapley value. To decide the appropriate rewards to incentivize the collaboration, we adapt solution concepts from CGT (i.e., fairness, Shapley fairness, stability, individual rationality, and group welfare) for our freely replicable model rewards. We propose a novel reward scheme (Theorem 1) with an adjustable parameter that can trade off between these incentives while maintaining fairness. Empirical evaluations show that a smaller ρ\rho and more valuable model rewards translate to higher predictive accuracy. Current efforts on designing incentive mechanisms for federated learning can build upon our work presented in this paper. For future work, we plan to address privacy preservation in our reward scheme. The noise injection method used for realizing the model rewards in this work is closely related to the Gaussian mechanism of differential privacy (Dwork & Roth, 2014). This motivates us to explore how the injected noise will affect privacy in the model rewards.

Acknowledgements

This research/project is supported by the National Research Foundation, Singapore under its Strategic Capability Research Centres Funding Initiative and its AI Singapore Programme (Award Number: AISG-GC-20192019-002002). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore.

References

  • Center for Open Data Enterprise (2019) Center for Open Data Enterprise. Sharing and utilizing health data for AI applications. Roundtable report, 2019.
  • Chalkiadakis et al. (2011) Chalkiadakis, G., Elkind, E., and Wooldridge, M. Computational aspects of cooperative game theory. In Brachman, R. J., Cohen, W. W., and Dietterich, T. G. (eds.), Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2011.
  • Claver (2019) Claver, H. Data sharing key for AI in agriculture. Future Farming, Feb. 2019. URL https://www.futurefarming.com/Tools-data/Articles/2019/2/Data-sharing-key-for-AI-in-agriculture-389844E/.
  • Conway (2018) Conway, J. Artificial Intelligence and Machine Learning: Current Applications in Real Estate. PhD thesis, Massachusetts Institute of Technology, 2018.
  • Cover & Thomas (1991) Cover, T. and Thomas, J. Elements of Information Theory. John Wiley & Sons, Inc., 1991.
  • de Clippel & Rozen (2013) de Clippel, G. and Rozen, K. Fairness through the lens of cooperative game theory: An experimental approach. Cowles Foundation discussion paper no. 1925, 2013.
  • Dwork & Roth (2014) Dwork, C. and Roth, A. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014.
  • Efron et al. (2004) Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. Least angle regression. Ann. Statist., 32(2):407–499, 2004.
  • Friedman (1991) Friedman, J. H. Multivariate adaptive regression splines. Ann. Statist., 19(1):1–67, 1991.
  • Ghorbani & Zou (2019) Ghorbani, A. and Zou, J. Data Shapley: Equitable valuation of data for machine learning. In Proc. ICML, pp.  2242–2251, 2019.
  • Ghorbani et al. (2020) Ghorbani, A., Kim, M. P., and Zou, J. A distributional framework for data valuation. In Proc. ICML, 2020.
  • Harrison & Rubinfeld (1978) Harrison, D. and Rubinfeld, D. Hedonic housing prices and the demand for clean air. Journal of Environmental Economics and Management, 5(1):81–102, 1978.
  • Hoang et al. (2015) Hoang, T. N., Hoang, Q. M., and Low, K. H. A unifying framework of anytime sparse Gaussian process regression models with stochastic variational inference for big data. In Proc. ICML, pp.  569–578, 2015.
  • Hoang et al. (2016) Hoang, T. N., Hoang, Q. M., and Low, K. H. A distributed variational inference framework for unifying parallel sparse Gaussian process regression models. In Proc. ICML, pp.  382–391, 2016.
  • Jia et al. (2019a) Jia, R., Dao, D., Wang, B., Hubis, F. A., Gurel, N. M., Li, B., Zhang, C., Spanos, C., and Song, D. Efficient task-specific data valuation for nearest neighbor algorithms. Proc. VLDB Endowment, 12(11):1610–1623, 2019a.
  • Jia et al. (2019b) Jia, R., Dao, D., Wang, B., Hubis, F. A., Hynes, N., Gurel, N. M., Li, B., Zhang, C., Song, D., and Spanos, C. Towards efficient data valuation based on the Shapley value. In Proc. AISTATS, pp.  1167–1176, 2019b.
  • Krause & Guestrin (2007) Krause, A. and Guestrin, C. Nonmyopic active learning of Gaussian processes: an exploration-exploitation approach. In Proc. ICML, pp.  449–456, 2007.
  • Kruschke (2008) Kruschke, J. K. Bayesian approaches to associative learning: From passive to active learning. Learning & Behavior, 36(3):210–226, 2008.
  • Maleki et al. (2013) Maleki, S., Tran-Thanh, L., Hines, G., Rahwan, T., and Rogers, A. Bounding the estimation error of sampling-based Shapley value approximation. arXiv:1306.4265, 2013.
  • Maschler & Peleg (1966) Maschler, M. and Peleg, B. A characterization, existence proof and dimension bounds for the kernel of a game. Pacific J. Mathematics, 18(2):289–328, 1966.
  • Ocean Protocol Foundation (2019) Ocean Protocol Foundation. Ocean protocol: A decentralized substrate for AI data and services. Technical whitepaper, 2019.
  • Ohrimenko et al. (2019) Ohrimenko, O., Tople, S., and Tschiatschek, S. Collaborative machine learning markets with data-replication-robust payments. arXiv:1911.09052, 2019.
  • Pace & Barry (1997) Pace, R. K. and Barry, R. Sparse spatial auto-regressions. Statistics and Probability Letters, 33(3):291–297, 1997.
  • Shapley (1953) Shapley, L. S. A value for nn-person games. In Kuhn, H. W. and Tucker, A. W. (eds.), Contributions to the Theory of Games, volume 2, pp.  307–317. Princeton Univ. Press, 1953.
  • Yoon et al. (2020) Yoon, J., Arik, S. O., and Pfister, T. Data valuation using reinforcement learning. In Proc. ICML, 2020.
  • Young (1985) Young, H. P. Monotonic solutions of cooperative games. International Journal of Game Theory, 14(2):65–72, 1985.

Appendix A Proof of Submodularity of IG (1)

Firstly, we assume that the data DiD_{i} and DjD_{j} are conditionally independent given model parameters 𝜽\boldsymbol{\theta} for any i,jNi,j\in N and iji\neq j. Then, iNCCN{i},\forall i\in N\ \ \forall C\subseteq C^{\prime}\subseteq N\setminus\{i\}\ ,

vC{i}vC=𝕀(𝜽;DC{i})𝕀(𝜽;DC)=(𝜽|DC)(𝜽|DC{i})=(Di|DC)(Di|𝜽,DC)=(Di|DC)(Di|𝜽)(Di|DC)(Di|𝜽)=(Di|DC)(Di|𝜽,DC)=(𝜽|DC)(𝜽|DC{i})=𝕀(𝜽;DC{i})𝕀(𝜽;DC)=vC{i}vC\begin{array}[]{l}v_{C\cup\{i\}}-v_{C}\vspace{1mm}\\ =\mathbb{I}(\boldsymbol{\theta};D_{C\cup\{i\}})-\mathbb{I}(\boldsymbol{\theta};D_{C})\vspace{1mm}\\ =\mathbb{H}(\boldsymbol{\theta}|D_{C})-\mathbb{H}(\boldsymbol{\theta}|D_{C\cup\{i\}})\vspace{1mm}\\ =\mathbb{H}(D_{i}|D_{C})-\mathbb{H}(D_{i}|\boldsymbol{\theta},D_{C})\vspace{1mm}\\ =\mathbb{H}(D_{i}|D_{C})-\mathbb{H}(D_{i}|\boldsymbol{\theta})\vspace{1mm}\\ \geq\mathbb{H}(D_{i}|D_{C^{\prime}})-\mathbb{H}(D_{i}|\boldsymbol{\theta})\vspace{1mm}\\ =\mathbb{H}(D_{i}|D_{C^{\prime}})-\mathbb{H}(D_{i}|\boldsymbol{\theta},D_{C^{\prime}})\vspace{1mm}\\ =\mathbb{H}(\boldsymbol{\theta}|D_{C^{\prime}})-\mathbb{H}(\boldsymbol{\theta}|D_{C^{\prime}\cup\{i\}})\vspace{1mm}\\ =\mathbb{I}(\boldsymbol{\theta};D_{C^{\prime}\cup\{i\}})-\mathbb{I}(\boldsymbol{\theta};D_{C^{\prime}})\vspace{1mm}\\ =v_{C^{\prime}\cup\{i\}}-v_{C^{\prime}}\end{array}

where the third and sixth equalities are due to the symmetric property of mutual information, the fourth and fifth equalities are due to the conditional independence of DiD_{i} and DCD_{C} given 𝜽\boldsymbol{\theta}, and the inequality is due to the property that conditioning on more information (i.e., DCCD_{C^{\prime}\setminus C}) never increases the entropy (i.e., “information never hurts” bound for entropy) (Cover & Thomas, 1991).

Appendix B Proof of Proposition 1

We prove below that ri=Shapleyv(i)r_{i}=\text{Shapley}_{v}(i) for all iNi\in N satisfy each of the properties F1 to F4:

  • \bullet

    F1 Uselessness. For all iNi\in N, if

    CN{i}vC{i}=vC,\forall C\subseteq N\setminus\{i\}\ \ v_{C\cup\{i\}}=v_{C}\ ,

    then

    ri=Shapleyv(i)\displaystyle r_{i}=\text{Shapley}_{v}(i) =1n!πΠN(vSπ,i{i}vSπ,i)\displaystyle=\frac{1}{n!}\sum_{\pi\in\Pi_{N}}{\left(v_{{S_{\pi,i}}\cup\{i\}}-v_{S_{\pi,i}}\right)}
    =1n!πΠN0\displaystyle=\frac{1}{n!}\sum_{\pi\in\Pi_{N}}{0}
    =0.\displaystyle=0\ .
  • \bullet

    F2 Symmetry. For all i,jNi,j\in N s.t. iji\neq j, if

    CN{i,j}vC{i}=vC{j},\forall C\subseteq N\setminus\{i,j\}\ \ v_{C\cup\{i\}}=v_{C\cup\{j\}}\ , (4)

    then

    ri=Shapleyv(i)\displaystyle r_{i}=\text{Shapley}_{v}(i) =1n!πΠN(vSπ,i{i}vSπ,i)\displaystyle=\frac{1}{n!}\sum_{\pi\in\Pi_{N}}{\left(v_{{S_{\pi,i}}\cup\{i\}}-v_{S_{\pi,i}}\right)}
    =1n!πΠN(vSπ,j{j}vSπ,j)\displaystyle=\frac{1}{n!}\sum_{\pi^{\prime}\in\Pi_{N}}{\left(v_{{S_{\pi^{\prime},j}}\cup\{j\}}-v_{S_{\pi^{\prime},j}}\right)}
    =Shapleyv(j)=rj.\displaystyle=\text{Shapley}_{v}(j)=r_{j}\ .

    Note that π\pi^{\prime} is obtained from π\pi by swapping the positions of ii and jj in π\pi. Then, we prove the third equality by considering the following two cases: (a) When ii precedes jj in π\pi, Sπ,i=Sπ,jS_{\pi,i}=S_{\pi^{\prime},j} (i.e., both excluding ii and jj) and thus vSπ,i=vSπ,jv_{S_{\pi,i}}=v_{S_{\pi^{\prime},j}}. It also follows from (4) that by setting C=Sπ,i=Sπ,jC=S_{\pi,i}=S_{\pi^{\prime},j}, vSπ,i{i}=vC{i}=vC{j}=vSπ,j{j}v_{{S_{\pi,i}}\cup\{i\}}=v_{C\cup\{i\}}=v_{C\cup\{j\}}=v_{{S_{\pi^{\prime},j}}\cup\{j\}}; (b) when jj precedes ii in π\pi, Sπ,i{i}=Sπ,j{j}{S_{\pi,i}\cup\{i\}}={S_{\pi^{\prime},j}\cup\{j\}} (i.e., both containing ii and jj) and thus vSπ,i{i}=vSπ,j{j}v_{{S_{\pi,i}}\cup\{i\}}=v_{{S_{\pi^{\prime},j}}\cup\{j\}}. It also follows from (4) that by setting C=Sπ,i{j}=Sπ,j{i}C=S_{\pi,i}\setminus\{j\}=S_{\pi^{\prime},j}\setminus\{i\}, vSπ,i=vC{j}=vC{i}=vSπ,jv_{S_{\pi,i}}=v_{C\cup\{j\}}=v_{C\cup\{i\}}=v_{S_{\pi^{\prime},j}}.

  • \bullet

    F3 Strict Desirability. For all i,jNi,j\in N s.t. iji\neq j, if

    (BN{i,j}vB{i}>vB{j})(CN{i,j}vC{i}vC{j}),\begin{array}[]{l}(\exists B\subseteq N\setminus\{i,j\}\ \ v_{B\cup\{i\}}>v_{B\cup\{j\}})\ \land\vspace{1mm}\\ (\forall C\subseteq N\setminus\{i,j\}\ \ v_{C\cup\{i\}}\geq v_{C\cup\{j\}})\ ,\end{array} (5)

    then

    ri=Shapleyv(i)\displaystyle r_{i}=\text{Shapley}_{v}(i) =1n!πΠN(vSπ,i{i}vSπ,i)\displaystyle=\frac{1}{n!}\sum_{\pi\in\Pi_{N}}{\left(v_{{S_{\pi,i}}\cup\{i\}}-v_{S_{\pi,i}}\right)}
    >1n!πΠN(vSπ,j{j}vSπ,j)\displaystyle>\frac{1}{n!}\sum_{\pi^{\prime}\in\Pi_{N}}{\left(v_{{S_{\pi^{\prime},j}}\cup\{j\}}-v_{S_{\pi^{\prime},j}}\right)}
    =Shapleyv(j)=rj.\displaystyle=\text{Shapley}_{v}(j)=r_{j}\ .

    Note that π\pi^{\prime} is obtained from π\pi by swapping the positions of ii and jj in π\pi. Similar to the proof of F2, we prove the inequality by considering the following two cases: (a) When ii precedes jj in π\pi, Sπ,i=Sπ,jS_{\pi,i}=S_{\pi^{\prime},j} (i.e., both excluding ii and jj) and thus vSπ,i=vSπ,jv_{S_{\pi,i}}=v_{S_{\pi^{\prime},j}}. It also follows from (5) that by setting C=Sπ,i=Sπ,jC=S_{\pi,i}=S_{\pi^{\prime},j}, vSπ,i{i}=vC{i}vC{j}=vSπ,j{j}v_{{S_{\pi,i}}\cup\{i\}}=v_{C\cup\{i\}}\geq v_{C\cup\{j\}}=v_{{S_{\pi^{\prime},j}}\cup\{j\}}; (b) when jj precedes ii in π\pi, Sπ,i{i}=Sπ,j{j}{S_{\pi,i}\cup\{i\}}={S_{\pi^{\prime},j}\cup\{j\}} (i.e., both containing ii and jj) and thus vSπ,i{i}=vSπ,j{j}v_{{S_{\pi,i}}\cup\{i\}}=v_{{S_{\pi^{\prime},j}}\cup\{j\}}. It also follows from (5) that by setting C=Sπ,i{j}=Sπ,j{i}C=S_{\pi,i}\setminus\{j\}=S_{\pi^{\prime},j}\setminus\{i\}, vSπ,i=vC{j}vC{i}=vSπ,j-v_{S_{\pi,i}}=-v_{C\cup\{j\}}\geq-v_{C\cup\{i\}}=-v_{S_{\pi^{\prime},j}}. By (5), a strict inequality must hold for either case a or b.

  • \bullet

    F4 Strict Monotonicity. Let {vC}C2N\{v_{C}\}_{C\in 2^{N}} and {vC}C2N\{v^{\prime}_{C}\}_{C\in 2^{N}} denote any two sets of values of data over all coalitions CNC\subseteq N, and rir_{i} and rir^{\prime}_{i} be the corresponding values of model rewards received by party ii. For all iNi\in N, if

    (BN{i}vB{i}>vB{i})(CN{i}vC{i}vC{i})(AN{i}vA=vA)(vN>ri),\begin{array}[]{l}(\exists B\subseteq N\setminus\{i\}\ \ v^{\prime}_{B\cup\{i\}}>v_{B\cup\{i\}})\ \land\vspace{1mm}\\ (\forall C\subseteq N\setminus\{i\}\ \ v^{\prime}_{C\cup\{i\}}\geq v_{C\cup\{i\}})\ \land\vspace{1mm}\\ (\forall A\subseteq N\setminus\{i\}\ \ v^{\prime}_{A}=v_{A})\land(v^{\prime}_{N}>r_{i})\ ,\end{array} (6)

    then

    ri=Shapleyv(i)\displaystyle r_{i}=\text{Shapley}_{v}(i) =1n!πΠN(vSπ,i{i}vSπ,i)\displaystyle=\frac{1}{n!}\sum_{\pi\in\Pi_{N}}{\left(v_{{S_{\pi,i}}\cup\{i\}}-v_{S_{\pi,i}}\right)}
    =1n!πΠN(vSπ,i{i}vSπ,i)\displaystyle=\frac{1}{n!}\sum_{\pi\in\Pi_{N}}{\left(v_{{S_{\pi,i}}\cup\{i\}}-v^{\prime}_{S_{\pi,i}}\right)}
    <1n!πΠN(vSπ,i{i}vSπ,i)\displaystyle<\frac{1}{n!}\sum_{\pi\in\Pi_{N}}{\left(v^{\prime}_{{S_{\pi,i}}\cup\{i\}}-v^{\prime}_{S_{\pi,i}}\right)}
    =Shapleyv(i)=ri.\displaystyle=\text{Shapley}_{v^{\prime}}(i)=r^{\prime}_{i}\ .

    Note that for any π\pi, Sπ,iS_{\pi,i} does not contain ii. Hence, it follows from (6) that vSπ,i=vSπ,iv_{S_{\pi,i}}=v^{\prime}_{S_{\pi,i}} which results in the third equality. The inequality also follows from (6).

Appendix C Enforcing Shapley Fairness may not satisfy Individual Rationality (R4) for Submodular Value of Data

When the value of data is submodular, for any party iNi\in N and any coalition CN{i}C\subseteq N\setminus\{i\}, viv=vivC{i}vC{v_{i}-v_{\emptyset}}=v_{i}\geq{v_{C\cup\{i\}}-v_{C}}. Then,

ϕiShapleyv(i)\displaystyle\phi_{i}\triangleq\text{Shapley}_{v}(i) =1n!πΠN(vSπ,i{i}vSπ,i)\displaystyle=\frac{1}{n!}\sum_{\pi\in\Pi_{N}}{\left(v_{{S_{\pi,i}}\cup\{i\}}-v_{S_{\pi,i}}\right)}
1n!πΠNvi\displaystyle\leq\frac{1}{n!}\sum_{\pi\in\Pi_{N}}v_{i}
=vi.\displaystyle=v_{i}\ . (7)

Also, for any iNi\in N, ϕiiNϕi=vN\phi_{i}\leq\sum_{i\in N}{\phi_{i}}=v_{N} due to the efficiency of Shapley value (Chalkiadakis et al., 2011).

To satisfy R3 and R5 simultaneously, we need to set k=vN/ϕk=v_{N}/\phi^{*} in Definition 1 where ϕ=maxiNϕi\phi^{*}=\max_{i\in N}{\phi_{i}}. So, for each party iNi\in N,

ri=(vN/ϕ)×ϕir_{i}=(v_{N}/\phi^{*})\times\phi_{i}\

according to Definition 1 and the party with the largest ϕi\phi_{i} (i.e., ϕ\phi^{*}) always receives the most valuable model reward with the highest value vNv_{N}.

When vi/ϕi>vN/ϕv_{i}/\phi_{i}>v_{N}/\phi^{*}, R4 will not be satisfied (i.e., ri=(vN/ϕ)×ϕi<vir_{i}=(v_{N}/\phi^{*})\times\phi_{i}<v_{i}). Since vN/ϕ1v_{N}/\phi^{*}\geq 1, this will never happen if vi/ϕi<1v_{i}/\phi_{i}<1. However, as the value of data is submodular, for each party iNi\in N, viϕiv_{i}\nless\phi_{i} due to (C). Thus, R4 may not be satisfied.

Appendix D Proof of Theorem 1

Firstly, the values (ri)iN(r_{i})_{i\in N} of model rewards satisfy non-negativity (R1): For each party iNi\in N, ϕi0\phi_{i}\geq 0 due to (2) and the non-negative and monotonic value of data using IG (1).

Next, the values (ri)iN(r_{i})_{i\in N} of model rewards satisfy feasibility (R2): For each party iNi\in N, ri(ϕi/ϕ)ρ×vN1ρ×vN=vNr_{i}\triangleq{\left(\phi_{i}/\phi^{*}\right)}^{\rho}\times v_{N}\leq 1^{\rho}\times v_{N}=v_{N} when ρ0\rho\geq 0. Also, weak efficiency (R3) is satisfied since ri=vNr_{i}=v_{N} for the party ii with the largest Shapley value ϕ\phi^{*}.

When ρ>0\rho>0, the values (ri)iN(r_{i})_{i\in N} of model rewards are ρ\rho-Shapley fair with k=(1/ϕ)ρ×vNk=(1/\phi^{*})^{\rho}\times v_{N} in Definition 2.

D.1 Proof of Individual Rationality (R4)

For each party iNi\in N, since

ρρrminjNlog(vj/vN)log(ϕj/ϕ)log(vi/vN)log(ϕi/ϕ)\rho\leq\rho_{r}\triangleq\min_{j\in N}{\frac{\log(v_{j}/v_{N})}{\log(\phi_{j}/\phi^{*})}}\leq\frac{\log(v_{i}/v_{N})}{\log(\phi_{i}/\phi^{*})}

and log(ϕi/ϕ)0,\log(\phi_{i}/\phi^{*})\leq 0\ ,

log(vi/vN)\displaystyle\log(v_{i}/v_{N}) log(ϕi/ϕ)ρ\displaystyle\leq\log{\left(\phi_{i}/\phi^{*}\right)}^{\rho}
vi/vN\displaystyle v_{i}/v_{N} (ϕi/ϕ)ρ\displaystyle\leq{\left(\phi_{i}/\phi^{*}\right)}^{\rho}
vi\displaystyle v_{i} (ϕi/ϕ)ρ×vN=ri.\displaystyle\leq{\left(\phi_{i}/\phi^{*}\right)}^{\rho}\times v_{N}=r_{i}\ .

As rivir_{i}\geq v_{i} for each party iNi\in N, the values (ri)iN(r_{i})_{i\in N} of model rewards satisfy R4.

For each party iNi\in N, when ρ=0\rho=0, ri=vNr_{i}=v_{N}. Since the value of data is monotonic, vivNv_{i}\leq v_{N}. So, ρr0\rho_{r}\geq 0. It is possible that ρr>1\rho_{r}>1.

D.2 Proof of Stability (R6)

For each party iNi\in N, since

ρρsminjNlog(vCj/vN)log(ϕj/ϕ)log(vCi/vN)log(ϕi/ϕ)\rho\leq\rho_{s}\triangleq\min_{j\in N}{\frac{\log(v_{C_{j}}/v_{N})}{\log(\phi_{j}/\phi^{*})}}\leq\frac{\log(v_{{C_{i}}}/v_{N})}{\log(\phi_{i}/\phi^{*})}

and log(ϕi/ϕ)0,\log(\phi_{i}/\phi^{*})\leq 0\ ,

log(vCi/vN)\displaystyle\log(v_{C_{i}}/v_{N}) log(ϕi/ϕ)ρ\displaystyle\leq\log{\left(\phi_{i}/\phi^{*}\right)}^{\rho}
vCi/vN\displaystyle v_{C_{i}}/v_{N} (ϕi/ϕ)ρ\displaystyle\leq{\left(\phi_{i}/\phi^{*}\right)}^{\rho}
vCi\displaystyle v_{C_{i}} (ϕi/ϕ)ρ×vN=ri.\displaystyle\leq{\left(\phi_{i}/\phi^{*}\right)}^{\rho}\times v_{N}=r_{i}\ .

As rivCir_{i}\geq v_{C_{i}} for each party iNi\in N, the values (ri)iN(r_{i})_{i\in N} of model rewards satisfy R6.

For each party iNi\in N, when ρ=0\rho=0, ri=vNr_{i}=v_{N}. Since the value of data is monotonic, vivCivNv_{i}\leq v_{C_{i}}\leq v_{N}. So, ρrρs0\rho_{r}\geq\rho_{s}\geq 0. It is possible that ρs>1\rho_{s}>1.

D.3 Proof of Fairness (R5)

Firstly, we show that the Shapley value satisfies the following party monotonicity property,121212Another solution concept that satisfies the party monotonicity property is the Banzhaf index. which will be used in our proof of F4: For all iNi\in N, if

(BN{i}vB{i}>vB{i})(CN{i}vC{i}vC{i})(AN{i}vA=vA),\begin{array}[]{l}\displaystyle(\exists B\subseteq N\setminus\{i\}\ \ v^{\prime}_{B\cup\{i\}}>v_{B\cup\{i\}})\ \land\vspace{1mm}\\ (\forall C\subseteq N\setminus\{i\}\ \ v^{\prime}_{C\cup\{i\}}\geq v_{C\cup\{i\}})\ \land\vspace{1mm}\\ (\forall A\subseteq N\setminus\{i\}\ \ v^{\prime}_{A}=v_{A})\ ,\end{array} (8)

then jN{i}ϕiϕiϕjϕj.\forall j\in N\setminus\{i\}\ \ \phi^{\prime}_{i}-\phi_{i}\geq\phi^{\prime}_{j}-\phi_{j}\ .

Proof.

Let vCΔ(vCvC)v^{\Delta}_{C}\triangleq\left(v^{\prime}_{C}-v_{C}\right) for any coalition CNC\subseteq N. It follows from (2) that for any party jN{i}j\in N\setminus\{i\},

ϕjϕj\displaystyle\phi^{\prime}_{j}-\phi_{j} =1n!πΠij(vSπ,j{j}ΔvSπ,jΔ)\displaystyle=\frac{1}{n!}\sum_{\pi\in\Pi_{i\prec j}}{\left(v^{\Delta}_{S_{\pi,j}\cup\{j\}}-v^{\Delta}_{S_{\pi,j}}\right)}
1n!πΠij(vSπ,j{j}Δ)\displaystyle\leq\frac{1}{n!}\sum_{\pi\in\Pi_{i\prec j}}{\left(v^{\Delta}_{S_{\pi,j}\cup\{j\}}\right)}
=1n!πΠji(vSπ,i{i}Δ)\displaystyle=\frac{1}{n!}\sum_{\pi^{\prime}\in\Pi_{j\prec i}}{\left(v^{\Delta}_{S_{\pi^{\prime},i}\cup\{i\}}\right)}
1n!πΠN(vSπ,i{i}Δ)\displaystyle\leq\frac{1}{n!}\sum_{\pi\in\Pi_{N}}{\left(v^{\Delta}_{S_{\pi,i}\cup\{i\}}\right)}
=ϕiϕi\displaystyle=\phi^{\prime}_{i}-\phi_{i}

where Πij\Pi_{i\prec j} is a subset of ΠN\Pi_{N} containing all permutations with ii preceding jj. We consider only the subset Πij\Pi_{i\prec j} of ΠN\Pi_{N} in the first equality due to (8) where the quality of a model trained on the aggregated data of a coalition can only change/increase when containing party ii. The first inequality is also due to (8) where vCvCv^{\prime}_{C}\geq v_{C} for every CNC\subseteq N and thus vSπ,jΔ0v^{\Delta}_{S_{\pi,j}}\geq 0. The second equality can be understood by swapping the positions of ii and jj in permutation π\pi to obtain π\pi^{\prime} such that Sπ,j{j}=Sπ,i{i}S_{\pi,j}\cup\{j\}=S_{\pi^{\prime},i}\cup\{i\}. The second inequality holds because there may exist a permutation π\pi where ijinπi\prec j\ \text{in}\ \pi (i.e., jSπ,ij\notin S_{\pi,i}) and vSπ,i{i}Δ0v^{\Delta}_{S_{\pi,i}\cup\{i\}}\geq 0. The last equality holds since vSπ,iΔ=0v^{\Delta}_{S_{\pi,i}}=0 due to (8) where AN{i}vA=vA\forall A\subseteq N\setminus\{i\}\ \ v^{\prime}_{A}=v_{A}. ∎

Let us now consider ri(ϕi/ϕ)ρ×vNr_{i}\triangleq{\left(\phi_{i}/\phi^{*}\right)}^{\rho}\times v_{N}. We will prove the following properties for R5:

  • \bullet

    F1 Uselessness.

    (CN{i}vC{i}=vC)ϕi=0ri=0.\begin{array}[]{l}\displaystyle(\forall C\subseteq N\setminus\{i\}\ \ v_{C\cup\{i\}}=v_{C})\vspace{1mm}\\ \Rightarrow\phi_{i}=0\vspace{1mm}\\ \Rightarrow r_{i}=0\ .\end{array}
  • \bullet

    F2 Symmetry. Since each ϕiρ\phi_{i}^{\rho} is multiplied by the same factor k=vN/ϕρk=v_{N}/{\phi^{*}}^{\rho},

    (CN{i,j}vC{i}=vC{j})ϕi=ϕjri=rj.\begin{array}[]{l}\displaystyle(\forall C\subseteq N\setminus\{i,j\}\ \ v_{C\cup\{i\}}=v_{C\cup\{j\}})\vspace{1mm}\\ \Rightarrow\phi_{i}=\phi_{j}\vspace{1mm}\\ \Rightarrow r_{i}=r_{j}\ .\end{array}
  • \bullet

    F3 Strict Desirability. Since each ϕiρ\phi_{i}^{\rho} is multiplied by the same factor k=vN/ϕρk=v_{N}/{\phi^{*}}^{\rho} and k>0k>0,

    (BN{i,j}vB{i}>vB{j})(CN{i,j}vC{i}vC{j})ϕi>ϕjri>rj.\begin{array}[]{l}\displaystyle(\exists B\subseteq N\setminus\{i,j\}\ \ v_{B\cup\{i\}}>v_{B\cup\{j\}})\ \land\vspace{1mm}\\ (\forall C\subseteq N\setminus\{i,j\}\ \ v_{C\cup\{i\}}\geq v_{C\cup\{j\}})\vspace{1mm}\\ \Rightarrow\phi_{i}>\phi_{j}\vspace{1mm}\\ \Rightarrow r_{i}>r_{j}\ .\end{array}

    The first implication in the proof of F1, F2, and F3 follows from Proposition 1.

  • \bullet

    F4 Strict Monotonicity. Let {vC}C2N\{v_{C}\}_{C\in 2^{N}} and {vC}C2N\{v^{\prime}_{C}\}_{C\in 2^{N}} denote any two sets of values of data over all coalitions CNC\subseteq N, and ϕi\phi_{i} and ϕi\phi^{\prime}_{i} be the corresponding Shapley values of party ii. Let N\ell\in N be the party with the largest Shapley value ϕ\phi^{\prime}_{\ell} based on {vC}C2N\{v^{\prime}_{C}\}_{C\in 2^{N}}. For any party iNi\in N satisfying (6) (i.e., premise of F4), ϕϕi\phi^{\prime}_{\ell}\geq\phi^{\prime}_{i} due to definition of party \ell and ϕiϕiϕϕ\phi^{\prime}_{i}-\phi_{i}\geq\phi^{\prime}_{\ell}-\phi_{\ell} due to party monotonicity property when N{i}\ell\in N\setminus\{i\}. Together, it follows that ϕϕi\phi_{\ell}\geq\phi_{i}. Then,

    ϕiϕi\displaystyle\phi^{\prime}_{i}-\phi_{i} ϕϕ\displaystyle\geq\phi^{\prime}_{\ell}-\phi_{\ell}\vspace{1mm}
    ϕ(ϕiϕi)\displaystyle\phi_{\ell}(\phi^{\prime}_{i}-\phi_{i}) ϕi(ϕϕ)\displaystyle\geq\phi_{i}(\phi^{\prime}_{\ell}-\phi_{\ell})\vspace{1mm}
    ϕ×ϕi\displaystyle\phi_{\ell}\times\phi^{\prime}_{i} ϕi×ϕ\displaystyle\geq\phi_{i}\times\phi^{\prime}_{\ell}\vspace{1mm}
    ϕi/ϕ\displaystyle\phi^{\prime}_{i}/\phi^{\prime}_{\ell} ϕi/ϕ.\displaystyle\geq\phi_{i}/\phi_{\ell}\ . (9)

    If ϕ>ϕi\phi_{\ell}>\phi_{i}, then ϕi/ϕ>ϕi/ϕ\phi^{\prime}_{i}/\phi^{\prime}_{\ell}>\phi_{i}/\phi_{\ell} instead. The equality in (9) holds only if ϕ=ϕi\phi_{\ell}=\phi_{i}. Let pNp\in N be the party with the largest Shapley value ϕp\phi_{p} based on {vC}C2N\{v_{C}\}_{C\in 2^{N}}. Then,

    ri=(ϕi/ϕ)ρ×vN(ϕi/ϕ)ρ×vN(ϕi/ϕ)ρ×vN(ϕi/ϕp)ρ×vN=ri\displaystyle\begin{array}[]{l}r_{i}^{\prime}\\ \displaystyle={\left({\phi_{i}^{\prime}}/{\phi^{\prime}_{\ell}}\right)}^{\rho}\times v^{\prime}_{N}\vspace{1mm}\\ \displaystyle\geq{\left({\phi_{i}^{\prime}}/{\phi^{\prime}_{\ell}}\right)}^{\rho}\times v_{N}\vspace{1mm}\\ \displaystyle\geq{\left({\phi_{i}}/{\phi_{\ell}}\right)}^{\rho}\times v_{N}\vspace{1mm}\\ \displaystyle\geq{\left({\phi_{i}}/{\phi_{p}}\right)}^{\rho}\times v_{N}\\ =r_{i}\end{array}

    where the first inequality follows from (6), the second inequality is due to (9), and the last inequality is due to the definition of party pp. For equality in ririr^{\prime}_{i}\geq r_{i} to possibly hold, it has to be the case that ϕi=ϕ=ϕp\phi_{i}=\phi_{\ell}=\phi_{p} and vN=vNv^{\prime}_{N}=v_{N}. Based on the definitions of rir_{i} and rir^{\prime}_{i} in Theorem 1, this implies ri=vN=vN=rir^{\prime}_{i}=v^{\prime}_{N}=v_{N}=r_{i} which does not satisfy vN>riv^{\prime}_{N}>r_{i} in (6) (i.e., premise of F4). For all other cases, ri>rir^{\prime}_{i}>r_{i}.

Appendix E Details of Experimental Settings

In all the experiments, we optimize ηi\eta_{i} for iNi\in N (Sec. 4.3) by finding the root of 𝕀(𝜽;Dir)ri\mathbb{I}(\boldsymbol{\theta};D_{i}^{r})-r^{*}_{i} using a root-finding algorithm called the TOMS Algorithm 748748 method.131313Alefeld, G. E., Potra, F. A., and Shi, Y. Algorithm 748748: Enclosing zeros of continuous functions. ACM Trans. Math. Softw., 21(3)21(3), 327327344344, 19951995.

E.1 Gaussian Process (GP) Regression with Synthetic Friedman Dataset

The following Friedman function is used in our experiment:

y=10sin(π𝐱[d,0]𝐱[d,1])+20(𝐱[d,2]0.5)2+10𝐱[d,3]+5𝐱[d,4]+0𝐱[d,5]+𝒩(0,1)\begin{array}[]{rl}y=&10\sin(\pi{\rm{\mathbf{x}}}_{[d,0]}{\rm{\mathbf{x}}}_{[d,1]})+20({\rm{\mathbf{x}}}_{[d,2]}-0.5)^{2}\vspace{1mm}\\ &+10{\rm{\mathbf{x}}}_{[d,3]}+5{\rm{\mathbf{x}}}_{[d,4]}+0{\rm{\mathbf{x}}}_{[d,5]}+\mathcal{N}(0,1)\end{array}

where dd is the index of data point (𝐱,y)({\rm{\mathbf{x}}},y) and its 66 input features (i.e., 𝐱(𝐱[d,a])a=0,,5{\rm{\mathbf{x}}}\triangleq({\rm{\mathbf{x}}}_{[d,a]})_{a=0,\ldots,5}) are independent and uniformly distributed over the input domain [0,1][0,1]. The last input feature 𝐱[d,5]{\rm{\mathbf{x}}}_{[d,5]} is an independent variable which does not affect the output yy. We standardize the values of output yy and train a GP model141414For all GP models, we use automatic relevance determination such that each input feature has a different lengthscale parameter. with a squared exponential kernel.

For each tested value of ρ\rho, we reset the random seed to a fixed value and draw 2020 samples of 𝐳i\mathbf{z}_{i} from 𝒩(𝟎,ηi𝐈)\mathcal{N}(\mathbf{0},\eta_{i}\mathbf{I}) with an optimized ηi\eta_{i} (Sec. 4.3). Though ηi\eta_{i} differs for varying ρ\rho, the fixed random seed makes the standardized samples of 𝐳i\mathbf{z}_{i} the same across all values of ρ\rho, thus allowing a fair comparison. We evaluate the performance of our reward scheme on a test dataset which consists of 800800 points generated from the Friedman function.

E.2 Gaussian Process (GP) Regression with DiaP Dataset

We remove the gender feature and standardize the values of both input matrix 𝐗{\rm{\mathbf{X}}} and output vector 𝐲{\rm{\mathbf{y}}}. We train a GP model with a composite kernel comprising the squared exponential kernel and the exponential kernel.

We consider 55 different train-test (i.e., 80%80\%-20%20\% at random) splits. For each split, we create 1010 different partitions of the training dataset using the partitioning algorithm detailed in Sec. 5.

E.3 Neural Network (NN) Regression with CaliH Dataset

We first standardize the values of both input matrix 𝐗{\rm{\mathbf{X}}} and output vector 𝐲{\rm{\mathbf{y}}}. Then, we use 60%60\% of the CaliH dataset to train a NN with 22 dense and fully connected hidden layers of 100100 and 5050 units and use the rectified linear unit (ReLU) as the activation function. Subsequently, we perform BLR on the outputs from the last hidden layer.

We consider 55 different train-test (i.e., 80%80\%-20%20\% at random) splits. For each split, we create 1010 different partitions of the training dataset using the partitioning algorithm detailed in Sec. 5.

E.4 Sparse GP Regression with Synthetic Friedman Dataset (10 Parties)

We consider a larger synthetic Friedman dataset of size 50005000 and 33 different train-test (i.e., 80%80\%-20%20\% at random) splits. For each split, we create 1010 different partitions of the training dataset using the partitioning algorithm detailed in Sec. 5. But, we divide the training dataset into 1010 consecutive blocks instead with the constraint that each party owns at least 5%5\% of the dataset.

As in Appendix E.1, we standardize the values of output yy. Since the exact Shapley value (2) is expensive to evaluate with a large number of parties, we approximate it using 30003000 samples generated by the simple random sampling algorithm (Maleki et al., 2013).

E.5 Calculating IG for BLR and GP Models

Let D(𝐗,𝐲)D\triangleq({\rm{\mathbf{X}}},{\rm{\mathbf{y}}}) be the training data where 𝐗{\rm{\mathbf{X}}} is an input matrix and 𝐲{\rm{\mathbf{y}}} (𝐟{\rm{\mathbf{f}}}) is the corresponding noisy (latent) output vector such that 𝐲{\rm{\mathbf{y}}} is perturbed from 𝐟{\rm{\mathbf{f}}} by an additive Gaussian noise with noise variance σ2\sigma^{2}.

For BLR with model parameters/weights 𝜽\boldsymbol{\theta} and prior belief p(𝜽)𝒩(𝟎,𝚺0)p(\boldsymbol{\theta})\triangleq\mathcal{N}(\mathbf{0},\boldsymbol{\Sigma}_{0}), the IG on 𝜽\boldsymbol{\theta} given DD is

𝕀(𝜽;D)=0.5log(|𝐈+𝚺0𝐗𝐗/σ2|).\mathbb{I}(\boldsymbol{\theta};D)=0.5\log(|{\rm{\mathbf{I}}}+{\rm{\mathbf{\Sigma}}}_{0}{\rm{\mathbf{X}}}^{\top}{\rm{\mathbf{X}}}/\sigma^{2}|)\ .

We also consider modeling an unknown/latent function ff with a full-rank GP and setting 𝜽=f\boldsymbol{\theta}=f such that the latent output vector 𝐟{\rm{\mathbf{f}}} comprises components f(𝐱)f({\rm{\mathbf{x}}}) for all 𝐱{\rm{\mathbf{x}}} in 𝐗{\rm{\mathbf{X}}}. Then, the IG on ff given DD is

𝕀(f;D)=𝕀(𝐟;D)=0.5log(|𝐈+K𝐗𝐗/σ2|)\mathbb{I}(f;D)=\mathbb{I}({\rm{\mathbf{f}}};D)=0.5\log(|{\rm{\mathbf{I}}}+K_{{\rm{\mathbf{X}}}{\rm{\mathbf{X}}}}/\sigma^{2}|) (10)

where K𝐗𝐗K_{{\rm{\mathbf{X}}}{\rm{\mathbf{X}}}} is a covariance matrix with components k(𝐱,𝐱)k({\rm{\mathbf{x}}},{\rm{\mathbf{x}}}^{\prime}) for all 𝐱,𝐱{\rm{\mathbf{x}}},{\rm{\mathbf{x}}}^{\prime} in 𝐗{\rm{\mathbf{X}}} and kk is a kernel function. When we use a sparse GP model151515We use a sparse GP model called the deterministic training conditional (DTC) approximation of the GP model. with inducing input matrix 𝐔{\rm{\mathbf{U}}}, we replace K𝐗𝐗K_{{\rm{\mathbf{X}}}{\rm{\mathbf{X}}}} in (10) with K𝐗𝐔K𝐔𝐔1K𝐔𝐗K^{\top}_{{\rm{\mathbf{X}}}{\rm{\mathbf{U}}}}K_{{\rm{\mathbf{U}}}{\rm{\mathbf{U}}}}^{-1}K_{{\rm{\mathbf{U}}}{\rm{\mathbf{X}}}} where K𝐔𝐔K_{{\rm{\mathbf{U}}}{\rm{\mathbf{U}}}} is a covariance matrix with components k(𝐱,𝐱)k({\rm{\mathbf{x}}},{\rm{\mathbf{x}}}^{\prime}) for all 𝐱,𝐱{\rm{\mathbf{x}}},{\rm{\mathbf{x}}}^{\prime} in 𝐔{\rm{\mathbf{U}}} and K𝐔𝐗K_{{\rm{\mathbf{U}}}{\rm{\mathbf{X}}}} is a covariance matrix with components k(𝐱,𝐱)k({\rm{\mathbf{x}}},{\rm{\mathbf{x}}}^{\prime}) for all 𝐱{\rm{\mathbf{x}}} in 𝐔{\rm{\mathbf{U}}} and 𝐱{\rm{\mathbf{x}}}^{\prime} in 𝐗{\rm{\mathbf{X}}}.

When calculating the IG for heteroscedastic data (i.e., data points with different noise variances), instead of dividing by σ2\sigma^{2}, we multiply by the diagonal matrix 𝐊noise1{\rm{\mathbf{K}}}_{\text{noise}}^{-1} such that each diagonal component of 𝐊noise{\rm{\mathbf{K}}}_{\text{noise}} represents the noise variance corresponding to a data point in DD.

Appendix F Additional Experimental Results

F.1 Additional Results for Synthetic Friedman and BosH Datasets

GP regression with synthetic Friedman dataset. We also evaluate the performance of our reward scheme on a synthetic Friedman dataset with 20002000 data points. We standardize the values of output yy and train a GP model with a squared exponential kernel. We consider 55 different train-test (i.e., 80%80\%-20%20\% at random) splits. For each split, we create 1010 different partitions of the training dataset using the partitioning algorithm detailed in Sec. 5.

The results are shown in Fig. 7. As before, the improvement in MNLP is usually positive. When ρ=0.5\rho=0.5, most points move closer to the diagonal line, which implies that parties with smaller ϕi\phi_{i} can now receive more valuable model rewards with higher predictive accuracy.

Refer to caption Refer to caption
(a) IG, ρ=1\rho=1 (b) IG, ρ=0.5\rho=0.5
Refer to caption Refer to caption
(c) MNLP, ρ=1\rho=1 (d) MNLP, ρ=0.5\rho=0.5
Figure 7: Scatter graph of the (a-b) improvement in IG (rivir_{i}-v_{i}) vs. maximal improvement in IG (vNviv_{N}-v_{i}) and (c-d) improvement in MNLP vs. maximal improvement in MNLP when ρ=1,0.5\rho=1,0.5 for multiple partitions of Friedman dataset among n=3n=3 parties.

NN regression with BosH dataset. The Boston housing (BosH) dataset contains the value of 506506 houses with 1010 selected input features (Harrison & Rubinfeld, 1978). We train a NN with 22 dense and fully connected hidden layers of 5050 units each and use the rectified linear unit (ReLU) as the activation function. Subsequently, we perform BLR on the outputs from the last hidden layer.

We remove 33 input features (i.e., proportion of residential land zoned for lots over 25,00025,000 sq. ft., proportion of non-retail business acres per town, and the Charles River dummy variable) and standardize the values of both input matrix 𝐗{\rm{\mathbf{X}}} and output vector 𝐲{\rm{\mathbf{y}}}. Then, we use 80%80\% of the BosH dataset to train the first 22 layers of the NN and divide the entire BosH dataset into a test dataset and among 33 parties using our partitioning algorithm. Due to the small data size, there is overlap between the data used to train the “public” NN layers and the data that is divided among parties.

Fig. 8 shows results of our reward scheme for multiple partitions of BosH dataset among n=3n=3 parties. The same observations follow: Most model rewards achieve positive improvement in MNLP.

When a smaller ρ\rho is used (i.e., Figs. 8b and 8d) or when the parties have higher Shapley values relative to the others (i.e., lighter color), they will receive more valuable model rewards which translates to lower MNLP (i.e., points lie closer to the diagonal identity line).

Refer to caption Refer to caption
(a) IG, ρ=1\rho=1 (b) IG, ρ=0.5\rho=0.5
Refer to caption Refer to caption
(c) MNLP, ρ=1\rho=1 (d) MNLP, ρ=0.5\rho=0.5
Figure 8: Scatter graph of the (a-b) improvement in IG (rivir_{i}-v_{i}) vs. maximal improvement in IG (vNviv_{N}-v_{i}) and (c-d) improvement in MNLP vs. maximal improvement in MNLP when ρ=1,0.5\rho=1,0.5 for multiple partitions of BosH dataset among n=3n=3 parties.

F.2 Empirical Analysis of Relationship between IG and MNLP

In this work, our proposed data valuation method and reward scheme are based on IG. Our experimental results have shown that IG is closely related to the predictive accuracy (i.e., MNLP) of a trained model. A model with higher IG usually achieves lower MNLP as it produces smaller predictive variances (i.e., related to the first term of (3)). However, the relationship between IG and MNLP is not strictly monotonic as it is also affected by the second term of (3), i.e., the scaled squared error. In this subsection, we will demonstrate some poor use cases of our scheme where improvement in IG fails to translate to higher predictive accuracy of the trained model more frequently.

The first poor use case is when an unsuitable model is selected. To demonstrate this, we train an alternative GP regression model with an unsuitable squared exponential kernel on the DiaP dataset. Even when ample training data is provided, the squared error remains high relative to the variance in the data. The results achieved using this unsuitable GP model are shown in Figs. 9c-d. For ease of comparison, we copy the results of the GP model with a suitable kernel in Figs. 3c-d to Figs. 9a-b. We can observe that the maximal and actual improvement in MNLP are mostly positive for the suitable model (Figs. 9a-b). However, for the unsuitable model, fewer positive values of the maximal and actual improvement in MNLP can be observed in Figs. 9c-d. In Figs. 9b and 9d, there are, respectively, 631631 and 388388 points with positive improvement in MNLP (out of a maximum of 750750 points). Improvement in IG fails to translate to a higher predictive accuracy of its trained model more frequently when an unsuitable model is selected.

Refer to caption Refer to caption
(a) ρ=1\rho=1, suitable model (b) ρ=0.5\rho=0.5, suitable model
Refer to caption Refer to caption
(c) ρ=1\rho=1, unsuitable model (d) ρ=0.5\rho=0.5, unsuitable model
Figure 9: Scatter graph of the improvement in MNLP vs. maximal improvement in MNLP when ρ=1,0.5\rho=1,0.5 for (a-b) a suitable GP model and (c-d) an unsuitable GP model for multiple partitions of DiaP dataset among n=3n=3 parties.

The next poor use case is when the model prior, which determines its predictive performance with limited data, is sufficiently informative, hence leading to a high predictive accuracy even if the model is only trained on a small amount of data. An example of an informative model prior is when the prior mean of the model parameters is close to the true mean and the prior variance is small. In this case, although the first term of (3) is smaller for a model with higher IG, the small predictive variance will magnify the squared error, thus increasing the second term of (3). So, on the overall, the model with a higher IG may not have a significantly lower MNLP. Instead, it is possible that it has a worse/higher MNLP. We demonstrate this with the BosH dataset by setting the prior mean of the BLR parameters to be the NN last layer’s weights (instead of the zero vector as in Appendix F.1) and prior variance to be a small value of 0.010.01 (instead of 11). The MNLP results are shown in Fig. 10. It can be observed that the largest maximal improvement in MNLP in Fig. 10 is around 0.20.2 which is significantly smaller than that (i.e., 1.01.0) in Figs. 8c-d. In Fig. 10a, there are more instances with negative improvement in MNLP compared to that in Fig. 8c.

The last poor use case is related to the previous one. Even if the model prior is insufficiently informative, the phenomenon can also happen if each party has ample data to independently train a model with a reasonable predictive accuracy. More training data lead to a higher IG and reduce the first term of (3) but magnify the squared error, thus potentially increasing MNLP.

Refer to caption Refer to caption
(a) MNLP, ρ=1\rho=1 (b) MNLP, ρ=0.5\rho=0.5
Figure 10: Scatter graph of the improvement in MNLP vs. maximal improvement in MNLP when ρ=1,0.5\rho=1,0.5 for multiple partitions of BosH dataset among n=3n=3 parties. The prior mean of the model parameters is close to the true mean and the prior variance is small.

Summary. Ideally, improvement in IG by training on the additional data decided by our reward scheme should translate to a higher predictive accuracy of the model reward. This should occur when a suitable model is selected and the model prior is not sufficiently informative for any party to achieve a high predictive accuracy by training a model on its own data. In practice, this is reasonable as collaboration precisely happens only when every individual party cannot train a high-quality model alone but can do so from working together.

F.3 Comparing IG-Based vs. MNLP-Based Data Valuation Methods

In Sec. 3, our data valuation method is based on IG as it can avoid the need and difficulty of selecting a common validation dataset, unlike other data valuation methods (e.g., using MNLP). We also claim that IG is a suitable surrogate measure of the predictive accuracy of a trained model. In this subsection, we will demonstrate this by comparing the values of data and Shapley values when the data valuation is performed using IG vs. MNLP.

For the MNLP-based data valuation method, let the value of data DCD_{C} for any CNC\subseteq N be vC=MNLPMNLPCv^{\prime}_{C}=\text{MNLP}_{\emptyset}-\text{MNLP}_{C} where MNLPC\text{MNLP}_{C} is defined in the same way as (3) except that DirD_{i}^{r} in (3) is replaced by DCD_{C}. For each party iNi\in N, we can compute its alternative Shapley value ϕi\phi_{i}^{\prime} using (2) by replacing vCv_{C} with vCv^{\prime}_{C} for all CNC\subseteq N.

Since we are more concerned about the relative values of data and Shapley values between parties, Figs. 11a-b show results of the IG-based vs. MNLP-based normalized values of data (respectively, vi/iNviv_{i}/\sum_{i\in N}{v_{i}} vs. vi/iNviv^{\prime}_{i}/\sum_{i\in N}{v^{\prime}_{i}}) of 55 different partitions, while Figs. 11c-d show results of the IG-based vs. MNLP-based normalized Shapley values (respectively, ϕi/vN\phi_{i}/v_{N} vs. ϕi/vN\phi^{\prime}_{i}/v^{\prime}_{N}) of the same 55 partitions. These partitions are generated using the partitioning algorithm in Sec. 5 on a training dataset of 800800 data points. For each partition, we consider 1010 different validation datasets of size 200200, which results in 1010 possible sets of {vC}C2N\{v^{\prime}_{C}\}_{C\in 2^{N}} and {ϕi}iN\{\phi^{\prime}_{i}\}_{i\in N}. The 1010 validation datasets are either randomly selected or generated using the partitioning algorithm on the larger withheld test dataset (using the same feature aa) so that it may overlap or exclude any party’s data range. We measure the similarity between the party’s data and the validation dataset using the Wasserstein distance W(,)W(\cdot,\cdot) between both their values of data or Shapley values for feature aa. The smaller the distance, the more similar the party’s data and the validation dataset.

Refer to caption Refer to caption
(a) GP Friedman (b) NN CaliH
Refer to caption Refer to caption
(c) GP Friedman (d) NN CaliH
Figure 11: Scatter graph of MNLP-based vs. IG-based normalized (a-b) values of data and (c-d) Shapley values for Bayesian regression models with various datasets. Each partition and choice of validation dataset generates multiple scatter points (one for each party).

In Fig. 11, as the validation dataset varies, the normalized values of data and Shapley values for the MNLP-based data valuation method differ significantly (i.e, large spread along the y-axis). This shows that the MNLP-based data valuation method is highly sensitive to the choice of the validation dataset. Parties are valued significantly lower if their data is very different from the test dataset. This is reflected by the position of the darker colored points below the lighter colored points.

The IG-based data valuation method can serve as a good surrogate for the MNLP-based data valuation method as it will not favor any specific party. In particular, for the Friedman dataset, the normalized IG-based value of data/Shapley value (i.e., diagonal line) (additionally) closely approximates the normalized MNLP-based values of data/Shapley values over different validation datasets (i.e., scatter points) on average.