This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

2D-Shapley: A Framework for Fragmented Data Valuation

Zhihong Liu    Hoang Anh Just    Xiangyu Chang    Xi Chen    Ruoxi Jia

2D-Shapley: A Framework for Fragmented Data Valuation
Supplementary Materials

Zhihong Liu    Hoang Anh Just    Xiangyu Chang    Xi Chen    Ruoxi Jia
Abstract

Data valuation—quantifying the contribution of individual data sources to certain predictive behaviors of a model—is of great importance to enhancing the transparency of machine learning and designing incentive systems for data sharing. Existing work has focused on evaluating data sources with the shared feature or sample space. How to valuate fragmented data sources of which each only contains partial features and samples remains an open question. We start by presenting a method to calculate the counterfactual of removing a fragment from the aggregated data matrix. Based on the counterfactual calculation, we further propose 2D-Shapley, a theoretical framework for fragmented data valuation that uniquely satisfies some appealing axioms in the fragmented data context. 2D-Shapley empowers a range of new use cases, such as selecting useful data fragments, providing interpretation for sample-wise data values, and fine-grained data issue diagnosis.

Machine Learning, ICML, XAI

1 Introduction

Data are essential ingredients for building machine learning (ML) applications. The ability to quantify and measure the value of data is crucial to the entire lifecycle of ML: from cleaning poor-quality samples and tracking important ones to be collected during data preparation to setting proper proprieties over samples during training to interpret why certain behaviors of a model emerge during deployment. Determining the value of data is also central to designing incentive systems for data sharing and implementing current policies about the monetarization of personal data.

Refer to caption
Figure 1: Illustration of different data valuation settings based on how training set is partitioned among different data contributors.

Current literature of data valuation (Jia et al., 2019b; Ghorbani & Zou, 2019) has exclusively focused on valuing horizontally partitioned data—in other words, each data source to be valued shares the same feature space. How to value vertically partitioned data, where each data source provides a different feature but shares the same sample space, has been studied in the context of ML interpretability (Covert et al., 2020). However, none of these abstractions could fully capture the complexity of real-world scenarios, where data sources can have non-overlapping features and samples (termed as fragmented data sources hereinafter).

Example 1.

Consider two banks, B1B_{1} and B2B_{2}, and two e-commerce companies, E1E_{1} and E2E_{2}, located in Region 11 and 22. These four institutions are interested in collaboratively building an ML model to predict users’ credit scores with their data. Due to the geographical difference, B1B_{1} and E1E_{1} have a different user group from B2B_{2} and E2E_{2}. Also, due to the difference in business, B1B_{1} and B2B_{2} provide different features than what E1E_{1} and E2E_{2} can offer. Overall, the four institutions partition the aggregated data horizontally and vertically, as illustrated by Figure 1(c). How to quantify each institution’s contribution to the joint model training?

Example 2.

Due to inevitable errors occurring during the data generation and collection processes, real-world data are seldom high quality. Suppose that a data analyst is interested in identifying some potentially erroneous entries in a dataset. Existing horizontal and vertical data valuation tools can help locate the rows or columns that could contain errors by returning the ones with the lowest values. Nevertheless, can we perform more fine-grained detection—e.g., how to pinpoint the coordinate of erroneous entries?

Example 3.

Horizontal data valuation is now widely used to explain the importance of each sample to a learning outcome (Tang et al., 2021; Karlaš et al., 2022). But how can a data analyst further explain these sample importance scores—why a sample receives a certain importance score? Is a sample “low-quality” because it contains several “moderate low quality” features or an “exceptionally low quality” feature?

Answering the above questions calls for a quantitative understanding of how each block in the data matrix (e.g. a sub-matrix as in Ex. 1 or a single entry as in Ex. 2 and 3) contributes to the outcome of learning.

Technical Challenges. The problem of block valuation requires rethinking about fundamental aspects of data valuation. Existing data valuation theory consists of two basic modules at a conceptual level: (1) Counterfactual Analysis, where one calculates how the utility of a subset of data sources would change after the source to be valued is removed; and (2) Fair Attribution, where a data source is valued based on a weighted average of its marginal utilities for different subsets and the weights are set for the value to satisfy certain fairness properties. The fairness notion considered by the past valuation schemes requires that permuting the order of different data sources does not change their value.

For horizontal and vertical valuation, the counterfactual can be simply calculated by taking the difference between the model performance trained on a subset of columns or rows and the performance with one column or row being removed. However, it is unclear how to calculate the counterfactual when a block is excluded because the remaining data matrix could be incomplete. Besides, the fairness notion of existing data value notions is no longer appropriate in the context of block valuation. As a concrete example to illustrate this point, consider Figure 1(c) and suppose the two blocks on the left provide temperature measurements as features and the ones on the right are humidity measurements. In this case, one should not expect the value to be unchanged when two blocks with different physical meanings (e.g., yellow and pink) are swapped.

Contributions. This paper presents the first focused study on data valuation without assuming shared feature space or sample space. Toward that end, we make the following contributions.

  • We present an approach that enables evaluation of the marginal contribution of a block within the data matrix to any other block with non-overlapping sample and feature spaces.

  • We abstract the block valuation problem into a two-dimensional (2D) cooperative game, where the utility function is invariant to column permutations and row permutations but not to any arbitrary entry permutations.

  • We propose axioms that a proper valuation scheme should satisfy in the 2D game and show that the axioms lead to a unique representation of the value assignment (referred to as 2D-Shapley). Particularly, this representation is a natural generalization of the Shapley value (Shapley, 1997)—a celebrated value attribution scheme widely used in data valuation among other applications.

  • We demonstrate that 2D-Shapley enables new applications, including selecting useful data fragments, providing interpretation for sample-wise data values, and fine-grained data issue diagnosis.

2 Background and Related Work

In a typical setting, a set of data sources are used to learn an ML model, which achieves a certain performance score. The goal of data valuation is to quantify the contribution of each data source toward achieving the performance score. The definition of a data source depends on the context in which the data valuation results are utilized. For instance, when using data valuation to interpret how the global behavior of the ML model depends on individual samples or individual features, a sample or a feature in the training data is regarded as a data source; when using data valuation to inform the reward design for data sharing, the collection of all samples or all features contributed by the same entities is regarded as a data source.

Formally, let N={1,,n}N=\{1,\ldots,n\} denotes the index set of nn training data sources. A data valuation scheme assigns a score to each training data source in a way that reflects their contribution. These scores are referred to as data values. To analyze a source’s “contribution”, we define a utility function U:2NU:2^{N}\rightarrow\mathbb{R}, which maps any subset of the data sources to a score indicating the usefulness of the subset. 2N2^{N} represents the power set of NN, i.e., the set of all subsets of NN, including the empty set and NN itself. For the classification task, a common choice for UU is the performance of a model trained on the input subset, i.e., U(S)=acc(𝒜(S))U(S)=\text{acc}(\mathcal{A}(S)), where 𝒜\mathcal{A} is a learning algorithm that takes a set SNS\subseteq N of sources as input and returns a model, and acc is a metric function to evaluate the performance of a given model, e.g., the accuracy of a model on a hold-out validation set.

Past research has proposed various ways to characterize data values given the utility function, among which the Shapley value is arguably the most widely used scheme for data valuation. The Shapley value is defined as

ψi1d(U):=1nk=1n(n1k1)1SNi|S|=k1[U(Si)U(S)].\psi_{i}^{1d}\left(U\right):=\frac{1}{n}\sum_{k=1}^{n}\binom{n-1}{k-1}^{-1}\sum_{\begin{subarray}{c}S\subseteq N\setminus i\\ |S|=k-1\end{subarray}}\left[U(S\cup i)-U(S)\right].

(1)

To differentiate from the proposed work, we will refer to the Shapley value defined in Eq. (1) as 1D-Shapley. 1D-Shapley is popular due to its unique satisfaction of the following four axioms (Shapley, 1953):

  • \bullet

    Dummy: if U(Si)=U(S)+cU\left(S\cup i\right)=U(S)+c for any SNiS\subseteq N\setminus i and some cc\in\mathbb{R}, then ψi1d(U)=c\psi_{i}^{1d}\left(U\right)=c.

  • \bullet

    Symmetry: let π:NN\pi:N\rightarrow N be any permutation of NN and πU(S):=U(π(S))\pi U(S):=U(\pi(S)), then ψπ(i)1d(πU)=ψi1d(U)\psi_{\pi(i)}^{1d}(\pi U)=\psi_{i}^{1d}(U).

  • \bullet

    Linearity: For utility functions U1,U2U_{1},U_{2} and any α1,α2\alpha_{1},\alpha_{2}\in\mathbb{R}, ψi1d(α1U1+α2U2)=α1ψi1d(U1)+\psi_{i}^{1d}\left(\alpha_{1}U_{1}+\alpha_{2}U_{2}\right)=\alpha_{1}\psi_{i}^{1d}\left(U_{1}\right)+ α2ψi1d(U2)\alpha_{2}\psi_{i}^{1d}\left(U_{2}\right).

  • \bullet

    Efficiency: for every U,iNψi1d(U)=U(N)U,\sum_{i\in N}\psi_{i}^{1d}(U)=U(N).

The symmetry axiom embodies fairness. In particular, πU\pi U arises upon the reindexing of data sources 1,,n1,\ldots,n with the indices π(1),,π(n)\pi(1),\ldots,\pi(n); the symmetry axiom states that the evaluation of a particular position should not depend on the indices of the data sources.

Although the Shapley value was justified through these axioms in prior literature, the necessity of each axiom depends on the actual use case of data valuation results. Recent literature has studied new data value notions obtained by relaxing some of the aforementioned axioms and enabled improvements in terms of accuracy of bad data identification (Kwon & Zou, 2022), robustness to learning stochasticity (Wang & Jia, 2023; Wu et al., 2022a), and computational efficiency (Yan & Procaccia, 2021). For instance, relaxing the efficiency axiom gives rise to semi-values (Kwon & Zou, 2022; Wang & Jia, 2023); relaxing the linearity axiom gives rise to least cores (Yan & Procaccia, 2021). This paper will focus on generalizing 1D-Shapley to block valuation. As we will expound on later, 1D-Shapley faces two limitations to serve a reasonable notion for block-wise values. Note that 1D-Shapley and the aforementioned relaxed notions share a similar structure: all of them are based on the marginal utility of a data source. Hence, our effort to generalize the 1D-Shapley to new settings can be adapted to other more relaxed notions.

Another line of related work focuses on developing efficient algorithms for data valuation via Monte Carlo methods (Jia et al., 2019b; Lin et al., 2022), via surrogate utility functions such as KK-nearest-neighbors (Jia et al., 2019a), neural tangent kernels (Wu et al., 2022b), and distributional distance measures (Just et al., 2023; Tay et al., 2022), and via reinforcement learning (Yoon et al., 2020). These ideas can also benefit the efficient computation of the proposed 2D-Shapley. As a concrete example, this paper builds upon Monte Carlo simulation and surrogate model approaches to improve the efficiency of 2D-Shapley.

Beyond data valuation, 1D-Shapley has been extensively used to gain feature-based interpretability for black-box models locally and globally. The local interpretability methods (Lundberg & Lee, 2017; Strumbelj & Kononenko, 2010) focus on analyzing the relative importance of features for each input separately; therefore, the importance scores of features across different samples are not comparable. By contrast, our work allows the comparison of feature importance across different samples. The global interpretability methods (Covert et al., 2020), on the other hand, explain the model’s behavior across the entire dataset. In the context of this paper, we consider them vertical data valuation. Compared to global interpretability methods, our work provides a more fine-grained valuation by associating each entry of the feature with an importance score. Our work improves the interpretability of the global feature importance score in the sense that it reveals the individual sample’s contribution to the importance of a feature.

3 How to Value a Block?

This section starts with formulating the block valuation problem. Then, we will discuss the challenges of using 1D-Shapley to tackle the block valuation problem in terms of both counterfactual analysis and fair attribution. At last, we will present our proposed framework for solving the block valuation problem.

3.1 Problem Formulation

Let N={1,2,,n}N=\{1,2,\cdots,n\} and M={1,2,,m}M=\{1,2,\dots,m\}, indexing nn disjoint collection of samples and mm disjoint collection of features contributed by nmnm sources (or blocks). Each data source can be labeled by (i,j)(i,j) for iNi\in N and jMj\in M, where we call ii the sample-wise index and jj the feature-wise index. To measure the contribution of a data source, we need to define a utility function, which measures the usefulness of a subset of data sources. The utility function h(S,F)h(S,F) takes in two separate sets SNS\subseteq N and FMF\subseteq M as the variables and returns a real-valued score indicating the utility of {(i,j)}iS,jF\{(i,j)\}_{i\in S,j\in F}. Note that this paper focuses on valuing the relative importance of feature blocks; that is, we assume that each data contributor provides a block of features and then the aggregation of features will be annotated by a separate entity (e.g., a data labeling company) that does not share the profit generated from joint training. More formally, we define the utility function as follows:

h(S,F):=Performance of the model trained on the\displaystyle h(S,F):=\text{Performance of the model trained on the }
feature blocks {(i,j)}iS,jF after annotation.\displaystyle\text{feature blocks }\{(i,j)\}_{i\in S,j\in F}\text{ after annotation.}

One can potentially generalize our framework to jointly value feature and label blocks by redefining the utility function to be non-zero only when feature and label are both included in the input block, like (Jia et al., 2019a; Yona et al., 2021), but an in-depth investigation is deferred to future work.

The benefit of this utility function definition is two-fold. First, its two-dimensional index always corresponds to a data fragment with the same feature space for all samples inside. As a result, one can calculate the utility in a straightforward manner by training on the matrix and evaluating the corresponding performance. This is an essential advantage over the one-dimensional index utilized by 1D-Shapley, as will be exemplified later. Second, created this way, the utility function is invariant to permutations of sample-wise indices in SS for any given FF and permutations of feature-wise indices in FF for any given SS, but not to permutations of the sample-wise and feature-wise indices combined. This is a desirable property as for many data types in ML, such as tabular data, one would expect that swapping samples or swapping features 111Swapping features in an image dataset may lead to the loss of certain local information. However, it is rare that different pixel positions of an image dataset are contributed by different entities. So we will not consider this case. does not change the model performance, yet swapping any two entries in the matrix may lead to arbitrary errors and thus alter the model performance significantly.

Our goal is to assign a score to each block in {(i,j)}iN,jM\{(i,j)\}_{i\in N,j\in M} that measures its contribution to the outcome of joint learning h(N,M)h(N,M).

3.2 A Naive Baseline: 1D-Shapley

One idea to tackle the block valuation problem is to flatten the indices of blocks into one dimension and leverage 1D-Shapley to value each block. Specifically, we can reindex {(i,j)}iN,jM\{(i,j)\}_{i\in N,j\in M} by T={1,,nm}T=\{1,\ldots,nm\}. Note that this step discards the structural information contained in the two-dimensional indices. Then, one can utilize Eq. (1) to value each iTi\in T.

The second step of applying Eq. (1) requires calculating U(Si)U(S)U(S\cup i)-U(S) for any STiS\subseteq T\setminus i. Both SS and SiS\cup i could correspond to a data fragment with samples differing in their feature space (see example in Figure 2); nevertheless, how to evaluate the utility of such a fragment is unclear. An ad hoc way of addressing this problem is to perform missing value imputation, e.g., filling out the missing values of a feature using the average of the feature values present.

In addition to the difficulty of evaluating the counterfactual, the symmetry axiom satisfied by 1D-Shapley no longer has the correct fairness interpretation when the input indices are flattened from 2D ones. In that case, 1,,nm1,\ldots,nm, carry specific meanings entailed by the original 2D structure; e.g., some indices might correspond to temperature features, and others might correspond to humidity. Hence, the symmetry axiom that requires unchanged data values after permuting the data sources’ indices is not sensible and necessary, as the permutation might map the content of a data source from one meaning to an entirely different one.

We will use 1D-Shapley with missing value imputation as a baseline for our proposed approach. This simple baseline is still a useful benchmark to assess the extra (non-trivial) gains in different application scenarios that our approach can attain.

Refer to caption
Figure 2: A visualization of 1D-Shapley marginal contribution applied to sample-feature valuation.

3.3 Our Approach: 2D-Shapley

Here, we will describe 2D-Shapley as a principled framework for valuing data blocks. We will emphasize how 2D-Shapley overcomes the challenges of the 1D-Shapley baseline in terms of (1) calculating the counterfactual, (2) framing the correct fairness principles, and then derive the representation of the data values based on the new counterfactual analysis and principle. At last, we will show efficient algorithms to compute 2D-Shapley.

3.3.1 Two-Dimensional Counterfactual Analysis

Given a two-dimensional utility function h(,)h(\cdot,\cdot), we will define the marginal contribution of a block (i,j)(i,j) to the collection of blocks {(i,j)}iS,jF\{(i,j)\}_{i\in S,j\in F} as

Mhi,j(S,F):=\displaystyle M_{h}^{i,j}(S,F):= h(Si,Fj)+h(S,F)\displaystyle h(S\cup i,F\cup j)+h(S,F)
\displaystyle- h(Si,F)h(S,Fj).\displaystyle h(S\cup i,F)-h(S,F\cup j). (2)

The rationality of the definition of Mhi,j(S,F)M_{h}^{i,j}(S,F) can be shown by Figure 3. The area corresponding to h(Si,Fj)h(S\cup i,F\cup j) can be viewed as the area (Si,Fj)(S\cup i,F\cup j), which subtracts these two areas of (Si,F)(S\cup i,F) and (S,Fj)(S,F\cup j), plus the (S,F)(S,F) area that is subtracted twice, the remaining area is shown in Figure 3 as “marginal”, which corresponds to the marginal influence of the block (i,j)(i,j).

Refer to caption
Figure 3: Removal process and marginal influence of (i,j)(i,j).

The unique advantage is that each individual utility is well-defined as it takes as input a collection of blocks within which the samples all share same feature space.

3.3.2 Axioms for Block Valuation

We start by redefining “dummy” for block valuation, where the underlying utility function is 2D.

Definition 3.1.

(2D-Dummy) We call a block (i,j)(i,j) a 2D-dummy under utility function hh if for all SN\iS\subseteq N\backslash i and FM\jF\subseteq M\backslash j,

Mhi,j(S,F)=c,c.M_{h}^{i,j}(S,F)=c,c\in\mathbb{R}. (3)

2D2D-dummy implies the canonical (one-dimensional) dummy mentioned in Section 2. Specifically, if sample ii is a sample dummy which satisfies h(Si,F)=h(S,F)+c1h(S\cup i,F)=h(S,F)+c_{1} and h(Si,Fj)=h(S,Fj)+c2h(S\cup i,F\cup j)=h(S,F\cup j)+c_{2} for SN\i,FM\jS\subseteq N\backslash i,F\subseteq M\backslash j like the dummy defined in 1D-Shapley, then Eq. (3) is satisfied with c:=c2c1c:=c_{2}-c_{1}, and similarly, if feature jj is a feature dummy which satisfies h(S,Fj)=h(S,F)+c1h(S,F\cup j)=h(S,F)+c_{1}^{\prime} and h(Si,Fj)=h(Si,F)+c2h(S\cup i,F\cup j)=h(S\cup i,F)+c_{2}^{\prime} for SN\i,FM\jS\subseteq N\backslash i,F\subseteq M\backslash j, then Eq. (3) is also satisfied with c:=c2c1c:=c_{2}^{\prime}-c_{1}^{\prime}. However, Eq. (3) can not imply sample ii is a sample dummy or feature jj is a feature dummy.

We first define the utility function set GG which contains all possible utility functions, and define a value function ψ:Gn×m\psi:G\rightarrow\mathbb{R}^{n\times m} and denote the value of block (i,j)(i,j) as ψij(h)\psi_{ij}(h) which is the ijijth element in matrix ψ(h)\psi(h). In order to build an equatable evaluation system, we provide the following axioms.

Axiom 1.

(2D-Linearity) For any two utility functions h1,h2Gh_{1},h_{2}\in G and any β1,β2\beta_{1},\beta_{2}\in\mathbb{R},

ψij(β1h1+β2h2)=β1ψij(h1)+β2ψij(h2).\psi_{ij}(\beta_{1}h_{1}+\beta_{2}h_{2})=\beta_{1}\psi_{ij}(h_{1})+\beta_{2}\psi_{ij}(h_{2}). (4)
Axiom 2.

(2D-Dummy) If the block (i,j)(i,j) is a dummy of hh which satisfies Eq. (3), then ψij(h)=c\psi_{ij}(h)=c.

Axiom 3.

(2D-Symmetry) Let π1:NN\pi_{1}:N\rightarrow N and π2:MM\pi_{2}:M\rightarrow M be two permutations, then:

ψπ1(i)π2(j)[(π1π2)h]=ψij(h),\psi_{\pi_{1}(i)\pi_{2}(j)}[(\pi_{1}\pi_{2})h]=\psi_{ij}(h), (5)

where for all SN,FMS\subseteq N,F\subseteq M,

[(π1π2)h](S,F):=[(π2π1)h](S,F):=h(π1(S),π2(F)).[(\pi_{1}\pi_{2})h](S,F):=[(\pi_{2}\pi_{1})h](S,F):=h(\pi_{1}(S),\pi_{2}(F)). (6)
Axiom 4.

(2D-Efficiency) For every utility function hGh\in G,

iNjMψij(h)=h(N,M).\sum_{\begin{subarray}{c}i\in N\\ j\in M\end{subarray}}\psi_{ij}(h)=h(N,M). (7)

Let us discuss the rationality of the four axioms.

The 2D-linearity axiom is inherited from 1D-Shapley, which implies that the value of the (i,j)(i,j)-th block under the sum of two ML performance measures is the sum of the value under each performance measure.

The 2D-dummy axiom can be interpreted by taking c=0c=0. If a block has no contribution to the ML task, no matter what the situation (i.e., for any SN\iS\subseteq N\backslash i and FM\jF\subseteq M\backslash j), then its value is zero.

In the 2D-symmetry axiom, the rows and columns are permuted independently. As a result, the entries from the same feature will always remain in the same column. The axiom state that such permutations would not change the value for individual data blocks, which is what we would expect in many ML applications. In Appendix A, we proved that Axiom 3 is implied by explanation here.

The 2D-efficiency axiom is inherited from 1D-Shapley, requiring that the sum of the values of all the data blocks equals the performance of the whole data set.

Based on the axioms, we provide a definition:

Definition 3.2.

The value ψij(h)\psi_{ij}(h) with respect to the utility function hh is a two-dimensional Shapley value (2D-Shapley for short) if ψij\psi_{ij} satisfies the 2d-linearity, 2d-dummy, 2d-symmetry and 2d-efficiency axioms, denoting as ψij2d\psi_{ij}^{2d}.

2D-Shapley can be seen as the two-dimensional extension of Shapley values, which inherits its advantage with a natural adaptation of the dummy and symmetry axiom to the two-dimensional utility function scenario.

3.3.3 Representation Theory

We will show that there exists an analytic and unique solution for 2D-Shapley.

Theorem 3.3.

(Representation Theory of 2D-Shapley) The ψij2d\psi^{2d}_{ij} has a unique solution:

ψij2d=1nms=1nf=1mΔsf,\psi_{ij}^{2d}=\frac{1}{nm}\sum_{s=1}^{n}\sum_{f=1}^{m}\Delta_{sf}, (8)

where iNi\in N, jMj\in M,

Δsf=1(n1s1)(m1f1)(S,F)Dsfij\displaystyle\Delta_{sf}=\frac{1}{\tbinom{n-1}{s-1}\tbinom{m-1}{f-1}}\sum_{(S,F)\in D_{sf}^{ij}} Mhi,j(S,F),\displaystyle M_{h}^{i,j}(S,F), (9)

Dsfij={(S,F):SN\i,FM\j,|S|=s1,|F|=f1},D_{sf}^{ij}=\{(S,F):S\subseteq N\backslash i,F\subseteq M\backslash j,|S|=s-1,|F|=f-1\}, and Mhi,j(S,F)M_{h}^{i,j}(S,F) defined in Eq. (2).

Theorem 3.3 indicates that ψij2d\psi_{ij}^{2d} is a weighted average of the two-dimensional counterfactual in Eq. (2). Theorem 3.3 is referred to as the representation theory of 2D-Shapley, because the proof procedure shows that ψij2d\psi_{ij}^{2d} has a basis expansion formulation (see Eq. (14) in Appendix B). To show the basis expansion, a series of basic utility functions in GG needs to be defined (e.g., Eq. (13)). Compared with the representation theory of 1D-Shapley by Roth (1988), one technical challenge is to define the basis and basic utility functions for the 2D case to handle the 2D counterfactual. Furthermore, the proof of the uniqueness of 2D-Shapley has to solve a complex high-dimensional linear system (see Eq. (18) in Appendix B). Our proof incorporates new techniques, unseen in the classic proof of 1D-Shapley, to deal with these unique technical challenges arising in the 2D context.

Moreover, the representation theory also implies that 2D-Shapley can be reduced to 1D-Shapley. The following corollary shows that summing up the block values over all rows gives 1D-Shapley of features, and summing up the block values over all columns gives 1D-Shapley of samples. Corollary 3.4 does not only indicate that the 2D-Shapley is a natural generalization of 1D-Shapley, but also is useful for discussing the experimental results of how 2D values can explain 1D values (see Subsection 4.1).

Corollary 3.4.

For any hGh\in G, let ψi1d(h):=jMψij2d(h)\psi^{1d}_{i\cdot}(h):=\sum_{j\in M}\psi^{2d}_{ij}(h) and ψj1d(h):=iNψij2d(h)\psi^{1d}_{\cdot j}(h):=\sum_{i\in N}\psi^{2d}_{ij}(h), then

ψi1d(h)=1nSN\i|S|=s1(n1s)[h(Si,M)h(S,M)],\psi^{1d}_{i\cdot}(h)=\frac{1}{n}\sum_{\begin{subarray}{c}S\subseteq N\backslash i\\ |S|=s\end{subarray}}\frac{1}{\tbinom{n-1}{s}}[h(S\cup i,M)-h(S,M)], (10)

and

ψj1d(h)=1mFM\j|F|=f1(m1f)[h(N,Fj)h(N,F)],\psi^{1d}_{\cdot j}(h)=\frac{1}{m}\sum_{\begin{subarray}{c}F\subseteq M\backslash j\\ |F|=f\end{subarray}}\frac{1}{\tbinom{m-1}{f}}[h(N,F\cup j)-h(N,F)], (11)

which are in the form of 1D-Shapley.

Finally, having the analytical expression Eq. (8) of 2D-Shapley at hand will provide us with great convenience in designing efficient algorithms.

3.3.4 Efficient Algorithm

The computational complexity of exactly calculating 2D-Shapley is exponential in mnmn due to the summation over all possible rows and columns. To overcome this challenge, we develop a Monte Carlo approach to approximating 2D-Shapley. The key idea is that 2D-Shapley can be rewritten as an expectation of the marginal contribution of (i,j)(i,j) to the blocks indexed by row indices before ii and column indices before jj over random permutations of rows and columns. As a result, we can approximate 2D-Shapley by taking an average over randomly sampled rows and columns. We also design the algorithm in ways that can reuse utility function evaluations across different permutations, which gives rise to significant efficiency gains. The full details of the algorithm design are provided in Appendix E, and the pseudo-code is shown in Algorithm 1.

Evaluating the utility function requires retraining a model. For small-scale datasets, it might be possible to evaluate the utility function within a reasonable time multiple times, but for large-scale datasets, even evaluating it once might require days to finish. This would deem our method impractical for any applications. Nonetheless, we can even obviate all model training to compute our values when using KK-nearest-neighbor (KNN) as a surrogate model. KNN-surrogate-based data valuation has shown great computational advantage while providing effective data quality identification (Jia et al., 2019a). In this work, we leverage a similar idea to reduce the computational complexity of 2D-Shapley for large models. First, let us observe from Eq. (8) and Corollary. D.1 that after rearranging inner terms, we have:

ψij2d=1n!m!π1Π(N)π2Π(M)[h(Piπ1i,Pjπ2j)\displaystyle\psi_{ij}^{2d}=\frac{1}{n!m!}\sum_{\begin{subarray}{c}\pi_{1}\in\Pi(N)\\ \pi_{2}\in\Pi(M)\end{subarray}}\left[h(P_{i}^{\pi_{1}}\cup i,P_{j}^{\pi_{2}}\cup j)-\right. (12)
h(Piπ1,Pjπ2j)][h(Piπ1i,Pjπ2)h(Piπ1,Pjπ2)],\displaystyle\left.h(P_{i}^{\pi_{1}},P_{j}^{\pi_{2}}\cup j)\right]-\left[h(P_{i}^{\pi_{1}}\cup i,P_{j}^{\pi_{2}})-h(P_{i}^{\pi_{1}},P_{j}^{\pi_{2}})\right],

where Π(X)\Pi(X) is a set of all permutations of XX, πΠ(X)\pi\in\Pi(X) is a permutation of XX, and PiπP^{\pi}_{i} is a set of elements preceding ii in π\pi. The expression in the first bracket is the 1D marginal contribution of sample ii and is valid since both utilities are trained on same features, Pjπ2jP_{j}^{\pi_{2}}\cup j. Similarly, the second bracket also represents a valid 1D marginal contribution of the sample ii but with features Pjπ2P_{j}^{\pi_{2}}. From this observation, we can apply the results of 1D-Shapley value approximated with nearest neighbors, ϕKNN\phi^{\text{KNN}}, defined recursively in Theorem 1 (Jia et al., 2019a), and the 2D-Shapley under KNN surrogates can be therefore expressed as

ψij2d-KNN=1m!π2Π(M)[ϕKNN(i,Pjπ2j)ϕKNN(i,Pjπ2)].\psi_{ij}^{\text{2d-KNN}}=\frac{1}{m!}\sum_{\begin{subarray}{c}\pi_{2}\in\Pi(M)\end{subarray}}[\phi^{\text{KNN}}(i,P_{j}^{\pi_{2}}\cup j)-\phi^{\text{KNN}}(i,P_{j}^{\pi_{2}})].

This new formulation is efficient as it requires no more model training and removes the summing over all possible permutations of samples. We can further approximate the sum over all possible permutations over features with the average over sampled permutations. Our final complexity becomes 𝒪(PT|M||N|2log|N|)\mathcal{O}(PT|M||N|^{2}log|N|), where PP is the number of sampled feature permutations, TT is the number of test points used for evaluating model performance, and |N|,|M||N|,|M| are the cardinality of NN and MM respectively, and the pseudo-code for the overall KNN-based approximation is provided in Algorithm 2.

4 Experiments

Refer to caption
Figure 4: Performance comparison between 2D-Shapley and baselines on various use cases.

This section covers the two general application scenarios of 2D-Shapley. (1) Cell valuation, where each cell in the training data matrix is considered a data source and receives a score indicating its contribution to a learning task performed on the matrix. We mainly demonstrate this application scenario’s benefits in fine-grained data debugging and interpreting canonical sample-wise or feature-wise data values. (2) Sub-matrix valuation, where a sub-matrix containing multiple cells is considered a data source and receives a joint score. This scenario is closely related to data marketplaces, where each entity provides a dataset that appears as a submatrix in the aggregated data. Details about datasets, models, implementations, and ablation studies on a budget of inserted outliers are provided in Appendix F.

4.1 Cell Valuation

Sanity check of cell-wise values.

We first check whether the cell-wise values produced by our method make sense via the data removal experiments commonly used in the data valuation literature. Specifically, we would expect that removing the cells with the highest values from the training set leads to the most significant performance degradation; conversely, removing the cells with the lowest values should barely affect the model performance. To evaluate the model performance after removal, we “remove” a cell by refilling its content with the average of all other cells on the same feature column. In the previous section, we present two algorithms to calculate 2D-Shapley. We will label the values obtained from the Monte Carlo-based method as 2D-Shapley-MC and the ones from the KNN-surrogate-based method as 2D-Shapley-KNN. 1D-Shapley and random removal are used as our baselines. In particular, 1D-Shapley is estimated by the permutation sampling described in (Jia et al., 2019b). For each baseline, we remove a number of cells at a time based on their sample-feature value ranking in either descending or ascending order; then, we train a model on the reduced dataset and evaluate the model performance.

As shown in Figure 4, when removing cells in ascending value order, 2D-Shapley can not only maintain the model performance but also improve it by at least 2%2\% for Census, Credit, and Breast Cancer datasets, whereas 1D-Shapley dips the model performance earlier than 2D-Shapley in all three datasets. Considering removal from the highest valued cells, we observe that 2D-Shapley can effectively detect contributing cells, and removing these cells causes the model performance to drop quickly. By contrast, removing cells according to 1D-Shapley is close to random removal. These results indicate that 2D-Shapley is more effective than 1D-Shapley at recognizing the contribution of cells and can better inform strategic data harnessing in ML.

Fine-Grained Outlier Localization.

Existing horizontal data valuation methods have demonstrated promising results in detecting abnormal samples (Ghorbani & Zou, 2019; Kwon & Zou, 2022; Wang & Jia, 2023) by finding lowest-valued samples. However, it is rarely the case that every cell in the sample is abnormal. For instance, a type of error in the Census data is “198x\rightarrow189x”, where the years of birth are wrongly specified; this error could appear on a single feature column and, at the same, only affects partial samples (or users) born in 198x. Existing horizontal valuation remains limited in localizing these erroneous entries.

To demonstrate the potential of 2D-Shapley in fine-grained entry-wise outlier detection, we first inject outlier cells into the clean dataset, Breast Cancer Dataset. Following a recent outlier generation technique in (Du et al., 2022), we inject low-probability-density values into the dataset as outlier cells. We explain the outlier injection method in detail in Appendix F.3. We randomly place outlier cells in 2%~{}2\% (50 of total cells). Afterward, we compute 2D-Shapley-KNN for each cell in the dataset with inserted outliers, which are shown in Figure 10. Since we expect outliers not to be helpful for the model performance, the values for outlier cells should be low. Therefore, we sort the 2D-Shapley cell values in ascending order and prioritize human inspection towards the ones with the lowest values. We show the detection rate of the inserted outliers in Figure 5A). As we can see, with 2D-Shapley values, we can detect 90%90\% of inserted outliers within the first 5%5\% of all cells. By contrast, based on the values produced by 1D-Shapley, one would need over 90%90\% of cell inspection to screen out all the outlier cells.

Refer to caption
Figure 5: A) Detection of the inserted outliers in the Breast Cancer dataset. B) Detection of the inserted outliers in the Age category of the Census dataset.
Refer to caption
Figure 6: 2D Shapley vs Model Performance on various dataset splits.
Refer to caption
Figure 7: Cell values of samples with similar 1D values in Breast Cancer dataset.

We further examine a practical case of outliers caused by human errors, where the cells have been incorrectly typed, e.g., “18” became “81”. In the Census dataset, for the feature “Age”, we randomly swap 15 cells between “17” and “71”, “18” and “81”, “19” and “91”. Similarly, we sort the values of all cells in the dataset in ascending order. As we observe in Figure 5B), detection with 2D-Shapley outperforms 1D-Shapley. Particularly, with 2D-Shapley we can detect 80%80\% of added outliers with less than 18001800 inspected cells while 1D-Shapley requires 44 times as many cells to achieve a comparable rate. The 1D-Shapley and 2D-Shapley heatmaps are provided in Appendix. The results above demonstrate the effectiveness of 2D-Shapley in locating outlier cells in a dataset.

Enabling Interpretation of 1D Valuation Results.

Apart from outlier detection, 2D-Shapley also brings new insights into horizontal sample valuation or vertical feature valuation, which is referred to as 1D valuation. For instance, 1D sample valuation produces an importance score for each sample, but we lack a deeper understanding of why a sample receives a certain value. Recall Corollary 3.4 that the sum of 2D-Shapley over rows or columns gives 1D feature values and 1D sample values, respectively. Hence, 2D-Shapley allows one to interpret the 1D value of a sample by further breaking it down to contributions of different features in that sample. That is, 2D-Shapley gives insights into the relative importance of different features of a sample to the valuation result received by the sample. For example, in Figure 7A), we observe that two samples have similar 1D values and their cell values are also close. However, in Figure 7B), we observe a contrasting case, where although both samples have a close 1D value, their cell values are completely unrelated. More detailed results can be found in Appendix F.3.

4.2 Sub-matrix Valuation

We turn to the application of 2D-Shapley to inform dataset pricing in the data marketplace. 2D-Shapley enables a principled method to value fragmented data sources as illustrated in Figure 1(c), where each source is a sub-matrix in the aggregated training data matrix. A reasonable measure of a source’s value should reflect its usefulness for ML. Hence, to verify the significance of the resulting values for sub-matrix valuation, we measure the model performance trained on a source and examine the correlation between its value and the performance. For this experiment, we use the Credit Dataset with sources contributing fragmented data and consider multiple random splits of the dataset. The results are provided in Figure 6, where each line corresponds to a different split of the aggregate data into individual sources. Figure 6 shows that with the increasing model performance trained on the block, its corresponding 2D-Shapley block value also increases.

5 Conclusion

This work aims to set the theoretical foundation for more realistic data valuation application scenarios. In particular, we investigate the block valuation problem and present 2D-Shapley, a new data value notion that is suitable to solve this problem. 2D-Shapley empowers a range of new use cases, such as informing the pricing of fragmented data, strategic data selection on a fine-grained scale, and interpreting 1D valuation results. Our work opens up many new venues for future investigation. First, we can immediately adapt our proof technique to prove a two-dimensional generalization of other typical data value notions (Kwon & Zou, 2022; Wang & Jia, 2023). Second, it is interesting to build upon our framework to evaluate irregular-shaped data sources (Fang et al., 2019) and incorporate label information for joint valuation in a principled way.

Acknowledgements

Xiangyu Chang’s work was partly supported by the National Natural Science Foundation for Outstanding Young Scholars of China under Grant 72122018 and partly by the Natural Science Foundation of Shaanxi Province under Grant 2021JC-01. Xi Chen would like to thank the support from NSF via the Grant IIS-1845444. RJ and the ReDS lab would like to thank the support from NSF via the Grant OAC-2239622.

References

  • Covert et al. (2020) Covert, I., Lundberg, S. M., and Lee, S.-I. Understanding global feature contributions with additive importance measures. Advances in Neural Information Processing Systems, 33:17212–17223, 2020.
  • Du et al. (2022) Du, X., Wang, Z., Cai, M., and Li, Y. Vos: Learning what you don’t know by virtual outlier synthesis. arXiv preprint arXiv:2202.01197, 2022.
  • Dua & Graff (2017) Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
  • Fang et al. (2019) Fang, F., Lan, W., Tong, J., and Shao, J. Model averaging for prediction with fragmentary data. Journal of Business & Economic Statistics, 37(3):517–527, 2019.
  • Ghorbani & Zou (2019) Ghorbani, A. and Zou, J. Data shapley: Equitable valuation of data for machine learning. In International Conference on Machine Learning, pp. 2242–2251. PMLR, 2019.
  • Jia et al. (2019a) Jia, R., Dao, D., Wang, B., Hubis, F. A., Gürel, N. M., Li, B., Zhang, C., Spanos, C. J., and Song, D. Efficient task-specific data valuation for nearest neighbor algorithms. Proceedings of the VLDB Endowment, 12(11):1610–1623, 2019a.
  • Jia et al. (2019b) Jia, R., Dao, D., Wang, B., Hubis, F. A., Hynes, N., Gürel, N. M., Li, B., Zhang, C., Song, D., and Spanos, C. J. Towards efficient data valuation based on the Shapley value. In The 22nd International Conference on Artificial Intelligence and Statistics, pp.  1167–1176. PMLR, 2019b.
  • Just et al. (2023) Just, H. A., Kang, F., Wang, J. T., Zeng, Y., Ko, M., Jin, M., and Jia, R. Lava: Data valuation without pre-specified learning algorithms. In International Conference on Learning Representations, 2023.
  • Karlaš et al. (2022) Karlaš, B., Dao, D., Interlandi, M., Li, B., Schelter, S., Wu, W., and Zhang, C. Data debugging with shapley importance over end-to-end machine learning pipelines. arXiv preprint arXiv:2204.11131, 2022.
  • Kwon & Zou (2022) Kwon, Y. and Zou, J. Beta Shapley: a unified and noise-reduced data valuation framework for machine learning. In International Conference on Artificial Intelligence and Statistics, pp.  8780–8802. PMLR, 2022.
  • Lin et al. (2022) Lin, J., Zhang, A., Lécuyer, M., Li, J., Panda, A., and Sen, S. Measuring the effect of training data on deep learning predictions via randomized experiments. In International Conference on Machine Learning, pp. 13468–13504. PMLR, 2022.
  • Lundberg & Lee (2017) Lundberg, S. M. and Lee, S.-I. A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems, 30:4768–4777, 2017.
  • Roth (1988) Roth, A. E. The Shapley value: essays in honor of Lloyd S. Shapley. Cambridge University Press, 1988.
  • Shapley (1953) Shapley, L. S. A value for n-person games. Contributions to the Theory of Games, 2(28):307–317, 1953.
  • Shapley (1997) Shapley, L. S. A value for n-person games. Classics in game theory, 69, 1997.
  • Strumbelj & Kononenko (2010) Strumbelj, E. and Kononenko, I. An efficient explanation of individual classifications using game theory. Journal of Machine Learning Research, 11:1–18, 2010.
  • Tang et al. (2021) Tang, S., Ghorbani, A., Yamashita, R., Rehman, S., Dunnmon, J. A., Zou, J., and Rubin, D. L. Data valuation for medical imaging using Shapley value and application to a large-scale chest X-ray dataset. Scientific Reports, 11(1):1–9, 2021.
  • Tay et al. (2022) Tay, S. S., Xu, X., Foo, C. S., and Low, B. K. H. Incentivizing collaboration in machine learning via synthetic data rewards. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp.  9448–9456, 2022.
  • Wang & Jia (2023) Wang, J. T. and Jia, R. A robust data valuation framework for machine learning. In International Conference on Artificial Intelligence and Statistics. PMLR, 2023.
  • Wang & Jia (2022) Wang, T. and Jia, R. Data banzhaf: A data valuation framework with maximal robustness to learning stochasticity. arXiv preprint arXiv:2205.15466, 2022.
  • Wu et al. (2022a) Wu, M., Jia, R., Huang, W., Chang, X., et al. Robust data valuation via variance reduced data shapley. arXiv preprint arXiv:2210.16835, 2022a.
  • Wu et al. (2022b) Wu, Z., Shu, Y., and Low, B. K. H. Davinz: Data valuation using deep neural networks at initialization. In International Conference on Machine Learning, pp. 24150–24176. PMLR, 2022b.
  • Yan & Procaccia (2021) Yan, T. and Procaccia, A. D. If you like shapley then you’ll love the core. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  5751–5759, 2021.
  • Yona et al. (2021) Yona, G., Ghorbani, A., and Zou, J. Who’s responsible? jointly quantifying the contribution of the learning algorithm and data. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pp.  1034–1041, 2021.
  • Yoon et al. (2020) Yoon, J., Arik, S., and Pfister, T. Data valuation using reinforcement learning. In International Conference on Machine Learning, pp. 10842–10851. PMLR, 2020.

Appendix A Proof of the fact that Axiom 3 is implied by its explanation

The explanation above is: for i1,i2Ni_{1},i_{2}\in N, j1,j2Mj_{1},j_{2}\in M, if for any SN\{i1,i2}S\subseteq N\backslash\{i_{1},i_{2}\} and FMF\subseteq M, h(Si1,F)=h(Si2,F)h(S\cup i_{1},F)=h(S\cup i_{2},F), and for any SNS\subseteq N and FM\{j1,j2}F\subseteq M\backslash\{j_{1},j_{2}\}, h(S,Fj1)=h(S,Fj2)h(S,F\cup j_{1})=h(S,F\cup j_{2}), then ψi1j1(h)=ψi2j2(h)\psi_{i_{1}j_{1}}(h)=\psi_{i_{2}j_{2}}(h).

For the proof, we prove in three steps that the explanation is equivalent to Axiom 3. Note that we should assume Axiom 1, 2 and 4 already exist. For simplicity, we use the lowercase letter to denote the cardinality of a set, for example, |S|=s|S|=s.

We want to prove the following proposition.

Proposition A.1.

If Axiom 1, 2 and 4 exist, then Axiom 3 is equivalent to its explanation.

Proof.

For the direction that Axiom 3 is implied by its explanation, we prove in three steps.

  • Step 1: Define a utility function hS,Fh_{S,F}:

    hS,F(W1,W2)={1,ifSW1,FW2.0,otherwise.h_{S,F}(W_{1},W_{2})=\left\{\begin{array}[]{lr}1,if\ S\subseteq W_{1},F\subseteq W_{2}.\\ 0,otherwise.\end{array}\right. (13)

    For fixed SNS\subseteq N, FMF\subseteq M and i1,i2Si_{1},i_{2}\in S, j1,j2Fj_{1},j_{2}\in F and for all W1N\{i1,i2},W2M\{j1,j2}W_{1}\subseteq N\backslash\{i_{1},i_{2}\},W_{2}\subseteq M\backslash\{j_{1},j_{2}\}, MhS,Fi1,j1(W1,W2)=MhS,Fi2,j2(W1,W2)M_{h_{S,F}}^{i_{1},j_{1}}(W_{1},W_{2})=M_{h_{S,F}}^{i_{2},j_{2}}(W_{1},W_{2}). It leads to the conclusion that ψi1j1(hS,F)=ψi2j2(hS,F)\psi_{i_{1}j_{1}}(h_{S,F})=\psi_{i_{2}j_{2}}(h_{S,F}) according to the explanation.

    For iSi^{*}\notin S, jMj\in M (or jFj^{*}\notin F, iNi\in N) and W1N\iW_{1}\subseteq N\backslash i^{*}, W2M\jW_{2}\subseteq M\backslash j, (W1N\iW_{1}\subseteq N\backslash i, W2M\jW_{2}\subseteq M\backslash j^{*},) MhS,Fi,j(W1,W2)=0M_{h_{S,F}}^{i^{*},j}(W_{1},W_{2})=0. (MhS,Fi,j(W1,W2)=0M_{h_{S,F}}^{i,j^{*}}(W_{1},W_{2})=0.) It leads to the conclusion that ψij(hS,F)=0\psi_{i^{*}j}(h_{S,F})=0, jM\forall j\in M (ψij(hS,F)=0\psi_{ij^{*}}(h_{S,F})=0, iN\forall i\in N) according to Axiom 2.

    In summary, we have conclusion that the values ψij\psi_{ij}s are the same when iS,jFi\in S,j\in F, and otherwise zero. According to Axiom 4,

    1=hS,F(N,M)=iNjMψij(hS,F)=iSjFψij(hS,F).1=h_{S,F}(N,M)=\sum_{\begin{subarray}{c}i\in N\\ j\in M\end{subarray}}\psi_{ij}(h_{S,F})=\sum_{\begin{subarray}{c}i\in S\\ j\in F\end{subarray}}\psi_{ij}(h_{S,F}).

    then ψij(hS,F)=1/sf\psi_{ij}(h_{S,F})=1/sf, where iS,jFi\in S,j\in F.

  • Step 2: We have to prove a lemma which shows another formation of a utility hh by using hS,Fh_{S,F} defined above.

    Lemma A.2.
    h=SNFMCS,F(h)hS,F,h=\sum_{\begin{subarray}{c}S\subseteq N\\ F\subseteq M\end{subarray}}C_{S,F}(h)h_{S,F},

    where CS,F(h)=SSFF(1)s+fsfh(S,F)C_{S,F}(h)=\sum_{\begin{subarray}{c}S^{\prime}\subseteq S\\ F^{\prime}\subseteq F\end{subarray}}(-1)^{s+f-s^{\prime}-f^{\prime}}h(S^{\prime},F^{\prime}).

    Proof.

    We can directly verify the lemma.

    h(W1,W2)\displaystyle h(W_{1},W_{2}) =SNFMCS,F(h)hS,F(W1,W2)\displaystyle=\sum_{\begin{subarray}{c}S\subseteq N\\ F\subseteq M\end{subarray}}C_{S,F}(h)h_{S,F}(W_{1},W_{2})
    =SW1FW2SSFF(1)s+fsfh(S,F)\displaystyle=\sum_{\begin{subarray}{c}S\subseteq W_{1}\\ F\subseteq W_{2}\end{subarray}}\sum_{\begin{subarray}{c}S^{\prime}\subseteq S\\ F^{\prime}\subseteq F\end{subarray}}(-1)^{s+f-s^{\prime}-f^{\prime}}h(S^{\prime},F^{\prime})
    =SW1FW2[s=sw1(1)ss(w1sss)f=fw2(1)ff(w2fff)]h(S,F)\displaystyle=\sum_{\begin{subarray}{c}S^{\prime}\subseteq W_{1}\\ F^{\prime}\subseteq W_{2}\end{subarray}}\big{[}\sum_{s=s^{\prime}}^{w_{1}}(-1)^{s-s^{\prime}}\tbinom{w_{1}-s^{\prime}}{s-s^{\prime}}\sum_{f=f^{\prime}}^{w_{2}}(-1)^{f-f^{\prime}}\tbinom{w_{2}-f^{\prime}}{f-f^{\prime}}\big{]}h(S^{\prime},F^{\prime})
    =h(W1,W2).\displaystyle=h(W_{1},W_{2}).

  • Step 3: Combine the first two steps, and by Axiom 1,

    ψij(h)\displaystyle\psi_{ij}(h) =SNFMCS,F(h)ψij(hS,F)\displaystyle=\sum_{\begin{subarray}{c}S\subseteq N\\ F\subseteq M\end{subarray}}C_{S,F}(h)\psi_{ij}(h_{S,F})
    =iSNjFMCS,F(h)/sf.\displaystyle=\sum_{\begin{subarray}{c}i\in S\subseteq N\\ j\in F\subseteq M\end{subarray}}C_{S,F}(h)/sf.

    Let π1,π2\pi_{1},\pi_{2} be two permutations on NN and MM respectively, then

    ψπ1(i)π2(j)(π1π2h)\displaystyle\psi_{\pi_{1}(i)\pi_{2}(j)}(\pi_{1}\pi_{2}h) =π1(i)SNπ2(j)FMCS,F(π1π2h)/sf\displaystyle=\sum_{\begin{subarray}{c}\pi_{1}(i)\in S\subseteq N\\ \pi_{2}(j)\in F\subseteq M\end{subarray}}C_{S,F}(\pi_{1}\pi_{2}h)/sf
    =iπ1(S)Njπ2(F)MCπ1(S),π2(F)(h)/sf\displaystyle=\sum_{\begin{subarray}{c}i\in\pi_{1}(S)\subseteq N\\ j\in\pi_{2}(F)\subseteq M\end{subarray}}C_{\pi_{1}(S),\pi_{2}(F)}(h)/sf
    =ψij(h).\displaystyle=\psi_{ij}(h).

For another direction that Axiom 3 implies its explanation, since we already assume Axiom 1, 2, 3 and 4 hold, then we have the formula of 2D-Shapley, that is, Eq. (21). Clearly, we can see the numerator is always the same for both i1j1i_{1}j_{1} and i2j2i_{2}j_{2} under the same SS and FF, hence ψi1j1(h)=ψi2j2(h)\psi_{i_{1}j_{1}}(h)=\psi_{i_{2}j_{2}}(h).

Appendix B Proof of the representation theory of 2D-Shapley

In this section, we will justify the representation theory by a number of proposed lemmas. The proof process is to add the axioms one by one and try to show what each axiom does for 2D-Shapley. We add linearity and dummy axioms first to get a sum of weighted marginals.

Lemma B.1.

For any value ψij\psi_{ij} satisfying the 2d-linearity and 2d-dummy axioms (Axiom 1 and 2), we have that

ψij(h)=SN\iFM\jpS,Fij[h(Si,Fj)+h(S,F)h(Si,F)h(S,Fj)],whereSN\iFM\jpS,Fij=1.\displaystyle\psi_{ij}(h)=\sum_{S\subseteq N\backslash i}\sum_{F\subseteq M\backslash j}p^{ij}_{S,F}[h(S\cup i,F\cup j)+h(S,F)-h(S\cup i,F)-h(S,F\cup j)],where\ \sum_{S\subseteq N\backslash i}\sum_{F\subseteq M\backslash j}p^{ij}_{S,F}=1.

where SN\iFM\jpS,Fij=1\sum_{S\subseteq N\backslash i}\sum_{F\subseteq M\backslash j}p^{ij}_{S,F}=1.

Proof.

For any hGh\in G,

h\displaystyle h =SNFMh(S,F)WS,F,\displaystyle=\sum_{\begin{subarray}{c}S\subseteq N\\ F\subseteq M\end{subarray}}h(S,F)W_{S,F}, (14)

where

WS,F(W1,W2)={1,ifW1=S,W2=F.0,otherwise.W_{S,F}(W_{1},W_{2})=\left\{\begin{array}[]{lr}1,if\ W_{1}=S,W_{2}=F.\\ 0,otherwise.\end{array}\right.

By the 2d-linearity axiom,

ψij(h)\displaystyle\psi_{ij}(h) =SNFMh(S,F)ψij(WS,F).\displaystyle=\sum_{\begin{subarray}{c}S\subseteq N\\ F\subseteq M\end{subarray}}h(S,F)\psi_{ij}(W_{S,F}).

Now define another utility function WS,FW^{\prime}_{S,F}:

WS,F(W1,W2)={1,ifSW1,F=W2.0,otherwise.W^{\prime}_{S,F}(W_{1},W_{2})=\left\{\begin{array}[]{lr}1,if\ S\subseteq W_{1},F=W_{2}.\\ 0,otherwise.\end{array}\right.

For any SN\iS\subseteq N\backslash i and FM\jF\subseteq M\backslash j, we can check that block (i,j)(i,j) is a dummy for WS,FW^{\prime}_{S,F}, then by the 2d-dummy axiom, ψij(WS,F)=0\psi_{ij}(W^{\prime}_{S,F})=0. Especially, let S=N\iS=N\backslash i and any fixed FM\jF^{\prime}\subseteq M\backslash j, we have:

ψij(WN,F)+ψij(WN\i,F)=0.\psi_{ij}(W_{N,F^{\prime}})+\psi_{ij}(W_{N\backslash i,F^{\prime}})=0.

For inductive purposes, assume it has been shown that ψij(S,F)+ψij(Si,F)=0\psi_{ij}(S,F^{\prime})+\psi_{ij}(S\cup i,F^{\prime})=0 for fixed FM\jF^{\prime}\subseteq M\backslash j and every SN\iS\subseteq N\backslash i with |S|k2|S|\geq k\geq 2. (The case k=n1k=n-1 has been proved.) Now take fixed SN\iS\subseteq N\backslash i with |S|=k1|S|=k-1, then

0=ψij(WS,F)\displaystyle 0=\psi_{ij}(W^{\prime}_{S,F^{\prime}}) =SS1Nψij(WS1,F)\displaystyle=\sum_{S\subseteq S_{1}\subseteq N}\psi_{ij}(W_{S_{1},F^{\prime}})
=ψij(WSi,F)+ψij(WS,F)+S1N\iSS1[ψij(WS1i,F)+ψij(WS1,F)]\displaystyle=\psi_{ij}(W_{S\cup i,F^{\prime}})+\psi_{ij}(W_{S,F^{\prime}})+\sum_{\begin{subarray}{c}S_{1}\subseteq N\backslash i\\ S\subsetneq S_{1}\end{subarray}}[\psi_{ij}(W_{S_{1}\cup i,F^{\prime}})+\psi_{ij}(W_{S_{1},F^{\prime}})]
=ψij(WSi,F)+ψij(WS,F).\displaystyle=\psi_{ij}(W_{S\cup i,F^{\prime}})+\psi_{ij}(W_{S,F^{\prime}}).

Therefore, ψij(WSi,F)+ψij(WS,F)=0\psi_{ij}(W_{S\cup i,F^{\prime}})+\psi_{ij}(W_{S,F^{\prime}})=0 for all SN\iS\subseteq N\backslash i and fixed FN\jF^{\prime}\subseteq N\backslash j with 0<|S|n10<|S|\leq n-1 and 0<|F|m10<|F^{\prime}|\leq m-1. Similarly, we have another conclusion that ψij(WS,F)+ψij(WS,Fj)=0\psi_{ij}(W_{S^{\prime},F})+\psi_{ij}(W_{S^{\prime},F\cup j})=0 for fixed SN\iS^{\prime}\subseteq N\backslash i and all FN\jF\subseteq N\backslash j with 0<|S|n10<|S^{\prime}|\leq n-1 and 0<|F|m10<|F|\leq m-1 by simply defining another similar utility function WS,FW^{\prime}_{S^{\prime},F} and repeat the process above again.

Using the results above,

ψij(h)\displaystyle\psi_{ij}(h) =SNFMh(S,F)ψij(WS,F)\displaystyle=\sum_{\begin{subarray}{c}S\subseteq N\\ F\subseteq M\end{subarray}}h(S,F)\psi_{ij}(W_{S,F})
=FMSN\ih(Si,F)ψij(WSi,F)+h(S,F)ψij(WS,F)\displaystyle=\sum_{F\subseteq M}\sum_{S\subseteq N\backslash i}h(S\cup i,F)\psi_{ij}(W_{S\cup i,F})+h(S,F)\psi_{ij}(W_{S,F})
=SN\iFMh(Si,F)ψij(WSi,F)h(S,F)ψij(WSi,F)\displaystyle=\sum_{S\subseteq N\backslash i}\sum_{F\subseteq M}h(S\cup i,F)\psi_{ij}(W_{S\cup i,F})-h(S,F)\psi_{ij}(W_{S\cup i,F})
=SN\iFM\jψij(WSi,Fj)[h(Si,Fj)h(S,Fj)]\displaystyle=\sum_{S\subseteq N\backslash i}\sum_{F\subseteq M\backslash j}\psi_{ij}(W_{S\cup i,F\cup j})[h(S\cup i,F\cup j)-h(S,F\cup j)]
SN\iFM\jψij(WSi,Fj)[h(Si,F)h(S,F)]\displaystyle-\sum_{S\subseteq N\backslash i}\sum_{F\subseteq M\backslash j}\psi_{ij}(W_{S\cup i,F\cup j})[h(S\cup i,F)-h(S,F)]
=SN\iFM\jψij(WSi,Fj)[h(Si,Fj)+h(S,F)\displaystyle=\sum_{S\subseteq N\backslash i}\sum_{F\subseteq M\backslash j}\psi_{ij}(W_{S\cup i,F\cup j})[h(S\cup i,F\cup j)+h(S,F)
h(S,Fj)h(Si,F)].\displaystyle-h(S,F\cup j)-h(S\cup i,F)].

For simplicity, denote ψij(WSi,Fj)\psi_{ij}(W_{S\cup i,F\cup j}) as pS,Fijp^{ij}_{S,F}, then

ψij(h)=SN\iFM\jpS,Fij[h(Si,Fj)+h(S,F)h(Si,F)h(S,Fj)].\psi_{ij}(h)=\sum_{S\subseteq N\backslash i}\sum_{F\subseteq M\backslash j}p^{ij}_{S,F}[h(S\cup i,F\cup j)+h(S,F)-h(S\cup i,F)-h(S,F\cup j)].

Consider the utility function hijh_{ij},

hij(W1,W2)={1,ifiW1,jW2.0,otherwise.h_{ij}(W_{1},W_{2})=\left\{\begin{array}[]{lr}1,if\ i\in W_{1},j\in W_{2}.\\ 0,otherwise.\end{array}\right.

and we can check that ijij is a dummy for hijh_{ij}, and ψij(hij)=1\psi_{ij}(h_{ij})=1. Hence

1=ψij(hij)=SN\iFM\jpS,Fij.1=\psi_{ij}(h_{ij})=\sum_{S\subseteq N\backslash i}\sum_{F\subseteq M\backslash j}p^{ij}_{S,F}.

Next, add the 2d-symmetry axiom to Lemma B.1 and we make the conclusion that pS,Fijp^{ij}_{S,F} is only related to the cardinality of SS and FF, which is not associated with the name of the blocks.

Lemma B.2.

Assume Lemma B.1 holds. If ψij\psi_{ij} also satisfies the 2d-symmetry axiom, then

pS,Fij=ps,f,p^{ij}_{S,F}=p_{s,f},

where ps,fp_{s,f} is some common value for SN\iS\subseteq N\backslash i, FM\jF\subseteq M\backslash j and 0|S|=sn10\leq|S|=s\leq n-1, 0|F|=fm10\leq|F|=f\leq m-1.

Proof.

Define a utility h^S,F\hat{h}_{S,F}:

h^S,F(W1,W2)={1,ifSW1,FW2.0,otherwise.\hat{h}_{S,F}(W_{1},W_{2})=\left\{\begin{array}[]{lr}1,if\ S\subsetneq W_{1},F\subsetneq W_{2}.\\ 0,otherwise.\end{array}\right.
  1. 1.

    For iNi\in N and jMj\in M, let S1S_{1}, F1F_{1} and S2S_{2}, F2F_{2} be any two coalitions where S1,S2N\iS_{1},S_{2}\subseteq N\backslash i and F1,F2M\jF_{1},F_{2}\subseteq M\backslash j with 0<|S1|=|S2|<n10<|S_{1}|=|S_{2}|<n-1 and 0<|F1|=|F2|<m10<|F_{1}|=|F_{2}|<m-1 respectively. Consider two permutation π1\pi_{1} and π2\pi_{2} which satisfy π1(S1)=S2,π1(i)=i\pi_{1}(S_{1})=S_{2},\pi_{1}(i)=i and π2(F1)=F2,π2(j)=j\pi_{2}(F_{1})=F_{2},\pi_{2}(j)=j. Then,

    pS1,F1ij=ψij(h^S1,F1)=ψij(h^S2,F2)=pS2,F2ij,p_{S_{1},F_{1}}^{ij}=\psi_{ij}(\hat{h}_{S_{1},F_{1}})=\psi_{ij}(\hat{h}_{S_{2},F_{2}})=p_{S_{2},F_{2}}^{ij},

    where the central equality is a consequence of the 2d-symmetry axiom.

  2. 2.

    For distinct i1,i2Ni_{1},i_{2}\in N and j1,j2Mj_{1},j_{2}\in M, let SN\{i1,i2}S\subseteq N\backslash\{i_{1},i_{2}\} and FM\{j1,j2}F\subseteq M\backslash\{j_{1},j_{2}\}, and the permutations π1,π2\pi_{1},\pi_{2} respectively interchange i1,i2i_{1},i_{2} and j1,j2j_{1},j_{2} while leaving other elements fixed. Then,

    π1π2h^S,F=h^S,F,\pi_{1}\pi_{2}\hat{h}_{S,F}=\hat{h}_{S,F},
    pS,Fi1j1=ψi1j1(h^S,F)=ψi2j2(h^S,F)=pS,Fi2j2,p_{S,F}^{i_{1}j_{1}}=\psi_{i_{1}j_{1}}(\hat{h}_{S,F})=\psi_{i_{2}j_{2}}(\hat{h}_{S,F})=p_{S,F}^{i_{2}j_{2}},

    where the central equality is a consequence of the 2d-symmetry axiom. Combining with the previous result in Step 1, we find that for every 0<s<n10<s<n-1 and 0<f<m10<f<m-1, there is a ps,fp_{s,f} such that pS,Fij=ps,fp_{S,F}^{ij}=p_{s,f} for every iNi\in N and jMj\in M, SN\iS\subseteq N\backslash i and FM\jF\subseteq M\backslash j with |S|=s|S|=s, |F|=f|F|=f.

  3. 3.

    Similarly, by using different utility functions, we can find for iN,jM\forall i\in N,j\in M:

    • a pn1,fp_{n-1,f} such that pN\i,Fij=pn1,fp_{N\backslash i,F}^{ij}=p_{n-1,f} for FM\jF\subseteq M\backslash j and 0|F|=f<m10\leq|F|=f<m-1,

    • a ps,m1p_{s,m-1} such that pS,M\jij=ps,m1p_{S,M\backslash j}^{ij}=p_{s,m-1} for SN\iS\subseteq N\backslash i and 0|S|=s<n10\leq|S|=s<n-1,

    • a p0,fp_{0,f} such that p,Fij=p0,fp_{\emptyset,F}^{ij}=p_{0,f} for FM\jF\subseteq M\backslash j and 0<|F|=f<m10<|F|=f<m-1,

    • a ps,0p_{s,0} such that pS,ij=ps,0p_{S,\emptyset}^{ij}=p_{s,0}, for SN\iS\subseteq N\backslash i and 0<|S|=s<n10<|S|=s<n-1,

    • a pn1,m1p_{n-1,m-1} such that pN\i,M\jij=pn1,m1p_{N\backslash i,M\backslash j}^{ij}=p_{n-1,m-1},

    • a p0,0p_{0,0} such that p,ij=p0,0p_{\emptyset,\emptyset}^{ij}=p_{0,0} which makes the sum of all the weights equals to 1.

Finally add the 2d-efficiency axiom and obtain the uniqueness of 2D-Shapley.

Lemma B.3.

Assume Lemma B.1 holds. Then ψij(h)\psi_{ij}(h) satisfies the 2d-efficiency axiom if and only if

iNjMpN\i,M\jij=1,\sum_{\begin{subarray}{c}i\in N\\ j\in M\end{subarray}}p_{N\backslash i,M\backslash j}^{ij}=1, (15)
iSjFpS\i,F\jij+iSjFpS,FijiSjFpS,F\jijiSjFpS\i,Fij=0,\sum_{\begin{subarray}{c}i\in S\\ j\in F\end{subarray}}p_{S\backslash i,F\backslash j}^{ij}+\sum_{\begin{subarray}{c}i\notin S\\ j\notin F\end{subarray}}p_{S,F}^{ij}-\sum_{\begin{subarray}{c}i\notin S\\ j\in F\end{subarray}}p_{S,F\backslash j}^{ij}-\sum_{\begin{subarray}{c}i\in S\\ j\notin F\end{subarray}}p_{S\backslash i,F}^{ij}=0, (16)

where SNS\subsetneq N or FMF\subsetneq M.

Proof.

On the one hand, by Eq. (15) and Eq. (16),

h(N,M)\displaystyle h(N,M) =SNFMh(S,F)[iSjFpS\i,F\jij+iSjFpS,FijiSjFpS,F\jijiSjFpS\i,Fij]\displaystyle=\sum_{\begin{subarray}{c}S\subseteq N\\ F\subseteq M\end{subarray}}h(S,F)[\sum_{\begin{subarray}{c}i\in S\\ j\in F\end{subarray}}p_{S\backslash i,F\backslash j}^{ij}+\sum_{\begin{subarray}{c}i\notin S\\ j\notin F\end{subarray}}p_{S,F}^{ij}-\sum_{\begin{subarray}{c}i\notin S\\ j\in F\end{subarray}}p_{S,F\backslash j}^{ij}-\sum_{\begin{subarray}{c}i\in S\\ j\notin F\end{subarray}}p_{S\backslash i,F}^{ij}]
=iNjMSN\iFM\jpS,Fij[h(Si,Fj)+h(S,F)h(Si,F)h(S,Fj)]\displaystyle=\sum_{\begin{subarray}{c}i\in N\\ j\in M\end{subarray}}\sum_{\begin{subarray}{c}S\subseteq N\backslash i\\ F\subseteq M\backslash j\end{subarray}}p_{S,F}^{ij}[h(S\cup i,F\cup j)+h(S,F)-h(S\cup i,F)-h(S,F\cup j)]
=iNjMψij(h).\displaystyle=\sum_{\begin{subarray}{c}i\in N\\ j\in M\end{subarray}}\psi_{ij}(h).

On the other hand, recall:

h^S,F(W1,W2)={1,ifSW1,FW2.0,otherwise.\hat{h}_{S,F}(W_{1},W_{2})=\left\{\begin{array}[]{lr}1,if\ S\subsetneq W_{1},F\subsetneq W_{2}.\\ 0,otherwise.\end{array}\right.

and

hS,F(W1,W2)={1,ifSW1,FW2.0,otherwise.h_{S,F}(W_{1},W_{2})=\left\{\begin{array}[]{lr}1,if\ S\subseteq W_{1},F\subseteq W_{2}.\\ 0,otherwise.\end{array}\right.

Consider two new utility functions

h~S,F(W1,W2)={1,ifSW1,FW2,0,otherwise.\tilde{h}_{S,F}(W_{1},W_{2})=\left\{\begin{array}[]{lr}1,if\ S\subsetneq W_{1},F\subseteq W_{2},\\ 0,otherwise.\end{array}\right.

and

h¯S,F(W1,W2)={1,ifSW1,FW2,0,otherwise.\bar{h}_{S,F}(W_{1},W_{2})=\left\{\begin{array}[]{lr}1,if\ S\subseteq W_{1},F\subsetneq W_{2},\\ 0,otherwise.\end{array}\right.

Then for any SNS\subseteq N, FMF\subseteq M,

iNjMψij(hS,F)+iNjMψij(h^S,F)iNjMψij(h~S,F)iNjMψij(h¯S,F)\displaystyle\sum_{\begin{subarray}{c}i\in N\\ j\in M\end{subarray}}\psi_{ij}(h_{S,F})+\sum_{\begin{subarray}{c}i\in N\\ j\in M\end{subarray}}\psi_{ij}(\hat{h}_{S,F})-\sum_{\begin{subarray}{c}i\in N\\ j\in M\end{subarray}}\psi_{ij}(\tilde{h}_{S,F})-\sum_{\begin{subarray}{c}i\in N\\ j\in M\end{subarray}}\psi_{ij}(\bar{h}_{S,F})
=iSjFpS\i,F\jij+iSjFpS,FijiSjFpS,F\jijiSjFpS\i,Fij.\displaystyle=\sum_{\begin{subarray}{c}i\in S\\ j\in F\end{subarray}}p_{S\backslash i,F\backslash j}^{ij}+\sum_{\begin{subarray}{c}i\notin S\\ j\notin F\end{subarray}}p_{S,F}^{ij}-\sum_{\begin{subarray}{c}i\notin S\\ j\in F\end{subarray}}p_{S,F\backslash j}^{ij}-\sum_{\begin{subarray}{c}i\in S\\ j\notin F\end{subarray}}p_{S\backslash i,F}^{ij}.

When S=NS=N and F=MF=M,

iNjMψij(hN,M)+iNjMψij(h^N,M)iNjMψij(h~N,M)iNjMψij(h¯N,M)\displaystyle\sum_{\begin{subarray}{c}i\in N\\ j\in M\end{subarray}}\psi_{ij}(h_{N,M})+\sum_{\begin{subarray}{c}i\in N\\ j\in M\end{subarray}}\psi_{ij}(\hat{h}_{N,M})-\sum_{\begin{subarray}{c}i\in N\\ j\in M\end{subarray}}\psi_{ij}(\tilde{h}_{N,M})-\sum_{\begin{subarray}{c}i\in N\\ j\in M\end{subarray}}\psi_{ij}(\bar{h}_{N,M})
=hN,M(N,M)+h^N,M(N,M)h~N,M(N,M)h¯N,M(N,M)\displaystyle=h_{N,M}(N,M)+\hat{h}_{N,M}(N,M)-\tilde{h}_{N,M}(N,M)-\bar{h}_{N,M}(N,M)
=1,\displaystyle=1,

Otherwise,

iNjMψij(hS,F)+iNjMψij(h^S,F)iNjMψij(h~S,F)iNjMψij(h¯S,F)\displaystyle\sum_{\begin{subarray}{c}i\in N\\ j\in M\end{subarray}}\psi_{ij}(h_{S,F})+\sum_{\begin{subarray}{c}i\in N\\ j\in M\end{subarray}}\psi_{ij}(\hat{h}_{S,F})-\sum_{\begin{subarray}{c}i\in N\\ j\in M\end{subarray}}\psi_{ij}(\tilde{h}_{S,F})-\sum_{\begin{subarray}{c}i\in N\\ j\in M\end{subarray}}\psi_{ij}(\bar{h}_{S,F})
=hS,F(N,M)+h^S,F(N,M)h~S,F(N,M)h¯S,F(N,M)\displaystyle=h_{S,F}(N,M)+\hat{h}_{S,F}(N,M)-\tilde{h}_{S,F}(N,M)-\bar{h}_{S,F}(N,M)
=0.\displaystyle=0.

Hence, Eq. (15) and Eq. (16) can be easily obtained. ∎

Now, let’s prove Theorem 3.3.

Proof of Theorem 3.3.

By Lemma B.2,

ψij(h)\displaystyle\psi_{ij}(h) =s=0n1f=0m1SN\i|S|=sFM\j|F|=fps,f[h(Si,Fj)+h(S,F)\displaystyle=\sum_{s=0}^{n-1}\sum_{f=0}^{m-1}\sum_{\begin{subarray}{c}S\subseteq N\backslash i\\ |S|=s\end{subarray}}\sum_{\begin{subarray}{c}F\subseteq M\backslash j\\ |F|=f\end{subarray}}p_{s,f}[h(S\cup i,F\cup j)+h(S,F)
h(Si,F)h(S,Fj)].\displaystyle-h(S\cup i,F)-h(S,F\cup j)].

By Lemma B.1 and Lemma B.3, we have the following equations:

s=0n1f=0m1(n1s)(m1f)ps,f=1,\displaystyle\sum_{s=0}^{n-1}\sum_{f=0}^{m-1}\tbinom{n-1}{s}\tbinom{m-1}{f}p_{s,f}=1, (17)
sfps1,f1+(ns)(mf)ps,f=(ns)fps,f1\displaystyle sf\cdot p_{s-1,f-1}+(n-s)(m-f)\cdot p_{s,f}=(n-s)f\cdot p_{s,f-1}
+s(mf)ps1,f, 1sn1,1fm1,\displaystyle+s(m-f)p_{s-1,f},\ 1\leq s\leq n-1,1\leq f\leq m-1,
(mf)p0,f=fp0,f1, 1fm1,\displaystyle(m-f)\cdot p_{0,f}=f\cdot p_{0,f-1},\ 1\leq f\leq m-1,
(ns)ps,0=sps1,0, 1sn1,\displaystyle(n-s)\cdot p_{s,0}=s\cdot p_{s-1,0},\ 1\leq s\leq n-1,
nmpn1,m1=1.\displaystyle nm\cdot p_{n-1,m-1}=1.

Actually, we can omit the first equation and the conditions are:

sfps1,f1+(ns)(mf)ps,f=(ns)fps,f1\displaystyle sf\cdot p_{s-1,f-1}+(n-s)(m-f)\cdot p_{s,f}=(n-s)f\cdot p_{s,f-1} (18)
+s(mf)ps1,f, 1sn1,1fm1,\displaystyle+s(m-f)p_{s-1,f},\ 1\leq s\leq n-1,1\leq f\leq m-1,
(mf)p0,f=fp0,f1, 1fm1,\displaystyle(m-f)\cdot p_{0,f}=f\cdot p_{0,f-1},\ 1\leq f\leq m-1,
(ns)ps,0=sps1,0, 1sn1,\displaystyle(n-s)\cdot p_{s,0}=s\cdot p_{s-1,0},\ 1\leq s\leq n-1,
nmpn1,m1=1.\displaystyle nm\cdot p_{n-1,m-1}=1.

Hence, we have nmn\cdot m variables and (m1)(n1)+(m1)+(n1)+1=nm(m-1)(n-1)+(m-1)+(n-1)+1=n\cdot m equations.

Eq. (18) has a solution:

ps,f=s!(ns1)!n!f!(mf1)!m!.p_{s,f}=\frac{s!(n-s-1)!}{n!}\cdot\frac{f!(m-f-1)!}{m!}. (19)

Therefore,

ψij(h)\displaystyle\psi_{ij}(h) =s=0n1f=0m1SN\i|S|=sFM\j|F|=fs!(ns1)!n!f!(mf1)!m![h(Si,Fj)+h(S,F)\displaystyle=\sum_{s=0}^{n-1}\sum_{f=0}^{m-1}\sum_{\begin{subarray}{c}S\subseteq N\backslash i\\ |S|=s\end{subarray}}\sum_{\begin{subarray}{c}F\subseteq M\backslash j\\ |F|=f\end{subarray}}\frac{s!(n-s-1)!}{n!}\cdot\frac{f!(m-f-1)!}{m!}[h(S\cup i,F\cup j)+h(S,F)
h(Si,F)h(S,Fj)]\displaystyle-h(S\cup i,F)-h(S,F\cup j)]
=1nms=1nf=1mSN\i|S|=s1FM\j|F|=f1(s1)!(ns)!(n1)!(f1)!(mf)!(m1)![h(Si,Fj)+h(S,F)\displaystyle=\frac{1}{nm}\sum_{s=1}^{n}\sum_{f=1}^{m}\sum_{\begin{subarray}{c}S\subseteq N\backslash i\\ |S|=s-1\end{subarray}}\sum_{\begin{subarray}{c}F\subseteq M\backslash j\\ |F|=f-1\end{subarray}}\frac{(s-1)!(n-s)!}{(n-1)!}\cdot\frac{(f-1)!(m-f)!}{(m-1)!}[h(S\cup i,F\cup j)+h(S,F)
h(Si,F)h(S,Fj)]\displaystyle-h(S\cup i,F)-h(S,F\cup j)]
=1nms=1nf=1m1(n1s1)(m1f1)(S,F)Dsfij[h(Si,Fj)+h(S,F)h(Si,F)h(S,Fj)]\displaystyle=\frac{1}{nm}\sum_{s=1}^{n}\sum_{f=1}^{m}\frac{1}{\tbinom{n-1}{s-1}\tbinom{m-1}{f-1}}\sum_{(S,F)\in D_{sf}^{ij}}[h(S\cup i,F\cup j)+h(S,F)-h(S\cup i,F)-h(S,F\cup j)]
=1nms=1nf=1mΔsf.\displaystyle=\frac{1}{nm}\sum_{s=1}^{n}\sum_{f=1}^{m}\Delta_{sf}.
ψij(h)=s=0n1f=0m1S𝒩\i|S|=sF\j|F|=f\displaystyle\psi_{ij}(h)=\sum_{s=0}^{n-1}\sum_{f=0}^{m-1}\sum_{\begin{subarray}{c}S\subseteq\mathcal{N}\backslash i\\ |S|=s\end{subarray}}\sum_{\begin{subarray}{c}F\subseteq\mathcal{M}\backslash j\\ |F|=f\end{subarray}} ps,f[h(Si,Fj)+h(S,F)\displaystyle p_{s,f}[h(S\cup i,F\cup j)+h(S,F)
h(Si,F)h(S,Fj)]\displaystyle-h(S\cup i,F)-h(S,F\cup j)]
=s=0n1f=0m1(n1s)(m1f)ps,f1(n1s)(m1f)(S,F)Dsfij[h(Si,Fj)+h(S,F)h(Si,F)h(S,Fj)]\displaystyle=\sum_{s=0}^{n-1}\sum_{f=0}^{m-1}\tbinom{n-1}{s}\tbinom{m-1}{f}p_{s,f}\frac{1}{\tbinom{n-1}{s}\tbinom{m-1}{f}}\sum_{(S,F)\in D_{sf}^{ij}}[h(S\cup i,F\cup j)+h(S,F)-h(S\cup i,F)-h(S,F\cup j)]

Now we prove the solution Eq. (19) is unique.

Convert the Eq. (18) to matrix equations in the form of

𝑨𝐱=𝐛,{\bm{A}}{\mathbf{x}}={\mathbf{b}},

where

𝐱T=(p0,0,p0,1,,p0,m1,p1,0,p1,1,,p1,m1,,pn1,0,,pn1,m1)1×nm,{\mathbf{x}}^{T}=(p_{0,0},p_{0,1},\dots,p_{0,m-1},p_{1,0},p_{1,1},\dots,p_{1,m-1},\dots,p_{n-1,0},\dots,p_{n-1,m-1})_{1\times nm},
𝐛T=(0,0,0,,0,1)1×nm,{\mathbf{b}}^{T}=(0,0,0,\dots,0,1)_{1\times nm},

and

𝑨=(𝑨1𝑨2)nm×nm,{\bm{A}}=\begin{pmatrix}{\bm{A}}_{1}\\ {\bm{A}}_{2}\end{pmatrix}_{nm\times nm}, (20)

where

𝑨1=(𝑨(m1)×m0𝑶(m1)×m𝑶(m1)×m𝑨m×m1𝑩m×m1𝑶m×m𝑶m×m𝑶m×m𝑨m×m2𝑩m×m2𝑶m×m𝑶m×m𝑶m×m𝑨m×mn1𝑩m×mn1)(nm1)×nm,\displaystyle{\bm{A}}_{1}=\begin{pmatrix}{\bm{A}}^{0}_{(m-1)\times m}&{\bm{O}}_{(m-1)\times m}&\cdots&\cdots&{\bm{O}}_{(m-1)\times m}\\ \\ {\bm{A}}^{1}_{m\times m}&{\bm{B}}^{1}_{m\times m}&{\bm{O}}_{m\times m}&\cdots&{\bm{O}}_{m\times m}\\ \\ {\bm{O}}_{m\times m}&{\bm{A}}^{2}_{m\times m}&{\bm{B}}^{2}_{m\times m}&\cdots&{\bm{O}}_{m\times m}\\ \\ \vdots&\vdots&\ddots&\ddots&\vdots\\ \\ {\bm{O}}_{m\times m}&{\bm{O}}_{m\times m}&\cdots&{\bm{A}}^{n-1}_{m\times m}&{\bm{B}}^{n-1}_{m\times m}\end{pmatrix}_{(nm-1)\times nm},
𝑨2=(0,0,,0,nm)1×nm.\displaystyle{\bm{A}}_{2}=\Big{(}0,0,\cdots,0,nm\Big{)}_{1\times nm}.

And

𝑨(m1)×m0\displaystyle{\bm{A}}^{0}_{(m-1)\times m} =(1(m1)0002(m2)00003(m3)0000m11)(m1)×m,\displaystyle=\begin{pmatrix}1&-(m-1)&0&\cdots&\cdots&0\\ \\ 0&2&-(m-2)&0&\cdots&0\\ \\ 0&0&3&-(m-3)&\cdots&0\\ \vdots&\vdots&\vdots&\ddots&\ddots&\vdots\\ \\ 0&0&0&\cdots&m-1&-1\end{pmatrix}_{(m-1)\times m},
𝑨m×mj\displaystyle{\bm{A}}^{j}_{m\times m} =(j000jj(m1)00002jj(m2)00003jj(m3)0000j(m1)j)m×m, 1jn1,\displaystyle=\begin{pmatrix}j&0&0&\cdots&\cdots&0\\ \\ j&-j\cdot(m-1)&0&0&\cdots&0\\ \\ 0&2j&-j\cdot(m-2)&0&\cdots&0\\ \\ 0&0&3j&-j\cdot(m-3)&\cdots&0\\ \\ \vdots&\vdots&\vdots&\ddots&\ddots&\vdots\\ \\ 0&0&0&\cdots&j\cdot(m-1)&-j\end{pmatrix}_{m\times m},\ 1\leq j\leq n-1,
𝑩m×mj\displaystyle{\bm{B}}^{j}_{m\times m} =((nj)000(nj)(nj)(m1)00002(nj)(nj)(m2)00003(nj)(nj)(m3)0000(m1)(nj)nj)m×m, 1jn1.\displaystyle=\begin{pmatrix}-(n-j)&0&0&\cdots&\cdots&0\\ \\ -(n-j)&(n-j)\cdot(m-1)&0&0&\cdots&0\\ \\ 0&-2\cdot(n-j)&(n-j)\cdot(m-2)&0&\cdots&0\\ \\ 0&0&-3\cdot(n-j)&(n-j)\cdot(m-3)&\cdots&0\\ \\ \vdots&\vdots&\vdots&\ddots&\ddots&\vdots\\ \\ 0&0&0&\cdots&-(m-1)\cdot(n-j)&n-j\end{pmatrix}_{m\times m},\ 1\leq j\leq n-1.

For example, if n=m=3n=m=3, then

𝑨=(120000000021000000100200000120240000021042000000200100000240120000042021000000009)9×9{\bm{A}}=\begin{pmatrix}\begin{array}[]{ccc:ccc:ccc}1&-2&0&0&0&0&0&0&0\\ 0&2&-1&0&0&0&0&0&0\\ \hline\cr 1&0&0&-2&0&0&0&0&0\\ 1&-2&0&-2&4&0&0&0&0\\ 0&2&-1&0&-4&2&0&0&0\\ \hline\cr 0&0&0&2&0&0&-1&0&0\\ 0&0&0&2&-4&0&-1&2&0\\ 0&0&0&0&4&-2&0&-2&1\\ \hline\cr 0&0&0&0&0&0&0&0&9\end{array}\end{pmatrix}_{9\times 9}

Convert 𝑨{\bm{A}} to 𝑨^\hat{{\bm{A}}} by using the elementary column and row transformation,

𝑨^=(120240120021042021100000000120000000021000000000100000000120000000021000000000001)9×9.\hat{{\bm{A}}}=\begin{pmatrix}\begin{array}[]{ccc:ccc:ccc}1&-2&0&2&-4&0&1&-2&0\\ 0&2&-1&0&4&-2&0&2&-1\\ \hline\cr 1&0&0&0&0&0&0&0&0\\ 1&-2&0&0&0&0&0&0&0\\ 0&2&-1&0&0&0&0&0&0\\ \hline\cr 0&0&0&1&0&0&0&0&0\\ 0&0&0&1&-2&0&0&0&0\\ 0&0&0&0&2&-1&0&0&0\\ \hline\cr 0&0&0&0&0&0&0&0&1\end{array}\end{pmatrix}_{9\times 9}.

According to the property of the elementary row and column transformation,

Rank(𝑨)=Rank(𝑨^).Rank({\bm{A}})=Rank(\hat{{\bm{A}}}).

Consider equation

𝑨^𝐱=𝟎,\hat{{\bm{A}}}{\mathbf{x}}=\mathbf{0},

and the solution is only 𝐱=𝟎{\mathbf{x}}=\mathbf{0}, hence

Rank(𝑨)=Rank(𝑨^)=9.Rank({\bm{A}})=Rank(\hat{{\bm{A}}})=9.

In general, we can prove Rank(𝑨)=nmRank({\bm{A}})=nm always holds for any n1n\geq 1 and m1m\geq 1, (Make elementary column transformation for [𝑨m×mj,𝑩m×mj][{\bm{A}}^{j}_{m\times m},{\bm{B}}^{j}_{m\times m}] in the context of 𝑨{\bm{A}} with the order of j=1,2,,n1j=1,2,\dots,n-1.) Hence the solution of Eq. (18) is unique, which is shown in Eq. (19). And we can check Eq. (19) also satisfies Eq. (17), hence the solution of Eq. (17) is unique. ∎

Appendix C Proof of Corollary 3.4

Proof.

We use the same technique in the proof of Lemma B.3.

ψi1d(h)\displaystyle\psi_{i\cdot}^{1d}(h) =jMSN\iFM\jps,f[h(Si,Fj)+h(S,F)h(Si,F)h(S,Fj)]\displaystyle=\sum_{j\in M}\sum_{\begin{subarray}{c}S\subseteq N\backslash i\\ F\subseteq M\backslash j\end{subarray}}p_{s,f}[h(S\cup i,F\cup j)+h(S,F)-h(S\cup i,F)-h(S,F\cup j)]
=SN\iFMh(Si,F)[jFps,f1jFps,f]+h(S,F)[jFps,fjFps,f1]\displaystyle=\sum_{\begin{subarray}{c}S\subseteq N\backslash i\\ F\subseteq M\end{subarray}}h(S\cup i,F)[\sum_{j\in F}p_{s,f-1}-\sum_{j\notin F}p_{s,f}]+h(S,F)[\sum_{j\notin F}p_{s,f}-\sum_{j\in F}p_{s,f-1}]
=SN\iFM(jFps,f1jFps,f)[h(Si,F)h(S,F)]\displaystyle=\sum_{\begin{subarray}{c}S\subseteq N\backslash i\\ F\subseteq M\end{subarray}}(\sum_{j\in F}p_{s,f-1}-\sum_{j\notin F}p_{s,f})[h(S\cup i,F)-h(S,F)]
=SN\i(jMps,m1)[h(Si,M)h(S,M)].\displaystyle=\sum_{S\subseteq N\backslash i}(\sum_{j\in M}p_{s,m-1})[h(S\cup i,M)-h(S,M)].

Substitute Eq. (19) into the above equation and we get the conclusion. The similar argument can be applied to ψj1d\psi_{\cdot j}^{1d}. ∎

Appendix D Permutation-based 2D-Shapley Formulation

To compute 2D-Shapley more efficient, we propose the following corollary.

Corollary D.1.

Eq. (8) has an equivalent form as follows:

ψij2d=1nmSN\iFM\j[h(Si,Fj)+h(S,F)h(Si,F)h(S,Fj)](n1|S|)(m1|F|),\displaystyle\psi_{ij}^{2d}=\frac{1}{nm}\sum_{\begin{subarray}{c}S\subseteq N\backslash i\\ F\subseteq M\backslash j\end{subarray}}\frac{[h(S\cup i,F\cup j)+h(S,F)-h(S\cup i,F)-h(S,F\cup j)]}{\tbinom{n-1}{|S|}\tbinom{m-1}{|F|}}, (21)

or

ψij2d=1n!m!π1Π(N)π2Π(M)[h(Piπ1i,Pjπ2j)+h(Piπ1,Pjπ2)h(Piπ1i,Pjπ2)h(Piπ1,Pjπ2j)],\displaystyle\psi_{ij}^{2d}=\frac{1}{n!m!}\sum_{\begin{subarray}{c}\pi_{1}\in\Pi(N)\\ \pi_{2}\in\Pi(M)\end{subarray}}[h(P_{i}^{\pi_{1}}\cup i,P_{j}^{\pi_{2}}\cup j)+h(P_{i}^{\pi_{1}},P_{j}^{\pi_{2}})-h(P_{i}^{\pi_{1}}\cup i,P_{j}^{\pi_{2}})-h(P_{i}^{\pi_{1}},P_{j}^{\pi_{2}}\cup j)], (22)

where Π(A)\Pi(A) denotes a set of all permutations of AA and PkπP_{k}^{\pi} a set of all elements of AA that precede kAk\in A in the permutation πΠ(A)\pi\in\Pi(A).

The formulation in Eq. (21) is a simple derivation from Eq. (8) that sums marginal contributions over all subsets. Whereas, the second formulation in Eq. (22) sums over all sample and feature permutations, and the marginal contribution of block (i,j)(i,j) is weighted by a coefficient that measures all orderings of samples appearing before and after sample ii and all orderings of features appearing before and after feature jj. This corollary gives a simple expression of 2D-Shapley. Using this equivalent formulation, we can design efficient algorithms for 2D-Shapley implementation.

Appendix E Algorithm Details

Here, we explain the implementation of algorithms and explore ways to achieve efficient computation.

E.1 Saving Computation in 2D-Shapley-MC

First, we focus on 2D-Shapley-MC. Apart from Monte Carlo sampling on both sample and feature permutations to reduce complexity, we also reduce the number of model training to a single time for each counterfactual evaluation as opposed to 44, which is derived in Eq. 2. Let us observe that in the marginal contribution equation, we have 44 utility terms, but actually, 33 of them are already computed, which we can reuse them. We take a pair (i,j)(i,j) as an example. For the marginal contribution of (i,j)(i,j), we have 4 utility terms to compute: h(Si,Fj),h(S,Fj),h(Si,F),h(S,F)h(S\cup i,F\cup j),h(S,F\cup j),h(S\cup i,F),h(S,F). However, we notice that h(S,Fj)h(S,F\cup j) was already computed for a pair (i1,j)(i-1,j), h(Si,F)h(S\cup i,F) for a pair (i,j1)(i,j-1), and h(S,F)h(S,F) for (i1,j1)(i-1,j-1). Therefore, by saving these evaluations, we can reduce the total number of model training by 75%75\%. Saving all model evaluations for every block might overflow the memory. However, we only need to save the utilities of the previous and current rows (columns) if we are looping horizontally downwards (vertically rightwards), which promotes efficient memory usage. Additionally, our algorithm can be parallelized. In particular, every permutation can be computed independently and combined at the last stage, which is the “while loop” in Algorithm 1.

E.2 Limitations of 2D-Shapley-MC and Possible Improvements

One limitation of the Monte Carlo method is time complexity which scales with the number of rows and columns in an aggregate data matrix. To improve the efficiency of 2D-Shapley-MC, we can reduce the burden on model retraining of 2D-Shapley-MC to lower the computation cost. For example, there exist highly efficient methods for model re-training, such as FFCV [1,2], which has been applied in Datamodels [3] and can significantly reduce computation complexity. Another limitation is that 2D-Shapley-MC relies on the performance scores associated with models trained on different subsets to determine the cell values. However, these values are susceptible to noise due to training stochasticity when the learning algorithm is randomized (e.g., SGD) (Wang & Jia, 2022). To overcome these limitations, we proposed an efficient, nearest-neighbor-based method, 2D-Shapley-KNN, which involves no model training and only requires sorting data. With this method, we also avoid the problem of model training stochasticity, which 2D-Shapley-MC is facing with. Another advantage of 2D-Shapley-KNN is that it has an explicit formulation for sample values and only requires permuting over features. This method not only beats 2D-Shapley-MC by an order of magnitude in terms of computational efficiency but is straightforward to compute and only requires CPU resources.

E.3 Saving Computation in 2D-Shapley-KNN

Apart from removing the dependency on the sample permutations and all model training, 2D-Shapley-KNN can further be reduced in computation. Similar to the 2D-Shapley-MC, we here also save the utility terms, as shown in Algorithm 2. For each pair (i,j)(i,j), we need to compute SVKNN(i,Pjπk)SV_{KNN}(i,P_{j}^{\pi}\cup k) and SVKNN(i,Pjπ)SV_{KNN}(i,P^{\pi}_{j}). However, the second term was already calculated for the previous feature in π\pi prior to jj. Thus, we can reduce the total number of SVKNNSV_{KNN} evaluations by 50%50\%.

Input: Training Set DD, Learning Algorithm 𝒜\mathcal{A}, Test Set TT, Utility Function hh.

Output: Sample-Feature 2D Shapley Values ψ2d\psi^{2d}.

Ensure: i,j\forall_{i,j}, ψij2d=0\psi_{ij}^{2d}=0; t=0t=0.

while ψ2d\psi^{2d} not converged do

       πN\pi_{N}\leftarrow Random Samples Permutation πM\pi_{M}\leftarrow Random Features Permutation u0u\leftarrow\textbf{0} // Utility Matrix for i,ji,j in range(πN),range(πM\pi_{N}),range(\pi_{M}) do
             sπN(i),fπM(j)s\leftarrow\pi_{N}(i),f\leftarrow\pi_{M}(j)
u[s,f]h(PsπN{s},PfπM{f})u[s,f]\leftarrow h\left(P^{\pi_{N}}_{s}\cup\{s\},P^{\pi_{M}}_{f}\cup\{f\}\right)
ψsfnewu[s,f]+u[πN(i1),πM(j1)]u[πN(i),πM(j1)]u[πN(i1),πM(j)]\psi_{sf}^{new}\leftarrow u[s,f]+u[\pi_{N}(i-1),\pi_{M}(j-1)]-u[\pi_{N}(i),\pi_{M}(j-1)]-u[\pi_{N}(i-1),\pi_{M}(j)]
ψsf2dtt+1ψsf2d+1t+1ψsfnew\psi_{sf}^{2d}\leftarrow\frac{t}{t+1}\psi_{sf}^{2d}+\frac{1}{t+1}\psi_{sf}^{new}
       end for
      Set tt+1t\leftarrow t+1
end while
Algorithm 1 2D-Shapley-MC Valuation Algorithm.

Input: Training Set DD, Test Set TT, Top KK.

Output: Sample-Feature 2D Shapley Values ψ2d\psi^{2d}.

Ensure: i,j\forall_{i,j}, ψij2d=0\psi_{ij}^{2d}=0; t=0t=0.

while ψ2d\psi^{2d} not converged do

       πM\pi_{M}\leftarrow Random Features Permutation u0u\leftarrow\textbf{0} // SVknnSV_{knn} values for jj in range(πM\pi_{M}) do
             fπM(j)f\leftarrow\pi_{M}(j)
u[f]SVKNN(N,PmπM{f},T)u[f]\leftarrow SV_{KNN}(N,P_{m}^{\pi_{M}}\cup\{f\},T)
ψsfnewu[f]su[πM(j1)]s\psi_{sf}^{new}\leftarrow u[f]_{s}-u[\pi_{M}(j-1)]_{s}
ψsf2dtt+1ψsf2d+1t+1ψsfnew\psi_{sf}^{2d}\leftarrow\frac{t}{t+1}\psi_{sf}^{2d}+\frac{1}{t+1}\psi_{sf}^{new}
       end for
      Set tt+1t\leftarrow t+1
end while
Algorithm 2 2D-Shapley-KNN Valuation Algorithm.

E.4 Actual Runtime Complexity

Time complexity is an important aspect when evaluating the efficiency of algorithms. In our case, we focus on determining the runtime of our methods for different number of cell valuations on the Census dataset until the values’ convergence is achieved. While computing the runtime for the exact 2D Shapley runtime, we encounter a challenge due to the exponential growth of permutations with the cell size, making exact 2D Shapley intractable to compute. To address this, we benchmark the exact 2D Shapley runtime, by measuring the runtime for a single permutation and scale it by the total number of permutations needed for the exact 2D Shapley. As we observe in Table 1, 2D-Shapley-KNN, exhibits exceptional efficiency compared to 2D-Shapley-MC across various cell valuations on the Census dataset. At 1,000 cells valuation, 2D-Shapley-KNN was at least 25 times faster than 2D-Shapley-MC, showcasing a substantial advantage. Furthermore, as the number of cells increased to 100,000, 2D-Shapley-KNN demonstrates a remarkable speed advantage, being approximately 300 times faster than 2D-Shapley-MC. These findings clearly establish an advantage of 2D-Shapley-KNN over 2D-Shapley-MC in terms of runtime efficiency. Moreover, we observe that both 2D-Shapley-KNN and 2D-Shapley-MC outperform the exact 2D Shapley method in terms of runtime. These results highlight the effectiveness and practicality of our approach for computing 2D-Shapley in real-world cases.

Method 1K 5K 10K 20K 50K 100K
2D Shapley-Exact
(Theoretical)
1.5E+301s 2.0E+1505s 2.8E+3010s 5.6E+6020s 4.4E+15051s 1.4E+30103s
2D-Shapley-MC 280s 1,661s 3,127s 9,258s 17,786s 26,209s
2D-Shapley-KNN 11s 25s 37s 44s 53s 88s
Table 1: Actual runtime comparison between 2D-Shapley methods.

Appendix F Implementation Details & Results

F.1 Details on Datasets and Models

For our experiments, we use the following datasets from Machine Learning Repository (Dua & Graff, 2017):

Dataset Training Data Test Data Features
Census Income 32561 16281 14
Default of Credit Card Clients 18000 12000 24
Heart Failure 512 513 13
Breast Cancer Wisconsin (Original) 242 241 10
Wine Dataset 106 72 13
Table 2: Details on datasets used in experiments.

In Breast Cancer Wisconsin dataset, we removed “ID number” from the list of features as it was irrelevant for model training.

For methods requiring model training, 1D-Shapley, Random, and 2D-Shapley-MC, we implemented a decision tree classifier on all of them.

Empirically, we verified that for each of the method, the cell values converge within 500500 permutations and that is the number we decide to use to run these methods.

Due to varying sizes of each dataset with different number of features, we set a different number of cells to be removed at a time. For bigger datasets, Census Income and Credit Default, we remove ten cells at a time, and for a smaller dataset, Breast Cancer, we remove one cell at a time.

F.2 Additional Results on Sanity check of cell-wise values experiment

Refer to caption
Figure 8: 2D-Shapley values for benign patients in the original breast cancer dataset. The green border denotes a cell before an outlier value has been injected to that cell.

We provide results on additional datasets, Heart Failure and Wine Dataset, to demonstrate the effectiveness of 2D-Shapley in cell-wise valuation. We additionally include the 2D LOO baseline for comparison. As we can observe in Figure 8, 2D LOO performance is comparable to or worse than the Random baseline. One of the main reasons is that 2D LOO only valuates a cell’s contribution when all other cells are present. This means that after the sequential removal of some cells, the values obtained from 2D LOO may no longer accurately represent the importance of the cells. In contrast, our method computes a cell’s value by averaging its contribution over various sample and feature subset sizes, which ensures our cell values are informative even after the sequential removal of a certain amount of cells, thereby addressing the shortcomings of 2D LOO and leading to improved performance in cell-wise valuation.

F.3 Additional Details and Results on Fine-Grained Outlier Localization experiment

F.3.1 Outlier Value Generation

Our outlier generation technique is inspired by (Du et al., 2022). Specifically, for a random cell with a sample index ii and a feature index jj, we generate an outlier value based on its feature jj. We first recreate a distribution of the feature jj and then sample a value from a low-probability-density region, below 5%5\% in our experiment.

F.3.2 Heatmaps Comparison

To better understand the detection rate of outlier values, we visualize them through a heatmap. In Figure 9, we provide a 2D-Shapley heatmap of the original dataset before outlier injection and compare with a 2D-Shapley heatmap in Figure 10 after injecting outliers. Due to dimensional reasons, we transpose the heatmap, where the rows represent features and the columns denote the samples.

We observe through the breast cancer dataset that the cells with injected outliers have changed their values and lie mostly in the lower range of 2D-Shapley values. However, we can also notice that other cells are also affected by the outliers and the overall range of values has increased in both directions.

Refer to caption
Figure 9: 2D-Shapley values for benign patients in the original breast cancer dataset. The green border denotes a cell before an outlier value has been injected to that cell.
Refer to caption
Figure 10: 2D-Shapley values for benign patients in the breast cancer dataset with randomly inserted outliers. The green border denotes a cell after an outlier value has been injected to that cell.
Refer to caption
Figure 11: 1D-Shapley values for benign patients in the breast cancer dataset with randomly inserted outliers. The green border denotes a cell after an outlier value has been injected to that cell.

In addition, we present a heatmap with injected outliers generated by 1D-Shapley to provide insights into the 1D-Shapley detection performance, which we show in Figure 5A). As we can observe the 1D-Shapley heatmap in Figure 11, the values of injected outliers are scattered which explains why the detection rate by 1D-Shapley was suboptimal.

F.3.3 Ablation Study on the Budget of Inserted Outliers

In Figure 5A), we injected outlier values to 2%2\% of total cells. Here, we explore whether our 2D-Shapley method can still detect outliers on various different amount of outliers. Thus, we randomly inject 1%,2%,5%,10%,15%1\%,2\%,5\%,10\%,15\% of outlier values to the original breast cancer dataset and plot the detection rate.

Refer to caption
Figure 12: 2D-Shapley Detection rate of randomly inserted outliers in the breast cancer dataset over various injection rates.

As we observe in Figure 12, the detection rate of outliers is very high within the first 200200 inspected cells for every outlier injection rate. Further, we observe that with more outliers added to the dataset, our detection rate slightly decreases. It is indeed reasonable, since as we inject more outliers in the dataset, the less uncommon these outliers are.

F.4 Additional Details on Sub-matrix Valuation experiment

For the plots in Figure 6, we have randomly split the Credit Default dataset into blocks. One of the random split is pictured in Figure 13. We randomly moved the horizontal and vertical lines and permuted separately rows and columns to create different possibilities for block splits.

Refer to caption
Figure 13: An example of a dataset split into blocks.

F.5 Hardware

In this work, we used an 8-Core Intel Xeon Processor E5-2620 v4 @ 2.20Ghz CPU server as a hardware platform.

F.6 Code

The code repository is available via this link https://github.com/ruoxi-jia-group/2dshapley.