This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11affiliationtext: Princeton University22affiliationtext: Virginia Tech
{tianhaowang,pmittal}@princeton.edu, [email protected]

Efficient Data Shapley for Weighted Nearest Neighbor Algorithms

Jiachen T. Wang{}^{\mathbf{\star}} Prateek Mittal Ruoxi Jia{}^{\mathbf{\star}}
Abstract

This work aims to address an open problem in data valuation literature concerning the efficient computation of Data Shapley for weighted KK nearest neighbor algorithm (WKNN-Shapley). By considering the accuracy of hard-label KNN with discretized weights as the utility function, we reframe the computation of WKNN-Shapley into a counting problem and introduce a quadratic-time algorithm, presenting a notable improvement from O(NK)O(N^{K}), the best result from existing literature. We develop a deterministic approximation algorithm that further improves computational efficiency while maintaining the key fairness properties of the Shapley value. Through extensive experiments, we demonstrate WKNN-Shapley’s computational efficiency and its superior performance in discerning data quality compared to its unweighted counterpart.{}^{\mathbf{\star}}Correspondence to Jiachen T. Wang and Ruoxi Jia.

1 Introduction

Data is the backbone of machine learning (ML) models, but not all data is created equally. In real-world scenarios, data often carries noise and bias, sourced from diverse origins and labeling processes [Northcutt et al., 2021]. Against this backdrop, data valuation emerges as a growing research field, aiming to quantify the quality of individual data sources for ML training. Data valuation techniques are critical in explainable ML to diagnose influential training instances and in data marketplaces for fair compensation. The importance of data valuation is highlighted by legislative efforts [Warner, 2019] and vision statements from leading tech companies [OpenAI, 2023]. For instance, OpenAI listed “how to fairly distribute the benefits the AI systems generate” as an important question to be explored.

Data Valuation via the Shapley Value. Drawing on cooperative game theory, the technique of using the Shapley value for data valuation was pioneered by [Ghorbani and Zou, 2019, Jia et al., 2019b]. The Shapley value is a renowned solution concept in game theory for fair profit attribution [Shapley, 1953]. In the context of data valuation, individual data points or sources are regarded as “players” in a cooperative game, and Data Shapley refers to the suite of data valuation techniques that use the Shapley value as the contribution measure for each data owner.

KNN-Shapley. Despite offering a principled approach to data valuation with a solid theoretical foundation, the exact calculation of the Shapley value has the time complexity of O(2N)O(2^{N}) in general, where NN refers to the number of data points/sources. Fortunately, [Jia et al., 2019a] discovered an efficient O(NlogN)O(N\log N) algorithm to compute the exact Data Shapley for unweighted K-Nearest Neighbors (KNN), one of the oldest yet still popular ML algorithms. KNN-Shapley refers to the technique of assessing data value for any learning algorithms based on KNN’s Data Shapley score. Here, KNN serves as a proxy model for the original, perhaps complicated learning algorithm. KNN-Shapley can be applied to large, high-dimensional datasets by calculating the value scores on the features extracted from neural network embeddings. Due to its superior efficiency and effectiveness in discerning data quality, KNN-Shapley has become one of the most popular data valuation techniques [Pandl et al., 2021].

Open question from [Jia et al., 2019a]: efficient computation of weighted KNN-Shapley. While [Jia et al., 2019a] showed unweighted KNN-Shapley can be computed efficiently, they did not develop a practical algorithm for the more general weighted KNN-Shapley (WKNN-Shapley). Despite presenting a polynomial-time algorithm, its computational complexity of O(NK)O(N^{K}) becomes impractical even for modest KK, such as 5. Closing this efficiency gap is important, especially given the inherent advantages and wider application of weighted KNN. Compared with the unweighted counterpart, weighted KNN considers the distance between data points, assigning varying levels of importance to neighbors based on their proximity to the query. Consequently, the Data Shapley of weighted KNN can potentially better discriminate between low and high-quality data points. Moreover, weighted KNN is applied more widely in practice. For example, recent research discovered weighted KNN’s capability to improve language model’s performance [Khandelwal et al., 2019].

Our contributions are summarized as follows:

Making KNN Configurations “Shapley-friendly” (Section 3). Our preliminary investigations suggest that improving the computational efficiency of WKNN-Shapley for soft-label KNN classifiers with continuous weight values (the setting considered in [Jia et al., 2019a]), poses considerable challenges. Consequently, we make necessary modifications to the specific KNN classifiers’ configuration and shift our focus to hard-label KNN classifiers with discrete weight values. The justification for these changes and their practical relevance is detailed in Section 3. In particular, discretizing weights does not change the data value score significantly even with few bits. We emphasize that making proper tweaks to the problem setup is important for developing an efficient Shapley computation algorithm, a strategy frequently adopted in literature [Dall’Aglio et al., 2019].

A quadratic-time algorithm for computing exact WKNN-Shapley (Section 4.1). Given the adjusted “Shapley-friendly” configurations of the weighted KNN, we can reframe the Shapley value computation as a counting problem. We develop an algorithm with a quadratic runtime for solving the counting problem and computing the exact WKNN-Shapley, which greatly improves the baseline O(NK)O(N^{K}) algorithm.

A subquadratic-time deterministic approximation algorithm that preserves fairness properties (Section 4.2). To further improve the computational efficiency, we propose a deterministic approximation algorithm by making minor changes to the exact WKNN-Shapley implementation. In particular, our approximation algorithm retains the crucial fairness properties of the original Shapley value.

Empirical Evaluations (Section 5). We experiment on benchmark datasets and assess the efficiency and efficacy of our exact and approximation algorithms for WKNN-Shapley. Here are the key takeaways: (1) Our exact and approximation algorithm for WKNN-Shapley significantly improves computational efficiency compared to the baseline exact algorithm and Monte Carlo approximation, respectively. (2) WKNN-Shapley outperforms the unweighted KNN-Shapley in discerning data quality for critical downstream tasks such as detecting mislabeled/noisy data and data selection. Remarkably, the approximated WKNN-Shapley matches the performance of the exact WKNN-Shapley on many benchmark datasets, attributable to its deterministic nature and the preservation of fairness properties.

Overall, with proper changes to KNN configurations, we show that WKNN-Shapley can be efficiently calculated and approximated. This facilitates its wider adoption, offering a more effective data valuation method compared to unweighted KNN-Shapley.

2 Preliminaries

In this section, we formalize the setup of data valuation for ML, and revisit relevant techniques.

Setup & Goal. Given a labeled dataset D:={zi}i=1ND:=\{z_{i}\}_{i=1}^{N} where each data point zi:=(xi,yi)z_{i}:=(x_{i},y_{i}), data valuation aims to assign a score to each training data point ziz_{i}, reflecting its importance for the trained ML model’s performance. Formally, we seek a score vector (ϕzi)i=1N(\phi_{z_{i}})_{i=1}^{N} where each ϕzi\phi_{z_{i}}\in\mathbb{R} represents the “value” of ziz_{i}.

2.1 Data Shapley

The Shapley value (SV) [Shapley, 1953], originating from game theory, stands out as a distinguished method for equitably distributing total profit among all participating players. Before diving into its definition, we first discuss a fundamental concept: the utility function.

Utility Function. A utility function maps an input dataset to a score indicating the utility of the dataset for model training. Often, this function is chosen as the validation accuracy of a model trained on the given dataset. That is, given a training set SS, the utility function v(S):=ValAcc(𝒜(S))v(S):=\texttt{ValAcc}(\mathcal{A}(S)), where 𝒜\mathcal{A} represents a learning algorithm that trains a model on dataset SS, and ValAcc()\texttt{ValAcc}(\cdot) is a function assessing the model’s performance, e.g., its accuracy on a validation set.

Definition 1 (Shapley value [Shapley, 1953]).

Let v()v(\cdot) denote a utility function and DD represent a training set of NN data points. The Shapley value, ϕz(v)\phi_{z}\left(v\right), assigned to a data point zDz\in D is defined as ϕz(v):=1Nk=1N(N1k1)1SDz,|S|=k1[v(S{z})v(S)]\phi_{z}\left(v\right):=\frac{1}{N}\sum_{k=1}^{N}{N-1\choose k-1}^{-1}\sum_{S\subseteq D_{-z},|S|=k-1}\left[v(S\cup\{z\})-v(S)\right] where Dz=D{z}D_{-z}=D\setminus\{z\}.

In simple terms, the Shapley value is a weighted average of the marginal contribution v(S{z})v(S)v(S\cup\{z\})-v(S), i.e., the utility change when the point zz is added to different SSs. For simplicity, we often write ϕz\phi_{z} when the utility function is clear from the context. The Shapley value uniquely satisfies several important axioms, including key fairness requirements like the null player and symmetry axioms, lending justification to its popularity. See Appendix A for detailed axiom definitions.

2.2 KNN-Shapley

A well-known challenge of using the Shapley value is that its exact calculation is computationally infeasible in general, as it usually requires evaluating v(S)v(S) for all possible subsets SDS\subseteq D. A surprising result in [Jia et al., 2019a] showed that when the learning algorithm 𝒜\mathcal{A} is unweighted KNN, there exists a highly efficient algorithm for computing its exact Data Shapley score.

K Nearest Neighbor Classifier. Given a validation data point z(val)=(x(val),y(val))z^{(\mathrm{val})}=(x^{(\mathrm{val})},y^{(\mathrm{val})}) and a distance metric d(,)d(\cdot,\cdot), we sort the training set D={zi=(xi,yi)}i=1ND=\{z_{i}=(x_{i},y_{i})\}_{i=1}^{N} according to their distance to the validation point d(xi,x(val))d(x_{i},x^{(\mathrm{val})}) in non-descending order. Throughout the paper, we assume that d(xi,x(val))d(xj,x(val))d(x_{i},x^{(\mathrm{val})})\leq d(x_{j},x^{(\mathrm{val})}) for any iji\leq j unless otherwise specified. A KNN classifier makes a prediction for the query x(val)x^{(\mathrm{val})} based on the (weighted) majority voting among x(val)x^{(\mathrm{val})}’s KK nearest neighbors in the training set. Weight of data point: in KNN, each data point ziz_{i} is associated with a weight wiw_{i}. The weight is usually determined based on the distance between xix_{i} and the query x(val)x^{(\mathrm{val})}. For example, a popular weight function is RBF kernel wi:=exp(d(xi,x(val)))w_{i}:=\exp(-d(x_{i},x^{(\mathrm{val})})). If wiw_{i} is the same for all ziz_{i}s, it becomes unweighted KNN.

[Jia et al., 2019a] considers the utility function for the weighted, soft-label KNN:

v(S;z(val))\displaystyle v(S;z^{(\mathrm{val})}) :=j=1min(K,|S|)wαx(val)(S,j)𝟙[yαx(val)(S,j)=y(val)]j=1min(K,|S|)wαx(val)(S,j)\displaystyle:=\frac{\sum_{j=1}^{\min(K,|S|)}w_{\alpha_{x^{(\mathrm{val})}}^{(S,j)}}\mathbbm{1}\left[y_{\alpha_{x^{(\mathrm{val})}}^{(S,j)}}=y^{(\mathrm{val})}\right]}{\sum_{j=1}^{\min(K,|S|)}w_{\alpha_{x^{(\mathrm{val})}}^{(S,j)}}} (1)

where αx(val)(S,j)\alpha_{x^{(\mathrm{val})}}^{(S,j)} denotes the index (among DD) of jjth closest data point in SS to x(val)x^{(\mathrm{val})}. “Soft-label” refers to the classifiers that output the confidence scores. The main result in [Jia et al., 2019a] shows that for unweighted KNN, we can compute the exact Shapley value ϕzi(v(;z(val)))\phi_{z_{i}}\left(v(\cdot;z^{(\mathrm{val})})\right) for all ziDz_{i}\in D within a total runtime of O(NlogN)O(N\log N) (see Appendix B for details).

Remark 1.

In practice, the model performance is assessed based on a validation set D(val)D^{(\mathrm{val})}. After computing ϕzi(v(;z(val)))\phi_{z_{i}}\left(v(\cdot;z^{(\mathrm{val})})\right) for each z(val)D(val)z^{(\mathrm{val})}\in D^{(\mathrm{val})}, one can compute the Shapley value corresponding to the utility function on the full validation set v(S;D(val)):=z(val)D(val)v(S;z(val))v(S;D^{(\mathrm{val})}):=\sum_{z^{(\mathrm{val})}\in D^{(\mathrm{val})}}v(S;z^{(\mathrm{val})}) by simply taking the sum ϕzi(v(;D(val)))=z(val)D(val)ϕzi(v(;z(val)))\phi_{z_{i}}\left(v(\cdot;D^{(\mathrm{val})})\right)=\sum_{z^{(\mathrm{val})}\in D^{(\mathrm{val})}}\phi_{z_{i}}\left(v(\cdot;z^{(\mathrm{val})})\right) due to the linearity property of the Shapley value.

Remark 2.

In alignment with the existing literature [Jia et al., 2019a, Wang et al., 2023], our discussion of the time complexity of KNN-Shapley refers to the total runtime needed to calculate all data value scores (ϕz1(v(;z(val))),,ϕzN(v(;z(val))))\left(\phi_{z_{1}}(v(\cdot;z^{(\mathrm{val})})),\ldots,\phi_{z_{N}}(v(\cdot;z^{(\mathrm{val})}))\right), given that standard applications of data valuation, such as profit allocation and bad data detection, all require computing the data value scores for all data points in the training set. Furthermore, the stated runtime is with respect to v(;z(val))v(\cdot;z^{(\mathrm{val})}), and the overall runtime with respect to v(;D(val))v(\cdot;D^{(\mathrm{val})}) will be multiplied by the size of validation set D(val)D^{(\mathrm{val})}. Runtime is typically presented this way because SV computations with respect to different v(;z(val))v(\cdot;z^{(\mathrm{val})}) are independent, readily benefit from parallel computing.

Since its introduction, KNN-Shapley has rapidly gained popularity in data valuation for its efficiency and effectiveness, and is being advocated as the ‘most practical technique for effectively evaluating large-scale data’ in recent studies [Pandl et al., 2021, Karlaš et al., 2022].

2.3 Baseline Algorithm for Computing and Approximating WKNN-Shapley

[Jia et al., 2019a] developed an efficient O(NlogN)O(N\log N) algorithm to calculate the exact unweighted KNN-Shapley. However, when it comes to the more general weighted KNN-Shapley, only an O(NK)O(N^{K}) algorithm is given. While still in polynomial time (if KK is considered a constant), the runtime is impractically large even for small KK (e.g., 5). Here, we review the high-level idea of the baseline algorithms from [Jia et al., 2019a].

An O(NK)O(N^{K}) algorithm for exact WKNN-Shapley computation. From Definition 1, the Shapley value for ziz_{i} is a weighted average of the marginal contribution (MC) v(S{zi})v(S)v(S\cup\{z_{i}\})-v(S); hence, we only need to study those SS whose utility might change due to the inclusion of ziz_{i}. For KNN, those are the subsets SS where ziz_{i} is within the KK nearest neighbors of x(val)x^{(\mathrm{val})} after being added into SS. Note that for KNN, the utility of any SS only depends on the KK nearest neighbors of x(val)x^{(\mathrm{val})} in SS. Given that there are only j=0K(Nj)\sum_{j=0}^{K}{N\choose j} unique subsets of size K\leq K, we can simply query the MC value v(S{zi})v(S)v(S\cup\{z_{i}\})-v(S) for all SS of size K\leq K. For any larger SS, the MC must be the same as its subset of KK nearest neighbors. We can then compute the Shapley value as a weighted average of these MC values by counting the number of subsets that share the same MC values through simple combinatorial analysis. Such an algorithm results in the runtime of j=0K(Nj)=O(NK)\sum_{j=0}^{K}{N\choose j}=O(N^{K}). The algorithm details can be found in Appendix B.

Monte Carlo Approximation. Given the large runtime of this exact algorithm, [Jia et al., 2019a] further proposes an approximation algorithm based on Monte Carlo techniques. However, Monte Carlo-based approximation is randomized and may not preserve the fairness property of the exact Shapley value. Additionally, the sample complexity of the Monte Carlo estimator is derived from concentration inequalities, which, while suitable for asymptotic analysis, may provide loose bounds in practical applications.

3 Making KNN Configurations Shapley-friendly

We point out the major challenges associated with directly improving the computational efficiency for the soft-label KNN configuration considered in [Jia et al., 2019a], and propose proper changes that enable more efficient algorithms for computing WKNN-Shapley.

Challenge #1: weights normalization term. The key principle behind the O(NlogN)O(N\log N) algorithm for unweighted KNN-Shapley from [Jia et al., 2019a] is that, the MC can only take few distinct values. For example, for any |S|K|S|\geq K, we have v(S{zi})v(S)=1K(𝟙[yi=y(val)]𝟙[yαx(val)(S,K)=y(val)])v(S\cup\{z_{i}\})-v(S)=\frac{1}{K}\left(\mathbbm{1}[y_{i}=y^{(\mathrm{val})}]-\mathbbm{1}[y_{\alpha_{x^{(\mathrm{val})}}^{(S,K)}}=y^{(\mathrm{val})}]\right). To avoid the task of evaluating v(S)v(S) for all SDS\subseteq D, one can just count the subsets SD{zi}S\subseteq D\setminus\{z_{i}\} such that ziz_{i} is among the KK nearest neighbors of x(val)x^{(\mathrm{val})} in S{zi}S\cup\{z_{i}\}, as well as the subsets share the same KKth nearest neighbor to z(val)z^{(\mathrm{val})}. However, for weighted soft-label KNN with the utility function in (1), there is little chance that any of two v(S1{zi})v(S1)v(S_{1}\cup\{z_{i}\})-v(S_{1}) and v(S2{zi})v(S2)v(S_{2}\cup\{z_{i}\})-v(S_{2}) can have the same value due to the weights normalization term (j=1Kwαx(val)(S,j))1\left(\sum_{j=1}^{K}w_{\alpha_{x^{(\mathrm{val})}}^{(S,j)}}\right)^{-1} (note that this term is 1/K1/K, a constant, for unweighted setting). Solution #1: hard-label KNN. In this work, we instead consider the utility function for weighted hard-label KNN. “Hard-label” refers to the classifiers that output the predicted class instead of the confidence scores (see (2) in Section 4). In practice, user-facing applications usually only output a class prediction instead of the entire confidence vector. More importantly, hard-label KNN’s prediction only depends on the weight comparison between different classes, and hence its utility function does not have a normalization term.

Challenge #2: continuous weights. If the weights are on the continuous space, there will be infinitely many possibilities of weighted voting scores of the KK nearest neighbors. This makes it difficult to analyze which pairs of S1,S2S_{1},S_{2} share the same MC value. Solution #2: discretize weights. Therefore, we consider a more tractable setting where the weights lie in a discrete space. Such a change is reasonable since the weights are stored in terms of finite bits (and hence in the discrete space) in practice. Moreover, rounding is a deterministic operation and does not reverse the original order of weights. In Appendix C, we show that the Shapley value computed based on the discrete weights has the same ranking order compared with the Shapley value computed on the continuous weights (it might create ties but will not reverse the original order). In Appendix LABEL:sec:eval-discretization, we empirically verify that weight discretization does not cause a large deviation in the Shapley value.

We emphasize that, proper adjustments to the underlying utility function are important for efficient Shapley computation, and such a strategy is frequently applied in game theory literature [Dall’Aglio et al., 2019].

4 Data Shapley for Weighted KNN

In this section, we develop efficient solutions for computing and approximating Data Shapley scores for weighted, hard-label KNN binary classifiers, where the weight values used in KNN are discretized. Without loss of generality, in this paper we assume every weight wi[0,1]w_{i}\in[0,1].111 If it does not hold one can simply normalize the weights to [0,1][0,1] and the KNN classifier remains the same. We use 𝐖\mathbf{W} to denote the discretized space of [0,1][0,1], where we create 2b2^{b} equally spaced points within the interval when we use bb bits for discretization. We denote W:=|𝐖|=2bW:=|\mathbf{W}|=2^{b} the size of the weight space.

Utility Function for Weighted Hard-Label KNN Classifiers. The utility function of weighted hard-label KNN, i.e., the correctness of weighted KNN on the queried example z(val)z^{(\mathrm{val})}, can be written as

v(S;z(val))=𝟙[y(val)argmaxc𝐂j=1min(K,|S|)wαx(val)(S,j)×𝟙[yαx(val)(S,j)=c]]\displaystyle v(S;z^{(\mathrm{val})})=\mathbbm{1}\bigg{[}y^{(\mathrm{val})}\in\operatorname*{\arg\!\max}_{c\in\mathbf{C}}\sum_{j=1}^{\min(K,|S|)}w_{\alpha_{x^{(\mathrm{val})}}^{(S,j)}}\times\mathbbm{1}[y_{\alpha_{x^{(\mathrm{val})}}^{(S,j)}}=c]\bigg{]} (2)

where 𝐂={1,,C}\mathbf{C}=\{1,\ldots,C\} is the space of classes, and CC is the number of classes222If multiple classes have the same top counts, we take the utility as 1 as long as y(val)y^{(\mathrm{val})} is among the majority classes.. We omit the input of z(val)z^{(\mathrm{val})} and simply write v(S)v(S) when the validation point is clear from the context. For KNN binary classifier, we can rewrite the utility function in a more compact form:

v(S)=𝟙[j=1min(K,|S|)w~αx(val)(S,j)0]wherew~j:={wjyj=y(val)wjyjy(val)\displaystyle v(S)=\mathbbm{1}\left[\sum_{j=1}^{\min(K,|S|)}\widetilde{w}_{\alpha_{x^{(\mathrm{val})}}^{(S,j)}}\geq 0\right]~{}\text{where}~{}\widetilde{w}_{j}:=\begin{cases}w_{j}&y_{j}=y^{(\mathrm{val})}\\ -w_{j}&y_{j}\neq y^{(\mathrm{val})}\\ \end{cases} (3)

For ease of presentation, we present the algorithms for KNN binary classifier here, and defer the extension to multi-class classifier to Appendix LABEL:appendix:multiclass.

4.1 Exact WKNN-Shapley Calculation

4.1.1 Computing SV is a Counting Problem

Given that the Shapley value is a weighted average of the marginal contribution v(S{zi})v(S)v(S\cup\{z_{i}\})-v(S), we first study the expression of v(S{zi})v(S)v(S\cup\{z_{i}\})-v(S) for a fixed subset SD{zi}S\subseteq D\setminus\{z_{i}\} with the utility function in (3).

Theorem 2.

For any data point ziDz_{i}\in D and any subset SD{zi}S\subseteq D\setminus\{z_{i}\}, the marginal contribution has the expression as follows:

v(S{zi})v(S)={1if yi=y(val),CondKNN,Cond0to11if yiy(val),CondKNN,Cond1to00Otherwise\displaystyle v(S\cup\{z_{i}\})-v(S)=\begin{cases}1&\text{if~{}}y_{i}=y^{(\mathrm{val})},\texttt{Cond}_{K\text{NN}},\texttt{Cond}_{\text{0to1}}\\ -1&\text{if~{}}y_{i}\neq y^{(\mathrm{val})},\texttt{Cond}_{K\text{NN}},\texttt{Cond}_{\text{1to0}}\\ 0&\text{Otherwise}\end{cases} (4)

where

CondKNN\displaystyle\texttt{Cond}_{K\text{NN}} :=ziis withinKnearest neighbors ofx(val)amongS{zi}\displaystyle:=z_{i}~{}\text{is within}~{}K~{}\text{nearest neighbors of}~{}x^{(\mathrm{val})}~{}\text{among}~{}S\cup\{z_{i}\}
Cond0to1\displaystyle\texttt{Cond}_{\text{0to1}} :={zjSw~j[w~i,0)if|S|K1j=1K1w~αx(val)(S,j)[wi,w~αx(val)(S,K))if|S|K\displaystyle:=\begin{cases}\sum_{z_{j}\in S}\widetilde{w}_{j}\in[-\widetilde{w}_{i},0)&\text{if}~{}|S|\leq K-1\\ \sum_{j=1}^{K-1}\widetilde{w}_{\alpha_{x^{(\mathrm{val})}}^{(S,j)}}\in\left[-w_{i},-\widetilde{w}_{\alpha_{x^{(\mathrm{val})}}^{(S,K)}}\right)&\text{if}~{}|S|\geq K\end{cases}
Cond1to0\displaystyle\texttt{Cond}_{\text{1to0}} :={zjSw~j[0,w~i)if|S|K1j=1K1w~αx(val)(S,j)[w~αx(val)(S,K),wi)if|S|K\displaystyle:=\begin{cases}\sum_{z_{j}\in S}\widetilde{w}_{j}\in[0,-\widetilde{w}_{i})&\text{if}~{}|S|\leq K-1\\ \sum_{j=1}^{K-1}\widetilde{w}_{\alpha_{x^{(\mathrm{val})}}^{(S,j)}}\in\left[-\widetilde{w}_{\alpha_{x^{(\mathrm{val})}}^{(S,K)}},-w_{i}\right)&\text{if}~{}|S|\geq K\end{cases}

In words, the condition CondKNN\texttt{Cond}_{K\text{NN}} means that ziz_{i} should be among the KK nearest neighbors to the query sample when it is added to SS. The conditions Cond0to1\texttt{Cond}_{\text{0to1}} and Cond1to0\texttt{Cond}_{\text{1to0}} cover situations where adding ziz_{i} to the set SS changes the prediction of the weighted KNN classifiers. In greater detail, Cond0to1\texttt{Cond}_{\text{0to1}} captures the condition for which incorporating ziz_{i} shifts the “effective sum of signed weights w~\widetilde{w}” from a negative to a non-negative value, thereby incrementing the utility from 0 to 11. Cond1to0\texttt{Cond}_{\text{1to0}} can be interpreted similarly. From Theorem 2 and the formula of the Shapley value (Definition 1), we can reframe the problem of computing hard-label WKNN-Shapley as a counting problem. Specifically, this involves counting the quantity defined as follows:

Definition 3.

Let Gi,\texttt{G}_{i,\ell} denote the count of subsets SDziS\subseteq D\setminus{z_{i}} of size \ell that satisfy (1) CondKNN\texttt{Cond}_{K\text{NN}}, and (2) Cond0to1\texttt{Cond}_{\text{0to1}} if yi=y(val)y_{i}=y^{(\mathrm{val})}, or Cond1to0\texttt{Cond}_{\text{1to0}} if yiy(val)y_{i}\neq y^{(\mathrm{val})}.

Theorem 4.

For a weighted, hard-label KNN binary classifier using the utility function given by (3), the Shapley value of a data point ziz_{i} can be expressed as:

ϕzi=2𝟙[yi=y(val)]1N=0N1(N1)1Gi,\displaystyle\phi_{z_{i}}=\frac{2\mathbbm{1}[y_{i}=y^{(\mathrm{val})}]-1}{N}\sum_{\ell=0}^{N-1}{N-1\choose\ell}^{-1}\texttt{G}_{i,\ell} (5)

Figure 1 illustrates the counting problem we try to solve here.

Refer to caption
Figure 1: Illustration of the subsets targeted in the counting problem. When K=3K=3, both S1S_{1} and S2S_{2} have a utility of 0 as both of them contain 2 dogs and 1 cat. Adding ziz_{i} to S1S_{1} and S2S_{2} alters the 33 nearest neighbors to the query image x(val)x^{(\mathrm{val})}, which now contains 1 dog and 2 cats, raising the utility to 1. In contrast, S3S_{3}’s utility remains unchanged with the addition of ziz_{i} since it solely contains cat images. To compute WKNN-Shapley of ziz_{i}, we count the subsets SS where adding ziz_{i} changes its utility, as seen with S1S_{1} and S2S_{2}.

4.1.2 Dynamic Programming Solution for Computing Gi,\texttt{G}_{i,\ell}

The multiple, intricate conditions wrapped within Gi,\texttt{G}_{i,\ell}’s definition can pose a formidable challenge for direct and efficient counting. The main rationale behind our solution is to break down the complex counting problems into smaller, more manageable subproblems, thereby making them amenable to algorithmic solutions like dynamic programming. Before delving into the specifics of the algorithm, we introduce an intermediary quantity, Fi\texttt{F}_{i}, that becomes the building block of our dynamic programming formulation.

Definition 5.

Let Fi[m,,s]\texttt{F}_{i}\left[m,\ell,s\right] denote the count of subsets SD{zi}S\subseteq D\setminus\{z_{i}\} of size \ell that satisfy (1) CondKNN\texttt{Cond}_{K\text{NN}}, as well as the following conditions: (2) Within SS, the data point xmx_{m} is the min(,K)\min(\ell,K)-th closest to the query example x(val)x^{(\mathrm{val})}, (3) j=1min(,K1)w~αx(val)(S,j)=s\sum_{j=1}^{\min(\ell,K-1)}\widetilde{w}_{\alpha_{x^{(\mathrm{val})}}^{(S,j)}}=s.

We can relate this auxiliary quantity to our desired Gi,\texttt{G}_{i,\ell} as follows:

Theorem 6 (Relation between Gi,\texttt{G}_{i,\ell} and Fi\texttt{F}_{i}).

For yi=y(val)y_{i}=y^{(\mathrm{val})}, we can compute Gi,\texttt{G}_{i,\ell} from Fi\texttt{F}_{i} as follows:

Gi,={m[N]is[w~i,0)Fi[m,,s]for K1,m[N]is[w~i,w~m)Fi[m,,s]for K.\displaystyle\texttt{G}_{i,\ell}=\begin{cases}\sum_{m\in[N]\setminus i}\sum_{s\in[-\widetilde{w}_{i},0)}\texttt{F}_{i}\left[m,\ell,s\right]&\text{for }\ell\leq K-1,\\ \sum_{m\in[N]\setminus i}\sum_{s\in[-\widetilde{w}_{i},-\widetilde{w}_{m})}\texttt{F}_{i}\left[m,\ell,s\right]&\text{for }\ell\geq K.\end{cases} (6)

For yiy(val)y_{i}\neq y^{(\mathrm{val})}, we have:

Gi,={m[N]is[0,w~i)Fi[m,,s]for K1,m[N]is[w~m,w~i)Fi[m,,s]for K.\displaystyle\texttt{G}_{i,\ell}=\begin{cases}\sum_{m\in[N]\setminus i}\sum_{s\in[0,-\widetilde{w}_{i})}\texttt{F}_{i}\left[m,\ell,s\right]&\text{for }\ell\leq K-1,\\ \sum_{m\in[N]\setminus i}\sum_{s\in[-\widetilde{w}_{m},-\widetilde{w}_{i})}\texttt{F}_{i}\left[m,\ell,s\right]&\text{for }\ell\geq K.\end{cases} (7)

The auxiliary quantity Fi\texttt{F}_{i} thus serves as a pivot, allowing us to explore the search space of possible subsets SS more systematically. We next exploit the computational advantage of Fi\texttt{F}_{i}. Specifically, Fi\texttt{F}_{i} can be conveniently computed with (recursive) formulas, which further enables us to compute Gi,\texttt{G}_{i,\ell} with reduced computational demand.

Theorem 7 (simplified version).

For K1\ell\leq K-1, Fi[m,,s]\texttt{F}_{i}[m,\ell,s] can be computed from Fi[t,,]\texttt{F}_{i}[t,\ell,\cdot] with tm1t\leq m-1. For K\ell\geq K, Fi[m,,s]\texttt{F}_{i}[m,\ell,s] can be computed from Fi[t,K1,]\texttt{F}_{i}[t,K-1,\cdot] with tm1t\leq m-1.

Leveraging the results from Theorem 7, a direct method for calculating Gi,\texttt{G}_{i,\ell} for all 1\ell\geq 1 is as follows: we first use a recursive formula to compute Fi[,,]\texttt{F}_{i}[\cdot,\ell,\cdot] for K1\ell\leq K-1, and then use an explicit formula to compute Fi[,,]\texttt{F}_{i}[\cdot,\ell,\cdot] for K\ell\geq K. With Fi\texttt{F}_{i} being computed, we apply Theorem 6 to compute Gi,\texttt{G}_{i,\ell}. This direct approach renders an O(N3)O(N^{3}) runtime, as it necessitates computing Fi[m,,]\texttt{F}_{i}[m,\ell,\cdot] for each of i,m,i,m, and \ell in the range of 1,,N1,\ldots,N.

Further Improvement of Efficiency Through Short-cut Formula. While the above direct approach offers a clear path, there exists a more optimized algorithm to expedite computational efficiency by circumventing the explicit calculations of Fi[,,]\texttt{F}_{i}[\cdot,\ell,\cdot] for K\ell\geq K. Specifically, we discover a short-cut formula that allows us to directly calculate the summation =KN1Gi,(N1)\sum_{\ell=K}^{N-1}\frac{\texttt{G}_{i,\ell}}{{N-1\choose\ell}} from Theorem 4 once we obtain Fi[,K1,]\texttt{F}_{i}[\cdot,K-1,\cdot].

Theorem 8.

For a weighted, hard-label KNN binary classifier using the utility function given by (3), the Shapley value of data point ziz_{i} can be expressed as:

ϕzi=sign(wi)[1N=0K1Gi,(N1)+m=max(i+1,K+1)NRi,mm(m1K)]\displaystyle\phi_{z_{i}}=\texttt{sign}(w_{i})\left[\frac{1}{N}\sum_{\ell=0}^{K-1}\frac{\texttt{G}_{i,\ell}}{{N-1\choose\ell}}+\sum_{m=\max(i+1,K+1)}^{N}\frac{\texttt{R}_{i,m}}{m{m-1\choose K}}\right]

where

Ri,m:={t=1m1s[w~i,w~m)Fi[t,K1,s]for yi=y(val)t=1m1s[w~m,w~i)Fi[t,K1,s]for yiy(val)\displaystyle\texttt{R}_{i,m}:=\begin{cases}\sum_{t=1}^{m-1}\sum_{s\in[-\widetilde{w}_{i},-\widetilde{w}_{m})}\texttt{F}_{i}[t,K-1,s]&\text{for }y_{i}=y^{(\mathrm{val})}\\ \sum_{t=1}^{m-1}\sum_{s\in[-\widetilde{w}_{m},-\widetilde{w}_{i})}\texttt{F}_{i}[t,K-1,s]&\text{for }y_{i}\neq y^{(\mathrm{val})}\end{cases}

Crucially, the Ri,m\texttt{R}_{i,m} quantity in the above expression can be efficiently calculated using a clever caching technique. Based on the above findings, we can eliminate a factor of NN in the final time complexity, thereby obtaining a quadratic-time algorithm to compute the exact WKNN-Shapley. The comprehensive pseudocode for the full algorithm can be found in Appendix LABEL:appendix:pseudocode.

Theorem 9.

Algorithm LABEL:alg:efficient-wknn (in Appendix LABEL:appendix:pseudocode) computes the exact WKNN-Shapley ϕzi\phi_{z_{i}} for all i=1,,Ni=1,\ldots,N and achieves a total runtime of O(WK2N2)O(WK^{2}N^{2}).

Remark 3 (Runtime Dependency with KK and WW).

While the time complexity in Theorem 9 also depends on KK and WW, these variables can be effectively treated as constants in our context. In our ablation study, we found that the error caused by weights discretization reduces quickly as the number of bits bb for discretization grows. Hence, across all experiments in Section 5, we set the number of bits for discretization as b=3b=3 and therefore W=2b=8W=2^{b}=8. Additionally, the selection of KK in KNN-Shapley literature commonly stabilizes around values of 55 or 1010, irrespective of the dataset size [Jia et al., 2019a, Wang et al., 2023]. This stability arises because, in the context of KNN classifiers, increasing KK can easily result in underfitting even when NN is large. Throughout our experiments, we fix KK at 55. A detailed ablation study examining different choices of KK and WW is available in Appendix LABEL:appendix:eval.

Remark 4 (Novelty in the derivation of WKNN-Shapley).

Our derivation of WKNN-Shapley starts similarly to [Jia et al., 2019a] by examining the marginal contribution, but the methodologies significantly differ afterward. Unweighted KNN-Shapley benefits from the simplicity of its utility function, allowing for a relatively straightforward derivation. Specifically, [Jia et al., 2019a] plugs unweighted KNN’s utility function into the formula of the difference of the Shapley values between two data points, and then simplifies the expression using known equalities in combinatorial analysis. In contrast, the same approach does not apply to WKNN-Shapley due to the complexity introduced by weights. Therefore, we develop a novel dynamic programming solution, which marks a substantial departure from the techniques used in [Jia et al., 2019a]. We would like to clarify that there is no correlation between the “recursion” in [Jia et al., 2019a] and our paper. In [Jia et al., 2019a], each ϕzj\phi_{z_{j}} is recursively computed from ϕzj+1\phi_{z_{j+1}}. On the contrary, the computation of different data points’ WKNN-Shapley scores ϕzj\phi_{z_{j}} is independent of each other. In our method, recursion is a fundamental component of dynamic programming to solve complex counting problems.

4.2 Deterministic Approximation for Weighted KNN-Shapley

While the algorithm introduced in Section 4.1 for calculating the exact WKNN-Shapley achieves O(N2)O(N^{2}) runtime, a huge improvement from the original O(NK)O(N^{K}) algorithm from [Jia et al., 2019a], there remains room for further improving the efficiency if we only require an approximation of the Shapley value. Contrary to the prevalent use of Monte Carlo techniques in existing literature, in this section, we develop a deterministic approximation algorithm for WKNN-Shapley.

Intuition. From Theorem 7, we know that in order to compute Fi[m,,]\texttt{F}_{i}[m,\ell,\cdot] with K1\ell\leq K-1, we only need to know Fi[t,1,]\texttt{F}_{i}[t,\ell-1,\cdot] with tm1t\leq m-1. Moreover, observe that the building blocks for Gi,\texttt{G}_{i,\ell} (or Ri,m\texttt{R}_{i,m}), s[w~i,0)Fi[t,,s]\sum_{s\in[-\widetilde{w}_{i},0)}\texttt{F}_{i}[t,\ell,s] (or s[w~i,w~m)Fi[t,K1,s]\sum_{s\in[-\widetilde{w}_{i},-\widetilde{w}_{m})}\texttt{F}_{i}[t,K-1,s]), can be quite small as it only takes the summation over a small range of the weight space. Hence, we can use F^i[m,,]=0\widehat{\texttt{F}}_{i}[m,\cdot,\cdot]=0 as an approximation for Fi[m,,]\texttt{F}_{i}[m,\cdot,\cdot] for all mM+1m\geq M^{\star}+1 with some prespecified threshold MM^{\star}. Similarly, we can use R^i,m=0\widehat{\texttt{R}}_{i,m}=0 as an approximation for Ri,m\texttt{R}_{i,m} for all mM+1m\geq M^{\star}+1. The resultant approximation for the Shapley value ϕzi\phi_{z_{i}} is stated as follows:

Definition 10.

We define the approximation ϕ^zi(M)\widehat{\phi}_{z_{i}}^{(M^{\star})} as

ϕ^zi(M):=sign(wi)[1N=0K1G~i,(M)(N1)+m=max(i+1,K+1)MRi,mm(m1K)]\displaystyle\widehat{\phi}_{z_{i}}^{(M^{\star})}:=\texttt{sign}(w_{i})\left[\frac{1}{N}\sum_{\ell=0}^{K-1}\frac{\widetilde{\texttt{G}}_{i,\ell}^{(M^{\star})}}{{N-1\choose\ell}}+\sum_{m=\max(i+1,K+1)}^{M^{\star}}\frac{\texttt{R}_{i,m}}{m{m-1\choose K}}\right]

where

G~i,(M):={m=1Ms[w~i,0)Fi[m,,s]for yi=y(val)m=1Ms[0,w~i)Fi[m,,s]for yiy(val)\displaystyle\widetilde{\texttt{G}}_{i,\ell}^{(M^{\star})}:=\begin{cases}\sum_{m=1}^{M^{\star}}\sum_{s\in[-\widetilde{w}_{i},0)}\texttt{F}_{i}\left[m,\ell,s\right]&\text{for }y_{i}=y^{(\mathrm{val})}\\ \sum_{m=1}^{M^{\star}}\sum_{s\in[0,-\widetilde{w}_{i})}\texttt{F}_{i}\left[m,\ell,s\right]&\text{for }y_{i}\neq y^{(\mathrm{val})}\end{cases}

To calculate ϕ^zi(M)\widehat{\phi}_{z_{i}}^{(M^{\star})}, we only need to compute Fi[m,,]\texttt{F}_{i}[m,\cdot,\cdot] and Ri,m\texttt{R}_{i,m} for mm from 11 to MM^{\star} instead of NN, thereby reducing the runtime of Algorithm LABEL:alg:efficient-wknn to O(NM)O(NM^{\star}) with minimal modification to the exact algorithm’s implementation.

Theorem 11.

Algorithm LABEL:alg:efficient-wknn-approx (in Appendix LABEL:appendix:pseudocode-approx) computes the approximated WKNN-Shapley ϕ^zi(M)\widehat{\phi}_{z_{i}}^{(M^{\star})} for all ziDz_{i}\in D and achieves a total runtime of O(WK2NM)O(WK^{2}NM^{\star}).

In particular, when M=NM^{\star}=\sqrt{N}, we can achieve the runtime of O(N1.5)O(N^{1.5}). The selection of MM^{\star} is discussed in Remark 5. In the following, we derive the error bound and point out two nice properties of this approximation.

Theorem 12.

For any ziDz_{i}\in D, the approximated Shapley value ϕ^zi(M)\widehat{\phi}_{z_{i}}^{(M^{\star})} (1) shares the same sign as ϕzi\phi_{z_{i}}, (2) ensures |ϕ^zi(M)||ϕzi|\left|\widehat{\phi}_{z_{i}}^{(M^{\star})}\right|\leq\left|\phi_{z_{i}}\right|, and (3) has the approximation error bounded by |ϕ^zi(M)ϕzi|ε(M)\left|\widehat{\phi}_{z_{i}}^{(M^{\star})}-\phi_{z_{i}}\right|\leq\varepsilon(M^{\star}) where

ε(M):=m=M+1N(1mK1m)+=1K1(N)(M)N(N1)=O(K/M)\displaystyle\varepsilon(M^{\star}):=\sum_{m=M^{\star}+1}^{N}\left(\frac{1}{m-K}-\frac{1}{m}\right)+\sum_{\ell=1}^{K-1}\frac{{N\choose\ell}-{M^{\star}\choose\ell}}{N{N-1\choose\ell}}=O\left(K/M^{\star}\right)

Leveraging the error bound ε(M)\varepsilon(M^{\star}) alongside the additional nice properties of ϕ^zi(M)\widehat{\phi}_{z_{i}}^{(M^{\star})} stated in Theorem 12, we can obtain a deterministic interval within which ϕzi\phi_{z_{i}} always resides. Specifically, when yi=y(val)y_{i}=y^{(\mathrm{val})}, we have ϕzi[ϕ^zi(M),ϕ^zi(M)+ε(M)]\phi_{z_{i}}\in\left[\widehat{\phi}_{z_{i}}^{(M^{\star})},\widehat{\phi}_{z_{i}}^{(M^{\star})}+\varepsilon(M^{\star})\right], and when yiy(val)y_{i}\neq y^{(\mathrm{val})}, we have ϕzi[ϕ^zi(M)ε(M),ϕ^zi(M)]\phi_{z_{i}}\in\left[\widehat{\phi}_{z_{i}}^{(M^{\star})}-\varepsilon(M^{\star}),\widehat{\phi}_{z_{i}}^{(M^{\star})}\right]. Unlike the commonly used Monte Carlo method for approximating the Shapley value, which only offers a high-probability interval and allows for a failure possibility that the exact value might fall outside of it, our deterministic approximation ensures that the exact value is always within the corresponding interval.

The approximated WKNN-Shapley preserves the fairness axioms. The Shapley value’s axiomatic properties, particularly the symmetry and null player axioms, are of great importance for ensuring fairness when attributing value to individual players (see Appendix A for the formal definitions of the two axioms). These fundamental axioms have fostered widespread adoption of the Shapley value. An ideal approximation of the Shapley value, therefore, should preserve at least the symmetry and null player axioms to ensure that the principal motivations for employing the Shapley value—fairness and equity—are not diminished. The prevalent Monte Carlo-based approximation techniques give randomized results and necessarily muddy the clarity of fairness axioms. In contrast, our deterministic approximation preserves both important axioms.

Theorem 13.

The approximated Shapley value {ϕ^zi(M)}ziD\{\widehat{\phi}_{z_{i}}^{(M^{\star})}\}_{z_{i}\in D} satisfies symmetry and null player axiom.

Remark 5 (Selection of MM^{\star}).

Ideally, we would like to pick the smallest MM^{\star} such that ε(M)\varepsilon(M^{\star}) is significantly smaller than |ϕzi||\phi_{z_{i}}| for a significant portion of ziz_{i}s. However, determining a universally applicable heuristic for setting MM^{\star} is challenging due to the varying magnitude of ϕzi\phi_{z_{i}} across different datasets, which are difficult to anticipate. For example, in a case where all but one data point are “null players”, that single data point will possess a value of v(D)v(D), while all others will be valued at 0. On the other hand, if all data points are identical, each will receive a value of v(D)/Nv(D)/N. Therefore, we suggest to select MM^{\star} in an adaptive way. Specifically, for each M{K+1,,N}M^{\star}\in\{K+1,\ldots,N\}, we calculate (ϕ^zi(M))i[N](\widehat{\phi}_{z_{i}}^{(M^{\star})})_{i\in[N]}, halting the computation when the magnitude of ε(M)\varepsilon(M^{\star}) is substantially smaller than (e.g., <10%<10\%) the magnitude of ϕ^zi(M)\widehat{\phi}_{z_{i}}^{(M^{\star})} for a majority of ziz_{i}s. This approach does not increase the overall runtime since ϕ^zi(M+1)\widehat{\phi}_{z_{i}}^{(M^{\star}+1)} can be easily computed from ϕ^zi(M)\widehat{\phi}_{z_{i}}^{(M^{\star})} and the additionally computed Fi[M+1,,]\texttt{F}_{i}[M^{\star}+1,\cdot,\cdot]. Details of this approach are in Appendix LABEL:appendix:how-to-select-mstar. Furthermore, we highlight that our deterministic approximation algorithm maintains the fairness properties of exact WKNN-Shapley. Hence, in practice, we can potentially use a smaller MM^{\star} and still get satisfactory performance in discerning data quality. Throughout our experiments in Section 5, we find that setting M=NM^{\star}=\sqrt{N} consistently works well across all benchmark datasets.

5 Numerical Experiments

We systematically evaluate the performance of our WKNN-Shapley computation and approximation algorithms. Our experiments aim to demonstrate the following assertions: (1) Our exact and deterministic approximation algorithm for WKNN-Shapley significantly improves computational efficiency compared to the baseline O(NK)O(N^{K}) algorithm and Monte Carlo approximation, respectively. (2) Compared to unweighted KNN-Shapley, WKNN-Shapley (2-1) achieves a better performance in discerning data quality in several important downstream tasks including mislabeled/noisy data detection and data selection, and (2-2) demonstrates more stable performance against different choices of KKs. (3) The approximated WKNN-Shapley, while being more efficient, consistently achieves performance comparable to the exact WKNN-Shapley on most of the benchmark datasets.

5.1 Runtime Comparison

We empirically assess the computational efficiency of our exact and deterministic approximation algorithms for WKNN-Shapley, comparing them to the O(NK)O(N^{K}) exact algorithm and the Monte Carlo approximation presented in [Jia et al., 2019a]. We examine various training data sizes NN and compare the execution clock time of different algorithms at each NN. In data size regimes where the baseline algorithms from [Jia et al., 2019a] are infeasible to execute (>10>10 hours), we fit a polynomial curve to smaller data size regimes and plot the predicted extrapolation for larger sizes.

Figure 2 shows that the exact algorithm from [Jia et al., 2019a] requires 103\geq 10^{3} hours to run even for N=100N=100, rendering it impractical for actual use. In contrast, our exact algorithm for computing WKNN-Shapley achieves a significantly better computational efficiency (e.g., almost 𝟏𝟎𝟔\mathbf{10^{6}} times faster at 𝐍=𝟏𝟎𝟓\mathbf{N=10^{5}}). Our deterministic approximation algorithm is compared to the Monte Carlo approximation from [Jia et al., 2019a]. For a fair comparison, both algorithms are aligned for the same theoretical error bounds. Note that the error bound for the Monte Carlo algorithm is a high-probability bound, subject to a small failure probability. In contrast, our error bound always holds, offering a more robust guarantee compared to the Monte Carlo technique. Nonetheless, from Figure 2 we can see that our deterministic approximation algorithm not only provides a stronger approximation guarantee but also achieves significantly greater efficiency (e.g., also almost 𝟏𝟎𝟔\mathbf{10^{6}} times faster at 𝐍=𝟏𝟎𝟓\mathbf{N=10^{5}}). This showcases the remarkable improvements of our techniques.

Refer to caption
Figure 2: Runtime comparison between our exact and approximation algorithms for WKNN-Shapley in Section 4, and those from [Jia et al., 2019a], across varying training data sizes NN. We set K=5K=5 and the weights are discretized to 3-bit here. In Appendix LABEL:appendix:eval, we provide additional experiments on different KKs and bbs. For our deterministic approximation algorithm, we set M=NM^{\star}=\sqrt{N} (so that the time complexity is O(N1.5)O(N^{1.5})). For the Monte Carlo approximation from [Jia et al., 2019a], we align the error bounds to be the same as ours for fair comparison; we set the failure probability for Monte Carlo method as δ=0.1\delta=0.1. The plot shows the average runtime based on 5 independent runs.

5.2 Discerning Data Quality

Due to the superior computational efficiency of our newly developed algorithms, WKNN-Shapley has become a practical data valuation technique for actual use. In this section, we evaluate its effectiveness in discerning data quality for common real-world applications. Tasks: we consider three applications that are commonly used for evaluating the performance of data valuation techniques in the prior works [Ghorbani and Zou, 2019, Kwon and Zou, 2022, Wang and Jia, 2023a]: mislabeled data detection, noisy data detection, and data selection. Due to space constraints, we only present the results for mislabeled data detection here, and defer the results for the other two tasks to Appendix LABEL:appendix:eval. Mislabeled data usually detrimentally impacts model performance. Hence, a reasonable data valuation technique should assign low values to these data points. In our experiments for mislabeled data detection, we randomly select 10% of the data points to flip their labels. Baselines & Settings & Hyperparameters: We evaluate the performance of both the exact and approximated WKNN-Shapley. We use 2\ell_{2} distance and the popular RBF kernel wi=exp(xix(val))w_{i}=\exp(-\left\lVert x_{i}-x^{(\mathrm{val})}\right\rVert) to determine the weights of training points. We discretize the weights to 3 bits, as we find this level of precision offers a balance between good performance and computational efficiency, with a weight space size, WW, of merely 23=82^{3}=8. In Appendix LABEL:appendix:eval, we conduct ablation studies on the choice of the number of bits for discretization. For approximated WKNN-Shapley, we set M=NM^{\star}=\sqrt{N}. Our primary baseline is the unweighted, soft-label KNN-Shapley from [Jia et al., 2019a]. Since our WKNN-Shapley corresponds to hard-label KNN, we also include unweighted, hard-label WKNN-Shapley in comparison for completeness. Note that it can be computed by simply setting the weights of all data points as a constant.

Results. We use AUROC as the performance metric on mislabeled data detection tasks. Unweighted vs weighted KNN-Shapley: Table 1 shows the AUROC scores across the 13 benchmark datasets we experimented on when K=5K=5. Notably, both exact and approximated WKNN-Shapley markedly outperform the unweighted KNN-Shapley (either soft-label or hard-label) across most datasets. This can likely be attributed to WKNN-Shapley’s ability to more accurately differentiate between bad and good data based on the proximity to the queried example. In Appendix LABEL:appendix:eval-comparison, we present a qualitative study highlighting why WKNN-Shapley outperforms unweighted KNN-Shapley in discerning data quality. Exact vs Approximated WKNN-Shapley: From Table 1, we can see an encouraging result that the approximated WKNN-Shapley achieves performance comparable to (and sometimes even slightly better than) the exact WKNN-Shapley across the majority of datasets. This is likely attributable to its favored property in preserving the fairness properties of its exact counterpart. Robustness to the choice of KK: In Figure 3, we show that, compared to unweighted KNN-Shapley, WKNN-Shapley maintains notably stable performance across various choices of KK, particularly for larger values. This is because those benign data points—though within the KK nearest neighbors of the query example and possessing different labels—may not receive a very low value due to their likely distant positioning from the query example. Conversely, unweighted KNN-Shapley tends to assign these benign points lower values.

Unweighted
KNN-Shapley
(Soft-label)
Unweighted
KNN-Shapley
(Hard-label, this work)
Exact
WKNN-Shapley
(this work)
Approximated
WKNN-Shapley
(this work)
2DPlanes 0.849 0.8 0.884 0.831
CPU 0.867 0.929 0.956 0.956
Phoneme 0.707 0.724 0.773 0.778
Fraud 0.556 0.547 0.751 0.596
Creditcard 0.698 0.676 0.842 0.747
Vehicle 0.689 0.724 0.8 0.813
Click 0.627 0.6 0.751 0.693
Wind 0.836 0.849 0.858 0.88
Pol 0.907 0.862 1 0.991
MNIST 0.724 0.471 0.831 0.836
CIFAR10 0.684 0.76 0.76 0.756
AGNews 0.953 0.978 0.991 0.988
DBPedia 0.968 0.902 1 1
Table 1: AUROC scores of different variants of KNN-Shapley for mislabeled data detection on benchmark datasets. The higher, the better.
Refer to caption
Figure 3: AUROC scores of different variants of KNN-Shapley for mislabeled data detection with different KKs. The higher the curve is, the better the method is.

6 Conclusion

In this study, we addressed WKNN-Shapley computation and approximation when using the accuracy of hard-label KNN with discretized weights as the utility function. Future work should explore polynomial time computation of exact Data Shapley for other learning algorithms. It is especially important to think about whether we can make some modifications to more complicated learning algorithms, e.g., neural networks, so that their exact Data Shapley can be computed efficiently.

Acknowledgments

This work was supported in part by the National Science Foundation under grants CNS-2131938, CNS-1553437, CNS-1704105, IIS-2312794, IIS-2313130, OAC-2239622, the ARL’s Army Artificial Intelligence Innovation Institute (A2I2), the Office of Naval Research Young Investigator Award, the Army Research Office Young Investigator Prize, Schmidt DataX award, Princeton E-ffiliates Award, Amazon-Virginia Tech Initiative in Efficient and Robust Machine Learning, the Commonwealth Cyber Initiative, and a Princeton’s Gordon Y. S. Wu Fellowship. We are grateful to anonymous reviewers at AISTATS for their valuable feedback.

References

  • [Anonymous, 2021] Anonymous (2021). Towards general robustness to bad training data.
  • [Auer et al., 2007] Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., and Ives, Z. (2007). Dbpedia: A nucleus for a web of open data. In The Semantic Web: 6th International Semantic Web Conference, 2nd Asian Semantic Web Conference, ISWC 2007+ ASWC 2007, Busan, Korea, November 11-15, 2007. Proceedings, pages 722–735. Springer.
  • [Belaid et al., 2023] Belaid, M. K., Mekki, D. E., Rabus, M., and Hüllermeier, E. (2023). Optimizing data shapley interaction calculation from o (2^ n) to o (tn^ 2) for knn models. arXiv preprint arXiv:2304.01224.
  • [Bian et al., 2021] Bian, Y., Rong, Y., Xu, T., Wu, J., Krause, A., and Huang, J. (2021). Energy-based learning for cooperative games, with applications to valuation problems in machine learning. arXiv preprint arXiv:2106.02938.
  • [Courtnage and Smirnov, 2021] Courtnage, C. and Smirnov, E. (2021). Shapley-value data valuation for semi-supervised learning. In Discovery Science: 24th International Conference, DS 2021, Halifax, NS, Canada, October 11–13, 2021, Proceedings 24, pages 94–108. Springer.
  • [Dal Pozzolo et al., 2015] Dal Pozzolo, A., Caelen, O., Johnson, R. A., and Bontempi, G. (2015). Calibrating probability with undersampling for unbalanced classification. In 2015 IEEE Symposium Series on Computational Intelligence, pages 159–166. IEEE.
  • [Dall’Aglio et al., 2019] Dall’Aglio, M., Fragnelli, V., and Moretti, S. (2019). Sometimes the computation of the shapley value is simple. Handbook of the Shapley value, 441.
  • [Ghorbani et al., 2020] Ghorbani, A., Kim, M., and Zou, J. (2020). A distributional framework for data valuation. In International Conference on Machine Learning, pages 3535–3544. PMLR.
  • [Ghorbani and Zou, 2019] Ghorbani, A. and Zou, J. (2019). Data shapley: Equitable valuation of data for machine learning. In International Conference on Machine Learning, pages 2242–2251. PMLR.
  • [Ghorbani et al., 2022] Ghorbani, A., Zou, J., and Esteva, A. (2022). Data shapley valuation for efficient batch active learning. In 2022 56th Asilomar Conference on Signals, Systems, and Computers, pages 1456–1462. IEEE.
  • [He et al., 2016] He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
  • [Jia et al., 2019a] Jia, R., Dao, D., Wang, B., Hubis, F. A., Gurel, N. M., Li, B., Zhang, C., Spanos, C. J., and Song, D. (2019a). Efficient task-specific data valuation for nearest neighbor algorithms. Proceedings of the VLDB Endowment.
  • [Jia et al., 2019b] Jia, R., Dao, D., Wang, B., Hubis, F. A., Hynes, N., Gürel, N. M., Li, B., Zhang, C., Song, D., and Spanos, C. J. (2019b). Towards efficient data valuation based on the shapley value. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1167–1176. PMLR.
  • [Just et al., 2022] Just, H. A., Kang, F., Wang, T., Zeng, Y., Ko, M., Jin, M., and Jia, R. (2022). Lava: Data valuation without pre-specified learning algorithms. In The Eleventh International Conference on Learning Representations.
  • [Karlaš et al., 2022] Karlaš, B., Dao, D., Interlandi, M., Li, B., Schelter, S., Wu, W., and Zhang, C. (2022). Data debugging with shapley importance over end-to-end machine learning pipelines. arXiv preprint arXiv:2204.11131.
  • [Kenton and Toutanova, 2019] Kenton, J. D. M.-W. C. and Toutanova, L. K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
  • [Khandelwal et al., 2019] Khandelwal, U., Levy, O., Jurafsky, D., Zettlemoyer, L., and Lewis, M. (2019). Generalization through memorization: Nearest neighbor language models. In International Conference on Learning Representations.
  • [Krizhevsky et al., 2009] Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from tiny images.
  • [Kwon and Zou, 2022] Kwon, Y. and Zou, J. (2022). Beta shapley: a unified and noise-reduced data valuation framework for machine learning. In International Conference on Artificial Intelligence and Statistics, pages 8780–8802. PMLR.
  • [Kwon and Zou, 2023] Kwon, Y. and Zou, J. (2023). Data-oob: Out-of-bag estimate as a simple and efficient data value. ICML.
  • [LeCun, 1998] LeCun, Y. (1998). The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/.
  • [Liang et al., 2021] Liang, W., Liang, K.-H., and Yu, Z. (2021). Herald: An annotation efficient method to detect user disengagement in social conversations. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3652–3665.
  • [Liang et al., 2020] Liang, W., Zou, J., and Yu, Z. (2020). Beyond user self-reported likert scale ratings: A comparison model for automatic dialog evaluation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1363–1374.
  • [Lin et al., 2022] Lin, J., Zhang, A., Lécuyer, M., Li, J., Panda, A., and Sen, S. (2022). Measuring the effect of training data on deep learning predictions via randomized experiments. In International Conference on Machine Learning, pages 13468–13504. PMLR.
  • [Liu et al., 2023] Liu, Z., Just, H. A., Chang, X., Chen, X., and Jia, R. (2023). 2d-shapley: A framework for fragmented data valuation. arXiv preprint arXiv:2306.10473.
  • [Maleki, 2015] Maleki, S. (2015). Addressing the computational issues of the Shapley value with applications in the smart grid. PhD thesis, University of Southampton.
  • [Northcutt et al., 2021] Northcutt, C. G., Athalye, A., and Mueller, J. (2021). Pervasive label errors in test sets destabilize machine learning benchmarks. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).
  • [OpenAI, 2023] OpenAI, F. (2023). Planning for agi and beyond. https://openai.com/blog/planning-for-agi-and-beyond.
  • [Pandl et al., 2021] Pandl, K. D., Feiland, F., Thiebes, S., and Sunyaev, A. (2021). Trustworthy machine learning for health care: scalable data valuation with the shapley value. In Proceedings of the Conference on Health, Inference, and Learning, pages 47–57.
  • [Reimers and Gurevych, 2019] Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  • [Shapley, 1953] Shapley, L. S. (1953). A value for n-person games. Contributions to the Theory of Games, 2(28):307–317.
  • [Shim et al., 2021] Shim, D., Mai, Z., Jeong, J., Sanner, S., Kim, H., and Jang, J. (2021). Online class-incremental continual learning with adversarial shapley value. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 9630–9638.
  • [Wang et al., 2018] Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. (2018). Glue: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations.
  • [Wang and Jia, 2023a] Wang, J. T. and Jia, R. (2023a). Data banzhaf: A robust data valuation framework for machine learning. In International Conference on Artificial Intelligence and Statistics, pages 6388–6421. PMLR.
  • [Wang and Jia, 2023b] Wang, J. T. and Jia, R. (2023b). A note on" efficient task-specific data valuation for nearest neighbor algorithms". arXiv preprint arXiv:2304.04258.
  • [Wang and Jia, 2023c] Wang, J. T. and Jia, R. (2023c). A note on" towards efficient data valuation based on the shapley value”. arXiv preprint arXiv:2302.11431.
  • [Wang et al., 2023] Wang, J. T., Zhu, Y., Wang, Y.-X., Jia, R., and Mittal, P. (2023). Threshold knn-shapley: A linear-time and privacy-friendly approach to data valuation. arXiv preprint arXiv:2308.15709.
  • [Wang et al., 2020] Wang, T., Rausch, J., Zhang, C., Jia, R., and Song, D. (2020). A principled approach to data valuation for federated learning. In Federated Learning, pages 153–167. Springer.
  • [Wang et al., 2021] Wang, Z., Shan, X., Zhang, X., and Yang, J. (2021). N24news: A new dataset for multimodal news classification. arXiv preprint arXiv:2108.13327.
  • [Warner, 2019] Warner, M. (2019). Warner & hawley introduce bill to force social media companies to disclose how they are monetizing user data. Government Document.
  • [Wu et al., 2022] Wu, Z., Shu, Y., and Low, B. K. H. (2022). Davinz: Data valuation using deep neural networks at initialization. In International Conference on Machine Learning, pages 24150–24176. PMLR.
  • [Yan and Procaccia, 2021] Yan, T. and Procaccia, A. D. (2021). If you like shapley then you’ll love the core. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 5751–5759.
  • [Yeh and Lien, 2009] Yeh, I.-C. and Lien, C.-h. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert systems with applications, 36(2):2473–2480.

Appendix A Extended Related Works

A.1 Data Shapley

Data Shapley is one of the first principled approaches to data valuation that became increasingly popular [Ghorbani and Zou, 2019, Jia et al., 2019b]. Data Shapley is based on the Shapley value, a famous solution concept from game theory literature which is usually justified as the unique value notion satisfying the following four axioms:

  1. (1)

    Null player: if v(S{zi})=v(S)v(S\cup\{z_{i}\})=v(S) for all SD{zi}S\subseteq D\setminus\{z_{i}\}, then ϕzi(v)=0\phi_{z_{i}}(v)=0.

  2. (2)

    Symmetry: if v(S{zi})=v(S{zj})v(S\cup\{z_{i}\})=v(S\cup\{z_{j}\}) for all SD{zi,zj}S\subseteq D\setminus\{z_{i},z_{j}\}, then ϕzi(v)=ϕzj(v)\phi_{z_{i}}(v)=\phi_{z_{j}}(v).

  3. (3)

    Linearity: For utility functions v1,v2v_{1},v_{2} and any α1,α2\alpha_{1},\alpha_{2}\in\mathbb{R}, ϕzi(α1v1+α2v2)=α1ϕzi(v1)+α2ϕzi(v2)\phi_{z_{i}}(\alpha_{1}v_{1}+\alpha_{2}v_{2})=\alpha_{1}\phi_{z_{i}}(v_{1})+\alpha_{2}\phi_{z_{i}}(v_{2}).

  4. (4)

    Efficiency: for every v,ziDϕzi(v)=v(D)v,\sum_{z_{i}\in D}\phi_{z_{i}}(v)=v(D).

Since its introduction, numerous variants of Data Shapley have been developed [Jia et al., 2019a, Ghorbani et al., 2020, Wang et al., 2020, Bian et al., 2021, Kwon and Zou, 2022, Lin et al., 2022, Wu et al., 2022, Karlaš et al., 2022, Wang and Jia, 2023a, Wang and Jia, 2023c, Liu et al., 2023, Wang et al., 2023], reflecting its effectiveness as a principled approach for quantifying data point contributions to ML model training. However, not all axioms mentioned above are considered necessary for a reasonable data valuation technique by the community. For example, [Kwon and Zou, 2022] argue that (4) Efficiency is not necessarily required for data valuation, and [Yan and Procaccia, 2021] argue that (3) Linearity is mainly a technical requirement and does not have a natural interpretation in the context of machine learning. That said, the (1) Null player and (2) Symmetry axioms are often seen as the “fairness axioms” and are deemed fundamental for any sound data valuation technique. Therefore, a reasonable Shapley approximation should uphold these two axioms to ensure the Shapley value’s core advantages—fairness and equity—are not diminished. The deterministic approximation algorithm we develop in Section 4.2 meets these criteria.

A.2 KNN-Shapley

The computation of the Shapley value is notoriously resource-intensive. To the best of our knowledge, unweighted KNN stands as the only frequently deployed ML model where the exact Data Shapley can be efficiently computed. Due to its exceptional computational efficiency coupled with its capability to discern data quality, KNN-Shapley has become one of the most popular and practical data valuation techniques. For instance, [Ghorbani et al., 2022] extends KNN-Shapley to active learning, [Shim et al., 2021] applies it in a continual learning setting. Additionally, studies such as [Liang et al., 2020, Liang et al., 2021] have leveraged KNN-Shapley to eliminate ambiguous samples in NLP tasks, and [Courtnage and Smirnov, 2021] has endorsed its use in semi-supervised learning data valuation. [Belaid et al., 2023] extends the analysis of KNN-Shapley to the calculation to the Shapley interaction index. KNN-Shapley has also shown its practicality in real-world scenarios. For example, [Pandl et al., 2021] shows that KNN-Shapley is the only practical data valuation technique for valuing large amounts of healthcare data. A concurrent work [Wang et al., 2023] considers a simple variant of unweighted KNN termed Threshold KNN, and develops an alternative of KNN-Shapley that can be easily incorporated with differential privacy. Both [Wang et al., 2023] and our work demonstrate the importance of proper adjustments to the underlying KNN’s configuration (i.e., the utility function) in developing new data valuation techniques with desired properties (computational efficiency, privacy compatibility, etc.).

All the studies mentioned above focus on unweighted KNN. On the other hand, weighted KNN incorporates more information about the underlying dataset. Consequently, the Data Shapley score derived from weighted KNN might offer a better assessment of individual data point quality (as demonstrated in our experiments).

Finally, we note that some alternative data valuation techniques are as efficient as KNN-Shapley [Just et al., 2022, Kwon and Zou, 2023]. However, these methods lack a formal theoretical justification as the Shapley value-based approaches.

Appendix B Additional Background of KNN-Shapley

Notation Review. Recall that we use αx(val)(S,j)\alpha_{x^{(\mathrm{val})}}^{(S,j)} denotes the index (among DD) of jjth closest data point in SS to x(val)x^{(\mathrm{val})}.

B.1 Unweighted KNN-Shapley

The unweighted KNN-Shapley was originally proposed in [Jia et al., 2019a] and was later refined in [Wang and Jia, 2023b]. Specifically, [Wang and Jia, 2023b] considers the utility function for unweighted, soft-label KNN on a validation point z(val)z^{(\mathrm{val})}:

v(S;z(val))\displaystyle v(S;z^{(\mathrm{val})}) :=j=1min(K,|S|)𝟙[yαx(val)(S,j)=y(val)]min(|S|,K)\displaystyle:=\frac{\sum_{j=1}^{\min(K,|S|)}\mathbbm{1}[y_{\alpha_{x^{(\mathrm{val})}}^{(S,j)}}=y^{(\mathrm{val})}]}{\min(|S|,K)} (8)

which is a special case of the utility function for weighted KNN in (1). It can be interpreted as the probability of a soft-label KNN classifier in predicting the correct label for a validation point z(val)=(x(val),y(val))D(val)z^{(\mathrm{val})}=(x^{(\mathrm{val})},y^{(\mathrm{val})})\in D^{(\mathrm{val})}. When |S|=0|S|=0, vz(val)(S)v_{z^{(\mathrm{val})}}(S) is set to the accuracy by random guessing (i.e., v()=1/#classv(\emptyset)=1/\texttt{\#class}). Here is the main result of [Wang and Jia, 2023b]:

Theorem 14 (KNN-Shapley [Wang and Jia, 2023b]).

Consider the utility function in (8). Given a validation data point z(val)=(x(val),y(val))z^{(\mathrm{val})}=(x^{(\mathrm{val})},y^{(\mathrm{val})}) and a distance metric d(,)d(\cdot,\cdot), if we sort the training set D={zi=(xi,yi)}i=1ND=\{z_{i}=(x_{i},y_{i})\}_{i=1}^{N} according to d(xi,x(val))d(x_{i},x^{(\mathrm{val})}) in ascending order, then the Shapley value of each data point ϕzi\phi_{z_{i}} corresponding to utility function vz(val)v_{z^{(\mathrm{val})}} can be computed recursively as follows:

ϕzN\displaystyle\phi_{z_{N}} =𝟙[N2]N(𝟙[yN=y(val)]i=1N1𝟙[yi=y(val)]N1)(j=1min(K,N)11j+1)+1N(𝟙[yN=y(val)]1C)\displaystyle=\frac{\mathbbm{1}[N\geq 2]}{N}\left(\mathbbm{1}[y_{N}=y^{(\mathrm{val})}]-\frac{\sum_{i=1}^{N-1}\mathbbm{1}[y_{i}=y^{(\mathrm{val})}]}{N-1}\right)\left(\sum_{j=1}^{\min(K,N)-1}\frac{1}{j+1}\right)+\frac{1}{N}\left(\mathbbm{1}[y_{N}=y^{(\mathrm{val})}]-\frac{1}{C}\right)
ϕzi\displaystyle\phi_{z_{i}} =ϕzi+1+𝟙[yi=y(val)]𝟙[yi+1=y(val)]N1[j=1min(K,N)1j+𝟙[NK]K(min(i,K)(N1)iK)]\displaystyle=\phi_{z_{i+1}}+\frac{\mathbbm{1}[y_{i}=y^{(\mathrm{val})}]-\mathbbm{1}[y_{i+1}=y^{(\mathrm{val})}]}{N-1}\left[\sum_{j=1}^{\min(K,N)}\frac{1}{j}+\frac{\mathbbm{1}[N\geq K]}{K}\left(\frac{\min(i,K)\cdot(N-1)}{i}-K\right)\right]

where CC denotes the number of classes for the classification task.

The computation of all unweighted KNN-Shapley (ϕz1,,ϕzN)(\phi_{z_{1}},\ldots,\phi_{z_{N}}) can be achieved in O(NlogN)O(N\log N) runtime in total, as the runtime is dominated by the sorting data points in DD.

B.2 Baseline Algorithm for Computing and Approximating WKNN-Shapley

Exact Computation Algorithm.

[Jia et al., 2019a] shows that the Data Shapley for weighted KNN can be computed exactly with a runtime of O(NK)O(N^{K}). The high-level idea, as described in Section 2.3 in the maintext, is that we only need to consider evaluating v(S)v(S) for those SS where the addition of the target data point ziz_{i} may change the prediction of KNN. Moreover, there are at most O(NK)O(N^{K}) such SSs. For completeness, we state the specific expression for computing the exact WKNN-Shapley from [Jia et al., 2019a]. We note that the original theorem statement in [Jia et al., 2019a] has minor errors and we also fix it here.

Theorem 15 ([Jia et al., 2019a]).

Consider the utility function in (1). Let Bk(i)={S:|S|=k,ziS,SD}B_{k}(i)=\{S:|S|=k,z_{i}\notin S,S\subseteq D\}. Let r()r(\cdot) be a function that maps the set of training data to their ranks of similarity to x(val)x^{(\mathrm{val})}. Then, the Shapley value ϕzi\phi_{z_{i}} of each training point ziz_{i} can be calculated recursively as follows:

ϕzN=1Nk=0K11(N1k)SBk(zN)[v(S{zN})v(S)]\displaystyle\phi_{z_{N}}=\frac{1}{N}\sum_{k=0}^{K-1}\frac{1}{{N-1\choose k}}\!\!\sum_{S\in B_{k}(z_{N})}\!\!\left[v(S\cup\{z_{N}\})-v(S)\right] (9)
ϕzi=ϕzi+1+1N1k=0N21(N2k)SDi,kAi,k[v(S{zi})v(S{zi+1})]\displaystyle\phi_{z_{i}}=\phi_{z_{i+1}}+\frac{1}{N-1}\sum_{k=0}^{N-2}\frac{1}{{N-2\choose k}}\sum_{S\in D_{i,k}}A_{i,k}\left[v(S\cup\{z_{i}\})-v(S\cup\{z_{i+1}\})\right] (10)

where

Di,k={Bk(zi)Bk(zi+1)0kK2BK1(zi)BK1(zi+1)K1kN2\displaystyle D_{i,k}=\begin{cases}B_{k}(z_{i})\cap B_{k}(z_{i+1})&0\leq k\leq K-2\\ B_{K-1}(z_{i})\cap B_{K-1}(z_{i+1})&K-1\leq k\leq N-2\end{cases}

and

Ai,k={10kK2(Nmax(i+1,αx(val)(S,|S|))kK+1)K1kN2\displaystyle A_{i,k}=\begin{cases}1&0\leq k\leq K-2\\ {N-\max\left(i+1,\alpha_{x^{(\mathrm{val})}}^{(S,|S|)}\right)\choose k-K+1}&K-1\leq k\leq N-2\end{cases}
Proof.

The proof proceeds by standard combinatorial analysis and we refer readers to Appendix E.2. in [Jia et al., 2019a] for details. We note that αx(val)(S,|S|)\alpha_{x^{(\mathrm{val})}}^{(S,|S|)} means the index of the farthest data point to x(val)x^{(\mathrm{val})} in SS. ∎

Remark 6.

While [Jia et al., 2019a] state the above results for soft-label weighted KNN, this algorithm is in fact fairly general and also applies to hard-label weighted KNN.

Monte Carlo Algorithm.

The Monte Carlo approximation algorithm for WKNN-Shapley from [Jia et al., 2019a] is a simple adaptation of the famous permutation sampling algorithm [Maleki, 2015]. We refer the reader to Section 5 of [Jia et al., 2019a] for the detailed description and error analysis.

Appendix C Additional Discussion of Weights Discretization

Value Rank Preservation.

We show that the Shapley value computed based on the discrete weights has the same ranking order compared with the Shapley value computed on the continuous weights (it might create ties but will not reverse the original order).

Here, we consider the binary classification setting since our approach for computing WKNN-Shapley for multi-class classification (in Appendix LABEL:appendix:multiclass) makes a reduction to the binary classification setting. We make a very mild assumption that if d(xi,x(val))d(xj,x(val))d(x_{i},x^{(\mathrm{val})})\leq d(x_{j},x^{(\mathrm{val})}), then wiwjw_{i}\geq w_{j}. That is, the closer the training point is to the query, the higher the weight the training point will have. This is natural for any reasonable weighting scheme for KNN.

Lemma 16.

Under the assumption of weight function stated above, for any weight assignment function and any two data points zi,zjz_{i},z_{j}, if ϕzi(v)ϕzj(v)\phi_{z_{i}}(v)\geq\phi_{z_{j}}(v), then we have ϕzi(vdisc)ϕzj(vdisc)\phi_{z_{i}}(v^{\texttt{disc}})\geq\phi_{z_{j}}(v^{\texttt{disc}}) where vdiscv^{\texttt{disc}} refers to the new utility function after weights discretization.

Proof.

The following cases will result in ϕzi(v)ϕzj(v)\phi_{z_{i}}(v)\geq\phi_{z_{j}}(v):

Case 1: if yi=y(val)y_{i}=y^{(\mathrm{val})}, yjy(val)y_{j}\neq y^{(\mathrm{val})}, it is easy to see that for any SDS\subseteq D we have v(Szi)v(S)v(S\cup z_{i})\geq v(S) while v(Szj)v(S)v(S\cup z_{j})\leq v(S), and hence we have ϕzi0\phi_{z_{i}}\geq 0 while ϕzi0\phi_{z_{i}}\leq 0, regardless of the specific magnitude of the weights. Hence, we still have ϕzi(vdisc)0\phi_{z_{i}}(v^{\texttt{disc}})\geq 0 and ϕzj(vdisc)0\phi_{z_{j}}(v^{\texttt{disc}})\leq 0 after weights discretization.

Case 2: if yi=yj=y(val)y_{i}=y_{j}=y^{(\mathrm{val})}, then ϕzi(v)ϕzj(v)\phi_{z_{i}}(v)\geq\phi_{z_{j}}(v) implies that d(xi,x(val))d(xj,x(val))d(x_{i},x^{(\mathrm{val})})\leq d(x_{j},x^{(\mathrm{val})}) and wiwjw_{i}\geq w_{j}. This is because for any SD{zi,zj}S\subseteq D\setminus\{z_{i},z_{j}\}, if v(S{zj})v(S)=1v(S\cup\{z_{j}\})-v(S)=1, we must also have v(S{zi})v(S)=1v(S\cup\{z_{i}\})-v(S)=1. Since weight discretization does not change the rank order of weights wiw_{i} and wjw_{j}, we still have ϕzi(vdisc)ϕzj(vdisc)\phi_{z_{i}}(v^{\texttt{disc}})\geq\phi_{z_{j}}(v^{\texttt{disc}}).

Case 3: if yi=yjy(val)y_{i}=y_{j}\neq y^{(\mathrm{val})}, then ϕzi(v)ϕzj(v)\phi_{z_{i}}(v)\geq\phi_{z_{j}}(v) implies that d(xi,x(val))d(xj,x(val))d(x_{i},x^{(\mathrm{val})})\geq d(x_{j},x^{(\mathrm{val})}) and wiwjw_{i}\leq w_{j}. This is because for any SD{zi,zj}S\subseteq D\setminus\{z_{i},z_{j}\}, if v(S{zi})v(S)=1v(S\cup\{z_{i}\})-v(S)=-1, we must also have v(S{zj})v(S)=1v(S\cup\{z_{j}\})-v(S)=1. Since weight discretization does not change the rank order of weights wiw_{i} and wjw_{j}, we still have ϕzi(vdisc)ϕzj(vdisc)\phi_{z_{i}}(v^{\texttt{disc}})\geq\phi_{z_{j}}(v^{\texttt{disc}}). ∎

Value Deviation and Performance on Downstream Tasks.

It is difficult to analytically derive or upper bound the deviation of WKNN-Shapley score caused by weight discretization. Therefore, in Appendix LABEL:sec:eval-discretization, we empirically investigate such value deviation. We also evaluate the impact of weight discretization on WKNN-Shapley’s performance on downstream tasks such as mislabeled/noisy data detection. Overall, we conclude that the impact from weight discretization is very small even when the number of discretization bits bb is as small as 33.

Appendix D Additional Details for Exact and Approximation Algorithm for WKNN-Shapley

We state the full version of Theorem 7 here, and defer the proof to Appendix LABEL:appendix:proof.

Theorem 17 (Full version of Theorem 7).

When K>1K>1,333Since [Jia et al., 2019a] has shown that weighted KNN-Shapley can be computed in O(NK)O(N^{K}) time complexity, we focus on the setting where K>1K>1. for =1\ell=1 we have Fi[m,1,s]={1s=wm0swm\texttt{F}_{i}[m,1,s]=\begin{cases}1&s=w_{m}\\ 0&s\neq w_{m}\end{cases}. We can then compute Fi[m,,s]\texttt{F}_{i}[m,\ell,s] for 2\ell\geq 2 with the following relations:

If K1\ell\leq K-1, we have

Fi[m,,s]=t=1m1Fi[t,1,swm]\displaystyle\texttt{F}_{i}[m,\ell,s]=\sum_{t=1}^{m-1}\texttt{F}_{i}[t,\ell-1,s-w_{m}] (11)

and if K\ell\geq K, we have

Fi[m,,s]={0m<it=1m1Fi[t,K1,s](NmK)m>i\displaystyle\texttt{F}_{i}[m,\ell,s]=\begin{cases}0&m<i\\ \sum_{t=1}^{m-1}\texttt{F}_{i}[t,K-1,s]{N-m\choose\ell-K}&m>i\end{cases} (12)

Note that we set Fi[i,,]=0\texttt{F}_{i}[i,\cdot,\cdot]=0 for mathematical convenience.

In the following, we present the pseudocode for our exact computation and approximation algorithm for WKNN-Shapley. For the clarity of presentation, we first show a reader-friendly version but inefficient version of the pseudo-code for the exact computation algorithm from Section 4.1 in Appendix D.1. We then show the pseudo-code that optimizes the runtime (but less readable) in Appendix LABEL:appendix:pseudocode. We further show the pseudo-code for our deterministic approximation algorithm in Appendix LABEL:appendix:pseudocode-approx.

Notation. Recall that we use 𝐖\mathbf{W} to denote the discretized space of [0,1][0,1], where we create 2b2^{b} equally spaced points within the interval when we use bb bits for discretization. We denote W:=|𝐖|=2bW:=|\mathbf{W}|=2^{b} the size of the weight space. Furthermore, we use 𝐖(K)\mathbf{W}_{(K)} to denote the discretized space of [0,K][0,K] (where we create K2bK2^{b} equally spaced points within the interval).

D.1 Reader-friendly Pseudo-code

Here, we show a reader-friendly version of the pseudo-code for our algorithms for computing WKNN-Shapley for binary classification setting. The runtime-optimized version of the pseudo-code is shown in Appendix LABEL:appendix:pseudocode.

Algorithm 1 Weighted KNN-Shapley for binary classification (reader-friendly version) 1:Input: