{tianhaowang,pmittal}@princeton.edu, [email protected]
Efficient Data Shapley for Weighted Nearest Neighbor Algorithms
Abstract
This work aims to address an open problem in data valuation literature concerning the efficient computation of Data Shapley for weighted nearest neighbor algorithm (WKNN-Shapley). By considering the accuracy of hard-label KNN with discretized weights as the utility function, we reframe the computation of WKNN-Shapley into a counting problem and introduce a quadratic-time algorithm, presenting a notable improvement from , the best result from existing literature. We develop a deterministic approximation algorithm that further improves computational efficiency while maintaining the key fairness properties of the Shapley value. Through extensive experiments, we demonstrate WKNN-Shapley’s computational efficiency and its superior performance in discerning data quality compared to its unweighted counterpart.††Correspondence to Jiachen T. Wang and Ruoxi Jia.
1 Introduction
Data is the backbone of machine learning (ML) models, but not all data is created equally. In real-world scenarios, data often carries noise and bias, sourced from diverse origins and labeling processes [Northcutt et al., 2021]. Against this backdrop, data valuation emerges as a growing research field, aiming to quantify the quality of individual data sources for ML training. Data valuation techniques are critical in explainable ML to diagnose influential training instances and in data marketplaces for fair compensation. The importance of data valuation is highlighted by legislative efforts [Warner, 2019] and vision statements from leading tech companies [OpenAI, 2023]. For instance, OpenAI listed “how to fairly distribute the benefits the AI systems generate” as an important question to be explored.
Data Valuation via the Shapley Value. Drawing on cooperative game theory, the technique of using the Shapley value for data valuation was pioneered by [Ghorbani and Zou, 2019, Jia et al., 2019b]. The Shapley value is a renowned solution concept in game theory for fair profit attribution [Shapley, 1953]. In the context of data valuation, individual data points or sources are regarded as “players” in a cooperative game, and Data Shapley refers to the suite of data valuation techniques that use the Shapley value as the contribution measure for each data owner.
KNN-Shapley. Despite offering a principled approach to data valuation with a solid theoretical foundation, the exact calculation of the Shapley value has the time complexity of in general, where refers to the number of data points/sources. Fortunately, [Jia et al., 2019a] discovered an efficient algorithm to compute the exact Data Shapley for unweighted K-Nearest Neighbors (KNN), one of the oldest yet still popular ML algorithms. KNN-Shapley refers to the technique of assessing data value for any learning algorithms based on KNN’s Data Shapley score. Here, KNN serves as a proxy model for the original, perhaps complicated learning algorithm. KNN-Shapley can be applied to large, high-dimensional datasets by calculating the value scores on the features extracted from neural network embeddings. Due to its superior efficiency and effectiveness in discerning data quality, KNN-Shapley has become one of the most popular data valuation techniques [Pandl et al., 2021].
Open question from [Jia et al., 2019a]: efficient computation of weighted KNN-Shapley. While [Jia et al., 2019a] showed unweighted KNN-Shapley can be computed efficiently, they did not develop a practical algorithm for the more general weighted KNN-Shapley (WKNN-Shapley). Despite presenting a polynomial-time algorithm, its computational complexity of becomes impractical even for modest , such as 5. Closing this efficiency gap is important, especially given the inherent advantages and wider application of weighted KNN. Compared with the unweighted counterpart, weighted KNN considers the distance between data points, assigning varying levels of importance to neighbors based on their proximity to the query. Consequently, the Data Shapley of weighted KNN can potentially better discriminate between low and high-quality data points. Moreover, weighted KNN is applied more widely in practice. For example, recent research discovered weighted KNN’s capability to improve language model’s performance [Khandelwal et al., 2019].
Our contributions are summarized as follows:
Making KNN Configurations “Shapley-friendly” (Section 3). Our preliminary investigations suggest that improving the computational efficiency of WKNN-Shapley for soft-label KNN classifiers with continuous weight values (the setting considered in [Jia et al., 2019a]), poses considerable challenges. Consequently, we make necessary modifications to the specific KNN classifiers’ configuration and shift our focus to hard-label KNN classifiers with discrete weight values. The justification for these changes and their practical relevance is detailed in Section 3. In particular, discretizing weights does not change the data value score significantly even with few bits. We emphasize that making proper tweaks to the problem setup is important for developing an efficient Shapley computation algorithm, a strategy frequently adopted in literature [Dall’Aglio et al., 2019].
A quadratic-time algorithm for computing exact WKNN-Shapley (Section 4.1). Given the adjusted “Shapley-friendly” configurations of the weighted KNN, we can reframe the Shapley value computation as a counting problem. We develop an algorithm with a quadratic runtime for solving the counting problem and computing the exact WKNN-Shapley, which greatly improves the baseline algorithm.
A subquadratic-time deterministic approximation algorithm that preserves fairness properties (Section 4.2). To further improve the computational efficiency, we propose a deterministic approximation algorithm by making minor changes to the exact WKNN-Shapley implementation. In particular, our approximation algorithm retains the crucial fairness properties of the original Shapley value.
Empirical Evaluations (Section 5). We experiment on benchmark datasets and assess the efficiency and efficacy of our exact and approximation algorithms for WKNN-Shapley. Here are the key takeaways: (1) Our exact and approximation algorithm for WKNN-Shapley significantly improves computational efficiency compared to the baseline exact algorithm and Monte Carlo approximation, respectively. (2) WKNN-Shapley outperforms the unweighted KNN-Shapley in discerning data quality for critical downstream tasks such as detecting mislabeled/noisy data and data selection. Remarkably, the approximated WKNN-Shapley matches the performance of the exact WKNN-Shapley on many benchmark datasets, attributable to its deterministic nature and the preservation of fairness properties.
Overall, with proper changes to KNN configurations, we show that WKNN-Shapley can be efficiently calculated and approximated. This facilitates its wider adoption, offering a more effective data valuation method compared to unweighted KNN-Shapley.
2 Preliminaries
In this section, we formalize the setup of data valuation for ML, and revisit relevant techniques.
Setup & Goal. Given a labeled dataset where each data point , data valuation aims to assign a score to each training data point , reflecting its importance for the trained ML model’s performance. Formally, we seek a score vector where each represents the “value” of .
2.1 Data Shapley
The Shapley value (SV) [Shapley, 1953], originating from game theory, stands out as a distinguished method for equitably distributing total profit among all participating players. Before diving into its definition, we first discuss a fundamental concept: the utility function.
Utility Function. A utility function maps an input dataset to a score indicating the utility of the dataset for model training. Often, this function is chosen as the validation accuracy of a model trained on the given dataset. That is, given a training set , the utility function , where represents a learning algorithm that trains a model on dataset , and is a function assessing the model’s performance, e.g., its accuracy on a validation set.
Definition 1 (Shapley value [Shapley, 1953]).
Let denote a utility function and represent a training set of data points. The Shapley value, , assigned to a data point is defined as where .
In simple terms, the Shapley value is a weighted average of the marginal contribution , i.e., the utility change when the point is added to different s. For simplicity, we often write when the utility function is clear from the context. The Shapley value uniquely satisfies several important axioms, including key fairness requirements like the null player and symmetry axioms, lending justification to its popularity. See Appendix A for detailed axiom definitions.
2.2 KNN-Shapley
A well-known challenge of using the Shapley value is that its exact calculation is computationally infeasible in general, as it usually requires evaluating for all possible subsets . A surprising result in [Jia et al., 2019a] showed that when the learning algorithm is unweighted KNN, there exists a highly efficient algorithm for computing its exact Data Shapley score.
K Nearest Neighbor Classifier. Given a validation data point and a distance metric , we sort the training set according to their distance to the validation point in non-descending order. Throughout the paper, we assume that for any unless otherwise specified. A KNN classifier makes a prediction for the query based on the (weighted) majority voting among ’s nearest neighbors in the training set. Weight of data point: in KNN, each data point is associated with a weight . The weight is usually determined based on the distance between and the query . For example, a popular weight function is RBF kernel . If is the same for all s, it becomes unweighted KNN.
[Jia et al., 2019a] considers the utility function for the weighted, soft-label KNN:
(1) |
where denotes the index (among ) of th closest data point in to . “Soft-label” refers to the classifiers that output the confidence scores. The main result in [Jia et al., 2019a] shows that for unweighted KNN, we can compute the exact Shapley value for all within a total runtime of (see Appendix B for details).
Remark 1.
In practice, the model performance is assessed based on a validation set . After computing for each , one can compute the Shapley value corresponding to the utility function on the full validation set by simply taking the sum due to the linearity property of the Shapley value.
Remark 2.
In alignment with the existing literature [Jia et al., 2019a, Wang et al., 2023], our discussion of the time complexity of KNN-Shapley refers to the total runtime needed to calculate all data value scores , given that standard applications of data valuation, such as profit allocation and bad data detection, all require computing the data value scores for all data points in the training set. Furthermore, the stated runtime is with respect to , and the overall runtime with respect to will be multiplied by the size of validation set . Runtime is typically presented this way because SV computations with respect to different are independent, readily benefit from parallel computing.
Since its introduction, KNN-Shapley has rapidly gained popularity in data valuation for its efficiency and effectiveness, and is being advocated as the ‘most practical technique for effectively evaluating large-scale data’ in recent studies [Pandl et al., 2021, Karlaš et al., 2022].
2.3 Baseline Algorithm for Computing and Approximating WKNN-Shapley
[Jia et al., 2019a] developed an efficient algorithm to calculate the exact unweighted KNN-Shapley. However, when it comes to the more general weighted KNN-Shapley, only an algorithm is given. While still in polynomial time (if is considered a constant), the runtime is impractically large even for small (e.g., 5). Here, we review the high-level idea of the baseline algorithms from [Jia et al., 2019a].
An algorithm for exact WKNN-Shapley computation. From Definition 1, the Shapley value for is a weighted average of the marginal contribution (MC) ; hence, we only need to study those whose utility might change due to the inclusion of . For KNN, those are the subsets where is within the nearest neighbors of after being added into . Note that for KNN, the utility of any only depends on the nearest neighbors of in . Given that there are only unique subsets of size , we can simply query the MC value for all of size . For any larger , the MC must be the same as its subset of nearest neighbors. We can then compute the Shapley value as a weighted average of these MC values by counting the number of subsets that share the same MC values through simple combinatorial analysis. Such an algorithm results in the runtime of . The algorithm details can be found in Appendix B.
Monte Carlo Approximation. Given the large runtime of this exact algorithm, [Jia et al., 2019a] further proposes an approximation algorithm based on Monte Carlo techniques. However, Monte Carlo-based approximation is randomized and may not preserve the fairness property of the exact Shapley value. Additionally, the sample complexity of the Monte Carlo estimator is derived from concentration inequalities, which, while suitable for asymptotic analysis, may provide loose bounds in practical applications.
3 Making KNN Configurations Shapley-friendly
We point out the major challenges associated with directly improving the computational efficiency for the soft-label KNN configuration considered in [Jia et al., 2019a], and propose proper changes that enable more efficient algorithms for computing WKNN-Shapley.
Challenge #1: weights normalization term. The key principle behind the algorithm for unweighted KNN-Shapley from [Jia et al., 2019a] is that, the MC can only take few distinct values. For example, for any , we have . To avoid the task of evaluating for all , one can just count the subsets such that is among the nearest neighbors of in , as well as the subsets share the same th nearest neighbor to . However, for weighted soft-label KNN with the utility function in (1), there is little chance that any of two and can have the same value due to the weights normalization term (note that this term is , a constant, for unweighted setting). Solution #1: hard-label KNN. In this work, we instead consider the utility function for weighted hard-label KNN. “Hard-label” refers to the classifiers that output the predicted class instead of the confidence scores (see (2) in Section 4). In practice, user-facing applications usually only output a class prediction instead of the entire confidence vector. More importantly, hard-label KNN’s prediction only depends on the weight comparison between different classes, and hence its utility function does not have a normalization term.
Challenge #2: continuous weights. If the weights are on the continuous space, there will be infinitely many possibilities of weighted voting scores of the nearest neighbors. This makes it difficult to analyze which pairs of share the same MC value. Solution #2: discretize weights. Therefore, we consider a more tractable setting where the weights lie in a discrete space. Such a change is reasonable since the weights are stored in terms of finite bits (and hence in the discrete space) in practice. Moreover, rounding is a deterministic operation and does not reverse the original order of weights. In Appendix C, we show that the Shapley value computed based on the discrete weights has the same ranking order compared with the Shapley value computed on the continuous weights (it might create ties but will not reverse the original order). In Appendix LABEL:sec:eval-discretization, we empirically verify that weight discretization does not cause a large deviation in the Shapley value.
We emphasize that, proper adjustments to the underlying utility function are important for efficient Shapley computation, and such a strategy is frequently applied in game theory literature [Dall’Aglio et al., 2019].
4 Data Shapley for Weighted KNN
In this section, we develop efficient solutions for computing and approximating Data Shapley scores for weighted, hard-label KNN binary classifiers, where the weight values used in KNN are discretized. Without loss of generality, in this paper we assume every weight .111 If it does not hold one can simply normalize the weights to and the KNN classifier remains the same. We use to denote the discretized space of , where we create equally spaced points within the interval when we use bits for discretization. We denote the size of the weight space.
Utility Function for Weighted Hard-Label KNN Classifiers. The utility function of weighted hard-label KNN, i.e., the correctness of weighted KNN on the queried example , can be written as
(2) |
where is the space of classes, and is the number of classes222If multiple classes have the same top counts, we take the utility as 1 as long as is among the majority classes.. We omit the input of and simply write when the validation point is clear from the context. For KNN binary classifier, we can rewrite the utility function in a more compact form:
(3) |
For ease of presentation, we present the algorithms for KNN binary classifier here, and defer the extension to multi-class classifier to Appendix LABEL:appendix:multiclass.
4.1 Exact WKNN-Shapley Calculation
4.1.1 Computing SV is a Counting Problem
Given that the Shapley value is a weighted average of the marginal contribution , we first study the expression of for a fixed subset with the utility function in (3).
Theorem 2.
For any data point and any subset , the marginal contribution has the expression as follows:
(4) |
where
In words, the condition means that should be among the nearest neighbors to the query sample when it is added to . The conditions and cover situations where adding to the set changes the prediction of the weighted KNN classifiers. In greater detail, captures the condition for which incorporating shifts the “effective sum of signed weights ” from a negative to a non-negative value, thereby incrementing the utility from to . can be interpreted similarly. From Theorem 2 and the formula of the Shapley value (Definition 1), we can reframe the problem of computing hard-label WKNN-Shapley as a counting problem. Specifically, this involves counting the quantity defined as follows:
Definition 3.
Let denote the count of subsets of size that satisfy (1) , and (2) if , or if .
Theorem 4.
For a weighted, hard-label KNN binary classifier using the utility function given by (3), the Shapley value of a data point can be expressed as:
(5) |
Figure 1 illustrates the counting problem we try to solve here.

4.1.2 Dynamic Programming Solution for Computing
The multiple, intricate conditions wrapped within ’s definition can pose a formidable challenge for direct and efficient counting. The main rationale behind our solution is to break down the complex counting problems into smaller, more manageable subproblems, thereby making them amenable to algorithmic solutions like dynamic programming. Before delving into the specifics of the algorithm, we introduce an intermediary quantity, , that becomes the building block of our dynamic programming formulation.
Definition 5.
Let denote the count of subsets of size that satisfy (1) , as well as the following conditions: (2) Within , the data point is the -th closest to the query example , (3) .
We can relate this auxiliary quantity to our desired as follows:
Theorem 6 (Relation between and ).
For , we can compute from as follows:
(6) |
For , we have:
(7) |
The auxiliary quantity thus serves as a pivot, allowing us to explore the search space of possible subsets more systematically. We next exploit the computational advantage of . Specifically, can be conveniently computed with (recursive) formulas, which further enables us to compute with reduced computational demand.
Theorem 7 (simplified version).
For , can be computed from with . For , can be computed from with .
Leveraging the results from Theorem 7, a direct method for calculating for all is as follows: we first use a recursive formula to compute for , and then use an explicit formula to compute for . With being computed, we apply Theorem 6 to compute . This direct approach renders an runtime, as it necessitates computing for each of and in the range of .
Further Improvement of Efficiency Through Short-cut Formula. While the above direct approach offers a clear path, there exists a more optimized algorithm to expedite computational efficiency by circumventing the explicit calculations of for . Specifically, we discover a short-cut formula that allows us to directly calculate the summation from Theorem 4 once we obtain .
Theorem 8.
For a weighted, hard-label KNN binary classifier using the utility function given by (3), the Shapley value of data point can be expressed as:
where
Crucially, the quantity in the above expression can be efficiently calculated using a clever caching technique. Based on the above findings, we can eliminate a factor of in the final time complexity, thereby obtaining a quadratic-time algorithm to compute the exact WKNN-Shapley. The comprehensive pseudocode for the full algorithm can be found in Appendix LABEL:appendix:pseudocode.
Theorem 9.
Algorithm LABEL:alg:efficient-wknn (in Appendix LABEL:appendix:pseudocode) computes the exact WKNN-Shapley for all and achieves a total runtime of .
Remark 3 (Runtime Dependency with and ).
While the time complexity in Theorem 9 also depends on and , these variables can be effectively treated as constants in our context. In our ablation study, we found that the error caused by weights discretization reduces quickly as the number of bits for discretization grows. Hence, across all experiments in Section 5, we set the number of bits for discretization as and therefore . Additionally, the selection of in KNN-Shapley literature commonly stabilizes around values of or , irrespective of the dataset size [Jia et al., 2019a, Wang et al., 2023]. This stability arises because, in the context of KNN classifiers, increasing can easily result in underfitting even when is large. Throughout our experiments, we fix at . A detailed ablation study examining different choices of and is available in Appendix LABEL:appendix:eval.
Remark 4 (Novelty in the derivation of WKNN-Shapley).
Our derivation of WKNN-Shapley starts similarly to [Jia et al., 2019a] by examining the marginal contribution, but the methodologies significantly differ afterward. Unweighted KNN-Shapley benefits from the simplicity of its utility function, allowing for a relatively straightforward derivation. Specifically, [Jia et al., 2019a] plugs unweighted KNN’s utility function into the formula of the difference of the Shapley values between two data points, and then simplifies the expression using known equalities in combinatorial analysis. In contrast, the same approach does not apply to WKNN-Shapley due to the complexity introduced by weights. Therefore, we develop a novel dynamic programming solution, which marks a substantial departure from the techniques used in [Jia et al., 2019a]. We would like to clarify that there is no correlation between the “recursion” in [Jia et al., 2019a] and our paper. In [Jia et al., 2019a], each is recursively computed from . On the contrary, the computation of different data points’ WKNN-Shapley scores is independent of each other. In our method, recursion is a fundamental component of dynamic programming to solve complex counting problems.
4.2 Deterministic Approximation for Weighted KNN-Shapley
While the algorithm introduced in Section 4.1 for calculating the exact WKNN-Shapley achieves runtime, a huge improvement from the original algorithm from [Jia et al., 2019a], there remains room for further improving the efficiency if we only require an approximation of the Shapley value. Contrary to the prevalent use of Monte Carlo techniques in existing literature, in this section, we develop a deterministic approximation algorithm for WKNN-Shapley.
Intuition. From Theorem 7, we know that in order to compute with , we only need to know with . Moreover, observe that the building blocks for (or ), (or ), can be quite small as it only takes the summation over a small range of the weight space. Hence, we can use as an approximation for for all with some prespecified threshold . Similarly, we can use as an approximation for for all . The resultant approximation for the Shapley value is stated as follows:
Definition 10.
We define the approximation as
where
To calculate , we only need to compute and for from to instead of , thereby reducing the runtime of Algorithm LABEL:alg:efficient-wknn to with minimal modification to the exact algorithm’s implementation.
Theorem 11.
Algorithm LABEL:alg:efficient-wknn-approx (in Appendix LABEL:appendix:pseudocode-approx) computes the approximated WKNN-Shapley for all and achieves a total runtime of .
In particular, when , we can achieve the runtime of . The selection of is discussed in Remark 5. In the following, we derive the error bound and point out two nice properties of this approximation.
Theorem 12.
For any , the approximated Shapley value (1) shares the same sign as , (2) ensures , and (3) has the approximation error bounded by where
Leveraging the error bound alongside the additional nice properties of stated in Theorem 12, we can obtain a deterministic interval within which always resides. Specifically, when , we have , and when , we have . Unlike the commonly used Monte Carlo method for approximating the Shapley value, which only offers a high-probability interval and allows for a failure possibility that the exact value might fall outside of it, our deterministic approximation ensures that the exact value is always within the corresponding interval.
The approximated WKNN-Shapley preserves the fairness axioms. The Shapley value’s axiomatic properties, particularly the symmetry and null player axioms, are of great importance for ensuring fairness when attributing value to individual players (see Appendix A for the formal definitions of the two axioms). These fundamental axioms have fostered widespread adoption of the Shapley value. An ideal approximation of the Shapley value, therefore, should preserve at least the symmetry and null player axioms to ensure that the principal motivations for employing the Shapley value—fairness and equity—are not diminished. The prevalent Monte Carlo-based approximation techniques give randomized results and necessarily muddy the clarity of fairness axioms. In contrast, our deterministic approximation preserves both important axioms.
Theorem 13.
The approximated Shapley value satisfies symmetry and null player axiom.
Remark 5 (Selection of ).
Ideally, we would like to pick the smallest such that is significantly smaller than for a significant portion of s. However, determining a universally applicable heuristic for setting is challenging due to the varying magnitude of across different datasets, which are difficult to anticipate. For example, in a case where all but one data point are “null players”, that single data point will possess a value of , while all others will be valued at 0. On the other hand, if all data points are identical, each will receive a value of . Therefore, we suggest to select in an adaptive way. Specifically, for each , we calculate , halting the computation when the magnitude of is substantially smaller than (e.g., ) the magnitude of for a majority of s. This approach does not increase the overall runtime since can be easily computed from and the additionally computed . Details of this approach are in Appendix LABEL:appendix:how-to-select-mstar. Furthermore, we highlight that our deterministic approximation algorithm maintains the fairness properties of exact WKNN-Shapley. Hence, in practice, we can potentially use a smaller and still get satisfactory performance in discerning data quality. Throughout our experiments in Section 5, we find that setting consistently works well across all benchmark datasets.
5 Numerical Experiments
We systematically evaluate the performance of our WKNN-Shapley computation and approximation algorithms. Our experiments aim to demonstrate the following assertions: (1) Our exact and deterministic approximation algorithm for WKNN-Shapley significantly improves computational efficiency compared to the baseline algorithm and Monte Carlo approximation, respectively. (2) Compared to unweighted KNN-Shapley, WKNN-Shapley (2-1) achieves a better performance in discerning data quality in several important downstream tasks including mislabeled/noisy data detection and data selection, and (2-2) demonstrates more stable performance against different choices of s. (3) The approximated WKNN-Shapley, while being more efficient, consistently achieves performance comparable to the exact WKNN-Shapley on most of the benchmark datasets.
5.1 Runtime Comparison
We empirically assess the computational efficiency of our exact and deterministic approximation algorithms for WKNN-Shapley, comparing them to the exact algorithm and the Monte Carlo approximation presented in [Jia et al., 2019a]. We examine various training data sizes and compare the execution clock time of different algorithms at each . In data size regimes where the baseline algorithms from [Jia et al., 2019a] are infeasible to execute ( hours), we fit a polynomial curve to smaller data size regimes and plot the predicted extrapolation for larger sizes.
Figure 2 shows that the exact algorithm from [Jia et al., 2019a] requires hours to run even for , rendering it impractical for actual use. In contrast, our exact algorithm for computing WKNN-Shapley achieves a significantly better computational efficiency (e.g., almost times faster at ). Our deterministic approximation algorithm is compared to the Monte Carlo approximation from [Jia et al., 2019a]. For a fair comparison, both algorithms are aligned for the same theoretical error bounds. Note that the error bound for the Monte Carlo algorithm is a high-probability bound, subject to a small failure probability. In contrast, our error bound always holds, offering a more robust guarantee compared to the Monte Carlo technique. Nonetheless, from Figure 2 we can see that our deterministic approximation algorithm not only provides a stronger approximation guarantee but also achieves significantly greater efficiency (e.g., also almost times faster at ). This showcases the remarkable improvements of our techniques.

5.2 Discerning Data Quality
Due to the superior computational efficiency of our newly developed algorithms, WKNN-Shapley has become a practical data valuation technique for actual use. In this section, we evaluate its effectiveness in discerning data quality for common real-world applications. Tasks: we consider three applications that are commonly used for evaluating the performance of data valuation techniques in the prior works [Ghorbani and Zou, 2019, Kwon and Zou, 2022, Wang and Jia, 2023a]: mislabeled data detection, noisy data detection, and data selection. Due to space constraints, we only present the results for mislabeled data detection here, and defer the results for the other two tasks to Appendix LABEL:appendix:eval. Mislabeled data usually detrimentally impacts model performance. Hence, a reasonable data valuation technique should assign low values to these data points. In our experiments for mislabeled data detection, we randomly select 10% of the data points to flip their labels. Baselines & Settings & Hyperparameters: We evaluate the performance of both the exact and approximated WKNN-Shapley. We use distance and the popular RBF kernel to determine the weights of training points. We discretize the weights to 3 bits, as we find this level of precision offers a balance between good performance and computational efficiency, with a weight space size, , of merely . In Appendix LABEL:appendix:eval, we conduct ablation studies on the choice of the number of bits for discretization. For approximated WKNN-Shapley, we set . Our primary baseline is the unweighted, soft-label KNN-Shapley from [Jia et al., 2019a]. Since our WKNN-Shapley corresponds to hard-label KNN, we also include unweighted, hard-label WKNN-Shapley in comparison for completeness. Note that it can be computed by simply setting the weights of all data points as a constant.
Results. We use AUROC as the performance metric on mislabeled data detection tasks. Unweighted vs weighted KNN-Shapley: Table 1 shows the AUROC scores across the 13 benchmark datasets we experimented on when . Notably, both exact and approximated WKNN-Shapley markedly outperform the unweighted KNN-Shapley (either soft-label or hard-label) across most datasets. This can likely be attributed to WKNN-Shapley’s ability to more accurately differentiate between bad and good data based on the proximity to the queried example. In Appendix LABEL:appendix:eval-comparison, we present a qualitative study highlighting why WKNN-Shapley outperforms unweighted KNN-Shapley in discerning data quality. Exact vs Approximated WKNN-Shapley: From Table 1, we can see an encouraging result that the approximated WKNN-Shapley achieves performance comparable to (and sometimes even slightly better than) the exact WKNN-Shapley across the majority of datasets. This is likely attributable to its favored property in preserving the fairness properties of its exact counterpart. Robustness to the choice of : In Figure 3, we show that, compared to unweighted KNN-Shapley, WKNN-Shapley maintains notably stable performance across various choices of , particularly for larger values. This is because those benign data points—though within the nearest neighbors of the query example and possessing different labels—may not receive a very low value due to their likely distant positioning from the query example. Conversely, unweighted KNN-Shapley tends to assign these benign points lower values.
|
|
|
|
|||||||||||||
2DPlanes | 0.849 | 0.8 | 0.884 | 0.831 | ||||||||||||
CPU | 0.867 | 0.929 | 0.956 | 0.956 | ||||||||||||
Phoneme | 0.707 | 0.724 | 0.773 | 0.778 | ||||||||||||
Fraud | 0.556 | 0.547 | 0.751 | 0.596 | ||||||||||||
Creditcard | 0.698 | 0.676 | 0.842 | 0.747 | ||||||||||||
Vehicle | 0.689 | 0.724 | 0.8 | 0.813 | ||||||||||||
Click | 0.627 | 0.6 | 0.751 | 0.693 | ||||||||||||
Wind | 0.836 | 0.849 | 0.858 | 0.88 | ||||||||||||
Pol | 0.907 | 0.862 | 1 | 0.991 | ||||||||||||
MNIST | 0.724 | 0.471 | 0.831 | 0.836 | ||||||||||||
CIFAR10 | 0.684 | 0.76 | 0.76 | 0.756 | ||||||||||||
AGNews | 0.953 | 0.978 | 0.991 | 0.988 | ||||||||||||
DBPedia | 0.968 | 0.902 | 1 | 1 |

6 Conclusion
In this study, we addressed WKNN-Shapley computation and approximation when using the accuracy of hard-label KNN with discretized weights as the utility function. Future work should explore polynomial time computation of exact Data Shapley for other learning algorithms. It is especially important to think about whether we can make some modifications to more complicated learning algorithms, e.g., neural networks, so that their exact Data Shapley can be computed efficiently.
Acknowledgments
This work was supported in part by the National Science Foundation under grants CNS-2131938, CNS-1553437, CNS-1704105, IIS-2312794, IIS-2313130, OAC-2239622, the ARL’s Army Artificial Intelligence Innovation Institute (A2I2), the Office of Naval Research Young Investigator Award, the Army Research Office Young Investigator Prize, Schmidt DataX award, Princeton E-ffiliates Award, Amazon-Virginia Tech Initiative in Efficient and Robust Machine Learning, the Commonwealth Cyber Initiative, and a Princeton’s Gordon Y. S. Wu Fellowship. We are grateful to anonymous reviewers at AISTATS for their valuable feedback.
References
- [Anonymous, 2021] Anonymous (2021). Towards general robustness to bad training data.
- [Auer et al., 2007] Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., and Ives, Z. (2007). Dbpedia: A nucleus for a web of open data. In The Semantic Web: 6th International Semantic Web Conference, 2nd Asian Semantic Web Conference, ISWC 2007+ ASWC 2007, Busan, Korea, November 11-15, 2007. Proceedings, pages 722–735. Springer.
- [Belaid et al., 2023] Belaid, M. K., Mekki, D. E., Rabus, M., and Hüllermeier, E. (2023). Optimizing data shapley interaction calculation from o (2^ n) to o (tn^ 2) for knn models. arXiv preprint arXiv:2304.01224.
- [Bian et al., 2021] Bian, Y., Rong, Y., Xu, T., Wu, J., Krause, A., and Huang, J. (2021). Energy-based learning for cooperative games, with applications to valuation problems in machine learning. arXiv preprint arXiv:2106.02938.
- [Courtnage and Smirnov, 2021] Courtnage, C. and Smirnov, E. (2021). Shapley-value data valuation for semi-supervised learning. In Discovery Science: 24th International Conference, DS 2021, Halifax, NS, Canada, October 11–13, 2021, Proceedings 24, pages 94–108. Springer.
- [Dal Pozzolo et al., 2015] Dal Pozzolo, A., Caelen, O., Johnson, R. A., and Bontempi, G. (2015). Calibrating probability with undersampling for unbalanced classification. In 2015 IEEE Symposium Series on Computational Intelligence, pages 159–166. IEEE.
- [Dall’Aglio et al., 2019] Dall’Aglio, M., Fragnelli, V., and Moretti, S. (2019). Sometimes the computation of the shapley value is simple. Handbook of the Shapley value, 441.
- [Ghorbani et al., 2020] Ghorbani, A., Kim, M., and Zou, J. (2020). A distributional framework for data valuation. In International Conference on Machine Learning, pages 3535–3544. PMLR.
- [Ghorbani and Zou, 2019] Ghorbani, A. and Zou, J. (2019). Data shapley: Equitable valuation of data for machine learning. In International Conference on Machine Learning, pages 2242–2251. PMLR.
- [Ghorbani et al., 2022] Ghorbani, A., Zou, J., and Esteva, A. (2022). Data shapley valuation for efficient batch active learning. In 2022 56th Asilomar Conference on Signals, Systems, and Computers, pages 1456–1462. IEEE.
- [He et al., 2016] He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
- [Jia et al., 2019a] Jia, R., Dao, D., Wang, B., Hubis, F. A., Gurel, N. M., Li, B., Zhang, C., Spanos, C. J., and Song, D. (2019a). Efficient task-specific data valuation for nearest neighbor algorithms. Proceedings of the VLDB Endowment.
- [Jia et al., 2019b] Jia, R., Dao, D., Wang, B., Hubis, F. A., Hynes, N., Gürel, N. M., Li, B., Zhang, C., Song, D., and Spanos, C. J. (2019b). Towards efficient data valuation based on the shapley value. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1167–1176. PMLR.
- [Just et al., 2022] Just, H. A., Kang, F., Wang, T., Zeng, Y., Ko, M., Jin, M., and Jia, R. (2022). Lava: Data valuation without pre-specified learning algorithms. In The Eleventh International Conference on Learning Representations.
- [Karlaš et al., 2022] Karlaš, B., Dao, D., Interlandi, M., Li, B., Schelter, S., Wu, W., and Zhang, C. (2022). Data debugging with shapley importance over end-to-end machine learning pipelines. arXiv preprint arXiv:2204.11131.
- [Kenton and Toutanova, 2019] Kenton, J. D. M.-W. C. and Toutanova, L. K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
- [Khandelwal et al., 2019] Khandelwal, U., Levy, O., Jurafsky, D., Zettlemoyer, L., and Lewis, M. (2019). Generalization through memorization: Nearest neighbor language models. In International Conference on Learning Representations.
- [Krizhevsky et al., 2009] Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from tiny images.
- [Kwon and Zou, 2022] Kwon, Y. and Zou, J. (2022). Beta shapley: a unified and noise-reduced data valuation framework for machine learning. In International Conference on Artificial Intelligence and Statistics, pages 8780–8802. PMLR.
- [Kwon and Zou, 2023] Kwon, Y. and Zou, J. (2023). Data-oob: Out-of-bag estimate as a simple and efficient data value. ICML.
- [LeCun, 1998] LeCun, Y. (1998). The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/.
- [Liang et al., 2021] Liang, W., Liang, K.-H., and Yu, Z. (2021). Herald: An annotation efficient method to detect user disengagement in social conversations. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3652–3665.
- [Liang et al., 2020] Liang, W., Zou, J., and Yu, Z. (2020). Beyond user self-reported likert scale ratings: A comparison model for automatic dialog evaluation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1363–1374.
- [Lin et al., 2022] Lin, J., Zhang, A., Lécuyer, M., Li, J., Panda, A., and Sen, S. (2022). Measuring the effect of training data on deep learning predictions via randomized experiments. In International Conference on Machine Learning, pages 13468–13504. PMLR.
- [Liu et al., 2023] Liu, Z., Just, H. A., Chang, X., Chen, X., and Jia, R. (2023). 2d-shapley: A framework for fragmented data valuation. arXiv preprint arXiv:2306.10473.
- [Maleki, 2015] Maleki, S. (2015). Addressing the computational issues of the Shapley value with applications in the smart grid. PhD thesis, University of Southampton.
- [Northcutt et al., 2021] Northcutt, C. G., Athalye, A., and Mueller, J. (2021). Pervasive label errors in test sets destabilize machine learning benchmarks. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).
- [OpenAI, 2023] OpenAI, F. (2023). Planning for agi and beyond. https://openai.com/blog/planning-for-agi-and-beyond.
- [Pandl et al., 2021] Pandl, K. D., Feiland, F., Thiebes, S., and Sunyaev, A. (2021). Trustworthy machine learning for health care: scalable data valuation with the shapley value. In Proceedings of the Conference on Health, Inference, and Learning, pages 47–57.
- [Reimers and Gurevych, 2019] Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
- [Shapley, 1953] Shapley, L. S. (1953). A value for n-person games. Contributions to the Theory of Games, 2(28):307–317.
- [Shim et al., 2021] Shim, D., Mai, Z., Jeong, J., Sanner, S., Kim, H., and Jang, J. (2021). Online class-incremental continual learning with adversarial shapley value. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 9630–9638.
- [Wang et al., 2018] Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. (2018). Glue: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations.
- [Wang and Jia, 2023a] Wang, J. T. and Jia, R. (2023a). Data banzhaf: A robust data valuation framework for machine learning. In International Conference on Artificial Intelligence and Statistics, pages 6388–6421. PMLR.
- [Wang and Jia, 2023b] Wang, J. T. and Jia, R. (2023b). A note on" efficient task-specific data valuation for nearest neighbor algorithms". arXiv preprint arXiv:2304.04258.
- [Wang and Jia, 2023c] Wang, J. T. and Jia, R. (2023c). A note on" towards efficient data valuation based on the shapley value”. arXiv preprint arXiv:2302.11431.
- [Wang et al., 2023] Wang, J. T., Zhu, Y., Wang, Y.-X., Jia, R., and Mittal, P. (2023). Threshold knn-shapley: A linear-time and privacy-friendly approach to data valuation. arXiv preprint arXiv:2308.15709.
- [Wang et al., 2020] Wang, T., Rausch, J., Zhang, C., Jia, R., and Song, D. (2020). A principled approach to data valuation for federated learning. In Federated Learning, pages 153–167. Springer.
- [Wang et al., 2021] Wang, Z., Shan, X., Zhang, X., and Yang, J. (2021). N24news: A new dataset for multimodal news classification. arXiv preprint arXiv:2108.13327.
- [Warner, 2019] Warner, M. (2019). Warner & hawley introduce bill to force social media companies to disclose how they are monetizing user data. Government Document.
- [Wu et al., 2022] Wu, Z., Shu, Y., and Low, B. K. H. (2022). Davinz: Data valuation using deep neural networks at initialization. In International Conference on Machine Learning, pages 24150–24176. PMLR.
- [Yan and Procaccia, 2021] Yan, T. and Procaccia, A. D. (2021). If you like shapley then you’ll love the core. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 5751–5759.
- [Yeh and Lien, 2009] Yeh, I.-C. and Lien, C.-h. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert systems with applications, 36(2):2473–2480.
Appendix A Extended Related Works
A.1 Data Shapley
Data Shapley is one of the first principled approaches to data valuation that became increasingly popular [Ghorbani and Zou, 2019, Jia et al., 2019b]. Data Shapley is based on the Shapley value, a famous solution concept from game theory literature which is usually justified as the unique value notion satisfying the following four axioms:
-
(1)
Null player: if for all , then .
-
(2)
Symmetry: if for all , then .
-
(3)
Linearity: For utility functions and any , .
-
(4)
Efficiency: for every .
Since its introduction, numerous variants of Data Shapley have been developed [Jia et al., 2019a, Ghorbani et al., 2020, Wang et al., 2020, Bian et al., 2021, Kwon and Zou, 2022, Lin et al., 2022, Wu et al., 2022, Karlaš et al., 2022, Wang and Jia, 2023a, Wang and Jia, 2023c, Liu et al., 2023, Wang et al., 2023], reflecting its effectiveness as a principled approach for quantifying data point contributions to ML model training. However, not all axioms mentioned above are considered necessary for a reasonable data valuation technique by the community. For example, [Kwon and Zou, 2022] argue that (4) Efficiency is not necessarily required for data valuation, and [Yan and Procaccia, 2021] argue that (3) Linearity is mainly a technical requirement and does not have a natural interpretation in the context of machine learning. That said, the (1) Null player and (2) Symmetry axioms are often seen as the “fairness axioms” and are deemed fundamental for any sound data valuation technique. Therefore, a reasonable Shapley approximation should uphold these two axioms to ensure the Shapley value’s core advantages—fairness and equity—are not diminished. The deterministic approximation algorithm we develop in Section 4.2 meets these criteria.
A.2 KNN-Shapley
The computation of the Shapley value is notoriously resource-intensive. To the best of our knowledge, unweighted KNN stands as the only frequently deployed ML model where the exact Data Shapley can be efficiently computed. Due to its exceptional computational efficiency coupled with its capability to discern data quality, KNN-Shapley has become one of the most popular and practical data valuation techniques. For instance, [Ghorbani et al., 2022] extends KNN-Shapley to active learning, [Shim et al., 2021] applies it in a continual learning setting. Additionally, studies such as [Liang et al., 2020, Liang et al., 2021] have leveraged KNN-Shapley to eliminate ambiguous samples in NLP tasks, and [Courtnage and Smirnov, 2021] has endorsed its use in semi-supervised learning data valuation. [Belaid et al., 2023] extends the analysis of KNN-Shapley to the calculation to the Shapley interaction index. KNN-Shapley has also shown its practicality in real-world scenarios. For example, [Pandl et al., 2021] shows that KNN-Shapley is the only practical data valuation technique for valuing large amounts of healthcare data. A concurrent work [Wang et al., 2023] considers a simple variant of unweighted KNN termed Threshold KNN, and develops an alternative of KNN-Shapley that can be easily incorporated with differential privacy. Both [Wang et al., 2023] and our work demonstrate the importance of proper adjustments to the underlying KNN’s configuration (i.e., the utility function) in developing new data valuation techniques with desired properties (computational efficiency, privacy compatibility, etc.).
All the studies mentioned above focus on unweighted KNN. On the other hand, weighted KNN incorporates more information about the underlying dataset. Consequently, the Data Shapley score derived from weighted KNN might offer a better assessment of individual data point quality (as demonstrated in our experiments).
Finally, we note that some alternative data valuation techniques are as efficient as KNN-Shapley [Just et al., 2022, Kwon and Zou, 2023]. However, these methods lack a formal theoretical justification as the Shapley value-based approaches.
Appendix B Additional Background of KNN-Shapley
Notation Review. Recall that we use denotes the index (among ) of th closest data point in to .
B.1 Unweighted KNN-Shapley
The unweighted KNN-Shapley was originally proposed in [Jia et al., 2019a] and was later refined in [Wang and Jia, 2023b]. Specifically, [Wang and Jia, 2023b] considers the utility function for unweighted, soft-label KNN on a validation point :
(8) |
which is a special case of the utility function for weighted KNN in (1). It can be interpreted as the probability of a soft-label KNN classifier in predicting the correct label for a validation point . When , is set to the accuracy by random guessing (i.e., ). Here is the main result of [Wang and Jia, 2023b]:
Theorem 14 (KNN-Shapley [Wang and Jia, 2023b]).
Consider the utility function in (8). Given a validation data point and a distance metric , if we sort the training set according to in ascending order, then the Shapley value of each data point corresponding to utility function can be computed recursively as follows:
where denotes the number of classes for the classification task.
The computation of all unweighted KNN-Shapley can be achieved in runtime in total, as the runtime is dominated by the sorting data points in .
B.2 Baseline Algorithm for Computing and Approximating WKNN-Shapley
Exact Computation Algorithm.
[Jia et al., 2019a] shows that the Data Shapley for weighted KNN can be computed exactly with a runtime of . The high-level idea, as described in Section 2.3 in the maintext, is that we only need to consider evaluating for those where the addition of the target data point may change the prediction of KNN. Moreover, there are at most such s. For completeness, we state the specific expression for computing the exact WKNN-Shapley from [Jia et al., 2019a]. We note that the original theorem statement in [Jia et al., 2019a] has minor errors and we also fix it here.
Theorem 15 ([Jia et al., 2019a]).
Consider the utility function in (1). Let . Let be a function that maps the set of training data to their ranks of similarity to . Then, the Shapley value of each training point can be calculated recursively as follows:
(9) | |||
(10) |
where
and
Proof.
The proof proceeds by standard combinatorial analysis and we refer readers to Appendix E.2. in [Jia et al., 2019a] for details. We note that means the index of the farthest data point to in . ∎
Remark 6.
While [Jia et al., 2019a] state the above results for soft-label weighted KNN, this algorithm is in fact fairly general and also applies to hard-label weighted KNN.
Monte Carlo Algorithm.
The Monte Carlo approximation algorithm for WKNN-Shapley from [Jia et al., 2019a] is a simple adaptation of the famous permutation sampling algorithm [Maleki, 2015]. We refer the reader to Section 5 of [Jia et al., 2019a] for the detailed description and error analysis.
Appendix C Additional Discussion of Weights Discretization
Value Rank Preservation.
We show that the Shapley value computed based on the discrete weights has the same ranking order compared with the Shapley value computed on the continuous weights (it might create ties but will not reverse the original order).
Here, we consider the binary classification setting since our approach for computing WKNN-Shapley for multi-class classification (in Appendix LABEL:appendix:multiclass) makes a reduction to the binary classification setting. We make a very mild assumption that if , then . That is, the closer the training point is to the query, the higher the weight the training point will have. This is natural for any reasonable weighting scheme for KNN.
Lemma 16.
Under the assumption of weight function stated above, for any weight assignment function and any two data points , if , then we have where refers to the new utility function after weights discretization.
Proof.
The following cases will result in :
Case 1: if , , it is easy to see that for any we have while , and hence we have while , regardless of the specific magnitude of the weights. Hence, we still have and after weights discretization.
Case 2: if , then implies that and . This is because for any , if , we must also have . Since weight discretization does not change the rank order of weights and , we still have .
Case 3: if , then implies that and . This is because for any , if , we must also have . Since weight discretization does not change the rank order of weights and , we still have . ∎
Value Deviation and Performance on Downstream Tasks.
It is difficult to analytically derive or upper bound the deviation of WKNN-Shapley score caused by weight discretization. Therefore, in Appendix LABEL:sec:eval-discretization, we empirically investigate such value deviation. We also evaluate the impact of weight discretization on WKNN-Shapley’s performance on downstream tasks such as mislabeled/noisy data detection. Overall, we conclude that the impact from weight discretization is very small even when the number of discretization bits is as small as .
Appendix D Additional Details for Exact and Approximation Algorithm for WKNN-Shapley
We state the full version of Theorem 7 here, and defer the proof to Appendix LABEL:appendix:proof.
Theorem 17 (Full version of Theorem 7).
When ,333Since [Jia et al., 2019a] has shown that weighted KNN-Shapley can be computed in time complexity, we focus on the setting where . for we have . We can then compute for with the following relations:
If , we have
(11) |
and if , we have
(12) |
Note that we set for mathematical convenience.
In the following, we present the pseudocode for our exact computation and approximation algorithm for WKNN-Shapley. For the clarity of presentation, we first show a reader-friendly version but inefficient version of the pseudo-code for the exact computation algorithm from Section 4.1 in Appendix D.1. We then show the pseudo-code that optimizes the runtime (but less readable) in Appendix LABEL:appendix:pseudocode. We further show the pseudo-code for our deterministic approximation algorithm in Appendix LABEL:appendix:pseudocode-approx.
Notation. Recall that we use to denote the discretized space of , where we create equally spaced points within the interval when we use bits for discretization. We denote the size of the weight space. Furthermore, we use to denote the discretized space of (where we create equally spaced points within the interval).
D.1 Reader-friendly Pseudo-code
Here, we show a reader-friendly version of the pseudo-code for our algorithms for computing WKNN-Shapley for binary classification setting. The runtime-optimized version of the pseudo-code is shown in Appendix LABEL:appendix:pseudocode.
Algorithm 1 Weighted KNN-Shapley for binary classification (reader-friendly version) 1:Input: