Predictive Power of Nearest Neighbors Algorithm under Random Perturbation
Abstract
We consider a data corruption scenario in the classical Nearest Neighbors (-NN) algorithm, that is, the testing data are randomly perturbed. Under such a scenario, the impact of corruption level on the asymptotic regret is carefully characterized. In particular, our theoretical analysis reveals a phase transition phenomenon that, when the corruption level is below a critical order (i.e., small- regime), the asymptotic regret remains the same; when it is beyond that order (i.e., large- regime), the asymptotic regret deteriorates polynomially. Surprisingly, we obtain a negative result that the classical noise-injection approach will not help improve the testing performance in the beginning stage of the large- regime, even in the level of the multiplicative constant of asymptotic regret. As a technical by-product, we prove that under different model assumptions, the pre-processed 1-NN proposed in [27] will at most achieve a sub-optimal rate when the data dimension even if is chosen optimally in the pre-processing step.
1 Introduction
While there has been a great deal of success in the development of machine learning algorithms, much of the success is in relatively restricted domains with limited structural variation or few system constraints. Those algorithms would be quite fragile in broader real-world scenarios, especially when the testing data are contaminated. For example, in image classification, when the input data are slightly altered due to a minor optical sensor system malfunction, a deep neural network may yield a totally different classification result [11]; more seriously, attackers can feed well-designed malicious adversarial input to the system and induce wrong decision making [30, 17]. One strand of existing research along this line focuses on methodology development including generation of corrupted testing samples for which machine learning algorithms fail [18, 19, 13] and design of robust training algorithms [12, 14, 23, 16]. The other strand of research focuses on theoretical investigation on how the data corruption affects the algorithm performance [25, 28, 10, 9].
This work aims to study the robustness of Nearest Neighbors (-NN) algorithm and its variants, from a theoretical perspective. There is a rich literature on the same topic, but under different setups. For example, Cannings et al. [5], Reeve and Kaban [21, 20] study the case where labels for training data are contaminated, and investigate the overall excess risk of trained classifier; Wang et al. [25], Yang et al. [28] study the case where testing data are contaminated, and only concern about the testing robustness when testing data belong to certain subset of support, rather than the whole support. In contrast to these existing works, the presented study aims to address a different question: how the overall regret of -NN classifier, which is trained by uncontaminated training data, is affected when the testing features are corrupted by random perturbation?
Our main theoretical result (derived in the framework of 22) characterizes the asymptotic regret for randomly perturbed testing data (with an explicit form of multiplicative constant) of -NN with respect to the choice of and the level of testing data corruption. There are several interesting implications. First, there exists a critical contamination level, (a) below which the asymptotic order of regret is not affected; (b) above which the asymptotic order of regret deteriorates polynomially. Second, although the regret of -NN deteriorates polynomially with respect to the corruption level, it actually achieves the best possible accuracy for testing randomly perturbed data (under a fine tuned choice of ). Hence -NN classifier is rate-minimax for both clean data testing task [2, 22, 4] and randomly perturbed data testing task.
Similar as adversarial training, a strategy to robustify learning algorithm is to inject the same random noises to training data. However, our theoretical analysis reveals that such a noise injection approach doesn’t improve the performance of -NN algorithm in the beginning stage of the polynomial deterioration regime, in the sense that it leads to exactly the same asymptotic regret.
Adapting the analysis for adversarial attack scenario (i.e., testing data are contaminated with worst-case perturbation), we compare the regret of -NN under adversarial attack and random perturbation. It is quite interesting to find that when the level of data contamination is beyond the aforementioned critical order, adversarial attack leads to a worse regret than random perturbation only in terms of the multiplicative constant of the convergence rate.
Our developed theory may also be used to evaluate the asymptotic performance of variants of -NN algorithms. For example, Xue and Kpotufe [27] applied NN to pre-processed data (relabelled by -NN) in order to achieve the same accuracy as -NN. Interestingly, this algorithm can be translated into the classical -NN algorithm under a type of perturbed samples to which our theory naturally applies. In particular, we prove that the above algorithm, under our model assumption framework, only obtain a sub-optimal rate (at least worse than -NN) of regret when .
As far as we are aware, the only related work in the context of -NN is [25] that evaluated the adversarial robustness of -NN based on the concept of “robust radius”, which is the maximum allowed attack intensity that doesn’t affect the testing performance. However, their analysis ignores the area nearby the decision boundary where mis-classification is most likely to occur. By filling these gaps, our work attempts to present more comprehensive regret analysis on robustness of -NN.
2 Effect of Random Perturbation on -NN
In this section, we formally introduce the model setup and present our main theorems which characterize the asymptotic regret for randomly perturbed testing samples.
2.1 Model Setup
Denote as , and its -NN estimator as , an average of nearest neighbors among training samples. The corresponding Bayes classifier and -NN classifier is defined as and , respectively.
Define as the level of perturbation. For any intended testing data , we only observe its randomly perturbed version:
(1) |
that is, is uniformly distributed over , the boundary of an Euclidean ball centered at with radius .
In this case, we define the so called “perturbed” regret as
and
Note that the -NN classifier is trained by uncontaminated training samples. Obviously, when , the above definition reduces to the common regret that used in statistical classification literature.
The following assumptions are imposed on and the underlying , to facilitate our theoretical analysis.
-
A.1
is a random variable on a compact -dimensional manifold with boundary . Density function of is differentiable, finite and bounded away from 0.
-
A.2
The set is non-empty. There exists an open subset in which contains such that, define as an open set containing , then is continuous on .
-
A.3
There exists some constant such that when , has bounded fourth-order derivative; when , , where is the gradient of in . Also the derivative of within restriction on the boundary of support is non-zero.
Assumptions A.1 ensures that for any , all its nearest neighbors are close to with high probability. This is due to the fact that if the density at a point is positive and finite, its distance to its th nearest will be of . Assumption A.2 ensures the existence of in and is continuous in other regions of . Assumption A.3 on is slightly stronger than that imposed in [22] due to the consideration of testing data contamination. Specifically, the additional smoothness on imposed in Assumption A.3 guarantees that some higher-order terms in the Taylor expansion of are negligible.
2.2 Asymptotic Regret and Phase Transition Phenomenon
We are now ready to conduct regret analysis for -NN in the presence of randomly perturbed testing examples. For any , define
Therefore, represents the expected squared distance from to any of its nearest neighbors. And take . Also denote and as the joint density of and marginal density of respectively. Let , , and .
We first characterize the asymptotic perturbed regret.
Theorem 1.
Define . Under [A.1] to [A.3] if testing data is randomly perturbed, then it follows that
(2) |
where Remainder= as . The term relies on the true and the distribution of , and does not change with respect to and :
Here , , and represent the gradient, Hessian of , and gradient of respectively. The subscript denotes the ’th element of or , as well as subscript denotes the ’th element of .
Our result (2) clearly decomposes the asymptotic regret into squared bias term, data corruption effect term, variance term as well as a reminder term due to higher-order Taylor expansion. The first three terms are of order , and respectively, and the reminder term is technically derived from high order Taylor expansion and Berry-Essen theorem. When is within a reasonable range, the reminder term is negligible comparing with the other three terms. When , (2) reduces to the bias-variance decomposition commonly observed in the nonparametric regression literature.
Based on Theorem 1, through changing , we have the following observations:
Phase Transition Phenomenon
We discover a phase transition phenomenon for the regret with respect to the level of testing data contamination in general.
-
1.
When 111To prevent the conflict of definitions of , we use and to replace and in O/ notation. Moreover, for , we mean that and for some when ., the data corruption has almost no effect: ;
-
2.
When , the regret is of the same order as but with a different multiplicative constant depending on and ;
-
3.
When , and .
Impact on and the choice of
The value plays an important role in the -NN algorithm. It is essential to examine how the intensity level affects the corresponding . From Theorem 1, one can derive that for randomly perturbed testing scenario, if , ; if , . In other words, . The above rate can be achieved if we choose when and when .
Robustness Trade-off in -NN
Theorem 1 reveals that the regret is of order , therefore the data corruption has no impact as long as . In other words, if is chosen to be optimal and minimizes the regret for uncontaminated testing sample, i.e., is small, then the -NN is more sensitive to the increase of . On the other hands, if is some sub-optimal choice such that is larger, then -NN is more robust to the testing data corruption. Hence there is a trade-off between the accuracy of uncontaminated testing task and robustness with respect to random perturbation corruption.
Effect of Metric of Noise
Note that can be defined on ball / sphere for different . Theorem 1 reveals that the effect of is independent with and when is not dominant while is uniformly distributed in a ball / sphere (so that first order terms w.r.t. noise are zero). As a result, define as a random variable uniformly distributed in a ball/ sphere, then to change the regret accordingly, one can replace in Theorem 1 into .
Minimax Result under General Smoothness Conditions
To relax the strong assumptions A.1 to A.3, we follow [6] to impose smoothness and margin conditions. Basically, the observations are similar as the results above. One can also show that -NN reaches the optimal rate of convergence.
Theorem 2 (Informal Statement under General Smoothness Conditions).
If the distribution of satisfies
-
1.
for all ;
-
2.
for some ;
together with some other general assumptions, then when taking
the regret becomes
which is proven to be the minimax rate.
Formal assumptions and results for Theorem 2 are postponed in Section C in Appendix (Theorem S.3 for convergence rate and Theorem S.4 for minimax rate). From Theorem 2, the rate of regret is dominated by the larger one between the random perturbation effect () and the minimax rate for clean data (). Similar as in [22], our regret result derived under conditions A.1-A.3 matches the minimax rate of Theorem 2 by taking and .
Remark 1 (Adversarial Data Corruption).
So far, we focus on the case of random perturbation. As a by-product, we analyze the effect of some special non-random data corruption. Due to page limit constraint, the detailed results and discussions are postponed to Section A in appendix. In general, comparing with worst-case data corruption, the effect of random perturbed noise will mostly cancel out in each perturbation direction, thus leading to a smaller corruption effect term. Therefore, unsurprisingly, -NN is more robust to random perturbed data corruption than worse-case adversarial data corruption. However, our rigorous analysis shows that the regret under adversarial data corruption (defined formally in Appendix) is still of the same order as in the case of random perturbation but with a larger multiplicative constant when (when , the effect of either adversarial corruption / random perturbation is negligible).
2.3 Comparison with Noise-Injected -NN
In an iterative adversarial training algorithm (e.g. 23), attack is designed in each iteration based on the current model, and the model is updated based on attacked training data. Similarly for random perturbation, a natural idea to enhance the robustness of -NN is to inject noise in training data so that training and testing data share the same distribution. One can randomly perturb training samples and train -NN using the perturbed training data set. Comparing this noise-injection -NN and traditional -NN methods, we find that there is no performance improvement for the former even when the corruption level is in the early stage of the polynomial deterioration regime.
Denote as the Bayes estimator and as the estimator trained using randomly perturbed training data. Let both estimators and adopt their best choices of respectively. Then we have
Theorem 3.
Under [A.1] to [A.3], when ,
(3) |
Although it is intuitive to consider perturbing training data such that they match the distribution of the corrupted testing data, result (19) implies that the estimators and asymptotically share the same predictive performance for randomly perturbed testing data, and noise injection strategy is futile. Note that this result holds when is small. Combined with our analysis in Theorem 1, within the range , the regret is deteriorated polynomially due the impact of testing data corruption, and the deterioration can not be remedied by noise-injection adversarial training at all. One heuristic explanation is that, such an injected perturbation may introduce additional noise to the estimation procedure and change some underlying properties (e.g. smoothness), and consequently this strategy of perturbing training data doesn’t necessarily help to achieve smaller regret especially when is small.
Remark 2.
The above robustness result can be applied after we obtain knowledge while testing. In reality, we do not know whether the testing data is corrupted or not. In such a case, other techniques, e.g. hypothesis testing, are needed to be applied while testing.
3 Application to other Nearest Neighbor-Type Algorithms
Our theoretical analysis may be adapted to other NN-type algorithms: pre-processed 1NN [27] and distributed-NN [8]. In particular, we prove that the regret of the former is sub-optimal for some class of distributions, and explain why the regret of the latter converges in the optimal rate, both from the random perturbation viewpoint.
3.1 Pre-processed 1NN
In some literature [27, 25], the algorithms run 1NN to do prediction using pre-processed data instead of running -NN using raw data. The pre-processing step (or called de-noising step) is reviewed in Algorithm 1. Specifically, we firstly run -NN to predict labels for the training data set, then replace the original labels with the predict labels ’s. In this way, applying 1NN on data ,…, can achieve good accuracy while the computational cost is much smaller than -NN.
This in fact can be treated as an application of random perturbation of testing data in -NN, in the sense that this classifier can be equivalently represented as -NN under perturbed sample:
where is the pre-processed 1NN classifier, and is the perturbed observation of , which is the nearest neighbor of . Although the here doesn’t exactly match our definition (1), it still can be viewed as randomly perturbed with level of contamination , which is the order for the expected length from to the nearest neighbor under Assumption A.1.
From this point of view, Theorem 1 can be be applied to derive the regret of the pre-processed 1NN algorithm, whose rate of convergence turns out to be slower than the optimal rate of -NN when the data dimension is relatively high, say .
Theorem 4.
Under [A.1] to [A.3], the regret of pre-processed 1NN under un-corrupted testing data is
where
Corruption | ||||
Remainder |
when both and are of , and . As a result, pre-processed 1NN is sub-optimal when (compared with optimal rate ).
The result in Theorem 4 reveals a sub-optimal rate for the pre-processed NN, in comparison with the optimal rate claimed by [27]. This is not a contradiction, but due to different assumptions imposed for and the distribution of . Xue and Kpotufe [27] assumes -smoothness condition which is more general than our smoothness assumption A.3.
3.2 Distributed-NN
The computational complexity of -NN is huge if is large, therefore we consider a distributed NN algorithm: we randomly partition the original data into equal-size parts, then given , for each machine, the nearest neighbors of are selected and calculate for , finally we average ,…, to obtain . The algorithm is shown in Algorithm 2 as in [8].
Distributed-NN is practically different from -NN in a single machine since the selected neighbors aggregated from subsets of data are not necessarily the same nearest neighbors selected in a single machine. Therefore, an additional assumption is imposed, to ensure that the neighborhood set selected by distributed NN behaves similarly to the neighborhood set selected by single machine -NN, in the sense that , where belongs to the distributed NN neighborhood set, is of the same order of . Therefore, based on Theorem 1, we obtain that the order of regret of Distributed-NN is in fact of the same order as -NN. Formally, we have the following corollary:
Corollary 5.
Under [A.1] to [A.3], when and , if the number of machines , then
where and denote the (clean testing data) regret of distributed NN and -NN algorithms respectively.
4 Numerical Experiments
In Section 4.1, we will evaluate the empirical performance of -NN algorithm for randomly perturbed testing data, where we compare the -NN classifiers trained by raw un-corrupted training data and trained by noise injected training data (i.e., versus defined in Section 2.3). In Section 4.2 We also conduct experiments to compare -NN with pre-processed 1NN for un-corrupted testing data. These numerical experiments are intended to show: (i) -NN trained by un-corrupted training data has a similar testing performance with -NN trained by noise injected training data when is small, which validates our Theorem 3; (ii) for un-corrupted testing data, the regret of pre-processed 1NN is much worse than that of -NN if , which validates our Theorem 4. In general, -NN is expected to be always better than pre-processed 1NN.
Note that in all our real data applications except for MNIST data set, we normalized all attributes.
4.1 Tackling Random Perturbation
4.1.1 Simulation
The random variable is of dimension, and each dimension independently follows exponential distribution with mean 0.5. The conditional mean of is defined as
(4) | |||||
For each pair of , we use training samples, 10000 testing samples and repeated 50 times to calculate the average regret. In each repetition, 5-fold cross validation was used to obtain . Then based on [22], we adjust the number of neighbors to since is the best value for samples instead of samples. The two classifiers, trained via un-corrupted training data and corrupted training data respectively, used to predict corrupted testing data.
From Figure 1, as the number of training samples increases, the regret for both -NNs gets reduced for in the same speed. This verifies that these two -NNs in fact do not differ a lot when is small, i.e., Theorem 3. Empirically, the regret of -NN trained by corrupted training data is worse than the one trained by un-corrupted training data when . On the other hand, when is large (such that required condition in Theorem 3 fails), the two -NNs may perform significantly different. For example, we tried , when sample size , is -2.86 using uncontaminated data, and is -3.11 using corrupted training data.

4.1.2 Real Data
We use two real data sets for the comparison of 2 -NNs: Abalone (7), HTRU2 (15). For Abalone data set, the data set contains 4177 samples, and two attributes (Length and Diameter) are used in this experiment. The classification label is whether an abalone is older than 10.5 years. For HTRU2 data set [15], the data has a size of of 17,898 with 8 continuous attributes. For each data set, 25% of the samples are contaminated by random noise, and thereafter are used as testing data.
As shown in Figure 2, when is small, for both data sets, the error rate (misclassification rate) of the two -NNs do not differ a lot when is small.


4.2 Performance of 1NN with Pre-processed Data
4.2.1 Simulation

To observe a clear difference, instead of in (4), we use a model where each dimension of follows uniform distribution, with in (4), and
for different values of to compare the performance between -NN and pre-processed 1NN. now follows -dimensional uniform . From Figure 3, we show that the order of regret of pre-processed 1NN is different from that of -NN when .
We also replace the uniform distribution as exponential distribution with mean 0.5 for each dimension of . Figure 4 shows that for and , the increasing trend of regret ratio is obvious, indicating a sub-optimal rate of pre-processed 1NN.

4.2.2 Real Data
We use four data sets to compare the -NN and pre-processed 1NN: MNIST, Abalone, HTRU2, and Credit (29).
For MNIST, this data set contains 70000 samples and data dimension is 784. We randomly pick 25% samples as testing data, and randomly pick () samples to train -NN classifier and pre-processed 1NN classifier, where the choices of for both algorithms are determined by 5-folds cross validation. We repeated this procedure 50 times with different random seeds to obtain the mean testing error rates. As shown in Figure 5, through increasing the number of training samples, the error rate ratio between pre-process 1NN classifier and -NN classifier is stably above 1 and is around 1.17.


For Abalone data set, we conducted experiment in the same way as MNIST, and observe that the error rate using pre-processed 1NN is always greater than -NN. As is shown in Figure 6, while the error rate for both pre-processed 1NN and -NN are decreasing in , their difference changes little when .

In Credit data set, there are 30000 samples (25% as testing data) with 23 attributes. For HTRU2 and Credit data set, the mean and standard error of error rates in the 50 repetitions are summarized in Table 1. From Table 1, using -NN we obtain slightly smaller error rate on average.
Data Set | Error Rate (-NN) | Error Rate (Pre-1NN) |
---|---|---|
Credit | 0.18879 | 0.1899 |
HTRU2 | 0.02143 | 0.0221 |
5 Conclusion and Discussion
In this work, we conduct asymptotic regret analysis of -NN classification for randomly perturbed testing data. In particular, a phase transition phenomenon is observed: when the corruption level is below a threshold order, it doesn’t affect the asymptotic regret; when the corruption level is beyond this threshold order, the asymptotic regret grows polynomially. Moreover, when the level of corruption is small, there is no benefit to perform noise injected adversarial training approach.
Moreover, using the idea of random perturbation, we can further explain why pre-processed 1NN converges in a sub-optimal rate: it can be treated as -NN with perturbation in testing data while , the distance from to its nearest neighbor, is large when . It is worth to mention that our analysis can be applied to Distributed-NN to verify the optimal rate obtained in [8] as well.
An interesting observation from numerical experiment is that using traditional -NN leads to an even better performance than the -NN trained via noise injection method. This observation contradicts to common belief that injecting attack into training algorithm to obtain an adversarial robust algorithm (e.g. optimization method in 23). Therefore, it deserves further theoretical investigation to understand that, under which circumstance one can indeed benefit from the noise injection strategy.
References
- Audibert [2004] Audibert, J.-Y. (2004), “Classification under polynomial entropy and margin assumptions and randomized estimators,” .
- Audibert et al. [2007] Audibert, J.-Y., Tsybakov, A. B., et al. (2007), “Fast learning rates for plug-in classifiers,” The Annals of statistics, 35, 608–633.
- Belkin et al. [2018] Belkin, M., Hsu, D., and Mitra, P. (2018), “Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate,” arXiv preprint arXiv:1806.05161.
- Cannings et al. [2017] Cannings, T. I., Berrett, T. B., and Samworth, R. J. (2017), “Local nearest neighbour classification with applications to semi-supervised learning,” arXiv preprint arXiv:1704.00642.
- Cannings et al. [2018] Cannings, T. I., Fan, Y., and Samworth, R. J. (2018), “Classification with imperfect training labels,” arXiv preprint arXiv:1805.11505.
- Chaudhuri and Dasgupta [2014] Chaudhuri, K. and Dasgupta, S. (2014), “Rates of convergence for nearest neighbor classification,” in Advances in Neural Information Processing Systems, pp. 3437–3445.
- Dua and Graff [2017] Dua, D. and Graff, C. (2017), “UCI Machine Learning Repository,” .
- Duan et al. [2018] Duan, J., Qiao, X., and Cheng, G. (2018), “Distributed Nearest Neighbor Classification,” arXiv preprint arXiv:1812.05005.
- Fawzi et al. [2018] Fawzi, A., Fawzi, H., and Fawzi, O. (2018), “Adversarial vulnerability for any classifier,” in Advances in Neural Information Processing Systems, pp. 1178–1187.
- Fawzi et al. [2016] Fawzi, A., Moosavi-Dezfooli, S.-M., and Frossard, P. (2016), “Robustness of classifiers: from adversarial to random noise,” in Advances in Neural Information Processing Systems, pp. 1632–1640.
- Goodfellow et al. [2014] Goodfellow, I., Shlens, J., and Szegedy, C. (2014), “Explaining and Harnessing Adversarial Examples,” arXiv preprint arXiv:1412.6572.
- Goodfellow et al. [2015] Goodfellow, I. J., Shlens, J., and Szegedy, C. (2015), “Explaining and harnessing adversarial examples. CoRR,” .
- Grosse et al. [2017] Grosse, K., Papernot, N., Manoharan, P., Backes, M., and McDaniel, P. (2017), “Adversarial examples for malware detection,” in European Symposium on Research in Computer Security, Springer, pp. 62–79.
- Kurakin et al. [2016] Kurakin, A., Goodfellow, I., and Bengio, S. (2016), “Adversarial machine learning at scale,” arXiv preprint arXiv:1611.01236.
- Lyon et al. [2016] Lyon, R. J., Stappers, B., Cooper, S., Brooke, J., and Knowles, J. (2016), “Fifty years of pulsar candidate selection: from simple filters to a new principled real-time classification approach,” Monthly Notices of the Royal Astronomical Society, 459, 1104–1123.
- Madry et al. [2017] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. (2017), “Towards deep learning models resistant to adversarial attacks,” arXiv preprint arXiv:1706.06083.
- Papernot et al. [2017] Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z. B., and Swami, A. (2017), “Practical black-box attacks against machine learning,” in Proceedings of the 2017 ACM on Asia conference on computer and communications security, ACM, pp. 506–519.
- Papernot et al. [2016a] Papernot, N., McDaniel, P., Jha, S., Fredrikson, M., Celik, Z. B., and Swami, A. (2016a), “The limitations of deep learning in adversarial settings,” in Security and Privacy (EuroS&P), 2016 IEEE European Symposium on, IEEE, pp. 372–387.
- Papernot et al. [2016b] Papernot, N., McDaniel, P., Swami, A., and Harang, R. (2016b), “Crafting adversarial input sequences for recurrent neural networks,” in Military Communications Conference, MILCOM 2016-2016 IEEE, IEEE, pp. 49–54.
- Reeve and Kaban [2019a] Reeve, H. W. and Kaban, A. (2019a), “Classification with unknown class conditional label noise on non-compact feature spaces,” arXiv preprint arXiv:1902.05627.
- Reeve and Kaban [2019b] — (2019b), “Fast Rates for a kNN Classifier Robust to Unknown Asymmetric Label Noise,” arXiv preprint arXiv:1906.04542.
- Samworth et al. [2012] Samworth, R. J. et al. (2012), “Optimal weighted nearest neighbour classifiers,” The Annals of Statistics, 40, 2733–2763.
- Sinha et al. [2018] Sinha, A., Namkoong, H., and Duchi, J. (2018), “Certifying some distributional robustness with principled adversarial training,” .
- Sun et al. [2016] Sun, W. W., Qiao, X., and Cheng, G. (2016), “Stabilized nearest neighbor classifier and its statistical properties,” Journal of the American Statistical Association, 111, 1254–1265.
- Wang et al. [2017] Wang, Y., Jha, S., and Chaudhuri, K. (2017), “Analyzing the robustness of nearest neighbors to adversarial examples,” arXiv preprint arXiv:1706.03922.
- Xing et al. [2018] Xing, Y., Song, Q., and Cheng, G. (2018), “Statistical Optimality of Interpolated Nearest Neighbor Algorithms,” arXiv preprint arXiv:1810.02814.
- Xue and Kpotufe [2017] Xue, L. and Kpotufe, S. (2017), “Achieving the time of -NN, but the accuracy of -NN,” arXiv preprint arXiv:1712.02369.
- Yang et al. [2019] Yang, Y.-Y., Rashtchian, C., Wang, Y., and Chaudhuri, K. (2019), “Adversarial Examples for Non-Parametric Methods: Attacks, Defenses and Large Sample Limits,” arXiv preprint arXiv:1906.03310.
- Yeh and Lien [2009] Yeh, I.-C. and Lien, C.-h. (2009), “The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients,” Expert Systems with Applications, 36, 2473–2480.
- Zhang et al. [2017] Zhang, G., Yan, C., Ji, X., Zhang, T., Zhang, T., and Xu, W. (2017), “Dolphinattack: Inaudible voice commands,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, ACM, pp. 103–117.
Appendix A Comparison between Random Perturbation and Non-random Perturbation
We use the following adversarial attack as the non-random perturbation:
(5) |
When , if is differentiable, the length o attack converges to as well.
The proposed attack scheme (5) is also called as “white-box attack” as the adversary has the knowledge of . On the other hand, unlike the “white-box attack” mentioned in [25], the perturbation and attack we focus on are independent with the training samples.
Theorem 6.
Under [A.1] to [A.3], if testing data is adversarially attacked and , then
(6) |
where .
From Theorem 6, one can see that the regret under adversarial attack is larger than the one under random perturbation if the term is dominant.
Appendix B Proof of Regret Analysis in Section 2 and 3
B.1 Theorem 1
This section contains the proof of Theorem 1 and Theorem S.6. The two proofs are similar, so for proof of Theorem S.6, we only present the part where the proof is different from Theorem 1.
Define to as the unsorted distance from the nearest neighbors to testing data point , and as the distance from the exact -th nearest neighbor to itself. Similar as [6], conditional on the distance of the -th neighbor, the first neighbors are i.i.d. random variables distributed within .
In addition to , , and , we further denote as the density of .
Proposition 7 (Lemma S.1 in [24]).
For any distribution function with density ,
Now we start our proof of Theorem 1.
Proof of Theorem 1.
The idea of proof follows [22], and there are total 5 steps in our proof:
-
•
Step 0: We give some definitions prepare for the proof.
-
•
Step 1: Given a fixed (unobserved) testing sample and conditional on the perturbation random variable , we obtain the mean and variance of . In particular, for any satisfying , let , we have that
and
-
•
Step 2: use tube theory to construct a tube based on . The remainder of regret outside the tube is of for some :
for some remainder term .
-
•
Step 3: use Berry-Esseen Theorem to transform the probability
to a Gaussian probability.
(7) -
•
Step 4: plug in the mean and variance of from Step 1 into (7), integrate in the formula in Step 2 on the tube (integrate over ).
Step 0: In this step we introduce some definitions.
Denote as the random perturbation, i,e., . Denote to be the unsorted neighbors of in the training samples and be the value for the corresponding . When no confusion is caused, we drop the argument and use and for abbreviation. Then the probability of classifying contaminated sample as 0 becomes
If we directly investigate in , the possible values of can be , but those ’s which are far from will have little contribution on the regret. Hence we consider in where
for .
Further define
for some large constant .
Step 1: For the scenario of random perturbation, is a random variable uniformly distributed on sphere , we first evaluate and for given and .
where is a remainder term due to the Taylor’s expansion. Before discussing , we consider the dominant part of . Conditional on and , then the distribution of is on the sphere of . Denote the density of this distribution as . Also define as the gradient of . For simplicity, rewrite as . Then based on (A.1) and (A.3) for the smoothness of and , rewrite as a Taylor expansion at , and we have
where denotes integration over sphere the first-order and third-order terms becomes 0.
In addition,
The term in can be tackled in a similar manner and . Hence taking
we have
Denote , using arguments in Lemma 1 and Theorem 2 of [26], take , we obtain
Further denote , we obtain
In terms of , fixing , the neighbors are i.i.d. random variables in ,
when . Moreover, as [6] and [3] mentioned, the probability of is an exponential tail, hence the overall variance becomes
This also implies that
Step 2: Since the density of is 0 when is not in support,
is equal to
where is the support of . Taking , we have
(8) | |||||
The result in (8) adopts tube theory to transform the integration from to . Denote the map , then the pullback of the -form is given at by
For , it is composed of four parts: (1) the integral outside , (2) the difference between and , (3) the difference between and the tube generated using , and (4) the remainder of :
(9) |
In , is of since ; is of since the difference of volume is of . For in :
From the definition of , we know that for any , for some . Using Berstein inequality, we have an upper bound as
for .
Similar result can be obtained for .
For in ,
Finally .
Step 3: we continue the derivation of (where denotes testing sample random variable) using
Since is obtained from i.i.d. samples (though for some samples their weight is 0), by non-uniform Berry-Esseen Theorem,
where
hence
Following the analysis in Step 4, we can also obtain that
Step 4: Finally we integrate on Gaussian probabilities:
The last step follows Proposition 7. For the small order terms,
Through definition of we have
Finally we take expectation on :
∎
B.2 Theorem 2
Proof of Theorem 2.
When for some large constant , and will always be the same, thus
(10) | |||||
(11) |
Moreover,
(12) |
As a result,
(13) |
Plugging in into regret, we obtain that
(14) | |||||
(15) | |||||
(16) | |||||
(17) | |||||
(18) |
From this derivation, the dominant terms in the denominator and numerator of the quantity
(19) |
are both when ’s are chosen to be optimal respectively. Note that the multiplicative constants for numerator and denominator are both determined by and density of , and converges to each other when . As when , the difference on the densities vanishes, thus (19) converges to 1. ∎
B.3 Theorem S.1
Proof of Theorem S.1.
The proof is similar with Theorem 1. Since the format of to are unchanged, one can show that they are small order terms in Theorem 3 as well. What is changed in the proof of Theorem 3 is :
When , we have
while for ,
Therefore,
The remainder is not a small order term, but we can show that it is positive, and calculate its rate.
For ,
From the format of and , we know that they are positive. When and both , is an exponential tail (so we just ignore it) and for we have:
∎
B.4 Theorem 4
Proof of Theorem 4.
First, it is easy to know that since the nearest neighbor has an average distance of .
Second, there is a difference between pre-processed 1NN and random perturbation: in pre-processed 1NN, the nearest neighbor distributes approximately uniformly around , while the other neighbors should have a distance to larger than the nearest neighbor. However, this difference only affects the remainder term of regret, i.e., assuming whether or not the other neighbors are uniformly distributed in the ball does not affect our result.
As a result, taking expectation on the direction of ,
When , i.e. , the dominant part of regret becomes . ∎
Appendix C Regret Convergence under General Smoothness Condition and Margin Condition
C.1 Model and Theorem
In this section, we will relax the conditions on the distribution of and smoothness of , and as a consequence, we only obtain the rate of the regret (without explicit form for the multiplicative constant). Technically, we will adopt the framework of [6], and the following assumptions on the smoothness of and the density of are used instead of conditions [A.1]-[A.3].
-
B.1
Let be the Lebesgue measure on . There exists a positive pair such that for any ,
for any .
-
B.2
The support of is compact.
-
B.3
Margin condition: .
-
B.4
Smoothness of : there exist some and , such that for any and .
-
B.5
The density of is finite and bounded away from 0.
Remark 3.
The assumption B.3 is weaker from [6] (), but in fact does not affect the convergence.
The following theorem provide a general upper bound of regret for both perturbed and attacked data:
Theorem 8 (Convergence of Regret).
Under [B.1] to [B.5], if for some , , taking
the regret becomes
where is the minimax rate of regret in -NN.
Theorem 8 also reveals a sufficient condition when -NN is consistent, i.e regret finally converges to 0: for both perturbed and attacked data, when , -NN is still consistent using these two types of corrupted testing data.
Theorem 9 (Minimax Rate of Regret).
Let be an estimator of , let be a set of distributions which satisfy [B.1] to [B.5], when , there exists some such that
(20) |
The constant depends on only.
C.2 Proofs
Proof of Theorem S.8.
Let . Denote and , and as the excess risk. Define
with as the distance from to its th nearest neighbor, and the decision boundary area:
Given , , and , similar with Lemma 8 in [6], the event of can be covered as:
When for all , and , assume , then
The other two events are easy to figure out.
In addition, from the definition of regret, assume ,
similarly, when , we have
As a result, the regret can be represented as
For simplicity, denote . We then follow the proof of Lemma 20 of [6]. Without loss of generality assume . For perturbation , define
then we have
hence .
From the definition of and , when , we also have
Considering the problem that the upper bound can be much greater than 1 when is small, we define , taking , using Berstein inequality, it becomes
When , the exponential tail will diminish fast, leading to
Recall that and , hence when , we can obtain the minimum upper bound
∎
Proof of Theorem S.9.
The proof is similar as [2] using technical details in [1] for Assouad’s method. There are two scenarios we will consider. Define , and as some suitable constants, we will first show for any ,
(21) |
Further, when , our target is to show that
(22) |
Case 1: when , the basic idea is to construct a distribution of and two distributions of such that, the Bayes classifiers from these two distributions of reverse with each other, but through sampling points, we cannot distinguish which distribution these samples chosen are from. For example, given samples from a normal distribution, statistically we cannot determine whether data are sampled from a zero-mean distribution, or a distribution with mean , thus any estimator based on data (either using clean testing data or corrupted testing data) can make a false prediction.
Assume distributed within a compact set in . For an integer , consider the regular grid as
(23) |
For any point , denote as the closest grid point in , and define as a partition of such that and are in the same if and only if . Among all the ’s, select of them as , and .
Take as the center of for . When , set the density of as for some , and the density of in is set to be 0. Assume uniformly distributes in .
Let be a nonincreasing infinitely differentiable function starting from 0 and satisfying -smoothness condition. Moreover, is in . Denote and as
(24) |
and
(25) |
Through the above construction, if we take or , and let , then when , margin condition is also satisfied.
The construction above will also be applied in Case 2 (with difference on the choice of ).
Now we apply Assouad’s method to find the lower bound of regret. Denote as a distribution such that when , , and when , , then we have for any estimator with as data,
(26) | |||||
(27) | |||||
(28) | |||||
(29) | |||||
(30) | |||||
(31) | |||||
(32) | |||||
(33) |
As a result, when , for all , and , there exists some such that
(36) | |||||
(37) | |||||
(38) | |||||
(39) |
The regret is lower bounded as when taking . Note that can be any classifier, which also includes those “random" estimators when is perturbed / attacked.
Case 2: when , we construct a distribution of such that, after injecting noise in it, there is some sets of where and are comparable, thus no matter which label is obtained from the estimator, it has a constant-level of probability to make false decision at this .
The construction is similar as Case 1, and we take . For function , here we let it increase from 0 and becomes 1 in . For each pair , take when and when . The support of is . Take and , then both -smoothness condition and -margin condition are satisfied.
After injecting random noise on , consider as the boundary between and , then when is from , and are in for some constant . Thus the probability of any estimator to make a false decision at this is larger than . In addition, the probability measure of is greater than for some constant . Thus the regret is greater than .
∎