Predictive Power of Nearest Neighbors Algorithm under Random Perturbation

Yue Xing
Department of Statistics
Purdue University
West Lafayette, IN 47907, USA
&Qifan Song
Department of Statistics
Purdue University
West Lafayette, IN 47907, USA
\ANDGuang Cheng
Department of Statistics
Purdue University
West Lafayette, IN 47907, USA

Abstract

We consider a data corruption scenario in the classical $k$ Nearest Neighbors ( $k$ -NN) algorithm, that is, the testing data are randomly perturbed. Under such a scenario, the impact of corruption level on the asymptotic regret is carefully characterized. In particular, our theoretical analysis reveals a phase transition phenomenon that, when the corruption level $\omega$ is below a critical order (i.e., small- $\omega$ regime), the asymptotic regret remains the same; when it is beyond that order (i.e., large- $\omega$ regime), the asymptotic regret deteriorates polynomially. Surprisingly, we obtain a negative result that the classical noise-injection approach will not help improve the testing performance in the beginning stage of the large- $\omega$ regime, even in the level of the multiplicative constant of asymptotic regret. As a technical by-product, we prove that under different model assumptions, the pre-processed 1-NN proposed in [27] will at most achieve a sub-optimal rate when the data dimension $d>4$ even if $k$ is chosen optimally in the pre-processing step.

1 Introduction

While there has been a great deal of success in the development of machine learning algorithms, much of the success is in relatively restricted domains with limited structural variation or few system constraints. Those algorithms would be quite fragile in broader real-world scenarios, especially when the testing data are contaminated. For example, in image classification, when the input data are slightly altered due to a minor optical sensor system malfunction, a deep neural network may yield a totally different classification result [11]; more seriously, attackers can feed well-designed malicious adversarial input to the system and induce wrong decision making [30, 17]. One strand of existing research along this line focuses on methodology development including generation of corrupted testing samples for which machine learning algorithms fail [18, 19, 13] and design of robust training algorithms [12, 14, 23, 16]. The other strand of research focuses on theoretical investigation on how the data corruption affects the algorithm performance [25, 28, 10, 9].

This work aims to study the robustness of $k$ Nearest Neighbors ( $k$ -NN) algorithm and its variants, from a theoretical perspective. There is a rich literature on the same topic, but under different setups. For example, Cannings et al. [5], Reeve and Kaban [21, 20] study the case where labels for training data are contaminated, and investigate the overall excess risk of trained classifier; Wang et al. [25], Yang et al. [28] study the case where testing data are contaminated, and only concern about the testing robustness when testing data belong to certain subset of support, rather than the whole support. In contrast to these existing works, the presented study aims to address a different question: how the overall regret of $k$ -NN classifier, which is trained by uncontaminated training data, is affected when the testing features are corrupted by random perturbation?

Our main theoretical result (derived in the framework of 22) characterizes the asymptotic regret for randomly perturbed testing data (with an explicit form of multiplicative constant) of $k$ -NN with respect to the choice of $k$ and the level of testing data corruption. There are several interesting implications. First, there exists a critical contamination level, (a) below which the asymptotic order of regret is not affected; (b) above which the asymptotic order of regret deteriorates polynomially. Second, although the regret of $k$ -NN deteriorates polynomially with respect to the corruption level, it actually achieves the best possible accuracy for testing randomly perturbed data (under a fine tuned choice of $k$ ). Hence $k$ -NN classifier is rate-minimax for both clean data testing task [2, 22, 4] and randomly perturbed data testing task.

Similar as adversarial training, a strategy to robustify learning algorithm is to inject the same random noises to training data. However, our theoretical analysis reveals that such a noise injection approach doesn’t improve the performance of $k$ -NN algorithm in the beginning stage of the polynomial deterioration regime, in the sense that it leads to exactly the same asymptotic regret.

Adapting the analysis for adversarial attack scenario (i.e., testing data are contaminated with worst-case perturbation), we compare the regret of $k$ -NN under adversarial attack and random perturbation. It is quite interesting to find that when the level of data contamination is beyond the aforementioned critical order, adversarial attack leads to a worse regret than random perturbation only in terms of the multiplicative constant of the convergence rate.

Our developed theory may also be used to evaluate the asymptotic performance of variants of $k$ -NN algorithms. For example, Xue and Kpotufe [27] applied $1$ NN to pre-processed data (relabelled by $k$ -NN) in order to achieve the same accuracy as $k$ -NN. Interestingly, this algorithm can be translated into the classical $k$ -NN algorithm under a type of perturbed samples to which our theory naturally applies. In particular, we prove that the above algorithm, under our model assumption framework, only obtain a sub-optimal rate (at least worse than $k$ -NN) of regret when $d>4$ .

As far as we are aware, the only related work in the context of $k$ -NN is [25] that evaluated the adversarial robustness of $k$ -NN based on the concept of “robust radius”, which is the maximum allowed attack intensity that doesn’t affect the testing performance. However, their analysis ignores the area nearby the decision boundary where mis-classification is most likely to occur. By filling these gaps, our work attempts to present more comprehensive regret analysis on robustness of $k$ -NN.

2 Effect of Random Perturbation on $k$ -NN

In this section, we formally introduce the model setup and present our main theorems which characterize the asymptotic regret for randomly perturbed testing samples.

2.1 Model Setup

Denote $P(Y=1|X=x)$ as $\eta(x)$ , and its $k$ -NN estimator as $\widehat{\eta}_{k,n}(x)$ , an average of $k$ nearest neighbors among $n$ training samples. The corresponding Bayes classifier and $k$ -NN classifier is defined as $g(x)=1\{\eta(x)>1/2\}$ and $\widehat{g}_{n,k}(x)=1\{\widehat{\eta}_{k,n}(x)>1/2\}$ , respectively.

Define $\omega$ as the level of perturbation. For any intended testing data $x$ , we only observe its randomly perturbed version:

\widetilde{x}\sim\mbox{Unif}(\partial B(x,\omega)),

(1)

that is, $\widetilde{x}$ is uniformly distributed over $\partial B(x,w)$ , the boundary of an Euclidean ball centered at $x$ with radius $\omega$ .

In this case, we define the so called “perturbed” regret as

\displaystyle\mbox{Regret}(k,n,\omega)=P(Y\neq\widehat{g}_{n,k}(\widetilde{X}))-P(Y\neq g(X)),

and

\mbox{Regret}(n,\omega)=\min\limits_{k=1,...,n}\mbox{Regret}(k,n,\omega).

Note that the $k$ -NN classifier $\widehat{g}_{n,k}$ is trained by uncontaminated training samples. Obviously, when $\omega=0$ , the above definition reduces to the common regret that used in statistical classification literature.

The following assumptions are imposed on $X$ and the underlying $\eta$ , to facilitate our theoretical analysis.

A.1

$X$ is a random variable on a compact $d$ -dimensional manifold $\mathcal{X}$ with boundary $\partial\mathcal{X}$ . Density function of $X$ is differentiable, finite and bounded away from 0.
A.2

The set $\mathcal{S}=\{x|\eta(x)=1/2\}$ is non-empty. There exists an open subset $U_{0}$ in $\mathbb{R}^{d}$ which contains $\mathcal{S}$ such that, define $U$ as an open set containing $\mathcal{X}$ , then $\eta$ is continuous on $U\backslash U_{0}$ .
A.3

There exists some constant $c_{x}>0$ such that when $|\eta(x)-1/2|\leq c_{x}$ , $\eta$ has bounded fourth-order derivative; when $\eta(x)=1/2$ , $\dot{\eta}(x)\neq 0$ , where $\dot{\eta}$ is the gradient of $\eta$ in $x$ . Also the derivative of $\eta(x)$ within restriction on the boundary of support is non-zero.

Assumptions A.1 ensures that for any $x\in\mathcal{X}$ , all its $k$ nearest neighbors are close to $x$ with high probability. This is due to the fact that if the density at a point $x$ is positive and finite, its distance to its $k$ th nearest will be of $O_{p}((k/n)^{1/d})=o_{p}(1)$ . Assumption A.2 ensures the existence of $x$ in $\{x\in\mathcal{X}|\;\eta(x)=1/2\}$ and $\eta(x)$ is continuous in other regions of $\mathcal{X}$ . Assumption A.3 on $\eta(x)$ is slightly stronger than that imposed in [22] due to the consideration of testing data contamination. Specifically, the additional smoothness on $\eta(x)$ imposed in Assumption A.3 guarantees that some higher-order terms in the Taylor expansion of $\mathbb{E}\{\widehat{\eta}_{k,n}(\widetilde{x})-\eta(x)\}$ are negligible.

2.2 Asymptotic Regret and Phase Transition Phenomenon

We are now ready to conduct regret analysis for $k$ -NN in the presence of randomly perturbed testing examples. For any $x\in\mathcal{X}$ , define

t(x)=\mathbb{E}\big{(}\|X_{i}-x\|_{2}^{2}\,\big{|}\,X_{i}\mbox{ is among the $k$ nearest neighbors of }x\big{)}.

Therefore, $t(x)$ represents the expected squared distance from $x$ to any of its $k$ nearest neighbors. And take $t=\max_{x}t(x)$ . Also denote $\bar{f}(x,y)$ and $\bar{f}(x)$ as the joint density of $(x,y)$ and marginal density of $x$ respectively. Let $f_{1}(x):=\bar{f}(x,0)$ , $f_{2}(x):=\bar{f}(x,1)$ , and $\Psi(x):=f_{1}(x)-f_{2}(x)$ .

We first characterize the asymptotic perturbed regret.

Theorem 1.

Define $\epsilon_{k,n,\omega}=\max(\log k/\sqrt{k},t+\omega)$ . Under [A.1] to [A.3] if testing data is randomly perturbed, then it follows that

\begin{split}\mbox{Regret}(k,n,\omega)=&\underbrace{\frac{1}{2}\int_{\mathcal{S}}\frac{\|\dot{\Psi}(x_{0})\|}{\|\dot{\eta}(x_{0})\|^{2}}\left(b(x_{0})t(x_{0})\right)^{2}d\text{Vol}^{d-1}(x_{0})}_{Bias}+\underbrace{\frac{1}{2}\int_{\mathcal{S}}\frac{\omega^{2}}{d}\|\dot{\Psi}(x_{0})\|d\text{Vol}^{d-1}(x_{0})}_{Corruption}\\ &+\underbrace{\frac{1}{2}\int_{\mathcal{S}}\frac{1}{4k}\frac{\|\dot{\Psi}(x_{0})\|}{\|\dot{\eta}(x_{0})\|^{2}}d\text{Vol}^{d-1}(x_{0})}_{Variance}+\text{Remainder},\end{split}

(2)

where Remainder= $O(\epsilon_{k,n,\omega}^{3})$ as $k,n\to\infty$ . The term $b(\cdot)$ relies on the true $\eta(x)$ and the distribution of $X$ , and does not change with respect to $k$ and $n$ :

\displaystyle b(x)

\displaystyle=

\displaystyle\frac{1}{\bar{f}(x)d}\left\{\sum_{j=1}^{d}[\dot{\eta}_{j}(x)\dot{\bar{f}}_{j}(x)+\ddot{\eta}_{j,j}(x)\bar{f}(x)/2]\right\}.

Here $\dot{\eta}$ , $\ddot{\eta}$ , and $\dot{\bar{f}}$ represent the gradient, Hessian of $\eta$ , and gradient of $\bar{f}$ respectively. The subscript $j$ denotes the $j$ ’th element of $\dot{\eta}$ or $\dot{\bar{f}}$ , as well as subscript $j,j$ denotes the $(j,j)$ ’th element of $\ddot{\eta}$ .

Our result (2) clearly decomposes the asymptotic regret into squared bias term, data corruption effect term, variance term as well as a reminder term due to higher-order Taylor expansion. The first three terms are of order $O((k/n)^{4/d})$ , $O(\omega^{2})$ and $O(1/k)$ respectively, and the reminder term is technically derived from high order Taylor expansion and Berry-Essen theorem. When $k$ is within a reasonable range, the reminder term is negligible comparing with the other three terms. When $\omega=0$ , (2) reduces to the bias-variance decomposition commonly observed in the nonparametric regression literature.

Based on Theorem 1, through changing $\omega$ , we have the following observations:

Phase Transition Phenomenon

We discover a phase transition phenomenon for the regret with respect to the level of testing data contamination in general.

1.

When $\omega^{2}\preccurlyeq(1/k\wedge t^{2})$ ¹¹1To prevent the conflict of definitions of $\omega$ , we use $\preccurlyeq$ and $\succcurlyeq$ to replace $o(.)$ and $\omega(.)$ in O/ $\Omega$ notation. Moreover, for $a(n)\preccurlyeq b(n)\preccurlyeq 1$ , we mean that $b(n)/1<n^{-\varepsilon_{1}}$ and $a(n)/b(n)<n^{-\varepsilon_{2}}$ for some $\varepsilon_{1},\varepsilon_{2}>0$ when $n\rightarrow\infty$ ., the data corruption has almost no effect: $\mbox{Regret}(k,n,\omega)/\mbox{Regret}(k,n,0)\rightarrow 1$ ;
2.

When $\omega^{2}=\Theta(1/k\vee t^{2})$ , the regret is of the same order as $\mbox{Regret}(k,n,0)$ but with a different multiplicative constant depending on $\bar{f}$ and $\eta$ ;
3.

When $\omega^{2}\succcurlyeq(1/k\vee t^{2})$ , $\mbox{Regret}(k,n,\omega)=\Theta(\omega^{2})$ and $\mbox{Regret}(k,n,\omega)\succcurlyeq\mbox{Regret}(k,n,0)$ .

Impact on $\mbox{Regret}(n,\omega)$ and the choice of $k$

The value $k$ plays an important role in the $k$ -NN algorithm. It is essential to examine how the intensity level $\omega$ affects the corresponding $\mbox{Regret}(n,\omega)$ . From Theorem 1, one can derive that for randomly perturbed testing scenario, if $\omega\preccurlyeq n^{-2/(d+4)}$ , $\mbox{Regret}(n,\omega)=\Theta(n^{-4/(d+4)})$ ; if $\omega\succcurlyeq n^{-2/(d+4)}$ , $\mbox{Regret}(n,\omega)=\Theta(\omega^{2})$ . In other words, $\mbox{Regret}(n,\omega)=\Theta(\omega^{2}\vee n^{-4/(d+4)})$ . The above rate can be achieved if we choose $k=\Theta(n^{4/(4+d)})$ when $\omega\preccurlyeq n^{-4/3(4+d)}$ and $1/\omega^{2}\preccurlyeq k\preccurlyeq n\omega^{d/2}$ when $\omega\succcurlyeq n^{-4/3(4+d)}$ .

Robustness Trade-off in $k$ -NN

Theorem 1 reveals that the regret is of order $O((k/n)^{4/d}+1/k)+O(\omega^{2})$ , therefore the data corruption has no impact as long as $\omega^{2}=o((k/n)^{4/d}+1/k)$ . In other words, if $k$ is chosen to be optimal and minimizes the regret for uncontaminated testing sample, i.e., $(k/n)^{4/d}+1/k$ is small, then the $k$ -NN is more sensitive to the increase of $\omega^{2}$ . On the other hands, if $k$ is some sub-optimal choice such that $(k/n)^{4/d}+1/k$ is larger, then $k$ -NN is more robust to the testing data corruption. Hence there is a trade-off between the accuracy of uncontaminated testing task and robustness with respect to random perturbation corruption.

Effect of Metric of Noise

Note that $\widetilde{x}$ can be defined on $\mathcal{L}_{p}$ ball / sphere for different $p\geq 1$ . Theorem 1 reveals that the effect of $\omega$ is independent with $t$ and $1/k$ when $\omega_{k,n,\omega}^{3}$ is not dominant while $\widetilde{x}$ is uniformly distributed in a ball / sphere (so that first order terms w.r.t. noise are zero). As a result, define $\varepsilon_{p}$ as a random variable uniformly distributed in a $\mathcal{L}_{p}$ ball/ sphere, then to change the regret accordingly, one can replace $\omega^{2}/d$ in Theorem 1 into $\frac{1}{\|\dot{\eta}(x_{0})\|^{2}}\mathbb{E}(\varepsilon_{p}^{\top}\dot{\eta}(x_{0}))^{2}$ .

Minimax Result under General Smoothness Conditions

To relax the strong assumptions A.1 to A.3, we follow [6] to impose smoothness and margin conditions. Basically, the observations are similar as the results above. One can also show that $k$ -NN reaches the optimal rate of convergence.

Theorem 2 (Informal Statement under General Smoothness Conditions).

If the distribution of $(X,Y)$ satisfies

1.

$|\eta(x)-\eta(x^{\prime})|\leq A\|x-x^{\prime}\|^{\alpha}$ for all $x$ ;
2.

$P(|\eta(X)-1/2|<t)\leq Bt^{\beta}$ for some $\beta>0$ ;

together with some other general assumptions, then when taking

k\asymp O(n^{2\alpha/(2\alpha+d)}\wedge(n^{2\alpha/d}\omega^{-2\alpha\beta})^{1/(2\alpha/d+\beta+1)}),

the regret becomes

\displaystyle\mbox{Regret}(n,\omega)=O\left(\omega^{\alpha(\beta+1)}\vee n^{-\alpha(\beta+1)/(2\alpha+d)}\right),

which is proven to be the minimax rate.

Formal assumptions and results for Theorem 2 are postponed in Section C in Appendix (Theorem S.3 for convergence rate and Theorem S.4 for minimax rate). From Theorem 2, the rate of regret is dominated by the larger one between the random perturbation effect ( $\omega^{\alpha(\beta+1)}$ ) and the minimax rate for clean data ( $n^{-\alpha(\beta+1)/(2\alpha+d)}$ ). Similar as in [22], our regret result derived under conditions A.1-A.3 matches the minimax rate of Theorem 2 by taking $\alpha=2$ and $\beta=1$ .

Remark 1 (Adversarial Data Corruption).

So far, we focus on the case of random perturbation. As a by-product, we analyze the effect of some special non-random data corruption. Due to page limit constraint, the detailed results and discussions are postponed to Section A in appendix. In general, comparing with worst-case data corruption, the effect of random perturbed noise will mostly cancel out in each perturbation direction, thus leading to a smaller corruption effect term. Therefore, unsurprisingly, $k$ -NN is more robust to random perturbed data corruption than worse-case adversarial data corruption. However, our rigorous analysis shows that the regret under adversarial data corruption (defined formally in Appendix) is still of the same order as in the case of random perturbation but with a larger multiplicative constant when $\omega\succcurlyeq n^{-2/(d+4)}$ (when $\omega\preccurlyeq n^{-2/(d+4)}$ , the effect of either adversarial corruption / random perturbation is negligible).

2.3 Comparison with Noise-Injected $k$ -NN

In an iterative adversarial training algorithm (e.g. 23), attack is designed in each iteration based on the current model, and the model is updated based on attacked training data. Similarly for random perturbation, a natural idea to enhance the robustness of $k$ -NN is to inject noise in training data so that training and testing data share the same distribution. One can randomly perturb training samples and train $k$ -NN using the perturbed training data set. Comparing this noise-injection $k$ -NN and traditional $k$ -NN methods, we find that there is no performance improvement for the former even when the corruption level is in the early stage of the polynomial deterioration regime.

Denote $\widetilde{g}(\widetilde{x}):=P(Y=1|\widetilde{x}\text{ is observed})$ as the Bayes estimator and $\widehat{g}^{\prime}_{n}$ as the estimator trained using randomly perturbed training data. Let both estimators $\widehat{g}_{n}$ and $\widehat{g}^{\prime}_{n}$ adopt their best choices of $k$ respectively. Then we have

Theorem 3.

Under [A.1] to [A.3], when $0<\omega^{3}\preccurlyeq n^{-4/(d+4)}$ ,

\displaystyle\frac{P(Y\neq\widehat{g}_{n}(\widetilde{x}))-P(Y\neq\widetilde{g}(\widetilde{x}))}{P(Y\neq\widehat{g}_{n}^{\prime}(\widetilde{x}))-P(Y\neq\widetilde{g}(\widetilde{x}))}\rightarrow 1.

(3)

Although it is intuitive to consider perturbing training data such that they match the distribution of the corrupted testing data, result (19) implies that the estimators $\widehat{g}_{n}$ and $\widehat{g}_{n}^{\prime}$ asymptotically share the same predictive performance for randomly perturbed testing data, and noise injection strategy is futile. Note that this result holds when $\omega$ is small. Combined with our analysis in Theorem 1, within the range $n^{-2/(d+4)}\preccurlyeq\omega\preccurlyeq n^{4/3(d+4)}$ , the regret is deteriorated polynomially due the impact of testing data corruption, and the deterioration can not be remedied by noise-injection adversarial training at all. One heuristic explanation is that, such an injected perturbation may introduce additional noise to the estimation procedure and change some underlying properties (e.g. smoothness), and consequently this strategy of perturbing training data doesn’t necessarily help to achieve smaller regret especially when $\omega$ is small.

Remark 2.

The above robustness result can be applied after we obtain knowledge while testing. In reality, we do not know whether the testing data is corrupted or not. In such a case, other techniques, e.g. hypothesis testing, are needed to be applied while testing.

3 Application to other Nearest Neighbor-Type Algorithms

Our theoretical analysis may be adapted to other NN-type algorithms: pre-processed 1NN [27] and distributed-NN [8]. In particular, we prove that the regret of the former is sub-optimal for some class of distributions, and explain why the regret of the latter converges in the optimal rate, both from the random perturbation viewpoint.

3.1 Pre-processed 1NN

In some literature [27, 25], the algorithms run 1NN to do prediction using pre-processed data instead of running $k$ -NN using raw data. The pre-processing step (or called de-noising step) is reviewed in Algorithm 1. Specifically, we firstly run $k$ -NN to predict labels for the training data set, then replace the original labels with the predict labels $\widehat{y}_{i}$ ’s. In this way, applying 1NN on data $(x_{1},\widehat{y}_{1})$ ,…, $(x_{n},\widehat{y}_{n})$ can achieve good accuracy while the computational cost is much smaller than $k$ -NN.

Algorithm 1 Data Pre-processing

Input: data

(x_{1},y_{1})

,…,

(x_{n},y_{n})

, number of neighbors

k

for

i=1

n

Find the

k

nearest neighbors of

x_{i}

x_{1}

,…,

x_{n}

, excluding

x_{i}

itself. Denote the index set of these

k

neighbors as

N_{i}

Estimate a label for

x_{i}

	$\displaystyle\widehat{\eta}(x_{i})$	$\displaystyle=$	$\displaystyle\frac{1}{k}\sum_{j\in N_{i}}y_{j},$
	$\displaystyle\widehat{y}_{i}$	$\displaystyle=$	$\displaystyle 1_{\{\widehat{\eta}(x_{i})>1/2\}}.$

end for

Output:

(x_{1},\widehat{y}_{1})

,…,

(x_{n},\widehat{y}_{n})

This in fact can be treated as an application of random perturbation of testing data in $k$ -NN, in the sense that this classifier can be equivalently represented as $k$ -NN under perturbed sample:

\widehat{g}_{1\rm NN}(x)=\widehat{g}_{n,k}(\widetilde{x}),

where $\widehat{g}_{1\rm NN}$ is the pre-processed 1NN classifier, and $\widetilde{x}$ is the perturbed observation of $x$ , which is the nearest neighbor of $x$ . Although the $\widetilde{x}$ here doesn’t exactly match our definition (1), it still can be viewed as randomly perturbed $x$ with level of contamination $\omega=\Theta(n^{-1/d})$ , which is the order for the expected length from $x$ to the nearest neighbor under Assumption A.1.

From this point of view, Theorem 1 can be be applied to derive the regret of the pre-processed 1NN algorithm, whose rate of convergence turns out to be slower than the optimal rate $\Theta(n^{-4/(d+4)})$ of $k$ -NN when the data dimension $d$ is relatively high, say $d>4$ .

Theorem 4.

Under [A.1] to [A.3], the regret of pre-processed 1NN under un-corrupted testing data is

\begin{split}\textit{Regret}_{\textit{1NN}}(k,n)=&\frac{1}{2}\int_{\mathcal{S}}\frac{\|\dot{\Psi}(x_{0})\|}{\|\dot{\eta}(x_{0})\|^{2}}\left(b(x_{0})t(x_{0})\right)^{2}d\text{Vol}^{d-1}(x_{0})+\frac{1}{2}\int_{\mathcal{S}}\frac{1}{4k}\frac{\|\dot{\Psi}(x_{0})\|}{\|\dot{\eta}(x_{0})\|^{2}}d\text{Vol}^{d-1}(x_{0})\\ &+\text{Corruption}+\text{Remainder},\end{split}

where

	Corruption	$\displaystyle=$	$\displaystyle\Theta(n^{-2/d}),$
	Remainder	$\displaystyle=$	$\displaystyle o(n^{-2/d})$

when both $1/k$ and $(k/n)^{4/d}$ are of $O(n^{-1/d})$ , and $k=O(n^{6/d})$ . As a result, pre-processed 1NN is sub-optimal when $d>4$ (compared with optimal rate $n^{-4/(d+4)}$ ).

The result in Theorem 4 reveals a sub-optimal rate for the pre-processed $1$ NN, in comparison with the optimal rate claimed by [27]. This is not a contradiction, but due to different assumptions imposed for $\eta$ and the distribution of $X$ . Xue and Kpotufe [27] assumes $\alpha$ -smoothness condition $|\eta(x)-\eta(x^{\prime})|\leq A\|x-x^{\prime}\|^{\alpha}$ which is more general than our smoothness assumption A.3.

3.2 Distributed-NN

The computational complexity of $k$ -NN is huge if $n$ is large, therefore we consider a distributed NN algorithm: we randomly partition the original data into $s$ equal-size parts, then given $x$ , for each machine, the $k/s$ nearest neighbors of $x$ are selected and calculate $\widehat{\eta}_{j}(x)$ for $j=1,...,s$ , finally we average $\widehat{\eta}_{1}(x)$ ,…, $\widehat{\eta}_{s}(x)$ to obtain $\widehat{\eta}(x)$ . The algorithm is shown in Algorithm 2 as in [8].

Algorithm 2 Distributed-NN

Input: data

(x_{1},y_{1})

,…,

(x_{n},y_{n})

, number of neighbors

k

, number of slaves

s

, a point

x

for prediction.

Randomly divide the whole data set into

s

parts, with index sets

S_{1}

,…,

S_{s}

for

i=1

s

Find the

k/s

nearest neighbors of

x

\{x_{j}\;|\;j\in S_{i}\}

. Denote the index set of these

k/s

neighbors as

N_{i}

Estimate

\displaystyle\widehat{\eta}_{i}(x)

\displaystyle=

\displaystyle\frac{1}{k/s}\sum_{j\in N_{i}}y_{j}.

end for

Estimate the label of

x

	$\displaystyle\widehat{\eta}(x)$	$\displaystyle=$	$\displaystyle\frac{1}{s}\sum_{i=1}^{s}\widehat{\eta}_{i}(x),$
	$\displaystyle\widehat{y}$	$\displaystyle=$	$\displaystyle 1_{\{\widehat{\eta}(x)>1/2\}}.$

Output:

(x,\widehat{y})

Distributed-NN is practically different from $k$ -NN in a single machine since the $k$ selected neighbors aggregated from $s$ subsets of data are not necessarily the same $k$ nearest neighbors selected in a single machine. Therefore, an additional assumption $k/s\rightarrow\infty$ is imposed, to ensure that the neighborhood set selected by distributed NN behaves similarly to the neighborhood set selected by single machine $k$ -NN, in the sense that $E\|X_{i}-x\|^{2}$ , where $X_{i}$ belongs to the distributed NN neighborhood set, is of the same order of $t(x)$ . Therefore, based on Theorem 1, we obtain that the order of regret of Distributed-NN is in fact of the same order as $k$ -NN. Formally, we have the following corollary:

Corollary 5.

Under [A.1] to [A.3], when $\int_{\mathcal{S}}\|\dot{\Psi}(x_{0})\|dVol^{d-1}(x_{0})\neq 0$ and $P(b(X)>0|\eta(X)=1/2)>0$ , if the number of machines $s\preccurlyeq k$ , then

\mbox{Regret}_{\rm DNN}(k,n)=\Theta(\mbox{Regret}_{\rm kNN}(k,n)).

where $\mbox{Regret}_{\rm DNN}$ and $\mbox{Regret}_{\rm kNN}$ denote the (clean testing data) regret of distributed NN and $k$ -NN algorithms respectively.

4 Numerical Experiments

In Section 4.1, we will evaluate the empirical performance of $k$ -NN algorithm for randomly perturbed testing data, where we compare the $k$ -NN classifiers trained by raw un-corrupted training data and trained by noise injected training data (i.e., $\widehat{g}_{n}$ versus $\widehat{g}_{n}^{\prime}$ defined in Section 2.3). In Section 4.2 We also conduct experiments to compare $k$ -NN with pre-processed 1NN for un-corrupted testing data. These numerical experiments are intended to show: (i) $k$ -NN trained by un-corrupted training data has a similar testing performance with $k$ -NN trained by noise injected training data when $\omega$ is small, which validates our Theorem 3; (ii) for un-corrupted testing data, the regret of pre-processed 1NN is much worse than that of $k$ -NN if $d>4$ , which validates our Theorem 4. In general, $k$ -NN is expected to be always better than pre-processed 1NN.

Note that in all our real data applications except for MNIST data set, we normalized all attributes.

4.1 Tackling Random Perturbation

4.1.1 Simulation

The random variable $X$ is of $5$ dimension, and each dimension independently follows exponential distribution with mean 0.5. The conditional mean of $Y$ is defined as

	$\displaystyle\eta(x)$	$\displaystyle=$	$\displaystyle\frac{e^{x^{\top}w}}{e^{x^{\top}w}+e^{-x^{\top}w}},$		(4)
	$\displaystyle\qquad w_{i}$	$\displaystyle=$	$\displaystyle i-d/2\qquad i=1,...,d.$

For each pair of $(k,n)$ , we use $2^{6},...,2^{11}$ training samples, 10000 testing samples and repeated 50 times to calculate the average regret. In each repetition, 5-fold cross validation was used to obtain $\widetilde{k}$ . Then based on [22], we adjust the number of neighbors to $\widehat{k}=\widetilde{k}(5/4)^{4/(4+d)}$ since $\widetilde{k}$ is the best $k$ value for $4n/5$ samples instead of $n$ samples. The two classifiers, trained via un-corrupted training data and corrupted training data respectively, used to predict corrupted testing data.

From Figure 1, as the number of training samples increases, the regret for both $k$ -NNs gets reduced for $0<\omega\leq 0.05$ in the same speed. This verifies that these two $k$ -NNs in fact do not differ a lot when $\omega$ is small, i.e., Theorem 3. Empirically, the regret of $k$ -NN trained by corrupted training data is worse than the one trained by un-corrupted training data when $\omega\leq 0.5$ . On the other hand, when $\omega$ is large (such that required condition in Theorem 3 fails), the two $k$ -NNs may perform significantly different. For example, we tried $\omega=3$ , when sample size $n=64$ , $\log_{2}(\mbox{Regret})$ is -2.86 using uncontaminated data, and is -3.11 using corrupted training data.

Refer to caption — Figure 1: Comparison between $k$ -NN trained by raw training data (solid line) and $k$ -NN trained by noise injected training data (dashed line) in Simulation.

4.1.2 Real Data

We use two real data sets for the comparison of 2 $k$ -NNs: Abalone (7), HTRU2 (15). For Abalone data set, the data set contains 4177 samples, and two attributes (Length and Diameter) are used in this experiment. The classification label is whether an abalone is older than 10.5 years. For HTRU2 data set [15], the data has a size of of 17,898 with 8 continuous attributes. For each data set, 25% of the samples are contaminated by random noise, and thereafter are used as testing data.

As shown in Figure 2, when $\omega$ is small, for both data sets, the error rate (misclassification rate) of the two $k$ -NNs do not differ a lot when $\omega$ is small.

4.2 Performance of 1NN with Pre-processed Data

4.2.1 Simulation

To observe a clear difference, instead of $w_{i}$ in (4), we use a model where each dimension of $x$ follows uniform $[0,1]$ distribution, with $\eta$ in (4), and

\displaystyle w_{i}

\displaystyle=

\displaystyle i-d/2-0.5\qquad i=1,...,d.

for different values of $d$ to compare the performance between $k$ -NN and pre-processed 1NN. $X$ now follows $d$ -dimensional uniform $(0,1)$ . From Figure 3, we show that the order of regret of pre-processed 1NN is different from that of $k$ -NN when $d\geq 4$ .

We also replace the uniform distribution as exponential distribution with mean 0.5 for each dimension of $x$ . Figure 4 shows that for $d=5$ and $d=10$ , the increasing trend of regret ratio is obvious, indicating a sub-optimal rate of pre-processed 1NN.

4.2.2 Real Data

We use four data sets to compare the $k$ -NN and pre-processed 1NN: MNIST, Abalone, HTRU2, and Credit (29).

For MNIST, this data set contains 70000 samples and data dimension is 784. We randomly pick 25% samples as testing data, and randomly pick $2^{i}$ ( $i=7,8,...,12$ ) samples to train $k$ -NN classifier and pre-processed 1NN classifier, where the choices of $k$ for both algorithms are determined by 5-folds cross validation. We repeated this procedure 50 times with different random seeds to obtain the mean testing error rates. As shown in Figure 5, through increasing the number of training samples, the error rate ratio between pre-process 1NN classifier and $k$ -NN classifier is stably above 1 and is around 1.17.

For Abalone data set, we conducted experiment in the same way as MNIST, and observe that the error rate using pre-processed 1NN is always greater than $k$ -NN. As is shown in Figure 6, while the error rate for both pre-processed 1NN and $k$ -NN are decreasing in $n$ , their difference changes little when $n\leq 2^{11}$ .

In Credit data set, there are 30000 samples (25% as testing data) with 23 attributes. For HTRU2 and Credit data set, the mean and standard error of error rates in the 50 repetitions are summarized in Table 1. From Table 1, using $k$ -NN we obtain slightly smaller error rate on average.

Data Set	Error Rate ( $k$ -NN)	Error Rate (Pre-1NN)
Credit	0.18879	0.1899
HTRU2	0.02143	0.0221

Table 1: Comparison between

k

-NN and Pro-processed 1NN (Pre-1NN) in HTRU2 and Credit

5 Conclusion and Discussion

In this work, we conduct asymptotic regret analysis of $k$ -NN classification for randomly perturbed testing data. In particular, a phase transition phenomenon is observed: when the corruption level is below a threshold order, it doesn’t affect the asymptotic regret; when the corruption level is beyond this threshold order, the asymptotic regret grows polynomially. Moreover, when the level of corruption is small, there is no benefit to perform noise injected adversarial training approach.

Moreover, using the idea of random perturbation, we can further explain why pre-processed 1NN converges in a sub-optimal rate: it can be treated as $k$ -NN with perturbation in testing data while $\omega$ , the distance from $x$ to its nearest neighbor, is large when $d>4$ . It is worth to mention that our analysis can be applied to Distributed-NN to verify the optimal rate obtained in [8] as well.

An interesting observation from numerical experiment is that using traditional $k$ -NN leads to an even better performance than the $k$ -NN trained via noise injection method. This observation contradicts to common belief that injecting attack into training algorithm to obtain an adversarial robust algorithm (e.g. optimization method in 23). Therefore, it deserves further theoretical investigation to understand that, under which circumstance one can indeed benefit from the noise injection strategy.

References

Audibert [2004] Audibert, J.-Y. (2004), “Classification under polynomial entropy and margin assumptions and randomized estimators,” .
Audibert et al. [2007] Audibert, J.-Y., Tsybakov, A. B., et al. (2007), “Fast learning rates for plug-in classifiers,” The Annals of statistics, 35, 608–633.
Belkin et al. [2018] Belkin, M., Hsu, D., and Mitra, P. (2018), “Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate,” arXiv preprint arXiv:1806.05161.
Cannings et al. [2017] Cannings, T. I., Berrett, T. B., and Samworth, R. J. (2017), “Local nearest neighbour classification with applications to semi-supervised learning,” arXiv preprint arXiv:1704.00642.
Cannings et al. [2018] Cannings, T. I., Fan, Y., and Samworth, R. J. (2018), “Classification with imperfect training labels,” arXiv preprint arXiv:1805.11505.
Chaudhuri and Dasgupta [2014] Chaudhuri, K. and Dasgupta, S. (2014), “Rates of convergence for nearest neighbor classification,” in Advances in Neural Information Processing Systems, pp. 3437–3445.
Dua and Graff [2017] Dua, D. and Graff, C. (2017), “UCI Machine Learning Repository,” .
Duan et al. [2018] Duan, J., Qiao, X., and Cheng, G. (2018), “Distributed Nearest Neighbor Classification,” arXiv preprint arXiv:1812.05005.
Fawzi et al. [2018] Fawzi, A., Fawzi, H., and Fawzi, O. (2018), “Adversarial vulnerability for any classifier,” in Advances in Neural Information Processing Systems, pp. 1178–1187.
Fawzi et al. [2016] Fawzi, A., Moosavi-Dezfooli, S.-M., and Frossard, P. (2016), “Robustness of classifiers: from adversarial to random noise,” in Advances in Neural Information Processing Systems, pp. 1632–1640.
Goodfellow et al. [2014] Goodfellow, I., Shlens, J., and Szegedy, C. (2014), “Explaining and Harnessing Adversarial Examples,” arXiv preprint arXiv:1412.6572.
Goodfellow et al. [2015] Goodfellow, I. J., Shlens, J., and Szegedy, C. (2015), “Explaining and harnessing adversarial examples. CoRR,” .
Grosse et al. [2017] Grosse, K., Papernot, N., Manoharan, P., Backes, M., and McDaniel, P. (2017), “Adversarial examples for malware detection,” in European Symposium on Research in Computer Security, Springer, pp. 62–79.
Kurakin et al. [2016] Kurakin, A., Goodfellow, I., and Bengio, S. (2016), “Adversarial machine learning at scale,” arXiv preprint arXiv:1611.01236.
Lyon et al. [2016] Lyon, R. J., Stappers, B., Cooper, S., Brooke, J., and Knowles, J. (2016), “Fifty years of pulsar candidate selection: from simple filters to a new principled real-time classification approach,” Monthly Notices of the Royal Astronomical Society, 459, 1104–1123.
Madry et al. [2017] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. (2017), “Towards deep learning models resistant to adversarial attacks,” arXiv preprint arXiv:1706.06083.
Papernot et al. [2017] Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z. B., and Swami, A. (2017), “Practical black-box attacks against machine learning,” in Proceedings of the 2017 ACM on Asia conference on computer and communications security, ACM, pp. 506–519.
Papernot et al. [2016a] Papernot, N., McDaniel, P., Jha, S., Fredrikson, M., Celik, Z. B., and Swami, A. (2016a), “The limitations of deep learning in adversarial settings,” in Security and Privacy (EuroS&P), 2016 IEEE European Symposium on, IEEE, pp. 372–387.
Papernot et al. [2016b] Papernot, N., McDaniel, P., Swami, A., and Harang, R. (2016b), “Crafting adversarial input sequences for recurrent neural networks,” in Military Communications Conference, MILCOM 2016-2016 IEEE, IEEE, pp. 49–54.
Reeve and Kaban [2019a] Reeve, H. W. and Kaban, A. (2019a), “Classification with unknown class conditional label noise on non-compact feature spaces,” arXiv preprint arXiv:1902.05627.
Reeve and Kaban [2019b] — (2019b), “Fast Rates for a kNN Classifier Robust to Unknown Asymmetric Label Noise,” arXiv preprint arXiv:1906.04542.
Samworth et al. [2012] Samworth, R. J. et al. (2012), “Optimal weighted nearest neighbour classifiers,” The Annals of Statistics, 40, 2733–2763.
Sinha et al. [2018] Sinha, A., Namkoong, H., and Duchi, J. (2018), “Certifying some distributional robustness with principled adversarial training,” .
Sun et al. [2016] Sun, W. W., Qiao, X., and Cheng, G. (2016), “Stabilized nearest neighbor classifier and its statistical properties,” Journal of the American Statistical Association, 111, 1254–1265.
Wang et al. [2017] Wang, Y., Jha, S., and Chaudhuri, K. (2017), “Analyzing the robustness of nearest neighbors to adversarial examples,” arXiv preprint arXiv:1706.03922.
Xing et al. [2018] Xing, Y., Song, Q., and Cheng, G. (2018), “Statistical Optimality of Interpolated Nearest Neighbor Algorithms,” arXiv preprint arXiv:1810.02814.
Xue and Kpotufe [2017] Xue, L. and Kpotufe, S. (2017), “Achieving the time of $1$ -NN, but the accuracy of $k$ -NN,” arXiv preprint arXiv:1712.02369.
Yang et al. [2019] Yang, Y.-Y., Rashtchian, C., Wang, Y., and Chaudhuri, K. (2019), “Adversarial Examples for Non-Parametric Methods: Attacks, Defenses and Large Sample Limits,” arXiv preprint arXiv:1906.03310.
Yeh and Lien [2009] Yeh, I.-C. and Lien, C.-h. (2009), “The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients,” Expert Systems with Applications, 36, 2473–2480.
Zhang et al. [2017] Zhang, G., Yan, C., Ji, X., Zhang, T., Zhang, T., and Xu, W. (2017), “Dolphinattack: Inaudible voice commands,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, ACM, pp. 103–117.

Appendix A Comparison between Random Perturbation and Non-random Perturbation

We use the following adversarial attack as the non-random perturbation:

\widetilde{x}=\begin{cases}\mathop{\rm argmin}\limits_{z\in B(x,\omega)}\eta(z)\qquad\mbox{if }\eta(x)>1/2\\ \mathop{\rm argmax}\limits_{z\in B(x,\omega)}\eta(z)\qquad\mbox{if }\eta(x)\leq 1/2\end{cases}.

(5)

When $\omega\rightarrow 0$ , if $\eta$ is differentiable, the length o attack converges to $\omega$ as well.

The proposed attack scheme (5) is also called as “white-box attack” as the adversary has the knowledge of $\eta(x)$ . On the other hand, unlike the “white-box attack” mentioned in [25], the perturbation and attack we focus on are independent with the training samples.

Theorem 6.

Under [A.1] to [A.3], if testing data is adversarially attacked and $1/\sqrt{k},\zeta\ll\omega$ , then

\begin{split}&\mbox{Regret}(k,n,\omega)=\frac{B_{1}}{4k}+\frac{1}{2}\int_{\mathcal{S}}\frac{\|\dot{\Psi}(x_{0})\|}{\|\dot{\eta}(x_{0})\|^{2}}\left(b(x_{0})^{2}\zeta(x_{0})^{2}+2\omega^{2}\|\dot{\eta}(x_{0})\|^{2}\right)d\text{Vol}^{d-1}(x_{0})+Rem,\end{split}

(6)

where $Rem:=O(\omega/\sqrt{k}+\omega\zeta)+o((1/k)\vee(\zeta+\omega)^{2})$ .

From Theorem 6, one can see that the regret under adversarial attack is larger than the one under random perturbation if the $\omega^{2}$ term is dominant.

Appendix B Proof of Regret Analysis in Section 2 and 3

B.1 Theorem 1

This section contains the proof of Theorem 1 and Theorem S.6. The two proofs are similar, so for proof of Theorem S.6, we only present the part where the proof is different from Theorem 1.

Define $R_{1}(x)$ to $R_{k}(x)$ as the unsorted distance from the nearest $k$ neighbors to testing data point $x$ , and $R_{k+1}(x)$ as the distance from the exact $(k+1)$ -th nearest neighbor to $x$ itself. Similar as [6], conditional on the distance of the $(k+1)$ -th neighbor, the first $k$ neighbors are i.i.d. random variables distributed within $B(x,R_{k+1}(x))$ .

In addition to $f_{1}$ , $f_{2}$ , and $\Psi$ , we further denote $\bar{f}(x)$ as the density of $X$ .

Proposition 7 (Lemma S.1 in [24]).

For any distribution function $G$ with density $g$ ,

	$\displaystyle\int_{\mathbb{R}}[G(-bu-a)-1_{\{u<0\}}]du$	$\displaystyle=$	$\displaystyle-\frac{1}{b}\left\{a+\int_{\mathbb{R}}tg(t)dt\right\},$
	$\displaystyle\int_{\mathbb{R}}u[G(-bu-a)-1_{\{u<0\}}]du$	$\displaystyle=$	$\displaystyle\frac{1}{b^{2}}\left\{\frac{a^{2}}{2}+\frac{1}{2}\int_{\mathbb{R}}t^{2}g(t)dt+a\int_{\mathbb{R}}tg(t)dt\right\}.$

Now we start our proof of Theorem 1.

Proof of Theorem 1.

The idea of proof follows [22], and there are total 5 steps in our proof:

•

Step 0: We give some definitions prepare for the proof.

•

Step 1: Given a fixed (unobserved) testing sample $x$ and conditional on the perturbation random variable $\delta$ , we obtain the mean and variance of $\widehat{\eta}_{k,n}(x+\delta)$ . In particular, for any $x_{0}$ satisfying $\eta(x_{0})=1/2$ , let $x^{t}_{0}=x_{0}+t\frac{\dot{\eta}(x_{0})}{\|\dot{\eta}(x_{0})\|}$ , we have that

\displaystyle\mathbb{E}[\widehat{\eta}_{k,n}(x_{0}^{t}+\delta)|\delta]=\eta(x_{0})+t\|\dot{\eta}(x_{0})\|+\delta^{\top}\dot{\eta}(x_{0})+b(x_{0})R_{1}^{2}+O(t^{2}+\omega^{2}+R_{1}^{4}),

and

\displaystyle Var(\widehat{\eta}_{k,n}(x_{0}^{t}+\delta)|\delta)=\frac{1}{4k}+O(\epsilon^{2}/k):=\frac{1}{s_{k,n}^{2}}+O(\epsilon^{2}/k).

•

Step 2: use tube theory to construct a tube based on $\mathcal{S}$ . The remainder of regret outside the tube is of $O(\epsilon^{3})$ for some $\epsilon$ :

	$\displaystyle 2*\mbox{Regret}$	$\displaystyle=$	$\displaystyle\int_{\mathbb{R}^{d}}\left(P\left(\sum_{i=1}^{k}\frac{1}{k}Y_{i}\leq\frac{1}{2}\bigg{\|}\delta\right)-1_{\{\eta(x)<1/2\}}\right)dP(x)$
		$\displaystyle=$	$\displaystyle\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\\|\dot{\Psi}(x_{0})\\|\mathbb{E}\left(P(\widehat{\eta}_{k,n}(x_{0}^{t}+\delta)<1/2)-1_{\{t<0\}}\right)dtd\text{Vol}^{d-1}(x_{0})+r_{1}.$

for some remainder term $r_{1}$ .

•

Step 3: use Berry-Esseen Theorem to transform the probability

P(\widehat{\eta}_{k,n}(x_{0}^{t}+\delta)<1/2)

to a Gaussian probability.

\Phi\left(\frac{\mathbb{E}(1/2-\widehat{\eta}_{k,n}(x_{0}+\delta))}{\sqrt{Var(\widehat{\eta}_{k,n}(x_{0}^{t}+\delta))}}\bigg{|}\delta\right)+o

(7)

•

Step 4: plug in the mean and variance of $\widehat{\eta}_{k,n}$ from Step 1 into (7), integrate in the formula in Step 2 on the tube (integrate over $t$ ).

Step 0: In this step we introduce some definitions.

Denote $\delta$ as the random perturbation, i,e., $\delta=\widetilde{x}-x$ . Denote $X_{1}(x)$ to $X_{k}(x)$ be the $k$ unsorted neighbors of $x$ in the training samples and $Y_{i}(x)$ be the $Y$ value for the corresponding $X_{i}(x)$ . When no confusion is caused, we drop the argument $x$ and use $X_{i}$ and $Y_{i}$ for abbreviation. Then the probability of classifying contaminated sample as 0 becomes

\displaystyle P\left(\widehat{\eta}_{k,n}(x+\delta)\leq\frac{1}{2}\bigg{|}\delta\right)

\displaystyle=

\displaystyle P\left(\sum_{i=1}^{k}(Y_{i}(x+\delta)-1/2)<0\bigg{|}\delta\right).

If we directly investigate in $P(\widehat{\eta}_{k,n}(x)\leq 1/2)$ , the possible values of $\eta(x)$ can be $[0,1]$ , but those $\eta(x)$ ’s which are far from $1/2$ will have little contribution on the regret. Hence we consider $x$ in $\mathcal{S}^{\epsilon}$ where

\displaystyle\mathcal{S}^{\epsilon}=\left\{x_{0}^{t}:=x_{0}+t\frac{\dot{\eta}(x_{0})}{\|\dot{\eta}(x_{0})\|}:x_{0}\in\mathcal{S},|t|\leq\epsilon\right\}

for $\epsilon=\epsilon_{k,n,\omega}$ .

Further define

\epsilon_{k,n,\omega}=\max_{x\in\mathcal{\mathcal{R}}}C\left(\frac{\log k}{\sqrt{k}}\vee(R_{1}^{2}(x)+\omega)\right)

for some large constant $C$ .

Step 1: For the scenario of random perturbation, $\delta$ is a random variable uniformly distributed on sphere $B(x_{0}^{t},\omega)$ , we first evaluate $\mathbb{E}(\widehat{\eta}_{k,n}(x_{0}^{t}+\delta))$ and $Var(\widehat{\eta}_{k,n}(x_{0}^{t}+\delta))$ for given $x_{0}$ and $\delta$ .

\begin{split}&\mathbb{E}[\widehat{\eta}_{k,n}(x_{0}^{t}+\delta)|\delta]=\mathbb{E}(Y_{1}(x_{0}^{t}+\delta)|\delta)=\mathbb{E}\eta(X_{1}|\delta)\\ =&\mathbb{E}\left(\eta(x_{0}^{t}+\delta)+(X_{1}-x_{0}^{t}-\delta)^{\top}\dot{\eta}(x_{0}^{t}+\delta)+1/2(X_{1}-x_{0}^{t}-\delta)^{\top}\ddot{\eta}(x_{0}^{t}+\delta)(X_{1}-x_{0}^{t}-\delta)\bigg{|}\delta\right)\\ &+rem,\end{split}

where $rem$ is a remainder term due to the Taylor’s expansion. Before discussing $rem$ , we consider the dominant part of $\mathbb{E}\widehat{\eta}_{k,n}(x_{0}^{t}+\delta)$ . Conditional on $\delta$ and $R_{1}(x_{t}^{0}+\delta)=\|X_{1}-x_{0}^{t}-\delta\|$ , then the distribution of $X_{1}$ is on the sphere of $B(x_{t}^{0}+\delta,R_{1})$ . Denote the density of this distribution as $\bar{f}(x|x_{0}^{t}+\delta,R_{1}(x_{0}^{t}+\delta))$ . Also define $\bar{f}^{\prime}(x|x_{0}^{t}+\delta,R_{1}(x_{0}^{t}+\delta))$ as the gradient of $\bar{f}(x|x_{0}^{t}+\delta,R_{1}(x_{0}^{t}+\delta))$ . For simplicity, rewrite $R_{1}(x_{0}^{t}+\delta)$ as $R_{1}$ . Then based on (A.1) and (A.3) for the smoothness of $\bar{f}$ and $\eta$ , rewrite $\bar{f}(x|x_{0}^{t}+\delta,R_{1})$ as a Taylor expansion at $x_{0}^{t}+\delta$ , and we have

			$\displaystyle\mathbb{E}((X_{1}-x_{0}^{t}-\delta)^{\top}\dot{\eta}(x_{0}^{t}+\delta)\|\delta,R_{1})$
		$\displaystyle=$	$\displaystyle\int_{\partial B}(x-x_{0}^{t}-\delta)^{\top}\dot{\eta}(x_{0}^{t}+\delta)\bar{f}(x\|x_{0}^{t}+\delta,R_{1})dx$
		$\displaystyle=$	$\displaystyle\int_{\partial B}(x-x_{0}^{t}-\delta)^{\top}\dot{\eta}(x_{0}^{t}+\delta)\bigg{[}\bar{f}(x_{0}^{t}+\delta\|x_{0}^{t}+\delta,R_{1})$
			$\displaystyle\qquad\qquad\qquad\qquad+\bar{f}^{\prime}(x_{0}^{t}+\delta\|x_{0}^{t}+\delta,R_{1})^{\top}(x-x_{0}^{t}-\delta)$
			$\displaystyle\qquad\qquad\qquad\qquad+\frac{1}{2}(x-x_{0}^{t}-\delta)^{\top}\bar{f}^{\prime\prime}(x_{0}^{t}+\delta\|x_{0}^{t}+\delta,R_{1})(x-x_{0}^{t}-\delta)$
			$\displaystyle\qquad\qquad\qquad\qquad+O(\\|x-x_{0}^{t}-\delta\\|_{2}^{3})\bigg{]}dx$
		$\displaystyle=$	$\displaystyle\int_{\partial B}(x-x_{0}^{t}-\delta)^{\top}\dot{\eta}(x_{0}^{t}+\delta)\bar{f}^{\prime}(x_{0}^{t}+\delta\|x_{0}^{t}+\delta,R_{1})^{\top}(x-x_{0}^{t}-\delta)dx+o$
		$\displaystyle=$	$\displaystyle tr\left(\dot{\eta}(x_{0}^{t}+\delta)\bar{f}^{\prime}(x_{0}^{t}+\delta\|x_{0}^{t}+\delta,R_{1})^{\top}\int_{\partial B}(x-x_{0}^{t}-\delta)(x-x_{0}^{t}-\delta)^{\top}dx\right)+O(R_{1}^{4}),$

where $\int_{\partial B}$ denotes integration over sphere $\partial B(x_{0}^{t}+\delta,R_{1})$ the first-order and third-order terms becomes 0.

In addition,

			$\displaystyle tr\left(\frac{1}{2}\ddot{\eta}(x_{0}^{t}+\delta)\mathbb{E}\left((X_{1}-x_{0}^{t}-\delta)(X_{1}-x_{0}^{t}-\delta)^{\top}\|R_{1}\right)\right)$
		$\displaystyle=$	$\displaystyle tr\left(\frac{1}{2}\ddot{\eta}(x_{0}^{t}+\delta)\int_{\partial B}(x-x_{0}^{t}-\delta)(x-x_{0}^{t}-\delta)^{\top}\bar{f}(x\|x_{0}^{t}+\delta,R_{1})dx\right)$
		$\displaystyle=$	$\displaystyle tr\left(\frac{\bar{f}(x_{0}^{t}+\delta\|x_{0}^{t}+\delta,R_{1})}{2}\ddot{\eta}(x_{0}^{t}+\delta)\int_{\partial B}(x-x_{0}^{t}-\delta)(x-x_{0}^{t}-\delta)^{\top}dx\right)$
			$\displaystyle+tr\left(\frac{1}{2}\ddot{\eta}(x_{0}^{t}+\delta)\int_{\partial B}(x-x_{0}^{t}-\delta)(x-x_{0}^{t}-\delta)^{\top}(x-x_{0}^{t}-\delta)^{\top}\bar{f}^{\prime}(x_{0}^{t}+\delta\|x_{0}^{t}+\delta,R_{1})dx\right)$
			$\displaystyle+O(R_{1}^{4}).$

The term $rem$ in $\mathbb{E}(\widehat{\eta}_{k,n})$ can be tackled in a similar manner and $rem=O(R_{1}^{4})$ . Hence taking

b(x)=\frac{1}{\bar{f}(x)d}\left\{\sum_{j=1}^{d}[\dot{\eta}_{j}(x)\dot{\bar{f}}_{j}(x)+\ddot{\eta}_{j,j}(x)\bar{f}(x)/2]\right\},

we have

$\displaystyle\mathbb{E}(\widehat{\eta}_{k,n}\|\delta,R_{1})$	$\displaystyle=$	$\displaystyle\eta(x_{0}^{t}+\delta)+b(x_{0}^{t}+\delta)R_{1}^{2}+O(R_{1}^{4})$
	$\displaystyle=$	$\displaystyle\eta(x_{0})+\frac{t}{\\|\dot{\eta}(x_{0})\\|}\dot{\eta}(x_{0})^{\top}\dot{\eta}(x_{0})+\delta^{\top}\dot{\eta}(x_{0})+O(t^{2}+\omega^{2})$
		$\displaystyle+b(x_{0})R_{1}^{2}+R_{1}^{2}\frac{t}{\\|\dot{\eta}(x_{0})\\|}\dot{\eta}(x_{0})^{\top}\dot{b}(x_{0})+R_{1}^{2}\delta^{\top}\dot{b}(x_{0})+O(R_{1}^{4})$
	$\displaystyle=$	$\displaystyle\eta(x_{0})+t\\|\dot{\eta}(x_{0})\\|+\delta^{\top}\dot{\eta}(x_{0})+b(x_{0})R_{1}^{2}+O(t^{2}+\omega^{2}+R_{1}^{4}).$

Denote $t_{k,n}(x_{0}^{t}+\delta)=\mathbb{E}R_{1}^{2}$ , using arguments in Lemma 1 and Theorem 2 of [26], take $a_{d}=2^{d}\Gamma(1+1/2)^{d}/\Gamma(1+d/2)$ , we obtain

	$\displaystyle t_{k,n}(x_{0}^{t}+\delta)$	$\displaystyle=$	$\displaystyle\frac{1}{a_{d}^{2/d}\bar{f}(x_{0}^{t}+\delta)^{2/d}}\left(\frac{k}{n}\right)^{2/d}+o(t_{k,n}^{2}(x_{0}^{t}+\delta))$
		$\displaystyle=$	$\displaystyle t_{k,n}(x_{0})+O\left(t\left(\frac{k}{n}\right)^{2/d}\right)+O\left(\omega\left(\frac{k}{n}\right)^{2/d}\right)+o(t_{k,n}^{2}(x_{0}^{t}+\delta)).$

Further denote ${\mu}_{k,n,\omega}(x_{0}^{t},\delta)=\eta(x_{0})+t\|\dot{\eta}(x_{0})\|+\delta^{\top}\dot{\eta}(x_{0})+b(x_{0})t_{k,n}(x_{0})$ , we obtain

\displaystyle\mathbb{E}(\widehat{\eta}_{k,n})={\mu}_{k,n,\omega}(x_{0}^{t},\delta)+O(t^{2}+\omega^{2}+t_{k,n}^{2})={\mu}_{k,n,\omega}(x_{0}^{t},\delta)+O(\epsilon_{k,n,\omega}^{2}).

In terms of $Var(\widehat{\eta}_{k,n}(x_{0}^{t},\delta))$ , fixing $R_{k+1}$ , the $k$ neighbors are i.i.d. random variables in $B(x_{0}^{t}+\delta,R_{k+1})$ ,

\displaystyle Var(Y_{1}|R_{k+1},\delta)=\mathbb{E}(Y_{1}|R_{k+1},\delta)(1-\mathbb{E}(Y_{1}|R_{k+1}\delta))=\frac{1}{4}+O\left(\epsilon_{k,n,\omega}^{2}\right),

when $R_{k+1}^{2}=O(t_{k,n}(x_{0}))$ . Moreover, as [6] and [3] mentioned, the probability of $R_{k+1}\gg t_{k,n}(x_{0})$ is an exponential tail, hence the overall variance becomes

Var(Y_{1}|\delta)=\frac{1}{4}+O\left(\epsilon_{k,n,\omega}^{2}\right).

This also implies that

|\sqrt{Var(Y_{1}|\delta)}-\sqrt{1/4}|=O(\epsilon_{k,n,\omega}).

Step 2: Since the density of $x$ is 0 when $x$ is not in support,

\int_{\mathbb{R}^{d}}\left(P\left(\sum_{i=1}^{k}\frac{1}{k}Y_{i}\leq\frac{1}{2}\right)-1_{\{\eta(x)<1/2\}}\right)dP(x)

is equal to

\int_{\mathcal{R}}\left(P\left(\sum_{i=1}^{k}\frac{1}{k}Y_{i}\leq\frac{1}{2}\right)-1_{\{\eta(x)<1/2\}}\right)dP(x),

where $\mathcal{R}$ is the support of $X$ . Taking $\epsilon_{k,n,\omega}\geq-s_{k,n}\log s_{k,n}$ , we have

			$\displaystyle\int_{\mathbb{R}^{d}}\left(P\left(\sum_{i=1}^{k}\frac{1}{k}Y_{i}\leq\frac{1}{2}\bigg{\|}\delta\right)-1_{\{\eta(x)<1/2\}}\right)dP(x)$		(8)
		$\displaystyle=$	$\displaystyle\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\\|\dot{\Psi}(x_{0})\\|P(\widehat{\eta}_{k,n}(x_{0}^{t}+\delta)<1/2)-1_{\{t<0\}}dtd\text{Vol}^{d-1}(x_{0})+r_{1}.$		(8)

The result in (8) adopts tube theory to transform the integration from $\mathbb{R}^{d}$ to $\mathbb{R}\times\mathcal{S}$ . Denote the map $\phi\left(x_{0},t\frac{\dot{\eta}(x_{0})}{\|\dot{\eta}(x_{0})\|}\right)=x_{0}^{t}$ , then the pullback of the $d$ -form $dx$ is given at $(x_{0},t{\dot{\eta}(x_{0})}/{\|\dot{\eta}(x_{0})\|})$ by

\displaystyle det\left(\dot{\Psi}\left(x_{0},t\frac{\dot{\eta}(x_{0})}{\|\dot{\eta}(x_{0})\|}\right)\right)dtd\text{Vol}^{d-1}(x_{0}).

For $r_{1}$ , it is composed of four parts: (1) the integral outside $\mathcal{S}^{\epsilon_{k,n,\omega}}$ , (2) the difference between $\Psi(t)$ and $t\|\dot{\Psi}(x)\|$ , (3) the difference between $\mathcal{S}^{\epsilon_{k,n,\omega}}$ and the tube generated using $\mathcal{S}$ , and (4) the remainder of $det\left(\dot{\Psi}\left(x_{0},t\frac{\dot{\eta}(x_{0})}{\|\dot{\eta}(x_{0})\|}\right)\right)$ :

\begin{split}&r_{1}\\ =&\int_{\mathbb{R}^{d}\backslash\mathcal{S}^{\epsilon_{k,n,\omega}}}\left(P\left(\sum_{i=1}^{k}\frac{1}{k}Y_{i}\leq\frac{1}{2}\bigg{|}\delta\right)-1_{\{\eta(x)<1/2\}}\right)dP(x)\\ +&\int_{\mathcal{S}^{\epsilon_{k,n,\omega}}}(\Psi(x)-t\|\dot{\Psi}(x)\|)\left(P\left(\sum_{i=1}^{k}\frac{1}{k}Y_{i}\leq\frac{1}{2}\bigg{|}\delta\right)-1_{\{\eta(x)<1/2\}}\right)dx\\ +&\bigg{[}\int_{\mathcal{S}^{\epsilon_{k,n,\omega}}}t\|\dot{\Psi}(x)\|\left(P\left(\sum_{i=1}^{k}\frac{1}{k}Y_{i}\leq\frac{1}{2}\bigg{|}\delta\right)-1_{\{\eta(x)<1/2\}}\right)dx\\ &\qquad-\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\|\dot{\Psi}(x_{0})\|\left(P(\widehat{\eta}_{k,n}(x_{0}^{t}+\delta)<1/2)-1_{\{t<0\}}\right)det\left(\dot{\Psi}\left(x_{0},t\frac{\dot{\eta}(x_{0})}{\|\dot{\eta}(x_{0})\|}\right)\right)dtd\text{Vol}^{d-1}(x_{0})\bigg{]}\\ +&\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\|\dot{\Psi}(x_{0})\|\left(P(\widehat{\eta}_{k,n}(x_{0}^{t}+\delta)<1/2)-1_{\{t<0\}}\right)\left[det\left(\dot{\Psi}\left(x_{0},t\frac{\dot{\eta}(x_{0})}{\|\dot{\eta}(x_{0})\|}\right)\right)-1\right]dtd\text{Vol}^{d-1}(x_{0})\\ :=&r_{11}+r_{12}+r_{13}+r_{14}.\end{split}

(9)

In $r_{1}$ , $r_{12}$ is of $O(\epsilon_{k,n,\omega}^{3})$ since $\Psi(x)-t\dot{\Psi}(x)=O(t^{2})$ ; $r_{13}$ is of $O(\epsilon_{k,n,\omega}^{3})$ since the difference of volume is of $O(\epsilon_{k,n,\omega}^{2})$ . For $r_{11}$ in $r_{1}$ :

$\displaystyle 0$	$\displaystyle\geq$	$\displaystyle\int_{\mathbb{R}^{d}\backslash\mathcal{S}^{\epsilon_{k,n,\omega}}\cap\{x\|\eta(x)<1/2\}}\left(P\left(\sum_{i=1}^{k}\frac{1}{k}Y_{i}\leq\frac{1}{2}\bigg{\|}\delta\right)-1_{\{\eta(x)<1/2\}}\right)dP(x)$
	$\displaystyle=$	$\displaystyle\int_{\mathbb{R}^{d}\backslash\mathcal{S}^{\epsilon_{k,n,\omega}}\cap\{x\|\eta(x)<1/2\}}\left(P\left(\sum_{i=1}^{k}\frac{1}{k}\left(Y_{i}-\frac{1}{2}\right)-\mathbb{E}\left(Y_{1}-\frac{1}{2}\right)\leq-\mathbb{E}\left(Y_{1}-\frac{1}{2}\right)\bigg{\|}\delta\right)-1_{\{\eta(x)<1/2\}}\right)dP(x)$
	$\displaystyle=$	$\displaystyle-\int_{\mathbb{R}^{d}\backslash\mathcal{S}^{\epsilon_{k,n,\omega}}\cap\{x\|\eta(x)<1/2\}}P\left(\sum_{i=1}^{k}\frac{1}{k}\left(Y_{i}-\frac{1}{2}\right)-\mathbb{E}\left(Y_{1}-\frac{1}{2}\right)>-\mathbb{E}\left(Y_{1}-\frac{1}{2}\right)\bigg{\|}\delta\right)dP(x).$

From the definition of $\epsilon_{k,n,\omega}$ , we know that for any $\delta$ , $\inf\limits_{x\in\mathbb{R}^{d}\backslash\mathcal{S}^{\epsilon_{k,n,\omega}}}|\mathbb{E}Y(\widetilde{x}_{\omega})-1/2|\geq c_{1}\epsilon_{k,n,\omega}$ for some $c_{1}>0$ . Using Berstein inequality, we have an upper bound as

			$\displaystyle\int_{\mathbb{R}^{d}\backslash\mathcal{S}^{\epsilon_{k,n,\omega}}\cap\{x\|\eta(x)<1/2\}}P\left(\sum_{i=1}^{k}\frac{1}{k}(Y_{i}-\frac{1}{2})-\mathbb{E}(Y_{1}-\frac{1}{2})>-\mathbb{E}(Y_{1}-\frac{1}{2})\bigg{\|}\delta\right)dP(x)$
		$\displaystyle\leq$	$\displaystyle O(\exp(-c_{2}k\epsilon_{k,n,\omega}^{2}))=o(1/k^{3/2}),$

for $c_{2}>0$ .

Similar result can be obtained for $\mathbb{R}^{d}\backslash\mathcal{S}^{\epsilon_{k,n,\omega}}\cap\{x|\eta(x)>1/2\}$ .

For $r_{14}$ in $r_{1}$ ,

			$\displaystyle\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\\|\dot{\Psi}(x_{0})\\|\mathbb{E}_{\delta}\left(P(\widehat{\eta}_{k,n}(x_{0}^{t}+\delta)<1/2)-1_{\{t<0\}}\right)$
			$\displaystyle\qquad\qquad\qquad\left[det\left(\dot{\Psi}\left(x_{0},t\frac{\dot{\eta}(x_{0})}{\\|\dot{\eta}(x_{0})\\|}\right)\right)-1\right]dtd\text{Vol}^{d-1}(x_{0})$
		$\displaystyle=$	$\displaystyle\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\\|\dot{\Psi}(x_{0})\\|\mathbb{E}_{\delta}\left(P(\widehat{\eta}_{k,n}(x_{0}+\delta)<1/2)-1_{\{t<0\}}\right)$
			$\displaystyle\qquad\qquad\qquad\left[det\left(\dot{\Psi}\left(x_{0},t\frac{\dot{\eta}(x_{0})}{\\|\dot{\eta}(x_{0})\\|}\right)\right)-1\right]dtd\text{Vol}^{d-1}(x_{0})$
			$\displaystyle+\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\\|\dot{\Psi}(x_{0})\\|\left[t\frac{\dot{\eta}(x_{0})^{\top}}{\\|\dot{\eta}(x_{0})\\|}\frac{\partial}{\partial x_{0}}\mathbb{E}_{\delta}\left(P(\widehat{\eta}_{k,n}(x_{0}+\delta)<1/2)-1_{\{t<0\}}\right)\right]$
			$\displaystyle\qquad\qquad\qquad\left[det\left(\dot{\Psi}\left(x_{0},t\frac{\dot{\eta}(x_{0})}{\\|\dot{\eta}(x_{0})\\|}\right)\right)-1\right]dtd\text{Vol}^{d-1}(x_{0})$
			$\displaystyle+o$
		$\displaystyle=$	$\displaystyle\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\\|\dot{\Psi}(x_{0})\\|\left[t\frac{\dot{\eta}(x_{0})^{\top}}{\\|\dot{\eta}(x_{0})\\|}\frac{\partial}{\partial x_{0}}\mathbb{E}_{\delta}\left(P(\widehat{\eta}_{k,n}(x_{0}+\delta)<1/2)-1_{\{t<0\}}\right)\right]$
			$\displaystyle\qquad\qquad\qquad\left[det\left(\dot{\Psi}\left(x_{0},t\frac{\dot{\eta}(x_{0})}{\\|\dot{\eta}(x_{0})\\|}\right)\right)-1\right]dtd\text{Vol}^{d-1}(x_{0})+o$
		$\displaystyle=$	$\displaystyle O(\epsilon_{k,n,\omega}^{4}).$

Finally $r_{1}=O(\epsilon_{k,n,\omega}^{3})$ .

Step 3: we continue the derivation of $P(\widehat{\eta}_{k,n}(X+\delta)\leq 1/2,X\in\mathcal{S}^{\epsilon_{k,n,\omega}}|\delta)$ (where $X$ denotes testing sample random variable) using

\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\|\dot{\Psi}(x_{0})\|\mathbb{E}_{\delta}\left(P(\widehat{\eta}_{k,n}(x_{0}^{t}+\delta)<1/2)-1_{\{t<0\}}\right)dtd\text{Vol}^{d-1}(x_{0}).

Since $\widehat{\eta}_{k,n}$ is obtained from $n$ i.i.d. samples (though for some samples their weight is 0), by non-uniform Berry-Esseen Theorem,

			$\displaystyle\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\\|\dot{\Psi}(x_{0})\\|\left(P(\widehat{\eta}_{k,n}(x_{0}^{t}+\delta)<1/2)-1_{\{t<0\}}\|\delta\right)dtd\text{Vol}^{d-1}(x_{0})$
		$\displaystyle=$	$\displaystyle\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\\|\dot{\Psi}(x_{0})\\|\left(\Phi\left(\frac{k\mathbb{E}(1/2-Y_{1})}{\sqrt{kVar(Y_{1})}}\bigg{\|}\delta\right)-1_{\{t<0\}}\right)dtd\text{Vol}^{d-1}(x_{0})+r_{2}$
			$\displaystyle+\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\\|\dot{\Psi}(x_{0})\\|\left(\mathbb{E}_{R_{k+1}}\Phi\left(\frac{k\mathbb{E}(1/2-Y_{1}\|R_{k+1})}{\sqrt{kVar(Y_{1}\|R_{k+1})}}\bigg{\|}\delta\right)-\Phi\left(\frac{k\mathbb{E}(1/2-Y_{1})}{\sqrt{kVar(Y_{1})}}\bigg{\|}\delta\right)\right)dtd\text{Vol}^{d-1}(x_{0}).$

where

\displaystyle\left|P(\widehat{\eta}_{k,n}(x_{0}^{t}+\delta)<1/2)-\Phi\left(\frac{k\mathbb{E}(1/2-Y_{1})}{\sqrt{kVar(Y_{1})}}\bigg{|}\delta\right)\right|

\displaystyle\leq

\displaystyle\frac{c_{3}}{\sqrt{k}}\frac{1}{1+k^{3/2}\left|\mathbb{E}1/2-Y_{1}\right|^{3}},

hence

$\displaystyle r_{2}$	$\displaystyle\leq$	$\displaystyle\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\\|\dot{\Psi}(x_{0})\\|\frac{c_{3}}{\sqrt{k}}\frac{1}{1+k^{3/2}\left\|\mathbb{E}1/2-Y_{1}\right\|^{3}}dtd\text{Vol}^{d-1}(x_{0})$
	$\displaystyle=$	$\displaystyle\frac{c_{4}}{\sqrt{k}}\int_{\mathcal{S}}\int_{\|t\|<s_{k,n}}t\\|\dot{\Psi}(x_{0})\\|dtd\text{Vol}^{d-1}(x_{0})$
		$\displaystyle+\frac{c_{5}}{\sqrt{k}}\int_{\mathcal{S}}\int_{s_{k,n}<\|t\|<\epsilon_{k,n,\omega}}t\\|\dot{\Psi}(x_{0})\\|\frac{1}{1+k^{3/2}t^{3}}dtd\text{Vol}^{d-1}(x_{0})$
	$\displaystyle\leq$	$\displaystyle O\left(\frac{s^{2}_{k,n}}{\sqrt{k}}\right)+\frac{c_{5}}{k}\int_{\mathcal{S}}\int_{s_{k,n}<\|t\|<\epsilon_{k,n,\omega}}\sqrt{k}t\\|\dot{\Psi}(x_{0})\\|\frac{1}{k^{3/2}t^{3}}dt\frac{d\sqrt{k}}{d\sqrt{k}}d\text{Vol}^{d-1}(x_{0})$
	$\displaystyle=$	$\displaystyle O\left(\frac{s^{2}_{k,n}}{\sqrt{k}}\right)+\frac{c_{5}}{k^{3/2}}\int_{\mathcal{S}}\int_{1<\sqrt{k}t<\epsilon_{k,n,\omega}\sqrt{k}}\sqrt{k}t\\|\dot{\Psi}(x_{0})\\|\frac{1}{k^{3/2}t^{3}}dt\frac{d\sqrt{k}}{d\sqrt{k}}d\text{Vol}^{d-1}(x_{0})$
	$\displaystyle=$	$\displaystyle O\left(\frac{1}{k^{3/2}}\right).$

Following the analysis in Step 4, we can also obtain that

			$\displaystyle\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\\|\dot{\Psi}(x_{0})\\|\left(\mathbb{E}_{R_{k+1}}\Phi\left(\frac{k\mathbb{E}(1/2-Y_{1}\|R_{k+1})}{\sqrt{kVar(Y_{1}\|R_{k+1})}}\bigg{\|}\delta\right)-\Phi\left(\frac{k\mathbb{E}(1/2-Y_{1})}{\sqrt{kVar(Y_{1})}}\bigg{\|}\delta\right)\right)dtd\text{Vol}^{d-1}(x_{0})$
		$\displaystyle=$	$\displaystyle O(\epsilon_{k,n,\omega}^{5}\sqrt{k})+O(\epsilon_{k,n,\omega}^{3}).$

Step 4: Finally we integrate on Gaussian probabilities:

			$\displaystyle\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\\|\dot{\Psi}(x_{0})\\|\left(\Phi\left(\frac{k\mathbb{E}(1/2-Y_{1})}{\sqrt{kVar(Y_{1})}}\bigg{\|}\delta\right)-1_{\{t<0\}}\right)dtd\text{Vol}^{d-1}(x_{0})$
		$\displaystyle=$	$\displaystyle\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\\|\dot{\Psi}(x_{0})\\|\left(\Phi\left(-\frac{t\\|\dot{\eta}(x_{0})\\|}{\sqrt{s_{k,n}^{2}}}-\frac{(b(x_{0})t_{k,n}(x_{0})+\delta^{\top}\dot{\eta}(x_{0}^{t}))}{\sqrt{s_{k,n}^{2}}}\right)-1_{\{t<0\}}\right)dtd\text{Vol}^{d-1}(x_{0})$
			$\displaystyle+r_{3}$
		$\displaystyle=$	$\displaystyle\int_{\mathcal{S}}\int_{\mathbb{R}}t\\|\dot{\Psi}(x_{0})\\|\left(\Phi\left(-\frac{t\\|\dot{\eta}(x_{0})\\|}{\sqrt{s_{k,n}^{2}}}-\frac{(b(x_{0})t_{k,n}(x_{0})+\delta^{\top}\dot{\eta}(x_{0}))}{\sqrt{s_{k,n}^{2}}}\right)-1_{\{t<0\}}\right)dtd\text{Vol}^{d-1}(x_{0})$
			$\displaystyle+r_{3}+r_{4}$
		$\displaystyle=$	$\displaystyle B_{1}s_{k,n}^{2}+\int_{S}\frac{\\|\dot{\Psi}(x_{0})\\|}{\\|\dot{\eta}(x_{0})\\|^{2}}(b(x_{0})t_{k,n}(x_{0})+\delta^{\top}\dot{\eta}(x_{0}))^{2}d\text{Vol}^{d-1}(x_{0})+r_{3}+r_{4}.$

The last step follows Proposition 7. For the small order terms,

$\displaystyle r_{3}$	$\displaystyle=$	$\displaystyle\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\\|\dot{\Psi}(x_{0})\\|\bigg{(}\Phi\left(\frac{k\mathbb{E}(1/2-Y_{1})}{\sqrt{kVar(Y_{1})}}\bigg{\|}\delta\right)$
		$\displaystyle\qquad\qquad\qquad-\Phi\left(-\frac{t\\|\dot{\eta}(x_{0})\\|}{\sqrt{s_{k,n}^{2}}}-\frac{(b(x_{0})t_{k,n}(x_{0})+\delta^{\top}\dot{\eta}(x_{0}^{t}))}{\sqrt{s_{k,n}^{2}}}\right)\bigg{)}dtd\text{Vol}^{d-1}(x_{0})$
	$\displaystyle=$	$\displaystyle O(\epsilon_{k,n,\omega}^{3}).$

Through definition of $\epsilon_{k,n,\omega}$ we have

\displaystyle r_{4}

\displaystyle=

\displaystyle o(1/k^{3/2}).

Finally we take expectation on $\delta$ :

			$\displaystyle\mathbb{E}_{\delta}\int_{S}\frac{\\|\dot{\Psi}(x_{0})\\|}{\\|\dot{\eta}(x_{0})\\|^{2}}(b(x_{0})t_{k,n}(x_{0})+\delta^{\top}\dot{\eta}(x_{0}))^{2}d\text{Vol}^{d-1}(x_{0})$
		$\displaystyle=$	$\displaystyle\int_{S}\frac{\\|\dot{\Psi}(x_{0})\\|}{\\|\dot{\eta}(x_{0})\\|^{2}}\mathbb{E}_{\delta}(b(x_{0})t_{k,n}(x_{0})+\delta^{\top}\dot{\eta}(x_{0}))^{2}d\text{Vol}^{d-1}(x_{0})$
		$\displaystyle=$	$\displaystyle\int_{S}\frac{\\|\dot{\Psi}(x_{0})\\|}{\\|\dot{\eta}(x_{0})\\|^{2}}\left(b(x_{0})^{2}t_{k,n}^{2}(x_{0})+2b(x_{0})t_{k,n}(x_{0})\mathbb{E}_{\delta}(\delta^{\top}\dot{\eta}(x_{0}))+\mathbb{E}_{\delta}(\delta^{\top}\dot{\eta}(x_{0}))^{2}\right)d\text{Vol}^{d-1}(x_{0})$
		$\displaystyle=$	$\displaystyle\int_{S}\frac{\\|\dot{\Psi}(x_{0})\\|}{\\|\dot{\eta}(x_{0})\\|^{2}}\left(b^{2}(x_{0})t_{k,n}^{2}(x_{0})+\frac{\\|\dot{\eta}(x_{0})\\|^{2}}{d}\omega^{2}\right)d\text{Vol}^{d-1}(x_{0}).$

∎

B.2 Theorem 2

Proof of Theorem 2.

When $|\eta(x)-1/2|<C\omega$ for some large constant $C>0$ , $g$ and $\widetilde{g}$ will always be the same, thus

			$\displaystyle P(\widetilde{g}(\widetilde{x})\neq Y)-P(g(x)\neq Y)$		(10)
		$\displaystyle=$	$\displaystyle\mathbb{E}_{\delta}\left[\int_{\mathcal{S}}\int_{-C\omega}^{C\omega}t\\|\dot{\Psi}(x_{0})\\|\left(1_{\{\widetilde{\eta}(x_{0}^{t}+\delta)<1/2\}}-1_{\{t<0\}}\right)dtd\text{Vol}^{d-1}(x_{0})\right]+O(\omega^{4}).$		(11)

Moreover,

\displaystyle\widetilde{\eta}(\widetilde{x})=\mathbb{E}(\eta(x)|\widetilde{x}\text{ is observed})=\eta(\widetilde{x})+b(\widetilde{x})\omega^{2}+O(\omega^{3}).

(12)

As a result,

\displaystyle\widetilde{\eta}(x_{0}^{t}+\delta)=\eta(x_{0})+t\|\dot{\eta}(x_{0})\|+\dot{\eta}(x_{0})^{\top}\delta+b(x_{0})\omega^{2}+O(t\omega^{2})+O(\omega^{3}).

(13)

Plugging in $\widetilde{\eta}(x_{0}^{t}+\delta)$ into regret, we obtain that

	$\displaystyle P(\widetilde{g}(\widetilde{x})\neq Y)-P(g(x)\neq Y)$	(14)
$\displaystyle=$	$\displaystyle\mathbb{E}_{\delta}\left[\int_{\mathcal{S}}\int_{-C\omega}^{C\omega}t\\|\dot{\Psi}(x_{0})\\|\left(1_{\{\widetilde{\eta}(x_{0}^{t}+\delta)<1/2\}}-1_{\{t<0\}}\right)dtd\text{Vol}^{d-1}(x_{0})\right]+O(\omega^{4})$	(15)
$\displaystyle=$	$\displaystyle\mathbb{E}_{\delta}\left[\int_{\mathcal{S}}\int_{-C\omega}^{C\omega}t\\|\dot{\Psi}(x_{0})\\|\left(1_{\{t<-\dot{\eta}(x_{0})^{\top}\delta/\\|\dot{\eta}(x_{0})\\|-b(x_{0})\omega^{2}\}}-1_{\{t<0\}}\right)dtd\text{Vol}^{d-1}(x_{0})\right]+O(\omega^{4})$	(16)
$\displaystyle=$	$\displaystyle\mathbb{E}\left[\int_{\mathcal{S}}\\|\dot{\Psi}(x_{0})\\|\int_{-\dot{\eta}(x_{0})^{\top}\delta/\\|\dot{\eta}(x_{0})\\|-b(x_{0})\omega^{2}}^{0}tdtd\text{Vol}^{d-1}(x_{0})\right]+O(\omega^{4})$	(17)
$\displaystyle=$	$\displaystyle\int_{\mathcal{S}}\\|\dot{\Psi}(x_{0})\\|\frac{\omega^{2}}{2d}d\text{Vol}^{d-1}(x_{0})+O(\omega^{4}).$	(18)

From this derivation, the dominant terms in the denominator and numerator of the quantity

\displaystyle\frac{P(Y\neq\widehat{g}_{n}(\widetilde{x}))-P(Y\neq\widetilde{g}(\widetilde{x}))}{P(Y\neq\widehat{g}_{n}^{\prime}(\widetilde{x}))-P(Y\neq\widetilde{g}(\widetilde{x}))}

(19)

are both $\Theta(n^{-4/(d+4)})$ when $k$ ’s are chosen to be optimal respectively. Note that the multiplicative constants for numerator and denominator are both determined by $\delta$ and density of $X$ , and converges to each other when $\omega\rightarrow 0$ . As $\omega\rightarrow 0$ when $n\rightarrow\infty$ , the difference on the densities vanishes, thus (19) converges to 1. ∎

B.3 Theorem S.1

Proof of Theorem S.1.

The proof is similar with Theorem 1. Since the format of $r_{1}$ to $r_{4}$ are unchanged, one can show that they are small order terms in Theorem 3 as well. What is changed in the proof of Theorem 3 is ${\mu}_{k,n,\omega}(x)$ :

When $t<0$ , we have

\displaystyle{\mu}_{k,n,\omega}(x_{0}^{t})=\eta(x_{0})+t\|\dot{\eta}(x_{0})\|+\omega\|\dot{\eta}(x_{0})\|+b(x_{0})t_{k,n}(x)+o,

while for $t>0$ ,

\displaystyle{\mu}_{k,n,\omega}(x_{0}^{t})=\eta(x_{0})+t\|\dot{\eta}(x_{0})\|-\omega\|\dot{\eta}(x_{0})\|+b(x_{0})t_{k,n}(x)+o.

Therefore,

			$\displaystyle\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\\|\dot{\Psi}(x_{0})\\|\left(\Phi\left(\frac{k\mathbb{E}(1/2-Y_{1})}{\sqrt{kVar(Y_{1})}}\right)-1_{\{t<0\}}\right)dtd\text{Vol}^{d-1}(x_{0})$
		$\displaystyle=$	$\displaystyle\int_{\mathcal{S}}\int_{\mathbb{R}}t\\|\dot{\Psi}(x_{0})\\|\left(\Phi\left(-\frac{t\\|\dot{\eta}(x_{0})\\|-sign(t)\omega\\|\dot{\eta}(x_{0})\\|}{\sqrt{s_{k,n}^{2}}}-\frac{b(x_{0})t_{k,n}(x_{0}))}{\sqrt{s_{k,n}^{2}}}\right)-1_{\{t<0\}}\right)dtd\text{Vol}^{d-1}(x_{0})+o$
		$\displaystyle=$	$\displaystyle\int_{\mathcal{S}}\int_{\mathbb{R}}t\\|\dot{\Psi}(x_{0})\\|\left(\Phi\left(-\frac{t\\|\dot{\eta}(x_{0})\\|+\omega\\|\dot{\eta}(x_{0})\\|}{\sqrt{s_{k,n}^{2}}}-\frac{b(x_{0})t_{k,n}(x_{0}))}{\sqrt{s_{k,n}^{2}}}\right)-1_{\{t<0\}}\right)dtd\text{Vol}^{d-1}(x_{0})+r_{5}+o$
		$\displaystyle=$	$\displaystyle\frac{B_{1}}{4k}+\frac{1}{2}\int_{S}\frac{\\|\dot{\Psi}(x_{0})\\|}{\\|\dot{\eta}(x_{0})\\|^{2}}\left(b(x_{0})\mathbb{E}R_{1}(x)^{2}+\omega\\|\dot{\eta}(x_{0})\\|\right)^{2}d\text{Vol}^{d-1}(x_{0})+r_{5}+o.$

The remainder $r_{5}$ is not a small order term, but we can show that it is positive, and calculate its rate.

\displaystyle r_{5}=O\left(\frac{B_{1}}{4k}+\int_{S}\frac{\|\dot{\Psi}(x_{0})\|}{\|\dot{\eta}(x_{0})\|^{2}}\left(b(x_{0})\mathbb{E}R_{1}(x)^{2}+\omega\|\dot{\eta}(x_{0})\|\right)^{2}d\text{Vol}^{d-1}(x_{0})\right).

For $r_{5}$ ,

$\displaystyle r_{5}$	$\displaystyle=$	$\displaystyle\int_{\mathcal{S}}\int_{0}^{+\infty}t\\|\dot{\Psi}(x_{0})\\|\Phi\left(-\frac{t\\|\dot{\eta}(x_{0})\\|-\omega\\|\dot{\eta}(x_{0})\\|}{\sqrt{s_{k,n}^{2}}}-\frac{b(x_{0})t_{k,n}(x_{0})}{\sqrt{s_{k,n}^{2}}}\right)dtd\text{Vol}^{d-1}(x_{0})$
		$\displaystyle-\int_{\mathcal{S}}\int_{0}^{+\infty}t\\|\dot{\Psi}(x_{0})\\|\Phi\left(-\frac{t\\|\dot{\eta}(x_{0})\\|+\omega\\|\dot{\eta}(x_{0})\\|}{\sqrt{s_{k,n}^{2}}}-\frac{b(x_{0})t_{k,n}(x_{0})}{\sqrt{s_{k,n}^{2}}}\right)dtd\text{Vol}^{d-1}(x_{0})$
	$\displaystyle=$	$\displaystyle\int_{\mathcal{S}}\int_{-2\omega}^{+\infty}(t+2\omega)\\|\dot{\Psi}(x_{0})\\|\Phi\left(-\frac{t\\|\dot{\eta}(x_{0})\\|+\omega\\|\dot{\eta}(x_{0})\\|}{\sqrt{s_{k,n}^{2}}}-\frac{b(x_{0})t_{k,n}(x_{0})}{\sqrt{s_{k,n}^{2}}}\right)dtd\text{Vol}^{d-1}(x_{0})$
		$\displaystyle-\int_{\mathcal{S}}\int_{0}^{+\infty}t\\|\dot{\Psi}(x_{0})\\|\Phi\left(-\frac{t\\|\dot{\eta}(x_{0})\\|+\omega\\|\dot{\eta}(x_{0})\\|}{\sqrt{s_{k,n}^{2}}}-\frac{b(x_{0})t_{k,n}(x_{0})}{\sqrt{s_{k,n}^{2}}}\right)dtd\text{Vol}^{d-1}(x_{0})$
	$\displaystyle=$	$\displaystyle\int_{\mathcal{S}}\int_{0}^{\infty}2\omega\\|\dot{\Psi}(x_{0})\\|\Phi\left(-\frac{t\\|\dot{\eta}(x_{0})\\|+\omega\\|\dot{\eta}(x_{0})\\|}{\sqrt{s_{k,n}^{2}}}-\frac{b(x_{0})t_{k,n}(x_{0})}{\sqrt{s_{k,n}^{2}}}\right)dtd\text{Vol}^{d-1}(x_{0})$
		$\displaystyle+\int_{\mathcal{S}}\int_{-2\omega}^{0}(t+2\omega)\\|\dot{\Psi}(x_{0})\\|\Phi\left(-\frac{t\\|\dot{\eta}(x_{0})\\|+\omega\\|\dot{\eta}(x_{0})\\|}{\sqrt{s_{k,n}^{2}}}-\frac{b(x_{0})t_{k,n}(x_{0})}{\sqrt{s_{k,n}^{2}}}\right)dtd\text{Vol}^{d-1}(x_{0})$
	$\displaystyle:=$	$\displaystyle A+B.$

From the format of $A$ and $B$ , we know that they are positive. When $t_{k,n}(x_{0})$ and $1/\sqrt{k}$ both $\ll\omega$ , $A$ is an exponential tail (so we just ignore it) and for $B$ we have:

\displaystyle B=\int_{\mathcal{S}}\|\dot{\Psi}(x_{0})\|\omega^{2}d\text{Vol}^{d-1}(x_{0})+O(\omega t_{k,n}(x_{0})+\omega/\sqrt{k}).

∎

B.4 Theorem 4

Proof of Theorem 4.

First, it is easy to know that $\omega=O((1/n)^{1/d})$ since the nearest neighbor has an average distance of $O((1/n)^{1/d})$ .

Second, there is a difference between pre-processed 1NN and random perturbation: in pre-processed 1NN, the nearest neighbor distributes approximately uniformly around $x$ , while the other neighbors should have a distance to $x$ larger than the nearest neighbor. However, this difference only affects the remainder term of regret, i.e., assuming whether or not the other neighbors are uniformly distributed in the ball $B(x,R_{k+1})$ does not affect our result.

As a result, taking expectation on the direction of $\delta$ ,

			$\displaystyle\mathbb{E}\int_{S}\frac{\\|\dot{\Psi}(x_{0})\\|}{\\|\dot{\eta}(x_{0})\\|^{2}}(b(x_{0})t_{k,n}(x_{0})+\delta^{\top}(x_{0})\dot{\eta}(x_{0}))^{2}d\text{Vol}^{d-1}(x_{0})$
		$\displaystyle=$	$\displaystyle\int_{S}\frac{\\|\dot{\Psi}(x_{0})\\|}{\\|\dot{\eta}(x_{0})\\|^{2}}\mathbb{E}(b(x_{0})t_{k,n}(x_{0})+\delta^{\top}(x_{0})\dot{\eta}(x_{0}))^{2}d\text{Vol}^{d-1}(x_{0})$
		$\displaystyle=$	$\displaystyle\int_{S}\frac{\\|\dot{\Psi}(x_{0})\\|}{\\|\dot{\eta}(x_{0})\\|^{2}}\left(b^{2}(x_{0})t_{k,n}^{2}(x_{0})\right)d\text{Vol}^{d-1}(x_{0})+\Theta(\omega^{2}).$

When $n^{-1/d}\gg n^{-2/(4+d)}$ , i.e. $d>4$ , the dominant part of regret becomes $n^{-2/d}$ . ∎

Appendix C Regret Convergence under General Smoothness Condition and Margin Condition

C.1 Model and Theorem

In this section, we will relax the conditions on the distribution of $X$ and smoothness of $\eta$ , and as a consequence, we only obtain the rate of the regret (without explicit form for the multiplicative constant). Technically, we will adopt the framework of [6], and the following assumptions on the smoothness of $\eta$ and the density of $X$ are used instead of conditions [A.1]-[A.3].

B.1

Let $\lambda$ be the Lebesgue measure on $\mathbb{R}^{d}$ . There exists a positive pair $(c_{0},r_{0})$ such that for any $x\in\mathcal{X}$ ,

$\lambda(\mathcal{X}\cap B(x,r))\geq c_{0}\lambda(B(x,r)),$

for any $0<r\leq r_{0}$ .
B.2

The support of $X$ is compact.
B.3

Margin condition: $P(0<|\eta(x)-1/2|<t)\leq Bt^{\beta}$ .
B.4

Smoothness of $\eta$ : there exist some $\alpha>0$ and $c_{r}>0$ , such that $|\eta(x+r)-\eta(x)|\leq\|r\|^{\alpha}$ for any $x$ and $r\leq c_{r}$ .
B.5

The density of $X$ is finite and bounded away from 0.

Remark 3.

The assumption B.3 is weaker from [6] ( $P(|\eta(x)-1/2|<t)\leq Bt^{\beta}$ ), but in fact does not affect the convergence.

In [6], the assumption of smoothness is made on $|\mathbb{E}(\eta(x^{\prime})|x^{\prime}\in B(x,r))-\eta(x)|$ , which is a weaker assumption compared with our B.4. However, under either random perturbation or adversarial attack, given a direction $\delta$ to obtain $\widetilde{x}$ , the assumption in [6] cannot be simply applied.

The following theorem provide a general upper bound of regret for both perturbed and attacked data:

Theorem 8 (Convergence of Regret).

Under [B.1] to [B.5], if for some $\delta>0$ , $k/n^{\delta}\rightarrow\infty$ , taking

\displaystyle k

\displaystyle\asymp

\displaystyle O(n^{2\alpha/(2\alpha+d)}\wedge(n^{2\alpha/d}\omega^{-2\alpha\beta})^{1/(2\alpha/d+\beta+1)}),

the regret becomes

\displaystyle\mbox{Regret}(n,\omega)=O\left(\omega^{\alpha(\beta+1)}\vee n^{-\alpha(\beta+1)/(2\alpha+d)}\right),

where $n^{-\alpha(\beta+1)/(2\alpha+d)}$ is the minimax rate of regret in $k$ -NN.

Theorem 8 also reveals a sufficient condition when $k$ -NN is consistent, i.e regret finally converges to 0: for both perturbed and attacked data, when $\omega=o(1)$ , $k$ -NN is still consistent using these two types of corrupted testing data.

Theorem 9 (Minimax Rate of Regret).

Let $\widehat{g}_{n}$ be an estimator of $g$ , let $\mathcal{P}_{\alpha,\beta}$ be a set of distributions which satisfy [B.1] to [B.5], when $\alpha\leq 1$ , there exists some $C>0$ such that

\sup\limits_{P\in\mathcal{P}_{\alpha,\beta}}P(\widehat{g}_{n}(\widetilde{X})\neq Y)-P(g(X)\neq Y)\geq C(\omega^{\alpha(\beta+1)}\vee n^{-\frac{\alpha(\beta+1)}{2\alpha+d}}).

(20)

The constant $C$ depends on $\alpha,\beta,d$ only.

Theorem 9 reveals that, for any estimator of $g$ , under either random perturbation or adversarial attack, the regret in the worst case is larger than $C(\omega^{\alpha(\beta+1)}\vee n^{-\frac{\alpha(\beta+1)}{2\alpha+d}})$ . Theorem 8 and 9 together shows that the kNN estimator reaches the optimal rate of regret.

C.2 Proofs

Proof of Theorem S.8.

Let $p=k/n$ . Denote $R_{k,n}(x)=P(\widehat{g}_{k,n}(x)\neq Y|x)$ and $R^{*}(x)=P(g(x)\neq Y)$ , and $\mathbb{E}R_{k,n}(x)-R^{*}(x)$ as the excess risk. Define

	$\displaystyle\mathcal{X}^{+}_{p,\Delta,\omega}=\{x\in\mathcal{X}\|\eta(x)>\frac{1}{2},\forall x^{\prime}\in B(x,\omega),{\eta}(x^{\prime}+r)\geq$	$\displaystyle\frac{1}{2}+\Delta,\forall\\|r\\|<r_{2p}(x)\},$
	$\displaystyle\mathcal{X}^{-}_{p,\Delta,\omega}=\{x\in\mathcal{X}\|\eta(x)<\frac{1}{2},\forall x^{\prime}\in B(x,\omega),{\eta}(x^{\prime}+r)\leq$	$\displaystyle\frac{1}{2}-\Delta,\forall\\|r\\|<r_{2p}(x)\},$

with $r_{2p}$ as the distance from $x$ to its $2pn$ th nearest neighbor, and the decision boundary area:

\partial_{p,\Delta,\omega}=\mathcal{X}\setminus(\mathcal{X}^{+}_{p,\Delta,\omega}\cup\mathcal{X}^{-}_{p,\Delta,\omega}).

Given $\partial_{p,\Delta,\omega}$ , $\mathcal{X}^{+}_{p,\Delta,\omega}$ , and $\mathcal{X}^{-}_{p,\Delta,\omega}$ , similar with Lemma 8 in [6], the event of $g(x)\neq\widehat{g}_{k,n}(x)$ can be covered as:

$\displaystyle 1_{\{g(x)\neq\widehat{g}_{k,n}(x)\}}$	$\displaystyle\leq$	$\displaystyle 1_{\{x\in\partial_{p,\Delta,\omega}\}}$
		$\displaystyle+1_{\{\max\limits_{i=1,...,k}R_{i}(\widetilde{x})\geq r_{2p}(x)\}}$
		$\displaystyle+1_{\{\|\widehat{\eta}_{k,n}(x)-\eta(x^{\prime}+r)\|\geq\Delta\}}.$

When ${\eta}(x^{\prime}+r)>1/2$ for all $\|r\|\leq r_{2p}(x)$ , and $x\in\mathcal{X}_{p,\Delta}^{+}$ , assume $\widehat{\eta}_{k,n}(x)<1/2$ , then

{\eta}(x^{\prime}+r)-\widehat{\eta}_{k,n}(x^{\prime})>{\eta}(x^{\prime}+r)-1/2\geq\Delta.

The other two events are easy to figure out.

By [6] and [3], $P(\max\limits_{i=1,...,k}R_{i}(x)\geq r_{2p}(x))$ is of $O(\exp(-ck^{2}))$ for some $c>0$ , hence it becomes a smaller order term if for some $\delta>0$ , $k/n^{\delta}\rightarrow\infty$ .

In addition, from the definition of regret, assume $\eta(x)<1/2$ ,

			$\displaystyle P(\widehat{g}(x)\neq Y\|X=x)-\eta(x)$
		$\displaystyle=$	$\displaystyle\eta(x)P(\widehat{g}(x)=0\|X=x)+(1-\eta(x))P(\widehat{g}(x)=1\|X=x)-\eta(x)$
		$\displaystyle=$	$\displaystyle\eta(x)P(\widehat{g}(x)=g(x)\|X=x)+(1-\eta(x))P(\widehat{g}(x)\neq g(x)\|X=x)-\eta(x)$
		$\displaystyle=$	$\displaystyle\eta(x)-\eta(x)P(\widehat{g}(x)\neq g(x)\|X=x)+(1-\eta(x))P(\widehat{g}(x)\neq g(x)\|X=x)-\eta(x)$
		$\displaystyle=$	$\displaystyle(1-2\eta(x))P(\widehat{g}(x)\neq g(x)\|X=x),$

similarly, when $\eta(x)>1/2$ , we have

\displaystyle P(\widehat{g}(x)\neq Y|X=x)-1+\eta(x)

\displaystyle=

\displaystyle(2\eta(x)-1)P(\widehat{g}(x)\neq g(x)|X=x).

As a result, the regret can be represented as

\displaystyle Regret(k,n,\omega)

\displaystyle=

\displaystyle\mathbb{E}\left(|1-2\eta(X)|P(g(X)\neq\widehat{g}_{k,n}(X))\right).

For simplicity, denote $p=k/n$ . We then follow the proof of Lemma 20 of [6]. Without loss of generality assume $\eta(x)>1/2$ . For perturbation $\delta\in\mathbb{R}^{d}$ , define

	$\displaystyle\Delta_{0}$	$\displaystyle=$	$\displaystyle\sup\limits_{x,\delta,\\|r\\|<r_{2p}(x)}\|\eta(x+\delta+r)-\eta(x)\|=O(\omega^{\alpha})+O((k/n)^{\alpha/d}),$
	$\displaystyle\Delta(x)$	$\displaystyle=$	$\displaystyle\|\eta(x)-1/2\|,$

then we have

\eta(x+\delta+r)\geq\eta(x)-\Delta_{0}=\frac{1}{2}+(\Delta(x)-\Delta_{0}),

hence $x\in\mathcal{X}_{p,\Delta(x)-\Delta_{0},\omega}^{+}$ .

From the definition of $R_{k,n}$ and $R^{*}$ , when $\Delta(x)>\Delta_{0}$ , we also have

			$\displaystyle\mathbb{E}R_{k,n}(x)-R^{*}(x)$
		$\displaystyle\leq$	$\displaystyle 2\Delta(x)\bigg{[}P(r_{(k+1)}>v_{2p})+P\bigg{(}\sum_{i=1}^{k}\frac{1}{k}Y(X_{i})-{\eta}(x^{\prime}+\delta+r)>\Delta(x)-\Delta_{0}\bigg{)}\bigg{]}$
		$\displaystyle\leq$	$\displaystyle 2\Delta(x)P\bigg{(}\sum_{i=1}^{k}\frac{1}{k}Y(X_{i})-{\eta}(x^{\prime}+\delta+r)>\Delta(x)-\Delta_{0}\bigg{)}+o$
		$\displaystyle=$	$\displaystyle 2\Delta(x)\mathbb{E}_{\delta}\left[P\bigg{(}\sum_{i=1}^{k}\frac{1}{k}Y(X_{i})-{\eta}(x^{\prime}+\delta+r)>\Delta(x)-\Delta_{0}\bigg{\|}\delta\bigg{)}\right]+o$

Considering the problem that the upper bound can be much greater than 1 when $\Delta(x)$ is small, we define $\Delta_{i}=2^{i}\Delta_{0}$ , taking $i_{0}=\min\{i\geq 1|\;(\Delta_{i}-\Delta_{0})^{2}>1/k\}$ , using Berstein inequality, it becomes

$\displaystyle\mathbb{E}R_{k,n}(X)-R^{*}(X)$	$\displaystyle=$	$\displaystyle\mathbb{E}(R_{k,n}(X)-R^{*}(X))1_{\{\Delta(X)\leq\Delta_{i_{0}}\}}$
		$\displaystyle+\mathbb{E}(R_{k,n}(X)-R^{*}(X))1_{\{\Delta(X)>\Delta_{i_{0}}\}}$
	$\displaystyle\leq$	$\displaystyle 2\Delta_{i_{0}}P(\Delta(X)\leq\Delta_{i_{0}})+\exp(-k/8)$
		$\displaystyle+c_{2}\mathbb{E}\left[\Delta(X)1_{\{\Delta_{i_{0}}<\Delta(X)\}}\exp(-c_{1}k(\Delta(x)-\Delta_{0})^{2})\right]$
	$\displaystyle\leq$	$\displaystyle 2\Delta_{i_{0}}P(\Delta(X)\leq\Delta_{i_{0}})+\exp(-k/8)$
		$\displaystyle+c_{2}\mathbb{E}\left[\Delta(X)1_{\{\Delta_{i_{0}}<\Delta(X)\}}\exp(-c_{1}k(\Delta(x)-\Delta_{0})^{2})\right].$

When $i_{0}=\min\{i\geq 1|\;(\Delta_{i}-\Delta_{0})^{2}>1/k\}$ , the exponential tail will diminish fast, leading to

			$\displaystyle\mathbb{E}\left[\Delta(X)1_{\{\Delta_{i_{0}}<\Delta(X)\}}\exp(-c_{1}k(\Delta(x)-\Delta_{0})^{2})\right]$
		$\displaystyle=$	$\displaystyle\sum_{i=i_{0}}^{\infty}\mathbb{E}\left[\Delta(X)1_{\{\Delta_{i}<\Delta(X)<\Delta_{i+1}\}}\exp(-c_{1}k(\Delta(x)-\Delta_{0})^{2})\right]$
		$\displaystyle\leq$	$\displaystyle\sum_{i=i_{0}}^{\infty}\Delta_{i+1}^{\beta+1}\exp(-c_{1}k(\Delta_{i}-\Delta_{0})^{2})$
		$\displaystyle=$	$\displaystyle\sum_{i=i_{0}}^{\infty}\Delta_{0}^{\beta+1}2^{(i+1)(\beta+1)}\exp(-c_{1}k\Delta_{0}^{2}(2^{i}-1)^{2})$
		$\displaystyle\leq$	$\displaystyle c_{3}\Delta_{0}^{\beta+1}.$

Recall that $\Delta_{i_{0}}>\Delta_{0}$ and $\Delta_{i_{0}}^{2}>1/k$ , hence when $\Delta_{i_{0}}^{2}=O(1/k)$ , we can obtain the minimum upper bound

\displaystyle\mathbb{E}R_{k,n}(X)-R^{*}(X)=O(\Delta_{0}^{\beta+1})+O\left(\left(\frac{1}{k}\right)^{(\beta+1)/2}\right).

∎

Proof of Theorem S.9.

The proof is similar as [2] using technical details in [1] for Assouad’s method. There are two scenarios we will consider. Define $C_{0}$ , $C_{1}$ and $C_{2}$ as some suitable constants, we will first show for any $\omega\geq 0$ ,

\sup\limits_{P\in\mathcal{P}_{\alpha,\beta}}P(\widehat{g}_{n}(\widetilde{X})\neq Y)-P(g(X)\neq Y)\geq C_{1}n^{-\frac{\alpha(\beta+1)}{2\alpha+d}}.

(21)

Further, when $\omega>C_{0}n^{-\frac{1}{2\alpha+d}}$ , our target is to show that

\sup\limits_{P\in\mathcal{P}_{\alpha,\beta}}P(\widehat{g}_{n}(\widetilde{X})\neq Y)-P(g(X)\neq Y)\geq C_{2}\omega^{\alpha(\beta+1)}.

(22)

Case 1: when $\omega\leq C_{0}n^{-\frac{1}{2\alpha+d}}$ , the basic idea is to construct a distribution of $x$ and two distributions of $y|x$ such that, the Bayes classifiers from these two distributions of $y|x$ reverse with each other, but through sampling $n$ points, we cannot distinguish which distribution these $n$ samples chosen are from. For example, given $n$ samples from a normal distribution, statistically we cannot determine whether data are sampled from a zero-mean distribution, or a distribution with mean $1/\sqrt{n}$ , thus any estimator based on data (either using clean testing data or corrupted testing data) can make a false prediction.

Assume $X$ distributed within a compact set in $[0,1]^{d}$ . For an integer $q\geq 1$ , consider the regular grid as

G_{q}:=\left\{\left(\frac{2k_{1}+1}{2q},...,\frac{2k_{d}+1}{2q}\right):k_{i}\in\{0,...,q-1\},i=1,...,d\right\}.

(23)

For any point $x$ , denote $n_{q}(x)$ as the closest grid point in $G_{q}$ , and define $\mathcal{X}^{\prime}_{1},...,\mathcal{X}^{\prime}_{q^{d}}$ as a partition of $[0,1]^{d}$ such that $x$ and $x^{\prime}$ are in the same $\mathcal{X}^{\prime}_{i}$ if and only if $n_{q}(x)=n_{q}(x^{\prime})$ . Among all the $\mathcal{X}^{\prime}_{i}$ ’s, select $m$ of them as $\mathcal{X}_{1},...,\mathcal{X}_{m}$ , and $\mathcal{X}_{0}:=[0,1]^{d}\backslash\cup_{i=1}^{m}\mathcal{X}_{i}$ .

Take $z_{i}$ as the center of $\mathcal{X}_{i}$ for $i=1,...,m$ . When $x\in B(z_{i},1/4q)$ , set the density of $x$ as $\epsilon/\lambda[B(z_{i},1/4q)]$ for some $\epsilon>0$ , and the density of $x$ in $\mathcal{X}_{i}\backslash B(z_{i},1/4q)$ is set to be 0. Assume $x$ uniformly distributes in $\mathcal{X}_{0}$ .

Let $u:\mathbb{R}^{+}\rightarrow\mathbb{R}^{+}$ be a nonincreasing infinitely differentiable function starting from 0 and satisfying $\alpha$ -smoothness condition. Moreover, $u$ is $1$ in $[1/2,\infty)$ . Denote $\psi$ and $\phi$ as

\psi(x):=C_{\psi}u(\|x\|),

(24)

and

\phi(x):=q^{-\alpha}\psi(q(x-n_{q}(x))).

(25)

Through the above construction, if we take $\eta(x)=(1+\phi(x))/2$ or $\eta(x)=(1-\phi(x))/2$ , and let $m=O(q^{d-\alpha\beta})$ , then when $\alpha\beta\leq d$ , $\beta$ margin condition is also satisfied.

The construction above will also be applied in Case 2 (with difference on the choice of $q,\epsilon,u$ ).

Now we apply Assouad’s method to find the lower bound of regret. Denote $P_{jk}$ as a distribution such that $\eta(x)=(1+\phi(x))/2$ when $k=0$ , $x\in\mathcal{X}_{j}$ , and $\eta(x)=(1-\phi(x))/2$ when $k=1$ , $x\in\mathcal{X}_{j}$ , then we have for any estimator $\widehat{g}(x,Z_{n})$ with $Z_{n}=(X_{n},Y_{n})$ as data,

	$\displaystyle\sup\limits_{k=0,1}\mathbb{E}_{X,Z_{n},P_{jk}}1_{\{\widehat{g}(X,Z_{n})\neq g(X)\}}1_{\{X\in\mathcal{X}_{j}\}}$	(26)
$\displaystyle\geq$	$\displaystyle\frac{1}{2}\mathbb{E}_{X,Z_{n},P_{j0}}1_{\{\widehat{g}(X,Z_{n})\neq g(X)\}}1_{\{X\in\mathcal{X}_{j}\}}+\frac{1}{2}\mathbb{E}_{Z_{n},P_{j1}}1_{\{\widehat{g}(X,Z_{n})\neq g(X)\}}1_{\{X\in\mathcal{X}_{j}\}}$	(27)
$\displaystyle=$	$\displaystyle\frac{1}{2}\mathbb{E}_{X,Z_{n},P_{j0}}1_{\{\widehat{g}(X,Z_{n})\neq 0\}}1_{\{X\in\mathcal{X}_{j}\}}+\frac{1}{2}\mathbb{E}_{Z_{n},P_{j1}}1_{\{\widehat{g}(X,Z_{n})\neq 1\}}1_{\{X\in\mathcal{X}_{j}\}}$	(28)
$\displaystyle=$	$\displaystyle\frac{1}{2}\mathbb{E}_{X}1_{\{X\in\mathcal{X}_{j}\}}\mathbb{E}\left[\mathbb{E}_{Z_{n},P_{j0}}1_{\{\widehat{g}(x,Z_{n})\neq 0\}}+\mathbb{E}_{Z_{n},P_{j1}}1_{\{\widehat{g}(x,Z_{n})\neq 1\}}\bigg{\|}X=x\right]$	(29)
$\displaystyle=$	$\displaystyle\frac{1}{2}\mathbb{E}_{X}1_{\{X\in\mathcal{X}_{j}\}}\mathbb{E}\left[\int 1_{\{\widehat{g}(x,Z_{n})\neq 0\}}dP_{j0}(Z_{n})+\int 1_{\{\widehat{g}(x,Z_{n})\neq 1\}}dP_{j1}(Z_{n})\bigg{\|}X=x\right]$	(30)
$\displaystyle\geq$	$\displaystyle\frac{1}{2}\mathbb{E}_{X}1_{\{X\in\mathcal{X}_{j}\}}\mathbb{E}\left[\int 1_{\{\widehat{g}(x,Z_{n})\neq 0\}}+1_{\{\widehat{g}(x,Z_{n})\neq 1\}}(dP_{j0}(Z_{n})\wedge dP_{j1}(Z_{n}))\bigg{\|}X=x\right]$	(31)
$\displaystyle=$	$\displaystyle\frac{1}{2}\mathbb{E}_{X}1_{\{X\in\mathcal{X}_{j}\}}\mathbb{E}\left[\int(dP_{j0}(Z_{n})\wedge dP_{j1}(Z_{n}))\bigg{\|}X=x\right]$	(32)
$\displaystyle=$	$\displaystyle\frac{1}{2}\mathbb{E}_{X}1_{\{X\in\mathcal{X}_{j}\}}\int(dP_{j0}(Z_{n})\wedge dP_{j1}(Z_{n})).$	(33)

Denote

b_{j}:=\left[1-\mathbb{E}^{2}(\sqrt{1-\phi^{2}(X)}|X\in\mathcal{X}_{j})\right]^{1/2},

(34)

and

b^{\prime}_{j}:=(\mathbb{E}\phi(X)|X\in\mathcal{X}_{j}),

(35)

then $\int(dP_{j0}(Z_{n})\wedge dP_{j1}(Z_{n}))=\Theta(1)$ through our design of $\mathcal{X}_{j}$ when $b_{j}=O(1/\sqrt{n\epsilon})$ by Lemma 5.1 in [1].

As a result, when $b_{j}=b$ , $b^{\prime}_{j}=b^{\prime}$ for all $j=1,...,m$ , and $b=O(1/\sqrt{n\epsilon})$ , there exists some $C_{3}>0$ such that

	$\displaystyle\sup\limits_{P\in\mathcal{P}}P(\widehat{g}(X,Z_{n})\neq Y)-P(g(X)\neq Y)$	(36)
$\displaystyle=$	$\displaystyle\sup\limits_{P\in\mathcal{P}}\mathbb{E}\|2\eta(X)-1\|P(\widehat{g}(X,Z_{n})\neq g(X))$	(37)
$\displaystyle=$	$\displaystyle\sup\limits_{P\in\mathcal{P}}\sum_{j=1}^{m}\mathbb{E}\|2\eta(X)-1\|P(\widehat{g}(X,Z_{n})\neq g(X))1_{\{X\in\mathcal{X}_{j}\}}$	(38)
$\displaystyle\geq$	$\displaystyle C_{3}mb^{\prime}\epsilon.$	(39)

The regret is lower bounded as $C_{1}n^{-\alpha(\beta+1)/(2\alpha+d)}$ when taking $q=O(n^{1/(2\alpha+d)})$ . Note that $\widehat{g}(x,Z_{n})$ can be any classifier, which also includes those “random" estimators when $x$ is perturbed / attacked.

Case 2: when $\omega>C_{0}n^{-\frac{1}{2\alpha+d}}$ , we construct a distribution of $(x,y)$ such that, after injecting noise in it, there is some sets of $\tilde{x}$ where $P(g(x)=1|\tilde{x})$ and $P(g(x)=0|\tilde{x})$ are comparable, thus no matter which label is obtained from the estimator, it has a constant-level of probability to make false decision at this $\tilde{x}$ .

The construction is similar as Case 1, and we take $q=\lfloor 2/\omega\rfloor$ . For function $u$ , here we let it increase from 0 and becomes 1 in $[1/4,\infty)$ . For each pair $(\mathcal{X}_{j0},\mathcal{X}_{j1})$ , take $\eta(x)=(1+\phi(x))/2$ when $x\in\mathcal{X}_{j0}$ and $\eta(x)=(1-\phi(x))/2$ when $x\in\mathcal{X}_{j1}$ . The support of $x$ is $\mathcal{X}_{0}\cup(\bigcup_{i=1}^{m}B(z_{i},3\omega/4))$ . Take $m=O(\omega^{\alpha\beta-d})$ and $\epsilon=O(\omega^{d})$ , then both $\alpha$ -smoothness condition and $\beta$ -margin condition are satisfied.

After injecting random noise on $x$ , consider $\xi_{j}$ as the boundary between $\mathcal{X}_{j0}$ and $\mathcal{X}_{j1}$ , then when $\tilde{x}$ is from $\{z\;|\;dist(z,\xi_{j})<\omega/4,\;z\in\mathcal{X}_{j0}\cup\mathcal{X}_{j1}\}$ , $P(g(x)=1|\tilde{x})$ and $P(g(x)=0|\tilde{x})$ are in $[C_{4},1-C_{4}]$ for some constant $C_{4}>0$ . Thus the probability of any estimator to make a false decision at this $\tilde{x}$ is larger than $C_{4}$ . In addition, the probability measure of $\cup_{j=1}^{m}\{z\;|\;dist(z,\xi_{j})<\omega/4,\;z\in\mathcal{X}_{j0}\cup\mathcal{X}_{j1}\}$ is greater than $C_{5}\omega^{\alpha\beta}$ for some constant $C_{5}>0$ . Thus the regret is greater than $C_{5}\omega^{\alpha\beta}C_{\phi}\omega^{\alpha}C_{4}=C_{6}\omega^{\alpha(\beta+1)}$ .

∎

			$\displaystyle\mathbb{E}((X_{1}-x_{0}^{t}-\delta)^{\top}\dot{\eta}(x_{0}^{t}+\delta)\|\delta,R_{1})$
		$\displaystyle=$	$\displaystyle\int_{\partial B}(x-x_{0}^{t}-\delta)^{\top}\dot{\eta}(x_{0}^{t}+\delta)\bar{f}(x\|x_{0}^{t}+\delta,R_{1})dx$
		$\displaystyle=$	$\displaystyle\int_{\partial B}(x-x_{0}^{t}-\delta)^{\top}\dot{\eta}(x_{0}^{t}+\delta)\bigg{[}\bar{f}(x_{0}^{t}+\delta\|x_{0}^{t}+\delta,R_{1})$
			$\displaystyle\qquad\qquad\qquad\qquad+\bar{f}^{\prime}(x_{0}^{t}+\delta\|x_{0}^{t}+\delta,R_{1})^{\top}(x-x_{0}^{t}-\delta)$
			$\displaystyle\qquad\qquad\qquad\qquad+\frac{1}{2}(x-x_{0}^{t}-\delta)^{\top}\bar{f}^{\prime\prime}(x_{0}^{t}+\delta\|x_{0}^{t}+\delta,R_{1})(x-x_{0}^{t}-\delta)$
			$\displaystyle\qquad\qquad\qquad\qquad+O(\\|x-x_{0}^{t}-\delta\\|_{2}^{3})\bigg{]}dx$
		$\displaystyle=$	$\displaystyle\int_{\partial B}(x-x_{0}^{t}-\delta)^{\top}\dot{\eta}(x_{0}^{t}+\delta)\bar{f}^{\prime}(x_{0}^{t}+\delta\|x_{0}^{t}+\delta,R_{1})^{\top}(x-x_{0}^{t}-\delta)dx+o$
		$\displaystyle=$	$\displaystyle tr\left(\dot{\eta}(x_{0}^{t}+\delta)\bar{f}^{\prime}(x_{0}^{t}+\delta\|x_{0}^{t}+\delta,R_{1})^{\top}\int_{\partial B}(x-x_{0}^{t}-\delta)(x-x_{0}^{t}-\delta)^{\top}dx\right)+O(R_{1}^{4}),$

			$\displaystyle tr\left(\frac{1}{2}\ddot{\eta}(x_{0}^{t}+\delta)\mathbb{E}\left((X_{1}-x_{0}^{t}-\delta)(X_{1}-x_{0}^{t}-\delta)^{\top}\|R_{1}\right)\right)$
		$\displaystyle=$	$\displaystyle tr\left(\frac{1}{2}\ddot{\eta}(x_{0}^{t}+\delta)\int_{\partial B}(x-x_{0}^{t}-\delta)(x-x_{0}^{t}-\delta)^{\top}\bar{f}(x\|x_{0}^{t}+\delta,R_{1})dx\right)$
		$\displaystyle=$	$\displaystyle tr\left(\frac{\bar{f}(x_{0}^{t}+\delta\|x_{0}^{t}+\delta,R_{1})}{2}\ddot{\eta}(x_{0}^{t}+\delta)\int_{\partial B}(x-x_{0}^{t}-\delta)(x-x_{0}^{t}-\delta)^{\top}dx\right)$
			$\displaystyle+tr\left(\frac{1}{2}\ddot{\eta}(x_{0}^{t}+\delta)\int_{\partial B}(x-x_{0}^{t}-\delta)(x-x_{0}^{t}-\delta)^{\top}(x-x_{0}^{t}-\delta)^{\top}\bar{f}^{\prime}(x_{0}^{t}+\delta\|x_{0}^{t}+\delta,R_{1})dx\right)$
			$\displaystyle+O(R_{1}^{4}).$

			$\displaystyle\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\\|\dot{\Psi}(x_{0})\\|\mathbb{E}_{\delta}\left(P(\widehat{\eta}_{k,n}(x_{0}^{t}+\delta)<1/2)-1_{\{t<0\}}\right)$
			$\displaystyle\qquad\qquad\qquad\left[det\left(\dot{\Psi}\left(x_{0},t\frac{\dot{\eta}(x_{0})}{\\|\dot{\eta}(x_{0})\\|}\right)\right)-1\right]dtd\text{Vol}^{d-1}(x_{0})$
		$\displaystyle=$	$\displaystyle\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\\|\dot{\Psi}(x_{0})\\|\mathbb{E}_{\delta}\left(P(\widehat{\eta}_{k,n}(x_{0}+\delta)<1/2)-1_{\{t<0\}}\right)$
			$\displaystyle\qquad\qquad\qquad\left[det\left(\dot{\Psi}\left(x_{0},t\frac{\dot{\eta}(x_{0})}{\\|\dot{\eta}(x_{0})\\|}\right)\right)-1\right]dtd\text{Vol}^{d-1}(x_{0})$
			$\displaystyle+\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\\|\dot{\Psi}(x_{0})\\|\left[t\frac{\dot{\eta}(x_{0})^{\top}}{\\|\dot{\eta}(x_{0})\\|}\frac{\partial}{\partial x_{0}}\mathbb{E}_{\delta}\left(P(\widehat{\eta}_{k,n}(x_{0}+\delta)<1/2)-1_{\{t<0\}}\right)\right]$
			$\displaystyle\qquad\qquad\qquad\left[det\left(\dot{\Psi}\left(x_{0},t\frac{\dot{\eta}(x_{0})}{\\|\dot{\eta}(x_{0})\\|}\right)\right)-1\right]dtd\text{Vol}^{d-1}(x_{0})$
			$\displaystyle+o$
		$\displaystyle=$	$\displaystyle\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\\|\dot{\Psi}(x_{0})\\|\left[t\frac{\dot{\eta}(x_{0})^{\top}}{\\|\dot{\eta}(x_{0})\\|}\frac{\partial}{\partial x_{0}}\mathbb{E}_{\delta}\left(P(\widehat{\eta}_{k,n}(x_{0}+\delta)<1/2)-1_{\{t<0\}}\right)\right]$
			$\displaystyle\qquad\qquad\qquad\left[det\left(\dot{\Psi}\left(x_{0},t\frac{\dot{\eta}(x_{0})}{\\|\dot{\eta}(x_{0})\\|}\right)\right)-1\right]dtd\text{Vol}^{d-1}(x_{0})+o$
		$\displaystyle=$	$\displaystyle O(\epsilon_{k,n,\omega}^{4}).$

			$\displaystyle\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\\|\dot{\Psi}(x_{0})\\|\left(P(\widehat{\eta}_{k,n}(x_{0}^{t}+\delta)<1/2)-1_{\{t<0\}}\|\delta\right)dtd\text{Vol}^{d-1}(x_{0})$
		$\displaystyle=$	$\displaystyle\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\\|\dot{\Psi}(x_{0})\\|\left(\Phi\left(\frac{k\mathbb{E}(1/2-Y_{1})}{\sqrt{kVar(Y_{1})}}\bigg{\|}\delta\right)-1_{\{t<0\}}\right)dtd\text{Vol}^{d-1}(x_{0})+r_{2}$
			$\displaystyle+\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\\|\dot{\Psi}(x_{0})\\|\left(\mathbb{E}_{R_{k+1}}\Phi\left(\frac{k\mathbb{E}(1/2-Y_{1}\|R_{k+1})}{\sqrt{kVar(Y_{1}\|R_{k+1})}}\bigg{\|}\delta\right)-\Phi\left(\frac{k\mathbb{E}(1/2-Y_{1})}{\sqrt{kVar(Y_{1})}}\bigg{\|}\delta\right)\right)dtd\text{Vol}^{d-1}(x_{0}).$

			$\displaystyle\mathbb{E}_{\delta}\int_{S}\frac{\\|\dot{\Psi}(x_{0})\\|}{\\|\dot{\eta}(x_{0})\\|^{2}}(b(x_{0})t_{k,n}(x_{0})+\delta^{\top}\dot{\eta}(x_{0}))^{2}d\text{Vol}^{d-1}(x_{0})$
		$\displaystyle=$	$\displaystyle\int_{S}\frac{\\|\dot{\Psi}(x_{0})\\|}{\\|\dot{\eta}(x_{0})\\|^{2}}\mathbb{E}_{\delta}(b(x_{0})t_{k,n}(x_{0})+\delta^{\top}\dot{\eta}(x_{0}))^{2}d\text{Vol}^{d-1}(x_{0})$
		$\displaystyle=$	$\displaystyle\int_{S}\frac{\\|\dot{\Psi}(x_{0})\\|}{\\|\dot{\eta}(x_{0})\\|^{2}}\left(b(x_{0})^{2}t_{k,n}^{2}(x_{0})+2b(x_{0})t_{k,n}(x_{0})\mathbb{E}_{\delta}(\delta^{\top}\dot{\eta}(x_{0}))+\mathbb{E}_{\delta}(\delta^{\top}\dot{\eta}(x_{0}))^{2}\right)d\text{Vol}^{d-1}(x_{0})$
		$\displaystyle=$	$\displaystyle\int_{S}\frac{\\|\dot{\Psi}(x_{0})\\|}{\\|\dot{\eta}(x_{0})\\|^{2}}\left(b^{2}(x_{0})t_{k,n}^{2}(x_{0})+\frac{\\|\dot{\eta}(x_{0})\\|^{2}}{d}\omega^{2}\right)d\text{Vol}^{d-1}(x_{0}).$