This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Predictive Power of Nearest Neighbors Algorithm under Random Perturbation

Yue Xing
Department of Statistics
Purdue University
West Lafayette, IN 47907, USA
&Qifan Song
Department of Statistics
Purdue University
West Lafayette, IN 47907, USA
\ANDGuang Cheng
Department of Statistics
Purdue University
West Lafayette, IN 47907, USA
Abstract

We consider a data corruption scenario in the classical kk Nearest Neighbors (kk-NN) algorithm, that is, the testing data are randomly perturbed. Under such a scenario, the impact of corruption level on the asymptotic regret is carefully characterized. In particular, our theoretical analysis reveals a phase transition phenomenon that, when the corruption level ω\omega is below a critical order (i.e., small-ω\omega regime), the asymptotic regret remains the same; when it is beyond that order (i.e., large-ω\omega regime), the asymptotic regret deteriorates polynomially. Surprisingly, we obtain a negative result that the classical noise-injection approach will not help improve the testing performance in the beginning stage of the large-ω\omega regime, even in the level of the multiplicative constant of asymptotic regret. As a technical by-product, we prove that under different model assumptions, the pre-processed 1-NN proposed in [27] will at most achieve a sub-optimal rate when the data dimension d>4d>4 even if kk is chosen optimally in the pre-processing step.

1 Introduction

While there has been a great deal of success in the development of machine learning algorithms, much of the success is in relatively restricted domains with limited structural variation or few system constraints. Those algorithms would be quite fragile in broader real-world scenarios, especially when the testing data are contaminated. For example, in image classification, when the input data are slightly altered due to a minor optical sensor system malfunction, a deep neural network may yield a totally different classification result [11]; more seriously, attackers can feed well-designed malicious adversarial input to the system and induce wrong decision making [30, 17]. One strand of existing research along this line focuses on methodology development including generation of corrupted testing samples for which machine learning algorithms fail [18, 19, 13] and design of robust training algorithms [12, 14, 23, 16]. The other strand of research focuses on theoretical investigation on how the data corruption affects the algorithm performance [25, 28, 10, 9].

This work aims to study the robustness of kk Nearest Neighbors (kk-NN) algorithm and its variants, from a theoretical perspective. There is a rich literature on the same topic, but under different setups. For example, Cannings et al. [5], Reeve and Kaban [21, 20] study the case where labels for training data are contaminated, and investigate the overall excess risk of trained classifier; Wang et al. [25], Yang et al. [28] study the case where testing data are contaminated, and only concern about the testing robustness when testing data belong to certain subset of support, rather than the whole support. In contrast to these existing works, the presented study aims to address a different question: how the overall regret of kk-NN classifier, which is trained by uncontaminated training data, is affected when the testing features are corrupted by random perturbation?

Our main theoretical result (derived in the framework of 22) characterizes the asymptotic regret for randomly perturbed testing data (with an explicit form of multiplicative constant) of kk-NN with respect to the choice of kk and the level of testing data corruption. There are several interesting implications. First, there exists a critical contamination level, (a) below which the asymptotic order of regret is not affected; (b) above which the asymptotic order of regret deteriorates polynomially. Second, although the regret of kk-NN deteriorates polynomially with respect to the corruption level, it actually achieves the best possible accuracy for testing randomly perturbed data (under a fine tuned choice of kk). Hence kk-NN classifier is rate-minimax for both clean data testing task [2, 22, 4] and randomly perturbed data testing task.

Similar as adversarial training, a strategy to robustify learning algorithm is to inject the same random noises to training data. However, our theoretical analysis reveals that such a noise injection approach doesn’t improve the performance of kk-NN algorithm in the beginning stage of the polynomial deterioration regime, in the sense that it leads to exactly the same asymptotic regret.

Adapting the analysis for adversarial attack scenario (i.e., testing data are contaminated with worst-case perturbation), we compare the regret of kk-NN under adversarial attack and random perturbation. It is quite interesting to find that when the level of data contamination is beyond the aforementioned critical order, adversarial attack leads to a worse regret than random perturbation only in terms of the multiplicative constant of the convergence rate.

Our developed theory may also be used to evaluate the asymptotic performance of variants of kk-NN algorithms. For example, Xue and Kpotufe [27] applied 11NN to pre-processed data (relabelled by kk-NN) in order to achieve the same accuracy as kk-NN. Interestingly, this algorithm can be translated into the classical kk-NN algorithm under a type of perturbed samples to which our theory naturally applies. In particular, we prove that the above algorithm, under our model assumption framework, only obtain a sub-optimal rate (at least worse than kk-NN) of regret when d>4d>4.

As far as we are aware, the only related work in the context of kk-NN is [25] that evaluated the adversarial robustness of kk-NN based on the concept of “robust radius”, which is the maximum allowed attack intensity that doesn’t affect the testing performance. However, their analysis ignores the area nearby the decision boundary where mis-classification is most likely to occur. By filling these gaps, our work attempts to present more comprehensive regret analysis on robustness of kk-NN.

2 Effect of Random Perturbation on kk-NN

In this section, we formally introduce the model setup and present our main theorems which characterize the asymptotic regret for randomly perturbed testing samples.

2.1 Model Setup

Denote P(Y=1|X=x)P(Y=1|X=x) as η(x)\eta(x), and its kk-NN estimator as η^k,n(x)\widehat{\eta}_{k,n}(x), an average of kk nearest neighbors among nn training samples. The corresponding Bayes classifier and kk-NN classifier is defined as g(x)=1{η(x)>1/2}g(x)=1\{\eta(x)>1/2\} and g^n,k(x)=1{η^k,n(x)>1/2}\widehat{g}_{n,k}(x)=1\{\widehat{\eta}_{k,n}(x)>1/2\}, respectively.

Define ω\omega as the level of perturbation. For any intended testing data xx, we only observe its randomly perturbed version:

x~Unif(B(x,ω)),\widetilde{x}\sim\mbox{Unif}(\partial B(x,\omega)), (1)

that is, x~\widetilde{x} is uniformly distributed over B(x,w)\partial B(x,w), the boundary of an Euclidean ball centered at xx with radius ω\omega.

In this case, we define the so called “perturbed” regret as

Regret(k,n,ω)=P(Yg^n,k(X~))P(Yg(X)),\displaystyle\mbox{Regret}(k,n,\omega)=P(Y\neq\widehat{g}_{n,k}(\widetilde{X}))-P(Y\neq g(X)),

and

Regret(n,ω)=mink=1,,nRegret(k,n,ω).\mbox{Regret}(n,\omega)=\min\limits_{k=1,...,n}\mbox{Regret}(k,n,\omega).

Note that the kk-NN classifier g^n,k\widehat{g}_{n,k} is trained by uncontaminated training samples. Obviously, when ω=0\omega=0, the above definition reduces to the common regret that used in statistical classification literature.

The following assumptions are imposed on XX and the underlying η\eta, to facilitate our theoretical analysis.

  1. A.1

    XX is a random variable on a compact dd-dimensional manifold 𝒳\mathcal{X} with boundary 𝒳\partial\mathcal{X}. Density function of XX is differentiable, finite and bounded away from 0.

  2. A.2

    The set 𝒮={x|η(x)=1/2}\mathcal{S}=\{x|\eta(x)=1/2\} is non-empty. There exists an open subset U0U_{0} in d\mathbb{R}^{d} which contains 𝒮\mathcal{S} such that, define UU as an open set containing 𝒳\mathcal{X}, then η\eta is continuous on U\U0U\backslash U_{0}.

  3. A.3

    There exists some constant cx>0c_{x}>0 such that when |η(x)1/2|cx|\eta(x)-1/2|\leq c_{x}, η\eta has bounded fourth-order derivative; when η(x)=1/2\eta(x)=1/2, η˙(x)0\dot{\eta}(x)\neq 0, where η˙\dot{\eta} is the gradient of η\eta in xx. Also the derivative of η(x)\eta(x) within restriction on the boundary of support is non-zero.

Assumptions A.1 ensures that for any x𝒳x\in\mathcal{X}, all its kk nearest neighbors are close to xx with high probability. This is due to the fact that if the density at a point xx is positive and finite, its distance to its kkth nearest will be of Op((k/n)1/d)=op(1)O_{p}((k/n)^{1/d})=o_{p}(1). Assumption A.2 ensures the existence of xx in {x𝒳|η(x)=1/2}\{x\in\mathcal{X}|\;\eta(x)=1/2\} and η(x)\eta(x) is continuous in other regions of 𝒳\mathcal{X}. Assumption A.3 on η(x)\eta(x) is slightly stronger than that imposed in [22] due to the consideration of testing data contamination. Specifically, the additional smoothness on η(x)\eta(x) imposed in Assumption A.3 guarantees that some higher-order terms in the Taylor expansion of 𝔼{η^k,n(x~)η(x)}\mathbb{E}\{\widehat{\eta}_{k,n}(\widetilde{x})-\eta(x)\} are negligible.

2.2 Asymptotic Regret and Phase Transition Phenomenon

We are now ready to conduct regret analysis for kk-NN in the presence of randomly perturbed testing examples. For any x𝒳x\in\mathcal{X}, define

t(x)=𝔼(Xix22|Xi is among the k nearest neighbors of x).t(x)=\mathbb{E}\big{(}\|X_{i}-x\|_{2}^{2}\,\big{|}\,X_{i}\mbox{ is among the $k$ nearest neighbors of }x\big{)}.

Therefore, t(x)t(x) represents the expected squared distance from xx to any of its kk nearest neighbors. And take t=maxxt(x)t=\max_{x}t(x). Also denote f¯(x,y)\bar{f}(x,y) and f¯(x)\bar{f}(x) as the joint density of (x,y)(x,y) and marginal density of xx respectively. Let f1(x):=f¯(x,0)f_{1}(x):=\bar{f}(x,0), f2(x):=f¯(x,1)f_{2}(x):=\bar{f}(x,1), and Ψ(x):=f1(x)f2(x)\Psi(x):=f_{1}(x)-f_{2}(x).

We first characterize the asymptotic perturbed regret.

Theorem 1.

Define ϵk,n,ω=max(logk/k,t+ω)\epsilon_{k,n,\omega}=\max(\log k/\sqrt{k},t+\omega). Under [A.1] to [A.3] if testing data is randomly perturbed, then it follows that

Regret(k,n,ω)=12𝒮Ψ˙(x0)η˙(x0)2(b(x0)t(x0))2𝑑Vold1(x0)Bias+12𝒮ω2dΨ˙(x0)𝑑Vold1(x0)Corruption+12𝒮14kΨ˙(x0)η˙(x0)2𝑑Vold1(x0)Variance+Remainder,\begin{split}\mbox{Regret}(k,n,\omega)=&\underbrace{\frac{1}{2}\int_{\mathcal{S}}\frac{\|\dot{\Psi}(x_{0})\|}{\|\dot{\eta}(x_{0})\|^{2}}\left(b(x_{0})t(x_{0})\right)^{2}d\text{Vol}^{d-1}(x_{0})}_{Bias}+\underbrace{\frac{1}{2}\int_{\mathcal{S}}\frac{\omega^{2}}{d}\|\dot{\Psi}(x_{0})\|d\text{Vol}^{d-1}(x_{0})}_{Corruption}\\ &+\underbrace{\frac{1}{2}\int_{\mathcal{S}}\frac{1}{4k}\frac{\|\dot{\Psi}(x_{0})\|}{\|\dot{\eta}(x_{0})\|^{2}}d\text{Vol}^{d-1}(x_{0})}_{Variance}+\text{Remainder},\end{split} (2)

where Remainder=O(ϵk,n,ω3)O(\epsilon_{k,n,\omega}^{3}) as k,nk,n\to\infty. The term b()b(\cdot) relies on the true η(x)\eta(x) and the distribution of XX, and does not change with respect to kk and nn:

b(x)\displaystyle b(x) =\displaystyle= 1f¯(x)d{j=1d[η˙j(x)f¯˙j(x)+η¨j,j(x)f¯(x)/2]}.\displaystyle\frac{1}{\bar{f}(x)d}\left\{\sum_{j=1}^{d}[\dot{\eta}_{j}(x)\dot{\bar{f}}_{j}(x)+\ddot{\eta}_{j,j}(x)\bar{f}(x)/2]\right\}.

Here η˙\dot{\eta}, η¨\ddot{\eta}, and f¯˙\dot{\bar{f}} represent the gradient, Hessian of η\eta, and gradient of f¯\bar{f} respectively. The subscript jj denotes the jj’th element of η˙\dot{\eta} or f¯˙\dot{\bar{f}}, as well as subscript j,jj,j denotes the (j,j)(j,j)’th element of η¨\ddot{\eta}.

Our result (2) clearly decomposes the asymptotic regret into squared bias term, data corruption effect term, variance term as well as a reminder term due to higher-order Taylor expansion. The first three terms are of order O((k/n)4/d)O((k/n)^{4/d}), O(ω2)O(\omega^{2}) and O(1/k)O(1/k) respectively, and the reminder term is technically derived from high order Taylor expansion and Berry-Essen theorem. When kk is within a reasonable range, the reminder term is negligible comparing with the other three terms. When ω=0\omega=0, (2) reduces to the bias-variance decomposition commonly observed in the nonparametric regression literature.

Based on Theorem 1, through changing ω\omega, we have the following observations:

Phase Transition Phenomenon

We discover a phase transition phenomenon for the regret with respect to the level of testing data contamination in general.

  1. 1.

    When ω2(1/kt2)\omega^{2}\preccurlyeq(1/k\wedge t^{2}) 111To prevent the conflict of definitions of ω\omega, we use \preccurlyeq and \succcurlyeq to replace o(.)o(.) and ω(.)\omega(.) in O/Ω\Omega notation. Moreover, for a(n)b(n)1a(n)\preccurlyeq b(n)\preccurlyeq 1, we mean that b(n)/1<nε1b(n)/1<n^{-\varepsilon_{1}} and a(n)/b(n)<nε2a(n)/b(n)<n^{-\varepsilon_{2}} for some ε1,ε2>0\varepsilon_{1},\varepsilon_{2}>0 when nn\rightarrow\infty., the data corruption has almost no effect: Regret(k,n,ω)/Regret(k,n,0)1\mbox{Regret}(k,n,\omega)/\mbox{Regret}(k,n,0)\rightarrow 1;

  2. 2.

    When ω2=Θ(1/kt2)\omega^{2}=\Theta(1/k\vee t^{2}), the regret is of the same order as Regret(k,n,0)\mbox{Regret}(k,n,0) but with a different multiplicative constant depending on f¯\bar{f} and η\eta;

  3. 3.

    When ω2(1/kt2)\omega^{2}\succcurlyeq(1/k\vee t^{2}), Regret(k,n,ω)=Θ(ω2)\mbox{Regret}(k,n,\omega)=\Theta(\omega^{2}) and Regret(k,n,ω)Regret(k,n,0)\mbox{Regret}(k,n,\omega)\succcurlyeq\mbox{Regret}(k,n,0).

Impact on Regret(n,ω)\mbox{Regret}(n,\omega) and the choice of kk

The value kk plays an important role in the kk-NN algorithm. It is essential to examine how the intensity level ω\omega affects the corresponding Regret(n,ω)\mbox{Regret}(n,\omega). From Theorem 1, one can derive that for randomly perturbed testing scenario, if ωn2/(d+4)\omega\preccurlyeq n^{-2/(d+4)}, Regret(n,ω)=Θ(n4/(d+4))\mbox{Regret}(n,\omega)=\Theta(n^{-4/(d+4)}); if ωn2/(d+4)\omega\succcurlyeq n^{-2/(d+4)}, Regret(n,ω)=Θ(ω2)\mbox{Regret}(n,\omega)=\Theta(\omega^{2}). In other words, Regret(n,ω)=Θ(ω2n4/(d+4))\mbox{Regret}(n,\omega)=\Theta(\omega^{2}\vee n^{-4/(d+4)}). The above rate can be achieved if we choose k=Θ(n4/(4+d))k=\Theta(n^{4/(4+d)}) when ωn4/3(4+d)\omega\preccurlyeq n^{-4/3(4+d)} and 1/ω2knωd/21/\omega^{2}\preccurlyeq k\preccurlyeq n\omega^{d/2} when ωn4/3(4+d)\omega\succcurlyeq n^{-4/3(4+d)}.

Robustness Trade-off in kk-NN

Theorem 1 reveals that the regret is of order O((k/n)4/d+1/k)+O(ω2)O((k/n)^{4/d}+1/k)+O(\omega^{2}), therefore the data corruption has no impact as long as ω2=o((k/n)4/d+1/k)\omega^{2}=o((k/n)^{4/d}+1/k). In other words, if kk is chosen to be optimal and minimizes the regret for uncontaminated testing sample, i.e., (k/n)4/d+1/k(k/n)^{4/d}+1/k is small, then the kk-NN is more sensitive to the increase of ω2\omega^{2}. On the other hands, if kk is some sub-optimal choice such that (k/n)4/d+1/k(k/n)^{4/d}+1/k is larger, then kk-NN is more robust to the testing data corruption. Hence there is a trade-off between the accuracy of uncontaminated testing task and robustness with respect to random perturbation corruption.

Effect of Metric of Noise

Note that x~\widetilde{x} can be defined on p\mathcal{L}_{p} ball / sphere for different p1p\geq 1. Theorem 1 reveals that the effect of ω\omega is independent with tt and 1/k1/k when ωk,n,ω3\omega_{k,n,\omega}^{3} is not dominant while x~\widetilde{x} is uniformly distributed in a ball / sphere (so that first order terms w.r.t. noise are zero). As a result, define εp\varepsilon_{p} as a random variable uniformly distributed in a p\mathcal{L}_{p} ball/ sphere, then to change the regret accordingly, one can replace ω2/d\omega^{2}/d in Theorem 1 into 1η˙(x0)2𝔼(εpη˙(x0))2\frac{1}{\|\dot{\eta}(x_{0})\|^{2}}\mathbb{E}(\varepsilon_{p}^{\top}\dot{\eta}(x_{0}))^{2}.

Minimax Result under General Smoothness Conditions

To relax the strong assumptions A.1 to A.3, we follow [6] to impose smoothness and margin conditions. Basically, the observations are similar as the results above. One can also show that kk-NN reaches the optimal rate of convergence.

Theorem 2 (Informal Statement under General Smoothness Conditions).

If the distribution of (X,Y)(X,Y) satisfies

  1. 1.

    |η(x)η(x)|Axxα|\eta(x)-\eta(x^{\prime})|\leq A\|x-x^{\prime}\|^{\alpha} for all xx;

  2. 2.

    P(|η(X)1/2|<t)BtβP(|\eta(X)-1/2|<t)\leq Bt^{\beta} for some β>0\beta>0;

together with some other general assumptions, then when taking

kO(n2α/(2α+d)(n2α/dω2αβ)1/(2α/d+β+1)),k\asymp O(n^{2\alpha/(2\alpha+d)}\wedge(n^{2\alpha/d}\omega^{-2\alpha\beta})^{1/(2\alpha/d+\beta+1)}),

the regret becomes

Regret(n,ω)=O(ωα(β+1)nα(β+1)/(2α+d)),\displaystyle\mbox{Regret}(n,\omega)=O\left(\omega^{\alpha(\beta+1)}\vee n^{-\alpha(\beta+1)/(2\alpha+d)}\right),

which is proven to be the minimax rate.

Formal assumptions and results for Theorem 2 are postponed in Section C in Appendix (Theorem S.3 for convergence rate and Theorem S.4 for minimax rate). From Theorem 2, the rate of regret is dominated by the larger one between the random perturbation effect (ωα(β+1)\omega^{\alpha(\beta+1)}) and the minimax rate for clean data (nα(β+1)/(2α+d)n^{-\alpha(\beta+1)/(2\alpha+d)}). Similar as in [22], our regret result derived under conditions A.1-A.3 matches the minimax rate of Theorem 2 by taking α=2\alpha=2 and β=1\beta=1.

Remark 1 (Adversarial Data Corruption).

So far, we focus on the case of random perturbation. As a by-product, we analyze the effect of some special non-random data corruption. Due to page limit constraint, the detailed results and discussions are postponed to Section A in appendix. In general, comparing with worst-case data corruption, the effect of random perturbed noise will mostly cancel out in each perturbation direction, thus leading to a smaller corruption effect term. Therefore, unsurprisingly, kk-NN is more robust to random perturbed data corruption than worse-case adversarial data corruption. However, our rigorous analysis shows that the regret under adversarial data corruption (defined formally in Appendix) is still of the same order as in the case of random perturbation but with a larger multiplicative constant when ωn2/(d+4)\omega\succcurlyeq n^{-2/(d+4)} (when ωn2/(d+4)\omega\preccurlyeq n^{-2/(d+4)}, the effect of either adversarial corruption / random perturbation is negligible).

2.3 Comparison with Noise-Injected kk-NN

In an iterative adversarial training algorithm (e.g. 23), attack is designed in each iteration based on the current model, and the model is updated based on attacked training data. Similarly for random perturbation, a natural idea to enhance the robustness of kk-NN is to inject noise in training data so that training and testing data share the same distribution. One can randomly perturb training samples and train kk-NN using the perturbed training data set. Comparing this noise-injection kk-NN and traditional kk-NN methods, we find that there is no performance improvement for the former even when the corruption level is in the early stage of the polynomial deterioration regime.

Denote g~(x~):=P(Y=1|x~ is observed)\widetilde{g}(\widetilde{x}):=P(Y=1|\widetilde{x}\text{ is observed}) as the Bayes estimator and g^n\widehat{g}^{\prime}_{n} as the estimator trained using randomly perturbed training data. Let both estimators g^n\widehat{g}_{n} and g^n\widehat{g}^{\prime}_{n} adopt their best choices of kk respectively. Then we have

Theorem 3.

Under [A.1] to [A.3], when 0<ω3n4/(d+4)0<\omega^{3}\preccurlyeq n^{-4/(d+4)},

P(Yg^n(x~))P(Yg~(x~))P(Yg^n(x~))P(Yg~(x~))1.\displaystyle\frac{P(Y\neq\widehat{g}_{n}(\widetilde{x}))-P(Y\neq\widetilde{g}(\widetilde{x}))}{P(Y\neq\widehat{g}_{n}^{\prime}(\widetilde{x}))-P(Y\neq\widetilde{g}(\widetilde{x}))}\rightarrow 1. (3)

Although it is intuitive to consider perturbing training data such that they match the distribution of the corrupted testing data, result (19) implies that the estimators g^n\widehat{g}_{n} and g^n\widehat{g}_{n}^{\prime} asymptotically share the same predictive performance for randomly perturbed testing data, and noise injection strategy is futile. Note that this result holds when ω\omega is small. Combined with our analysis in Theorem 1, within the range n2/(d+4)ωn4/3(d+4)n^{-2/(d+4)}\preccurlyeq\omega\preccurlyeq n^{4/3(d+4)}, the regret is deteriorated polynomially due the impact of testing data corruption, and the deterioration can not be remedied by noise-injection adversarial training at all. One heuristic explanation is that, such an injected perturbation may introduce additional noise to the estimation procedure and change some underlying properties (e.g. smoothness), and consequently this strategy of perturbing training data doesn’t necessarily help to achieve smaller regret especially when ω\omega is small.

Remark 2.

The above robustness result can be applied after we obtain knowledge while testing. In reality, we do not know whether the testing data is corrupted or not. In such a case, other techniques, e.g. hypothesis testing, are needed to be applied while testing.

3 Application to other Nearest Neighbor-Type Algorithms

Our theoretical analysis may be adapted to other NN-type algorithms: pre-processed 1NN [27] and distributed-NN [8]. In particular, we prove that the regret of the former is sub-optimal for some class of distributions, and explain why the regret of the latter converges in the optimal rate, both from the random perturbation viewpoint.

3.1 Pre-processed 1NN

In some literature [27, 25], the algorithms run 1NN to do prediction using pre-processed data instead of running kk-NN using raw data. The pre-processing step (or called de-noising step) is reviewed in Algorithm 1. Specifically, we firstly run kk-NN to predict labels for the training data set, then replace the original labels with the predict labels y^i\widehat{y}_{i}’s. In this way, applying 1NN on data (x1,y^1)(x_{1},\widehat{y}_{1}),…,(xn,y^n)(x_{n},\widehat{y}_{n}) can achieve good accuracy while the computational cost is much smaller than kk-NN.

Algorithm 1 Data Pre-processing
  Input: data (x1,y1)(x_{1},y_{1}),…, (xn,yn)(x_{n},y_{n}), number of neighbors kk.
  for i=1i=1 to nn do
     Find the kk nearest neighbors of xix_{i} in x1x_{1},…,xnx_{n}, excluding xix_{i} itself. Denote the index set of these kk neighbors as NiN_{i}.
     Estimate a label for xix_{i} as
η^(xi)\displaystyle\widehat{\eta}(x_{i}) =\displaystyle= 1kjNiyj,\displaystyle\frac{1}{k}\sum_{j\in N_{i}}y_{j},
y^i\displaystyle\widehat{y}_{i} =\displaystyle= 1{η^(xi)>1/2}.\displaystyle 1_{\{\widehat{\eta}(x_{i})>1/2\}}.
  end for
  Output: (x1,y^1)(x_{1},\widehat{y}_{1}),…,(xn,y^n)(x_{n},\widehat{y}_{n}).

This in fact can be treated as an application of random perturbation of testing data in kk-NN, in the sense that this classifier can be equivalently represented as kk-NN under perturbed sample:

g^1NN(x)=g^n,k(x~),\widehat{g}_{1\rm NN}(x)=\widehat{g}_{n,k}(\widetilde{x}),

where g^1NN\widehat{g}_{1\rm NN} is the pre-processed 1NN classifier, and x~\widetilde{x} is the perturbed observation of xx, which is the nearest neighbor of xx. Although the x~\widetilde{x} here doesn’t exactly match our definition (1), it still can be viewed as randomly perturbed xx with level of contamination ω=Θ(n1/d)\omega=\Theta(n^{-1/d}), which is the order for the expected length from xx to the nearest neighbor under Assumption A.1.

From this point of view, Theorem 1 can be be applied to derive the regret of the pre-processed 1NN algorithm, whose rate of convergence turns out to be slower than the optimal rate Θ(n4/(d+4))\Theta(n^{-4/(d+4)}) of kk-NN when the data dimension dd is relatively high, say d>4d>4.

Theorem 4.

Under [A.1] to [A.3], the regret of pre-processed 1NN under un-corrupted testing data is

Regret1NN(k,n)=12𝒮Ψ˙(x0)η˙(x0)2(b(x0)t(x0))2𝑑Vold1(x0)+12𝒮14kΨ˙(x0)η˙(x0)2𝑑Vold1(x0)+Corruption+Remainder,\begin{split}\textit{Regret}_{\textit{1NN}}(k,n)=&\frac{1}{2}\int_{\mathcal{S}}\frac{\|\dot{\Psi}(x_{0})\|}{\|\dot{\eta}(x_{0})\|^{2}}\left(b(x_{0})t(x_{0})\right)^{2}d\text{Vol}^{d-1}(x_{0})+\frac{1}{2}\int_{\mathcal{S}}\frac{1}{4k}\frac{\|\dot{\Psi}(x_{0})\|}{\|\dot{\eta}(x_{0})\|^{2}}d\text{Vol}^{d-1}(x_{0})\\ &+\text{Corruption}+\text{Remainder},\end{split}

where

Corruption =\displaystyle= Θ(n2/d),\displaystyle\Theta(n^{-2/d}),
Remainder =\displaystyle= o(n2/d)\displaystyle o(n^{-2/d})

when both 1/k1/k and (k/n)4/d(k/n)^{4/d} are of O(n1/d)O(n^{-1/d}), and k=O(n6/d)k=O(n^{6/d}). As a result, pre-processed 1NN is sub-optimal when d>4d>4 (compared with optimal rate n4/(d+4)n^{-4/(d+4)}).

The result in Theorem 4 reveals a sub-optimal rate for the pre-processed 11NN, in comparison with the optimal rate claimed by [27]. This is not a contradiction, but due to different assumptions imposed for η\eta and the distribution of XX. Xue and Kpotufe [27] assumes α\alpha-smoothness condition |η(x)η(x)|Axxα|\eta(x)-\eta(x^{\prime})|\leq A\|x-x^{\prime}\|^{\alpha} which is more general than our smoothness assumption A.3.

3.2 Distributed-NN

The computational complexity of kk-NN is huge if nn is large, therefore we consider a distributed NN algorithm: we randomly partition the original data into ss equal-size parts, then given xx, for each machine, the k/sk/s nearest neighbors of xx are selected and calculate η^j(x)\widehat{\eta}_{j}(x) for j=1,,sj=1,...,s, finally we average η^1(x)\widehat{\eta}_{1}(x),…,η^s(x)\widehat{\eta}_{s}(x) to obtain η^(x)\widehat{\eta}(x). The algorithm is shown in Algorithm 2 as in [8].

Algorithm 2 Distributed-NN
  Input: data (x1,y1)(x_{1},y_{1}),…, (xn,yn)(x_{n},y_{n}), number of neighbors kk, number of slaves ss, a point xx for prediction.
  Randomly divide the whole data set into ss parts, with index sets S1S_{1},…,SsS_{s}.
  for i=1i=1 to ss do
     Find the k/sk/s nearest neighbors of xx in {xj|jSi}\{x_{j}\;|\;j\in S_{i}\}. Denote the index set of these k/sk/s neighbors as NiN_{i}.
     Estimate
η^i(x)\displaystyle\widehat{\eta}_{i}(x) =\displaystyle= 1k/sjNiyj.\displaystyle\frac{1}{k/s}\sum_{j\in N_{i}}y_{j}.
  end for
  Estimate the label of xx as
η^(x)\displaystyle\widehat{\eta}(x) =\displaystyle= 1si=1sη^i(x),\displaystyle\frac{1}{s}\sum_{i=1}^{s}\widehat{\eta}_{i}(x),
y^\displaystyle\widehat{y} =\displaystyle= 1{η^(x)>1/2}.\displaystyle 1_{\{\widehat{\eta}(x)>1/2\}}.
  Output: (x,y^)(x,\widehat{y}).

Distributed-NN is practically different from kk-NN in a single machine since the kk selected neighbors aggregated from ss subsets of data are not necessarily the same kk nearest neighbors selected in a single machine. Therefore, an additional assumption k/sk/s\rightarrow\infty is imposed, to ensure that the neighborhood set selected by distributed NN behaves similarly to the neighborhood set selected by single machine kk-NN, in the sense that EXix2E\|X_{i}-x\|^{2}, where XiX_{i} belongs to the distributed NN neighborhood set, is of the same order of t(x)t(x). Therefore, based on Theorem 1, we obtain that the order of regret of Distributed-NN is in fact of the same order as kk-NN. Formally, we have the following corollary:

Corollary 5.

Under [A.1] to [A.3], when 𝒮Ψ˙(x0)𝑑Vold1(x0)0\int_{\mathcal{S}}\|\dot{\Psi}(x_{0})\|dVol^{d-1}(x_{0})\neq 0 and P(b(X)>0|η(X)=1/2)>0P(b(X)>0|\eta(X)=1/2)>0, if the number of machines sks\preccurlyeq k, then

RegretDNN(k,n)=Θ(RegretkNN(k,n)).\mbox{Regret}_{\rm DNN}(k,n)=\Theta(\mbox{Regret}_{\rm kNN}(k,n)).

where RegretDNN\mbox{Regret}_{\rm DNN} and RegretkNN\mbox{Regret}_{\rm kNN} denote the (clean testing data) regret of distributed NN and kk-NN algorithms respectively.

4 Numerical Experiments

In Section 4.1, we will evaluate the empirical performance of kk-NN algorithm for randomly perturbed testing data, where we compare the kk-NN classifiers trained by raw un-corrupted training data and trained by noise injected training data (i.e., g^n\widehat{g}_{n} versus g^n\widehat{g}_{n}^{\prime} defined in Section 2.3). In Section 4.2 We also conduct experiments to compare kk-NN with pre-processed 1NN for un-corrupted testing data. These numerical experiments are intended to show: (i) kk-NN trained by un-corrupted training data has a similar testing performance with kk-NN trained by noise injected training data when ω\omega is small, which validates our Theorem 3; (ii) for un-corrupted testing data, the regret of pre-processed 1NN is much worse than that of kk-NN if d>4d>4, which validates our Theorem 4. In general, kk-NN is expected to be always better than pre-processed 1NN.

Note that in all our real data applications except for MNIST data set, we normalized all attributes.

4.1 Tackling Random Perturbation

4.1.1 Simulation

The random variable XX is of 55 dimension, and each dimension independently follows exponential distribution with mean 0.5. The conditional mean of YY is defined as

η(x)\displaystyle\eta(x) =\displaystyle= exwexw+exw,\displaystyle\frac{e^{x^{\top}w}}{e^{x^{\top}w}+e^{-x^{\top}w}}, (4)
wi\displaystyle\qquad w_{i} =\displaystyle= id/2i=1,,d.\displaystyle i-d/2\qquad i=1,...,d.

For each pair of (k,n)(k,n), we use 26,,2112^{6},...,2^{11} training samples, 10000 testing samples and repeated 50 times to calculate the average regret. In each repetition, 5-fold cross validation was used to obtain k~\widetilde{k}. Then based on [22], we adjust the number of neighbors to k^=k~(5/4)4/(4+d)\widehat{k}=\widetilde{k}(5/4)^{4/(4+d)} since k~\widetilde{k} is the best kk value for 4n/54n/5 samples instead of nn samples. The two classifiers, trained via un-corrupted training data and corrupted training data respectively, used to predict corrupted testing data.

From Figure 1, as the number of training samples increases, the regret for both kk-NNs gets reduced for 0<ω0.050<\omega\leq 0.05 in the same speed. This verifies that these two kk-NNs in fact do not differ a lot when ω\omega is small, i.e., Theorem 3. Empirically, the regret of kk-NN trained by corrupted training data is worse than the one trained by un-corrupted training data when ω0.5\omega\leq 0.5. On the other hand, when ω\omega is large (such that required condition in Theorem 3 fails), the two kk-NNs may perform significantly different. For example, we tried ω=3\omega=3, when sample size n=64n=64, log2(Regret)\log_{2}(\mbox{Regret}) is -2.86 using uncontaminated data, and is -3.11 using corrupted training data.

Refer to caption
Figure 1: Comparison between kk-NN trained by raw training data (solid line) and kk-NN trained by noise injected training data (dashed line) in Simulation.

4.1.2 Real Data

We use two real data sets for the comparison of 2 kk-NNs: Abalone (7), HTRU2 (15). For Abalone data set, the data set contains 4177 samples, and two attributes (Length and Diameter) are used in this experiment. The classification label is whether an abalone is older than 10.5 years. For HTRU2 data set [15], the data has a size of of 17,898 with 8 continuous attributes. For each data set, 25% of the samples are contaminated by random noise, and thereafter are used as testing data.

As shown in Figure 2, when ω\omega is small, for both data sets, the error rate (misclassification rate) of the two kk-NNs do not differ a lot when ω\omega is small.

Refer to caption
Refer to caption
Figure 2: kk-NN trained by raw Training Data vs noise injected Training Data

4.2 Performance of 1NN with Pre-processed Data

4.2.1 Simulation

Refer to caption
Figure 3: Simulation Comparison between kk-NN and pre-processed 1NN, the yy axis denotes the log2(Regre of pre-processed 1NN)log2(Regre of kNN)\log_{2}(\mbox{Regre of pre-processed 1NN})-\log_{2}(\mbox{Regre of $k$NN})

To observe a clear difference, instead of wiw_{i} in (4), we use a model where each dimension of xx follows uniform [0,1][0,1] distribution, with η\eta in (4), and

wi\displaystyle w_{i} =\displaystyle= id/20.5i=1,,d.\displaystyle i-d/2-0.5\qquad i=1,...,d.

for different values of dd to compare the performance between kk-NN and pre-processed 1NN. XX now follows dd-dimensional uniform (0,1)(0,1). From Figure 3, we show that the order of regret of pre-processed 1NN is different from that of kk-NN when d4d\geq 4.

We also replace the uniform distribution as exponential distribution with mean 0.5 for each dimension of xx. Figure 4 shows that for d=5d=5 and d=10d=10, the increasing trend of regret ratio is obvious, indicating a sub-optimal rate of pre-processed 1NN.

Refer to caption
Figure 4: Comparison between kk-NN and pre-processed 1NN. Each Dimension of XX Independently Follows Exponential Distribution with Mean 0.5. Y Represents the Regret Ratio (instead of logarithm of Regret Ratio).

4.2.2 Real Data

We use four data sets to compare the kk-NN and pre-processed 1NN: MNIST, Abalone, HTRU2, and Credit (29).

For MNIST, this data set contains 70000 samples and data dimension is 784. We randomly pick 25% samples as testing data, and randomly pick 2i2^{i} (i=7,8,,12i=7,8,...,12) samples to train kk-NN classifier and pre-processed 1NN classifier, where the choices of kk for both algorithms are determined by 5-folds cross validation. We repeated this procedure 50 times with different random seeds to obtain the mean testing error rates. As shown in Figure 5, through increasing the number of training samples, the error rate ratio between pre-process 1NN classifier and kk-NN classifier is stably above 1 and is around 1.17.

Refer to caption
Refer to caption
Figure 5: MNIST, kk-NN vs pre-processed 1NN

For Abalone data set, we conducted experiment in the same way as MNIST, and observe that the error rate using pre-processed 1NN is always greater than kk-NN. As is shown in Figure 6, while the error rate for both pre-processed 1NN and kk-NN are decreasing in nn, their difference changes little when n211n\leq 2^{11}.

Refer to caption
Figure 6: Comparison between kNN and Pre-processed 1NN in Abalone Data Set

In Credit data set, there are 30000 samples (25% as testing data) with 23 attributes. For HTRU2 and Credit data set, the mean and standard error of error rates in the 50 repetitions are summarized in Table 1. From Table 1, using kk-NN we obtain slightly smaller error rate on average.

Data Set Error Rate (kk-NN) Error Rate (Pre-1NN)
Credit 0.18879 0.1899
HTRU2 0.02143 0.0221

Table 1: Comparison between kk-NN and Pro-processed 1NN (Pre-1NN) in HTRU2 and Credit

5 Conclusion and Discussion

In this work, we conduct asymptotic regret analysis of kk-NN classification for randomly perturbed testing data. In particular, a phase transition phenomenon is observed: when the corruption level is below a threshold order, it doesn’t affect the asymptotic regret; when the corruption level is beyond this threshold order, the asymptotic regret grows polynomially. Moreover, when the level of corruption is small, there is no benefit to perform noise injected adversarial training approach.

Moreover, using the idea of random perturbation, we can further explain why pre-processed 1NN converges in a sub-optimal rate: it can be treated as kk-NN with perturbation in testing data while ω\omega, the distance from xx to its nearest neighbor, is large when d>4d>4. It is worth to mention that our analysis can be applied to Distributed-NN to verify the optimal rate obtained in [8] as well.

An interesting observation from numerical experiment is that using traditional kk-NN leads to an even better performance than the kk-NN trained via noise injection method. This observation contradicts to common belief that injecting attack into training algorithm to obtain an adversarial robust algorithm (e.g. optimization method in 23). Therefore, it deserves further theoretical investigation to understand that, under which circumstance one can indeed benefit from the noise injection strategy.

References

  • Audibert [2004] Audibert, J.-Y. (2004), “Classification under polynomial entropy and margin assumptions and randomized estimators,” .
  • Audibert et al. [2007] Audibert, J.-Y., Tsybakov, A. B., et al. (2007), “Fast learning rates for plug-in classifiers,” The Annals of statistics, 35, 608–633.
  • Belkin et al. [2018] Belkin, M., Hsu, D., and Mitra, P. (2018), “Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate,” arXiv preprint arXiv:1806.05161.
  • Cannings et al. [2017] Cannings, T. I., Berrett, T. B., and Samworth, R. J. (2017), “Local nearest neighbour classification with applications to semi-supervised learning,” arXiv preprint arXiv:1704.00642.
  • Cannings et al. [2018] Cannings, T. I., Fan, Y., and Samworth, R. J. (2018), “Classification with imperfect training labels,” arXiv preprint arXiv:1805.11505.
  • Chaudhuri and Dasgupta [2014] Chaudhuri, K. and Dasgupta, S. (2014), “Rates of convergence for nearest neighbor classification,” in Advances in Neural Information Processing Systems, pp. 3437–3445.
  • Dua and Graff [2017] Dua, D. and Graff, C. (2017), “UCI Machine Learning Repository,” .
  • Duan et al. [2018] Duan, J., Qiao, X., and Cheng, G. (2018), “Distributed Nearest Neighbor Classification,” arXiv preprint arXiv:1812.05005.
  • Fawzi et al. [2018] Fawzi, A., Fawzi, H., and Fawzi, O. (2018), “Adversarial vulnerability for any classifier,” in Advances in Neural Information Processing Systems, pp. 1178–1187.
  • Fawzi et al. [2016] Fawzi, A., Moosavi-Dezfooli, S.-M., and Frossard, P. (2016), “Robustness of classifiers: from adversarial to random noise,” in Advances in Neural Information Processing Systems, pp. 1632–1640.
  • Goodfellow et al. [2014] Goodfellow, I., Shlens, J., and Szegedy, C. (2014), “Explaining and Harnessing Adversarial Examples,” arXiv preprint arXiv:1412.6572.
  • Goodfellow et al. [2015] Goodfellow, I. J., Shlens, J., and Szegedy, C. (2015), “Explaining and harnessing adversarial examples. CoRR,” .
  • Grosse et al. [2017] Grosse, K., Papernot, N., Manoharan, P., Backes, M., and McDaniel, P. (2017), “Adversarial examples for malware detection,” in European Symposium on Research in Computer Security, Springer, pp. 62–79.
  • Kurakin et al. [2016] Kurakin, A., Goodfellow, I., and Bengio, S. (2016), “Adversarial machine learning at scale,” arXiv preprint arXiv:1611.01236.
  • Lyon et al. [2016] Lyon, R. J., Stappers, B., Cooper, S., Brooke, J., and Knowles, J. (2016), “Fifty years of pulsar candidate selection: from simple filters to a new principled real-time classification approach,” Monthly Notices of the Royal Astronomical Society, 459, 1104–1123.
  • Madry et al. [2017] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. (2017), “Towards deep learning models resistant to adversarial attacks,” arXiv preprint arXiv:1706.06083.
  • Papernot et al. [2017] Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z. B., and Swami, A. (2017), “Practical black-box attacks against machine learning,” in Proceedings of the 2017 ACM on Asia conference on computer and communications security, ACM, pp. 506–519.
  • Papernot et al. [2016a] Papernot, N., McDaniel, P., Jha, S., Fredrikson, M., Celik, Z. B., and Swami, A. (2016a), “The limitations of deep learning in adversarial settings,” in Security and Privacy (EuroS&P), 2016 IEEE European Symposium on, IEEE, pp. 372–387.
  • Papernot et al. [2016b] Papernot, N., McDaniel, P., Swami, A., and Harang, R. (2016b), “Crafting adversarial input sequences for recurrent neural networks,” in Military Communications Conference, MILCOM 2016-2016 IEEE, IEEE, pp. 49–54.
  • Reeve and Kaban [2019a] Reeve, H. W. and Kaban, A. (2019a), “Classification with unknown class conditional label noise on non-compact feature spaces,” arXiv preprint arXiv:1902.05627.
  • Reeve and Kaban [2019b] — (2019b), “Fast Rates for a kNN Classifier Robust to Unknown Asymmetric Label Noise,” arXiv preprint arXiv:1906.04542.
  • Samworth et al. [2012] Samworth, R. J. et al. (2012), “Optimal weighted nearest neighbour classifiers,” The Annals of Statistics, 40, 2733–2763.
  • Sinha et al. [2018] Sinha, A., Namkoong, H., and Duchi, J. (2018), “Certifying some distributional robustness with principled adversarial training,” .
  • Sun et al. [2016] Sun, W. W., Qiao, X., and Cheng, G. (2016), “Stabilized nearest neighbor classifier and its statistical properties,” Journal of the American Statistical Association, 111, 1254–1265.
  • Wang et al. [2017] Wang, Y., Jha, S., and Chaudhuri, K. (2017), “Analyzing the robustness of nearest neighbors to adversarial examples,” arXiv preprint arXiv:1706.03922.
  • Xing et al. [2018] Xing, Y., Song, Q., and Cheng, G. (2018), “Statistical Optimality of Interpolated Nearest Neighbor Algorithms,” arXiv preprint arXiv:1810.02814.
  • Xue and Kpotufe [2017] Xue, L. and Kpotufe, S. (2017), “Achieving the time of 11-NN, but the accuracy of kk-NN,” arXiv preprint arXiv:1712.02369.
  • Yang et al. [2019] Yang, Y.-Y., Rashtchian, C., Wang, Y., and Chaudhuri, K. (2019), “Adversarial Examples for Non-Parametric Methods: Attacks, Defenses and Large Sample Limits,” arXiv preprint arXiv:1906.03310.
  • Yeh and Lien [2009] Yeh, I.-C. and Lien, C.-h. (2009), “The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients,” Expert Systems with Applications, 36, 2473–2480.
  • Zhang et al. [2017] Zhang, G., Yan, C., Ji, X., Zhang, T., Zhang, T., and Xu, W. (2017), “Dolphinattack: Inaudible voice commands,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, ACM, pp. 103–117.

Appendix A Comparison between Random Perturbation and Non-random Perturbation

We use the following adversarial attack as the non-random perturbation:

x~={argminzB(x,ω)η(z)if η(x)>1/2argmaxzB(x,ω)η(z)if η(x)1/2.\widetilde{x}=\begin{cases}\mathop{\rm argmin}\limits_{z\in B(x,\omega)}\eta(z)\qquad\mbox{if }\eta(x)>1/2\\ \mathop{\rm argmax}\limits_{z\in B(x,\omega)}\eta(z)\qquad\mbox{if }\eta(x)\leq 1/2\end{cases}. (5)

When ω0\omega\rightarrow 0, if η\eta is differentiable, the length o attack converges to ω\omega as well.

The proposed attack scheme (5) is also called as “white-box attack” as the adversary has the knowledge of η(x)\eta(x). On the other hand, unlike the “white-box attack” mentioned in [25], the perturbation and attack we focus on are independent with the training samples.

Theorem 6.

Under [A.1] to [A.3], if testing data is adversarially attacked and 1/k,ζω1/\sqrt{k},\zeta\ll\omega, then

Regret(k,n,ω)=B14k+12𝒮Ψ˙(x0)η˙(x0)2(b(x0)2ζ(x0)2+2ω2η˙(x0)2)𝑑Vold1(x0)+Rem,\begin{split}&\mbox{Regret}(k,n,\omega)=\frac{B_{1}}{4k}+\frac{1}{2}\int_{\mathcal{S}}\frac{\|\dot{\Psi}(x_{0})\|}{\|\dot{\eta}(x_{0})\|^{2}}\left(b(x_{0})^{2}\zeta(x_{0})^{2}+2\omega^{2}\|\dot{\eta}(x_{0})\|^{2}\right)d\text{Vol}^{d-1}(x_{0})+Rem,\end{split} (6)

where Rem:=O(ω/k+ωζ)+o((1/k)(ζ+ω)2)Rem:=O(\omega/\sqrt{k}+\omega\zeta)+o((1/k)\vee(\zeta+\omega)^{2}).

From Theorem 6, one can see that the regret under adversarial attack is larger than the one under random perturbation if the ω2\omega^{2} term is dominant.

Appendix B Proof of Regret Analysis in Section 2 and 3

B.1 Theorem 1

This section contains the proof of Theorem 1 and Theorem S.6. The two proofs are similar, so for proof of Theorem S.6, we only present the part where the proof is different from Theorem 1.

Define R1(x)R_{1}(x) to Rk(x)R_{k}(x) as the unsorted distance from the nearest kk neighbors to testing data point xx, and Rk+1(x)R_{k+1}(x) as the distance from the exact (k+1)(k+1)-th nearest neighbor to xx itself. Similar as [6], conditional on the distance of the (k+1)(k+1)-th neighbor, the first kk neighbors are i.i.d. random variables distributed within B(x,Rk+1(x))B(x,R_{k+1}(x)).

In addition to f1f_{1}, f2f_{2}, and Ψ\Psi, we further denote f¯(x)\bar{f}(x) as the density of XX.

Proposition 7 (Lemma S.1 in [24]).

For any distribution function GG with density gg,

[G(bua)1{u<0}]𝑑u\displaystyle\int_{\mathbb{R}}[G(-bu-a)-1_{\{u<0\}}]du =\displaystyle= 1b{a+tg(t)𝑑t},\displaystyle-\frac{1}{b}\left\{a+\int_{\mathbb{R}}tg(t)dt\right\},
u[G(bua)1{u<0}]𝑑u\displaystyle\int_{\mathbb{R}}u[G(-bu-a)-1_{\{u<0\}}]du =\displaystyle= 1b2{a22+12t2g(t)𝑑t+atg(t)𝑑t}.\displaystyle\frac{1}{b^{2}}\left\{\frac{a^{2}}{2}+\frac{1}{2}\int_{\mathbb{R}}t^{2}g(t)dt+a\int_{\mathbb{R}}tg(t)dt\right\}.

Now we start our proof of Theorem 1.

Proof of Theorem 1.

The idea of proof follows [22], and there are total 5 steps in our proof:

  • Step 0: We give some definitions prepare for the proof.

  • Step 1: Given a fixed (unobserved) testing sample xx and conditional on the perturbation random variable δ\delta, we obtain the mean and variance of η^k,n(x+δ)\widehat{\eta}_{k,n}(x+\delta). In particular, for any x0x_{0} satisfying η(x0)=1/2\eta(x_{0})=1/2, let x0t=x0+tη˙(x0)η˙(x0)x^{t}_{0}=x_{0}+t\frac{\dot{\eta}(x_{0})}{\|\dot{\eta}(x_{0})\|}, we have that

    𝔼[η^k,n(x0t+δ)|δ]=η(x0)+tη˙(x0)+δη˙(x0)+b(x0)R12+O(t2+ω2+R14),\displaystyle\mathbb{E}[\widehat{\eta}_{k,n}(x_{0}^{t}+\delta)|\delta]=\eta(x_{0})+t\|\dot{\eta}(x_{0})\|+\delta^{\top}\dot{\eta}(x_{0})+b(x_{0})R_{1}^{2}+O(t^{2}+\omega^{2}+R_{1}^{4}),

    and

    Var(η^k,n(x0t+δ)|δ)=14k+O(ϵ2/k):=1sk,n2+O(ϵ2/k).\displaystyle Var(\widehat{\eta}_{k,n}(x_{0}^{t}+\delta)|\delta)=\frac{1}{4k}+O(\epsilon^{2}/k):=\frac{1}{s_{k,n}^{2}}+O(\epsilon^{2}/k).
  • Step 2: use tube theory to construct a tube based on 𝒮\mathcal{S}. The remainder of regret outside the tube is of O(ϵ3)O(\epsilon^{3}) for some ϵ\epsilon:

    2Regret\displaystyle 2*\mbox{Regret} =\displaystyle= d(P(i=1k1kYi12|δ)1{η(x)<1/2})𝑑P(x)\displaystyle\int_{\mathbb{R}^{d}}\left(P\left(\sum_{i=1}^{k}\frac{1}{k}Y_{i}\leq\frac{1}{2}\bigg{|}\delta\right)-1_{\{\eta(x)<1/2\}}\right)dP(x)
    =\displaystyle= 𝒮ϵk,n,ωϵk,n,ωtΨ˙(x0)𝔼(P(η^k,n(x0t+δ)<1/2)1{t<0})𝑑t𝑑Vold1(x0)+r1.\displaystyle\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\|\dot{\Psi}(x_{0})\|\mathbb{E}\left(P(\widehat{\eta}_{k,n}(x_{0}^{t}+\delta)<1/2)-1_{\{t<0\}}\right)dtd\text{Vol}^{d-1}(x_{0})+r_{1}.

    for some remainder term r1r_{1}.

  • Step 3: use Berry-Esseen Theorem to transform the probability

    P(η^k,n(x0t+δ)<1/2)P(\widehat{\eta}_{k,n}(x_{0}^{t}+\delta)<1/2)

    to a Gaussian probability.

    Φ(𝔼(1/2η^k,n(x0+δ))Var(η^k,n(x0t+δ))|δ)+o\Phi\left(\frac{\mathbb{E}(1/2-\widehat{\eta}_{k,n}(x_{0}+\delta))}{\sqrt{Var(\widehat{\eta}_{k,n}(x_{0}^{t}+\delta))}}\bigg{|}\delta\right)+o (7)
  • Step 4: plug in the mean and variance of η^k,n\widehat{\eta}_{k,n} from Step 1 into (7), integrate in the formula in Step 2 on the tube (integrate over tt).

Step 0: In this step we introduce some definitions.

Denote δ\delta as the random perturbation, i,e., δ=x~x\delta=\widetilde{x}-x. Denote X1(x)X_{1}(x) to Xk(x)X_{k}(x) be the kk unsorted neighbors of xx in the training samples and Yi(x)Y_{i}(x) be the YY value for the corresponding Xi(x)X_{i}(x). When no confusion is caused, we drop the argument xx and use XiX_{i} and YiY_{i} for abbreviation. Then the probability of classifying contaminated sample as 0 becomes

P(η^k,n(x+δ)12|δ)\displaystyle P\left(\widehat{\eta}_{k,n}(x+\delta)\leq\frac{1}{2}\bigg{|}\delta\right) =\displaystyle= P(i=1k(Yi(x+δ)1/2)<0|δ).\displaystyle P\left(\sum_{i=1}^{k}(Y_{i}(x+\delta)-1/2)<0\bigg{|}\delta\right).

If we directly investigate in P(η^k,n(x)1/2)P(\widehat{\eta}_{k,n}(x)\leq 1/2), the possible values of η(x)\eta(x) can be [0,1][0,1], but those η(x)\eta(x)’s which are far from 1/21/2 will have little contribution on the regret. Hence we consider xx in 𝒮ϵ\mathcal{S}^{\epsilon} where

𝒮ϵ={x0t:=x0+tη˙(x0)η˙(x0):x0𝒮,|t|ϵ}\displaystyle\mathcal{S}^{\epsilon}=\left\{x_{0}^{t}:=x_{0}+t\frac{\dot{\eta}(x_{0})}{\|\dot{\eta}(x_{0})\|}:x_{0}\in\mathcal{S},|t|\leq\epsilon\right\}

for ϵ=ϵk,n,ω\epsilon=\epsilon_{k,n,\omega}.

Further define

ϵk,n,ω=maxxC(logkk(R12(x)+ω))\epsilon_{k,n,\omega}=\max_{x\in\mathcal{\mathcal{R}}}C\left(\frac{\log k}{\sqrt{k}}\vee(R_{1}^{2}(x)+\omega)\right)

for some large constant CC.

Step 1: For the scenario of random perturbation, δ\delta is a random variable uniformly distributed on sphere B(x0t,ω)B(x_{0}^{t},\omega), we first evaluate 𝔼(η^k,n(x0t+δ))\mathbb{E}(\widehat{\eta}_{k,n}(x_{0}^{t}+\delta)) and Var(η^k,n(x0t+δ))Var(\widehat{\eta}_{k,n}(x_{0}^{t}+\delta)) for given x0x_{0} and δ\delta.

𝔼[η^k,n(x0t+δ)|δ]=𝔼(Y1(x0t+δ)|δ)=𝔼η(X1|δ)=𝔼(η(x0t+δ)+(X1x0tδ)η˙(x0t+δ)+1/2(X1x0tδ)η¨(x0t+δ)(X1x0tδ)|δ)+rem,\begin{split}&\mathbb{E}[\widehat{\eta}_{k,n}(x_{0}^{t}+\delta)|\delta]=\mathbb{E}(Y_{1}(x_{0}^{t}+\delta)|\delta)=\mathbb{E}\eta(X_{1}|\delta)\\ =&\mathbb{E}\left(\eta(x_{0}^{t}+\delta)+(X_{1}-x_{0}^{t}-\delta)^{\top}\dot{\eta}(x_{0}^{t}+\delta)+1/2(X_{1}-x_{0}^{t}-\delta)^{\top}\ddot{\eta}(x_{0}^{t}+\delta)(X_{1}-x_{0}^{t}-\delta)\bigg{|}\delta\right)\\ &+rem,\end{split}

where remrem is a remainder term due to the Taylor’s expansion. Before discussing remrem, we consider the dominant part of 𝔼η^k,n(x0t+δ)\mathbb{E}\widehat{\eta}_{k,n}(x_{0}^{t}+\delta). Conditional on δ\delta and R1(xt0+δ)=X1x0tδR_{1}(x_{t}^{0}+\delta)=\|X_{1}-x_{0}^{t}-\delta\|, then the distribution of X1X_{1} is on the sphere of B(xt0+δ,R1)B(x_{t}^{0}+\delta,R_{1}). Denote the density of this distribution as f¯(x|x0t+δ,R1(x0t+δ))\bar{f}(x|x_{0}^{t}+\delta,R_{1}(x_{0}^{t}+\delta)). Also define f¯(x|x0t+δ,R1(x0t+δ))\bar{f}^{\prime}(x|x_{0}^{t}+\delta,R_{1}(x_{0}^{t}+\delta)) as the gradient of f¯(x|x0t+δ,R1(x0t+δ))\bar{f}(x|x_{0}^{t}+\delta,R_{1}(x_{0}^{t}+\delta)). For simplicity, rewrite R1(x0t+δ)R_{1}(x_{0}^{t}+\delta) as R1R_{1}. Then based on (A.1) and (A.3) for the smoothness of f¯\bar{f} and η\eta, rewrite f¯(x|x0t+δ,R1)\bar{f}(x|x_{0}^{t}+\delta,R_{1}) as a Taylor expansion at x0t+δx_{0}^{t}+\delta, and we have

𝔼((X1x0tδ)η˙(x0t+δ)|δ,R1)\displaystyle\mathbb{E}((X_{1}-x_{0}^{t}-\delta)^{\top}\dot{\eta}(x_{0}^{t}+\delta)|\delta,R_{1})
=\displaystyle= B(xx0tδ)η˙(x0t+δ)f¯(x|x0t+δ,R1)𝑑x\displaystyle\int_{\partial B}(x-x_{0}^{t}-\delta)^{\top}\dot{\eta}(x_{0}^{t}+\delta)\bar{f}(x|x_{0}^{t}+\delta,R_{1})dx
=\displaystyle= B(xx0tδ)η˙(x0t+δ)[f¯(x0t+δ|x0t+δ,R1)\displaystyle\int_{\partial B}(x-x_{0}^{t}-\delta)^{\top}\dot{\eta}(x_{0}^{t}+\delta)\bigg{[}\bar{f}(x_{0}^{t}+\delta|x_{0}^{t}+\delta,R_{1})
+f¯(x0t+δ|x0t+δ,R1)(xx0tδ)\displaystyle\qquad\qquad\qquad\qquad+\bar{f}^{\prime}(x_{0}^{t}+\delta|x_{0}^{t}+\delta,R_{1})^{\top}(x-x_{0}^{t}-\delta)
+12(xx0tδ)f¯′′(x0t+δ|x0t+δ,R1)(xx0tδ)\displaystyle\qquad\qquad\qquad\qquad+\frac{1}{2}(x-x_{0}^{t}-\delta)^{\top}\bar{f}^{\prime\prime}(x_{0}^{t}+\delta|x_{0}^{t}+\delta,R_{1})(x-x_{0}^{t}-\delta)
+O(xx0tδ23)]dx\displaystyle\qquad\qquad\qquad\qquad+O(\|x-x_{0}^{t}-\delta\|_{2}^{3})\bigg{]}dx
=\displaystyle= B(xx0tδ)η˙(x0t+δ)f¯(x0t+δ|x0t+δ,R1)(xx0tδ)𝑑x+o\displaystyle\int_{\partial B}(x-x_{0}^{t}-\delta)^{\top}\dot{\eta}(x_{0}^{t}+\delta)\bar{f}^{\prime}(x_{0}^{t}+\delta|x_{0}^{t}+\delta,R_{1})^{\top}(x-x_{0}^{t}-\delta)dx+o
=\displaystyle= tr(η˙(x0t+δ)f¯(x0t+δ|x0t+δ,R1)B(xx0tδ)(xx0tδ)𝑑x)+O(R14),\displaystyle tr\left(\dot{\eta}(x_{0}^{t}+\delta)\bar{f}^{\prime}(x_{0}^{t}+\delta|x_{0}^{t}+\delta,R_{1})^{\top}\int_{\partial B}(x-x_{0}^{t}-\delta)(x-x_{0}^{t}-\delta)^{\top}dx\right)+O(R_{1}^{4}),

where B\int_{\partial B} denotes integration over sphere B(x0t+δ,R1)\partial B(x_{0}^{t}+\delta,R_{1}) the first-order and third-order terms becomes 0.

In addition,

tr(12η¨(x0t+δ)𝔼((X1x0tδ)(X1x0tδ)|R1))\displaystyle tr\left(\frac{1}{2}\ddot{\eta}(x_{0}^{t}+\delta)\mathbb{E}\left((X_{1}-x_{0}^{t}-\delta)(X_{1}-x_{0}^{t}-\delta)^{\top}|R_{1}\right)\right)
=\displaystyle= tr(12η¨(x0t+δ)B(xx0tδ)(xx0tδ)f¯(x|x0t+δ,R1)𝑑x)\displaystyle tr\left(\frac{1}{2}\ddot{\eta}(x_{0}^{t}+\delta)\int_{\partial B}(x-x_{0}^{t}-\delta)(x-x_{0}^{t}-\delta)^{\top}\bar{f}(x|x_{0}^{t}+\delta,R_{1})dx\right)
=\displaystyle= tr(f¯(x0t+δ|x0t+δ,R1)2η¨(x0t+δ)B(xx0tδ)(xx0tδ)𝑑x)\displaystyle tr\left(\frac{\bar{f}(x_{0}^{t}+\delta|x_{0}^{t}+\delta,R_{1})}{2}\ddot{\eta}(x_{0}^{t}+\delta)\int_{\partial B}(x-x_{0}^{t}-\delta)(x-x_{0}^{t}-\delta)^{\top}dx\right)
+tr(12η¨(x0t+δ)B(xx0tδ)(xx0tδ)(xx0tδ)f¯(x0t+δ|x0t+δ,R1)𝑑x)\displaystyle+tr\left(\frac{1}{2}\ddot{\eta}(x_{0}^{t}+\delta)\int_{\partial B}(x-x_{0}^{t}-\delta)(x-x_{0}^{t}-\delta)^{\top}(x-x_{0}^{t}-\delta)^{\top}\bar{f}^{\prime}(x_{0}^{t}+\delta|x_{0}^{t}+\delta,R_{1})dx\right)
+O(R14).\displaystyle+O(R_{1}^{4}).

The term remrem in 𝔼(η^k,n)\mathbb{E}(\widehat{\eta}_{k,n}) can be tackled in a similar manner and rem=O(R14)rem=O(R_{1}^{4}). Hence taking

b(x)=1f¯(x)d{j=1d[η˙j(x)f¯˙j(x)+η¨j,j(x)f¯(x)/2]},b(x)=\frac{1}{\bar{f}(x)d}\left\{\sum_{j=1}^{d}[\dot{\eta}_{j}(x)\dot{\bar{f}}_{j}(x)+\ddot{\eta}_{j,j}(x)\bar{f}(x)/2]\right\},

we have

𝔼(η^k,n|δ,R1)\displaystyle\mathbb{E}(\widehat{\eta}_{k,n}|\delta,R_{1}) =\displaystyle= η(x0t+δ)+b(x0t+δ)R12+O(R14)\displaystyle\eta(x_{0}^{t}+\delta)+b(x_{0}^{t}+\delta)R_{1}^{2}+O(R_{1}^{4})
=\displaystyle= η(x0)+tη˙(x0)η˙(x0)η˙(x0)+δη˙(x0)+O(t2+ω2)\displaystyle\eta(x_{0})+\frac{t}{\|\dot{\eta}(x_{0})\|}\dot{\eta}(x_{0})^{\top}\dot{\eta}(x_{0})+\delta^{\top}\dot{\eta}(x_{0})+O(t^{2}+\omega^{2})
+b(x0)R12+R12tη˙(x0)η˙(x0)b˙(x0)+R12δb˙(x0)+O(R14)\displaystyle+b(x_{0})R_{1}^{2}+R_{1}^{2}\frac{t}{\|\dot{\eta}(x_{0})\|}\dot{\eta}(x_{0})^{\top}\dot{b}(x_{0})+R_{1}^{2}\delta^{\top}\dot{b}(x_{0})+O(R_{1}^{4})
=\displaystyle= η(x0)+tη˙(x0)+δη˙(x0)+b(x0)R12+O(t2+ω2+R14).\displaystyle\eta(x_{0})+t\|\dot{\eta}(x_{0})\|+\delta^{\top}\dot{\eta}(x_{0})+b(x_{0})R_{1}^{2}+O(t^{2}+\omega^{2}+R_{1}^{4}).

Denote tk,n(x0t+δ)=𝔼R12t_{k,n}(x_{0}^{t}+\delta)=\mathbb{E}R_{1}^{2}, using arguments in Lemma 1 and Theorem 2 of [26], take ad=2dΓ(1+1/2)d/Γ(1+d/2)a_{d}=2^{d}\Gamma(1+1/2)^{d}/\Gamma(1+d/2), we obtain

tk,n(x0t+δ)\displaystyle t_{k,n}(x_{0}^{t}+\delta) =\displaystyle= 1ad2/df¯(x0t+δ)2/d(kn)2/d+o(tk,n2(x0t+δ))\displaystyle\frac{1}{a_{d}^{2/d}\bar{f}(x_{0}^{t}+\delta)^{2/d}}\left(\frac{k}{n}\right)^{2/d}+o(t_{k,n}^{2}(x_{0}^{t}+\delta))
=\displaystyle= tk,n(x0)+O(t(kn)2/d)+O(ω(kn)2/d)+o(tk,n2(x0t+δ)).\displaystyle t_{k,n}(x_{0})+O\left(t\left(\frac{k}{n}\right)^{2/d}\right)+O\left(\omega\left(\frac{k}{n}\right)^{2/d}\right)+o(t_{k,n}^{2}(x_{0}^{t}+\delta)).

Further denote μk,n,ω(x0t,δ)=η(x0)+tη˙(x0)+δη˙(x0)+b(x0)tk,n(x0){\mu}_{k,n,\omega}(x_{0}^{t},\delta)=\eta(x_{0})+t\|\dot{\eta}(x_{0})\|+\delta^{\top}\dot{\eta}(x_{0})+b(x_{0})t_{k,n}(x_{0}), we obtain

𝔼(η^k,n)=μk,n,ω(x0t,δ)+O(t2+ω2+tk,n2)=μk,n,ω(x0t,δ)+O(ϵk,n,ω2).\displaystyle\mathbb{E}(\widehat{\eta}_{k,n})={\mu}_{k,n,\omega}(x_{0}^{t},\delta)+O(t^{2}+\omega^{2}+t_{k,n}^{2})={\mu}_{k,n,\omega}(x_{0}^{t},\delta)+O(\epsilon_{k,n,\omega}^{2}).

In terms of Var(η^k,n(x0t,δ))Var(\widehat{\eta}_{k,n}(x_{0}^{t},\delta)), fixing Rk+1R_{k+1}, the kk neighbors are i.i.d. random variables in B(x0t+δ,Rk+1)B(x_{0}^{t}+\delta,R_{k+1}),

Var(Y1|Rk+1,δ)=𝔼(Y1|Rk+1,δ)(1𝔼(Y1|Rk+1δ))=14+O(ϵk,n,ω2),\displaystyle Var(Y_{1}|R_{k+1},\delta)=\mathbb{E}(Y_{1}|R_{k+1},\delta)(1-\mathbb{E}(Y_{1}|R_{k+1}\delta))=\frac{1}{4}+O\left(\epsilon_{k,n,\omega}^{2}\right),

when Rk+12=O(tk,n(x0))R_{k+1}^{2}=O(t_{k,n}(x_{0})). Moreover, as [6] and [3] mentioned, the probability of Rk+1tk,n(x0)R_{k+1}\gg t_{k,n}(x_{0}) is an exponential tail, hence the overall variance becomes

Var(Y1|δ)=14+O(ϵk,n,ω2).Var(Y_{1}|\delta)=\frac{1}{4}+O\left(\epsilon_{k,n,\omega}^{2}\right).

This also implies that

|Var(Y1|δ)1/4|=O(ϵk,n,ω).|\sqrt{Var(Y_{1}|\delta)}-\sqrt{1/4}|=O(\epsilon_{k,n,\omega}).

Step 2: Since the density of xx is 0 when xx is not in support,

d(P(i=1k1kYi12)1{η(x)<1/2})𝑑P(x)\int_{\mathbb{R}^{d}}\left(P\left(\sum_{i=1}^{k}\frac{1}{k}Y_{i}\leq\frac{1}{2}\right)-1_{\{\eta(x)<1/2\}}\right)dP(x)

is equal to

(P(i=1k1kYi12)1{η(x)<1/2})𝑑P(x),\int_{\mathcal{R}}\left(P\left(\sum_{i=1}^{k}\frac{1}{k}Y_{i}\leq\frac{1}{2}\right)-1_{\{\eta(x)<1/2\}}\right)dP(x),

where \mathcal{R} is the support of XX. Taking ϵk,n,ωsk,nlogsk,n\epsilon_{k,n,\omega}\geq-s_{k,n}\log s_{k,n}, we have

d(P(i=1k1kYi12|δ)1{η(x)<1/2})𝑑P(x)\displaystyle\int_{\mathbb{R}^{d}}\left(P\left(\sum_{i=1}^{k}\frac{1}{k}Y_{i}\leq\frac{1}{2}\bigg{|}\delta\right)-1_{\{\eta(x)<1/2\}}\right)dP(x) (8)
=\displaystyle= 𝒮ϵk,n,ωϵk,n,ωtΨ˙(x0)P(η^k,n(x0t+δ)<1/2)1{t<0}dtdVold1(x0)+r1.\displaystyle\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\|\dot{\Psi}(x_{0})\|P(\widehat{\eta}_{k,n}(x_{0}^{t}+\delta)<1/2)-1_{\{t<0\}}dtd\text{Vol}^{d-1}(x_{0})+r_{1}.

The result in (8) adopts tube theory to transform the integration from d\mathbb{R}^{d} to ×𝒮\mathbb{R}\times\mathcal{S}. Denote the map ϕ(x0,tη˙(x0)η˙(x0))=x0t\phi\left(x_{0},t\frac{\dot{\eta}(x_{0})}{\|\dot{\eta}(x_{0})\|}\right)=x_{0}^{t}, then the pullback of the dd-form dxdx is given at (x0,tη˙(x0)/η˙(x0))(x_{0},t{\dot{\eta}(x_{0})}/{\|\dot{\eta}(x_{0})\|}) by

det(Ψ˙(x0,tη˙(x0)η˙(x0)))dtdVold1(x0).\displaystyle det\left(\dot{\Psi}\left(x_{0},t\frac{\dot{\eta}(x_{0})}{\|\dot{\eta}(x_{0})\|}\right)\right)dtd\text{Vol}^{d-1}(x_{0}).

For r1r_{1}, it is composed of four parts: (1) the integral outside 𝒮ϵk,n,ω\mathcal{S}^{\epsilon_{k,n,\omega}}, (2) the difference between Ψ(t)\Psi(t) and tΨ˙(x)t\|\dot{\Psi}(x)\|, (3) the difference between 𝒮ϵk,n,ω\mathcal{S}^{\epsilon_{k,n,\omega}} and the tube generated using 𝒮\mathcal{S}, and (4) the remainder of det(Ψ˙(x0,tη˙(x0)η˙(x0)))det\left(\dot{\Psi}\left(x_{0},t\frac{\dot{\eta}(x_{0})}{\|\dot{\eta}(x_{0})\|}\right)\right):

r1=d\𝒮ϵk,n,ω(P(i=1k1kYi12|δ)1{η(x)<1/2})𝑑P(x)+𝒮ϵk,n,ω(Ψ(x)tΨ˙(x))(P(i=1k1kYi12|δ)1{η(x)<1/2})𝑑x+[𝒮ϵk,n,ωtΨ˙(x)(P(i=1k1kYi12|δ)1{η(x)<1/2})dx𝒮ϵk,n,ωϵk,n,ωtΨ˙(x0)(P(η^k,n(x0t+δ)<1/2)1{t<0})det(Ψ˙(x0,tη˙(x0)η˙(x0)))dtdVold1(x0)]+𝒮ϵk,n,ωϵk,n,ωtΨ˙(x0)(P(η^k,n(x0t+δ)<1/2)1{t<0})[det(Ψ˙(x0,tη˙(x0)η˙(x0)))1]𝑑t𝑑Vold1(x0):=r11+r12+r13+r14.\begin{split}&r_{1}\\ =&\int_{\mathbb{R}^{d}\backslash\mathcal{S}^{\epsilon_{k,n,\omega}}}\left(P\left(\sum_{i=1}^{k}\frac{1}{k}Y_{i}\leq\frac{1}{2}\bigg{|}\delta\right)-1_{\{\eta(x)<1/2\}}\right)dP(x)\\ +&\int_{\mathcal{S}^{\epsilon_{k,n,\omega}}}(\Psi(x)-t\|\dot{\Psi}(x)\|)\left(P\left(\sum_{i=1}^{k}\frac{1}{k}Y_{i}\leq\frac{1}{2}\bigg{|}\delta\right)-1_{\{\eta(x)<1/2\}}\right)dx\\ +&\bigg{[}\int_{\mathcal{S}^{\epsilon_{k,n,\omega}}}t\|\dot{\Psi}(x)\|\left(P\left(\sum_{i=1}^{k}\frac{1}{k}Y_{i}\leq\frac{1}{2}\bigg{|}\delta\right)-1_{\{\eta(x)<1/2\}}\right)dx\\ &\qquad-\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\|\dot{\Psi}(x_{0})\|\left(P(\widehat{\eta}_{k,n}(x_{0}^{t}+\delta)<1/2)-1_{\{t<0\}}\right)det\left(\dot{\Psi}\left(x_{0},t\frac{\dot{\eta}(x_{0})}{\|\dot{\eta}(x_{0})\|}\right)\right)dtd\text{Vol}^{d-1}(x_{0})\bigg{]}\\ +&\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\|\dot{\Psi}(x_{0})\|\left(P(\widehat{\eta}_{k,n}(x_{0}^{t}+\delta)<1/2)-1_{\{t<0\}}\right)\left[det\left(\dot{\Psi}\left(x_{0},t\frac{\dot{\eta}(x_{0})}{\|\dot{\eta}(x_{0})\|}\right)\right)-1\right]dtd\text{Vol}^{d-1}(x_{0})\\ :=&r_{11}+r_{12}+r_{13}+r_{14}.\end{split} (9)

In r1r_{1}, r12r_{12} is of O(ϵk,n,ω3)O(\epsilon_{k,n,\omega}^{3}) since Ψ(x)tΨ˙(x)=O(t2)\Psi(x)-t\dot{\Psi}(x)=O(t^{2}); r13r_{13} is of O(ϵk,n,ω3)O(\epsilon_{k,n,\omega}^{3}) since the difference of volume is of O(ϵk,n,ω2)O(\epsilon_{k,n,\omega}^{2}). For r11r_{11} in r1r_{1}:

0\displaystyle 0 \displaystyle\geq d\𝒮ϵk,n,ω{x|η(x)<1/2}(P(i=1k1kYi12|δ)1{η(x)<1/2})𝑑P(x)\displaystyle\int_{\mathbb{R}^{d}\backslash\mathcal{S}^{\epsilon_{k,n,\omega}}\cap\{x|\eta(x)<1/2\}}\left(P\left(\sum_{i=1}^{k}\frac{1}{k}Y_{i}\leq\frac{1}{2}\bigg{|}\delta\right)-1_{\{\eta(x)<1/2\}}\right)dP(x)
=\displaystyle= d\𝒮ϵk,n,ω{x|η(x)<1/2}(P(i=1k1k(Yi12)𝔼(Y112)𝔼(Y112)|δ)1{η(x)<1/2})𝑑P(x)\displaystyle\int_{\mathbb{R}^{d}\backslash\mathcal{S}^{\epsilon_{k,n,\omega}}\cap\{x|\eta(x)<1/2\}}\left(P\left(\sum_{i=1}^{k}\frac{1}{k}\left(Y_{i}-\frac{1}{2}\right)-\mathbb{E}\left(Y_{1}-\frac{1}{2}\right)\leq-\mathbb{E}\left(Y_{1}-\frac{1}{2}\right)\bigg{|}\delta\right)-1_{\{\eta(x)<1/2\}}\right)dP(x)
=\displaystyle= d\𝒮ϵk,n,ω{x|η(x)<1/2}P(i=1k1k(Yi12)𝔼(Y112)>𝔼(Y112)|δ)𝑑P(x).\displaystyle-\int_{\mathbb{R}^{d}\backslash\mathcal{S}^{\epsilon_{k,n,\omega}}\cap\{x|\eta(x)<1/2\}}P\left(\sum_{i=1}^{k}\frac{1}{k}\left(Y_{i}-\frac{1}{2}\right)-\mathbb{E}\left(Y_{1}-\frac{1}{2}\right)>-\mathbb{E}\left(Y_{1}-\frac{1}{2}\right)\bigg{|}\delta\right)dP(x).

From the definition of ϵk,n,ω\epsilon_{k,n,\omega}, we know that for any δ\delta, infxd\𝒮ϵk,n,ω|𝔼Y(x~ω)1/2|c1ϵk,n,ω\inf\limits_{x\in\mathbb{R}^{d}\backslash\mathcal{S}^{\epsilon_{k,n,\omega}}}|\mathbb{E}Y(\widetilde{x}_{\omega})-1/2|\geq c_{1}\epsilon_{k,n,\omega} for some c1>0c_{1}>0. Using Berstein inequality, we have an upper bound as

d\𝒮ϵk,n,ω{x|η(x)<1/2}P(i=1k1k(Yi12)𝔼(Y112)>𝔼(Y112)|δ)𝑑P(x)\displaystyle\int_{\mathbb{R}^{d}\backslash\mathcal{S}^{\epsilon_{k,n,\omega}}\cap\{x|\eta(x)<1/2\}}P\left(\sum_{i=1}^{k}\frac{1}{k}(Y_{i}-\frac{1}{2})-\mathbb{E}(Y_{1}-\frac{1}{2})>-\mathbb{E}(Y_{1}-\frac{1}{2})\bigg{|}\delta\right)dP(x)
\displaystyle\leq O(exp(c2kϵk,n,ω2))=o(1/k3/2),\displaystyle O(\exp(-c_{2}k\epsilon_{k,n,\omega}^{2}))=o(1/k^{3/2}),

for c2>0c_{2}>0.

Similar result can be obtained for d\𝒮ϵk,n,ω{x|η(x)>1/2}\mathbb{R}^{d}\backslash\mathcal{S}^{\epsilon_{k,n,\omega}}\cap\{x|\eta(x)>1/2\}.

For r14r_{14} in r1r_{1},

𝒮ϵk,n,ωϵk,n,ωtΨ˙(x0)𝔼δ(P(η^k,n(x0t+δ)<1/2)1{t<0})\displaystyle\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\|\dot{\Psi}(x_{0})\|\mathbb{E}_{\delta}\left(P(\widehat{\eta}_{k,n}(x_{0}^{t}+\delta)<1/2)-1_{\{t<0\}}\right)
[det(Ψ˙(x0,tη˙(x0)η˙(x0)))1]dtdVold1(x0)\displaystyle\qquad\qquad\qquad\left[det\left(\dot{\Psi}\left(x_{0},t\frac{\dot{\eta}(x_{0})}{\|\dot{\eta}(x_{0})\|}\right)\right)-1\right]dtd\text{Vol}^{d-1}(x_{0})
=\displaystyle= 𝒮ϵk,n,ωϵk,n,ωtΨ˙(x0)𝔼δ(P(η^k,n(x0+δ)<1/2)1{t<0})\displaystyle\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\|\dot{\Psi}(x_{0})\|\mathbb{E}_{\delta}\left(P(\widehat{\eta}_{k,n}(x_{0}+\delta)<1/2)-1_{\{t<0\}}\right)
[det(Ψ˙(x0,tη˙(x0)η˙(x0)))1]dtdVold1(x0)\displaystyle\qquad\qquad\qquad\left[det\left(\dot{\Psi}\left(x_{0},t\frac{\dot{\eta}(x_{0})}{\|\dot{\eta}(x_{0})\|}\right)\right)-1\right]dtd\text{Vol}^{d-1}(x_{0})
+𝒮ϵk,n,ωϵk,n,ωtΨ˙(x0)[tη˙(x0)η˙(x0)x0𝔼δ(P(η^k,n(x0+δ)<1/2)1{t<0})]\displaystyle+\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\|\dot{\Psi}(x_{0})\|\left[t\frac{\dot{\eta}(x_{0})^{\top}}{\|\dot{\eta}(x_{0})\|}\frac{\partial}{\partial x_{0}}\mathbb{E}_{\delta}\left(P(\widehat{\eta}_{k,n}(x_{0}+\delta)<1/2)-1_{\{t<0\}}\right)\right]
[det(Ψ˙(x0,tη˙(x0)η˙(x0)))1]dtdVold1(x0)\displaystyle\qquad\qquad\qquad\left[det\left(\dot{\Psi}\left(x_{0},t\frac{\dot{\eta}(x_{0})}{\|\dot{\eta}(x_{0})\|}\right)\right)-1\right]dtd\text{Vol}^{d-1}(x_{0})
+o\displaystyle+o
=\displaystyle= 𝒮ϵk,n,ωϵk,n,ωtΨ˙(x0)[tη˙(x0)η˙(x0)x0𝔼δ(P(η^k,n(x0+δ)<1/2)1{t<0})]\displaystyle\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\|\dot{\Psi}(x_{0})\|\left[t\frac{\dot{\eta}(x_{0})^{\top}}{\|\dot{\eta}(x_{0})\|}\frac{\partial}{\partial x_{0}}\mathbb{E}_{\delta}\left(P(\widehat{\eta}_{k,n}(x_{0}+\delta)<1/2)-1_{\{t<0\}}\right)\right]
[det(Ψ˙(x0,tη˙(x0)η˙(x0)))1]dtdVold1(x0)+o\displaystyle\qquad\qquad\qquad\left[det\left(\dot{\Psi}\left(x_{0},t\frac{\dot{\eta}(x_{0})}{\|\dot{\eta}(x_{0})\|}\right)\right)-1\right]dtd\text{Vol}^{d-1}(x_{0})+o
=\displaystyle= O(ϵk,n,ω4).\displaystyle O(\epsilon_{k,n,\omega}^{4}).

Finally r1=O(ϵk,n,ω3)r_{1}=O(\epsilon_{k,n,\omega}^{3}).

Step 3: we continue the derivation of P(η^k,n(X+δ)1/2,X𝒮ϵk,n,ω|δ)P(\widehat{\eta}_{k,n}(X+\delta)\leq 1/2,X\in\mathcal{S}^{\epsilon_{k,n,\omega}}|\delta) (where XX denotes testing sample random variable) using

𝒮ϵk,n,ωϵk,n,ωtΨ˙(x0)𝔼δ(P(η^k,n(x0t+δ)<1/2)1{t<0})𝑑t𝑑Vold1(x0).\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\|\dot{\Psi}(x_{0})\|\mathbb{E}_{\delta}\left(P(\widehat{\eta}_{k,n}(x_{0}^{t}+\delta)<1/2)-1_{\{t<0\}}\right)dtd\text{Vol}^{d-1}(x_{0}).

Since η^k,n\widehat{\eta}_{k,n} is obtained from nn i.i.d. samples (though for some samples their weight is 0), by non-uniform Berry-Esseen Theorem,

𝒮ϵk,n,ωϵk,n,ωtΨ˙(x0)(P(η^k,n(x0t+δ)<1/2)1{t<0}|δ)𝑑t𝑑Vold1(x0)\displaystyle\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\|\dot{\Psi}(x_{0})\|\left(P(\widehat{\eta}_{k,n}(x_{0}^{t}+\delta)<1/2)-1_{\{t<0\}}|\delta\right)dtd\text{Vol}^{d-1}(x_{0})
=\displaystyle= 𝒮ϵk,n,ωϵk,n,ωtΨ˙(x0)(Φ(k𝔼(1/2Y1)kVar(Y1)|δ)1{t<0})𝑑t𝑑Vold1(x0)+r2\displaystyle\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\|\dot{\Psi}(x_{0})\|\left(\Phi\left(\frac{k\mathbb{E}(1/2-Y_{1})}{\sqrt{kVar(Y_{1})}}\bigg{|}\delta\right)-1_{\{t<0\}}\right)dtd\text{Vol}^{d-1}(x_{0})+r_{2}
+𝒮ϵk,n,ωϵk,n,ωtΨ˙(x0)(𝔼Rk+1Φ(k𝔼(1/2Y1|Rk+1)kVar(Y1|Rk+1)|δ)Φ(k𝔼(1/2Y1)kVar(Y1)|δ))𝑑t𝑑Vold1(x0).\displaystyle+\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\|\dot{\Psi}(x_{0})\|\left(\mathbb{E}_{R_{k+1}}\Phi\left(\frac{k\mathbb{E}(1/2-Y_{1}|R_{k+1})}{\sqrt{kVar(Y_{1}|R_{k+1})}}\bigg{|}\delta\right)-\Phi\left(\frac{k\mathbb{E}(1/2-Y_{1})}{\sqrt{kVar(Y_{1})}}\bigg{|}\delta\right)\right)dtd\text{Vol}^{d-1}(x_{0}).

where

|P(η^k,n(x0t+δ)<1/2)Φ(k𝔼(1/2Y1)kVar(Y1)|δ)|\displaystyle\left|P(\widehat{\eta}_{k,n}(x_{0}^{t}+\delta)<1/2)-\Phi\left(\frac{k\mathbb{E}(1/2-Y_{1})}{\sqrt{kVar(Y_{1})}}\bigg{|}\delta\right)\right| \displaystyle\leq c3k11+k3/2|𝔼1/2Y1|3,\displaystyle\frac{c_{3}}{\sqrt{k}}\frac{1}{1+k^{3/2}\left|\mathbb{E}1/2-Y_{1}\right|^{3}},

hence

r2\displaystyle r_{2} \displaystyle\leq 𝒮ϵk,n,ωϵk,n,ωtΨ˙(x0)c3k11+k3/2|𝔼1/2Y1|3𝑑t𝑑Vold1(x0)\displaystyle\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\|\dot{\Psi}(x_{0})\|\frac{c_{3}}{\sqrt{k}}\frac{1}{1+k^{3/2}\left|\mathbb{E}1/2-Y_{1}\right|^{3}}dtd\text{Vol}^{d-1}(x_{0})
=\displaystyle= c4k𝒮|t|<sk,ntΨ˙(x0)𝑑t𝑑Vold1(x0)\displaystyle\frac{c_{4}}{\sqrt{k}}\int_{\mathcal{S}}\int_{|t|<s_{k,n}}t\|\dot{\Psi}(x_{0})\|dtd\text{Vol}^{d-1}(x_{0})
+c5k𝒮sk,n<|t|<ϵk,n,ωtΨ˙(x0)11+k3/2t3𝑑t𝑑Vold1(x0)\displaystyle+\frac{c_{5}}{\sqrt{k}}\int_{\mathcal{S}}\int_{s_{k,n}<|t|<\epsilon_{k,n,\omega}}t\|\dot{\Psi}(x_{0})\|\frac{1}{1+k^{3/2}t^{3}}dtd\text{Vol}^{d-1}(x_{0})
\displaystyle\leq O(sk,n2k)+c5k𝒮sk,n<|t|<ϵk,n,ωktΨ˙(x0)1k3/2t3𝑑tdkdk𝑑Vold1(x0)\displaystyle O\left(\frac{s^{2}_{k,n}}{\sqrt{k}}\right)+\frac{c_{5}}{k}\int_{\mathcal{S}}\int_{s_{k,n}<|t|<\epsilon_{k,n,\omega}}\sqrt{k}t\|\dot{\Psi}(x_{0})\|\frac{1}{k^{3/2}t^{3}}dt\frac{d\sqrt{k}}{d\sqrt{k}}d\text{Vol}^{d-1}(x_{0})
=\displaystyle= O(sk,n2k)+c5k3/2𝒮1<kt<ϵk,n,ωkktΨ˙(x0)1k3/2t3𝑑tdkdk𝑑Vold1(x0)\displaystyle O\left(\frac{s^{2}_{k,n}}{\sqrt{k}}\right)+\frac{c_{5}}{k^{3/2}}\int_{\mathcal{S}}\int_{1<\sqrt{k}t<\epsilon_{k,n,\omega}\sqrt{k}}\sqrt{k}t\|\dot{\Psi}(x_{0})\|\frac{1}{k^{3/2}t^{3}}dt\frac{d\sqrt{k}}{d\sqrt{k}}d\text{Vol}^{d-1}(x_{0})
=\displaystyle= O(1k3/2).\displaystyle O\left(\frac{1}{k^{3/2}}\right).

Following the analysis in Step 4, we can also obtain that

𝒮ϵk,n,ωϵk,n,ωtΨ˙(x0)(𝔼Rk+1Φ(k𝔼(1/2Y1|Rk+1)kVar(Y1|Rk+1)|δ)Φ(k𝔼(1/2Y1)kVar(Y1)|δ))𝑑t𝑑Vold1(x0)\displaystyle\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\|\dot{\Psi}(x_{0})\|\left(\mathbb{E}_{R_{k+1}}\Phi\left(\frac{k\mathbb{E}(1/2-Y_{1}|R_{k+1})}{\sqrt{kVar(Y_{1}|R_{k+1})}}\bigg{|}\delta\right)-\Phi\left(\frac{k\mathbb{E}(1/2-Y_{1})}{\sqrt{kVar(Y_{1})}}\bigg{|}\delta\right)\right)dtd\text{Vol}^{d-1}(x_{0})
=\displaystyle= O(ϵk,n,ω5k)+O(ϵk,n,ω3).\displaystyle O(\epsilon_{k,n,\omega}^{5}\sqrt{k})+O(\epsilon_{k,n,\omega}^{3}).

Step 4: Finally we integrate on Gaussian probabilities:

𝒮ϵk,n,ωϵk,n,ωtΨ˙(x0)(Φ(k𝔼(1/2Y1)kVar(Y1)|δ)1{t<0})𝑑t𝑑Vold1(x0)\displaystyle\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\|\dot{\Psi}(x_{0})\|\left(\Phi\left(\frac{k\mathbb{E}(1/2-Y_{1})}{\sqrt{kVar(Y_{1})}}\bigg{|}\delta\right)-1_{\{t<0\}}\right)dtd\text{Vol}^{d-1}(x_{0})
=\displaystyle= 𝒮ϵk,n,ωϵk,n,ωtΨ˙(x0)(Φ(tη˙(x0)sk,n2(b(x0)tk,n(x0)+δη˙(x0t))sk,n2)1{t<0})𝑑t𝑑Vold1(x0)\displaystyle\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\|\dot{\Psi}(x_{0})\|\left(\Phi\left(-\frac{t\|\dot{\eta}(x_{0})\|}{\sqrt{s_{k,n}^{2}}}-\frac{(b(x_{0})t_{k,n}(x_{0})+\delta^{\top}\dot{\eta}(x_{0}^{t}))}{\sqrt{s_{k,n}^{2}}}\right)-1_{\{t<0\}}\right)dtd\text{Vol}^{d-1}(x_{0})
+r3\displaystyle+r_{3}
=\displaystyle= 𝒮tΨ˙(x0)(Φ(tη˙(x0)sk,n2(b(x0)tk,n(x0)+δη˙(x0))sk,n2)1{t<0})𝑑t𝑑Vold1(x0)\displaystyle\int_{\mathcal{S}}\int_{\mathbb{R}}t\|\dot{\Psi}(x_{0})\|\left(\Phi\left(-\frac{t\|\dot{\eta}(x_{0})\|}{\sqrt{s_{k,n}^{2}}}-\frac{(b(x_{0})t_{k,n}(x_{0})+\delta^{\top}\dot{\eta}(x_{0}))}{\sqrt{s_{k,n}^{2}}}\right)-1_{\{t<0\}}\right)dtd\text{Vol}^{d-1}(x_{0})
+r3+r4\displaystyle+r_{3}+r_{4}
=\displaystyle= B1sk,n2+SΨ˙(x0)η˙(x0)2(b(x0)tk,n(x0)+δη˙(x0))2𝑑Vold1(x0)+r3+r4.\displaystyle B_{1}s_{k,n}^{2}+\int_{S}\frac{\|\dot{\Psi}(x_{0})\|}{\|\dot{\eta}(x_{0})\|^{2}}(b(x_{0})t_{k,n}(x_{0})+\delta^{\top}\dot{\eta}(x_{0}))^{2}d\text{Vol}^{d-1}(x_{0})+r_{3}+r_{4}.

The last step follows Proposition 7. For the small order terms,

r3\displaystyle r_{3} =\displaystyle= 𝒮ϵk,n,ωϵk,n,ωtΨ˙(x0)(Φ(k𝔼(1/2Y1)kVar(Y1)|δ)\displaystyle\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\|\dot{\Psi}(x_{0})\|\bigg{(}\Phi\left(\frac{k\mathbb{E}(1/2-Y_{1})}{\sqrt{kVar(Y_{1})}}\bigg{|}\delta\right)
Φ(tη˙(x0)sk,n2(b(x0)tk,n(x0)+δη˙(x0t))sk,n2))dtdVold1(x0)\displaystyle\qquad\qquad\qquad-\Phi\left(-\frac{t\|\dot{\eta}(x_{0})\|}{\sqrt{s_{k,n}^{2}}}-\frac{(b(x_{0})t_{k,n}(x_{0})+\delta^{\top}\dot{\eta}(x_{0}^{t}))}{\sqrt{s_{k,n}^{2}}}\right)\bigg{)}dtd\text{Vol}^{d-1}(x_{0})
=\displaystyle= O(ϵk,n,ω3).\displaystyle O(\epsilon_{k,n,\omega}^{3}).

Through definition of ϵk,n,ω\epsilon_{k,n,\omega} we have

r4\displaystyle r_{4} =\displaystyle= o(1/k3/2).\displaystyle o(1/k^{3/2}).

Finally we take expectation on δ\delta:

𝔼δSΨ˙(x0)η˙(x0)2(b(x0)tk,n(x0)+δη˙(x0))2𝑑Vold1(x0)\displaystyle\mathbb{E}_{\delta}\int_{S}\frac{\|\dot{\Psi}(x_{0})\|}{\|\dot{\eta}(x_{0})\|^{2}}(b(x_{0})t_{k,n}(x_{0})+\delta^{\top}\dot{\eta}(x_{0}))^{2}d\text{Vol}^{d-1}(x_{0})
=\displaystyle= SΨ˙(x0)η˙(x0)2𝔼δ(b(x0)tk,n(x0)+δη˙(x0))2𝑑Vold1(x0)\displaystyle\int_{S}\frac{\|\dot{\Psi}(x_{0})\|}{\|\dot{\eta}(x_{0})\|^{2}}\mathbb{E}_{\delta}(b(x_{0})t_{k,n}(x_{0})+\delta^{\top}\dot{\eta}(x_{0}))^{2}d\text{Vol}^{d-1}(x_{0})
=\displaystyle= SΨ˙(x0)η˙(x0)2(b(x0)2tk,n2(x0)+2b(x0)tk,n(x0)𝔼δ(δη˙(x0))+𝔼δ(δη˙(x0))2)𝑑Vold1(x0)\displaystyle\int_{S}\frac{\|\dot{\Psi}(x_{0})\|}{\|\dot{\eta}(x_{0})\|^{2}}\left(b(x_{0})^{2}t_{k,n}^{2}(x_{0})+2b(x_{0})t_{k,n}(x_{0})\mathbb{E}_{\delta}(\delta^{\top}\dot{\eta}(x_{0}))+\mathbb{E}_{\delta}(\delta^{\top}\dot{\eta}(x_{0}))^{2}\right)d\text{Vol}^{d-1}(x_{0})
=\displaystyle= SΨ˙(x0)η˙(x0)2(b2(x0)tk,n2(x0)+η˙(x0)2dω2)𝑑Vold1(x0).\displaystyle\int_{S}\frac{\|\dot{\Psi}(x_{0})\|}{\|\dot{\eta}(x_{0})\|^{2}}\left(b^{2}(x_{0})t_{k,n}^{2}(x_{0})+\frac{\|\dot{\eta}(x_{0})\|^{2}}{d}\omega^{2}\right)d\text{Vol}^{d-1}(x_{0}).

B.2 Theorem 2

Proof of Theorem 2.

When |η(x)1/2|<Cω|\eta(x)-1/2|<C\omega for some large constant C>0C>0, gg and g~\widetilde{g} will always be the same, thus

P(g~(x~)Y)P(g(x)Y)\displaystyle P(\widetilde{g}(\widetilde{x})\neq Y)-P(g(x)\neq Y) (10)
=\displaystyle= 𝔼δ[𝒮CωCωtΨ˙(x0)(1{η~(x0t+δ)<1/2}1{t<0})𝑑t𝑑Vold1(x0)]+O(ω4).\displaystyle\mathbb{E}_{\delta}\left[\int_{\mathcal{S}}\int_{-C\omega}^{C\omega}t\|\dot{\Psi}(x_{0})\|\left(1_{\{\widetilde{\eta}(x_{0}^{t}+\delta)<1/2\}}-1_{\{t<0\}}\right)dtd\text{Vol}^{d-1}(x_{0})\right]+O(\omega^{4}). (11)

Moreover,

η~(x~)=𝔼(η(x)|x~ is observed)=η(x~)+b(x~)ω2+O(ω3).\displaystyle\widetilde{\eta}(\widetilde{x})=\mathbb{E}(\eta(x)|\widetilde{x}\text{ is observed})=\eta(\widetilde{x})+b(\widetilde{x})\omega^{2}+O(\omega^{3}). (12)

As a result,

η~(x0t+δ)=η(x0)+tη˙(x0)+η˙(x0)δ+b(x0)ω2+O(tω2)+O(ω3).\displaystyle\widetilde{\eta}(x_{0}^{t}+\delta)=\eta(x_{0})+t\|\dot{\eta}(x_{0})\|+\dot{\eta}(x_{0})^{\top}\delta+b(x_{0})\omega^{2}+O(t\omega^{2})+O(\omega^{3}). (13)

Plugging in η~(x0t+δ)\widetilde{\eta}(x_{0}^{t}+\delta) into regret, we obtain that

P(g~(x~)Y)P(g(x)Y)\displaystyle P(\widetilde{g}(\widetilde{x})\neq Y)-P(g(x)\neq Y) (14)
=\displaystyle= 𝔼δ[𝒮CωCωtΨ˙(x0)(1{η~(x0t+δ)<1/2}1{t<0})𝑑t𝑑Vold1(x0)]+O(ω4)\displaystyle\mathbb{E}_{\delta}\left[\int_{\mathcal{S}}\int_{-C\omega}^{C\omega}t\|\dot{\Psi}(x_{0})\|\left(1_{\{\widetilde{\eta}(x_{0}^{t}+\delta)<1/2\}}-1_{\{t<0\}}\right)dtd\text{Vol}^{d-1}(x_{0})\right]+O(\omega^{4}) (15)
=\displaystyle= 𝔼δ[𝒮CωCωtΨ˙(x0)(1{t<η˙(x0)δ/η˙(x0)b(x0)ω2}1{t<0})𝑑t𝑑Vold1(x0)]+O(ω4)\displaystyle\mathbb{E}_{\delta}\left[\int_{\mathcal{S}}\int_{-C\omega}^{C\omega}t\|\dot{\Psi}(x_{0})\|\left(1_{\{t<-\dot{\eta}(x_{0})^{\top}\delta/\|\dot{\eta}(x_{0})\|-b(x_{0})\omega^{2}\}}-1_{\{t<0\}}\right)dtd\text{Vol}^{d-1}(x_{0})\right]+O(\omega^{4}) (16)
=\displaystyle= 𝔼[𝒮Ψ˙(x0)η˙(x0)δ/η˙(x0)b(x0)ω20t𝑑t𝑑Vold1(x0)]+O(ω4)\displaystyle\mathbb{E}\left[\int_{\mathcal{S}}\|\dot{\Psi}(x_{0})\|\int_{-\dot{\eta}(x_{0})^{\top}\delta/\|\dot{\eta}(x_{0})\|-b(x_{0})\omega^{2}}^{0}tdtd\text{Vol}^{d-1}(x_{0})\right]+O(\omega^{4}) (17)
=\displaystyle= 𝒮Ψ˙(x0)ω22d𝑑Vold1(x0)+O(ω4).\displaystyle\int_{\mathcal{S}}\|\dot{\Psi}(x_{0})\|\frac{\omega^{2}}{2d}d\text{Vol}^{d-1}(x_{0})+O(\omega^{4}). (18)

From this derivation, the dominant terms in the denominator and numerator of the quantity

P(Yg^n(x~))P(Yg~(x~))P(Yg^n(x~))P(Yg~(x~))\displaystyle\frac{P(Y\neq\widehat{g}_{n}(\widetilde{x}))-P(Y\neq\widetilde{g}(\widetilde{x}))}{P(Y\neq\widehat{g}_{n}^{\prime}(\widetilde{x}))-P(Y\neq\widetilde{g}(\widetilde{x}))} (19)

are both Θ(n4/(d+4))\Theta(n^{-4/(d+4)}) when kk’s are chosen to be optimal respectively. Note that the multiplicative constants for numerator and denominator are both determined by δ\delta and density of XX, and converges to each other when ω0\omega\rightarrow 0. As ω0\omega\rightarrow 0 when nn\rightarrow\infty, the difference on the densities vanishes, thus (19) converges to 1. ∎

B.3 Theorem S.1

Proof of Theorem S.1.

The proof is similar with Theorem 1. Since the format of r1r_{1} to r4r_{4} are unchanged, one can show that they are small order terms in Theorem 3 as well. What is changed in the proof of Theorem 3 is μk,n,ω(x){\mu}_{k,n,\omega}(x):

When t<0t<0, we have

μk,n,ω(x0t)=η(x0)+tη˙(x0)+ωη˙(x0)+b(x0)tk,n(x)+o,\displaystyle{\mu}_{k,n,\omega}(x_{0}^{t})=\eta(x_{0})+t\|\dot{\eta}(x_{0})\|+\omega\|\dot{\eta}(x_{0})\|+b(x_{0})t_{k,n}(x)+o,

while for t>0t>0,

μk,n,ω(x0t)=η(x0)+tη˙(x0)ωη˙(x0)+b(x0)tk,n(x)+o.\displaystyle{\mu}_{k,n,\omega}(x_{0}^{t})=\eta(x_{0})+t\|\dot{\eta}(x_{0})\|-\omega\|\dot{\eta}(x_{0})\|+b(x_{0})t_{k,n}(x)+o.

Therefore,

𝒮ϵk,n,ωϵk,n,ωtΨ˙(x0)(Φ(k𝔼(1/2Y1)kVar(Y1))1{t<0})𝑑t𝑑Vold1(x0)\displaystyle\int_{\mathcal{S}}\int_{-\epsilon_{k,n,\omega}}^{\epsilon_{k,n,\omega}}t\|\dot{\Psi}(x_{0})\|\left(\Phi\left(\frac{k\mathbb{E}(1/2-Y_{1})}{\sqrt{kVar(Y_{1})}}\right)-1_{\{t<0\}}\right)dtd\text{Vol}^{d-1}(x_{0})
=\displaystyle= 𝒮tΨ˙(x0)(Φ(tη˙(x0)sign(t)ωη˙(x0)sk,n2b(x0)tk,n(x0))sk,n2)1{t<0})𝑑t𝑑Vold1(x0)+o\displaystyle\int_{\mathcal{S}}\int_{\mathbb{R}}t\|\dot{\Psi}(x_{0})\|\left(\Phi\left(-\frac{t\|\dot{\eta}(x_{0})\|-sign(t)\omega\|\dot{\eta}(x_{0})\|}{\sqrt{s_{k,n}^{2}}}-\frac{b(x_{0})t_{k,n}(x_{0}))}{\sqrt{s_{k,n}^{2}}}\right)-1_{\{t<0\}}\right)dtd\text{Vol}^{d-1}(x_{0})+o
=\displaystyle= 𝒮tΨ˙(x0)(Φ(tη˙(x0)+ωη˙(x0)sk,n2b(x0)tk,n(x0))sk,n2)1{t<0})𝑑t𝑑Vold1(x0)+r5+o\displaystyle\int_{\mathcal{S}}\int_{\mathbb{R}}t\|\dot{\Psi}(x_{0})\|\left(\Phi\left(-\frac{t\|\dot{\eta}(x_{0})\|+\omega\|\dot{\eta}(x_{0})\|}{\sqrt{s_{k,n}^{2}}}-\frac{b(x_{0})t_{k,n}(x_{0}))}{\sqrt{s_{k,n}^{2}}}\right)-1_{\{t<0\}}\right)dtd\text{Vol}^{d-1}(x_{0})+r_{5}+o
=\displaystyle= B14k+12SΨ˙(x0)η˙(x0)2(b(x0)𝔼R1(x)2+ωη˙(x0))2𝑑Vold1(x0)+r5+o.\displaystyle\frac{B_{1}}{4k}+\frac{1}{2}\int_{S}\frac{\|\dot{\Psi}(x_{0})\|}{\|\dot{\eta}(x_{0})\|^{2}}\left(b(x_{0})\mathbb{E}R_{1}(x)^{2}+\omega\|\dot{\eta}(x_{0})\|\right)^{2}d\text{Vol}^{d-1}(x_{0})+r_{5}+o.

The remainder r5r_{5} is not a small order term, but we can show that it is positive, and calculate its rate.

r5=O(B14k+SΨ˙(x0)η˙(x0)2(b(x0)𝔼R1(x)2+ωη˙(x0))2𝑑Vold1(x0)).\displaystyle r_{5}=O\left(\frac{B_{1}}{4k}+\int_{S}\frac{\|\dot{\Psi}(x_{0})\|}{\|\dot{\eta}(x_{0})\|^{2}}\left(b(x_{0})\mathbb{E}R_{1}(x)^{2}+\omega\|\dot{\eta}(x_{0})\|\right)^{2}d\text{Vol}^{d-1}(x_{0})\right).

For r5r_{5},

r5\displaystyle r_{5} =\displaystyle= 𝒮0+tΨ˙(x0)Φ(tη˙(x0)ωη˙(x0)sk,n2b(x0)tk,n(x0)sk,n2)𝑑t𝑑Vold1(x0)\displaystyle\int_{\mathcal{S}}\int_{0}^{+\infty}t\|\dot{\Psi}(x_{0})\|\Phi\left(-\frac{t\|\dot{\eta}(x_{0})\|-\omega\|\dot{\eta}(x_{0})\|}{\sqrt{s_{k,n}^{2}}}-\frac{b(x_{0})t_{k,n}(x_{0})}{\sqrt{s_{k,n}^{2}}}\right)dtd\text{Vol}^{d-1}(x_{0})
𝒮0+tΨ˙(x0)Φ(tη˙(x0)+ωη˙(x0)sk,n2b(x0)tk,n(x0)sk,n2)𝑑t𝑑Vold1(x0)\displaystyle-\int_{\mathcal{S}}\int_{0}^{+\infty}t\|\dot{\Psi}(x_{0})\|\Phi\left(-\frac{t\|\dot{\eta}(x_{0})\|+\omega\|\dot{\eta}(x_{0})\|}{\sqrt{s_{k,n}^{2}}}-\frac{b(x_{0})t_{k,n}(x_{0})}{\sqrt{s_{k,n}^{2}}}\right)dtd\text{Vol}^{d-1}(x_{0})
=\displaystyle= 𝒮2ω+(t+2ω)Ψ˙(x0)Φ(tη˙(x0)+ωη˙(x0)sk,n2b(x0)tk,n(x0)sk,n2)𝑑t𝑑Vold1(x0)\displaystyle\int_{\mathcal{S}}\int_{-2\omega}^{+\infty}(t+2\omega)\|\dot{\Psi}(x_{0})\|\Phi\left(-\frac{t\|\dot{\eta}(x_{0})\|+\omega\|\dot{\eta}(x_{0})\|}{\sqrt{s_{k,n}^{2}}}-\frac{b(x_{0})t_{k,n}(x_{0})}{\sqrt{s_{k,n}^{2}}}\right)dtd\text{Vol}^{d-1}(x_{0})
𝒮0+tΨ˙(x0)Φ(tη˙(x0)+ωη˙(x0)sk,n2b(x0)tk,n(x0)sk,n2)𝑑t𝑑Vold1(x0)\displaystyle-\int_{\mathcal{S}}\int_{0}^{+\infty}t\|\dot{\Psi}(x_{0})\|\Phi\left(-\frac{t\|\dot{\eta}(x_{0})\|+\omega\|\dot{\eta}(x_{0})\|}{\sqrt{s_{k,n}^{2}}}-\frac{b(x_{0})t_{k,n}(x_{0})}{\sqrt{s_{k,n}^{2}}}\right)dtd\text{Vol}^{d-1}(x_{0})
=\displaystyle= 𝒮02ωΨ˙(x0)Φ(tη˙(x0)+ωη˙(x0)sk,n2b(x0)tk,n(x0)sk,n2)𝑑t𝑑Vold1(x0)\displaystyle\int_{\mathcal{S}}\int_{0}^{\infty}2\omega\|\dot{\Psi}(x_{0})\|\Phi\left(-\frac{t\|\dot{\eta}(x_{0})\|+\omega\|\dot{\eta}(x_{0})\|}{\sqrt{s_{k,n}^{2}}}-\frac{b(x_{0})t_{k,n}(x_{0})}{\sqrt{s_{k,n}^{2}}}\right)dtd\text{Vol}^{d-1}(x_{0})
+𝒮2ω0(t+2ω)Ψ˙(x0)Φ(tη˙(x0)+ωη˙(x0)sk,n2b(x0)tk,n(x0)sk,n2)𝑑t𝑑Vold1(x0)\displaystyle+\int_{\mathcal{S}}\int_{-2\omega}^{0}(t+2\omega)\|\dot{\Psi}(x_{0})\|\Phi\left(-\frac{t\|\dot{\eta}(x_{0})\|+\omega\|\dot{\eta}(x_{0})\|}{\sqrt{s_{k,n}^{2}}}-\frac{b(x_{0})t_{k,n}(x_{0})}{\sqrt{s_{k,n}^{2}}}\right)dtd\text{Vol}^{d-1}(x_{0})
:=\displaystyle:= A+B.\displaystyle A+B.

From the format of AA and BB, we know that they are positive. When tk,n(x0)t_{k,n}(x_{0}) and 1/k1/\sqrt{k} both ω\ll\omega, AA is an exponential tail (so we just ignore it) and for BB we have:

B=𝒮Ψ˙(x0)ω2𝑑Vold1(x0)+O(ωtk,n(x0)+ω/k).\displaystyle B=\int_{\mathcal{S}}\|\dot{\Psi}(x_{0})\|\omega^{2}d\text{Vol}^{d-1}(x_{0})+O(\omega t_{k,n}(x_{0})+\omega/\sqrt{k}).

B.4 Theorem 4

Proof of Theorem 4.

First, it is easy to know that ω=O((1/n)1/d)\omega=O((1/n)^{1/d}) since the nearest neighbor has an average distance of O((1/n)1/d)O((1/n)^{1/d}).

Second, there is a difference between pre-processed 1NN and random perturbation: in pre-processed 1NN, the nearest neighbor distributes approximately uniformly around xx, while the other neighbors should have a distance to xx larger than the nearest neighbor. However, this difference only affects the remainder term of regret, i.e., assuming whether or not the other neighbors are uniformly distributed in the ball B(x,Rk+1)B(x,R_{k+1}) does not affect our result.

As a result, taking expectation on the direction of δ\delta,

𝔼SΨ˙(x0)η˙(x0)2(b(x0)tk,n(x0)+δ(x0)η˙(x0))2𝑑Vold1(x0)\displaystyle\mathbb{E}\int_{S}\frac{\|\dot{\Psi}(x_{0})\|}{\|\dot{\eta}(x_{0})\|^{2}}(b(x_{0})t_{k,n}(x_{0})+\delta^{\top}(x_{0})\dot{\eta}(x_{0}))^{2}d\text{Vol}^{d-1}(x_{0})
=\displaystyle= SΨ˙(x0)η˙(x0)2𝔼(b(x0)tk,n(x0)+δ(x0)η˙(x0))2𝑑Vold1(x0)\displaystyle\int_{S}\frac{\|\dot{\Psi}(x_{0})\|}{\|\dot{\eta}(x_{0})\|^{2}}\mathbb{E}(b(x_{0})t_{k,n}(x_{0})+\delta^{\top}(x_{0})\dot{\eta}(x_{0}))^{2}d\text{Vol}^{d-1}(x_{0})
=\displaystyle= SΨ˙(x0)η˙(x0)2(b2(x0)tk,n2(x0))𝑑Vold1(x0)+Θ(ω2).\displaystyle\int_{S}\frac{\|\dot{\Psi}(x_{0})\|}{\|\dot{\eta}(x_{0})\|^{2}}\left(b^{2}(x_{0})t_{k,n}^{2}(x_{0})\right)d\text{Vol}^{d-1}(x_{0})+\Theta(\omega^{2}).

When n1/dn2/(4+d)n^{-1/d}\gg n^{-2/(4+d)}, i.e. d>4d>4, the dominant part of regret becomes n2/dn^{-2/d}. ∎

Appendix C Regret Convergence under General Smoothness Condition and Margin Condition

C.1 Model and Theorem

In this section, we will relax the conditions on the distribution of XX and smoothness of η\eta, and as a consequence, we only obtain the rate of the regret (without explicit form for the multiplicative constant). Technically, we will adopt the framework of [6], and the following assumptions on the smoothness of η\eta and the density of XX are used instead of conditions [A.1]-[A.3].

  1. B.1

    Let λ\lambda be the Lebesgue measure on d\mathbb{R}^{d}. There exists a positive pair (c0,r0)(c_{0},r_{0}) such that for any x𝒳x\in\mathcal{X},

    λ(𝒳B(x,r))c0λ(B(x,r)),\lambda(\mathcal{X}\cap B(x,r))\geq c_{0}\lambda(B(x,r)),

    for any 0<rr00<r\leq r_{0}.

  2. B.2

    The support of XX is compact.

  3. B.3

    Margin condition: P(0<|η(x)1/2|<t)BtβP(0<|\eta(x)-1/2|<t)\leq Bt^{\beta}.

  4. B.4

    Smoothness of η\eta: there exist some α>0\alpha>0 and cr>0c_{r}>0, such that |η(x+r)η(x)|rα|\eta(x+r)-\eta(x)|\leq\|r\|^{\alpha} for any xx and rcrr\leq c_{r}.

  5. B.5

    The density of XX is finite and bounded away from 0.

Remark 3.

The assumption B.3 is weaker from [6] (P(|η(x)1/2|<t)BtβP(|\eta(x)-1/2|<t)\leq Bt^{\beta}), but in fact does not affect the convergence.

In [6], the assumption of smoothness is made on |𝔼(η(x)|xB(x,r))η(x)||\mathbb{E}(\eta(x^{\prime})|x^{\prime}\in B(x,r))-\eta(x)|, which is a weaker assumption compared with our B.4. However, under either random perturbation or adversarial attack, given a direction δ\delta to obtain x~\widetilde{x}, the assumption in [6] cannot be simply applied.

The following theorem provide a general upper bound of regret for both perturbed and attacked data:

Theorem 8 (Convergence of Regret).

Under [B.1] to [B.5], if for some δ>0\delta>0, k/nδk/n^{\delta}\rightarrow\infty, taking

k\displaystyle k \displaystyle\asymp O(n2α/(2α+d)(n2α/dω2αβ)1/(2α/d+β+1)),\displaystyle O(n^{2\alpha/(2\alpha+d)}\wedge(n^{2\alpha/d}\omega^{-2\alpha\beta})^{1/(2\alpha/d+\beta+1)}),

the regret becomes

Regret(n,ω)=O(ωα(β+1)nα(β+1)/(2α+d)),\displaystyle\mbox{Regret}(n,\omega)=O\left(\omega^{\alpha(\beta+1)}\vee n^{-\alpha(\beta+1)/(2\alpha+d)}\right),

where nα(β+1)/(2α+d)n^{-\alpha(\beta+1)/(2\alpha+d)} is the minimax rate of regret in kk-NN.

Theorem 8 also reveals a sufficient condition when kk-NN is consistent, i.e regret finally converges to 0: for both perturbed and attacked data, when ω=o(1)\omega=o(1), kk-NN is still consistent using these two types of corrupted testing data.

Theorem 9 (Minimax Rate of Regret).

Let g^n\widehat{g}_{n} be an estimator of gg, let 𝒫α,β\mathcal{P}_{\alpha,\beta} be a set of distributions which satisfy [B.1] to [B.5], when α1\alpha\leq 1, there exists some C>0C>0 such that

supP𝒫α,βP(g^n(X~)Y)P(g(X)Y)C(ωα(β+1)nα(β+1)2α+d).\sup\limits_{P\in\mathcal{P}_{\alpha,\beta}}P(\widehat{g}_{n}(\widetilde{X})\neq Y)-P(g(X)\neq Y)\geq C(\omega^{\alpha(\beta+1)}\vee n^{-\frac{\alpha(\beta+1)}{2\alpha+d}}). (20)

The constant CC depends on α,β,d\alpha,\beta,d only.

Theorem 9 reveals that, for any estimator of gg, under either random perturbation or adversarial attack, the regret in the worst case is larger than C(ωα(β+1)nα(β+1)2α+d)C(\omega^{\alpha(\beta+1)}\vee n^{-\frac{\alpha(\beta+1)}{2\alpha+d}}). Theorem 8 and 9 together shows that the kNN estimator reaches the optimal rate of regret.

C.2 Proofs

Proof of Theorem S.8.

Let p=k/np=k/n. Denote Rk,n(x)=P(g^k,n(x)Y|x)R_{k,n}(x)=P(\widehat{g}_{k,n}(x)\neq Y|x) and R(x)=P(g(x)Y)R^{*}(x)=P(g(x)\neq Y), and 𝔼Rk,n(x)R(x)\mathbb{E}R_{k,n}(x)-R^{*}(x) as the excess risk. Define

𝒳p,Δ,ω+={x𝒳|η(x)>12,xB(x,ω),η(x+r)\displaystyle\mathcal{X}^{+}_{p,\Delta,\omega}=\{x\in\mathcal{X}|\eta(x)>\frac{1}{2},\forall x^{\prime}\in B(x,\omega),{\eta}(x^{\prime}+r)\geq 12+Δ,r<r2p(x)},\displaystyle\frac{1}{2}+\Delta,\forall\|r\|<r_{2p}(x)\},
𝒳p,Δ,ω={x𝒳|η(x)<12,xB(x,ω),η(x+r)\displaystyle\mathcal{X}^{-}_{p,\Delta,\omega}=\{x\in\mathcal{X}|\eta(x)<\frac{1}{2},\forall x^{\prime}\in B(x,\omega),{\eta}(x^{\prime}+r)\leq 12Δ,r<r2p(x)},\displaystyle\frac{1}{2}-\Delta,\forall\|r\|<r_{2p}(x)\},

with r2pr_{2p} as the distance from xx to its 2pn2pnth nearest neighbor, and the decision boundary area:

p,Δ,ω=𝒳(𝒳p,Δ,ω+𝒳p,Δ,ω).\partial_{p,\Delta,\omega}=\mathcal{X}\setminus(\mathcal{X}^{+}_{p,\Delta,\omega}\cup\mathcal{X}^{-}_{p,\Delta,\omega}).

Given p,Δ,ω\partial_{p,\Delta,\omega}, 𝒳p,Δ,ω+\mathcal{X}^{+}_{p,\Delta,\omega}, and 𝒳p,Δ,ω\mathcal{X}^{-}_{p,\Delta,\omega}, similar with Lemma 8 in [6], the event of g(x)g^k,n(x)g(x)\neq\widehat{g}_{k,n}(x) can be covered as:

1{g(x)g^k,n(x)}\displaystyle 1_{\{g(x)\neq\widehat{g}_{k,n}(x)\}} \displaystyle\leq 1{xp,Δ,ω}\displaystyle 1_{\{x\in\partial_{p,\Delta,\omega}\}}
+1{maxi=1,,kRi(x~)r2p(x)}\displaystyle+1_{\{\max\limits_{i=1,...,k}R_{i}(\widetilde{x})\geq r_{2p}(x)\}}
+1{|η^k,n(x)η(x+r)|Δ}.\displaystyle+1_{\{|\widehat{\eta}_{k,n}(x)-\eta(x^{\prime}+r)|\geq\Delta\}}.

When η(x+r)>1/2{\eta}(x^{\prime}+r)>1/2 for all rr2p(x)\|r\|\leq r_{2p}(x), and x𝒳p,Δ+x\in\mathcal{X}_{p,\Delta}^{+}, assume η^k,n(x)<1/2\widehat{\eta}_{k,n}(x)<1/2, then

η(x+r)η^k,n(x)>η(x+r)1/2Δ.{\eta}(x^{\prime}+r)-\widehat{\eta}_{k,n}(x^{\prime})>{\eta}(x^{\prime}+r)-1/2\geq\Delta.

The other two events are easy to figure out.

By [6] and [3], P(maxi=1,,kRi(x)r2p(x))P(\max\limits_{i=1,...,k}R_{i}(x)\geq r_{2p}(x)) is of O(exp(ck2))O(\exp(-ck^{2})) for some c>0c>0, hence it becomes a smaller order term if for some δ>0\delta>0, k/nδk/n^{\delta}\rightarrow\infty.

In addition, from the definition of regret, assume η(x)<1/2\eta(x)<1/2,

P(g^(x)Y|X=x)η(x)\displaystyle P(\widehat{g}(x)\neq Y|X=x)-\eta(x)
=\displaystyle= η(x)P(g^(x)=0|X=x)+(1η(x))P(g^(x)=1|X=x)η(x)\displaystyle\eta(x)P(\widehat{g}(x)=0|X=x)+(1-\eta(x))P(\widehat{g}(x)=1|X=x)-\eta(x)
=\displaystyle= η(x)P(g^(x)=g(x)|X=x)+(1η(x))P(g^(x)g(x)|X=x)η(x)\displaystyle\eta(x)P(\widehat{g}(x)=g(x)|X=x)+(1-\eta(x))P(\widehat{g}(x)\neq g(x)|X=x)-\eta(x)
=\displaystyle= η(x)η(x)P(g^(x)g(x)|X=x)+(1η(x))P(g^(x)g(x)|X=x)η(x)\displaystyle\eta(x)-\eta(x)P(\widehat{g}(x)\neq g(x)|X=x)+(1-\eta(x))P(\widehat{g}(x)\neq g(x)|X=x)-\eta(x)
=\displaystyle= (12η(x))P(g^(x)g(x)|X=x),\displaystyle(1-2\eta(x))P(\widehat{g}(x)\neq g(x)|X=x),

similarly, when η(x)>1/2\eta(x)>1/2, we have

P(g^(x)Y|X=x)1+η(x)\displaystyle P(\widehat{g}(x)\neq Y|X=x)-1+\eta(x) =\displaystyle= (2η(x)1)P(g^(x)g(x)|X=x).\displaystyle(2\eta(x)-1)P(\widehat{g}(x)\neq g(x)|X=x).

As a result, the regret can be represented as

Regret(k,n,ω)\displaystyle Regret(k,n,\omega) =\displaystyle= 𝔼(|12η(X)|P(g(X)g^k,n(X))).\displaystyle\mathbb{E}\left(|1-2\eta(X)|P(g(X)\neq\widehat{g}_{k,n}(X))\right).

For simplicity, denote p=k/np=k/n. We then follow the proof of Lemma 20 of [6]. Without loss of generality assume η(x)>1/2\eta(x)>1/2. For perturbation δd\delta\in\mathbb{R}^{d}, define

Δ0\displaystyle\Delta_{0} =\displaystyle= supx,δ,r<r2p(x)|η(x+δ+r)η(x)|=O(ωα)+O((k/n)α/d),\displaystyle\sup\limits_{x,\delta,\|r\|<r_{2p}(x)}|\eta(x+\delta+r)-\eta(x)|=O(\omega^{\alpha})+O((k/n)^{\alpha/d}),
Δ(x)\displaystyle\Delta(x) =\displaystyle= |η(x)1/2|,\displaystyle|\eta(x)-1/2|,

then we have

η(x+δ+r)η(x)Δ0=12+(Δ(x)Δ0),\eta(x+\delta+r)\geq\eta(x)-\Delta_{0}=\frac{1}{2}+(\Delta(x)-\Delta_{0}),

hence x𝒳p,Δ(x)Δ0,ω+x\in\mathcal{X}_{p,\Delta(x)-\Delta_{0},\omega}^{+}.

From the definition of Rk,nR_{k,n} and RR^{*}, when Δ(x)>Δ0\Delta(x)>\Delta_{0}, we also have

𝔼Rk,n(x)R(x)\displaystyle\mathbb{E}R_{k,n}(x)-R^{*}(x)
\displaystyle\leq 2Δ(x)[P(r(k+1)>v2p)+P(i=1k1kY(Xi)η(x+δ+r)>Δ(x)Δ0)]\displaystyle 2\Delta(x)\bigg{[}P(r_{(k+1)}>v_{2p})+P\bigg{(}\sum_{i=1}^{k}\frac{1}{k}Y(X_{i})-{\eta}(x^{\prime}+\delta+r)>\Delta(x)-\Delta_{0}\bigg{)}\bigg{]}
\displaystyle\leq 2Δ(x)P(i=1k1kY(Xi)η(x+δ+r)>Δ(x)Δ0)+o\displaystyle 2\Delta(x)P\bigg{(}\sum_{i=1}^{k}\frac{1}{k}Y(X_{i})-{\eta}(x^{\prime}+\delta+r)>\Delta(x)-\Delta_{0}\bigg{)}+o
=\displaystyle= 2Δ(x)𝔼δ[P(i=1k1kY(Xi)η(x+δ+r)>Δ(x)Δ0|δ)]+o\displaystyle 2\Delta(x)\mathbb{E}_{\delta}\left[P\bigg{(}\sum_{i=1}^{k}\frac{1}{k}Y(X_{i})-{\eta}(x^{\prime}+\delta+r)>\Delta(x)-\Delta_{0}\bigg{|}\delta\bigg{)}\right]+o

Considering the problem that the upper bound can be much greater than 1 when Δ(x)\Delta(x) is small, we define Δi=2iΔ0\Delta_{i}=2^{i}\Delta_{0}, taking i0=min{i1|(ΔiΔ0)2>1/k}i_{0}=\min\{i\geq 1|\;(\Delta_{i}-\Delta_{0})^{2}>1/k\}, using Berstein inequality, it becomes

𝔼Rk,n(X)R(X)\displaystyle\mathbb{E}R_{k,n}(X)-R^{*}(X) =\displaystyle= 𝔼(Rk,n(X)R(X))1{Δ(X)Δi0}\displaystyle\mathbb{E}(R_{k,n}(X)-R^{*}(X))1_{\{\Delta(X)\leq\Delta_{i_{0}}\}}
+𝔼(Rk,n(X)R(X))1{Δ(X)>Δi0}\displaystyle+\mathbb{E}(R_{k,n}(X)-R^{*}(X))1_{\{\Delta(X)>\Delta_{i_{0}}\}}
\displaystyle\leq 2Δi0P(Δ(X)Δi0)+exp(k/8)\displaystyle 2\Delta_{i_{0}}P(\Delta(X)\leq\Delta_{i_{0}})+\exp(-k/8)
+c2𝔼[Δ(X)1{Δi0<Δ(X)}exp(c1k(Δ(x)Δ0)2)]\displaystyle+c_{2}\mathbb{E}\left[\Delta(X)1_{\{\Delta_{i_{0}}<\Delta(X)\}}\exp(-c_{1}k(\Delta(x)-\Delta_{0})^{2})\right]
\displaystyle\leq 2Δi0P(Δ(X)Δi0)+exp(k/8)\displaystyle 2\Delta_{i_{0}}P(\Delta(X)\leq\Delta_{i_{0}})+\exp(-k/8)
+c2𝔼[Δ(X)1{Δi0<Δ(X)}exp(c1k(Δ(x)Δ0)2)].\displaystyle+c_{2}\mathbb{E}\left[\Delta(X)1_{\{\Delta_{i_{0}}<\Delta(X)\}}\exp(-c_{1}k(\Delta(x)-\Delta_{0})^{2})\right].

When i0=min{i1|(ΔiΔ0)2>1/k}i_{0}=\min\{i\geq 1|\;(\Delta_{i}-\Delta_{0})^{2}>1/k\}, the exponential tail will diminish fast, leading to

𝔼[Δ(X)1{Δi0<Δ(X)}exp(c1k(Δ(x)Δ0)2)]\displaystyle\mathbb{E}\left[\Delta(X)1_{\{\Delta_{i_{0}}<\Delta(X)\}}\exp(-c_{1}k(\Delta(x)-\Delta_{0})^{2})\right]
=\displaystyle= i=i0𝔼[Δ(X)1{Δi<Δ(X)<Δi+1}exp(c1k(Δ(x)Δ0)2)]\displaystyle\sum_{i=i_{0}}^{\infty}\mathbb{E}\left[\Delta(X)1_{\{\Delta_{i}<\Delta(X)<\Delta_{i+1}\}}\exp(-c_{1}k(\Delta(x)-\Delta_{0})^{2})\right]
\displaystyle\leq i=i0Δi+1β+1exp(c1k(ΔiΔ0)2)\displaystyle\sum_{i=i_{0}}^{\infty}\Delta_{i+1}^{\beta+1}\exp(-c_{1}k(\Delta_{i}-\Delta_{0})^{2})
=\displaystyle= i=i0Δ0β+12(i+1)(β+1)exp(c1kΔ02(2i1)2)\displaystyle\sum_{i=i_{0}}^{\infty}\Delta_{0}^{\beta+1}2^{(i+1)(\beta+1)}\exp(-c_{1}k\Delta_{0}^{2}(2^{i}-1)^{2})
\displaystyle\leq c3Δ0β+1.\displaystyle c_{3}\Delta_{0}^{\beta+1}.

Recall that Δi0>Δ0\Delta_{i_{0}}>\Delta_{0} and Δi02>1/k\Delta_{i_{0}}^{2}>1/k, hence when Δi02=O(1/k)\Delta_{i_{0}}^{2}=O(1/k), we can obtain the minimum upper bound

𝔼Rk,n(X)R(X)=O(Δ0β+1)+O((1k)(β+1)/2).\displaystyle\mathbb{E}R_{k,n}(X)-R^{*}(X)=O(\Delta_{0}^{\beta+1})+O\left(\left(\frac{1}{k}\right)^{(\beta+1)/2}\right).

Proof of Theorem S.9.

The proof is similar as [2] using technical details in [1] for Assouad’s method. There are two scenarios we will consider. Define C0C_{0}, C1C_{1} and C2C_{2} as some suitable constants, we will first show for any ω0\omega\geq 0,

supP𝒫α,βP(g^n(X~)Y)P(g(X)Y)C1nα(β+1)2α+d.\sup\limits_{P\in\mathcal{P}_{\alpha,\beta}}P(\widehat{g}_{n}(\widetilde{X})\neq Y)-P(g(X)\neq Y)\geq C_{1}n^{-\frac{\alpha(\beta+1)}{2\alpha+d}}. (21)

Further, when ω>C0n12α+d\omega>C_{0}n^{-\frac{1}{2\alpha+d}}, our target is to show that

supP𝒫α,βP(g^n(X~)Y)P(g(X)Y)C2ωα(β+1).\sup\limits_{P\in\mathcal{P}_{\alpha,\beta}}P(\widehat{g}_{n}(\widetilde{X})\neq Y)-P(g(X)\neq Y)\geq C_{2}\omega^{\alpha(\beta+1)}. (22)

Case 1: when ωC0n12α+d\omega\leq C_{0}n^{-\frac{1}{2\alpha+d}}, the basic idea is to construct a distribution of xx and two distributions of y|xy|x such that, the Bayes classifiers from these two distributions of y|xy|x reverse with each other, but through sampling nn points, we cannot distinguish which distribution these nn samples chosen are from. For example, given nn samples from a normal distribution, statistically we cannot determine whether data are sampled from a zero-mean distribution, or a distribution with mean 1/n1/\sqrt{n}, thus any estimator based on data (either using clean testing data or corrupted testing data) can make a false prediction.

Assume XX distributed within a compact set in [0,1]d[0,1]^{d}. For an integer q1q\geq 1, consider the regular grid as

Gq:={(2k1+12q,,2kd+12q):ki{0,,q1},i=1,,d}.G_{q}:=\left\{\left(\frac{2k_{1}+1}{2q},...,\frac{2k_{d}+1}{2q}\right):k_{i}\in\{0,...,q-1\},i=1,...,d\right\}. (23)

For any point xx, denote nq(x)n_{q}(x) as the closest grid point in GqG_{q}, and define 𝒳1,,𝒳qd\mathcal{X}^{\prime}_{1},...,\mathcal{X}^{\prime}_{q^{d}} as a partition of [0,1]d[0,1]^{d} such that xx and xx^{\prime} are in the same 𝒳i\mathcal{X}^{\prime}_{i} if and only if nq(x)=nq(x)n_{q}(x)=n_{q}(x^{\prime}). Among all the 𝒳i\mathcal{X}^{\prime}_{i}’s, select mm of them as 𝒳1,,𝒳m\mathcal{X}_{1},...,\mathcal{X}_{m}, and 𝒳0:=[0,1]d\i=1m𝒳i\mathcal{X}_{0}:=[0,1]^{d}\backslash\cup_{i=1}^{m}\mathcal{X}_{i}.

Take ziz_{i} as the center of 𝒳i\mathcal{X}_{i} for i=1,,mi=1,...,m. When xB(zi,1/4q)x\in B(z_{i},1/4q), set the density of xx as ϵ/λ[B(zi,1/4q)]\epsilon/\lambda[B(z_{i},1/4q)] for some ϵ>0\epsilon>0, and the density of xx in 𝒳i\B(zi,1/4q)\mathcal{X}_{i}\backslash B(z_{i},1/4q) is set to be 0. Assume xx uniformly distributes in 𝒳0\mathcal{X}_{0}.

Let u:++u:\mathbb{R}^{+}\rightarrow\mathbb{R}^{+} be a nonincreasing infinitely differentiable function starting from 0 and satisfying α\alpha-smoothness condition. Moreover, uu is 11 in [1/2,)[1/2,\infty). Denote ψ\psi and ϕ\phi as

ψ(x):=Cψu(x),\psi(x):=C_{\psi}u(\|x\|), (24)

and

ϕ(x):=qαψ(q(xnq(x))).\phi(x):=q^{-\alpha}\psi(q(x-n_{q}(x))). (25)

Through the above construction, if we take η(x)=(1+ϕ(x))/2\eta(x)=(1+\phi(x))/2 or η(x)=(1ϕ(x))/2\eta(x)=(1-\phi(x))/2, and let m=O(qdαβ)m=O(q^{d-\alpha\beta}), then when αβd\alpha\beta\leq d, β\beta margin condition is also satisfied.

The construction above will also be applied in Case 2 (with difference on the choice of q,ϵ,uq,\epsilon,u).

Now we apply Assouad’s method to find the lower bound of regret. Denote PjkP_{jk} as a distribution such that η(x)=(1+ϕ(x))/2\eta(x)=(1+\phi(x))/2 when k=0k=0, x𝒳jx\in\mathcal{X}_{j}, and η(x)=(1ϕ(x))/2\eta(x)=(1-\phi(x))/2 when k=1k=1, x𝒳jx\in\mathcal{X}_{j}, then we have for any estimator g^(x,Zn)\widehat{g}(x,Z_{n}) with Zn=(Xn,Yn)Z_{n}=(X_{n},Y_{n}) as data,

supk=0,1𝔼X,Zn,Pjk1{g^(X,Zn)g(X)}1{X𝒳j}\displaystyle\sup\limits_{k=0,1}\mathbb{E}_{X,Z_{n},P_{jk}}1_{\{\widehat{g}(X,Z_{n})\neq g(X)\}}1_{\{X\in\mathcal{X}_{j}\}} (26)
\displaystyle\geq 12𝔼X,Zn,Pj01{g^(X,Zn)g(X)}1{X𝒳j}+12𝔼Zn,Pj11{g^(X,Zn)g(X)}1{X𝒳j}\displaystyle\frac{1}{2}\mathbb{E}_{X,Z_{n},P_{j0}}1_{\{\widehat{g}(X,Z_{n})\neq g(X)\}}1_{\{X\in\mathcal{X}_{j}\}}+\frac{1}{2}\mathbb{E}_{Z_{n},P_{j1}}1_{\{\widehat{g}(X,Z_{n})\neq g(X)\}}1_{\{X\in\mathcal{X}_{j}\}} (27)
=\displaystyle= 12𝔼X,Zn,Pj01{g^(X,Zn)0}1{X𝒳j}+12𝔼Zn,Pj11{g^(X,Zn)1}1{X𝒳j}\displaystyle\frac{1}{2}\mathbb{E}_{X,Z_{n},P_{j0}}1_{\{\widehat{g}(X,Z_{n})\neq 0\}}1_{\{X\in\mathcal{X}_{j}\}}+\frac{1}{2}\mathbb{E}_{Z_{n},P_{j1}}1_{\{\widehat{g}(X,Z_{n})\neq 1\}}1_{\{X\in\mathcal{X}_{j}\}} (28)
=\displaystyle= 12𝔼X1{X𝒳j}𝔼[𝔼Zn,Pj01{g^(x,Zn)0}+𝔼Zn,Pj11{g^(x,Zn)1}|X=x]\displaystyle\frac{1}{2}\mathbb{E}_{X}1_{\{X\in\mathcal{X}_{j}\}}\mathbb{E}\left[\mathbb{E}_{Z_{n},P_{j0}}1_{\{\widehat{g}(x,Z_{n})\neq 0\}}+\mathbb{E}_{Z_{n},P_{j1}}1_{\{\widehat{g}(x,Z_{n})\neq 1\}}\bigg{|}X=x\right] (29)
=\displaystyle= 12𝔼X1{X𝒳j}𝔼[1{g^(x,Zn)0}𝑑Pj0(Zn)+1{g^(x,Zn)1}𝑑Pj1(Zn)|X=x]\displaystyle\frac{1}{2}\mathbb{E}_{X}1_{\{X\in\mathcal{X}_{j}\}}\mathbb{E}\left[\int 1_{\{\widehat{g}(x,Z_{n})\neq 0\}}dP_{j0}(Z_{n})+\int 1_{\{\widehat{g}(x,Z_{n})\neq 1\}}dP_{j1}(Z_{n})\bigg{|}X=x\right] (30)
\displaystyle\geq 12𝔼X1{X𝒳j}𝔼[1{g^(x,Zn)0}+1{g^(x,Zn)1}(dPj0(Zn)dPj1(Zn))|X=x]\displaystyle\frac{1}{2}\mathbb{E}_{X}1_{\{X\in\mathcal{X}_{j}\}}\mathbb{E}\left[\int 1_{\{\widehat{g}(x,Z_{n})\neq 0\}}+1_{\{\widehat{g}(x,Z_{n})\neq 1\}}(dP_{j0}(Z_{n})\wedge dP_{j1}(Z_{n}))\bigg{|}X=x\right] (31)
=\displaystyle= 12𝔼X1{X𝒳j}𝔼[(dPj0(Zn)dPj1(Zn))|X=x]\displaystyle\frac{1}{2}\mathbb{E}_{X}1_{\{X\in\mathcal{X}_{j}\}}\mathbb{E}\left[\int(dP_{j0}(Z_{n})\wedge dP_{j1}(Z_{n}))\bigg{|}X=x\right] (32)
=\displaystyle= 12𝔼X1{X𝒳j}(dPj0(Zn)dPj1(Zn)).\displaystyle\frac{1}{2}\mathbb{E}_{X}1_{\{X\in\mathcal{X}_{j}\}}\int(dP_{j0}(Z_{n})\wedge dP_{j1}(Z_{n})). (33)

Denote

bj:=[1𝔼2(1ϕ2(X)|X𝒳j)]1/2,b_{j}:=\left[1-\mathbb{E}^{2}(\sqrt{1-\phi^{2}(X)}|X\in\mathcal{X}_{j})\right]^{1/2}, (34)

and

bj:=(𝔼ϕ(X)|X𝒳j),b^{\prime}_{j}:=(\mathbb{E}\phi(X)|X\in\mathcal{X}_{j}), (35)

then (dPj0(Zn)dPj1(Zn))=Θ(1)\int(dP_{j0}(Z_{n})\wedge dP_{j1}(Z_{n}))=\Theta(1) through our design of 𝒳j\mathcal{X}_{j} when bj=O(1/nϵ)b_{j}=O(1/\sqrt{n\epsilon}) by Lemma 5.1 in [1].

As a result, when bj=bb_{j}=b, bj=bb^{\prime}_{j}=b^{\prime} for all j=1,,mj=1,...,m, and b=O(1/nϵ)b=O(1/\sqrt{n\epsilon}), there exists some C3>0C_{3}>0 such that

supP𝒫P(g^(X,Zn)Y)P(g(X)Y)\displaystyle\sup\limits_{P\in\mathcal{P}}P(\widehat{g}(X,Z_{n})\neq Y)-P(g(X)\neq Y) (36)
=\displaystyle= supP𝒫𝔼|2η(X)1|P(g^(X,Zn)g(X))\displaystyle\sup\limits_{P\in\mathcal{P}}\mathbb{E}|2\eta(X)-1|P(\widehat{g}(X,Z_{n})\neq g(X)) (37)
=\displaystyle= supP𝒫j=1m𝔼|2η(X)1|P(g^(X,Zn)g(X))1{X𝒳j}\displaystyle\sup\limits_{P\in\mathcal{P}}\sum_{j=1}^{m}\mathbb{E}|2\eta(X)-1|P(\widehat{g}(X,Z_{n})\neq g(X))1_{\{X\in\mathcal{X}_{j}\}} (38)
\displaystyle\geq C3mbϵ.\displaystyle C_{3}mb^{\prime}\epsilon. (39)

The regret is lower bounded as C1nα(β+1)/(2α+d)C_{1}n^{-\alpha(\beta+1)/(2\alpha+d)} when taking q=O(n1/(2α+d))q=O(n^{1/(2\alpha+d)}). Note that g^(x,Zn)\widehat{g}(x,Z_{n}) can be any classifier, which also includes those “random" estimators when xx is perturbed / attacked.

Case 2: when ω>C0n12α+d\omega>C_{0}n^{-\frac{1}{2\alpha+d}}, we construct a distribution of (x,y)(x,y) such that, after injecting noise in it, there is some sets of x~\tilde{x} where P(g(x)=1|x~)P(g(x)=1|\tilde{x}) and P(g(x)=0|x~)P(g(x)=0|\tilde{x}) are comparable, thus no matter which label is obtained from the estimator, it has a constant-level of probability to make false decision at this x~\tilde{x}.

The construction is similar as Case 1, and we take q=2/ωq=\lfloor 2/\omega\rfloor. For function uu, here we let it increase from 0 and becomes 1 in [1/4,)[1/4,\infty). For each pair (𝒳j0,𝒳j1)(\mathcal{X}_{j0},\mathcal{X}_{j1}), take η(x)=(1+ϕ(x))/2\eta(x)=(1+\phi(x))/2 when x𝒳j0x\in\mathcal{X}_{j0} and η(x)=(1ϕ(x))/2\eta(x)=(1-\phi(x))/2 when x𝒳j1x\in\mathcal{X}_{j1}. The support of xx is 𝒳0(i=1mB(zi,3ω/4))\mathcal{X}_{0}\cup(\bigcup_{i=1}^{m}B(z_{i},3\omega/4)). Take m=O(ωαβd)m=O(\omega^{\alpha\beta-d}) and ϵ=O(ωd)\epsilon=O(\omega^{d}), then both α\alpha-smoothness condition and β\beta-margin condition are satisfied.

After injecting random noise on xx, consider ξj\xi_{j} as the boundary between 𝒳j0\mathcal{X}_{j0} and 𝒳j1\mathcal{X}_{j1}, then when x~\tilde{x} is from {z|dist(z,ξj)<ω/4,z𝒳j0𝒳j1}\{z\;|\;dist(z,\xi_{j})<\omega/4,\;z\in\mathcal{X}_{j0}\cup\mathcal{X}_{j1}\}, P(g(x)=1|x~)P(g(x)=1|\tilde{x}) and P(g(x)=0|x~)P(g(x)=0|\tilde{x}) are in [C4,1C4][C_{4},1-C_{4}] for some constant C4>0C_{4}>0. Thus the probability of any estimator to make a false decision at this x~\tilde{x} is larger than C4C_{4}. In addition, the probability measure of j=1m{z|dist(z,ξj)<ω/4,z𝒳j0𝒳j1}\cup_{j=1}^{m}\{z\;|\;dist(z,\xi_{j})<\omega/4,\;z\in\mathcal{X}_{j0}\cup\mathcal{X}_{j1}\} is greater than C5ωαβC_{5}\omega^{\alpha\beta} for some constant C5>0C_{5}>0. Thus the regret is greater than C5ωαβCϕωαC4=C6ωα(β+1)C_{5}\omega^{\alpha\beta}C_{\phi}\omega^{\alpha}C_{4}=C_{6}\omega^{\alpha(\beta+1)}.