This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Department of Computer Science, University of Wisconsin-Madison, United [email protected]://orcid.org/0000-0001-7714-2195Department of Computer Science, University of Wisconsin-Madison, United States and http://pages.cs.wisc.edu/~paris/[email protected]://orcid.org/0000-0001-6309-1702 \CopyrightAusten Z. Fan and Paraschos Koutris \ccsdescTheory of computation Database theory \ccsdescTheory of computation Incomplete, inconsistent, and uncertain databases \fundingThis research was supported in part by National Science Foundation grants CRII-1850348 and III-1910014, as well as a gift by Google.\EventEditorsDan Olteanu and Nils Vortmeier \EventNoEds2 \EventLongTitle25th International Conference on Database Theory (ICDT 2022) \EventShortTitleICDT 2022 \EventAcronymICDT \EventYear2022 \EventDateMarch 29–April 1, 2022 \EventLocationEdinburgh, UK \EventLogo \SeriesVolume220 \ArticleNo18

Certifiable Robustness for Nearest Neighbor Classifiers

Austen Z. Fan    Paraschos Koutris
Abstract

ML models are typically trained using large datasets of high quality. However, training datasets often contain inconsistent or incomplete data. To tackle this issue, one solution is to develop algorithms that can check whether a prediction of a model is certifiably robust. Given a learning algorithm that produces a classifier and given an example at test time, a classification outcome is certifiably robust if it is predicted by every model trained across all possible worlds (repairs) of the uncertain (inconsistent) dataset. This notion of robustness falls naturally under the framework of certain answers. In this paper, we study the complexity of certifying robustness for a simple but widely deployed classification algorithm, kk-Nearest Neighbors (kk-NN). Our main focus is on inconsistent datasets when the integrity constraints are functional dependencies (FDs). For this setting, we establish a dichotomy in the complexity of certifying robustness w.r.t. the set of FDs: the problem either admits a polynomial time algorithm, or it is coNP-hard. Additionally, we exhibit a similar dichotomy for the counting version of the problem, where the goal is to count the number of possible worlds that predict a certain label. As a byproduct of our study, we also establish the complexity of a problem related to finding an optimal subset repair that may be of independent interest.

keywords:
Inconsistent databases, kk-NN classification, certifiable robustness

1 Introduction

Machine Learning (ML) has been widely adopted as a central tool in business analytics, medical decisions, autonomous driving, and many other domains. In supervised learning settings, ML models are typically trained using large datasets of high quality. However, real-world training datasets often contain incorrect or incomplete data. For example, attribute values may be missing from the dataset, attribute values may be wrong, or the dataset can violate integrity constraints. Several approaches to tackle this problem have been proposed in the literature, including data cleaning [17, 23] and robust learning methods [4, 26, 5].

In this work, we study an alternative but complementary approach using the framework of certain answers, which has been a focus of the database community in the last two decades [1, 2]. Under this framework, an inconsistent or incomplete training dataset is modeled as a set of possible worlds called repairs. We can think of a repair as a way to clean the dataset such that it is consistent (w.r.t. a set of integrity constraints) and complete. In the context of query answering, certain answers are the output tuples that exist in the query result of every possible world. In other words, we will obtain a certain answer in the output no matter how we choose to repair the dataset.

The notion of certain answers in the context of ML is known as certifiable (or certified) robustness [3, 26]. Given a learning algorithm that produces a classifier and a tuple at test time, we say that a predicted label is certifiably robust if it is predicted by every model trained across all possible worlds. In other words, certifiably robust predictions come with a formal guarantee that the label is robust to how the training dataset is fixed. Such a guarantee can be beneficial to decide whether we should trust the predictions of the ML model or whether we should spend resources to clean the dataset before feeding it to the learning algorithm.

As a first step towards certifying robustness for other more complex ML algorithms, we focus in this work on kk-Nearest Neighbor classification (kk-NN). In this problem, we start with a dd-dimensional dataset equipped with a distance function. Given a test point xx, the classifier first finds the kk points closest to xx w.r.t. the given distance function, and assigns to xx the label that is a plurality among these. Certified robustness for kk-NNs was recently explored by Karlas et. al [13] under the uncertain model where tuples (training points) with the same label are grouped into blocks, and a possible world is constructed by picking independently exactly one tuple from each block. This setting is equivalent to considering the subset repairs of an inconsistent dataset where the only integrity constraint is a primary key, with the additional restriction that training points with the same key must have the same label.

A B C label
t1t_{1} 1 0 a 0
t2t_{2} 1 2 b 0
t3t_{3} 2 0 a 2
t4t_{4} 2 5 c 1
t5t_{5} 3 1 a 0
t6t_{6} 4 2 d 2

In this paper, we generalize the study of certifying robustness for kk-NNs to other inconsistent and uncertain models. Specifically, we consider subset repairs of a dataset where the integrity constraints can be any set of functional dependencies (FDs) and not only a primary key as in [13].

Example 1.1.

Consider the inconsistent dataset in the table, where the integrity constraint is the FD ABA\rightarrow B. Let the distance function between two tuples s,ts,t be f(s,t)=|s[A]t[A]|+|s[B]t[B]|f(s,t)=|s[A]-t[A]|+|s[B]-t[B]|. Suppose that we want to label the test point x=(0,0,d)x=(0,0,d) using a 3-NN classifier. This induces the following ordering of the tuples w.r.t. their distance from xx (for convenience, we include the labels as well):

t1:𝟎<t3:𝟐<t2:𝟎<t5:𝟎<t6:𝟐<t4:𝟏t_{1}:\mathbf{0}<t_{3}:\mathbf{2}<t_{2}:\mathbf{0}<t_{5}:\mathbf{0}<t_{6}:\mathbf{2}<t_{4}:\mathbf{1}

A repair for this inconsistent dataset has to choose one tuple from {t1,t2}\{t_{1},t_{2}\} and one tuple from {t3,t4}\{t_{3},t_{4}\}. There are 4 possible repairs, which form the following 3-neighborhoods around xx:

{t1:𝟎,t5:𝟎,t6:𝟐},\displaystyle\{t_{1}:\mathbf{0},t_{5}:\mathbf{0},t_{6}:\mathbf{2}\}, {t2:𝟎,t5:𝟎,t6:𝟐}\displaystyle\quad\{t_{2}:\mathbf{0},t_{5}:\mathbf{0},t_{6}:\mathbf{2}\}
{t1:𝟐,t3:𝟎,t5:𝟎},\displaystyle\{t_{1}:\mathbf{2},t_{3}:\mathbf{0},t_{5}:\mathbf{0}\}, {t3:𝟐,t2:𝟎,t5:𝟎}\displaystyle\quad\{t_{3}:\mathbf{2},t_{2}:\mathbf{0},t_{5}:\mathbf{0}\}

In all repairs, label 0 occurs two times, and hence it is always the majority label. Hence, we can certify that 0 is a robust label for tuple xx.

We show that for general sets of FDs the complexity landscape for certified robustness admits a dichotomy: it is computationally intractable for some sets of FDs, and is in polynomial time for the other sets of FDs. We also investigate certifying robustness for other widely used uncertain models, including Codd-tables [11], or-sets and ??-sets [25]. We establish that in these settings the problem can always be solved in polynomial time.

Our work shows that the logical structure of the errors (or missing values) from a training set can be exploited to construct fast algorithms for certifiable robustness. Tools developed in the database theory community can facilitate the design of these algorithms. We also demonstrate that, even for the relatively simple kk-NN classifier, the complexity landscape exhibits a complex behavior that is related to other problems in consistent query answering.

Our Contribution. We now present in more detail the contributions of this work:

  • We establish a complexity dichotomy for certifying robustness for kk-NNs under subset repairs (Section 4) into P and coNP-complete. The dichotomy depends on the structure of FDs. More precisely, the syntactic condition for the dichotomy is based on the notion of lhs chains. In fact, it is the same as the one used for the complexity classification of the problem of counting the number of subset repairs under a set of FDs [21]. In the case where the only FD constraint is a primary key, we show that we can design an even faster algorithm that runs in linear time in the size of the data, improving upon the running time of the algorithm in [13] (Section 5).

  • In addition to certified robustness, we study the related problem of counting the number of repairs that predict a given label for a test point (Section 8). We establish a dichotomy into the complexity classes FP and #P-complete with the same syntactic condition as the decision problem. The polynomial time algorithm here depends exponentially on the number of classification labels, in contrast to our algorithm for certifiable robustness which has a linear dependence on the number of labels.

  • We show that certifying robustness for kk-NNs is tightly connected to the problem of finding the subset repair with the smallest total weight, when each tuple is assigned a positive weight (Section 7). As a consequence, we obtain a dichotomy result for that problem as well. Note that this problem is a symmetric variant of the problem in [20], which asks instead for the repair with the maximum total weight.

  • Finally, we investigate the complexity of certifiable robustness for kk-NNs for three widely used uncertain models: Codd-tables, ??-sets and or-sets (Section 9). We show that for all the above models, certifying robustness for kk-NN classification admits a polynomial time algorithm.

2 Related Work

Certain Query Answering. There has been a long line of research in Certain Query Answering (CQA) in the database community. Data consistency might be violated, for example, during data integration or in a data warehouse. It is then natural to ask: given a query and such inconsistent data, can we answer the query with a certain answer? Arenas, Bertossi, and Chomicki [1] first define the notion of a repair, which refers to a consistent subinstance that is minimally different from the inconsistent data. A certain answer to the query is defined as an answer that will result from every repair. Beginning from the work of Fuxman and Miller [8], more general dichotomy results in CQA have been proven [14, 15, 16]. A dichotomy theorem for a class of queries and integrity constraints says CQA is either in polynomial time or coNP-complete, usually depending on an efficiently checkable criterion for tractability. Certain answers have also been studied in the context of incomplete models of data [19, 22, 10].

Subset Repairs. Livshits et. al [21] studied the problem of counting the number of subset repairs w.r.t. a given set of FDs, establishing a complexity dichotomy. The syntactic condition for tractability (existence of a lhs chain) is the same as the one we establish for certifiable robustness in kk-NN classification. Livshits et. al [20] also studied the problem of finding a maximum-weight subset repair w.r.t. a given set of FDs, and showed that the complexity observes a dichotomy. In this paper we study the symmetric problem of finding a minimum-weight subset repair, and show that the problem also exhibits a complexity dichotomy, albeit the condition for tractability is again the existence of a lhs chain.

Certifiable Robustness in ML. Robust learning methods are used to learn over inconsistent or incomplete datasets. For example, [5] discusses a sound verification technique which proves whether a prediction is robust to data poisoning in decision-tree models. There is also a line of work on smoothing techniques for ML robustness [3, 12, 24, 18], where added random noise, usually with a Gaussian distribution, will sometimes boost the robustness of the model against adversarial attacks such as label-flipping. Our approach is different in that we prove a dichotomy for kk-NN certifiable robustness, i.e. either we can assert that the dataset will always lead to the same prediction efficiently or it is coNP-complete to do so, with an efficiently testable criterion. We show how to extend our model to capture scenarios including uncertain labels, weighted tuples, and data poisoning.

3 Preliminaries

In this paper, we consider relation schemas of the form R(A1,,Ad)R(A_{1},\dots,A_{d}) with arity dd. The attributes A1,,AdA_{1},\dots,A_{d} take values from a (possibly infinite) domain 𝔻\mathbb{D}. Given a tuple tt in an instance II over RR, we will use t[Ai]t[A_{i}] to denote its value at attribute AiA_{i}. It will be convenient to interpret an instance II as a training set of points in the dd-dimensional space 𝔻d\mathbb{D}^{d}. We will use the terms point/tuple interchangeably in the paper.

An uncertain instance \mathcal{I} over a schema R(A1,,Ad)R(A_{1},\dots,A_{d}) is a set of instances over the schema. We will often refer to each instance in \mathcal{I} as a possible world. We will see later different ways in which we can define uncertain instances implicitly.

For each tuple tt that occurs in some possible world in \mathcal{I}, we associate a label L(t)L(t) which takes values from a finite set 𝒴\mathcal{Y}. We will say that the uncertain instance \mathcal{I} equipped with the labeling function LL is a labeled uncertain instance over the schema R(A1,,Ad)R(A_{1},\dots,A_{d}). We similarly define a labeled instance II.

Certifiable Robustness. In this work, we will focus on the classification task. Let \mathcal{L} be a learning algorithm that takes as training set a labeled instance II over the schema R(A1,,Ad)R(A_{1},\dots,A_{d}), and returns a classifier, which is a total function I:𝔻d𝒴\mathcal{L}_{I}:\mathbb{D}^{d}\rightarrow\mathcal{Y}.

Definition 3.1 (Certifiable Robustness).

Let \mathcal{I} be a labeled uncertain instance over R(A1,,Ad)R(A_{1},\dots,A_{d}) and \mathcal{L} be a classification learning algorithm with labels in 𝒴\mathcal{Y}. We say that a (test) point x𝔻dx\in\mathbb{D}^{d} is certifiably robust in \mathcal{I} if there exists a label 𝒴\ell\in\mathcal{Y} such that for every possible world II\in\mathcal{I}, I(x)=\mathcal{L}_{I}(x)=\ell. The label \ell is called a certain label for xx.

In other words, suppose we call \ell a possible label for xx if there exists some possible world II\in\mathcal{I} for which I(x)=\mathcal{L}_{I}(x)=\ell, then certifiable robustness simply means that there is only one possible label for xx.

Nearest Neighbor Classifiers. In kk-NN classification, we are given a labeled instance II over R(A1,,Ad)R(A_{1},\dots,A_{d}), along with a distance function ff over 𝔻d\mathbb{D}^{d}. Given a point x𝔻dx\in\mathbb{D}^{d}, the classifier first finds the kk-neighborhood 𝒩k(x,I)\mathcal{N}_{k}(x,I), which consists of the kk points closest to xx w.r.t. the distance function ff. Then, the classifier assigns to xx the label that is a plurality among 𝒩k(x,I)\mathcal{N}_{k}(x,I). When k=1k=1, the classifier returns the label of the point that is closest to xx w.r.t. the distance function ff. When |𝒴|=2|\mathcal{Y}|=2, we are performing binary classification. We will also consider the generalization of kk-NN where each tuple has a positive weight, and the classifier assigns the label with the largest total weight.

For this work, we require the following tie-breaking mechanisms: (i)(i) if there are two labels in 𝒩k(x,I)\mathcal{N}_{k}(x,I) with the maximum number, then we say xx is not certifiably robust for kk-NN classification, and (ii)(ii) if there are more tuples with the same distance to the test point that will make 𝒩k(x,I)\mathcal{N}_{k}(x,I) not well-defined, we will break ties according to a predefined ordering of the tuples in the instance. By a slight abuse of notation, throughout the proof when we say I(x)=\mathcal{L}_{I}(x)=\ell, we mean the number of tuples labeled \ell is strictly more than that of any other labels for any choices made to pick 𝒩k(x,I)\mathcal{N}_{k}(x,I).

Functional Dependencies. A functional dependency (FD) is an expression of the form XYX\rightarrow Y, where XX and YY are sets of attributes from RR. An instance II over RR satisfies XYX\rightarrow Y if for every two tuples in II, if they agree on XX they must also agree on YY. We say that II satisfies a set of functional dependencies Σ\Sigma if it satisfies every functional dependency in Σ\Sigma. For an attribute AA and set of FDs Σ\Sigma, we define ΣA\Sigma-A to be the FD set where we have removed from any FD in Σ\Sigma the attribute AA. An FD schema 𝐑\mathbf{R} is a pair (R(A1,,Ad),Σ)(R(A_{1},\dots,A_{d}),\Sigma), where Σ\Sigma is a set of FDs defined over RR.

Repairs. Given Σ\Sigma, assume that we have an inconsistent instance DD that violates the functional dependencies in Σ\Sigma. We say that DD^{\prime} is a repair of DD if it is a maximal subset of DD that satisfies Σ\Sigma. In other words, we can create a repair by removing the smallest possible number of tuples from DD such that all the integrity constraints are satisfied. We will use IΣ(D)I_{\Sigma}(D) to denote the set of all possible repairs of DD w.r.t. the FD schema Σ\Sigma. If the instance DD is consistent, namely it does not violate any functional dependency, then IΣ(D)I_{\Sigma}(D) is defined to be DD itself.

3.1 Problem Definitions

Although our algorithms work for any distance function such that f(x,x)f(x,x^{\prime}) can be computed in time O(1)O(1) (assuming the dimension dd is fixed), for the hardness results we need a more precise formalization. We consider two variants of the problem. In the first variant, we will consider a specific distance function, the pp-norm when the domain is 𝔻=\mathbb{D}=\mathbb{R}. Recall that for any p1p\geq 1, the pp-norm is

xxp=(i=1d|x[Ai]x[Ai]|p)1/p\|x-x^{\prime}\|_{p}=\left(\sum_{i=1}^{d}|x[A_{i}]-x^{\prime}[A_{i}]|^{p}\right)^{1/p}

For the following definitions, we fix the dimension dd and the label set 𝒴\mathcal{Y}. Formally, we can now define the following decision problem, parameterized by an FD schema 𝐑\mathbf{R} and an integer k>0k>0.

Definition 3.2 (CR-NNp𝐑,k\textsc{CR-NN}_{p}\langle\mathbf{R},k\rangle).

Given an inconsistent labeled instance DD over an FD schema 𝐑\mathbf{R} and a test point xx, is xx certifiably robust in IΣ(D)I_{\Sigma}(D) for kk-NN classification w.r.t. the pp-norm?

Note that kk is fixed in the above problem. We can also define the decision problem where kk is instead an input parameter, denoted as CR-NNp𝐑\textsc{CR-NN}_{p}\langle\mathbf{R}\rangle.

In the second variant of the problem, instead of fixing a distance function, we will directly provide as input to the problem a strict ordering of the points in the dataset DD w.r.t. their distance from the test point xx. From an algorithmic point of view this does not make much difference, since we can compute the ordering in time O(|D|log|D|)O(|D|\log|D|) for any distance function that can be computed in time O(1)O(1).

Definition 3.3 (CR-NN<𝐑,k\textsc{CR-NN}_{<}\langle\mathbf{R},k\rangle).

Given an inconsistent labeled instance DD over an FD schema 𝐑\mathbf{R} and a strict ordering of the points in DD w.r.t. their distance from a test point xx, is xx certifiably robust in IΣ(D)I_{\Sigma}(D) for kk-NN classification?

Similarly we also define the problem CR-NN<𝐑\textsc{CR-NN}_{<}\langle\mathbf{R}\rangle with the parameter kk as part of the input. Note here that there is a straightforward many-one polynomial time reduction from CR-NNp𝐑,k\textsc{CR-NN}_{p}\langle\mathbf{R},k\rangle to CR-NN<𝐑,k\textsc{CR-NN}_{<}\langle\mathbf{R},k\rangle.

Finally, we define the counting variant of the problem. Given an inconsistent instance DD and a label 𝒴\ell\in\mathcal{Y}, we want to count how many repairs of DD will predict label \ell.

Definition 3.4 (#CR-NN<𝐑\textsc{\#CR-NN}_{<}\langle\mathbf{R}\rangle).

Given an inconsistent labeled instance DD over an FD schema 𝐑\mathbf{R}, a strict ordering of the points in DD w.r.t. their distance from a test point xx, and a label 𝒴\ell\in\mathcal{Y}, output the number of repairs in IΣ(D)I_{\Sigma}(D) for which the kk-NN classifier assigns label \ell to xx.

Similarly, one can define the counting question #CR-NNp𝐑\textsc{\#CR-NN}_{p}\langle\mathbf{R}\rangle.

4 Main Results

In this section, we present and discuss our key results. The main dichotomy theorem relies on the notion of lhs chains for an FD schema, as defined in [21].

Definition 4.1 (lhs Chain).

A set of FDs Σ\Sigma has a left-hand-side chain (lhs chain for short) if for every two FDs X1Y1X_{1}\rightarrow Y_{1} and X2Y2X_{2}\rightarrow Y_{2} in Σ\Sigma, either X1X2X_{1}\subseteq X_{2} or X2X1X_{2}\subseteq X_{1}.

One can determine efficiently whether an FD schema is equivalent to one with an lhs chain or not [21].

Example 4.2.

Consider the relational schema R(A,B,C,D)R(A,B,C,D). The FD set {AC,BC}\{A\rightarrow C,B\rightarrow C\} does not have an lhs-chain, since neither of the two left-hand-sides of the FDs are contained in each other. The FD set {ABC,BD}\{AB\rightarrow C,B\rightarrow D\} on the other hand has an lhs chain.

We are now ready to state our main theorem.

Theorem 4.3 (Main Theorem).

Let 𝐑\mathbf{R} be an FD schema. Then, the following hold:

  • If 𝐑\mathbf{R} is equivalent to an FD schema with an lhs chain, then CR-NN<𝐑\textsc{CR-NN}_{<}\langle\mathbf{R}\rangle (and thus CR-NNp𝐑\textsc{CR-NN}_{p}\langle\mathbf{R}\rangle) is in P.

  • Otherwise, for any integer k1k\geq 1, CR-NNp𝐑,k\textsc{CR-NN}_{p}\langle\mathbf{R},k\rangle (and thus CR-NN<𝐑,k\textsc{CR-NN}_{<}\langle\mathbf{R},k\rangle) is coNP-complete.

Moreover, it can be decided in polynomial time which of the two cases holds.

We show the polynomial time algorithm in Section 5 and the hardness proof in Section 6. We should discuss three things at this point. First, the polynomial time algorithm works for any distance function, as long as we can compute the distance between any two points in time O(1)O(1). Indeed, the distance function is only needed to compute the order of tuples in the dataset according to their distance from the test point. Second, we show the intractability result for the pp-norm distance function, which is widely used in practice. It is likely that the problem remains hard for other popular distance functions as well. Third, as we will see in the next section, the tractable cases are polynomial in kk, the size of the neighborhood. This is important, since kk is often set to be a function of the training set size, for example n\sqrt{n}. For the intractable cases, the problem is already hard even for k=1k=1.

Uncertain Labels. The above dichotomy theorem holds even when we allow inconsistent labels. We can model inconsistent labels by modifying the labelling function L(t)L(t) to take values from 𝒫(𝒴)\mathcal{P}(\mathcal{Y}), the power set of the finite label set 𝒴\mathcal{Y}. The set of possible worlds is then defined to be the set of possible worlds of the inconsistent instance DD paired with a labelling function LL^{\prime} such that L(t)L(t)L^{\prime}(t)\in L(t) for all tDt\in D. The definition of certifiable robustness carries over to this set of possible worlds.

We can simulate uncertain labels by adding an extra attribute (label) to the schema and an FD A1,,AdlabelA_{1},\dots,A_{d}\rightarrow\textsf{label}. It is easy to see that the new schema is equivalent to one with an lhs chain if and only if the original one is. Thus, we conclude that uncertain labels do not change the complexity with respect to certifiable robustness.

Counting. For the counting variant of certifying robustness for kk-NNs, we show an analogous dichotomy result.

Theorem 4.4.

Let 𝐑\mathbf{R} be an FD schema. Then, the following hold:

  • If 𝐑\mathbf{R} is equivalent to an FD schema with an lhs chain, then #CR-NN<𝐑\textsc{\#CR-NN}_{<}\langle\mathbf{R}\rangle is in FP.

  • Otherwise, even #CR-NN<𝐑,1\textsc{\#CR-NN}_{<}\langle\mathbf{R},1\rangle is #P-complete.

Moreover, it can be decided in P which of the two cases holds.

5 Tractable Cases

In this section, we prove that if the FD schema 𝐑\mathbf{R} has an lhs chain, then there is a polynomial time algorithm for CR-NN<𝐑\textsc{CR-NN}_{<}\langle\mathbf{R}\rangle in the size of the inconsistent dataset, the parameter kk and the number of possible labels. Then, we show that when 𝐑\mathbf{R} consists of a single primary key we can construct an even faster algorithm that runs in linear time w.r.t the number of tuples, number of labels, and kk.

For this section, let DD be an inconsistent labeled instance and xx be the test point. Assume w.l.o.g. that 𝒴={1,2,,m}\mathcal{Y}=\{1,2,\dots,m\} and let nn be the number of tuples in DD. We assume that the points in DD are already in strict order w.r.t. their distance from xx: t1<t2<<tnt_{1}<t_{2}<\dots<t_{n}.

5.1 An Algorithm for Certifiable Robustness

Note that in the following analysis we fix an FD schema 𝐑\mathbf{R} with an lhs chain. The algorithm first constructs a repair II by choosing greedily points using the given ordering as long as they do not conflict with each other. This step can be implemented in time O(n)O(n) by, say, building a hash map per FD which maps for each tuple the value of the LHS attribute(s) to the value of the RHS attribute(s). Suppose w.l.o.g. that I(x)=1\mathcal{L}_{I}(x)=1.

As a second step, for every label {2,,m}\ell\in\{2,\dots,m\}, we will attempt to construct a repair II^{\prime} of DD such that the number of \ell-labeled points is at least as many as the number of 1-labeled points in the kk-neighborhood of xx. Such a repair will be a witness that some other label is possible for xx, hence xx is not certifiably robust.

It will be helpful now to define the following terms for a subinstance IDI\subseteq D, τ{1,,n}\tau\in\{1,\dots,n\}, and a label 𝒴\ell\in\mathcal{Y}:

𝒩τ(I)\displaystyle\mathcal{N}^{\leq}_{\tau}(I) ={tjIjτ}\displaystyle=\{t_{j}\in I\mid j\leq\tau\}
Cτ(,I)\displaystyle C^{\leq}_{\tau}(\ell,I) =|{t𝒩τ(I)L(t)=}|\displaystyle=|\{t\in\mathcal{N}^{\leq}_{\tau}(I)\mid L(t)=\ell\}|

We are now ready to present the core component of our algorithm. This component will be executed for every label >1\ell>1 and τ{1,,n}\tau\in\{1,\dots,n\}. Thus, it will run O(|𝒴|n)O(|\mathcal{Y}|\cdot n) times. Define the following quantity for a subinstance JDJ\subseteq D, an FD set Δ\Delta, and ii where 0ik0\leq i\leq k:

Mi[J,Δ]=max{Cτ(,K)Cτ(1,K)KIΔ(J) s.t. |𝒩τ(K)|=i}.M_{i}[J,\Delta]=\max\{C^{\leq}_{\tau}(\ell,K)-C^{\leq}_{\tau}(1,K)\mid K\in I_{\Delta}(J)\text{ s.t. }|\mathcal{N}^{\leq}_{\tau}(K)|=i\}.

Here for simplicity we adopt a slight abuse of notation where, although Mi[J,Δ]M_{i}[J,\Delta] depends on τ\tau, τ\tau is not explicitly shown in the notation Mi[J,Δ]M_{i}[J,\Delta]. If there is no repair KK for JJ such that |𝒩τ(K)|=i|\mathcal{N}^{\leq}_{\tau}(K)|=i, we define Mi[J,Δ]=M_{i}[J,\Delta]=-\infty. The key observation is that if Mk[D,Σ]0M_{k}[D,\Sigma]\geq 0 then \ell is a possible label for xx and hence robustness is falsified. The algorithm computes this quantity using a combination of dynamic programming and recursion on the structure of the FD set.

The Recursive Algorithm. Given JDJ\subseteq D and a set of FDs Δ\Delta, we want to compute Mi[J,Δ]M_{i}[J,\Delta] for every i=0,,ki=0,\dots,k. First, we remove from Δ\Delta any trivial FDs. Then we distinguish three cases:

Base Case:

the set of FDs is empty. In this case, IΔ(J)={J}I_{\Delta}(J)=\{J\}. For every i|𝒩τ(J)|i\neq|\mathcal{N}^{\leq}_{\tau}(J)|, Mi[J,Δ]=M_{i}[J,\Delta]=-\infty. For i=|𝒩τ(J)|i=|\mathcal{N}^{\leq}_{\tau}(J)| (if |𝒩τ(J)|k|\mathcal{N}^{\leq}_{\tau}(J)|\leq k), we have Mi[J,Δ]=Cτ(,J)Cτ(1,J)M_{i}[J,\Delta]=C^{\leq}_{\tau}(\ell,J)-C^{\leq}_{\tau}(1,J), so we can compute this in a straightforward way.

Consensus FD:

there exists an FD A\emptyset\rightarrow A. In this case, we recursively call the algorithm to compute Mi[σA=a(J),ΔA]M_{i}[\sigma_{A=a}(J),\Delta-A] for every aπA(J)a\in\pi_{A}(J). Then, for every i=0,,ki=0,\dots,k:

Mi[J,Δ]=maxaπA(J)Mi[σA=a(J),ΔA]M_{i}[J,\Delta]=\max_{a\in\pi_{A}(J)}M_{i}[\sigma_{A=a}(J),\Delta-A]
Common Attribute:

there exists a common attribute AA in the lhs of all FDs. In this case, we recursively call the algorithm to compute Mi[σA=a(J),ΔA]M_{i}[\sigma_{A=a}(J),\Delta-A] for every aπA(J)a\in\pi_{A}(J). Let S=πA(J)={a1,,a|S|}S=\pi_{A}(J)=\{a_{1},\dots,a_{|S|}\}. Then, for every i=0,,ki=0,\dots,k:

Mi[J,Δ]=maxaSia=iaSMia[σA=a(J),ΔA]M_{i}[J,\Delta]=\max_{\sum_{a\in S}i_{a}=i}\sum_{a\in S}M_{i_{a}}[\sigma_{A=a}(J),\Delta-A]

We next discuss how to do the above computation using dynamic programming. We can view Mia[σA=a(J),ΔA]M_{i_{a}}[\sigma_{A=a}(J),\Delta-A] as a matrix with rows indexed by value aa of attributes AA and columns indexed by iai_{a} with 0iak0\leq i_{a}\leq k. The task is to pick one entry from each row so that the sum of entries is maximized and the column indices of entries sum to ii. The dynamic programming computes an |S|×(k+1)|S|\times(k+1) matrix \mathcal{M} where the (i,j)(i,j)-entry represents the maximum of u=1iMiu[σA=au(J),ΔA]\sum_{u=1}^{i}M_{i_{u}}[\sigma_{A=a_{u}}(J),\Delta-A] such that u=1iiu=j\sum_{u=1}^{i}i_{u}=j and finally returns the entries [|S|,i]\mathcal{M}[|S|,i].

1
2for j=0,,kj=0,\dots,k do
3       [1,j]Mj[σA=a1(J),ΔA]\mathcal{M}[1,j]\leftarrow M_{j}[\sigma_{A=a_{1}}(J),\Delta-A] ;
4      
5for i=2,,|S|i=2,\dots,|S| do
6       for j=0,,kj=0,\dots,k do
7             [i,j]maxu{[i1,u]+Mju[σA=ai(J),ΔA]}\mathcal{M}[i,j]\leftarrow\max_{u}\{\mathcal{M}[i-1,u]+M_{j-u}[\sigma_{A=a_{i}}(J),\Delta-A]\} ;
8            
Algorithm 1 Dynamic Programming
Lemma 5.1.

The Recursive Algorithm outputs the correct result and runs in polynomial time with respect to the size of DD, the parameter kk and the number of labels |𝒴||\mathcal{Y}|.

Proof 5.2.

To prove the correctness of the algorithm, we use the observation that if the functional dependencies form an lhs chain, they can be arranged in an order X1Y1,,XnYnX_{1}\rightarrow Y_{1},\dots,X_{n}\rightarrow Y_{n} such that XiXjX_{i}\subseteq X_{j} for all i<ji<j. Note the 3 cases are mutually exclusive, meaning that at every execution step the algorithm will go into one and only one case. Note also that the algorithm will eventually terminate since step 2 and 3 will eliminate at least one lhs or rhs attribute of an FD and the FD schema 𝐑\mathbf{R} is fixed.

We now argue that the algorithm will return the correct result for CR-NN<𝐑\textsc{CR-NN}_{<}\langle\mathbf{R}\rangle. The crucial observation is that Mi[σA=a(J),ΔA]M_{i}[\sigma_{A=a}(J),\Delta-A] represents the maximum difference between the number of label \ell and the number of label 11 within all repairs of JJ, where A=aA=a and the number of tuples with distance to the test point less than τ\tau is exactly ii. Thus, the formula correctly computes such maximum difference among all repairs of JDJ\subseteq D by summing all repairs KJK\subseteq J with fixed value of the attribute AA with suitable iji_{j}’s. The fact that a repair can be “reassembled” by sub-repairs partitioned according to different values of an attribute hinges on the assumption that the FD schema 𝐑\mathbf{R} has an lhs chain. We conclude that the algorithm will correctly return Mk[D,Σ]M_{k}[D,\Sigma] whose non-negativity is equivalent to non-certifiable-robustness.

Regarding the running time, observe that for every label \ell and every threshold τ=1,,n\tau=1,\dots,n, the algorithm will be called recursively finitely many times since the FD schema 𝐑\mathbf{R} is fixed. Furthermore, the dynamic programming in step 3 will run in O(nk2)O(n\cdot k^{2}) time, since the matrix \mathcal{M} has size O(nk)O(n\cdot k) and to compute each entry we need O(k)O(k) time. Thus for each \ell and τ\tau, the algorithm will run in polynomial time with respect to nn and the parameter kk. Since there are O(n)O(n) thresholds, we conclude that the algorithm runs in polynomial time with respect to the size of DD, the parameter kk and the number of labels |𝒴||\mathcal{Y}|.

Weighted Majority. The algorithm can also handle the case where we compute the predicted label by weighted majority, where each tuple tt is assigned a weight wtw_{t}. The only thing we need to modify is the definition of Cτ(,I)C^{\leq}_{\tau}(\ell,I), which now becomes t𝒩τ(I):L(t)=wt\sum_{t\in\mathcal{N}^{\leq}_{\tau}(I):L(t)=\ell}w_{t}.

5.2 A Faster Algorithm for Single Primary Key

The algorithm given in the above section, though in polynomial time, is not very efficient. In this section, we give a faster algorithm for certifiable robustness when the FD schema is equivalent to one with a single primary key. Recall that in this case we need to pick exactly one tuple from the set of tuples that share the same key.

As in the previous section, we will first run kk-NN on an arbitrarily chosen repair to obtain a possible label for xx (this part needs only linear time). Without any loss of generality, assume that the predicted label for xx is 1. For every target label {2,,m}\ell\in\{2,\dots,m\}, we will attempt to construct a repair such that \ell occurs at least as many times as 1 in the kk-neighborhood of xx. If such a repair exists, then robustness is falsified.

1113113121331133
(a) Ordering of the tuples in instance DD
1331233
(b) Result of Prune(D,2,1)\textsc{Prune}(D,2,1)
133233
(c) Result of Prune(D,3,1)\textsc{Prune}(D,3,1)
Figure 1: Running example for the single primary key algorithm. Tuples with the same color/shape belong in the same block.

For a tuple tt, we denote 𝗄𝖾𝗒(t)\mathsf{key}(t) to be its key. The block of a tuple tt is the set of tuples with the same key. Further, define C(,I)=|{t𝒩k(x,I)L(t)=}|C(\ell,I)=|\{t\in\mathcal{N}_{k}(x,I)\mid L(t)=\ell\}|.

Suppose now we want to find a repair II such that C(2,I)C(1,I)C(\ell_{2},I)\geq C(\ell_{1},I). Define Prune(D,2,1)\textsc{Prune}(D,\ell_{2},\ell_{1}) to be the dataset obtained from DD if we remove any tuple tDt\in D such that there exists another tuple tDt^{\prime}\in D in the same block with:

  1. 1.

    t<tt<t^{\prime} and L(t)=1L(t)=\ell_{1}; or

  2. 2.

    t>tt>t^{\prime} and L(t)=2L(t^{\prime})=\ell_{2}.

Lemma 5.3.

Let 1,2𝒴\ell_{1},\ell_{2}\in\mathcal{Y} and D=Prune(D,2,1)D^{\star}=\textsc{Prune}(D,\ell_{2},\ell_{1}). Then, there exists a repair II of DD s.t. C(2,I)C(1,I)C(\ell_{2},I)\geq C(\ell_{1},I) if and only if there exists a repair II^{\prime} of DD^{\star} with C(2,I)C(1,I)C(\ell_{2},I^{\prime})\geq C(\ell_{1},I^{\prime}).

Proof 5.4.

For the proof of Lemma 5.3, we need the following helper Sublemma.

Sublemma 1.

Let t,tDt,t^{\prime}\in D such that t<tt<t^{\prime} and 𝗄𝖾𝗒(t)=𝗄𝖾𝗒(t)\mathsf{key}(t)=\mathsf{key}(t^{\prime}). Consider any repair II of DD with tIt\in I and let I=(I{t}){t}I^{\prime}=(I\setminus\{t\})\cup\{t^{\prime}\} be another repair of DD. Then for any label \ell:

C(,I)C(L(t),I)C(,I)C(L(t),I).\displaystyle C(\ell,I^{\prime})-C(L(t),I^{\prime})\geq C(\ell,I)-C(L(t),I).
Proof 5.5 (Proof of Sublemma 1).

We distinguish two cases. Suppose t𝒩k(x,I)t\notin\mathcal{N}_{k}(x,I). Since t>tt^{\prime}>t, the kk-neighborhood of xx remains the same, and thus the inequality holds with equality. Suppose now that t𝒩k(x,I)t\in\mathcal{N}_{k}(x,I). In this case, tt is replaced by some other tuple t′′t^{\prime\prime} in the kk-neighborhood. If L(t)=L(t′′)L(t)=L(t^{\prime\prime}), the inequality holds again with equality. Otherwise, C(L(t),I)=C(L(t),I)1C(L(t),I^{\prime})=C(L(t),I)-1. Since |C(,I)C(,I)|1|C(\ell,I)-C(\ell,I^{\prime})|\leq 1, the inequality follows.

Since DDD^{\star}\subseteq D, the one direction is straightforward. For the other direction, suppose that there exists a repair II of DD s.t. C(2,I)C(1,I)C(\ell_{2},I)\geq C(\ell_{1},I). We will transform II to a repair II^{\prime} of DD^{\star} with the same property by constructing a sequence of datasets (D=)D0D1(D=)D_{0}\supset D_{1}\supset\dots such that for every ii there exists a repair IiI_{i} of DiD_{i} with C(2,Ii)C(1,Ii)C(\ell_{2},I_{i})\geq C(\ell_{1},I_{i}). The invariant is true for i=0i=0. For some i>0i>0, if IiDI_{i}\subseteq D^{\star}, then IiI_{i} is a repair of DD^{\star} with the desired property. Otherwise, let tIiDt\in I_{i}\setminus D^{\star}. Then, two cases exist:

  • L(t)=1L(t)=\ell_{1} and there exists a tuple tDit^{\prime}\in D_{i} in the same block as tt with t<tt<t^{\prime}. Let Di+1=Di{t}D_{i+1}=D_{i}\setminus\{t\} and its repair Ii+1=(Ii{t}){t}I_{i+1}=(I_{i}\setminus\{t\})\cup\{t^{\prime}\}. Then, by applying Lemma 1 with =2\ell=\ell_{2}, I=IiI=I_{i} and I=Ii+1I^{\prime}=I_{i+1}, we obtain C(2,Ii+1)C(1,Ii+1)C(2,Ii)C(1,Ii)0C(\ell_{2},I_{i+1})-C(\ell_{1},I_{i+1})\geq C(\ell_{2},I_{i})-C(\ell_{1},I_{i})\geq 0, where the last inequality comes from the invariant.

  • There exists a tuple tDit^{\prime}\in D_{i} in the same block as tt with t<tt^{\prime}<t and L(t)=2L(t^{\prime})=\ell_{2}. Let Di+1=Di{t}D_{i+1}=D_{i}\setminus\{t\} and its repair Ii+1=(Ii{t}){t}I_{i+1}=(I_{i}\setminus\{t\})\cup\{t^{\prime}\}. Then, by applying Lemma 1 with =1\ell=\ell_{1}, I=Ii+1I=I_{i+1} and I=IiI^{\prime}=I_{i}, we obtain C(2,Ii+1)C(1,Ii+1)C(2,Ii)C(1,Ii)0C(\ell_{2},I_{i+1})-C(\ell_{1},I_{i+1})\geq C(\ell_{2},I_{i})-C(\ell_{1},I_{i})\geq 0, where the last inequality comes from the invariant.

Finally, note that the sequence must terminate at some point since the datasets DiD_{i} strictly decrease in size.

Lemma 5.3 tells us that it suffices to consider DD^{\star} instead of DD. DD^{\star} has the following nice properties:

  • every block has at most one tuple with a label from {1,2}\{\ell_{1},\ell_{2}\}.

  • any tuple with label in {1,2}\{\ell_{1},\ell_{2}\} is always the last tuple in its block (i.e. the one furthest away from xx).

The pruning procedure can be implemented in linear time O(n)O(n). Algorithm 2 FastScan now attempts to find the desired repair. It runs in linear time with respect to the size of DD and the number of labels |𝒴||\mathcal{Y}| and, moreover, its time complexity does not depend on kk.

1
Input: instance DD, test point xx, labels 1,2\ell_{1},\ell_{2}
Output: whether there exists repair I s.t. C(2,I)C(1,I)C(\ell_{2},I)\geq C(\ell_{1},I)
2
3DPrune(D,2,1)D\leftarrow\textsc{Prune}(D,\ell_{2},\ell_{1}) ;
4 n1,n20n_{1},n_{2}\leftarrow 0 ;
5 B,B{}\textsf{B},\textsf{B}^{\square}\leftarrow\{\} ;
6 for i1i\leftarrow 1 to |D||D| do
7       BB{𝗄𝖾𝗒(ti)}\textsf{B}\leftarrow\textsf{B}\cup\{\mathsf{key}(t_{i})\} ;
8       if L(ti)=2L(t_{i})=\ell_{2} then
9            n2n2+1n_{2}\leftarrow n_{2}+1;
10      if tit_{i} is the only tuple of its block and L(ti)=1L(t_{i})=\ell_{1} then
11            n1n1+1n_{1}\leftarrow n_{1}+1;
12      if tit_{i} is the last tuple of its block then
13            BB{𝗄𝖾𝗒(ti)}\textsf{B}^{\square}\leftarrow\textsf{B}^{\square}\cup\{\mathsf{key}(t_{i})\};
14      if |B|k|B||\textsf{B}^{\square}|\leq k\leq|\textsf{B}| and n2n1n_{2}\geq n_{1} then
15            return true;
16return false;
Algorithm 2 FastScan
Theorem 5.6.

There exists an O(|𝒴|n)O(|\mathcal{Y}|n) algorithm for CR-NN<𝐑\textsc{CR-NN}_{<}\langle\mathbf{R}\rangle when the FD schema 𝐑\mathbf{R} is equivalent to one with a single primary key.

Proof 5.7.

We first show the correctness of our Algorithm 2. The algorithm falsifies robustness if there exists a pair of labels (,1)(\ell,1) for which FastScan returns true. Suppose that FastScan terminates at iteration ii. Then, we can construct a repair II of DD^{\star} (and hence of DD) as follows. For each block in B\textsf{B}^{\square} we pick the last tuple unless its label is 1\ell_{1}, in which case we arbitrarily pick another tuple in that block. For the blocks in BB\textsf{B}\setminus\textsf{B}^{\square}, we pick any tuple ti\leq t_{i} for (k|B|)(k-|\textsf{B}^{\square}|) blocks, and for the remaining blocks in BB\textsf{B}\setminus\textsf{B}^{\square} we pick a tuple >ti>t_{i}. For the remaining blocks we make an arbitrary choice. From our construction, II has exactly kk tuples ti\leq t_{i}, which form the kk-neighborhood. Since we have pruned DD, all the occurrences of tuples with label 1,2\ell_{1},\ell_{2} occur at the last positions of the blocks. Hence, C(2,I)C(1,I)=n2n10C(\ell_{2},I)-C(\ell_{1},I)=n_{2}-n_{1}\geq 0.

For the opposite direction, suppose that there exists a repair II that falsifies robustness. Then, for some label \ell we must have C(2,I)C(1,I)0C(\ell_{2},I)-C(\ell_{1},I)\geq 0. From Lemma 5.3, there exists a repair II^{\prime} of DD^{\star} for which C(2,I)C(1,I)C(\ell_{2},I^{\prime})\geq C(\ell_{1},I^{\prime}). Let tit_{i} be the last tuple in the kk-neighborhood of II^{\prime}. Then, the exit condition will be satisfied for the ii-th iteration of the loop of FastScan and hence the algorithm will return true.

We now discuss the time complexity. First, we spend O(n)O(n) to find a possible label for xx. The algorithm runs FastScan (m1)(m-1) times, for every pair of labels (,1)(\ell,1) where =2,,m\ell=2,\dots,m. Finally, we need to argue that each iteration of the loop in FastScan requires O(1)O(1) time. This can be achieved by implementing B and B\textsf{B}^{\square} as hash-sets with constant-time insertions. Checking whether a tuple is the last one in a block can also be done in O(1)O(1) time by performing a preprocessing linear pass that marks all the last tuples of every block.

When |𝒴|=2|\mathcal{Y}|=2, the algorithm essentially reduces to the MinMax algorithm in [13]. For |𝒴|3|\mathcal{Y}|\geq 3 it outperforms the SortScan algorithm [13], since the latter algorithm has an exponential dependence on |𝒴||\mathcal{Y}| and kk. Our algorithm also can deal with the case where two tuples in the same block have different labels, which is not something the MinMax and SortScan algorithms can handle.

Example 5.8.

We now illustrate our algorithm by a simple example where k=3k=3 and 𝒴={1,2,3}\mathcal{Y}=\{1,2,3\}. Figure 1(a) represents an inconsistent instance DD, where nodes with the same shape/color are in the same block. Their distances to a given test point are increasing from left to right. A repair that chooses the first two tuples assigns label 1 to xx, hence 1 is a possible label. Figures 1(b) and 1(c) illustrate the pruned instances Prune(D, 2, 1) and Prune(D, 3, 1), respectively. Take Figure 1(c) for example: when FastScan reaches the iteration where i=3i=3, we have n2=2,n1=1n_{2}=2,n_{1}=1, |B|=3|B^{\square}|=3 and |B|=3|B|=3, so |B|=k=|B||B^{\square}|=k=|B| and n2n1n_{2}\geq n_{1}. Indeed, by choosing the first, second, third, and last two tuples in Figure 1(c), we see that label 1 is not robust (against label 3).

6 Hardness Result

In this section, we establish the main intractability result.

Theorem 6.1.

Let 𝐑\mathbf{R} be an FD schema that is not equivalent to any FD schema with an lhs chain. Then, the problem CR-NNp𝐑,1\textsc{CR-NN}_{p}\langle\mathbf{R},1\rangle is coNP-complete for any p>1p>1.

Proof 6.2.

The coNP membership of the problem CR-NNp𝐑,1\textsc{CR-NN}_{p}\langle\mathbf{R},1\rangle follows from the observation that if one is not certainly robust, then it can be checked efficiently two given repairs (certificate) indeed lead to two different prediction labels. To prove this hardness result, we will describe a reduction from the SAT-3-restricted problem, inspired by the construction of [27] for the edge dominating set problem. In this variant of SAT, each clause has at most three literals, while every variable occurs two times and its negation once.

Let ϕ\phi be a SAT-3-restricted formula. Suppose that ϕ\phi has mm clauses C1,C2,,CmC_{1},C_{2},\dots,C_{m} and nn variables x1,x2,,xnx_{1},x_{2},\dots,x_{n}. Our construction consists of three steps.

Step 1: In the first step, we construct a directed labeled graph G=(V,E)G=(V,E) with labels in {0,1}\{0,1\}:

  • The set of vertices V={Cixjkyjk where 1im,0k2 and 1jn}V=\{C_{i}\bigcup x^{k}_{j}\bigcup y^{k}_{j}\text{ where }1\leq i\leq m,0\leq k\leq 2\text{ and }1\leq j\leq n\}.

  • For each clause CiC_{i}, where i=1,,mi=1,\dots,m, we add the following labeled edge:

    (Ci+,Ci)0(C_{i}^{+},C_{i}^{-})\rightarrow 0

    That is, we add the directed edge which points from Ci+C_{i}^{+} to CiC_{i}^{-} to the set of edges EE and label it as 0.

  • Suppose that variable xjx_{j}, where j=1,,nj=1,\dots,n, appears positive in clauses Cκ,CλC_{\kappa},C_{\lambda}, and negative in clause CμC_{\mu}. Then, we introduce the following labeled edges:

    (Cκ+,xj0),(yj0,xj0)1\displaystyle(C_{\kappa}^{+},x_{j}^{0}),(y_{j}^{0},x_{j}^{0})\rightarrow 1
    (Cλ+,xj1),(yj1,xj1)1\displaystyle(C_{\lambda}^{+},x_{j}^{1}),(y_{j}^{1},x_{j}^{1})\rightarrow 1
    (xj2,Cμ),(xj2,yj2)1\displaystyle(x_{j}^{2},C_{\mu}^{-}),(x_{j}^{2},y_{j}^{2})\rightarrow 1
    (yj0,yj2),(yj1,yj2)0\displaystyle(y_{j}^{0},y_{j}^{2}),(y_{j}^{1},y_{j}^{2})\rightarrow 0

Figure 2 shows the above construction. Note that GG is a directed bipartite graph, since no vertex has both incoming and outgoing edges. Hence, one can equivalently view each maximal matching of GG as a subset repair of an instance with FD schema (R(A,B),{AB,BA})(R(A,B),\{A\rightarrow B,B\rightarrow A\}) and vice versa (attributes AA and BB correspond to the two sides of the bipartite graph).

yj0y^{0}_{j}Cκ+C_{\kappa}^{+}Cλ+C_{\lambda}^{+}xj0x_{j}^{0}yj2y_{j}^{2}xj2x_{j}^{2}xj1x_{j}^{1}yj1y_{j}^{1}CμC_{\mu}^{-}00111111111111
Figure 2: Variable gadget for the hardness reduction from the SAT-3-restricted problem.
Sublemma 2.

ϕ\phi is satisfiable if and only if there exists a maximal matching for GG that includes only edges with label 1.

Proof 6.3 (Proof of Sublemma 2).

\Rightarrow Assume that the variable assignment ψ\psi makes ϕ\phi satisfiable. Fix any order of variables x1,xnx_{1}\dots,x_{n}. We form a set of edges MM as follows. For any variable xjx_{j} visited in the above order, we distinguish two cases:

  • ψ(xj)=true\psi(x_{j})=\textsf{true}: we pick (xj2,yj2)(x_{j}^{2},y_{j}^{2}). If Cκ+C_{\kappa}^{+} is not yet matched, pick (Cκ+,xj0)(C_{\kappa}^{+},x_{j}^{0}), otherwise pick (yj0,xj0)(y_{j}^{0},x_{j}^{0}). Similarly for Cλ+C_{\lambda}^{+}.

  • ψ(xj)=false\psi(x_{j})=\textsf{false}: pick (yj0,xj0)(y_{j}^{0},x_{j}^{0}) and (yj1,xj1)(y_{j}^{1},x_{j}^{1}). If CμC_{\mu}^{-} is not yet matched, pick (xj2,Cμ)(x_{j}^{2},C_{\mu}^{-}), otherwise pick (xj2,yj2)(x_{j}^{2},y_{j}^{2}).

By construction, MM contains only edges with label 1.

Claim 1: MM is a matching. Since xj,yjx_{j},y_{j} occur only in a variable gadget, they will have at most one adjacent edge from MM. By construction, each clause Cκ+,CμC_{\kappa}^{+},C_{\mu}^{-} will also have at most one adjacent edge from MM.

Claim 2: MM is a maximal matching. First, observe that by our construction, all xj0,xj1,xj2x_{j}^{0},x_{j}^{1},x_{j}^{2} are matched for any j=1,,nj=1,\dots,n. Second, notice that if ψ(xj)=true\psi(x_{j})=\textsf{true}, the edge (xj2,yj2)(x_{j}^{2},y_{j}^{2}) will be chosen; if ψ(xj)=false\psi(x_{j})=\textsf{false}, the edges (yj0,xj0)(y_{j}^{0},x_{j}^{0}) and (yj1,xj1)(y_{j}^{1},x_{j}^{1}) will be chosen. Thus, the edges (yj0,yj2),(yj1,yj2)(y_{j}^{0},y_{j}^{2}),(y_{j}^{1},y_{j}^{2}) can not be added to MM. Finally, consider the edge (Ci+,Ci)(C_{i}^{+},C_{i}^{-}) corresponding to clause CiC_{i}. If there exists a positive literal which satisfies CiC_{i}, then consider the earliest xjx_{j} in the linear order of variables. By construction, (Ci+,xjν)(C_{i}^{+},x_{j}^{\nu}) is in the matching, where ν{0,1}\nu\in\{0,1\}. Otherwise, CiC_{i} is satisfied by a negative literal: consider the earliest such xkx_{k} in the linear order. Then (xk2,Ci)(x_{k}^{2},C_{i}^{-}) is in MM. In either case, (Ci+,Ci)(C_{i}^{+},C_{i}^{-}) cannot be added to increase the size of the matching.

\Leftarrow Assume a maximal matching MM that avoids 0-labeled edges. For every variable xjx_{j}, if MM contains (xj2,yj2)(x_{j}^{2},y_{j}^{2}), we assign ψ(xj)=true\psi(x_{j})=\textsf{true}, otherwise ψ(xj)=false\psi(x_{j})=\textsf{false}. We claim that ψ\psi is a satisfying assignment for ϕ\phi. Indeed, take any clause CiC_{i}. Since (Ci+,Ci)M(C_{i}^{+},C_{i}^{-})\notin M, there exists some edge in MM that conflicts with it. If this edge is of the form (Ci+,xjν)(C_{i}^{+},x_{j}^{\nu}), then it must be that (xj2,yj2)M(x_{j}^{2},y_{j}^{2})\in M. But then ψ(xj)=true\psi(x_{j})=\textsf{true}, and since xjx_{j} occurs positively in CiC_{i} the clause is satisfied. If this edge is of the form (xj2,Ci)(x_{j}^{2},C_{i}^{-}), then (xj2,yj2)M(x_{j}^{2},y_{j}^{2})\notin M. Thus, ψ(xj)=false\psi(x_{j})=\textsf{false}, and since xjx_{j} occurs negatively in CiC_{i} it is again satisfied.

Step 2: A maximal matching of GG can be viewed equivalently as a repair of a labeled instance D0D_{0} with FD schema 𝐒=(R(A,B),{AB,BA})\mathbf{S}=(R(A,B),\{A\rightarrow B,B\rightarrow A\}). In the second step, we will transform the instance D0D_{0} of 𝐒\mathbf{S} to a labeled instance D1D_{1} of the target FD schema 𝐑\mathbf{R}. We will do this using the concept of fact-wise reductions. A fact-wise reduction from 𝐒\mathbf{S} to 𝐑\mathbf{R} is a function Π\Pi that maps a tuple from an instance of 𝐒\mathbf{S} to a tuple of an instance of 𝐑\mathbf{R} such that (i)(i) it is injective, (ii)(ii) it preserves consistency and inconsistency (i.e. a tuple in D0D_{0} violates 𝐒\mathbf{S} if and only if the corresponding tuple in D1D_{1} violates 𝐑\mathbf{R}), and (iii)(iii) it can be computed in polynomial time. In fact, we will use exactly the same fact-wise reduction as the one used in [21] to reduce an instance in 𝐒\mathbf{S} to one in 𝐑\mathbf{R}, where 𝐑\mathbf{R} is not equivalent to an FD schema with an lhs chain. It will be necessary to present this reduction in detail, since its structure will be needed for the third step of our reduction.

W.l.o.g., we can assume the FD schema is minimal. Since it does not have an lhs chain, there are two FDs XAX\rightarrow A and XAX^{\prime}\rightarrow A^{\prime} such that XXX\subsetneq X^{\prime} and XXX^{\prime}\subsetneq X. Let \oplus be a fresh constant. We map t=R(u,v)t=R(u,v) with label \ell to a tuple Π(t)\Pi(t) with label also \ell such that:

Π(t)[Ai]={if Ai(XX)+,Σuif AiX(XX)+,Σvif AiX(XX)+,Σ(u,v)otherwise.\displaystyle\Pi(t)[A_{i}]=\begin{cases}\oplus&\mbox{if }A_{i}\in(X\cap X^{\prime})^{+,\Sigma}\\ u&\mbox{if }A_{i}\in X\setminus(X\cap X^{\prime})^{+,\Sigma}\\ v&\mbox{if }A_{i}\in X^{\prime}\setminus(X\cap X^{\prime})^{+,\Sigma}\\ (u,v)&\mbox{otherwise.}\end{cases}

Here, X+,ΣX^{+,\Sigma} denotes the closure of an attribute set XX w.r.t. the FD set Σ\Sigma. By [21] we know that Π\Pi is a fact-wise reduction. Let D1D_{1} be the resulting instance of 𝐑\mathbf{R}.

Sublemma 3.

ϕ\phi is satisfiable if and only if there exists a repair for D1D_{1} in 𝐑\mathbf{R} that includes only tuples with label 1.

Proof 6.4 (Proof of Sublemma 3).

This follows from Sublemma 2 and the fact that Π\Pi is a fact-wise reduction.

Step 3: In the last step of the reduction, we will present an encoding \llbracket\cdot\rrbracket that embeds the values of the attributes in D1D_{1} to values in \mathbb{N}, hence allowing us to compute distances with the pp-norm. The resulting tuples will also be labeled from 𝒴={0,1}\mathcal{Y}=\{0,1\}.

Let α=d(2m+8n)\alpha=d\cdot(2m+8n), where dd is the number of attributes. First, let =0\llbracket\oplus\rrbracket=0. For a vertex uVu\in V, let

u={iif u=Ci+m+iif u=Ci2m+3j+νif u=yjνα+3j+νif u=xjν\displaystyle\llbracket u\rrbracket=\begin{cases}i&\mbox{if }u=C_{i}^{+}\\ m+i&\mbox{if }u=C_{i}^{-}\\ 2m+3j+\nu&\mbox{if }u=y_{j}^{\nu}\\ \alpha+3j+\nu&\mbox{if }u=x_{j}^{\nu}\end{cases}

It is easy to see that the above embedding is injective, meaning that if u=v\llbracket u\rrbracket=\llbracket v\rrbracket then u=vu=v. As for the edges, consider any ordering e1,e2,e_{1},e_{2},\dots and simply let ei=i\llbracket e_{i}\rrbracket=i. Note that the number of edges in GG is |E|=8n+2m|E|=8n+2m. This embedding is also injective. Let D2=D1D_{2}=\llbracket D_{1}\rrbracket denote the instance we obtain by encoding every value of D1D_{1} as above. Since the encoding is injective, this is also trivially a fact-wise reduction, hence Sublemma 3 holds for D2D_{2} as well.

Sublemma 4.

ϕ\phi is satisfiable if and only if x=R(0,0,,0)x=R(0,0,\dots,0) has no certain label in D2D_{2}.

Proof 6.5 (Proof of Sublemma 4).

We first need the following two claims.

Claim 1: Any tuple with label 0 is closer to xx than any tuple with label 1. Indeed, a tuple has label 1 if and only if it contains in an attribute a value of the form xjν\llbracket x_{j}^{\nu}\rrbracket. Hence, any tuple with label 1 has distance >α>\alpha from xx. On the other hand, each attribute in a 0-labeled tuple has value at most 2m+8n2m+8n. Hence, the distance from any tuple with label 0 is bounded by d1/p(2m+8n)d(2m+8n)d^{1/p}\cdot(2m+8n)\leq d\cdot(2m+8n).

Claim 2: 0 is a possible label for xx. Indeed, the tuple corresponding to the edge (C1+,C1)(C_{1}^{+},C_{1}^{-}) is the closest one to xx and has label 0. Hence, any repair that includes this tuple will assign the label 0 to xx.

\Rightarrow Assume that the variable assignment ψ\psi makes ϕ\phi satisfiable. Then, we know that there exists a repair for D2D_{2} that avoids any tuple with label 0. This repair will then assign label 1 to xx, which implies that xx is not a certain label since by Claim 2 0 is a possible label for xx.

\Leftarrow Assume a repair that assigns a label 1 to xx – hence making xx have no certain label. Since by Claim 1 all 0-labeled tuples are closer than the 1-labeled tuples, this means that all tuples in the repair must have label 1. But then, ϕ\phi is satisfiable.

This completes the proof.

Finally, we extend the intractability result from CR-NNp𝐑,1\textsc{CR-NN}_{p}\langle\mathbf{R},1\rangle to CR-NNp𝐑,k\textsc{CR-NN}_{p}\langle\mathbf{R},k\rangle for any integer k1k\geq 1.

Theorem 6.6.

Let 𝐑\mathbf{R} be an FD schema that is not equivalent to any FD schema with an lhs chain. Then, for any integer k1k\geq 1, CR-NNp𝐑,k\textsc{CR-NN}_{p}\langle\mathbf{R},k\rangle is coNP-hard for any p>1p>1.

Proof 6.7.

To show this hardness result, we will show a reduction from CR-NNp𝐑,1\textsc{CR-NN}_{p}\langle\mathbf{R},1\rangle to CR-NNp𝐑,k\textsc{CR-NN}_{p}\langle\mathbf{R},k\rangle, k>1k>1, for any FD schema 𝐑\mathbf{R} and binary labels 𝒴={0,1}\mathcal{Y}=\{0,1\}. We will also assume that we restrict the inputs such that the test point xx does not coincide with any existing tuple in the dataset (this assumption does not affect the hardness claim).

Let DD the input labeled instance with values in \mathbb{N}, and xDx\notin D be the test point. Assume w.l.o.g. that x=(0,,0)x=(0,\dots,0); otherwise, we can shift the coordinates so that distances are maintained and xx becomes the origin. Find the tuple t0t_{0} in DD that is closest to xx and let d0=f(x,t0)d_{0}=f(x,t_{0}) (this can be done in polynomial time). Since xt0x\neq t_{0}, d0>0d_{0}>0. W.l.o.g. assume L(t0)=0L(t_{0})=0. Let λ=d1/pk/d0\lambda=d^{1/p}k/d_{0}.

We construct an instance DD^{\prime} as follows. First, we map each tuple R(a1,,ad)R(a_{1},\dots,a_{d}) in DD to R(λa1,,λad)R(\lambda\cdot a_{1},\dots,\lambda\cdot a_{d}). Second, we add the tuples T={R(1,1,,1),R(2,,2),R(k1,,k1)}T=\{R(1,1,\dots,1),R(2,\dots,2),R(k-1,\dots,k-1)\} with corresponding labels 0,1,0,0,1,0,\dots.

By construction, every new point in TT has distance at most d1/p(k1)d^{1/p}(k-1) from xx. On the other hand, all the other points have distance at least d1/pkd^{1/p}k. Moreover, the tuples in TT do not conflict with the existing tuples or with each other. From this, we obtain that the kk-neighborhood of xx in every repair of DD^{\prime} contains TT.

We now claim that xx is certifiably robust in DD for 1-NN classification if and only if it is certifiably robust in DD^{\prime} for kk-NN classification.

\Rightarrow Suppose that xx is certifiably robust in DD for 1-NN classification. Since 0 is a possible label for xx, this means that for every repair of DD, the closest point tt has label 0. But then every repair of DD^{\prime} will be TT plus one point that has label 0 in the kk-neighborhood of xx, so the number of 0-labeled tuples will be at least (k1)/2+1\lceil(k-1)/2\rceil+1, which consists a majority.

\Leftarrow Suppose that xx is not certifiably robust in DD for 1-NN classification. Since 0 is a possible label for xx, this means that there exists a repair of DD such that the closest tuple tt has label 1. But then the corresponding repair of DD^{\prime} will have TT plus one point that has label 11 in the kk-neighborhood of xx, so the number of 11-labeled tuples will be at least (k1)/2+1\lfloor(k-1)/2\rfloor+1, which is either majority or tie.

7 Optimal Repairs Revisited

In this section, we investigate the complexity landscape of a related problem to certifying robustness for kk-NN classification, which may be of independent interest. In [20], the authors studied the Opt-Repair problem: each tuple tt is associated with a positive weight wt0w_{t}\geq 0, and we want to find the optimal subset repair that removes the set of tuples with the smallest total weight. Note that this is equivalent to finding the repair with the largest total weight.

Here, we consider the symmetric problem, Min-Repair: we want to find the subset repair that has tuples with the smallest total weight. In this case, we interpret the tuple weights as a measure of how "wrong" we think a tuple is. We can parametrize this problem with a given FD schema 𝐑\mathbf{R}, as in Min-Repair𝐑\textsc{Min-Repair}\langle\mathbf{R}\rangle. Min-Repair captures as a special case the following problem, denoted as Forbidden-Repair𝐑\textsc{Forbidden-Repair}\langle{\mathbf{R}}\rangle: given an inconsistent instance DD over 𝐑\mathbf{R} and a subinstance SDS\subseteq D, does there exist a subset repair IDI\subseteq D such that IS=I\cap S=\emptyset?

Lemma 7.1.

There exists a many-one polynomial time reduction from Forbidden-Repair𝐑\textsc{Forbidden-Repair}\langle{\mathbf{R}}\rangle to Min-Repair𝐑\textsc{Min-Repair}\langle\mathbf{R}\rangle.

Proof 7.2.

One can set the weight of any tuple in SS to be 1, otherwise 0. Then, there exists a repair that avoids the forbidden set SS if and only if there exists a repair with total weight equal to 0.

From the hardness proof of Theorem 6.1, we immediately obtain the following intractability result.

Theorem 7.3.

Let 𝐑\mathbf{R} be an FD schema that is not equivalent to any FD schema with an lhs chain. Then, Forbidden-Repair𝐑\textsc{Forbidden-Repair}\langle{\mathbf{R}}\rangle is NP-hard. As a result, Min-Repair𝐑\textsc{Min-Repair}\langle\mathbf{R}\rangle is also NP-hard.

It turns out that the forbidden set repair problem is directly connected with certifying robustness for 11-NN classification.

Lemma 7.4.

There exists a many-one polynomial time reduction from Forbidden-Repair𝐑\textsc{Forbidden-Repair}\langle{\mathbf{R}}\rangle to the complement of CR-NN<𝐑,1\textsc{CR-NN}_{<}\langle\mathbf{R},1\rangle.

Proof 7.5.

Let DD, SDS\subseteq D be the inputs to the Forbidden-Repair problem. We will construct a labeled instance for the classification problem using only two labels, 𝒴={0,1}\mathcal{Y}=\{0,1\}. The input instance is exactly DD. For labeling, if tSt\in S then L(t)=0L(t)=0, otherwise L(t)=1L(t)=1. Finally, we construct an ordering of the tuples in DD such that t<tt<t^{\prime} whenever tSt\in S, tDSt^{\prime}\in D\setminus S.

We first claim that any repair of DD that avoids SS is a repair that assigns a label of 11 to xx. Indeed, the 11-neighborhood of xx for such a repair will consist of a tuple with label 1 by construction. On the other hand, any repair that includes a tuple from SS assigns label 0 to xx. Since there exists at least one repair includes a tuple from SS, there exists no repair that avoids the forbidden set SS if and only if xx is certifiably robust (with label 0).

The above reduction gives a polynomial time algorithm for Forbidden-Repair𝐑\textsc{Forbidden-Repair}\langle\mathbf{R}\rangle whenever 𝐑\mathbf{R} is equivalent to an FD schema with an lhs chain. We now present Algorithm 3 that works for the more general Min-Repair problem.

1
2if Σ\Sigma is trivial then
3      return DD
4remove trivial FDs from Σ\Sigma ;
5 if Σ\Sigma has a common lhs attribute AA then
6      return aπA(D)Min-Rep(ΣA,σA=a(D))\cup_{a\in\pi_{A}(D)}\textsc{Min-Rep}(\Sigma-A,\sigma_{A=a}(D))
7if Σ\Sigma has a consensus FD A\emptyset\rightarrow A then
8      margminaπA(D)w(Min-Rep(ΣA,σA=a(D)))m\leftarrow\arg\min_{a\in\pi_{A}(D)}w(\textsc{Min-Rep}(\Sigma-A,\sigma_{A=a}(D))) ;
9       return Min-Rep(ΣA,σA=m(D))\textsc{Min-Rep}(\Sigma-A,\sigma_{A=m}(D))
Algorithm 3 Min-Rep(Σ,D)\textsc{Min-Rep}(\Sigma,D)

The algorithm works similarly to the OptSRepair algorithm in [20] with two differences. First, we only need to consider the cases where the FD schema has a common lhs or a consensus FD (i.e., FD with an empty lhs). Second, in the case of the consensus FD, where we partition the instance, we take the repair that has the minimum total weight instead of the largest one.

Theorem 7.6.

Let 𝐑\mathbf{R} be an FD schema that is equivalent to an FD schema with an lhs chain. Then, Min-Repair𝐑\textsc{Min-Repair}\langle\mathbf{R}\rangle is in P.

The dichotomy we obtain for Min-Repair coincides with the one we obtain for CR-NN. However, it is different from the dichotomy observed for the Opt-Repair problem in [20]; in fact, every FD schema that admits a polynomial time algorithm for Min-Repair also admits a polynomial time algorithm for Opt-Repair, but not vice versa. Specifically, the FD schema (R(A,B),{AB,BA})(R(A,B),\{A\rightarrow B,B\rightarrow A\}) is hard for Min-Repair, but tractable for Opt-Repair. The reason is that finding a maximum-weight (maximal) matching in a bipartite graph is polynomially solvable, but finding a minimum-weight maximal matching is NP-hard.

The next lemma might be of independent interest.

Lemma 7.7.

There exists a Cook reduction from CR-NN<𝐑,1\textsc{CR-NN}_{<}\langle\mathbf{R},1\rangle to Forbidden-Repair𝐑\textsc{Forbidden-Repair}\langle\mathbf{R}\rangle.

Proof 7.8.

We are given an instance DD, a test point xx. Let t1,t2,,tnt_{1},t_{2},\dots,t_{n} be the tuples in increasing order of their distance to xx. Observe that we can always choose a repair that includes the closest tuple t1t_{1}. Hence, the label =L(t1)\ell=L(t_{1}) is a possible label for xx. To show that xx is certifiably robust, it suffices to prove that none of the labels in 𝒴{}\mathcal{Y}\setminus\{\ell\} are possible.

Let 𝒴{}\ell^{\prime}\in\mathcal{Y}\setminus\{\ell\}. For every tuple tit_{i} such that L(ti)=L(t_{i})=\ell^{\prime}, we construct an input to Forbidden-Repair𝐑\textsc{Forbidden-Repair}\langle\mathbf{R}\rangle as follows. First, construct DiD^{\prime}_{i} to be the instance where we have removed from DD the tuple tit_{i} along with any tuples that conflict with tit_{i}. Then, the input to Forbidden-Repair is the instance DiD^{\prime}_{i} along with the forbidden set {t1,,ti1}Di\{t_{1},\dots,t_{i-1}\}\cap D^{\prime}_{i}. In other words we ask: is there a repair that includes tit_{i} but avoids all the other tuples that are closer to xx? It is now easy to observe that \ell^{\prime} is a possible label for xx – and thus xx is not certifiably robust – if and only if at least one instance (out of the at most nn instances) for Forbidden-Repair returns true.

8 Certifiable Robustness by Counting

In this section, we consider the counting version of certifiable robustness for kk-NN classifiers. The counting problem of certifiable robustness asks the following: among all possible repairs, how many will predict label \ell for a fixed 𝒴\ell\in\mathcal{Y}?

We show that the counting problem still remains in polynomial time if the FD set 𝐑\mathbf{R} is equivalent to an lhs chain by generalizing the algorithm in Section 5.1. Formally, the class of counting problems that can be computed in polynomial time is called FP.

Theorem 8.1.

If the FD schema 𝐑\mathbf{R} is equivalent to some FD schema with an lhs chain, then the counting problem #CR-NN<𝐑\textsc{\#CR-NN}_{<}\langle\mathbf{R}\rangle is in FP.

We show now how to generalize the algorithm in Section 5.1 to perform counting. Let xx be the test point and 𝒴={1,,2,m}\mathcal{Y}=\{1,,2\dots,m\}. It suffices to show how to count for the label =1\ell=1. Recall the definition of 𝒩τ(x,I)\mathcal{N}^{\leq}_{\tau}(x,I) and Cτ(,I)C^{\leq}_{\tau}(\ell,I). Similarly as in Section 5.1, we will compute a high-dimensional matrix MM “layer by layer” and read off the answer from it.

We now make our key definition. Fix a threshold τ>0\tau>0, and define the following quantity for a (possibly inconsistent) subinstance JDJ\subseteq D and integers i,c2,c3,,cmi,c_{2},c_{3},\dots,c_{m}, where 0ik0\leq i\leq k and kcjk-k\leq c_{j}\leq k for all j{2,3,,m}j\in\{2,3,\dots,m\}:

Mi[J,c2,,cm,Δ]={#KKIΔ(J) s.t. |𝒩τ(x,K)|=i andCτ(j,K)Cτ(1,K)=cjj{2,3,,m}}.\begin{split}M_{i}[J,c_{2},\dots,c_{m},\Delta]=\{\#K\mid K\in I_{\Delta}(J)\text{ s.t. }|\mathcal{N}^{\leq}_{\tau}(x,K)|=i\text{ and}\\ C^{\leq}_{\tau}(j,K)-C^{\leq}_{\tau}(1,K)=c_{j}\ \forall j\in\{2,3,\dots,m\}\}.\end{split}

For simplicity of notation, let 𝐜{\mathbf{c}} denote the vector (c2,c3,,cm)(c_{2},c_{3},\dots,c_{m}) with the understanding that 𝐜j\mathbf{c}_{j} could represent any new vector (c2j,c3j,,cmj)(c_{2_{j}},c_{3_{j}},\dots,c_{m_{j}}). Sometimes we might suppress the FD set Δ\Delta to write Mi[J,𝐜]M_{i}[J,{\mathbf{c}}] when the context is clear. The interpretation for the entry Mi[J,𝐜]M_{i}[J,{\mathbf{c}}] is that it records the number of repairs in JJ with ii many tuples which are among τ\tau-th closest to xx such that the differences between the number of 11-tuples and the number of jj-tuples are exactly cjc_{j}, j{2,3,,m}j\in\{2,3,\dots,m\}. If there is no such KIΣ(J)K\in I_{\Sigma}(J), we define Mi[J,𝐜]=0M_{i}[J,{\mathbf{c}}]=0. The algorithm computes the entries of MM by a combination of dynamic programming and recursion on the FD schema 𝐑\mathbf{R}. Note that after computing MM, the answer to #CR-NN<𝐑\textsc{\#CR-NN}_{<}\langle\mathbf{R}\rangle is exactly the sum of all entries Mk[D,𝐜]M_{k}[D,{\mathbf{c}}] where cj1c_{j}\geq 1 for all j{2,3,,m}j\in\{2,3,\dots,m\}. The algorithm has three disjoint cases when computing Mi[J,𝐜]M_{i}[J,{\mathbf{c}}]:

Base Case:

the set of FDs is empty. In this case, Mi[J,𝐜]=0M_{i}[J,{\mathbf{c}}]=0 unless |𝒩τ(x,I)|=ik|\mathcal{N}^{\leq}_{\tau}(x,I)|=i\leq k and Cτ(j,J)Cτ(,J)=cjC^{\leq}_{\tau}(j,J)-C^{\leq}_{\tau}(\ell,J)=c_{j} for all j{2,3,,m}j\in\{2,3,\dots,m\}, in which case the entry is 1. This step clearly can be computed efficiently.

Consensus FD:

there exists an FD A\emptyset\rightarrow A. In this case, we recursively call the algorithm to compute Mi[σA=a(J),𝐜,ΔA]M_{i}[\sigma_{A=a}(J),{\mathbf{c}},\Delta-A] for every value aπA(J)a\in\pi_{A}(J). Then, we calculate Mi[J,𝐜]=aπA(J)Mi[σA=a(J),𝐜,ΔA]M_{i}[J,{\mathbf{c}}]=\sum_{a\in\pi_{A}(J)}M_{i}[\sigma_{A=a}(J),{\mathbf{c}},\Delta-A].

Common Attribute:

there exists a common attribute AA in the lhs of all FDs. In this case, we recursively call the algorithm to compute Mj[σA=a(J),𝐜j,ΔA]M_{j}[\sigma_{A=a}(J),{\mathbf{c}_{j}},\Delta-A] for every value aπA(J)a\in\pi_{A}(J) and all j,𝐜jj,{\mathbf{c}_{j}} such that 0ji0\leq j\leq i and 0cljcl0\leq c_{l_{j}}\leq c_{l} where 2lm2\leq l\leq m. Let S=πA(J)={a1,,a|S|}S=\pi_{A}(J)=\{a_{1},\dots,a_{|S|}\}. We then compute

Mi[J,𝐜]=j=1|S|Mij[σA=aj(J),𝐜j,ΔA]M_{i}[J,{\mathbf{c}}]=\sum\sum\limits_{j=1}^{|S|}M_{i_{j}}[\sigma_{A=a_{j}}(J),{\mathbf{c}_{j}},\Delta-A]

where the first summation is over all possible solutions of iji_{j}’s and cljc_{l_{j}}’s such that j=1|S|ij=i\sum_{j=1}^{|S|}i_{j}=i and j=1|S|clj=cl\sum_{j=1}^{|S|}c_{l_{j}}=c_{l} for all l{2,3,,m}l\in\{2,3,\dots,m\}. We now show how to compute Mi[J,𝐜]M_{i}[J,{\mathbf{c}}] by a direct generalization of Algorithm 1. The dynamic programming computes a (m+1)(m+1)-dimensional matrix \mathcal{M} where its (p,q,𝐜i)(p,q,{\mathbf{c}_{i}})-entry represents the sum j=1pMij[σA=aj(J),𝐜j]\sum_{j=1}^{p}M_{i_{j}}[\sigma_{A=a_{j}}(J),{\mathbf{c}_{j}}] over all possible solutions iji_{j}’s and 𝐜j{\mathbf{c}_{j}}’s such that j=1pij=q\sum_{j=1}^{p}i_{j}=q and j=1pclj=cli\sum_{j=1}^{p}c_{l_{j}}=c_{l_{i}} for 2lm2\leq l\leq m. Note that \mathcal{M} has |S|(k+1)(2k+1)m1=O(nkm)|S|\cdot(k+1)\cdot(2k+1)^{m-1}=O(n\cdot k^{m}) entries. Let K={k,k1,,k}K=\{k,k-1,\dots,-k\} and 𝐜i𝐜j:=(c2ic2j,,cmicmj){\mathbf{c}_{i}}-{\mathbf{c}_{j}}:=(c_{2_{i}}-c_{2_{j}},\dots,c_{m_{i}}-c_{m_{j}}). We are now ready to present our dynamic programming algorithm:

1
2for j=0,,k and 𝐜Kmj=0,\dots,k\text{ and }{\mathbf{c}}\in K^{m}  do
3       j[1,𝐜]Mj[σA=a1(J),𝐜,ΔA]\mathcal{M}_{j}[1,{\mathbf{c}}]\leftarrow M_{j}[\sigma_{A=a_{1}}(J),{\mathbf{c}},\Delta-A] ;
4      
5for i=2,,|S|i=2,\dots,|S| do
6       for j=0,,k and 𝐜Kmj=0,\dots,k\text{ and }{\mathbf{c}}\in K^{m} do
7             j[i,𝐜]p,𝐜q{p[i1,𝐜q]+Mjp[σA=ai(J),𝐜𝐜q,ΔA]}\mathcal{M}_{j}[i,{\mathbf{c}}]\leftarrow\sum\limits_{p,\mathbf{c}_{q}}\{\mathcal{M}_{p}[i-1,\mathbf{c}_{q}]+M_{j-p}[\sigma_{A=a_{i}}(J),\mathbf{c}-\mathbf{c}_{q},\Delta-A]\};
8            
Algorithm 4 Dynamic Programming (Counting)

We now show that the counting algorithm returns correctly and runs in polynomial time with respect to the size of DD and the parameter kk. However, the running time depends exponentially on the number of labels |𝒴||\mathcal{Y}|. An open question remains whether this exponential dependency is essential.

Proof 8.2 (Proof of Theorem 8.1).

The correctness of the counting algorithm follows similarly as in the proof of Lemma 5.1. We now argue that the counting algorithm runs in polynomial time. For every label \ell and every threshold τ\tau, the algorithm will be called recursively finitely many times since the FD schema 𝐑\mathbf{R} is fixed. Furthermore, the dynamic programming in step 3 will run in O(nk2m)O(n\cdot k^{2m}), where n=|D|n=|D|, since the matrix \mathcal{M} has size O(nkm)O(n\cdot k^{m}) and to compute each entry we need O(km)O(k^{m}) time. Thus for each \ell and τ\tau, the algorithm will run in polynomial with respect to nn and the parameter kk. Since there are O(n)O(n) boundary thresholds, we conclude that the algorithm runs in polynomial time with respect to the size of DD and the parameter kk.

The hardness part for counting is an immediate corollary with the previous result in [21].

Theorem 8.3.

If the FD schema 𝐑\mathbf{R} is not equivalent to some FD schema with an lhs chain, then #CR-NN<𝐑\textsc{\#CR-NN}_{<}\langle\mathbf{R}\rangle is #P-complete.

Proof 8.4.

By Theorem 3.2 in [21], it is #P-hard to count the number of repairs for the FD schema 𝐑\mathbf{R}. Now, given any instance DD, we can pick any ordering of the points and assign the same label \ell to every tuple. Then, the number of repairs that predict label \ell is the same as the total number of repairs.

The approximate counting of certifying robustness of kk-NN classifiers is also hard.

Theorem 8.5.

If the FD schema 𝐑\mathbf{R} is not equivalent to some FD schema with an lhs chain, then #CR-NN<𝐑AP#SAT\textsc{\#CR-NN}_{<}\langle\mathbf{R}\rangle\equiv_{AP}\textsc{\#SAT}.

Here, AP\equiv_{AP} refers to the equivalency up to approximation-preserving reductions [6].

Proof 8.6.

By the same fact-wise reduction in Lemma 3 and labelling in Theorem 8.3, we have #MaximalBIS#SRep#CR-NN<𝐑\textsc{\#MaximalBIS}\leq\textsc{\#SRep}\leq\textsc{\#CR-NN}_{<}\langle\mathbf{R}\rangle where #MaximalBIS is the problem of counting the number of maximal independent sets in a bipartite graph and #SRep is the problem of counting the number of subset repairs of an inconsistent instance. By Theorem 1 in [9], #MaximalBISAP #SAT\textsc{\#MaximalBIS}\equiv_{\text{AP }}\textsc{\#SAT} and thus we obtain #CR-NN<𝐑AP#SAT\textsc{\#CR-NN}_{<}\langle\mathbf{R}\rangle\equiv_{AP}\textsc{\#SAT}.

9 Other Uncertainty Models

In this section, we study the complexity of certifying robustness for kk-NNs under three simple uncertainty models: ??-sets, or-sets, and Codd tables. We show that for these models we can certify robustness in polynomial time. Throughout the section, we fix the relational schema to be R(A1,,Ad)R(A_{1},\dots,A_{d}).

?-Sets with Size Constraints. For a given instance DD over the schema, we mark an uncertain subset D?D_{?} of the tuples in DD. Then, for a positive integer m1m\geq 1, we define the set of possible worlds as:

?={IDD?ID,|DI|m}.\mathcal{I}_{?}=\{I\mid D\setminus D_{?}\subseteq I\subseteq D,|D\setminus I|\leq m\}.

In other words, we can construct a possible world by removing any – but at most mm – tuples from D?D_{?}. When m=|D?|m=|D_{?}|, this definition captures the notion of ??-tables as defined in [25]. When D?=DD_{?}=D, it captures data poisoning scenarios, where the goal is to protect against attacks that can poison up to mm tuples of the training set.

For this setting, we can show that certifiable robustness for kk-NNs can be computed in almost linear time in the size of the input.

Lemma 9.1.

We can certify robustness in ?\mathcal{I}_{?} for kk-NNs in time O((|𝒴|+logn)n)O((|\mathcal{Y}|+\log n)\cdot n) where nn is the size of the dataset.

Proof 9.2.

For the sake of simplicity, let 𝒴={1,2,}\mathcal{Y}=\{1,2,\dots\}.

Let xx be the test point. As a first step, we order the tuples in increasing order of f(x,t)f(x,t) – this can be done in time O(nlogn)O(n\log n). We assume w.l.o.g. that all distances are distinct. Let the resulting order be t1,t2,,tnt_{1},t_{2},\dots,t_{n}.

Since it always holds that D?D\in\mathcal{I}_{?}, we can fist compute the predicted label of xx in DD in time O(k)O(k). W.l.o.g. assume that D(x)=1\mathcal{L}_{D}(x)=1. For every label >1\ell>1, the algorithm will now try to construct a possible world where \ell occurs at least as many times as 1 in the kk-neighborhood of xx. To make the algorithm exposition easier, define for every tuple tD?t\in D_{?} a priority value ρ(t)\rho(t) as follows.

ρ(t)={2,if L(t)=0,if L(t)=11,otherwise.\displaystyle\rho(t)=\begin{cases}2,&\text{if }L(t)=\ell\\ 0,&\text{if }L(t)=1\\ 1,&\text{otherwise.}\end{cases}
1
Input: test point xx
Output: is kk-NN certifiably robust in ?\mathcal{I}_{?}
2
3J{t1,,tk}J\leftarrow\{t_{1},\dots,t_{k}\} ;
4 Δ0\Delta\leftarrow 0 ;
5 for i=k+1i=k+1 to nn do
6       ΔΔ+1\Delta\leftarrow\Delta+1 ;
7       if |{tJL(t)=}||{tJL(t)=1}||\{t\in J\mid L(t)=\ell\}|\geq|\{t\in J\mid L(t)=1\}| then
8            return true;
9      if JD?=J\cap D_{?}=\emptyset or Δ>m\Delta>m then
10            return false;
11      JJ{ti}J\leftarrow J\cup\{t_{i}\} ;
12       remove from JJ the tuple argmintJD?{ρ(t)}\arg\min_{t\in J\cap D_{?}}\{\rho(t)\} ;
13      
Algorithm 5 Certifiable Robustness for ??-Sets

We now present our Algorithm 5. Intuitively, it iterates over the tuples in order of proximity to xx. For each tuple tit_{i}, it attempts to construct the most promising kk-neighborhood that includes tuples from {t1,,ti}\{t_{1},\dots,t_{i}\}. Each loop in the algorithm can be executed in time O(1)O(1). Indeed, we can implement this by keeping three sets with tuples depending on their labels, where we can do insertion and deletion in constant time.

Note that the running time of Algorithm 5 is independent of k,mk,m and |D?||D_{?}|, but it does depend linearly on the number of labels.

Or-Sets. In this uncertain model, each attribute value of a tuple is an or-set consisting of finite values (e.g., 2,5,10\langle 2,5,10\rangle). Each possible world in or\mathcal{I}_{or} is formed by choosing exactly one value from each or-set, independent of the choices across all other or-sets. We can express or\mathcal{I}_{or} as subset repairs for the following schema: add to RR an extra attribute id. Then, assign a unique value for this attribute to each tuple tt, and create a tuple for each "realization" of this tuple with the same id. This will increase the input size of the problem, but by at most a polynomial factor (since dd is fixed). Moreover, the FD schema is clearly equivalent to an lhs chain; in fact, this is the single primary key case. As a consequence, we have the following proposition.

Proposition 9.3.

We can certify robustness in or\mathcal{I}_{or} for kk-NNs in P.

Codd tables. In a Codd table, a missing value is represented as Null paired with a domain Dom from which that value can be chosen. A repair is any possible completion of the table. The first observation is that, by adding a new identifier attribute ID, we can think of a Codd table as an inconsistent instance with a single primary key, where each block has a consistent label (i.e., every tuple in the block has the same label). However, if Dom is given as an infinite interval, then Algorithm 2 does not apply directly since it may not be possible to write all tuple completions in an increasing order (there are uncountably many). Indeed, in [13] Karlas et al. consider only the situation where Dom is given as a finite discrete set.

Formally, the distance between a tuple tt in the Codd table and a test point xx is given by the function f(x,t)=gt(y1,y2,,yn)f(x,t)=g_{t}(y_{1},y_{2},\dots,y_{n}) where the yiy_{i}’s are the missing entries in tt. In order to be able to certify robustness, we need to assume that the minimum and maximum of this function can be computed efficiently. This is a mild assumption, since for example for the pp-norm the minimum and maximum is achieved when each summand is minimized or maximized respectively.

Now, we observe that since every block has the same label, we can modify Algorithm 2 so that every resulting repair consists of only the first or the last tuple in a block without changing its correctness properties. With this observation in hand, we can now simply define the extremal set S={minf(x,t)maxf(x,t)tD}S=\{\min f(x,t)\ \cup\ \max f(x,t)\mid t\in D\} consisting of the minimum and maximum of each block and then run Algorithm 2 on SS. Since SS is at most twice the size of SS, this implies a linear time algorithm (w.r.t. the number of tuples, labels, and kk) for certifiable robustness on Codd tables. We have shown the following proposition.

Proposition 9.4.

We can certify robustness in a Codd table for kk-NNs in linear time even if the domain Dom is given as an infinite interval.

10 Conclusion

In this paper, we study the complexity of certifiable robustness for kk-NN classification under FD constraints. We show a dichotomy in the complexity of the problem depending on the structure of the FDs, and we prove that the same dichotomy condition also holds for the counting version of the problem. We envision this work to be a first step towards the long-term goal of investigating the complexity of certifying robustness for other widely used classification algorithms, such as decision trees, Naive Bayes classifiers and linear classifiers.

References

  • [1] Marcelo Arenas, Leopoldo E. Bertossi, and Jan Chomicki. Consistent query answers in inconsistent databases. In PODS, pages 68–79. ACM Press, 1999. doi:10.1145/303976.303983.
  • [2] Jan Chomicki. Consistent query answering: Five easy pieces. In ICDT, volume 4353 of Lecture Notes in Computer Science, pages 1–17. Springer, 2007. doi:10.1007/11965893\_1.
  • [3] Jeremy M. Cohen, Elan Rosenfeld, and J. Zico Kolter. Certified adversarial robustness via randomized smoothing. In ICML, volume 97 of Proceedings of Machine Learning Research, pages 1310–1320. PMLR, 2019. URL: http://proceedings.mlr.press/v97/cohen19c.html.
  • [4] Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high-dimensions without the computational intractability. SIAM J. Comput., 48(2):742–864, 2019. doi:10.1137/17M1126680.
  • [5] Samuel Drews, Aws Albarghouthi, and Loris D’Antoni. Proving data-poisoning robustness in decision trees. In PLDI, pages 1083–1097. ACM, 2020. doi:10.1145/3385412.3385975.
  • [6] Martin E. Dyer, Leslie Ann Goldberg, Catherine S. Greenhill, and Mark Jerrum. The relative complexity of approximate counting problems. Algorithmica, 38(3):471–500, 2004. doi:10.1007/s00453-003-1073-y.
  • [7] Austen Z. Fan and Paraschos Koutris. Certifiable robustness for nearest neighbor classifiers. CoRR, abs/2201.04770, 2022. URL: http://arxiv.org/abs/2201.04770.
  • [8] Ariel Fuxman and Renée J. Miller. First-order query rewriting for inconsistent databases. J. Comput. Syst. Sci., 73(4):610–635, 2007. doi:10.1016/j.jcss.2006.10.013.
  • [9] Leslie Ann Goldberg, Rob Gysel, and John Lapinskas. Approximately counting locally-optimal structures. J. Comput. Syst. Sci., 82(6):1144–1160, 2016. doi:10.1016/j.jcss.2016.04.001.
  • [10] Sergio Greco, Cristian Molinaro, and Francesca Spezzano. Incomplete Data and Data Dependencies in Relational Databases. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2012. doi:10.2200/S00435ED1V01Y201207DTM029.
  • [11] Tomasz Imielinski and Witold Lipski Jr. Incomplete information in relational databases. J. ACM, 31(4):761–791, 1984. doi:10.1145/1634.1886.
  • [12] Jinyuan Jia, Xiaoyu Cao, Binghui Wang, and Neil Zhenqiang Gong. Certified robustness for top-k predictions against adversarial perturbations via randomized smoothing. In ICLR. OpenReview.net, 2020. URL: https://openreview.net/forum?id=BkeWw6VFwr.
  • [13] Bojan Karlas, Peng Li, Renzhi Wu, Nezihe Merve Gürel, Xu Chu, Wentao Wu, and Ce Zhang. Nearest neighbor classifiers over incomplete information: From certain answers to certain predictions. Proc. VLDB Endow., 14(3):255–267, 2020. doi:10.5555/3430915.3442426.
  • [14] Phokion G. Kolaitis and Enela Pema. A dichotomy in the complexity of consistent query answering for queries with two atoms. Inf. Process. Lett., 112(3):77–85, 2012. doi:10.1016/j.ipl.2011.10.018.
  • [15] Paraschos Koutris and Dan Suciu. A dichotomy on the complexity of consistent query answering for atoms with simple keys. In ICDT, pages 165–176. OpenProceedings.org, 2014. doi:10.5441/002/icdt.2014.19.
  • [16] Paraschos Koutris and Jef Wijsen. The data complexity of consistent query answering for self-join-free conjunctive queries under primary key constraints. In PODS, pages 17–29. ACM, 2015. doi:10.1145/2745754.2745769.
  • [17] Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J. Franklin, and Ken Goldberg. Activeclean: Interactive data cleaning for statistical modeling. Proc. VLDB Endow., 9(12):948–959, 2016. doi:10.14778/2994509.2994514.
  • [18] Aounon Kumar, Alexander Levine, Tom Goldstein, and Soheil Feizi. Curse of dimensionality on randomized smoothing for certifiable robustness. In ICML, volume 119 of Proceedings of Machine Learning Research, pages 5458–5467. PMLR, 2020. URL: http://proceedings.mlr.press/v119/kumar20b.html.
  • [19] Leonid Libkin. Incomplete information and certain answers in general data models. In PODS, pages 59–70. ACM, 2011. doi:10.1145/1989284.1989294.
  • [20] Ester Livshits, Benny Kimelfeld, and Sudeepa Roy. Computing optimal repairs for functional dependencies. ACM Trans. Database Syst., 45(1):4:1–4:46, 2020. doi:10.1145/3196959.3196980.
  • [21] Ester Livshits, Benny Kimelfeld, and Jef Wijsen. Counting subset repairs with functional dependencies. J. Comput. Syst. Sci., 117:154–164, 2021. doi:10.1016/j.jcss.2020.10.001.
  • [22] Simon Razniewski and Werner Nutt. Completeness of queries over incomplete databases. Proc. VLDB Endow., 4(11):749–760, 2011. URL: http://www.vldb.org/pvldb/vol4/p749-razniewski.pdf.
  • [23] Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. Holoclean: Holistic data repairs with probabilistic inference. Proc. VLDB Endow., 10(11):1190–1201, 2017. doi:10.14778/3137628.3137631.
  • [24] Elan Rosenfeld, Ezra Winston, Pradeep Ravikumar, and J. Zico Kolter. Certified robustness to label-flipping attacks via randomized smoothing. In ICML, volume 119 of Proceedings of Machine Learning Research, pages 8230–8241. PMLR, 2020. URL: http://proceedings.mlr.press/v119/rosenfeld20b.html.
  • [25] Anish Das Sarma, Omar Benjelloun, Alon Y. Halevy, and Jennifer Widom. Working models for uncertain data. In ICDE, page 7. IEEE Computer Society, 2006. doi:10.1109/ICDE.2006.174.
  • [26] Jacob Steinhardt, Pang Wei Koh, and Percy Liang. Certified defenses for data poisoning attacks. In NIPS, pages 3517–3529, 2017. URL: https://proceedings.neurips.cc/paper/2017/hash/9d7311ba459f9e45ed746755a32dcd11-Abstract.html.
  • [27] M. Yannakakis and F. Gavril. Edge dominating sets in graphs. SIAM Journal on Applied Mathematics, 38(3):364–372, 1980. doi:10.1137/0138030.