This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

GIST: Greedy Independent Set Thresholding for Diverse Data Summarization

Matthew Fahrbach111Equal contribution. Emails: {fahrbach,rsrikumar,zadim,sahmadian,gcitovsky,giuliad}@google.com Srikumar Ramalingam11footnotemark: 1 Morteza Zadimoghaddam11footnotemark: 1 Sara Ahmadian Gui Citovsky Giulia DeSalvo
Abstract

We propose a novel subset selection task called min-distance diverse data summarization (MDDS), which has a wide variety of applications in machine learning, e.g., data sampling and feature selection. Given a set of points in a metric space, the goal is to maximize an objective that combines the total utility of the points and a diversity term that captures the minimum distance between any pair of selected points, subject to the constraint |S|k|S|\leq k. For example, the points may correspond to training examples in a data sampling problem, e.g., learned embeddings of images extracted from a deep neural network. This work presents the GIST algorithm, which achieves a 2/3\nicefrac{{2}}{{3}}-approximation guarantee for MDDS by approximating a series of maximum independent set problems with a bicriteria greedy algorithm. We also prove a complementary (2/3+ε)(\nicefrac{{2}}{{3}}+\varepsilon)-hardness of approximation, for any ε>0\varepsilon>0. Finally, we provide an empirical study that demonstrates GIST outperforms existing methods for MDDS on synthetic data, and also for a real-world image classification experiment the studies single-shot subset selection for ImageNet.

1 Introduction

Subset selection is a challenging optimization problem with a wide variety of applications in machine learning, including feature selection, recommender systems, news aggregation, drug discovery, data summarization, and designing pretraining sets for large language models (Anil et al., 2023). Data sampling in particular is a salient problem due to unprecedented and continuous data collection. For example, LiDAR and imaging devices in one self-driving vehicle can easily capture ~80 terabytes of data per day (Kazhamiaka et al., 2021).

In most subset selection tasks, we rely on the weight (or utility) of the objects to rank one over the other, and also to avoid selecting duplicate or near-duplicate objects. If we select a small subset, then we also want to ensure that the selected subset is a good representation of the original set. These utility, diversity, and coverage criteria can be expressed through objective functions, and the interesting research lies in developing efficient algorithms with strong approximation guarantees. The underlying machinery used in constrained subset selection algorithms shares many similarities with techniques from other areas of combinatorial optimization such as submodular maximization, kk-center clustering, and convex hull approximations.

In this work, we study the problem of selecting a set of points in a metric space that maximizes an objective that combines their utility and a minimum pairwise-distance diversity measure. Concretely, given a set of points VV, weight function w:V0w:V\rightarrow\mathbb{R}_{\geq 0}, and distance function dist:V×V0\textnormal{dist}:V\times V\rightarrow\mathbb{R}_{\geq 0}, our goal is to solve the cardinality-constrained optimization problem:

S\displaystyle S^{*} =argmaxSV1kvSw(v)+div(S)\displaystyle=\operatorname*{arg\,max}_{S\subseteq V}~{}\frac{1}{k}\sum_{v\in S}w(v)+\texttt{div}(S)
subject to|S|k\displaystyle\hskip 13.6572pt\text{subject to}~{}~{}|S|\leq k

where the diversity term div(S)minu,vS:uvdist(u,v)\texttt{div}(S)\coloneqq\min_{u,v\in S:u\neq v}\textnormal{dist}(u,v) encourages maximizing the minimum distance between all pairs of distinct points in the selected subset. Bhaskara et al. (2016) call div(S)\texttt{div}(S) the min-min diversity objective and Meinl et al. (2011) call it pp-dispersion maximization, where it is known to admit a 1/2\nicefrac{{1}}{{2}}-approximation ratio subject to the strict cardinality constraint |S|=k|S|=k.

Other diversity objectives such as the sum-of-min-distances uSminvS{u}dist(u,v)\sum_{u\in S}\min_{v\in S\setminus\{u\}}\textnormal{dist}(u,v) and sum-of-all-distances uSvSdist(u,v)\sum_{u\in S}\sum_{v\in S}\textnormal{dist}(u,v) have also been studied (Meinl et al., 2011; Bhaskara et al., 2016). While previous works have looked at maximizing the diversity objective div(S)\texttt{div}(S), our formulation that introduces the utility term has not been studied before. In particular, Bhaskara et al. (2016) observe that min-min is a powerful diversity objective, but point out a major challenge: it is highly non-monotonic, i.e., it can exhibit a stark decrease in the objective value by adding a single new point. Our work shows that the use of a utility term along with the min-min can alleviate some of these pitfalls, and allows us to inherit the strong benefits of this strict diversity objective.

1.1 Our contributions

We summarize our main contributions below:

  • We propose a novel subset selection problem called min-distance diverse data summarization (MDDS), where the goal is to maximize an objective function that combines a linear weight term with the min-min distance diversity term div(S)\texttt{div}(S) subject to the cardinality constraint |S|k|S|\leq k. This problem has a wide variety of applications in machine learning, e.g., data denoising and active learning.

  • We give a 2/3\nicefrac{{2}}{{3}}-approximation algorithm for MDDS called GIST that approximates a series of maximum independent set problems on geometric intersection graphs with a bicriteria greedy algorithm and returns the best solution.

  • We prove a complementary (2/3+ε)(\nicefrac{{2}}{{3}}+\varepsilon)-hardness of approximation, for any ε>0\varepsilon>0, for general metric spaces by reducing from approximate maximum clique. We also prove that MDDS is APX-complete for Euclidean metrics.

  • Our experiments show that GIST overcomes the shortcomings of existing methods for MDDS on synthetic data. We also demonstrate how GIST can be used to create better single-shot subsets of training data for an image classification benchmark on ImageNet (Russakovsky et al., 2015) compared to margin sampling and kk-center algorithms.

1.2 Background and related work

Diversity objectives.

A closely related problem is kk-center, which aims to find a subset of kk centers that minimize the largest distance of any point to its closest center in SS. Concretely, the objective is maxvVmincSdist(v,c)\max_{v\in V}\min_{c\in S}\textnormal{dist}(v,c), and a greedy 22-approximation algorithm is widely used in active learning applications (Sener and Savarese, 2018). The main difference between our objective function and kk-center is the ability to benefit from the utility term, e.g., uncertainty values in an active learning setting. The second difference is that we relax the need to cover every point in the original dataset, which is particularly beneficial for large datasets, where we prefer to avoid instability due to certain corner or outlier points.

Bhaskara et al. (2016) study various distance-based diversity metrics subject to a cardinality constraint. In particular, they give a 1/8\nicefrac{{1}}{{8}}-approximation algorithm for the sum-min diversity objective defined as uSminvS{u}dist(u,v)\sum_{u\in S}\min_{v\in S\setminus\{u\}}\textnormal{dist}(u,v) by rounding a linear programming relaxation, and they complement this result by proving a conditional 1/2\nicefrac{{1}}{{2}}-hardness of approximation under the planted clique assumption. They also study the min-min objective minu,vS:uvdist(u,v)\min_{u,v\in S:u\neq v}\textnormal{dist}(u,v), which is the same as div(S)\texttt{div}(S). They do not have a linear weight function in their objective, so they enforce a strict cardinality constraint |S|=k|S|=k; otherwise, it is always better to output at most one or two points. Meinl et al. (2011) study the same objective subject to |S|=k|S|=k under the name pp-dispersion (also called max-min or remote-edge) diversity maximization. This cardinality issue does not arise in our formulation since we need to optimize the trade-off between competing linear weight (utility) and diversity terms.

Moumoulidou et al. (2020), Addanki et al. (2022), and Mahabadi and Trajanovski (2023) initiated the study of diversity maximization under fairness constraints. Given a partitioning of the dataset, they ensure that a prespecified number of points kik_{i} are selected from each group i[m]i\in[m]. This fairness constraint considerably differentiates their problem setting from ours (i.e., |S|=k=i=1mki|S|=k=\sum_{i=1}^{m}k_{i}). Moumoulidou et al. (2020); Addanki et al. (2022) maximize the div(S)\texttt{div}(S) objective, and Mahabadi and Trajanovski (2023) consider the sum-of-pairwise distances uSvSdist(u,v)\sum_{u\in S}\sum_{v\in S}\textnormal{dist}(u,v) and sum-min objectives. None of these works, however, add a linear weight function to the objective.

Data sampling.

In the realm of dataset curation and active learning, there have been several lines of work that study the combination of utility and diversity terms. Ash et al. (2020) invoke kk-means++ seeding over the gradient space of a model in order to balance uncertainty and diversity. Wei et al. (2015) introduce several submodular objectives, e.g., facility location, and use them to diversify a set of uncertain examples in each active learning iteration. Citovsky et al. (2021) cluster examples which are represented by embeddings extracted from the penultimate layer of a partially trained DNN, and then use these clusters to diversify uncertain examples in each iteration. Our work differs from these and several others (e.g., Kirsch et al. (2019); Zhdanov (2019)) in that we directly incorporate both the utility and diversity terms in our objective function.

2 Preliminaries

We are given a set of nn points VMV\subseteq M in the metric space (M,dist)(M,\textnormal{dist}), where dist:M×M0\textnormal{dist}:M\times M\rightarrow\mathbb{R}_{\geq 0} is the distance function. We are also given weights w:V0w:V\rightarrow\mathbb{R}_{\geq 0} representing the utility of each point. We extend these functions to take subsets SVS\subseteq V as input in the standard way: w(S)=vSw(v)w(S)=\sum_{v\in S}w(v) and dist(u,S)=minvSdist(u,v)\textnormal{dist}(u,S)=\min_{v\in S}\textnormal{dist}(u,v). For the sake of completeness, we define dist(u,)=\textnormal{dist}(u,\varnothing)=\infty.

Intersection graph.

Let Gd(V)G_{d}(V) denote the intersection graph of VV for distance threshold dd, i.e., the graph with node set VV and edge set

E={(u,v)V2:dist(u,v)<d}.E=\{(u,v)\in V^{2}:\textnormal{dist}(u,v)<d\}.

This graph is well-defined for any metric space and distance dd\in\mathbb{R}. Note that for any independent set SS of Gd(V)G_{d}(V) and pair of distinct nodes u,vSu,v\in S, we have dist(u,v)d\textnormal{dist}(u,v)\geq d.

Definition 2.1.

This work focuses on the min-distance diversity reward function

div(S)\displaystyle\texttt{div}(S) ={minu,vS:uvdist(u,v)if |S|2,maxu,vVdist(u,v)if |S|1.\displaystyle=\begin{cases}\min_{u,v\in S:u\neq v}\textnormal{dist}(u,v)&\text{if }|S|\geq 2,\\ \max_{u,v\in V}\textnormal{dist}(u,v)&\text{if }|S|\leq 1.\end{cases}

This reward function encourages selecting a well-separated set of points (generalizing the notion of data deduplication). To make div(S)\texttt{div}(S) well-defined if SS is empty or consists of a singleton point, we set the diversity to be the diameter of the whole ground set VV, implying that it is always beneficial to select at least two points to maximize the value of the selected set.

For any α[0,1]\alpha\in[0,1], we define the objective function

f(S)=α1kvSw(v)+(1α)div(S),f(S)=\alpha\frac{1}{k}\sum_{v\in S}w(v)+(1-\alpha)\texttt{div}(S), (1)

which allows us to control the trade-off between the average utility and the diversity of set SS. The goal of the min-distance diverse data summarization problem (MDDS) is to maximize f(S)f(S) subject to a cardinality constraint kk:

S=argmaxSV:|S|kf(S).S^{*}=\operatorname*{arg\,max}_{S\subseteq V:|S|\leq k}f(S).

Let OPT=f(S)\textnormal{OPT}=f(S^{*}) denote the optimal objective value.

Simple 1/2\nicefrac{{1}}{{2}}-approximation algorithm.

Equation 1 has two parts, each of which is easy to optimize in isolation. To maximize the w(S)w(S) term, we simply select the kk heaviest points. In contrast, to maximize the div(S)\texttt{div}(S) term, we select two points whose distance is the diameter of VV. Outputting a subset that corresponds to the maximum of these two terms gives a 1/2\nicefrac{{1}}{{2}}-approximation guarantee. To see this, observe that

ALGsimple\displaystyle\textnormal{ALG}_{\textnormal{simple}} =max{α1kmax|S|kw(S),(1α)maxu,vVdist(u,v)}\displaystyle=\max\left\{\alpha\frac{1}{k}\max_{|S|\leq k}w(S),(1-\alpha)\max_{u,v\in V}\textnormal{dist}(u,v)\right\}
12α1kmax|S|kw(S)+12(1α)maxu,vVdist(u,v)\displaystyle\geq\frac{1}{2}\cdot\alpha\frac{1}{k}\max_{|S|\leq k}w(S)+\frac{1}{2}\cdot(1-\alpha)\max_{u,v\in V}\textnormal{dist}(u,v)
12(α1kw(S)+(1α)div(S))=12OPT,\displaystyle\geq\frac{1}{2}\cdot\left(\alpha\frac{1}{k}w(S^{*})+(1-\alpha)\texttt{div}(S^{*})\right)=\frac{1}{2}\cdot\textnormal{OPT},

where div(S)\texttt{div}(S) is defined in Definition 2.1. Our main goals are to improve over this simple baseline and prove complementary hardness results.

3 Algorithm

We now present the GIST algorithm, which achieves a (2/3ε)(\nicefrac{{2}}{{3}}-\varepsilon)-approximation for the MDDS problem. It works by repeatedly calling the GreedyIndependentSet heuristic to find an independent set of the intersection graph Gd(V)G_{d}(V) for various distance thresholds dd. The subroutine GreedyIndependentSet sorts the points by their weight in a non-increasing order breaking ties arbitrarily. It starts with an empty set of selected points, and then it scans the points according to the sorted order. For each point, GreedyIndependentSet selects the point as long as its distance to the set of selected points so far is at least dd and the number of selected points does not exceed kk.

The GIST algorithm computes multiple solutions and returns the one with the maximum value f(S)f(S). It starts by picking kk points with the largest weight as the initial solution, which can be done by calling GreedyIndependentSet with the distance threshold set to zero. We then consider a set of distance thresholds as follows. Let dmaxd_{\max} be the diameter of the ground set VV, i.e., dmax=maxu,vVdist(u,v)d_{\max}=\max_{u,v\in V}\textnormal{dist}(u,v). Define the set of distance thresholds D{(1+ε)iεdmax/2:(1+ε)i2/ε and i0}D\leftarrow\{(1+\varepsilon)^{i}\cdot\varepsilon d_{\max}/2:(1+\varepsilon)^{i}\leq 2/\varepsilon\mbox{ and }i\in\mathbb{Z}_{\geq 0}\}. For each dDd\in D, we call the subroutine GreedyIndependentSet(V,w,d,k)\textnormal{{GreedyIndependentSet}}(V,w,d,k) to find a set TT of size at most kk. If the newly generated set TT has a larger objective value than our best solution so far, then update STS\leftarrow T. After considering all distance thresholds in DD, the highest-value solution among all candidates is the output of the GIST algorithm.

Algorithm 1 Diversified data summarization using greedy-weighted maximum independent sets.
1:function GIST(points VV, weights w:V0w:V\rightarrow\mathbb{R}_{\geq 0}, budget kk, error ε\varepsilon)
2:     Initialize SGreedyIndependentSet(V,w,0,k)S\leftarrow\textnormal{{GreedyIndependentSet}}(V,w,0,k) \triangleright kk heaviest points
3:     Let dmax=maxu,vVdist(u,v)d_{\max}=\max_{u,v\in V}\textnormal{dist}(u,v) be the diameter of VV
4:     Let T{u,v}T\leftarrow\{u,v\} be two points such that dist(u,v)=dmax\textnormal{dist}(u,v)=d_{\max}
5:     if f(T)>f(S)f(T)>f(S) and k2k\geq 2 then
6:         Update STS\leftarrow T      
7:     Let D{(1+ε)iεdmax/2:(1+ε)i2/ε and i0}D\leftarrow\{(1+\varepsilon)^{i}\cdot\varepsilon d_{\max}/2:(1+\varepsilon)^{i}\leq 2/\varepsilon\mbox{ and }i\in\mathbb{Z}_{\geq 0}\} \triangleright distance thresholds
8:     for threshold dDd\in D do
9:         Set TGreedyIndependentSet(V,w,d,k)T\leftarrow\textnormal{{GreedyIndependentSet}}(V,w,d,k)
10:         if f(T)f(S)f(T)\geq f(S) then
11:              Update STS\leftarrow T               return SS
1:function GreedyIndependentSet(points VV, weights w:V0w:V\rightarrow\mathbb{R}_{\geq 0}, distance dd, budget kk)
2:     Initialize SS\leftarrow\varnothing
3:     for vVv\in V sorted by non-increasing weight do
4:         if dist(v,S)d\textnormal{dist}(v,S)\geq d and |S|<k|S|<k then
5:              Update SS{v}S\leftarrow S\cup\{v\}               return SS
Theorem 3.1.

For any ε>0\varepsilon>0, GIST outputs a set SVS\subseteq V with |S|k|S|\leq k and f(S)(2/3ε)OPTf(S)\geq(\nicefrac{{2}}{{3}}-\varepsilon)\cdot\textnormal{OPT}.

The main building block in our design and analysis of GIST is the following lemma, which is inspired by the greedy 22-approximation algorithm for metric kk-centers.

Lemma 3.2.

Let SdS_{d}^{*} be a max-weight independent set of the intersection graph Gd(V)G_{d}(V) of size at most kk. If TT is the output of GreedyIndependentSet(V,w,d,k)\textnormal{{GreedyIndependentSet}}(V,w,d^{\prime},k), for dd/2d^{\prime}\leq d/2, then w(T)w(Sd)w(T)\geq w(S_{d}^{*}).

Proof.

Let k=|T|k^{\prime}=|T| and t1,t2,,tkt_{1},t_{2},\dots,t_{k^{\prime}} be the points in TT in the order that GreedyIndependentSet selected them. Let Bi={vV:dist(ti,v)<d}B_{i}=\{v\in V:\textnormal{dist}(t_{i},v)<d^{\prime}\} be the points in VV contained in the radius-dd^{\prime} open ball around tit_{i}. First, we show that each BiB_{i} contains at most one point in SdS_{d}^{*}. If this is not true, then some BiB_{i} contains two different points u,vSdu,v\in S_{d}^{*}. Since dist(,)\textnormal{dist}(\cdot,\cdot) is a metric, this means

dist(u,v)\displaystyle\textnormal{dist}(u,v) dist(u,ti)+dist(ti,v)\displaystyle\leq\textnormal{dist}(u,t_{i})+\textnormal{dist}(t_{i},v)
<d+d\displaystyle<d^{\prime}+d^{\prime}
d/2+d/2\displaystyle\leq d/2+d/2
=d,\displaystyle=d,

which contradicts the assumption that SdS_{d}^{*} is an independent set of Gd(V)G_{d}(V). Note that it is possible to have BiBjB_{i}\cap B_{j}\neq\varnothing, for iji\neq j, since these balls consider all points in VV.

Now let CiVC_{i}\subseteq V be the set of uncovered (by the open balls) points that become covered when GreedyIndependentSet selects tit_{i}. Concretely, Ci=Bi(B1Bi1)C_{i}=B_{i}\setminus(B_{1}\cup\cdots\cup B_{i-1}). Each CiC_{i} contains at most one point in SdS_{d}^{*} since |BiSd|1|B_{i}\cap S_{d}^{*}|\leq 1. Moreover, if sCiSds\in C_{i}\cap S_{d}^{*}, then w(ti)w(s)w(t_{i})\geq w(s) because the points are sorted in non-increasing order and selected if uncovered.

Let A=C1CkA=C_{1}\cup\cdots\cup C_{k^{\prime}} be the set of points covered by the algorithm. For each point sSdAs\in S_{d}^{*}\cap A, there is exactly one covering set CiC_{i} corresponding to ss. It follows that

sSdAw(s)i[k]:SdCiw(ti).\displaystyle\sum_{s\in S_{d}^{*}\cap A}w(s)\leq\sum_{i\in[k^{\prime}]:S_{d}^{*}\cap C_{i}\neq\varnothing}w(t_{i}). (2)

It remains to account for the points in SdAS_{d}^{*}\setminus A. Note that if we have any such points, then |T|=k|T|=k since the points in SdAS_{d}^{*}\setminus A are uncovered at the end of the algorithm. Further, for any tiTt_{i}\in T and sSdAs\in S_{d}^{*}\setminus A, we have w(ti)w(s)w(t_{i})\geq w(s) since tit_{i} was selected and the points are sorted by non-increasing weight. Therefore, we can assign each sSdAs\in S_{d}^{*}\setminus A to a unique CiC_{i} such that CiSd=C_{i}\cap S_{d}^{*}=\varnothing. It follows that

sSdAw(s)i[k]:SdCi=w(ti).\displaystyle\sum_{s\in S_{d}^{*}\setminus A}w(s)\leq\sum_{i\in[k^{\prime}]:S_{d}^{*}\cap C_{i}=\varnothing}w(t_{i}). (3)

Adding the two sums together in Equation 2 and (3) completes the proof. ∎

Equipped with this bicriteria approximation guarantee, we can now analyze the approximation ratio of the GIST algorithm.

Proof of Theorem 3.1.

Let dd^{*} be the minimum distance between two distinct points in SS^{*}. There are two cases: d<εdmaxd^{*}<\varepsilon d_{\max} or dεdmaxd^{*}\geq\varepsilon d_{\max}. For the first case, outputting the kk heaviest points (Line 2) achieves a (1ε)(1-\varepsilon)-approximation. To see this, observe that

OPT(1α)dmax>(1α)dεεOPT>(1α)d.\textnormal{OPT}\geq(1-\alpha)d_{\max}>(1-\alpha)\cdot\frac{d^{*}}{\varepsilon}\implies\varepsilon\cdot\textnormal{OPT}>(1-\alpha)d^{*}.

The sum of the kk heaviest points upper bounds w(S)w(S^{*}), so it follows that

ALGα1kw(S)=OPT(1α)d>(1ε)OPT.\textnormal{ALG}\geq\alpha\frac{1}{k}w(S^{*})=\textnormal{OPT}-(1-\alpha)d^{*}>(1-\varepsilon)\textnormal{OPT}.

Therefore, we focus on the second case where dεdmaxd^{*}\geq\varepsilon d_{\max}. It follows that GIST considers a threshold d=(1+ε)iεdmax/2Dd=(1+\varepsilon)^{i}\cdot\varepsilon d_{\max}/2\in D such that

d2[d,(1+ε)d)11+εd2<dd2.\frac{d^{*}}{2}\in[d,(1+\varepsilon)d)\implies\frac{1}{1+\varepsilon}\cdot\frac{d^{*}}{2}<d\leq\frac{d^{*}}{2}.

Using Lemma 3.2, GreedyIndependentSet(V,w,d,k)\textnormal{{GreedyIndependentSet}}(V,w,d,k) outputs a set TT such that

f(T)\displaystyle f(T) α1kw(S)+(1α)dα1kw(S)+(1α)d2(1+ε).\displaystyle\geq\alpha\frac{1}{k}w(S^{*})+(1-\alpha)d\geq\alpha\frac{1}{k}w(S^{*})+(1-\alpha)\frac{d^{*}}{2(1+\varepsilon)}. (4)

The potential max-diameter update to SS on Line 5 gives us

ALG(1α)dmax(1α)d.\displaystyle\textnormal{ALG}\geq(1-\alpha)d_{\max}\geq(1-\alpha)d^{*}. (5)

Putting Equation 4 and (5) together, for any p[0,1]p\in[0,1], it holds that

ALG p[α1kw(S)+(1α)d2(1+ε)]+(1p)(1α)d\displaystyle\geq p\left[\alpha\frac{1}{k}w(S^{*})+(1-\alpha)\frac{d^{*}}{2(1+\varepsilon)}\right]+(1-p)(1-\alpha)d^{*}
=pα1kw(S)+(1p+p2(1+ε))(1α)d.\displaystyle=p\cdot\alpha\frac{1}{k}w(S^{*})+\left(1-p+\frac{p}{2(1+\varepsilon)}\right)\cdot(1-\alpha)d^{*}.

To maximize the approximation ratio as ε0\varepsilon\rightarrow 0, solve p=1p/2p=1-p/2 to get p=2/3p=2/3. It follows that

ALG 23α1kw(S)+(123+13(1+ε))(1α)d\displaystyle\geq\frac{2}{3}\cdot\alpha\frac{1}{k}w(S^{*})+\left(1-\frac{2}{3}+\frac{1}{3(1+\varepsilon)}\right)\cdot(1-\alpha)d^{*} (6)
=23α1kw(S)+13(1+11+ε)(1α)d\displaystyle=\frac{2}{3}\cdot\alpha\frac{1}{k}w(S^{*})+\frac{1}{3}\left(1+\frac{1}{1+\varepsilon}\right)\cdot(1-\alpha)d^{*}
23α1kw(S)+13(2ε)(1α)d\displaystyle\geq\frac{2}{3}\cdot\alpha\frac{1}{k}w(S^{*})+\frac{1}{3}\left(2-\varepsilon\right)\cdot(1-\alpha)d^{*}
23(1ε)α1kw(S)+23(1ε)(1α)d\displaystyle\geq\frac{2}{3}\left(1-\varepsilon\right)\cdot\alpha\frac{1}{k}w(S^{*})+\frac{2}{3}\left(1-\varepsilon\right)\cdot(1-\alpha)d^{*}
(23ε)OPT,\displaystyle\geq\left(\frac{2}{3}-\varepsilon\right)\cdot\textnormal{OPT},

which completes the proof. ∎

Remark 3.3.

We can modify GIST to achieve an exact 2/3\nicefrac{{2}}{{3}}-approximation by trying all pairwise distance thresholds in D={dist(u,v)/2:u,vV}D=\{\textnormal{dist}(u,v)/2:u,v\in V\}. This requires O(n2)O(n^{2}) calls to GreedyIndependentSet compared to O(log1+ε(1/ε))O(\log_{1+\varepsilon}(1/\varepsilon)), but otherwise the same analysis holds for ε=0\varepsilon=0.

Finally, we show how to implement GIST in nearly linear time for low-dimensional Euclidean metrics using cover trees for exact nearest-neighbor queries (Beygelzimer et al., 2006; Izbicki and Shelton, 2015) and a (1ε)(1-\varepsilon)-approximation algorithm for point-set diameters dmaxd_{\max} (Barequet and Har-Peled, 2001; Imanparast et al., 2018).

Theorem 3.4.

For VdV\subseteq\mathbb{R}^{d} and Euclidean distance, GIST runs in time O(n3k+n2d)O(n^{3}k+n^{2}d). Furthermore, for fixed dimension dd, the running time can be improved to O(nlognlog1+ε(1/ε)poly(d)+1/εd1)O(n\log n\cdot\log_{1+\varepsilon}(1/\varepsilon)\cdot\textnormal{poly}(d)+1/\varepsilon^{d-1}).

We give the proof in Section A.1 and note that the key property is that cover trees allow us to insert a point into a nearest-neighbor index in O(c12logn)O(c^{12}\log n) time and find a nearest neighbor to a query point in O(c6logn)O(c^{6}\log n) time, where c=Θ(d)c=\Theta(d) is the doubling dimension of dd-dimensional Euclidean space.

4 Hardness of approximation

We start by summarizing our hardness results for the MDDS problem. Assuming PNP\text{P}\neq\text{NP}, we prove that for any ε>0\varepsilon>0:

  1. 1.

    There is no polynomial-time (2/3+ε)(\nicefrac{{2}}{{3}}+\varepsilon)-approximation algorithm for general metric spaces.

  2. 2.

    There is no polynomial-time (3/4+ε)(\nicefrac{{3}}{{4}}+\varepsilon)-approximation algorithm for general metric spaces, even if k|V|/4k\geq|V|/4.

  3. 3.

    APX-completeness for the Euclidean metric, i.e., there is no polynomial-time approximation scheme (PTAS) for this objective.

4.1 General metric spaces

Our first result builds on the hardness of approximating the size of the maximum clique, and it tightly complements the 2/3\nicefrac{{2}}{{3}}-approximation guarantee that we can achieve with GIST.

Theorem 4.1.

For any ε>0\varepsilon>0, there is no polynomial-time (2/3+ε)(\nicefrac{{2}}{{3}}+\varepsilon)-approximation algorithm for the MDDS problem, unless P=NP\textnormal{P}=\textnormal{NP}.

Proof.

First, recall that a clique is a subset of vertices in an undirected graph such that there is an edge between every pair of its vertices. Håstad (1996) and Zuckerman (2006) showed that the maximum clique problem does not admit an n1θn^{1-\theta}-approximation for any constant θ>0\theta>0, unless NP=P\textnormal{NP}=\textnormal{P}. This implies that there is no constant-factor approximation algorithm for maximum clique. In other words, for any constant 0<δ10<\delta\leq 1, there exists a graph GG and a threshold integer value kk such that it is NP-hard to distinguish between the following two cases:

  • YES instance: Graph GG has a clique of size kk.

  • NO instance: Graph GG does not have a clique of size greater than δk\delta k.

We reduce this instance of the maximum clique decision problem to MDDS with objective function (1) as follows. Represent each vertex of graph GG with a point in our ground set. The distance between a pair of points is 22 if there is an edge between their corresponding vertices in GG, and it is 11 otherwise.

Use the same threshold value of kk (in the YES and NO instance above) for the cardinality constraint on set SS, and set each weight w(v)=1w(v)=1. In a YES instance, selecting a clique of size kk as set SS results in the maximum possible value of the objective:

OPT=α1kk+(1α)2=2α.\displaystyle\textnormal{OPT}=\alpha\cdot\frac{1}{k}\cdot k+(1-\alpha)\cdot 2=2-\alpha. (7)

In a NO instance, the best objective value that can be achieved in polynomial-time is the maximum of the following two scenarios: (a) selecting kk points with minimum distance 11, or (b) selecting at most δk\delta k vertices forming a clique with minimum distance 22. The maximum value obtained by any polynomial-time algorithm is then

ALG =max{α+(1α)1,αδ+(1α)2}\displaystyle=\max\left\{\alpha+(1-\alpha)\cdot 1,\alpha\cdot\delta+(1-\alpha)\cdot 2\right\}
=max{1,2(2δ)α}.\displaystyle=\max\left\{1,2-(2-\delta)\alpha\right\}.

These two terms become equal for α=1/(2δ)\alpha=1/(2-\delta). Thus, the gap between the maximum value any algorithm can achieve in the NO case and the optimum value in the YES case is

12α=121/(2δ)=2δ32δ.\frac{1}{2-\alpha}=\frac{1}{2-\nicefrac{{1}}{{(2-\delta)}}}=\frac{2-\delta}{3-2\delta}.

To complete the proof, it suffices to show that the ratio above is at most 2/3+ε\nicefrac{{2}}{{3}}+\varepsilon. We separate the 2/3\nicefrac{{2}}{{3}} term as follows:

2δ32δ=2/3(32δ)+δ/332δ=23+δ96δ.\frac{2-\delta}{3-2\delta}=\frac{\nicefrac{{2}}{{3}}\cdot(3-2\delta)+\nicefrac{{\delta}}{{3}}}{3-2\delta}=\frac{2}{3}+\frac{\delta}{9-6\delta}.

Therefore, we must choose a value of δ\delta satisfying δ/(96δ)ε\nicefrac{{\delta}}{{(9-6\delta)}}\leq\varepsilon. Since δ1\delta\leq 1, the denominator 96δ9-6\delta is positive. Equivalently, we want to satisfy:

96δδ=9δ61ε.\frac{9-6\delta}{\delta}=\frac{9}{\delta}-6\geq\frac{1}{\varepsilon}.

By setting δ<9ε/(1+6ε)\delta<9\varepsilon/(1+6\varepsilon), we satisfy the required inequality and achieve the inapproximability gap in the theorem statement. ∎

4.2 Large subsets

In many data curation problems, we are interested in instances of MDDS where k=Ω(|V|)k=\Omega(|V|). Even in this restricted case, we can show hardness of approximation.

Theorem 4.2.

For any ε>0\varepsilon>0, there is no polynomial-time (3/4+ε)(\nicefrac{{3}}{{4}}+\varepsilon)-approximation algorithm for the MDDS problem, even if k|V|/4k\geq|V|/4, unless P=NP\textnormal{P}=\textnormal{NP}.

The proof of this theorem uses the vertex cover instance constructed in the proof of Håstad (2001, Theorem 8.1) and is given in Section B.1.

4.3 Euclidean metrics

Our final result is specific to the Euclidean metric, i.e., if SdS\subseteq\mathbb{R}^{d} and we consider the L2L^{2} distance between points. It builds on a result of Alimonti and Kann (2000), which says the size of a maximum independent set in a bounded-degree graph cannot be approximated to within a constant factor 1ε1-\varepsilon, for any ε>0\varepsilon>0, unless P=NP\text{P}=\text{NP}. In other words, it does not admit a PTAS.

Lemma 4.3 (Alimonti and Kann (2000, Theorem 3.2)).

The maximum independent set problem for graphs with degree at most 33 is APX-complete.

Similar to our reduction from maximum clique in Section 4.1 for general metric spaces, this construction uses a node embedding function hg(v)h_{g}(v) to encode graph adjacency in Euclidean space.

Lemma 4.4.

Let G=(V,E)G=(V,E) be a simple undirected graph with n=|V|n=|V|, m=|E|m=|E|, and maximum degree Δ\Delta. There exists an embedding hG:Vn+mh_{G}:V\rightarrow\mathbb{R}^{n+m} such that if {u,v}E\{u,v\}\in E then

hG(u)hG(v)2112(Δ+1),\lVert h_{G}(u)-h_{G}(v)\rVert_{2}\leq 1-\frac{1}{2(\Delta+1)},

and if {u,v}E\{u,v\}\not\in E then hG(u)hG(v)2=1\lVert h_{G}(u)-h_{G}(v)\rVert_{2}=1.

Theorem 4.5.

The MDDS problem is APX-complete for Euclidean metrics.

We sketch the main idea of our reduction and defer the complete proofs of Lemma 4.4 and Theorem 4.5 to Appendix B. Let (G)\mathcal{I}(G) be the set of independent sets of GG. A set of embedded nodes SVS\subseteq V (with at least two nodes) of a graph GG with maximum degree Δ3\Delta\leq 3 has the property:

  • If S(G)S\in\mathcal{I}(G): div(S)=minu,vS:uvhG(u)hG(v)2=1\texttt{div}(S)=\min_{u,v\in S:u\neq v}\lVert h_{G}(u)-h_{G}(v)\rVert_{2}=1.

  • If S(G)S\not\in\mathcal{I}(G): div(S)=minu,vS:uvhG(u)hG(v)211/(2(Δ+1))11/8\texttt{div}(S)=\min_{u,v\in S:u\neq v}\lVert h_{G}(u)-h_{G}(v)\rVert_{2}\leq 1-1/(2(\Delta+1))\leq 1-\nicefrac{{1}}{{8}}.

This gap between the two cases is at least 1/8\nicefrac{{1}}{{8}} (i.e., a universal constant) for all such graphs, and allows us to show that MDDS inherits the APX-complete hardness of bounded-degree max independent set.

5 Experiments

5.1 Warm-up: Synthetic dataset

Setup.

We start with a simple dataset of weighted random points to see the shortcomings of existing methods for MDDS. For a fixed dimension dd, we generate n=1000n=1000 points 𝐱id\mathbf{x}_{i}\in\mathbb{R}^{d}, where each 𝐱i𝒩(𝟎,𝐈d)\mathbf{x}_{i}\sim\mathcal{N}(\mathbf{0},\mathbf{I}_{d}) and each point is assigned an i.i.d. uniform continuous weight wi𝒰[0,1]w_{i}\sim\mathcal{U}_{[0,1]}.

We consider three baseline algorithms: random, simple, and greedy. The random algorithm chooses a random set of kk points, permutes them, and returns the best prefix. simple is the 1/2\nicefrac{{1}}{{2}}-approximation algorithm in Section 2. greedy builds a set SVS\subseteq V one element at a time by selecting the point with index it=argmaxiVSt1f(St1{i})f(St1)i^{*}_{t}=\operatorname*{arg\,max}_{i\in V\setminus S_{t-1}}f(S_{t-1}\cup\{i\})-f(S_{t-1}) at each step t[k]t\in[k]. Then it returns the best prefix S=argmaxt[k]f(St)S=\operatorname*{arg\,max}_{t\in[k]}f(S_{t}), as this sequence of objective values is not necessarily monotone.

Refer to caption
Figure 1: Synthetic data with n=1000n=1000 and d=64d=64. Plots of f(SALG)f(S_{\text{ALG}}) for k[n]k\in[n], α{0.90,0.95}\alpha\in\{0.90,0.95\}.

Results.

We plot the values of f(S)f(S) for our baseline algorithms and GIST for all k[n]k\in[n] in Figure 1. We set α{0.90,0.95}\alpha\in\{0.90,0.95\} to balance the two competing objective terms (which are easy to optimize in isolation).222If 𝐱,𝐲𝒩(𝟎,𝐈d)\mathbf{x},\mathbf{y}\sim\mathcal{N}(\mathbf{0},\mathbf{I}_{d}) are i.i.d., then 𝔼[𝐱𝐲22]=2d\mathbb{E}[\lVert\mathbf{x}-\mathbf{y}\rVert_{2}^{2}]=2d, so α2(1α)2dα1122d+1=0.957\frac{\alpha}{2}\approx(1-\alpha)\sqrt{2d}\implies\alpha\approx 1-\frac{1}{2\sqrt{2d}+1}=0.957. There is a clear separation in solution quality for each k[n]k\in[n] when α=0.95\alpha=0.95. Note that these objective values are not directly comparable as a function of kk due to the 1/k\nicefrac{{1}}{{k}} normalizing factor (hence the discontinuities), but both terms in the objective decreasing on average in kk for a fixed α\alpha: mean utility and min distance of the selected points. GIST noticeably outperforms simple, greedy, and random. The poor performance of greedy is somewhat surprising since it is normally competitive for submodular-style objectives. simple also outperforms greedy for all but the extreme values of kk, i.e., very small or close to nn, but is always worse relative to GIST.

5.2 Image classification

The following data sampling experiment compares the top-1 image classification accuracy achieved by different single-shot subset selection algorithms.

Setup.

We use the standard vision dataset ImageNet (Russakovsky et al., 2015), which contains ~1.3 million images and 1000 classes. We select 10% of the data points at random and use them to train an initial ResNet-56 model 𝜽0\bm{\theta}_{0} (He et al., 2016). Then we use this initial model to compute an uncertainty value and 2048-dimensional embedding for each example. The uncertainty value of 𝐱i\mathbf{x}_{i} is ui=1Pr(y=b𝐱i;𝜽0)u_{i}=1-\Pr(y=b\mid\mathbf{x}_{i};~{}\bm{\theta}_{0}), where bb is the best class label according to the initial model. Finally, we use the fast maximum inner product search of Guo et al. (2020) to build a Δ\Delta-nearest neighbor graph in the embedding space using Δ=10\Delta=10 and cosine distance. We present all of the model training hyperparameters in Appendix C.

We compare GIST with three state-of-the-art benchmarks:

  • random: We select samples from the dataset uniformly at random (and without replacement). While being a simple and lightweight approach, this implicitly encodes diversity in many settings and provides good solutions.

  • margin (Roth and Small, 2006): Margin sampling selects the top-kk points based on how hard they are to classify, i.e., using the margin scores mi=1(Pr(y=b𝐱i;𝜽0)Pr(y=b𝐱i;𝜽0))m_{i}=1-(\Pr(y=b\mid\mathbf{x}_{i};~{}\bm{\theta}_{0})-\Pr(y=b^{\prime}\mid\mathbf{x}_{i};~{}\bm{\theta}_{0})), which measure the difference between the probability of the best and second-best class label bb^{\prime} for an example.

  • kk-center (Sener and Savarese, 2018): We run the classic greedy algorithm for kk-center on the Δ\Delta-nearest neighbor graph. The distance between non-adjacent nodes is implicitly assumed to be very large, which is acceptable in the large budget regime k=Ω(n)k=\Omega(n).

For this task, GIST uses uncertainty values as its weights wi=uiw_{i}=u_{i} and the same graph as kk-center, and sets α=0.85\alpha=0.85.

Results.

We run each algorithm with cardinality constraint kk on the full dataset to get a subset of examples that we use to train a new ResNet-56 model. We report the top-1 classification accuracy of these models in Table 1.

kk (% of data) random margin kk-center GIST
30 65.02 67.64 67.32 67.70
40 68.92 70.45 70.98 71.20
50 70.86 73.41 72.51 73.67
60 72.39 74.70 74.78 75.16
70 72.92 75.57 75.62 75.85
80 74.36 75.39 76.26 76.49
90 75.10 76.11 76.02 76.50
Table 1: Top-1 classification accuracy (%) on ImageNet for different single-shot data downsampling algorithms. The cardinality constraint kk is expressed as a fraction of the ~1.3 million examples.

Conclusion

This work proposes a novel subset selection problem called MDDS that combines the total utility of the selected points with the div(S)=minu,vS:uvdist(u,v)\texttt{div}(S)=\min_{u,v\in S:u\neq v}\textnormal{dist}(u,v) diversity objective. We design and analyze the GIST algorithm, which achieves a 2/3\nicefrac{{2}}{{3}}-approximation guarantee for general metric spaces by solving a series of maximum independent set instances on geometric intersection graphs with the GreedyIndependentSet bicriteria approximation algorithm. We also give a nearly linear-time implementation of GIST for points in low-dimensional Euclidean space. We complement our algorithm with a (2/3+ε)(\nicefrac{{2}}{{3}}+\varepsilon)-hardness of approximation, for any ε>0\varepsilon>0, in general metric spaces. We also prove hardness of approximation in the practical cases of Euclidean metrics or k=Ω(n)k=\Omega(n). Finally, we present an empirical study showing that GIST steadily outperforms baseline methods for MDDS (in particular greedy) on a simple set of Gaussian points with random weights. We conclude with an experiment comparing the top-1 image classification accuracy achieved by GIST and three other state-of-the-art data sampling methods for ImageNet.

References

  • Addanki et al. (2022) R. Addanki, A. McGregor, A. Meliou, and Z. Moumoulidou. Improved approximation and scalability for fair max-min diversification. In Proceedings of the 25th International Conference on Database Theory, pages 7:1–7:21, 2022.
  • Alimonti and Kann (2000) P. Alimonti and V. Kann. Some APX-completeness results for cubic graphs. Theoretical Computer Science, 237(1-2):123–134, 2000.
  • Anil et al. (2023) R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, et al. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  • Ash et al. (2020) J. T. Ash, C. Zhang, A. Krishnamurthy, J. Langford, and A. Agarwal. Deep batch active learning by diverse, uncertain gradient lower bounds. In Proceedings of the 8th International Conference on Learning Representations, 2020.
  • Barequet and Har-Peled (2001) G. Barequet and S. Har-Peled. Efficiently approximating the minimum-volume bounding box of a point set in three dimensions. Journal of Algorithms, 38(1):91–109, 2001.
  • Beygelzimer et al. (2006) A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In Proceedings of the 23rd International Conference on Machine Learning, pages 97–104, 2006.
  • Bhaskara et al. (2016) A. Bhaskara, M. Ghadiri, V. Mirrokni, and O. Svensson. Linear relaxations for finding diverse elements in metric spaces. Advances in Neural Information Processing Systems, 29:4098–4106, 2016.
  • Citovsky et al. (2021) G. Citovsky, G. DeSalvo, C. Gentile, L. Karydas, A. Rajagopalan, A. Rostamizadeh, and S. Kumar. Batch active learning at scale. Advances in Neural Information Processing Systems, 34:11933–11944, 2021.
  • Guo et al. (2020) R. Guo, P. Sun, E. Lindgren, Q. Geng, D. Simcha, F. Chern, and S. Kumar. Accelerating large-scale inference with anisotropic vector quantization. In International Conference on Machine Learning, pages 3887–3896. PMLR, 2020.
  • Håstad (1996) J. Håstad. Clique is hard to approximate within n1εn^{1-\varepsilon}. In Proceedings of 37th Conference on Foundations of Computer Science, pages 627–636, 1996.
  • Håstad (2001) J. Håstad. Some optimal inapproximability results. Journal of the ACM, 48(4):798–859, 2001.
  • He et al. (2016) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
  • Imanparast et al. (2018) M. Imanparast, S. N. Hashemi, and A. Mohades. An efficient approximation for point-set diameter in higher dimensions. In Proceedings of the 30th Canadian Conference on Computational Geometry, pages 72–77, 2018.
  • Izbicki and Shelton (2015) M. Izbicki and C. Shelton. Faster cover trees. In International Conference on Machine Learning, pages 1162–1170. PMLR, 2015.
  • Kazhamiaka et al. (2021) F. Kazhamiaka, M. Zaharia, and P. Bailis. Challenges and opportunities for autonomous vehicle query systems. In Proceedings of the Conference on Innovative Data Systems Research, 2021.
  • Kirsch et al. (2019) A. Kirsch, J. Van Amersfoort, and Y. Gal. Batchbald: Efficient and diverse batch acquisition for deep bayesian active learning. Advances in Neural Information Processing Systems, 32, 2019.
  • Mahabadi and Trajanovski (2023) S. Mahabadi and S. Trajanovski. Core-sets for fair and diverse data summarization. Advances in Neural Information Processing Systems, 2023.
  • Meinl et al. (2011) T. Meinl, C. Ostermann, and M. R. Berthold. Maximum-score diversity selection for early drug discovery. Journal of Chemical Information and Modeling, 51(2):237–247, 2011.
  • Moumoulidou et al. (2020) Z. Moumoulidou, A. McGregor, and A. Meliou. Diverse data selection under fairness constraints. In Proceedings of the 24th International Conference on Database Theory, pages 13:1–13:25, 2020.
  • Roth and Small (2006) D. Roth and K. Small. Margin-based active learning for structured output spaces. In Proceedings of the 17th European Conference on Machine Learning, 2006.
  • Russakovsky et al. (2015) O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115:211–252, 2015.
  • Sener and Savarese (2018) O. Sener and S. Savarese. Active learning for convolutional neural networks: A core-set approach. In Proceedings of the 6th International Conference on Learning Representations, 2018.
  • Wei et al. (2015) K. Wei, R. Iyer, and J. Bilmes. Submodularity in data subset selection and active learning. In International Conference on Machine Learning, pages 1954–1963. PMLR, 2015.
  • Zhdanov (2019) F. Zhdanov. Diverse mini-batch active learning. arXiv preprint arXiv:1901.05954, 2019.
  • Zuckerman (2006) D. Zuckerman. Linear degree extractors and the inapproximability of max clique and chromatic number. In Proceedings of the 38th Annual ACM Symposium on Theory of Computing, pages 681–690, 2006.

Appendix A Missing analysis for Section 3

A.1 Proof of Theorem 3.4

See 3.4

Proof.

First, observe that we can compute and store all pairwise distances in O(n2d)O(n^{2}d) time and O(n2)O(n^{2}) space. This is useful in the naive implementation because we can then look up the values dist(u,v)\textnormal{dist}(u,v) in constant time in any subroutine.

In the nearly linear-time implementation, we skip this precomputation step and use cover trees for fast exact nearest-neighbor search in low-dimensional Euclidean space (Beygelzimer et al., 2006): inserting a point into the index takes O(c12logn)O(c^{12}\log n) time and finding a nearest neighbor in the index to a query point takes O(c6logn)O(c^{6}\log n) time, where c=Θ(d)c=\Theta(d) is the doubling dimension of dd-dimensional Euclidean space.

Thus, we can speed up the following subroutines in Algorithm 1 to have near-linear running times in a fixed dimension:

  • Computing dmind_{\min}: O(n2)O(nd12logn)O(n^{2})\rightarrow O(n\cdot d^{12}\log n) after discarding duplicate points.

  • GreedyIndependentSet: O(nk)O(nd12logn)O(nk)\rightarrow O(n\cdot d^{12}\log n)

  • Evaluating f(S)f(S): O(k2)O(kd12logk)O(k^{2})\rightarrow O(k\cdot d^{12}\log k)

Taking advantage of the error ε\varepsilon in our (2/3ε)(\nicefrac{{2}}{{3}}-\varepsilon)-approximation guarantee, we use the approximate point-set diameter algorithm in Imanparast et al. (2018, Theorem 1) to compute a (1ε)(1-\varepsilon)-approximate diameter, together with two points realizing it. This speeds up the naive running time of Lines 34 in Algorithm 1 as follows:

  • Computing a (1ε)(1-\varepsilon)-approximate dmaxd_{\max}: O(n2)O(n2dd+(2d)2dεd1)O(n^{2})\rightarrow O\left(n\cdot 2^{d}d+\frac{(2\sqrt{d})^{2d}}{\varepsilon^{d-1}}\right)

Since we have a smaller diameter (1ε)dmax(1-\varepsilon)d_{\max}, we need to show that it is compatible with the proof of Theorem 3.1. Concretely, this means modifying Eq. (5) and (6), and observing that the inequality

(123+13(1+ε))(1α)(1ε)d(23ε)(1α)d\left(1-\frac{2}{3}+\frac{1}{3(1+\varepsilon)}\right)(1-\alpha)\cdot(1-\varepsilon)d^{*}\geq\left(\frac{2}{3}-\varepsilon\right)(1-\alpha)d^{*}

still holds. Lastly, the number of calls to GreedyIndependentSet (i.e., distance thresholds that GIST tries) is min{log1+ε(2/ε),n2}\min\{\lceil\log_{1+\varepsilon}(2/\varepsilon)\rceil,n^{2}\}. Putting everything together proves the claim. ∎

Appendix B Missing analysis for Section 4

B.1 Proof of Theorem 4.2

See 4.2

Proof.

We use the vertex cover instance constructed in Håstad (2001, Theorem 8.1). For any ε0>0\varepsilon_{0}>0, they construct a graph GG with 2r+22^{r+2} nodes for some integer rr and show that it is NP-hard to distinguish between the following two cases:

  • YES case: Graph GG has an independent set of size 2r(1ε0)2^{r}(1-\varepsilon_{0}).

  • NO case: Graph GG has no independent set of size larger than 2r(1/2+ε0)2^{r}(\nicefrac{{1}}{{2}}+\varepsilon_{0}).

We mimic the general structure of the proof of Theorem 4.1. Add a point for each vertex in graph GG and set its weight to 11. Set the distance between two points to 11 if they share an edge in the graph, and set it to 22 otherwise. Lastly, let k=2rk=2^{r}. Since there are 2r+22^{r+2} nodes in the graph, k=|V|/4k=|V|/4 in this case. In the YES case, objective function (1) can achieve the following value by selecting the independent set of size 2r(1ε0)2^{r}(1-\varepsilon_{0}):

OPT α12r(2r(1ε0))+(1α)2\displaystyle\geq\alpha\cdot\frac{1}{2^{r}}\cdot(2^{r}(1-\varepsilon_{0}))+(1-\alpha)\cdot 2
=2(1+ε0)α.\displaystyle=2-(1+\varepsilon_{0})\cdot\alpha.

In the NO case, the largest value that can be achieved is the best of two scenarios: (a) selecting kk points with minimum distance 11, or (b) selecting an independent set of size at most 2r(1/2+ε0)2^{r}(\nicefrac{{1}}{{2}}+\varepsilon_{0}) and achieving minimum distance 22. This results in the following upper bound for the objective:

ALG max{α+(1α)1,α(1/2+ε0)+(1α)2}\displaystyle\leq\max\left\{\alpha+(1-\alpha)\cdot 1,\alpha\cdot(\nicefrac{{1}}{{2}}+\varepsilon_{0})+(1-\alpha)\cdot 2\right\}
=max{1,2(3/2ε0)α}.\displaystyle=\max\left\{1,2-(\nicefrac{{3}}{{2}}-\varepsilon_{0})\cdot\alpha\right\}.

The largest hardness gap occurs when the two terms of the max\max expression become equal, which happens at α=1/(3/2ε0)\alpha=1/(\nicefrac{{3}}{{2}}-\varepsilon_{0}) and yields the upper bound ALG1\textnormal{ALG}\leq 1. The optimum value OPT will be lower bounded by:

OPT 2(1+ε0)α\displaystyle\geq 2-(1+\varepsilon_{0})\cdot\alpha
=21+ε03/2ε0\displaystyle=2-\frac{1+\varepsilon_{0}}{\nicefrac{{3}}{{2}}-\varepsilon_{0}}
=23ε03/2ε0.\displaystyle=\frac{2-3\varepsilon_{0}}{\nicefrac{{3}}{{2}}-\varepsilon_{0}}.

The desired inapproximability gap of 3/4+ε\nicefrac{{3}}{{4}}+\varepsilon in Theorem 4.2 is achieved when the ratio between the upper bound on ALG and lower bound for OPT drops below it. In other words, we want to set ε0\varepsilon_{0} such that the following inequality holds:

23ε03/2ε0>13/4+ε.\frac{2-3\varepsilon_{0}}{\nicefrac{{3}}{{2}}-\varepsilon_{0}}>\frac{1}{\nicefrac{{3}}{{4}}+\varepsilon}.

We set ε0ε\varepsilon_{0}\coloneqq\varepsilon and prove that the above inequality holds. We first note that both the numerator and denominator of the left side ratio are positive. Therefore, we can invert both ratios and swap their places to achieve the following equivalent inequality:

34+ε>3/2ε023ε0ε>3/2ε023ε034=64ε06+9ε0812ε0>5ε08.\frac{3}{4}+\varepsilon>\frac{\nicefrac{{3}}{{2}}-\varepsilon_{0}}{2-3\varepsilon_{0}}\iff\varepsilon>\frac{\nicefrac{{3}}{{2}}-\varepsilon_{0}}{2-3\varepsilon_{0}}-\frac{3}{4}=\frac{6-4\varepsilon_{0}-6+9\varepsilon_{0}}{8-12\varepsilon_{0}}>\frac{5\varepsilon_{0}}{8}.

The left-hand side inequality ε>5ε0/8\varepsilon>\nicefrac{{5\varepsilon_{0}}}{{8}} holds by our choice of ε0=ε\varepsilon_{0}=\varepsilon. ∎

B.2 Proof of Lemma 4.4

Let G=(V,E)G=(V,E) be a simple undirected graph. Our goal is to embed the vertices of VV into d\mathbb{R}^{d}, for some d1d\geq 1, in a way that encodes the adjacency structure of GG. Concretely, we want to construct a function hG:Vdh_{G}:V\rightarrow\mathbb{R}^{d} such that:

  • hG(u)hG(v)2=1\lVert h_{G}(u)-h_{G}(v)\rVert_{2}=1 if {u,v}E\{u,v\}\not\in E, and

  • hG(u)hG(v)21εG\lVert h_{G}(u)-h_{G}(v)\rVert_{2}\leq 1-\varepsilon_{G} if {u,v}E\{u,v\}\in E,

for the largest possible value of εG(0,1]\varepsilon_{G}\in(0,1].

Construction.

First, let n=|V|n=|V| and m=|E|m=|E|. Augment GG by adding a self-loop to each node to get G=(V,E)G^{\prime}=(V,E^{\prime}). We embed VV using GG^{\prime} since each node now has positive degree. Let deg(v)\deg^{\prime}(v) be the degree of vv in GG^{\prime} and N(v)N^{\prime}(v) be the neighborhood of vv in GG^{\prime}.

Define a total ordering on EE^{\prime} (e.g., lexicographically by sorted endpoints {u,v}\{u,v\}). Each edge eEe\in E^{\prime} corresponds to an index in the embedding dimension d|E|=m+nd\coloneqq|E^{\prime}|=m+n. We consider the embedding function that acts as a degree-normalized adjacency vector:

hG(v)e={12deg(v)if ve,0if ve.\displaystyle h_{G}(v)_{e}=\begin{cases}\sqrt{\frac{1}{2\deg^{\prime}(v)}}&\text{if $v\in e$},\\ 0&\text{if $v\not\in e$}.\end{cases} (8)

See 4.4

Proof.

If {u,v}E\{u,v\}\not\in E, then we have

hG(u)hG(v)22\displaystyle\lVert h_{G}(u)-h_{G}(v)\rVert_{2}^{2} =eN(u)(12deg(u)0)2+eN(v)(12deg(v)0)2\displaystyle=\sum_{e\in N^{\prime}(u)}\left(\sqrt{\frac{1}{2\deg^{\prime}(u)}}-0\right)^{2}+\sum_{e\in N^{\prime}(v)}\left(\sqrt{\frac{1}{2\deg^{\prime}(v)}}-0\right)^{2}
=(12eN(u)1deg(u))+(12eN(v)1deg(v))\displaystyle=\left(\frac{1}{2}\sum_{e\in N^{\prime}(u)}\frac{1}{\deg^{\prime}(u)}\right)+\left(\frac{1}{2}\sum_{e\in N^{\prime}(v)}\frac{1}{\deg^{\prime}(v)}\right)
=12+12\displaystyle=\frac{1}{2}+\frac{1}{2}
=1.\displaystyle=1.

This follows because the only index where both embeddings can be nonzero is {u,v}\{u,v\}, if it exists.

Now suppose that {u,v}E\{u,v\}\in E. It follows that

hG(u)hG(v)22\displaystyle\lVert h_{G}(u)-h_{G}(v)\rVert_{2}^{2}
=eN(u){v}12deg(u)+eN(v){u}12deg(v)+(12deg(u)12deg(v))2\displaystyle=\sum_{e\in N^{\prime}(u)\setminus\{v\}}\frac{1}{2\deg^{\prime}(u)}+\sum_{e\in N^{\prime}(v)\setminus\{u\}}\frac{1}{2\deg^{\prime}(v)}+\left(\sqrt{\frac{1}{2\deg^{\prime}(u)}}-\sqrt{\frac{1}{2\deg^{\prime}(v)}}\right)^{2}
=deg(u)12deg(u)+deg(v)12deg(v)+(12deg(u)+12deg(v)214deg(u)deg(v))\displaystyle=\frac{\deg^{\prime}(u)-1}{2\deg^{\prime}(u)}+\frac{\deg^{\prime}(v)-1}{2\deg^{\prime}(v)}+\left(\frac{1}{2\deg^{\prime}(u)}+\frac{1}{2\deg^{\prime}(v)}-2\sqrt{\frac{1}{4\deg^{\prime}(u)\deg^{\prime}(v)}}\right)
=12+121deg(u)deg(v)\displaystyle=\frac{1}{2}+\frac{1}{2}-\sqrt{\frac{1}{\deg^{\prime}(u)\deg^{\prime}(v)}}
11Δ+1.\displaystyle\leq 1-\frac{1}{\Delta+1}.

The previous inequality follows from deg(v)=deg(v)+1Δ+1\deg^{\prime}(v)=\deg(v)+1\leq\Delta+1. For any x[0,1]x\in[0,1], we have

1x1x2,\sqrt{1-x}\leq 1-\frac{x}{2},

so it follows that

hG(u)hG(v)211Δ+1112(Δ+1),\displaystyle\lVert h_{G}(u)-h_{G}(v)\rVert_{2}\leq\sqrt{1-\frac{1}{\Delta+1}}\leq 1-\frac{1}{2(\Delta+1)},

which completes the proof. ∎

B.3 Proof of Theorem 4.5

See 4.5

Proof.

We build on the hardness of approximation for the maximum independent set problem for graphs with maximum degree Δ=3\Delta=3. Alimonti and Kann (2000, Theorem 3.2) showed that this problem is APX-complete, so there exists some ε0>0\varepsilon_{0}>0 such that there is no polynomial-time (1ε0)(1-\varepsilon_{0})-approximation algorithm unless NP=P\textnormal{NP}=\textnormal{P}. Hence, there exists a graph GG with max degree Δ=3\Delta=3 and a threshold integer value kk such that it is NP-hard to distinguish between the following two cases:

  • YES instance: Graph GG has an independent set of size kk.

  • NO instance: Graph GG does not have an independent set of size greater than (1ε0)k(1-\varepsilon_{0})k.

We reduce this instance of bounded-degree maximum independent set to MDDS with objective function (1) as follows. Embed each node of the graph GG into Euclidean space using the function hG(v)h_{G}(v) in Lemma 4.4. We use the same threshold value of kk (between YES and NO instances above) for the cardinality constraint on set SS, and we set each weight w(v)=1w(v)=1. In a YES instance, selecting an independent set of size kk as the set SS results in the maximum value of objective (1):

OPT=α1kk+(1α)1=1,\textnormal{OPT}=\alpha\cdot\frac{1}{k}\cdot k+(1-\alpha)\cdot 1=1,

since hG(u)hG(v)2=1\lVert h_{G}(u)-h_{G}(v)\rVert_{2}=1 for any two distinct points u,vSu,v\in S since there is no edge between uu and vv in graph GG.

In a NO instance, the best objective value that can be achieved in polynomial-time is the maximum of the following two scenarios: (a) selecting kk points with minimum distance at most 11/(2(Δ+1))=11/81-1/(2(\Delta+1))=1-1/8, or (b) selecting at most (1ε0)k(1-\varepsilon_{0})k vertices forming an independent set with minimum distance equal to 11. The maximum value obtained by any polynomial-time algorithm is then

ALG =max{α(1ε0)+(1α)1,α+(1α)(11/8)}\displaystyle=\max\left\{\alpha(1-\varepsilon_{0})+(1-\alpha)\cdot 1,\alpha+(1-\alpha)\left(1-1/8\right)\right\}
=max{1ε0α,(7+α)/8}.\displaystyle=\max\left\{1-\varepsilon_{0}\cdot\alpha,(7+\alpha)/8\right\}.

These two terms become equal for α=1/(1+8ε0)\alpha=1/(1+8\varepsilon_{0}). Therefore, the gap between the maximum value any algorithm can achieve in the NO case and the optimum value in the YES case is upper bounded by

1ε0α=1ε01/ε0+8=1ε1.1-\varepsilon_{0}\cdot\alpha=1-\frac{\varepsilon_{0}}{\nicefrac{{1}}{{\varepsilon_{0}}}+8}=1-\varepsilon_{1}.

Since ε0>0\varepsilon_{0}>0 is a constant, ε1ε0/(1/ε0+8)>0\varepsilon_{1}\coloneqq\varepsilon_{0}/(\nicefrac{{1}}{{\varepsilon_{0}}}+8)>0 is also a positive constant. This completes the proof of APX-completeness. ∎

Appendix C Missing details for Section 5

Hyperparameters for ImageNet classification.

We generate predictions and embeddings for all points using a coarsely-trained ResNet-56 model (He et al., 2016) trained on a random 10% subset of ImageNet (Russakovsky et al., 2015). We use SGD with Nesterov momentum 0.9 with 450/90 epochs. The base learning rate is 0.1, and is reduced by a tenth at 5, 30, 69, and 80. We extract the penultimate layer features to produce 20482048-dimensional embeddings of each image.