GIST: Greedy Independent Set Thresholding for Diverse Data Summarization

Matthew Fahrbach¹¹1Equal contribution. Emails: {fahrbach,rsrikumar,zadim,sahmadian,gcitovsky,giuliad}@google.com Srikumar Ramalingam¹¹footnotemark: 1 Morteza Zadimoghaddam¹¹footnotemark: 1 Sara Ahmadian Gui Citovsky Giulia DeSalvo

Abstract

We propose a novel subset selection task called min-distance diverse data summarization (MDDS), which has a wide variety of applications in machine learning, e.g., data sampling and feature selection. Given a set of points in a metric space, the goal is to maximize an objective that combines the total utility of the points and a diversity term that captures the minimum distance between any pair of selected points, subject to the constraint $|S|\leq k$ . For example, the points may correspond to training examples in a data sampling problem, e.g., learned embeddings of images extracted from a deep neural network. This work presents the GIST algorithm, which achieves a $\nicefrac{{2}}{{3}}$ -approximation guarantee for MDDS by approximating a series of maximum independent set problems with a bicriteria greedy algorithm. We also prove a complementary $(\nicefrac{{2}}{{3}}+\varepsilon)$ -hardness of approximation, for any $\varepsilon>0$ . Finally, we provide an empirical study that demonstrates GIST outperforms existing methods for MDDS on synthetic data, and also for a real-world image classification experiment the studies single-shot subset selection for ImageNet.

1 Introduction

Subset selection is a challenging optimization problem with a wide variety of applications in machine learning, including feature selection, recommender systems, news aggregation, drug discovery, data summarization, and designing pretraining sets for large language models (Anil et al., 2023). Data sampling in particular is a salient problem due to unprecedented and continuous data collection. For example, LiDAR and imaging devices in one self-driving vehicle can easily capture ~80 terabytes of data per day (Kazhamiaka et al., 2021).

In most subset selection tasks, we rely on the weight (or utility) of the objects to rank one over the other, and also to avoid selecting duplicate or near-duplicate objects. If we select a small subset, then we also want to ensure that the selected subset is a good representation of the original set. These utility, diversity, and coverage criteria can be expressed through objective functions, and the interesting research lies in developing efficient algorithms with strong approximation guarantees. The underlying machinery used in constrained subset selection algorithms shares many similarities with techniques from other areas of combinatorial optimization such as submodular maximization, $k$ -center clustering, and convex hull approximations.

In this work, we study the problem of selecting a set of points in a metric space that maximizes an objective that combines their utility and a minimum pairwise-distance diversity measure. Concretely, given a set of points $V$ , weight function $w:V\rightarrow\mathbb{R}_{\geq 0}$ , and distance function $\textnormal{dist}:V\times V\rightarrow\mathbb{R}_{\geq 0}$ , our goal is to solve the cardinality-constrained optimization problem:

	$\displaystyle S^{*}$	$\displaystyle=\operatorname*{arg\,max}_{S\subseteq V}~{}\frac{1}{k}\sum_{v\in S}w(v)+\texttt{div}(S)$
		$\displaystyle\hskip 13.6572pt\text{subject to}~{}~{}\|S\|\leq k$

where the diversity term $\texttt{div}(S)\coloneqq\min_{u,v\in S:u\neq v}\textnormal{dist}(u,v)$ encourages maximizing the minimum distance between all pairs of distinct points in the selected subset. Bhaskara et al. (2016) call $\texttt{div}(S)$ the min-min diversity objective and Meinl et al. (2011) call it $p$ -dispersion maximization, where it is known to admit a $\nicefrac{{1}}{{2}}$ -approximation ratio subject to the strict cardinality constraint $|S|=k$ .

Other diversity objectives such as the sum-of-min-distances $\sum_{u\in S}\min_{v\in S\setminus\{u\}}\textnormal{dist}(u,v)$ and sum-of-all-distances $\sum_{u\in S}\sum_{v\in S}\textnormal{dist}(u,v)$ have also been studied (Meinl et al., 2011; Bhaskara et al., 2016). While previous works have looked at maximizing the diversity objective $\texttt{div}(S)$ , our formulation that introduces the utility term has not been studied before. In particular, Bhaskara et al. (2016) observe that min-min is a powerful diversity objective, but point out a major challenge: it is highly non-monotonic, i.e., it can exhibit a stark decrease in the objective value by adding a single new point. Our work shows that the use of a utility term along with the min-min can alleviate some of these pitfalls, and allows us to inherit the strong benefits of this strict diversity objective.

1.1 Our contributions

We summarize our main contributions below:

•

We propose a novel subset selection problem called min-distance diverse data summarization (MDDS), where the goal is to maximize an objective function that combines a linear weight term with the min-min distance diversity term $\texttt{div}(S)$ subject to the cardinality constraint $|S|\leq k$ . This problem has a wide variety of applications in machine learning, e.g., data denoising and active learning.
•

We give a $\nicefrac{{2}}{{3}}$ -approximation algorithm for MDDS called GIST that approximates a series of maximum independent set problems on geometric intersection graphs with a bicriteria greedy algorithm and returns the best solution.
•

We prove a complementary $(\nicefrac{{2}}{{3}}+\varepsilon)$ -hardness of approximation, for any $\varepsilon>0$ , for general metric spaces by reducing from approximate maximum clique. We also prove that MDDS is APX-complete for Euclidean metrics.
•

Our experiments show that GIST overcomes the shortcomings of existing methods for MDDS on synthetic data. We also demonstrate how GIST can be used to create better single-shot subsets of training data for an image classification benchmark on ImageNet (Russakovsky et al., 2015) compared to margin sampling and $k$ -center algorithms.

1.2 Background and related work

Diversity objectives.

A closely related problem is $k$ -center, which aims to find a subset of $k$ centers that minimize the largest distance of any point to its closest center in $S$ . Concretely, the objective is $\max_{v\in V}\min_{c\in S}\textnormal{dist}(v,c)$ , and a greedy $2$ -approximation algorithm is widely used in active learning applications (Sener and Savarese, 2018). The main difference between our objective function and $k$ -center is the ability to benefit from the utility term, e.g., uncertainty values in an active learning setting. The second difference is that we relax the need to cover every point in the original dataset, which is particularly beneficial for large datasets, where we prefer to avoid instability due to certain corner or outlier points.

Bhaskara et al. (2016) study various distance-based diversity metrics subject to a cardinality constraint. In particular, they give a $\nicefrac{{1}}{{8}}$ -approximation algorithm for the sum-min diversity objective defined as $\sum_{u\in S}\min_{v\in S\setminus\{u\}}\textnormal{dist}(u,v)$ by rounding a linear programming relaxation, and they complement this result by proving a conditional $\nicefrac{{1}}{{2}}$ -hardness of approximation under the planted clique assumption. They also study the min-min objective $\min_{u,v\in S:u\neq v}\textnormal{dist}(u,v)$ , which is the same as $\texttt{div}(S)$ . They do not have a linear weight function in their objective, so they enforce a strict cardinality constraint $|S|=k$ ; otherwise, it is always better to output at most one or two points. Meinl et al. (2011) study the same objective subject to $|S|=k$ under the name $p$ -dispersion (also called max-min or remote-edge) diversity maximization. This cardinality issue does not arise in our formulation since we need to optimize the trade-off between competing linear weight (utility) and diversity terms.

Moumoulidou et al. (2020), Addanki et al. (2022), and Mahabadi and Trajanovski (2023) initiated the study of diversity maximization under fairness constraints. Given a partitioning of the dataset, they ensure that a prespecified number of points $k_{i}$ are selected from each group $i\in[m]$ . This fairness constraint considerably differentiates their problem setting from ours (i.e., $|S|=k=\sum_{i=1}^{m}k_{i}$ ). Moumoulidou et al. (2020); Addanki et al. (2022) maximize the $\texttt{div}(S)$ objective, and Mahabadi and Trajanovski (2023) consider the sum-of-pairwise distances $\sum_{u\in S}\sum_{v\in S}\textnormal{dist}(u,v)$ and sum-min objectives. None of these works, however, add a linear weight function to the objective.

Data sampling.

In the realm of dataset curation and active learning, there have been several lines of work that study the combination of utility and diversity terms. Ash et al. (2020) invoke $k$ -means++ seeding over the gradient space of a model in order to balance uncertainty and diversity. Wei et al. (2015) introduce several submodular objectives, e.g., facility location, and use them to diversify a set of uncertain examples in each active learning iteration. Citovsky et al. (2021) cluster examples which are represented by embeddings extracted from the penultimate layer of a partially trained DNN, and then use these clusters to diversify uncertain examples in each iteration. Our work differs from these and several others (e.g., Kirsch et al. (2019); Zhdanov (2019)) in that we directly incorporate both the utility and diversity terms in our objective function.

2 Preliminaries

We are given a set of $n$ points $V\subseteq M$ in the metric space $(M,\textnormal{dist})$ , where $\textnormal{dist}:M\times M\rightarrow\mathbb{R}_{\geq 0}$ is the distance function. We are also given weights $w:V\rightarrow\mathbb{R}_{\geq 0}$ representing the utility of each point. We extend these functions to take subsets $S\subseteq V$ as input in the standard way: $w(S)=\sum_{v\in S}w(v)$ and $\textnormal{dist}(u,S)=\min_{v\in S}\textnormal{dist}(u,v)$ . For the sake of completeness, we define $\textnormal{dist}(u,\varnothing)=\infty$ .

Intersection graph.

Let $G_{d}(V)$ denote the intersection graph of $V$ for distance threshold $d$ , i.e., the graph with node set $V$ and edge set

E=\{(u,v)\in V^{2}:\textnormal{dist}(u,v)<d\}.

This graph is well-defined for any metric space and distance $d\in\mathbb{R}$ . Note that for any independent set $S$ of $G_{d}(V)$ and pair of distinct nodes $u,v\in S$ , we have $\textnormal{dist}(u,v)\geq d$ .

Definition 2.1.

This work focuses on the min-distance diversity reward function

\displaystyle\texttt{div}(S)

\displaystyle=\begin{cases}\min_{u,v\in S:u\neq v}\textnormal{dist}(u,v)&\text{if }|S|\geq 2,\\ \max_{u,v\in V}\textnormal{dist}(u,v)&\text{if }|S|\leq 1.\end{cases}

This reward function encourages selecting a well-separated set of points (generalizing the notion of data deduplication). To make $\texttt{div}(S)$ well-defined if $S$ is empty or consists of a singleton point, we set the diversity to be the diameter of the whole ground set $V$ , implying that it is always beneficial to select at least two points to maximize the value of the selected set.

For any $\alpha\in[0,1]$ , we define the objective function

f(S)=\alpha\frac{1}{k}\sum_{v\in S}w(v)+(1-\alpha)\texttt{div}(S),

(1)

which allows us to control the trade-off between the average utility and the diversity of set $S$ . The goal of the min-distance diverse data summarization problem (MDDS) is to maximize $f(S)$ subject to a cardinality constraint $k$ :

S^{*}=\operatorname*{arg\,max}_{S\subseteq V:|S|\leq k}f(S).

Let $\textnormal{OPT}=f(S^{*})$ denote the optimal objective value.

Simple $\nicefrac{{1}}{{2}}$ -approximation algorithm.

Equation 1 has two parts, each of which is easy to optimize in isolation. To maximize the $w(S)$ term, we simply select the $k$ heaviest points. In contrast, to maximize the $\texttt{div}(S)$ term, we select two points whose distance is the diameter of $V$ . Outputting a subset that corresponds to the maximum of these two terms gives a $\nicefrac{{1}}{{2}}$ -approximation guarantee. To see this, observe that

	$\displaystyle\textnormal{ALG}_{\textnormal{simple}}$	$\displaystyle=\max\left\{\alpha\frac{1}{k}\max_{\|S\|\leq k}w(S),(1-\alpha)\max_{u,v\in V}\textnormal{dist}(u,v)\right\}$
		$\displaystyle\geq\frac{1}{2}\cdot\alpha\frac{1}{k}\max_{\|S\|\leq k}w(S)+\frac{1}{2}\cdot(1-\alpha)\max_{u,v\in V}\textnormal{dist}(u,v)$
		$\displaystyle\geq\frac{1}{2}\cdot\left(\alpha\frac{1}{k}w(S^{})+(1-\alpha)\texttt{div}(S^{})\right)=\frac{1}{2}\cdot\textnormal{OPT},$

where $\texttt{div}(S)$ is defined in Definition 2.1. Our main goals are to improve over this simple baseline and prove complementary hardness results.

3 Algorithm

We now present the GIST algorithm, which achieves a $(\nicefrac{{2}}{{3}}-\varepsilon)$ -approximation for the MDDS problem. It works by repeatedly calling the GreedyIndependentSet heuristic to find an independent set of the intersection graph $G_{d}(V)$ for various distance thresholds $d$ . The subroutine GreedyIndependentSet sorts the points by their weight in a non-increasing order breaking ties arbitrarily. It starts with an empty set of selected points, and then it scans the points according to the sorted order. For each point, GreedyIndependentSet selects the point as long as its distance to the set of selected points so far is at least $d$ and the number of selected points does not exceed $k$ .

The GIST algorithm computes multiple solutions and returns the one with the maximum value $f(S)$ . It starts by picking $k$ points with the largest weight as the initial solution, which can be done by calling GreedyIndependentSet with the distance threshold set to zero. We then consider a set of distance thresholds as follows. Let $d_{\max}$ be the diameter of the ground set $V$ , i.e., $d_{\max}=\max_{u,v\in V}\textnormal{dist}(u,v)$ . Define the set of distance thresholds $D\leftarrow\{(1+\varepsilon)^{i}\cdot\varepsilon d_{\max}/2:(1+\varepsilon)^{i}\leq 2/\varepsilon\mbox{ and }i\in\mathbb{Z}_{\geq 0}\}$ . For each $d\in D$ , we call the subroutine $\textnormal{{GreedyIndependentSet}}(V,w,d,k)$ to find a set $T$ of size at most $k$ . If the newly generated set $T$ has a larger objective value than our best solution so far, then update $S\leftarrow T$ . After considering all distance thresholds in $D$ , the highest-value solution among all candidates is the output of the GIST algorithm.

Algorithm 1 Diversified data summarization using greedy-weighted maximum independent sets.

1:function GIST(points

V

, weights

w:V\rightarrow\mathbb{R}_{\geq 0}

, budget

k

, error

\varepsilon

)

2: Initialize

S\leftarrow\textnormal{{GreedyIndependentSet}}(V,w,0,k)

\triangleright

k

heaviest points

3: Let

d_{\max}=\max_{u,v\in V}\textnormal{dist}(u,v)

be the diameter of

V

4: Let

T\leftarrow\{u,v\}

be two points such that

\textnormal{dist}(u,v)=d_{\max}

5: if

f(T)>f(S)

and

k\geq 2

then

6: Update

S\leftarrow T

7: Let

D\leftarrow\{(1+\varepsilon)^{i}\cdot\varepsilon d_{\max}/2:(1+\varepsilon)^{i}\leq 2/\varepsilon\mbox{ and }i\in\mathbb{Z}_{\geq 0}\}

\triangleright

distance thresholds

8: for threshold

d\in D

9: Set

T\leftarrow\textnormal{{GreedyIndependentSet}}(V,w,d,k)

10: if

f(T)\geq f(S)

then

11: Update

S\leftarrow T

return

S

1:function GreedyIndependentSet(points

V

, weights

w:V\rightarrow\mathbb{R}_{\geq 0}

, distance

d

, budget

k

)

2: Initialize

S\leftarrow\varnothing

3: for

v\in V

sorted by non-increasing weight do

4: if

\textnormal{dist}(v,S)\geq d

and

|S|<k

then

5: Update

S\leftarrow S\cup\{v\}

return

S

Theorem 3.1.

For any $\varepsilon>0$ , GIST outputs a set $S\subseteq V$ with $|S|\leq k$ and $f(S)\geq(\nicefrac{{2}}{{3}}-\varepsilon)\cdot\textnormal{OPT}$ .

The main building block in our design and analysis of GIST is the following lemma, which is inspired by the greedy $2$ -approximation algorithm for metric $k$ -centers.

Lemma 3.2.

Let $S_{d}^{*}$ be a max-weight independent set of the intersection graph $G_{d}(V)$ of size at most $k$ . If $T$ is the output of $\textnormal{{GreedyIndependentSet}}(V,w,d^{\prime},k)$ , for $d^{\prime}\leq d/2$ , then $w(T)\geq w(S_{d}^{*})$ .

Proof.

Let $k^{\prime}=|T|$ and $t_{1},t_{2},\dots,t_{k^{\prime}}$ be the points in $T$ in the order that GreedyIndependentSet selected them. Let $B_{i}=\{v\in V:\textnormal{dist}(t_{i},v)<d^{\prime}\}$ be the points in $V$ contained in the radius- $d^{\prime}$ open ball around $t_{i}$ . First, we show that each $B_{i}$ contains at most one point in $S_{d}^{*}$ . If this is not true, then some $B_{i}$ contains two different points $u,v\in S_{d}^{*}$ . Since $\textnormal{dist}(\cdot,\cdot)$ is a metric, this means

	$\displaystyle\textnormal{dist}(u,v)$	$\displaystyle\leq\textnormal{dist}(u,t_{i})+\textnormal{dist}(t_{i},v)$
		$\displaystyle<d^{\prime}+d^{\prime}$
		$\displaystyle\leq d/2+d/2$
		$\displaystyle=d,$

which contradicts the assumption that $S_{d}^{*}$ is an independent set of $G_{d}(V)$ . Note that it is possible to have $B_{i}\cap B_{j}\neq\varnothing$ , for $i\neq j$ , since these balls consider all points in $V$ .

Now let $C_{i}\subseteq V$ be the set of uncovered (by the open balls) points that become covered when GreedyIndependentSet selects $t_{i}$ . Concretely, $C_{i}=B_{i}\setminus(B_{1}\cup\cdots\cup B_{i-1})$ . Each $C_{i}$ contains at most one point in $S_{d}^{*}$ since $|B_{i}\cap S_{d}^{*}|\leq 1$ . Moreover, if $s\in C_{i}\cap S_{d}^{*}$ , then $w(t_{i})\geq w(s)$ because the points are sorted in non-increasing order and selected if uncovered.

Let $A=C_{1}\cup\cdots\cup C_{k^{\prime}}$ be the set of points covered by the algorithm. For each point $s\in S_{d}^{*}\cap A$ , there is exactly one covering set $C_{i}$ corresponding to $s$ . It follows that

\displaystyle\sum_{s\in S_{d}^{*}\cap A}w(s)\leq\sum_{i\in[k^{\prime}]:S_{d}^{*}\cap C_{i}\neq\varnothing}w(t_{i}).

(2)

It remains to account for the points in $S_{d}^{*}\setminus A$ . Note that if we have any such points, then $|T|=k$ since the points in $S_{d}^{*}\setminus A$ are uncovered at the end of the algorithm. Further, for any $t_{i}\in T$ and $s\in S_{d}^{*}\setminus A$ , we have $w(t_{i})\geq w(s)$ since $t_{i}$ was selected and the points are sorted by non-increasing weight. Therefore, we can assign each $s\in S_{d}^{*}\setminus A$ to a unique $C_{i}$ such that $C_{i}\cap S_{d}^{*}=\varnothing$ . It follows that

\displaystyle\sum_{s\in S_{d}^{*}\setminus A}w(s)\leq\sum_{i\in[k^{\prime}]:S_{d}^{*}\cap C_{i}=\varnothing}w(t_{i}).

(3)

Adding the two sums together in Equation 2 and (3) completes the proof. ∎

Equipped with this bicriteria approximation guarantee, we can now analyze the approximation ratio of the GIST algorithm.

Proof of Theorem 3.1.

Let $d^{*}$ be the minimum distance between two distinct points in $S^{*}$ . There are two cases: $d^{*}<\varepsilon d_{\max}$ or $d^{*}\geq\varepsilon d_{\max}$ . For the first case, outputting the $k$ heaviest points (Line 2) achieves a $(1-\varepsilon)$ -approximation. To see this, observe that

\textnormal{OPT}\geq(1-\alpha)d_{\max}>(1-\alpha)\cdot\frac{d^{*}}{\varepsilon}\implies\varepsilon\cdot\textnormal{OPT}>(1-\alpha)d^{*}.

The sum of the $k$ heaviest points upper bounds $w(S^{*})$ , so it follows that

\textnormal{ALG}\geq\alpha\frac{1}{k}w(S^{*})=\textnormal{OPT}-(1-\alpha)d^{*}>(1-\varepsilon)\textnormal{OPT}.

Therefore, we focus on the second case where $d^{*}\geq\varepsilon d_{\max}$ . It follows that GIST considers a threshold $d=(1+\varepsilon)^{i}\cdot\varepsilon d_{\max}/2\in D$ such that

\frac{d^{*}}{2}\in[d,(1+\varepsilon)d)\implies\frac{1}{1+\varepsilon}\cdot\frac{d^{*}}{2}<d\leq\frac{d^{*}}{2}.

Using Lemma 3.2, $\textnormal{{GreedyIndependentSet}}(V,w,d,k)$ outputs a set $T$ such that

\displaystyle f(T)

\displaystyle\geq\alpha\frac{1}{k}w(S^{*})+(1-\alpha)d\geq\alpha\frac{1}{k}w(S^{*})+(1-\alpha)\frac{d^{*}}{2(1+\varepsilon)}.

(4)

The potential max-diameter update to $S$ on Line 5 gives us

\displaystyle\textnormal{ALG}\geq(1-\alpha)d_{\max}\geq(1-\alpha)d^{*}.

(5)

Putting Equation 4 and (5) together, for any $p\in[0,1]$ , it holds that

	ALG	$\displaystyle\geq p\left[\alpha\frac{1}{k}w(S^{})+(1-\alpha)\frac{d^{}}{2(1+\varepsilon)}\right]+(1-p)(1-\alpha)d^{*}$
		$\displaystyle=p\cdot\alpha\frac{1}{k}w(S^{})+\left(1-p+\frac{p}{2(1+\varepsilon)}\right)\cdot(1-\alpha)d^{}.$

To maximize the approximation ratio as $\varepsilon\rightarrow 0$ , solve $p=1-p/2$ to get $p=2/3$ . It follows that

ALG	$\displaystyle\geq\frac{2}{3}\cdot\alpha\frac{1}{k}w(S^{})+\left(1-\frac{2}{3}+\frac{1}{3(1+\varepsilon)}\right)\cdot(1-\alpha)d^{}$	(6)
	$\displaystyle=\frac{2}{3}\cdot\alpha\frac{1}{k}w(S^{})+\frac{1}{3}\left(1+\frac{1}{1+\varepsilon}\right)\cdot(1-\alpha)d^{}$
	$\displaystyle\geq\frac{2}{3}\cdot\alpha\frac{1}{k}w(S^{})+\frac{1}{3}\left(2-\varepsilon\right)\cdot(1-\alpha)d^{}$
	$\displaystyle\geq\frac{2}{3}\left(1-\varepsilon\right)\cdot\alpha\frac{1}{k}w(S^{})+\frac{2}{3}\left(1-\varepsilon\right)\cdot(1-\alpha)d^{}$
	$\displaystyle\geq\left(\frac{2}{3}-\varepsilon\right)\cdot\textnormal{OPT},$

which completes the proof. ∎

Remark 3.3.

We can modify GIST to achieve an exact $\nicefrac{{2}}{{3}}$ -approximation by trying all pairwise distance thresholds in $D=\{\textnormal{dist}(u,v)/2:u,v\in V\}$ . This requires $O(n^{2})$ calls to GreedyIndependentSet compared to $O(\log_{1+\varepsilon}(1/\varepsilon))$ , but otherwise the same analysis holds for $\varepsilon=0$ .

Finally, we show how to implement GIST in nearly linear time for low-dimensional Euclidean metrics using cover trees for exact nearest-neighbor queries (Beygelzimer et al., 2006; Izbicki and Shelton, 2015) and a $(1-\varepsilon)$ -approximation algorithm for point-set diameters $d_{\max}$ (Barequet and Har-Peled, 2001; Imanparast et al., 2018).

Theorem 3.4.

For $V\subseteq\mathbb{R}^{d}$ and Euclidean distance, GIST runs in time $O(n^{3}k+n^{2}d)$ . Furthermore, for fixed dimension $d$ , the running time can be improved to $O(n\log n\cdot\log_{1+\varepsilon}(1/\varepsilon)\cdot\textnormal{poly}(d)+1/\varepsilon^{d-1})$ .

We give the proof in Section A.1 and note that the key property is that cover trees allow us to insert a point into a nearest-neighbor index in $O(c^{12}\log n)$ time and find a nearest neighbor to a query point in $O(c^{6}\log n)$ time, where $c=\Theta(d)$ is the doubling dimension of $d$ -dimensional Euclidean space.

4 Hardness of approximation

We start by summarizing our hardness results for the MDDS problem. Assuming $\text{P}\neq\text{NP}$ , we prove that for any $\varepsilon>0$ :

1.

There is no polynomial-time $(\nicefrac{{2}}{{3}}+\varepsilon)$ -approximation algorithm for general metric spaces.
2.

There is no polynomial-time $(\nicefrac{{3}}{{4}}+\varepsilon)$ -approximation algorithm for general metric spaces, even if $k\geq|V|/4$ .
3.

APX-completeness for the Euclidean metric, i.e., there is no polynomial-time approximation scheme (PTAS) for this objective.

4.1 General metric spaces

Our first result builds on the hardness of approximating the size of the maximum clique, and it tightly complements the $\nicefrac{{2}}{{3}}$ -approximation guarantee that we can achieve with GIST.

Theorem 4.1.

For any $\varepsilon>0$ , there is no polynomial-time $(\nicefrac{{2}}{{3}}+\varepsilon)$ -approximation algorithm for the MDDS problem, unless $\textnormal{P}=\textnormal{NP}$ .

Proof.

First, recall that a clique is a subset of vertices in an undirected graph such that there is an edge between every pair of its vertices. Håstad (1996) and Zuckerman (2006) showed that the maximum clique problem does not admit an $n^{1-\theta}$ -approximation for any constant $\theta>0$ , unless $\textnormal{NP}=\textnormal{P}$ . This implies that there is no constant-factor approximation algorithm for maximum clique. In other words, for any constant $0<\delta\leq 1$ , there exists a graph $G$ and a threshold integer value $k$ such that it is NP-hard to distinguish between the following two cases:

•

YES instance: Graph $G$ has a clique of size $k$ .
•

NO instance: Graph $G$ does not have a clique of size greater than $\delta k$ .

We reduce this instance of the maximum clique decision problem to MDDS with objective function (1) as follows. Represent each vertex of graph $G$ with a point in our ground set. The distance between a pair of points is $2$ if there is an edge between their corresponding vertices in $G$ , and it is $1$ otherwise.

Use the same threshold value of $k$ (in the YES and NO instance above) for the cardinality constraint on set $S$ , and set each weight $w(v)=1$ . In a YES instance, selecting a clique of size $k$ as set $S$ results in the maximum possible value of the objective:

\displaystyle\textnormal{OPT}=\alpha\cdot\frac{1}{k}\cdot k+(1-\alpha)\cdot 2=2-\alpha.

(7)

In a NO instance, the best objective value that can be achieved in polynomial-time is the maximum of the following two scenarios: (a) selecting $k$ points with minimum distance $1$ , or (b) selecting at most $\delta k$ vertices forming a clique with minimum distance $2$ . The maximum value obtained by any polynomial-time algorithm is then

	ALG	$\displaystyle=\max\left\{\alpha+(1-\alpha)\cdot 1,\alpha\cdot\delta+(1-\alpha)\cdot 2\right\}$
		$\displaystyle=\max\left\{1,2-(2-\delta)\alpha\right\}.$

These two terms become equal for $\alpha=1/(2-\delta)$ . Thus, the gap between the maximum value any algorithm can achieve in the NO case and the optimum value in the YES case is

\frac{1}{2-\alpha}=\frac{1}{2-\nicefrac{{1}}{{(2-\delta)}}}=\frac{2-\delta}{3-2\delta}.

To complete the proof, it suffices to show that the ratio above is at most $\nicefrac{{2}}{{3}}+\varepsilon$ . We separate the $\nicefrac{{2}}{{3}}$ term as follows:

\frac{2-\delta}{3-2\delta}=\frac{\nicefrac{{2}}{{3}}\cdot(3-2\delta)+\nicefrac{{\delta}}{{3}}}{3-2\delta}=\frac{2}{3}+\frac{\delta}{9-6\delta}.

Therefore, we must choose a value of $\delta$ satisfying $\nicefrac{{\delta}}{{(9-6\delta)}}\leq\varepsilon$ . Since $\delta\leq 1$ , the denominator $9-6\delta$ is positive. Equivalently, we want to satisfy:

\frac{9-6\delta}{\delta}=\frac{9}{\delta}-6\geq\frac{1}{\varepsilon}.

By setting $\delta<9\varepsilon/(1+6\varepsilon)$ , we satisfy the required inequality and achieve the inapproximability gap in the theorem statement. ∎

4.2 Large subsets

In many data curation problems, we are interested in instances of MDDS where $k=\Omega(|V|)$ . Even in this restricted case, we can show hardness of approximation.

Theorem 4.2.

For any $\varepsilon>0$ , there is no polynomial-time $(\nicefrac{{3}}{{4}}+\varepsilon)$ -approximation algorithm for the MDDS problem, even if $k\geq|V|/4$ , unless $\textnormal{P}=\textnormal{NP}$ .

The proof of this theorem uses the vertex cover instance constructed in the proof of Håstad (2001, Theorem 8.1) and is given in Section B.1.

4.3 Euclidean metrics

Our final result is specific to the Euclidean metric, i.e., if $S\subseteq\mathbb{R}^{d}$ and we consider the $L^{2}$ distance between points. It builds on a result of Alimonti and Kann (2000), which says the size of a maximum independent set in a bounded-degree graph cannot be approximated to within a constant factor $1-\varepsilon$ , for any $\varepsilon>0$ , unless $\text{P}=\text{NP}$ . In other words, it does not admit a PTAS.

Lemma 4.3 (Alimonti and Kann (2000, Theorem 3.2)).

The maximum independent set problem for graphs with degree at most $3$ is APX-complete.

Similar to our reduction from maximum clique in Section 4.1 for general metric spaces, this construction uses a node embedding function $h_{g}(v)$ to encode graph adjacency in Euclidean space.

Lemma 4.4.

Let $G=(V,E)$ be a simple undirected graph with $n=|V|$ , $m=|E|$ , and maximum degree $\Delta$ . There exists an embedding $h_{G}:V\rightarrow\mathbb{R}^{n+m}$ such that if $\{u,v\}\in E$ then

\lVert h_{G}(u)-h_{G}(v)\rVert_{2}\leq 1-\frac{1}{2(\Delta+1)},

and if $\{u,v\}\not\in E$ then $\lVert h_{G}(u)-h_{G}(v)\rVert_{2}=1$ .

Theorem 4.5.

The MDDS problem is APX-complete for Euclidean metrics.

We sketch the main idea of our reduction and defer the complete proofs of Lemma 4.4 and Theorem 4.5 to Appendix B. Let $\mathcal{I}(G)$ be the set of independent sets of $G$ . A set of embedded nodes $S\subseteq V$ (with at least two nodes) of a graph $G$ with maximum degree $\Delta\leq 3$ has the property:

•

If $S\in\mathcal{I}(G)$ : $\texttt{div}(S)=\min_{u,v\in S:u\neq v}\lVert h_{G}(u)-h_{G}(v)\rVert_{2}=1$ .
•

If $S\not\in\mathcal{I}(G)$ : $\texttt{div}(S)=\min_{u,v\in S:u\neq v}\lVert h_{G}(u)-h_{G}(v)\rVert_{2}\leq 1-1/(2(\Delta+1))\leq 1-\nicefrac{{1}}{{8}}$ .

This gap between the two cases is at least $\nicefrac{{1}}{{8}}$ (i.e., a universal constant) for all such graphs, and allows us to show that MDDS inherits the APX-complete hardness of bounded-degree max independent set.

5 Experiments

5.1 Warm-up: Synthetic dataset

Setup.

We start with a simple dataset of weighted random points to see the shortcomings of existing methods for MDDS. For a fixed dimension $d$ , we generate $n=1000$ points $\mathbf{x}_{i}\in\mathbb{R}^{d}$ , where each $\mathbf{x}_{i}\sim\mathcal{N}(\mathbf{0},\mathbf{I}_{d})$ and each point is assigned an i.i.d. uniform continuous weight $w_{i}\sim\mathcal{U}_{[0,1]}$ .

We consider three baseline algorithms: random, simple, and greedy. The random algorithm chooses a random set of $k$ points, permutes them, and returns the best prefix. simple is the $\nicefrac{{1}}{{2}}$ -approximation algorithm in Section 2. greedy builds a set $S\subseteq V$ one element at a time by selecting the point with index $i^{*}_{t}=\operatorname*{arg\,max}_{i\in V\setminus S_{t-1}}f(S_{t-1}\cup\{i\})-f(S_{t-1})$ at each step $t\in[k]$ . Then it returns the best prefix $S=\operatorname*{arg\,max}_{t\in[k]}f(S_{t})$ , as this sequence of objective values is not necessarily monotone.

Refer to caption — Figure 1: Synthetic data with $n=1000$ and $d=64$ . Plots of $f(S_{\text{ALG}})$ for $k\in[n]$ , $\alpha\in\{0.90,0.95\}$ .

Results.

We plot the values of $f(S)$ for our baseline algorithms and GIST for all $k\in[n]$ in Figure 1. We set $\alpha\in\{0.90,0.95\}$ to balance the two competing objective terms (which are easy to optimize in isolation).²²2If $\mathbf{x},\mathbf{y}\sim\mathcal{N}(\mathbf{0},\mathbf{I}_{d})$ are i.i.d., then $\mathbb{E}[\lVert\mathbf{x}-\mathbf{y}\rVert_{2}^{2}]=2d$ , so $\frac{\alpha}{2}\approx(1-\alpha)\sqrt{2d}\implies\alpha\approx 1-\frac{1}{2\sqrt{2d}+1}=0.957$ . There is a clear separation in solution quality for each $k\in[n]$ when $\alpha=0.95$ . Note that these objective values are not directly comparable as a function of $k$ due to the $\nicefrac{{1}}{{k}}$ normalizing factor (hence the discontinuities), but both terms in the objective decreasing on average in $k$ for a fixed $\alpha$ : mean utility and min distance of the selected points. GIST noticeably outperforms simple, greedy, and random. The poor performance of greedy is somewhat surprising since it is normally competitive for submodular-style objectives. simple also outperforms greedy for all but the extreme values of $k$ , i.e., very small or close to $n$ , but is always worse relative to GIST.

5.2 Image classification

The following data sampling experiment compares the top-1 image classification accuracy achieved by different single-shot subset selection algorithms.

Setup.

We use the standard vision dataset ImageNet (Russakovsky et al., 2015), which contains ~1.3 million images and 1000 classes. We select 10% of the data points at random and use them to train an initial ResNet-56 model $\bm{\theta}_{0}$ (He et al., 2016). Then we use this initial model to compute an uncertainty value and 2048-dimensional embedding for each example. The uncertainty value of $\mathbf{x}_{i}$ is $u_{i}=1-\Pr(y=b\mid\mathbf{x}_{i};~{}\bm{\theta}_{0})$ , where $b$ is the best class label according to the initial model. Finally, we use the fast maximum inner product search of Guo et al. (2020) to build a $\Delta$ -nearest neighbor graph in the embedding space using $\Delta=10$ and cosine distance. We present all of the model training hyperparameters in Appendix C.

We compare GIST with three state-of-the-art benchmarks:

•

random: We select samples from the dataset uniformly at random (and without replacement). While being a simple and lightweight approach, this implicitly encodes diversity in many settings and provides good solutions.
•

margin (Roth and Small, 2006): Margin sampling selects the top- $k$ points based on how hard they are to classify, i.e., using the margin scores $m_{i}=1-(\Pr(y=b\mid\mathbf{x}_{i};~{}\bm{\theta}_{0})-\Pr(y=b^{\prime}\mid\mathbf{x}_{i};~{}\bm{\theta}_{0}))$ , which measure the difference between the probability of the best and second-best class label $b^{\prime}$ for an example.
•

$k$ -center (Sener and Savarese, 2018): We run the classic greedy algorithm for $k$ -center on the $\Delta$ -nearest neighbor graph. The distance between non-adjacent nodes is implicitly assumed to be very large, which is acceptable in the large budget regime $k=\Omega(n)$ .

For this task, GIST uses uncertainty values as its weights $w_{i}=u_{i}$ and the same graph as $k$ -center, and sets $\alpha=0.85$ .

Results.

We run each algorithm with cardinality constraint $k$ on the full dataset to get a subset of examples that we use to train a new ResNet-56 model. We report the top-1 classification accuracy of these models in Table 1.

$k$ (% of data)	random	margin	$k$ -center	GIST
30	65.02	67.64	67.32	67.70
40	68.92	70.45	70.98	71.20
50	70.86	73.41	72.51	73.67
60	72.39	74.70	74.78	75.16
70	72.92	75.57	75.62	75.85
80	74.36	75.39	76.26	76.49
90	75.10	76.11	76.02	76.50

Table 1: Top-1 classification accuracy (%) on ImageNet for different single-shot data downsampling algorithms. The cardinality constraint

k

is expressed as a fraction of the ~1.3 million examples.

Conclusion

This work proposes a novel subset selection problem called MDDS that combines the total utility of the selected points with the $\texttt{div}(S)=\min_{u,v\in S:u\neq v}\textnormal{dist}(u,v)$ diversity objective. We design and analyze the GIST algorithm, which achieves a $\nicefrac{{2}}{{3}}$ -approximation guarantee for general metric spaces by solving a series of maximum independent set instances on geometric intersection graphs with the GreedyIndependentSet bicriteria approximation algorithm. We also give a nearly linear-time implementation of GIST for points in low-dimensional Euclidean space. We complement our algorithm with a $(\nicefrac{{2}}{{3}}+\varepsilon)$ -hardness of approximation, for any $\varepsilon>0$ , in general metric spaces. We also prove hardness of approximation in the practical cases of Euclidean metrics or $k=\Omega(n)$ . Finally, we present an empirical study showing that GIST steadily outperforms baseline methods for MDDS (in particular greedy) on a simple set of Gaussian points with random weights. We conclude with an experiment comparing the top-1 image classification accuracy achieved by GIST and three other state-of-the-art data sampling methods for ImageNet.

References

Addanki et al. (2022) R. Addanki, A. McGregor, A. Meliou, and Z. Moumoulidou. Improved approximation and scalability for fair max-min diversification. In Proceedings of the 25th International Conference on Database Theory, pages 7:1–7:21, 2022.
Alimonti and Kann (2000) P. Alimonti and V. Kann. Some APX-completeness results for cubic graphs. Theoretical Computer Science, 237(1-2):123–134, 2000.
Anil et al. (2023) R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, et al. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
Ash et al. (2020) J. T. Ash, C. Zhang, A. Krishnamurthy, J. Langford, and A. Agarwal. Deep batch active learning by diverse, uncertain gradient lower bounds. In Proceedings of the 8th International Conference on Learning Representations, 2020.
Barequet and Har-Peled (2001) G. Barequet and S. Har-Peled. Efficiently approximating the minimum-volume bounding box of a point set in three dimensions. Journal of Algorithms, 38(1):91–109, 2001.
Beygelzimer et al. (2006) A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In Proceedings of the 23rd International Conference on Machine Learning, pages 97–104, 2006.
Bhaskara et al. (2016) A. Bhaskara, M. Ghadiri, V. Mirrokni, and O. Svensson. Linear relaxations for finding diverse elements in metric spaces. Advances in Neural Information Processing Systems, 29:4098–4106, 2016.
Citovsky et al. (2021) G. Citovsky, G. DeSalvo, C. Gentile, L. Karydas, A. Rajagopalan, A. Rostamizadeh, and S. Kumar. Batch active learning at scale. Advances in Neural Information Processing Systems, 34:11933–11944, 2021.
Guo et al. (2020) R. Guo, P. Sun, E. Lindgren, Q. Geng, D. Simcha, F. Chern, and S. Kumar. Accelerating large-scale inference with anisotropic vector quantization. In International Conference on Machine Learning, pages 3887–3896. PMLR, 2020.
Håstad (1996) J. Håstad. Clique is hard to approximate within $n^{1-\varepsilon}$ . In Proceedings of 37th Conference on Foundations of Computer Science, pages 627–636, 1996.
Håstad (2001) J. Håstad. Some optimal inapproximability results. Journal of the ACM, 48(4):798–859, 2001.
He et al. (2016) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
Imanparast et al. (2018) M. Imanparast, S. N. Hashemi, and A. Mohades. An efficient approximation for point-set diameter in higher dimensions. In Proceedings of the 30th Canadian Conference on Computational Geometry, pages 72–77, 2018.
Izbicki and Shelton (2015) M. Izbicki and C. Shelton. Faster cover trees. In International Conference on Machine Learning, pages 1162–1170. PMLR, 2015.
Kazhamiaka et al. (2021) F. Kazhamiaka, M. Zaharia, and P. Bailis. Challenges and opportunities for autonomous vehicle query systems. In Proceedings of the Conference on Innovative Data Systems Research, 2021.
Kirsch et al. (2019) A. Kirsch, J. Van Amersfoort, and Y. Gal. Batchbald: Efficient and diverse batch acquisition for deep bayesian active learning. Advances in Neural Information Processing Systems, 32, 2019.
Mahabadi and Trajanovski (2023) S. Mahabadi and S. Trajanovski. Core-sets for fair and diverse data summarization. Advances in Neural Information Processing Systems, 2023.
Meinl et al. (2011) T. Meinl, C. Ostermann, and M. R. Berthold. Maximum-score diversity selection for early drug discovery. Journal of Chemical Information and Modeling, 51(2):237–247, 2011.
Moumoulidou et al. (2020) Z. Moumoulidou, A. McGregor, and A. Meliou. Diverse data selection under fairness constraints. In Proceedings of the 24th International Conference on Database Theory, pages 13:1–13:25, 2020.
Roth and Small (2006) D. Roth and K. Small. Margin-based active learning for structured output spaces. In Proceedings of the 17th European Conference on Machine Learning, 2006.
Russakovsky et al. (2015) O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115:211–252, 2015.
Sener and Savarese (2018) O. Sener and S. Savarese. Active learning for convolutional neural networks: A core-set approach. In Proceedings of the 6th International Conference on Learning Representations, 2018.
Wei et al. (2015) K. Wei, R. Iyer, and J. Bilmes. Submodularity in data subset selection and active learning. In International Conference on Machine Learning, pages 1954–1963. PMLR, 2015.
Zhdanov (2019) F. Zhdanov. Diverse mini-batch active learning. arXiv preprint arXiv:1901.05954, 2019.
Zuckerman (2006) D. Zuckerman. Linear degree extractors and the inapproximability of max clique and chromatic number. In Proceedings of the 38th Annual ACM Symposium on Theory of Computing, pages 681–690, 2006.

Appendix A Missing analysis for Section 3

A.1 Proof of Theorem 3.4

See 3.4

Proof.

First, observe that we can compute and store all pairwise distances in $O(n^{2}d)$ time and $O(n^{2})$ space. This is useful in the naive implementation because we can then look up the values $\textnormal{dist}(u,v)$ in constant time in any subroutine.

In the nearly linear-time implementation, we skip this precomputation step and use cover trees for fast exact nearest-neighbor search in low-dimensional Euclidean space (Beygelzimer et al., 2006): inserting a point into the index takes $O(c^{12}\log n)$ time and finding a nearest neighbor in the index to a query point takes $O(c^{6}\log n)$ time, where $c=\Theta(d)$ is the doubling dimension of $d$ -dimensional Euclidean space.

Thus, we can speed up the following subroutines in Algorithm 1 to have near-linear running times in a fixed dimension:

•

Computing $d_{\min}$ : $O(n^{2})\rightarrow O(n\cdot d^{12}\log n)$ after discarding duplicate points.
•

GreedyIndependentSet: $O(nk)\rightarrow O(n\cdot d^{12}\log n)$
•

Evaluating $f(S)$ : $O(k^{2})\rightarrow O(k\cdot d^{12}\log k)$

Taking advantage of the error $\varepsilon$ in our $(\nicefrac{{2}}{{3}}-\varepsilon)$ -approximation guarantee, we use the approximate point-set diameter algorithm in Imanparast et al. (2018, Theorem 1) to compute a $(1-\varepsilon)$ -approximate diameter, together with two points realizing it. This speeds up the naive running time of Lines 3–4 in Algorithm 1 as follows:

•

Computing a $(1-\varepsilon)$ -approximate $d_{\max}$ : $O(n^{2})\rightarrow O\left(n\cdot 2^{d}d+\frac{(2\sqrt{d})^{2d}}{\varepsilon^{d-1}}\right)$

Since we have a smaller diameter $(1-\varepsilon)d_{\max}$ , we need to show that it is compatible with the proof of Theorem 3.1. Concretely, this means modifying Eq. (5) and (6), and observing that the inequality

\left(1-\frac{2}{3}+\frac{1}{3(1+\varepsilon)}\right)(1-\alpha)\cdot(1-\varepsilon)d^{*}\geq\left(\frac{2}{3}-\varepsilon\right)(1-\alpha)d^{*}

still holds. Lastly, the number of calls to GreedyIndependentSet (i.e., distance thresholds that GIST tries) is $\min\{\lceil\log_{1+\varepsilon}(2/\varepsilon)\rceil,n^{2}\}$ . Putting everything together proves the claim. ∎

Appendix B Missing analysis for Section 4

B.1 Proof of Theorem 4.2

See 4.2

Proof.

We use the vertex cover instance constructed in Håstad (2001, Theorem 8.1). For any $\varepsilon_{0}>0$ , they construct a graph $G$ with $2^{r+2}$ nodes for some integer $r$ and show that it is NP-hard to distinguish between the following two cases:

•

YES case: Graph $G$ has an independent set of size $2^{r}(1-\varepsilon_{0})$ .
•

NO case: Graph $G$ has no independent set of size larger than $2^{r}(\nicefrac{{1}}{{2}}+\varepsilon_{0})$ .

We mimic the general structure of the proof of Theorem 4.1. Add a point for each vertex in graph $G$ and set its weight to $1$ . Set the distance between two points to $1$ if they share an edge in the graph, and set it to $2$ otherwise. Lastly, let $k=2^{r}$ . Since there are $2^{r+2}$ nodes in the graph, $k=|V|/4$ in this case. In the YES case, objective function (1) can achieve the following value by selecting the independent set of size $2^{r}(1-\varepsilon_{0})$ :

	OPT	$\displaystyle\geq\alpha\cdot\frac{1}{2^{r}}\cdot(2^{r}(1-\varepsilon_{0}))+(1-\alpha)\cdot 2$
		$\displaystyle=2-(1+\varepsilon_{0})\cdot\alpha.$

In the NO case, the largest value that can be achieved is the best of two scenarios: (a) selecting $k$ points with minimum distance $1$ , or (b) selecting an independent set of size at most $2^{r}(\nicefrac{{1}}{{2}}+\varepsilon_{0})$ and achieving minimum distance $2$ . This results in the following upper bound for the objective:

	ALG	$\displaystyle\leq\max\left\{\alpha+(1-\alpha)\cdot 1,\alpha\cdot(\nicefrac{{1}}{{2}}+\varepsilon_{0})+(1-\alpha)\cdot 2\right\}$
		$\displaystyle=\max\left\{1,2-(\nicefrac{{3}}{{2}}-\varepsilon_{0})\cdot\alpha\right\}.$

The largest hardness gap occurs when the two terms of the $\max$ expression become equal, which happens at $\alpha=1/(\nicefrac{{3}}{{2}}-\varepsilon_{0})$ and yields the upper bound $\textnormal{ALG}\leq 1$ . The optimum value OPT will be lower bounded by:

	OPT	$\displaystyle\geq 2-(1+\varepsilon_{0})\cdot\alpha$
		$\displaystyle=2-\frac{1+\varepsilon_{0}}{\nicefrac{{3}}{{2}}-\varepsilon_{0}}$
		$\displaystyle=\frac{2-3\varepsilon_{0}}{\nicefrac{{3}}{{2}}-\varepsilon_{0}}.$

The desired inapproximability gap of $\nicefrac{{3}}{{4}}+\varepsilon$ in Theorem 4.2 is achieved when the ratio between the upper bound on ALG and lower bound for OPT drops below it. In other words, we want to set $\varepsilon_{0}$ such that the following inequality holds:

\frac{2-3\varepsilon_{0}}{\nicefrac{{3}}{{2}}-\varepsilon_{0}}>\frac{1}{\nicefrac{{3}}{{4}}+\varepsilon}.

We set $\varepsilon_{0}\coloneqq\varepsilon$ and prove that the above inequality holds. We first note that both the numerator and denominator of the left side ratio are positive. Therefore, we can invert both ratios and swap their places to achieve the following equivalent inequality:

\frac{3}{4}+\varepsilon>\frac{\nicefrac{{3}}{{2}}-\varepsilon_{0}}{2-3\varepsilon_{0}}\iff\varepsilon>\frac{\nicefrac{{3}}{{2}}-\varepsilon_{0}}{2-3\varepsilon_{0}}-\frac{3}{4}=\frac{6-4\varepsilon_{0}-6+9\varepsilon_{0}}{8-12\varepsilon_{0}}>\frac{5\varepsilon_{0}}{8}.

The left-hand side inequality $\varepsilon>\nicefrac{{5\varepsilon_{0}}}{{8}}$ holds by our choice of $\varepsilon_{0}=\varepsilon$ . ∎

B.2 Proof of Lemma 4.4

Let $G=(V,E)$ be a simple undirected graph. Our goal is to embed the vertices of $V$ into $\mathbb{R}^{d}$ , for some $d\geq 1$ , in a way that encodes the adjacency structure of $G$ . Concretely, we want to construct a function $h_{G}:V\rightarrow\mathbb{R}^{d}$ such that:

•

$\lVert h_{G}(u)-h_{G}(v)\rVert_{2}=1$ if $\{u,v\}\not\in E$ , and
•

$\lVert h_{G}(u)-h_{G}(v)\rVert_{2}\leq 1-\varepsilon_{G}$ if $\{u,v\}\in E$ ,

for the largest possible value of $\varepsilon_{G}\in(0,1]$ .

Construction.

First, let $n=|V|$ and $m=|E|$ . Augment $G$ by adding a self-loop to each node to get $G^{\prime}=(V,E^{\prime})$ . We embed $V$ using $G^{\prime}$ since each node now has positive degree. Let $\deg^{\prime}(v)$ be the degree of $v$ in $G^{\prime}$ and $N^{\prime}(v)$ be the neighborhood of $v$ in $G^{\prime}$ .

Define a total ordering on $E^{\prime}$ (e.g., lexicographically by sorted endpoints $\{u,v\}$ ). Each edge $e\in E^{\prime}$ corresponds to an index in the embedding dimension $d\coloneqq|E^{\prime}|=m+n$ . We consider the embedding function that acts as a degree-normalized adjacency vector:

\displaystyle h_{G}(v)_{e}=\begin{cases}\sqrt{\frac{1}{2\deg^{\prime}(v)}}&\text{if $v\in e$},\\ 0&\text{if $v\not\in e$}.\end{cases}

(8)

See 4.4

Proof.

If $\{u,v\}\not\in E$ , then we have

	$\displaystyle\lVert h_{G}(u)-h_{G}(v)\rVert_{2}^{2}$	$\displaystyle=\sum_{e\in N^{\prime}(u)}\left(\sqrt{\frac{1}{2\deg^{\prime}(u)}}-0\right)^{2}+\sum_{e\in N^{\prime}(v)}\left(\sqrt{\frac{1}{2\deg^{\prime}(v)}}-0\right)^{2}$
		$\displaystyle=\left(\frac{1}{2}\sum_{e\in N^{\prime}(u)}\frac{1}{\deg^{\prime}(u)}\right)+\left(\frac{1}{2}\sum_{e\in N^{\prime}(v)}\frac{1}{\deg^{\prime}(v)}\right)$
		$\displaystyle=\frac{1}{2}+\frac{1}{2}$
		$\displaystyle=1.$

This follows because the only index where both embeddings can be nonzero is $\{u,v\}$ , if it exists.

Now suppose that $\{u,v\}\in E$ . It follows that

	$\displaystyle\lVert h_{G}(u)-h_{G}(v)\rVert_{2}^{2}$
	$\displaystyle=\sum_{e\in N^{\prime}(u)\setminus\{v\}}\frac{1}{2\deg^{\prime}(u)}+\sum_{e\in N^{\prime}(v)\setminus\{u\}}\frac{1}{2\deg^{\prime}(v)}+\left(\sqrt{\frac{1}{2\deg^{\prime}(u)}}-\sqrt{\frac{1}{2\deg^{\prime}(v)}}\right)^{2}$
	$\displaystyle=\frac{\deg^{\prime}(u)-1}{2\deg^{\prime}(u)}+\frac{\deg^{\prime}(v)-1}{2\deg^{\prime}(v)}+\left(\frac{1}{2\deg^{\prime}(u)}+\frac{1}{2\deg^{\prime}(v)}-2\sqrt{\frac{1}{4\deg^{\prime}(u)\deg^{\prime}(v)}}\right)$
	$\displaystyle=\frac{1}{2}+\frac{1}{2}-\sqrt{\frac{1}{\deg^{\prime}(u)\deg^{\prime}(v)}}$
	$\displaystyle\leq 1-\frac{1}{\Delta+1}.$

The previous inequality follows from $\deg^{\prime}(v)=\deg(v)+1\leq\Delta+1$ . For any $x\in[0,1]$ , we have

\sqrt{1-x}\leq 1-\frac{x}{2},

so it follows that

\displaystyle\lVert h_{G}(u)-h_{G}(v)\rVert_{2}\leq\sqrt{1-\frac{1}{\Delta+1}}\leq 1-\frac{1}{2(\Delta+1)},

which completes the proof. ∎

B.3 Proof of Theorem 4.5

See 4.5

Proof.

We build on the hardness of approximation for the maximum independent set problem for graphs with maximum degree $\Delta=3$ . Alimonti and Kann (2000, Theorem 3.2) showed that this problem is APX-complete, so there exists some $\varepsilon_{0}>0$ such that there is no polynomial-time $(1-\varepsilon_{0})$ -approximation algorithm unless $\textnormal{NP}=\textnormal{P}$ . Hence, there exists a graph $G$ with max degree $\Delta=3$ and a threshold integer value $k$ such that it is NP-hard to distinguish between the following two cases:

•

YES instance: Graph $G$ has an independent set of size $k$ .
•

NO instance: Graph $G$ does not have an independent set of size greater than $(1-\varepsilon_{0})k$ .

We reduce this instance of bounded-degree maximum independent set to MDDS with objective function (1) as follows. Embed each node of the graph $G$ into Euclidean space using the function $h_{G}(v)$ in Lemma 4.4. We use the same threshold value of $k$ (between YES and NO instances above) for the cardinality constraint on set $S$ , and we set each weight $w(v)=1$ . In a YES instance, selecting an independent set of size $k$ as the set $S$ results in the maximum value of objective (1):

\textnormal{OPT}=\alpha\cdot\frac{1}{k}\cdot k+(1-\alpha)\cdot 1=1,

since $\lVert h_{G}(u)-h_{G}(v)\rVert_{2}=1$ for any two distinct points $u,v\in S$ since there is no edge between $u$ and $v$ in graph $G$ .

In a NO instance, the best objective value that can be achieved in polynomial-time is the maximum of the following two scenarios: (a) selecting $k$ points with minimum distance at most $1-1/(2(\Delta+1))=1-1/8$ , or (b) selecting at most $(1-\varepsilon_{0})k$ vertices forming an independent set with minimum distance equal to $1$ . The maximum value obtained by any polynomial-time algorithm is then

	ALG	$\displaystyle=\max\left\{\alpha(1-\varepsilon_{0})+(1-\alpha)\cdot 1,\alpha+(1-\alpha)\left(1-1/8\right)\right\}$
		$\displaystyle=\max\left\{1-\varepsilon_{0}\cdot\alpha,(7+\alpha)/8\right\}.$

These two terms become equal for $\alpha=1/(1+8\varepsilon_{0})$ . Therefore, the gap between the maximum value any algorithm can achieve in the NO case and the optimum value in the YES case is upper bounded by

1-\varepsilon_{0}\cdot\alpha=1-\frac{\varepsilon_{0}}{\nicefrac{{1}}{{\varepsilon_{0}}}+8}=1-\varepsilon_{1}.

Since $\varepsilon_{0}>0$ is a constant, $\varepsilon_{1}\coloneqq\varepsilon_{0}/(\nicefrac{{1}}{{\varepsilon_{0}}}+8)>0$ is also a positive constant. This completes the proof of APX-completeness. ∎

Appendix C Missing details for Section 5

Hyperparameters for ImageNet classification.

We generate predictions and embeddings for all points using a coarsely-trained ResNet-56 model (He et al., 2016) trained on a random 10% subset of ImageNet (Russakovsky et al., 2015). We use SGD with Nesterov momentum 0.9 with 450/90 epochs. The base learning rate is 0.1, and is reduced by a tenth at 5, 30, 69, and 80. We extract the penultimate layer features to produce $2048$ -dimensional embeddings of each image.

	$\displaystyle\textnormal{ALG}_{\textnormal{simple}}$	$\displaystyle=\max\left\{\alpha\frac{1}{k}\max_{\|S\|\leq k}w(S),(1-\alpha)\max_{u,v\in V}\textnormal{dist}(u,v)\right\}$
		$\displaystyle\geq\frac{1}{2}\cdot\alpha\frac{1}{k}\max_{\|S\|\leq k}w(S)+\frac{1}{2}\cdot(1-\alpha)\max_{u,v\in V}\textnormal{dist}(u,v)$
		$\displaystyle\geq\frac{1}{2}\cdot\left(\alpha\frac{1}{k}w(S^{})+(1-\alpha)\texttt{div}(S^{})\right)=\frac{1}{2}\cdot\textnormal{OPT},$

ALG	$\displaystyle\geq\frac{2}{3}\cdot\alpha\frac{1}{k}w(S^{})+\left(1-\frac{2}{3}+\frac{1}{3(1+\varepsilon)}\right)\cdot(1-\alpha)d^{}$	(6)
	$\displaystyle=\frac{2}{3}\cdot\alpha\frac{1}{k}w(S^{})+\frac{1}{3}\left(1+\frac{1}{1+\varepsilon}\right)\cdot(1-\alpha)d^{}$
	$\displaystyle\geq\frac{2}{3}\cdot\alpha\frac{1}{k}w(S^{})+\frac{1}{3}\left(2-\varepsilon\right)\cdot(1-\alpha)d^{}$
	$\displaystyle\geq\frac{2}{3}\left(1-\varepsilon\right)\cdot\alpha\frac{1}{k}w(S^{})+\frac{2}{3}\left(1-\varepsilon\right)\cdot(1-\alpha)d^{}$
	$\displaystyle\geq\left(\frac{2}{3}-\varepsilon\right)\cdot\textnormal{OPT},$

GIST: Greedy Independent Set Thresholding for Diverse Data Summarization

Abstract

1 Introduction

1.1 Our contributions

1.2 Background and related work

Diversity objectives.

Data sampling.

2 Preliminaries

Intersection graph.

Definition 2.1.

Simple 1/2\nicefrac{{1}}{{2}}-approximation algorithm.

3 Algorithm

Theorem 3.1.

Lemma 3.2.

Proof.

Proof of Theorem 3.1.

Remark 3.3.

Theorem 3.4.

4 Hardness of approximation

4.1 General metric spaces

Theorem 4.1.

Proof.

4.2 Large subsets

Theorem 4.2.

4.3 Euclidean metrics

Lemma 4.3 (Alimonti and Kann (2000, Theorem 3.2)).

Lemma 4.4.

Theorem 4.5.

5 Experiments

5.1 Warm-up: Synthetic dataset

Setup.

Results.

5.2 Image classification

Setup.

Results.

Conclusion

References

Appendix A Missing analysis for Section 3

A.1 Proof of Theorem 3.4

Proof.

Appendix B Missing analysis for Section 4

B.1 Proof of Theorem 4.2

Proof.

B.2 Proof of Lemma 4.4

Construction.

Proof.

B.3 Proof of Theorem 4.5

Proof.

Appendix C Missing details for Section 5

Hyperparameters for ImageNet classification.

Simple $\nicefrac{{1}}{{2}}$ -approximation algorithm.