This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

SLOSH: Set LOcality Sensitive Hashing
via Sliced-Wasserstein Embeddings

Yuzhe Lu
Computer Science Department,
Vanderbilt University,
Nashville, TN, 37235
[email protected]
   Xinran Liu
Computer Science Department,
Vanderbilt University
Nashville, TN, 37235
[email protected]
   Andrea Soltoggio
School of Computer Science,
Loughborough University,
Leicestershire, UK
[email protected]
   Soheil Kolouri
Computer Science Department,
Vanderbilt University,
Nashville, TN, 37235
[email protected]
Abstract

Learning from set-structured data is an essential problem with many applications in machine learning and computer vision. This paper focuses on non-parametric and data-independent learning from set-structured data using approximate nearest neighbor (ANN) solutions, particularly locality-sensitive hashing. We consider the problem of set retrieval from an input set query. Such retrieval problem requires: 1) an efficient mechanism to calculate the distances/dissimilarities between sets, and 2) an appropriate data structure for fast nearest neighbor search. To that end, we propose Sliced-Wasserstein set embedding as a computationally efficient “set-2-vector” mechanism that enables downstream ANN, with theoretical guarantees. The set elements are treated as samples from an unknown underlying distribution, and the Sliced-Wasserstein distance is used to compare sets. We demonstrate the effectiveness of our algorithm, denoted as Set-LOcality Sensitive Hashing (SLOSH), on various set retrieval datasets and compare our proposed embedding with standard set embedding approaches, including Generalized Mean (GeM) embedding/pooling, Featurewise Sort Pooling (FSPool), and Covariance Pooling and show consistent improvement in retrieval results. The code for replicating our results is available here: https://github.com/mint-vu/SLOSH.

1 Introduction

The nearest neighbor search problem is at the heart of many nonparametric learning approaches in classification, regression, and density estimation, with many applications in machine learning, computer vision, and other related fields [3, 55, 6]. The exhaustive search solution to the nearest neighbor problem for NN given objects (e.g., images, vectors, etc.) requires NN evaluation of (dis)similarities (or distances), which could be problematic when: 1) the number of objects, NN, is large, or 2) (dis)similarity evaluation is expensive. Approximate Nearest Neighbor (ANN) [5] approaches have been proposed as an efficient alternative for similarity search on massive datasets. ANN approaches leverage data structures like random projections, e.g., Locality-Sensitive Hashing (LSH) [21, 17], or tree-based structures, e.g., kd-trees [10, 63], to reduce the complexity of nearest neighbor search. Ideally, ANN approaches must address both these challenges, i.e., decreasing the number of similarity evaluations and reducing the computational complexity of similarity calculations while providing theoretical guarantees on ANN retrievals.

Despite the great strides in developing ANN methods, the majority of the existing approaches are designed for objects living in Hilbert spaces. Recently, however, there has been an increasing interest in set-structured data with many applications in point cloud processing, graph learning, image/video recognition, and object detection, to name a few [70, 36, 62]. Even when the input data itself is not a set, in many applications, the complex input data (e.g., a natural image or a graph) is decomposed into a set of more abstract components (e.g., objects or node embeddings). Similarity search for large databases of set-structured data remains an active field of research, with many real-world applications. In this paper, we focus on developing a data-independent ANN method for set-structured data. We leverage insights from computational optimal transport [61, 13, 30, 48] and propose a novel LSH algorithm, which relies on Sliced-Wasserstein Embeddings and enables efficient set retrieval.

Defining (dis)similarities for set-structured data comes with unique challenges: i) the sets could have different cardinalities, and ii) the set elements do not necessarily have an inherent ordering. Hence, a similarity measure for set-structured data must handle varied input sizes and should be invariant to permutations, i.e., the (dis)similarity score should not change under any permutation of the input set elements. Generally, the existing approaches for defining similarities between sets rely on the following two strategies. First, solving an assignment problem (via optimization) for finding corresponding elements between two sets and aggregate (dis)similarities between corresponding elements, e.g., using Hungarian algorithm, Wasserstein distances, Chamfer loss, etc. These approaches are at best quadratic and at worst cubic in the set cardinalities.

The second family of approaches rely on embedding the sets into a vector space and leveraging common similarities in the embedded space. The set embedding could be explicit (e.g., deep set networks) [70, 36] or implicit (e.g., Kernel methods) [22, 19, 11, 51, 50, 42, 69, 32]. Also, the embedding process could be data-dependent (i.e., learning based) as in deep set learning approaches, which leverage a composition of permutation-equivariant backbones followed by a permutation-invariant global pooling mechanisms that define a parametric permutation-invariant set embedding into a Hilbert space [70, 36, 72]. Or, it can be data-independent as it is the case for global average/max/sum/covariance pooling, variations of Janossy pooling [43], and variations of Wasserstein embedding [27], among others. Recently, there has been a lot of interest in learning-based embeddings using deep neural networks and in particular transformer networks. However, data-independent embedding approaches (e.g., global poolings) have received less attention.

Contributions. Our paper focuses on non-parametric learning from set-structured data using data-independent set embeddings. Precisely, we consider the problem where our training data is a set of sets, i.e., 𝒳={Xi|Xi={xnid}n=0Ni1}i=1I\mathcal{X}=\{X_{i}|X_{i}=\{x^{i}_{n}\in\mathbb{R}^{d}\}_{n=0}^{N_{i}-1}\}_{i=1}^{I}, (e.g., set of point clouds), and for a query set XX we would like to retrieve the KK-Nearest Neighbors (KNN) from 𝒳\mathcal{X}. To solve this problem, we require a fast and reliable (dis)similarity measure between sets, and a computationally efficient nearest neighbor search. We propose Sliced-Wasserstein Embedding as a computationally efficient and powerful tool that provides a set-2-vector operator with computational complexity of 𝒪(LN(d+log(N))\mathcal{O}(LN(d+log(N)), with sequential processing, 𝒪(N(d+log(N))\mathcal{O}(N(d+log(N)), with parallel processing, where LL is of the same order as dd. Treating sets as empirical distributions, Sliced-Wasserstein Embedding embedds sets in a vector space in which the Euclidean distance between two embedded vectors is equal to the Sliced-Wasserstein distance between their corresponding empirical distributions. Such embedding enables the application of fast ANN approaches, like Locality Sensitive Hashing (LSH), to sets while providing collision probabilities with respect to the Sliced-Wasserstein distance. Finally, we provide extensive numerical results analyzing and comparing our approach with various data-independent embedding methods in the literature.

2 Related Work

Set embeddings (set-2-vector): Machine learning on set-structured data is challenging due to: 1) permutation-invariant nature of sets, and 2) having various cardinalities. Hence, any model (parametric or non-parametric) designed for analyzing set-structured data has to be permutation invariant, and allow for inputs of various sizes. Today, a common approach for learning from sets is to use a permutation equivariant parametric function, e.g., fully connected networks [70] or transformer networks [36], composed with a permutation invariant function, i.e., a global pooling, e.g., global average pooling, or pooling by multi-head attention [36]. One can view this process as embedding a set into a fixed-dimensional representation through a parametric embedding that could be learned using the training data and an objective function (e.g., classification).

A major unanswered question is regarding non-parametric learning from set-structured data. In other words, what would be a good data-independent set embedding that one can use for generic applications, including K-Nearest Neighbor classification/regression/density estimation? Given that a data-independent global pooling could be viewed as a set-2-vector process, we surveyed the existing set-2-vector mechanisms in the literature. In particular, global average/max/sum and covariance pooling [64] could be considered as the simplest such processes. Generalized Mean (GeM) [53] is another pooling mechanism commonly used in image retrieval applications, which captures higher statistical moments of the underlying distributions. Other notable approaches include VLAD [23, 7], CroW [25], and FSPool [72], among others.

Locality Sensitive Hashing (LSH): A LSH function hashes two “similar” objects into the same bucket with “high” probability, while ensuring that “dissimilar” objects will end up in the same bucket with “low” probability. Originally presented in [21] and extended in [17], LSH uses random projections of high-dimensional data to hash samples into different buckets. The LSH algorithm forms the foundation of many ANN search methods, which provide theoretical guarantees and have been extensively studied since its conception [3, 33, 4, 6].

Here, we are interested in nearest neighbor retrieval for sets. More precisely, given a training set of sets as training data (think of set of point clouds), and a test set (a point cloud representation of an object) we would like to retrieve the “nearest” sets in our training set in an efficient manner. To that end, we extend LSH to enable its application to set retrieval. While there has been a few recent work [45, 26] on the topic of LSH for set queries, our proposed approach significantly differs from these work. In contrast to [45, 26], we provide a Euclidean embedding for sets, which allows for a direct utilization of the LSH algorithm and provides collision probabilities as a function of the set metrics.

Wasserstein-based learning: Wasserstein distances are rooted in the optimal transportation problem [61, 30, 48], and they provide a robust mathematical framework for comparing probability distributions that respect the underlying geometry of the space. Wasserstein distances have recently received abundant interest from the Machine Learning and Computer Vision communities. These distances and their variations (e.g., Sliced-Wasserstein distances [52, 13] and subspace robust Wasserstein distances [47]) have been extensively studied in the context of deep generative modeling [8, 20, 60, 31, 38], domain adaptation [15, 16, 9, 35], transfer learning [2], adversarial attacks [66, 67], and adversarial robustness [37, 58].

More recently, Wasserstein distances and optimal transport have been used in the context of comparing set-structured data. The main idea behind these recent approaches is to treat sets (with possibly variable cardinalities) as empirical distributions and use transport-based distances for comparing/modeling these distributions. For instance, Togninalli et al. [59] propose to compare node embeddings of two graphs (treated as sets) via the Wasserstein distance. Later, Mialon et al. [40] and Kolouri et al. [27] concurrently propose Wasserstein embedding frameworks for extracting fixed-dimensional representations from set-structured data. Here, we further extend this direction and propose Sliced-Wasserstein Embedding as a computationally efficient approach that allows us to perform data-independent non-parametric learning from set-structured data.

3 Preliminaries

We denote an input set with NiN_{i} elements living in d\mathbb{R}^{d} by Xi={xnid}n=0Ni1X_{i}=\{x_{n}^{i}\in\mathbb{R}^{d}\}_{n=0}^{N_{i}-1}. We view sets as empirical probability measures, μi\mu_{i}, defined in 𝕏d\mathbb{X}\subseteq\mathbb{R}^{d} with probability density dμi(x)=pi(x)dxd\mu_{i}(x)=p_{i}(x)dx, where pi(x)=1Nin=1Niδ(xxni)p_{i}(x)=\frac{1}{N_{i}}\sum_{n=1}^{N_{i}}\delta(x-x_{n}^{i}), and δ()\delta(\cdot) is the Dirac delta function. The main idea is then to define the distance between two sets, XiX_{i} and XjX_{j}, as a probability metric between their empirical distributions.

3.1 Sliced-Wasserstein distances

We use the Sliced-Wasserstein (SW) probability metric as a distance measure between the sets (viewed as empirical probability distributions). Later we will see that this choice allows us to effectively embed sets into a vector space such that the Euclidean distance between embedded sets is equal to the SW-distance between their corresponding empirical distributions. But first, let us briefly define the Wasserstein and Sliced-Wasserstein distances.

Let μi\mu_{i} and μj\mu_{j} be one-dimensional probability measures defined on \mathbb{R}. Then the p-Wasserstein distance between these measures can be written as:

𝒲p(μi,μj)=(01(Fμi1(τ)Fμj1(τ))p𝑑τ)1p\mathcal{W}_{p}(\mu_{i},\mu_{j})=\big{(}\int_{0}^{1}(F^{-1}_{\mu_{i}}(\tau)-F^{-1}_{\mu_{j}}(\tau))^{p}d\tau\big{)}^{\frac{1}{p}} (1)

where Fμ1F^{-1}_{\mu} is the inverse of the cumulative distribution function (c.d.f) FμF_{\mu} of μ\mu, i.e., it is the quantile function. The one-dimensional p-Wasserstein distance for empirical distributions with NN and MM samples can be computed with 𝒪(Nlog(N)+Mlog(M))\mathcal{O}(Nlog(N)+Mlog(M)), which is in stark difference from the generally cubic order for (d>>1)-dimensional distributions. The one-dimensional case motivates the concept of Sliced-Wasserstein distances [52, 28]. For the rest of this paper we will consider only the case p=2p=2, and for brevity we refer to 2-Wasserstein and 2-Sliced-Wasserstein distances as Wasserstein and Sliced-Wasserstein distances.

The main idea behind SW distances is to slice d-dimensional distributions into infinite sets of their one-dimensional slices/marginals and then calculate the expected Wasserstein distance between their slices. Let gθ:dg_{\theta}:\mathbb{R}^{d}\rightarrow\mathbb{R} be a parametric function with parameters θΩθdθ\theta\in\Omega_{\theta}\subseteq\mathbb{R}^{d_{\theta}}, satisfying the regularity conditions in both inputs and parameters as presented in [28]. A common choice is gθ(x)=θTxg_{\theta}(x)=\theta^{T}x where θ𝕊d1\theta\in\mathbb{S}^{d-1} is a unit vector in d\mathbb{R}^{d}, and 𝕊d1\mathbb{S}^{d-1} denotes the unit dd-dimensional hypersphere. The slice of a probability measure, μ\mu, with respect to gθg_{\theta} is the one-dimensional probability measure gθ#μg_{\theta\#}\mu, with the density,

piθ(t):=𝕏δ(tgθ(x))𝑑μi(x)t.\displaystyle p^{\theta}_{i}(t):=\int_{\mathbb{X}}\delta(t-g_{\theta}(x))d\mu_{i}(x)~{}~{}~{}\forall t\in\mathbb{R}. (2)

The generalized Sliced-Wasserstein distance is defined as

𝒢𝒮𝒲2(μi,μj)=(Ωθ𝒲22(gθ#μi,gθ#μj)𝑑σ(θ))12,\displaystyle\mathcal{GSW}_{2}(\mu_{i},\mu_{j})=\left(\int_{\Omega_{\theta}}\mathcal{W}^{2}_{2}(g_{\theta\#}\mu_{i},g_{\theta\#}\mu_{j})d\sigma(\theta)\right)^{\frac{1}{2}}, (3)

where σ(θ)\sigma(\theta) is the uniform measure on Ωθ\Omega_{\theta}, and once again for gθ(x)=θTxg_{\theta}(x)=\theta^{T}x and Ωθ=𝕊d1\Omega_{\theta}=\mathbb{S}^{d-1}, the generalized Sliced-Wasserstein distance is simply the Sliced-Wasserstein distance. Equation (3) is the expected value of the Wasserstein distances between uniformly distributed slices (i.e., on Ωθ\Omega_{\theta}) of distributions μi\mu_{i} and μj\mu_{j}.

3.2 Locality Sensitive Hashing

Locality Sensitive Hashing (LSH) [21, 17] strives for hashing near points in the high-dimensional input-space into the same hash bucket with a higher probability than distant ones, effectively solving the (R,c)(R,c)-Near Neighbor problem. More precisely, let H:={h:dU}H:=\{h:\mathbb{R}^{d}\rightarrow U\} denote a LSH function family with UU denoting the hash values. The LSH function family HH, is called (R,c,P1,P2)(R,c,P_{1},P_{2})-sensitive if for any two points u,vdu,v\in\mathbb{R}^{d} and hH\forall h\in H it satisfies the following conditions:

  • If uv2R\|u-v\|_{2}\leq R, then Pr[h(u)=h(v)]P1\operatorname{Pr}[h(u)=h(v)]\geq P_{1}, and

  • If uv2>cR\|u-v\|_{2}>cR, then Pr[h(u)=h(v)]P2\operatorname{Pr}[h(u)=h(v)]\leq P_{2}

Where for a proper LSH c>1c>1 and P1>P2P_{1}>P_{2}. The original LSH for Euclidean distance uses,

ha,b(u)=aTu+bω\displaystyle h_{a,b}(u)=\lfloor\frac{a^{T}u+b}{\omega}\rfloor

where \lfloor\cdot\rfloor is the floor function, ada\in\mathbb{R}^{d} is a random vector generated from a pp-stable distribution (e.g., Normal or Cauchy distributions) [17], and bb is a real number chosen uniformly from [0,ω)[0,\omega) such that ω\omega is the width of the hash bucket. Let r=uvpr=\|u-v\|_{p} and fp(t)f_{p}(t) denote the probability density function of the absolute value of the p-stable distribution, then this family of hash functions leads to the following probability of collision:

Pr[ha,b(u)=ha,b(v)]=0ω1rfp(tr)(1tω)𝑑t\operatorname{Pr}[h_{a,b}(u)=h_{a,b}(v)]=\int_{0}^{\omega}\frac{1}{r}f_{p}(\frac{t}{r})(1-\frac{t}{\omega})dt (4)

To amplify the gap between P1P_{1} and P2P_{2} one can concatenate several such hash functions. Let 𝒢:={g:dUk}\mathcal{G}:=\{g:\mathbb{R}^{d}\rightarrow U^{k}\} denote the concatenation of kk randomly selected hash functions, i.e., g(u)=(h1(u),,hk(u))g(u)=(h_{1}(u),...,h_{k}(u)). In practice, multiple such hash functions g1,,gTg_{1},\ldots,g_{T} are often used. For points, uu and vv, within RR-distance from one another, the probability that they collide at least in one of gjg_{j}s is: 1(1P1k)T1-(1-P_{1}^{k})^{T}.

Another commonly used, and related, family of hash functions consist of random projections and thresholding, i.e., ha,b(u)=sgn(aTu+b)h_{a,b}(u)=\operatorname{sgn}(a^{T}u+b) [14]. Using kk such random projections (i.e., binary codes of length kk) provides the following collision probability:

Pr[g(u)=g(v)]=[1cos1(uTv)π]k\operatorname{Pr}[g(u)=g(v)]=[1-\frac{\cos^{-1}(u^{T}v)}{\pi}]^{k}

Since the conception of its idea [21, 17], many variants of LSH have been proposed. However, these approaches are not designed to handle set queries. Here, we extend LSH to set-structured data via Sliced-Wasserstein Embeddings.

Refer to caption
Figure 1: A graphical depiction of Sliced-Wasserstein Embedding (SWE) for a given reference set and a chosen number of slices, LL. The input and reference sets are sliced via LL random projections {gθl}l=1L\{g_{\theta_{l}}\}_{l=1}^{L}. The projections are then sorted and the Monge couplings between input and reference sets’ slices are calculated following Eq. (13). Finally, the embedding is obtained by weighting and concatenating ϕθl\phi_{\theta_{l}}s. The Euclidean distance between two embedded sets is equal to their corresponding 𝒢𝒮𝒲^2,L\widehat{\mathcal{GSW}}_{2,L} distance/dissimilarity measure, i.e., ϕ(Xi)ϕ(Xj)2=𝒢𝒮𝒲^2,L(Xi,Xj)𝒢𝒮𝒲2(Xi,Xj)\|\phi(X_{i})-\phi(X_{j})\|_{2}=\widehat{\mathcal{GSW}}_{2,L}(X_{i},X_{j})\approx\mathcal{GSW}_{2}(X_{i},X_{j}).

4 Problem Formulation and Method

4.1 Sliced-Wasserstein Embedding

The idea of Sliced-Wasserstein Embeddings (SWE) is rooted in Linear Optimal Transport [65, 30, 41] and was first introduced in the context of pattern recognition from 2D probability density functions (e.g., images) [29] and more recently in [56]. Our work extends the framework to d-dimensional distributions with the specific application of set retrieval in mind. Consider a set of probability measures {μi}i=1I\{\mu_{i}\}_{i=1}^{I} with densities {pi}i=1I\{p_{i}\}_{i=1}^{I}, and recall that we use μi\mu_{i} to represent the i’th set Xi={xni}n=0Ni1X_{i}=\{x_{n}^{i}\}_{n=0}^{N_{i}-1}. At a high level, SWE can be thought as a set-2-vector operator, ϕ\phi, such that:

ϕ(μi)ϕ(μj)2=𝒢𝒮𝒲2(μi,μj).\displaystyle\|\phi(\mu_{i})-\phi(\mu_{j})\|_{2}=\mathcal{GSW}_{2}(\mu_{i},\mu_{j}). (5)

For convenience, we use μiθ:=gθ#μi\mu_{i}^{\theta}:=g_{\theta\#}\mu_{i} to denote the slice of measure μi\mu_{i} with respect to gθg_{\theta}. Also, let μ0\mu_{0} denote a reference measure, with μ0θ\mu_{0}^{\theta} being its corresponding slice. The optimal transport coupling (i.e., Monge coupling) between μiθ\mu_{i}^{\theta} and μ0θ\mu_{0}^{\theta} can be written as

Tiθ=Fμiθ1Fμ0θ,\displaystyle T^{\theta}_{i}=F_{\mu_{i}^{\theta}}^{-1}\circ F_{\mu_{0}^{\theta}}, (6)

and recall that Fμiθ1F_{\mu_{i}^{\theta}}^{-1} is the quantile function of μiθ\mu_{i}^{\theta}. Now, letting idid denote the identity function, we define the so-called cumulative distribution transform (CDT) [46] of μiθ\mu_{i}^{\theta} as

ϕθ(μi)(Tiθid).\displaystyle\phi_{\theta}(\mu_{i})\coloneqq(T_{i}^{\theta}-id). (7)

For a fixed θ\theta, we can show that ϕθ(μi)\phi^{\theta}(\mu_{i}) satisfies (see supplementary material):

  1. C1.

    The weighted 2\ell_{2}-norm of the embedded slice, ϕθ(μi)\phi_{\theta}(\mu_{i}), satisfies:

    ϕθ(μi)μ0θ,2\displaystyle\|\phi_{\theta}(\mu_{i})\|_{\mu_{0}^{\theta},2} =(ϕθ(μi(t))22𝑑μ0θ(t))12\displaystyle=\left(\int_{\mathbb{R}}\|\phi_{\theta}(\mu_{i}(t))\|_{2}^{2}d\mu_{0}^{\theta}(t)\right)^{\frac{1}{2}}
    =𝒲2(μiθ,μ0θ),\displaystyle=\mathcal{W}_{2}(\mu_{i}^{\theta},\mu_{0}^{\theta}),

    As a corollary we have ϕθ(μ0)μ0θ,2=0\|\phi_{\theta}(\mu_{0})\|_{\mu_{0}^{\theta},2}=0.

  2. C2.

    The weighted 2\ell_{2} distance between two embedded slices satisfies:

    ϕθ(μi)ϕθ(μj)μ0θ,2=𝒲2(μiθ,μjθ).\displaystyle\|\phi_{\theta}(\mu_{i})-\phi_{\theta}(\mu_{j})\|_{\mu_{0}^{\theta},2}=\mathcal{W}_{2}(\mu_{i}^{\theta},\mu_{j}^{\theta}). (8)

It follows from C1 and C2 that:

𝒢𝒮𝒲2(μi,μj)\displaystyle\mathcal{GSW}_{2}(\mu_{i},\mu_{j}) =(Ωθϕθ(μi)ϕθ(μj)μ0θ,22𝑑σ(θ))12\displaystyle=\left(\int_{\Omega_{\theta}}\|\phi_{\theta}(\mu_{i})-\phi_{\theta}(\mu_{j})\|_{\mu_{0}^{\theta},2}^{2}d\sigma(\theta)\right)^{\frac{1}{2}} (9)

For probability measure μi\mu_{i}, we then define the mapping to the embedding space via,

ϕ(μi){ϕθ(μi)|θΩθ}.\phi(\mu_{i})\coloneqq\{\phi_{\theta}(\mu_{i})~{}|~{}\theta\in\Omega_{\theta}\}.

Finally, we can re-weight the embedding (according to dμ0θd\mu^{\theta}_{0}) such that the weighted 2\ell_{2} in 8 becomes the 2\ell_{2} distance as in 5. Next, we describe this reweighting and other implementation considerations in more details.

4.2 Monte Carlo Integral Approximation

The GSW distance in Eq. (3) relies on integration on Ωθ\Omega_{\theta} (e.g., 𝕊d1\mathbb{S}^{d-1} for linear slices), which cannot be directly calculated. Following the common practice in the literature [52, 12, 32, 31, 39], here, we approximate the integration on θ\theta via a Monte-Carlo (MC) sampling of Ωθ\Omega_{\theta}. Let ΘL={θlσ(θ)}l=1L\Theta_{L}=\{\theta_{l}\sim\sigma(\theta)\}_{l=1}^{L} denote a set of LL parameters sampled independently and uniformly from Ωθ\Omega_{\theta}. We assume an empirical reference measure, μ0=1Mm=1Mδ(xxm0)\mu_{0}=\frac{1}{M}\sum_{m=1}^{M}\delta(x-x_{m}^{0}) with MM samples. The MC approximation is written as:

𝒢𝒮𝒲^2,L2(μi,μj)=1LMl=1Lϕθl(μi)ϕθl(μj)22.\displaystyle\widehat{\mathcal{GSW}}^{2}_{2,L}(\mu_{i},\mu_{j})=\frac{1}{LM}\sum_{l=1}^{L}\|\phi_{\theta_{l}}(\mu_{i})-\phi_{\theta_{l}}(\mu_{j})\|^{2}_{2}. (10)

Finally, our SWE embedding is calculated via:

ϕ(μi)=[ϕθ1(μi)LM;;ϕθL(μi)LM]LM×1,\displaystyle\phi(\mu_{i})=[\frac{\phi_{\theta_{1}}(\mu_{i})}{\sqrt{LM}};...;\frac{\phi_{\theta_{L}}(\mu_{i})}{\sqrt{LM}}]\in\mathbb{R}^{LM\times 1}, (11)

which satisfies:

ϕ(μi)ϕ(μj)2=𝒢𝒮𝒲^2,L(μi,μj)𝒢𝒮𝒲2(μi,μj).\|\phi(\mu_{i})-\phi(\mu_{j})\|_{2}=\widehat{\mathcal{GSW}}_{2,L}(\mu_{i},\mu_{j})\approx\mathcal{GSW}_{2}(\mu_{i},\mu_{j}).

As for the approximation error, we rely on Theorem 6 in [44], which uses Hölder’s inequality and the results on the moments of the Monte Carlo estimation error to obtain:

𝔼[|𝒢𝒮𝒲^2,L2(μi,μj)𝒢𝒮𝒲22(μ,ν)|]var(𝒲22(μiθ,μjθ))L\displaystyle\mathbb{E}\big{[}|\widehat{\mathcal{GSW}}^{2}_{2,L}(\mu_{i},\mu_{j})-\mathcal{GSW}^{2}_{2}(\mu,\nu)|\big{]}\leq\sqrt{\frac{\mathrm{var}(\mathcal{W}_{2}^{2}(\mu^{\theta}_{i},\mu^{\theta}_{j}))}{L}} (12)

The upper bound indicates that the approximation error decreases with L\sqrt{L}. The numerator, however, is implicitly dependent on the dimensionality of input space. Meaning that a larger number of slices, LL, is needed for higher dimensions.

4.3 SWE Algorithm

Here we review the algorithmic steps to obtain SWE. We consider Xi={xni}n=0Ni1X_{i}=\{x^{i}_{n}\}_{n=0}^{N_{i}-1} as the input set with NiN_{i} elements, and X0={xm0}m=0M1X_{0}=\{x^{0}_{m}\}_{m=0}^{M-1} denote the reference set of MM samples where in general MNiM\neq N_{i}. For a fixed slicer gθg_{\theta} we calculate {gθ(xni)}n=0Ni1\{g_{\theta}(x^{i}_{n})\}_{n=0}^{N_{i}-1} and {gθ(xm0)}m=0M1\{g_{\theta}(x^{0}_{m})\}_{m=0}^{M-1} and sort them increasingly. Let πi\pi_{i} and π0\pi_{0} denote the permutation indices (obtained from argsort). Also, let π01\pi_{0}^{-1} denote the ordering that permutes the sorted set back to the original ordering. Then we numerically calculate the Monge coupling TiθT^{\theta}_{i} via:

Tiθ[m]=Fμiθ1(π01[m]+1M)T^{\theta}_{i}[m]=F_{\mu_{i}^{\theta}}^{-1}\left(\frac{\pi_{0}^{-1}[m]+1}{M}\right) (13)

where Fμ0θ(xm0)=π01[m]+1MF_{\mu_{0}^{\theta}}(x_{m}^{0})=\frac{\pi^{-1}_{0}[m]+1}{M}, assuming that the indices start from 0. Here Fμiθ1F_{\mu_{i}^{\theta}}^{-1} is calculated via interpolation. In our experiments we used linear interpolation similar to [38]. Note that the dimensionality of the Monge coupling is only a function of the reference cardinality, i.e., TiθMT_{i}^{\theta}\in\mathbb{R}^{M}. Consequently, we write:

ϕθ(Xi)[m]=(Tiθ[m]gθ(xm0))\phi_{\theta}(X_{i})[m]=(T_{i}^{\theta}[m]-g_{\theta}(x^{0}_{m})) (14)

and repeat this process for θ{θlσ(θ)}l=1L\theta\in\{\theta_{l}\sim\sigma(\theta)\}_{l=1}^{L}, while we emphasize that this process can be parallelized. The final embedding is achieved via weighting and concatenating ϕθl\phi_{\theta_{l}}s as in Eq. (11), where the coefficient 1LM\frac{1}{\sqrt{LM}} allows us to simplify the weighted Euclidean distance, μ0,2\|\cdot\|_{\mu_{0},2}, to Euclidean distance, 2\|\cdot\|_{2}. Algorithm 1 summarizes the embedding process, and Figure 1 provides a graphical depiction of the process. Lastly, the SWE’s computational complexity for a set with cardinality |X|=N|X|=N is 𝒪(LN(d+logN))\mathcal{O}(LN(d+logN)), where we assumed the cardinality of the reference set is of the same order as NN. Note that 𝒪(LNd)\mathcal{O}(LNd) is the cost of slicing and 𝒪(LNlog(N))\mathcal{O}(LNlog(N)) is the sorting and interpolation cost to calculate Eq. (13).

Algorithm 1 Sliced-Wasserstein Embedding
procedure SWE(Xi={xni}n=0Ni1X_{i}=\{x_{n}^{i}\}_{n=0}^{N_{i}-1}, X0={xm0}m=0M1X_{0}=\{x_{m}^{0}\}_{m=0}^{M-1}, LL)
     Generate a set of LL samples ΘL={θl𝒰Ωθ}l=1L\Theta_{L}=\{\theta_{l}\sim\mathcal{U}_{\Omega_{\theta}}\}_{l=1}^{L}
     Calculate gΘL(X0):={gθl(xm0)}m,lM×Lg_{\Theta_{L}}(X_{0}):=\{g_{\theta_{l}}(x_{m}^{0})\}_{m,l}\in\mathbb{R}^{M\times L}
     Calculate π0=argsort(gΘL(X0))\pi_{0}=\mbox{argsort}(g_{\Theta_{L}}(X_{0})) and π01\pi_{0}^{-1} (on mm-axis)
     Calculate gΘL(Xi):={gθl(xni)}n,lNi×Lg_{\Theta_{L}}(X_{i}):=\{g_{\theta_{l}}(x_{n}^{i})\}_{n,l}\in\mathbb{R}^{N_{i}\times L}
     Calculate πi=argsort(gΘL(Xi))\pi_{i}=\mbox{argsort}(g_{\Theta_{L}}(X_{i})) (on nn-axis)
     for l=1l=1 to LL do
         Calculate the Monge coupling TiθlMT_{i}^{\theta_{l}}\in\mathbb{R}^{M} (Eq. (13))
     end for
     Calculate the embedding ϕ(Xi)M×L\phi(X_{i})\in\mathbb{R}^{M\times L} (Eq. (11))
     return ϕ(Xi)\phi(X_{i})
end procedure

4.4 SLOSH

Our proposed Set LOcality Sensitive Hashing (SLOSH) leverages the Sliced-Wasserstein Embedding (SWE) to embed training sets, 𝒳train={Xi|Xi={xnid}n=0Ni1}\mathcal{X}_{train}=\{X_{i}|X_{i}=\{x^{i}_{n}\in\mathbb{R}^{d}\}_{n=0}^{N_{i}-1}\}, into a vector space where we can use Locality Sensitive Hashing (LSH) to perform ANN search. We treat each input set XiX_{i} as a probability measure μi(x)=1Nin=1Niδ(xxni)\mu_{i}(x)=\frac{1}{N_{i}}\sum_{n=1}^{N_{i}}\delta(x-x_{n}^{i}). For a reference set X0X_{0} with cardinality |X0|=M|X_{0}|=M and LL slices, we embed the input set XiX_{i} into a (LM)(\mathbb{R}^{LM})-dimensional vector space using Algorithm 1. With abuse of notation we use ϕ(μi)\phi(\mu_{i}) and ϕ(Xi)\phi(X_{i}) interchangeably.

Given SWE, a family of (R,c,P1,P2)(R,c,P_{1},P_{2})-sensitive LSH functions, HH, will induce the following conditions,

  • If 𝒢𝒮𝒲^2,L(Xi,Xj)R\widehat{\mathcal{GSW}}_{2,L}(X_{i},X_{j})\leq R, then Pr[h(ϕ(Xi))=h(ϕ(Xj))]P1Pr[h(\phi(X_{i}))=h(\phi(X_{j}))]\geq P_{1}, and

  • If 𝒢𝒮𝒲^2,L(Xi,Xj)>cR\widehat{\mathcal{GSW}}_{2,L}(X_{i},X_{j})>cR, then Pr[h(ϕ(Xi))=h(ϕ(Xj))]P2Pr[h(\phi(X_{i}))=h(\phi(X_{j}))]\leq P_{2}

As mentioned in Section 2, for amplifying the gap between P1P_{1} and P2P_{2}, one uses g(Xi)=[h1(ϕ(Xi)),,hk(ϕ(Xi))]g(X_{i})=[h_{1}(\phi(X_{i})),...,h_{k}(\phi(X_{i}))], which results in a code length, kk, for each input set, XiX_{i}. Finally, if 𝒢𝒮𝒲^2,L(Xi,Xj)R\widehat{\mathcal{GSW}}_{2,L}(X_{i},X_{j})\leq R, by using TT such codes, gtg_{t} for t{1,,T}t\in\{1,...,T\}, of length kk, we can ensure collision at least in one of gtg_{t}s with probability 1(1P1k)T1-(1-P_{1}^{k})^{T}.

Table 1: Retrieval results of baselines and our approach on set retrieval on the three data sets.
Point MNIST 2D ModelNet 40 Oxford 5k
Precision@k/Accuracy Precision@k/Accuracy Precision@k/Accuracy
k=4 k=8 k=16 k=4 k=8 k=16 k=4 k=8 k=16
GeM-1 (GAP) 0.10/0.11 0.10/0.10 0.10/0.10 0.16/0.19 0.16/0.22 0.16/0.25 0.29/0.35 0.25/0.31 0.22/0.29
GeM-2 0.29/0.32 0.28/0.35 0.29/0.37 0.29/0.34 0.26/0.34 0.23/0.33 0.38/0.53 0.31/0.40 0.27/0.38
GeM-4 0.39/0.45 0.39/0.47 0.38/0.49 0.31/0.37 0.28/0.38 0.25/0.37 0.35/0.41 0.33/0.51 0.29/0.42
Cov 0.25/0.27 0.25/0.28 0.25/0.28 0.37/0.42 0.35/0.44 0.32/0.44 0.35/0.55 0.30/0.37 0.26/0.33
FSPool 0.75/0.80 0.74/0.81 0.73/0.81 0.50/0.57 0.47/0.58 0.42/0.56 0.47/0.53 0.39/0.6 0.32/0.49
SLOSH (L<dL<d) 0.52/0.59 0.51/0.61 0.49/0.61 0.22/0.25 0.20/0.27 0.19/0.27 0.33/0.42 0.30/0.49 0.24/0.36
SLOSH (L=dL=d) 0.67/0.73 0.64/0.74 0.62/0.73 0.39/0.44 0.36/0.45 0.33/0.44 0.43/0.58 0.39/0.60 0.34/0.55
SLOSH (L>dL>d) 0.90/0.92 0.88/0.92 0.87/0.91 0.55/0.61 0.51/0.60 0.46/0.57 0.53/0.69 0.45/0.65 0.38/0.64

5 Experiments

We evaluated our set locality sensitive hashing via Sliced-Wasserstein Embedding algorithm against using Generalized Mean (GeM) pooling [53], Global Covariance (Cov) pooling [64], and Featurewise Sort Pooling (FSPool) [71] on various set-structured datasets. We note that while FSPool was proposed as a data-dependent embedding, here we devise its data-independent variation for fair comparison. Interstingly, FSPool can be thought as a special case of our SWE embedding where L=dL=d and ΘL\Theta_{L} is chosen as the identity matrix. To evaluate these methods, we tested all approaches on point cloud MNIST dataset (2D) [34, 18], ModelNet40 dataset (3D) [68], and the Oxford Buildings dataset [49] (i.e., Oxford 5k).

5.1 Baselines

Let Xi={xnid}n=0Ni1{X}_{i}=\{x^{i}_{n}\in\mathbb{R}^{d}\}_{n=0}^{N_{i}-1} be the input set with NiN_{i} elements. We denote [Xi]k={[xni]k}n=0Ni1[X_{i}]_{k}=\{[x^{i}_{n}]_{k}\}_{n=0}^{N_{i}-1} as the set of all elements along the kk’th dimension, k{1,2,d}k\in\{1,2,...d\}. Below we provide a quick overview of the baseline approaches, which provide different set-2-vector mechanisms.

Generalized-Mean Pooling (GeM) [53] was originally proposed as a generalization of global mean and global max pooling on Convolutional Neural Network (CNN) features to boost image retrieval performance. Given the input Xi{X}_{i}, GeM calculates the (p-th)-moment of each feature, f(p)df^{(p)}\in\mathbb{R}^{d}, as:

[f(p)]k=(1Nin=1Ni([xni]k)p)1p[f^{(p)}]_{k}=\left(\frac{1}{N_{i}}\sum_{n=1}^{N_{i}}([x^{i}_{n}]_{k})^{p}\right)^{\frac{1}{p}} (15)

When pooling parameter p=1p=1, we end up with average pooling. While as pp\to\infty, we get max pooling. In practice, we found that a concatenation of higher-order GeM features, i.e., ϕGeM(Xi)=[f(1);;f(p)]pd\phi_{GeM}(X_{i})=[f^{(1)};...;f^{(p)}]\in\mathbb{R}^{pd}, leads to the best performance, where pp is GeM’s hyper-parameter.

Covariance Pooling [1, 64] presents another way to capture second-order statistics and provide more informative representations. It was shown that this mechanism can be applied to CNN features as an alternative to global mean/max pooling to generate state-of-the-art results on facial expression recognition tasks[1]. Given input set Xi{X}_{i}, the unbiased covariance matrix is computed by:

C=1Ni1n=1Ni(xniμ¯i)(xniμ¯i)T,\displaystyle C=\frac{1}{N_{i}-1}\sum_{n=1}^{N_{i}}(x^{i}_{n}-\overline{\mu}_{i})(x^{i}_{n}-\overline{\mu}_{i})^{T}, (16)

where μ¯i=1Nin=1Nixni\overline{\mu}_{i}=\frac{1}{N_{i}}\sum_{n=1}^{N_{i}}x^{i}_{n}. The output matrix can be further regularized by adding a multiple of the trace to diagonal entries of the covariance matrix to ensure symmetric positive definiteness (SPD), Cλ=C+λtrace(C)IC_{\lambda}=C+\lambda trace(C)\textbf{I} where λ\lambda is a regularization hyper-parameter and I is the identity matrix. Covariance pooling then uses ϕCov(Xi)=flatten(Cλ)\phi_{Cov}(X_{i})=\text{flatten}(C_{\lambda}).

Featurewise Sort Pooling (FSPool) [71] is a powerful technique for learning representations from set-structured data. In short, this approach is based on sorting features along all elements of a set, [Xi]k[X_{i}]_{k}:

f=[Sorted([Xi]1),,Sorted([Xi]d))]Ni×df=[Sorted([X_{i}]_{1}),...,Sorted([X_{i}]_{d}))]\in\mathbb{R}^{N_{i}\times d} (17)

The fixed-dimensional representation is then obtained via an interpolation along the NiN_{i} dimension of ff. More precisely, a continuous linear operator WW is learned and the inner product between this continuous operator and ff is evaluated at MM fixed points (i.e., leading to weighted summation over dd), resulting in a MM-dimensional embedding.

Given that we are interested in a data-independent set representation, we cannot rely on learning the continuous linear operator WW. Instead, we perform interpolation along the NiN_{i} axis on MM points and drop the inner product all together to obtain a (M×d)(M\times d)-dimensional data-independent set representation. The mentioned variation of FSPool is similar to the Sliced-Wasserstein Embedding when L=dL=d and ΘL=Id×d\Theta_{L}=I_{d\times d}, i.e., axis-aligned projections.

5.2 Datasets

Point Cloud MNIST 2D [34, 18] consists of 60,000 training samples and 10,000 testing samples. Each sample is a 2-dimensional point cloud derived from an image in the original MNIST dataset. The sets have various cardinalities in the range of |Xi|[34,351]|X_{i}|\in[34,351].

ModelNet40 [68] contains 3-dimensional point clouds converted from 12,311 CAD models in 40 common object categories. We used the official split with 9,843 samples for training and 2,468 samples for testing. We sample NiN_{i} points uniformly and randomly from the mesh faces of each object, where Ni=ni,ni𝒩(512,64)N_{i}=\lfloor n_{i}\rfloor,n_{i}\sim\mathcal{N}(512,64). To avoid any orientation bias, we randomly rotate the point clouds by ρ[0,45]\rho\in[0,45] degrees around the x-axis. Finally, we normalize each sample to fit within the unit cube to avoid any scale bias and to enforce the methods to rely on shapes.

The Oxford Buildings Dataset [49] has 5,062 images containing eleven different Oxford landmarks. Each landmark has five corresponding queries leading to 55 queries over which an object retrieval system can be evaluated. In our experiments, we use a pretrained VGG16 [57] on ImageNet1k [54] as a feature extractor, and use the features in the last convolutional layer as a set representation for an input image. We resize the images without cropping, which leads to varied input size images, and therefore gives set representations with varied cardinalities. We further perform a dimensionality reduction on these features to obtain sets of features in an (d=8)(d=8)-dimensional space.

Refer to caption
Figure 2: Example sample retrievals.
Refer to caption
Figure 3: Sensitivity to hash code length.

5.3 Results

For all datasets, we first calculate the set-2-vector embeddings for all baselines and SWE. Then, we apply Locality-Sensitive Hashing (LSH) to the embedded sets and report Precision@k and accuracy (based on majority voting) for all the approaches on the test sets. We use the FAISS library [24], developed by Facebook AI Research, for LSH. For all methods, we use a hash code length of 10241024, and we report our results for k=4,8,and16k=4,8,\text{and}~{}16. For SLOSH, we consider three different settings for the number of slices, namely, L<dL<d, L=dL=d, and L>dL>d. We repeat the experiments five times per method and report the mean Precision@k and accuracy in Table 1. In all experiments, the best hyper-parameters are selected for each approach based on cross validation. We see that SLOSH provides a consistent lead on all datasets, especially, when L>dL>d. Figure 2 provide set retrieval examples on the three data sets. Next, we provide a sensitivity analysis of our approach.

Sensitivity to code length. For all datasets, we study the sensitivity of the different embeddings to the hashing code-length. We vary the code length from 1616 to 10241024, and report the average of Precision@k over five runs. Figure 3 shows the outcome of this study. It can be seen that methods that encode higher statistical orders of the underlying distribution gain performance as a function of the hash code length. In addition, we see a strong performance gap between SWE and other set-2-vector approaches, which points at the more descriptive nature of our proposed method.

Sensitivity to the number of slices. Next, we study the sensitivity of SLOSH to the choice of number of slices, LL, for the three datasets. We measure the average Precision@k over five runs for various number of slices and for different code-lengths and report our results in Figure 4. As expected we see a performance increase with respect to increasing LL.

Sensitivity to the reference. Finally, we perform a study on the sensitivity to the choice of the reference set, X0X_{0}. We measure the performance of SLOSH on the three datasets and for various code-lengths, when the reference set is: 1) calculated using K-Means on the elements of all sets, 2) a random set from the dataset, 3) sampled from a uniform distribution, and 4) sampled from the normal distribution. We see that SLOSH’s performance depends on the reference choice. However, we point out that our formulation implies that as LL\rightarrow\infty the embedding becomes independent of the choice of the reference, and the observed sensitivity could be due to using finite LL.

Refer to caption
Figure 4: Sensitivity to number of slices.
Refer to caption
Figure 5: Sensitivity to reference functions.

6 Conclusion

We described a novel data-independent approach for Approximate Nearest Neighbor (ANN) search on set-structured data, with applications in set retrieval. We treat set elements as samples from an underlying distribution, and embed sets into a vector space in which the Euclidean distance approximates the Sliced-Wasserstein (SW) distance between the input distributions. We show that for a set XX with cardinality |X|=N|X|=N, our framework requires 𝒪(LN(d+Log(N)))\mathcal{O}(LN(d+Log(N))) (sequential processing) or 𝒪(N(d+log(N)))\mathcal{O}(N(d+log(N))) (parallel processing) calculations to obtain the embedding. We then use Locality Sensitive Hashing (LSH) for fast retrieval of nearest sets in our proposed embedding. We demonstrate a significant boost over other data-independent approaches including Generalized Mean (GeM) on three different set retrieval tasks, namely, Point Cloud MNIST, ModelNet40, and the Oxford Buildings datasets. Finally, our proposed method is readily extendable to data-dependent settings by allowing optimization on the slices and the reference set.

7 Acknowledgements

This research was supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR00112190132.

References

  • [1] Dinesh Acharya, Zhiwu Huang, Danda Pani Paudel, and Luc Van Gool. Covariance pooling for facial expression recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 367–374, 2018.
  • [2] David Alvarez Melis and Nicolo Fusi. Geometric dataset distances via optimal transport. Advances in Neural Information Processing Systems, 33, 2020.
  • [3] Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In 2006 47th annual IEEE symposium on foundations of computer science (FOCS’06), pages 459–468. IEEE, 2006.
  • [4] Alexandr Andoni, Piotr Indyk, Huy L Nguyen, and Ilya Razenshteyn. Beyond locality-sensitive hashing. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pages 1018–1028. SIAM, 2014.
  • [5] Alexandr Andoni, Piotr Indyk, and Ilya Razenshteyn. Approximate nearest neighbor search in high dimensions. In Proceedings of the International Congress of Mathematicians: Rio de Janeiro 2018, pages 3287–3318. World Scientific, 2018.
  • [6] Alexandr Andoni and Ilya Razenshteyn. Optimal data-dependent hashing for approximate near neighbors. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 793–801, 2015.
  • [7] Relja Arandjelovic and Andrew Zisserman. All about vlad. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1578–1585, 2013.
  • [8] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214–223. PMLR, 2017.
  • [9] Yogesh Balaji, Rama Chellappa, and Soheil Feizi. Normalized wasserstein for mixture distributions with applications in adversarial learning and domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6500–6508, 2019.
  • [10] Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509–517, 1975.
  • [11] Oren Boiman, Eli Shechtman, and Michal Irani. In defense of nearest-neighbor based image classification. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2008.
  • [12] Nicolas Bonneel, Julien Rabin, Gabriel Peyré, and Hanspeter Pfister. Sliced and Radon Wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 51(1):22–45, 2015.
  • [13] Nicolas Bonnotte. Unidimensional and evolution methods for optimal transportation. PhD thesis, Université Paris 11, France, 2013.
  • [14] Moses S Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages 380–388, 2002.
  • [15] Nicolas Courty, Rémi Flamary, Devis Tuia, and Alain Rakotomamonjy. Optimal transport for domain adaptation. IEEE transactions on pattern analysis and machine intelligence, 39(9):1853–1865, 2016.
  • [16] Bharath Bhushan Damodaran, Benjamin Kellenberger, Rémi Flamary, Devis Tuia, and Nicolas Courty. Deepjdot: Deep joint distribution optimal transport for unsupervised domain adaptation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 447–463, 2018.
  • [17] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the twentieth annual symposium on Computational geometry, pages 253–262, 2004.
  • [18] Cristian Garcia. Point cloud mnist 2d, 2020.
  • [19] Arthur Gretton, Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, and Alex Smola. A kernel method for the two-sample-problem. Advances in neural information processing systems, 19:513–520, 2006.
  • [20] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of Wasserstein GANs. In Advances in Neural Information Processing Systems, pages 5767–5777, 2017.
  • [21] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 604–613, 1998.
  • [22] Tony Jebara, Risi Kondor, and Andrew Howard. Probability product kernels. The Journal of Machine Learning Research, 5:819–844, 2004.
  • [23] Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. Aggregating local descriptors into a compact image representation. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 3304–3311. IEEE, 2010.
  • [24] Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with gpus. arXiv preprint arXiv:1702.08734, 2017.
  • [25] Yannis Kalantidis, Clayton Mellina, and Simon Osindero. Cross-dimensional weighting for aggregated deep convolutional features. In European conference on computer vision, pages 685–701. Springer, 2016.
  • [26] Haim Kaplan and Jay Tenenbaum. Locality sensitive hashing for set-queries, motivated by group recommendations. In 17th Scandinavian Symposium and Workshops on Algorithm Theory (SWAT 2020). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2020.
  • [27] Soheil Kolouri, Navid Naderializadeh, Gustavo K. Rohde, and Heiko Hoffmann. Wasserstein embedding for graph learning. In International Conference on Learning Representations, 2021.
  • [28] Soheil Kolouri, Kimia Nadjahi, Umut Simsekli, Roland Badeau, and Gustavo Rohde. Generalized sliced wasserstein distances. In Advances in Neural Information Processing Systems, pages 261–272, 2019.
  • [29] Soheil Kolouri, Se Rim Park, and Gustavo K. Rohde. The Radon cumulative distribution transform and its application to image classification. Image Processing, IEEE Transactions on, 25(2):920–934, 2016.
  • [30] Soheil Kolouri, Se Rim Park, Matthew Thorpe, Dejan Slepcev, and Gustavo K Rohde. Optimal mass transport: Signal processing and machine-learning applications. IEEE Signal Processing Magazine, 34(4):43–59, 2017.
  • [31] Soheil Kolouri, Phillip E. Pope, Charles E. Martin, and Gustavo K. Rohde. Sliced Wasserstein auto-encoders. In International Conference on Learning Representations, 2019.
  • [32] Soheil Kolouri, Yang Zou, and Gustavo K Rohde. Sliced-Wasserstein kernels for probability distributions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4876–4884, 2016.
  • [33] Brian Kulis and Kristen Grauman. Kernelized locality-sensitive hashing for scalable image search. In 2009 IEEE 12th international conference on computer vision, pages 2130–2137. IEEE, 2009.
  • [34] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [35] Chen-Yu Lee, Tanmay Batra, Mohammad Haris Baig, and Daniel Ulbricht. Sliced wasserstein discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10285–10295, 2019.
  • [36] Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. In International Conference on Machine Learning, pages 3744–3753. PMLR, 2019.
  • [37] Alexander Levine and Soheil Feizi. Wasserstein smoothing: Certified robustness against wasserstein adversarial attacks. In International Conference on Artificial Intelligence and Statistics, pages 3938–3947. PMLR, 2020.
  • [38] Antoine Liutkus, Umut Simsekli, Szymon Majewski, Alain Durmus, and Fabian-Robert Stöter. Sliced-wasserstein flows: Nonparametric generative modeling via optimal transport and diffusions. In International Conference on Machine Learning, pages 4104–4113. PMLR, 2019.
  • [39] Antoine Liutkus, Umut Şimşekli, Szymon Majewski, Alain Durmus, and Fabian-Robert Stoter. Sliced-Wasserstein flows: Nonparametric generative modeling via optimal transport and diffusions. In International Conference on Machine Learning, 2019.
  • [40] Grégoire Mialon, Dexiong Chen, Alexandre d’Aspremont, and Julien Mairal. A trainable optimal transport embedding for feature aggregation and its relationship to attention. In International Conference on Learning Representations, 2021.
  • [41] Caroline Moosmüller and Alexander Cloninger. Linear optimal transport embedding: Provable fast wasserstein distance computation and classification for nonlinear problems. arXiv preprint arXiv:2008.09165, 2020.
  • [42] Krikamol Muandet, Kenji Fukumizu, Francesco Dinuzzo, and Bernhard Schölkopf. Learning from distributions via support measure machines. In Proceedings of the 25th International Conference on Neural Information Processing Systems-Volume 1, pages 10–18, 2012.
  • [43] Ryan L. Murphy, Balasubramaniam Srinivasan, Vinayak Rao, and Bruno Ribeiro. Janossy pooling: Learning deep permutation-invariant functions for variable-size inputs. In International Conference on Learning Representations, 2019.
  • [44] Kimia Nadjahi, Alain Durmus, Lénaïc Chizat, Soheil Kolouri, Shahin Shahrampour, and Umut Şimşekli. Statistical and topological properties of sliced probability divergences. In Advances in Neural Information Processing Systems, 2020.
  • [45] Parth Nagarkar and K Selçuk Candan. Pslsh: An index structure for efficient execution of set queries in high-dimensional spaces. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pages 477–486, 2018.
  • [46] Se Rim Park, Soheil Kolouri, Shinjini Kundu, and Gustavo K Rohde. The cumulative distribution transform and linear pattern classification. Applied and Computational Harmonic Analysis, 45(3):616–641, 2018.
  • [47] François-Pierre Paty and Marco Cuturi. Subspace robust wasserstein distances. In International Conference on Machine Learning, 2019.
  • [48] Gabriel Peyré and Marco Cuturi. Computational optimal transport. arXiv preprint arXiv:1803.00567, 2018.
  • [49] James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. Object retrieval with large vocabularies and fast spatial matching. In 2007 IEEE conference on computer vision and pattern recognition, pages 1–8. IEEE, 2007.
  • [50] Barnabás Póczos and Jeff Schneider. Nonparametric estimation of conditional information and divergences. In Artificial Intelligence and Statistics, pages 914–923. PMLR, 2012.
  • [51] Barnabás Póczos, Liang Xiong, and Jeff Schneider. Nonparametric divergence estimation with applications to machine learning on distributions. In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, pages 599–608, 2011.
  • [52] Julien Rabin, Gabriel Peyré, Julie Delon, and Marc Bernot. Wasserstein barycenter and its application to texture mixing. In Scale Space and Variational Methods in Computer Vision, pages 435–446. Springer, 2012.
  • [53] Filip Radenović, Giorgos Tolias, and Ondřej Chum. Fine-tuning cnn image retrieval with no human annotation. IEEE transactions on pattern analysis and machine intelligence, 41(7):1655–1668, 2018.
  • [54] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
  • [55] Gregory Shakhnarovich, Trevor Darrell, and Piotr Indyk. Nearest-neighbor methods in learning and vision. IEEE Trans. Neural Networks, 19(2):377, 2008.
  • [56] Mohammad Shifat-E-Rabbi, Xuwang Yin, Abu Hasnat Mohammad Rubaiyat, Shiying Li, Soheil Kolouri, Akram Aldroubi, Jonathan M Nichols, and Gustavo K Rohde. Radon cumulative distribution transform subspace modeling for image classification. Journal of Mathematical Imaging and Vision, 63(9):1185–1203, 2021.
  • [57] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [58] Aman Sinha, Hongseok Namkoong, and John Duchi. Certifying some distributional robustness with principled adversarial training. In International Conference on Learning Representations, 2018.
  • [59] Matteo Togninalli, Elisabetta Ghisu, Felipe Llinares-López, Bastian Rieck, and Karsten Borgwardt. Wasserstein weisfeiler-lehman graph kernels. Advances in Neural Information Processing Systems, 32:6439–6449, 2019.
  • [60] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-encoders. In International Conference on Learning Representations, 2018.
  • [61] Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
  • [62] Edward Wagstaff, Fabian Fuchs, Martin Engelcke, Ingmar Posner, and Michael A Osborne. On the limitations of representing functions on sets. In International Conference on Machine Learning, pages 6487–6494. PMLR, 2019.
  • [63] Ingo Wald and Vlastimil Havran. On building fast kd-trees for ray tracing, and on doing that in o (n log n). In 2006 IEEE Symposium on Interactive Ray Tracing, pages 61–69. IEEE, 2006.
  • [64] Qilong Wang, Jiangtao Xie, Wangmeng Zuo, Lei Zhang, and Peihua Li. Deep cnns meet global covariance pooling: Better representation and generalization. IEEE transactions on pattern analysis and machine intelligence, 2020.
  • [65] Wei Wang, Dejan Slepčev, Saurav Basu, John A Ozolek, and Gustavo K Rohde. A linear optimal transportation framework for quantifying and visualizing variations in sets of images. International journal of computer vision, 101(2):254–269, 2013.
  • [66] Eric Wong, Frank Schmidt, and Zico Kolter. Wasserstein adversarial examples via projected sinkhorn iterations. In International Conference on Machine Learning, pages 6808–6817. PMLR, 2019.
  • [67] Kaiwen Wu, Allen Wang, and Yaoliang Yu. Stronger and faster wasserstein adversarial attacks. In International Conference on Machine Learning, pages 10377–10387. PMLR, 2020.
  • [68] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015.
  • [69] Liang Xiong and Jeff Schneider. Learning from point sets with observational bias. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, pages 898–906, 2014.
  • [70] Manzil Zaheer, Satwik Kottur, Siamak Ravanbhakhsh, Barnabás Póczos, Ruslan Salakhutdinov, and Alexander J Smola. Deep sets. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 3394–3404, 2017.
  • [71] Yan Zhang, Jonathon Hare, and Adam Prügel-Bennett. Fspool: Learning set representations with featurewise sort pooling. arXiv preprint arXiv:1906.02795, 2019.
  • [72] Yan Zhang, Jonathon Hare, and Adam Prügel-Bennett. Fspool: Learning set representations with featurewise sort pooling. In International Conference on Learning Representations, 2020.

8 Supplementary Materials

8.1 Proofs

Here we include the proof for the C1 and C2 conditions covered in Section 4. Recall that μi\mu_{i} represents a probability measure, μiθ\mu_{i}^{\theta} represents gθ#μig_{\theta\#}\mu_{i}, where gθ:dg_{\theta}:\mathbb{R}^{d}\rightarrow\mathbb{R} with some regularity constraints, and that we define the cumulative distribution transform (CDT) [46] of μiθ\mu_{i}^{\theta} as

ϕθ(μi)(Tiθid),\displaystyle\phi_{\theta}(\mu_{i})\coloneqq(T_{i}^{\theta}-id),

where TiθT_{i}^{\theta} is the Monge map/coupling, and idid denotes the identity function. For a fixed θ\theta, here we prove that ϕθ(μi)\phi^{\theta}(\mu_{i}) satisfies the following conditions:

Method Complexity
Gem-p 𝒪(Np2d)\mathcal{O}(Np^{2}d)
Cov. 𝒪(Nd2)\mathcal{O}(Nd^{2})
FSPool 𝒪(NdlogN)\mathcal{O}(Nd\text{log}N)
SLOSH 𝒪(LN(d+logN))\mathcal{O}(LN(d+\text{log}N))
Table 2: Computational complexities.

C1. The weighted 2\ell_{2}-norm of the embedded slice, ϕθ(μi)\phi_{\theta}(\mu_{i}), satisfies:

ϕθ(μi)μ0θ,2\displaystyle\|\phi_{\theta}(\mu_{i})\|_{\mu_{0}^{\theta},2} =(ϕθ(μi(t))22𝑑μ0θ(t))12\displaystyle=\left(\int_{\mathbb{R}}\|\phi_{\theta}(\mu_{i}(t))\|_{2}^{2}d\mu_{0}^{\theta}(t)\right)^{\frac{1}{2}}
=𝒲2(μiθ,μ0θ),\displaystyle=\mathcal{W}_{2}(\mu_{i}^{\theta},\mu_{0}^{\theta}),
Proof.

We start by writing the squared distance:

ϕθ(μi)μ0θ,22=ϕθ(μi(t))22𝑑μ0θ(t)\displaystyle\|\phi_{\theta}(\mu_{i})\|^{2}_{\mu_{0}^{\theta},2}=\int_{\mathbb{R}}\|\phi_{\theta}(\mu_{i}(t))\|_{2}^{2}d\mu_{0}^{\theta}(t)
=Tiθ(t)t22𝑑μ0θ(t)\displaystyle=\int_{\mathbb{R}}\|T_{i}^{\theta}(t)-t\|_{2}^{2}d\mu_{0}^{\theta}(t)
=(Fμiθ1Fμ0θ)(t)t22𝑑μ0θ(t)\displaystyle=\int_{\mathbb{R}}\|(F_{\mu_{i}^{\theta}}^{-1}\circ F_{\mu_{0}^{\theta}})(t)-t\|_{2}^{2}d\mu_{0}^{\theta}(t)
=01Fμiθ1(τ)Fμ0θ1(τ)22𝑑τ\displaystyle=\int^{1}_{0}\|F_{\mu_{i}^{\theta}}^{-1}(\tau)-F_{\mu_{0}^{\theta}}^{-1}(\tau)\|_{2}^{2}d\tau
=𝒲22(μiθ,μ0θ)\displaystyle=\mathcal{W}_{2}^{2}(\mu_{i}^{\theta},\mu_{0}^{\theta})

where we used the definition of the one-dimensional Monge map, Tiθ=Fμiθ1Fμ0θT^{\theta}_{i}=F_{\mu_{i}^{\theta}}^{-1}\circ F_{\mu_{0}^{\theta}}, and the change of variable τ=Fμ0θ(t)\tau=F_{\mu_{0}^{\theta}}(t). The corollary, ϕθ(μ0)μ0θ,2=0\|\phi_{\theta}(\mu_{0})\|_{\mu_{0}^{\theta},2}=0, is trivial as 𝒲22(μ0θ,μ0θ)=0\mathcal{W}_{2}^{2}(\mu_{0}^{\theta},\mu_{0}^{\theta})=0. ∎

C2. The weighted 2\ell_{2} distance satisfies:

ϕθ(μi)ϕθ(μj)μ0θ,2=𝒲2(μiθ,μjθ).\displaystyle\|\phi_{\theta}(\mu_{i})-\phi_{\theta}(\mu_{j})\|_{\mu_{0}^{\theta},2}=\mathcal{W}_{2}(\mu_{i}^{\theta},\mu_{j}^{\theta}). (18)
Proof.

Similar to the previous proof:

ϕθ(μi)ϕθ(μj)μ0θ,22=ϕθ(μi(t))ϕθ(μj(t))22𝑑μ0θ(t)\displaystyle\|\phi_{\theta}(\mu_{i})-\phi_{\theta}(\mu_{j})\|^{2}_{\mu_{0}^{\theta},2}=\int_{\mathbb{R}}\|\phi_{\theta}(\mu_{i}(t))-\phi_{\theta}(\mu_{j}(t))\|_{2}^{2}d\mu_{0}^{\theta}(t)
=Tiθ(t)Tjθ(t)22𝑑μ0θ(t)\displaystyle=\int_{\mathbb{R}}\|T_{i}^{\theta}(t)-T_{j}^{\theta}(t)\|_{2}^{2}d\mu_{0}^{\theta}(t)
=(Fμiθ1Fμ0θ)(t)(Fμjθ1Fμ0θ)(t)22𝑑μ0θ(t)\displaystyle=\int_{\mathbb{R}}\|(F_{\mu_{i}^{\theta}}^{-1}\circ F_{\mu_{0}^{\theta}})(t)-(F_{\mu_{j}^{\theta}}^{-1}\circ F_{\mu_{0}^{\theta}})(t)\|_{2}^{2}d\mu_{0}^{\theta}(t)
=01Fμiθ1(τ)Fμjθ1(τ)22𝑑τ=𝒲22(μiθ,μjθ)\displaystyle=\int^{1}_{0}\|F_{\mu_{i}^{\theta}}^{-1}(\tau)-F_{\mu_{j}^{\theta}}^{-1}(\tau)\|_{2}^{2}d\tau=\mathcal{W}_{2}^{2}(\mu_{i}^{\theta},\mu_{j}^{\theta})

where again we used the definition of the one-dimensional Monge map, Tiθ=Fμiθ1Fμ0θT^{\theta}_{i}=F_{\mu_{i}^{\theta}}^{-1}\circ F_{\mu_{0}^{\theta}}, and the change of variable τ=Fμ0θ(t)\tau=F_{\mu_{0}^{\theta}}(t). ∎

As a corollary of C1 and C2 we have:

𝒢𝒮𝒲22(μi,μj)=Ωθ𝒲22(μiθ,μjθ)𝑑σ(θ)=\displaystyle\mathcal{GSW}^{2}_{2}(\mu_{i},\mu_{j})=\int_{\Omega_{\theta}}\mathcal{W}_{2}^{2}(\mu_{i}^{\theta},\mu_{j}^{\theta})d\sigma(\theta)=
Ωθϕθ(μi)ϕθ(μj)μ0θ,22𝑑σ(θ)\displaystyle\int_{\Omega_{\theta}}\|\phi_{\theta}(\mu_{i})-\phi_{\theta}(\mu_{j})\|_{\mu_{0}^{\theta},2}^{2}d\sigma(\theta)

8.2 Computational complexities

For the sake of completeness, here we include the computational complexities of the baseline methods used in our paper. In Table 2, we provide the computational complexity of embedding a set X={xnd}n=1NX=\{x_{n}\in\mathbb{R}^{d}\}_{n=1}^{N} into a vector space.