This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Wasserstein kk-means with sparse simplex projection

Takumi Fukunaga Department of Communications and Computer Engineering, School of Fundamental Science and Engineering, WASEDA University, 3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan (e-mail: [email protected])    Hiroyuki Kasai Department of Communications and Computer Engineering, School of Fundamental Science and Engineering, WASEDA University, 3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan (e-mail: [email protected])
Abstract

This paper presents a proposal of a faster Wasserstein kk-means algorithm for histogram data by reducing Wasserstein distance computations and exploiting sparse simplex projection. We shrink data samples, centroids, and the ground cost matrix, which leads to considerable reduction of the computations used to solve optimal transport problems without loss of clustering quality. Furthermore, we dynamically reduced the computational complexity by removing lower-valued data samples and harnessing sparse simplex projection while keeping the degradation of clustering quality lower. We designate this proposed algorithm as sparse simplex projection based Wasserstein kk-means, or SSPW kk-means. Numerical evaluations conducted with comparison to results obtained using Wasserstein kk-means algorithm demonstrate the effectiveness of the proposed SSPW kk-means for real-world datasets 111This paper has been accepted in ICPR2020 [1]..

1 Introduction

The kk-means clustering algorithm is a simple yet powerful algorithm, which has been widely used in various application areas including, for example, computer vision, statistic, robotics, bioinformatics, machine learning, signal processing and medical image engineering. The standard Euclidean kk-means algorithm proposed by Lloyd [2] minimizes the total squared distances between all data samples and their assigned cluster centroids. This algorithm consists of two iterative steps, the assignment step and the update step, when clustering qq data samples into kk clusters. Because the assignment step requires calculation cost 𝒪(qk)\mathcal{O}(qk), this step becomes prohibitively expensive when qq is larger. For this issue, many authors have proposed efficient approaches [3, 4, 5, 6, 7, 8, 9, 10]. It should be also noted that kk-means algorithm is equivalent to non-negative matrix factorization (NMF) [11], and NMF has been applied to many fields [12, 13, 14, 15]. Regarding another direction of the research in the kk-means clustering algorithm, some efforts have achieved improvement of the clustering quality. Some of them address the point that the kk-means clustering algorithm ignores latent structure of data samples, such as histogram data [10]. One approach in this category achieves better performances than the standard algorithm in terms of the clustering quality by exploiting the Wasserstein distance [16], a metric between probability distributions of histograms which respects their latent geometry of the space. This is called the Wasserstein kk-means algorithm hereinafter. The Wasserstein distance, which originates from optimal transport theory [17], has many favorable properties [18]. Therefore, it is applied for a wide variety problems in machine learning [19], statistical learning and signal processing. Nevertheless, calculation of the Wasserstein distance requires much higher computational costs than that of the standard Euclidean kk-means algorithm because it calculates an optimization problem every iteration, which makes the Wasserstein kk-means infeasible for a larger-scale data.

To this end, this paper presents a proposal of a faster Wasserstein kk-means algorithm for histogram data by reducing the Wasserstein distance computations exploiting sparse simplex projection. More concretely, we shrink data samples, centroids, and the ground cost matrix, which drastically reduces computations to solve linear programming optimization problems in optimal transport problems without loss of clustering quality. Furthermore, we dynamically reduce the computational complexity by removing lower-valued histogram elements and harnessing the sparse simplex projection while maintaining degradation of the clustering quality lower. We designate this proposed algorithm as sparse simplex projection based Wasserstein kk-means, or SSPW kk-means. Numerical evaluations are then conducted against the standard Wasserstein kk-means algorithm to demonstrate the effectiveness of the proposed SSPW kk-means for real-world datasets.

The source code will be made available at https://github.com/hiroyuki-kasai.

2 Preliminaries

This section first presents the notation used in the remainder of this paper. It then introduces the optimal transport problem, followed by explanation of the Wasserstein kk-means algorithm.

2.1 Notation

We denote scalars with lower-case letters (a,b,)(a,b,\ldots), vectors as bold lower-case letters (𝒂,𝒃,)(\mbox{\boldmath$a$},\mbox{\boldmath$b$},\ldots), and matrices as bold-face capitals (A,B,)(\mbox{\bf A},\mbox{\bf B},\ldots). The ii-th element of 𝒂a and the element at the (i,j)(i,j) position of A are represented as 𝒂i\mbox{\boldmath$a$}_{i} and Aij\mbox{\bf A}_{ij}, respectively. 𝟏d\mbox{\boldmath$1$}_{d} is used for the dd-dimensional vector of ones. Operator ()T(\cdot)^{T} stands for the matrix transpose. Operator [𝒂i]+[\mbox{\boldmath$a$}_{i}]_{+} represents max(a,0)\mathrm{max}(a,0) that outputs aa when a0a\!\geq\!0, and 0 otherwise. Given a set 𝒮𝒩={1,,n}\mathcal{S}\subseteq\mathcal{N}=\{1,\ldots,n\}, the complement 𝒮c\mathcal{S}^{c} is defined with respect to 𝒩\mathcal{N}, and the cardinality is |𝒮||\mathcal{S}|. The support set of 𝒂a is supp(𝒂)={i:𝒂i0}{\rm supp}(\mbox{\boldmath$a$})=\{i:\mbox{\boldmath$a$}_{i}\neq 0\}. 𝒂|𝒮\mbox{\boldmath$a$}_{|\mathcal{S}} extracts the elements of 𝒮\mathcal{S} in 𝒂a, of which size is |𝒮||\mathcal{S}|. +n×m\mathbb{R}_{+}^{n\times m} represents a nonnegative matrix of size n×m{n\times m}. The unit-simplex, simply called simplex, is denoted by Δn\Delta_{n}, which is the subset of n\mathbb{R}^{n} comprising all nonnegative vectors whose sums is 1. a\lfloor a\rfloor represents the floor function, which outputs the greatest integer less than or equal to aa.

2.2 Euclidean kk-means and speed-up methods

The goal of the standard Euclidean kk-means algorithm is to partition given qq data X={𝒙1,,𝒙q}d\mbox{\bf X}=\{\mbox{\boldmath$x$}_{1},\ldots,\mbox{\boldmath$x$}_{q}\}\subset\mathbb{R}^{d} into kk separated groups, i.e., kk clusters, as 𝒞={C1,,Ck}\mathcal{C}=\{C_{1},\ldots,C_{k}\} such that the following kk-means cost function is minimized:

f(C,X)\displaystyle f(\mbox{\bf C},\mbox{\bf X}) =\displaystyle= j=1k𝒙iCj𝒙i𝒄j22,\displaystyle\sum_{j=1}^{k}\sum_{\mbox{\boldmath$x$}_{i}\in C_{j}}\|\mbox{\boldmath$x$}_{i}-\mbox{\boldmath$c$}_{j}\|_{2}^{2}, (1)

where C={𝒄1,,𝒄k}d\mbox{\bf C}=\{\mbox{\boldmath$c$}_{1},\ldots,\mbox{\boldmath$c$}_{k}\}\subset\mathbb{R}^{d} is a set of kk cluster centers, designated as centroids hereinafter. Also, 22\|\cdot\|_{2}^{2} represents the squared Euclidean distance. Consequently, our aim is to find kk cluster centroids C. It is noteworthy that the problem defined in (1) is non-convex. It is known to be an NP-hard problem [20, 21]. Therefore, many efforts have been undertaken to seek a locally optimized solution of (1). Among them, the most popular but simple and powerful algorithm is the one proposed by Lloyd [2], called Lloyd’s kk-means. This algorithm consists of two iterative steps where the first step, called the assignment step, finds the closest cluster centroid 𝒄j([k])\mbox{\boldmath$c$}_{j}(\in[k]) for each data sample 𝒙i([q])\mbox{\boldmath$x$}_{i}(\in[q]). It then updates the cluster centroids C at the second step, called the update step. These two steps continue until the clusters do not change anymore. The algorithm is presented in Algorithm 1.

Algorithm 1 Lloyd’s algorithm for kk-means [2]
1:data X={𝒙1,,𝒙q}d\mbox{\bf X}\!=\!\{\mbox{\boldmath$x$}_{1},\ldots,\mbox{\boldmath$x$}_{q}\}\!\subset\!\mathbb{R}^{d}, cluster number kk\in\mathbb{N}.
2:Initialize randomly centroids C={𝒄1,,𝒄k}d\mbox{\bf C}=\{\mbox{\boldmath$c$}_{1},\ldots,\mbox{\boldmath$c$}_{k}\}\subset\mathbb{R}^{d}.
3:repeat
4:     Find closest centroids (assignment step):
5:          si=argminj=1,,k𝒙i𝒄j22,i[q].s_{i}=\text{argmin}_{j=1,\ldots,k}\ \|\mbox{\boldmath$x$}_{i}-\mbox{\boldmath$c$}_{j}\|_{2}^{2},\forall i\in[q].
6:     Update centroids (update step):
7:          𝒄j=mean(𝒙X|si=j}),j[k].\mbox{\boldmath$c$}_{j}={\rm mean}(\mbox{\boldmath$x$}\in\mbox{\bf X}|s_{i}=j\}),\forall j\in[k].
8:until cluster centroids stop changing.
9:cluster centers C={𝒄1,,𝒄k}\mbox{\bf C}=\{\mbox{\boldmath$c$}_{1},\ldots,\mbox{\boldmath$c$}_{k}\}.

Although Lloyd’s kk-means is simple and powerful, the obtained result is not guaranteed to reach a global optimum because the problem is not convex. For that reason, several efforts have been launched to find good initializations of Lloyd’s algorithm [22, 23, 24]. Apart from the lack of the global guarantee, Lloyd’s algorithm is adversely affected by high computation complexity. The highest complexity comes from the assignment step, where the Euclidean distances among all the data samples and all cluster centroids must be calculated. This cost of complexity is 𝒪(qk)\mathcal{O}(qk), which is infeasible for numerous data samples or clusters. It is also problematic for higher dimension dd. For this particular problem, many authors have proposed efficient approaches. One category of approaches is to generate a hierarchical tree structure of the data samples and to make the assignment step more efficient [3, 4, 5]. However, this also entails high costs to maintain this structure. It becomes prohibitively expensive for larger dimension data samples. Another category is to reduce redundant calculations of the distance at the assignment step. For example, if the assigned centroid changes closer to a data sample 𝒙x in successive iterations, then it is readily apparent that all other centroids that did not change cannot move to 𝒙x. Consequently, the distance calculations to the corresponding centroids can be omitted. In addition, the triangle inequality is useful to skip the calculation. This category includes, for example, [6, 7, 8, 9, 10]. However, it remains unclear whether these approaches are applicable to Wasserstein kk-means described later in Section 2.4.

2.3 Optimal transport [17]

Let Ω\Omega be an arbitrary space, d(,)d(\cdot,\cdot) a metric on that space, and P(Ω)P(\Omega) the set of Borel probability measures on Ω\Omega. For any point νΩ\nu\in\Omega is the Dirac unit mass on ν\nu. Here, we consider two families 𝝂=(ν1,,νm)\mbox{\boldmath$\nu$}=(\nu_{1},\ldots,\nu_{m}) and 𝝁=(μ1,,μn)\mbox{\boldmath$\mu$}=(\mu_{1},\ldots,\mu_{n}) of point in Ω\Omega, where 𝒂a and 𝒃b respectively satisfy 𝒂=(a1,am)T+m\mbox{\boldmath$a$}\!=\!(a_{1}\ldots,a_{m})^{T}\in\mathbb{R}_{+}^{m} and 𝒃=(b1,bn)T+n\mbox{\boldmath$b$}\!=\!(b_{1}\ldots,b_{n})^{T}\in\mathbb{R}_{+}^{n} with imai=jnbj=1\sum_{i}^{m}a_{i}=\sum_{j}^{n}b_{j}=1. The ground cost matrix C𝝂𝝁\mbox{\bf C}_{\scriptsize\mbox{\boldmath$\nu\mu$}}, or simply C, represents the distances between elements of 𝝂\nu and 𝝁\mu raised to the power pp as C𝝂𝝁=C:=[d(νu,μv)p]uv+m×n\mbox{\bf C}_{{\scriptsize{\mbox{\boldmath$\nu\mu$}}}}\ =\ \mbox{\bf C}:=[d(\nu_{u},\mu_{v})^{p}]_{uv}\ \ \in\mathbb{R}_{+}^{m\times n}. The cost transportation polytope 𝒰mn\mathcal{U}_{mn} of aΔma\in\Delta_{m} and bΔnb\in\Delta_{n} is a feasible set defined as the set of m×nm\times n nonnegative matrices such that their row and column marginals are respectively equal to aa and bb. It is formally defined as

𝒰mn\displaystyle\mathcal{U}_{mn} :=\displaystyle:= {T+m×n:T𝟏n=𝒂,TT𝟏m=𝒃}.\displaystyle\{\mbox{\bf T}\in\mathbb{R}^{m\times n}_{+}:\mbox{\bf T}\mbox{\boldmath$1$}_{n}=\mbox{\boldmath$a$},\mbox{\bf T}^{T}\mbox{\boldmath$1$}_{m}=\mbox{\boldmath$b$}\}.

Consequently, the optimal transport matrix T\mbox{\bf T}^{*} is defined as a solution of the following optimal transport problem [17]

T\displaystyle\mbox{\bf T}^{*} =\displaystyle= argminT𝒰mnT,C.\displaystyle\mathop{\rm arg~{}min}\limits_{\scriptsize\mbox{\bf T}\ \in\ \mathcal{U}_{mn}}\ \langle\mbox{\bf T},\mbox{\bf C}\rangle. (2)

2.4 Wasserstein distance and Wasserstein kk-means [16]

The Wasserstein kk-means algorithm replaces the distance metric 𝒙i𝒄j22\|\mbox{\boldmath$x$}_{i}-\mbox{\boldmath$c$}_{j}\|_{2}^{2} in (1) with the following the pp-Wasserstein distance Wp(𝝁,𝝂)W_{p}(\mbox{\boldmath$\mu$},\mbox{\boldmath$\nu$}), which is formally defined as shown below.

Definition 1 (pp-Wasserstein distance). For 𝝁\mu and 𝝂\nu, the Wasserstein distance of order pp is defined as

Wp(𝝁,𝝂)\displaystyle W_{p}(\mbox{\boldmath$\mu$},\mbox{\boldmath$\nu$}) =\displaystyle= minT𝒰mnT,C=T,C.\displaystyle\min_{\scriptsize\mbox{\bf T}\ \in\ \mathcal{U}_{mn}}\ \langle\mbox{\bf T},\mbox{\bf C}\rangle=\langle\mbox{\bf T}^{*},\mbox{\bf C}\rangle.

In addition, the calculation in Step 2 in Algorithm 1, i.e., the update step, is replaced with the Wasserstein barycenter calculation, which is defined as presented below.

Definition 2 (Wasserstein barycenter [25]). A Wasserstein barycenter of qq measures {𝝂1,,𝝂q}𝒫P(Ω)\{\mbox{\boldmath$\nu$}_{1},\ldots,\mbox{\boldmath$\nu$}_{q}\}\in\mathcal{P}\subset P(\Omega) is a minimizer of g(𝝁)g(\mbox{\boldmath$\mu$}) over 𝒫\mathcal{P} as

g(𝝁)\displaystyle g(\mbox{\boldmath$\mu$}) :=\displaystyle:= 1ni=1qWp(𝝁,𝝂i).\displaystyle\frac{1}{n}\sum_{i=1}^{q}W_{p}(\mbox{\boldmath$\mu$},\mbox{\boldmath$\nu$}_{i}).

Addressing that m=nm=n holds in this case, we assume {T,C}+n×n\{\mbox{\bf T},\mbox{\bf C}\}\subset\mathbb{R}^{n\times n}_{+} hereinafter. Then, the overall algorithm is summarized in Algorithm 2.

Algorithm 2 Wasserstein kk-means
1:data {𝝂1,,𝝂q}\{\mbox{\boldmath$\nu$}_{1},\ldots,\mbox{\boldmath$\nu$}_{q}\}, cluster number kk\in\mathbb{N}, ground cost matrix Cn×n\mbox{\bf C}\in\mathbb{R}^{n\times n}.
2:Initialize centroids {𝒄1,,𝒄k}\{{\mbox{\boldmath$c$}}_{1},\ldots,{\mbox{\boldmath$c$}}_{k}\}.
3:repeat
4:     Find closest centroids (assignment step):
5:          si=argminj=1,,kWp(𝝂i,𝒄j),i[q].s_{i}=\text{argmin}_{j=1,\ldots,k}\ W_{p}({\mbox{\boldmath$\nu$}}_{i},{\mbox{\boldmath$c$}}_{j}),\forall i\in[q].
6:     Update centroids (update step):
7:          𝒄j=barycenter({𝝂|si=j}),j[k].\mbox{\boldmath$c$}_{j}=\text{barycenter}(\{\mbox{\boldmath$\nu$}|s_{i}=j\}),\forall j\in[k].
8:until cluster centroids stop changing.
9:cluster centers {𝒄1,,𝒄k}\{\mbox{\boldmath$c$}_{1},\ldots,\mbox{\boldmath$c$}_{k}\}.

The computation of the Wasserstein barycenters has recently gained increasing attention in the literature because it has a wide range of applications. Many fast algorithms have been proposed, and they include, for example, [26, 27, 28, 16, 29, 30]. Although they are effective at the update step in the Wasserstein kk-means algorithm, they do not directly tackle the computational issue at the assignment step that this paper particularly addresses. [31] proposes a mini-batch algorithm for the Wasserstein kk-means. However, it is not clear whether it gives better performances than the original Wasserstein kk-means algorithm due to the lack of comprehensive experiments. Regarding another direction, the direct calculation of a Wasserstein barycenter of sparse support may give better performances on the problem of interest in this paper. However, as suggested and proved in [32], the finding such a sparse barycenter is a hard problem, even in dimension 22 and for only 33 measures.

3 Sparse simplex projection-based Wasserstein kk-means: SSPW kk-means

After this section describes the motivation of the proposed method, it presents elaboration of the algorithm of the proposed sparse simplex projection based Wasserstein kk-means.

3.1 Motivation

The optimal transport problem in (2) described in Section 2.3 is reformulated as

T\displaystyle\mbox{\bf T}^{*} =argminT𝒰mn[T]uvνuμvpp,\displaystyle=\mathop{\rm arg~{}min}\limits_{\scriptsize\mbox{\bf T}\ \in\ \mathcal{U}_{mn}}\ [\mbox{\bf T}]_{uv}\|\nu_{u}-\mu_{v}\|_{p}^{p},
s.t.\displaystyle{\rm s.t.} u=1m[T]uv=bv,v[n],j=1n[T]uv=au,u[m].\displaystyle\quad\sum_{u=1}^{m}[\mbox{\bf T}]_{uv}=b_{v},v\in[n],\sum_{j=1}^{n}[\mbox{\bf T}]_{uv}=a_{u},u\in[m]. (3)

Consequently, this problem is reduced to the linear programming (LP) problem with (m+n)(m+n) linear equality constraints. The LP problem is an optimization problem with a linear objective function and linear equality and inequality constraints. The problem is solvable using general LP problem solvers such as the simplex method, the interior-point method and those variants. The computational complexity of those methods, however, scales at best cubically in the size of the input data. Here, if the measures have n(=m)n(=m) bits, then the number of unknowns [T]uv[\mbox{\bf T}]_{uv} is n2n^{2}, and the computational complexities of the solvers are at best 𝒪(n3logn)\mathcal{O}(n^{3}\log n) [33, 34]. To alleviate this difficulty, the entropy-regularized algorithm, known as Sinkhorn algorithm, is proposed under the entropy-regularized Wasserstein distance. This algorithm regularizes the Wasserstein metric by the entropy of the transport plan and enables much faster numerical schemes with complexity 𝒪(n2)\mathcal{O}(n^{2}) or 𝒪(nlogn)\mathcal{O}(n\log n). It is nevertheless difficult to obtain a higher-accuracy solution. In addition, it can be unstable if the regularization parameter is reduced to a smaller value to improve the accuracy.

Subsequently, this paper continues to describe the original LP problem in (3.1). We attempt to reduce the size of the measures. More specifically, we adopt the following approaches: (i) we make the input data samples 𝝂\nu and the centroids 𝒄c sparser than the original one by projecting them onto sparser simplex, and (ii) we shrink the projected data sample 𝝂\nu and centroid 𝒄c and the corresponding ground cost matrix C𝝂𝒄\mbox{\bf C}_{\mbox{\boldmath$\nu c$}} by removing their zero elements. Noteworthy points are that the first approach enables to speed-up the computation while maintaining degradation of the clustering quality as small as possible. The second approach yields no degradation.

Algorithm 3 Proposed sparse simplex projection Wasserstein kk-means (SSPW kk-means)
1:data {𝝂1,,𝝂q}\{\mbox{\boldmath$\nu$}_{1},\!\ldots,\!\mbox{\boldmath$\nu$}_{q}\}, cluster number kk\!\in\!\mathbb{N}, ground cost matrix Cn×n\mbox{\bf C}\in\mathbb{R}^{n\times n}, maximum number TmaxT_{\rm max}, γmin\gamma_{\rm min}.
2:Initialize centroids {𝒄~1,,𝒄~k}\{\tilde{\mbox{\boldmath$c$}}_{1},\ldots,\tilde{\mbox{\boldmath$c$}}_{k}\}, set t=1t=1.
3:repeat
4:     Update sparsity ratio γ(t)\gamma(t).
5:     Project 𝝂i\mbox{\boldmath$\nu$}_{i} to 𝝂^i\hat{\mbox{\boldmath$\nu$}}_{i} on sparse simplex Δp\Delta_{p}:
6:          𝝂^i=Projγ(t)(𝝂i)i[q]\hat{\mbox{\boldmath$\nu$}}_{i}=\text{Proj}^{\gamma(t)}(\mbox{\boldmath$\nu$}_{i})\ \forall i\in[q].
7:     Shrink 𝝂^i\hat{\mbox{\boldmath$\nu$}}_{i} into 𝝂~i\tilde{\mbox{\boldmath$\nu$}}_{i}: 𝝂~i=shrink(𝝂^i)\tilde{\mbox{\boldmath$\nu$}}_{i}=\text{shrink}(\hat{\mbox{\boldmath$\nu$}}_{i}).
8:     Project 𝒄j\mbox{\boldmath$c$}_{j} into 𝒄^j\hat{\mbox{\boldmath$c$}}_{j} on sparse simplex Δp\Delta_{p}:
9:          𝒄^j=Projγ(𝒄j)j[k]\hat{\mbox{\boldmath$c$}}_{j}=\text{Proj}^{\gamma}(\mbox{\boldmath$c$}_{j})\ \forall j\in[k].
10:     Shrink 𝒄^j\hat{\mbox{\boldmath$c$}}_{j} into 𝒄~j\tilde{\mbox{\boldmath$c$}}_{j}: 𝒄~j=shrink(𝒄^j)\tilde{\mbox{\boldmath$c$}}_{j}=\text{shrink}(\hat{\mbox{\boldmath$c$}}_{j}).
11:     Shrink ground cost matrix C into C~\tilde{\mbox{\bf C}}: C~=Shrink(C)\tilde{\mbox{\bf C}}=\text{Shrink}(\mbox{\bf C})
12:     Find closest centroids (assignment step):
13:          si=argminj=1,,kWp(𝝂~i,𝒄~j),i[q].s_{i}=\text{argmin}_{j=1,\ldots,k}\ W_{p}(\tilde{\mbox{\boldmath$\nu$}}_{i},\tilde{\mbox{\boldmath$c$}}_{j}),\forall i\in[q].
14:     Update centroids (update step):
15:          𝒄j=barycenter({𝝂|si=j}),j[k].\mbox{\boldmath$c$}_{j}=\text{barycenter}(\{\mbox{\boldmath$\nu$}|s_{i}=j\}),\forall j\in[k].
16:until cluster centroids stop changing. StateUpdate the iteration number tt as t=t+1t=t+1.
17:cluster centers {𝒄1,,𝒄k}\{\mbox{\boldmath$c$}_{1},\ldots,\mbox{\boldmath$c$}_{k}\}.

3.2 Sparse simplex projection

We first consider the sparse simplex projection Projγ(t)()\text{Proj}^{\gamma(t)}(\cdot), where γ(t)(0,1]\gamma(t)\in(0,1] is the control parameter of the sparsity, which is described in Section 3.4. Projγ(t)()\text{Proj}^{\gamma(t)}(\cdot) projects the ii-th data sample 𝝂i\mbox{\boldmath$\nu$}_{i} and the jj-th centroid 𝒄j\mbox{\boldmath$c$}_{j}, respectively, into sparse 𝝂^i\hat{\mbox{\boldmath$\nu$}}_{i} and 𝒄^j\hat{\mbox{\boldmath$c$}}_{j} on Δn\Delta_{n}. For this purpose, we exploit the projection method proposed in [35], which is called GSHP.

More concretely, denoting the original 𝝂i\mbox{\boldmath$\nu$}_{i} or 𝒄j\mbox{\boldmath$c$}_{j} as 𝜷Δn\mbox{\boldmath$\beta$}\in\Delta_{n}, and the projected 𝝂^i\hat{\mbox{\boldmath$\nu$}}_{i} or 𝒄^j\hat{\mbox{\boldmath$c$}}_{j} as 𝜷^Δn\hat{\mbox{\boldmath$\beta$}}\in\Delta_{n}, then 𝜷^\hat{\mbox{\boldmath$\beta$}} is generated as

𝜷^=Projγ(t)(𝜷)={𝜷^|𝒮=𝒫Δκ(𝜷|𝒮)𝜷^|(𝒮)c=0,\displaystyle\hat{\mbox{\boldmath$\beta$}}=\text{Proj}^{\gamma(t)}(\mbox{\boldmath$\beta$})=\left\{\begin{array}[]{lrl}\hat{\mbox{\boldmath$\beta$}}_{|\mathcal{S}^{\star}}&\!\!=\!\!&\mathcal{P}_{\Delta_{\kappa}}(\mbox{\boldmath$\beta$}_{|\mathcal{S}^{\star}})\\ \hat{\mbox{\boldmath$\beta$}}_{|(\mathcal{S}^{\star})^{c}}&\!\!=\!\!&0,\end{array}\right.

where κ=nγ(t)\kappa=\lfloor n\cdot\gamma(t)\rfloor, and where 𝒮\mathcal{S}^{\star} is defined as 𝒮=supp(𝒫κ(𝜷))\mathcal{S}^{\star}={\rm supp}(\mathcal{P}_{\kappa}(\mbox{\boldmath$\beta$})), where 𝒫κ(𝜷)\mathcal{P}_{\kappa}(\mbox{\boldmath$\beta$}) is the operator that keeps the κ\kappa-largest elements of 𝜷\beta and sets the rest to zero. Additionally, the v([n])v(\in[n])-th element of 𝒫Δκ(𝜷|𝒮)\mathcal{P}_{\Delta_{\kappa}}(\mbox{\boldmath$\beta$}_{|\mathcal{S}^{\star}}) is defined as

(𝒫Δκ(𝜷|𝒮))v\displaystyle(\mathcal{P}_{\Delta_{\kappa}}(\mbox{\boldmath$\beta$}_{|\mathcal{S}^{\star}}))_{v} =\displaystyle= [(𝜷|𝒮)v+τ]+,\displaystyle[(\mbox{\boldmath$\beta$}_{|\mathcal{S}^{\star}})_{v}+\tau]_{+},

where τ\tau is defined as

τ\displaystyle\tau :=\displaystyle:= 1κ(1+|𝒮|𝜷|𝒮).\displaystyle\frac{1}{\kappa}\left(1+\sum^{|\mathcal{S}^{\star}|}\mbox{\boldmath$\beta$}_{|\mathcal{S}^{\star}}\right).

This computational complexity is 𝒪(nmin(κ,log(n)))\mathcal{O}(n\min(\kappa,\log(n))).

Table 1: Averaged clustering performance results on COIL-100 dataset dataset (5 runs). The best result in each γmin\gamma_{\rm min} is shown in bold. The best in all settings is shown in bold with underline.
method shrink projection Purity NMI Accurarcy Time
oper. 𝝂~i\tilde{\mbox{\boldmath$\nu$}}_{i} 𝒄~j\tilde{\mbox{\boldmath$c$}}_{j} [×102\times 10^{2}] [×102\times 10^{2}] [×102\times 10^{2}] [×102\times 10^{2}sec]
kk-means 38.0 41.2 35.4 1.49×1041.49\times 10^{-4}
baseline 59.2 65.5 58.2 6.51
shrink \checkmark 5.01
γmin\gamma_{\rm min} 0.7 0.8 0.7 0.8 0.7 0.8 0.7 0.8
\checkmark \checkmark 56.5 57.7 65.2 64.4 55.0 55.9 2.50 3.28
FIX \checkmark \checkmark 61.3 60.9 66.0 66.6 59.1 59.7 2.73 3.60
\checkmark \checkmark \checkmark 60.7 57.3 66.8 65.5 59.0 55.8 1.62 2.40
\checkmark \checkmark 59.5 59.7 65.0 65.6 58.4 58.6 2.62 3.43
DEC \checkmark \checkmark 60.7 59.1 66.7 65.5 59.7 58.0 4.15 4.00
\checkmark \checkmark \checkmark 60.0 59.5 66.0 65.4 59.0 58.5 2.83 3.45
\checkmark \checkmark 58.4 58.5 65.0 65.5 57.0 57.3 6.20 6.65
INC \checkmark \checkmark 63.5 61.2 68.1 66.8 62.5 59.9 6.27 6.76
\checkmark \checkmark \checkmark 58.3 57.2 65.4 64.5 56.5 55.9 5.12 5.82
Refer to caption

(a) Purity

Refer to caption

(b) NMI

Refer to caption

(c) Accuracy

Refer to caption

(d) Computation time

Figure 1: Performance results of 1-D histogram data on the COIL100 dataset.

3.3 Shrinking operations according to zero elements

The sparse simplex projection in the previous section increases the number of zeros in the centroid and data samples. We reduce the computational complexity by harnessing this sparse structure. The vector shrinking operator, denoted as shrink()\text{shrink}(\cdot), removes the zero elements from the projected sample 𝝂^i\hat{\mbox{\boldmath$\nu$}}_{i} and centroid 𝒄^i\hat{\mbox{\boldmath$c$}}_{i}, and generates 𝝂~i\tilde{\mbox{\boldmath$\nu$}}_{i} and 𝒄~j\tilde{\mbox{\boldmath$c$}}_{j}, respectively. It should be noted that this operation does not produce any degradations in terms of the solution of the LP optimization problem. More specifically, 𝝂^i\hat{\mbox{\boldmath$\nu$}}_{i} can be calculated as

𝝂~i\displaystyle\tilde{\mbox{\boldmath$\nu$}}_{i} =\displaystyle= shrink(𝝂^i)=(𝝂^i)|𝒮samp|𝒮samp|,\displaystyle\text{shrink}(\hat{\mbox{\boldmath$\nu$}}_{i})\ =\ (\hat{\mbox{\boldmath$\nu$}}_{i})_{|\mathcal{S}_{\rm samp}}\ \ \in\mathbb{R}^{|\mathcal{S}_{\rm samp}|},

where 𝒮samp=supp(𝝂~i)\mathcal{S}_{\rm samp}={\rm supp}(\tilde{\mbox{\boldmath$\nu$}}_{i}). Similarly to this, we calculate

𝒄~i\displaystyle\tilde{\mbox{\boldmath$c$}}_{i} =\displaystyle= shrink(𝒄^i)=(𝒄^i)|𝒮cent|𝒮cent|,\displaystyle\text{shrink}(\hat{\mbox{\boldmath$c$}}_{i})\ =\ (\hat{\mbox{\boldmath$c$}}_{i})_{|\mathcal{S}_{\rm cent}}\ \ \in\mathbb{R}^{|\mathcal{S}_{\rm cent}|},

where 𝒮cent=supp(𝒄~i)\mathcal{S}_{\rm cent}={\rm supp}(\tilde{\mbox{\boldmath$c$}}_{i}).

Accordingly, we must also shrink the ground cost matrix C𝝂𝒄\mbox{\bf C}_{\mbox{\boldmath$\nu c$}} based on the change of the size of 𝒄~i\tilde{\mbox{\boldmath$c$}}_{i} and 𝝂~i\tilde{\mbox{\boldmath$\nu$}}_{i}. For this purpose, we newly introduce the matrix shrinking operator Shrink()\text{Shrink}(\cdot), which removes the row and the column vectors from C𝝂𝒄\mbox{\bf C}_{\mbox{\boldmath$\nu c$}}. The removal is performed against (𝒮cent)c(\mathcal{S}_{\rm cent})^{c} and (𝒮samp)c(\mathcal{S}_{\rm samp})^{c}. Consequently, C𝝂𝒄\mbox{\bf C}_{\mbox{\boldmath$\nu c$}} is compressed into C~\tilde{\mbox{\bf C}} using Shrink()\text{Shrink}(\cdot) as

C~=Shrink(C𝝂𝒄)=Csupp(𝝂~i),supp(𝒄~i)|𝒮samp|×|𝒮cent|.\displaystyle\tilde{\mbox{\bf C}}=\text{Shrink}(\mbox{\bf C}_{\mbox{\boldmath$\nu c$}})=\mbox{\bf C}_{{\rm supp}(\tilde{\mbox{\boldmath$\nu$}}_{i}),{\rm supp}(\tilde{\mbox{\boldmath$c$}}_{i})}\ \in\mathbb{R}^{|\mathcal{S}_{\rm samp}|\times|\mathcal{S}_{\rm cent}|}.

It should also be noted that the size of C~\tilde{\mbox{\bf C}} is |𝒮samp|×|𝒮cent||\mathcal{S}_{\rm samp}|\times|\mathcal{S}_{\rm cent}|. Those sizes change according to the control parameter of the sparse ratio γ(t)\gamma(t), which is described in the next subsection.

3.4 Control parameter of sparse ratio γ(t)\gamma(t)

As described in Section 3.2, the control parameter of the sparsity is denoted as γ(t)\gamma(t), where tt represents the iteration number tt. We also propose three γ(t)\gamma(t) control algorithms, which are denoted respectively as ‘FIX’ (fixed), ‘DEC’ (decrease), and ‘INC’(increase). They are mathematically formulated as

γ(t):={γmin(FIX)1(1γmin)Tmaxt(DEC)γmin+(1γmin)Tmaxt(INC),\gamma(t):=\left\{\begin{array}[]{lrl}\displaystyle{\gamma_{\rm min}}&&\text{({FIX})}\\ \displaystyle{1-\frac{(1-\gamma_{\rm min})}{T_{\rm max}}t}&&\text{({DEC})}\\ \displaystyle{\gamma_{\rm min}+\frac{(1-\gamma_{\rm min})}{T_{\rm max}}t}&&\text{({INC})},\end{array}\right.

where γmin\gamma_{\rm min}\in\mathbb{R} is the minimum value, and where TmaxT_{\rm max}\in\mathbb{N} is the maximum number of the iterations. The overall algorithm of the proposed SSPW kk-means is presented in Algorithm 3.

Table 2: Averaged clustering performance results on USPS dataset dataset (10 runs). The best result in each γmin\gamma_{\rm min} is shown in bold. The best in all settings is shown in bold with underline.
​​​method​​​ ​​​shrink​​​ ​​​projection​​​ Purity NMI Accuracy Time
​​​oper.​​​ 𝝂~i\tilde{\mbox{\boldmath$\nu$}}_{i} 𝒄~j\tilde{\mbox{\boldmath$c$}}_{j} [×102\times 10^{2}] [×102\times 10^{2}] [×102\times 10^{2}] [×102\times 10^{2}sec]
​​​​​​​​kk-means​​​​​​​​ 65.5 65.8 64.1 2.13×1042.13\times 10^{-4}
​​​​​​​​baseline​​​​​​​​ 66.6 65.0 65.5 8.29
shrink \checkmark 7.02
γmin\gamma_{\rm min} 0.5 0.6 0.7 0.8 0.5 0.6 0.7 0.8 0.5 0.6 0.7 0.8 0.5 0.6 0.7 0.8
\checkmark \checkmark 67.8 67.0 66.9 66.8 66.2 65.5 65.9 65.2 66.4 65.8 65.8 65.8 3.11 3.25 3.99 5.04
fix \checkmark \checkmark 66.0 67.2 66.2 66.4 64.6 65.4 65.0 64.9 65.0 66.1 65.1 65.3 1.87 2.98 3.91 4.60
\checkmark \checkmark \checkmark 68.1 67.2 66.3 66.6 66.3 65.6 65.0 65.0 67.0 65.9 65.2 65.5 0.98 1.68 2.48 3.46
\checkmark \checkmark 66.8 66.8 66.8 66.8 65.4 65.4 65.3 65.3 65.7 65.7 65.7 65.7 4.12 4.45 5.65 6.29
dec \checkmark \checkmark 67.0 67.0 66.8 66.6 65.3 65.3 65.3 65.0 65.8 65.8 65.7 65.5 5.22 5.82 6.13 7.37
\checkmark \checkmark \checkmark 66.8 66.8 66.8 66.8 65.4 65.4 65.4 65.3 65.7 65.7 65.7 65.7 3.51 3.91 4.89 5.62
\checkmark \checkmark 66.3 65.7 66.9 66.9 65.0 64.8 65.4 65.2 65.1 64.6 65.7 65.9 9.74 ​​10.47 ​​11.29 ​​12.04
inc \checkmark \checkmark 66.8 66.4 66.3 66.2 65.3 65.0 64.9 65.0 65.6 65.3 65.2 65.1 7.11 6.64 6.97 6.63
\checkmark \checkmark \checkmark 66.8 65.8 66.3 66.5 65.4 64.8 65.2 64.9 65.5 64.7 65.2 65.5 6.06 7.05 7.48 9.71
Refer to caption

(a) Purity

Refer to caption

(b) NMI

Refer to caption

(c) Accuracy

Refer to caption

(d) Computation time

Figure 2: Performance results of 2-D histogram data on the USPS dataset.
Refer to caption

(a) projection of sample

Refer to caption

(b) projection of centroid

Refer to caption

(c) projection of centroid and sample

Figure 3: Convergence performance results for different projection data with different γmin={0.5,0.6,0.7,0.8}\gamma_{\rm min}=\{0.5,0.6,0.7,0.8\}.

4 Numerical evaluation

We conducted numerical experiments with respect to computational efficiency and clustering quality on real-world datasets to demonstrate the effectiveness of the proposed SSPW kk-means. Regarding the Wasserstein barycenter algorithm, we use that proposed in [25]. The initial centroid is calculated using the Euclidean kk-means with litekmeans222http://www.cad.zju.edu.cn/home/dengcai/Data/code/litekmeans.m.. linprog of Mosek [36]333https://www.mosek.com/ is used to solve the LP problem. Throughout the following experiment, the standard Wasserstein kk-means described in Section 2.4 is referred as baseline method. The Euclidean kk-means algorithm is also compared to evaluate the clustering quality metrics although its computation is much lower than the Wasserstein one. We set Tmax=10T_{\rm max}=10. The experiments were performed on a four quad-core Intel Core i5 computer at 3.6 GHz, with 64 GB DDR2 2667 MHz RAM. All the codes are implemented in MATLAB.

4.1 1-D histogram evaluation

We first conducted a preliminary experiment using the COIL-100 object image dataset444http://www1.cs.columbia.edu/CAVE/software/softlib/coil-100.php [37], which includes 7,2007,200 images of 100100 objects. From this dataset, we randomly select 1010 classes and 1010 images per class. We first convert the pixel information into intensity with the range of 0 to 255255. Then we generate a one-dimensional histogram of which the bin size is 255255 by removing the intensity of zero. We set γmin={0.7,0.8}\gamma_{\rm min}=\{0.7,0.8\}.

The averaged clustering performances on Purity, NMI and Accuracy over 55 runs are summarized in TABLE 1. For ease of comparison, the difference values against the baseline method are presented in Fig. 1, of which panels (a), (b) and (c) respectively show results of Purity, NMI, and Accuracy. Here, positive values represent improvements against the baseline method. Additionally, Fig. 1 (d) presents the speed-up ratio of the computation time against that of the baseline method, where the value more than than 1.01.0 indicate faster computations than that of the baseline. The figures specifically summarize the results in terms of the combinations of different γ(t)\gamma(t) algorithms and different γmin\gamma_{\rm min}. They also show the differences among the different projection types, i.e., the projection of the centroid, data sample, and both the centroid and data sample. From Fig. 1, one can find that the case with the centroid projection stably outperforming other cases. Especially, the INC algorithm with the centroid projection stably outperforms others. However, it requires a greater number of computations. Moreover, the effectiveness in terms of the computation complexity reduction is lower. On the other hand, the FIX algorithm with γmin=0.7\gamma_{\rm min}=0.7 engenders comparable improvements while keeping the computation time much lower.

4.2 2-D histogram evaluation

In this experiment, comprehensive evaluations have been conducted. We investigate a wider range of γmin\gamma_{\rm min} than that of the earlier experiment, i.e., γmin={0.5,0.6,0.7,0.8}\gamma_{\rm min}=\{0.5,0.6,0.7,0.8\}. For this purpose, we used the USPS handwritten dataset555http://www.gaussianprocess.org/gpml/data/, which includes completely 92989298 handwritten single digits between 0 and 99, each of which consists of 16×1616\times 16 pixel image. Pixel values are normalized to be in the range of [1,1][-1,1]. From this dataset, we randomly select 1010 images per class. We first convert the pixel information into intensity with the range of 0 to 255255. Then we generate a two-dimensional histogram for which the sum of intensities of all pixels is equal to one.

The averaged clustering qualities over 1010 runs are presented in TABLE 2. As shown there, the FIX algorithm outperforms others in terms of both the clustering quality and the computation time. Similarly to the earlier experiment, for ease of comparison, the difference values against the baseline method are also shown in Fig. 2. It is surprising that the proposed algorithm maintains the clustering quality across all the metrics as higher than the baseline method does, even while reducing the computation time. Addressing individual algorithms, although the INC algorithm exhibits degradations in some cases, the DEC algorithm outperforms the baseline method in all settings. Furthermore, as unexpected, the performances of the algorithms with lower γmin\gamma_{\rm min}, i.e., γmin={0.5,0.6}\gamma_{\rm min}=\{0.5,0.6\}, are comparable with or better than those with γmin={0.7,0.8}\gamma_{\rm min}=\{0.7,0.8\}.

Second, the computation time under the same settings is presented in Fig. 2(d). From this result, it is apparent that the computation time is approximately proportional to γmin\gamma_{\rm min} in each γ(t)\gamma(t) control algorithm. In other words, the case with γmin=0.5\gamma_{\rm min}=0.5 requires the lowest computation time among all. Moreover, with regard to the different projection types, the case in which the sparse simplex projection is performed onto both the centroid and data sample requires the shortest computation time, as expected.

4.3 Convergence performance

This subsection investigates the convergence performances of the proposed algorithm. The same settings as those of earlier experiments are used. We first address the convergence speed in terms of the computing time with respect to the different projection types as well as different γmin\gamma_{\rm min}. For this purpose, we examine the DEC algorithm. However, the two remaining algorithms behave very similar. Fig. 3 shows convergence of Accuracy in terms of computation time at the first instance of 1010 runs in the earlier experiment. Fig. 3(a)–Fig. 3(c) respectively show the results obtained when the projection is performed onto the data sample, centroid, and both the centroid and data sample. Regarding the different γmin\gamma_{\rm min}, the case with the smaller γmin\gamma_{\rm min} indicates faster convergence. Apart from that, the convergence behaviors among three projection types are similar. However, the case with the which projection of both the centroid and data sample yields the fastest results among all. This is also demonstrated explicitly in Fig. 5 under γmin=0.5\gamma_{\rm min}=0.5.

We also address performance differences in terms of the different algorithms of the γ(t)\gamma(t) control parameters. Fig. 5 presents the convergence speeds obtained when projecting both the centroid and data sample with γmin=0.5\gamma_{\rm min}=0.5, where the plots explicitly show individual iteration. As might be readily apparent, the FIX control algorithm and the INC control algorithm require much less time than the others because these two algorithms can process much smaller centroids and data samples than others at the beginning of the iterations. However, the computation time of the INC control algorithm increases gradually as the iterations proceed, whereas that of the DEC control algorithm decreases gradually. Overall, the FIX control algorithm outperforms the other algorithms.

Refer to caption
Figure 4: ​Convergence with different projection data using γmin=0.5\gamma_{\rm min}\!=\!0.5.​
Refer to caption
Figure 5: Convergence comparisons of different algorithms of γ(t)\gamma(t).

4.4 Performance comparison on different ratios

Finally, we compared performances on different sparsity ratios {0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9}\{0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9\} of γmin\gamma_{\rm min} under the same configurations of the earlier experiment. We specifically address the FIX algorithm of γ(t)\gamma(t) and the case where the projection is performed onto both the centroid and sample. Regarding the time comparison, we use the same speed-up metric against the baseline method as the earlier experiment. Therefore, the range of the values is more than 0.00.0, and the value more than 1.01.0 means a speed-up against the baseline method. For this experiment, we have re-conducted the same simulation, thereby the obtained value are slightly different from those in the earlier experiment due to its randomness.

The results are summarized in TABLE 3. Fig. 6 also demonstrates the results for ease of comparison. From both of the results, the cases with γmin={0.3,0.4}\gamma_{\rm min}=\{0.3,0.4\} yield higher performances than other cases with respect to the clustering quality as well as the computation efficiency. Surprisingly, we obtained around 44 points higher values in the clustering quality metrics with more than 2020-fold speed-up. On the other hand, γmin=0.2\gamma_{\rm min}=0.2 results in the worst performance. Therefore, the optimization of γmin\gamma_{\rm min} to give best performances is one of the important topics in the future research.

Table 3: Averaged clustering performance results on USPS dataset dataset (10 runs). The best result in each γmin\gamma_{\rm min} is shown in bold.
method γmin\gamma_{\rm min} Purity NMI Accuracy Time
[×102\times 10^{2}] [×102\times 10^{2}] [×102\times 10^{2}] [×102\times 10^{2}sec]
baseline 0.00 69.7 69.6 66.9 7.98
shrink 6.77
0.90 69.9 69.7 67.1 4.68
FIX 0.80 69.9 69.7 67.1 3.14
0.70 69.9 69.3 67.0 2.34
with 0.60 70.4 70.0 67.7 1.54
shrink & 0.50 71.8 71.8 69.4 0.88
projection 0.40 72.4 72.4 69.6 0.56
of 𝝂~i\tilde{\mbox{\boldmath$\nu$}}_{i}&𝒄~j\tilde{\mbox{\boldmath$c$}}_{j} 0.30 73.9 72.7 71.4 0.35
0.20 68.8 67.6 65.8 0.23
Refer to caption
Figure 6: Performance comparison on different ratios on the USPS dataset.

4.5 Discussions

From the exhaustive experiments above, it can be seen that the proposed proposed SSPW kk-means algorithm dynamically reduce the computational complexity by removing lower-valued histogram elements and harnessing the sparse simplex projection while maintaining degradation of the clustering quality lower. Although the performance of each setting of the proposed algorithm depends on the dataset, we conclude that the FIX algorithm with the projections of both the centroid and sample under lower sparsity ratios γmin\gamma_{min} is the best option because it gives stably better classification qualities with the fastest computing times among all the settings. In fact, this algorithm gives comparable performances to the best one in TABLE 1, and outperforms the others in TABLE 2. The algorithm also yiedls around 44 points higher values in the clustering quality metrics with more than 2020-fold speed-up as seen in TABLE 3.

5 Conclusions

This paper proposed a faster Wasserstein kk-means algorithm by reducing the computational complexity of Wasserstein distance, exploiting sparse simplex projection operations. Numerical evaluations demonstrated the effectiveness of the proposed SSPW kk-means algorithm. As for a future avenue, we would the proposed approach into a stochastic setting [38].

References

  • [1] T. Fukunaga and H. Kasai. Wasserstein k-means with sparse simplex projection. In 25th International Conference on Pattern Recognition (ICPR), 2020.
  • [2] S. Lloyd. Least squaares quantization in pcm. IEEE Trans. Inf. Theory, 28(2):129–137, 1982.
  • [3] K Alsabti, S. Rank, and V. Sing. An efficient k-means clustering algorithm. In IPPS/SPDP Workshop on High Performance Data Mining, 1998.
  • [4] D. Pelleg and A. Moore. Accelerating exact kk-means algorithms with geometric reasoning. In Proc. 5th ACM SIGKDD Int. Conf. Knowl. Disc. Data Mining (KDD), 1999.
  • [5] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell., 24(7):881–892, 2002.
  • [6] C. Elkan. Using the triangle inequality to accelerate k-means. In ICML, 2003.
  • [7] G. Hamerly. Making k-means even faster. In SIAM Int. Conf. Data Mining (SDM), 2010.
  • [8] J. Drake and G. Hamerly. Accelerated kk-means with adaptive distance bounds. In NIPS Work. Optim., 2012.
  • [9] Y. Ding, Y. Zhao, X. Shen, M. Musuvathi, and T. Mytkowicz. Yinyang k-means: A drop-in replacement of the classic k-means with consistent speedup. In ICML, 2015.
  • [10] T. Bottesch, T. Buhler, and M. Kachele. Speeding up k-means by approximating euclidean distances via block vectors. In ICML, 2016.
  • [11] D. Chris, X. He, and H. D. Simon. On the equivalence of nonnegative matrix factorization and spectral clustering (sdm). In Proceedings of the 2005 SIAM International Conference on Data Mining, 2005.
  • [12] S. Innami and H. Kasai. NMF-based environmental sound source separation using time-variant gain features. Computers & Mathematics with Applications, 64(5):1333–1342, 2012.
  • [13] H. Kasai. Stochastic variance reduced multiplicative update for nonnegative matrix factorization. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018.
  • [14] H. Kasai. Accelerated stochastic multiplicative update with gradient averaging for nonnegative matrix factorization. In 26th European Signal Processing Conference (EUSIPCO), 2018.
  • [15] R. Hashimoto and H. Kasai. Sequential semi-orthogonal multi-level nmf with negative residual reduction for network embedding. In nternational Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020.
  • [16] Y. Ye, P. Wu, J. Z. Wang, and J. Li. Fast discrete distribution clustering using wasserstein barycenter with sparse support. IEEE Trans. Signal Process, 65(9):2317–2332, 2017.
  • [17] G. Peyre and M. Cuturi. Computational optimal transport. Foundations and Trends in Machine Learning, 11(5-6):355–607, 2019.
  • [18] C. Villani. Optimal transport: Old and new. Springer, 2008.
  • [19] K. Kasai. Multi-view wasserstein discriminant analysis with entropic regularized wasserstein distance. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.
  • [20] S. Dasgupta. The hardness of k-means clustering. Technical Report Techni- cal Report CS2007-0890, UC San Diego, 2007.
  • [21] M. Mahajan, P. Nimbhorkar, and K. Varadarajan. The planar k-means problem is np-hard. In Proc. 3rd Int. Work. Alg. Comput. (WALCOM), 2009.
  • [22] D. Arthur and S. Vassilvitskii. K-means++: The advantages of careful seeding. In Proc. 18th Ann. ACM-SIAM Symp. Discr. Alg. (SODA), 2007.
  • [23] R. Ostrovsky, Y. Rabani, L. J. Schulman, and C. Swamy. The effectiveness of lloyd-type methods for the k-means problem. In 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2006.
  • [24] B. Bahmani, B. Moseley, A. Vattani, R. Kumar, and S. Vassilvitskii. Scalable k-means++. In Proc. VLDB Endow, 2012.
  • [25] J. D. Benamou, G. Carlier, M. Cuturi, L. Nenna, and G. Peyré. Iterative bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing,, 37(2):1111–A1138, 2015.
  • [26] M. Cuturi and A. Doucet. Fast computation of wasserstein barycenters. In ICML, 2014.
  • [27] N. Bonneel, J. Rabin, and G. Peyré. Sliced and radon wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 51:22–45, 2015.
  • [28] E. Anderes, S. Borgwardt, and J. Miller. Discrete wasserstein barycenters: optimal transport for discrete data. Mathematical Methods of Operations Research, 84:389–409, 2016.
  • [29] S. Claici, E. Chien, and Solomon. J. Stochastic wasserstein barycenters. In ICML, 2018.
  • [30] G. Puccetti, L. Rüschendorf, and S. Vanduffe. Onthecomputationofwassersteinbarycenters. Journal of Multivariate Analysis, 176:1–16, 2020.
  • [31] M. Staib and S. Jegelka. Wasserstein k-means++ for cloud regime histogram clustering. In Proceedings of the Seventh International Workshop on Climate Informatics (CI), 2017.
  • [32] S. Borgwardt and S. Patterson. On the computational complexity of finding a sparse wasserstein barycenter. arXiv preprint: arXiv:1910.07568, 2019.
  • [33] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Annual Conference on Neural Information Processing Systems (NIPS), 2013.
  • [34] Y. Rubner, C. Tomasi, and L. J. Guibas. The earth mover’s distance as a metric for image retrieval. International Journal of Computer Vision, 40(2):99–121, 2000.
  • [35] A. Kyrillidis, S. Becker, V. Cevher, and C. Koch. Sparse projections onto the simplex. In ICML, 2013.
  • [36] E. D. Andersen, C. Roos, and T. Terlaky. High Performance Optimization, volume 33, chapter The MOSEK interior point optimizer for linear programming: an implementation of the homogeneous algorithm. Springer, Boston, MA, 2000.
  • [37] Columbia university image library (COIL-100).
  • [38] H. Kasai. SGDLibrary: A MATLAB library for stochastic optimization algorithms. Journal of Machine Learning Research (JMLR), 18, 2018.