This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Hierarchical Low-Rank Approximation of Regularized Wasserstein Distance

Mohammad Motamed [email protected] Department of Mathematics and Statistics, The University of New Mexico, Albuquerque, USA
Abstract

Sinkhorn divergence is a measure of dissimilarity between two probability measures. It is obtained through adding an entropic regularization term to Kantorovich’s optimal transport problem and can hence be viewed as an entropically regularized Wasserstein distance. Given two discrete probability vectors in the nn-simplex and supported on two bounded subsets of d{\mathbb{R}}^{d}, we present a fast method for computing Sinkhorn divergence when the cost matrix can be decomposed into a dd-term sum of asymptotically smooth Kronecker product factors. The method combines Sinkhorn’s matrix scaling iteration with a low-rank hierarchical representation of the scaling matrices to achieve a near-linear complexity 𝒪(nlog3n){\mathcal{O}}(n\log^{3}n). This provides a fast and easy-to-implement algorithm for computing Sinkhorn divergence, enabling its applicability to large-scale optimization problems, where the computation of classical Wasserstein metric is not feasible. We present a numerical example related to signal processing to demonstrate the applicability of quadratic Sinkhorn divergence in comparison with quadratic Wasserstein distance and to verify the accuracy and efficiency of the proposed method.

keywords optimal transport, Wasserstein metric, Sinkhorn divergence, hierarchical matrices

1 Introduction

Wasserstein distance is a measure of dissimilarity between two probability measures, arising from optimal transport; see e.g. [16, 17]. Due to its desirable convexity and convergence features, Wasserstein distance is considered as an important measure of dissimilarity in many applications, ranging from computer vision and machine learning to Seismic and Bayesian inversion; see e.g. [13, 5, 18, 11]. The application of Wasserstein metric may however be limited to cases where the probability measures are supported on low-dimensional spaces, because its numerical computation can quickly become prohibitive as the dimension increases; see e.g. [13].

Sinkhorn divergence [3] is another measure of dissimilarity related to Wasserstein distance. It is obtained through adding an entropic regularization term to the Kantorovich formulation of optimal transport problem. It can hence be viewed as a regularized Wasserstein distance with desirable convexity and regularity properties. A main advantage of Sinkhorn divergence over Wasserstein distance lies in its computability by an iterative algorithm known as Sinkhorn’s matrix scaling algorithm [14], where each iteration involves two matrix-vector products. Consequently, Sinkhorn divergence may serve as a feasible alternative to classical Waaserstein distance, particularly when probability measures are supported on multi-dimensional spaces in d{\mathbb{R}}^{d} with d2d\geq 2.

Given a cost function and two nn-dimensional discrete probability vectors supported on two measurable subsets of the Euclidean space d{\mathbb{R}}^{d}, it is known that Sinkhorn’s algorithm can compute an ε\varepsilon-approximation of the original Kantorovich optimal transport problem within 𝒪(lognε2){\mathcal{O}}(\log n\,\varepsilon^{-2}) iterations [4]. If the matrix-vector products needed in each iteration of the algorithm are performed with cost 𝒪(n2){\mathcal{O}}(n^{2}), the overall complexity of the algorithm for computing Sinkhorn divergence will be 𝒪(n2lognε2){\mathcal{O}}(n^{2}\,\log n\,\varepsilon^{-2}). While theoretically attainable, this quadratic cost may still prohibit the application of Sinkhorn divergence to large-scale optimization problems that require many evaluations of Sinkhorn-driven loss functions.

In the present work, we will develop a hierarchical low-rank strategy that significantly reduces the quadratic cost of Sinkhorn’s algorithm to a near-linear complexity 𝒪(nlog3nε2){\mathcal{O}}(n\,\log^{3}n\,\varepsilon^{-2}). We consider a class of cost matrices in optimal transport that can be decomposed into a dd-term sum of Kronecker product factors, where each term is asymptotically smooth. Importantly, such class of cost matrices induce a wide range of optimal transport distances, including the quadratic Wasserstein metric and its corresponding Sinkhorn divergence. We then propose a strategy that takes two steps to reduce the complexity of each Sinkhorn iteration from 𝒪(n2){\mathcal{O}}(n^{2}) to 𝒪(nlog2n){\mathcal{O}}(n\,\log^{2}n). We first exploit the block structure of Kronecker product matrices and turn the original (and possibly high-dimensional) matrix-vector product problems into several smaller one-dimensional problems. The smaller problems will then be computed by the hierarchical matrix technique [7], thanks to their asymptotic smoothness, with a near-linear complexity. We further present a thorough error-complexity analysis of the proposed hierarchical low-rank Sinkhorn’s algorithm.

The rest of the paper is organized as follows. In Section 2 we review the basics of Wasserstein and Sinkhorn dissimilarity measures in optimal transport and include a short discussion of the original Sinkhorn’s algorithm. We then present and analyze the proposed hierarchical low-rank Sinkhorn’s algorithm for computing Sinkhorn divergence in Section 3. In Section 4 we present a numerical example related to a signal processing optimization problem to demonstrate the applicability, accuracy, and efficiency of the proposed method. Finally, in Section 5 we summarize conclusions and outline future works.

2 Optimal transport dissimilarity measures

In this section, we present the basics of the Kantorovich formulation of optimal transport and review two related dissimilarity measures; the Wasserstein distance and the Sinkhorn divergence. We focus on probability vectors, rather than working with more general probability measures.

2.1 Kantorovich’s optimal transport problem and Wasserstein distance

Let 𝒳{\mathcal{X}} and 𝒴{\mathcal{Y}} be two measurable subsets of the Euclidean space d{\mathbb{R}}^{d}, with dd\in{\mathbb{N}}. Let further 𝐟{\bf f} and 𝐠{\bf g} be two nn-dimensional probability vectors, i.e. two vectors in the probability simplex Σn:={𝐟+n:𝐟𝟙n=1}\Sigma_{n}:=\{{\bf f}\in{\mathbb{R}}_{+}^{n}:\,\,{\bf f}^{\top}\mathds{1}_{n}=1\}, defined on 𝒳{\mathcal{X}} and 𝒴{\mathcal{Y}}, respectively. Here, 𝟙n\mathds{1}_{n} is the nn-dimensional vector of ones. The two probability vectors 𝐟{\bf f} and 𝐠{\bf g} are assumed to be given at two sets of nn discrete points {𝐱1,,𝐱n}𝒳\{{\bf x}_{1},\dotsc,{\bf x}_{n}\}\subset{\mathcal{X}} and {𝐲1,,𝐲n}𝒴\{{\bf y}_{1},\dotsc,{\bf y}_{n}\}\subset{\mathcal{Y}} in d{\mathbb{R}}^{d}, respectively. Figure 1 shows a schematic representation of probability vectors in d=1d=1 and d=2d=2 dimensions.

Refer to caption
Refer to caption
Figure 1: A schematic representation of a probability vector 𝐟Σn{\bf f}\in\Sigma_{n} given at a set of nn discrete points {𝐱1,,𝐱n}𝒳d\{{\bf x}_{1},\dotsc,{\bf x}_{n}\}\subset{\mathcal{X}}\subset{\mathbb{R}}^{d} in one (d=1d=1) and two (d=2d=2) dimensions. Left: a probability vector with n=5n=5 components supported in {\mathbb{R}} and represented by a lollipop chart. The hight of each bar is equal to a component 𝐟i{\bf f}_{i}, and the location of each bar corresponds to a point 𝐱i𝒳{\bf x}_{i}\in{\mathcal{X}}\subset{\mathbb{R}}. Right: a probability vector with n=12n=12 components supported in 2{\mathbb{R}}^{2} and represented by a bubble chart, where the area and the center of each bubble correspond to 𝐟i{\bf f}_{i} and 𝐱i𝒳2{\bf x}_{i}\in{\mathcal{X}}\subset{\mathbb{R}}^{2}, respectively.

We denote by U(𝐟,𝐠)U({\bf f},{\bf g}) the transport polytope of 𝐟{\bf f} and 𝐠{\bf g}, i.e. the set of all nonnegative n×nn\times n matrices with row and column sums equal to 𝐟{\bf f} and 𝐠{\bf g}, respectively, that is,

U(𝐟,𝐠):={P+n×n:P𝟙n=𝐟,P𝟙n=𝐠}.U({\bf f},{\bf g}):=\{P\in{\mathbb{R}}_{+}^{n\times n}:\,\,P\mathds{1}_{n}={\bf f},\,P^{\top}\mathds{1}_{n}={\bf g}\}.

Each matrix P=[Pij]+n×nP=[P_{ij}]\in{\mathbb{R}}_{+}^{n\times n} in the transport polytope U(𝐟,𝐠)U({\bf f},{\bf g}) is a transport (or coupling) matrix that encodes a transport plan, that is, each element PijP_{ij} of PP describes the amount of mass to be transported from a source point 𝐱i𝒳{\bf x}_{i}\in{\mathcal{X}} to a traget point 𝐲j𝒴{\bf y}_{j}\in{\mathcal{Y}}, with i,j=1,,ni,j=1,\dotsc,n. The two constraints P𝟙n=𝐟P\mathds{1}_{n}={\bf f} and P𝟙n=𝐠P^{\top}\mathds{1}_{n}={\bf g} are necessary for a transport plan PP to be admissible; all the mass taken from a point 𝐱i{\bf x}_{i}, i.e. j=1nPij\sum_{j=1}^{n}P_{ij}, must be equal to the mass at point 𝐱i{\bf x}_{i}, i.e. 𝐟i{\bf f}_{i}, and all the mass transported to a point 𝐲j{\bf y}_{j}, i.e. i=1nPij\sum_{i=1}^{n}P_{ij}, must be equal to the mass at point 𝐲j{\bf y}_{j}, i.e. 𝐠j{\bf g}_{j}. We say that 𝐟{\bf f} and 𝐠{\bf g} are the marginals of PP.

Let further c:𝒳×𝒴+c:{\mathcal{X}}\times{\mathcal{Y}}\rightarrow{\mathbb{R}}_{+} be a non-negative cost function defined on 𝒳×𝒴{\mathcal{X}}\times{\mathcal{Y}}. For any pair of points (𝐱,𝐲)𝒳×𝒴({\bf x},{\bf y})\in{\mathcal{X}}\times{\mathcal{Y}}, the value c(𝐱,𝐲)c({\bf x},{\bf y}) represents the cost of transporting one unit of mass from a source point 𝐱𝒳{\bf x}\in{\mathcal{X}} to a target point 𝐲𝒴{\bf y}\in{\mathcal{Y}}. Correspondingly, we consider the cost matrix

C=[Cij]+n×n,Cij=c(𝐱i,𝐲j),i,j=1,,n.C=[C_{ij}]\in{\mathbb{R}}_{+}^{n\times n},\qquad C_{ij}=c({\bf x}_{i},{\bf y}_{j}),\qquad i,j=1,\dotsc,n.

The optimal transport problem can now be formulated as follows. We first note that for any given transport plan PU(𝐟,𝐠)P\in U({\bf f},{\bf g}), the total transport cost associated with PP is given by the Frobenius inner product P,C=i,jPijCij\langle P,C\rangle=\sum_{i,j}P_{ij}C_{ij}. Kantorovich’s optimal transport problem then aims at minimizing the total transport cost over all admissible transport plans. It reads

TC(𝐟,𝐠):=minPU(𝐟,𝐠)P,C,T_{C}({\bf f},{\bf g}):=\min_{P\in U({\bf f},{\bf g})}\langle P,C\rangle, (1)

where TC(𝐟,𝐠)T_{C}({\bf f},{\bf g}) is the optimal total cost of transporting 𝐟{\bf f} onto 𝐠{\bf g}.

In some applications the main goal of solving Kantorovich’s optimal transport problem (1) is to find the optimal transport plan, i.e. the transport matrix PP that minimizes the total cost. In some other applications the goal is to directly find the optimal cost TCT_{C} without the need to explicitly form the optimal transport matrix. An example of the latter application appears in the construction of loss functions in optimization problems, where loss functions are built based on metrics induced by the optimal cost TCT_{C}. An important type of such metrics is Wasserstein metric.

For the optimal cost TCT_{C} in (1) to induce a metric, we need the cost matrix C+n×nC\in{\mathbb{R}}_{+}^{n\times n} to be a metric matrix belonging to the cone of distance matrices [17], i.e. we need Cn×nC\in{\mathcal{M}}^{n\times n}, where

n×n:={C=[Cij]+n×n:Cij=0i=j,CijCik+Ckj}.{\mathcal{M}}^{n\times n}:=\{C=[C_{ij}]\in{\mathbb{R}}_{+}^{n\times n}:C_{ij}=0\Leftrightarrow i=j,\,\,C_{ij}\leq C_{ik}+C_{kj}\}.

Throughout this paper, we always assume that the cost matrix CC is a metric matrix: Cn×nC\in{\mathcal{M}}^{n\times n}. An important class of metric matrices is driven by cost functions of the form c(𝐱,𝐲)=d(𝐱,𝐲)pc({\bf x},{\bf y})=d({\bf x},{\bf y})^{p}, where p[1,)p\in[1,\infty), and dd is a distance function (or metric). In this case the cost matrix reads

C=[Cij]+n×n,Cij=d(𝐱i,𝐲j)p,i,j=1,,n,p[1,).C=[C_{ij}]\in{\mathbb{R}}_{+}^{n\times n},\qquad C_{ij}=d({\bf x}_{i},{\bf y}_{j})^{p},\qquad i,j=1,\dotsc,n,\qquad p\in[1,\infty). (2)

Intuitively, this choice implies that the cost of transporting one unit of mass from a source point 𝐱𝒳{\bf x}\in{\mathcal{X}} to a target point 𝐲𝒴{\bf y}\in{\mathcal{Y}} corresponds to the distance between the two points. The optimal cost TC(𝐟,𝐠)T_{C}({\bf f},{\bf g}) in (1) corresponding to the cost matrix (2) induces a metric known as the Wasserstein distance of order pp [16, 17],

Wp(𝐟,𝐠)=(TC(𝐟,𝐠))1/p.W_{p}({\bf f},{\bf g})=\bigl{(}T_{C}({\bf f},{\bf g})\bigr{)}^{1/p}. (3)

Due to its desirable convexity and convergence features, Wasserstein distance is considered as an important measure of dissimilarity in many applications, ranging from computer vision and machine learning to Seismic and Bayesian inversion; see e.g. [13, 5, 18, 11]. Unfortunately, the numerical computation of Wasserstein metric is in general infeasible in high dimensions and is mainly limited to low dimensions; see e.g. [13]. For instance, in the case c(𝐱,𝐲)=𝐱𝐲2c({\bf x},{\bf y})=||{\bf x}-{\bf y}||^{2}, the quadratic Wasserstein distance between two continuous probability density functions f:𝒳+f:{\mathcal{X}}\rightarrow{\mathbb{R}}+ and g:𝒴+g:{\mathcal{Y}}\rightarrow{\mathbb{R}}+ is given by [2]

W22(f,g)=𝒳|𝐱M(𝐱)|2f(𝐱)𝐝𝐱,W_{2}^{2}(f,g)=\int_{\mathcal{X}}|{\bf x}-M({\bf x})|^{2}\,f(\bf x)\,d{\bf x},

where MM is the optimal map. In one dimension, i.e. when 𝒳,𝒴{\mathcal{X}},{\mathcal{Y}}\subset{\mathbb{R}}, the optimal map is given by M=G1FM=G^{-1}\circ F, where FF and GG are the cumulative distribution functions corresponding to ff and gg, respectively. The discrete W2W_{2} metric between the corresponding probability vectors 𝐟=[f(𝐱i)]{\bf f}=[f({\bf x}_{i})] and 𝐠=[g(𝐲i)]{\bf g}=[g({\bf y}_{i})] can then be efficiently approximated by a quadrature rule,

W22(𝐟,𝐠)Δ𝐱i=1n|𝐱iG1F(𝐱i)|2𝐟i,F(𝐱i)=j=1i𝐟j,W_{2}^{2}({\bf f},{\bf g})\approx\Delta{\bf x}\,\sum_{i=1}^{n}|{\bf x}_{i}-G^{-1}\circ F({\bf x}_{i})|^{2}{\bf f}_{i},\qquad F({\bf x}_{i})=\sum_{j=1}^{i}{\bf f}_{j},

where G1G^{-1} may be computed by interpolation and an efficient search algorithm, such as binary search. The overall cost of computing the discrete W2W_{2} metric in one dimension will then be 𝒪(nlogn){\mathcal{O}}(n\,\log n). In multiple dimensions, i.e. when 𝒳,𝒴d{\mathcal{X}},{\mathcal{Y}}\subset{\mathbb{R}}^{d} with d2d\geq 2, however, the computation of the optimal map MM, and hence the W2W_{2} metric, may be either expensive or infeasible. The map is indeed given by the gradient of a convex function that is the solution to a constrained nonlinear Monge-Ampère equation. We refer to [5] for more discussion on the finite difference approximation of the Monge-Ampère equation in two dimensions.

2.2 Entropy regularized optimal transport and Sinkhorn divergence

One approach to computing Kantorovich’s optimal transport problem (1) is to regularize the original problem and seek approximate solutions to the regularized problem. One possibility is to add an entropic penalty term to the total transport cost and arrive at the regularized problem

TCλ(𝐟,𝐠):=minPU(𝐟,𝐠)P,C1λH(P),T_{C}^{\lambda}({\bf f},{\bf g}):=\min_{P\in U({\bf f},{\bf g})}\langle P,C\rangle-\frac{1}{\lambda}H(P), (4)

where λ>0\lambda>0 is a regularization parameter, and H(P)H(P) is the discrete entropy of transport matrix,

H(P)=i,jPij(logPij1).H(P)=-\sum_{i,j}P_{ij}(\log P_{ij}-1). (5)

The regularized problem (4)-(5) with Cn×nC\in{\mathcal{M}}^{n\times n} has a unique solution, say PλP_{\lambda} [3]. Indeed, the existence and uniqueness follows from the boundedness of U(𝐟,𝐠)U({\bf f},{\bf g}) and the strict convexity of the negative entropy. On the one hand, as λ\lambda increases, the unique solution PλP_{\lambda} converges to the solution with maximum entropy within the set of all optimal solutions of the original Kantorovich’s problem; see Proposition 4.1 in [13],

PλλargminP{PU(𝐟,𝐠),P,C=TC(𝐟,𝐠)}H(P).P_{\lambda}\stackrel{{\scriptstyle\lambda\rightarrow\infty}}{{\longrightarrow}}\underset{P\in\{P\in U({\bf f},{\bf g}),\,\langle P,C\rangle=T_{C}({\bf f},{\bf g})\}}{\mathrm{argmin}}-H(P).

In particular, we have TCλ(𝐟,𝐠)λTC(𝐟,𝐠)T_{C}^{\lambda}({\bf f},{\bf g})\stackrel{{\scriptstyle\lambda\rightarrow\infty}}{{\longrightarrow}}T_{C}({\bf f},{\bf g}). On the other hand, as λ\lambda decreases, the unique solution PλP_{\lambda} converges to the transport matrix with maximum entropy between the two marginals 𝐟{\bf f} and 𝐠{\bf g}, i.e. the outer product of 𝐟{\bf f} and 𝐠{\bf g}; see Proposition 4.1 in [13],

Pλλ0𝐟𝐠=(𝐟i𝐠j).P_{\lambda}\stackrel{{\scriptstyle\lambda\rightarrow 0}}{{\longrightarrow}}\,{\bf f}\,{\bf g}^{\top}=({\bf f}_{i}\,{\bf g}_{j}).

It is to be noted that in this latter case, as λ0\lambda\rightarrow 0, the optimal solution PλP_{\lambda} becomes less and less sparse (or smoother), i.e. with more and more entries larger than a prescribed small threshold.

With PλP_{\lambda} being the optimal solution to the regularized problem (4)-(5) corresponding to the cost matrix (2), the Sinkhorn divergence of order pp between 𝐟{\bf f} and 𝐠{\bf g} is defined as

Sp,λ(𝐟,𝐠)=Pλ,C1/p.S_{p,\lambda}({\bf f},{\bf g})=\langle P_{\lambda},C\rangle^{1/p}. (6)

Sinkhorn divergence (6) has several important properties. First, it satisfies all axioms for a metric with the exception of identity of indiscernibles (or the coincidence axiom), that is, it satisfies non-negativity, symmetry, and triangle inequality axioms [3]. Second, it is convex and smooth with respect to the input probability vectors and can be differentiated using automatic differentiation [13]. Third, as discussed below, it can be computed by a simple alternating minimization algorithm, known as Sinkhorn’s algorithm, where each iteration involves two matrix-vector products. Consequently, Sinkhorn divergence may serve as a feasible alternative to classical Waaserstein distance, particularly when probability measures are supported on multi-dimensional metric spaces.

Proposition 1.

Sinkhorn divergence (6) is an upper bound for Wasserstein distance (3),

Sp,λ(𝐟,𝐠)Wp(𝐟,𝐠).S_{p,\lambda}({\bf f},{\bf g})\geq W_{p}({\bf f},{\bf g}). (7)
Proof.

Since PλP_{\lambda} is the solution to (4)-(5), then we have PλU(𝐟,𝐠)P_{\lambda}\in U({\bf f},{\bf g}). By (1) we therefore get Pλ,CTC(𝐟,𝐠)\langle P_{\lambda},C\rangle\geq T_{C}({\bf f},{\bf g}), and hence the inequality (7) follows by (3) and (6). ∎

Sinkhorn’s algorithm. Adding an entropic penalty term to the optimal transport problem enforces a simple structure on the regularized optimal transport matrix PλP_{\lambda}. Introducing two dual variables 𝐟^n\hat{\bf f}\in{\mathbb{R}}^{n} and 𝐠^n\hat{\bf g}\in{\mathbb{R}}^{n} for the two marginal constraints P 1n=𝐟P\,\mathds{1}_{n}={\bf f} and P 1n=𝐠P^{\top}\,\mathds{1}_{n}={\bf g} in U(𝐟,𝐠)U({\bf f},{\bf g}), the Lagrangian of (4)-(5) reads

(P,𝐟^,𝐠^)=P,C1λH(P)𝐟^(P 1n𝐟)𝐠^(P 1n𝐠).{\mathcal{L}}(P,\hat{\bf f},\hat{\bf g})=\langle P,C\rangle-\frac{1}{\lambda}H(P)-\hat{\bf f}^{\top}\,(P\,\mathds{1}_{n}-{\bf f})-\hat{\bf g}^{\top}\,(P^{\top}\,\mathds{1}_{n}-{\bf g}).

Setting Pij=Cij+1λlogPij𝐟^i𝐠^j=0\partial_{P_{ij}}{\mathcal{L}}=C_{ij}+\frac{1}{\lambda}\log P_{ij}-\hat{\bf f}_{i}-\hat{\bf g}_{j}=0, we obtain

Pij=𝐮iQij𝐯j,Qij:=exp(λCij),𝐮i:=exp(λ𝐟^i),𝐯j:=exp(λ𝐠^j).P_{ij}={\bf u}_{i}\,Q_{ij}\,{\bf v}_{j},\qquad Q_{ij}:=\exp({-\lambda\,C_{ij}}),\qquad{\bf u}_{i}:=\exp({\lambda\,\hat{\bf f}_{i}}),\qquad{\bf v}_{j}:=\exp(\lambda\,\hat{\bf g}_{j}). (8)

We can also write (8) in matrix factorization form,

Pλ=UQV,U=diag(𝐮1,,𝐮n),Q=[Qij],V=diag(𝐯1,,𝐯n).P_{\lambda}=U\,Q\,V,\qquad U=\text{diag}({\bf u}_{1},\dotsc,{\bf u}_{n}),\qquad Q=[Q_{ij}],\qquad V=\text{diag}({\bf v}_{1},\dotsc,{\bf v}_{n}). (9)

A direct consequence of the form (8), or equivalently (9), is that PλP_{\lambda} is non-negative, i.e. Pλ+n×nP_{\lambda}\in{\mathbb{R}}_{+}^{n\times n}. Moreover, it implies that the computation of PλP_{\lambda} amounts to computing two non-negative vectors (𝐮,𝐯)+n×+n({\bf u},{\bf v})\in{\mathbb{R}}_{+}^{n}\times{\mathbb{R}}_{+}^{n}, known as scaling vectors. These two vectors can be obtained from the two marginal constraints,

UQV 1n=𝐟,VQU 1n=𝐠.U\,Q\,V\,\mathds{1}_{n}={\bf f},\qquad V\,Q^{\top}\,U\,\mathds{1}_{n}={\bf g}. (10)

Noting that U 1n=𝐮U\,\mathds{1}_{n}={\bf u} and V 1n=𝐯V\,\mathds{1}_{n}={\bf v}, we arrive at the following nonlinear equations for (𝐮,𝐯)({\bf u},{\bf v}),

𝐮(Q𝐯)=𝐟,𝐯(Q𝐮)=𝐠,{\bf u}\odot(Q\,{\bf v})={\bf f},\qquad{\bf v}\odot(Q^{\top}{\bf u})={\bf g}, (11)

where \odot denotes the Hadamard (or entrywise) product. Problem (10), or equivalently (11), is known as the non-negative matrix scaling problem (see e.g. [9, 12]) and can be solved by an iterative method known as Sinkhorn’s algorithm [14],

𝐮(i)=𝐟(Q𝐯(i1)),𝐯(i)=𝐠(Q𝐮(i)),i=1,,K.{\bf u}^{(i)}={\bf f}\,\oslash\,(Q\,{\bf v}^{(i-1)}),\qquad{\bf v}^{(i)}={\bf g}\,\oslash\,(Q^{\top}\,{\bf u}^{(i)}),\qquad i=1,\dotsc,K. (12)

Here, \oslash denotes the Hadamard (or entrywise) division. We note that in practice we will need to use a stopping criterion. One such criterion may be defined by monitoring the difference between the true marginals (𝐟,𝐠)({\bf f},{\bf g}) and the marginals of the most updated solutions. Precisely, given a small tolerance εS>0\varepsilon_{\text{S}}>0, we continue Sinkhorn iterations i=1,2,i=1,2,\dotsc until we obtain

max{𝐮(i)(Q𝐯(i))𝐟,𝐯(i)(Q𝐮(i))𝐠}εS.\max\{||{\bf u}^{(i)}\odot(Q\,{\bf v}^{(i)})-{\bf f}||_{\infty},\,||{\bf v}^{(i)}\odot(Q^{\top}{\bf u}^{(i)})-{\bf g}||_{\infty}\}\leq\varepsilon_{\text{S}}. (13)

After computing the scaling vectors (𝐮,𝐯)({\bf u},{\bf v}), the pp-Sinkhorn divergence can be computed as

Sp,λ=(𝐮Q^𝐯)1/p,Q^:=QC.S_{p,\lambda}=({\bf u}^{\top}\hat{Q}\,{\bf v})^{1/p},\qquad\hat{Q}:=Q\odot C. (14)

We refer to Remark 4.5 of [13] for a historic perspective of the matrix scaling problem (11) and the iterative algorithm (12).

Convergence and complexity. Suppose that our goal is to compute an ε\varepsilon-approximation PP^{\ast} of the optimal transport plan,

findPU(𝐟,𝐠)s.t.P,CminPU(𝐟,𝐠)P,C+ε,{\text{find}}\,\,P^{\ast}\in U({\bf f},{\bf g})\ \ \text{s.t.}\ \ \langle P^{\ast},C\rangle\leq\min_{P\in U({\bf f},{\bf g})}\langle P,C\rangle+\varepsilon,

where ε>0\varepsilon>0 is a positive small tolerance. This can be achieved within K=𝒪(C2lognε2)K={\mathcal{O}}(||C||_{\infty}^{2}\log n\,\varepsilon^{-2}) Sinkhorn iterations if we set λ=4ε1logn\lambda=4\,\varepsilon^{-1}\log n [4]. See also [1] for a less sharp bound on the number of iterations. The bounds presented in [4, 1] suggest that the number of iterations depends weakly on nn. The main computational bottleneck of Sinkhorn’s algorithm is however the two vector-matrix multiplications against kernels QQ and QQ^{\top} needed in each iteration with complexity 𝒪(n2){\mathcal{O}}(n^{2}) if implemented naively. Overall, Sinkhorn’s algorithm can compute an ε\varepsilon-approximate solution of the unregularized OT problem in 𝒪(Kn2)=𝒪(n2logn){\mathcal{O}}(Kn^{2})={\mathcal{O}}(n^{2}\log n) operations. We also refer to Section 4.3 in [13] for a few strategies that may improve this complexity. In particular, the authors in [15] exploited possible separability of cost metrics (closely related to assumption A1 in Section 3.1 of the present work) to reduce the quadratic complexity to 𝒪(n1+1/dlogn){\mathcal{O}}(n^{1+1/d}\,\log n); see also Remark 4.17 in [13]. It is to be noted that although this improved cost is no longer quadratic (except when d=1d=1), the resulting algorithm will still be a polynomial time algorithm whose complexity includes a term nγn^{\gamma} with γ=1+1/d>1\gamma=1+1/d>1. Our goal here is to exploit other possible structures in the kernels and further reduce the cost to a log-linear cost, thereby delivering an algorithm that runs faster than any polynomial time algorithm.

Remark 1.

(Selection of λ\lambda) On the one hand, the regularization parameter λ\lambda needs to be large enough for the regularized OT problem to be close to the original OT problem. This is indeed reflected in the choice λ=4ε1logn\lambda=4\,\varepsilon^{-1}\log n that enables achieving an ε\varepsilon-approximation of the optimal transport plan: the smaller the tolerance ε\varepsilon, the larger λ\lambda. On the other hands, the convergence of Sinkhorn’s algorithm deteriorates as λ\lambda\rightarrow\infty; see e.g. [6, 10]. Moreover, as λ\lambda\rightarrow\infty, more and more entries of the kernel QQ (and hence entries of Q𝐯Q\,{\bf v} and Q𝐮Q^{\top}{\bf u}) may become “essentially” zero. In this case, Sinkhorn’s algorithm may become numerically unstable due to division by zero. The selection of the regularization parameter λ\lambda should therefore be based on a trade-off between accuracy, computational cost, and robustness.

3 A fast algorithm for computing Sinkhorn divergence

We consider an important class of cost matrices for which the complexity of each Sinkhorn iteration can be significantly reduced from 𝒪(n2){\mathcal{O}}(n^{2}) to 𝒪(nlog2n){\mathcal{O}}(n\,\log^{2}n). Precisely, we consider cost matrices that can be decomposed into a dd-term sum of Kronecker product factors, where each term is asymptotically smooth; see Section 3.1. The new Sinkhorn’s algoritm after K=𝒪(logn)K={\mathcal{O}}(\log n) iterations will then have a near-linear complexity 𝒪(nlog3n){\mathcal{O}}(n\,\log^{3}n); see Section 3.2 and Section 3.3. Importantly, such class of cost matrices induce a wide range of optimal transport distances, including the quadratic Wasserstein metric and its corresponding Sinkhorn divergence.

3.1 Main assumptions

To make the assumptions precise, we first introduce a special Kronecker sum of matrices.

Definition 1.

The “all-ones Kronecker sum” of two matrices Ap×pA\in{\mathbb{R}}^{p\times p} and Bq×qB\in{\mathbb{R}}^{q\times q}, denoted by ABA\oplus B, is defined as

AB:=AJq+JpB,A\oplus B:=A\otimes J_{q}+J_{p}\otimes B,

where JpJ_{p} is the all-ones matrix of size p×pp\times p, and \otimes denotes the standard Kronecker product.

It is to be noted that the all-ones Kronecker sum is different from the common Kronecker sum of two matrices in which identity matrices IpI_{p} and IqI_{q} replace all-ones matrices JpJ_{p} and JqJ_{q}. This new operation facilitates working with elementwise matrix operations. We also note that the two-term sum in Definition 1 can be recursively extended to define all-ones Kronecker sums with arbitrary number of terms, thanks to the following associative property inherited from matrix algebra,

A1A2A3:=(A1A2)A3=A1(A2A3).A_{1}\oplus A_{2}\oplus A_{3}:=(A_{1}\oplus A_{2})\oplus A_{3}=A_{1}\oplus(A_{2}\oplus A_{3}).

We next consider kernel matrices generated by kernel functions with a special type of regularity, known as “asymptotic smoothness” [7].

Definition 2.

A kernel function κ:×\kappa:{\mathbb{R}}\times{\mathbb{R}}\rightarrow{\mathbb{R}} on the real line is said to be “asymptotically smooth” if there exist constants c0,α,βc_{0},\alpha,\beta such that for some ss\in{\mathbb{R}} the following holds,

|xmκ(x,y)|c0m!αmmβ|xy|ms,(x,y)2,xy,m.|\partial_{x}^{m}\kappa(x,y)|\leq c_{0}\,m!\,\alpha^{m}\,m^{\beta}\,|x-y|^{-m-s},\quad(x,y)\in{\mathbb{R}}^{2},\ \ \ x\neq y,\ \ \ m\in{\mathbb{N}}. (15)

A kernel matrix A=[κ(xi,yj)]q×qA=[\kappa(x_{i},y_{j})]\in{\mathbb{R}}^{q\times q}, generated by an asymptotically smooth kernel function κ\kappa, is called an asymptotically smooth kernel matrix.

We now make the following two assumptions.

  • A1.

    The cost matrix Cn×nC\in{\mathcal{M}}^{n\times n} can be decomposed into a dd-term all-ones Kronecker sum

    C=CdC1,Cknk×nk,k=1,,d,n=k=1dnk,C=C_{d}\oplus\dotsi\oplus C_{1},\qquad C_{k}\in{\mathcal{M}}^{n_{k}\times n_{k}},\ \ \ k=1,\dotsc,d,\ \ \ n=\prod_{k=1}^{d}n_{k}, (16)

    where each matrix Cknk×nkC_{k}\in{\mathcal{M}}^{n_{k}\times n_{k}} is generated by a metric ck:×+c_{k}:{\mathbb{R}}\times{\mathbb{R}}\rightarrow{\mathbb{R}}_{+} on the real line.

  • A2.

    The kernel matices Q(k)=exp[λC(k)]Q^{(k)}=\exp[-\lambda C^{(k)}] and Q^(k)=C(k)Q(k)\hat{Q}^{(k)}=C^{(k)}\odot Q^{(k)}, with k=1,,dk=1,\dotsc,d, generated by the kernels κk(x,y)=exp(λck(x,y))\kappa_{k}(x,y)=\exp(-\lambda\,c_{k}(x,y)) and κ^k(x,y)=ck(x,y)κk(x,y)\hat{\kappa}_{k}(x,y)=c_{k}(x,y)\,\kappa_{k}(x,y), where (x,y)2(x,y)\in{\mathbb{R}}^{2}, are asymptotically smooth.

It is to be noted that by assumption A2 all 2d2\,d kernels κk\kappa_{k} and κ^k\hat{\kappa}_{k}, with k=1,,dk=1,\dotsc,d, satisfy similar estimates of form (15) with possibly different sets of constants (c0,α,β,s)(c_{0},\alpha,\beta,s).

LpL^{p} cost functions. An important class of cost functions for which assumptions A1-A2 hold includes the LpL^{p} cost functions, given by

c(𝐱,𝐲)=𝐱𝐲pp=k=1d|x(k)y(k)|p,𝐱=(x(1),,x(d))d,𝐲=(y(1),,y(d))d.c({\bf x},{\bf y})=||{\bf x}-{\bf y}||_{p}^{p}=\sum_{k=1}^{d}|x^{(k)}-y^{(k)}|^{p},\quad{\bf x}=(x^{(1)},\dotsc,x^{(d)})\in{\mathbb{R}}^{d},\ \ \ {\bf y}=(y^{(1)},\dotsc,y^{(d)})\in{\mathbb{R}}^{d}. (17)

Here, ||||p||\cdot||_{p} denotes the LpL^{p}-norm in d{\mathbb{R}}^{d}, with p[1,)p\in[1,\infty). Importantly, such cost functions take the form of (2) with d(𝐱,𝐲)=𝐱𝐲pd({\bf x},{\bf y})=||{\bf x}-{\bf y}||_{p}, for which pp-Wasserstein metric and pp-Sinkhorn divergence are well defined. In particular, the case p=2p=2 corresponds to the quadratic Wasserstein metric.

We will first show that assumption A1 holds for LpL^{p} cost functions. Let the two sets of discrete points {𝐱i}i=1n𝒳\{{\bf x}_{i}\}_{i=1}^{n}\in{\mathcal{X}} and {𝐲j}j=1n𝒴\{{\bf y}_{j}\}_{j=1}^{n}\in{\mathcal{Y}}, at which the two probability vectors (𝐟,𝐠)({\bf f},{\bf g}) are defined, be given on two regular grids in d{\mathbb{R}}^{d}. We note that if the original probability vectors are given on irregular grids, we may find their values on regular grids by interpolation with a cost 𝒪(n){\mathcal{O}}(n) that would not change the near-linearity of our overall target cost. Set In:={1,,n}I_{n}:=\{1,\dotsc,n\}. Then, to each pair of indices iIni\in I_{n} and jInj\in I_{n} of the cost matrix C=[c(𝐱i,𝐲j)]C=[c({\bf x}_{i},{\bf y}_{j})], we can assign a pair of dd-tuples (i1,,id)(i_{1},\dotsc,i_{d}) and (j1,,jd)(j_{1},\dotsc,j_{d}) in the Cartesian product of dd finite sets In1××IndI_{n_{1}}\times\dotsc\times I_{n_{d}}, where n=k=1dnkn=\prod_{k=1}^{d}n_{k}. Consequently, the additive representation of cc in (17) will imply (16).

It is also easy to see that assumption A2 holds for LpL^{p} cost functions, with corresponding kernels,

κ(x,y)=exp(λ|xy|p),κ^(x,y)=|xy|pexp(λ|xy|p),(x,y)2.\kappa(x,y)=\exp(-\lambda\,|x-y|^{p}),\qquad\hat{\kappa}(x,y)=|x-y|^{p}\,\exp(-\lambda\,|x-y|^{p}),\qquad(x,y)\in{\mathbb{R}}^{2}.

Indeed, these kernels are asymptotically smooth due to the exponential decay of exp(λ|xy|p)\exp(-\lambda\,|x-y|^{p}) as |xy||x-y| increases. For instance, in the particular case when p=2p=2, the estimate (15) holds for both κ\kappa and κ^\hat{\kappa} with c0=1c_{0}=1, α=2\alpha=2, β=0\beta=0, and s=0s=0. To show this, we note that

xmexp(λ(xy)2)=(λ)mHm(λ(xy))exp(λ(xy)2),\partial_{x}^{m}\exp(-\lambda\,(x-y)^{2})=(-\sqrt{\lambda})^{m}H_{m}(\sqrt{\lambda}(x-y))\,\exp(-\lambda\,(x-y)^{2}),

where Hm(z)H_{m}(z) is the Hermite polynomial of degree mm, satisfying the following inequality [8],

Hm(z)(2mm!)1/2exp(z2/2),z.H_{m}(z)\leq(2^{m}\,m!)^{1/2}\exp(z^{2}/2),\qquad z\in{\mathbb{R}}.

Setting z:=λ(xy)0z:=\sqrt{\lambda}(x-y)\neq 0, and noting that (2mm!)1/22mm!(2^{m}\,m!)^{1/2}\leq 2^{m}\,m! for mm\in{\mathbb{N}}, we obtain

|xmexp(z2)|=(λ)m|Hm(z)|exp(z2)(2λ)mm!exp(z2/2)(2λ)mm!|z|m.|\partial_{x}^{m}\exp(-z^{2})|=(\sqrt{\lambda})^{m}|H_{m}(z)|\,\exp(-z^{2})\leq(2\sqrt{\lambda})^{m}\,m!\,\exp(-z^{2}/2)\leq(2\sqrt{\lambda})^{m}\,m!\,|z|^{-m}.

Hence we arrive at

|xmexp(λ(xy)2)|2mm!|xy|m.|\partial_{x}^{m}\exp(-\lambda\,(x-y)^{2})|\leq 2^{m}\,m!\,|x-y|^{-m}.

Similarly, we can show that the same estimate holds for |xm(xy)2exp(λ(xy)2)||\partial_{x}^{m}(x-y)^{2}\exp(-\lambda\,(x-y)^{2})|.

3.2 Hierarchical low-rank approximation of Sinkhorn iterations

As noted in Section 2.2, the main computational bottleneck of Sinkhorn’s algorithm is the matrix-vector multiplications by kernels QQ and QQ^{\top} in each iteration (12) and by kernel Q^\hat{Q} in (14). We present a strategy that reduces the complexity of each Sinkhorn iteration to 𝒪(nlog2n){\mathcal{O}}(n\,\log^{2}n), enabling Sinkhorn’s algorithm to achieve a near-linear overall complexity 𝒪(nlog3n){\mathcal{O}}(n\,\log^{3}n). The proposed strategy takes two steps, following two observations made based on assumptions A1 and A2.

3.2.1 Step 1: decomposition into 1D problems

A direct consequence of assumption A1 is that both kernel matrices QQ and Q^\hat{Q} have separable multiplicative structures.

Proposition 2.

Under assumption A1 on the cost matrix CC, the matrices Q=exp[λC]Q=\exp[-\lambda C] and Q^=QC\hat{Q}=Q\odot C will have Kronecker product structures,

Q=exp[λC]=Q(d)Q(1),Q(k)=exp[λC(k)],k=1,,d,Q=\exp[-\lambda C]=Q^{(d)}\otimes\dotsi\otimes Q^{(1)},\qquad Q^{(k)}=\exp[-\lambda C^{(k)}],\ \ \ k=1,\dotsc,d, (18)
Q^=QC=k=1dAk(d)Ak(1),Ak(m)={Q^(k):=C(m)Q(m),k=mQ(m),km.\hat{Q}=Q\odot C=\sum_{k=1}^{d}A_{k}^{(d)}\otimes\dotsi\otimes A_{k}^{(1)},\qquad A_{k}^{(m)}=\left\{\begin{array}[]{ll}\hat{Q}^{(k)}:=C^{(m)}\odot Q^{(m)},&k=m\\ Q^{(m)},&k\neq m\end{array}\right.. (19)
Proof.

The first expression (18) follows from Lemma 1 in the appendix and (16). The second expression (19) follows from Lemma 2 in the appendix and (16) and (18). ∎

Exploiting the block structure of the Kronecker product matrices QQ and Q^\hat{Q}, one can compute the matrix-vector products Q𝐰Q{\bf w}, Q𝐰Q^{\top}{\bf w}, and Q^𝐰\hat{Q}{\bf w} without explicitly forming QQ and Q^\hat{Q}. This can be achieved using the vector operator vec(.)\text{vec}(.) that transforms a matrix into a vector by stacking its columns beneath one another. We denote by 𝐚=vec(A){\bf a}=\text{vec}(A) the vectorization of matrix Ap×qA\in{\mathbb{R}}^{p\times q} formed by stacking its columns into a single column vector 𝐚pq×1{\bf a}\in{\mathbb{R}}^{pq\times 1}.

Proposition 3.

Let A=A(d)A(1)n×nA=A^{(d)}\otimes\dotsi\otimes A^{(1)}\in{\mathbb{R}}^{n\times n}, where A(k)nk×nkA^{(k)}\in{\mathbb{R}}^{n_{k}\times n_{k}} and n=k=1dnkn=\prod_{k=1}^{d}n_{k}. Then, the matrix-vector multiplication A𝐰A{\bf w}, where 𝐰n×1{\bf w}\in{\mathbb{R}}^{n\times 1}, amounts to n~k=i=1,ikdni\tilde{n}_{k}=\prod_{\begin{subarray}{c}i=1,\,i\neq k\end{subarray}}^{d}n_{i} matrix-vector multiplications A(k)𝐳A^{(k)}{\bf z} for n~k\tilde{n}_{k} different vectors 𝐳nk×1{\bf z}\in{\mathbb{R}}^{n_{k}\times 1}, with k=1,,dk=1,\dotsc,d.

Proof.

The case d=1d=1 is trivial, noting that n~1=1\tilde{n}_{1}=1. Consider the case d=2d=2. Then we have A=A(2)A(1)n×nA=A^{(2)}\otimes A^{(1)}\in{\mathbb{R}}^{n\times n}, with A(1)n1×n1A^{(1)}\in{\mathbb{R}}^{n_{1}\times n_{1}}, A(2)n2×n2A^{(2)}\in{\mathbb{R}}^{n_{2}\times n_{2}}, and n=n1n2n=n_{1}n_{2}. Let Wn1×n2W\in{\mathbb{R}}^{n_{1}\times n_{2}} be the matrix whose vectorization is the vector 𝐰n1n2×1{\bf w}\in{\mathbb{R}}^{n_{1}n_{2}\times 1}, i.e. 𝐰=vec(W){\bf w}=\text{vec}(W). Then, we have

A𝐰=(A(2)A(1))vec(W)=vec(A(1)WA(2)).A{\bf w}=(A^{(2)}\otimes A^{(1)})\,{\text{vec}}(W)=\text{vec}(A^{(1)}WA^{(2)\top}).

Hence, computing A𝐰A{\bf w} amounts to computing A(1)WA(2)A^{(1)}WA^{(2)\top}. This can be done in two steps. We first compute G:=WA(2)G:=WA^{(2)\top},

G:=WA(2)=(𝐰1𝐰n1)A(2)=(𝐰1A(2)𝐰n1A(2))=((A(2)𝐰1)(A(2)𝐰n1)),G:=WA^{(2)\top}=\left(\begin{array}[]{c}{\bf w}_{1}^{\top}\\ \vdots\\ {\bf w}_{n_{1}}^{\top}\end{array}\right)A^{(2)\top}=\left(\begin{array}[]{c}{\bf w}_{1}^{\top}A^{(2)\top}\\ \vdots\\ {\bf w}_{n_{1}}^{\top}A^{(2)\top}\end{array}\right)=\left(\begin{array}[]{c}(A^{(2)}{\bf w}_{1})^{\top}\\ \vdots\\ (A^{(2)}{\bf w}_{n_{1}})^{\top}\end{array}\right),

where 𝐰kn2×1{\bf w}_{k}\in{\mathbb{R}}^{n_{2}\times 1} is the kk-th row of matrix WW in the form of a column vector. For this we need to perform n1n_{1} matrix-vector multiplies A(2)𝐰kA^{(2)}{\bf w}_{k}, with k=1,,n1k=1,\dotsc,n_{1}. We then compute A(1)GA^{(1)}G,

A(1)G=A(1)[𝐠1𝐠n2]=[A(1)𝐠1A(1)𝐠n2],A^{(1)}G=A^{(1)}[{\bf g}_{1}\dotsi{\bf g}_{n_{2}}]=[A^{(1)}{\bf g}_{1}\dotsi A^{(1)}{\bf g}_{n_{2}}],

where 𝐠kn1×1{\bf g}_{k}\in{\mathbb{R}}^{n_{1}\times 1} is the kk-th column of matrix GG. We therefore need to perform n2n_{2} matrix-vector multiplies A(1)𝐠kA^{(1)}{\bf g}_{k}, with k=1,,n2k=1,\dotsc,n_{2}. Overall, we need n1n_{1} matrix-vector multiplies A(2)𝐳A^{(2)}{\bf z} for n1n_{1} different vectors 𝐳{\bf z}, and n2n_{2} matrix-vector multiplies A(1)𝐳A^{(1)}{\bf z} for n2n_{2} different vectors 𝐳{\bf z}, as desired. This can be recursively extended to any dimension d3d\geq 3. Setting A~:=A(d1)A(1)\tilde{A}:=A^{(d-1)}\otimes\dotsi\otimes A^{(1)}, we can write

A𝐰=(A(d)A(d1)A(1))𝐰=(A(d)A~)vec(W)=vec(A~WA(d)).A{\bf w}=(A^{(d)}\otimes A^{(d-1)}\otimes\dotsi\otimes A^{(1)}){\bf w}=(A^{(d)}\otimes\tilde{A})\,{\text{vec}}(W)=\text{vec}(\tilde{A}WA^{(d)\top}). (20)

This can again be done in two steps. We first compute G:=WA(d)G:=WA^{(d)\top}, where Wn~×ndW\in{\mathbb{R}}^{\tilde{n}\times n_{d}} with n~=k=1d1nk\tilde{n}=\prod_{k=1}^{d-1}n_{k},

G:=WA(d)=(𝐰1𝐰n~)A(d)=(𝐰1A(d)𝐰n~A(d))=((A(d)𝐰1)(A(d)𝐰n~)),G:=WA^{(d)\top}=\left(\begin{array}[]{c}{\bf w}_{1}^{\top}\\ \vdots\\ {\bf w}_{\tilde{n}}^{\top}\end{array}\right)A^{(d)\top}=\left(\begin{array}[]{c}{\bf w}_{1}^{\top}A^{(d)\top}\\ \vdots\\ {\bf w}_{\tilde{n}}^{\top}A^{(d)\top}\end{array}\right)=\left(\begin{array}[]{c}(A^{(d)}{\bf w}_{1})^{\top}\\ \vdots\\ (A^{(d)}{\bf w}_{\tilde{n}})^{\top}\end{array}\right),

where 𝐰knd×1{\bf w}_{k}\in{\mathbb{R}}^{n_{d}\times 1} is the kk-th row of matrix WW in the form of a column vector. To compute GG we need to perform n~\tilde{n} matrix-vector multiplies A(d)𝐰kA^{(d)}{\bf w}_{k}, with k=1,,n~k=1,\dotsc,\tilde{n} We then compute A~G\tilde{A}G,

A~G=A~[𝐠1𝐠nd]=[A~𝐠1A~𝐠nd],\tilde{A}G=\tilde{A}[{\bf g}_{1}\dotsi{\bf g}_{n_{d}}]=[\tilde{A}{\bf g}_{1}\dotsi\tilde{A}{\bf g}_{n_{d}}],

where 𝐠kn~×1{\bf g}_{k}\in{\mathbb{R}}^{\tilde{n}\times 1} is the kk-th column of matrix GG. We therefore need to perform ndn_{d} matrix-vector multiplies A~𝐠k\tilde{A}{\bf g}_{k}, with k=1,,ndk=1,\dotsc,n_{d}. Noting that each multiply A~𝐠k\tilde{A}{\bf g}_{k} is of the same form as the multiply (20) but in d1d-1 dimensions, the proposition follows easily by induction. ∎

Reduction in complexity due to step 1. Proposition 2 and Proposition 3 imply

cost(Q𝐰)=k=1dn~kcost(Q(k)𝐳),cost(Q^𝐰)=dcost(Q𝐰).\text{cost}(Q{\bf w})=\sum_{k=1}^{d}\tilde{n}_{k}\,\text{cost}(Q^{(k)}{\bf z}),\qquad\text{cost}(\hat{Q}{\bf w})=d\,\text{cost}(Q{\bf w}). (21)

Indeed, the original dd-dimensional problems Q𝐰Q{\bf w} and Q^𝐰\hat{Q}{\bf w} turn into several smaller one-dimensional problems of either Q(k)𝐳Q^{(k)}{\bf z} or Q^(k)𝐳\hat{Q}^{(k)}{\bf z} form. This already amounts to a significant gain in efficiency. For instance, if we compute each of these smaller problems by performing 𝒪(nk2){\mathcal{O}}(n_{k}^{2}) floating-point arithmetic operations, then the complexity of Q𝐰Q{\bf w} will become

cost(Q𝐰)=k=1dn~k𝒪(nk2)=𝒪(nk=1dnk).\text{cost}(Q{\bf w})=\sum_{k=1}^{d}\tilde{n}_{k}{\mathcal{O}}(n_{k}^{2})={\mathcal{O}}(n\sum_{k=1}^{d}n_{k}).

And in the particular case when n1==nd=n1/dn_{1}=\dotsi=n_{d}=n^{1/d}, we will have

cost(Q𝐰)=d𝒪(n1+1/d).\text{cost}(Q{\bf w})=d\,\mathcal{O}(n^{1+1/d}).

It is to be noted that this improvement in the efficiency (for d2d\geq 2) was also achieved in [15]. Although this improved cost is no longer quadratic (except in 1D problems when d=1d=1), the resulting algorithm will still be a polynomial time algorithm whose complexity includes a term nγn^{\gamma} with γ>1\gamma>1. For instance for 2D and 3D problems (considered in [15]), the overall complexity of Sinkhorn’s algorithm after K=𝒪(logn)K=\mathcal{O}(\log n) iterations would be 𝒪(n3/2logn)\mathcal{O}(n^{3/2}\,\log n) and 𝒪(n4/3logn)\mathcal{O}(n^{4/3}\,\log n), respectively. In what follows, we will further apply the technique of hierarchical matrices to the smaller one-dimensional problems, thereby delivering an algorithm that achieves log-linear complexity and hence runs faster than any polynomial time algorithm.

3.2.2 Step 2: approximation of 1D problems by hierarchical matrices

We will next present a fast method for computing the smaller 1D problems Q(k)𝐳Q^{(k)}{\bf z} and Q^(k)𝐳\hat{Q}^{(k)}{\bf z} obtained in step 1. We will show that, thanks to assumption A2, the hierarchical matrix technique [7] can be employed to compute each one-dimensional problem with a near-linear complexity 𝒪(nklog2nk){\mathcal{O}}(n_{k}\,\log^{2}n_{k}).

Since by assumption A2 the kernels κk\kappa_{k} and κ^k\hat{\kappa}_{k} that generate Q(k)Q^{(k)} and Q^(k)\hat{Q}^{(k)}, with k=1,,dk=1,\dotsc,d, are all asymptotically smooth and satisfy similar estimates of form (15), the numerical treatment and complexity of all 2d2\,d one-dimensional problems will be the same. Hence, in what follows, we will drop the dependence on kk and consider a general one-dimensional problem A𝐳A{\bf z}, where Ank×nkA\in{\mathbb{R}}^{n_{k}\times n_{k}} is a kernel matrix generated by a one-dimensional kernel κ:×\kappa:{\mathbb{R}}\times{\mathbb{R}}\rightarrow{\mathbb{R}} satisfying (15). Precisely, let Ink:={1,,nk}I_{n_{k}}:=\{1,\dotsc,n_{k}\} be an index set with cardinality #Ink=nk\#I_{n_{k}}=n_{k}. Consider the kernel matrix

A:=[κ(xi,yj)]nk×nk,i,jInk,A:=[\kappa(x_{i},y_{j})]\in{\mathbb{R}}^{n_{k}\times n_{k}},\qquad i,j\in I_{n_{k}},

uniquely determined by two sets of discrete points {xi}iInk\{x_{i}\}_{i\in I_{n_{k}}}\in{\mathbb{R}} and {yj}jInk\{y_{j}\}_{j\in I_{n_{k}}}\in{\mathbb{R}} and an asymptotically smooth kernel κ(x,y)\kappa(x,y), with (x,y)2(x,y)\in{\mathbb{R}}^{2}, satisfying (15). Let A|τ×σA\arrowvert_{\tau\times\sigma} denote a matrix block of the square kernel matrix AA, characterized by two index sets τInk\tau\subset I_{n_{k}} and σInk\sigma\subset I_{n_{k}}. We define the diameter of τ\tau and the distance between τ\tau and σ\sigma by

diam(τ)=|maxiτximiniτxi|,dist(τ,σ)=miniτ,jσ|xiyj|.\text{diam}(\tau)=|\max_{i\in{\tau}}x_{i}-\min_{i\in{\tau}}x_{i}|,\qquad\text{dist}(\tau,\sigma)=\min_{i\in\tau,\,j\in\sigma}|x_{i}-y_{j}|.

The two sets of discrete points and the position of the matrix block is illustrated in Figure 2. Note that when the cardinality of the index sets τ\tau and σ\sigma are equal (#τ=#σ\#\tau=\#\sigma), the matrix block A|τ×σA\arrowvert_{\tau\times\sigma} will be a square matrix.

Refer to caption
Figure 2: Left: the set of discrete points {xi}iInk\{x_{i}\}_{i\in I_{n_{k}}} (purple circles) and {yj}jInk\{y_{j}\}_{j\in I_{n_{k}}} (blue circles) and two index sets τInk\tau\subset I_{n_{k}} and σInk\sigma\subset I_{n_{k}}. Right: position of the matrix block A|τ×σA\arrowvert_{\tau\times\sigma} in the kernel matrix A=[κ(xi,yj)]A=[\kappa(x_{i},y_{j})] corresponding to the two index sets τ\tau and σ\sigma.

The hierarchical matrix technique for computing A𝐳A{\bf z} consists if three parts (see details below):

  • I.

    construction of a hierarchical block structure that partitions AA into low-rank (admissible) and full-rank (inadmissible) blocks;

  • II.

    low-rank approximation of admissible blocks by separable expansions;

  • III.

    assembling the output A𝐳A{\bf z} through a loop over all admissible and inadmissible blocks.

I. Hierarchical block structure. We employ a quadtree structure, i.e. a tree structure in which each node/vertex (square block) has either four or no children (square sub-blocks). Starting with the original matrix AA as the tree’s root node with τ=σ=Ink\tau=\sigma=I_{n_{k}}, we recursively split each square block A|τ×σA\arrowvert_{\tau\times\sigma} into four square sub-blocks provided τ\tau and σ\sigma do not satisfy the following “admissibility” condition,

η:=diam(τ)dist(τ,σ)η0.\eta:=\frac{\text{diam}(\tau)}{\text{dist}(\tau,\sigma)}\leq\eta_{0}. (22)

Here, η0>0\eta_{0}>0 is a fixed number to be determined later; see Section 3.2.3. If the above admissibility condition holds, we do not split the block A|τ×σA\arrowvert_{\tau\times\sigma} and label it as an admissible leaf. We continue this process until the size of inadmissible blocks reaches a minimum size nmin×nminn_{\text{min}}\times n_{\text{min}}. This will impose a condition on the cardinality of τ\tau:

#τnmin.\#\tau\geq n_{\text{min}}.

Indeed, we want all sub-block leaves of the tree not to be smaller than a minimum size for the low-rank approximation of sub-blocks to be favorable compared to a full-rank computation in terms of complexity. It is suggested that we take nmin=32n_{\text{min}}=32 in order to avoid the overhead caused by recursions; see [7]. Eventually, we collect all leaves of the tree and split them into two partitions:

  • PadmissibleP_{\text{admissible}}: a partition of admissible blocks

  • PinadmissibleP_{\text{inadmissible}}: a partition of inadmissible blocks

Note that by this construction, all blocks in PinadmissibleP_{\text{inadmissible}} will be of size nmin×nminn_{\text{min}}\times n_{\text{min}} and are hence small. Such blocks will be treated as full-rank matrices, while the admissible blocks will be treated as low-rank matrices. Figure 3 shows a schematic representation of the hierarchical block structure obtained by a quadtree of depth five, that is, five recursive levels of splitting is performed to obtain the block structure. Gray blocks are admissible, while red blocks are inadmissible.

Refer to caption
Figure 3: A schematic representation of the hierarchical block structure based on a quadtree of depth five. Matrix blocks are partitioned into two admissible PadmissibleP_{\text{admissible}} (gray) and inadmissible PinadmissibleP_{\text{inadmissible}} (red) sets.

II. Low-rank approximation by separable expansions. Locally on each admissible block A|τ×σb×bA\arrowvert_{\tau\times\sigma}\in{\mathbb{R}}^{b\times b}, with b=#τ=#σb=\#\tau=\#\sigma, we approximate the kernel κ\kappa by a separable expansion, resulting in a low-rank approximation of the block. Precisely, we approximate κ(x,y)\kappa(x,y), where x[miniτxi,maxiτxi]x\in[\min_{i\in{\tau}}x_{i},\max_{i\in{\tau}}x_{i}] and y[minjσyj,maxjσyj]y\in[\min_{j\in{\sigma}}y_{j},\max_{j\in{\sigma}}y_{j}], using barycentric Lagrange interpolation in xx on a set of rr\in{\mathbb{N}} Chebyshev points {x^k}k=1r[miniτxi,maxiτxi]\{\hat{x}_{k}\}_{k=1}^{r}\in[\min_{i\in{\tau}}x_{i},\max_{i\in{\tau}}x_{i}], where rbr\leq b is a number to be determined; see Section 3.2.3. We write

κ(x,y)κ(r)(x,y):=(x)k=1rwkxx^kκ(x^k,y),(x):=k=1r(xx^k),wk=1/(x^k).\kappa(x,y)\approx\kappa^{(r)}(x,y):=\ell(x)\sum_{k=1}^{r}\frac{w_{k}}{x-\hat{x}_{k}}\,\kappa(\hat{x}_{k},y),\qquad\ell(x):=\prod_{k=1}^{r}(x-\hat{x}_{k}),\qquad w_{k}=1/\ell^{\prime}(\hat{x}_{k}).

Due to the separability of the kernel estimator κ(r)\kappa^{(r)} in (x,y)(x,y) variables, the matrix block A|τ×σA\arrowvert_{\tau\times\sigma} can be approximated by a rank-rr matrix:

A|τ×σA(r)|τ×σ:=LR,L,Rb×r,A\arrowvert_{\tau\times\sigma}\approx A^{(r)}\arrowvert_{\tau\times\sigma}:=LR^{\top},\qquad L,\,R\in{\mathbb{R}}^{b\times r},

where

L=[Lik]=[(xi)wkxix^k],R=[Rjk]=[κ(x^k,yj)],iτ,jσ,kIr.L=[L_{ik}]=[\ell(x_{i})\frac{w_{k}}{x_{i}-\hat{x}_{k}}],\qquad R=[R_{jk}]=[\kappa(\hat{x}_{k},y_{j})],\quad i\in\tau,\quad j\in\sigma,\quad k\in I_{r}.

The number of chebyshev points rr is hence referred to as the local rank.

Pre-computing the barycentric weights {w1,,wr}\{w_{1},\dotsc,w_{r}\}, we note that the cost of computing each row of LL and RR (for a fixed iτi\in\tau and a fixed jσj\in\sigma and for all k=1,,rk=1,\dotsc,r) is 𝒪(r){\mathcal{O}}(r). It therefore takes 𝒪(rb){\mathcal{O}}(rb) floating-point arithmetic operations to form LL and RR. Hence, the total cost of a matrix-vector multiplication A(r)|τ×σ𝐰=LR𝐰A^{(r)}\arrowvert_{\tau\times\sigma}{\bf w}=LR^{\top}{\bf w}, where 𝐰b×1{\bf w}\in{\mathbb{R}}^{b\times 1}, is 𝒪(rb){\mathcal{O}}(rb), where we first compute 𝐰^:=R𝐰\hat{\bf w}:=R^{\top}{\bf w} and then L𝐰^L\hat{\bf w}.

It is to be noted that only admissible blocks are approximated by low-rank matrices. All inadmissible blocks will be treated as full-rank nmin×nminn_{\text{min}}\times n_{\text{min}} matrices whose components are directly given by the kernel κ\kappa.

III. Loop over all admissible and inadmissible blocks. By the above procedure, we have approximated AA by a hierarchical matrix AHA_{\text{H}} whose blocks are either of rank at most rr or of small size nmin×nminn_{\text{min}}\times n_{\text{min}}. We then assemble all low-rank and full-rank block matrix-vector multiplications and approximate 𝐬=A𝐳{\bf s}=A{\bf z} by 𝐬H=AH𝐳{\bf s}_{\text{H}}=A_{\text{H}}{\bf z}, as follows. We first set 𝐬H=𝟎{\bf s}_{\text{H}}={\bf 0}, and loop over all blocks A|τ×σA\arrowvert_{\tau\times\sigma}. Let 𝐬H|τ{\bf s}_{\text{H}}\arrowvert_{\tau} and 𝐳|σ{\bf z}\arrowvert_{\sigma} denote the parts of vectors 𝐬H{\bf s}_{\text{H}} and 𝐳{\bf z} corresponding to the index sets τ\tau and σ\sigma, respectively. We then proceed as follows:

  • if A|τ×σPadmissibleA\arrowvert_{\tau\times\sigma}\in P_{\text{admissible}}, then AH|τ×σ=LRA_{\text{H}}\arrowvert_{\tau\times\sigma}=LR^{\top} and 𝐬H|τ=𝐬H|τ+LR𝐳|σ{\bf s}_{\text{H}}\arrowvert_{\tau}={\bf s}_{\text{H}}\arrowvert_{\tau}+LR^{\top}{\bf z}\arrowvert_{\sigma}.

  • if A|τ×σPinadmissibleA\arrowvert_{\tau\times\sigma}\in P_{\text{inadmissible}}, then 𝐬H|τ=𝐬H|τ+A|τ×σ𝐳|σ{\bf s}_{\text{H}}\arrowvert_{\tau}={\bf s}_{\text{H}}\arrowvert_{\tau}+A\arrowvert_{\tau\times\sigma}{\bf z}\arrowvert_{\sigma}.

3.2.3 Accuracy and adaptive selection of local ranks

We will now discuss the error in the approximation of the kernel matrix A=[κ(xi,yj)]nk×nkA=[\kappa(x_{i},y_{j})]\in{\mathbb{R}}^{n_{k}\times n_{k}} by the hierarchical matrix AHA_{\text{H}}. We will consider the error in Frobenius norm AAHF||A-A_{\text{H}}||_{\text{F}}, where

AF2:=i,j=1nk|Aij|2.||A||_{\text{F}}^{2}:=\sum_{i,j=1}^{n_{k}}|A_{ij}|^{2}.

It is to be noted that other matrix norms, such as row-sum norm ||.||||.||_{\infty} associated with the maximum vector norm and spectral norm ||.||2||.||_{2} associated with the Euclidean vector norm, can similarly be considered.

We first write the global error as a sum of all local errors [7]

||AAH||F2=τ×σPadmissible||A|τ×σAH|τ×σ||F2,||A-A_{\text{H}}||_{\text{F}}^{2}=\sum_{\tau\times\sigma\in P_{\text{admissible}}}||A\arrowvert_{\tau\times\sigma}-A_{\text{H}}\arrowvert_{\tau\times\sigma}||_{F}^{2}, (23)

where for each block τ×σPadmissible\tau\times\sigma\in P_{\text{admissible}}, the local error in approximating A|τ×σb×bA\arrowvert_{\tau\times\sigma}\in{\mathbb{R}}^{b\times b}, with b=#τ=#σb=\#\tau=\#\sigma, reads

||A|τ×σAH|τ×σ||F2=i,j=1b|κ(xi,yj)κ(r)(xi,yj)|2.||A\arrowvert_{\tau\times\sigma}-A_{\text{H}}\arrowvert_{\tau\times\sigma}||_{F}^{2}=\sum_{i,j=1}^{b}|\kappa(x_{i},y_{j})-\kappa^{(r)}(x_{i},y_{j})|^{2}. (24)

Each term in the right hand side of (24) is given by the interpolation error (see e.g. [7]),

|κ(x,y)κ(r)(x,y)|=|ω(x)|r!|xrκ(ξ,y)|,ξ[miniτxi,maxiτxi],|\kappa(x,y)-\kappa^{(r)}(x,y)|=\frac{|\omega(x)|}{r!}\,|\partial_{x}^{r}\kappa(\xi,y)|,\qquad\xi\in[\min_{i\in{\tau}}x_{i},\max_{i\in{\tau}}x_{i}],

where

ω(x)=k=1r(xx^k)(diam(τ)4)r,\omega(x)=\prod_{k=1}^{r}(x-\hat{x}_{k})\leq\Bigl{(}\frac{\text{diam}(\tau)}{4}\Bigr{)}^{r},

and the xx-derivatives of κ\kappa satisfies the estimate (15), thanks to the asymptotic smoothness of κ\kappa. We further note that for x[miniτxi,maxiτxi]x\in[\min_{i\in{\tau}}x_{i},\max_{i\in{\tau}}x_{i}] and y[minjσyj,maxjσyj]y\in[\min_{j\in{\sigma}}y_{j},\max_{j\in{\sigma}}y_{j}], we have

|xy|dist(τ,σ).|x-y|\geq\text{dist}(\tau,\sigma).

Hence, assuming r+s0r+s\geq 0, we get

|κκ(r)|c0rβdist(τ,σ)s(α4η)r=:c(α4η)r,|\kappa-\kappa^{(r)}|\leq c_{0}\,r^{\beta}\,\text{dist}(\tau,\sigma)^{-s}\,\left(\frac{\alpha}{4}\,\eta\right)^{r}=:c\,\left(\frac{\alpha}{4}\,\eta\right)^{r},

which implies local spectral convergence provided η<4/α\eta<4/\alpha. This can be achieved if we take η0<4/α\eta_{0}<4/\alpha in the admissibility condition (22). For instance, in the numerical examples in Section 4 we take η0=2/α\eta_{0}=2/\alpha.

Importantly, as we will show here, the above global and local estimates provide a simple strategy for a priori selection of local ranks, i.e. the number of Chebyshev points for each block, to achieve a desired small tolerance εTOL>0\varepsilon_{\tiny{\text{TOL}}}>0 for the global error,

AAHFεTOL.||A-A_{\text{H}}||_{\text{F}}\leq\varepsilon_{\tiny{\text{TOL}}}. (25)

We first observe that (25) holds if

|κκ(r)|εTOL/nk.|\kappa-\kappa^{(r)}|\leq\varepsilon_{\tiny{\text{TOL}}}/n_{k}.

This simply follows by (23)-(24) and noting that τ×σPadmissibleb2<nk2\sum_{\tau\times\sigma\in P_{\text{admissible}}}b^{2}<n_{k}^{2}. Hence, we can achieve (25) if we set

c(α4η)rεTOL/nk.c\,\left(\frac{\alpha}{4}\,\eta\right)^{r}\leq\varepsilon_{\tiny{\text{TOL}}}/n_{k}.

We hence obtain an optimal local rank

r=log(εTOL/(cnk))log(αη/4).r=\Biggl{\lceil}\frac{\log(\varepsilon_{\tiny{\text{TOL}}}/(c\,n_{k}))}{\log(\alpha\,\eta/4)}\Biggr{\rceil}. (26)

We note that each block may have a different η\eta and hence a different local rank rr. We also note that the optimal local rank depends logarithmically on nkn_{k}.

3.2.4 Computational cost

We now discuss the overall complexity of the proposed strategy (steps 1 and 2) applied to each iteration of Sinkhorn’s algorithm.

As discussed earlier in Section 3.2.1, the complexity of the original dd-dimensional problems Q𝐰Q{\bf w} and Q^𝐰\hat{Q}{\bf w} reduces to several smaller one-dimensional problems of either Q(k)𝐳Q^{(k)}{\bf z} or Q^(k)𝐳\hat{Q}^{(k)}{\bf z} form, given by (21). We now study the complexity of computing one-dimensional problems to obtain cost(Q(k)𝐳)\text{cost}(Q^{(k)}{\bf z}). We follow [7] and assume nk=2qn_{k}=2^{q}, where qq is a non-negative integer that may be different for different kk. Let q{\mathcal{H}}_{q} denote the family of 2q×2q2^{q}\times 2^{q} hierarchical matrices constructed above. Corresponding to the quadtree block structure, the representation of q{\mathcal{H}}_{q} for q1q\geq 1 can be given recursively,

q=(q1𝒩q1𝒩q1q1),𝒩q=(q1q1𝒩q1q1),{\mathcal{H}}_{q}=\left(\begin{array}[]{@{}c|c@{}}{\mathcal{H}}_{q-1}&{\mathcal{N}}_{q-1}\\ \hline\cr{\mathcal{N}}_{q-1}^{*}&{\mathcal{H}}_{q-1}\end{array}\right),\qquad{\mathcal{N}}_{q}=\left(\begin{array}[]{@{}c|c@{}}{\mathcal{R}}_{q-1}&{\mathcal{R}}_{q-1}\\ \hline\cr{\mathcal{N}}_{q-1}&{\mathcal{R}}_{q-1}\end{array}\right), (27)

where q{\mathcal{R}}_{q} is the family of rank-rr square matrices of size 2q×2q2^{q}\times 2^{q}. Note that for q=0q=0, the matrices q{\mathcal{H}}_{q}, 𝒩q{\mathcal{N}}_{q}, q{\mathcal{R}}_{q} are 1×11\times 1 matrices and hence are simply scalars.

Now let W(q)W_{{\mathcal{H}}}(q) be the cost of multiplying q{\mathcal{H}}_{q} by a vector 𝐳{\bf z}. The recursive forms (27) imply

W(q)=2W(q1)+2W𝒩(q1),W𝒩(q)=W𝒩(q1)+3W(q1),W_{{\mathcal{H}}}(q)=2\,W_{{\mathcal{H}}}(q-1)+2\,W_{{\mathcal{N}}}(q-1),\qquad W_{{\mathcal{N}}}(q)=W_{{\mathcal{N}}}(q-1)+3\,W_{{\mathcal{R}}}(q-1),

where W𝒩(q)W_{{\mathcal{N}}}(q) and W(q)W_{{\mathcal{R}}}(q) are the costs of multiplying a vector by 𝒩q{\mathcal{N}}_{q} and q{\mathcal{R}}_{q}, respectively. We first note that since q{\mathcal{R}}_{q} is a rank-rr matrix, we have

W(q)=cr 2q,c>0.W_{{\mathcal{R}}}(q)=c\,r\,2^{q},\qquad c>0.

We can then solve the recursive equation W𝒩(q)=W𝒩(q1)+3cr 2q1W_{{\mathcal{N}}}(q)=W_{{\mathcal{N}}}(q-1)+3\,c\,r\,2^{q-1} with W𝒩(0)=1W_{{\mathcal{N}}}(0)=1 to get

W𝒩(q)=3cr(2q1)+13cr 2q.W_{{\mathcal{N}}}(q)=3\,c\,r\,(2^{q}-1)+1\sim 3\,c\,r\,2^{q}.

We finally solve the recursive equation W(q)=2W(q1)+6cr 2q1W_{{\mathcal{H}}}(q)=2\,W_{{\mathcal{H}}}(q-1)+6\,c\,r\,2^{q-1} with W(0)=1W_{{\mathcal{H}}}(0)=1 to obtain

W(q)=(1+3crq) 2q.W_{{\mathcal{H}}}(q)=(1+3\,c\,r\,q)\,2^{q}.

If we choose the local ranks rr according to (26), then r=𝒪(lognk)r={\mathcal{O}}(\log n_{k}), and hence the cost of computing the one-dimensional problems Q(k)𝐳Q^{(k)}{\bf z} is

cost(Q(k)𝐳)=𝒪(nklog2nk).\text{cost}(Q^{(k)}{\bf z})={\mathcal{O}}(n_{k}\log^{2}n_{k}). (28)

Combining (21) and (28), we obtain

cost(Q𝐰)=k=1dn~k𝒪(nklog2nk)=𝒪(nk=1dlog2nk)𝒪(n(k=1dlognk)2)=𝒪(nlog2n).\text{cost}(Q{\bf w})=\sum_{k=1}^{d}\tilde{n}_{k}\,{\mathcal{O}}(n_{k}\log^{2}n_{k})={\mathcal{O}}(n\sum_{k=1}^{d}\log^{2}n_{k})\leq{\mathcal{O}}(n(\sum_{k=1}^{d}\log n_{k})^{2})={\mathcal{O}}(n\log^{2}n).

We note that with such a reduction in the complexity of each Sinkhorn iteration, the overall cost of computing Sinkhorn divergence after K=𝒪(logn)K={\mathcal{O}}(\log n) iterations will reduce to 𝒪(nlog3n){\mathcal{O}}(n\,\log^{3}n).

3.3 Algorithm

Algorithm 1 outlines the hierarchical low-rank computation of Sinkhorn divergence.

Algorithm 1 Hierarchical low-rank computation of Sinkhorn divergence
  Input:   \bullet cost matrix C+n×nC\in{\mathbb{R}}_{+}^{n\times n} generated by metric functions ck:×+c_{k}:{\mathbb{R}}\times{\mathbb{R}}\rightarrow{\mathbb{R}}_{+}, k=1,,dk=1,\dotsc,d                \bullet probability vectors 𝐟,𝐠Σn{\bf f},{\bf g}\in\Sigma_{n} supported on two bounded subsets of d{\mathbb{R}}^{d}                \bullet tolerances εTOL\varepsilon_{\text{TOL}} and εS\varepsilon_{\text{S}}
  Output:   Sinkhorn divergence Sp,λ(𝐟,𝐠)S_{p,\lambda}({\bf f},{\bf g})
  1. Select λ>0\lambda>0; see Remark 1.
  2. For every k=1,,dk=1,\dotsc,d, construct two hierarchical matrices (within tolerance εTOL\varepsilon_{\text{TOL}}):
QH(k)Q(k)=[exp(λck)],Q^H(k)Q^(k)=[ckexp(λck)].Q_{\text{H}}^{(k)}\approx Q^{(k)}=[\exp(-\lambda\,c_{k})],\qquad\hat{Q}_{\text{H}}^{(k)}\approx\hat{Q}^{(k)}=[c_{k}\,\exp(-\lambda\,c_{k})].
  3. Implement two procedures, following steps 1-2 in Section 3.2:          Procedure 1.    𝐬{\bf s}=MVM1(QH(1),,QH(d),𝐳Q_{\text{H}}^{(1)},\dotsc,Q_{\text{H}}^{(d)},{\bf z})    producing the result 𝐬Q𝐳{\bf s}\approx Q\,{\bf z}          Procedure 2.    𝐬{\bf s}=MVM2(QH(1),,QH(d),Q^H(1),,Q^H(d),𝐳Q_{\text{H}}^{(1)},\dotsc,Q_{\text{H}}^{(d)},\hat{Q}_{\text{H}}^{(1)},\dotsc,\hat{Q}_{\text{H}}^{(d)},{\bf z})    producing the result 𝐬Q^𝐳{\bf s}\approx\hat{Q}\,{\bf z}
  4. Sinkhorn’s loop:          𝐮(0)=𝟙n/n{\bf u}^{(0)}=\mathds{1}_{n}/n          𝐯(0)=𝐠MVM1(QH(1),,QH(d),𝐮(0)){\bf v}^{(0)}={\bf g}\,\oslash\,\text{MVM1}(Q_{\text{H}}^{(1)},\dotsc,Q_{\text{H}}^{(d)},{\bf u}^{(0)})          i=0i=0          while  \langle stopping criterion (13) does not hold \rangle  do                 𝐮(i+1)=𝐟MVM1(QH(1),,QH(d),𝐯(i)){\bf u}^{(i+1)}={\bf f}\,\oslash\,\text{MVM1}(Q_{\text{H}}^{(1)},\dotsc,Q_{\text{H}}^{(d)},{\bf v}^{(i)})                 𝐯(i+1)=𝐠MVM1(QH(1),,QH(d),𝐮(i)){\bf v}^{(i+1)}={\bf g}\,\oslash\,\text{MVM1}(Q_{\text{H}}^{(1)},\dotsc,Q_{\text{H}}^{(d)},{\bf u}^{(i)})                 i=i+1i=i+1          end while          𝐬=MVM2(QH(1),,QH(d),Q^H(1),,Q^H(d),𝐯(i)){\bf s}=\text{MVM2}(Q_{\text{H}}^{(1)},\dotsc,Q_{\text{H}}^{(d)},\hat{Q}_{\text{H}}^{(1)},\dotsc,\hat{Q}_{\text{H}}^{(d)},{\bf v}^{(i)})          Sλ,p=(𝐬𝐮(i))1/pS_{\lambda,p}=\left({\bf s}^{\top}{\bf u}^{(i)}\right)^{1/p}

4 Numerical illustration

In this section we will present a numerical example to verify the theoretical and numerical results presented in previous sections. In particular, we will consider an example studied in [11] and demonstrate the applicability of Sinkhorn divergence and the proposed algorithm in solving optimization problems.

Problem statement. Let 𝐟=[f(xi)]n{\bf f}=[f(x_{i})]\in{\mathbb{R}}^{n} be a three-pulse signal given at nn discrete points xi=(i1)/(n1)x_{i}=(i-1)/(n-1), with i=1,,ni=1,\dotsc,n, where

f(x)=e(x0.4σ)2e(x0.5σ)2+e(x0.6σ)2,x[0,1].f(x)=e^{-\left(\frac{x-0.4}{\sigma}\right)^{2}}-e^{-\left(\frac{x-0.5}{\sigma}\right)^{2}}+e^{-\left(\frac{x-0.6}{\sigma}\right)^{2}},\qquad x\in[0,1].

Here, σ>0\sigma>0 is a given positive constant that controls the spread of the three pulses; the smaller σ\sigma, the narrower the three pulses. Let further 𝐠(s)=[g(xi;s)]n{\bf g}(s)=[g(x_{i};s)]\in{\mathbb{R}}^{n} be a shifted version of 𝐟{\bf f} with a shift ss given at the same discrete points xi=(i1)/(n1)x_{i}=(i-1)/(n-1), with i=1,,ni=1,\dotsc,n, where

g(x;s)=e(xs0.4σ)2e(xs0.5σ)2+e(xs0.6σ)2,x[0,1].g(x;s)=e^{-\left(\frac{x-s-0.4}{\sigma}\right)^{2}}-e^{-\left(\frac{x-s-0.5}{\sigma}\right)^{2}}+e^{-\left(\frac{x-s-0.6}{\sigma}\right)^{2}},\qquad x\in[0,1].

Assuming 0.3s0.3-0.3\leq s\leq 0.3, our goal is to find the “optimal” shift ss^{*} that minimizes the “distance” between the original signal 𝐟{\bf f} and the shifted signal 𝐠(s){\bf g}(s),

s=argmins[0.3,0.3]d(𝐟,𝐠(s)).s^{*}=\operatornamewithlimits{argmin}_{s\in[-0.3,0.3]}\text{d}({\bf f},{\bf g}(s)).

Here, d(.,.)\text{d}(.,.) is a “loss” function that measures dissimilarities between 𝐟{\bf f} and 𝐠(s){\bf g}(s); see the discussion below. While being a simple problem with a known solution s=0s^{*}=0, this model problem serves as an illustrative example exhibiting important challenges that we may face in solving optimization problems. Figure 4 shows the original signal 𝐟{\bf f} and two shifted signals 𝐠(s){\bf g}(s) for two different shifts s=0.3s=-0.3 and s=0.3s=0.3. Wide signals (left) correspond to width σ=0.05\sigma=0.05, and narrow signals (right) correspond to width σ=0.01\sigma=0.01.

Refer to caption
Refer to caption
Figure 4: The original signal 𝐟{\bf f} and the shifted signal 𝐠(s){\bf g}(s) for two different shifts s=0.3s=-0.3 and s=0.3s=0.3. Left figure shows wide signals (σ=0.05\sigma=0.05), and right figure shows narrow signals (σ=0.01\sigma=0.01).

Loss functions. A first step is tackling an optimization problem is to select an appropriate loss function. Here, we consider and compare three loss functions:

  • Euclidean distance (or L2L^{2} norm): dE(𝐟,𝐠(s)):=𝐟𝐠(s)2\text{d}_{\text{E}}({\bf f},{\bf g}(s)):=||{\bf f}-{\bf g}(s)||_{2}.

  • Quadratic Wasserstein distance: dW(𝐟,𝐠(s))\text{d}_{\text{W}}({\bf f},{\bf g}(s)) defined in (29).

  • Quadratic Sinkhorn divergence: dS(𝐟,𝐠(s))\text{d}_{\text{S}}({\bf f},{\bf g}(s)) defined in (30).

It is to be noted that since the two signals are not necessarily probability vectors, i.e. positivity and mass balance are not expected, the signals need to be pre-processed before we can measure their Wasserstein distance and Sinkhorn divergence. While there are various techniques for this, here we employ a sign-splitting and normalization technique that considers the non-negative parts 𝐟+{\bf f}^{+} and 𝐠+{\bf g}^{+} and the non-positive parts 𝐟{\bf f}^{-} and 𝐠{\bf g}^{-} of the two signals separately. Precisely, we define

dW(𝐟,𝐠(s)):=W2(𝐟+𝐟+𝟙,𝐠(s)+𝐠(s)+𝟙)+W2(𝐟𝐟𝟙,𝐠(s)𝐠(s)𝟙),d_{\text{W}}({\bf f},{\bf g}(s)):=W_{2}\left(\frac{{\bf f}^{+}}{{\bf f}^{+\top}\mathds{1}},\frac{{\bf g}(s)^{+}}{{\bf g}(s)^{+\top}\mathds{1}}\right)+W_{2}\left(\frac{{\bf f}^{-}}{{\bf f}^{-\top}\mathds{1}},\frac{{\bf g}(s)^{-}}{{\bf g}(s)^{-\top}\mathds{1}}\right), (29)
dS(𝐟,𝐠(s)):=S2,λ(𝐟+𝐟+𝟙,𝐠(s)+𝐠(s)+𝟙)+S2,λ(𝐟𝐟𝟙,𝐠(s)𝐠(s)𝟙).d_{\text{S}}({\bf f},{\bf g}(s)):=S_{2,\lambda}\left(\frac{{\bf f}^{+}}{{\bf f}^{+\top}\mathds{1}},\frac{{\bf g}(s)^{+}}{{\bf g}(s)^{+\top}\mathds{1}}\right)+S_{2,\lambda}\left(\frac{{\bf f}^{-}}{{\bf f}^{-\top}\mathds{1}},\frac{{\bf g}(s)^{-}}{{\bf g}(s)^{-\top}\mathds{1}}\right). (30)

Here, W2W_{2} and S2,λS_{2,\lambda} are respectively given by (3) and (6) with p=2p=2.

Two desirable properties of a loss function include its convexity with respect to the unknown parameter (here ss) and its efficient computability for many different values of ss. While the Euclidean distance dEd_{\text{E}} is widely used and can be efficiently computed with a cost 𝒪(n){\mathcal{O}}(n), it may generate multiple local extrema that could result in converging to wrong solutions. The Wasserstein metric, on the other hand, features better convexity and hence better convergence to correct solutions; see [11] and references there in. However, the application of Wasserstein metric is limited to low-dimensional signals due to its high computational cost and infeasibility in high dimensions. This makes the Wasserstein loss function dW\text{d}_{\text{W}} particularly suitable for one-dimensional problems, where it can be efficiently computed with a cost 𝒪(nlogn){\mathcal{O}}(n\,\log n), as discussed in Section 2.1. As a promising alternative to dE\text{d}_{\text{E}} and dW\text{d}_{\text{W}}, we may consider the Sinkhorn loss function dS\text{d}_{\text{S}} that possesses both desirable properties of a loss function. In terms of convexity, Sinkhorn divergence is similar to the Wasserstein metric. In terms of computational cost, as shown in the present work, we can efficiently compute Sinkhorn divergence in multiple dimensions with a near-linear cost 𝒪(nlog3n){\mathcal{O}}(n\,\log^{3}n).

Numerical results and discussion. Figure 5 shows the loss functions (dE,dW,dSd_{\text{E}},d_{\text{W}},d_{\text{S}}) versus ss. Here, we set n=212n=2^{12} and consider two different values for the width of signals, σ=0.05\sigma=0.05 (left figure) and σ=0.01\sigma=0.01 (right figure). We compute the Sinkhorn loss function with λ=50\lambda=50 both by the original Sinkhorn’s algorithm (labeled dSd_{\text{S}} in the figure) and by the proposed fast algorithm outlined in Algorithm 1 (labeled dSHd_{\text{S}}^{\text{H}} in the figure). In computing dSHd_{\text{S}}^{\text{H}}, we obtain the optimal local ranks by (26), with parameters c=1c=1 and α=2\alpha=2 that correspond to the LpL^{p} cost function with p=2p=2; see the discussion on LpL^{p} cost functions in Section 3.1 for this choice of parameters. We consider the accuracy tolerances εTOL=εS=0.01\varepsilon_{\tiny{\text{TOL}}}=\varepsilon_{\tiny{\text{S}}}=0.01.

Refer to caption
Refer to caption
Figure 5: Comparison of different loss functions that measure the dissimilarity between 𝐟{\bf f} and 𝐠(s){\bf g}(s) as a function of the shift ss. While the Euclidean loss function has local extrema, the Wasserstein and Sinkhorn loss functions are convex. Left figure is for wide signals (σ=0.05\sigma=0.05), and right figure is for narrow signals (σ=0.01\sigma=0.01).

As we observe, the Euclidean loss function has local extrema, while the Wasserstein and Sinkhorn loss functions are convex. We also observe that dSHd_{\text{S}}^{\text{H}} well approximates dSd_{\text{S}}, verifying the accuracy of the proposed algorithm in computing Sinkhorn divergence.

Figure 6 shows the CPU time of computing dSHd_{\text{S}}^{\text{H}} as a function of nn. We observe that the cost is 𝒪(nlog3n){\mathcal{O}}(n\,\log^{3}n) as expected.

Refer to caption
Figure 6: CPU time versus nn.

5 Conclusion

This work presents a fast method for computing Sinkhorn divergence between two discrete probability vectors supported on possibly high-dimensional spaces. The cost matrix in optimal transport problem is assumed to be decomposable into a sum of asymptotically smooth Kronecker product factors. The method combines Sinkhorn’s matrix scaling iteration with a low-rank hierarchical representation of the scaling matrices to achieve a near-linear complexity. This provides a fast and easy-to-implement algorithm for computing Sinkhorn divergence, enabling the applicability of Sinkhorn divergence to large-scale optimization problems, where the computation of classical Wasserstein metric is not feasible.

Future directions include exploring the application of the proposed hierarchical low-rank Sinkhorn’s algorithm to large-scale Bayesian inversion and other data science and machine learning problems.


Acknowledgments. I would like to thank Klas Modin for a question he raised during a seminar that I gave at Chalmers University of Technology in Gothenburg, Sweden. Klas’ question on the computation of Wasserstein metric in multiple dimensions triggered my curiosity and persuaded me to work further on the subject.

Appendix

In this appendix we collect a number of auxiliary lemmas.

Lemma 1.

For a real matrix A=[Aij]A=[A_{ij}], let exp[A]\exp[A] denote the matrix obtained by the entrywise application of exponential function to AA, i.e. exp[A]:=[exp(Aij)]\exp[A]:=[\exp(A_{ij})]. Then we have

exp[C(d)C(1)]=exp[C(d)]exp[C(1)].\exp[C^{(d)}\oplus\dotsi\oplus C^{(1)}]=\exp[C^{(d)}]\otimes\dotsi\otimes\exp[C^{(1)}].
Proof.

The case d=1d=1 is trivial. Consider the case d=2d=2. By the definition of Kronecker sum, we have

exp[C(2)C(1)]=exp[C(2)J1+J2C(1)]=\exp[C^{(2)}\oplus C^{(1)}]=\exp[C^{(2)}\otimes J_{1}+J_{2}\otimes C^{(1)}]=
=(exp(C11(2))J1exp(C1n2(2))J1exp(Cn21(2))J1exp(Cn2n2(2))J1)(exp[C(1)]exp[C(1)]exp[C(1)]exp[C(1)])==\left(\begin{array}[]{@{}c|c|c@{}}\exp(C_{11}^{(2)})J_{1}&\cdots&\exp(C_{1n_{2}}^{(2)})J_{1}\\ \hline\cr\vdots&\ddots&\vdots\\ \hline\cr\exp(C_{n_{2}1}^{(2)})J_{1}&\cdots&\exp(C_{n_{2}n_{2}}^{(2)})J_{1}\end{array}\right)\odot\left(\begin{array}[]{@{}c|c|c@{}}\exp[C^{(1)}]&\cdots&\exp[C^{(1)}]\\ \hline\cr\vdots&\ddots&\vdots\\ \hline\cr\exp[C^{(1)}]&\cdots&\exp[C^{(1)}]\end{array}\right)=
=(exp(C11(2))exp[C(1)]exp(C1n2(2))exp[C(1)]exp(Cn21(2))exp[C(1)]exp(Cn2n2(2))exp[C(1)])==\left(\begin{array}[]{@{}c|c|c@{}}\exp(C_{11}^{(2)})\exp[C^{(1)}]&\cdots&\exp(C_{1n_{2}}^{(2)})\exp[C^{(1)}]\\ \hline\cr\vdots&\ddots&\vdots\\ \hline\cr\exp(C_{n_{2}1}^{(2)})\exp[C^{(1)}]&\cdots&\exp(C_{n_{2}n_{2}}^{(2)})\exp[C^{(1)}]\end{array}\right)=
=exp[C(2)]exp[C(1)].=\exp[C^{(2)}]\otimes\exp[C^{(1)}].

Now assume that it is true for d1d-1:

exp[C(d1)C(1)]=exp[C(d1)]exp[C(1)].\exp[C^{(d-1)}\oplus\dotsi\oplus C^{(1)}]=\exp[C^{(d-1)}]\otimes\dotsi\otimes\exp[C^{(1)}].

Then

exp[C(d)C(1)]=exp[C(d)]exp[C(d1)C(1)]=\exp[C^{(d)}\oplus\dotsi\oplus C^{(1)}]=\exp[C^{(d)}]\otimes\exp[C^{(d-1)}\oplus\dotsi\oplus C^{(1)}]=
=exp[C(d)]exp[C(1)].=\exp[C^{(d)}]\otimes\dotsi\otimes\exp[C^{(1)}].

Lemma 2.

We have

[C(d)C(1)][Q(d)Q(1)]=k=1dAk(d)Ak(1),Ak(m)={C(m)Q(m),k=mQ(m),km[C^{(d)}\oplus\dotsi\oplus C^{(1)}]\odot[Q^{(d)}\otimes\dotsi\otimes Q^{(1)}]=\sum_{k=1}^{d}A_{k}^{(d)}\otimes\dotsi\otimes A_{k}^{(1)},\quad A_{k}^{(m)}=\left\{\begin{array}[]{ll}C^{(m)}\odot Q^{(m)},&k=m\\ Q^{(m)},&k\neq m\end{array}\right.
Proof.

The case d=1d=1 is trivial. Consider the case d=2d=2. By the definition of Kronecker sum, we have

[C(2)C(1)][Q(2)Q(1)]=[C(2)J1][Q(2)Q(1)]+[J2C(1)][Q(2)Q(1)][C^{(2)}\oplus C^{(1)}]\odot[Q^{(2)}\otimes Q^{(1)}]=[C^{(2)}\oplus J_{1}]\odot[Q^{(2)}\otimes Q^{(1)}]+[J_{2}\oplus C^{(1)}]\odot[Q^{(2)}\otimes Q^{(1)}]
=(C11(2)J1C1n2(2)J1Cn21(2)J1Cn2n2(2)J1)(Q11(2)Q(1)Q1n2(2)Q(1)Qn21(2)Q(1)Qn2n2(2)Q(1))+=\left(\begin{array}[]{@{}c|c|c@{}}C_{11}^{(2)}J_{1}&\cdots&C_{1n_{2}}^{(2)}J_{1}\\ \hline\cr\vdots&\ddots&\vdots\\ \hline\cr C_{n_{2}1}^{(2)}J_{1}&\cdots&C_{n_{2}n_{2}}^{(2)}J_{1}\end{array}\right)\odot\left(\begin{array}[]{@{}c|c|c@{}}Q_{11}^{(2)}Q^{(1)}&\cdots&Q_{1n_{2}}^{(2)}Q^{(1)}\\ \hline\cr\vdots&\ddots&\vdots\\ \hline\cr Q_{n_{2}1}^{(2)}Q^{(1)}&\cdots&Q_{n_{2}n_{2}}^{(2)}Q^{(1)}\end{array}\right)+
(C(1)C(1)C(1)C(1))(Q11(2)Q(1)Q1n2(2)Q(1)Qn21(2)Q(1)Qn2n2(2)Q(1))=\left(\begin{array}[]{@{}c|c|c@{}}C^{(1)}&\cdots&C^{(1)}\\ \hline\cr\vdots&\ddots&\vdots\\ \hline\cr C^{(1)}&\cdots&C^{(1)}\end{array}\right)\odot\left(\begin{array}[]{@{}c|c|c@{}}Q_{11}^{(2)}Q^{(1)}&\cdots&Q_{1n_{2}}^{(2)}Q^{(1)}\\ \hline\cr\vdots&\ddots&\vdots\\ \hline\cr Q_{n_{2}1}^{(2)}Q^{(1)}&\cdots&Q_{n_{2}n_{2}}^{(2)}Q^{(1)}\end{array}\right)=
=(C11(2)Q11(2)Q(1)C1n2(2)Q1n2(2)Q(1)Cn21(2)Qn21(2)Q(1)Cn2n2(2)Qn2n2(2)Q(1))+(Q11(2)(C(1)Q(1))Q1n2(2)(C(1)Q(1))Qn21(2)(C(1)Q(1))Qn2n2(2)(C(1)Q(1)))==\left(\begin{array}[]{@{}c|c|c@{}}C_{11}^{(2)}Q_{11}^{(2)}Q^{(1)}&\cdots&C_{1n_{2}}^{(2)}Q_{1n_{2}}^{(2)}Q^{(1)}\\ \hline\cr\vdots&\ddots&\vdots\\ \hline\cr C_{n_{2}1}^{(2)}Q_{n_{2}1}^{(2)}Q^{(1)}&\cdots&C_{n_{2}n_{2}}^{(2)}Q_{n_{2}n_{2}}^{(2)}Q^{(1)}\end{array}\right)+\left(\begin{array}[]{@{}c|c|c@{}}Q_{11}^{(2)}(C^{(1)}\odot Q^{(1)})&\cdots&Q_{1n_{2}}^{(2)}(C^{(1)}\odot Q^{(1)})\\ \hline\cr\vdots&\ddots&\vdots\\ \hline\cr Q_{n_{2}1}^{(2)}(C^{(1)}\odot Q^{(1)})&\cdots&Q_{n_{2}n_{2}}^{(2)}(C^{(1)}\odot Q^{(1)})\end{array}\right)=
=[C(2)Q(2)]Q(1)+Q(2)[C(1)Q(1)].=[C^{(2)}\odot Q^{(2)}]\otimes Q^{(1)}+Q^{(2)}\otimes[C^{(1)}\odot Q^{(1)}].

Now assume that it is true for d1d-1:

[C(d1)C(1)][Q(d1)Q(1)]=k=1d1Ak(d1)Ak(1).[C^{(d-1)}\oplus\dotsi\oplus C^{(1)}]\odot[Q^{(d-1)}\otimes\dotsi\otimes Q^{(1)}]=\sum_{k=1}^{d-1}A_{k}^{(d-1)}\otimes\dotsi\otimes A_{k}^{(1)}.

Then

[C(d)C(1)][Q(d)Q(1)]=[C^{(d)}\oplus\dotsi\oplus C^{(1)}]\odot[Q^{(d)}\otimes\dotsi\otimes Q^{(1)}]=
[C(d)Q(d)][Q(d1)Q(1)]+Q(d)([C(d1)C(1)][Q(d1)Q(1)])=[C^{(d)}\odot Q^{(d)}]\otimes[Q^{(d-1)}\otimes\dotsi\otimes Q^{(1)}]+Q^{(d)}\otimes\left([C^{(d-1)}\oplus\dotsi\oplus C^{(1)}]\odot[Q^{(d-1)}\otimes\dotsi\otimes Q^{(1)}]\right)=
[C(d)Q(d)]Q(d1)Q(1)+Q(d)k=1d1Ak(d1)Ak(1)=[C^{(d)}\odot Q^{(d)}]\otimes Q^{(d-1)}\otimes\dotsi\otimes Q^{(1)}+Q^{(d)}\otimes\sum_{k=1}^{d-1}A_{k}^{(d-1)}\otimes\dotsi\otimes A_{k}^{(1)}=
Ad(d)Q(d1)Q(1)+k=1d1Q(d)Ak(d1)Ak(1)=A_{d}^{(d)}\otimes Q^{(d-1)}\otimes\dotsi\otimes Q^{(1)}+\sum_{k=1}^{d-1}Q^{(d)}\otimes A_{k}^{(d-1)}\otimes\dotsi\otimes A_{k}^{(1)}=
=k=1dAk(d)Ak(1),=\sum_{k=1}^{d}A_{k}^{(d)}\otimes\dotsi\otimes A_{k}^{(1)},

noting that Ad(d)=C(d)Q(d)A_{d}^{(d)}=C^{(d)}\odot Q^{(d)}, and Ad(k)=Q(k)A_{d}^{(k)}=Q^{(k)} and Ak(d)=Q(d)A_{k}^{(d)}=Q^{(d)} for k=1,,d1k=1,\dotsc,d-1.

References

  • [1] J. Altschuler, J. Weed, and P. Rigollet. Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 1961–1971, 2017.
  • [2] Y. Brenier. Polar factorization and monotone rearrangement of vector-valued functions. Comm. Pure Appl. Math., 44:375–417, 1991.
  • [3] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems 26, pages 2292–2300, 2013.
  • [4] P. Dvurechensky, A. Gasnikov, and A. Kroshnin. Computational optimal transport: Complexity by accelerated gradient descent is better than by Sinkhorn’s algorithm. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, PMLR 80, pages 1367–1376, 2018.
  • [5] B. Engquist, B.D. Froese Brittany, and Y. Yang. Optimal transport for seismic full waveform inversion. Communications in Mathematical Sciences, 14(8):2309–2330, 2016.
  • [6] J. Franklin and J. Lorenz. On the scaling of multidimensional matrices. Linear Algebra and its Application, 114:717–735, 1989.
  • [7] W. Hackbusch. Hierarchical Matrices: Algorithms and Analysis, volume 49 of Springer Series in Computational Mathematics. Springer-Verlag, 2015.
  • [8] J. Indritz. An inequality for Hermite polynomials. Proc. Amer. Math. Soc., 12:981–983, 1961.
  • [9] B. Kalantari and L. Khachiyan. On the complexity of nonnegative-matrix scaling. Linear Algebra and its Applications, 240:87–103, 1996.
  • [10] P. A Knight. The Sinkhorn–Knopp algorithm: convergence and applications. SIAM J. on Matrix Analysis and Applications, 30:261–275, 2008.
  • [11] M. Motamed and D. Appelö. Wasserstein metric-driven Bayesian inversion with applications to signal processing. International J. for Uncertainty Quantification, 9:395–414, 2019.
  • [12] A. Nemirovski and U. Rothblum. On complexity of matrix scaling. Linear Algebra and its Applications, 302:435–460, 1999.
  • [13] G. Peyré and M. Cuturi. Computational optimal transport. Foundations and Trends in Machine Learning, 11:355–607, 2019.
  • [14] R. Sinkhorn. A relationship between arbitrary positive matrices and doubly stochastic matrices. Annals of Mathematical Statististics, 35:876–879, 1964.
  • [15] J. Solomon, F. De Goes, G. Peyré, M. Cuturi, A. Butscher, A. Nguyen, T. Du, and L. Guibas. Convolutional Wasserstein distances: efficient optimal transportation on geometric domains. ACM Transactions on Graphics, 34:66:1–66:11, 2015.
  • [16] C. Villani. Topics in Optimal Transportation, volume 58 of Graduate Studies in Mathematics. American Mathematical Society, 2003.
  • [17] C. Villani. Optimal Transport: Old and New, volume 338 of Grundlehren der mathematischen Wissenschaften. Springer Verlag, 2009.
  • [18] Y. Yang, B. Engquist, J. Sun, and B. F. Hamfeldt. Application of optimal transport and the quadratic Wasserstein metric to full-waveform inversion. Geophysics, 83(1):R43–R62, 2018.