This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Yue Xie22institutetext: Department of Mathematics, The University of Hong Kong, Pokfulam Road, Hong Kong SAR, China
22email: [email protected]
33institutetext: Zhongjian Wang 44institutetext: School of Physical and Mathematical Sciences, Nanyang Technological University, 21 Nanyang Link, Singapore 637371
44email: [email protected]
55institutetext: Zhiwen Zhang 66institutetext: Department of Mathematics, The University of Hong Kong, Pokfulam Road, Hong Kong SAR, China
66email: [email protected]

Randomized methods for computing optimal transport without regularization and their convergence analysis

Yue Xie    Zhongjian Wang    Zhiwen Zhang
(Received: date / Accepted: date)
Abstract

The optimal transport (OT) problem can be reduced to a linear programming (LP) problem through discretization. In this paper, we introduced the random block coordinate descent (RBCD) methods to directly solve this LP problem. Our approach involves restricting the potentially large-scale optimization problem to small LP subproblems constructed via randomly chosen working sets. By using a random Gauss-Southwell-qq rule to select these working sets, we equip the vanilla version of (RBCD0) with almost sure convergence and a linear convergence rate to solve general standard LP problems. To further improve the efficiency of the (RBCD0) method, we explore the special structure of constraints in the OT problems and leverage the theory of linear systems to propose several approaches for refining the random working set selection and accelerating the vanilla method. Inexact versions of the RBCD methods are also discussed. Our preliminary numerical experiments demonstrate that the accelerated random block coordinate descent (ARBCD) method compares well with other solvers including Sinkhorn’s algorithm when seeking solutions with relatively high accuracy, and offers the advantage of saving memory.

Keywords:
Optimal transportdeep particle methodconvex optimizationrandom block coordinate descentconvergence analysis.
MSC:
65C3568W2090C0890C25

1 Introduction

Background and motivation

The optimal transport problem was first introduced by Monge in 1781, which aims to find the most cost-efficient way to transport mass from a set of sources to a set of sinks. Later, the theory was modernized and revolutionized by Kantorovich in 1942, who found a key link between optimal transport and linear programming. In recent years, optimal transport has become a popular and powerful tool in data science, especially in image processing, machine learning, and deep learning areas, where it provides a very natural way to compare and interpolate probability distributions. For instance, in generative models arjovsky2017wasserstein ; lei2019geometric ; wang2022deepparticle , a natural penalty function is the Wasserstein distance (a concept closely related to OT) between the data and the generated distribution. In image processing, the optimal transport plan, which minimizes the transportation cost, provides solutions to image registration haker2004optimal and seamless copy perrot2016mapping . Apart from data science, in the past three decades, there has been an explosion of research interest in optimal transport because of the deep connections between the optimal transport problems with quadratic cost functions and a diverse class of partial differential equations (PDEs) arising in statistical mechanics and fluid mechanics; see e.g. brenier1991polar ; benamou2000computational ; otto2001geometry ; jordan1998variational ; villani2021topics for just a few of the most prominent results and references therein.

Inspired by this research progress, we have developed efficient numerical methods to solve multi-scale PDE problems using the optimal transport approach. Specifically, in our recent paper, we proposed a deep particle method for learning and computing invariant measures of parameterized stochastic dynamical systems wang2022deepparticle . To achieve this goal, we designed a deep neural network (DNN) to map a uniform distribution (source) to an invariant measure (target), where the Péclet number is an input parameter for the DNN. The network is trained by minimizing the 2-Wasserstein distance (W2W_{2}) between the measure of network output μ\mu and target measure ν\nu. We consider a discrete version of W2W_{2} for finitely many samples of μ\mu and ν\nu, which involves a linear program (LP) optimized over doubly stochastic matrices sinkhorn1964relationship .

Solving the LP directly using the interior point method wright1997primal is too costly. Motivated by the domain decomposition method toselli2004domain in scientific computing, which solves partial differential equations using subroutines that solve problems on subdomains and has the advantage of saving memory (i.e., using the same computational resource, it can compute a larger problem), we devised a mini-batch interior point method. This approach involves sampling smaller sub-matrices while preserving row and column sums. It has proven to be highly efficient and integrates seamlessly with the stochastic gradient descent (SGD) method for overall network training. However, we did not obtain convergence analysis for this mini-batch interior point method in our previous work wang2022deepparticle .

The objectives of this paper are twofold. First, we aim to provide rigorous convergence analysis for the mini-batch interior point method presented in wang2022deepparticle , with minimal modifications. Second, we seek to enhance the mini-batch selection strategy, thereby achieving improved and more robust performance in computing optimal transport problems. We recognize that the mini-batch interior point method aligns with the random block coordinate descent (RBCD) method in optimization terminology. Specifically, it applies the block coordinate descent (BCD) method to the LP problem directly, selects the working set randomly, and solves subproblems using the primal-dual interior point method wright1997primal or any other efficient linear programming solver. Encouraged by the demonstrated efficiency of this approach, we will develop theoretical results for solving LP with RBCD methods and explore various strategies for selecting working sets.

Theorectical contributions

In this work, we first introduce an expected Gauss-Southwell-qq rule to guide the selection of the working set. It enables almost sure convergence and a linear convergence rate in expectation when solving a general standard LP. Based on this rule, we develop a vanilla RBCD method - RBCD0, which selects the working set with complete randomness. Then, we investigate the special linear system present in the LP formulation of OT. We characterize all the elementary vectors of the null space and provide a strategy for finding the conformal realization of any given vector in the null space at a low computational cost. Based on these findings, we propose various approaches to refine the working set selection and improve the performance of RBCD0. A better estimation of the constant in the linear convergence rate is shown. Moreover, we incorporate an acceleration technique inspired by the momentum concept to improve the algorithm’s efficiency. Inexact versions of the RBCD methods are also discussed.

Numerical experiments

We perform numerical experiments to evaluate the performance of the proposed methods. Synthetic data sets of various shapes/dimensions and invariant measures generated from IPM methods are utilized to create distributions. Our experiments first compare different RBCD methods proposed in this paper, demonstrating the benefits of refining working set selection and verifying the effectiveness of the acceleration technique. We also illustrate the gap between theory and practice regarding convergence rate, sparse solutions generated by the proposed RBCD methods, and discuss the choice of subproblem size. In the second set of experiments, we compare the best-performance method, ARBCD, with Sinkhorn’s algorithm. Preliminary numerical results show that ARBCD is comparable to Sinkhorn’s algorithm in computation time when seeking solutions with relatively high accuracy. ARBCD is also comparable to a recently proposed interior point inspired algorithm in memory-saving settings. We also test ARBCD on a large-scale OT problem, where Gurobi runs out of memory. This further justifies the memory-saving advantage of ARBCD.

Previous research on (R)BCD

BCD and RBCD are well-studied for essentially unconstrained smooth optimization (sometimes allow separable constraints or nonsmooth separable objective functions): beck2013convergence ; gurbuzbalaban2017cyclic ; sun2021worst investigate BCD with cyclic coordinate search; nesterov2012efficiency ; lu2015complexity ; richtarik2014iteration study RBCD to address problems with possibly nonsmooth separable objective functions; other related works include theoretical speedup of RBCD (richtarik2016parallel ; necoara2016parallel ), second-order sketching (qu2016sdna ; berahas2020investigation ). However, much less is known for their convergence properties when applied to problems with nonseparable nonsmooth functions as summands or coupled constraints. To our best knowledge, no one has ever considered using the RBCD to solve general LP before and the related theoretical guarantees are absent. In necoara2017random , the authors studied the RBCD method to tackle problems with a convex smooth objective and coupled linear equality constraints x1+x2++xN=0x_{1}+x_{2}+\ldots+x_{N}=0; a similar algorithm named random sketch descent method necoara2021randomized is investigated to solve problems with a general smooth objective and general coupled linear equality constraints Ax=bAx=b. However, after adding the simple bound constraints x0x\geq 0, the analysis in necoara2017random ; necoara2021randomized may not work anymore, nor can it be easily generalized. Beck beck20142 studied a greedy coordinate descent method but focused on a single linear equality constraint and bound constraints. In Paul Tseng and his collaborators’ work tseng2009block ; tseng2009coordinate ; tseng2010coordinate , a block coordinate gradient descent method is proposed to solve linearly constrained optimization problems including general LP. In these works, a Gauss-Southwell-qq rule is proposed to guide the selection of the working set in each iteration. Therefore, the working set selected in a deterministic fashion can only be decided after solving a quadratic program with a similar problem size as the original one. In contrast, our proposed mini-batch interior point/RBCD method approach selects the working set through a combination of randomness and low computational cost. Another research direction that addresses separable functions, linearly coupled constraints, and additional separable constraints involves using the alternating direction method of multipliers (ADMM) chen2016direct ; he20121 ; xie2019si ; xie2021tractable . This method updates blocks of primal variables in a Gauss-Seidal fashion and incorporates multiplier updates as well.

Existing algorithms for OT

Encouraged by the success in applying Sinkhorn’s algorithm to the dual of entropy regularized OT cuturi2013sinkhorn , researchers have conducted extensive studies in this area, including other types of regularization blondel2018smooth gasnikov2016efficient , acceleration guminov2021combination lin2022efficiency and numerical stability mandad2017variance . In xie2020fast , a proximal point algorithm (PPA) is considered to solve the discrete LP. The entropy inducing distance function is introduced to contruct the proximal term and each subproblem has the same formulation as the entropy regularized OT. This approach is found to stabilize the numerical computation while maintaining the efficiency of the Sinkhorn’s algorithm. In schmitzer2019stabilized , techniques such as log-domain stabilization, and epsilon scaling are discussed to further improve the performance of the Sinkhorn’s algorithm. The approach of iterative Bregman projection is discussed in benamou2015iterative . It is discovered that the Sinkhorn’s iteration is equivalent to a special case of this method. In pmlr-v84-genevay18a , a notion of Sinkhorn distance is proposed, which allows computable differentiation when serving as the loss function when training generative models. Sinkhorn’s algorithm can also be generalized to solve unbalanced OT in chizat2018scaling , and applied to solve OT over geometric domains in solomon2015convolutional . In huang2021riemannian , a Riemannian block coordinate descent method is applied to solve projection robust Wasserstein distance. The problem is to calculate Wasserstein distance after projecting the distributions onto lower-dimensional subspaces. The proposed approach employs entropy regularization, deterministic block coordinate descent, and techniques in Riemannian optimization. A domain decomposition method is considered in bonafini2021domain . Unlike our method, the decomposition is deterministic, focusing on one pair of distributions, and the authors try to tackle entropy-regularized OT.

Other works that significantly deviate from the entropy regularization framework include li2018computations , which computes the Schrödinger bridge problem (equivalent to OT with Fisher information regularization), and multiscale strategies such as gerber2017multiscale , liu2022multiscale and schmitzer2016sparse . In liu2022multiscale , the problem size is reduced using a multiscale strategy, and a semi-smooth Newton method is applied to solve smaller-scale subproblems. The RBCD method employed in this study is a regularization-free method. As a result, it avoids dealing with inaccurate solutions and numerical stability issues introduced by the regularization term. Furthermore, each subproblem in RBCD is a small-size LP, allowing for flexible resolution choices. Interior-point methods and simplex methods are also revisited and enhanced zanetti2023interior ; wijesinghe2023matrix ; natale2021computation ; gottschlich2014shortlist . In particular, the authors of zanetti2023interior propose an interior point inspired algorithm with reduced subproblems to address large-scale OT. The authors of facca2021fast consider an equivalent formulation of minimizing an energy functional to solve OT over graphs. Backward Euler (equivalent to the proximal point method) is used in the outer loop of the algorithm and Newton-Rapson is used in the inner loop. Other approaches include numerical solution of the Monge–Ampère equation benamou2014numerical ; benamou2016monotone , stochastic gradient descent to resolve OT in discrete, semi-discrete and continuous formulations genevay2016stochastic , and an alternating direction method of multipliers (ADMM) approach to solve the more general Wasserstein barycenter yang2021fast .

Organization

The rest of the paper is organized as follows. In Section 2, we review the basic idea of optimal transport and Wasserstein distance. In Section 3, we introduce the expected Gauss-Southwell-qq rule and a vanilla RBCD (RBCD0) method for solving general LP problems. An inexact version of RBCD0 is also discussed. In Section 4, we investigate the properties of the linear system in OT and propose several approaches to refine and accelerate the RBCD0 method. In Section 5, preliminary numerical results are presented to demonstrate the performance of our proposed methods. Finally, concluding remarks are made in Section 6.

Notation. For any matrix XX, let X(i,j)X(i,j) denote its element in the iith column and jjth row, and let X(:,j)X(:,j) represent its jjth row vector. For a vector vv, we usually use superscripts to denote its copies (e.g., vkv^{k} in kkth iteration of an algorithm) and use subscripts to denote its components (e.g., viv_{i}); for a scalar, we usually use subscripts to denote its copies. Occasional inconsistent cases will be declared in context. mod(k,n)\operatorname{mod}(k,n) means kk modulo nn. For any vector vv, we define supp(v){i{1,,n}vi0}\operatorname{supp}(v)\triangleq\{i\in\{1,\ldots,n\}\mid v_{i}\neq 0\}. Given a matrix Xn×nX\in\mathbb{R}^{n\times n}, we define its vectorization as follows:

vec(X)(X(:,1)T,X(:,2)T,,X(:,n)T)T.\displaystyle\operatorname{vec}(X)\triangleq(X(:,1)^{T},X(:,2)^{T},...,X(:,n)^{T})^{T}.

For any positive integer k2k\geq 2, we denote [1,k]{1,,k}[1,k]\triangleq\{1,...,k\}. 𝟏n×n{\bf 1}_{n\times n} represents the n×nn\times n matrix of all ones.

2 Optimal transport problems and Wasserstein distance

The Kantorovich formulation of optimal transport can be described as follows,

infγΓ(μ,ν)X×YC(x,y)dγ(x,y)\displaystyle\inf_{\gamma\in\Gamma(\mu,\nu)}\int_{X\times Y}C(x,y)\mathrm{~{}d}\gamma(x,y) (1)

where Γ(μ,ν)\Gamma(\mu,\nu) is the set of all measures on X×YX\times Y whose marginal distribution on XX is μ\mu and marginal distribution on YY is ν\nu, C(x,y)C(x,y) is the transportation cost. In this article, we refer to the Kantorovich formulation when we mention optimal transport.

Wasserstein distances are metrics on probability distributions inspired by the problem of optimal mass transport. They measure the minimal effort required to reconfigure the probability mass of one distribution in order to recover the other distribution. They are ubiquitous in mathematics, especially in fluid mechanics, PDEs, optimal transport, and probability theory villani2021topics . One can define the pp-Wasserstein distance between probability measures μ\mu and ν\nu on a metric space YY with distance function distdist by

Wp(μ,ν):=(infγΓ(μ,ν)Y×Y𝑑ist(y~,y)pdγ(y~,y))1/p\displaystyle W_{p}(\mu,\nu):=\left(\inf_{\gamma\in\Gamma(\mu,\nu)}\int_{Y\times Y}dist(\tilde{y},y)^{p}\mathrm{~{}d}\gamma(\tilde{y},y)\right)^{1/p} (2)

where Γ(μ,ν)\Gamma(\mu,\nu) is the set of probability measures γ\gamma on Y×YY\times Y satisfying γ(A×Y)=μ(A)\gamma(A\times Y)=\mu(A) and γ(Y×B)=ν(B)\gamma(Y\times B)=\nu(B) for all Borel subsets A,BYA,B\subset Y. Elements γΓ(μ,ν)\gamma\in\Gamma(\mu,\nu) are called couplings of the measures μ\mu and ν\nu, i.e., joint distributions on Y×YY\times Y with marginals μ\mu and ν\nu on each axis. pp-Wasserstein distance is a special case of optimal transport when X=YX=Y and the cost function c(y~,y)=dist(y~,y)pc(\tilde{y},y)=dist(\tilde{y},y)^{p}.

In the discrete case, the definition (2) has a simple intuitive interpretation: given a γΓ(μ,ν)\gamma\in\Gamma(\mu,\nu) and any pair of locations (y~,y)(\tilde{y},y), the value of γ(y~,y)\gamma(\tilde{y},y) tells us what proportion of μ\mu mass at y~\tilde{y} should be transferred to yy, in order to reconfigure μ\mu into ν\nu. Computing the effort of moving a unit of mass from y~\tilde{y} to yy by dist(y~,y)pdist(\tilde{y},y)^{p} yields the interpretation of Wp(μ,ν)W_{p}(\mu,\nu) as the minimal effort required to reconfigure μ\mu mass distribution into that of ν\nu.

In a practical setting COTFNT , referred to as a point cloud, the closed-form solution of μ\mu and ν\nu may be unknown, instead only nn independent and identically distributed (i.i.d.) samples of μ\mu and nn i.i.d. samples of ν\nu are available. In further discussion, nn refers to the size of the problem. We approximate the probability measures μ\mu and ν\nu by empirical distribution functions:

μ=1ni=1nδy~iandν=1nj=1nδyj,\displaystyle\mu=\frac{1}{n}\sum_{i=1}^{n}\delta_{\tilde{y}^{i}}\quad\text{and}\quad\nu=\frac{1}{n}\sum_{j=1}^{n}\delta_{y^{j}}, (3)

where δx\delta_{x} is the Dirac measure. Any element in Γ(μ,ν)\Gamma(\mu,\nu) can clearly be represented by a transition matrix, denoted as γ=(γi,j)i,j\gamma=(\gamma_{i,j})_{i,j} satisfying:

γi,j0;j,i=1nγi,j=1n;i,j=1nγi,j=1n.\displaystyle\gamma_{i,j}\geq 0;\quad\quad\forall j,~{}\sum_{i=1}^{n}\gamma_{i,j}=\frac{1}{n};\quad\quad\forall i,~{}\sum_{j=1}^{n}\gamma_{i,j}=\frac{1}{n}. (4)

Then γi,j\gamma_{i,j} means the mass of y~i\tilde{y}^{i} that is transferring to yjy^{j}.

We denote all matrices in n×n\mathbb{R}^{n\times n} satisfying (4) as Γn\Gamma^{n}, then (2) becomes

W^(f):=(infγΓni,j=1n,ndist(y~i,yj)pγi,j)1/p.\displaystyle\hat{W}(f):=\left(\inf_{\gamma\in\Gamma^{n}}\sum_{i,j=1}^{n,n}\,dist(\tilde{y}^{i},y^{j})^{p}\gamma_{i,j}\right)^{1/p}. (5)
Remark 1

Γn\Gamma^{n} is in fact the set of n×nn\times n doubly stochastic matrix sinkhorn1964relationship divided by nn.

Another practical setting, which is commonly used in fields of computer vision peleg1989unified ; ling2007efficient , is to compute the Wasserstein distance between two histograms. To compare two grey-scale figures (2D, size n0×n0n_{0}\times n_{0}), we first normalize the grey scale such that the values of cells of each picture sum to one. We denote centers of the cell as {yi}i=1n\{y^{i}\}_{i=1}^{n} and {y~i}i=1n\{\tilde{y}^{i}\}_{i=1}^{n}, then we can use two probability measures to represent the two figures:

μ=i=1nr1,iδy~iandν=j=1nr2,jδyj,\displaystyle\mu=\sum_{i=1}^{n}r_{1,i}\delta_{\tilde{y}^{i}}\quad\text{and}\quad\nu=\sum_{j=1}^{n}r_{2,j}\delta_{y^{j}},

where r1,i,r2,j0,1i,jnr_{1,i},r_{2,j}\geq 0,\forall 1\leq i,j\leq n, i=1nr1,i=j=1nr2,j=1\sum\limits_{i=1}^{n}r_{1,i}=\sum\limits_{j=1}^{n}r_{2,j}=1. The discrete Wasserstein distance (5) keeps the same form while the transition matrix follows different constraints:

γi,j0;j,i=1nγi,j=r2,j;i,j=1nγi,j=r1,i.\displaystyle\gamma_{i,j}\geq 0;\quad\quad\forall j,~{}\sum_{i=1}^{n}\gamma_{i,j}=r_{2,j};\quad\quad\forall i,~{}\sum_{j=1}^{n}\gamma_{i,j}=r_{1,i}. (6)

Note that in both settings, the computation of Wasserstein distance is reduced to an LP, i.e.,

min1i,jnCi,jγi,jsubject toj=1nγi,j=r1,i,j=1nγi,j=r2,i,γi,j0,\displaystyle\begin{array}[]{rl}\min&\sum\limits_{1\leq i,j\leq n}C_{i,j}\gamma_{i,j}\\ \mbox{subject to}&\sum\limits_{j=1}^{n}\gamma_{i,j}=r_{1,i},\quad\sum\limits_{j=1}^{n}\gamma_{i,j}=r_{2,i},\quad\gamma_{i,j}\geq 0,\end{array} (9)

where r1(r1,1,,r1,n)Tr^{1}\triangleq(r_{1,1},...,r_{1,n})^{T} and r2(r2,1,,r2,n)Tr^{2}\triangleq(r_{2,1},...,r_{2,n})^{T} are two probability distributions, and Ci,j=dist(x~i,xj)pC_{i,j}=dist(\tilde{x}^{i},x^{j})^{p}. More generally, we can let r1r^{1} and r2r^{2} be two nonnegative vectors and Ci,j=C(y~i,yj)C_{i,j}=C(\tilde{y}^{i},y^{j}) be any appropriate transportation cost from y~i\tilde{y}^{i} to yjy^{j}, so (9) also captures the discrete OT.

However, when the number of particles nn becomes large, the number of variables (entries of γ\gamma) scales like n2n^{2}, which leads to costly computation. Therefore, we will discuss random block coordinate descent methods to keep the computational workload in each iteration reasonable.

3 Random block coordinate descent for standard LP

In this section, we first generalize the LP problem (9) to a standard LP (see Eq.(10)). Then we propose a random block coordinate descent algorithm for resolution. Its almost sure convergence and linear convergence rate in expectation are analyzed. Last we introduce an inexact version that is implementable.

We consider the following standard LP problem:

minxNcTxsubject toAx=b,x0,\displaystyle\begin{aligned} \min_{x\in\mathbb{R}^{N}}\quad&c^{T}x\\ \mbox{subject to}\quad&Ax=b,\;x\geq 0,\end{aligned} (10)

where AM×NA\in\mathbb{R}^{M\times N}, bMb\in\mathbb{R}^{M}, cNc\in\mathbb{R}^{N}, hence MM is the number of constraint and NN is the total degree of freedom. Assume throughout that MNM\leq N. Suppose that 𝒩{1,,N}\mathcal{N}\triangleq\{1,\ldots,N\} and denote 𝒳{xNAx=b,x0}\mathcal{X}\triangleq\{x\in\mathbb{R}^{N}\mid Ax=b,x\geq 0\} as the feasible set. Assume that (10) is finite and has an optimal solution. For any x𝒳x\in\mathcal{X} and 𝒩\mathcal{I}\subseteq\mathcal{N}, denote

𝒟(x;)\displaystyle\mathcal{D}(x;\mathcal{I}) argmindN{cTdx+d0,Ad=0,di=0,i𝒩}.\displaystyle\triangleq\mbox{argmin}_{d\in\mathbb{R}^{N}}\left\{c^{T}d\mid x+d\geq 0,Ad=0,d_{i}=0,\forall i\in\mathcal{N}\setminus\mathcal{I}\right\}. (11)
q(x;)\displaystyle q(x;\mathcal{I}) mindN{cTdx+d0,Ad=0,di=0,i𝒩}.\displaystyle\triangleq\min_{d\in\mathbb{R}^{N}}\left\{c^{T}d\mid x+d\geq 0,Ad=0,d_{i}=0,\forall i\in\mathcal{N}\setminus\mathcal{I}\right\}. (12)

Namely, 𝒟(x;)\mathcal{D}(x;\mathcal{I}) is the optimal solution set of the linear program in (11) and q(x;)q(x;\mathcal{I}) is the optimal function value. We have that q(x;)=cTdq(x;\mathcal{I})=c^{T}d for any d𝒟(x;)d\in\mathcal{D}(x;\mathcal{I}). Denote 𝒳\mathcal{X}^{*} as the optimal solution set of (10). Then the following equations hold for any x𝒳x\in\mathcal{X}:

𝒳\displaystyle\mathcal{X}^{*} =x+𝒟(x;𝒩),\displaystyle=x+\mathcal{D}(x;\mathcal{N}), (13)
q(x;𝒩)\displaystyle q(x;\mathcal{N}) =cTxcTx,x𝒳.\displaystyle=c^{T}x^{*}-c^{T}x,\quad\forall x^{*}\in\mathcal{X}^{*}. (14)
Remark 2

It is worthy to mention that since d=0d=0 follows the conditions in (12), (x,)\forall(x,\mathcal{I}), q(x;)0q(x;\mathcal{I})\leq 0. Furthermore, q(x;𝒩)q(x;)q(x;\mathcal{N})\leq q(x;\mathcal{I}).

Consider the block coordinate descent (BCD) method for (10):

finddk𝒟(xk,k),xk+1:=xk+dk,\displaystyle\begin{aligned} &\mbox{find}\;d^{k}\in\mathcal{D}(x^{k},\mathcal{I}_{k}),\\ &x^{k+1}:=x^{k}+d^{k},\end{aligned} (15)

where k𝒩\mathcal{I}_{k}\subset\mathcal{N} is the working set chosen at iteration kk. Next, we describe several approaches to select the working set k\mathcal{I}_{k}.

Gauss-Southwell-q rule

Motivated by the Gauss-Southwell-q rule introduced in tseng2009coordinate , we desire to select k\mathcal{I}_{k} such that

q(xk;k)vq(xk;𝒩),\displaystyle q(x^{k};\mathcal{I}_{k})\leq vq(x^{k};\mathcal{N}), (16)

for some constant v(0,1]v\in(0,1]. Note that by (14), we have

q(xk;𝒩)=cT(xxk),\displaystyle q(x^{k};\mathcal{N})=c^{T}(x^{*}-x^{k}), (17)

where xx^{*} is an optimal solution of (10). Therefore, (12)-(17) imply that

cTdk\displaystyle c^{T}d^{k} vcT(xxk)\displaystyle\leq vc^{T}(x^{*}-x^{k})
(15)cT(xk+1xk)\displaystyle\overset{\eqref{alg: bcd}}{\implies}c^{T}(x^{k+1}-x^{k}) vcT(xxk)\displaystyle\leq vc^{T}(x^{*}-x^{k})
cT(xk+1x)\displaystyle\implies c^{T}(x^{k+1}-x^{*}) (1v)cT(xkx).\displaystyle\leq(1-v)c^{T}(x^{k}-x^{*}). (18)

(18) indicates that the gap of function value decays exponentially with rate 1v1-v, as long as we choose k\mathcal{I}_{k} according to the Gauss-Southwell-q rule (16) at each iteration kk. A trivial choice of k\mathcal{I}_{k} to satisfy (16) is 𝒩\mathcal{N} and v=1v=1. However, this choice results in a potential large-scale subproblem in the BCD method (15), contradicting the purpose of using BCD. Instead, we should set an upper bound on the cardinality of k\mathcal{I}_{k}, namely, a reasonable batch size to balance the computational effort in each iteration and convergence performance of BCD. Next, we discuss the existence of such an k\mathcal{I}_{k} given an upper bound ll on |k||\mathcal{I}_{k}|, which necessitates the following concept.

Definition 1

A vector d¯N\bar{d}\in\mathbb{R}^{N} is conformal to dNd\in\mathbb{R}^{N} if

supp(d¯)supp(d),d¯idi0,i𝒩.\displaystyle\operatorname{supp}(\bar{d})\subseteq\operatorname{supp}(d),\;\bar{d}_{i}d_{i}\geq 0,\forall i\in\mathcal{N}.

The following Theorem confirms the existence of such an k\mathcal{I}_{k} that satisfies (16), the proof of which follows closely to (tseng2009block, , Proposition 6.1).

Theorem 3.1

Suppose that rank(A)+1N\mathop{\hbox{\rm rank}}(A)+1\leq N. Given any x𝒳x\in\mathcal{X}, l{rank(A)+1,,N}l\in\{\mathop{\hbox{\rm rank}}(A)+1,\ldots,N\} and d𝒟(x;𝒩)d\in\mathcal{D}(x;\mathcal{N}). There exist a set 𝒩\mathcal{I}\in\mathcal{N} satisfying ||l|\mathcal{I}|\leq l and a vector d¯null(A)\bar{d}\in\operatorname{null}(A) conformal to dd such that

\displaystyle\mathcal{I} =supp(d¯).\displaystyle=\operatorname{supp}(\bar{d}). (19)
q(x;)\displaystyle q(x;\mathcal{I}) 1Nl+1q(x;𝒩).\displaystyle\leq\frac{1}{N-l+1}q(x;\mathcal{N}). (20)
Proof

If d=0d=0, then let d¯=0\bar{d}=0 and =\mathcal{I}=\emptyset. We have q(x;)=q(x;𝒩)=0q(x;\mathcal{I})=q(x;\mathcal{N})=0. Therefore, both (19) and (20) are satisfied. If d0d\neq 0 and |supp(d)|l|\operatorname{supp}(d)|\leq l, then let d¯=d\bar{d}=d. Thus, =supp(d¯)\mathcal{I}=\operatorname{supp}(\bar{d}) satisfies ||l|\mathcal{I}|\leq l and q(x;)=q(x;𝒩)q(x;\mathcal{I})=q(x;\mathcal{N}). If |supp(d)|>l|\operatorname{supp}(d)|>l, then similar to the discussion in (tseng2009block, , Proposition 6.1), we have that

d=d(1)++d(r),\displaystyle d=d^{(1)}+\ldots+d^{(r)},

for some r|supp(d)|l+1r\leq|\operatorname{supp}(d)|-l+1 and some nonzero d(s)null(A)d^{(s)}\in\operatorname{null}(A) conformal to dd with |supp(d(s))|l|\operatorname{supp}(d^{(s)})|\leq l, s=1,,rs=1,...,r. Since |supp(d)|N|\operatorname{supp}(d)|\leq N, we have rNl+1r\leq N-l+1. Since Ad(s)=0Ad^{(s)}=0 and xi+di(s)xi+di0,s=1,,rx_{i}+d^{(s)}_{i}\geq x_{i}+d_{i}\geq 0,\forall s=1,...,r and i{idi<0}\forall i\in\{i\mid d_{i}<0\}, we have that x+d(s)𝒳,s=1,,rx+d^{(s)}\in\mathcal{X},\forall s=1,...,r. Therefore,

q(x;𝒩)\displaystyle q(x;\mathcal{N}) =cTd=s=1rcTd(s)rmins=1,,r{cTd(s)}.\displaystyle=c^{T}d=\sum_{s=1}^{r}c^{T}d^{(s)}\geq r\min_{s=1,\ldots,r}\{c^{T}d^{(s)}\}.

Denote s¯argmins=1,,r{cTd(s)}\bar{s}\in\mbox{argmin}_{s=1,\ldots,r}\{c^{T}d^{(s)}\} and let =supp(d(s¯))\mathcal{I}=\operatorname{supp}(d^{(\bar{s})}), then ||l|\mathcal{I}|\leq l and

q(x;𝒩)rcTd(s¯)rq(x;)(Nl+1)q(x;),\displaystyle q(x;\mathcal{N})\geq rc^{T}d^{(\bar{s})}\geq rq(x;\mathcal{I})\geq(N-l+1)q(x;\mathcal{I}),

where the last inequalities holds due to rNl+1r\leq N-l+1 and q(x;)0q(x;\mathcal{I})\leq 0.

Therefore (19) and (20) hold for this \mathcal{I} and d¯=d(s¯)\bar{d}=d^{(\bar{s})}.  

However, it is not clear how to identify the set \mathcal{I} described in Theorem 3.1 with little computational effort for a general AA. Therefore, we introduced the following.

Expected Gauss-Southwell-q rule

We introduce randomness in the selection of k\mathcal{I}_{k} to reduce the potential computation burden in identifying an k\mathcal{I}_{k} that satisfies (16). Consider an expected Gauss-Southwell-q rule:

𝔼[q(xk;k)k]vq(xk;𝒩),\displaystyle\mathbb{E}[q(x^{k};\mathcal{I}_{k})\mid\mathcal{F}_{k}]\leq vq(x^{k};\mathcal{N}), (21)

where v(0,1]v\in(0,1] is a constant, and k{x0,,xk}\mathcal{F}_{k}\triangleq\{x^{0},\ldots,x^{k}\} denotes the history of the algorithm. Therefore, using the notations of LP (10) and BCD method (15):

(12)(17)(21)\displaystyle\eqref{def: sd}\eqref{qkN}\eqref{EGSq} 𝔼[cTdkk]vcT(xxk)\displaystyle\implies\mathbb{E}[c^{T}d^{k}\mid\mathcal{F}_{k}]\leq vc^{T}(x^{*}-x^{k})
𝔼[cT(xk+1xk)k]vcT(xxk)\displaystyle\implies\mathbb{E}[c^{T}(x^{k+1}-x^{k})\mid\mathcal{F}_{k}]\leq vc^{T}(x^{*}-x^{k})
𝔼[cT(xk+1x)k]cT(xkx)vcT(xxk)\displaystyle\implies\mathbb{E}[c^{T}(x^{k+1}-x^{*})\mid\mathcal{F}_{k}]-c^{T}(x^{k}-x^{*})\leq vc^{T}(x^{*}-x^{k})
𝔼[cT(xk+1x)k](1v)cT(xkx),\displaystyle\implies\mathbb{E}[c^{T}(x^{k+1}-x^{*})\mid\mathcal{F}_{k}]\leq(1-v)c^{T}(x^{k}-x^{*}), (22)

where xx^{*} is an optimal solution of (10). According to (polyak1987introduction, , Lemma 10, page 49), cT(xkx)0c^{T}(x^{k}-x^{*})\to 0 almost surely. Moreover, if we take expectations on both sides of (22),

𝔼[cT(xk+1x)]\displaystyle\mathbb{E}[c^{T}(x^{k+1}-x^{*})] (1v)𝔼[cT(xkx)]\displaystyle\leq(1-v)\mathbb{E}[c^{T}(x^{k}-x^{*})]
𝔼[cT(xkx)]\displaystyle\implies\mathbb{E}[c^{T}(x^{k}-x^{*})] (1v)k𝔼[cT(x0x)].\displaystyle\leq(1-v)^{k}\mathbb{E}[c^{T}(x^{0}-x^{*})].

i.e., the expectation of function value gap converges to 0 exponentially with rate 1v1-v.

Vanilla random block coordinate descent

Based on the expected Gauss-Southwell-qq rule, we formally propose a vanilla random block coordinate descent (RBCD0) algorithm (Algorithm 1) to solve the LP (10). Specifically, we choose the working set k\mathcal{I}_{k} with full randomness, that is, randomly choose an index set of cardinality ll out of 𝒩\mathcal{N}. Then with probability at least 1(Nl)\frac{1}{\binom{N}{l}}, the index set will be the same as or cover the working set suggested by Theorem 3.1. As a result, (21) will be satisfied with v1(Nl)(Nl+1)v\geq\frac{1}{\binom{N}{l}(N-l+1)}.

Algorithm 1 Vanilla random block coordinate descent (RBCD0)
  (Initialization) Choose feasible x0Nx^{0}\in\mathbb{R}^{N} and the batch size ll such that rank(A)+1lN\mathop{\hbox{\rm rank}}(A)+1\leq l\leq N.
  for k=0,1,2,k=0,1,2,\dotsc do
     Step 1. Choose k\mathcal{I}_{k} uniformly randomly from 𝒩\mathcal{N} with |k|=l|\mathcal{I}_{k}|=l.
     Step 2. Find dk𝒟(xk;k).d^{k}\in\mathcal{D}(x^{k};\mathcal{I}_{k}).
     Step 3. xk+1:=xk+dk.\;\;x^{k+1}:=x^{k}+d^{k}.
  end for

Based on the previous discussions, Algorithm 1 generates a sequence {xk}\{x^{k}\} such that the value of cTxkc^{T}x^{k} converges to the optimal with probability 1. Moreover, the expectation of the optimality gap converges to 0 exponentially. It is important to note that 1(Nl)(Nl+1)\frac{1}{\binom{N}{l}(N-l+1)} is only a loose lower bound of vv. This bound can become quite small when NN grows large due to the binomial coefficient (Nl)\binom{N}{l}. However, in our numerical experiments (c.f. Sec. 5), this lower bound is rarely reached. In Section 4, we will discuss how to further improve this bound given the specific structure of the OT problem. Before that, we first investigate an inexact extension of RBCD0.

Inexact extension of Algorithm 1.

For any xx and 𝒩\mathcal{I}\subseteq\mathcal{N}, still denote 𝒟(x;)\mathcal{D}(x;\mathcal{I}) and q(x;)q(x;\mathcal{I}) as in (11) and (12). However, note that now xx does not have to be feasible. We consider the inexact version of Algorithm 1, where Step 2 in (1) is only approximately solved. We compute dkd^{k} such that for any dk𝒟(xk;k)d^{k}_{*}\in\mathcal{D}(x^{k};\mathcal{I}_{k}),

cTdk(1ϵk)cTdk,xk+dk0,Adkϵ^k,dik=0,ik.\displaystyle c^{T}d^{k}\leq(1-\epsilon_{k})c^{T}d^{k}_{*},\quad x^{k}+d^{k}\geq 0,\quad\|Ad^{k}\|\leq\hat{\epsilon}_{k},\quad d^{k}_{i}=0,\;\forall i\notin\mathcal{I}_{k}. (23)

where the inexactness sequences {ϵk}\{\epsilon_{k}\} and {ϵ^k}\{\hat{\epsilon}_{k}\} should be nonnegative. The inexact algorithm is described as follows.

Algorithm 1(a) Inexact random block coordinate descent (IRBCD)
  (Initialization) Choose feasible x0Nx^{0}\in\mathbb{R}^{N}, the batch size ll such that rank(A)+1lN\mathop{\hbox{\rm rank}}(A)+1\leq l\leq N, and inexactness sequences {ϵk}\{\epsilon_{k}\} and {ϵ^k}\{\hat{\epsilon}_{k}\}.
  for k=0,1,2,k=0,1,2,\dotsc do
     Step 1. Choose k\mathcal{I}_{k} uniformly randomly from 𝒩\mathcal{N} with |k|=l|\mathcal{I}_{k}|=l.
     Step 2. Find dkd^{k} such that (23) holds.
     Step 3. xk+1:=xk+dk.\;\;x^{k+1}:=x^{k}+d^{k}.
  end for

Next, we analyze the sequence generated by Algorithm 1(a) and summarize the results in the following theorem.

Theorem 3.2

Consider Algorithm 1(a). Given δ>0\delta>0, suppose that 0ϵkϵ<10\leq\epsilon_{k}\leq\epsilon<1 and k=1ϵ^kδ\sum_{k=1}^{\infty}\hat{\epsilon}_{k}\leq\delta. Then we have that

  1. 1.

    Axkbδ\|Ax^{k}-b\|\leq\delta and xk0x^{k}\geq 0 a.s. for any k0k\geq 0,

  2. 2.

    cTxkcTxκδc^{T}x^{k}\geq c^{T}x^{*}-\kappa\delta for any k0k\geq 0,

  3. 3.

    𝔼[cTxk(cTx+κδ)](1(1ϵ)v)k𝔼[cTx0(cTx+κδ)]\mathbb{E}[c^{T}x^{k}-(c^{T}x^{*}+\kappa\delta)]\leq(1-(1-\epsilon)v)^{k}\mathbb{E}[c^{T}x^{0}-(c^{T}x^{*}+\kappa\delta)],

where κ\kappa is a constant only depending on the matrix AA and vector cc, and vv is the constant in the expected Gauss-Southwell-q rule.

Proof

The proof of 1 is straightforward. Note that for any k0k\geq 0, we have in a.s. sense for any k0k\geq 0,

Axkb=Ax0b+t=0k1AdtAx0b+t=0k1Adt(23)t=0k1ϵ^tδ.\displaystyle\|Ax^{k}-b\|=\left\|Ax^{0}-b+\sum_{t=0}^{k-1}Ad^{t}\right\|\leq\|Ax^{0}-b\|+\sum_{t=0}^{k-1}\|Ad^{t}\|\overset{\eqref{def: dk-inex}}{\leq}\sum_{t=0}^{k-1}\hat{\epsilon}_{t}\leq\delta.

xk0x^{k}\geq 0 a.s. is also a direct result of (23) (xk+dk0x^{k}+d^{k}\geq 0). Now we prove 2. First we argue that 𝒟(xk;𝒩)\mathcal{D}(x^{k};\mathcal{N})\neq\emptyset. This is true by the duality theory in linear programming and the fact that AxkSAx^{k}\in S and the dual feasibility set ATpcA^{T}p\leq c is nonempty. Suppose that xk+1𝒟(xk;𝒩)x^{k+1}_{*}\in\mathcal{D}(x^{k};\mathcal{N}). Then we have

Axk+1b=Axkbδ.\displaystyle\|Ax^{k+1}_{*}-b\|=\|Ax^{k}-b\|\leq\delta.

We want to estimate cTxk+1c^{T}x^{k+1}_{*}. Suppose that p1,,psp^{1},\ldots,p^{s} are all the extreme points of the dual feasible region {pATpc}\{p\mid A^{T}p\leq c\} and denote bk=Axkb^{k}=Ax^{k}. Then by the theory of LP, we have

cTxk+1=maxi=1,,s(bk)Tpi=bTpi\displaystyle c^{T}x^{k+1}_{*}=\max_{i=1,\ldots,s}(b^{k})^{T}p^{i}=b^{T}p^{i^{*}}

Note that we also have cTx=maxi=1,,sbTpic^{T}x^{*}=\max_{i=1,\ldots,s}b^{T}p^{i}, then

cTxk+1cTx(bkb)Tpibkbpi(maxi=1,,spi)δ.\displaystyle c^{T}x^{k+1}_{*}-c^{T}x^{*}\leq(b^{k}-b)^{T}p^{i^{*}}\leq\|b^{k}-b\|\|p^{i^{*}}\|\leq\left(\max_{i=1,\ldots,s}\|p^{i}\|\right)\delta. (24)

Similarly we can show

cTxkcTxcTxk+1cTx(maxi=1,,spi)δ,\displaystyle c^{T}x^{k}-c^{T}x^{*}\geq c^{T}x^{k+1}_{*}-c^{T}x^{*}\geq-\left(\max_{i=1,\ldots,s}\|p^{i}\|\right)\delta,

which concludes 2 if we let κmaxi=1,,spi\kappa\triangleq\max_{i=1,\ldots,s}\|p^{i}\|. According to Theorem 3.1 and discussion about the exact algorithm, we still have that the expected Gauss-Southwell-q rule (21) holds with v1(Nl)(Nl+1)v\geq\frac{1}{\binom{N}{l}(N-l+1)}. Then

𝔼[cTdkk](23)(1ϵk)𝔼[cTdkk](21)(1ϵk)vq(xk;𝒩)\displaystyle\mathbb{E}[c^{T}d^{k}\mid\mathcal{F}_{k}]\overset{\eqref{def: dk-inex}}{\leq}(1-\epsilon_{k})\mathbb{E}[c^{T}d^{k}_{*}\mid\mathcal{F}_{k}]\overset{\eqref{EGSq}}{\leq}(1-\epsilon_{k})vq(x^{k};\mathcal{N})
\displaystyle\implies 𝔼[cTxk+1cTxkk](1ϵ)v(cTxk+1cTxk)\displaystyle\mathbb{E}[c^{T}x^{k+1}-c^{T}x^{k}\mid\mathcal{F}_{k}]\leq(1-\epsilon)v(c^{T}x^{k+1}_{*}-c^{T}x^{k})
(24)\displaystyle\overset{\eqref{ineq: objerr}}{\implies} 𝔼[cTxk+1cTxkk](1ϵ)v(cTx+κδcTxk)\displaystyle\mathbb{E}[c^{T}x^{k+1}-c^{T}x^{k}\mid\mathcal{F}_{k}]\leq(1-\epsilon)v(c^{T}x^{*}+\kappa\delta-c^{T}x^{k})
\displaystyle\implies 𝔼[cTxk+1(cTx+κδ)k](1(1ϵ)v)[cTxk(cTx+κδ)],\displaystyle\mathbb{E}[c^{T}x^{k+1}-(c^{T}x^{*}+\kappa\delta)\mid\mathcal{F}_{k}]\leq(1-(1-\epsilon)v)[c^{T}x^{k}-(c^{T}x^{*}+\kappa\delta)],
\displaystyle\implies 𝔼[cTxk(cTx+κδ)](1(1ϵ)v)k𝔼[cTx0(cTx+κδ)], for any k0.\displaystyle\mathbb{E}[c^{T}x^{k}-(c^{T}x^{*}+\kappa\delta)]\leq(1-(1-\epsilon)v)^{k}\mathbb{E}[c^{T}x^{0}-(c^{T}x^{*}+\kappa\delta)],\mbox{ for any }k\geq 0.

 

Remark 3

(i). Theorem 3.2 justifies the inexact algorithm. In particular, if we let δ\delta be small, then we showed that the expected objective function value sequence also converges to the vicinity of the optimal one linearly. During the implementation, we could either let ϵ^k\hat{\epsilon}_{k} to be a sequence proportional to 1/kα,α>11/k^{\alpha},\alpha>1, or choose them equally as a sufficiently small number so that their accumulation is also insignificant. Choice of the sequence ϵ^k\hat{\epsilon}_{k} is less stringent if we occasionally project the iterates to the feasible region so that the accumulated error is offset.
(ii). In the proof we implicitly assume that the dual feasible region exists at least one extreme point. This can be implied if AA has linearly independent rows. In fact, we can eliminate redundant rows of AA. After this operation, the feasible region remains the same; the analysis and implementation of the algorithms are also similar.
(iii). Inexact versions of the algorithms proposed in Section 4 are of similar fashion and we will omit detailed and repetitive explanations.

4 Random block coordinate descent and optimal transport

Denote the cost matrix C(Ci,j)i,jC\triangleq(C_{i,j})_{i,j} in (9). Then calculating the OT between two measures with finite support (problem (9)) is a special case of (10), where c=vec(C)c=\operatorname{vec}(C), and N=n2N=n^{2}. The constraint matrix AA has the following structure:

A(InInIn𝟏nT𝟏nT𝟏nT)n blocks,\displaystyle A\triangleq\underbrace{\left(\begin{array}[]{llll}I_{n}&I_{n}&\ldots&I_{n}\\ {\bf 1}_{n}^{T}&&&\\ &{\bf 1}_{n}^{T}&&\\ &&\ddots&\\ &&&{\bf 1}_{n}^{T}\end{array}\right)}_{\text{$n$ blocks}}, (30)

where InI_{n} is an n×nn\times n identity matrix, 𝟏n{\bf 1}_{n} is an nn dimensional vector of all 11’s (then M=2nM=2n). Right hand side bb in (10) has the form b((r1)T,(r2)T)Tb\triangleq((r^{1})^{T},(r^{2})^{T})^{T}, where r1,r2+nr^{1},r^{2}\in\mathbb{R}^{n}_{+} can be two discrete probability distributions. Next, we discuss the property of matrix AA and null(A)\operatorname{null}(A).

Property of matrix AA

A nonzero dNd\in\mathbb{R}^{N} is an elementary vector of null(A)\operatorname{null}(A) if dnull(A)d\in\operatorname{null}(A) and there is no nonzero dnull(A)d^{\prime}\in\operatorname{null}(A) that is conformal to dd and supp(d)supp(d)\operatorname{supp}(d^{\prime})\neq\operatorname{supp}(d). According to the definition in (30), we say that a nonzero matrix XX is an elementary matrix of null(A)\operatorname{null}(A) if vec(X)\operatorname{vec}(X) is an elementary vector of null(A)\operatorname{null}(A). For simplicity, a matrix M1M^{1} being conformal to M2M^{2} means vec(M1)\operatorname{vec}(M^{1}) being conformal to vec(M2)\operatorname{vec}(M^{2}) for the rest of this paper.

Now we define a set A\mathcal{E}_{A}:

XAn×nX0,and after row and column permutations, X is\displaystyle X\in\mathcal{E}_{A}\subseteq\mathbb{R}^{n\times n}\Longleftrightarrow X\neq 0,\;\mbox{and after row and column permutations, X is }
a multiple of one of the following matrices:
E2=(111100)n×n,E3=(11111100)n×n,,\displaystyle E^{2}=\begin{pmatrix}1&-1&&&\\ -1&1&&&\\ &&0&&\\ &&&\ddots&\\ &&&&0\end{pmatrix}_{n\times n},E^{3}=\begin{pmatrix}1&&-1&&&\\ -1&1&&&&\\ &-1&1&&&\\ &&&0&&\\ &&&&\ddots&\\ &&&&&0\end{pmatrix}_{n\times n},...,
En1=(111111110)n×n,En=(11111111)n×n.\displaystyle E^{n-1}=\begin{pmatrix}1&&&&-1&\\ -1&1&&&&\\ &-1&1&&&\\ &&&\ddots&&\\ &&&-1&1&\\ &&&&&0\end{pmatrix}_{n\times n},E^{n}=\begin{pmatrix}1&&&&-1\\ -1&1&&&\\ &-1&1&&\\ &&&\ddots&\\ &&&-1&1\end{pmatrix}_{n\times n}.

First, we state a Lemma about A\mathcal{E}_{A}, the proof of which is trivial and thus omitted.

Lemma 1

Every matrix in A\mathcal{E}_{A} is an elementary matrix of null(A)\operatorname{null}(A).

Then we show every element in the null space of AA is related to a matrix in A\mathcal{E}_{A}.

Lemma 2

For any nonzero DD such that vec(D)null(A)\operatorname{vec}(D)\in\operatorname{null}(A), there exists XAX\in\mathcal{E}_{A} such that XX is conformal to DD.

Proof

We prove this by contradiction and induction. We assume a nonzero DD such that vec(D)null(A)\operatorname{vec}(D)\in\operatorname{null}(A) and no XAX\in\mathcal{E}_{A} is conformal to DD.

Note that vec(D)null(A)\operatorname{vec}(D)\in\operatorname{null}(A) implies

i=1mD(i,j¯)=j=1nD(i¯,j)=0,i¯,j¯.\displaystyle\sum_{i=1}^{m}D(i,\bar{j})=\sum_{j=1}^{n}D(\bar{i},j)=0,\forall\bar{i},\bar{j}. (31)

Then without loss of generality, suppose that

D(1,1)>0.D(1,1)>0.

Otherwise since DD is nonzero we can permute row/column to let D(1,1)0D(1,1)\neq 0 and by (31) all entries in the first row/column after permutation cannot have the same sign.

By (31), the first column of DD must have one negative element. Suppose D(2,1)<0D(2,1)<0 WLOG. The second row of DD must have one positive element, so suppose D(2,2)>0D(2,2)>0 WLOG. Since no XAX\in\mathcal{E}_{A} is conformal to DD, we must have D(1,2)0D(1,2)\geq 0 otherwise DD is conformal to E2E^{2}. Therefore, the 2×22\times 2 principal matrix of DD has the following sign arrangement (after appropriate row/column permutations),

(++/0+),\displaystyle\begin{pmatrix}+&+/0\\ -&+\end{pmatrix},

where we use ++, +/0+/0, -, and /0-/0 to indicate that the corresponding entry is positive, nonnegative, negative, and nonpositive respectively.

If n=2n=2, then the above pattern is impossible as the first column sums to some positive number, leading to a contradiction with (31). Suppose that n3n\geq 3. For math induction, we assume that,

(𝐇k):(\mathbf{H}^{k}): after appropriate row/column permutations, the k×kk\times k principal matrix of DD has the following sign arrangement (2kn12\leq k\leq n-1),

(++/0+/0+/0++/0/0++/0+/0/0/0+),\displaystyle\begin{pmatrix}+&+/0&+/0&\ldots&+/0\\ -&+&+/0&\ddots&\vdots\\ -/0&-&+&\ddots&+/0\\ \vdots&\ddots&\ddots&\ddots&+/0\\ -/0&\ldots&-/0&-&+\end{pmatrix}, (32)

namely,

{D(i,i)>0,1ik;D(i+1,i)<0,1ik1;D(i,j)0,1ijk;D(i,j)0,1j<ik.\displaystyle\left\{\begin{array}[]{ll}D(i,i)>0,&1\leq i\leq k;\\ D(i+1,i)<0,&1\leq i\leq k-1;\\ D(i,j)\geq 0,&1\leq i\leq j\leq k;\\ D(i,j)\leq 0,&1\leq j<i\leq k.\end{array}\right.

kkth column of DD needs to have at least one negative element, otherwise, it contradicts with (31). WLOG, we suppose D(k+1,k)<0D(k+1,k)<0.

We now claim that the rest of the elements in the (k+1)(k+1)th row has to be non-positive, namely D(k+1,i)0D(k+1,i)\leq 0, i=1,,k1\forall i=1,...,k-1. Otherwise, let i0i_{0} be the largest index in {1,,k1}\{1,\cdots,k-1\} such that D(k+1,i0)>0D(k+1,i_{0})>0. Then the submatrix D(i0+1:k+1,i0:k)D(i_{0}+1:k+1,i_{0}:k) takes the form,

(++/0+/0/0++/0/0/0++/0/0),\displaystyle\begin{pmatrix}-&+&+/0&\ldots&+/0\\ -/0&-&+&\ddots&\vdots\\ \vdots&\ddots&\ddots&\ddots&+/0\\ -/0&\ldots&-/0&-&+\\ +&-/0&\ldots&-/0&-\end{pmatrix}, (33)

and it is conformal to Eki0+1E^{k-i_{0}+1} after row/column permutations. To see this, we can move the first column of (33) to the last. For the whole matrix DD, it is equivalent to move the i0i_{0}th column and insert it between kk and k+1k+1th column and shift the resulting submatrix to the upper left corner through permutation operations.

While due to (31), (k+1)(k+1)th row of DD needs to have at least one positive element and we have just shown that D(k+1,i)0D(k+1,i)\leq 0, i=1,,k\forall i=1,...,k, so WLOG we suppose D(k+1,k+1)>0D(k+1,k+1)>0 . Similar argument shows if there is no XAX\in\mathcal{E}_{A} is conformal to DD, so D(i,k+1)0D(i,k+1)\geq 0, i=1,,k\forall i=1,...,k.

Therefore, the (k+1)×(k+1)(k+1)\times(k+1) principal matrix of DD has exactly the same sign pattern as indicated by (32), after appropriate row/column permutations. So we have the induction (𝐇k)(𝐇k+1)(\mathbf{H}^{k})\Rightarrow(\mathbf{H}^{k+1}).

However, when the n×nn\times n matrix DD follows (𝐇n)(\mathbf{H}^{n}), it has the sign pattern as (32) after row/column permutations. Then the summation of the last column is strictly positive as indicated by the sign pattern, which contradicts with (31).  

Finally, we show that A\mathcal{E}_{A} characterizes all the elementary matrices.

Theorem 4.1

Given any Dn×nD\in\mathbb{R}^{n\times n}, if vec(D)null(A)\operatorname{vec}(D)\in\operatorname{null}(A), then DD has a conformal realization (rockafellar1999network, , Section 10B), namely:

D=D(1)+D(2)++D(s),\displaystyle D=D^{(1)}+D^{(2)}+\ldots+D^{(s)}, (34)

where D(1),,D(s)D^{(1)},\ldots,D^{(s)} are elementary matrices of null(A)\operatorname{null}(A) and D(i)D^{(i)} is conformal to DD, for all i=1,,si=1,\ldots,s. In particular, D(i)AD^{(i)}\in\mathcal{E}_{A}, i=1,,s\forall i=1,...,s. Therefore, A\mathcal{E}_{A} includes all the elementary matrices of null(A)\operatorname{null}(A).

Proof

By Lemma 2, we suppose that X(1)AX^{(1)}\in\mathcal{E}_{A} and X(1)X^{(1)} is conformal to DD. Then X(1)X^{(1)} can be scaled properly by α1=min(i,j)supp(X(1))supp(D)|D(i,j)/X(1)(i,j)|>0\alpha_{1}=\min_{(i,j)\in\operatorname{supp}(X^{(1)})\cap\operatorname{supp}(D)}|D(i,j)/X^{(1)}(i,j)|>0 such that |supp(Dα1X(1))|<|supp(D)||\operatorname{supp}(D-\alpha_{1}X^{(1)})|<|\operatorname{supp}(D)| and Dα1X(1)D-\alpha_{1}X^{(1)} is conformal to DD. Denote D(1)α1X(1)D^{(1)}\triangleq\alpha_{1}X^{(1)} and D¯(1)=DD(1)\bar{D}^{(1)}=D-D^{(1)}, both of them are conformal to DD.

By apply the same procedure to D¯(1)\bar{D}^{(1)}, we get D¯(2)\bar{D}^{(2)} such that |supp(D¯(2))|<|supp(D¯(1))||\operatorname{supp}(\bar{D}^{(2)})|<|\operatorname{supp}(\bar{D}^{(1)})|. We can repeat this process and eventually, we have that the conformal realization (34) holds since |supp(D)|n2|\operatorname{supp}(D)|\leq n^{2} and |supp(D¯(k))||\operatorname{supp}(\bar{D}^{(k)})| is strictly decreasing.

If DD is an elementary matrix, in previous construction of (34), we notice that D=D(1)+D¯(1)D=D^{(1)}+\bar{D}^{(1)} and both of them are conformal to DD. If D¯(1)0\bar{D}^{(1)}\not=0, we have supp(D(1))supp(D)\operatorname{supp}(D^{(1)})\not=\operatorname{supp}(D) while D(1)D^{(1)} is conformal to DD. This contradicts the assumption of DD as an elementary matrix. When D¯(1)=0\bar{D}^{(1)}=0, D=D(1)D=D^{(1)} is a multiple of the special matrix in the description of A\mathcal{E}_{A} after a certain row/column permutation. Summing up, A\mathcal{E}_{A} includes all the elementary matrices of null(A)\operatorname{null}(A).

Remark 4

For a given DD such that vec(D)null(A)\operatorname{vec}(D)\in\operatorname{null}(A), a simple algorithm following the proof in Theorem 4.1 to find an elementary matrix XX conformal to D0D\neq 0 will cost at most O(n2)O(n^{2}) operations. We can select an appropriate α>0\alpha>0 such that DαXD-\alpha X is conformal to DD and |supp(DαX)|<|supp(D)||\operatorname{supp}(D-\alpha X)|<|\operatorname{supp}(D)|. By repeating this process, we can find the conformal realization in supp(D)n2\operatorname{supp}(D)\leq n^{2} steps. Therefore, the total number of operations needed to find the conformal realization is O(n4)O(n^{4}). In comparison, the approach proposed by tseng2010coordinate finds a conformal realization with support cardinality less than ll (usually, ll is much smaller than n2n^{2}) and requires O(n3(n2l)2)O(n^{3}(n^{2}-l)^{2}) operations.

Working set selection

By analyzing the structure of elementary matrices of null(A)\operatorname{null}(A), we will have a better idea of potential directions along which the transport cost is minimized by a large amount. This is supported by the following theorem, where we continue using notations introduced in Section 3.

Theorem 4.2

Consider the linear program (10) where AM×NA\in\mathbb{R}^{M\times N} and bMb\in\mathbb{R}^{M} are defined as in (30) (M=2nM=2n, N=n2N=n^{2}). Given any Xn×nX\in\mathbb{R}^{n\times n} and Dn×nD\in\mathbb{R}^{n\times n} such that vec(X)𝒳\operatorname{vec}(X)\in\mathcal{X}, and vec(D)𝒟(vec(X);𝒩)\operatorname{vec}(D)\in\mathcal{D}(\operatorname{vec}(X);\mathcal{N}). There exists an elementary matrix D¯\bar{D} of null(A)\operatorname{null}(A) conformal to DD such that for any set 𝒩\mathcal{I}\in\mathcal{N} satisfying

supp(vec(D¯)),\displaystyle\mathcal{I}\supseteq\operatorname{supp}(\operatorname{vec}(\bar{D})),

We have

q(vec(X);)(1n23)q(vec(X);𝒩).\displaystyle q(\operatorname{vec}(X);\mathcal{I})\leq\left(\frac{1}{n^{2}-3}\right)q(\operatorname{vec}(X);\mathcal{N}). (35)
Proof

Since vec(D)𝒟(vec(X);𝒩)\operatorname{vec}(D)\in\mathcal{D}(\operatorname{vec}(X);\mathcal{N}), vec(D)null(A)\operatorname{vec}(D)\in\operatorname{null}(A). Then based on Theorem 4.1, we have the conformal realization:

D=D(1)+D(2)++D(s).\displaystyle D=D^{(1)}+D^{(2)}+...+D^{(s)}.

Moreover, proof of Theorem 4.1 indicates that we can construct this realization with sn23s\leq n^{2}-3, because the support of D(i)D^{(i)} has cardinality at least 44. Then similar to discussion in Theorem 3.1, we may find s¯{1,,s}\bar{s}\in\{1,\ldots,s\} such that D¯=D(s¯)\bar{D}=D^{(\bar{s})}, supp(vec(D(s¯)))\mathcal{I}\supseteq\operatorname{supp}(\operatorname{vec}(D^{(\bar{s})})), and

q(vec(X);𝒩)(n23)q(vec(X);).\displaystyle q(\operatorname{vec}(X);\mathcal{N})\geq(n^{2}-3)q(\operatorname{vec}(X);\mathcal{I}).

 

Now we discuss two approaches to carefully select the support set k\mathcal{I}_{k} at iteration kk of the block coordinate descent method (15):

  1. 1.

    Diagonal band. Given 3pn3\leq p\leq n, denote

    𝒢{(i,j)2|i[j,j+p1]ifj[1,np+1]i[1,,j+pn1][j,n]ifj[np+2,n]},\displaystyle\mathcal{G}\triangleq\left\{(i,j)\in\mathbb{Z}^{2}\Big{|}\begin{array}[]{ll}i\in[j,j+p-1]&\text{if}\;j\in[1,n-p+1]\\ i\in[1,...,j+p-n-1]\cup[j,n]&\text{if}\;j\in[n-p+2,n]\\ \end{array}\right\},

    and construct matrix Gn×nG\in\mathbb{R}^{n\times n} such that

    G(i,j)={1,if(i,j)𝒢,0,otherwise.\displaystyle G(i,j)=\begin{cases}1,&\text{if}(i,j)\in\mathcal{G},\\ 0,&\text{otherwise.}\end{cases} (36)

    Therefore, GG has the following structure:

    p{( 111111111111111)n×n}(p1)\displaystyle\begin{array}[]{r}p\;\left\{\begin{array}[]{r}\\ \\ \\ \\ \end{array}\right.\\ \begin{array}[]{r}\\ \\ \\ \\ \end{array}\end{array}\begin{pmatrix}\;1\;&&&&1&\ldots&1\\ \vdots&1\;&&&&\ddots&\vdots\\ 1&\vdots&\ddots&&&&1\\ 1&1&&1\;&&&\\ &1&\ddots&\vdots&1\;&&\\ &&\ddots&1&\vdots&\ddots&\\ &&&1&1&\ldots&1\;\end{pmatrix}_{n\times n}\begin{array}[]{r}\left.\begin{array}[]{r}\\ \\ \\ \end{array}\right\}(p-1)\\ \begin{array}[]{r}\\ \\ \\ \\ \\ \end{array}\end{array}

    It is like a band of width pp across the diagonal, hence the name. Then we may construct D¯kn×n\bar{D}^{k}\in\mathbb{R}^{n\times n} and k\mathcal{I}_{k} as follows:

    Obtain D¯k by uniformly randomly permuting all columns and rows of G.Letksupp(vec(D¯k)).\displaystyle\begin{aligned} &\text{Obtain $\bar{D}^{k}$ by uniformly randomly permuting all columns and rows of $G$.}\\ &\text{Let}\mathcal{I}_{k}\triangleq\operatorname{supp}(\operatorname{vec}(\bar{D}^{k})).\end{aligned} (37)

    Note that |k|=np|\mathcal{I}_{k}|=np.

  2. 2.

    Submatrix. Given m<nm<n, obtain D¯k\bar{D}^{k} and k\mathcal{I}_{k} such that

    Uniformly randomly pick two sets of m different numbers out of {1,n}:i1,,imandj1,,jm.LetD¯k(i,j)={1ifi{i1,,im}andj{j1,,jm},0otherwise.Letksupp(vec(D¯k)).\displaystyle\begin{aligned} &\text{Uniformly randomly pick two sets of $m$ different numbers out of $\{1,...n\}$:}\\ &\quad i_{1},...,i_{m}\text{and}j_{1},...,j_{m}.\\ &\text{Let}\bar{D}^{k}(i,j)=\begin{cases}1&\text{if}i\in\{i_{1},...,i_{m}\}\text{and}j\in\{j_{1},...,j_{m}\},\\ 0&\text{otherwise.}\end{cases}\\ &\text{Let}\mathcal{I}_{k}\triangleq\operatorname{supp}(\operatorname{vec}(\bar{D}^{k})).\end{aligned} (38)

    In this case, the support of D¯k\bar{D}^{k} is a submatrix of size m×mm\times m. Therefore, |k|=m2|\mathcal{I}_{k}|=m^{2}.

Next, we discuss two random block coordinate descent algorithms to solve the LP problem (10) with AA given in (30) whose working set selections are based on the two approaches discussed above.

Algorithm 2 Random block coordinate descent - diagonal band (RBCD-DB)
  (Initialization) Choose feasible X0n×nX^{0}\in\mathbb{R}^{n\times n} and band width p[3,n]p\in[3,n]. Let x0=vec(X0)x^{0}=\operatorname{vec}(X^{0}).
  for k=0,1,2,k=0,1,2,\dotsc do
     Step 1. Choose k\mathcal{I}_{k} according to (37).
     Step 2. Find dk𝒟(xk;k).d^{k}\in\mathcal{D}(x^{k};\mathcal{I}_{k}).
     Step 3. xk+1:=xk+dk.\;\;x^{k+1}:=x^{k}+d^{k}.
  end for

The following result describes the convergence property of Algorithm 2.

Theorem 4.3

Consider the LP problem (10) with AA given in (30). Then sequence {xk}\{x^{k}\} and {k}\{\mathcal{I}_{k}\} generated by Algorithm 2 satisfies the expected Gauss-Southwell-qq rule (21), i.e.,

𝔼[q(xk;k)k]vq(xk;𝒩),\displaystyle\mathbb{E}[q(x^{k};\mathcal{I}_{k})\mid\mathcal{F}_{k}]\leq vq(x^{k};\mathcal{N}),

with vn(p2)(n23)(n!)2v\geq\frac{n(p-2)}{(n^{2}-3)(n!)^{2}}. Therefore, cT(xkx)0c^{T}(x^{k}-x^{*})\to 0 almost surely and 𝔼[cT(xkx)]\mathbb{E}[c^{T}(x^{k}-x^{*})] converges to 0 exponentially with rate 1v1-v.

Proof

Given xkx^{k}, Theorem 4.2 guarantees that there exists DkAD^{k}\in\mathcal{E}_{A} such that if ksupp(vec(Dk))\mathcal{I}_{k}\supseteq\operatorname{supp}(\operatorname{vec}(D^{k})), then (35) holds for =k\mathcal{I}=\mathcal{I}_{k} and vec(X)=xk\operatorname{vec}(X)=x^{k}, i.e.,

q(xk;k)(1n23)q(xk;𝒩).\displaystyle q(x^{k};\mathcal{I}_{k})\leq\left(\frac{1}{n^{2}-3}\right)q(x^{k};\mathcal{N}). (39)

Next, we will estimate the probability that ksupp(vec(Dk))\mathcal{I}_{k}\supseteq\operatorname{supp}(\operatorname{vec}(D^{k})) holds.

Suppose that after row/column permutations and scaling of DkD^{k}, we obtain EtE^{t}, 2tn2\leq t\leq n. Then after appropriate row and column swapping, DkD^{k} can be written as

t{(000)n×n.\displaystyle\begin{array}[]{r}\begin{array}[]{c}\\ \end{array}\\ t\left\{\begin{array}[]{c}\\ \\ \\ \\ \\ \\ \end{array}\right.\\ \begin{array}[]{c}\\ \\ \end{array}\end{array}\begin{pmatrix}0\;&&&&&&&&\\ *\;&*\;&&&&&&&\\ *\;&&*\;&&&&&&\\ &*\;&&\ddots\;&&&&&\\ &&*\;&&*\;&&&&\\ &&&\ddots\;&&*\;&&&\\ &&&&*\;&*\;&0\;&&\\ &&&&&&&\ddots\;&\\ &&&&&&&&0\;\end{pmatrix}_{n\times n}. (52)

That is, elements (2,1)(2,1) and (3,1)(3,1) are nonzeros; elements (j,j)(j,j) and (mod(j+2,n),j)(\operatorname{mod}(j+2,n),j) are nonzeros, for all j=2,,t1j=2,...,t-1; elements (t,t)(t,t) and (mod(t+1,n),t)(\operatorname{mod}(t+1,n),t) are nonzeros; all other elements are zeros. Obviously, support of this matrix is covered by the support of GG in (36). Moreover, by moving the whole support in matrix (52) downwards or to the bottom right corner, we can create at least n(p2)1n(p-2)-1 more different matrices whose support are all covered by GG. These n(p2)n(p-2) matrices can be obtained by permuting rows and columns of DkD^{k} in n(p2)n(p-2) in different ways. Therefore, the probability that k\mathcal{I}_{k} will cover the support of DkD^{k} is at least n(p2)(n!)2\frac{n(p-2)}{(n!)^{2}}, and we have that

𝔼[q(xk;k)xk]\displaystyle\mathbb{E}[q(x^{k};\mathcal{I}_{k})\mid x^{k}] =supp(vec(Dk))q(xk;)P(k=)+supp(vec(Dk))q(xk;)P(k=)\displaystyle=\sum_{\operatorname{supp}(\operatorname{vec}(D^{k}))\subseteq\mathcal{I}}q(x^{k};\mathcal{I})P(\mathcal{I}_{k}=\mathcal{I})+\sum_{\operatorname{supp}(\operatorname{vec}(D^{k}))\nsubseteq\mathcal{I}}q(x^{k};\mathcal{I})P(\mathcal{I}_{k}=\mathcal{I})
(1n23)q(xk;𝒩)P(supp(vec(Dk))k)+0\displaystyle\leq\left(\frac{1}{n^{2}-3}\right)q(x^{k};\mathcal{N})P(\operatorname{supp}(\operatorname{vec}(D^{k}))\subseteq\mathcal{I}_{k})+0
(n(p2)(n23)(n!)2)q(xk;𝒩)\displaystyle\leq\left(\frac{n(p-2)}{(n^{2}-3)(n!)^{2}}\right)q(x^{k};\mathcal{N})

Therefore, the expected Gauss-Southwell-qq rule (21) holds with vv at least n(p2)(n23)(n!)2\frac{n(p-2)}{(n^{2}-3)(n!)^{2}}.  

Remark 5

It can be shown that if nn is large enough and pp is chosen between O(log(n))O(log(n)) and O(n)O(n), then the lower bound for constant vv derived in Theorem 4.3 is better than the one estimated for Algorithm 1, i.e., 1(Nl)(Nl+1)\frac{1}{\binom{N}{l}(N-l+1)}. In fact, we have the following results.

Lemma 3

Suppose that K¯2\bar{K}\geq 2 and η>0\eta>0 satisfies

2K¯32(K¯1)+log(K¯2)>2/η,\displaystyle\frac{2\bar{K}-3}{2(\bar{K}-1)}+\log\left(\frac{\bar{K}}{2}\right)>2/\eta,

and nn satisfies

n4(2K¯32(K¯1)+log(K¯2))η2,nlog(n)ηK¯.\displaystyle n\geq\frac{4}{\left(\frac{2\bar{K}-3}{2(\bar{K}-1)}+\log\left(\frac{\bar{K}}{2}\right)\right)\eta-2},\quad\frac{n}{\log(n)}\geq\eta\bar{K}.

Then for any p[ηlog(n),nK¯]p\in[\eta\log(n),\frac{n}{\bar{K}}], and p3p\geq 3, we have n(p2)(n23)(n!)21(n2np)(n2np+1)\frac{n(p-2)}{(n^{2}-3)(n!)^{2}}\geq\frac{1}{\binom{n^{2}}{np}(n^{2}-np+1)}.

Proof

See Appendix.  

Let n30n\geq 30, η=1\eta=1, K¯=8\bar{K}=8. Then according to Lemma 3, for log(n)pn/8log(n)\leq p\leq n/8, the lower bound n(p2)(n23)(n!)2\frac{n(p-2)}{(n^{2}-3)(n!)^{2}} is larger. We believe that this is a fairly reasonable range of pp when nn grows large. This lower bound is improved because we have knowledge of the structure of the elementary matrix when solving OT problems.

As for the submatrix approach, we often find it quite efficient in numerical experiments. The optimal solution in the submatrix approach can also be decomposed in A\mathcal{E}_{A}. More precisely, consider that k\mathcal{I}_{k} is chosen according to (38) associated with a submatrix of size m×mm\times m. Then, the decomposition of dk𝒟(xk;k)d^{k}\in\mathcal{D}(x^{k};\mathcal{I}_{k}) (more rigorously its matrix form) only involves multiples and permutations of the elementary matrices in {E2,E3,,Em}\{E^{2},E^{3},\cdots,E^{m}\}. However, global convergence with a fixed-width submatrix is not guaranteed. In fact, there is a counterexample (see B). Therefore, we propose an algorithm that combines these two approaches together.

Algorithm 3 Random block coordinate descent - submatrix and diagonal Band (RBCD-SDB)
  (Initialization) Choose feasible X0n×nX^{0}\in\mathbb{R}^{n\times n}, submatrix row/column dimension mm, band width p[3,n]p\in[3,n] and selection parameter s[0,1]s\in[0,1]. Let x0=vec(X0)x^{0}=\operatorname{vec}(X^{0}).
  for k=0,1,2,k=0,1,2,\dotsc do
     Step 1. With probability ss, choose k\mathcal{I}_{k} according to (37); otherwise, choose k\mathcal{I}_{k} according to (38).
     Step 2. Find dk𝒟(xk;k).d^{k}\in\mathcal{D}(x^{k};\mathcal{I}_{k}).
     Step 3. xk+1:=xk+dk.\;\;x^{k+1}:=x^{k}+d^{k}.
  end for

The convergence of Algorithm 3 is guaranteed by the next theorem.

Theorem 4.4

Consider the LP problem (10) with AA given in (30). Then sequence {xk}\{x^{k}\} and {k}\{\mathcal{I}_{k}\} generated by Algorithm 3 satisfies the expected Gauss-Southwell-qq rule (21), with vsn(p2)(n23)(n!)2v\geq\frac{sn(p-2)}{(n^{2}-3)(n!)^{2}}. Therefore, cT(xkx)0c^{T}(x^{k}-x^{*})\to 0 almost surely and 𝔼[cT(xkx)]\mathbb{E}[c^{T}(x^{k}-x^{*})] converges to 0 exponentially with rate 1v1-v.

Proof

Given xkx^{k}, Theorem 4.2 shows that there exists DkAD^{k}\in\mathcal{E}_{A} such that if ksupp(vec(Dk))\mathcal{I}_{k}\supseteq\operatorname{supp}(\operatorname{vec}(D^{k})), then (35) holds with =k\mathcal{I}=\mathcal{I}_{k} and vec(X)=xk\operatorname{vec}(X)=x^{k}. We estimate the probability that ksupp(vec(Dk))\mathcal{I}_{k}\supseteq\operatorname{supp}(\operatorname{vec}(D^{k})).

First, consider the case that after row/column permutations and scaling of DkD^{k}, we obtain EtE^{t}, 2tm2\leq t\leq m. If k\mathcal{I}_{k} is chosen according to (37), then similar to discussion in Theorem 4.3, k\mathcal{I}_{k} will cover the support of DkD^{k} with probability at least n(p2)(n!)2\frac{n(p-2)}{(n!)^{2}}. If k\mathcal{I}_{k} is chosen according to (38), then k\mathcal{I}_{k} will cover the support of DkD^{k} with probability

(ntmt)2(nm)2=((nt)!/((mt)!(nm)!)n!/(m!(nm)!))2=(m!/(mt)!n!/(nt)!)2.\displaystyle\frac{\binom{n-t}{m-t}^{2}}{\binom{n}{m}^{2}}=\left(\frac{(n-t)!/((m-t)!(n-m)!)}{n!/(m!(n-m)!)}\right)^{2}=\left(\frac{m!/(m-t)!}{n!/(n-t)!}\right)^{2}.

Therefore, in this case, the probability ptp_{t} that k\mathcal{I}_{k} cover the support of DkD^{k} is:

ptsn(p2)(n!)2+(1s)(m!/(mt)!n!/(nt)!)2.\displaystyle p_{t}\geq\frac{sn(p-2)}{(n!)^{2}}+(1-s)\left(\frac{m!/(m-t)!}{n!/(n-t)!}\right)^{2}.

Then we consider the case that when we get EtE^{t}, m+1tnm+1\leq t\leq n after row/column permutations and rescaling of DkD^{k}. In this case, if k\mathcal{I}_{k} is chosen according to (37), k\mathcal{I}_{k} will cover the support of DkD^{k} with probability at least n(p2)(n!)2\frac{n(p-2)}{(n!)^{2}}; if k\mathcal{I}_{k} is chosen according to (38), this probability is 0. Therefore, in this case we have ptsn(p2)(n!)2p_{t}\geq\frac{sn(p-2)}{(n!)^{2}}. In general, the probability that k\mathcal{I}_{k} cover the support of DkD^{k} is at least min{pt}sn(p2)(n!)2\min\{p_{t}\}\geq\frac{sn(p-2)}{(n!)^{2}}. Similar to discussion in Theorem 4.2, (21) will hold with vsn(p2)(n23)(n!)2v\geq\frac{sn(p-2)}{(n^{2}-3)(n!)^{2}}.  

Accelerated random block coordinate descent

Algorithm 4 is an accelerated random block coordinate descent (ARBCD) algorithm. It selects the working set k\mathcal{I}_{k} in a different way from Algorithm 3 intermittently for acceleration purposes. At certain times, we construct k\mathcal{I}_{k} based on the iterates generated by the algorithm in the past, i.e., xendxstartx^{end}-x^{start}. This vector reflects the progress achieved by running the RBCD-SDB for a few iterations and predicts the direction in which the algorithm could potentially make further improvements. This choice is analogous to the momentum concept often employed in acceleration techniques in optimization, such as in the heavy ball method and Nesterov acceleration. Algorithm 4 has a similar convergence rate as Algorithm 3 (note that the acceleration iteration occurs occasionally). We will verify its improved performance in the numerical experiments.

Algorithm 4 Accelerated random block coordinate descent (ARBCD)
  (Initialization) Choose feasible X0n×nX^{0}\in\mathbb{R}^{n\times n}, submatrix row/column dimension mm, band width p[3,n]p\in[3,n], selection parameter s[0,1]s\in[0,1], and acceleration interval TT. Let x0=vec(X0)x^{0}=\operatorname{vec}(X^{0}), xstart=xend=x0x^{start}=x^{end}=x^{0}. Binary variable accacc.
  for k=0,1,2,k=0,1,2,\ldots do
     Step 1. Choose k\mathcal{I}_{k} as following.
     if mod(k+1,T)0\operatorname{mod}(k+1,T)\neq 0 or |supp(xendxstart)|m2|\operatorname{supp}(x^{end}-x^{start})|\leq m^{2} then
        acc=falseacc=\texttt{false}. With probability ss, choose k\mathcal{I}_{k} according to (37); otherwise, choose k\mathcal{I}_{k} according to (38).
     else
        acc=trueacc=\texttt{true}. Choose k\mathcal{I}_{k} uniformly randomly from supp(xendxstart)\operatorname{supp}(x^{end}-x^{start}) so that |k|=m2|\mathcal{I}_{k}|=m^{2}.
     end if
     Step 2. Find dk𝒟(xk;k).d^{k}\in\mathcal{D}(x^{k};\mathcal{I}_{k}).
     Step 3. Update xk+1:=xk+dk;x^{k+1}:=x^{k}+d^{k};
     Step 4. Update xend=xk+1.x^{end}=x^{k+1}.
     if acc=trueacc=\texttt{true}then
        Update xstart=xk+1.x^{start}=x^{k+1}.
     end if
  end for

5 Numerical experiments

In this section, we conduct numerical experiments on various examples of OT problems111All experiments are conducted using Matlab R2021b on Dell OptiPlex 7090 with CPU: Intel(R) Core(TM) i9-10900 @ 2.80GHz (20 CPUs), \sim2.8GHz and RAM: 65536Mb. Data and codes are uploaded to https://github.com/yue-xie/RBCDforOT.. In Section 5.1, we compare various random block coordinate descent methods with different working set selections proposed in this article. Then, we compare the one with the best performance - ARBCD with Sinkhorn in Section 5.2.1 and an interior point inspired algorithm in Section 5.2.2. Finally, a large-scale OT problem is solved using ARBCD in Section 5.3.

5.1 Comparison between various random block coordinate descent methods with the different working set selection rules

In this subsection, we apply the proposed random block coordinate descent methods (Alg. 1 - Alg. 4) to calculate the Wasserstein distance between three pairs of distributions. We compare these algorithms to illustrate the difference between various working set selection rules. Additionally, we inspect the differences between the theoretical and actual convergence rates, as well as the solution sparsity.

Experiment settings

We compute the Wasserstein distance between a pair of 1-dim probability distributions (standard normal and uniform over [1,1][-1,1]), a pair of 2-dim probability distributions (uniform over [π,π]2[-\pi,\pi]^{2} and an empirical invariant measure obtained from IPM simulation of reaction-diffusion particles in advection flows, detailed configurations can be found in wang2022deepparticle , Section 4.2, 2D cellular flow, κ=24\kappa=2^{-4}), and a pair of 3-dim distributions (uniform over [1,1]3[-1,1]^{3} and 3-dimensional multivariate normal distribution). When computing the Wasserstein distance between the pair of 1-dim probability distributions, we utilize their histograms (c.f. Section 2): Let n=1001n=1001. Centers of the cells are yi=y~i=i501500y^{i}=\tilde{y}^{i}=\frac{i-501}{500}, i=1,..,1001i=1,..,1001; Ci,j=dist(y~i,yj)2C_{i,j}=dist(\tilde{y}^{i},y^{j})^{2}, 1i,j10011\leq i,j\leq 1001; r1,i=ϕ(yi)i=11001ϕ(yi)r_{1,i}=\frac{\phi(y^{i})}{\sum_{i=1}^{1001}\phi(y^{i})}, i=1,,1001i=1,...,1001, where ϕ(y)\phi(y) is the pdf of standard normal; r2=(1/1001,,1/1001)T1001r^{2}=(1/1001,...,1/1001)^{T}\in\mathbb{R}^{1001}. When calculating the Wasserstein distance between the second and third pairs, we apply the point cloud setting (c.f. Section 2): Let n=1000n=1000. For each pair, use i.i.d. samples {y~i}\{\tilde{y}^{i}\} and {yj}\{y^{j}\}, 1i,j10001\leq i,j\leq 1000 to approximate the two continuous probability measure respectively. Let Ci,j=dist(y~i,yj)2C_{i,j}=dist(\tilde{y}^{i},y^{j})^{2} and r1=r2=(1/1000,,1/1000)T1000r^{1}=r^{2}=(1/1000,...,1/1000)^{T}\in\mathbb{R}^{1000}. In all cases, we normalize the cost matrix CC such that its maximal element is 11. Figure 1 captures these three pairs of distributions. For all cases, we first use the linprog in Matlab to find a solution with high precision (dual-simplex, constraint tolerance 1e-9, optimality tolerance 1e-10).

Refer to caption
Refer to caption
Refer to caption
Figure 1: Three pairs of distributions

In 1-d case, we compare the histograms of two distributions; in 2-d and 3-d settings, we compare the samples/point clouds of two distributions.

Methods

We specify the settings of the four algorithms. All algorithms are started at the same feasible x0=vec(r1(r2)T)x^{0}=\operatorname{vec}(r^{1}(r^{2})^{T}) in each experiment. We solve the LP subproblems via linprog in Matlab with high precision (dual-simplex, constraint tolerance 1e-9, optimality tolerance 1e-7).
RBCD0. Algorithm 1: Vanilla random block coordinate descent. Let l=1002l=100^{2}. Stop the algorithm after 50005000 iterations.
RBCD-DB. Algorithm 2: Random block coordinate descent - diagonal band. Let p=1002/np=\lfloor 100^{2}/n\rfloor. Stop the algorithm after 50005000 iterations.
RBCD-SDB. Algorithm 3: Random block coordinate descent - submatrix and diagonal band. Let m=100m=100, p=m2/np=\lfloor m^{2}/n\rfloor and s=0.1s=0.1. Stop the algorithm after 50005000 iterations.
ARBCD. Algorithm 4: Accelerated random block coordinate descent. Let m=100m=100, p=m2/np=\lfloor m^{2}/n\rfloor, s=0.1s=0.1 and T=10T=10. Stop the algorithm after 50005000 iterations. Note that the degree of freedom of the subproblem per iteration is 1002100^{2}, about 1/1001/100 the size of the original one.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Comparison of algorithms to compute Wasserstein distance I

X-axis is the number of operated iterations. Y-axis is the optimality gap fkf=cTxkcTxf_{k}-f^{*}=c^{T}x^{k}-c^{T}x^{*} (first row) or corresponding feasibility error Xk𝟏r1+(Xk)T𝟏r2\|X^{k}\cdot{\bf 1}-r^{1}\|+\|(X^{k})^{T}\cdot{\bf 1}-r^{2}\| (second row). First row of subplots in this figure shows the trajectory/progress of Alg. 1, Alg. 2, Alg. 3 and Alg. 4 when computing the Wasserstein distance between the three pairs of prob. in 1-d, 2-d, and 3-d respectively. Each algorithm is run 5 times and the curves showcase the average behavior. Second row of subplots shows the corresponding average feasibility error against computation time for the algorithms in this experiment.

Comments on Figure 2

We can see from Figure 2 that different approaches to choosing the working set of the same size can significantly affect the performance of random BCD types of methods. The curves of RBCD-DB are below those of RBCD0 in the long run, demonstrating that RBCD-DB has a better average performance. The reason for this is that RBCD0 generates the working set with full randomness, while RBCD-DB takes the structure of the elementary matrices into account. The latter makes an educated guess at the working set that decreases the objective function by a large amount. The submatrix approach (38) works very well in practice, as illustrated by the better performances of RBCD-SDB and ARBCD compared to RBCD-DB. In the long run, ARBCD outperforms RBCD-SDB, verifying the acceleration effect. It is important to note that the algorithm settings are set by default. We expect and have observed similar behaviors of the algorithms when changing their algorithm settings. Also note that the feasibility error of algorithms are controlled at a low level according to the figure. On the other hand, the curves in these numerical experiments suggest sublinear convergence rates. This observation does not contradict the theoretical linear convergence rate as long as vv is small enough. We will verify that the numerical experiments do not violate the lower bounds we derived for the constant vv in the linear convergence rates.

1-d case 2-d case 3-d case
iter. (f¯kf)Cmax(\bar{f}_{k}-f^{*})C_{\max} v^\hat{v} iter. (f¯kf)Cmax(\bar{f}_{k}-f^{*})C_{\max} v^\hat{v} iter. (f¯kf)Cmax(\bar{f}_{k}-f^{*})C_{\max} v^\hat{v}
0 0.6237 N/A 0 9.3538 N/A 0 3.3456 N/A
1000 0.0083 4.3e-3 1000 0.6254 2.7e-3 1000 0.4746 2.0e-3
2000 0.0054 4.3e-4 2000 0.4773 2.7e-4 2000 0.3821 2.2e-4
3000 0.0045 1.8e-4 3000 0.4183 1.3e-4 3000 0.3438 1.1e-4
4000 0.0039 1.4e-4 4000 0.3846 8.3e-5 4000 0.3371 2.0e-5
5000 0.0036 8.0e-5 5000 0.3762 2.2e-5 5000 0.3350 6.2e-6
Table 1: Data of RBCD0, CmaxC_{\max} is the maximal value of elements in CC
1-d case 2-d case 3-d case
iter. (f¯kf)Cmax(\bar{f}_{k}-f^{*})C_{\max} v^\hat{v} iter. (f¯kf)Cmax(\bar{f}_{k}-f^{*})C_{\max} v^\hat{v} iter. (f¯kf)Cmax(\bar{f}_{k}-f^{*})C_{\max} v^\hat{v}
0 0.6237 N/A 0 9.3538 N/A 0 3.3456 N/A
1000 0.0130 3.9e-3 1000 0.6002 2.7e-3 1000 0.4601 2.0e-3
2000 0.0057 8.2e-4 2000 0.3920 4.3e-4 2000 0.3414 3.0e-4
3000 0.0036 4.6e-4 3000 0.3074 2.4e-4 3000 0.2878 1.7e-4
4000 0.0026 3.3e-4 4000 0.2592 1.7e-4 4000 0.2544 1.2e-4
5000 0.0020 2.6e-4 5000 0.2276 1.3e-4 5000 0.2316 9.4e-5
Table 2: Data of RBCD-DB
About Table 1 & 2

In these two tables we record the optimality gap f¯kf\bar{f}_{k}-f^{*} every 1000 iterations for both RBCD0 and RBCD-DB. f¯k\bar{f}_{k} is the average function value at iteration kk, as we run the algorithms repeatedly for 5 times. The column v^\hat{v} denotes the estimation of the constant vv in the expected Gauss-Southwell-qq rule (21). It is calculated by the formula: v^=1(f¯kf)/(f¯k1000f)1000\hat{v}=1-\sqrt[1000]{(\bar{f}_{k}-f^{*})/(\bar{f}_{k-1000}-f^{*})}. Values of v^\hat{v} in both Table 1 & 2 are far larger than the lower bounds for vv: 1(Nl)(Nl+1)\frac{1}{\binom{N}{l}(N-l+1)} and (n1)(p2)(n23)(n!)2\frac{(n-1)(p-2)}{(n^{2}-3)(n!)^{2}}, corresponding to RBCD0 and RBCD-DB respectively, where N=n2N=n^{2}. They also decrease as we run more iterations, indicating that the optimality gap shrinkage becomes less when the iterate is closer to the solution. We intend to study this phenomenon in our future work.

Refer to caption
Refer to caption
Refer to caption
Figure 3: Sparsity of solutions

Y-axis records xk0\|x_{k}\|_{0}, i.e., the number of nonzero elements in xkx_{k}. This figure shows the sparsity of xkx_{k} in RBCD-SDB and ARBCD when computing the Wasserstein distance given the three pairs of probability distributions. Each curve represents the average over 5 repetitions.

Sparse solutions

We can observe from Figure 3 that the iterates in RBCD-SDB and ARBCD become sparse quickly and remain so. The reason for this is that the solutions of OT problems are usually sparse (for the point cloud setting, at least one of the optimal solutions satisfies x=n\|x^{*}\|=n because extreme points of the LP in this setting are permutation matrices divided by nn), and these two algorithms can locate solutions with high accuracy relatively fast. As a result, the storage need for these two algorithms is considerably reduced after they have been run for a while. In the point cloud setting, storage complexity is typically expected to decrease from O(n2)O(n^{2}) to O(n)O(n) (note that the degree of freedom of the subproblem per iteration is typically chosen as O(n)O(n) because of the diagonal band approach with p3p\geq 3).

Choice of mm

We run RBCD-SDB and ARBCD with different mm for a fixed nn to test the optimal setting of subproblem size. As is shown in Figure 4, the best choice of mm happens at a smaller value (10 or 30 percent of nn) in 1-d case. In Figure 5, we observe similar phenomenon in 2-d case. The optimal setting of mm shifts to a larger value in 3-d case. We conclude that the optimal setting of mm may vary depending on many factors such as specific problem and subproblem solver efficiency. Based on the given results, we suggest choosing a smaller mm in practice, especially when there is limitation of computation resources.

Refer to caption
Refer to caption
Refer to caption
Figure 4: Choice of mm under various nn

Wall-clock time of algorithm RBCD-SDB and ARBCD to compute solutions of accuracy level f×103f^{*}\times 10^{-3}. Algorithms are stopped if the solution accuracy is within the tolerance. Repetition is 3 and the average time is reported. p=m2/np=\lfloor m^{2}/n\rfloor.

n=501n=501
mm 50 150 250 350 450
RBCD-SDB iter. 2.00e+04 3.55e+02 55.0 17.7 7.67
err. 7.28e-07 4.95e-07 4.88e-07 4.01e-07 1.71e-07
ARBCD iter. 1.95e+4 3.01e+02 58.0 22.3 8.33
err. 5.68e-07 4.90e-07 4.92e-07 9.08e-08 9.49e-08
n=1001n=1001
mm 60 100 300 500 700 900
RBCD-SDB iter. 2.00e+04 1.14e+04 2.21e+02 47.7 20 7.67
err. 3.55e-06 4.93e-07 4.93e-07 2.96e-07 3.51e-07 1.65e-07
ARBCD iter. 2.00e+04 9.67e+03 1.74e+02 47.3 22.3 8.33
err. 2.06e-06 4.94e-07 4.60e-07 4.09e-07 3.39e-07 1.17e-08
n=1501n=1501
mm 75 150 450 750 1050 1350
RBCD-SDB iter. 2.00e+04 6.80e+03 1.69e+02 50.3 19.3 7.67
err. 3.90e-06 4.93e-07 4.89e-07 4.18e-07 3.63e-07 1.70e-07
ARBCD iter. 2.00e+04 3.11e+03 1.67e+02 46.3 20 8
err. 1.85e-06 4.93e-07 4.86e-07 4.15e-07 3.08e-07 2.35e-07
Table 3: Data of Figure 4

The average iterations and solution errors are recorded for reference. Feasibility errors are kept at a low level (below 1e-15) and omitted.

Refer to caption
Refer to caption
Refer to caption
Figure 5: Choice of mm in various dimensions

Numerical experiments from 1-d to 3-d cases. Solutions of accuracy level f×103f^{*}\times 10^{-3}. Repetition is 5 and the box plot of time is shown. p=m2/np=\lfloor m^{2}/n\rfloor.

5.2 Comparison between ARBCD and algorithms in the literature

Experiment settings

We generated 8 pairs of distributions/patterns based on synthetic and real datasets. Descriptions are as follows. Note that we use histogram settings (c.f. Section 2) for datasets 1 and 2, and point cloud settings (c.f. Section 2) for other datasets. We use cost function C(x,y)=xy2C(x,y)=\|x-y\|^{2}.
Dataset 1: Uniform distribution to standard normal distribution over [1,1][-1,1]. Similar to the 1-d case in Section 5.1. n=200.n=200.
Dataset 2: Uniform distribution to a randomly shuffled standard normal distribution over [1,1][-1,1].222Similar to the 1-d case in Section 5.1. We randomly shuffled the weights of the normal distribution histogram. n=1000.n=1000.
Dataset 3: Uniform distribution over [π,π]2[-\pi,\pi]^{2} to an empirical invariant measure. Similar to the 2-d case in Section 5.1. n=1000.n=1000.
Dataset 4: Distribution of Σu\sqrt{\Sigma}u to distribution of 2Σv(1;1;1)2\sqrt{\Sigma}v-(1;1;1), where Σ=(10.50.250.510.50.250.51)\Sigma=\begin{pmatrix}1&0.5&0.25\\ 0.5&1&0.5\\ 0.25&0.5&1\end{pmatrix}, uu and vv conform uniform distributions on [0,1]3[0,1]^{3} and are independent. n=1000.n=1000.
Dataset 5: Similar to Dataset 4, with Σ=(10.80.640.810.80.640.81)\Sigma=\begin{pmatrix}1&0.8&0.64\\ 0.8&1&0.8\\ 0.64&0.8&1\end{pmatrix}. n=1000.n=1000.
Dataset 6: Distribution of Σu\Sigma u to distribution of Σv\Sigma v, where Σ=(10110111)T\Sigma=\begin{pmatrix}1&0&1&1\\ 0&1&1&-1\end{pmatrix}^{T}, uu conforms a uniform distribution on [0,2π]2[0,2\pi]^{2} and vv conforms a uniform distribution on [1,1]2[-1,1]^{2}. n=1000.n=1000.
Dataset 7: Distribution of (1;1;;1)T10u\underbrace{(1;1;\ldots;1)^{T}}_{10}u to distribution of (1;2;3;;10)Tv+(1;1;;1)T(1;2;3;\ldots;10)^{T}v+(1;1;\ldots;1)^{T}, where uu conforms uniform distribution over [0,2π][0,2\pi] and vv conforms uniform distribution over [1,1][-1,1]. n=1000.n=1000.
Dataset 8: Distibution of a “cylinder” to a ”spiral”, see Figure 6. n=1000.n=1000.
In all cases, we normalize the cost matrix CC such that its maximal element is 11. For all cases, we use the linprog in Matlab to find a solution with high precision (dual-simplex, constraint tolerance 1e-9, optimality tolerance 1e-10).

Refer to caption
Figure 6: Visualization of dataset 8

5.2.1 Comparison with Sinkhorn

Methods

Implementation of Sinkhorn and ARBCD are specified as follows.
Sinkhorn. The algorithm proposed in cuturi2013sinkhorn to compute Wasserstein distance. Let γ\gamma be the coefficient of the entropy term. We let γ=ϵ/(4logn)\gamma=\epsilon/(4\log n) as suggested in dvurechensky2018computational . We consider the settings ϵ=104,103,0.01,0.1\epsilon=10^{-4},10^{-3},0.01,0.1. Iterations of Sinkhorn are projected onto the feasible region using a rounding procedure: Algorithm 2 in altschuler2017near . Note that this projection step is added only for evaluation purposes because Sinkhorn does not provide feasible solutions if early stopped. It does not affect Sinkhorn’s main steps or Sinkhorn’s convergence at all. A similar approach is used for evaluation in jambulapati2019direct . In addition, we take all the updates to log space and use the LogSumExp function to avoid numerical instability issues. We stop Sinkhorn after 300000 iterations when n=200n=200 and 100000 iterations if n=1000n=1000.
ARBCD. Algorithm 4: Accelerated random block coordinate descent. Let m=40m=40 when n=200n=200 and m=100m=100 when n=1000n=1000. Let p=m2/np=\lfloor m^{2}/n\rfloor, s=0.1s=0.1 and T=10T=10. Stop the algorithm after 1000010000 iterations. To be fair, we also project the solution in each iteration onto the feasible region via the rounding procedure. LP subproblems are solved via linprog in Matlab with high precision (dual-simplex, constraint tolerance 1e-9).

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: Comparison of algorithms to compute Wasserstein distance II

X-axis is the wall-clock time in seconds. Y-axis is the optimality gap fkf=cTxkcTxf_{k}-f^{*}=c^{T}x^{k}-c^{T}x^{*}. This figure shows the trajectory/progress of Algorithm 4: ARBCD and Sinkhorn with different settings when computing the Wasserstein distance between eight pairs of probability. ARBCD is run 5 times in each experiment and the curves showcase the average behavior.

Refer to caption
Figure 8: Transport plan given by different algorithms (dataset 1)

This figure shows the transport plan for dataset 1. In each plot, the bottom distribution is uniform and the top is standard normal. Each line segment in between represents mass transported between a pair of points. The darker the line is, the more mass is transported. To plot the plans more clearly, we select every other mass point from 1-1 to 0 (so only include 5050 mass points). The overall transport plans from 1-1 to 11 are symmetric plots so we show them in this way due to presentation clarity.

Dataset #\# method time(s) iter. gap feas.err. subproblem size
ARBCD 82.63 2326.7 4.9386e-07 4.1753e-17 22500
IPM1 7.539 198 3.36713-08 1.8924e-17 20890
1 IPM2 342.6 2000 4.3408-04 2.0889e-17 96550
IPM3 5.044 19 1.2319-07 5.7525e-17 485194
IPM4 11.01 21 2.8982e-07 8.3340e-17 1000000
ARBCD 240.3 10000 8.6011e-09 8.4518e-17 22500
IPM1 413.9 2000 1.0562-04 2.5704e-17 20890
2 IPM2 0.8815 16 1.1066-10 2.6584e-17 96550
IPM3 5.923 20 1.7153-09 5.5638e-17 485194
IPM4 11.02 20 3.0871e-09 7.6026e-17 1000000
ARBCD 30.80 375 1.2787e-04 2.3256e-17 22500
IPM1 N/A N/A N/A N/A 21055
3 IPM2 67.43 2000 0.0146 2.1107e-17 92666
IPM3 313.7 2000 0.0110 1.3510e-17 465075
IPM4 6.696 13 1.2655e-04 6.3473e-17 1000000
ARBCD 145.2 4404.3 2.0519e-05 1.8927e-17 22500
IPM1 N/A N/A N/A N/A 21312
4 IPM2 N/A N/A N/A N/A 95639
IPM3 345.7 2000 0.0083 1.5114e-17 474798
IPM4 7.716 15 2.0397e-05 7.2654e-17 1000000
ARBCD 56.24 569.3 2.0412e-05 2.1927e-17 22500
IPM1 N/A N/A N/A N/A 19988
5 IPM2 N/A N/A N/A N/A 94468
IPM3 320.6 2000 0.0050 1.4802e-17 492856
IPM4 7.616 15 8.6357e-06 6.6523e-17 1000000
ARBCD 22.20 265 2.3475e-04 2.3622e-17 22500
IPM1 N/A N/A N/A N/A 20143
6 IPM2 417.6 2000 0.0032 1.4814e-17 94072
IPM3 337.9 2000 0.0065 1.3566e-17 482269
IPM4 5.438 11 1.4044e-04 6.2801e-17 1000000
ARBCD 63.88 342.3 7.4215e-05 2.0341e-17 22500
IPM1 N/A N/A N/A N/A 20139
7 IPM2 N/A N/A N/A N/A 99792
IPM3 278.4 2000 0.0221 1.9256e-17 486651
IPM4 5.136 11 5.7172e-05 5.5542e-17 1000000
ARBCD 71.44 439.3 1.9715e-05 2.4507e-17 22500
IPM1 N/A N/A N/A N/A 10702
8 IPM2 N/A N/A N/A N/A 97305
IPM3 219.1 2000 9.3439e-04 2.3205e-17 499786
IPM4 5.303 10 1.7165e-05 7.5243e-17 1000000
Table 4: Comparison between ARBCD and IPM

For ARBCD, the average iterations/time/gap/feasibility error are recorded for reference. For IPM, results of four different settings are recorded. The initial reduced problem size is reflected by the subproblem size column.

Comments on Figure 7

We can observe the following from Figure 7: although Sinkhorn with larger ϵ\epsilon may converge fast, the solution accuracy is also lower. In fact, this is true for all Sinkhorn-based algorithms because the optimization problem is not exact - it has an extra entropy term. Therefore, the larger γ\gamma or ϵ\epsilon is chosen, the less accurate the solution becomes. The discrepancy in accuracy does matter, as can be seen by inspecting the solution quality in Figure 8. On the other hand, when ϵ\epsilon is set smaller, the convergence of Sinkhorn becomes slower. As can be seen from the plots, when ϵ=0.1\epsilon=0.1 or 0.010.01, Sinkhorn converges faster than ARBCD; when ϵ=103\epsilon=10^{-3}, Sinkhorn is comparable to ARBCD; when ϵ=104\epsilon=10^{-4}, Sinkhorn is slower than ARBCD. In conclusion, if relatively higher precision is desired, ARBCD is comparable with Sinkhorn. Moreover, note that here we solve the subproblems in ARBCD using Matlab built-in solver linprog. ARBCD can be faster if more efficient subproblem solvers are applied.

5.2.2 Comparison with an interior point-inspired algorithm

In this experiment we continue using the 8 datasets, except that Dataset 1 has n=1000n=1000.

Methods

Implementation of ARBCD and IPM are specified as follows.
ARBCD. Algorithm 4. Let m=150m=150, p=m2/np=\lfloor m^{2}/n\rfloor, s=0.1s=0.1 and T=10T=10. We stop the algorithm when (fkf)/f103(f_{k}-f^{*})/f^{*}\leq 10^{-3} or when iteration reaches the maximum 10000. Other settings are similar to the experiment in Section 5.2.1. The algorithm outputs the last iterate.
IPM. A recent interior point-inspired algorithm proposed in zanetti2023interior . The Matlab code is directly from the github page in this paper (https://github.com/INFORMSJoC/2022.0184). We have made minimal modification for comparison. We stop the algorithm when (fkf)/f103(f_{k}-f^{*})/f^{*}\leq 10^{-3} or when the iteration hits 2000. Run four different settings separately, where the initial reduced problem’s sizes are approximately 11(2n1)11(2n-1), n2/10n^{2}/10, n2/2n^{2}/2 and n2n^{2} (n=1000n=1000) correspondingly. For fairness, we also project the solution in each iteration onto the feasible region via the rounding procedure. The algorithm outputs the solution with the best function value gap.

Comments on Table 4

We could see that the performance of ARBCD is better than IPM1, comparable to IPM2 and IPM3. IPM4 has the best performance but its subproblem size is also large. In fact, our observation of the memory storage is IPM4 \geq IPM3 \geq IPM2 \geq IPM1 \approx ARBCD. This is consistent with the subproblem size column. In practice, memory consumption of IPM4 is large, contradicting the goal of the authors in zanetti2023interior to save memory. In zanetti2023interior , the authors suggest the setting of IPM1. We also find that when implementing IPM1,IPM2,IPM3, the algorithms develop numerical instabilities, which is reported as N/A in the table. This is consistent with the result in zanetti2023interior (Figure 4) where ratio of the IPM algorithm finding an ideal solution is below 11.

Refer to caption
Figure 9: Solving a large-scale problem via ARBCD

We apply ARBCD to solve the large-scale 1-d problem (n=12800n=12800). y-axis of the left plot shows the optimality gap fkff_{k}-f^{*}, y-axis of the right plot shows the feasibility error Xk𝟏r1+(Xk)T𝟏r2\|X^{k}\cdot{\bf 1}-r^{1}\|+\|(X^{k})^{T}\cdot{\bf 1}-r^{2}\|, and x-axis records the iteration number. ARBCD is repeated for 3 times and average results are reported.

5.3 Test on a large-scale OT problem

In this subsection, we generate a pair of 1-dim probability distributions with large discrete support sets (n=12800n=12800). For the first distribution, locations of the discrete support (xi,i=1,,nx^{i},i=1,\ldots,n) are evenly aligned between [1,1][-1,1], and their weights/probability are uniformly distributed (i.e., 1/n1/n). For the other distribution, locations of the discrete support are determined as x~i=xσ(i)+ui\tilde{x}^{i}=x^{\sigma(i)}+u^{i}, where σ(i)\sigma(i) is a random permutation of i=1,,ni=1,\ldots,n, and uiu^{i} is a random variable that conforms to a uniform distribution over [0.5,0.5][-0.5,0.5]. Weights/probability are determined as wi=ϕ(x~i)i=1nϕ(x~i)w_{i}=\frac{\phi(\tilde{x}^{i})}{\sum_{i=1}^{n}\phi(\tilde{x}^{i})}, where ϕ(x)\phi(x) is the pdf of the standard normal. The benchmark optimal solution is quickly computed via a closed-form formula for 1-d OT problem (f=6.236×103f^{*}=6.236\times 10^{-3}). For ARBCD, we use the setting m=10nm=\lceil\sqrt{10n}\rceil, p=m2/np=\lfloor m^{2}/n\rfloor, s=0.1s=0.1 and T=10T=10.

Comments on Figure 9

The figure showcases the average behavior of ARBCD within 1000010000 iterations. It is able to locate a solution such that (fkf)/f0.1(f_{k}-f^{*})/f^{*}\leq 0.1. The convergence is linear by observing the trajectory. We also want to point out that Gurobi 10.01 (academic license) runs out of memory on the desktop we use for numerical experiments. Indeed, saving memory is one of the merits that motivate us to consider RBCD methods.

6 Conclusion

In this paper, we investigate the RBCD method to solve LP problems, including OT problems. In particular, an expected Gauss-Southwell-qq rule is proposed to select the working set k\mathcal{I}_{k} at iteration kk. It guarantees almost sure convergence and linear convergence rate and is satisfied by all algorithms proposed in this work. We first develop a vanilla RBCD, called RBCD0, to solve general LP problems. Its inexact version is also investigated. Then, by examining the structure of the matrix AA in the linear system of OT, characterizing elementary matrices of null(A)\operatorname{null}(A) and identifying conformal realization of any matrix Dnull(A)D\in\operatorname{null}(A), we refine the working set selection. We use two approaches - diagonal band and submatrix - for constructing k\mathcal{I}_{k} and employ an acceleration technique inspired by the momentum concept to improve the performance of RBCD0. In our numerical experiments, we compare all proposed RBCD methods and verify the acceleration effects as well as the sparsity of solutions. We also demonstrate the gap between the theoretical convergence rate and the practical one. We discuss the choice of the subproblem size. Furthermore, we run ARBCD, the best among all other methods, against others and also on large-scale OT problems. The results show the advantages of our method in finding relatively accurate solutions to OT problems and saving memory. For future work, we plan to extend our method to handle continuous measures and further improve it through parallelization and multiscale strategies, among other approaches.

Acknowledgements

The research of YX is supported by Guangdong Province Fundamental and Applied Fundamental Research Regional Joint Fund, (Project 2022B1515130009), and the start-up fund of HKU-IDS. The research of ZZ is supported by Hong Kong RGC grants (Projects 17300318 and 17307921), the National Natural Science Foundation of China (Project 12171406), Seed Fund for Basic Research (HKU), an R&D Funding Scheme from the HKU-SCF FinTech Academy, and Seed Funding for Strategic Interdisciplinary Research Scheme 2021/22 (HKU).

Declarations

The authors have no competing interests to declare that are relevant to the content of this article.

Data availability

The datasets generated during the current study are available in the GitHub repository,
https://github.com/yue-xie/RBCDforOT.

Appendix A Proof of Lemma 3

Proof

Suppose that

log((n2)!/(n2np)!)2log(n!)+log(np)!\displaystyle\log((n^{2})!/(n^{2}-np)!)\geq 2\log(n!)+\log(np)! (53)

Then

(n2)!/(n2np)!)/(np)!(n!)2(n2np)/(n!)21(n2np)(n!)2n(p2)(n2np+1)n231n(p2)(n23)(n!)21(n2np)(n2np+1),\displaystyle\begin{array}[]{rrl}&(n^{2})!/(n^{2}-np)!)/(np)!&\geq(n!)^{2}\\ \implies&\binom{n^{2}}{np}/(n!)^{2}&\geq 1\\ \implies&\frac{\binom{n^{2}}{np}}{(n!)^{2}}\cdot\frac{n(p-2)(n^{2}-np+1)}{n^{2}-3}&\geq 1\\ \implies&\frac{n(p-2)}{(n^{2}-3)(n!)^{2}}&\geq\frac{1}{\binom{n^{2}}{np}(n^{2}-np+1)},\end{array}

where the third inequality holds because pn/2p\leq n/2 and npK¯6n\geq p\bar{K}\geq 6. So we only need to prove (53). Note that

log(n2)!(n2np)!\displaystyle\log\frac{(n^{2})!}{(n^{2}-np)!} =x=n2np+1n2log(x)n2npn2(logx)𝑑x\displaystyle=\sum_{x=n^{2}-np+1}^{n^{2}}\log(x)\geq\int_{n^{2}-np}^{n^{2}}(\log x)dx
=n2log(n2)n2((n2np)log(n2np)n2+np)\displaystyle=n^{2}\log(n^{2})-n^{2}-\left((n^{2}-np)\log(n^{2}-np)-n^{2}+np\right)
=n2log(n2)(n2np)log(n2np)np\displaystyle=n^{2}\log(n^{2})-(n^{2}-np)\log(n^{2}-np)-np
=(p=n/K)2nplogn+K1Kn2logKK1np\displaystyle\overset{(p=n/K)}{=}2np\log n+\frac{K-1}{K}\cdot n^{2}\cdot\log\frac{K}{K-1}-np
2nplogn+2K32K2npnp.\displaystyle\geq 2np\log n+\frac{2K-3}{2K-2}\cdot np-np. (54)

The last inequality holds because log(1+x)xx2/2\log(1+x)\geq x-x^{2}/2 for x(0,1)x\in(0,1) and p=n/Kp=n/K. Meanwhile, right hand side of (53) satisfies the following:

2log(n!)+log(np)!\displaystyle\quad 2\log(n!)+\log(np)!
2(n+1)log(n+1)2n+(np+1)log(np+1)np\displaystyle\leq 2(n+1)\log(n+1)-2n+(np+1)\log(np+1)-np
2(n+1)(logn+log2)2n+(np+1)(log(np)+log2)np\displaystyle\leq 2(n+1)(\log n+\log 2)-2n+(np+1)(\log(np)+\log 2)-np
=(np+2n+3)logn+(np+1)logp+2(n+1)log22nnp+(log2)(np+1)\displaystyle=(np+2n+3)\log n+(np+1)\log p+2(n+1)\log 2-2n-np+(\log 2)(np+1)
=(p=nK)2nplogn+(2n+4)logn+(log4)n+(log2)np+log82n(1+logK)nplogK\displaystyle\overset{\left(p=\frac{n}{K}\right)}{=}2np\log n+(2n+4)\log n+(\log 4)n+(\log 2)np+\log 8-2n-(1+\log K)np-\log K
(KK¯2,n6)2nplogn+(2n+4)logn+(log2)np(1+logK)np\displaystyle\overset{(K\geq\bar{K}\geq 2,n\geq 6)}{\leq}2np\log n+(2n+4)\log n+(\log 2)np-(1+\log K)np (55)

In order to show (53), we only need to confirm (55)(54)\eqref{ineq3: pr}\leq\eqref{ineq2: pr}. By observation, this is equivalent to

(2K32K2+log(K2))np(2n+4)logn(pηlogn,K¯K)(2K¯32K¯2+log(K¯2))ηn2n+44(2K¯32K¯2+log(K¯2))η2n.\displaystyle\begin{array}[]{crl}&\left(\frac{2K-3}{2K-2}+\log\left(\frac{K}{2}\right)\right)np&\geq(2n+4)\log n\\ \overset{(p\geq\eta\log n,\bar{K}\leq K)}{\Longleftarrow}&\left(\frac{2\bar{K}-3}{2\bar{K}-2}+\log\left(\frac{\bar{K}}{2}\right)\right)\eta n&\geq 2n+4\\ \Longleftrightarrow&\frac{4}{\left(\frac{2\bar{K}-3}{2\bar{K}-2}+\log\left(\frac{\bar{K}}{2}\right)\right)\eta-2}&\leq n.\end{array}

The last inequality is assumed.  

Appendix B A counterexample of interest

Example 1

Consider LP problem (9) with n=3n=3, r1=r2=(1/3,1/3,1/3)Tr^{1}=r^{2}=(1/3,1/3,1/3)^{T}. Let

C=((1+ϵ1)2(2ϵ3)2(1ϵ2)2(1ϵ2)2(1+ϵ1)2(2ϵ3)2(2ϵ3)2(1ϵ2)2(1+ϵ1)2),C=\begin{pmatrix}(1+\epsilon_{1})^{2}&(2-\epsilon_{3})^{2}&(1-\epsilon_{2})^{2}\\ (1-\epsilon_{2})^{2}&(1+\epsilon_{1})^{2}&(2-\epsilon_{3})^{2}\\ (2-\epsilon_{3})^{2}&(1-\epsilon_{2})^{2}&(1+\epsilon_{1})^{2}\end{pmatrix},

where 0<ϵi10<\epsilon_{i}\ll 1, i=1,2,3i=1,2,3 such that 2(1+ϵ1)2<(1ϵ2)2+(2ϵ3)22(1+\epsilon_{1})^{2}<(1-\epsilon_{2})^{2}+(2-\epsilon_{3})^{2}. It can be easily seen that the optimal solution is γ=13(001100010)\gamma^{*}=\frac{1}{3}\begin{pmatrix}0&0&1\\ 1&0&0\\ 0&1&0\end{pmatrix}. Suppose that γ0=13(100010001)\gamma^{0}=\frac{1}{3}\begin{pmatrix}1&0&0\\ 0&1&0\\ 0&0&1\end{pmatrix}. If we use the submatrix approach (38) with m=2m=2 (largest number less than nn) to select a working set k\mathcal{I}_{k}, then the algorithm will be stuck at γ0\gamma^{0}. It is not globally convergent. This cost matrix corresponds to the following case of transporting a three-point distribution to another one:

y~1\tilde{y}_{1}y1y_{1}y~2\tilde{y}_{2}y2y_{2}y~3\tilde{y}_{3}y3y_{3}
Figure 10: A counterexample of interest

The dashed hexagon has an edge length of 11. Transport point distribution y~1\tilde{y}_{1}, y~2\tilde{y}_{2} and y~3\tilde{y}_{3} to that of y1y_{1}, y2y_{2} and y3y_{3}.

References

  • (1) Altschuler, J., Niles-Weed, J., Rigollet, P.: Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration. Advances in Neural Information Processing Systems 30 (2017)
  • (2) Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: International Conference on Machine Learning, pp. 214–223. PMLR (2017)
  • (3) Beck, A.: The 2-coordinate descent method for solving double-sided simplex constrained minimization problems. Journal of Optimization Theory and Applications 162(3), 892–919 (2014)
  • (4) Beck, A., Tetruashvili, L.: On the convergence of block coordinate descent type methods. SIAM Journal on Optimization 23(4), 2037–2060 (2013)
  • (5) Benamou, J., Brenier, Y.: A computational fluid mechanics solution to the Monge-Kantorovich mass transfer problem. Numerische Mathematik 84(3), 375–393 (2000)
  • (6) Benamou, J.D., Carlier, G., Cuturi, M., Nenna, L., Peyré, G.: Iterative bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing 37(2), A1111–A1138 (2015)
  • (7) Benamou, J.D., Collino, F., Mirebeau, J.M.: Monotone and consistent discretization of the Monge–Ampère operator. Mathematics of computation 85(302), 2743–2775 (2016)
  • (8) Benamou, J.D., Froese, B.D., Oberman, A.M.: Numerical solution of the optimal transportation problem using the Monge–Ampère equation. Journal of Computational Physics 260, 107–126 (2014)
  • (9) Berahas, A.S., Bollapragada, R., Nocedal, J.: An investigation of Newton-sketch and subsampled Newton methods. Optimization Methods and Software 35(4), 661–680 (2020)
  • (10) Blondel, M., Seguy, V., Rolet, A.: Smooth and sparse optimal transport. In: International Conference on Artificial Intelligence and Statistics, pp. 880–889. PMLR (2018)
  • (11) Bonafini, M., Schmitzer, B.: Domain decomposition for entropy regularized optimal transport. Numerische Mathematik 149(4), 819–870 (2021)
  • (12) Brenier, Y.: Polar factorization and monotone rearrangement of vector-valued functions. Communications on Pure and Applied Mathematics 44(4), 375–417 (1991)
  • (13) Chen, C., He, B., Ye, Y., Yuan, X.: The direct extension of ADMM for multi-block convex minimization problems is not necessarily convergent. Mathematical Programming 155(1), 57–79 (2016)
  • (14) Chizat, L., Peyré, G., Schmitzer, B., Vialard, F.X.: Scaling algorithms for unbalanced optimal transport problems. Mathematics of Computation 87(314), 2563–2609 (2018)
  • (15) Cuturi, M.: Sinkhorn distances: Lightspeed computation of optimal transport. Advances in Neural Information Processing Systems 26 (2013)
  • (16) Dvurechensky, P., Gasnikov, A., Kroshnin, A.: Computational optimal transport: Complexity by accelerated gradient descent is better than by Sinkhorn’s algorithm. In: International Conference on Machine Learning, pp. 1367–1376. PMLR (2018)
  • (17) Facca, E., Benzi, M.: Fast iterative solution of the optimal transport problem on graphs. SIAM Journal on Scientific Computing 43(3), A2295–A2319 (2021)
  • (18) Gasnikov, A.V., Gasnikova, E., Nesterov, Y.E., Chernov, A.: Efficient numerical methods for entropy-linear programming problems. Computational Mathematics and Mathematical Physics 56(4), 514–524 (2016)
  • (19) Genevay, A., Cuturi, M., Peyré, G., Bach, F.: Stochastic optimization for large-scale optimal transport. Advances in neural information processing systems 29 (2016)
  • (20) Genevay, A., Peyre, G., Cuturi, M.: Learning generative models with sinkhorn divergences. In: A. Storkey, F. Perez-Cruz (eds.) Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, vol. 84, pp. 1608–1617. PMLR (2018)
  • (21) Gerber, S., Maggioni, M.: Multiscale strategies for computing optimal transport. Journal of Machine Learning Research 18 (2017)
  • (22) Gottschlich, C., Schuhmacher, D.: The shortlist method for fast computation of the earth mover’s distance and finding optimal solutions to transportation problems. PloS one 9(10), e110214 (2014)
  • (23) Guminov, S., Dvurechensky, P., Tupitsa, N., Gasnikov, A.: On a combination of alternating minimization and Nesterov’s momentum. In: International Conference on Machine Learning, pp. 3886–3898. PMLR (2021)
  • (24) Gurbuzbalaban, M., Ozdaglar, A., Parrilo, P.A., Vanli, N.: When cyclic coordinate descent outperforms randomized coordinate descent. Advances in Neural Information Processing Systems 30 (2017)
  • (25) Haker, S., Zhu, L., Tannenbaum, A., Angenent, S.: Optimal mass transport for registration and warping. International Journal of Computer Vision 60(3), 225–240 (2004)
  • (26) He, B., Yuan, X.: On the 𝒪(1/n)\mathcal{O}(1/n) convergence rate of the Douglas–Rachford alternating direction method. SIAM Journal on Numerical Analysis 50(2), 700–709 (2012)
  • (27) Huang, M., Ma, S., Lai, L.: A Riemannian block coordinate descent method for computing the projection robust Wasserstein distance. In: International Conference on Machine Learning, pp. 4446–4455. PMLR (2021)
  • (28) Jambulapati, A., Sidford, A., Tian, K.: A direct O~(1/ϵ)\tilde{O}(1/\epsilon) iteration parallel algorithm for optimal transport. Advances in Neural Information Processing Systems 32 (2019)
  • (29) Jordan, R., Kinderlehrer, D., Otto, F.: The variational formulation of the Fokker–Planck equation. SIAM Journal on Mathematical Analysis 29(1), 1–17 (1998)
  • (30) Lei, N., Su, K., Cui, L., Yau, S.T., Gu, X.D.: A geometric view of optimal transportation and generative model. Computer Aided Geometric Design 68, 1–21 (2019)
  • (31) Li, W., Yin, P., Osher, S.: Computations of optimal transport distance with Fisher information regularization. Journal of Scientific Computing 75(3), 1581–1595 (2018)
  • (32) Lin, T., Ho, N., Jordan, M.I.: On the efficiency of entropic regularized algorithms for optimal transport. Journal of Machine Learning Research 23(137), 1–42 (2022)
  • (33) Ling, H., Okada, K.: An efficient earth mover’s distance algorithm for robust histogram comparison. IEEE transactions on pattern analysis and machine intelligence 29(5), 840–853 (2007)
  • (34) Liu, Y., Wen, Z., Yin, W.: A multiscale semi-smooth Newton method for optimal transport. Journal of Scientific Computing 91(2), 39 (2022)
  • (35) Lu, Z., Xiao, L.: On the complexity analysis of randomized block-coordinate descent methods. Mathematical Programming 152(1), 615–642 (2015)
  • (36) Mandad, M., Cohen-Steiner, D., Kobbelt, L., Alliez, P., Desbrun, M.: Variance-minimizing transport plans for inter-surface mapping. ACM Transactions on Graphics (ToG) 36(4), 1–14 (2017)
  • (37) Natale, A., Todeschi, G.: Computation of optimal transport with finite volumes. ESAIM: Mathematical Modelling and Numerical Analysis 55(5), 1847–1871 (2021)
  • (38) Necoara, I., Clipici, D.: Parallel random coordinate descent method for composite minimization: Convergence analysis and error bounds. SIAM Journal on Optimization 26(1), 197–226 (2016)
  • (39) Necoara, I., Nesterov, Y., Glineur, F.: Random block coordinate descent methods for linearly constrained optimization over networks. Journal of Optimization Theory and Applications 173(1), 227–254 (2017)
  • (40) Necoara, I., Takáč, M.: Randomized sketch descent methods for non-separable linearly constrained optimization. IMA Journal of Numerical Analysis 41(2), 1056–1092 (2021)
  • (41) Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization 22(2), 341–362 (2012)
  • (42) Otto, F.: The geometry of dissipative evolution equations: the porous medium equation. Taylor & Francis (2001)
  • (43) Peleg, S., Werman, M., Rom, H.: A unified approach to the change of resolution: Space and gray-level. IEEE Transactions on Pattern Analysis and Machine Intelligence 11(7), 739–742 (1989)
  • (44) Perrot, M., Courty, N., Flamary, R., Habrard, A.: Mapping estimation for discrete optimal transport. Advances in Neural Information Processing Systems 29 (2016)
  • (45) Peyré, G., Cuturi, M.: Computational optimal transport. Foundations and Trends in Machine Learning 11(5-6), 355–607 (2019)
  • (46) Polyak, B.T.: Introduction to optimization. Optimization Software, Inc., Publications Division, New York (1987)
  • (47) Qu, Z., Richtárik, P., Takác, M., Fercoq, O.: SDNA: stochastic dual Newton ascent for empirical risk minimization. In: International Conference on Machine Learning, pp. 1823–1832. PMLR (2016)
  • (48) Richtárik, P., Takáč, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming 144(1), 1–38 (2014)
  • (49) Richtárik, P., Takáč, M.: Parallel coordinate descent methods for big data optimization. Mathematical Programming 156(1), 433–484 (2016)
  • (50) Rockafellar, R.T.: Network flows and monotropic optimization, vol. 9. Athena scientific (1999)
  • (51) Schmitzer, B.: A sparse multiscale algorithm for dense optimal transport. Journal of Mathematical Imaging and Vision 56, 238–259 (2016)
  • (52) Schmitzer, B.: Stabilized sparse scaling algorithms for entropy regularized transport problems. SIAM Journal on Scientific Computing 41(3), A1443–A1481 (2019)
  • (53) Sinkhorn, R.: A relationship between arbitrary positive matrices and doubly stochastic matrices. The annals of mathematical statistics 35(2), 876–879 (1964)
  • (54) Solomon, J., De Goes, F., Peyré, G., Cuturi, M., Butscher, A., Nguyen, A., Du, T., Guibas, L.: Convolutional Wasserstein distances: Efficient optimal transportation on geometric domains. ACM Transactions on Graphics (ToG) 34(4), 1–11 (2015)
  • (55) Sun, R., Ye, Y.: Worst-case complexity of cyclic coordinate descent: 𝒪(n2)\mathcal{O}(n^{2}) gap with randomized version. Mathematical Programming 185(1), 487–520 (2021)
  • (56) Toselli, A., Widlund, O.: Domain decomposition methods-algorithms and theory, vol. 34. Springer Science & Business Media (2004)
  • (57) Tseng, P., Yun, S.: Block-coordinate gradient descent method for linearly constrained nonsmooth separable optimization. Journal of Optimization Theory and Applications 140(3), 513–535 (2009)
  • (58) Tseng, P., Yun, S.: A coordinate gradient descent method for nonsmooth separable minimization. Mathematical Programming 117(1), 387–423 (2009)
  • (59) Tseng, P., Yun, S.: A coordinate gradient descent method for linearly constrained smooth optimization and support vector machines training. Computational Optimization and Applications 47(2), 179–206 (2010)
  • (60) Villani, C.: Topics in optimal transportation, vol. 58. American Math. Soc. (2021)
  • (61) Wang, Z., Xin, J., Zhang, Z.: DeepParticle: learning invariant measure by a deep neural network minimizing Wasserstein distance on data generated from an interacting particle method. Journal of Computational Physics p. 111309 (2022)
  • (62) Wijesinghe, J., Chen, P.: Matrix balancing based interior point methods for point set matching problems. SIAM Journal on Imaging Sciences 16(3), 1068–1105 (2023)
  • (63) Wright, S.: Primal-dual interior-point methods. SIAM (1997)
  • (64) Xie, Y., Shanbhag, U.V.: SI-ADMM: A stochastic inexact ADMM framework for stochastic convex programs. IEEE Transactions on Automatic Control 65(6), 2355–2370 (2019)
  • (65) Xie, Y., Shanbhag, U.V.: Tractable ADMM schemes for computing KKT points and local minimizers for 0\ell_{0}-minimization problems. Computational Optimization and Applications 78(1), 43–85 (2021)
  • (66) Xie, Y., Wang, X., Wang, R., Zha, H.: A fast proximal point method for computing exact Wasserstein distance. In: Uncertainty in artificial intelligence, pp. 433–453. PMLR (2020)
  • (67) Yang, L., Li, J., Sun, D., Toh, K.C.: A fast globally linearly convergent algorithm for the computation of wasserstein barycenters. The Journal of Machine Learning Research 22(1), 984–1020 (2021)
  • (68) Zanetti, F., Gondzio, J.: An interior point–inspired algorithm for linear programs arising in discrete optimal transport. INFORMS Journal on Computing (2023)