∎

¹¹institutetext: Yue Xie²²institutetext: Department of Mathematics, The University of Hong Kong, Pokfulam Road, Hong Kong SAR, China
²²email: [email protected] ³³institutetext: Zhongjian Wang ⁴⁴institutetext: School of Physical and Mathematical Sciences, Nanyang Technological University, 21 Nanyang Link, Singapore 637371
⁴⁴email: [email protected] ⁵⁵institutetext: Zhiwen Zhang ⁶⁶institutetext: Department of Mathematics, The University of Hong Kong, Pokfulam Road, Hong Kong SAR, China
⁶⁶email: [email protected]

Randomized methods for computing optimal transport without regularization and their convergence analysis

Yue Xie Zhongjian Wang Zhiwen Zhang

(Received: date / Accepted: date)

Abstract

The optimal transport (OT) problem can be reduced to a linear programming (LP) problem through discretization. In this paper, we introduced the random block coordinate descent (RBCD) methods to directly solve this LP problem. Our approach involves restricting the potentially large-scale optimization problem to small LP subproblems constructed via randomly chosen working sets. By using a random Gauss-Southwell- $q$ rule to select these working sets, we equip the vanilla version of (RBCD₀) with almost sure convergence and a linear convergence rate to solve general standard LP problems. To further improve the efficiency of the (RBCD₀) method, we explore the special structure of constraints in the OT problems and leverage the theory of linear systems to propose several approaches for refining the random working set selection and accelerating the vanilla method. Inexact versions of the RBCD methods are also discussed. Our preliminary numerical experiments demonstrate that the accelerated random block coordinate descent (ARBCD) method compares well with other solvers including Sinkhorn’s algorithm when seeking solutions with relatively high accuracy, and offers the advantage of saving memory.

Keywords:

Optimal transportdeep particle methodconvex optimizationrandom block coordinate descentconvergence analysis.

MSC:

65C3568W2090C0890C25

1 Introduction

Background and motivation

The optimal transport problem was first introduced by Monge in 1781, which aims to find the most cost-efficient way to transport mass from a set of sources to a set of sinks. Later, the theory was modernized and revolutionized by Kantorovich in 1942, who found a key link between optimal transport and linear programming. In recent years, optimal transport has become a popular and powerful tool in data science, especially in image processing, machine learning, and deep learning areas, where it provides a very natural way to compare and interpolate probability distributions. For instance, in generative models arjovsky2017wasserstein ; lei2019geometric ; wang2022deepparticle , a natural penalty function is the Wasserstein distance (a concept closely related to OT) between the data and the generated distribution. In image processing, the optimal transport plan, which minimizes the transportation cost, provides solutions to image registration haker2004optimal and seamless copy perrot2016mapping . Apart from data science, in the past three decades, there has been an explosion of research interest in optimal transport because of the deep connections between the optimal transport problems with quadratic cost functions and a diverse class of partial differential equations (PDEs) arising in statistical mechanics and fluid mechanics; see e.g. brenier1991polar ; benamou2000computational ; otto2001geometry ; jordan1998variational ; villani2021topics for just a few of the most prominent results and references therein.

Inspired by this research progress, we have developed efficient numerical methods to solve multi-scale PDE problems using the optimal transport approach. Specifically, in our recent paper, we proposed a deep particle method for learning and computing invariant measures of parameterized stochastic dynamical systems wang2022deepparticle . To achieve this goal, we designed a deep neural network (DNN) to map a uniform distribution (source) to an invariant measure (target), where the Péclet number is an input parameter for the DNN. The network is trained by minimizing the 2-Wasserstein distance ( $W_{2}$ ) between the measure of network output $\mu$ and target measure $\nu$ . We consider a discrete version of $W_{2}$ for finitely many samples of $\mu$ and $\nu$ , which involves a linear program (LP) optimized over doubly stochastic matrices sinkhorn1964relationship .

Solving the LP directly using the interior point method wright1997primal is too costly. Motivated by the domain decomposition method toselli2004domain in scientific computing, which solves partial differential equations using subroutines that solve problems on subdomains and has the advantage of saving memory (i.e., using the same computational resource, it can compute a larger problem), we devised a mini-batch interior point method. This approach involves sampling smaller sub-matrices while preserving row and column sums. It has proven to be highly efficient and integrates seamlessly with the stochastic gradient descent (SGD) method for overall network training. However, we did not obtain convergence analysis for this mini-batch interior point method in our previous work wang2022deepparticle .

The objectives of this paper are twofold. First, we aim to provide rigorous convergence analysis for the mini-batch interior point method presented in wang2022deepparticle , with minimal modifications. Second, we seek to enhance the mini-batch selection strategy, thereby achieving improved and more robust performance in computing optimal transport problems. We recognize that the mini-batch interior point method aligns with the random block coordinate descent (RBCD) method in optimization terminology. Specifically, it applies the block coordinate descent (BCD) method to the LP problem directly, selects the working set randomly, and solves subproblems using the primal-dual interior point method wright1997primal or any other efficient linear programming solver. Encouraged by the demonstrated efficiency of this approach, we will develop theoretical results for solving LP with RBCD methods and explore various strategies for selecting working sets.

Theorectical contributions

In this work, we first introduce an expected Gauss-Southwell- $q$ rule to guide the selection of the working set. It enables almost sure convergence and a linear convergence rate in expectation when solving a general standard LP. Based on this rule, we develop a vanilla RBCD method - RBCD₀, which selects the working set with complete randomness. Then, we investigate the special linear system present in the LP formulation of OT. We characterize all the elementary vectors of the null space and provide a strategy for finding the conformal realization of any given vector in the null space at a low computational cost. Based on these findings, we propose various approaches to refine the working set selection and improve the performance of RBCD₀. A better estimation of the constant in the linear convergence rate is shown. Moreover, we incorporate an acceleration technique inspired by the momentum concept to improve the algorithm’s efficiency. Inexact versions of the RBCD methods are also discussed.

Numerical experiments

We perform numerical experiments to evaluate the performance of the proposed methods. Synthetic data sets of various shapes/dimensions and invariant measures generated from IPM methods are utilized to create distributions. Our experiments first compare different RBCD methods proposed in this paper, demonstrating the benefits of refining working set selection and verifying the effectiveness of the acceleration technique. We also illustrate the gap between theory and practice regarding convergence rate, sparse solutions generated by the proposed RBCD methods, and discuss the choice of subproblem size. In the second set of experiments, we compare the best-performance method, ARBCD, with Sinkhorn’s algorithm. Preliminary numerical results show that ARBCD is comparable to Sinkhorn’s algorithm in computation time when seeking solutions with relatively high accuracy. ARBCD is also comparable to a recently proposed interior point inspired algorithm in memory-saving settings. We also test ARBCD on a large-scale OT problem, where Gurobi runs out of memory. This further justifies the memory-saving advantage of ARBCD.

Previous research on (R)BCD

BCD and RBCD are well-studied for essentially unconstrained smooth optimization (sometimes allow separable constraints or nonsmooth separable objective functions): beck2013convergence ; gurbuzbalaban2017cyclic ; sun2021worst investigate BCD with cyclic coordinate search; nesterov2012efficiency ; lu2015complexity ; richtarik2014iteration study RBCD to address problems with possibly nonsmooth separable objective functions; other related works include theoretical speedup of RBCD (richtarik2016parallel ; necoara2016parallel ), second-order sketching (qu2016sdna ; berahas2020investigation ). However, much less is known for their convergence properties when applied to problems with nonseparable nonsmooth functions as summands or coupled constraints. To our best knowledge, no one has ever considered using the RBCD to solve general LP before and the related theoretical guarantees are absent. In necoara2017random , the authors studied the RBCD method to tackle problems with a convex smooth objective and coupled linear equality constraints $x_{1}+x_{2}+\ldots+x_{N}=0$ ; a similar algorithm named random sketch descent method necoara2021randomized is investigated to solve problems with a general smooth objective and general coupled linear equality constraints $Ax=b$ . However, after adding the simple bound constraints $x\geq 0$ , the analysis in necoara2017random ; necoara2021randomized may not work anymore, nor can it be easily generalized. Beck beck20142 studied a greedy coordinate descent method but focused on a single linear equality constraint and bound constraints. In Paul Tseng and his collaborators’ work tseng2009block ; tseng2009coordinate ; tseng2010coordinate , a block coordinate gradient descent method is proposed to solve linearly constrained optimization problems including general LP. In these works, a Gauss-Southwell- $q$ rule is proposed to guide the selection of the working set in each iteration. Therefore, the working set selected in a deterministic fashion can only be decided after solving a quadratic program with a similar problem size as the original one. In contrast, our proposed mini-batch interior point/RBCD method approach selects the working set through a combination of randomness and low computational cost. Another research direction that addresses separable functions, linearly coupled constraints, and additional separable constraints involves using the alternating direction method of multipliers (ADMM) chen2016direct ; he20121 ; xie2019si ; xie2021tractable . This method updates blocks of primal variables in a Gauss-Seidal fashion and incorporates multiplier updates as well.

Existing algorithms for OT

Encouraged by the success in applying Sinkhorn’s algorithm to the dual of entropy regularized OT cuturi2013sinkhorn , researchers have conducted extensive studies in this area, including other types of regularization blondel2018smooth gasnikov2016efficient , acceleration guminov2021combination lin2022efficiency and numerical stability mandad2017variance . In xie2020fast , a proximal point algorithm (PPA) is considered to solve the discrete LP. The entropy inducing distance function is introduced to contruct the proximal term and each subproblem has the same formulation as the entropy regularized OT. This approach is found to stabilize the numerical computation while maintaining the efficiency of the Sinkhorn’s algorithm. In schmitzer2019stabilized , techniques such as log-domain stabilization, and epsilon scaling are discussed to further improve the performance of the Sinkhorn’s algorithm. The approach of iterative Bregman projection is discussed in benamou2015iterative . It is discovered that the Sinkhorn’s iteration is equivalent to a special case of this method. In pmlr-v84-genevay18a , a notion of Sinkhorn distance is proposed, which allows computable differentiation when serving as the loss function when training generative models. Sinkhorn’s algorithm can also be generalized to solve unbalanced OT in chizat2018scaling , and applied to solve OT over geometric domains in solomon2015convolutional . In huang2021riemannian , a Riemannian block coordinate descent method is applied to solve projection robust Wasserstein distance. The problem is to calculate Wasserstein distance after projecting the distributions onto lower-dimensional subspaces. The proposed approach employs entropy regularization, deterministic block coordinate descent, and techniques in Riemannian optimization. A domain decomposition method is considered in bonafini2021domain . Unlike our method, the decomposition is deterministic, focusing on one pair of distributions, and the authors try to tackle entropy-regularized OT.

Other works that significantly deviate from the entropy regularization framework include li2018computations , which computes the Schrödinger bridge problem (equivalent to OT with Fisher information regularization), and multiscale strategies such as gerber2017multiscale , liu2022multiscale and schmitzer2016sparse . In liu2022multiscale , the problem size is reduced using a multiscale strategy, and a semi-smooth Newton method is applied to solve smaller-scale subproblems. The RBCD method employed in this study is a regularization-free method. As a result, it avoids dealing with inaccurate solutions and numerical stability issues introduced by the regularization term. Furthermore, each subproblem in RBCD is a small-size LP, allowing for flexible resolution choices. Interior-point methods and simplex methods are also revisited and enhanced zanetti2023interior ; wijesinghe2023matrix ; natale2021computation ; gottschlich2014shortlist . In particular, the authors of zanetti2023interior propose an interior point inspired algorithm with reduced subproblems to address large-scale OT. The authors of facca2021fast consider an equivalent formulation of minimizing an energy functional to solve OT over graphs. Backward Euler (equivalent to the proximal point method) is used in the outer loop of the algorithm and Newton-Rapson is used in the inner loop. Other approaches include numerical solution of the Monge–Ampère equation benamou2014numerical ; benamou2016monotone , stochastic gradient descent to resolve OT in discrete, semi-discrete and continuous formulations genevay2016stochastic , and an alternating direction method of multipliers (ADMM) approach to solve the more general Wasserstein barycenter yang2021fast .

Organization

The rest of the paper is organized as follows. In Section 2, we review the basic idea of optimal transport and Wasserstein distance. In Section 3, we introduce the expected Gauss-Southwell- $q$ rule and a vanilla RBCD (RBCD₀) method for solving general LP problems. An inexact version of RBCD₀ is also discussed. In Section 4, we investigate the properties of the linear system in OT and propose several approaches to refine and accelerate the RBCD₀ method. In Section 5, preliminary numerical results are presented to demonstrate the performance of our proposed methods. Finally, concluding remarks are made in Section 6.

Notation. For any matrix $X$ , let $X(i,j)$ denote its element in the $i$ th column and $j$ th row, and let $X(:,j)$ represent its $j$ th row vector. For a vector $v$ , we usually use superscripts to denote its copies (e.g., $v^{k}$ in $k$ th iteration of an algorithm) and use subscripts to denote its components (e.g., $v_{i}$ ); for a scalar, we usually use subscripts to denote its copies. Occasional inconsistent cases will be declared in context. $\operatorname{mod}(k,n)$ means $k$ modulo $n$ . For any vector $v$ , we define $\operatorname{supp}(v)\triangleq\{i\in\{1,\ldots,n\}\mid v_{i}\neq 0\}$ . Given a matrix $X\in\mathbb{R}^{n\times n}$ , we define its vectorization as follows:

\displaystyle\operatorname{vec}(X)\triangleq(X(:,1)^{T},X(:,2)^{T},...,X(:,n)^{T})^{T}.

For any positive integer $k\geq 2$ , we denote $[1,k]\triangleq\{1,...,k\}$ . ${\bf 1}_{n\times n}$ represents the $n\times n$ matrix of all ones.

2 Optimal transport problems and Wasserstein distance

The Kantorovich formulation of optimal transport can be described as follows,

\displaystyle\inf_{\gamma\in\Gamma(\mu,\nu)}\int_{X\times Y}C(x,y)\mathrm{~{}d}\gamma(x,y)

(1)

where $\Gamma(\mu,\nu)$ is the set of all measures on $X\times Y$ whose marginal distribution on $X$ is $\mu$ and marginal distribution on $Y$ is $\nu$ , $C(x,y)$ is the transportation cost. In this article, we refer to the Kantorovich formulation when we mention optimal transport.

Wasserstein distances are metrics on probability distributions inspired by the problem of optimal mass transport. They measure the minimal effort required to reconfigure the probability mass of one distribution in order to recover the other distribution. They are ubiquitous in mathematics, especially in fluid mechanics, PDEs, optimal transport, and probability theory villani2021topics . One can define the $p$ -Wasserstein distance between probability measures $\mu$ and $\nu$ on a metric space $Y$ with distance function $dist$ by

\displaystyle W_{p}(\mu,\nu):=\left(\inf_{\gamma\in\Gamma(\mu,\nu)}\int_{Y\times Y}dist(\tilde{y},y)^{p}\mathrm{~{}d}\gamma(\tilde{y},y)\right)^{1/p}

(2)

where $\Gamma(\mu,\nu)$ is the set of probability measures $\gamma$ on $Y\times Y$ satisfying $\gamma(A\times Y)=\mu(A)$ and $\gamma(Y\times B)=\nu(B)$ for all Borel subsets $A,B\subset Y$ . Elements $\gamma\in\Gamma(\mu,\nu)$ are called couplings of the measures $\mu$ and $\nu$ , i.e., joint distributions on $Y\times Y$ with marginals $\mu$ and $\nu$ on each axis. $p$ -Wasserstein distance is a special case of optimal transport when $X=Y$ and the cost function $c(\tilde{y},y)=dist(\tilde{y},y)^{p}$ .

In the discrete case, the definition (2) has a simple intuitive interpretation: given a $\gamma\in\Gamma(\mu,\nu)$ and any pair of locations $(\tilde{y},y)$ , the value of $\gamma(\tilde{y},y)$ tells us what proportion of $\mu$ mass at $\tilde{y}$ should be transferred to $y$ , in order to reconfigure $\mu$ into $\nu$ . Computing the effort of moving a unit of mass from $\tilde{y}$ to $y$ by $dist(\tilde{y},y)^{p}$ yields the interpretation of $W_{p}(\mu,\nu)$ as the minimal effort required to reconfigure $\mu$ mass distribution into that of $\nu$ .

In a practical setting COTFNT , referred to as a point cloud, the closed-form solution of $\mu$ and $\nu$ may be unknown, instead only $n$ independent and identically distributed (i.i.d.) samples of $\mu$ and $n$ i.i.d. samples of $\nu$ are available. In further discussion, $n$ refers to the size of the problem. We approximate the probability measures $\mu$ and $\nu$ by empirical distribution functions:

\displaystyle\mu=\frac{1}{n}\sum_{i=1}^{n}\delta_{\tilde{y}^{i}}\quad\text{and}\quad\nu=\frac{1}{n}\sum_{j=1}^{n}\delta_{y^{j}},

(3)

where $\delta_{x}$ is the Dirac measure. Any element in $\Gamma(\mu,\nu)$ can clearly be represented by a transition matrix, denoted as $\gamma=(\gamma_{i,j})_{i,j}$ satisfying:

\displaystyle\gamma_{i,j}\geq 0;\quad\quad\forall j,~{}\sum_{i=1}^{n}\gamma_{i,j}=\frac{1}{n};\quad\quad\forall i,~{}\sum_{j=1}^{n}\gamma_{i,j}=\frac{1}{n}.

(4)

Then $\gamma_{i,j}$ means the mass of $\tilde{y}^{i}$ that is transferring to $y^{j}$ .

We denote all matrices in $\mathbb{R}^{n\times n}$ satisfying (4) as $\Gamma^{n}$ , then (2) becomes

\displaystyle\hat{W}(f):=\left(\inf_{\gamma\in\Gamma^{n}}\sum_{i,j=1}^{n,n}\,dist(\tilde{y}^{i},y^{j})^{p}\gamma_{i,j}\right)^{1/p}.

(5)

Remark 1

$\Gamma^{n}$ is in fact the set of $n\times n$ doubly stochastic matrix sinkhorn1964relationship divided by $n$ .

Another practical setting, which is commonly used in fields of computer vision peleg1989unified ; ling2007efficient , is to compute the Wasserstein distance between two histograms. To compare two grey-scale figures (2D, size $n_{0}\times n_{0}$ ), we first normalize the grey scale such that the values of cells of each picture sum to one. We denote centers of the cell as $\{y^{i}\}_{i=1}^{n}$ and $\{\tilde{y}^{i}\}_{i=1}^{n}$ , then we can use two probability measures to represent the two figures:

\displaystyle\mu=\sum_{i=1}^{n}r_{1,i}\delta_{\tilde{y}^{i}}\quad\text{and}\quad\nu=\sum_{j=1}^{n}r_{2,j}\delta_{y^{j}},

where $r_{1,i},r_{2,j}\geq 0,\forall 1\leq i,j\leq n$ , $\sum\limits_{i=1}^{n}r_{1,i}=\sum\limits_{j=1}^{n}r_{2,j}=1$ . The discrete Wasserstein distance (5) keeps the same form while the transition matrix follows different constraints:

\displaystyle\gamma_{i,j}\geq 0;\quad\quad\forall j,~{}\sum_{i=1}^{n}\gamma_{i,j}=r_{2,j};\quad\quad\forall i,~{}\sum_{j=1}^{n}\gamma_{i,j}=r_{1,i}.

(6)

Note that in both settings, the computation of Wasserstein distance is reduced to an LP, i.e.,

\displaystyle\begin{array}[]{rl}\min&\sum\limits_{1\leq i,j\leq n}C_{i,j}\gamma_{i,j}\\ \mbox{subject to}&\sum\limits_{j=1}^{n}\gamma_{i,j}=r_{1,i},\quad\sum\limits_{j=1}^{n}\gamma_{i,j}=r_{2,i},\quad\gamma_{i,j}\geq 0,\end{array}

(9)

where $r^{1}\triangleq(r_{1,1},...,r_{1,n})^{T}$ and $r^{2}\triangleq(r_{2,1},...,r_{2,n})^{T}$ are two probability distributions, and $C_{i,j}=dist(\tilde{x}^{i},x^{j})^{p}$ . More generally, we can let $r^{1}$ and $r^{2}$ be two nonnegative vectors and $C_{i,j}=C(\tilde{y}^{i},y^{j})$ be any appropriate transportation cost from $\tilde{y}^{i}$ to $y^{j}$ , so (9) also captures the discrete OT.

However, when the number of particles $n$ becomes large, the number of variables (entries of $\gamma$ ) scales like $n^{2}$ , which leads to costly computation. Therefore, we will discuss random block coordinate descent methods to keep the computational workload in each iteration reasonable.

3 Random block coordinate descent for standard LP

In this section, we first generalize the LP problem (9) to a standard LP (see Eq.(10)). Then we propose a random block coordinate descent algorithm for resolution. Its almost sure convergence and linear convergence rate in expectation are analyzed. Last we introduce an inexact version that is implementable.

We consider the following standard LP problem:

\displaystyle\begin{aligned} \min_{x\in\mathbb{R}^{N}}\quad&c^{T}x\\ \mbox{subject to}\quad&Ax=b,\;x\geq 0,\end{aligned}

(10)

where $A\in\mathbb{R}^{M\times N}$ , $b\in\mathbb{R}^{M}$ , $c\in\mathbb{R}^{N}$ , hence $M$ is the number of constraint and $N$ is the total degree of freedom. Assume throughout that $M\leq N$ . Suppose that $\mathcal{N}\triangleq\{1,\ldots,N\}$ and denote $\mathcal{X}\triangleq\{x\in\mathbb{R}^{N}\mid Ax=b,x\geq 0\}$ as the feasible set. Assume that (10) is finite and has an optimal solution. For any $x\in\mathcal{X}$ and $\mathcal{I}\subseteq\mathcal{N}$ , denote

	$\displaystyle\mathcal{D}(x;\mathcal{I})$	$\displaystyle\triangleq\mbox{argmin}_{d\in\mathbb{R}^{N}}\left\{c^{T}d\mid x+d\geq 0,Ad=0,d_{i}=0,\forall i\in\mathcal{N}\setminus\mathcal{I}\right\}.$		(11)
	$\displaystyle q(x;\mathcal{I})$	$\displaystyle\triangleq\min_{d\in\mathbb{R}^{N}}\left\{c^{T}d\mid x+d\geq 0,Ad=0,d_{i}=0,\forall i\in\mathcal{N}\setminus\mathcal{I}\right\}.$		(12)

Namely, $\mathcal{D}(x;\mathcal{I})$ is the optimal solution set of the linear program in (11) and $q(x;\mathcal{I})$ is the optimal function value. We have that $q(x;\mathcal{I})=c^{T}d$ for any $d\in\mathcal{D}(x;\mathcal{I})$ . Denote $\mathcal{X}^{*}$ as the optimal solution set of (10). Then the following equations hold for any $x\in\mathcal{X}$ :

	$\displaystyle\mathcal{X}^{*}$	$\displaystyle=x+\mathcal{D}(x;\mathcal{N}),$		(13)
	$\displaystyle q(x;\mathcal{N})$	$\displaystyle=c^{T}x^{}-c^{T}x,\quad\forall x^{}\in\mathcal{X}^{*}.$		(14)

Remark 2

It is worthy to mention that since $d=0$ follows the conditions in (12), $\forall(x,\mathcal{I})$ , $q(x;\mathcal{I})\leq 0$ . Furthermore, $q(x;\mathcal{N})\leq q(x;\mathcal{I})$ .

Consider the block coordinate descent (BCD) method for (10):

\displaystyle\begin{aligned} &\mbox{find}\;d^{k}\in\mathcal{D}(x^{k},\mathcal{I}_{k}),\\ &x^{k+1}:=x^{k}+d^{k},\end{aligned}

(15)

where $\mathcal{I}_{k}\subset\mathcal{N}$ is the working set chosen at iteration $k$ . Next, we describe several approaches to select the working set $\mathcal{I}_{k}$ .

Gauss-Southwell-q rule

Motivated by the Gauss-Southwell-q rule introduced in tseng2009coordinate , we desire to select $\mathcal{I}_{k}$ such that

\displaystyle q(x^{k};\mathcal{I}_{k})\leq vq(x^{k};\mathcal{N}),

(16)

for some constant $v\in(0,1]$ . Note that by (14), we have

\displaystyle q(x^{k};\mathcal{N})=c^{T}(x^{*}-x^{k}),

(17)

where $x^{*}$ is an optimal solution of (10). Therefore, (12)-(17) imply that

$\displaystyle c^{T}d^{k}$	$\displaystyle\leq vc^{T}(x^{*}-x^{k})$
$\displaystyle\overset{\eqref{alg: bcd}}{\implies}c^{T}(x^{k+1}-x^{k})$	$\displaystyle\leq vc^{T}(x^{*}-x^{k})$
$\displaystyle\implies c^{T}(x^{k+1}-x^{*})$	$\displaystyle\leq(1-v)c^{T}(x^{k}-x^{*}).$	(18)

(18) indicates that the gap of function value decays exponentially with rate $1-v$ , as long as we choose $\mathcal{I}_{k}$ according to the Gauss-Southwell-q rule (16) at each iteration $k$ . A trivial choice of $\mathcal{I}_{k}$ to satisfy (16) is $\mathcal{N}$ and $v=1$ . However, this choice results in a potential large-scale subproblem in the BCD method (15), contradicting the purpose of using BCD. Instead, we should set an upper bound on the cardinality of $\mathcal{I}_{k}$ , namely, a reasonable batch size to balance the computational effort in each iteration and convergence performance of BCD. Next, we discuss the existence of such an $\mathcal{I}_{k}$ given an upper bound $l$ on $|\mathcal{I}_{k}|$ , which necessitates the following concept.

Definition 1

A vector $\bar{d}\in\mathbb{R}^{N}$ is conformal to $d\in\mathbb{R}^{N}$ if

\displaystyle\operatorname{supp}(\bar{d})\subseteq\operatorname{supp}(d),\;\bar{d}_{i}d_{i}\geq 0,\forall i\in\mathcal{N}.

The following Theorem confirms the existence of such an $\mathcal{I}_{k}$ that satisfies (16), the proof of which follows closely to (tseng2009block, , Proposition 6.1).

Theorem 3.1

Suppose that $\mathop{\hbox{\rm rank}}(A)+1\leq N$ . Given any $x\in\mathcal{X}$ , $l\in\{\mathop{\hbox{\rm rank}}(A)+1,\ldots,N\}$ and $d\in\mathcal{D}(x;\mathcal{N})$ . There exist a set $\mathcal{I}\in\mathcal{N}$ satisfying $|\mathcal{I}|\leq l$ and a vector $\bar{d}\in\operatorname{null}(A)$ conformal to $d$ such that

	$\displaystyle\mathcal{I}$	$\displaystyle=\operatorname{supp}(\bar{d}).$		(19)
	$\displaystyle q(x;\mathcal{I})$	$\displaystyle\leq\frac{1}{N-l+1}q(x;\mathcal{N}).$		(20)

Proof

If $d=0$ , then let $\bar{d}=0$ and $\mathcal{I}=\emptyset$ . We have $q(x;\mathcal{I})=q(x;\mathcal{N})=0$ . Therefore, both (19) and (20) are satisfied. If $d\neq 0$ and $|\operatorname{supp}(d)|\leq l$ , then let $\bar{d}=d$ . Thus, $\mathcal{I}=\operatorname{supp}(\bar{d})$ satisfies $|\mathcal{I}|\leq l$ and $q(x;\mathcal{I})=q(x;\mathcal{N})$ . If $|\operatorname{supp}(d)|>l$ , then similar to the discussion in (tseng2009block, , Proposition 6.1), we have that

\displaystyle d=d^{(1)}+\ldots+d^{(r)},

for some $r\leq|\operatorname{supp}(d)|-l+1$ and some nonzero $d^{(s)}\in\operatorname{null}(A)$ conformal to $d$ with $|\operatorname{supp}(d^{(s)})|\leq l$ , $s=1,...,r$ . Since $|\operatorname{supp}(d)|\leq N$ , we have $r\leq N-l+1$ . Since $Ad^{(s)}=0$ and $x_{i}+d^{(s)}_{i}\geq x_{i}+d_{i}\geq 0,\forall s=1,...,r$ and $\forall i\in\{i\mid d_{i}<0\}$ , we have that $x+d^{(s)}\in\mathcal{X},\forall s=1,...,r$ . Therefore,

\displaystyle q(x;\mathcal{N})

\displaystyle=c^{T}d=\sum_{s=1}^{r}c^{T}d^{(s)}\geq r\min_{s=1,\ldots,r}\{c^{T}d^{(s)}\}.

Denote $\bar{s}\in\mbox{argmin}_{s=1,\ldots,r}\{c^{T}d^{(s)}\}$ and let $\mathcal{I}=\operatorname{supp}(d^{(\bar{s})})$ , then $|\mathcal{I}|\leq l$ and

\displaystyle q(x;\mathcal{N})\geq rc^{T}d^{(\bar{s})}\geq rq(x;\mathcal{I})\geq(N-l+1)q(x;\mathcal{I}),

where the last inequalities holds due to $r\leq N-l+1$ and $q(x;\mathcal{I})\leq 0$ .

Therefore (19) and (20) hold for this $\mathcal{I}$ and $\bar{d}=d^{(\bar{s})}$ .

However, it is not clear how to identify the set $\mathcal{I}$ described in Theorem 3.1 with little computational effort for a general $A$ . Therefore, we introduced the following.

Expected Gauss-Southwell-q rule

We introduce randomness in the selection of $\mathcal{I}_{k}$ to reduce the potential computation burden in identifying an $\mathcal{I}_{k}$ that satisfies (16). Consider an expected Gauss-Southwell-q rule:

\displaystyle\mathbb{E}[q(x^{k};\mathcal{I}_{k})\mid\mathcal{F}_{k}]\leq vq(x^{k};\mathcal{N}),

(21)

where $v\in(0,1]$ is a constant, and $\mathcal{F}_{k}\triangleq\{x^{0},\ldots,x^{k}\}$ denotes the history of the algorithm. Therefore, using the notations of LP (10) and BCD method (15):

$\displaystyle\eqref{def: sd}\eqref{qkN}\eqref{EGSq}$	$\displaystyle\implies\mathbb{E}[c^{T}d^{k}\mid\mathcal{F}_{k}]\leq vc^{T}(x^{*}-x^{k})$
	$\displaystyle\implies\mathbb{E}[c^{T}(x^{k+1}-x^{k})\mid\mathcal{F}_{k}]\leq vc^{T}(x^{*}-x^{k})$
	$\displaystyle\implies\mathbb{E}[c^{T}(x^{k+1}-x^{})\mid\mathcal{F}_{k}]-c^{T}(x^{k}-x^{})\leq vc^{T}(x^{*}-x^{k})$
	$\displaystyle\implies\mathbb{E}[c^{T}(x^{k+1}-x^{})\mid\mathcal{F}_{k}]\leq(1-v)c^{T}(x^{k}-x^{}),$	(22)

where $x^{*}$ is an optimal solution of (10). According to (polyak1987introduction, , Lemma 10, page 49), $c^{T}(x^{k}-x^{*})\to 0$ almost surely. Moreover, if we take expectations on both sides of (22),

	$\displaystyle\mathbb{E}[c^{T}(x^{k+1}-x^{*})]$	$\displaystyle\leq(1-v)\mathbb{E}[c^{T}(x^{k}-x^{*})]$
	$\displaystyle\implies\mathbb{E}[c^{T}(x^{k}-x^{*})]$	$\displaystyle\leq(1-v)^{k}\mathbb{E}[c^{T}(x^{0}-x^{*})].$

i.e., the expectation of function value gap converges to $0$ exponentially with rate $1-v$ .

Vanilla random block coordinate descent

Based on the expected Gauss-Southwell- $q$ rule, we formally propose a vanilla random block coordinate descent (RBCD₀) algorithm (Algorithm 1) to solve the LP (10). Specifically, we choose the working set $\mathcal{I}_{k}$ with full randomness, that is, randomly choose an index set of cardinality $l$ out of $\mathcal{N}$ . Then with probability at least $\frac{1}{\binom{N}{l}}$ , the index set will be the same as or cover the working set suggested by Theorem 3.1. As a result, (21) will be satisfied with $v\geq\frac{1}{\binom{N}{l}(N-l+1)}$ .

Algorithm 1 Vanilla random block coordinate descent (RBCD₀)

(Initialization) Choose feasible

x^{0}\in\mathbb{R}^{N}

and the batch size

l

such that

\mathop{\hbox{\rm rank}}(A)+1\leq l\leq N

for

k=0,1,2,\dotsc

Step 1. Choose

\mathcal{I}_{k}

uniformly randomly from

\mathcal{N}

with

|\mathcal{I}_{k}|=l

Step 2. Find

d^{k}\in\mathcal{D}(x^{k};\mathcal{I}_{k}).

Step 3.

\;\;x^{k+1}:=x^{k}+d^{k}.

end for

Based on the previous discussions, Algorithm 1 generates a sequence $\{x^{k}\}$ such that the value of $c^{T}x^{k}$ converges to the optimal with probability 1. Moreover, the expectation of the optimality gap converges to $0$ exponentially. It is important to note that $\frac{1}{\binom{N}{l}(N-l+1)}$ is only a loose lower bound of $v$ . This bound can become quite small when $N$ grows large due to the binomial coefficient $\binom{N}{l}$ . However, in our numerical experiments (c.f. Sec. 5), this lower bound is rarely reached. In Section 4, we will discuss how to further improve this bound given the specific structure of the OT problem. Before that, we first investigate an inexact extension of RBCD₀.

Inexact extension of Algorithm 1.

For any $x$ and $\mathcal{I}\subseteq\mathcal{N}$ , still denote $\mathcal{D}(x;\mathcal{I})$ and $q(x;\mathcal{I})$ as in (11) and (12). However, note that now $x$ does not have to be feasible. We consider the inexact version of Algorithm 1, where Step 2 in (1) is only approximately solved. We compute $d^{k}$ such that for any $d^{k}_{*}\in\mathcal{D}(x^{k};\mathcal{I}_{k})$ ,

\displaystyle c^{T}d^{k}\leq(1-\epsilon_{k})c^{T}d^{k}_{*},\quad x^{k}+d^{k}\geq 0,\quad\|Ad^{k}\|\leq\hat{\epsilon}_{k},\quad d^{k}_{i}=0,\;\forall i\notin\mathcal{I}_{k}.

(23)

where the inexactness sequences $\{\epsilon_{k}\}$ and $\{\hat{\epsilon}_{k}\}$ should be nonnegative. The inexact algorithm is described as follows.

Algorithm 1(a) Inexact random block coordinate descent (IRBCD)

(Initialization) Choose feasible

x^{0}\in\mathbb{R}^{N}

, the batch size

l

such that

\mathop{\hbox{\rm rank}}(A)+1\leq l\leq N

, and inexactness sequences

\{\epsilon_{k}\}

and

\{\hat{\epsilon}_{k}\}

for

k=0,1,2,\dotsc

Step 1. Choose

\mathcal{I}_{k}

uniformly randomly from

\mathcal{N}

with

|\mathcal{I}_{k}|=l

Step 2. Find

d^{k}

such that (23) holds.

Step 3.

\;\;x^{k+1}:=x^{k}+d^{k}.

end for

Next, we analyze the sequence generated by Algorithm 1(a) and summarize the results in the following theorem.

Theorem 3.2

Consider Algorithm 1(a). Given $\delta>0$ , suppose that $0\leq\epsilon_{k}\leq\epsilon<1$ and $\sum_{k=1}^{\infty}\hat{\epsilon}_{k}\leq\delta$ . Then we have that

1.

$\|Ax^{k}-b\|\leq\delta$ and $x^{k}\geq 0$ a.s. for any $k\geq 0$ ,
2.

$c^{T}x^{k}\geq c^{T}x^{*}-\kappa\delta$ for any $k\geq 0$ ,
3.

$\mathbb{E}[c^{T}x^{k}-(c^{T}x^{*}+\kappa\delta)]\leq(1-(1-\epsilon)v)^{k}\mathbb{E}[c^{T}x^{0}-(c^{T}x^{*}+\kappa\delta)]$ ,

where $\kappa$ is a constant only depending on the matrix $A$ and vector $c$ , and $v$ is the constant in the expected Gauss-Southwell-q rule.

Proof

The proof of 1 is straightforward. Note that for any $k\geq 0$ , we have in a.s. sense for any $k\geq 0$ ,

\displaystyle\|Ax^{k}-b\|=\left\|Ax^{0}-b+\sum_{t=0}^{k-1}Ad^{t}\right\|\leq\|Ax^{0}-b\|+\sum_{t=0}^{k-1}\|Ad^{t}\|\overset{\eqref{def: dk-inex}}{\leq}\sum_{t=0}^{k-1}\hat{\epsilon}_{t}\leq\delta.

$x^{k}\geq 0$ a.s. is also a direct result of (23) ( $x^{k}+d^{k}\geq 0$ ). Now we prove 2. First we argue that $\mathcal{D}(x^{k};\mathcal{N})\neq\emptyset$ . This is true by the duality theory in linear programming and the fact that $Ax^{k}\in S$ and the dual feasibility set $A^{T}p\leq c$ is nonempty. Suppose that $x^{k+1}_{*}\in\mathcal{D}(x^{k};\mathcal{N})$ . Then we have

\displaystyle\|Ax^{k+1}_{*}-b\|=\|Ax^{k}-b\|\leq\delta.

We want to estimate $c^{T}x^{k+1}_{*}$ . Suppose that $p^{1},\ldots,p^{s}$ are all the extreme points of the dual feasible region $\{p\mid A^{T}p\leq c\}$ and denote $b^{k}=Ax^{k}$ . Then by the theory of LP, we have

\displaystyle c^{T}x^{k+1}_{*}=\max_{i=1,\ldots,s}(b^{k})^{T}p^{i}=b^{T}p^{i^{*}}

Note that we also have $c^{T}x^{*}=\max_{i=1,\ldots,s}b^{T}p^{i}$ , then

\displaystyle c^{T}x^{k+1}_{*}-c^{T}x^{*}\leq(b^{k}-b)^{T}p^{i^{*}}\leq\|b^{k}-b\|\|p^{i^{*}}\|\leq\left(\max_{i=1,\ldots,s}\|p^{i}\|\right)\delta.

(24)

Similarly we can show

\displaystyle c^{T}x^{k}-c^{T}x^{*}\geq c^{T}x^{k+1}_{*}-c^{T}x^{*}\geq-\left(\max_{i=1,\ldots,s}\|p^{i}\|\right)\delta,

which concludes 2 if we let $\kappa\triangleq\max_{i=1,\ldots,s}\|p^{i}\|$ . According to Theorem 3.1 and discussion about the exact algorithm, we still have that the expected Gauss-Southwell-q rule (21) holds with $v\geq\frac{1}{\binom{N}{l}(N-l+1)}$ . Then

		$\displaystyle\mathbb{E}[c^{T}d^{k}\mid\mathcal{F}_{k}]\overset{\eqref{def: dk-inex}}{\leq}(1-\epsilon_{k})\mathbb{E}[c^{T}d^{k}_{*}\mid\mathcal{F}_{k}]\overset{\eqref{EGSq}}{\leq}(1-\epsilon_{k})vq(x^{k};\mathcal{N})$
	$\displaystyle\implies$	$\displaystyle\mathbb{E}[c^{T}x^{k+1}-c^{T}x^{k}\mid\mathcal{F}_{k}]\leq(1-\epsilon)v(c^{T}x^{k+1}_{*}-c^{T}x^{k})$
	$\displaystyle\overset{\eqref{ineq: objerr}}{\implies}$	$\displaystyle\mathbb{E}[c^{T}x^{k+1}-c^{T}x^{k}\mid\mathcal{F}_{k}]\leq(1-\epsilon)v(c^{T}x^{*}+\kappa\delta-c^{T}x^{k})$
	$\displaystyle\implies$	$\displaystyle\mathbb{E}[c^{T}x^{k+1}-(c^{T}x^{}+\kappa\delta)\mid\mathcal{F}_{k}]\leq(1-(1-\epsilon)v)[c^{T}x^{k}-(c^{T}x^{}+\kappa\delta)],$
	$\displaystyle\implies$	$\displaystyle\mathbb{E}[c^{T}x^{k}-(c^{T}x^{}+\kappa\delta)]\leq(1-(1-\epsilon)v)^{k}\mathbb{E}[c^{T}x^{0}-(c^{T}x^{}+\kappa\delta)],\mbox{ for any }k\geq 0.$

Remark 3

(i). Theorem 3.2 justifies the inexact algorithm. In particular, if we let $\delta$ be small, then we showed that the expected objective function value sequence also converges to the vicinity of the optimal one linearly. During the implementation, we could either let $\hat{\epsilon}_{k}$ to be a sequence proportional to $1/k^{\alpha},\alpha>1$ , or choose them equally as a sufficiently small number so that their accumulation is also insignificant. Choice of the sequence $\hat{\epsilon}_{k}$ is less stringent if we occasionally project the iterates to the feasible region so that the accumulated error is offset.
(ii). In the proof we implicitly assume that the dual feasible region exists at least one extreme point. This can be implied if $A$ has linearly independent rows. In fact, we can eliminate redundant rows of $A$ . After this operation, the feasible region remains the same; the analysis and implementation of the algorithms are also similar.
(iii). Inexact versions of the algorithms proposed in Section 4 are of similar fashion and we will omit detailed and repetitive explanations.

4 Random block coordinate descent and optimal transport

Denote the cost matrix $C\triangleq(C_{i,j})_{i,j}$ in (9). Then calculating the OT between two measures with finite support (problem (9)) is a special case of (10), where $c=\operatorname{vec}(C)$ , and $N=n^{2}$ . The constraint matrix $A$ has the following structure:

\displaystyle A\triangleq\underbrace{\left(\begin{array}[]{llll}I_{n}&I_{n}&\ldots&I_{n}\\ {\bf 1}_{n}^{T}&&&\\ &{\bf 1}_{n}^{T}&&\\ &&\ddots&\\ &&&{\bf 1}_{n}^{T}\end{array}\right)}_{\text{$n$ blocks}},

(30)

where $I_{n}$ is an $n\times n$ identity matrix, ${\bf 1}_{n}$ is an $n$ dimensional vector of all $1$ ’s (then $M=2n$ ). Right hand side $b$ in (10) has the form $b\triangleq((r^{1})^{T},(r^{2})^{T})^{T}$ , where $r^{1},r^{2}\in\mathbb{R}^{n}_{+}$ can be two discrete probability distributions. Next, we discuss the property of matrix $A$ and $\operatorname{null}(A)$ .

Property of matrix $A$

A nonzero $d\in\mathbb{R}^{N}$ is an elementary vector of $\operatorname{null}(A)$ if $d\in\operatorname{null}(A)$ and there is no nonzero $d^{\prime}\in\operatorname{null}(A)$ that is conformal to $d$ and $\operatorname{supp}(d^{\prime})\neq\operatorname{supp}(d)$ . According to the definition in (30), we say that a nonzero matrix $X$ is an elementary matrix of $\operatorname{null}(A)$ if $\operatorname{vec}(X)$ is an elementary vector of $\operatorname{null}(A)$ . For simplicity, a matrix $M^{1}$ being conformal to $M^{2}$ means $\operatorname{vec}(M^{1})$ being conformal to $\operatorname{vec}(M^{2})$ for the rest of this paper.

Now we define a set $\mathcal{E}_{A}$ :

	$\displaystyle X\in\mathcal{E}_{A}\subseteq\mathbb{R}^{n\times n}\Longleftrightarrow X\neq 0,\;\mbox{and after row and column permutations, X is }$
	a multiple of one of the following matrices:
	$\displaystyle E^{2}=\begin{pmatrix}1&-1&&&\\ -1&1&&&\\ &&0&&\\ &&&\ddots&\\ &&&&0\end{pmatrix}_{n\times n},E^{3}=\begin{pmatrix}1&&-1&&&\\ -1&1&&&&\\ &-1&1&&&\\ &&&0&&\\ &&&&\ddots&\\ &&&&&0\end{pmatrix}_{n\times n},...,$
	$\displaystyle E^{n-1}=\begin{pmatrix}1&&&&-1&\\ -1&1&&&&\\ &-1&1&&&\\ &&&\ddots&&\\ &&&-1&1&\\ &&&&&0\end{pmatrix}_{n\times n},E^{n}=\begin{pmatrix}1&&&&-1\\ -1&1&&&\\ &-1&1&&\\ &&&\ddots&\\ &&&-1&1\end{pmatrix}_{n\times n}.$

First, we state a Lemma about $\mathcal{E}_{A}$ , the proof of which is trivial and thus omitted.

Lemma 1

Every matrix in $\mathcal{E}_{A}$ is an elementary matrix of $\operatorname{null}(A)$ .

Then we show every element in the null space of $A$ is related to a matrix in $\mathcal{E}_{A}$ .

Lemma 2

For any nonzero $D$ such that $\operatorname{vec}(D)\in\operatorname{null}(A)$ , there exists $X\in\mathcal{E}_{A}$ such that $X$ is conformal to $D$ .

Proof

We prove this by contradiction and induction. We assume a nonzero $D$ such that $\operatorname{vec}(D)\in\operatorname{null}(A)$ and no $X\in\mathcal{E}_{A}$ is conformal to $D$ .

Note that $\operatorname{vec}(D)\in\operatorname{null}(A)$ implies

\displaystyle\sum_{i=1}^{m}D(i,\bar{j})=\sum_{j=1}^{n}D(\bar{i},j)=0,\forall\bar{i},\bar{j}.

(31)

Then without loss of generality, suppose that

D(1,1)>0.

Otherwise since $D$ is nonzero we can permute row/column to let $D(1,1)\neq 0$ and by (31) all entries in the first row/column after permutation cannot have the same sign.

By (31), the first column of $D$ must have one negative element. Suppose $D(2,1)<0$ WLOG. The second row of $D$ must have one positive element, so suppose $D(2,2)>0$ WLOG. Since no $X\in\mathcal{E}_{A}$ is conformal to $D$ , we must have $D(1,2)\geq 0$ otherwise $D$ is conformal to $E^{2}$ . Therefore, the $2\times 2$ principal matrix of $D$ has the following sign arrangement (after appropriate row/column permutations),

\displaystyle\begin{pmatrix}+&+/0\\ -&+\end{pmatrix},

where we use $+$ , $+/0$ , $-$ , and $-/0$ to indicate that the corresponding entry is positive, nonnegative, negative, and nonpositive respectively.

If $n=2$ , then the above pattern is impossible as the first column sums to some positive number, leading to a contradiction with (31). Suppose that $n\geq 3$ . For math induction, we assume that,

$(\mathbf{H}^{k}):$ after appropriate row/column permutations, the $k\times k$ principal matrix of $D$ has the following sign arrangement ( $2\leq k\leq n-1$ ),

\displaystyle\begin{pmatrix}+&+/0&+/0&\ldots&+/0\\ -&+&+/0&\ddots&\vdots\\ -/0&-&+&\ddots&+/0\\ \vdots&\ddots&\ddots&\ddots&+/0\\ -/0&\ldots&-/0&-&+\end{pmatrix},

(32)

namely,

\displaystyle\left\{\begin{array}[]{ll}D(i,i)>0,&1\leq i\leq k;\\ D(i+1,i)<0,&1\leq i\leq k-1;\\ D(i,j)\geq 0,&1\leq i\leq j\leq k;\\ D(i,j)\leq 0,&1\leq j<i\leq k.\end{array}\right.

$k$ th column of $D$ needs to have at least one negative element, otherwise, it contradicts with (31). WLOG, we suppose $D(k+1,k)<0$ .

We now claim that the rest of the elements in the $(k+1)$ th row has to be non-positive, namely $D(k+1,i)\leq 0$ , $\forall i=1,...,k-1$ . Otherwise, let $i_{0}$ be the largest index in $\{1,\cdots,k-1\}$ such that $D(k+1,i_{0})>0$ . Then the submatrix $D(i_{0}+1:k+1,i_{0}:k)$ takes the form,

\displaystyle\begin{pmatrix}-&+&+/0&\ldots&+/0\\ -/0&-&+&\ddots&\vdots\\ \vdots&\ddots&\ddots&\ddots&+/0\\ -/0&\ldots&-/0&-&+\\ +&-/0&\ldots&-/0&-\end{pmatrix},

(33)

and it is conformal to $E^{k-i_{0}+1}$ after row/column permutations. To see this, we can move the first column of (33) to the last. For the whole matrix $D$ , it is equivalent to move the $i_{0}$ th column and insert it between $k$ and $k+1$ th column and shift the resulting submatrix to the upper left corner through permutation operations.

While due to (31), $(k+1)$ th row of $D$ needs to have at least one positive element and we have just shown that $D(k+1,i)\leq 0$ , $\forall i=1,...,k$ , so WLOG we suppose $D(k+1,k+1)>0$ . Similar argument shows if there is no $X\in\mathcal{E}_{A}$ is conformal to $D$ , so $D(i,k+1)\geq 0$ , $\forall i=1,...,k$ .

Therefore, the $(k+1)\times(k+1)$ principal matrix of $D$ has exactly the same sign pattern as indicated by (32), after appropriate row/column permutations. So we have the induction $(\mathbf{H}^{k})\Rightarrow(\mathbf{H}^{k+1})$ .

However, when the $n\times n$ matrix $D$ follows $(\mathbf{H}^{n})$ , it has the sign pattern as (32) after row/column permutations. Then the summation of the last column is strictly positive as indicated by the sign pattern, which contradicts with (31).

Finally, we show that $\mathcal{E}_{A}$ characterizes all the elementary matrices.

Theorem 4.1

Given any $D\in\mathbb{R}^{n\times n}$ , if $\operatorname{vec}(D)\in\operatorname{null}(A)$ , then $D$ has a conformal realization (rockafellar1999network, , Section 10B), namely:

\displaystyle D=D^{(1)}+D^{(2)}+\ldots+D^{(s)},

(34)

where $D^{(1)},\ldots,D^{(s)}$ are elementary matrices of $\operatorname{null}(A)$ and $D^{(i)}$ is conformal to $D$ , for all $i=1,\ldots,s$ . In particular, $D^{(i)}\in\mathcal{E}_{A}$ , $\forall i=1,...,s$ . Therefore, $\mathcal{E}_{A}$ includes all the elementary matrices of $\operatorname{null}(A)$ .

Proof

By Lemma 2, we suppose that $X^{(1)}\in\mathcal{E}_{A}$ and $X^{(1)}$ is conformal to $D$ . Then $X^{(1)}$ can be scaled properly by $\alpha_{1}=\min_{(i,j)\in\operatorname{supp}(X^{(1)})\cap\operatorname{supp}(D)}|D(i,j)/X^{(1)}(i,j)|>0$ such that $|\operatorname{supp}(D-\alpha_{1}X^{(1)})|<|\operatorname{supp}(D)|$ and $D-\alpha_{1}X^{(1)}$ is conformal to $D$ . Denote $D^{(1)}\triangleq\alpha_{1}X^{(1)}$ and $\bar{D}^{(1)}=D-D^{(1)}$ , both of them are conformal to $D$ .

By apply the same procedure to $\bar{D}^{(1)}$ , we get $\bar{D}^{(2)}$ such that $|\operatorname{supp}(\bar{D}^{(2)})|<|\operatorname{supp}(\bar{D}^{(1)})|$ . We can repeat this process and eventually, we have that the conformal realization (34) holds since $|\operatorname{supp}(D)|\leq n^{2}$ and $|\operatorname{supp}(\bar{D}^{(k)})|$ is strictly decreasing.

If $D$ is an elementary matrix, in previous construction of (34), we notice that $D=D^{(1)}+\bar{D}^{(1)}$ and both of them are conformal to $D$ . If $\bar{D}^{(1)}\not=0$ , we have $\operatorname{supp}(D^{(1)})\not=\operatorname{supp}(D)$ while $D^{(1)}$ is conformal to $D$ . This contradicts the assumption of $D$ as an elementary matrix. When $\bar{D}^{(1)}=0$ , $D=D^{(1)}$ is a multiple of the special matrix in the description of $\mathcal{E}_{A}$ after a certain row/column permutation. Summing up, $\mathcal{E}_{A}$ includes all the elementary matrices of $\operatorname{null}(A)$ .

Remark 4

For a given $D$ such that $\operatorname{vec}(D)\in\operatorname{null}(A)$ , a simple algorithm following the proof in Theorem 4.1 to find an elementary matrix $X$ conformal to $D\neq 0$ will cost at most $O(n^{2})$ operations. We can select an appropriate $\alpha>0$ such that $D-\alpha X$ is conformal to $D$ and $|\operatorname{supp}(D-\alpha X)|<|\operatorname{supp}(D)|$ . By repeating this process, we can find the conformal realization in $\operatorname{supp}(D)\leq n^{2}$ steps. Therefore, the total number of operations needed to find the conformal realization is $O(n^{4})$ . In comparison, the approach proposed by tseng2010coordinate finds a conformal realization with support cardinality less than $l$ (usually, $l$ is much smaller than $n^{2}$ ) and requires $O(n^{3}(n^{2}-l)^{2})$ operations.

Working set selection

By analyzing the structure of elementary matrices of $\operatorname{null}(A)$ , we will have a better idea of potential directions along which the transport cost is minimized by a large amount. This is supported by the following theorem, where we continue using notations introduced in Section 3.

Theorem 4.2

Consider the linear program (10) where $A\in\mathbb{R}^{M\times N}$ and $b\in\mathbb{R}^{M}$ are defined as in (30) ( $M=2n$ , $N=n^{2}$ ). Given any $X\in\mathbb{R}^{n\times n}$ and $D\in\mathbb{R}^{n\times n}$ such that $\operatorname{vec}(X)\in\mathcal{X}$ , and $\operatorname{vec}(D)\in\mathcal{D}(\operatorname{vec}(X);\mathcal{N})$ . There exists an elementary matrix $\bar{D}$ of $\operatorname{null}(A)$ conformal to $D$ such that for any set $\mathcal{I}\in\mathcal{N}$ satisfying

\displaystyle\mathcal{I}\supseteq\operatorname{supp}(\operatorname{vec}(\bar{D})),

We have

\displaystyle q(\operatorname{vec}(X);\mathcal{I})\leq\left(\frac{1}{n^{2}-3}\right)q(\operatorname{vec}(X);\mathcal{N}).

(35)

Proof

Since $\operatorname{vec}(D)\in\mathcal{D}(\operatorname{vec}(X);\mathcal{N})$ , $\operatorname{vec}(D)\in\operatorname{null}(A)$ . Then based on Theorem 4.1, we have the conformal realization:

\displaystyle D=D^{(1)}+D^{(2)}+...+D^{(s)}.

Moreover, proof of Theorem 4.1 indicates that we can construct this realization with $s\leq n^{2}-3$ , because the support of $D^{(i)}$ has cardinality at least $4$ . Then similar to discussion in Theorem 3.1, we may find $\bar{s}\in\{1,\ldots,s\}$ such that $\bar{D}=D^{(\bar{s})}$ , $\mathcal{I}\supseteq\operatorname{supp}(\operatorname{vec}(D^{(\bar{s})}))$ , and

\displaystyle q(\operatorname{vec}(X);\mathcal{N})\geq(n^{2}-3)q(\operatorname{vec}(X);\mathcal{I}).

Now we discuss two approaches to carefully select the support set $\mathcal{I}_{k}$ at iteration $k$ of the block coordinate descent method (15):

Diagonal band. Given $3\leq p\leq n$ , denote

\displaystyle\mathcal{G}\triangleq\left\{(i,j)\in\mathbb{Z}^{2}\Big{|}\begin{array}[]{ll}i\in[j,j+p-1]&\text{if}\;j\in[1,n-p+1]\\ i\in[1,...,j+p-n-1]\cup[j,n]&\text{if}\;j\in[n-p+2,n]\\ \end{array}\right\},

and construct matrix $G\in\mathbb{R}^{n\times n}$ such that

\displaystyle G(i,j)=\begin{cases}1,&\text{if}(i,j)\in\mathcal{G},\\ 0,&\text{otherwise.}\end{cases}

(36)

Therefore, $G$ has the following structure:

\displaystyle\begin{array}[]{r}p\;\left\{\begin{array}[]{r}\\ \\ \\ \\ \end{array}\right.\\ \begin{array}[]{r}\\ \\ \\ \\ \end{array}\end{array}\begin{pmatrix}\;1\;&&&&1&\ldots&1\\ \vdots&1\;&&&&\ddots&\vdots\\ 1&\vdots&\ddots&&&&1\\ 1&1&&1\;&&&\\ &1&\ddots&\vdots&1\;&&\\ &&\ddots&1&\vdots&\ddots&\\ &&&1&1&\ldots&1\;\end{pmatrix}_{n\times n}\begin{array}[]{r}\left.\begin{array}[]{r}\\ \\ \\ \end{array}\right\}(p-1)\\ \begin{array}[]{r}\\ \\ \\ \\ \\ \end{array}\end{array}

It is like a band of width $p$ across the diagonal, hence the name. Then we may construct $\bar{D}^{k}\in\mathbb{R}^{n\times n}$ and $\mathcal{I}_{k}$ as follows:

\displaystyle\begin{aligned} &\text{Obtain $\bar{D}^{k}$ by uniformly randomly permuting all columns and rows of $G$.}\\ &\text{Let}\mathcal{I}_{k}\triangleq\operatorname{supp}(\operatorname{vec}(\bar{D}^{k})).\end{aligned}

(37)

Note that $|\mathcal{I}_{k}|=np$ .

Submatrix. Given $m<n$ , obtain $\bar{D}^{k}$ and $\mathcal{I}_{k}$ such that

\displaystyle\begin{aligned} &\text{Uniformly randomly pick two sets of $m$ different numbers out of $\{1,...n\}$:}\\ &\quad i_{1},...,i_{m}\text{and}j_{1},...,j_{m}.\\ &\text{Let}\bar{D}^{k}(i,j)=\begin{cases}1&\text{if}i\in\{i_{1},...,i_{m}\}\text{and}j\in\{j_{1},...,j_{m}\},\\ 0&\text{otherwise.}\end{cases}\\ &\text{Let}\mathcal{I}_{k}\triangleq\operatorname{supp}(\operatorname{vec}(\bar{D}^{k})).\end{aligned}

(38)

In this case, the support of $\bar{D}^{k}$ is a submatrix of size $m\times m$ . Therefore, $|\mathcal{I}_{k}|=m^{2}$ .

Next, we discuss two random block coordinate descent algorithms to solve the LP problem (10) with $A$ given in (30) whose working set selections are based on the two approaches discussed above.

Algorithm 2 Random block coordinate descent - diagonal band (RBCD-DB)

(Initialization) Choose feasible

X^{0}\in\mathbb{R}^{n\times n}

and band width

p\in[3,n]

. Let

x^{0}=\operatorname{vec}(X^{0})

for

k=0,1,2,\dotsc

Step 1. Choose

\mathcal{I}_{k}

according to (37).

Step 2. Find

d^{k}\in\mathcal{D}(x^{k};\mathcal{I}_{k}).

Step 3.

\;\;x^{k+1}:=x^{k}+d^{k}.

end for

The following result describes the convergence property of Algorithm 2.

Theorem 4.3

Consider the LP problem (10) with $A$ given in (30). Then sequence $\{x^{k}\}$ and $\{\mathcal{I}_{k}\}$ generated by Algorithm 2 satisfies the expected Gauss-Southwell- $q$ rule (21), i.e.,

\displaystyle\mathbb{E}[q(x^{k};\mathcal{I}_{k})\mid\mathcal{F}_{k}]\leq vq(x^{k};\mathcal{N}),

with $v\geq\frac{n(p-2)}{(n^{2}-3)(n!)^{2}}$ . Therefore, $c^{T}(x^{k}-x^{*})\to 0$ almost surely and $\mathbb{E}[c^{T}(x^{k}-x^{*})]$ converges to $0$ exponentially with rate $1-v$ .

Proof

Given $x^{k}$ , Theorem 4.2 guarantees that there exists $D^{k}\in\mathcal{E}_{A}$ such that if $\mathcal{I}_{k}\supseteq\operatorname{supp}(\operatorname{vec}(D^{k}))$ , then (35) holds for $\mathcal{I}=\mathcal{I}_{k}$ and $\operatorname{vec}(X)=x^{k}$ , i.e.,

\displaystyle q(x^{k};\mathcal{I}_{k})\leq\left(\frac{1}{n^{2}-3}\right)q(x^{k};\mathcal{N}).

(39)

Next, we will estimate the probability that $\mathcal{I}_{k}\supseteq\operatorname{supp}(\operatorname{vec}(D^{k}))$ holds.

Suppose that after row/column permutations and scaling of $D^{k}$ , we obtain $E^{t}$ , $2\leq t\leq n$ . Then after appropriate row and column swapping, $D^{k}$ can be written as

\displaystyle\begin{array}[]{r}\begin{array}[]{c}\\ \end{array}\\ t\left\{\begin{array}[]{c}\\ \\ \\ \\ \\ \\ \end{array}\right.\\ \begin{array}[]{c}\\ \\ \end{array}\end{array}\begin{pmatrix}0\;&&&&&&&&\\ *\;&*\;&&&&&&&\\ *\;&&*\;&&&&&&\\ &*\;&&\ddots\;&&&&&\\ &&*\;&&*\;&&&&\\ &&&\ddots\;&&*\;&&&\\ &&&&*\;&*\;&0\;&&\\ &&&&&&&\ddots\;&\\ &&&&&&&&0\;\end{pmatrix}_{n\times n}.

(52)

That is, elements $(2,1)$ and $(3,1)$ are nonzeros; elements $(j,j)$ and $(\operatorname{mod}(j+2,n),j)$ are nonzeros, for all $j=2,...,t-1$ ; elements $(t,t)$ and $(\operatorname{mod}(t+1,n),t)$ are nonzeros; all other elements are zeros. Obviously, support of this matrix is covered by the support of $G$ in (36). Moreover, by moving the whole support in matrix (52) downwards or to the bottom right corner, we can create at least $n(p-2)-1$ more different matrices whose support are all covered by $G$ . These $n(p-2)$ matrices can be obtained by permuting rows and columns of $D^{k}$ in $n(p-2)$ in different ways. Therefore, the probability that $\mathcal{I}_{k}$ will cover the support of $D^{k}$ is at least $\frac{n(p-2)}{(n!)^{2}}$ , and we have that

	$\displaystyle\mathbb{E}[q(x^{k};\mathcal{I}_{k})\mid x^{k}]$	$\displaystyle=\sum_{\operatorname{supp}(\operatorname{vec}(D^{k}))\subseteq\mathcal{I}}q(x^{k};\mathcal{I})P(\mathcal{I}_{k}=\mathcal{I})+\sum_{\operatorname{supp}(\operatorname{vec}(D^{k}))\nsubseteq\mathcal{I}}q(x^{k};\mathcal{I})P(\mathcal{I}_{k}=\mathcal{I})$
		$\displaystyle\leq\left(\frac{1}{n^{2}-3}\right)q(x^{k};\mathcal{N})P(\operatorname{supp}(\operatorname{vec}(D^{k}))\subseteq\mathcal{I}_{k})+0$
		$\displaystyle\leq\left(\frac{n(p-2)}{(n^{2}-3)(n!)^{2}}\right)q(x^{k};\mathcal{N})$

Therefore, the expected Gauss-Southwell- $q$ rule (21) holds with $v$ at least $\frac{n(p-2)}{(n^{2}-3)(n!)^{2}}$ .

Remark 5

It can be shown that if $n$ is large enough and $p$ is chosen between $O(log(n))$ and $O(n)$ , then the lower bound for constant $v$ derived in Theorem 4.3 is better than the one estimated for Algorithm 1, i.e., $\frac{1}{\binom{N}{l}(N-l+1)}$ . In fact, we have the following results.

Lemma 3

Suppose that $\bar{K}\geq 2$ and $\eta>0$ satisfies

\displaystyle\frac{2\bar{K}-3}{2(\bar{K}-1)}+\log\left(\frac{\bar{K}}{2}\right)>2/\eta,

and $n$ satisfies

\displaystyle n\geq\frac{4}{\left(\frac{2\bar{K}-3}{2(\bar{K}-1)}+\log\left(\frac{\bar{K}}{2}\right)\right)\eta-2},\quad\frac{n}{\log(n)}\geq\eta\bar{K}.

Then for any $p\in[\eta\log(n),\frac{n}{\bar{K}}]$ , and $p\geq 3$ , we have $\frac{n(p-2)}{(n^{2}-3)(n!)^{2}}\geq\frac{1}{\binom{n^{2}}{np}(n^{2}-np+1)}$ .

Proof

See Appendix.

Let $n\geq 30$ , $\eta=1$ , $\bar{K}=8$ . Then according to Lemma 3, for $log(n)\leq p\leq n/8$ , the lower bound $\frac{n(p-2)}{(n^{2}-3)(n!)^{2}}$ is larger. We believe that this is a fairly reasonable range of $p$ when $n$ grows large. This lower bound is improved because we have knowledge of the structure of the elementary matrix when solving OT problems.

As for the submatrix approach, we often find it quite efficient in numerical experiments. The optimal solution in the submatrix approach can also be decomposed in $\mathcal{E}_{A}$ . More precisely, consider that $\mathcal{I}_{k}$ is chosen according to (38) associated with a submatrix of size $m\times m$ . Then, the decomposition of $d^{k}\in\mathcal{D}(x^{k};\mathcal{I}_{k})$ (more rigorously its matrix form) only involves multiples and permutations of the elementary matrices in $\{E^{2},E^{3},\cdots,E^{m}\}$ . However, global convergence with a fixed-width submatrix is not guaranteed. In fact, there is a counterexample (see B). Therefore, we propose an algorithm that combines these two approaches together.

Algorithm 3 Random block coordinate descent - submatrix and diagonal Band (RBCD-SDB)

(Initialization) Choose feasible

X^{0}\in\mathbb{R}^{n\times n}

, submatrix row/column dimension

m

, band width

p\in[3,n]

and selection parameter

s\in[0,1]

. Let

x^{0}=\operatorname{vec}(X^{0})

for

k=0,1,2,\dotsc

Step 1. With probability

s

, choose

\mathcal{I}_{k}

according to (37); otherwise, choose

\mathcal{I}_{k}

according to (38).

Step 2. Find

d^{k}\in\mathcal{D}(x^{k};\mathcal{I}_{k}).

Step 3.

\;\;x^{k+1}:=x^{k}+d^{k}.

end for

The convergence of Algorithm 3 is guaranteed by the next theorem.

Theorem 4.4

Consider the LP problem (10) with $A$ given in (30). Then sequence $\{x^{k}\}$ and $\{\mathcal{I}_{k}\}$ generated by Algorithm 3 satisfies the expected Gauss-Southwell- $q$ rule (21), with $v\geq\frac{sn(p-2)}{(n^{2}-3)(n!)^{2}}$ . Therefore, $c^{T}(x^{k}-x^{*})\to 0$ almost surely and $\mathbb{E}[c^{T}(x^{k}-x^{*})]$ converges to $0$ exponentially with rate $1-v$ .

Proof

Given $x^{k}$ , Theorem 4.2 shows that there exists $D^{k}\in\mathcal{E}_{A}$ such that if $\mathcal{I}_{k}\supseteq\operatorname{supp}(\operatorname{vec}(D^{k}))$ , then (35) holds with $\mathcal{I}=\mathcal{I}_{k}$ and $\operatorname{vec}(X)=x^{k}$ . We estimate the probability that $\mathcal{I}_{k}\supseteq\operatorname{supp}(\operatorname{vec}(D^{k}))$ .

First, consider the case that after row/column permutations and scaling of $D^{k}$ , we obtain $E^{t}$ , $2\leq t\leq m$ . If $\mathcal{I}_{k}$ is chosen according to (37), then similar to discussion in Theorem 4.3, $\mathcal{I}_{k}$ will cover the support of $D^{k}$ with probability at least $\frac{n(p-2)}{(n!)^{2}}$ . If $\mathcal{I}_{k}$ is chosen according to (38), then $\mathcal{I}_{k}$ will cover the support of $D^{k}$ with probability

\displaystyle\frac{\binom{n-t}{m-t}^{2}}{\binom{n}{m}^{2}}=\left(\frac{(n-t)!/((m-t)!(n-m)!)}{n!/(m!(n-m)!)}\right)^{2}=\left(\frac{m!/(m-t)!}{n!/(n-t)!}\right)^{2}.

Therefore, in this case, the probability $p_{t}$ that $\mathcal{I}_{k}$ cover the support of $D^{k}$ is:

\displaystyle p_{t}\geq\frac{sn(p-2)}{(n!)^{2}}+(1-s)\left(\frac{m!/(m-t)!}{n!/(n-t)!}\right)^{2}.

Then we consider the case that when we get $E^{t}$ , $m+1\leq t\leq n$ after row/column permutations and rescaling of $D^{k}$ . In this case, if $\mathcal{I}_{k}$ is chosen according to (37), $\mathcal{I}_{k}$ will cover the support of $D^{k}$ with probability at least $\frac{n(p-2)}{(n!)^{2}}$ ; if $\mathcal{I}_{k}$ is chosen according to (38), this probability is $0$ . Therefore, in this case we have $p_{t}\geq\frac{sn(p-2)}{(n!)^{2}}$ . In general, the probability that $\mathcal{I}_{k}$ cover the support of $D^{k}$ is at least $\min\{p_{t}\}\geq\frac{sn(p-2)}{(n!)^{2}}$ . Similar to discussion in Theorem 4.2, (21) will hold with $v\geq\frac{sn(p-2)}{(n^{2}-3)(n!)^{2}}$ .

Accelerated random block coordinate descent

Algorithm 4 is an accelerated random block coordinate descent (ARBCD) algorithm. It selects the working set $\mathcal{I}_{k}$ in a different way from Algorithm 3 intermittently for acceleration purposes. At certain times, we construct $\mathcal{I}_{k}$ based on the iterates generated by the algorithm in the past, i.e., $x^{end}-x^{start}$ . This vector reflects the progress achieved by running the RBCD-SDB for a few iterations and predicts the direction in which the algorithm could potentially make further improvements. This choice is analogous to the momentum concept often employed in acceleration techniques in optimization, such as in the heavy ball method and Nesterov acceleration. Algorithm 4 has a similar convergence rate as Algorithm 3 (note that the acceleration iteration occurs occasionally). We will verify its improved performance in the numerical experiments.

Algorithm 4 Accelerated random block coordinate descent (ARBCD)

(Initialization) Choose feasible

X^{0}\in\mathbb{R}^{n\times n}

, submatrix row/column dimension

m

, band width

p\in[3,n]

, selection parameter

s\in[0,1]

, and acceleration interval

T

. Let

x^{0}=\operatorname{vec}(X^{0})

x^{start}=x^{end}=x^{0}

. Binary variable

acc

for

k=0,1,2,\ldots

Step 1. Choose

\mathcal{I}_{k}

as following.

\operatorname{mod}(k+1,T)\neq 0

|\operatorname{supp}(x^{end}-x^{start})|\leq m^{2}

then

acc=\texttt{false}

. With probability

s

, choose

\mathcal{I}_{k}

according to (37); otherwise, choose

\mathcal{I}_{k}

according to (38).

else

acc=\texttt{true}

. Choose

\mathcal{I}_{k}

uniformly randomly from

\operatorname{supp}(x^{end}-x^{start})

so that

|\mathcal{I}_{k}|=m^{2}

end if

Step 2. Find

d^{k}\in\mathcal{D}(x^{k};\mathcal{I}_{k}).

Step 3. Update

x^{k+1}:=x^{k}+d^{k};

Step 4. Update

x^{end}=x^{k+1}.

acc=\texttt{true}

. then

Update

x^{start}=x^{k+1}.

end if

end for

5 Numerical experiments

In this section, we conduct numerical experiments on various examples of OT problems¹¹1All experiments are conducted using Matlab R2021b on Dell OptiPlex 7090 with CPU: Intel(R) Core(TM) i9-10900 @ 2.80GHz (20 CPUs), $\sim$ 2.8GHz and RAM: 65536Mb. Data and codes are uploaded to https://github.com/yue-xie/RBCDforOT.. In Section 5.1, we compare various random block coordinate descent methods with different working set selections proposed in this article. Then, we compare the one with the best performance - ARBCD with Sinkhorn in Section 5.2.1 and an interior point inspired algorithm in Section 5.2.2. Finally, a large-scale OT problem is solved using ARBCD in Section 5.3.

5.1 Comparison between various random block coordinate descent methods with the different working set selection rules

In this subsection, we apply the proposed random block coordinate descent methods (Alg. 1 - Alg. 4) to calculate the Wasserstein distance between three pairs of distributions. We compare these algorithms to illustrate the difference between various working set selection rules. Additionally, we inspect the differences between the theoretical and actual convergence rates, as well as the solution sparsity.

Experiment settings

We compute the Wasserstein distance between a pair of 1-dim probability distributions (standard normal and uniform over $[-1,1]$ ), a pair of 2-dim probability distributions (uniform over $[-\pi,\pi]^{2}$ and an empirical invariant measure obtained from IPM simulation of reaction-diffusion particles in advection flows, detailed configurations can be found in wang2022deepparticle , Section 4.2, 2D cellular flow, $\kappa=2^{-4}$ ), and a pair of 3-dim distributions (uniform over $[-1,1]^{3}$ and 3-dimensional multivariate normal distribution). When computing the Wasserstein distance between the pair of 1-dim probability distributions, we utilize their histograms (c.f. Section 2): Let $n=1001$ . Centers of the cells are $y^{i}=\tilde{y}^{i}=\frac{i-501}{500}$ , $i=1,..,1001$ ; $C_{i,j}=dist(\tilde{y}^{i},y^{j})^{2}$ , $1\leq i,j\leq 1001$ ; $r_{1,i}=\frac{\phi(y^{i})}{\sum_{i=1}^{1001}\phi(y^{i})}$ , $i=1,...,1001$ , where $\phi(y)$ is the pdf of standard normal; $r^{2}=(1/1001,...,1/1001)^{T}\in\mathbb{R}^{1001}$ . When calculating the Wasserstein distance between the second and third pairs, we apply the point cloud setting (c.f. Section 2): Let $n=1000$ . For each pair, use i.i.d. samples $\{\tilde{y}^{i}\}$ and $\{y^{j}\}$ , $1\leq i,j\leq 1000$ to approximate the two continuous probability measure respectively. Let $C_{i,j}=dist(\tilde{y}^{i},y^{j})^{2}$ and $r^{1}=r^{2}=(1/1000,...,1/1000)^{T}\in\mathbb{R}^{1000}$ . In all cases, we normalize the cost matrix $C$ such that its maximal element is $1$ . Figure 1 captures these three pairs of distributions. For all cases, we first use the linprog in Matlab to find a solution with high precision (dual-simplex, constraint tolerance 1e-9, optimality tolerance 1e-10).

Refer to caption — Figure 1: Three pairs of distributions

Methods

We specify the settings of the four algorithms. All algorithms are started at the same feasible $x^{0}=\operatorname{vec}(r^{1}(r^{2})^{T})$ in each experiment. We solve the LP subproblems via linprog in Matlab with high precision (dual-simplex, constraint tolerance 1e-9, optimality tolerance 1e-7).
RBCD₀. Algorithm 1: Vanilla random block coordinate descent. Let $l=100^{2}$ . Stop the algorithm after $5000$ iterations.
RBCD-DB. Algorithm 2: Random block coordinate descent - diagonal band. Let $p=\lfloor 100^{2}/n\rfloor$ . Stop the algorithm after $5000$ iterations.
RBCD-SDB. Algorithm 3: Random block coordinate descent - submatrix and diagonal band. Let $m=100$ , $p=\lfloor m^{2}/n\rfloor$ and $s=0.1$ . Stop the algorithm after $5000$ iterations.
ARBCD. Algorithm 4: Accelerated random block coordinate descent. Let $m=100$ , $p=\lfloor m^{2}/n\rfloor$ , $s=0.1$ and $T=10$ . Stop the algorithm after $5000$ iterations. Note that the degree of freedom of the subproblem per iteration is $100^{2}$ , about $1/100$ the size of the original one.

Comments on Figure 2

We can see from Figure 2 that different approaches to choosing the working set of the same size can significantly affect the performance of random BCD types of methods. The curves of RBCD-DB are below those of RBCD₀ in the long run, demonstrating that RBCD-DB has a better average performance. The reason for this is that RBCD₀ generates the working set with full randomness, while RBCD-DB takes the structure of the elementary matrices into account. The latter makes an educated guess at the working set that decreases the objective function by a large amount. The submatrix approach (38) works very well in practice, as illustrated by the better performances of RBCD-SDB and ARBCD compared to RBCD-DB. In the long run, ARBCD outperforms RBCD-SDB, verifying the acceleration effect. It is important to note that the algorithm settings are set by default. We expect and have observed similar behaviors of the algorithms when changing their algorithm settings. Also note that the feasibility error of algorithms are controlled at a low level according to the figure. On the other hand, the curves in these numerical experiments suggest sublinear convergence rates. This observation does not contradict the theoretical linear convergence rate as long as $v$ is small enough. We will verify that the numerical experiments do not violate the lower bounds we derived for the constant $v$ in the linear convergence rates.

1-d case			2-d case			3-d case
iter.	$(\bar{f}_{k}-f^{*})C_{\max}$	$\hat{v}$	iter.	$(\bar{f}_{k}-f^{*})C_{\max}$	$\hat{v}$	iter.	$(\bar{f}_{k}-f^{*})C_{\max}$	$\hat{v}$
0	0.6237	N/A	0	9.3538	N/A	0	3.3456	N/A
1000	0.0083	4.3e-3	1000	0.6254	2.7e-3	1000	0.4746	2.0e-3
2000	0.0054	4.3e-4	2000	0.4773	2.7e-4	2000	0.3821	2.2e-4
3000	0.0045	1.8e-4	3000	0.4183	1.3e-4	3000	0.3438	1.1e-4
4000	0.0039	1.4e-4	4000	0.3846	8.3e-5	4000	0.3371	2.0e-5
5000	0.0036	8.0e-5	5000	0.3762	2.2e-5	5000	0.3350	6.2e-6

Table 1: Data of RBCD₀,

C_{\max}

is the maximal value of elements in

C

1-d case			2-d case			3-d case
iter.	$(\bar{f}_{k}-f^{*})C_{\max}$	$\hat{v}$	iter.	$(\bar{f}_{k}-f^{*})C_{\max}$	$\hat{v}$	iter.	$(\bar{f}_{k}-f^{*})C_{\max}$	$\hat{v}$
0	0.6237	N/A	0	9.3538	N/A	0	3.3456	N/A
1000	0.0130	3.9e-3	1000	0.6002	2.7e-3	1000	0.4601	2.0e-3
2000	0.0057	8.2e-4	2000	0.3920	4.3e-4	2000	0.3414	3.0e-4
3000	0.0036	4.6e-4	3000	0.3074	2.4e-4	3000	0.2878	1.7e-4
4000	0.0026	3.3e-4	4000	0.2592	1.7e-4	4000	0.2544	1.2e-4
5000	0.0020	2.6e-4	5000	0.2276	1.3e-4	5000	0.2316	9.4e-5

Table 2: Data of RBCD-DB

About Table 1 & 2

In these two tables we record the optimality gap $\bar{f}_{k}-f^{*}$ every 1000 iterations for both RBCD₀ and RBCD-DB. $\bar{f}_{k}$ is the average function value at iteration $k$ , as we run the algorithms repeatedly for 5 times. The column $\hat{v}$ denotes the estimation of the constant $v$ in the expected Gauss-Southwell- $q$ rule (21). It is calculated by the formula: $\hat{v}=1-\sqrt[1000]{(\bar{f}_{k}-f^{*})/(\bar{f}_{k-1000}-f^{*})}$ . Values of $\hat{v}$ in both Table 1 & 2 are far larger than the lower bounds for $v$ : $\frac{1}{\binom{N}{l}(N-l+1)}$ and $\frac{(n-1)(p-2)}{(n^{2}-3)(n!)^{2}}$ , corresponding to RBCD₀ and RBCD-DB respectively, where $N=n^{2}$ . They also decrease as we run more iterations, indicating that the optimality gap shrinkage becomes less when the iterate is closer to the solution. We intend to study this phenomenon in our future work.

Sparse solutions

We can observe from Figure 3 that the iterates in RBCD-SDB and ARBCD become sparse quickly and remain so. The reason for this is that the solutions of OT problems are usually sparse (for the point cloud setting, at least one of the optimal solutions satisfies $\|x^{*}\|=n$ because extreme points of the LP in this setting are permutation matrices divided by $n$ ), and these two algorithms can locate solutions with high accuracy relatively fast. As a result, the storage need for these two algorithms is considerably reduced after they have been run for a while. In the point cloud setting, storage complexity is typically expected to decrease from $O(n^{2})$ to $O(n)$ (note that the degree of freedom of the subproblem per iteration is typically chosen as $O(n)$ because of the diagonal band approach with $p\geq 3$ ).

Choice of $m$

We run RBCD-SDB and ARBCD with different $m$ for a fixed $n$ to test the optimal setting of subproblem size. As is shown in Figure 4, the best choice of $m$ happens at a smaller value (10 or 30 percent of $n$ ) in 1-d case. In Figure 5, we observe similar phenomenon in 2-d case. The optimal setting of $m$ shifts to a larger value in 3-d case. We conclude that the optimal setting of $m$ may vary depending on many factors such as specific problem and subproblem solver efficiency. Based on the given results, we suggest choosing a smaller $m$ in practice, especially when there is limitation of computation resources.

$n=501$
	$m$		50	150	250	350	450
RBCD-SDB	iter.		2.00e+04	3.55e+02	55.0	17.7	7.67
	err.		7.28e-07	4.95e-07	4.88e-07	4.01e-07	1.71e-07
ARBCD	iter.		1.95e+4	3.01e+02	58.0	22.3	8.33
	err.		5.68e-07	4.90e-07	4.92e-07	9.08e-08	9.49e-08
$n=1001$
	$m$	60	100	300	500	700	900
RBCD-SDB	iter.	2.00e+04	1.14e+04	2.21e+02	47.7	20	7.67
	err.	3.55e-06	4.93e-07	4.93e-07	2.96e-07	3.51e-07	1.65e-07
ARBCD	iter.	2.00e+04	9.67e+03	1.74e+02	47.3	22.3	8.33
	err.	2.06e-06	4.94e-07	4.60e-07	4.09e-07	3.39e-07	1.17e-08
$n=1501$
	$m$	75	150	450	750	1050	1350
RBCD-SDB	iter.	2.00e+04	6.80e+03	1.69e+02	50.3	19.3	7.67
	err.	3.90e-06	4.93e-07	4.89e-07	4.18e-07	3.63e-07	1.70e-07
ARBCD	iter.	2.00e+04	3.11e+03	1.67e+02	46.3	20	8
	err.	1.85e-06	4.93e-07	4.86e-07	4.15e-07	3.08e-07	2.35e-07

Table 3: Data of Figure 4

The average iterations and solution errors are recorded for reference. Feasibility errors are kept at a low level (below 1e-15) and omitted.

5.2 Comparison between ARBCD and algorithms in the literature

Experiment settings

We generated 8 pairs of distributions/patterns based on synthetic and real datasets. Descriptions are as follows. Note that we use histogram settings (c.f. Section 2) for datasets 1 and 2, and point cloud settings (c.f. Section 2) for other datasets. We use cost function $C(x,y)=\|x-y\|^{2}$ .
Dataset 1: Uniform distribution to standard normal distribution over $[-1,1]$ . Similar to the 1-d case in Section 5.1. $n=200.$
Dataset 2: Uniform distribution to a randomly shuffled standard normal distribution over $[-1,1]$ .²²2Similar to the 1-d case in Section 5.1. We randomly shuffled the weights of the normal distribution histogram. $n=1000.$
Dataset 3: Uniform distribution over $[-\pi,\pi]^{2}$ to an empirical invariant measure. Similar to the 2-d case in Section 5.1. $n=1000.$
Dataset 4: Distribution of $\sqrt{\Sigma}u$ to distribution of $2\sqrt{\Sigma}v-(1;1;1)$ , where $\Sigma=\begin{pmatrix}1&0.5&0.25\\ 0.5&1&0.5\\ 0.25&0.5&1\end{pmatrix}$ , $u$ and $v$ conform uniform distributions on $[0,1]^{3}$ and are independent. $n=1000.$
Dataset 5: Similar to Dataset 4, with $\Sigma=\begin{pmatrix}1&0.8&0.64\\ 0.8&1&0.8\\ 0.64&0.8&1\end{pmatrix}$ . $n=1000.$
Dataset 6: Distribution of $\Sigma u$ to distribution of $\Sigma v$ , where $\Sigma=\begin{pmatrix}1&0&1&1\\ 0&1&1&-1\end{pmatrix}^{T}$ , $u$ conforms a uniform distribution on $[0,2\pi]^{2}$ and $v$ conforms a uniform distribution on $[-1,1]^{2}$ . $n=1000.$
Dataset 7: Distribution of $\underbrace{(1;1;\ldots;1)^{T}}_{10}u$ to distribution of $(1;2;3;\ldots;10)^{T}v+(1;1;\ldots;1)^{T}$ , where $u$ conforms uniform distribution over $[0,2\pi]$ and $v$ conforms uniform distribution over $[-1,1]$ . $n=1000.$
Dataset 8: Distibution of a “cylinder” to a ”spiral”, see Figure 6. $n=1000.$
In all cases, we normalize the cost matrix $C$ such that its maximal element is $1$ . For all cases, we use the linprog in Matlab to find a solution with high precision (dual-simplex, constraint tolerance 1e-9, optimality tolerance 1e-10).

5.2.1 Comparison with Sinkhorn

Methods

Implementation of Sinkhorn and ARBCD are specified as follows.
Sinkhorn. The algorithm proposed in cuturi2013sinkhorn to compute Wasserstein distance. Let $\gamma$ be the coefficient of the entropy term. We let $\gamma=\epsilon/(4\log n)$ as suggested in dvurechensky2018computational . We consider the settings $\epsilon=10^{-4},10^{-3},0.01,0.1$ . Iterations of Sinkhorn are projected onto the feasible region using a rounding procedure: Algorithm 2 in altschuler2017near . Note that this projection step is added only for evaluation purposes because Sinkhorn does not provide feasible solutions if early stopped. It does not affect Sinkhorn’s main steps or Sinkhorn’s convergence at all. A similar approach is used for evaluation in jambulapati2019direct . In addition, we take all the updates to log space and use the LogSumExp function to avoid numerical instability issues. We stop Sinkhorn after 300000 iterations when $n=200$ and 100000 iterations if $n=1000$ .
ARBCD. Algorithm 4: Accelerated random block coordinate descent. Let $m=40$ when $n=200$ and $m=100$ when $n=1000$ . Let $p=\lfloor m^{2}/n\rfloor$ , $s=0.1$ and $T=10$ . Stop the algorithm after $10000$ iterations. To be fair, we also project the solution in each iteration onto the feasible region via the rounding procedure. LP subproblems are solved via linprog in Matlab with high precision (dual-simplex, constraint tolerance 1e-9).

Dataset $\#$	method	time(s)	iter.	gap	feas.err.	subproblem size
	ARBCD	82.63	2326.7	4.9386e-07	4.1753e-17	22500
	IPM₁	7.539	198	3.36713-08	1.8924e-17	20890
1	IPM₂	342.6	2000	4.3408-04	2.0889e-17	96550
	IPM₃	5.044	19	1.2319-07	5.7525e-17	485194
	IPM₄	11.01	21	2.8982e-07	8.3340e-17	1000000
	ARBCD	240.3	10000	8.6011e-09	8.4518e-17	22500
	IPM₁	413.9	2000	1.0562-04	2.5704e-17	20890
2	IPM₂	0.8815	16	1.1066-10	2.6584e-17	96550
	IPM₃	5.923	20	1.7153-09	5.5638e-17	485194
	IPM₄	11.02	20	3.0871e-09	7.6026e-17	1000000
	ARBCD	30.80	375	1.2787e-04	2.3256e-17	22500
	IPM₁	N/A	N/A	N/A	N/A	21055
3	IPM₂	67.43	2000	0.0146	2.1107e-17	92666
	IPM₃	313.7	2000	0.0110	1.3510e-17	465075
	IPM₄	6.696	13	1.2655e-04	6.3473e-17	1000000
	ARBCD	145.2	4404.3	2.0519e-05	1.8927e-17	22500
	IPM₁	N/A	N/A	N/A	N/A	21312
4	IPM₂	N/A	N/A	N/A	N/A	95639
	IPM₃	345.7	2000	0.0083	1.5114e-17	474798
	IPM₄	7.716	15	2.0397e-05	7.2654e-17	1000000
	ARBCD	56.24	569.3	2.0412e-05	2.1927e-17	22500
	IPM₁	N/A	N/A	N/A	N/A	19988
5	IPM₂	N/A	N/A	N/A	N/A	94468
	IPM₃	320.6	2000	0.0050	1.4802e-17	492856
	IPM₄	7.616	15	8.6357e-06	6.6523e-17	1000000
	ARBCD	22.20	265	2.3475e-04	2.3622e-17	22500
	IPM₁	N/A	N/A	N/A	N/A	20143
6	IPM₂	417.6	2000	0.0032	1.4814e-17	94072
	IPM₃	337.9	2000	0.0065	1.3566e-17	482269
	IPM₄	5.438	11	1.4044e-04	6.2801e-17	1000000
	ARBCD	63.88	342.3	7.4215e-05	2.0341e-17	22500
	IPM₁	N/A	N/A	N/A	N/A	20139
7	IPM₂	N/A	N/A	N/A	N/A	99792
	IPM₃	278.4	2000	0.0221	1.9256e-17	486651
	IPM₄	5.136	11	5.7172e-05	5.5542e-17	1000000
	ARBCD	71.44	439.3	1.9715e-05	2.4507e-17	22500
	IPM₁	N/A	N/A	N/A	N/A	10702
8	IPM₂	N/A	N/A	N/A	N/A	97305
	IPM₃	219.1	2000	9.3439e-04	2.3205e-17	499786
	IPM₄	5.303	10	1.7165e-05	7.5243e-17	1000000

Table 4: Comparison between ARBCD and IPM

For ARBCD, the average iterations/time/gap/feasibility error are recorded for reference. For IPM, results of four different settings are recorded. The initial reduced problem size is reflected by the subproblem size column.

Comments on Figure 7

We can observe the following from Figure 7: although Sinkhorn with larger $\epsilon$ may converge fast, the solution accuracy is also lower. In fact, this is true for all Sinkhorn-based algorithms because the optimization problem is not exact - it has an extra entropy term. Therefore, the larger $\gamma$ or $\epsilon$ is chosen, the less accurate the solution becomes. The discrepancy in accuracy does matter, as can be seen by inspecting the solution quality in Figure 8. On the other hand, when $\epsilon$ is set smaller, the convergence of Sinkhorn becomes slower. As can be seen from the plots, when $\epsilon=0.1$ or $0.01$ , Sinkhorn converges faster than ARBCD; when $\epsilon=10^{-3}$ , Sinkhorn is comparable to ARBCD; when $\epsilon=10^{-4}$ , Sinkhorn is slower than ARBCD. In conclusion, if relatively higher precision is desired, ARBCD is comparable with Sinkhorn. Moreover, note that here we solve the subproblems in ARBCD using Matlab built-in solver linprog. ARBCD can be faster if more efficient subproblem solvers are applied.

5.2.2 Comparison with an interior point-inspired algorithm

In this experiment we continue using the 8 datasets, except that Dataset 1 has $n=1000$ .

Methods

Implementation of ARBCD and IPM are specified as follows.
ARBCD. Algorithm 4. Let $m=150$ , $p=\lfloor m^{2}/n\rfloor$ , $s=0.1$ and $T=10$ . We stop the algorithm when $(f_{k}-f^{*})/f^{*}\leq 10^{-3}$ or when iteration reaches the maximum 10000. Other settings are similar to the experiment in Section 5.2.1. The algorithm outputs the last iterate.
IPM. A recent interior point-inspired algorithm proposed in zanetti2023interior . The Matlab code is directly from the github page in this paper (https://github.com/INFORMSJoC/2022.0184). We have made minimal modification for comparison. We stop the algorithm when $(f_{k}-f^{*})/f^{*}\leq 10^{-3}$ or when the iteration hits 2000. Run four different settings separately, where the initial reduced problem’s sizes are approximately $11(2n-1)$ , $n^{2}/10$ , $n^{2}/2$ and $n^{2}$ ( $n=1000$ ) correspondingly. For fairness, we also project the solution in each iteration onto the feasible region via the rounding procedure. The algorithm outputs the solution with the best function value gap.

Comments on Table 4

We could see that the performance of ARBCD is better than IPM₁, comparable to IPM₂ and IPM₃. IPM₄ has the best performance but its subproblem size is also large. In fact, our observation of the memory storage is IPM₄ $\geq$ IPM₃ $\geq$ IPM₂ $\geq$ IPM₁ $\approx$ ARBCD. This is consistent with the subproblem size column. In practice, memory consumption of IPM₄ is large, contradicting the goal of the authors in zanetti2023interior to save memory. In zanetti2023interior , the authors suggest the setting of IPM₁. We also find that when implementing IPM₁,IPM₂,IPM₃, the algorithms develop numerical instabilities, which is reported as N/A in the table. This is consistent with the result in zanetti2023interior (Figure 4) where ratio of the IPM algorithm finding an ideal solution is below $1$ .

5.3 Test on a large-scale OT problem

In this subsection, we generate a pair of 1-dim probability distributions with large discrete support sets ( $n=12800$ ). For the first distribution, locations of the discrete support ( $x^{i},i=1,\ldots,n$ ) are evenly aligned between $[-1,1]$ , and their weights/probability are uniformly distributed (i.e., $1/n$ ). For the other distribution, locations of the discrete support are determined as $\tilde{x}^{i}=x^{\sigma(i)}+u^{i}$ , where $\sigma(i)$ is a random permutation of $i=1,\ldots,n$ , and $u^{i}$ is a random variable that conforms to a uniform distribution over $[-0.5,0.5]$ . Weights/probability are determined as $w_{i}=\frac{\phi(\tilde{x}^{i})}{\sum_{i=1}^{n}\phi(\tilde{x}^{i})}$ , where $\phi(x)$ is the pdf of the standard normal. The benchmark optimal solution is quickly computed via a closed-form formula for 1-d OT problem ( $f^{*}=6.236\times 10^{-3}$ ). For ARBCD, we use the setting $m=\lceil\sqrt{10n}\rceil$ , $p=\lfloor m^{2}/n\rfloor$ , $s=0.1$ and $T=10$ .

Comments on Figure 9

The figure showcases the average behavior of ARBCD within $10000$ iterations. It is able to locate a solution such that $(f_{k}-f^{*})/f^{*}\leq 0.1$ . The convergence is linear by observing the trajectory. We also want to point out that Gurobi 10.01 (academic license) runs out of memory on the desktop we use for numerical experiments. Indeed, saving memory is one of the merits that motivate us to consider RBCD methods.

6 Conclusion

In this paper, we investigate the RBCD method to solve LP problems, including OT problems. In particular, an expected Gauss-Southwell- $q$ rule is proposed to select the working set $\mathcal{I}_{k}$ at iteration $k$ . It guarantees almost sure convergence and linear convergence rate and is satisfied by all algorithms proposed in this work. We first develop a vanilla RBCD, called RBCD₀, to solve general LP problems. Its inexact version is also investigated. Then, by examining the structure of the matrix $A$ in the linear system of OT, characterizing elementary matrices of $\operatorname{null}(A)$ and identifying conformal realization of any matrix $D\in\operatorname{null}(A)$ , we refine the working set selection. We use two approaches - diagonal band and submatrix - for constructing $\mathcal{I}_{k}$ and employ an acceleration technique inspired by the momentum concept to improve the performance of RBCD₀. In our numerical experiments, we compare all proposed RBCD methods and verify the acceleration effects as well as the sparsity of solutions. We also demonstrate the gap between the theoretical convergence rate and the practical one. We discuss the choice of the subproblem size. Furthermore, we run ARBCD, the best among all other methods, against others and also on large-scale OT problems. The results show the advantages of our method in finding relatively accurate solutions to OT problems and saving memory. For future work, we plan to extend our method to handle continuous measures and further improve it through parallelization and multiscale strategies, among other approaches.

Acknowledgements

The research of YX is supported by Guangdong Province Fundamental and Applied Fundamental Research Regional Joint Fund, (Project 2022B1515130009), and the start-up fund of HKU-IDS. The research of ZZ is supported by Hong Kong RGC grants (Projects 17300318 and 17307921), the National Natural Science Foundation of China (Project 12171406), Seed Fund for Basic Research (HKU), an R&D Funding Scheme from the HKU-SCF FinTech Academy, and Seed Funding for Strategic Interdisciplinary Research Scheme 2021/22 (HKU).

Declarations

The authors have no competing interests to declare that are relevant to the content of this article.

Data availability

The datasets generated during the current study are available in the GitHub repository,
https://github.com/yue-xie/RBCDforOT.

Appendix A Proof of Lemma 3

Proof

Suppose that

\displaystyle\log((n^{2})!/(n^{2}-np)!)\geq 2\log(n!)+\log(np)!

(53)

Then

\displaystyle\begin{array}[]{rrl}&(n^{2})!/(n^{2}-np)!)/(np)!&\geq(n!)^{2}\\ \implies&\binom{n^{2}}{np}/(n!)^{2}&\geq 1\\ \implies&\frac{\binom{n^{2}}{np}}{(n!)^{2}}\cdot\frac{n(p-2)(n^{2}-np+1)}{n^{2}-3}&\geq 1\\ \implies&\frac{n(p-2)}{(n^{2}-3)(n!)^{2}}&\geq\frac{1}{\binom{n^{2}}{np}(n^{2}-np+1)},\end{array}

where the third inequality holds because $p\leq n/2$ and $n\geq p\bar{K}\geq 6$ . So we only need to prove (53). Note that

$\displaystyle\log\frac{(n^{2})!}{(n^{2}-np)!}$	$\displaystyle=\sum_{x=n^{2}-np+1}^{n^{2}}\log(x)\geq\int_{n^{2}-np}^{n^{2}}(\log x)dx$
	$\displaystyle=n^{2}\log(n^{2})-n^{2}-\left((n^{2}-np)\log(n^{2}-np)-n^{2}+np\right)$
	$\displaystyle=n^{2}\log(n^{2})-(n^{2}-np)\log(n^{2}-np)-np$
	$\displaystyle\overset{(p=n/K)}{=}2np\log n+\frac{K-1}{K}\cdot n^{2}\cdot\log\frac{K}{K-1}-np$
	$\displaystyle\geq 2np\log n+\frac{2K-3}{2K-2}\cdot np-np.$	(54)

The last inequality holds because $\log(1+x)\geq x-x^{2}/2$ for $x\in(0,1)$ and $p=n/K$ . Meanwhile, right hand side of (53) satisfies the following:

		$\displaystyle\quad 2\log(n!)+\log(np)!$
		$\displaystyle\leq 2(n+1)\log(n+1)-2n+(np+1)\log(np+1)-np$
		$\displaystyle\leq 2(n+1)(\log n+\log 2)-2n+(np+1)(\log(np)+\log 2)-np$
		$\displaystyle=(np+2n+3)\log n+(np+1)\log p+2(n+1)\log 2-2n-np+(\log 2)(np+1)$
		$\displaystyle\overset{\left(p=\frac{n}{K}\right)}{=}2np\log n+(2n+4)\log n+(\log 4)n+(\log 2)np+\log 8-2n-(1+\log K)np-\log K$
		$\displaystyle\overset{(K\geq\bar{K}\geq 2,n\geq 6)}{\leq}2np\log n+(2n+4)\log n+(\log 2)np-(1+\log K)np$		(55)

In order to show (53), we only need to confirm $\eqref{ineq3: pr}\leq\eqref{ineq2: pr}$ . By observation, this is equivalent to

\displaystyle\begin{array}[]{crl}&\left(\frac{2K-3}{2K-2}+\log\left(\frac{K}{2}\right)\right)np&\geq(2n+4)\log n\\ \overset{(p\geq\eta\log n,\bar{K}\leq K)}{\Longleftarrow}&\left(\frac{2\bar{K}-3}{2\bar{K}-2}+\log\left(\frac{\bar{K}}{2}\right)\right)\eta n&\geq 2n+4\\ \Longleftrightarrow&\frac{4}{\left(\frac{2\bar{K}-3}{2\bar{K}-2}+\log\left(\frac{\bar{K}}{2}\right)\right)\eta-2}&\leq n.\end{array}

The last inequality is assumed.

Appendix B A counterexample of interest

Example 1

Consider LP problem (9) with $n=3$ , $r^{1}=r^{2}=(1/3,1/3,1/3)^{T}$ . Let

C=\begin{pmatrix}(1+\epsilon_{1})^{2}&(2-\epsilon_{3})^{2}&(1-\epsilon_{2})^{2}\\ (1-\epsilon_{2})^{2}&(1+\epsilon_{1})^{2}&(2-\epsilon_{3})^{2}\\ (2-\epsilon_{3})^{2}&(1-\epsilon_{2})^{2}&(1+\epsilon_{1})^{2}\end{pmatrix},

where $0<\epsilon_{i}\ll 1$ , $i=1,2,3$ such that $2(1+\epsilon_{1})^{2}<(1-\epsilon_{2})^{2}+(2-\epsilon_{3})^{2}$ . It can be easily seen that the optimal solution is $\gamma^{*}=\frac{1}{3}\begin{pmatrix}0&0&1\\ 1&0&0\\ 0&1&0\end{pmatrix}$ . Suppose that $\gamma^{0}=\frac{1}{3}\begin{pmatrix}1&0&0\\ 0&1&0\\ 0&0&1\end{pmatrix}$ . If we use the submatrix approach (38) with $m=2$ (largest number less than $n$ ) to select a working set $\mathcal{I}_{k}$ , then the algorithm will be stuck at $\gamma^{0}$ . It is not globally convergent. This cost matrix corresponds to the following case of transporting a three-point distribution to another one:

Figure 10: A counterexample of interest

The dashed hexagon has an edge length of $1$ . Transport point distribution $\tilde{y}_{1}$ , $\tilde{y}_{2}$ and $\tilde{y}_{3}$ to that of $y_{1}$ , $y_{2}$ and $y_{3}$ .

References

(1) Altschuler, J., Niles-Weed, J., Rigollet, P.: Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration. Advances in Neural Information Processing Systems 30 (2017)
(2) Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: International Conference on Machine Learning, pp. 214–223. PMLR (2017)
(3) Beck, A.: The 2-coordinate descent method for solving double-sided simplex constrained minimization problems. Journal of Optimization Theory and Applications 162(3), 892–919 (2014)
(4) Beck, A., Tetruashvili, L.: On the convergence of block coordinate descent type methods. SIAM Journal on Optimization 23(4), 2037–2060 (2013)
(5) Benamou, J., Brenier, Y.: A computational fluid mechanics solution to the Monge-Kantorovich mass transfer problem. Numerische Mathematik 84(3), 375–393 (2000)
(6) Benamou, J.D., Carlier, G., Cuturi, M., Nenna, L., Peyré, G.: Iterative bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing 37(2), A1111–A1138 (2015)
(7) Benamou, J.D., Collino, F., Mirebeau, J.M.: Monotone and consistent discretization of the Monge–Ampère operator. Mathematics of computation 85(302), 2743–2775 (2016)
(8) Benamou, J.D., Froese, B.D., Oberman, A.M.: Numerical solution of the optimal transportation problem using the Monge–Ampère equation. Journal of Computational Physics 260, 107–126 (2014)
(9) Berahas, A.S., Bollapragada, R., Nocedal, J.: An investigation of Newton-sketch and subsampled Newton methods. Optimization Methods and Software 35(4), 661–680 (2020)
(10) Blondel, M., Seguy, V., Rolet, A.: Smooth and sparse optimal transport. In: International Conference on Artificial Intelligence and Statistics, pp. 880–889. PMLR (2018)
(11) Bonafini, M., Schmitzer, B.: Domain decomposition for entropy regularized optimal transport. Numerische Mathematik 149(4), 819–870 (2021)
(12) Brenier, Y.: Polar factorization and monotone rearrangement of vector-valued functions. Communications on Pure and Applied Mathematics 44(4), 375–417 (1991)
(13) Chen, C., He, B., Ye, Y., Yuan, X.: The direct extension of ADMM for multi-block convex minimization problems is not necessarily convergent. Mathematical Programming 155(1), 57–79 (2016)
(14) Chizat, L., Peyré, G., Schmitzer, B., Vialard, F.X.: Scaling algorithms for unbalanced optimal transport problems. Mathematics of Computation 87(314), 2563–2609 (2018)
(15) Cuturi, M.: Sinkhorn distances: Lightspeed computation of optimal transport. Advances in Neural Information Processing Systems 26 (2013)
(16) Dvurechensky, P., Gasnikov, A., Kroshnin, A.: Computational optimal transport: Complexity by accelerated gradient descent is better than by Sinkhorn’s algorithm. In: International Conference on Machine Learning, pp. 1367–1376. PMLR (2018)
(17) Facca, E., Benzi, M.: Fast iterative solution of the optimal transport problem on graphs. SIAM Journal on Scientific Computing 43(3), A2295–A2319 (2021)
(18) Gasnikov, A.V., Gasnikova, E., Nesterov, Y.E., Chernov, A.: Efficient numerical methods for entropy-linear programming problems. Computational Mathematics and Mathematical Physics 56(4), 514–524 (2016)
(19) Genevay, A., Cuturi, M., Peyré, G., Bach, F.: Stochastic optimization for large-scale optimal transport. Advances in neural information processing systems 29 (2016)
(20) Genevay, A., Peyre, G., Cuturi, M.: Learning generative models with sinkhorn divergences. In: A. Storkey, F. Perez-Cruz (eds.) Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, vol. 84, pp. 1608–1617. PMLR (2018)
(21) Gerber, S., Maggioni, M.: Multiscale strategies for computing optimal transport. Journal of Machine Learning Research 18 (2017)
(22) Gottschlich, C., Schuhmacher, D.: The shortlist method for fast computation of the earth mover’s distance and finding optimal solutions to transportation problems. PloS one 9(10), e110214 (2014)
(23) Guminov, S., Dvurechensky, P., Tupitsa, N., Gasnikov, A.: On a combination of alternating minimization and Nesterov’s momentum. In: International Conference on Machine Learning, pp. 3886–3898. PMLR (2021)
(24) Gurbuzbalaban, M., Ozdaglar, A., Parrilo, P.A., Vanli, N.: When cyclic coordinate descent outperforms randomized coordinate descent. Advances in Neural Information Processing Systems 30 (2017)
(25) Haker, S., Zhu, L., Tannenbaum, A., Angenent, S.: Optimal mass transport for registration and warping. International Journal of Computer Vision 60(3), 225–240 (2004)
(26) He, B., Yuan, X.: On the $\mathcal{O}(1/n)$ convergence rate of the Douglas–Rachford alternating direction method. SIAM Journal on Numerical Analysis 50(2), 700–709 (2012)
(27) Huang, M., Ma, S., Lai, L.: A Riemannian block coordinate descent method for computing the projection robust Wasserstein distance. In: International Conference on Machine Learning, pp. 4446–4455. PMLR (2021)
(28) Jambulapati, A., Sidford, A., Tian, K.: A direct $\tilde{O}(1/\epsilon)$ iteration parallel algorithm for optimal transport. Advances in Neural Information Processing Systems 32 (2019)
(29) Jordan, R., Kinderlehrer, D., Otto, F.: The variational formulation of the Fokker–Planck equation. SIAM Journal on Mathematical Analysis 29(1), 1–17 (1998)
(30) Lei, N., Su, K., Cui, L., Yau, S.T., Gu, X.D.: A geometric view of optimal transportation and generative model. Computer Aided Geometric Design 68, 1–21 (2019)
(31) Li, W., Yin, P., Osher, S.: Computations of optimal transport distance with Fisher information regularization. Journal of Scientific Computing 75(3), 1581–1595 (2018)
(32) Lin, T., Ho, N., Jordan, M.I.: On the efficiency of entropic regularized algorithms for optimal transport. Journal of Machine Learning Research 23(137), 1–42 (2022)
(33) Ling, H., Okada, K.: An efficient earth mover’s distance algorithm for robust histogram comparison. IEEE transactions on pattern analysis and machine intelligence 29(5), 840–853 (2007)
(34) Liu, Y., Wen, Z., Yin, W.: A multiscale semi-smooth Newton method for optimal transport. Journal of Scientific Computing 91(2), 39 (2022)
(35) Lu, Z., Xiao, L.: On the complexity analysis of randomized block-coordinate descent methods. Mathematical Programming 152(1), 615–642 (2015)
(36) Mandad, M., Cohen-Steiner, D., Kobbelt, L., Alliez, P., Desbrun, M.: Variance-minimizing transport plans for inter-surface mapping. ACM Transactions on Graphics (ToG) 36(4), 1–14 (2017)
(37) Natale, A., Todeschi, G.: Computation of optimal transport with finite volumes. ESAIM: Mathematical Modelling and Numerical Analysis 55(5), 1847–1871 (2021)
(38) Necoara, I., Clipici, D.: Parallel random coordinate descent method for composite minimization: Convergence analysis and error bounds. SIAM Journal on Optimization 26(1), 197–226 (2016)
(39) Necoara, I., Nesterov, Y., Glineur, F.: Random block coordinate descent methods for linearly constrained optimization over networks. Journal of Optimization Theory and Applications 173(1), 227–254 (2017)
(40) Necoara, I., Takáč, M.: Randomized sketch descent methods for non-separable linearly constrained optimization. IMA Journal of Numerical Analysis 41(2), 1056–1092 (2021)
(41) Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization 22(2), 341–362 (2012)
(42) Otto, F.: The geometry of dissipative evolution equations: the porous medium equation. Taylor & Francis (2001)
(43) Peleg, S., Werman, M., Rom, H.: A unified approach to the change of resolution: Space and gray-level. IEEE Transactions on Pattern Analysis and Machine Intelligence 11(7), 739–742 (1989)
(44) Perrot, M., Courty, N., Flamary, R., Habrard, A.: Mapping estimation for discrete optimal transport. Advances in Neural Information Processing Systems 29 (2016)
(45) Peyré, G., Cuturi, M.: Computational optimal transport. Foundations and Trends in Machine Learning 11(5-6), 355–607 (2019)
(46) Polyak, B.T.: Introduction to optimization. Optimization Software, Inc., Publications Division, New York (1987)
(47) Qu, Z., Richtárik, P., Takác, M., Fercoq, O.: SDNA: stochastic dual Newton ascent for empirical risk minimization. In: International Conference on Machine Learning, pp. 1823–1832. PMLR (2016)
(48) Richtárik, P., Takáč, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming 144(1), 1–38 (2014)
(49) Richtárik, P., Takáč, M.: Parallel coordinate descent methods for big data optimization. Mathematical Programming 156(1), 433–484 (2016)
(50) Rockafellar, R.T.: Network flows and monotropic optimization, vol. 9. Athena scientific (1999)
(51) Schmitzer, B.: A sparse multiscale algorithm for dense optimal transport. Journal of Mathematical Imaging and Vision 56, 238–259 (2016)
(52) Schmitzer, B.: Stabilized sparse scaling algorithms for entropy regularized transport problems. SIAM Journal on Scientific Computing 41(3), A1443–A1481 (2019)
(53) Sinkhorn, R.: A relationship between arbitrary positive matrices and doubly stochastic matrices. The annals of mathematical statistics 35(2), 876–879 (1964)
(54) Solomon, J., De Goes, F., Peyré, G., Cuturi, M., Butscher, A., Nguyen, A., Du, T., Guibas, L.: Convolutional Wasserstein distances: Efficient optimal transportation on geometric domains. ACM Transactions on Graphics (ToG) 34(4), 1–11 (2015)
(55) Sun, R., Ye, Y.: Worst-case complexity of cyclic coordinate descent: $\mathcal{O}(n^{2})$ gap with randomized version. Mathematical Programming 185(1), 487–520 (2021)
(56) Toselli, A., Widlund, O.: Domain decomposition methods-algorithms and theory, vol. 34. Springer Science & Business Media (2004)
(57) Tseng, P., Yun, S.: Block-coordinate gradient descent method for linearly constrained nonsmooth separable optimization. Journal of Optimization Theory and Applications 140(3), 513–535 (2009)
(58) Tseng, P., Yun, S.: A coordinate gradient descent method for nonsmooth separable minimization. Mathematical Programming 117(1), 387–423 (2009)
(59) Tseng, P., Yun, S.: A coordinate gradient descent method for linearly constrained smooth optimization and support vector machines training. Computational Optimization and Applications 47(2), 179–206 (2010)
(60) Villani, C.: Topics in optimal transportation, vol. 58. American Math. Soc. (2021)
(61) Wang, Z., Xin, J., Zhang, Z.: DeepParticle: learning invariant measure by a deep neural network minimizing Wasserstein distance on data generated from an interacting particle method. Journal of Computational Physics p. 111309 (2022)
(62) Wijesinghe, J., Chen, P.: Matrix balancing based interior point methods for point set matching problems. SIAM Journal on Imaging Sciences 16(3), 1068–1105 (2023)
(63) Wright, S.: Primal-dual interior-point methods. SIAM (1997)
(64) Xie, Y., Shanbhag, U.V.: SI-ADMM: A stochastic inexact ADMM framework for stochastic convex programs. IEEE Transactions on Automatic Control 65(6), 2355–2370 (2019)
(65) Xie, Y., Shanbhag, U.V.: Tractable ADMM schemes for computing KKT points and local minimizers for $\ell_{0}$ -minimization problems. Computational Optimization and Applications 78(1), 43–85 (2021)
(66) Xie, Y., Wang, X., Wang, R., Zha, H.: A fast proximal point method for computing exact Wasserstein distance. In: Uncertainty in artificial intelligence, pp. 433–453. PMLR (2020)
(67) Yang, L., Li, J., Sun, D., Toh, K.C.: A fast globally linearly convergent algorithm for the computation of wasserstein barycenters. The Journal of Machine Learning Research 22(1), 984–1020 (2021)
(68) Zanetti, F., Gondzio, J.: An interior point–inspired algorithm for linear programs arising in discrete optimal transport. INFORMS Journal on Computing (2023)

Randomized methods for computing optimal transport without regularization and their convergence analysis

Abstract

Keywords:

MSC:

1 Introduction

Background and motivation

Theorectical contributions

Numerical experiments

Previous research on (R)BCD

Existing algorithms for OT

Organization

2 Optimal transport problems and Wasserstein distance

Remark 1

3 Random block coordinate descent for standard LP

Remark 2

Gauss-Southwell-q rule

Definition 1

Theorem 3.1

Proof

Expected Gauss-Southwell-q rule

Vanilla random block coordinate descent

Inexact extension of Algorithm 1.

Theorem 3.2

Proof

Remark 3

4 Random block coordinate descent and optimal transport

Property of matrix AA

Lemma 1

Lemma 2

Proof

Theorem 4.1

Proof

Remark 4

Working set selection

Theorem 4.2

Proof

Theorem 4.3

Proof

Remark 5

Lemma 3

Proof

Theorem 4.4

Proof

Accelerated random block coordinate descent

5 Numerical experiments

5.1 Comparison between various random block coordinate descent methods with the different working set selection rules

Experiment settings

Methods

Comments on Figure 2

About Table 1 & 2

Sparse solutions

Choice of mm

5.2 Comparison between ARBCD and algorithms in the literature

Experiment settings

5.2.1 Comparison with Sinkhorn

Methods

Comments on Figure 7

5.2.2 Comparison with an interior point-inspired algorithm

Methods

Comments on Table 4

5.3 Test on a large-scale OT problem

Comments on Figure 9

6 Conclusion

Acknowledgements

Declarations

Data availability

Appendix A Proof of Lemma 3

Proof

Appendix B A counterexample of interest

Example 1

References

Property of matrix $A$

Choice of $m$