This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Flows, Scaling, and Entropy Revisited:
A Unified Perspective via Optimizing Joint Distributions

Jason Altschuler111The author is with the Laboratory for Information and Decision Systems (LIDS), Massachusetts Institute of Technology, Cambridge, MA, 02139.

In this short expository note, we describe a unified algorithmic perspective on several classical problems which have traditionally been studied in different communities. This perspective views the main characters—the problems of Optimal Transport, Minimum Mean Cycle, Matrix Scaling, and Matrix Balancing—through the same lens of optimization problems over joint probability distributions P(x,y)P(x,y) with constrained marginals. While this is how Optimal Transport is typically introduced, this lens is markedly less conventional for the other three problems. This perspective leads to a simple and unified framework spanning problem formulation, algorithm development, and runtime analysis.

Some fragments of this story are classical—for example, the approach for solving Optimal Transport via the Sinkhorn algorithm for Matrix Scaling dates back at least to the 1960s [57] and by now is well-known in multiple communities spanning economics, statistics, machine learning, operations research, and scientific computing [49, 35, 45, 25].

Yet, the story described in this note was only recently developed in full—for example, the use of probabilistic inequalities to prove near-optimal runtimes [11, 14], and the parallels between Optimal Transport and Minimum Mean Cycle which provide a framework for applying popular algorithmic techniques for the former problem to the latter [12]. These developments all hinge on the perspective of optimizing distributions highlighted in this note. For some problems, this leads to rigorous guarantees that justify algorithms that have long been used in practice and are nowadays the default algorithms in many numerical software packages (e.g., POT, OTT, OTJulia, GeomLoss, MATLAB, R, Lapack, and Eispack); for other problems, this leads to even faster algorithms in practice and/or theory.

The goal of this note is to explain this story with an emphasis on the unifying connections as summarized in Table 1. There are several ways to tell this story. We start by introducing Optimal Transport and Minimum Mean Cycle as graph problems (§1) before re-formulating them as optimization problems over joint distributions (§2), which makes clear their connections to Matrix Scaling and Balancing (§3), and naturally leads to the development (§4) and analysis (§5) of scalable algorithms based on entropic regularization. Throughout this narrative, we keep two threads as equals: the fixed marginal setting (Table 1, left) and the symmetric marginal setting (Table 1, right). As the parallels between these two threads are nearly exact(!), the mental overhead of two threads is hopefully minimal and outweighed by the pedagogical benefit of unifying these two halves of the story—which have often been studied in separate papers and sometimes even separate communities.

The presentation of this article is based on Part I of the author’s thesis [6], and in the interest of brevity, we refer the reader there for proofs and further references (we make no attempt to be comprehensive here given the short length of this note and the immense literature around each of the four problems, e.g., thousands of papers in the past decade just on Optimal Transport).

Fixed marginals Symmetric marginals
Linear program Optimal Transport Minimum Mean Cycle
Entropic regularization Matrix Scaling Matrix Balancing
Polytope Transportation polytope Circulation polytope
Simple algorithm Sinkhorn algorithm Osborne algorithm
Algorithm in dual Block coordinate descent Entrywise coordinate descent
Per-iteration progress KL divergence Hellinger divergence
Table 1: Although not traditionally viewed in this way, the four bolded optimization problems can be viewed under the same lens: each optimizes a joint probability distribution P(x,y)P(x,y) with similar constraints and objectives. The purpose of this note is to describe the unifying connections in this table and how they can be exploited to obtain practical algorithms with near-optimal runtimes for all four problems.

Notation.

We associate the set of probability distibutions on nn atoms with the simplex Δn:={p0n:ipi=1}\Delta_{n}:=\{p\in\mathbb{R}_{\geqslant 0}^{n}:\sum_{i}p_{i}=1\}, and the set of joint probability distributions on n×nn\times n atoms with Δn×n:={P0n×n:ijPij=1}\Delta_{n\times n}:=\{P\in\mathbb{R}_{\geqslant 0}^{n\times n}:\sum_{ij}P_{ij}=1\}. We write 𝟏\mathbf{1} to denote the all-ones vector in n\mathbb{R}^{n}, and G=(V,E,c)G=(V,E,c) to denote a graph with vertex set VV, edge set EE, and edge weights c:Ec:E\to\mathbb{R}. One caution: we write exp[A]\exp[A] with brackets to denote the entrywise exponential of a matrix AA.

1 Two classical graph problems

Here we introduce the first two characters in our story: the problems of Optimal Transport (OT) and Minimum Mean Cycle (MMC). We begin by introducing both in the language of graphs as this helps makes the parallels clearer when we transition to the language of probability afterward.

1.1 Optimal Transport

In the language of graphs, the OT problem is to find a flow on a bipartite graph that routes “supplies” from one vertex set to “demands” in the other vertex set in a minimum-cost way. For convenience, we renormalize the supply/demand to view them as distributions and abuse notation by denoting both vertex sets by [n]={1,,n}[n]=\{1,\dots,n\}.

Definition 1 (Optimal Transport).

Given a weighted bipartite graph G=([n][n],E,c)G=([n]\cup[n],E,c) and distributions μ,νΔn\mu,\nu\in\Delta_{n}, the OT problem is

minf:E0j[n]f(i,j)=μi,i[n]i[n]f(i,j)=νj,j[n]eEf(e)c(e).\displaystyle\min_{\begin{subarray}{c}f:E\to\mathbb{R}_{\geqslant 0}\\ \sum_{j\in[n]}f(i,j)=\mu_{i},\;\forall i\in[n]\\ \sum_{i\in[n]}f(i,j)=\nu_{j},\;\forall j\in[n]\end{subarray}}\sum_{e\in E}f(e)c(e)\,. (1.1)

OT dates back to Monge in the 18th century when he asked: what is the least-effort way to move a mound of dirt into a nearby ditch of equal volume [39]? In operations research, OT appears in textbook problems where one seeks to, e.g., find the minimum-cost way to ship widgets from factories to stores [16]. Recently, OT has become central to diverse applications in data science—ranging from machine learning to computer vision to the natural sciences—due to the ability of OT to compare and morph complex data distributions beyond just dirt mounds and widget allocations. Prototypical examples include data distributions arising from point clouds in statistics, images or 3D meshes in computer graphics, document embeddings in natural language processing, or cell phenotyopes or fMRI brain scans in biology. For details on the many applications of OT, we refer to the recent monograph [45].

A central challenge in all these applications is scalable computation. Indeed, data-driven applications require computing OT when the number of data points nn in each distribution is large. Although OT is a linear program (LP), it is a very large LP when nn is in, say, the many thousands or millions, and it is a longstanding challenge to develop algorithms that can compute OT (even approximately) in a reasonable amount of time for large nn. See, e.g., the standard texts [16, 50, 45, 3] for a discussion of the extensive literature on OT algorithms which dates back to Jacobi in the 19th century.

1.2 Minimum Mean Cycle

Fixed marginals Symmetric marginals
Graphs Bipartite Flows Circulations
Matrices Transportation polytope Circulation polytope
Distributions Couplings Self-couplings
Table 2: Informal dictionary for translating between graphs, matrices, and distributions. See §2 for details.

We now turn to the second character in our story. Below, recall that a cycle is a sequence of edges that starts and ends at the same vertex.

Definition 2 (Minimum-Mean-Cycle).

Given a weighted directed graph G=(V,E,c)G=(V,E,c), the MMC problem is

mincycle σ in G1|σ|eσc(e).\displaystyle\min_{\text{cycle }\sigma\text{ in }G}\frac{1}{|\sigma|}\;\sum_{e\in\sigma}c(e)\,. (1.2)

MMC is a classical problem in algorithmic graph theory which has received significant attention over the past half century due to its many applications. A canonical textbook example is that, at least in an idealized world, finding arbitrage opportunities on Wall Street is equivalent to solving an MMC problem [23, §24]. Other classic applications range from periodic optimization (e.g., MMC is equivalent to finding an optimal policy for a deterministic Markov Decision Process [58]), to algorithm design (e.g., MMC provides a tractable alternative for the bottleneck step in the Network Simplex algorithm [32]), to control theory (e.g., MMC characterizes the spectral quantities in Max-Plus algebra [18]).

Just as for OT, a central challenge for MMC is large-scale computation. The first polynomial-time algorithm222It is worth remarking that MMC is polynomial-time solvable while the seemingly similar problem of finding a cycle with minimum (total) weight is NP-hard [50, §8.6b]. was based on dynamic programming, due to Karp in 1972 [36]. However, its runtime is O(n3)O(n^{3}), and an extensive literature has sought to reduce this cubic runtime in both theory and practice; see e.g., the references within the recent papers [12, 20, 31].

2 Graphs, matrices, and distributions

As written, the optimization problems (1.1) and (1.2) defining OT and MMC appear quite different. Here we describe reformulations that are strikingly parallel. This hinges on a simple but useful connection between graph flows, non-negative matrices, and joint distributions, as summarized in Table 2. Below, let CC denote the n×nn\times n matrix whose ijij-th entry is the cost c(i,j)c(i,j) if (i,j)(i,j) is an edge, and \infty otherwise.

OT is linear optimization over joint distributions with fixed marginals.

Consider a feasible flow for the OT problem in (1.1), i.e., a flow f:E0f:E\to\mathbb{R}_{\geqslant 0} routing the supply distribution μ\mu to the demand distribution ν\nu. This flow is naturally associated with a matrix P0n×nP\in\mathbb{R}_{\geqslant 0}^{n\times n} whose ijij-th entry is the flow f(i,j)f(i,j) on that edge. The netflow constraints on ff then simply amount to constraints on the row and column sums of PP, namely P1=μP1=\mu and PT1=νP^{T}1=\nu. Thus, OT can be re-written as the LP

minPΔn×n:P1=μ,PT1=νP,C.\displaystyle\min_{P\in\operatorname*{\Delta_{n\times n}}\;:\;P1=\mu,\;P^{T}1=\nu}\langle P,C\rangle\,. (2.1)

This decision set {PΔn×n:P1=μ,PT1=ν}\{P\in\operatorname*{\Delta_{n\times n}}:P1=\mu,\;P^{T}1=\nu\} is called the transportation polytope and can be equivalently viewed as the space of “couplings”—a.k.a., joint distributions P(x,y)P(x,y) with first marginal μ\mu and second marginal ν\nu.

MMC is linear optimization over joint distributions with symmetric marginals.

MMC admits a similar formulation by taking an LP relaxation. Briefly, the idea is to re-write the objective in terms of matrices, as above, and then take the convex hull of the discrete decision set. Specifically, re-write the objective 1|σ|eσc(e)\frac{1}{|\sigma|}\sum_{e\in\sigma}c(e) as Pσ,C\langle P_{\sigma},C\rangle by associating to a cycle σ\sigma the n×nn\times n matrix PσP_{\sigma} with ijij-th entry 1/|σ|1/|\sigma| if the edge (i,j)σ(i,j)\in\sigma, and zero otherwise. By the Circulation Decomposition Theorem (see, e.g., [16, Problem 7.14]), the convex hull of the set {Pσ:σ cycle}\{P_{\sigma}:\sigma\text{ cycle}\} of normalized cycles is the set {PΔn×n:P1=PT1}\{P\in\operatorname*{\Delta_{n\times n}}:P1=P^{T}1\} of normalized circulations, and so the LP relaxation of MMC is

minPΔn×n:P1=PT1P,C,\displaystyle\min_{P\in\operatorname*{\Delta_{n\times n}}\;:\;P1=P^{T}1}\langle P,C\rangle\,, (2.2)

and moreover this LP relaxation is exact. For details see, e.g., [3, Problem 5.47]. This decision set {PΔn×n:P1=PT1}\{P\in\operatorname*{\Delta_{n\times n}}:P1=P^{T}1\} can be equivalently viewed as the space of “self-couplings”—a.k.a., joint distributions P(x,y)P(x,y) with symmetric marginals.

3 Entropic regularization and matrix pre-conditioning

Together, (2.1) and (2.2) put OT and MMC on equal footing in that they are both LP over spaces of joint distributions P(x,y)P(x,y) with constrained marginals—fixed marginals for OT, and symmetric marginals for MMC. We now move from problem formulation to algorithm development, continuing in a parallel way.

The approach discussed in this note, motivated by the interpretation of OT and MMC as optimizing distributions, is to use entropic regularization. Namely, add η1H(P)-\eta^{-1}H(P) to the objectives in (2.1) and (2.2), where H(P)=ijPijlogPijH(P)=\sum_{ij}P_{ij}\log P_{ij} denotes the Shannon entropy of PP. See Tables 3 and 4, bottom left. This regularization is convex because the entropy function is concave (in fact, strongly concave by Pinsker’s inequality). The regularization parameter η>0\eta>0 has a natural tradeoff: intuitively, smaller η\eta makes the regularized problem “more convex” and thus easier to optimize, but less accurate for the original problem.

But let’s step back. Why use entropic regularization? The modern optimization toolbox has many other convex regularizers. The key benefit of entropic regularization is that it reduces the problems of OT and MMC to the problems of Matrix Scaling and Matrix Balancing, respectively. This enables the application of classical algorithms for the latter two problems to the former two problems. Below we introduce these two matrix pre-conditioning problems in §3.1 and then explain this reduction in §3.2.

Primal Dual
Optimal Transport minPΔn×n:P1=μ,PT1=νP,C\min_{P\in\Delta_{n\times n}\;:\;P1=\mu,P^{T}1=\nu}\;\langle P,C\rangle maxx,ynminij(Cijxiyj)+μ,x+ν,y\max_{x,y\in\mathbb{R}^{n}}\min_{ij}(C_{ij}-x_{i}-y_{j})+\langle\mu,x\rangle+\langle\nu,y\rangle
Matrix Scaling minPΔn×n:P1=μ,PT1=νP,C1ηH(P)\min_{P\in\Delta_{n\times n}\;:\;P1=\mu,P^{T}1=\nu}\;\langle P,C\rangle-\tfrac{1}{\eta}H(P) maxx,ynsminij(Cijxiyj)+μ,x+ν,y\max_{x,y\in\mathbb{R}^{n}}\operatorname*{smin}_{ij}(C_{ij}-x_{i}-y_{j})+\langle\mu,x\rangle+\langle\nu,y\rangle
Table 3: Primal/dual LP formulations of OT (top) and its regularization (bottom). The regularization is entropic in the primal and softmin smoothing in the dual. The regularized problem is a convex formulation of the Matrix Scaling problem for the matrix K=exp[ηC]K=\exp[-\eta C].
Primal Dual
Minimum Mean Cycle minPΔn×n:P1=PT1P,C\min_{P\in\Delta_{n\times n}\;:\;P1=P^{T}1}\;\langle P,C\rangle maxxnminijCij+xixj\max_{x\in\mathbb{R}^{n}}\min_{ij}C_{ij}+x_{i}-x_{j}
Matrix Balancing minPΔn×n:P1=PT1P,C1ηH(P)\min_{P\in\Delta_{n\times n}\;:\;P1=P^{T}1}\;\langle P,C\rangle-\tfrac{1}{\eta}H(P) maxxnsminijCij+xixj\max_{x\in\mathbb{R}^{n}}\operatorname*{smin}_{ij}C_{ij}+x_{i}-x_{j}
Table 4: Analog to Table 3 in the setting of symmetric marginals rather than fixed marginals. The story is mirrored, with OT and Matrix Scaling replaced by MMC and Matrix Balancing, respectively.

3.1 Matrix pre-conditioning

We now introduce the final two characters in our story: Matrix Scaling and Matrix Balancing. In words, these two problems seek to left- and right-multiply a given matrix KK by diagonal matrices in order to satisfy certain marginal constraints—fixed marginals for Matrix Scaling, and symmetric marginals for Matrix Balancing. For the former, the scaling is of the form XKYXKY; for the latter, it is a similarity transform XKX1XKX^{-1}.

Definition 3 (Matrix Scaling).

Given K>0n×nK\in\mathbb{R}_{>0}^{n\times n} and μ,νΔn\mu,\nu\in\Delta_{n}, find positive diagonal matrices X,YX,Y such that P=XKYP=XKY has marginals P1=μP1=\mu and PT1=νP^{T}1=\nu.

Definition 4 (Matrix Balancing).

Given K>0n×nK\in\mathbb{R}_{>0}^{n\times n}, find a positive diagonal matrix XX such that P=XKX1P=XKX^{-1} has symmetric marginals P𝟏=PT𝟏P\mathbf{1}=P^{T}\mathbf{1}.

(Both problems are defined here in a simplified way that suffices for the purposes of this note. See the discussion section for details.)

Matrix Scaling and Matrix Balancing are classical problems in their own right and have been studied in many communities over many decades under many names. See the review [35] for a historical account. The most famous application of these problems is their use as pre-conditioning subroutines before numerical linear algebra computations [52, 41, 46]. For example, Matrix Balancing is nowadays used by default before eigenvalue decomposition and matrix exponentiation in standard numerical packages such as R, MATLAB, Lapack, and Eispack.

3.2 Reduction

As alluded to above, entropic regularization leads to the following reductions. Below, K=exp[ηC]K=\exp[-\eta C] denotes the matrix with entries Kij=exp(ηCij)K_{ij}=\exp(-\eta C_{ij}). For simplicity, assume henceforth that GG is complete so that KK is strictly positive, which ensures existence and uniqueness of the Matrix Scaling/Balancing solutions PP. The general case is similar but requires combinatorial properties of the sparsity pattern [28, 48].

Lemma 3.1 (Entropic OT is Matrix Scaling).

For any η>0\eta>0, the entropic OT problem has a unique solution. It is the solution P=XKYP=XKY to the Matrix Scaling problem for K=exp[ηC]K=\exp[-\eta C].

Lemma 3.2 (Entropic MMC is Matrix Balancing).

For any η>0\eta>0, the entropic MMC problem has a unique solution. It is the the solution P=XKX1P=XKX^{-1} to the Matrix Balancing problem for K=exp[ηC]K=\exp[-\eta C], modulo normalizing this Matrix Balancing solution PP by a constant so that the sum of its entries is 11.

Both lemmas are immediate from first-order optimality conditions and are classical facts that have been re-discovered many times; see [35] for a historical account. There is also an elegant dual interpretation. Briefly, entropic regularization in the primal is equivalent to softmin smoothing in the dual, i.e., replacing miniai\min_{i}a_{i} by sminiai:=η1logiexp(ηai)\operatorname*{smin}_{i}a_{i}:=-\eta^{-1}\log\sum_{i}\exp(-\eta a_{i}) when writing the dual LP in saddle-point form. See Tables 3 and 4, bottom right. Modulo a simple transformation, the optimal scaling matrices X,YX,Y are in correspondence with the optimal solutions to these dual regularized problems—a fact that will be exploited and explained further below.

4 Simple scalable algorithms

Since entropic OT and entropic MMC are respectively equivalent to Matrix Scaling and Matrix Balancing (Lemmas 3.1 and 3.2), it suffices to solve the latter two problems. For both, there is a simple algorithm that has long been the practitioner’s algorithm of choice. For Matrix Scaling, this is the Sinkhorn algorithm; for Matrix Balancing, this is the Osborne algorithm. These algorithms have been re-invented many times under different names, see the survey [35, §3.1]. Pseudocode is in Algorithms 1 and 2. For shorthand, we write 𝔻(v)\operatorname{\mathbb{D}}(v) to denote the diagonal matrix with diagonal vv, we write r(P)=P𝟏r(P)=P\mathbf{1} and c(P)=PT𝟏c(P)=P^{T}\mathbf{1} to denote row and column sums, and we write \odot and ././ to denote entrywise multiplication and division.

Both algorithms have natural geometric interpretations as alternating projection in the primal and coordinate descent in the dual. We explain both interpretations as they provide complementary insights.

1:Initialize X,YIX,Y\leftarrow I \triangleright No scaling
2:Until convergence:
3:XX𝔻(μ./r(XKY))X\leftarrow X\odot\mathbb{D}(\mu./r(XKY)) \triangleright Fix rows
4:YY𝔻(ν./c(XKY))Y\leftarrow Y\odot\mathbb{D}(\nu./c(XKY)) \triangleright Fix columns
Algorithm 1 Sinkhorn’s Algorithm for scaling a matrix KK to have marginals μ,ν\mu,\nu. To solve OT to ±ε\raisebox{0.77498pt}{$\scriptstyle\pm$}\varepsilon, run on K=exp[ηC]K=\exp[-\eta C] where ηε1logn\eta\approx\varepsilon^{-1}\log n.
Algorithm 2 Osborne’s Algorithm for balancing a matrix KK to have symmetric marginals. To solve MMC to ±ε\raisebox{0.77498pt}{$\scriptstyle\pm$}\varepsilon, run on K=exp[ηC]K=\exp[-\eta C] where ηε1logn\eta\approx\varepsilon^{-1}\log n.
1:Initialize XIX\leftarrow I \triangleright No balancing
2:Until convergence:
3:Choose coordinate i[n]i\in[n] to fix
4:XiiXiici(XKX1)/ri(XKX1)X_{ii}\leftarrow X_{ii}\cdot\sqrt{c_{i}(XKX^{-1})/r_{i}(XKX^{-1})}

Primal interpretation: alternating projection.

The most direct interpretation of the Sinkhorn algorithm is that it alternately projects333This projection is to the closest point in KL divergence rather than Euclidean distance. In the language of information geometry, this is an I-projection. the current matrix P=XKYP=XKY onto either the subspace {PΔn×n:P1=μ}\{P\in\operatorname*{\Delta_{n\times n}}:P1=\mu\} with correct row marginals, or the subspace {PΔn×n:PT1=ν}\{P\in\operatorname*{\Delta_{n\times n}}:P^{T}1=\nu\} with correct column marginals. Note that when it corrects one constraint, it potentially violates the other. Nevertheless, the algorithm converges to the unique solution at the subspaces’ intersection, see Figure 1 for a cartoon illustration. The Osborne algorithm is analogous, except that it alternately projects P=XKX1P=XKX^{-1} onto nn subspaces: the subspaces {PΔn×n:(P1)i=(PT1)i}\{P\in\operatorname*{\Delta_{n\times n}}:(P1)_{i}=(P^{T}1)_{i}\} defined by equal ii-th row and column sums, for all i[n]i\in[n]. Here there is a choice for the order of subspaces to project onto; see [11] for a detailed discussion of this.

Refer to caption
Figure 1: In the primal, the Sinkhorn algorithm alternately projects the iterate P=XKYn×nP=XKY\in\mathbb{R}^{n\times n} onto the two subspaces corresponding to the row/column marginal constraints. The Osborne algorithm is analogous, but with P=XKX1P=XKX^{-1} and nn subspaces.

Dual interpretation: coordinate descent.

The Sinkhorn algorithm also admits an appealing dual interpretation. In the dual, entropic OT has 2n2n variables—the Lagrange multipliers x,ynx,y\in\mathbb{R}^{n} respectively corresponding to the row and column marginal constraints in the primal. See Table 3, bottom right. Via the transformation Xii=eηxiX_{ii}=e^{\eta x_{i}} and Yii=eηyiY_{ii}=e^{\eta y_{i}}, these 2n2n dual variables are in correspondence with the 2n2n diagonal entries of the scaling matrices XX and YY that the Sinkhorn algorithm seeks to find. One can verify that the Sinkhorn algorithm’s row update (Line 3 of Algorithm 1) is an exact block coordinate descent step that maximizes the dual regularized objective over all xnx\in\mathbb{R}^{n} given that yy is fixed. Vice versa for the column update. See Figure 2 for a cartoon illustration. The Osborne algorithm is analogous, except that now there are only nn dual variables xnx\in\mathbb{R}^{n} since there are only nn marginal constraints in the primal, see Table 4, bottom right. When the Osborne algorithm updates Xii=eηxiX_{ii}=e^{\eta x_{i}} to equate the ii-th entry of the row and column marginals, this corresponds to an exact coordinate descent step on xix_{i}.

Refer to caption
Figure 2: In the dual, the Sinkhorn algorithm performs exact block coordinate descent by iteratively updating the scaling matrices XX or YY so as to optimally improve the dual objective given that all other entries are fixed. The Osborne algorithm is analogous but updates individual entries rather than blocks.

5 Runtime analysis

We now turn to convergence analysis. The key challenge is how to measure progress. Indeed, an iteration of the Sinkhorn algorithm corrects either the row or column marginals, but messes up the others. Similarly, an iteration of the Osborne algorithm corrects one row-column pair, but potentially messes up the n1n-1 others. From the cartoons in Figures 1 and 2, we might hope to make significant progress in each iteration—but how do we quantify this?

There’s an entire literature on this question. (Or at least for Matrix Scaling; for Matrix Balancing, things were much less clear until quite recently: even polynomial convergence was unknown for half a century until the breakthrough paper [42], let alone near-linear time convergence [11].) For example, in the 1980s, Franklin and Lorenz established that Sinkhorn iterations contract in the Hilbert projective metric [29]; however, the contraction rate incurs large factors of nn, which leads to similarly large factors of nn in the final runtime. Another approach uses auxiliary Lyapunov functions to measure progress, for example the permanent [37]; however this too incurs extraneous factors of nn. See the survey [35].

As it turns out, there is a short, simple, and strikingly parallel analysis approach that leads to runtimes for both the Sinkhorn and Osborne algorithms that scale in the dimension nn as O~(n2)\tilde{O}(n^{2}) [11, 14]. These are near-optimal444The “near” in “near-optimal” refers to the logarithmic factors suppressed by the O~\tilde{O} notation. runtimes in nn in the sense that in the absence of further structure, it takes Θ(n2)\Theta(n^{2}) time to even read the input for any of the matrix/graph problems discussed in this note, let alone solve them.

At a high level, this analysis hinges on using the regularized dual objective as a Lyapunov function (an idea dating back to the 1990s for Matrix Scaling [33]), and observing that each iteration of the Osborne/Sinkhorn algorithm significantly improves this Lyapunov function by an amount related to how violated the marginal constraints are for the current iterate PP (“progress lemma”). In its simplest form, the analysis argues that if the current iterate has very violated marginal constraints, then the Lyapunov function improves significantly; and since the Lyapunov function is bounded within a small range (“initialization lemma”), this can only continue for a small number of iterations before we arrive at an iterate with reasonably accurate marginals—and this must be a reasonably accurate solution (“termination lemma”).

The progress lemma is itself composed of two steps, both of which leverage the probabilistic perspective highlighted in this note. The first step is a direct calculation which shows that an iteration improves the dual objective by the current imbalance between the marginal distributions as measured in the KL divergence for Sinkhorn, or the Hellinger divergence for Osborne. The second step uses a probabilistic inequality to analyze this imbalance. The point is that since probabilistic inequalities apply for (infinite-dimensional) continuous distributions, they are independent of the dimension nn. Operationally, this allows switching between the dual (where progress is measured) and the primal (where solutions are desired) without incurring factors of nn, thereby enabling a final runtime without extraneous factors of nn. Full details can be found in [14, 27] for the Sinkhorn algorithm and [11] for the Osborne algorithm.

6 Discussion

For each of the four problems in this note, much more is known and also many questions remain. Here we briefly mention a few selected topics.

Algorithm comparisons and the nuances therein.

Each optimization problem in this note has been studied for many decades by many communities, which as already mentioned, has led to the development of many approaches besides the Sinkhorn and Osborne algorithms. Which algorithm is best? Currently there is no consensus. We suspect that the true answer is nuanced because different algorithms are often effective for different types of problem instances. For example, specialized combinatorial solvers blow competitors out of the water for small-to-medium problem sizes by easily achieving high accuracy solutions [26], whereas Sinkhorn and Osborne are the default algorithms in most numerical software packages for larger problems where moderate accuracy is acceptable. Both parameter regimes are important, but typically for different application domains. For example, high precision may be relevant for safety-critical or scientific computing applications. Whereas the latter regime is typically relevant for modern data-science applications of OT, since there is no need to solve beyond the inherent modeling error (OT is usually just a proxy loss in machine learning applications) and discretization error (μ,ν\mu,\nu are often thought of as samples from underlying true distributions). This latter regime is also relevant for pre-conditioning matrices before eigenvalue computation, since Matrix Scaling/Balancing optimize objectives that are just proxies for fast convergence of downstream eigenvalue algorithms [41, 46].

Theory vs practice.

Comparisons between algorithms are further muddled by discrepancies between theory and practice. Sometimes algorithms are more effective in practice than our best theoretical guarantees suggest—is this because current analysis techniques are lacking, or because the input is an easy benchmark, or because of practical considerations not captured by a runtime theorem (e.g., the Sinkhorn algorithm is “embarrasingly parallelizable” and interfaces well with modern GPU hardware [25]). On the flipside, some algorithms with incredibly fast theoretical runtimes are less practical due to large constant or polylogarithmic factors hidden in the O~\tilde{O} runtime. At least for now, this includes the elegant line of work (e.g., [54, 22, 4, 55]) based on the Laplacian paradigm, which very recently culminated in the incredible theoretical breakthrough [20] that solves the more general problem of minimum cost flow in time that is almost-linear in the input sparsity and polylogarithmic in 1/ε1/\varepsilon. I am excited to see to what extent theory and practice are bridged in the upcoming years.

Exploiting structure.

All algorithms discussed so far work for generic inputs. This robustness comes at an unavoidable Ω(n2)\Omega(n^{2}) cost in runtime/memory from just reading/storing the input. This precludes scaling beyond nn in the several tens of thousands, say, on a laptop. For larger problems, it is essential to exploit “stucture” in the input. What stucture? This is a challenging question in itself because it is tied to the applications and pipelines relevant to practice. For OT, typically μ,ν\mu,\nu are distributions over d\mathbb{R}^{d} and the cost CC is given by pairwise distances, raised to some power p[1,)p\in[1,\infty). This is the pp-Wasserstein distance, which plays an analogous role to the p\ell_{p} distance [56]. Then the n×nn\times n matrix CC is implicit from the 2n2n points in μ,ν\mu,\nu. In low dimension dd, this at least enables reading the input in O(n)O(n) time—can OT also be solved in O(n)O(n) time? A beautiful line of work in the computational geometry community has worked towards this goal, a recent breakthrough being n(ε1logn)O(d)n\cdot(\varepsilon^{-1}\log n)^{O(d)} runtimes for (1±ε)(1\raisebox{0.86108pt}{$\scriptstyle\pm$}\varepsilon) multiplicatively approximating OT for p=1p=1 [47, 1]. We refer to those papers for a detailed account of this literature and the elegant ideas therein. Low-dimensional structure can also be exploited by the Sinkhorn algorithm. The basic idea is that the n2n^{2} runtime arises only through multiplying the n×nn\times n kernel matrix K=exp[ηC]K=\exp[-\eta C] by a vector—a well-studied task in scientific computing related to Fast Multipole Methods [15]—and this can be done efficiently if the distributions lie on low-dimensional grids [53], geometric domains arising in computer graphics [53], manifolds [5], or algebraic varieties [13]. Preliminary numerics suggest that in these structured geometric settings, Sinkhorn can scale to millions of data points nn while maintaining reasonable accuracy [5]. Many questions remain open in this vein, for example graceful performance degradation in the dimension dd, instance-optimality, and average-case complexity for “real-world inputs”.

Optimizing joint distibutions with many constrained marginals.

This note focuses on optimizing joint distributions P(x,y)P(x,y) with k=2k=2 constrained marginals, and it is natural to ask about k2k\geqslant 2. These problems are called Multimarginal OT in the case of fixed marginals and linear objectives, and arise in diverse applications in fluid dynamics [17], barycentric averaging [2], graphical models [34], distributionally robust optimization [21], and much more; see the monographs [45, 44, 40]. The setting of symmetric marginals arises in quantum chemistry via Density Functional Theory [24, 19]. A central challenge in all these problems is that in general, it is intractable even to store a kk-variate probability distribution, let alone solve for the optimal one. Indeed, a kk-variate joint distribution in which each variable takes nn values is in correspondence with a kk-order tensor that has nkn^{k} entries—an astronomical number even for tiny n,k=20n,k=20, say. As such, a near-linear runtime in the size of PP (the goal in this note for k=2k=2) is effectively useless for large kk, and it is essential to go beyond this by exploiting structure via implicit representations. See part II of the author’s thesis [6] and the papers upon which it is based [9, 7, 8, 10] for a systematic investigation of what structure enables poly(n,k)\mathrm{poly}(n,k) time algorithms, and for pointers to the extensive surrounding literature.

Matrix Balancing in p\ell_{p} norms.

The original papers [43, 41] studied Matrix Balancing in the setting: given Kn×nK\in\mathbb{C}^{n\times n}, find diagonal XX such that the ii-th row and column of XKX1XKX^{-1} have equal p\ell_{p} norm, for all i[n]i\in[n]. Definition 4 is equivalent for any finite pp (which suffices for this note) and elucidates the connection to optimizing distributions. See [11, §1.4] for details. For the case p=p=\infty, similar algorithms have been developed and were recently shown to converge in polynomial time in the breakthrough [51]. It would be interesting to reconcile the case p=p=\infty as the analysis techniques there seem different and the connection to optimizing distributions seems unclear.

Entropic OT.

In this note, entropically regularized OT was motivated as a means to an end for computing OT efficiently. However, it has emerged as an interesting quantity in its own right and is now used in lieu of OT in many applications due to its better statistical and computational properties [25, 30, 38], as well as its ability to interface with complex deep learning architectures [45]. Understanding these improved properties is an active area of research bridging optimization, statistics, and applications. We are not aware of an analogous study of entropic MMC and believe this may be interesting.

Acknowledgments.

I am grateful to Jonathan Niles-Weed, Pablo Parrilo, and Philippe Rigollet for a great many stimulating conversations about these topics. This work was partially supported by NSF Graduate Research Fellowship 1122374 and a TwoSigma PhD Fellowship.

References

  • Agarwal et al. [2022] Pankaj K Agarwal, Hsien-Chih Chang, Sharath Raghvendra, and Allen Xiao. Deterministic, near-linear ε\varepsilon-approximation algorithm for geometric bipartite matching. In Symposium on Theory of Computing, pages 1052–1065, 2022.
  • Agueh and Carlier [2011] Martial Agueh and Guillaume Carlier. Barycenters in the Wasserstein space. SIAM Journal on Mathematical Analysis, 43(2):904–924, 2011.
  • Ahuja et al. [1988] Ravindra K Ahuja, Thomas L Magnanti, and James B Orlin. Network flows. 1988.
  • Allen-Zhu et al. [2017] Zeyuan Allen-Zhu, Yuanzhi Li, Rafael Oliveira, and Avi Wigderson. Much faster algorithms for matrix scaling. In Symposium on the Foundations of Computer Science. IEEE, 2017.
  • Altschuler et al. [2019] Jason Altschuler, Francis Bach, Alessandro Rudi, and Jonathan Niles-Weed. Massively scalable Sinkhorn distances via the Nyström method. In Conference on Neural Information Processing Systems, pages 4429–4439, 2019.
  • Altschuler [2022] Jason M Altschuler. Transport and beyond: efficient optimization over probability distributions. PhD thesis, MIT, 2022.
  • Altschuler and Boix-Adserà [2021a] Jason M Altschuler and Enric Boix-Adserà. Wasserstein barycenters can be computed in polynomial time in fixed dimension. Journal of Machine Learning Research, 22:44–1, 2021a.
  • Altschuler and Boix-Adserà [2021b] Jason M Altschuler and Enric Boix-Adserà. Hardness results for Multimarginal Optimal Transport problems. Discrete Optimization, 42:100669, 2021b.
  • Altschuler and Boix-Adserà [2022a] Jason M Altschuler and Enric Boix-Adserà. Polynomial-time algorithms for Multimarginal Optimal Transport problems with structure. Mathematical Programming, pages 1–72, 2022a.
  • Altschuler and Boix-Adserà [2022b] Jason M Altschuler and Enric Boix-Adserà. Wasserstein barycenters are NP-hard to compute. SIAM Journal on Mathematics of Data Science, 4(1):179–203, 2022b.
  • Altschuler and Parrilo [2022a] Jason M Altschuler and Pablo A Parrilo. Near-linear convergence of the Random Osborne algorithm for Matrix Balancing. Mathematical Programming, pages 1–35, 2022a.
  • Altschuler and Parrilo [2022b] Jason M Altschuler and Pablo A Parrilo. Approximating Min-Mean-Cycle for low-diameter graphs in near-optimal time and memory. SIAM Journal on Optimization, 32(3):1791–1816, 2022b.
  • Altschuler and Parrilo [2022, to appear] Jason M Altschuler and Pablo A Parrilo. Kernel approximation on algebraic varieties. SIAM Journal on Applied Algebra and Geometry, 2022, to appear.
  • Altschuler et al. [2017] Jason M Altschuler, Jonathan Weed, and Philippe Rigollet. Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration. In Conference on Neural Information Processing Systems, 2017.
  • Beatson and Greengard [1997] Rick Beatson and Leslie Greengard. A short course on fast multipole methods. Wavelets, multilevel methods and elliptic PDEs, 1:1–37, 1997.
  • Bertsimas and Tsitsiklis [1997] Dimitris Bertsimas and John N Tsitsiklis. Introduction to linear optimization, volume 6. Athena Scientific Belmont, MA, 1997.
  • Brenier [2008] Yann Brenier. Generalized solutions and hydrostatic approximation of the Euler equations. Physica D: Nonlinear Phenomena, 237(14-17):1982–1988, 2008.
  • Butkovič [2010] Peter Butkovič. Max-linear systems: theory and algorithms. Springer Science & Business Media, 2010.
  • Buttazzo et al. [2012] Giuseppe Buttazzo, Luigi De Pascale, and Paola Gori-Giorgi. Optimal-transport formulation of electronic density-functional theory. Physical Review A, 85(6):062502, 2012.
  • Chen et al. [2022a] Li Chen, Rasmus Kyng, Yang P Liu, Richard Peng, Maximilian Probst Gutenberg, and Sushant Sachdeva. Maximum flow and minimum-cost flow in almost-linear time. arXiv preprint arXiv:2203.00671, 2022a.
  • Chen et al. [2022b] Louis Chen, Will Ma, Karthik Natarajan, David Simchi-Levi, and Zhenzhen Yan. Distributionally robust linear and discrete optimization with marginals. Operations Research, 2022b.
  • Cohen et al. [2017] Michael B Cohen, Aleksander Madry, Dimitris Tsipras, and Adrian Vladu. Matrix scaling and balancing via box constrained Newton’s method and interior point methods. In Symposium on the Foundations of Computer Science, pages 902–913. IEEE, 2017.
  • Cormen et al. [2009] Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. Introduction to algorithms. MIT Press, 2009.
  • Cotar et al. [2013] Codina Cotar, Gero Friesecke, and Claudia Klüppelberg. Density functional theory and optimal transportation with Coulomb cost. Communications on Pure and Applied Mathematics, 66(4):548–599, 2013.
  • Cuturi [2013] Marco Cuturi. Sinkhorn distances: lightspeed computation of optimal transport. Conference on Neural Information Processing Systems, 26:2292–2300, 2013.
  • Dong et al. [2020] Yihe Dong, Yu Gao, Richard Peng, Ilya Razenshteyn, and Saurabh Sawlani. A study of performance of optimal transport. Preprint at arXiv:2005.01182, 2020.
  • Dvurechensky et al. [2018] Pavel Dvurechensky, Alexander Gasnikov, and Alexey Kroshnin. Computational optimal transport: complexity by accelerated gradient descent is better than by Sinkhorn’s algorithm. In International Conference on Machine Learning, 2018.
  • Eaves et al. [1985] B Curtis Eaves, Alan J Hoffman, Uriel G Rothblum, and Hans Schneider. Line-sum-symmetric scalings of square nonnegative matrices. Mathematical Programming, pages 124–141, 1985.
  • Franklin and Lorenz [1989] Joel Franklin and Jens Lorenz. On the scaling of multidimensional matrices. Linear Algebra and its Applications, 114:717–735, 1989.
  • Genevay et al. [2019] Aude Genevay, Lénaic Chizat, Francis Bach, Marco Cuturi, and Gabriel Peyré. Sample complexity of Sinkhorn divergences. In International Conference on Artificial Intelligence and Statistics, pages 1574–1583, 2019.
  • Georgiadis et al. [2009] Loukas Georgiadis, Andrew V Goldberg, Robert E Tarjan, and Renato F Werneck. An experimental study of minimum mean cycle algorithms. In Workshop on Algorithm Engineering and Experiments, pages 1–13. SIAM, 2009.
  • Goldberg and Tarjan [1989] Andrew V Goldberg and Robert E Tarjan. Finding minimum-cost circulations by canceling negative cycles. Journal of the ACM, 36(4):873–886, 1989.
  • Gurvits and Yianilos [1998] Leonid Gurvits and Peter N Yianilos. The deflation-inflation method for certain semidefinite programming and maximum determinant completion problems. Technical report, NECI, 1998.
  • Haasler et al. [2021] Isabel Haasler, Rahul Singh, Qinsheng Zhang, Johan Karlsson, and Yongxin Chen. Multi-marginal optimal transport and probabilistic graphical models. IEEE Transactions on Information Theory, 2021.
  • Idel [2016] Martin Idel. A review of matrix scaling and Sinkhorn’s normal form for matrices and positive maps. Preprint at arXiv:1609.06349, 2016.
  • Karp [1972] Richard M Karp. Reducibility among combinatorial problems. In Complexity of Computer Computations, pages 85–103. Springer, 1972.
  • Linial et al. [1998] Nathan Linial, Alex Samorodnitsky, and Avi Wigderson. A deterministic strongly polynomial algorithm for matrix scaling and approximate permanents. In Symposium on Theory of Computing, pages 644–652, 1998.
  • Mena and Niles-Weed [2019] Gonzalo Mena and Jonathan Niles-Weed. Statistical bounds for entropic optimal transport: sample complexity and the central limit theorem. In Conference on Neural Information Processing Systems, pages 4541–4551, 2019.
  • Monge [1781] Gaspard Monge. Mémoire sur la théorie des déblais et des remblais. Histoire de l’Académie Royale des Sciences de Paris, 1781.
  • Natarajan [2021] Karthik Natarajan. Optimization with Marginals and Moments. Dynamic Ideas LLC, 2021.
  • Osborne [1960] EE Osborne. On pre-conditioning of matrices. Journal of the ACM, 7(4):338–345, 1960.
  • Ostrovsky et al. [2017] Rafail Ostrovsky, Yuval Rabani, and Arman Yousefi. Matrix balancing in lpl_{p} norms: bounding the convergence rate of Osborne’s iteration. In Symposium on Discrete Algorithms, pages 154–169. SIAM, 2017.
  • Parlett and Reinsch [1969] Beresford N Parlett and Christian Reinsch. Balancing a matrix for calculation of eigenvalues and eigenvectors. Numerische Mathematik, 13(4):293–304, 1969.
  • Pass [2015] Brendan Pass. Multi-marginal optimal transport: theory and applications. ESAIM: Mathematical Modelling and Numerical Analysis, 49(6):1771–1790, 2015.
  • Peyré and Cuturi [2019] Gabriel Peyré and Marco Cuturi. Computational optimal transport. Foundations and Trends in Machine Learning, 11(5-6):355–607, 2019.
  • Press et al. [2007] William H Press, Saul A Teukolsky, William T Vetterling, and Brian P Flannery. Numerical recipes 3rd edition: The art of scientific computing. Cambridge University Press, 2007.
  • Raghvendra and Agarwal [2020] Sharath Raghvendra and Pankaj K Agarwal. A near-linear time ε\varepsilon-approximation algorithm for geometric bipartite matching. Journal of the ACM, 67(3):1–19, 2020.
  • Rothblum and Schneider [1989] Uriel G Rothblum and Hans Schneider. Scalings of matrices which have prespecified row sums and column sums via optimization. Linear Algebra and its Applications, 114:737–764, 1989.
  • Schneider and Zenios [1990] Michael H Schneider and Stavros A Zenios. A comparative study of algorithms for matrix balancing. Operations Research, 38(3):439–455, 1990.
  • Schrijver [2003] Alexander Schrijver. Combinatorial optimization: polyhedra and efficiency, volume 24. Springer Science & Business Media, 2003.
  • Schulman and Sinclair [2017] Leonard J Schulman and Alistair Sinclair. Analysis of a classical matrix preconditioning algorithm. Journal of the ACM, 64(2):9, 2017.
  • Sinkhorn [1967] Richard Sinkhorn. Diagonal equivalence to matrices with prescribed row and column sums. The American Mathematical Monthly, 74(4):402–405, 1967.
  • Solomon et al. [2015] Justin Solomon, Fernando De Goes, Gabriel Peyré, Marco Cuturi, Adrian Butscher, Andy Nguyen, Tao Du, and Leonidas Guibas. Convolutional Wasserstein distances: Efficient optimal transportation on geometric domains. ACM Transactions on Graphics, 34(4):1–11, 2015.
  • Spielman and Teng [2004] Daniel A Spielman and Shang-Hua Teng. Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In Symposium on Theory of Computing, pages 81–90, 2004.
  • van den Brand et al. [2020] Jan van den Brand, Yin-Tat Lee, Danupon Nanongkai, Richard Peng, Thatchaphol Saranurak, Aaron Sidford, Zhao Song, and Di Wang. Bipartite matching in nearly-linear time on moderately dense graphs. In Symposium on Foundations of Computer Science, pages 919–930. IEEE, 2020.
  • Villani [2003] Cédric Villani. Topics in optimal transportation. Number 58. American Mathematical Society, 2003.
  • Wilson [1969] Alan Geoffrey Wilson. The use of entropy maximising models, in the theory of trip distribution, mode split and route split. Journal of Transport Economics and Policy, pages 108–126, 1969.
  • Zwick and Paterson [1996] Uri Zwick and Mike Paterson. The complexity of mean payoff games on graphs. Theoretical Computer Science, 158(1-2):343–359, 1996.