This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Monotonicity in Quadratically Regularized Linear Programs

Alberto González-Sanz Department of Statistics, Columbia University, [email protected]    Marcel Nutz Departments of Mathematics and Statistics, Columbia University, [email protected]. Research supported by NSF Grants DMS-1812661, DMS-2106056, DMS-2407074.    Andrés Riveros Valdevenito Department of Statistics, Columbia University, [email protected].
Abstract

In optimal transport, quadratic regularization is a sparse alternative to entropic regularization: the solution measure tends to have small support. Computational experience suggests that the support decreases monotonically to the unregularized counterpart as the regularization parameter is relaxed. We find it useful to investigate this monotonicity more abstractly for linear programs over polytopes, regularized with the squared norm. Here, monotonicity can be stated as an invariance property of the curve mapping the regularization parameter to the solution: once the curve enters a face of the polytope, does it remain in that face forever? We show that this invariance is equivalent to a geometric property of the polytope, namely that each face contains the minimum norm point of its affine hull. Returning to the optimal transport problem and its associated Birkhoff polytope, we verify this property for low dimensions, but show that it fails for dimension d25d\geq 25. As a consequence, the conjectured monotonicity of the support fails in general, even if experiments suggest that monotonicity holds for many cost matrices.

Keywords Linear Program, Quadratic Regularization, Optimal Transport, Sparsity

AMS 2020 Subject Classification 49N10; 49N05; 90C25

1 Introduction

Let 𝒫d\mathcal{P}\subset\mathbb{R}^{d} be a polytope and cdc\in\mathbb{R}^{d}. The regularized linear program

xδ=argminx𝒫:xδc,xx^{\delta}=\operatorname*{argmin}_{x\in\mathcal{P}:\,\|x\|\leq\delta}\langle c,x\rangle (1)

has a unique solution as long as δ0\delta\geq 0 belongs to a certain interval [δmin,δmax][\delta_{\min},\delta_{\max}], hence we can consider the curve δxδ𝒫\delta\mapsto x^{\delta}\in\mathcal{P} which travels across various faces of 𝒫\mathcal{P} as δ\delta increases (i.e., as the regularizing constraint relaxes). We are interested in the following invariance property: once δxδ𝒫\delta\mapsto x^{\delta}\in\mathcal{P} enters a given face, does it ever leave that face? (Cf. Figure 1.) In the optimal transport applications that motivate our study (see below), this property corresponds to the monotonicity of the optimal support wrt. the regularization strength. Our abstract result, Theorem 3.2, geometrically characterizes all polytopes such that the invariance property holds (for any cost cc). We show that this property holds for the set of probability measures (unit simplex), but fails for the set of transport plans (Birkhoff polytope) when the dimension d25d\geq 25. As a consequence—which may be surprising given numerical experience—the optimal support in quadratically regularized optimal transport problems is not always monotone.

0c{\color[rgb]{0.82,0.01,0.11}\definecolor[named]{pgfstrokecolor}{rgb}{0.82,0.01,0.11}c}δmin\delta_{\min}xδminx^{\delta_{\min}}δmax\delta_{\max}xδmaxx^{\delta_{\max}}c{\color[rgb]{0.82,0.01,0.11}\definecolor[named]{pgfstrokecolor}{rgb}{0.82,0.01,0.11}c}0=xδmin0=x^{\delta_{\min}}δmax\delta_{\max}xδmaxx^{\delta_{\max}}
Figure 1: A non-monotone (left) and a monotone (right) polytope with cost cc (red). The curve [δmin,δmax]δxδ[\delta_{\min},\delta_{\max}]\ni\delta\mapsto x^{\delta} follows the blue arrows and has the invariance property only in the right example.

1.1 Motivation

This study is motivated by optimal transport and related minimization problems over probability measures. In its simplest form, the transport problem between probability measures μ\mu and ν\nu is

infγΓ(μ,ν)c^(x,y)γ(dx,dy),\inf_{\gamma\in\Gamma(\mu,\nu)}\int\hat{c}(x,y)\,\gamma(dx,dy), (OT)

where c^(x,y)\hat{c}(x,y) is a given “cost” function and Γ(μ,ν)\Gamma(\mu,\nu) denotes the set of couplings; i.e., joint probability measures γ\gamma with marginals (μ,ν)(\mu,\nu). See [34] for a detailed discussion. In many applications such as machine learning or statistics (see [24, 31] for surveys), the marginals encode observed data points X1,,XN{X_{1},\dots,X_{N}} and Y1,,YN{Y_{1},\dots,Y_{N}} which are represented by their empirical measures μ=1NiδXi\mu=\frac{1}{N}\sum_{i}\delta_{X_{i}} and ν=1NiδYi\nu=\frac{1}{N}\sum_{i}\delta_{Y_{i}}. Denoting by cij=c^(Xi,Yj)c_{ij}=\hat{c}(X_{i},Y_{j}) the resulting N×NN\times N cost matrix, (OT) then becomes a linear program

infx𝒫c,x\displaystyle\inf_{x\in\mathcal{P}}\langle c,x\rangle (LP)

where the polytope 𝒫\mathcal{P} is (up to a constant factor) the Birkhoff polytope of doubly stochastic matrices of size N×NN\times N. For different choices of polytope, (LP) includes other problems of recent interest, such as multi-marginal optimal transport and Wasserstein barycenters [2] or adapted Wasserstein distances [3]. The optimal transport problem is computationally costly when NN is large. The impactful paper [13] proposed to regularize (OT) by penalizing with Kullback–Leibler divergence (entropy). Then, solutions can be computed using the Sinkhorn–Knopp algorithm, which has lead to an explosion of high-dimensional applications (see [32]). More generally, [14] introduced regularized optimal transport with regularization by a divergence. Different divergences give rise to different properties of the solution. Entropic regularization always leads to couplings whose support contains all data pairs (Xi,Yj)(X_{i},Y_{j}), even though (OT) typically has a sparse solution. In some applications that is undesirable; for instance, it may correspond to blurrier images in an image processing task [6]. The second-most prominent regularization is χ2\chi^{2}-divergence or equivalently the squared norm, as proposed in [6],111See also [17] for a similar formulation of minimum-cost flow problems, and the predecessors referenced therein. Our model with a general polytope includes such problems, and many others. which gives rise to sparse solutions. In the Eulerian formulation of [14], this is exactly our problem (1) with 𝒫\mathcal{P} being the (scaled) Birkhoff polytope. Alternately, the problem can be stated in Lagrangian form as in [6], making the squared-norm penalty explicit:

infγΓ(μ,ν)c^(x,y)γ(dx,dy)+12ηdγd(μν)L2(μν)2\inf_{\gamma\in\Gamma(\mu,\nu)}\int\hat{c}(x,y)\,\gamma(dx,dy)+\frac{1}{2\eta}\left\|\frac{d\gamma}{d(\mu\otimes\nu)}\right\|_{L^{2}(\mu\otimes\nu)}^{2} (QOT)

where dγ/d(μν)d\gamma/d(\mu\otimes\nu) denotes the density of γ\gamma wrt. the product measure μν\mu\otimes\nu and the regularization strength is now parameterized by η[0,]\eta\in[0,\infty]. In the general setting, this corresponds to

infx𝒫c,x+12ηx2\displaystyle\inf_{x\in\mathcal{P}}\langle c,x\rangle+\frac{1}{2\eta}\|x\|^{2} (2)

which has a unique solution xηx_{\eta} for any η[0,]\eta\in[0,\infty]. The curves (xη)(x^{\eta}) and (xη)(x_{\eta}) of solutions to (1) and (2), respectively, coincide up to a simple reparametrization.

In optimal transport, the regularized problem is often solved to approximate the linear problem (OT). The latter has a generically unique (see [12]) solution xx, which is recovered as δδmax\delta\to\delta_{\max} (or η\eta\to\infty): x=xδmax=xx=x^{\delta_{\max}}=x_{\infty}.222For problems with non-unique solution, xδmaxx^{\delta_{\max}} recovers the minimum-norm solution. In particular, the support of xδx^{\delta} converges to the support of xx, which is generically sparse (NN out of N2N^{2} possible points, again by [12]). In numerous experiments, it has been observed not only that sptxδ\operatorname*{spt}x^{\delta} is sparse when δ\delta is large, but also that sptxδ\operatorname*{spt}x^{\delta} monotonically decreases to sptx\operatorname*{spt}x (e.g., [6]). If this monotonicity holds, then in particular sptxsptxδ\operatorname*{spt}x\subset\operatorname*{spt}x^{\delta}, meaning that sptxδ\operatorname*{spt}x^{\delta} can be used as a multivariate confidence band for the (unknown) solution xx.

1.2 Summary of Results

When 𝒫\mathcal{P} is a set of measures, the support of x𝒫x\in\mathcal{P} can be defined as usual. When 𝒫\mathcal{P} is a general polytope, the set of vertices of the minimal face containing xx yields a similar notion. With that in mind, we can ask whether the support of xδx^{\delta} is monotone wrt. the strength δ\delta of regularization. (The two notions of support yield the same result when 𝒫\mathcal{P} is the simplex or the Birkhoff polytope; cf. Lemma 4.2.) Geometrically, this corresponds to the aforementioned invariance property: if xδFx^{\delta}\in F for some face FF, does it follow that xδFx^{\delta^{\prime}}\in F for all δδ\delta^{\prime}\geq\delta? The answer may of course depend on the cost cc; we call 𝒫\mathcal{P} monotone if the invariance holds for any cdc\in\mathbb{R}^{d}.

We show that monotonicity can be characterized by the geometry of 𝒫\mathcal{P}. Namely, monotonicity of 𝒫\mathcal{P} is equivalent to two properties: for any proper face FF, the minimum-norm point of the affine hull of FF must lie in FF, and moreover, the minimum-norm point of 𝒫\mathcal{P} must lie in the relative interior ri𝒫\operatorname*{ri}\mathcal{P}. See Theorem 3.2. Once the right point of view is taken, the proof is quite elementary. An example satisfying both properties, and hence monotonicity, is the unit simplex Δd\Delta\subset\mathbb{R}^{d}, for any d1d\geq 1. For that choice of polytope, xδx^{\delta} is a sparse soft-min of c=(c1,,cd)c=(c_{1},\dots,c_{d}), converging to 1{argminc}1_{\{\operatorname*{argmin}c\}} in the unregularized limit. On the other hand, we show that the Birkhoff polytope violates the first condition whenever the marginals have N5N\geq 5 data points (meaning that d25d\geq 25). As a result, the optimal support in quadratically regularized optimal transport problems is not always monotone, even if it appears so in numerous experiments. In fact, we exhibit a particularly egregious failure of monotonicity where the support of the limiting (unregularized) optimal transport xx is not contained in sptxδ\operatorname*{spt}x^{\delta} for some reasonably large δ\delta. Our counterexample is constructed by hand, based on theoretical considerations. Our numerical experiments using random cost matrices have failed to locate counterexamples, suggesting that there are, in some sense, “few” faces violating our condition on the minimum-norm point. (See also the proof of Lemma 4.10.) This is an interesting problem for further study in combinatorics, where the Birkhoff polytope has remained an active topic even after decades of research (e.g., [30] and the references therein).

The remainder of this note is organized as follows. Section 2 introduces the regularized linear program and its solution curve in detail. Section 3 contains the main abstract result, whereas Section 4 reports the applications to optimal transport and soft-min.

2 Preliminaries

This section collects notation and well-known or straightforward results for ease of reference. Let ,\langle\cdot,\cdot\rangle be an inner product on d\mathbb{R}^{d} and \|\cdot\| the induced norm. Let 𝒫d\emptyset\neq\mathcal{P}\subseteq\mathbb{R}^{d} be a polytope; i.e., the convex hull of finitely many points. The minimal set of such points, called vertices or extreme points, is denoted ext𝒫\operatorname*{ext}\mathcal{P}. A face of 𝒫\mathcal{P} is a subset F𝒫F\subset\mathcal{P} such that any open line segment L=(x,y)𝒫L=(x,y)\subset\mathcal{P} with LFL\cap F\neq\emptyset satisfies L¯F\overline{L}\subset F, where L¯=[x,y]\overline{L}=[x,y] denotes closure. Alternately, a nonempty face FF of 𝒫\mathcal{P} is the intersection F=𝒫HF=\mathcal{P}\cap H of 𝒫\mathcal{P} with a tangent hyperplane HH. Here a hyperplane H={xd:x,a=b}H=\{x\in\mathbb{R}^{d}:\ \langle x,a\rangle=b\} is called tangent if H𝒫H\cap\mathcal{P}\neq\emptyset and 𝒫{xd:x,ab}\mathcal{P}\subset\{x\in\mathbb{R}^{d}:\ \langle x,a\rangle\leq b\}. We denote by riK\operatorname*{ri}K and rbdK=KriK\operatorname*{rbd}K=K\setminus\operatorname*{ri}K the relative interior and relative boundary of a set KK, and by affK\operatorname*{aff}K its affine span. See [8] for background on polytopes and related notions.

Next, we introduce the regularized linear program and the interval of parameters δ\delta where the set of feasible points in nonempty and the constraint is binding.

Lemma 2.1 (Eulerian Formulation).

Denote 𝒫(δ):={x𝒫:xδ}\mathcal{P}(\delta):=\{x\in\mathcal{P}:\,\|x\|\leq\delta\} for δ0\delta\geq 0 and

δmin:=min{δ0:𝒫(δ)},δmax:=min{δ0:minx𝒫(δ)x,c=minx𝒫x,c}.\delta_{\min}:=\min\{\delta\geq 0:\,\mathcal{P}(\delta)\neq\emptyset\},\qquad\delta_{\max}:=\min\left\{\delta\geq 0:\,\min_{x\in\mathcal{P}(\delta)}\langle x,c\rangle=\min_{\ x\in\mathcal{P}}\langle x,c\rangle\right\}.

Then

D:={δ>δmin:argminx𝒫(δ)x,cargminx𝒫x,c=}=(δmin,δmax).D:=\left\{\delta>\delta_{\min}:\ \operatorname*{argmin}_{x\in\mathcal{P}(\delta)}\langle x,c\rangle\cap\operatorname*{argmin}_{\ x\in\mathcal{P}}\langle x,c\rangle=\emptyset\right\}=(\delta_{\min},\delta_{\max}).

For each δD¯=[δmin,δmax]\delta\in\overline{D}=[\delta_{\min},\delta_{\max}], the problem

infx𝒫(δ)x,chas a unique minimizer xδ.\displaystyle\inf_{x\in\mathcal{P}(\delta)}\langle x,c\rangle\qquad\text{has a unique minimizer $x^{\delta}$.} (3)

Moreover, [δmin,δmax]δxδ[\delta_{\min},\delta_{\max}]\ni\delta\mapsto x^{\delta} is continuous and xδ=δ\|x^{\delta}\|=\delta. In particular, xδminx^{\delta_{\min}} is the singleton 𝒫(δmin)\mathcal{P}(\delta_{\min}), and xδmaxx^{\delta_{\max}} is the minimum-norm solution of minx𝒫x,c\min_{x\in\mathcal{P}}\langle x,c\rangle.

Proof.

The identity D=(δmin,δmax)D=(\delta_{\min},\delta_{\max}) holds as 𝒫(δ)\mathcal{P}(\delta) is increasing in δ\delta. Fix δD\delta\in D. Let xδargminx𝒫(δ)x,cx^{\delta}\in\operatorname*{argmin}_{x\in\mathcal{P}(\delta)}\langle x,c\rangle and xargminx𝒫x,cx^{*}\in\operatorname*{argmin}_{x\in\mathcal{P}}\langle x,c\rangle. If xδ<δ\|x^{\delta}\|<\delta, then x:=λxδ+(1λ)x𝒫(δ)x:=\lambda x^{\delta}+(1-\lambda)x^{*}\in\mathcal{P}(\delta) for sufficiently small λ(0,1)\lambda\in(0,1). By optimality it follows that xδargminx𝒫x,cx^{\delta}\in\operatorname*{argmin}_{x\in\mathcal{P}}\langle x,c\rangle, meaning that δD\delta\notin D. Thus, for δD\delta\in D, we have xδ=δ\|x^{\delta}\|=\delta. In particular, the set argminx𝒫(δ)x,c\operatorname*{argmin}_{x\in\mathcal{P}(\delta)}\langle x,c\rangle is contained in the sphere {x:x=δ}\{x:\,\|x\|=\delta\}. As the set is also convex, it must be a singleton by the strict convexity of \|\cdot\|. Continuity of δxδ\delta\mapsto x^{\delta} is straightforward.

It remains to deal with the boundary cases. For δ=δmin\delta=\delta_{\min}, we note that xδ<δ\|x^{\delta}\|<\delta is trivially ruled out, hence we can conclude as above. Clearly {xδmin}=𝒫(δmin)\{x^{\delta_{\min}}\}=\mathcal{P}(\delta_{\min}). Define xx^{*} as the minimum-norm solution of minx𝒫x,c\min_{x\in\mathcal{P}}\langle x,c\rangle, which is unique by strict convexity. Clearly x=δmax\|x^{*}\|=\delta_{\max}. Thus when δ=δmax\delta=\delta_{\max}, we must have xδ=xx^{\delta}=x^{*} for any xδargminx𝒫(δ)x,cx^{\delta}\in\operatorname*{argmin}_{x\in\mathcal{P}(\delta)}\langle x,c\rangle. Continuity and xδ=δ\|x^{\delta}\|=\delta are again straightforward. ∎

The next lemma recalls the standard projection theorem (e.g., [7, Theorem 5.2]).

Lemma 2.2 (Projection).

Let Kd\emptyset\neq K\subseteq\mathbb{R}^{d} be closed and convex. Given xdx\in\mathbb{R}^{d}, there exists a unique xKKx_{K}\in K, called the projection of xx onto KK and denoted xK=projK(x)x_{K}=\operatorname{proj}_{K}(x), such that

xxK=infxKxx.\displaystyle\|x-x_{K}\|=\inf_{x^{\prime}\in K}\|x-x^{\prime}\|.

Moreover, xKx_{K} is characterized within KK by

xxK,xxK0for all xK.\displaystyle\langle x-x_{K},x^{\prime}-x_{K}\rangle\leq 0\qquad\text{for all }x^{\prime}\in K. (4)

If xKriKx_{K}\in\operatorname*{ri}K, then xK=projaffK(x)x_{K}=\operatorname{proj}_{\operatorname*{aff}K}(x) and (4) can be sharpened to

xxK,xxK=0for all xK.\displaystyle\langle x-x_{K},x^{\prime}-x_{K}\rangle=0\qquad\text{for all }x^{\prime}\in K. (5)

In particular, (5) holds when KK is an affine subspace. In that case, xprojK(x)x\mapsto\operatorname{proj}_{K}(x) is affine.

Below, we often use projK(0)\operatorname{proj}_{K}(0) as a convenient notation for argminxKx\operatorname*{argmin}_{x\in K}\|x\|, the minimum-norm point of KK. In the Lagrangian formulation of our regularized linear program, the solution can be expressed as a projection onto 𝒫\mathcal{P} as follows.

Lemma 2.3 (Lagrangian Formulation).

Define

xη:=proj𝒫(ηc),η[0,),x:=limηxη.\displaystyle x_{\eta}:=\operatorname{proj}_{\mathcal{P}}(-\eta c),\quad\eta\in[0,\infty),\qquad x_{\infty}:=\lim_{\eta\to\infty}x_{\eta}. (6)

Then

xη\displaystyle x_{\eta} =argminx𝒫x,c+12ηx2,η(0,),\displaystyle=\operatorname*{argmin}_{x\in\mathcal{P}}\,\langle x,c\rangle+\frac{1}{2\eta}\|x\|^{2},\quad\eta\in(0,\infty), (7)
x0\displaystyle x_{0} =argminx𝒫x,x=argminxargminx𝒫x,cx.\displaystyle=\operatorname*{argmin}_{x\in\mathcal{P}}\|x\|,\qquad x_{\infty}=\operatorname*{argmin}_{x^{\prime}\in\operatorname*{argmin}_{x\in\mathcal{P}}\langle x,c\rangle}\|x^{\prime}\|. (8)

The limit xηxx_{\eta}\to x_{\infty} is stationary; i.e., there exists η¯+\bar{\eta}\in\mathbb{R}_{+} such that xη=xx_{\eta}=x_{\infty} for all ηη¯\eta\geq\bar{\eta}.

Proof.

Let η(0,)\eta\in(0,\infty). Then (7) follows from

12ηηcx2=η2c2+x,c+12ηx2.\frac{1}{2\eta}\|-\eta c-x\|^{2}=\frac{\eta}{2}\|c\|^{2}+\langle x,c\rangle+\frac{1}{2\eta}\|x\|^{2}.

The first claim in (8) is trivial. For the second claim and the stationary convergence xηxx_{\eta}\to x_{\infty} (which will not be used directly), see [28, Theorem 2.1], or [21] for the exact threshold η¯\bar{\eta}. ∎

The algorithm of [23] solves the problem of projecting a point onto a polyhedron, hence can be used to find xη=proj𝒫(ηc)x_{\eta}=\operatorname{proj}_{\mathcal{P}}(-\eta c) numerically. The solutions xδx^{\delta} of the Eulerian formulation (3) and xηx_{\eta} of the Lagrangian formulation (6) are related as follows.

Lemma 2.4 (Euler \leftrightarrow Lagrange).

Given η[0,]\eta\in[0,\infty], there exists a unique δ=δ(η)D¯\delta=\delta(\eta)\in\overline{D} such that xη=xδx_{\eta}=x^{\delta}, namely δ=xη\delta=\|x_{\eta}\|. The function ηδ(η)\eta\mapsto\delta(\eta) is nondecreasing.

Conversely, let δD¯\delta\in\overline{D}. Then there exists η[0,)\eta\in[0,\infty) (possibly non-unique) such that xη=xδx_{\eta}=x^{\delta}. Define η(δ)\eta(\delta) as the minimal η\eta with xη=xδx_{\eta}=x^{\delta}.333This is merely for concreteness; any choice of selector will do. Then δη(δ)\delta\mapsto\eta(\delta) is strictly increasing on D¯\overline{D}.

Proof.

Let δD\delta\in D and recall that xδ=δ\|x^{\delta}\|=\delta. Note also that ηxη\eta\mapsto x_{\eta} is continuous and that ηxη\eta\mapsto\|x_{\eta}\| is nondecreasing, with range [x0,x]=D¯[\|x_{0}\|,\|x_{\infty}\|]=\overline{D}. Hence there exists η=η(δ)\eta=\eta(\delta) such that xη=δ\|x_{\eta}\|=\delta. We see from (7) that xηx_{\eta} minimizes x,c\langle x,c\rangle among all x𝒫(δ)x\in\mathcal{P}(\delta), which is to say that xη=xδx_{\eta}=x^{\delta}. The converse follows. ∎

3 Abstract Result

Recall that xδx^{\delta} and xηx_{\eta} denote the solutions of the Eulerian (3) and Lagrangian (6) formulation, respectively. Lemma 2.4 shows that the curves (xδ)δminδδmax(x^{\delta})_{\delta_{\min}\leq\delta\leq\delta_{\max}} and (xη)η0(x_{\eta})_{\eta\geq 0} are reparametrizations of one another. In particular, they trace out the same trajectory, and only the trajectory matters for the subsequent definition.

Definition 3.1.

Let 𝒫d\mathcal{P}\subset\mathbb{R}^{d} be a polytope and cdc\in\mathbb{R}^{d}. We say that 𝒫\mathcal{P} is cc-monotone if for any face FF of 𝒫\mathcal{P},

xδF\displaystyle x^{\delta}\in F\quad xδFfor all δδ in [δmin,δmax],or equivalently if\displaystyle\implies\quad x^{\delta^{\prime}}\in F\quad\text{for all }\delta^{\prime}\geq\delta\text{ in }[\delta_{\min},\delta_{\max}],\qquad\text{or equivalently if} (9)
xηF\displaystyle x_{\eta}\in F\quad xηFfor all ηη in [0,].\displaystyle\implies\quad x_{\eta^{\prime}}\in F\quad\text{for all }\eta^{\prime}\geq\eta\text{ in }[0,\infty]. (10)

We say that 𝒫\mathcal{P} is monotone if it is cc-monotone for all cdc\in\mathbb{R}^{d}.

This definition means that the “support” of xδx^{\delta} is monotone decreasing in δ\delta, in the following sense. For any x𝒫x\in\mathcal{P}, there is a unique face F=F(x)F=F(x) such that xriFx\in\operatorname*{ri}F; moreover, xriFx\in\operatorname*{ri}F if and only if xx is a convex combination of the vertices extF\operatorname*{ext}F with strictly positive weights [8, Exercise 3.1, Theorem 5.6]. Thus extF(x)\operatorname*{ext}F(x) is a notion of support for xx, and then the property (9) indeed means that the support of xδx^{\delta} is monotone decreasing for inclusion. When 𝒫\mathcal{P} is the unit simplex, extF(x)\operatorname*{ext}F(x) boils down to the usual measure-theoretic notion of support, namely {i:xi>0}\{i:\,x_{i}>0\} for x=(x1,,xd)𝒫x=(x_{1},\dots,x_{d})\in\mathcal{P}. This identification breaks down when 𝒫\mathcal{P} is the Birkhoff polytope, but monotonicity is nevertheless equivalent for the two notions of support (see Lemma 4.2).444Clearly this equivalence does not hold in general. E.g., when 𝒫(0,1)d\mathcal{P}\subset(0,1)^{d}, all points have the same measure-theoretic support. Next, we characterize monotonicity in geometric terms.

Theorem 3.2.

A polytope 𝒫\mathcal{P} is monotone (Definition 3.1) if and only if

  1. (H1)

    proj𝒫(0)ri𝒫\operatorname{proj}_{\mathcal{P}}(0)\in\operatorname*{ri}\mathcal{P} and

  2. (H2)

    projaffF(0)F\operatorname{proj}_{\operatorname*{aff}F}(0)\in F for each face F𝒫\emptyset\neq F\subset\mathcal{P}.

The two conditions are similar, with the requirement for the proper faces FF being less stringent than for the improper face 𝒫\mathcal{P}: while projF(0)riF\operatorname{proj}_{F}(0)\in\operatorname*{ri}F is a sufficient condition for (H2), the condition (H2) includes the boundary case where the projection lands on rbdF\operatorname*{rbd}F without the constraint FF being binding. We shall see that (H1) and (H2) play different roles in the proof. Simple examples show that changing ri𝒫\operatorname*{ri}\mathcal{P} to 𝒫\mathcal{P} in (H1) or FF to riF\operatorname*{ri}F in (H2) invalidates the equivalence in Theorem 3.2.

3.1 Proof of Theorem 3.2

We first prove the “if” implication, using the Lagrangian formulation (Lemma 2.3). This parametrization is convenient because ηxη\eta\mapsto x_{\eta} is piecewise affine. While the latter is well-known (even for some more general norms, see [18] and the references therein); we detail the statement for completeness.

Lemma 3.3.

The curve [0,]ηxη[0,\infty]\ni\eta\mapsto x_{\eta} is piecewise affine. The affine pieces correspond to faces of 𝒫\mathcal{P} as follows. Fix η0[0,]\eta_{0}\in[0,\infty] and let FF be the unique face of 𝒫\mathcal{P} such that xη0riFx_{\eta_{0}}\in\operatorname*{ri}F. Let I={η:xηriF}I=\{\eta:x_{\eta}\in\operatorname*{ri}F\}. Then II is an interval containing η0\eta_{0} and I¯ηxηF\overline{I}\ni\eta\mapsto x_{\eta}\in F is affine. In particular, the curve ηxη\eta\mapsto x_{\eta} does not return to riF\operatorname*{ri}F after leaving it.

Proof.

Consider η1<η2\eta_{1}<\eta_{2} in II and let H=affFH=\operatorname*{aff}F. As xηiriFx_{\eta_{i}}\in\operatorname*{ri}F, we have xηi=projF(ηic)=projH(ηic)x_{\eta_{i}}=\operatorname{proj}_{F}(-\eta_{i}c)=\operatorname{proj}_{H}(-\eta_{i}c). Since HH is an affine subspace, the curve ηprojH(ηc)\eta\mapsto\operatorname{proj}_{H}(-\eta c) is affine. In particular, convexity of riF\operatorname*{ri}F implies that projH(ηc)riF\operatorname{proj}_{H}(-\eta c)\in\operatorname*{ri}F for all η[η1,η2]\eta\in[\eta_{1},\eta_{2}]. Thus II is an interval and IηxηriFI\ni\eta\mapsto x_{\eta}\in\operatorname*{ri}F is affine, and this extends to the closure by continuity. ∎

Recall that rbdK:=KriK\operatorname*{rbd}K:=K\setminus\operatorname*{ri}K denotes the relative boundary of KdK\subset\mathbb{R}^{d}. Consider the first time the curve ηxη\eta\mapsto x_{\eta} touches a given face FF of 𝒫\mathcal{P}. That point may lie in riF\operatorname*{ri}F or in rbdF\operatorname*{rbd}F. As seen in Lemma 3.5 below, the boundary situation indicates that ηxη\eta\mapsto x_{\eta} left another face FF_{*} when it entered FFF\not\subset F_{*}, meaning that 𝒫\mathcal{P} is not monotone. Hence, we analyze that situation in detail in the next lemma.

Lemma 3.4.

Let FF be a face of 𝒫\mathcal{P} and I={η:xηriF}I=\{\eta:x_{\eta}\in\operatorname*{ri}F\}\neq\emptyset. Let η0:=infI\eta_{0}:=\inf I. If xη0rbdFx_{\eta_{0}}\in\operatorname*{rbd}F, then exactly one of the following holds true:

  1. (i)

    projaffF(0)F\operatorname{proj}_{\operatorname*{aff}F}(0)\notin F;

  2. (ii)

    η0=0\eta_{0}=0, and then xη0=x0=proj𝒫(0)=projaffF(0)rbdFx_{\eta_{0}}=x_{0}=\operatorname{proj}_{\mathcal{P}}(0)=\operatorname{proj}_{\operatorname*{aff}F}(0)\in\operatorname*{rbd}F. In particular, x0ri𝒫x_{0}\notin\operatorname*{ri}\mathcal{P}.

Proof.

Recall from Lemma 3.3 that II is an interval and I¯ηxηF\overline{I}\ni\eta\mapsto x_{\eta}\in F is affine. Let H=affFH=\operatorname*{aff}F and consider zη:=projH(ηc)z_{\eta}:=\operatorname{proj}_{H}(-\eta c). Recalling Lemma 2.2, we have for ηI\eta\in I that xη=proj𝒫(ηc)=projF(ηc)=projH(ηc)=zηx_{\eta}=\operatorname{proj}_{\mathcal{P}}(-\eta c)=\operatorname{proj}_{F}(-\eta c)=\operatorname{proj}_{H}(-\eta c)=z_{\eta} and in particular zηriFz_{\eta}\in\operatorname*{ri}F.

As xη=zηx_{\eta}=z_{\eta} for ηI\eta\in I and both curves are continuous, it follows that xη0=zη0x_{\eta_{0}}=z_{\eta_{0}}. Therefore, our assumption that xη0rbdFx_{\eta_{0}}\in\operatorname*{rbd}F implies that zη0rbdFz_{\eta_{0}}\in\operatorname*{rbd}F.

Since HH is an affine space, +ηzη\mathbb{R}_{+}\ni\eta\mapsto z_{\eta} is affine. We have zη0rbdFz_{\eta_{0}}\in\operatorname*{rbd}F and zηriFz_{\eta}\in\operatorname*{ri}F for all η(η0,η1)\eta\in(\eta_{0},\eta_{1}), where η1:=supI>η0\eta_{1}:=\sup I>\eta_{0}. (Note that II cannot be a singleton when zη0rbdFz_{\eta_{0}}\in\operatorname*{rbd}F.) As FF is convex, it follows that zηFz_{\eta}\notin F for 0η<η00\leq\eta<\eta_{0}. If z0Fz_{0}\notin F, we are in the first case. Whereas if z0Fz_{0}\in F, it follows that η0=0\eta_{0}=0. Thus xη0=x0=proj𝒫(0)=projF(0)x_{\eta_{0}}=x_{0}=\operatorname{proj}_{\mathcal{P}}(0)=\operatorname{proj}_{F}(0). Moreover, zη0=xη0Fz_{\eta_{0}}=x_{\eta_{0}}\in F implies xη0=projaffF(0)x_{\eta_{0}}=\operatorname{proj}_{\operatorname*{aff}F}(0). ∎

Lemma 3.5.

Let η0\eta_{*}\geq 0 and let FF_{*} be the unique face of 𝒫\mathcal{P} with xηriFx_{\eta_{*}}\in\operatorname*{ri}F_{*}. Suppose there exists η>η\eta>\eta_{*} such that xηFx_{\eta}\notin F_{*} and let η0=max{ηη:xηF}\eta_{0}=\max\{\eta\geq\eta_{*}:\,x_{\eta}\in F_{*}\}. For sufficiently small η1>η0\eta_{1}>\eta_{0}, there is a unique face FF such that xηriFx_{\eta}\in\operatorname*{ri}F for all η(η0,η1)\eta\in(\eta_{0},\eta_{1}).555When traveling along the curve (xη)ηη(x_{\eta})_{\eta\geq\eta_{*}}, FF is the first face outside FF_{*} whose interior is visited. If x0:=proj𝒫(0)ri𝒫x_{0}:=\operatorname{proj}_{\mathcal{P}}(0)\in\operatorname*{ri}\mathcal{P}, then FF satisfies projaffF(0)F\operatorname{proj}_{\operatorname*{aff}F}(0)\notin F.

Proof.

For x𝒫x\in\mathcal{P}, let F(x)F(x) denote the unique face such that xriF(x)x\in\operatorname*{ri}F(x). As seen in Lemma 3.3, the map (η0,)ηF(xη)(\eta_{0},\infty)\ni\eta\mapsto F(x_{\eta}) has finitely many values and the preimage of each value is an interval. Hence the map must be constant on (η0,η1)(\eta_{0},\eta_{1}) for sufficiently small η1>η0\eta_{1}>\eta_{0}. Let F=F(xη)F=F(x_{\eta}), η(η0,η1)\eta\in(\eta_{0},\eta_{1}) be the corresponding face. As FF_{*} and FF are faces with FFF_{*}\not\subset F, we have FFrbdFF\cap F_{*}\subset\operatorname*{rbd}F. Thus xη0FFx_{\eta_{0}}\in F\cap F_{*} implies xη0rbdFx_{\eta_{0}}\in\operatorname*{rbd}F, and now Lemma 3.4 applies. ∎

Proof of Theorem 3.2.

Step 1: (H1), (H2) \Rightarrow monotone. If 𝒫\mathcal{P} is not monotone, then Lemma 3.5 shows that either (H1) or (H2) must be violated.

Step 2: Not (H1) \Rightarrow not monotone. Suppose that x0:=proj𝒫(0)ri𝒫x_{0}:=\operatorname{proj}_{\mathcal{P}}(0)\notin\operatorname*{ri}\mathcal{P}. As x0𝒫(ri𝒫)x_{0}\in\mathcal{P}\setminus(\operatorname*{ri}\mathcal{P}), there exists a hyperplane H={xd : a,x=b}H=\left\{x\in\mathbb{R}^{d}\text{ : }\langle a,x\rangle=b\right\} with

x0H,𝒫{xd : a,xb},ri𝒫{xd : a,x<b}.x_{0}\in H,\qquad\mathcal{P}\subseteq\left\{x\in\mathbb{R}^{d}\text{ : }\langle a,x\rangle\leq b\right\},\qquad\operatorname*{ri}\mathcal{P}\subseteq\left\{x\in\mathbb{R}^{d}\text{ : }\langle a,x\rangle<b\right\}.

Define c=ac=a and F=H𝒫F=H\cap\mathcal{P} and δ0=x0\delta_{0}=\|x_{0}\|, and denote B¯δ={xd:xδ}\overline{B}_{\delta}=\{x\in\mathbb{R}^{d}:\ \|x\|\leq\delta\}. Then 𝒫B¯δ0={x0}\mathcal{P}\cap\overline{B}_{\delta_{0}}=\left\{x_{0}\right\} and in particular xδ0=x0x^{\delta_{0}}=x_{0}. Moreover,

x,c=bfor all xF,whilec,x<bfor all xri𝒫.\langle x,c\rangle=b\quad\text{for all }x\in F,\quad\text{while}\quad\langle c,x\rangle<b\quad\text{for all }x\in\operatorname*{ri}\mathcal{P}. (11)

For any δ>δ0\delta>\delta_{0}, we have B¯δri𝒫\overline{B}_{\delta}\cap\operatorname*{ri}\mathcal{P}\neq\emptyset, and then (11) implies xδFx^{\delta}\notin F. In particular, 𝒫\mathcal{P} is not monotone for the cost cc.

Step 3: Not (H2) \Rightarrow not monotone. Given the assumption, there exists a face FF of 𝒫\mathcal{P} such that projaffF(0)\operatorname{proj}_{\operatorname*{aff}F}(0) does not belong to FF. Set xmin:=projF(0)x_{\min}:=\operatorname{proj}_{F}(0). As FF is a face of 𝒫\mathcal{P}, there is a hyperplane H={xd:x,a=b}H=\{x\in\mathbb{R}^{d}:\ \langle x,a\rangle=b\} such that

F=H𝒫and𝒫{xd:x,ab}.F=H\cap\mathcal{P}\quad{\rm and}\quad\mathcal{P}\subset\{x\in\mathbb{R}^{d}:\ \langle x,a\rangle\geq b\}.

Define c=axminc=a-x_{\min} and δ1=xmin\delta_{1}=\|x_{\min}\|. Then for every x𝒫B¯δ1x\in\mathcal{P}\cap\overline{B}_{\delta_{1}},

x,c=x,ax,xminbxxminbxmin2=xmin,c,\langle x,c\rangle=\langle x,a\rangle-\langle x,x_{\min}\rangle\geq b-\|x\|\|x_{\min}\|\geq b-\|x_{\min}\|^{2}=\langle x_{\min},c\rangle,

showing that xδ1=xminx^{\delta_{1}}=x_{\min}. Since xminprojaffF(0)x_{\min}\neq\operatorname{proj}_{\operatorname*{aff}F}(0), Lemma 2.2 implies that

F:={xd:xmin,xxmin=0}FF^{\prime}:=\{x\in\mathbb{R}^{d}:\ \langle x_{\min},x-x_{\min}\rangle=0\}\cap F

is a face of FF—and a fortiori a face of 𝒫\mathcal{P}—such that

FF{xd:xmin,xxmin>0}.F\setminus F^{\prime}\subset\{x\in\mathbb{R}^{d}:\ \langle x_{\min},x-x_{\min}\rangle>0\}.

We claim that xδFx^{\delta}\notin F^{\prime} for all δ>δ1\delta>\delta_{1}. Indeed, for any xFx^{\prime}\in F^{\prime} it holds that

x,c=x,ax,xmin=bx,xmin=xmin,c,\langle x^{\prime},c\rangle=\langle x^{\prime},a\rangle-\langle x^{\prime},x_{\min}\rangle=b-\langle x^{\prime},x_{\min}\rangle=\langle x_{\min},c\rangle,

whereas for any xFFx\in F\setminus F^{\prime}, it holds that

x,c=bx,xmin=xmin,cxxmin,xmin<xmin,c.\langle x,c\rangle=b-\langle x,x_{\min}\rangle=\langle x_{\min},c\rangle-\langle x-x_{\min},x_{\min}\rangle<\langle x_{\min},c\rangle.

In summary, x,c<x,c\langle x,c\rangle<\langle x^{\prime},c\rangle for all xFFx\in F\setminus F^{\prime} and xFx^{\prime}\in F^{\prime}. In view of xδ=xminFx^{\delta}=x_{\min}\in F^{\prime}, it follows that xδFx^{\delta}\notin F^{\prime} for all δ>δ1\delta>\delta_{1} and in particular that 𝒫\mathcal{P} is not monotone for the cost cc. ∎

4 Applications

4.1 Soft-Min

To begin with a straightforward example, let 𝒫=Δ\mathcal{P}=\Delta be the unit simplex in Euclidean space d\mathbb{R}^{d} and c=(c1,,cd)dc=(c_{1},\dots,c_{d})\in\mathbb{R}^{d}. Then

minxΔc,x=min1idci\min_{x\in\Delta}\langle c,x\rangle=\min_{1\leq i\leq d}c_{i}

corresponds to finding the minimum value of cc. More specifically, x=(x1,,xd)x=(x_{1},\dots,x_{d}) is a minimizer if and only if xx is supported in argminc\operatorname*{argmin}c; i.e., sptx={i:xi0}argminc\operatorname*{spt}x=\{i:\,x_{i}\neq 0\}\subset\operatorname*{argmin}c. When this linear program is regularized with the entropy of xx, we obtain the usual soft-min (counterpart of log-sum-exp) which gives large weights to the small values of cc but non-zero weights to all values. The quadratic regularization,

xδ=argminxΔ:xδc,xorxη=argminxΔc,x+12ηx2,\displaystyle x^{\delta}=\operatorname*{argmin}_{x\in\Delta:\,\|x\|\leq\delta}\langle c,x\rangle\qquad\text{or}\qquad x_{\eta}=\operatorname*{argmin}_{x\in\Delta}\langle c,x\rangle+\frac{1}{2\eta}\|x\|^{2}, (12)

yields a sparse soft-min: as δδmax\delta\to\delta_{\max} (or as η\eta\to\infty), the support of the solution tends to argminc\operatorname*{argmin}c. In this example, δmin=1\delta_{\min}=1 and δmax=(#argminc)1\delta_{\max}=(\#\operatorname*{argmin}c)^{-1}.

Corollary 4.1.

The unit simplex Δd\Delta\subset\mathbb{R}^{d} is monotone for all d1d\geq 1; i.e., the support of the soft-min xδx^{\delta} (or xηx_{\eta}) defined in (12) decreases monotonically to argminc\operatorname*{argmin}c as δ[δmin,δmax]\delta\in[\delta_{\min},\delta_{\max}] (or η0\eta\geq 0) increases.

Proof.

We have extΔ={e1,,ed}\operatorname*{ext}\Delta=\{e_{1},\dots,e_{d}\}, the canonical basis of d\mathbb{R}^{d}. Let FΔF\subset\Delta be a face, then there exist {i1,,ik}{1,,d}\{i_{1},\dots,i_{k}\}\subseteq\{1,\dots,d\} such that F=conv(ei1,,eik)F=\operatorname*{conv}(e_{i_{1}},\dots,e_{i_{k}}). Note that projF(0)=j=1kλjeij\operatorname{proj}_{F}(0)=\sum_{j=1}^{k}\lambda_{j}e_{i_{j}}, where λ=(λ1,,λk)\lambda=(\lambda_{1},\dots,\lambda_{k}) is the solution of

minλdj=1kλjeij2 s.t. j=1kλj=1,\displaystyle\min_{\lambda\in\mathbb{R}^{d}}\big{\|}\sum_{j=1}^{k}\lambda_{j}e_{i_{j}}\big{\|}^{2}\quad\text{ s.t. }\sum_{j=1}^{k}\lambda_{j}=1,

namely λ=(k1,,k1)\lambda=(k^{-1},\dots,k^{-1}). In particular, projF(0)=k1j=1keijriF\operatorname{proj}_{F}(0)=k^{-1}\sum_{j=1}^{k}e_{i_{j}}\in\operatorname*{ri}F. Now Theorem 3.2 applies. ∎

4.2 Optimal Transport

We recall the quadratically regularized optimal transport problem

infγΓ(μ,ν)c^(x,y)𝑑γ(x,y)+12ηdγd(μν)L2(μν)2\inf_{\gamma\in\Gamma(\mu,\nu)}\int\hat{c}(x,y)d\gamma(x,y)+\frac{1}{2\eta}\left\|\frac{d\gamma}{d(\mu\otimes\nu)}\right\|_{L^{2}(\mu\otimes\nu)}^{2} (QOT)

where Γ(μ,ν)\Gamma(\mu,\nu) denotes the set of couplings of the probability measures (μ,ν)(\mu,\nu). We will be concerned with the discrete case as introduced in [6, 14, 17]. See also [26, 27, 29] for more general theory, [4, 16, 19] for asymptotic aspects, [15, 20, 22, 25, 33] for computational approaches and [25, 35] for some recent applications.

Fix NN\in\mathbb{N} and two sets of distinct points, {X1,,XN}D\{X_{1},\dots,X_{N}\}\subset\mathbb{R}^{D} and {Y1,,YN}D\{Y_{1},\dots,Y_{N}\}\subset\mathbb{R}^{D}. Let μ=1Ni=1NδXi\mu=\frac{1}{N}\sum_{i=1}^{N}\delta_{X_{i}} and ν=1Ni=1NδYi\nu=\frac{1}{N}\sum_{i=1}^{N}\delta_{Y_{i}} denote the associated empirical measures, and let c^:D×D\hat{c}:\mathbb{R}^{D}\times\mathbb{R}^{D}\to\mathbb{R} be a (cost) function. We note that μν=1N2i,j=1Nδ(Xi,Yj)\mu\otimes\nu=\frac{1}{N^{2}}\sum_{i,j=1}^{N}\delta_{(X_{i},Y_{j})} while dγ/d(μν)d\gamma/d(\mu\otimes\nu) is simply the ratio of the probability mass functions. Any coupling γΓ(μ,ν)\gamma\in\Gamma(\mu,\nu) can be identified with its matrix of probability weights

(γi,j)i,j=1nΓN={γN×N:i=1Nγi,j=1N,j=1Nγi,j=1N,γi,j0}(\gamma_{i,j})_{i,j=1}^{n}\in\Gamma_{N}=\left\{\gamma\in\mathbb{R}^{N\times N}:\,\sum_{i=1}^{N}\gamma_{i,j}=\frac{1}{N},\;\sum_{j=1}^{N}\gamma_{i,j}=\frac{1}{N},\;\gamma_{i,j}\geq 0\right\}

via γ=i,j=1Nγi,jδ(Xi,Yj)\gamma=\sum_{i,j=1}^{N}\gamma_{i,j}\delta_{(X_{i},Y_{j})}. Writing similarly cij=c^(Xi,Xj)c_{ij}=\hat{c}(X_{i},X_{j}), (QOT) can be identified with the problem

infγΓNc,π+εN2γ2.\inf_{\gamma\in\Gamma_{N}}\langle c,\pi\rangle+\frac{\varepsilon N}{2}\|\gamma\|^{2}.

Clearly ΓN=1NΠN\Gamma_{N}=\frac{1}{N}\Pi_{N} where

ΠN={πN×N:i=1Nπi,j=1,j=1Nπi,j=1,πi,j0}\Pi_{N}=\left\{\pi\in\mathbb{R}^{N\times N}:\,\sum_{i=1}^{N}\pi_{i,j}=1,\;\sum_{j=1}^{N}\pi_{i,j}=1,\;\pi_{i,j}\geq 0\right\}

denotes the Birkhoff polytope of doubly stochastic matrices; see, e.g., [9]. The monotonicity (in the sense of Definition 3.1) of ΠN\Pi_{N} is equivalent to that of ΓN\Gamma_{N}.

The following result connects the measure-theoretic support sptπ={(i,j):πij>0}\operatorname*{spt}\pi=\{(i,j):\,\pi_{ij}>0\} with the geometric notion discussed below Definition 3.1.

Lemma 4.2.

Let π,πΠN\pi,\pi^{\prime}\in\Pi_{N}. Denote by F(π)F(\pi) the unique face FΠNF\subset\Pi_{N} such that πriF\pi\in\operatorname*{ri}F. For π,πΠN\pi,\pi^{\prime}\in\Pi_{N}, the following are equivalent:

  1. (i)

    sptπsptπ\operatorname*{spt}\pi^{\prime}\subset\operatorname*{spt}\pi,

  2. (ii)

    F(π)F(π)F(\pi^{\prime})\subset F(\pi),

  3. (iii)

    if πF\pi\in F for some face FΠNF\subset\Pi_{N}, then πF\pi^{\prime}\in F.

Proof.

The equivalence of (ii) and (iii) follows from the fact that F(π)F(\pi) is the smallest face containing π\pi [8, Theorem 5.6]. The equivalence of (i) and (iii) is deferred to Lemma 4.8 below. (The implication (ii) \Rightarrow (i) is straightforward, and holds for any polytope 𝒫\mathcal{P} of measures.) ∎

As a consequence of Lemma 4.2, we have the following equivalence.

Corollary 4.3.

Let γη,c^\gamma_{\eta,\hat{c}} denote the solution of (QOT) for c^:N×N\hat{c}:\mathbb{R}^{N}\times\mathbb{R}^{N}\to\mathbb{R}. The Birkhoff polytope is monotone if and only if the optimal support sptγη,c^\operatorname*{spt}\gamma_{\eta,\hat{c}} is monotone decreasing in η\eta for all c^\hat{c}.

Theorem 4.4.

The Birkhoff polytope ΠN\Pi_{N} is monotone for 1N41\leq N\leq 4 but not monotone for N5N\geq 5. In particular, when N5N\geq 5, the optimal support ηspt(γη,c^)\eta\mapsto\operatorname*{spt}(\gamma_{\eta,\hat{c}}) is not monotone for some costs c^\hat{c}.

Example 4.5.

We exhibit an example for N=5N=5 where the support is not monotone. (This is not easily achieved by brute-force numerical experiment). Our starting point is the matrix AA in (14) below, which is used to describe a face where (H2) fails. Step 3 in the proof of Theorem 3.2 then suggests to perturb A-A in a suitable direction to find a cost cc exhibiting non-monotonicity. With some geometric considerations, this leads us to propose the cost matrix

c=[1.1111111.1000101.1001001.1010001.1].c=\begin{bmatrix}-1.1&-1&-1&-1&-1\\ -1&-1.1&0&0&0\\ -1&0&-1.1&0&0\\ -1&0&0&-1.1&0\\ -1&0&0&0&-1.1\end{bmatrix}.

For η=2.5\eta=2.5, or equivalently regularization strength 1/(2η)=0.21/(2\eta)=0.2, the corresponding problem (QOT) has an exact solution coupling γη\gamma_{\eta} with probability weights given by

(γη=2.5)=[00.050.050.050.050.050.150000.0500.15000.05000.1500.050000.15].(\gamma_{\eta=2.5})=\begin{bmatrix}0&0.05&0.05&0.05&0.05\\ 0.05&0.15&0&0&0\\ 0.05&0&0.15&0&0\\ 0.05&0&0&0.15&0\\ 0.05&0&0&0&0.15\end{bmatrix}.

(This can be verified by noticing that the stated coupling is induced by the dual potentials f=g=(0.575,0.175,0.175,0.175,0.175)f=g=(-0.575,-0.175,-0.175,-0.175,-0.175); see for instance [29, Theorem 2.2 and Remark 2.3].) We observe in particular that the location (X1,Y1)(X_{1},Y_{1}) is not in the support. On the other hand, because the diagonal of cc features the smallest costs, the solution γ\gamma_{\infty} of the unregularized transport problem is to put all mass on the diagonal; i.e., to transport all mass from XiX_{i} to YiY_{i} for each ii. Because of the stationary convergence mentioned in Lemma 2.3, that is also the solution of the regularized problem for large enough value of η\eta (e.g., η=100\eta=100 will do):

(γη=100)=(γ)=[0.2000000.2000000.2000000.2000000.2].(\gamma_{\eta=100})=(\gamma_{\infty})=\begin{bmatrix}0.2&0&0&0&0\\ 0&0.2&0&0&0\\ 0&0&0.2&0&0\\ 0&0&0&0.2&0\\ 0&0&0&0&0.2\end{bmatrix}.

In particular, (X1,Y1)(X_{1},Y_{1}) is part of the support for large η\eta, completing the example. Figure 2 shows in more detail the weight at (X1,Y1)(X_{1},Y_{1}) as a function of (the inverse of) η\eta.

Refer to caption
Figure 2: Probability mass γη(X1,Y1)\gamma_{\eta}(X_{1},Y_{1}) plotted against 1/(2η)1/(2\eta), showing that (X1,Y1)(X_{1},Y_{1}) is in the support for small and large values of η\eta but outside for an intermediate interval.
Remark 4.6.

The early work of [17] considered a minimum-cost flow problem with quadratic regularization that predates the optimal transport literature for this regularization. The authors point out that the solution can have non-monotone support already in the minimal setting of 2×22\times 2 points [17, Figure 1]. Similarly, it is immediate to obtain non-monotonicity if instead of μν\mu\otimes\nu we use a different measure to define the L2L^{2}-norm in (QOT), as in [29]. In those examples, the mechanism causing non-monotonicity is straightforward and quite different than in the present work, where non-monotonicity arises only in higher dimensions.

4.3 Proof of Lemma 4.2 and Theorem 4.4

The celebrated Birkhoff’s theorem [5] states that the vertices extΠN\operatorname*{ext}\Pi_{N} are the permutation matrices of {1,,N}\{1,\dots,N\}; that is, the elements of ΠN\Pi_{N} with binary entries. Following [11], the faces of ΠN\Pi_{N} can be described using the so-called permanent function. If AA is a binary N×NN\times N matrix, its permanent per(A){\rm per}(A)\in\mathbb{N} is defined as the number of permutation matrices PP with PAP\leq A (meaning that Pi,jAi,jP_{i,j}\leq A_{i,j} for all i,ji,j). Denoting by conv()\operatorname*{conv}(\cdot) the convex hull, the following characterization is contained in [11, Theorem 2.1].

Lemma 4.7.

Let tt\in\mathbb{N} and P(1),,P(t)ext(ΠN)P^{(1)},\dots,P^{(t)}\in\operatorname*{ext}(\Pi_{N}). Let A=(Ai,j)i,j=1NA=(A_{i,j})_{i,j=1}^{N} be the matrix such that Ai,j=1A_{i,j}=1 if there exists s{1,,t}s\in\{1,\dots,t\} with Pi,j(s)=1P^{(s)}_{i,j}=1 and Ai,j=0A_{i,j}=0 otherwise. Then conv({P(1),,P(t)})\operatorname*{conv}(\{P^{(1)},\dots,P^{(t)}\}) is a face of ΠN\Pi_{N} if and only if per(A)=t{\rm per}(A)=t.

We use Lemma 4.7 to relate faces with the distribution of zeros.

Lemma 4.8.

Let π,πΠN\pi,\pi^{\prime}\in\Pi_{N}. The following are equivalent:

  1. (i)

    there exists (i,j)(i,j) such that πi,j=0\pi_{i,j}=0 and πi,j>0\pi_{i,j}^{\prime}>0;

  2. (ii)

    there exists a face FF of ΠN\Pi_{N} such that πF\pi\in F and πF\pi^{\prime}\notin F.

Proof.

(i) \Rightarrow (ii): Let (i,j)(i,j) be such that πi,j=0\pi_{i,j}=0 and πi,j>0\pi_{i,j}^{\prime}>0. Assume w.l.o.g. that (i,j)=(1,1)(i,j)=(1,1). Then π\pi belongs to the set

F={πΠN:π,A=n},F=\{\pi\in\Pi_{N}:\langle\pi,A\rangle=n\},

where AA is the matrix

A=(0111111111111111).A=\begin{pmatrix}0&1&1&\cdots&1\\ 1&1&1&\cdots&1\\ 1&1&1&\cdots&1\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ 1&1&1&\cdots&1\end{pmatrix}.

As ΠN{πΠN:π,An},\Pi_{N}\subset\{\pi\in\Pi_{N}:\langle\pi,A\rangle\leq n\}, we see that FF is a face. Clearly π\pi^{\prime} does not belong to FF, proving the claim.

(ii) \Rightarrow (i): Let P(1),,P(t)extΠNP^{(1)},\dots,P^{(t)}\in\operatorname*{ext}\Pi_{N} be such that F=conv({P(1),,P(t)})F=\operatorname*{conv}(\{P^{(1)},\dots,P^{(t)}\}) and let A=(Ai,j)i,j=1NA=(A_{i,j})_{i,j=1}^{N} be the matrix such that Ai,j=1A_{i,j}=1 if there exists s{1,,t}s\in\{1,\dots,t\} such that Pi,j(s)=1P^{(s)}_{i,j}=1 and Ai,j=0A_{i,j}=0 otherwise. We have per(A)=t{\rm per}(A)=t by Lemma 4.7. As an element of ΠN\Pi_{N}, π\pi^{\prime} can be written as a convex combination of permutation matrices, i.e.,

π=Pext(ΠN)λPP,withλP[0,1]andPext(ΠN)λP=1.\pi^{\prime}=\sum_{P\in\operatorname*{ext}(\Pi_{N})}\lambda_{P}P,\quad{\rm with}\quad\lambda_{P}\in[0,1]\ {\rm and}\ \sum_{P\in\operatorname*{ext}(\Pi_{N})}\lambda_{P}=1.

As πF\pi^{\prime}\notin F, there exists Pext(ΠN){P(1),,P(t)}P\in\operatorname*{ext}(\Pi_{N})\setminus\{P^{(1)},\dots,P^{(t)}\} with λP>0\lambda_{P}>0. Using the fact that per(A)=t{\rm per}(A)=t, we derive the existence of (i,j)(i,j) such that Pi,j=1P_{i,j}=1 but Ai,j=0A_{i,j}=0. In particular, πi,jλP>0\pi_{i,j}^{\prime}\geq\lambda_{P}>0 but πi,j=0\pi_{i,j}=0, proving the claim. ∎

Lemma 4.9.

Let N1N\geq 1. Define the permutation matrices PkN×NP^{k}\in\mathbb{R}^{N\times N}, 1kN1\leq k\leq N by

Pi,jk={1 if i=1 and j=k,1 if i=k and j=1,1 if 1i=jk,0 otherwise.\displaystyle P^{k}_{i,j}=\begin{cases}1&\text{ if }i=1\text{ and }j=k,\\ 1&\text{ if }i=k\text{ and }j=1,\\ 1&\text{ if }1\neq i=j\neq k,\\ 0&\text{ otherwise.}\end{cases}

That is, PkP^{k} permutes the first with the kk-th element. Then F:=conv(P1,,PN)F:=\operatorname*{conv}(P^{1},\dots,P^{N}) is a face of ΠN\Pi_{N} and

projaffF(0)=i=1NλkPkforλ1=4NN+2,λ2==λN=2N+2.\displaystyle\operatorname{proj}_{\operatorname*{aff}F}(0)=\sum_{i=1}^{N}\lambda_{k}P^{k}\quad\text{for}\quad\lambda_{1}=\frac{4-N}{N+2},\quad\lambda_{2}=\cdots=\lambda_{N}=\frac{2}{N+2}. (13)

As a consequence,

projaffF(0){riF, if N=1,2,3,rbdF, if N=4,N×NF if N5,\displaystyle\operatorname{proj}_{\operatorname*{aff}F}(0)\in\begin{cases}\operatorname*{ri}F,&\text{ if }N=1,2,3,\\ \operatorname*{rbd}F,&\text{ if }N=4,\\ \mathbb{R}^{N\times N}\setminus F&\text{ if }N\geq 5,\end{cases}

and in particular (H2) is violated for N5N\geq 5.

Proof.

Define AN×NA\in\mathbb{R}^{N\times N} by

Ai,j={1 if i=j,1 if i=1 or j=1,0 otherwise,\displaystyle A_{i,j}=\begin{cases}1&\text{ if }i=j,\\ 1&\text{ if }i=1\text{ or }j=1,\\ 0&\text{ otherwise,}\end{cases} (14)

so that AA is the entry-wise maximum A=max(P1,,PN)A=\max(P^{1},\dots,P^{N}). One readily verifies that per(A)=N{\rm per}(A)=N. Hence by Lemma 4.7, F:=conv(P1,,PN)F:=\operatorname*{conv}(P^{1},\dots,P^{N}) is a face of ΠN\Pi_{N}. To determine projaffF(0)\operatorname{proj}_{\operatorname*{aff}F}(0), we consider the minimization problem

minλ=(λ1,,λN)Ni=1NλiPi2 s.t. i=1Nλi=1.\displaystyle\min_{\lambda=(\lambda_{1},\dots,\lambda_{N})\in\mathbb{R}^{N}}\big{\|}\sum_{i=1}^{N}\lambda_{i}P^{i}\big{\|}^{2}\quad\text{ s.t.\ }\quad\sum_{i=1}^{N}\lambda_{i}=1.

The Lagrangian for this problem is

L(λ,ρ)=λ12+2j=2Nλj2+j=2N(kjλk)2+ρ(1j=1Nλj)\displaystyle L(\lambda,\rho)=\lambda_{1}^{2}+2\sum_{j=2}^{N}\lambda_{j}^{2}+\sum_{j=2}^{N}\Big{(}\sum_{k\neq j}\lambda_{k}\Big{)}^{2}+\rho\Big{(}1-\sum_{j=1}^{N}\lambda_{j}\Big{)}

and the resulting optimality equations are

ρ\displaystyle\rho =2Nλ1+2(N2)j=2Nλj,j=1Nλj=1,\displaystyle=2N\lambda_{1}+2(N-2)\sum_{j=2}^{N}\lambda_{j},\qquad\sum_{j=1}^{N}\lambda_{j}=1,
ρ\displaystyle\rho =2Nλi+2(N2)λ1+2(N3)j=2jiNλj for 2iN.\displaystyle=2N\lambda_{i}+2(N-2)\lambda_{1}+2(N-3)\sum_{\begin{subarray}{c}j=2\\ j\neq i\end{subarray}}^{N}\lambda_{j}\quad\text{ for }2\leq i\leq N.

By symmetry, the unique optimal λ\lambda satisfies λi=λj=:λ0\lambda_{i}=\lambda_{j}=:\lambda_{0} for all i,j2i,j\geq 2, so that

2Nλ0+2(N2)λ1+2(N2)(N3)λ0=2Nλ1+2(N1)(N2)λ0,\displaystyle 2N\lambda_{0}+2(N-2)\lambda_{1}+2(N-2)(N-3)\lambda_{0}=2N\lambda_{1}+2(N-1)(N-2)\lambda_{0},
λ1+(N1)λ0=1\displaystyle\lambda_{1}+(N-1)\lambda_{0}=1

and finally

λ1=λ0(4N)2,λ1+(N1)λ0=1.\displaystyle\lambda_{1}=\lambda_{0}\frac{(4-N)}{2},\qquad\lambda_{1}+(N-1)\lambda_{0}=1.

Solving for λ1\lambda_{1} and λ0\lambda_{0} yields (13). Note that λ0>0\lambda_{0}>0 for any NN whereas λ0>0\lambda_{0}>0 for N{1,2,3}N\in\{1,2,3\}, λ0=0\lambda_{0}=0 for N=4N=4 and λ0<0\lambda_{0}<0 for N5N\geq 5, implying the last claim. ∎

The matrix AA of (14) is inspired by the 4×44\times 4 matrix in [10, Example 1.5] where the authors are interested in a different problem (and, in turn, credit [1]).666In [10], the conclusion is that “not every zero pattern of a fully indecomposable (0,1)(0,1)-matrix is realizable as the zero pattern of a doubly stochastic matrix whose diagonal sums avoiding the 0’s are constant.” For the present question of monotonicity, this matrix yields a counterexample only for N5N\geq 5. As an aside, the subsequent proof that for N=4N=4, the face FF considered in Lemma 4.9 is (up to symmetries) the only face where projaffF(0)rbdF\operatorname{proj}_{\operatorname*{aff}F}(0)\in\operatorname*{rbd}F, whereas all other faces FF^{\prime} satisfy projaffF(0)riF\operatorname{proj}_{\operatorname*{aff}F^{\prime}}(0)\in\operatorname*{ri}F^{\prime}.

Lemma 4.10.

Let 1N41\leq N\leq 4. Then ΠN\Pi_{N} is monotone.

Proof.

We verify (H1) and (H2). Note that projΠN(0)\operatorname{proj}_{\Pi_{N}}(0) is the matrix with all entries equal to 1/N1/N (corresponding to the product measure). This matrix is clearly in the relative interior of ΠN\Pi_{N}, showing (H1). The property (H2) is trivial for N=1,2N=1,2. For N=3N=3 and N=4N=4, we give a computer-assisted proof in the interest of brevity; the code is available at https://github.com/marcelnutz/birkhoff-polytope-faces.777An analytic proof is also available, but requires us to go through 52 different cases for N=4N=4.888The cases N3N\leq 3 can also be obtained as a corollary of the case N=4N=4. One can check directly that if the Birkhoff polytope ΠN\Pi_{N} is (not) monotone for some NN\in\mathbb{N}, then ΠN\Pi_{N^{\prime}} is also (not) monotone for all N()NN^{\prime}\leq(\geq)N. Specifically, we generate all N×NN\times N permutation matrices and determine all families {P1,,Pm}\{P_{1},\dots,P_{m}\} of permutation matrices that form the vertices of a nonempty face FF by using the permanent function as in Lemma 4.7. There are 49 nonempty faces for N=3N=3 and 7443 for N=4N=4. For each face FF, we can numerically compute projaffF(0)\operatorname{proj}_{\operatorname*{aff}F}(0) or more specifically scalar coefficients (λk)1km(\lambda_{k})_{1\leq k\leq m} such that projaffF(0)=kλkPk\operatorname{proj}_{\operatorname*{aff}F}(0)=\sum_{k}\lambda_{k}P_{k} and kλk1\sum_{k}\lambda_{k}\leq 1. The coefficients λk\lambda_{k} are non-unique for some faces; in that case, we choose positive weights if possible. Note that as PkP_{k} are binary matrices and N4N\leq 4, the computation can be done with accuracy close to machine precision. It turns out that for N=3N=3, all coefficients satisfy λk>0.01\lambda_{k}>0.01 (much larger than machine precision), establishing that projaffF(0)riF\operatorname{proj}_{\operatorname*{aff}F}(0)\in\operatorname*{ri}F and in particular (H2). For N=4N=4, most faces have coefficients λk>0.01\lambda_{k}>0.01, whereas for 96 faces, one coefficient is numerically close to zero, hence requiring an analytic argument. We verify that all those 96 faces are equivalent up to permutations of rows and columns, corresponding to a relabeling of the points XiX_{i} and YjY_{j}. Specifically, they are all equivalent to the particular face analyzed in Lemma 4.9, where we have seen that projaffF(0)F\operatorname{proj}_{\operatorname*{aff}F}(0)\in F for N4N\leq 4 (and in fact projaffF(0)rbdF\operatorname{proj}_{\operatorname*{aff}F}(0)\in\operatorname*{rbd}F for N=4N=4). We conclude that (H2) holds for all faces FF when N=4N=4, completing the proof. ∎

References

  • [1] E. Achilles. Doubly stochastic matrices with some equal diagonal sums. Linear Algebra Appl., 22:293–296, 1978.
  • [2] M. Agueh and G. Carlier. Barycenters in the Wasserstein space. SIAM J. Math. Anal., 43(2):904–924, 2011.
  • [3] J. Backhoff-Veraguas, D. Bartl, M. Beiglböck, and M. Eder. All adapted topologies are equal. Probab. Theory Related Fields, 178(3-4):1125–1172, 2020.
  • [4] E. Bayraktar, S. Eckstein, and X. Zhang. Stability and sample complexity of divergence regularized optimal transport. Preprint arXiv:2212.00367v1, 2022.
  • [5] G. Birkhoff. Tres observaciones sobre el algebra lineal. Univ. Nac. Tucumán. Revista A., 5:147–151, 1946.
  • [6] M. Blondel, V. Seguy, and A. Rolet. Smooth and sparse optimal transport. volume 84 of Proceedings of Machine Learning Research, pages 880–889, 2018.
  • [7] H. Brezis. Functional analysis, Sobolev spaces and partial differential equations. Universitext. Springer, New York, 2011.
  • [8] A. Brøndsted. An introduction to convex polytopes, volume 90 of Graduate Texts in Mathematics. Springer-Verlag, New York-Berlin, 1983.
  • [9] R. A. Brualdi. Combinatorial matrix classes, volume 108 of Encyclopedia of Mathematics and its Applications. Cambridge University Press, Cambridge, 2006.
  • [10] R. A. Brualdi and G. Dahl. Diagonal sums of doubly stochastic matrices. Linear Multilinear Algebra, 70(20):4946–4972, 2022.
  • [11] R. A. Brualdi and P. M. Gibson. Convex polyhedra of doubly stochastic matrices. I. Applications of the permanent function. J. Combinatorial Theory Ser. A, 22(2):194–230, 1977.
  • [12] J. A. Cuesta-Albertos and A. Tuero-Díaz. A characterization for the solution of the Monge-Kantorovich mass transference problem. Statist. Probab. Lett., 16(2):147–152, 1993.
  • [13] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems 26, pages 2292–2300. 2013.
  • [14] A. Dessein, N. Papadakis, and J.-L. Rouas. Regularized optimal transport and the rot mover’s distance. J. Mach. Learn. Res., 19(15):1–53, 2018.
  • [15] S. Eckstein and M. Kupper. Computation of optimal transport and related hedging problems via penalization and neural networks. Appl. Math. Optim., 83(2):639–667, 2021.
  • [16] S. Eckstein and M. Nutz. Convergence rates for regularized optimal transport via quantization. Math. Oper. Res., 49(2):1223–1240, 2024.
  • [17] M. Essid and J. Solomon. Quadratically regularized optimal transport on graphs. SIAM J. Sci. Comput., 40(4):A1961–A1986, 2018.
  • [18] M. Finzel and W. Li. Piecewise affine selections for piecewise polyhedral multifunctions and metric projections. J. Convex Anal., 7(1):73–94, 2000.
  • [19] A. Garriz-Molina, A. González-Sanz, and G. Mordant. Infinitesimal behavior of quadratically regularized optimal transport and its relation with the porous medium equation. Preprint arXiv:2407.21528v1, 2024.
  • [20] A. Genevay, M. Cuturi, G. Peyré, and F. Bach. Stochastic optimization for large-scale optimal transport. In Advances in Neural Information Processing Systems 29, pages 3440–3448, 2016.
  • [21] A. González-Sanz and M. Nutz. Quantitative convergence of quadratically regularized linear programs. Preprint arXiv:2408.04088v1, 2024.
  • [22] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of Wasserstein GANs. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 5769–5779, 2017.
  • [23] W. W. Hager and H. Zhang. Projection onto a polyhedron that exploits sparsity. SIAM J. Optim., 26(3):1773–1798, 2016.
  • [24] S. Kolouri, S. R. Park, M. Thorpe, D. Slepcev, and G. K. Rohde. Optimal mass transport: Signal processing and machine-learning applications. IEEE Signal Processing Magazine, 34(4):43–59, 2017.
  • [25] L. Li, A. Genevay, M. Yurochkin, and J. Solomon. Continuous regularized Wasserstein barycenters. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 17755–17765. Curran Associates, Inc., 2020.
  • [26] D. Lorenz and H. Mahler. Orlicz space regularization of continuous optimal transport problems. Appl. Math. Optim., 85(2):Paper No. 14, 33, 2022.
  • [27] D. Lorenz, P. Manns, and C. Meyer. Quadratically regularized optimal transport. Appl. Math. Optim., 83(3):1919–1949, 2021.
  • [28] O. L. Mangasarian. Normal solutions of linear programs. Math. Programming Stud., 22:206–216, 1984. Mathematical programming at Oberwolfach, II (Oberwolfach, 1983).
  • [29] M. Nutz. Quadratically regularized optimal transport: Existence and multiplicity of potentials. Preprint arXiv:2404.06847v1, 2024.
  • [30] A. Paffenholz. Faces of Birkhoff polytopes. Electron. J. Combin., 22(1):Paper 1.67, 36, 2015.
  • [31] V. M. Panaretos and Y. Zemel. Statistical aspects of Wasserstein distances. Annu. Rev. Stat. Appl., 6:405–431, 2019.
  • [32] G. Peyré and M. Cuturi. Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning, 11(5-6):355–607, 2019.
  • [33] V. Seguy, B. B. Damodaran, R. Flamary, N. Courty, A. Rolet, and M. Blondel. Large scale optimal transport and mapping estimation. In International Conference on Learning Representations, 2018.
  • [34] C. Villani. Optimal transport, old and new, volume 338 of Grundlehren der Mathematischen Wissenschaften. Springer-Verlag, Berlin, 2009.
  • [35] S. Zhang, G. Mordant, T. Matsumoto, and G. Schiebinger. Manifold learning with sparse regularised optimal transport. Preprint arXiv:2307.09816v1, 2023.