Monotonicity in Quadratically Regularized Linear Programs

Alberto González-Sanz Department of Statistics, Columbia University, [email protected] Marcel Nutz Departments of Mathematics and Statistics, Columbia University, [email protected]. Research supported by NSF Grants DMS-1812661, DMS-2106056, DMS-2407074. Andrés Riveros Valdevenito Department of Statistics, Columbia University, [email protected].

Abstract

In optimal transport, quadratic regularization is a sparse alternative to entropic regularization: the solution measure tends to have small support. Computational experience suggests that the support decreases monotonically to the unregularized counterpart as the regularization parameter is relaxed. We find it useful to investigate this monotonicity more abstractly for linear programs over polytopes, regularized with the squared norm. Here, monotonicity can be stated as an invariance property of the curve mapping the regularization parameter to the solution: once the curve enters a face of the polytope, does it remain in that face forever? We show that this invariance is equivalent to a geometric property of the polytope, namely that each face contains the minimum norm point of its affine hull. Returning to the optimal transport problem and its associated Birkhoff polytope, we verify this property for low dimensions, but show that it fails for dimension $d\geq 25$ . As a consequence, the conjectured monotonicity of the support fails in general, even if experiments suggest that monotonicity holds for many cost matrices.

Keywords Linear Program, Quadratic Regularization, Optimal Transport, Sparsity

AMS 2020 Subject Classification 49N10; 49N05; 90C25

1 Introduction

Let $\mathcal{P}\subset\mathbb{R}^{d}$ be a polytope and $c\in\mathbb{R}^{d}$ . The regularized linear program

x^{\delta}=\operatorname*{argmin}_{x\in\mathcal{P}:\,\|x\|\leq\delta}\langle c,x\rangle

(1)

has a unique solution as long as $\delta\geq 0$ belongs to a certain interval $[\delta_{\min},\delta_{\max}]$ , hence we can consider the curve $\delta\mapsto x^{\delta}\in\mathcal{P}$ which travels across various faces of $\mathcal{P}$ as $\delta$ increases (i.e., as the regularizing constraint relaxes). We are interested in the following invariance property: once $\delta\mapsto x^{\delta}\in\mathcal{P}$ enters a given face, does it ever leave that face? (Cf. Figure 1.) In the optimal transport applications that motivate our study (see below), this property corresponds to the monotonicity of the optimal support wrt. the regularization strength. Our abstract result, Theorem 3.2, geometrically characterizes all polytopes such that the invariance property holds (for any cost $c$ ). We show that this property holds for the set of probability measures (unit simplex), but fails for the set of transport plans (Birkhoff polytope) when the dimension $d\geq 25$ . As a consequence—which may be surprising given numerical experience—the optimal support in quadratically regularized optimal transport problems is not always monotone.

Figure 1: A non-monotone (left) and a monotone (right) polytope with cost

c

(red). The curve

[\delta_{\min},\delta_{\max}]\ni\delta\mapsto x^{\delta}

follows the blue arrows and has the invariance property only in the right example.

1.1 Motivation

This study is motivated by optimal transport and related minimization problems over probability measures. In its simplest form, the transport problem between probability measures $\mu$ and $\nu$ is

\inf_{\gamma\in\Gamma(\mu,\nu)}\int\hat{c}(x,y)\,\gamma(dx,dy),

(OT)

where $\hat{c}(x,y)$ is a given “cost” function and $\Gamma(\mu,\nu)$ denotes the set of couplings; i.e., joint probability measures $\gamma$ with marginals $(\mu,\nu)$ . See [34] for a detailed discussion. In many applications such as machine learning or statistics (see [24, 31] for surveys), the marginals encode observed data points ${X_{1},\dots,X_{N}}$ and ${Y_{1},\dots,Y_{N}}$ which are represented by their empirical measures $\mu=\frac{1}{N}\sum_{i}\delta_{X_{i}}$ and $\nu=\frac{1}{N}\sum_{i}\delta_{Y_{i}}$ . Denoting by $c_{ij}=\hat{c}(X_{i},Y_{j})$ the resulting $N\times N$ cost matrix, (OT) then becomes a linear program

\displaystyle\inf_{x\in\mathcal{P}}\langle c,x\rangle

(LP)

where the polytope $\mathcal{P}$ is (up to a constant factor) the Birkhoff polytope of doubly stochastic matrices of size $N\times N$ . For different choices of polytope, (LP) includes other problems of recent interest, such as multi-marginal optimal transport and Wasserstein barycenters [2] or adapted Wasserstein distances [3]. The optimal transport problem is computationally costly when $N$ is large. The impactful paper [13] proposed to regularize (OT) by penalizing with Kullback–Leibler divergence (entropy). Then, solutions can be computed using the Sinkhorn–Knopp algorithm, which has lead to an explosion of high-dimensional applications (see [32]). More generally, [14] introduced regularized optimal transport with regularization by a divergence. Different divergences give rise to different properties of the solution. Entropic regularization always leads to couplings whose support contains all data pairs $(X_{i},Y_{j})$ , even though (OT) typically has a sparse solution. In some applications that is undesirable; for instance, it may correspond to blurrier images in an image processing task [6]. The second-most prominent regularization is $\chi^{2}$ -divergence or equivalently the squared norm, as proposed in [6],¹¹1See also [17] for a similar formulation of minimum-cost flow problems, and the predecessors referenced therein. Our model with a general polytope includes such problems, and many others. which gives rise to sparse solutions. In the Eulerian formulation of [14], this is exactly our problem (1) with $\mathcal{P}$ being the (scaled) Birkhoff polytope. Alternately, the problem can be stated in Lagrangian form as in [6], making the squared-norm penalty explicit:

\inf_{\gamma\in\Gamma(\mu,\nu)}\int\hat{c}(x,y)\,\gamma(dx,dy)+\frac{1}{2\eta}\left\|\frac{d\gamma}{d(\mu\otimes\nu)}\right\|_{L^{2}(\mu\otimes\nu)}^{2}

(QOT)

where $d\gamma/d(\mu\otimes\nu)$ denotes the density of $\gamma$ wrt. the product measure $\mu\otimes\nu$ and the regularization strength is now parameterized by $\eta\in[0,\infty]$ . In the general setting, this corresponds to

\displaystyle\inf_{x\in\mathcal{P}}\langle c,x\rangle+\frac{1}{2\eta}\|x\|^{2}

(2)

which has a unique solution $x_{\eta}$ for any $\eta\in[0,\infty]$ . The curves $(x^{\eta})$ and $(x_{\eta})$ of solutions to (1) and (2), respectively, coincide up to a simple reparametrization.

In optimal transport, the regularized problem is often solved to approximate the linear problem (OT). The latter has a generically unique (see [12]) solution $x$ , which is recovered as $\delta\to\delta_{\max}$ (or $\eta\to\infty$ ): $x=x^{\delta_{\max}}=x_{\infty}$ .²²2For problems with non-unique solution, $x^{\delta_{\max}}$ recovers the minimum-norm solution. In particular, the support of $x^{\delta}$ converges to the support of $x$ , which is generically sparse ( $N$ out of $N^{2}$ possible points, again by [12]). In numerous experiments, it has been observed not only that $\operatorname*{spt}x^{\delta}$ is sparse when $\delta$ is large, but also that $\operatorname*{spt}x^{\delta}$ monotonically decreases to $\operatorname*{spt}x$ (e.g., [6]). If this monotonicity holds, then in particular $\operatorname*{spt}x\subset\operatorname*{spt}x^{\delta}$ , meaning that $\operatorname*{spt}x^{\delta}$ can be used as a multivariate confidence band for the (unknown) solution $x$ .

1.2 Summary of Results

When $\mathcal{P}$ is a set of measures, the support of $x\in\mathcal{P}$ can be defined as usual. When $\mathcal{P}$ is a general polytope, the set of vertices of the minimal face containing $x$ yields a similar notion. With that in mind, we can ask whether the support of $x^{\delta}$ is monotone wrt. the strength $\delta$ of regularization. (The two notions of support yield the same result when $\mathcal{P}$ is the simplex or the Birkhoff polytope; cf. Lemma 4.2.) Geometrically, this corresponds to the aforementioned invariance property: if $x^{\delta}\in F$ for some face $F$ , does it follow that $x^{\delta^{\prime}}\in F$ for all $\delta^{\prime}\geq\delta$ ? The answer may of course depend on the cost $c$ ; we call $\mathcal{P}$ monotone if the invariance holds for any $c\in\mathbb{R}^{d}$ .

We show that monotonicity can be characterized by the geometry of $\mathcal{P}$ . Namely, monotonicity of $\mathcal{P}$ is equivalent to two properties: for any proper face $F$ , the minimum-norm point of the affine hull of $F$ must lie in $F$ , and moreover, the minimum-norm point of $\mathcal{P}$ must lie in the relative interior $\operatorname*{ri}\mathcal{P}$ . See Theorem 3.2. Once the right point of view is taken, the proof is quite elementary. An example satisfying both properties, and hence monotonicity, is the unit simplex $\Delta\subset\mathbb{R}^{d}$ , for any $d\geq 1$ . For that choice of polytope, $x^{\delta}$ is a sparse soft-min of $c=(c_{1},\dots,c_{d})$ , converging to $1_{\{\operatorname*{argmin}c\}}$ in the unregularized limit. On the other hand, we show that the Birkhoff polytope violates the first condition whenever the marginals have $N\geq 5$ data points (meaning that $d\geq 25$ ). As a result, the optimal support in quadratically regularized optimal transport problems is not always monotone, even if it appears so in numerous experiments. In fact, we exhibit a particularly egregious failure of monotonicity where the support of the limiting (unregularized) optimal transport $x$ is not contained in $\operatorname*{spt}x^{\delta}$ for some reasonably large $\delta$ . Our counterexample is constructed by hand, based on theoretical considerations. Our numerical experiments using random cost matrices have failed to locate counterexamples, suggesting that there are, in some sense, “few” faces violating our condition on the minimum-norm point. (See also the proof of Lemma 4.10.) This is an interesting problem for further study in combinatorics, where the Birkhoff polytope has remained an active topic even after decades of research (e.g., [30] and the references therein).

The remainder of this note is organized as follows. Section 2 introduces the regularized linear program and its solution curve in detail. Section 3 contains the main abstract result, whereas Section 4 reports the applications to optimal transport and soft-min.

2 Preliminaries

This section collects notation and well-known or straightforward results for ease of reference. Let $\langle\cdot,\cdot\rangle$ be an inner product on $\mathbb{R}^{d}$ and $\|\cdot\|$ the induced norm. Let $\emptyset\neq\mathcal{P}\subseteq\mathbb{R}^{d}$ be a polytope; i.e., the convex hull of finitely many points. The minimal set of such points, called vertices or extreme points, is denoted $\operatorname*{ext}\mathcal{P}$ . A face of $\mathcal{P}$ is a subset $F\subset\mathcal{P}$ such that any open line segment $L=(x,y)\subset\mathcal{P}$ with $L\cap F\neq\emptyset$ satisfies $\overline{L}\subset F$ , where $\overline{L}=[x,y]$ denotes closure. Alternately, a nonempty face $F$ of $\mathcal{P}$ is the intersection $F=\mathcal{P}\cap H$ of $\mathcal{P}$ with a tangent hyperplane $H$ . Here a hyperplane $H=\{x\in\mathbb{R}^{d}:\ \langle x,a\rangle=b\}$ is called tangent if $H\cap\mathcal{P}\neq\emptyset$ and $\mathcal{P}\subset\{x\in\mathbb{R}^{d}:\ \langle x,a\rangle\leq b\}$ . We denote by $\operatorname*{ri}K$ and $\operatorname*{rbd}K=K\setminus\operatorname*{ri}K$ the relative interior and relative boundary of a set $K$ , and by $\operatorname*{aff}K$ its affine span. See [8] for background on polytopes and related notions.

Next, we introduce the regularized linear program and the interval of parameters $\delta$ where the set of feasible points in nonempty and the constraint is binding.

Lemma 2.1 (Eulerian Formulation).

Denote $\mathcal{P}(\delta):=\{x\in\mathcal{P}:\,\|x\|\leq\delta\}$ for $\delta\geq 0$ and

\delta_{\min}:=\min\{\delta\geq 0:\,\mathcal{P}(\delta)\neq\emptyset\},\qquad\delta_{\max}:=\min\left\{\delta\geq 0:\,\min_{x\in\mathcal{P}(\delta)}\langle x,c\rangle=\min_{\ x\in\mathcal{P}}\langle x,c\rangle\right\}.

Then

D:=\left\{\delta>\delta_{\min}:\ \operatorname*{argmin}_{x\in\mathcal{P}(\delta)}\langle x,c\rangle\cap\operatorname*{argmin}_{\ x\in\mathcal{P}}\langle x,c\rangle=\emptyset\right\}=(\delta_{\min},\delta_{\max}).

For each $\delta\in\overline{D}=[\delta_{\min},\delta_{\max}]$ , the problem

\displaystyle\inf_{x\in\mathcal{P}(\delta)}\langle x,c\rangle\qquad\text{has a unique minimizer $x^{\delta}$.}

(3)

Moreover, $[\delta_{\min},\delta_{\max}]\ni\delta\mapsto x^{\delta}$ is continuous and $\|x^{\delta}\|=\delta$ . In particular, $x^{\delta_{\min}}$ is the singleton $\mathcal{P}(\delta_{\min})$ , and $x^{\delta_{\max}}$ is the minimum-norm solution of $\min_{x\in\mathcal{P}}\langle x,c\rangle$ .

Proof.

The identity $D=(\delta_{\min},\delta_{\max})$ holds as $\mathcal{P}(\delta)$ is increasing in $\delta$ . Fix $\delta\in D$ . Let $x^{\delta}\in\operatorname*{argmin}_{x\in\mathcal{P}(\delta)}\langle x,c\rangle$ and $x^{*}\in\operatorname*{argmin}_{x\in\mathcal{P}}\langle x,c\rangle$ . If $\|x^{\delta}\|<\delta$ , then $x:=\lambda x^{\delta}+(1-\lambda)x^{*}\in\mathcal{P}(\delta)$ for sufficiently small $\lambda\in(0,1)$ . By optimality it follows that $x^{\delta}\in\operatorname*{argmin}_{x\in\mathcal{P}}\langle x,c\rangle$ , meaning that $\delta\notin D$ . Thus, for $\delta\in D$ , we have $\|x^{\delta}\|=\delta$ . In particular, the set $\operatorname*{argmin}_{x\in\mathcal{P}(\delta)}\langle x,c\rangle$ is contained in the sphere $\{x:\,\|x\|=\delta\}$ . As the set is also convex, it must be a singleton by the strict convexity of $\|\cdot\|$ . Continuity of $\delta\mapsto x^{\delta}$ is straightforward.

It remains to deal with the boundary cases. For $\delta=\delta_{\min}$ , we note that $\|x^{\delta}\|<\delta$ is trivially ruled out, hence we can conclude as above. Clearly $\{x^{\delta_{\min}}\}=\mathcal{P}(\delta_{\min})$ . Define $x^{*}$ as the minimum-norm solution of $\min_{x\in\mathcal{P}}\langle x,c\rangle$ , which is unique by strict convexity. Clearly $\|x^{*}\|=\delta_{\max}$ . Thus when $\delta=\delta_{\max}$ , we must have $x^{\delta}=x^{*}$ for any $x^{\delta}\in\operatorname*{argmin}_{x\in\mathcal{P}(\delta)}\langle x,c\rangle$ . Continuity and $\|x^{\delta}\|=\delta$ are again straightforward. ∎

The next lemma recalls the standard projection theorem (e.g., [7, Theorem 5.2]).

Lemma 2.2 (Projection).

Let $\emptyset\neq K\subseteq\mathbb{R}^{d}$ be closed and convex. Given $x\in\mathbb{R}^{d}$ , there exists a unique $x_{K}\in K$ , called the projection of $x$ onto $K$ and denoted $x_{K}=\operatorname{proj}_{K}(x)$ , such that

\displaystyle\|x-x_{K}\|=\inf_{x^{\prime}\in K}\|x-x^{\prime}\|.

Moreover, $x_{K}$ is characterized within $K$ by

\displaystyle\langle x-x_{K},x^{\prime}-x_{K}\rangle\leq 0\qquad\text{for all }x^{\prime}\in K.

(4)

If $x_{K}\in\operatorname*{ri}K$ , then $x_{K}=\operatorname{proj}_{\operatorname*{aff}K}(x)$ and (4) can be sharpened to

\displaystyle\langle x-x_{K},x^{\prime}-x_{K}\rangle=0\qquad\text{for all }x^{\prime}\in K.

(5)

In particular, (5) holds when $K$ is an affine subspace. In that case, $x\mapsto\operatorname{proj}_{K}(x)$ is affine.

Below, we often use $\operatorname{proj}_{K}(0)$ as a convenient notation for $\operatorname*{argmin}_{x\in K}\|x\|$ , the minimum-norm point of $K$ . In the Lagrangian formulation of our regularized linear program, the solution can be expressed as a projection onto $\mathcal{P}$ as follows.

Lemma 2.3 (Lagrangian Formulation).

Define

\displaystyle x_{\eta}:=\operatorname{proj}_{\mathcal{P}}(-\eta c),\quad\eta\in[0,\infty),\qquad x_{\infty}:=\lim_{\eta\to\infty}x_{\eta}.

(6)

Then

	$\displaystyle x_{\eta}$	$\displaystyle=\operatorname*{argmin}_{x\in\mathcal{P}}\,\langle x,c\rangle+\frac{1}{2\eta}\\|x\\|^{2},\quad\eta\in(0,\infty),$		(7)
	$\displaystyle x_{0}$	$\displaystyle=\operatorname{argmin}_{x\in\mathcal{P}}\\|x\\|,\qquad x_{\infty}=\operatorname{argmin}_{x^{\prime}\in\operatorname*{argmin}_{x\in\mathcal{P}}\langle x,c\rangle}\\|x^{\prime}\\|.$		(8)

The limit $x_{\eta}\to x_{\infty}$ is stationary; i.e., there exists $\bar{\eta}\in\mathbb{R}_{+}$ such that $x_{\eta}=x_{\infty}$ for all $\eta\geq\bar{\eta}$ .

Proof.

Let $\eta\in(0,\infty)$ . Then (7) follows from

\frac{1}{2\eta}\|-\eta c-x\|^{2}=\frac{\eta}{2}\|c\|^{2}+\langle x,c\rangle+\frac{1}{2\eta}\|x\|^{2}.

The first claim in (8) is trivial. For the second claim and the stationary convergence $x_{\eta}\to x_{\infty}$ (which will not be used directly), see [28, Theorem 2.1], or [21] for the exact threshold $\bar{\eta}$ . ∎

The algorithm of [23] solves the problem of projecting a point onto a polyhedron, hence can be used to find $x_{\eta}=\operatorname{proj}_{\mathcal{P}}(-\eta c)$ numerically. The solutions $x^{\delta}$ of the Eulerian formulation (3) and $x_{\eta}$ of the Lagrangian formulation (6) are related as follows.

Lemma 2.4 (Euler $\leftrightarrow$ Lagrange).

Given $\eta\in[0,\infty]$ , there exists a unique $\delta=\delta(\eta)\in\overline{D}$ such that $x_{\eta}=x^{\delta}$ , namely $\delta=\|x_{\eta}\|$ . The function $\eta\mapsto\delta(\eta)$ is nondecreasing.

Conversely, let $\delta\in\overline{D}$ . Then there exists $\eta\in[0,\infty)$ (possibly non-unique) such that $x_{\eta}=x^{\delta}$ . Define $\eta(\delta)$ as the minimal $\eta$ with $x_{\eta}=x^{\delta}$ .³³3This is merely for concreteness; any choice of selector will do. Then $\delta\mapsto\eta(\delta)$ is strictly increasing on $\overline{D}$ .

Proof.

Let $\delta\in D$ and recall that $\|x^{\delta}\|=\delta$ . Note also that $\eta\mapsto x_{\eta}$ is continuous and that $\eta\mapsto\|x_{\eta}\|$ is nondecreasing, with range $[\|x_{0}\|,\|x_{\infty}\|]=\overline{D}$ . Hence there exists $\eta=\eta(\delta)$ such that $\|x_{\eta}\|=\delta$ . We see from (7) that $x_{\eta}$ minimizes $\langle x,c\rangle$ among all $x\in\mathcal{P}(\delta)$ , which is to say that $x_{\eta}=x^{\delta}$ . The converse follows. ∎

3 Abstract Result

Recall that $x^{\delta}$ and $x_{\eta}$ denote the solutions of the Eulerian (3) and Lagrangian (6) formulation, respectively. Lemma 2.4 shows that the curves $(x^{\delta})_{\delta_{\min}\leq\delta\leq\delta_{\max}}$ and $(x_{\eta})_{\eta\geq 0}$ are reparametrizations of one another. In particular, they trace out the same trajectory, and only the trajectory matters for the subsequent definition.

Definition 3.1.

Let $\mathcal{P}\subset\mathbb{R}^{d}$ be a polytope and $c\in\mathbb{R}^{d}$ . We say that $\mathcal{P}$ is $c$ -monotone if for any face $F$ of $\mathcal{P}$ ,

	$\displaystyle x^{\delta}\in F\quad$	$\displaystyle\implies\quad x^{\delta^{\prime}}\in F\quad\text{for all }\delta^{\prime}\geq\delta\text{ in }[\delta_{\min},\delta_{\max}],\qquad\text{or equivalently if}$		(9)
	$\displaystyle x_{\eta}\in F\quad$	$\displaystyle\implies\quad x_{\eta^{\prime}}\in F\quad\text{for all }\eta^{\prime}\geq\eta\text{ in }[0,\infty].$		(10)

We say that $\mathcal{P}$ is monotone if it is $c$ -monotone for all $c\in\mathbb{R}^{d}$ .

This definition means that the “support” of $x^{\delta}$ is monotone decreasing in $\delta$ , in the following sense. For any $x\in\mathcal{P}$ , there is a unique face $F=F(x)$ such that $x\in\operatorname*{ri}F$ ; moreover, $x\in\operatorname*{ri}F$ if and only if $x$ is a convex combination of the vertices $\operatorname*{ext}F$ with strictly positive weights [8, Exercise 3.1, Theorem 5.6]. Thus $\operatorname*{ext}F(x)$ is a notion of support for $x$ , and then the property (9) indeed means that the support of $x^{\delta}$ is monotone decreasing for inclusion. When $\mathcal{P}$ is the unit simplex, $\operatorname*{ext}F(x)$ boils down to the usual measure-theoretic notion of support, namely $\{i:\,x_{i}>0\}$ for $x=(x_{1},\dots,x_{d})\in\mathcal{P}$ . This identification breaks down when $\mathcal{P}$ is the Birkhoff polytope, but monotonicity is nevertheless equivalent for the two notions of support (see Lemma 4.2).⁴⁴4Clearly this equivalence does not hold in general. E.g., when $\mathcal{P}\subset(0,1)^{d}$ , all points have the same measure-theoretic support. Next, we characterize monotonicity in geometric terms.

Theorem 3.2.

A polytope $\mathcal{P}$ is monotone (Definition 3.1) if and only if

(H1)

$\operatorname{proj}_{\mathcal{P}}(0)\in\operatorname*{ri}\mathcal{P}$ and
(H2)

$\operatorname{proj}_{\operatorname*{aff}F}(0)\in F$ for each face $\emptyset\neq F\subset\mathcal{P}$ .

The two conditions are similar, with the requirement for the proper faces $F$ being less stringent than for the improper face $\mathcal{P}$ : while $\operatorname{proj}_{F}(0)\in\operatorname*{ri}F$ is a sufficient condition for (H2), the condition (H2) includes the boundary case where the projection lands on $\operatorname*{rbd}F$ without the constraint $F$ being binding. We shall see that (H1) and (H2) play different roles in the proof. Simple examples show that changing $\operatorname*{ri}\mathcal{P}$ to $\mathcal{P}$ in (H1) or $F$ to $\operatorname*{ri}F$ in (H2) invalidates the equivalence in Theorem 3.2.

3.1 Proof of Theorem 3.2

We first prove the “if” implication, using the Lagrangian formulation (Lemma 2.3). This parametrization is convenient because $\eta\mapsto x_{\eta}$ is piecewise affine. While the latter is well-known (even for some more general norms, see [18] and the references therein); we detail the statement for completeness.

Lemma 3.3.

The curve $[0,\infty]\ni\eta\mapsto x_{\eta}$ is piecewise affine. The affine pieces correspond to faces of $\mathcal{P}$ as follows. Fix $\eta_{0}\in[0,\infty]$ and let $F$ be the unique face of $\mathcal{P}$ such that $x_{\eta_{0}}\in\operatorname*{ri}F$ . Let $I=\{\eta:x_{\eta}\in\operatorname*{ri}F\}$ . Then $I$ is an interval containing $\eta_{0}$ and $\overline{I}\ni\eta\mapsto x_{\eta}\in F$ is affine. In particular, the curve $\eta\mapsto x_{\eta}$ does not return to $\operatorname*{ri}F$ after leaving it.

Proof.

Consider $\eta_{1}<\eta_{2}$ in $I$ and let $H=\operatorname*{aff}F$ . As $x_{\eta_{i}}\in\operatorname*{ri}F$ , we have $x_{\eta_{i}}=\operatorname{proj}_{F}(-\eta_{i}c)=\operatorname{proj}_{H}(-\eta_{i}c)$ . Since $H$ is an affine subspace, the curve $\eta\mapsto\operatorname{proj}_{H}(-\eta c)$ is affine. In particular, convexity of $\operatorname*{ri}F$ implies that $\operatorname{proj}_{H}(-\eta c)\in\operatorname*{ri}F$ for all $\eta\in[\eta_{1},\eta_{2}]$ . Thus $I$ is an interval and $I\ni\eta\mapsto x_{\eta}\in\operatorname*{ri}F$ is affine, and this extends to the closure by continuity. ∎

Recall that $\operatorname*{rbd}K:=K\setminus\operatorname*{ri}K$ denotes the relative boundary of $K\subset\mathbb{R}^{d}$ . Consider the first time the curve $\eta\mapsto x_{\eta}$ touches a given face $F$ of $\mathcal{P}$ . That point may lie in $\operatorname*{ri}F$ or in $\operatorname*{rbd}F$ . As seen in Lemma 3.5 below, the boundary situation indicates that $\eta\mapsto x_{\eta}$ left another face $F_{*}$ when it entered $F\not\subset F_{*}$ , meaning that $\mathcal{P}$ is not monotone. Hence, we analyze that situation in detail in the next lemma.

Lemma 3.4.

Let $F$ be a face of $\mathcal{P}$ and $I=\{\eta:x_{\eta}\in\operatorname*{ri}F\}\neq\emptyset$ . Let $\eta_{0}:=\inf I$ . If $x_{\eta_{0}}\in\operatorname*{rbd}F$ , then exactly one of the following holds true:

(i)

$\operatorname{proj}_{\operatorname*{aff}F}(0)\notin F$ ;
(ii)

$\eta_{0}=0$ , and then $x_{\eta_{0}}=x_{0}=\operatorname{proj}_{\mathcal{P}}(0)=\operatorname{proj}_{\operatorname*{aff}F}(0)\in\operatorname*{rbd}F$ . In particular, $x_{0}\notin\operatorname*{ri}\mathcal{P}$ .

Proof.

Recall from Lemma 3.3 that $I$ is an interval and $\overline{I}\ni\eta\mapsto x_{\eta}\in F$ is affine. Let $H=\operatorname*{aff}F$ and consider $z_{\eta}:=\operatorname{proj}_{H}(-\eta c)$ . Recalling Lemma 2.2, we have for $\eta\in I$ that $x_{\eta}=\operatorname{proj}_{\mathcal{P}}(-\eta c)=\operatorname{proj}_{F}(-\eta c)=\operatorname{proj}_{H}(-\eta c)=z_{\eta}$ and in particular $z_{\eta}\in\operatorname*{ri}F$ .

As $x_{\eta}=z_{\eta}$ for $\eta\in I$ and both curves are continuous, it follows that $x_{\eta_{0}}=z_{\eta_{0}}$ . Therefore, our assumption that $x_{\eta_{0}}\in\operatorname*{rbd}F$ implies that $z_{\eta_{0}}\in\operatorname*{rbd}F$ .

Since $H$ is an affine space, $\mathbb{R}_{+}\ni\eta\mapsto z_{\eta}$ is affine. We have $z_{\eta_{0}}\in\operatorname*{rbd}F$ and $z_{\eta}\in\operatorname*{ri}F$ for all $\eta\in(\eta_{0},\eta_{1})$ , where $\eta_{1}:=\sup I>\eta_{0}$ . (Note that $I$ cannot be a singleton when $z_{\eta_{0}}\in\operatorname*{rbd}F$ .) As $F$ is convex, it follows that $z_{\eta}\notin F$ for $0\leq\eta<\eta_{0}$ . If $z_{0}\notin F$ , we are in the first case. Whereas if $z_{0}\in F$ , it follows that $\eta_{0}=0$ . Thus $x_{\eta_{0}}=x_{0}=\operatorname{proj}_{\mathcal{P}}(0)=\operatorname{proj}_{F}(0)$ . Moreover, $z_{\eta_{0}}=x_{\eta_{0}}\in F$ implies $x_{\eta_{0}}=\operatorname{proj}_{\operatorname*{aff}F}(0)$ . ∎

Lemma 3.5.

Let $\eta_{*}\geq 0$ and let $F_{*}$ be the unique face of $\mathcal{P}$ with $x_{\eta_{*}}\in\operatorname*{ri}F_{*}$ . Suppose there exists $\eta>\eta_{*}$ such that $x_{\eta}\notin F_{*}$ and let $\eta_{0}=\max\{\eta\geq\eta_{*}:\,x_{\eta}\in F_{*}\}$ . For sufficiently small $\eta_{1}>\eta_{0}$ , there is a unique face $F$ such that $x_{\eta}\in\operatorname*{ri}F$ for all $\eta\in(\eta_{0},\eta_{1})$ .⁵⁵5When traveling along the curve $(x_{\eta})_{\eta\geq\eta_{*}}$ , $F$ is the first face outside $F_{*}$ whose interior is visited. If $x_{0}:=\operatorname{proj}_{\mathcal{P}}(0)\in\operatorname*{ri}\mathcal{P}$ , then $F$ satisfies $\operatorname{proj}_{\operatorname*{aff}F}(0)\notin F$ .

Proof.

For $x\in\mathcal{P}$ , let $F(x)$ denote the unique face such that $x\in\operatorname*{ri}F(x)$ . As seen in Lemma 3.3, the map $(\eta_{0},\infty)\ni\eta\mapsto F(x_{\eta})$ has finitely many values and the preimage of each value is an interval. Hence the map must be constant on $(\eta_{0},\eta_{1})$ for sufficiently small $\eta_{1}>\eta_{0}$ . Let $F=F(x_{\eta})$ , $\eta\in(\eta_{0},\eta_{1})$ be the corresponding face. As $F_{*}$ and $F$ are faces with $F_{*}\not\subset F$ , we have $F\cap F_{*}\subset\operatorname*{rbd}F$ . Thus $x_{\eta_{0}}\in F\cap F_{*}$ implies $x_{\eta_{0}}\in\operatorname*{rbd}F$ , and now Lemma 3.4 applies. ∎

Proof of Theorem 3.2.

Step 1: (H1), (H2) $\Rightarrow$ monotone. If $\mathcal{P}$ is not monotone, then Lemma 3.5 shows that either (H1) or (H2) must be violated.

Step 2: Not (H1) $\Rightarrow$ not monotone. Suppose that $x_{0}:=\operatorname{proj}_{\mathcal{P}}(0)\notin\operatorname*{ri}\mathcal{P}$ . As $x_{0}\in\mathcal{P}\setminus(\operatorname*{ri}\mathcal{P})$ , there exists a hyperplane $H=\left\{x\in\mathbb{R}^{d}\text{ : }\langle a,x\rangle=b\right\}$ with

x_{0}\in H,\qquad\mathcal{P}\subseteq\left\{x\in\mathbb{R}^{d}\text{ : }\langle a,x\rangle\leq b\right\},\qquad\operatorname*{ri}\mathcal{P}\subseteq\left\{x\in\mathbb{R}^{d}\text{ : }\langle a,x\rangle<b\right\}.

Define $c=a$ and $F=H\cap\mathcal{P}$ and $\delta_{0}=\|x_{0}\|$ , and denote $\overline{B}_{\delta}=\{x\in\mathbb{R}^{d}:\ \|x\|\leq\delta\}$ . Then $\mathcal{P}\cap\overline{B}_{\delta_{0}}=\left\{x_{0}\right\}$ and in particular $x^{\delta_{0}}=x_{0}$ . Moreover,

\langle x,c\rangle=b\quad\text{for all }x\in F,\quad\text{while}\quad\langle c,x\rangle<b\quad\text{for all }x\in\operatorname*{ri}\mathcal{P}.

(11)

For any $\delta>\delta_{0}$ , we have $\overline{B}_{\delta}\cap\operatorname*{ri}\mathcal{P}\neq\emptyset$ , and then (11) implies $x^{\delta}\notin F$ . In particular, $\mathcal{P}$ is not monotone for the cost $c$ .

Step 3: Not (H2) $\Rightarrow$ not monotone. Given the assumption, there exists a face $F$ of $\mathcal{P}$ such that $\operatorname{proj}_{\operatorname*{aff}F}(0)$ does not belong to $F$ . Set $x_{\min}:=\operatorname{proj}_{F}(0)$ . As $F$ is a face of $\mathcal{P}$ , there is a hyperplane $H=\{x\in\mathbb{R}^{d}:\ \langle x,a\rangle=b\}$ such that

F=H\cap\mathcal{P}\quad{\rm and}\quad\mathcal{P}\subset\{x\in\mathbb{R}^{d}:\ \langle x,a\rangle\geq b\}.

Define $c=a-x_{\min}$ and $\delta_{1}=\|x_{\min}\|$ . Then for every $x\in\mathcal{P}\cap\overline{B}_{\delta_{1}}$ ,

\langle x,c\rangle=\langle x,a\rangle-\langle x,x_{\min}\rangle\geq b-\|x\|\|x_{\min}\|\geq b-\|x_{\min}\|^{2}=\langle x_{\min},c\rangle,

showing that $x^{\delta_{1}}=x_{\min}$ . Since $x_{\min}\neq\operatorname{proj}_{\operatorname*{aff}F}(0)$ , Lemma 2.2 implies that

F^{\prime}:=\{x\in\mathbb{R}^{d}:\ \langle x_{\min},x-x_{\min}\rangle=0\}\cap F

is a face of $F$ —and a fortiori a face of $\mathcal{P}$ —such that

F\setminus F^{\prime}\subset\{x\in\mathbb{R}^{d}:\ \langle x_{\min},x-x_{\min}\rangle>0\}.

We claim that $x^{\delta}\notin F^{\prime}$ for all $\delta>\delta_{1}$ . Indeed, for any $x^{\prime}\in F^{\prime}$ it holds that

\langle x^{\prime},c\rangle=\langle x^{\prime},a\rangle-\langle x^{\prime},x_{\min}\rangle=b-\langle x^{\prime},x_{\min}\rangle=\langle x_{\min},c\rangle,

whereas for any $x\in F\setminus F^{\prime}$ , it holds that

\langle x,c\rangle=b-\langle x,x_{\min}\rangle=\langle x_{\min},c\rangle-\langle x-x_{\min},x_{\min}\rangle<\langle x_{\min},c\rangle.

In summary, $\langle x,c\rangle<\langle x^{\prime},c\rangle$ for all $x\in F\setminus F^{\prime}$ and $x^{\prime}\in F^{\prime}$ . In view of $x^{\delta}=x_{\min}\in F^{\prime}$ , it follows that $x^{\delta}\notin F^{\prime}$ for all $\delta>\delta_{1}$ and in particular that $\mathcal{P}$ is not monotone for the cost $c$ . ∎

4 Applications

4.1 Soft-Min

To begin with a straightforward example, let $\mathcal{P}=\Delta$ be the unit simplex in Euclidean space $\mathbb{R}^{d}$ and $c=(c_{1},\dots,c_{d})\in\mathbb{R}^{d}$ . Then

\min_{x\in\Delta}\langle c,x\rangle=\min_{1\leq i\leq d}c_{i}

corresponds to finding the minimum value of $c$ . More specifically, $x=(x_{1},\dots,x_{d})$ is a minimizer if and only if $x$ is supported in $\operatorname*{argmin}c$ ; i.e., $\operatorname*{spt}x=\{i:\,x_{i}\neq 0\}\subset\operatorname*{argmin}c$ . When this linear program is regularized with the entropy of $x$ , we obtain the usual soft-min (counterpart of log-sum-exp) which gives large weights to the small values of $c$ but non-zero weights to all values. The quadratic regularization,

\displaystyle x^{\delta}=\operatorname*{argmin}_{x\in\Delta:\,\|x\|\leq\delta}\langle c,x\rangle\qquad\text{or}\qquad x_{\eta}=\operatorname*{argmin}_{x\in\Delta}\langle c,x\rangle+\frac{1}{2\eta}\|x\|^{2},

(12)

yields a sparse soft-min: as $\delta\to\delta_{\max}$ (or as $\eta\to\infty$ ), the support of the solution tends to $\operatorname*{argmin}c$ . In this example, $\delta_{\min}=1$ and $\delta_{\max}=(\#\operatorname*{argmin}c)^{-1}$ .

Corollary 4.1.

The unit simplex $\Delta\subset\mathbb{R}^{d}$ is monotone for all $d\geq 1$ ; i.e., the support of the soft-min $x^{\delta}$ (or $x_{\eta}$ ) defined in (12) decreases monotonically to $\operatorname*{argmin}c$ as $\delta\in[\delta_{\min},\delta_{\max}]$ (or $\eta\geq 0$ ) increases.

Proof.

We have $\operatorname*{ext}\Delta=\{e_{1},\dots,e_{d}\}$ , the canonical basis of $\mathbb{R}^{d}$ . Let $F\subset\Delta$ be a face, then there exist $\{i_{1},\dots,i_{k}\}\subseteq\{1,\dots,d\}$ such that $F=\operatorname*{conv}(e_{i_{1}},\dots,e_{i_{k}})$ . Note that $\operatorname{proj}_{F}(0)=\sum_{j=1}^{k}\lambda_{j}e_{i_{j}}$ , where $\lambda=(\lambda_{1},\dots,\lambda_{k})$ is the solution of

\displaystyle\min_{\lambda\in\mathbb{R}^{d}}\big{\|}\sum_{j=1}^{k}\lambda_{j}e_{i_{j}}\big{\|}^{2}\quad\text{ s.t. }\sum_{j=1}^{k}\lambda_{j}=1,

namely $\lambda=(k^{-1},\dots,k^{-1})$ . In particular, $\operatorname{proj}_{F}(0)=k^{-1}\sum_{j=1}^{k}e_{i_{j}}\in\operatorname*{ri}F$ . Now Theorem 3.2 applies. ∎

4.2 Optimal Transport

We recall the quadratically regularized optimal transport problem

\inf_{\gamma\in\Gamma(\mu,\nu)}\int\hat{c}(x,y)d\gamma(x,y)+\frac{1}{2\eta}\left\|\frac{d\gamma}{d(\mu\otimes\nu)}\right\|_{L^{2}(\mu\otimes\nu)}^{2}

(QOT)

where $\Gamma(\mu,\nu)$ denotes the set of couplings of the probability measures $(\mu,\nu)$ . We will be concerned with the discrete case as introduced in [6, 14, 17]. See also [26, 27, 29] for more general theory, [4, 16, 19] for asymptotic aspects, [15, 20, 22, 25, 33] for computational approaches and [25, 35] for some recent applications.

Fix $N\in\mathbb{N}$ and two sets of distinct points, $\{X_{1},\dots,X_{N}\}\subset\mathbb{R}^{D}$ and $\{Y_{1},\dots,Y_{N}\}\subset\mathbb{R}^{D}$ . Let $\mu=\frac{1}{N}\sum_{i=1}^{N}\delta_{X_{i}}$ and $\nu=\frac{1}{N}\sum_{i=1}^{N}\delta_{Y_{i}}$ denote the associated empirical measures, and let $\hat{c}:\mathbb{R}^{D}\times\mathbb{R}^{D}\to\mathbb{R}$ be a (cost) function. We note that $\mu\otimes\nu=\frac{1}{N^{2}}\sum_{i,j=1}^{N}\delta_{(X_{i},Y_{j})}$ while $d\gamma/d(\mu\otimes\nu)$ is simply the ratio of the probability mass functions. Any coupling $\gamma\in\Gamma(\mu,\nu)$ can be identified with its matrix of probability weights

(\gamma_{i,j})_{i,j=1}^{n}\in\Gamma_{N}=\left\{\gamma\in\mathbb{R}^{N\times N}:\,\sum_{i=1}^{N}\gamma_{i,j}=\frac{1}{N},\;\sum_{j=1}^{N}\gamma_{i,j}=\frac{1}{N},\;\gamma_{i,j}\geq 0\right\}

via $\gamma=\sum_{i,j=1}^{N}\gamma_{i,j}\delta_{(X_{i},Y_{j})}$ . Writing similarly $c_{ij}=\hat{c}(X_{i},X_{j})$ , (QOT) can be identified with the problem

\inf_{\gamma\in\Gamma_{N}}\langle c,\pi\rangle+\frac{\varepsilon N}{2}\|\gamma\|^{2}.

Clearly $\Gamma_{N}=\frac{1}{N}\Pi_{N}$ where

\Pi_{N}=\left\{\pi\in\mathbb{R}^{N\times N}:\,\sum_{i=1}^{N}\pi_{i,j}=1,\;\sum_{j=1}^{N}\pi_{i,j}=1,\;\pi_{i,j}\geq 0\right\}

denotes the Birkhoff polytope of doubly stochastic matrices; see, e.g., [9]. The monotonicity (in the sense of Definition 3.1) of $\Pi_{N}$ is equivalent to that of $\Gamma_{N}$ .

The following result connects the measure-theoretic support $\operatorname*{spt}\pi=\{(i,j):\,\pi_{ij}>0\}$ with the geometric notion discussed below Definition 3.1.

Lemma 4.2.

Let $\pi,\pi^{\prime}\in\Pi_{N}$ . Denote by $F(\pi)$ the unique face $F\subset\Pi_{N}$ such that $\pi\in\operatorname*{ri}F$ . For $\pi,\pi^{\prime}\in\Pi_{N}$ , the following are equivalent:

(i)

$\operatorname*{spt}\pi^{\prime}\subset\operatorname*{spt}\pi$ ,
(ii)

$F(\pi^{\prime})\subset F(\pi)$ ,
(iii)

if $\pi\in F$ for some face $F\subset\Pi_{N}$ , then $\pi^{\prime}\in F$ .

Proof.

The equivalence of (ii) and (iii) follows from the fact that $F(\pi)$ is the smallest face containing $\pi$ [8, Theorem 5.6]. The equivalence of (i) and (iii) is deferred to Lemma 4.8 below. (The implication (ii) $\Rightarrow$ (i) is straightforward, and holds for any polytope $\mathcal{P}$ of measures.) ∎

As a consequence of Lemma 4.2, we have the following equivalence.

Corollary 4.3.

Let $\gamma_{\eta,\hat{c}}$ denote the solution of (QOT) for $\hat{c}:\mathbb{R}^{N}\times\mathbb{R}^{N}\to\mathbb{R}$ . The Birkhoff polytope is monotone if and only if the optimal support $\operatorname*{spt}\gamma_{\eta,\hat{c}}$ is monotone decreasing in $\eta$ for all $\hat{c}$ .

Theorem 4.4.

The Birkhoff polytope $\Pi_{N}$ is monotone for $1\leq N\leq 4$ but not monotone for $N\geq 5$ . In particular, when $N\geq 5$ , the optimal support $\eta\mapsto\operatorname*{spt}(\gamma_{\eta,\hat{c}})$ is not monotone for some costs $\hat{c}$ .

Example 4.5.

We exhibit an example for $N=5$ where the support is not monotone. (This is not easily achieved by brute-force numerical experiment). Our starting point is the matrix $A$ in (14) below, which is used to describe a face where (H2) fails. Step 3 in the proof of Theorem 3.2 then suggests to perturb $-A$ in a suitable direction to find a cost $c$ exhibiting non-monotonicity. With some geometric considerations, this leads us to propose the cost matrix

c=\begin{bmatrix}-1.1&-1&-1&-1&-1\\ -1&-1.1&0&0&0\\ -1&0&-1.1&0&0\\ -1&0&0&-1.1&0\\ -1&0&0&0&-1.1\end{bmatrix}.

For $\eta=2.5$ , or equivalently regularization strength $1/(2\eta)=0.2$ , the corresponding problem (QOT) has an exact solution coupling $\gamma_{\eta}$ with probability weights given by

(\gamma_{\eta=2.5})=\begin{bmatrix}0&0.05&0.05&0.05&0.05\\ 0.05&0.15&0&0&0\\ 0.05&0&0.15&0&0\\ 0.05&0&0&0.15&0\\ 0.05&0&0&0&0.15\end{bmatrix}.

(This can be verified by noticing that the stated coupling is induced by the dual potentials $f=g=(-0.575,-0.175,-0.175,-0.175,-0.175)$ ; see for instance [29, Theorem 2.2 and Remark 2.3].) We observe in particular that the location $(X_{1},Y_{1})$ is not in the support. On the other hand, because the diagonal of $c$ features the smallest costs, the solution $\gamma_{\infty}$ of the unregularized transport problem is to put all mass on the diagonal; i.e., to transport all mass from $X_{i}$ to $Y_{i}$ for each $i$ . Because of the stationary convergence mentioned in Lemma 2.3, that is also the solution of the regularized problem for large enough value of $\eta$ (e.g., $\eta=100$ will do):

(\gamma_{\eta=100})=(\gamma_{\infty})=\begin{bmatrix}0.2&0&0&0&0\\ 0&0.2&0&0&0\\ 0&0&0.2&0&0\\ 0&0&0&0.2&0\\ 0&0&0&0&0.2\end{bmatrix}.

In particular, $(X_{1},Y_{1})$ is part of the support for large $\eta$ , completing the example. Figure 2 shows in more detail the weight at $(X_{1},Y_{1})$ as a function of (the inverse of) $\eta$ .

Refer to caption — Figure 2: Probability mass $\gamma_{\eta}(X_{1},Y_{1})$ plotted against $1/(2\eta)$ , showing that $(X_{1},Y_{1})$ is in the support for small and large values of $\eta$ but outside for an intermediate interval.

Remark 4.6.

The early work of [17] considered a minimum-cost flow problem with quadratic regularization that predates the optimal transport literature for this regularization. The authors point out that the solution can have non-monotone support already in the minimal setting of $2\times 2$ points [17, Figure 1]. Similarly, it is immediate to obtain non-monotonicity if instead of $\mu\otimes\nu$ we use a different measure to define the $L^{2}$ -norm in (QOT), as in [29]. In those examples, the mechanism causing non-monotonicity is straightforward and quite different than in the present work, where non-monotonicity arises only in higher dimensions.

4.3 Proof of Lemma 4.2 and Theorem 4.4

The celebrated Birkhoff’s theorem [5] states that the vertices $\operatorname*{ext}\Pi_{N}$ are the permutation matrices of $\{1,\dots,N\}$ ; that is, the elements of $\Pi_{N}$ with binary entries. Following [11], the faces of $\Pi_{N}$ can be described using the so-called permanent function. If $A$ is a binary $N\times N$ matrix, its permanent ${\rm per}(A)\in\mathbb{N}$ is defined as the number of permutation matrices $P$ with $P\leq A$ (meaning that $P_{i,j}\leq A_{i,j}$ for all $i,j$ ). Denoting by $\operatorname*{conv}(\cdot)$ the convex hull, the following characterization is contained in [11, Theorem 2.1].

Lemma 4.7.

Let $t\in\mathbb{N}$ and $P^{(1)},\dots,P^{(t)}\in\operatorname*{ext}(\Pi_{N})$ . Let $A=(A_{i,j})_{i,j=1}^{N}$ be the matrix such that $A_{i,j}=1$ if there exists $s\in\{1,\dots,t\}$ with $P^{(s)}_{i,j}=1$ and $A_{i,j}=0$ otherwise. Then $\operatorname*{conv}(\{P^{(1)},\dots,P^{(t)}\})$ is a face of $\Pi_{N}$ if and only if ${\rm per}(A)=t$ .

We use Lemma 4.7 to relate faces with the distribution of zeros.

Lemma 4.8.

Let $\pi,\pi^{\prime}\in\Pi_{N}$ . The following are equivalent:

(i)

there exists $(i,j)$ such that $\pi_{i,j}=0$ and $\pi_{i,j}^{\prime}>0$ ;
(ii)

there exists a face $F$ of $\Pi_{N}$ such that $\pi\in F$ and $\pi^{\prime}\notin F$ .

Proof.

(i) $\Rightarrow$ (ii): Let $(i,j)$ be such that $\pi_{i,j}=0$ and $\pi_{i,j}^{\prime}>0$ . Assume w.l.o.g. that $(i,j)=(1,1)$ . Then $\pi$ belongs to the set

F=\{\pi\in\Pi_{N}:\langle\pi,A\rangle=n\},

where $A$ is the matrix

A=\begin{pmatrix}0&1&1&\cdots&1\\ 1&1&1&\cdots&1\\ 1&1&1&\cdots&1\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ 1&1&1&\cdots&1\end{pmatrix}.

As $\Pi_{N}\subset\{\pi\in\Pi_{N}:\langle\pi,A\rangle\leq n\},$ we see that $F$ is a face. Clearly $\pi^{\prime}$ does not belong to $F$ , proving the claim.

(ii) $\Rightarrow$ (i): Let $P^{(1)},\dots,P^{(t)}\in\operatorname*{ext}\Pi_{N}$ be such that $F=\operatorname*{conv}(\{P^{(1)},\dots,P^{(t)}\})$ and let $A=(A_{i,j})_{i,j=1}^{N}$ be the matrix such that $A_{i,j}=1$ if there exists $s\in\{1,\dots,t\}$ such that $P^{(s)}_{i,j}=1$ and $A_{i,j}=0$ otherwise. We have ${\rm per}(A)=t$ by Lemma 4.7. As an element of $\Pi_{N}$ , $\pi^{\prime}$ can be written as a convex combination of permutation matrices, i.e.,

\pi^{\prime}=\sum_{P\in\operatorname*{ext}(\Pi_{N})}\lambda_{P}P,\quad{\rm with}\quad\lambda_{P}\in[0,1]\ {\rm and}\ \sum_{P\in\operatorname*{ext}(\Pi_{N})}\lambda_{P}=1.

As $\pi^{\prime}\notin F$ , there exists $P\in\operatorname*{ext}(\Pi_{N})\setminus\{P^{(1)},\dots,P^{(t)}\}$ with $\lambda_{P}>0$ . Using the fact that ${\rm per}(A)=t$ , we derive the existence of $(i,j)$ such that $P_{i,j}=1$ but $A_{i,j}=0$ . In particular, $\pi_{i,j}^{\prime}\geq\lambda_{P}>0$ but $\pi_{i,j}=0$ , proving the claim. ∎

Lemma 4.9.

Let $N\geq 1$ . Define the permutation matrices $P^{k}\in\mathbb{R}^{N\times N}$ , $1\leq k\leq N$ by

\displaystyle P^{k}_{i,j}=\begin{cases}1&\text{ if }i=1\text{ and }j=k,\\ 1&\text{ if }i=k\text{ and }j=1,\\ 1&\text{ if }1\neq i=j\neq k,\\ 0&\text{ otherwise.}\end{cases}

That is, $P^{k}$ permutes the first with the $k$ -th element. Then $F:=\operatorname*{conv}(P^{1},\dots,P^{N})$ is a face of $\Pi_{N}$ and

\displaystyle\operatorname{proj}_{\operatorname*{aff}F}(0)=\sum_{i=1}^{N}\lambda_{k}P^{k}\quad\text{for}\quad\lambda_{1}=\frac{4-N}{N+2},\quad\lambda_{2}=\cdots=\lambda_{N}=\frac{2}{N+2}.

(13)

As a consequence,

\displaystyle\operatorname{proj}_{\operatorname*{aff}F}(0)\in\begin{cases}\operatorname*{ri}F,&\text{ if }N=1,2,3,\\ \operatorname*{rbd}F,&\text{ if }N=4,\\ \mathbb{R}^{N\times N}\setminus F&\text{ if }N\geq 5,\end{cases}

and in particular (H2) is violated for $N\geq 5$ .

Proof.

Define $A\in\mathbb{R}^{N\times N}$ by

\displaystyle A_{i,j}=\begin{cases}1&\text{ if }i=j,\\ 1&\text{ if }i=1\text{ or }j=1,\\ 0&\text{ otherwise,}\end{cases}

(14)

so that $A$ is the entry-wise maximum $A=\max(P^{1},\dots,P^{N})$ . One readily verifies that ${\rm per}(A)=N$ . Hence by Lemma 4.7, $F:=\operatorname*{conv}(P^{1},\dots,P^{N})$ is a face of $\Pi_{N}$ . To determine $\operatorname{proj}_{\operatorname*{aff}F}(0)$ , we consider the minimization problem

\displaystyle\min_{\lambda=(\lambda_{1},\dots,\lambda_{N})\in\mathbb{R}^{N}}\big{\|}\sum_{i=1}^{N}\lambda_{i}P^{i}\big{\|}^{2}\quad\text{ s.t.\ }\quad\sum_{i=1}^{N}\lambda_{i}=1.

The Lagrangian for this problem is

\displaystyle L(\lambda,\rho)=\lambda_{1}^{2}+2\sum_{j=2}^{N}\lambda_{j}^{2}+\sum_{j=2}^{N}\Big{(}\sum_{k\neq j}\lambda_{k}\Big{)}^{2}+\rho\Big{(}1-\sum_{j=1}^{N}\lambda_{j}\Big{)}

and the resulting optimality equations are

	$\displaystyle\rho$	$\displaystyle=2N\lambda_{1}+2(N-2)\sum_{j=2}^{N}\lambda_{j},\qquad\sum_{j=1}^{N}\lambda_{j}=1,$
	$\displaystyle\rho$	$\displaystyle=2N\lambda_{i}+2(N-2)\lambda_{1}+2(N-3)\sum_{\begin{subarray}{c}j=2\\ j\neq i\end{subarray}}^{N}\lambda_{j}\quad\text{ for }2\leq i\leq N.$

By symmetry, the unique optimal $\lambda$ satisfies $\lambda_{i}=\lambda_{j}=:\lambda_{0}$ for all $i,j\geq 2$ , so that

	$\displaystyle 2N\lambda_{0}+2(N-2)\lambda_{1}+2(N-2)(N-3)\lambda_{0}=2N\lambda_{1}+2(N-1)(N-2)\lambda_{0},$
	$\displaystyle\lambda_{1}+(N-1)\lambda_{0}=1$

and finally

\displaystyle\lambda_{1}=\lambda_{0}\frac{(4-N)}{2},\qquad\lambda_{1}+(N-1)\lambda_{0}=1.

Solving for $\lambda_{1}$ and $\lambda_{0}$ yields (13). Note that $\lambda_{0}>0$ for any $N$ whereas $\lambda_{0}>0$ for $N\in\{1,2,3\}$ , $\lambda_{0}=0$ for $N=4$ and $\lambda_{0}<0$ for $N\geq 5$ , implying the last claim. ∎

The matrix $A$ of (14) is inspired by the $4\times 4$ matrix in [10, Example 1.5] where the authors are interested in a different problem (and, in turn, credit [1]).⁶⁶6In [10], the conclusion is that “not every zero pattern of a fully indecomposable $(0,1)$ -matrix is realizable as the zero pattern of a doubly stochastic matrix whose diagonal sums avoiding the $0$ ’s are constant.” For the present question of monotonicity, this matrix yields a counterexample only for $N\geq 5$ . As an aside, the subsequent proof that for $N=4$ , the face $F$ considered in Lemma 4.9 is (up to symmetries) the only face where $\operatorname{proj}_{\operatorname*{aff}F}(0)\in\operatorname*{rbd}F$ , whereas all other faces $F^{\prime}$ satisfy $\operatorname{proj}_{\operatorname*{aff}F^{\prime}}(0)\in\operatorname*{ri}F^{\prime}$ .

Lemma 4.10.

Let $1\leq N\leq 4$ . Then $\Pi_{N}$ is monotone.

Proof.

We verify (H1) and (H2). Note that $\operatorname{proj}_{\Pi_{N}}(0)$ is the matrix with all entries equal to $1/N$ (corresponding to the product measure). This matrix is clearly in the relative interior of $\Pi_{N}$ , showing (H1). The property (H2) is trivial for $N=1,2$ . For $N=3$ and $N=4$ , we give a computer-assisted proof in the interest of brevity; the code is available at https://github.com/marcelnutz/birkhoff-polytope-faces.⁷⁷7An analytic proof is also available, but requires us to go through 52 different cases for $N=4$ .⁸⁸8The cases $N\leq 3$ can also be obtained as a corollary of the case $N=4$ . One can check directly that if the Birkhoff polytope $\Pi_{N}$ is (not) monotone for some $N\in\mathbb{N}$ , then $\Pi_{N^{\prime}}$ is also (not) monotone for all $N^{\prime}\leq(\geq)N$ . Specifically, we generate all $N\times N$ permutation matrices and determine all families $\{P_{1},\dots,P_{m}\}$ of permutation matrices that form the vertices of a nonempty face $F$ by using the permanent function as in Lemma 4.7. There are 49 nonempty faces for $N=3$ and 7443 for $N=4$ . For each face $F$ , we can numerically compute $\operatorname{proj}_{\operatorname*{aff}F}(0)$ or more specifically scalar coefficients $(\lambda_{k})_{1\leq k\leq m}$ such that $\operatorname{proj}_{\operatorname*{aff}F}(0)=\sum_{k}\lambda_{k}P_{k}$ and $\sum_{k}\lambda_{k}\leq 1$ . The coefficients $\lambda_{k}$ are non-unique for some faces; in that case, we choose positive weights if possible. Note that as $P_{k}$ are binary matrices and $N\leq 4$ , the computation can be done with accuracy close to machine precision. It turns out that for $N=3$ , all coefficients satisfy $\lambda_{k}>0.01$ (much larger than machine precision), establishing that $\operatorname{proj}_{\operatorname*{aff}F}(0)\in\operatorname*{ri}F$ and in particular (H2). For $N=4$ , most faces have coefficients $\lambda_{k}>0.01$ , whereas for 96 faces, one coefficient is numerically close to zero, hence requiring an analytic argument. We verify that all those 96 faces are equivalent up to permutations of rows and columns, corresponding to a relabeling of the points $X_{i}$ and $Y_{j}$ . Specifically, they are all equivalent to the particular face analyzed in Lemma 4.9, where we have seen that $\operatorname{proj}_{\operatorname*{aff}F}(0)\in F$ for $N\leq 4$ (and in fact $\operatorname{proj}_{\operatorname*{aff}F}(0)\in\operatorname*{rbd}F$ for $N=4$ ). We conclude that (H2) holds for all faces $F$ when $N=4$ , completing the proof. ∎

References

[1] E. Achilles. Doubly stochastic matrices with some equal diagonal sums. Linear Algebra Appl., 22:293–296, 1978.
[2] M. Agueh and G. Carlier. Barycenters in the Wasserstein space. SIAM J. Math. Anal., 43(2):904–924, 2011.
[3] J. Backhoff-Veraguas, D. Bartl, M. Beiglböck, and M. Eder. All adapted topologies are equal. Probab. Theory Related Fields, 178(3-4):1125–1172, 2020.
[4] E. Bayraktar, S. Eckstein, and X. Zhang. Stability and sample complexity of divergence regularized optimal transport. Preprint arXiv:2212.00367v1, 2022.
[5] G. Birkhoff. Tres observaciones sobre el algebra lineal. Univ. Nac. Tucumán. Revista A., 5:147–151, 1946.
[6] M. Blondel, V. Seguy, and A. Rolet. Smooth and sparse optimal transport. volume 84 of Proceedings of Machine Learning Research, pages 880–889, 2018.
[7] H. Brezis. Functional analysis, Sobolev spaces and partial differential equations. Universitext. Springer, New York, 2011.
[8] A. Brøndsted. An introduction to convex polytopes, volume 90 of Graduate Texts in Mathematics. Springer-Verlag, New York-Berlin, 1983.
[9] R. A. Brualdi. Combinatorial matrix classes, volume 108 of Encyclopedia of Mathematics and its Applications. Cambridge University Press, Cambridge, 2006.
[10] R. A. Brualdi and G. Dahl. Diagonal sums of doubly stochastic matrices. Linear Multilinear Algebra, 70(20):4946–4972, 2022.
[11] R. A. Brualdi and P. M. Gibson. Convex polyhedra of doubly stochastic matrices. I. Applications of the permanent function. J. Combinatorial Theory Ser. A, 22(2):194–230, 1977.
[12] J. A. Cuesta-Albertos and A. Tuero-Díaz. A characterization for the solution of the Monge-Kantorovich mass transference problem. Statist. Probab. Lett., 16(2):147–152, 1993.
[13] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems 26, pages 2292–2300. 2013.
[14] A. Dessein, N. Papadakis, and J.-L. Rouas. Regularized optimal transport and the rot mover’s distance. J. Mach. Learn. Res., 19(15):1–53, 2018.
[15] S. Eckstein and M. Kupper. Computation of optimal transport and related hedging problems via penalization and neural networks. Appl. Math. Optim., 83(2):639–667, 2021.
[16] S. Eckstein and M. Nutz. Convergence rates for regularized optimal transport via quantization. Math. Oper. Res., 49(2):1223–1240, 2024.
[17] M. Essid and J. Solomon. Quadratically regularized optimal transport on graphs. SIAM J. Sci. Comput., 40(4):A1961–A1986, 2018.
[18] M. Finzel and W. Li. Piecewise affine selections for piecewise polyhedral multifunctions and metric projections. J. Convex Anal., 7(1):73–94, 2000.
[19] A. Garriz-Molina, A. González-Sanz, and G. Mordant. Infinitesimal behavior of quadratically regularized optimal transport and its relation with the porous medium equation. Preprint arXiv:2407.21528v1, 2024.
[20] A. Genevay, M. Cuturi, G. Peyré, and F. Bach. Stochastic optimization for large-scale optimal transport. In Advances in Neural Information Processing Systems 29, pages 3440–3448, 2016.
[21] A. González-Sanz and M. Nutz. Quantitative convergence of quadratically regularized linear programs. Preprint arXiv:2408.04088v1, 2024.
[22] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of Wasserstein GANs. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 5769–5779, 2017.
[23] W. W. Hager and H. Zhang. Projection onto a polyhedron that exploits sparsity. SIAM J. Optim., 26(3):1773–1798, 2016.
[24] S. Kolouri, S. R. Park, M. Thorpe, D. Slepcev, and G. K. Rohde. Optimal mass transport: Signal processing and machine-learning applications. IEEE Signal Processing Magazine, 34(4):43–59, 2017.
[25] L. Li, A. Genevay, M. Yurochkin, and J. Solomon. Continuous regularized Wasserstein barycenters. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 17755–17765. Curran Associates, Inc., 2020.
[26] D. Lorenz and H. Mahler. Orlicz space regularization of continuous optimal transport problems. Appl. Math. Optim., 85(2):Paper No. 14, 33, 2022.
[27] D. Lorenz, P. Manns, and C. Meyer. Quadratically regularized optimal transport. Appl. Math. Optim., 83(3):1919–1949, 2021.
[28] O. L. Mangasarian. Normal solutions of linear programs. Math. Programming Stud., 22:206–216, 1984. Mathematical programming at Oberwolfach, II (Oberwolfach, 1983).
[29] M. Nutz. Quadratically regularized optimal transport: Existence and multiplicity of potentials. Preprint arXiv:2404.06847v1, 2024.
[30] A. Paffenholz. Faces of Birkhoff polytopes. Electron. J. Combin., 22(1):Paper 1.67, 36, 2015.
[31] V. M. Panaretos and Y. Zemel. Statistical aspects of Wasserstein distances. Annu. Rev. Stat. Appl., 6:405–431, 2019.
[32] G. Peyré and M. Cuturi. Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning, 11(5-6):355–607, 2019.
[33] V. Seguy, B. B. Damodaran, R. Flamary, N. Courty, A. Rolet, and M. Blondel. Large scale optimal transport and mapping estimation. In International Conference on Learning Representations, 2018.
[34] C. Villani. Optimal transport, old and new, volume 338 of Grundlehren der Mathematischen Wissenschaften. Springer-Verlag, Berlin, 2009.
[35] S. Zhang, G. Mordant, T. Matsumoto, and G. Schiebinger. Manifold learning with sparse regularised optimal transport. Preprint arXiv:2307.09816v1, 2023.

Monotonicity in Quadratically Regularized Linear Programs

Abstract

1 Introduction

1.1 Motivation

1.2 Summary of Results

2 Preliminaries

Lemma 2.1 (Eulerian Formulation).

Proof.

Lemma 2.2 (Projection).

Lemma 2.3 (Lagrangian Formulation).

Proof.

Lemma 2.4 (Euler ↔\leftrightarrow Lagrange).

Proof.

3 Abstract Result

Definition 3.1.

Theorem 3.2.

3.1 Proof of Theorem 3.2

Lemma 3.3.

Proof.

Lemma 3.4.

Proof.

Lemma 3.5.

Proof.

Proof of Theorem 3.2.

4 Applications

4.1 Soft-Min

Corollary 4.1.

Proof.

4.2 Optimal Transport

Lemma 4.2.

Proof.

Corollary 4.3.

Theorem 4.4.

Example 4.5.

Remark 4.6.

4.3 Proof of Lemma 4.2 and Theorem 4.4

Lemma 4.7.

Lemma 4.8.

Proof.

Lemma 4.9.

Proof.

Lemma 4.10.

Proof.

References

Lemma 2.4 (Euler $\leftrightarrow$ Lagrange).