This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Linear optimal partial transport embedding

Yikun Bai [email protected] Ivan Medri [email protected] Rocio Diaz Martin [email protected] Rana Muhammad Shahroz Khan [email protected] Soheil Kolouri [email protected]
Abstract

Optimal transport (OT) has gained popularity due to its various applications in fields such as machine learning, statistics, and signal processing. However, the balanced mass requirement limits its performance in practical problems. To address these limitations, variants of the OT problem, including unbalanced OT, Optimal partial transport (OPT), and Hellinger Kantorovich (HK), have been proposed. In this paper, we propose the Linear optimal partial transport (LOPT) embedding, which extends the (local) linearization technique on OT and HK to the OPT problem. The proposed embedding allows for faster computation of OPT distance between pairs of positive measures. Besides our theoretical contributions, we demonstrate the LOPT embedding technique in point-cloud interpolation and PCA analysis. Our code is available at https://github.com/Baio0/LinearOPT.

1 Introduction

The Optimal Transport (OT) problem has found numerous applications in machine learning (ML), computer vision, and graphics. The probability metrics and dissimilarity measures emerging from the OT theory, e.g., Wasserstein distances and their variations, are used in diverse applications, including training generative models [3, 24, 33], domain adaptation [16, 15], bayesian inference [28], regression [26], clustering [42], learning from graphs [29] and point sets [35, 36], to name a few. These metrics define a powerful geometry for comparing probability measures with numerous desirable properties, for instance, parameterized geodesics [2], barycenters [18], and a weak Riemannian structure [40].

In large-scale machine learning applications, optimal transport approaches face two main challenges. First, the OT problem is computationally expensive. This has motivated many approximations that lead to significant speedups [17, 13, 39]. Second, while OT is designed for comparing probability measures, many ML problems require comparing non-negative measures with varying total amounts of mass. This has led to the recent advances in unbalanced optimal transport [14, 12, 32] and optimal partial transport [8, 20, 21]. Such unbalanced/partial optimal transport formulations have been recently used to improve minibatch optimal transport [37] and for point-cloud registration [4].

Comparing KK (probability) measures requires the pairwise calculation of transport-based distances, which, despite the significant recent computational speed-ups, remains to be relatively expensive. To address this problem, [41] proposed the Linear Optimal Transport (LOT) framework, which linearizes the 2-Wasserstein distance utilizing its weak Riemannian structure. In short, the probability measures are embedded into the tangent space at a fixed reference measure (e.g., the measures’ Wasserstein barycenter) through a logarithmic map. The Euclidean distances between the embedded measures then approximate the 2-Wasserstein distance between the probability measures. The LOT framework is computationally attractive as it only requires the computation of one optimal transport problem per input measure, reducing the otherwise quadratic cost to linear. Moreover, the framework provides theoretical guarantees on convexifying certain sets of probability measures [34, 1], which is critical in supervised and unsupervised learning from sets of probability measures. The LOT embedding has recently found diverse applications, from comparing collider events in physics [9] and comparing medical images [5, 30] to permutation invariant pooling for comparing graphs [29] and point sets [35].

Many applications in ML involve comparing non-negative measures (often empirical measures) with varying total amounts of mass, e.g., domain adaptation [19]. Moreover, OT distances (or dissimilarity measures) are often not robust against outliers and noise, resulting in potentially high transportation costs for outliers. Many recent publications have focused on variants of the OT problem that allow for comparing non-negative measures with unequal mass. For instance, the optimal partial transport (OPT) problem [8, 20, 21], Kantorovich–Rubinstein norm [25, 31], and the Hellinger–Kantorovich distance [11, 32]. These methods fall under the broad category of “unbalanced optimal transport” [12, 32]. The existing solvers for “unbalanced optimal transport” problems are generally as expensive or more expensive than the OT solvers. Hence, computation time remains a main bottleneck of these approaches.

To reduce the computational burden for comparing unbalanced measures, [10] proposed a clever extension for the LOT framework to unbalanced nonnegative measures by linearizing the Hellinger-Kantorovich, denoted as Linearized Hellinger-Kantorovich (LHK), distance, with many desirable theoretical properties. However, an unintuitive caveat about HK and LHK formulation is that the geodesic for the transported portion of the mass does not resemble the OT geodesic. In particular, the transported mass does not maintain a constant mass as it is transported (please see Figure 1). In contrast, OPT behaves exactly like OT for the transported mass with the trade-off of losing the Riemannian structure of HK.

Contributions: In this paper, inspired by OT geodesics, we provide an OPT interpolation technique using its dynamic formulation and explain how to compute it for empirical distributions using barycentric projections. We use this interpolation to embed the space of measures in a euclidean space using optimal partial transport concerning a reference measure. This allows us to extend the LOT framework to LOPT, a linearized version of OPT. Thus, we reduce the computational burden of OPT while maintaining the decoupling properties between noise (created and destroyed mass) and signal (transported mass) of OPT. We propose a LOPT discrepancy measure and a LOPT interpolating curve and contrast them with their OPT counterparts. Finally, we demonstrate applications of the new framework in point cloud interpolation and PCA analysis, showing that the new technique is more robust to noise.

Organization: In section 2, we review Optimal Transport Theory and the Linear Optimal Transport framework to set the basis and intuitions on which we build our new techniques. In Section 3 we review Optimal Partial Transport Theory and present an explicit solution to its Dynamic formulation that we use to introduce the Linear Optimal Partial Transport framework (LOPT). We define LOPT Embedding, LOPT Discrepancy, LOPT interpolation and give explicit ways to work with empirical data. In Section 4 we show applications of the LOPT framework to approximate OPT distances, to interpolate between point cloud datasets, and to preprocess data for PCA analysis. In the appendix, we provide proofs for all the results, new or old, for which we could not find a proof in the literature.

Refer to caption
Figure 1: The depiction of the HK and OPT geodesics between two measures, at times t{0,0.25,0.5,0.75,1}t\in\{0,0.25,0.5,0.75,1\}. The top row (Blue) represents two initial deltas of mass one located at positions -1.2 and -1. The bottom row (Purple) shows two final deltas of mass one located at 1 and 1.2. At intermediate time steps t=0.25,0.5,0.75t=0.25,0.5,0.75, the transported part (middle delta moving from -1 to 1) changes mass for HK while its mass remains constant for OPT. Outer masses (located at -1.2 for initial time t=0t=0, and at 1.2 for final time t=1t=1) are being destroyed and created, so mass changes are expected. Notably, mass is created/destroyed with a linear rate for OPT and a nonlinear rate for HK. See Appendix H.4 for further analysis.

2 Background: OT and LOT

2.1 Static Formulation of Optimal Transport

Let 𝒫(Ω)\mathcal{P}(\Omega) be the set of Borel probability measures defined in a convex compact subset Ω\Omega of d\mathbb{R}^{d}, and consider μ0,μj𝒫(Ω)\mu^{0},\mu^{j}\in\mathcal{P}(\Omega). The Optimal Transport (OT) problem between μ0\mu^{0} and μj\mu^{j} is that of finding the cheapest way to transport all the mass distributed according to the reference measure μ0\mu^{0} onto a new distribution of mass determined by the target measure μj\mu^{j}. Mathematically, it was stated by Kantorovich as the minimization problem

OT(μ0,μj):=infγΓ(μ0,μj)C(γ;μ0,μj)\displaystyle OT(\mu^{0},\mu^{j}):=\inf_{\gamma\in\Gamma(\mu^{0},\mu^{j})}C(\gamma;\mu^{0},\mu^{j}) (1)
for C(γ;μ0,μj):=Ω2x0xj2𝑑γ(x0,xj),\displaystyle\text{for }\quad C(\gamma;\mu^{0},\mu^{j}):=\int_{\Omega^{2}}\|x^{0}-x^{j}\|^{2}d\gamma(x^{0},x^{j}), (2)

where Γ(μ0,μj)\Gamma(\mu^{0},\mu^{j}) is the set of all joint probability measures in Ω2\Omega^{2} with marginals μ0\mu^{0} and μj\mu^{j}. A measure γΓ(μ0,μj)\gamma\in\Gamma(\mu^{0},\mu^{j}) is called a transportation plan, and given measurable sets A,BΩA,B\in\Omega, the coupling γ(A×B)\gamma(A\times B) describes how much mass originally in the set AA is transported into the set BB. The squared of the Euclidean distance111More general cost functions might be used, but they are beyond the scope of this article. x0xj2\|x^{0}-x^{j}\|^{2} is interpreted as the cost of transporting a unit mass located at x0x^{0} to xjx^{j}. Therefore, C(γ;μ0,μj)C(\gamma;\mu^{0},\mu^{j}) represents the total cost of moving μ0\mu^{0} to μj\mu^{j} according to γ\gamma. Finally, we will denote the set of all plans that achieve the infimum in (1), which is non-empty [40], as Γ(μ0,μj)\Gamma^{*}(\mu^{0},\mu^{j}).

Under certain conditions (e.g. when μ0\mu^{0} has continuous density), an optimal plan γ\gamma can be induced by a rule/map TT that takes all the mass at each position xx to a unique point T(x)T(x). If that is the case, we say that γ\gamma does not split mass and that it is induced by a map T. In fact, it is concentrated on the graph of TT in the sense that for all measurable sets A,BΩA,B\subset\Omega, γ(A×B)=μ0({xA:T(x)B})\,\gamma(A\times B)=\mu^{0}(\{x\in A:\,T(x)\in B\}), and we will write it as the pushforward γ=(id×T)#μ0\gamma=(\mathrm{id}\times T)_{\#}\mu^{0}. Hence, (1) reads as

OT(μ0,μj)=ΩxT(x)2𝑑μ0(x)OT(\mu^{0},\mu^{j})=\int_{\Omega}\|x-T(x)\|^{2}d\mu^{0}(x) (3)

The function T:ΩΩT:\Omega\to\Omega is called a Monge map, and when μ0\mu^{0} is absolutely continuous it is unique [7].

Finally, the square root of the optimal value OT(,)OT(\cdot,\cdot) is exactly the so-called Wasserstein distance, W2W_{2}, in 𝒫(Ω)\mathcal{P}(\Omega) [40, Th.7.3], and we will call it also as OT squared distance. In addition, with this distance, 𝒫(Ω)\mathcal{P}(\Omega) is not only a metric space but also a Riemannian manifold [40]. In particular, the tangent space of any μ𝒫(Ω)\mu\in\mathcal{P}(\Omega) is 𝒯μ=L2(Ω;d,μ)={u:Ωd:uμ2<}\mathcal{T}_{\mu}=L^{2}(\Omega;\mathbb{R}^{d},\mu)=\{u:\Omega\to\mathbb{R}^{d}:\|u\|_{\mu}^{2}<\infty\}, where

uμ2:=Ωu(x)2𝑑μ(x).\|u\|_{\mu}^{2}:=\int_{\Omega}\|u(x)\|^{2}d\mu(x). (4)

2.2 Dynamic Formulation of Optimal Transport

To understand the framework of Linear Optimal Transport (LOT) we will use the dynamic formulation of the OT problem. Optimal plans and maps can be viewed as a static way of matching two distributions. They tell us where each mass in the initial distribution should end, but they do not tell the full story of how the system evolves from initial to final configurations.

In the dynamic formulation, we consider ρ𝒫([0,1]×Ω)\rho\in\mathcal{P}([0,1]\times\Omega) a curve of measures parametrized in time that describes the distribution of mass ρt:=ρ(t,)𝒫(Ω)\rho_{t}:=\rho(t,\cdot)\in\mathcal{P}(\Omega) at each instant 0t10\leq t\leq 1. We will require the curve to be sufficiently smooth, to have boundary conditions ρ0=μ0\rho_{0}=\mu^{0}, ρ1=μj\rho_{1}=\mu^{j}, and to satisfy the conservation of mass law. Then, it is well known that there exists a velocity vector field vt:=v(t,)v_{t}:=v(t,\cdot) such that ρt\rho_{t} satisfies the continuity equation222The continuity equation is satisfied weakly or in the sense of distributions. See [40, 38]. with boundary conditions

tρ+ρv=0,ρ0=μ0,ρ1=μj.\partial_{t}\rho+\nabla\cdot\rho v=0,\qquad\rho_{0}=\mu^{0},\quad\rho_{1}=\mu^{j}. (5)

The length333Precisely, the length of the curve ρ\rho, with respect to the Wasserstein distance, should be 01vtρt𝑑t\int_{0}^{1}\|v_{t}\|_{\rho_{t}}dt, but this will make no difference in the solutions of (6) since they are constant speed geodesics. of the curve can be stated as [0,1]×Ωv2𝑑ρ:=01vtρt2𝑑t\int_{[0,1]\times\Omega}\|v\|^{2}d\rho:=\int_{0}^{1}\|v_{t}\|_{\rho_{t}}^{2}dt, for ρt\|\cdot\|_{\rho_{t}} as in (4), and OT(μ0,μj)OT(\mu^{0},\mu^{j}) coincides with the length of the shortest curve between μ0\mu^{0} and μj\mu^{j} [6]. Hence, the dynamical formulation of the OT problem (1) reads as

OT(μ0,μj)=inf(ρ,v)𝒞(μ0,μj)[0,1]×Ωv2𝑑ρ,\displaystyle OT(\mu^{0},\mu^{j})=\inf_{(\rho,v)\in\mathcal{CE}(\mu^{0},\mu^{j})}\int_{[0,1]\times\Omega}\|v\|^{2}d\rho, (6)

where 𝒞(μ0,μj)\mathcal{CE}(\mu^{0},\mu^{j}) is the set of pairs (ρ,v)(\rho,v), where ρ𝒫([0,1]×Ω)\rho\in\mathcal{P}([0,1]\times\Omega), and v:[0,1]×Ωdv:[0,1]\times\Omega\to\mathbb{R}^{d}, satisfying (5).

Under the assumption of existence of an optimal Monge map TT, an optimal solution for (6) can be given explicitly and is pretty intuitive. If a particle starts at position xx and finishes at position T(x)T(x), then for 0<t<10<t<1 it will be at the point

Tt(x):=(1t)x+tT(x).T_{t}(x):=(1-t)x+tT(x). (7)

Then, varying both the time tt and xΩx\in\Omega, the mapping (7) can be interpreted as a flow whose time velocity444For each (t,x)(0,1)×Ω(t,x)\in(0,1)\times\Omega, the vector vt(x)v_{t}(x) is well defined as TtT_{t} is invertible. See [38, Lemma 5.29]. is

vt(x)=T(x0)x0, for x=Tt(x0).v_{t}(x)=T(x_{0})-x_{0},\qquad\text{ for }x=T_{t}(x_{0}). (8)

To obtain the curve of probability measures ρt\rho_{t}, one can evolve μ0\mu^{0} through the flow TtT_{t} using the formula ρt(A)=μ0(Tt1(A))\rho_{t}(A)=\mu_{0}(T_{t}^{-1}(A)) for any measurable set AA. That is, ρt\rho_{t} is the push-forward of μ0\mu^{0} by TtT_{t}

ρt=(Tt)#μ0,0t1.\rho_{t}=(T_{t})_{\#}\mu^{0},\qquad 0\leq t\leq 1. (9)

The pair (ρ,v)(\rho,v) defined by (9) and (8) satisfies the continuity equation (5) and solves (6). Moreover, the curve ρt\rho_{t} is a constant speed geodesic in 𝒫(Ω)\mathcal{P}(\Omega) between μ0\mu^{0} and μj\mu^{j} [22], i.e., it satisfies that for all 0st10\leq s\leq t\leq 1

OT(ρs,ρt)=(ts)OT(ρ0,ρ1).\sqrt{OT(\rho_{s},\rho_{t})}=(t-s)\sqrt{OT(\rho_{0},\rho_{1})}. (10)

A confirmation of this comes from comparing the OT cost (3) with (8) obtaining

OT(μ0,μj)\displaystyle OT(\mu^{0},\mu^{j}) =Ωv0(x)2𝑑μ0(x)\displaystyle=\int_{\Omega}\|v_{0}(x)\|^{2}d\mu^{0}(x) (11)

which tells us that we only need the speed at the initial time to compute the total length of the curve. Moreover, OT(μ0,μj)OT(\mu^{0},\mu^{j}) coincides with the squared norm of the tangent vector v0v_{0} in the tangent space 𝒯μ0\mathcal{T}_{\mu^{0}} of 𝒫(Ω)\mathcal{P}(\Omega) at μ0\mu^{0}.

2.3 Linear Optimal Transport Embedding

Inspired by the induced Riemannian geometry of the OTOT squared distance, [41] proposed the so-called Linear Optimal Transportation (LOT) framework. Given two target measures μi,μj\mu^{i},\mu^{j}, the main idea relies on considering a reference measure μ0\mu^{0} and embed these target measures into the tangent space 𝒯μ0\mathcal{T}_{\mu^{0}}. This is done by identifying each measure μj\mu^{j} with the curve (9) minimizing OT(μ0,μj)OT(\mu^{0},\mu^{j}) and computing its velocity (tangent vector) at t=0t=0 using (8).

Formally, let us fix a continuous probability reference measure μ0\mu^{0}. Then, the LOT embedding [34] is defined as

μjuj:=Tjidμj𝒫(Ω)\displaystyle\mu^{j}\mapsto u^{j}:=T^{j}-\mathrm{id}\qquad\forall\ \mu^{j}\in\mathcal{P}(\Omega) (12)

where TjT^{j} is the optimal Monge map between μ0\mu^{0} and μj\mu^{j}. Notice that by (3), (4) and (11) we have

ujμ02=OT(μ0,μj).\|u^{j}\|^{2}_{\mu^{0}}=OT(\mu^{0},\mu^{j}). (13)

After this embedding, one can use the distance in 𝒯μ0\mathcal{T}_{\mu^{0}} between the projected measures to define a new distance in 𝒫(Ω)\mathcal{P}(\Omega) that can be used to approximate OT(μi,μj)OT(\mu^{i},\mu^{j}). The LOT squared distance is defined as

LOTμ0(μi,μj):=uiujμ02.\displaystyle LOT_{\mu^{0}}(\mu^{i},\mu^{j}):=\|u^{i}-u^{j}\|_{\mu^{0}}^{2}. (14)

2.4 LOT in the Discrete Setting

For discrete probability measures μ0,μj\mu^{0},\mu^{j} of the form

μ0=n=1N0pn0δxn0,μj=n=1Njpnjδxnj,\mu^{0}=\sum_{n=1}^{N_{0}}p^{0}_{n}\delta_{x^{0}_{n}},\qquad\mu^{j}=\sum_{n=1}^{N_{j}}p^{j}_{n}\delta_{x^{j}_{n}}, (15)

a Monge map TjT^{j} for OT(μ0,μj)OT(\mu^{0},\mu^{j}) may not exist. Following [41], in this setting, the target measure μj\mu^{j} can be replaced by a new measure μ^j\hat{\mu}^{j} for which an optimal transport Monge map exists. For that, given an optimal plan γjΓ(μ0,μj)\gamma^{j}\in\Gamma^{*}(\mu^{0},\mu^{j}), it can be viewed as a N0×NiN_{0}\times N_{i} matrix whose value at position (n,m)(n,m) represents how much mass from xn0x^{0}_{n} should be taken to xmjx^{j}_{m}. Then, we define the OT barycentric projection555We refer to [2] for the rigorous definition. of μj\mu^{j} with respect to μ0\mu^{0} as

μ^j:=n=1N0pn0δx^nj,wherex^nj:=1pn0m=1Njγn,mjxmj.\hat{\mu}^{j}:=\sum_{n=1}^{N_{0}}p^{0}_{n}\delta_{\hat{x}_{n}^{j}},\,\text{where}\quad\hat{x}_{n}^{j}:=\frac{1}{p_{n}^{0}}\sum_{m=1}^{N_{j}}\gamma^{j}_{n,m}x_{m}^{j}. (16)

The new measure μ^j\hat{\mu}^{j} is regarded as an N0N_{0}-point representation of the target measure μj\mu^{j}. The following lemma guarantees the existence of a Monge map between μ0\mu^{0} and μ^j\hat{\mu}^{j}.

Lemma 2.1.

Let μ0\mu^{0} and μj\mu^{j} be two discrete probability measures as in (15), and consider an OT barycentric projection μ^j\hat{\mu}^{j} of μj\mu^{j} with respect to μ0\mu^{0} as in (16). Then, the map xn0x^njx_{n}^{0}\mapsto\hat{x}_{n}^{j} given by (16) solves the OT problem OT(μ0,μ^j)OT(\mu^{0},\hat{\mu}^{j}).

It is easy to show that if the optimal transport plan γj\gamma^{j} is induced by a Monge map, then μ^j=μj\hat{\mu}^{j}=\mu^{j}. As a consequence, the OT barycentric projection is an actual projection in the sense that it is idempotent.

Similar to the continuous case (12), given a discrete reference measure μ0\mu^{0}, we can define the LOT embedding for a discrete measure μj\mu^{j} as the rule

μjuj:=[(x^1jx10),,(x^N0jxN00)].\mu^{j}\mapsto u^{j}:=[(\hat{x}_{1}^{j}-x^{0}_{1}),\ldots,(\hat{x}_{N_{0}}^{j}-x^{0}_{N_{0}})]. (17)

The range 𝒯μ0\mathcal{T}_{\mu_{0}} of this application is identified with d×N0\mathbb{R}^{d\times N_{0}} with the norm uμ0:=n=1N0u(n)2pn0\|u\|_{\mu^{0}}:=\sum_{n=1}^{N_{0}}\|u(n)\|^{2}p_{n}^{0}, where u(n)du(n)\in\mathbb{R}^{d} denotes the nnth entry of uu. We call (d×N0,μ0)(\mathbb{R}^{d\times N_{0}},\|\cdot\|_{\mu^{0}}) the embedding space.

By the discussion above, if the optimal plan γj\gamma^{j} for problem OT(μ0,μj)\text{OT}(\mu^{0},\mu^{j}) is induced by a Monge map, then the discrete embedding is consistent with (13) in the sense that

ujμ02=OT(μ0,μ^j)=OT(μ0,μj).\displaystyle\|u^{j}\|_{\mu^{0}}^{2}=OT(\mu^{0},\hat{\mu}^{j})=OT(\mu^{0},\mu^{j}). (18)

Hence, as in section 2.3, we can use the distance between embedded measures in (d×N0,μ0)\mathbb{R}^{d\times{N_{0}}},\|\cdot\|_{\mu_{0}}) to define a discrepancy in the space of discrete probabilities that can be used to approximate OT(μi,μj)OT(\mu^{i},\mu^{j}). The LOT discrepancy777In [41], LOT is defined by the infimum over all possible optimal pairs (γi,γj)(\gamma^{i},\gamma^{j}). We do not distinguish these two formulations for convenience in this paper. Additionally, (19) is determined by the choice of (γi,γj)(\gamma^{i},\gamma^{j}). is defined as

LOTμ0(μi,μj):=uiujμ02.\displaystyle LOT_{\mu^{0}}(\mu^{i},\mu^{j}):=\|u^{i}-u^{j}\|_{\mu^{0}}^{2}. (19)

We call it a discrepancy because it is not a squared metric between discrete measures. It does not necessarily satisfy that LOT(μi,μj)0LOT(\mu^{i},\mu^{j})\not=0 for every distinct μi,μj\mu^{i},\mu^{j}. Nevertheless, uiujμ0\|u^{i}-u^{j}\|_{\mu^{0}} is a metric in the embedding space.

2.5 OT and LOT Geodesics in Discrete Settings

Let μi\mu^{i}, μj\mu^{j} be discrete probability measures as in (15) (with ‘ii’ in place of 0). If an optimal Monge map TT for OT(μi,μj)OT(\mu^{i},\mu^{j}) exists, a constant speed geodesic ρt\rho_{t} between μi\mu^{i} and μj\mu^{j}, for the OT squared distance, can be found by mimicking (9). Explicitly, with TtT_{t} as in (7),

ρt=(Tt)#μi=n=1Nipniδ(1t)xni+tT(xni).\rho_{t}=(T_{t})_{\#}\mu^{i}=\sum_{n=1}^{N_{i}}p_{n}^{i}\delta_{(1-t)x_{n}^{i}+tT(x_{n}^{i})}. (20)

In practice, one replaces μj\mu^{j} by its OT barycentric projection with respect to μi\mu^{i} (and so, the existence of an optimal Monge map is guaranteed by Lemma 2.1).

Now, given a discrete reference μ0\mu^{0}, the LOT discrepancy provides a new structure to the space of discrete probability densities. Therefore, we can provide a substitute for the OT geodesic (20) between μi\mu^{i} and μj\mu^{j}. Assume we have the embeddings μiui\mu^{i}\mapsto u^{i}, μjuj\mu^{j}\mapsto u^{j} as in (17). The geodesic between uiu^{i} and uju^{j} in the LOT embedding space d×N0\mathbb{R}^{d\times N_{0}} has the simple form ut=(1t)ui+tuju_{t}=(1-t)u^{i}+tu^{j}. This correlates with the curve ρ^t\hat{\rho}_{t} in 𝒫(Ω)\mathcal{P}(\Omega) induced by the map T^:x^nix^nj\hat{T}:\hat{x}_{n}^{i}\mapsto\hat{x}_{n}^{j} 888This map can be understood as the one that transports μ^i\hat{\mu}^{i} onto μ^j\hat{\mu}^{j} pivoting on the reference: μ^iμ0μ^j\hat{\mu}^{i}\mapsto\mu^{0}\mapsto\hat{\mu}^{j}. as

ρ^t:=(T^t)#μ^i=n=1N0pn0δxn0+ut(n).\hat{\rho}_{t}:=(\hat{T}_{t})_{\#}\hat{\mu}^{i}=\sum_{n=1}^{N_{0}}p_{n}^{0}\delta_{{x}_{n}^{0}+u_{t}(n)}. (21)

By abuse of notation, we call this curve the LOT geodesic between μi\mu^{i} and μj\mu^{j}. Nevertheless, it is a geodesic between their barycentric projections since it satisfies the following.

Proposition 2.2.

Let ρ^t\hat{\rho}_{t} be defined as (21), and ρ^0=μ^i,ρ^1=μ^j\hat{\rho}_{0}=\hat{\mu}^{i},\hat{\rho}_{1}=\hat{\mu}^{j}, then for all 0st10\leq s\leq t\leq 1

LOTμ0(ρ^s,ρ^t)=(ts)LOTμ0(ρ^0,ρ^1).\sqrt{LOT_{\mu^{0}}(\hat{\rho}_{s},\hat{\rho}_{t})}=(t-s)\sqrt{LOT_{\mu^{0}}(\hat{\rho}_{0},\hat{\rho}_{1})}. (22)

3 Linear Optimal Partial Transport Embedding

3.1 Static Formulation of Optimal Partial Transport

In addition to mass transportation, the OPT problem allows mass destruction at the source and mass creation at the target. Let +(Ω)\mathcal{M}_{+}(\Omega) denote the set of all positive finite Borel measures defined on Ω\Omega. For λ0\lambda\geq 0 the OPT problem between μ0,μj+(Ω)\mu^{0},\mu^{j}\in\mathcal{M}_{+}(\Omega) can be formulated as

OPTλ(μ0,μj):=infγΓ(μ0,μj)C(γ;μ0,μj,λ)\displaystyle OPT_{\lambda}(\mu^{0},\mu^{j}):=\inf_{\gamma\in\Gamma_{\leq}(\mu^{0},\mu^{j})}C(\gamma;\mu^{0},\mu^{j},\lambda) (23)
forC(γ;μ0,μj,λ):=Ω2x0xj2𝑑γ(x0,xj)+λ(|μ0γ0|+|μjγ1|)\displaystyle\text{for}\quad C(\gamma;\mu^{0},\mu^{j},\lambda):=\int_{\Omega^{2}}\|x^{0}-x^{j}\|^{2}d\gamma(x^{0},x^{j})+\lambda(|\mu^{0}-\gamma_{0}|+|\mu^{j}-\gamma_{1}|) (24)

where |μ0γ0||\mu^{0}-\gamma_{0}| is the total mass of μ0γ0\mu^{0}-\gamma_{0} (resp. |μjγ1||\mu^{j}-\gamma_{1}|), and Γ(μ0,μj)\Gamma_{\leq}(\mu^{0},\mu^{j}) denotes the set of all measures in Ω2\Omega^{2} with marginals γ0\gamma_{0} and γ1\gamma_{1} satisfying γ0μ0\gamma_{0}\leq\mu^{0} (i.e., γ0(E)μ0(E)\gamma_{0}(E)\leq\mu^{0}(E) for all measurable set EE), and γ1μj\gamma_{1}\leq\mu^{j}. Here, the mass destruction and creation penalty is linear, parametrized by λ\lambda. The set of minimizers Γ(μ0,μj)\Gamma_{\leq}^{*}(\mu^{0},\mu^{j}) of (23) is non-empty [20]. One can further restrict Γ(μ0,μj)\Gamma_{\leq}(\mu^{0},\mu^{j}) to the set of partial transport plans γ\gamma such that x0xj2<2λ\|x^{0}-x^{j}\|^{2}<2\lambda for all (x0,xj)supp(γ)(x^{0},x^{j})\in\operatorname{supp}(\gamma) [4, Lemma 3.2]. This means that if the usual transportation cost is greater than 2λ2\lambda, it is better to create/destroy mass.

3.2 Dynamic Formulation of Optimal Partial Transport

Adding a forcing term ζ\zeta to the continuity equation (6), one can take into account curves that allow creation and destruction of mass. That is, those who break the conservation of mass law. Thus, it is natural that the minimization problem (23) can be rewritten [12, Th. 5.2] into a dynamic formulation as

OPTλ(μ0,μj)=inf(ρ,v,ζ)𝒞(μ0,μj)[0,1]×Ωv2𝑑ρ+λ|ζ|OPT_{\lambda}(\mu^{0},\mu^{j})=\inf_{(\rho,v,\zeta)\in\mathcal{FCE}(\mu^{0},\mu^{j})}\int_{[0,1]\times\Omega}\|v\|^{2}d\rho+\lambda|\zeta| (25)

where 𝒞(μ0,μj)\mathcal{FCE}(\mu^{0},\mu^{j}) is the set of tuples (ρ,v,ζ)(\rho,v,\zeta) such that ρ+([0,1]×Ω)\rho\in\mathcal{M}_{+}([0,1]\times\Omega), ζ([0,1]×Ω)\zeta\in\mathcal{M}([0,1]\times\Omega) (where \mathcal{M} stands for signed measures) and v:[0,1]×Ωdv:[0,1]\times\Omega\to\mathbb{R}^{d}, satisfying

tρ+ρv=ζ,ρ0=μ0,ρ1=μj.\partial_{t}\rho+\nabla\cdot\rho v=\zeta,\qquad\rho_{0}=\mu^{0},\quad\rho_{1}=\mu^{j}. (26)

As in the case of OT, under certain conditions on the minimizers γ\gamma of (23), one curve ρt\rho_{t} that minimizes the dynamic formulation (25) is quite intuitive. We show in the next proposition that it consists of three parts γt\gamma_{t}, (1t)ν0(1-t)\nu_{0} and tνjt\nu^{j} (see (27), (28), and (29) below). The first is a curve that only transports mass, and the second and third destroy and create mass at constant rates |ν0||\nu_{0}|, |νj||\nu^{j}|, respectively.

Proposition 3.1.

Let γΓ(μ0,μj)\gamma^{*}\in\Gamma_{\leq}^{*}(\mu^{0},\mu^{j}) be of the form γ=(id×T)#γ0\gamma^{*}=(\mathrm{id}\times T)_{\#}\gamma_{0}^{*}\, for T:ΩΩT:\Omega\to\Omega a (measurable) map. Let

ν0:=μ0γ0,νj:=μjγ1,\displaystyle\nu_{0}:=\mu^{0}-\gamma_{0}^{*},\quad\nu^{j}:=\mu^{j}-\gamma_{1}^{*}, (27)
Tt(x):=(1t)x+tT(x),γt:=(Tt)#γ0.\displaystyle T_{t}(x):=(1-t)x+tT(x),\quad\gamma_{t}:=(T_{t})_{\#}\gamma_{0}^{*}. (28)

Then, an optimal solution (ρ,v,ζ)(\rho,v,\zeta) for (25) is given by

ρt:=γt+(1t)ν0+tνj,\displaystyle\rho_{t}:=\gamma_{t}+(1-t)\nu_{0}+t\nu^{j}, (29)
vt(x):=T(x0)x0, if x=Tt(x0).\displaystyle v_{t}(x):=T(x_{0})-x_{0},\qquad\text{ if }x=T_{t}(x_{0}). (30)
ζt:=νjν0\displaystyle\zeta_{t}:=\nu^{j}-\nu_{0} (31)

Moreover, plugging in (ρ,v,ζ)(\rho,v,\zeta) into (25), it holds that

OPTλ(μ0,μj)=v0γ0,2λ2+λ(|ν0|+|νj|),{OPT}_{\lambda}(\mu^{0},\mu^{j})=\|v_{0}\|^{2}_{\gamma_{0}^{*},2\lambda}+\lambda(|\nu_{0}|+|\nu^{j}|), (32)

where v0(x)=T(x)xv_{0}(x)=T(x)-x (i.e., vtv_{t} at time t=0t=0), and

vμ,2λ2:=Ωmin(v2,2λ)𝑑μ,for v:Ωd.\|v\|_{\mu,2\lambda}^{2}:=\int_{\Omega}\min(\|v\|^{2},2\lambda)d\mu,\qquad\text{for }v:\Omega\to\mathbb{R}^{d}.

In analogy to the OT squared distance, we also call the optimal partial cost (32) as the OPT squared distance.

3.3 Linear Optimal Partial Transport Embedding

Definition 3.2.

Let μ0\mu^{0}, μj+(Ω)\mu^{j}\in\mathcal{M}_{+}(\Omega) such that OPTλ(μ0,μj)OPT_{\lambda}(\mu^{0},\mu^{j}) is solved by a plan induced by a map. The LOPT embedding of μj\mu^{j} with respect to μ0\mu^{0} is defined as

μj(uj,μ¯j,νj):=(v0,γ0,νj)\mu^{j}\mapsto(u^{j},\bar{\mu}^{j},\nu^{j}):=(v_{0},\gamma_{0},\nu^{j}) (33)

where v0,γ0,νjv_{0},\gamma_{0},\nu^{j} are defined as in Proposition 3.1.

Let us compare the LOPT (33) and LOT (12) embeddings. The first component v0v_{0} represents the tangent of the curve that transports mass from the reference to the target. This is exactly the same as the LOT embedding. In contrast to LOT, the second component γ0\gamma_{0} is necessary since we need to specify what part of the reference is being transported. The third component νj\nu^{j} can be thought of as the tangent vector of the part that creates mass. There is no need to save the destroyed mass because it can be inferred from the other quantities.

Now, let μ0μj\mu^{0}\wedge\mu^{j} be the minimum measure999Formally, μ0μj(B):=inf{μ0(B1)+μj(B2)}{\mu}^{0}\wedge{\mu}^{j}(B):=\inf\left\{{\mu}^{0}\left(B_{1}\right)+{\mu}^{j}\left(B_{2}\right)\right\} for every Borel set BB, where the infimum is taken over all partitions of BB., i.e. B=B1B2B=B_{1}\cup B_{2}, B1B2=B_{1}\cap B_{2}=\emptyset, given by Borel sets B1B_{1}, B2B_{2}. between μ0\mu^{0} and μj\mu^{j}. By the above definition, μ0(u0,μ¯0,ν0)=(0,μ0,0)\mu^{0}\mapsto(u^{0},\bar{\mu}^{0},\nu^{0})=(0,\mu^{0},0). Therefore, (32) can be rewritten

OPTλ(μ0,μj)\displaystyle OPT_{\lambda}(\mu^{0},\mu^{j}) =u0ujμ¯0μ¯j,2λ2+λ(|μ¯0μ¯j|+|ν0|+|νj|)\displaystyle=\|u^{0}-u^{j}\|^{2}_{\bar{\mu}^{0}\wedge\bar{\mu}^{j},2\lambda}+\lambda(|\bar{\mu}^{0}-\bar{\mu}^{j}|+|\nu_{0}|+|\nu^{j}|) (34)

This motivates the definition of the LOPT discrepancy.101010LOPTλLOPT_{\lambda} is not a rigorous metric.

Definition 3.3.

Consider a reference μ0+(Ω)\mu^{0}\in\mathcal{M}_{+}(\Omega) and target measures μi,μj+(Ω)\mu^{i},\mu^{j}\in\mathcal{M}_{+}(\Omega) such that OPTλ(μ0,μi)OPT_{\lambda}(\mu^{0},\mu^{i}) and OPTλ(μ0,μj)OPT_{\lambda}(\mu^{0},\mu^{j}) can be solved by plans induced by mappings as in the hypothesis of Proposition 3.1. Let (ui,μ¯i,νi)(u^{i},\bar{\mu}^{i},\nu^{i}) and (uj,μ¯j,νj)(u^{j},\bar{\mu}^{j},\nu^{j}) be the LOPT embeddings of μi\mu^{i} and μj\mu^{j} with respect to μ0\mu^{0}. The LOPT discrepancy between μi\mu^{i} and μj\mu^{j} with respect to μ0\mu^{0} is defined as

LOPTμ0,λ(μi,μj)\displaystyle LOPT_{\mu^{0},\lambda}(\mu^{i},\mu^{j}) :=uiujμ¯iμ¯j,2λ2+λ(|μ¯iμ¯j|+|νi|+|νj|)\displaystyle:=\|u^{i}-u^{j}\|^{2}_{\bar{\mu}^{i}\wedge\bar{\mu}^{j},2\lambda}+\lambda(|\bar{\mu}^{i}-\bar{\mu}^{j}|+|\nu^{i}|+|\nu^{j}|) (35)

Similar to the LOT framework, by equation (34), LOPT can recover OPT when μi=μ0\mu^{i}=\mu^{0}. That is,

LOPTμ0,λ(μ0,μj)=OPTλ(μ0,μj).LOPT_{\mu^{0},\lambda}(\mu^{0},\mu^{j})=OPT_{\lambda}(\mu^{0},\mu^{j}).

3.4 LOPT in the Discrete Setting

If μ0,μj\mu^{0},\mu^{j} are N0,NjN_{0},N_{j}-size discrete non-negative measures as in (15) (but not necessarily with total mass 1), the OPT problem (23) can be written as

minγΓ(μ0,μj)n,mxn0xmj2γn,m+λ(|p0|+|pj|2|γ|)\min_{\gamma\in\Gamma_{\leq}(\mu^{0},\mu^{j})}\sum_{n,m}\|x_{n}^{0}-x_{m}^{j}\|^{2}\gamma_{n,m}+\lambda(|p^{0}|+|p^{j}|-2|\gamma|)

where the set Γ(μ0,μj)\Gamma_{\leq}(\mu^{0},\mu^{j}) can be viewed as the subset of N0×NjN_{0}\times N_{j} matrices with non-negative entries

Γ(μ0,μj):={γ+N0×Nj:γ1Njp0,γT1N0pj},\Gamma_{\leq}(\mu^{0},\mu^{j}):=\{\gamma\in\mathbb{R}_{+}^{N_{0}\times{N_{j}}}:\gamma 1_{N_{j}}\leq p^{0},\gamma^{T}1_{N_{0}}\leq p^{j}\},

where 1N01_{N_{0}} denotes the N0×1N_{0}\times 1 vector whose entries are 11 (resp. 1Nj1_{N_{j}}), p0=[p10,,pN00]p^{0}=[p_{1}^{0},\ldots,p_{N_{0}}^{0}] is the vector of weights of μ0\mu^{0} (resp. pjp^{j}), γ1Njp0\gamma 1_{N_{j}}\leq p^{0} means that component-wise holds the ‘\leq’ (resp. γT1N0pj\gamma^{T}1_{N_{0}}\leq p^{j}, where γT\gamma^{T} is the transpose of γ\gamma), and |p0|=n=1N0|pn0||p^{0}|=\sum_{n=1}^{N_{0}}|p^{0}_{n}| is the total mass of μ0\mu^{0} (resp. |pj|,|γ||p^{j}|,|\gamma|). The marginals are γ0:=γ1Nj\gamma_{0}:=\gamma 1_{N_{j}}, and γ1:=γT1N0\gamma_{1}:=\gamma^{T}1_{N_{0}}.

Similar to OT, when an optimal plan γj\gamma^{j} for OPTλ(μ0,μ^j)OPT_{\lambda}(\mu^{0},\hat{\mu}^{j}) is not induced by a map, we can replace the target measure μj\mu^{j} by an OPT barycentric projection μ^j\hat{\mu}^{j} for which a map exists. Therefore, allowing us to apply the LOPT embedding (see (33) and (40) below).

Definition 3.4.

Let μ0\mu^{0} and μj\mu^{j} be positive discrete measures, and γjΓ(μ0,μj)\gamma^{j}\in\Gamma^{*}_{\leq}(\mu^{0},\mu^{j}). The OPT barycentric projection111111Notice that in (16) we had pn0=m=1Njγn,mjp_{n}^{0}=\sum_{m=1}^{N_{j}}\gamma^{j}_{n,m}. This leads to introducing p^nj\hat{p}_{n}^{j} as in (37). That is, p^nj\hat{p}_{n}^{j} plays the role of pn0p_{n}^{0} in the OPT framework. However, here p^nj\hat{p}_{n}^{j} depends on γj\gamma^{j} (on its first marginal γ0j\gamma_{0}^{j}) and not only on μ0\mu^{0}, and so we add a superscript ‘jj’. of μj\mu^{j} with respect to μ0\mu^{0} is defined as

μ^j:=n=1N0p^njδx^nj, where\displaystyle\hat{\mu}^{j}:=\sum_{n=1}^{N_{0}}\hat{p}_{n}^{j}\delta_{\hat{x}^{j}_{n}},\qquad\text{ where} (36)
p^nj:=m=1Njγn,mj,1nN0,\displaystyle\hat{p}_{n}^{j}:=\sum_{m=1}^{N_{j}}\gamma^{j}_{n,m},\qquad 1\leq n\leq N_{0}, (37)
x^nj:={1p^njm=1Njγn,mjxmjif p^nj>0xn0if p^nj=0.\displaystyle\hat{x}_{n}^{j}:=\begin{cases}\frac{1}{\hat{p}_{n}^{j}}\sum_{m=1}^{N_{j}}\gamma^{j}_{n,m}x_{m}^{j}&\text{if }\,{\hat{p}_{n}^{j}}>0\\ x_{n}^{0}&\text{if }\,{\hat{p}_{n}^{j}}=0.\end{cases} (38)
Theorem 3.5.

In the same setting of Definition 3.4, the map xn0x^njx_{n}^{0}\mapsto\hat{x}_{n}^{j} given by (38) solves the problem OPTλ(μ0,μ^j),OPT_{\lambda}(\mu^{0},\hat{\mu}^{j}), in the sense that induces the partial optimal plan γ^j=diag(p^1j,,p^N0j)\hat{\gamma}^{j}=\operatorname{diag}(\hat{p}_{1}^{j},\ldots,\hat{p}_{N_{0}}^{j}).

It is worth noting that when we take a barycentric projection of a measure, some information is lost. Specifically, the information about the part of μj\mu^{j} that is not transported from the reference μ0\mu^{0}. This has some minor consequences.

First, unlike (18), the optimal partial transport cost OPTλ(μ0,μj)OPT_{\lambda}(\mu^{0},\mu^{j}) changes when we replace μj\mu^{j} by μ^j\hat{\mu}^{j}. Nevertheless, the following relation holds.

Theorem 3.6.

In the same setting of Definition 3.4, if γj\gamma^{j} is induced by a map, then

OPTλ(μ0,μj)=OPTλ(μ0,μ^j)+λ(|μj||μ^j|)\displaystyle OPT_{\lambda}(\mu^{0},\mu^{j})=OPT_{\lambda}(\mu^{0},\hat{\mu}^{j})+\lambda(|\mu^{j}|-|\hat{\mu}^{j}|) (39)

The second consequence121212This is indeed an advantage since it allows the range of the embedding to always have the same dimension N0×(d+1)N_{0}\times(d+1). is that the LOPT embedding of μ^j\hat{\mu}^{j} will always have a null third component. That is,

μ^j([x^1jx10,,x^N0jxN00],n=1N0p^njδxn0, 0).\hat{\mu}^{j}\mapsto([\hat{x}_{1}^{j}-x_{1}^{0},\ldots,\hat{x}_{N_{0}}^{j}-x_{N_{0}}^{0}],\,\sum_{n=1}^{N_{0}}\hat{p}_{n}^{j}\delta_{x_{n}^{0}},\,0). (40)

Therefore, we represent this embedding as μ^j(uj,p^j)\hat{\mu}^{j}\mapsto(u^{j},\hat{p}^{j}), for uj=[x^1jx10,,x^N0jxN00]u^{j}=[\hat{x}_{1}^{j}-x_{1}^{0},\ldots,\hat{x}_{N_{0}}^{j}-x_{N_{0}}^{0}] and p^j=[p^1j,,p^N0j]\hat{p}^{j}=[\hat{p}^{j}_{1},\ldots,\hat{p}^{j}_{N_{0}}]. The last consequence is given in the next result.

Proposition 3.7.

If μ0,μi,μj\mu^{0},\mu^{i},\mu^{j} are discrete and satisfy the conditions of Definition 3.3, then

LOPTμ0,λ(μi,μj)=LOPTμ0,λ(μ^i,μ^j)+λCi,jLOPT_{\mu^{0},\lambda}(\mu^{i},\mu^{j})=LOPT_{\mu^{0},\lambda}(\hat{\mu}^{i},\hat{\mu}^{j})+\lambda C_{i,j} (41)

where Ci,j=|μi||μ^i|+|μj||μ^j|C_{i,j}=|\mu^{i}|-|\hat{\mu}^{i}|+|\mu^{j}|-|\hat{\mu}^{j}|.

As a byproduct we can define the LOPT discrepancy for any pair of discrete measures μi,μj\mu^{i},\mu^{j} as the right-hand side of (41). In practice, unless to approximate OPTλ(μi,μj)OPT_{\lambda}(\mu^{i},\mu^{j}), we set Ci,j=0C_{i,j}=0 in (41). That is,

LOPTμ0,λ(μi,μj):=LOPTμ0,λ(μ^i,μ^j).LOPT_{\mu^{0},\lambda}(\mu^{i},\mu^{j}):=LOPT_{\mu^{0},\lambda}(\hat{\mu}^{i},\hat{\mu}^{j}). (42)

3.5 OPT and LOPT Interpolation

Inspired by OT and LOT geodesics as defined in section 2.5, but lacking the Riemannian structure provided by the OT squared norm, we propose an OPT interpolation curve and its LOPT approximation.

For the OPT interpolation between two measures μi\mu^{i}, μj\mu^{j} for which exists γΓ(μi,μj)\gamma\in\Gamma_{\leq}^{*}(\mu^{i},\mu^{j}) of the form γ=(id×T)#γ0\gamma=(\mathrm{id}\times T)_{\#}\gamma_{0}\,, a natural candidate is the solution ρt\rho_{t} of the dynamic formulation of OPTλ(μi,μj)OPT_{\lambda}(\mu^{i},\mu^{j}). The exact expression is given by Proposition 3.1. When working with general discrete measures μi\mu^{i}, μj\mu^{j} (as in (15), with ‘ii’ in place of 0) such γ\gamma is not guaranteed to exist. Then, we replace the latter with its OPT barycentric projection with respect to μi\mu^{i}. And by Theorem 3.5 the map T:xnix^njT:x_{n}^{i}\mapsto\hat{x}_{n}^{j} solves OPTλ(μi,μ^j)OPT_{\lambda}(\mu^{i},\hat{\mu}^{j}) and the OPT interpolating curve is131313p^nj\hat{p}_{n}^{j} are the coefficients of μ^j\hat{\mu}^{j} with respect to μi\mu^{i} analogous to (36).

tn=1Nip^njδ(1t)xni+tT(xni)+(1t)n=1Ni(pnip^nj)δxni.t\mapsto\sum_{n=1}^{N_{i}}\hat{p}_{n}^{j}\delta_{(1-t)x_{n}^{i}+tT(x_{n}^{i})}+(1-t)\sum_{n=1}^{N_{i}}(p_{n}^{i}-\hat{p}_{n}^{j})\delta_{x_{n}^{i}}.

When working with a multitude of measures, it is convenient to consider a reference μ0\mu^{0} and embed the measures in (d+1)×N0\mathbb{R}^{(d+1)\times N_{0}} using LOPT. Hence, doing computations in a simpler space. Below we provide the LOPT interpolation.

Definition 3.8.

Given discrete measures μ0,μi,μj\mu^{0},\mu^{i},\mu^{j}, with μ0\mu^{0} as the reference, let (ui,p^i),(uj,p^i)(u^{i},\hat{p}^{i}),(u^{j},\hat{p}^{i}) be the LOPT embeddings of μi,μj\mu^{i},\mu^{j}. Let p^ij:=p^ip^j\hat{p}^{ij}:=\hat{p}^{i}\wedge\hat{p}^{j}, and ut:=(1t)ui+tuju_{t}:=(1-t)u^{i}+tu^{j}. We define the LOPT interpolating curve between μi\mu^{i} and μj\mu^{j} by

t\displaystyle t\mapsto kDTp^kijδxk0+ut(k)+(1t)kDD(p^kip^kij)δxk0+uki+tkDC(p^kjp^kij)δxk0+ukj\displaystyle\sum_{k\in D_{T}}\hat{p}^{ij}_{k}\delta_{x_{k}^{0}+u_{t}(k)}+(1-t)\sum_{k\in D_{D}}(\hat{p}_{k}^{i}-\hat{p}^{ij}_{k})\delta_{x_{k}^{0}+u_{k}^{i}}+t\sum_{k\in D_{C}}(\hat{p}_{k}^{j}-\hat{p}_{k}^{ij})\delta_{x^{0}_{k}+u_{k}^{j}}

where DT={k:p^kij>0}D_{T}=\{k:\hat{p}_{k}^{ij}>0\}, DD={k:p^ki>p^kij)}D_{D}=\{k:\hat{p}_{k}^{i}>\hat{p}_{k}^{ij})\} and DC={k:p^kij<p^kj)}D_{C}=\{k:\hat{p}_{k}^{ij}<\hat{p}_{k}^{j})\} are respectively the sets where we transport, destroy and create mass.

4 Applications

Approximation of OPT Distance: Similar to LOT [41], and Linear Hellinger Kantorovich (LHK) [10], we test the approximation performance of OPT using LOPT. Given KK empirical measures {μi}i=1K\{\mu^{i}\}_{i=1}^{K}, for each pair (μi,μj)(\mu^{i},\mu^{j}), we compute OPTλ(μi,μj)OPT_{\lambda}(\mu^{i},\mu^{j}) and LOPTμ0,λ(μi,μj)LOPT_{\mu^{0},\lambda}(\mu^{i},\mu^{j}) and the mean or median of all pairs (μi,μj)(\mu^{i},\mu^{j}) of relative error defined as

|OPTλ(μi,μj)LOPTμ0,λ(μi,μj)|OPTλ(μi,μj).\frac{|OPT_{\lambda}(\mu^{i},\mu^{j})-LOPT_{\mu^{0},\lambda}(\mu^{i},\mu^{j})|}{OPT_{\lambda}(\mu^{i},\mu^{j})}.

Similar to LOT and LHK, the choice of μ0\mu^{0} is critical for the accurate approximation of OPT. If μ0\mu^{0} is far away from {μi}i=1K\{\mu^{i}\}_{i=1}^{K}, the linearization is a poor approximation because the mass in μi\mu^{i} and μ0\mu^{0} would only be destroyed or created. In practice, one candidate for μ0\mu^{0} is the barycenter of the set of measures {μi}\{\mu^{i}\}. The OPT can be converted into OT problem [8], and one can use OT barycenter [18] to find μ0\mu^{0}.

For our experiments, we created KK point sets of size N=500N=500 for KK different Gaussian distributions in 2\mathbb{R}^{2}. In particular, μi𝒩(mi,I)\mu^{i}\sim\mathcal{N}(m^{i},I), where mim^{i} is randomly selected such that mi=3\|m^{i}\|=\sqrt{3} for i=1,,Ki=1,...,K. For the reference, we picked an NN point representation of μ0𝒩(m¯,I)\mu^{0}\sim\mathcal{N}(\overline{m},I) with m¯=mi/K\overline{m}=\sum m^{i}/K. We repeated each experiment 1010 times. To exhibit the effect of the parameter λ\lambda in the approximation, the relative errors are shown in Figure 2. For the histogram of the relative errors for each value of λ\lambda and each number of measures KK, we refer to Figure 6 in the Appendix H. For large λ\lambda, most mass is transported and OT(μi,μj)OPTλ(μi,μj)OT(\mu^{i},\mu^{j})\approx OPT_{\lambda}(\mu^{i},\mu^{j}), the performance of LOPT is close to that of LOT, and the relative error is small.

Refer to caption
(a) Mean of the error
Refer to caption
(b) Medean of the error
Figure 2: Graphs of the mean and median relative errors between OPTλOPT_{\lambda} and LOPTλ,μ0LOPT_{\lambda,\mu_{0}} as a function of the parameter λ\lambda.

In Figure 3 we report wall clock times of OPT vs LOPT for λ=5\lambda=5. We use linear programming [27] to solve each OPT problem with a cost of 𝒪(N3log(N))\mathcal{O}(N^{3}\text{log}(N)) each. Thus, computing the OPT distance pair-wisely for {μi}i=1K\{\mu^{i}\}_{i=1}^{K} requires 𝒪(K2N3log(N))\mathcal{O}(K^{2}N^{3}\text{log}(N)). In contrast, to compute LOPTLOPT, we only need to solve KK optimal partial transport problems for the embeddings (see (33) or (40)). Computing LOPT discrepancies after the embeddings is linear. Thus, the total computational cost is 𝒪(KN3log(N)+K2N)\mathcal{O}(KN^{3}\text{log}(N)+K^{2}N). The experiment was conducted on a Linux computer with AMD EPYC 7702P CPU with 64 cores and 256GB DDR4 RAM.

Refer to caption
Figure 3: Wall-clock time between OPT and LOPT. The LP solver in PythonOT [23] is applied to each individual OPT problem, with 100N100N maximum number of iterations.

Point Cloud Interpolation: We test OT geodesic, LOT geodesic, OPT interpolation, and LOPT interpolation on the point cloud MNIST dataset. We compute different transport curves between point sets of the digits 0 and 99. Each digit is a weighted point set {xnj,pnj}n=1Nj\{x_{n}^{j},p_{n}^{j}\}_{n=1}^{N_{j}}, j=1,2j=1,2, that we consider as a discrete measure of the form μj=n=1Njpnjδxnj+1/Njm=1ηNjδymj\mu^{j}=\sum_{n=1}^{N_{j}}p_{n}^{j}\delta_{x^{j}_{n}}+1/N_{j}\sum_{m=1}^{\eta N_{j}}\delta_{y_{m}^{j}}, where the first sum corresponds to the clean data normalized to have total mass 1, and the second sum is constructed with samples from a uniform distribution acting as noise with total mass η\eta. For HK, OPT, LHK and LOPT, we use the distributions μj\mu^{j} without re-normalization, while for OT and LOT, we re-normalize them. The reference in LOT, LHK and LOPT is taken as the OT barycenter of a sample of the digits 0, 1, and 9 not including the ones used for interpolation, and normalized to have unit total mass. We test for η=0,0.5,0.75\eta=0,0.5,0.75 (see Figure 8 in the Appendix H). The results for η=0.5\eta=0.5 are shown in Figure 4. We can see that OT and LOT do not eliminate noise points. HK, OPT still retains much of the noise because interpolation is essentially between μ1\mu^{1} and μ^2\hat{\mu}^{2} (with respect to μ1\mu^{1}). So μ1\mu^{1} acts as a reference that still has a lot of noise. In LHK, LOPT, by selecting the same reference as LOT we see that the noise significantly decreases. In the HK and LHK cases, we notice not only how the masses vary, but also how their relative positions change obtaining a very different configuration at the end of the interpolation. OPT and LOPT instead returns a more natural interpolation because of the mass preservation of the transported portion and the decoupling between transport and destruction/creation of mass.

Refer to caption
Figure 4: We demonstrate the OT geodesic, OPT interpolation, LOT geodesic and LOPT interpolation in MNIST dataset. In LOT geodesic and LOPT interpolation, we use the same reference measure. The percentage of noise η\eta is set to 0.50.5. In OPT and LOPT interpolation, we set λ=20\lambda=20; in HK and LHK, we set the scaling to be 2.52.5.

PCA analysis: We compare the results of performing PCA on the embedding space of LOT, LHK and LOPT for point cloud MNIST. We take 900 digits from the dataset corresponding to digits 0,10,1 and 33 in equal proportions. Each element is a point set {xnj}n=1Nj\{x^{j}_{n}\}_{n=1}^{N_{j}} that we consider as a discrete measure with added noise. The reference, μ0\mu^{0}, is set to the OT barycenter of 30 samples from the clean data. For LOT we re-normalize each μj\mu^{j} to have a total mass of 1, while we do not re-normalize for LOPT. Let Sη:={μj:noise level=η}j=1900S_{\eta}:=\{\mu^{j}:\text{noise level}=\eta\}_{j=1}^{900}. We embed SηS_{\eta} using LOT, LHK and LOPT and apply PCA on the embedded vectors {uj}\{u^{j}\}. In Figure 5 we show the first two principal components of the set of embedded vectors based on LOT, LHK and LOPT for noise levels η=0,0.75\eta=0,0.75. It can be seen that when there is no noise, the PCA dimension reduction technique works well for all three embedding methods. When η=0.75\eta=0.75, the method fails for LOT embedding, but the dimension-reduced data is still separable for LOPT and LHK. For the running time, LOT, LOPT requires 60-80 seconds and LHK requires about 300-350 seconds. The experiments are conducted on a Linux computer with AMD EPYC 7702P CPU with 64 cores and 256GB DDR4 RAM.

Refer to caption
Figure 5: We plot the first two principal components of each uju^{j} based on LOT and LOPT. For LOPT, we set λ=20.0\lambda=20.0, and for LHK, we set the scaling to be 2.52.5.

We refer the reader to Appendix H for further details and analysis.

5 Summary

We proposed a Linear Optimal Partial Transport (LOPT) technique that allows us to embed distributions with different masses into a fixed dimensional space in which several calculations are significantly simplified. We show how to implement this for real data distributions allowing us to reduce the computational cost in applications that would benefit from the use of optimal (partial) transport. We finally provide comparisons with previous techniques and show some concrete applications. In particular, we show that LOPT is more robust and computationally efficient in the presence of noise than previous methods. For future work, we will continue to investigate the comparison of LHK and LOPT, and the potential applications of LOPT in other machine learning and data science tasks, such as Barycenter problems, graph embedding, task similarity measurement in transfer learning, and so on.

Acknowledgements

This work was partially supported by the Defense Advanced Research Projects Agency (DARPA) under Contracts No. HR00112190132 and No. HR00112190135. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the United States Air Force, DARPA, or other funding agencies.

References

  • [1] Akram Aldroubi, Shiying Li, and Gustavo K Rohde. Partitioning signal classes using transport transforms for data analysis and machine learning. Sampling Theory, Signal Processing, and Data Analysis, 19(1):1–25, 2021.
  • [2] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2005.
  • [3] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214–223. PMLR, 2017.
  • [4] Yikun Bai, Bernard Schmitzer, Mathew Thorpe, and Soheil Kolouri. Sliced optimal partial transport. arXiv preprint arXiv:2212.08049, 2022.
  • [5] Saurav Basu, Soheil Kolouri, and Gustavo K Rohde. Detecting and visualizing cell phenotype differences from microscopy images using transport-based morphometry. Proceedings of the National Academy of Sciences, 111(9):3448–3453, 2014.
  • [6] Jean-David Benamou and Yann Brenier. A computational fluid mechanics solution to the monge-kantorovich mass transfer problem. Numerische Mathematik, 84(3):375–393, 2000.
  • [7] Yann Brenier. Polar factorization and monotone rearrangement of vector-valued functions. Communications on pure and applied mathematics, 44(4):375–417, 1991.
  • [8] Luis A Caffarelli and Robert J McCann. Free boundaries in optimal transport and monge-ampere obstacle problems. Annals of mathematics, pages 673–730, 2010.
  • [9] Tianji Cai, Junyi Cheng, Nathaniel Craig, and Katy Craig. Linearized optimal transport for collider events. Physical Review D, 102(11):116019, 2020.
  • [10] Tianji Cai, Junyi Cheng, Bernhard Schmitzer, and Matthew Thorpe. The linearized hellinger–kantorovich distance. SIAM Journal on Imaging Sciences, 15(1):45–83, 2022.
  • [11] Lenaic Chizat, Gabriel Peyré, Bernhard Schmitzer, and François-Xavier Vialard. An interpolating distance between optimal transport and fisher–rao metrics. Foundations of Computational Mathematics, 18(1):1–44, 2018.
  • [12] Lenaic Chizat, Gabriel Peyré, Bernhard Schmitzer, and François-Xavier Vialard. Unbalanced optimal transport: Dynamic and kantorovich formulations. Journal of Functional Analysis, 274(11):3090–3123, 2018.
  • [13] Lenaic Chizat, Pierre Roussillon, Flavien Léger, François-Xavier Vialard, and Gabriel Peyré. Faster wasserstein distance estimation with the sinkhorn divergence. Advances in Neural Information Processing Systems, 33:2257–2269, 2020.
  • [14] Lenaic Chizat, Bernhard Schmitzer, Gabriel Peyre, and Francois-Xavier Vialard. An interpolating distance between optimal transport and fisher-rao. arXiv:1506.06430 [math], 2015.
  • [15] Nicolas Courty, Rémi Flamary, Amaury Habrard, and Alain Rakotomamonjy. Joint distribution optimal transportation for domain adaptation. Advances in Neural Information Processing Systems, 30, 2017.
  • [16] Nicolas Courty, Rémi Flamary, and Devis Tuia. Domain adaptation with regularized optimal transport. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 274–289. Springer, 2014.
  • [17] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems, 26, 2013.
  • [18] Marco Cuturi and Arnaud Doucet. Fast computation of wasserstein barycenters. In International conference on machine learning, pages 685–693. PMLR, 2014.
  • [19] Kilian Fatras, Thibault Séjourné, Rémi Flamary, and Nicolas Courty. Unbalanced minibatch optimal transport; applications to domain adaptation. In International Conference on Machine Learning, pages 3186–3197. PMLR, 2021.
  • [20] Alessio Figalli. The optimal partial transport problem. Archive for rational mechanics and analysis, 195(2):533–560, 2010.
  • [21] Alessio Figalli and Nicola Gigli. A new transportation distance between non-negative measures, with applications to gradients flows with dirichlet boundary conditions. Journal de mathématiques pures et appliquées, 94(2):107–130, 2010.
  • [22] Alessio Figalli and Federico Glaudo. An Invitation to Optimal Transport, Wasserstein Distances, and Gradient Flows. European Mathematical Society, 2021.
  • [23] Rémi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z. Alaya, Aurélie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, Léo Gautheron, Nathalie T.H. Gayraud, Hicham Janati, Alain Rakotomamonjy, Ievgen Redko, Antoine Rolet, Antony Schutz, Vivien Seguy, Danica J. Sutherland, Romain Tavenard, Alexander Tong, and Titouan Vayer. Pot: Python optimal transport. Journal of Machine Learning Research, 22(78):1–8, 2021.
  • [24] Aude Genevay, Gabriel Peyré, and Marco Cuturi. Gan and vae from an optimal transport point of view. arXiv preprint arXiv:1706.01807, 2017.
  • [25] Kevin Guittet. Extended Kantorovich norms: a tool for optimization. PhD thesis, INRIA, 2002.
  • [26] Hicham Janati, Marco Cuturi, and Alexandre Gramfort. Wasserstein regularization for sparse multi-task regression. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1407–1416. PMLR, 2019.
  • [27] Narendra Karmarkar. A new polynomial-time algorithm for linear programming. In Proceedings of the sixteenth annual ACM symposium on Theory of computing, pages 302–311, 1984.
  • [28] Sanggyun Kim, Rui Ma, Diego Mesa, and Todd P Coleman. Efficient bayesian inference methods via convex optimization and optimal transport. In 2013 IEEE International Symposium on Information Theory, pages 2259–2263. IEEE, 2013.
  • [29] Soheil Kolouri, Navid Naderializadeh, Gustavo K Rohde, and Heiko Hoffmann. Wasserstein embedding for graph learning. In International Conference on Learning Representations, 2020.
  • [30] Shinjini Kundu, Soheil Kolouri, Kirk I Erickson, Arthur F Kramer, Edward McAuley, and Gustavo K Rohde. Discovery and visualization of structural biomarkers from mri using transport-based morphometry. NeuroImage, 167:256–275, 2018.
  • [31] Jan Lellmann, Dirk A Lorenz, Carola Schonlieb, and Tuomo Valkonen. Imaging with kantorovich–rubinstein discrepancy. SIAM Journal on Imaging Sciences, 7(4):2833–2859, 2014.
  • [32] Matthias Liero, Alexander Mielke, and Giuseppe Savare. Optimal entropy-transport problems and a new hellinger–kantorovich distance between positive measures. Inventiones mathematicae, 211(3):969–1117, 2018.
  • [33] Huidong Liu, Xianfeng Gu, and Dimitris Samaras. Wasserstein gan with quadratic transport cost. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4832–4841, 2019.
  • [34] Caroline Moosmüller and Alexander Cloninger. Linear optimal transport embedding: Provable wasserstein classification for certain rigid transformations and perturbations. Information and Inference: A Journal of the IMA, 12(1):363–389, 2023.
  • [35] Navid Naderializadeh, Joseph F Comer, Reed Andrews, Heiko Hoffmann, and Soheil Kolouri. Pooling by sliced-wasserstein embedding. Advances in Neural Information Processing Systems, 34:3389–3400, 2021.
  • [36] Khai Nguyen, Dang Nguyen, and Nhat Ho. Self-attention amortized distributional projection optimization for sliced wasserstein point-cloud reconstruction. arXiv preprint arXiv:2301.04791, 2023.
  • [37] Khai Nguyen, Dang Nguyen, Tung Pham, Nhat Ho, et al. Improving mini-batch optimal transport via partial transportation. In International Conference on Machine Learning, pages 16656–16690. PMLR, 2022.
  • [38] F. Santambrogio. Optimal Transport for Applied Mathematicians. Calculus of Variations, PDEs and Modeling. Birkhäuser, 2015.
  • [39] Meyer Scetbon and marco cuturi. Low-rank optimal transport: Approximation, statistics and debiasing. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
  • [40] Cedric Villani. Topics in Optimal Transportation, volume 58. American Mathematical Society, 2003.
  • [41] Wei Wang, Dejan Slepčev, Saurav Basu, John A Ozolek, and Gustavo K Rohde. A linear optimal transportation framework for quantifying and visualizing variations in sets of images. International journal of computer vision, 101(2):254–269, 2013.
  • [42] Jianbo Ye, Panruo Wu, James Z Wang, and Jia Li. Fast discrete distribution clustering using wasserstein barycenter with sparse support. IEEE Transactions on Signal Processing, 65(9):2317–2332, 2017.

Appendix

We refer to the main text for references.

Appendix A Notation

  • (d,)(\mathbb{R}^{d},\|\cdot\|), dd-dimensional Euclidean space endowed with the standard Euclidean norm x=k=1d|x(k)2|\|x\|=\sqrt{\sum_{k=1}^{d}|x(k)^{2}|}. d(,)d(\cdot,\cdot) is the associated distance, i.e., d(x,y)=xyd(x,y)=\|x-y\|.

  • xyx\cdot y canonical inner product in d\mathbb{R}^{d}.

  • 1N1_{N}, column vector of size N×1N\times 1 with all entries equal to 1.

  • diag(p1,,pN)\operatorname{diag}(p_{1},\ldots,p_{N}), diagonal matrix.

  • wTw^{T}, transpose of a matrix ww.

  • +\mathbb{R}_{+}, non-negative real numbers.

  • Ω\Omega, convex and compact subset of d\mathbb{R}^{d}. Ω2=Ω×Ω\Omega^{2}=\Omega\times\Omega.

  • 𝒫(Ω)\mathcal{P}(\Omega), set of probability Borel measures defined in Ω\Omega.

  • +(Ω)\mathcal{M}_{+}(\Omega), set of positive finite Borel measures defined in Ω\Omega.

  • \mathcal{M}, set of signed measures.

  • π0:Ω2Ω\pi_{0}:\Omega^{2}\to\Omega, π0(x0,x1):=x0\pi_{0}(x^{0},x^{1}):=x^{0}; π1:Ω2Ω\pi_{1}:\Omega^{2}\to\Omega, π1(x0,x1):=x1\pi_{1}(x^{0},x^{1}):=x^{1}, standard projections.

  • F#μF_{\#}\mu, push-forward of the measure μ\mu by the function. F#μ(B)=μ(F1(B))F_{\#}\mu(B)=\mu(F^{-1}(B)) for all measurable set BB, where F1(B)={x:F(x)B}F^{-1}(B)=\{x:\,F(x)\in B\}.

  • δx\delta_{x}, Dirac measure concentrated on xx.

  • μ=n=1Npnδxn\mu=\sum_{n=1}^{N}p_{n}\delta_{x_{n}}, discrete measure (xnΩx_{n}\in\Omega, pn+p_{n}\in\mathbb{R}_{+}). The coefficients pnp_{n} are called the weights of μ\mu.

  • supp(μ)\operatorname{supp}(\mu), support of the measure μ\mu.

  • μiμj\mu^{i}\wedge\mu^{j} minimum measure between μi\mu^{i} and μj\mu^{j}; pipjp^{i}\wedge p^{j} vector having at each nn entry the minimum value pnip^{i}_{n} and pnjp^{j}_{n}.

  • μ0\mu^{0} reference measure. μi,μj\mu^{i},\mu^{j} target measures. In the OT framework, they are in 𝒫(Ω)\mathcal{P}(\Omega). In the OPT framework, they are in +(Ω)\mathcal{M}_{+}(\Omega).

  • Γ(μ0,μj)={γ𝒫(Ω2):π0#γ=μ0,π1#γ=μj}\Gamma(\mu^{0},\mu^{j})=\{\gamma\in\mathcal{P}(\Omega^{2}):\,\pi_{0\#}\gamma=\mu^{0},\,\pi_{1\#}\gamma=\mu^{j}\}, set of Kantorovich transport plans.

  • C(γ;μ0,μj)=Ω2x0xj2𝑑γ(x0,xj)C(\gamma;\mu^{0},\mu^{j})=\int_{\Omega^{2}}\|x^{0}-x^{j}\|^{2}d\gamma(x^{0},x^{j}), Kantorovich cost given by the transportation plan γ\gamma between μ0\mu^{0} and μj\mu^{j}.

  • Γ(μ0,μj)\Gamma^{*}(\mu^{0},\mu^{j}), set of optimal Kantorovich transport plans.

  • TT, optimal transport Monge map.

  • id\mathrm{id}, identity map id(x)=x\mathrm{id}(x)=x.

  • Tt(x)=(1t)x+tT(x)T_{t}(x)=(1-t)x+tT(x) for T:ΩΩT:\Omega\to\Omega. Viewed as a function of (t,x)(t,x), it is a flow.

  • In section 2: ρ𝒫([0,1]×Ω)\rho\in\mathcal{P}([0,1]\times\Omega) curve of measures. At each 0t0\leq t\leq, ρt𝒫(Ω)\rho_{t}\in\mathcal{P}(\Omega). In section 3: analogous, replacing 𝒫\mathcal{P} by +\mathcal{M}_{+}.

  • v:[0,1]×Ωdv:[0,1]\times\Omega\to\mathbb{R}^{d}. at each time vt:Ωdv_{t}:\Omega\to\mathbb{R}^{d} is a vector field. v0v_{0}, initial velocity.

  • V\nabla\cdot V, divergence of the vector field VV with respect to the spatial variable xx.

  • tρ+ρv=0\partial_{t}\rho+\nabla\cdot\rho v=0, continuity equation.

  • 𝒞(μ0,μj)\mathcal{CE}(\mu^{0},\mu^{j}), set of solutions of the continuity equation with boundary conditions μ0\mu^{0} and μj\mu^{j}.

  • tρ+ρv=ζ\partial_{t}\rho+\nabla\cdot\rho v=\zeta, continuity equation with forcing term ζ\zeta.

  • 𝒞(μ0,μj)\mathcal{FCE}(\mu^{0},\mu^{j}), set solutions (ρ,v,ζ)(\rho,v,\zeta) of the continuity equation with forcing term with boundary conditions μ0\mu^{0} and μj\mu^{j}.

  • OT(μ0,μj)OT(\mu^{0},\mu^{j}), optimal transport minimization problem. See (1) for the Kantorovich formulation, and (6) for the dynamic formulation.

  • W2(μ0,μj)=OT(μ0,μj)W_{2}(\mu^{0},\mu^{j})=\sqrt{OT(\mu^{0},\mu^{j})}, 22-Wasserstein distance.

  • 𝒯μ=L2(Ω;d,μ)\mathcal{T}_{\mu}=L^{2}(\Omega;\mathbb{R}^{d},\mu), where uμ2=Ωu(x)2𝑑μ(x)\|u\|_{\mu}^{2}=\int_{\Omega}\|u(x)\|^{2}d\mu(x) (if μ\mu is discrete, 𝒯μ\mathcal{T}_{\mu} is identified with d×N\mathbb{R}^{d\times N} and uμ2=n=1Nu(n)2pn\|u\|_{\mu}^{2}=\sum_{n=1}^{N}\|u(n)\|^{2}p_{n}).

  • LOTμ0(μi,μj)LOT_{\mu^{0}}(\mu^{i},\mu^{j}), see (14) for continuous densities, and (19) for discrete measures.

  • λ>0\lambda>0, penalization in OPT.

  • μν\mu\leq\nu if μ(B)ν(E)\mu(B)\leq\nu(E) for all measurable set EE and we say that μ\mu is dominated by ν\nu.

  • Γ(μ0,μj)={γ+(Ω2):π0#γμ0,π1#γμj}\Gamma_{\leq}(\mu^{0},\mu^{j})=\{\gamma\in\mathcal{M}_{+}(\Omega^{2}):\,\pi_{0\#}\gamma\leq\mu^{0},\,\pi_{1\#}\gamma\leq\mu^{j}\}, set of partial transport plans.

  • γ0:=π0#γ\gamma_{0}:=\pi_{0\#}\gamma, γ1:=π1#γ\gamma_{1}:=\pi_{1\#}\gamma, marginals of γ+(Ω2)\gamma\in\mathcal{M}_{+}(\Omega^{2}).

  • ν0=μ0γ0\nu_{0}=\mu^{0}-\gamma_{0} (for the reference μ0\mu^{0}), νj=μjγ1\nu^{j}=\mu^{j}-\gamma_{1} (for the target μj\mu^{j}).

  • C(γ;μ0,μj,λ)=x0xj2𝑑γ(x0,xj)+λ(|μ0γ0|+|μjγ1|)C(\gamma;\mu^{0},\mu^{j},\lambda)=\int\|x^{0}-x^{j}\|^{2}d\gamma(x^{0},x^{j})+\lambda(|\mu^{0}-\gamma_{0}|+|\mu^{j}-\gamma_{1}|), partial transport cost given by the partial transportation plan γ\gamma between μ0\mu^{0} and μj\mu^{j} with penalization λ\lambda.

  • Γ(μ0,μj)\Gamma_{\leq}^{*}(\mu^{0},\mu^{j}), set of optimal partial transport plans.

  • OPT(μ0,μj)OPT(\mu^{0},\mu^{j}), partial optimal transport minimization problem between μ0\mu^{0} and μj\mu^{j}. See (23) for the static formulation, and (25) for the dynamic formulation.

  • vμ,2λ2:=Ωmin(v2,2λ)𝑑μ\|v\|_{\mu,2\lambda}^{2}:=\int_{\Omega}\min(\|v\|^{2},2\lambda)d\mu

  • LOPTλ(μi,μj)LOPT_{\lambda}(\mu^{i},\mu^{j}), see (35), and (42) and the discussion above.

  • μjuj\mu^{j}\mapsto u^{j}, LOT embedding (fixing first a reference μ0𝒫(Ω)\mu^{0}\in\mathcal{P}(\Omega)). If μ0\mu^{0} has continuous density, uj:=v0j=Tjidu^{j}:=v_{0}^{j}=T^{j}-\mathrm{id}, and so it is a map from measures μj𝒫(Ω)\mu^{j}\in\mathcal{P}(\Omega) to vector fields uiu^{i} defined in Ω\Omega, see (12). For discrete measures (μ0\mu^{0}, μj\mu^{j} discrete), LOT embedding is needed first, and in this case, the embedding is a map from discrete probability to d×N0\mathbb{R}^{d\times N_{0}}, see (17).

  • μ\mu, OT and OPT barycentric projection of μ\mu with respect to a reference μ0\mu^{0} (in 𝒫\mathcal{P} or +\mathcal{M}_{+}, resp.), see (16) and (36), respectively. x^n\hat{x}_{n} denote the new locations where μ^\hat{\mu} is concentrated. pn0p_{n}^{0} and p^n\hat{p}_{n} denote the wights of μ^\hat{\mu} for OT and OPT, respectively.

  • ut=(1t)ui+tuju_{t}=(1-t)u^{i}+tu^{j}, geodesic in LOT embedding space.

  • ρ^t\hat{\rho}_{t}, LOT geodesic.

  • LOPT embeding, see (33), and (40).

  • OPT interpolating curve, and LOPT interpolating curve are defined in section 3.5.

Appendix B Proof of lemma 2.1

Proof.

We fix γΓ(μ0,μj)\gamma^{*}\in\Gamma^{*}(\mu^{0},\mu^{j})141414Although in subsection 2.4 we use the notation γjΓ(μ0,μj)\gamma^{j}\in\Gamma^{*}(\mu^{0},\mu^{j}), here we use the notation γ\gamma^{*} to emphasize that we are fixing an optimal plan., and then we compute the map xn0x^njx_{n}^{0}\mapsto\hat{x}_{n}^{j} according to (16). This map induces a transportation plan γ^:=diag(p10,,pN00)+N0×N0\hat{\gamma}^{*}:=\operatorname{diag}({p}_{1}^{0},\ldots,p_{N_{0}}^{0})\in\mathbb{R}_{+}^{N_{0}\times N_{0}}. We will prove that γ^Γ(μ0,μ^j)\hat{\gamma}^{*}\in\Gamma^{*}(\mu^{0},\hat{\mu}^{j}). By definition, it is easy to see that its marginals are μ0\mu^{0} and μ^j\hat{\mu}^{j}. The complicated part is to prove optimality. Let γ^Γ(μ0,μ^j)\hat{\gamma}\in\Gamma(\mu^{0},\hat{\mu}^{j}) be an arbitrary transportation plan between μ0\mu^{0} and μj\mu^{j} (i.e. γ1N0=γT1N0=p0\gamma 1_{N_{0}}=\gamma^{T}1_{N_{0}}=p^{0}). We need to show

C(γ^;μ0,μ^j)C(γ^;μ0,μ^j).\displaystyle C(\hat{\gamma}^{*};\mu^{0},\hat{\mu}^{j})\leq C(\hat{\gamma};\mu^{0},\hat{\mu}^{j}). (43)

Since γ^\hat{\gamma}^{*} is a diagonal matrix with positive diagonal, its inverse matrix is (γ^)1=diag(1/p10,,1/pN00)(\hat{\gamma}^{*})^{-1}=\operatorname{diag}(1/p_{1}^{0},\ldots,1/p_{N_{0}}^{0}). We define

γ:=γ^(γ^)1γ+N0×Nj.\gamma:=\hat{\gamma}(\hat{\gamma}^{*})^{-1}\gamma^{*}\in\mathbb{R}_{+}^{N_{0}\times N_{j}}.

We claim γΓ(μ0,μj)\gamma\in\Gamma(\mu^{0},\mu^{j}). Indeed,

γ1Nj\displaystyle\gamma 1_{N_{j}} =γ^(γ^)1γ1Nj=γ^(γ^)1p0=γ^1N0=p0\displaystyle=\hat{\gamma}(\hat{\gamma}^{*})^{-1}\gamma^{*}1_{N_{j}}=\hat{\gamma}(\hat{\gamma}^{*})^{-1}p^{0}=\hat{\gamma}1_{N_{0}}=p^{0} (44)
γT1N0\displaystyle\gamma^{T}1_{N_{0}} =(γ)T(γ^)1γ^T1N0=(γ)T(γ^)1p0=(γ)T1N0=pj,\displaystyle=(\gamma^{*})^{T}(\hat{\gamma}^{*})^{-1}\hat{\gamma}^{T}1_{N_{0}}=(\gamma^{*})^{T}(\hat{\gamma}^{*})^{-1}p^{0}=(\gamma^{*})^{T}1_{N_{0}}=p^{j}, (45)

where the second equality in (44) and the third equality in (45) follow from the fact γΓ(μ0,μ1)\gamma^{*}\in\Gamma(\mu^{0},\mu^{1}), and the fourth equality in (44) and the second equality in (45) hold since γ^Γ(μ0,μ^j)\hat{\gamma}\in\Gamma(\mu^{0},\hat{\mu}^{j}).

Since γ\gamma^{*} is optimal for OT(μ,μj)OT(\mu,\mu^{j}),we have C(γ;μ0,μj)C(γ;μ0,μj)C(\gamma^{*};\mu^{0},\mu^{j})\leq C(\gamma;\mu^{0},\mu^{j}) to denote the transportation cost for γ\gamma^{*} and γ\gamma, respectively. We write

C(γ;μ0,μj)\displaystyle C(\gamma^{*};\mu^{0},\mu^{j}) =n=1N0m=1Njxn0xmj2γn,m\displaystyle=\sum_{n=1}^{N_{0}}\sum_{m=1}^{N_{j}}\|x_{n}^{0}-x_{m}^{j}\|^{2}\gamma^{*}_{n,m}
=n=1N0xn02pn0+m=1Njxmj2pmj2n=1N0m=1Njxnxmjγn,m\displaystyle=\sum_{n=1}^{N_{0}}\|x_{n}^{0}\|^{2}p_{n}^{0}+\sum_{m=1}^{N_{j}}\|x_{m}^{j}\|^{2}p_{m}^{j}-2\sum_{n=1}^{N_{0}}\sum_{m=1}^{N_{j}}x_{n}\cdot x_{m}^{j}\,\gamma_{n,m}^{*}
=K12n=1N0m=1Njx^nxmjγn,m,\displaystyle=K_{1}-2\sum_{n=1}^{N_{0}}\sum_{m=1}^{N_{j}}\hat{x}_{n}\cdot x_{m}^{j}\,\gamma_{n,m}^{*},

where K1K_{1} is a constant (which only depends on μ0,μj\mu^{0},\mu^{j}). Similarly,

C(γ;μ0,μj)\displaystyle C(\gamma;\mu^{0},\mu^{j}) =K12n=1N0m=1Njxn0xmjγn,m\displaystyle=K_{1}-2\sum_{n=1}^{N_{0}}\sum_{m=1}^{N_{j}}x_{n}^{0}\cdot x_{m}^{j}\,\gamma_{n,m}
=K12n=1N0m=1Nj=1N0γ^n,γ,mp0xn0xmj\displaystyle=K_{1}-2\sum_{n=1}^{N_{0}}\sum_{m=1}^{N_{j}}\sum_{\ell=1}^{N_{0}}\frac{\hat{\gamma}_{n,\ell}\gamma^{*}_{\ell,m}}{p^{0}_{\ell}}\,x_{n}^{0}\cdot x_{m}^{j}

Analogously, we have

C(γ^;μ0,μ^j)\displaystyle C(\hat{\gamma}^{*};\mu^{0},\hat{\mu}^{j}) =n=1N0m=1N0x^nx^mj2γ^n,m\displaystyle=\sum_{n=1}^{N_{0}}\sum_{m=1}^{N_{0}}\|\hat{x}_{n}-\hat{x}_{m}^{j}\|^{2}\hat{\gamma}_{n,m}^{*}
=n=1N0xn02pn0+m=1N0x^mj2pm02n=1N0m=1N0xn0x^mjγ^n,m\displaystyle=\sum_{n=1}^{N_{0}}\|x_{n}^{0}\|^{2}p_{n}^{0}+\sum_{m=1}^{N_{0}}\|\hat{x}_{m}^{j}\|^{2}p_{m}^{0}-2\sum_{n=1}^{N_{0}}\sum_{m=1}^{N_{0}}x_{n}^{0}\cdot\hat{x}_{m}^{j}\,\hat{\gamma}^{*}_{n,m}
=K22n=1N0m=1Njxn0xmjγn,m\displaystyle=K_{2}-2\sum_{n=1}^{N_{0}}\sum_{m=1}^{N_{j}}x^{0}_{n}\cdot x_{m}^{j}\,\gamma_{n,m}^{*}
=K2K1+C(γ;μ0,μj)\displaystyle=K_{2}-K_{1}+C(\gamma^{*};\mu^{0},\mu^{j}) (46)

where K2K_{2} is constant (which only depends on μ0\mu^{0} and μ^j\hat{\mu}^{j}) and similarly

C(γ^;μ0,μ^j)\displaystyle C(\hat{\gamma};\mu^{0},\hat{\mu}^{j}) =K22n=1N0m=1Nj=1N0γ^n,γ,mp0xn0xmj\displaystyle=K_{2}-2\sum_{n=1}^{N_{0}}\sum_{m=1}^{N_{j}}\sum_{\ell=1}^{N_{0}}\frac{\hat{\gamma}_{n,\ell}\gamma^{*}_{\ell,m}}{p^{0}_{\ell}}\,x_{n}^{0}\cdot x_{m}^{j}
=K2K1+C(γ;μ0,μ1)\displaystyle=K_{2}-K_{1}+C(\gamma;\mu^{0},\mu^{1}) (47)

Therefore, by (46) and (47), we have that C(γ;μ0,μj)C(γ;μ0,μj)C(\gamma^{*};\mu^{0},\mu^{j})\leq C(\gamma;\mu^{0},\mu^{j}) if and only if C(γ^;μ0,μ^j)C(γ^;μ0,μ^j)C(\hat{\gamma}^{*};\mu^{0},\hat{\mu}^{j})\leq C(\hat{\gamma};\mu^{0},\hat{\mu}^{j}). So, we conclude the proof by the fact γ\gamma^{*} is optimal for OT(μ0,μj)OT(\mu_{0},\mu^{j}). ∎

Appendix C Proof of Proposition 2.2

Lemma C.1.

Given discrete measures μ0=k=1N0pk0δxk0,μ1=k=1N0pk0δxk1,μ2=k=1N0pk0δxk2\mu^{0}=\sum_{k=1}^{N_{0}}p^{0}_{k}\delta_{x^{0}_{k}},\mu^{1}=\sum_{k=1}^{N_{0}}p^{0}_{k}\delta_{x^{1}_{k}},\mu^{2}=\sum_{k=1}^{N_{0}}p^{0}_{k}\delta_{x^{2}_{k}}, suppose that the maps xk0xk1x_{k}^{0}\mapsto x_{k}^{1} and xk0xk2x^{0}_{k}\mapsto x_{k}^{2} solve OT(μ0,μ1)OT(\mu^{0},\mu^{1}) and OT(μ0,μ2)OT(\mu^{0},\mu^{2}), respectively. For t[0,1]t\in[0,1], define μt:=k=1N0pk0δ(1t)xk1+txk2\mu_{t}:=\sum_{k=1}^{N_{0}}p_{k}^{0}\delta_{(1-t)x_{k}^{1}+tx_{k}^{2}}. Then, the mapping Tt:xk0(1t)xk1+txk2T_{t}:x_{k}^{0}\mapsto(1-t)x_{k}^{1}+tx_{k}^{2} solves the problem OT(μ0,μt)OT(\mu^{0},\mu_{t}).

Proof.

Let γ=diag(p10,,pN00)\gamma^{*}=\operatorname{diag}(p_{1}^{0},\ldots,p_{N_{0}}^{0}) be the corresponding transportation plan induced by TtT_{t}. Consider an arbitrary γ+N0×N0\gamma\in\mathbb{R}_{+}^{N_{0}\times N_{0}} such that γΓ(μ0,μt)\gamma\in\Gamma(\mu^{0},\mu_{t}). We need to show

C(γ;μ0,μt)C(γ;μ0,μt).C(\gamma^{*};\mu^{0},\mu_{t})\leq C(\gamma;\mu^{0},\mu_{t}).

We have

C(γ;μ0,μt)\displaystyle C(\gamma^{*};\mu^{0},\mu_{t}) =k,k=1N0xk0(1t)xk1txk22γk,k\displaystyle=\sum_{k,k^{\prime}=1}^{N_{0}}\|x^{0}_{k}-(1-t)x^{1}_{k^{\prime}}-tx_{k^{\prime}}^{2}\|^{2}\,\gamma^{*}_{k,k^{\prime}}
=k=1N0xk02pk0+k=1N0(1t)xk1+txk22pk02k,k=1N0xk0((1t)xk1+txk2)\displaystyle=\sum_{k=1}^{N_{0}}\|x_{k}^{0}\|^{2}p_{k}^{0}+\sum_{k=1}^{N_{0}}\|(1-t)x_{k^{\prime}}^{1}+tx_{k^{\prime}}^{2}\|^{2}p_{k}^{0}-2\sum_{k,k^{\prime}=1}^{N_{0}}x_{k}^{0}\cdot((1-t)x_{k^{\prime}}^{1}+tx_{k^{\prime}}^{2})
=K2[(1t)k,k=1N0x^kxk1γk,k+tk,k=1N0x^kxk2γk,k],\displaystyle=K-2\left[(1-t)\sum_{k,k^{\prime}=1}^{N_{0}}\hat{x}_{k}\cdot x_{k^{\prime}}^{1}\,\gamma_{k,k^{\prime}}^{*}+t\sum_{k,k^{\prime}=1}^{N_{0}}\hat{x}_{k}\cdot x_{k^{\prime}}^{2}\,\gamma_{k,k^{\prime}}^{*}\right], (48)

where in third equation, KK is a constant which only depends on the marginals μ0,μt\mu^{0},\mu_{t}. Similarly,

C(γ;μ0,μt)\displaystyle C(\gamma;\mu^{0},\mu_{t}) =K2[(1t)k,k=1N0x^kxk1γk,k+tk,k=1N0x^kxk2γk,k].\displaystyle=K-2\left[(1-t)\sum_{k,k^{\prime}=1}^{N_{0}}\,\hat{x}_{k}\cdot x_{k^{\prime}}^{1}\gamma_{k,k^{\prime}}+t\sum_{k,k^{\prime}=1}^{N_{0}}\hat{x}_{k}\cdot x_{k^{\prime}}^{2}\,\gamma_{k,k^{\prime}}\right]. (49)

By the fact that γ,γΓ(μ^,μt)=Γ(μ0,μ1)=Γ(μ0,μ2)\gamma,\gamma^{*}\in\Gamma(\hat{\mu},\mu_{t})=\Gamma(\mu^{0},\mu^{1})=\Gamma(\mu^{0},\mu^{2}), and that γ\gamma^{*} is optimal for OT(μ^,μ1),OT(μ^,μ2)OT(\hat{\mu},\mu^{1}),OT(\hat{\mu},\mu^{2}), we have

k,k=1N0x^kxk1γk,kk,k=1N0x^kxk1γk,k and k,k=1N0x^kxk2γk,kk,k=1N0x^kxk2γk,k.\displaystyle\sum_{k,k^{\prime}=1}^{N_{0}}\hat{x}_{k}\cdot x_{k^{\prime}}^{1}\,\gamma_{k,k^{\prime}}^{*}\geq\sum_{k,k^{\prime}=1}^{N_{0}}\hat{x}_{k}\cdot x_{k^{\prime}}^{1}\,\gamma_{k,k^{\prime}}\qquad\text{ and }\qquad\sum_{k,k^{\prime}=1}^{N_{0}}\hat{x}_{k}\cdot x_{k^{\prime}}^{2}\,\gamma_{k,k^{\prime}}^{*}\geq\sum_{k,k^{\prime}=1}^{N_{0}}\hat{x}_{k}\cdot x_{k^{\prime}}^{2}\,\gamma_{k,k^{\prime}}.

Thus, by (48) and (49), we have C(γ;μ^,μt)C(γ;μ^,μt)C(\gamma^{*};\hat{\mu},\mu_{t})\leq C(\gamma;\hat{\mu},\mu_{t}), and this completes the proof. ∎

Proof of Proposition 2.2.

Consider 0st10\leq s\leq t\leq 1. By Lemma 2.1, the maps Ti:xk0x^kiT^{i}:x^{0}_{k}\mapsto\hat{x}^{i}_{k}, Tj:xk0x^kjT^{j}:x^{0}_{k}\mapsto\hat{x}^{j}_{k} solve OT(μ0,μ^i),OT(μ0,μ^j)OT(\mu^{0},\hat{\mu}^{i}),OT(\mu^{0},\hat{\mu}^{j}), respectively. Moreover, by Lemma C.1 (under the appropriate renaming), we have the mapping

xk0(1s)x^ki+sx^kj=xk0+(1s)uki+sukj,1kN0x^{0}_{k}\mapsto(1-s)\hat{x}^{i}_{k}+s\hat{x}_{k}^{j}=x_{k}^{0}+(1-s)u^{i}_{k}+su^{j}_{k},\qquad 1\leq k\leq N_{0}

solves OT(μ0,ρ^s)OT(\mu^{0},\hat{\rho}_{s}), and similarly

xk0(1t)x^ki+tx^kj=xk0+(1t)uki+tukj,1kN0x^{0}_{k}\mapsto(1-t)\hat{x}^{i}_{k}+t\hat{x}_{k}^{j}=x_{k}^{0}+(1-t)u^{i}_{k}+tu^{j}_{k},\qquad 1\leq k\leq N_{0}

solves OT(μ0,ρ^t)OT(\mu^{0},\hat{\rho}_{t}).

Thus,

LOTμ0(ρ^s,ρ^t)\displaystyle LOT_{\mu^{0}}(\hat{\rho}_{s},\hat{\rho}_{t}) =k=1N0((1s)x^ki+sx^kj)((1t)x^ki+tx^kj)2pk0\displaystyle=\sum_{k=1}^{N_{0}}\|((1-s)\hat{x}^{i}_{k}+s\hat{x}_{k}^{j})-((1-t)\hat{x}^{i}_{k}+t\hat{x}_{k}^{j})\|^{2}\,p_{k}^{0}
=k=1N0(ts)2x^kix^kj2pk0\displaystyle=\sum_{k=1}^{N_{0}}(t-s)^{2}\|\hat{x}_{k}^{i}-\hat{x}_{k}^{j}\|^{2}\,p_{k}^{0}
=(ts)2LOTμ0(μi,μj),\displaystyle=(t-s)^{2}LOT_{\mu^{0}}(\mu^{i},\mu^{j}),

and this finishes the proof. ∎

Appendix D Proof of proposition 3.1

This result relies on the definition (29) of the curve ρt\rho_{t}, which is inspired by the fact that solutions of non-homogeneous equations are given by solutions of the associated homogeneous equation plus a particular solution of the non-homogeneous. We choose the first term of ρt\rho_{t} as a solution of a (homogeneous) continuity equation (γt\gamma_{t} defined in (28)), and the second term (γ¯t:=(1t)ν0+tνj\overline{\gamma}_{t}:=(1-t)\nu_{0}+t\nu^{j}) as an appropriate particular solution of the equation (26) with forcing term ζ\zeta defined in (31). Moreover, the curve ρt\rho_{t} will become optimal for (25) since both γt\gamma_{t} and γ¯t\overline{\gamma}_{t} are ‘optimal’ in different senses. On the one hand, γt\gamma_{t} is optimal for a classical optimal transport problem151515Here we will strongly use that the fixed partial optimal transport plan γΓ(μ0,μj)\gamma^{*}\in\Gamma_{\leq}^{*}(\mu^{0},\mu^{j}) in the hypothesis of the proposition has the form γ=(I×T)#γ0\gamma^{*}=(I\times T)_{\#}\gamma_{0}^{*}, for a map T:ΩΩT:\Omega\to\Omega. On the other hand, γ¯t\overline{\gamma}_{t} is defined as a line interpolating the new part introduced by the framework of partial transportation, ‘destruction and creation of mass’.

Although this is the core idea behind the proposition, to prove it we need several lemmas to deal with the details and subtleties.

First, we mention that, the push-forward measure F#μF_{\#}\mu can be defined satisfying the formula of change of variables

Ag(x)𝑑F#μ(x)=F1(A)g(F(x))𝑑μ(x)\int_{A}g(x)\,dF_{\#}\mu(x)=\int_{F^{-1}(A)}g(F(x))\,d\mu(x)

for all measurable set AA, and all measurable function gg, where F1(A)={x:F(x)A}F^{-1}(A)=\{x:\,F(x)\in A\}. This is a general fact, and as an immediate consequence, we have conservation of mass

|F#μ|=|μ|.|F_{\#}\mu|=|\mu|.

Then, in our case, the second term in the cost function (24) we can be written as

λ(|μ0γ0|+|μjγ1|=λ(|μ0|+|μ1|2|γ|)\lambda(|\mu^{0}-\gamma_{0}|+|\mu^{j}-\gamma_{1}|=\lambda(|\mu^{0}|+|\mu^{1}|-2|\gamma|)

since γ0\gamma_{0} and γ1\gamma_{1} are dominated by μ0\mu^{0} and μj\mu^{j}, respectively, in the sense that γ0μ0\gamma_{0}\leq\mu^{0} and γ1μj\gamma_{1}\leq\mu^{j}.

Finally, we would like to point out that when we say that (ρ,v,ζ)(\rho,v,\zeta) is a solution for equation (26), we mean that it is a weak solution or, equivalently, it is a solution in the distributional sense [38, Section 4 4]. That is, for any test function ψ:Ω\psi:\Omega\to\mathbb{R} continuous differentiable with compact support, (ρ,v,ζ)(\rho,v,\zeta) satisfies

t(Ωψ𝑑ρt)Ωψvtdρt=Ωψ𝑑ζt,\displaystyle\partial_{t}\left(\int_{\Omega}\psi\,d\rho_{t}\right)-\int_{\Omega}\nabla\psi\cdot v_{t}\,d\rho_{t}=\int_{\Omega}\psi\,d\zeta_{t},

or, equivalently, for any test function ϕ:[0,1]×Ω\phi:[0,1]\times\Omega\to\mathbb{R} continuous differentiable with compact support, (ρ,v,ζ)(\rho,v,\zeta) satisfies

[0,1]×Ωtϕdρ+[0,1]×Ωϕvdρ+[0,1]×Ω𝑑ζ=Ωϕ(1,)𝑑μjΩϕ(0,)𝑑μ0.\displaystyle\int_{[0,1]\times\Omega}\partial_{t}\phi\,d\rho+\int_{[0,1]\times\Omega}\nabla\phi\cdot v\,d\rho+\int_{[0,1]\times\Omega}\,d\zeta=\int_{\Omega}\phi(1,\cdot)\,d\mu^{j}-\int_{\Omega}\phi(0,\cdot)\,d\mu^{0}.
Lemma D.1.

If γ\gamma^{*} is optimal for OPTλ(μ0,μ1)OPT_{\lambda}(\mu^{0},\mu^{1}), and has marginals γ0\gamma_{0}^{*} and γ1\gamma_{1}^{*}, then γ\gamma^{*} is optimal for OPTλ(μ0,γ1)OPT_{\lambda}(\mu^{0},\gamma_{1}^{*}), OPTλ(γ0,μ1)OPT_{\lambda}(\gamma_{0}^{*},\mu^{1}) and OT(γ0,γ1)OT(\gamma_{0}^{*},\gamma_{1}^{*}).

Proof.

Let C(γ;μ0,μ1,λ)C(\gamma;\mu^{0},\mu^{1},\lambda) denote the objective function of the minimization problem OPTλ(μ0,μ1){OPT}_{\lambda}(\mu^{0},\mu^{1}), and similarly consider C(γ;γ0,μ1,λ),C(γ;μ0,γ1,λ)C(\gamma;\gamma_{0}^{*},\mu^{1},\lambda),C(\gamma;\mu^{0},\gamma_{1}^{*},\lambda). Also, let C(γ,γ0,γ1)C(\gamma,\gamma_{0}^{*},\gamma_{1}^{*}) be the objective function of OT(γ0,γ1){OT}(\gamma_{0}^{*},\gamma_{1}^{*}).

First, given an arbitrary plan γΓ(γ0,γ1)Γ(μ0,μ1)\gamma\in\Gamma(\gamma_{0}^{*},\gamma_{1}^{*})\subset\Gamma_{\leq}(\mu^{0},\mu^{1}), we have

C(γ;μ0,μ1,λ)\displaystyle C(\gamma;\mu^{0},\mu^{1},\lambda) =Ω2x0x12𝑑γ(x0,x1)+λ(|μ0γ0|+|μ1γ1|)\displaystyle=\int_{\Omega^{2}}\|x^{0}-x^{1}\|^{2}d\gamma(x^{0},x^{1})+\lambda(|\mu^{0}-\gamma_{0}|+|\mu^{1}-\gamma_{1}|)
=C(γ;γ0,γ1)+λ(|μ0γ0|+|μ1γ1|).\displaystyle=C(\gamma;\gamma_{0}^{*},\gamma_{1}^{*})+\lambda(|\mu^{0}-\gamma_{0}|+|\mu^{1}-\gamma_{1}|).

Since C(γ;μ0,μ1,λ)C(γ;μ0,μ1,λ)C(\gamma^{*};\mu^{0},\mu^{1},\lambda)\leq C(\gamma;\mu^{0},\mu^{1},\lambda), we have C(γ;γ0,γ1)C(γ;γ0,γ1)C(\gamma^{*};\gamma_{0}^{*},\gamma_{1}^{*})\leq C(\gamma;\gamma_{0}^{*},\gamma_{1}^{*}). Thus, γ\gamma^{*} is optimal for OT(γ0,γ1){OT}(\gamma_{0}^{*},\gamma_{1}^{*}).

Similarly, for every γΓ(γ0,μ1)\gamma\in\Gamma_{\leq}(\gamma_{0}^{*},\mu^{1}), we have C(γ;μ0,μ1,λ)=C(γ;γ0,μ1)+λ|μ0γ0|C(\gamma;\mu^{0},\mu^{1},\lambda)=C(\gamma;\gamma_{0}^{*},\mu^{1})+\lambda|\mu^{0}-\gamma^{*}_{0}| and thus γ\gamma^{*} is optimal for OPTλ(γ0,μ1){OPT}_{\lambda}(\gamma_{0}^{*},\mu^{1}). Analogously, γ\gamma^{*} is optimal for OPTλ(μ0,γ1){OPT}_{\lambda}(\mu^{0},\gamma_{1}^{*}). ∎

Lemma D.2.

If γΓ(μ0,μ1)\gamma^{*}\in\Gamma_{\leq}^{*}(\mu^{0},\mu^{1}), then d(supp(ν0),supp(ν1))2λd\left({\operatorname{supp}}(\nu^{0}),{\operatorname{supp}}(\nu^{1})\right)\geq\sqrt{2\lambda}, where ν0=μ0γ0\nu^{0}=\mu^{0}-\gamma_{0}, ν1=μ1γ1\nu^{1}=\mu^{1}-\gamma_{1}, and dd is the Euclidean distance in d\mathbb{R}^{d}.

Proof.

We will proceed by contradiction. Assume that d(supp(ν0),supp(ν1))<2λd\left(\operatorname{supp}(\nu^{0}),{\operatorname{supp}}(\nu^{1})\right)<\sqrt{2\lambda}. Then, there exist x~0supp(ν0)\widetilde{x}^{0}\in{\operatorname{supp}}(\nu^{0}) and x~1supp(ν1)\widetilde{x}^{1}\in{\operatorname{supp}}(\nu^{1}) such that x~0x~1<2λ\|\widetilde{x}^{0}-\widetilde{x}^{1}\|<\sqrt{2\lambda}. Moreover, we can choose ε>0\varepsilon>0 such ν0(B(x~0,ε))>0\nu^{0}(B(\widetilde{x}^{0},\varepsilon))>0 and ν1(B(x~1,ε))>0\nu^{1}(B(\widetilde{x}^{1},\varepsilon))>0, where B(x~i,ε)B(\widetilde{x}^{i},\varepsilon) denotes the ball in d\mathbb{R}^{d} of radius ε\varepsilon centered at x~i\widetilde{x}^{i}.

We will construct γ~Γ(μ0,μ1)\widetilde{\gamma}\in\Gamma_{\leq}(\mu^{0},\mu^{1}) with transportation cost strictly less than OPTλ(μ0,μ1)OPT_{\lambda}(\mu^{0},\mu^{1}), leading to a contradiction. For i=0,1i=0,1 we denote by ν~i\widetilde{\nu}^{i} the probability measure given by 1νi(B(x~i,ε))νi\frac{1}{\nu^{i}(B(\widetilde{x}^{i},\varepsilon))}\nu^{i} restricted to B(x~i,ε)B(\widetilde{x}^{i},\varepsilon). Let δ=min{ν0(B(x~0,ε)),ν1(B(x~1,ε))}\delta=\min\{\nu^{0}(B(\widetilde{x}^{0},\varepsilon)),\nu^{1}(B(\widetilde{x}^{1},\varepsilon))\}. Now, we consider

γ~:=γ+δ(ν~0×ν~1).\widetilde{\gamma}:=\gamma^{*}+\delta(\,\widetilde{\nu}^{0}\times\widetilde{\nu}^{1}).

Then, γ~Γ(μ0,μ1)\widetilde{\gamma}\in\Gamma_{\leq}(\mu^{0},\mu^{1}) because

πi#γ~=γi+δν~iγi+νi=μi for i=0,1.\pi_{i\#}\widetilde{\gamma}=\gamma_{i}+\delta\widetilde{\nu}^{i}\leq\gamma_{i}+\nu^{i}=\mu^{i}\qquad\text{ for }i=0,1.

The partial transport cost of γ~\widetilde{\gamma} is

C(γ~,μ0,μ1,λ)\displaystyle C(\widetilde{\gamma},\mu^{0},\mu^{1},\lambda)
=Ω2x0x12𝑑γ~+λ(|μ0|+|μ1|2|γ~|)\displaystyle=\int_{\Omega^{2}}\|x^{0}-x^{1}\|^{2}\,d\widetilde{\gamma}+\lambda(|\mu^{0}|+|\mu^{1}|-2|\widetilde{\gamma}|)
=Ω2x0x12𝑑γ+δΩ2x0x12d(ν~0×ν~1)+λ(|μ0|+|μ1|2|γ|2δ|ν~0×ν~1|)\displaystyle=\int_{\Omega^{2}}\|x^{0}-x^{1}\|^{2}\,d\gamma+\delta\int_{\Omega^{2}}\|x^{0}-x^{1}\|^{2}\,d(\widetilde{\nu}^{0}\times\widetilde{\nu}^{1})+\lambda(|\mu^{0}|+|\mu^{1}|-2|\gamma|-2\delta|\widetilde{\nu}^{0}\times\widetilde{\nu}^{1}|)
=OPTλ(μ0,μ1)+δB(x~0,ε)×B(x~1,ε)x0x12d(ν~0×ν~1)2λδ\displaystyle=OPT_{\lambda}(\mu^{0},\mu^{1})+\delta\int_{B(\widetilde{x}^{0},\varepsilon)\times B(\widetilde{x}^{1},\varepsilon)}\|x^{0}-x^{1}\|^{2}\,d(\widetilde{\nu}^{0}\times\widetilde{\nu}^{1})-2\lambda\delta
<OPTλ(μ0,μ1)+2λδ2λδ=OPTλ(μ0,μ1),\displaystyle<OPT_{\lambda}(\mu^{0},\mu^{1})+2\lambda\delta-2\lambda\delta=OPT_{\lambda}(\mu^{0},\mu^{1}),

which is a contradiction.

Lemma D.3.

Using the notation and hypothesis of Proposition 3.1, for t(0,1)t\in(0,1) let Dt:={x:vt(x)0}D_{t}:=\{x:v_{t}(x)\neq 0\}. Then, γtν0γtνj0\gamma_{t}\wedge\nu_{0}\equiv\gamma_{t}\wedge\nu^{j}\equiv 0 over DtD_{t}.

Proof.

We will prove this for γtν0\gamma_{t}\wedge\nu_{0}. The other case is analogous. Suppose by contradiction that γtν0\gamma_{t}\wedge\nu_{0} is not the null measure. By Lemma D.1, we know that γ\gamma^{*} is optimal for OT(γ0,γ1)OT(\gamma_{0}^{*},\gamma_{1}^{*}). Since we are also assuming that γ\gamma is induced by a map TT, i.e. γ=(I×T)#γ0\gamma^{*}=(I\times T)_{\#}\gamma_{0}^{*}, then TT is an optimal Monge map between γ0\gamma_{0}^{*} and γ1\gamma_{1}^{*}. Therefore, the map Tt:ΩΩT_{t}:\Omega\to\Omega is invertible (see for example [38, Lemma 5.29]). For ease of notation we will denote ν~0:=γtν0\widetilde{\nu}^{0}:=\gamma_{t}\wedge\nu_{0} as a measure in DtD_{t}.

The idea of the proof is to exhibit a plan γ~\widetilde{\gamma} with smaller OPTλOPT_{\lambda} cost than γ\gamma^{*}, which will be a contradiction since γΓ(μ0μj)\gamma^{*}\in\Gamma_{\leq}^{*}(\mu^{0}\mu^{j}). Let us define

γ~:=γ+(I,TTt1)#ν~0(I,T)#(Tt1)#ν~0.\widetilde{\gamma}:=\gamma^{*}+(I,T\circ T_{t}^{-1})_{\#}\widetilde{\nu}^{0}-(I,T)_{\#}(T_{t}^{-1})_{\#}\widetilde{\nu}^{0}.

Then, γ~Γ(μ0,μj)\widetilde{\gamma}\in\Gamma_{\leq}(\mu^{0},\mu^{j}) since

π0#γ~=π0#γ+ν~0(Tt1)#ν~0γ0+ν~0μ0,\displaystyle\pi_{0\#}\widetilde{\gamma}=\pi_{0\#}\gamma^{*}+\widetilde{\nu}^{0}-(T_{t}^{-1})_{\#}\widetilde{\nu}^{0}\leq\gamma_{0}^{*}+\widetilde{\nu}^{0}\leq\mu^{0},
π1#γ~=π1#γ+(TTt1)#ν~0(TTt1)#ν~0=γ1.(and we know γ1μj).\displaystyle\pi_{1\#}\widetilde{\gamma}=\pi_{1\#}\gamma^{*}+(T\circ T_{t}^{-1})_{\#}\widetilde{\nu}^{0}-(T\circ T_{t}^{-1})_{\#}\widetilde{\nu}^{0}=\gamma_{1}^{*}.\qquad(\text{and we know }\gamma_{1}^{*}\leq\mu^{j}).

From the last equation we can also conclude that |γ~|=|γ1|=|γ||\widetilde{\gamma}|=|\gamma_{1}^{*}|=|\gamma^{*}|. Therefore, the partial transportation cost for γ~\widetilde{\gamma} is

Ω2xy2𝑑γ~(x,y)+λ(|μ0|+|μj|2|γ~|)\displaystyle\int_{\Omega^{2}}\|x-y\|^{2}d\widetilde{\gamma}(x,y)+\lambda(|\mu^{0}|+|\mu^{j}|-2|\widetilde{\gamma}|)
=Ω2xy𝑑γ(x,y)+λ(|μ0|+|μj|2|γ|)OPTλ(μ0,μj)\displaystyle\qquad=\underbrace{\int_{\Omega^{2}}\|x-y\|d\gamma^{*}(x,y)+\lambda(|\mu^{0}|+|\mu^{j}|-2|\gamma|)}_{OPT_{\lambda}(\mu^{0},\mu^{j})}
+Dt(xT(Tt1(x))2Tt1(x)T(Tt1(x))2)𝑑ν~0(x)\displaystyle\qquad\qquad+\int_{D_{t}}\left(\|x-T(T_{t}^{-1}(x))\|^{2}-\|T_{t}^{-1}(x)-T(T_{t}^{-1}(x))\|^{2}\right)d\widetilde{\nu}^{0}(x)
<OPTλ(μ0,μ1)\displaystyle\qquad<OPT_{\lambda}(\mu^{0},\mu^{1})

where in the last inequality we used that if y=Tt1(x)y=T_{t}^{-1}(x) we have

Tt(y)T(y)=(1t)y+tT(y)T(y)=(1t)yT(y)<yT(y) for t(0,1).\|T_{t}(y)-T(y)\|=\|(1-t)y+tT(y)-T(y)\|=(1-t)\|y-T(y)\|<\|y-T(y)\|\qquad\text{ for }t\in(0,1).

Lemma D.4.

Using the notation and hypothesis of Proposition 3.1, for t(0,1)t\in(0,1) the vector-valued measures vtν0v_{t}\cdot\nu_{0} and vtνjv_{t}\cdot\nu^{j} are exactly zero.

Proof.

We prove this for vtν0v_{t}\cdot\nu_{0}. The other case is analogous. We recall that the measure vtν0v_{t}\cdot\nu_{0} is determined by

ΩΦ(x)vt(x)𝑑ν0(x)\int_{\Omega}\Phi(x)\cdot v_{t}(x)\,d\nu_{0}(x)

for all measurable functions Φ:Ωd\Phi:\Omega\to\mathbb{R}^{d}, where Φ(x)vt(x)\Phi(x)\cdot v_{t}(x) denotes the usual dot product in d\mathbb{R}^{d} between the vectors Φ(x)\Phi(x) and vt(x)v_{t}(x).

From Lemma D.3 we know that γtν00\gamma_{t}\wedge\nu_{0}\equiv 0 over DtD_{t}. Therefore, γt\gamma_{t} and ν0\nu_{0} are mutually singular in that set. This implies that we can decompose DtD_{t} into two disjoint sets D1D_{1}, D2D_{2} such that γt(D1)=0=ν0(D2)\gamma_{t}(D_{1})=0=\nu_{0}(D_{2}). Since vtv_{t} is a function defined γt\gamma_{t}–almost everywhere (up to sets of null measure with respect to γt\gamma_{t}), we can assume without loss of generality that vt0v_{t}\equiv 0 on D1D_{1}.

Let Φ:Ωd\Phi:\Omega\to\mathbb{R}^{d} be a measurable vector-valued function over Ω\Omega. Using that vt0v_{t}\equiv 0 in ΩDt\Omega\setminus D_{t} and in D1D_{1}, and that ν0(D2)=0\nu_{0}(D_{2})=0, we obtain

ΩΦ(x)vt(x)𝑑ν0(x)=ΩDtΦ(x)vt(x)𝑑ν0(x)+D1Φ(x)vt(x)𝑑ν0(x)+D2Φ(x)vt(x)𝑑ν0(x)=0.\int_{\Omega}\Phi(x)\cdot v_{t}(x)\,d\nu_{0}(x)=\int_{\Omega\setminus D_{t}}\Phi(x)\cdot v_{t}(x)\,d\nu_{0}(x)+\int_{D_{1}}\Phi(x)\cdot v_{t}(x)\,d\nu_{0}(x)+\int_{D_{2}}\Phi(x)\cdot v_{t}(x)\,d\nu_{0}(x)=0.

Since Φ\Phi was arbitrary, we can conclude that vtν00v_{t}\cdot\nu_{0}\equiv 0. ∎

Lemma D.5.

Using the notation and hypothesis of Proposition 3.1, the measure γ¯t=(1t)ν0+tν1\overline{\gamma}_{t}=(1-t)\nu_{0}+t\nu^{1} satisfies the equation

{tγ¯+γ¯v=ζγ¯0=ν0,γ¯1=ν1.\begin{cases}\partial_{t}\overline{\gamma}+\nabla\cdot\overline{\gamma}v=\zeta\\ \overline{\gamma}_{0}=\nu_{0},\quad\overline{\gamma}_{1}=\nu^{1}.\end{cases} (50)
Proof.

From Lemma D.4 we have that vtγ¯t=(1t)vtν0+tvtν1=0v_{t}\cdot\overline{\gamma}_{t}=(1-t)v_{t}\cdot\nu_{0}+tv_{t}\cdot\nu^{1}=0, then γ¯v=0\nabla\cdot\overline{\gamma}v=0. It is easy to see that γ¯\overline{\gamma} satisfies the boundary conditions by replacing t=0t=0 and t=1t=1 in its definition. Also, tγ¯=ν1ν0\partial_{t}\overline{\gamma}=\nu^{1}-\nu_{0}. Then, since ζt=ν1ν0\zeta_{t}=\nu^{1}-\nu_{0} for every tt, we get tγ¯+γ¯v=ν1ν0+0=ζt\partial_{t}\overline{\gamma}+\nabla\cdot\overline{\gamma}v=\nu^{1}-\nu_{0}+0=\zeta_{t}. ∎

Proof of Proposition 3.1.

First, we address that the v:[0,1]×Ωdv:[0,1]\times\Omega\to\mathbb{R}^{d} is well defined. Indeed, by Lemma D.1, we have that γ=(id×T)#γ0\gamma^{*}=(\mathrm{id}\times T)_{\#}\gamma_{0}^{*} is optimal for OT(γ0,γ1)OT(\gamma_{0}^{*},\gamma_{1}^{*}). Thus, for each (t,x)[0,1]×Ω(t,x)\in[0,1]\times\Omega, by so-called cyclical monotonicity property of the support of classical optimal transport plans, there exists at most one x0supp(γ0)x_{0}\in\operatorname{supp}(\gamma_{0}^{*}) such that Tt(x0)=xT_{t}(x_{0})=x, see [38, Lemma 5.29]. So, vt(x)v_{t}(x) in (30) is well defined by understanding its definition as

vt(x)={T(x0)x0if x=Tt(x0), for x0supp(γ0)0elsewhere.v_{t}(x)=\begin{cases}T(x_{0})-x_{0}&\text{if }x=T_{t}(x_{0}),\quad\text{ for }x_{0}\in\operatorname{supp}(\gamma_{0}^{*})\\ 0&\text{elsewhere}.\end{cases}

Now, we check that (ρ,v,ζ)(\rho,v,\zeta) defined in (29), (30), (31) is a solution for (26). In fact,

ρt=γt+γ¯t\rho_{t}=\gamma_{t}+\overline{\gamma}_{t}

where γt\gamma_{t} is given by (28), and γ¯t\overline{\gamma}_{t} is as in Lemma D.5. Then, from section 2.2, (γ,v)(\gamma,v) solves the continuity equation (5) since γ=(I×T)#γ0Γ(γ0,γ1)\gamma^{*}=(I\times T)_{\#}\gamma_{0}^{*}\in\Gamma^{*}(\gamma_{0}^{*},\gamma_{1}^{*}) (Lemma D.1), and from its definition γt=(Tt)#γ0\gamma_{t}=(T_{t})_{\#}\gamma_{0}^{*} coincides with (9) in this context, as well as vtv_{t} coincides with (8) (the support of vtv_{t} lies on the support of γ0\gamma_{0}^{*}). On the other hand, from Lemma D.5, γ¯t\overline{\gamma}_{t} solves (50). Therefore, by linearity,

tρt+ρtvt=tγt+γtvt0+tγ¯t+γ¯tvtζt=ζt\displaystyle\partial_{t}\rho_{t}+\nabla\cdot\rho_{t}v_{t}=\underbrace{\partial_{t}\gamma_{t}+\nabla\cdot\gamma_{t}v_{t}}_{0}+\underbrace{\partial_{t}\overline{\gamma}_{t}+\nabla\cdot\overline{\gamma}_{t}v_{t}}_{\zeta_{t}}=\zeta_{t}

Finally, by plugging in (ρ,v,ζ)(\rho,v,\zeta) into the objective function in (25), we have:

[0,1]×Ωv2𝑑ρ+λ|ζ|\displaystyle\int_{[0,1]\times\Omega}\|v\|^{2}d\rho+\lambda|\zeta| =[0,1]×Ωv2𝑑γt𝑑t+λ|ζ|\displaystyle=\int_{[0,1]\times\Omega}\|v\|^{2}\,d\gamma_{t}\,dt+\lambda|\zeta|
=ΩT(x0)x02𝑑γ0(x0)+λ(|ν0|+|ν1|)\displaystyle=\int_{\Omega}\|T(x^{0})-x^{0}\|^{2}\,d\gamma_{0}^{*}(x^{0})+\lambda(|\nu_{0}|+|\nu^{1}|)
=Ω2x0xj2𝑑γ(x0,xj)+λ(|μ0γ0|+|μjγ1|)\displaystyle=\int_{\Omega^{2}}\|x^{0}-x^{j}\|^{2}\,d\gamma^{*}(x^{0},x^{j})+\lambda(|\mu^{0}-\gamma_{0}^{*}|+|\mu^{j}-\gamma_{1}^{*}|)
=OPTλ(μ0,μj)\displaystyle=OPT_{\lambda}(\mu^{0},\mu^{j})

since γΓ(μ0,μj)\gamma^{*}\in\Gamma_{\leq}^{*}(\mu^{0},\mu^{j}). So, this shows that (ρ,v,ζ)(\rho,v,\zeta) is minimum for (25).

The ‘moreover’ part holds from the above identities, using that

Ωv02𝑑γ0(x0)+λ(|ν0|+|ν1|)=Ωmin(v02,2λ)𝑑γ0(x0)+λ(|ν0|+|ν1|)\displaystyle\int_{\Omega}\|v_{0}\|^{2}d\gamma_{0}^{*}(x_{0})+\lambda(|\nu_{0}|+|\nu^{1}|)=\int_{\Omega}\min(\|v_{0}\|^{2},2\lambda)d\gamma_{0}^{*}(x_{0})+\lambda(|\nu_{0}|+|\nu^{1}|)
for v0(x0)=T(x0)x0,\displaystyle\text{ for }v_{0}(x^{0})=T(x^{0})-x^{0},

since γ\gamma^{*} is such that x0xj2<2λ\|x^{0}-x^{j}\|^{2}<2\lambda for all (x0,xj)supp(γ)(x^{0},x^{j})\in\operatorname{supp}(\gamma) [4, Lemma 3.2].

Appendix E Proof of Theorem 3.5

The proof will be similar to that of Lemma 2.1. To simplify the notation, in this proof, we use μ1\mu^{1} (and μ^1\hat{\mu}^{1}) to replace μj\mu^{j} (and μ^j\hat{\mu}^{j}).

E.1 Notation setup

We choose γ1Γ(μ0,μ1)\gamma^{1}\in\Gamma_{\leq}^{*}(\mu^{0},\mu^{1}). We will understand the second marginal distribution π1#γ1\pi_{1\#}\gamma^{1} induced by γ1\gamma^{1}, either as the vector γ11:=(γ1)T1N0\gamma^{1}_{1}:=(\gamma^{1})^{T}1_{N_{0}} or as the measure m=1N1(γ11)mδxm1\sum_{m=1}^{N_{1}}(\gamma^{1}_{1})_{m}\delta_{x_{m}^{1}}, and, by abuse of notation, we will write π1#γ1=(γ1)T1N0=γ11\pi_{1\#}\gamma^{1}=(\gamma^{1})^{T}1_{N_{0}}=\gamma_{1}^{1} (analogously for π0#γ1\pi_{0\#}\gamma^{1}).

Let γ^1:=diag(p^11,,p^N01)\hat{\gamma}^{1}:=\operatorname{diag}(\hat{p}_{1}^{1},\ldots,\hat{p}_{N_{0}}^{1}) denote the transportation plan induced by mapping xk0x^k1x_{k}^{0}\mapsto\hat{x}_{k}^{1}.

Let

D:={k{1,N0}:p^k1>0}.D:=\{k\in\{1,\dots N_{0}\}:\hat{p}_{k}^{1}>0\}.

We select γ^Γ(μ0,μ^1)\hat{\gamma}\in\Gamma_{\leq}(\mu^{0},\hat{\mu}^{1}).

With a slight abuse of notation, we define

γ:=γ^(γ^1)1γj\gamma:=\hat{\gamma}(\hat{\gamma}^{1})^{-1}\gamma^{j}

where we mean that (γ^1)1N0×N0(\hat{\gamma}^{1})^{-1}\in\mathbb{R}^{N_{0}\times N_{0}} is a digonal matrix with:

(γ^1)kk={1p^k1if p^k1>00elsewhere,k[1:N0].(\hat{\gamma}^{1})_{kk}=\begin{cases}\frac{1}{\hat{p}_{k}^{1}}&\text{if }\hat{p}_{k}^{1}>0\\ 0&\text{elsewhere}\end{cases},\forall k\in[1:N_{0}].

The goal is to show

C(γ^j;μ0,μ^j,λ)C(γ^;μ0,μ^j,λ).C(\hat{\gamma}^{j};\mu^{0},\hat{\mu}^{j},\lambda)\leq C(\hat{\gamma};\mu^{0},\hat{\mu}^{j},\lambda). (51)

E.2 Connect OPTλ(μ0,μ1)OPT_{\lambda}(\mu^{0},\mu^{1}) and OPTλ(μ0,μ^1).OPT_{\lambda}(\mu^{0},\hat{\mu}^{1}).

Note, First, we claim γΓ(μ0,γ11)Γ(μ0,μ1)\gamma\in\Gamma_{\leq}(\mu^{0},\gamma_{1}^{1})\subset\Gamma_{\leq}(\mu^{0},\mu^{1}). Indeed, we have

γ1N1\displaystyle\gamma 1_{N_{1}} =γ^(γ^1)1γj1N1=γ^(γ^1)1p^j=γ^1D=γ^0p0\displaystyle=\hat{\gamma}(\hat{\gamma}^{1})^{-1}\gamma^{j}1_{N_{1}}=\hat{\gamma}(\hat{\gamma}^{1})^{-1}\hat{p}^{j}=\hat{\gamma}1_{D}=\hat{\gamma}_{0}\leq p^{0} (52)
γT1N0\displaystyle\gamma^{T}1_{N_{0}} =(γ1)T(γ^1)1γ^T1N0(γ1)T(γ^1)1p^1=(γ1)T1D=γ11,\displaystyle=(\gamma^{1})^{T}(\hat{\gamma}^{1})^{-1}\hat{\gamma}^{T}1_{N_{0}}\leq(\gamma^{1})^{T}(\hat{\gamma}^{1})^{-1}\hat{p}^{1}=(\gamma^{1})^{T}1_{D}=\gamma^{1}_{1}, (53)

where 1DN01_{D}\in\mathbb{R}^{N_{0}} is defined with (1D)k=1(1_{D})_{k}=1 if kDk\in D and (1D)k=0(1_{D})_{k}=0 elsewhere.

In (52), the equality γ^1D=γ^0\hat{\gamma}1_{D}=\hat{\gamma}_{0} follows from the following: (γ^1)T1N0p^1(\hat{\gamma}^{1})^{T}1_{N_{0}}\leq\hat{p}^{1}, we have γ^k,k=0,kD,k{1,,N0}\hat{\gamma}_{k,k^{\prime}}=0,\forall k^{\prime}\notin D,k\in\{1,\ldots,N_{0}\}. In (53), the inequality follows from the fact γ^Γ(μ0,μ^1)\hat{\gamma}\in\Gamma_{\leq}(\mu^{0},\hat{\mu}^{1}).

Note, from (52), we also have γ01=γ^01\gamma^{1}_{0}=\hat{\gamma}^{1}_{0}.

We compute the transportation costs induced by γ1,γ,γ^1,γ^\gamma^{1},\gamma,\hat{\gamma}^{1},\hat{\gamma}:

C(γ1;μ0,μ1,λ)\displaystyle C(\gamma^{1};\mu^{0},\mu^{1},\lambda)
=i=1N1kDxk0xi12γk,i1+λ(|p0|+|p1|2|γ1|)\displaystyle=\sum_{i=1}^{N_{1}}\sum_{k\in D}\|x_{k}^{0}-x_{i}^{1}\|^{2}\gamma_{k,i}^{1}+\lambda(|p^{0}|+|p^{1}|-2|\gamma^{1}|)
=kDxk02(γ01)k+i=1N1xi12(γ11)i2kDi=1N1xk0xi1γk,i1+λ(|p0|+|p1|2|γ1|)\displaystyle=\sum_{k\in D}\|x^{0}_{k}\|^{2}(\gamma_{0}^{1})_{k}+\sum_{i=1}^{N_{1}}\|x_{i}^{1}\|^{2}(\gamma_{1}^{1})_{i}-2\sum_{k\in D}\sum_{i=1}^{N_{1}}x_{k}^{0}\cdot x_{i}^{1}\,\gamma_{k,i}^{1}+\lambda(|p^{0}|+|p^{1}|-2|\gamma^{1}|)

Similarly,

C(γ^1;μ0,μ^1,λ)\displaystyle C(\hat{\gamma}^{1};\mu^{0},\hat{\mu}^{1},\lambda)
=kDxk02(γ^01)k+kDx^k12(γ^11)k2kDkDxk0x^k1γ^k,k1+λ(|p0|+|p^1|2|γ^1|)\displaystyle=\sum_{k\in D}\|x^{0}_{k}\|^{2}(\hat{\gamma}_{0}^{1})_{k}+\sum_{k^{\prime}\in D}\|\hat{x}_{k^{\prime}}^{1}\|^{2}(\hat{\gamma}^{1}_{1})_{k^{\prime}}-2\sum_{k\in D}\sum_{k^{\prime}\in D}x_{k}^{0}\cdot\hat{x}_{k^{\prime}}^{1}\,\hat{\gamma}_{k,k^{\prime}}^{1}+\lambda(|p^{0}|+|\hat{p}^{1}|-2|\hat{\gamma}^{1}|)
=kDxk02(γ^01)k+kDx^k12(γ^11)k2kDkDi=1N1xk0xi1γk,i1γ^k,k1γ^k,k1+λ(|p0|+|p^1|2|γ^1|)\displaystyle=\sum_{k\in D}\|x^{0}_{k}\|^{2}(\hat{\gamma}_{0}^{1})_{k}+\sum_{k^{\prime}\in D}\|\hat{x}_{k^{\prime}}^{1}\|^{2}(\hat{\gamma}^{1}_{1})_{k^{\prime}}-2\sum_{k\in D}\sum_{k^{\prime}\in D}\sum_{i=1}^{N_{1}}x_{k}^{0}\cdot x_{i}^{1}\,\gamma^{1}_{k^{\prime},i}\frac{\hat{\gamma}^{1}_{k,k^{\prime}}}{\hat{\gamma}^{1}_{k^{\prime},k^{\prime}}}+\lambda(|p^{0}|+|\hat{p}^{1}|-2|\hat{\gamma}^{1}|)
=kDxk02(γ^01)k+kDx^k12(γ^11)k2kDi=1N1xk0xi1γk,i1+λ(|p0|+|p^1|2|γ^1|)\displaystyle=\sum_{k\in D}\|x^{0}_{k}\|^{2}(\hat{\gamma}_{0}^{1})_{k}+\sum_{k^{\prime}\in D}\|\hat{x}_{k^{\prime}}^{1}\|^{2}(\hat{\gamma}^{1}_{1})_{k^{\prime}}-2\sum_{k\in D}\sum_{i=1}^{N_{1}}x_{k}^{0}\cdot x_{i}^{1}\,\gamma^{1}_{k,i}+\lambda(|p^{0}|+|\hat{p}^{1}|-2|\hat{\gamma}^{1}|)

Therefore, we obtain

C(γ^1;μ0,μ^1,λ)=C(γ1;μ0,γ11,λ)i=1N1xi12(γ11)i+kDx^k12(γ^11)k+λ(|p^1||p1|)\displaystyle C(\hat{\gamma}^{1};\mu^{0},\hat{\mu}^{1},\lambda)=C(\gamma^{1};\mu^{0},\gamma_{1}^{1},\lambda)-\sum_{i=1}^{N_{1}}\|x_{i}^{1}\|^{2}(\gamma^{1}_{1})_{i}+\sum_{k^{\prime}\in D}\|\hat{x}_{k^{\prime}}^{1}\|^{2}(\hat{\gamma}^{1}_{1})_{k^{\prime}}+\lambda(|\hat{p}^{1}|-|p^{1}|) (54)

And also,

C(γ;μ0,μ1,λ)\displaystyle C(\gamma;\mu^{0},\mu^{1},\lambda)
=k=1N0xk02(γ0)k+i=1N1xi12(γ1)i2k=1N0i=1N1kDxk0xi1γ^k,kγk,i1γ^k,k1+λ(|p0|+|p1|2|γ|),\displaystyle=\sum_{k=1}^{N_{0}}\|x_{k}^{0}\|^{2}(\gamma_{0})_{k}+\sum_{i=1}^{N_{1}}\|x^{1}_{i}\|^{2}(\gamma_{1})_{i}-2\sum_{k=1}^{N_{0}}\sum_{i=1}^{N_{1}}\sum_{k^{\prime}\in D}x_{k}^{0}\cdot x_{i}^{1}\,\hat{\gamma}_{k,k^{\prime}}\frac{\gamma^{1}_{k^{\prime},i}}{\hat{\gamma}_{k^{\prime},k^{\prime}}^{1}}+\lambda(|p^{0}|+|p^{1}|-2|\gamma|),
C(γ^;μ0,μ^1,λ)\displaystyle C(\hat{\gamma};\mu^{0},\hat{\mu}^{1},\lambda)
=k=1N0xk02(γ^0)k+kDx^k12(γ^1)k2k=1N0kDi=1N0xk0xi1γ^k,k1γk,i1γ^k,k1+λ(|p0|+|p^1|2|γ^|)\displaystyle=\sum_{k=1}^{N_{0}}\|x_{k}^{0}\|^{2}(\hat{\gamma}_{0})_{k}+\sum_{k^{\prime}\in D}\|\hat{x}_{k^{\prime}}^{1}\|^{2}(\hat{\gamma}_{1})_{k^{\prime}}-2\sum_{k=1}^{N_{0}}\sum_{k^{\prime}\in D}\sum_{i=1}^{N_{0}}x_{k}^{0}\cdot x_{i}^{1}\,\hat{\gamma}_{k,k^{\prime}}^{1}\frac{\gamma^{1}_{k^{\prime},i}}{\hat{\gamma}^{1}_{k^{\prime},k^{\prime}}}+\lambda(|p^{0}|+|\hat{p}^{1}|-2|\hat{\gamma}|)

Thus we obtain

C(γ^;μ0,μ^1,λ)\displaystyle C(\hat{\gamma};\mu^{0},\hat{\mu}^{1},\lambda) =C(γ;μ0,γ11,λ)i=1N1xi12(γ1)i+kDx^k12(γ^1)k+λ(|p^1||p1|)\displaystyle=C(\gamma;\mu^{0},\gamma_{1}^{1},\lambda)-\sum_{i=1}^{N_{1}}\|x_{i}^{1}\|^{2}(\gamma_{1})_{i}+\sum_{k^{\prime}\in D}\|\hat{x}^{1}_{k^{\prime}}\|^{2}(\hat{\gamma}_{1})_{k^{\prime}}+\lambda(|\hat{p}^{1}|-|p^{1}|) (55)

Combining with (54) and (55), we have

C(γ^1;μ0,μ^1,λ)C(γ^;μ0,μ^1,λ)\displaystyle C(\hat{\gamma}^{1};\mu^{0},\hat{\mu}^{1},\lambda)-C(\hat{\gamma};\mu^{0},\hat{\mu}^{1},\lambda)
=C(γ1;μ0,γ11,λ)C(γ;μ0,γ11,λ)+i=1N1xi12((γ1)i(γ11)i)+kD|x^k1|2((γ^11)k(γ^1)k)\displaystyle=C(\gamma^{1};\mu^{0},\gamma^{1}_{1},\lambda)-C(\gamma;\mu^{0},\gamma^{1}_{1},\lambda)+\sum_{i=1}^{N_{1}}\|x_{i}^{1}\|^{2}((\gamma_{1})_{i}-(\gamma^{1}_{1})_{i})+\sum_{k\in D}|\hat{x}_{k}^{1}|^{2}((\hat{\gamma}^{1}_{1})_{k}-(\hat{\gamma}_{1})_{k})
=i=1N1xi12((γ1)i(γ11)i)kDx^k12((γ^1)k(γ^11)k)\displaystyle=\sum_{i=1}^{N_{1}}\|x_{i}^{1}\|^{2}((\gamma_{1})_{i}-(\gamma^{1}_{1})_{i})-\sum_{k\in D}\|\hat{x}_{k}^{1}\|^{2}((\hat{\gamma}_{1})_{k}-(\hat{\gamma}^{1}_{1})_{k}) (56)

where the inequality holds sine γ1\gamma^{1} is optimal for OPTλ(μ0,γ11)OPT_{\lambda}(\mu^{0},\gamma^{1}_{1}).

E.3 Verification of the inequality

It remains to show (56) is less than 0. By Jensen’s inequality, for each kDk\in D, we obtain

x^k12i=1N1γk,i1p^k1xi12.\|\hat{x}^{1}_{k}\|^{2}\leq\sum_{i=1}^{N_{1}}\frac{\gamma^{1}_{k,i}}{\hat{p}^{1}_{k}}\|x_{i}^{1}\|^{2}.

Combined with the fact γ^1γ^11\hat{\gamma}_{1}\leq\hat{\gamma}^{1}_{1}, we obtain:

(56)\displaystyle\eqref{pf: C(gamma_hat)-C(gamma)} i=1N1xi12((γ1)i(γ11)i)kDi=1N1γk,i1p^k1xi12((γ^1)k(γ^11)k)\displaystyle\leq\sum_{i=1}^{N_{1}}\|x_{i}^{1}\|^{2}((\gamma_{1})_{i}-(\gamma^{1}_{1})_{i})-\sum_{k\in D}\sum_{i=1}^{N_{1}}\frac{\gamma^{1}_{k,i}}{\hat{p}^{1}_{k}}\|x_{i}^{1}\|^{2}((\hat{\gamma}_{1})_{k}-(\hat{\gamma}^{1}_{1})_{k})
=i=1N1xi12((γ1)i(γ11)ikDγk,i1((γ^1)kp^k1)p^k1)\displaystyle=\sum_{i=1}^{N_{1}}\|x_{i}^{1}\|^{2}\left((\gamma_{1})_{i}-(\gamma^{1}_{1})_{i}-\sum_{k\in D}\frac{\gamma_{k,i}^{1}((\hat{\gamma}_{1})_{k}-\hat{p}^{1}_{k})}{\hat{p}^{1}_{k}}\right) (57)

Pick i[1:N0]i\in[1:N_{0}], we have:

(γ1)i(γ11)ikDγk,i1((γ^1)kp^k1)p^k1\displaystyle(\gamma_{1})_{i}-(\gamma^{1}_{1})_{i}-\sum_{k\in D}\frac{\gamma_{k,i}^{1}((\hat{\gamma}_{1})_{k}-\hat{p}^{1}_{k})}{\hat{p}^{1}_{k}}
=(γ1)ip^i1kDγk,i1((γ^1)k)p^k1+p^i1\displaystyle=(\gamma_{1})_{i}-\hat{p}_{i}^{1}-\sum_{k\in D}\frac{\gamma_{k,i}^{1}((\hat{\gamma}_{1})_{k})}{\hat{p}^{1}_{k}}+\hat{p}_{i}^{1}
=k=1Nγk,ikDγk,i1((γ^1)k)p^k1\displaystyle=\sum_{k^{\prime}=1}^{N}\gamma_{k^{\prime},i}-\sum_{k\in D}\frac{\gamma_{k,i}^{1}((\hat{\gamma}_{1})_{k})}{\hat{p}^{1}_{k}}
=k=1N0kDγ^k,kγk,i1p^k1kDk=1N0γk,i1γ^k,kp^k1\displaystyle=\sum_{k^{\prime}=1}^{N_{0}}\sum_{k\in D}\frac{\hat{\gamma}_{k^{\prime},k}\gamma^{1}_{k,i}}{\hat{p}_{k}^{1}}-\sum_{k\in D}\sum_{k^{\prime}=1}^{N_{0}}\frac{\gamma^{1}_{k,i}\hat{\gamma}_{k^{\prime},k}}{\hat{p}_{k}^{1}} (58)
=0\displaystyle=0

where (58) holds by the fact: for each k[1:N0],i[1:N1]k^{\prime}\in[1:N_{0}],i\in[1:N_{1}], we have

γk,i\displaystyle\gamma_{k^{\prime},i} =(γ^(γ^1)1[k,:])γ1[:,i]\displaystyle=(\hat{\gamma}(\hat{\gamma}^{1})^{-1}[k^{\prime},:])\gamma^{1}[:,i]
=[γ^k,11p^11,γ^k,N01p^N0j]T[γ1,i1,γN0,i1]\displaystyle=\left[\hat{\gamma}_{k^{\prime},1}\frac{1}{\hat{p}^{1}_{1}},\ldots\hat{\gamma}_{k^{\prime},N^{0}}\frac{1}{\hat{p}^{j}_{N_{0}}}\right]^{T}[\gamma^{1}_{1,i},\ldots\gamma^{1}_{N_{0},i}]
=kDγ^k,kγk,i1p^k1\displaystyle=\sum_{k\in D}\frac{\hat{\gamma}_{k^{\prime},k}\gamma^{1}_{k,i}}{\hat{p}_{k}^{1}}

Thus (57) is 0. Therefore, (56) is upper bounded by 0, and we complete the proof.

Appendix F Proof of Theorem 3.6

Proof.

Without loss of generality, we assume that γjΓ(μ0,μj)\gamma^{j}\in\Gamma_{\leq}^{*}(\mu^{0},\mu^{j}) is induced by a 1-1 map. We can suppose this since the trick is that we will end up with an N0N_{0}-point representation of μj\mu^{j} when we get μ^j\hat{\mu}^{j}. For example, if two xn0x_{n}^{0} and xm0x_{m}^{0} are mapped to the same point T(xn0)=T(xm0)T(x_{n}^{0})=T(x_{m}^{0}), when performing the barycentric, we can split this into two different labels with corresponding labels. (The moral is that the labels are important, rather than where the mass is located.) Therefore, γjN0×Nj\gamma^{j}\in\mathbb{R}^{N_{0}\times N_{j}} is such that in each row and column there exists at most one positive entry, and all others are zero. Let γ^:=diag(p^1j,,p^N0j)\hat{\gamma}:=\operatorname{diag}(\hat{p}_{1}^{j},\ldots,\hat{p}_{N_{0}}^{j}). Then, by the definition of μ^j\hat{\mu}^{j}, we have C(γj;μ0,μj,λ)=C(γ^;μ0,μ^j,λ)C(\gamma^{j};\mu^{0},\mu^{j},\lambda)=C(\hat{\gamma};\mu^{0},\hat{\mu}^{j},\lambda). Thus,

OPTλ(μ0,μj)\displaystyle OPT_{\lambda}(\mu^{0},\mu^{j}) =C(γj;μ0,μj,λ)\displaystyle=C(\gamma^{j};\mu^{0},\mu^{j},\lambda)
=C(γj;μ0,γ1j,λ)+λ(|μj||γ1j|)\displaystyle=C(\gamma^{j};\mu^{0},\gamma^{j}_{1},\lambda)+\lambda(|\mu^{j}|-|\gamma_{1}^{j}|)
=C(γ^;μ0,μ^j,λ)+λ(|μj||μ^j|)\displaystyle=C(\hat{\gamma};\mu^{0},\hat{\mu}^{j},\lambda)+\lambda(|\mu^{j}|-|\hat{\mu}^{j}|)
=OPTλ(μ0,μ^j)+λ(|μj||μ^j|),\displaystyle=OPT_{\lambda}(\mu^{0},\hat{\mu}^{j})+\lambda(|\mu^{j}|-|\hat{\mu}^{j}|),

where the last equality holds from Theorem 3.5. This concludes the proof.

Moreover, from the above identities (for discrete measure) we can express

OPTλ(μ0,μj)=k=1N0p^kxk0x^kj2+λ|p0p^kj|+λ(|pj||p^j|).OPT_{\lambda}(\mu^{0},\mu^{j})=\sum_{k=1}^{N_{0}}\hat{p}_{k}\|x_{k}^{0}-\hat{x}_{k}^{j}\|^{2}+\lambda|p^{0}-\hat{p}_{k}^{j}|+\lambda(|p^{j}|-|\hat{p}^{j}|).

Appendix G Proof of Proposition 3.7

Proof.

The result follows from (40) and (35). Indeed, we have

LOPTλ,μ0(μ^j,μ^j)=uiujp^ip^j,2λ+λ(|p^ip^j|)and\displaystyle LOPT_{\lambda,\mu_{0}}(\hat{\mu}^{j},\hat{\mu}^{j})=\|u^{i}-u^{j}\|_{\hat{p}^{i}\wedge\hat{p}^{j},2\lambda}+\lambda(|\hat{p}^{i}-\hat{p}^{j}|)\qquad\text{and}
LOPTλ,μ0(μj,μj)=uiujp^ip^j,2λ+λ(|p^ip^j|)+λ(|νi|+|νj|),\displaystyle LOPT_{\lambda,\mu_{0}}(\mu^{j},\mu^{j})=\|u^{i}-u^{j}\|_{\hat{p}^{i}\wedge\hat{p}^{j},2\lambda}+\lambda(|\hat{p}^{i}-\hat{p}^{j}|)+\lambda(|\nu^{i}|+|\nu^{j}|),

since the LOPT barycentric projection of μi\mu^{i} is basically μi\mu^{i} without the mass that needs to be created from the reference (analogously for μj\mu^{j}). ∎

Appendix H Applications

For completeness, we will expand on the experiments and discussion presented in Section 4, as well as on Figure 1 which contrasts the HK technique with OPT from the point of view of interpolation of measures.

First, we recall that similar to LOT [41] and [34], the goal of LOPT is not exclusively to approximate OPT, but to propose new transport-based metrics (or discrepancy measures) that are computationally less expensive and easier to work with than OT or OPT, specifically when many measures must be compared.

Also, one of the main advantages of the linearization step is that it allows us to embed sets of probability (resp. positive finite) measures into a linear space (𝒯μ0\mathcal{T}_{\mu^{0}} space). Moreover, it does it in a way that allows us to use the 𝒯μ0\mathcal{T}_{\mu^{0}}-metric in that space as a proxy (or replacement) for more complicated transport metrics while preserving the natural properties of transport theory. As a consequence, data analysis can be performed using Euclidean metric in a simple vector space.

H.1 Approximation of OPT Distance

For a better understanding of the errors plotted in Figure 2, the following Figure 6 shows the histograms of the relative errors for different values of λ\lambda and each number of measures K=5,10,15K=5,10,15.

Refer to caption
Figure 6: Histogram of relatives errors (depicted in Figure 2) between OPTλOPT_{\lambda} and LOPTλ,μ0LOPT_{\lambda,\mu_{0}}. (Number of samples N=500N=500. Number of repetitions =10=10. Dimension =2=2. –measures on 2\mathbb{R}^{2}–.)

As said in Section 4, we recall that for these experiments, we created KK point sets of size NN for KK different Gaussian distributions in 2\mathbb{R}^{2}. In particular, μi𝒩(mi,I)\mu^{i}\sim\mathcal{N}(m^{i},I), where mim^{i} is randomly selected such that mi=3\|m^{i}\|=\sqrt{3} for i=1,,Ki=1,...,K. For the reference, we picked an NN point representation of μ0𝒩(m¯,I)\mu^{0}\sim\mathcal{N}(\overline{m},I) with m¯=mi/K\overline{m}=\sum m^{i}/K. For figures 2 and 6, the sample size NN was set equal to 500500.

In what follows, we include tests for N=200,250,300,350,,900,950,1000N=200,250,300,350,...,900,950,1000 and K=2,4K=2,4. For each (N,K)(N,K), we repeated each experiment 1010 times. The relative errors are shown in Figure 7. For large λ\lambda, most mass is transported and OT(μi,μj)OPTλ(μi,μj)OT(\mu^{i},\mu^{j})\approx OPT_{\lambda}(\mu^{i},\mu^{j}), the performance of LOPT is close to that of LOT, and the relative error is small. For small λ\lambda, almost no mass is transported, OPTλ(μi,μj)λ(|μi|+|μj|)λ(|μi||μ^i|+|μj||μ^j|)LOPTμ0,λ(μi,μj)OPT_{\lambda}(\mu^{i},\mu^{j})\approx\lambda(|\mu^{i}|+|\mu^{j}|)\approx\lambda(|\mu^{i}|-|\hat{\mu}^{i}|+|\mu^{j}|-|\hat{\mu}^{j}|)\approx{LOPT}_{\mu^{0},\lambda}(\mu^{i},\mu^{j}), and we still have a small error. In between, e.g., λ=5\lambda=5, we have the largest relative error. Similar results were obtained by setting the reference as the OT barycenter.

Refer to caption
Figure 7: The average relative error between OPTλOPT_{\lambda} and LOPTλ,μ0LOPT_{\lambda,\mu_{0}} between all pairs of discrete measures μi\mu^{i} and μj\mu^{j}.

H.2 PCA analysis

For problems where doing pair-wise comparisons between KK distributions is needed, in the classical optimal (partial) transport setting we have to solve (K2)\binom{K}{2} OT (resp. OPT) problems. In the LOT (resp. LOPT) framework, however, one only needs to perform KK OT (resp. OPT) problems (matching each distribution with a reference measure).

One very ubiquitous example is to do clustering or classification of measures. For this case, the number of target measures KK representing different samples is usually very large. For the particular case of data analysis on Point Cloud MNIST, after using PCA in the embedding space (see Figure 5), it can be observed that the LOT framework is a natural option for separating different digits, but the equal mass requirement is too restrictive in the presence of noise. LOPT performs better since it does not have this restriction. This is one example where the number of measures K=900K=900 is much larger than 2.

H.3 Point Cloud Interpolation

Here, we add the results of the experiments mentioned in Section 4 which complete Figure 4. In the new Figure 8, in fact, Figure 4 corresponds to the subfigure 8(b). We conducted experiments using three digits (0, 1, and 9) from the PointCloud MNIST dataset, with 300 point sets per digit. We utilized LOPT embedding to calculate and visualize the interpolation between pairs of digits. We chose the barycenter between 0, 1, and 9 as the reference for our experiment. However, to avoid redundant results, in the main paper, we only demonstrated the interpolation between 0 and 9 in our main paper. The remaining plots for the other digit pairs using different levels of noise η\eta are included here for completeness. Later, in Section H.4, Figure 11 will add the plots for the HK technique and its linearized version.

Refer to caption
(a) η=0\eta=0
Refer to caption
(b) η=0.5\eta=0.5
Refer to caption
(c) η=0.75\eta=0.75
Refer to caption
(d) η=0\eta=0
Refer to caption
(e) η=0.5\eta=0.5
Refer to caption
(f) η=0.75\eta=0.75
Figure 8: Interpolation between two point-clouds at different times t{0,0.25,0.5,0.75,1}t\in\{0,0.25,0.5,0.75,1\}. Different values of noise η\eta were considered for the different interpolation approaches (OT, LOP, OPT, LOPT). For LOT and LOPT the reference measure is the barycenter between the PointClouds 0, 11, and 99 with no noise. Top row: Interpolation between digits 0 and 99. Bottom row: Interpolation between digits 0 and 11.

H.4 Preliminary comparisons between LHK and LOPT

The contrasts between the Linearized Hellinger-Kantorovich (LHK) [10] and the LOPT approaches come from the differences between HK (Hellinger-Kantorovich) and OPT distances.

The main issue is that for HK transportation between two measures, the transported portion of the mass does not resemble the OT-geodesic where mass is preserved. In other words, HK changes the mass as it transports it, while OPT preserves it.

This issue is depicted in Figure 1. In that figure, both the initial (blue) and final (purple) distributions of masses are two unit-mass delta measures at different locations. Mass decreases and then increases while being transported for HK, while it remains constant for OPT. For HK, the transported portion of the mass does not resemble the OT-geodesic where mass is preserved. In other words, HK changes the mass as it transports it, while OPT preserves it.

To illustrate better this point, we incorporate here Figure 9 which is the two-dimensional analog of Figure 1.

Refer to caption
Figure 9: HK vs. OPT interpolation between two delta measures of unit mass located at different positions. Two scenarios are shown for each transport framework varying the respective parameters (ss for HK, and λ\lambda for OPT). Blue circles stand for the cases when the geodesic transports mass. Triangular shapes represent destruction and creation of mass. Triangles pointing down in orange indicate the locations where mass will be destroyed. Triangles pointing up in green indicate the locations where mass will be created. On top of that shadows are added to emphasize the change of mass from initial and final configurations. The Top two plots exhibit two extreme cases when performing HK geodesic. When ss is small, everything is created/destroyed, when s is large everything is transported without mass-preservation. On the other side, the Bottom two plots show the two analogous cases for OPT geodesics. When λ\lambda is small we observe creation/destruction, when λ\lambda is large we have mass-preserving transportation. The intermediate cases are treated in the following.

In addition, in Figure 10 we not only compare HK and OPT, but also LHK and LOPT. The top subfigure 10(a) shows the measures μ1\mu_{1} (blue dots) and μ2\mu_{2} (orange crosses) to be interpolated with the different techniques in the next subfigures 10(b) and 10(c). Also, in 10(a) we plot the reference μ0\mu_{0} (green triangles) which is going to be used only in experiment 10(c). The measures μ1\mu_{1} and μ2\mu_{2} are interpreted as follows. The three masses on the interior, forming a triangle, can be considered as the signal information and the two masses on the corners can be considered noise. That is, we can assume they are just two noisy point cloud representations of the same distribution shifted. We want to transport the three masses in the middle without affecting their mass and relative positions too much. Subfigure 10(b) shows HK and OPT interpolation between the two measures μ1\mu_{1} and μ2\mu_{2}. In the HK cases, we notice not only how the masses vary, but also how their relative positions change obtaining a very different configuration at the end of the interpolation. OPT instead returns a more natural interpolation because of the mass preservation of the transported portion and the decoupling between transport and destruction/creation of mass. Finally, subfigure 10(c) shows the interpolation of μ1\mu_{1} and μ2\mu_{2} for LHK and LOPT. The reference measure μ0\mu_{0} is the same for both and we can see the denoising effect due to the fact that, for the three measures, the mass is concentrated in the triangular region in the center. However, while mass and structure-preserving transportation can be seen for LOPT, for LHK the shape of the configuration changes.

On top of that, as is often the case on quantization, point cloud representations of measures are given as samples with uniform mass. OPT/LOPT interpolation will not change the mass of each transported point. Therefore, the intermediate steps of an algorithm using OPT/LOPT transport benefit from conserving the same type of representation. That is, as a uniform set of points.

Refer to caption
(a) Plot of two measures (μ1\mu_{1} as blue circles and μ2\mu_{2} as orange crosses, where each point has mass one), and a reference measure μ0\mu_{0} (concentrated on the three green triangular locations, unit mass each).
Refer to caption
(b) HK and OPT interpolation between the two measures μ1\mu_{1} and μ2\mu_{2} at different times. The color code is the same as for Figure 9.
Refer to caption
(c) LHK and LOPT interpolation between the two measures μ1\mu_{1} and μ2\mu_{2} at different times using for both cases the same reference μ0\mu_{0}.
Figure 10: HK vs OPT and LHK vs LOPT

For comparison on PointCloud interpolation using MNIST, we include Figure 11 that illustrates OPT, LOPT, HK, and LHK interpolation for digit pairs (0,1) and (0,9). In the visualizations, the size of each circle is plotted according to amount of the mass at each location.

Refer to caption
(a) Samples of digits 0, 1, and 9 from MNIST Data Set. Top row: clean point cloud. Bottom row: 50%50\% of noise.
Refer to caption
(b) Noise level η=0\eta=0.
Refer to caption
(c) Noise level η=0.5\eta=0.5
Refer to caption
(d) Noise level η=0.\eta=0.
Refer to caption
(e) Noise level η=0.5\eta=0.5.
Figure 11: Point cloud interpolation using all the techniques unbalanced transportation HK, OPT, LHK, and LOPT.

However, we do not claim the presented LOPT tool to be a one size fits all kind of tool. We are working on the subtle differences between LHK and LOPT and expect to have a complete and clear picture in the future. The aim of this article was to present a new tool with an intuitive introduction and motivation so that the OT community would benefit from it.

H.5 Barycenter computation

Can the linear embedding technique be used to compute the barycenter of a set of measures, e.g., by computing the barycenter of the embeddings and performing an inversion to the original space?

One can first calculate the mean of the embedded measures, and recover a measure from this mean. The recovered measure is not necessarily the OPT barycenter, however, one can repeat this process and obtain better approximations of the barycenter. Similar to LOT, we have numerically observed that such a process will converge to a measure that is close to the barycenter. However, there are several technical considerations that one needs to pay close attention. For instance, the choice of λ\lambda and the choice of the initial reference measure are critical in this process.

Refer to caption
Figure 12: The depiction of barycenters between digits 0 and 11, and between 0 and 99 using the LOPT technique. Left panel: original measures (point-clouds) μ1\mu_{1} (digit 0), μ2\mu_{2} (digit 11), and μ3\mu_{3} (digit 99). We considered them with no noise on the top left panel (η=0\eta=0), and with corrupted under noise on the bottom left panel (level of noise η=0.5\eta=0.5). Right panel: The first row is the result that both initial measure and data μ1,μ2,μ3\mu_{1},\mu_{2},\mu_{3} are clean data; the second row is the result of clean initial measure and noise corrupted data; the third row is the result for noise corrupted initial measure and data.