This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

AUTHOR GUIDELINES FOR MLSP PROCEEDINGS MANUSCRIPTS

Abstract

Efficient computation of optimal transport distance between distributions is of growing importance in data science. Sinkhorn-based methods are currently the state of the art for such computations, but require O(n2)O(n^{2}) computations. In addition, Sinkhorn-based methods commonly use an Euclidean ground distance between datapoints. However, with the prevalence of manifold structured scientific data, it is often desirable to consider geodesic ground distance. Here, we tackle both issues by proposing Geodesic Sinkhorn—based on diffusing a heat kernel on a manifold graph. Notably, Geodesic Sinkhorn requires only O(nlogn)O(n\log n) computation, as we approximate the heat kernel with Chebyshev polynomials based on the sparse graph Laplacian. We apply our method to the computation of barycenters of several distributions of high dimensional single cell data from patient samples undergoing chemotherapy. In particular we define the barycentric distance as the distance between two such barycenters. Using this definition, we identify an optimal transport distance and path associated with the effect of treatment on cellular data.

1 Introduction

Optimal Transport (OT) distances, or Wasserstein distances are computed by lifting ground distances between points to distances between measures. This distance is computed relative to a ground distance on the support of the distributions, making it more informative than distances based only on a pointwise comparison of the densities. However to compute the Wasserstein, one needs to find the optimal transport plan from the source distribution to a target distribution, this is a linear programming problem requiring O(n3logn)O(n^{3}\log n) for discrete distributions of size nn [peyre_computational_2020].

An efficient modification of the optimal transport problem is to consider an entropy-regularized transportation. This formulation is solved with the Sinkhorn algorithm [sinkhorn1967concerning] by iteratively rescaling a Gaussian kernel based from the distance matrix. It is equivalent to the Schrödinger Bridge problem, for which similar algorithms were developed [fortet1940resolution, kullback1968probability, knight2013fast]. In the discrete case, it requires O(n2)O(n^{2}) for distributions of size nn, since it relies on matrix-vector products. Furthermore, this formulation allows for fast computation of the discrete barycenter with fixed support (the average distributions w.r.t. the Sinkhorn distance). An important drawback of the Sinkhorn algorithm is the necessity to store and multiply the pairwise distance matrix with a vector.

Additionally, the ground distance is commonly chosen as the Euclidean distance. The Euclidean distance is often suboptimal for high-dimensional datasets over larger distances according to the manifold hypothesis, which says observations lie near a low dimensional (curved) manifold in high dimensional space [moon_manifold_2018]. For higher dimensional datasets assumed to be sampled from lower dimensional manifold, using a distance closer to the manifold for OT as shown promising results [chen2020uncovering, huguet2022manifold, solomon2015convolutional, tong2022embedding, tong2021diffusion].

In this work we present Geodesic Sinkhorn111https://anonymous.4open.science/r/geo_sinkh_anon-61F7/README.rst; a Sinkhorn-based method for fast optimal transport with a heat-geodesic ground distance. Our method is based on the geometry of the dataset constructed from a common graph and uses the heat kernel on the graph to defined a heat-geodesic distance. Key to this approach, we will never need to construct or operate on an n×nn\times n distance matrix, and we will only use the sparse Laplacian matrix and sparse matrix vector products. For sparse graphs, this can be used for O(nlogn)O(n\log n) computation of the Sinkhorn distance with a manifold ground distance, improving on the O(n2)O(n^{2}) implementations based on dense matrices.

Increasing the state-of-the art efficiency in Sinkhorn computation opens us up to being able to perform complex operations on large groups of datasets. In particular, we consider interpolating between datasets: 1) the McCann interpolant between two datasets, and 2) the barycenter between m2m\geq 2 datasets. We show that using our heat-geodesic distance improve the interpolation accuracy compare to OT with Euclidean distance. The barycenter corresponds to the average distribution of a set of distributions. Our methods allows for finer grained barycenters on a data manifold, which motivate us to define a novel notion of dissimilarity between families of distributions called barycentric distance.

We use Geodesic Sinkhorn for interpolation of dynamic distribution on ten single-cell RNA sequencing (snRNA-seq) time series datasets. Further we apply the barycentric distance to single cell data from patient-derived cancer organoids (PDOs) to assess the effect of treatments (such as drugs and chemotherapy). Here we have one set of PDOs from control conditions, and another set that are treated. The true treatment effect is thus the distance between these barycenters. In addition, we use Geodesic Sinkhorn’s barycenter to compare the effect from one family of distribution to another.

Our main contributions include:

  • A new method for computing optimal transport distances on a manifold called Geodesic sinkhorn, which is highly efficient in time and memory.

  • Defining the barycentric distance; a novel distance between families of distribution, and showing its utility in deriving treatment effect from control and treated patient samples.

  • Presenting new large single-cell RNA-sequencing datasets to the community as benchmarks for optimal transport.

2 Related Work

There are few methods for computing optimal transport efficiently between large datasets. The current state of the art is the Sinkhorn algorithm [cuturi2013sinkhorn] which computes a regularized OT map between discrete distributions of nn points in O(n2)O(n^{2}) time. The Low-Rank (LR) Sinkhorn algorithm [scetbon2021low], speed up the computation by representing the cost matrix by a low-rank factorization. Similarly \citetscetbon2020linear, use a kernel representation of the cost matrix for faster computation. Some other approximations are based on the dual formulation of the Wasserstein distance, like DiffusionEMD [tong2021diffusion] that approximates a distance equivalent to the Wasserstein with geodesic cost, and Tree-based methods [le2019tree, backurs_scalable_2020, sato_fast_2020] that solve the dual formulation using a tree metric. These methods admit a closed form, and scale almost linearly with nn. \citetle2022sobolev generalizes this tree formulation to graphs, however edges in the graph are interpreted as costs rather than affinities. This leads to a different problem setup to manifold approximation methods.

Geodesic Sinkhorn is related prior work linking the entropy-regularized optimal transport problem triangular mesh with the heat operator [crane_geodesics_2013, solomon2015convolutional], but using different graph filtering techniques. These approaches approximate the application of the heat kernel to a vector by discretizing the heat equation and solving systems of linear equations, this technique was used in different contexts either with the cotangent Laplacian [solomon2015convolutional] or to learn a ground metric [heitz2021ground]. Solving these systems for each Sinkhorn iteration can be done efficiently with a sparse Cholesky decomposition. However, this method’s efficiency depends mainly on the efficiency of the Cholesky decomposition which can be slow depending on the sparsity pattern is O(n3)O(n^{3}) for an n×nn\times n matrix, and necessitates solving 2K2K systems of linear equation per Sinkhorn iteration.

In Table 1, we summarize the advantages of the previous methods, where we note Sinkhorn with the heat kernel approximated by implicit Euler discretization as Sinkhorn Euler.

Table 1: Summary of comparable OT methods. Geodesic Sinkhorn’s complexity benefits from the sparse matrix representation while providing efficient computation for the barycenter problem and the McCann interpolation.
Method Sparse Matrix Barycenter Mccann Complexity per iteration
Exact OT No Yes Yes O(n3logn)O(n^{3}\log n)
Sinkhorn [cuturi2013sinkhorn] No Yes Yes O(n2)O(n^{2})
LR Sinkhorn [scetbon2021low] No Yes Yes O(N(n+m)r+Nnmr)O(N(n+m)r+Nnmr) N,m,r<<nN,m,r<<n
Diffusion EMD [tong2021diffusion] Yes No No O(nlogn)O(n\log n)
Sinkhorn Euler [solomon2015convolutional] Yes Yes Yes Cholesky O(n3)O(n^{3}) + Linear System O(n2)O(n^{2})
Geodesic Sinkhorn (ours) Yes Yes Yes O(Kmn)O(Kmn) K,m<<nK,m<<n

3 Preliminaries

In this section, we start by review the basics of OT and the Wasserstein distance, as well the Sinkhorn distance. Then we review two notions fundamental to our method; the heat equation on a graph, and the Chebyshev approximation of the heat kernel.

Wasserstein Distance

In the following, we assume that all distributions admit a density or a probability mass function, and we use the same notation for both. Let μ,ν\mu,\nu be two probability distributions on a measurable space 𝒳d{\mathcal{X}}\subseteq\mathbb{R}^{d} with metric d(,)d(\cdot,\cdot), let Π(μ,ν)\Pi(\mu,\nu) be the set of joint probability distributions π\pi on the space 𝒳×𝒳{\mathcal{X}}\times{\mathcal{X}} where for any measurable subset ω𝒳\omega\subset{\mathcal{X}}, π(ω×𝒳)=μ(ω)\pi(\omega\times{\mathcal{X}})=\mu(\omega) and π(𝒳×ω)=ν(ω)\pi({\mathcal{X}}\times\omega)=\nu(\omega). The pp-Wasserstein distance is defined as:

c:=(infπΠ(μ,ν)𝒳2d(x,y)pdπ(x,y))1/p.c:=\left(\inf_{\pi\in\Pi(\mu,\nu)}\int_{{\mathcal{X}}^{2}}d(x,y)^{p}\mathrm{d}\pi(x,y)\right)^{1/p}. (1)

In this work we consider p=2p=2 as it is a valid metric between distributions and has other attractive theoretical properties. An exact algorithm based on linear programming can solve this problem in O(n3logn)O(n^{3}\log n) time for discrete distributions of size nn.

Sinkhorn Distances

The Kullback-Leibler (KL) divergence between π\pi and some strictly positive KK on 𝒳×𝒳{\mathcal{X}}\times{\mathcal{X}} is defined as

DKL(π|K):=𝒳2(lnπ(x,y)K(x,y)1)dπ(x,y).D_{\mathrm{KL}}(\pi|K):=\int_{{\mathcal{X}}^{2}}\left(\ln\frac{\pi(x,y)}{K(x,y)}-1\right)\mathrm{d}\pi(x,y). (2)

The Sinkhorn distance222With a slight abuse of language we use the term distance, although the entropy-regularized formulation does not respect the identity of indiscernibles. is a relaxation of equation 1 where the infimum is over all coupling in {πΠ(μ,ν)|DKL(π|μ×ν)ξ}\{\pi\in\Pi(\mu,\nu)|\,D_{\mathrm{KL}}(\pi|\mu\times\nu)\leq\xi\} for ξ>1\xi>-1. Introduced in \citetcuturi2013sinkhorn, the optimization of this distance can be solved by considering the entropy-regularized transport

Wd,λ2(μ,ν):=(infπΠ(μ,ν)𝒳2d(x,y)2dπ(x,y)λH(π))1/p,W_{d,\lambda}^{2}(\mu,\nu):=\left(\inf_{\pi\in\Pi(\mu,\nu)}\int_{{\mathcal{X}}^{2}}d(x,y)^{2}\mathrm{d}\pi(x,y)-\lambda H(\pi)\right)^{1/p}, (3)

where we define the entropy of a coupling π\pi as H(π):=lnπ(x,y)dπ(x,y)H(\pi):=-\int\ln\pi(x,y)\mathrm{d}\pi(x,y), and λ>0\lambda>0.

This formulation converges to the Wasserstein distance as λ0\lambda\to 0, and can be solved with the Sinkhorn algorithm with complexity of the order O(n2/ϵ)O(n^{2}/\epsilon) for discrete distributions of size nn [benamou2015iterative, cuturi2013sinkhorn]. In the discrete case, the transport matrix 𝝅\bm{\pi} admits the form diag(𝒗)Kλdiag(𝒘)\text{diag}({\bm{v}})K_{\lambda}\text{diag}({\bm{w}}), where 𝒗,𝒘{\bm{v}},{\bm{w}} are vectors of size nn. The Sinkhorn algorithm iteratively updates the vectors as (𝒗,𝒘)(μ./Kλ𝒘,ν./Kλ𝒗)({\bm{v}},{\bm{w}})\leftarrow(\mu./K_{\lambda}{\bm{w}},\nu./K^{\prime}_{\lambda}{\bm{v}}), where Kλ:=ed(x,y)2/λK_{\lambda}:=e^{-d(x,y)^{2}/\lambda}.

Following [solomon2015convolutional], using the kernel KλK_{\lambda} gives an alternative interpretation of the Sinkhorn distance as

Wd,λ2(μ,ν)=λ1/2(1+minπΠ(μ,ν)DKL(π|Kλ))1/2.W_{d,\lambda}^{2}(\mu,\nu)=\lambda^{1/2}\left(1+\min_{\pi\in\Pi(\mu,\nu)}D_{\mathrm{KL}}(\pi|K_{\lambda})\right)^{1/2}. (4)

The problem in equation 3 is strictly convex and continuous yielding a unique minimizer. In the discrete case, this leads to an algorithm for the entropy-regularized Wasserstein distance based on the Sinkhorn algorithm enforcing the marginal constraints on the kernel KλK_{\lambda} while minimizing the distance as quantified by DKLD_{\mathrm{KL}}.

The underlying metric d(,)d(\cdot,\cdot) is generally unknown, thus the kernel KλK_{\lambda} cannot be evaluated. The authors of [solomon2015convolutional] proposed to approximate KλK_{\lambda} with the heat kernel t(x,y)\mathcal{H}_{t}(x,y) on 𝒳{\mathcal{X}}. According to Varadhan’s formula [varadhan_behavior_1967], the geodesic distance on a manifold can be recovered from the heat transfer at small timescales as

d(x,y)2=limt04tlnt(x,y).d(x,y)^{2}=\lim_{t\to 0}-4t\ln\mathcal{H}_{t}(x,y). (5)

Hence, motivating the use of the heat-geodesic distance d2(x,y):=4tlnt(x,y)d_{\mathcal{H}}^{2}(x,y):=-4t\ln\mathcal{H}_{t}(x,y), with associated kernel

Kλ(x,y)=λ/4(x,y).K_{\lambda}(x,y)=\mathcal{H}_{\lambda/4}(x,y). (6)

Interestingly, the Sinkhorn-based methods admit an efficient algorithm to solve the barycenter problem which we present next.

Interpolation with discrete support

By constraining the support to a set 𝒳{\mathcal{X}} (or a graph), we can efficiently interpolate between more than two distributions. The barycenter problem [benamou2015iterative, solomon2015convolutional, peyre_computational_2020, cuturi2014fast] generalizes the notion of average between points to an average between distributions. For a set of mm distributions {μ1,,μm}\{\mu_{1},\dotsc,\mu_{m}\} supported on 𝒳{\mathcal{X}}, the objective is to find a distribution minimizing the average distance

μ:=argminμ𝒫(𝒳)i=1mαiWdp(μ,μi)p,\mu^{*}:=\operatorname*{arg\,min}_{\mu\in{\mathcal{P}}({\mathcal{X}})}\sum_{i=1}^{m}\alpha_{i}W_{d}^{p}(\mu,\mu_{i})^{p},

where 𝒫(𝒳){\mathcal{P}}({\mathcal{X}}) denotes the space of probability distributions supported on 𝒳{\mathcal{X}}, and {α1,,αm}\{\alpha_{1},\dotsc,\alpha_{m}\}are non-negative weights. Finding the barycenter is a challenging optimization problem, however the barycenter for Sinkhorn-based methods admits an efficient computation (see Algorithm LABEL:alg:sinkhorn_bary in appendix). It involves updating mm vectors 𝒗i,𝒘i{\bm{v}}_{i},{\bm{w}}_{i}, which define a transport plan from μi\mu_{i} to the barycenter μ\mu^{*}. The barycenter for two distributions supported on a (discrete) semi-circle is depicted in Fig. 1. Here the two input distributions zero and one have support over 5,000 points each. In the case where 𝒳{\mathcal{X}} is the union of these point sets, the barycenter is a discrete distribution over these 10,000 points, with density concentrated at the top of the semi-circle. The support of the barycenter is constrained to 𝒳{\mathcal{X}}, for most Sinkhorn-based method the size of 𝒳{\mathcal{X}} needs to be small for computational reason. Our method does not suffer from such a limitation. Hence, we can consider barycenter with greater expressively, and interpolate between large sets of distributions.

Interpolation in d\mathbb{R}^{d}

Calculating the barycenter with 𝒳=d{\mathcal{X}}=\mathbb{R}^{d} is computationally challenging. With m>2m>2, this is computationally intractable. However, for the special case of m=2m=2, The d\mathbb{R}^{d}-barycenter can be computed efficiently using what is known as the McCann interpolant [mccann1997convexity]. For an optimal transport matrix 𝝅\bm{\pi} between μ0\mu_{0} and μ1\mu_{1} with weights α0=1t\alpha_{0}=1-t, α1=t\alpha_{1}=t, the Mccann interpolant at time tt is

μt:=i,j𝝅ijδ(1t)xi+tyj=argminμ𝒫(d)i=01αiWdp(μ,μi)p,\mu_{t}:=\sum_{i,j}\bm{\pi}_{ij}\delta_{(1-t)x_{i}+ty_{j}}=\operatorname*{arg\,min}_{\mu\in{\mathcal{P}}(\mathbb{R}^{d})}\sum_{i=0}^{1}\alpha_{i}W_{d}^{p}(\mu,\mu_{i})^{p},

where xiμ0x_{i}\sim\mu_{0} and yjμ1y_{j}\sim\mu_{1} and δ\delta is the Dirac distribution. The Mccann interpolant is often used for interpolation between consecutive distributions [schiebinger_optimal-transport_2019].

Next we consider the ground distance, and following Varadhan’s formula, we consider the approximation of heat diffusion for OT with a heat-geodesic distance.

Heat Diffusion on a Graph

Consider an undirected graph 𝒢=(V,E){\mathcal{G}}=(V,E) with a set VV of nn vertices and a set of edges EE, and its weighted adjacency matrix 𝑨{\bm{A}} with non-negative edge weights, and the diagonal degree matrix 𝑫{\bm{D}}, where 𝑫ii:=k𝑨ik{\bm{D}}_{ii}:=\sum_{k}{\bm{A}}_{ik}. We define the combinatorial Laplacian as 𝑳:=𝑫𝑨{\bm{L}}:={\bm{D}}-{\bm{A}}, for any function f:Vf:V\to\mathbb{R} we have (𝑳f)(v)=uau,v(f(v)f(u))({\bm{L}}f)(v)=\sum_{u}a_{u,v}(f(v)-f(u)). The combinatorial Laplacian is a symmetric positive semi-definite matrix, and has an eigen-decomposition 𝑳=ΨΛΨT{\bm{L}}=\Psi\Lambda\Psi^{T} with orthonormal eigenvectors Ψ\Psi and diagonal eigenvalue matrix Λ=diag(λ1,λ2,,λn)\Lambda=\mathrm{diag}(\lambda_{1},\lambda_{2},\ldots,\lambda_{n}), such that 0λ1λ2λn0\leq\lambda_{1}\leq\lambda_{2}\leq\cdots\leq\lambda_{n}. The combinatorial Laplacian is a natural extension of the negative of the Laplacian operator to graph. For a signal 𝒇0n{\bm{f}}_{0}\in\mathbb{R}^{n} on 𝒢{\mathcal{G}}, the diffusion of 𝒇0{\bm{f}}_{0} on the graph evolves according to the heat equation

ddt𝒇(t)+𝑳𝒇(t)=𝟎,s.t.𝒇(0)=𝒇0t+.\frac{d}{dt}{\bm{f}}(t)+{\bm{L}}{\bm{f}}(t)=\bm{0},\,s.t.\quad{\bm{f}}(0)={\bm{f}}_{0}\quad t\in\mathbb{R}^{+}.

This ordinary differential equation describes how the heat, represented by the signal 𝒇{\bm{f}}, propagates on the graph over time. Since the eigenvectors of 𝑳{\bm{L}} form an orthonormal basis of n\mathbb{R}^{n}, we have the solution

𝒇(t)=k𝒇0,ψketλkψk.{\bm{f}}(t)=\sum_{k}\langle{\bm{f}}_{0},\psi_{k}\rangle e^{-t\lambda_{k}}\psi_{k}.

The heat operator on 𝒢{\mathcal{G}} is defined by the matrix exponential 𝑯t:=et𝑳{\bm{H}}_{t}:=e^{-t{\bm{L}}}. By orthogonality of the eigenvectors of 𝑳{\bm{L}}, we can write 𝑯t=ΨetΛΨT{\bm{H}}_{t}=\Psi e^{-t\Lambda}\Psi^{T} and 𝒇(t)=𝑯t𝒇0{\bm{f}}(t)={\bm{H}}_{t}{\bm{f}}_{0}. Computing 𝑯t{\bm{H}}_{t} by eigendecomposition would require O(n3)O(n^{3}) operations. Recall that, for the Sinkhorn algorithm, we are only concerned with the application of the heat operator 𝑯t{\bm{H}}_{t} on a signal 𝒇n{\bm{f}}\in\mathbb{R}^{n}. In Geodesic Sinkhorn we use Chebyshev polynomials [shuman_chebyshev_2011, marcotte2022fast] to approximate the application of the heat operator to a signal. For a short timescale tt, using the heat kernel accounts for using the geodesic distance as ground distance in the entropy-regularized optimal transport formulation equation 3.

Chebyshev Polynomials

Polynomial sequences are often used to approximate functions or operator. With Chebyshev polynomial, we can approximate the application of the matrix exponential 𝑯t=et𝑳{\bm{H}}_{t}=e^{-t{\bm{L}}} to a signal 𝒇{\bm{f}} on the graph. An attractive property of Chebyshev polynomials is that the approximation error decay exponentially with the maximum degree KK. They are defined by the recursive relation {Tk}k\{T_{k}\}_{k\in\mathbb{N}} with T0(y)=0T_{0}(y)=0, T1(y)=yT_{1}(y)=y and Tk(y)=2yTk1(y)Tk2(y)T_{k}(y)=2yT_{k-1}(y)-T_{k-2}(y) for k2k\geq 2. On [1,1][-1,1] these polynomials are orthogonal w.r.t. the weight (1y)1/2(1-y)^{-1/2}, and can be used to express the operator 𝑯t{\bm{H}}_{t}. Assuming the largest eigenvalue λn2\lambda_{n}\leq 2, we can write

𝑯t=b02+k=1bkTk(𝑳𝑰𝒅),{\bm{H}}_{t}=\frac{b_{0}}{2}+\sum_{k=1}^{\infty}b_{k}T_{k}({\bm{L}}-\bm{Id}),

where the K+1K+1 scalar coefficient {bk}\{b_{k}\} depend on time and can be evaluated with the Bessel function. The approximation of 𝑯t{\bm{H}}_{t} is based on the first KK term of the series, and we note it pK(𝑳,t)p_{K}({\bm{L}},t). The approximation of the application of the heat operator to a signal 𝒇{\bm{f}} is

pK(𝑳,t)𝒇=b02𝒇+k=1KbkTk(𝑳𝑰𝒅)𝒇.p_{K}({\bm{L}},t){\bm{f}}=\frac{b_{0}}{2}{\bm{f}}+\sum_{k=1}^{K}b_{k}T_{k}({\bm{L}}-\bm{Id}){\bm{f}}.

It results in KK matrix-vector products which can be efficient since in general 𝑳{\bm{L}} is a sparse matrix. On a mm-nearest neighbor graph, this can be O(Kmn/λ)O(Kmn/\lambda), where λ\lambda is a regularization parameter. The assumption that λi2\lambda_{i}\leq 2 is not always verified for the combinatorial Laplacian, but it can be rescaled as 2𝑳/λn2{\bm{L}}/\lambda_{n} to ensure this, or use the normalized Laplacian that has eigenvalues in [0,2][0,2] which also avoid computing the second-largest eigenvalue. Chebyshev polynomials admits interesting theoretical properties. In particular, the existence of approximation bounds depending on the largest degree KK [marcotte2022fast], hence the KK can be determined for a given approximation error tolerance.

4 Geodesic Sinkhorn Distances

We define the Geodesic Sinkhorn distance between any signals or distributions on a graph 𝒢{\mathcal{G}} by the entropy-regularized OT with the heat kernel 𝑯t{\bm{H}}_{t} on the graph. This construction is also valid between any point cloud datasets. In that case, for mm datasets {𝖷1,,𝖷𝗆}\{\mathsf{X}_{1},\dotsc,\mathsf{X_{m}}\} sampled from a set of distributions {μ1,,μm}\{\mu_{1},\dotsc,\mu_{m}\}, we construct a common graph using an affinity kernel on the mm datasets and compare two samples by taking the distance between two indicator functions on the graph. We approximate the heat kernel 𝑯t{\bm{H}}_{t} with Chebyshev polynomials pK(𝑳,l)p_{K}({\bm{L}},l) of order KK. In Algorithm 4.1, we present the main steps to evaluate the Geodesic Sinkhorn. It is based upon Sinkhorn iterations [sinkhorn1967concerning, cuturi2013sinkhorn], where \oslash and \odot denote respectively the elementwise division and multiplication. Note that, as opposed to the usual Sinkhorn algorithm, we never have to store a dense n×nn\times n distance matrix, but only the usually sparse graph Laplacian.

Definition 4.1.

The Geodesic Sinkhorn distance between two distributions μ\mu, ν\nu on a graph 𝒢{\mathcal{G}} is

W𝑯t(μ,ν):=4t1/2(1+minπΠ(μ,ν)DKL(π|𝑯t))1/2.W_{{\bm{H}}_{t}}(\mu,\nu):=4t^{1/2}\left(1+\min_{\pi\in\Pi(\mu,\nu)}D_{\mathrm{KL}}(\pi|{\bm{H}}_{t})\right)^{1/2}.
{algorithm}

[tb] Geodesic Sinkhorn {algorithmic} \STATEInput: n×nn\times n Laplacian 𝑳{\bm{L}}, distributions (signals) 𝝁{\bm{\mu}}, 𝝂{\bm{\nu}} on 𝒢{\mathcal{G}}, maximum Chebyshev degree KK, regularization λ\lambda, 𝒂{\bm{a}} vertices weights. \STATEOutput: Wd𝑯,λ2(𝝁,𝝂)W_{d_{{\bm{H}}},\lambda}^{2}({\bm{\mu}},{\bm{\nu}}) \STATE// Initialization \STATE𝒘𝟏{\bm{w}}\leftarrow{\bm{1}} \STATE// Sinkhorn iterations \FORj=1,2,j=1,2,\dotsc \STATE𝒗𝝁pK(𝑳,t)(𝒂𝒘){\bm{v}}\leftarrow{\bm{\mu}}\oslash p_{K}({\bm{L}},t)({\bm{a}}\odot{\bm{w}}) \STATE𝒘𝝂pK(𝑳,t)(𝒂𝒗){\bm{w}}\leftarrow{\bm{\nu}}\oslash p_{K}({\bm{L}},t)({\bm{a}}\odot{\bm{v}}) \ENDFOR\STATEReturn: 4t[𝒂(𝝁ln𝒗+𝝂ln𝒘)]4t\,\sum[{\bm{a}}\odot({\bm{\mu}}\odot\ln{\bm{v}}+{\bm{\nu}}\odot\ln{\bm{w}})]

Refer to caption
Fig. 1: Two distributions at time 0 and time 1 on a 1D manifold embedded in 2D. Compares the Mccann interpolant (center green) with the Barycenter interpolant with fixed support colored by density (right) of a left out time 1/2.

In the following proposition, we find the ground distance implicitly used in the optimal transport defined by Geodesic Sinkhorn. We recall that two distances d1(x,y),d2(x,y)d_{1}(x,y),d_{2}(x,y) are equivalent if for all x,yx,y there exists two positive constants c,Cc,C such that cd2(x,y)d1(x,y)Cd2(x,y)cd_{2}(x,y)\leq d_{1}(x,y)\leq Cd_{2}(x,y), and we note d1d2d_{1}\simeq d_{2}.

Proposition 4.2.

There exists a maximum Chebyshev polynomial degree KK such that the ground distance in Geodesic Sinkhorn is equivalent to the one based on the heat kernel on 𝒢{\mathcal{G}}

4tlog(pK(𝑳,t))ij\displaystyle-4t\log(p_{K}({\bm{L}},t))_{ij} 4tlog(𝑯t)ij\displaystyle\simeq-4t\log({\bm{H}}_{t})_{ij}
4tlog(ΨetΛΨT)ij.\displaystyle\simeq-4t\log(\Psi e^{-t\Lambda}\Psi^{T})_{ij}.

In particular, the Wasserstein distances with these ground distances are equivalent.

Proof.

Provided in Appendix LABEL:sec:appendix_theory. ∎

In \citetsolomon2015convolutional,heitz2021ground, using the Euler implicit discretization results in a ground cost of the form ϵln((𝐈𝐝ϵ4K𝑳)K)-\epsilon\ln((\mathbf{Id}-\frac{\epsilon}{4K}{\bm{L}})^{-K}), where 𝐈𝐝\mathbf{Id} is the identity matrix, and can be seen as another approximation for the matrix exponential.

The efficiency of Geodesic Sinkhorn improves the notion of barycenter as it is possible to consider much larger graph 𝒢{\mathcal{G}}, thus a finer grained support of the barycenter. This leads us to define a novel distance between families of distributions.

Definition 4.3.

For two finite families of distributions 𝒯{\mathcal{T}} and 𝒞{\mathcal{C}} supported on 𝒢{\mathcal{G}}, we define the barycentric distance between the families 𝒯,𝒞{\mathcal{T}},{\mathcal{C}} as

γ(𝒯,𝒞):=W𝑯t(μ𝒯,μ𝒞),\gamma({\mathcal{T}},{\mathcal{C}}):=W_{{\bm{H}}_{t}}(\mu^{*}_{\mathcal{T}},\mu^{*}_{\mathcal{C}}),

where μ𝒯,μ𝒞\mu^{*}_{\mathcal{T}},\mu^{*}_{\mathcal{C}} are respectively the barycenters of 𝒯{\mathcal{T}} and 𝒞{\mathcal{C}}.

This distance follows similar intuitions as the distances in clustering algorithms like average-linkage, where a distance between clusters is defined via an aggregation of distances. The previous definition is valid for any distance between distributions or barycenters. However, OT barycenters are known to be more informative than others[cuturi2014fast]. We will further explore this comparison in our experiments. We use it to distinguish between two groups in a medical setting where a set of patients received a treatment (defining the family 𝒯{\mathcal{T}}), and another set acts as control family 𝒞{\mathcal{C}}. Following this idea, we define a notion of effect between two families.

Definition 4.4.

For two family of distributions 𝒯{\mathcal{T}} and 𝒞{\mathcal{C}} supported on 𝒢{\mathcal{G}}, we define the Expected Barycenter Effect of 𝒯{\mathcal{T}} as

τ(𝒯):=𝔼μ𝒯(Yt)𝔼μ𝒞(Yc),\tau({\mathcal{T}}):=\mathbb{E}_{\mu_{\mathcal{T}}^{*}}(Y_{t})-\mathbb{E}_{\mu_{\mathcal{C}}*}(Y_{c}),

where μ𝒯,μ𝒞\mu^{*}_{\mathcal{T}},\mu^{*}_{\mathcal{C}} are respectively the barycenters of 𝒯{\mathcal{T}} and 𝒞{\mathcal{C}}, and Ycμ𝒞Y_{c}\sim\mu_{\mathcal{C}}^{*} and the features Ytμ𝒯Y_{t}\sim\mu_{\mathcal{T}}^{*}.

Note that we compute the average on the family of distributions instead of the average on their support, hence we evaluate their expectations in a closed form. This definition also extends to a conditional equivalent where families of distribution can be subdivided with discrete covariate variables. When the barycenters are computed with the total variation, this definition is equivalent to the naive Average Treatment Effect(ATE) [imbens2015causal]; i.e. difference of empirical means of the two groups.

In Fig. 2 we present the focus of this section and the three types of interpolation we considered; between two distributions without constraint on the support; barycenters between multiple distributions, and a distance between families of distributions.

Refer to caption
Fig. 2: Schematic and applications of Geodesic Sinkhorn; time interpolation, barycenters, Barycentric distance

5 Results

We demonstrate the accuracy and efficiency of the Geodesic Sinkhorn distance on three tasks:

  1. 1.

    Time-series interpolation in section 5.1 similar to the setup of [schiebinger_optimal-transport_2019] consisting of ten scRNA-seq datasets.

  2. 2.

    Nearest-Wasserstein-neighbor distribution calculation on simulated data with manifold structure similar to the setup of [tong2021diffusion].

  3. 3.

    A newly defined Barycentric distance between families of distributions and its application to quantify the effect of a treatment on patient-derived organoids.

In all our experiments, we constructed the weighted adjacency matrix 𝑨{\bm{A}} of the graph with the α\alpha-decay kernel [moon_visualizing_2019]. This kernel is based on a k-nearest neighbor graph and attribute weights with exponential decay and is extensively used in single-cell analysis. We used the default parameters that is five nearest neighbors and α=40\alpha=40.

Table 2: Time series interpolation task comparing mean and standard deviation (μ±σ)(\mu\pm\sigma) across 5 seeds the 2-Wasserstein metric averaged across time (1/T2)t[2T1]W2(μ^t,μt)(1/T-2)\sum_{t\in[2\ldots T-1]}W_{2}(\hat{\mu}_{t},\mu_{t}) for 10 single-cell timeseries datasets. Sinkhron Euler ran out of memory on two datasets. Lower is better, best performance on each dataset is bold.
Dataset L2L^{2} Sinkhorn Sinkhron Euler Geo Sinkhorn (ours)
Cite Donor0 48.545 ±\pm 0.057 46.254 ±\pm 3.192 44.440 ±\pm 0.108
Cite Donor1 48.220 ±\pm 0.055 45.897 ±\pm 3.254 44.165 ±\pm 0.103
Cite Donor2 50.281 ±\pm 0.016 47.773 ±\pm 3.958 45.673 ±\pm 0.092
Cite Donor3 49.339 ±\pm 0.081 46.565 ±\pm 3.553 45.022 ±\pm 0.146
Clark 13.500 ±\pm 0.003 13.288 ±\pm 0.008
EB 12.415 ±\pm 0.008 12.298 ±\pm 0.140 12.133 ±\pm 0.011
Multiome Donor0 56.648 ±\pm 0.048 55.373 ±\pm 7.234 53.431 ±\pm 0.077
Multiome Donor1 54.028 ±\pm 0.126 52.396 ±\pm 4.394 50.238 ±\pm 0.022
Multiome Donor2 58.798 ±\pm 0.155 57.182 ±\pm 5.511 55.041 ±\pm 0.058
WOT 8.096 ±\pm 0.003 7.397 ±\pm 0.106

5.1 Time series Interpolation

To evaluate Geodesic Sinkhorn’s performance on inferring dynamics, we test its performance on a task for time series interpolation. In this setting the most used datasets are Embryoid Body [moon_visualizing_2019], and WOT [schiebinger_optimal-transport_2019]. We curated ten scRNA-seq datasets; WOT [schiebinger_optimal-transport_2019], Clark [clark_single-cell_2019], Embryoid Body [moon_visualizing_2019], and seven more from the 2022 multimodel single-cell integration challenge333https://www.kaggle.com/competitions/open-problems-multimodal/ to test our method. The observations are the gene expression of single cells from a distribution evolving through time. The goal is to interpolate the distribution between two timepoints. The number of timepoints in each dataset range from 4 to 40. Here for a dataset with TT single-cell distributions {μ1,μ2,,μT}\{\mu_{1},\mu_{2},\ldots,\mu_{T}\} over time for t[2T1]t\in[2\ldots T-1] we compute the exact Euclidean 2-Wasserstein distance between the interpolated distribution μ^t\hat{\mu}_{t} at time tt and the ground truth distribution μt\mu_{t}, W2(μ^t,μt)W_{2}(\hat{\mu}_{t},\mu_{t}) (see Appendix LABEL:sec:_appendix_additional_res for details). Since we interpolate between two distributions we used the McCann interpolant as its support is 𝐑d\mathbf{R}^{d}. We compare our Geodesic Sinkhorn interpolation with either the Exact Mccann interpolant with Euclidean distances (Exact Mccann), the Sinkhorn Mccann interpolant with Euclidean ground distance (Sinkhorn Mccann), Sinkhorn Euler Mccann with the Euler heat approximation in Tab. 2. Sinkhorn with Euler approximation ran out of memory on the Clark and WOT datasets. For space reasons, we leave comparison to additional interpolation methods to the appendix (Tab. LABEL:tab:full_time_interp, Tab. LABEL:tab:time_serie_kernel). We see that across all 10 the Geodesic Sinkhorn interpolation with the Mccann interpolant outperforms all other methods, hence showcasing the importance of the heat-geodesic distance and our kernel approximation.

5.2 Nearest-Wasserstein-neighbor distributions

Computation time of heat approximation

We compare the time required to approximate the heat kernel using either the Chebyshev polynomials (K=30K=30) or the implicit Euler discretization (30 steps) with Cholesky decomposition [solomon2015convolutional, heitz2021ground]. In Fig 3, we report the time is second requires to diffuse a uniform signal on a graph where we sample 10k observations from a number of distributions ranging from 5 to 45. We also report the time without the steps that can be prefactored; the computation of the normalized Laplacian for the Chebyshev method (Cheb w/o pre), and the Cholesky decomposition for the Euler method (Euler w/o pre). Using Chebyshev polynomials can be more than two times faster than the Euler method. Recall that this step needs to be applied twice per Sinkhorn iteration, our method can thus be drastically faster. In the appendix Fig. LABEL:fig:time_swiss_roll_W1 and Fig. LABEL:fig:time_swiss_roll we show more comparative results on the time efficiency of our methods against Sinkhorn, Diffusion EMD, and the LR Sinkhorn, our method can be 20 time faster than Sinkhorn and LR Sinkhorn.

Computation time and accuracy

In this experiment, we compare our method against Sinkhorn [cuturi2013sinkhorn], and LR Sinkhorn [scetbon2021low], both algorithm with Euclidean and squared Euclidean ground distance, with DiffusionEMD [tong_trajectorynet_2020], and Sinkorn with Euler approximation of the heat filter. We created 15 Gaussian distributions sampled randomly on a swiss roll dataset, and sampled 10k observations from each distribution. We consider a k-nearest neighbors task on these distributions. We evaluate the methods with the ground truth, since we know the exact geodesic distance on the manifold. We test two implementations of our method, either using the Chebyshev approximation from the python library PyGSP[michael_defferrard_2017_1003158] (Geodesic Sinkhorn), with the graph from PHATE (P) and UMAP (S), or using the Chebyshev implementation from [marcotte2022fast] (Geodesic Sinkhorn*). The latter implementation can be easily sectioned, thus avoiding superfluous calculation. For instance, we compute the coefficient bkb_{k} only once on the graph, then use them for all 15 distributions. In Tab. 3, we report the average and standard deviation over 10 seeds of the Spearman and Pearson correlations to the ground truth, and the runtime in seconds with and without the computation of the graph. Our method is always more accurate than Sinkhorn and LR Sinkhorn, while being about 6 to 20 times faster. Our method’s accuracy is similar to the approximation with Euler, but can be almost four times faster. DiffusionEMD and Geodesic Sinkhorn performances are similar, but diffusion EMD does not provide a transportation plan nor meaningful barycenters.

TODO: [[[REVIEW that in Tab.3, marcotte is slower than pygsp.]]]

Refer to caption
Fig. 3: Time in seconds to diffuse a uniform vector on a graph either using Euler with Cholesky or Chebyshev polynomials. Using Chebyshev polynomials is faster with and without taking into account the steps that can be computed only once.
Table 3: KNN task for 15 distributions, best score highlighted is bold, and second-best is underline Geodesic Sinkhorn accuracy is better or similar to other Sinkhorn-based method, while being much faster.
SpearmanR PearsonR P@5 time(s) no graph time(s)
Name
Diffusion EMD 0.62±\pm0.097 0.736±\pm0.023 0.66±\pm0.072 2.845±\pm0.135 7.877±\pm0.531
Sinkhorn L1L^{1} 0.387±\pm0.044 0.523±\pm0.036 0.471±\pm0.028 112.406±\pm0.206 112.406±\pm0.206
Sinkhorn L2L^{2} 0.411±\pm0.036 0.485±\pm0.027 0.492±\pm0.053 133.686±\pm5.234 133.686±\pm5.234
LR Sinkhorn L1L^{1} -0.31±\pm0.07 -0.131±\pm0.086 0.237±\pm0.037 578.631±\pm107.82 578.631±\pm107.82
LR Sinkhorn L2L^{2} 0.366±\pm0.048 0.379±\pm0.051 0.447±\pm0.023 204.191±\pm3.656 204.191±\pm3.656
Geodesic Sinkhorn 0.847±\pm0.023 0.754±\pm0.016 0.833±\pm0.034 10.176±\pm1.249 16.682±\pm1.705

5.3 Barycentric distance

Synthetic data

We test if we can identify a linear treatment effect with the Expected Barycenter Effect (EBE). In this experiment, we create a control family of distributions 𝒞{\mathcal{C}} of ten standard Gaussian distributions. The treatment group consists of nine Gaussian distributions 𝒩(5,1)\mathcal{N}(5,1), and one outlier centered at different means. For each distribution, we sample 500 observations, and reproduce the experiment over ten seeds. In Tab. 4, we report the EBE and its standard deviation with the Geodesic Sinkhorn, the total variation distance, and Sinkhorn. Since the TV only compares the mean, it is sensitive to the outlier, whereas our method can identify the true treatment effect.

Table 4: Expected Barycenter Effect (EBE) with one outlier distribution centered either at -60,-30,0, and 5. Comparison using the barycenter from Sinkhorn, total variation, or Geodesic Sinkhorn. Values closer to the real treatment effect of 5 are better. Best score is bold.
Outlier EBE Geo Sinkhorn EBE Sinkhorn EBE TV
-60 5.016±\pm0.226 -0.103±\pm0.005 -1.429±\pm0.144
-30 5.053±\pm0.196 0.355±\pm0.049 1.571±\pm0.144
0 4.917±\pm0.315 4.954±\pm0.157 4.571±\pm0.144
No outlier 5.059±\pm0.159 5.054±\pm0.16 5.071±\pm0.144

We compare distances to compute the barycenter to retrieve an underlying distribution. We suppose two ground truth distributions; a standard Gaussian and a Gaussian with mean five and unit variance, but we only observe two groups of noisy distributions of unequal sizes

𝒞={𝒩(θi,1):θi𝒩(0,1),i[10]}\displaystyle{\mathcal{C}}=\{{\mathcal{N}}(\theta_{i},1):\theta_{i}\sim{\mathcal{N}}(0,1),i\in[10]\}
𝒯={𝒩(θi,1):θi𝒩(5,1),i[5]}.\displaystyle{\mathcal{T}}=\{{\mathcal{N}}(\theta_{i},1):\theta_{i}\sim{\mathcal{N}}(5,1),i\in[5]\}.

We compute μ𝒞\mu^{*}_{{\mathcal{C}}} and μ𝒯\mu^{*}_{{\mathcal{T}}}, then the (exact) 2-Wasserstein between the barycenters μ𝒞\mu^{*}_{{\mathcal{C}}} and μ𝒯\mu^{*}_{{\mathcal{T}}} which we desire to be close to known 2-Wasserstein between the ground truth distributions. In Tab 5, we report the score |1W2^/W2||1-\widehat{W_{2}}/W_{2}|, where W^2\widehat{W}_{2} is the exact 2-Wasserstein between barycenters and W2W_{2} the ground truth. A score close to zero is better; it implies that the barycenters are close to the ground truth distributions. We report results for distributions in 22 and 1010 dimensions as well as different diffusion time tt and regularization parameter λ\lambda. In the Appendix (Tab.LABEL:tab:ATE_perturb_appendix, Tab. LABEL:tab:_gaussian_fam_appendix) we report additional results for Geodesic Sinkhorn with the graph used in UMAP [wolf2018scanpy, mcinnes2018umap].

Table 5: Approximation score for the barycenters of ten distributions centered around 𝒩(0,1){\mathcal{N}}(0,1) and five distribution centered around 𝒩(5,1){\mathcal{N}}(5,1). Score closer to zero are better, best score is bold.
Dim Reg. (t,λ)(t,\lambda) Geo Sinkhorn Sinkhorn TV
2 (25,0.005)(25,0.005) 0.144±\pm0.104 0.142±\pm0.094 0.636±\pm0.065
(50,0.05)(50,0.05) 0.141±\pm0.107 0.222±\pm0.192 0.636±\pm0.065
(100,0.5)(100,0.5) 0.133±\pm0.135 0.945±\pm0.019 0.636±\pm0.065
10 (25,0.005)(25,0.005) 0.191±\pm0.086 0.186±\pm0.081 0.541±\pm0.094
(50,0.05)(50,0.05) 0.192±\pm0.089 0.426±\pm0.309 0.541±\pm0.094
(100,0.5)(100,0.5) 0.187±\pm0.111 0.894±\pm0.035 0.541±\pm0.094

Single-cell signalling data

In this experiment, we show an application of the barycentric distance and the expected barycenter effect. We use single-cell signalling data produced by mass cytometry (MC) for a screening study to compare the treatment effect of different chemotherapies on 10 colorectal cancer (CRC) patient-derived organoids (PDOs) [zapatero2022pdos]. These PDOs can be grouped into chemoresistant PDOs, that show little-to-no effect when treated with chemotherapies; and chemosensitive PDOs, that present strong shifts in their phenotypes upon treatment. The observations include single-cell data information on the cell cycle and signalling changes upon treatment of PDOs with different CRC treatments at a range of concentrations. To compute the CBT distance, we define signals on the graph as indicator functions for the treatments and PDOs respecting the condition chemoresistant or chemosensitive, and compute the barycenter of each treatment. For example, the barycenter corresponding to the treatment S (SN-38, active metabolite of Irinotecan) is μS\mu_{S}^{*}. In Fig. 4, we present the barycentric distances matrices between treatments a) and between four concentrations of treatment SN-38 (S) b). In both cases, the control groups corresponds to AH and DMSO, the two rightmost columns. We compare the distance matrices between Sinkhorn (left) and our method (right). Our method provide a finer distinction between treatments and concentrations, especially for the chemosensitive group. As observed in [zapatero2022pdos], chemosensitive PDOs show little-to-no response to lower concentrations of SN-38 (S1), but their phenotype shifts very strongly upon treatment with higher concentrations (S2, S3, and S4) (Fig. 4b). When comparing combinations of different treatments with SN-38, Geodesic Sinkhorn better resolves the difference between SN-38 alone and in combination with Cetuximab (C), showing that S is the main agent creating the treatment effect and the combination with C does not resolve in a synergistic effect (Fig. 4a). Note that we only consister the relative magnitude of the distances since the two algorithm use different ground distances. In Fig. 5, we show the Expected Barycenter Effect (Def. 4.4) of SN-38 for the two conditions, our method correctly detects the chemosensitive marker cPARP.

Refer to caption
Fig. 4: a) Barycentric distances matrices for the Sinkhorn algorithm (left) and our method Geodesic Sinkhorn (right). b) Barycentric distances matrices between concentration of treatment SN-38, for four concentrations S1 S2 S3 S4. Control groups correspond to AH and DSMO. Geodesic Sinkhorn provides a clearer distinction between treatments, and concentrations.
Refer to caption
Fig. 5: Expected conditional barycenter effect of treatment SN-38, for conditions chemoresistant (left), and chemosensitive (right). The increase in cPARP is coherent with chemosensitivity.

6 Conclusion

In this work, we considered the use of optimal transport for graphs and large datasets in high dimensions potentially sampled from a lower dimensional manifold. We proposed Geodesic Sinkhorn, a fast implementation of the Sinkhorn algorithm using the graph Laplacian and Chebyshev polynomials. Our method is well adapted for large and high dimensional dataset as it is defined with a geodesic ground distance, which takes into account the underlying geometry of the data, and requires less computation time and less memory. We showed on 10 scRNA-seq time series datasets that our method has greater interpolation accuracy than Sinkhorn and the exact Mccann interpolant. On a synthetic dataset, we showed that Geodesic Sinkhorn can be about 23 times faster than Sinkhorn, even when taking into account the construction of the graph. On the classification task, our method was faster than other Sinkhorn-based algorithms, while having greater accuracy. With the Wasserstein barycenter, we defined the barycentric distance to compare entire families of distribution, and the expected barycenter effect, then applied both methods to a large PDO drug screen dataset. In future work, we aim to study the theoretical implications of our new definitions.

7 REFERENCES

List and number all bibliographical references at the end of the paper. The references can be numbered in alphabetic order or in order of appearance in the document. When referring to them in the text, type the corresponding reference number in square brackets as shown at the end of this sentence [C2].