AUTHOR GUIDELINES FOR MLSP PROCEEDINGS MANUSCRIPTS

Abstract

Efficient computation of optimal transport distance between distributions is of growing importance in data science. Sinkhorn-based methods are currently the state of the art for such computations, but require $O(n^{2})$ computations. In addition, Sinkhorn-based methods commonly use an Euclidean ground distance between datapoints. However, with the prevalence of manifold structured scientific data, it is often desirable to consider geodesic ground distance. Here, we tackle both issues by proposing Geodesic Sinkhorn—based on diffusing a heat kernel on a manifold graph. Notably, Geodesic Sinkhorn requires only $O(n\log n)$ computation, as we approximate the heat kernel with Chebyshev polynomials based on the sparse graph Laplacian. We apply our method to the computation of barycenters of several distributions of high dimensional single cell data from patient samples undergoing chemotherapy. In particular we define the barycentric distance as the distance between two such barycenters. Using this definition, we identify an optimal transport distance and path associated with the effect of treatment on cellular data.

1 Introduction

Optimal Transport (OT) distances, or Wasserstein distances are computed by lifting ground distances between points to distances between measures. This distance is computed relative to a ground distance on the support of the distributions, making it more informative than distances based only on a pointwise comparison of the densities. However to compute the Wasserstein, one needs to find the optimal transport plan from the source distribution to a target distribution, this is a linear programming problem requiring $O(n^{3}\log n)$ for discrete distributions of size $n$ [peyre_computational_2020].

An efficient modification of the optimal transport problem is to consider an entropy-regularized transportation. This formulation is solved with the Sinkhorn algorithm [sinkhorn1967concerning] by iteratively rescaling a Gaussian kernel based from the distance matrix. It is equivalent to the Schrödinger Bridge problem, for which similar algorithms were developed [fortet1940resolution, kullback1968probability, knight2013fast]. In the discrete case, it requires $O(n^{2})$ for distributions of size $n$ , since it relies on matrix-vector products. Furthermore, this formulation allows for fast computation of the discrete barycenter with fixed support (the average distributions w.r.t. the Sinkhorn distance). An important drawback of the Sinkhorn algorithm is the necessity to store and multiply the pairwise distance matrix with a vector.

Additionally, the ground distance is commonly chosen as the Euclidean distance. The Euclidean distance is often suboptimal for high-dimensional datasets over larger distances according to the manifold hypothesis, which says observations lie near a low dimensional (curved) manifold in high dimensional space [moon_manifold_2018]. For higher dimensional datasets assumed to be sampled from lower dimensional manifold, using a distance closer to the manifold for OT as shown promising results [chen2020uncovering, huguet2022manifold, solomon2015convolutional, tong2022embedding, tong2021diffusion].

In this work we present Geodesic Sinkhorn¹¹1https://anonymous.4open.science/r/geo_sinkh_anon-61F7/README.rst; a Sinkhorn-based method for fast optimal transport with a heat-geodesic ground distance. Our method is based on the geometry of the dataset constructed from a common graph and uses the heat kernel on the graph to defined a heat-geodesic distance. Key to this approach, we will never need to construct or operate on an $n\times n$ distance matrix, and we will only use the sparse Laplacian matrix and sparse matrix vector products. For sparse graphs, this can be used for $O(n\log n)$ computation of the Sinkhorn distance with a manifold ground distance, improving on the $O(n^{2})$ implementations based on dense matrices.

Increasing the state-of-the art efficiency in Sinkhorn computation opens us up to being able to perform complex operations on large groups of datasets. In particular, we consider interpolating between datasets: 1) the McCann interpolant between two datasets, and 2) the barycenter between $m\geq 2$ datasets. We show that using our heat-geodesic distance improve the interpolation accuracy compare to OT with Euclidean distance. The barycenter corresponds to the average distribution of a set of distributions. Our methods allows for finer grained barycenters on a data manifold, which motivate us to define a novel notion of dissimilarity between families of distributions called barycentric distance.

We use Geodesic Sinkhorn for interpolation of dynamic distribution on ten single-cell RNA sequencing (snRNA-seq) time series datasets. Further we apply the barycentric distance to single cell data from patient-derived cancer organoids (PDOs) to assess the effect of treatments (such as drugs and chemotherapy). Here we have one set of PDOs from control conditions, and another set that are treated. The true treatment effect is thus the distance between these barycenters. In addition, we use Geodesic Sinkhorn’s barycenter to compare the effect from one family of distribution to another.

Our main contributions include:

•

A new method for computing optimal transport distances on a manifold called Geodesic sinkhorn, which is highly efficient in time and memory.
•

Defining the barycentric distance; a novel distance between families of distribution, and showing its utility in deriving treatment effect from control and treated patient samples.
•

Presenting new large single-cell RNA-sequencing datasets to the community as benchmarks for optimal transport.

2 Related Work

There are few methods for computing optimal transport efficiently between large datasets. The current state of the art is the Sinkhorn algorithm [cuturi2013sinkhorn] which computes a regularized OT map between discrete distributions of $n$ points in $O(n^{2})$ time. The Low-Rank (LR) Sinkhorn algorithm [scetbon2021low], speed up the computation by representing the cost matrix by a low-rank factorization. Similarly \citetscetbon2020linear, use a kernel representation of the cost matrix for faster computation. Some other approximations are based on the dual formulation of the Wasserstein distance, like DiffusionEMD [tong2021diffusion] that approximates a distance equivalent to the Wasserstein with geodesic cost, and Tree-based methods [le2019tree, backurs_scalable_2020, sato_fast_2020] that solve the dual formulation using a tree metric. These methods admit a closed form, and scale almost linearly with $n$ . \citetle2022sobolev generalizes this tree formulation to graphs, however edges in the graph are interpreted as costs rather than affinities. This leads to a different problem setup to manifold approximation methods.

Geodesic Sinkhorn is related prior work linking the entropy-regularized optimal transport problem triangular mesh with the heat operator [crane_geodesics_2013, solomon2015convolutional], but using different graph filtering techniques. These approaches approximate the application of the heat kernel to a vector by discretizing the heat equation and solving systems of linear equations, this technique was used in different contexts either with the cotangent Laplacian [solomon2015convolutional] or to learn a ground metric [heitz2021ground]. Solving these systems for each Sinkhorn iteration can be done efficiently with a sparse Cholesky decomposition. However, this method’s efficiency depends mainly on the efficiency of the Cholesky decomposition which can be slow depending on the sparsity pattern is $O(n^{3})$ for an $n\times n$ matrix, and necessitates solving $2K$ systems of linear equation per Sinkhorn iteration.

In Table 1, we summarize the advantages of the previous methods, where we note Sinkhorn with the heat kernel approximated by implicit Euler discretization as Sinkhorn Euler.

Table 1: Summary of comparable OT methods. Geodesic Sinkhorn’s complexity benefits from the sparse matrix representation while providing efficient computation for the barycenter problem and the McCann interpolation.

Method	Sparse Matrix	Barycenter	Mccann	Complexity per iteration
Exact OT	No	Yes	Yes	$O(n^{3}\log n)$
Sinkhorn [cuturi2013sinkhorn]	No	Yes	Yes	$O(n^{2})$
LR Sinkhorn [scetbon2021low]	No	Yes	Yes	$O(N(n+m)r+Nnmr)$ $N,m,r<<n$
Diffusion EMD [tong2021diffusion]	Yes	No	No	$O(n\log n)$
Sinkhorn Euler [solomon2015convolutional]	Yes	Yes	Yes	Cholesky $O(n^{3})$ + Linear System $O(n^{2})$
Geodesic Sinkhorn (ours)	Yes	Yes	Yes	$O(Kmn)$ $K,m<<n$

3 Preliminaries

In this section, we start by review the basics of OT and the Wasserstein distance, as well the Sinkhorn distance. Then we review two notions fundamental to our method; the heat equation on a graph, and the Chebyshev approximation of the heat kernel.

Wasserstein Distance

In the following, we assume that all distributions admit a density or a probability mass function, and we use the same notation for both. Let $\mu,\nu$ be two probability distributions on a measurable space ${\mathcal{X}}\subseteq\mathbb{R}^{d}$ with metric $d(\cdot,\cdot)$ , let $\Pi(\mu,\nu)$ be the set of joint probability distributions $\pi$ on the space ${\mathcal{X}}\times{\mathcal{X}}$ where for any measurable subset $\omega\subset{\mathcal{X}}$ , $\pi(\omega\times{\mathcal{X}})=\mu(\omega)$ and $\pi({\mathcal{X}}\times\omega)=\nu(\omega)$ . The $p$ -Wasserstein distance is defined as:

c:=\left(\inf_{\pi\in\Pi(\mu,\nu)}\int_{{\mathcal{X}}^{2}}d(x,y)^{p}\mathrm{d}\pi(x,y)\right)^{1/p}.

(1)

In this work we consider $p=2$ as it is a valid metric between distributions and has other attractive theoretical properties. An exact algorithm based on linear programming can solve this problem in $O(n^{3}\log n)$ time for discrete distributions of size $n$ .

Sinkhorn Distances

The Kullback-Leibler (KL) divergence between $\pi$ and some strictly positive $K$ on ${\mathcal{X}}\times{\mathcal{X}}$ is defined as

D_{\mathrm{KL}}(\pi|K):=\int_{{\mathcal{X}}^{2}}\left(\ln\frac{\pi(x,y)}{K(x,y)}-1\right)\mathrm{d}\pi(x,y).

(2)

The Sinkhorn distance²²2With a slight abuse of language we use the term distance, although the entropy-regularized formulation does not respect the identity of indiscernibles. is a relaxation of equation 1 where the infimum is over all coupling in $\{\pi\in\Pi(\mu,\nu)|\,D_{\mathrm{KL}}(\pi|\mu\times\nu)\leq\xi\}$ for $\xi>-1$ . Introduced in \citetcuturi2013sinkhorn, the optimization of this distance can be solved by considering the entropy-regularized transport

W_{d,\lambda}^{2}(\mu,\nu):=\left(\inf_{\pi\in\Pi(\mu,\nu)}\int_{{\mathcal{X}}^{2}}d(x,y)^{2}\mathrm{d}\pi(x,y)-\lambda H(\pi)\right)^{1/p},

(3)

where we define the entropy of a coupling $\pi$ as $H(\pi):=-\int\ln\pi(x,y)\mathrm{d}\pi(x,y)$ , and $\lambda>0$ .

This formulation converges to the Wasserstein distance as $\lambda\to 0$ , and can be solved with the Sinkhorn algorithm with complexity of the order $O(n^{2}/\epsilon)$ for discrete distributions of size $n$ [benamou2015iterative, cuturi2013sinkhorn]. In the discrete case, the transport matrix $\bm{\pi}$ admits the form $\text{diag}({\bm{v}})K_{\lambda}\text{diag}({\bm{w}})$ , where ${\bm{v}},{\bm{w}}$ are vectors of size $n$ . The Sinkhorn algorithm iteratively updates the vectors as $({\bm{v}},{\bm{w}})\leftarrow(\mu./K_{\lambda}{\bm{w}},\nu./K^{\prime}_{\lambda}{\bm{v}})$ , where $K_{\lambda}:=e^{-d(x,y)^{2}/\lambda}$ .

Following [solomon2015convolutional], using the kernel $K_{\lambda}$ gives an alternative interpretation of the Sinkhorn distance as

W_{d,\lambda}^{2}(\mu,\nu)=\lambda^{1/2}\left(1+\min_{\pi\in\Pi(\mu,\nu)}D_{\mathrm{KL}}(\pi|K_{\lambda})\right)^{1/2}.

(4)

The problem in equation 3 is strictly convex and continuous yielding a unique minimizer. In the discrete case, this leads to an algorithm for the entropy-regularized Wasserstein distance based on the Sinkhorn algorithm enforcing the marginal constraints on the kernel $K_{\lambda}$ while minimizing the distance as quantified by $D_{\mathrm{KL}}$ .

The underlying metric $d(\cdot,\cdot)$ is generally unknown, thus the kernel $K_{\lambda}$ cannot be evaluated. The authors of [solomon2015convolutional] proposed to approximate $K_{\lambda}$ with the heat kernel $\mathcal{H}_{t}(x,y)$ on ${\mathcal{X}}$ . According to Varadhan’s formula [varadhan_behavior_1967], the geodesic distance on a manifold can be recovered from the heat transfer at small timescales as

d(x,y)^{2}=\lim_{t\to 0}-4t\ln\mathcal{H}_{t}(x,y).

(5)

Hence, motivating the use of the heat-geodesic distance $d_{\mathcal{H}}^{2}(x,y):=-4t\ln\mathcal{H}_{t}(x,y)$ , with associated kernel

K_{\lambda}(x,y)=\mathcal{H}_{\lambda/4}(x,y).

(6)

Interestingly, the Sinkhorn-based methods admit an efficient algorithm to solve the barycenter problem which we present next.

Interpolation with discrete support

By constraining the support to a set ${\mathcal{X}}$ (or a graph), we can efficiently interpolate between more than two distributions. The barycenter problem [benamou2015iterative, solomon2015convolutional, peyre_computational_2020, cuturi2014fast] generalizes the notion of average between points to an average between distributions. For a set of $m$ distributions $\{\mu_{1},\dotsc,\mu_{m}\}$ supported on ${\mathcal{X}}$ , the objective is to find a distribution minimizing the average distance

\mu^{*}:=\operatorname*{arg\,min}_{\mu\in{\mathcal{P}}({\mathcal{X}})}\sum_{i=1}^{m}\alpha_{i}W_{d}^{p}(\mu,\mu_{i})^{p},

where ${\mathcal{P}}({\mathcal{X}})$ denotes the space of probability distributions supported on ${\mathcal{X}}$ , and $\{\alpha_{1},\dotsc,\alpha_{m}\}$ are non-negative weights. Finding the barycenter is a challenging optimization problem, however the barycenter for Sinkhorn-based methods admits an efficient computation (see Algorithm LABEL:alg:sinkhorn_bary in appendix). It involves updating $m$ vectors ${\bm{v}}_{i},{\bm{w}}_{i}$ , which define a transport plan from $\mu_{i}$ to the barycenter $\mu^{*}$ . The barycenter for two distributions supported on a (discrete) semi-circle is depicted in Fig. 1. Here the two input distributions zero and one have support over 5,000 points each. In the case where ${\mathcal{X}}$ is the union of these point sets, the barycenter is a discrete distribution over these 10,000 points, with density concentrated at the top of the semi-circle. The support of the barycenter is constrained to ${\mathcal{X}}$ , for most Sinkhorn-based method the size of ${\mathcal{X}}$ needs to be small for computational reason. Our method does not suffer from such a limitation. Hence, we can consider barycenter with greater expressively, and interpolate between large sets of distributions.

Interpolation in $\mathbb{R}^{d}$

Calculating the barycenter with ${\mathcal{X}}=\mathbb{R}^{d}$ is computationally challenging. With $m>2$ , this is computationally intractable. However, for the special case of $m=2$ , The $\mathbb{R}^{d}$ -barycenter can be computed efficiently using what is known as the McCann interpolant [mccann1997convexity]. For an optimal transport matrix $\bm{\pi}$ between $\mu_{0}$ and $\mu_{1}$ with weights $\alpha_{0}=1-t$ , $\alpha_{1}=t$ , the Mccann interpolant at time $t$ is

\mu_{t}:=\sum_{i,j}\bm{\pi}_{ij}\delta_{(1-t)x_{i}+ty_{j}}=\operatorname*{arg\,min}_{\mu\in{\mathcal{P}}(\mathbb{R}^{d})}\sum_{i=0}^{1}\alpha_{i}W_{d}^{p}(\mu,\mu_{i})^{p},

where $x_{i}\sim\mu_{0}$ and $y_{j}\sim\mu_{1}$ and $\delta$ is the Dirac distribution. The Mccann interpolant is often used for interpolation between consecutive distributions [schiebinger_optimal-transport_2019].

Next we consider the ground distance, and following Varadhan’s formula, we consider the approximation of heat diffusion for OT with a heat-geodesic distance.

Heat Diffusion on a Graph

Consider an undirected graph ${\mathcal{G}}=(V,E)$ with a set $V$ of $n$ vertices and a set of edges $E$ , and its weighted adjacency matrix ${\bm{A}}$ with non-negative edge weights, and the diagonal degree matrix ${\bm{D}}$ , where ${\bm{D}}_{ii}:=\sum_{k}{\bm{A}}_{ik}$ . We define the combinatorial Laplacian as ${\bm{L}}:={\bm{D}}-{\bm{A}}$ , for any function $f:V\to\mathbb{R}$ we have $({\bm{L}}f)(v)=\sum_{u}a_{u,v}(f(v)-f(u))$ . The combinatorial Laplacian is a symmetric positive semi-definite matrix, and has an eigen-decomposition ${\bm{L}}=\Psi\Lambda\Psi^{T}$ with orthonormal eigenvectors $\Psi$ and diagonal eigenvalue matrix $\Lambda=\mathrm{diag}(\lambda_{1},\lambda_{2},\ldots,\lambda_{n})$ , such that $0\leq\lambda_{1}\leq\lambda_{2}\leq\cdots\leq\lambda_{n}$ . The combinatorial Laplacian is a natural extension of the negative of the Laplacian operator to graph. For a signal ${\bm{f}}_{0}\in\mathbb{R}^{n}$ on ${\mathcal{G}}$ , the diffusion of ${\bm{f}}_{0}$ on the graph evolves according to the heat equation

\frac{d}{dt}{\bm{f}}(t)+{\bm{L}}{\bm{f}}(t)=\bm{0},\,s.t.\quad{\bm{f}}(0)={\bm{f}}_{0}\quad t\in\mathbb{R}^{+}.

This ordinary differential equation describes how the heat, represented by the signal ${\bm{f}}$ , propagates on the graph over time. Since the eigenvectors of ${\bm{L}}$ form an orthonormal basis of $\mathbb{R}^{n}$ , we have the solution

{\bm{f}}(t)=\sum_{k}\langle{\bm{f}}_{0},\psi_{k}\rangle e^{-t\lambda_{k}}\psi_{k}.

The heat operator on ${\mathcal{G}}$ is defined by the matrix exponential ${\bm{H}}_{t}:=e^{-t{\bm{L}}}$ . By orthogonality of the eigenvectors of ${\bm{L}}$ , we can write ${\bm{H}}_{t}=\Psi e^{-t\Lambda}\Psi^{T}$ and ${\bm{f}}(t)={\bm{H}}_{t}{\bm{f}}_{0}$ . Computing ${\bm{H}}_{t}$ by eigendecomposition would require $O(n^{3})$ operations. Recall that, for the Sinkhorn algorithm, we are only concerned with the application of the heat operator ${\bm{H}}_{t}$ on a signal ${\bm{f}}\in\mathbb{R}^{n}$ . In Geodesic Sinkhorn we use Chebyshev polynomials [shuman_chebyshev_2011, marcotte2022fast] to approximate the application of the heat operator to a signal. For a short timescale $t$ , using the heat kernel accounts for using the geodesic distance as ground distance in the entropy-regularized optimal transport formulation equation 3.

Chebyshev Polynomials

Polynomial sequences are often used to approximate functions or operator. With Chebyshev polynomial, we can approximate the application of the matrix exponential ${\bm{H}}_{t}=e^{-t{\bm{L}}}$ to a signal ${\bm{f}}$ on the graph. An attractive property of Chebyshev polynomials is that the approximation error decay exponentially with the maximum degree $K$ . They are defined by the recursive relation $\{T_{k}\}_{k\in\mathbb{N}}$ with $T_{0}(y)=0$ , $T_{1}(y)=y$ and $T_{k}(y)=2yT_{k-1}(y)-T_{k-2}(y)$ for $k\geq 2$ . On $[-1,1]$ these polynomials are orthogonal w.r.t. the weight $(1-y)^{-1/2}$ , and can be used to express the operator ${\bm{H}}_{t}$ . Assuming the largest eigenvalue $\lambda_{n}\leq 2$ , we can write

{\bm{H}}_{t}=\frac{b_{0}}{2}+\sum_{k=1}^{\infty}b_{k}T_{k}({\bm{L}}-\bm{Id}),

where the $K+1$ scalar coefficient $\{b_{k}\}$ depend on time and can be evaluated with the Bessel function. The approximation of ${\bm{H}}_{t}$ is based on the first $K$ term of the series, and we note it $p_{K}({\bm{L}},t)$ . The approximation of the application of the heat operator to a signal ${\bm{f}}$ is

p_{K}({\bm{L}},t){\bm{f}}=\frac{b_{0}}{2}{\bm{f}}+\sum_{k=1}^{K}b_{k}T_{k}({\bm{L}}-\bm{Id}){\bm{f}}.

It results in $K$ matrix-vector products which can be efficient since in general ${\bm{L}}$ is a sparse matrix. On a $m$ -nearest neighbor graph, this can be $O(Kmn/\lambda)$ , where $\lambda$ is a regularization parameter. The assumption that $\lambda_{i}\leq 2$ is not always verified for the combinatorial Laplacian, but it can be rescaled as $2{\bm{L}}/\lambda_{n}$ to ensure this, or use the normalized Laplacian that has eigenvalues in $[0,2]$ which also avoid computing the second-largest eigenvalue. Chebyshev polynomials admits interesting theoretical properties. In particular, the existence of approximation bounds depending on the largest degree $K$ [marcotte2022fast], hence the $K$ can be determined for a given approximation error tolerance.

4 Geodesic Sinkhorn Distances

We define the Geodesic Sinkhorn distance between any signals or distributions on a graph ${\mathcal{G}}$ by the entropy-regularized OT with the heat kernel ${\bm{H}}_{t}$ on the graph. This construction is also valid between any point cloud datasets. In that case, for $m$ datasets $\{\mathsf{X}_{1},\dotsc,\mathsf{X_{m}}\}$ sampled from a set of distributions $\{\mu_{1},\dotsc,\mu_{m}\}$ , we construct a common graph using an affinity kernel on the $m$ datasets and compare two samples by taking the distance between two indicator functions on the graph. We approximate the heat kernel ${\bm{H}}_{t}$ with Chebyshev polynomials $p_{K}({\bm{L}},l)$ of order $K$ . In Algorithm 4.1, we present the main steps to evaluate the Geodesic Sinkhorn. It is based upon Sinkhorn iterations [sinkhorn1967concerning, cuturi2013sinkhorn], where $\oslash$ and $\odot$ denote respectively the elementwise division and multiplication. Note that, as opposed to the usual Sinkhorn algorithm, we never have to store a dense $n\times n$ distance matrix, but only the usually sparse graph Laplacian.

Definition 4.1.

The Geodesic Sinkhorn distance between two distributions $\mu$ , $\nu$ on a graph ${\mathcal{G}}$ is

W_{{\bm{H}}_{t}}(\mu,\nu):=4t^{1/2}\left(1+\min_{\pi\in\Pi(\mu,\nu)}D_{\mathrm{KL}}(\pi|{\bm{H}}_{t})\right)^{1/2}.

{algorithm}

[tb] Geodesic Sinkhorn {algorithmic} \STATEInput: $n\times n$ Laplacian ${\bm{L}}$ , distributions (signals) ${\bm{\mu}}$ , ${\bm{\nu}}$ on ${\mathcal{G}}$ , maximum Chebyshev degree $K$ , regularization $\lambda$ , ${\bm{a}}$ vertices weights. \STATEOutput: $W_{d_{{\bm{H}}},\lambda}^{2}({\bm{\mu}},{\bm{\nu}})$ \STATE// Initialization \STATE ${\bm{w}}\leftarrow{\bm{1}}$ \STATE// Sinkhorn iterations \FOR $j=1,2,\dotsc$ \STATE ${\bm{v}}\leftarrow{\bm{\mu}}\oslash p_{K}({\bm{L}},t)({\bm{a}}\odot{\bm{w}})$ \STATE ${\bm{w}}\leftarrow{\bm{\nu}}\oslash p_{K}({\bm{L}},t)({\bm{a}}\odot{\bm{v}})$ \ENDFOR\STATEReturn: $4t\,\sum[{\bm{a}}\odot({\bm{\mu}}\odot\ln{\bm{v}}+{\bm{\nu}}\odot\ln{\bm{w}})]$

Refer to caption — Fig. 1: Two distributions at time 0 and time 1 on a 1D manifold embedded in 2D. Compares the Mccann interpolant (center green) with the Barycenter interpolant with fixed support colored by density (right) of a left out time 1/2.

In the following proposition, we find the ground distance implicitly used in the optimal transport defined by Geodesic Sinkhorn. We recall that two distances $d_{1}(x,y),d_{2}(x,y)$ are equivalent if for all $x,y$ there exists two positive constants $c,C$ such that $cd_{2}(x,y)\leq d_{1}(x,y)\leq Cd_{2}(x,y)$ , and we note $d_{1}\simeq d_{2}$ .

Proposition 4.2.

There exists a maximum Chebyshev polynomial degree $K$ such that the ground distance in Geodesic Sinkhorn is equivalent to the one based on the heat kernel on ${\mathcal{G}}$

	$\displaystyle-4t\log(p_{K}({\bm{L}},t))_{ij}$	$\displaystyle\simeq-4t\log({\bm{H}}_{t})_{ij}$
		$\displaystyle\simeq-4t\log(\Psi e^{-t\Lambda}\Psi^{T})_{ij}.$

In particular, the Wasserstein distances with these ground distances are equivalent.

Proof.

Provided in Appendix LABEL:sec:appendix_theory. ∎

In \citetsolomon2015convolutional,heitz2021ground, using the Euler implicit discretization results in a ground cost of the form $-\epsilon\ln((\mathbf{Id}-\frac{\epsilon}{4K}{\bm{L}})^{-K})$ , where $\mathbf{Id}$ is the identity matrix, and can be seen as another approximation for the matrix exponential.

The efficiency of Geodesic Sinkhorn improves the notion of barycenter as it is possible to consider much larger graph ${\mathcal{G}}$ , thus a finer grained support of the barycenter. This leads us to define a novel distance between families of distributions.

Definition 4.3.

For two finite families of distributions ${\mathcal{T}}$ and ${\mathcal{C}}$ supported on ${\mathcal{G}}$ , we define the barycentric distance between the families ${\mathcal{T}},{\mathcal{C}}$ as

\gamma({\mathcal{T}},{\mathcal{C}}):=W_{{\bm{H}}_{t}}(\mu^{*}_{\mathcal{T}},\mu^{*}_{\mathcal{C}}),

where $\mu^{*}_{\mathcal{T}},\mu^{*}_{\mathcal{C}}$ are respectively the barycenters of ${\mathcal{T}}$ and ${\mathcal{C}}$ .

This distance follows similar intuitions as the distances in clustering algorithms like average-linkage, where a distance between clusters is defined via an aggregation of distances. The previous definition is valid for any distance between distributions or barycenters. However, OT barycenters are known to be more informative than others[cuturi2014fast]. We will further explore this comparison in our experiments. We use it to distinguish between two groups in a medical setting where a set of patients received a treatment (defining the family ${\mathcal{T}}$ ), and another set acts as control family ${\mathcal{C}}$ . Following this idea, we define a notion of effect between two families.

Definition 4.4.

For two family of distributions ${\mathcal{T}}$ and ${\mathcal{C}}$ supported on ${\mathcal{G}}$ , we define the Expected Barycenter Effect of ${\mathcal{T}}$ as

\tau({\mathcal{T}}):=\mathbb{E}_{\mu_{\mathcal{T}}^{*}}(Y_{t})-\mathbb{E}_{\mu_{\mathcal{C}}*}(Y_{c}),

where $\mu^{*}_{\mathcal{T}},\mu^{*}_{\mathcal{C}}$ are respectively the barycenters of ${\mathcal{T}}$ and ${\mathcal{C}}$ , and $Y_{c}\sim\mu_{\mathcal{C}}^{*}$ and the features $Y_{t}\sim\mu_{\mathcal{T}}^{*}$ .

Note that we compute the average on the family of distributions instead of the average on their support, hence we evaluate their expectations in a closed form. This definition also extends to a conditional equivalent where families of distribution can be subdivided with discrete covariate variables. When the barycenters are computed with the total variation, this definition is equivalent to the naive Average Treatment Effect(ATE) [imbens2015causal]; i.e. difference of empirical means of the two groups.

In Fig. 2 we present the focus of this section and the three types of interpolation we considered; between two distributions without constraint on the support; barycenters between multiple distributions, and a distance between families of distributions.

5 Results

We demonstrate the accuracy and efficiency of the Geodesic Sinkhorn distance on three tasks:

1.

Time-series interpolation in section 5.1 similar to the setup of [schiebinger_optimal-transport_2019] consisting of ten scRNA-seq datasets.
2.

Nearest-Wasserstein-neighbor distribution calculation on simulated data with manifold structure similar to the setup of [tong2021diffusion].
3.

A newly defined Barycentric distance between families of distributions and its application to quantify the effect of a treatment on patient-derived organoids.

In all our experiments, we constructed the weighted adjacency matrix ${\bm{A}}$ of the graph with the $\alpha$ -decay kernel [moon_visualizing_2019]. This kernel is based on a k-nearest neighbor graph and attribute weights with exponential decay and is extensively used in single-cell analysis. We used the default parameters that is five nearest neighbors and $\alpha=40$ .

Table 2: Time series interpolation task comparing mean and standard deviation

(\mu\pm\sigma)

across 5 seeds the 2-Wasserstein metric averaged across time

(1/T-2)\sum_{t\in[2\ldots T-1]}W_{2}(\hat{\mu}_{t},\mu_{t})

for 10 single-cell timeseries datasets. Sinkhron Euler ran out of memory on two datasets. Lower is better, best performance on each dataset is bold.

Dataset	$L^{2}$ Sinkhorn	Sinkhron Euler	Geo Sinkhorn (ours)
Cite Donor0	48.545 $\pm$ 0.057	46.254 $\pm$ 3.192	44.440 $\pm$ 0.108
Cite Donor1	48.220 $\pm$ 0.055	45.897 $\pm$ 3.254	44.165 $\pm$ 0.103
Cite Donor2	50.281 $\pm$ 0.016	47.773 $\pm$ 3.958	45.673 $\pm$ 0.092
Cite Donor3	49.339 $\pm$ 0.081	46.565 $\pm$ 3.553	45.022 $\pm$ 0.146
Clark	13.500 $\pm$ 0.003	–	13.288 $\pm$ 0.008
EB	12.415 $\pm$ 0.008	12.298 $\pm$ 0.140	12.133 $\pm$ 0.011
Multiome Donor0	56.648 $\pm$ 0.048	55.373 $\pm$ 7.234	53.431 $\pm$ 0.077
Multiome Donor1	54.028 $\pm$ 0.126	52.396 $\pm$ 4.394	50.238 $\pm$ 0.022
Multiome Donor2	58.798 $\pm$ 0.155	57.182 $\pm$ 5.511	55.041 $\pm$ 0.058
WOT	8.096 $\pm$ 0.003	–	7.397 $\pm$ 0.106

5.1 Time series Interpolation

To evaluate Geodesic Sinkhorn’s performance on inferring dynamics, we test its performance on a task for time series interpolation. In this setting the most used datasets are Embryoid Body [moon_visualizing_2019], and WOT [schiebinger_optimal-transport_2019]. We curated ten scRNA-seq datasets; WOT [schiebinger_optimal-transport_2019], Clark [clark_single-cell_2019], Embryoid Body [moon_visualizing_2019], and seven more from the 2022 multimodel single-cell integration challenge³³3https://www.kaggle.com/competitions/open-problems-multimodal/ to test our method. The observations are the gene expression of single cells from a distribution evolving through time. The goal is to interpolate the distribution between two timepoints. The number of timepoints in each dataset range from 4 to 40. Here for a dataset with $T$ single-cell distributions $\{\mu_{1},\mu_{2},\ldots,\mu_{T}\}$ over time for $t\in[2\ldots T-1]$ we compute the exact Euclidean 2-Wasserstein distance between the interpolated distribution $\hat{\mu}_{t}$ at time $t$ and the ground truth distribution $\mu_{t}$ , $W_{2}(\hat{\mu}_{t},\mu_{t})$ (see Appendix LABEL:sec:_appendix_additional_res for details). Since we interpolate between two distributions we used the McCann interpolant as its support is $\mathbf{R}^{d}$ . We compare our Geodesic Sinkhorn interpolation with either the Exact Mccann interpolant with Euclidean distances (Exact Mccann), the Sinkhorn Mccann interpolant with Euclidean ground distance (Sinkhorn Mccann), Sinkhorn Euler Mccann with the Euler heat approximation in Tab. 2. Sinkhorn with Euler approximation ran out of memory on the Clark and WOT datasets. For space reasons, we leave comparison to additional interpolation methods to the appendix (Tab. LABEL:tab:full_time_interp, Tab. LABEL:tab:time_serie_kernel). We see that across all 10 the Geodesic Sinkhorn interpolation with the Mccann interpolant outperforms all other methods, hence showcasing the importance of the heat-geodesic distance and our kernel approximation.

5.2 Nearest-Wasserstein-neighbor distributions

Computation time of heat approximation

We compare the time required to approximate the heat kernel using either the Chebyshev polynomials ( $K=30$ ) or the implicit Euler discretization (30 steps) with Cholesky decomposition [solomon2015convolutional, heitz2021ground]. In Fig 3, we report the time is second requires to diffuse a uniform signal on a graph where we sample 10k observations from a number of distributions ranging from 5 to 45. We also report the time without the steps that can be prefactored; the computation of the normalized Laplacian for the Chebyshev method (Cheb w/o pre), and the Cholesky decomposition for the Euler method (Euler w/o pre). Using Chebyshev polynomials can be more than two times faster than the Euler method. Recall that this step needs to be applied twice per Sinkhorn iteration, our method can thus be drastically faster. In the appendix Fig. LABEL:fig:time_swiss_roll_W1 and Fig. LABEL:fig:time_swiss_roll we show more comparative results on the time efficiency of our methods against Sinkhorn, Diffusion EMD, and the LR Sinkhorn, our method can be 20 time faster than Sinkhorn and LR Sinkhorn.

Computation time and accuracy

In this experiment, we compare our method against Sinkhorn [cuturi2013sinkhorn], and LR Sinkhorn [scetbon2021low], both algorithm with Euclidean and squared Euclidean ground distance, with DiffusionEMD [tong_trajectorynet_2020], and Sinkorn with Euler approximation of the heat filter. We created 15 Gaussian distributions sampled randomly on a swiss roll dataset, and sampled 10k observations from each distribution. We consider a k-nearest neighbors task on these distributions. We evaluate the methods with the ground truth, since we know the exact geodesic distance on the manifold. We test two implementations of our method, either using the Chebyshev approximation from the python library PyGSP[michael_defferrard_2017_1003158] (Geodesic Sinkhorn), with the graph from PHATE (P) and UMAP (S), or using the Chebyshev implementation from [marcotte2022fast] (Geodesic Sinkhorn*). The latter implementation can be easily sectioned, thus avoiding superfluous calculation. For instance, we compute the coefficient $b_{k}$ only once on the graph, then use them for all 15 distributions. In Tab. 3, we report the average and standard deviation over 10 seeds of the Spearman and Pearson correlations to the ground truth, and the runtime in seconds with and without the computation of the graph. Our method is always more accurate than Sinkhorn and LR Sinkhorn, while being about 6 to 20 times faster. Our method’s accuracy is similar to the approximation with Euler, but can be almost four times faster. DiffusionEMD and Geodesic Sinkhorn performances are similar, but diffusion EMD does not provide a transportation plan nor meaningful barycenters.

TODO: [[[REVIEW that in Tab.3, marcotte is slower than pygsp.]]]

Table 3: KNN task for 15 distributions, best score highlighted is bold, and second-best is underline Geodesic Sinkhorn accuracy is better or similar to other Sinkhorn-based method, while being much faster.

	SpearmanR	PearsonR	P@5	time(s) no graph	time(s)
Name
Diffusion EMD	0.62 $\pm$ 0.097	0.736 $\pm$ 0.023	0.66 $\pm$ 0.072	2.845 $\pm$ 0.135	7.877 $\pm$ 0.531
Sinkhorn $L^{1}$	0.387 $\pm$ 0.044	0.523 $\pm$ 0.036	0.471 $\pm$ 0.028	112.406 $\pm$ 0.206	112.406 $\pm$ 0.206
Sinkhorn $L^{2}$	0.411 $\pm$ 0.036	0.485 $\pm$ 0.027	0.492 $\pm$ 0.053	133.686 $\pm$ 5.234	133.686 $\pm$ 5.234
LR Sinkhorn $L^{1}$	-0.31 $\pm$ 0.07	-0.131 $\pm$ 0.086	0.237 $\pm$ 0.037	578.631 $\pm$ 107.82	578.631 $\pm$ 107.82
LR Sinkhorn $L^{2}$	0.366 $\pm$ 0.048	0.379 $\pm$ 0.051	0.447 $\pm$ 0.023	204.191 $\pm$ 3.656	204.191 $\pm$ 3.656
Geodesic Sinkhorn	0.847 $\pm$ 0.023	0.754 $\pm$ 0.016	0.833 $\pm$ 0.034	10.176 $\pm$ 1.249	16.682 $\pm$ 1.705

5.3 Barycentric distance

Synthetic data

We test if we can identify a linear treatment effect with the Expected Barycenter Effect (EBE). In this experiment, we create a control family of distributions ${\mathcal{C}}$ of ten standard Gaussian distributions. The treatment group consists of nine Gaussian distributions $\mathcal{N}(5,1)$ , and one outlier centered at different means. For each distribution, we sample 500 observations, and reproduce the experiment over ten seeds. In Tab. 4, we report the EBE and its standard deviation with the Geodesic Sinkhorn, the total variation distance, and Sinkhorn. Since the TV only compares the mean, it is sensitive to the outlier, whereas our method can identify the true treatment effect.

Table 4: Expected Barycenter Effect (EBE) with one outlier distribution centered either at -60,-30,0, and 5. Comparison using the barycenter from Sinkhorn, total variation, or Geodesic Sinkhorn. Values closer to the real treatment effect of 5 are better. Best score is bold.

Outlier	EBE Geo Sinkhorn	EBE Sinkhorn	EBE TV
-60	5.016 $\pm$ 0.226	-0.103 $\pm$ 0.005	-1.429 $\pm$ 0.144
-30	5.053 $\pm$ 0.196	0.355 $\pm$ 0.049	1.571 $\pm$ 0.144
0	4.917 $\pm$ 0.315	4.954 $\pm$ 0.157	4.571 $\pm$ 0.144
No outlier	5.059 $\pm$ 0.159	5.054 $\pm$ 0.16	5.071 $\pm$ 0.144

We compare distances to compute the barycenter to retrieve an underlying distribution. We suppose two ground truth distributions; a standard Gaussian and a Gaussian with mean five and unit variance, but we only observe two groups of noisy distributions of unequal sizes

	$\displaystyle{\mathcal{C}}=\{{\mathcal{N}}(\theta_{i},1):\theta_{i}\sim{\mathcal{N}}(0,1),i\in[10]\}$
	$\displaystyle{\mathcal{T}}=\{{\mathcal{N}}(\theta_{i},1):\theta_{i}\sim{\mathcal{N}}(5,1),i\in[5]\}.$

We compute $\mu^{*}_{{\mathcal{C}}}$ and $\mu^{*}_{{\mathcal{T}}}$ , then the (exact) 2-Wasserstein between the barycenters $\mu^{*}_{{\mathcal{C}}}$ and $\mu^{*}_{{\mathcal{T}}}$ which we desire to be close to known 2-Wasserstein between the ground truth distributions. In Tab 5, we report the score $|1-\widehat{W_{2}}/W_{2}|$ , where $\widehat{W}_{2}$ is the exact 2-Wasserstein between barycenters and $W_{2}$ the ground truth. A score close to zero is better; it implies that the barycenters are close to the ground truth distributions. We report results for distributions in $2$ and $10$ dimensions as well as different diffusion time $t$ and regularization parameter $\lambda$ . In the Appendix (Tab.LABEL:tab:ATE_perturb_appendix, Tab. LABEL:tab:_gaussian_fam_appendix) we report additional results for Geodesic Sinkhorn with the graph used in UMAP [wolf2018scanpy, mcinnes2018umap].

Table 5: Approximation score for the barycenters of ten distributions centered around

{\mathcal{N}}(0,1)

and five distribution centered around

{\mathcal{N}}(5,1)

. Score closer to zero are better, best score is bold.

Dim	Reg. $(t,\lambda)$	Geo Sinkhorn	Sinkhorn	TV
2	$(25,0.005)$	0.144 $\pm$ 0.104	0.142 $\pm$ 0.094	0.636 $\pm$ 0.065
	$(50,0.05)$	0.141 $\pm$ 0.107	0.222 $\pm$ 0.192	0.636 $\pm$ 0.065
	$(100,0.5)$	0.133 $\pm$ 0.135	0.945 $\pm$ 0.019	0.636 $\pm$ 0.065
10	$(25,0.005)$	0.191 $\pm$ 0.086	0.186 $\pm$ 0.081	0.541 $\pm$ 0.094
	$(50,0.05)$	0.192 $\pm$ 0.089	0.426 $\pm$ 0.309	0.541 $\pm$ 0.094
	$(100,0.5)$	0.187 $\pm$ 0.111	0.894 $\pm$ 0.035	0.541 $\pm$ 0.094

Single-cell signalling data

In this experiment, we show an application of the barycentric distance and the expected barycenter effect. We use single-cell signalling data produced by mass cytometry (MC) for a screening study to compare the treatment effect of different chemotherapies on 10 colorectal cancer (CRC) patient-derived organoids (PDOs) [zapatero2022pdos]. These PDOs can be grouped into chemoresistant PDOs, that show little-to-no effect when treated with chemotherapies; and chemosensitive PDOs, that present strong shifts in their phenotypes upon treatment. The observations include single-cell data information on the cell cycle and signalling changes upon treatment of PDOs with different CRC treatments at a range of concentrations. To compute the CBT distance, we define signals on the graph as indicator functions for the treatments and PDOs respecting the condition chemoresistant or chemosensitive, and compute the barycenter of each treatment. For example, the barycenter corresponding to the treatment S (SN-38, active metabolite of Irinotecan) is $\mu_{S}^{*}$ . In Fig. 4, we present the barycentric distances matrices between treatments a) and between four concentrations of treatment SN-38 (S) b). In both cases, the control groups corresponds to AH and DMSO, the two rightmost columns. We compare the distance matrices between Sinkhorn (left) and our method (right). Our method provide a finer distinction between treatments and concentrations, especially for the chemosensitive group. As observed in [zapatero2022pdos], chemosensitive PDOs show little-to-no response to lower concentrations of SN-38 (S1), but their phenotype shifts very strongly upon treatment with higher concentrations (S2, S3, and S4) (Fig. 4b). When comparing combinations of different treatments with SN-38, Geodesic Sinkhorn better resolves the difference between SN-38 alone and in combination with Cetuximab (C), showing that S is the main agent creating the treatment effect and the combination with C does not resolve in a synergistic effect (Fig. 4a). Note that we only consister the relative magnitude of the distances since the two algorithm use different ground distances. In Fig. 5, we show the Expected Barycenter Effect (Def. 4.4) of SN-38 for the two conditions, our method correctly detects the chemosensitive marker cPARP.

6 Conclusion

In this work, we considered the use of optimal transport for graphs and large datasets in high dimensions potentially sampled from a lower dimensional manifold. We proposed Geodesic Sinkhorn, a fast implementation of the Sinkhorn algorithm using the graph Laplacian and Chebyshev polynomials. Our method is well adapted for large and high dimensional dataset as it is defined with a geodesic ground distance, which takes into account the underlying geometry of the data, and requires less computation time and less memory. We showed on 10 scRNA-seq time series datasets that our method has greater interpolation accuracy than Sinkhorn and the exact Mccann interpolant. On a synthetic dataset, we showed that Geodesic Sinkhorn can be about 23 times faster than Sinkhorn, even when taking into account the construction of the graph. On the classification task, our method was faster than other Sinkhorn-based algorithms, while having greater accuracy. With the Wasserstein barycenter, we defined the barycentric distance to compare entire families of distribution, and the expected barycenter effect, then applied both methods to a large PDO drug screen dataset. In future work, we aim to study the theoretical implications of our new definitions.

7 REFERENCES

List and number all bibliographical references at the end of the paper. The references can be numbered in alphabetic order or in order of appearance in the document. When referring to them in the text, type the corresponding reference number in square brackets as shown at the end of this sentence [C2].