This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Modeling Cell Type Developmental Trajectory using Multinomial Unbalanced Optimal Transport

Junhao Zhu, Kevin Zhang
Department of Statistical Sciences, University of Toronto,
Zhaolei Zhang
Department of Molecular Genetics,
Donnelly Centre for Cellular and Biomolecular Research,
University of Toronto,
and
Dehan Kong
Department of Statistical Sciences, University of Toronto   
The first two authors contribute equally to the paper. The last two authors are joint corresponding authors.
Abstract

Single-cell trajectory analysis aims to reconstruct the biological developmental processes of cells as they evolve over time, leveraging temporal correlations in gene expression. During cellular development, gene expression patterns typically change and vary across different cell types. A significant challenge in this analysis is that RNA sequencing destroys the cell, making it impossible to track gene expression across multiple stages for the same cell. Recent advances have introduced the use of optimal transport tools to model the trajectory of individual cells. In this paper, our focus shifts to a question of greater practical importance: we examine the differentiation of cell types over time. Specifically, we propose a novel method based on discrete unbalanced optimal transport to model the developmental trajectory of cell types. Our method detects biological changes in cell types and infers their transitions to different states by analyzing the transport matrix. We validated our method using single-cell RNA sequencing data from mouse embryonic fibroblasts. The results accurately identified major developmental changes in cell types, which were corroborated by experimental evidence. Furthermore, the inferred transition probabilities between cell types are highly congruent to biological ground truth.


Keywords: cellular differentiation, change point detection, growth rate, single-cell RNA sequencing, transport matrix

1 Introduction

All living organisms are subject to the effects of genetics. The genetic information stored within cells dictates the way of life, functioning like a sequence of commands that shape every organism into its eventual form. Since the discovery of DNA, the study of molecular genetics has revealed the dynamic nature of cellular development. A key aspect of this dynamic process is the variation in gene expression among different cell types and at different stages of development. Building on this understanding, biologists have studied the developmental processes of stem cells and differentiated cells (Keller, 2005), as well as stem cell reprogramming in the pursuit of new methods for stem cell therapy (Mothe et al., 2012). Although much research has been conducted on the molecular mechanisms underlying stem cell differentiation (Dixon et al., 2015), there is a growing interest for mathematical models that are more efficient and adaptable to heterogeneous datasets, in order to better understand these complex developmental processes. Waddington (2014, 2015) modeled the changes in gene expression as a landscape that influences the developmental trajectory of cells. In this analogy, various potential developmental pathways available to a cell are similar to different railway tracks, with each track representing a unique developmental trajectory. The observed gene expression patterns and the developmental costs are similar to the layout and travel costs of railway tracks, each guiding the cell’s developmental journey. At the population level, hundreds and thousands of cells undergo changes in gene expression based on their individual trajectories. In experimental settings, observations are typically made at multiple time points during cellular development. Single-cell RNA sequencing (scRNA-seq) enables the measurement of gene expression levels of individual cells at different time points (Tanay and Regev, 2017). However, it is often challenging to model the gene expression profile of a single cell and recover its developmental trajectory by analyzing scRNA-seq data from different time points, since for each cell we can only observe its expression profiles at one particular time instead of the whole trajectory. Several existing methods (Kester and Van Oudenaarden, 2018; Farrell et al., 2018; Wagner et al., 2018; Cao et al., 2019) aim to model such trajectories. Nevertheless, these methods do not directly address two main challenges. Firstly, they often focus on inferring trajectories for individual cells, potentially missing the broader developmental dynamics of the entire cell population. This can result in scenarios where individual optimization is not feasible for the whole population. Interactions and resource allocation among cells must be considered. For example, while a cell may differentiate into its expected optimal type under ideal conditions with ample resource, limited resources due to large number of cells can prevent all cells from reaching their expected state. Secondly, different cell types may proliferate at varying growth rates, and this heterogeneity in cellular growth rates affects the relative proportions of different cell types over time.

To tackle the two challenges mentioned above, Schiebinger et al. (2019), introduced Waddington-OT, a novel method that applies the concept of optimal transport to single-cell RNA sequencing (scRNA-seq) data analysis. Optimal transport, initially proposed by Monge (1781), was designed to model the transition process between different distributions. The Waddington-OT method models the developmental trajectory of cells in two key steps. Firstly, it employs optimal transport to infer temporal couplings between adjacent time points, effectively tracing the transition of the cell population from one time point to the next. Secondly, it introduces “unbalancedness” into the optimal transport framework. Traditional optimal transport models assume the conservation of total “mass” during transitions between states. However, in many biological processes, such as cellular growth, this conservation assumption is not valid, as the total mass can either increase or decrease by absorbing external nutrients or by death of cells. Unbalanced optimal transport addresses this by accommodating changes in mass, thus allowing the model to incorporate variations in cellular growth rates. This adaptation makes the Waddington-OT method a more realistic and flexible tool for modeling cell population dynamics over time.

The ability to model trajectories of cell types, which are groups of cells with similar expression profiles and functions, is particularly important in fields such as developmental biology, cell replacement therapy, and drug discovery (Keller, 2005). As opposed to modelling individual cells, analyzing different cell types reduces the model complexity compared to modelling individual cells while still preserving biologically meaningful heterogeneity. For example, in cell replacement therapy, trajectory modeling can help identify progenitor cell types with the highest potential for differentiation into the required cell types for therapeutic purposes. In this paper, we introduce Trajectory Inference with Multinomial Optimal Transport (TIMO), a novel approach that shifts the focus from individual cells to modeling trajectories of cell types. TIMO incorporates a unique statistic based on optimal transport, leveraging the benefits of a multinomial framework.

The use of a multinomial model offers several practical advantages over continuous models. Firstly, in Multinomial Optimal Transport, as detailed in Section 3, the cost functions are specifically designed to assign zero cost to cells that remain within the same cell type. In contrast, in Schiebinger et al. (2019), it is impossible for two cells of the same type to have identical gene expression profiles, leading to a non-zero cost between these two cells. This distinction makes the multinomial approach more effective in accurately detecting biological change points and in studying cellular sub-populations, i.e., cell types. Additionally, by focusing on the trajectories of cell types rather than individual cells, we significantly reduce the computational complexity from 𝒪(n2)\mathcal{O}(n^{2}) to 𝒪(d2)\mathcal{O}(d^{2}), where nn is the number of cells, typically in the thousands, and dd is the number of cell types, usually only around 10 (Pham et al., 2020).

To showcase the advantages of the TIMO framework, we demonstrate that TIMO can effectively detect biological change points and quantify cellular differentiation events during developmental trajectories. To empirically validate our approach, we applied TIMO to a comprehensive scRNA-seq dataset containing over 250,000 cells containing multiple cell types originated from mouse embryo fibroblast (MEF). MEF cells are commonly used in research related to modern medicine. Our model yielded the following key findings:

  1. 1.

    We detected two biological change points corresponding to the emergence of mesenchymal epithelial transition (MET) cells and stromal cells, both of which have been confirmed by wet lab results in Schiebinger et al. (2019) as underlying points of cellular differentiation.

  2. 2.

    We quantified cell type developmental trajectory and identified that MEF cells are the most potent cell type capable of differentiating into MET cells, and that both MEF and MET cells differentiate into stromal cells. These findings highly match the biological results from the original experiment (Schiebinger et al., 2019).

  3. 3.

    TIMO successfully distinguishes the effects of growth from the effects of cellular differentiation. Notably, we detected no change during the known period of induced pluripotent stem (IPS) cell proliferation (i.e., cellular growth/division), unlike other modern statistical change point detection methods, which incorrectly identified growth periods as change points.

To the best of our knowledge, TIMO is the first method that can detect biological change points while accounting for the effects of cellular growth.

The rest of this paper is structured as follows. In Section 2, we provide definition on cell type developmental trajectories. In Section 3, we provide mathematical details on trajectory inference using TIMO. In Section 3.1, we introduce the test statistic and its application in biological change point detection. In Section 3.2, we present procedures on trajectory inference given a finite sample. Numerical results on simulated and real data sets are presented in Section 4 and Section 5. We end with a discussion in Section 6.

2 Background

The study of cellular differentiation dynamics, encompassing landscapes, fates, and trajectories, poses formidable challenges for scientists. Deciphering these developmental pathways is a key area of study in scRNA-seq. To capture the gene expression profile of a cell, we employ a GG-dimensional vector in gene expression space, where GG represents the total number of genes, where GG is often around 20,00020,000. Additionally, we assume that the observed developmental trajectory is limited to a finite number of time points, specifically t=0,1,2,,Tt=0,1,2,\ldots,T, where TT is a finite positive integer.

2.1 Single-cell developmental trajectory

To lay the background, we first introduce the following definition of single-cell developmental trajectory (Schiebinger et al., 2019):

Definition 1 (Single-cell developmental trajectory).

The single-cell developmental trajectory is characterized as a sequence of random vectors in G\mathbb{R}^{G}, denoted as {Yt}0tT\{Y_{t}\}_{0\leq t\leq T}, where each YtGY_{t}\in\mathbb{R}^{G} represents the gene expression profile of a cell at time tt. This sequence captures the temporal evolution of a particular cell over the time horizon t=0,1,2,,Tt=0,1,2,\ldots,T.

Since single-cell RNA sequencing (scRNA-seq) is a destructive measurement process, the gene expression levels of a cell cannot be observed at later time points after it has been measured. Consequently, the full single-cell developmental trajectory cannot be observed. Schiebinger et al. (2019) proposed an alternative approach to learn probable trajectories of individual cells by measuring snapshots from an evolving population consisting of many single-cell developmental trajectories. They introduced Waddington-optimal transport, a method to study developmental time courses, infer ancestor-descendant fates, and model the underlying regulatory programs. Optimal transport was first introduced by Monge (1781) and formalized by Kantorovich (1942) to model the transitional process between two distributions. The rationale of the optimal transport process is that it assumes the population evolves in an energy-efficient fashion. For a detailed review of the application of optimal transport to single-cell trajectory inference proposed in Schiebinger et al. (2019), see LABEL:sec:ot_cont of the supplementary material.

2.2 Cell type developmental trajectory

In the context of studying biological systems, it is often more informative to study the composition of cell types in a tissue or sample, or the development of the cell types, other than developmental trajectory of individual cells. As opposed to Schiebinger et al. (2019), which focuses on modelling individual cell trajectories, our primary goal is to model trajectories of distinct cell types. We explore the real-world scRNA-seq data obtained from mouse embryonic fibroblasts, which have the potential to differentiate into different cell types with promising therapeutic possibilities. For instance, mesenchymal stromal cells, which can be derived from mouse embryonic fibroblasts, have exhibited significant potential in regenerative medicine (Krampera and Le Blanc, 2021). By employing trajectory modelling to discern their ancestors, we can suggest more efficient avenues for generating mesenchymal stromal cells.

To conduct cell type trajectory inference, we first define the concept of cell type developmental trajectory. Suppose there are in total dd different types of cells across all stages of the developmental trajectory. Each cell can be represented by its type, denoted by Xt{1,2,,d}X_{t}\in\{1,2,\ldots,d\}. In alignment with Definition 1, we define the following cell type developmental trajectory.

Definition 2 (Cell type developmental trajectory).

The cell type developmental trajectory is characterized as a sequence of random multinomial labels in d={1,2,,d}\mathcal{I}_{d}=\{1,2,\ldots,d\}, denoted by {Xt}0tT\{X_{t}\}_{0\leq t\leq T}, where each XtdX_{t}\in\mathcal{I}_{d} represents the cell type at time tt. This sequence captures the temporal evolution of a particular cell over the time horizon t=0,1,2,,Tt=0,1,2,\ldots,T.

Definition 2 introduces the concept of a cell type developmental trajectory originating from a single cell at t=0t=0. However, in a population context, events such as cellular division and deaths should also be taken into account. To illustrate this idea, we present a simplified example of a single-cell population in Figure 1. The example involves four different cells observed at various time points, each representing a potential fate a single cell within the population might undertake. At the initial time t=0t=0, the population contains only two cells: Cell A belonging to cell type 2 and Cell D belonging to cell type 3. As time progresses to t=1t=1, Cell A undergoes cellular division, giving rise to Cell B, which differentiates into cell type 3. Meanwhile, Cell A itself differentiates into cell type 1, and Cell D dies following its differentiation into cell type 2. Advancing to t=2t=2, Cell A further differentiates into cell type 3, while Cell B divides, generating Cell C, which remains in cell type 3. And Cell B then dies upon differentiation into cell type 1. Finally, at t=3t=3, Cell A remains in cell type 3, and Cell C differentiates into cell type 1. This example demonstrates how cellular events such as division, death, and differentiation impact population dynamics. Specifically, division can increase the proportion of certain cell types, while death decreases it. Differentiation, on the other hand, can increase the proportion of one cell type while decreasing that of another, reflecting the complex interplay of factors that drive changes within a cell population.

Refer to caption
Figure 1: Illustration of a four-time-point observation of a cellular population (t=0,1,2,3t=0,1,2,3) event. Each colored circle represents a cell type, with crosses indicating deceased cells. These four cells demonstrate various fates, showcasing potential trajectories within the population. At t=0t=0, Cell A, initially of type 2, undergoes cellular division at t=1t=1, giving rise to Cell B (type 3) and simultaneously differentiating into type 1. Concurrently, at t=1t=1, Cell D, initially of type 3, differentiates into type 2 and then dies. By t=2t=2, Cell A further differentiates into type 3, and at the same time, Cell B divides, generating Cell C, which remains in cell type 3. And Cell B then dies upon differentiation into cell type 1. By t=3t=3, Cell A remains in cell type 3, and Cell C differentiates into cell type 1.

Based on the single-cell population in Figure 1, we can describe the cell type trajectories for Cells 1–4, visually presented in Figure 2.

  1. Cell A: This cell remains alive throughout the observation timeline. It originates as cell type 2 at t=0t=0, differentiates into cell type 1 at t=1t=1, further differentiates into cell type 3 at t=2t=2, and remains as cell type 3 at t=3t=3.

  2. Cell B: This is a newborn cell at t=1t=1. It belongs to cell type 3 at t=1t=1 and differentiates into cell type 1 at t=2t=2. However, the cell then dies at t=2t=2. Therefore, we plot an imaginary trajectory afterwards until the end of the observed time if the cell was still alive.

  3. Cell C: This is a newborn cell at t=2t=2. It appears as cell type 3 at t=2t=2, then differentiates into cell type 1 at t=3t=3 and persists throughout the observed time.

  4. Cell D: This cell is of type 3 at t=0t=0 and differentiates into cell type 2 at t=1t=1. Then it dies at t=1t=1. To visualize its possible continuation, we draw an imaginary trajectory from this moment, illustrating what might have been its path if it had survived the entirety of the observation period.

Figure 2 demonstrates how cellular growth or deaths impact trajectories, with some cells experiencing death, resulting in imaginary trajectories that represent possible paths these deceased cells might have followed if they had not been sequenced, while others born by cellular division, generating new trajectories.

Refer to caption
Figure 2: Trajectories of Cell Types in a Population. The circles represent individual cells, with colors indicating their cell types. Circles with a cross represent deceased cells due to cellular death. The observed time points are t=0,1,2,3t=0,1,2,3. Solid lines depict trajectories for living cells, while dashed lines represent possible trajectories for deceased cells, as if they were alive. Unfilled circles indicate hypothetical descendants. The figure showcases four numbered trajectories exemplifying possible cell fates within the population. Cell A: The cell remains alive throughout the timeline. It initiates as cell type 2 at t=0t=0, differentiates into cell type 1 at t=1t=1, and further evolves to cell type 3 at t=2t=2, maintaining this type at t=3t=3. Cell B: The cell emerges as cell type 3 at t=1t=1, and differentiates into cell type 1 at t=2t=2, but ceases to exist at that same point. Its potential path, had it survived, is represented by an imaginary trajectory from there. Cell C: The cell is a newborn cell, starting as cell type 3 at t=2t=2, then differentiating into cell type 1, and remaining alive throughout the observed time. Cell D: The cell starts as cell type 3 at t=0t=0, then changes to cell type 2 at t=1t=1, and is recorded as deceased at t=1t=1. An imagined trajectory is sketched post this point to depict its potential path.

In scRNA-seq experiments, the sampling inevitably results in the death of cells, which adds complexity to observing trajectories. Once a cell is taken for observation, we can not see its future trajectory. Consequently, complete trajectories are never observable. Instead, we can only observe snapshots of the cell population at each time point. These snapshots consist of samples from each time point and are used to discover the underlying distributions of cell type within the population at these time points.

In practice, these snapshots are obtained by sampling a finite number of cells from each time point. Figure 3 provides a visual representation of the sampling scheme for a cell population. At each time point, the only information observed is the empirical distribution of cellular gene expression levels. For instance, as depicted in Figure 3, the cells sampled at t=0t=0 consist of one cell from cell type 1, two cells from cell type 2, and one cell from cell type 3. The empirical marginal distribution at t=0t=0 is then represented as Q^0=(0.25,0.5,0.25)T\widehat{Q}_{0}=(0.25,0.5,0.25)^{\mathrm{\scriptscriptstyle T}}, indicating the proportions of the three cell types. Similarly, the distribution Q^1\widehat{Q}_{1} at t=1t=1 is (0.25,0,0.75)T(0.25,0,0.75)^{\mathrm{\scriptscriptstyle T}}. The distributions at subsequent time points can be calculated following a similar approach.

Refer to caption
Figure 3: Example of the sampling scheme in scRNA-seq data. Circles represent cells, with colors indicating cell types. Circles with a star denote the sampled cells. The observation spans four time points (t=0,1,2,3t=0,1,2,3). At t=0t=0, one cell from cell type 1, two cells from cell type 2, and one cell from cell type 3 are sampled, resulting in an empirical distribution Q^0\widehat{Q}_{0} of (0.25,0.5,0.25)T(0.25,0.5,0.25)^{\mathrm{\scriptscriptstyle T}}. At t=1t=1, one cell from cell type 1 and three cells from cell type 3 are sampled, with the empirical distribution Q^1\widehat{Q}_{1} being (0.25,0,0.75)T(0.25,0,0.75)^{\mathrm{\scriptscriptstyle T}}. At t=2t=2, two cells from cell type 2 and two cells from cell type 3 are sampled, leading to an empirical distribution Q^2\widehat{Q}_{2} of (0,0.5,0.5)T(0,0.5,0.5)^{\mathrm{\scriptscriptstyle T}}. Finally, at t=3t=3, one cell from cell type 1, two cells from cell type 2, and one cell from cell type 3 are sampled, resulting in an empirical distribution Q^3\widehat{Q}_{3} of (0.25,0.5,0.25)T(0.25,0.5,0.25)^{\mathrm{\scriptscriptstyle T}}.

Hence, our observations are limited to the empirical probability distributions of cell types at distinct observed time points. There exists a discrepancy between recovering the cell type trajectory and these observed empirical probability distributions at each observed time points.

To bridge this gap, we mathematically formulate the cell type trajectory problem. Recall that Xtd={1,2,,d}X_{t}\in\mathcal{I}_{d}=\{1,2,\ldots,d\} denotes the cell type of a cell at time tt. We strive to represent the trajectory of cell types as illustrated in Definition 2. Given that we only have a snapshot of each cell, it remains uncertain when the cell came into existence or ceased to be. Consequently, our model of the cell type trajectory encompasses both actual and imaginary trajectories, since we lack definitive knowledge about the trajectory’s authenticity.

We assume that XtX_{t} is a random variable that follows a multinomial distribution Qt=(qt,1,qt,2,,qt,d)TdQ_{t}=(q_{t,1},q_{t,2},\ldots,q_{t,d})^{{\mathrm{\scriptscriptstyle T}}}\in\mathbb{R}^{d} with dd categories, where qt,k=P(Xt=k)q_{t,k}=P(X_{t}=k) and k=1dqt,k=1\sum_{k=1}^{d}q_{t,k}=1. We also assume that XtX_{t} is a Markov process of order 11. As stated in Definition 2, the cell type developmental trajectory is a sequence of the random variables XtX_{t} from initial time t=0t=0 to terminal time t=Tt=T. We aim to model the development trajectory based on the marginal distribution QtQ_{t}. For any developmental process from time ss to time tt, the process can be described by the conditional probability of XtX_{t} given XsX_{s}, denoted by Ht|sH^{t|s}. In other words, Ht|sH^{t|s} is a d×dd\times d transition probability matrix, where the kjkj-th entry, denoted by hkjt|sh_{kj}^{t|s}, represents the probability of a cell belonging to the kkth type at time tt given that it belongs to the jjth type at time ss. Given a fixed distribution QτQ_{\tau} of XτX_{\tau}, the probability matrix enables us to perform trajectory inference by predicting or backtracking the developmental landscape. For s<τs<\tau, we can obtain the ancestor distribution Qsτ=Hs|s+1Hs+1|s+2Hτ1|τQτQ_{s\leftarrow\tau}=H^{s|s+1}H^{s+1|s+2}\cdots H^{\tau-1|\tau}Q_{\tau}, where QsτQ_{s\leftarrow\tau} is its ancestor distribution at time ss, where its kkth entry denotes the probability of a cell belonging to the kkth type at time ss given the distribution QτQ_{\tau} at time τ\tau. For t>τt>\tau, Qτt=Ht|t1Ht1|t2Hτ+1|τQτQ_{\tau\rightarrow t}=H^{t|t-1}H^{t-1|t-2}\cdots H^{\tau+1|\tau}Q_{\tau}, where QτtQ_{\tau\rightarrow t} is referred to as the descendent distribution and its kkth entry denotes the probability of a cell belonging to the kkth type at time ss given the distribution QτQ_{\tau} at time τ\tau. Unlike the marginal distributions Q0,Q1,,QTQ_{0},Q_{1},\ldots,Q_{T}, which represent the population snapshots at individual time points, the ancestor and descendent distributions rely on the transition probability matrices HH and enable trajectory inference. Therefore, if we can obtain Ht|t1H^{t|t-1} for t=1,,Tt=1,\ldots,T and Hs|s+1H^{s|s+1} for s=0,,T1s=0,\ldots,T-1, the complete cell type developmental trajectory, that is, the sequence X0,X1,X2,,XTX_{0},X_{1},X_{2},\ldots,X_{T} from all time points can be recovered using the observed marginal distribution QτQ_{\tau}. This complete trajectory portrays the differentiation process as though the cell persists throughout the entire timeline. In essence, if the cell indeed survives all time points, this trajectory reflects the actual trajectory. However, if the cell either dies or is newborn at certain time points, its imaginary trajectory after or before these events can still be characterized through these transition probability matrices.

We illustrate this by a simple example shown in Figure 4. Given a cell with type 2 at t=3t=3, we can infer that its ancestor at t=2t=2 belongs to cell type 3 with probability h322|3h_{32}^{2|3}. Further, given its ancestor at t=2t=2 belongs to cell type 3, the cell’s grandparent at t=1t=1 belongs to cell type 2 with probability h231|2h_{23}^{1|2}. So the cell’s ancestor at t=1t=1 belongs to cell type 2 with a probability of h231|2h322|3h_{23}^{1|2}h_{32}^{2|3}. On the other hand, its descendent at t=4t=4 belongs to cell type 2 with probability h224|3h_{22}^{4|3}. Following similar computation, a cell type developmental trajectory can be simulated as the sequence {X0=1,X1=2,X2=3,X3=2,X4=2,X5=1,X6=1}\{X_{0}=1,X_{1}=2,X_{2}=3,X_{3}=2,X_{4}=2,X_{5}=1,X_{6}=1\}. Specifically, the probability of this trajectory given the cell belong to type 2 at t=3t=3 equals (h120|1h231|2h322|3)(h224|3h125|4h116|5)(h_{12}^{0|1}h_{23}^{1|2}h_{32}^{2|3})(h_{22}^{4|3}h_{12}^{5|4}h_{11}^{6|5}). Therefore, given prior information QtQ_{t} at time tt, the process of trajectory modelling can be described as the computation of probabilities for all possible trajectories.

Refer to caption
Figure 4: Sample trajectory inference. The coloured squares indicate cell types. There are six observed time points t=0,1,2,3,4,5,6t=0,1,2,3,4,5,6. The black star indicates the cell type of interest at time t=3t=3 for which the trajectory is to be inferred. The blue arrows indicate the trajectory coming from the ancestors of the black star. The red arrows indicate the trajectory going forwards to the descendents of the black star. The hh’s indicate the conditional probabilities of cell types at the former time point in the superscript given the latter. The trajectory is one possible realization generated by conditional probabilities.

To obtain these trajectory probabilities, we need to obtain the transition probability matrices Ht|sH^{t|s}. Suppose XtX_{t} and Xt+1X_{t+1} are distributed by QtQ_{t} and Qt+1Q_{t+1}. In order to recover the conditional probability Ht|sH^{t|s} for all 0s<tT0\leq s<t\leq T, it is sufficient to find the conditional distribution Ht+1|tH^{t+1|t} for all 0t<T0\leq t<T. Given the marginal distributions QtQ_{t} for 0tT0\leq t\leq T, it is also equivalent to finding the joint probability matrix between tt and t+1t+1, denoted by by πt,t+1d×d\pi^{t,t+1}\in\mathbb{R}^{d\times d}, where the jkjk-th entry πjkt,t+1\pi^{t,t+1}_{jk} is equal to P(Xt=j,Xt+1=k).P(X_{t}=j,X_{t+1}=k).

3 Modeling trajectories of different cell types via discrete unbalanced optimal transport

In this section, we develop a new method, Trajectory Inference with Multinomial Unbalanced Optimal Transport (TIMO), to model the trajectories of different cell types under a discrete unbalanced optimal transport framework.

We first consider an ideal case, where the cells do not divide into multiple cells or different cell types divide at the same rate across time. If there is no differentiation among all cell types from time tt to t+1t+1, one has Qt=Qt+1Q_{t}=Q_{t+1}. We then have πjkt,t+1=0\pi_{jk}^{t,t+1}=0 (or equivalently hkjt+1|t=0h_{kj}^{t+1|t}=0) for all jkj\neq k, and πjjt,t+1=qt,j=qt+1,j\pi_{jj}^{t,t+1}=q_{t,j}=q_{t+1,j} (or equivalently hjjt+1|t=1h_{jj}^{t+1|t}=1) for all jj. In other words, the transition probability matrix πt+1|t\pi^{t+1|t} is an identity matrix, indicating that any cell type will not differentiate to other types from time tt to t+1t+1.

We now utilize the optimal transport tool to find the joint probability matrix πt,t+1\pi^{t,t+1}. Define Γd,dμ,ν{π0;πd×d,j=1dk=1dπjk=1,π1d=μ,πT1d=ν}\Gamma_{d,d}^{\mu,\nu}\coloneqq\{\pi\geq 0;\pi\in\mathbb{R}^{d}\times\mathbb{R}^{d},\sum_{j=1}^{d}\sum_{k=1}^{d}\pi_{jk}=1,\pi 1_{d}=\mu,\pi^{{\mathrm{\scriptscriptstyle T}}}1_{d}=\nu\}, where μ\mu and ν\nu are probability measures on {1,2,,d}\{1,2,\ldots,d\} and 1d1_{d} is a dd-dimensional column vector with each element 11.

A preliminary approach involves determining the joint distribution πt,t+1\pi^{t,t+1} by solving the following optimal transport problem,

πt,t+1argminπΓd,dQt,Qt+1j=1dk=1dmjkπjk,\pi^{t,t+1}\coloneqq\operatorname*{arg\,min}_{\pi\in\Gamma_{d,d}^{Q_{t},Q_{t+1}}}\sum_{j=1}^{d}\sum_{k=1}^{d}m_{jk}\pi_{jk}, (1)

where mjkm_{jk} is the transport cost from cell type jj to cell type kk. Here, the costs mjk=0m_{jk}=0 if j=kj=k, and mjk>0m_{jk}>0 otherwise. The πt,t+1\pi^{t,t+1} is called the transport map and W(Qt,Qt+1)=minπΓd,dQt,Qt+1j=1dk=1dmjkπjkW(Q_{t},Q_{t+1})=\min_{\pi\in\Gamma_{d,d}^{Q_{t},Q_{t+1}}}\sum_{j=1}^{d}\sum_{k=1}^{d}m_{jk}\pi_{jk} is the transport cost from QtQ_{t} to Qt+1Q_{t+1}. We require πt,t+1Γd,dQt,Qt+1\pi^{t,t+1}\in\Gamma_{d,d}^{Q_{t},Q_{t+1}} because for any joint distributions of XtX_{t} and Xt+1X_{t+1}, their marginal distributions must be QtQ_{t} and Qt+1Q_{t+1}.

From a biological perspective, the costs represented by mjkm_{jk} can be interpreted as the energy or resources required for cellular differentiation from type jj to type kk. In practice, numerous studies have revealed that differentiating cells undergo a substantial energy acquisition process (Fard et al., 2016; Guo et al., 2017; Banerji et al., 2013). In our context, these distinctions are reflected in the varying values of mjkm_{jk}, which are determined by difference in gene expressions among various cell types. We will discuss how to choose mjkm_{jk}’s at the end of this subsection.

Recent work by Shi et al. (2022) has demonstrated that cellular differentiation is intricately linked to energy availability. Building upon these biological insights, the differentiation process of a single-cell population can be conceptualized as an efficient allocation of available energy. This allocation aligns with the optimization problem described in (1).

However, in the real world, the optimal transport problem (1) can not be directly applied to model the cell type developmental trajectory. The main reason is that different cell types may grow at different rates (Yang et al., 2014), which makes the marginal distribution constraints invalid. To illustrate the idea, define gt,jg_{t,j} the growth rate of cell type jj at time tt. We have qt+1,j=qt,jgt,jl=1dqt,lgt,lq_{t+1,j}=\frac{q_{t,j}g_{t,j}}{\sum_{l=1}^{d}q_{t,l}g_{t,l}} and

Qt+1=gt(Qt)=(gt,1qt,1k=1gt,kqt,k,gt,2qt,2k=1gt,kqt,k,,gt,dqt,dk=1gt,dqt,d).Q_{t+1}=g_{t}(Q_{t})=\left(\frac{g_{t,1}q_{t,1}}{\sum_{k=1}g_{t,k}q_{t,k}},\frac{g_{t,2}q_{t,2}}{\sum_{k=1}g_{t,k}q_{t,k}},\ldots,\frac{g_{t,d}q_{t,d}}{\sum_{k=1}g_{t,d}q_{t,d}}\right). (2)

Based on Zhang et al. (2021), the developmental process between tt and t+1t+1 can be considered as a two-phase procedure. The first phase is the growth phase, where QtQ_{t} grows to gt(Qt)g_{t}(Q_{t}). The second phase is the transport phase, where gt(Qt)g_{t}(Q_{t}) changes to Qt+1Q_{t+1} following the optimal transport principle. If the growth function gt()g_{t}(\cdot) is known, by modifying the constraint in (1), one can obtain the transport map πt,t+1\pi^{t,t+1} by solving the following optimal transport problem

πt,t+1=argminπΓd,dgt(Qt),Qt+1j=1dk=1dmjkπjk.\pi^{t,t+1}=\operatorname*{arg\,min}_{\pi\in\Gamma_{d,d}^{g_{t}(Q_{t}),Q_{t+1}}}\sum_{j=1}^{d}\sum_{k=1}^{d}m_{jk}\pi_{jk}.

One key challenge is that the growth rates gt,jg_{t,j}’s cannot be directly identified based on the observed marginal distributions QtQ_{t} and Qt+1Q_{t+1}, making the above approach infeasible.

To overcome this challenge, we utilize the Kullback-Leibler (KL) divergence penalties to relax the marginal distribution constraints, effectively mimicing the effects of cellular growth on probability distributions. In particular, we can infer the cellular development trajectory by solving the following discrete unbalanced optimal transport problem

πt,t+1=argminπΓd,d+,Qt+1{j=1dk=1dmjkπjk+λKL(π1d||Qt)},\pi^{t,t+1}=\operatorname*{arg\,min}_{\pi\in\Gamma_{d,d}^{+,Q_{t+1}}}\{\sum_{j=1}^{d}\sum_{k=1}^{d}m_{jk}\pi_{jk}+\lambda KL\left(\pi 1_{d}||Q_{t}\right)\}, (3)

where Γd,d+,Qt+1{π>0;πd×d,i=1dj=1dπij=1,πT1d=Qt+1}\Gamma_{d,d}^{+,Q_{t+1}}\coloneqq\{\pi>0;\pi\in\mathbb{R}^{d}\times\mathbb{R}^{d},\sum_{i=1}^{d}\sum_{j=1}^{d}\pi_{ij}=1,\pi^{\mathrm{\scriptscriptstyle T}}1_{d}=Q_{t+1}\}. Here KLKL denotes the KL divergence, which is defined as KL(P||Q)=k=1dpklog(pkqk)KL(P||Q)=\sum_{k=1}^{d}p_{k}\log\left(\frac{p_{k}}{q_{k}}\right) for any two multinomial probability measures with dd categories P=(p1,p2,,pd)TP=(p_{1},p_{2},\ldots,p_{d})^{\mathrm{\scriptscriptstyle T}} and Q=(q1,q2,,qd)TQ=(q_{1},q_{2},\ldots,q_{d})^{\mathrm{\scriptscriptstyle T}}. In LABEL:sec:comp_algo of the supplementary material, we provide a detailed algorithm for solving the problem in (3).

In problem (3), the equality constraint only holds for the column sum constraints πT1d=Qt+1\pi^{\mathrm{\scriptscriptstyle T}}1_{d}=Q_{t+1}, whereas the KL divergence penalization relaxes the row sum constraints π1d=Qt\pi 1_{d}=Q_{t} to account for different growth rates of cell types. When λ=0\lambda=0, the model does not enforce the row sum constraints at all. The optimal solution is selected from Γd,d+,Qt+1\Gamma_{d,d}^{+,Q_{t+1}}. As λ\lambda increases, the row sum of the optimal solution πt,t+1\pi^{t,t+1} more closely resembles QtQ_{t}, that is, the discrepancy in growth rates among cell types approaches 0. When λ=\lambda=\infty, it is equivalent to imposing the row sum constraints π1d=Qt\pi 1_{d}=Q_{t}, which corresponds to the case that every cell grows or dies at the same rate.

3.1 Detection of cell types differentiation

One application of our proposed unbalanced optimal transport is to pinpoint the moments when significant changes in cell types occur, which is a key challenge in trajectory inference. At the initial stage, cells like stem cells often exhibit similar expression profiles. However, as development continues, these cells diverge and evolve into their final states (Kalinka et al., 2010). Identifying these pivotal shifts in genetic profiles can offer valuable insights into the cellular developmental journey within a target population.

Recall we have multinomial distributions Q0,Q1,,QTQ_{0},Q_{1},\ldots,Q_{T} characterizing the marginal distribution of different cell types. To formulate the problem, we first consider the ideal case where all the cells have identical growth rates (i.e., gt(Qt)=Qt+1g_{t}(Q_{t})=Q_{t+1}). In this ideal case, the cell types differentiate at time tt if Qt+1QtQ_{t+1}\neq Q_{t}, namely a change point at tt. And we aim to identify all the change points 0t1<t2<<tK<T0\leq t_{1}<t_{2}<\ldots<t_{K}<T such that Q0=Q1==Qt1Qt1+1==Qt2Qt2+1=QtlQtl+1==QTQ_{0}=Q_{1}=\cdots=Q_{t_{1}}\neq Q_{t_{1}+1}=\cdots=Q_{t_{2}}\neq Q_{t_{2}+1}=\cdots\cdots Q_{t_{l}}\neq Q_{t_{l}+1}=\cdots=Q_{T}.

In reality, each cell type grows at a different rate, even if there are no changes, we would expect Qt+1=gt(Qt)QtQ_{t+1}=g_{t}(Q_{t})\neq Q_{t}. Therefore, we need to detect those biological change point tt, where Qt+1gt(Qt)Q_{t+1}\neq g_{t}(Q_{t}).

Following (3), we can define the transport cost for unbalanced optimal transport by

Wλ(Qt,Qt+1)minπΓd,d+,Qt+1{j=1dk=1dmjkπjk+λKL(π1d||Qt)}.\begin{split}W^{\lambda}(Q_{t},Q_{t+1})\coloneqq\min_{\pi\in\Gamma_{d,d}^{+,Q_{t+1}}}\{\sum_{j=1}^{d}\sum_{k=1}^{d}m_{jk}\pi_{jk}+\lambda KL\left(\pi 1_{d}||Q_{t}\right)\}\end{split}. (4)

To identify the biological change points, we define Wt=Wλ(Qt,Qt+1)W_{t}=W^{\lambda}(Q_{t},Q_{t+1}) and seek the local maxima among the statistics {W0,W1,,WT1}\{W_{0},W_{1},\ldots,W_{T-1}\} using the peak detection method by Zhang et al. (2015). The process is detailed in LABEL:sec:peak of the supplementary material. The locations of these peaks indicate the biological change points.

3.2 Practical implementation

In practice, the marginal distributions QtQ_{t} and Qt+1Q_{t+1} as well as the centroids of each cell type are unknown and need to be estimated from the data. Denote yt,iGy_{t,i}\in\mathbb{R}^{G} the gene expression of the iith observed cell at time tt and xt,id={1,2,,d}x_{t,i}\in\mathcal{I}_{d}=\{1,2,\ldots,d\} the cell type of the iith cell at time tt. The marginal distribution QtQ_{t} can be estimated by its empirical distribution Q^t=(q^t,1,,q^t,d)T\widehat{Q}_{t}=(\widehat{q}_{t,1},\ldots,\widehat{q}_{t,d})^{{\mathrm{\scriptscriptstyle T}}}, where q^t,k=nt1i=1nt𝟙(xt,i=k)\widehat{q}_{t,k}=n_{t}^{-1}\sum_{i=1}^{n_{t}}\mathbbm{1}(x_{t,i}=k).

To solve the optimization problem (3), the transport cost mjkm_{jk} is calculated based on the distances between centroids of the gene expression profiles of those cells from the jjth and the kkth cell types. Naively, the centroids can be calculated based on the original GG-dimensional expression data, however, the number of genes GG can be very large. In this case, the Euclidean distances between cells become similar to each other, making it very hard to distinguish different cell types (Feng et al., 2020). To deal with this problem, we adopt the Potential of Heat-diffusion for Affinity-based Transition Embedding (PHATE) (Moon et al., 2019), a commonly used dimension reduction approach in single-cell RNA sequencing analysis, to reduce the original GG-dimensional gene expression profiles to the two-dimensional data. Specifically, the cost m^jk=ζ^jζ^k22\widehat{m}_{jk}=\|\widehat{\zeta}_{j}-\widehat{\zeta}_{k}\|_{2}^{2}, where ζ^j2\widehat{\zeta}_{j}\in\mathbb{R}^{2} and ζ^k2\widehat{\zeta}_{k}\in\mathbb{R}^{2} are centroids of the PHATE-reduced two-dimensional gene expression profiles from the jjth and the kkth cell types at all time points, respectively.

We then plug Q^t\widehat{Q}_{t}, Q^t+1\widehat{Q}_{t+1}, and m^jk\widehat{m}_{jk} into (3), and estimate transport plan πt,t+1\pi^{t,t+1} by solving

π^t,t+1=argminπΓd,d+,Q^t+1j=1dk=1dm^jkπjk+λKL(π1d||Q^t).\widehat{\pi}^{t,t+1}=\operatorname*{arg\,min}_{\pi\in\Gamma_{d,d}^{+,\hat{Q}_{t+1}}}\sum_{j=1}^{d}\sum_{k=1}^{d}\widehat{m}_{jk}\pi_{jk}+\lambda KL\left(\pi 1_{d}||\widehat{Q}_{t}\right). (5)

The above optimization problem involves a regularization parameter λ\lambda. Choosing the appropriate λ\lambda remains a challenging and unresolved issue in unbalanced optimal transport problems. Recent research commonly recommends using a value of λ=1\lambda=1 (Lee et al., 2019; Séjourné et al., 2023), which we adopt in all our experiments and data applications.

4 Simulation

In this section, we conduct a simulation study to evaluate the performance of TIMO. We consider a population of cells spanning a time horizon from t=0,1,,Tt=0,1,\ldots,T, with T=50T=50. In this population, there are a total of d=10d=10 distinct cell types, each expressing G=50G=50 different genes.

We generate the cell type XtX_{t} from a multinomial distribution Qt=(qt,1,qt,2,,qt,10)TQ_{t}=(q_{t,1},q_{t,2},\ldots,q_{t,10})^{{\mathrm{\scriptscriptstyle T}}}. Initially, Q0Q_{0} is set as (0.1,0.1,,0.1)T(0.1,0.1,\ldots,0.1)^{\mathrm{\scriptscriptstyle T}}. We then define gt(Qt)g_{t}(Q_{t}) for t=0,1,,T1t=0,1,\ldots,T-1 based on Equation 2. Here, the relative growth rate of the jjth cell type at time tt is set as gt,j=exp(νsin(t+j1d)π)g_{t,j}=\exp(\nu\sin(\frac{t+j-1}{d})\pi), where we consider two settings ν=0.1\nu=0.1 and ν=0.25\nu=0.25.

Given Q0Q_{0} and gt,jg_{t,j}, we generate Q1,Q2,,Q50Q_{1},Q_{2},\ldots,Q_{50} as follows:

Qt+1={c1Tgt(Qt)c1Tgt(Qt)1t{10,30}c2Tgt(Qt)c2Tgt(Qt)1t{20,40}gt(Qt)t{10,20,30,40},Q_{t+1}=\begin{cases}\frac{c_{1}^{{\mathrm{\scriptscriptstyle T}}}g_{t}(Q_{t})}{\|c_{1}^{{\mathrm{\scriptscriptstyle T}}}g_{t}(Q_{t})\|_{1}}&t\in\{10,30\}\\ \frac{c_{2}^{{\mathrm{\scriptscriptstyle T}}}g_{t}(Q_{t})}{\|c_{2}^{{\mathrm{\scriptscriptstyle T}}}g_{t}(Q_{t})\|_{1}}&t\in\{20,40\}\\ g_{t}(Q_{t})&t\notin\{10,20,30,40\},\end{cases}

where c1=eη(15,15)Tc_{1}=e^{\eta}(1_{5},-1_{5})^{{\mathrm{\scriptscriptstyle T}}} and c2=eη(15,15)Tc_{2}=e^{\eta}(-1_{5},1_{5})^{{\mathrm{\scriptscriptstyle T}}} with 1d1_{d} denotes a dd-dimensional vector with all entries equal 11. Here, we generate QtQ_{t} differently at time points t=10,20,30,40t=10,20,30,40 to mimic the cell type differentiation at these time points. Given Q0,Q1,,QTQ_{0},Q_{1},\ldots,Q_{T}, we generate the gene expression Yt50Y_{t}\in\mathbb{R}^{50} based on the conditional distribution Yt|XtY_{t}|X_{t}. In particular, Yt|Xt=jY_{t}|X_{t}=j is generated from the multivariate normal distribution 𝒩G(μj,Σj)\mathcal{N}_{G}(\mu_{j},\Sigma_{j}), with μj={0.5(jd)1}1G\mu_{j}=\{0.5(j-d)-1\}1_{G} and Σj=IG\Sigma_{j}=I_{G}.

To define the underlying cost, we first reduce YtY_{t} to two dimensions by the PHATE procedure. The underlying cost mjk=ζjζk22m_{jk}=||\zeta_{j}-\zeta_{k}||_{2}^{2}, where ζj2\zeta_{j}\in\mathbb{R}^{2} and ζk2\zeta_{k}\in\mathbb{R}^{2} are centroids of reduced two-dimensional expressions of the jjth and kkth cell types, respectively. Given the underlying costs mjkm_{jk}’s and the marginal distributions QtQ_{t}’s, the underlying transport plan πt,t+1\pi^{t,t+1} is defined as: πt,t+1argminπΓd,d+,Qt+1{j,kdπjkmjk+λKL(Π1d||Qt)},\pi^{t,t+1}\coloneqq\operatorname*{arg\,min}_{\pi\in\Gamma_{d,d}^{+,Q_{t+1}}}\{\sum_{j,k}^{d}\pi_{jk}m_{jk}+\lambda KL(\Pi 1_{d}||Q_{t})\}, with default value λ=1\lambda=1.

To generate the data, we sample nn cells at each time point, where we consider n=1000n=1000 and 20002000. Denote xt,ix_{t,i} the cell type of the iith sampled cell at time tt and yt,iy_{t,i} its gene expression. Each xt,ix_{t,i} is independent and identically distributed (i.i.d) from QtQ_{t}, and yt,iy_{t,i} is i.i.d. from 𝒩G(μxt,i,Σxt,i)\mathcal{N}_{G}(\mu_{x_{t,i}},\Sigma_{x_{t,i}}). We can obtain the empirical distributions Q^t={q^t,j}1jdT\widehat{Q}_{t}=\{\widehat{q}_{t,j}\}_{1\leq j\leq d}^{\mathrm{\scriptscriptstyle T}}, where q^t,j=n1i=1n1(xt,i=j)\widehat{q}_{t,j}=n^{-1}\sum_{i=1}^{n}1(x_{t,i}=j). To estimate the centroid ζj\zeta_{j}, we first project yt,iy_{t,i} to the two-dimensional space using the PHATE procedure, denoted by y~t,i\tilde{y}_{t,i}. For each cell type, its centroid ζ^j\widehat{\zeta}_{j} is estimated as ζ^j={t=0Ti=1n1(xt,i=j)}1t=0Ti=1ny~t,i1(xt,i=j)\widehat{\zeta}_{j}=\{\sum_{t=0}^{T}\sum_{i=1}^{n}1(x_{t,i}=j)\}^{-1}\sum_{t=0}^{T}\sum_{i=1}^{n}\tilde{y}_{t,i}1(x_{t,i}=j). We can then obtain the cost estimate m^jk=ζ^jζ^k22\widehat{m}_{jk}=||\widehat{\zeta}_{j}-\widehat{\zeta}_{k}||_{2}^{2}. Finally, we estimate the transport plan πt,t+1\pi^{t,t+1} by solving the following optimization problem

π^t,t+1argminπΓd,d+,Q^t+1{j,kdπjkm^jk+λKL(Π1d||Q^t)}.\widehat{\pi}^{t,t+1}\coloneqq\operatorname*{arg\,min}_{\pi\in\Gamma_{d,d}^{+,\widehat{Q}_{t+1}}}\{\sum_{j,k}^{d}\pi_{jk}\widehat{m}_{jk}+\lambda KL(\Pi 1_{d}||\widehat{Q}_{t})\}.

We first examine the estimation error for the transport plan πt,t+1\pi^{t,t+1}. We conduct 10001000 Monte Carlo runs and we report the average of 14t𝒞π^t,t+1πt,t+1F2\frac{1}{4}\sum_{t\in\mathcal{C}}\|\widehat{\pi}^{t,t+1}-\pi^{t,t+1}\|_{F}^{2} at four change points t𝒞={10,20,30,40}t\in\mathcal{C}=\{10,20,30,40\} and 146t𝒞π^t,t+1πt,t+1F2\frac{1}{46}\sum_{t\notin\mathcal{C}}\|\widehat{\pi}^{t,t+1}-\pi^{t,t+1}\|_{F}^{2} at all 4646 non-change points t𝒞t\notin\mathcal{C} for different settings of η\eta and ν\nu in Table 1. The results show the estimation errors become smaller as nn increases, suggesting consistency of the proposed estimator.

Table 1: Estimation Errors ×10000\times 10000 (Standard error ×10000\times 10000) for change points and non-change points. The results are based on 10001000 Monte Carlo runs.
Change Points Non-Change Points
ν=0.1\nu=0.1 ν=0.25\nu=0.25 ν=0.1\nu=0.1 ν=0.25\nu=0.25
n=1000n=1000 η=0.5\eta=0.5 39.87(0.06) 76.08(0.05) 14.37(0.01) 20.81(0.01)
n=1000n=1000 η=1\eta=1 20.83(0.06) 22.70(0.18) 13.27(0.01) 13.15(0.01)
n=2000n=2000 η=0.5\eta=0.5 32.93(0.09) 70.58(0.05) 7.51(0.01) 15.40(0.01)
n=2000n=2000 η=1\eta=1 13.75(0.17) 15.81(0.29) 6.68(0.01) 7.16(0.01)

We next evaluate the proposed change point detection procedure in Section 3.1. In this setting, we have biological change points at time t=10,20,30,40t=10,20,30,40. We also compare the performance with two other change point detection methods for multinomial data. In particular, we consider the method in Wang et al. (2018), which is designed for detection of multiple change points for multinomial time series data. We also compare with Matteson and James (2014), which is designed for change point detection for multivariate time series data. Here, we treat the probability weight vector at every time point as one vector observations in the time series and perform their method.

Suppose the underlying change points are 𝒞={t1,t2,,tL}\mathcal{C}=\{t_{1},t_{2},\ldots,t_{L}\} and the estimated change points are 𝒞^={t^1,t^2,,t^L}\widehat{\mathcal{C}}=\{\widehat{t}_{1},\widehat{t}_{2},\ldots,\widehat{t}_{L^{\prime}}\}. Similar to Aminikhanghahi and Cook (2017) and Truong et al. (2020), we report the following quantities to evaluate the performance:

precision=|𝒞𝒞^||𝒞^|,recall=|𝒞𝒞^||𝒞|,F-score=2×precision×recallprecision+recall.\text{precision}=\frac{|\mathcal{C}\cap\widehat{\mathcal{C}}|}{|\widehat{\mathcal{C}}|},\quad\text{recall}=\frac{|\mathcal{C}\cap\widehat{\mathcal{C}}|}{|\mathcal{C}|},\quad\text{F-score}=\frac{2\times\text{precision}\times\text{recall}}{\text{precision}+\text{recall}}.

Among these three quantities, F-score is considered the most important measure for change point detection. The results are summarized in Tables 2 and 3.

Table 2: Change Point Detection Results for n=1000n=1000. The results are based on 10001000 Monte Carlo runs. TIMO is the proposed method, MN denotes the method in Wang et al. (2018), and ECP denotes the method in Matteson and James (2014). The sample means are reported alongside standard errors (×100\times 100) in brackets. The best results among all methods for each setting are shaded with grey color.
ν=0.1\nu=0.1 ν=0.25\nu=0.25
precision recall F-score precision recall F-score
TIMO 0.93(0.32) 0.99(0.14) 0.95(0.19) 0.82(0.48) 0.81(0.56) 0.80(0.45)
η=0.25\eta=0.25 ECP 0.43(0.26) 0.91(0.44) 0.58(0.26) 0.26(0.53) 0.54(1.19) 0.35(0.73)
MN 0.23(0.10) 0.96(0.32) 0.37(0.15) 0.07(0.14) 0.39(0.78) 0.12(0.24)
TIMO 0.95(0.28) 1.00(0.05) 0.97(0.16) 0.91(0.34) 1.00(0.12) 0.95(0.21)
η=1\eta=1 ECP 0.45(0.21) 1.00(0.00) 0.62(0.16) 0.44(0.11) 0.96(0.33) 0.60(0.14)
MN 0.21(0.21) 0.74(0.71) 0.33(0.32) 0.10(0.14) 0.58(0.79) 0.17(0.24)
Table 3: Simulation Results for n=2000n=2000. MN: multinomial change point detection by Wang et al. (2018). ECP: multivariate time series change point detection by Matteson and James (2014).
ν=0.1\nu=0.1 ν=0.25\nu=0.25
precision recall F-score precision recall F-score
TIMO 0.99(0.16) 1.00(0.08) 0.99(0.10) 0.87(0.45) 0.84(0.52) 0.84(0.42)
η=0.25\eta=0.25 ECP 0.44(0.16) 0.98(0.24) 0.60(0.16) 0.28(0.57) 0.61(1.28) 0.39(0.79)
MN 0.07(0.16) 0.29(0.70) 0.11(0.26) 0.08(0.12) 0.57(0.80) 0.14(0.20)
TIMO 0.99(0.13) 1.00(0.04) 0.99(0.08) 0.97(0.23) 1.00(0.04) 0.98(0.13)
η=1\eta=1 ECP 0.45(0.15) 1.00(0.00) 0.62(0.12) 0.44(0.07) 0.98(0.24) 0.61(0.09)
MN 0.10(0.18) 0.44(0.77) 0.17(0.29) 0.09(0.11) 0.61(0.76) 0.15(0.19)

Our proposed method TIMO showcases outstanding performance across various scenarios. It is evident that TIMO signficantly leads in precision and F-score in every instance. While ECP excels in recall, TIMO consistently attains values near the value 11.

5 Analysis of scRNA-seq Data from Mouse Embryonic Fibroblasts

We analyze the scRNA-seq data (Schiebinger et al., 2019) which can be downloaded from NCBI Gene Expression Omnibus (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE122662). The cells were originated from mouse embryonic fibroblasts (MEFs) in serum conditions, and this cellular populations contained multiple cell types due to differentiation. The experiment ran over 18 days during which a total of 259,155259,155 cells were collected. We pre-processed the scRNA-seq data using R package Seurat (Satija et al., 2015). Specifically, we first remove low-quality cells (e.g. cells with very few genes or cell doublets/multiplets with an aberrantly high gene count). Next, the gene expression levels of each cell were normalized by the total expression levels. The normalization is followed by multiplicative scaling and log transformation. Finally, the Seurat recipe selects only the genes that demonstrate high cell-to-cell variation, i.e., the potentially biologically significant genes. After pre-processing, there are a total of 182,908182,908 cells collected every 1212 hours from day 0 to day 1818. The number of cells at each time point is provided in LABEL:fig:samplesize in the supplementary material. For each cell, we obtaine its gene expression level over G=216G=216 genes, i.e., Yt216Y_{t}\in\mathbb{R}^{216}. For simplicity, we denote these observation time {0,1,2,,36}\{0,1,2,\ldots,36\}, where the time point tt denotes the 0.5t0.5t-th day.

Refer to caption
Figure 5: PHATE-embedded scRNA Data Visualization. This figure displays scRNA sequencing data, where the gene expression profile of each cell contains G=216G=216 genes. These high-dimensional data points have been dimensionally reduced to a two-dimensional space via the PHATE procedure. The dataset includes d=7d=7 distinct cell types: MEF, MET cells, stromal cells, trophoblasts, neural cells, epithelial cells, and IPS cells. In the two panels, the left one displays cells color-coded to indicate the day of collection, ranging from dark blue (early days) to light yellow (later days), while the right panel color-codes cells by their specific cell types.

We leverage the cell sets provided by Schiebinger et al. (2019) to assign labels to each cell based on its corresponding cell type. Our dataset encompasses the following d=7d=7 distinct cell types: Mouse Embryonic Fibroblasts (MEFs), Mesenchymal-Epithelial Transition (MET) Cells, Induced Pluripotent Stem (IPS) Cells, Stromal Cells, Epithelial Cells, Neural Cells and Trophoblasts.

We first perform dimensionality reduction using PHATE. Figure 5 displays the PHATE-embedded data, featuring labels for both time and cell types. To outline the observed patterns in Figure 5, we can observe the dominance of MEFs in the population during the earlier days of development. As developmental progresses, distinct branches corresponding to the final stages of development emerge. These branches are populated by various cell types, including epithelial cells, trophoblasts, and induced pluripotent stem cells (IPS cells). Notably, the middle region is predominantly occupied by mesenchymal-epithelial transition (MET) cells, signifying transitional states on the path to becoming epithelial cells.

We now demonstrate the performance of our proposed method, TIMO, on the MEF dataset. TIMO detects two biological change points on days 4.5 and 10 (see LABEL:fig:ot_cost in the supplementary material for a plot of the statistics {W0,W1,,WT1}\{W_{0},W_{1},\ldots,W_{T-1}\}). Transport maps for these two time points are presented in Figure 6 and Figure 7. From day 4.5 to 5, the transport map highlights the early stage of cells transitioning into the MET state. According to Schiebinger et al. (2019), the first significant divergence occurs around day 1.5 and intensifies over the subsequent days, notably around day 4. From day 10 to 10.5, the transport map illustrates a substantial portion of the population undergoing differentiation, with many cell types transitioning into stromal cells. This aligns with the peak population of stromal cells around days 10-11, as reported by Schiebinger et al. (2019). In Figures LABEL:fig:tempora0LABEL:fig:tempora17_5 of the supplementary material, we also display all the transport maps corresponding to the remaining non-change points. We can see that the transport maps predominantly take the form of diagonal matrices, except at the two detected change points. Importantly, our model does not identify the increase in IPS cell proportion and decrease in stromal cell proportion as change points, further highlighting that our model effectively distinguishes the effects of growth from differentiation.

Next, we compare our results with the detailed biological findings from Schiebinger et al. (2019). From a biological standpoint, this entire timeline corresponds to the reprogramming of IPS cells. The developmental process of this population reveals two critical branches. In the initial phases of IPS cell reprogramming, the original MEF population bifurcates into two branches: MET and stromal. Additionally, Schiebinger et al. (2019) established that IPS cells, along with other cell types like epithelial cells, neural cells, and trophoblasts, also originate from the MET branch. However, only a very small percentage (less than 10%) of MET cells undergo differentiation into IPS cells or neural cells. The increase in the proportion of IPS cells is primarily driven by rapid cell growth and proliferation rather than differentiation. In contrast, the stromal branch emerges at a later stage, with its population peaking around days 10-11. However, beyond this peak, the proportion of stromal cells starts to decline due to reduced cellular growth and increased apoptosis (i.e., cellular death). TIMO successfully distinguishes the effects of cellular growth from cellular differentiation.

Refer to caption
Figure 6: Illustration of the First Change Point between Days 4.5 (D4.5, t=9t=9) and 5 (D5, t=10t=10). This heatmap represents the transport plan π9,10\pi^{9,10}, with colors indicating the values of πjk9,10\pi_{jk}^{9,10}, the joint probabilities between Q9Q_{9} (D4.5) and Q10Q_{10} (D5). The accompanying horizontal and vertical barplots show the marginal multinomial distributions Q9Q_{9} and Q10Q_{10}, respectively. Notably, the heatmap reveal a significant shift in the cell population, with a considerable number of MEFs at D4.5 differentiating into MET cells by D5, as evidenced by the increased proportion of MET cells.
Refer to caption
Figure 7: Illustration of the Second Change Point between Days 10 (D10, t=20t=20) and 10.5 (D10.5, t=21t=21). This heatmap depicts the transport plan π20,21\pi^{20,21}, with colors representing the values of πjk20,21\pi_{jk}^{20,21}, which are the joint probabilities between Q20Q_{20} (D10) and Q21Q_{21} (D10.5). The marginal multinomial distributions Q20Q_{20} and Q21Q_{21} are shown in the horizontal and vertical barplots, respectively. The heatmap indicates a substantial transition of MEF and MET cells from D10 into stromal cells by D10.5, as evidenced by the increased proportion of stromal cells at D10.5 in the barplot.

Finally, we compare the change points detected by TIMO with those detected by the other two methods. The ECP method detected change points on days 4.5, 8.5, 10, and 15.5, while the MN method detected change points on days 4.5, 8, 10, 11.5, 14.5, and 16.5. Both methods fail to distinguish non-change points on days 14-16, whereas TIMO succeeds by accounting for growth rates with unbalanced regularity in its minimization problem.

6 Discussion

Our study introduces TIMO, an innovative method for modeling the development trajectory of cell types using unbalanced optimal transport. Our method can be used to identify key biological change points and elucidating the temporal evolution of various cell types. We applied TIMO to a real scRNA-seq dataset from mouse embryonic fibroblasts. The results accurately pinpointed significant changes in cell type developmental process, supported by experimental observations. The transition probabilities among different cell types are consistent with the experimental observations.

While our method has shown promising results in cell type trajectory modeling, it also has some limitations and provides opportunities for future research. First, we computed the transition matrix at the cell type level. While this approach is computationally more efficient than optimal transport analysis at the single-cell level, it overlooks the heterogeneity among cells within the same cell type. Second, we applied TIMO to datasets with known cell types for each individual cell. Future improvements could involve integrating clustering techniques to make our method applicable to a broader array of datasets without predefined labels. Additionally, our approach for detecting multiple change points currently relies on constructing statistics using only two adjacent time points. This has been sufficient for our real-world applications due to significant changes in the underlying biological process. However, in scenarios with more moderate changes, it might be beneficial to incorporate moving window techniques to enhance stability and accuracy. Finally, in the field of unbalanced optimal transport, determining the appropriate regularization parameter λ\lambda through a data-adaptive method remains an unsolved challenge.

References

  • Aminikhanghahi and Cook (2017) Aminikhanghahi, S. and Cook, D. J. (2017), “A survey of methods for time series change point detection,” Knowledge and Information Systems, 51, 339–367.
  • Banerji et al. (2013) Banerji, C. R., Miranda-Saavedra, D., Severini, S., Widschwendter, M., Enver, T., Zhou, J. X., and Teschendorff, A. E. (2013), “Cellular network entropy as the energy potential in Waddington’s differentiation landscape,” Scientific Reports, 3, 3039.
  • Cao et al. (2019) Cao, J., Spielmann, M., Qiu, X., Huang, X., Ibrahim, D. M., Hill, A. J., Zhang, F., Mundlos, S., Christiansen, L., Steemers, F. J., et al. (2019), “The single-cell transcriptional landscape of mammalian organogenesis,” Nature, 566, 496–502.
  • Dixon et al. (2015) Dixon, J. R., Jung, I., Selvaraj, S., Shen, Y., Antosiewicz-Bourget, J. E., Lee, A. Y., Ye, Z., Kim, A., Rajagopal, N., Xie, W., et al. (2015), “Chromatin architecture reorganization during stem cell differentiation,” Nature, 518, 331–336.
  • Fard et al. (2016) Fard, A. T., Srihari, S., Mar, J. C., and Ragan, M. A. (2016), “Not just a colourful metaphor: modelling the landscape of cellular development using Hopfield networks,” NPJ Systems Biology and Applications, 2, 1–9.
  • Farrell et al. (2018) Farrell, J. A., Wang, Y., Riesenfeld, S. J., Shekhar, K., Regev, A., and Schier, A. F. (2018), “Single-cell reconstruction of developmental trajectories during zebrafish embryogenesis,” Science, 360, eaar3131.
  • Feng et al. (2020) Feng, C., Liu, S., Zhang, H., Guan, R., Li, D., Zhou, F., Liang, Y., and Feng, X. (2020), “Dimension reduction and clustering models for single-cell RNA sequencing data: a comparative study,” International Journal of Molecular Sciences, 21, 2181.
  • Guo et al. (2017) Guo, M., Bao, E. L., Wagner, M., Whitsett, J. A., and Xu, Y. (2017), “SLICE: determining cell differentiation and lineage based on single cell entropy,” Nucleic Acids Research, 45, e54–e54.
  • Kalinka et al. (2010) Kalinka, A. T., Varga, K. M., Gerrard, D. T., Preibisch, S., Corcoran, D. L., Jarrells, J., Ohler, U., Bergman, C. M., and Tomancak, P. (2010), “Gene expression divergence recapitulates the developmental hourglass model,” Nature, 468, 811–814.
  • Kantorovich (1942) Kantorovich, L. V. (1942), “On the translocation of masses,” in Dokl. Akad. Nauk. USSR (NS), vol. 37, pp. 199–201.
  • Keller (2005) Keller, G. (2005), “Embryonic stem cell differentiation: emergence of a new era in biology and medicine,” Genes & Development, 19, 1129–1155.
  • Kester and Van Oudenaarden (2018) Kester, L. and Van Oudenaarden, A. (2018), “Single-cell transcriptomics meets lineage tracing,” Cell Stem Cell, 23, 166–179.
  • Krampera and Le Blanc (2021) Krampera, M. and Le Blanc, K. (2021), “Mesenchymal stromal cells: Putative microenvironmental modulators become cell therapy,” Cell Stem Cell, 28, 1708–1725.
  • Lee et al. (2019) Lee, J., Bertrand, N. P., and Rozell, C. J. (2019), “Parallel unbalanced optimal transport regularization for large scale imaging problems,” arXiv preprint arXiv:1909.00149.
  • Matteson and James (2014) Matteson, D. S. and James, N. A. (2014), “A nonparametric approach for multiple change point analysis of multivariate data,” Journal of the American Statistical Association, 109, 334–345.
  • Monge (1781) Monge, G. (1781), “Mémoire sur la théorie des déblais et des remblais,” Mem. Math. Phys. Acad. Royale Sci., 666–704.
  • Moon et al. (2019) Moon, K. R., van Dijk, D., Wang, Z., Gigante, S., Burkhardt, D. B., Chen, W. S., Yim, K., Elzen, A. v. d., Hirn, M. J., Coifman, R. R., et al. (2019), “Visualizing structure and transitions in high-dimensional biological data,” Nature Biotechnology, 37, 1482–1492.
  • Mothe et al. (2012) Mothe, A. J., Tator, C. H., et al. (2012), “Advances in stem cell therapy for spinal cord injury,” Journal of Clinical Investigation, 122, 3824–3834.
  • Pham et al. (2020) Pham, K., Le, K., Ho, N., Pham, T., and Bui, H. (2020), “On unbalanced optimal transport: An analysis of sinkhorn algorithm,” in International Conference on Machine Learning, PMLR, pp. 7673–7682.
  • Satija et al. (2015) Satija, R., Farrell, J. A., Gennert, D., Schier, A. F., and Regev, A. (2015), “Spatial reconstruction of single-cell gene expression data,” Nature Biotechnology, 33, 495–502.
  • Schiebinger et al. (2019) Schiebinger, G., Shu, J., Tabaka, M., Cleary, B., Subramanian, V., Solomon, A., Gould, J., Liu, S., Lin, S., Berube, P., et al. (2019), “Optimal-transport analysis of single-cell gene expression identifies developmental trajectories in reprogramming,” Cell, 176, 928–943.
  • Séjourné et al. (2023) Séjourné, T., Peyré, G., and Vialard, F.-X. (2023), “Unbalanced Optimal Transport, from theory to numerics,” Handbook of Numerical Analysis, 24, 407–471.
  • Shi et al. (2022) Shi, J., Aihara, K., Li, T., and Chen, L. (2022), “Energy landscape decomposition for cell differentiation with proliferation effect,” National Science Review, 9, nwac116.
  • Tanay and Regev (2017) Tanay, A. and Regev, A. (2017), “Scaling single-cell genomics from phenomenology to mechanism,” Nature, 541, 331–338.
  • Truong et al. (2020) Truong, C., Oudre, L., and Vayatis, N. (2020), “Selective review of offline change point detection methods,” Signal Processing, 167, 107299.
  • Waddington (2014) Waddington, C. H. (2014), The strategy of the genes, Routledge.
  • Waddington (2015) — (2015), How animals develop, Routledge.
  • Wagner et al. (2018) Wagner, D. E., Weinreb, C., Collins, Z. M., Briggs, J. A., Megason, S. G., and Klein, A. M. (2018), “Single-cell mapping of gene expression landscapes and lineage in the zebrafish embryo,” Science, 360, 981–987.
  • Wang et al. (2018) Wang, G., Zou, C., and Yin, G. (2018), “Change-point detection in multinomial data with a large number of categories,” Annals of Statistics, 46, 2020–2044.
  • Yang et al. (2014) Yang, N., Ray, S., and Krafts, K. (2014), “Cell proliferation,” in Encyclopedia of Toxicology: Third Edition, Elsevier, pp. 761–765.
  • Zhang et al. (2021) Zhang, S., Afanassiev, A., Greenstreet, L., Matsumoto, T., and Schiebinger, G. (2021), “Optimal transport analysis reveals trajectories in steady-state systems,” PLoS Computational Biology, 17, e1009466.
  • Zhang et al. (2015) Zhang, Z.-M., Tong, X., Peng, Y., Ma, P., Zhang, M.-J., Lu, H.-M., Chen, X.-Q., and Liang, Y.-Z. (2015), “Multiscale peak detection in wavelet space,” Analyst, 140, 7955–7964.