Elastic Coupled Co-clustering for Single-Cell Genomic Data

Pengcheng Zeng Department of Statistics, The Chinese University of Hong Kong, HK Zhixiang Lin ¹¹1Corresponding author: [email protected] Department of Statistics, The Chinese University of Hong Kong, HK

Abstract

The recent advances in single-cell technologies have enabled us to profile genomic features at unprecedented resolution and datasets from multiple domains are available, including datasets that profile different types of genomic features and datasets that profile the same type of genomic features across different species. These datasets typically have different powers in identifying the unknown cell types through clustering, and data integration can potentially lead to a better performance of clustering algorithms. In this work, we formulate the problem in an unsupervised transfer learning framework, which utilizes knowledge learned from auxiliary dataset to improve the clustering performance of target dataset. The degree of shared information among the target and auxiliary datasets can vary, and their distributions can also be different. To address these challenges, we propose an elastic coupled co-clustering based transfer learning algorithm, by elastically propagating clustering knowledge obtained from the auxiliary dataset to the target dataset. Implementation on single-cell genomic datasets shows that our algorithm greatly improves clustering performance over the traditional learning algorithms. The source code and data sets are available at https://github.com/cuhklinlab/elasticC3

Keywords: Co-clustering, Unsupervised transfer learning, Single-cell genomics

1 Introduction

Clustering aims at grouping a set of objects such that objects in the same cluster are more similar to each other compared to those in other clusters. It has wide applications in many areas, including genomics, where single-cell sequencing technologies have recently been developed. For the analysis of single-cell genomic data, most clustering methods are focused on one data type: SIMLR (Wang et al.,, 2017), SC3 (Kiselev et al.,, 2017), DIMM-SC (Sun et al.,, 2017), SAFE-clustering (Yang et al.,, 2018) and SOUP (Zhu et al.,, 2019) are developed for scRNA-seq data, and chromVAR (Schep et al.,, 2017), scABC (Zamanighomi et al.,, 2018), SCALE (Xiong et al.,, 2019) and cisTopic (Gonzalez-Blas et al.,, 2019) are developed for scATAC-seq data. A more comprehensive discussion is presented in Lin et al., (2019). Some methods are developed for the integrative analysis of single-cell genomic data, including Seurat (Butler et al.,, 2018; Stuart et al.,, 2019), MOFA (Argelaguet et al.,, 2018), coupleNMF (Duren et al.,, 2018), scVDMC (Zhang et al.,, 2018), Harmony (Korsunsky et al.,, 2019), scACE(Lin et al.,, 2019) and MOFA+ (Argelaguet et al.,, 2020). David et al., (2020) presented a more comprehensive discussion on integration of single-cell data across samples, experiments, and types of measurement. In real-world applications, for instance, we may better cluster scATAC-seq data by using the knowledge from scRNA-seq data, or better cluster scRNA-seq data from mouse by inference from human data. This raises a critical question on how can we apply knowledge learned from one dataset in one domain to cluster another dataset from a different domain.

In this paper, we focus on the problem of clustering single-cell genomic data across different domains. For example, we may typically have an auxiliary unlabeled dataset A from one domain (say, scRNA-seq data from human), and a target unlabeled dataset T from a different domain (say, scRNA-seq data from mouse). The two datasets follow different distributions. Target data T may consist of a collection of unlabeled data from which it is hard to learn a good feature representation - clustering directly on T may therefore perform poorly. It may be easier to learn a good feature representation from auxiliary data A, which can be due to its larger sample size or less noise than T. Therefore, incorporating the auxiliary data A, we may achieve better clustering on the target data T. This problem falls in the context of transfer learning, which utilizes knowledge obtained from one learning task to improve the performance of another (Caruana,, 1997; Pan and Yang,, 2009), and it can be considered as an instance of unsupervised transfer learning (Teh et al.,, 2006), since all of the data are unlabeled.

Refer to caption — Figure 1: The model of our elastic coupled co-clustering.

In this work, we propose a novel co-clustering-based transfer learning model to address this problem. A schematic plot of our proposed model is shown in Figure 1. Auxiliary data A and target data T can be regarded as two matrices with cells in the rows and genomic features in the columns. The co-clustering framework (Dhillon et al.,, 2003), which clusters cells and features simultaneously, is utilized in this work. Our proposed approach is composed of two steps: in Step 1, we co-cluster auxiliary data A and obtain the optimal clustering results for cells and features; in Step 2, we co-cluster target data T by transferring knowledge from the clusters of cells and features learned from A. The degree of cluster propagation is elastically controlled by learning adaptively from the data, and we refer to our model as elastic coupled co-clustering (elasticC3). If auxiliary data A and target data T are highly related, the degree of cluster propagation will be higher. On the contrary, if A and T are less related, the degree of knowledge transfer will be lower. The contributions of this paper are as follows:

•

To the best of our knowledge, this is the first work introducing unsupervised transfer learning for clustering single-cell genomic data across different data types.
•

To ensure wide application, the model proposed in this paper can elastically control the degree of knowledge transfer and is applicable when cluster numbers of cells in auxiliary data and target data are different.
•

Our algorithm significantly boosts clustering performance on single-cell genomic data over traditional learning algorithms.

2 Problem Formulation

We first use a toy example (Figure 2) to illustrate our method. We regard two $5\times 6$ matrices as the auxiliary data (denoted as A) and the target data (denoted as T), respectively, with all matrix elements being either 1 or 0. Let $X$ and $Y$ be discrete random variables taking values from the sets of cell indexes $\{x_{1},...,x_{n_{\textsc{A}}}\}$ in the auxiliary data A and $\{y_{1},...,y_{n_{\textsc{T}}}\}$ in the target data T, respectively. Let $W$ and $Z$ be discrete random variables for the respective feature spaces of these data, taking values from the sets of feature indexes $\{w_{1},...,w_{k}\}$ and $\{z_{1},...,z_{k}\}$ . Take the auxiliary data in Figure 2(a) as an example: $X=x_{i},i=1,\dots,5$ means the selected cell is the $i$ -th cell among the $5$ cells and $W=w_{j},j=1,\dots,6$ means the selected feature is the $j$ -th feature among the $6$ features. The meanings of $Y$ and $Z$ (Figure 2(c)) are the same as that for $X$ and $W$ .

Let $p(X,W)$ be the joint probability distribution for $X$ and $W$ , which can be represented by an $n_{\textsc{A}}\times k$ matrix. $p(X=x_{i},W=w_{j})$ represents the probability of the $j$ -th gene being active: the $j$ -th gene is expressed in scRNA-seq data or the $j$ -th genomic region is accessible in scATAC-seq data in the $i$ -th cell. The probability is estimated from the observed auxiliary data A, and we have $p(X=x_{i},W=w_{j})=\frac{\textsc{A}_{ij}}{\sum_{u=1}^{n_{\textsc{A}}}\sum_{v=1}^{k}{\textsc{A}_{uv}}},$ where the $\textsc{A}_{uv}$ are elements of auxiliary data observations: $\textsc{A}_{uv}=1$ if the $v$ -th feature is active in the $u$ -th cell, and $\textsc{A}_{uv}=0$ otherwise. The marginal probability distributions are then expressed as $p(X=x_{i})=\frac{\sum_{j=1}^{k}\textsc{A}_{ij}}{\sum_{u=1}^{n_{\textsc{A}}}\sum_{v=1}^{k}{\textsc{A}_{uv}}}$ and $p(W=w_{j})=\frac{\sum_{i=1}^{n_{\textsc{A}}}\textsc{A}_{ij}}{\sum_{u=1}^{n_{\textsc{A}}}\sum_{v=1}^{k}{\textsc{A}_{uv}}}$ . For the target data, $\textsc{T}=(\textsc{T}_{ij})_{n_{\textsc{T}}\times k}$ is the observed data matrix. $q(Y,Z)$ is the joint probability distribution for $Y$ and $Z$ . $q(Y)$ and $q(Z)$ are the marginal probabilities calculated similarly to that in the auxiliary data.

Our goal is to group similar cells and features into clusters (Figure 2(b) and Figure 2(e)). Suppose we want to cluster the cells in auxiliary data A and target data T into $N_{\textsc{A}}$ and $N_{\textsc{T}}$ clusters ²²2Often the number of clusters is unknown, and we usually combine other exploratory analysis including visualization with clustering to determine the number of clusters in practice.correspondingly, and cluster the features in A and T into $K$ clusters. Let $X^{\ast}$ and $Y^{\ast}$ be discrete random variables that take values from the sets of cell cluster indexes $\{x_{1}^{\ast},...,x_{N_{\textsc{A}}}^{\ast}\}$ and $\{y_{1}^{\ast},...,y_{N_{\textsc{T}}}^{\ast}\}$ , respectively. Let $W^{\ast}$ and $Z^{\ast}$ be discrete random variables that take values from the sets of feature cluster indexes $\{w_{1}^{\ast},...,w_{K}^{\ast}\}$ and $\{z_{1}^{\ast},...,z_{K}^{\ast}\}$ , respectively. We use $C_{X}(\cdot)$ and $C_{W}(\cdot)$ to represent the clustering functions for auxiliary data and $C_{X}(x)=x_{i}^{\ast}$ $(i=1,...,N_{\textsc{A}})$ indicates that cell $x$ belongs to cluster $x_{i}^{\ast}$ and $C_{W}(w)=w_{i}^{\ast}$ $(i=1,...,K)$ indicates that feature $w$ belongs to cluster $w_{i}^{\ast}$ . For the target data, the clustering functions $C_{Y}(\cdot)$ and $C_{Z}(\cdot)$ are defined in the same way as that for the auxiliary data. The tuples $(C_{X},C_{W})$ and $(C_{Y},C_{Z})$ are referred to as co-clustering (Dhillon et al.,, 2003).

Let $p(X^{\ast},W^{\ast})$ be the joint probability distribution of $X^{\ast}$ and $W^{\ast}$ , which can be represented as an $N_{\textsc{A}}\times K$ matrix. This distribution can be expressed as

p(X^{\ast}=x^{\ast}_{i},W^{\ast}=w^{\ast}_{j})=\sum_{x\in\{C_{X}(x)=x^{\ast}_{i}\}}\sum_{w\in\{C_{W}(w)=w^{\ast}_{j}\}}p(X=x,W=w).

(1)

The marginal probability distributions are then expressed as $p(X^{\ast}=x^{\ast}_{i})=\sum_{j=1}^{K}p(X^{\ast}=x^{\ast}_{i},W^{\ast}=w^{\ast}_{j})$ and $p(W^{\ast}=w^{\ast}_{j})=\sum_{i=1}^{N_{\textsc{A}}}p(X^{\ast}=x^{\ast}_{i},W^{\ast}=w^{\ast}_{j})$ . For the target data, $q(Y^{\ast},Z^{\ast})$ , $q(Y^{\ast})$ and $q(Z^{\ast})$ are defined and calculated similarly to those for the auxiliary data.

The goal of elastic coupled co-clustering in this work is to find the optimal cell clustering function $C_{Y}$ on the target data T by co-clustering the target data and utilizing the information of $(C_{X},C_{W})$ learned from auxiliary data A (Figure 2(d)).

3 Elastic Coupled Co-clustering Algorithm

In this section, we first present our elastic coupled co-clustering (elasticC3) algorithm, and then discuss its theoretical properties.

3.1 Objective Function

Based on the information theoretic co-clustering (Dhillon et al.,, 2003), the objective function of co-clustering between instances and features is defined as minimizing the loss in mutual information after co-clustering. For auxiliary data A, the objective function of co-clustering can be expressed:

\ell_{\textsc{A}}(C_{X},C_{W})=I(X,W)-I(X^{\ast},W^{\ast}),

(2)

where $I(\cdot,\cdot)$ denotes the mutual information between two random variables:
$I(C,D)=\sum_{c\in C}\sum_{d\in D}g(c,d)\text{log}\frac{g(c,d)}{g(c)g(d)}$ (Cover and Thomas,, 1991). In practice, we only consider the elements satisfying $g(c,d)\neq 0$ . We propose the following objective function for elastic coupled co-clustering of the target data T:

\begin{split}\ell_{\textsc{T}}(C_{Y},C_{Z}|\tilde{C}_{X},\tilde{C}_{W})=&I(Y,Z)-I(Y^{\ast},Z^{\ast})+\alpha\mathds{1}_{\{N_{\textsc{A}}=N_{\textsc{T}}\}}D_{\textsc{KL}}(q(Y^{\ast})||p(X^{\ast}))+\\ &\beta D_{\textsc{KL}}(q(Z^{\ast})||p(W^{\ast})),\end{split}

(3)

where $\mathds{1}_{\{N_{\textsc{A}}=N_{\textsc{T}}\}}$ equals to 1 if $N_{\textsc{A}}=N_{\textsc{T}}$ and 0 otherwise, $(\tilde{C}_{X},\tilde{C}_{W})=\operatorname*{argmin}_{C_{X},C_{W}}\ell_{\textsc{A}}(C_{X},C_{W})$ , and $D_{\textsc{KL}}(\cdot||\cdot)$ denotes the Kullback-Leibler divergence between two probability distributions (Cover and Thomas,, 1991), where $D_{\textsc{KL}}(g||h)=\sum_{x}g(x)\text{log}\frac{g(x)}{h(x)}$ . The term $I(Y,Z)-I(Y^{\ast},Z^{\ast})$ measures the loss in mutual information after co-clustering for the target data T. $D_{\textsc{KL}}(q(Y^{\ast})||p(X^{\ast}))$ and $D_{\textsc{KL}}(q(Z^{\ast})||p(W^{\ast}))$ are two distribution-matching terms between A and T after co-clustering, in terms of the row-dimension (cells) and the column-dimension (features), respectively. When the numbers of cell clusters are different between auxiliary data A and target data T ( $N_{\textsc{A}}\neq N_{\textsc{T}}$ ), the distribution-matching term for the row-dimension disappears. $\alpha$ and $\beta$ are non-negative hyper-parameters that elastically control how much information should be transferred, and both tend to be higher if the auxiliary data are more similar to the target data. How the parameters $\alpha$ and $\beta$ are tuned is discussed in the following section.

3.2 Optimization

The optimization for the objective function in Equation (3) can be divided into two separate steps. In Step 1, as shown at the top of Figure 2(d), we use the co-clustering algorithm by Dhillon et al., (2003) to solve the following optimization problem:

(\tilde{C}_{X},\tilde{C}_{W})=\operatorname*{argmin}_{(C_{X},C_{W})}\ell_{\textsc{A}}(C_{X},C_{W}).

(4)

Details are given in A. In Step 2, we transfer the estimated $\tilde{C}_{X}$ and $\tilde{C}_{W}$ from Step 1, as shown at the bottom of Figure 2(d), to solve the following optimization problem:

(\tilde{C}_{Y},\tilde{C}_{Z})=\operatorname*{argmin}_{(C_{Y},C_{Z})}\ell_{\textsc{T}}(C_{Y},C_{Z}|\tilde{C}_{X},\tilde{C}_{W}).

(5)

We first rewrite the term $I(Y,Z)-I(Y^{\ast},Z^{\ast})$ in Equation (3) in a similar manner as rewriting $I(X,W)-I(X^{\ast},W^{\ast})$ in Equations (11), (12) and (13) in A, and we have

\begin{split}D_{\textsc{KL}}(q(Y,Z)||q^{\ast}(Y,Z))&=\sum_{y^{\ast}\in\{y_{1}^{\ast},...,y_{N_{\textsc{T}}}^{\ast}\}}\sum_{y\in\{y:C_{Y}(y)=y^{\ast}\}}q(y)D_{\textsc{KL}}(q(Z|y)||q^{\ast}(Z|y^{\ast}))\\ &=\sum_{z^{\ast}\in\{z_{1}^{\ast},...,z_{K}^{\ast}\}}\sum_{z\in\{z:C_{Z}(z)=z^{\ast}\}}q(z)D_{\textsc{KL}}(q(Y|z)||q^{\ast}(Y|z^{\ast})),\end{split}

(6)

where $q^{\ast}(z|y^{\ast})\triangleq\frac{q^{\ast}(y,z)}{q(y)}=\frac{q(y^{\ast},z^{\ast})}{q(y^{\ast})}\frac{q(z)}{q(z^{\ast})}$ and $q^{\ast}(y|z^{\ast})\triangleq\frac{q^{\ast}(y,z)}{q(z)}=\frac{q(y^{\ast},z^{\ast})}{q(z^{\ast})}\frac{q(y)}{q(y^{\ast})}$ .

We can iteratively update $C_{Y}$ and $C_{Z}$ to minimize $\ell_{\textsc{T}}$ . First, given $C_{Z}$ , minimizing $\ell_{\textsc{T}}$ is equivalent to minimizing $\sum_{y^{\ast}\in\{y_{1}^{\ast},...,y_{N_{\textsc{T}}}^{\ast}\}}\sum_{y\in\{y:C_{Y}(y)=y^{\ast}\}}q(y)Q_{\alpha}(Y^{\ast}|C_{Z},\tilde{C}_{X})$ , where

Q_{\alpha}(Y^{\ast}|C_{Z},\tilde{C}_{X})\triangleq D_{\textsc{KL}}(q(Z|y)||q^{\ast}(Z|y^{\ast}))+\frac{\alpha D_{\textsc{KL}}(q(Y^{\ast})||p(X^{\ast}))}{n_{\textsc{T}}N_{\textsc{T}}q(y)}\mathds{1}_{\{N_{\textsc{A}}=N_{\textsc{T}}\}}.

We iteratively update the cluster assignment $y^{\ast}$ for each cell $y$ in the target data, fixing the cluster assignment for the other cells:

C_{Y}(y)=\operatorname*{argmin}_{y^{\ast}\in\{y_{1}^{\ast},...,y_{N_{\textsc{T}}}^{\ast}\}}Q_{\alpha}(Y^{\ast}|C_{Z},\tilde{C}_{X}).

(7)

Second, given $C_{Y}$ , minimizing $\ell_{\textsc{T}}$ is equivalent to minimizing
$\sum_{z^{\ast}\in\{z_{1}^{\ast},...,z_{K}^{\ast}\}}\sum_{z\in\{z:C_{Z}(z)=z^{\ast}\}}q(z)R_{\beta}(Z^{\ast}|C_{Y},\tilde{C}_{W}),$ where

R_{\beta}(Z^{\ast}|C_{Y},\tilde{C}_{W})\triangleq D_{\textsc{KL}}(q(Y|z)||q^{\ast}(Y|z^{\ast}))+\frac{\beta D_{\textsc{KL}}(q(Z^{\ast})||p(W^{\ast}))}{n_{\textsc{T}}N_{\textsc{T}}q(z)}.

We iteratively update the cluster assignment $z^{\ast}$ for each feature $z$ in the target data, fixing the cluster assignment for the other features:

C_{Z}(z)=\operatorname*{argmin}_{z^{\ast}\in\{z_{1}^{\ast},...,z_{K}^{\ast}\}}R_{\beta}(Z^{\ast}|C_{Y},\tilde{C}_{W}).

(8)

Summaries of Steps 1 and 2 are given in Algorithm 1. Note that our model has three hyper-parameters: the non-negative $\alpha$ and $\beta$ , and K (the number of clusters in feature spaces W and Z). We perform grid-search to choose the optimal combination of parameters, and we will show grid search performs well in both simulated data and real data in Section 4.

Input: A,T,

N_{\textsc{A}}

N_{\textsc{T}}

\alpha

\beta

K

and the number of iterations

I_{\textsc{A}}

and

I_{\textsc{T}}

Output:

C_{Y}^{(I_{\textsc{T}})}

/* Initialization */

1 Calculate

p

and

q

, and initialize

C_{X}^{(0)}

C_{W}^{(0)}

C_{Y}^{(0)}

and

C_{Z}^{(0)}

;

2 Initialize

p^{\ast(0)}

based on

p

C_{X}^{(0)}

C_{W}^{(0)}

and initialize

q^{\ast(0)}

based on

q

C_{Y}^{(0)}

C_{Z}^{(0)}

/* Step 1: details in Appendix 1 */

3 while $i<I_{\textsc{A}}$ do

4 Update

C_{X}^{(i)}

based on

p

p^{\ast(i-1)}

and Equation (14).

5 Update

C_{W}^{(i)}

based on

p

p^{\ast(i-1)}

and Equation (15).

6 Update

p^{\ast(i)}

based on

C_{X}^{(i)}

C_{W}^{(i)}

and Equation (12).

7Return

C_{X}^{(\textsc{I}_{\textsc{A}})}

and

C_{W}^{(\textsc{I}_{\textsc{A}})}

as the final clustering functions for the auxiliary data A.

/* Step 2 */

8 while $i<I_{\textsc{T}}$ do

9 Update

C_{Y}^{(i)}

based on

q

q^{\ast(i-1)}

C_{X}^{(\textsc{I}_{\textsc{A}})}

and Equation (7).

10 Update

C_{Z}^{(i)}

based on

q

q^{\ast(i-1)}

C_{W}^{(\textsc{I}_{\textsc{A}})}

and Equation (8).

11 Update

q^{\ast(i)}

based on

C_{Y}^{(i)}

C_{Z}^{(i)}

Return

C_{Y}^{(\textsc{I}_{\textsc{T}})}

as the final clustering functions for the target data T.

Algorithm 1 The elasticC3 Algorithm

3.3 Theoretical Properties

First, we give the monotonically decreasing property of the objective function of the elasticC3 algorithm in the following theorem:

Theorem 1.

Let the value of objective function $\ell_{\textsc{T}}$ in the $i$ -th iteration be

\begin{split}\ell_{\textsc{T}}(C_{Y}^{\ast(i)},C_{Z}^{\ast(i)}|\tilde{C}_{X},\tilde{C}_{W})&=I(Y,Z)-I(Y^{\ast(i)},Z^{\ast(i)})+\qquad\\ \alpha\mathds{1}_{\{N_{\textsc{A}}=N_{\textsc{T}}\}}D_{\textsc{KL}}(q(Y^{\ast(i)})||p(X^{\ast}))&+\beta D_{\textsc{KL}}(q(Z^{\ast(i)})||p(W^{\ast})).\end{split}

(9)

Then, we have

\ell_{\textsc{T}}(C_{Y}^{\ast(i)},C_{Z}^{\ast(i)}|\tilde{C}_{X},\tilde{C}_{W})\geq\ell_{\textsc{T}}(C_{Y}^{\ast(i+1)},C_{Z}^{\ast(i+1)}|\tilde{C}_{X},\tilde{C}_{W}).

(10)

The proof of Theorem 1 is given in B. Because the search space is finite and the objective function is non-increasing in the iterations, Algorithm 1 converges in a finite number of iterations. Second, our proposed algorithm converges to a local minimum, since finding the global optimal solution is NP-hard.

Finally, we analyze the computational complexity of our algorithm. Suppose the total number of cell-feature co-occurrences (means 1 within data matrices) in the auxiliary dataset is $d_{\textsc{A}}$ and in the target dataset is $d_{\textsc{T}}$ . For each iteration, updating $C_{X}$ and $C_{W}$ in Step 1 takes $\mathcal{O}((N_{\textsc{A}}+K)\cdot d_{\textsc{A}})$ , while updating $C_{Y}$ and $C_{Z}$ in Step 2 takes $\mathcal{O}((N_{\textsc{T}}+K)\cdot d_{\textsc{T}})$ . The number of iterations is $\textsc{I}_{\textsc{A}}$ in Step 1 and $\textsc{I}_{\textsc{T}}$ in Step 2. Therefore, the time complexity of our elasticC3 algorithm is $\mathcal{O}((N_{\textsc{A}}+K)\cdot d_{\textsc{A}}\cdot\textsc{I}_{\textsc{A}}+(N_{\textsc{T}}+K)\cdot d_{\textsc{T}}\cdot\textsc{I}_{\textsc{T}}))$ . In the experiments, it is shown that $\textsc{I}_{\textsc{A}}=\textsc{I}_{\textsc{T}}=10$ is enough for convergence in Figure 3. We may consider the number of clusters $N_{\textsc{A}}$ , $N_{\textsc{T}}$ and $K$ as constants; thus the time complexity of elasticC3 is $\mathcal{O}(d_{\textsc{A}}+d_{\textsc{T}})$ . Because our algorithm needs to store all of the cell-feature co-occurrences, it has a space complexity of $\mathcal{O}(d_{\textsc{A}}+d_{\textsc{T}})$ .

4 Experiments

To evaluate the performance of our algorithm, we conduct experiments on simulated datasets and two single-cell genomic datasets.

4.1 Data Sets

To generate the simulated single-cell data, we follow the simulation setup given in Lin et al., (2019). We set the scATAC-seq data as the auxiliary data A and set the scRNA-seq data as the target data T. We set the number of clusters $N_{\textsc{A}}=N_{\textsc{T}}=2$ . Details for generating A and T are given in the C. We set the number of cells in both A and T as $n_{\textsc{A}}=n_{\textsc{T}}=100$ , and set the number of features as $k=100$ . We assume that only a subset of features are highly correlated in scRNA-seq and scATAC-seq data, and the other features have no more correlation than random. We vary the percentage of the highly correlated features (percentage = 0.1, 0.5, and 0.9)(details the first column in Table 1).

Our experiment also contains two real single-cell genomic datasets. Real data 1 includes human scRNA-seq data as the auxiliary data and human scATAC-seq data as the target data, where 233 K562 and 91 HL60 scATAC-seq cells are obtained from Buenrostro et al., (2015), and 42 K562 and 54 HL60 deeply sequenced scRNA-seq cells are obtained from Pollen et al., (2014). True cell labels are used as a benchmark for evaluating the performance of the clustering methods. Real data 2 includes human scRNA-seq data as the auxiliary data and mouse scRNA-seq data as the target data. There are three cell types and the datasets are downloaded from panglaodb.se (Fran et al.,, 2019). For the mouse data, 179 pulmonary alveolar type 2 cells, 99 clara cells and 14 ependymal cells are obtained under the accession number SRS4237518. For the human data, 193 pulmonary alveolar type 2 cells, 113 clara cells and 58 ependymal cells are obtained under accession number SRS4660846. We use the cell-type annotation (Angelidis et al.,, 2019) as a benchmark for evaluating the performance of the clustering methods. For each real data, we also consider the setting where $N_{\textsc{A}}\neq N_{\textsc{T}}$ by removing one cell type from the auxiliary data, referred as Setting 2 (details in the first column in Table 2).

4.2 Experimental Results

For the simulation analysis, we compare our method with the classic unsupervised transfer clustering method STC (Dai et al.,, 2008) and $k$ -means clustering. For real data analysis, we implement variable selection before performing clustering, and select 100 most variable features for each real dataset (D). After variable selection, we binarize the data matrices by setting the non-zero entries to 1. For real data analysis, we compare our method with STC (Dai et al.,, 2008), co-clustering (Dhillon et al.,, 2003) and two commonly used clustering methods for single-cell genomic data, including SC3 (Kiselev et al.,, 2017) and SIMLR (Wang et al.,, 2017). ³³3For our method elasticC3, we use grid search to tune the hyper-parameters $\alpha$ , $\beta$ and K . We choose the search domains $\alpha\in[0,1]$ , $\beta\in[0,1]$ and $K\in(0,10)$ . The methods STC and co-clustering require input to be binary. Both SC3 and SIMLR use log2(TPM+1) as input for the real scRNA-seq data. The hyper-parameters in STC and co-clustering are also tuned by grid search, as presented in their original publication. Note that the method co-clustering is the same as implementing elasticC3 with $\alpha=0$ and $\beta=0$ , and no knowledge is transferred between auxiliary data and target data. We use four criteria to evaluate the clustering results, including normalized mutual information (NMI), adjusted Rand index (ARI), Rand index (RI) and purity.

Table 1: Clustering results for simulated scRNA-seq data (target data) over 50 independent runs. The values of the hyper-parameters

\alpha

\beta

and

K

for elasticC3 algorithm are shown, where they are tuned by grid search.

Simulated Data Setting	Methods	NMI	ARI	RI	purity
Setting 1	elasticC3	0.214	0.262	0.631	0.752
$\text{percentage}=0.1$	STC	0.142	0.172	0.586	0.701
$\alpha=0.1,\beta=0,K=3$	$k$ -means	0.116	0.136	0.568	0.671
Setting 2	elasticC3	0.282	0.345	0.674	0.796
$\text{percentage}=0.5$	STC	0.208	0.255	0.628	0.753
$\alpha=0.5,\beta=0.01,K=3$	$k$ -means	0.229	0.263	0.632	0.748
Setting 3	elasticC3	0.379	0.454	0.727	0.837
$\text{percentage}=0.9$	STC	0.374	0.440	0.720	0.830
$\alpha=0.9,\beta=0.04,K=3$	$k$ -means	0.211	0.241	0.621	0.735

Table 2: Clustering results for the target data in real datasets. By doing grid search,

\alpha=0.1,\beta=0.1,K=3

on real data 1 and

\alpha=0.05,\beta=0.01,K=3

on real data 2 for our elasticC3 method.

Real Data Sets	Methods	NMI	ARI	RI	purity
Real data 1	elasticC3 (Setting1)	0.610	0.743	0.875	0.933
A: human scRNA-seq data	elasticC3 (Setting2)	0.535	0.658	0.832	0.908
T: human scATAC-seq data	STC (Setting1)	0.495	0.616	0.811	0.895
Setting 1: complete data	STC (Setting2)	0.472	0.587	0.796	0.885
( $N_{\textsc{A}}=N_{\textsc{T}}=2$ )	Co-clustering	0.502	0.617	0.811	0.895
Setting 2: K562 removed from A	SC3	0.000	0.001	0.504	0.710
( $N_{\textsc{A}}=1,N_{\textsc{T}}=2$ )	SIMLR	0.025	0.046	0.525	0.710
Real data 2	elasticC3 (Setting1)	0.564	0.682	0.841	0.880
A: human scRNA-seq data	elasticC3 (Setting2)	0.515	0.629	0.815	0.863
T: mouse scRNA-seq data	STC (Setting1)	0.454	0.434	0.718	0.836
Setting 1: complete data	STC (Setting2)	0.397	0.425	0.714	0.788
( $N_{\textsc{A}}=N_{\textsc{T}}=3$ )	Co-clustering	0.510	0.625	0.813	0.860
Setting 2: alveolar type 2 removed from A	SC3	0.378	0.318	0.661	0.849
( $N_{\textsc{A}}=2,N_{\textsc{T}}=3$ )	SIMLR	0.405	0.244	0.618	0.709

4.2.1 Performance

We present in Table 1 the clustering results for the simulated scRNA-seq data (target data T). Across different settings, the trends are similar for the four clustering criteria, including NMI, ARI, RI and purity. When the percentage of highly correlated features increases from 0.1 to 0.9, more information is shared among the feature space W for the auxiliary data and the feature space Z for the target data. The clustering results for elasticC3 and STC are improved, because they can transfer more informative knowledge from the auxiliary data to improve clustering of the target data. When the features W and Z are less similar (percentage = 0.1 and 0.5), STC does not perform as well as our method elasticC3, because STC assumes that W and Z are the same, while elasticC3 assumes that W and Z are different and elastically controls the degree of knowledge transfer by introducing the term $\beta D_{\textsc{KL}}(q(Z^{\ast})||p(W^{\ast}))$ in Equation (3). When W and Z are more similar (percentage = 0.9), STC is comparable to elasticC3. We also note that our algorithm elasticC3 can adaptively learn the degree of knowledge transferring from the data because the parameters $\alpha$ and $\beta$ increase as the similarity of the auxiliary data and target data increases. In Setting 1, the features W and Z are less related and the tuning parameter $\beta=0$ in elasticC3. Finally, because $k$ -means clustering cannot transfer knowledge from auxiliary data to target data, it does not work as well as elasticC3 and STC.

The clustering results for the target data in two real single-cell genomic datasets are shown in Table 2. We see that elasticC3 performs better in Setting 1 than in Setting 2 in the two datasets, because one cell type is removed from the auxiliary data in Setting 2, thereby losing information that can be transferred. Knowledge transfer improves clustering of the target data as elasticC3 outperforms the methods that do not incorporate knowledge transfer, including co-clustering, SC3 and SIMLR in both datasets. The distributions of the features are likely different in the auxiliary data and target data: in real data 1, the distribution of gene expression is likely different from the distribution of promoter accessibility; in real data 2, the distributions of gene expression may be different between human and mouse. STC treats the features in the auxiliary data and target data as the same and performs knowledge transfer. In both datasets, the performance of STC is not as good as that of elasticC3, and it is even worse than that of co-clustering, especially for real data 2. The degree of shared information is likely lower in real data 2, as indicated by the smaller values in $\alpha$ and $\beta$ when we implement elasticC3. In summary, elasticC3 adaptively learns the degree of knowledge transfer in auxiliary data and target data, and it outperforms methods that do not allow for knowledge transfer (co-clustering, SC3 and SIMLR) and methods that do not control the degree of knowledge transfer (STC). In addition, elasticC3 works well even when the number of cell clusters differ between auxiliary data and target data, ensuring its wider application.

4.2.2 Convergence

We have proven the convergence of elasticC3 in Theorem 1, and now we show its convergence property empirically. Figure 3 shows the curve of purity vs number of iterations on the seven datasets (For simulated data, we randomly select one run for each setting). Our algorithm elasticC3 converges within 10 iterations. Using the other three clustering criteria gives similar convergence results.

5 Conclusion and Future Work

In this paper, we have developed elasticC3 for the integrative analysis of multiple single-cell genomic datasets. Our proposed method, elasticC3, was developed under the unsupervised transfer learning framework, where the knowledge learned from an auxiliary data is utilized to improve the clustering results of target data. Our algorithm consists of two separate steps. In Step 1, we cluster both the cells and features (i.e. co-cluster) in the auxiliary data, and in Step 2 we co-cluster the target data by elastically transferring the knowledge learned in Step 1. We prove the convergence of elasticC3 to a local optimum. Our algorithm outperforms other commonly used clustering methods in single-cell genomics. Because the framework of elasticC3 is general, we plan to explore its application to other areas, including text mining.

Broader Impact

This work does not present any foreseeable societal consequences and ethical issues.

Acknowledgement

This work has been supported by the Chinese University of Hong Kong direct grant 2018/2019 No. 4053360, the Chinese University of Hong Kong startup grant No. 4930181, and Hong Kong Research Grant Council Grant ECS No. CUHK 24301419.

References

Angelidis et al., (2019) Angelidis, I., Simon, L. M., Fernandez, I. E., Strunz, M., and Mayr, C. H. (2019). An atlas of the aging lung mapped by single cell transcriptomics and deep tissue proteomics. Nat. Commun, 10(963).
Argelaguet et al., (2020) Argelaguet, R., Arnol, D., Bredikhin, D., and so on (2020). Mofa+: a statistical framework for comprehensive integration of multi-modal single-cell data. Genome Biol, 21(111).
Argelaguet et al., (2018) Argelaguet, R., Velten, B., Arnol, D., Dietrich, S., Marioni, J. C., and so on (2018). Multi-omics factor analysis-a framework for unsupervised integration of multi-omics data sets. Mol Syst Biol, 14.
Buenrostro et al., (2015) Buenrostro, J. D., Wu, B., Litzenburger, U. M., Ruff, D., Gonzales, M. L., Snyder, M. P., Chang, H. Y., and Greenleaf, W. J. (2015). Single-cell chromatin accessibility reveals principles of regulatory variation. Nature, (523):486–490.
Butler et al., (2018) Butler, A., Hoffman, P., Smibert, P., Papalexi, E., and Satija, R. (2018). Integrating single-cell transcriptomic data across different conditions, technologies and species. Nat. Biotechnol., 36:411–420.
Caruana, (1997) Caruana, R. (1997). Multitask learning. Machine Learning, 28:41–75.
Cover and Thomas, (1991) Cover, T. M. and Thomas, J. A. (1991). Elements of information theory. Wiley-Interscience.
Dai et al., (2008) Dai, W. Y., Yang, Q., Xue, G. R., and Yu, Y. (2008). Self-taught clustering. Proceedings of the 25th international Conference on Machine Learning.
David et al., (2020) David, L., Johannes, K., Ewa, S., and so on (2020). Eleven grand challenges in single-cell data science. Genome Biol, 21(31).
Dhillon et al., (2003) Dhillon, I. S., Mallela, S., and Modha, D. S. (2003). Information-theoretic co-clustering. Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 89–98.
Duren et al., (2018) Duren, Z., Chen, X., Zamanighomi, M., Zeng, W., Satpathy, A., Chang, H., Wang, Y., and Wong, W. H. (2018). Integrative analysis of single cell genomics data by coupled non-negative matrix factorizations. Proc. Natl. Acad. Sci., (115):7723–7728.
Fran et al., (2019) Fran, O., Gan, G. M., and Johan, L. M. B. (2019). Panglaodb:a web serer for exploration of mouse and human single-cell rna sequencing data. Database.
Gonzalez-Blas et al., (2019) Gonzalez-Blas, C. B. et al. (2019). cistopic: cis-regulatory topic modeling on single-cell atac-seq data. Nat. Methods, 16:397–400.
Kiselev et al., (2017) Kiselev, V. Y., Kirschner, K., Schaub, M. T., Andrews, T., Yiu, A., Chandra, T., Natarajan, K. N., Reik, W., Barahona, M., et al. (2017). Sc3: Consensus clustering of single-cell rna-seq data. Nat. Methods, 14(483).
Korsunsky et al., (2019) Korsunsky, I., Millard, N., Fan, J., Slowikowski, K., Zhang, F., and so on (2019). Fast, sensitive and accurate integration of single-cell data with harmony. Nat.Methods., 16(12):1289–1296.
Lin et al., (2019) Lin, Z. X., Zamanighomi, M., Daley, T., Ma, S., and Wong, W. H. (2019). Model-based approach to the joint analysis of single-cell data on chromatin accessibility and gene expression. Stat. Sci.
Love et al., (2014) Love, M. I., Huber, W., and Anders, S. (2014). Moderated estimation of fold change and dispersion for rna-seq data with deseq2. Genome Biol, 15(550).
Pan and Yang, (2009) Pan, S. J. and Yang, Q. (2009). A survey on transfer learning.
Pollen et al., (2014) Pollen, A. A., Nowakowski, T. J., Shuga, J., Wang, X., and Leyrat, A. A. (2014). Low-coverage single-cell mrna sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex. Nat. Biotechnol, (32):1053–1058.
Schep et al., (2017) Schep, N. A., Wu, B., Buenrostro, J. D., and Greenleaf, W. J. (2017). chromvar: inferring transcription-factor-associated accessibility from single-cell epigenomic data. Nat. Methods, 14:975–978.
Stuart et al., (2019) Stuart, T., Butler, A., Hoffman, P., and so on (2019). Comprehensive integration of single-cell data. Cell, 177(7):1888–1902.
Sun et al., (2017) Sun, Z., Wang, T., Deng, K., Wang, X. F., Lafyatis, R., Ding, Y., Hu, M., and Chen, W. (2017). Dimm-sc: A dirichlet mixture model for clustering droplet-based single cell transcriptomic data. Bioinformatics, (34):139–146.
Teh et al., (2006) Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006). Hierarchical dirichlet processes. J. Am. Stat. Assoc, (101):1566–1581.
Wang et al., (2017) Wang, B., Zhu, J., Pierson, E., Ramazzotti, D., and Batzoglou, S. (2017). Visualization and analysis of single-cell rna-seq data by kernel-based similarity learning. Nat. Methods, 14:414–416.
Xiong et al., (2019) Xiong, L., Xu, K., Tian, K., Shao, Y., Tang, L., Gao, G., Zhang, M., Jiang, T., and Zhang, Q. C. (2019). Scale method for single-cell atac-seq analysis via latent feature extraction. Nat. Commun, 10(4576).
Yang et al., (2018) Yang, Y., Huh, R., Culpepper, H. W., Lin, Y., Love, M. I., and Li, Y. (2018). Safe-clustering: Single-cell aggregated(from ensemble)clustering for single-cell rna-seq data. Bioinformatics.
Zamanighomi et al., (2018) Zamanighomi, M., Lin, Z., Daley, T., Chen, X., Duren, Z., Schep, A., Greenleaf, W. J., and Wong, W. H. (2018). Unsupervised clustering and epigenetic classification of single cells. Nat. Commun, 9(2410).
Zhang et al., (2018) Zhang, H., Lee, C. A. A., Li, Z., and the others (2018). A multitask clustering approach for single-cell rna-seq analysis in recessive dystrophic epidermolysis bullosa. PLoS Comput Biol, 14(4).
Zhu et al., (2019) Zhu, L., Lei, J., Klei, L., Devlin, B., and Roeder, K. (2019). Semisoft clustering of single-cell data. Proc. Natl. Acad. Sci. USA, 116:466–471.

Appendix A Details of Step 1 in elasticC3 algorithm

The optimization problem

(\tilde{C}_{X},\tilde{C}_{W})=\operatorname*{argmin}_{(C_{X},C_{W})}\ell_{\textsc{A}}(C_{X},C_{W})

is non-convex and challenging to solve. We rewrite this objective function in the form of KL divergence, since the reformulated objective function is easier to optimize. To be more specific,

\ell_{\textsc{A}}(C_{X},C_{W})=D_{\textsc{KL}}(p(X,W)||p^{\ast}(X,W)),

(11)

where $p^{\ast}(X,W)$ is expressed as

p^{\ast}(x,w)=p(x^{\ast},w^{\ast})\frac{p(x)}{p(x^{\ast})}\frac{p(w)}{p(w^{\ast})}.

(12)

The distributions $p(x^{\ast},w^{\ast})$ , $p(x)$ , $p(x^{\ast})$ , $p(w)$ , $p(w^{\ast})$ are estimated as in Section 2 in the text. Further, we have

\begin{split}D_{\textsc{KL}}(p(X,W)||p^{\ast}(X,W))&=\sum_{x^{\ast}\in\{x_{1}^{\ast},...,x_{N_{\textsc{A}}}^{\ast}\}}\sum_{x\in\{x:C_{X}(x)=x^{\ast}\}}p(x)D_{\textsc{KL}}(p(W|x)||p^{\ast}(W|x^{\ast}))\\ &=\sum_{w^{\ast}\in\{w_{1}^{\ast},...,w_{K}^{\ast}\}}\sum_{w\in\{w:C_{W}(w)=w^{\ast}\}}p(w)D_{\textsc{KL}}(p(X|w)||p^{\ast}(X|w^{\ast})),\end{split}

(13)

where $p^{\ast}(w|x^{\ast})\triangleq\frac{p^{\ast}(x,w)}{p(x)}=\frac{p(x^{\ast},w^{\ast})}{p(x^{\ast})}\frac{p(w)}{p(w^{\ast})}$ and $p^{\ast}(x|w^{\ast})\triangleq\frac{p^{\ast}(x,w)}{p(w)}=\frac{p(x^{\ast},w^{\ast})}{p(w^{\ast})}\frac{p(x)}{p(x^{\ast})}$ . The relations in Equations (11) and (13) have been proven by Dhillon et al., (2003) and Dai et al., (2008).

To minimize $D_{\textsc{KL}}(p(X,W)||p^{\ast}(X,W))$ , we can iteratively update $C_{X}$ and $C_{W}$ as follows.

•

Fix $C_{W}$ and iteratively update the cluster assignment $x^{\ast}$ for each cell $x$ in the auxiliary data, while fixing the cluster assignments for the other cells:

C_{X}(x)=\operatorname*{argmin}_{x^{\ast}\in\{x_{1}^{\ast},...,x_{N_{\textsc{A}}}^{\ast}\}}D_{\textsc{KL}}(p(W|x)||p^{\ast}(W|x^{\ast})).

(14)

•

Fix $C_{X}$ and iteratively update the cluster assignment $w^{\ast}$ for each feature $w$ in the auxiliary data, while fixing the cluster assignments for the other features:

$C_{W}(w)=\operatorname*{argmin}_{w^{\ast}\in\{w_{1}^{\ast},...,w_{K}^{\ast}\}}D_{\textsc{KL}}(p(X|w)||p^{\ast}(X|w^{\ast})).$ (15)

The above procedures monotonically decrease the objective function (2)(Dhillon et al.,, 2003), converging to a local minimum.

Appendix B Proof of Theorem 1

Proof We rewrite $\ell_{\textsc{T}}(C_{Y}^{\ast(i)},C_{Z}^{\ast(i)}|\tilde{C}_{X},\tilde{C}_{W})$ as

\begin{split}\ell_{\textsc{T}}(C_{Y}^{\ast(i)},C_{Z}^{\ast(i)}|\tilde{C}_{X},\tilde{C}_{W})=&D_{\textsc{KL}}(q(Y,Z)||q^{\ast(i)}(Y,Z))+\alpha\mathds{1}_{\{N_{\textsc{A}}=N_{\textsc{T}}\}}D_{\textsc{KL}}(q(Y^{\ast(i)})||p(X^{\ast}))+\\ &\beta D_{\textsc{KL}}(q(Z^{\ast(i)})||p(W^{\ast}))\\ =&\frac{1}{2}\big{(}D_{\textsc{KL}}(q(Y,Z)||q^{\ast(i)}(Y,Z))+2*\alpha\mathds{1}_{\{N_{\textsc{A}}=N_{\textsc{T}}\}}D_{\textsc{KL}}(q(Y^{\ast(i)})||p(X^{\ast}))\big{)}+\\ &\frac{1}{2}\big{(}D_{\textsc{KL}}(q(Y,Z)||q^{\ast(i)}(Y,Z))+2*\beta D_{\textsc{KL}}(q(Z^{\ast(i)})||p(W^{\ast})\big{)}\\ =&\frac{1}{2}\sum_{y^{\ast}\in\{y_{1}^{\ast(i)},...,y_{N_{\textsc{T}}}^{\ast(i)}\}}\sum_{y\in\{y:C_{Y}(y)=y^{\ast}\}}q(y)Q_{2\alpha}(Y^{\ast(i)}|C_{Z}^{\ast(i-1)},\tilde{C}_{X})+\\ &\frac{1}{2}\sum_{z^{\ast}\in\{z_{1}^{\ast(i)},...,z_{N_{\textsc{T}}}^{\ast(i)}\}}\sum_{z\in\{z:C_{Z}(z)=z^{\ast}\}}q(z)R_{2\beta}(Z^{\ast(i)}|C_{Y}^{\ast(i-1)},\tilde{C}_{W})\\ \triangleq&\frac{1}{2}\ell_{\textsc{T}1}(C_{Y}^{\ast(i)}|C_{Z}^{\ast(i-1)},\tilde{C}_{X})+\frac{1}{2}\ell_{\textsc{T}2}(C_{Z}^{\ast(i)}|C_{Y}^{\ast(i-1)},\tilde{C}_{W})\\ \geq&\frac{1}{2}\ell_{\textsc{T}1}(C_{Y}^{\ast(i+1)}|C_{Z}^{\ast(i)},\tilde{C}_{X})+\frac{1}{2}\ell_{\textsc{T}2}(C_{Z}^{\ast(i+1)}|C_{Y}^{\ast(i)},\tilde{C}_{W})\\ =&\ell_{\textsc{T}}(C_{Y}^{\ast(i+1)},C_{Z}^{\ast(i+1)}|\tilde{C}_{X},\tilde{C}_{W})\end{split}

Note that

\ell_{\textsc{T}1}(C_{Y}^{\ast(i)}|C_{Z}^{\ast(i-1)},\tilde{C}_{X})\geq\ell_{\textsc{T}1}(C_{Y}^{\ast(i+1)}|C_{Z}^{\ast(i)},\tilde{C}_{X})

and

\ell_{\textsc{T}2}(C_{Z}^{\ast(i)}|C_{Y}^{\ast(i-1)},\tilde{C}_{W})\geq\ell_{\textsc{T}2}(C_{Z}^{\ast(i+1)}|C_{Y}^{\ast(i)},\tilde{C}_{W})

are straightforward based on Equation (7) and (8).

Appendix C Steps of data generation in simulation study

Similar to the simulation scheme in Lin et al., (2019), we generate data in simulation study as in the following:

Generate $\textbf{w}^{acc}$ and $\textbf{w}^{exp}$ .

w^{acc}_{rj}=\left\{\begin{array}[]{ll}w,\hskip 17.07164ptr=1,j=1,\dots,k(1-w)/2;\\ \hskip 28.45274ptr=2,j=k(1-w)/2+1,\dots,k(1-w)\\ 1-w,\hskip 2.84526ptr=2,j=1,\dots,k(1-w)/2;\\ \hskip 28.45274ptr=1,j=k(1-w)/2+1,\dots,k(1-w).\end{array}\right.

$w^{acc}_{1j}=w^{acc}_{2j}\sim Beta(0.5,2),j=k(1-w)+1,\dots,k$ . $w^{exp}_{cj}\sim Beta(w^{acc}_{cj},10),j=1,\dots,k(1-w)$ ; $w^{exp}_{1j}=w^{exp}_{2j}\sim Beta(w^{acc}_{1j},10),j=k(1-w)+1,\dots,k$ .

2.

Generate $z^{acc}$ and $z^{exp}$ . The cluster labels are generated with equal probability 0.5.
3.

Generate $u^{acc}$ and $\tilde{u}^{acc}$ . $\tilde{u}^{acc}_{ij}\sim Bernoulli(0.5)$ if $u^{acc}_{ij}=1$ , where $u^{acc}_{ij}\sim Bernoulli(w^{acc}_{cj})$ if $z^{acc}_{ic}=1,i=1,\dots,m$ ; $\tilde{u}^{acc}_{ij}=0$ otherwise.
4.

Generate $u^{exp}$ and $\tilde{v}^{exp}$ . $\tilde{v}^{exp}_{lj}\sim Bernoulli(0.8)$ if $u^{exp}_{lj}=1$ , where $u^{exp}_{lj}\sim Bernoulli(w^{exp}_{lj})$ if $z^{exp}_{lc}=1$ ; $\tilde{v}^{exp}_{lj}\sim Bernoulli(0.1)$ otherwise; $l=1,\dots,n$ .
5.

Generate $C$ and $G$ . $C_{ij}\sim N(0,0.6^{2})$ if $\tilde{u}^{acc}_{ij}=0$ and $C_{ij}\sim N(2,0.6^{2})$ if $\tilde{u}^{acc}_{ij}=1$ ; $G_{lj}\sim N(0,\sigma^{2})$ if $\tilde{v}^{exp}_{lj}=0$ and $G_{lj}\sim N(2,\sigma^{2})$ if $\tilde{v}^{exp}_{lj}=1$ .
6.

Generate target data T and auxiliary data A. $\textsc{T}_{lj}=1$ if $\textsc{G}_{lj}>0$ and $\textsc{T}_{lj}=0$ otherwise; $\textsc{A}_{ij}=1$ if $C_{ij}>0$ and $\textsc{A}_{ij}=0$ otherwise.

More details on the notations and the simulation scheme is presented in Lin et al., (2019). In our simulation, the difference on the steps of data generation from that in Lin et al., (2019) is the addition of Step 6, which generates binary data. $\textsc{T}_{lj}=1$ means gene $j$ is expressed in cell $l$ , and $\textsc{T}_{lj}=0$ otherwise. $\textsc{A}_{ij}=1$ means the promoter region for feature $j$ is accessible in cell $i$ , and $\textsc{A}_{ij}=0$ otherwise.

Appendix D Details of feature selection for real data

In the experiment, we implement variable selection for real single-cell genomic datasets before performing clustering, for the purpose of speeding up computation and balances the number of variables among different data types. For real data 1, we apply clustering to the individual datasets (Zamanighomi et al.,, 2018) and then select cluster-specific features (Love et al.,, 2014; Zamanighomi et al.,, 2018). Using the R toolkit Seurat (Butler et al.,, 2018; Stuart et al.,, 2019), we select 100 most variable cluster-specific genes in each of scATAC-seq and scRNA-seq data. For real data 2, we also use Seurat to select the 100 most variable homologs from each of the mouse and human scRNA-seq data.