This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Compositional Graphical Lasso Resolves the Impact of Parasitic Infection on Gut Microbial Interaction Networks in a Zebrafish Model

Chuan Tiana{}^{\text{a}}, Duo Jianga{}^{\text{a}}, Austin Hammerb{}^{\text{b}},
Thomas Sharptona,b{}^{\text{a,b}}, and Yuan Jianga{}^{\text{a}}
a{}^{\text{a}}Department of Statistics, Oregon State University, Corvallis, OR
b{}^{\text{b}}Department of Microbiology, Oregon State University, Corvallis, OR
Yuan Jiang is the corresponding author (E-mail: [email protected]). This research is supported in part by the grant R01 GM126549 from the National Institutes of Health (NIH).
Abstract

Understanding how microbes interact with each other is key to revealing the underlying role that microorganisms play in the host or environment and to identifying microorganisms as an agent that can potentially alter the host or environment. For example, understanding how the microbial interactions associate with parasitic infection can help resolve potential drug or diagnostic test for parasitic infection. To unravel the microbial interactions, existing tools often rely on graphical models to infer the conditional dependence of microbial abundances to represent their interactions. However, current methods do not simultaneously account for the discreteness, compositionality, and heterogeneity inherent to microbiome data. Thus, we build a new approach called “compositional graphical lasso” upon existing tools by incorporating the above characteristics into the graphical model explicitly. We illustrate the advantage of compositional graphical lasso over current methods under a variety of simulation scenarios and on a benchmark study, the Tara Oceans Project. Moreover, we present our results from the analysis of a dataset from the Zebrafish Parasite Infection Study, which aims to gain insight into how the gut microbiome and parasite burden covary during infection, thus uncovering novel putative methods of disrupting parasite success. Our approach identifies changes in interaction degree between infected and uninfected individuals for three taxa, Photobacterium, Gemmobacter, and Paucibacter, which are inversely predicted by other methods. Further investigation of these method-specific taxa interaction changes reveals their biological plausibility. In particular, we speculate on the potential pathobiotic roles of Photobacterium and Gemmobacter in the zebrafish gut, and the potential probiotic role of Paucibacter. Collectively, our analyses demonstrate that compositional graphical lasso provides a powerful means of accurately resolving interactions between microbiota and can thus drive novel biological discovery.


Keywords: Additive log-ratio transformation; Heteroscedasticity; Intestinal pathobiont; Logistic normal multinomial distribution; Parasite infection

1 Introduction

1.1 Background

Microorganisms are ubiquitous in nature and manage critical ecosystem services, ranging from global nutrient cycling to human health. Microbes do not exist in a vacuum in nature, but instead are members of diverse communities known as microbiomes, wherein microbes may interact with other microbes of different species. These microbial interactions may be favorable or antagonistic and are crucial for the successful establishment and maintenance of a microbial community, and frequently result in important pathogenic or beneficial effect to the host or environment (Braga et al., 2016). A thorough understanding of how microbe interact with one another is critical to uncovering the underlying role microorganisms play in the host or environment. However, it is a highly challenging biological task as it is estimated that only 1% of bacteria are cultivatable. The inability to culture the majority of microbial species has motivated the use of culture-independent methods for microbiome studies in different environments (Faust and Raes, 2012; Tang, 2019).

Fortunately, recent innovations in in situ DNA sequencing provide opportunities to infer how microbes interact with one another or their environments. Modern studies on microbial interactions frequently rely on DNA sequencing techniques through the bioinformatic analysis of taxonomically diagnostic genetic markers (e.g., 16S rRNA) sequenced directly from a sample. The counts of the taxonomically diagnostic genetic markers can be used to represent the abundance of microbial species, e.g., Operational Taxonomic Units (OTUs) or phylotypes (e.g., genera), in a sample. Here, the frequency with which a taxon’s marker is observed in a sequence library represents its relative abundance in the community. When such abundance data are available from many communities, interactions among microbiota can be inferred through statistical correlation analysis (Faust and Raes, 2012). For example, if the relative abundances of two microbial taxa are statistically correlated, then it is inferred that they interact on some level. This approach has been used to document interaction networks in the healthy human microbiome (Faust and Raes, 2012), as well as free-living microbial communities (Freilich et al., 2010), and has been useful for generating hypotheses of host-microbiome interaction (Morgan et al., 2015).

1.2 Methodological Innovations

Despite the great potential of microbiome interaction networks as a tool to advance microbiome research, the power of this approach has been limited by the availability of effective statistical methods and computationally efficient estimation techniques. Microbial abundance data possess a few important features that pose tremendous challenges to standard statistical tools. First, the data are represented as compositional counts of the 16S rRNA sequences because the total count of sequences per sample is predetermined by how deeply the sequencing is conducted, a concept named sequencing depth. The counts only carry information about the relative abundances of the taxa instead of their absolute abundances. Second, the sequencing depth is always finite and often varies considerably across samples in a microbiome dataset. Thus, the observed relative abundance of a taxon in a sample is only an estimator of its true relative abundance with the variance depending on sample-specific sequencing depth, causing the “heteroscedasticity” issue (McMurdie and Holmes, 2014). Third, the data are high-dimensional in nature. It is likely that the number of taxa is far more than the number of samples in any biological experiment.

When such abundance data are available, one common strategy to resolve the interactions among microbial taxa is to use correlation-type analyses. For example, after the sample correlations are calculated between the relative abundances of each pair of microbial taxa, a threshold is then applied such that an interaction is deemed present if the sample correlation exceeds the threshold. More recently developed methods have started to account for the compositional feature and aim to construct sparse networks for the absolute abundances instead of relative abundances, including SparCC (Friedman and Alm, 2012), CCLasso (Fang et al., 2015), and REBACCA (Ban et al., 2015), among others.

The above-mentioned network methods based on marginal correlations could lead to spurious correlations that are caused by confounding factors such as other species in the same community. To eliminate the detection of spurious correlations, interactions among taxa can be modeled through their conditional dependencies given the other taxa. The Gaussian graphical model is a classical approach to modeling the conditional dependency, in which the conditional dependency is determined by the nonzero entries of the inverse covariance matrix of a multivariate normal distribution used to model the data. Graphical lasso (Yuan and Lin, 2007; Banerjee et al., 2008; Friedman et al., 2008) and neighborhood selection (Meinshausen et al., 2006) are two commonly used methods to estimate sparse inverse covariance matrix under the high-dimensional Gaussian graphical model. However, when applied to microbiome abundance data, its multivariate normality assumption was violated by either the count or the compositional features of the data.

Several methods have been proposed to infer microbial conditional dependence networks based on Gaussian graphical models, such as SPIEC-EASI (Kurtz et al., 2015), gCoda (Fang et al., 2017), CD-trace (Yuan et al., 2019), and SPRING (Yoon et al., 2019). In order to transform the discrete counts to continuous variables and to remove the compositionality constraint, all these methods take the centered log-ratio transformation (Aitchison, 1986) on the observed counts as their first step. However, the centered log-ratio transformation suffers from an undefined inverse covariance matrix of the transformed data. To partially address this issue, these methods impose a sparsity assumption on the inverse covariance matrix and adding an L1L_{1}-norm penalty to their objective functions. Additionally, these methods ignore the heteroscedasticity issue as they simply treat the observed relative abundances as the truth. Ignoring the heteroscedasticity issue could impact downstream analysis including constructing microbial interaction networks (McMurdie and Holmes, 2014).

In this article, we provide a new statistical tool to help unleash the full potential of microbiome interaction networks as a research tool in the microbiome field. We adopt the logistic normal multinomial distribution to model the compositional count data (Aitchison, 1986; Billheimer et al., 2001; Xia et al., 2013). Compared to previous methods, this model accounts for the heteroscedasticity issue as the sequencing depth is treated as the number of trials in the multinomial distribution. Additionally, the additive log-ratio transformation applied to the multinomial probabilities results in a well-defined inverse covariance matrix in contrast to the centered log-ratio transformation. Based on this model, we develop an efficient algorithm that iterates between Newton-Raphson and graphical lasso for estimating a sparse inverse covariance matrix. We call this new approach “compositional graphical lasso”. We establish the theoretical convergence of the algorithm and illustrate the advantage of compositional graphical lasso in comparison to current methods under a variety of simulation scenarios. We further apply the developed method to the data from the Zebrafish Parasite Infection study (Gaulke et al., 2019) (see Section 1.3) to investigate how microbial interactions associate with parasite infection.

1.3 Zebrafish Parasite Infection Study

Helminth parasites represent a significant threat to the health of human and animal populations, and there is a growing need for tools to treat, diagnose, and prevent these infections. A growing body of evidence points to the gut microbiome as an agent that interacts with parasites to influence their success in the gut. To clarify how the gut microbiome varies in accordance with parasitic infection dynamics, the Zebrafish Parasite Infection Study was a recent effort (Gaulke et al., 2019) conducted at Oregon State University, which assessed the association of an intestinal helminth of zebrafish, Pseudocapillaria tomentosa, and the gut microbiome of 210 4-month-old 5D line zebrafish. Among these fish, 105 were exposed to P. tomentosa and the remaining 105 were unexposed controls. At each of the seven time points after exposure, a randomly selected group of 30 fish (15 exposed and 15 unexposed) were euthanized and fecal samples were collected. The parasite burden and tissue damage in P. tomentosa-infected fish were also monitored over 12 weeks of infection.

Previous analyses (Gaulke et al., 2019) of the Zebrafish Parasite Infection Study data have revealed that parasite exposure, burden, and intestinal lesions were correlated with gut microbial diversity. They also identified individual taxa whose abundance associated with parasite burden, suggesting that gut microbiota may influence P. tomentosa success. Numerous associations between taxon abundance, burden, and gut pathologic changes were also observed, indicating that the magnitude of microbiome disruption during infection varies with infection severity. However, it remains unclear how parasite success may disrupt or be modulated by the microbial interactions in the gut. Understanding how the microbial interactions associate with parasitic infection can help resolve potential drug or diagnostic test for parasitic infection.

1.4 Benchmarking and Novel Biological Discoveries

To evaluate the performance of our method and to benchmark it against previously available tools, we take advantage of a unique data resource provided by the Tara Oceans Project on ocean planktons. This data set is particularly suitable for method comparison because it enjoys an experimentally validated sub-network of plankton interactome that has served as a gold standard for method benchmarking. Compared to other methods, compositional graphical lasso performs better in reconstructing the microbial interactions that are validated by the literature. In addition, it performs the best in picking up the keystone taxa, which are those with an excessive number of interactions with other taxa in the literature, such as Amoebophyra (Chambouvet et al., 2008), Blastodinium (Skovgaard et al., 2012), Phaeocystis (Verity et al., 2007), and Syndinium (Skovgaard et al., 2005). All these genera are well-described keystone organisms in marine ecosystems. Finally, compositional graphical lasso affords an opportunity to resolve novel modulators of community composition. For example, as one of a few described genera within the syndinean dinoflagellates—an enigmatic lineage with abundant diversity in marine environmental clone libraries, Euduboscquella (Bachvaroff et al., 2012) ranked high in the degree distribution uniquely by compositional graphical lasso.

To investigate the role of the gut microbiome plays in the parasite infections, we apply compositional graphical lasso to the Zebrafish Parasite Infection Study data. Interestingly, compositional graphical lasso identifies changes in interaction degree between infected and uninfected individuals for three taxa, Photobacterium, Gemmobacter, and Paucibacter, which are inversely predicted by other methods. Further investigation of these method-specific taxa interaction changes reveals their biological plausibility, and provides insight into their relevance in the context of parasite-linked changes in the zebrafish gut microbiome. In particular, based on our observations, we speculate on the potential pathobiotic roles of Photobacterium and Gemmobacter in the zebrafish gut, and the potential probiotic role of Paucibacter. Future studies should seek to experimentally validate the ecological roles of Photobacterium, Gemmobacter, and Paucibacter in the zebrafish gut, including their impacts on the rest of the microbial community and their roles in infection induced tissue damage.

2 Compositional Graphical Lasso

2.1 Logistic Normal Multinomial Model

Consider a microbiome abundance dataset with nn independent samples, each of which is composed of the observed counts of K+1K+1 taxa, denoted by 𝐱i=(xi,1,,xi,K+1)\mathbf{x}_{i}=(x_{i,1},\ldots,x_{i,K+1})^{\prime} for the ii-th sample, i=1,,ni=1,\ldots,n. Due to the compositional property of the data, the total count of all taxa for each sample ii is a fixed number, denoted by MiM_{i}. Naturally, a multinomial distribution is imposed on the observed counts:

𝐱i|𝐩iMultinomial(Mi;pi,1,,pi,K+1),\mathbf{x}_{i}|\mathbf{p}_{i}\sim\text{Multinomial}(M_{i};p_{i,1},\ldots,p_{i,K+1}), (1)

where 𝐩i=(pi,1,,pi,K+1)\mathbf{p}_{i}=(p_{i,1},\ldots,p_{i,K+1})^{\prime} are the multinomial probabilities for all taxa and k=1K+1pi,k=1\sum_{k=1}^{K+1}p_{i,k}=1.

To apply the additive log-ratio transformation (Aitchison, 1986) on the multinomial probabilities, we choose one taxon, without loss of generality the (K+1)(K+1)-th taxon, as a reference to which all the other tax are compared. The transformed multinomial probabilities are given by

zi,k=log(pi,kpi,K+1),i=1,,n,k=1,,K.z_{i,k}=\log(\frac{p_{i,k}}{p_{i,K+1}}),\ i=1,\ldots,n,\ k=1,\ldots,K. (2)

Let 𝐳i=(zi,1,,zi,K)\mathbf{z}_{i}=(z_{i,1},\ldots,z_{i,K})^{\prime} for i=1,,ni=1,\ldots,n, and further assume that they follow an i.i.d. multivariate normal distribution

𝐳1,,𝐳ni.i.d.N(𝝁,𝚺),\mathbf{z}_{1},\ldots,\mathbf{z}_{n}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}N(\boldsymbol{\mu},\boldsymbol{\Sigma}), (3)

where 𝝁\boldsymbol{\mu} is the mean and 𝚺\boldsymbol{\Sigma} is the covariance matrix. Let 𝛀=𝚺1\boldsymbol{\Omega}=\boldsymbol{\Sigma}^{-1} be the inverse covariance matrix or the precision matrix.

The above model given in (1)–(3) is often referred to as the logistic normal multinomial model. In this model, a multinomial distribution is imposed on the compositional counts, which is the distribution of the observed data given the multinomial probabilities. In addition, to capture the variation of the multinomial probabilities across samples, we impose a logistic normal distribution on the multinomial probabilities as a prior distribution. We thereby obtain as our final model the logistic normal multinomial model, which is a hierarchical model with two levels.

The logistic normal multinomial model has a long history in modeling compositional count data and it has also been applied to analyze microbiome abundance data. For example, Xia et al. (2013) proposed a penalized regression under this model to identify a subset of covariates that are associated with the taxon composition. Our objective is different from Xia et al. (2013) as we aim to reveal the microbial interaction network by finding a sparse estimator of the inverse covariance matrix 𝛀\boldsymbol{\Omega} in (3). It is also noteworthy that Jiang et al. (2020) has the same objective as ours. However, Jiang et al. (2020) did not make full use of the logistic normal multinomial model as it focused on correcting the bias of a naive estimator of the 𝚺\boldsymbol{\Sigma} that does not require the logistic normal part of the model. By contrast, we aim to find an estimator of 𝛀\boldsymbol{\Omega} directly based on the logistic normal multinomial model.

2.2 Objective Function

From the logistic normal multinomial model in (1)–(3), we aim to derive an objective function of 𝛀\boldsymbol{\Omega} to estimate the microbial interaction network. To this end, we take a two-step procedure similar to SPIEC-EASI (Kurtz et al., 2015). In the first step, we find estimated values of 𝐳1,,𝐳n\mathbf{z}_{1},\ldots,\mathbf{z}_{n} based on the logistic normal multinomial model given μ\mu and 𝚺\boldsymbol{\Sigma}; in the second step, we find an estimate of 𝛀\boldsymbol{\Omega} based on the estimated values of 𝐳1,,𝐳n\mathbf{z}_{1},\ldots,\mathbf{z}_{n}.

In the first step, we consider the posterior distribution of 𝐳1,,𝐳n\mathbf{z}_{1},\ldots,\mathbf{z}_{n} given the data 𝐱1,,𝐱n\mathbf{x}_{1},\ldots,\mathbf{x}_{n} and find the maximum a posteriori (MAP) estimates of 𝐳1,,𝐳n\mathbf{z}_{1},\ldots,\mathbf{z}_{n}. For i=1,,ni=1,\ldots,n, the logarithm of the posterior density function of 𝐳i\mathbf{z}_{i} given 𝐱i\mathbf{x}_{i} is

log[f𝝁,𝛀(𝐳i|𝐱i)]log[f𝝁,𝛀(𝐱i,𝐳i)]\displaystyle\log[f_{\boldsymbol{\mu},\boldsymbol{\Omega}}(\mathbf{z}_{i}|\mathbf{x}_{i})]\propto\log[f_{\boldsymbol{\mu},\boldsymbol{\Omega}}(\mathbf{x}_{i},\mathbf{z}_{i})]
\displaystyle\propto{} k=1K+1xi,klogpi,k+12log[det(𝛀)]12(𝐳i𝝁)𝛀(𝐳i𝝁)\displaystyle\sum_{k=1}^{K+1}x_{i,k}\log p_{i,k}+\frac{1}{2}\log[\det(\boldsymbol{\Omega})]-\frac{1}{2}(\mathbf{z}_{i}-\boldsymbol{\mu})^{\prime}\boldsymbol{\Omega}(\mathbf{z}_{i}-\boldsymbol{\mu})
=\displaystyle={} k=1Kxi,kzi,kMilog(k=1Kezi,k+1)+12log[det(𝛀)]12(𝐳i𝝁)𝛀(𝐳i𝝁),\displaystyle\sum_{k=1}^{K}x_{i,k}z_{i,k}-M_{i}\log(\sum_{k=1}^{K}e^{z_{i,k}}+1)+\frac{1}{2}\log[\det(\boldsymbol{\Omega})]-\frac{1}{2}(\mathbf{z}_{i}-\boldsymbol{\mu})^{\prime}\boldsymbol{\Omega}(\mathbf{z}_{i}-\boldsymbol{\mu}),

where \propto denotes that two quantities are equal up to a term not depending on 𝐳i\mathbf{z}_{i}. In the above derivation, we ignored the marginal density function of 𝐱i\mathbf{x}_{i} that does not depend on 𝐳i\mathbf{z}_{i}, which does not affect the estimation of 𝐳i\mathbf{z}_{i}.

By independence between all the samples, the logarithm of the posterior density function of (𝐳1,,𝐳n)(\mathbf{z}_{1},\ldots,\mathbf{z}_{n}) given the data (𝐱1,,𝐱n)(\mathbf{x}_{1},\ldots,\mathbf{x}_{n}) can be written as (again, ignoring a term independent of 𝐳1,,𝐳n\mathbf{z}_{1},\ldots,\mathbf{z}_{n}):

i=1n[k=1Kxi,kzi,kMilog(k=1Kezi,k+1)]+n2log[det(𝛀)]12i=1n(𝐳i𝝁)𝛀(𝐳i𝝁).\sum_{i=1}^{n}\left[\sum_{k=1}^{K}x_{i,k}z_{i,k}-M_{i}\log(\sum_{k=1}^{K}e^{z_{i,k}}+1)\right]+\frac{n}{2}\log[\det(\boldsymbol{\Omega})]-\frac{1}{2}\sum_{i=1}^{n}(\mathbf{z}_{i}-\boldsymbol{\mu})^{\prime}\boldsymbol{\Omega}(\mathbf{z}_{i}-\boldsymbol{\mu}). (4)

Given the values of the multivariate normal parameters 𝝁\boldsymbol{\mu} and 𝛀\boldsymbol{\Omega}, one can maximize (4) with respect to (𝐳1,,𝐳n)(\mathbf{z}_{1},\ldots,\mathbf{z}_{n}). This leads to the MAP estimator (𝐳^1,,𝐳^n)(\hat{\mathbf{z}}_{1},\ldots,\hat{\mathbf{z}}_{n}).

In the second step, we find a sparse inverse covariance estimator of 𝛀\boldsymbol{\Omega} based on the estimate values (𝐳^1,,𝐳^n)(\hat{\mathbf{z}}_{1},\ldots,\hat{\mathbf{z}}_{n}). Hereby, we use the graphical lasso estimator, which minimizes the L1L_{1} penalized negative log-likelihood function as follows:

12log[det(𝛀)]+12ni=1n(𝐳^i𝝁)𝛀(𝐳^i𝝁)+λ𝛀1.-\frac{1}{2}\log[\det(\boldsymbol{\Omega})]+\frac{1}{2n}\sum_{i=1}^{n}(\hat{\mathbf{z}}_{i}-\boldsymbol{\mu})^{\prime}\boldsymbol{\Omega}(\hat{\mathbf{z}}_{i}-\boldsymbol{\mu})+\lambda\|\boldsymbol{\Omega}\|_{1}. (5)

It turns out that the above two-step procedure is equivalent to minimizing an overall objective function with respect to both 𝐳1,,𝐳n\mathbf{z}_{1},\ldots,\mathbf{z}_{n} and (𝝁,𝛀)(\boldsymbol{\mu},\boldsymbol{\Omega}):

(𝐳1,,𝐳n,𝝁,𝛀)=\displaystyle\ell(\mathbf{z}_{1},\ldots,\mathbf{z}_{n},\boldsymbol{\mu},\boldsymbol{\Omega})={} 1ni=1n[k=1Kxi,kzi,kMilog(k=1Kezi,k+1)]\displaystyle-\frac{1}{n}\sum_{i=1}^{n}\left[\sum_{k=1}^{K}x_{i,k}z_{i,k}-M_{i}\log(\sum_{k=1}^{K}e^{z_{i,k}}+1)\right]
12log[det(𝛀)]+12ni=1n(𝐳i𝝁)𝛀(𝐳i𝝁)+λ𝛀1.\displaystyle-\frac{1}{2}\log[\det(\boldsymbol{\Omega})]+\frac{1}{2n}\sum_{i=1}^{n}(\mathbf{z}_{i}-\boldsymbol{\mu})^{\prime}\boldsymbol{\Omega}(\mathbf{z}_{i}-\boldsymbol{\mu})+\lambda\|\boldsymbol{\Omega}\|_{1}. (6)

In other words, 𝐳1,,𝐳n\mathbf{z}_{1},\ldots,\mathbf{z}_{n}, 𝝁\boldsymbol{\mu}, and 𝛀\boldsymbol{\Omega} are all treated as unknown parameters in the minimization of the objective function (6).

It is noteworthy that similar ideas to the above two-step procedure have been used in existing microbial interaction network estimation methods. For example, SPIEC-EASI is also a two-step procedure. In its first step, the abundance counts are converted into their centered log-ratio transformed data; in its second step, either graphical lasso or neighborhood selection is applied to estimate a sparse inverse covariance matrix based on the centered log-ratio transformed data in the first step.

Although both our method and SPIEC-EASI can be regarded as two-step procedures, we underline two important distinctions between them. First, our method is based on the additive log-ratio transformation and SPIEC-EASI uses the centered log-ratio transformation. As mentioned in the Introduction, the centered log-ratio transformation suffers from an undefined inverse covariance matrix of the transformed data while the additive log-ratio transformation does not. Second, in the first step, our method accounts for the sequencing depths [MiM_{i}’s in (1)] and further the uncertainty of the observed relative abundances, and thus addresses the heteroscedasticity issue. However, the heteroscedasticity issue is ignored in SPIEC-EASI.

2.3 Computational Algorithm

The objective function (6) naturally includes three sets of parameters (𝐳1,,𝐳n)(\mathbf{z}_{1},\ldots,\mathbf{z}_{n}), 𝝁\boldsymbol{\mu}, and 𝛀\boldsymbol{\Omega}, which motivates us to apply a block coordinate descent algorithm. A block coordinate descent algorithm minimizes the objective function iteratively for each set of parameters given current values of the other sets. Given the initial values (𝐳1(0),,𝐳n(0))(\mathbf{z}_{1}^{(0)},\ldots,\mathbf{z}_{n}^{(0)}), 𝝁(0)\boldsymbol{\mu}^{(0)}, and 𝛀(0)\boldsymbol{\Omega}^{(0)}, a block coordinate algorithm repeats the following steps cyclically for iteration t=0,1,2,t=0,1,2,\ldots until the algorithm converges.

  1. 1.

    Given 𝝁(t)\boldsymbol{\mu}^{(t)} and 𝛀(t)\boldsymbol{\Omega}^{(t)}, find (𝐳1(t+1),,𝐳n(t+1))(\mathbf{z}_{1}^{(t+1)},\ldots,\mathbf{z}_{n}^{(t+1)}) that maximizes (6).

  2. 2.

    Given (𝐳1(t+1),,𝐳n(t+1))(\mathbf{z}_{1}^{(t+1)},\ldots,\mathbf{z}_{n}^{(t+1)}) and 𝛀(t)\boldsymbol{\Omega}^{(t)}, find 𝝁(t+1)\boldsymbol{\mu}^{(t+1)} that maximizes (6).

  3. 3.

    Given (𝐳1(t+1),,𝐳n(t+1))(\mathbf{z}_{1}^{(t+1)},\ldots,\mathbf{z}_{n}^{(t+1)}) and 𝝁(t+1)\boldsymbol{\mu}^{(t+1)}, find 𝛀(t+1)\boldsymbol{\Omega}^{(t+1)} that maximizes (6).

Below, we will present the details of this algorithm in each iteration. For the initial values (𝐳1(0),,𝐳n(0))(\mathbf{z}_{1}^{(0)},\ldots,\mathbf{z}_{n}^{(0)}), we use their maximum likelihood estimators from the multinomial distribution, i.e.,

zi,k(0)=log(xi,kxi,K+1),i=1,,n,k=1,,K.{z}_{i,k}^{(0)}=\log(\frac{x_{i,k}}{x_{i,K+1}}),\ i=1,\ldots,n,\ k=1,\ldots,K.

If xi,k=0x_{i,k}=0 for some ii and kk, we add a small constant to it to evaluate the log ratio. For the initial value 𝝁(0)\boldsymbol{\mu}^{(0)}, we have a closed form minimizer of 𝝁\boldsymbol{\mu} for (6) given the values of (𝐳1,,𝐳n)(\mathbf{z}_{1},\ldots,\mathbf{z}_{n}), which is 𝝁=𝐳¯=1ni=1n𝐳i\boldsymbol{\mu}=\bar{\mathbf{z}}=\frac{1}{n}\sum_{i=1}^{n}\mathbf{z}_{i}. Therefore, we set the initial value as 𝝁(0)=1ni=1n𝐳i(0)\boldsymbol{\mu}^{(0)}=\frac{1}{n}\sum_{i=1}^{n}\mathbf{z}_{i}^{(0)}. Finally, for the initial value 𝛀(0)\boldsymbol{\Omega}^{(0)}, we use the estimate of the graphical lasso algorithm taking the sample covariance matrix computed from 𝐳1(0),,𝐳n(0)\mathbf{z}_{1}^{(0)},\ldots,\mathbf{z}_{n}^{(0)} as input.

In step 1, given 𝝁(t)\boldsymbol{\mu}^{(t)} and 𝛀(t)\boldsymbol{\Omega}^{(t)}, minimizing the objective function (6) with respect to (𝐳1,,𝐳n)(\mathbf{z}_{1},\ldots,\mathbf{z}_{n}) is equivalent to minimizing the following objective function with respect to each 𝐳i\mathbf{z}_{i} separately, for i=1,,ni=1,\ldots,n:

i(t)(𝐳i)=12(𝐳i𝝁(t))𝛀(t)(𝐳i𝝁(t))[k=1Kxi,kzi,kMilog(k=1Kezi,k+1)].\ell_{i}^{(t)}(\mathbf{z}_{i})=\frac{1}{2}(\mathbf{z}_{i}-\boldsymbol{\mu}^{(t)})^{\prime}\boldsymbol{\Omega}^{(t)}(\mathbf{z}_{i}-\boldsymbol{\mu}^{(t)})-\left[\sum_{k=1}^{K}x_{i,k}z_{i,k}-M_{i}\log(\sum_{k=1}^{K}e^{z_{i,k}}+1)\right]. (7)

The above objective function is a smooth and convex function in 𝐳i\mathbf{z}_{i} with the following positive definite Hessian matrix

𝛀(t)+Mi(k=1Kezi,k+1)2{(k=1Kezi,k+1)diag(e𝐳i)(e𝐳i)(e𝐳i)},\boldsymbol{\Omega}^{(t)}+\frac{M_{i}}{\left(\sum_{k=1}^{K}e^{z_{i,k}}+1\right)^{2}}\left\{\left(\sum_{k=1}^{K}e^{z_{i,k}}+1\right)\text{diag}(e^{\mathbf{z}_{i}})-(e^{\mathbf{z}_{i}})(e^{\mathbf{z}_{i}})^{\prime}\right\},

where e𝐳i=(ezi,1,,ezi,K)e^{\mathbf{z}_{i}}=(e^{z_{i,1}},\ldots,e^{z_{i,K}})^{\prime} and diag(e𝐳i)\text{diag}(e^{\mathbf{z}_{i}}) is the diagonal matrix with the diagonal elements ezi,1,,ezi,Ke^{z_{i,1}},\ldots,e^{z_{i,K}}. Therefore, we apply the Newton-Raphson algorithm to find the minimizer numerically. In addition, we implement a line search procedure in each Newton-Raphson iteration following the Armijo rule (Armijo, 1966). This procedure ensures sufficient decrease in the objective function at each iteration to prevent possible divergence of the algorithm.

Step 2 is similar to the initialization step, in which 𝝁\boldsymbol{\mu} has a closed-form solution and is updated as 𝐳¯(t+1)=1ni=1n𝐳i(t+1)\bar{\mathbf{z}}^{(t+1)}=\frac{1}{n}\sum_{i=1}^{n}\mathbf{z}_{i}^{(t+1)} from the current numerical values of (𝐳1(t+1),,𝐳n(t+1))(\mathbf{z}_{1}^{(t+1)},\ldots,\mathbf{z}_{n}^{(t+1)}) that are computed from the Newton-Raphson algorithm in step 1.

In step 3, given (𝐳1(t+1),,𝐳n(t+1))(\mathbf{z}_{1}^{(t+1)},\ldots,\mathbf{z}_{n}^{(t+1)}) and 𝝁(t+1)=𝐳¯(t+1)\boldsymbol{\mu}^{(t+1)}=\bar{\mathbf{z}}^{(t+1)}, the objective function for 𝛀\boldsymbol{\Omega} can be simplified as

(t)(𝛀)\displaystyle\ell^{(t)}(\boldsymbol{\Omega}) =12log[det(𝛀)]+12ni=1n(𝐳i(t+1)𝝁(t+1))𝛀(𝐳i(t+1)𝝁(t+1))+λ𝛀1,\displaystyle=-\frac{1}{2}\log[\det(\boldsymbol{\Omega})]+\frac{1}{2n}\sum_{i=1}^{n}(\mathbf{z}_{i}^{(t+1)}-\boldsymbol{\mu}^{(t+1)})^{\prime}\boldsymbol{\Omega}(\mathbf{z}_{i}^{(t+1)}-\boldsymbol{\mu}^{(t+1)})+\lambda\|\boldsymbol{\Omega}\|_{1},
=12log[det(𝛀)]+12tr(𝐒(t+1)𝛀)+λ𝛀1,\displaystyle=-\frac{1}{2}\log[\det(\boldsymbol{\Omega})]+\frac{1}{2}\mathrm{tr}(\mathbf{S}^{(t+1)}\boldsymbol{\Omega})+\lambda\|\boldsymbol{\Omega}\|_{1}, (8)

where 𝐒(t+1)=1ni=1n(𝐳i(t+1)𝐳¯(t+1))(𝐳i(t+1)𝐳¯(t+1))\mathbf{S}^{(t+1)}=\frac{1}{n}\sum_{i=1}^{n}(\mathbf{z}_{i}^{(t+1)}-\bar{\mathbf{z}}^{(t+1)})(\mathbf{z}_{i}^{(t+1)}-\bar{\mathbf{z}}^{(t+1)})^{\prime}. It is obvious that minimizing the objective function (8) becomes a graphical lasso problem (Yuan and Lin, 2007; Banerjee et al., 2008; Friedman et al., 2008). It is well known that the graphical lasso objective function is a convex function in 𝛀\boldsymbol{\Omega} (Banerjee et al., 2008) and efficient algorithms have been developed for its optimization (Friedman et al., 2008). In this paper, we implement this step using the graphical lasso algorithm included in the huge (Zhao et al., 2012) package in R.

The above block coordinate descent algorithm iterates between Newton-Raphson and graphical lasso and is designed specifically to optimize the objective function (6) for compositional count data. Therefore, we name this algorithm the compositional graphical lasso algorithm, and the entire approach the compositional graphical lasso method including both the model and the algorithm for the analysis of microbiome abundance data.

2.4 Theoretical Convergence

Unfortunately, the objective function (6) is not necessarily a convex function jointly in (𝐳1,,𝐳n)(\mathbf{z}_{1},\ldots,\mathbf{z}_{n}), 𝝁\boldsymbol{\mu}, and 𝛀\boldsymbol{\Omega}. However, we have shown that it is convex in each of the three subsets of its parameters. The convergence property of such an optimization problem has been studied in the literature. For example, Tseng (2001) studied the convergence property of a block coordinate descent method applied to minimize a nonconvex function with certain separability and regularity properties. We will establish the convergence property of the compositional graphical lasso algorithm following Tseng (2001).

Recall that our algorithm treats the three sets of parameters (𝐳1,,𝐳n)(\mathbf{z}_{1},\ldots,\mathbf{z}_{n}), 𝝁\boldsymbol{\mu}, and 𝛀\boldsymbol{\Omega} as three blocks and optimizes for each block iteratively. In addition, as in Tseng (2001), the objective function (6) can be regarded as the sum of two parts, the first of which is an inseparable but differentiable function given by

0(𝐳1,,𝐳n,𝝁,𝛀)=12ni=1n(𝐳i𝝁)𝛀(𝐳i𝝁),\ell_{0}(\mathbf{z}_{1},\ldots,\mathbf{z}_{n},\boldsymbol{\mu},\boldsymbol{\Omega})=\frac{1}{2n}\sum_{i=1}^{n}(\mathbf{z}_{i}-\boldsymbol{\mu})^{\prime}\boldsymbol{\Omega}(\mathbf{z}_{i}-\boldsymbol{\mu}), (9)

and the second of which is a sum of separable and differentiable functions given by 1(𝐳1,,𝐳n)+2(𝛀)\ell_{1}(\mathbf{z}_{1},\ldots,\mathbf{z}_{n})+\ell_{2}(\boldsymbol{\Omega}), where

1(𝐳1,,𝐳n)\displaystyle\ell_{1}(\mathbf{z}_{1},\ldots,\mathbf{z}_{n}) =1ni=1n[k=1Kxi,kzi,kMilog(k=1Kezi,k+1)],\displaystyle=-\frac{1}{n}\sum_{i=1}^{n}\left[\sum_{k=1}^{K}x_{i,k}z_{i,k}-M_{i}\log(\sum_{k=1}^{K}e^{z_{i,k}}+1)\right], (10)
2(𝛀)\displaystyle\ell_{2}(\boldsymbol{\Omega}) =12log[det(𝛀)]+λ𝛀1.\displaystyle=-\frac{1}{2}\log[\det(\boldsymbol{\Omega})]+\lambda\|\boldsymbol{\Omega}\|_{1}. (11)

Tseng (2001) established the convergence property of a block coordinate descent algorithm under regularity conditions on 0\ell_{0}, 1\ell_{1}, and 2\ell_{2}.

To present the main convergence property of the compositional graphical lasso algorithm, let’s review the definition of a cluster point in real analysis. A cluster point of a set 𝒜n\mathcal{A}\subset\mathbb{R}^{n} is a real vector 𝐚n\mathbf{a}\in\mathbb{R}^{n} such that for every δ>0\delta>0, there exists a point 𝐱\mathbf{x} in 𝒜{𝐚}\mathcal{A}\setminus\{\mathbf{a}\} satisfying that 𝐱𝐚2<δ\|\mathbf{x}-\mathbf{a}\|_{2}<\delta. Obviously, any limit point of the set 𝒜\mathcal{A} is a cluster point. Furthermore, define a cluster point of the compositional graphical lasso algorithm to be a cluster point of the set {(𝐳1(t),,𝐳n(t),𝝁(t),𝛀(t)):t=0,1,2,}\{(\mathbf{z}_{1}^{(t)},\ldots,\mathbf{z}_{n}^{(t)},\boldsymbol{\mu}^{(t)},\boldsymbol{\Omega}^{(t)}):t=0,1,2,\ldots\}, which are minimizers found at each iteration tt. Then, the following theorem presents a theoretical property for every cluster point of our algorithm as follows.

Theorem 1.

Any cluster point of the compositional graphical lasso algorithm is a stationary point of the objective function (6).

The proof of Theorem 1 can be found in the supplementary materials. This theorem guarantees that a cluster point, usually a limit point, of the compositional graphical lasso algorithm, is at least a stationary point. It is noteworthy that there exists a global minimizer for the objective function in (6) because we have proved the coerciveness of the objective function in the proof (Calafiore and El Ghaoui, 2014, Lemma 8.4). Therefore, to achieve global optimization in practice, one can run the algorithm multiple times starting with different initial values and choose the one solution that yields the smallest objective function among the multiple ones.

In addition, the values of the objective function at each iteration, i.e.,
{(𝐳1(t),,𝐳n(t),𝝁(t),𝛀(t)):t=0,1,2,}\{\ell(\mathbf{z}_{1}^{(t)},\ldots,\mathbf{z}_{n}^{(t)},\boldsymbol{\mu}^{(t)},\boldsymbol{\Omega}^{(t)}):t=0,1,2,\ldots\}, will always converge. This is because that the objective function is bounded below (as 0+2\ell_{0}+\ell_{2} is bounded below as shown in the proof and 1>0\ell_{1}>0 by definition) and our algorithm results in non-increasing objective function values between two iterations. Therefore, the values of objective function will always converge to a limit point. In practice, we have always observed the numerical convergence of both the minimizers and the values of the objective function after a certain number of iterations.

2.5 Tuning Parameter Selection

There is a large body of literature on the selection of a tuning parameter in the variable selection framework. Common approaches can be broadly categorized into three types: criteirion-based methods, prediction-based methods, and stability-based methods. Criterion-based methods such as Akaike information criterion (AIC) (Akaike, 1974) and Bayesian information criteria (BIC) (Schwarz et al., 1978) balance the model complexity and the goodness of fit, prediction-based methods such as cross validation (Stone, 1974; Geisser, 1975) and generalized cross validation (Golub et al., 1979) aim to minimize the expected prediction error of the selected model on independent datasets. Stability-based methods such as stability selection (Meinshausen and Bühlmann, 2010) and the Stability Approach to Regularization Selection (StARS) (Liu et al., 2010) select a model with high stability under subsampling or bootstrapping the original data.

In this work, we apply StARS to select the tuning parameter λ\lambda in our objective function (6). In StARS, we draw NN subsamples without replacement from the original dataset with nn observations, each subsample of size bb. For each value of the tuning parameter λ\lambda, we obtain an estimate of 𝛀\boldsymbol{\Omega}, i.e., a network for each subsample. Then, we measure the total instability of these resultant networks across the NN subsamples. The total instability of these networks is defined by averaging the instabilities of each edge across the NN subsamples over all possible edges, where the instability of each edge is estimated as the twice the sample variance of the Bernoulli indicator of whether this edge is selected or not in each of the NN subsamples.

Starting from a large penalty which corresponds to the empty network, the instability of networks increases as λ\lambda decreases. StARS stops and selects the tuning parameter to be the minimum value of λ\lambda’s with which the instability of the resultant networks is less than a threshold β>0\beta>0. In principle, StARS selects a tuning parameter so that the resultant network is the densest among networks with a total instability less than a threshold β\beta without violating some sparsity assumption. The selected network is the “densest on the sparse side,” as it starts with the empty network and stops when the instability first across the threshold.

3 Simulation Studies

3.1 Simulation Settings

To evaluate the performance of compositional graphical lasso, we conduct simulation studies and compare it with other network estimation methods.

Given that our goal is to estimate the true network, i.e., 𝛀\boldsymbol{\Omega} in (3), we consider the following three types of true precision matrices 𝛀=(ωkl)1k,lK\boldsymbol{\Omega}=(\omega_{kl})_{1\leq k,l\leq K}, which are different in the pattern of edge distributions as well as the degree of connectedness.

  1. 1.

    Chain network: ωkk=1.5\omega_{kk}=1.5, ωkl=0.5\omega_{kl}=0.5 if |kl|=1|k-l|=1, and ωkl=0\omega_{kl}=0 if |kl|>1|k-l|>1. A node is designed to be connected to its adjacent nodes, and the connectedness of nodes is balanced.

  2. 2.

    Random network: ωkl=1\omega_{kl}=1 with probability 3/K3/K for klk\neq l. A node is connected to all other nodes randomly with a fixed probability, set to be 3/K3/K. Similar to the chain structure, the connectedness of nodes is balanced.

  3. 3.

    Hub network: All nodes are randomly split into K/20\lceil K/20\rceil disjoint groups, and a hub node kk is selected from each group. For any other node ll in the same group, ωkl=1\omega_{kl}=1. All the remaining entries of 𝛀\boldsymbol{\Omega} are zero. Here, nodes are partitioned into the same group at random, but is then designated to be connected to the hub node at certain. The degree of connectedness among nodes is extremely unbalanced in this case: the hub nodes are connected to all the other nodes in its group (around 20 nodes) and all the other nodes are only connected to the hub node in its group, i.e., just one node.

In addition to the true network, we also consider two other factors that are expected to influence the result. The first factor is the sequencing depth, MiM_{i}, in the multinomial distribution (1). We simulate MiM_{i} from uniform distributions, the detail of which will be discussed in the following subsections. The second factor is the degree of variation in the logistic normal distribution (3). For each of the three types of precision matrices, we consider an additional factor by multiplying a positive constant cc to 𝛀\boldsymbol{\Omega} so that the true precision matrix is c𝛀c\boldsymbol{\Omega}. We choose c=1c=1 and c=1/5c=1/5 separately and call the two settings low and high compositional variation, respectively.

The data are simulated following the logistic normal multinomial model in (1)–(3). We first simulate 𝐳iN(𝝁,𝚺)\mathbf{z}_{i}\sim N(\boldsymbol{\mu},\boldsymbol{\Sigma}) independently for i=1,,ni=1,\ldots,n; then, we perform the inverse log-ratio transformation (also know as the softmax transformation, the inverse transformation of (2)) to obtain the multinomial probabilities 𝐩i\mathbf{p}_{i} for i=1,,ni=1,\ldots,n; last, we simulate multinomial counts 𝐱i\mathbf{x}_{i} from a multinomial distribution with sequencing depth MiM_{i} and probabilities 𝐩i\mathbf{p}_{i}. Throughout this simulation study, we fix n=100n=100 and K=200K=200.

The simulation results are based on 100 replicates. For each replicate, we apply compositional graphical lasso, neighborhood selection, and graphical lasso separately to obtain a sparse estimator of 𝛀\boldsymbol{\Omega}. For neighborhood selection and graphical lasso, we first obtain an estimate of 𝐳1,,𝐳n\mathbf{z}_{1},\ldots,\mathbf{z}_{n} from the multinomial distribution via the additive log-ratio transformation

z~i,k=log(xi,kxi,K+1),i=1,,n,k=1,,K,\tilde{z}_{i,k}=\log(\frac{x_{i,k}}{x_{i,K+1}}),\ i=1,\ldots,n,\ k=1,\ldots,K, (12)

and then apply neighborhood selection and graphical lasso directly on the estimates 𝐳~1,,𝐳~n\tilde{\mathbf{z}}_{1},\ldots,\tilde{\mathbf{z}}_{n} by treating them as surrogates for their true counterparts, i.e., 𝐳1,,𝐳n\mathbf{z}_{1},\ldots,\mathbf{z}_{n}. These methods are almost identical to SPIEC-EASI although we replaced the centered log-ratio transformation in SPIEC-EASI with additive log-ratio transformation for a fair comparison.

To compare the performance of the three methods in terms of network recovery, all three methods are applied with a sequence of tuning parameter values, and their true positive rates (TPR) and false positive rates (FPR) in terms of edge selection are recorded for each value of λ\lambda. An ROC curve is plotted from the average TPR and the average FPR over the 100 replicates for each of the tuning parameters.

In addition, we apply StARS to select an optimal tuning parameter λ\lambda. Following the recommendation in Liu et al. (2010), we set the threshold for the total instability to be β=0.05\beta=0.05, the size of each subsample b=7nb=7\sqrt{n}, and the number of subsamples N=50N=50. Once the optimal tuning parameter is determined by StARS, we fit the whole dataset with the selected tuning parameter and evaluate the resultant network using three criteria: precision, recall, and F1 score, which are defined as

Precision=TPTP+FP,Recall=TPTP+FN,F1=2×Precision×RecallPrecision+Recall,\text{Precision}=\frac{\text{TP}}{\text{TP}+\text{FP}},\quad\text{Recall}=\frac{\text{TP}}{\text{TP}+\text{FN}},\quad\text{F1}=\frac{2\times\text{Precision}\times\text{Recall}}{\text{Precision}+\text{Recall}},

where TP, FP, and FN are numbers of true positives, false positives, and false negatives, respectively.

3.2 Simulation Results for Dense Data

In this subsection, we evaluate the performance of compositional graphical lasso on dense data, in which most of the simulated counts are nonzero. This corresponds to a simulation setting where the sequencing depths MiM_{i}’s are relatively larger than the number of taxa K+1K+1. Still, to evaluate the effect of the sequencing depth on the performance of network estimation methods, we simulate MiM_{i} from two uniform distributions, Uniform(20K,40K)(20K,40K) and Uniform(100K,200K)(100K,200K), and call the two settings low and high sequencing depth in this subsection, respectively.

Figure 1 presents the ROC curves for compositional graphical lasso (Comp-gLASSO), neighborhood selection (MB), and graphical lasso (gLASSO), from which we can see that compositional graphical lasso dominates its competitors in terms of edge selection in all settings. In particular, the advantage of the compositional graphical lasso over neighborhood selection and graphical lasso is the most obvious when the compositional variation is high and the sequencing depth is low, no matter which type of network structure is considered. On the contrary, the three methods perform very similarly for all types of network structures when the compositional variation is low and the sequencing depth is high. The difference between compositional graphical lasso and the rest is intermediate for the other two settings when both compositional variation and sequencing depth are high or when both are low. Comparing graphical lasso and neighborhood selection, they tend to perform more similarly although graphical lasso seems to outperform neighborhood selection in some settings with a small margin.

Refer to caption
Figure 1: ROC curves for compositional graphical lasso (Comp-gLASSO), graphical lasso (gLASSO) and neighborhood selection (MB). Solid blue: Comp-gLASSO; dashed red: gLASSO; dotted black: MB. 𝐡/𝐥,𝐡/𝐥\mathbf{h/l,h/l}: high/low sequencing depth, high/low compositional variation.
Refer to caption
Figure 2: Recall, precision and F1 score for the network selected by StARS for compositional graphical lasso (Comp-gLASSO), graphical lasso (gLASSO) and neighborhood selection (MB). Red (left): Comp-gLASSO; green (middle): gLASSO; blue (right): MB. 𝐡/𝐥,𝐡/𝐥\mathbf{h/l,h/l}: high/low sequencing depth, high/low compositional variation.

The above observations agree with our expectation about how the two factors, compositional variation and sequencing depth, affect the comparison between the methods. Recall that neighborhood selection and graphical lasso replace the true values of 𝐳1,,𝐳n\mathbf{z}_{1},\ldots,\mathbf{z}_{n} by their estimates/surrogates 𝐳~1,,𝐳~n\tilde{\mathbf{z}}_{1},\ldots,\tilde{\mathbf{z}}_{n} as in (12) without taking into account the estimation accuracy or uncertainty of these surrogates. First, a higher sequencing depth leads to more accurate surrogates 𝐳~1,,𝐳~n\tilde{\mathbf{z}}_{1},\ldots,\tilde{\mathbf{z}}_{n}; therefore, it is not surprising to see that the three methods perform more similarly when the sequencing depth is high. Second, a higher compositional variation results in a higher variation in 𝐳i\mathbf{z}_{i}’s and further in 𝐩i\mathbf{p}_{i}’s that are multinomial probabilities. Since neighborhood selection and graphical lasso ignore the multinomial component in the model, it is also not surprising to see that their performance is deteriorated by a high compositional variation.

Figure 2 presents the recall, precision, and F1 score from 50 replicates of the estimated network resulted from the tuning parameter selected by StARS. The first observation would be that the precisions of both compositional graphical lasso and graphical lasso are much worse than their recalls, whereas the precisions and recalls are more comparable for neighborhood selection. Interestingly, StARS results in a much more sparse network for neighborhood selection than the other methods under the same stability threshold, suggesting that fewer edges selected by neighborhood selection are stable enough (within the instability threshold). When it comes to method comparison, compositional graphical lasso has much higher recall than neighborhood selection in most settings, but have comparable or lower precision in most of the settings with high sequencing depth. The network from compositional graphical lasso has higher F1 score than the ones from neighborhood selection in most settings, except when sequencing depth is high and compositional variation is low for chain and hub networks. In addition, the network from compositional graphical lasso has higher or comparable precision, recall, and F1 score than the ones from graphical lasso in all settings. Similar to the observations from the ROC curves, the advantage of compositional graphical lasso is more obvious with a low sequencing depth or a high compositional variation.

3.3 Simulation Results for Sparse Data

In this subsection, we evaluate the performance of compositional graphical lasso on sparse data, in which a range of proportions of the compositional counts are simulated as zero. We only present the simulation results for the chain network, given that the simulation results for the other network types are similar. To examine how our method performs for different sparsity levels, we simulate MiM_{i} from four uniform distributions, Uniform(8K,16K)(8K,16K), Uniform(4K,8K)(4K,8K), Uniform(2K,4K)(2K,4K), and Uniform(K,2K)(K,2K), respectively. Note that the sparsity level of the simulated data also depends on the compositional variation (see Section 3.1). In a typically simulated dataset with high compositional variation, the sparsity level is around 40%, 50%, 60%, and 70% with the four uniform distributions for MiM_{i}; in a typically simulated dataset with low compositional variation, the sparsity level is around 5%, 10%, 25%, and 40% with the four uniform distributions for MiM_{i}. In both cases, we refer to these four sparsity levels as 1, 2, 3, and 4, with 1 the least sparse and 4 the most sparse. Due to the limited space, we place the figures resulted from this simulation in the supplementary materials.

Figure S1 presents the ROC curves for compositional graphical lasso (Comp-gLASSO), neighborhood selection (MB), and graphical lasso (gLASSO). Similar to Figure 1, we can see that compositional graphical lasso dominates its competitors in terms of edge selection across all sparsity levels. As the sequencing depths for sparse data are relatively small compared to the ones used for dense data, the advantage of the compositional graphical lasso over neighborhood selection and graphical lasso is obviously and consistently observed in all simulation settings. In addition, graphical lasso and neighborhood selection tend to perform more similarly although graphical lasso seems to outperform neighborhood selection with a small margin. This result is consistent with that observed for dense data in Section 3.2.

Figure S2 presents the recall, precision, and F1 score from 50 replicates of the estimated network resulted from the tuning parameter selected by StARS. The first observation is that all methods perform worse for sparse data than for dense data from the comparison between Figure S2 and Figure 2; all methods perform worse as the sparsity level increases. Similar to Figure 2, Figure S2 implies that compositional graphical lasso outperforms graphical lasso and neighborhood selection, with a higher recall, precision, and F1 score consistently observed in most settings. The only exception is that, when data are extremely sparse with a low compositional variation, graphical lasso seems to outperform compositional graphical lasso by a very small margin as both methods are not working well. For sparse data, StARS results in an almost empty network for neighborhood selection, resulting in zero recall, precision, and F1 score for most settings. This agrees with our observation from the simulation results for dense data that neighborhood selection tends to produce a much sparser network than compositional graphical lasso and graphical lasso when its tuning parameter is selected by StARS.

4 Real Data

4.1 Benchmark Study: Tara Oceans Project

To better understand the ocean, the largest ecosystem on the earth, the Tara Oceans Project aims to build the global ocean interactome that can be used to predict the dynamics and structure of ocean ecosystems. To achieve this, the Tara Oceans Consortium sampled both plankton and environmental data at 210 sites from the world’s oceans using the 110-foot research schooner Tara during the Tara Oceans Expedition (2009-2013). The data collected was later processed using sequencing and imaging techniques. One unique advantage of the Tara Oceans Project is that it has generated a list of 91 genus-level marine planktonic interactions that have been validated in the literature (Lima-Mendez et al., 2015). Though this list only comprises interactions between microbes that represent a small fraction of the total marine eukaryotic diversity and is therefore far from complete, it could serve as partial ground truth for us to evaluate the interactions identified by different methods. Thus, our major goal is to use the Tara Oceans Project as a benchmark study in order to compare the performance of different methods in constructing the planktonic interactions.

As the partial ground truth is a list of genus-level interactions, we choose to analyze the genus-level abundance data, which are aggregated from the original OTU abundance data downloadable from the Tara Oceans Project data repository (http://taraoceans.sb-roscoff.fr/EukDiv/). As a benchmark study, we only include the 81 genera that are involved in the list of gold-standard interactions in our analysis. In addition, we discard the samples with too few sequence reads (less than 100), resulting in 324 samples left in our analysis.

Similar to the simulation study, we apply compositional graphical lasso, graphical lasso, and neighborhood selection to estimate the interaction network among the 81 genera. We first pick the genus Acrosphaera, which has the largest average relative abundance among those genera not involved in the gold-standard list, and use this genus as the reference taxon for all three methods. Then, we apply each method with a sequence of 70 decreasing tuning parameter values, resulting in a sequence of interaction networks starting from an empty network. Finally, we apply StARS to find the optimal tuning parameter, in which the parameters β\beta, bb, and NN are set the same as in the simulation study.

Refer to caption
(a)
Refer to caption
(b)
Figure 3: (a): Number of identified literature interactions versus number of edges of the estimated network from the Tara dataset. (b): The degree distribution of vertices from the networks selected by StARS. Solid red: compositional graphical lasso; dashed green: graphical lasso; dashed dotted blue: neighborhood selection.

First, we compare the three methods in terms of how highly they rank the literature validated interactions amongst their top reported edges. Specifically, we start with a large value of the tuning parameter that results in an empty network, then decrease the tuning parameter so that the network becomes denser, and stop until the network has about 200 edges (out of a total of 3240 possible edges). At each value of the tuning parameter, we plot the number of literature validated interactions included in the network versus the total number of edges of the network, resulting in a step function for each method (Figure 3(a)). From Figure 3(a), we observe that compositional graphical lasso identifies slightly more literature validated interactions than graphical lasso until the total number of edges arrives 175 and graphical lasso identifies one more literature validated interaction afterwards. Neighborhood selection selects much fewer literature validated interactions than either compositional graphical lasso or graphical lasso. These observations imply that compositional graphical lasso slightly outperforms graphical lasso in reconstructing the literature validated interactions, while its advantage over neighborhood selection is much more obvious.

Second, we compare the overall topologies of the three interaction networks with tuning parameters selected by StARS. We find that compositional graphical lasso, graphical lasso, and neighborhood selection identify 749, 921, and 190 edges, respectively, with the same instability threshold used in StARS. This agrees with our observation in the simulation study that the network from neighborhood selection is much sparser than those from compositional graphical lasso and graphical lasso. The degree distributions from the networks estimated by the three methods are shown in Figure 3(b). The center of the three degree distributions are ranked as neighborhood selection, compositional graphical lasso, and graphical lasso in the ascending order, which is also reflected in the densities of the three interaction networks.

Third, we investigate the high-degree nodes, i.e., hub genera, in the interaction networks. High-degree nodes are often thought to represent taxa that elicit an effect on a large number of other members of their community, such as keystone taxa and generalistic parasites. It is observed that there are a few hub genera that have an excessive number of interactions with other genera in the literature, such as Amoebophrya, Blastodinium, and Parvilucifera, which are referred to as benchmark hub genera. Although the literature validated interactions are rather incomplete, it is still of interest to evaluate how well the three methods pick up those benchmark hub genera. Since the density of networks from the three methods are rather different, it is hard to compare the degrees of the hub genera from the three networks directly, but it is reasonable to compare the ranks of those degrees within each degree distribution. The method that generates lower ranks (degree of genera ranked in descending order) for those hubs in their degree distribution are believed to pick up the benchmark hub genera better.

A list of 7 benchmark hubs (which has degree 5\geq 5) along with their degrees from the incomplete network constructed from the literature is shown in Table 1, followed by the corresponding ranks of those genera in the degree distributions from each of the three methods and their corresponding degrees in the parentheses. We can see that compositional graphical lasso generates lower ranks than graphical lasso for all 7 genera, while neighborhood selection generates lower ranks than compositional graphical lasso for 3 genera, and the opposite for the other 4 genera. Overall, compositional graphical lasso performs the best in picking up the benchmark hub genera among the three methods.

Literature Comp-gLASSO gLASSO MB
Amoebophrya 1 (21) 31 (19) 47 (21) 2 (9)
Blastodinium 2 (12) 13 (23) 26 (24) 30 (5)
Parvilucifera 2 (12) 46 (17) 67 (19) 58 (3)
Syndinium 4 (7) 14 (23) 27 (24) 19 (6)
Vampyrophrya 4 (7) 34 (19) 60 (20) 6 (8)
Phaeocystis 6 (6) 1 (31) 3 (29) 17 (6)
Pirsonia 7 (5) 64 (15) 68 (19) 33 (5)
Table 1: For the hub genera, their ranks in each degree distribution (in descending order) from the literature, compositional graphical lasso (Comp-gLASSO), graphical lasso (gLASSO) and neighborhood selection (MB). The numbers in the parentheses are the corresponding degrees of the genera.

We find that compositional graphical lasso predicts a high degree for genera that are known to act as keystone species or parasitize other taxa, many of which are also high-degree in the literature validated network. For example, Phaeocystis elicits a high degree in both literature validated and compositional graphical lasso networks. This genus is a well-described keystone organism in some marine ecosystems (Verity et al., 2007), where it causes large phytoplankton blooms and plays an important role in the global cycling of carbon and sulfur (Verity and Smetacek, 1996). Similarly, these two networks predict a high degree for taxa that are known parasites, such as Blastodinium (Skovgaard et al., 2012), Amoebophyra (Chambouvet et al., 2008), and Syndinium (Skovgaard et al., 2005). In some instances, compositional graphical lasso uniquely reveals high-degree genera even when compared to the literature validated network, such as in the case of the parasite Euduboscquella (Bachvaroff et al., 2012), which is ranked the 5th by compositional graphical lasso but only has 1 interaction in the literature. Given the fact that the literature validated network only captures a subset of interactions that exist in nature (i.e., those interactions that have been explicitly tested), we posit that compositional graphical lasso affords an opportunity to resolve novel modulators of community composition, such as keystone taxa and generalistic parasites, that follow-up experiments can validate.

4.2 Zebrafish Parasite Infection Study

The Zebrafish Parasite Infection Study, recently conducted at Oregon State University, used a zebrafish helminth intestinal infection model to resolve how the gut microbiome and parasite burden covary during infection (Gaulke et al., 2019). This study quantified the infection burden of an intestinal helminth of zebrafish, Pseudocapillaria tomentosa, and the gut microbiome in 210 4-month-old zebrafish. Half of these fish were exposed to the parasite, P. tomentosa, and the other half were unexposed. Given that not all exposed fish would be ultimately infected, parasite burden was measured more accurately as the total number of worms in their fecal samples. In addition, the same fecal samples were used to measure the abundance of the gut microbiome species via 16S rRNA sequencing, which resulted in gut microbiome data for 207 fish. Among these fish, 81 were infected after being exposed to P. tomentosa and the other 126 fish were not infected, as indicated by the total number of worms. Our major goal is to evaluate how the microbial interaction network associates with successful parasite infection in the gut.

Similar to the Tara Oceans Project, we choose to analyze the genus-level abundance data. We discard those genera that have a nonzero abundance in less than 5% of the samples, resulting in 42 genera left in our analysis. In addition, we follow the same strategy to define a reference genus as in Jiang et al. (2020), i.e., combining those OTUs that do not have a genus-level taxonomic classification into a pseudo-genus and using it as the reference genus. The analysis is conducted in the same way as in the Tara Oceans Project, except that the methods (compositional graphical lasso, graphical lasso, and neighborhood selection) are applied separately for the uninfected and the infected fish, resulting in three interaction networks for each group of fish. Analyzing the uninfected and infected fish separately allows us to compare the microbial interaction networks between the two groups.

First, we compare the overall topologies of the interaction networks with tuning parameters selected by StARS. We find that compositional graphical lasso, graphical lasso, and neighborhood selection identify 262, 312, and 78 edges, respectively, for uninfected fish, and 241, 259, and 60 edges, respectively, for infected fish. Comparing the two groups of fish, the interaction networks for the uninfected fish are slightly denser than the infected fish. The comparison among the three methods agrees with our observation in the simulation study and in the Tara Oceans Project that the network from neighborhood selection tends to be much sparser than those from compositional graphical lasso and graphical lasso. The degree distributions based on the three methods are shown in Figure 4, which again suggests high similarity between the two groups of fish. Similar to the Tara Oceans Project, neighborhood selection results in the lowest median degree, followed by composition graphical lasso and graphical lasso, regardless of whether the fish are infected or not.

Refer to caption
(a)
Refer to caption
(b)
Figure 4: (a): The degree distribution of vertices from the networks selected by StARS for uninfected fish. (b): The degree distribution of vertices from the networks selected by StARS for infected fish. Solid red: compositional graphical lasso; dashed green: graphical lasso; dashed dotted blue: neighborhood selection.

Second, we further investigate the degree of each node in the interaction networks. This analysis helps identify those high-degree nodes, i.e., hub genera, which are indicative of keystone taxa in the microbial community. Due to the limited space, we refer to Table S2 in the supplementary materials that presents all 42 genera with their degrees in each network in descending order. Although the networks are similar in density between the two groups of fish, we have observed a substantial difference between the degrees of individual nodes. To see this we sort the nodes in each network based on their degrees and compare the top ten nodes between networks. Taking compositional graphical lasso as an example, only three out of the top ten nodes overlap between uninfected and infected fish: Paicibacter, Photobacterium, Fusobacterium. This suggests that the network structures between the uninfected and infected fish are more distinctive than what their overall topologies might appear. The hub genera identified by different methods are quite different from each other too. For the group of uninfected fish, only three out of the top ten nodes overlap between the networks from the three methods: Aeromonas, Photobacterium, and Rheinheimera; similarly, for the group of infected fish, only three out of the top ten nodes overlap between the networks from the three methods: Fusobacterium, Paucibacter, and Yersinia.

Third, we compare the networks generated from uninfected and infected fish to clarify the relationship between infection and the interactions that exist in the gut microbiome, motivated by the hypothesis that intestinal dysbiosis is defined not only by changes in community composition, but also by how members of the community ecologically interact. In order to understand this relationship, we identify microbes whose interaction set changes between infected and uninfected hosts then compare these results across methods. One of the bacterial taxa that compositional graphical lasso identifies as increasing in interaction degree among infected individuals, while other methods estimate a decrease in detected interaction degree, is the genus Gemmobacter. Our earlier work (Gaulke et al., 2019) showed that the abundance of this taxon positively links to parasite exposure, indicating that it is more ecologically successful in infected versus uninfected intestines. The analysis from compositional graphical lasso indicates that this increase in Gemmobacter relative abundance in infected individuals is coincident with an increase in the ecological interactions between Gemmobacter and other members of the gut microbiome, as it interacts with a greater number of other taxa in infected individuals. While Gemmobacter is not a particularly well studied genus, prior work links it to dysbiotic diseases in a variety of hosts (Bates et al., 2022; Ni et al., 2020). Our observation of an increase in its interaction with other taxa in infected individuals strengthens the hypothesis that Gemmobacter may act as an agent of dysbiosis, as its increase in its relative abundance not only links to a disease context, but its more integrated role in the microbial community (i.e., it potentially impacts a larger number of taxa in the gut).

The results from compositional graphical lasso also uniquely suggest that the Paucibacter genus in uninfected individuals displays a higher number of identified interactions relative to infected individuals. The other methods we evaluated find that Paucibacter’s interactions are relatively consistent between infected and uninfected individuals. The results from compositional graphical lasso are notable because the genus Paucibacter includes clades that have been identified as being conserved in the healthy zebrafish gut microbiome (Sharpton et al., 2021), presumably because these taxa are critical to the commensal microbial community. Based on our observations, namely that disruption to the broader microbiome by parasite infection appears to disrupt the set of interactions between Paucibacter and other taxa in the community, we hypothesize that Paucibacter is a keystone member of the healthy zebrafish gut microbiome and that parasite exposure induced dysbiosis occurs in part by disrupting Paucibacter in the gut. Collectively, our observations reveal patterns of interaction between infected and uninfected individuals that clarify the potential ecological role of Gemmobacter and Paucibacter, which future research should seek to empirically test.

In addition to these observations about Gemmobacter and Paucibacter, compositional graphical lasso also finds insightful patterns about Photobacterium, a genus which contains pathogens of marine fish (Romalde, 2002; Osorio et al., 2018), though its role as a pathogen in zebrafish is less clear. However, as prior work has cultured Photobacterium from the zebrafish gut (Cantas et al., 2012) and other studies have demonstrated that Photobacterium elicits toxicity to zebrafish cells in culture (Osorio et al., 2018), observations collectively point to Photobacterium as a potential pathogen or pathobiont of zebrafish. Consistent with this speculation, prior work found that Photobacterium statistically explained variation in P. tomentosa worm burden as well as worm infection induced intestinal inflammation (Gaulke et al., 2019). We sought to determine if our interaction network analysis could clarify the role of Photobacterium in the zebrafish gut microbiome, especially as it relates to parasite infection.

Interestingly, Photobacterium is identified as an interaction hub (i.e., a relatively high degree node) in all networks assembled from all methods we evaluated. However, compositional graphical lasso uniquely reveals that the number of interactions linked to Photobacterium increases in infected fish (19 edges) as compared to uninfected fish (16 edges), whereas the other approaches observe the opposite pattern (graphical lasso: 11 versus 17 edges; neighborhood selection: 3 versus 5 edges). These differences in the Photobacterium subgraph across approaches also includes variation in the specific taxa that Photobacterium is inferred to interact with. To better understand changes in the set of Photobacterium interactions, we extracted the first-degree sub-graph connected to Photobacterium that each of the three methods produced for both the infected and uninfected host interaction networks. Compositional graphical lasso alone estimates eight interactions in the Photobacterium-specific infected host network, and six of these eight genera have previously been linked to parasite exposure, disease burden, or histopathology score. Two of these parasite etiology-linked taxa are Aeromonas and Gemmobacter, genera which are notable because they contain microbes that elicit pathogenic or pathobiotic phenotypes. For example, the Aeromonas genus includes well-characterized and abundant opportunistic pathogens, such as A. hydrophila (Saraceni et al., 2016), and the Gemmobacter genus, as noted above, has been shown to opportunistically increase abundance (Huang et al., 2020) in dysbiotic fish gut communities which display impeded overall gut function. Given that Photobacterium proliferates in the infected host gut, these observations suggest that Photobacterium may work alongside and even promote the growth of other pathobiotic taxa to disrupt the composition of the gut microbiome and induce dysbiosis upon infection.

Based on these collective observations, we hypothesize that Photobacterium is an intestinal pathobiont of the zebrafish gut and contributes to dysbiosis in infected fish. Photobacterium is a relatively prevalent genus in the zebrafish intestines, even in uninfected fish. However, P. tomentosa infection links to an increase in the relative abundance of Photobacterium, which also positively associates with intestinal hyperplasia (Gaulke et al., 2019), possibly due to the cytotoxic effect of Photobacterium. Given that Photobacterium is also an interaction hub whose influence on the community increases in infected fish, at least according to compositional graphic lasso, infection induced changes in Photobacterium relative abundance may drive additional changes in the success of other taxa in the microbiome, including opportunistic pathogens and pathobionts like members of Aeromonas and Gemmobacter. Going forward, future studies should seek to experimentally validate our novel hypothesis about the pathobiotic role of Photobacterium in the zebrafish gut, including its impact on the rest of the microbial community and its role in infection induced tissue damage. If our hypothesis is accurate, Photobacterium may serve as an important model taxon for discerning how the gut microbiome and helminths interact to impact infection outcomes.

Collectively, these results provide evidence for our hypothesis that in the case of intestinal helminth infection, dysbiosis may be defined by not only a change in the composition of the gut microbiome, but a restructuring in how key members of the microbial community interact with the rest of the microbiota. Notably, these patterns were only observed by compositional graphical lasso, as the other methods did not observe unique evidence of such infection-associated inversions for Gemmobacter, Paucibacter, or Photobacterium.

Finally, we visualize the six genus interaction networks, three for infected fish and three for uninfected fish (Figure 5). For a better visualization, we only keep the top 100100 edges for the networks resulted from compositional graphical lasso and graphical lasso as they are much denser than the ones from neighborhood selection. Their edges are ranked in the exactly same way as in the Tara Oceans Study, first by selection probability and then by edge weight (see Figure S3 in the supplementary materials for details). For all networks, darker blue implies higher magnitude in the absolute value of partial correlation.

Refer to caption
Figure 5: Inferred networks from each method with edges filtered by selection probability and ranked by edge weight separately for the groups of uninfected and infected fish. In each network, darker blue implies stronger (larger in absolute value) edge weight.

5 Discussion

A growing body of work points to the gut microbiome as an agent to treat, diagnose, and prevent human diseases. The innovation of these utilities requires foundational knowledge about the pathobiotic or probiotic role that the gut microbiome play in the development and progression of the diseases. While most studies have focused on investigating how the abundance of gut microbes covary with a disease, such as infection (Gaulke et al., 2019), how the microbial interactions associate with the diseases is largely unknown. Understanding how the microbial interactions associate with a disease is critical in identifying microbes as a pathobiont or probiotic, which are candidates for potential drug or diagnostic test.

This work focuses on gaining a better understanding of the underlying role that microorganisms play in their communities by constructing microbial interaction networks. We propose a novel method called compositional graphical lasso that simultaneously accounts for the following key features of microbiome abundance data: (a) the data are compositional counts and only carry information about relative abundances of the taxa; (b) the observed relative abundance is subject to heteroscedasticity as its variance depends on the varying sequence depth across samples; (c) the data are high-dimensional as the number of taxa are often larger than the number of samples. We have demonstrated the advantages of our approach over previously proposed methods in simulations and on a benchmark dataset.

We apply compositional graphical lasso to the data from the Zebrafish Parasite Infection Study, which used a parasite infection model to identify gut microbiota that positively and negatively associate with infection burden (Gaulke et al., 2019). Our approach identified method-specific changes in interaction degree between infected and uninfected individuals for three taxa, Photobacterium, Gemmobacter, and Paucibacter. Further investigation of these method-specific taxa interaction changes reveals their biological plausibility, and provides insight into their relevance in the context of parasite-linked changes in the zebrafish gut microbiome. In particular, we hypothesize that Photobacterium and Gemmobacter are pathobionts of zebrafish gut for parasitic infection and that Paucibacter is a probiotic. Future studies should seek to experimentally validate their ecological roles in the zebrafish gut, including their impacts on the rest of the microbial community and their roles in infection induced tissue damage.

It is noteworthy that compositional graphical lasso requires one to choose a reference taxon. As a general rule of thumb, we recommend choosing a “common” taxon that has a high average relative abundance and relatively few zero counts across samples, as its true relative abundance serves as the denominator in the additive log-ratio transformation. In practice, we also suggest that a user choose a taxon that is not of direct interest to investigate, as the reference taxon will not be represented in the resultant network. It is also of interest to study how much the choice of the reference taxon may affect the estimated network, and whether some robustness could be guaranteed across different choices of the reference. Because the robustness against the choice of the reference is so important not just for compositional graphical lasso, but for many other network estimation methods as well, we have chosen to devote a major effort to this issue in a separate ongoing project. In this ongoing work, we have theoretically established the reference-invariance property of the inverse covariance matrix of a class of additive log-ratio transformation-based models, including the logistic normal multinomial model as a special case. However, as it is beyond the scope of this paper, this invariance property will be detailed in the future.

References

  • Aitchison (1986) Aitchison, J. (1986), The statistical analysis of compositional data, Monographs on statistics and applied probability, Chapman and Hall.
  • Akaike (1974) Akaike, H. (1974), “A new look at the statistical model identification,” IEEE transactions on automatic control, 19, 716–723.
  • Armijo (1966) Armijo, L. (1966), “Minimization of functions having Lipschitz continuous first partial derivatives,” Pacific Journal of mathematics, 16, 1–3.
  • Bachvaroff et al. (2012) Bachvaroff, T. R., Kim, S., Guillou, L., Delwiche, C. F., and Coats, D. W. (2012), “Molecular diversity of the syndinean genus Euduboscquella based on single-cell PCR analysis,” Applied and environmental microbiology, 78, 334–345.
  • Ban et al. (2015) Ban, Y., An, L., and Jiang, H. (2015), “Investigating microbial co-occurrence patterns based on metagenomic compositional data,” Bioinformatics, 31, 3322–3329.
  • Banerjee et al. (2008) Banerjee, O., Ghaoui, L. E., and d’Aspremont, A. (2008), “Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data,” Journal of Machine learning research, 9, 485–516.
  • Bates et al. (2022) Bates, K. A., Sommer, U., Hopkins, K. P., Shelton, J. M., Wierzbicki, C., Sergeant, C., Tapley, B., Michaels, C. J., Schmeller, D. S., Loyau, A., et al. (2022), “Microbiome function predicts amphibian chytridiomycosis disease dynamics,” Microbiome, 10, 1–16.
  • Billheimer et al. (2001) Billheimer, D., Guttorp, P., and Fagan, W. F. (2001), “Statistical interpretation of species composition,” Journal of the American statistical Association, 96, 1205–1214.
  • Braga et al. (2016) Braga, R. M., Dourado, M. N., and Araújo, W. L. (2016), “Microbial interactions: ecology in a molecular perspective,” brazilian journal of microbiology, 47, 86–98.
  • Calafiore and El Ghaoui (2014) Calafiore, G. C. and El Ghaoui, L. (2014), Optimization models, Cambridge university press.
  • Cantas et al. (2012) Cantas, L., Sørby, J. R. T., Aleström, P., and Sørum, H. (2012), “Culturable gut microbiota diversity in zebrafish,” Zebrafish, 9, 26–37.
  • Chambouvet et al. (2008) Chambouvet, A., Morin, P., Marie, D., and Guillou, L. (2008), “Control of toxic marine dinoflagellate blooms by serial parasitic killers,” Science, 322, 1254–1257.
  • Fang et al. (2015) Fang, H., Huang, C., Zhao, H., and Deng, M. (2015), “CCLasso: correlation inference for compositional data through Lasso,” Bioinformatics, 31, 3172–3180.
  • Fang et al. (2017) — (2017), “gCoda: conditional dependence network inference for compositional data,” Journal of Computational Biology, 24, 699–708.
  • Faust and Raes (2012) Faust, K. and Raes, J. (2012), “Microbial interactions: from networks to models,” Nature Reviews Microbiology, 10, 538.
  • Freilich et al. (2010) Freilich, S., Kreimer, A., Meilijson, I., Gophna, U., Sharan, R., and Ruppin, E. (2010), “The large-scale organization of the bacterial network of ecological co-occurrence interactions,” Nucleic acids research, 38, 3857–3868.
  • Friedman and Alm (2012) Friedman, J. and Alm, E. J. (2012), “Inferring correlation networks from genomic survey data,” PLoS computational biology, 8, e1002687.
  • Friedman et al. (2008) Friedman, J., Hastie, T., and Tibshirani, R. (2008), “Sparse inverse covariance estimation with the graphical lasso,” Biostatistics, 9, 432–441.
  • Gaulke et al. (2019) Gaulke, C. A., Martins, M. L., Watral, V. G., Humphreys, I. R., Spagnoli, S. T., Kent, M. L., and Sharpton, T. J. (2019), “A longitudinal assessment of host-microbe-parasite interactions resolves the zebrafish gut microbiome’s link to Pseudocapillaria tomentosa infection and pathology,” Microbiome, 7, 1–16.
  • Geisser (1975) Geisser, S. (1975), “The predictive sample reuse method with applications,” Journal of the American statistical Association, 70, 320–328.
  • Golub et al. (1979) Golub, G. H., Heath, M., and Wahba, G. (1979), “Generalized cross-validation as a method for choosing a good ridge parameter,” Technometrics, 21, 215–223.
  • Huang et al. (2020) Huang, J.-N., Wen, B., Zhu, J.-G., Zhang, Y.-S., Gao, J.-Z., and Chen, Z.-Z. (2020), “Exposure to microplastics impairs digestive performance, stimulates immune response and induces microbiota dysbiosis in the gut of juvenile guppy (Poecilia reticulata),” Science of The Total Environment, 733, 138929.
  • Jiang et al. (2020) Jiang, D., Sharpton, T., and Jiang, Y. (2020), “Microbial Interaction Network Estimation via Bias-Corrected Graphical Lasso,” Statistics in Biosciences, 1–22.
  • Kurtz et al. (2015) Kurtz, Z. D., Müller, C. L., Miraldi, E. R., Littman, D. R., Blaser, M. J., and Bonneau, R. A. (2015), “Sparse and compositionally robust inference of microbial ecological networks,” PLoS computational biology, 11, e1004226.
  • Lima-Mendez et al. (2015) Lima-Mendez, G., Faust, K., Henry, N., Decelle, J., Colin, S., Carcillo, F., Chaffron, S., Ignacio-Espinosa, J. C., Roux, S., Vincent, F., et al. (2015), “Determinants of community structure in the global plankton interactome,” Science, 348, 1262073.
  • Liu et al. (2010) Liu, H., Roeder, K., and Wasserman, L. (2010), “Stability approach to regularization selection (stars) for high dimensional graphical models,” in Advances in neural information processing systems, pp. 1432–1440.
  • McMurdie and Holmes (2014) McMurdie, P. J. and Holmes, S. (2014), “Waste not, want not: why rarefying microbiome data is inadmissible,” PLoS Comput Biol, 10, e1003531.
  • Meinshausen and Bühlmann (2010) Meinshausen, N. and Bühlmann, P. (2010), “Stability selection,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72, 417–473.
  • Meinshausen et al. (2006) Meinshausen, N., Bühlmann, P., et al. (2006), “High-dimensional graphs and variable selection with the lasso,” The annals of statistics, 34, 1436–1462.
  • Morgan et al. (2015) Morgan, X. C., Kabakchiev, B., Waldron, L., Tyler, A. D., Tickle, T. L., Milgrom, R., Stempak, J. M., Gevers, D., Xavier, R. J., Silverberg, M. S., et al. (2015), “Associations between host gene expression, the mucosal microbiome, and clinical outcome in the pelvic pouch of patients with inflammatory bowel disease,” Genome biology, 16, 1–15.
  • Ni et al. (2020) Ni, Q., Ye, Z., Wang, Y., Chen, J., Zhang, W., Ma, C., Li, K., Liu, Y., Liu, L., Han, Z., et al. (2020), “Gut microbial dysbiosis and plasma metabolic profile in individuals with vitiligo,” Frontiers in microbiology, 11, 592248.
  • Osorio et al. (2018) Osorio, C., Vences, A., Matanza, X., and Terceti, M. (2018), “Photobacterium damselae subsp. damselae, a generalist pathogen with unique virulence factors and high genetic diversity.” Journal of Bacteriology, 200, e00002–18.
  • Rockafellar (1970) Rockafellar, R. T. (1970), Convex analysis, no. 28, Princeton university press.
  • Romalde (2002) Romalde, J. L. (2002), “Photobacterium damselae subsp. piscicida: an integrated view of a bacterial fish pathogen,” International Microbiology, 5, 3–9.
  • Saraceni et al. (2016) Saraceni, P. R., Romero, A., Figueras, A., and Novoa, B. (2016), “Establishment of infection models in zebrafish larvae (Danio rerio) to study the pathogenesis of Aeromonas hydrophila,” Frontiers in microbiology, 7, 1219.
  • Schwarz et al. (1978) Schwarz, G. et al. (1978), “Estimating the dimension of a model,” The annals of statistics, 6, 461–464.
  • Sharpton et al. (2021) Sharpton, T. J., Stagaman, K., Sieler Jr, M. J., Arnold, H. K., and Davis, E. W. (2021), “Phylogenetic integration reveals the zebrafish core microbiome and its sensitivity to environmental exposures,” Toxics, 9, 10.
  • Skovgaard et al. (2012) Skovgaard, A., Karpov, S. A., and Guillou, L. (2012), “The parasitic dinoflagellates Blastodinium spp. inhabiting the gut of marine, planktonic copepods: morphology, ecology, and unrecognized species diversity,” Frontiers in microbiology, 3, 305.
  • Skovgaard et al. (2005) Skovgaard, A., Massana, R., Balague, V., and Saiz, E. (2005), “Phylogenetic position of the copepod-infesting parasite Syndinium turbo (Dinoflagellata, Syndinea),” Protist, 156, 413–423.
  • Stone (1974) Stone, M. (1974), “Cross-validatory choice and assessment of statistical predictions,” Journal of the Royal Statistical Society: Series B (Methodological), 36, 111–133.
  • Tang (2019) Tang, L. (2019), “Microbial interactions,” Nature Methods, 16, 19–19.
  • Tseng (2001) Tseng, P. (2001), “Convergence of a block coordinate descent method for nondifferentiable minimization,” Journal of optimization theory and applications, 109, 475–494.
  • Verity et al. (2007) Verity, P. G., Brussaard, C. P., Nejstgaard, J. C., van Leeuwe, M. A., Lancelot, C., and Medlin, L. K. (2007), “Current understanding of Phaeocystis ecology and biogeochemistry, and perspectives for future research,” Biogeochemistry, 83, 311–330.
  • Verity and Smetacek (1996) Verity, P. G. and Smetacek, V. (1996), “Organism life cycles, predation, and the structure of marine pelagic ecosystems,” Marine Ecology Progress Series, 130, 277–293.
  • Xia et al. (2013) Xia, F., Chen, J., Fung, W. K., and Li, H. (2013), “A logistic normal multinomial regression model for microbiome compositional data analysis,” Biometrics, 69, 1053–1063.
  • Yoon et al. (2019) Yoon, G., Gaynanova, I., and Müller, C. L. (2019), “Microbial networks in SPRING-Semi-parametric rank-based correlation and partial correlation estimation for quantitative microbiome data,” Frontiers in genetics, 10, 516.
  • Yuan et al. (2019) Yuan, H., He, S., and Deng, M. (2019), “Compositional data network analysis via lasso penalized D-trace loss,” Bioinformatics, 35, 3404–3411.
  • Yuan and Lin (2007) Yuan, M. and Lin, Y. (2007), “Model selection and estimation in the Gaussian graphical model,” Biometrika, 94, 19–35.
  • Zhao et al. (2012) Zhao, T., Liu, H., Roeder, K., Lafferty, J., and Wasserman, L. (2012), “The huge package for high-dimensional undirected graph estimation in R,” Journal of Machine Learning Research, 13, 1059–1062.

Supplementary Materials

Proof of Theorem 1

Proof.

The conclusion in Theorem 1 is directly implied by Theorem 4.1(c) in Tseng (2001), for which we need to verify a few regularity conditions as follows.

First, 0(𝐳1,,𝐳n,𝝁,𝛀)\ell_{0}(\mathbf{z}_{1},\ldots,\mathbf{z}_{n},\boldsymbol{\mu},\boldsymbol{\Omega}) in (9) is regular at each point in its domain. This is true because dom(0)\mathrm{dom}(\ell_{0}) is open and 0\ell_{0} is differentiable and all its partial derivatives exists.

Second, the sublevel set {(𝐳1,,𝐳n,𝝁,𝛀):(𝐳1,,𝐳n,𝝁,𝛀)(𝐳1(0),,𝐳n(0),𝝁(0),𝛀(0))}\{(\mathbf{z}_{1},\ldots,\mathbf{z}_{n},\boldsymbol{\mu},\boldsymbol{\Omega}):\ell(\mathbf{z}_{1},\ldots,\mathbf{z}_{n},\boldsymbol{\mu},\boldsymbol{\Omega})\leq\ell(\mathbf{z}_{1}^{(0)},\ldots,\mathbf{z}_{n}^{(0)},\boldsymbol{\mu}^{(0)},\boldsymbol{\Omega}^{(0)})\} is compact and that \ell in (6) is continuous on this sublevel set. The continuity part is obvious and we just need to argue the compactness of the sublevel set. Since (𝐳1,,𝐳n,𝝁,𝛀)\ell(\mathbf{z}_{1},\ldots,\mathbf{z}_{n},\boldsymbol{\mu},\boldsymbol{\Omega}) is a continuous function, the compactness of this sublevel set is actually implied by the coerciveness of (𝐳1,,𝐳n,𝝁,𝛀)\ell(\mathbf{z}_{1},\ldots,\mathbf{z}_{n},\boldsymbol{\mu},\boldsymbol{\Omega}) (Calafiore and El Ghaoui, 2014, Lemma 8.3), which we are going to argue as follows.

As argued previously, (𝐳1,,𝐳n,𝝁,𝛀)=0(𝐳1,,𝐳n,𝝁,𝛀)+1(𝐳1,,𝐳n)+2(𝛀)\ell(\mathbf{z}_{1},\ldots,\mathbf{z}_{n},\boldsymbol{\mu},\boldsymbol{\Omega})=\ell_{0}(\mathbf{z}_{1},\ldots,\mathbf{z}_{n},\boldsymbol{\mu},\boldsymbol{\Omega})+\ell_{1}(\mathbf{z}_{1},\ldots,\mathbf{z}_{n})+\ell_{2}(\boldsymbol{\Omega}) as in (9)–(11). On the one hand, let’s argue that 0(𝐳1,,𝐳n,𝝁,𝛀)+2(𝛀)\ell_{0}(\mathbf{z}_{1},\ldots,\mathbf{z}_{n},\boldsymbol{\mu},\boldsymbol{\Omega})+\ell_{2}(\boldsymbol{\Omega}) is bounded below as follows.

0(𝐳1,,𝐳n,𝝁,𝛀)+2(𝛀)\displaystyle\ell_{0}(\mathbf{z}_{1},\ldots,\mathbf{z}_{n},\boldsymbol{\mu},\boldsymbol{\Omega})+\ell_{2}(\boldsymbol{\Omega}) =12log[det(𝛀)]+12ni=1n(𝐳i𝝁)𝛀(𝐳i𝝁)+λ𝛀1\displaystyle=-\frac{1}{2}\log[\det(\boldsymbol{\Omega})]+\frac{1}{2n}\sum_{i=1}^{n}(\mathbf{z}_{i}-\boldsymbol{\mu})^{\prime}\boldsymbol{\Omega}(\mathbf{z}_{i}-\boldsymbol{\mu})+\lambda\|\boldsymbol{\Omega}\|_{1}
12log[det(𝛀^)]+12ni=1n(𝐳i𝝁)𝛀^(𝐳i𝝁)+λ𝛀^1\displaystyle\geq-\frac{1}{2}\log[\det(\hat{\boldsymbol{\Omega}})]+\frac{1}{2n}\sum_{i=1}^{n}(\mathbf{z}_{i}-\boldsymbol{\mu})^{\prime}\hat{\boldsymbol{\Omega}}(\mathbf{z}_{i}-\boldsymbol{\mu})+\lambda\|\hat{\boldsymbol{\Omega}}\|_{1}
12log[det(𝛀^)]\displaystyle\geq-\frac{1}{2}\log[\det(\hat{\boldsymbol{\Omega}})]
12KlogKλ,\displaystyle\geq-\frac{1}{2}K\log\frac{K}{\lambda},

where

𝛀^=argmin𝛀12log[det(𝛀^)]+12ni=1n(𝐳i𝝁)𝛀^(𝐳i𝝁)+λ𝛀^1,\hat{\boldsymbol{\Omega}}=\operatorname*{arg\,min}_{\boldsymbol{\Omega}}-\frac{1}{2}\log[\det(\hat{\boldsymbol{\Omega}})]+\frac{1}{2n}\sum_{i=1}^{n}(\mathbf{z}_{i}-\boldsymbol{\mu})^{\prime}\hat{\boldsymbol{\Omega}}(\mathbf{z}_{i}-\boldsymbol{\mu})+\lambda\|\hat{\boldsymbol{\Omega}}\|_{1},

and the last inequality follows because 𝛀^\hat{\boldsymbol{\Omega}} is unique and has positive eigenvalues bounded above by Kλ\frac{K}{\lambda} (Banerjee et al., 2008).

On the other hand, we argue that 1(𝐳1,,𝐳n)\ell_{1}(\mathbf{z}_{1},\ldots,\mathbf{z}_{n}) is a coercive function of 𝐳1,,𝐳n\mathbf{z}_{1},\ldots,\mathbf{z}_{n}. This is because 1(𝐳1,,𝐳n)\ell_{1}(\mathbf{z}_{1},\ldots,\mathbf{z}_{n}) is proper, closed (as implied by continuity), strictly convex and has a unique minimizer (with a positive definite Hessian matrix, see Section 2.3), thus every sublevel set of 1\ell_{1} is bounded, following Corollary 8.7.1 from Rockafellar (1970). In addition, every sublevel set of 1\ell_{1} is closed due to the fact that 1\ell_{1} is continuous (Rockafellar, 1970, Theorem 7.1). Therefore, every sublevel set of 1\ell_{1} is compact, implying the coerciveness of 1\ell_{1} (Calafiore and El Ghaoui, 2014, Lemma 8.3).

Combining that 0(𝐳1,,𝐳n,𝝁,𝛀)+2(𝛀)\ell_{0}(\mathbf{z}_{1},\ldots,\mathbf{z}_{n},\boldsymbol{\mu},\boldsymbol{\Omega})+\ell_{2}(\boldsymbol{\Omega}) is bounded below and that 1(𝐳1,,𝐳n)\ell_{1}(\mathbf{z}_{1},\ldots,\mathbf{z}_{n}) is a coercive function of 𝐳1,,𝐳n\mathbf{z}_{1},\ldots,\mathbf{z}_{n}, then if (𝐳1,,𝐳n)\|(\mathbf{z}_{1}^{\prime},\ldots,\mathbf{z}_{n}^{\prime})\|\to\infty, (𝐳1,,𝐳n,𝝁,𝛀)\ell(\mathbf{z}_{1},\ldots,\mathbf{z}_{n},\boldsymbol{\mu},\boldsymbol{\Omega})\to\infty. Thus, to show that (𝐳1,,𝐳n,𝝁,𝛀)\ell(\mathbf{z}_{1},\ldots,\mathbf{z}_{n},\boldsymbol{\mu},\boldsymbol{\Omega}) is a coercive function, we just need to show that g(𝝁,𝛀)=0(𝐳1,,𝐳n,𝝁,𝛀)+2(𝛀)g(\boldsymbol{\mu},\boldsymbol{\Omega})=\ell_{0}(\mathbf{z}_{1},\ldots,\mathbf{z}_{n},\boldsymbol{\mu},\boldsymbol{\Omega})+\ell_{2}(\boldsymbol{\Omega}) is a coercive function of 𝝁\boldsymbol{\mu} and 𝛀\boldsymbol{\Omega} for fixed and finite 𝐳1,,𝐳n\mathbf{z}_{1},\ldots,\mathbf{z}_{n}.

Note that for fixed and finite 𝐳1,,𝐳n\mathbf{z}_{1},\ldots,\mathbf{z}_{n},

g(𝝁,𝛀)\displaystyle g(\boldsymbol{\mu},\boldsymbol{\Omega}) =12log[det(𝛀)]+12ni=1n(𝐳i𝝁)𝛀(𝐳i𝝁)+λ𝛀1\displaystyle=-\frac{1}{2}\log[\det(\boldsymbol{\Omega})]+\frac{1}{2n}\sum_{i=1}^{n}(\mathbf{z}_{i}-\boldsymbol{\mu})^{\prime}\boldsymbol{\Omega}(\mathbf{z}_{i}-\boldsymbol{\mu})+\lambda\|\boldsymbol{\Omega}\|_{1}
12j=1Klog[κj(𝛀)]+12nκmin(𝛀)i=1n𝐳i𝝁2+λκmax(𝛀)\displaystyle\geq-\frac{1}{2}\sum_{j=1}^{K}\log[\kappa_{j}(\boldsymbol{\Omega})]+\frac{1}{2n}\kappa_{\min}(\boldsymbol{\Omega})\sum_{i=1}^{n}\|\mathbf{z}_{i}-\boldsymbol{\mu}\|^{2}+\lambda\kappa_{\max}(\boldsymbol{\Omega})
max(I1,I2),\displaystyle\geq\max(I_{1},I_{2}),

where κj(𝛀)\kappa_{j}(\boldsymbol{\Omega}), κmin(𝛀)\kappa_{\min}(\boldsymbol{\Omega}), and κmax(𝛀)\kappa_{\max}(\boldsymbol{\Omega}) denote the jjth, minimum, and maximum eigenvalues of 𝛀\boldsymbol{\Omega}, respectively, and

I1\displaystyle I_{1} =12nκmin(𝛀)i=1n𝐳i𝝁2+λκmax(𝛀)12Klog[κmax(𝛀)],\displaystyle=\frac{1}{2n}\kappa_{\min}(\boldsymbol{\Omega})\sum_{i=1}^{n}\|\mathbf{z}_{i}-\boldsymbol{\mu}\|^{2}+\lambda\kappa_{\max}(\boldsymbol{\Omega})-\frac{1}{2}K\log[\kappa_{\max}(\boldsymbol{\Omega})],
I2\displaystyle I_{2} =12log[κmin(𝛀)]+λκmax(𝛀)12(K1)log[κmax(𝛀)].\displaystyle=-\frac{1}{2}\log[\kappa_{\min}(\boldsymbol{\Omega})]+\lambda\kappa_{\max}(\boldsymbol{\Omega})-\frac{1}{2}(K-1)\log[\kappa_{\max}(\boldsymbol{\Omega})].

For L=KL=K or L=K1L=K-1, λκmax(𝛀)12Llog[κmax(𝛀)]0\lambda\kappa_{\max}(\boldsymbol{\Omega})-\frac{1}{2}L\log[\kappa_{\max}(\boldsymbol{\Omega})]\geq 0 when κmax(𝛀)\kappa_{\max}(\boldsymbol{\Omega})\to\infty and λκmax(𝛀)12Llog[κmax(𝛀)]\lambda\kappa_{\max}(\boldsymbol{\Omega})-\frac{1}{2}L\log[\kappa_{\max}(\boldsymbol{\Omega})] is bounded below when κmax(𝛀)\kappa_{\max}(\boldsymbol{\Omega}) is bounded above. Therefore, if κmin(𝛀)\kappa_{\min}(\boldsymbol{\Omega}) is bounded away from 0, then I1I_{1}\to\infty when 𝝁\|\boldsymbol{\mu}\|\to\infty because 12nκmin(𝛀)i=1n𝐳i𝝁2\frac{1}{2n}\kappa_{\min}(\boldsymbol{\Omega})\sum_{i=1}^{n}\|\mathbf{z}_{i}-\boldsymbol{\mu}\|^{2}\to\infty; if κmin(𝛀)0\kappa_{\min}(\boldsymbol{\Omega})\to 0, then I2I_{2}\to\infty because 12log[κmin(𝛀)]-\frac{1}{2}\log[\kappa_{\min}(\boldsymbol{\Omega})]\to\infty. In summary, for fixed and finite 𝐳1,,𝐳n\mathbf{z}_{1},\ldots,\mathbf{z}_{n}, we have g(𝝁,𝛀)g(\boldsymbol{\mu},\boldsymbol{\Omega})\to\infty when 𝝁\|\boldsymbol{\mu}\|\to\infty regardless of 𝛀\boldsymbol{\Omega}.

Thus, to show that g(𝝁,𝛀)g(\boldsymbol{\mu},\boldsymbol{\Omega}) is a coercive function of 𝝁\boldsymbol{\mu} and 𝛀\boldsymbol{\Omega} for fixed and finite 𝐳1,,𝐳n\mathbf{z}_{1},\ldots,\mathbf{z}_{n}, it suffices to show that g(Ω)=0(𝐳1,,𝐳n,𝝁,𝛀)+2(𝛀)g(\Omega)=\ell_{0}(\mathbf{z}_{1},\ldots,\mathbf{z}_{n},\boldsymbol{\mu},\boldsymbol{\Omega})+\ell_{2}(\boldsymbol{\Omega}) is a coercive function of 𝛀\boldsymbol{\Omega} for fixed and finite 𝐳1,,𝐳n\mathbf{z}_{1},\ldots,\mathbf{z}_{n} and μ\mu. As g(Ω)g(\Omega) is the graphical lasso objective function, it is proper, closed, convex, and has a unique minimizer (Banerjee et al., 2008; Friedman et al., 2008), thus every sublevel set of gg is bounded, again following Corollary 8.7.1 from Rockafellar (1970). In addition, every sublevel set of gg is closed due to the continuity of gg. Therefore, every sublevel set of gg is compact, implying the coerciveness of gg (Calafiore and El Ghaoui, 2014, Lemma 8.3). This concludes the proof of the coerciveness of the function (𝐳1,,𝐳n,𝝁,𝛀)\ell(\mathbf{z}_{1},\ldots,\mathbf{z}_{n},\boldsymbol{\mu},\boldsymbol{\Omega}).

Third, (𝐳1,,𝐳n,𝝁,𝛀)\ell(\mathbf{z}_{1},\ldots,\mathbf{z}_{n},\boldsymbol{\mu},\boldsymbol{\Omega}) has at most one minimum in its second block of parameters, i.e., 𝝁\boldsymbol{\mu}. This is because the Hessian matrix for 𝝁\boldsymbol{\mu} is 𝛀\boldsymbol{\Omega} which is positive definite.

The conclusion of Theorem 1 is thus proved as we have verified all regularity conditions in Theorem 4.1(c) in Tseng (2001). ∎

Simulations: Figures S1 and S2

Refer to caption
Figure S1: ROC curves for compositional graphical lasso (Comp-gLASSO), graphical lasso (gLASSO) and neighborhood selection (MB). Solid blue: Comp-gLASSO; dashed red: gLASSO; dotted black: MB. Sparsity levels 1–4: from the least sparse to the most sparse.
Refer to caption
Figure S2: Recall, precision and F1 score for the network selected by StARS for compositional graphical lasso (Comp-gLASSO), graphical lasso (gLASSO) and neighborhood selection (MB). Red (left): Comp-gLASSO; green (middle): gLASSO; blue (right): MB. 1–4: sparsity levels 1–4 from the least sparse to the most sparse.

Tara Oceans Project: Figure S3 and Table S1

We visualize the three genus interaction networks (from compositional graphical lasso, graphical lasso, and neighborhood selection) in company with the network of the literature validated interactions. For a better visualization, we only keep the top 100100 edges that are ranked by the following two criteria: (a) selection probability, the proportion of times that an edge is selected from the NN subsamples in StARS and (b) edge weight, the absolute value of the partial correlation that is defined as |ω^ij|/ω^iiω^jj|\hat{\omega}_{ij}|/\sqrt{\hat{\omega}_{ii}\hat{\omega}_{jj}} where ω^ij\hat{\omega}_{ij} is the (i,j)(i,j) entry of the estimated inverse covariance matrix Ω^\hat{\Omega}. Specifically, the edges are first ranked by selection probability, and the edges with the same selection probabilities are further ranked by edge weight. For the networks from all three methods, darker blue implies higher magnitude in the absolute value of partial correlation.

We can see from Figure S3 that, though still different, the networks estimated by the three algorithms have apparent similarity in the predicted edges and their edge weights, e.g., the genus pairs “Centropages - Thalassicolla” and “Acanthometra - Hexaconus” are in dark blue for all three methods. On the other hand, there are very few overlaps between those top 100 edges and the known interactions from literature. Since our current knowledge of the genus-level interactions are still limited, the edges that have not been reported from literature but enjoys higher selection probability and larger weight might suggest promising new eukaryotic interactions that deserve biological validations. There are actually 39 common edges from the top 100 edges from the three estimated networks. We provide the list of these common genus-genus interactions (Table S1) for the interested readers.

Refer to caption
Figure S3: Inferred networks from each method with edges filtered by selection probability and ranked by edge weight, in comparison with the 91 interactions reported in literature. In each network, darker blue implies stronger (larger in absolute value) edge weight.
Genus 1 Genus 2
1 Gonyaulax Alexandrium
2 Thalassicolla Centropages
3 Sphaerozoum Collozoum
4 Hexaconus Acanthometra
5 Pedinomonas Karenia
6 Lonchostaurus Aulacantha
7 Eucampia Coscinodiscus
8 Phagomyxa Noctiluca
9 Temora Centropages
10 Oblea Coscinodiscus
11 Oithona Alexandrium
12 Orbulina Globigerinoides
13 Vampyrophrya Syndinium
14 Paracineta Euchaeta
15 Tortanus Calanus
16 Scrippsiella Gyrodinium
17 Pseudopirsonia Pirsonia
18 Heterocapsa Akashiwo
19 Thalassicolla Collozoum
20 Rhynchopus Phaeocystis
21 Tintinnophagus Pirsonia
22 Thalassiosira Rhizosolenia
23 Prorocentrum Hexaconus
24 Pseudopirsonia Paradinium
25 Metschnikowia Acartia
26 Leptocylindrus Eucampia
27 Odontella Chaetoceros
28 Globigerinoides Corycaeus
29 Tintinnophagus Rhynchopus
30 Metschnikowia Aulacantha
31 Pelagodinium Favella
32 Prorocentrum Euduboscquella
33 Tintinnopsis Thalassiosira
34 Collozoum Acartia
35 Subeucalanus Sphaerozoum
36 Vampyrophrya Pseudohimantidium
37 Oxyrrhis Arthracanthida
38 Temora Acrocalanus
39 Tintinnophagus Amoebophrya
Table S1: Genus-genus interactions that are commonly identified by compositional graphical lasso, graphical lasso, and neighborhood selection.

Zebrafish Parasite Infection Study: Table S2

Uninfected Infected Uninfected Infected Uninfected Infected
Comp-gLASSO gLASSO MB
Plesiomonas (21) Photobacterium (19) Silvanigrella (21) Paucibacter (21) Rheinheimera (9) Fluviicola (7)
Paraclostridium (18) Weissella (18) Peptostreptococcus (21) Phreatobacter (18) Aeromonas (8) Aeromonas (6)
Aeromonas (17) Gemmobacter (18) Aeromonas (20) Fusobacterium (17) Cetobacterium (8) Paucibacter (5)
Paucibacter (17) Fusobacterium (17) Plesiomonas (20) Cetobacterium (16) Runella (6) Runella (5)
Photobacterium (16) Paucibacter (15) Paraclostridium (19) Cloacibacterium (16) Peptostreptococcus (6) Cetobacterium (4)
Fusobacterium (16) Yersinia (15) Pseudoduganella (18) Weissella (16) Pseudomonas (5) Paraclostridium (4)
Lactobacillus (16) Crenobacter (14) Rheinheimera (18) ZOR0006 (15) Cloacibacterium (5) Yersinia (4)
Rheinheimera (15) Aurantisolimonas (14) Bacteroides (18) Flavobacterium (15) Photobacterium (5) Fusobacterium (4)
Yersinia (15) Paracoccus (14) Undibacterium (17) Yersinia (15) Haliscomenobacter (5) Paracoccus (4)
Silvanigrella (15) Ignatzschineria (14) Photobacterium (17) Shewanella (14) Tychonema (5) Ignatzschineria (4)
Bacteroides (15) Lysinibacillus (14) Runella (17) Aurantisolimonas (14) Plesiomonas (4) Bosea (4)
Chitinibacter (14) Plesiomonas (13) Paucibacter (16) Aeromonas (13) Acinetobacter (4) Bacteroides (4)
Weissella (14) Acinetobacter (13) Chitinimonas (16) Acinetobacter (13) Paucibacter (4) Pseudomonas (3)
Paracoccus (14) Chitinibacter (13) Yersinia (16) Crenobacter (13) Shewanella (4) Chitinibacter (3)
Peptostreptococcus (14) Rheinheimera (13) Paracoccus (16) A-N-P-R (13) Pseudoduganella (4) Crenobacter (3)
Pseudomonas (13) Cetobacterium (12) Caenimonas (16) Pseudoduganella (12) Chitinimonas (4) ZOR0006 (3)
Acinetobacter (13) Cloacibacterium (12) Lactobacillus (16) Caenimonas (12) Fluviicola (4) Pseudoduganella (3)
Pseudoduganella (13) Phreatobacter (12) Luteimonas (15) Gemmobacter (12) Aurantisolimonas (4) Cloacibacterium (3)
Cloacibacterium (13) Tychonema (12) Ignatzschineria (15) Ignatzschineria (12) Fusobacterium (4) Photobacterium (3)
Chitinimonas (13) Bacteroides (12) Lysinibacillus (15) Lysinibacillus (12) Caenimonas (4) Chitinimonas (3)
Ignatzschineria (13) Aeromonas (11) Cetobacterium (14) Lactobacillus (12) Defluviimonas (4) Aurantisolimonas (3)
Lysinibacillus (13) Shewanella (11) Chitinibacter (14) Pseudomonas (11) Ensifer (4) Phreatobacter (3)
Shewanella (12) Undibacterium (11) Shewanella (14) Chitinibacter (11) Bosea (4) Defluviimonas (3)
ZOR0006 (12) Chitinimonas (11) Cloacibacterium (14) Mycoplasma (11) Chitinibacter (3) Tychonema (3)
Flavobacterium (12) Lactobacillus (11) Phreatobacter (14) Photobacterium (11) Crenobacter (3) Peptostreptococcus (3)
Mycoplasma (12) ZOR0006 (10) Gemmobacter (14) Fluviicola (11) Flavobacterium (3) Plesiomonas (2)
Tychonema (12) Pseudoduganella (10) Bosea (14) Rheinheimera (11) A-N-P-R (3) Acinetobacter (2)
Undibacterium (11) A-N-P-R (10) Tychonema (14) Paracoccus (11) Luteimonas (3) Undibacterium (2)
Crenobacter (10) Mycoplasma (9) Pseudomonas (13) Runella (11) Paracoccus (3) Flavobacterium (2)
Fluviicola (10) Fluviicola (9) Acinetobacter (13) Haliscomenobacter (11) Gemmobacter (3) Weissella (2)
Defluviimonas (10) Luteimonas (9) A-N-P-R (13) Bosea (11) Lysinibacillus (3) Rheinheimera (2)
Ensifer (10) Defluviimonas (9) Fluviicola (13) Plesiomonas (10) ZOR0006 (2) Luteimonas (2)
Cetobacterium (9) Ensifer (9) Weissella (13) Undibacterium (10) Undibacterium (2) Gemmobacter (2)
A-N-P-R (9) Peptostreptococcus (9) ZOR0006 (12) Paraclostridium (10) Weissella (2) Ensifer (2)
Aurantisolimonas (9) Pseudomonas (8) Flavobacterium (12) Chitinimonas (10) Yersinia (2) Haliscomenobacter (2)
Luteimonas (9) Flavobacterium (8) Mycoplasma (12) Defluviimonas (10) Phreatobacter (2) Lactobacillus (2)
Runella (9) Caenimonas (8) Defluviimonas (12) Ensifer (10) Ignatzschineria (2) Mycoplasma (1)
Haliscomenobacter (9) Haliscomenobacter (8) Haliscomenobacter (12) Tychonema (10) Lactobacillus (2) A-N-P-R (1)
Caenimonas (8) Bosea (8) Crenobacter (11) Silvanigrella (10) Mycoplasma (1) Lysinibacillus (1)
Gemmobacter (8) Silvanigrella (8) Aurantisolimonas (10) Luteimonas (9) Paraclostridium (1) Silvanigrella (1)
Bosea (8) Paraclostridium (6) Fusobacterium (10) Peptostreptococcus (9) Silvanigrella (1) Shewanella (0)
Phreatobacter (7) Runella (5) Ensifer (9) Bacteroides (9) Bacteroides (1) Caenimonas (0)
Table S2: The ranks of genera in each degree distribution (in descending order) for uninfected fish and infected fish separately, from compositional graphical lasso (Comp-gLASSO), graphical lasso (gLASSO), and neighborhood selection (MB). The numbers in the parentheses are the corresponding degrees of the genera.