Bayes Factor of Zero Inflated Models under Jeffereys Prior
Abstract.
Microbiome omics data including 16S rRNA reveal intriguing dynamic associations between the human microbiome and various disease states. Drastic changes in microbiota can be associated with factors like diet, hormonal cycles, diseases, and medical interventions. Along with the identification of specific bacteria taxa associated with diseases, recent advancements give evidence that metabolism, genetics, and environmental factors can model these microbial effects. However, the current analytic methods for integrating microbiome data are fully developed to address the main challenges of longitudinal metagenomics data, such as high-dimensionality, intra-sample dependence, and zero-inflation of observed counts. Hence, we propose the Bayes factor approach for model selection based on negative binomial, Poisson, zero-inflated negative binomial, and zero-inflated Poisson models with non-informative Jeffreys prior. We find that both in simulation studies and real data analysis, our Bayes factor remarkably outperform traditional Akaike information criterion and Vuong’s test. A new R package BFZINBZIP has been introduced to do simulation study and real data analysis to facilitate Bayesian model selection based on the Bayes factor.
Key words:
Negative binomial distribution; Zero inflated negative binomial distribution, Poisson distribution, zero inflated poisson distribution; Bayes factor; non-informative Jeffreys prior; Microbiome.
1 Introduction
The human microbiome consistes of the collection of estimated (Sender et al.,, 2016) bacteria and genes (Qin et al.,, 2010; Jiang et al.,, 2021). The human microbiome analysis impose a drastic impact on human health and disease (Ursell et al.,, 2012). In recent years, microbiome studies have been successfully identified disease-associated bacteria taxa in type 2 diabetes (Karlsson et al.,, 2013), liver cirrhosis (Qin et al.,, 2014), inflammatory bowl disease (Halfvarson et al.,, 2017; Kakkat et al.,, 2023), and melanoma patients responsive to cancer immunotherapy (Frankel et al.,, 2017; Jiang et al.,, 2021; Dasgupta et al.,, 2023; Hertweck et al.,, 2023; Khan et al.,, 2023; Vikramdeo et al.,, 2023). Quantification of of the human microbiome usually being proceeded by 16s rRNA sequencing or metagenomic shotgun sequencing, where sequence read counts are often summarized into a taxa count table (Jiang et al.,, 2023). Bioinformatics tools like quantitative insights into microbial ecology (QIIME) and mothur are used for analyzing raw 16S rRNA sequencing data (Jovel et al.,, 2016; Zhang and Yi,, 2020). In this literature the word taxa means operational taxonomic units or other taxonomic or functional groups of bacterial sequences (Jiang et al.,, 2023; Altaweel et al.,, 2022). Although innovations in sequencing technology continue to prosper in microbiome studies, the statistical methods used in this field fail to catch up with these advanced sequences (Jiang et al.,, 2021). For example, metagenomic shotgun sequencing generates an increasingly large amount of sequence reads which give species or confine level taxonomic resolution (Segata et al.,, 2012). The subsequent statistical analysis compares whether a species is linked to a phenotypic state or experimental condition (Jiang et al.,, 2021; Polansky and Pramanik,, 2021; Maity et al., 2018b, ).
One commonly used statistical approach in microbiome community involves comparing multiple taxa (Chen et al.,, 2012; Kelly et al.,, 2015; Wu et al.,, 2016; Jiang et al.,, 2021). These approaches do not identify differentially abundant species, which makes clinical interpretation, mechanistic insights, and biological validations difficult (Jiang et al.,, 2021; Pramanik and Polansky, 2023a, ). An alternative approach is to interrogate each individual bacterial taxa for different groups or conditions. La Rosa et al., (2015) use Wilcoxon rank sum or Kruskal-Wallis tests for groupwise comparisons on microbiome compositional data. In more recent years, RNA-seq methods have been adapted to microbiome studies, such as the negative-binomial regression model in DESeq2 (Love et al.,, 2014) and overdispersed Poisson model in edgeR (Robinson et al.,, 2010; Maity et al., 2018a, ). However, these approaches are not optimized for microbiome datasets (Jiang et al.,, 2021). Microbial abundance is influenced by covariates like metabolites, antibiotics, and host genetics (Pramanik, 2022a, ; Pramanik, 2022b, ; Pramanik, 2023c, ). To account for these confounding variables, the association between microbiome and clinical confounders must be quantified. Pairwise correlations between all taxa and covariates are commonly used, but this method may be underpowered (Kinross et al.,, 2011; Maier et al.,, 2018; Zhu et al.,, 2018). Other approaches have been proposed to detect covariate-taxa associations, but these ignore the taxon-outcome associations (Pramanik, 2023a, ; Pramanik, 2023b, ). Recently, Li et al., (2018); Schweizer et al., (2022) developed a multivariate zero-inflated logistic normal model to quantify the associations between microbiome abundances and multiple factors based on microbiome compositional data instead of the count data (Jiang et al.,, 2021; Maity et al.,, 2020; Altaweel et al.,, 2019).
Recent studies have investigated the relationship between diseases and the human microbiome over time. For example, Vatanen et al., (2016) followed 222 infants from birth to age 3 to study their gut microbiome development and its associations with the increasing incidence of autoimmune diseases. Additionally, Romero et al., (2014) compared the vaginal microbiota of pregnant women who delivered preterm versus those who delivered at term. Longitudinal metagenomic count data is often overdispersed (Pramanik,, 2016; Hua et al.,, 2019), and sparse (Pramanik, 2021b, ). Two main categories exist for handling these data sets: the first involves logarithmic or other transformations on count data, followed by usage of linear mixed models to analyze the transformed data (Benson et al.,, 2010; La Rosa et al.,, 2014; Leamy et al.,, 2014; Wang et al.,, 2015). The second category involves zero-inflated Gaussian mixed models to address sparsity issues in longitudinal metagenomics data (Zhang and Yi,, 2020). The first method performs poorly under certain conditions and fails to address the sparcity issue (O’Hara and Kotze,, 2010). On the other hand, zero-inflated Gaussian mixed models successfully address the sparsity issue and can be used to analyze both transformed and untransformed metagenomic data (Zhang and Yi,, 2020; Pramanik,, 2020; Pramanik, 2021c, ; Pramanik and Polansky,, 2021). The second category is generalized linear mixed models, which enable direct analysis of longitudinal metagenomic count data. Metagenomic count data can typically be analyzed similarly to RNA-Seq data, assuming a negative binomial distribution (Pramanik and Polansky,, 2020; Zhang and Yi,, 2020; Pramanik, 2021a, ; Pramanik and Polansky, 2023b, ).
Here, we propose a Bayesian integrative approach of computing Bayes factor to analyze microbiome count data (Polansky and Pramanik,, 2021). Our approach jointly identifies differential abundant taxa among two groups of women (i.e., pregnant and non-pregnant). The data includes 16S rRNA gene sequence based vaginal microbiota from which samples are collected from each subject over intervals of weeks, resulting in 143 taxa and N = 900 longitudinal samples (139 measurements from pregnant women and 761 measurements from non-pregnant women.) To do our experiment we have used Romero et al., (2014) and Jiang et al., (2023) data sets. Count data with large number of zeros (i.e. zero-inflation) are encountered in different fields such as medicine (Böhning et al.,, 1999), public health (Zhou and Tu,, 2000; Maity et al., 2021a, ; Maity et al.,, 2020), environmental sciences (Agarwal et al.,, 2002), agriculture (Hall,, 2000), manufacturing studies(Lambert,, 1992), Orange-crowned Warblers in ponderosa (Garay et al.,, 2011; Maity and Paul,, 2022; White and Bennetts,, 1996). Zero-inflation, a common exemplification of overdispersion, refers to the incidence of zero counts is relatively higher than usual (Garay et al.,, 2011). Since, zero counts frequently have special status in statistical literature, this definitely leads us do research in this area. For example, a production engineer might count the number of defective items selected at random from a production process (Bayarri et al.,, 2008). If overdispersion in raw data is caused by zero-inflation, then zero-inflated Poisson (ZIP) model provides a standard framework for fitting the data (Garay et al.,, 2011; Lambert,, 1992; Maity and Basu,, 2023). According to Ghosh and Samanta, (2002) when some production processes are in absolute states, zero defects occure more frequently (Bayarri et al.,, 2008). An approach to address this issue is to use a two-parameter distribution so that the extra parameter permits a larger variance (Bhattacharya et al.,, 2008). Double exponential family approach, a two-parametric modification of a standard one-parameter exponential family, has been developed which allows more variability than permitted by the single-parameter version (Bhattacharya et al.,, 2008; Efron,, 1986). This is reasonable in count data distributions, such as Poisson, but not useful to model data inflated with zeros (Bhattacharya et al.,, 2008). Fundamental idea of ZIP model is to mix a distribution degenerate at zero with a Poisson distribution (Garay et al.,, 2011). In other words, ZIP assumes that a population consists of two individual types whereas the first type gives a zero count and the second type gives a Poisson-distributed count (Ridout et al.,, 2001; Maity et al., 2019b, ; Maity,, 2016).
If a data set with zero-inflation exhibits overdispersion, a zero-inflated negative binomial (ZINB) model, mixture of a distribution degenerate at zero with a baseline negative binomial distribution, over the ZIP model (Garay et al.,, 2011). Since overdispersion is a ramification of excess zeros, the result has excess variability and ZIP model might not a good fit for such data (Garay et al.,, 2011). A multivariate random-parameter ZINB regression model for modeling crash counts has been developed in Dong et al., (2014). A score test for conducting hypothesis testing of ZIP regression models versus ZINB has been performed inRidout et al., (2001); Paul et al., (2018). A ZINB framework with a Gaussian process has been introduced by Li et al., (2021) to analyze spatial transcriptomics data in which analysis was conducted under a Bayesian framework (Nam et al.,, 2022; Calvo et al.,, 2023). Jiang et al. (Jiang et al.,, 2021; Maity et al., 2019a, ) have been used a ZINB regression model to perform an integrative analysis on microbiome data (Nam et al.,, 2022). In Nam et al., (2022) a statistical inference has been discussed for a zero-inflated binomial distribution with an objective Bayesian and frequentist approaches to determine a point and an interval estimators of the model parameters. Furthermore, a hypothesis testing for excessive zeros in a zero-inflated binomial distribution have been performed and finally, a Monte Carlo simulation is utilized to investigate the performance of estimation and hypothesis testing procedures (Nam et al.,, 2022).
Since the baseline Poisson fails to incorporate the remaining overdispersion not accounted for through zero-inflation and negative binomial models are more flexible than their Poisson counterparts in dealing with overdispersion, ZIP model is not a good fit for such data (Garay et al.,, 2011; Lawless,, 1987; Maity and Dey,, 2018; Maity and Paul,, 2023; Maity et al., 2021c, ; Maity et al., 2021b, ). Moreover, it is a well known fact that the ZIP parameter estimators can be significantly biased under overdispersion of non-zero counts in relation to Poisson distribution (Garay et al.,, 2011). Although, there is a large interest in testing of the presence of overdispersion on a given dataset, our main concentration in this paper is on those circumstances where the data exhibits overdispersion. Furthermore, in this paper we discuss Bayesian methodologies when a negative binomial (NB), ZINB, Poisson or ZIP is fitted to the dataset. We investigate the effectiveness of our theoretical results through simulation and real data analysis based on Romero et al., (2014); Ghosh et al., (2023) and Jiang et al., (2023) data sets. We have introduced a new R package BFZINBZIP to facilitate model selection from Poisson, NB, ZIP, and ZINB distributions.
A popular method to determine the estimates of parameters is to maximize the likelihood or natural logarithm of the likelihood with various Bayesian approaches (Maity and Paul,, 2022; Sommerhalder et al.,, 2023). For example, a Poisson scale representation of NB with Gamma distribution as the mixing density has been discussed in Burrell, (1990); Roy Sarkar et al., (2019); Beck et al., (2023); Maity, (2022), a polynomial expansion and a power series expansion have been considered in Bradlow et al., (2002) and Bhattacharya et al., (2008) respectively. However a little has been given on the Bayesian analysis regarding ZIP versus ZINB models. In particular, to the best of our knowledge, no work exists on posterior analysis under non-informative prior analysis with above two models. We further compare our data driven results within Poisson versus NB, Poisson versus ZIP, NB versus ZINB and ZIP versus ZINB models.
Let be a vector of observed count data such that each of the elements are independent and identically distributed, where ′ represents transposition of the vector. If follows an NB distribution then for all positive and , the probability density function (pdf) is defined as
or, if follows a ZINB then for all the pdf is
with mean and variance , where is represented as the zero-inflation parameter. On the other hand, if follows a ZIP distribution then for and a zero-inflation parameter the pdf is
where be an indicator function and is the Poisson density so that
Many researches have been performed using ZIP distributions with and without covariates to model count data (Bayarri et al.,, 2008). For instance, Lambert, (1992) and (Ghosh and Samanta,, 2002) have been used frequentist and Bayesian approaches respectively to explore industrial data sets through a ZIP regression model (Bayarri et al.,, 2008). A Bayesian score test has been developed in Bhattacharya et al., (2008) to test the null hypothesis against the alternative hypothesis (Bayarri et al.,, 2008). As frequentist approach of score test has been explained in (Deng and Paul,, 2000, 2005; Van den Broek,, 1995), is permitted to be negative in Bhattacharya et al., (2008), as long as .
In a Bayesian framework we are interested in testing
(1) | ||||
(2) | ||||
(3) | ||||
(4) |
where and are the densities of Poisson, NB, ZIP and ZINB distributions, respectively, and has density with being the parameters in model for all and represents any of Poisson, NB, ZIP and ZINB distributions based on the testing of hypotheses in 1-4.
The article is structured as follows: we first discuss convergence properties of the posterior distribution in Section 2. Next, we determine objective priors for the four distributions previously mentioned. Finally, we compute Bayes factors for hypotheses 1-4 and evaluate model performance on simulated data in Section 3. Two real-data analyses are presented in Section 4, and our conclusions are in Section 5.
2 Framework
The Bayesian methodology for choosing between and is determined by assessing the prior probabilities of each model, the prior distributions for the model parameters, and then by computing the posterior probabilities of each for all (Bayarri et al.,, 2008). The posterior probabilities are calculated from the prior distributions and the Bayes Factor, a ratio of maximum likelihood for and which is standard method in Bayesian testing and model selection and is associated with Schwarz Bayesian information criterion (BIC) (Bayarri et al.,, 2008; Maity and Paul,, 2022). Most of the times, due to scarcity of resources or lack of time, it is impossible to assess all the priors diligently in a subjective manner (Berger,, 2006). In this environment, objective Bayesian approach based upon non-external information (other than constructing the problem) gives a competent answer (Bayarri et al.,, 2008; Berger,, 2006).
2.1 An overview of the Bayes factor in model selection
Let there be models so that and , and these models contend with each other in determining the most relevant model. If model holds, then for a parametric space of such that , is a probability measure on a measurable space such that for each , is Borel measurable (Ghosh and Ramamoorthi,, 2010), follows a parametric distribution with pdf . It is convenient that as the coordinate random variable defined on the sample space and as the i.i.d. product measure defined on (Ghosh and Ramamoorthi,, 2010). Define the space and be the -fold product of . Bayesian model selection proceeds by choosing a prior density under model for a set of parameters , and the prior model probability of before data are observed so that . Therefore, the marginal or predictive likelihood corresponding to model is defined as
where is the likelihood function under model . Therefore, the posterior probability under the assumption that model is true can written by the following expression
Definition 1.
For all n, let be a posterior probability for given values . The sequence
is said to be consistent at if there exists a with so that if , then for every neighborhood of ,
Remark 1.
When the metric space constructs a base for the neighborhood of , and therefore it can be allowed to bet set of measure 1 to depend upon (Ghosh and Ramamoorthi,, 2010). Hence, it is enough to show that for each of , almost everywhere of .
Lemma 1.
Let be a probability measure dominated by a -finite measure and be the density of . Assume be an interior point of (i.e. ) and , be two continuous prior densities w.r.t. a measure . If posterior densities , are both consistent at then
Proof.
See the Appendix. ∎
Lemma 2.
Let and be posterior probabilities for give values of such that for some constants , ,
Then their convolution can be written as
as .
Proof.
See the Appendix. ∎
Along the way of Ghosh and Ramamoorthi, (2010) we will discuss Bernstein-von Mises theorem on the asymptotic normality of the posterior distribution . If a consistent global likelihood estimator exists, then under differentiability it is easy to verify that for all , it is a consistent solution of the likelihood equation a.s. (Ghosh and Ramamoorthi,, 2010). To show the consistency of the posterior distribution we need the following assumption 1 of the density function .
Assumption 1.
(i). For model , takes the same value for all .
(ii). Suppose the likelihood function under model is defined as for all is thrice differentiable with respect to in the neighborhood of so that
Then the expectations at corresponding to likelihood are and with
(iii). After interchange the order of expectation w.r.t. and differentiating w.r.t. such that, and .
(iv). Fisher information set .
(v). Define . Then for all , there so that
(vi). The prior density under model is Lebesgue measurable, continuous and positive at .
Proposition 1.
For model consider the density for all satisfies Assumption 1. Let be the posterior density of under model . Then
Proof.
See the Appendix. ∎
For a given data set the model with the largest posterior probability is the most favorable model (Nam et al.,, 2022). Moreover, the Bayes factor for model with respect to can be expressed as
Although we have four models corresponding to Poisson, NB, ZIP and ZINP distributions, we are going to test two models at a time as written in hypotheses 1-4. Therefore, throughout this paper (hence, ). The Bayes factor of model with respect to is
Since each hypothesis consists of two models, we have and
Furthermore, we choose model as the true model if
and choose model as true model if . Following Kass and Vaidyanathan, (1992) and Wasserman, (2000) using non-informative prior (will be discussed in the next section) yields a general interpretation of Bayes factors as given in Table 1.
Bayes Factors with their meanings. | |
---|---|
Bayes Factor | Description |
Strong presence of model . | |
Moderate presence of model . | |
Weak presence of model . | |
Weak presence of model . | |
Moderate presence of model . | |
Strong presence of model . |
2.2 Objective priors in models with Poisson, NB, ZIP and ZINB distributions
A severe problem in Bayesian analysis is to choose an appropriate prior under model . The subjective Bayesian inference theory suggests that should be based on a person’s prior opinion on (Wasserman,, 2000). More common Bayesian model selection approach is based on objective theory where is chosen to be noninformative in some sense (Wasserman,, 2000). A philosophical thinking behind this approach can be found in Kass and Wasserman, (1996). It is well known that if the common parameters are orthogonal to the rest of the parameters in each of the models, they can be assigned the same prior distribution since the Fisher Information matrix is block diagonal.(Bayarri et al.,, 2008; Jeffreys,, 1961; Kass and Vaidyanathan,, 1992). Since the arbitrary constant would be canceled in the Bayes factor, we use noninformative (or improper) prior in our case. A widely recognized noninformative prior is Jeffreys’ prior, defined as . In this case is the Fisher information matrix as defined in Assumption 1. For example, if then Jeffreys’ prior is a flat prior , where and represent an identity matrix and a multivariate normal distribution, respectively (Wasserman,, 2000).
Since and in ZIP are not orthogonal, following Bayarri et al., (2008) with the density function can be reparametrized as
(5) |
where is the zero-truncated Poisson distribution with parameter such that . Therefore, the expression for model is
(6) |
According to the suggestions in Maity and Paul, (2022) with and for all the density function can be represented as
(7) |
where
is the zero-truncated negative binomial distributions with parameter such that the expression for model is
(8) |
As suggested by Bayarri et al., (2008), Jeffreys prior can be used for the common parameter and a proper prior for the extra parameters. It is well known that the Jeffreys prior for in Poisson model, and for and in negative binomial model are and , respectively (Bayarri et al.,, 2008; Maity and Paul,, 2022). The Jeffreys prior for orthogonal ZIP (i.e., the Jeffreys prior of ) can be expressed as
In a similar fashion we can determine the Jeffreys prior for orthogonal ZINB (i.e., the Jeffreys prior of ) can be expressed as
where,
(9) |
for all . The derivation of Equation (9) is presented in the Appendix. Since we need to choose a single prior for both of the NB and ZINB, and as Maity and Paul, (2022) yields that working with any of and will add negligible error in computing, we are going to choose the simpler prior version of for both of the NB and ZINB cases. In a similar fashion the simpler prior can be used for Poisson and ZIP cases (Bayarri et al.,, 2008). Under orthogonal ZIP model a proper prior for given is a uniform distribution over is
Similarly, for an orthogonal ZINB model, a proper prior for given is a uniform distribution over the interval it can be expressed as
2.3 Objective Bayes factor in models with Poisson, NB, ZIP and ZINB distributions
In this section we are going to determine objective Bayes factors for each of the models explained in 1-4. For a sample of counts let be the number of zero observations, and be the total count. It is important to note that (Bayarri et al.,, 2008). Therefore, by Bayarri et al., (2008) for given data set
and by Maity and Paul, (2022)
For the marginal likelihood function of Poisson and ZIP distributions with Jeffreys priors and respectively are
On the other hand, following (Maity and Paul,, 2022) the marginal likelihood function of NB distribution with Jeffreys prior is
where “Beta” represents a beta integration. Similarly by Maity and Paul, (2022), the marginal likelihood function of ZINB distribution with Jeffreys prior is
The Bayes factor of the NB model against the Poisson model (i.e., Hypothesis 1) is
(10) |
It is important to note that, the Bayes factor is increasing in total count for any given and . When or equivalently all counts are zero (), . Following Bayarri et al., (2008) the Bayes factor of the ZIP model against the Poisson model (i.e., Hypothesis 2) is
Bayarri et al., (2008) suggests that when , and which implies . In this case, for a given , is increasing in for any fixed , and is increasing in for any given (Bayarri et al.,, 2008). Now the Bayes factor of the ZINB model against the NB model (i.e., Hypothesis 3) is
For any give and if then, . Finally, the Bayes factor of the ZINB model against the model ZIP (i.e., Hypothesis 4) is
It can be easily verified that for any given and , is strictly increasing in . Furthermore, when , .
3 Simulation Study
In this section we carry out a series of simulation studies to estimate some operating characteristics of the Bayes factors derived in the previous Section.
3.1 Bayes factor of Negative Binomial against Poisson
In the first experiment, we generate 1000 simulated datasets from either the NB distribution or the Poisson distribution with different parameter settings. The exact values of the parameters are given in Table 2. For each simulation, we compute the Bayes factor derived in Section that is the evidence of the ZINB distribution against the NB distribution. Note that, when computing the Bayes factor, has been assumed to be fixed as discussed in Section 2. Empirically, it has been noted that offers the best outcome.
In the following, it is said that the Bayes factor fevers the NB model against the Poisson model if the computed log(Bayes factor) is more than log(3.2) or log(10). If the computed log(Bayes factor) is more than log(3.2) then the evidence is substantial and if the computed log(Bayes factor) is more than log(10) then it is said that there is strong evidence that the model under consideration is a NB model (see Table 1). On the other hand, if the computed log(Bayes factor) is less than log(3.2) or log(10), then the evidence is substantial or strong respectively in the favor of Poisson model. In terms of the notations introduced in Section 2, if we denote NB and Poisson model by and then Table 1 is directly applicable to draw the inference.
Data Generating Model | BF3 | BF10 | Vuong | AIC | |||
---|---|---|---|---|---|---|---|
0.5 | 1.5 | 0.5 | NB | 969 | 900 | 59 | 628 |
Pois | 66 | 17 | 123 | 928 | |||
0.5 | 0.5 | NB | 1000 | 998 | 518 | 979 | |
Pois | 68 | 14 | 126 | 945 | |||
0.5 | 1.5 | NB | 1000 | 1000 | 999 | 1000 | |
Pois | 75 | 19 | 126 | 941 | |||
1 | 1.5 | 0.5 | NB | 972 | 897 | 46 | 597 |
Pois | 431 | 304 | 116 | 933 | |||
0.5 | 0.5 | NB | 999 | 995 | 517 | 973 | |
Pois | 426 | 301 | 116 | 942 | |||
0.5 | 1.5 | NB | 1000 | 1000 | 995 | 1000 | |
Pois | 393 | 263 | 100 | 935 | |||
3 | 1.5 | 0.5 | NB | 961 | 893 | 63 | 608 |
Pois | 988 | 980 | 102 | 933 | |||
0.5 | 0.5 | NB | 999 | 996 | 533 | 976 | |
Pois | 989 | 982 | 106 | 934 | |||
0.5 | 1.5 | NB | 1000 | 1000 | 996 | 1000 | |
Pois | 992 | 986 | 58 | 942 | |||
5 | 1.5 | 0.5 | NB | 963 | 929 | 63 | 619 |
Pois | 1000 | 1000 | 84 | 935 | |||
0.5 | 0.5 | NB | 1000 | 998 | 519 | 978 | |
Pois | 1000 | 1000 | 104 | 936 | |||
0.5 | 1.5 | NB | 1000 | 1000 | 996 | 1000 | |
Pois | 1000 | 1000 | 111 | 927 |
Table 2 summarizes the result how many times the zero inflated model is selected out of 1000 simulations using the Bayes factor comparisons. Additionally, we have included the outcome using the Vuong’s test (Vuong,, 1989) and akaiake information criterion (AIC, (Akaike,, 1998)). R package nonnest2 (Merkle and You,, 2020; R Core Team,, 2021) has been utilized to carry out Vuong’s test.
Nevertheless, it is evident from Table 2 that Bayes factor remains superior in selecting the correct model if the data generating model follows a NB distribution. It remains superior in selecting the correct model when the data generating model is a Poisson model if the mean of the Poisson distribution is high. Moreover, the criterion – log(Bayes factor) more than log(3.2) (BF3) – selects the zero inflated model more often than the criterion – log(Bayes factor) more than log(10) (BF10) – for obvious reason. For instance, with data generating , when the sample is simulated from a NB distribution, then BF3 and BF10 are able to recover the NB distribution 963 times and 929 times respectively. On the other hand, AIC criterion indicates that 619 datasets follows the NB model out of 1000 simulated datasets. With the same data generating parameters, when the data are simulated from a Poisson distribuion, then, BF3 and B10 are able to recover the Poisson model 1000 times and 1000 times respctively, outperfroming the AIC creterion which is able to indicate in the favor of the Poisson model 935 times. Note that, the performance of Vuong’s test remains inferior throughout the simulation studies.
3.2 Bayes factor of Zero Inflated Poisson against Poisson
In the second experiment, we generate 1000 simulated datasets from either the zero inflated Poisson (ZIP) distribution or the Poisson distribution with different parameter settings. The exact values of the parameters are given in Table 3. The data generation and the inference follows the similar paths as the first simulated example.
Percentage (%) of Zeros | Data Generating Model | BF3 | BF10 | Vuong | Inflation | AIC | |
---|---|---|---|---|---|---|---|
0.5 | 97.7 | ZIP | 415 | 294 | 36 | 2 | 415 |
60.9 | Pois | 559 | 28 | 45 | 814 | 939 | |
90.3 | ZIP | 590 | 362 | 80 | 50 | 644 | |
60.8 | Pois | 567 | 35 | 51 | 790 | 935 | |
80.3 | ZIP | 390 | 191 | 36 | 195 | 519 | |
60.6 | Pois | 578 | 23 | 38 | 789 | 929 | |
70.6 | ZIP | 135 | 48 | 9 | 178 | 258 | |
60.5 | Pois | 560 | 18 | 44 | 799 | 943 | |
1 | 96.8 | ZIP | 765 | 633 | 173 | 75 | 765 |
37 | Pois | 810 | 349 | 46 | 388 | 922 | |
84.3 | ZIP | 937 | 859 | 508 | 743 | 958 | |
36.8 | Pois | 815 | 320 | 47 | 409 | 944 | |
68.2 | ZIP | 869 | 715 | 419 | 935 | 949 | |
36.6 | Pois | 795 | 322 | 47 | 385 | 926 | |
52.5 | ZIP | 390 | 220 | 77 | 806 | 643 | |
36.7 | Pois | 803 | 338 | 50 | 390 | 918 | |
3 | 95.2 | ZIP | 995 | 989 | 810 | 868 | 995 |
4.9 | Pois | 959 | 853 | 72 | 207 | 926 | |
76.5 | ZIP | 1000 | 1000 | 999 | 1000 | 1000 | |
4.9 | Pois | 963 | 845 | 66 | 194 | 937 | |
52.5 | ZIP | 1000 | 1000 | 1000 | 1000 | 1000 | |
4.9 | Pois | 959 | 845 | 72 | 203 | 934 | |
28.6 | ZIP | 1000 | 999 | 995 | 1000 | 1000 | |
5 | Pois | 956 | 849 | 80 | 228 | 934 | |
5 | 95 | ZIP | 1000 | 1000 | 1000 | 1000 | 1000 |
1.3 | Pois | 961 | 884 | 0 | 642 | 884 | |
74.9 | ZIP | 1000 | 1000 | 1000 | 1000 | 1000 | |
1.4 | Pois | 962 | 879 | 0 | 710 | 881 | |
50.3 | ZIP | 1000 | 1000 | 1000 | 1000 | 1000 | |
1.4 | Pois | 964 | 892 | 0 | 620 | 891 | |
25.5 | ZIP | 1000 | 1000 | 1000 | 1000 | 1000 | |
1.4 | Pois | 964 | 898 | 0 | 652 | 896 |
Table 3 summarizes the result how many times the zero inflated model is selected out of 1000 simulations using the Bayes factor comparisons, Vuong’s test and the AIC criterion. An additional outcome has been included using the R package performance written by Lüdecke et al., (2021). This package offers functionality to check if excessive amount of zeros are present in the data. In this way, if it is determined that the number of existing zero’s are than the usual then one can conclude that the data follows a zero inflated distribution.
Nevertheless, it is evident from Table 3 that Bayes factor remains superior in selecting the correct model, particularly, when the mean of the Possion distribution is large. The other inferences remain similar to the first simulated example.
3.3 Bayes factor of Zero Inflated Negative Binomial against Negative Binomial
In the third experiment, we generate 1000 simulated datasets from either the zero inflated Negative Binomial (ZINB) distribution or the Negative Binomial (NB) distribution with different parameter settings. The exact values of the parameters are given in Table 4. The data generation and the inference follows the similar paths as the previous examples.
Percentage (%) of Zeros | Data Generating Model | BF3 | BF10 | Vuong | Inflation | AIC | ||
---|---|---|---|---|---|---|---|---|
1.5 | 0.5 | 96.9 | ZINB | 1000 | 1000 | 54 | 32.6 | 848 |
45.5 | NB | 40 | 0 | 20 | 910 | 470 | ||
86.8 | ZINB | 1000 | 1000 | 60 | 0 | 820 | ||
46.5 | NB | 50 | 0 | 30 | 909 | 485 | ||
73.8 | ZINB | 1000 | 1000 | 120 | 0 | 870 | ||
46.2 | NB | 60 | 10 | 20 | 930 | 440 | ||
59.8 | ZINB | 1000 | 990 | 40 | 0 | 720 | ||
47.0 | NB | 40 | 10 | 0 | 879 | 374 | ||
0.5 | 0.5 | 97.6 | ZINB | 1000 | 1000 | 34 | 966 | 896 |
57.3 | NB | 0 | 0 | 20 | 1000 | 450 | ||
92.4 | ZINB | 1000 | 1000 | 20 | 0 | 808 | ||
71.0 | NB | 0 | 0 | 0 | 1000 | 410 | ||
85.2 | ZINB | 1000 | 1000 | 20 | 0 | 737 | ||
71.0 | NB | 0 | 0 | 10 | 1000 | 420 | ||
77.6 | ZINB | 1000 | 1000 | 20 | 0 | 600 | ||
58.5 | NB | 0 | 0 | 10 | 1000 | 440 | ||
5 | 5 | 94.9 | ZINB | 1000 | 1000 | 646 | 545 | 1000 |
3.3 | NB | 1000 | 1000 | 20 | 265 | 510 | ||
75.9 | ZINB | 1000 | 1000 | 900 | 200 | 1000 | ||
3.2 | NB | 1000 | 10000 | 0 | 280 | 480 | ||
51.4 | ZINB | 1000 | 1000 | 987 | 953 | 1000 | ||
3.3 | NB | 1000 | 1000 | 41 | 301 | 499 | ||
27.7 | ZINB | 910 | 860 | 930 | 1000 | 1000 | ||
3.3 | NB | 1000 | 1000 | 20.8 | 271 | 521 |
Table 4 summarizes the result how many times the zero inflated model is selected out of 1000 simulations using the Bayes factor comparisons, Vuong’s test, inflation, and the AIC criterion. It is evident from Table 4 that Bayes factor remains superior in selecting the correct model, particularly, when the parameters of the NB distribution are large. The other inferences remain similar to the previous simulated examples.
3.4 Bayes factor of Zero Inflated Negative Binomial against Zero Inflated Poisson
In the last experiment, we generate 1000 simulated datasets from either the zero inflated Negative Binomial (ZINB) distribution or the zero inflated Poisson (ZIP) distribution with different parameter settings. The exact values of the parameters are given in Table 5. The data generation and the inference follows the similar paths as the previous examples.
Percentage (%) of Zeros | Data Generating Model | BF3 | BF10 | Vuong | AIC | |||
---|---|---|---|---|---|---|---|---|
1 | 0.5 | 0.5 | 84 | ZINB | 587 | 587 | 49 | 665 |
68.1 | ZIP | 990 | 990 | 61 | 624 | |||
1 | 5 | 5 | 53 | ZINB | 1000 | 1000 | 278 | 994 |
69.4 | ZIP | 989 | 989 | 45 | 603 | |||
3 | 0.5 | 0.5 | 86.6 | ZINB | 579 | 579 | 41 | 665 |
52.6 | ZIP | 518 | 518 | 101 | 556 | |||
3 | 5 | 5 | 53.2 | ZINB | 999 | 999 | 281 | 994 |
54.4 | ZIP | 518 | 518 | 84 | 573 |
4 Model Selection in Microbiome Data
In this Section we apply the Bayes factor computation techniques discussed here in a real life data originated from a case-control study. The objective of the original experiment was to gain knowledge of the vaginal microbioata throughout pregnancy. Toward this end, a longitudinal case control study was designed in 22 pregnant women who delivered at term (38 to 42 weeks) without complications, and 32 non-pregnant women. Serial samples of vaginal fluid were collected from both non-pregnant and pregnant patients. The data includes 16S rRNA gene sequence based vaginal microbiota from which samples are collected from each subject over intervals of weeks, resulting in 143 taxa and N = 900 longitudinal samples (139 measurements from pregnant women and 761 measurements from non-pregnant women.) For more details on the experiment see Romero et al., (2014); also see Jiang et al., (2023).
For the analysis, we focused on two specific Phylotypes: Lactobacillus.iners and Atopobium. Each dataset contained 900 observations, with the first dataset having 15.1% zeros and the second dataset having 66.3% zeros. We computed the log(Bayes factor) and AIC criteria for four models: Negative Binomial (NB), Poisson, Zero-Inflated Negative Binomial (ZINB), and Zero-Inflated Poisson (ZIP).
Table 6 presents the computed log(Bayes Factor) on the Microbiome data, while Table 7 displays the AIC values for each model. Note that, a Negative Binomial model cannot be fitted to the data because the underlying maximization process does not converge. For the same reason, the Bayes factor of NB against Poisson model cannot be computed.
For the first dataset, the log(Bayes factor) of ZINB against NB and of ZIP against Poisson are 829.0 and 171854.9, respectively, which favors a zero Inflated model for the data. Consequently, the log(Bayes factor) of ZINB against ZIP becomes -13686110.0 which implies that one should fit a zero inflated Poisson model to the data. On the other hand, the AIC criterion supports to fit a zero Inflated Negative Binomial model to the data. Romero et al., (2014) concluded in the favor of fitting a negative Binomial model.
Example | Model | log(Bayes factor) |
---|---|---|
1 | NB vs. Poisson | – |
ZINB vs. NB | 829.0 | |
ZIP vs. Poisson | 171854.9 | |
ZINB vs. ZIP | -13686110.0 | |
2 | NB vs. Poisson | – |
ZINB vs. NB | 1172.6 | |
ZIP vs. Poisson | 5073.6 | |
ZINB vs. ZIP | 120266.8 |
Example | Model | AIC |
---|---|---|
1 | NB | – |
Poisson | 1667918.0 | |
ZINB | 12513.0 | |
ZIP | 1324204.0 | |
2 | NB | – |
Poisson | 24913.9 | |
ZINB | 3342.2 | |
ZIP | 14763.3 |
A very similar analysis concludes that a ZINB model is the appropriate one to fit into the second dataset. This can be concluded by computing both th Bayes factor and the AIC. Furthermore, this inference accords with the findings of Romero et al., (2014).
Overall, the Bayes factor and AIC analyses provide insights into selecting the appropriate models for further analysis of the vaginal microbiota data obtained from the case-control study.
5 Discussion
In recent years, a significant effort has done in the literature of longitudinal metagenomics to investigate dynamic associations between microbial symbiosis and the development of many diseases, such as inflammatory bowl diseases (Sharpton et al.,, 2017; Minar, 2018a, ), colorectal cancers (Liang et al.,, 2014), Parkinson’s disease (Yang et al.,, 2018; Minar, 2018b, ; Minar,, 2019), preterm birth (Stewart et al.,, 2017), and autoimmune diseases (Vatanen et al.,, 2016; Zhang and Yi,, 2020; Roy et al., 2023c, ; Roy et al., 2023a, ). The literature discussed above either used 16S rRNA or whole-metagenome shotgun sequencing technologies to simulate longitudinal metagenomics data (Zhang and Yi,, 2020; Roy et al., 2023b, ). While the bioinformatics tools for processing 16S rRNA sequencing data give the counts, whole-metagenome shotgun sequencing data give either counts or proportions. In this article, we considered the longitudinal metagenomic count data generated from 16S rRNA sequencing based vaginal microbiota. Since the objective was to gain knowledge of the vaginal microbiota throughout pregnancy, a longitudinal case control study was designed in 22 pregnant women who delivered at term (38 to 42 weeks) without complications, and 32 non-pregnant women, and serial samples of vaginal fluid were collected from both non-pregnant and pregnant patients. Moreover, we analyzed on two specific Phylotypes: Lactobacillus.iners and Atopobium. Each dataset contained 900 observations, with the first dataset having 15.1% zeros and the second dataset having 66.3% zeros. We computed the log(Bayes factor) and AIC criteria for four models: Negative Binomial (NB), Poisson, Zero-Inflated Negative Binomial (ZINB), and Zero-Inflated Poisson (ZIP).
In this article, we presented Poisson, NB, ZIP, and ZINB distributions to analyze high-throughput sequencing microbiome data. First, we verified some convergence and measurability properties of the posterior distribution. Second, the Jeffreys prior was calculated for ZINB. Then the presence of over-dispersion was tested by using Bayesian methodologies. We introduced the Bayes factor for ZINB and ZIP and tested for the model selection under the incidence of over dispersed data. For each of the four distributions, we used non-informative Jeffreys prior and determined Bayes factors corresponding to the hypotheses 1-4. We did simulation studies of the distributions with different parameters to determine the effectiveness of Bayes’ factors (i.e., BF3 and BF10) compared with traditional AIC and Vuong’s test. We showed that BF3 and BF10 outperformed AIC and Vuong’s test in every case. For example, in the case of NB versus Poisson with , , and , when a sample is generated by simulating an NB, then BF3 and BF10 would be able to recover the NB distribution 972 and 897 times, respectively (see Table 2). On the other hand, AIC indicates that 597 datasets follow the NB distribution out of 1000 simulated datasets. In this case, Vuong’s test gives the most inferior result and throughout our simulation studies, its performance was the worst. To conduct the quantitative analysis, R package BFZINBZIP was used which is available at authors’ github account.
Our method is novel in identifying differentially 143 taxa for two patient groups (i.e., pregnant and non-pregnant women) under a single statistical framework which allows for an integrative analysis of the microbiome and other omics data sets. The proposed method can lead to proper clinical decisions corresponding to the precision shaping of the microbiome data. Furthermore, BF3 and BF10 proposed in this article perform better than AIC and Vuong’s test throughout our simulation studies and real data analysis. In real data analysis, since the underlying maximization process of 16S rRNA data do not converge, an NB distribution is impossible to fit. As a result, the Bayes factor of NB against the Poisson model cannot be determined. In Table, 6, the log(Bayes factor) of ZINB against NB, and of ZIP against Poisson are 829.0 and 171854.9, respectively, which supports the zero-inflated model for our data set. On the other hand, the log(Bayes factor) of ZINB against ZIP is -13686110.0 supports in favor of the implementation of a ZIP model to the data. Furthermore, the AIC criterion in Table 7 goes in favor of fitting a ZINB to the data. Tables 6 and 7 give similar results for the second set of data which favors the implementation of a ZINB model as the log(Bayes factor) and AIC are 120266.8 and 3342.2, respectively. This inference is similar to the results obtained in Romero et al., (2014).
The framework of the proposed method allows for several extensions. For example, the current model supports two groups (i.e., pregnant and non-pregnant women). We can extend our current model to multiple phenotype type groups including intermediate phenotypes. In this case, our method can incorporate group-specific parameters while holding other parameters fixed, and same poaterior inference can be incorporated. We can extend our proposed model to a regression framework where the normalized microbiome normalized abundance can be used as a the response which would integrate metabolite compounds, as predictors (Jiang et al.,, 2021). Another extension would be to discuss correlated covariates such as longitudinal clinical measurements (Zhang et al.,, 2017; Jiang et al.,, 2021).
Conflict of Interest
None declared.
Supplementary Material
An R package BFZINBZIP is available on Github:
https://github.com/arnabkrmaity/BFZINBZIP/tree/main. This package has been used to do simulations in Section 3 and real data analysis in Section 4.
Data Availability Statement
Romero data is available in their paper Romero et al., (2014) or directly from the R package NBZIMM.
Appendix
Proof of Lemma 1
In order to prove this lemma we will show that for the probability measure at of model denoted as
where be a posterior density distribution of model so that . Using the continuity at point , there so that for all and there exists a neighborhood of such that , and ,
By consistency there exists a sample space , , so that for each , we have the posterior probability at neighborhood of as
For all there exists such that for all the posterior probability is
Furthermore, the ratio of two posterior distributions is
For all and yields
and by the choice of ,
(11) |
For the inequality Proof of Lemma 1 yields
so that for small values of and we have
Finally, for ,
This completes the proof. ∎
Proof of Lemma 2
Consider two independent random variables , with posterior probability distribution functions , respectively. Then by Piterbarg, (1996),
(12) |
where with represents the transposition of the matrix.
Now let us analyze the asymptotic properties of the finite integral
(13) |
There exists such that . For a positive integer define so that is an integer. Therefore,
(14) |
By condition
there exist two monotonically decreasing functions , since so that for all we have,
Our main objective is to determine the estimate of the upper and the lower bounds of in condition Proof of Lemma 2. The upper bound is
where .
The lower bound is,
Therefore,
(15) |
and
(16) |
Define . First sum of the right hand side of Equation (Proof of Lemma 2) yields,
as . The last inequality is obtained by using the monotone convergence theorem. The estimate from below for the first sum on the right hand side of condition (Proof of Lemma 2) becomes,
as . Now consider the second sums and , on the right hand side of condition (Proof of Lemma 2). For all , -th summand in those sums is determined by the corresponding summand in the first sum by multiplying by . Dividing left hand side and right hand sides of condition (Proof of Lemma 2) and (Proof of Lemma 2) by
and letting yields,
For an arbitrary , and definition of , the above inequality shows the asymptotic behavior of I. Therefore, the statement of the lemma follows by condition (Proof of Lemma 2). This completes the proof. ∎
Proof of Proposition 1
In order to prove this proposition we would go along the line of Ghosh and Ramamoorthi, (2010). Since , under model we can write,
It is sufficient to show that
(17) |
or,
(18) |
In order to understand conditions 17 and 18 define
Thus, expression in condition 17 becomes
Let us denote two integrals as
Since condition 18 implies , it is sufficient to show that integral inside the parenthesis converges to zero in probability and this term is less than . Now by condition 18 and the expression in ,
as . For further simplicity of this problem, define a function
Clearly almost surely in probability as . To check condition 18 it is sufficient to show that
(19) |
For any let us break into three regions so that , and . For the region ,
By Assumption (v) in 1 the first integral goes to zero and by the tail estimate of a normal distribution the second integral converges to zero (Ghosh and Ramamoorthi,, 2010). Since, for , then a Taylor series expansion yields,
where . Now for region
Since the prior density is continuous at , the second integral converges to zero a.s. in probability . The first integral of the above expression is,
(20) |
Since
the condition 20 satisfies
Finally, for the region ,
For a large constant the second integral of the above inequality satisfies
Since and , first integral yields . Therefore,
Small values of ensures
(21) |
as . The condition 21 can be written as,
(22) |
Therefore, with probability greater than ,
as . Now first choosing a to ensure condition 21 and then by working with the in first and second steps yields the final expression. This completes the proof. ∎
Derivation of Equation 9
Note that the probability mass function (pmf) of the zero-truncated negative binomial random variable is
The expected value of is
Note that
Since
and
we have the Fisher information matrix as
Therefore, the Jeffreys prior is readily available by taking a squared root on .
References
- Agarwal et al., (2002) Agarwal, D. K., Gelfand, A. E., and Citron-Pousty, S. (2002). Zero-inflated models with application to spatial count data. Environmental and Ecological statistics, 9(4):341–355.
- Akaike, (1998) Akaike, H. (1998). Information theory and an extension of the maximum likelihood principle. In Selected Papers of Hirotugu Akaike, pages 199–213. Springer.
- Altaweel et al., (2019) Altaweel, A., Stoleru, R., Gu, G., and Maity, A. K. (2019). Collusivehijack: A new route hijacking attack and countermeasures in opportunistic networks. In 2019 IEEE Conference on Communications and Network Security (CNS), pages 73–81. IEEE.
- Altaweel et al., (2022) Altaweel, A., Stoleru, R., Gu, G., Maity, A. K., and Bhunia, S. (2022). On detecting route hijacking attack in opportunistic mobile networks. IEEE Transactions on Dependable and Secure Computing.
- Bayarri et al., (2008) Bayarri, M., Berger, J. O., Datta, G. S., et al. (2008). Objective bayes testing of poisson versus inflated poisson models. Pushing the Limits of Contemporary Statistics: Contributions in Honor of Jayanta K. Ghosh, 3:105–121.
- Beck et al., (2023) Beck, J. T. T., McKean, M., Gadgeel, S. M., Bowles, D. W., Haq, R., Yaeger, R., Taylor, M. H., Maity, A. K., Drescher, S., Oliver, C., et al. (2023). A phase 1, open-label, dose escalation and dose expansion study to evaluate the safety, tolerability, pharmacokinetics, and antitumor activity of pf-07799933 (arry-440) as a single agent and in combination therapy in participants 16 years and older with advanced solid tumors with braf alterations.
- Benson et al., (2010) Benson, A. K., Kelly, S. A., Legge, R., Ma, F., Low, S. J., Kim, J., Zhang, M., Oh, P. L., Nehrenberg, D., Hua, K., et al. (2010). Individuality in gut microbiota composition is a complex polygenic trait shaped by multiple environmental and host genetic factors. Proceedings of the National Academy of Sciences, 107(44):18933–18938.
- Berger, (2006) Berger, J. (2006). The case for objective bayesian analysis. Bayesian analysis, 1(3):385–402.
- Bhattacharya et al., (2008) Bhattacharya, A., Clarke, B. S., Datta, G. S., et al. (2008). A bayesian test for excess zeros in a zero-inflated power series distribution. IMS collections, Beyond Parametrics in Interdisciplinary Research: Festschrift in Honor of Professor Pranab K Sen, 1:89–104.
- Böhning et al., (1999) Böhning, D., Dietz, E., Schlattmann, P., Mendonca, L., and Kirchner, U. (1999). The zero-inflated poisson model and the decayed, missing and filled teeth index in dental epidemiology. Journal of the Royal Statistical Society: Series A (Statistics in Society), 162(2):195–209.
- Bradlow et al., (2002) Bradlow, E. T., Hardie, B. G. S., and Fader, P. S. (2002). Bayesian inference for the negative binomial distribution via polynomial expansions. Journal of Computational and Graphical Statistics, 11(1):189–201.
- Burrell, (1990) Burrell, Q. L. (1990). Using the gamma-poisson model to predict library circulations. Journal of the American Society for Information Science, 41(3):164–170.
- Calvo et al., (2023) Calvo, M., Penkov, K., Spira, A. I., Moreno Candilejo, I., Shore, N. D., Zhang, T., Mellado-Gonzalez, B., Alonso Gordoa, T., Paz-Ares Rodriguez, L., Tarantolo, S. R., et al. (2023). A multi-center, open-label, randomized dose expansion study of pf-06821497, a potent and selective inhibitor of enhancer of zeste homolog 2 (ezh2), in patients with metastatic castration-resistant prostate cancer (mcrpc).
- Chen et al., (2012) Chen, J., Bittinger, K., Charlson, E. S., Hoffmann, C., Lewis, J., Wu, G. D., Collman, R. G., Bushman, F. D., and Li, H. (2012). Associating microbiome composition with environmental covariates using generalized unifrac distances. Bioinformatics, 28(16):2106–2113.
- Dasgupta et al., (2023) Dasgupta, S., Acharya, S., Khan, M. A., Pramanik, P., Marbut, S. M., Yunus, F., Galeas, J. N., Singh, S., Singh, A. P., and Dasgupta, S. (2023). Frequent loss of cacna1c, a calcium voltage-gated channel subunit is associated with lung adenocarcinoma progression and poor prognosis. Cancer Research, 83(7_Supplement):3318–3318.
- Deng and Paul, (2000) Deng, D. and Paul, S. R. (2000). Score tests for zero inflation in generalized linear models. The Canadian Journal of Statistics/La Revue Canadienne de Statistique, pages 563–570.
- Deng and Paul, (2005) Deng, D. and Paul, S. R. (2005). Score tests for zero-inflation and over-dispersion in generalized linear models. Statistica Sinica, pages 257–276.
- Dong et al., (2014) Dong, C., Clarke, D. B., Yan, X., Khattak, A., and Huang, B. (2014). Multivariate random-parameters zero-inflated negative binomial regression model: An application to estimate crash frequencies at intersections. Accident Analysis & Prevention, 70:320–329.
- Efron, (1986) Efron, B. (1986). Double exponential families and their use in generalized linear regression. Journal of the American Statistical Association, 81(395):709–721.
- Frankel et al., (2017) Frankel, A. E., Coughlin, L. A., Kim, J., Froehlich, T. W., Xie, Y., Frenkel, E. P., and Koh, A. Y. (2017). Metagenomic shotgun sequencing and unbiased metabolomic profiling identify specific human gut microbiota and metabolites associated with immune checkpoint therapy efficacy in melanoma patients. Neoplasia, 19(10):848–855.
- Garay et al., (2011) Garay, A. M., Hashimoto, E. M., Ortega, E. M., and Lachos, V. H. (2011). On estimation and influence diagnostics for zero-inflated negative binomial regression models. Computational Statistics & Data Analysis, 55(3):1304–1318.
- Ghosh and Ramamoorthi, (2010) Ghosh, J. and Ramamoorthi, R. (2010). Bayesian nonparametrics. Springer Series in Statistics.
- Ghosh and Samanta, (2002) Ghosh, J. K. and Samanta, T. (2002). Nonsubjective bayes testing—an overview. Journal of statistical planning and inference, 103(1-2):205–223.
- Ghosh et al., (2023) Ghosh, R. P., Maity, A. K., Pourahmadi, M., and Mallick, B. K. (2023). Adaptive bayesian variable clustering via structural learning of breast cancer data. Genetic Epidemiology, 47(1):95–104.
- Halfvarson et al., (2017) Halfvarson, J., Brislawn, C. J., Lamendella, R., Vázquez-Baeza, Y., Walters, W. A., Bramer, L. M., D’amato, M., Bonfiglio, F., McDonald, D., Gonzalez, A., et al. (2017). Dynamics of the human gut microbiome in inflammatory bowel disease. Nature microbiology, 2(5):1–7.
- Hall, (2000) Hall, D. B. (2000). Zero-inflated poisson and binomial regression with random effects: a case study. Biometrics, 56(4):1030–1039.
- Hertweck et al., (2023) Hertweck, K. L., Vikramdeo, K. S., Galeas, J. N., Marbut, S. M., Pramanik, P., Yunus, F., Singh, S., Singh, A. P., and Dasgupta, S. (2023). Clinicopathological significance of unraveling mitochondrial pathway alterations in non-small-cell lung cancer. The FASEB Journal, 37(7):e23018.
- Hua et al., (2019) Hua, L., Polansky, A., and Pramanik, P. (2019). Assessing bivariate tail non-exchangeable dependence. Statistics & Probability Letters, 155:108556.
- Jeffreys, (1961) Jeffreys, H. (1961). The theory of probability (3rd ed.). OUP Oxford.
- Jiang et al., (2023) Jiang, R., Zhan, X., and Wang, T. (2023). A flexible zero-inflated poisson-gamma model with application to microbiome sequence count data. Journal of the American Statistical Association, 118(542):792–804.
- Jiang et al., (2021) Jiang, S., Xiao, G., Koh, A. Y., Kim, J., Li, Q., and Zhan, X. (2021). A bayesian zero-inflated negative binomial regression model for the integrative analysis of microbiome data. Biostatistics, 22(3):522–540.
- Jovel et al., (2016) Jovel, J., Patterson, J., Wang, W., Hotte, N., O’Keefe, S., Mitchel, T., Perry, T., Kao, D., Mason, A. L., Madsen, K. L., et al. (2016). Characterization of the gut microbiome using 16s or shotgun metagenomics. Frontiers in microbiology, 7:459.
- Kakkat et al., (2023) Kakkat, S., Pramanik, P., Singh, S., Singh, A. P., Sarkar, C., and Chakroborty, D. (2023). Cardiovascular complications in patients with prostate cancer: Potential molecular connections. International Journal of Molecular Sciences, 24(8):6984.
- Karlsson et al., (2013) Karlsson, F. H., Tremaroli, V., Nookaew, I., Bergström, G., Behre, C. J., Fagerberg, B., Nielsen, J., and Bäckhed, F. (2013). Gut metagenome in european women with normal, impaired and diabetic glucose control. Nature, 498(7452):99–103.
- Kass and Vaidyanathan, (1992) Kass, R. E. and Vaidyanathan, S. K. (1992). Approximate bayes factors and orthogonal parameters, with application to testing equality of two binomial proportions. Journal of the Royal Statistical Society: Series B (Methodological), 54(1):129–144.
- Kass and Wasserman, (1996) Kass, R. E. and Wasserman, L. (1996). The selection of prior distributions by formal rules. Journal of the American statistical Association, 91(435):1343–1370.
- Kelly et al., (2015) Kelly, B. J., Gross, R., Bittinger, K., Sherrill-Mix, S., Lewis, J. D., Collman, R. G., Bushman, F. D., and Li, H. (2015). Power and sample-size estimation for microbiome studies using pairwise distances and permanova. Bioinformatics, 31(15):2461–2468.
- Khan et al., (2023) Khan, M. A., Acharya, S., Anand, S., Sameeta, F., Pramanik, P., Keel, C., Singh, S., Carter, J. E., Dasgupta, S., and Singh, A. P. (2023). Myb exhibits racially disparate expression, clinicopathologic association, and predictive potential for biochemical recurrence in prostate cancer. Iscience, 26(12).
- Kinross et al., (2011) Kinross, J. M., Darzi, A. W., and Nicholson, J. K. (2011). Gut microbiome-host interactions in health and disease. Genome medicine, 3:1–12.
- La Rosa et al., (2014) La Rosa, P. S., Warner, B. B., Zhou, Y., Weinstock, G. M., Sodergren, E., Hall-Moore, C. M., Stevens, H. J., Bennett Jr, W. E., Shaikh, N., Linneman, L. A., et al. (2014). Patterned progression of bacterial populations in the premature infant gut. Proceedings of the National Academy of Sciences, 111(34):12522–12527.
- La Rosa et al., (2015) La Rosa, P. S., Zhou, Y., Sodergren, E., Weinstock, G., and Shannon, W. D. (2015). Hypothesis testing of metagenomic data. In Metagenomics for microbiology, pages 81–96. Elsevier.
- Lambert, (1992) Lambert, D. (1992). Zero-inflated poisson regression, with an application to defects in manufacturing. Technometrics, 34(1):1–14.
- Lawless, (1987) Lawless, J. F. (1987). Negative binomial and mixed poisson regression. The Canadian Journal of Statistics/La Revue Canadienne de Statistique, pages 209–225.
- Lüdecke et al., (2021) Lüdecke, D., Ben-Shachar, M. S., Patil, I., Waggoner, P., and Makowski, D. (2021). Performance: An R package for assessment, comparison and testing of statistical models. Journal of Open Source Software, 6(60):3139.
- Leamy et al., (2014) Leamy, L. J., Kelly, S. A., Nietfeldt, J., Legge, R. M., Ma, F., Hua, K., Sinha, R., Peterson, D. A., Walter, J., Benson, A. K., et al. (2014). Host genetics and diet, but not immunoglobulin a expression, converge to shape compositional features of the gut microbiome in an advanced intercross population of mice. Genome biology, 15:1–20.
- Li et al., (2021) Li, Q., Zhang, M., Xie, Y., and Xiao, G. (2021). Bayesian modeling of spatial molecular profiling data via gaussian process. Bioinformatics, 37(22):4129–4136.
- Li et al., (2018) Li, Z., Lee, K., Karagas, M. R., Madan, J. C., Hoen, A. G., O’malley, A. J., and Li, H. (2018). Conditional regression based on a multivariate zero-inflated logistic-normal model for microbiome relative abundance data. Statistics in biosciences, 10:587–608.
- Liang et al., (2014) Liang, X., Li, H., Tian, G., and Li, S. (2014). Dynamic microbe and molecule networks in a mouse model of colitis-associated colorectal cancer. Scientific reports, 4(1):4985.
- Love et al., (2014) Love, M. I., Huber, W., and Anders, S. (2014). Moderated estimation of fold change and dispersion for rna-seq data with deseq2. Genome biology, 15(12):1–21.
- Maier et al., (2018) Maier, L., Pruteanu, M., Kuhn, M., Zeller, G., Telzerow, A., Anderson, E. E., Brochado, A. R., Fernandez, K. C., Dose, H., Mori, H., et al. (2018). Extensive impact of non-antibiotic drugs on human gut bacteria. Nature, 555(7698):623–628.
- Maity, (2022) Maity, A. (2022). sahpm: Variable Selection using Simulated Annealing. R package version 1.0.1.
- (52) Maity, A., Chakraborty, A., Bhattacharya, A., Carroll, R., Mallick, B. K., and Maity, M. A. (2019a). Package ‘intsurvbin’.
- (53) Maity, A., Sharma, J., Sarkar, A., More, A. K., Pal, R. K., Nagane, V. P., and Maity, A. (2018a). Salicylic acid mediated multi-pronged strategy to combat bacterial blight disease (xanthomonas axonopodis pv. punicae) in pomegranate. European Journal of Plant Pathology, 150:923–937.
- Maity, (2016) Maity, A. K. (2016). Bayesian variable selection in linear and non-linear models. Northern Illinois University.
- Maity and Basu, (2023) Maity, A. K. and Basu, S. (2023). Highest posterior model computation and variable selection via simulated annealing. The New England Journal of Statistics in Data Science, 1(2):200–207.
- (56) Maity, A. K., Basu, S., and Ghosh, S. (2021a). Bayesian criterion-based variable selection. Journal of the Royal Statistical Society Series C: Applied Statistics, 70(4):835–857.
- Maity et al., (2020) Maity, A. K., Bhattacharya, A., Mallick, B. K., and Baladandayuthapani, V. (2020). Bayesian data integration and variable selection for pan-cancer survival prediction using protein expression data. Biometrics, 76(1):316–325.
- (58) Maity, A. K., Carroll, R. J., and Mallick, B. K. (2019b). Integration of survival and binary data for variable selection and prediction: a bayesian approach. Journal of the Royal Statistical Society Series C: Applied Statistics, 68(5):1577–1595.
- (59) Maity, A. K., Chan Lee, S., K. Mallick, B., Bhattacharjee, S., and K. Biswas, N. (2021b). semmcmc: Bayesian Structural Equation Modeling in Multiple Omics Data Integration. R package version 0.0.6.
- Maity and Dey, (2018) Maity, A. K. and Dey, J. (2018). Power analysis of collapsed ordered categories with application to cancer data. Calcutta Statistical Association Bulletin, 70(2):87–95.
- (61) Maity, A. K., Lee, S. C., Hu, L., Bell-pederson, D., Mallick, B. K., and Sarkar, T. R. (2021c). Circadian gene selection for time-to-event phenotype by integrating cnv and rnaseq data. Chemometrics and Intelligent Laboratory Systems, 212:104276.
- Maity and Paul, (2022) Maity, A. K. and Paul, E. (2022). Jeffreys prior for negative binomial and zero inflated negative binomial distributions. Sankhya A, pages 1–15.
- Maity and Paul, (2023) Maity, A. K. and Paul, E. (2023). Jeffreys prior for negative binomial and zero inflated negative binomial distributions. Sankhya A, 85(1):999–1013.
- (64) Maity, A. K., Pradhan, V., and Das, U. (2018b). Bias reduction in logistic regression with missing responses when the missing data mechanism is nonignorable. The American Statistician.
- Merkle and You, (2020) Merkle, E. and You, D. (2020). nonnest2: Tests of Non-Nested Models. R package version 0.5-5.
- (66) Minar, S. J. (2018a). Evaluating the effectiveness of the united nations organizations: the limits of theories and need for a new analytical framework. International Journal of Advanced Research, 6(7):457–462.
- (67) Minar, S. J. (2018b). Grand strategy and foreign policy: How grand strategy can aid bangladesh’s foreign policy rethinking. Journal of Social Studies, 4(1):20–27.
- Minar, (2019) Minar, S. J. (2019). Tatmadaw’s crackdown on the rohingyas: A swot analysis. Journal of Social Studies, 5(1):1–5.
- Nam et al., (2022) Nam, S. J., Kim, S., and Ng, H. K. T. (2022). Bayesian and frequentist approaches on estimation and testing for a zero-inflated binomial distribution. Hacettepe Journal of Mathematics and Statistics, 37(3):1–23.
- O’Hara and Kotze, (2010) O’Hara, R. and Kotze, J. (2010). Do not log-transform count data. Nature Precedings, pages 1–1.
- Paul et al., (2018) Paul, E., Maity, A. K., and Maiti, R. (2018). Bayesian comparative study on binary time series. Journal of Statistical Computation and Simulation, 88(14):2811–2826.
- Piterbarg, (1996) Piterbarg, V. I. (1996). Asymptotic methods in the theory of Gaussian processes and fields, volume 148. American Mathematical Soc.
- Polansky and Pramanik, (2021) Polansky, A. M. and Pramanik, P. (2021). A motif building process for simulating random networks. Computational Statistics & Data Analysis, 162:107263.
- Pramanik, (2016) Pramanik, P. (2016). Tail non-exchangeability. Northern Illinois University.
- Pramanik, (2020) Pramanik, P. (2020). Optimization of market stochastic dynamics. In SN Operations Research Forum, volume 1, page 31. Springer.
- (76) Pramanik, P. (2021a). Consensus as a nash equilibrium of a stochastic differential game. arXiv preprint arXiv:2107.05183.
- (77) Pramanik, P. (2021b). Effects of water currents on fish migration through a feynman-type path integral approach under 8/3 liouville-like quantum gravity surfaces. Theory in Biosciences, 140(2):205–223.
- (78) Pramanik, P. (2021c). Optimization of dynamic objective functions using path integrals. PhD thesis, Northern Illinois University.
- (79) Pramanik, P. (2022a). On lock-down control of a pandemic model. arXiv preprint arXiv:2206.04248.
- (80) Pramanik, P. (2022b). Stochastic control of a sir model with non-linear incidence rate through euclidean path integral. arXiv preprint arXiv:2209.13733.
- (81) Pramanik, P. (2023a). Optimal lock-down intensity: A stochastic pandemic control approach of path integral. Computational and Mathematical Biophysics, 11(1):20230110.
- (82) Pramanik, P. (2023b). Path integral control in infectious disease modeling. arXiv preprint arXiv:2311.02113.
- (83) Pramanik, P. (2023c). Path integral control of a stochastic multi-risk sir pandemic model. Theory in Biosciences, 142(2):107–142.
- Pramanik and Polansky, (2020) Pramanik, P. and Polansky, A. M. (2020). Motivation to run in one-day cricket. arXiv preprint arXiv:2001.11099.
- Pramanik and Polansky, (2021) Pramanik, P. and Polansky, A. M. (2021). Optimal estimation of brownian penalized regression coefficients. arXiv preprint arXiv:2107.02291.
- (86) Pramanik, P. and Polansky, A. M. (2023a). Optimization of a dynamic profit function using euclidean path integral. SN Business & Economics, 4(1):8.
- (87) Pramanik, P. and Polansky, A. M. (2023b). Scoring a goal optimally in a soccer game under liouville-like quantum gravity action. In Operations Research Forum, volume 4, page 66. Springer.
- Qin et al., (2010) Qin, J., Li, R., Raes, J., Arumugam, M., Burgdorf, K. S., Manichanh, C., Nielsen, T., Pons, N., Levenez, F., Yamada, T., et al. (2010). A human gut microbial gene catalogue established by metagenomic sequencing. nature, 464(7285):59–65.
- Qin et al., (2014) Qin, N., Yang, F., Li, A., Prifti, E., Chen, Y., Shao, L., Guo, J., Le Chatelier, E., Yao, J., Wu, L., et al. (2014). Alterations of the human gut microbiome in liver cirrhosis. Nature, 513(7516):59–64.
- R Core Team, (2021) R Core Team (2021). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
- Ridout et al., (2001) Ridout, M., Hinde, J., and Demétrio, C. G. (2001). A score test for testing a zero-inflated poisson regression model against zero-inflated negative binomial alternatives. Biometrics, 57(1):219–223.
- Robinson et al., (2010) Robinson, M. D., McCarthy, D. J., and Smyth, G. K. (2010). edger: a bioconductor package for differential expression analysis of digital gene expression data. bioinformatics, 26(1):139–140.
- Romero et al., (2014) Romero, R., Hassan, S. S., Gajer, P., Tarca, A. L., Fadrosh, D. W., Nikita, L., Galuppi, M., Lamont, R. F., Chaemsaithong, P., Miranda, J., et al. (2014). The composition and stability of the vaginal microbiota of normal pregnant women is different from that of non-pregnant women. Microbiome, 2(1):1–19.
- (94) Roy, M., Das, S., and Protity, A. T. (2023a). Obeseye: Interpretable diet recommender for obesity management using machine learning and explainable ai. arXiv preprint arXiv:2308.02796.
- (95) Roy, M., Minar, S. J., Dhar, P., and Faruq, A. (2023b). Machine learning applications in healthcare: The state of knowledge and future directions. arXiv preprint arXiv:2307.14067.
- (96) Roy, M., Protity, A. T., Das, S., and Dhar, P. (2023c). Prevalence and major risk factors of non-communicable diseases: a machine learning based cross-sectional study. EUREKA: Health Sciences, (3):28–45.
- Roy Sarkar et al., (2019) Roy Sarkar, T., Maity, A. K., Niu, Y., and Mallick, B. K. (2019). Multiple omics data integration to identify long noncoding rna responsible for breast cancer–related mortality. Cancer Informatics, 18:1176935119871933.
- Schweizer et al., (2022) Schweizer, M., Penkov, K., Tolcher, A., Choudhury, A., Doronin, V., Aljumaily, R., Calvo, E., Frank, R., Hamm, J., Garcia, V. M., et al. (2022). 488p phase i trial of pf-06821497, a potent and selective inhibitor of enhancer of zeste homolog 2 (ezh2), in follicular lymphoma (fl), small cell lung cancer (sclc) and castration-resistant prostate cancer (crpc). Annals of Oncology, 33:S763–S764.
- Segata et al., (2012) Segata, N., Waldron, L., Ballarini, A., Narasimhan, V., Jousson, O., and Huttenhower, C. (2012). Metagenomic microbial community profiling using unique clade-specific marker genes. Nature methods, 9(8):811–814.
- Sender et al., (2016) Sender, R., Fuchs, S., and Milo, R. (2016). Revised estimates for the number of human and bacteria cells in the body. PLoS biology, 14(8):e1002533.
- Sharpton et al., (2017) Sharpton, T., Lyalina, S., Luong, J., Pham, J., Deal, E. M., Armour, C., Gaulke, C., Sanjabi, S., and Pollard, K. S. (2017). Development of inflammatory bowel disease is linked to a longitudinal restructuring of the gut metagenome in mice. Msystems, 2(5):10–1128.
- Sommerhalder et al., (2023) Sommerhalder, D., Hamilton, E. P., Mukohara, T., Yonemori, K., Mita, M. M., Yamashita, T., Zheng, J., Liu, L., Maity, A. K., Homji Mishra, N., et al. (2023). First-in-human phase 1 dose escalation study of the kat6 inhibitor pf-07248144 in patients with advanced solid tumors.
- Stewart et al., (2017) Stewart, C. J., Embleton, N. D., Marrs, E. C., Smith, D. P., Fofanova, T., Nelson, A., Skeath, T., Perry, J. D., Petrosino, J. F., Berrington, J. E., et al. (2017). Longitudinal development of the gut microbiome and metabolome in preterm neonates with late onset sepsis and healthy controls. Microbiome, 5(1):1–11.
- Ursell et al., (2012) Ursell, L. K., Metcalf, J. L., Parfrey, L. W., and Knight, R. (2012). Defining the human microbiome. Nutrition reviews, 70(suppl_1):S38–S44.
- Van den Broek, (1995) Van den Broek, J. (1995). A score test for zero inflation in a poisson distribution. Biometrics, pages 738–743.
- Vatanen et al., (2016) Vatanen, T., Kostic, A. D., d’Hennezel, E., Siljander, H., Franzosa, E. A., Yassour, M., Kolde, R., Vlamakis, H., Arthur, T. D., Hämäläinen, A.-M., et al. (2016). Variation in microbiome lps immunogenicity contributes to autoimmunity in humans. Cell, 165(4):842–853.
- Vikramdeo et al., (2023) Vikramdeo, K. S., Anand, S., Sudan, S. K., Pramanik, P., Singh, S., Godwin, A. K., Singh, A. P., and Dasgupta, S. (2023). Profiling mitochondrial dna mutations in tumors and circulating extracellular vesicles of triple-negative breast cancer patients for potential biomarker development. FASEB BioAdvances, 5(10):412.
- Vuong, (1989) Vuong, Q. H. (1989). Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica: journal of the Econometric Society, pages 307–333.
- Wang et al., (2015) Wang, J., Kalyan, S., Steck, N., Turner, L. M., Harr, B., Künzel, S., Vallier, M., Häsler, R., Franke, A., Oberg, H.-H., et al. (2015). Analysis of intestinal microbiota in hybrid house mice reveals evolutionary divergence in a vertebrate hologenome. Nature communications, 6(1):6440.
- Wasserman, (2000) Wasserman, L. (2000). Bayesian model selection and model averaging. Journal of mathematical psychology, 44(1):92–107.
- White and Bennetts, (1996) White, G. C. and Bennetts, R. E. (1996). Analysis of frequency count data using the negative binomial distribution. Ecology, 77(8):2549–2557.
- Wu et al., (2016) Wu, C., Chen, J., Kim, J., and Pan, W. (2016). An adaptive association test for microbiome data. Genome medicine, 8:1–12.
- Yang et al., (2018) Yang, X., Qian, Y., Xu, S., Song, Y., and Xiao, Q. (2018). Longitudinal analysis of fecal microbiome and pathologic processes in a rotenone induced mice model of parkinson’s disease. Frontiers in aging neuroscience, 9:441.
- Zhang et al., (2017) Zhang, X., Mallick, H., Tang, Z., Zhang, L., Cui, X., Benson, A. K., and Yi, N. (2017). Negative binomial mixed models for analyzing microbiome count data. BMC bioinformatics, 18:1–10.
- Zhang and Yi, (2020) Zhang, X. and Yi, N. (2020). Fast zero-inflated negative binomial mixed modeling approach for analyzing longitudinal metagenomics data. Bioinformatics, 36(8):2345–2351.
- Zhou and Tu, (2000) Zhou, X.-H. and Tu, W. (2000). Confidence intervals for the mean of diagnostic test charge data containing zeros. Biometrics, 56(4):1118–1125.
- Zhu et al., (2018) Zhu, W., Winter, M. G., Byndloss, M. X., Spiga, L., Duerkop, B. A., Hughes, E. R., Büttner, L., de Lima Romão, E., Behrendt, C. L., Lopez, C. A., et al. (2018). Precision editing of the gut microbiota ameliorates colitis. Nature, 553(7687):208–211.