Sparse Bayesian Factor Models with Mass-Nonlocal Factor Scores

Dafne Zorzettolabel=e2][email protected] [ Yingjie Huanglabel=e1][email protected] [ Roberta De Vitolabel=e3][email protected] [ Equally contributing co-first authors. Data Science Institute, Brown Universitypresep=, ]e1,e2 Department of Biostatistics, Brow Universitypresep=, ]e3

(2024)

Abstract

Bayesian factor models are widely used for dimensionality reduction and pattern discovery in high-dimensional datasets across diverse fields. These models typically focus on imposing priors on factor loading to induce sparsity and improve interpretability. However, factor score, which plays a critical role in individual-level associations with factors, has received less attention and is assumed to have standard multivariate normal distribution. This oversimplification fails to capture the heterogeneity observed in real-world applications. We propose the Sparse Bayesian Factor Model with Mass-Nonlocal Factor Scores (BFMAN), a novel framework that addresses these limitations by introducing a mass-nonlocal prior for factor scores. This prior provides a more flexible posterior distribution that captures individual heterogeneity while assigning positive probability to zero value. The zeros entries in the score matrix, characterize the sparsity, offering a robust and novel approach for determining the optimal number of factors. Model parameters are estimated using a fast and efficient Gibbs sampler. Extensive simulations demonstrate that BFMAN outperforms standard Bayesian sparse factor models in factor recovery, sparsity detection, and score estimation. We apply BFMAN to the Hispanic Community Health Study/Study of Latinos (HCHS/SOL) and identify dietary patterns and their associations with cardiovascular outcomes, showcasing the model’s ability to uncover meaningful insights in diet.

Factors selection,

nutritional data,

pMOM distribution,

spike and non-local prior,

keywords:

^†^†volume: TBA^†^†issue: TBA

\arxiv

2010.00000 \startlocaldefs \endlocaldefs

, and

1 Introduction

Bayesian factor models are crucial in numerous disciplines, including social sciences [12], genomics [45], nutrition [22], and diverse more disciplines [38, 10]. These models are particularly advantageous when analyzing high-dimensional data as they provide a structured approach to dimensionality reduction, improving interpretability and facilitating deeper understanding of underlying phenomena [6]. However, when working with large-scale datasets, sparsity or penalization techniques must often be introduced for two critical reasons: first, to enhance interpretability and achieve meaningful insights into the data-generating process, and second, to ensure the covariance matrix is estimable [32].

The primary focus has remained on imposing priors on the factor loading matrices, using approaches such as shrinkage priors [7, 30], sparsity priors [9, 24], spike-and-slab priors [37, 4], and non-local mass priors [3]. However, little attention has been given to the factor score matrix, which continues to rely on standard priors [44, 18], typically assuming a multivariate normal distribution centered at zero with unit variance. Some studies have assumed a full covariance structure to accommodate potential correlations among factors [34].

Factor scores play a critical role as they represent the score that each individuals detain on the corresponding factors, making them particularly valuable across various applications [22, 15, 11]. For example, in dietary research, factor analysis is often used to identify dietary patterns (i.e., factor loadings) and then estimate factor scores for each individual to estimate the association between these patterns and disease outcome [21]. A more refined estimation and appropriate prior specification for factor scores are essential to accurately model the relationship between each factor and health outcomes. For instance, in diet-related studies, individuals may exhibit substantial heterogeneity in adherence to dietary patterns—some may follow a particular pattern strictly, others moderately, and some minimally. Current methods fail to capture such heterogeneity effectively.

In this work, we address these limitations by proposing a novel sparse Bayesian factor model that incorporates a mass-nonlocal prior for the latent factor scores, termed Sparse Bayesian Factor Model with Mass-Nonlocal Factor Scores (BFMAN). This framework introduces a more flexible posterior distribution for factor scores, characterizing the heterogeneity in subject-level associations with latent factors. This is allow by the mass-nonlocal prior with a non-null probability to have exact zero value and a non-local slab prior that do not overlap the spike. Additionally, the model addresses the critical challenge of determining the number of factors from a different angle, an open problem in Bayesian factor analysis. Specifically, while existing approaches typically focus on sparsity in the factor loading matrix, i.e., number of shrinked [7] or sparse element [9] in each column of the loading matrix, or proportion of variance explained in the factor loading matrix [15], our method takes a novel approach by leveraging the level of sparsity in the factor score matrix to infer the optimal number of factors. This unique perspective allows for more precise identification of factors and better captures the structural complexity of the data.

We introducve a fast and efficient Gibbs sampler, which guarantees computational feasibility even for large-scale datasets.

We conducted extensive simulation studies to evaluate the performance of BFMAN, comparing it against standard Bayesian sparse factor models. Our results demonstrate that BFMAN consistently outperforms existing methods in recovering both factor loadings and factor scores, while providing more accurate estimates of the number of latent factors. Additionally, the flexibility of the proposed mass-nonlocal prior for factor scores allows for better characterization of individual-level heterogeneity, making the model particularly well-suited for complex, high-dimensional applications.

To further showcase the utility of our approach, we apply BFMAN to the Hispanic Community Health Study/Study of Latinos (HCHS/SOL) [35], a multi-center epidemiologic study designed to investigate critical components impacting the health of Hispanic/Latino populations [43]. A central focus of the study is to study the role of diet in cardiovascular disease risk [13], including outcomes such as hypertension and high cholesterol. By employing our method, we uncover robust dietary patterns and their associations with health outcomes, providing novel insights into diet-disease relationships.

The paper is organized as follows. Section 2 introduces the BFMAN framework, detailing the mass-nonlocal prior, and the procedure for selecting the optimal number of factors using sparsity in the factor score matrix. Section 3 presents extensive simulation studies, illustrating the superiority of BFMAN compared to standard Bayesian sparse factor models in terms of accuracy and computational efficiency. Section 4 applies BFMAN to the HCHS/SOL data. Finally, Section 5 includes a discussion of our findings and their implications.

2 Bayesian mass-nonlocal factor analysis

2.1 Model and prior specification

Let ${\bf Y}\in\mathbb{R}^{n\times p}$ be the $p$ -dimensional data matrix where $n$ represents the number of observations and $p$ the number of observed variable. Assuming independence across observations and identical distribution, we can define a generic form of a latent factor model, for each unit $i\in\{1,\cdots,n\}$ , as following:

\displaystyle y_{i}=\Lambda\eta_{i}+\epsilon_{i},

(1)

where $\Lambda\in\mathbb{R}^{n\times p}$ is the factor loading matrix, $\eta_{i}\in\mathbb{R}^{k}$ is the latent factor score for the $i$ -th observation, and $\epsilon_{i}\sim\mathcal{N}_{p}(0,\Sigma)$ represents the idiosyncratic error matrix, with $\Sigma=\mbox{diag}(\sigma_{1}^{2},\cdots,\sigma_{p}^{2})$ .

Traditional factor models assume that the factor score are standard normally distributed, but this assumption often fails to capture sparsity and heterogeneity in the factor scores. To address this, we define a mixture distribution for $\eta$ , assigning a nonzero probability to exact zeros using a Dirac distribution with mass in zero.

To ensure a flexible distribution that avoids overlap with the spike while maintaining tails similar to a normal distribution, we adopt a pMOM prior (Figure 1) as the slab component of the mixture [26, 27]. Hence, we define a mass-nonlocal prior, introduced by Shi et al. [42], for $\eta_{ih}$ , for each unit $i$ and factor $h$ , as follows:

\{\eta_{ih}|\theta_{h},\phi_{h}\}\sim(1-\theta_{h})\delta_{0}(\eta_{ih})+\theta_{h}\mbox{pMOM}(\eta_{ih}\mid\psi_{h}),

(2)

where $\delta_{0}(\cdot)$ is a Dirac measure with mass at zero, and the probability density of the pMOM distribution [26, 27] is defined as follows:

p(\eta_{ih}\mid\psi_{h})=\frac{1}{\sqrt{2\pi\psi^{3}}}\exp\left(-\frac{\eta_{ih}^{2}}{2\psi}\right)\eta_{ih}^{2},

where $\psi>0$ is the scale parameter.

Refer to caption — Figure 1: pMOM densities with different dispersion parameters $\psi$ . These densities corresponds to the non-local prior used in our model.

The prior elicitation for the hyperparameters in the distribution involved in the (1) and (2) are are specified as follows:

	$\displaystyle\theta_{h}\sim\mbox{Beta}(a_{\theta},b_{\theta})\quad\forall h\geq 1,$		(3)
	$\displaystyle\sigma_{j}^{-2}\sim\mbox{Ga}(a_{\sigma},b_{\sigma})\quad\forall j\in\{1,\cdots,p\}.$

This formulation allows us to introduce a latent variable $Z_{ih}$ for each unit $i\in\{1,\dots,n\}$ and factor $h\in\{1,\dots,H\}$ , indicating whether the factor score belongs to the spike or the slab component. Specifically $Z_{ih}$ is defines as a binary random variable with the following distribution

Z_{ih}\sim\mbox{Bern}(\theta_{h}),

where $\theta_{h}$ is the probability for the corresponding factor score $\eta_{ij}$ to have a pMOM probability distribution, such that

\{\eta_{ih}|Z_{ih}=1,\psi\}\sim\mbox{pMOM}(\psi)\;\mbox{ and }\;\{\eta_{ih}|Z_{ih}=0\}\sim\delta_{0}.

We adopt the multiplicative gamma process shrinkage (MGPS) prior introduced by Bhattacharya and Dunson [7] for the factor loading matrix $\Lambda$ . However, possible alternative are the Bayesian cumulative shrinkage introduced by Legramanti et al. [30] or a generalization of Bhattacharya and Dunson [7]’s prior proposed by Schiavon et al. [39].

The MGPS prior [7] is defined for each entry of the factor loading matrix $\Lambda$ as follows:

	$\displaystyle\lambda_{jh}\|\phi_{jh},\tau_{h}\sim\mathcal{N}(0,\phi_{jh}^{-1}\tau_{h}^{-1}),$
	$\displaystyle\phi_{jh}\sim\mbox{Ga}(\nu/2,\nu/2),$
	$\displaystyle\tau_{h}=\prod_{l=1}^{h}\delta_{l},\quad\delta_{1}\sim\mbox{Ga}(a_{1},1),\quad\delta_{l}\sim\mbox{Ga}(a_{2},1),\quad l\geq 2,$		(4)

where $\{\phi_{jh}\}_{h\geq 1,j\in\{1,\dots,p\}}$ are global shrinkage parameters follow gamma distribution with degree of freedom $\nu$ , while $\{\tau_{h}\}_{h\geq 1}$ are global shrinkage parameters for the corresponding $h$ -th column, obtained from the cumulative product of $\{\delta_{h}\}_{h\geq 1}$ . Each $\delta_{h}$ follows a gamma distribution. Under specific settings, $\{\tau_{h}\}_{h\geq 1}$ is stochastically increasing process, favoring more shrinkage for higher-indexed factors, facilitating sparsity in $\Lambda$ . We adopt the values for the hyperparameters suggested by [7, 19].

2.2 Posterior Computation

A straightforward Gibbs sampler can be proposed for posterior computation, levering the conjugate prior with the exception of pMOM distribution that requests a Metropolis-Hasting step.

Following the steps in the algorithm 2, in each iteration $r=1,\dots,R$ , we use the observed data $y$ to update the following parameters and random variable: the factor loading $\Lambda$ , denoting with $\lambda_{j}$ the $j$ -th row, for $j\in\{1,\dots,p\}$ , the factor score $\eta$ , denoting with $\eta_{i}$ the $i$ -th row, for $i\in\{1,\dots,n\}$ , the diagonal matrix used for the MGPS prior (4) $D_{j}^{-1}=\mathbf{diag}(\phi_{j1}\tau_{1},\cdots,\phi_{jk}\tau_{k})$ , the parameters $\{\theta_{j}\}_{j\in\{1,\dots,k\}}$ , and the pMOM parameters $\{\psi_{j}\}_{j\in\{1,\dots,k\}}$ . We indicated with $y^{(j)}=(y_{1j},\cdots,y_{nj})^{\intercal}$ for $j=1,\cdots,p$ the $n$ vector for the $j$ -the outcome variable, and with $k$ the number of factors.

The steps for posterior sampling are as follows:

The loading matrix entries $\{\lambda_{j}\}_{j\in\{1,\dots,p\}}$ are sampled from the following posterior distribution:

\displaystyle f(\lambda_{j}|\tau,\Lambda,D,\sigma_{y})\sim\mathcal{N}_{k}\Big{\{}\left(D_{j}^{-1}+\sigma_{j}^{-2}\eta^{\intercal}\eta\right)^{-1}\eta^{\intercal}\sigma_{j}^{-2}y^{(j)},\left(D_{j}^{-1}+\sigma_{j}^{-2}\eta^{\intercal}\eta\right)^{-1}\Big{\}}.

2.

The the MGPS prior introduces two parameter. First, the local shrinkage parameter $\phi_{jh}$ , with the following poster distribution, for $j\in\{1,\dots,p\}$ and $h\in\{1,\dots,k\}$ :

$\displaystyle f(\phi_{jh}|\nu,\Lambda,\tau)\sim\text{Ga}\left(\frac{\nu+1}{2},\frac{\nu+\tau_{h}\lambda_{jh}^{2}}{2}\right).$

Second, the global shrinkage parameter $\delta_{h}$ , with posterior distribution defined as follows:

\displaystyle f(\delta_{h}|a,\tau,\phi,\Lambda)\sim\text{Ga}\Big{\{}a_{h}+\frac{p}{2}\left(k-h+1\right),1+\frac{1}{2}\sum_{l=1}^{k}\tau_{l}^{(h)}\sum_{j=1}^{p}\phi_{jl}\lambda_{jl}^{2}\Big{\}},

where $\tau_{l}^{(h)}=\prod_{t=1,t\neq h}^{l}\delta_{t}$ for $h=1,\cdots,k$ .

The factor score $\eta_{ih}$ , conditional to the latent variable $Z_{ih}$ , are sampled from:

\displaystyle f(\eta_{ih}|Z_{ih})=\begin{cases}\pi(\eta_{ih}|c,d)&\mbox{ if }Z_{ih}=1,\\ 0&\mbox{ if }Z_{ih}=0;\end{cases}

where $\pi(\eta_{ih}|-)$ indicates the following distribution:

	$\displaystyle\pi(\eta_{ih}\|c,d)\propto\exp\bigg{\{}-c\left(\eta_{ih}-\frac{d}{c}\right)^{2}\bigg{\}}\eta_{ih}^{2};$
	$\displaystyle\mbox{with }c=\frac{1}{2\psi}+\sum_{j=1}^{p}\frac{1}{2\sigma_{j}^{2}}\lambda_{jh}^{2}\mbox{ and }d=\sum_{j=1}^{p}\frac{1}{2\sigma_{j}^{2}}\lambda_{jh}\left(y_{ij}-\sum_{l\neq h}^{k}\lambda_{jl}\eta_{il}\right).$

Due to the non-conjugacy of $\pi(\eta_{ih}|c,d)$ , we embedded a Metroplis-Hastings algorithm, which is implemented as follows:

Input: Probability density

\pi(\eta_{ih}|c,d)

, initial state

\eta_{ih}^{0}

Output: Posterior samples from

\pi(\eta_{ih}|c,d)

for $m=1$ to $M$ do

Generate a random candidate

\eta_{ih}^{\ast}\sim\mathcal{N}(\mu=\eta_{ih}^{m-1},\sigma)

;

Calculate acceptance probability

r=\exp\left(\log\pi(\eta_{ih}^{\ast}|c,d)-\log\pi(\eta_{ih}^{m-1}|c,d)\right)

;

Accept or reject:

\alpha=\min(1,r),Z=\mbox{Bern}(\alpha)

;

\eta_{ih}^{m}=Z\eta_{ih}^{\ast}+(1-Z)\eta_{ih}^{m-1}

end for

Algorithm 1 Metroplis-Hastings Algorithm

Sample latent variable $Z_{ih}$ , for $i\in\{1,\dots,n\}$ $h\in\{1,\dots,k\}$ , from $\mbox{Bern}\left(f(Z_{ih}=0|-)\right)$ , where:

	$\displaystyle f(Z_{ih}=0\|-)$	$\displaystyle=\frac{f(Z_{ih}=0)f(y_{i}\|Z_{ih}=0,-)}{f(Z_{ih}=1)f(y_{i}\|Z_{ih}=1,-)+f(Z_{ih}=0)f(y_{i}\|Z_{ih}=0,-)}$
		$\displaystyle=\frac{f(Z_{ih}=0)}{f(Z_{ih}=0)+f(Z_{ih}=1)T}$

where $T=K\sqrt{2\pi}H^{-\frac{1}{2}}(H^{-1}+M^{2})$ , $H=\frac{1}{\psi}+\lambda_{h}^{\intercal}\Sigma^{-1}\lambda_{h}$ , $M=\frac{1}{H}(y_{i}-\Lambda_{(-h)}\eta_{i(-h)})\Sigma^{-1}\lambda_{h}$ and $K=2\pi\psi^{-\frac{3}{2}}\exp\{\frac{1}{2}HM^{2}\}$ . The $\Lambda_{(-h)}$ represents $p\times(k-1)$ matrix with $h$ th column dropped, and $\eta_{i(-h)}$ denotes $k-1$ vector with $\eta_{ih}$ entry deleted.

6.

Sample the probability parameter $\theta_{h}$ , for each factor $h\in\{1,\dots,k\}$ , from:

$\displaystyle f(\theta_{h}|Z_{h},a_{1},b_{1})=\mbox{Beta}\left(\sum_{i}^{n}Z_{ih}+a_{1},n-\sum_{i}^{n}Z_{ih}+b_{1}\right);$

where $a_{1}$ and $b_{1}$ are the hyperparameters.

The residual variance $\sigma_{j}^{2}$ , for $j\in\{1,\dots,p\}$ , is drawn from the posterior distribution:

\displaystyle f(\sigma_{j}^{-2}|a_{\sigma},b_{\sigma},y,\Lambda,\eta)=\text{Ga}\Big{\{}a_{\sigma}+\frac{n}{2},b_{\sigma}+\frac{1}{2}\sum_{i=1}^{n}\left(y_{i}j-\lambda_{j}^{\intercal}\eta_{i}\right)^{2}\Big{\}}.

Input: Outcome matrix

\bf Y

Output: Posterior distribution of each parameter

for $r=1$ to $R$ do

Sample factor loading

\lambda_{j}

for

j\in\{1,\dots,p\}

;

Sample

\phi_{jh}

, for

j\in\{1,\dots,p\}

and

h\in\{1,\dots,k\}

;

Sample

\delta_{h}

for

h\in\{1,\dots,k\}

;

Sample factor score

\eta_{ih}

given the latent variable

Z_{ih}

, for

i\in\{1,\dots,n\}

h\in\{1,\dots,k\}

;

Sample the latent variable

Z_{ih}

, for

i\in\{1,\dots,n\}

and

h\in\{1,\dots,k\}

;

Sample the

\theta_{h}|Z_{h}

, for

h\in\{1,\dots,k\}

;

Sample the residual variance

\sigma_{j}^{2}

for

j\in\{1,\dots,p\}

end for

Algorithm 2 Posterior computation.

2.3 Model identification

A key challenge in factor analysis is the identifiability of the factor loading matrix $\Lambda$ . For any orthogonal matrix $P$ , $\Lambda^{\prime}=\Lambda P$ and $\eta_{i}^{\prime}=P^{\intercal}\eta_{i}$ satisfy equation 1 ( $\Lambda\eta_{i}=\Lambda^{\prime}\eta_{i}^{\prime}$ ). This rotational invariance creates a non-identifiability issue, as multiple equivalent solutions exist for the same model. A common solution is to constrain the loading matrix to a lower triangular matrix, as described by Lopes and West [33]. These constraints can be restrictive and are not always practical. Moreover, each chain of the Gibbs sampler may converge to a different rotational solution, and averaging these solutions can lead to results that are not interpretable or meaningful. To address this challenge, Aßmann et al. [1] developed an orthogonal post-processing (OP) approach that resolves rotational ambiguity in unconstrained MCMC samples. This method generates samples without imposing rotational constraints, then aligns them by computing orthogonal transformations for each iteration. The procedure involves solving an optimization problem to align the sampled factor loading matrices and remove the effect of orthogonal rotations. When model parameters are not constrained, the Gibbs sampler produces orthogonally mixed MCMC samples. Each sample $\eta^{r}$ is associated with a potentially different orthogonal transformation $\big{\{}\mathbf{Q}^{(r)}\big{\}}_{r=1}^{R}$ . The OP algorithm aligns these samples by minimizing the difference between the aligned samples and a reference loading matrix. Specifically, it solves the following optimization problem: Starting from a sequence of $R$ draws from the posterior distribution of $\big{\{}\eta^{(r)}\big{\}}_{r=1}^{R}$ , the algorithm addressed this problem by estimating the loading matrices via the following constrained optimization:

\displaystyle\Big{\{}\big{\{}\tilde{\mathbf{Q}}^{(r)}\big{\}}_{r=1}^{R},\tilde{\eta}^{\ast}\Big{\}}=\arg\min_{\mathbf{Q}^{(r)},\eta^{\ast}}\sum_{r=1}^{R}L\Big{\{}\eta^{\ast},\eta^{(r)}\mathbf{Q}^{(r)}\Big{\}}\quad\text{s.t.}\mathbf{Q}^{(r)}\mathbf{Q}^{(r)\intercal}=\mathbf{I},r=1,\cdots,R

(5)

where $L$ is the loss function:

\displaystyle L\Big{\{}\eta^{\ast},\eta^{(r)}\mathbf{Q}^{(r)}\Big{\}}=\text{tr}\Big{\{}(\eta^{(r)}\mathbf{Q}^{(r)}-\eta^{\ast})^{\intercal}(\eta^{(r)}\mathbf{Q}^{(r)}-\eta^{\ast})\Big{\}}

The optimization is carried out by two steps while detailed derivation of this solution can be found at Schönemann [40]:

Input: Loading matrices

\big{\{}\eta^{(r)}:\eta^{(r)}\in\mathbb{R}^{n\times k^{\ast}}\big{\}}_{r=1}^{R}

Output: Recovered loading matrix

\tilde{\eta}^{\ast}

set initial value of

\eta^{\ast}

with the last iteration of Gibbs sampler:

\eta^{\ast}=\eta^{(R)}

while $d_{r}>\epsilon$ do

obtain the orthogonal transformation via SVD

for $r=1$ to $R$ do

\Sigma_{\eta^{\ast}}=\eta^{(r)\intercal}\eta^{\ast}=\mathbf{U}_{r}\Sigma_{r}\mathbf{V}_{r}^{\intercal}

\tilde{\mathbf{Q}}^{(r)}=\mathbf{U}_{r}\mathbf{V}_{r}^{\intercal}

end for

compute recovered loading matrix:

\tilde{\eta}^{\ast}=\frac{1}{R}\sum_{r=1}^{R}\eta^{(r)}\tilde{\mathbf{Q}}^{(r)}

obtain the 2-norm between the estimation and the target:

d_{r}=\lVert\eta^{\ast}-\tilde{\eta}^{\ast}\rVert

update

\eta^{\ast}

with

\tilde{\eta}^{\ast}

\eta^{\ast}=\tilde{\eta}^{\ast}

end while

Algorithm 3 Recover loading matrix with the effect of rotation removed

As Aßmann et al. [1] suggested, the last iteration of the Gibbs sampler can be chosen as the initial value for $\eta^{\ast}$ . This iterative procedure aligns the loading matrices consistently across MCMC samples, ensuring interpretability and identifiability of the final results.

Convergence is assessed using the 2-norm to measure the distance between the current estimate and the final aligned loading matrix. Although the procedure is iterative, it typically requires only a single iteration to achieve alignment.

2.4 Choosing number of factors

Determining the optimal number of factors is a critical challenge in factor analysis. The goal is to retain a small number of factors while capturing enough variance to preserve the structure of the data. Common approaches include threshold-based methods and information criteria.

Threshold-based methods, such as Kaiser’s criterion or scree plots, rely on subjective thresholds, such as retaining factors with eigenvalues greater than one in the Principal Component Analysis (PCA). Although easy to implement, these methods can be sensitive to data structure, leading to inconsistent results. Information criteria, such as the Bayesian Information Criterion (BIC), involve fitting models with varying numbers of factors and selecting the optimal model. However, these methods are computationally expensive, especially for high-dimensional data, as the process must be repeated for each potential number of factors. Furthermore, maximum likelihood estimates may not exist in such settings, reducing the reliability of these criteria.

Bhattacharya and Dunson [7] introduced an adaptive strategy to address the challenge of determining the optimal number of factors. This method begins with an overestimate of the number of factors and progressively shrinks redundant columns in the factor loading matrix. At each iteration, columns with elements near zero are pruned, while additional columns are introduced if needed. The effective number of factors stabilizes after a burn-in period, and the final estimate is determined using the median or mode of these stabilized values. This approach avoids the need for subjective thresholds and offers computational efficiency. However, simulations indicate that it may overestimate the number of factors in small sample sizes and low-dimensional settings.

In our approach, we have successfully leveraged the sparsity of the factor score matrix by specifying a mass-nonlocal prior. Building on this, instead of focusing on the loading matrix, we utilize the sparsity information inherent in the factor score matrix. Specifically, the parameter $\theta_{h}$ —see definition of the model distribution (2) and the prior (3)—reflects the degree of sparsity for each factor $h$ . Factors with higher sparsity are less likely to represent meaningful variation in the data and can be pruned from the candidate set of factors.

Unlike methods that apply a hard threshold on the proportion of non-zero entries in matrix to filter out auxiliary factors, our approach incorporates uncertainty of factor selection, which brings much flexibility. The whole process starts with a conservative guess, treating it as the true number of factors, then we implement our method for estimation. The simulation results demonstrate that, even if the specified number is incorrect, our model remains robust and produces highly accurate estimates. To determine the final number of factors, we examine the posterior mean of $\theta_{h}$ , with respect to each factor $h$ , which naturally reveal the probability that each factor is contributing or influential in capturing the variance structure of the data. Based on these probabilities, we can apply appropriate criteria or thresholds to ultimately determine the number of factors and truncate both the loading and the score matrices accordingly. This adaptive truncation balances robustness with flexibility, ensuring that only meaningful factors are retained while avoiding the pitfalls of overestimation.

3 Simulation study

In this section, we present simulation experiments designed to evaluate the ability of our proposed model in recovering the sparse structure of the factor score $\eta$ and factor loading $\Lambda$ , and correctly imputing the zeroes entries. To benchmark the performance, we compare our method with the MGPS factor model proposed by [7], a well establish method in Bayesian factor analysis.

We construct four different Scenarios with varying levels of complexity in the data generation process, focusing on different sparsity schemes in the factor scores. The data generation process differs across scenarios in terms of sample size $n$ , dimension of variables $p$ , and number of factors $k^{\ast}$ , and the sparsity of the factors score induced by $\{\theta_{h}\}_{h=1}^{k^{\ast}}$ . Here, the star denote the fixed true value, differently to the tilde indicating the number of factor used in the algorithm 2 to estimate the model. Table 1 summarizes the scenario-specific parameters. Scenario 1 represents a moderate setup with a small sample size and a small number of observed variables. Scenario 2 retains the same dimensions as Scenario 1 but introduces heterogeneity in the sparsity of the factor scores. Scenarios 3 and 4 are specifically designed to closely mimic real-world problems. In Scenario 3, the dimension of the response variable $p$ is bigger than the sample size $n$ i.e., $p>>n$ . In Scenario 4, both $p$ and $n$ increase significantly, as seen in the nutritional data analyzed in Section 4. A common data generation process was employed across all scenarios, as described in Table 2.

Each scenario is replicated 50 times to ensure robust evaluation.

Table 1: Scenario-specific parameters used to generate the simulation experiments.

	Scenario 1	Scenario 2	Scenario 3	Scenario 4
$n$	$100$	$100$	$30$	$10000$
$p$	$20$	$20$	$60$	$40$
$k^{\ast}$	$3$	$3$	$5$	$3$
$\{\theta_{h}\}_{h=1}^{k^{\ast}}$	$0.4\;\forall h$	$\{0.8,0.6,0.4\}$	$\{0.8,0.7,0.6,0.5,0.4\}$	$\{0.6,0.4,0.2\}$

Table 2: Common data generating mechanism across the four Scenarios, for each unit

i\in\{1,\dots,n\}

and factor

h\in\{1,\dots,k^{\ast}\}

. Both

n

and

k^{\ast}

vary across different Scenarios.

Z_{ih}\sim\mbox{Bern}(\theta_{h}),

\eta_{ih}\sim(1-Z_{ih})\delta_{0}+Z_{ih}\mbox{pMOM}(\psi=0.5),

\lambda_{h}\sim\mathcal{N}_{p}(0,\mathbb{I}),

\epsilon_{i}\sim\mathcal{N}_{p}(0,\Sigma)

with

\Sigma=\mbox{diag}(\sigma^{2}_{1},\dots,\sigma^{2}_{p}),

\sigma^{2}_{j}\sim\mbox{Unif}(0,1)\;\forall j\in\{1,\dots,p\},

Y_{i}=\Lambda\eta_{i}+\epsilon_{i}

3.1 Results

To evaluate the ability to recover the factor loading matrix $\Lambda$ and the factor score matrix $\eta$ in the four simulated scenarios we use the RV coefficient [36]. The RV coefficient ranges from 0 to 1, where values closer to 1 indicate greater similarity between the estimated and true matrices.

The results, shown in Figure 2, demonstrate that our proposed model—referred to as Mass-nonlocal — effectively recovers the factors. In Scenario 1,2, and 4 the RV coefficients for $\Lambda\Lambda^{T}$ and $\eta\eta^{T}$ are consistently above $0.95$ . In Scenario 3, where $p$ is twice the size of $n$ , the RV coefficients are slightly lower, with median values close to $0.9$ for $\Lambda\Lambda^{T}$ and $0.85$ for $\eta\eta^{T}$ , but the performance still exceeds that of the MGPS model.

The BFMAN model outperforms in all the scenarios, particularly in Scenarios 3 and 4, where its RV coefficients are often below 0.85. These results highlight the superior ability of the proposed model to recover the structure of the data under varying levels of sparsity and dimensionality.

The strong performance of our proposed model is attributed to its ability to effectively capture the sparsity of the factor score matrix. Figure 3 illustrates the distribution of the estimated probabilities of nonzero entries in the factor score matrix for each factor across the 50 replicates. The red lines in the figure indicate the true simulated values reported in Table 1. The estimated probabilities align well with the true values, as evidenced by their inclusion within the interquartile range of the boxplots in most cases.

Moreover, the BFMAN model demonstrates superior performance in selecting the number of factors. Table 3 presents the results using two different thresholds for BFMAN and MGPS. Specifically, the threshold of $0.17$ for $\theta$ —i.e., the probability of observing non-zero entries in the factor scores for a specific factor—is chosen as it corresponds to the mean of the assumed prior Beta distribution with shape parameters equal to $1$ and $5$ . Across all four scenarios, BFMAN consistently selects values closer to the true number of factors used in the data generation process, whereas MGPS tends to select notably higher values.

Table 3: Comparison of the selection of the number of factors. For BFMAN we use the threshold 0.2 and 0.17.

	Scenario 1	Scenario 2	Scenario 3	Scenario 4
true	3	3	5	3
BFMAN thr.=0.20	3	4	6	4
BFMAN thr.=0.17	4	4	7	4
MGPS	10	9	10	10

4 Nutritional data and cardiovascular diseases

Diet plays an important role in diseases such as cancer [46], obesity [23], and cardiovascular disease [28], contributing to more than 500,000 deaths annually in the United States, including 280,000 linked to obesity [2]. Understanding eating patterns is crucial for assessing the diet-disease relationship and informing public health policies.

A flourishing literature in nutritional epidemiology has leveraged factor analysis [see e.g., 25, 41, 20, 16, 5, 17, 14] to estimate dietary patterns, within a single population or between diverse ethnic groups.

In this section, we analyze the dietary patterns with our proposed model in Section 2.1 and their role in the cholesterol level in the Hispanic Community Health Study/Study of Latinos (HCHS/SOL) in the United States, used in De Vito and Avalos-Pacheco [14]’s work. In particular, the HCHS / SOL study aims to fill in the gaps in understanding the health status and burden of diseases among Hispanics / Latinos in the US, exploring the relationship between risk factors and disease incidence over time. Specifically, it examines food and diet as potential health risks, focusing on the Healthy Eating Index and cardiovascular disease. The study analyzed data from 16,415 adults aged 18-74 years, representing seven Hispanic/Latino ethnic groups (Cuban, Dominican, Mexican, Puerto Rican, Central and South American, and others), residing in four sites (Bronx, Chicago, Miami, and San Diego). Data were collected at baseline (2008–2011) using a stratified two-stage probability sampling design detailed by [29].

The original dataset has a total of $14,002$ participants, however the final dataset, removing individuals with missing information, has sample size $12,972$ . We use 42 nutrients that best represent the overall diet for Hispanics/Latinos in the considered population—see De Vito and Avalos-Pacheco [14] for details about nutrients variable selection. The selected nutrients are log-transformed and standardized, following the previous studies.

For each individual, we have socio-demographic characteristics as gender, education, alcohol use, cigarette use, body mass index, and ethnic groups. Moreover, we have different diseases measure, however we are considering for this analysis just hypercholesterolemia and hypertension—when blood vessels is $140/90mmHg$ or higher.

In particular, following the nutritional literature [see e.g., 31, 14], the following analysis is divided in two part: the estimation of our proposed model to identify the dietary patterns, and a regression model to characterize the association between the dietary patterns and cardiovascular diseases: hypercholesterolemia and hypertension.

4.1 Results

We estimated the proposed model with nutrients data, where the matrix $\bf Y$ , defined in Section 2.1, is a matrix $12,972$ . We define $k$ —i.e., the number of factors used to run the model—equal to $6$ . However, the posterior mean of the percentage of non-zero for factors is, in decreasing order, $(0.70,0.42,0.41,0.38,0.19,0.05)$ , therefor we decided to consider just the first five factors and to remove the last one due to the $95\%$ of the corresponding factor scores are zeros.

Figure 4 reports the estimated factor load associated with nutrients. Inherently, according to the nutritional literature and with the results obtained for the same dataset by De Vito and Avalos-Pacheco [14], we can name the first factor as ’plant food’ due to the high-valued loadings in particolar on soluble and insoluble dietary fiber, vegetable protein, potassium, manganese, natural folate and copper. The second factor, with high values for animal protein, Zinc, Vitamin B12, B6, B2, and B3, Cholesterol, and short-chain saturated fatty acids (SCSFA) is nominated as ’animal product’. The ’seafood’ pattern is characterized by eicosapentaenoic acid (EPA), docosapentaenoic acid (DPA), docosahexaenoic acid (DHA), and arachidonic acid. Factor 4 can be defined as ’processed food’, due to the significant level of long-chain saturated fatty acids (LCSFA), long-chain monounsaturated fatty acids (LCSFA), linoleic and linolenic acids, and total trans fatty acids. Although factor 5, named ’dairy product’, has high-valued loadings on short- and medium-chain saturated fatty acids (SCSFA and MCSFA), calcium, retinol, and vitamin D.

The identified five dietary patterns are incorporated into a logistic regression model to investigate their association with cardiovascular diseases while controlling for gender. The results indicate that individuals with processed foods in their diet exhibit a significantly higher probability of developing hypercholesterolemia.

5 Discussion

In this paper, we introduce a novel approach to Bayesian factor analysis that emphasizes the key role of the factor scores. By leveraging the non-local mass prior, our proposed BFMAN model effectively captures the heterogeneity within the factor scores, enabling the identification of their individual contributions. This capability is particularly evident in the nutritional data application, where each study participant exhibits varying degrees of influence in shaping distinct dietary patterns. Moreover, the sparsity helps reduce noise by using the zero values to identify individuals who do not follow specific dietary behaviors.

The sparsity plays a key role in (i) the methodological aspect of defining a robust and novel approach to determining the optimal number of factors, and (ii) the real-world application, where the score each individual holds for the corresponding factors is linked to different outcomes. The BFMAN demonstrates the ability (i) in the simulation study by accurately identifying the number of factors, which strengthens the robustness of our conclusions in the nutritional data application. Additionally, the point (ii) allows us to address the question: how is a dietary pattern associated with cardiovascular diseases?

The simulation study demonstrates that BFMAN outperforms MGPS in recovering both the factor score matrix and the factor loading matrix—with RV coefficient values closer to 1—and in selecting the number of factors closer to the true generating processes in all the simulated scenarios.

In the nutrient analysis, we identified five main dietary patterns: plant-based foods, animal products, seafood, processed foods, and dairy products. Notably, only individuals with higher consumption of processed foods demonstrated a significantly increased probability of developing hypercholesterolemia.

Our work aims to highlight the pivotal role of the factor score in guiding future research directions, whereas much of the existing literature has primarily focused on the factor loadings. A related work id proposed by Bortolato and Canale [8] introduces innovative shrinkage priors for the latent factor scores, enabling greater flexibility to address the diverse scenarios encountered in multi-study factor analysis. However, different priors can be adopted for the factor scores depending on the specific methodological considerations and real-world data challenges.

{funding}

RDV was supported by the US National Institutes of Health, NIGMS/NIH COBRE CBHD P20GM109035. This Manuscript was prepared using HCHSSOL Research Materials obtained from the NHLBI Biologic Specimen and Data Repository Information Coordinating Center and does not necessarily reflect the opinions or views of the HCHSSOL or the NHLBI.

References

Aßmann et al. [2016] Aßmann, C., Boysen-Hogrefe, J., and Pape, M. (2016). “Bayesian analysis of static and dynamic factor models: An ex-post approach towards the rotation problem.” Journal of Econometrics, 192(1): 190–206.
Aune et al. [2016] Aune, D., Keum, N., Giovannucci, E., Fadnes, L. T., Boffetta, P., Greenwood, D. C., Tonstad, S., Vatten, L. J., Riboli, E., and Norat, T. (2016). “Nut consumption and risk of cardiovascular disease, total cancer, all-cause and cause-specific mortality: a systematic review and dose-response meta-analysis of prospective studies.” BMC medicine, 14: 1–14.
Avalos-Pacheco et al. [2022] Avalos-Pacheco, A., Rossell, D., and Savage, R. S. (2022). “Heterogeneous large datasets integration using Bayesian factor regression.” Bayesian analysis, 17(1): 33–66.
Bai et al. [2021] Bai, R., Ročková, V., and George, E. I. (2021). “Spike-and-slab meets LASSO: A review of the spike-and-slab LASSO.” Handbook of Bayesian variable selection, 81–108.
Bennett et al. [2022] Bennett, G., Bardon, L. A., and Gibney, E. R. (2022). “A comparison of dietary patterns and factors influencing food choice among ethnic groups living in one locality: a systematic review.” Nutrients, 14(5): 941.
Bernardo et al. [2003] Bernardo, J., Bayarri, M., Berger, J., Dawid, A., Heckerman, D., Smith, A., and West, M. (2003). “Bayesian factor regression models in the “large p, small n” paradigm.” Bayesian Statistics, 7: 733–742.
Bhattacharya and Dunson [2011] Bhattacharya, A. and Dunson, D. B. (2011). “Sparse Bayesian infinite factor models.” Biometrika, 98(2): 291–306.
Bortolato and Canale [2024] Bortolato, E. and Canale, A. (2024). “Adaptive partition Factor Analysis.” arXiv preprint arXiv:2410.18939.
Carvalho et al. [2008] Carvalho, C. M., Chang, J., Lucas, J. E., Nevins, J. R., Wang, Q., and West, M. (2008). “High-dimensional sparse factor modeling: applications in gene expression genomics.” Journal of the American Statistical Association, 103(484): 1438–1456.
Casa et al. [2022] Casa, A., O’Callaghan, T. F., and Murphy, T. B. (2022). “Parsimonious Bayesian factor analysis for modelling latent structures in spectroscopy data.” The Annals of Applied Statistics, 16(4): 2417–2436.
Castelló et al. [2016] Castelló, A., Ascunce, N., Salas-Trejo, D., Vidal, C., Sanchez-Contador, C., Santamarina, C., Pedraz-Pingarron, C., Moreno, M. P., Perez-Gomez, B., Lope, V., et al. (2016). “Association between Western and Mediterranean dietary patterns and mammographic density.” Obstetrics & Gynecology, 128(3): 574–581.
Chen et al. [1992] Chen, X., Rubin, K. H., and Sun, Y. (1992). “Social reputation and peer relationships in Chinese and Canadian children: A cross-cultural study.” Child development, 63(6): 1336–1343.
Daviglus et al. [2014] Daviglus, M. L., Pirzada, A., and Talavera, G. A. (2014). “Cardiovascular disease risk factors in the Hispanic/Latino population: lessons from the Hispanic Community Health Study/Study of Latinos (HCHS/SOL).” Progress in cardiovascular diseases, 57(3): 230–236.
De Vito and Avalos-Pacheco [2023] De Vito, R. and Avalos-Pacheco, A. (2023). “Multi-study factor regression model: an application in nutritional epidemiology.” arXiv preprint arXiv:2304.13077.
De Vito et al. [2021] De Vito, R., Bellio, R., Trippa, L., and Parmigiani, G. (2021). “Bayesian multistudy factor analysis for high-throughput biological data.” The annals of applied statistics, 15(4): 1723–1741.
De Vito et al. [2019] De Vito, R., Lee, Y. C. A., Parpinel, M., Serraino, D., Olshan, A. F., Zevallos, J. P., Levi, F., Zhang, Z. F., Morgenstern, H., Garavello, W., et al. (2019). “Shared and study-specific dietary patterns and head and neck cancer risk in an international consortium.” Epidemiology, 30(1): 93–102.
De Vito et al. [2022] De Vito, R., Stephenson, B., Sotres-Alvarez, D., Siega-Riz, A.-M., Mattei, J., Parpinel, M., Peters, B. A., Bainter, S. A., Daviglus, M. L., Van Horn, L., et al. (2022). “Shared and ethnic background site-specific dietary patterns in the Hispanic Community Health Study/Study of Latinos (HCHS/SOL).” medRxiv, 2022–06.
DiStefano et al. [2019] DiStefano, C., Zhu, M., and Mindrila, D. (2019). “Understanding and using factor scores: Considerations for the applied researcher.” Practical assessment, research, and evaluation, 14(1): 20.
Durante [2017] Durante, D. (2017). “A note on the multiplicative gamma process.” Statistics & Probability Letters, 122: 198–204.
Edefonti et al. [2012a] Edefonti, V., Hashibe, M., Ambrogi, F., Parpinel, M., Bravi, F., Talamini, R., Levi, F., Yu, G., Morgenstern, H., Kelsey, K., et al. (2012a). “Nutrient-based dietary patterns and the risk of head and neck cancer: a pooled analysis in the International Head and Neck Cancer Epidemiology consortium.” Annals of Oncology, 23(7): 1869–1880.
Edefonti et al. [2009] Edefonti, V., Randi, G., La Vecchia, C., Ferraroni, M., and Decarli, A. (2009). “Dietary patterns and breast cancer: a review with focus on methodological issues.” Nutrition reviews, 67(6): 297–314.
Edefonti et al. [2012b] Edefonti, V. et al. (2012b). “Nutrient-based dietary patterns and the risk of head and neck cancer: a pooled analysis in the International Head and Neck Cancer Epidemiology consortium.” Annals of Oncology, 23(7): 1869–1880.
Foster et al. [2003] Foster, G. D., Wyatt, H. R., Hill, J. O., McGuckin, B. G., Brill, C., Mohammed, B. S., Szapary, P. O., Rader, D. J., Edman, J. S., and Klein, S. (2003). “A randomized trial of a low-carbohydrate diet for obesity.” New England Journal of Medicine, 348(21): 2082–2090.
Frühwirth-Schnatter et al. [2024] Frühwirth-Schnatter, S., Hosszejni, D., and Lopes, H. F. (2024). “Sparse Bayesian factor analysis when the number of factors is unknown.” Bayesian Analysis, 1(1): 1–31.
Haines et al. [1999] Haines, P. S., Siega-Riz, A. M., and Popkin, B. M. (1999). “The Diet Quality Index revised: a measurement instrument for populations.” Journal of the American Dietetic Association, 99(6): 697–704.
Johnson and Rossell [2010] Johnson, V. E. and Rossell, D. (2010). “On the use of non-local prior densities in Bayesian hypothesis tests.” Journal of the Royal Statistical Society Series B: Statistical Methodology, 72(2): 143–170.
Johnson and Rossell [2012] — (2012). “Bayesian model selection in high-dimensional settings.” Journal of the American Statistical Association, 107(498): 649–660.
Kromhout [2001] Kromhout, D. (2001). “Diet and cardiovascular diseases.” The journal of nutrition, health & aging, 5(3): 144–149.
LaVange et al. [2010] LaVange, L. M., Kalsbeek, W. D., Sorlie, P. D., Avilés-Santa, L. M., Kaplan, R. C., Barnhart, J., Liu, K., Giachello, A., Lee, D. J., Ryan, J., et al. (2010). “Sample design and cohort selection in the Hispanic Community Health Study/Study of Latinos.” Annals of epidemiology, 20(8): 642–649.
Legramanti et al. [2020] Legramanti, S., Durante, D., and Dunson, D. B. (2020). “Bayesian cumulative shrinkage for infinite factorizations.” Biometrika, 107(3): 745–752.
Liese et al. [2015] Liese, A. D., Krebs-Smith, S. M., Subar, A. F., George, S. M., Harmon, B. E., Neuhouser, M. L., Boushey, C. J., Schap, T. E., and Reedy, J. (2015). “The Dietary Patterns Methods Project: synthesis of findings across cohorts and relevance to dietary guidance.” The Journal of nutrition, 145(3): 393–402.
Lopes and West [2004a] Lopes, H. F. and West, M. (2004a). “Bayesian model assessment in factor analysis.” Statistica Sinica, 14: 41–67.
Lopes and West [2004b] — (2004b). “Bayesian model assessment in factor analysis.” Statistica Sinica, 41–67.
McDonald and Burr [1967] McDonald, R. P. and Burr, E. (1967). “A comparison of four methods of constructing factor scores.” Psychometrika, 32(4): 381–401.
National Heart, Lung, and Blood Institute and others [2009] National Heart, Lung, and Blood Institute and others (2009). “Hispanic Community Health Study/Study of Latinos (HCHS/SOL).” Bethesda, MD: NHLBI.
Robert and Escoufier [1976] Robert, P. and Escoufier, Y. (1976). “A unifying tool for linear multivariate statistical methods: the RV-coefficient.” Journal of the Royal Statistical Society Series C: Applied Statistics, 25(3): 257–265.
Ročková and George [2016] Ročková, V. and George, E. I. (2016). “Fast Bayesian factor analysis via automatic rotations to sparsity.” Journal of the American Statistical Association, 111(516): 1608–1622.
Roy et al. [2021] Roy, A., Lavine, I., Herring, A. H., and Dunson, D. B. (2021). “Perturbed factor analysis: Accounting for group differences in exposure profiles.” The annals of applied statistics, 15(3): 1386.
Schiavon et al. [2022] Schiavon, L., Canale, A., and Dunson, D. B. (2022). “Generalized infinite factorization models.” Biometrika, 109(3): 817–835.
Schönemann [1966] Schönemann, P. H. (1966). “A generalized solution of the orthogonal procrustes problem.” Psychometrika, 31(1): 1–10.
Schulze et al. [2003] Schulze, M. B., Hoffmann, K., Kroke, A., and Boeing, H. (2003). “An approach to construct simplified measures of dietary patterns from exploratory factor analysis.” British Journal of Nutrition, 89(3): 409–418.
Shi et al. [2019] Shi, G., Lim, C. Y., and Maiti, T. (2019). “Model selection using mass-nonlocal prior.” Statistics & Probability Letters, 147: 36–44.
Sorlie et al. [2010] Sorlie, P. D. et al. (2010). “Design and implementation of the Hispanic community health study/study of Latinos.” Annals of epidemiology, 20(8): 629–641.
Tucker [1971] Tucker, L. R. (1971). “Relations of factor score estimates to their use.” Psychometrika, 36(4): 427–436.
Wang et al. [2011] Wang, X. V., Verhaak, R. G., Purdom, E., Spellman, P. T., and Speed, T. P. (2011). “Unifying gene expression measures from multiple platforms using factor analysis.” PloS One, 6(3).
Willett and MacMahon [1984] Willett, W. C. and MacMahon, B. (1984). “Diet and cancer–an overview.” New England Journal of Medicine, 310(10): 633–638.