This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Dimension-Grouped Mixed Membership Models for Multivariate Categorical Data

Yuqi Gu Department of Statistics, Columbia University Elena A. Erosheva Department of Statistics, School of Social Work, and the Center for Statistics and the Social Sciences, University of Washington Gongjun Xu Department of Statistics, University of Michigan David B. Dunson Department of Statistical Science, Duke University
Abstract

Mixed Membership Models (MMMs) are a popular family of latent structure models for complex multivariate data. Instead of forcing each subject to belong to a single cluster, MMMs incorporate a vector of subject-specific weights characterizing partial membership across clusters. With this flexibility come challenges in uniquely identifying, estimating, and interpreting the parameters. In this article, we propose a new class of Dimension-Grouped MMMs (Gro-M3s) for multivariate categorical data, which improve parsimony and interpretability. In Gro-M3s, observed variables are partitioned into groups such that the latent membership is constant for variables within a group but can differ across groups. Traditional latent class models are obtained when all variables are in one group, while traditional MMMs are obtained when each variable is in its own group. The new model corresponds to a novel decomposition of probability tensors. Theoretically, we derive transparent identifiability conditions for both the unknown grouping structure and model parameters in general settings. Methodologically, we propose a Bayesian approach for Dirichlet Gro-M3s to inferring the variable grouping structure and estimating model parameters. Simulation results demonstrate good computational performance and empirically confirm the identifiability results. We illustrate the new methodology through applications to a functional disability survey dataset and a personality test dataset.

Keywords: Bayesian Methods, Grade of Membership Model, Identifiability, Mixed Membership Model, Multivariate Categorical Data, Probabilistic Tensor Decomposition.

1 Introduction

Mixed membership models (MMMs) are a popular family of latent structure models for complex multivariate data. Building on classical latent class and finite mixture models (McLachlan and Peel, , 2000), which assign each subject to a single cluster, MMMs include a vector of probability weights characterizing partial membership. MMMs have seen many applications in a wide variety of fields, including social science surveys (Erosheva et al., , 2007), topic modeling and text mining (Blei et al., , 2003), population genetics and bioinformatics (Pritchard et al., , 2000; Saddiki et al., , 2015), biological and social networks (Airoldi et al., 2008b, ), collaborative filtering (Mackey et al., , 2010), and data privacy (Manrique-Vallier and Reiter, , 2012); see Airoldi et al., (2014) for more examples.

Although MMMs are conceptually appealing and very flexible, with the rich modeling capacity come challenges in identifying, accurately estimating, and interpreting the parameters. MMMs have been popular in many applications, yet key theoretical issues remain understudied. The handbook of Airoldi et al., (2014) emphasized theoretical difficulties of MMMs ranging from non-identifiability to multi-modality of the likelihood. Finite mixture models have related challenges, and the additional complexity of the individual-level mixed membership incurs extra difficulties. A particularly important case is MMMs for multivariate categorical data, such as survey response (Woodbury et al., , 1978; Erosheva et al., , 2007; Manrique-Vallier and Reiter, , 2012). In this setting, MMMs provide an attractive alternative to the latent class model of Goodman, (1974). However, little is known about what is fundamentally identifiable and learnable from observed data under such models.

Identifiability is a key property of a statistical model, meaning that the model parameters can be uniquely obtained from the observables. An identifiable model is a prerequisite for reproducible statistical inferences and reliable applications. Indeed, interpreting parameters estimated from an unidentifiable model is meaningless, and may lead to misleading conclusions in practice. It is thus important to study the identifiability of MMMs and to provide theoretical support to back up the conceptual appeal. Even better would be to expand the MMM framework to allow variations that aid interpretability and identifiability. With this motivation, and focused on mixed membership modeling of multivariate categorical data, this paper makes the following key contributions.

We propose a new class of models for multivariate categorical data, which retains the same flexibility offered by MMMs, while favoring greater parsimony and interpretability. The key innovation is to allow the pp-dimensional latent membership vector to belong to GG (Gp)(G\ll p) groups; memberships are the same for different variables within a group but can differ across groups. We deem the new model the Dimension-Grouped Mixed Membership Model (Gro-M3). Gro-M3 improves interpretability by allowing the potentially high-dimensional observed variables to belong to a small number of meaningful groups. Theoretically, we show that both the continuous model parameters, and the discrete variable grouping structure, can be identified from the data for models in the Gro-M3 class under transparent conditions on how the variables are grouped. This challenging identifiability issue is addressed by carefully leveraging the dimension-grouping structure to write the model as certain structured tensor products, and then invoking Kruskal’s fundamental theorem on the uniqueness of three-way tensor decompositions (Kruskal, , 1977; Allman et al., , 2009).

To illustrate the methodological usefulness of the proposed class of models, we consider a special case in which each subject’s mixed membership proportion vector follows a Dirichlet distribution. This is among the most popular modeling assumptions underlying various MMMs (Blei et al., , 2003; Erosheva et al., , 2007; Manrique-Vallier and Reiter, , 2012; Zhao et al., , 2018). For such a Dirichlet Gro-M3, we employ a Bayesian inference procedure and develop a Metropolis-Hastings-within-Gibbs algorithm for posterior computation. The algorithm has excellent computational performance. Simulation results demonstrate this approach can accurately learn the identifiable quantities of the model, including both the variable-grouping structure and the continuous model parameters. This also empirically confirms the model identifiability result.

The rest of this paper is organized as follows. Section 2 reviews existing mixed membership models, introduces the proposed Gro-M3, and provides an interesting probabilistic tensor decomposition perspective of the models. Section 3 is devoted to the study of the identifiability of the new model. Section 4 focuses on the Dirichlet distribution induced Gro-M3 and proposes a Bayesian inference procedure. Section 5 includes simulation studies and Section 6 applies the new model to reanalyze the NLTCS disability survey data. Section 7 provides discussions.

2 Dimension-Grouped Mixed Membership Models

2.1 Existing Mixed Membership Models

In this subsection, we briefly review the existing MMM literature to give our proposal appropriate context. Let KK be the number of extreme latent profiles. Denote the KK-dimensional probability simplex by ΔK1={(π1,,πK):πk0 for all k,k=1Kπk=1}\Delta^{K-1}=\{(\pi_{1},\ldots,\pi_{K}):\pi_{k}\geq 0\text{ for all }k,\;\sum_{k=1}^{K}\pi_{k}=1\}. Each subject ii has an individual proportion vector 𝝅i=(πi,1,,πi,K)ΔK1\boldsymbol{\pi}_{i}=(\pi_{i,1},\ldots,\pi_{i,K})\in\Delta^{K-1}, which indicates the degrees to which subject ii is a member of the KK extreme profiles. The general mixed membership models summarized in Airoldi et al., (2014) have the following distribution,

p({yi,1(r),,yi,p(r)}r=1R)=\displaystyle p\left(\left\{y^{(r)}_{i,1},\ldots,y_{i,p}^{(r)}\right\}_{r=1}^{R}\right)= ΔK1j=1pr=1R(k=1Kπi,kf(yi,j(r)𝝀j,k))dD𝜶(𝝅i),\displaystyle~{}\int_{\Delta^{K-1}}\prod_{j=1}^{p}\prod_{r=1}^{R}\left(\sum_{k=1}^{K}\pi_{i,k}f(y_{i,j}^{(r)}\mid\boldsymbol{\lambda}_{j,k})\right)dD_{\boldsymbol{\alpha}}(\boldsymbol{\pi}_{i}), (1)

where 𝝅i=(πi,1,,πi,K)\boldsymbol{\pi}_{i}=(\pi_{i,1},\ldots,\pi_{i,K}) follows the distribution D𝜶D_{\boldsymbol{\alpha}} and is integrated out; the 𝜶\boldsymbol{\alpha} refers to some generic population parameters depending on the specific model. The hierarchical Bayesian representation for the model in (1) can be written as follows.

yij(1),,yij(R)zij=k\displaystyle y_{ij}^{(1)},\ldots,y_{ij}^{(R)}\mid z_{ij}=k i.i.d.Categorical([dj];𝝀j,k),j[p];\displaystyle\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}\text{Categorical}([d_{j}];~{}\boldsymbol{\lambda}_{j,k}),~{}~{}j\in[p];
zi1,,zip𝝅i\displaystyle z_{i1},\ldots,z_{ip}\mid\boldsymbol{\pi}_{i} i.i.d.Categorical([K];𝝅i),i[n];\displaystyle\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}\text{Categorical}([K];~{}\boldsymbol{\pi}_{i}),~{}~{}i\in[n];
𝝅1,,𝝅n\displaystyle\boldsymbol{\pi}_{1},\ldots,\boldsymbol{\pi}_{n} i.i.d.D𝜶.\displaystyle\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}D_{\boldsymbol{\alpha}}.

where “i.i.d.” is short for “independent and identically distributed”. The number pp in (1) is the number of “characteristics”, and RR is the number of “replications” per characteristic. As shown in (1), for each characteristic jj, there are a corresponding set of KK conditional distributions indexed by parameter vectors {𝝀j,k:k=1,,K}\{\boldsymbol{\lambda}_{j,k}:\,k=1,\ldots,K\}. Many different mixed membership models are special cases of the general setup (1). For example, the popular Latent Dirichlet Allocation (LDA) (Blei et al., , 2003; Blei, , 2012; Anandkumar et al., , 2014) for topic modeling takes a document ii as a subject, and assumes there is only p=1p=1 distinct characteristic (one single set of KK topics which are distributions over the word vocabulary) with R>1R>1 replications (a document ii contains RR words which are conditionally i.i.d. given 𝝅i\boldsymbol{\pi}_{i}); LDA further specifies D𝜶(𝝅i)D_{\boldsymbol{\alpha}}(\boldsymbol{\pi}_{i}) to be the Dirichlet distribution with parameters 𝜶=(α1,,αK)\boldsymbol{\alpha}=(\alpha_{1},\ldots,\alpha_{K}).

Focusing on MMMs for multivariate categorical data, there are generally many characteristics with p1p\gg 1 and one replication of each characteristic with R=1R=1 in (1). Each variable yi,j{1,,dj}y_{i,j}\in\{1,\ldots,d_{j}\} takes one of djd_{j} unordered categories. For each subject ii, the observables 𝒚i=(yi,1,,yi,p)\boldsymbol{y}_{i}=(y_{i,1},\ldots,y_{i,p})^{\top} are a vector of pp categorical variables. MMMs for such data are traditionally called Grade of Membership models (GoMs) (Woodbury et al., , 1978). GoMs have been extensively used in applications, such as disability survey data (Erosheva et al., , 2007), scholarly publication data (Erosheva et al., , 2004), and data disclosure risk and privacy (Manrique-Vallier and Reiter, , 2012). GoMs are also useful for psychological measurements where data are Likert scale responses to psychology survey items, and educational assessments where data are students’ correct/wrong answers to test questions (e.g. Shang et al., , 2021).

In GoMs, the conditional distribution f(yi,j𝝀j,k)f(y_{i,j}\mid\boldsymbol{\lambda}_{j,k}) in (1) can be written as (yi,j𝝀j,k)=cj=1djλj,cj,k𝕀(yi,j=cj)\mathbb{P}(y_{i,j}\mid\boldsymbol{\lambda}_{j,k})=\prod_{c_{j}=1}^{d_{j}}\lambda_{j,c_{j},k}^{\mathbb{I}(y_{i,j}=c_{j})}. Hence, the probability mass function of 𝒚i\boldsymbol{y}_{i} in a GoM is

pGoM(yi,1,,yi,p𝚲,𝜶)\displaystyle p^{\text{GoM}}\left(y_{i,1},\ldots,y_{i,p}\mid\boldsymbol{\Lambda},\boldsymbol{\alpha}\right) =ΔK1j=1p[k=1Kπi,kcj=1djλj,cj,k𝕀(yi,j=cj)]dD𝜶(𝝅i).\displaystyle=\int_{\Delta^{K-1}}\prod_{j=1}^{p}\left[\sum_{k=1}^{K}\pi_{i,k}\prod_{c_{j}=1}^{d_{j}}\lambda_{j,c_{j},k}^{\mathbb{I}(y_{i,j}=c_{j})}\right]dD_{\boldsymbol{\alpha}}(\boldsymbol{\pi}_{i}). (2)

The hierarchical Bayesian representation for the model in (2) can be written as follows.

yijzij=k\displaystyle y_{ij}\mid z_{ij}=k i.i.d.Categorical([dj];𝝀j,k),j[p];\displaystyle\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}\text{Categorical}([d_{j}];~{}\boldsymbol{\lambda}_{j,k}),~{}~{}j\in[p];
zi1,,zip𝝅i\displaystyle z_{i1},\ldots,z_{ip}\mid\boldsymbol{\pi}_{i} i.i.d.Categorical([K];𝝅i),i[n];\displaystyle\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}\text{Categorical}([K];~{}\boldsymbol{\pi}_{i}),~{}~{}i\in[n];
𝝅1,,𝝅n\displaystyle\boldsymbol{\pi}_{1},\ldots,\boldsymbol{\pi}_{n} i.i.d.D𝜶.\displaystyle\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}D_{\boldsymbol{\alpha}}.

See a graphical model representation of the GoM with sample size nn in Figure 1(b), where individual latent indicator variables (zi,1,,zi,p)[K]p(z_{i,1},\ldots,z_{i,p})\in[K]^{p} are introduced to better describe the data generative process.

yi,1y_{i,1}yi,2y_{i,2}\cdots\cdots\cdotsyi,py_{i,p}ziz_{i}zi[K]z_{i}\in[K]𝝂\boldsymbol{\nu}nn𝚲1,:,:\mathbf{\Lambda}_{1,:,:}𝚲2,:,:\mathbf{\Lambda}_{2,:,:}\cdots\cdots\cdots𝚲p,:,:\mathbf{\Lambda}_{p,:,:}(a) Latent Class Model(Probabilistic CP Decomposition)
yi,1y_{i,1}yi,2y_{i,2}\cdots\cdots\cdotsyi,py_{i,p}zi,1z_{i,1}\cdots\cdots\cdots\cdotszi,pz_{i,p}𝝅i\boldsymbol{\pi}_{i}𝝅iΔK1\boldsymbol{\pi}_{i}\in\Delta^{K-1}𝜶\boldsymbol{\alpha}nn𝚲1,:,:\mathbf{\Lambda}_{1,:,:}𝚲2,:,:\mathbf{\Lambda}_{2,:,:}\cdots\cdots\cdots𝚲p,:,:\mathbf{\Lambda}_{p,:,:}(b) Grade of Membership Model, zi,j[K]z_{i,j}\in[K](Probabilistic Tucker Decomposition)
nnyi,1y_{i,1}yi,2y_{i,2}\cdots\cdots\cdotsyi,py_{i,p}zi,1z_{i,1}\cdotszi,Gz_{i,G}𝜼i\boldsymbol{\eta}_{i}f(𝜼i)ΔK1f(\boldsymbol{\eta}_{i})\in\Delta^{K-1}𝐋\mathbf{L}𝝁\boldsymbol{\mu}𝚺\mathbf{\Sigma}𝚲1,:,:\mathbf{\Lambda}_{1,:,:}𝚲2,:,:\mathbf{\Lambda}_{2,:,:}\cdots\cdots\cdots𝚲p,:,:\mathbf{\Lambda}_{p,:,:}(c) (New) Gro-M3, f(𝜼i)f(\boldsymbol{\eta}_{i}) logit normal(Probabilistic Hybrid Decomposition)
nnyi,1y_{i,1}yi,2y_{i,2}\cdots\cdots\cdotsyi,py_{i,p}zi,1z_{i,1}\cdotszi,Gz_{i,G}𝝅i\boldsymbol{\pi}_{i}𝝅iΔK1\boldsymbol{\pi}_{i}\in\Delta^{K-1}𝐋\mathbf{L}𝜶\boldsymbol{\alpha}𝚲1,:,:\mathbf{\Lambda}_{1,:,:}𝚲2,:,:\mathbf{\Lambda}_{2,:,:}\cdots\cdots\cdots𝚲p,:,:\mathbf{\Lambda}_{p,:,:}(d) (New) Gro-M3, 𝝅i\boldsymbol{\pi}_{i} Dirichlet(Probabilistic Hybrid Decomposition)
Figure 1: Graphical model representations of LCMs in (a), GoMs in (b), and the proposed family of Gro-M3s with two examples in (c), (d). Shaded nodes {yi,j}\{y_{i,j}\} are observed variables, white nodes are latent variables, quantities outside each solid box are population parameters. In (c) and (d), the dotted red box is the key dimension-grouping structure, where the red edges from {zi,g}\{z_{i,g}\} to {yi,j}\{y_{i,j}\} correspond to entries of “1” in the grouping matrix 𝐋\mathbf{L}.

We emphasize that the case with p>1p>1 and R=1R=1 is fundamentally different from the topic models with p=1p=1 and R>1R>1, with the former typically involving many more parameters. This is because the “bag-of-words” assumption in the topic model with R>1R>1 disregards word order in a document and assumes all words in a document are exchangeable. In contrast, our mixed-membership model for multivariate categorical data does not assume a subject’s responses to the pp items in a survey/questionnaire are exchangeable. In other words, given a subject’s mixed membership vector 𝝅i\boldsymbol{\pi}_{i}, his/her responses to the pp items are independent but not identically distributed (because they follow categorical distributions governed by pp different sets of parameters {𝝀j,kd:k[K}\{\boldsymbol{\lambda}_{j,k}\in\mathbb{R}^{d}:~{}k\in[K\} for j=1,,pj=1,\ldots,p); whereas in a topic model, given a document’s latent topic proportion vector 𝝅i\boldsymbol{\pi}_{i}, the pp words in the document are independent and identically distributed, following the categorical distribution with the same set of parameters {𝝀kV:k[K]}\{\boldsymbol{\lambda}_{k}\in\mathbb{R}^{V}:~{}k\in[K]\} (here VV denotes the vocabulary size). In this sense, the GoM model has greater modeling flexibility than topic models and are more suitable for modeling item response data, where it is inappropriate to assume that the items in the survey/questionnaire are exchangeable or share the same set of parameters. This fact is made clear also in Figure 1(b), where for each j[p]j\in[p] there is a population quantity, the parameter node 𝚲j,:,:\boldsymbol{\Lambda}_{j,:,:} (also denoted by 𝚲j\boldsymbol{\Lambda}_{j} for simplicity), that governs its distribution. Thus identifiability is a much greater challenge for GoM models. To our best knowledge, the identifiability issue of the grade-of-membership (GoM) models for item response data considered in Woodbury et al., (1978) and Erosheva et al., (2007) has not been rigorously investigated so far. Motivated by the difficulty of identifying GoM in its original setting due to the large parameter complexity, we next propose a new modeling grouping component to enhance identifiability. Our resulting model still does not make any exchangeability assumption of the items, but rather leverages the variable grouping structure to reduce model complexity.

2.2 New Modeling Component: the Variable Grouping Structure

Generalizing Grade of Membership models for multivariate categorical data, we propose a new structure that groups the pp observed variables in the following sense: any subject’s latent membership is the same for variables within a group but can differ across groups. To represent the key structure of how the pp variables are partitioned into GG groups, we introduce a notation of the grouping matrix 𝐋=(j,g)\mathbf{L}=(\ell_{j,g}). The 𝐋\mathbf{L} is a p×Gp\times G matrix with binary entries, with rows indexed by the pp variables and columns by the GG groups. Each row jj of 𝐋\mathbf{L} has exactly one entry of “1” indicating group assignment. In particular,

𝐋=(j,g)p×G,j,g={1,if the jth variable belongs to the gth group;0,otherwise.\displaystyle\mathbf{L}=(\ell_{j,g})_{p\times G},\qquad\ell_{j,g}=\begin{cases}1,&\text{if the $j$th variable belongs to the $g$th group};\\ 0,&\text{otherwise}.\end{cases} (3)

Our key specification is the following generative process in the form of a hierarchical Bayesian representation,

Gro-M3: {yi,j}j,g=1zi,g=k\displaystyle\left\{y_{i,j}\right\}_{\ell_{j,g}=1}\mid z_{i,g}=k~{} ind.Categorical([dj];(λj,1,k,,λj,dj,k)),g[G];\displaystyle\stackrel{{\scriptstyle\text{ind.}}}{{\sim}}~{}\text{Categorical}\left([d_{j}];\,\left(\lambda_{j,1,k},\cdots,\lambda_{j,d_{j},k}\right)\right),\quad g\in[G];
zi,1,,zi,G𝝅i\displaystyle z_{i,1},\ldots,z_{i,G}\mid\boldsymbol{\pi}_{i}~{} i.i.dCategorical([K];𝝅i);\displaystyle\stackrel{{\scriptstyle\text{i.i.d}}}{{\sim}}~{}\text{Categorical}([K];\,\boldsymbol{\pi}_{i}); (4)
𝝅1,,𝝅n\displaystyle\boldsymbol{\pi}_{1},\ldots,\boldsymbol{\pi}_{n}~{} i.i.d.D𝜶.\displaystyle\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}D_{\boldsymbol{\alpha}}.

where “ind.” is short for “independent”, meaning that conditional on zi,g=kz_{i,g}=k, subject ii’s observed responses to items in group gg are independently generated. Hence, given the population parameters (𝐋,𝚲,𝜶)(\mathbf{L},\boldsymbol{\Lambda},\boldsymbol{\alpha}), the probability distribution of 𝒚i\boldsymbol{y}_{i} can be written as

pGro-M3(yi,1,,yi,p𝐋,𝚲,𝜶)=\displaystyle p^{\text{Gro-M${}^{3}$}}\left(y_{i,1},\ldots,y_{i,p}\mid\mathbf{L},\boldsymbol{\Lambda},\boldsymbol{\alpha}\right)= ΔK1g=1G[k=1Kπi,kj:j,g=1cj=1djλj,cj,k𝕀(yi,j=cj)]dD𝜶(𝝅i).\displaystyle~{}\int_{\Delta^{K-1}}\prod_{g=1}^{G}\left[\sum_{k=1}^{K}\pi_{i,k}\prod_{j:\,\ell_{j,g}=1}\prod_{c_{j}=1}^{d_{j}}\lambda_{j,c_{j},k}^{\mathbb{I}(y_{i,j}=c_{j})}\right]dD_{\boldsymbol{\alpha}}(\boldsymbol{\pi}_{i}).

For a sample with nn subjects, assume the observed responses 𝒚1,,𝒚n\boldsymbol{y}_{1},\ldots,\boldsymbol{y}_{n} are independent and identically distributed according to the above model.

We visualize the proposed model as a probabilistic graphical model to highlight connections to and differences from existing latent structure models for multivariate categorical data. In Figure 1, we show the graphical model representations of two popular latent structure models for multivariate categorical data in (a) and (b), and for the newly proposed Gro-M3 in (c) and (d). The 𝚲j\boldsymbol{\Lambda}_{j} for j[p]j\in[p] denotes a dj×Kd_{j}\times K matrix with entries λj,cj,k\lambda_{j,c_{j},k}. Each column of 𝚲j\boldsymbol{\Lambda}_{j} characterizes a conditional probability distribution of variable yjy_{j} given a particular extreme latent profile. We emphasize that the variable grouping is done at the level of the latent allocation variables zz, and that the 𝚲j\mathbf{\Lambda}_{j} parameters are still free without constraints just as they are in traditional LCMs or GoMs. From the visualizations in Figure 1 we can also easily distinguish our proposed model from another popular family of methods, the co-clustering methods (Dhillon et al., , 2003; Govaert and Nadif, , 2013). Co-clustering usually refers to simultaneously clustering the subjects and clustering the variables, where subjects within a cluster exhibit similar behaviors and variables within a cluster also share similar characteristics. Our Gro-M3, however, does not restrict the pp variables to have similar characteristics within groups, but rather allows them to have entirely free parameters 𝚲1,,𝚲p\boldsymbol{\Lambda}_{1},\ldots,\boldsymbol{\Lambda}_{p}. The “dimension-grouping” happens at the latent level by constraining the latent allocations behind the pp variables to be grouped into GG statuses. Such groupings give rise to a novel probabilistic hybrid tensor decomposition visualized in Figure 1(c)–(d); see the next Section 2.3 for details.

Other than facilitating model identifiability (see Section 3), our dimension-grouping modeling assumption is also motivated by real-world applications. In general, our new model Gro-M3 with the variable grouping component can apply to any multivariate categorical data to simultaneously model individual mixed membership and capture variable similarity. For example, Gro-M3 can be applied to survey/questionnaire response data in social sciences, where it is not only of interest to model subjects’ partial membership to several extreme latent profiles, but also of interest to identify blocks of items which share similar measurement goals within each block. We next provide numerical evidence to demonstrate the merit of the variable grouping modeling component. For a dataset simulated from Gro-M3 (in the setting as the later Table 2) and also the real-world IPIP personality test dataset (analyzed in the later Section 6), we calculate the sample Cramer’s V between item pairs. Cramer’s V is a classical measure of association between two categorical variables, which gives a value between 0 and 1, with larger values indicating stronger association. Figure 2 presents the plots of the sample Cramer’s V matrix for the simulated data and the real IPIP data. This figure shows that the pairwise item dependence for the Gro-M3-simulated data looks quite similar to the real-world personality test data. Indeed, after fitting the Gro-M3 to this IPIP personality test dataset, the estimated model-based Cramer’s V shown in Figure 2(c) nicely and more clearly recovers the item block structure. We conjecture that many real world datasets in other applied domains exhibit similar grouped dependence.

Refer to caption
(a) Simulated data.
Refer to caption
(b) IPIP data, sample CRV.
Refer to caption
(c) IPIP data, Gro-M3 CRV.
Figure 2: (a): Sample Cramer’s V (abbreviated as CRV) for a simulated dataset. (b): Sample Cramer’s V for the IPIP data. (c) Gro-M3 based Cramer’s V for the IPIP data.

2.3 Probabilistic Tensor Decomposition Perspective

The Gro-M3 class has interesting connections to popular tensor decompositions. For a subject ii, the observed vector 𝒚i\boldsymbol{y}_{i} resides in a contingency table with j=1pdj\prod_{j=1}^{p}d_{j} cells. Since the MMMs for multivariate categorical data (both traditional GoM and the newly proposed Gro-M3) induce a probability of 𝒚i\boldsymbol{y}_{i} being in each of these cells, such probabilities {p(yi,1=c1,,yi,p=cp);cj[dj]for eachj[p]}\{p\left(y_{i,1}=c_{1},\ldots,y_{i,p}=c_{p}\mid-\right);\,c_{j}\in[d_{j}]~{}\text{for each}~{}j\in[p]\} can be arranged as a pp-way d1×d2××dpd_{1}\times d_{2}\times\cdots\times d_{p} array. This array is a tensor with pp modes and we denote it by 𝐏\mathbf{P}; Kolda and Bader, (2009) provided a review of tensors. The tensor 𝐏\mathbf{P} has all the entries nonnegative and they sum up to one, so we call it a probability tensor. We next describe in detail the tensor decomposition perspective of our model; such a perspective will turn out to be useful in the study of identifiability.

The probability mass function of 𝒚i\boldsymbol{y}_{i} under the traditional GoM model can be written as follows by exchanging the order of product and summation,

pGoM(yi,1=c1,,yi,p=cp𝚲,𝜶)=ΔK1j=1p[k=1Kπi,kλj,cj,k]dD𝜶(𝝅i)\displaystyle~{}p^{\text{GoM}}\left(y_{i,1}=c_{1},\ldots,y_{i,p}=c_{p}\mid\boldsymbol{\Lambda},\boldsymbol{\alpha}\right)=\int_{\Delta^{K-1}}\prod_{j=1}^{p}\left[\sum_{k=1}^{K}\pi_{i,k}\lambda_{j,c_{j},k}\right]dD_{\boldsymbol{\alpha}}(\boldsymbol{\pi}_{i})
=\displaystyle= k1=1Kkp=1Kj=1pλj,cj,kjΔK1πi,k1πi,kp𝑑D𝜶(𝝅i)=:ϕk1,,kpGoM.\displaystyle~{}\sum_{k_{1}=1}^{K}\cdots\sum_{k_{p}=1}^{K}\prod_{j=1}^{p}\lambda_{j,c_{j},k_{j}}\underbrace{\int_{\Delta^{K-1}}\pi_{i,k_{1}}\cdots\pi_{i,k_{p}}dD_{\boldsymbol{\alpha}}(\boldsymbol{\pi}_{i})}_{=:\;\phi^{\text{{GoM}}}_{k_{1},\ldots,k_{p}}}. (5)

Then 𝚽GoM:=(ϕk1,,kpGoM;kj[K])\mathbf{\Phi}^{\text{{GoM}}}:=\left(\phi^{\text{{GoM}}}_{k_{1},\ldots,k_{p}};\;k_{j}\in[K]\right) forms a tensor with pp modes, and each mode has dimension KK. Further, this tensor 𝚽\mathbf{\Phi} is a probability tensor, because ϕk1,,kp0\phi_{k_{1},\ldots,k_{p}}\geq 0 and it is not hard to see that its entries sum up to one. Viewed from a tensor decomposition perspective, this is the popular Tucker decomposition (Tucker, , 1966); more specifically this is the nonnegative and probabilistic version of the Tucker decomposition. The 𝚽GoM\mathbf{\Phi}^{\text{{GoM}}} represents the Tucker tensor core, and the product of {λj,cj,k}\{\lambda_{j,c_{j},k}\} form the Tucker tensor arms.

It is useful to compare our modeling assumption to that of the the Latent Class Model (LCM; Goodman, , 1974), which follows the graphical model shown in Figure 1(a). The LCM is essentially a finite mixture model assuming each subject ii belongs to a single cluster. The distribution of 𝒚i\boldsymbol{y}_{i} under an LCM is

pLC(yi,1=c1,,yi,p=cp𝚲,𝝂)\displaystyle p^{\text{LC}}\left(y_{i,1}=c_{1},\ldots,y_{i,p}=c_{p}\mid\boldsymbol{\Lambda},\boldsymbol{\nu}\right) =k=1K(zi=k)j=1p(yi,jzi=k)=k=1Kνkj=1pλj,cj,k.\displaystyle=\sum_{k=1}^{K}\mathbb{P}(z_{i}=k)\prod_{j=1}^{p}\mathbb{P}(y_{i,j}\mid z_{i}=k)=\sum_{k=1}^{K}\nu_{k}\prod_{j=1}^{p}\lambda_{j,c_{j},k}. (6)

Based on the above definition, each subject ii has a single variable zi[K]z_{i}\in[K] indicating which latent class it belongs to, rather than a mixed membership proportion vector 𝝅i\boldsymbol{\pi}_{i}. Denoting 𝝂LC=(νk;k[K])\boldsymbol{\nu}^{\text{LC}}=(\nu_{k};\,k\in[K]), then (6) corresponds to the popular CP decomposition of tensors (Hitchcock, , 1927), where the CP rank is at most KK.

Finally, consider our proposed Gro-M3,

pGro-M3(yi,1,,yi,p𝐋,𝚲,𝜶)=ΔK1g=1G[k=1Kπi,kj:j,g=1f(yi,jλj,cj,k)]dD𝜶(𝝅i)\displaystyle~{}p^{\text{Gro-M${}^{3}$}}\left(y_{i,1},\ldots,y_{i,p}\mid\mathbf{L},\boldsymbol{\Lambda},\boldsymbol{\alpha}\right)=\int_{\Delta^{K-1}}\prod_{g=1}^{G}\left[\sum_{k=1}^{K}\pi_{i,k}\prod_{j:\,\ell_{j,g}=1}f(y_{i,j}\mid\lambda_{j,c_{j},k})\right]dD_{\boldsymbol{\alpha}}(\boldsymbol{\pi}_{i})
=\displaystyle= k1=1KkG=1Kg=1Gj:j,g=1f(yi,jλj,cj,kg)ΔK1πi,k1πi,kG𝑑D𝜶(𝝅i)=:ϕk1,,kGGro-M3,\displaystyle~{}\sum_{k_{1}=1}^{K}\cdots\sum_{k_{G}=1}^{K}\prod_{g=1}^{G}\prod_{j:\,\ell_{j,g}=1}f(y_{i,j}\mid\lambda_{j,c_{j},k_{g}})\underbrace{\int_{\Delta^{K-1}}\pi_{i,k_{1}}\cdots\pi_{i,k_{G}}dD_{\boldsymbol{\alpha}}(\boldsymbol{\pi}_{i})}_{=:\;\phi^{\text{Gro-M${}^{3}$}}_{k_{1},\ldots,k_{G}}}, (7)

where f(yi,j|λj,cj,k)f(y_{i,j}|\lambda_{j,c_{j},k}) generally denotes the conditional distribution of variable yi,jy_{i,j} given parameter λj,cj,k\lambda_{j,c_{j},k}. In our Gro-M3, λj,cj,k\lambda_{j,c_{j},k} specifically refer to the categorical distribution parameters for the random variable yi,jy_{i,j}; that is, λj,cj,k=(yi,j=cjzi,j=k)\lambda_{j,c_{j},k}=\mathbb{P}(y_{i,j}=c_{j}\mid z_{i,j}=k) denotes the probability of responding cjc_{j} to item jj given that the subject’s realization of the latent profile for item jj is the kkth extreme latent profile. In this case, 𝚽Gro-M3:=(ϕk1,,kGGro-M3;kg[K])\mathbf{\Phi}^{\text{Gro-M${}^{3}$}}:=\left(\phi^{\text{Gro-M${}^{3}$}}_{k_{1},\ldots,k_{G}};\;k_{g}\in[K]\right) forms a tensor with GG modes, and each mode has dimension KK. There still is k1=1KkG=1Kϕk1,,kGGro-M3=1\sum_{k_{1}=1}^{K}\cdots\sum_{k_{G}=1}^{K}\allowbreak\phi_{k_{1},\ldots,k_{G}}^{\text{Gro-M${}^{3}$}}\allowbreak=1. This reduces the size of the core tensor in the classical Tucker decomposition because G<pG<p. The Gro-M3 incorporates aspects of both the CP and Tucker decompositions, providing a probabilistic hybrid decomposition of probability tensors. The CP is obtained when all variables are in the same group, while the Tucker is obtained when each variable is in its own group; see Figure 1 for a clear illustration of this fact.

Gro-M3 is conceptually related to the collapsed Tucker decomposition (c-Tucker) of Johndrow et al., (2017), though they did not model mixed memberships, used a very different model for the core tensor 𝚽\boldsymbol{\Phi}, and did not consider identifiability. Nonetheless and interestingly, our identifiability results can be applied to establish identifiability of c-Tucker decomposition (see Remark 1 in Section 4). Another work related to our dimension-grouping assumption is Anandkumar et al., (2015), which considered the case of overcomplete topic modeling with the number of topics exceeding the vocabulary size. For such scenarios, the authors proposed a “persistent topic model” which assumes the latent topic assignment persists locally through multiple words, and established identifiability. Our dimension-grouped mixed membership assumption is similar in spirit to this topic persistence assumption. However, the setting we consider here for general multivariate categorical data has the multi-characteristic single-replication nature (p>1p>1 and R=1R=1); as mentioned before, this is fundamentally different from topic models with a single characteristic and multiple replications (p=1p=1 and R>1R>1).

3 Identifiability of Dimension-Grouped MMMs

Identifiability is an important property of a statistical model, generally meaning that model parameters can be uniquely recovered from the observables. Identifiability serves as a fundamental prerequisite for valid statistical estimation and inference. The study of identifiability, however, can be challenging for complicated models and especially latent variable models, including the Gro-M3s considered here. In subsections 3.1 and 3.2, we propose easily checkable and practically useful identifiability conditions for Gro-M3s by carefully inspecting the inherent algebraic structures. Specifically, we will exploit the variable groupings to write the model as certain highly structured mixed tensor products, and then prove identifiability by invoking Kruskal’s theorem on the uniqueness of tensor decompositions (Kruskal, , 1977). We point out that such proof procedures share a similar spirit to those in Allman et al., (2009), but the complicated structure of the new Gro-M3s requires some special care to handle. We provide a high-level summary of our proof approach. First, we write the probability mass function of the observed pp-dimensional multivariate categorical vector as a probabilistic tensor with pp modes. Second, we unfold this tensor into a GG-way tensor with each mode corresponding to a variable group. Third, we further concatenate the transformed tensor and leverage Kruskal’s Theorem on the uniqueness of three-way tensor decomposition to establish the identifiability of the model parameters under our proposed Gro-M3. Our theoretical developments provide a solid foundation for performing estimation of the latent quantities and drawing valid conclusions from data.

3.1 Strict Identifiability Conditions

For LDA and closely related topic models, there is a rich literature investigating identifiability under different assumptions (Anandkumar et al., , 2012; Arora et al., , 2012; Nguyen, , 2015; Wang, , 2019). Typically, when there is only one characteristic (p=1p=1), R2R\geq 2 is necessary for identifiability; see Example 2 in Wang, (2019). However, there has been limited consideration of identifiability of mixed membership models with multiple characteristics and one replication, i.e., p>1p>1 and R=1R=1. GoM models belong to this category, as does the proposed Gro-M3s, with GoM being a special case of Gro-M3s.

We consider the general setup in (1), where 𝚽\boldsymbol{\Phi} denotes the GG-mode tensor core induced by any distribution D(𝝅i)D(\boldsymbol{\pi}_{i}) on the probability simplex ΔK1\Delta^{K-1}. The following definition formally defines the concept of strict identifiability for the proposed model.

Definition 1 (Strict Identifiability of Gro-M3s).

A parameter space 𝚯\boldsymbol{\Theta} of a Gro-M3\,{}^{3} is said to be strictly identifiable, if for any valid set of parameters (𝐋,𝚲,𝚽)𝚯(\mathbf{L},\boldsymbol{\Lambda},\boldsymbol{\Phi})\in\boldsymbol{\Theta}, the following equations hold if and only if (𝐋,𝚲,𝚽)(\mathbf{L},\boldsymbol{\Lambda},\boldsymbol{\Phi}) and the alternative (𝐋¯,𝚲¯,𝚽¯)(\overline{\mathbf{L}},\overline{\boldsymbol{\Lambda}},\overline{\boldsymbol{\Phi}}) are identical up to permutations of the KK extreme latent profiles and permutations of the GG variable groups,

(𝒚=𝒄𝐋,𝚲,𝚽)=(𝒚=𝒄𝐋¯,𝚲¯,𝚽¯),𝒄×j=1p[dj].\displaystyle\mathbb{P}(\boldsymbol{y}=\boldsymbol{c}\mid\mathbf{L},\boldsymbol{\Lambda},\boldsymbol{\Phi})=\mathbb{P}(\boldsymbol{y}=\boldsymbol{c}\mid\overline{\mathbf{L}},\overline{\boldsymbol{\Lambda}},\overline{\boldsymbol{\Phi}}),\quad\forall\boldsymbol{c}\in\times_{j=1}^{p}[d_{j}]. (8)

Definition 1 gives the strongest possible notion of identifiability of the considered population quantities (𝐋,𝚲,𝚽)(\mathbf{L},\boldsymbol{\Lambda},\boldsymbol{\Phi}) in the model. In particular, the strict identifiability notion in Definition 1 requires identification of both the continuous parameters 𝚲\boldsymbol{\Lambda} and 𝚽\boldsymbol{\Phi}, and the discrete latent grouping structure of variables in 𝐋\mathbf{L}. The following theorem proposes sufficient conditions for the strict identifiability of Gro-M3s.

Theorem 1.

Under a Gro-M3\,{}^{3}, the following two identifiability conclusions hold.

  • (a)

    Suppose each column of 𝐋\mathbf{L} contains at least three entries of “1”s, and the corresponding conditional probability table 𝚲j=(λj,cj,k)dj×K\boldsymbol{\Lambda}_{j}=(\lambda_{j,c_{j},k})_{d_{j}\times K} for each of these three jj has full column rank. Then the 𝚲\boldsymbol{\Lambda} and 𝚽\boldsymbol{\Phi} are strictly identifiable.

  • (b)

    In addition to the conditions in (a), if 𝚲\boldsymbol{\Lambda} satisfies that for each j[p]j\in[p], not all the column vectors of 𝚲j\boldsymbol{\Lambda}_{j} are identical, then 𝐋\mathbf{L} is also identifiable.

Example 1.

Denote by 𝐈G\mathbf{I}_{G} a G×GG\times G identity matrix. Suppose p=3Gp=3G and the matrix 𝐋\mathbf{L} takes the following form,

𝐋=(𝐈G𝐈G𝐈G).\displaystyle\mathbf{L}=(\mathbf{I}_{G}\ \mathbf{I}_{G}\ \mathbf{I}_{G})^{\top}. (9)

Also suppose for each j{1,,3G}j\in\{1,\ldots,3G\}, the 𝚲j\boldsymbol{\Lambda}_{j} of size dj×Kd_{j}\times K has full column rank KK. Then the conditions in Theorem 1 hold, so 𝚲\boldsymbol{\Lambda}, 𝐋\mathbf{L} and 𝚽\boldsymbol{\Phi} are identifiable. Theorem 1 implies that if 𝐋\mathbf{L} contains any additional row vectors other than those in (9) the model is still identifiable.

Theorem 1 requires that each of the GG variable groups contains at least three variables, and that for each of these 3G3G variables, the corresponding conditional probability table 𝚲j\boldsymbol{\Lambda}_{j} has linearly independent columns. Theorem 1 guarantees not only the continuous parameters are identifiable, but also the discrete variable grouping structure summarized by 𝐋\mathbf{L} is identifiable. This is important practically as typically the appropriate variable grouping structure is unknown, and hence needs to be inferred from the data.

The conditions in Theorem 1 essentially requires at least 3G3G conditional probability tables, each being a matrix of size dj×Kd_{j}\times K, to have full column rank. This implicitly requires djKd_{j}\geq K. Tan and Mukherjee, (2017) proposed a moment-based estimation approach for traditional mixed membership models and briefly discussed the identifiability issue, also assuming djKd_{j}\geq K with some full-rank requirements. However, the cases where the number of categories djd_{j}’s are small but the number of extreme latent profiles KK is much larger can arise in applications; for example, the disability survey data analyzed in Erosheva et al., (2007) and Manrique-Vallier, (2014) have binary responses with d1==dp=2d_{1}=\cdots=d_{p}=2 while the considered KK ranges from 2 to 10. Our next theoretical result establishes weaker conditions for identifiability that accommodates dj<Kd_{j}<K, by taking advantage of the dimension-grouping property of our proposed model class.

Before stating the theorem, we first introduce two useful notions of matrix products. Denote by \bigotimes the Kronecker product of matrices and by \bigodot the Khatri-Rao product. Consider two matrices 𝐀=(ai,j)m×r\mathbf{A}=(a_{i,j})\in\mathbb{R}^{m\times r}, 𝐁=(bi,j)s×t\mathbf{B}=(b_{i,j})\in\mathbb{R}^{s\times t}; and another two matrices 𝐂=(ci,j)=(𝒄:,1𝒄:,k)n×k\mathbf{C}=(c_{i,j})=(\boldsymbol{c}_{\boldsymbol{:},1}\mid\cdots\mid\boldsymbol{c}_{\boldsymbol{:},k})\in\mathbb{R}^{n\times k}, 𝐃=(di,j)=(𝒅:,1𝒅:,k)×k\mathbf{D}=(d_{i,j})=(\boldsymbol{d}_{\boldsymbol{:},1}\mid\cdots\mid\boldsymbol{d}_{\boldsymbol{:},k})\in\mathbb{R}^{\ell\times k}, then there are 𝐀𝐁ms×rt\mathbf{A}\bigotimes\mathbf{B}\in\mathbb{R}^{ms\times rt} and 𝐂𝐃n×k\mathbf{C}\bigodot\mathbf{D}\in\mathbb{R}^{n\ell\times k} with

𝐀𝐁=(a1,1𝐁a1,r𝐁am,1𝐁am,r𝐁),𝐂𝐃=(𝒄:,1𝒅:,1𝒄:,k𝒅:,k).\displaystyle\mathbf{A}\bigotimes\mathbf{B}=\begin{pmatrix}a_{1,1}\mathbf{B}&\cdots&a_{1,r}\mathbf{B}\\ \vdots&\vdots&\vdots\\ a_{m,1}\mathbf{B}&\cdots&a_{m,r}\mathbf{B}\end{pmatrix},\qquad\mathbf{C}\bigodot\mathbf{D}=\begin{pmatrix}\boldsymbol{c}_{\boldsymbol{:},1}\bigotimes\boldsymbol{d}_{\boldsymbol{:},1}\mid\cdots\mid\boldsymbol{c}_{\boldsymbol{:},k}\bigotimes\boldsymbol{d}_{\boldsymbol{:},k}\end{pmatrix}.

The above definitions show the Khatri-Rao product is the column-wise Kronecker product. The Khatri-Rao product of matrices plays an important role in the technical definition of the proposed dimension-grouped MMM. The following Theorem 2 exploits the grouping structure in 𝐋\mathbf{L} to relax the identifiability conditions in the previous Theorem 1.

Theorem 2.

Denote by 𝒜g={j[p]:j,g=1}\mathcal{A}_{g}=\{j\in[p]:\,\ell_{j,g}=1\} the set of variables that belong to group gg. Suppose each 𝒜g\mathcal{A}_{g} can be partitioned into three sets 𝒜g=m=13𝒜g,m\mathcal{A}_{g}=\cup_{m=1}^{3}\mathcal{A}_{g,m}, and for each g[G]g\in[G] and m{1,2,3}m\in\{1,2,3\} the matrix 𝚲~g,m\widetilde{\boldsymbol{\Lambda}}_{g,m} defined below has full column rank KK.

𝚲~g,m:=j𝒜g,m𝚲j.\displaystyle\widetilde{\boldsymbol{\Lambda}}_{g,m}:=\bigodot_{j\in\mathcal{A}_{g,m}}\boldsymbol{\Lambda}_{j}. (10)

Also suppose for each j[p]j\in[p], not all the column vectors of 𝚲j\boldsymbol{\Lambda}_{j} are identical. Then the model parameters 𝐋\mathbf{L}, 𝚲\boldsymbol{\Lambda}, and 𝚽\boldsymbol{\Phi} are strictly identifiable.

Compared to Theorem 1, Theorem 2 relaxes the identifiability conditions by lifting the full-rank requirement on the individual matrices 𝚲j\boldsymbol{\Lambda}_{j}’s. Rather, as long as the Khatri-Rao product of several different 𝚲j\boldsymbol{\Lambda}_{j}’s have full column rank as specified in (10), identifiability can be guaranteed. Recall that the Khatri-Rao product of two matrices 𝚲j1\boldsymbol{\Lambda}_{j_{1}} of size dj1×Kd_{j_{1}}\times K and 𝚲j2\boldsymbol{\Lambda}_{j_{2}} of size dj2×Kd_{j_{2}}\times K has size (dj1dj2)×K(d_{j_{1}}d_{j_{2}})\times K. So intuitively, requiring 𝚲j1𝚲j2\boldsymbol{\Lambda}_{j_{1}}\bigodot\boldsymbol{\Lambda}_{j_{2}} to have full column rank KK is weaker than requiring each of 𝚲j1\boldsymbol{\Lambda}_{j_{1}} and 𝚲j2\boldsymbol{\Lambda}_{j_{2}} to have full column rank KK. The following Example 2 formalizes this intuition.

Example 2.

Consider d1=d2=2d_{1}=d_{2}=2, K=3K=3 with the following conditional probability tables

𝚲1=(a1a2a31a11a21a3);𝚲2=(b1b2b31b11b21b3).\displaystyle\boldsymbol{\Lambda}_{1}=\begin{pmatrix}a_{1}&a_{2}&a_{3}\\ 1-a_{1}&1-a_{2}&1-a_{3}\end{pmatrix};\quad\boldsymbol{\Lambda}_{2}=\begin{pmatrix}b_{1}&b_{2}&b_{3}\\ 1-b_{1}&1-b_{2}&1-b_{3}\end{pmatrix}.

Suppose variables j=1,2j=1,2 belong to the same group, e.g., 1,:=2,:\ell_{1,:}=\ell_{2,:}. Then since K>d1=d2K>d_{1}=d_{2}, both 𝚲1\boldsymbol{\Lambda}_{1} and 𝚲2\boldsymbol{\Lambda}_{2} can not have full column rank KK. However, if we consider their Khatri-Rao product, it has size 4×34\times 3 in the following form

𝚲1𝚲2=(a1b1a2b2a3b3a1(1b1)a2(1b2)a3(1b3)(1a1)b1(1a2)b2(1a3)b3(1a1)(1b1)(1a2)(1b2)(1a3)(1b3)).\displaystyle\boldsymbol{\Lambda}_{1}\bigodot\boldsymbol{\Lambda}_{2}=\begin{pmatrix}a_{1}b_{1}&a_{2}b_{2}&a_{3}b_{3}\\ a_{1}(1-b_{1})&a_{2}(1-b_{2})&a_{3}(1-b_{3})\\ (1-a_{1})b_{1}&(1-a_{2})b_{2}&(1-a_{3})b_{3}\\ (1-a_{1})(1-b_{1})&(1-a_{2})(1-b_{2})&(1-a_{3})(1-b_{3})\\ \end{pmatrix}.

Indeed, 𝚲1𝚲2\boldsymbol{\Lambda}_{1}\bigodot\boldsymbol{\Lambda}_{2} has full column rank for “generic” parameters 𝛉:=(a1,a2,a3,b1,b2,b3)\boldsymbol{\theta}:=(a_{1},a_{2},a_{3},b_{1},b_{2},b_{3}); precisely speaking, for 𝛉\boldsymbol{\theta} varying almost everywhere in the parameter space [0,1]6[0,1]^{6} (the 6-dimensional hypercube), the subset of 𝛉\boldsymbol{\theta} that renders 𝚲1𝚲2\boldsymbol{\Lambda}_{1}\bigodot\boldsymbol{\Lambda}_{2} rank-deficient has Lebesgue measure zero in 6\mathbb{R}^{6}. To see this, let 𝐱=(x1,x2,x3)3\boldsymbol{x}=(x_{1},x_{2},x_{3})^{\top}\in\mathbb{R}^{3} such that (𝚲1𝚲2)𝐱=0(\boldsymbol{\Lambda}_{1}\bigodot\boldsymbol{\Lambda}_{2})\boldsymbol{x}=0, then

{a1b1x1+a2b2x2+a3b3x3=0;a1(1b1)x1+a2(1b2)x2+a3(1b3)x3=0;(1a1)b1x1+(1a2)b2x2+(1a3)b3x3=0;(1a1)(1b1)x1+(1a2)(1b2)x2+(1a3)(1b3)x3=0;\displaystyle\begin{cases}a_{1}b_{1}x_{1}+a_{2}b_{2}x_{2}+a_{3}b_{3}x_{3}=0;\\ a_{1}(1-b_{1})x_{1}+a_{2}(1-b_{2})x_{2}+a_{3}(1-b_{3})x_{3}=0;\\ (1-a_{1})b_{1}x_{1}+(1-a_{2})b_{2}x_{2}+(1-a_{3})b_{3}x_{3}=0;\\ (1-a_{1})(1-b_{1})x_{1}+(1-a_{2})(1-b_{2})x_{2}+(1-a_{3})(1-b_{3})x_{3}=0;\\ \end{cases}
invertible transform\displaystyle\stackrel{{\scriptstyle\text{{invertible transform}}}}{{\iff}}\quad {a1b1x1+a2b2x2+a3b3x3=0;a1x1+a2x2+a3x3=0;b1x1+b2x2+b3x3=0;x1+x2+x3=0.\displaystyle\begin{cases}a_{1}b_{1}x_{1}+a_{2}b_{2}x_{2}+a_{3}b_{3}x_{3}=0;\\ a_{1}x_{1}+a_{2}x_{2}+a_{3}x_{3}=0;\\ b_{1}x_{1}+b_{2}x_{2}+b_{3}x_{3}=0;\\ x_{1}+x_{2}+x_{3}=0.\\ \end{cases}

Based on the last four equations above, one can use basic algebra to obtain the following set of equations about (x1,x2,x3)(x_{1},x_{2},x_{3}),

(b1b3b3b2a1a3a3a2)(x1x2)=(b2b1b1b3a2a1a1a3)(x2x3)=(00).\displaystyle\begin{pmatrix}b_{1}-b_{3}&b_{3}-b_{2}\\ a_{1}-a_{3}&a_{3}-a_{2}\\ \end{pmatrix}\begin{pmatrix}x_{1}\\ x_{2}\end{pmatrix}=\begin{pmatrix}b_{2}-b_{1}&b_{1}-b_{3}\\ a_{2}-a_{1}&a_{1}-a_{3}\\ \end{pmatrix}\begin{pmatrix}x_{2}\\ x_{3}\end{pmatrix}=\begin{pmatrix}0\\ 0\end{pmatrix}.

This implies as long as the following inequalities hold, there must be x1=x2=x3=0x_{1}=x_{2}=x_{3}=0,

{(b1b3)(a3a2)(a1a3)(b3b2)0;(b2b1)(a1a3)(a2a1)(b1b3)0\displaystyle\begin{cases}(b_{1}-b_{3})(a_{3}-a_{2})-(a_{1}-a_{3})(b_{3}-b_{2})\neq 0;\\ (b_{2}-b_{1})(a_{1}-a_{3})-(a_{2}-a_{1})(b_{1}-b_{3})\neq 0\end{cases} (11)

Now note that the subset of the parameter space {(a1,a2,a3,b1,b2,b3)[0,1]6:Eq. (11) holds}\{(a_{1},a_{2},a_{3},b_{1},b_{2},b_{3})\in[0,1]^{6}:\,\text{Eq.~{}}\eqref{eq-ab0}\text{ holds}\} is a Lebesgue measure zero subset of [0,1]6[0,1]^{6}. This means for such “generic” parameters varying almost everywhere in the parameter space [0,1]6[0,1]^{6}, the (𝚲1𝚲2)𝐱=𝟎(\boldsymbol{\Lambda}_{1}\bigodot\boldsymbol{\Lambda}_{2})\boldsymbol{x}=\mathbf{0} implies 𝐱=𝟎\boldsymbol{x}=\mathbf{0} which means 𝚲1𝚲2\boldsymbol{\Lambda}_{1}\bigodot\boldsymbol{\Lambda}_{2} has full column rank K=3K=3.

Example 2 shows that the Khatri-Rao product of two matrices seems to have full rank under fairly mild conditions. This indicates that the conditions in Theorem 2 are much weaker than those in Theorem 1 by imposing the full-rankness requirement only on a certain Khatri-Rao product of the 𝚲j\boldsymbol{\Lambda}_{j}-matrices, instead of on individual 𝚲j\boldsymbol{\Lambda}_{j}s. To be more concrete, the next Example 3 illustrates Theorem 2, as a counterpart of Example 1.

Example 3.

Consider the following grouping matrix 𝐋\mathbf{L} with G=3G=3 and p=6G=18p=6G=18,

\singlespacing
𝐋=(𝐋1𝐋1𝐋1),where𝐋1=(100100010010001001).\mathbf{L}=\begin{pmatrix}\mathbf{L}_{1}\\ \mathbf{L}_{1}\\ \mathbf{L}_{1}\end{pmatrix},\quad\text{where}~{}~{}\mathbf{L}_{1}=\begin{pmatrix}1&0&0\\ 1&0&0\\ 0&1&0\\ 0&1&0\\ 0&0&1\\ 0&0&1\\ \end{pmatrix}. (12)

Then 𝐋\mathbf{L} contains six copies of the identity matrix 𝐈G\mathbf{I}_{G} after a row permutation. Thanks to greater variable grouping compared to the previous Example 1, we can use Theorem 2 (instead of Theorem 1) to establish identifiability. Specifically, consider binary responses with d1==d18=:d=2d_{1}=\cdots=d_{18}=:d=2 and K=3K=3 extreme latent profiles. For g=1g=1, define sets 𝒜g,1={1,2}\mathcal{A}_{g,1}=\{1,2\}, 𝒜g,2={7,8}\mathcal{A}_{g,2}=\{7,8\}, 𝒜g,3={13,14}\mathcal{A}_{g,3}=\{13,14\}; for g=2g=2, define sets 𝒜g,1={3,4}\mathcal{A}_{g,1}=\{3,4\}, 𝒜g,2={5,6}\mathcal{A}_{g,2}=\{5,6\}, 𝒜g,3={7,8}\mathcal{A}_{g,3}=\{7,8\}; and for g=3g=3, define sets 𝒜g,1={5,6}\mathcal{A}_{g,1}=\{5,6\}, 𝒜g,2={11,12}\mathcal{A}_{g,2}=\{11,12\}, 𝒜g,3={17,18}\mathcal{A}_{g,3}=\{17,18\}. Then for each (g,m){1,,G}×{1,2,3}(g,m)\in\{1,\ldots,G\}\times\{1,2,3\}, the 𝚲~g,m=j𝒜g,m𝚲j\widetilde{\boldsymbol{\Lambda}}_{g,m}=\bigodot_{j\in\mathcal{A}_{g,m}}\boldsymbol{\Lambda}_{j} defined in Theorem 2 has size d2×Kd^{2}\times K which is 4×34\times 3, similar to the structure in Example 2. Now from the derivation and discussion in Example 2, we know such a 𝚲~g,m\widetilde{\boldsymbol{\Lambda}}_{g,m} has full rank for almost all the valid parameters in the parameter space. So the conditions in Theorem 2 are easily satisfied, and for almost all the valid parameters of such a Gro-M3, the identifiability conclusion follows.

3.2 Generic Identifiability Conditions

Example 2 shows that the Khatri-Rao product of conditional probability tables easily has full column rank in a toy case, and Example 3 leverages this observation to establish identifiability for almost all parameters in the parameter space using Theorem 2. We next generalize this observation to derive more practical identifiability conditions, under the generic identifiability notion introduced by Allman et al., (2009). Generic identifiability generally means that the unidentifiable parameters belong to a set of Lebesgue measure zero with respect to the parameter space. Its definition adapted to the current Gro-M3s is given as follows.

Definition 2 (Generic Identifiability of Gro-M3s).

Under a Gro-M3\,{}^{3}, a parameter space 𝒯\mathcal{T} for (𝚲,𝚽)(\boldsymbol{\Lambda},\boldsymbol{\Phi}) is said to be generically identifiable, if there exists a subset 𝒩𝒯\mathcal{N}\subseteq\mathcal{T} that has Lebesgue measure zero with respect to 𝒯\mathcal{T} such that for any (𝚲,𝚽)𝒯𝒩(\boldsymbol{\Lambda},\boldsymbol{\Phi})\in\mathcal{T}\setminus\mathcal{N} and an associated 𝐋\mathbf{L} matrix, the following holds if and only if (𝐋,𝚲,𝚽)(\mathbf{L},\boldsymbol{\Lambda},\boldsymbol{\Phi}) and the alternative (𝐋¯,𝚲¯,𝚽¯)(\overline{\mathbf{L}},\overline{\boldsymbol{\Lambda}},\overline{\boldsymbol{\Phi}}) are identical up to permutations of the KK extreme latent profiles and that of the GG variable groups,

(𝒚=𝒄𝐋,𝚲,𝚽)=(𝒚=𝒄𝐋¯,𝚲¯,𝚽¯),𝒄×j=1p[dj].\displaystyle\mathbb{P}(\boldsymbol{y}=\boldsymbol{c}\mid\mathbf{L},\boldsymbol{\Lambda},\boldsymbol{\Phi})=\mathbb{P}(\boldsymbol{y}=\boldsymbol{c}\mid\overline{\mathbf{L}},\overline{\boldsymbol{\Lambda}},\overline{\boldsymbol{\Phi}}),\quad\forall\boldsymbol{c}\in\times_{j=1}^{p}[d_{j}].

Compared to the strict identifiability notion in Definition 1, the generic identifiability notion in Definition 2 is less stringent in allowing the existence of a measure zero set of parameters where identifiability does not hold; see the previous Example 2 for an instance of a measure-zero set. Such an identifiability notion usually suffices for real data analyses (Allman et al., , 2009). In the following Theorem 3, we propose simple conditions to ensure generic identifiability of Gro-M3s.

Theorem 3.

For the notation 𝒜g={j[p]:j,g=1}\mathcal{A}_{g}=\{j\in[p]:\ell_{j,g}=1\} defined in Theorem 2, suppose each 𝒜g\mathcal{A}_{g} can be partitioned into three non-overlapping sets 𝒜g=m=13𝒜g,m\mathcal{A}_{g}=\cup_{m=1}^{3}\mathcal{A}_{g,m}, such that for each gg and mm the following holds,

j𝒜g,mdjK.\displaystyle\prod_{j\in\mathcal{A}_{g,m}}d_{j}\geq K. (13)

Then the matrix j𝒜g,m𝚲j\bigodot_{j\in\mathcal{A}_{g,m}}\boldsymbol{\Lambda}_{j} has full column rank KK for generic parameters. Further, the 𝚲\boldsymbol{\Lambda}, 𝐋\mathbf{L}, and 𝚽\boldsymbol{\Phi} are generically identifiable.

Compared to Theorem 2, Theorem 3 lifts the explicit full-rank requirements on any matrix. Rather, Theorem 3 only requires that certain products of djd_{j}’s should not be smaller than the number of extreme latent profiles, which in turn guarantees that the Khatri-Rao products of matrices have full column rank for generic parameters. Intuitively, the more variables belonging to a group and the more categories each variable has, the easier the identifiability conditions are to satisfy. This illustrates the benefit of dimension-grouping to model identifiability.

4 Dirichlet Gro-M3 and Bayesian Inference

4.1 Dirichlet model and identifiability

The previous section studies identifiability of general Gro-M3s, not restricting the distribution D𝜶()D_{\boldsymbol{\alpha}}(\cdot) of the mixed membership scores to be a specific form. Next we focus on an interesting special case where D𝜶()D_{\boldsymbol{\alpha}}(\cdot) is a Dirichlet distribution with unknown parameters 𝜶\boldsymbol{\alpha}. Among all the possible distributions for the individual mixed-membership proportions, the Dirichlet distribution is the most popular. It is widely used in applications including social science survey data (Erosheva et al., , 2007; Wang et al., , 2015), topic modeling (Blei et al., , 2003; Griffiths and Steyvers, , 2004), and data privacy (Manrique-Vallier and Reiter, , 2012). We term the Gro-M3 with 𝝅i\boldsymbol{\pi}_{i} following a Dirichlet distribution the Dirichlet Gro-M3, and propose a Bayesian inference procedure to estimate both the discrete variable groupings and the continuous parameters. Such a Dirichlet Gro-M3 has the graphical model representation in Figure 1(d).

For an unknown vector 𝜶=(α1,,αK)\boldsymbol{\alpha}=(\alpha_{1},\ldots,\alpha_{K}) with αk>0\alpha_{k}>0 for all k[K]k\in[K], suppose

Dirichlet Gro-M3: 𝝅i=(πi,1,,πi,K)i.i.d.Dirichlet(α1,,αK).\displaystyle\boldsymbol{\pi}_{i}=(\pi_{i,1},\ldots,\pi_{i,K})\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}\text{Dirichlet}(\alpha_{1},\ldots,\alpha_{K}). (14)

The vector 𝜶\boldsymbol{\alpha} characterizes the distribution of membership scores. As αk0\alpha_{k}\to 0, the model simplifies to a latent class model in which each individual belongs to a single latent class. For larger αk\alpha_{k}’s, there will tend to be multiple elements of 𝝅i\boldsymbol{\pi}_{i} that are not close to 0 or 11.

Recall that the previous identifiability conclusions in Theorems 13 generally apply to 𝐋\mathbf{L}, 𝚲\boldsymbol{\Lambda}, and 𝚽\boldsymbol{\Phi}, where 𝚽\boldsymbol{\Phi} is the core tensor with KGK^{G} entries in our hybrid tensor decomposition. Now with the core tensor 𝚽\boldsymbol{\Phi} parameterized by the Dirichlet distribution in particular, we can further investigate the identifiability of the Dirichlet parameters 𝜶\boldsymbol{\alpha}. The following proposition establishes the identifiability of 𝜶\boldsymbol{\alpha} for Dirichlet Gro-M3s.

Proposition 1.

Consider a Dirichlet Gro-M3\,{}^{3}. If G2G\geq 2, then following conclusions hold.

  • (a)

    If the conditions in Theorem 1 or Theorem 2 are satisfied, then the Dirichlet parameters 𝜶=(α1,,αK)\boldsymbol{\alpha}=(\alpha_{1},\ldots,\alpha_{K}) are strictly identifiable.

  • (b)

    If the conditions in Theorem 3 are satisfied, then the Dirichlet parameters 𝜶=(α1,,αK)\boldsymbol{\alpha}=(\alpha_{1},\ldots,\alpha_{K}) are generically identifiable.

Remark 1.

Our identifiability results have implications for the collapsed Tucker (c-Tucker) decomposition for multivariate categorical data (Johndrow et al., , 2017). Our assumption that the latent memberships underlying several variables are in one state is similar to that in c-Tucker. However, c-Tucker does not model mixed memberships, and the c-Tucker tensor core, 𝚽\boldsymbol{\Phi} in our notation, is assumed to arise from a CP decomposition (Goodman, , 1974) with ϕk1,,kG=v=1rwvg=1Gψg,kg,v\phi_{k_{1},\ldots,k_{G}}=\sum_{v=1}^{r}w_{v}\prod_{g=1}^{G}\psi_{g,k_{g},v}. We can invoke the uniqueness of the CP decomposition (e.g., Kruskal, , 1977; Allman et al., , 2009) to obtain identifiability of parameters 𝐰=(wv;v[r])\boldsymbol{w}=(w_{v};\,v\in[r]) and 𝛙=(ψg,k,v;g[G],k[K],v[r])\boldsymbol{\psi}=(\psi_{g,k,v};\,g\in[G],\,k\in[K],v\in[r]). Hence, under our assumptions on the variable grouping structure in Section 3, imposing existing mild conditions on 𝐰\boldsymbol{w} and 𝛙\boldsymbol{\psi} will yield identifiability of all the c-Tucker parameters.

4.2 Bayesian inference

Considering the complexity of our latent structure model, we adopt a Bayesian approach. We next describe the prior specification for 𝐋\mathbf{L}, 𝚲\boldsymbol{\Lambda}, and 𝜶\boldsymbol{\alpha} in Dirichlet Gro-M3s. The number of variable groups GG and number of extreme latent profiles KK are assumed known; we relax this assumption in Section 5. Recall the indicators s1,,sp[G]s_{1},\ldots,s_{p}\in[G] are defined as sj=gs_{j}=g if and only if j,g=1\ell_{j,g}=1, so there is a one-to-one correspondence between the matrix 𝐋\mathbf{L} and the vector 𝒔=(s1,,sp)\boldsymbol{s}=(s_{1},\ldots,s_{p}). We adopt the following prior for the sjs_{j}’s,

s1,,sp\displaystyle s_{1},\ldots,s_{p} i.i.d.Categorical([G],ξ1,,ξG),\displaystyle\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}\text{Categorical}([G],\;\xi_{1},\ldots,\xi_{G}),

where Categorical([G],ξ1,,ξG)\text{Categorical}([G],\;\xi_{1},\ldots,\xi_{G}) is a categorical distribution over GG categories with proportions ξg0\xi_{g}\geq 0 and g=1Gξg=1\sum_{g=1}^{G}\xi_{g}=1. We choose uniform priors over the probability simplex for (ξ1,,ξG)(\xi_{1},\ldots,\xi_{G}) and each column of 𝚲j\boldsymbol{\Lambda}_{j}. We remark that if certain prior knowledge about the variable groups is available for the data, then it is also possible to employ informative priors such as those in Paganin et al., (2021) for the sjs_{j}’s. For the Dirichlet parameters 𝜶\boldsymbol{\alpha}, defining α0=k=1Kαk\alpha_{0}=\sum_{k=1}^{K}\alpha_{k} and 𝜼=(α1/α0,,αK/α0)\boldsymbol{\eta}=(\alpha_{1}/\alpha_{0},\ldots,\alpha_{K}/\alpha_{0}), we choose the hyperpriors α0Gamma(aα,bα)\alpha_{0}\sim\text{Gamma}(a_{\alpha},b_{\alpha}) and 𝜼\boldsymbol{\eta} is uniform over the (K1)(K-1)-probability simplex.

Given a sample of size nn, denote the observed data by 𝐘={𝒚i;i=1,,n}\mathbf{Y}=\{\boldsymbol{y}_{i};\,i=1,\ldots,n\}. We propose a Metropolis-Hastings-within-Gibbs sampler and also a Gibbs sampler for posterior inference of 𝐋\mathbf{L}, 𝚲\boldsymbol{\Lambda}, and 𝜶\boldsymbol{\alpha} based on the data 𝐘\mathbf{Y}.

Metropolis-Hastings-within-Gibbs Sampler.

This sampler cycles through the following steps.

Step 1–3.

Sample each column of the conditional probability tables 𝚲j\boldsymbol{\Lambda}_{j}’s, the individual mixed-membership proportions 𝝅i\boldsymbol{\pi}_{i}’s, and the individual latent assignments zi,gz_{i,g}’s from their full conditional posterior distributions. Define indicator variables yi,j,c=𝕀(yi,j=c)y_{i,j,c}=\mathbb{I}(y_{i,j}=c) and zi,g,k=𝕀(zi,g=k)z_{i,g,k}=\mathbb{I}(z_{i,g}=k). These posteriors are

{𝝀j,:,k}sj=g\displaystyle\{\boldsymbol{\lambda}_{j,\boldsymbol{:},k}\mid-\}_{s_{j}=g} \displaystyle\sim Dirichlet(1+i=1nzi,g,kyi,j,1,,1+i=1nzi,g,kyi,j,dj);\displaystyle\text{Dirichlet}\left(1+\sum_{i=1}^{n}z_{i,g,k}y_{i,j,1},~{}\ldots,~{}1+\sum_{i=1}^{n}z_{i,g,k}y_{i,j,d_{j}}\right);
𝝅i\displaystyle\boldsymbol{\pi}_{i}\mid- \displaystyle\sim Dirichlet(α1+g=1Gzi,g,1,,αK+g=1Gzi,g,K);\displaystyle\text{Dirichlet}\left(\alpha_{1}+\sum_{g=1}^{G}z_{i,g,1},~{}\ldots,~{}\alpha_{K}+\sum_{g=1}^{G}z_{i,g,K}\right);
(zi,g=k)\displaystyle\mathbb{P}(z_{i,g}=k\mid-) =\displaystyle= πi,kj:sj=gc=1djλj,c,kyi,j,ck=1Kπi,kj:sj=gc=1djλj,c,kyi,j,c,k[K].\displaystyle\frac{\pi_{i,k}\prod_{j:\,s_{j}=g}\prod_{c=1}^{d_{j}}\lambda_{j,c,k}^{y_{i,j,c}}}{\sum_{k^{\prime}=1}^{K}\pi_{i,k^{\prime}}\prod_{j:\,s_{j}=g}\prod_{c=1}^{d_{j}}\lambda_{j,c,k^{\prime}}^{y_{i,j,c}}},\quad k\in[K].
Step 4.

Sample the variable grouping structure (s1,,sp)(s_{1},\ldots,s_{p}). The posterior of each sjs_{j} is

(sj=g)=ξgi=1nλj,yi,j,zi,gg=1Gξgi=1nλj,yi,j,zi,g,g[G].\displaystyle\mathbb{P}(s_{j}=g\mid-)=\frac{\xi_{g}\prod_{i=1}^{n}\lambda_{j,y_{i,j},z_{i,g}}}{\sum_{g^{\prime}=1}^{G}\xi_{g^{\prime}}\prod_{i=1}^{n}\lambda_{j,y_{i,j},z_{i,g^{\prime}}}},\quad g\in[G].

The posterior of (ξ1,,ξG)(\xi_{1},\ldots,\xi_{G}) is

(ξ1,,ξG)Dirichlet(1+j=1p𝕀(sj=1),,1+j=1p𝕀(sj=G)).\displaystyle(\xi_{1},\ldots,\xi_{G})\mid-\sim\text{Dirichlet}\left(1+\sum_{j=1}^{p}\mathbb{I}(s_{j}=1),\ldots,1+\sum_{j=1}^{p}\mathbb{I}(s_{j}=G)\right).
Step 5.

Sample the Dirichlet parameters 𝜶=(α1,,αK)\boldsymbol{\alpha}=(\alpha_{1},\ldots,\alpha_{K}) via Metropolis-Hastings sampling. The conditional posterior distribution of 𝜶\boldsymbol{\alpha} (or equivalently, α0\alpha_{0} and 𝜼\boldsymbol{\eta}) is

p(𝜶)\displaystyle p(\boldsymbol{\alpha}\mid-) Gamma(α0a,b)×Dirichlet(𝜼𝟏K)×i=1nDirichlet(𝝅i𝜶)\displaystyle\propto\text{Gamma}(\alpha_{0}\mid a,b)\times\text{Dirichlet}(\boldsymbol{\eta}\mid\mathbf{1}_{K})\times\prod_{i=1}^{n}\text{Dirichlet}(\boldsymbol{\pi}_{i}\mid\boldsymbol{\alpha})
α0aα1exp(bαα0)×[Γ(α0)k=1KΓ(αk)]n×k=1K[i=1nπi,k]αk,\displaystyle\propto\alpha_{0}^{a_{\alpha}-1}\exp(-b_{\alpha}\alpha_{0})\times\left[\frac{\Gamma(\alpha_{0})}{\prod_{k=1}^{K}\Gamma(\alpha_{k})}\right]^{n}\times\prod_{k=1}^{K}\left[\prod_{i=1}^{n}\pi_{i,k}\right]^{\alpha_{k}},

which is not an easy-to-sample-from distribution. We use a Metropolis-Hastings sampling strategy in Manrique-Vallier and Reiter, (2012). The steps are detailed as follows.

  • Sample each entry of 𝜶=(α1,,αK)\boldsymbol{\alpha}^{\star}=(\alpha_{1}^{\star},\ldots,\alpha_{K}^{\star}) from independent lognormal distributions (proposal distribution g(𝜶𝜶)g(\boldsymbol{\alpha}^{\star}\mid\boldsymbol{\alpha})) as

    αkind.lognormal(logαk,σα2),\displaystyle\alpha_{k}^{\star}\stackrel{{\scriptstyle\text{ind.}}}{{\sim}}\text{lognormal}(\log\alpha_{k},\sigma_{\alpha}^{2}), (15)

    where σα\sigma_{\alpha} is a tuning parameter that affects the acceptance ratio of the draw. Based on our preliminary simulations, σ\sigma should be relatively small to avoid the acceptance ratio to be always too close to zero.

  • Let α0=k=1Kαk\alpha_{0}^{\star}=\sum_{k=1}^{K}\alpha_{k}^{\star}. Define

    r=\displaystyle r^{\star}= p(𝜶)g(𝜶𝜶)p(𝜶)g(𝜶𝜶)\displaystyle~{}\frac{p(\boldsymbol{\alpha}^{\star}\mid-)g(\boldsymbol{\alpha}\mid\boldsymbol{\alpha}^{\star})}{p(\boldsymbol{\alpha}\mid-)g(\boldsymbol{\alpha}^{\star}\mid\boldsymbol{\alpha})}
    =\displaystyle= (α0α0)aα1exp(bα(α0α0))×(Γ(α0)Γ(α0)k=1KΓ(αk)k=1KΓ(αk))n\displaystyle~{}\left(\frac{\alpha_{0}^{\star}}{\alpha_{0}}\right)^{a_{\alpha}-1}\exp\left(-b_{\alpha}(\alpha_{0}^{\star}-\alpha_{0})\right)\times\left(\frac{\Gamma(\alpha_{0}^{*})}{\Gamma(\alpha_{0})}\cdot\frac{\prod_{k=1}^{K}\Gamma(\alpha_{k})}{\prod_{k=1}^{K}\Gamma(\alpha_{k}^{\star})}\right)^{n}
    ×k=1K(i=1nπi,k)αkαk×k=1Kαkαk\displaystyle~{}\times\prod_{k=1}^{K}\left(\prod_{i=1}^{n}\pi_{i,k}\right)^{\alpha_{k}^{\star}-\alpha_{k}}\times\prod_{k=1}^{K}\frac{\alpha_{k}^{\star}}{\alpha_{k}}

    The Metropolis-Hastings acceptance ratio of the proposed 𝜶\boldsymbol{\alpha}^{\star} is r=min{1,r}r=\min\left\{1,\;r^{\star}\right\}.

We track the acceptance ratio in the Metropolis-Hastings step along the MCMC iterations in a simulation study. Figure 3 shows the boxplots of the average acceptance ratios for various sample sizes in the same simulation as the later Table 3. This figure shows that the Metropolis-Hastings acceptance ratio is generally high and mostly exceeds 80%.

Refer to caption
Figure 3: Metropolis-Hastings average acceptance ratio in the simulation setting (p,G,K)=(30,6,4)(p,G,K)=(30,6,4), corresponding to the first setting in Table 3 in the manuscript.

Gibbs Sampler.

We also develop a fully Gibbs sampling algorithm for our Gro-M3, leveraging the auxiliary variable method in Zhou, (2018) to sample the Dirichlet parameters 𝜶\boldsymbol{\alpha}. Especially, since we have proved in Proposition 1 that the entire Dirichlet parameter vector 𝜶=(α1,,αK)\boldsymbol{\alpha}=(\alpha_{1},\ldots,\alpha_{K}) is identifiable from the observed data distribution, we choose to freely and separately sample all the entries α1,,αK\alpha_{1},\ldots,\alpha_{K} instead of constraining these KK entries to be equal as in Zhou, (2018). Recall that for each subject ii, zi,g[K]z_{i,g}\in[K] for g[G]g\in[G] denotes the latent profile realization for the ggth group of items. Introduce new notation Zikmult=g=1G𝟙(zi,g=k)Z^{\text{mult}}_{ik}=\sum_{g=1}^{G}\mathbbm{1}(z_{i,g}=k) for i[N]i\in[N] and k[K]k\in[K]. Then (Zi1mult,,Zi1mult)(Z^{\text{mult}}_{i1},\ldots,Z^{\text{mult}}_{i1}) follows the Dirichlet-Multinomial distribution with parameters GG and (α1,,αK)(\alpha_{1},\ldots,\alpha_{K}). We introduce auxiliary Beta variables qiq_{i} for i[N]i\in[N] and auxiliary Chinese Restaurant Table (CRT) variables tikt_{ik} for i[N]i\in[N] and k[K]k\in[K]. Endowing the Dirichlet parameter αk\alpha_{k} with the prior αkGamma(a0,b0)\alpha_{k}\sim\text{Gamma}(a_{0},~{}b_{0}), we have the following Gibbs updates for sampling αk\alpha_{k}.

Step 5

Sample the auxiliary variables qiq_{i}, tikt_{ik} and the Dirichlet parameters αk\alpha_{k} from the following full conditional posteriors:

qi\displaystyle q_{i} Beta(k=1KZikmult,k=1Kαk),i[n];\displaystyle\sim\text{Beta}\left(\sum_{k=1}^{K}Z^{\text{mult}}_{ik},~{}\sum_{k=1}^{K}\alpha_{k}\right),\qquad i\in[n];
tik\displaystyle t_{ik} CRT(Zikmult,αk),i[n],k[K];\displaystyle\sim\text{CRT}(Z^{\text{mult}}_{ik},~{}\alpha_{k}),\qquad i\in[n],~{}k\in[K];
αk\displaystyle\alpha_{k} Gamma(a0+i=1ntik,b0i=1nlog(1qi)),k[K].\displaystyle\sim\text{Gamma}\left(a_{0}+\sum_{i=1}^{n}t_{ik},~{}b_{0}-\sum_{i=1}^{n}\log(1-q_{i})\right),\qquad k\in[K].

Replacing the previous Step 5 in the Metropolis-within-Gibbs sampler with the above Step 5 gives a fully Gibbs sampling algorithm for Gro-M3.

Our simulations reveal the following empirical comparisons between the Gibbs sampler and the Metropolis-Hastings-within-Gibbs (MH-within-Gibbs) sampler. In terms of Markov chain mixing, the Gibbs sampler mixes faster than the MH-within-Gibbs sampler as expected, and requires fewer MCMC iterations to generate quality posterior samples if initialized well. However, in terms of estimation accuracy, we observe that the MH-within-Gibbs sampler tends to have better accuracy in estimating the identifiable model parameters. This is likely because that the MH-within-Gibbs sampler performs better on exploring the entire posterior space through the proposal distributions; whereas the Gibbs sampler tends to be more heavily influenced by the initial value of the parameters and can converge to suboptimal distributions if not initialized well. We next provide the experimental evidence behind the above observations.

Figure 4 provides typical traceplots for the MH-within-Gibbs sampler (left) and the Gibbs sampler (middle and right) in one simulation trial in the same setting as the later Table 3. The four horizontal lines in each panel denote the true parameter values 𝜶=(α1,α2,α3,α4)=(0.4,0.5,0.6,0.7)\boldsymbol{\alpha}=(\alpha_{1},\alpha_{2},\alpha_{3},\alpha_{4})=(0.4,0.5,0.6,0.7). The left and middle panels of Figure 4 are traceplots of αk\alpha_{k} in MCMC chains initialized randomly with the same initial value, whereas the right panel corresponds to a chain initialized with the true parameter value 𝜶\boldsymbol{\alpha}. Figure 4 shows that when initialized randomly with the same value, the MH-within-Gibbs chain converges to distributions much closer to the truth than the Gibbs sampler; in contrast, the Gibbs chain only manages to converge to the desirable posteriors when initialized with the true 𝜶\boldsymbol{\alpha}. Furthermore, Figure 5 plots the root mean squared error quantitles (25%, 50%, 75%) of 𝜶\boldsymbol{\alpha} estimated using the two samplers from the 50 simulation replicates in each setting. The parameter initialization in each replicate for the two samplers is random and identical. Figure 5 clearly shows that the MH-within-Gibbs sampler has lower estimation error for 𝜶\boldsymbol{\alpha}. In summary, when initialized randomly using the same mechanism, the MH-within-Gibbs sampler has higher parameter estimation accuracy despite that the Gibbs sampler mixes faster. Therefore, we choose to present the estimation results of the MH-within-Gibbs sampler in the later Section 5.

Refer to captionRefer to captionRefer to caption
Figure 4: Traceplots of the MH-within-Gibbs sampler (left) and the Gibbs sampler (middle and right) applied to one simulated dataset with (n,p,G,K)=(500,30,6,4)(n,p,G,K)=(500,30,6,4). The horizontal lines in each panel denote the true 𝜶=(α1,α2,α3,α4)=(0.4,0.5,0.6,0.7)\boldsymbol{\alpha}=(\alpha_{1},\alpha_{2},\alpha_{3},\alpha_{4})=(0.4,0.5,0.6,0.7). The left and middle panels correspond to chains initialized randomly with the same initial value, whereas the right panel corresponds to a chain initialized with the true parameter value 𝜶\boldsymbol{\alpha}.
Refer to caption
Figure 5: Root mean squared errors (RMSE) quantiles (25%, 50%, 75%) for the MH-within-Gibbs sampler and the Gibbs sampler obtained from 50 simulation replicates for each sample size. In each simulation replicate, the initializations of the Gibbs chain and the MH-within-Gibbs chain are identical.

After collecting posterior samples from the output of the MCMC algorithm, for those continuous parameters in the model we can calculate their posterior means as point estimates. As for the discrete variable grouping structure, we can obtain the posterior modes of each sjs_{j}. That is, given the TT posterior samples of 𝒔(t)=(s1(t),,sp(t))\boldsymbol{s}^{(t)}=(s^{(t)}_{1},\ldots,s^{(t)}_{p}) for t=1,,Tt=1,\ldots,T, we define point estimates 𝒔¯\overline{\boldsymbol{s}} and 𝐋¯\overline{\mathbf{L}} with entries

s¯j\displaystyle\overline{s}_{j} =argmaxg[G]t=1T𝕀(sj(t)=g);¯j,g={1,if s¯j=g;0,otherwise.\displaystyle=\operatorname*{arg\,max}_{g\in[G]}\sum_{t=1}^{T}\mathbb{I}(s^{(t)}_{j}=g);\qquad\overline{\ell}_{j,g}=\begin{cases}1,&\text{if }\overline{s}_{j}=g;\\ 0,&\text{otherwise}.\end{cases} (16)

5 Simulation Studies

In this section, we carry out simulation studies to assess the performance of the proposed Bayesian estimation approach, while verifying that identifiable parameters are indeed estimated more accurately as sample size grows. In Section 5.1, we perform a simulation study to assess the estimation accuracy of the model parameters, assuming the number of extreme latent profiles KK and the number of variable groups GG are known. This is the same assumption as in many existing estimation methods of traditional MMMs (e.g., Manrique-Vallier and Reiter, , 2012). In Section 5.2, to facilitate the use of our estimation method in applications, we propose data-driven criteria to select KK and GG and perform a corresponding simulation study.

5.1 Estimation of Grouping Structure and Model Parameters

In this simulation study, we assess the proposed algorithm’s performance in estimating the (𝐋,𝚲,𝜶)(\mathbf{L},\boldsymbol{\Lambda},\boldsymbol{\alpha}) in Dirichlet Gro-M3s. We consider various simulation settings, with K=2,3K=2,3, or 44, and (p,G)=(30,6)(p,G)=(30,6), (60,12)(60,12), or (90,15)(90,15). The number of categories of each yjy_{j} is specified to be three, i.e., d1==dp=3d_{1}=\cdots=d_{p}=3. The true 𝚲\boldsymbol{\Lambda}-parameters are specified as follows: in the most challenging case with K=4K=4 and (p,G)=(90,15)(p,G)=(90,15), for u=0,1,,p/61u=0,1,\ldots,p/6-1 we specify

\singlespacing
𝚲6u+1=(0.10.70.30.10.80.20.40.10.10.10.30.8);𝚲6u+2=(0.10.80.10.20.20.10.60.50.70.10.30.3);𝚲6u+3=(0.10.80.20.90.20.10.50.050.70.10.30.05);{\small\boldsymbol{\Lambda}_{6u+1}=\begin{pmatrix}0.1&0.7&0.3&0.1\\ 0.8&0.2&0.4&0.1\\ 0.1&0.1&0.3&0.8\end{pmatrix};~{}\boldsymbol{\Lambda}_{6u+2}=\begin{pmatrix}0.1&0.8&0.1&0.2\\ 0.2&0.1&0.6&0.5\\ 0.7&0.1&0.3&0.3\end{pmatrix};~{}\boldsymbol{\Lambda}_{6u+3}=\begin{pmatrix}0.1&0.8&0.2&0.9\\ 0.2&0.1&0.5&0.05\\ 0.7&0.1&0.3&0.05\end{pmatrix};}
𝚲6u+4=(0.10.10.80.30.80.20.10.60.10.70.10.1);𝚲6u+5=(0.20.70.30.10.60.20.40.10.20.10.30.8);𝚲6u+6=(0.10.80.10.20.20.10.10.60.70.10.80.2).{\small\boldsymbol{\Lambda}_{6u+4}=\begin{pmatrix}0.1&0.1&0.8&0.3\\ 0.8&0.2&0.1&0.6\\ 0.1&0.7&0.1&0.1\end{pmatrix};~{}\boldsymbol{\Lambda}_{6u+5}=\begin{pmatrix}0.2&0.7&0.3&0.1\\ 0.6&0.2&0.4&0.1\\ 0.2&0.1&0.3&0.8\end{pmatrix};~{}\boldsymbol{\Lambda}_{6u+6}=\begin{pmatrix}0.1&0.8&0.1&0.2\\ 0.2&0.1&0.1&0.6\\ 0.7&0.1&0.8&0.2\end{pmatrix}.}

As for other simulation settings with smaller KK and (p,G)(p,G), we specify the 𝚲j\boldsymbol{\Lambda}_{j}’s by taking a subset of the above matrices and retaining a subset of columns from each of these matrices. The true Dirichlet parameters 𝜶\boldsymbol{\alpha} are set to (0.4,0.5)(0.4,0.5) for K=2K=2, (0.4,0.5,0.6)(0.4,0.5,0.6) for K=3K=3, and (0.4,0.5,0.6,0.7)(0.4,0.5,0.6,0.7) for K=4K=4. The true grouping matrix 𝐋\mathbf{L} of size p×Gp\times G is specified to containing p/Gp/G copies of identity submatrices 𝐈G\mathbf{I}_{G} up to a row permutation. Under these specifications, our identifiability conditions in Theorem 2 are satisfied. We consider sample sizes n=250,500,1000,1500n=250,500,1000,1500. In each scenario, 50 independent datasets are generated and fitted with the proposed MCMC algorithm described in Section 4. In our MCMC algorithm under all simulation settings, we take hyperparameters to be (aα,bα)=(2,1)(a_{\alpha},b_{\alpha})=(2,1) and σα=0.02\sigma_{\alpha}=0.02. The MCMC sampler is run for 15000 iterations, with the first 10000 iterations as burn-in and every fifth sample is collected after burn-in to thin the chain.

We observed good mixing and convergence behaviors of the model parameters from examining the trace plots. In particular, simulations show that the estimation of the discrete variable grouping structure in matrix 𝐋\mathbf{L} (equivalently, vector 𝒔\boldsymbol{s}) is quite accurate in general, and the posterior means of the continuous 𝚲\boldsymbol{\Lambda} and 𝜶\boldsymbol{\alpha} are also close to their truth. Next we first present details of two typical simulation trials as an illustration, before presenting summaries across the independent simulation replicates.

Two random simulation trials were taken from the settings (n,p,G,K)=(500,30,6,2)(n,p,G,K)=(500,30,6,2) and (n,p,G,K)=(500,90,15,2)(n,p,G,K)=(500,90,15,2). All the parameters were randomly initialized from their prior distributions. In Figure 6, the left three plots in each of the first two rows show the sampled 𝐋iter.\mathbf{L}_{\text{iter.}} in the MCMC algorithm, after the 1st, 201st, and 401st iterations, respectively; the fourth plot show the posterior mode 𝐋¯\overline{\mathbf{L}} defined in (16), and the last plot shows the simulation truth 𝐋\mathbf{L}. If an 𝐋~\widetilde{\mathbf{L}} equals the true 𝐋\mathbf{L} after a column permutation then it indicates 𝐋~\widetilde{\mathbf{L}} and 𝐋\mathbf{L} induce identical variable groupings. The bottom two plots in Figure 6 show the Adjusted Rand Index (ARI, Rand, , 1971) of the variable groupings of 𝐋iter.\mathbf{L}_{\text{iter.}} (𝒔iter.\boldsymbol{s}_{\text{iter.}}) with respect to the true 𝐋\mathbf{L} (true 𝒔\boldsymbol{s}) along the first 1000 MCMC iterations. The ARI measures the similarity between two clusterings, and it is appropriate to compare a true 𝒔\boldsymbol{s} and an estimated 𝒔¯\overline{\boldsymbol{s}} because they each summarizes a clustering of the pp variables into GG groups. The ARI is at most 11, with ARI=1\text{ARI}=1 indicating perfect agreement between two clusterings. The bottom row of Figure 6 shows that in each simulation trial, the ARI measure starts with values around 0 due to the random MCMC initialization, and within a few hundred iterations the ARI increases to a distribution over much larger values. For the simulation with (n,p,G,K)=(500,90,15,2)(n,p,G,K)=(500,90,15,2), the posterior mode of 𝐋\mathbf{L} exactly equals the truth, and the corresponding plot on the bottom right of Figure 6 shows the ARI is distributed very close to 1 after just about 500 MCMC iterations. In general, our MCMC algorithm has excellent performance in inferring the 𝐋\mathbf{L} from randomly initialized simulations; also see the later Tables 13 for more details.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6: Estimation of 𝐋\mathbf{L} (from 𝒔\boldsymbol{s}) in two random simulation trials, one under (n,p,G,K)=(500,30,6,2)(n,p,G,K)=(500,30,6,2) and the other under (n,p,G,K)=(500,90,15,2)(n,p,G,K)=(500,90,15,2). In each of the first two rows, the left three plots record the sampled 𝐋iter.\mathbf{L}_{\text{iter.}} after the 1st, 201st, and 401st MCMC iteration, respectively. The fourth plot shows the posterior mode 𝐋¯\overline{\mathbf{L}} and the last shows the true 𝐋\mathbf{L}. The two plots in the bottom row record the ARI of the clustering of pp variables given by 𝐋iter.\mathbf{L}_{\text{iter.}} along the first 1000 MCMC iterations, for each of the two simulation scenarios.

We next present estimation accuracy results of both 𝐋\mathbf{L} and (𝚲,𝜶)(\boldsymbol{\Lambda},\boldsymbol{\alpha}) summarized across 50 simulation replicates in each setting. For continuous parameters (𝚲,𝜶)(\boldsymbol{\Lambda},\boldsymbol{\alpha}), we calculate their Root Mean Squared Errors (RMSEs) to evaluate the estimation accuracy. To obtain the estimation error of (𝚲,𝜶)(\boldsymbol{\Lambda},\boldsymbol{\alpha}) after collecting posterior samples, we need to find an appropriate permutation of the KK extreme latent profiles in order to compare the (𝚲¯,𝜶¯)(\overline{\boldsymbol{\Lambda}},\overline{\boldsymbol{\alpha}}) and the true (𝚲,𝜶)(\boldsymbol{\Lambda},\boldsymbol{\alpha}). To this end, we first reshape each of 𝚲¯\overline{\boldsymbol{\Lambda}} and 𝚲\boldsymbol{\Lambda} to a (j=1pdj)×K(\sum_{j=1}^{p}d_{j})\times K matrix 𝚲¯mat\overline{\boldsymbol{\Lambda}}_{\text{mat}} and 𝚲mat\boldsymbol{\Lambda}_{\text{mat}}, calculate the inner product matrix (𝚲mat)𝚲¯mat(\boldsymbol{\Lambda}_{\text{mat}})^{\top}\overline{\boldsymbol{\Lambda}}_{\text{mat}}, and then find the index iki_{k} of the largest entry in each kkth row of the inner product matrix. Such a vector of indices (i1,,iK)(i_{1},\ldots,i_{K}) gives a permutation of the KK profiles, and we will compare 𝚲¯j,:,(i1,,iK)\overline{\boldsymbol{\Lambda}}_{j,:,(i_{1},\ldots,i_{K})} to 𝚲j\boldsymbol{\Lambda}_{j} and compare 𝜶¯(i1,,iK)\overline{\boldsymbol{\alpha}}_{(i_{1},\ldots,i_{K})} to 𝜶\boldsymbol{\alpha}. In Tables 13, we present the RMSEs of (𝚲,𝜶)(\boldsymbol{\Lambda},\boldsymbol{\alpha}) and the ARIs of 𝐋\mathbf{L} under the aforementioned 36 different simulation settings. The median and interquartile range of the ARIs or RMSEs across the simulation replicates are shown in these tables.

Tables 13 show that under each setting of true parameters with a fixed (p,G,K)(p,G,K), the ARIs of the variable grouping 𝐋\mathbf{L} generally increase as sample size nn increases, and the RMSEs of 𝚲\boldsymbol{\Lambda} and 𝜶\boldsymbol{\alpha} decreases as nn increases. This shows the increased estimation accuracy with an increased sample size. In particular, the estimation accuracy of the variable grouping structure is quite high across the considered settings. The estimation errors are slightly larger for larger values of KK in Table 3 compared to smaller values of KK in Tables 1 and 2. Overall, the simulation results empirically confirm the identifiability and estimability of the model parameters in our Dirichlet Gro-M3.

{p,G}\{p,\,G\} nn ARI of 𝐋\mathbf{L} RMSE of 𝚲{\boldsymbol{\Lambda}} RMSE of 𝜶\boldsymbol{\alpha}
Median (IQR) Median (IQR) Median (IQR)
(30, 6)(30,\,6) 250 0.74 (0.18) 0.042 (0.005) 0.064 (0.056)
500 0.88 (0.17) 0.030 (0.004) 0.031 (0.043)
1000 0.91 (0.29) 0.023 (0.014) 0.027 (0.028)
1500 0.91 (0.31) 0.018 (0.022) 0.026 (0.045)
K=2K=2 (60, 12)(60,\,12) 250 0.73 (0.13) 0.042 (0.004) 0.039 (0.041)
500 0.79 (0.14) 0.032 (0.003) 0.031 (0.021)
1000 0.85 (0.20) 0.027 (0.010) 0.018 (0.029)
1500 0.81 (0.21) 0.028 (0.016) 0.024 (0.025)
(90, 15)(90,\,15) 250 0.95 (0.05) 0.042 (0.003) 0.045 (0.045)
500 1.00 (0.00) 0.026 (0.002) 0.032 (0.023)
1000 1.00 (0.00) 0.018 (0.001) 0.019 (0.021)
1500 1.00 (0.08) 0.015 (0.010) 0.017 (0.017)
Table 1: Simulation results of the Dirichlet Gro-M3 for K=2K=2. “ARI” of 𝐋\mathbf{L} is the Adjusted Rand Index of the estimated variable groupings with respect to the truth. “RMSE” of 𝚲{\boldsymbol{\Lambda}} and 𝜶{\boldsymbol{\alpha}} are Root Mean Squared Errors. “Median” and “IQR” are based on 50 replicates in each simulation setting.
(p,G)(p,\,G) nn ARI of 𝐋\mathbf{L} RMSE of 𝚲{\boldsymbol{\Lambda}} RMSE of 𝜶\boldsymbol{\alpha}
Median (IQR) Median (IQR) Median (IQR)
(30, 6)(30,\,6) 250 1.00 (0.00) 0.045 (0.004) 0.046 (0.048)
500 1.00 (0.00) 0.033 (0.003) 0.046 (0.059)
1000 1.00 (0.00) 0.023 (0.022) 0.039 (0.037)
1500 1.00 (0.00) 0.019 (0.023) 0.029 (0.032)
K=3K=3 (60, 12)(60,\,12) 250 1.00 (0.00) 0.045 (0.004) 0.044 (0.030)
500 1.00 (0.00) 0.032 (0.002) 0.030 (0.018)
1000 1.00 (0.00) 0.023 (0.002) 0.021 (0.017)
1500 1.00 (0.00) 0.018 (0.002) 0.020 (0.017)
(90, 15)(90,\,15) 250 1.00 (0.00) 0.045 (0.002) 0.047 (0.036)
500 1.00 (0.00) 0.031 (0.002) 0.026 (0.022)
1000 1.00 (0.00) 0.022 (0.001) 0.021 (0.013)
1500 1.00 (0.21) 0.019 (0.024) 0.024 (0.023)
Table 2: Simulation results of the Dirichlet Gro-M3 for K=3K=3. See the caption of Table 1 for the meanings of columns.
(p,G)(p,\,G) nn ARI of 𝐋\mathbf{L} RMSE of 𝚲{\boldsymbol{\Lambda}} RMSE of 𝜶\boldsymbol{\alpha}
Median (IQR) Median (IQR) Median (IQR)
(30, 6)(30,\,6) 250 1.00 (0.00) 0.064 (0.007) 0.078 (0.056)
500 1.00 (0.00) 0.046 (0.006) 0.062 (0.072)
1000 1.00 (0.00) 0.032 (0.004) 0.043 (0.046)
1500 1.00 (0.00) 0.026 (0.004) 0.032 (0.036)
K=4K=4 (60, 12)(60,\,12) 250 1.00 (0.00) 0.064 (0.005) 0.060 (0.031)
500 1.00 (0.00) 0.043 (0.003) 0.047 (0.027)
1000 1.00 (0.00) 0.031 (0.002) 0.032 (0.014)
1500 1.00 (0.00) 0.025 (0.001) 0.023 (0.017)
(90, 15)(90,\,15) 250 1.00 (0.00) 0.046 (0.004) 0.053 (0.036)
500 1.00 (0.00) 0.041 (0.003) 0.037 (0.022)
1000 1.00 (0.00) 0.029 (0.001) 0.026 (0.027)
1500 1.00 (0.00) 0.024 (0.001) 0.026 (0.020)
Table 3: Simulation results of the Dirichlet Gro-M3 for K=4K=4. See the caption of Table 1 for the meanings of columns.

Our MCMC algorithm can be viewed as a novel algorithm for Bayesian factorization of probability tensors. To see this, note that the observed response vector ranges in the pp-way contingency table 𝒚i[d1]×[d2]×[dp]\boldsymbol{y}_{i}\in[d_{1}]\times[d_{2}]\cdots\times[d_{p}], and the marginal probabilities of a random vector 𝒚i\boldsymbol{y}_{i} falling each of the j=1pdj\prod_{j=1}^{p}d_{j} cells therefore form a probability tensor with pp modes. Our Gro-M3 model provides a general and interpretable hybrid tensor factorization; it reduces to the nonnegative CP decomposition when the grouping matrix equals the p×1p\times 1 one-vector and reduces to the nonnegative Tucker decomposition when the grouping matrix equals the p×pp\times p identity matrix. Specifically, our estimated Dirichlet parameters 𝜶\boldsymbol{\alpha} help define the tensor core and our estimated conditional probability parameters 𝝀j,k\boldsymbol{\lambda}_{j,k} constitute the tensor arms. In this regard, we view our proposed MCMC algorithm as contributing a new tensor factorization method with nice uniqueness guarantee (i.e., identifiability guarantee) and good empirical performance.

We conduct a simulation study to empirically verify the theoretical identifiability results. Specifically, in the simulation setting (p,G,K)=(30,6,4)(p,G,K)=(30,6,4), corresponding to the first setting in Table 3, we now consider more sample sizes n{250,500,750,1000,1250,1500}n\in\{250,~{}500,~{}750,~{}1000,~{}1250,~{}1500\}. For each sample size, we conducted 50 independent simulation replications and calculated the average root mean squared errors (RMSEs) of the model parameters 𝚲\boldsymbol{\Lambda} and 𝜶\boldsymbol{\alpha}. Figure 7 plots the RMSEs versus the sample size nn and shows that as nn increases, the RMSEs decrease gradually. This trend provides an empirical verification of identifiability, and corroborates the conclusion that under an identifiable model, the model parameters can be estimated increasingly accurately as one collects more and more samples.

Refer to caption
Refer to caption
Figure 7: Empirical verification of identifiability. Root mean square errors (RMSEs) of model parameters averaged across simulation replicates decrease as sample size increases. The simulation setting is (p,G,K)=(30,6,4)(p,G,K)=(30,6,4), which is the first setting in Table 3.

5.2 Selecting GG and KK from Data

In Section 3, model identifiability is established under the assumption that GG and KK are known, like many other latent structure models; for example, generic identifiability of latent class models in Allman et al., (2009) is established assuming the number of latent classes is known. But in order to provide a practical estimation pipeline applicable to real-world applications, we next briefly discuss how to select GG and KK in a data-driven way.

Our basic rationale is to use a practically useful criterion that favors a model with good out-of-sample predictive performance while remaining parsimonious. Gelman et al., (2014) contains a comprehensive review of various predictive information criteria for evaluating Bayesian models. We first considered using the Deviance Information Criterion (DIC, Spiegelhalter et al., , 2002), a traditional model selection criteria for Bayesian models. However, our preliminary simulations imply that DIC does not work well for selecting the latent dimensions in Gro-M3s. In particular, we observed that DIC sometimes severely overselects the latent dimensions in our model, while that the WAIC (Widely Applicable Information Criterion, Watanabe, , 2010) has better performance in our simulation studies (see the next paragraph for details). Our observation about DIC agrees with previous studies on the inconsistency of DIC in several different settings (Gelman et al., , 2014; Hooten and Hobbs, , 2015; Piironen and Vehtari, , 2017).

Watanabe, (2010) proved that WAIC is asymptotically equal to Bayesian leave-one-out cross validation and provided a solid theoretical justification for using WAIC to choose models with relatively good predictive ability. WAIC is particularly useful for models with hierarchical and mixture structures, making it well suited to selecting the latent profile dimension KK and variable group dimension GG in our proposed model. Denote the posterior samples by 𝜽(t)\boldsymbol{\theta}^{(t)}, t=1,,Tt=1,\ldots,T. For each i[n]i\in[n] and t[T]t\in[T], denote

p(𝒚i𝜽(t))=\displaystyle p(\boldsymbol{y}_{i}\mid\boldsymbol{\theta}^{(t)})= m=1G[k=1Kπik(t)j,m(t)=1c=1dj(λj,c,k(t))yi,j,c].\displaystyle~{}\prod_{m=1}^{G}\left[\sum_{k=1}^{K}\pi_{ik}^{(t)}\prod_{\ell_{j,m}^{(t)}=1}\prod_{c=1}^{d_{j}}\left(\lambda_{j,c,k}^{(t)}\right)^{y_{i,j,c}}\right].

In particular, Gelman et al., (2014) recommended using the following version of the WAIC, where “lppd” refers to log pointwise predictive density and pWAIC2p_{\text{WAIC}_{2}} measures the model complexity through the variance,

WAIC =2(lppdpWAIC2)\displaystyle=-2\left(\text{lppd}-p_{\text{WAIC}_{2}}\right) (17)
=2i=1nlog(1Tt=1Tp(𝒚i𝜽(t)))+2i=1nvart=1T(logp(𝒚i𝜽(t))),\displaystyle=-2\sum_{i=1}^{n}\log\left(\frac{1}{T}\sum_{t=1}^{T}p(\boldsymbol{y}_{i}\mid\boldsymbol{\theta}^{(t)})\right)+2\sum_{i=1}^{n}\text{var}_{t=1}^{T}\left(\log p\left(\boldsymbol{y}_{i}\mid\boldsymbol{\theta}^{(t)}\right)\right),

where vart=1T\text{var}_{t=1}^{T} refers to the variance based on TT posterior samples, with definition vart=1T(at)=1/(T1)t=1T(att=1Tat/T)2\text{var}_{t=1}^{T}(a_{t})=1/(T-1)\sum_{t=1}^{T}\big{(}a_{t}-\sum_{t^{\prime}=1}^{T}a_{t^{\prime}}/T\big{)}^{2}. Based on the above definition, the WAIC can be easily calculated based on posterior samples. The model with a smaller WAIC is favored.

We carried out a simulation study to evaluate how WAIC performs on selecting GG and KK, focusing on the previous setting where 50 independent datasets are generated from (n,p,G,K)=(1000,30,6,3)(n,p,G,K)=(1000,30,6,3). When fixing the candidate KK to the truth K=3K=3 and varying the candidate Gcandi{4,5,6,7,8}G_{\text{candi}}\in\{4,5,6,7,8\}, the percentages of the datasets that each of G=4,5,6,7,8G=4,5,6,7,8 is selected are 0%, 0%, 74% (true GG), 20%, 6%, respectively. When fixing the candidate GG to the truth G=6G=6 and varying Kcandi{2,3,4,5,6}K_{\text{candi}}\in\{2,3,4,5,6\}, the percentages of the datasets that each of K=2,3,4,5,6K=2,3,4,5,6 is selected are 0%, 80% (true KK), 6%, 4%, 10%, respectively. Further, when varying (K,G)(K,G) in the grid of 25 possible pairs {2,3,4,5,6}×{4,5,6,7,8}\{2,3,4,5,6\}\times\{4,5,6,7,8\}, the percentage of the datasets for which the true pair (K,G)=(3,6)(K,G)=(3,6) is selected by WAIC is 58% and neither KK nor GG ever gets underselected. In general, our simulations show that the WAIC does not tend to underselect the latent dimensions KK and GG, and that it generally has a reasonably good accuracy of selecting the truth. We remark that here our goal was to pick a practical selection criterion that can be readily applied in real-world applications. To develop a selection strategy for deciding on the number of latent dimensions with rigorous theoretical guarantees under the proposed models would need future investigations.

6 Real Data Applications

6.1 NLTCS Disability Survey Data

In this section we apply Gro-M3 methodology to a functional disability dataset extracted from the National Long Term Care Survey (NLTCS), created by the former Center for Demographic Studies at Duke University. This dataset has been widely analyzed, both with mixed membership models (Erosheva et al., , 2007; Manrique-Vallier, , 2014), and with other models for multivariate categorical data (Dobra and Lenkoski, , 2011; Johndrow et al., , 2017). Here we reanalyze this dataset as an illustration of our dimension-grouped mixed membership approach.

The NLTCS dataset was downloaded from at http://lib.stat.cmu.edu/datasets/. It is an extract containing responses from n=21574n=21574 community-dwelling elderly Americans aged 65 and above, pooled over 1982, 1984, 1989, and 1994 survey waves. The disability survey contains p=16p=16 items, with respondents being either coded as healthy (level 0) or as disabled (level 1) for each item. Each respondent provides a 16-dimensional response vector 𝒚i=(yi,1,,yi,16){0,1}××{0,1}\boldsymbol{y}_{i}=(y_{i,1},\ldots,y_{i,16})\in\{0,1\}\times\cdots\times\{0,1\}, where each variable yi,jy_{i,j} follows a special categorical distribution with two categories, i.e., a Bernoulli distribution, with parameters specific to item jj. Among the p=16p=16 NLTCS disability items, functional disability researchers distinguish six activities of daily living (ADLs) and ten instrumental activities of daily living (IADLs). Specifically, the first six ADL items are more basic and relate to hygiene and personal care: eating, getting in/out of bed, getting around inside, dressing, bathing, and getting to the bathroom or using a toilet. The remaining ten IADL items are related to activities needed to live without dedicated professional care: doing heavy house work, doing light house work, doing laundry, cooking, grocery shopping, getting about outside, travelling, managing money, taking medicine, and telephoning.

Here, we apply the MCMC algorithm developed for the Dirichlet Gro-M3 to the data; the Dirichlet distribution was also used to model the mixed membership scores in Erosheva et al., (2007). Our preliminary analysis of the NLTCS data indicates the Dirichlet parameters 𝜶\boldsymbol{\alpha} are relatively small, so we adopt a small σα=0.002\sigma_{\alpha}=0.002 in the lognormal proposal distribution in Eq. (15) in the Metropolis-Hastings sampling step. For each setting of (G,K)(G,K), we run the MCMC for 40000 iterations and consider the first 20000 as burn-in to be conservative. We retain every 10th sample after the burn-in. The candidate values for the (G,K)(G,K) are all the combinations of G{2,3,,15,16}G\in\{2,3,\ldots,15,16\} and K{6,7,,11,12}K\in\{6,7,\ldots,11,12\}.

For selecting the values of latent dimensions (G,K)(G,K) in practice, we recommend picking the (G,K)(G^{\star},K^{\star}) that provide the lowest WAIC value and also do not contain any empty groups of variables. In particular, for certain pairs of (G,K)(G,K) (in our case, for all G>10G>10) under the NLTCS data, we observe that the posterior mode of the grouping matrix, 𝐋¯\overline{\mathbf{L}}, has some all-zero columns. If G~\widetilde{G} denotes the number of not-all-zero columns in 𝐋¯\overline{\mathbf{L}}, this means after model fitting, the number of groups occupied by the pp variables is G~<G\widetilde{G}<G. Models with G~<G\widetilde{G}<G are difficult to interpret because empty groups that do not contain any variables cannot be assigned meaning. Therefore, we focus only on models where 𝐋¯\overline{\mathbf{L}} does not contain any all-zero columns and pick the one with the smallest WAIC among these models. Using this criterion, for the NLTCS data, the model with G=10G^{\star}=10 and K=9K^{\star}=9 is selected. We have observed reasonably good convergence and mixing of our MCMC algorithm for the NLTCS data. The proposed new dimension-grouping model provides a better fit in terms of WAIC and a parsimonious alternative to traditional MMMs.

Refer to caption
Figure 8: Estimated variable grouping structure 𝒔\boldsymbol{s} (i.e., 𝐋\mathbf{L}) for the NLTCS data with (G,K)=(10,9)(G^{\star},K^{\star})=(10,9). The first six items are ADL “activities of daily living” and the remaining ten items are IADL “instrumental activities of daily living”. Out of the G=10G^{\star}=10 variable groups, the three groups containing multiple items are colored coded in blue (j=1,2,4,5j=1,2,4,5), red (j=9,10,16j=9,10,16), and yellow (j=12,13j=12,13) for better visualization.

We provide the estimated 𝐋¯\overline{\mathbf{L}} under the selected model with G=10G^{\star}=10 and K=9K^{\star}=9 in Figure 8. The estimated variable groupings are given in Figure 8. Out of the G=10G^{\star}=10 groups, there are three groups that contain multiple items. In Figure 8, the item labels of these three groups are colored in blue (j=1,2,4,5j=1,2,4,5), red (j=9,10,16j=9,10,16), and yellow (j=12,13j=12,13) for better visualization. These groupings obtained by our model lead to several observations. First, four out of six ADL variables (j=1,2,4,5j=1,2,4,5) are categorized into one group. This group of items are basic self-care activities that require limited mobility. Second, the three IADL variables (j=9,10,16j=9,10,16) in one group may be related to traditional gender roles – these items correspond to activities performed more frequently by women than by men. Finally, the two items j=12j=12 “getting about outside” and j=13j=13 “traveling” that require high level of mobility form another group. Note that such a model-based grouping of the items is different than the established groups (ADL and IADL), and could not have been obtained by applying previous mixed membership models (Erosheva et al., , 2007).

In addition to the variable grouping structures, we plot posterior means of the positive response probabilities 𝚲¯:,1,:\overline{\boldsymbol{\Lambda}}_{:,1,:} in Figure 9 for the selected model. For each survey item j[p]j\in[p] and each extreme latent profile k[K]k\in[K], the 𝚲j,1,k\boldsymbol{\Lambda}_{j,1,k} records the conditional probability of giving a positive response of being disabled on this item conditional on possessing the kkth latent profile. The K=9K^{\star}=9 profiles are quite well separated and can be interpreted as usual in mixed membership analysis. For example, in Figure 9, the leftmost column for k=1k=1 represents a relatively healthy latent profile, the rightmost column for k=9k=9 represents a relatively severely disabled latent profile. As for the Dirichlet parameters 𝜶\boldsymbol{\alpha}, their posterior means are 𝜶¯=(0.0245, 0.0289, 0.0074, 0.0176, 0.0231, 0.0193, 0.0001, 0.0001, 0.0242)\overline{\boldsymbol{\alpha}}=(0.0245,\;0.0289,\;0.0074,\;0.0176,\;0.0231,\;0.0193,\;0.0001,\;0.0001,\;0.0242). Such small values of the Dirichlet parameters imply that membership score vectors tend to be dominated by one component for a majority of individuals. This observation is consistent with Erosheva et al., (2007). Meanwhile, here we obtain a simpler model than that in Erosheva et al., (2007) as each subject can partially belong to up to GG latent profiles according to the grouping of variables, rather than p=16p=16 ones as in the traditional MMMs.

Refer to caption
Figure 9: Estimated positive response probabilities 𝚲:,1,:\boldsymbol{\Lambda}_{:,1,:} for the NLTCS data with (G,K)=(10,9)(G^{\star},K^{\star})=(10,9). Each column represents one extreme latent profile. Entries are conditional probabilities of giving a positive response (1=disabled1=\text{disabled}) to each item given that latent profile.

We emphasize again that the bag-of-words topic models make the exchangeability assumption, which is fundamentally different from, and actually more restrictive than, our Gro-M3 when modeling non-exchangeable item response data. Specifically, the exchangeability assumption would force all the item parameters {𝝀j,kd:k[K]}\{\boldsymbol{\lambda}_{j,k}\in\mathbb{R}^{d}:~{}k\in[K]\} across all the items j[p]j\in[p] to be identical, which is unrealistic for the survey response data (or the personality test data to be analyed in Section 6.2) in which different items clearly have different characteristics. For example, if one were to use a topic model such as LDA to analyze the NLTCS disability survey data, then a plot of the 16×K16\times K conditional probability table like Figure 9 would not have been possible, because all the p=16p=16 items would share the same KK-dimensional vector of conditional Bernoulli probabilities.

6.2 International Personality Item Pool (IPIP) Personality Test Data

We also apply the proposed method to analyze a personality test dataset containing multivariate polytomous responses: the International Personality Item Pool (IPIP) personality test data. This dataset is publicly available from the Open-Source Psychometrics Project website https://openpsychometrics.org/_rawdata/. The dataset contains nall=1005n_{\text{all}}=1005 subjects’ responses to p=40p=40 Likert rated personality test items in the International Personality Item Pool. After dropping those subjects who have missing entries in their responses, there are n=901n=901 complete response vectors left. Each subject’s observed response vector is 4040-dimensional, where each dimension ranges in {1,2,3,4,5}\{1,2,3,4,5\} with d1=d2==dp=5d_{1}=d_{2}=\cdots=d_{p}=5 categories. Each of these 40 items was designed to measure one of the four personality factors: Assertiveness (short as “AS”), Social confidence (short as “SC”), Adventurousness (short as “AD”), and Dominance (short as “DO”). Specifically, items 1-10 measure AS, items 11-20 measure SC, items 21-30 measure AD, and items 31-40 measure DO. The responses of certain reversely-termed items (i.e., items 7–10, 16–20, 25–30) are preprocessed to be y~ij=6yij\widetilde{y}_{ij}=6-y_{ij}. We apply our new model to analyze this dataset for various numbers of variable groups G{3,4,5,6,7}G\in\{3,4,5,6,7\} and K=4K=4 extreme latent profiles, and the WAIC selects the model with G=5G=5 groups. We plot the posterior mode of the estimated grouping matrix in Figure 10, together with the content of each item.

Figure 10 shows that our new modeling component of variable grouping is able to reveal the item blocks that measure different personality factors in a totally unsupervised manner. Moreover, the estimated variable grouping cuts across the four established personality factors to uncover a more nuanced structure. For example, the group of items {\{AS1, SC4, SC10}\} concerns the verbal expression aspect of a person; the group of items {\{AS3–AS10, SC5, SC7}\} concerns a person’s intention to lead and influence other people. In summary, for this new personality test dataset, the proposed Gro-M3 not only provides better model fit than the usual GoM model (since G=5p=40G=5\ll p=40 is selected by WAIC), but also enjoys interpretability and uncovers meaningful subgroups of the observed variables.

Refer to caption
Figure 10: IPIP personality test items grouping structure estimated from our Gro-M3. Item type abbreviations are: “AS” represents “Assertiveness”, “SC” represents “Social confidence”, “AD” represents “Adventurousness”, and “DO” represents “Dominance”.

We also conduct experiments to compare our probabilistic hybrid decomposition Gro-M3 with the probabilistic CP decomposition (the latent class model in Dunson and Xing, (2009)) and the probabilistic Tucker decomposition (the GoM model in Erosheva et al., (2007)) on the IPIP personality test data. After fitting each tensor decomposition method to the data, we calculate the model-based Cramer’s V measure between each pair of items and see how different methods perform on recovering meaningful item dependence structure. Figure 11 presents the model-based pairwise Cramer’s V calculated using the three tensor decompositions, along with the model-free Cramer’s V calculated directly from data. Figure 11 shows that our Gro-M3 decomposition clearly outperforms probabilistic CP and Tucker decomposition in recovering the meaningful block structure of the personality test items.

Refer to captionRefer to caption
Refer to captionRefer to caption
Figure 11: Upper two panels: Cramer’s V posterior means for item pairs obtained using the usual CP decomposition (latent class model) and the usual Tucker decomposition (grade of membership model). Bottom left: Cramer’s V posterior means for item pairs obtained using the Gro-M3. Bottom right: Sample Cramer’s V for item pairs calculated directly from data.

7 Discussion

We have proposed a new class of mixed membership models for multivariate categorical data, dimension-grouped mixed membership models (Gro-M3s), studied its model identifiability, and developed a Bayesian inference procedure for Dirichlet Gro-M3s. On the methodology side, the new model strikes a nice balance between model flexibility and model parsimony. Considering popular existing latent structure models for multivariate categorical data, the Gro-M3 bridges the parsimonious yet insufficiently flexible Latent Class Model (corresponding to CP decomposition of probability tensors) and the very flexible yet not parsimonious Grade of Membership Model (corresponding to Tucker decomposition of probability tensors). On the theory side, we establish the identifiability of population parameters that govern the distribution of Gro-M3s. The quantities shown to be identifiable include not only the continuous model parameters, but also the key discrete structure – how the variables’ latent assignments are partitioned into groups. The obtained identifiability conclusions lay a solid foundation for reliable statistical analysis and real-world applications. We have performed Bayesian estimation for the new model using a Metropolis-Hastings-within-Gibbs sampler. Numerical studies show that the method can accurately estimate the quantities of interest, empirically validating the identifiability results.

For the special case of binary responses with d1==dp=2d_{1}=\cdots=d_{p}=2, as pointed out by a reviewer, models with Bernoulli-to-latent-Poisson link in Zhou et al., (2016) and the Bernoulli-to-latent-Gaussian link in multivariate item response theory models in Embretson and Reise, (2013) are useful tools that can capture certain lower-dimensional latent constructs. Our model differs from these models in terms of statistical and practical interpretation. In our Gro-M3, each subject’s latent variables are a mixed membership vector 𝝅i\boldsymbol{\pi}_{i} ranging in the probability simplex ΔK1\Delta^{K-1}, and can be interpreted as that each subject is a partial member of each of the KK extreme latent profiles. For k[K]k\in[K], the kkth extreme latent profile also can be directly interpreted by inspecting the estimated item parameters {𝝀j,k:j[p]}\{\boldsymbol{\lambda}_{j,k}:~{}j\in[p]\}. Geometrically, the entry πik\pi_{ik} captures the relative proximity of each subject to the kkth extreme latent behavioral profile. Such an interpretation of individual-level mixtures are highly desirable in applications such as social science surveys (Erosheva, , 2003) and medical diagnosis (Woodbury et al., , 1978), where each extreme latent profile represents a prototypical response pattern. Therefore, in these applications, the mixed membership modeling is more interpretable and preferable to using a nonlinear transformation of certain underlying Gaussian or Poisson latent variables to model binary matrix data (such as the Bernoulli-to-latent-Poisson or Bernoulli-to-latent-Gaussian link).

We remark that our proposed Gro-M3 covers the usual GoM model as a special case. In fact, the GoM model can be readily recovered by setting our grouping matrix 𝐋\mathbf{L} to be the p×pp\times p identity matrix (i.e., 𝐋=𝐈p\mathbf{L}=\mathbf{I}_{p}). In terms of practical estimation, we can simply fix 𝐋=𝐈p\mathbf{L}=\mathbf{I}_{p} throughout our MCMC iterations and estimate other quantities in the same way as in our current algorithm. Using this approach, we have compared the performance of our flexible Gro-M3 and the classical GoM model in the real data analyses. Specifically, for both the NLTCS disability survey data and the IPIP personality test data, fixing 𝐋=𝐈p\mathbf{L}=\mathbf{I}_{p} with G=pG=p variable groups gives larger WAIC values than the selected more parsimonious model with GpG\ll p. This indicates that the traditional GoM model is not favored by the information criterion and gives a poorer model fit to the data. We also point out that our MCMC algorithm can be viewed as a novel Bayesian factorization algorithm for probability tensors, in a similar spirit to the existing Bayesian tensor factorization methods such as Dunson and Xing, (2009) and Zhou et al., (2015). Our Bayesian Gro-M3 factorization outperforms usual probabilistic tensor factorizations in recovering the item dependence structure in the IPIP personality data analysis. Therefore, we view our proposed MCMC algorithm as contributing a new type of tensor factorization approach with nice uniqueness guarantee (i.e., identifiability guarantee) and a Bayesian factorization procedure with good empirical performance.

Our modeling assumption of the variable grouping structure can be useful to other related models. For example, Manrique-Vallier, (2014) proposed a longitudinal MMM to capture heterogeneous pathways of disability and cognitive trajectories of elderly population for disability survey data. The proposed dimension-grouping assumption can provide an interesting new interpretation to such longitudinal settings. Specifically, when survey items are answered in multiple time points, it may be plausible to assume that a subject’s latent profile locally persists for a block of items, before potentially switching to a different profile for the next block of items. This can be readily accommodated by the dimension-grouping modeling assumption, with the slight modification that items belonging to the same group should be forced to be close in time. Our identifiability results can be applied to this setup. Similar computational procedures can also be developed. Furthermore, although this work focuses on modeling multivariate categorical data, the applicability of the new dimension-grouping assumption is not limited to such data. A similar assumption may be made in other mixed membership models; examples include the generalized latent Dirichlet models for mixed data types studied in Zhao et al., (2018).

In terms of identifiability, the current work has focused on the population quantities, including the variable grouping matrix 𝐋\mathbf{L}, the conditional probability tables 𝚲\boldsymbol{\Lambda}, and the Dirichlet parameters 𝜶\boldsymbol{\alpha}. In addition to these population parameters, an interesting future question is the identification of individual mixed membership proportions {𝝅i;i=1,,n}\{\boldsymbol{\pi}_{i};\,i=1,\ldots,n\} for subjects in the sample. Studying the identification and accurate estimation of 𝝅i\boldsymbol{\pi}_{i}’s presumably requires quite different conditions from ours. A recent work (Mao et al., , 2020) considered a similar problem for mixed membership stochastic block models for network data. Finally, in terms of estimation procedures, in this work we have employed a Bayesian approach to Dirichlet Gro-M3s, and the developed MCMC sampler shows excellent computational performance. In the future, it would also be interesting to consider method-of-moments estimation for the proposed models related to Zhao et al., (2018) and Tan and Mukherjee, (2017).

This work has focused on proposing a new interpretable and identifiable mixed membership model for multivariate categorical data, and our MCMC algorithm has satisfactory performance in real data applications. In the future, it would be interesting to develop scalable and online variational inference methods, which would make the model more applicable to massive-scale real-world datasets. We expect that it is possible to develop variational inference algorithms similar in spirit to Blei et al., (2003) for topic models and Airoldi et al., 2008a for mixed membership stochastic block models to scale up computation. In addition, just as the hierarchical Dirichlet process (Teh et al., , 2006) is a natural nonparametric generalization of the parametric latent Dirichlet allocation (Blei et al., , 2003) model, it would also be interesting to generalize our Gro-M3 to the nonparametric Bayesian setting to automatically infer KK and GG. Developing a method to automatically infer KK and GG will be of great practical value, because in many situations there might not be enough prior knowledge for these quantatities. We leave these directions for future work.

\singlespacing

Acknowledgements

This work was partially supported by NIH grants R01ES027498 and R01ES028804, NSF grants DMS-2210796, SES-2150601 and SES-1846747, and IES grant R305D200015. This project also received funding from the European Research Council under the European Union’s Horizon 2020 research and innovation program (grant agreement No. 856506). The authors thank the Action Editor Dr. Mingyuan Zhou and three anonymous reviewers for constructive and helpful comments that helped improve this manuscript.


References

  • Airoldi et al., (2014) Airoldi, E. M., Blei, D., Erosheva, E. A., and Fienberg, S. E. (2014). Handbook of mixed membership models and their applications. CRC press.
  • (2) Airoldi, E. M., Blei, D., Fienberg, S., and Xing, E. (2008a). Mixed membership stochastic blockmodels. Advances in Neural Information Processing Systems, 21.
  • (3) Airoldi, E. M., Blei, D. M., Fienberg, S. E., and Xing, E. P. (2008b). Mixed membership stochastic blockmodels. Journal of Machine Learning Research, 9:1981–2014.
  • Allman et al., (2009) Allman, E. S., Matias, C., and Rhodes, J. A. (2009). Identifiability of parameters in latent structure models with many observed variables. Annals of Statistics, 37(6A):3099–3132.
  • Anandkumar et al., (2012) Anandkumar, A., Foster, D. P., Hsu, D. J., Kakade, S. M., and Liu, Y.-K. (2012). A spectral algorithm for latent Dirichlet allocation. In Advances in Neural Information Processing Systems, pages 917–925.
  • Anandkumar et al., (2014) Anandkumar, A., Ge, R., Hsu, D., Kakade, S. M., and Telgarsky, M. (2014). Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15:2773–2832.
  • Anandkumar et al., (2015) Anandkumar, A., Hsu, D., Janzamin, M., and Kakade, S. (2015). When are overcomplete topic models identifiable? Uniqueness of tensor tucker decompositions with structured sparsity. The Journal of Machine Learning Research, 16(1):2643–2694.
  • Arora et al., (2012) Arora, S., Ge, R., and Moitra, A. (2012). Learning topic models–going beyond SVD. In 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science, pages 1–10. IEEE.
  • Blei, (2012) Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4):77–84.
  • Blei et al., (2003) Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3(Jan):993–1022.
  • Dhillon et al., (2003) Dhillon, I. S., Mallela, S., and Modha, D. S. (2003). Information-theoretic co-clustering. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 89–98.
  • Dobra and Lenkoski, (2011) Dobra, A. and Lenkoski, A. (2011). Copula Gaussian graphical models and their application to modeling functional disability data. Annals of Applied Statistics, 5(2A):969–993.
  • Dunson and Xing, (2009) Dunson, D. B. and Xing, C. (2009). Nonparametric bayes modeling of multivariate categorical data. Journal of the American Statistical Association, 104(487):1042–1051.
  • Embretson and Reise, (2013) Embretson, S. E. and Reise, S. P. (2013). Item response theory. Psychology Press.
  • Erosheva, (2003) Erosheva, E. A. (2003). Bayesian estimation of the grade of membership model. Bayesian Statistics, 7:501–510.
  • Erosheva et al., (2007) Erosheva, E. A., Fienberg, S. E., and Joutard, C. (2007). Describing disability through individual-level mixture models for multivariate binary data. Annals of Applied Statistics, 1(2):346.
  • Erosheva et al., (2004) Erosheva, E. A., Fienberg, S. E., and Lafferty, J. (2004). Mixed-membership models of scientific publications. Proceedings of the National Academy of Sciences, 101(suppl 1):5220–5227.
  • Gelman et al., (2014) Gelman, A., Hwang, J., and Vehtari, A. (2014). Understanding predictive information criteria for Bayesian models. Statistics and Computing, 24(6):997–1016.
  • Goodman, (1974) Goodman, L. A. (1974). Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika, 61(2):215–231.
  • Govaert and Nadif, (2013) Govaert, G. and Nadif, M. (2013). Co-clustering: models, algorithms and applications. John Wiley & Sons.
  • Griffiths and Steyvers, (2004) Griffiths, T. L. and Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl 1):5228–5235.
  • Hitchcock, (1927) Hitchcock, F. L. (1927). The expression of a tensor or a polyadic as a sum of products. Journal of Mathematics and Physics, 6(1-4):164–189.
  • Hooten and Hobbs, (2015) Hooten, M. B. and Hobbs, N. T. (2015). A guide to Bayesian model selection for ecologists. Ecological monographs, 85(1):3–28.
  • Johndrow et al., (2017) Johndrow, J. E., Bhattacharya, A., and Dunson, D. B. (2017). Tensor decompositions and sparse log-linear models. Annals of Statistics, 45(1):1–38.
  • Kolda and Bader, (2009) Kolda, T. G. and Bader, B. W. (2009). Tensor decompositions and applications. SIAM Review, 51(3):455–500.
  • Kruskal, (1977) Kruskal, J. B. (1977). Three-way arrays: rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics. Linear Algebra and its Applications, 18(2):95–138.
  • Mackey et al., (2010) Mackey, L. W., Weiss, D. J., and Jordan, M. I. (2010). Mixed membership matrix factorization. In International Conference on Machine Learning.
  • Manrique-Vallier, (2014) Manrique-Vallier, D. (2014). Longitudinal mixed membership trajectory models for disability survey data. Annals of Applied Statistics, 8(4):2268.
  • Manrique-Vallier and Reiter, (2012) Manrique-Vallier, D. and Reiter, J. P. (2012). Estimating identification disclosure risk using mixed membership models. Journal of the American Statistical Association, 107(500):1385–1394.
  • Mao et al., (2020) Mao, X., Sarkar, P., and Chakrabarti, D. (2020). Estimating mixed memberships with sharp eigenvector deviations. Journal of the American Statistical Association, pages 1–13.
  • McLachlan and Peel, (2000) McLachlan, G. J. and Peel, D. (2000). Finite Mixture Models. John Wiley & Sons.
  • Nguyen, (2015) Nguyen, X. (2015). Posterior contraction of the population polytope in finite admixture models. Bernoulli, 21(1):618–646.
  • Paganin et al., (2021) Paganin, S., Herring, A. H., Olshan, A. F., and Dunson, D. B. (2021). Centered partition processes: Informative priors for clustering (with discussion). Bayesian Analysis, 16(1):301–370.
  • Piironen and Vehtari, (2017) Piironen, J. and Vehtari, A. (2017). Comparison of Bayesian predictive methods for model selection. Statistics and Computing, 27(3):711–735.
  • Pritchard et al., (2000) Pritchard, J. K., Stephens, M., and Donnelly, P. (2000). Inference of population structure using multilocus genotype data. Genetics, 155(2):945–959.
  • Rand, (1971) Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336):846–850.
  • Saddiki et al., (2015) Saddiki, H., McAuliffe, J., and Flaherty, P. (2015). GLAD: a mixed-membership model for heterogeneous tumor subtype classification. Bioinformatics, 31(2):225–232.
  • Shang et al., (2021) Shang, Z., Erosheva, E. A., and Xu, G. (2021). Partial-mastery cognitive diagnosis models. Annals of Applied Statistics, 15(3):1529–1555.
  • Spiegelhalter et al., (2002) Spiegelhalter, D. J., Best, N. G., Carlin, B. P., and Van Der Linde, A. (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(4):583–639.
  • Tan and Mukherjee, (2017) Tan, Z. and Mukherjee, S. (2017). Partitioned tensor factorizations for learning mixed membership models. In International Conference on Machine Learning, pages 3358–3367.
  • Teh et al., (2006) Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006). Hierarchical dirichlet processes. Journal of the American Statistical Association, 101(476):1566–1581.
  • Tucker, (1966) Tucker, L. R. (1966). Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279–311.
  • Wang, (2019) Wang, Y. (2019). Convergence rates of latent topic models under relaxed identifiability conditions. Electronic Journal of Statistics, 13(1):37–66.
  • Wang et al., (2015) Wang, Y. S., Matsueda, R. L., and Erosheva, E. A. (2015). A variational em method for mixed membership models with multivariate rank data: an analysis of public policy preferences. arXiv: Methodology, pages 1452–1480.
  • Watanabe, (2010) Watanabe, S. (2010). Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. Journal of Machine Learning Research, 11(12).
  • Woodbury et al., (1978) Woodbury, M. A., Clive, J., and Garson Jr, A. (1978). Mathematical typology: a grade of membership technique for obtaining disease definition. Computers and Biomedical Research, 11(3):277–298.
  • Zhao et al., (2018) Zhao, S., Engelhardt, B. E., Mukherjee, S., and Dunson, D. B. (2018). Fast moment estimation for generalized latent Dirichlet models. Journal of the American Statistical Association, 113(524):1528–1540.
  • Zhou et al., (2015) Zhou, J., Bhattacharya, A., Herring, A. H., and Dunson, D. B. (2015). Bayesian factorizations of big sparse tensors. Journal of the American Statistical Association, 110(512):1562–1576.
  • Zhou, (2018) Zhou, M. (2018). Nonparametric Bayesian negative binomial factor analysis. Bayesian Analysis, 13(4):1065–1093.
  • Zhou et al., (2016) Zhou, M., Cong, Y., and Chen, B. (2016). Augmentable Gamma belief networks. Journal of Machine Learning Research, 17(1):5656–5699.

Supplementary Material

This Supplementary Material contains two sections. The first section Supplement A contains the proofs of all the theoretical results in the paper. The second section Supplement B presents a note on the pairwise mutual information measures between categorical variables.

Supplement A: Proofs of Theoretical Results

7.1 Proof of Theorem 1

For notational simplicity, from now on we will omit the subscript ii of subject-specific random variables without loss of generality; all such variables including 𝒚\boldsymbol{y} and 𝒛\boldsymbol{z} should be understood as associated with a random subject. Denote by 𝒛=(z1,,zG)[K]G\boldsymbol{z}=(z_{1},\ldots,z_{G})\in[K]^{G} a configuration of the latent profiles realized for the GG groups of variables. Recall that given a fixed grouping matrix 𝐋\mathbf{L}, the associated group assignment vector 𝒔=(s1,,sp)\boldsymbol{s}=(s_{1},\ldots,s_{p}) is defined as sj=gs_{j}=g if and only if j,g=1\ell_{j,g}=1. We next introduce a new notation. For each variable j[p]j\in[p], each category cj[dj]c_{j}\in[d_{j}], and each possible latent profile configuration 𝒛{1,,K}G\boldsymbol{z}\in\{1,\ldots,K\}^{G}, define a new parameter γj,cj,𝒛\gamma_{j,c_{j},\boldsymbol{z}} to be

γj,cj,𝒛=\displaystyle\gamma_{j,c_{j},\boldsymbol{z}}= λj,cj,zsj.\displaystyle~{}\lambda_{j,c_{j},z_{s_{j}}}. (18)

Collect all the γ\gamma-parameters in 𝚪=(γj,cj,𝒛)\boldsymbol{\Gamma}=(\gamma_{j,c_{j},\boldsymbol{z}}), then 𝚪\boldsymbol{\Gamma} is a three-way array (which is a tensor of size p×d×KGp\times d\times K^{G} if d1==dp=dd_{1}=\cdots=d_{p}=d) since j[p]j\in[p], cj[dj]c_{j}\in[d_{j}], and 𝒛[K]G\boldsymbol{z}\in[K]^{G}. For each j[p]j\in[p], we will denote the dj×KGd_{j}\times K^{G} matrix 𝚪j,:,:\boldsymbol{\Gamma}_{j,:,:} by 𝚪j\boldsymbol{\Gamma}_{j} for simplicity. The representation in (18) implies that many entries in 𝚪\boldsymbol{\Gamma} are equal. Specifically, for two arbitrary latent assignment vectors 𝒛=(z1,,zG)\boldsymbol{z}=(z_{1},\ldots,z_{G}) and 𝒛=(z1,,zG)\boldsymbol{z}^{\prime}=(z^{\prime}_{1},\ldots,z^{\prime}_{G}) with 𝒛𝒛\boldsymbol{z}\neq\boldsymbol{z}^{\prime}, as long as zsj=zsjz_{s_{j}}=z^{\prime}_{s_{j}} there would be γj,cj,𝒛=γj,cj,𝒛\gamma_{j,c_{j},\boldsymbol{z}}=\gamma_{j,c_{j},\boldsymbol{z}^{\prime}}. We choose to use the over-parameterization in (18) since this notation facilitates the study of identifiability through the underlying tensor decomposition structure, as will be revealed soon. In particular, the 𝚪j\boldsymbol{\Gamma}_{j}’s have the following property.

Lemma 1.

Under the definition in (18), for any set of indices S[p]S\subseteq[p] such that {sj:jS}[G]\{s_{j}:\,j\in S\}\supseteq[G], there is

g=1G(jS:sj=g𝚲j)=jS𝚪j.\displaystyle\bigotimes_{g=1}^{G}\left(\bigodot_{j\in S:\,s_{j}=g}\boldsymbol{\Lambda}_{j}\right)=\bigodot_{j\in S}\boldsymbol{\Gamma}_{j}. (19)

Now we can equivalently rewrite the previous model specification (7) as follows,

C-M3(y1=c1,,yp=cp𝐋,𝚲,𝚽)=πc1,,cp\displaystyle~{}\mathbb{P}^{\text{C-M}^{3}}(y_{1}=c_{1},\ldots,y_{p}=c_{p}\mid\mathbf{L},\boldsymbol{\Lambda},\boldsymbol{\Phi})=\pi_{c_{1},\ldots,c_{p}}
=\displaystyle= z1=1KzG=1Kϕz1,,zGj=1pλj,cj,zsj=z1=1KzG=1Kϕz1,,zGj=1pγj,cj,𝒛\displaystyle~{}\sum_{z_{1}=1}^{K}\cdots\sum_{z_{G}=1}^{K}\phi_{z_{1},\ldots,z_{G}}\prod_{j=1}^{p}\lambda_{j,c_{j},z_{s_{j}}}=\sum_{z_{1}=1}^{K}\cdots\sum_{z_{G}=1}^{K}\phi_{z_{1},\ldots,z_{G}}\prod_{j=1}^{p}\gamma_{j,c_{j},\boldsymbol{z}}
=\displaystyle= 𝒛[K]Gϕ𝒛j=1pγj,cj,𝒛=(y1=c1,,yp=cp𝚪,𝚽),\displaystyle~{}\sum_{\boldsymbol{z}\in[K]^{G}}\phi_{\boldsymbol{z}}\prod_{j=1}^{p}\gamma_{j,c_{j},\boldsymbol{z}}=\mathbb{P}(y_{1}=c_{1},\ldots,y_{p}=c_{p}\mid\boldsymbol{\Gamma},\boldsymbol{\Phi}), (20)

where 𝒄=(c1,,cp)×j=1p[dj]\boldsymbol{c}=(c_{1},\ldots,c_{p})\in\times_{j=1}^{p}[d_{j}]. Denote by 𝚽=(ϕz1,,zG)\boldsymbol{\Phi}=(\phi_{z_{1},\ldots,z_{G}}) a GG-th order tensor of size K××KK\times\cdots\times K. Denote by vec(𝚷)\text{{vec}}(\boldsymbol{\Pi}) the vectorized version of 𝚷\boldsymbol{\Pi}, so vec(𝚷)\text{{vec}}(\boldsymbol{\Pi}) is a vector of length j=1pdj\prod_{j=1}^{p}d_{j}; in particular, this vector has entries defined as follows,

vec(𝚷)c1+(c21)d1++(cp1)d1dp1=πc1,c2,,cp\displaystyle\text{{vec}}(\boldsymbol{\Pi})_{c_{1}+(c_{2}-1)d_{1}+\cdots+(c_{p}-1)d_{1}\cdots d_{p-1}}=\pi_{c_{1},c_{2},\cdots,c_{p}} (21)

for any 𝒄=(c1,,cp)×j=1p[dj]\boldsymbol{c}=(c_{1},\ldots,c_{p})\in\times_{j=1}^{p}[d_{j}]. Suppose alternative parameters (𝐋¯,𝚲¯,𝚽¯)(\overline{\mathbf{L}},\overline{\boldsymbol{\Lambda}},\overline{\boldsymbol{\Phi}}) lead to the same distribution of the observed variables; that is (𝒚=𝒄𝐋,𝚲,𝚽)=(𝒚=𝒄𝐋¯,𝚲¯,𝚽¯)\mathbb{P}(\boldsymbol{y}=\boldsymbol{c}\mid\mathbf{L},\boldsymbol{\Lambda},\boldsymbol{\Phi})=\mathbb{P}(\boldsymbol{y}=\boldsymbol{c}\mid\overline{\mathbf{L}},\overline{\boldsymbol{\Lambda}},\overline{\boldsymbol{\Phi}}) holds for each possible response pattern 𝒄×j=1p[dj]\boldsymbol{c}\in\times_{j=1}^{p}[d_{j}]. Then by the equivalence in (20), we also have (𝒚=𝒄𝚪,𝚽)=(𝒚=𝒄𝚪¯,𝚽¯)\mathbb{P}(\boldsymbol{y}=\boldsymbol{c}\mid\boldsymbol{\Gamma},\boldsymbol{\Phi})=\mathbb{P}(\boldsymbol{y}=\boldsymbol{c}\mid\overline{\boldsymbol{\Gamma}},\overline{\boldsymbol{\Phi}}) for all 𝒄×j=1p[dj]\boldsymbol{c}\in\times_{j=1}^{p}[d_{j}].

The following two lemmas will be useful.

Lemma 2.

Without loss of generality, suppose the first j=1pj,1\sum_{j=1}^{p}\ell_{j,1} variables belong to the first group, the second j=1pj,2\sum_{j=1}^{p}\ell_{j,2} variables belong to the second group, etc. That is, the matrix 𝐋\mathbf{L} takes a block-diagonal form. The Gro-M3\,{}^{3} in (20) implies the following identity

vec(𝚷)={g=1Gj:j,g=1𝚲j}vec(𝚽).\displaystyle\text{{vec}}(\boldsymbol{\Pi})=\left\{\bigotimes_{g=1}^{G}\bigodot_{j:\,\ell_{j,g}=1}\boldsymbol{\Lambda}_{j}\right\}\cdot\text{{vec}}(\boldsymbol{\Phi}). (22)
Lemma 3.

Suppose there are two disjoint sets of GG observed variables S(1)={j1(1),,jG(1)}S^{(1)}=\{j_{1}^{(1)},\ldots,j_{G}^{(1)}\} and S(2)={j1(2),,jG(2)}S^{(2)}=\{j_{1}^{(2)},\ldots,j_{G}^{(2)}\} satisfying sjg(1)=sjg(2)=gs_{j^{(1)}_{g}}=s_{j^{(2)}_{g}}=g for each g=1,,Gg=1,\ldots,G. Then

g=1G{𝚲jg(1)𝚲jg(2)}={g=1G𝚲jg(1)}{g=1G𝚲jg(2)}up to a permutation of rows.\displaystyle\bigotimes_{g=1}^{G}\left\{\boldsymbol{\Lambda}_{j^{(1)}_{g}}\bigodot\boldsymbol{\Lambda}_{j^{(2)}_{g}}\right\}=\left\{\bigotimes_{g=1}^{G}\boldsymbol{\Lambda}_{j^{(1)}_{g}}\right\}\bigodot\left\{\bigotimes_{g=1}^{G}\boldsymbol{\Lambda}_{j^{(2)}_{g}}\right\}~{}~{}\text{up to a permutation of rows}. (23)

If there further is sjG(3)=Gs_{j^{(3)}_{G}}=G for some jG(3)[p]j^{(3)}_{G}\in[p], then up to a permutation of rows there is

g=1G1{𝚲jg(1)𝚲jg(2)}{𝚲jG(1)𝚲jG(2)𝚲jG(3)}\displaystyle~{}\bigotimes_{g=1}^{G-1}\left\{\boldsymbol{\Lambda}_{j^{(1)}_{g}}\bigodot\boldsymbol{\Lambda}_{j^{(2)}_{g}}\right\}\bigotimes\left\{\boldsymbol{\Lambda}_{j^{(1)}_{G}}\bigodot\boldsymbol{\Lambda}_{j^{(2)}_{G}}\bigodot\boldsymbol{\Lambda}_{j^{(3)}_{G}}\right\} (24)
=\displaystyle= {g=1G𝚲jg(1)}{g=1G1𝚲jg(2)(𝚲jG(2)𝚲jG(3))}.\displaystyle~{}\left\{\bigotimes_{g=1}^{G}\boldsymbol{\Lambda}_{j^{(1)}_{g}}\right\}\bigodot\left\{\bigotimes_{g=1}^{G-1}\boldsymbol{\Lambda}_{j^{(2)}_{g}}\bigotimes\left(\boldsymbol{\Lambda}_{j^{(2)}_{G}}\bigodot\boldsymbol{\Lambda}_{j^{(3)}_{G}}\right)\right\}.

We continue with the proof of Theorem 1. Under the conditions of the theorem, without loss of generality we can assume that for each g[G]g\in[G], the first three variables (among the pp ones) belonging to the ggth group have their corresponding 𝚲j\boldsymbol{\Lambda}_{j} full-column-rank; denote the indices of these three variables by jg(1),jg(2),jg(3)j^{(1)}_{g},j^{(2)}_{g},j^{(3)}_{g}. For example, if 𝐋\mathbf{L} takes the block diagonal form, then for g=1g=1 such three variables are indexed by {j1(1),j1(2),j1(3)}={1,2,3}\left\{j^{(1)}_{1},\,j^{(2)}_{1},\,j^{(3)}_{1}\right\}=\left\{1,2,3\right\}; for g=2g=2 they are indexed by {j2(1),j2(2),j2(3)}={j=1pj,1+1,j=1pj,1+2,j=1pj,1+3}\left\{j^{(1)}_{2},\,j^{(2)}_{2},\,j^{(3)}_{2}\right\}=\left\{\sum_{j=1}^{p}\ell_{j,1}+1,\sum_{j=1}^{p}\ell_{j,1}+2,\sum_{j=1}^{p}\ell_{j,1}+3\right\}, etc. Define the following sets of variable indices,

S(m)={j1(m),,jG(m)} for m=1,2,3;S(0)={1,,p}m=13S(m).\displaystyle S^{(m)}=\left\{j^{(m)}_{1},\ldots,j^{(m)}_{G}\right\}\text{ for }m=1,2,3;\quad S^{(0)}=\{1,\ldots,p\}\setminus\bigcup_{m=1}^{3}S^{(m)}.

Lemmas 2 and 3 imply that we can write vec(𝚷)\text{{vec}}(\boldsymbol{\Pi}) under the true parameters as follows

vec(𝚷)={g=1Gj:j,g=1𝚲j}vec(𝚽)\displaystyle~{}\text{{vec}}(\boldsymbol{\Pi})=\left\{\bigotimes_{g=1}^{G}\bigodot_{j:\,\ell_{j,g}=1}\boldsymbol{\Lambda}_{j}\right\}\cdot\text{{vec}}(\boldsymbol{\Phi})
=\displaystyle= [{g=1G𝚲jg(1)}{g=1G𝚲jg(2)}{g=1G(𝚲jg(3)(jS(0):j,g=1𝚲j))}]vec(𝚽)\displaystyle~{}\left[\left\{\bigotimes_{g=1}^{G}\boldsymbol{\Lambda}_{j_{g}^{(1)}}\right\}\bigodot\left\{\bigotimes_{g=1}^{G}\boldsymbol{\Lambda}_{j_{g}^{(2)}}\right\}\bigodot\left\{\bigotimes_{g=1}^{G}\left(\boldsymbol{\Lambda}_{j_{g}^{(3)}}\bigodot\left(\bigodot_{j\in S^{(0)}:\,\ell_{j,g}=1}\boldsymbol{\Lambda}_{j}\right)\right)\right\}\right]\cdot\text{{vec}}(\boldsymbol{\Phi})
=()\displaystyle\stackrel{{\scriptstyle(\star)}}{{=}} [{g=1G𝚪jg(1)}{g=1G𝚪jg(2)}{jS(3)S(0)𝚪j}]vec(𝚽).\displaystyle~{}\left[\left\{\bigodot_{g=1}^{G}\boldsymbol{\Gamma}_{j_{g}^{(1)}}\right\}\bigodot\left\{\bigodot_{g=1}^{G}\boldsymbol{\Gamma}_{j_{g}^{(2)}}\right\}\bigodot\left\{\bigodot_{j\in S^{(3)}\cup S^{(0)}}\boldsymbol{\Gamma}_{j}\right\}\right]\cdot\text{{vec}}(\boldsymbol{\Phi}). (25)

The last equality ()(\star) above follows from Lemma 1, by noting that for each m=1,2,3m=1,2,3, the index set {j1(m),,jG(m)}=[K]\{j_{1}^{(m)},\ldots,j_{G}^{(m)}\}=[K]. The last equality in the above display results from the property of the Khatri-Rao product. Define

f(1)(𝚪):=g=1G𝚪jg(1)=jS(1)𝚪j,f(2)(𝚪):=g=1G𝚪jg(2)=jS(2)𝚪j,f(3)(𝚪):=jS(3)S(0)𝚪j.\displaystyle f^{(1)}(\boldsymbol{\Gamma}):=\bigodot_{g=1}^{G}\boldsymbol{\Gamma}_{j_{g}^{(1)}}=\bigodot_{j\in S^{(1)}}\boldsymbol{\Gamma}_{j},\quad f^{(2)}(\boldsymbol{\Gamma}):=\bigodot_{g=1}^{G}\boldsymbol{\Gamma}_{j_{g}^{(2)}}=\bigodot_{j\in S^{(2)}}\boldsymbol{\Gamma}_{j},\quad f^{(3)}(\boldsymbol{\Gamma}):=\bigodot_{j\in S^{(3)}\cup S^{(0)}}\boldsymbol{\Gamma}_{j}.

It can be seen that the definitions of the above three functions f(1)(),f(2)(),f(3)()f^{(1)}(\cdot),~{}f^{(2)}(\cdot),~{}f^{(3)}(\cdot) of 𝚪\boldsymbol{\Gamma} only depend on the two sets of variable indices S(1)S^{(1)} and S(2)S^{(2)}, which in turn are determined by the true grouping matrix 𝐋\mathbf{L}. Now (25) can be further written as

vec(𝚷)=(f(1)(𝚪)f(2)(𝚪)f(3)(𝚪))vec(𝚽)=j=1p𝚪jvec(𝚽).\displaystyle\text{{vec}}(\boldsymbol{\Pi})=\left(f^{(1)}(\boldsymbol{\Gamma})\bigodot f^{(2)}(\boldsymbol{\Gamma})\bigodot f^{(3)}(\boldsymbol{\Gamma})\right)\cdot\text{{vec}}(\boldsymbol{\Phi})=\bigodot_{j=1}^{p}\boldsymbol{\Gamma}_{j}\cdot\text{{vec}}(\boldsymbol{\Phi}).

So for true parameters (𝚪,𝚽)(\boldsymbol{\Gamma},\boldsymbol{\Phi}) and alternative parameters (𝚪¯,𝚽¯)(\overline{\boldsymbol{\Gamma}},\overline{\boldsymbol{\Phi}}) that lead to the same distribution of the observed 𝒚\boldsymbol{y}, we have

vec(𝚷)=\displaystyle\text{{vec}}(\boldsymbol{\Pi})= (f(1)(𝚪)f(2)(𝚪)f(3)(𝚪))vec(𝚽)\displaystyle\left(f^{(1)}(\boldsymbol{\Gamma})\bigodot f^{(2)}(\boldsymbol{\Gamma})\bigodot f^{(3)}(\boldsymbol{\Gamma})\right)\cdot\text{{vec}}(\boldsymbol{\Phi}) (26)
=\displaystyle= (f(1)(𝚪¯)f(2)(𝚪¯)f(3)(𝚪¯))vec(𝚽¯).\displaystyle\left(f^{(1)}(\overline{\boldsymbol{\Gamma}})\bigodot f^{(2)}(\overline{\boldsymbol{\Gamma}})\bigodot f^{(3)}(\overline{\boldsymbol{\Gamma}})\right)\cdot\text{{vec}}(\overline{\boldsymbol{\Phi}}).

Recall that under the assumptions in the theorem and the current notation, for each jS(1)S(2)S(3)j\in S^{(1)}\cup S^{(2)}\cup S^{(3)} the matrix 𝚲j\boldsymbol{\Lambda}_{j} has full column rank KK. According to the property of the Kronecker product, the matrices g=1G𝚲jg(1)\bigotimes_{g=1}^{G}\boldsymbol{\Lambda}_{j_{g}^{(1)}} and g=1G𝚲jg(2)\bigotimes_{g=1}^{G}\boldsymbol{\Lambda}_{j_{g}^{(2)}} each has full column rank KGK^{G}. Further, since 𝚲jg(3)\boldsymbol{\Lambda}_{j_{g}^{(3)}} has full column rank KK, the Khatri-Rao product 𝚲jg(3)(jS(0):j,g=1𝚲j)\boldsymbol{\Lambda}_{j_{g}^{(3)}}\bigodot\left(\bigodot_{j\in S^{(0)}:\,\ell_{j,g}=1}\boldsymbol{\Lambda}_{j}\right) must have full column rank KK. Therefore the matrix {g=1G(𝚲jg(3)(jS(0):j,g=1𝚲j))}\left\{\bigotimes_{g=1}^{G}\left(\boldsymbol{\Lambda}_{j_{g}^{(3)}}\bigodot\left(\bigodot_{j\in S^{(0)}:\,\ell_{j,g}=1}\boldsymbol{\Lambda}_{j}\right)\right)\right\} also has full column rank KGK^{G}. By definition of f(m)(𝚪)f^{(m)}(\boldsymbol{\Gamma})’s, the above full-rank assertions indeed mean that f(1)(𝚪)f^{(1)}(\boldsymbol{\Gamma}), f(2)(𝚪)f^{(2)}(\boldsymbol{\Gamma}), and f(3)(𝚪)f^{(3)}(\boldsymbol{\Gamma}) all have full column rank KGK^{G}.

We next invoke a useful lemma on the uniqueness of three-way tensor decompositions, the Kruskal’s theorem established in Kruskal, (1977), and then proceed similarly as the proof procedures in Allman et al., (2009). For a matrix 𝐌\mathbf{M}, its Kruskal rank is defined to be the largest number rr such that any rr columns of 𝐌\mathbf{M} are linearly independent. Denote the Kruskal rank of matrix 𝐌\mathbf{M} by rankK(𝐌)\text{rank}_{K}(\mathbf{M}).

Lemma 4 (Kruskal’s Theorem).

Suppose 𝐌1,𝐌2,𝐌3\mathbf{M}_{1},\mathbf{M}_{2},\mathbf{M}_{3} are three matrices of dimension am×Ka_{m}\times K for m=1,2,3m=1,2,3, 𝐍1,𝐍2,𝐍3\mathbf{N}_{1},\mathbf{N}_{2},\mathbf{N}_{3} are three matrices each with KK columns, and m=13𝐌m=m=13𝐍m\bigodot_{m=1}^{3}\mathbf{M}_{m}=\bigodot_{m=1}^{3}\mathbf{N}_{m}. If rankK(𝐌1)+rankK(𝐌2)+rankK(𝐌3)2K+2\text{{rank}}_{K}(\mathbf{M}_{1})+\text{{rank}}_{K}(\mathbf{M}_{2})+\text{{rank}}_{K}(\mathbf{M}_{3})\geq 2K+2, then there exists a permutation matrix 𝐏\mathbf{P} and three invertible diagonal matrices 𝐃m\mathbf{D}_{m} with 𝐃1𝐃2𝐃3=𝐈K\mathbf{D}_{1}\mathbf{D}_{2}\mathbf{D}_{3}=\mathbf{I}_{K} and 𝐍m=𝐌m𝐃m𝐏\mathbf{N}_{m}=\mathbf{M}_{m}\mathbf{D}_{m}\mathbf{P} for each m=1,2,3m=1,2,3.

If a matrix has full column rank KK, then it must also have Kruskal rank KK by definition. As a corrolary of Lemma 4, if the three matrices 𝐌1,𝐌2,𝐌3\mathbf{M}_{1},\,\mathbf{M}_{2},\,\mathbf{M}_{3} all have full column rank KK, then the condition rankK(𝐌1)+rankK(𝐌2)+rankK(𝐌3)=3K2K+2\text{{rank}}_{K}(\mathbf{M}_{1})+\text{{rank}}_{K}(\mathbf{M}_{2})+\text{{rank}}_{K}(\mathbf{M}_{3})=3K\geq 2K+2 is satisfied and the uniqueness conclusion follows. We now take 𝐌m=f(m)(𝚪)\mathbf{M}_{m}=f^{(m)}(\boldsymbol{\Gamma}), 𝐍m=f(m)(𝚪¯)\mathbf{N}_{m}=f^{(m)}(\overline{\boldsymbol{\Gamma}}) for m=1,2m=1,2, and define

𝐌3=f(3)(𝚪)diag(vec(𝚽)),𝐍3=f(3)(𝚪¯)diag(vec(𝚽¯)),\displaystyle\mathbf{M}_{3}=f^{(3)}(\boldsymbol{\Gamma})\cdot\text{{diag}}(\text{{vec}}(\boldsymbol{\Phi})),\quad\mathbf{N}_{3}=f^{(3)}(\overline{\boldsymbol{\Gamma}})\cdot\text{{diag}}(\text{{vec}}(\overline{\boldsymbol{\Phi}})),

then there is vec(𝚷)=m=13𝐌m=m=13𝐍m\text{{vec}}(\boldsymbol{\Pi})=\bigodot_{m=1}^{3}\mathbf{M}_{m}=\bigodot_{m=1}^{3}\mathbf{N}_{m}. According to our argument right after (26), rankK(𝐌m)=rankK(f(m)(𝚪))=K\text{{rank}}_{K}(\mathbf{M}_{m})=\text{{rank}}_{K}(f^{(m)}(\boldsymbol{\Gamma}))=K for m=1,2m=1,2. As for 𝐌3\mathbf{M}_{3}, since f(3)(𝚪)f^{(3)}(\boldsymbol{\Gamma}) has full column rank KK and the entries of 𝚽\boldsymbol{\Phi} are positive, the 𝐌3\mathbf{M}_{3} also has full column rank KK. Therefore, we can invoke Lemma 4 to establish that there exists a permutation matrix 𝐏\mathbf{P} and three invertible diagonal matrices 𝐃m\mathbf{D}_{m} with 𝐃1𝐃2𝐃3=𝐈K\mathbf{D}_{1}\mathbf{D}_{2}\mathbf{D}_{3}=\mathbf{I}_{K} such that

f(m)(𝚪¯)=𝐍m=𝐌m𝐃m𝐏=f(m)(𝚪)𝐃m𝐏\displaystyle f^{(m)}(\overline{\boldsymbol{\Gamma}})=\mathbf{N}_{m}=\mathbf{M}_{m}\mathbf{D}_{m}\mathbf{P}=f^{(m)}({\boldsymbol{\Gamma}})\mathbf{D}_{m}\mathbf{P}

for m=1,2,3m=1,2,3.

The next step is to show that the diagonal matrices 𝐃i\mathbf{D}_{i} are all identity matrices. Note that each column of the jS(1)dj×K\prod_{j\in S^{(1)}}d_{j}\times K matrix f(1)(𝚪)=g=1G𝚪jg(1)=g=1G𝚲jg(1)f^{(1)}({\boldsymbol{\Gamma}})=\bigodot_{g=1}^{G}\boldsymbol{\Gamma}_{j_{g}^{(1)}}=\bigotimes_{g=1}^{G}\boldsymbol{\Lambda}_{j_{g}^{(1)}} characterizes the conditional joint distribution of {yj:jS(1)}\{y_{j}:\,j\in S^{(1)}\} given the latent assignment vector 𝒛[K]G\boldsymbol{z}\in[K]^{G} under the true 𝚲\boldsymbol{\Lambda}-parameters. And similarly, each column of f(1)(𝚪¯)=g=1G𝚪¯jg(1)=g=1G𝚲¯jg(1)f^{(1)}(\overline{\boldsymbol{\Gamma}})=\bigodot_{g=1}^{G}\overline{\boldsymbol{\Gamma}}_{j_{g}^{(1)}}=\bigotimes_{g=1}^{G}\overline{\boldsymbol{\Lambda}}_{j_{g}^{(1)}} characterizes the conditional joint distribution of {yj:jS(1)}\{y_{j}:\,j\in S^{(1)}\} given 𝒛[K]G\boldsymbol{z}\in[K]^{G} under the alternative 𝚲¯\overline{\boldsymbol{\Lambda}}. Therefore the sum of each column of f(1)(𝚪)f^{(1)}({\boldsymbol{\Gamma}}) or that of f(1)(𝚪¯)f^{(1)}(\overline{\boldsymbol{\Gamma}}) equals one, which implies the diagonal matrix 𝐃m\mathbf{D}_{m} is an identity matrix for m=1m=1 or 2. Since Lemma 4 ensures 𝐃1𝐃2𝐃3=𝐈K\mathbf{D}_{1}\mathbf{D}_{2}\mathbf{D}_{3}=\mathbf{I}_{K}, we also obtain 𝐃3=𝐈K\mathbf{D}_{3}=\mathbf{I}_{K}. By far we have obtained f(m)(𝚪¯)=f(m)(𝚪)𝐏f^{(m)}(\overline{\boldsymbol{\Gamma}})=f^{(m)}({\boldsymbol{\Gamma}})\mathbf{P} for m=1,2m=1,2 and

f(3)(𝚪¯)diag(vec(𝚽¯))=f(3)(𝚪)diag(vec(𝚽))𝐏.\displaystyle f^{(3)}(\overline{\boldsymbol{\Gamma}})\boldsymbol{\cdot}\text{diag}(\text{{vec}}(\overline{\boldsymbol{\Phi}}))=f^{(3)}({\boldsymbol{\Gamma}})\boldsymbol{\cdot}\text{diag}(\text{{vec}}({\boldsymbol{\Phi}}))\mathbf{P}. (27)

Note that the permutation matrix 𝐏\mathbf{P} has rows and columns both indexed by latent assignment vectors 𝒛[K]G\boldsymbol{z}\in[K]^{G}. For m=3m=3, consider an arbitrary 𝒛1[K]G\boldsymbol{z}_{1}\in[K]^{G} and assume without loss of generality that 𝒛2[K]G\boldsymbol{z}_{2}\in[K]^{G} satisfies that the (𝒛2,𝒛1)(\boldsymbol{z}_{2},\boldsymbol{z}_{1})th entry of matrix 𝐏\mathbf{P} is 𝐏𝒛2,𝒛1=1\mathbf{P}_{\boldsymbol{z}_{2},\boldsymbol{z}_{1}}=1. Then the 𝒛1\boldsymbol{z}_{1}th column of the matrix equality f(3)(𝚪¯)diag(vec(𝚽¯))=f(3)(𝚪)diag(vec(𝚽))𝐏f^{(3)}(\overline{\boldsymbol{\Gamma}})\boldsymbol{\cdot}\text{diag}(\text{{vec}}(\overline{\boldsymbol{\Phi}}))=f^{(3)}({\boldsymbol{\Gamma}})\boldsymbol{\cdot}\text{diag}(\text{{vec}}({\boldsymbol{\Phi}}))\mathbf{P} takes the form

f(3)(𝚪¯)𝒄,𝒛1vec(𝚽¯)𝒛1=f(3)(𝚪)𝒄,𝒛2vec(𝚽)𝒛2f^{(3)}(\overline{\boldsymbol{\Gamma}})_{\boldsymbol{c},\boldsymbol{z}_{1}}\cdot\text{{vec}}(\overline{\boldsymbol{\Phi}})_{\boldsymbol{z}_{1}}=f^{(3)}({\boldsymbol{\Gamma}})_{\boldsymbol{c},\boldsymbol{z}_{2}}\cdot\text{{vec}}({\boldsymbol{\Phi}})_{\boldsymbol{z}_{2}}

for each 𝒄×j=1p[dj]\boldsymbol{c}\in\times_{j=1}^{p}[d_{j}]; summing the above equality over the index 𝒄×j𝒜1[dj]\boldsymbol{c}\in\times_{j\in\mathcal{A}_{1}}[d_{j}] gives vec(𝚽¯)𝒛1=vec(𝚽)𝒛2\text{{vec}}(\overline{\boldsymbol{\Phi}})_{\boldsymbol{z}_{1}}=\text{{vec}}(\boldsymbol{\Phi})_{\boldsymbol{z}_{2}}. Note we have generally established vec(𝚽¯)𝒛1=vec(𝚽)𝒛2\text{{vec}}(\overline{\boldsymbol{\Phi}})_{\boldsymbol{z}_{1}}=\text{{vec}}({\boldsymbol{\Phi}})_{\boldsymbol{z}_{2}} whenever 𝐏𝒛2,𝒛1=1\mathbf{P}_{\boldsymbol{z}_{2},\boldsymbol{z}_{1}}=1, which essentially implies vec(𝚽¯)=vec(𝚽)𝐏\text{{vec}}(\overline{\boldsymbol{\Phi}})^{\top}=\text{{vec}}({\boldsymbol{\Phi}})^{\top}\cdot\mathbf{P}. Note that this shows the identifiability of the tensor core in our hybrid tensor decomposition formulation of the Gro-M3. Further, vec(𝚽¯)=vec(𝚽)𝐏\text{{vec}}(\overline{\boldsymbol{\Phi}})^{\top}=\text{{vec}}({\boldsymbol{\Phi}})^{\top}\cdot\mathbf{P} implies that

diag(vec(𝚽¯))=diag(vec(𝚽)𝐏)=diag(vec(𝚽))𝐏.\text{{diag}}(\text{{vec}}(\overline{\boldsymbol{\Phi}}))=\text{{diag}}(\text{{vec}}({\boldsymbol{\Phi}})\cdot\mathbf{P})=\text{{diag}}(\text{{vec}}({\boldsymbol{\Phi}}))\cdot\mathbf{P}.

Combining the above display to the previous (27), since diag(vec(𝚽¯))\text{{diag}}(\text{{vec}}(\overline{\boldsymbol{\Phi}})) is a diagonal matrix with positive diagonal entries, we can right multiply the inverse of this matrix with the LHS of (27) and meanwhile right multiply the inverse of diag(vec(𝚽))𝐏\text{{diag}}(\text{{vec}}({\boldsymbol{\Phi}}))\cdot\mathbf{P} with the RHS of (27); this gives f(3)(𝚪¯)=f(3)(𝚪)𝐏f^{(3)}(\overline{\boldsymbol{\Gamma}})=f^{(3)}({\boldsymbol{\Gamma}})\mathbf{P}.

Our final step of proving the theorem is to show that the established f(m)(𝚪¯)=f(m)(𝚪)𝐏f^{(m)}(\overline{\boldsymbol{\Gamma}})=f^{(m)}({\boldsymbol{\Gamma}})\mathbf{P} for m=1,2,3m=1,2,3 implies the identifiability of 𝚲\boldsymbol{\Lambda} and 𝐋\mathbf{L}. First, since f(m)(𝚪)f^{(m)}({\boldsymbol{\Gamma}}) is defined as certain Khatri-Rao products of the individual 𝚪j\boldsymbol{\Gamma}_{j}’s, we claim that the f(m)(𝚪¯)=f(m)(𝚪)𝐏f^{(m)}(\overline{\boldsymbol{\Gamma}})=f^{(m)}({\boldsymbol{\Gamma}})\mathbf{P} indeed implies that 𝚪¯j=𝚪j𝐏{\overline{\boldsymbol{\Gamma}}}_{j}={{\boldsymbol{\Gamma}}}_{j}\mathbf{P} for each j[p]j\in[p]. To see this, note that each column of f(m)(𝚪¯)f^{(m)}(\overline{\boldsymbol{\Gamma}}) and f(m)(𝚪)𝐏f^{(m)}({\boldsymbol{\Gamma}})\mathbf{P} characterizes the conditional joint distribution of variables {yj:jS(m)}\{y_{j}:\,j\in S^{(m)}\} given the 𝒛\boldsymbol{z}. So the conditional marginal distribution 𝚪j\boldsymbol{\Gamma}_{j} can be obtained by summing up appropriate row vectors of the matrices f(m)(𝚪¯)f^{(m)}(\overline{\boldsymbol{\Gamma}}) and f(m)(𝚪)𝐏f^{(m)}({\boldsymbol{\Gamma}})\mathbf{P}, corresponding to marginalizing out other variables except the jjth one. Now without loss of generality we can assume that 𝐏=𝐈KG\mathbf{P}=\mathbf{I}_{K^{G}}, then 𝚪¯j=𝚪j𝐏{\overline{\boldsymbol{\Gamma}}}_{j}={\boldsymbol{\Gamma}}_{j}\mathbf{P} gives \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111=j,cj,𝒛γj,cj,𝒛\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{j,c_{j},\boldsymbol{z}}=\gamma_{j,c_{j},\boldsymbol{z}} for all 𝒛[K]G\boldsymbol{z}\in[K]^{G}. Now it only remains to show that 𝚪\boldsymbol{\Gamma} uniquely determines 𝚲\boldsymbol{\Lambda} and 𝐋\mathbf{L}. By definition (18) there is γj,cj,𝒛=λj,cj,zsj\gamma_{j,c_{j},\boldsymbol{z}}=\lambda_{j,c_{j},z_{s_{j}}}. For arbitrary sjs_{j} and s¯j\overline{s}_{j}, we first consider an arbitrary latent assignment 𝒛[K]G\boldsymbol{z}\in[K]^{G} such that zsj=zs¯jz_{s_{j}}=z_{\overline{s}_{j}}, then

λj,cj,k=λj,cj,zsj=γj,cj,𝒛=γ¯j,cj,𝒛=λ¯j,cj,zs¯j=λ¯j,cj,k.\lambda_{j,c_{j},k}=\lambda_{j,c_{j},z_{s_{j}}}=\gamma_{j,c_{j},\boldsymbol{z}}=\overline{\gamma}_{j,c_{j},\boldsymbol{z}}=\overline{\lambda}_{j,c_{j},z_{\overline{s}_{j}}}=\overline{\lambda}_{j,c_{j},k}.

The above reasoning proves the identifiability of 𝚲\boldsymbol{\Lambda}. Thus far we have proved part (a) of Theorem 1.

We next prove part (b) of the theorem. We use proof by contradiction to show the identifiability of the grouping matrix 𝐋\mathbf{L} (or equivalently, the identifiability of the vector 𝒔\boldsymbol{s}). If there exists some j[p]j\in[p] such that the jjth rows of 𝐋\mathbf{L} and 𝐋¯\overline{\mathbf{L}} are different, then sjs¯js_{j}\neq\overline{s}_{j}; denote sj=:gs_{j}=:g and s¯j=:g\overline{s}_{j}=:g^{\prime}. Next for arbitrary two different indices k,k[K]k,k^{\prime}\in[K] and kkk\neq k^{\prime}, we consider a latent assignment 𝒛[K]G\boldsymbol{z}\in[K]^{G} such that zg=kz_{g}=k and zg=kz_{g}=k^{\prime}. Then there are

λj,cj,k=λj,cj,zsj=γj,cj,𝒛=γ¯j,cj,𝒛=λ¯j,cj,zs¯j=λ¯j,cj,k for all cj[dj].\lambda_{j,c_{j},k}=\lambda_{j,c_{j},z_{s_{j}}}=\gamma_{j,c_{j},\boldsymbol{z}}=\overline{\gamma}_{j,c_{j},\boldsymbol{z}}=\overline{\lambda}_{j,c_{j},z_{\overline{s}_{j}}}=\overline{\lambda}_{j,c_{j},k^{\prime}}~{}\text{ for all }~{}c_{j}\in[d_{j}].

Since kkk\neq k^{\prime}, the above equality means the kkth and kk^{\prime}th columns of 𝚲j\boldsymbol{\Lambda}_{j} are identical. Since kk and kk^{\prime} are two arbitrary indices, this means all the column vectors in the matrix 𝚲j\boldsymbol{\Lambda}_{j} are identical. This contradicts the assumption in part (b) of the theorem. Therefore we have shown that sj=s¯js_{j}=\overline{s}_{j} must hold for an arbitrary j[p]j\in[p]. This proves the identifiability of 𝐋\mathbf{L} from 𝚪\boldsymbol{\Gamma}. This completes the proof of Theorem 1.

7.2 Proof of Theorem 2

The proof of this theorem is similar in spirit to the previous Theorem 1 by exploiting the inherent tensor decomposition structure, but differing in taking advantage of the more dimension-grouping structure under the assumptions here. Recall 𝒜g={g[p]:j,g=1}=m=13𝒜g,m\mathcal{A}_{g}=\{g\in[p]:\,\ell_{j,g}=1\}=\cup_{m=1}^{3}\mathcal{A}_{g,m}. We write vec(𝚷)\text{{vec}}(\boldsymbol{\Pi}) under the true parameters as

vec(𝚷)={g=1Gj𝒜g𝚲j}vec(𝚽)\displaystyle~{}\text{{vec}}(\boldsymbol{\Pi})=\left\{\bigotimes_{g=1}^{G}\bigodot_{j\in\mathcal{A}_{g}}\boldsymbol{\Lambda}_{j}\right\}\cdot\text{{vec}}(\boldsymbol{\Phi})
=\displaystyle= {g=1G(m=13j𝒜g,m𝚲j)}vec(𝚽)\displaystyle~{}\left\{\bigotimes_{g=1}^{G}\left(\bigodot_{m=1}^{3}\bigodot_{j\in\mathcal{A}_{g,m}}\boldsymbol{\Lambda}_{j}\right)\right\}\cdot\text{{vec}}(\boldsymbol{\Phi})
=\displaystyle= [{g=1G(j𝒜g,1𝚲j)}{g=1G(j𝒜g,2𝚲j)}{g=1G(j𝒜g,3𝚲j)}]vec(𝚽).\displaystyle~{}\left[\left\{\bigotimes_{g=1}^{G}\left(\bigodot_{j\in\mathcal{A}_{g,1}}\boldsymbol{\Lambda}_{j}\right)\right\}\bigodot\left\{\bigotimes_{g=1}^{G}\left(\bigodot_{j\in\mathcal{A}_{g,2}}\boldsymbol{\Lambda}_{j}\right)\right\}\bigodot\left\{\bigotimes_{g=1}^{G}\left(\bigodot_{j\in\mathcal{A}_{g,3}}\boldsymbol{\Lambda}_{j}\right)\right\}\right]\cdot\text{{vec}}(\boldsymbol{\Phi}).

Since each 𝒜g,m\mathcal{A}_{g,m} is nonempty under the assumption in the theorem and g=1G𝒜g,m[G]\cup_{g=1}^{G}\mathcal{A}_{g,m}\supseteq[G], we can use Lemma 1 to obtain that

jg=1G𝒜g,m𝚪j=g=1G(j𝒜g,m𝚲j)=g=1G𝚲~g,m,\bigodot_{j\in\cup_{g=1}^{G}\mathcal{A}_{g,m}}\boldsymbol{\Gamma}_{j}=\bigotimes_{g=1}^{G}\left(\bigodot_{j\in\mathcal{A}_{g,m}}\boldsymbol{\Lambda}_{j}\right)=\bigotimes_{g=1}^{G}\widetilde{\boldsymbol{\Lambda}}_{g,m},

where the second equality above follows from the definition of 𝚲~g,m\widetilde{\boldsymbol{\Lambda}}_{g,m} in the theorem. We now define

f(m)(𝚪):=jg=1G𝒜g,m𝚪j,m=1,2,3.f^{(m)}(\boldsymbol{\Gamma}):=\bigodot_{j\in\cup_{g=1}^{G}\mathcal{A}_{g,m}}\boldsymbol{\Gamma}_{j},\quad m=1,2,3.

Since the theorem has the assumption that each 𝚲~g,m\widetilde{\boldsymbol{\Lambda}}_{g,m} has full column rank KK, we have that f(m)(𝚪)f^{(m)}(\boldsymbol{\Gamma}) has full rank KK. Note that each f(m)()f^{(m)}(\cdot) is the Khatri-Rao product of certain 𝚪j\boldsymbol{\Gamma}_{j}’s and f(m)f^{(m)} depends on the true grouping matrix 𝐋\mathbf{L}. Also note that there is

vec(𝚷)=\displaystyle\text{{vec}}(\boldsymbol{\Pi})= (f(1)(𝚪)f(2)(𝚪)f(3)(𝚪))vec(𝚽)\displaystyle\left(f^{(1)}(\boldsymbol{\Gamma})\bigodot f^{(2)}(\boldsymbol{\Gamma})\bigodot f^{(3)}(\boldsymbol{\Gamma})\right)\cdot\text{{vec}}(\boldsymbol{\Phi})
=\displaystyle= (f(1)(𝚪¯)f(2)(𝚪¯)f(3)(𝚪¯))vec(𝚽¯).\displaystyle\left(f^{(1)}(\overline{\boldsymbol{\Gamma}})\bigodot f^{(2)}(\overline{\boldsymbol{\Gamma}})\bigodot f^{(3)}(\overline{\boldsymbol{\Gamma}})\right)\cdot\text{{vec}}(\overline{\boldsymbol{\Phi}}).

Now the problem is in exactly the same formulation as that in the proof of Theorem 1, so we can proceed in the same way to establish the identifiability of 𝚽\boldsymbol{\Phi} and individual 𝚪j\boldsymbol{\Gamma}_{j}’s. The identifiability of 𝚪\boldsymbol{\Gamma} further gives the identifiability of 𝚲\boldsymbol{\Lambda} and 𝐋\mathbf{L}. This finishes the proof of Theorem 2. ∎

7.3 Proof of Theorem 3

We first prove the following claim:

Claim 1. Under condition (13) that j𝒜g,mdjK\prod_{j\in\mathcal{A}_{g,m}}d_{j}\geq K in the theorem, the following matrix 𝚲~g,m\widetilde{\boldsymbol{\Lambda}}_{g,m} has full column rank KK for generic parameters,

𝚲~g,m=j𝒜g,m𝚲j.\displaystyle\widetilde{\boldsymbol{\Lambda}}_{g,m}=\bigodot_{j\in\mathcal{A}_{g,m}}\boldsymbol{\Lambda}_{j}.

The 𝚲~g,m\widetilde{\boldsymbol{\Lambda}}_{g,m} above has the same definition as that in Theorem 2. The proof of this claim is similar in spirit to that of Lemma 13 in Allman et al., (2009). Note that the statement that the j𝒜g,mdj×K\prod_{j\in\mathcal{A}_{g,m}}d_{j}\times K matrix 𝚲~g,m\widetilde{\boldsymbol{\Lambda}}_{g,m} does not have full column rank is equivalent to the statement that the maps sending 𝚲~g,m\widetilde{\boldsymbol{\Lambda}}_{g,m} to its K×KK\times K minors are all zero maps. There are

(j𝒜g,mdjK)\binom{\prod_{j\in\mathcal{A}_{g,m}}d_{j}}{K}

such maps, and each of this map is a polynomial with indeterminants λj,cj,k\lambda_{j,c_{j},k}’s. To show that 𝚲~g,m\widetilde{\boldsymbol{\Lambda}}_{g,m} has full column rank KK for generic parameters, we just need to show that these maps are not all zero polynomials. According to the property of the polynomial maps, it indeed suffices to find one particular set of {𝚲j;j𝒜g,m}\{\boldsymbol{\Lambda}_{j};\,j\in\mathcal{A}_{g,m}\} such that the resulting Khatri-Rao product 𝚲~g,m\widetilde{\boldsymbol{\Lambda}}_{g,m} has full column rank.

Consider a set of distinct prime numbers denoted by {aj,c;j=1,,p,c=1,,dj}\{a_{j,c};\,j=1,\ldots,p,\,c=1,\ldots,d_{j}\}. Define

𝚲j=(1aj,1aj,12aj,1K11aj,2aj,22aj,2K11aj,djaj,dj2aj,djK1),\displaystyle\boldsymbol{\Lambda}_{j}^{\star}=\begin{pmatrix}1&a_{j,1}&a^{2}_{j,1}&\cdots&a^{K-1}_{j,1}\\ 1&a_{j,2}&a^{2}_{j,2}&\cdots&a^{K-1}_{j,2}\\ \vdots&\vdots&\vdots&\vdots&\vdots\\ 1&a_{j,d_{j}}&a^{2}_{j,d_{j}}&\cdots&a^{K-1}_{j,d_{j}}\\ \end{pmatrix}, (28)

then 𝚲j\boldsymbol{\Lambda}_{j}^{\star} is a dj×Kd_{j}\times K Vandermonde matrix. Generally, for a dd-dimensional vector 𝒃\boldsymbol{b}, let VDM(𝒃)=VDM(b1,,bd)\text{VDM}(\boldsymbol{b})=\text{VDM}(b_{1},\ldots,b_{d}) denote the the d×dd\times d Vandermonde matrix with the (i,c)(i,c)th entry being bic1b_{i}^{c-1}, so the 𝚲j\boldsymbol{\Lambda}_{j}^{\star} defined in (28) can be written as 𝚲j=VDM(aj,1,,aj,dj)\boldsymbol{\Lambda}_{j}^{\star}=\text{VDM}(a_{j,1},\ldots,a_{j,d_{j}}). Now consider a 𝚲~g,m\widetilde{\boldsymbol{\Lambda}}^{\star}_{g,m} defined as

𝚲~g,m=j𝒜g,m𝚲j.\widetilde{\boldsymbol{\Lambda}}^{\star}_{g,m}=\bigodot_{j\in\mathcal{A}_{g,m}}\boldsymbol{\Lambda}^{\star}_{j}.

Under the assumption (13) in the theorem that j𝒜g,mdjK\prod_{j\in\mathcal{A}_{g,m}}d_{j}\geq K, the KK columns of 𝚲~g,m\widetilde{\boldsymbol{\Lambda}}_{g,m} are indeed the first KK columns in the following Vandermonde matrix

𝐕g,m=VDM(j𝒜g,maj,1,,j𝒜g,maj,dj).\displaystyle\mathbf{V}_{g,m}=\text{VDM}\left(\prod_{j\in\mathcal{A}_{g,m}}a_{j,1},\ldots,\prod_{j\in\mathcal{A}_{g,m}}a_{j,d_{j}}\right).

Since by construction the aj,ca_{j,c}’s are distinct prime numbers, for each jSmj\in S_{m} the djd_{j} products j𝒜g,maj,1,,j𝒜g,maj,dj\prod_{j\in\mathcal{A}_{g,m}}a_{j,1},~{}\allowbreak\ldots,~{}\allowbreak\prod_{j\in\mathcal{A}_{g,m}}a_{j,d_{j}} are also distinct numbers. Therefore the 𝐕g,m\mathbf{V}_{g,m} defined above has full rank jSmdj\prod_{j\in S_{m}}d_{j}. Since j𝒜g,mdjK\prod_{j\in\mathcal{A}_{g,m}}d_{j}\geq K and 𝚲~g,m\widetilde{\boldsymbol{\Lambda}}^{\star}_{g,m} has columns from the first KK columns of 𝐕g,m\mathbf{V}_{g,m}, we have that 𝚲~g,m\widetilde{\boldsymbol{\Lambda}}^{\star}_{g,m} has full column rank KK for this particular choice of parameters. Note that the 𝚲j\boldsymbol{\Lambda}_{j}^{\star} defined in (28) does not have each column summing up to one, as is required in the parameterization of the probability tensor. But performing a positive rescaling of the each column of 𝚲j\boldsymbol{\Lambda}_{j}^{\star} to a conditional probability table 𝚲j\boldsymbol{\Lambda}_{j} would not change the above reasoning and conclusion about matrix rank; so we have proved the earlier Claim 1 that each 𝚲~g,m\widetilde{\boldsymbol{\Lambda}}_{g,m} has full column rank KK for generic parameters. Given this conclusion, for generic parameters in the parameter space the situation is reduced back to that under Theorem 2. So the identifiability condition in Theorem 1 carries over, and we can obtain the conclusion that identifiability holds here for generic parameters. This completes the proof of Theorem 3. ∎

7.4 Proof of Proposition 1

Recall that under the conditions of the previous theorem, we already have the conclusion that 𝚲\boldsymbol{\Lambda} and 𝚽\boldsymbol{\Phi} are identifiable. Next the question boils down to whether (α1,,αK)(\alpha_{1},\ldots,\alpha_{K}) are identifiable from 𝚽=(ϕk1,,kG)\boldsymbol{\Phi}=(\phi_{k_{1},\ldots,k_{G}}). By the definition, we have

ϕk1,,kG=𝔼𝝅Dir(𝜶)[πk1πkG]=ΔK1πk1πkG𝑑Dir𝜶(𝝅).\displaystyle\phi_{k_{1},\ldots,k_{G}}=\mathbb{E}_{\boldsymbol{\pi}\sim\text{Dir}(\boldsymbol{\alpha})}\left[\pi_{k_{1}}\cdots\pi_{k_{G}}\right]=\int_{\Delta^{K-1}}\pi_{k_{1}}\cdots\pi_{k_{G}}d\text{Dir}_{\boldsymbol{\alpha}}(\boldsymbol{\pi}).

First we consider the case of G=2G=2. Denote α0=k=1Kαk\alpha_{0}=\sum_{k=1}^{K}\alpha_{k}. Then according to the moment property of the Dirichlet distribution, there is

𝔼𝝅Dir(𝜶)[πkπ]={αkαα0(α0+1),if k;αk(αk+1)α0(α0+1),if k=.\displaystyle\mathbb{E}_{\boldsymbol{\pi}\sim\text{Dir}(\boldsymbol{\alpha})}\left[\pi_{k}\pi_{\ell}\right]=\begin{cases}\dfrac{\alpha_{k}\alpha_{\ell}}{\alpha_{0}(\alpha_{0}+1)},&\text{if }k\neq\ell;\\[8.53581pt] \dfrac{\alpha_{k}(\alpha_{k}+1)}{\alpha_{0}(\alpha_{0}+1)},&\text{if }k=\ell.\end{cases}

Therefore for kk\neq\ell, consider xx and yy defined as follows,

x:=𝔼𝝅Dir(𝜶)[πk2]𝔼𝝅Dir(𝜶)[πkπ]\displaystyle x:=\dfrac{\mathbb{E}_{\boldsymbol{\pi}\sim\text{Dir}(\boldsymbol{\alpha})}\left[\pi_{k}^{2}\right]}{\mathbb{E}_{\boldsymbol{\pi}\sim\text{Dir}(\boldsymbol{\alpha})}\left[\pi_{k}\pi_{\ell}\right]} =αk+1α,\displaystyle=\dfrac{\alpha_{k}+1}{\alpha_{\ell}},
y:=𝔼𝝅Dir(𝜶)[π2]𝔼𝝅Dir(𝜶)[πkπ]\displaystyle y:=\dfrac{\mathbb{E}_{\boldsymbol{\pi}\sim\text{Dir}(\boldsymbol{\alpha})}\left[\pi_{\ell}^{2}\right]}{\mathbb{E}_{\boldsymbol{\pi}\sim\text{Dir}(\boldsymbol{\alpha})}\left[\pi_{k}\pi_{\ell}\right]} =α+1αk.\displaystyle=\dfrac{\alpha_{\ell}+1}{\alpha_{k}}.

Since xx and yy are already identified, then we can solve for αk\alpha_{k} and α\alpha_{\ell} as follows

αk=x+1xy1,α=y+1xy1.\displaystyle\alpha_{k}=\frac{x+1}{xy-1},\quad\alpha_{\ell}=\frac{y+1}{xy-1}.

Since the above reasoning holds for arbitrary pairs of (k,)(k,\ell) with kk\neq\ell, we have obtained the identifiability of the entire vector 𝜶=(α1,,αK)\boldsymbol{\alpha}=(\alpha_{1},\ldots,\alpha_{K}).

Next we consider the general case of G>2G>2. For arbitrary 1kK1\leq k\neq\ell\leq K, consider two sequences (k,k,k3,,kG)(k,k,k_{3},\ldots,k_{G}), (k,,k3,,kG)[K]G(k,\ell,k_{3},\ldots,k_{G})\in[K]^{G}. According to the property of the Dirichlet distribution, we have

𝔼𝝅Dir(𝜶)[πkπkπk3πkG]𝔼𝝅Dir(𝜶)[πkππk3πkG]=αk+1α,\displaystyle\frac{\mathbb{E}_{\boldsymbol{\pi}\sim\text{Dir}(\boldsymbol{\alpha})}\left[\pi_{k}\pi_{k}\pi_{k_{3}}\cdots\pi_{k_{G}}\right]}{\mathbb{E}_{\boldsymbol{\pi}\sim\text{Dir}(\boldsymbol{\alpha})}\left[\pi_{k}\pi_{\ell}\pi_{k_{3}}\cdots\pi_{k_{G}}\right]}=\frac{\alpha_{k}+1}{\alpha_{\ell}}, (29)
𝔼𝝅Dir(𝜶)[πππk3πkG]𝔼𝝅Dir(𝜶)[πkππk3πkG]=α+1αk.\displaystyle\frac{\mathbb{E}_{\boldsymbol{\pi}\sim\text{Dir}(\boldsymbol{\alpha})}\left[\pi_{\ell}\pi_{\ell}\pi_{k_{3}}\cdots\pi_{k_{G}}\right]}{\mathbb{E}_{\boldsymbol{\pi}\sim\text{Dir}(\boldsymbol{\alpha})}\left[\pi_{k}\pi_{\ell}\pi_{k_{3}}\cdots\pi_{k_{G}}\right]}=\frac{\alpha_{\ell}+1}{\alpha_{k}}. (30)

Now that the left hand sides of the above two equations are identified by the previous theorem, we denote them by u:=LHS of (29)u:=\text{LHS of }\eqref{eq-dir-u} and v:=LHS of (30)v:=\text{LHS of }\eqref{eq-dir-v}. The uu and vv are identified constants. Solving for αk\alpha_{k} and α\alpha_{\ell} gives

αk=u+1uv1,α=v+1uv1.\alpha_{k}=\frac{u+1}{uv-1},\quad\alpha_{\ell}=\frac{v+1}{uv-1}.

Since k,k,\ell are arbitrary, we have shown that the entire vector 𝜶=(α1,,αK)\boldsymbol{\alpha}=(\alpha_{1},\ldots,\alpha_{K}) is identifiable. This completes the proof of Proposition 1. ∎

7.5 Proof of Supporting Lemmas

Proof of Lemma 1.

First note that the jS𝚪j\bigodot_{j\in S}\boldsymbol{\Gamma}_{j} on the right hand side of (19) has size jSdj×KG\prod_{j\in S}d_{j}\times K^{G}. Further, since {sj:jS}[G]\{s_{j}:\,j\in S\}\supseteq[G], the set {jS:sj=g}\{j\in S:\,s_{j}=g\} is nonempty. So the jS:sj=g𝚲j\bigodot_{j\in S:\,s_{j}=g}\boldsymbol{\Lambda}_{j} has KK columns and hence the left hand side of (19) also has size jSdj×KG\prod_{j\in S}d_{j}\times K^{G}. Without loss of generality, suppose S={1,2,,|S|}S=\{1,2,\ldots,|S|\}, where |S||S| denotes the cardinality of the set SS. The (c1+(c21)d1++(c|S|1)d1d|S|1,z1+(z21)K++(zG1)KG1)(c_{1}+(c_{2}-1)d_{1}+\cdots+(c_{|S|}-1)d_{1}\cdots d_{|S|-1},\;z_{1}+(z_{2}-1)K+\cdots+(z_{G}-1)K^{G-1})th entry of the RHS of (19) is jSγj,cj,𝒛\prod_{j\in S}\gamma_{j,c_{j},\boldsymbol{z}}, which by definition equals jSλj,cj,zsj=g=1GjS:sj=gλj,cj,zg\prod_{j\in S}\lambda_{j,c_{j},z_{s_{j}}}=\prod_{g=1}^{G}\prod_{j\in S:\,s_{j}=g}\lambda_{j,c_{j},z_{g}}; this is exactly the (c1+(c21)d1++(c|S|1)d1d|S|1,z1+(z21)K++(zG1)KG1)(c_{1}+(c_{2}-1)d_{1}+\cdots+(c_{|S|}-1)d_{1}\cdots d_{|S|-1},\;z_{1}+(z_{2}-1)K+\cdots+(z_{G}-1)K^{G-1})th entry of the LHS of (19). This completes the proof of Lemma 1. ∎

Proof of Lemma 2.

First note that both hand sides of (22) are vectors of size j=1pdj×1\prod_{j=1}^{p}d_{j}\times 1. To see this for the right hand side of (22), note the matrix j:j,g=1𝚲j\bigodot_{j:\,\,\ell_{j,g}=1}\boldsymbol{\Lambda}_{j} has size j:j,g=1dj×Kj=1pj,g\prod_{j:\,\ell_{j,g}=1}d_{j}\times K^{\sum_{j=1}^{p}\ell_{j,g}}, and hence the matrix g=1Gj:j,g=1𝚲j\bigotimes_{g=1}^{G}\bigodot_{j:\,\ell_{j,g}=1}\boldsymbol{\Lambda}_{j} has size

g=1Gj:j,g=1dj×Kg=1Gj=1pj,g,\prod_{g=1}^{G}\prod_{j:\,\ell_{j,g}=1}d_{j}\times K^{\sum_{g=1}^{G}\sum_{j=1}^{p}\ell_{j,g}},

which is just j=1pdj×KG.\prod_{j=1}^{p}d_{j}\times K^{G}. Further note that the vector vec(𝚽)\text{{vec}}(\boldsymbol{\Phi}) has size KG×1K^{G}\times 1, so the {g=1Gj:j,g=1𝚲j}vec(𝚽)\left\{\bigotimes_{g=1}^{G}\bigodot_{j:\,\ell_{j,g}=1}\boldsymbol{\Lambda}_{j}\right\}\text{{vec}}(\boldsymbol{\Phi}) on the right hand side of (22) has size j=1pdj×1\prod_{j=1}^{p}d_{j}\times 1, matching the size of the left hand side. Next consider the individual entries of both hand sides of (22). First, by definition of the vec() operator, the [c1+(c21)d1++(cp1)d1dp1][c_{1}+(c_{2}-1)d_{1}+\cdots+(c_{p}-1)d_{1}\cdots d_{p-1}]-th entry of the left hand side of (22) is πc1,,cp\pi_{c_{1},\ldots,c_{p}}. Next, according to (20), the πc1,,cp\pi_{c_{1},\ldots,c_{p}} can be written in the following way,

πc1,,cp=\displaystyle\pi_{c_{1},\ldots,c_{p}}= z1=1KzG=1Kϕz1,,zGj=1pγj,cj,𝒛\displaystyle~{}\sum_{z_{1}=1}^{K}\cdots\sum_{z_{G}=1}^{K}\phi_{z_{1},\ldots,z_{G}}\prod_{j=1}^{p}\gamma_{j,c_{j},\boldsymbol{z}}
=\displaystyle= z1=1KzG=1Kvec(𝚽)z1+(z21)K++(zG1)KG1×g=1Gj:j,g=1λj,cj,zg\displaystyle~{}\sum_{z_{1}=1}^{K}\cdots\sum_{z_{G}=1}^{K}\text{{vec}}(\boldsymbol{\Phi})_{z_{1}+(z_{2}-1)K+\cdots+(z_{G}-1)K^{G-1}}\times\prod_{g=1}^{G}\prod_{j:\,\ell_{j,g}=1}\lambda_{j,c_{j},z_{g}}
=\displaystyle= z1=1KzG=1Kvec(𝚽)z1+(z21)K++(zG1)KG1\displaystyle~{}\sum_{z_{1}=1}^{K}\cdots\sum_{z_{G}=1}^{K}\text{{vec}}(\boldsymbol{\Phi})_{z_{1}+(z_{2}-1)K+\cdots+(z_{G}-1)K^{G-1}}
×{g=1Gj:j,g=1𝚲j}c1+(c21)d1++(cp1)d1dp1,z1+(z21)K++(zG1)KG1\displaystyle\qquad\times\left\{\bigotimes_{g=1}^{G}\bigodot_{j:\,\ell_{j,g}=1}\boldsymbol{\Lambda}_{j}\right\}_{c_{1}+(c_{2}-1)d_{1}+\cdots+(c_{p}-1)d_{1}\cdots d_{p-1},\;z_{1}+(z_{2}-1)K+\cdots+(z_{G}-1)K^{G-1}}
=\displaystyle= vec(𝚽){g=1Gj:j,g=1𝚲j}c1+(c21)d1++(cp1)d1dp1,:.\displaystyle~{}\text{{vec}}(\boldsymbol{\Phi})^{\top}\cdot\left\{\bigotimes_{g=1}^{G}\bigodot_{j:\,\ell_{j,g}=1}\boldsymbol{\Lambda}_{j}\right\}^{\top}_{c_{1}+(c_{2}-1)d_{1}+\cdots+(c_{p}-1)d_{1}\cdots d_{p-1},\,\boldsymbol{:}}.

The last row in the above display exactly equals the [c1+(c21)d1++(cp1)d1dp1][c_{1}+(c_{2}-1)d_{1}+\cdots+(c_{p}-1)d_{1}\cdots d_{p-1}]-th entry of the RHS of (22). This proves the equality in (22) and completes the proof of Lemma 2. ∎

Proof of Lemma 3.

In the LHS of (23), the term 𝚲jg(1)𝚲jg(2)\boldsymbol{\Lambda}_{j^{(1)}_{g}}\bigodot\boldsymbol{\Lambda}_{j^{(2)}_{g}} has size djg(1)djg(2)×Kd_{j_{g}^{(1)}}d_{j_{g}^{(2)}}\times K and hence the Kronecker product g=1G{𝚲jg(1)𝚲jg(2)}\bigotimes_{g=1}^{G}\left\{\boldsymbol{\Lambda}_{j^{(1)}_{g}}\bigodot\boldsymbol{\Lambda}_{j^{(2)}_{g}}\right\} has size jS(1)S(1)dj×KG\prod_{j\in S^{(1)}\cup S^{(1)}}d_{j}\times K^{G}. In the RHS of (23), the term {g=1G𝚲jg(1)}\left\{\bigotimes_{g=1}^{G}\boldsymbol{\Lambda}_{j^{(1)}_{g}}\right\} has size g=1Gdjg(1)×KG\prod_{g=1}^{G}d_{j_{g}^{(1)}}\times K^{G}, and hence the Khatri-Rao product of two such terms

{g=1G𝚲jg(1)}{g=1G𝚲jg(2)}\left\{\bigotimes_{g=1}^{G}\boldsymbol{\Lambda}_{j^{(1)}_{g}}\right\}\bigodot\left\{\bigotimes_{g=1}^{G}\boldsymbol{\Lambda}_{j^{(2)}_{g}}\right\}

has size (g=1Gdjg(1)djg(2))×KG\left(\prod_{g=1}^{G}d_{j_{g}^{(1)}}d_{j_{g}^{(2)}}\right)\times K^{G}. So both hand sides of (23) has size jS(1)S(2)dj×KG\prod_{j\in S^{(1)}\cup S^{(2)}}d_{j}\times K^{G}. The equality (23) can be similarly shown as in the proof of Lemma 2 by writing out and checking the individual elements of the two matrices on the LHS and RHS of (23).

Similarly, the LHS and RHS of (24) both have size jS(1)S(1){jG(3)}dj×KG\prod_{j\in S^{(1)}\cup S^{(1)}\cup\left\{j_{G}^{(3)}\right\}}d_{j}\times K^{G} and the equality can be similarly shown as in the proof of Lemma 2. ∎

8 Supplement B: Pairwise Cramer’s V between Categorical Variables

According to the definition of mutual information in information theory, for two discrete variables yj[dj]y_{j}\in[d_{j}] and ym[dm]y_{m}\in[d_{m}], their Cramer’s V is

CRV(yj,ym)={1min(dj,dm)c1[dj]c2[dm](p(yj,ym)(c1,c2)pyj(c1)pym(c2))2pyj(c1)pym(c2)}1/2\displaystyle\text{CRV}(y_{j},y_{m})=\left\{\frac{1}{\min(d_{j},~{}d_{m})}\sum_{c_{1}\in[d_{j}]}\sum_{c_{2}\in[d_{m}]}\frac{\left(p_{(y_{j},y_{m})}(c_{1},c_{2})-p_{y_{j}}(c_{1})p_{y_{m}}(c_{2})\right)^{2}}{p_{y_{j}}(c_{1})p_{y_{m}}(c_{2})}\right\}^{1/2} (31)

where pyj(c1)=(yj=c1)p_{y_{j}}(c_{1})=\mathbb{P}(y_{j}=c_{1}) denotes the marginal distribution of yjy_{j} and p(yj,ym)(c1,c2)=(yj=c1,ym=c2)p_{(y_{j},y_{m})}(c_{1},c_{2})=\mathbb{P}(y_{j}=c_{1},y_{m}=c_{2}) denotes the joint distribution of yjy_{j} and ymy_{m}. The Cramer’s V measures the the inherent dependence expressed in the joint distribution of two variables relative to their marginal distributions under the independence assumption. Therefore, Cramer’s V measures the dependence between variables and it equals zero if and only of the two variables are independent; otherwise Cramer’s V is positive.

The expression of Cramer’s V in (31) is the population version. Given a sample 𝒚1,,𝒚n\boldsymbol{y}_{1},\ldots,\boldsymbol{y}_{n} with 𝒚i=(yi,1,,yi,p)\boldsymbol{y}_{i}=(y_{i,1},\ldots,y_{i,p}), the population quantities of the marginal and joint distributions in (31) can be replaced by their sample estimates. That is, the previous pyj(c1)p_{y_{j}}(c_{1}) and p(yj,ym)(c1,c2)p_{(y_{j},y_{m})}(c_{1},c_{2}) are replaced by the following,

pyjsamp(c1)=1ni=1n𝕀(yi,j=c1),p(yj,ym)samp(c1,c2)=1ni=1n𝕀(yi,j=c1,yi,m=c2).\displaystyle p^{\text{samp}}_{y_{j}}(c_{1})=\frac{1}{n}\sum_{i=1}^{n}\mathbb{I}(y_{i,j}=c_{1}),\quad p^{\text{samp}}_{(y_{j},y_{m})}(c_{1},c_{2})=\frac{1}{n}\sum_{i=1}^{n}\mathbb{I}(y_{i,j}=c_{1},y_{i,m}=c_{2}).

Using the sample-based Cramer’s V measure, we calculate the Cramer’s V for all the pairs of variables when jj and mm each range from 1 to pp. For two randomly chosen simulated datasets from the simulations settings p=30,G=6,K=3,n=1000p=30,~{}G=6,~{}K=3,~{}n=1000 and p=90,G=15,K=3,n=1000p=90,~{}G=15,~{}K=3,~{}n=1000 described in Section 5 in the main text, their pairwise Cramer’s V plots are displayed in Figure 12.

Refer to caption
(a) p=30,G=6,K=3,n=1000p=30,~{}G=6,~{}K=3,~{}n=1000.
Refer to caption
(b) p=90,G=15,K=3,n=1000p=90,~{}G=15,~{}K=3,~{}n=1000.
Figure 12: Cramer’s V of item pairs for two simulated datasets.

By visual inspection, Figure 12 shows a block-diagonal structure of the p×pp\times p pairwise Cramer’s V matrix for both simulation settings. In each of these settings, the true grouping matrix 𝐋\mathbf{L} used to generate data takes the form that the first p/Gp/G variables belong to a same group, the second p/Gp/G variables belong to another same group, etc. Therefore, Figure 12 implies in the simulations, the variables belonging to the same group tend to show higher dependence than those variables belonging to different groups.