This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Covariate Adjusted Functional Mixed Membership Models

Nicholas Marco
Department of Statistical Science, Duke University, USA.
and
Damla Şentürk
Department of Biostatistics, University of California, Los Angeles, USA.
and
Shafali Jeste
Division of Neurology and Neurological Institute,
Children’s Hospital Los Angeles, Los Angeles, USA.
and
Charlotte DiStefano
Division of Psychiatry, Children’s Hospital Los Angeles, Los Angeles, USA.
and
Abigail Dickinson
Department of Psychiatry and Biobehavioral Sciences,
University of California, Los Angeles, USA.
and
Donatello Telesca   
Department of Biostatistics, University of California, Los Angeles, USA.
Corresponding author. [email protected]
Abstract

Mixed membership models are a flexible class of probabilistic data representations used for unsupervised and semi-supervised learning, allowing each observation to partially belong to multiple clusters or features. In this manuscript, we extend the framework of functional mixed membership models to allow for covariate-dependent adjustments. The proposed model utilizes a multivariate Karhunen-Loève decomposition, which allows for a scalable and flexible model. Within this framework, we establish a set of sufficient conditions ensuring the identifiability of the mean, covariance, and allocation structure up to a permutation of the labels. This manuscript is primarily motivated by studies on functional brain imaging through electroencephalography (EEG) of children with autism spectrum disorder (ASD). Specifically, we are interested in characterizing the heterogeneity of alpha oscillations for typically developing (TD) children and children with ASD. Since alpha oscillations are known to change as children develop, we aim to characterize the heterogeneity of alpha oscillations conditionally on the age of the child. Using the proposed framework, we were able to gain novel information on the developmental trajectories of alpha oscillations for children with ASD and how the developmental trajectories differ between TD children and children with ASD.


Keywords: Bayesian Methods, EEG, Functional Data, Mixed Membership Models, Neuroimaging

1 Introduction

Cluster analysis is an unsupervised exploratory task that aims to group similar observations into “clusters” to explain the heterogeneity found in a particular dataset. When an outcome of interest is indexed by covariate information, it is often informative to adjust clustering procedures to account for covariate-dependent patterns of variation. Covariate-dependent clustering has found applications in post hock analyses of clinical trial data (Müller et al., 2011), where conditioning on factors such as dose response, tumor grade, and other clinical factors is often of interest when defining patients response subgroups. In addition to clinical trial settings, covariate-dependent clustering has also become popular in fields such as genetics (Qin and Self, 2006), flow cytometry (Hyun et al., 2023), neuroimaging (Guha and Guhaniyogi, 2022; Binkiewicz et al., 2017), and spatial statistics (Page and Quintana, 2016).

In statistics and machine learning, covariate-dependent clustering procedures have known a number of related nomenclatures, including finite mixture of regressions, mixture of experts, and covariate adjusted product partition models, which tend to refer to specific ways in which covariate information is used in data grouping decisions. The term finite mixture of regressions (McLachlan et al., 2019; Faria and Soromenho, 2010; Grün et al., 2007; Khalili and Chen, 2007; Hyun et al., 2023; Devijver, 2015) refers to the fitting of a mixture model, where the mean structure depends on the covariates of interest through a regression framework. The mixture of experts model (Jordan and Jacobs, 1994; Bishop and Svenskn, 2002) is similar to a finite mixture of regressions model in that it assumes that the likelihood is a weighted combination of probability distribution functions. However, in the mixture of experts model, the weights are dependent on the covariates of interest, providing an additional layer of flexibility beyond that of traditional finite mixture of regressions models. Although the mixture of experts model was originally designed for supervised learning settings, recent developments have expanded its application to unsupervised settings (Kopf et al., 2021; Tsai et al., 2020). Lastly, covariate adjusted product partition models (Müller et al., 2011; Park and Dunson, 2010) are a Bayesian non-parametric version of covariate adjusted clustering. Analogous to mixture of experts models, covariate adjusted product partition models allow the cluster partitions to depend on the covariates of interest.

In this manuscript, we extend the class of functional mixed membership models proposed by Marco et al. (2024a) to allow latent features to depend on covariate information. Mixed membership models (Erosheva et al., 2004; Heller et al., 2008; Gruhl and Erosheva, 2014; Marco et al., 2024a, b), sometimes referred to as partial membership models, can be thought of as a generalization of traditional clustering, where each observation is allowed to partially belong to multiple clusters or features. Although there have been many advancements in covariate adjusted clustering models for multivariate data, to the best of our knowledge, little work has been done on incorporating covariate information in functional mixed membership models or functional clustering models. Two important exceptions are Yao et al. (2011), who specified a finite mixture of regressions model where the covariates are functional data and the clustered data are scalar, and Gaffney and Smyth (1999) who proposed a functional mixture of regressions model where the function is modeled by a deterministic linear function of the covariates. Here, we consider the case where we have a covariate vector in p\mathbb{R}^{p} and the outcome we cluster is functional.

Our work is primarily motivated by functional brain imaging studies on children with autism spectrum disorder (ASD) through electroencephalography (EEG). Specifically, Marco et al. (2024a) analyzed how alpha oscillations differ between children with ASD and typically developing (TD) children. Since alpha oscillations are known to change as children develop, the need for an age-dependent mixed membership model is crucial to ensure that shifts in the power spectrum of alpha oscillations do not confound measurements of alpha power (Haegens et al., 2014). Unlike mixture of experts models and covariate adjusted product partition models, we aim to specify a mixed membership model in which the allocation structure does not depend on the covariates of interest. Although previous studies have shown that the alpha peak shifts to a higher frequency and becomes more attenuated (Scheffler et al., 2019; Rodríguez-Martínez et al., 2017), the results can be confounded if this effect is not observed in all individuals. In our model, since we assume that each individual’s allocation parameters do not change with age, we can infer how alpha oscillations change as children age at an individual level, making a covariate adjusted mixed membership model ideal for this case study.

This manuscript starts with a brief review of functional clustering and covariate-dependent clustering frameworks, such as the mixture of experts model and the finite mixture of regressions model. Using these previous frameworks as a reference, we derive the general form of our covariate adjusted mixed membership model. In Section 2.2 we review the multivariate Karhunen-Loève (KL) theorem, which allows us to have a concise representation of the KK latent functional features. In Section 2.3, we leverage the KL decomposition to fully specify the covariate adjusted functional mixed membership model. A review of the identifiability issues that occur in mixed membership models, as well as a set of sufficient conditions to ensure identifiability in a covariate adjusted mixed membership model can be found in Section 2.4. Section 3 covers two simulation studies which explore the empirical convergence properties of the proposed model and evaluate the performance of various information criteria in choosing the number of features in a covariate adjusted functional mixed membership model. Section 4 illustrates the usefulness of the covariate adjusted functional mixed membership model by analyzing EEG data from children with ASD and TD children. Lastly, we conclude this manuscript with discussion on some of the challenges of fitting these models, as well as possible theoretical challenges when working with covariate adjusted mixed membership models.

2 Covariate Adjusted Functional Mixed Membership Model

Functional data analysis (FDA) focuses on the statistical analysis of sample paths from a continuous stochastic process f:𝒯f:\mathcal{T}\rightarrow\mathbb{R}, where 𝒯\mathcal{T} is a compact subset of d\mathbb{R}^{d} (Wang et al., 2016). In FDA, one commonly assumes that the random functions are elements of a Hilbert space, or more specifically that the random functions are square-integrable functions (fL2 or 𝒯f(t)2dt<)\left(f\in L^{2}\text{ or }\int_{\mathcal{T}}\mid f(t)\mid^{2}\text{d}t<\infty\right). In this manuscript, we assume that the observed sample paths are generated from a Gaussian process (GP), fully characterized by a mean function, μ(t)=𝔼(f(t))\mu(t)=\mathbb{E}\left(f(t)\right), and a covariance function, C(s,t)=Cov(f(s),f(t))C(s,t)=\text{Cov}\left(f(s),f(t)\right), for t,s𝒯t,s\in\mathcal{T}. Since mixed membership models can be considered a generalization of finite mixture models, we will show how the finite mixture of regressions and mixture of experts models relate to our proposed mixed membership models. We relate our contribution to covariate adjusted product partition models only superficially, due to the significant differences in both theory and implementation for this model class (Müller et al., 2011; Park and Dunson, 2010). For the theoretical developments discussed in this section, we will assume that the number of clusters or features, KK, is known a-priori. Although the number of clusters or features is often unknown, Section 3.1 shows that the use of information criteria or simple heuristic methods such as the “elbow” method can be informative in choosing the number of features.

2.1 Mixture of Regressions, Mixture of Experts and Covariate Adjusted Mixed Membership

Functional clustering generally assumes that each sample path is drawn from one of KK underlying cluster-specific sub-processes (James and Sugar, 2003; Chiou and Li, 2007; Jacques and Preda, 2014). In the GP framework, assuming that f(1),,f(K)f^{(1)},\dots,f^{(K)} are the KK underlying cluster-specific sub-processes with corresponding mean (μ(1),μ(K))\left(\mu^{(1)},\dots\mu^{(K)}\right) and covariance functions (C(1),,C(K))\left(C^{(1)},\dots,C^{(K)}\right), the general form of a GP finite mixture model is typically expressed as:

P(fiρ(1:K),μ(1:K),C(1:K))=k=1Kρ(k)𝒢𝒫(fiμ(k),C(k)),P\left(f_{i}\mid\rho^{(1:K)},\mu^{(1:K)},C^{(1:K)}\right)=\sum_{k=1}^{K}\rho^{(k)}\;\mathcal{GP}\left(f_{i}\mid\mu^{(k)},\,C^{(k)}\right), (1)

where ρ(k)\rho^{(k)} (k=1Kρ(k)=1\sum_{k=1}^{K}\rho^{(k)}=1) are the mixing proportions and fif_{i} are the sample paths for i=1,,Ni=1,\dots,N. Let 𝐱i=(Xi1XiR)R:=𝒳\mathbf{x}_{i}=\left(X_{i1}\dots X_{iR}\right)^{\prime}\in\mathbb{R}^{R}:=\mathcal{X} be the covariates of interest associated with the ithi^{th} observation. Extending the multivariate finite mixture of regressions model (McLachlan et al., 2019; Faria and Soromenho, 2010; Grün et al., 2007; Khalili and Chen, 2007; Hyun et al., 2023; Devijver, 2015) to a functional setting, the covariate adjustments would be encoded through the mean by defining a set of cluster-specific regression functions μ(k)(𝐱i):𝒳×𝒯\mu^{(k)}(\mathbf{x}_{i}):\mathcal{X}\times\mathcal{T}\rightarrow\mathbb{R}. The mixture of GPs in (1) would then be generalized to include covariate information as follows:

P(fi𝐱i,ρ(1:K),μ(1:K),C(1:K))=k=1Kρ(k)𝒢𝒫(fiμ(k)(𝐱i),C(k)),P\left(f_{i}\mid\mathbf{x}_{i},\rho^{(1:K)},\mu^{(1:K)},C^{(1:K)}\right)=\sum_{k=1}^{K}\rho^{(k)}\mathcal{GP}\left(f_{i}\mid\mu^{(k)}(\mathbf{x}_{i}),\,C^{(k)}\right), (2)

where cluster specific GPs are now assumed to depend on mean functions μ(k)(𝐱)\mu^{(k)}(\mathbf{x}), which are allowed to vary over the covariate space. Alternative representations of covariate adjusted mixtures have been proposed to allow the mixing proportions to depend on covariate information through generalized regression functions ρ(k)(𝐱i)\rho^{(k)}(\mathbf{x}_{i}) (Jordan and Jacobs, 1994) or implicitly defining covariate adjusted partitions through a cohesion prior (Park and Dunson, 2010).

In both the finite mixture of regressions and mixture of experts settings, covariate adjustment does not change the overall interpretation of the finite mixture modeling framework, and the underlying assumption is that a function fif_{i} exhibits uncertain membership to only one of KK well-defined subprocesses f(1:K)f^{(1:K)}. Following Marco et al. (2024a), we contrast this representation of functional data with one where a sample path fif_{i} is allowed mixed membership to one or more well-defined sub-processes f(1:K)f^{(1:K)}. In the mixed membership framework, the underlying subprocesses f(1:K)f^{(1:K)} are best interpreted as latent functional features, which may partially manifest in a specific path fif_{i}.

Our representation of a functional mixed membership model hinges on the introduction of a new set of latent variables 𝐳i=(Zi1,,ZiK)\mathbf{z}_{i}=(Z_{i1},\ldots,Z_{iK})^{\prime}, where Zik[0,1]Z_{ik}\in[0,1] and k=1KZik=1\sum_{k=1}^{K}Z_{ik}=1. Given the latent mixing variables 𝐳i\mathbf{z}_{i}, each sample path is assumed to follow the following sampling structure:

fi𝐙1,,𝐙N=dk=1KZikf(k).f_{i}\mid\mathbf{Z}_{1},\dots,\mathbf{Z}_{N}=_{d}\sum_{k=1}^{K}Z_{ik}f^{(k)}. (3)

Thus we can see that under the functional mixed membership model, each sample path is assumed to come from a convex combination of the underlying GPs, f(k)f^{(k)}. To avoid unwarranted restrictions on the implied covariance structure, the functional mixed membership model does not assume that the underlying GPs are mutually independent. Thus, we will let C(k,j)C^{(k,j)} represent the cross-covariance function between the kthk^{th} GP and the jthj^{th} GP, for 1kjK1\leq k\neq j\leq K. Letting 𝓒\boldsymbol{\mathcal{C}} be the collection of covariance and cross-covariance functions, we can specify the sampling model of the functional mixed membership model as

fi𝐳i,μ(1:K),𝓒𝒢𝒫(k=1KZikμ(k),k=1KZik2C(k)+k=1KkkZikZikC(k,k)).f_{i}\mid\mathbf{z}_{i},\mu^{(1:K)},\boldsymbol{\mathcal{C}}\sim\mathcal{GP}\left(\sum_{k=1}^{K}Z_{ik}\mu^{(k)},\sum_{k=1}^{K}Z_{ik}^{2}C^{(k)}+\sum_{k=1}^{K}\sum_{k^{\prime}\neq k}Z_{ik}Z_{ik^{\prime}}C^{(k,k^{\prime})}\right). (4)

Leveraging the mixture of regressions framework in (2), the proposed covariate adjusted functional mixed membership model becomes:

fi𝐗,𝐳1,,𝐳N,μ(1:K)(𝐱i),𝓒𝒢𝒫(k=1KZikμ(k)(𝐱i),k=1KZik2C(k)+k=1KkkZikZikC(k,k)).f_{i}\mid\mathbf{X},\mathbf{z}_{1},\dots,\mathbf{z}_{N},\mu^{(1:K)}(\mathbf{x}_{i}),\boldsymbol{\mathcal{C}}\sim\mathcal{GP}\left(\sum_{k=1}^{K}Z_{ik}\mu^{(k)}(\mathbf{x}_{i}),\sum_{k=1}^{K}Z_{ik}^{2}C^{(k)}+\sum_{k=1}^{K}\sum_{k^{\prime}\neq k}Z_{ik}Z_{ik^{\prime}}C^{(k,k^{\prime})}\right). (5)

In Section 2.2, we review the multivariate Karhunen-Loève (KL) construction (Ramsay and Silverman, 2005a; Happ and Greven, 2018), which allows us to have a joint approximation of the covariance structure of the KK underlying GPs. Using the joint approximation, we are able to concisely represent the KK GPs, facilitating inference on higher-dimensional functions such as surfaces. Using the KL decomposition, we are able to fully specify our proposed covariate adjusted functional mixed membership model (Equation 5) in Section 2.3.

2.2 Multivariate Karhunen-Loève Characterization

The proposed modeling framework in Equation (5) depends on a set of mean and covariance functions, which are naturally represented infinite-dimensional parameters. Thus, in order to work in finite dimensions, we assume that the underlying GPs are smooth and lie in the PP-dimensional subspace, 𝓢L2(𝒯)\boldsymbol{\mathcal{S}}\subset L^{2}(\mathcal{T}), spanned by a set of linearly independent square-integrable basis functions, {b1,,bP}\{b_{1},\dots,b_{P}\} (bp:𝒯b_{p}:\mathcal{T}\rightarrow\mathbb{R}). Although the choice of basis functions is user-defined, the basis functions must be uniformly continuous in order to satisfy Lemma 2.2 of Marco et al. (2024a). In this manuscript, we will utilize B-splines for all case studies and simulation studies.

The assumption that f(k)𝓢f^{(k)}\in\boldsymbol{\mathcal{S}} allows us to turn an infinite-dimensional problem into a finite-dimensional problem, making traditional inference tractable. Although tractable, accounting for covariance and cross-covariance functions separately leads to a model that needs 𝒪(K2P2)\mathcal{O}(K^{2}P^{2}) parameters to represent the overall covariance structure. While the number of clusters, KK, supported in applications is typically small, the number of basis functions, PP, tends to follow the observational grid resolution, and may therefore be quite large. This is particularly true when one considers higher-dimensional functional data, such as surfaces, which necessitate a substantial number of basis functions, making it computationally expensive to fit models. Thus, we will use the multivariate KL decomposition (Ramsay and Silverman, 2005a; Happ and Greven, 2018) to reduce the number of parameters needed to estimate the covariance surface to 𝒪(KPM)\mathcal{O}(KPM), where MM is the number of eigenfunctions used to approximate the covariance structure. A more detailed derivation of a KL decomposition for functions in 𝓢\boldsymbol{\mathcal{S}} can be found in Marco et al. (2024a).

To achieve a concise joint representation of the KK latent GPs, we define a multivariate GP, which we denote by 𝐟(𝐭)\mathbf{f}(\mathbf{t}), such that

𝐟(𝐭):={f(1)(t(1)),f(2)(t(2)),,f(K)(t(K))},\mathbf{f}(\mathbf{t}):=\left\{f^{(1)}\left(t^{(1)}\right),f^{(2)}\left(t^{(2)}\right),\dots,f^{(K)}\left(t^{(K)}\right)\right\},

where 𝐭=(t(1),t(2),,t(K))\mathbf{t}=\left(t^{(1)},t^{(2)},\dots,t^{(K)}\right) and t(1),t(2),,t(K)𝒯t^{(1)},t^{(2)},\dots,t^{(K)}\in\mathcal{T}. Since 𝓢L2\boldsymbol{\mathcal{S}}\subset L^{2} is a Hilbert space, with the inner product defined as the L2L^{2} inner product, we can see that 𝐟𝓗:=𝓢×𝓢××𝓢:=𝓢K\mathbf{f}\in\boldsymbol{\mathcal{H}}:=\boldsymbol{\mathcal{S}}\times\boldsymbol{\mathcal{S}}\times\dots\times\boldsymbol{\mathcal{S}}:=\boldsymbol{\mathcal{S}}^{K}, where 𝓗\boldsymbol{\mathcal{H}} is defined as the direct sum of the Hilbert spaces 𝓢\boldsymbol{\mathcal{S}}, making 𝓗\boldsymbol{\mathcal{H}} also a Hilbert space. Since 𝓗\boldsymbol{\mathcal{H}} is a Hilbert space and the covariance operator of 𝐟\mathbf{f}, denoted 𝓚\boldsymbol{\mathcal{K}}, is a positive, bounded, self-adjoint, and compact operator, we know that there exists a complete set of eigenfunctions 𝚿1,,𝚿KP𝓗\boldsymbol{\Psi}_{1},\dots,\boldsymbol{\Psi}_{KP}\in\boldsymbol{\mathcal{H}} and the corresponding eigenvalues λ1λKP0\lambda_{1}\geq\dots\geq\lambda_{KP}\geq 0 such that 𝓚𝚿p=λp𝚿p,\boldsymbol{\mathcal{K}}\boldsymbol{\Psi}_{p}=\lambda_{p}\boldsymbol{\Psi}_{p}, for p=1,,KPp=1,\dots,KP. Using the eigenpairs of the covariance operator, we can rewrite f(k)f^{(k)} as

f(k)(t)=μ(k)(𝐱,t)+m=1KPχm(λmΨm(k)(t)),f^{(k)}(t)=\mu^{(k)}(\mathbf{x},t)+\sum_{m=1}^{KP}\chi_{m}\left(\sqrt{\lambda_{m}}\Psi_{m}^{(k)}(t)\right),

where χm𝒩(0,1)\chi_{m}\sim\mathcal{N}(0,1) and Ψm(k)(t)\Psi_{m}^{(k)}(t) is the kthk^{th} element of 𝚿m(𝐭)\boldsymbol{\Psi}_{m}(\mathbf{t}). Since Ψm(k)(t)𝓢\Psi_{m}^{(k)}(t)\in\boldsymbol{\mathcal{S}}, we know that there exists ϕmP\phi_{m}\in\mathbb{R}^{P} such that λmΨm(k)(t)=ϕkm𝐁(t)\sqrt{\lambda_{m}}\Psi_{m}^{(k)}(t)=\boldsymbol{\phi}_{km}^{\prime}\mathbf{B}(t), where 𝐁(t):=[b1(t),b2(t),,bP(t)]\mathbf{B}^{\prime}(t):=[b_{1}(t),b_{2}(t),\dots,b_{P}(t)]. Similarly, since μ(k)(𝐱,t)𝓢\mu^{(k)}(\mathbf{x},t)\in\boldsymbol{\mathcal{S}}, we can introduce a mapping 𝐠:RP\mathbf{g}:\mathbb{R}^{R}\rightarrow\mathbb{R}^{P}, such that μ(k)(𝐱,t)=𝐠k(𝐱)𝐁(t)\mu^{(k)}(\mathbf{x},t)=\mathbf{g}_{k}(\mathbf{x})^{\prime}\mathbf{B}(t). Therefore, we arrive at the general form of our decomposition:

f(k)(t)=𝐠k(𝐱)𝐁(t)+m=1KPχmϕkm𝐁(t).f^{(k)}(t)=\mathbf{g}_{k}(\mathbf{x})^{\prime}\mathbf{B}(t)+\sum_{m=1}^{KP}\chi_{m}\boldsymbol{\phi}_{km}^{\prime}\mathbf{B}(t). (6)

Using this decomposition, our covariance and mean structures can be recovered such that C(k,k)(s,t)=𝐁(s)(m=1KPϕkmϕkm)𝐁(t)C^{(k,k^{\prime})}(s,t)=\mathbf{B}^{\prime}(s)\left(\sum_{m=1}^{KP}\boldsymbol{\phi}_{km}\boldsymbol{\phi}^{\prime}_{k^{\prime}m}\right)\mathbf{B}(t) and μ(k)(𝐱,t)=𝐠k(𝐱)𝐁(t)\mu^{(k)}(\mathbf{x},t)=\mathbf{g}_{k}(\mathbf{x})^{\prime}\mathbf{B}(t), for 1k,kK1\leq k,k^{\prime}\leq K and s,t𝒯s,t\in\mathcal{T}. To reduce the dimensionality of the problem, we will only use the first MM eigenpairs to approximate the KK stochastic processes. Although traditional functional principal component analysis (FPCA) will choose the number of eigenfunctions based on the proportion of variance explained, the same strategy cannot be employed in functional mixed membership models because the allocation parameters of the model are typically not known. Therefore, we suggest selecting a large value of MM, and then proceeding with stable estimation through prior regularization.

2.3 Model and Prior Specification

In this section, we fully specify the covariate adjusted functional mixed membership model utilizing a truncated version of the KL decomposition, specified in Equation 6. We start by first specifying how the covariates of interest influence the mean function. Under the standard function-on-scalar regression framework (Faraway, 1997; Brumback and Rice, 1998), we have μ(k)(𝐱i,t)=βk0+r=1RXirβkr(t)\mu^{(k)}(\mathbf{x}_{i},t)=\beta_{k0}+\sum_{r=1}^{R}X_{ir}\beta_{kr}(t) for k=1,,Kk=1,\dots,K. Since we assumed that μ(k)(𝐱i,t)𝓢\mu^{(k)}(\mathbf{x}_{i},t)\in\boldsymbol{\mathcal{S}} for k=1,,Kk=1,\dots,K, we know that βk0,,βkR𝓢\beta_{k0},\dots,\beta_{kR}\in\boldsymbol{\mathcal{S}}. Therefore, there exist 𝝂kP\boldsymbol{\nu}_{k}\in\mathbb{R}^{P} and 𝜼kP×R\boldsymbol{\eta}_{k}\in\mathbb{R}^{P\times R} such that

μ(k)(𝐱i,t)=𝝂k𝐁(t)+(𝜼k𝐱i)𝐁(t).\mu^{(k)}(\mathbf{x}_{i},t)=\boldsymbol{\nu}_{k}^{\prime}\mathbf{B}(t)+\left(\boldsymbol{\eta}_{k}\mathbf{x}_{i}^{\prime}\right)^{\prime}\mathbf{B}(t). (7)

Under a standardized set of covariates, 𝝂k\boldsymbol{\nu}_{k} specifies the population-level mean of the kthk^{th} feature and 𝜼k\boldsymbol{\eta}_{k} encodes the covariate dependence of the kthk^{th} feature. Thus, assuming the mean structure specified in Equation 7 and using a truncated version of the KL decomposition specified in Equation 6, we can specify the sampling model of our covariate adjusted functional mixed membership model.

Let {𝐘i(.)}i=1N\{\mathbf{Y}_{i}(.)\}_{i=1}^{N} be the observed sample paths that we want to model to the KK features, f(k)f^{(k)}, conditionally on the covariates of interest, 𝐱i\mathbf{x}_{i}. Since the observed sample paths are observed at only a finite number of points, we will let 𝐭i=[ti1,,tini]\mathbf{t}_{i}=[t_{i1},\dots,t_{in_{i}}]^{\prime} denote the time points at which the ithi^{th} function was observed. Without loss of generality, we will define the sampling distribution over the finite-dimensional marginals of 𝐘i(𝐭i)\mathbf{Y}_{i}(\mathbf{t}_{i}). Using the general form of our proposed model defined in Equation 5, we have

𝐘i(𝐭i)𝚯,𝐗𝒩{k=1KZik(𝐒(𝐭i)(𝝂k+𝜼k𝐱i)+m=1Mχim𝐒(𝐭i)ϕkm),σ2𝐈ni},\mathbf{Y}_{i}(\mathbf{t}_{i})\mid\boldsymbol{\Theta},\mathbf{X}\sim\mathcal{N}\left\{\sum_{k=1}^{K}Z_{ik}\left(\mathbf{S}^{\prime}(\mathbf{t}_{i})\left(\boldsymbol{\nu}_{k}+\boldsymbol{\eta}_{k}\mathbf{x}_{i}^{\prime}\right)+\sum_{m=1}^{M}\chi_{im}\mathbf{S}^{\prime}(\mathbf{t}_{i})\boldsymbol{\phi}_{km}\right),\;\sigma^{2}\mathbf{I}_{n_{i}}\right\}, (8)

where 𝐒(𝐭i)=[𝐁(t1)𝐁(tni)]P×ni\mathbf{S}(\mathbf{t}_{i})=[\mathbf{B}(t_{1})\cdots\mathbf{B}(t_{n_{i}})]\in\mathbb{R}^{P\times n_{i}} and 𝚯\boldsymbol{\Theta} is the collection of the model parameters. As defined in Section 2, ZikZ_{ik} are variables that lie on the unit simplex, such that Zik(0,1)Z_{ik}\in(0,1) and k=1KZik=1\sum_{k=1}^{K}Z_{ik}=1. From this characterization, we can see that each observation is modeled as a convex combination of realizations from the KK features with additional Gaussian noise, represented by σ2\sigma^{2}. If we integrate out the χim\chi_{im} variables, for i=1,,Ni=1,\dots,N and m=1,,Mm=1,\dots,M, we arrive at the following likelihood:

𝐘i(𝐭i)𝚯χ,𝐗𝒩{k=1KZik𝐒(𝐭i)(𝝂k+𝜼k𝐱i),𝐕(𝐭i,𝐳i)+σ2𝐈ni},\mathbf{Y}_{i}(\mathbf{t}_{i})\mid\boldsymbol{\Theta}_{-\chi},\mathbf{X}\sim\mathcal{N}\left\{\sum_{k=1}^{K}Z_{ik}\mathbf{S}^{\prime}(\mathbf{t}_{i})\left(\boldsymbol{\nu}_{k}+\boldsymbol{\eta}_{k}\mathbf{x}_{i}^{\prime}\right),\;\mathbf{V}(\mathbf{t}_{i},\mathbf{z}_{i})+\sigma^{2}\mathbf{I}_{n_{i}}\right\}, (9)

where 𝚯χ\boldsymbol{\Theta}_{-\chi} is the collection of the model parameters excluding the χim\chi_{im} parameters (i=1,,Ni=1,\dots,N and m=1,,Mm=1,\dots,M) and 𝐕(𝐭i,𝐳i)=k=1Kk=1KZikZik{𝐒(𝐭i)m=1M(ϕkmϕkm)𝐒(𝐭i)}\mathbf{V}(\mathbf{t}_{i},\mathbf{z}_{i})=\sum_{k=1}^{K}\sum_{k^{\prime}=1}^{K}Z_{ik}Z_{ik^{\prime}}\left\{\mathbf{S}^{\prime}(\mathbf{t}_{i})\sum_{m=1}^{M}\left(\boldsymbol{\phi}_{km}\boldsymbol{\phi}^{\prime}_{k^{\prime}m}\right)\mathbf{S}(\mathbf{t}_{i})\right\}. Equation 9 illustrates that the proposed covariate adjusted functional mixed membership model can be expressed as an additive model; the mean structure is a convex combination of the feature-specific means, while the covariance can be written as a weighted sum of covariance functions and cross-covariance functions.

To have an adequately expressive and scalable model, we approximate the covariance surface of the KK features using MM scaled pseudo-eigenfunctions. In this framework, orthogonality will not be imposed on the ϕkm𝐁(t)\boldsymbol{\phi}_{km}^{\prime}\mathbf{B}(t) parameters, making them pseudo-eigenfunctions instead of the eigenfunctions described in Section 2.2. From a modeling prospective, this allows us to sample on an unconstrained space, facilitating better Markov chain mixing and easier sampling schemes. Although direct inference on the eigenfunctions is no longer available, a formal analysis can still be conducted by reconstructing the posterior samples of the covariance surface and calculating eigenfunctions from the posterior samples. To avoid overfitting of the covariance functions, we follow Marco et al. (2024a) by using the multiplicative gamma process shrinkage prior proposed by Bhattacharya and Dunson (2011) to achieve adaptive regularized estimation of the covariance structure. Therefore, letting ϕkpm\phi_{kpm} be the pthp^{th} element of ϕkm\boldsymbol{\phi}_{km}, we have the following:

ϕkpmγkpm,τ~mk𝒩(0,γkpm1τ~mk1),γkpmΓ(νγ/2,νγ/2),τ~mk=n=1mδnk,\phi_{kpm}\mid\gamma_{kpm},\tilde{\tau}_{mk}\sim\mathcal{N}\left(0,\gamma_{kpm}^{-1}\tilde{\tau}_{mk}^{-1}\right),\;\;\;\gamma_{kpm}\sim\Gamma\left(\nu_{\gamma}/2,\nu_{\gamma}/2\right),\;\;\;\tilde{\tau}_{mk}=\prod_{n=1}^{m}\delta_{nk},
δ1ka1kΓ(a1k,1),δjka2kΓ(a2k,1),a1kΓ(α1,β1),a2kΓ(α2,β2),\delta_{1k}\mid a_{1k}\sim\Gamma(a_{1k},1),\;\;\;\delta_{jk}\mid a_{2k}\sim\Gamma(a_{2k},1),\;\;\;a_{1k}\sim\Gamma(\alpha_{1},\beta_{1}),\;\;\;a_{2k}\sim\Gamma(\alpha_{2},\beta_{2}),

where 1kK1\leq k\leq K, 1pP1\leq p\leq P, 1mM1\leq m\leq M, and 2jM2\leq j\leq M. By letting α2>β2\alpha_{2}>\beta_{2}, we can show that 𝔼(τ~mk)>𝔼(τ~mk)\mathbb{E}(\tilde{\tau}_{mk})>\mathbb{E}(\tilde{\tau}_{m^{\prime}k}) for 1m<mM1\leq m<m^{\prime}\leq M, leading to the prior on ϕkpm\phi_{kpm} having stochastically decreasing variance as mm increases. This will have a regularizing effect on the posterior draws of ϕkpm\phi_{kpm}, making ϕkpm\phi_{kpm} more likely to be close to zero as mm increases.

In functional data analysis, we often desire smooth mean functions to protect against overfit models. Therefore, we use a first-order random walk penalty proposed by Lang and Brezger (2004) on the 𝝂k\boldsymbol{\nu}_{k} and 𝝂k\boldsymbol{\nu}_{k} parameters to promote adaptive smoothing of the mean function of the features in our model. Therefore, we have that P(𝝂kτ𝝂k)exp(τ𝝂k2p=1P1(νpkν(p+1)k)2),P(\boldsymbol{\nu}_{k}\mid\tau_{\boldsymbol{\nu}_{k}})\propto\exp\left(-\frac{\tau_{\boldsymbol{\nu}_{k}}}{2}\sum_{p=1}^{P-1}\left(\nu_{pk}-{\nu}_{(p+1)k}\right)^{2}\right), for k=1,,Kk=1,\dots,K, where τ𝝂kΓ(α𝝂,β𝝂)\tau_{\boldsymbol{\nu}_{k}}\sim\Gamma(\alpha_{\boldsymbol{\nu}},\beta_{\boldsymbol{\nu}}) and νpk\nu_{pk} is the pthp^{th} element of 𝝂k\boldsymbol{\nu}_{k}. Similarly, we have that P({ηprk}p=1Pτ𝜼rk)exp(τ𝜼rk2p=1P1(ηprkη(p+1)rk)2),P(\{\eta_{prk}\}_{p=1}^{P}\mid\tau_{\boldsymbol{\eta}_{rk}})\propto\exp\left(-\frac{\tau_{\boldsymbol{\eta}_{rk}}}{2}\sum_{p=1}^{P-1}\left(\eta_{prk}-{\eta}_{(p+1)rk}\right)^{2}\right), for k=1,,Kk=1,\dots,K and r=1,,Rr=1,\dots,R, where τ𝜼rkΓ(α𝜼,β𝜼)\tau_{\boldsymbol{\eta}_{rk}}\sim\Gamma(\alpha_{\boldsymbol{\eta}},\beta_{\boldsymbol{\eta}}) and ηprk\eta_{prk} is the pthp^{th} row and rthr^{th} column of 𝜼k\boldsymbol{\eta}_{k}. Following previous mixed membership models (Heller et al., 2008; Marco et al., 2024a, b), we assume that 𝐳i𝝅,α3iidDir(α3𝝅)\mathbf{z}_{i}\mid\boldsymbol{\pi},\alpha_{3}\sim_{iid}Dir(\alpha_{3}\boldsymbol{\pi}), 𝝅Dir(𝐜π)\boldsymbol{\pi}\sim Dir(\mathbf{c}_{\pi}), and α3Exp(b)\alpha_{3}\sim Exp(b). Lastly, we assume that σ2IG(α0,β0)\sigma^{2}\sim IG(\alpha_{0},\beta_{0}). The posterior distributions of the parameters in our model, as well as a sampling scheme with tempered transitions, can be found in Section 2 of the Supplementary Materials. A covariate adjusted model where the covariance is also dependent on the covariates of interest can be found in Section 4 of the Supplementary Materials.

2.4 Model Identifiability

Mixed membership models face multiple identifiability problems due to their increased flexibility over traditional clustering models (Chen et al., 2022; Marco et al., 2024a, b). Like traditional clustering models, mixed membership models also face the common label switching problem, where an equivalent model can be formulated by permuting the labels or allocation parameters. Although this is one source of non-identifiability, relabelling algorithms can be formulated from the work of Stephens (2000). More complex identifiability problems arise since the allocation parameters are now continuous variables on the unit simplex, rather than binary variables like in clustering. Specifically, an equivalent model can be constructed by rotating and/or changing the volume of the convex polytope constructed by the allocation parameters through linear transformations of the allocation parameters (Chen et al., 2022). Therefore, geometric assumptions must be made on the allocation parameters to ensure that the mixed membership models are identifiable.

One way to ensure that we have an identifiable model up to a permutation of the labels is to assume that the separability condition holds (Pettit, 1990; Donoho and Stodden, 2003; Arora et al., 2012; Azar et al., 2001; Chen et al., 2022). The separability condition assumes that at least one observation belongs completely to each of the KK features. These observations that belong only to one feature act as “anchors”, ensuring that the allocation parameters are identifiable up to a permutation of the labels. Although the separability condition is conceptually simple and relatively easy to implement, it makes strong geometric assumptions on the data-generating process for mixed membership models with three or more features. Weaker geometric assumptions, known as the sufficiently scattered condition (Chen et al., 2022), that ensure an identifiable model in mixed membership models with 3 or more features are discussed in Chen et al. (2022). Although the geometric assumptions are relatively weak, implementing these constraints is nontrivial. When extending multivariate mixed membership models to functional data and introducing a covariate-dependent mean structure, ensuring an identifiable mean and covariance structure requires further assumptions.

Lemma 2.1

Consider a KK-feature covariate adjusted functional mixed membership model as specified in Equation 9. The parameters 𝛎k\boldsymbol{\nu}_{k}, 𝛈k\boldsymbol{\eta}_{k}, ZikZ_{ik}, m=1M(ϕkmϕkm)\sum_{m=1}^{M}\left(\boldsymbol{\phi}_{km}\boldsymbol{\phi}^{\prime}_{k^{\prime}m}\right), and σ2\sigma^{2} are identifiable up to a permutation of the labels (i.e. label switching), for k,k=1,,Kk,k^{\prime}=1,\dots,K, i=1,,Ni=1,\dots,N, and m=1,,Mm=1,\dots,M, given the following assumptions:

  1. 1.

    𝐗\mathbf{X} is full column rank with 𝟏\mathbf{1} not in the column space of 𝐗\mathbf{X}.

  2. 2.

    The separability condition or the sufficiently scattered condition holds on the allocation matrix. Moreover, there exist at least K2+K2\frac{K^{2}+K}{2} observations (NK2+K2)(N\geq\frac{K^{2}+K}{2}), with allocation parameters such that the matrix 𝐂N×K2+K2\mathbf{C}\in\mathbb{R}^{N\times\frac{K^{2}+K}{2}} is full column rank, where the ithi^{th} row of 𝐂\mathbf{C} is specified as 𝐜i=[Zi12,,ZiK2,2Zi1Zi2,,2Zi1ZiK,2Zi2Zi3,,2Zi(K1)ZiK]\mathbf{c}_{i}=[Z_{i1}^{2},\dots,Z_{iK}^{2},2Z_{i1}Z_{i2},\dots,2Z_{i1}Z_{iK},2Z_{i2}Z_{i3},\dots,2Z_{i(K-1)}Z_{iK}].

  3. 3.

    The sample paths 𝐘i(𝐭i)\mathbf{Y}_{i}(\mathbf{t}_{i}) are regularly sampled so that ni>Pn_{i}>P. Moreover, we will assume that the basis functions are B-splines with equidistant knots.

Lemma 2.1 states a set of sufficient conditions that lead to an identifiable mean and covariance structure up to a permutation of the labels. The proof of Lemma 2.1 can be found in Section 1 of the Supporting Materials. The first assumption is similar to the assumption needed in the linear regression setting to ensure identifiability; however, the intercept term is not included in the design matrix, as the 𝝂k\boldsymbol{\nu}_{k} parameters already account for the intercept. From the second assumption we can see that we assume geometric assumptions on the allocation parameters, provide a lower bound for the number of functional observations, and we assume that the matrix 𝐂\mathbf{C} is full column rank. The matrix 𝐂\mathbf{C} is constructed using the allocation parameters, and although the condition on 𝐂\mathbf{C} may seem restrictive, it often holds in practice. We note that this condition may not hold if there are not enough unique allocation parameter values or if most of your allocation parameters lie on a lower-dimensional subspace of the unit (K1)(K-1)-simplex. Although the individual ϕkm\boldsymbol{\phi}_{km} parameters are not identifiable in our model, an eigen analysis can still be performed by constructing posterior draws of the covariance structure and calculating the eigenvalues and eigenfunctions of the posterior draws. Section 3 provides empirical evidence that the mean and covariance structure converge to the truth as more observations are accrued.

2.5 Relationship to Function-on-Scalar Regression

Function-on-scalar regression is a common method in FDA that allows the mean structure of a continuous stochastic process to be dependent on scalar covariates. In function-on-scalar regression, we often assume that the response is a GP, and that the covariates of interest are vector-valued. A comprehensive review of the broader area of functional regression can be found in Ramsay and Silverman (2005b) and Morris (2015). While there have been many advancements and generalizations (Krafty et al., 2008; Reiss et al., 2010; Goldsmith et al., 2015; Kowal and Bourgeois, 2020) of function-on-scalar regression since the initial papers of Faraway (1997) and Brumback and Rice (1998), the general form of function-on-scalar regression can be expressed as follows:

Y(t)=μ(t)+r=1RXrβr(t)+ϵ(t);t𝒯,Y(t)=\mu(t)+\sum_{r=1}^{R}X_{r}\beta_{r}(t)+\epsilon(t);\;\;\;\;t\in\mathcal{T}, (10)

where Y(t)Y(t) is the response function evaluated at tt, βr()\beta_{r}(\cdot) is the functional coefficient representing the effect that the rthr^{th} covariate (XrX_{r}) has on the mean structure, and ϵ\epsilon is a mean-zero Gaussian process with covariance function 𝒞\mathcal{C}. The function μ:𝒯\mu:\mathcal{T}\rightarrow\mathbb{R} in Equation (10) represents the mean of the GP when all of the covariates, XirX_{ir} are set to zero. Unlike the traditional setting for multiple linear regression in finite-dimensional vector spaces, function-on-scalar regression requires the estimation of the infinite-dimensional functions μ\mu and β1,,βR\beta_{1},\dots,\beta_{R} from a finite number of observed sample paths at a finite number of points (𝐘i(𝐭i)\mathbf{Y}_{i}(\mathbf{t}_{i}) for i=1,,Ni=1,\dots,N and 𝐭i=[ti1,,tini]\mathbf{t}_{i}=[t_{i1},\dots,t_{in_{i}}]^{\prime}).

To make inference tractable, we assume that the data lie in the span of a finite set of basis functions, which will allow us to expand μ\mu and β1,,βR\beta_{1},\dots,\beta_{R} as a finite sum of the basis functions. The set of basis functions can be specified using data-driven basis functions or by specifying the basis functions a-priori. If the basis functions are specified a-priori, common choices of basis functions are B-splines and wavelets due to their flexibility, as well as Fourier series for periodic functions. Alternatively, if the use of data-driven basis functions is desired, a common choice is to use the eigenfunctions of the covariance operator as basis functions. In order to estimate the eigenfunctions, functional principal component analysis (Shang, 2014) is often performed and the obtained estimates of the eigenfunctions are used. Functional principal component analysis faces a similar problem in that the objective is to estimate eigenfunctions using only a finite number of sample paths observed at a finite number of points. To solve this problem, Rice and Silverman (1991) proposes using a splines basis to estimate smooth eigenfunctions, while Yao et al. (2005) proposes using local linear smoothers to estimate the smooth eigenfunctions. Therefore, even using data-driven basis functions require smoothing assumptions, suggesting similar results between data-driven basis functions and a reasonable set of a-priori specified basis functions paired with a penalty to prevent overfitting.

Specifying the basis functions a-priori (b1(t),,bP(t)b_{1}(t),\dots,b_{P}(t)), and letting 𝐘i(𝐭i)\mathbf{Y}_{i}(\mathbf{t}_{i}) be the observed sample paths at points 𝐭i=[ti1,,tini]\mathbf{t}_{i}=[t_{i1},\dots,t_{in_{i}}]^{\prime} (i=1,,Ni=1,\dots,N), we can simplify Equation (10) to get

𝐘i(𝐭i)=𝐒(𝐭i)𝝂~+𝐒(𝐭i)𝜼~𝐱i+ϵi(𝐭i),\mathbf{Y}_{i}(\mathbf{t}_{i})=\mathbf{S}^{\prime}(\mathbf{t}_{i})\tilde{\boldsymbol{\nu}}+\mathbf{S}^{\prime}(\mathbf{t}_{i})\tilde{\boldsymbol{\eta}}\mathbf{x}_{i}^{\prime}+\boldsymbol{\epsilon}_{i}(\mathbf{t}_{i}), (11)

where 𝝂~P\tilde{\boldsymbol{\nu}}\in\mathbb{R}^{P}, 𝜼~P×R\tilde{\boldsymbol{\eta}}\in\mathbb{R}^{P\times R}, and 𝐒(𝐭i)=[𝐁(t1)𝐁(tni)]P×ni\mathbf{S}(\mathbf{t}_{i})=[\mathbf{B}(t_{1})\cdots\mathbf{B}(t_{n_{i}})]\in\mathbb{R}^{P\times n_{i}} are the set of basis function evaluated at the time points of interest. As specified in the previous sections, 𝐁(t):=[b1(t),b2(t),,bP(t)]\mathbf{B}^{\prime}(t):=[b_{1}(t),b_{2}(t),\dots,b_{P}(t)]. Equation (11) shows that the function μ(.)\mu(.) evaluated at the points 𝐭i\mathbf{t}_{i} can be represented by 𝐒(𝐭i)𝝂~\mathbf{S}^{\prime}(\mathbf{t}_{i})\tilde{\boldsymbol{\nu}}, and similarly the functional coefficients β1(),,βR()\beta_{1}(\cdot),\dots,\beta_{R}(\cdot) can be represented by 𝐒(𝐭i)𝜼~\mathbf{S}^{\prime}(\mathbf{t}_{i})\tilde{\boldsymbol{\eta}}. Therefore, we are left to estimate 𝝂~\tilde{\boldsymbol{\nu}}, 𝜼~\tilde{\boldsymbol{\eta}}, and the parameters associated with the covariance function of ϵ()\epsilon(\cdot), denoted 𝒞\mathcal{C}.

The covariance function 𝒞\mathcal{C} represents the within-function covariance structure of the data. In the simplest case, we often assume that ϵi(𝐭i)𝒩(𝟎,σ~2𝐈ni)\boldsymbol{\epsilon}_{i}(\mathbf{t}_{i})\sim\mathcal{N}(\mathbf{0},\tilde{\sigma}^{2}\mathbf{I}_{n_{i}}), which means that we only need σ~2\tilde{\sigma}^{2} to specify 𝒞\mathcal{C}. In more complex models (Faraway, 1997; Krafty et al., 2008), one may make less restrictive assumptions and assume that the covariance ϵi(𝐭i)𝒩(𝟎ni,V~(𝐭i)+σ~2𝐈ni)\boldsymbol{\epsilon}_{i}(\mathbf{t}_{i})\sim\mathcal{N}(\mathbf{0}_{n_{i}},\tilde{V}(\mathbf{t}_{i})+\tilde{\sigma}^{2}\mathbf{I}_{n_{i}}), where V~()\tilde{V}(\cdot) is a low-dimensional approximation of a smooth covariance surface using a truncated eigendecomposition. Although functional regression usually assumes that the functions are independent, functional regression models have been proposed to model between-function variation, or cases where observations can be correlated (Morris and Carroll, 2006; Staicu et al., 2010).

Assuming a relatively general covariance structure as in Krafty et al. (2008), the function-on-scalar model assumes the following distributional assumptions on our sample paths:

𝐘i(𝐭i)𝝂~,𝜼~,𝐕~(𝐭i),σ~2,𝐗𝒩{𝐒(𝐭i)(𝝂~+𝜼~𝐱i),𝐕~(𝐭i)+σ~2𝐈ni}.\mathbf{Y}_{i}(\mathbf{t}_{i})\mid\tilde{\boldsymbol{\nu}},\tilde{\boldsymbol{\eta}},\tilde{\mathbf{V}}(\mathbf{t}_{i}),\tilde{\sigma}^{2},\mathbf{X}\sim\mathcal{N}\left\{\mathbf{S}^{\prime}(\mathbf{t}_{i})\left(\tilde{\boldsymbol{\nu}}+\tilde{\boldsymbol{\eta}}\mathbf{x}_{i}^{\prime}\right),\;\tilde{\mathbf{V}}(\mathbf{t}_{i})+\tilde{\sigma}^{2}\mathbf{I}_{n_{i}}\right\}. (12)

From Equation (12) it is apparent that the proposed covariate adjusted mixed membership model specified in Equation (9) is closely related to function-on-scalar regression. The key difference between the two models is that the covariate adjusted mixed membership model does not assume a common mean and covariance structure across all observations conditionally on the covariates of interest. Instead, the covariate adjusted mixed membership model allows each observation to be modeled as a convex combination of KK underlying features. Each feature is allowed to have different mean and covariance structures, meaning that the covariates are not assumed to have the same affect on all observations. By allowing this type of heterogeneity in our model, we can conduct a more granular analysis and identify subgroups that interact differently with the covariates of interest. Alternatively, if there are subgroups of the population that interact differently with the covariates of interest, then the results from a function-on-scalar regression model will be confounded, as the effects will likely be averaged out in the analysis. The finite mixture of experts model and the mixture of regressions model offer a reasonably granular level of insight, permitting inference at the sub-population level. However, they do not provide the individual-level inference achievable with the proposed covariate adjusted functional mixed membership model.

3 Simulation Studies

In this section, we will discuss the results of two simulation studies. The first simulation study explores the empirical convergence properties of the mean, covariance, and allocation structures of the proposed covariate adjusted functional mixed membership model. Along with correctly specified models, this simulation study also studies the convergence properties of over-specified and under-specified models. The second simulation study explores the use of information criteria in choosing the number of features in a covariate adjusted functional mixed membership model.

3.1 Simulation Study 1: Structure Recovery

In the first simulation study, we explore the empirical convergence properties of the proposed covariate adjusted functional mixed membership models. Specifically, we generate data from a covariate adjusted functional mixed membership model and see how well the proposed framework can recover the true mean, covariance, and allocation structures. To evaluate how well we can recover the mean, covariance, and cross-covariance functions, we calculate the relative mean integrated square error (R-MISE), which is defined as R-MISE={f(t)f^(t)}2dtf(t)2dt×100%\text{R-MISE}=\frac{\int\{f(t)-\hat{f}(t)\}^{2}\text{d}t}{\int f(t)^{2}\text{d}t}\times 100\% or R-MISE=tx{f(t,x)f^(t,x)}2dxdttxf(t,x)2dxdt×100%\text{R-MISE}=\frac{\int_{t}\int_{x}\{f(t,x)-\hat{f}(t,x)\}^{2}\text{d}x\text{d}t}{\int_{t}\int_{x}f(t,x)^{2}\text{d}x\text{d}t}\times 100\% in the case of a covariate adjusted model. In this simulation study, f^(t)\hat{f}(t) will be the posterior median obtained from our posterior samples. To measure how well we recover the allocation structure, ZikZ_{ik}, we calculated the root-mean-square error (RMSE). In addition to studying the empirical convergence properties of correctly specified models, we also included a scenario where we fit a covariate adjusted functional mixed membership model, when the generating truth had no covariate dependence. Conversely, we also studied the scenario where we fit a functional mixed membership model with no covariates, when the generating truth was generated from a covariate adjusted functional mixed membership model with one covariate.

Figure 1 contains a visualization of the performance metrics from each of the five scenarios considered in this simulation study. We can see that when the model is specified correctly, it does a good job in recovering the mean structure, even with relatively few observations. On the other hand, a relatively large number of observations are needed to recover the covariance structure when we have a correctly specified model. This simulation study also shows that we pay a penalty in terms of statistical efficiency when we over-specify a model; however, the over-specified model still shows signs of convergence to the true mean, covariance, and allocation structures. Conversely, an under-specified model shows no signs of converging to the true mean and covariance structure as we get more observations. Additional details on how the simulation was conducted, as well as more detailed visualizations of the simulation results, can be found in Section 3.1 of the Supplementary Materials.

Refer to caption
Figure 1: Recovery of the mean, covariance, and allocation structure under the following three data-generating scenarios: (1) data generated from a two-feature covariate adjusted functional mixed membership model with two covariates (Top Row), (2) data generated from a two-feature covariate adjusted functional mixed membership model with one covariate (Middle Row), and (3) data generated from a two-feature functional mixed membership model (Bottom Row).

3.2 Simulation Study 2: Information Criteria

When considering the use of a mixed membership model to gain insight into the heterogeneity found in a dataset, a practitioner must specify the number of features to use in the mixed membership model. Correctly specifying the number of features is crucial, as an over-specification will lead to challenges in interpretability of the results, and an under-specification will lead to a model that is not flexible enough to model the heterogeneity in the model. Although information criteria such as BIC and heuristics such as the elbow-method have been shown to be informative in choosing the number of parameters in mixed membership models (Marco et al., 2024a, b), it is unclear whether these results extend to covariate adjusted models. In this case, we consider the performance of AIC (Akaike, 1974), BIC (Schwarz, 1978), DIC (Spiegelhalter et al., 2002; Celeux et al., 2006), and the elbow-method in choosing the number of features in a covariate adjusted functional mixed membership model with one covariate. The information criteria used in this section are specified in Section 3.2 of the Supplementary Materials. As specified, the optimal model should have the smallest AIC, the largest BIC, and the smallest DIC.

Refer to caption
Figure 2: Estimated values of AIC, BIC, DIC, and the log-likelihood for the 50 simulated datasets. The true datasets were generated from a covariate adjusted functional mixed membership with three features.

To evaluate the performance of the information criteria and the elbow method, 50 datasets were generated from a covariate adjusted functional mixed membership model with three features and one covariate. Each dataset was analyzed using four covariate adjusted functional mixed membership models; each with a single covariate and a varying number of features (K=2,3,4,5K=2,3,4,5). Figure 2 shows the performance of the three information criteria. Similarly to Marco et al. (2024a) and Marco et al. (2024b), we can see that BIC had the best performance, picking the three-feature model all 50 times. From the simulation study, we found that AIC and DIC tended to choose more complex models, with AIC and DIC choosing the correctly specified model 68% and 26% of the time, respectively. Looking at the log-likelihood plot, we can see that there is a distinct elbow at K=3K=3, meaning that using heuristic methods such as the elbow method tends to be informative in choosing the number of features in covariate adjusted functional mixed membership models. Thus, we recommend using BIC and the elbow-method in conjunction to choose the number of features, as even though BIC correctly identified the model every time in the simulation, there were times when the BIC between the three-feature model and the four-feature model were similar. Additional details on how this simulation study was conducted can be found in Section 3.2 of the Supplementary Materials.

4 A Case Study in EEG Imaging for Autism Spectrum Disorder

Autism spectrum disorder (ASD) is a developmental disorder characterized by social communication deficits and restrictive and/or repetitive behaviors (Edition, 2013). Although once more narrowly defined, autism is now seen as a spectrum, with some individuals having very mild symptoms, to others that require lifelong support (Lord et al., 2018). In this case study, we analyze electroencephalogram (EEG) data that were obtained in a resting-state EEG study conducted by Dickinson et al. (2018). The study consisted of 58 children who have been diagnosed with ASD between the ages of 2 and 12 years of age, and 39 age-matched typically developing (TD) children, or children who have never been diagnosed with ASD. The children were instructed to view bubbles on a monitor in a dark, sound-attenuated room for 2 minutes, while EEG recordings were taken. The EEG recordings were obtained using a 128-channel HydroCel Geodesic Sensory net, and then interpolated to match the international 10-20 system 25-channel montage. The data were filtered using a band pass of 0.1 to 100 Hz, and then transformed into the frequency domain using a fast Fourier transform. To obtain the relative power, the functions were scaled so that they integrate to 1. Lastly, the relative power was averaged across the 25 channels to obtain a measure of the average relative power. Visualizations of the functional data that characterize the average relative power in the alpha band of frequencies can be seen in Figure 3.

Refer to caption
Figure 3: (Left Panel) Alpha frequency patterns averaged over the 25 channels for a sample of EEG recordings from 30 individuals (ASD and TD) with varying ages. (Right Panel) Estimated affects of age on alpha oscillations obtained by fitting a function-on-scalar model.

In this case study, we will specifically analyze the alpha band of frequencies (neural activity between 6Hz and 14Hz) and compare how alpha oscillations differ between children with ASD and TD children. The alpha band of frequencies has been shown to play a role in neural coordination and communication between distributed brain regions (Fries, 2005; Klimesch et al., 2007). Alpha oscillations are composed of periodic and aperiodic neural activity patterns that coexist in the EEG spectra. Neuroscientists are primarily interested in periodic signals, specifically in the location of a single prominent peak in the spectral density located in the alpha band of frequencies, called the peak alpha frequency (PAF). Peak alpha frequency has been shown to be a biomarker of neural development in typically developing children (Rodríguez-Martínez et al., 2017). Studies have shown that the alpha peak becomes more prominent and shifts to a higher frequency within the first years of life for TD children (Rodríguez-Martínez et al., 2017; Scheffler et al., 2019). Compared to TD children, the emergence of a distinct alpha peak and developmental shifts have been shown to be atypical in children with ASD (Dickinson et al., 2018; Scheffler et al., 2019; Marco et al., 2024a, b).

In Section 4.1, we analyze how age affects alpha oscillations by fitting a covariate adjusted functional mixed membership model with the log transformation of age as the only covariate. This model provides insight into how alpha frequencies change as children age, regardless of diagnostic group. Additionally, the use of a covariate adjusted functional mixed membership model allows us to directly compare the age-adjusted relative strength of the PAF biomarker between individuals in the study, including between diagnostic groups. This analysis is extended in Section 4.2 where an analysis on how developmental shifts differ based on diagnostic group. To conduct this analysis, a covariate adjusted functional mixed membership mode was fit with the log transformation of age, diagnostic group, and an interaction between the log transformation of age and diagnostic group as covariates. This analysis provides novel insight into the developmental shifts for children with ASD and TD children. Although these two models may be considered nested models, each model provides unique insight into the data.

4.1 Modeling Alpha Oscillations Conditionally on Age

In the first part of this case study, we are primarily interested in modeling the developmental changes in alpha frequencies. As stated in Haegens et al. (2014), shifts in PAF can confound the alpha power measures, demonstrating the need for a model that controls for the age of the child. In this subsection, we fit a covariate adjusted functional mixed membership model using the log transformation of age as the covariate. Using conditional predictive ordinates and comparing the pseudomarginal likelihoods, we found that a log transformation of age provided a better fitting model than using a model with age untransformed. Although this model does not use information on the diagnostic group, we are also interested in characterizing the differences in the observed alpha oscillations between individuals across diagnostic groups.

From Figure 4, we can see that the first feature consists mainly of aperiodic neural activity patterns, which are commonly referred to as a 1/f1/f trend or pink noise. On the other hand, the second feature can be interpreted as a distinct alpha peak, which is considered periodic neural activity. For children that heavily load on the second feature, we can see that as they age, the alpha peak increases in magnitude and the PAF shifts to a higher frequency, which has been observed in many other studies (Haegens et al., 2014; Rodríguez-Martínez et al., 2017; Scheffler et al., 2019). From a clinical perspective, it is also valuable to compare children’s alpha power conditional on age, since we know that there are developmental changes in alpha oscillations as children age. From Figure 4, we can also see that on average children with ASD have less attenuated alpha peaks compared to their age-adjusted TD counterpart.

Refer to caption
Figure 4: (Top Panels) Estimates of the mean functions of the two functional features conditional on Age. (Bottom Panel) Estimates of the allocation parameters found by fitting a covariate adjusted functional mixed membership model. Diagnostic group level means and medians of the allocation to the first feature is depicted as red triangles and green diamonds, respectively.

One key feature of this model is that it allows us to directly compare individuals in the TD group and ASD group on the same spectrum. However, to do this, we had to assume that the alpha oscillations of children with ASD and TD children can be represented as a continuous mixture of the same two features shown in Figure 4. Just as developmental shifts in PAF can confound the measures of alpha power (Haegens et al., 2014), the assumption that the alpha oscillations of children with ASD and TD children can be represented by the same two features can also confound the results found in this section if the assumption is shown to be incorrect. In Section 4.2, we relax this assumption and allow the functional features to differ according to the diagnostic group.

4.2 Modeling Alpha Oscillations Conditionally on Age and Diagnostic Group

Previous studies have shown that the emergence of alpha peaks and the developmental shifts in frequency are atypical in children with ASD (Dickinson et al., 2018; Scheffler et al., 2019; Marco et al., 2024a, b). Therefore, it may not be realistic to assume that the alpha oscillations of children with ASD and TD children can be represented by the same two features. In this section, we will fit a covariate adjusted functional mixed membership model using the log transformation of age, clinical diagnosis, and an interaction between the log transformation of age and clinical diagnosis as covariates of interest. By including the interaction between the log transformation of age and diagnostic group, we allow for differences in the developmental changes of alpha oscillations between diagnostic groups.

Refer to caption
Figure 5: (Top Panels) Estimates of the developmental trajectories conditional on percentiles of the estimated allocation parameters by clinical diagnosis. (Bottom Panel) Estimates of the allocation parameters stratified by clinical diagnosis, with red squares representing the group-specific means, green diamonds representing the group-specific medians, the blue triangles representing the group-specific 10th10{th} percentiles, and the inverted blue triangles representing the group-specific 90th90^{th} percentiles.

Compared to the model used in Section 4.1, this model is more flexible; it allows different features for each diagnostic group and different developmental trajectories for each diagnostic group. Although this model is more expressive, the comparison of individuals across diagnostic groups is no longer straightforward. The model in Section 4.1 did not use any diagnostic group information, which allowed us to easily compare TD children with ASD children as the distribution of the developmental trajectory depended only on the allocation parameters. On the other hand, comparing individuals across diagnostic groups is not as simple as looking at the inferred allocation parameters, as the underlying features are group-specific. Therefore, two individuals with the same allocation parameters that are in diagnostic groups will likely have completely different developmental trajectories. Moreover, in the model used in Section 4.1, we knew that at least one observation belonged to each feature, making the features interpretable as the extreme cases observed. However, for the more flexible model used in this subsection, there will not be at least one observation belonging to each feature for each diagnostic group. We can see that in the TD group, no individual has a loading of more than 50% on the first feature, meaning that the first TD-specific feature cannot be interpreted as the most extreme case observed in the TD group. Therefore, direct interpretation of the group-specific features is less useful, and instead we will focus on looking at the estimated trajectories of individuals conditional on empirical percentiles of the estimated allocation structure. Visualizations of the mean structure for the two features can be found in the Section 3.3 Supplementary Materials.

From Figure 5, we can see the estimated developmental trajectories of TD children and children with ASD, conditional on different empirical percentiles of the estimated allocation structure. Looking at the estimated developmental trajectories of children with a group-specific median allocation structure, we can see that both TD children and children diagnosed with ASD have a discernible alpha peak and that the expected PAF shifts to higher frequencies as the child develops. However, compared to TD children, ASD children have more of a 1/f1/f trend, and the shift in PAF is not as large in magnitude. Similar differences between TD children and children diagnosed with ASD can be seen when we look at the 10th10^{th} percentile of observations that belong to the first group-specific feature. However, when looking at the group-specific 90th90^{th} percentile, we can see that children with ASD do not have a discernible alpha peak at 25 months old, and a clear alpha peak never really appears as they develop. Alternatively, TD children in the 90th90^{th} percentile still have a discernible alpha peak that becomes more prominent and shifts to higher frequencies as they develop. However, compared to TD individuals in the 10th10^{th} or 50th50^{th} percentile, there is more aperiodic signal (1/f1/f trend) present in the alpha power of TD children in the 90th90^{th} percentile.

As discussed in Section 2.5, the proposed covariate adjusted mixed membership model can be thought of as a generalization of function-on-scalar regression. Figure 6 contains the estimated developmental trajectories obtained by fitting a function-on-scalar regression model using the package “refund” package in R (Goldsmith et al., 2016). The results of the function-on-scalar regression coincide with the estimated developmental trajectory of the alpha oscillations obtained from our covariate adjusted mixed membership model, conditional on the group-specific mean allocation (group-specific mean allocations are represented by triangles in the bottom panel of Figure 5). As discussed in Section 2.5, function-on-scalar regression assumes a common mean for all individuals, controlling for age and diagnostic group. This is a relatively strong assumption in this case, as ASD is relatively broadly defined; with diagnosed individuals having varying degrees of symptoms. Indeed, from the function-on-scalar results, we would only be able to make population-level inference; giving no insight into the heterogeneity of alpha oscillation developmental trajectories for children with ASD (Figure 5). Overall, the added flexibility of covariate adjusted mixed membership models allows scientists to have greater insight into the developmental changes of alpha oscillations. Technical details on how the models for Section 4.1 and Section 4.2 were fit can be found in Section 3.3 of the Supplementary Materials. Section 3.3 of the Supplementary Materials also discusses model comparison using conditional predictive ordinates (Pettit, 1990; Chen et al., 2012; Lewis et al., 2014) and pseudomarginal likelihood.

Refer to caption
Figure 6: (Left Panel) Estimates of the developmental trajectories stratified by diagnostic group, conditional on the group-specific mean allocation (red squares in Figure 5) (Right panel) Estimated population-level development trajectories stratified by diagnostic group, obtained by fitting a function-on-scalar regression model.

5 Discussion

In this manuscript, we extend the framework of the functional mixed membership model proposed by Marco et al. (2024a) to allow for covariate dependence. This work was primarily motivated by a neurodevelopmental EEG study on alpha oscillations, where alpha oscillations are known to change as children age. Although mixed membership models provided a novel way to quantify the emergence and presence of a developmental biomarker known as an alpha peak (Marco et al., 2024a, b), it has been shown that not accounting for developmental shifts in the alpha peak can confound measures of the peak alpha frequency (Haegens et al., 2014), leading to the need for a covariate adjusted functional mixed membership model. In Section 2, we derive the covariate adjusted functional mixed membership models and compare the framework with common covariate-dependent clustering frameworks, such as the mix of experts model and the mixture of regressions model. After specifying the proposed Bayesian model, we provide a set of sufficient conditions in Lemma 2.1 to ensure that the proposed covariate adjusted mixed membership model has identifiable mean, covariance, and allocation structures. Section 3 contains a simulation study exploring the empirical convergence properties of our model. We conclude by revisiting the neurodevelopmental case study on alpha oscillations in Section 4, where we fit two covariate adjusted functional mixed membership models to gain novel information on the developmental differences of alpha oscillations between children with ASD and TD children. In Section 4.1, we fit a covariate adjusted functional mixed membership model adjusting for age, which allowed us to directly compare children with ASD to TD children using the same spectrum while controlling for age. In Section 4.1, we fit a covariate adjusted functional mixed membership model adjusting for age and diagnostic group, which no longer allowed us to directly compare children with ASD to TD children; however, we were able to gain novel insight into the developmental trajectory of alpha oscillations for children with ASD.

Although the proposed model was introduced for an arbitrary number of features, and Lemma 2.1 holds for an arbitrary number of features, there are computational challenges to working with models with more than three features. As discussed in Section 2.4, we typically have to assume the separability condition or the sufficiently scattered condition. In a two-feature model, these two assumptions are equivalent and can be enforced by post-processing the Markov chain constructed through MCMC. When considering a model with three or more features, the separability condition becomes a relatively strong geometric assumption on the generating process, making the sufficiently scattered condition more theoretically appealing. For a three-feature model, we can similarly post-process the Markov chain utilizing the work of Pârvu and Gilbert (2016) to arrive at posterior samples that satisfy the sufficiently scattered assumption. However, ensuring that the sufficiently scattered assumption holds for models with four or more features is still an open problem. Similarly, more work is needed to derive a sampler that obeys the geometric constraints set forth by the sufficiently scattered condition.

As discussed in Section 2.5, covariate adjusted functional mixed membership models can be thought of as a more granular version of function-on-scalar regression. Although functional-on-scalar models can typically handle a large number of covariates, with regularized function-on-scalar models (Fan and Reimherr, 2017; Kowal and Bourgeois, 2020) becoming more popular, the model proposed in this manuscript can handle only a few covariates. Although fitting a covariate adjusted functional mixed membership model with more covariates may be computationally feasible, from the simulation study in Section 3.1, we can see that we need significantly more observations as we add covariates and lose a lot of statistical efficiency when over-specifying the model. Although one may be tempted to use information criteria to perform variable selection, we urge caution, as the addition of additional covariates can greatly impact the interpretation of the allocation parameters, as seen in Section 4.2. Thus, the incorporation of a covariate in the model should be decided based on the modeling goals, prior knowledge of how the covariates affect the observed functions, and computational/statistical efficiency considerations.

As observed in unadjusted mixed membership models (Marco et al., 2024a, b), the posterior distribution often has multiple modes, leading to poor posterior exploration using traditional sampling methods. Thus, we use an algorithm similar to Algorithm 1 described in the supplement of Marco et al. (2024a) to pick a good starting point for our Markov chain. In addition to finding good initial starting points, we also outline a tempered-transition sampling scheme (Pritchard et al., 2000; Behrens et al., 2012) in Section 2.2 of the Supplementary Materials, which allows us to traverse areas of low posterior probability. An R package for the proposed covariate adjusted functional mixed membership model is available for download at https://github.com/ndmarco/BayesFMMM.

6 Funding Acknowledgments

The authors gratefully acknowledge funding from the NIH/NIMH R01MH122428-01 (DS,DT).

7 Disclosure

The authors report that there are no competing interests to declare.

References

  • Müller et al. [2011] Peter Müller, Fernando Quintana, and Gary L Rosner. A product partition model with regression on covariates. Journal of Computational and Graphical Statistics, 20(1):260–278, 2011.
  • Qin and Self [2006] Li-Xuan Qin and Steven G Self. The clustering of regression models method with applications in gene expression data. Biometrics, 62(2):526–533, 2006.
  • Hyun et al. [2023] Sangwon Hyun, Mattias Rolf Cape, Francois Ribalet, and Jacob Bien. Modeling cell populations measured by flow cytometry with covariates using sparse mixture of regressions. The Annals of Applied Statistics, 17(1):357–377, 2023.
  • Guha and Guhaniyogi [2022] Sharmistha Guha and Rajarshi Guhaniyogi. Bayesian covariate-dependent clustering of undirected networks with brain-imaging data. 2022.
  • Binkiewicz et al. [2017] Norbert Binkiewicz, Joshua T Vogelstein, and Karl Rohe. Covariate-assisted spectral clustering. Biometrika, 104(2):361–377, 2017.
  • Page and Quintana [2016] Garritt L Page and Fernando A Quintana. Spatial product partition models. 2016.
  • McLachlan et al. [2019] Geoffrey J McLachlan, Sharon X Lee, and Suren I Rathnayake. Finite mixture models. Annual review of statistics and its application, 6:355–378, 2019.
  • Faria and Soromenho [2010] Susana Faria and Gilda Soromenho. Fitting mixtures of linear regressions. Journal of Statistical Computation and Simulation, 80(2):201–225, 2010.
  • Grün et al. [2007] Bettina Grün, Friedrich Leisch, et al. Applications of finite mixtures of regression models. URL: http://cran. r-project. org/web/packages/flexmix/vignettes/regression-examples. pdf, 2007.
  • Khalili and Chen [2007] Abbas Khalili and Jiahua Chen. Variable selection in finite mixture of regression models. Journal of the american Statistical association, 102(479):1025–1038, 2007.
  • Devijver [2015] Emilie Devijver. Finite mixture regression: a sparse variable selection by model selection for clustering. 2015.
  • Jordan and Jacobs [1994] Michael I Jordan and Robert A Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural computation, 6(2):181–214, 1994.
  • Bishop and Svenskn [2002] Christopher M Bishop and Markus Svenskn. Bayesian hierarchical mixtures of experts. In Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence, pages 57–64, 2002.
  • Kopf et al. [2021] Andreas Kopf, Vincent Fortuin, Vignesh Ram Somnath, and Manfred Claassen. Mixture-of-experts variational autoencoder for clustering and generating from similarity-based representations on single cell data. PLoS computational biology, 17(6):e1009086, 2021.
  • Tsai et al. [2020] Tsung Wei Tsai, Chongxuan Li, and Jun Zhu. Mice: Mixture of contrastive experts for unsupervised image clustering. In International conference on learning representations, 2020.
  • Park and Dunson [2010] Ju-Hyun Park and David B Dunson. Bayesian generalized product partition model. Statistica Sinica, pages 1203–1226, 2010.
  • Marco et al. [2024a] Nicholas Marco, Damla Şentürk, Shafali Jeste, Charlotte DiStefano, Abigail Dickinson, and Donatello Telesca. Functional mixed membership models. Journal of Computational and Graphical Statistics, pages 1–11, 2024a.
  • Erosheva et al. [2004] E Erosheva, S Fienberg, and J Lafferty. Mixed membership models of scientific publications. PNAS, 2004.
  • Heller et al. [2008] Katherine A Heller, Sinead Williamson, and Zoubin Ghahramani. Statistical models for partial membership. In Proceedings of the 25th International Conference on Machine learning, pages 392–399, 2008.
  • Gruhl and Erosheva [2014] Jonathan Gruhl and Elena Erosheva. A Tale of Two (Types of) Memberships: Comparing Mixed and Partial Membership with a Continuous Data Example, chapter 2. CRC Press, 2014.
  • Marco et al. [2024b] Nicholas Marco, Damla Şentürk, Shafali Jeste, Charlotte C DiStefano, Abigail Dickinson, and Donatello Telesca. Flexible regularized estimation in high-dimensional mixed membership models. Computational Statistics & Data Analysis, page 107931, 2024b.
  • Yao et al. [2011] Fang Yao, Yuejiao Fu, and Thomas CM Lee. Functional mixture regression. Biostatistics, 12(2):341–353, 2011.
  • Gaffney and Smyth [1999] Scott Gaffney and Padhraic Smyth. Trajectory clustering with mixtures of regression models. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 63–72, 1999.
  • Haegens et al. [2014] Saskia Haegens, Helena Cousijn, George Wallis, Paul J Harrison, and Anna C Nobre. Inter-and intra-individual variability in alpha peak frequency. Neuroimage, 92:46–55, 2014.
  • Scheffler et al. [2019] Aaron W Scheffler, Donatello Telesca, Catherine A Sugar, Shafali Jeste, Abigail Dickinson, Charlotte DiStefano, and Damla Şentürk. Covariate-adjusted region-referenced generalized functional linear model for eeg data. Statistics in medicine, 38(30):5587–5602, 2019.
  • Rodríguez-Martínez et al. [2017] EI Rodríguez-Martínez, FJ Ruiz-Martínez, CI Barriga Paulino, and Carlos M Gómez. Frequency shift in topography of spontaneous brain rhythms from childhood to adulthood. Cognitive neurodynamics, 11(1):23–33, 2017.
  • Wang et al. [2016] Jane-Ling Wang, Jeng-Min Chiou, and Hans-Georg Müller. Functional data analysis. Annual Review of Statistics and its application, 3(1):257–295, 2016.
  • James and Sugar [2003] Gareth M James and Catherine A Sugar. Clustering for sparsely sampled functional data. Journal of the American Statistical Association, 98(462):397–408, 2003.
  • Chiou and Li [2007] Jeng-Min Chiou and Pai-Ling Li. Functional clustering and identifying substructures of longitudinal data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(4):679–699, 2007.
  • Jacques and Preda [2014] Julien Jacques and Cristian Preda. Model-based clustering for multivariate functional data. Computational Statistics & Data Analysis, 71:92–106, 2014.
  • Ramsay and Silverman [2005a] JO Ramsay and BW Silverman. Principal components analysis for functional data. Functional data analysis, pages 147–172, 2005a.
  • Happ and Greven [2018] Clara Happ and Sonja Greven. Multivariate functional principal component analysis for data observed on different (dimensional) domains. Journal of the American Statistical Association, 113(522):649–659, 2018.
  • Faraway [1997] Julian J Faraway. Regression analysis for a functional response. Technometrics, 39(3):254–261, 1997.
  • Brumback and Rice [1998] Babette A Brumback and John A Rice. Smoothing spline models for the analysis of nested and crossed samples of curves. Journal of the American Statistical Association, 93(443):961–976, 1998.
  • Bhattacharya and Dunson [2011] Anirban Bhattacharya and David B Dunson. Sparse bayesian infinite factor models. Biometrika, pages 291–306, 2011.
  • Lang and Brezger [2004] Stefan Lang and Andreas Brezger. Bayesian p-splines. Journal of computational and graphical statistics, 13(1):183–212, 2004.
  • Chen et al. [2022] Yinyin Chen, Shishuang He, Yun Yang, and Feng Liang. Learning topic models: Identifiability and finite-sample analysis. Journal of the American Statistical Association, pages 1–16, 2022.
  • Stephens [2000] Matthew Stephens. Dealing with label switching in mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62(4):795–809, 2000.
  • Pettit [1990] LI Pettit. The conditional predictive ordinate for the normal distribution. Journal of the Royal Statistical Society: Series B (Methodological), 52(1):175–184, 1990.
  • Donoho and Stodden [2003] David Donoho and Victoria Stodden. When does non-negative matrix factorization give a correct decomposition into parts? Advances in neural information processing systems, 16, 2003.
  • Arora et al. [2012] Sanjeev Arora, Rong Ge, and Ankur Moitra. Learning topic models–going beyond svd. In 2012 IEEE 53rd annual symposium on foundations of computer science, pages 1–10. IEEE, 2012.
  • Azar et al. [2001] Yossi Azar, Amos Fiat, Anna Karlin, Frank McSherry, and Jared Saia. Spectral analysis of data. In Proceedings of the thirty-third annual ACM symposium on Theory of computing, pages 619–626, 2001.
  • Ramsay and Silverman [2005b] J. Ramsay and B.W. Silverman. Functional Data Analysis. Springer Series in Statistics. Springer, 2005b. ISBN 9780387400808. URL https://books.google.com/books?id=mU3dop5wY_4C.
  • Morris [2015] Jeffrey S Morris. Functional regression. Annual Review of Statistics and Its Application, 2:321–359, 2015.
  • Krafty et al. [2008] Robert T Krafty, Phyllis A Gimotty, David Holtz, George Coukos, and Wensheng Guo. Varying coefficient model with unknown within-subject covariance for analysis of tumor growth curves. Biometrics, 64(4):1023–1031, 2008.
  • Reiss et al. [2010] Philip T Reiss, Lei Huang, and Maarten Mennes. Fast function-on-scalar regression with penalized basis expansions. The international journal of biostatistics, 6(1), 2010.
  • Goldsmith et al. [2015] Jeff Goldsmith, Vadim Zipunnikov, and Jennifer Schrack. Generalized multilevel function-on-scalar regression and principal component analysis. Biometrics, 71(2):344–353, 2015.
  • Kowal and Bourgeois [2020] Daniel R Kowal and Daniel C Bourgeois. Bayesian function-on-scalars regression for high-dimensional data. Journal of Computational and Graphical Statistics, 29(3):629–638, 2020.
  • Shang [2014] Han Lin Shang. A survey of functional principal component analysis. AStA Advances in Statistical Analysis, 98:121–142, 2014.
  • Rice and Silverman [1991] John A Rice and Bernard W Silverman. Estimating the mean and covariance structure nonparametrically when the data are curves. Journal of the Royal Statistical Society: Series B (Methodological), 53(1):233–243, 1991.
  • Yao et al. [2005] Fang Yao, Hans-Georg Müller, and Jane-Ling Wang. Functional data analysis for sparse longitudinal data. Journal of the American statistical association, 100(470):577–590, 2005.
  • Morris and Carroll [2006] Jeffrey S Morris and Raymond J Carroll. Wavelet-based functional mixed models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(2):179–199, 2006.
  • Staicu et al. [2010] Ana-Maria Staicu, Ciprian M Crainiceanu, and Raymond J Carroll. Fast methods for spatially correlated multilevel functional data. Biostatistics, 11(2):177–194, 2010.
  • Akaike [1974] Hirotugu Akaike. A new look at the statistical model identification. IEEE transactions on automatic control, 19(6):716–723, 1974.
  • Schwarz [1978] Gideon Schwarz. Estimating the dimension of a model. The annals of statistics, pages 461–464, 1978.
  • Spiegelhalter et al. [2002] David J Spiegelhalter, Nicola G Best, Bradley P Carlin, and Angelika Van Der Linde. Bayesian measures of model complexity and fit. Journal of the royal statistical society: Series B (statistical methodology), 64(4):583–639, 2002.
  • Celeux et al. [2006] Gilles Celeux, Florence Forbes, Christian P Robert, and D Mike Titterington. Deviance information criteria for missing data models. Bayesian analysis, 1(4):651–673, 2006.
  • Edition [2013] Fifth Edition. Diagnostic and statistical manual of mental disorders. Am Psychiatric Assoc, 21(21):591–643, 2013.
  • Lord et al. [2018] Catherine Lord, Mayada Elsabbagh, Gillian Baird, and Jeremy Veenstra-Vanderweele. Autism spectrum disorder. The Lancet, 392(10146):508–520, 2018.
  • Dickinson et al. [2018] Abigail Dickinson, Charlotte DiStefano, Damla Senturk, and Shafali Spurling Jeste. Peak alpha frequency is a neural marker of cognitive function across the autism spectrum. European Journal of Neuroscience, 47(6):643–651, 2018.
  • Fries [2005] Pascal Fries. A mechanism for cognitive dynamics: neuronal communication through neuronal coherence. Trends in cognitive sciences, 9(10):474–480, 2005.
  • Klimesch et al. [2007] Wolfgang Klimesch, Paul Sauseng, and Simon Hanslmayr. Eeg alpha oscillations: the inhibition–timing hypothesis. Brain research reviews, 53(1):63–88, 2007.
  • Goldsmith et al. [2016] Jeff Goldsmith, Fabian Scheipl, Lei Huang, Julia Wrobel, J Gellar, J Harezlak, MW McLean, B Swihart, L Xiao, C Crainiceanu, et al. Refund: Regression with functional data. R package version 0.1-16, 572, 2016.
  • Chen et al. [2012] Ming-Hui Chen, Qi-Man Shao, and Joseph G Ibrahim. Monte Carlo methods in Bayesian computation. Springer Science & Business Media, 2012.
  • Lewis et al. [2014] Paul O Lewis, Wangang Xie, Ming-Hui Chen, Yu Fan, and Lynn Kuo. Posterior predictive bayesian phylogenetic model selection. Systematic biology, 63(3):309–321, 2014.
  • Pârvu and Gilbert [2016] Ovidiu Pârvu and David Gilbert. Implementation of linear minimum area enclosing triangle algorithm: Application note. Computational and Applied Mathematics, 35:423–438, 2016.
  • Fan and Reimherr [2017] Zhaohu Fan and Matthew Reimherr. High-dimensional adaptive function-on-scalar regression. Econometrics and statistics, 1:167–183, 2017.
  • Pritchard et al. [2000] Jonathan K Pritchard, Matthew Stephens, and Peter Donnelly. Inference of population structure using multilocus genotype data. Genetics, 155(2):945–959, 2000.
  • Behrens et al. [2012] Gundula Behrens, Nial Friel, and Merrilee Hurn. Tuning tempered transitions. Statistics and computing, 22(1):65–78, 2012.
  • Roeder and Wasserman [1997] Kathryn Roeder and Larry Wasserman. Practical bayesian density estimation using mixtures of normals. Journal of the American Statistical Association, 92(439):894–902, 1997.

Supplemental Materials: Covariate Adjusted Functional Mixed Membership Models

1 Proof of Lemma 2.1

We will start by defining identifiability and defining some of the notation used in this section. Let 𝝎={𝝂1,,𝝂K,𝜼1,,𝜼K,{Zj1,,ZjK}j=1N,{𝚺jk}1jkK,σ2}\boldsymbol{\omega}=\{\boldsymbol{\nu}_{1},\dots,\boldsymbol{\nu}_{K},\boldsymbol{\eta}_{1},\dots,\boldsymbol{\eta}_{K},\{Z_{j1},\dots,Z_{jK}\}_{j=1}^{N},\{\boldsymbol{\Sigma}_{jk}\}_{1\leq j\leq k\leq K},\sigma^{2}\}, where 𝚺jk=m=1M(ϕjmϕkm)\boldsymbol{\Sigma}_{jk}=\sum_{m=1}^{M}\left(\boldsymbol{\phi}_{jm}\boldsymbol{\phi}^{\prime}_{km}\right). We will say that the parameters 𝝎\boldsymbol{\omega} are unidentifiable if there exists at least one 𝝎𝝎\boldsymbol{\omega}^{*}\neq\boldsymbol{\omega} such that (𝐘i(𝐭i)𝝎,𝐱i)=(𝐘i(𝐭i)𝝎,𝐱i)\mathcal{L}(\mathbf{Y}_{i}(\mathbf{t}_{i})\mid\boldsymbol{\omega},\mathbf{x}_{i})=\mathcal{L}(\mathbf{Y}_{i}(\mathbf{t}_{i})\mid\boldsymbol{\omega}^{*},\mathbf{x}_{i}) for all sets of observations {𝐘i(𝐭i)}i=1N\{\mathbf{Y}_{i}(\mathbf{t}_{i})\}_{i=1}^{N} following Assumptions (1)-(3). Otherwise, the parameters 𝝎\boldsymbol{\omega} are called identifiable. In this case, (𝐘i(𝐭i)𝝎,𝐱i)\mathcal{L}(\mathbf{Y}_{i}(\mathbf{t}_{i})\mid\boldsymbol{\omega},\mathbf{x}_{i}) is the likelihood specified in Equation 12 in the main text.

From equation 12 in the main text, we have that

(𝐘i(𝐭i)𝝎,𝐱i)exp{12(𝐘i(𝐭i)μi(𝐱i,𝐭i))(𝐕(𝐭i,𝐳i)+σ2𝐈ni)1(𝐘i(𝐭i)μi(𝐱i,𝐭i))},\mathcal{L}\left(\mathbf{Y}_{i}(\mathbf{t}_{i})\mid\boldsymbol{\omega},\mathbf{x}_{i}\right)\propto\text{exp}\left\{-\frac{1}{2}\left(\mathbf{Y}_{i}(\mathbf{t}_{i})-\mu_{i}(\mathbf{x}_{i},\mathbf{t}_{i})\right)^{\prime}\left(\mathbf{V}(\mathbf{t}_{i},\mathbf{z}_{i})+\sigma^{2}\mathbf{I}_{n_{i}}\right)^{-1}\left(\mathbf{Y}_{i}(\mathbf{t}_{i})-\mu_{i}(\mathbf{x}_{i},\mathbf{t}_{i})\right)\right\},

(1)

where

μi(𝐱i,𝐭i)=k=1KZik𝐒(𝐭i)(𝝂k+𝜼k𝐱i)\mu_{i}(\mathbf{x}_{i},\mathbf{t}_{i})=\sum_{k=1}^{K}Z_{ik}\mathbf{S}^{\prime}(\mathbf{t}_{i})\left(\boldsymbol{\nu}_{k}+\boldsymbol{\eta}_{k}\mathbf{x}_{i}^{\prime}\right)

and

𝐕(𝐭i,𝐳i)=k=1Kk=1KZikZik{𝐒(𝐭i)m=1M(ϕkmϕkm)𝐒(𝐭i)}.\mathbf{V}(\mathbf{t}_{i},\mathbf{z}_{i})=\sum_{k=1}^{K}\sum_{k^{\prime}=1}^{K}Z_{ik}Z_{ik^{\prime}}\left\{\mathbf{S}^{\prime}(\mathbf{t}_{i})\sum_{m=1}^{M}\left(\boldsymbol{\phi}_{km}\boldsymbol{\phi}^{\prime}_{k^{\prime}m}\right)\mathbf{S}(\mathbf{t}_{i})\right\}.

Assume that (𝐘i(𝐭i)𝝎,𝐱i)=(𝐘i(𝐭i)𝝎,𝐱i)\mathcal{L}\left(\mathbf{Y}_{i}(\mathbf{t}_{i})\mid\boldsymbol{\omega},\mathbf{x}_{i}\right)=\mathcal{L}\left(\mathbf{Y}_{i}(\mathbf{t}_{i})\mid\boldsymbol{\omega}^{*},\mathbf{x}_{i}\right) for all sets of observations {𝐘i(𝐭i)}i=1N\{\mathbf{Y}_{i}(\mathbf{t}_{i})\}_{i=1}^{N} that follow Assumptions (1)-(3) . Thus we would like to prove that 𝝎=𝝎\boldsymbol{\omega}^{*}=\boldsymbol{\omega} must necessarily be true. Since (𝐘i(𝐭i)𝝎,𝐱i)\mathcal{L}\left(\mathbf{Y}_{i}(\mathbf{t}_{i})\mid\boldsymbol{\omega},\mathbf{x}_{i}\right) is written as a quadratic form in 𝐘i(𝐭i)\mathbf{Y}_{i}(\mathbf{t}_{i}) and (𝐕(𝐭i,𝐳i)+σ2𝐈ni)\left(\mathbf{V}(\mathbf{t}_{i},\mathbf{z}_{i})+\sigma^{2}\mathbf{I}_{n_{i}}\right) is full rank, we see that the following must necessarily be true:

  1. 1.

    μi(𝐱i,𝐭i)=μi(𝐱i,𝐭i)\mu_{i}^{*}(\mathbf{x}_{i},\mathbf{t}_{i})=\mu_{i}(\mathbf{x}_{i},\mathbf{t}_{i}),

  2. 2.

    𝐕(𝐭i,𝐳i)+(σ2)𝐈ni=𝐕(𝐭i,𝐳i)+σ2𝐈ni\mathbf{V}^{*}(\mathbf{t}_{i},\mathbf{z}_{i}^{*})+(\sigma^{2})^{*}\mathbf{I}_{n_{i}}=\mathbf{V}(\mathbf{t}_{i},\mathbf{z}_{i})+\sigma^{2}\mathbf{I}_{n_{i}},

for i=1,,Ni=1,\dots,N. By (1), we have that

k=1KZik𝐒(𝐭i)(𝝂k+𝜼k𝐱i)=k=1KZik𝐒(𝐭i)(𝝂k+𝜼k𝐱i)(i=1,,N).\sum_{k=1}^{K}Z_{ik}\mathbf{S}^{\prime}(\mathbf{t}_{i})\left(\boldsymbol{\nu}_{k}+\boldsymbol{\eta}_{k}\mathbf{x}_{i}^{\prime}\right)=\sum_{k=1}^{K}Z_{ik}^{*}\mathbf{S}^{\prime}(\mathbf{t}_{i})\left(\boldsymbol{\nu}_{k}^{*}+\boldsymbol{\eta}_{k}^{*}\mathbf{x}_{i}^{\prime}\right)\;\;\;(i=1,\dots,N).

Letting 𝝁k=[𝝂k𝜼k]P×(R+1)\boldsymbol{\mu}_{k}=[\boldsymbol{\nu}_{k}\;\boldsymbol{\eta}_{k}]\in\mathbb{R}^{P\times(R+1)} and 𝐱~i=[1𝐱i]\mathbf{\tilde{x}}_{i}=[1\;\mathbf{x}_{i}] (𝐗~N×(R+1)\tilde{\mathbf{X}}\in\mathbb{R}^{N\times(R+1)} is the design matrix with the ithi^{th} row as 𝐱~\mathbf{\tilde{x}}), we have

k=1KZik𝐒(𝐭i)𝝁k𝐱~i=k=1KZik𝐒(𝐭i)𝝁k𝐱~i(i=1,,N)\displaystyle\sum_{k=1}^{K}Z_{ik}\mathbf{S}^{\prime}(\mathbf{t}_{i})\boldsymbol{\mu}_{k}\mathbf{\tilde{x}}_{i}^{\prime}=\sum_{k=1}^{K}Z_{ik}^{*}\mathbf{S}^{\prime}(\mathbf{t}_{i})\boldsymbol{\mu}_{k}^{*}\mathbf{\tilde{x}}_{i}^{\prime}\;\;\;(i=1,\dots,N)
\displaystyle\iff k=1KZik𝝁k𝐱~i=k=1KZik𝝁k𝐱~i(i=1,,N),\displaystyle\sum_{k=1}^{K}Z_{ik}\boldsymbol{\mu}_{k}\mathbf{\tilde{x}}_{i}^{\prime}=\sum_{k=1}^{K}Z_{ik}^{*}\boldsymbol{\mu}_{k}^{*}\mathbf{\tilde{x}}_{i}^{\prime}\;\;\;(i=1,\dots,N), (2)

since niPn_{i}\geq P by Assumption (3). Since 𝐗~\tilde{\mathbf{X}} is full column rank from Assumption (1), we have that

k=1KZik𝝁k=k=1KZik𝝁k(i=1,,N).\sum_{k=1}^{K}Z_{ik}\boldsymbol{\mu}_{k}=\sum_{k=1}^{K}Z_{ik}^{*}\boldsymbol{\mu}_{k}^{*}\;\;\;(i=1,\dots,N). (3)

It is important to note that if Assumption (1) does not hold and 𝐗~\tilde{\mathbf{X}} is not full column rank, we could add any vector in the nullspace of 𝐗~\tilde{\mathbf{X}} to any row of 𝝁k\boldsymbol{\mu}_{k}^{*} (k=1,,Kk=1,\dots,K) and equation 2 would still hold.

We can rewrite Equation 3 in matrix form such that

𝐙𝝁~=𝐙𝝁~\mathbf{Z}\tilde{\boldsymbol{\mu}}=\mathbf{Z}^{*}\tilde{\boldsymbol{\mu}}^{*} (4)

where 𝝁~r=[vec(𝝁1),,vec(𝝁K)]K×P(R+1)\tilde{\boldsymbol{\mu}}_{r}=[\text{vec}(\boldsymbol{\mu}_{1}),\dots,\text{vec}(\boldsymbol{\mu}_{K})]^{\prime}\in\mathbb{R}^{K\times P(R+1)} and 𝐙N×K\mathbf{Z}\in\mathbb{R}^{N\times K} is the matrix of allocation parameters with 𝐳i\mathbf{z}_{i} as the ithi^{th} row of 𝐙\mathbf{Z}. From this we can directly apply the results of Chen et al. [2022], to show that 𝐙=𝐙\mathbf{Z}=\mathbf{Z}^{*} and 𝝁~=𝝁~\tilde{\boldsymbol{\mu}}=\tilde{\boldsymbol{\mu}}^{*} up to a permutation of the labels. Specifically, if the seperability condition holds, then Proposition 1 of Chen et al. [2022] shows that 𝐙=𝐙\mathbf{Z}=\mathbf{Z}^{*} and 𝝁~=𝝁~\tilde{\boldsymbol{\mu}}=\tilde{\boldsymbol{\mu}}^{*} up to a permutation of the labels. If the sufficiently scattered condition holds, then Theorem 2 of Chen et al. [2022] shows that 𝐙=𝐙\mathbf{Z}=\mathbf{Z}^{*} and 𝝁~=𝝁~\tilde{\boldsymbol{\mu}}=\tilde{\boldsymbol{\mu}}^{*} up to a permutation of the labels. Therefore, assuming that Assumptions (1) - (3) hold, we have Zik=ZikZ_{ik}=Z_{ik}^{*}, 𝝂k=𝝂k\boldsymbol{\nu}_{k}=\boldsymbol{\nu}_{k}^{*}, and 𝜼k=𝜼k\boldsymbol{\eta}_{k}=\boldsymbol{\eta}_{k}^{*} up to the permutation of the labels, for k=1,2k=1,2 and i=1,,Ni=1,\dots,N.

From (2), we have that

𝐕(𝐭i,𝐳i)+(σ2)𝐈ni=𝐕(𝐭i,𝐳i)+σ2𝐈ni\displaystyle\mathbf{V}^{*}(\mathbf{t}_{i},\mathbf{z}_{i}^{*})+(\sigma^{2})^{*}\mathbf{I}_{n_{i}}=\mathbf{V}(\mathbf{t}_{i},\mathbf{z}_{i})+\sigma^{2}\mathbf{I}_{n_{i}}
\displaystyle\iff 𝐕(𝐭i,𝐳i)𝐕(𝐭i,𝐳i)=((σ2)σ2)𝐈ni.\displaystyle\mathbf{V}^{*}(\mathbf{t}_{i},\mathbf{z}_{i}^{*})-\mathbf{V}(\mathbf{t}_{i},\mathbf{z}_{i})=((\sigma^{2})^{*}-\sigma^{2})\mathbf{I}_{n_{i}}.

Suppose that ((σ2)σ2)0((\sigma^{2})^{*}-\sigma^{2})\neq 0, then we have that

rank(𝐕(𝐭i,𝐳i)𝐕(𝐭i,𝐳i))=rank(((σ2)σ2)𝐈ni)>P,\text{rank}\left(\mathbf{V}^{*}(\mathbf{t}_{i},\mathbf{z}_{i}^{*})-\mathbf{V}(\mathbf{t}_{i},\mathbf{z}_{i})\right)=\text{rank}\left(((\sigma^{2})^{*}-\sigma^{2})\mathbf{I}_{n_{i}}\right)>P,

by Assumption (3) (ni>Pn_{i}>P) . Writing 𝐕(𝐭i,𝐳i)\mathbf{V}(\mathbf{t}_{i},\mathbf{z}_{i}) such that

𝐕(𝐭i,𝐳i)=𝐒(𝐭i){k=1Kk=1KZikZikm=1M(ϕkmϕkm)}𝐒(𝐭i),\mathbf{V}(\mathbf{t}_{i},\mathbf{z}_{i})=\mathbf{S}^{\prime}(\mathbf{t}_{i})\left\{\sum_{k=1}^{K}\sum_{k^{\prime}=1}^{K}Z_{ik}Z_{ik^{\prime}}\sum_{m=1}^{M}\left(\boldsymbol{\phi}_{km}\boldsymbol{\phi}^{\prime}_{k^{\prime}m}\right)\right\}\mathbf{S}(\mathbf{t}_{i}),

we can see that rank(𝐕(𝐭i,𝐳i))P(\mathbf{V}(\mathbf{t}_{i},\mathbf{z}_{i}))\leq P, which implies that rank(𝐕(𝐭i,𝐳i)𝐕(𝐭i,𝐳i))P\text{rank}\left(\mathbf{V}^{*}(\mathbf{t}_{i},\mathbf{z}_{i}^{*})-\mathbf{V}(\mathbf{t}_{i},\mathbf{z}_{i})\right)\leq P, leading to a contradiction. Therefore, we have (σ2)=σ2(\sigma^{2})^{*}=\sigma^{2} and 𝐕(𝐭i,𝐳i)=𝐕(𝐭i,𝐳i)\mathbf{V}^{*}(\mathbf{t}_{i},\mathbf{z}_{i}^{*})=\mathbf{V}(\mathbf{t}_{i},\mathbf{z}_{i}) up to a permutation of the labels. Assuming no permutation of the labels (𝐙=𝐙\mathbf{Z}=\mathbf{Z}^{*}), we have that

𝐕(𝐭i,𝐳i)=𝐕(𝐭i,𝐳i)k=1Kk=1KZikZik𝚺kk=k=1Kk=1KZikZik𝚺kk,\mathbf{V}(\mathbf{t}_{i},\mathbf{z}_{i})=\mathbf{V}^{*}(\mathbf{t}_{i},\mathbf{z}_{i}^{*})\iff\sum_{k=1}^{K}\sum_{k^{\prime}=1}^{K}Z_{ik}Z_{ik^{\prime}}\boldsymbol{\Sigma}_{kk^{\prime}}=\sum_{k=1}^{K}\sum_{k^{\prime}=1}^{K}Z_{ik}Z_{ik^{\prime}}\boldsymbol{\Sigma}_{kk^{\prime}}^{*},

since ni>Pn_{i}>P for all ii. Therefore, we have the following system of equations

k=1KZik2(𝚺kk𝚺kk)+2k=1Kk>kZikZik(𝚺kk𝚺kk)=𝟎,\sum_{k=1}^{K}Z_{ik}^{2}\left(\boldsymbol{\Sigma}_{kk}-\boldsymbol{\Sigma}_{kk}^{*}\right)+2\sum_{k=1}^{K}\sum_{k^{\prime}>k}Z_{ik}Z_{ik^{\prime}}\left(\boldsymbol{\Sigma}_{kk^{\prime}}-\boldsymbol{\Sigma}_{kk^{\prime}}^{*}\right)=\mathbf{0},

for i=1,,Ni=1,\dots,N. Equality can be proved if we have K2+K2\frac{K^{2}+K}{2} linearly independent equations in our system of equations. Thus, if the coefficient matrix has rank K2+K2\frac{K^{2}+K}{2}, then we have K2+K2\frac{K^{2}+K}{2} linearly independent equations in our system of equations. We can see that Assumption (2) gives us that the coefficient matrix, denoted 𝐂\mathbf{C}, has full column rank, meaning that we have K2+K2\frac{K^{2}+K}{2} linearly independent equations in our system of equations. Therefore, we have that 𝚺kk=𝚺kk\boldsymbol{\Sigma}_{kk^{\prime}}=\boldsymbol{\Sigma}_{kk^{\prime}}^{*} for 1k,kK1\leq k,k^{\prime}\leq K, up to a permutation of the labels. Therefore, we have that for i=1,,Ni=1,\dots,N and k,k=1,,Kk,k^{\prime}=1,\dots,K, the parameters 𝝂k\boldsymbol{\nu}_{k}, 𝜼k\boldsymbol{\eta}_{k}, ZikZ_{ik}, m=1M(ϕkmϕkm)\sum_{m=1}^{M}\left(\boldsymbol{\phi}_{km}\boldsymbol{\phi}^{\prime}_{k^{\prime}m}\right), and σ2\sigma^{2} are identifiable up to a permutation of the labels given Assumptions (1)-(3).

2 Computation

2.1 Posterior Distributions

In this subsection, we will specify the posterior distributions specifically for the functional covariate adjusted mixed membership model proposed in the main manuscript. We will first start with the ϕkm\boldsymbol{\phi}_{km} parameters, for j=1,,Kj=1,\dots,K and m=1,,Mm=1,\dots,M. Let 𝐃ϕjm=τ~ϕmj1diag(γϕj1m1,,γϕjPm1)\mathbf{D}_{\boldsymbol{\phi}_{jm}}=\tilde{\tau}_{\boldsymbol{\phi}_{mj}}^{-1}diag\left(\gamma_{\boldsymbol{\phi}_{j1m}}^{-1},\dots,\gamma_{\boldsymbol{\phi}_{jPm}}^{-1}\right). By letting

𝐦ϕjm=\displaystyle\mathbf{m}_{\boldsymbol{\phi}_{jm}}= 1σ2i=1Nl=1ni(B(til)χim(yi(til)ZijZij2(𝝂j+𝜼j𝐱i)B(til)Zij2nmχinϕjnB(til)\displaystyle\frac{1}{\sigma^{2}}\sum_{i=1}^{N}\sum_{l=1}^{n_{i}}\left(B(t_{il})\chi_{im}\left(y_{i}(t_{il})Z_{ij}-Z_{ij}^{2}\left(\boldsymbol{\nu}_{j}+\boldsymbol{\eta}_{j}\mathbf{x}_{i}^{\prime}\right)^{\prime}B(t_{il})-Z_{ij}^{2}\sum_{n\neq m}\chi_{in}\boldsymbol{\phi}_{jn}^{\prime}B(t_{il})\right.\right.
kjZijZik[(𝝂k+𝜼k𝐱i)B(til)+n=1MχinϕknB(til)])),\displaystyle\left.\left.-\sum_{k\neq j}Z_{ij}Z_{ik}\left[\left(\boldsymbol{\nu}_{k}+\boldsymbol{\eta}_{k}\mathbf{x}_{i}^{\prime}\right)^{\prime}B(t_{il})+\sum_{n=1}^{M}\chi_{in}\boldsymbol{\phi}_{kn}^{\prime}B(t_{il})\right]\right)\right),

and

𝐌ϕjm1=1σ2i=1Nl=1ni(Zij2χim2B(til)B(til))+𝐃ϕjm1,\mathbf{M}_{\boldsymbol{\phi}_{jm}}^{-1}=\frac{1}{\sigma^{2}}\sum_{i=1}^{N}\sum_{l=1}^{n_{i}}\left(Z_{ij}^{2}\chi_{im}^{2}B(t_{il})B^{\prime}(t_{il})\right)+\mathbf{D}_{\boldsymbol{\phi}_{jm}}^{-1},

we have that

ϕjm|𝚯ϕjm,𝐘1,,𝐘N,𝐗𝒩(𝐌ϕjm𝐦ϕjm,𝐌ϕjm).\boldsymbol{\phi}_{jm}|\boldsymbol{\Theta}_{-\boldsymbol{\phi}_{jm}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\sim\mathcal{N}\left(\mathbf{M}_{\boldsymbol{\phi}_{jm}}\mathbf{m}_{\boldsymbol{\phi}_{jm}},\mathbf{M}_{\boldsymbol{\phi}_{jm}}\right).

The posterior distribution of δ1k\delta_{{1k}}, for k=1,,Kk=1,\dots,K, is

δ1k|𝚯δ1k,𝐘1,,𝐘N,𝐗Γ(a1k+(PM/2),1+12r=1Pγk,r,1ϕk,r,12+12m=2Mr=1Pγk,r,mϕk,r,m2(j=2mδjk))..\begin{aligned} \delta_{{1k}}|\boldsymbol{\Theta}_{-\delta_{{1k}}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\sim&\Gamma\left(a_{{1k}}+(PM/2),1+\frac{1}{2}\sum_{r=1}^{P}\gamma_{{k,r,1}}\phi_{k,r,1}^{2}\right.\\ &\left.+\frac{1}{2}\sum_{m=2}^{M}\sum_{r=1}^{P}\gamma_{{k,r,m}}\phi_{k,r,m}^{2}\left(\prod_{j=2}^{m}\delta_{{jk}}\right)\right).\end{aligned}.

The posterior distribution for δik\delta_{{ik}}, for i=2,,Mi=2,\dots,M and k=1,,Kk=1,\dots,K, is

δik|𝚯δik,𝐘1,,𝐘N,𝐗\displaystyle\delta_{{ik}}|\boldsymbol{\Theta}_{-\delta_{{ik}}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\sim Γ(a2k+(P(Mi+1)/2),1\displaystyle\Gamma\Bigg{(}a_{{2k}}+(P(M-i+1)/2),1
+12m=iMr=1Pγ𝝃k,r,mϕk,r,m2(j=1;jimδjk)).\displaystyle\left.+\frac{1}{2}\sum_{m=i}^{M}\sum_{r=1}^{P}\gamma_{\boldsymbol{\xi}_{k,r,m}}\phi_{k,r,m}^{2}\left(\prod_{j=1;j\neq i}^{m}\delta_{{jk}}\right)\right).

The posterior distribution for a1ka_{{1k}} (k=1,,Kk=1,\dots,K) is not a commonly known distribution, however we have that

P(a1k|𝚯a1k,𝐘1,,𝐘N,𝐗)1Γ(a1k)δ1ka1k1a1kα11exp{a1kβ1}.P(a_{{1k}}|\boldsymbol{\Theta}_{-a_{{1k}}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X})\propto\frac{1}{\Gamma(a_{{1k}})}\delta_{{1k}}^{a_{{1k}}-1}a_{{1k}}^{\alpha_{1}-1}exp\left\{-a_{{1k}}\beta_{1}\right\}.

Since this is not a known kernel of a distribution, we will have to use Metropolis-Hastings algorithm. Consider the proposal distribution Q(a1k|a1k)=𝒩(a1k,ϵ1β11,0,+)Q(a_{{1k}}^{\prime}|a_{{1k}})=\mathcal{N}\left(a_{{1k}},\epsilon_{1}\beta_{1}^{-1},0,+\infty\right) (Truncated Normal) for some small ϵ1>0\epsilon_{1}>0. Thus the probability of accepting any step is

A(a1k,a1k)=min{1,P(a1k|𝚯a1k,𝐘1,,𝐘N,𝐗)P(a1k|𝚯a1k,𝐘1,,𝐘N,𝐗)Q(a1k|a1k)Q(a1k|a1k)}.A(a_{{1k}}^{\prime},a_{{1k}})=\min\left\{1,\frac{P\left(a_{{1k}}^{\prime}|\boldsymbol{\Theta}_{-a_{{1k}}^{\prime}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\right)}{P\left(a_{{1k}}|\boldsymbol{\Theta}_{-a_{{1k}}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\right)}\frac{Q\left(a_{{1k}}|a_{{1k}}^{\prime}\right)}{Q\left(a_{{1k}}^{\prime}|a_{{1k}}\right)}\right\}.

Similarly for a2ka_{{2k}} (k=1,,Kk=1,\dots,K), we have

P(a2k|𝚯a2k,𝐘1,,𝐘N,𝐗)1Γ(a2k)M1(i=2Mδika2k1)a2kα2k1exp{a2kβ2}.P(a_{{2k}}|\boldsymbol{\Theta}_{-a_{{2k}}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X})\propto\frac{1}{\Gamma(a_{{2k}})^{M-1}}\left(\prod_{i=2}^{M}\delta_{{ik}}^{a_{{2k}}-1}\right)a_{{2k}}^{\alpha_{{2k}}-1}exp\left\{-a_{{2k}}\beta_{2}\right\}.

We will use a similar proposal distribution, such that Q(a2k|a2k)=𝒩(a2k,ϵ2β21,0,+)Q(a_{{2k}}^{\prime}|a_{{2k}})=\mathcal{N}\left(a_{{2k}},\epsilon_{2}\beta_{2}^{-1},0,+\infty\right) for some small ϵ2>0\epsilon_{2}>0. Thus the probability of accepting any step is

A(a2k,a2k)=min{1,P(a2k|𝚯a2k,𝐘1,,𝐘N,𝐗)P(a2k|𝚯a2k,𝐘1,,𝐘N,𝐗)Q(a2k|a2k)Q(a2k|a2k)}.A(a_{{2k}}^{\prime},a_{{2k}})=\min\left\{1,\frac{P\left(a_{{2k}}^{\prime}|\boldsymbol{\Theta}_{-a_{{2k}}^{\prime}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\right)}{P\left(a_{{2k}}|\boldsymbol{\Theta}_{-a_{{2k}}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\right)}\frac{Q\left(a_{{2k}}|a_{{2k}}^{\prime}\right)}{Q\left(a_{{2k}}^{\prime}|a_{{2k}}\right)}\right\}.

The posterior distribution for the 𝐳i\mathbf{z}_{i} parameters are not a commonly known distribution, so we will use the Metropolis-Hastings algorithm. We know that

p(𝐳i|𝚯𝐳i,𝐘1,,𝐘N,𝐗)\displaystyle p(\mathbf{z}_{i}|\boldsymbol{\Theta}_{-\mathbf{z}_{i}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}) k=1KZikα3πk1\displaystyle\propto\prod_{k=1}^{K}Z_{ik}^{\alpha_{3}\pi_{k}-1}
×l=1niexp{12σ2(yi(til)k=1KZik((𝝂k+𝜼k𝐱i)B(til)\displaystyle\times\prod_{l=1}^{n_{i}}exp\left\{-\frac{1}{2\sigma^{2}}\left(y_{i}(t_{il})-\sum_{k=1}^{K}Z_{ik}\left(\left(\boldsymbol{\nu}_{k}+\boldsymbol{\eta}_{k}\mathbf{x}_{i}^{\prime}\right)^{\prime}B(t_{il})\right.\right.\right.
+m=1MχimϕkmB(til)))2}.\displaystyle\left.\left.\left.+\sum_{m=1}^{M}\chi_{im}\boldsymbol{\phi}_{km}^{\prime}B(t_{il})\right)\right)^{2}\right\}.

We will use Q(𝐳i|𝐳i)=Dir(a𝐳𝐳i)Q(\mathbf{z}_{i}^{\prime}|\mathbf{z}_{i})=Dir(a_{\mathbf{z}}\mathbf{z}_{i}) for some large a𝐳+a_{\mathbf{z}}\in\mathbb{R}^{+} as the proposal distribution. Thus the probability of accepting a proposed step is

A(𝐳i,𝐳i)=min{1,P(𝐳i|𝚯𝐳i,𝐘1,,𝐘N,𝐗)P(𝐳i|𝚯𝐳i,𝐘1,,𝐘N,𝐗)Q(𝐳i|𝐳i)Q(𝐳i|𝐳i)}.A(\mathbf{z}_{i}^{\prime},\mathbf{z}_{i})=\min\left\{1,\frac{P\left(\mathbf{z}_{i}^{\prime}|\boldsymbol{\Theta}_{-\mathbf{z}_{i}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\right)}{P\left(\mathbf{z}_{i}|\boldsymbol{\Theta}_{-\mathbf{z}_{i}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\right)}\frac{Q\left(\mathbf{z}_{i}|\mathbf{z}_{i}^{\prime}\right)}{Q\left(\mathbf{z}_{i}^{\prime}|\mathbf{z}_{i}\right)}\right\}.

Similarly, a Gibbs update is not available for an update of the 𝝅\boldsymbol{\pi} parameters. We have that

p(𝝅|𝚯𝝅,𝐘1,,𝐘N,𝐗)\displaystyle p(\boldsymbol{\pi}|\boldsymbol{\Theta}_{-\boldsymbol{\pi}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}) k=1Kπkck1\displaystyle\propto\prod_{k=1}^{K}\pi_{k}^{c_{k}-1}
×i=1N1B(α3𝝅)k=1KZikα3πk1.\displaystyle\times\prod_{i=1}^{N}\frac{1}{B(\alpha_{3}\boldsymbol{\pi})}\prod_{k=1}^{K}Z_{ik}^{\alpha_{3}\pi_{k}-1}.

Letting out proposal distribution be such that Q(𝝅|𝝅)=Dir(a𝝅𝝅)Q(\boldsymbol{\pi}^{\prime}|\boldsymbol{\pi})=Dir(a_{\boldsymbol{\pi}}\boldsymbol{\pi}), for some large a𝝅+a_{\boldsymbol{\pi}}\in\mathbb{R}^{+}, we have that our probability of accepting any proposal is

A(𝝅,𝝅)=min{1,P(𝝅|𝚯𝝅,𝐘1,,𝐘N,𝐗)P(𝝅|𝚯𝝅,𝐘1,,𝐘N,𝐗)Q(𝝅|𝝅)Q(𝝅|𝝅)}.A(\boldsymbol{\pi}^{\prime},\boldsymbol{\pi})=\min\left\{1,\frac{P\left(\boldsymbol{\pi}^{\prime}|\boldsymbol{\Theta}_{-\boldsymbol{\pi}^{\prime}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\right)}{P\left(\boldsymbol{\pi}|\boldsymbol{\Theta}_{-\boldsymbol{\pi}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\right)}\frac{Q\left(\boldsymbol{\pi}|\boldsymbol{\pi}^{\prime}\right)}{Q\left(\boldsymbol{\pi}^{\prime}|\boldsymbol{\pi}\right)}\right\}.

The posterior distribution of α3\alpha_{3} is also not a commonly known distribution, so we will use the Metropolis-Hastings algorithm to sample from the posterior distribution. We have that

p(α3|𝚯α3,𝐘1,,𝐘N,𝐗)\displaystyle p(\alpha_{3}|\boldsymbol{\Theta}_{-\alpha_{3}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}) ebα3\displaystyle\propto e^{-b\alpha_{3}}
×i=1N1B(α3𝝅)k=1KZikα3πk1.\displaystyle\times\prod_{i=1}^{N}\frac{1}{B(\alpha_{3}\boldsymbol{\pi})}\prod_{k=1}^{K}Z_{ik}^{\alpha_{3}\pi_{k}-1}.

Using a proposal distribution such that Q(α3|α3)=𝒩(α3,σα32,0,+)Q(\alpha_{3}^{\prime}|\alpha_{3})=\mathcal{N}(\alpha_{3},\sigma^{2}_{\alpha_{3}},0,+\infty) (Truncated Normal), we are left with the probability of accepting a proposed state as

A(α3,α3)=min{1,P(α3|𝚯α3,𝐘1,,𝐘N,𝐗)P(α3|𝚯α3,𝐘1,,𝐘N,𝐗)Q(α3|α3)Q(α3|α3)}.A(\alpha_{3}^{\prime},\alpha_{3})=\min\left\{1,\frac{P\left(\alpha_{3}^{\prime}|\boldsymbol{\Theta}_{-\alpha_{3}^{\prime}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\right)}{P\left(\alpha_{3}|\boldsymbol{\Theta}_{-\alpha_{3}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\right)}\frac{Q\left(\alpha_{3}|\alpha_{3}^{\prime}\right)}{Q\left(\alpha_{3}^{\prime}|\alpha_{3}\right)}\right\}.

Let 𝐏\mathbf{P} be the following tridiagonal matrix:

𝐏=[110121121011].\mathbf{P}=\begin{bmatrix}1&-1&0&&\\ -1&2&-1&&\\ &\ddots&\ddots&\ddots&\\ &&-1&2&-1\\ &&0&-1&1\\ \end{bmatrix}.

Thus, letting

𝐁𝝂j=(τ𝝂j𝐏+1σ2i=1Nl=1niZij2B(til)B(til))1\mathbf{B}_{\boldsymbol{\nu}_{j}}=\left(\tau_{\boldsymbol{\nu}_{j}}\mathbf{P}+\frac{1}{\sigma^{2}}\sum_{i=1}^{N}\sum_{l=1}^{n_{i}}Z_{ij}^{2}B(t_{il})B^{\prime}(t_{il})\right)^{-1}

and

𝐛𝝂j\displaystyle\mathbf{b}_{\boldsymbol{\nu}_{j}} =1σ2i=1Nl=1niZijB(til)[yi(til)(kjZik𝝂kB(til))\displaystyle=\frac{1}{\sigma^{2}}\sum_{i=1}^{N}\sum_{l=1}^{n_{i}}Z_{ij}B(t_{il})\left[y_{i}(t_{il})-\left(\sum_{k\neq j}Z_{ik}\boldsymbol{\nu}^{\prime}_{k}B(t_{il})\right)\right.
(k=1KZik[𝐱i𝜼kB(til)+m=1MχimϕknB(til)])],\displaystyle-\left.\left(\sum_{k=1}^{K}Z_{ik}\left[\mathbf{x}_{i}\boldsymbol{\eta}_{k}^{\prime}B(t_{il})+\sum_{m=1}^{M}\chi_{im}\boldsymbol{\phi}_{kn}^{\prime}B(t_{il})\right]\right)\right],

we have that

𝝂j|𝚯𝝂j,𝐘1,,𝐘N,𝐗𝒩(𝐁𝝂j𝐛𝝂j,𝐁𝝂j).\boldsymbol{\nu}_{j}|\boldsymbol{\Theta}_{-\boldsymbol{\nu}_{j}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\sim\mathcal{N}\left(\mathbf{B}_{\boldsymbol{\nu}_{j}}\mathbf{b}_{\boldsymbol{\nu}_{j}},\mathbf{B}_{\boldsymbol{\nu}_{j}}\right).

Let 𝜼jd\boldsymbol{\eta}_{jd} denote the dthd^{th} column of the matrix 𝜼j\boldsymbol{\eta}_{j}. Thus, letting

𝐁𝜼jd=(τ𝜼jd𝐏+1σ2i=1Nl=1niZij2xid2B(til)B(til))1\mathbf{B}_{\boldsymbol{\eta}_{jd}}=\left(\tau_{\boldsymbol{\eta}_{jd}}\mathbf{P}+\frac{1}{\sigma^{2}}\sum_{i=1}^{N}\sum_{l=1}^{n_{i}}Z_{ij}^{2}x_{id}^{2}B(t_{il})B^{\prime}(t_{il})\right)^{-1}

and

𝐛𝜼jd=\displaystyle\mathbf{b}_{\boldsymbol{\eta}_{jd}}= 1σ2i=1Nl=1niZijxidB(til)[yi(til)(rdZijxir𝜼jrB(til))(kjZik𝐱i𝜼kB(til))\displaystyle\frac{1}{\sigma^{2}}\sum_{i=1}^{N}\sum_{l=1}^{n_{i}}Z_{ij}x_{id}B(t_{il})\left[y_{i}(t_{il})-\left(\sum_{r\neq d}Z_{ij}x_{ir}\boldsymbol{\eta}_{jr}^{\prime}B(t_{il})\right)-\left(\sum_{k\neq j}Z_{ik}\mathbf{x}_{i}\boldsymbol{\eta}_{k}^{\prime}B(t_{il})\right)\right.
(k=1KZik[𝝂kB(til)+m=1MχimϕknB(til)])],\displaystyle-\left.\left(\sum_{k=1}^{K}Z_{ik}\left[\boldsymbol{\nu}_{k}^{\prime}B(t_{il})+\sum_{m=1}^{M}\chi_{im}\boldsymbol{\phi}_{kn}^{\prime}B(t_{il})\right]\right)\right],

we have that

𝜼jd|𝚯𝜼jd,𝐘1,,𝐘N,𝐗𝒩(𝐁𝜼jd𝐛𝜼jd,𝐁𝜼jd).\boldsymbol{\eta}_{jd}|\boldsymbol{\Theta}_{-\boldsymbol{\eta}_{jd}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\sim\mathcal{N}\left(\mathbf{B}_{\boldsymbol{\eta}_{jd}}\mathbf{b}_{\boldsymbol{\eta}_{jd}},\mathbf{B}_{\boldsymbol{\eta}_{jd}}\right).

Thus we can see that we can draw samples from the posterior of the parameters controlling the mean structure using a Gibbs sampler. Similarly, we can use a Gibbs sampler to draw samples from the posterior distribution of τ𝜼jd\tau_{\boldsymbol{\eta}_{jd}} and τ𝝂j\tau_{\boldsymbol{\nu}_{j}}. We have that the posterior distributions are

τ𝝂j|𝚯τ𝝂j,𝐘1,,𝐘N,𝐗Γ(α𝝂+P/2,β𝝂+12𝝂j𝐏𝝂j)\tau_{\boldsymbol{\nu}_{j}}|\boldsymbol{\Theta}_{-\tau_{\boldsymbol{\nu}_{j}}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\sim\Gamma\left(\alpha_{\boldsymbol{\nu}}+P/2,\beta_{\boldsymbol{\nu}}+\frac{1}{2}\boldsymbol{\nu}^{\prime}_{j}\mathbf{P}\boldsymbol{\nu}_{j}\right)

and

τ𝜼jd|𝚯τ𝜼jd,𝐘1,,𝐘N,𝐗Γ(α𝜼+P/2,β𝜼+12𝜼jd𝐏𝜼jd),\tau_{\boldsymbol{\eta}_{jd}}|\boldsymbol{\Theta}_{-\tau_{\boldsymbol{\eta}_{jd}}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\sim\Gamma\left(\alpha_{\boldsymbol{\eta}}+P/2,\beta_{\boldsymbol{\eta}}+\frac{1}{2}\boldsymbol{\eta}^{\prime}_{jd}\mathbf{P}\boldsymbol{\eta}_{jd}\right),

for j=1,,Kj=1,\dots,K and d=1,,Rd=1,\dots,R. The parameter σ2\sigma^{2} can be updated by using a Gibbs update. If we let

βσ=12i=1Nl=1ni(yi(til)k=1KZik((𝝂k+𝜼k𝐱i)B(til)+n=1MχinϕknB(til)))2,\beta_{\sigma}=\frac{1}{2}\sum_{i=1}^{N}\sum_{l=1}^{n_{i}}\left(y_{i}(t_{il})-\sum_{k=1}^{K}Z_{ik}\left(\left(\boldsymbol{\nu}_{k}+\boldsymbol{\eta}_{k}\mathbf{x}_{i}^{\prime}\right)^{\prime}B(t_{il})+\sum_{n=1}^{M}\chi_{in}\boldsymbol{\phi}_{kn}^{\prime}B(t_{il})\right)\right)^{2},

then we have

σ2|𝚯σ2,𝐘1,,𝐘N,𝐗IG(α0+i=1Nni2,β0+βσ).\sigma^{2}|\boldsymbol{\Theta}_{-\sigma^{2}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\sim IG\left(\alpha_{0}+\frac{\sum_{i=1}^{N}n_{i}}{2},\beta_{0}+\beta_{\sigma}\right).

Lastly, we can update the χim\chi_{im} parameters, for i=1,,Ni=1,\dots,N and m=1,,Mm=1,\dots,M, using a Gibbs update. If we let

𝐰im=\displaystyle\mathbf{w}_{im}= 1σ2[l=1ni(k=1KZikϕkmB(til))\displaystyle\frac{1}{\sigma^{2}}\left[\sum_{l=1}^{n_{i}}\left(\sum_{k=1}^{K}Z_{ik}\boldsymbol{\phi}_{km}^{\prime}B(t_{il})\right)\right.
(yi(til)k=1KZik((𝝂k+𝜼k𝐱i)B(til)+nmχinϕknB(til)))]\displaystyle\left.\left(y_{i}(t_{il})-\sum_{k=1}^{K}Z_{ik}\left(\left(\boldsymbol{\nu}_{k}+\boldsymbol{\eta}_{k}\mathbf{x}_{i}^{\prime}\right)^{\prime}B(t_{il})+\sum_{n\neq m}\chi_{in}\boldsymbol{\phi}_{kn}^{\prime}B(t_{il})\right)\right)\right]

and

𝐖im1=1+1σ2l=1ni(k=1KZikϕkmB(til))2,\mathbf{W}_{im}^{-1}=1+\frac{1}{\sigma^{2}}\sum_{l=1}^{n_{i}}\left(\sum_{k=1}^{K}Z_{ik}\boldsymbol{\phi}_{km}^{\prime}B(t_{il})\right)^{2},

then we have that

χim|𝜻χim,𝐘1,,𝐘N,𝐗𝒩(𝐖im𝐰im,𝐖im).\chi_{im}|\boldsymbol{\zeta}_{-\chi_{im}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\sim\mathcal{N}(\mathbf{W}_{im}\mathbf{w}_{im},\mathbf{W}_{im}).

2.2 Tempered Transitions

One of the main computational problems we face in these flexible, unsupervised models is a multi-modal posterior distribution. In order to help the Markov chain move across modes, or traverse areas of low posterior probability, we can utilize tempered transitions.

In this paper, we will be following the works of Behrens et al. [2012] and Pritchard et al. [2000] and only temper the likelihood. The target distribution that we want to temper is usually assumed to be written as

p(x)π(x)exp(βhh(x)),p(x)\propto\pi(x)exp\left(-\beta_{h}h(x)\right),

where βh\beta_{h} controls how much the distribution is tempered (1=β0<<βh<<βNt1=\beta_{0}<\dots<\beta_{h}<\dots<\beta_{N_{t}}). In this setting, we will assume that the hyperparameters NtN_{t} and βNt\beta_{N_{t}} are user specified, and will depend on the complexity of the model. For more complex or larger models, we will need to set NtN_{t} relatively high. In this implementation, we assume the βh\beta_{h} parameters to follow a geometric scheme, but in more complex models, βNt\beta_{N_{t}} may need to be relatively small.

We can rewrite our likelihood for the functional covariate adjusted model to fit the above form:

ph(yi(t)|𝚯,𝐗)\displaystyle p_{h}(y_{i}(t)|\boldsymbol{\Theta},\mathbf{X})\propto exp{βh(12log(σ2)+12σ2(yi(t)k=1KZik((𝝂k+𝜼k𝐱i)B(t)\displaystyle exp\left\{-\beta_{h}\left(\frac{1}{2}log(\sigma^{2})+\frac{1}{2\sigma^{2}}\left(y_{i}(t)-\sum_{k=1}^{K}Z_{ik}\Bigg{(}\left(\boldsymbol{\nu}_{k}+\boldsymbol{\eta}_{k}\mathbf{x}_{i}^{\prime}\right)^{\prime}B(t)\right.\right.\right.
+n=1MχinϕknB(t)))2)}\displaystyle+\left.\left.\left.\sum_{n=1}^{M}\chi_{in}\boldsymbol{\phi}_{k^{\prime}n}^{\prime}B(t)\Bigg{)}\right)^{2}\right)\right\}
=\displaystyle= (σ2)βh/2exp{βh2σ2(yi(t)k=1KZik((𝝂k+𝜼k𝐱i)B(t)\displaystyle\left(\sigma^{2}\right)^{-\beta_{h}/2}exp\left\{-\frac{\beta_{h}}{2\sigma^{2}}\left(y_{i}(t)-\sum_{k=1}^{K}Z_{ik}\Bigg{(}\left(\boldsymbol{\nu}_{k}+\boldsymbol{\eta}_{k}\mathbf{x}_{i}^{\prime}\right)^{\prime}B(t)\right.\right.
+n=1MχinϕknB(t)))2}.\displaystyle+\left.\left.\sum_{n=1}^{M}\chi_{in}\boldsymbol{\phi}_{k^{\prime}n}^{\prime}B(t)\Bigg{)}\right)^{2}\right\}.

Let 𝚯h\boldsymbol{\Theta}_{h} be the set of parameters generated from the model using the tempered likelihood associated with βh\beta_{h}. The tempered transition algorithm can be summarized by the following steps:

  1. 1.

    Start with initial state 𝚯0\boldsymbol{\Theta}_{0}.

  2. 2.

    Transition from 𝚯0\boldsymbol{\Theta}_{0} to 𝚯1\boldsymbol{\Theta}_{1} using the tempered likelihood associated with β1\beta_{1}.

  3. 3.

    Continue in this manner until we transition from 𝚯Nt1\boldsymbol{\Theta}_{N_{t}-1} to 𝚯Nt\boldsymbol{\Theta}_{N_{t}} using the tempered likelihood associated with βNt\beta_{N_{t}}.

  4. 4.

    Transition from 𝚯Nt\boldsymbol{\Theta}_{N_{t}} to 𝚯Nt+1\boldsymbol{\Theta}_{N_{t}+1} using the tempered likelihood associated with βNt\beta_{N_{t}}.

  5. 5.

    Continue in this manner until we transition from 𝚯2Nt1\boldsymbol{\Theta}_{2N_{t}-1} to 𝚯2Nt\boldsymbol{\Theta}_{2N_{t}} using β1\beta_{1}.

  6. 6.

    Accept transition from 𝚯0\boldsymbol{\Theta}_{0} to 𝚯2Nt\boldsymbol{\Theta}_{2N_{t}} with probability

    min{1,h=0Nt1i=1Nl=1niph+1(yi(til)|𝚯h,𝐗i)i=1Nl=1niph(yi(til)|𝚯h,𝐗i)h=Nt+12Nti=1Nl=1niph(yi(til)|𝚯h,𝐗i)i=1Nl=1niph+1(yi(til)|𝚯h,𝐗i)}\min\left\{1,\prod_{h=0}^{N_{t}-1}\frac{\prod_{i=1}^{N}\prod_{l=1}^{n_{i}}p_{h+1}(y_{i}(t_{il})|\boldsymbol{\Theta}_{h},\mathbf{X}_{i})}{\prod_{i=1}^{N}\prod_{l=1}^{n_{i}}p_{h}(y_{i}(t_{il})|\boldsymbol{\Theta}_{h},\mathbf{X}_{i})}\prod_{h=N_{t}+1}^{2N_{t}}\frac{\prod_{i=1}^{N}\prod_{l=1}^{n_{i}}p_{h}(y_{i}(t_{il})|\boldsymbol{\Theta}_{h},\mathbf{X}_{i})}{\prod_{i=1}^{N}\prod_{l=1}^{n_{i}}p_{h+1}(y_{i}(t_{il})|\boldsymbol{\Theta}_{h},\mathbf{X}_{i})}\right\}

    in the functional case, or

    min{1,h=0Nt1i=1Nl=1niph+1(𝐲i|𝚯h,𝐗i)i=1Nl=1niph(𝐲i)|𝚯h,𝐗i)h=Nt+12Nti=1Nl=1niph(𝐲i|𝚯h,𝐗i)i=1Nl=1niph+1(𝐲i|𝚯h,𝐗i)}\min\left\{1,\prod_{h=0}^{N_{t}-1}\frac{\prod_{i=1}^{N}\prod_{l=1}^{n_{i}}p_{h+1}(\mathbf{y}_{i}|\boldsymbol{\Theta}_{h},\mathbf{X}_{i})}{\prod_{i=1}^{N}\prod_{l=1}^{n_{i}}p_{h}(\mathbf{y}_{i})|\boldsymbol{\Theta}_{h},\mathbf{X}_{i})}\prod_{h=N_{t}+1}^{2N_{t}}\frac{\prod_{i=1}^{N}\prod_{l=1}^{n_{i}}p_{h}(\mathbf{y}_{i}|\boldsymbol{\Theta}_{h},\mathbf{X}_{i})}{\prod_{i=1}^{N}\prod_{l=1}^{n_{i}}p_{h+1}(\mathbf{y}_{i}|\boldsymbol{\Theta}_{h},\mathbf{X}_{i})}\right\}

    in the multivariate case.

Since we only temper the likelihood, many of the posterior distributions derived in section 2.1 can be utilized. Thus the following posteriors are the only ones that change due to the tempering of the likelihood. Starting with the 𝚽\boldsymbol{\Phi} parameters, we have

(𝐦ϕjm)h=\displaystyle\left(\mathbf{m}_{\boldsymbol{\phi}_{jm}}\right)_{h}= βh(σ2)hi=1Nl=1ni(B(til)(χim)h(yi(til)(Zij)h(Zij)h2((𝝂j)h+(𝜼j)h𝐱i)B(til)\displaystyle\frac{\beta_{h}}{(\sigma^{2})_{h}}\sum_{i=1}^{N}\sum_{l=1}^{n_{i}}\left(B(t_{il})(\chi_{im})_{h}\left(y_{i}(t_{il})(Z_{ij})_{h}-(Z_{ij})_{h}^{2}\left((\boldsymbol{\nu}_{j})_{h}+(\boldsymbol{\eta}_{j})_{h}\mathbf{x}_{i}^{\prime}\right)^{\prime}B(t_{il})\right.\right.
(Zij)h2nm(χin)h(ϕjn)hB(til)\displaystyle\left.\left.-(Z_{ij})_{h}^{2}\sum_{n\neq m}(\chi_{in})_{h}(\boldsymbol{\phi}_{jn})_{h}^{\prime}B(t_{il})\right.\right.
kjZijZik[((𝝂k)h+(𝜼k)h𝐱i)B(til)+n=1Mχin(ϕkn)hB(til)])),\displaystyle\left.\left.-\sum_{k\neq j}Z_{ij}Z_{ik}\left[\left((\boldsymbol{\nu}_{k})_{h}+(\boldsymbol{\eta}_{k})_{h}\mathbf{x}_{i}^{\prime}\right)^{\prime}B(t_{il})+\sum_{n=1}^{M}\chi_{in}(\boldsymbol{\phi}_{kn})_{h}^{\prime}B(t_{il})\right]\right)\right),

and

(𝐌ϕjm)h1=βh(σ2)hi=1Nl=1ni((Zij)h2(χim)h2B(til)B(til))+(𝐃ϕjm)h1,\left(\mathbf{M}_{\boldsymbol{\phi}_{jm}}\right)_{h}^{-1}=\frac{\beta_{h}}{(\sigma^{2})_{h}}\sum_{i=1}^{N}\sum_{l=1}^{n_{i}}\left((Z_{ij})_{h}^{2}(\chi_{im})_{h}^{2}B(t_{il})B^{\prime}(t_{il})\right)+\left(\mathbf{D}_{\boldsymbol{\phi}_{jm}}\right)_{h}^{-1},

we have that

(ϕjm)h|𝚯(ϕjm)h,𝐘1,,𝐘N,𝐗𝒩((𝐌ϕjm)h(𝐦ϕjm)h,(𝐌ϕjm)h).\left(\boldsymbol{\phi}_{jm}\right)_{h}|\boldsymbol{\Theta}_{-\left(\boldsymbol{\phi}_{jm}\right)_{h}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\sim\mathcal{N}\left(\left(\mathbf{M}_{\boldsymbol{\phi}_{jm}}\right)_{h}\left(\mathbf{m}_{\boldsymbol{\phi}_{jm}}\right)_{h},\left(\mathbf{M}_{\boldsymbol{\phi}_{jm}}\right)_{h}\right).

As in the untempered case, we have that the posterior distribution 𝐙\mathbf{Z} parameters under the tempered likelihood is not a commonly known distribution. Therefore, we will use the Metropolis-Hastings algorithm. We have that

p((𝐳i)h|𝚯(𝐳i)h,𝐘1,,𝐘N,𝐗)\displaystyle p((\mathbf{z}_{i})_{h}|\boldsymbol{\Theta}_{-(\mathbf{z}_{i})_{h}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}) k=1K(Zik)h(α3)h(πk)h1\displaystyle\propto\prod_{k=1}^{K}(Z_{ik})_{h}^{(\alpha_{3})_{h}(\pi_{k})_{h}-1}
×l=1niexp{βh2(σ2)h(yi(til)k=1K(Zik)h(((𝝂k)h+(𝜼k)h𝐱i)B(til)\displaystyle\times\prod_{l=1}^{n_{i}}exp\left\{-\frac{\beta_{h}}{2(\sigma^{2})_{h}}\left(y_{i}(t_{il})-\sum_{k=1}^{K}(Z_{ik})_{h}\left(\left((\boldsymbol{\nu}_{k})_{h}+(\boldsymbol{\eta}_{k})_{h}\mathbf{x}_{i}^{\prime}\right)^{\prime}B(t_{il})\right.\right.\right.
+m=1M(χim)h(ϕkm)hB(til)))2}.\displaystyle\left.\left.\left.+\sum_{m=1}^{M}(\chi_{im})_{h}(\boldsymbol{\phi}_{km})_{h}^{\prime}B(t_{il})\right)\right)^{2}\right\}.

We will use Q((𝐳i)h|(𝐳i)h)=Dir(a𝐳(𝐳i)h)Q((\mathbf{z}_{i})_{h}^{\prime}|(\mathbf{z}_{i})_{h})=Dir(a_{\mathbf{z}}(\mathbf{z}_{i})_{h}) for some large a𝐳+a_{\mathbf{z}}\in\mathbb{R}^{+} as the proposal distribution. Thus the probability of accepting a proposed step is

A((𝐳i)h,(𝐳i)h)=min{1,P((𝐳i)h|𝚯(𝐳i)h,𝐘1,,𝐘N,𝐗)P((𝐳i)h|𝚯(𝐳i)h,𝐘1,,𝐘N,𝐗)Q((𝐳i)h|(𝐳i)h)Q((𝐳i)h|(𝐳i)h)}.A((\mathbf{z}_{i})_{h}^{\prime},(\mathbf{z}_{i})_{h})=\min\left\{1,\frac{P\left((\mathbf{z}_{i})_{h}^{\prime}|\boldsymbol{\Theta}_{-(\mathbf{z}_{i})_{h}^{\prime}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\right)}{P\left((\mathbf{z}_{i})_{h}|\boldsymbol{\Theta}_{-(\mathbf{z}_{i})_{h}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\right)}\frac{Q\left((\mathbf{z}_{i})_{h}|(\mathbf{z}_{i})_{h}^{\prime}\right)}{Q\left((\mathbf{z}_{i})_{h}^{\prime}|(\mathbf{z}_{i})_{h}\right)}\right\}.

Letting

(𝐁𝝂j)h=((τ𝝂j)h𝐏+βh(σ2)hi=1Nl=1ni(Zij)h2B(til)B(til))1\left(\mathbf{B}_{\boldsymbol{\nu}_{j}}\right)_{h}=\left(\left(\tau_{\boldsymbol{\nu}_{j}}\right)_{h}\mathbf{P}+\frac{\beta_{h}}{(\sigma^{2})_{h}}\sum_{i=1}^{N}\sum_{l=1}^{n_{i}}(Z_{ij})_{h}^{2}B(t_{il})B^{\prime}(t_{il})\right)^{-1}

and

(𝐛𝝂j)h\displaystyle\left(\mathbf{b}_{\boldsymbol{\nu}_{j}}\right)_{h} =βh(σ2)hi=1Nl=1ni(Zij)hB(til)[yi(til)(kj(Zik)h(𝝂k)hB(til))\displaystyle=\frac{\beta_{h}}{(\sigma^{2})_{h}}\sum_{i=1}^{N}\sum_{l=1}^{n_{i}}(Z_{ij})_{h}B(t_{il})\left[y_{i}(t_{il})-\left(\sum_{k\neq j}(Z_{ik})_{h}(\boldsymbol{\nu}^{\prime}_{k})_{h}B(t_{il})\right)\right.
(k=1K(Zik)[𝐱i(𝜼k)hB(til)+m=1M(χim)h(ϕkn)hB(til)])],\displaystyle-\left.\left(\sum_{k=1}^{K}(Z_{ik})\left[\mathbf{x}_{i}(\boldsymbol{\eta}_{k})_{h}^{\prime}B(t_{il})+\sum_{m=1}^{M}(\chi_{im})_{h}(\boldsymbol{\phi}_{kn})_{h}^{\prime}B(t_{il})\right]\right)\right],

we have that

(𝝂j)h|𝚯(𝝂j)h,𝐘1,,𝐘N,𝐗𝒩((𝐁𝝂j)h(𝐛𝝂j)h,(𝐁𝝂j)h).\left(\boldsymbol{\nu}_{j}\right)_{h}|\boldsymbol{\Theta}_{-\left(\boldsymbol{\nu}_{j}\right)_{h}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\sim\mathcal{N}\left(\left(\mathbf{B}_{\boldsymbol{\nu}_{j}}\right)_{h}\left(\mathbf{b}_{\boldsymbol{\nu}_{j}}\right)_{h},\left(\mathbf{B}_{\boldsymbol{\nu}_{j}}\right)_{h}\right).

Let (𝜼jd)h\left(\boldsymbol{\eta}_{jd}\right)_{h} denote the dthd^{th} column of the matrix (𝜼j)h(\boldsymbol{\eta}_{j})_{h}. Thus, letting

(𝐁𝜼jd)h=((τ𝜼jd)h𝐏+βh(σ2)hi=1Nl=1ni(Zij)h2xid2B(til)B(til))1\left(\mathbf{B}_{\boldsymbol{\eta}_{jd}}\right)_{h}=\left(\left(\tau_{\boldsymbol{\eta}_{jd}}\right)_{h}\mathbf{P}+\frac{\beta_{h}}{(\sigma^{2})_{h}}\sum_{i=1}^{N}\sum_{l=1}^{n_{i}}(Z_{ij})_{h}^{2}x_{id}^{2}B(t_{il})B^{\prime}(t_{il})\right)^{-1}

and

(𝐛𝜼jd)h=\displaystyle\left(\mathbf{b}_{\boldsymbol{\eta}_{jd}}\right)_{h}= βh(σ2)hi=1Nl=1ni(Zij)hxidB(til)[yi(til)(rd(Zij)hxir(𝜼jr)hB(til))\displaystyle\frac{\beta_{h}}{(\sigma^{2})_{h}}\sum_{i=1}^{N}\sum_{l=1}^{n_{i}}(Z_{ij})_{h}x_{id}B(t_{il})\left[y_{i}(t_{il})-\left(\sum_{r\neq d}(Z_{ij})_{h}x_{ir}(\boldsymbol{\eta}_{jr})_{h}^{\prime}B(t_{il})\right)\right.
(kj(Zik)h𝐱i(𝜼k)hB(til))\displaystyle-\left(\sum_{k\neq j}(Z_{ik})_{h}\mathbf{x}_{i}(\boldsymbol{\eta}_{k})_{h}^{\prime}B(t_{il})\right)
(k=1K(Zik)h[(𝝂k)hB(til)+m=1M(χim)h(ϕkn)hB(til)])],\displaystyle-\left.\left(\sum_{k=1}^{K}(Z_{ik})_{h}\left[(\boldsymbol{\nu}_{k})_{h}^{\prime}B(t_{il})+\sum_{m=1}^{M}(\chi_{im})_{h}(\boldsymbol{\phi}_{kn})_{h}^{\prime}B(t_{il})\right]\right)\right],

we have that

(𝜼jd)h|𝚯(𝜼jd)h,𝐘1,,𝐘N,𝐗𝒩((𝐁𝜼jd)h(𝐛𝜼jd)h,(𝐁𝜼jd)h).\left(\boldsymbol{\eta}_{jd}\right)_{h}|\boldsymbol{\Theta}_{-\left(\boldsymbol{\eta}_{jd}\right)_{h}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\sim\mathcal{N}\left(\left(\mathbf{B}_{\boldsymbol{\eta}_{jd}}\right)_{h}\left(\mathbf{b}_{\boldsymbol{\eta}_{jd}}\right)_{h},\left(\mathbf{B}_{\boldsymbol{\eta}_{jd}}\right)_{h}\right).

If we let

(βσ)h=βh2i=1Nl=1ni(yi(til)k=1K(Zik)h(((𝝂k)h+(𝜼k)h𝐱i)B(til)\displaystyle\left(\beta_{\sigma}\right)_{h}=\frac{\beta_{h}}{2}\sum_{i=1}^{N}\sum_{l=1}^{n_{i}}\left(y_{i}(t_{il})-\sum_{k=1}^{K}(Z_{ik})_{h}\Bigg{(}\left((\boldsymbol{\nu}_{k})_{h}+(\boldsymbol{\eta}_{k})_{h}\mathbf{x}_{i}^{\prime}\right)^{\prime}B(t_{il})\right.
+n=1M(χin)h(ϕkn)hB(til)))2,\displaystyle+\left.\left.\sum_{n=1}^{M}(\chi_{in})_{h}(\boldsymbol{\phi}_{kn})_{h}^{\prime}B(t_{il})\right)\right)^{2},

then we have

(σ2)h|𝚯(σ2)h,𝐘1,,𝐘N,𝐗IG(α0+βhi=1Nni2,β0+(βσ)h).(\sigma^{2})_{h}|\boldsymbol{\Theta}_{-(\sigma^{2})_{h}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\sim IG\left(\alpha_{0}+\frac{\beta_{h}\sum_{i=1}^{N}n_{i}}{2},\beta_{0}+\left(\beta_{\sigma}\right)_{h}\right).

Lastly, we can update the χim\chi_{im} parameters, for i=1,,Ni=1,\dots,N and m=1,,Mm=1,\dots,M, using a Gibbs update. If we let

(𝐰im)h=\displaystyle\left(\mathbf{w}_{im}\right)_{h}= βh(σ2)h[l=1ni(k=1K(Zik)h(ϕkm)hB(til))(yi(til)\displaystyle\frac{\beta_{h}}{(\sigma^{2})_{h}}\left[\sum_{l=1}^{n_{i}}\left(\sum_{k=1}^{K}(Z_{ik})_{h}(\boldsymbol{\phi}_{km})_{h}^{\prime}B(t_{il})\right)\right.\Bigg{(}y_{i}(t_{il})
k=1K(Zik)h(((𝝂k)h+(𝜼k)h𝐱i)B(til)+nm(χin)h(ϕkn)hB(til)))]\displaystyle\left.-\sum_{k=1}^{K}(Z_{ik})_{h}\left(\left((\boldsymbol{\nu}_{k})_{h}+(\boldsymbol{\eta}_{k})_{h}\mathbf{x}_{i}^{\prime}\right)^{\prime}B(t_{il})+\sum_{n\neq m}(\chi_{in})_{h}(\boldsymbol{\phi}_{kn})_{h}^{\prime}B(t_{il})\right)\Bigg{)}\right]

and

(𝐖im)h1=1+βhσ2l=1ni(k=1K(Zik)h(ϕkm)hB(til))2,\left(\mathbf{W}_{im}\right)_{h}^{-1}=1+\frac{\beta_{h}}{\sigma^{2}}\sum_{l=1}^{n_{i}}\left(\sum_{k=1}^{K}(Z_{ik})_{h}(\boldsymbol{\phi}_{km})_{h}^{\prime}B(t_{il})\right)^{2},

then we have that

(χim)h|𝜻(χim)h,𝐘1,,𝐘N,𝐗𝒩((𝐖im)h(𝐰im)h,(𝐖im)h).(\chi_{im})_{h}|\boldsymbol{\zeta}_{-(\chi_{im})_{h}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\sim\mathcal{N}\left(\left(\mathbf{W}_{im}\right)_{h}\left(\mathbf{w}_{im}\right)_{h},\left(\mathbf{W}_{im}\right)_{h}\right).

3 Simulation Study and Case Studies

3.1 Simulation Study 1

This subsection contains detailed information on how the simulation study in Section 3 of the main text was conducted. This simulation study primarily looked at how well we could recover the true mean structure, the covariance structure, and the allocation structure. In this simulation study, we simulated datasets from 3 scenarios at 3 different sample sizes for each scenario. Once the datasets were generated, we fit a variety of covariate adjusted functional mixed membership models, as well as unadjusted functional mixed membership models, on the datasets to see how well we could recover the mean, covariance, and allocation structures.

The first scenario we considered was a covariate adjusted functional mixed membership model with 2 true covariates. To generate all of the datasets, we assumed that the observations were in the span of B-spline basis with 8 basis functions. For this scenario, we generated 3 datasets with sample sizes of 60, 120, and 240 functional observations, all observed on a grid of 25 time points. The data was generated by first generating the model parameters (as discussed below) and then generating data from the likelihood specified in Equation 11 of the main text. The model parameters for this dataset were generated as follows:

𝝂1𝒩((6,4,,6,8),4𝐏),\boldsymbol{\nu}_{1}\sim\mathcal{N}\left((6,4,\dots,-6,-8)^{\prime},4\mathbf{P}\right),
𝝂2𝒩((8,6,,4,6),4𝐏),\boldsymbol{\nu}_{2}\sim\mathcal{N}\left((-8,-6,\dots,4,6)^{\prime},4\mathbf{P}\right),
𝜼k1𝒩(𝟏,𝐏)k=1,2,\boldsymbol{\eta}_{k1}\sim\mathcal{N}\left(\mathbf{1},\mathbf{P}\right)\;\;\;k=1,2,
𝜼k2𝒩((3,2,,4),𝐏)k=1,2.\boldsymbol{\eta}_{k2}\sim\mathcal{N}\left((3,2,\dots,-4)^{\prime},\mathbf{P}\right)\;\;\;k=1,2.

The 𝚽\boldsymbol{\Phi} parameters were drawn according to the following distributions:

ϕkm=𝐪kmk=1,2m=1,2,3,\boldsymbol{\phi}_{km}=\mathbf{q}_{km}\;\;\;k=1,2\;\;\;m=1,2,3,

where 𝐪k1𝒩(𝟎8,2.25𝐈8)\mathbf{q}_{k1}\sim\mathcal{N}(\mathbf{0}_{8},2.25\mathbf{I}_{8}), 𝐪k2𝒩(𝟎8,𝐈8)\mathbf{q}_{k2}\sim\mathcal{N}(\mathbf{0}_{8},\mathbf{I}_{8}), 𝐪k3𝒩(𝟎8,0.49𝐈8)\mathbf{q}_{k3}\sim\mathcal{N}(\mathbf{0}_{8},0.49\mathbf{I}_{8}). The χim\chi_{im} parameters were drawn from a standard normal distribution. The 𝐳i\mathbf{z}_{i} parameters were drawn from a mixture of Dirichlet distributions. Roughly 30% of the 𝐳i\mathbf{z}_{i} parameters were drawn from a Dirichlet distribution with α1=10\alpha_{1}=10 and α2=1\alpha_{2}=1. Another roughly 30% of the 𝐳i\mathbf{z}_{i} parameters were drawn from a Dirichlet distribution where α1=1\alpha_{1}=1 and α2=10\alpha_{2}=10. The rest of the 𝐳i\mathbf{z}_{i} parameters were drawn from a Dirichlet distribution with α1=α2=1\alpha_{1}=\alpha_{2}=1. The covariates, 𝐗\mathbf{X}, were drawn from a standard normal distribution. Models in this scenario were run for 500,000 MCMC iterations.

For the second scenario, we considered data drawn from a covariate adjusted functional mixed membership model with one covariate. We considered three sample sizes of 50, 100, and 200 functional samples observed on a grid of 25 time points. The model parameters for this dataset were generated as follows:

𝝂1𝒩((6,4,,6,8),4𝐏),\boldsymbol{\nu}_{1}\sim\mathcal{N}\left((6,4,\dots,-6,-8)^{\prime},4\mathbf{P}\right),
𝝂2𝒩((8,6,,4,6),4𝐏),\boldsymbol{\nu}_{2}\sim\mathcal{N}\left((-8,-6,\dots,4,6)^{\prime},4\mathbf{P}\right),
𝜼11𝒩(𝟐,𝐏),\boldsymbol{\eta}_{11}\sim\mathcal{N}\left(\mathbf{2},\mathbf{P}\right),
𝜼21𝒩(𝟐,𝐏).\boldsymbol{\eta}_{21}\sim\mathcal{N}\left(-\mathbf{2},\mathbf{P}\right).

The 𝚽\boldsymbol{\Phi} parameters were drawn according to the following distributions:

ϕkm=𝐪kmk=1,2m=1,2,3,\boldsymbol{\phi}_{km}=\mathbf{q}_{km}\;\;\;k=1,2\;\;\;m=1,2,3,

where 𝐪k1𝒩(𝟎8,2.25𝐈8)\mathbf{q}_{k1}\sim\mathcal{N}(\mathbf{0}_{8},2.25\mathbf{I}_{8}), 𝐪k2𝒩(𝟎8,𝐈8)\mathbf{q}_{k2}\sim\mathcal{N}(\mathbf{0}_{8},\mathbf{I}_{8}), 𝐪k3𝒩(𝟎8,0.49𝐈8)\mathbf{q}_{k3}\sim\mathcal{N}(\mathbf{0}_{8},0.49\mathbf{I}_{8}). The χim\chi_{im} parameters were drawn from a standard normal distribution. The 𝐳i\mathbf{z}_{i} parameters were drawn from a mixture of Dirichlet distributions. Roughly 30% of the 𝐳i\mathbf{z}_{i} parameters were drawn from a Dirichlet distribution with α1=10\alpha_{1}=10 and α2=1\alpha_{2}=1. Another roughly 30% of the 𝐳i\mathbf{z}_{i} parameters were drawn from a Dirichlet distribution where α1=1\alpha_{1}=1 and α2=10\alpha_{2}=10. The rest of the 𝐳i\mathbf{z}_{i} parameters were drawn from a Dirichlet distribution with α1=α2=1\alpha_{1}=\alpha_{2}=1. The covariates, 𝐗\mathbf{X}, were drawn from a normal distribution with variance of nine and mean of zero. Models in this scenario were run for 500,000 MCMC iterations.

For the third scenario, we generated data from an unadjusted functional mixed membership model. We considered three sample sizes of 40, 80, and 160 functional samples observed on a grid of 25 time points. The model parameters for this dataset were generated as follows:

𝝂1𝒩((6,4,,6,8),4𝐏),\boldsymbol{\nu}_{1}\sim\mathcal{N}\left((6,4,\dots,-6,-8)^{\prime},4\mathbf{P}\right),
𝝂2𝒩((8,6,,4,6),4𝐏),\boldsymbol{\nu}_{2}\sim\mathcal{N}\left((-8,-6,\dots,4,6)^{\prime},4\mathbf{P}\right),

The 𝚽\boldsymbol{\Phi} parameters were drawn according to the following distributions:

ϕkm=𝐪kmk=1,2m=1,2,\boldsymbol{\phi}_{km}=\mathbf{q}_{km}\;\;\;k=1,2\;\;\;m=1,2,

where 𝐪k1𝒩(𝟎8,2.25𝐈8)\mathbf{q}_{k1}\sim\mathcal{N}(\mathbf{0}_{8},2.25\mathbf{I}_{8}), 𝐪k2𝒩(𝟎8,𝐈8)\mathbf{q}_{k2}\sim\mathcal{N}(\mathbf{0}_{8},\mathbf{I}_{8}), 𝐪k3𝒩(𝟎8,0.49𝐈8)\mathbf{q}_{k3}\sim\mathcal{N}(\mathbf{0}_{8},0.49\mathbf{I}_{8}). The χim\chi_{im} parameters were drawn from a standard normal distribution. The 𝐳i\mathbf{z}_{i} parameters were drawn from a mixture of Dirichlet distributions. Approximately 30% of the 𝐳i\mathbf{z}_{i} parameters were drawn from a Dirichlet distribution with α1=10\alpha_{1}=10 and α2=1\alpha_{2}=1. Another roughly 30% of the 𝐳i\mathbf{z}_{i} parameters were drawn from a Dirichlet distribution where α1=1\alpha_{1}=1 and α2=10\alpha_{2}=10. The rest of the 𝐳i\mathbf{z}_{i} parameters were drawn from a Dirichlet distribution with α1=α2=1\alpha_{1}=\alpha_{2}=1. The models in this scenario were run for 500,000 MCMC iterations. The code for running this simulation study can be found on Github.

The following plots are more detailed visualizations of the results obtained in the first simulation study.

Refer to caption
Figure 1: RMISE and RMSE results from the simulations models fit with two covariates, where the true data was generated with two covariates.
Refer to caption
Figure 2: RMISE and RMSE results from the simulations models fit with one covariate, where the true data was generated with one covariate.
Refer to caption
Figure 3: RMISE and RMSE results from the simulations models fit with no covariates, where the true data was generated with one covariate.
Refer to caption
Figure 4: RMISE and RMSE results from the simulations models fit with one covariate, where the true data was generated with no covariates.
Refer to caption
Figure 5: RMISE and RMSE results from the simulations models fit with no covariates, where the true data was generated with no covariates.

3.2 Simulation Study 2

In this simulation study, we evaluated the performance of AIC, BIC, DIC, and the elbow method in choosing the number of features in a covariate adjusted mixed membership model. In this simulation study, we considered a covariate adjusted functional mixed membership model with only one continuous covariate. We generated 50 different data sets with 150 functional observations, observed along a uniform grid of 25 time points. The data was generated from a covariate adjusted mixed membership model with model parameters generated as follows:

𝝂1𝒩((6,4,,6,8),4𝐏),\boldsymbol{\nu}_{1}\sim\mathcal{N}\left((6,4,\dots,-6,-8)^{\prime},4\mathbf{P}\right),
𝝂2𝒩((8,6,,4,6),4𝐏),\boldsymbol{\nu}_{2}\sim\mathcal{N}\left((-8,-6,\dots,4,6)^{\prime},4\mathbf{P}\right),
𝜼11𝒩(𝟐,𝐏),\boldsymbol{\eta}_{11}\sim\mathcal{N}\left(\mathbf{2},\mathbf{P}\right),
𝜼21𝒩(𝟐,𝐏),\boldsymbol{\eta}_{21}\sim\mathcal{N}\left(-\mathbf{2},\mathbf{P}\right),
𝜼21𝒩(𝟏,𝐏).\boldsymbol{\eta}_{21}\sim\mathcal{N}\left(\mathbf{1},\mathbf{P}\right).

he 𝚽\boldsymbol{\Phi} parameters were drawn according to the following distributions:

ϕkm=𝐪kmk=1,2m=1,2,\boldsymbol{\phi}_{km}=\mathbf{q}_{km}\;\;\;k=1,2\;\;\;m=1,2,

where 𝐪k1𝒩(𝟎8,2.25𝐈8)\mathbf{q}_{k1}\sim\mathcal{N}(\mathbf{0}_{8},2.25\mathbf{I}_{8}), 𝐪k2𝒩(𝟎8,𝐈8)\mathbf{q}_{k2}\sim\mathcal{N}(\mathbf{0}_{8},\mathbf{I}_{8}), 𝐪k3𝒩(𝟎8,0.49𝐈8)\mathbf{q}_{k3}\sim\mathcal{N}(\mathbf{0}_{8},0.49\mathbf{I}_{8}). The χim\chi_{im} parameters were drawn from a standard normal distribution. The 𝐳i\mathbf{z}_{i} parameters were drawn from a mixture of Dirichlet distributions. Approximately 20% of the 𝐳i\mathbf{z}_{i} parameters were drawn from a Dirichlet distribution with α1=30\alpha_{1}=30, α2=1\alpha_{2}=1, and α3=1\alpha_{3}=1. Another roughly 20% of the 𝐳i\mathbf{z}_{i} parameters were drawn from a Dirichlet distribution where α1=1\alpha_{1}=1, α2=30\alpha_{2}=30, and α3=1\alpha_{3}=1. Another roughly 20% of the 𝐳i\mathbf{z}_{i} parameters were drawn from a Dirichlet distribution where α1=1\alpha_{1}=1, α2=1\alpha_{2}=1, and α3=30\alpha_{3}=30. The rest of the 𝐳i\mathbf{z}_{i} parameters were drawn from a Dirichlet distribution with α1=α2=α3=1\alpha_{1}=\alpha_{2}=\alpha_{3}=1. Four models were then fit for each dataset, with K=2,3,4,5K=2,3,4,5. The models in this scenario were run for 200,000 MCMC iterations.

The Bayesian Information Criterion (BIC), proposed by Schwarz [1978], is defined as:

BIC=2logP(𝐘|𝚯^,𝐗)dlog(n)\text{BIC}=2\text{log}P\left(\mathbf{Y}|\hat{\boldsymbol{\Theta}},\mathbf{X}\right)-d\text{log}(n)

where dd is the number of parameters and 𝚯^\hat{\boldsymbol{\Theta}} are the maximum likelihood estimators (MLE) of our parameters. In the case of our proposed model, we have that

BIC=2logP(𝐘|𝝂^1:K,𝜼^1:K,𝚽^1:KM,σ^2,𝐙^,𝝌^,𝐗)dlog(N~)\text{BIC}=2\text{log}P\left(\mathbf{Y}|\hat{\boldsymbol{\nu}}_{1:K},\hat{\boldsymbol{\eta}}_{1:K},\hat{\boldsymbol{\Phi}}_{1:KM},\hat{\sigma}^{2},\hat{\mathbf{Z}},\hat{\boldsymbol{\chi}},\mathbf{X}\right)-d\text{log}(\tilde{N}) (5)

where N~=ini\tilde{N}=\sum_{i}n_{i} (where nin_{i} is the number of observed time points observed for the ithi^{th} function), and d=(N+P)K+2MKP+4K+(N+K)M+2+PRK+KRd=(N+P)K+2MKP+4K+(N+K)M+2+PRK+KR.

Similarly, the AIC, proposed by Akaike [1974], can be written as

AIC=2logP(𝐘|𝝂^1:K,𝜼^1:K,𝚽^1:KM,σ^2,𝐙^,𝝌^,𝐗)+2d.\text{AIC}=-2\text{log}P\left(\mathbf{Y}|\hat{\boldsymbol{\nu}}_{1:K},\hat{\boldsymbol{\eta}}_{1:K},\hat{\boldsymbol{\Phi}}_{1:KM},\hat{\sigma}^{2},\hat{\mathbf{Z}},\hat{\boldsymbol{\chi}},\mathbf{X}\right)+2d. (6)

Following the work of Roeder and Wasserman [1997], we will use the posterior mean instead of the MLE for our estimates of BIC and AIC.

The modified DIC, proposed by Celeux et al. [2006], is advantageous to the original DIC proposed by Spiegelhalter et al. [2002] when we have a posterior distribution with multiple modes and when identifiability may be a problem. The modified DIC (referred to as DIC3\text{DIC}_{3} in Celeux et al. [2006]) is specified as

DIC=4𝔼𝚯[logf(𝐘|𝚯),𝐗|𝐘]+2logf^(𝐘)\text{DIC}=-4\mathbb{E}_{\boldsymbol{\Theta}}[\text{log}f(\mathbf{Y}|\boldsymbol{\Theta}),\mathbf{X}|\mathbf{Y}]+2log\hat{f}(\mathbf{Y}) (7)

where f^(yij)=1NMCl=1NMCP(yij|𝝂1:K(l),𝜼1:K(l),𝚽1:KM(l),(σ2)(l),𝐙(l),𝐱i)\hat{f}(y_{ij})=\frac{1}{N_{MC}}\sum_{l=1}^{N_{MC}}P\left(y_{ij}|\boldsymbol{\nu}^{(l)}_{1:K},\boldsymbol{\eta}^{(l)}_{1:K},\boldsymbol{\Phi}^{(l)}_{1:KM},\left(\sigma^{2}\right)^{(l)},\mathbf{Z}^{(l)},\mathbf{x}_{i}\right), f^(𝐘)=i=1Nj=1nif^(yij)\hat{f}(\mathbf{Y})=\prod_{i=1}^{N}\prod_{j=1}^{n_{i}}\hat{f}(y_{ij}), and NMCN_{MC} is the number of MCMC samples used for estimating f^(yij)\hat{f}(y_{ij}). We can approximate 𝔼𝚯[logf(𝐘|𝚯)|𝐘]\mathbb{E}_{\boldsymbol{\Theta}}[\text{log}f(\mathbf{Y}|\boldsymbol{\Theta})|\mathbf{Y}] by using the MCMC samples, such that

𝔼𝚯[logf(𝐘|𝚯,𝐗)|𝐘]1NMCl=1NMCi=1Nj=1nilog[P(yij|𝝂1:K(l),𝜼1:K(l),𝚽1:KM(l),(σ2)(l),𝐙(l),𝐱i)].\mathbb{E}_{\boldsymbol{\Theta}}[\text{log}f(\mathbf{Y}|\boldsymbol{\Theta},\mathbf{X})|\mathbf{Y}]\approx\frac{1}{N_{MC}}\sum_{l=1}^{N_{MC}}\sum_{i=1}^{N}\sum_{j=1}^{n_{i}}\text{log}\left[P\left(y_{ij}|\boldsymbol{\nu}^{(l)}_{1:K},\boldsymbol{\eta}^{(l)}_{1:K},\boldsymbol{\Phi}^{(l)}_{1:KM},\left(\sigma^{2}\right)^{(l)},\mathbf{Z}^{(l)},\mathbf{x}_{i}\right)\right].

3.3 Case Study

As discussed in the main manuscript, we fit two covariate adjusted functional mixed membership models using resting-state EEG data from Dickinson et al. [2018]. The first model used only the log transformation of age as the covariate, and the second model used the log transformation of age, diagnostic group (ASD vs TD), and an interaction between the log transformation of age and diagnostic group as the covariates.

The group-specific mean structures for each of the two features can be seen in Figure 6. As stated in the manuscript, these feature means cannot be interpreted as the expected trajectories of the most extreme observations and are thus harder to directly interpret.

Refer to caption
Figure 6: (Top 4 Panels)Estimated mean structures for ASD and TD individuals, for each of the two features used in the covariate adjusted functional mixed membership model fit in Section 4.2 in the main manuscript. (Bottom Panel) Estimates of the allocation parameters stratified by clinical Diagnosis, where the red triangles depict the group level means and the green diamonds depict the group level medians.

3.4 Comparison Between Mixed Membership Models

In the main manuscript, we extended the analysis on alpha oscillations conducted by Marco et al. [2024a] to allow for a covariate-dependent mixed membership model. While previous studies [Haegens et al., 2014, Rodríguez-Martínez et al., 2017, Scheffler et al., 2019] have shown that alpha oscillations are dependent on age, little is known about how alpha oscillations differ between children with ASD and TD children conditional on age. Although it is attractive to add more covariates to our model, the small sample sizes often found in neurodevelopmental studies limit our ability to fit models with a large amount of covariates. Thus, to avoid having overfit models, we can perform cross-validated methods such as conditional predictive ordinates (CPO) [Pettit, 1990, Chen et al., 2012, Lewis et al., 2014]. CPO for our model can be defined as P(𝐘i(𝐭i){𝐘j(𝐭j)}ji)P\left(\mathbf{Y}_{i}(\mathbf{t}_{i})\mid\{\mathbf{Y}_{j}(\mathbf{t}_{j})\}_{j\neq i}\right). Unlike traditional cross-validation methods, CPO requires no extra sampling to be conducted. Following Chen et al. [2012] and Lewis et al. [2014], an estimate of CPO for our model can be obtained using the following MCMC approximation:

CPO^i=(1NMCr=1NMC1P(𝐘i(𝐭i)𝚯^χr,𝐱i))1,\hat{CPO}_{i}=\left(\frac{1}{N_{MC}}\sum_{r=1}^{N_{MC}}\frac{1}{P\left(\mathbf{Y}_{i}(\mathbf{t}_{i})\mid\hat{\boldsymbol{\Theta}}_{-\chi}^{r},\mathbf{x}_{i}\right)}\right)^{-1}, (8)

where 𝚯^χr\hat{\boldsymbol{\Theta}}_{-\chi}^{r} are the samples from the rthr^{th} MCMC iteration, NMCN_{MC} are the number of MCMC iterations (not including burn-in), and P(𝐘i(𝐭i)𝚯^χr,𝐱i)P\left(\mathbf{Y}_{i}(\mathbf{t}_{i})\mid\hat{\boldsymbol{\Theta}}_{-\chi}^{r},\mathbf{x}_{i}\right) is specified in Equation 12 in the main manuscript. While CPO is a measure of how well the model fits each individual observation, the pseudomarginal likelihood (PML), defined as PML^=i=1NCPO^i\hat{PML}=\prod_{i=1}^{N}\hat{CPO}_{i}, is an overall measure of how well the model fits the entire dataset. Using CPO and PML, we will compare the two covariate adjusted functional mixed membership models fit in this section.

In this section, we will let M0M_{0} denote the covariate adjusted model with age as the covariate, M1M_{1} denote the covariate adjusted functional mixed membership model using the log transform of age as the covariate, and M2M_{2} denote the covariate adjusted functional mixed membership model using the log transform of age, diagnostic group, and the interaction between the log transform of age and diagnostic group as the covariates. Figure 7 contains the CPO values from all 3 of the models considered. Although the fit is similar between M0M_{0} and M1M_{1}, both the pseudomarginal likelihood and the likelihood were higher in the log transformed age model (M1:M_{1}: log(PML^)=7391.9\log{(\hat{\text{PML}})}=7391.9, logL=7833.2\log L=7833.2; M0:M_{0}: log(PML^)=7390.6\log{(\hat{\text{PML}})}=7390.6, logL=7826.3\log L=7826.3). Thus, the analysis in the main manuscript was performed using the log of age as the covariate. From Figure 7, we can see that M1M_{1} tends to fit the data slightly better than M2M_{2} (M1:M_{1}: log(PML^)=7391.9\log{(\hat{\text{PML}})}=7391.9, M2:M_{2}: log(PML^)=7303.9\log{(\hat{\text{PML}})}=7303.9). Although the fit may be slightly worse for the covariate adjusted model with age and diagnostic group as covariates, this model gives us useful information about how the two features differ between children with ASD and TD children.

Refer to caption
Figure 7: CPO comparisons between different models on the log scale. M0M_{0} denotes the covariate adjusted model with age as the covariate, M1M_{1} denotes the covariate adjusted functional mixed membership model using the log transform of age as the covariate, and M2M_{2} denotes the covariate adjusted functional mixed membership model using the log transform of age, diagnostic group, and the interaction between the log transform of age and diagnostic group as the covariates.

4 Mean and Covariance Covariate-dependent Mixed Membership Model

4.1 Model Specification

In this section, we completely specify a mixed membership model where the mean and covariance structures depend on the covariates of interest. As in the main text of this manuscript, we will let {𝐘i(.)}i=1N\{\mathbf{Y}_{i}(.)\}_{i=1}^{N} be the observed sample paths and 𝐭i=[ti1,,tini]\mathbf{t}_{i}=[t_{i1},\dots,t_{in_{i}}]^{\prime} denote the time points at which the ithi^{th} function was observed. We will also let 𝐗N×R\mathbf{X}\in\mathbb{R}^{N\times R} denote the design matrix and 𝐱i=[Xi1XiR]\mathbf{x}_{i}=[X_{i1}\dots X_{iR}] denote the ithi^{th} row of the design matrix (or the covariates associated with the ithi^{th} observation). By introducing covariate-dependent pseudo-eigenfunctions, we arrive at the likelihood of our mixed membership model where the mean and covariance structures are dependent on the covariates of interest:

𝐘i(𝐭i)𝚯,𝐗𝒩{k=1KZik(𝐒(𝐭i)(𝝂k+𝜼k𝐱i)+m=1Mχim𝐒(𝐭i)(ϕkm+𝝃km𝐱i)),σ2𝐈ni}.\mathbf{Y}_{i}(\mathbf{t}_{i})\mid\boldsymbol{\Theta},\mathbf{X}\sim\mathcal{N}\left\{\sum_{k=1}^{K}Z_{ik}\left(\mathbf{S}^{\prime}(\mathbf{t}_{i})\left(\boldsymbol{\nu}_{k}+\boldsymbol{\eta}_{k}\mathbf{x}_{i}^{\prime}\right)+\sum_{m=1}^{M}\chi_{im}\mathbf{S}^{\prime}({\bf t}_{i})\left(\boldsymbol{\phi}_{km}+\boldsymbol{\xi}_{km}\mathbf{x}_{i}^{\prime}\right)\right),\;\sigma^{2}\mathbf{I}_{n_{i}}\right\}.

(9)

From equation 9, we can see that 𝝃kmP×R\boldsymbol{\xi}_{km}\in\mathbb{R}^{P\times R}, directly controls the effect that the covariates have on the pseudo-eigenfunctions for k=1,,Kk=1,\dots,K and m=1,,Mm=1,\dots,M. By integrating out the χim\chi_{im} parameters (i=1,,Ni=1,\dots,N and m=1,,Mm=1,\dots,M), we get a model of the following form:

𝐘i(𝐭i)𝚯χ,𝐗𝒩{k=1KZik𝐒(𝐭i)(𝝂k+𝜼k𝐱i),𝐕(𝐭i,𝐳i)+σ2𝐈ni},\mathbf{Y}_{i}(\mathbf{t}_{i})\mid\boldsymbol{\Theta}_{-\chi},\mathbf{X}\sim\mathcal{N}\left\{\sum_{k=1}^{K}Z_{ik}\mathbf{S}^{\prime}(\mathbf{t}_{i})\left(\boldsymbol{\nu}_{k}+\boldsymbol{\eta}_{k}\mathbf{x}_{i}^{\prime}\right),\;\mathbf{V}(\mathbf{t}_{i},\mathbf{z}_{i})+\sigma^{2}\mathbf{I}_{n_{i}}\right\}, (10)

where 𝚯χ\boldsymbol{\Theta}_{-\chi} is the collection of our model parameters excluding the χim\chi_{im} variables, and the error-free mixed membership covariance is

𝐕(𝐭i,𝐳i)=k=1Kk=1KZikZik{𝐒(𝐭i)m=1M[(ϕkm+𝝃km𝐱i)(ϕkm+𝝃km𝐱i)]𝐒(𝐭i)}.\mathbf{V}(\mathbf{t}_{i},\mathbf{z}_{i})=\sum_{k=1}^{K}\sum_{k^{\prime}=1}^{K}Z_{ik}Z_{ik^{\prime}}\left\{\mathbf{S}^{\prime}(\mathbf{t}_{i})\sum_{m=1}^{M}\left[\left(\boldsymbol{\phi}_{km}+\boldsymbol{\xi}_{km}\mathbf{x}_{i}^{\prime}\right)\left(\boldsymbol{\phi}_{k^{\prime}m}+\boldsymbol{\xi}_{k^{\prime}m}\mathbf{x}_{i}^{\prime}\right)^{\prime}\right]\mathbf{S}(\mathbf{t}_{i})\right\}. (11)

As with the pseudo-eigenfunctions in the unadjusted model, we will utilize the multiplicative gamma process prior as our prior on the 𝝃km\boldsymbol{\xi}_{km} variables. Letting ξ(krm)p\xi_{(krm)_{p}} denote the element in the pthp^{th} row and rthr^{th} column of 𝝃km\boldsymbol{\xi}_{km}. Thus we have:

ξ(krm)pγ𝝃krmp,τ~𝝃mkr𝒩(0,γ𝝃krmp1τ~𝝃mkr1),γξkrmpΓ(νγ/2,νγ/2),τ~𝝃mkr=n=1mδ𝝃nkr,\xi_{(krm)_{p}}\mid\gamma_{\boldsymbol{\xi}_{krmp}},\tilde{\tau}_{\boldsymbol{\xi}_{mkr}}\sim\mathcal{N}\left(0,\gamma_{\boldsymbol{\xi}_{krmp}}^{-1}\tilde{\tau}_{\boldsymbol{\xi}_{mkr}}^{-1}\right),\;\;\;\gamma_{\xi_{krmp}}\sim\Gamma\left(\nu_{\gamma}/2,\nu_{\gamma}/2\right),\;\;\;\tilde{\tau}_{\boldsymbol{\xi}_{mkr}}=\prod_{n=1}^{m}\delta_{\boldsymbol{\xi}_{nkr}},
δ𝝃1kra𝝃1krΓ(a𝝃1kr,1),δ𝝃jkra𝝃2krΓ(a𝝃2kr,1),a𝝃1krΓ(α1,β1),a𝝃2krΓ(α2,β2),\delta_{\boldsymbol{\xi}_{1kr}}\mid a_{\boldsymbol{\xi}_{1kr}}\sim\Gamma(a_{\boldsymbol{\xi}_{1kr}},1),\;\;\;\delta_{\boldsymbol{\xi}_{jkr}}\mid a_{\boldsymbol{\xi}_{2kr}}\sim\Gamma(a_{\boldsymbol{\xi}_{2kr}},1),\;\;\;a_{\boldsymbol{\xi}_{1kr}}\sim\Gamma(\alpha_{1},\beta_{1}),\;\;\;a_{\boldsymbol{\xi}_{2kr}}\sim\Gamma(\alpha_{2},\beta_{2}),

for k=1,,Kk=1,\dots,K, r=1,,Rr=1,\dots,R, m=1,,Mm=1,\dots,M, and p=1,Pp=1,\dots P. The rest of the parameters in the model have the same prior distributions as the model with the covariate-dependence on the mean structure only in the main text. Specifically, we have

ϕkpm|γkpm,τ~mk𝒩(0,γkpm1τ~mk1),γkpmΓ(νγ/2,νγ/2),τ~mk=n=1mδnk,\phi_{kpm}|\gamma_{kpm},\tilde{\tau}_{mk}\sim\mathcal{N}\left(0,\gamma_{kpm}^{-1}\tilde{\tau}_{mk}^{-1}\right),\;\;\;\gamma_{kpm}\sim\Gamma\left(\nu_{\gamma}/2,\nu_{\gamma}/2\right),\;\;\;\tilde{\tau}_{mk}=\prod_{n=1}^{m}\delta_{nk},
δ1k|a1kΓ(a1k,1),δjk|a2kΓ(a2k,1),a1kΓ(α1,β1),a2kΓ(α2,β2),\delta_{1k}|a_{1k}\sim\Gamma(a_{1k},1),\;\;\;\delta_{jk}|a_{2k}\sim\Gamma(a_{2k},1),\;\;\;a_{1k}\sim\Gamma(\alpha_{1},\beta_{1}),\;\;\;a_{2k}\sim\Gamma(\alpha_{2},\beta_{2}),

for k=1,,Kk=1,\dots,K, m=1,,Mm=1,\dots,M, and p=1,Pp=1,\dots P. Similarly, we have

P(𝝂k|τ𝝂k)exp(τ𝝂k2p=1P1(νpkν(p+1)k)2),P(\boldsymbol{\nu}_{k}|\tau_{\boldsymbol{\nu}_{k}})\propto exp\left(-\frac{\tau_{\boldsymbol{\nu}_{k}}}{2}\sum_{p=1}^{P-1}\left(\nu_{pk}^{\prime}-{\nu}_{(p+1)k}\right)^{2}\right),

for k=1,,Kk=1,\dots,K, where τ𝝂kΓ(α𝝂,β𝝂)\tau_{\boldsymbol{\nu}_{k}}\sim\Gamma(\alpha_{\boldsymbol{\nu}},\beta_{\boldsymbol{\nu}}) and νpk\nu_{pk} is the pthp^{th} element of 𝝂k\boldsymbol{\nu}_{k}. Likewise, we have that

P({ηprk}p=1P|τ𝜼rk)exp(τ𝜼rk2p=1P1(ηprkη(p+1)rk)2),P(\{\eta_{prk}\}_{p=1}^{P}|\tau_{\boldsymbol{\eta}_{rk}})\propto exp\left(-\frac{\tau_{\boldsymbol{\eta}_{rk}}}{2}\sum_{p=1}^{P-1}\left(\eta_{prk}^{\prime}-{\eta}_{(p+1)rk}\right)^{2}\right),

for k=1,,Kk=1,\dots,K and r=1,,Rr=1,\dots,R, where τ𝜼rkΓ(α𝜼,β𝜼)\tau_{\boldsymbol{\eta}_{rk}}\sim\Gamma(\alpha_{\boldsymbol{\eta}},\beta_{\boldsymbol{\eta}}) and ηprk\eta_{prk} is the pthp^{th} row and rthr^{th} column of 𝜼k\boldsymbol{\eta}_{k}. Lastly, we assume that 𝐳i𝝅,α3iidDir(α3𝝅)\mathbf{z}_{i}\mid\boldsymbol{\pi},\alpha_{3}\sim_{iid}Dir(\alpha_{3}\boldsymbol{\pi}), 𝝅Dir(𝐜π)\boldsymbol{\pi}\sim Dir(\mathbf{c}_{\pi}), α3Exp(b)\alpha_{3}\sim Exp(b), and σ2IG(α0,β0)\sigma^{2}\sim IG(\alpha_{0},\beta_{0}).

4.2 Posterior Distributions

In this subsection, we will specify the posterior distributions specifically for the functional covariate adjusted mixed membership model where the covariance is covariate-dependent. We will first start with the ϕkm\boldsymbol{\phi}_{km} parameters, for j=1,,Kj=1,\dots,K and m=1,,Mm=1,\dots,M. Let 𝐃ϕjm=τ~ϕmj1diag(γϕj1m1,,γϕjPm1)\mathbf{D}_{\boldsymbol{\phi}_{jm}}=\tilde{\tau}_{\boldsymbol{\phi}_{mj}}^{-1}diag\left(\gamma_{\boldsymbol{\phi}_{j1m}}^{-1},\dots,\gamma_{\boldsymbol{\phi}_{jPm}}^{-1}\right). By letting

𝐦ϕjm=\displaystyle\mathbf{m}_{\boldsymbol{\phi}_{jm}}= 1σ2i=1Nl=1ni(B(til)χim(yi(til)ZijZij2(𝝂j+𝜼j𝐱i)B(til)Zij2nmχinϕjnB(til)\displaystyle\frac{1}{\sigma^{2}}\sum_{i=1}^{N}\sum_{l=1}^{n_{i}}\left(B(t_{il})\chi_{im}\left(y_{i}(t_{il})Z_{ij}-Z_{ij}^{2}\left(\boldsymbol{\nu}_{j}+\boldsymbol{\eta}_{j}\mathbf{x}_{i}^{\prime}\right)^{\prime}B(t_{il})-Z_{ij}^{2}\sum_{n\neq m}\chi_{in}\boldsymbol{\phi}_{jn}^{\prime}B(t_{il})\right.\right.
Zij2n=1Mχin𝐱i𝝃jnB(til)kjZijZik[(𝝂k+𝜼k𝐱i)B(til)+n=1Mχin(ϕkn+𝝃kn𝐱i)B(til)])),\displaystyle\left.\left.-Z_{ij}^{2}\sum_{n=1}^{M}\chi_{in}\mathbf{x}_{i}\boldsymbol{\xi}_{jn}^{\prime}B(t_{il})-\sum_{k\neq j}Z_{ij}Z_{ik}\left[\left(\boldsymbol{\nu}_{k}+\boldsymbol{\eta}_{k}\mathbf{x}_{i}^{\prime}\right)^{\prime}B(t_{il})+\sum_{n=1}^{M}\chi_{in}\left(\boldsymbol{\phi}_{kn}+\boldsymbol{\xi}_{kn}\mathbf{x}_{i}^{\prime}\right)^{\prime}B(t_{il})\right]\right)\right),

and

𝐌ϕjm1=1σ2i=1Nl=1ni(Zij2χim2B(til)B(til))+𝐃ϕjm1,\mathbf{M}_{\boldsymbol{\phi}_{jm}}^{-1}=\frac{1}{\sigma^{2}}\sum_{i=1}^{N}\sum_{l=1}^{n_{i}}\left(Z_{ij}^{2}\chi_{im}^{2}B(t_{il})B^{\prime}(t_{il})\right)+\mathbf{D}_{\boldsymbol{\phi}_{jm}}^{-1},

we have that

ϕjm|𝚯ϕjm,𝐘1,,𝐘N,𝐗𝒩(𝐌ϕjm𝐦ϕjm,𝐌ϕjm).\boldsymbol{\phi}_{jm}|\boldsymbol{\Theta}_{-\boldsymbol{\phi}_{jm}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\sim\mathcal{N}\left(\mathbf{M}_{\boldsymbol{\phi}_{jm}}\mathbf{m}_{\boldsymbol{\phi}_{jm}},\mathbf{M}_{\boldsymbol{\phi}_{jm}}\right).

Let 𝝃krm\boldsymbol{\xi}_{krm} be the rthr^{th} column of the matrix 𝝃km\boldsymbol{\xi}_{km}. We will let 𝐃𝝃krm=τ~𝝃mjr1diag(γ𝝃jrm11,,γϕ𝝃jrmP1)\mathbf{D}_{\boldsymbol{\xi}_{krm}}=\tilde{\tau}_{\boldsymbol{\xi}_{mjr}}^{-1}diag\left(\gamma_{\boldsymbol{\xi}_{jrm1}}^{-1},\dots,\gamma_{\boldsymbol{\phi}_{\boldsymbol{\xi}_{jrmP}}}^{-1}\right). We will also let xirx_{ir} denote the rthr^{th} element of 𝐱i\mathbf{x}_{i}. Thus, letting

𝐦𝝃kdm=\displaystyle\mathbf{m}_{\boldsymbol{\xi}_{kdm}}= 1σ2i=1Nl=1ni(B(til)χimxidZik(yi(til)j=1KZij[(𝝂j+𝜼j𝐱i)B(til)+n=1MχinϕjnB(til)]\displaystyle\frac{1}{\sigma^{2}}\sum_{i=1}^{N}\sum_{l=1}^{n_{i}}\left(B(t_{il})\chi_{im}x_{id}Z_{ik}\left(y_{i}(t_{il})-\sum_{j=1}^{K}Z_{ij}\left[\left(\boldsymbol{\nu}_{j}+\boldsymbol{\eta}_{j}\mathbf{x}_{i}^{\prime}\right)^{\prime}B(t_{il})+\sum_{n=1}^{M}\chi_{in}\boldsymbol{\phi}_{jn}^{\prime}B(t_{il})\right]\right.\right.
(j,n,r)(k,m,d)Zijχinxir𝝃krnB(til)))\displaystyle\left.\left.-\sum_{(j,n,r)\neq(k,m,d)}Z_{ij}\chi_{in}x_{ir}\boldsymbol{\xi}_{krn}^{\prime}B(t_{il})\right)\right)
𝐌𝝃kdm1=1σ2i=1Nl=1ni(Zik2χim2xid2B(til)B(til))+𝐃𝝃kdm1,\mathbf{M}_{\boldsymbol{\xi}_{kdm}}^{-1}=\frac{1}{\sigma^{2}}\sum_{i=1}^{N}\sum_{l=1}^{n_{i}}\left(Z_{ik}^{2}\chi_{im}^{2}x_{id}^{2}B(t_{il})B^{\prime}(t_{il})\right)+\mathbf{D}_{\boldsymbol{\xi}_{kdm}}^{-1},

we have that

𝝃kdm|𝚯𝝃kdm,𝐘1,,𝐘N,𝐗𝒩(𝐌𝝃kdm𝐦𝝃kdm,𝐌𝝃kdm).\boldsymbol{\xi}_{kdm}|\boldsymbol{\Theta}_{-\boldsymbol{\xi}_{kdm}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\sim\mathcal{N}\left(\mathbf{M}_{\boldsymbol{\xi}_{kdm}}\mathbf{m}_{\boldsymbol{\xi}_{kdm}},\mathbf{M}_{\boldsymbol{\xi}_{kdm}}\right).

The posterior distribution of δϕ1k\delta_{\boldsymbol{\phi}_{1k}}, for k=1,,Kk=1,\dots,K, is

δϕ1k|𝚯δϕ1k,𝐘1,,𝐘N,𝐗Γ(aϕ1k+(PM/2),1+12r=1Pγϕk,r,1ϕk,r,12+12m=2Mr=1Pγϕk,r,mϕk,r,m2(j=2mδϕjk))..\begin{aligned} \delta_{\boldsymbol{\phi}_{1k}}|\boldsymbol{\Theta}_{-\delta_{\boldsymbol{\phi}_{1k}}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\sim&\Gamma\left(a_{\boldsymbol{\phi}_{1k}}+(PM/2),1+\frac{1}{2}\sum_{r=1}^{P}\gamma_{\boldsymbol{\phi}_{k,r,1}}\phi_{k,r,1}^{2}\right.\\ &\left.+\frac{1}{2}\sum_{m=2}^{M}\sum_{r=1}^{P}\gamma_{\boldsymbol{\phi}_{k,r,m}}\phi_{k,r,m}^{2}\left(\prod_{j=2}^{m}\delta_{\boldsymbol{\phi}_{jk}}\right)\right).\end{aligned}.

The posterior distribution for δϕik\delta_{\boldsymbol{\phi}_{ik}}, for i=2,,Mi=2,\dots,M and k=1,,Kk=1,\dots,K, is

δϕik|𝚯δϕik,𝐘1,,𝐘N,𝐗\displaystyle\delta_{\boldsymbol{\phi}_{ik}}|\boldsymbol{\Theta}_{-\delta_{\boldsymbol{\phi}_{ik}}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\sim Γ(aϕ2k+(P(Mi+1)/2),1\displaystyle\Gamma\Bigg{(}a_{\boldsymbol{\phi}_{2k}}+(P(M-i+1)/2),1
+12m=iMr=1Pγ𝝃k,r,mϕk,r,m2(j=1;jimδϕjk)).\displaystyle\left.+\frac{1}{2}\sum_{m=i}^{M}\sum_{r=1}^{P}\gamma_{\boldsymbol{\xi}_{k,r,m}}\phi_{k,r,m}^{2}\left(\prod_{j=1;j\neq i}^{m}\delta_{\boldsymbol{\phi}_{jk}}\right)\right).

The posterior distribution of δ𝝃1kd\delta_{\boldsymbol{\xi}_{1kd}}, for k=1,,Kk=1,\dots,K and d=1,,Rd=1,\dots,R, is

δ𝝃1kd|𝚯δ𝝃1kd,𝐘1,,𝐘N,𝐗Γ(a𝝃1kd+(PM/2),1+12r=1Pγ𝝃kdr1ξkdr12+12m=2Mr=1Pγ𝝃kdrmξkdrm2(j=2mδ𝝃jkd))..\begin{aligned} \delta_{\boldsymbol{\xi}_{1kd}}|\boldsymbol{\Theta}_{-\delta_{\boldsymbol{\xi}_{1kd}}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\sim&\Gamma\left(a_{\boldsymbol{\xi}_{1kd}}+(PM/2),1+\frac{1}{2}\sum_{r=1}^{P}\gamma_{\boldsymbol{\xi}_{kdr1}}\xi_{kdr1}^{2}\right.\\ &\left.+\frac{1}{2}\sum_{m=2}^{M}\sum_{r=1}^{P}\gamma_{\boldsymbol{\xi}_{kdrm}}\xi_{kdrm}^{2}\left(\prod_{j=2}^{m}\delta_{\boldsymbol{\xi}_{jkd}}\right)\right).\end{aligned}.

The posterior distribution for δ𝝃ikd\delta_{\boldsymbol{\xi}_{ikd}}, for i=2,,Mi=2,\dots,M, k=1,,Kk=1,\dots,K, and d=1,,Dd=1,\dots,D is

δ𝝃ikd|𝚯δ𝝃ikd,𝐘1,,𝐘N,𝐗\displaystyle\delta_{\boldsymbol{\xi}_{ikd}}|\boldsymbol{\Theta}_{-\delta_{\boldsymbol{\xi}_{ikd}}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\sim Γ(a𝝃2kd+(P(Mi+1)/2),1\displaystyle\Gamma\Bigg{(}a_{\boldsymbol{\xi}_{2kd}}+(P(M-i+1)/2),1
+12m=iMr=1Pγ𝝃kdrmϕkdrm2(j=1;jimδ𝝃jkd)).\displaystyle\left.+\frac{1}{2}\sum_{m=i}^{M}\sum_{r=1}^{P}\gamma_{\boldsymbol{\xi}_{kdrm}}\phi_{kdrm}^{2}\left(\prod_{j=1;j\neq i}^{m}\delta_{\boldsymbol{\xi}_{jkd}}\right)\right).

The posterior distribution for aϕ1ka_{\boldsymbol{\phi}_{1k}} (k=1,,Kk=1,\dots,K) is not a commonly known distribution, however we have that

P(aϕ1k|𝚯aϕ1k,𝐘1,,𝐘N,𝐗)1Γ(aϕ1k)δϕ1kaϕ1k1aϕ1kα11exp{aϕ1kβ1}.P(a_{\boldsymbol{\phi}_{1k}}|\boldsymbol{\Theta}_{-a_{\boldsymbol{\phi}_{1k}}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X})\propto\frac{1}{\Gamma(a_{\boldsymbol{\phi}_{1k}})}\delta_{\boldsymbol{\phi}_{1k}}^{a_{\boldsymbol{\phi}_{1k}}-1}a_{\boldsymbol{\phi}_{1k}}^{\alpha_{1}-1}exp\left\{-a_{\boldsymbol{\phi}_{1k}}\beta_{1}\right\}.

Since this is not a known kernel of a distribution, we will have to use Metropolis-Hastings algorithm. Consider the proposal distribution Q(aϕ1k|aϕ1k)=𝒩(aϕ1k,ϵ1β11,0,+)Q(a_{\boldsymbol{\phi}_{1k}}^{\prime}|a_{\boldsymbol{\phi}_{1k}})=\mathcal{N}\left(a_{\boldsymbol{\phi}_{1k}},\epsilon_{1}\beta_{1}^{-1},0,+\infty\right) (Truncated Normal) for some small ϵ1>0\epsilon_{1}>0. Thus the probability of accepting any step is

A(aϕ1k,aϕ1k)=min{1,P(aϕ1k|𝚯aϕ1k,𝐘1,,𝐘N,𝐗)P(aϕ1k|𝚯aϕ1k,𝐘1,,𝐘N,𝐗)Q(aϕ1k|aϕ1k)Q(aϕ1k|aϕ1k)}.A(a_{\boldsymbol{\phi}_{1k}}^{\prime},a_{\boldsymbol{\phi}_{1k}})=\min\left\{1,\frac{P\left(a_{\boldsymbol{\phi}_{1k}}^{\prime}|\boldsymbol{\Theta}_{-a_{\boldsymbol{\phi}_{1k}}^{\prime}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\right)}{P\left(a_{\boldsymbol{\phi}_{1k}}|\boldsymbol{\Theta}_{-a_{\boldsymbol{\phi}_{1k}}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\right)}\frac{Q\left(a_{\boldsymbol{\phi}_{1k}}|a_{\boldsymbol{\phi}_{1k}}^{\prime}\right)}{Q\left(a_{\boldsymbol{\phi}_{1k}}^{\prime}|a_{\boldsymbol{\phi}_{1k}}\right)}\right\}.

Similarly for aϕ2ka_{\boldsymbol{\phi}_{2k}} (k=1,,Kk=1,\dots,K), we have

P(aϕ2k|𝚯aϕ2k,𝐘1,,𝐘N,𝐗)1Γ(aϕ2k)M1(i=2Mδϕikaϕ2k1)aϕ2kαϕ2k1exp{aϕ2kβ2}.P(a_{\boldsymbol{\phi}_{2k}}|\boldsymbol{\Theta}_{-a_{\boldsymbol{\phi}_{2k}}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X})\propto\frac{1}{\Gamma(a_{\boldsymbol{\phi}_{2k}})^{M-1}}\left(\prod_{i=2}^{M}\delta_{\boldsymbol{\phi}_{ik}}^{a_{\boldsymbol{\phi}_{2k}}-1}\right)a_{\boldsymbol{\phi}_{2k}}^{\alpha_{\boldsymbol{\phi}_{2k}}-1}exp\left\{-a_{\boldsymbol{\phi}_{2k}}\beta_{2}\right\}.

We will use a similar proposal distribution, such that Q(aϕ2k|aϕ2k)=𝒩(aϕ2k,ϵ2β21,0,+)Q(a_{\boldsymbol{\phi}_{2k}}^{\prime}|a_{\boldsymbol{\phi}_{2k}})=\mathcal{N}\left(a_{\boldsymbol{\phi}_{2k}},\epsilon_{2}\beta_{2}^{-1},0,+\infty\right) for some small ϵ2>0\epsilon_{2}>0. Thus the probability of accepting any step is

A(aϕ2k,aϕ2k)=min{1,P(aϕ2k|𝚯aϕ2k,𝐘1,,𝐘N,𝐗)P(aϕ2k|𝚯aϕ2k,𝐘1,,𝐘N,𝐗)Q(aϕ2k|aϕ2k)Q(aϕ2k|aϕ2k)}.A(a_{\boldsymbol{\phi}_{2k}}^{\prime},a_{\boldsymbol{\phi}_{2k}})=\min\left\{1,\frac{P\left(a_{\boldsymbol{\phi}_{2k}}^{\prime}|\boldsymbol{\Theta}_{-a_{\boldsymbol{\phi}_{2k}}^{\prime}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\right)}{P\left(a_{\boldsymbol{\phi}_{2k}}|\boldsymbol{\Theta}_{-a_{\boldsymbol{\phi}_{2k}}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\right)}\frac{Q\left(a_{\boldsymbol{\phi}_{2k}}|a_{\boldsymbol{\phi}_{2k}}^{\prime}\right)}{Q\left(a_{\boldsymbol{\phi}_{2k}}^{\prime}|a_{\boldsymbol{\phi}_{2k}}\right)}\right\}.

Similarly, the posterior distribution for a𝝃1kda_{\boldsymbol{\xi}_{1kd}} (k=1,,Kk=1,\dots,K and d=1,,Rd=1,\dots,R) is not a commonly known distribution, however we have that

P(a𝝃1kd|𝚯a𝝃1kd,𝐘1,,𝐘N,𝐗)1Γ(a𝝃1kd)δ𝝃1kda𝝃1kd1a𝝃1kdα11exp{a𝝃1kdβ1}.P(a_{\boldsymbol{\xi}_{1kd}}|\boldsymbol{\Theta}_{-a_{\boldsymbol{\xi}_{1kd}}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X})\propto\frac{1}{\Gamma(a_{\boldsymbol{\xi}_{1kd}})}\delta_{\boldsymbol{\xi}_{1kd}}^{a_{\boldsymbol{\xi}_{1kd}}-1}a_{\boldsymbol{\xi}_{1kd}}^{\alpha_{1}-1}exp\left\{-a_{\boldsymbol{\xi}_{1kd}}\beta_{1}\right\}.

We will use a similar proposal distribution, such that Q(a𝝃1kd|a𝝃1kd)=𝒩(a𝝃1kd,ϵ1β11,0,+)Q(a_{\boldsymbol{\xi}_{1kd}}^{\prime}|a_{\boldsymbol{\xi}_{1kd}})=\mathcal{N}\left(a_{\boldsymbol{\xi}_{1kd}},\epsilon_{1}\beta_{1}^{-1},0,+\infty\right) for some small ϵ1>0\epsilon_{1}>0. Thus the probability of accepting any step is

A(a𝝃1kd,a𝝃1kd)=min{1,P(a𝝃1kd|𝚯a𝝃1kd,𝐘1,,𝐘N,𝐗)P(a𝝃1kd|𝚯aϕ1k,𝐘1,,𝐘N,𝐗)Q(a𝝃1kd|a𝝃1kd)Q(a𝝃1kd|a𝝃1kd)}.A(a_{\boldsymbol{\xi}_{1kd}}^{\prime},a_{\boldsymbol{\xi}_{1kd}})=\min\left\{1,\frac{P\left(a_{\boldsymbol{\xi}_{1kd}}^{\prime}|\boldsymbol{\Theta}_{-a_{\boldsymbol{\xi}_{1kd}}^{\prime}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\right)}{P\left(a_{\boldsymbol{\xi}_{1kd}}|\boldsymbol{\Theta}_{-a_{\boldsymbol{\phi}_{1k}}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\right)}\frac{Q\left(a_{\boldsymbol{\xi}_{1kd}}|a_{\boldsymbol{\xi}_{1kd}}^{\prime}\right)}{Q\left(a_{\boldsymbol{\xi}_{1kd}}^{\prime}|a_{\boldsymbol{\xi}_{1kd}}\right)}\right\}.

Similarly for a𝝃2kda_{\boldsymbol{\xi}_{2kd}} (k=1,,Kk=1,\dots,K and d=1,,Rd=1,\dots,R), we have

P(a𝝃2kd|𝚯a𝝃2kd,𝐘1,,𝐘N,𝐗)1Γ(a𝝃2kd)M1(i=2Mδ𝝃ikda𝝃2kd1)a𝝃2kdα𝝃2kd1exp{a𝝃2kdβ2}.P(a_{\boldsymbol{\xi}_{2kd}}|\boldsymbol{\Theta}_{-a_{\boldsymbol{\xi}_{2kd}}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X})\propto\frac{1}{\Gamma(a_{\boldsymbol{\xi}_{2kd}})^{M-1}}\left(\prod_{i=2}^{M}\delta_{\boldsymbol{\xi}_{ikd}}^{a_{\boldsymbol{\xi}_{2kd}}-1}\right)a_{\boldsymbol{\xi}_{2kd}}^{\alpha_{\boldsymbol{\xi}_{2kd}}-1}exp\left\{-a_{\boldsymbol{\xi}_{2kd}}\beta_{2}\right\}.

We will use a similar proposal distribution, such that Q(a𝝃2kd|a𝝃2kd)=𝒩(a𝝃2kd,ϵ2β21,0,+)Q(a_{\boldsymbol{\xi}_{2kd}}^{\prime}|a_{\boldsymbol{\xi}_{2kd}})=\mathcal{N}\left(a_{\boldsymbol{\xi}_{2kd}},\epsilon_{2}\beta_{2}^{-1},0,+\infty\right) for some small ϵ2>0\epsilon_{2}>0. Thus the probability of accepting any step is

A(a𝝃2kd,a𝝃2kd)=min{1,P(a𝝃2kd|𝚯a𝝃2kd,𝐘1,,𝐘N,𝐗)P(a𝝃2kd|𝚯a𝝃2kd,𝐘1,,𝐘N,𝐗)Q(a𝝃2kd|a𝝃2kd)Q(a𝝃2kd|a𝝃2kd)}.A(a_{\boldsymbol{\xi}_{2kd}}^{\prime},a_{\boldsymbol{\xi}_{2kd}})=\min\left\{1,\frac{P\left(a_{\boldsymbol{\xi}_{2kd}}^{\prime}|\boldsymbol{\Theta}_{-a_{\boldsymbol{\xi}_{2kd}}^{\prime}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\right)}{P\left(a_{\boldsymbol{\xi}_{2kd}}|\boldsymbol{\Theta}_{-a_{\boldsymbol{\xi}_{2kd}}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\right)}\frac{Q\left(a_{\boldsymbol{\xi}_{2kd}}|a_{\boldsymbol{\xi}_{2kd}}^{\prime}\right)}{Q\left(a_{\boldsymbol{\xi}_{2kd}}^{\prime}|a_{\boldsymbol{\xi}_{2kd}}\right)}\right\}.

For the γϕjrm\gamma_{\phi_{jrm}} parameters, for j=1,Kj=1,\dots K, p=1,,Pp=1,\dots,P, and m=1,,Mm=1,\dots,M, we have

γϕjpm|𝚯γϕjpm,𝐘1,,𝐘N,𝐗Γ(νγ+12,ϕjpm2τ~ϕmj+νγ2).\gamma_{\phi_{jpm}}|\boldsymbol{\Theta}_{-\gamma_{\phi_{jpm}}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\sim\Gamma\left(\frac{\nu_{\gamma}+1}{2},\frac{\phi_{jpm}^{2}\tilde{\tau}_{\boldsymbol{\phi}_{mj}}+\nu_{\gamma}}{2}\right).

Similarly, for the γ𝝃jdpm\gamma_{\boldsymbol{\xi}_{jdpm}} parameters, we have

γξjrpm|𝚯γξjrpm,𝐘1,,𝐘N,𝐗Γ(νγ+12,ξjrpm2τ~𝝃mjr+νγ2),\gamma_{\xi_{jrpm}}|\boldsymbol{\Theta}_{-\gamma_{\xi_{jrpm}}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\sim\Gamma\left(\frac{\nu_{\gamma}+1}{2},\frac{\xi_{jrpm}^{2}\tilde{\tau}_{\boldsymbol{\xi}_{mjr}}+\nu_{\gamma}}{2}\right),

for j=1,,Kj=1,\dots,K, r=1,,Rr=1,\dots,R, p=1,,Pp=1,\dots,P, and m=1,,Mm=1,\dots,M. The posterior distribution for the 𝐳i\mathbf{z}_{i} parameters are not a commonly known distribution, so we will use the Metropolis-Hastings algorithm. We know that

p(𝐳i|𝚯𝐳i,𝐘1,,𝐘N,𝐗)\displaystyle p(\mathbf{z}_{i}|\boldsymbol{\Theta}_{-\mathbf{z}_{i}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}) k=1KZikα3πk1\displaystyle\propto\prod_{k=1}^{K}Z_{ik}^{\alpha_{3}\pi_{k}-1}
×l=1niexp{12σ2(yi(til)k=1KZik((𝝂k+𝜼k𝐱i)B(til)\displaystyle\times\prod_{l=1}^{n_{i}}exp\left\{-\frac{1}{2\sigma^{2}}\left(y_{i}(t_{il})-\sum_{k=1}^{K}Z_{ik}\left(\left(\boldsymbol{\nu}_{k}+\boldsymbol{\eta}_{k}\mathbf{x}_{i}^{\prime}\right)^{\prime}B(t_{il})\right.\right.\right.
+m=1Mχim(ϕkm+𝝃km𝐱i)B(til)))2}.\displaystyle\left.\left.\left.+\sum_{m=1}^{M}\chi_{im}\left(\boldsymbol{\phi}_{km}+\boldsymbol{\xi}_{km}\mathbf{x}_{i}^{\prime}\right)^{\prime}B(t_{il})\right)\right)^{2}\right\}.

We will use Q(𝐳i|𝐳i)=Dir(a𝐳𝐳i)Q(\mathbf{z}_{i}^{\prime}|\mathbf{z}_{i})=Dir(a_{\mathbf{z}}\mathbf{z}_{i}) for some large a𝐳+a_{\mathbf{z}}\in\mathbb{R}^{+} as the proposal distribution. Thus the probability of accepting a proposed step is

A(𝐳i,𝐳i)=min{1,P(𝐳i|𝚯𝐳i,𝐘1,,𝐘N,𝐗)P(𝐳i|𝚯𝐳i,𝐘1,,𝐘N,𝐗)Q(𝐳i|𝐳i)Q(𝐳i|𝐳i)}.A(\mathbf{z}_{i}^{\prime},\mathbf{z}_{i})=\min\left\{1,\frac{P\left(\mathbf{z}_{i}^{\prime}|\boldsymbol{\Theta}_{-\mathbf{z}_{i}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\right)}{P\left(\mathbf{z}_{i}|\boldsymbol{\Theta}_{-\mathbf{z}_{i}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\right)}\frac{Q\left(\mathbf{z}_{i}|\mathbf{z}_{i}^{\prime}\right)}{Q\left(\mathbf{z}_{i}^{\prime}|\mathbf{z}_{i}\right)}\right\}.

Similarly, a Gibbs update is not available for an update of the 𝝅\boldsymbol{\pi} parameters. We have that

p(𝝅|𝚯𝝅,𝐘1,,𝐘N,𝐗)\displaystyle p(\boldsymbol{\pi}|\boldsymbol{\Theta}_{-\boldsymbol{\pi}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}) k=1Kπkck1\displaystyle\propto\prod_{k=1}^{K}\pi_{k}^{c_{k}-1}
×i=1N1B(α3𝝅)k=1KZikα3πk1.\displaystyle\times\prod_{i=1}^{N}\frac{1}{B(\alpha_{3}\boldsymbol{\pi})}\prod_{k=1}^{K}Z_{ik}^{\alpha_{3}\pi_{k}-1}.

Letting out proposal distribution be such that Q(𝝅|𝝅)=Dir(a𝝅𝝅)Q(\boldsymbol{\pi}^{\prime}|\boldsymbol{\pi})=Dir(a_{\boldsymbol{\pi}}\boldsymbol{\pi}), for some large a𝝅+a_{\boldsymbol{\pi}}\in\mathbb{R}^{+}, we have that our probability of accepting any proposal is

A(𝝅,𝝅)=min{1,P(𝝅|𝚯𝝅,𝐘1,,𝐘N,𝐗)P(𝝅|𝚯𝝅,𝐘1,,𝐘N,𝐗)Q(𝝅|𝝅)Q(𝝅|𝝅)}.A(\boldsymbol{\pi}^{\prime},\boldsymbol{\pi})=\min\left\{1,\frac{P\left(\boldsymbol{\pi}^{\prime}|\boldsymbol{\Theta}_{-\boldsymbol{\pi}^{\prime}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\right)}{P\left(\boldsymbol{\pi}|\boldsymbol{\Theta}_{-\boldsymbol{\pi}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\right)}\frac{Q\left(\boldsymbol{\pi}|\boldsymbol{\pi}^{\prime}\right)}{Q\left(\boldsymbol{\pi}^{\prime}|\boldsymbol{\pi}\right)}\right\}.

The posterior distribution of α3\alpha_{3} is also not a commonly known distribution, so we will use the Metropolis-Hastings algorithm to sample from the posterior distribution. We have that

p(α3|𝚯α3,𝐘1,,𝐘N,𝐗)\displaystyle p(\alpha_{3}|\boldsymbol{\Theta}_{-\alpha_{3}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}) ebα3\displaystyle\propto e^{-b\alpha_{3}}
×i=1N1B(α3𝝅)k=1KZikα3πk1.\displaystyle\times\prod_{i=1}^{N}\frac{1}{B(\alpha_{3}\boldsymbol{\pi})}\prod_{k=1}^{K}Z_{ik}^{\alpha_{3}\pi_{k}-1}.

Using a proposal distribution such that Q(α3|α3)=𝒩(α3,σα32,0,+)Q(\alpha_{3}^{\prime}|\alpha_{3})=\mathcal{N}(\alpha_{3},\sigma^{2}_{\alpha_{3}},0,+\infty) (Truncated Normal), we are left with the probability of accepting a proposed state as

A(α3,α3)=min{1,P(α3|𝚯α3,𝐘1,,𝐘N,𝐗)P(α3|𝚯α3,𝐘1,,𝐘N,𝐗)Q(α3|α3)Q(α3|α3)}.A(\alpha_{3}^{\prime},\alpha_{3})=\min\left\{1,\frac{P\left(\alpha_{3}^{\prime}|\boldsymbol{\Theta}_{-\alpha_{3}^{\prime}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\right)}{P\left(\alpha_{3}|\boldsymbol{\Theta}_{-\alpha_{3}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\right)}\frac{Q\left(\alpha_{3}|\alpha_{3}^{\prime}\right)}{Q\left(\alpha_{3}^{\prime}|\alpha_{3}\right)}\right\}.

Let 𝐏\mathbf{P} be the following tridiagonal matrix:

𝐏=[110121121011].\mathbf{P}=\begin{bmatrix}1&-1&0&&\\ -1&2&-1&&\\ &\ddots&\ddots&\ddots&\\ &&-1&2&-1\\ &&0&-1&1\\ \end{bmatrix}.

Thus, letting

𝐁𝝂j=(τ𝝂j𝐏+1σ2i=1Nl=1niZij2B(til)B(til))1\mathbf{B}_{\boldsymbol{\nu}_{j}}=\left(\tau_{\boldsymbol{\nu}_{j}}\mathbf{P}+\frac{1}{\sigma^{2}}\sum_{i=1}^{N}\sum_{l=1}^{n_{i}}Z_{ij}^{2}B(t_{il})B^{\prime}(t_{il})\right)^{-1}

and

𝐛𝝂j\displaystyle\mathbf{b}_{\boldsymbol{\nu}_{j}} =1σ2i=1Nl=1niZijB(til)[yi(til)(kjZik𝝂kB(til))\displaystyle=\frac{1}{\sigma^{2}}\sum_{i=1}^{N}\sum_{l=1}^{n_{i}}Z_{ij}B(t_{il})\left[y_{i}(t_{il})-\left(\sum_{k\neq j}Z_{ik}\boldsymbol{\nu}^{\prime}_{k}B(t_{il})\right)\right.
(k=1KZik[𝐱i𝜼kB(til)+m=1Mχim(ϕkn+𝝃kn𝐱i)B(til)])],\displaystyle-\left.\left(\sum_{k=1}^{K}Z_{ik}\left[\mathbf{x}_{i}\boldsymbol{\eta}_{k}^{\prime}B(t_{il})+\sum_{m=1}^{M}\chi_{im}\left(\boldsymbol{\phi}_{kn}+\boldsymbol{\xi}_{kn}\mathbf{x}_{i}^{\prime}\right)^{\prime}B(t_{il})\right]\right)\right],

we have that

𝝂j|𝚯𝝂j,𝐘1,,𝐘N,𝐗𝒩(𝐁𝝂j𝐛𝝂j,𝐁𝝂j).\boldsymbol{\nu}_{j}|\boldsymbol{\Theta}_{-\boldsymbol{\nu}_{j}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\sim\mathcal{N}\left(\mathbf{B}_{\boldsymbol{\nu}_{j}}\mathbf{b}_{\boldsymbol{\nu}_{j}},\mathbf{B}_{\boldsymbol{\nu}_{j}}\right).

Let 𝜼jd\boldsymbol{\eta}_{jd} denote the dthd^{th} column of the matrix 𝜼j\boldsymbol{\eta}_{j}. Thus, letting

𝐁𝜼jd=(τ𝜼jd𝐏+1σ2i=1Nl=1niZij2xid2B(til)B(til))1\mathbf{B}_{\boldsymbol{\eta}_{jd}}=\left(\tau_{\boldsymbol{\eta}_{jd}}\mathbf{P}+\frac{1}{\sigma^{2}}\sum_{i=1}^{N}\sum_{l=1}^{n_{i}}Z_{ij}^{2}x_{id}^{2}B(t_{il})B^{\prime}(t_{il})\right)^{-1}

and

𝐛𝜼jd=\displaystyle\mathbf{b}_{\boldsymbol{\eta}_{jd}}= 1σ2i=1Nl=1niZijxidB(til)[yi(til)(rdZijxir𝜼jrB(til))(kjZik𝐱i𝜼kB(til))\displaystyle\frac{1}{\sigma^{2}}\sum_{i=1}^{N}\sum_{l=1}^{n_{i}}Z_{ij}x_{id}B(t_{il})\left[y_{i}(t_{il})-\left(\sum_{r\neq d}Z_{ij}x_{ir}\boldsymbol{\eta}_{jr}^{\prime}B(t_{il})\right)-\left(\sum_{k\neq j}Z_{ik}\mathbf{x}_{i}\boldsymbol{\eta}_{k}^{\prime}B(t_{il})\right)\right.
(k=1KZik[𝝂kB(til)+m=1Mχim(ϕkn+𝝃kn𝐱i)B(til)])],\displaystyle-\left.\left(\sum_{k=1}^{K}Z_{ik}\left[\boldsymbol{\nu}_{k}^{\prime}B(t_{il})+\sum_{m=1}^{M}\chi_{im}\left(\boldsymbol{\phi}_{kn}+\boldsymbol{\xi}_{kn}\mathbf{x}_{i}^{\prime}\right)^{\prime}B(t_{il})\right]\right)\right],

we have that

𝜼jd|𝚯𝜼jd,𝐘1,,𝐘N,𝐗𝒩(𝐁𝜼jd𝐛𝜼jd,𝐁𝜼jd).\boldsymbol{\eta}_{jd}|\boldsymbol{\Theta}_{-\boldsymbol{\eta}_{jd}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\sim\mathcal{N}\left(\mathbf{B}_{\boldsymbol{\eta}_{jd}}\mathbf{b}_{\boldsymbol{\eta}_{jd}},\mathbf{B}_{\boldsymbol{\eta}_{jd}}\right).

Thus we can see that we can draw samples from the posterior of the parameters controlling the mean structure using a Gibbs sampler. Similarly, we can use a Gibbs sampler to draw samples from the posterior distribution of τ𝜼jd\tau_{\boldsymbol{\eta}_{jd}} and τ𝝂j\tau_{\boldsymbol{\nu}_{j}}. We have that the posterior distributions are

τ𝝂j|𝚯τ𝝂j,𝐘1,,𝐘N,𝐗Γ(α𝝂+P/2,β𝝂+12𝝂j𝐏𝝂j)\tau_{\boldsymbol{\nu}_{j}}|\boldsymbol{\Theta}_{-\tau_{\boldsymbol{\nu}_{j}}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\sim\Gamma\left(\alpha_{\boldsymbol{\nu}}+P/2,\beta_{\boldsymbol{\nu}}+\frac{1}{2}\boldsymbol{\nu}^{\prime}_{j}\mathbf{P}\boldsymbol{\nu}_{j}\right)

and

τ𝜼jd|𝚯τ𝜼jd,𝐘1,,𝐘N,𝐗Γ(α𝜼+P/2,β𝜼+12𝜼jd𝐏𝜼jd),\tau_{\boldsymbol{\eta}_{jd}}|\boldsymbol{\Theta}_{-\tau_{\boldsymbol{\eta}_{jd}}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\sim\Gamma\left(\alpha_{\boldsymbol{\eta}}+P/2,\beta_{\boldsymbol{\eta}}+\frac{1}{2}\boldsymbol{\eta}^{\prime}_{jd}\mathbf{P}\boldsymbol{\eta}_{jd}\right),

for j=1,,Kj=1,\dots,K and d=1,,Rd=1,\dots,R. The parameter σ2\sigma^{2} can be updated by using a Gibbs update. If we let

βσ=12i=1Nl=1ni(yi(til)k=1KZik((𝝂k+𝜼k𝐱i)B(til)+n=1Mχin(ϕkn+𝝃kn𝐱i)B(til)))2,\beta_{\sigma}=\frac{1}{2}\sum_{i=1}^{N}\sum_{l=1}^{n_{i}}\left(y_{i}(t_{il})-\sum_{k=1}^{K}Z_{ik}\left(\left(\boldsymbol{\nu}_{k}+\boldsymbol{\eta}_{k}\mathbf{x}_{i}^{\prime}\right)^{\prime}B(t_{il})+\sum_{n=1}^{M}\chi_{in}\left(\boldsymbol{\phi}_{kn}+\boldsymbol{\xi}_{kn}\mathbf{x}_{i}^{\prime}\right)^{\prime}B(t_{il})\right)\right)^{2},

then we have

σ2|𝚯σ2,𝐘1,,𝐘N,𝐗IG(α0+i=1Nni2,β0+βσ).\sigma^{2}|\boldsymbol{\Theta}_{-\sigma^{2}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\sim IG\left(\alpha_{0}+\frac{\sum_{i=1}^{N}n_{i}}{2},\beta_{0}+\beta_{\sigma}\right).

Lastly, we can update the χim\chi_{im} parameters, for i=1,,Ni=1,\dots,N and m=1,,Mm=1,\dots,M, using a Gibbs update. If we let

𝐰im=\displaystyle\mathbf{w}_{im}= 1σ2[l=1ni(k=1KZik(ϕkm+𝝃km𝐱i)B(til))\displaystyle\frac{1}{\sigma^{2}}\left[\sum_{l=1}^{n_{i}}\left(\sum_{k=1}^{K}Z_{ik}\left(\boldsymbol{\phi}_{km}+\boldsymbol{\xi}_{km}\mathbf{x}_{i}^{\prime}\right)^{\prime}B(t_{il})\right)\right.
(yi(til)k=1KZik((𝝂k+𝜼k𝐱i)B(til)+nmχin(ϕkn+𝝃kn𝐱i)B(til)))]\displaystyle\left.\left(y_{i}(t_{il})-\sum_{k=1}^{K}Z_{ik}\left(\left(\boldsymbol{\nu}_{k}+\boldsymbol{\eta}_{k}\mathbf{x}_{i}^{\prime}\right)^{\prime}B(t_{il})+\sum_{n\neq m}\chi_{in}\left(\boldsymbol{\phi}_{kn}+\boldsymbol{\xi}_{kn}\mathbf{x}_{i}^{\prime}\right)^{\prime}B(t_{il})\right)\right)\right]

and

𝐖im1=1+1σ2l=1ni(k=1KZik(ϕkm+𝝃km𝐱i)B(til))2,\mathbf{W}_{im}^{-1}=1+\frac{1}{\sigma^{2}}\sum_{l=1}^{n_{i}}\left(\sum_{k=1}^{K}Z_{ik}\left(\boldsymbol{\phi}_{km}+\boldsymbol{\xi}_{km}\mathbf{x}_{i}^{\prime}\right)^{\prime}B(t_{il})\right)^{2},

then we have that

χim|𝜻χim,𝐘1,,𝐘N,𝐗𝒩(𝐖im𝐰im,𝐖im).\chi_{im}|\boldsymbol{\zeta}_{-\chi_{im}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\sim\mathcal{N}(\mathbf{W}_{im}\mathbf{w}_{im},\mathbf{W}_{im}).

4.3 Tempered Transitions

Since we only temper the likelihood, many of the posterior distributions derived in Section 4.2 can be utilized. Starting with the 𝚽\boldsymbol{\Phi} parameters, we have

(𝐦ϕjm)h=\displaystyle\left(\mathbf{m}_{\boldsymbol{\phi}_{jm}}\right)_{h}= βh(σ2)hi=1Nl=1ni(B(til)(χim)h(yi(til)(Zij)h(Zij)h2((𝝂j)h+(𝜼j)h𝐱i)B(til)\displaystyle\frac{\beta_{h}}{(\sigma^{2})_{h}}\sum_{i=1}^{N}\sum_{l=1}^{n_{i}}\left(B(t_{il})(\chi_{im})_{h}\left(y_{i}(t_{il})(Z_{ij})_{h}-(Z_{ij})_{h}^{2}\left((\boldsymbol{\nu}_{j})_{h}+(\boldsymbol{\eta}_{j})_{h}\mathbf{x}_{i}^{\prime}\right)^{\prime}B(t_{il})\right.\right.
(Zij)h2nm(χin)h(ϕjn)hB(til)(Zij)h2n=1M(χin)𝐱i(𝝃jn)hB(til)\displaystyle\left.\left.-(Z_{ij})_{h}^{2}\sum_{n\neq m}(\chi_{in})_{h}(\boldsymbol{\phi}_{jn})_{h}^{\prime}B(t_{il})-(Z_{ij})_{h}^{2}\sum_{n=1}^{M}(\chi_{in})\mathbf{x}_{i}(\boldsymbol{\xi}_{jn})_{h}^{\prime}B(t_{il})\right.\right.
kjZijZik[((𝝂k)h+(𝜼k)h𝐱i)B(til)+n=1Mχin((ϕkn)h+(𝝃kn)h𝐱i)B(til)])),\displaystyle\left.\left.-\sum_{k\neq j}Z_{ij}Z_{ik}\left[\left((\boldsymbol{\nu}_{k})_{h}+(\boldsymbol{\eta}_{k})_{h}\mathbf{x}_{i}^{\prime}\right)^{\prime}B(t_{il})+\sum_{n=1}^{M}\chi_{in}\left((\boldsymbol{\phi}_{kn})_{h}+(\boldsymbol{\xi}_{kn})_{h}\mathbf{x}_{i}^{\prime}\right)^{\prime}B(t_{il})\right]\right)\right),

and

(𝐌ϕjm)h1=βh(σ2)hi=1Nl=1ni((Zij)h2(χim)h2B(til)B(til))+(𝐃ϕjm)h1,\left(\mathbf{M}_{\boldsymbol{\phi}_{jm}}\right)_{h}^{-1}=\frac{\beta_{h}}{(\sigma^{2})_{h}}\sum_{i=1}^{N}\sum_{l=1}^{n_{i}}\left((Z_{ij})_{h}^{2}(\chi_{im})_{h}^{2}B(t_{il})B^{\prime}(t_{il})\right)+\left(\mathbf{D}_{\boldsymbol{\phi}_{jm}}\right)_{h}^{-1},

we have that

(ϕjm)h|𝚯(ϕjm)h,𝐘1,,𝐘N,𝐗𝒩((𝐌ϕjm)h(𝐦ϕjm)h,(𝐌ϕjm)h).\left(\boldsymbol{\phi}_{jm}\right)_{h}|\boldsymbol{\Theta}_{-\left(\boldsymbol{\phi}_{jm}\right)_{h}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\sim\mathcal{N}\left(\left(\mathbf{M}_{\boldsymbol{\phi}_{jm}}\right)_{h}\left(\mathbf{m}_{\boldsymbol{\phi}_{jm}}\right)_{h},\left(\mathbf{M}_{\boldsymbol{\phi}_{jm}}\right)_{h}\right).

Letting

(𝐦𝝃kdm)h=\displaystyle\left(\mathbf{m}_{\boldsymbol{\xi}_{kdm}}\right)_{h}= βh(σ2)hi=1Nl=1ni(B(til)(χim)hxid(Zik)h(yi(til)\displaystyle\frac{\beta_{h}}{(\sigma^{2})_{h}}\sum_{i=1}^{N}\sum_{l=1}^{n_{i}}\Bigg{(}B(t_{il})(\chi_{im})_{h}x_{id}(Z_{ik})_{h}\Bigg{(}y_{i}(t_{il})-
j=1K(Zij)h[((𝝂j)h+(𝜼j)h𝐱i)B(til)+n=1M(χin)h(ϕjn)hB(til)]\displaystyle\sum_{j=1}^{K}(Z_{ij})_{h}\left[\left((\boldsymbol{\nu}_{j})_{h}+(\boldsymbol{\eta}_{j})_{h}\mathbf{x}_{i}^{\prime}\right)^{\prime}B(t_{il})+\sum_{n=1}^{M}(\chi_{in})_{h}(\boldsymbol{\phi}_{jn})_{h}^{\prime}B(t_{il})\right]
(j,n,r)(k,m,d)(Zij)h(χin)hxir(𝝃krn)hB(til)))\displaystyle\left.\left.-\sum_{(j,n,r)\neq(k,m,d)}(Z_{ij})_{h}(\chi_{in})_{h}x_{ir}(\boldsymbol{\xi}_{krn})_{h}^{\prime}B(t_{il})\right)\right)
(𝐌𝝃kdm)h1=βh(σ2)hi=1Nl=1ni((Zik)h2(χim)h2xid2B(til)B(til))+(𝐃𝝃kdm)h1,\left(\mathbf{M}_{\boldsymbol{\xi}_{kdm}}\right)_{h}^{-1}=\frac{\beta_{h}}{(\sigma^{2})_{h}}\sum_{i=1}^{N}\sum_{l=1}^{n_{i}}\left((Z_{ik})_{h}^{2}(\chi_{im})_{h}^{2}x_{id}^{2}B(t_{il})B^{\prime}(t_{il})\right)+\left(\mathbf{D}_{\boldsymbol{\xi}_{kdm}}\right)_{h}^{-1},

we have that

(𝝃kdm)h|𝚯(𝝃kdm)h,𝐘1,,𝐘N,𝐗𝒩((𝐌𝝃kdm)h(𝐦𝝃kdm)h,(𝐌𝝃kdm)h).\left(\boldsymbol{\xi}_{kdm}\right)_{h}|\boldsymbol{\Theta}_{-\left(\boldsymbol{\xi}_{kdm}\right)_{h}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\sim\mathcal{N}\left(\left(\mathbf{M}_{\boldsymbol{\xi}_{kdm}}\right)_{h}\left(\mathbf{m}_{\boldsymbol{\xi}_{kdm}}\right)_{h},\left(\mathbf{M}_{\boldsymbol{\xi}_{kdm}}\right)_{h}\right).

As in the untempered case, we have that the posterior distribution 𝐙\mathbf{Z} parameters under the tempered likelihood is not a commonly known distribution. Therefore, we will use the Metropolis-Hastings algorithm. We have that

p((𝐳i)h|𝚯(𝐳i)h,𝐘1,,𝐘N,𝐗)\displaystyle p((\mathbf{z}_{i})_{h}|\boldsymbol{\Theta}_{-(\mathbf{z}_{i})_{h}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}) k=1K(Zik)h(α3)h(πk)h1\displaystyle\propto\prod_{k=1}^{K}(Z_{ik})_{h}^{(\alpha_{3})_{h}(\pi_{k})_{h}-1}
×l=1niexp{βh2(σ2)h(yi(til)k=1K(Zik)h(((𝝂k)h+(𝜼k)h𝐱i)B(til)\displaystyle\times\prod_{l=1}^{n_{i}}exp\left\{-\frac{\beta_{h}}{2(\sigma^{2})_{h}}\left(y_{i}(t_{il})-\sum_{k=1}^{K}(Z_{ik})_{h}\left(\left((\boldsymbol{\nu}_{k})_{h}+(\boldsymbol{\eta}_{k})_{h}\mathbf{x}_{i}^{\prime}\right)^{\prime}B(t_{il})\right.\right.\right.
+m=1M(χim)h((ϕkm)h+(𝝃km)h𝐱i)B(til)))2}.\displaystyle\left.\left.\left.+\sum_{m=1}^{M}(\chi_{im})_{h}\left((\boldsymbol{\phi}_{km})_{h}+(\boldsymbol{\xi}_{km})_{h}\mathbf{x}_{i}^{\prime}\right)^{\prime}B(t_{il})\right)\right)^{2}\right\}.

We will use Q((𝐳i)h|(𝐳i)h)=Dir(a𝐳(𝐳i)h)Q((\mathbf{z}_{i})_{h}^{\prime}|(\mathbf{z}_{i})_{h})=Dir(a_{\mathbf{z}}(\mathbf{z}_{i})_{h}) for some large a𝐳+a_{\mathbf{z}}\in\mathbb{R}^{+} as the proposal distribution. Thus the probability of accepting a proposed step is

A((𝐳i)h,(𝐳i)h)=min{1,P((𝐳i)h|𝚯(𝐳i)h,𝐘1,,𝐘N,𝐗)P((𝐳i)h|𝚯(𝐳i)h,𝐘1,,𝐘N,𝐗)Q((𝐳i)h|(𝐳i)h)Q((𝐳i)h|(𝐳i)h)}.A((\mathbf{z}_{i})_{h}^{\prime},(\mathbf{z}_{i})_{h})=\min\left\{1,\frac{P\left((\mathbf{z}_{i})_{h}^{\prime}|\boldsymbol{\Theta}_{-(\mathbf{z}_{i})_{h}^{\prime}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\right)}{P\left((\mathbf{z}_{i})_{h}|\boldsymbol{\Theta}_{-(\mathbf{z}_{i})_{h}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\right)}\frac{Q\left((\mathbf{z}_{i})_{h}|(\mathbf{z}_{i})_{h}^{\prime}\right)}{Q\left((\mathbf{z}_{i})_{h}^{\prime}|(\mathbf{z}_{i})_{h}\right)}\right\}.

Letting

(𝐁𝝂j)h=((τ𝝂j)h𝐏+βh(σ2)hi=1Nl=1ni(Zij)h2B(til)B(til))1\left(\mathbf{B}_{\boldsymbol{\nu}_{j}}\right)_{h}=\left(\left(\tau_{\boldsymbol{\nu}_{j}}\right)_{h}\mathbf{P}+\frac{\beta_{h}}{(\sigma^{2})_{h}}\sum_{i=1}^{N}\sum_{l=1}^{n_{i}}(Z_{ij})_{h}^{2}B(t_{il})B^{\prime}(t_{il})\right)^{-1}

and

(𝐛𝝂j)h\displaystyle\left(\mathbf{b}_{\boldsymbol{\nu}_{j}}\right)_{h} =βh(σ2)hi=1Nl=1ni(Zij)hB(til)[yi(til)(kj(Zik)h(𝝂k)hB(til))\displaystyle=\frac{\beta_{h}}{(\sigma^{2})_{h}}\sum_{i=1}^{N}\sum_{l=1}^{n_{i}}(Z_{ij})_{h}B(t_{il})\left[y_{i}(t_{il})-\left(\sum_{k\neq j}(Z_{ik})_{h}(\boldsymbol{\nu}^{\prime}_{k})_{h}B(t_{il})\right)\right.
(k=1K(Zik)[𝐱i(𝜼k)hB(til)+m=1M(χim)h((ϕkn)h+(𝝃kn)h𝐱i)B(til)])],\displaystyle-\left.\left(\sum_{k=1}^{K}(Z_{ik})\left[\mathbf{x}_{i}(\boldsymbol{\eta}_{k})_{h}^{\prime}B(t_{il})+\sum_{m=1}^{M}(\chi_{im})_{h}\left((\boldsymbol{\phi}_{kn})_{h}+(\boldsymbol{\xi}_{kn})_{h}\mathbf{x}_{i}^{\prime}\right)^{\prime}B(t_{il})\right]\right)\right],

we have that

(𝝂j)h|𝚯(𝝂j)h,𝐘1,,𝐘N,𝐗𝒩((𝐁𝝂j)h(𝐛𝝂j)h,(𝐁𝝂j)h).\left(\boldsymbol{\nu}_{j}\right)_{h}|\boldsymbol{\Theta}_{-\left(\boldsymbol{\nu}_{j}\right)_{h}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\sim\mathcal{N}\left(\left(\mathbf{B}_{\boldsymbol{\nu}_{j}}\right)_{h}\left(\mathbf{b}_{\boldsymbol{\nu}_{j}}\right)_{h},\left(\mathbf{B}_{\boldsymbol{\nu}_{j}}\right)_{h}\right).

Let (𝜼jd)h\left(\boldsymbol{\eta}_{jd}\right)_{h} denote the dthd^{th} column of the matrix (𝜼j)h(\boldsymbol{\eta}_{j})_{h}. Thus, letting

(𝐁𝜼jd)h=((τ𝜼jd)h𝐏+βh(σ2)hi=1Nl=1ni(Zij)h2xid2B(til)B(til))1\left(\mathbf{B}_{\boldsymbol{\eta}_{jd}}\right)_{h}=\left(\left(\tau_{\boldsymbol{\eta}_{jd}}\right)_{h}\mathbf{P}+\frac{\beta_{h}}{(\sigma^{2})_{h}}\sum_{i=1}^{N}\sum_{l=1}^{n_{i}}(Z_{ij})_{h}^{2}x_{id}^{2}B(t_{il})B^{\prime}(t_{il})\right)^{-1}

and

(𝐛𝜼jd)h=\displaystyle\left(\mathbf{b}_{\boldsymbol{\eta}_{jd}}\right)_{h}= βh(σ2)hi=1Nl=1ni(Zij)hxidB(til)[yi(til)(rd(Zij)hxir(𝜼jr)hB(til))\displaystyle\frac{\beta_{h}}{(\sigma^{2})_{h}}\sum_{i=1}^{N}\sum_{l=1}^{n_{i}}(Z_{ij})_{h}x_{id}B(t_{il})\left[y_{i}(t_{il})-\left(\sum_{r\neq d}(Z_{ij})_{h}x_{ir}(\boldsymbol{\eta}_{jr})_{h}^{\prime}B(t_{il})\right)\right.
(kj(Zik)h𝐱i(𝜼k)hB(til))\displaystyle-\left(\sum_{k\neq j}(Z_{ik})_{h}\mathbf{x}_{i}(\boldsymbol{\eta}_{k})_{h}^{\prime}B(t_{il})\right)
(k=1K(Zik)h[(𝝂k)hB(til)+m=1M(χim)h((ϕkn)h+(𝝃kn)h𝐱i)B(til)])],\displaystyle-\left.\left(\sum_{k=1}^{K}(Z_{ik})_{h}\left[(\boldsymbol{\nu}_{k})_{h}^{\prime}B(t_{il})+\sum_{m=1}^{M}(\chi_{im})_{h}\left((\boldsymbol{\phi}_{kn})_{h}+(\boldsymbol{\xi}_{kn})_{h}\mathbf{x}_{i}^{\prime}\right)^{\prime}B(t_{il})\right]\right)\right],

we have that

(𝜼jd)h|𝚯(𝜼jd)h,𝐘1,,𝐘N,𝐗𝒩((𝐁𝜼jd)h(𝐛𝜼jd)h,(𝐁𝜼jd)h).\left(\boldsymbol{\eta}_{jd}\right)_{h}|\boldsymbol{\Theta}_{-\left(\boldsymbol{\eta}_{jd}\right)_{h}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\sim\mathcal{N}\left(\left(\mathbf{B}_{\boldsymbol{\eta}_{jd}}\right)_{h}\left(\mathbf{b}_{\boldsymbol{\eta}_{jd}}\right)_{h},\left(\mathbf{B}_{\boldsymbol{\eta}_{jd}}\right)_{h}\right).

If we let

(βσ)h=βh2i=1Nl=1ni(yi(til)k=1K(Zik)h(((𝝂k)h+(𝜼k)h𝐱i)B(til)\displaystyle\left(\beta_{\sigma}\right)_{h}=\frac{\beta_{h}}{2}\sum_{i=1}^{N}\sum_{l=1}^{n_{i}}\left(y_{i}(t_{il})-\sum_{k=1}^{K}(Z_{ik})_{h}\Bigg{(}\left((\boldsymbol{\nu}_{k})_{h}+(\boldsymbol{\eta}_{k})_{h}\mathbf{x}_{i}^{\prime}\right)^{\prime}B(t_{il})\right.
+n=1M(χin)h((ϕkn)h+(𝝃kn)h𝐱i)B(til)))2,\displaystyle+\left.\left.\sum_{n=1}^{M}(\chi_{in})_{h}\left((\boldsymbol{\phi}_{kn})_{h}+(\boldsymbol{\xi}_{kn})_{h}\mathbf{x}_{i}^{\prime}\right)^{\prime}B(t_{il})\right)\right)^{2},

then we have

(σ2)h|𝚯(σ2)h,𝐘1,,𝐘N,𝐗IG(α0+βhi=1Nni2,β0+(βσ)h).(\sigma^{2})_{h}|\boldsymbol{\Theta}_{-(\sigma^{2})_{h}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\sim IG\left(\alpha_{0}+\frac{\beta_{h}\sum_{i=1}^{N}n_{i}}{2},\beta_{0}+\left(\beta_{\sigma}\right)_{h}\right).

Lastly, we can update the χim\chi_{im} parameters, for i=1,,Ni=1,\dots,N and m=1,,Mm=1,\dots,M, using a Gibbs update. If we let

(𝐰im)h=\displaystyle\left(\mathbf{w}_{im}\right)_{h}= βh(σ2)h[l=1ni(k=1K(Zik)h((ϕkm)h+(𝝃km)h𝐱i)B(til))(yi(til)\displaystyle\frac{\beta_{h}}{(\sigma^{2})_{h}}\left[\sum_{l=1}^{n_{i}}\left(\sum_{k=1}^{K}(Z_{ik})_{h}\left((\boldsymbol{\phi}_{km})_{h}+(\boldsymbol{\xi}_{km})_{h}\mathbf{x}_{i}^{\prime}\right)^{\prime}B(t_{il})\right)\right.\Bigg{(}y_{i}(t_{il})
k=1K(Zik)h(((𝝂k)h+(𝜼k)h𝐱i)B(til)+nm(χin)h((ϕkn)h+(𝝃kn)h𝐱i)B(til)))]\displaystyle\left.-\sum_{k=1}^{K}(Z_{ik})_{h}\left(\left((\boldsymbol{\nu}_{k})_{h}+(\boldsymbol{\eta}_{k})_{h}\mathbf{x}_{i}^{\prime}\right)^{\prime}B(t_{il})+\sum_{n\neq m}(\chi_{in})_{h}\left((\boldsymbol{\phi}_{kn})_{h}+(\boldsymbol{\xi}_{kn})_{h}\mathbf{x}_{i}^{\prime}\right)^{\prime}B(t_{il})\right)\Bigg{)}\right]

and

(𝐖im)h1=1+βhσ2l=1ni(k=1K(Zik)h((ϕkm)h+(𝝃km)h𝐱i)B(til))2,\left(\mathbf{W}_{im}\right)_{h}^{-1}=1+\frac{\beta_{h}}{\sigma^{2}}\sum_{l=1}^{n_{i}}\left(\sum_{k=1}^{K}(Z_{ik})_{h}\left((\boldsymbol{\phi}_{km})_{h}+(\boldsymbol{\xi}_{km})_{h}\mathbf{x}_{i}^{\prime}\right)^{\prime}B(t_{il})\right)^{2},

then we have that

(χim)h|𝜻(χim)h,𝐘1,,𝐘N,𝐗𝒩((𝐖im)h(𝐰im)h,(𝐖im)h).(\chi_{im})_{h}|\boldsymbol{\zeta}_{-(\chi_{im})_{h}},\mathbf{Y}_{1},\dots,\mathbf{Y}_{N},\mathbf{X}\sim\mathcal{N}\left(\left(\mathbf{W}_{im}\right)_{h}\left(\mathbf{w}_{im}\right)_{h},\left(\mathbf{W}_{im}\right)_{h}\right).