This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Functional clustering methods for binary longitudinal data with temporal heterogeneity

Jinwon Sohn [email protected] Seonghyun Jeong [email protected] Young Min Cho [email protected] Taeyoung Park [email protected] Department of Statistics, Purdue University, IN 47907, USA Department of Applied Statistics, Yonsei University, Seoul 03722, Korea Department of Statistics and Data Science, Yonsei University, Seoul 03722, Korea Department of Computer and Information Science, University of Pennsylvania, PA 19104, USA
Abstract

In the analysis of binary longitudinal data, it is of interest to model a dynamic relationship between a response and covariates as a function of time, while also investigating similar patterns of time-dependent interactions. We present a novel generalized varying-coefficient model that accounts for within-subject variability and simultaneously clusters varying-coefficient functions, without restricting the number of clusters nor overfitting the data. In the analysis of a heterogeneous series of binary data, the model extracts population-level fixed effects, cluster-level varying effects, and subject-level random effects. Various simulation studies show the validity and utility of the proposed method to correctly specify cluster-specific varying-coefficients when the number of clusters is unknown. The proposed method is applied to a heterogeneous series of binary data in the German Socioeconomic Panel (GSOEP) study, where we identify three major clusters demonstrating the different varying effects of socioeconomic predictors as a function of age on the working status.

keywords:
Longitudinal data , Probit mixed models , Varying-coefficients , Partial collapsed Gibbs sampler , Dirichlet process.
journal: Computational Statistics & Data Analysis
\DTLnewdb

ref

1 Introduction

Mixed-effects models are commonly used in binary longitudinal studies in the social, behavioral, and health sciences. These models’ popularity stems from their ability to capture longitudinal effects generated by repeated-measurement processes. To be more specific, random effects are introduced into linear models with only fixed effects to reflect the correlation between observations on the same subject. This extension also avoids some of the technical issues that can arise during the analysis of variance. For example, Stiratelli et al. (1984) showed that mixed-effects models have advantages over Markov models when dealing with a series of binary data because they are better at interpreting the effects of covariates and circumventing some of the difficult issues caused by unbalanced design or missing values. Other approaches to dealing with serial effects in longitudinal data provide practical recommendations by combining the mixed-effects models with other statistical methods (Varin and Czado, 2009; Guerra et al., 2012).

A varying-coefficient model has been shown to be extremely effective for modeling of time-varying effects in longitudinal studies (Hastie and Tibshirani, 1993; Hoover et al., 1998; Wu et al., 1998; Lang and Brezger, 2004; Sun and Wu, 2005; Fan and Zhang, 2008; Lu and Zhang, 2009; Jeong and Park, 2016; Jeong et al., 2017; Park and Jeong, 2018). Such varying-coefficient functions can be easily modeled by Bayesian methods, e.g., Bayesian P-splines (Lang and Brezger, 2004), series priors (Shen and Ghosal, 2015), Gaussian process priors (Neal, 1998), Bayesian wavelets (Chipman et al., 1997), and free-knot splines and adaptive knot selection (Smith and Kohn, 1996; DiMatteo et al., 2001). The main advantage of using the Bayesian approaches is that uncertainty quantification is naturally performed with credible sets obtained by Markov chain Monte Carlo (MCMC). The method of free-knot splines and adaptive knot selection, in particular, exhibits natural local adaptation to spatially inhomogeneous smoothness (Smith and Kohn, 1996; Kohn et al., 2001; Ruppert et al., 2003; Kang and Jeong, 2023).

Traditional varying-coefficient mixed models focus on exploring common varying coefficients shared across all subjects. However, because there are often various sources of heterogeneity among subjects, particularly in longitudinal studies, such a common structure of varying coefficients may be oversimplified, leading to incorrect conclusions. To uncover the heterogeneity of the population, many model-based clustering approaches have been proposed from both frequentist and Bayesian perspectives (Lenk and DeSarbo, 2000; James and Sugar, 2003; Heard et al., 2006; Shi and Wang, 2008; Aßmann and Boysen-Hogrefe, 2011; Coffey et al., 2014; Berrettini et al., 2022). For example, James and Sugar (2003) suggested modeling individual basis coefficients by random effects with the mean indexed by the cluster. Heard et al. (2006) developed a hierarchical Bayesian model that avoids MCMC using their particular model formulation. Coffey et al. (2014) developed a clustering method for longitudinal gene profiles via penalized splines. More recently, Berrettini et al. (2022) devised a semi-parametric mixture model with mixture weights and conditional means that are modeled as nonlinear functions of covariates. Although these frameworks clearly offer inferential advantages in the presence of heterogeneity, most require determining the appropriate number of clusters and additional evaluation steps decoupled from parameter estimation such as cross validation.

In contrast, the Bayesian nonparametric framework naturally chooses the required number of clusters in a data-driven way by using stochastic process priors that randomly partition a sample space to be clustered. These priors include the Dirichlet process (DP) (Ferguson, 1973), the two-parameter Poisson-Dirichlet process (Pitman and Yor, 1997), and the generalized stick-breaking process (Ishwaran and James, 2001), to name a few. As they are basically infinite-dimensional priors, they have become essential clustering tools for modeling an infinite number of clusters in various areas (Ishwaran and James, 2001; Teh et al., 2006; Lau and Green, 2007; Wallach et al., 2010; Canale and Dunson, 2011; Yerebakan et al., 2014; Kyung, 2015). In the context of nonparametric regression, Müller et al. (1996) employed the DP prior to jointly partition the support of response and predictor variables, which performs locally weighted regression estimation in terms of Bayesian predictive inference. Gelfand et al. (2005) incorporates the dependent DP prior (MacEachern et al., 2001) into the Gaussian process prior for spatial analysis. Ray and Mallick (2006) studies the Bayesian wavelet regression model where the DP prior has a base measure that expedites sparsity of the wavelet coefficients. Petrone et al. (2009) further considered the local heterogeneity in a subgroup of curves by proposing a hybrid Dirichlet prior that overcomes the global heterogeneity. Chib and Greenberg (2010) used the DP prior for modeling an error distribution while approximating nonlinear components via cubic splines with a smoothness prior regularizing difference of spline coefficients at the knot locations. Rodriguez and Dunson (2014) employed the generalized DP prior to cluster curves smoothed by the free-knot spline method. Suarez and Ghosal (2016) assigned the DP prior on each wavelet coefficient independently not jointly on the set of coefficients with the sparsity structure used in Ray and Mallick (2006). Margaritella et al. (2021) applied the DP prior to the clustering of functional principle scores, which improves the curve and correlation reconstruction.

While functional clustering for continuous response has emerged, there have been relatively fewer works for longitudinal binary responses with nonparametric components. Kuss et al. (2006) considered a parametric logistic model where the cluster allocation is assumed to follow a multinomial distribution. Similarly, Aßmann and Boysen-Hogrefe (2011) designed a Bayesian probit regression model with the multinomial label allocation. On the one hand, Hannah et al. (2011) proposed a DP mixture model for generalized linear models in the spirit of Müller et al. (1996), with the restriction of a linear relationship within each cluster. More recently, Zhu et al. (2021) devised a model-free clustering method for binary longitudinal data using a pairwise penalty to nearby clusters, but their model does not account for any functional effects. We find that none of the aforementioned studies deal with both model-based clustering on a mixed-effects model (especially with the DP prior) and nonparametric function estimation with guaranteed smoothness.

Our contribution is threefold. First, we propose a flexible framework for simultaneously modeling population-level fixed effects, cluster-level varying effects, and subject-level random effects in the analysis of binary longitudinal data. The proposed model is a probit varying-coefficient mixed model that flexibly and adaptively identifies different subpopulations having their own varying-coefficient functions that can be either constant, linear, or nonlinear. Second, we devise new prior distributions for the posterior analysis and effective functional clustering of the proposed model. In particular, it is well known that the measurement scale of data must be considered in choosing the base measure for the DP prior (Gelman et al., 2015). We carefully design a prior distribution with a reasonable scale so that it can start a new cluster well within a sampler to account for an infinite number of clusters, while achieving suitable smoothing for function estimation with spatial adaptation. Third, we construct a partially collapsed Gibbs (PCG) sampler to cover the varying-dimensional parameter issue of a standard Gibbs sampler and facilitate posterior computation via the method of partial collapse (van Dyk and Park, 2008; Park and van Dyk, 2009). To maintain a transition kernel, a PCG sampler, unlike a standard Gibbs sampler, requires its steps to be performed in a specific order. We thus develop a PCG sampler that can be used in the fitting of the proposed model.

The remainder of this paper is organized as follows. In Section 2, we describe the probit varying-coefficient mixed model and discuss how the DP prior constructs model-based clustering. Section 3 specifies prior distributions and constructs efficient sampling steps based on the method of partial collapse. In Section 4, simulation studies are presented to validate the proposed method. Section 5 applies the proposed model to the GSOEP data, and Section 6 discusses the results. A contains a detailed description of the proposed method, while B describes how to install the R package for the proposed method. The R package is currently available on the first author’s github111https://github.com/Jwsohn612/fvcc to demonstrate that all of the results in Sections 4 and 5 can be reproduced.

2 Probit Varying-Coefficient Mixed Models for Functional Clustering

Let YijY_{ij} represent a binary response observed at time tijt_{ij} for observation jj on subject ii, where i=1,,Ni=1,\dots,N and j=1,,nij=1,\dots,n_{i}. The outcome of the response YijY_{ij} can be expressed as an indicator function of the sign of a latent variable LijL_{ij}, i.e.,

Yij=I(Lij>0),\displaystyle Y_{ij}~{}=~{}I(L_{ij}>0),

where the latent variable is introduced for computational convenience but can be interpreted as a utility difference between choosing Yij=1Y_{ij}=1 or 0.

In longitudinal studies, a relationship between the latent variable LijL_{ij} and the available covariates is commonly specified by a generalized linear mixed model to account for between-subject variability. Specifically, the generalized linear mixed model is expressed as

Lij=𝐗i(j)𝜷+𝐙i(j)𝐛i+ϵij,\displaystyle L_{ij}~{}=~{}\mathbf{X}_{i}^{(j)}\boldsymbol{\beta}+\mathbf{Z}_{i}^{(j)}\mathbf{b}_{i}+\epsilon_{ij}, (1)

where 𝐗i=(𝐱i1,,𝐱iq)\mathbf{X}_{i}=(\mathbf{x}_{i1},\dots,\mathbf{x}_{iq}) and 𝐙i=(𝐳i1,,𝐳ir)\mathbf{Z}_{i}=(\mathbf{z}_{i1},\dots,\mathbf{z}_{ir}) are ni×q,n_{i}\times q, and ni×rn_{i}\times r design matrices for subject ii, 𝐗i(j)\mathbf{X}_{i}^{(j)} and 𝐙i(j)\mathbf{Z}_{i}^{(j)} denote the jjth row vectors of 𝐗i\mathbf{X}_{i} and 𝐙i\mathbf{Z}_{i}, respectively, 𝜷\boldsymbol{\beta} is a q×1q\times 1 vector of fixed effects, 𝐛i\mathbf{b}_{i} is a r×1r\times 1 vector of random effects for subject ii, and ϵij\epsilon_{ij} is an underlying error term that is assumed to follow a logistic or normal distribution in the logit or probit model, respectively. Note that the model in (1) involves multiple random effects on binary longitudinal data; for linear mixed models with multiple random effects, see Vines et al. (1996); Meng and van Dyk (1998); van Dyk (2000); Kim et al. (2013); Park and Min (2016).

In the presence of within-subject correlation over time, Jeong et al. (2017) extends the generalized linear mixed model in (1) to incorporate varying-coefficients 𝜶(t)\boldsymbol{\alpha}(t) that vary over time tt, i.e.,

Lij=𝐖i(j)𝜶(tij)+𝐗i(j)𝜷+𝐙i(j)𝐛i+ϵij,\displaystyle L_{ij}~{}=~{}\mathbf{W}_{i}^{(j)}\boldsymbol{\alpha}(t_{ij})+\mathbf{X}_{i}^{(j)}\boldsymbol{\beta}+\mathbf{Z}_{i}^{(j)}\mathbf{b}_{i}+\epsilon_{ij}, (2)

where 𝐖i=(𝐰i1,,𝐰ip)\mathbf{W}_{i}=(\mathbf{w}_{i1},\dots,\mathbf{w}_{ip}) is ni×pn_{i}\times p design matrices for subject ii, 𝐖i(j)\mathbf{W}_{i}^{(j)} denotes the jjth row vector of 𝐖i\mathbf{W}_{i}, 𝜶(t)=(α1(t),,αp(t))\boldsymbol{\alpha}(t)=(\alpha_{1}(t),\dots,\alpha_{p}(t))^{\top} is a p×1p\times 1 vector of unknown smooth functions that vary over time tt, and tijt_{ij} is the jjth time of the iith subject. Note that 𝜶(tij)\boldsymbol{\alpha}(t_{ij}) is a vector of real values of 𝜶(t)\boldsymbol{\alpha}(t) evaluated at tijt_{ij}. The llth time-varying function in the vector 𝜶(t)\boldsymbol{\alpha}(t), i.e., 𝜶l(t)\boldsymbol{\alpha}_{l}(t) can be modeled with regression splines that use a linear combination of basis functions, e.g.,

𝐁l(t)=(1,t,|tωl1|3,|tωl2|3,,|tωlMl|3),\displaystyle\mathbf{B}_{l}(t)~{}=~{}\left(1,t,\left|t-\omega_{l1}\right|^{3},\left|t-\omega_{l2}\right|^{3},\dots,\left|t-\omega_{lM_{l}}\right|^{3}\right)^{\top}, (3)

where 𝝎l=(ωl1,,ωlMl)\boldsymbol{\omega}_{l}=(\omega_{l1},\dots,\omega_{lM_{l}}) is an ordered sequence of knot-candidate locations within the range of observed time points, for l=1,,pl=1,\dots,p. The amount of smoothness for the ł\lth regression spline is controlled by the number MlM_{l} and locations 𝝎l\boldsymbol{\omega}_{l}.

To account for heterogeneity among the subjects while borrowing strength across the different subjects, we consider allocating each subject to its own cluster with different functions of varying-coefficients. To do so, we represent the set of functions 𝜶(t)\boldsymbol{\alpha}(t) in (2) as the subject-level varying-coefficients, i.e., 𝜶i(t)=(αi1(t),,αip(t))\boldsymbol{\alpha}_{i}(t)=(\alpha_{i1}(t),\dots,\alpha_{ip}(t))^{\top}. Each of the unknown subject-specific varying-coefficient functions is assumed to fall in the linear span of a set of its own basis functions according to basis selection, i.e.,

αil(t)(𝐁l(t)𝜸il)ϕil,\displaystyle\alpha_{il}(t)~{}\approx~{}(\mathbf{B}_{l}(t)\odot\boldsymbol{\gamma}_{il})^{\top}\boldsymbol{\phi}_{il},

where \odot denotes element-wise multiplication of vectors in accordance with values of 𝜸il\boldsymbol{\gamma}_{il}, 𝜸il=(1,γil0,γil1,,γilMl)\boldsymbol{\gamma}_{il}=(1,\gamma_{il0},\gamma_{il1},\dots,\gamma_{ilM_{l}})^{\top} denotes an (Ml+2)×1(M_{l}+2)\times 1 vector of indicator variables for basis inclusion, γilm=1\gamma_{ilm}=1 represents that the (m+2)(m+2)th element in 𝜸il\boldsymbol{\gamma}_{il} is used as a basis function, and ϕil\boldsymbol{\phi}_{il} denotes an (Ml+2)×1(M_{l}+2)\times 1 vector of basis coefficients corresponding to the llth varying-coefficient for subject ii. The first element in 𝜸il\boldsymbol{\gamma}_{il} equals one, so the constant basis function in (3) always remains in the model. When the corresponding covariate has no interaction with time, we have γilm=0\gamma_{ilm}=0 for m=0,,Mlm=0,\dots,M_{l}, and the model in (2) is reduced to the generalized linear mixed model in (1). That is, when the true varying-coefficient function is constant, our model can estimate it as a constant function by choosing γilm=0\gamma_{ilm}=0 for m=0,,Mlm=0,\dots,M_{l}, reducing modeling bias and avoiding the possibility of overfitting. When the true varying-coefficient function is nonlinear, selecting appropriate knots allows the estimated function to adapt to the true one’s curvature. That is, we use data to adjust the spatially inhomogeneous smoothness of a varying-coefficient function, so that more knots are used in a high-curvature region and fewer knots in a low-curvature region. This implies that we do not need to a priori determine whether a varying-coefficient function is constant, linear, or nonlinear (Jeong and Park, 2016; Jeong et al., 2017, 2022). The value of MlM_{l} for the knot-candidates is not crucial as long as it is large enough to capture the global and local characteristics of a function. Following the literature (e.g., Kohn et al., 2001), we recommend using 20 to 30 knot-candidates chosen by the sample quantiles of the time variable tt. If the time variable is repeatedly observed at some points in time, MlM_{l} should not be larger than the number of the non-duplicated values for tt.

Next, the individualized vector of functions 𝜶i(t)\boldsymbol{\alpha}_{i}(t) is given the DP prior, which induces functional clustering with respect to the functions. Section 3.1 describes how the model leverages the DP prior to cluster varying-coefficients in detail. Let Ci=kC_{i}=k denote that subject ii belongs to cluster kk sharing identical basis functions for varying-coefficients, for i=1,,Ni=1,\dots,N and k=1,,Kk=1,\dots,K. Then through the DP prior, we have 𝜶i(t)=𝜶k(t)\boldsymbol{\alpha}_{i}(t)=\boldsymbol{\alpha}^{*}_{k}(t), which implies 𝜸il=𝜸kl\boldsymbol{\gamma}_{il}=\boldsymbol{\gamma}^{*}_{kl} and ϕil=ϕkl\boldsymbol{\phi}_{il}=\boldsymbol{\phi}^{*}_{kl} as well. To be specific,

αil(t)=αkl(t)(𝐁l(t)𝜸kl)ϕkl,l=1,,p,\displaystyle\alpha_{il}(t)~{}=~{}\alpha^{*}_{kl}(t)~{}\approx~{}(\mathbf{B}_{l}(t)\odot\boldsymbol{\gamma}^{*}_{kl})^{\top}\boldsymbol{\phi}^{*}_{kl},\quad l=1,\dots,p, (4)

for the iith subject who is allocated to cluster kk, having Ci=kC_{i}=k. Thus, for the iith subject, the model in (2) can be represented in a matrix form,

𝐋i=l=1p(𝐰ilαCil(tij)|j=1ni)+𝐗i𝜷+𝐙i𝐛i+ϵi,\displaystyle\mathbf{L}_{i}~{}=~{}\sum_{l=1}^{p}\left(\mathbf{w}_{il}\odot\alpha^{*}_{C_{i}l}(t_{ij})|_{j=1}^{n_{i}}\right)+\mathbf{X}_{i}\boldsymbol{\beta}+\mathbf{Z}_{i}\mathbf{b}_{i}+\boldsymbol{\epsilon}_{i}, (5)

where αCil(tij)|j=1ni=(αCil(ti1),,αCil(tini))\alpha^{*}_{C_{i}l}(t_{ij})|_{j=1}^{n_{i}}=\left(\alpha^{*}_{C_{i}l}(t_{i1}),\dots,\alpha^{*}_{C_{i}l}(t_{in_{i}})\right)^{\top} is a ni×1n_{i}\times 1 vector of real values, 𝐋i\mathbf{L}_{i} is an ni×1n_{i}\times 1 vector of latent variables, and ϵi\boldsymbol{\epsilon}_{i} is a ni×1n_{i}\times 1 error vector. This representation implies that the clustering process is implemented with information about only dynamic covariates.

To express (5) with the approximation in (4), we define an ni×(Ml+2)n_{i}\times(M_{l}+2) matrix 𝐁il=(j=1ni(𝐁l(tij)𝜸Cil))\mathbf{B}^{*}_{il}=\left(\oplus_{j=1}^{n_{i}}(\mathbf{B}_{l}(t_{ij})\odot\boldsymbol{\gamma}^{*}_{C_{i}l})\right)^{\top} where \oplus represents the direct sum of vectors or matrices, for the llth covariate of the iith subject. It is obvious that this matrix can have zero column vectors when the corresponding elements of 𝜸Cil\boldsymbol{\gamma}^{*}_{C_{i}l} are zero. By removing the columns of 0’s, we can obtain an ni×|𝜸Cil|n_{i}\times|\boldsymbol{\gamma}^{*}_{C_{i}l}| submatrix of 𝐁il\mathbf{B}^{*}_{il}, which is denoted by 𝐁il\mathbf{B}^{\star}_{il}, where |𝜸Cil|=mγCilm|\boldsymbol{\gamma}_{C_{i}l}^{*}|=\sum_{m}\gamma^{*}_{C_{i}lm}. Then the model in (5) can be written as

𝐋i=𝐖i(𝜸Ci)ϕ(𝜸Ci)+𝐗i𝜷+𝐙i𝐛i+ϵi,\displaystyle\mathbf{L}_{i}~{}=~{}\mathbf{W}_{i(\boldsymbol{\gamma}_{C_{i}}^{*})}^{\star}\boldsymbol{\phi}_{(\boldsymbol{\gamma}_{C_{i}}^{*})}^{\star}+\mathbf{X}_{i}\boldsymbol{\beta}+\mathbf{Z}_{i}\mathbf{b}_{i}+\boldsymbol{\epsilon}_{i}, (6)

where ϕ(𝜸Ci)\boldsymbol{\phi}_{(\boldsymbol{\gamma}_{C_{i}}^{*})}^{\star} is a vector of cluster-level basis coefficients whose size is the sum of all elements of 𝜸Ci\boldsymbol{\gamma}_{C_{i}}^{*}. It is

ϕ(𝜸Ci)=(ϕ𝜸Ci1,,ϕ𝜸Cip)|𝜸Ci|×1,\displaystyle\boldsymbol{\phi}_{(\boldsymbol{\gamma}_{C_{i}}^{*})}^{\star}~{}=~{}\left(\boldsymbol{\phi}^{*}_{\boldsymbol{\gamma}^{*}_{C_{i}1}},\dots,\boldsymbol{\phi}^{*}_{\boldsymbol{\gamma}^{*}_{C_{i}p}}\right)^{\top}~{}\in~{}\mathbb{R}^{|\boldsymbol{\gamma}_{C_{i}}^{*}|\times 1},

where |𝜸Ci|=l,mγCilm|\boldsymbol{\gamma}_{C_{i}}^{*}|=\sum_{l,m}\gamma^{*}_{C_{i}lm} and ϕ𝜸Cil\boldsymbol{\phi}^{*}_{\boldsymbol{\gamma}^{*}_{C_{i}l}} is a |𝜸Cil|×1|\boldsymbol{\gamma}_{C_{i}l}^{*}|\times 1 subvector of ϕCil\boldsymbol{\phi}^{*}_{C_{i}l} whose elements correspond to nonzero columns of 𝐁il\mathbf{B}^{*}_{il}. Then, the design matrix 𝐖i(𝜸Ci)\mathbf{W}_{i(\boldsymbol{\gamma}_{C_{i}}^{*})}^{\star} is constructed by multiplying each set of selected basis terms to each column of 𝐖i\mathbf{W}_{i}, i.e.,

𝐖i(𝜸Ci)=[𝐰i1𝟏|𝜸Ci1|𝐁i1,,𝐰ip𝟏|𝜸Cip|𝐁ip]ni×|𝜸Ci|.\displaystyle\mathbf{W}_{i(\boldsymbol{\gamma}_{C_{i}}^{*})}^{\star}~{}=~{}\left[\mathbf{w}_{i1}{\bf 1}_{|\boldsymbol{\gamma}^{*}_{C_{i}1}|}^{\top}\odot\mathbf{B}_{i1}^{\star},\dots,\mathbf{w}_{ip}{\bf 1}_{|\boldsymbol{\gamma}^{*}_{C_{i}p}|}^{\top}\odot\mathbf{B}_{ip}^{\star}\right]~{}\in~{}\mathbb{R}^{n_{i}\times|\boldsymbol{\gamma}_{C_{i}}^{*}|}.

where 𝟏|𝜸Cil|{\bf 1}_{|\boldsymbol{\gamma}^{*}_{C_{i}l}|} is a vector of ones in |𝜸Cil|×1{\mathbb{R}}^{|\boldsymbol{\gamma}^{*}_{C_{i}l}|\times 1}.

3 Bayesian Analysis

3.1 Dirichlet Process Prior

In this paper, the DP is used as a prior distribution to cluster 𝜶i(t)\boldsymbol{\alpha}_{i}(t), and this clustering procedure is equivalent to clustering the set of (ϕi,𝜸i)(\boldsymbol{\phi}_{i},\boldsymbol{\gamma}_{i}), where ϕi=(ϕi1,,ϕip)\boldsymbol{\phi}_{i}=(\boldsymbol{\phi}_{i1},\dots,\boldsymbol{\phi}_{ip})^{\top} and 𝜸i=(𝜸i1,,𝜸ip)\boldsymbol{\gamma}_{i}=(\boldsymbol{\gamma}_{i1},\dots,\boldsymbol{\gamma}_{ip})^{\top}, as implied in (4). This process assigning the DP prior to the set of (ϕi,𝜸i)(\boldsymbol{\phi}_{i},\boldsymbol{\gamma}_{i}) is expressed as

(ϕi,𝜸i)|\displaystyle(\boldsymbol{\phi}_{i},\boldsymbol{\gamma}_{i})|\mathcal{H} iid,i=1,,N,\displaystyle~{}\stackrel{{\scriptstyle\rm iid}}{{\sim}}~{}\mathcal{H},\quad i=1,\dots,N,
\displaystyle\mathcal{H} DP(ν,0),\displaystyle~{}\sim~{}{\rm DP}(\nu,\mathcal{H}_{0}),

where ν>0\nu>0 and 0\mathcal{H}_{0} is a base distribution that randomly generates cluster-level parameters for (ϕi,𝜸i)(\boldsymbol{\phi}_{i},\boldsymbol{\gamma}_{i}). As another representation of the DP, it is worthwhile to look at the stick-breaking process (Sethuraman, 1994; Ishwaran and James, 2001) that allows the truncation of the summation in the DP after a large KK component, i.e.,

()=k=1πkδ(ϕk,𝜸k)()k=1Kπkδ(ϕk,𝜸k)(),(ϕk,𝜸k)iid0,\displaystyle\mathcal{H}(\cdot)~{}=~{}\sum_{k=1}^{\infty}\pi_{k}\delta_{(\boldsymbol{\phi}_{k}^{*},\boldsymbol{\gamma}_{k}^{*})}(\cdot)~{}\approx~{}\sum_{k=1}^{K}\pi_{k}\delta_{(\boldsymbol{\phi}_{k}^{*},\boldsymbol{\gamma}_{k}^{*})}(\cdot),\quad(\boldsymbol{\phi}_{k}^{*},\boldsymbol{\gamma}_{k}^{*})~{}\stackrel{{\scriptstyle\rm iid}}{{\sim}}~{}\mathcal{H}_{0}, (7)

where ϕk=(ϕk1,,ϕkp)\boldsymbol{\phi}_{k}^{*}=(\boldsymbol{\phi}_{k1}^{*},\dots,\boldsymbol{\phi}_{kp}^{*})^{\top} and 𝜸k=(𝜸k1,,𝜸kp)\boldsymbol{\gamma}_{k}^{*}=(\boldsymbol{\gamma}_{k1}^{*},\dots,\boldsymbol{\gamma}_{kp}^{*})^{\top} are the parameters for cluster kk, δ(ϕk,𝜸k)()\delta_{(\boldsymbol{\phi}_{k}^{*},\boldsymbol{\gamma}_{k}^{*})}(\cdot) is a Dirac measure at (ϕk,𝜸k)(\boldsymbol{\phi}_{k}^{*},\boldsymbol{\gamma}_{k}^{*}), πk\pi_{k} is the probability mass at atom (ϕk,𝜸k)(\boldsymbol{\phi}_{k}^{*},\boldsymbol{\gamma}_{k}^{*}), and KK is a finite truncation for the maximum number of clusters. The equation in (7) implies that the model in (6) explores latent subpopulations by limiting the maximum number of subpopulations to KK, not to infinity; see Ishwaran and James (2001) for theoretical arguments. Meanwhile, the set of cluster-level parameters, ϕk\boldsymbol{\phi}_{k}^{*} and 𝜸k\boldsymbol{\gamma}_{k}^{*}, is drawn from the base distribution 0\mathcal{H}_{0}, and the random weight πk\pi_{k} derives from a set of random variables that each follows a beta distribution, i.e.,

πk=πk(𝐕)=Vk<k(1V),VkindBeta(1,ν),k=1,,K1,\displaystyle\pi_{k}~{}=~{}\pi_{k}(\mathbf{V})~{}=~{}V_{k}\prod_{\ell<k}(1-V_{\ell}),\quad V_{k}~{}\stackrel{{\scriptstyle\rm ind}}{{\sim}}~{}{\rm Beta}\left(1,\nu\right),\quad k=1,\dots,K-1,

where 𝐕={V1,,VK}\mathbf{V}=\{V_{1},\dots,V_{K}\} and VK=1V_{K}=1, which guarantees the sum of all random weights is equal to one. Then, we can write P(Ci=k|𝐕)=πk(𝐕)P(C_{i}=k|\mathbf{V})=\pi_{k}(\mathbf{V}). The specification of ν\nu may affect clustering performance. As ν\nu goes to 0, the concentration toward the existing clusters gets stronger by decreasing the probability that a vector (ϕi,𝜸i)(\boldsymbol{\phi}_{i},\boldsymbol{\gamma}_{i}) forms a new cluster. In this work, we set ν=1\nu=1 by default so that every subject has the equal probability for shaping a new cluster.

3.2 Prior Specification for the Submodel Parameters

This section discusses the specification of prior distributions of each model component. The set of indicator variables 𝜸kl\boldsymbol{\gamma}_{kl}^{*} has a beta-binomial prior distribution, i.e.,

p(𝜸kl)B(|𝜸kl|+a,Ml+1|𝜸kl|+b),\displaystyle p(\boldsymbol{\gamma}_{kl}^{*})~{}\propto~{}B(|\boldsymbol{\gamma}_{kl}^{*}|+a,M_{l}+1-|\boldsymbol{\gamma}_{kl}^{*}|+b), (8)

for k=1,,Kk=1,\dots,K and l=1,,pl=1,\dots,p, where B(,)B(\cdot,\cdot) denotes the beta function. If a=b=1a=b=1, this prior distribution allocates equal probabilities for the number of active knots (Scott and Berger, 2010). This choice has been shown work successfully for function estimation with knot selection (Jeong and Park, 2016; Jeong et al., 2017).

For the basis coefficients ϕ(𝜸k)\boldsymbol{\phi}_{(\boldsymbol{\gamma}_{k}^{*})}^{\star} that have varying dimension in each iteration, we consider the following mixture of gg-priors,

ϕ(𝜸k)|(𝜸k,τk)indN|𝜸k|(𝟎,τk𝐑k(𝜸k)1),𝐑k(𝜸k)=i=1N𝐖i(𝜸k)𝐖i(𝜸k),k=1,,K,τkiidIG(1/2,N/2),k=1,K.\displaystyle\begin{split}\boldsymbol{\phi}_{(\boldsymbol{\gamma}_{k}^{*})}^{\star}|(\boldsymbol{\gamma}_{k}^{*},\tau_{k})~{}&\stackrel{{\scriptstyle\rm ind}}{{\sim}}~{}{\rm N}_{|\boldsymbol{\gamma}_{k}^{*}|}\left(\mathbf{0},\tau_{k}\mathbf{R}_{k(\boldsymbol{\gamma}_{k}^{*})}^{-1}\right),\\ \mathbf{R}_{k(\boldsymbol{\gamma}_{k}^{*})}~{}&=~{}\sum_{i=1}^{N}\mathbf{W}_{i(\boldsymbol{\gamma}_{k}^{*})}^{\star\top}\mathbf{W}_{i(\boldsymbol{\gamma}_{k}^{*})}^{\star},\quad k=1,\dots,K,\\ \tau_{k}~{}&\stackrel{{\scriptstyle\rm iid}}{{\sim}}{\rm IG}(1/2,N/2),\quad k=1\dots,K.\end{split} (9)

By characterizing the scale parameter with the total number of subjects, the prior in (9) corresponds to a Zellner-Siow prior, which is a multivariate Cauchy prior marginally for ϕ(𝜸k)\boldsymbol{\phi}_{(\boldsymbol{\gamma}_{k}^{*})}^{\star} (Liang et al., 2008). Therefore, the base measure 0\mathcal{H}_{0} is constructed by combining a beta-binomial distribution and a multivariate Cauchy distribution.

The prior in (9) has several desirable properties. First, the prior distribution is invariant to linear transformations of the design matrix (Zellner, 1986). This means that the posterior distribution of the varying-coefficients is not affected by linear transformations of the basis functions in (3). More importantly, the prior in (9) utilizes the population-level covariance, which is determined by assuming that all individuals belong to the kkth cluster. This specific structure enhances the convergence efficiency of MCMC because the cluster-level basis coefficients ϕ(𝜸k)\boldsymbol{\phi}_{(\boldsymbol{\gamma}_{k}^{*})}^{\star} of empty clusters are sampled by taking advantage of the summed information of all subjects. As a result, the prior can naturally begin a new cluster to which a few subjects may belong. Indeed, it is well known that choosing a reasonable scale for the base measure 0\mathcal{H}_{0} is very important in using the DP prior (Gelman et al., 2015). In this regard, our prior construction in (9) has a clear advantage over the related studies, which use the DP prior but do not fully account the concern of scale (Ray and Mallick, 2006; Rodriguez and Dunson, 2014). Furthermore, having the right scale may be difficult with other penalty priors. For example, the Bayesian P-spline is a widely used Bayesian approach for nonparametric regression (Lang and Brezger, 2004). Since its covariance structure plays an important role in smoothing, however, the prior distribution cannot be simply modified to have a reasonable scale for the DP prior.

The remaining specification of priors for the fixed-dimensional parameter is standard. As for the fixed effects 𝜷\boldsymbol{\beta}, we assign a multivariate normal distribution whose covariance matrix is positive definite, i.e.,

𝜷Nq(𝟎,𝐏).\displaystyle\boldsymbol{\beta}~{}\sim~{}{\rm N}_{q}(\mathbf{0},\mathbf{P}).

For practical purposes, 𝐏\mathbf{P} can be chosen as a diagonal matrix with large diagonal entries. For random effects, a multivariate normal distribution is used to generate the effects,

𝐛i|𝚿iidNr(𝟎,𝚿),i=1,,N,\mathbf{b}_{i}|\boldsymbol{\Psi}~{}\stackrel{{\scriptstyle\rm iid}}{{\sim}}~{}{\rm N}_{r}(\mathbf{0},\boldsymbol{\Psi}),\quad i=1,\dots,N,

where 𝚿\boldsymbol{\Psi} is the covariance matrix of the random effects and has an inverse-Wishart prior,

𝚿IW(u,𝐃).\boldsymbol{\Psi}~{}\sim~{}{\rm IW}(u,\mathbf{D}).

In our study, uu and 𝐃\mathbf{D} are fixed in advance to make the prior distribution diffuse.

3.3 Partially Collapsed Gibbs Sampler

Given the prior distributions in Section 3.2, we propose a sampling algorithm used to simulate the target posterior distribution,

p(ϕ(𝜸),𝜸,𝜷,𝐛,𝐂,𝐕,𝝉,𝚿,𝐋|𝐘),\displaystyle p(\boldsymbol{\phi}_{(\boldsymbol{\gamma}^{*})}^{\star},\boldsymbol{\gamma}^{*},\boldsymbol{\beta},\mathbf{b},\mathbf{C},\mathbf{V},\boldsymbol{\tau},\boldsymbol{\Psi},\mathbf{L}|\mathbf{Y}), (10)

where 𝜸={𝜸1,,𝜸K}\boldsymbol{\gamma}^{*}=\{\boldsymbol{\gamma}^{*}_{1},\dots,\boldsymbol{\gamma}^{*}_{K}\}, ϕ(𝜸)={ϕ(𝜸1),,ϕ(𝜸K)}\boldsymbol{\phi}_{(\boldsymbol{\gamma}^{*})}^{\star}=\{\boldsymbol{\phi}_{(\boldsymbol{\gamma}_{1}^{*})}^{\star},\dots,\boldsymbol{\phi}_{(\boldsymbol{\gamma}_{K}^{*})}^{\star}\}, 𝐛={𝐛1,,𝐛N}\mathbf{b}=\{\mathbf{b}_{1},\dots,\mathbf{b}_{N}\}, 𝝉={τ1,,τK}\boldsymbol{\tau}=\{\tau_{1},\dots,\tau_{K}\}, 𝐂={C1,,CN}\mathbf{C}=\{C_{1},\dots,C_{N}\}, 𝐋={𝐋1,,𝐋N}\mathbf{L}=\{\mathbf{L}_{1},\dots,\mathbf{L}_{N}\}, and 𝐘={𝐘1,,𝐘N}\mathbf{Y}=\{\mathbf{Y}_{1},\dots,\mathbf{Y}_{N}\} denoting 𝐘i\mathbf{Y}_{i} as a set of binary responses for subject ii. To simulate the target posterior distribution in (10), a standard Gibbs sampler based on its full conditional distributions cannot be implemented because the dimension |𝜸k|×1|\boldsymbol{\gamma}^{*}_{k}|\times 1 of ϕ(𝜸)\boldsymbol{\phi}_{(\boldsymbol{\gamma}^{*})}^{\star} depends on another model component 𝜸\boldsymbol{\gamma}^{*}. In such a varying-dimensional problem, PCG sampling avoids the need of jumping between spaces of different dimensions through marginalization, permutation, and trimming, thereby making it implementable with the expectation of faster convergence; refer to Section 4 in Park and van Dyk (2009). In this study, we consider marginalizing the random effects 𝐛\mathbf{b} and the basis coefficients ϕ(𝜸)\boldsymbol{\phi}_{(\boldsymbol{\gamma}^{*})}^{\star} in (10), thereby producing the following marginal distributions,

p(𝜸,𝜷,𝐂,𝐕,𝝉,𝚿,𝐋|𝐘),\displaystyle p(\boldsymbol{\gamma}^{*},\boldsymbol{\beta},\mathbf{C},\mathbf{V},\boldsymbol{\tau},\boldsymbol{\Psi},\mathbf{L}|\mathbf{Y}), (11)
p(ϕ(𝜸),𝜸,𝜷,𝐂,𝐕,𝝉,𝚿,𝐋|𝐘).\displaystyle p(\boldsymbol{\phi}_{(\boldsymbol{\gamma}^{*})}^{\star},\boldsymbol{\gamma}^{*},\boldsymbol{\beta},\mathbf{C},\mathbf{V},\boldsymbol{\tau},\boldsymbol{\Psi},\mathbf{L}|\mathbf{Y}). (12)
Initialize: (ϕ(𝜸),𝜸,𝜷,𝐛,𝐂,𝐕,𝝉,𝚿,𝐋)(\boldsymbol{\phi}_{(\boldsymbol{\gamma}^{*})}^{\star},\boldsymbol{\gamma}^{*},\boldsymbol{\beta},\mathbf{b},\mathbf{C},\mathbf{V},\boldsymbol{\tau},\boldsymbol{\Psi},\mathbf{L})
for k=1,2,,Kk=1,2,\dots,K do
       for l=1,2,,pl=1,2,\dots,p do
             Step 1: Draw 𝜸kl\boldsymbol{\gamma}_{kl}^{*} from p(𝜸kl|𝜸kl,𝜷,𝐂,𝐕,𝝉,𝚿,𝐋,𝐘)p(\boldsymbol{\gamma}_{kl}^{*}|\boldsymbol{\gamma}_{-kl}^{*},\boldsymbol{\beta},\mathbf{C},\mathbf{V},\boldsymbol{\tau},\boldsymbol{\Psi},\mathbf{L},\mathbf{Y})
      
for k=1,2,,Kk=1,2,\dots,K do
       Step 2: Draw VkV_{k} from p(Vk|𝜸,𝜷,𝐂,𝝉,𝚿,𝐋,𝐘)p(V_{k}|\boldsymbol{\gamma}^{*},\boldsymbol{\beta},\mathbf{C},\boldsymbol{\tau},\boldsymbol{\Psi},\mathbf{L},\mathbf{Y})
for k=1,2,,Kk=1,2,\dots,K do
       Step 3: Draw ϕ(𝜸k)\boldsymbol{\phi}_{(\boldsymbol{\gamma}_{k}^{*})}^{\star} from p(ϕ(𝜸k)|𝜸,𝜷,𝐂,𝐕,𝝉,𝚿,𝐋,𝐘)p(\boldsymbol{\phi}_{(\boldsymbol{\gamma}_{k}^{*})}^{\star}|\boldsymbol{\gamma}^{*},\boldsymbol{\beta},\mathbf{C},\mathbf{V},\boldsymbol{\tau},\boldsymbol{\Psi},\mathbf{L},\mathbf{Y})
Step 4: Draw 𝜷\boldsymbol{\beta} from p(𝜷|ϕ(𝜸),𝜸,𝐂,𝐕,𝝉,𝚿,𝐋,𝐘)p(\boldsymbol{\beta}|\boldsymbol{\phi}_{(\boldsymbol{\gamma}^{*})}^{\star},\boldsymbol{\gamma}^{*},\mathbf{C},\mathbf{V},\boldsymbol{\tau},\boldsymbol{\Psi},\mathbf{L},\mathbf{Y}),
for i=1,2,,Ni=1,2,\dots,N do
       Step 5: Draw 𝐛i\mathbf{b}_{i} from p(𝐛i|ϕ(𝜸),𝜸,𝜷,𝐂,𝐕,𝝉,𝚿,𝐋,𝐘)p(\mathbf{b}_{i}|\boldsymbol{\phi}_{(\boldsymbol{\gamma}^{*})}^{\star},\boldsymbol{\gamma}^{*},\boldsymbol{\beta},\mathbf{C},\mathbf{V},\boldsymbol{\tau},\boldsymbol{\Psi},\mathbf{L},\mathbf{Y})
for k=1,2,,Kk=1,2,\dots,K do
       Step 6: Draw τk\tau_{k} from p(τk|ϕ(𝜸),𝜸,𝜷,𝐛,𝐂,𝐕,𝚿,𝐋,𝐘)p(\tau_{k}|\boldsymbol{\phi}_{(\boldsymbol{\gamma}^{*})}^{\star},\boldsymbol{\gamma}^{*},\boldsymbol{\beta},\mathbf{b},\mathbf{C},\mathbf{V},\boldsymbol{\Psi},\mathbf{L},\mathbf{Y})
Step 7: Draw 𝚿\boldsymbol{\Psi} from p(𝚿|ϕ(𝜸),𝜸,𝜷,𝐛,𝐂,𝐕,𝝉,𝐋,𝐘)p(\boldsymbol{\Psi}|\boldsymbol{\phi}_{(\boldsymbol{\gamma}^{*})}^{\star},\boldsymbol{\gamma}^{*},\boldsymbol{\beta},\mathbf{b},\mathbf{C},\mathbf{V},\boldsymbol{\tau},\mathbf{L},\mathbf{Y}),
for i=1,2,,Ni=1,2,\dots,N do
       for j=1,2,,nij=1,2,\dots,n_{i} do
             Step 8: Draw LijL_{ij} from p(Lij|ϕ(𝜸),𝜸,𝜷,𝐛,𝐂,𝐕,𝝉,𝚿,𝐘)p(L_{ij}|\boldsymbol{\phi}_{(\boldsymbol{\gamma}^{*})}^{\star},\boldsymbol{\gamma}^{*},\boldsymbol{\beta},\mathbf{b},\mathbf{C},\mathbf{V},\boldsymbol{\tau},\boldsymbol{\Psi},\mathbf{Y})
      
for i=1,2,,Ni=1,2,\dots,N do
       Step 9: Draw CiC_{i} from p(Ci|ϕ(𝜸),𝜸,𝜷,𝐛,𝐕,𝝉,𝚿,𝐋,𝐘)p(C_{i}|\boldsymbol{\phi}_{(\boldsymbol{\gamma}^{*})}^{\star},\boldsymbol{\gamma}^{*},\boldsymbol{\beta},\mathbf{b},\mathbf{V},\boldsymbol{\tau},\boldsymbol{\Psi},\mathbf{L},\mathbf{Y})
Algorithm 1 One iteration of the PCG sampler

One iteration of the PCG sampler is shown in Algorithm 1. Steps 1 and 2 are marginalized by using (11), while Steps 3 and 4 are marginalized by using (12). To maintain the target stationary distribution of Algorithm 1, the sampling steps are permuted in a specific order. Trimming is used to remove the redundant samples of components. For more applications of the PCG sampling including other varying dimensional cases, refer to Park and van Dyk (2009); Jeong and Park (2016); Jeong et al. (2017); Park and Jeong (2018); Park et al. (2019). Because the target stationary distribution of Algorithm 1 is maintained in a specific order, the change of the order of sampling steps may not guarantee the stationarity of a Markov chain, and care must be taken not to change the sampling order; refer to van Dyk and Park (2008). The details of Algorithm 1 are given in A.

4 A Simulation Study

In this section, we validate the robustness and sensitivity of the proposed method through extensive simulation studies. All simulation results are based on 300 replicated datasets.

4.1 Simulation Setting

Throughout the simulation, we mainly consider three groups of varying coefficients, and the total number NN of subjects and the number of subjects in each cluster will be specified later based on simulation setups. The values of an underlying effect modifier tt, {tij:i=1,,N,j=1,,ni}\{t_{ij}:i=1,\dots,N,j=1,\dots,n_{i}\}, are randomly generated from a uniform distribution between 0 and 1, and the values of known covariates for subject ii, i.e., 𝐖i\mathbf{W}_{i}, 𝐗i\mathbf{X}_{i}, and 𝐙i\mathbf{Z}_{i}, are independently generated from a standard normal distribution, except that the first column of both 𝐖i\mathbf{W}_{i} and 𝐙i\mathbf{Z}_{i} is set to a column vector of 1’s. Within the range of tt between 0 and 1, αkl(t)\alpha_{kl}(t) denotes a varying-coefficient function of the llth covariate in cluster kk. The varying coefficients for three clusters are constant, linear, or nonlinear, as described below:

α11(t)\displaystyle\alpha_{11}(t)~{} =2exp{200(t0.2)2}+exp{10(t0.6)2},\displaystyle=~{}2\exp\{-200(t-0.2)^{2}\}+\exp\{-10(t-0.6)^{2}\},
α12(t)\displaystyle\alpha_{12}(t)~{} =sin(2πt3),\displaystyle=~{}\sin(2\pi t^{3}),
α21(t)\displaystyle\alpha_{21}(t)~{} =sin{8(t0.5)}+1.5exp{400(t0.5)2},\displaystyle=~{}\sin\{8(t-0.5)\}+1.5\exp\{-400(t-0.5)^{2}\},
α22(t)\displaystyle\alpha_{22}(t)~{} =2t,\displaystyle=~{}2t,
α31(t)\displaystyle\alpha_{31}(t)~{} =2t,\displaystyle=~{}-2t,
α32(t)\displaystyle\alpha_{32}(t)~{} =0.\displaystyle=~{}0.

The true values of the other model parameters are set to 𝜷=(β1,β2)=(1,1)\boldsymbol{\beta}=(\beta_{1},\beta_{2})^{\top}=(1,-1)^{\top} and

𝚿=(ψ11ψ12ψ12ψ22)=(0.50.250.250.8).\displaystyle\boldsymbol{\Psi}~{}=~{}\left(\begin{array}[]{cc}\psi_{11}&\psi_{12}\\ \psi_{12}&\psi_{22}\end{array}\right)~{}=~{}\left(\begin{array}[]{cc}0.5&0.25\\ 0.25&0.8\end{array}\right).

The latent response LijL_{ij} of the probit varying-coefficient mixed model is drawn from

LijN(𝐖i(j)𝜶Ci(tij)+𝐗i(j)𝜷+𝐙i(j)𝐛i,1),i=1,,N,j=1,,ni,\displaystyle L_{ij}~{}\sim~{}{\rm N}\left(\mathbf{W}_{i}^{(j)}\boldsymbol{\alpha}_{C_{i}}(t_{ij})+\mathbf{X}_{i}^{(j)}\boldsymbol{\beta}+\mathbf{Z}_{i}^{(j)}\mathbf{b}_{i},1\right),\quad i=1,\dots,N,\quad j=1,\dots,n_{i},

where 𝜶Ci(tij)=(αCi1(tij),αCi2(tij))\boldsymbol{\alpha}_{C_{i}}(t_{ij})=(\alpha_{C_{i}1}(t_{ij}),\alpha_{C_{i}2}(t_{ij}))^{\top} and it is used to generate a series of binary data such that Yij=I(Lij>0)Y_{ij}=I(L_{ij}>0) for observation jj on subject ii.

4.2 Performance of the Proposed Method

In this section, we demonstrate the performance of the proposed method under various simulation setups. We consider three different scenarios: Scenario I, where each cluster has 400 subjects (N=1200N=1200), Scenario II, where each cluster has 200 subjects (N=600N=600), and Scenario III, where the three clusters have 600, 400, and 200 subjects, respectively (N=1200N=1200). For each scenario, three different values of the concentration parameter are also considered to examine the robustness of the DP prior: ν{0.1,1,10}\nu\in\{0.1,1,10\}. The proposed method is applied to each combination of the scenarios for the sample size and the concentration parameter with 300 replications of the datasets. We run 20,000 iterations of the PCG sampler, discarding the first half of the draws as burn-in and using the second half for our posterior analysis.

\psfrag{p}[0.8]{Precision}\psfrag{r}[0.8]{Recall}\psfrag{f}[0.8]{F1-score}\includegraphics[width=303.53708pt]{cluster_metrics.eps}
Figure 1: Performance measures for the clustering procedure obtained from 300 replicated datasets with ν=0.1\nu=0.1 (red), ν=1\nu=1 (green), and ν=10\nu=10 (blue).
\psfrag{a11}[0.6]{$\alpha_{11}(\cdot)$}\psfrag{a12}[0.6]{$\alpha_{12}(\cdot)$}\psfrag{a21}[0.6]{$\alpha_{21}(\cdot)$}\psfrag{a22}[0.6]{$\alpha_{22}(\cdot)$}\psfrag{a31}[0.6]{$\alpha_{31}(\cdot)$}\psfrag{a32}[0.6]{$\alpha_{32}(\cdot)$}\includegraphics[width=303.53708pt]{coverage_plot2.eps}
Figure 2: Coverage probabilities of the pointwise 95% credible bands for the varying coefficients obtained from 300 replicated datasets with ν=0.1\nu=0.1 (red), ν=1\nu=1 (green), and ν=10\nu=10 (blue).
Table 1: RMSE of the posterior median and coverage probabilities of the 95% credible intervals for the fixed-dimensional parameters obtained from 300 replicated datasets.
RMSE 95% coverage
Parameter ν=0.1\nu=0.1 ν=1\nu=1 ν=10\nu=10 ν=0.1\nu=0.1 ν=1\nu=1 ν=10\nu=10
Scenario I β1\beta_{1} 0.029 0.030 0.031 0.930 0.930 0.930
β2\beta_{2} 0.032 0.033 0.033 0.913 0.920 0.909
ψ11\psi_{11} 0.070 0.072 0.072 0.937 0.916 0.937
ψ12\psi_{12} 0.042 0.043 0.043 0.955 0.944 0.948
ψ22\psi_{22} 0.073 0.074 0.075 0.937 0.934 0.941
Scenario II β1\beta_{1} 0.042 0.046 0.044 0.962 0.919 0.920
β2\beta_{2} 0.041 0.047 0.046 0.941 0.912 0.934
ψ11\psi_{11} 0.119 0.124 0.121 0.882 0.908 0.885
ψ12\psi_{12} 0.067 0.069 0.069 0.948 0.951 0.944
ψ22\psi_{22} 0.102 0.110 0.107 0.948 0.930 0.948
Scenario III β1\beta_{1} 0.029 0.030 0.031 0.946 0.932 0.917
β2\beta_{2} 0.032 0.031 0.034 0.907 0.929 0.897
ψ11\psi_{11} 0.063 0.064 0.064 0.929 0.943 0.921
ψ12\psi_{12} 0.041 0.041 0.042 0.946 0.957 0.952
ψ22\psi_{22} 0.077 0.077 0.079 0.946 0.954 0.941

The side-by-side boxplots in Figure 1 illustrate clustering performance for the 300 replicated datasets in terms of precision, recall, and F1-score. The clustering labels of all subjects are chosen by the posterior modes. As expected, clusters with larger sample sizes perform better in clustering. The metrics show similar results across different values of the concentration parameter ν\nu, indicating that clustering performance is robust to the specification of the hyperparameter for the DP prior. We calculate coverage probabilities of the pointwise 95% credible bands with the 300 replicates, where the bands are specified by the 2.5% and 97.5% posterior quantiles. Figure 2 shows that the coverage probabilities are close to the nominal value of 0.950.95, which validates the uncertainty quantification through the posterior distribution. The coverage probabilities are consistent for different values of ν\nu, further supporting the robustness of our proposed method against the hyperparameter specification. With 300 replicated datasets, Table 1 shows the root-mean-square errors (RMSE) of the posterior median and coverage probabilities of the 95% credible intervals for the fixed-dimensional parameters. The results indicate that the fixed-dimensional parameters are also insensitive to the hyperparameter specification.

To assess the efficiency of our proposed PCG sampler, we calculate the multivariate effective sample size (ESS) (Vats et al., 2019) for all simulation settings, as shown in Figure 3. The second half of the chain with 20,000 iterations is used to calculate the multivariate ESS, along with the running time of the sampler on a server equipped with CentOS7 and two Sky Lake CPUs @ 2.60GHz. The target parameters have a dimension of 185, including two varying-coefficient functions with 30 knots for each of three clusters and five fixed-dimensional parameters. As shown in Figure 3, the multivariate ESS is approximately 4,000–5,000 out of 10,000 iterations, implying that the proposed sampling algorithm exhibits reasonable convergence characteristics. Figure 3 also shows the multivariate ESS divided by time (seconds), providing the number of independent draws obtained per unit time. The results demonstrate that roughly 1.5 to 2.5 independent draws are obtained per second. Furthermore, the concentration hyperparameter has no significant influence on the multivariate ESS, indicating the robustness of our proposed method to the specification of the hyperparameter.

\psfrag{nu}[0.6]{$\nu$}\includegraphics[width=303.53708pt]{ess_plt.eps}
Figure 3: Estimates of the multivariate ESS out of 10,000 PCG iterations and the multivariate ESS per second. The estimates of the multivariate ESS are replaced by 10,000 if they are larger than 10,000.

Figure 4 overlays the pointwise posterior medians of the varying coefficients of the 300 replicated datasets using the default concentration parameter value of ν=1\nu=1. Our results show that the estimated posterior medians become closer to the true functions as sample sizes increase. We also examined the posterior medians obtained with alternative values of ν\nu, specifically ν=0.1\nu=0.1 and ν=10\nu=10, but found that the results were similar and therefore omitted here.

\psfrag{v11}[0.6]{$\alpha_{11}(\cdot)$}\psfrag{v12}[0.6]{$\alpha_{12}(\cdot)$}\psfrag{v21}[0.6]{$\alpha_{21}(\cdot)$}\psfrag{v22}[0.6]{$\alpha_{22}(\cdot)$}\psfrag{v31}[0.6]{$\alpha_{31}(\cdot)$}\psfrag{v32}[0.6]{$\alpha_{32}(\cdot)$}\psfrag{t}[0.6]{$\textit{t}$}\includegraphics[width=303.53708pt]{simul_var.eps}
Figure 4: Pointwise posterior medians of the varying coefficients for 300 replicated datasets (gray solid lines) and the true varying-coefficient functions (red solid lines).

4.3 Effects of Ignoring Subpopulation

To evaluate the necessity of subpopulation modeling, we compared our proposed method with two competing approaches that do not account for functional clustering: Jeong et al. (2017) and the mgcv package (Wood, 2017). Jeong et al. (2017) is a fully Bayesian method for estimating probit varying-coefficient mixed models with a homogeneous population using free-knot splines (Smith and Kohn, 1996). The mgcv package uses penalized quasi-likelihood to estimate the same model. To avoid redundancy, we compared the two competing methods with our proposed method using replicated datasets under Scenario I specified in Section 4.2

\psfrag{j1}[0.5]{$\alpha_{\cdot 1}(\cdot)$}\psfrag{j2}[0.5]{$\alpha_{\cdot 2}(\cdot)$}\psfrag{m1}[0.5]{$\alpha_{\cdot 1}(\cdot)$}\psfrag{m2}[0.5]{$\alpha_{\cdot 2}(\cdot)$}\psfrag{a}[0.6]{$\textit{t}$}\includegraphics[width=252.94499pt]{comp_coefs.eps}
Figure 5: Estimates of the varying coefficients obtained from 300 replicated datasets. The estimates are the pointwise posterior medians for Jeong et al. (2017) and the penalized quasi-likelihood estimates for mgcv.
Table 2: RMSE of the estimates and coverage probabilities of the 95% intervals for the fixed-dimensional parameters obtained from 300 replicated datasets. The coverage probabilities are obtained by the 95% credible intervals for Jeong et al. (2017) and the 95% confidence intervals approximated with standard errors for mgcv. The mgcv package does not provide standard errors for the covariance of random effects.
Proposed method (ν=1\nu=1) Jeong et al. (2017) mgcv
Parameter RMSE 95% coverage RMSE 95% coverage RMSE 95% coverage
β1\beta_{1} 0.030 0.930 0.205 0.000 0.190 0.000
β2\beta_{2} 0.033 0.920 0.204 0.000 0.190 0.000
ψ11\psi_{11} 0.072 0.916 0.220 0.000 0.199 -
ψ12\psi_{12} 0.043 0.944 0.097 0.170 0.099 -
ψ22\psi_{22} 0.074 0.934 0.299 0.000 0.351 -

In Figure 5, we present the estimates of varying coefficients obtained by two competing methods that do not account for functional clustering. The estimated trends appear to be the average of the varying coefficients across the clusters in Scenario I, resulting in significant estimation bias for the cluster-specific effects. Furthermore, ignoring subpopulations leads to significant estimation bias for the fixed-dimensional parameters, as demonstrated in Table 2, despite the fact that they are common across all clusters. This suggests that accounting for subpopulation effects is crucial even for parameters that are shared among clusters.

4.4 Homogeneous Population

We also investigate the performance of the proposed method in a homogeneous population with a single true cluster. To generate the simulation datasets, we set all subjects to have the same values of α11(t)\alpha_{11}(t) and α12(t)\alpha_{12}(t), as described in Section 4.1. Specifically, we set 𝜶Ci(tij)=(α11(tij),α12(tij))\boldsymbol{\alpha}_{C_{i}}(t_{ij})=(\alpha_{11}(t_{ij}),\alpha_{12}(t_{ij}))^{\top} for all ii, resulting in a single cluster. The number of subjects is set to N=300N=300, and all other simulation settings are identical to those specified in Section 4.1.

\psfrag{v11}[0.5]{$\alpha_{11}(\cdot)$}\psfrag{v12}[0.5]{$\alpha_{12}(\cdot)$}\psfrag{j1}[0.5]{$\alpha_{11}(\cdot)$}\psfrag{j2}[0.5]{$\alpha_{12}(\cdot)$}\psfrag{m1}[0.5]{$\alpha_{11}(\cdot)$}\psfrag{m2}[0.5]{$\alpha_{12}(\cdot)$}\psfrag{t}[0.6]{$\textit{t}$}\includegraphics[width=303.53708pt]{single_plt.eps}
Figure 6: Estimates of varying coefficients of 300 replications (solid gray lines) and the true functions (solid red lines). The estimates are the pointwise posterior medians for the proposed method and Jeong et al. (2017) and the penalized quasi-likelihood estimates for mgcv.
\psfrag{a11}[0.5]{$\alpha_{11}(\cdot)$}\psfrag{a12}[0.5]{$\alpha_{12}(\cdot)$}\psfrag{t}[0.6]{$\textit{t}$}\includegraphics[width=278.24104pt]{coverage_plot_single.eps}
Figure 7: Coverage probabilities of the 95% intervals for the varying coefficients obtained from 300 replicated datasets: the proposed method (red), Jeong et al. (2017) (blue), and mgcv (green). The coverage probabilities are obtained by the pointwise 95% credible bands for the proposed method and Jeong et al. (2017) and the pointwise 95% confidence bands approximated with standard errors for mgcv.
Table 3: RMSE of the estimates and coverage probabilities of the 95% intervals for the fixed-dimensional parameters obtained from 300 replicated datasets. The coverage probabilities are obtained by the 95% credible intervals for Jeong et al. (2017) and the 95% confidence intervals approximated with standard errors for mgcv. The mgcv package does not provide standard errors for the covariance of random effects.
Proposed method (ν=1\nu=1) Jeong et al. (2017) mgcv
Parameter RMSE 95% coverage RMSE 95% coverage RMSE 95% coverage
β1\beta_{1} 0.055 0.954 0.048 0.960 0.058 0.916
β2\beta_{2} 0.055 0.944 0.050 0.953 0.058 0.923
ψ11\psi_{11} 0.111 0.923 0.089 0.950 0.678 -
ψ12\psi_{12} 0.073 0.958 0.066 0.960 0.093 -
ψ22\psi_{22} 0.143 0.944 0.135 0.943 0.148 -

Figure 6 displays the pointwise posterior medians of the varying coefficients using the proposed method and two competitors under the homogeneous population assumption. The proposed method shows a reasonably small estimation bias compared to Jeong et al. (2017) despite some deviation for the incorrectly clustered subjects, implying its applicability without knowing the population structure. In contrast, mgcv yields larger estimation bias than the other two methods. Figure 7 presents the coverage probabilities of the pointwise 95% credible or confidence bands of the varying coefficients, further demonstrating the worse performance of mgcv. Table 3 summarizes the results of the fixed-dimensional parameters, indicating that the RMSEs of the proposed method are slightly larger than those for Jeong et al. (2017). Considering the flexibility of the proposed method in accounting for potentially heterogeneous populations, however, it can be deemed more useful than Jeong et al. (2017).

5 Application to Binary Longitudinal Data

5.1 Data Description and Modeling Procedure

In this section, we consider the German Socioeconomic Panel (GSOEP) data (Riphahn et al., 2003). The dataset consists of repeated observations from 7,293 subjects in Germany for the years 1984–1988, 1991, and 1994. The response variable of interest is working status (employed=1; otherwise=0) and covariates consist of AijA_{ij} (age), MijM_{ij} (marital status; married=1, otherwise=0), KijK_{ij} (children under the age of 16 in the household; yes=1, no=0), HijH_{ij} (degree of handicap; 0 to 100 in percent), and SijS_{ij} (personal health satisfaction; 0 to 10). We confine our samples to 893 subjects under the age of 53 with Abitur degrees in order to examine the varying effect of having young children in the household on working status as a function of age for people with secondary education.

According to our preliminary analysis, which assumes varying effects for all covariates, we have decided to treat handicap, personal health satisfaction, and marital status as constant effects in subsequent analyses. As a result, we aim to model the varying effects of the intercept and the presence of children under the age of 16, while treating the remaining covariates as fixed effects. Our target model is then given by

Lij=αCi0(Aij)+αCiK(Aij)Kij+βMMij+βSSij+βHHij+bi+ϵij.\displaystyle L_{ij}~{}=~{}\alpha_{{C_{i}}0}^{*}(A_{ij})+\alpha_{{C_{i}}K}^{*}(A_{ij})K_{ij}+\beta_{M}M_{ij}+\beta_{S}S_{ij}+\beta_{H}H_{ij}+b_{i}+\epsilon_{ij}.~{}~{}~{}

The reduced model complexity based on the preliminary analysis leads to faster computation and greater stability.

Similar to Section 4.3, we compare the proposed method with the model in Jeong et al. (2017) that ignores heterogeneity among subjects; mgcv is not considered because Jeong et al. (2017) outperforms it (see Section 4.4). The corresponding simple model is given by

Lij=α0(Aij)+αK(Aij)Kij+βMMij+βSSij+βHHij+bi+ϵij.\displaystyle L_{ij}~{}=~{}\alpha_{0}(A_{ij})+\alpha_{K}(A_{ij})K_{ij}+\beta_{M}M_{ij}+\beta_{S}S_{ij}+\beta_{H}H_{ij}+b_{i}+\epsilon_{ij}.

As shown in the next section, the simple model fails to account for heterogeneity among samples and can be obtained by pooling the results of the proposed target model.

5.2 Analysis and Results

\psfrag{b1}[0.6]{$\beta_{S}$}\psfrag{b2}[0.6]{$\beta_{H}$}\psfrag{b3}[0.6]{$\beta_{M}$}\psfrag{psi}[0.6]{$\psi$}\psfrag{a11}[0.6]{$\alpha^{*}_{G_{1}0}(\cdot)$}\psfrag{a12}[0.6]{$\alpha^{*}_{G_{1}K}(\cdot)$}\psfrag{a21}[0.6]{$\alpha^{*}_{G_{2}0}(\cdot)$}\psfrag{a22}[0.6]{$\alpha^{*}_{G_{2}K}(\cdot)$}\includegraphics[width=278.24104pt]{real_gelman_stat.eps}
Figure 8: The R1/2R^{1/2} statistics for the fixed-dimensional parameters (left) and the fixed points of the varying coefficients (right).
Table 4: Characteristics of the clustered groups
     Covariate   Overall   Group 1   Group 2
Female 35.1% 33.3% 89.1%
Married 57.7% 55.9% 71.2%
White-collar workers 38.7% 41.1 % 20.8%
Civil servants 25.8% 29.0% 4.0%
University 57.2% 58.4% 42.0%
\psfrag{a11}[0.6]{$\alpha_{G_{1}0}^{*}(\cdot)$}\psfrag{a21}[0.6]{$\alpha_{G_{2}0}^{*}(\cdot)$}\psfrag{a31}[0.6]{$\alpha_{G_{3}0}^{*}(\cdot)$}\psfrag{a12}[0.6]{$\alpha_{G_{1}K}^{*}(\cdot)$}\psfrag{a22}[0.6]{$\alpha_{G_{2}K}^{*}(\cdot)$}\psfrag{a32}[0.6]{$\alpha_{G_{3}K}^{*}(\cdot)$}\psfrag{a}[0.6]{Age}\includegraphics[width=278.24104pt]{sohn_vars2.eps}
Figure 9: Posterior summaries of the varying coefficients for two major groups. Solid lines represent the pointwise posterior medians of each varying coefficient function and gray regions correspond to pointwise 95% posterior intervals.
Table 5: Posterior summaries of the fixed-dimensional parameters
Proposed method Jeong et al. (2017)
Parameter Mean Median 95% interval Mean Median 95% interval
ψ\psi 3.5393.539 3.3573.357 (1.573,6.334)(1.573,6.334) 3.9493.949 3.8933.893 (3.031,5.106)(3.031,5.106)
βS\beta_{S} 0.0160.016 0.0160.016 (0.065,0.097)(-0.065,0.097) 0.0060.006 0.0060.006 (0.051,0.063)(-0.051,0.063)
βH\beta_{H} 0.002-0.002 0.002-0.002 (0.023,0.020)(-0.023,0.020) 0.010-0.010 0.010-0.010 (0.026,0.006)(-0.026,0.006)
βM\beta_{M} 0.2920.292 0.2840.284 (0.178,0.793)(-0.178,0.793) 0.2990.299 0.2970.297 (0.041,0.647)(-0.041,0.647)

We ran the proposed PCG sampler with three over-dispersed initial values. Figure 8 shows the convergence characteristics of the sampler by using the R1/2R^{1/2} diagnostic for the fixed-dimensional parameters and the fixed points of the varying coefficients (Gelman and Rubin, 1992). Because all R1/2R^{1/2} statistics are below 1.1, we combine the second halves of three chains each with 150,000 iterations through a label switching algorithm. Then, after thinning every 50th sample, our posterior inference is based on 4,500 mixed samples.

According to our posterior analysis, there are two main groups and a few minor groups. The group membership is determined by the posterior modes of the cluster labels. The interpretation of the analysis focuses on the two main groups. The first largest cluster, Group 1, accounts for 89.4% of subjects, the second largest, Group 2, accounts for 7.2%, and the remaining clusters account for 3.4%. Some characteristics of the two major groups are summarized in Table 4, demonstrating that these groups are made up of heterogeneous subjects. Specifically, Group 1 has a much lower proportion of females who tend to be more responsible for parenting than Group 2. In addition, Group 1 has a higher proportion of white-collar workers, civil servants, and university graduates with high job security than Group 2. Such difference in characteristics results in the different posterior estimates of varying-coefficient functions, as shown in Figure 9.

Figure 9 shows the posterior summaries of the varying-coefficient functions resulting from functional clustering. The first row of Figure 9 corresponds to the group-level varying-intercept functions. The intercept function of Group 1 is significantly positive and keeps increasing up to early 50s, implying that a posterior probability of being employed becomes higher as one tends to be old while holding all covariates constant. In contrast, Group 2 has a slightly positive but constant intercept function in all ages. The second row of Figure 9 shows the varying-coefficient function for having children below age 16 in the household. The 95% pointwise posterior intervals for Group 1 includes 0, which implies that having children below age 16 in the household does not significantly affect the probability of employment. Unlike Group 1, the existence of young children in Group 2’s household significantly decreases the probability of employment until his/her mid 40s. The probability is further decreased when the employee’s age tends to be younger. This is due in part to the fact that Group 2 has a higher proportion of females than Group 1, and females were more responsible for child care in the late twentieth century.

\psfrag{s1}[0.6]{$~{}~{}~{}~{}\alpha_{\cdot 0}^{*}(\cdot)$}\psfrag{s2}[0.6]{$~{}~{}~{}~{}\alpha_{\cdot K}^{*}(\cdot)$}\psfrag{j1}[0.6]{$~{}~{}~{}~{}\alpha_{0}(\cdot)$}\psfrag{j2}[0.6]{$~{}~{}~{}~{}\alpha_{K}(\cdot)$}\psfrag{a}[0.6]{Age}\includegraphics[width=278.24104pt]{margin_vars2.eps}
Figure 10: The first column shows the posterior summaries of the varying coefficients obtained by Jeong et al. (2017), and the second column is with respect to the proposed method pooled by the weights corresponding to cluster assignments, with posterior medians (solid lines) and pointwise 95% posterior intervals (gray areas).

Table 5 shows the posterior summaries of fixed-dimensional parameters, where Var(bi)=ψ{\rm Var}(b_{i})=\psi represents the variance of a random effect, and βH\beta_{H}, βS\beta_{S}, and βM\beta_{M} represent the coefficients of fixed effects, HijH_{ij}, SijS_{ij}, and MijM_{ij}, respectively. Based on the fact that the 95% posterior intervals include zero, we decide that the fixed effects have no significant influence on the employment status when the varying effects of having young children in the household for heterogeneous subpopulations are accounted for in the model. When the heterogeneous subpopulation assumption is ignored, the results obtained by Jeong et al. (2017) appear to be similar, but with narrower 95% credible intervals. This is due to the fact that Jeong et al. (2017) does not fully account for the variability of heterogeneous subpopulations.

Our proposed method identifies two major subpopulations with different characteristics, as shown in Table 4, and these subpopulations show different age-varying effects of having young children in a household on working status, as illustrated in Figure 9. When the heterogeneous subpopulation assumption is ignored, however, the single population model proposed by Jeong et al. (2017) estimates the varying-coefficient functions applied to the entire population, as shown in the first column of Figure 10. In the presence of heterogeneous subpopulations, such an approach would fail to separate subpopulations with different characteristics, leading to erroneous conclusions. This is confirmed by producing the pooled varying-coefficient functions estimated by the proposed method, as shown in the second column of Figure 10. These findings demonstrate the proposed model’s validity and utility in accounting for a heterogeneous population.

6 Discussion

In this paper, we propose a novel model-based functional clustering method for analyzing a heterogeneous series of binary data. Our proposed method models the varying effects of covariates on a series of binary responses as a function of an effect modifier, while accounting for heterogeneity among subjects using functional clustering and random effects. The proposed model estimates population-level fixed effects, cluster-level varying effects, and subject-level random effects. Even when the number of clusters is unknown, our proposed method accurately estimates cluster-specific varying coefficients with appropriate smoothness using a free-knot spline prior. We use the DP prior for functional clustering, which avoids specifying the exact number of clusters in advance. To perform posterior inference, we carefully develop a PCG sampler by specifying appropriate prior distributions and marginalizing some model components.

We suggest that there are several directions for future research, such as extending the clustering methodology to other generalized semi-parametric models using partitioning priors. Furthermore, a Gaussian process prior may be used instead of a free-knot spline for the functional clustering of varying coefficients because it may easily achieve the right scale for the base measure of the DP prior by employing a suitable covariance kernel.

Appendix A Details of Algorithm 1

In this section, we describe the details of Algorithm 1. Let 𝒞d\mathcal{C}_{d} denote a set of clusters containing at least one subject and let 𝒞dc\mathcal{C}_{d}^{c} denote a set of clusters containing no subject, in the ddth sampling iteration. For ddth iteration:

  1. Step 1:

    Draw 𝜸kl\boldsymbol{\gamma}_{kl}^{*} from p(𝜸kl|𝜸kl,𝜷,𝐂,𝐕,𝝉,𝚿,𝐋,𝐘)p(\boldsymbol{\gamma}_{kl}^{*}|\boldsymbol{\gamma}_{-kl}^{*},\boldsymbol{\beta},\mathbf{C},\mathbf{V},\boldsymbol{\tau},\boldsymbol{\Psi},\mathbf{L},\mathbf{Y}) that is a Bernoulli with success probability

    f(𝜸kl=1,𝜸kl)f(𝜸kl=1,𝜸kl)+f(𝜸kl=0,𝜸kl),k𝒞d,l=1,,p,\displaystyle\frac{f(\boldsymbol{\gamma}_{kl}^{*}=1,\boldsymbol{\gamma}_{-kl}^{*})}{f(\boldsymbol{\gamma}_{kl}^{*}=1,\boldsymbol{\gamma}_{-kl}^{*})+f(\boldsymbol{\gamma}_{kl}^{*}=0,\boldsymbol{\gamma}_{-kl}^{*})},\quad k\in\mathcal{C}_{d},\quad l=1,\dots,p,

    where 𝜸kl\boldsymbol{\gamma}_{-kl}^{*} denotes all latent indicator variables except 𝜸kl\boldsymbol{\gamma}_{kl}^{*} in 𝜸k\boldsymbol{\gamma}_{k}^{*},

    f(𝜸kl,𝜸kl)\displaystyle f(\boldsymbol{\gamma}_{kl}^{*},\boldsymbol{\gamma}_{-kl}^{*}) =B(|𝜸kl|+a,Ml+1|𝜸kl|+b)×det(τk𝐑k(𝜸k)1Ξk(𝜸k,𝐂,𝚿)+𝐈|𝜸k|)1/2\displaystyle~{}=~{}B\left(|\boldsymbol{\gamma}_{kl}^{*}|+a,M_{l}+1-|\boldsymbol{\gamma}_{kl}^{*}|+b\right)\times\det\left(\tau_{k}\mathbf{R}_{k(\boldsymbol{\gamma}_{k}^{*})}^{-1}\Xi_{k(\boldsymbol{\gamma}_{k}^{*},\mathbf{C},\boldsymbol{\Psi})}+\mathbf{I}_{|\boldsymbol{\gamma}_{k}^{*}|}\right)^{-1/2}
    ×exp{12ξk(𝜸k,𝐂,𝚿,𝐋,𝜷)(Ξk(𝜸k,𝐂,𝚿)+τk1𝐑k(𝜸k))1ξk(𝜸k,𝐂,𝚿,𝐋,𝜷)},\displaystyle\qquad\times\exp\left\{\dfrac{1}{2}\xi_{k(\boldsymbol{\gamma}_{k}^{*},\mathbf{C},\boldsymbol{\Psi},\mathbf{L},\boldsymbol{\beta})}^{\top}\left(\Xi_{k(\boldsymbol{\gamma}_{k}^{*},\mathbf{C},\boldsymbol{\Psi})}+\tau_{k}^{-1}\mathbf{R}_{k(\boldsymbol{\gamma}_{k}^{*})}\right)^{-1}\xi_{k(\boldsymbol{\gamma}_{k}^{*},\mathbf{C},\boldsymbol{\Psi},\mathbf{L},\boldsymbol{\beta})}\right\},
    Ξk(𝜸k,𝐂,𝚿)\displaystyle\Xi_{k(\boldsymbol{\gamma}_{k}^{*},\mathbf{C},\boldsymbol{\Psi})} =i:Ci=k𝐖i(𝜸k)𝐓i1𝐖i(𝜸k),\displaystyle~{}=~{}\sum_{i:C_{i}=k}\mathbf{W}_{i(\boldsymbol{\gamma}_{k}^{*})}^{\star\top}\mathbf{T}_{i}^{-1}\mathbf{W}_{i(\boldsymbol{\gamma}_{k}^{*})}^{\star},
    ξk(𝜸k,𝐂,𝚿,𝐋,𝜷)\displaystyle\xi_{k(\boldsymbol{\gamma}_{k}^{*},\mathbf{C},\boldsymbol{\Psi},\mathbf{L},\boldsymbol{\beta})} =i:Ci=k𝐖i(𝜸k)𝐓i1(𝐋i𝐗i𝜷),\displaystyle~{}=~{}\sum_{i:C_{i}=k}\mathbf{W}_{i(\boldsymbol{\gamma}_{k}^{*})}^{\star\top}\mathbf{T}_{i}^{-1}\left(\mathbf{L}_{i}-\mathbf{X}_{i}\boldsymbol{\beta}\right),

    and 𝐈|𝜸k|\mathbf{I}_{|\boldsymbol{\gamma}_{k}^{*}|} and 𝐈ni\mathbf{I}_{n_{i}} are identity matrices whose sizes of each dimension are |𝜸k||\boldsymbol{\gamma}_{k}^{*}| and nin_{i} respectively; and 𝐓i=𝐈ni+𝐙i𝚿𝐙i\mathbf{T}_{i}=\mathbf{I}_{n_{i}}+\mathbf{Z}_{i}\boldsymbol{\Psi}\mathbf{Z}_{i}^{\top}. In the case of k𝒞dck\in\mathcal{C}_{d}^{c}, 𝜸kl\boldsymbol{\gamma}_{kl}^{*} is drawn from its prior distribution because Ξk(𝜸k,𝐂,𝚿)\Xi_{k(\boldsymbol{\gamma}_{k}^{*},\mathbf{C},\boldsymbol{\Psi})} and ξk(𝜸k,𝐂,𝚿,𝐋,𝜷)\xi_{k(\boldsymbol{\gamma}_{k}^{*},\mathbf{C},\boldsymbol{\Psi},\mathbf{L},\boldsymbol{\beta})} do not exist.

  2. Step 2:

    Draw VkV_{k} from p(Vk|𝜸,𝜷,𝐂,𝝉,𝚿,𝐋,𝐘)p(V_{k}|\boldsymbol{\gamma}^{*},\boldsymbol{\beta},\mathbf{C},\boldsymbol{\tau},\boldsymbol{\Psi},\mathbf{L},\mathbf{Y}) that is a beta, i.e.,

    Vk|(𝜸,𝜷,𝐂,𝝉,𝚿,𝐋,𝐘)Beta(1+mk,ν+h=k+1Kmh),k=1,,K1,\displaystyle V_{k}|(\boldsymbol{\gamma}^{*},\boldsymbol{\beta},\mathbf{C},\boldsymbol{\tau},\boldsymbol{\Psi},\mathbf{L},\mathbf{Y})~{}\sim~{}{\rm Beta}\left(1+m_{k},\nu+\sum_{h=k+1}^{K}m_{h}\right),\quad k=1,\dots,K-1,

    where mk=i=1NI(Ci=k)m_{k}=\sum_{i=1}^{N}I(C_{i}=k).

  3. Step 3:

    Draw ϕ(𝜸k)\boldsymbol{\phi}_{(\boldsymbol{\gamma}_{k}^{*})}^{\star} from p(ϕ(𝜸k)|𝜸,𝜷,𝐂,𝐕,𝝉,𝚿,𝐋,𝐘)p(\boldsymbol{\phi}_{(\boldsymbol{\gamma}_{k}^{*})}^{\star}|\boldsymbol{\gamma}^{*},\boldsymbol{\beta},\mathbf{C},\mathbf{V},\boldsymbol{\tau},\boldsymbol{\Psi},\mathbf{L},\mathbf{Y}) that is a multivariate normal distribution, i.e.,

    ϕ(𝜸k)|(𝜸,𝜷,𝐂,𝐕,𝝉,𝚿,𝐋,𝐘)\displaystyle\boldsymbol{\phi}_{(\boldsymbol{\gamma}_{k}^{*})}^{\star}|(\boldsymbol{\gamma}^{*},\boldsymbol{\beta},\mathbf{C},\mathbf{V},\boldsymbol{\tau},\boldsymbol{\Psi},\mathbf{L},\mathbf{Y})
    N|𝜸k|((Ξk(𝜸k,𝐂,𝚿)+τk1𝐑k(𝜸k))1ξk(𝜸k,𝐂,𝚿,𝐋,𝜷),(Ξk(𝜸k,𝐂,𝚿)+τk1𝐑k(𝜸k))1),k𝒞d,\displaystyle\quad\sim~{}{\rm N}_{|\boldsymbol{\gamma}_{k}^{*}|}\left(\left(\Xi_{k(\boldsymbol{\gamma}_{k}^{*},\mathbf{C},\boldsymbol{\Psi})}+\tau_{k}^{-1}\mathbf{R}_{k(\boldsymbol{\gamma}_{k}^{*})}\right)^{-1}\xi_{k(\boldsymbol{\gamma}_{k}^{*},\mathbf{C},\boldsymbol{\Psi},\mathbf{L},\boldsymbol{\beta})},\left(\Xi_{k(\boldsymbol{\gamma}_{k}^{*},\mathbf{C},\boldsymbol{\Psi})}+\tau_{k}^{-1}\mathbf{R}_{k(\boldsymbol{\gamma}_{k}^{*})}\right)^{-1}\right),\quad k\in\mathcal{C}_{d},

    and

    ϕ(𝜸k)|(𝜸,𝜷,𝐂,𝐕,𝝉,𝚿,𝐋,𝐘)N|𝜸k|(𝟎,τk𝐑k(𝜸k)1),k𝒞dc.\displaystyle\boldsymbol{\phi}_{(\boldsymbol{\gamma}_{k}^{*})}^{\star}|(\boldsymbol{\gamma}^{*},\boldsymbol{\beta},\mathbf{C},\mathbf{V},\boldsymbol{\tau},\boldsymbol{\Psi},\mathbf{L},\mathbf{Y})~{}\sim~{}{\rm N}_{|\boldsymbol{\gamma}_{k}^{*}|}\left(\mathbf{0},\tau_{k}\mathbf{R}_{k(\boldsymbol{\gamma}_{k}^{*})}^{-1}\right),\quad k\in\mathcal{C}_{d}^{c}.
  4. Step 4:

    Draw 𝜷\boldsymbol{\beta} from p(𝜷|ϕ(𝜸),𝜸,𝐂,𝐕,𝝉,𝚿,𝐋,𝐘)p(\boldsymbol{\beta}|\boldsymbol{\phi}_{(\boldsymbol{\gamma}^{*})}^{\star},\boldsymbol{\gamma}^{*},\mathbf{C},\mathbf{V},\boldsymbol{\tau},\boldsymbol{\Psi},\mathbf{L},\mathbf{Y}) that is multivariate normal, i.e.,

    𝜷|(ϕ(𝜸),𝜸,𝐂,𝐕,𝝉,𝚿,𝐋,𝐘)Nq(𝚫1i=1N𝐗i𝐓i1(𝐋i𝐖i(𝜸Ci)ϕ(𝜸Ci)),𝚫1),\displaystyle\boldsymbol{\beta}|(\boldsymbol{\phi}_{(\boldsymbol{\gamma}^{*})}^{\star},\boldsymbol{\gamma}^{*},\mathbf{C},\mathbf{V},\boldsymbol{\tau},\boldsymbol{\Psi},\mathbf{L},\mathbf{Y})~{}\sim~{}{\rm N}_{q}\!\left(\boldsymbol{\Delta}^{-1}\sum_{i=1}^{N}\mathbf{X}_{i}^{\top}\mathbf{T}_{i}^{-1}\left(\mathbf{L}_{i}-\mathbf{W}_{i(\boldsymbol{\gamma}_{C_{i}}^{*})}^{\star}\boldsymbol{\phi}_{(\boldsymbol{\gamma}_{C_{i}}^{*})}^{\star}\right),\boldsymbol{\Delta}^{-1}\right),

    where 𝚫=𝐏1+i=1N𝐗i𝐓i1𝐗i\boldsymbol{\Delta}=\mathbf{P}^{-1}+\sum_{i=1}^{N}\mathbf{X}_{i}^{\top}\mathbf{T}_{i}^{-1}\mathbf{X}_{i}.

  5. Step 5:

    Draw 𝐛i\mathbf{b}_{i} from p(𝐛i|ϕ(𝜸),𝜸,𝜷,𝐂,𝐕,𝝉,𝚿,𝐋,𝐘)p(\mathbf{b}_{i}|\boldsymbol{\phi}_{(\boldsymbol{\gamma}^{*})}^{\star},\boldsymbol{\gamma}^{*},\boldsymbol{\beta},\mathbf{C},\mathbf{V},\boldsymbol{\tau},\boldsymbol{\Psi},\mathbf{L},\mathbf{Y}) that is multivariate normal, i.e.,

    𝐛i|(ϕ(𝜸),𝜸,𝜷,𝐂,𝐕,𝝉,𝚿,𝐋i,𝐘)Nr(𝐔i(𝚿,ϕ(𝜸Ci),𝜸Ci,𝐋,𝜷),𝐀(𝚿)),i=1,N,\displaystyle\mathbf{b}_{i}|(\boldsymbol{\phi}_{(\boldsymbol{\gamma}^{*})}^{\star},\boldsymbol{\gamma}^{*},\boldsymbol{\beta},\mathbf{C},\mathbf{V},\boldsymbol{\tau},\boldsymbol{\Psi},\mathbf{L}_{i},\mathbf{Y})~{}\sim~{}{\rm N}_{r}\left(\mathbf{U}_{i(\boldsymbol{\Psi},\boldsymbol{\phi}_{(\boldsymbol{\gamma}_{C_{i}}^{*})}^{\star},\boldsymbol{\gamma}^{*}_{C_{i}},\mathbf{L},\boldsymbol{\beta})},\;\mathbf{A}_{(\boldsymbol{\Psi})}\right),\quad i=1,\dots N,

    where

    𝐔i(𝚿,ϕ(𝜸Ci),𝜸Ci,𝐋i,𝜷)\displaystyle\mathbf{U}_{i(\boldsymbol{\Psi},\boldsymbol{\phi}_{(\boldsymbol{\gamma}_{C_{i}}^{*})}^{\star},\boldsymbol{\gamma}^{*}_{C_{i}},\mathbf{L}_{i},\boldsymbol{\beta})} =𝚿𝐙i(𝐈ni+𝐙i𝚿𝐙i)1(𝐋i𝐖i(𝜸Ci)ϕ(𝜸Ci)𝐗i𝜷),\displaystyle~{}=~{}\boldsymbol{\Psi}\mathbf{Z}_{i}^{\top}(\mathbf{I}_{n_{i}}+\mathbf{Z}_{i}\boldsymbol{\Psi}\mathbf{Z}_{i}^{\top})^{-1}(\mathbf{L}_{i}-\mathbf{W}_{i(\boldsymbol{\gamma}_{C_{i}}^{*})}^{\star}\boldsymbol{\phi}_{(\boldsymbol{\gamma}_{C_{i}}^{*})}^{\star}-\mathbf{X}_{i}\boldsymbol{\beta}),
    𝐀(𝚿)\displaystyle\mathbf{A}_{(\boldsymbol{\Psi})} =𝚿𝚿𝐙i(𝐈ni+𝐙i𝚿𝐙i)1𝐙i𝚿.\displaystyle~{}=~{}\boldsymbol{\Psi}-\boldsymbol{\Psi}\mathbf{Z}_{i}^{\top}(\mathbf{I}_{n_{i}}+\mathbf{Z}_{i}\boldsymbol{\Psi}\mathbf{Z}_{i}^{\top})^{-1}\mathbf{Z}_{i}\boldsymbol{\Psi}.
  6. Step 6:

    Draw τk\tau_{k} from p(τk|ϕ(𝜸),𝜸,𝜷,𝐛,𝐂,𝐕,𝚿,𝐋,𝐘)p(\tau_{k}|\boldsymbol{\phi}_{(\boldsymbol{\gamma}^{*})}^{\star},\boldsymbol{\gamma}^{*},\boldsymbol{\beta},\mathbf{b},\mathbf{C},\mathbf{V},\boldsymbol{\Psi},\mathbf{L},\mathbf{Y}) that is an inverse gamma, i.e.,

    τk|(ϕ(𝜸),𝜸,𝜷,𝐛,𝐂,𝐕,𝚿,𝐋,𝐘)\displaystyle\tau_{k}|(\boldsymbol{\phi}_{(\boldsymbol{\gamma}^{*})}^{\star},\boldsymbol{\gamma}^{*},\boldsymbol{\beta},\mathbf{b},\mathbf{C},\mathbf{V},\boldsymbol{\Psi},\mathbf{L},\mathbf{Y})
    IG(1+|𝜸k|2,N+ϕ(𝜸k)𝐑k(𝜸k)ϕ(𝜸k)2),k=1,,K.\displaystyle\quad\sim~{}{\rm IG}\left(\dfrac{1+|\boldsymbol{\gamma}_{k}^{*}|}{2},\;\dfrac{N+\boldsymbol{\phi}_{(\boldsymbol{\gamma}_{k}^{*})}^{\star\top}\mathbf{R}_{k(\boldsymbol{\gamma}_{k}^{*})}\boldsymbol{\phi}_{(\boldsymbol{\gamma}_{k}^{*})}^{\star}}{2}\right),\quad k=1,\dots,K.
  7. Step 7:

    Draw 𝚿\boldsymbol{\Psi} from p(𝚿|ϕ(𝜸),𝜸,𝜷,𝐛,𝐂,𝐕,𝝉,𝐋,𝐘)p(\boldsymbol{\Psi}|\boldsymbol{\phi}_{(\boldsymbol{\gamma}^{*})}^{\star},\boldsymbol{\gamma}^{*},\boldsymbol{\beta},\mathbf{b},\mathbf{C},\mathbf{V},\boldsymbol{\tau},\mathbf{L},\mathbf{Y}) that is an inverse Wishart,

    𝚿IW(u+N,𝐃+i=1N𝐛i𝐛i).\displaystyle\boldsymbol{\Psi}\sim{\rm IW}\left(u+N,\mathbf{D}+\sum_{i=1}^{N}\mathbf{b}_{i}\mathbf{b}_{i}^{\top}\right).
  8. Step 8:

    Draw LijL_{ij} from p(Lij|ϕ(𝜸),𝜸,𝜷,𝐛,𝐂,𝐕,𝝉,𝚿,𝐘)p(L_{ij}|\boldsymbol{\phi}_{(\boldsymbol{\gamma}^{*})}^{\star},\boldsymbol{\gamma}^{*},\boldsymbol{\beta},\mathbf{b},\mathbf{C},\mathbf{V},\boldsymbol{\tau},\boldsymbol{\Psi},\mathbf{Y}) that is truncated normal, i.e.,

    Lij|(ϕ(𝜸),𝜸,𝜷,𝐛,𝐂,𝐕,𝝉,𝚿,𝐘)\displaystyle L_{ij}|(\boldsymbol{\phi}_{(\boldsymbol{\gamma}^{*})}^{\star},\boldsymbol{\gamma}^{*},\boldsymbol{\beta},\mathbf{b},\mathbf{C},\mathbf{V},\boldsymbol{\tau},\boldsymbol{\Psi},\mathbf{Y}) {TN(,0](𝝁Ci(j),1)if Yij=0TN(0,)(𝝁Ci(j),1)if Yij=1,i=1,,N,j=1,,ni,\displaystyle\sim\begin{cases}{\rm TN}_{(-\infty,0]}(\boldsymbol{\mu}_{C_{i}}^{(j)},1)&\text{if $Y_{ij}=0$}\\ {\rm TN}_{(0,\infty)}(\boldsymbol{\mu}_{C_{i}}^{(j)},1)&\text{if $Y_{ij}=1$}\end{cases},~{}i=1,\dots,N,~{}j=1,\dots,n_{i},

    where 𝝁Ci(j)\boldsymbol{\mu}_{C_{i}}^{(j)} denotes the jjth element of 𝝁Ci=𝐖i(𝜸Ci)ϕ(𝜸Ci)+𝐗i𝜷+𝐙i𝐛i\boldsymbol{\mu}_{C_{i}}=\mathbf{W}_{i(\boldsymbol{\gamma}_{C_{i}}^{*})}^{\star}\boldsymbol{\phi}_{(\boldsymbol{\gamma}_{C_{i}}^{*})}^{\star}+\mathbf{X}_{i}\boldsymbol{\beta}+\mathbf{Z}_{i}\mathbf{b}_{i}.

  9. Step 9:

    Draw CiC_{i} from p(Ci|ϕ(𝜸),𝜸,𝜷,𝐛,𝐕,𝝉,𝚿,𝐋,𝐘)p(C_{i}|\boldsymbol{\phi}_{(\boldsymbol{\gamma}^{*})}^{\star},\boldsymbol{\gamma}^{*},\boldsymbol{\beta},\mathbf{b},\mathbf{V},\boldsymbol{\tau},\boldsymbol{\Psi},\mathbf{L},\mathbf{Y}) that has a discrete distribution with probabilities

    P(Ci=k|(ϕ(𝜸),𝜸,𝜷,𝐛,𝐕,𝝉,𝚿,𝐋,𝐘)πk(𝐕)Nni(𝐋i;𝝁k,𝐈ni)k=1Kπk(𝐕)Nni(𝐋i;𝝁k,𝐈ni),k=1,,K,\displaystyle P(C_{i}=k|(\boldsymbol{\phi}_{(\boldsymbol{\gamma}^{*})}^{\star},\boldsymbol{\gamma}^{*},\boldsymbol{\beta},\mathbf{b},\mathbf{V},\boldsymbol{\tau},\boldsymbol{\Psi},\mathbf{L},\mathbf{Y})~{}\propto~{}\dfrac{\pi_{k}(\mathbf{V}){\rm N}_{n_{i}}\left(\mathbf{L}_{i};\boldsymbol{\mu}_{k},\>\mathbf{I}_{n_{i}}\right)}{\sum_{k=1}^{K}\pi_{k}(\mathbf{V}){\rm N}_{n_{i}}\left(\mathbf{L}_{i};\boldsymbol{\mu}_{k},\>\mathbf{I}_{n_{i}}\right)},\quad k=1,\dots,K,

    where 𝝁k=𝐖i(𝜸k)ϕ(𝜸k)+𝐗i𝜷+𝐙i𝐛i\boldsymbol{\mu}_{k}=\mathbf{W}_{i(\boldsymbol{\gamma}_{k}^{*})}^{\star}\boldsymbol{\phi}_{(\boldsymbol{\gamma}_{k}^{*})}^{\star}+\mathbf{X}_{i}\boldsymbol{\beta}+\mathbf{Z}_{i}\mathbf{b}_{i}.

Appendix B R package fvcc

We provide an R package called fvcc for the proposed model. The package can be installed with the devtools package available in CRAN as follows.

ΨΨdevtools::install_github("jwsohn612/fvcc")
ΨΨlibrary(fvcc)
ΨΨhelp(fvcc)
Ψ

The main function fvcc contains a code script for reproducing the simulation results in Section 4.

Acknowledgements

S. Jeong’s research was supported by the Yonsei University Research Fund of 2022-22-0097. T. Park’s research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education (2017R1D1A1B03033536) and by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (2020R1A2C1A01005949).

\DTLnewrow

ref \DTLnewdbentryrefmylabelriph:etal:03 \DTLnewdbentryrefmynote[dataset]

References

  • Stiratelli et al. (1984) R. Stiratelli, N. Laird, J. H. Ware, Random-effects models for serial observations with binary response, Biometrics (1984) 961–971. doi:10.2307/2531147.
  • Varin and Czado (2009) C. Varin, C. Czado, A mixed autoregressive probit model for ordinal longitudinal data, Biostatistics 11 (2009) 127–138. doi:10.1093/biostatistics/kxp042.
  • Guerra et al. (2012) M. W. Guerra, J. Shults, J. Amsterdam, T. Ten-Have, The analysis of binary longitudinal data with time-dependent covariates, Statistics in Medicine 31 (2012) 931–948. doi:10.1002/sim.4465.
  • Hastie and Tibshirani (1993) T. Hastie, R. Tibshirani, Varying-coefficient models, Journal of the Royal Statistical Society. Series B (Methodological) (1993) 757–796. doi:10.1111/j.2517-6161.1993.tb01939.x.
  • Hoover et al. (1998) D. R. Hoover, J. A. Rice, C. O. Wu, L.-P. Yang, Nonparametric smoothing estimates of time-varying coefficient models with longitudinal data, Biometrika 85 (1998) 809–822. doi:10.1093/biomet/85.4.809.
  • Wu et al. (1998) C. O. Wu, C.-T. Chiang, D. R. Hoover, Asymptotic confidence regions for kernel smoothing of a varying-coefficient model with longitudinal data, Journal of the American Statistical Association 93 (1998) 1388–1402. doi:10.1080/01621459.1998.10473800.
  • Lang and Brezger (2004) S. Lang, A. Brezger, Bayesian P-splines, Journal of Computational and Graphical Statistics 13 (2004) 183–212. doi:10.1198/1061860043010.
  • Sun and Wu (2005) Y. Sun, H. Wu, Semiparametric time-varying coefficients regression model for longitudinal data, Scandinavian Journal of Statistics 32 (2005) 21–47. doi:10.1111/j.1467-9469.2005.00413.x.
  • Fan and Zhang (2008) J. Fan, W. Zhang, Statistical methods with varying coefficient models, Statistics and its Interface 1 (2008) 179. doi:10.4310/sii.2008.v1.n1.a15.
  • Lu and Zhang (2009) Y. Lu, R. Zhang, Smoothing spline estimation of generalised varying-coefficient mixed model, Journal of Nonparametric Statistics 21 (2009) 815–825. doi:10.1080/10485250903151078.
  • Jeong and Park (2016) S. Jeong, T. Park, Bayesian semiparametric inference on functional relationships in linear mixed models, Bayesian Analysis 11 (2016) 1137–1163. doi:10.1214/15-BA987.
  • Jeong et al. (2017) S. Jeong, M. Park, T. Park, Analysis of binary longitudinal data with time-varying effects, Computational Statistics & Data Analysis 112 (2017) 145–153. doi:10.1016/j.csda.2017.03.007.
  • Park and Jeong (2018) T. Park, S. Jeong, Analysis of Poisson varying-coefficient models with autoregression, Statistics 52 (2018) 34–49. doi:10.1080/02331888.2017.1353514.
  • Shen and Ghosal (2015) W. Shen, S. Ghosal, Adaptive Bayesian procedures using random series priors, Scandinavian Journal of Statistics 42 (2015) 1194–1213.
  • Neal (1998) R. M. Neal, Regression and classification using gaussian process priors, Bayesian Statistics 6 (1998) 475.
  • Chipman et al. (1997) H. A. Chipman, E. D. Kolaczyk, R. E. McCulloch, Adaptive Bayesian wavelet shrinkage, Journal of the American Statistical Association 92 (1997) 1413–1421.
  • Smith and Kohn (1996) M. Smith, R. Kohn, Nonparametric regression using Bayesian variable selection, Journal of Econometrics 75 (1996) 317–343. doi:10.1016/0304-4076(95)01763-1.
  • DiMatteo et al. (2001) I. DiMatteo, C. R. Genovese, R. E. Kass, Bayesian curve-fitting with free-knot splines, Biometrika 88 (2001) 1055–1071. doi:10.1093/biomet/88.4.1055.
  • Kohn et al. (2001) R. Kohn, M. Smith, D. Chan, Nonparametric regression using linear combinations of basis functions, Statistics and Computing 11 (2001) 313–322. doi:10.1023/A:1011916902934.
  • Ruppert et al. (2003) D. Ruppert, M. P. Wand, R. J. Carroll, Semiparametric Regression, 12, Cambridge University Press, 2003.
  • Kang and Jeong (2023) G. Kang, S. Jeong, Model selection-based estimation for generalized additive models using mixtures of g-priors: Towards systematization, arXiv preprint arXiv:2301.10468 (2023).
  • Lenk and DeSarbo (2000) P. J. Lenk, W. S. DeSarbo, Bayesian inference for finite mixtures of generalized linear models with random effects, Psychometrika 65 (2000) 93–119. doi:10.1007/BF02294188.
  • James and Sugar (2003) G. M. James, C. A. Sugar, Clustering for sparsely sampled functional data, Journal of the American Statistical Association 98 (2003) 397–408.
  • Heard et al. (2006) N. A. Heard, C. C. Holmes, D. A. Stephens, A quantitative study of gene regulation involved in the immune response of anopheline mosquitoes: An application of Bayesian hierarchical clustering of curves, Journal of the American Statistical Association 101 (2006) 18–29.
  • Shi and Wang (2008) J. Q. Shi, B. Wang, Curve prediction and clustering with mixtures of gaussian process functional regression models, Statistics and Computing 18 (2008) 267–283.
  • Aßmann and Boysen-Hogrefe (2011) C. Aßmann, J. Boysen-Hogrefe, A Bayesian approach to model-based clustering for binary panel probit models, Computational Statistics & Data Analysis 55 (2011) 261–279. doi:10.1016/j.csda.2010.04.016.
  • Coffey et al. (2014) N. Coffey, J. Hinde, E. Holian, Clustering longitudinal profiles using P-splines and mixed effects models applied to time-course gene expression data, Computational Statistics & Data Analysis 71 (2014) 14–29.
  • Berrettini et al. (2022) M. Berrettini, G. Galimberti, S. Ranciati, Semiparametric finite mixture of regression models with Bayesian P-splines, Advances in Data Analysis and Classification (2022) 1–31.
  • Ferguson (1973) T. S. Ferguson, A Bayesian analysis of some nonparametric problems, The annals of statistics (1973) 209–230.
  • Pitman and Yor (1997) J. Pitman, M. Yor, The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator, The Annals of Probability (1997) 855–900.
  • Ishwaran and James (2001) H. Ishwaran, L. F. James, Gibbs sampling methods for stick-breaking priors, Journal of the American Statistical Association 96 (2001) 161–173. doi:10.1198/016214501750332758.
  • Teh et al. (2006) Y. W. Teh, M. I. Jordan, M. J. Beal, D. M. Blei, Hierarchical Dirichlet processes, Journal of the American Statistical Association 101 (2006) 1566–1581. doi:10.1198/016214506000000302.
  • Lau and Green (2007) J. W. Lau, P. J. Green, Bayesian model-based clustering procedures, Journal of Computational and Graphical Statistics 16 (2007) 526–558.
  • Wallach et al. (2010) H. Wallach, S. Jensen, L. Dicker, K. Heller, An alternative prior process for nonparametric Bayesian clustering, in: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 2010.
  • Canale and Dunson (2011) A. Canale, D. B. Dunson, Bayesian kernel mixtures for counts, Journal of the American Statistical Association 106 (2011) 1528–1539.
  • Yerebakan et al. (2014) H. Z. Yerebakan, B. Rajwa, M. Dundar, The infinite mixture of infinite gaussian mixtures, in: Advances in Neural Information Processing Systems, volume 27, Curran Associates, Inc., 2014.
  • Kyung (2015) M. Kyung, Dirichlet process mixtures of linear mixed regressions, Communications for Statistical Applications and Methods 22 (2015) 625–637. doi:10.5351/CSAM.2015.22.6.625.
  • Müller et al. (1996) P. Müller, A. Erkanli, M. West, Bayesian curve fitting using multivariate normal mixtures, Biometrika 83 (1996) 67–79.
  • Gelfand et al. (2005) A. E. Gelfand, A. Kottas, S. N. MacEachern, Bayesian nonparametric spatial modeling with Dirichlet process mixing, Journal of the American Statistical Association 100 (2005) 1021–1035.
  • MacEachern et al. (2001) S. N. MacEachern, A. Kottas, A. E. Gelfand, Spatial nonparametric Bayesian models, in: Proceedings of the 2001 Joint Statistical Meetings, volume 3, Citeseer, 2001, p. 14.
  • Ray and Mallick (2006) S. Ray, B. Mallick, Functional clustering by Bayesian wavelet methods, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68 (2006) 305–332.
  • Petrone et al. (2009) S. Petrone, M. Guindani, A. E. Gelfand, Hybrid Dirichlet mixture models for functional data, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71 (2009) 755–782.
  • Chib and Greenberg (2010) S. Chib, E. Greenberg, Additive cubic spline regression with Dirichlet process mixture errors, Journal of Econometrics 156 (2010) 322–336.
  • Rodriguez and Dunson (2014) A. Rodriguez, D. B. Dunson, Functional clustering in nested designs: modeling variability in reproductive epidemiology studies (2014).
  • Suarez and Ghosal (2016) A. J. Suarez, S. Ghosal, Bayesian clustering of functional data using local features (2016).
  • Margaritella et al. (2021) N. Margaritella, V. Inácio, R. King, Parameter clustering in Bayesian functional principal component analysis of neuroscientific data, Statistics in Medicine 40 (2021) 167–184.
  • Kuss et al. (2006) O. Kuss, C. Gromann, T. L. Diepgen, Model-based clustering of binary longitudinal atopic dermatitis disease histories by latent class mixture models, Biometrical journal 48 (2006) 105–116.
  • Hannah et al. (2011) L. A. Hannah, D. M. Blei, W. B. Powell, Dirichlet process mixtures of generalized linear models., Journal of Machine Learning Research 12 (2011).
  • Zhu et al. (2021) X. Zhu, X. Tang, A. Qu, Longitudinal clustering for heterogeneous binary data, Statistica Sinica 31 (2021) 603–624.
  • Gelman et al. (2015) A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, D. B. Rubin, Bayesian Data Analysis, CRC Press etc., 2015.
  • van Dyk and Park (2008) D. A. van Dyk, T. Park, Partially collapsed Gibbs samplers: Theory and methods, Journal of the American Statistical Association 103 (2008) 790–796. doi:10.1198/016214508000000409.
  • Park and van Dyk (2009) T. Park, D. A. van Dyk, Partially collapsed Gibbs samplers: Illustrations and applications, Journal of Computational and Graphical Statistics 18 (2009) 283–305. doi:10.1198/jcgs.2009.08108.
  • Vines et al. (1996) S. Vines, W. Gilks, P. Wild, Fitting Bayesian multiple random effects models, Statistics and Computing 6 (1996) 337–346. doi:10.1007/BF00143554.
  • Meng and van Dyk (1998) X.-L. Meng, D. van Dyk, Fast em-type implementations for mixed effects models, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 60 (1998) 559–578. doi:10.1111/1467-9868.00140.
  • van Dyk (2000) D. A. van Dyk, Fitting mixed-effects models using efficient em-type algorithms, Journal of Computational and Graphical Statistics 9 (2000) 78–98. doi:10.1080/10618600.2000.10474867.
  • Kim et al. (2013) Y. Kim, Y.-K. Choi, S. Emery, Logistic regression with multiple random effects: A simulation study of estimation methods and statistical packages, The American Statistician 67 (2013) 171–182. doi:10.1080/00031305.2013.817357.
  • Park and Min (2016) T. Park, S. Min, Partially collapsed Gibbs sampling for linear mixed-effects models, Communications in Statistics-Simulation and Computation 45 (2016) 165–180.
  • Jeong et al. (2022) S. Jeong, T. Park, D. A. van Dyk, Bayesian model selection in additive partial linear models via locally adaptive splines, Journal of Computational and Graphical Statistics 31 (2022) 324–336.
  • Sethuraman (1994) J. Sethuraman, A constructive definition of Dirichlet priors, Statistica sinica (1994) 639–650.
  • Scott and Berger (2010) J. G. Scott, J. O. Berger, Bayes and empirical-bayes multiplicity adjustment in the variable-selection problem, The Annals of Statistics 38 (2010) 2587–2619. doi:10.1214/10-AOS792.
  • Liang et al. (2008) F. Liang, R. Paulo, G. Molina, M. A. Clyde, J. O. Berger, Mixtures of gg priors for Bayesian variable selection, Journal of the American Statistical Association 103 (2008) 410–423. doi:10.1198/016214507000001337.
  • Zellner (1986) A. Zellner, On assessing prior distributions and Bayesian regression analysis with g-prior distributions, Bayesian Inference and Decision Techniques: Essays in Honor of Bruno De Finetti 6 (1986) 233–243.
  • Park et al. (2019) H. Park, T. Park, Y.-S. Lee, Partially collapsed Gibbs sampling for latent Dirichlet allocation, Expert Systems with Applications 131 (2019) 208–218. doi:10.1016/j.eswa.2019.04.028.
  • Vats et al. (2019) D. Vats, J. M. Flegal, G. L. Jones, Multivariate output analysis for markov chain monte carlo, Biometrika 106 (2019) 321–337.
  • Wood (2017) S. Wood, Generalized Additive Models: An Introduction with R, 2 ed., Chapman and Hall/CRC, 2017.
  • Riphahn et al. (2003) R. T. Riphahn, A. Wambach, A. Million, Incentive effects in the demand for health care: a bivariate panel count data estimation, Journal of Applied Econometrics 18 (2003) 387–405. doi:10.1002/jae.680.
  • Gelman and Rubin (1992) A. Gelman, D. B. Rubin, Inference from iterative simulation using multiple sequences, Statistical Science 7 (1992) 457–472. doi:10.1214/ss/1177011136.