This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Robust Fitting of Mixture Models using Weighted Complete Estimating Equations

Shonosuke Sugasawa1 and Genya Kobayashi2

1Center for Spatial Information Science, The University of Tokyo
2Graduate School of Social Science, Chiba University

Abstract

Mixture modeling, which considers the potential heterogeneity in data, is widely adopted for classification and clustering problems. Mixture models can be estimated using the Expectation-Maximization algorithm, which works with the complete estimating equations conditioned by the latent membership variables of the cluster assignment based on the hierarchical expression of mixture models. However, when the mixture components have light tails such as a normal distribution, the mixture model can be sensitive to outliers. This study proposes a method of weighted complete estimating equations (WCE) for the robust fitting of mixture models. Our WCE introduces weights to complete estimating equations such that the weights can automatically downweight the outliers. The weights are constructed similarly to the density power divergence for mixture models, but in our WCE, they depend only on the component distributions and not on the whole mixture. A novel expectation-estimating-equation (EEE) algorithm is also developed to solve the WCE. For illustrative purposes, a multivariate Gaussian mixture, a mixture of experts, and a multivariate skew normal mixture are considered, and how our EEE algorithm can be implemented for these specific models is described. The numerical performance of the proposed robust estimation method was examined using simulated and real datasets.


Key words: Clustering; Divergence; EEE algorithm; Mixture of experts; Skew normal mixture

Introduction

Mixture modeling (McLachlan and Peel, 2000) is very popular for distribution estimation, regression, and model-based clustering by taking account of potential subgroup structures of data. Typically, such mixture models are fitted by the maximum likelihood method using the well-known Expectation-Maximization (EM) algorithm. However, data often contain outliers and the maximum likelihood method can be highly affected by them. The presence of outliers would result in the biased and inefficient statistical inference of the parameters of interest and would thus make recovering the underlying clustering structure of the data very difficult. A typical remedy for this problem is to replace the normal component distributions with heavy-tailed components. For example, Bagnato et al. (2017), Forbes and Wraith (2014), Peel and McLachlan (2004), Punzo and Tortora (2021), Sun et al. (2010), and Zhang and Liang (2010) used symmetric component distributions, Basso et al. (2010), Lee and McLachlan (2014), Lin (2010), Morris et al. (2019), and Wang et al. (2009) used skewed distributions, and Galimberti and Soffritti (2014), Ingrassia et al. (2014), Mazza and Punzo (2020), Punzo et al. (2017), Song et al. (2014), Yao et al. (2014), and Zarei et al. (2019) considered robustifying the mixture of regressions. However, this approach cannot distinguish non-outliers from outliers straightforwardly because it fits a mixture model to all the observations, including outliers. Apart from using heavy-tailed distributions, a class of generalized likelihood is an effective alternative tool for the robust fitting of statistical models. In the context of mixture modeling, Fujisawa and Eguchi (2006) employed density power divergence (Basu et al., 1998) for robust estimation of Gaussian mixture models. However, a direct application of the divergence may suffer from computational problems because the objective functions may contain intractable integrals. For example, the objective function under the density power divergence includes the integral f(x;θ)γ𝑑x\int f(x;\theta)^{\gamma}dx, where f(x;θ)f(x;\theta) is the density function with the parameter θ\theta and γ\gamma is the tuning parameter. When f(x;θ)f(x;\theta) is the density function of a mixture model, the integral becomes intractable. Therefore, these approaches are not necessarily appealing in practice, despite their desirable robustness properties.

In this paper, we develop a new approach to robust mixture modeling by newly introducing the idea of the weighted complete estimating equations (WCE). Complete estimating equations appear in the EM algorithm and are conditioned on the latent membership variables of the cluster assignment. Our WCE introduces weights to the complete estimating equations such that the contribution of outliers to the complete estimating equations is automatically downweighted. The weight is defined based on the assumed component distributions. By conditioning the membership variables in WCE, only the weighted estimating equations for each component distribution must be considered separately instead of directly using the weighted estimating equations for a mixture model. Since the derived WCE depends on the unknown latent membership variables, these latent variables are augmented via their posterior expectations to solve the WCE, which provides an expectation-estimating-equation (EEE) algorithm (Elashoff and Ryan, 2004). The proposed EEE algorithm is general and can be applied to various mixture models. The proposed WCE method is then applied to three types of mixture models: a multivariate Gaussian mixture, a mixture of experts (Jacobs et al., 1991), and a multivariate skew-normal mixture (Lin et al., 2007; Lin, 2009). For the multivariate Gaussian mixture, the updating steps of the proposed EEE algorithm are obtained in closed form without requiring either numerical integration or optimization steps. A similar algorithm can be derived for a mixture of experts when the component distributions are Gaussian. Moreover, for the multivariate skew normal mixture, by introducing additional latent variables based on the stochastic representation of the skew normal distribution, a novel EEE algorithm can be obtained as a slight extension of the proposed general EEE algorithm in which all the updating steps proceed analytically.

The proposed framework of the weighted complete estimating equations unifies several existing approaches regarding the choice of the weight function, such as hard trimming (e.g. Garcia-Escudero et al., 2008), soft trimming (e.g. Campbell, 1984; Farcomeni and Greco, 2015; Greco and Agostinelli, 2020) and their hybrid (e.g. Farcomeni and Punzo, 2020). Among several designs for the weight function, this study adopts the density weight, which leads to theoretically valid and computationally tractable robust estimation procedures not only for the Gaussian mixture but also for a variety of mixture models.

As a related work, Greco and Agostinelli (2020) employed a similar idea using the weighted likelihood for the robust fitting of the multivariate Gaussian mixture. They constructed weights by first obtaining a kernel density estimate for the Mahalanobis distance concerning the component parameters and then computing the Pearson residuals. Since their method requires knowledge of the distribution of the quadratic form of the residuals, their approach cannot be directly extended to other mixture models. On the other hand, the proposed method can be applied to various mixture models, as in the proposed weight design, and the weights are automatically determined by the component-wise density. It should be noted that introducing the density weight in estimating equations is closely related to the density power divergence (Basu et al., 1998) in which the density power divergence induces the density-weighted likelihood equations. However, the novelty of the proposed approach is that the density weight is considered within the framework of complete estimating equations rather than likelihood equations, which leads to tractable robust estimation algorithms for a variety of mixture models.

The remainder of this paper is organized as follows. In Section 2, we first describe the proposed WCE method for a general mixture model and derive a general EEE algorithm for solving the WCE. Then, the general algorithm is applied to the specific mixture models, a multivariate Gaussian mixture (Section 3), a mixture of experts (Section 4), and a multivariate skew normal mixture (Section 5). In Section 6, the proposed method is demonstrated for a multivariate Gaussian mixture and a skew normal mixture through the simulation studies. Section 7 illustrates practical advantages of the proposed method using real data. Finally, conclusions and discussions are presented in Section 8. The R code for implementing the proposed method is available at the GitHub repository (https://github.com/sshonosuke/RobMixture).

Weighted Estimating Equations for Mixture Modeling

Weighted complete estimating equations

Let 𝒚1,,𝒚n{\text{\boldmath$y$}}_{1},\ldots,{\text{\boldmath$y$}}_{n} be the random variables on p\mathbb{R}^{p}. We consider the following mixture model with probability density function fMf_{M}:

fM(𝒚i;𝚿)=k=1Kπkf(𝒚i;𝜽k),i=1,,n,f_{M}({\text{\boldmath$y$}}_{i};{\text{\boldmath$\Psi$}})=\sum_{k=1}^{K}\pi_{k}f({\text{\boldmath$y$}}_{i};{\text{\boldmath$\theta$}}_{k}),\quad i=1,\dots,n, (1)

where 𝜽k{\text{\boldmath$\theta$}}_{k} is the set of model parameters in the kkth component, 𝝅=(π1,,πK){\text{\boldmath$\pi$}}=(\pi_{1},\ldots,\pi_{K}) is the vector of grouping or prior membership probabilities, and 𝚿=(𝝅,𝜽1,,𝜽K){\text{\boldmath$\Psi$}}=({\text{\boldmath$\pi$}},{\text{\boldmath$\theta$}}_{1},\ldots,{\text{\boldmath$\theta$}}_{K}) is the collection of all model parameters. To fit model (1), we introduce the latent membership variable ziz_{i} defined as P(zi=k)=πkP(z_{i}=k)=\pi_{k} for i=1,,ni=1,\dots,n; therefore, the conditional distribution of 𝒚i{\text{\boldmath$y$}}_{i} given zi=kz_{i}=k is f(𝒚i;θk)f({\text{\boldmath$y$}}_{i};\theta_{k}). For notional simplicity, we define uik=I(zi=k)u_{ik}=I(z_{i}=k), which is the indicator function for the condition zi=kz_{i}=k. The complete estimating equations for 𝚿\Psi given ziz_{i} are given as follows:

i=1nuik𝜽klogf(𝒚i;𝜽k)=0,\displaystyle\sum_{i=1}^{n}u_{ik}\frac{\partial}{\partial{\text{\boldmath$\theta$}}_{k}}\log f({\text{\boldmath$y$}}_{i};{\text{\boldmath$\theta$}}_{k})=0,
i=1nuikπki=1nuiKπK=0,\displaystyle\sum_{i=1}^{n}\frac{u_{ik}}{\pi_{k}}-\sum_{i=1}^{n}\frac{u_{iK}}{\pi_{K}}=0,

for k=1,,Kk=1,\dots,K.

To robustify the estimating equations against outliers, we introduce the component-specific weight w(𝒚i;𝜽k)w({\text{\boldmath$y$}}_{i};{\text{\boldmath$\theta$}}_{k}) for i=1,,ni=1,\ldots,n and k=1,,Kk=1,\ldots,K which controls the contribution from the iith observation. We then propose the following weighted complete estimating equations (WCE):

i=1nuik{w(𝒚i;𝜽k)𝜽klogf(𝒚i;𝜽k)C(𝜽k)}=0,i=1nuikπkw(𝒚i;𝜽k)B(𝜽k)i=1nuiKπKw(𝒚i;𝜽K)B(𝜽K)=0,\begin{split}&\sum_{i=1}^{n}u_{ik}\left\{w({\text{\boldmath$y$}}_{i};{\text{\boldmath$\theta$}}_{k})\frac{\partial}{\partial{\text{\boldmath$\theta$}}_{k}}\log f({\text{\boldmath$y$}}_{i};{\text{\boldmath$\theta$}}_{k})-C({\text{\boldmath$\theta$}}_{k})\right\}=0,\\ &\sum_{i=1}^{n}\frac{u_{ik}}{\pi_{k}}\frac{w({\text{\boldmath$y$}}_{i};{\text{\boldmath$\theta$}}_{k})}{B({\text{\boldmath$\theta$}}_{k})}-\sum_{i=1}^{n}\frac{u_{iK}}{\pi_{K}}\frac{w({\text{\boldmath$y$}}_{i};{\text{\boldmath$\theta$}}_{K})}{B({\text{\boldmath$\theta$}}_{K})}=0,\end{split} (2)

for k=1,,Kk=1,\dots,K, where w(𝒚i;𝜽k)w({\text{\boldmath$y$}}_{i};{\text{\boldmath$\theta$}}_{k}) is the weight function which may depend on the tuning parameter γ\gamma and

C(𝜽k)=pf(t;𝜽k)w(t;𝜽k)𝜽klogf(t;𝜽k)𝑑t,\displaystyle C({\text{\boldmath$\theta$}}_{k})=\int_{\mathbb{R}^{p}}f(t;{\text{\boldmath$\theta$}}_{k})w(t;{\text{\boldmath$\theta$}}_{k})\frac{\partial}{\partial{\text{\boldmath$\theta$}}_{k}}\log f(t;{\text{\boldmath$\theta$}}_{k})dt,
B(𝜽k)=pf(t;𝜽k)w(t;𝜽k)𝑑t.\displaystyle\ \ \ \ B({\text{\boldmath$\theta$}}_{k})=\int_{\mathbb{R}^{p}}f(t;{\text{\boldmath$\theta$}}_{k})w(t;{\text{\boldmath$\theta$}}_{k})dt.

Note that B(𝜽k)B({\text{\boldmath$\theta$}}_{k}) and C(𝜽k)C({\text{\boldmath$\theta$}}_{k}) are necessary to make WCE (2) unbiased. That is, the expectations of the estimating equations in (2) with respect to the joint distribution of 𝒚i{\text{\boldmath$y$}}_{i} and ziz_{i} are zero; otherwise, asymptotic properties such as consistency and asymptotic normality of the resulting estimator may not be guaranteed. In the unbiased WCE, we consider the specific form of the weight function based on the component density given by

w(𝒚i;𝜽k)=f(𝒚i;𝜽k)γw({\text{\boldmath$y$}}_{i};{\text{\boldmath$\theta$}}_{k})=f({\text{\boldmath$y$}}_{i};{\text{\boldmath$\theta$}}_{k})^{\gamma}

for γ>0\gamma>0. The weight is small if 𝒚i{\text{\boldmath$y$}}_{i} is an outlier, that is, 𝒚i{\text{\boldmath$y$}}_{i} is located far in the tail of the distribution of the kkth component f(𝒚i;𝜽k)f({\text{\boldmath$y$}}_{i};{\text{\boldmath$\theta$}}_{k}). The weighted estimating equations are reduced to the original complete estimating equation when γ=0\gamma=0. It should be noted that the proposed weight design is not a direct application of the density power divergence of Basu et al. (1998) to the mixture model (1), which would have used the whole mixture density (1) as the weight in the estimating equations.

EEE algorithm

Since the proposed WCE (2) includes the latent variable uiku_{ik}, WCE should be imputed with the conditional expectation of uiku_{ik} given 𝒚i{\text{\boldmath$y$}}_{i}. Starting from some initial values, we propose an EEE algorithm that updates the estimates in the ssth iteration as follows:

  • -

    E-step:   Compute the posterior probability:

    uik(s)=πk(s)f(𝒚i;𝜽k(s))=1Kπ(s)f(𝒚i;𝜽(s)),k=1,,K.u_{ik}^{(s)}=\frac{\pi_{k}^{(s)}f({\text{\boldmath$y$}}_{i};{\text{\boldmath$\theta$}}_{k}^{(s)})}{\sum_{\ell=1}^{K}\pi_{\ell}^{(s)}f({\text{\boldmath$y$}}_{i};{\text{\boldmath$\theta$}}_{\ell}^{(s)})},\ \ \ \ k=1,\ldots,K. (3)
  • -

    EE-step:   Update the membership probabilities πk\pi_{k}’s as

    πk(s+1)=i=1nuik(s)wik(s)/B(𝜽k(s))=1Ki=1nui(s)wik(s)/B(𝜽(s)),\pi_{k}^{(s+1)}=\frac{\sum_{i=1}^{n}u_{ik}^{(s)}w_{ik}^{(s)}/B({\text{\boldmath$\theta$}}_{k}^{(s)})}{\sum_{\ell=1}^{K}\sum_{i=1}^{n}u_{i\ell}^{(s)}w_{ik}^{(s)}/B({\text{\boldmath$\theta$}}_{\ell}^{(s)})}, (4)

    and the component-specific parameters 𝜽k{\text{\boldmath$\theta$}}_{k} by solving the estimating equations:

    i=1nuik(s){wik(s)𝜽klogf(𝒚i;𝜽k)C(𝜽k)}=0,\sum_{i=1}^{n}u_{ik}^{(s)}\left\{w_{ik}^{(s)}\frac{\partial}{\partial{\text{\boldmath$\theta$}}_{k}}\log f({\text{\boldmath$y$}}_{i};{\text{\boldmath$\theta$}}_{k})-C({\text{\boldmath$\theta$}}_{k})\right\}=0,

    where wik(s)=w(𝒚i;𝜽k(s))w_{ik}^{(s)}=w({\text{\boldmath$y$}}_{i};{\text{\boldmath$\theta$}}_{k}^{(s)}).

Note that the updating process (4) can be obtained by solving the imputed version of the second equation in (2). To apply the general algorithm above to a particular mixture model such as a multivariate Gaussian mixture, the only requirement is to calculate the bias correction terms C(𝜽k)C({\text{\boldmath$\theta$}}_{k}) and B(𝜽k)B({\text{\boldmath$\theta$}}_{k}). As shown in Section 3, the bias correction terms are quite simple under a Gaussian distribution. Moreover, the algorithm can be easily modified to a case in which the distribution of each component admits a hierarchical or stochastic representation. For instance, the multivariate skew normal distribution has a hierarchical representation based on the multivariate normal distribution, which allows us to derive tractable weighted complete estimating equations to carry out the proposed robust EEE algorithm, as demonstrated in Section 5.

Regarding the setting of the starting values, we can use the estimation results of some existing robust methods. The detailed initialization strategy for each mixture model is discussed in Sections 3-5. We also note that the algorithm may converge to a not necessarily suitable solution. To avoid this problem, mm initial points are randomly prepared to produce mm solutions obtained from the EEE algorithm. Among the mm solutions, the best solution is chosen such that it minimizes the trimmed BIC criterion given in Section 2.4. In our implementation, we simply set m=10m=10.

The augmented function used in the EE-step can be regarded as a bivariate S(𝚿|𝚿)S({\text{\boldmath$\Psi$}}|{\text{\boldmath$\Psi$}}^{\ast}) with the constraint 𝚿=𝚿{\text{\boldmath$\Psi$}}={\text{\boldmath$\Psi$}}^{\ast}, where S(𝚿|𝚿)S({\text{\boldmath$\Psi$}}|{\text{\boldmath$\Psi$}}^{\ast}) is a collection of the augmented estimating equations given by

i=1nuik(𝚿){w(𝒚i;𝜽k)𝜽klogf(𝒚i;𝜽k)C(𝜽k)}=0,i=1nuik(𝚿)πkw(𝒚i;𝜽k)B(𝜽k)i=1nuik(𝚿)πKw(𝒚i;𝜽K)B(𝜽K)=0,\begin{split}&\sum_{i=1}^{n}u_{ik}({\text{\boldmath$\Psi$}}^{\ast})\left\{w({\text{\boldmath$y$}}_{i};{\text{\boldmath$\theta$}}_{k}^{\ast})\frac{\partial}{\partial{\text{\boldmath$\theta$}}_{k}}\log f({\text{\boldmath$y$}}_{i};{\text{\boldmath$\theta$}}_{k})-C({\text{\boldmath$\theta$}}_{k})\right\}=0,\\ &\sum_{i=1}^{n}\frac{u_{ik}({\text{\boldmath$\Psi$}}^{\ast})}{\pi_{k}}\frac{w({\text{\boldmath$y$}}_{i};{\text{\boldmath$\theta$}}_{k}^{\ast})}{B({\text{\boldmath$\theta$}}_{k}^{\ast})}-\sum_{i=1}^{n}\frac{u_{ik}({\text{\boldmath$\Psi$}}^{\ast})}{\pi_{K}}\frac{w({\text{\boldmath$y$}}_{i};{\text{\boldmath$\theta$}}_{K}^{\ast})}{B({\text{\boldmath$\theta$}}_{K}^{\ast})}=0,\end{split}

for k=1,,Kk=1,\ldots,K and uik(𝚿)=E[uik|𝒚i;𝚿]u_{ik}({\text{\boldmath$\Psi$}})=E[u_{ik}|{\text{\boldmath$y$}}_{i};{\text{\boldmath$\Psi$}}]. The EEE algorithm updates the estimate of 𝚿\Psi by solving S(𝚿|𝚿(s))=0S({\text{\boldmath$\Psi$}}|{\text{\boldmath$\Psi$}}^{(s)})=0 in the ssth iteration. It is assumed that the component distribution f(;𝜽k)f(\cdot;{\text{\boldmath$\theta$}}_{k}) is continuous and differentiable with respect to 𝜽k{\text{\boldmath$\theta$}}_{k}. Since the density weight w(;𝜽k)w(\cdot;{\text{\boldmath$\theta$}}_{k}) inherits these properties, bivariate function S(𝚿|𝚿)S({\text{\boldmath$\Psi$}}|{\text{\boldmath$\Psi$}}^{\ast}) is continuous with respect to 𝚿\Psi and 𝚿{\text{\boldmath$\Psi$}}^{\ast}. Hence, if the sequence of estimators from the EEE algorithm converges, it will converge to the solution of the augmented equation S(𝚿|𝚿)=0S({\text{\boldmath$\Psi$}}|{\text{\boldmath$\Psi$}})=0. From Lemma 1 in Tang and Qu (2016), the sequence of estimators {𝚿(s)}s=1,2,\{{\text{\boldmath$\Psi$}}^{(s)}\}_{s=1,2,\ldots} that solve S(𝚿(s+1)|𝚿(s))S({\text{\boldmath$\Psi$}}^{(s+1)}|{\text{\boldmath$\Psi$}}^{(s)}) converges to the solution 𝚿0{\text{\boldmath$\Psi$}}^{0} of S(𝚿|𝚿)=0S({\text{\boldmath$\Psi$}}|{\text{\boldmath$\Psi$}})=0 in the neighborhood of (𝚿0,𝚿0)({\text{\boldmath$\Psi$}}^{0},{\text{\boldmath$\Psi$}}^{0}) if S𝚿1S𝚿2<1\|S_{\text{\boldmath$\Psi$}}^{-1}S_{{\text{\boldmath$\Psi$}}^{\ast}}\|_{2}<1, where S𝚿=S(𝚿|𝚿)/𝚿S_{\text{\boldmath$\Psi$}}=\partial S({\text{\boldmath$\Psi$}}|{\text{\boldmath$\Psi$}}^{\ast})/\partial{\text{\boldmath$\Psi$}} and S𝚿=S(𝚿|𝚿)/𝚿S_{{\text{\boldmath$\Psi$}}^{\ast}}=\partial S({\text{\boldmath$\Psi$}}|{\text{\boldmath$\Psi$}}^{\ast})/\partial{\text{\boldmath$\Psi$}}^{\ast}.

Classification and outlier detection

The observations are classified into clusters using the posterior probability (3). We adopt the standard method of classifying an observation into a cluster with the maximum posterior probability. It should be noted that the classification is applied to all observations, both non-outliers and outliers.

Given the cluster assignment, outlier detection is performed based on a fitted robust model. Although distance-based outlier detection rules (e.g. Cerioli and Farcomeni, 2011; Greco and Agostinelli, 2020) can be adopted under Gaussian mixture models, such approaches are not necessarily applicable to other general mixture models that we are concerned with. Here, we propose an alternative outlier detection method that uses the component density (or probability mass) function f(𝒚i;𝜽k)f({\text{\boldmath$y$}}_{i};{\text{\boldmath$\theta$}}_{k}). Suppose the iith observation is classified into the kkth cluster. We define qi=f(𝒚i;𝜽^k)q_{i}=f({\text{\boldmath$y$}}_{i};\widehat{{\text{\boldmath$\theta$}}}_{k}) as the value of the density function of 𝒚i{\text{\boldmath$y$}}_{i} under the kkth cluster. Our strategy computes the probability pr(qi;𝜽k)=P(f(Y;𝜽k)qi){\rm pr}(q_{i};{\text{\boldmath$\theta$}}_{k})={\rm P}(f(Y;{\text{\boldmath$\theta$}}_{k})\leq q_{i}) where YY is a random variable with a density f(;𝜽k)f(\cdot;{\text{\boldmath$\theta$}}_{k}). Under a general ff, this probability is not analytically available, but we can approximately compute pr(qi;𝜽^k){\rm pr}(q_{i};\widehat{{\text{\boldmath$\theta$}}}_{k}) using the Monte Carlo approximation given by

pr(qi;𝜽^k)R1r=1RI{f(Y(r);𝜽k)qi},{\rm pr}(q_{i};\widehat{{\text{\boldmath$\theta$}}}_{k})\approx R^{-1}\sum_{r=1}^{R}I\{f(Y^{(r)};{\text{\boldmath$\theta$}}_{k})\leq q_{i}\},

where RR is the number of Monte Carlo samples and Y(r)Y^{(r)} is the rrth Monte Carlo sample generated from the distribution f(;𝜽k)f(\cdot;{\text{\boldmath$\theta$}}_{k}). Note that a smaller value of pr(qi;𝜽^k){\rm pr}(q_{i};\widehat{{\text{\boldmath$\theta$}}}_{k}) means that the iith observation is more likely to be an outlier. Then, our outlier detection strategy declares the iith observation as an outlier when pr(qi;𝜽^k)α{\rm pr}(q_{i};\widehat{{\text{\boldmath$\theta$}}}_{k})\leq\alpha for some fixed α\alpha. Note that this approach may result in both type I error (a non-outlier wrongly flagged as an outlier) and type II error (a genuine outlier is not flagged), and larger α\alpha leads to the larger likelihood of the type I error. We here consider α{0.001,0.005,0.01}\alpha\in\{0.001,0.005,0.01\} following the popular choices of the thresholding probability values in the distance-based outlier detection.

Selection of the tuning parameters

In practice, the number of components KK is unknown and must be chosen reasonably. When the data contains outliers, they should not affect selecting KK; otherwise the selected KK can be different from the true KK. Here, we utilize the results of the outlier detection discussed in the previous section. Let SαS_{\alpha} be the index set of observations flagged as non-outliers under thresholding level α\alpha. We then propose the trimmed BIC criterion as follows:

BIC(K)=2n|Sα|iSαlogfM(𝒚i;𝚿^)+|𝚿|logn,{\rm BIC}(K)=-\frac{2n}{|S_{\alpha}|}\sum_{i\in S_{\alpha}}\log f_{M}({\text{\boldmath$y$}}_{i};\widehat{{\text{\boldmath$\Psi$}}})+|{\text{\boldmath$\Psi$}}|\log n, (5)

where |Sα||S_{\alpha}| and |𝚿||{\text{\boldmath$\Psi$}}| are the cardinalities of SαS_{\alpha} and 𝚿\Psi, respectively. If |Sα|=n|S_{\alpha}|=n (no outliers are detected), then the criterion is reduced to the standard BIC criterion. The optimal KK value is selected as the minimizer of this criterion.

Regarding the choice of γ\gamma, we first note that the proposed EEE algorithm reduces to the standard EM algorithm when γ=0\gamma=0. With a larger γ\gamma, the observations with the small component densities are more strongly downweighted by the density weights w(𝒚i;𝜽k)w({\text{\boldmath$y$}}_{i};{\text{\boldmath$\theta$}}_{k}). Thus, the proposed EEE algorithm becomes more robust against outliers. However, when there are no outliers, using a large value of γ\gamma may result in a loss of efficiency by unnecessarily downweighting non-outliers. Specifically, the use of γ\gamma greater than 0.50.5 may lead to considerable efficiency loss and numerical instability, where this phenomenon is partly demonstrated under the simple models in Basu et al. (1998). Hence, we suggest using a small positive value for γ\gamma. Specifically, we recommend using γ=0.2\gamma=0.2 or 0.30.3 as in our numerical studies in Sections 6 and 7. Another practical idea is to monitor the trimmed BIC (5) for several values of γ\gamma and KK as demonstrated in Section 7.2.

Asymptotic variance-covariance matrix

Here, the asymptotic variance-covariance matrix of the estimator is considered. Let Gi(𝚿|zi)G_{i}^{\ast}({\text{\boldmath$\Psi$}}|z_{i}) denote the complete estimating functions for the iith observation given in (2) and let Gi(𝚿)=E[Gi(𝚿|zi)]G_{i}({\text{\boldmath$\Psi$}})=E[G_{i}^{\ast}({\text{\boldmath$\Psi$}}|z_{i})] denote the estimating equations in which the unobserved ziz_{i} (or uiku_{ik}) is imputed with its conditional expectation. Furthermore, let 𝚿^\widehat{{\text{\boldmath$\Psi$}}} denote the estimator which is the solution of i=1nGi(𝚿^)=0\sum_{i=1}^{n}G_{i}(\widehat{{\text{\boldmath$\Psi$}}})=0. Note that the augmented estimating equations are unbiased since the complete estimating equations (2) are unbiased. Then, under some regularity conditions, the asymptotic distribution of 𝚿^\widehat{{\text{\boldmath$\Psi$}}} is n(𝚿^𝚿)N(0,Vγ)\sqrt{n}(\widehat{{\text{\boldmath$\Psi$}}}-{\text{\boldmath$\Psi$}})\to N(0,V_{\gamma}) where the asymptotic variance-covariance matrix VγV_{\gamma} is given by

Vγ\displaystyle V_{\gamma} =limn(1ni=1nGi(𝚿)𝚿)1{1ni=1nVar(Gi(𝚿))}(1ni=1nGi(𝚿)𝚿)1.\displaystyle=\lim_{n\to\infty}\left(\frac{1}{n}\sum_{i=1}^{n}\frac{\partial G_{i}({\text{\boldmath$\Psi$}})}{\partial{\text{\boldmath$\Psi$}}}\right)^{-1}\left\{\frac{1}{n}\sum_{i=1}^{n}{\rm Var}(G_{i}({\text{\boldmath$\Psi$}}))\right\}\left(\frac{1}{n}\sum_{i=1}^{n}\frac{\partial G_{i}({\text{\boldmath$\Psi$}})^{\top}}{\partial{\text{\boldmath$\Psi$}}}\right)^{-1}.

This can be consistently estimated by replacing Var(Gi(𝚿)){\rm Var}(G_{i}({\text{\boldmath$\Psi$}})) with Gi(𝚿^)Gi(𝚿^)G_{i}(\widehat{{\text{\boldmath$\Psi$}}})G_{i}(\widehat{{\text{\boldmath$\Psi$}}})^{\top} and Gi(𝚿)/𝚿\partial G_{i}({\text{\boldmath$\Psi$}})/\partial{\text{\boldmath$\Psi$}} with Gi(𝚿)/𝚿|𝚿=𝚿^\partial G_{i}({\text{\boldmath$\Psi$}})/\partial{\text{\boldmath$\Psi$}}|_{{\text{\boldmath$\Psi$}}=\widehat{{\text{\boldmath$\Psi$}}}}. In practice, it is difficult to obtain an analytical expression for the derivative of Gi(𝚿)G_{i}({\text{\boldmath$\Psi$}}). Therefore, the derivative is numerically computed by using the outputs of the EEE algorithm. Let 𝚿(s){\text{\boldmath$\Psi$}}^{(s)} denote the final estimate of the EEE algorithm. Then the derivative Gi(𝚿)/𝚿j\partial G_{i}({\text{\boldmath$\Psi$}})/\partial{\text{\boldmath$\Psi$}}_{j} for j=1,,dim(𝚿)j=1,\ldots,{\rm dim}({\text{\boldmath$\Psi$}}) evaluated at 𝚿^\widehat{{\text{\boldmath$\Psi$}}} can be numerically approximated by {Gi(𝚿(s))Gi(𝚿(j))}/(𝚿j(s)𝚿j(s1))\{G_{i}({\text{\boldmath$\Psi$}}^{(s)})-G_{i}({\text{\boldmath$\Psi$}}^{\ast}_{(j)})\}/({\text{\boldmath$\Psi$}}_{j}^{(s)}-{\text{\boldmath$\Psi$}}_{j}^{(s-1)}), where 𝚿(j){\text{\boldmath$\Psi$}}^{\ast}_{(j)} is obtained by replacing the 𝚿j(s){\text{\boldmath$\Psi$}}_{j}^{(s)} in 𝚿(s){\text{\boldmath$\Psi$}}^{(s)} with 𝚿j(s1){\text{\boldmath$\Psi$}}_{j}^{(s-1)}.

Robust Gaussian mixture

The Gaussian mixture models are the most famous and widely adopted mixture models for model-based clustering, even though the performance can be severely affected by outliers due to the light tails of the Gaussian (normal) distribution. Here, a new approach to the robust fitting of the Gaussian mixture model using the proposed weighted complete estimating equations is presented. Let f(𝒚i;𝜽k)=ϕp(𝒚i;𝝁k,𝚺k)f({\text{\boldmath$y$}}_{i};{\text{\boldmath$\theta$}}_{k})=\phi_{p}({\text{\boldmath$y$}}_{i};{\text{\boldmath$\mu$}}_{k},{\text{\boldmath$\Sigma$}}_{k}) denote the pp-dimensional normal density ϕp\phi_{p}. It holds that

B(𝜽k)=|𝚺k|γ/2(2π)pγ/2(1+γ)p/2,B({\text{\boldmath$\theta$}}_{k})=|{\text{\boldmath$\Sigma$}}_{k}|^{-\gamma/2}(2\pi)^{-p\gamma/2}(1+\gamma)^{-p/2},

and C(𝜽k)=(0p,vec(Ck2))C({\text{\boldmath$\theta$}}_{k})=(\textbf{0}_{p},{\rm vec}(C_{k2})), where 0p\textbf{0}_{p} is a pp-dimensional vector of 0 and

Ck2=γ2|𝚺k|γ/2𝚺k1(2π)pγ/2(1+γ)p/21.\displaystyle C_{k2}=-\frac{\gamma}{2}|{\text{\boldmath$\Sigma$}}_{k}|^{-\gamma/2}{\text{\boldmath$\Sigma$}}_{k}^{-1}(2\pi)^{-p\gamma/2}(1+\gamma)^{-p/2-1}.

Then, the weighted complete estimating equations for 𝝁k{\text{\boldmath$\mu$}}_{k} and 𝚺k{\text{\boldmath$\Sigma$}}_{k} are given by

i=1nuikw(𝒚i;𝜽k)(𝒚i𝝁k)=0\displaystyle\sum_{i=1}^{n}u_{ik}w({\text{\boldmath$y$}}_{i};{\text{\boldmath$\theta$}}_{k})({\text{\boldmath$y$}}_{i}-{\text{\boldmath$\mu$}}_{k})=0
i=1nuik{w(𝒚i;𝜽k)𝚺kw(𝒚i;𝜽k)(𝒚i𝝁k)(𝒚i𝝁k)}g(𝚺k)(i=1nuik)𝚺k=0,\displaystyle\sum_{i=1}^{n}u_{ik}\Big{\{}w({\text{\boldmath$y$}}_{i};{\text{\boldmath$\theta$}}_{k}){\text{\boldmath$\Sigma$}}_{k}-w({\text{\boldmath$y$}}_{i};{\text{\boldmath$\theta$}}_{k})({\text{\boldmath$y$}}_{i}-{\text{\boldmath$\mu$}}_{k})({\text{\boldmath$y$}}_{i}-{\text{\boldmath$\mu$}}_{k})^{\top}\Big{\}}-g({\text{\boldmath$\Sigma$}}_{k})\Big{(}\sum_{i=1}^{n}u_{ik}\Big{)}{\text{\boldmath$\Sigma$}}_{k}=0,

for k=1,,Kk=1,\ldots,K, where g(𝚺k)=γ|𝚺k|γ/2(2π)pγ/2(1+γ)p/21g({\text{\boldmath$\Sigma$}}_{k})=\gamma|{\text{\boldmath$\Sigma$}}_{k}|^{-\gamma/2}(2\pi)^{-p\gamma/2}(1+\gamma)^{-p/2-1}. Hence, starting from some initial values 𝚿(0)=(𝝅(0),𝜽1(0),,𝜽K(0)){\text{\boldmath$\Psi$}}^{(0)}=({\text{\boldmath$\pi$}}^{(0)},{\text{\boldmath$\theta$}}_{1}^{(0)},\ldots,{\text{\boldmath$\theta$}}_{K}^{(0)}), the proposed EEE algorithm repeats the following two steps in the ssth iteration:

  • -

    E-step:   The standard E-step is left unchanged as

    uik(s)=πk(s)ϕp(𝒚i;𝝁k(s),𝚺k(s))=1Kπ(s)ϕp(𝒚i;𝝁(s),𝚺(s)).u_{ik}^{(s)}=\frac{\pi_{k}^{(s)}\phi_{p}({\text{\boldmath$y$}}_{i};{\text{\boldmath$\mu$}}_{k}^{(s)},{\text{\boldmath$\Sigma$}}_{k}^{(s)})}{\sum_{\ell=1}^{K}\pi_{\ell}^{(s)}\phi_{p}({\text{\boldmath$y$}}_{i};{\text{\boldmath$\mu$}}_{\ell}^{(s)},{\text{\boldmath$\Sigma$}}_{\ell}^{(s)})}.
  • -

    EE-step:   Compute the density weight wik(s)=w(𝒚i;𝜽k(s))w_{ik}^{(s)}=w({\text{\boldmath$y$}}_{i};{\text{\boldmath$\theta$}}_{k}^{(s)}) and then update the membership probabilities πk\pi_{k}’s as described in Section 2. The component-specific parameters θk\theta_{k}’s are updated as follows:

    𝝁k(s+1)=i=1nuik(s)wik(s)𝒚ii=1nuik(s)wik(s),\displaystyle{\text{\boldmath$\mu$}}_{k}^{(s+1)}=\frac{\sum_{i=1}^{n}u_{ik}^{(s)}w_{ik}^{(s)}{\text{\boldmath$y$}}_{i}}{\sum_{i=1}^{n}u_{ik}^{(s)}w_{ik}^{(s)}},
    𝚺k(s+1)=i=1nuik(s)wik(s)(𝒚i𝝁k(s+1))(𝒚i𝝁k(s+1))i=1nuik(s)wik(s)g(𝚺k(s))i=1nuik(s),\displaystyle{\text{\boldmath$\Sigma$}}_{k}^{(s+1)}=\frac{\sum_{i=1}^{n}u_{ik}^{(s)}w_{ik}^{(s)}({\text{\boldmath$y$}}_{i}-{\text{\boldmath$\mu$}}_{k}^{(s+1)})({\text{\boldmath$y$}}_{i}-{\text{\boldmath$\mu$}}_{k}^{(s+1)})^{\top}}{\sum_{i=1}^{n}u_{ik}^{(s)}w_{ik}^{(s)}-g({\text{\boldmath$\Sigma$}}_{k}^{(s)})\sum_{i=1}^{n}u_{ik}^{(s)}},

    and the mixing proportion πk\pi_{k} is updated as (4).

Note that all the formulas in the above steps are obtained in closed form. This is one of the attractive features of the proposed method, as it does not require any computationally intensive methods such as Monte Carlo integration.

To choose reasonable starting values 𝚿(0){\text{\boldmath$\Psi$}}^{(0)}, we employ existing robust estimation methods of Gaussian mixture models or clustering, such as trimmed robust clustering (TCL; Garcia-Escudero et al., 2008) and improper maximum likelihood (IML; Coretto and Hennig, 2016).

To stabilize the estimation of the variance-covariance matrix 𝚺k{\text{\boldmath$\Sigma$}}_{k}, it would be beneficial to impose the following eigen-ratio constraint:

maxj=1,,pmaxk=1,,Kλj(𝚺k)minj=1,,pmink=1,,Kλj(𝚺k)c,\frac{\max_{j=1,\ldots,p}\max_{k=1,\ldots,K}\lambda_{j}({\text{\boldmath$\Sigma$}}_{k})}{\min_{j=1,\ldots,p}\min_{k=1,\ldots,K}\lambda_{j}({\text{\boldmath$\Sigma$}}_{k})}\leq c, (6)

where λj(𝚺k)\lambda_{j}({\text{\boldmath$\Sigma$}}_{k}) denotes the jjth eigenvalue of the covariance matrix 𝚺k{\text{\boldmath$\Sigma$}}_{k} in the kkth component and cc is a fixed constant. When c=1c=1, a spherical structure is imposed on 𝚺k{\text{\boldmath$\Sigma$}}_{k}, and a more flexible structure is allowed under a large value of cc. To reflect the eigen-ratio constraint in the EEE algorithm, the eigenvalues of 𝚺k(s){\text{\boldmath$\Sigma$}}_{k}^{(s)} can be simply replaced with the truncated version λj(𝚺k)\lambda_{j}^{\ast}({\text{\boldmath$\Sigma$}}_{k}) where λj(𝚺k)=c\lambda_{j}^{\ast}({\text{\boldmath$\Sigma$}}_{k})=c if λj(𝚺k)>c\lambda_{j}({\text{\boldmath$\Sigma$}}_{k})>c, λj(𝚺k)=λj(𝚺k)\lambda_{j}^{\ast}({\text{\boldmath$\Sigma$}}_{k})=\lambda_{j}({\text{\boldmath$\Sigma$}}_{k}) if cθcλj(𝚺k)cc\theta_{c}\leq\lambda_{j}({\text{\boldmath$\Sigma$}}_{k})\leq c and λj(𝚺k)=cθc\lambda_{j}^{\ast}({\text{\boldmath$\Sigma$}}_{k})=c\theta_{c} if λj(𝚺k)<c𝜽c\lambda_{j}({\text{\boldmath$\Sigma$}}_{k})<c{\text{\boldmath$\theta$}}_{c}, where θc\theta_{c} is an unknown constant that depends on cc. The procedure of Fritz et al. (2013) is employed in this study. Note that if cc is set to a small value, the constraint for 𝚺k{\text{\boldmath$\Sigma$}}_{k} is too severe such that the solution of the weighted estimating equation might not exist.

Robust mixture of experts

A mixture of experts model (Jacobs et al., 1991; McLachlan and Peel, 2000) is a useful tool for modelling nonlinear regression relationships. Models that include a simple mixture of normal regression models and their robust versions have been proposed in the literature (e.g. Bai et al., 2012; Song et al., 2014). In addition, the robust versions of a mixture of experts based on the non-normal distributions were considered by Nguyen and McLachlan (2016), and Chamroukhi (2016). The general model is described as

fM(yi|𝒙i;𝚿)=k=1Kg(𝒙i;𝜼k)f(yi|𝒙i;𝜽k),f_{M}(y_{i}|{\text{\boldmath$x$}}_{i};{\text{\boldmath$\Psi$}})=\sum_{k=1}^{K}g({\text{\boldmath$x$}}_{i};{\text{\boldmath$\eta$}}_{k})f(y_{i}|{\text{\boldmath$x$}}_{i};{\text{\boldmath$\theta$}}_{k}),

where 𝒙i{\text{\boldmath$x$}}_{i} is a vector of covariates including an intercept term and g(𝒙i;𝜼k)g({\text{\boldmath$x$}}_{i};{\text{\boldmath$\eta$}}_{k}) is the mixing proportion such that k=1Kg(𝒙i;𝜼k)=1\sum_{k=1}^{K}g({\text{\boldmath$x$}}_{i};{\text{\boldmath$\eta$}}_{k})=1. For the identifiability of the parameters, 𝜼K{\text{\boldmath$\eta$}}_{K} is assumed to be an empty set, since g(𝒙i;𝜼K)g({\text{\boldmath$x$}}_{i};{\text{\boldmath$\eta$}}_{K}) is completely determined by 𝜼1,,𝜼K1{\text{\boldmath$\eta$}}_{1},\ldots,{\text{\boldmath$\eta$}}_{K-1}. A typical form of the continuous response variables adopts f(yi|𝒙i;𝜽k)=ϕ(yi;𝒙i𝜷k,σk2)f(y_{i}|{\text{\boldmath$x$}}_{i};{\text{\boldmath$\theta$}}_{k})=\phi(y_{i};{\text{\boldmath$x$}}_{i}^{\top}{\text{\boldmath$\beta$}}_{k},\sigma_{k}^{2}) and g(𝒙i;𝜼k)=exp(𝒙i𝜼k)/k=1Kexp(𝒙i𝜼k)g({\text{\boldmath$x$}}_{i};{\text{\boldmath$\eta$}}_{k})=\exp({\text{\boldmath$x$}}_{i}^{\top}{\text{\boldmath$\eta$}}_{k})/\sum_{k=1}^{K}\exp({\text{\boldmath$x$}}_{i}^{\top}{\text{\boldmath$\eta$}}_{k}). Let 𝚿\Psi be the set of the unknown parameters, 𝜼k{\text{\boldmath$\eta$}}_{k}, and 𝜽k{\text{\boldmath$\theta$}}_{k}. Compared to the standard mixture model (1), the mixing proportion g(𝒙i;𝜼k)g({\text{\boldmath$x$}}_{i};{\text{\boldmath$\eta$}}_{k}) is a function of 𝒙i{\text{\boldmath$x$}}_{i} parameterized by 𝜼k{\text{\boldmath$\eta$}}_{k}.

As before, the latent variable ziz_{i} such that P(zi=k)=g(𝒙i;𝜼k)P(z_{i}=k)=g({\text{\boldmath$x$}}_{i};{\text{\boldmath$\eta$}}_{k}) for k=1,,Kk=1,\ldots,K is introduced. Then, the complete weighted estimating equations are given by

i=1nuik{w(yi;𝜽k)𝜽klogf(yi;𝜽k)Ci(𝜽k)}=0,i=1nuikg(𝒙i;𝜼k)g(𝒙i;𝜼k)𝜼kw(yi;𝜽k)Bi(𝜽k)i=1nuiKg(𝒙i;𝜼K)g(𝒙i;𝜼k)𝜼kw(yi;𝜽K)Bi(𝜽K)=0,\begin{split}&\sum_{i=1}^{n}u_{ik}\left\{w(y_{i};{\text{\boldmath$\theta$}}_{k})\frac{\partial}{\partial{\text{\boldmath$\theta$}}_{k}}\log f(y_{i};{\text{\boldmath$\theta$}}_{k})-C_{i}({\text{\boldmath$\theta$}}_{k})\right\}=0,\\ &\sum_{i=1}^{n}\frac{u_{ik}}{g({\text{\boldmath$x$}}_{i};{\text{\boldmath$\eta$}}_{k})}\frac{\partial g({\text{\boldmath$x$}}_{i};{\text{\boldmath$\eta$}}_{k})}{\partial{\text{\boldmath$\eta$}}_{k}}\frac{w(y_{i};{\text{\boldmath$\theta$}}_{k})}{B_{i}({\text{\boldmath$\theta$}}_{k})}-\sum_{i=1}^{n}\frac{u_{iK}}{g({\text{\boldmath$x$}}_{i};{\text{\boldmath$\eta$}}_{K})}\frac{\partial g({\text{\boldmath$x$}}_{i};{\text{\boldmath$\eta$}}_{k})}{\partial{\text{\boldmath$\eta$}}_{k}}\frac{w(y_{i};{\text{\boldmath$\theta$}}_{K})}{B_{i}({\text{\boldmath$\theta$}}_{K})}=0,\end{split}

where

Ci(𝜽k)=f(t|𝒙i;𝜽k)w(t;𝜽k)𝜽klogf(t|𝒙i;𝜽k)𝑑t,\displaystyle C_{i}({\text{\boldmath$\theta$}}_{k})=\int f(t|{\text{\boldmath$x$}}_{i};{\text{\boldmath$\theta$}}_{k})w(t;{\text{\boldmath$\theta$}}_{k})\frac{\partial}{\partial{\text{\boldmath$\theta$}}_{k}}\log f(t|{\text{\boldmath$x$}}_{i};{\text{\boldmath$\theta$}}_{k})dt,
Bi(𝜽k)=f(t|𝒙i;𝜽k)w(t;𝜽k)𝑑t.\displaystyle\ \ \ \ B_{i}({\text{\boldmath$\theta$}}_{k})=\int f(t|{\text{\boldmath$x$}}_{i};{\text{\boldmath$\theta$}}_{k})w(t;{\text{\boldmath$\theta$}}_{k})dt.

Starting from some initial values 𝚿(0){\text{\boldmath$\Psi$}}^{(0)}, the proposed EEE algorithm repeats the following two steps in the ssth iteration:

  • -

    E-step:   The standard E-step is left unchanged as

    uik(s)=g(𝒙i;𝜼k(s))f(yi|𝒙i;𝜽k(s))=1Kg(𝒙i;𝜼(s))f(yi|𝒙i;𝜽(s)).u_{ik}^{(s)}=\frac{g({\text{\boldmath$x$}}_{i};{\text{\boldmath$\eta$}}_{k}^{(s)})f(y_{i}|{\text{\boldmath$x$}}_{i};{\text{\boldmath$\theta$}}_{k}^{(s)})}{\sum_{\ell=1}^{K}g({\text{\boldmath$x$}}_{i};{\text{\boldmath$\eta$}}_{\ell}^{(s)})f(y_{i}|{\text{\boldmath$x$}}_{i};{\text{\boldmath$\theta$}}_{\ell}^{(s)})}.
  • -

    EE-step:   Update 𝜼k{\text{\boldmath$\eta$}}_{k} and 𝜽k{\text{\boldmath$\theta$}}_{k} by solving the following equations:

    i=1nuik(s){w(yi;𝜽k(s))𝜽klogf(yi;𝜽k)Ci(𝜽k)}=0,i=1nuik(s)g(𝒙i;𝜼k)g𝜼kw(yi;𝜽k(s+1))Bi(𝜽k(s+1))i=1nuiK(s)g(𝒙i;𝜼K)g𝜼kw(yi;𝜽K(s+1))Bi(𝜽K(s+1))=0.\begin{split}&\sum_{i=1}^{n}u_{ik}^{(s)}\left\{w(y_{i};{\text{\boldmath$\theta$}}_{k}^{(s)})\frac{\partial}{\partial{\text{\boldmath$\theta$}}_{k}}\log f(y_{i};{\text{\boldmath$\theta$}}_{k})-C_{i}({\text{\boldmath$\theta$}}_{k})\right\}=0,\\ &\sum_{i=1}^{n}\frac{u_{ik}^{(s)}}{g({\text{\boldmath$x$}}_{i};{\text{\boldmath$\eta$}}_{k})}\frac{\partial g}{\partial{\text{\boldmath$\eta$}}_{k}}\frac{w(y_{i};{\text{\boldmath$\theta$}}_{k}^{(s+1)})}{B_{i}({\text{\boldmath$\theta$}}_{k}^{(s+1)})}-\sum_{i=1}^{n}\frac{u_{iK}^{(s)}}{g({\text{\boldmath$x$}}_{i};{\text{\boldmath$\eta$}}_{K})}\frac{\partial g}{\partial{\text{\boldmath$\eta$}}_{k}}\frac{w(y_{i};{\text{\boldmath$\theta$}}_{K}^{(s+1)})}{B_{i}({\text{\boldmath$\theta$}}_{K}^{(s+1)})}=0.\end{split} (7)

When the mixture components are the normal linear regression models given by f(yi|𝒙i;𝜽k)=ϕ(yi;𝒙i𝜷k,σk2)f(y_{i}|{\text{\boldmath$x$}}_{i};{\text{\boldmath$\theta$}}_{k})=\phi(y_{i};{\text{\boldmath$x$}}_{i}^{\top}{\text{\boldmath$\beta$}}_{k},\sigma_{k}^{2}) with 𝜽k=(𝜷k,σk2){\text{\boldmath$\theta$}}_{k}=({\text{\boldmath$\beta$}}_{k}^{\top},\sigma_{k}^{2}), the bias correction terms Ci(𝜽k)(0p,Ci2(𝜽k))C_{i}({\text{\boldmath$\theta$}}_{k})\equiv(\textbf{0}_{p},C_{i2}({\text{\boldmath$\theta$}}_{k})) and Bi(𝜽k)B_{i}({\text{\boldmath$\theta$}}_{k}) can be analytically obtained as

Ci2(𝜽k)=(γ/2)(σk2)1γ/2(2π)γ/2(1+γ)3/2C_{i2}({\text{\boldmath$\theta$}}_{k})=-(\gamma/2)(\sigma_{k}^{2})^{-1-\gamma/2}(2\pi)^{-\gamma/2}(1+\gamma)^{-3/2}
Bi(𝜽k)=(2πσk2)γ/2(1+γ)1/2.B_{i}({\text{\boldmath$\theta$}}_{k})=(2\pi\sigma_{k}^{2})^{-\gamma/2}(1+\gamma)^{-1/2}.

The first equation in (7) can be solved analytically, and the closed-form updating steps similar to those in Section 3 can be obtained. On the other hand, the second equation is a function of 𝜼1,,𝜼K1{\text{\boldmath$\eta$}}_{1},\ldots,{\text{\boldmath$\eta$}}_{K-1} and cannot be solved analytically. Note that the solution corresponds to the maximizer of the weighted log-likelihood function of the multinomial distribution given by

i=1nk=1Kuik(s)w(𝒚i;𝜽k(s+1))Bi(𝜽k(s+1))logg(𝒙i;𝜼k),\sum_{i=1}^{n}\sum_{k=1}^{K}u_{ik}^{(s)}\frac{w({\text{\boldmath$y$}}_{i};{\text{\boldmath$\theta$}}_{k}^{(s+1)})}{B_{i}({\text{\boldmath$\theta$}}_{k}^{(s+1)})}\log g({\text{\boldmath$x$}}_{i};{\text{\boldmath$\eta$}}_{k}),

since its first-order partial derivatives with respect to 𝜼k{\text{\boldmath$\eta$}}_{k} are reduced to the second equation in (7). Thus, the updating step for 𝜼k{\text{\boldmath$\eta$}}_{k} can be readily carried out.

In setting the initial values of the EEE algorithm, we first randomly split the data into KK groups and apply some existing robust methods to estimate 𝜽k{\text{\boldmath$\theta$}}_{k} for k=1,,Kk=1,\ldots,K, which can be adopted for the initial values 𝜽k(0){\text{\boldmath$\theta$}}_{k}^{(0)} of 𝜽k{\text{\boldmath$\theta$}}_{k}. For example, when f(;𝜽k)f(\cdot;{\text{\boldmath$\theta$}}_{k}) is the normal linear regression as adopted in Section 7.1, we employ an M-estimator available in the rlm function in R package ‘MASS’ (Venables and Ripley, 2002). For the initial values of 𝜼k{\text{\boldmath$\eta$}}_{k}, we suggest simply using 𝜼k(0)=(0,,0){\text{\boldmath$\eta$}}_{k}^{(0)}=(0,\ldots,0).

Robust skew normal mixture

We next consider the use of the pp-dimensional skew normal distribution (Azzalini and Valle, 1996) for the kkth component. The mixture model based on skew normal distributions is more flexible than a multivariate Gaussian mixture, especially when the cluster-specific distributions are not symmetric but skewed. Several works have been conducted on the maximum likelihood estimation of a skew normal mixture (Lin et al., 2007; Lin, 2009). Despite the flexibility in terms of skewness, because the skew normal distribution still has light tails as the normal distribution, the skew normal mixture is also sensitive to outliers. Alternative mixture models using heavy-tailed and skewed distributions have also been proposed in the literature (e.g. Lee and McLachlan, 2017; Morris et al., 2019). Here, we consider the robust fitting of the skew normal mixture within the proposed framework.

We first note that the direct application of the skew normal distribution to the proposed EEE algorithm would be computationally intensive since the bias correction term C(𝜽k)C({\text{\boldmath$\theta$}}_{k}) cannot be obtained analytically, and a Monte Carlo approximation would be required in each iteration. Instead, we provide a novel algorithm that exploits the stochastic representation of the multivariate skew normal distribution. The resulting algorithm is a slight extension of the proposed algorithm in Section 2. Früwirth-Schnatter and Pyne (2010) provided the following hierarchical representation for 𝒚i{\text{\boldmath$y$}}_{i} given ziz_{i}:

𝒚i|(zi=k)=𝝁k+𝝍kvik+εik,ϵikN(0,𝚺k),vikN+(0,1),\begin{split}&{\text{\boldmath$y$}}_{i}|(z_{i}=k)={\text{\boldmath$\mu$}}_{k}+{\text{\boldmath$\psi$}}_{k}v_{ik}+{\varepsilon}_{ik},\\ &\epsilon_{ik}\sim N(0,{\text{\boldmath$\Sigma$}}_{k}),\quad v_{ik}\sim N^{+}(0,1),\end{split} (8)

where 𝝁k{\text{\boldmath$\mu$}}_{k} is the p×1p\times 1 vector of location parameters, 𝝍k{\text{\boldmath$\psi$}}_{k} is the p×1p\times 1 vector of the skewness parameters, and 𝚺k{\text{\boldmath$\Sigma$}}_{k} is the covariance matrix. Here, N+(a,b)N^{+}(a,b) denotes the truncated normal distribution on the positive real line with mean and variance parameters aa and bb, respectively, and its the density function ϕ+(x;a,b2)\phi^{+}(x;a,b^{2}) is given by

ϕ+(x;a,b)=1Φ(a/b)ϕ(x;a,b)I(x0),\displaystyle\phi^{+}(x;a,b)=\frac{1}{\Phi(a/b)}\phi(x;a,b)I(x\geq 0),

where Φ()\Phi(\cdot) is the distribution function of the standard normal distribution. By defining

𝛀k=𝚺k+𝝍k𝝍k,𝜶k=𝝎k𝛀k1𝝍k1𝝍k𝛀k1𝝍k,{\text{\boldmath$\Omega$}}_{k}={\text{\boldmath$\Sigma$}}_{k}+{\text{\boldmath$\psi$}}_{k}{\text{\boldmath$\psi$}}_{k}^{\top},\quad{\text{\boldmath$\alpha$}}_{k}=\frac{{\text{\boldmath$\omega$}}_{k}{\text{\boldmath$\Omega$}}_{k}^{-1}{\text{\boldmath$\psi$}}_{k}}{\sqrt{1-{\text{\boldmath$\psi$}}_{k}^{\top}{\text{\boldmath$\Omega$}}_{k}^{-1}{\text{\boldmath$\psi$}}_{k}}},

where 𝝎k=(𝛀111/2,,𝛀pp1/2){\text{\boldmath$\omega$}}_{k}=({\text{\boldmath$\Omega$}}_{11}^{1/2},\dots,{\text{\boldmath$\Omega$}}_{pp}^{1/2}). The probability density function of the multivariate skew normal distribution of Azzalini and Valle (1996) is given by

fSN(𝒚;𝝁k,𝛀k,𝜶k)=2ϕp(𝒚;𝝁k,𝛀k)Φ(𝜶k𝝎k1(𝒚𝝁k)).f_{SN}({\text{\boldmath$y$}};{\text{\boldmath$\mu$}}_{k},{\text{\boldmath$\Omega$}}_{k},{\text{\boldmath$\alpha$}}_{k})=2\phi_{p}({\text{\boldmath$y$}};{\text{\boldmath$\mu$}}_{k},{\text{\boldmath$\Omega$}}_{k})\Phi({\text{\boldmath$\alpha$}}_{k}^{\top}{\text{\boldmath$\omega$}}^{-1}_{k}({\text{\boldmath$y$}}-{\text{\boldmath$\mu$}}_{k})). (9)

From (8), the conditional distribution of 𝒚i{\text{\boldmath$y$}}_{i} given both ziz_{i} and vikv_{ik} is normal, namely, 𝒚i|(zi=k),vikN(𝝁k+𝝍kvik,𝚺k){\text{\boldmath$y$}}_{i}|(z_{i}=k),v_{ik}\sim N({\text{\boldmath$\mu$}}_{k}+{\text{\boldmath$\psi$}}_{k}v_{ik},{\text{\boldmath$\Sigma$}}_{k}), so that the bias correction terms under given ziz_{i} and vikv_{ik} can be easily obtained in the same way as in the case of the Gaussian mixture models. Therefore, we consider the following complete estimating equations for the parameters conditional on both ziz_{i} and vikv_{ik}:

i=1nuikwik(𝒚i𝝁k𝝍kvik)=0,i=1nuikwik{𝚺k(𝒚i𝝁k𝝍kvik)(𝒚i𝝁k𝝍kvik)}g(𝚺k)(i=1nuik)𝚺k=0,i=1nuikvikwik(𝒚i𝝁k𝝍kvik)=0,\begin{split}&\sum_{i=1}^{n}u_{ik}w_{ik}({\text{\boldmath$y$}}_{i}-{\text{\boldmath$\mu$}}_{k}-{\text{\boldmath$\psi$}}_{k}v_{ik})=0,\\ &\sum_{i=1}^{n}u_{ik}w_{ik}\Big{\{}{\text{\boldmath$\Sigma$}}_{k}-({\text{\boldmath$y$}}_{i}-{\text{\boldmath$\mu$}}_{k}-{\text{\boldmath$\psi$}}_{k}v_{ik})({\text{\boldmath$y$}}_{i}-{\text{\boldmath$\mu$}}_{k}-{\text{\boldmath$\psi$}}_{k}v_{ik})^{\top}\Big{\}}-g({\text{\boldmath$\Sigma$}}_{k})\left(\sum_{i=1}^{n}u_{ik}\right){\text{\boldmath$\Sigma$}}_{k}=0,\\ &\sum_{i=1}^{n}u_{ik}v_{ik}w_{ik}({\text{\boldmath$y$}}_{i}-{\text{\boldmath$\mu$}}_{k}-{\text{\boldmath$\psi$}}_{k}v_{ik})=0,\\ \end{split}

where wikw(𝒚i;𝜽k)w_{ik}\equiv w({\text{\boldmath$y$}}_{i};{\text{\boldmath$\theta$}}_{k}) and g(𝚺k)g({\text{\boldmath$\Sigma$}}_{k}) have the same forms as in the Gaussian mixture case.

The proposed EEE algorithm for the robust fitting of the skew normal mixture is obtained after some modification to the normal case. Since the complete estimating equations contain the additional latent variables vikv_{ik}, some additional steps are included to impute the moments of vikv_{ik} in the equations. The moments are those of the conditional posterior distribution of vikv_{ik}, given zi=kz_{i}=k, which is N+(δik,τik2)N^{+}(\delta_{ik},\tau_{ik}^{2}), where

δik=𝝍k𝚺k1(𝒚i𝝁k)𝝍k𝚺k1𝝍k+1,τik2=1𝝍k𝚺k1𝝍k+1.\delta_{ik}=\frac{{\text{\boldmath$\psi$}}_{k}^{\top}{\text{\boldmath$\Sigma$}}_{k}^{-1}({\text{\boldmath$y$}}_{i}-{\text{\boldmath$\mu$}}_{k})}{{\text{\boldmath$\psi$}}_{k}^{\top}{\text{\boldmath$\Sigma$}}_{k}^{-1}{\text{\boldmath$\psi$}}_{k}+1},\ \ \ \ \tau_{ik}^{2}=\frac{1}{{\text{\boldmath$\psi$}}_{k}^{\top}{\text{\boldmath$\Sigma$}}_{k}^{-1}{\text{\boldmath$\psi$}}_{k}+1}.

We define the following quantities:

tik2=(γ𝝍k𝚺k1𝝍k+1τik2)1,mik=tik2{γ𝝍k𝚺k1(𝒚i𝝁k)+δikτik2},Uik=|𝚺k|γ/2(2π)γp/2Φ(mik/tik)Φ(δik/τik)(tik2τik2)1/2×exp{γ2(𝒚i𝝁k)𝚺k1(𝒚i𝝁k)δik22τik2+mik22tik2}.\begin{split}&t^{2}_{ik}=\left(\gamma{\text{\boldmath$\psi$}}_{k}^{\top}{\text{\boldmath$\Sigma$}}_{k}^{-1}{\text{\boldmath$\psi$}}_{k}+\frac{1}{\tau^{2}_{ik}}\right)^{-1},\ \ \ \ m_{ik}=t^{2}_{ik}\left\{\gamma{\text{\boldmath$\psi$}}_{k}^{\top}{\text{\boldmath$\Sigma$}}_{k}^{-1}({\text{\boldmath$y$}}_{i}-{\text{\boldmath$\mu$}}_{k})+\frac{\delta_{ik}}{\tau_{ik}^{2}}\right\},\\ &U_{ik}=|{\text{\boldmath$\Sigma$}}_{k}|^{-\gamma/2}(2\pi)^{-\gamma p/2}\frac{\Phi(m_{ik}/t_{ik})}{\Phi(\delta_{ik}/\tau_{ik})}\left(\frac{t_{ik}^{2}}{\tau_{ik}^{2}}\right)^{1/2}\\ \ &\times\exp\left\{-\frac{\gamma}{2}({\text{\boldmath$y$}}_{i}-{\text{\boldmath$\mu$}}_{k})^{\top}{\text{\boldmath$\Sigma$}}_{k}^{-1}({\text{\boldmath$y$}}_{i}-{\text{\boldmath$\mu$}}_{k})-\frac{\delta^{2}_{ik}}{2\tau^{2}_{ik}}+\frac{m_{ik}^{2}}{2t_{ik}^{2}}\right\}.\end{split} (10)

Note that we have for j=0,1j=0,1 and 22

E[uikwikvikj|𝒚i]\displaystyle E[u_{ik}w_{ik}v_{ik}^{j}|{\text{\boldmath$y$}}_{i}] =E[E[uikwikvikj|𝒚i,uik]|𝒚i]=E[uik|𝒚i]E[wikvikj|uik,𝒚i].\displaystyle=E[E[u_{ik}w_{ik}v_{ik}^{j}|{\text{\boldmath$y$}}_{i},u_{ik}]|{\text{\boldmath$y$}}_{i}]=E[u_{ik}|{\text{\boldmath$y$}}_{i}]E[w_{ik}v_{ik}^{j}|u_{ik},{\text{\boldmath$y$}}_{i}].

Since the conditional distribution of vikv_{ik} given (uik,𝒚i)(u_{ik},{\text{\boldmath$y$}}_{i}) is N+(δik,τik2)N^{+}(\delta_{ik},\tau_{ik}^{2}), we have

E[vikjwik|uik,𝒚i]\displaystyle E[v_{ik}^{j}w_{ik}|u_{ik},{\text{\boldmath$y$}}_{i}]
=0|𝚺k|γ/2(2π)γp/2exp{γ2(𝒚i𝝁kvik𝝍k)𝚺k1(𝒚i𝝁kvik𝝍k)}\displaystyle=\int_{0}^{\infty}|{\text{\boldmath$\Sigma$}}_{k}|^{-\gamma/2}(2\pi)^{-\gamma p/2}\exp\left\{-\frac{\gamma}{2}({\text{\boldmath$y$}}_{i}-{\text{\boldmath$\mu$}}_{k}-v_{ik}{\text{\boldmath$\psi$}}_{k})^{\top}{\text{\boldmath$\Sigma$}}_{k}^{-1}({\text{\boldmath$y$}}_{i}-{\text{\boldmath$\mu$}}_{k}-v_{ik}{\text{\boldmath$\psi$}}_{k})\right\}
×vikjΦ(δik/τik)(2πτik2)1/2exp{(vikδik)22τik2}dvik\displaystyle\ \ \ \ \ \times\frac{v_{ik}^{j}}{\Phi(\delta_{ik}/\tau_{ik})}(2\pi\tau_{ik}^{2})^{-1/2}\exp\left\{-\frac{(v_{ik}-\delta_{ik})^{2}}{2\tau_{ik}^{2}}\right\}dv_{ik}
=|𝚺k|γ/2(2π)γp/2Φ(mik/tik)Φ(δik/τik)(tik2τik2)1/20vikjϕ+(vik;mik,tik2)𝑑vik\displaystyle=|{\text{\boldmath$\Sigma$}}_{k}|^{-\gamma/2}(2\pi)^{-\gamma p/2}\frac{\Phi(m_{ik}/t_{ik})}{\Phi(\delta_{ik}/\tau_{ik})}\left(\frac{t_{ik}^{2}}{\tau_{ik}^{2}}\right)^{1/2}\int_{0}^{\infty}v_{ik}^{j}\phi_{+}(v_{ik};m_{ik},t_{ik}^{2})dv_{ik}
×exp{γ2(𝒚i𝝁k)𝚺k1(𝒚i𝝁k)δik22τik2+mik22tik2},\displaystyle\ \ \ \ \ \ \times\exp\left\{-\frac{\gamma}{2}({\text{\boldmath$y$}}_{i}-{\text{\boldmath$\mu$}}_{k})^{\top}{\text{\boldmath$\Sigma$}}_{k}^{-1}({\text{\boldmath$y$}}_{i}-{\text{\boldmath$\mu$}}_{k})-\frac{\delta^{2}_{ik}}{2\tau^{2}_{ik}}+\frac{m_{ik}^{2}}{2t_{ik}^{2}}\right\},

and it follows from Lin et al. (2007) that

0vik1ϕ+(vik;mik,tik2)𝑑vik=mik+tikϕ(mik/tik)Φ(mik/tik),\displaystyle\int_{0}^{\infty}v_{ik}^{1}\phi_{+}(v_{ik};m_{ik},t_{ik}^{2})dv_{ik}=m_{ik}+t_{ik}\frac{\phi(m_{ik}/t_{ik})}{\Phi(m_{ik}/t_{ik})},
0vik2ϕ+(vik;mik,tik2)𝑑vik=mik2+tik2+miktikϕ(mik/tik)Φ(mik/tik).\displaystyle\int_{0}^{\infty}v_{ik}^{2}\phi_{+}(v_{ik};m_{ik},t_{ik}^{2})dv_{ik}=m_{ik}^{2}+t_{ik}^{2}+m_{ik}t_{ik}\frac{\phi(m_{ik}/t_{ik})}{\Phi(m_{ik}/t_{ik})}.

These expressions are applied to the analytical updating steps in E-step. Starting from some initial values of the parameters, 𝚿(0)=(𝝅(0),𝜽1(0),,𝜽K(0)){\text{\boldmath$\Psi$}}^{(0)}=({\text{\boldmath$\pi$}}^{(0)},{\text{\boldmath$\theta$}}_{1}^{(0)},\ldots,{\text{\boldmath$\theta$}}_{K}^{(0)}), the proposed EEE algorithm iteratively updates the parameters in the ssth iteration as follows:

  • -

    E-step:   Compute the posterior expectations:

    uik(s)E(s)[uik|𝒚i]=πk(s)fSN(𝒚i;𝝁k(s),𝛀k(s),𝜶(s))=1Kπ(s)fSN(𝒚i;𝝁(s),𝛀(s),𝜶(s)),Vik0(s)E(s)[wik|𝒚i,uik]=Uik(s),Vik1(s)E(s)[wikvik|𝒚i,uik]=Uik(s){mik(s)+tik(s)ϕ(mik(s)/tik(s))Φ(mik(s)/tik(s))},Vik2(s)E(s)[wikvik2|𝒚i,uik]=Uik(s){mik2(s)+tik2(s)+mik(s)tik(s)ϕ(mik(s)/tik(s))Φ(mik(s)/tik(s))},\begin{split}u_{ik}^{(s)}&\equiv E^{(s)}[u_{ik}|{\text{\boldmath$y$}}_{i}]=\frac{\pi_{k}^{(s)}f_{SN}({\text{\boldmath$y$}}_{i};{\text{\boldmath$\mu$}}_{k}^{(s)},{\text{\boldmath$\Omega$}}_{k}^{(s)},{\text{\boldmath$\alpha$}}_{\ell}^{(s)})}{\sum_{\ell=1}^{K}\pi_{\ell}^{(s)}f_{SN}({\text{\boldmath$y$}}_{i};{\text{\boldmath$\mu$}}_{\ell}^{(s)},{\text{\boldmath$\Omega$}}_{\ell}^{(s)},{\text{\boldmath$\alpha$}}_{\ell}^{(s)})},\\ V_{ik}^{0(s)}&\equiv E^{(s)}[w_{ik}|{\text{\boldmath$y$}}_{i},u_{ik}]=U_{ik}^{(s)},\\ V_{ik}^{1(s)}&\equiv E^{(s)}[w_{ik}v_{ik}|{\text{\boldmath$y$}}_{i},u_{ik}]=U_{ik}^{(s)}\left\{m_{ik}^{(s)}+t_{ik}^{(s)}\frac{\phi(m_{ik}^{(s)}/t_{ik}^{(s)})}{\Phi(m_{ik}^{(s)}/t_{ik}^{(s)})}\right\},\\ V_{ik}^{2(s)}&\equiv E^{(s)}[w_{ik}v^{2}_{ik}|{\text{\boldmath$y$}}_{i},u_{ik}]=U_{ik}^{(s)}\left\{m_{ik}^{2(s)}+t_{ik}^{2(s)}+m_{ik}^{(s)}t_{ik}^{(s)}\frac{\phi(m_{ik}^{(s)}/t_{ik}^{(s)})}{\Phi(m_{ik}^{(s)}/t_{ik}^{(s)})}\right\},\end{split}

    where mik(s)m_{ik}^{(s)} and tik(s)t_{ik}^{(s)} stand for mikm_{ik} and tik2t_{ik}^{2} given in (10), respectively, evaluated with the current parameter values.

  • -

    EE-step:   Update the membership probabilities πk\pi_{k}’s as in Section 2 and component-specific parameters 𝜽k{\text{\boldmath$\theta$}}_{k}’s as

    πk(s+1)=i=1nuik(s)Vik0(s)/B(𝜽k(s))=1Ki=1nui(s)Vi0(s)/B(𝜽(s)),\displaystyle\pi_{k}^{(s+1)}=\frac{\sum_{i=1}^{n}u_{ik}^{(s)}V_{ik}^{0(s)}/B({\text{\boldmath$\theta$}}_{k}^{(s)})}{\sum_{\ell=1}^{K}\sum_{i=1}^{n}u_{i\ell}^{(s)}V_{i\ell}^{0(s)}/B({\text{\boldmath$\theta$}}_{\ell}^{(s)})},
    𝝁k(s+1)=i=1nuik(s)(Vik0(s)𝒚iVik1(s)𝝍k(s))i=1nuik(s)Vik0(s),\displaystyle{\text{\boldmath$\mu$}}_{k}^{(s+1)}=\frac{\sum_{i=1}^{n}u_{ik}^{(s)}(V_{ik}^{0(s)}{\text{\boldmath$y$}}_{i}-V_{ik}^{1(s)}{\text{\boldmath$\psi$}}_{k}^{(s)})}{\sum_{i=1}^{n}u_{ik}^{(s)}V_{ik}^{0(s)}},
    𝝍k(s+1)=i=1nuik(s)Vik1(s)(𝒚i𝝁k(s+1))i=1nuik(s)Vik2(s),\displaystyle{\text{\boldmath$\psi$}}_{k}^{(s+1)}=\frac{\sum_{i=1}^{n}u_{ik}^{(s)}V_{ik}^{1(s)}({\text{\boldmath$y$}}_{i}-{\text{\boldmath$\mu$}}_{k}^{(s+1)})}{\sum_{i=1}^{n}u_{ik}^{(s)}V_{ik}^{2(s)}},
    𝚺k(s+1)=i=1nuik(s)𝜼ik(s,s+1)i=1nuik(s)Vik0(s)g(𝚺k(s))i=1nuik(s),\displaystyle{\text{\boldmath$\Sigma$}}_{k}^{(s+1)}=\frac{\sum_{i=1}^{n}u_{ik}^{(s)}{\text{\boldmath$\eta$}}_{ik}^{(s,s+1)}}{\sum_{i=1}^{n}u_{ik}^{(s)}V_{ik}^{0(s)}-g\big{(}{\text{\boldmath$\Sigma$}}_{k}^{(s)}\big{)}\sum_{i=1}^{n}u_{ik}^{(s)}},

    for k=1,,Kk=1,\ldots,K where

    𝜼ik(s,s+1)=Vik0(s)(𝒚i𝝁k(s+1))(𝒚i𝝁k(s+1))Vik1(s)𝝍k(s+1)(𝒚i𝝁k(s+1))Vik1(s)(𝒚i𝝁k(s+1))(𝝍k(s+1))+Vik2(s)𝝍k(s+1)(𝝍k(s+1)).\begin{split}{\text{\boldmath$\eta$}}_{ik}^{(s,s+1)}=&V_{ik}^{0(s)}({\text{\boldmath$y$}}_{i}-{\text{\boldmath$\mu$}}_{k}^{(s+1)})({\text{\boldmath$y$}}_{i}-{\text{\boldmath$\mu$}}_{k}^{(s+1)})^{\top}-V_{ik}^{1(s)}{\text{\boldmath$\psi$}}_{k}^{(s+1)}({\text{\boldmath$y$}}_{i}-{\text{\boldmath$\mu$}}_{k}^{(s+1)})^{\top}\\ &-V_{ik}^{1(s)}({\text{\boldmath$y$}}_{i}-{\text{\boldmath$\mu$}}_{k}^{(s+1)})({\text{\boldmath$\psi$}}_{k}^{(s+1)})^{\top}+V_{ik}^{2(s)}{\text{\boldmath$\psi$}}^{(s+1)}_{k}({\text{\boldmath$\psi$}}_{k}^{(s+1)})^{\top}.\end{split}

Note that the form of B(𝜽k)B({\text{\boldmath$\theta$}}_{k}) under the skew normal mixture is the same as that in the case of the normal mixture, because it holds that E[w(𝒚i;𝜽k)]=E[E[w(𝒚i;𝜽k)|uik]]E[w({\text{\boldmath$y$}}_{i};{\text{\boldmath$\theta$}}_{k})]=E[E[w({\text{\boldmath$y$}}_{i};{\text{\boldmath$\theta$}}_{k})|u_{ik}]] and the conditional distribution of 𝒚i{\text{\boldmath$y$}}_{i} given uiku_{ik} is the multivariate normal with the variance-covariance matrix 𝚺k{\text{\boldmath$\Sigma$}}_{k}. In addition, note that all the steps in the above algorithm are obtained in closed forms by successfully exploiting the stochastic representation of the skew normal distribution in the complete weighted complete estimating equations. In our EEE algorithm, we impose the eigen-ratio constraint considered in Section 3 to avoid an ill-posed problem.

The following strategy is employed to set the initial values of the algorithm. First, an existing robust algorithm for fitting Gaussian mixture models is applied to obtain the cluster assignment and mean vectors of the clusters. We then set 𝝁k(0){\text{\boldmath$\mu$}}_{k}^{(0)} as the estimated mean vector and set 𝝍k(0){\text{\boldmath$\psi$}}_{k}^{(0)} to be the vector of marginal sample skewness of the observations in the kkth cluster. For 𝚺k{\text{\boldmath$\Sigma$}}_{k} and πk\pi_{k}, we set 𝚺k(0)=Ip{\text{\boldmath$\Sigma$}}_{k}^{(0)}=I_{p} and πk=1/K\pi_{k}=1/K.

Simulation study

Gaussian mixture

The performance of the proposed method for the robust fitting of a Gaussian mixture is evaluated along with existing methods through simulation studies. First, the performance in terms of the estimation accuracy of the unknown model parameters is examined. We set the underlying true distribution to be a pp-dimensional Gaussian mixture model with K=3K=3 components, where the parameters in each Gaussian distribution are given by

𝝁1=(ξ,ξ,0,,0),𝝁2=(ξ,ξ,0,,0),𝝁3=(ξ,ξ,0,,0)\displaystyle{\text{\boldmath$\mu$}}_{1}=(-\xi,-\xi,0,\ldots,0),\ \ {\text{\boldmath$\mu$}}_{2}=(\xi,-\xi,0,\ldots,0),\ \ {\text{\boldmath$\mu$}}_{3}=(\xi,\xi,0,\ldots,0)

for some ξ>0\xi>0 and 𝚺k=blockdiag(𝚺k,Ip2){\text{\boldmath$\Sigma$}}_{k}={\rm blockdiag}({\text{\boldmath$\Sigma$}}_{k}^{\ast},I_{p-2}) with

𝚺1=(20.30.31),𝚺2=(10.30.31),𝚺3=(10.30.32).\displaystyle{\text{\boldmath$\Sigma$}}_{1}=\left(\begin{array}[]{cc}2&0.3\\ 0.3&1\end{array}\right),\ \ {\text{\boldmath$\Sigma$}}_{2}=\left(\begin{array}[]{cc}1&-0.3\\ -0.3&1\end{array}\right),\ \ {\text{\boldmath$\Sigma$}}_{3}=\left(\begin{array}[]{cc}1&0.3\\ 0.3&2\end{array}\right).

ξ\xi controls the separation of the three clusters. The following two scenarios, ξ=2\xi=2 and ξ=3\xi=3, correspond to overlapping and well-separated clusters, respectively. In this study, the sample size is set to n=500n=500, and the three cases of the dimension, p{2,5,10}p\in\{2,5,10\}, are considered (the results under n=2000n=2000 are provided in the Supplementary Material). The true mixing probabilities are fixed at π1=0.3\pi_{1}=0.3, π2=0.3\pi_{2}=0.3 and π3=0.4\pi_{3}=0.4. Outliers are generated from the uniform distribution on ABA\setminus B where A=(10,10)×(5,5)×(3,3)p2A=(-10,10)\times(-5,5)\times(-3,3)^{p-2} and BB is the collection of points where the minimum Mahalanobis distance to the kkth cluster mean is smaller than 5p5p, namely,

B={x|mink=1,,K(x𝝁k)𝚺k1(x𝝁k)5p}.B=\{x\ |\ \min_{k=1,\ldots,K}(x-{\text{\boldmath$\mu$}}_{k})^{\top}{\text{\boldmath$\Sigma$}}_{k}^{-1}(x-{\text{\boldmath$\mu$}}_{k})\leq 5p\}.

Let ω\omega denote the proportion of outliers such that n(1ω)n(1-\omega) observations are drawn from the true mixture distribution, whereas the remaining observations are generated as outliers. We adopt the following three cases: ω{0,0.03,0.06}\omega\in\{0,0.03,0.06\}. Setting ω=0\omega=0 means that there are no outliers in the data. Therefore, the efficiency loss of the robust methods against standard non-robust methods can be assessed.

For each of the 1000 simulated datasets, the proposed weighted complete estimating equation method is applied with two fixed values of γ\gamma, γ=0.2\gamma=0.2, and 0.30.3. Hereafter, they will be denoted as WCE1 and WCE2, respectively. For comparison, the standard (non-robust) maximum likelihood method for fitting the Gaussian mixture (GM) using the EM algorithm is also implemented. For the existing robust contenders, we consider the trimmed robust clustering (TCL; Garcia-Escudero et al., 2008), improper maximum likelihood (IML; Coretto and Hennig, 2016) and contaminated Gaussian mixture (CGM; Punzo and McNicholas, 2016), which are available in R packages ‘tclust’ (Fritz et al., 2012), ‘otrimle’ (Coretto and Hennig, 2019) and ‘ContaminatedMixt’ (Punzo et al., 2018), respectively. The trimming level in TCL is set to a default value of 0.050.05. In CGM, the fully structured variance-covariance is used. The same eigen-ratio constraint is imposed for all the methods other than CGM by setting c=10c=10 in condition (6), noting that the true eigen-ratios are 2.272.27 for the first and third clusters and 1.861.86 for the second cluster. In WCE1 and WCE2, α=0.01\alpha=0.01 is employed for outlier detection.

The estimation performance of the methods is evaluated based on the squared error given by k=1K𝝁^k𝝁k2\sum_{k=1}^{K}\|\widehat{{\text{\boldmath$\mu$}}}_{k}-{\text{\boldmath$\mu$}}_{k}\|^{2}, k=1Kvec(𝚺^k)vec(𝚺k)2\sum_{k=1}^{K}\|{\rm vec}(\widehat{{\text{\boldmath$\Sigma$}}}_{k})-{\rm vec}({\text{\boldmath$\Sigma$}}_{k})\|^{2} and k=1K(π^kπk)2\sum_{k=1}^{K}(\widehat{\pi}_{k}-\pi_{k})^{2}. We also computed the following integrated squared error to check the overall estimation accuracy:

p{k=1Kπ^kf(x;𝜽^k)k=1Kπkf(x;𝜽k)}2dx,\int_{\mathcal{R}^{p}}\left\{\sum_{k=1}^{K}\widehat{\pi}_{k}f(x;\widehat{{\text{\boldmath$\theta$}}}_{k})-\sum_{k=1}^{K}\pi_{k}f(x;{\text{\boldmath$\theta$}}_{k})\right\}^{2}{\rm d}x, (11)

where the integral is approximated by Monte Carlo integration by generating 3000 random numbers uniformly distributed on [6,6]2×[3,3]p2[-6,6]^{2}\times[-3,3]^{p-2}. For measuring the classification accuracy, we calculated the classification error defined as |S|1iSI(z^izi)|S|^{-1}\sum_{i\in S}I(\widehat{z}_{i}\neq z_{i}), where z^i\widehat{z}_{i} and ziz_{i} are the estimated and true cluster assignments, respectively, and SS is the set of non-outliers. The above quantities were averaged over 1000 simulated detests, which gives the mean squared error (MSE) of the model parameters, the mean integrated squared errors (MISE) of the density function and the mean classification error (MCE) of the cluster assignment.

Tables 1-3 present the MSE, MISE and MCE of the competing models under p=2,5p=2,5 and 88, respectively. Firstly, the performance of the standard GM is reasonable only in the cases where the clusters are well-separated (ξ=3\xi=3), and there are no outliers (ω=0\omega=0). When the clusters are overlapping (ξ=2\xi=2), and the data does not contain any outliers, the estimation accuracy of GM deteriorates as the dimensionality pp increases. In fact, in the case of p=10p=10, the robust methods, including the proposed method, performed better than GM. In the presence of the outliers, ω>0\omega>0, the estimation accuracy in terms of MSE and MISE of the robust methods appears comparable. For the classification accuracy, the order of the performance of the robust methods appears to be case dependent, and they perform comparably overall. For example, in the cases where the clusters are well-separated with p=5p=5 and 1010, IML and CGM tend to result in smaller MCE than the proposed method. On the other hand, when p=10p=10 and the clusters are overlapping, the proposed WCE2 resulted in the smallest MCE.

The performance of outlier detection is also evaluated. We let δi\delta_{i} be an indicator of outliers such that δi=1\delta_{i}=1 denotes that 𝒚i{\text{\boldmath$y$}}_{i} is an outlier. The false discovery rate (FDR) and power (PW) are given by FDR=iSδ^i/i=1nδ^i{\rm FDR}=\sum_{i\in S}\widehat{\delta}_{i}/\sum_{i=1}^{n}\widehat{\delta}_{i} and PW=iΩδ^i/nω{\rm PW}=\sum_{i\in\Omega}\widehat{\delta}_{i}/n\omega, where Ω\Omega denotes the set of outliers. Table 4 presents the PDR and PW values averaged over the 1000 simulated datasets 4. The table shows that the proposed method performs well in the low-dimensional cases (p=2p=2) with achieving a relatively low FDR and high power compared with the contenders. It is also shown that, as the dimensionality increases, the FDR for the proposed method increases while maintaining high power.

Next, we consider the situation where the number of outliers is close to the minimum size of the clusters. Specifically, we set ω=0.06\omega=0.06 and the true mixing probability as 𝝅=(0.1,0.3,0.6){\text{\boldmath$\pi$}}=(0.1,0.3,0.6). Using the same data generating process as in the case of p=2p=2, MSE, MISE, and MCE of the six methods are evaluated based on the 1000 simulated datasets. The result are reported in Table 5. In this simulation setting, TCL appears to have produced the best result with the smallest MSE, MISE, and MCE. The table shows that some errors under IML and CGM resulted in very large magnitudes.

Finally, we examine the performance in selecting the number of components. The same simulation scenarios with p=2p=2 are considered. For the simulated datasets, the optimal KK from {2,,6}\{2,\ldots,6\} is selected based on the trimmed BIC criterion (5). Since the trimmed BIC can be applied to a variety of robust methods that provide parameter estimates and detect outliers, the proposed methods, as well as the three existing robust methods (TCL, IML, and CGM) are considered in this study. The same settings are used for the tuning parameters as in the previous simulation. In all the scenarios, α=0.01\alpha=0.01 is used for WCE1 and WCE2. In addition, the performance is also assessed with α{0.005,0.001}\alpha\in\{0.005,0.001\} to check the sensitivity with respect to α\alpha in the most challenging situations where ω=0.06\omega=0.06. Based on the 300 simulated datasets, the selection probabilities for each KK are computed. The results are presented in Table 6. The table shows that when the data contains outliers, the standard GM method tends to select a larger number of components than the truth since the outliers are recognized as a new cluster. On the other hand, the proposed methods (WCE1 and WCE2) are highly accurate in selecting the true number of components even in the presence of outliers. Among the existing robust methods, TCL shows comparable performance with the proposed methods, whereas IML and CGM perform rather poorly.

Recalling that rather advantageous settings are used for the contenders, we can conclude that the overall estimation performance of the proposed method is comparable with the existing methods. The results of this study indicate that the proposed method is also a promising approach for the robust estimation of mixture models.

Table 1: Mean squared errors (MSE) of the model parameters, mean integrated squared errors (MISE) of the density estimation, and mean classification error (MCE) of the proposed methods (WCE1 and WCE2), the standard Gaussian mixture (GM), and the existing robust methods (TCL, IML, CGM), based on the 1000 simulated datasets with p=2p=2.
MSE MSE MSE MSE MSE MSE
𝝁\mu 𝚺\Sigma 𝝅\pi MISE MCE 𝝁\mu 𝚺\Sigma 𝝅\pi MISE MCE
(×102\times 10^{2}) (×103\times 10^{3}) (×106\times 10^{6}) (×102\times 10^{2}) (×102\times 10^{2}) (×103\times 10^{3}) (×106\times 10^{6}) (×102\times 10^{2})
(C=2,ω=0)(C=2,\ \omega=0) (C=3,ω=0)(C=3,\ \omega=0)
WCE1 7.27 0.37 1.85 3.22 4.55 5.16 0.23 1.33 2.60 1.12
WCE2 7.49 0.40 1.88 3.36 4.60 5.43 0.25 1.34 2.73 1.17
GM 7.09 0.34 1.82 3.09 3.91 4.92 0.22 1.33 2.49 0.24
TCL 9.04 0.82 2.62 9.35 8.69 6.01 0.60 1.65 6.55 5.03
IML 7.64 0.61 2.37 9.44 7.18 5.23 0.41 1.60 6.60 2.47
CGM 7.15 0.47 1.84 5.07 3.91 4.94 0.33 1.33 4.06 0.24
8C=2,ω=0.03)8C=2,\ \omega=0.03) (C=3,ω=0.03)(C=3,\ \omega=0.03)
WCE1 8.90 0.91 2.15 3.77 4.54 5.66 0.42 1.34 2.98 0.90
WCE2 8.28 0.57 2.00 3.59 4.56 5.77 0.34 1.35 2.91 1.02
GM 47.63 17.28 10.01 13.68 6.28 13.13 6.00 1.39 11.25 0.28
TCL 8.47 0.56 2.24 6.09 5.93 5.76 0.33 1.49 3.91 2.16
IML 8.73 0.64 2.22 6.62 6.04 5.74 0.45 1.62 5.45 2.14
CGM 16.17 2.77 3.71 5.13 4.41 5.69 0.48 1.40 3.60 0.26
(C=2,ω=0.06)(C=2,\ \omega=0.06) (C=3,ω=0.06)(C=3,\ \omega=0.06)
WCE1 17.13 6.72 4.23 6.37 4.99 6.40 1.54 1.46 4.45 0.74
WCE2 9.77 1.36 2.33 4.12 4.63 6.02 0.60 1.45 3.36 0.90
GM 164.50 52.31 32.37 24.63 10.97 32.90 23.56 1.81 20.84 0.32
TCL 9.73 0.69 2.15 4.30 4.13 6.24 0.47 1.51 3.22 0.35
IML 9.35 0.80 2.52 7.91 6.86 6.12 0.58 1.82 7.33 3.20
CGM 58.41 15.18 14.10 8.77 6.23 6.41 1.08 1.67 4.99 0.26
Table 2: Mean squared errors (MSE) of the model parameters, mean integrated squared errors (MISE) of the density estimation, and mean classification error (MCE) of the proposed methods (WCE1 and WCE2), the standard Gaussian mixture (GM), and the existing robust methods (TCL, IML, CGM), based on the 1000 simulated datasets with p=5p=5.
MSE MSE MSE MSE MSE MSE
𝝁\mu 𝚺\Sigma 𝝅\pi MISE MCE 𝝁\mu 𝚺\Sigma 𝝅\pi MISE MCE
(×102\times 10^{2}) (×103\times 10^{3}) (×106\times 10^{6}) (×102\times 10^{2}) (×102\times 10^{2}) (×103\times 10^{3}) (×106\times 10^{6}) (×102\times 10^{2})
(C=2,ω=0)(C=2,\ \omega=0) (C=3,ω=0)(C=3,\ \omega=0)
WCE1 1.66 2.93 2.41 1.80 6.00 1.19 2.92 1.43 1.74 2.53
WCE2 1.74 2.82 2.46 1.79 6.01 1.33 2.77 1.47 1.72 2.49
GM 2.23 3.58 4.19 1.97 4.90 1.07 3.70 1.34 1.93 0.32
TCL 1.64 1.31 2.27 2.21 8.98 1.21 1.01 1.58 1.69 5.14
IML 2.46 1.38 3.49 1.23 5.31 1.07 0.77 1.39 0.94 0.81
CGM 1.42 1.00 1.87 1.21 4.19 1.06 0.79 1.34 1.07 0.26
(C=2,ω=0.03)(C=2,\ \omega=0.03) (C=3,ω=0.03)(C=3,\ \omega=0.03)
WCE1 1.77 3.10 2.88 1.73 5.91 1.23 2.91 1.50 1.64 2.24
WCE2 1.78 2.91 2.54 1.76 5.91 1.37 2.82 1.55 1.69 2.35
GM 15.22 26.80 40.38 2.82 12.64 2.18 10.19 1.52 2.11 0.44
TCL 1.65 1.20 2.35 1.57 6.30 1.15 0.85 1.47 1.14 2.26
IML 3.70 3.02 5.98 1.35 6.00 1.12 0.87 1.42 0.99 0.96
CGM 5.61 13.12 9.81 1.47 6.20 1.18 1.68 1.46 0.98 0.28
(C=2,ω=0.06)(C=2,\ \omega=0.06) (C=3,ω=0.06)(C=3,\ \omega=0.06)
WCE1 2.18 4.02 4.59 1.70 6.05 1.27 2.96 1.54 1.56 1.93
WCE2 1.88 3.09 2.98 1.74 5.89 1.39 2.84 1.60 1.63 2.13
GM 30.16 59.38 66.20 3.69 17.15 5.18 32.89 2.24 3.26 0.57
TCL 2.12 2.91 2.88 1.31 4.87 1.30 1.63 1.46 0.97 0.29
IML 3.01 2.23 3.51 1.30 5.69 1.17 1.01 1.52 1.05 1.00
CGM 15.61 62.26 33.93 2.51 10.26 1.45 3.96 1.83 1.23 0.28
Table 3: Mean squared errors (MSE) of the model parameters, mean integrated squared errors (MISE) of the density estimation, and mean classification error (MCE) of the proposed methods (WCE1 and WCE2), the standard Gaussian mixture (GM), and the existing robust methods (TCL, IML, CGM), based on the 1000 simulated datasets with p=10p=10.
MSE MSE MSE MSE MSE MSE
𝝁\mu 𝚺\Sigma 𝝅\pi MISE MCE 𝝁\mu 𝚺\Sigma 𝝅\pi MISE MCE
(×102\times 10^{2}) (×103\times 10^{3}) (×106\times 10^{6}) (×102\times 10^{2}) (×102\times 10^{2}) (×103\times 10^{3}) (×106\times 10^{6}) (×102\times 10^{2})
(C=2,ω=0)(C=2,\ \omega=0) (C=3,ω=0)(C=3,\ \omega=0)
WCE1 0.86 6.51 25.39 8.52 10.40 0.24 4.46 1.63 7.06 2.99
WCE2 0.60 5.73 15.32 9.80 8.90 0.29 4.52 1.75 7.63 3.08
GM 1.29 8.26 43.01 8.60 11.56 0.20 5.29 1.39 7.42 0.37
TCL 1.99 11.86 24.72 19.24 19.99 0.22 2.66 1.60 11.18 5.23
IML 2.04 10.79 20.15 9.45 14.06 0.20 2.36 1.39 5.95 0.55
CGM 0.44 3.88 7.51 7.50 5.96 0.20 2.34 1.38 5.94 0.30
(C=2,ω=0.03)(C=2,\ \omega=0.03) (C=3,ω=0.03)(C=3,\ \omega=0.03)
WCE1 0.83 6.84 26.74 8.80 10.52 0.25 4.56 1.68 7.94 2.89
WCE2 0.60 5.89 14.83 9.62 8.95 0.30 4.65 1.83 8.58 2.98
GM 3.10 36.02 95.08 10.81 21.60 0.39 15.96 1.94 9.32 0.73
TCL 2.02 12.64 26.26 16.46 17.76 0.22 2.57 1.53 8.34 2.34
IML 2.01 14.29 19.64 10.49 13.91 0.22 2.61 1.50 6.63 0.61
CGM 1.13 23.54 25.64 10.54 10.01 0.23 5.87 1.54 6.86 0.34
(C=2,ω=0.06)(C=2,\ \omega=0.06) (C=3,ω=0.06)(C=3,\ \omega=0.06)
WCE1 1.49 11.88 65.37 11.79 15.68 0.26 4.64 1.72 7.85 2.81
WCE2 0.53 5.91 14.05 17.53 8.43 0.31 4.75 1.87 7.70 2.88
GM 3.54 55.76 92.74 7.92 19.89 0.87 37.46 4.88 11.25 1.48
TCL 3.12 26.48 42.54 13.23 20.68 0.25 5.05 1.43 6.90 0.45
IML 1.25 10.40 21.28 11.08 11.36 0.23 2.84 1.66 6.89 0.70
CGM 1.59 34.55 43.64 9.75 12.72 0.34 15.13 2.17 7.80 0.44
Table 4: Percentage of false discovery rate (FDR) and power (PW) of the outlier detection methods under p{2,5,10}p\in\{2,5,10\}, averaged over the 1000 simulated datasets.
p=2p=2 p=5p=5 p=10p=10
FDR PW FDR PW FDR PW
(ξ=2,ω=0.03\xi=2,\ \omega=0.03)
WCE1 13.8 89.8 31.0 93.6 40.0 93.9
WCE2 15.6 91.5 32.4 93.8 41.7 93.8
TCL 40.0 93.8 39.9 93.8 39.9 93.8
IML 16.4 85.4 10.0 87.5 4.7 84.1
CGM 10.5 78.5 7.2 68.0 6.6 72.1
(ξ=2,ω=0.06\xi=2,\ \omega=0.06)
WCE1 6.0 81.9 16.2 96.1 21.2 96.8
WCE2 7.3 91.0 18.2 96.7 25.0 96.8
TCL 1.2 79.7 0.3 80.4 0.0 80.6
IML 16.4 91.5 8.0 94.9 2.4 94.3
CGM 9.9 72.3 9.4 49.0 7.3 78.5
(ξ=3,ω=0.03\xi=3,\ \omega=0.03)
WCE1 18.3 89.8 38.4 93.9 44.7 93.9
WCE2 20.8 91.2 39.6 93.9 45.4 94.0
TCL 40.0 93.7 39.9 93.9 39.9 93.9
IML 18.3 81.7 9.6 89.5 5.8 91.9
CGM 13.8 85.0 10.5 90.8 7.2 89.2
(ξ=3,ω=0.06\xi=3,\ \omega=0.06)
WCE1 8.4 86.7 20.7 96.8 27.8 96.9
WCE2 10.3 91.7 22.6 96.8 28.2 96.8
TCL 2.1 79.0 0.3 80.4 0.1 80.6
IML 17.3 89.7 6.3 94.7 4.2 96.1
CGM 16.0 92.1 12.0 93.4 8.4 91.0
Table 5: Mean squared errors (MSE) of the model parameters, mean integrated squared errors (MISE) of the density estimation, and mean classification error (MCE) of the proposed methods (WCE1 and WCE2), the standard Gaussian mixture (GM), and the existing robust methods (TCL, IML, CGM), based on the 1000 simulated datasets under the additional scenarios with p=2p=2.
MSE MSE MSE MSE MSE MSE
𝝁\mu 𝚺\Sigma 𝝅\pi MISE MCE 𝝁\mu 𝚺\Sigma 𝝅\pi MISE MCE
(×103\times 10^{3}) (×106\times 10^{6}) (×100\times 100) (×103\times 10^{3}) (×106\times 10^{6}) (×100\times 100)
(ξ=2)(\xi=2) (ξ=3)(\xi=3)
WCE1 3.29 60.50 22.79 5.58 5.22 0.68 45.93 2.69 4.62 0.86
WCE2 1.70 41.50 12.05 4.50 4.79 0.26 23.61 1.88 3.63 0.94
GM 9.90 82.05 66.44 22.17 13.33 3.63 83.48 6.62 13.41 0.80
TCL 0.29 3.58 1.78 4.03 3.57 0.16 2.10 1.24 3.07 0.35
IML 15.78 8.47 110.44 40.32 27.60 57.51 8.45 194.53 59.72 40.14
CGM 9.09 236.34 54.47 19.76 11.30 5.04 311.11 10.26 5.62 1.25
Table 6: Proportions (%\%) of the selected number of components based on the six methods under p=2p=2 and n=500n=500. The true number of components (KK) is 33.
KK KK
2 3 4 5 6 2 3 4 5 6
(C=2,ω=0)(C=2,\ \omega=0) (C=3,ω=0)(C=3,\ \omega=0)
WCE1 0.0 92.3 7.7 0.0 0.0 0.0 97.0 3.0 0.0 0.0
WCE2 0.0 92.3 7.7 0.0 0.0 0.0 96.0 4.0 0.0 0.0
GM 0.0 100.0 0.0 0.0 0.0 0.0 97.7 2.3 0.0 0.0
TCL 0.0 100.0 0.0 0.0 0.0 0.0 100.0 0.0 0.0 0.0
IML 23.1 53.8 23.1 0.0 0.0 68.0 25.3 3.0 3.0 0.7
CGM 0.0 84.6 15.4 0.0 0.0 0.0 71.0 20.7 5.0 3.3
(C=2,ω=0.03)(C=2,\ \omega=0.03) (C=3,ω=0.03)(C=3,\ \omega=0.03)
WCE1 0.0 98.7 1.3 0.0 0.0 0.0 98.3 1.7 0.0 0.0
WCE2 0.0 98.7 1.3 0.0 0.0 0.0 94.9 5.1 0.0 0.0
GM 0.3 92.7 7.0 0.0 0.0 0.0 97.5 2.5 0.0 0.0
TCL 0.0 100.0 0.0 0.0 0.0 0.0 100.0 0.0 0.0 0.0
IML 14.3 63.3 15.7 4.0 2.7 61.9 23.7 7.6 5.1 1.7
CGM 8.3 78.0 7.3 4.3 2.0 0.0 68.6 19.5 8.5 3.4
(C=2,ω=0.06)(C=2,\ \omega=0.06) (C=3,ω=0.06)(C=3,\ \omega=0.06)
WCE1 2.0 93.3 4.3 0.3 0.0 0.0 97.0 3.0 0.0 0.0
WCE2 0.0 93.7 6.3 0.0 0.0 0.0 96.0 4.0 0.0 0.0
WCE1 (α=0.005\alpha=0.005) 3.0 92.7 3.7 0.3 0.3 0.0 99.7 0.3 0.0 0.0
WCE2 (α=0.005\alpha=0.005) 0.0 96.7 3.3 0.0 0.0 0.0 99.0 1.0 0.0 0.0
WCE1 (α=0.001\alpha=0.001) 4.7 86.3 7.3 1.3 0.3 0.0 98.0 2.0 0.0 0.0
WCE2 (α=0.001\alpha=0.001) 1.0 93.0 5.7 0.3 0.0 0.0 95.3 4.7 0.0 0.0
GM 14.0 66.3 18.7 1.0 0.0 0.0 81.7 17.3 1.0 0.0
TCL 0.0 96.0 4.0 0.0 0.0 0.0 96.0 4.0 0.0 0.0
IML 8.0 59.7 21.0 8.7 2.7 56.7 31.0 6.3 4.0 2.0
CGM 43.3 33.0 14.7 7.0 2.0 2.7 74.7 16.0 4.7 2.0

Skew normal mixture

The performance of the proposed methods for the skew normal mixture is examined using simulated data along with some existing methods. We consider a pp-dimensional mixture of skew-normal distributions with K=2K=2 components where the mixture components have the location parameter vectors 𝝁1=𝝁2=(ξ,ξ,0,,0){\text{\boldmath$\mu$}}_{1}=-{\text{\boldmath$\mu$}}_{2}=(\xi,\xi,0,\ldots,0), skewness parameter vectors 𝝍1=𝝍2=(2,2,0,,0){\text{\boldmath$\psi$}}_{1}=-{\text{\boldmath$\psi$}}_{2}=(2,2,0,\ldots,0) and scale matrices, 𝚺1{\text{\boldmath$\Sigma$}}_{1} and 𝚺2{\text{\boldmath$\Sigma$}}_{2} given by 𝚺1=𝚺2=blockdiag(𝚺,Ip2){\text{\boldmath$\Sigma$}}_{1}={\text{\boldmath$\Sigma$}}_{2}={\rm blockdiag}({\text{\boldmath$\Sigma$}}^{\ast},I_{p-2}) with (𝚺)11=(𝚺)22=1({\text{\boldmath$\Sigma$}}^{\ast})_{11}=({\text{\boldmath$\Sigma$}}^{\ast})_{22}=1 and (𝚺)12=(𝚺)21=0.3({\text{\boldmath$\Sigma$}}^{\ast})_{12}=({\text{\boldmath$\Sigma$}}^{\ast})_{21}=0.3, in the parametrization given in (8). The two cases for ξ\xi, ξ=1\xi=1, and ξ=2\xi=2 are considered, corresponding to the cases of the overlapping and well-separated clusters, respectively. The mixing proportions are set to π1=0.4\pi_{1}=0.4 and π2=0.6\pi_{2}=0.6. The sample size is set to n=500n=500, and the results for n=2000n=2000 are provided in the Supplementary Material. The proportion of outliers is set to ω{0.03,0.06,0.09}\omega\in\{0.03,0.06,0.09\}, and the outliers are generated in the same way as in Section 6.1.

First, we show some results based on a single simulated dataset with p=2p=2. The proposed weighted complete equation (WCE) method is applied to fit the skew normal mixture with γ=0.2\gamma=0.2. The standard skew normal mixture (SNM) is estimated using the maximum likelihood method with the EM algorithm of Lin et al. (2007). Figure 1 presents the three contours of the estimated and true densities. The figure shows that the proposed WCE method provides reasonable estimates of the true density by adequately suppressing the influence of outliers through density weights. On the other hand, SMN produces inaccurate estimates. In particular, it overestimates the variability of the true data-generating process by treating the outliers as non-outliers. It is also observed that the inaccuracy of the SNM becomes more profound as the proportion of outliers increases.

Next, the performance of the proposed method and the existing methods are quantitatively compared. The two settings for the proposed WCE method with γ=0.2\gamma=0.2 and γ=0.3\gamma=0.3 are considered, denoted as WCE1 and WCE2, respectively. As contenders, we consider SNM and the skew-tt mixture (Lee and McLachlan, 2016) of the R package ‘EMMIXcskew’ (Lee and McLachlan, 2017) denoted by STM. Note that STM is also robust against outliers because of the heavy tails of the skew-tt distribution, but the interpretations of the skewness parameters and scale matrices are different from those for WCE and SNM. Thus, the comparison with STM is made only with respect to the mean (location) parameters, mixing proportions, and classification accuracy. We considered two cases of pp, p=2p=2 and p=5p=5, but the computing time for STM under p=5p=5 is too long to include STM as a contender in our study, so the results of STM are reported only for p=2p=2. To evaluate this estimation performance, we employ the squared errors: k{1,2}𝝁^k𝝁k2\sum_{k\in\{1,2\}}\|\widehat{{\text{\boldmath$\mu$}}}_{k}-{\text{\boldmath$\mu$}}_{k}\|^{2}, k{1,2}𝝍^k𝝍k2\sum_{k\in\{1,2\}}\|\widehat{{\text{\boldmath$\psi$}}}_{k}-{\text{\boldmath$\psi$}}_{k}\|^{2}, k{1,2}vec(𝚺^k)vec(𝚺k)2\sum_{k\in\{1,2\}}\|{\rm vec}(\widehat{{\text{\boldmath$\Sigma$}}}_{k})-{\rm vec}({\text{\boldmath$\Sigma$}}_{k})\|^{2} and k{1,2}(π^kπk)2\sum_{k\in\{1,2\}}(\widehat{\pi}_{k}-\pi_{k})^{2}. Based on 1000 replications, the mean squared error (MSE) for the model parameters is obtained. We also computed the mean integrated squared error (MISE) of the density estimation (11) and the mean classification error in the same way as in Section 6.1. The table shows that the standard SNM model is severely affected by the outliers resulting in a large MSE, as in the case of the normal mixture. In the present setting, MCE for SNM in the case of overlapped clusters is particularly large. Comparing the proposed method and STM for p=2p=2, it can be seen that the proposed method resulted in a smaller MSE for 𝝁\mu and 𝝅\pi than STM, while MCE appears somewhat comparable. The table shows that the proposed method also seems to work reasonably well in the skew normal mixture case and suggests that our method is a useful tool applicable to a wide range of mixture models.

Refer to caption
Figure 1: The contour plots of the true density function (blue), and estimated density functions based on WCE with γ=0.2\gamma=0.2 (red) and the standard maximum likelihood method (green) applied to simulated datasets with the 6 combinations of ξ{1,2}\xi\in\{1,2\} and ω{0.03,0.06,0.09}\omega\in\{0.03,0.06,0.09\}.
Table 7: Performance measures for the two proposed methods (WCE1 and WCE2), standard skew normal (SNM) mixture and skew-tt mixture (STM) under p=2p=2 and n=500n=500. The MSE values for 𝝅\pi are multiplied by 100.
p=2p=2 MSE MISE MCE
𝝁\mu 𝚺\Sigma 𝝍\psi 𝝅\pi (×105\times 10^{5}) (%)
(ξ=1,ω=0.03\xi=1,\ \omega=0.03)
WCE1 1.22 8.59 2.15 0.48 0.66 2.94
WCE2 1.43 10.97 2.76 0.43 0.61 2.83
SNM 25.48 29.00 38.70 21.61 2.42 24.47
STM 5.62 - - 1.93 - 4.12
(ξ=1,ω=0.06\xi=1,\ \omega=0.06)
WCE1 1.29 10.64 2.33 0.38 0.72 3.66
WCE2 1.70 12.75 3.30 0.32 0.57 3.62
SNM 46.21 61.28 75.54 39.67 3.22 47.80
STM 7.87 - - 3.88 - 7.02
(ξ=1,ω=0.09\xi=1,\ \omega=0.09)
WCE1 2.11 16.40 4.54 1.60 1.06 6.36
WCE2 2.17 15.08 4.10 0.48 0.63 4.51
SNM 45.16 80.68 76.50 36.66 3.62 49.63
STM 11.20 - - 7.82 - 14.83
(ξ=2,ω=0.03\xi=2,\ \omega=0.03)
WCE1 0.69 2.14 1.21 0.10 0.31 0.66
WCE2 0.71 2.47 1.33 0.10 0.32 0.65
SNM 1.19 8.98 2.28 0.12 2.11 0.18
STM 5.95 - - 0.87 - 0.91
(ξ=2,ω=0.06\xi=2,\ \omega=0.06)
WCE1 0.69 2.48 1.19 0.09 0.40 1.40
WCE2 0.86 3.22 1.59 0.09 0.38 1.32
SNM 1.25 25.51 2.85 0.22 3.28 0.33
STM 7.85 - - 0.89 - 1.07
(ξ=2,ω=0.09\xi=2,\ \omega=0.09)
WCE1 0.86 3.24 1.47 0.11 0.58 2.28
WCE2 1.12 4.09 2.01 0.10 0.48 2.03
SNM 1.28 44.31 3.21 0.46 3.94 0.47
STM 8.94 - - 1.34 - 1.47
Table 8: Performance measures for the two proposed methods (WCE1 and WCE2), standard skew normal (SNM) mixture and skew-tt mixture (STM) under p=5p=5 and n=500n=500. The MSE values for 𝝅\pi are multiplied by 100.
p=5p=5 MSE MISE MCE
𝝁\mu 𝚺\Sigma 𝝍\psi 𝝅\pi (×1010\times 10^{10}) (%)
(ξ=1,ω=0.03\xi=1,\ \omega=0.03)
WCE1 1.74 0.39 4.16 0.31 0.29 2.55
WCE2 1.83 0.40 4.28 0.32 0.32 2.65
SNM 2.22 6.66 5.40 1.48 1.23 4.12
(ξ=1,ω=0.06\xi=1,\ \omega=0.06)
WCE1 0.51 0.37 1.08 0.16 0.26 2.11
WCE2 0.56 0.38 1.11 0.16 0.28 2.10
SNM 1.08 20.72 2.57 0.81 1.85 3.02
(ξ=1,ω=0.09\xi=1,\ \omega=0.09)
WCE1 0.28 0.35 0.46 0.13 0.29 2.24
WCE2 0.31 0.37 0.46 0.13 0.29 2.23
SNM 1.72 36.35 3.49 1.20 2.15 3.07
(ξ=2,ω=0.03\xi=2,\ \omega=0.03)
WCE1 0.30 0.38 0.39 0.09 0.27 0.17
WCE2 0.34 0.38 0.44 0.10 0.28 0.17
SNM 0.44 6.51 0.80 0.11 1.15 0.19
(ξ=2,ω=0.06\xi=2,\ \omega=0.06)
WCE1 0.26 0.36 0.36 0.11 0.26 0.36
WCE2 0.29 0.38 0.38 0.11 0.28 0.35
SNM 1.52 23.78 3.24 0.25 1.93 0.38
(ξ=2,ω=0.09\xi=2,\ \omega=0.09)
WCE1 0.23 0.35 0.34 0.10 0.28 0.58
WCE2 0.30 0.37 0.41 0.10 0.29 0.55
SNM 3.06 40.12 6.80 0.52 2.21 0.53

Real data example

Tone perception data

The proposed robust mixture of experts modeling is applied to a famous tone perception dataset (Cohen, 1984). In the tone perception experiment, a pure fundamental tone added with electronically generated overtones determined by a stretching ratio was played to a trained musician. The data consists of n=150n=150 pairs of “stretch ratio” variable (denoted by YY) and “tuned” variable (denoted by XX) and the conditional distribution of YY given X=xX=x is of interest. Following Chamroukhi (2016), we consider the two-component mixture of experts model given by

f(yi|xi)\displaystyle f(y_{i}|x_{i}) =g(𝒙i;𝜼)ϕ(yi;β01+β11xi,σ12)\displaystyle=g({\text{\boldmath$x$}}_{i};{\text{\boldmath$\eta$}})\phi(y_{i};\beta_{01}+\beta_{11}x_{i},\sigma_{1}^{2})
+{1g(xi;𝜼)}ϕ(yi;β02+β12xi,σ22),\displaystyle+\{1-g(x_{i};{\text{\boldmath$\eta$}})\}\phi(y_{i};\beta_{02}+\beta_{12}x_{i},\sigma_{2}^{2}),

where g(xi;𝜼)=logit1(η0+η1xi)g(x_{i};{\text{\boldmath$\eta$}})={\rm logit}^{-1}(\eta_{0}+\eta_{1}x_{i}). The unknown parameters are estimated based on two approaches: the standard maximum likelihood method via the EM algorithm denoted by MOE and the proposed WCE method with γ=0.3\gamma=0.3 denoted by RMOE. Based on the outlier detection rule in Section 2.3, we detected 13 outliers for α{0.001,0.005,0.01}\alpha\in\{0.001,0.005,0.01\}. The estimated regression lines, detected outliers, and classification results are shown in the upper panels of Figure 2. The figure shows that the estimated regression lines based on the two methods are slightly different, but both methods successfully capture the grouped structure of the dataset. It is also found that the estimate of η1\eta_{1} is almost equal to 0, thus the mixing probability is homogeneous.

Next, we investigated the sensitivity of the model against outliers by artificially adding the 10 identical pairs (0,4)(0,4) to the original dataset as outliers, as done in Chamroukhi (2016). The estimated regression lines obtained from the two methods are shown in the lower panels of Figure 2. The figure clearly shows that MOE is very sensitive to outliers. In contrast, the estimates of the proposed RMOE for the original and contaminated datasets are quite similar, indicating the desirable robustness properties of the proposed method. Different values of γ{0.3,0.4,0.5}\gamma\in\{0.3,0.4,0.5\} are also tested, but the results are not sensitive to γ\gamma. We also note that the result of the proposed method in Figure 2 is almost identical to that reported by Chamroukhi (2016), where the tt-distribution was used as a robust alternative. Finally, using the outlier detection rule, we identified 23 outliers, including 10 artificial outliers for any α{0.001,0.005,0.01}\alpha\in\{0.001,0.005,0.01\}.

Refer to caption
Figure 2: Estimated regression lines in the two clusters, detected outliers and classification results for the tone perception data based on the standard maximum likelihood (MOE) and robust (RMOE) method for the original tone data (upper) and contaminated data (lower).

Swiss banknote data

The proposed WCE method is applied to the famous Swiss banknote dataset (Flury and Riedwyl, 1988). The data consists of six measurements (p=6p=6) made on the 100 genuine and 100 counterfeit old Swiss 1000 franc bills. We apply the skew normal mixture model to the data using the proposed WCE and the standard EM algorithm.

We first select the number of clusters KK. To this end, we computed the trimmed BIC criterion (5) for K{1,2,3,4}K\in\{1,2,3,4\} and γ{0,0.05,,0.45,0.5}\gamma\in\{0,0.05,\ldots,0.45,0.5\} in which we set α=0.001\alpha=0.001 (threshold probability for outliers) and c=50c=50 (eigen-ratio constraint). Note that we also carried out the same procedure with α{0.005,0.01}\alpha\in\{0.005,0.01\} and c{20,30}c\in\{20,30\}, but the results were almost identical. The results shown in Figure 3 indicate that K=2K=2 is a reasonable choice for the number of clusters, which is consistent with the true number of labels. We also note that under the non-robust method (γ=0\gamma=0), the BIC values of K=2K=2 and K=3K=3 are almost identical.

Using γ=0.3\gamma=0.3 in the proposed WCE method, we fitted the skew normal mixture models with K=2K=2. The robust estimates and standards errors of the skewness parameter 𝝍k{\text{\boldmath$\psi$}}_{k} are listed in Table 9. The table shows significant skewness in the two measurements (bottom and top), so the skew-normal mixture models would be more desirable than Gaussian mixture models for this dataset. By carrying out outlier detection with α=0.001\alpha=0.001, we identified 33 outliers, including 14 genuine and 19 counterfeit bills. The outliers and resulting cluster assignments are shown in Figure 4. The figure shows that the skew normal mixture can flexibly capture the cluster-wise distributions of the six measurements while successfully suppressing the influence of the outliers.

Finally, we assessed the classification accuracy of the proposed WCE method and standard (non-robust) EM algorithm. Based on the fitted results, the cluster assignment for all observations including outliers is obtained as the maximizer of the posterior probability. The estimated assignments with the true labels (”genuine” or ”counterfeit”) are then compared. The proposed method was able to classify all bills correctly, that is, there was no misclassification, whereas the standard EM algorithm misclassified 18 bills. This result is consistent with that of the simulation study and thus strongly suggests using the robust mixture model to correctly classify the observations to the clusters.

Refer to caption
Figure 3: Monitoring the trimmed BIC criterion for K{1,2,3,4}K\in\{1,2,3,4\} and γ\gamma under the skew normal mixture models applied to Swiss banknote data.
Refer to caption
Figure 4: Cluster assignment and detected outliers by the proposed WCE method with the skew normal mixture models applied to the bank note data. The genuine bills are denoted by a green ×\times, counterfeit bills by a red \triangle, and outliers by a black filled circle.
Table 9: Robust estimates and standard errors of the skewness parameters in the skew normal mixture models applied to the Swiss banknote data.
Cluster 1 Cluster 2
Estimate SE Estimate SE
Length -0.02 0.17 -0.11 0.55
Left -0.09 0.05 0.09 0.26
Right -0.05 0.18 0.01 0.15
Bottom 0.82 0.08 1.33 0.24
Top -0.95 0.07 -1.00 0.23
Diagonal 0.14 0.11 -0.05 0.16

Concluding remarks

In this paper, we have introduced a new strategy for robust mixture modeling based on the weighted complete estimating equations. We then developed an EEE algorithm that iteratively solves the proposed estimating equations. The advantage of the proposed method is its computational feasibility because the proposed EEE algorithm admits relatively simple iterations. Applications of the proposed method to the three mixture models (multivariate Gaussian mixture, mixture of experts, and multivariate skew normal mixture) were considered, and feasible EEE algorithms to estimate these models were derived. In particular, for the skew normal mixture, we have slightly extended the proposed method and derived a novel EEE algorithm in which all parameters are updated in closed forms. The desirable performance of the proposed method is confirmed through a simulation study and real data examples.

In this work, we only considered the unstructured variance-covariance (scale) matrix of the Gaussian mixture and skew normal mixture in Sections 3 and 5. However, it might be useful to consider a structured scale matrix characterized by some parameters as done in R package ‘Mclust’ (Scrucca et al., 2016) or selecting a suitable structure of the scale matrix via an information criterion (e.g. Cerioli et al., 2018). Furthermore, while γ\gamma is selected by monitoring the trimmed BIC defined in (5) in this paper, it might be possible to employ a selection criterion for robust divergence, suggested by, for example, (Basak et al., 2021) and Sugasawa and Yonekura (2021). Incorporating such approaches into the proposed method will be a valuable future study.

As briefly mentioned in Section 1, the proposed weighted complete estimating equations are closely related to the density power divergence (Basu et al., 1998), since the derivative of the density power divergence of the component-wise distribution given the latent membership variables leads to the weighted complete estimating equations for the component-wise parameters. Hence, other types of divergence can also be adopted to devise alternative estimating equations. However, there is no guarantee that a feasible EEE algorithm, as developed in the present study, can be obtained for other divergences. An investigation of the usage of another divergence and its comparisons exceeds the scope of this study and is thus left for future research.

Acknowledgements

We would like to thank the three reviewers and the coordinating editor for their valuable comments and suggestions, which have significantly improved the paper. This work was supported by the Japan Society for the Promotion of Science (KAKENHI) Grant Numbers 18K12754, 18K12757, 20H00080, 21K01421, and 21H00699.

References

  • Azzalini and Valle (1996) Azzalini, A. and A. D. Valle (1996). The multivariate skew normal distribution. Biometrika 83, 715–725.
  • Bagnato et al. (2017) Bagnato, L., A. Punzo, and M. G. Zoia (2017). The multivariate leptokurtic-normal distribution and its application in model-based clustering. Canadian Journal of Statistics 45(1), 95–119.
  • Bai et al. (2012) Bai, X., W. Yao, and J. E. Boyer (2012). Robust fitting of mixture regression models. Computational Statistics & Data Analysis 56, 2347–2359.
  • Basak et al. (2021) Basak, S., A. Basu, and M. Jones (2021). On the ‘optimal’density power divergence tuning parameter. Journal of Applied Statistics 48(3), 536–556.
  • Basso et al. (2010) Basso, R. M., V. H. Lachos, C. R. B. Cabral, and P. Ghosh (2010). Robust mixture modeling based on scale mixtures of skew-normal distributions. Computational Statistics & Data Analysis 54(12), 2926–2941.
  • Basu et al. (1998) Basu, A., G. I. R, N. L. Hjort, and M. C. Jones (1998). Robust and efficient estimation by minimizing a density power divergence. Biometrika 85, 549–559.
  • Campbell (1984) Campbell, N. (1984). Mixture models and atypical values. Mathematical Geology 16, 465–467.
  • Cerioli and Farcomeni (2011) Cerioli, A. and A. Farcomeni (2011). Error rates for multivariate outlier detection. Computational Statistics & Data Analysis 55, 544–553.
  • Cerioli et al. (2018) Cerioli, A., L. A. Garcia-Escudero, A. Mayo-Iscar, and M. Riani (2018). Finding the number of normal groups in model-based clustering via constrained likelihoods. Journal of Computational and Graphical Statistics 27, 404–416.
  • Chamroukhi (2016) Chamroukhi, F. (2016). Robust mixture of experts modeling using the t distribution. Neural Networks 79, 20–36.
  • Cohen (1984) Cohen, E. A. (1984). Some effects of inharmonic partials on interval perception. Music Perception 1, 323–349.
  • Coretto and Hennig (2016) Coretto, P. and C. Hennig (2016). Robust improper maximum likelihood: Tuning, computation, and a comparison with other methods for robust gaussian clustering. Journal of the American Statistical Association 516, 1648–1659.
  • Coretto and Hennig (2019) Coretto, P. and C. Hennig (2019). otrimle: Robust model-based clustering R package version 1.3 (https://CRAN.R-project.org/package=otrimle).
  • Elashoff and Ryan (2004) Elashoff, M. and L. Ryan (2004). An em algorithm for estimating equations. Journal of Computational and Graphical Statistics 13, 48–65.
  • Farcomeni and Greco (2015) Farcomeni, A. and L. Greco (2015). S-estimation of hidden markov models. Computational Statics 30, 57–80.
  • Farcomeni and Punzo (2020) Farcomeni, A. and A. Punzo (2020). Robust model-based clustering with mild and gross outliers. Test 29(4), 989–1007.
  • Flury and Riedwyl (1988) Flury, B. and H. Riedwyl (1988). Multivariate Statistics, a Practical Approach. Cambridge University Press.
  • Forbes and Wraith (2014) Forbes, F. and D. Wraith (2014). A new family of multivariate heavy-tailed distributions with variable marginal amounts of tailweight: application to robust clustering. Statistics and computing 24(6), 971–984.
  • Fritz et al. (2012) Fritz, H., L. Garcia-Escudero, and A. Mayo-Iscar (2012). tclust : An r package for a trimming approach to cluster analysis. Journal of Statistical Software 47, 1–26.
  • Fritz et al. (2013) Fritz, H., L. Garcia-Escudero, and A. Mayo-Iscar (2013). A fast algorithm for robust constrained clustering. Computational Statistics and Data Analysis 61, 124–136.
  • Früwirth-Schnatter and Pyne (2010) Früwirth-Schnatter, S. and S. Pyne (2010). Bayesian inference for finite mixtures of univariate and multivariate skew-normal and skew-t distributions. Biostatistics 11, 317–3360.
  • Fujisawa and Eguchi (2006) Fujisawa, H. and S. Eguchi (2006). Robust estimation in the normal mixture model. Journal of Statistical Planning and Inference 136, 3989–4011.
  • Galimberti and Soffritti (2014) Galimberti, G. and G. Soffritti (2014). A multivariate linear regression analysis using finite mixtures of t distributions. Computational Statistics & Data Analysis 71, 138–150.
  • Garcia-Escudero et al. (2008) Garcia-Escudero, L., A. Gordaliza, C. Matran, and A. Mayo-Iscar (2008). A general trimming approach to robust cluster analysis. The Annals of Statistics 36, 1324–1345.
  • Greco and Agostinelli (2020) Greco, L. and C. Agostinelli (2020). Weighted likelihood mixture modeling and model-based clustering. Statistics and Computing 30, 255–277.
  • Ingrassia et al. (2014) Ingrassia, S., S. C. Minotti, and A. Punzo (2014). Model-based clustering via linear cluster-weighted models. Computational Statistics & Data Analysis 71, 159–182.
  • Jacobs et al. (1991) Jacobs, R., M. Jordan, S. Nowlan, and G. Hinton (1991). Adaptive mixtures of local experts. Neural Computation 3, 79–87.
  • Lee and McLachlan (2016) Lee, S. and G. McLachlan (2016). Finite mixtures of canonical fundamental skew t-distributions: the unification of the restricted and unrestricted skew t-mixture models. Statistics and Computing 26, 573–589.
  • Lee and McLachlan (2017) Lee, S. and G. McLachlan (2017). Emmixcskew: An R package for the fitting of a mixture of canonical fundamental skew t-distributions. Journal of Statistical Software 83, 1–32.
  • Lee and McLachlan (2014) Lee, S. and G. J. McLachlan (2014). Finite mixtures of multivariate skew t-distributions: some recent and new results. Statistics and Computing 24(2), 181–202.
  • Lin (2009) Lin, T. (2009). Maximum likelihood estimation for multivariate skew normal mixture models. Journal of Multivariate Analysis 100, 257–265.
  • Lin et al. (2007) Lin, T., J. Lee, and Y. Yen (2007). Finite mixture modelling using the skew normal distribution. Statistica Sinica 17, 909–927.
  • Lin (2010) Lin, T.-I. (2010). Robust mixture modeling using multivariate skew t distributions. Statistics and Computing 20(3), 343–356.
  • Mazza and Punzo (2020) Mazza, A. and A. Punzo (2020). Mixtures of multivariate contaminated normal regression models. Statistical Papers 61(2), 787–822.
  • McLachlan and Peel (2000) McLachlan, G. and D. Peel (2000). Finite mixture models. Wiley, New York.
  • Morris et al. (2019) Morris, K., A. Punzo, P. D. McNicholas, and R. P. Browne (2019). Asymmetric clusters and outliers: Mixtures of multivariate contaminated shifted asymmetric laplace distributions. Computational Statistics & Data Analysis 132, 145–166.
  • Nguyen and McLachlan (2016) Nguyen, H. D. and G. J. McLachlan (2016). Laplace mixture of linear experts. Computational Statistics & Data Analysis 93, 177–191.
  • Peel and McLachlan (2004) Peel, D. and G. J. McLachlan (2004). Robust mixture modelling using the tt distribution. Statistics and Computing 10, 339–348.
  • Punzo et al. (2018) Punzo, A., A. Mazza, and P. D. McNicholas (2018). ContaminatedMixt: An R package for fitting parsimonious mixtures of multivariate contaminated normal distributions. Journal of Statistical Software 85, 1–25.
  • Punzo et al. (2017) Punzo, A., P. McNicholas, et al. (2017). Robust clustering in regression analysis via the contaminated gaussian cluster-weighted model. Journal of Classification 34(2), 249–293.
  • Punzo and McNicholas (2016) Punzo, A. and P. D. McNicholas (2016). Parsimonious mixtures of multivariate contaminated normal distributions. Biometrical Journal 58, 1506–1537.
  • Punzo and Tortora (2021) Punzo, A. and C. Tortora (2021). Multiple scaled contaminated normal distribution and its application in clustering. Statistical Modelling 21(4), 332–358.
  • Scrucca et al. (2016) Scrucca, L., M. Fop, T. B. Murphy, and A. E. Raftery (2016). mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. The R Journal 8, 289–317.
  • Song et al. (2014) Song, W., W. Yao, and Y. Xing (2014). Robust mixture regression model fitting by laplace distribution. Computational Statistics & Data Analysis 71, 128–137.
  • Sugasawa and Yonekura (2021) Sugasawa, S. and S. Yonekura (2021). On selection criteria for the tuning parameter in robust divergence. Entropy 23(9), 1147.
  • Sun et al. (2010) Sun, J., A. Kabán, and J. M. Garibaldi (2010). Robust mixture clustering using pearson type vii distribution. Pattern Recognition Letters 31(16), 2447–2454.
  • Tang and Qu (2016) Tang, X. and A. Qu (2016). Mixture modeling for longitudinal data. Journal of Computational and Graphical Statistics 25, 1117–1137.
  • Venables and Ripley (2002) Venables, W. N. and B. D. Ripley (2002). Modern Applied Statistics with S (Fourth ed.). New York: Springer.
  • Wang et al. (2009) Wang, K., S.-K. Ng, and G. J. McLachlan (2009). Multivariate skew t mixture models: applications to fluorescence-activated cell sorting data. In 2009 Digital Image Computing: Techniques and Applications, pp.  526–531. IEEE.
  • Yao et al. (2014) Yao, W., Y. Wei, and C. Yu (2014). Robust mixture regression using the t-distribution. Computational Statistics & Data Analysis 71, 116–127.
  • Zarei et al. (2019) Zarei, S., A. Mohammadpour, S. Ingrassia, and A. Punzo (2019). On the use of the sub-gaussian α\alpha-stable distribution in the cluster-weighted model. Iranian Journal of Science and Technology, Transactions A: Science 43(3), 1059–1069.
  • Zhang and Liang (2010) Zhang, J. and F. Liang (2010). Robust clustering using exponential power mixtures. Biometrics 66(4), 1078–1086.