This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Tin Lok James Ng 22institutetext: School of Computer Science and Statistics
Trinity College Dublin, Ireland
22email: [email protected]

Penalized Maximum Likelihood Estimator for Mixture of von Mises-Fisher Distributions

Tin Lok James Ng
(Received: date / Revised: date)
Abstract

The von Mises-Fisher distribution is one of the most widely used probability distributions to describe directional data. Finite mixtures of von Mises-Fisher distributions have found numerous applications. However, the likelihood function for the finite mixture of von Mises-Fisher distributions is unbounded and consequently the maximum likelihood estimation is not well defined. To address the problem of likelihood degeneracy, we consider a penalized maximum likelihood approach whereby a penalty function is incorporated. We prove strong consistency of the resulting estimator. An Expectation-Maximization algorithm for the penalized likelihood function is developed and experiments are performed to examine its performance.

Keywords:
mixture of von Mises-Fisher distributions penalized maximum likelihood estimation strong consistency

1 Introduction

In many areas of statistical modelling, data are represented as directions or unit length vectors (Mardia, 1972; Jupp, 1995; Mardia and Jupp, 2000). The analysis of directional data has attracted much research interests in various disciplines, from hydrology (Chen et al., 2013) to biology (Boomsma et al., 2006), and from image analysis (Zhe et al., 2019) to text mining (Banerjee et al., 2005). The von Mises-Fisher (vMF) distribution is one of the most commonly used distributions to model data distributed on the surface of the unit hypersphere (Fisher, 1953; Mardia and Jupp, 2000). The vMF distribution has been applied successfully in many domains (e.g., (Sinkkonen and Kaski, 2002; Mcgraw et al., 2006; Bangert et al., 2010)).

A mixture of vMF distributions (Banerjee et al., 2005) assumes that each observation is drawn from one of the pp vMF distributions. Applications of the vMF mixture model are diverse, including image analysis (Mcgraw et al., 2006) and text mining (Banerjee et al., 2005). More recently, it has been shown that the vMF mixture model can approximate any continuous density function on the unit hypersphere to arbitrary degrees of accuracy given sufficient numbers of mixture component (Ng and Kwong, 2021).

Various estimation strategies have been developed to perform model estimation, including the maximum likelihood approach (Banerjee et al., 2005) and Bayesian methods (Bagchi and Kadane, 1991; Taghia et al., 2014). The maximum likelihood approach, which is typically performed using the Expectation-Maximization (EM) algorithm (Dempster et al., 1977; Banerjee et al., 2005), is among the most popular approach to parameter estimation. However, as we show in Section 3 the likelihood function is unbounded from above, and consequently a global maximum likelihood estimate (MLE) fails to exist.

The unboundedness of likelihood function occurs in various mixture modelling context, particularly for mixture models with location-scale family distributions including the mixture of normal distributions (Ciuperca et al., 2003; Chen et al., 2008) and the mixture of Gamma distributions (Chen et al., 2016). Various approaches have been developed in order to tackle the likelihood degeneracy problem, including resorting to local estimates (Peters et al., 1978), compactification of the parameter space (Redner, 1981), and constrained maximization of the likelihood function (Hathaway, 1985).

An alternative solution to the likelihood degeneracy problem is to penalized the likelihood function such that the resulting penalized likelihood function is bounded and the existence of the penalized MLE is guaranteed. The approach of penalized maximum likelihood was applied to normal mixture models (Ciuperca et al., 2003; Chen et al., 2008), two-parameter Gamma mixture models (Chen et al., 2016). A penalized likelihood approach has also a Bayesian interpretation (Goodd and Gaskins, 1971; Ciuperca et al., 2003), whereby the penalized likelihood function corresponds to a posterior density and the penalized maximum likelihood solution to the maximum a posterior estimate.

Previously, the penalized maximum likelihood approach is applied to study the mixture of von-Mises distributions (Chen et al., 2007) where consistency results were obtained. The von-Mises distribution is a special case of the von Mises-Fisher distribution defined on the circle. We generalize the results in Chen et al. (2007) to the arbitrary dimensional sphere. The consistency proof in Chen et al. (2007) relies heavily on the univariate properties of the von Mises distribution and generalization of the result to higher dimensions is not straightforward. In this paper we prove a few useful technical lemmas before proving the main results. To handle the non-identifiability of the mixture models, we use the framework of Redner (1981) to obtain consistency in the quotient space.

In this paper, we also consider the penalized likelihood approach to tackle the problem of likelihood unboundedness for the mixture of vMF distributions. We incorporate a penalty term into the likelihood function and maximize the resulting penalized likelihood function. We study conditions on the penalty function to ensure consistency of the penalized maximum likelihood estimator (PMLE). We develop an Expectation-Maximization algorithm to perform model estimation based on the penalized likelihood function. The rest of the paper is structured as follows. Section 2 introduces the background on vMF mixtures and key notations used in the subsequent sections. The problem of likelihood degeneracy is formally presented in Section 3. Section 4 develops the penalized maximum likelihood approach and discusses conditions on the penalty function to ensure strong consistency of the resulting estimator. An Expectation-Maximization algorithm is also developed in Section 4, and its performance is examined in Section 5. Section 6 illustrate the proposed EM algorithm using a data application. We conclude the paper with a discussion section.

2 Background

The probability density function of a dd-dimensional vMF distribution is given by:

f(𝐱;𝝁,κ)=cd(κ)eκ𝝁T𝐱,\displaystyle f(\mathbf{x};\boldsymbol{\mu},\kappa)=c_{d}(\kappa)e^{\kappa\boldsymbol{\mu}^{T}\mathbf{x}}, (1)

where x𝕊d1x\in\mathbb{S}^{d-1} is a dd-dimensional unit vector (i.e. 𝕊d1:={𝐱d:𝐱=1}\mathbb{S}^{d-1}:=\{\mathbf{x}\in\mathbb{R}^{d}:||\mathbf{x}||=1\} where ||||||\cdot|| is the L2L_{2} norm), 𝝁𝕊d1\boldsymbol{\mu}\in\mathbb{S}^{d-1} is the mean direction, and κ0\kappa\geq 0 is the concentration parameter. The normalizing constant cd(κ)c_{d}(\kappa) has the form

cd(κ)=κd/21(2π)d/2Id/21(κ),c_{d}(\kappa)=\frac{\kappa^{d/2-1}}{(2\pi)^{d/2}I_{d/2-1}(\kappa)},

where Ir()I_{r}(\cdot) is the modified Bessel function of the first kind of order rr. The vMF distribution becomes increasingly concentrated at the mean direction μ\mu as the concentration parameter κ\kappa increases. The case κ=0\kappa=0 corresponds to the uniform distribution on 𝕊d1\mathbb{S}^{d-1}.

The probability density function of a mixture of vMF distributions with pp mixture components can be expressed as

g(𝐱;{πk,𝝁k,κk}k=1p)=k=1pπkf(𝐱;𝝁k,κk),\displaystyle g(\mathbf{x};\{\pi_{k},\boldsymbol{\mu}_{k},\kappa_{k}\}_{k=1}^{p})=\sum_{k=1}^{p}\pi_{k}f(\mathbf{x};\boldsymbol{\mu}_{k},\kappa_{k}), (2)

where (π1,,πp)(\pi_{1},\ldots,\pi_{p}) is the mixing proportions, (𝝁k,κk)(\boldsymbol{\mu}_{k},\kappa_{k}) are the parameters for the kkth component of the mixture, and f(;𝝁k,κk)f(\cdot;\boldsymbol{\mu}_{k},\kappa_{k}) is the vMF density function defined in (1).

We let Θ:={θ(𝝁,κ):𝝁𝕊d1,κ0}\Theta:=\{\theta\equiv(\boldsymbol{\mu},\kappa):\boldsymbol{\mu}\in\mathbb{S}^{d-1},\kappa\geq 0\} be the parameter space of the vMF distribution, with the metric ρ(,)\rho(\cdot,\cdot) defined as

ρ(θ1,θ2)=arccos(𝝁1T𝝁2)+|κ1κ2|,\displaystyle\rho(\theta_{1},\theta_{2})=\mbox{arccos}(\boldsymbol{\mu}_{1}^{T}\boldsymbol{\mu}_{2})+|\kappa_{1}-\kappa_{2}|, (3)

for θ1=(𝝁1,κ1),θ2=(𝝁2,κ2)\theta_{1}=(\boldsymbol{\mu}_{1},\kappa_{1}),\theta_{2}=(\boldsymbol{\mu}_{2},\kappa_{2}). For any θ=(𝝁,κ)Θ\theta=(\boldsymbol{\mu},\kappa)\in\Theta, we write fθ():=f(;𝝁,κ)f_{\theta}(\cdot):=f(\cdot;\boldsymbol{\mu},\kappa) for the density function and γθ\gamma_{\theta} for the corresponding measure. The space of mixing probabilities is denoted by Π:={(π1,,πp):i=1pπp=1,πk0,k=1,,p}\Pi:=\{(\pi_{1},\ldots,\pi_{p}):\sum_{i=1}^{p}\pi_{p}=1,\pi_{k}\geq 0,k=1,\ldots,p\}. A pp-component mixture of vMF distributions can be expressed as γ=k=1pπkγθk\gamma=\sum_{k=1}^{p}\pi_{k}\gamma_{\theta_{k}} where (π1,,πp)Π(\pi_{1},\ldots,\pi_{p})\in\Pi and (θ1,,θp)Θp(\theta_{1},\ldots,\theta_{p})\in\Theta^{p}, and where Θp=Θ××Θ\Theta^{p}=\Theta\times\cdots\times\Theta is the product of the parameter spaces. We define the product space Γ:=Π×Θp\Gamma:=\Pi\times\Theta^{p}, and we slightly abuse notations to let γ\gamma denote both the mixing measure and the parameters in Γ\Gamma. While Γ\Gamma is a natural parameterization of the family of mixture of vMF distributions, elements of Γ\Gamma are not identifiable. Thus, we let Γ~\tilde{\Gamma} be the quotient topological space obtained from Γ\Gamma by identifying all parameters (π1,,πp,θ1,,θp)(\pi_{1},\ldots,\pi_{p},\theta_{1},\ldots,\theta_{p}) such that their corresponding densities are equal (almost) everywhere. For the rest of the paper, we assume that the number of mixture components pp is known.

3 Likelihood Degeneracy

We investigate the likelihood degeneracy problem of the vMF mixture model in this section. For any observations generated from a vMF mixture model with two or more mixture components, we show that the resulting likelihood function on the parameter space Γ\Gamma is unbounded above. As discussed in Section 1, likelihood degeneracy is a common problem for mixture models with location-scale distributions, including the normal mixtures. In the case of normal mixture distributions, one can show that by letting the mean parameter of a mixture component equal to one of the observations and letting the variance of the same mixture component converge to zero while holding other parameters fixed, the likelihood function diverges to positive infinity (Chen et al., 2008).

For the vMF mixture distributions, the likelihood unboundedness can be best understood in the special case of x𝕊1x\in\mathbb{S}^{1}, or the mixture of von Mises distributions. The von Mises distribution, also known as the circular normal distribution, approaches a normal distribution with large concentration parameter κ\kappa:

f(x|μ,κ)1σ2πexp[(xμ)22σ2],f(x|\mu,\kappa)\approx\frac{1}{\sigma\sqrt{2\pi}}\exp\bigg{[}\frac{-(x-\mu)^{2}}{2\sigma^{2}}\bigg{]},

with σ2=1/κ\sigma^{2}=1/\kappa, and the approximation converges uniformly as κ\kappa goes to infinity. Therefore, the likelihood function of a mixture of von Mises distributions diverges to infinity by letting the mean parameter of a mixture component equal to one of the observations and letting the concentration parameter diverges to infinity.

We now consider the general case of the vMF mixture models. Let 𝒳={𝐱1,,𝐱n}{\cal X}=\{\mathbf{x}_{1},\ldots,\mathbf{x}_{n}\} be the observations generated from a mixture of vMF distributions with density function k=1pπkfθk()\sum_{k=1}^{p}\pi_{k}f_{\theta_{k}}(\cdot) where θk=(𝝁k,κk)\theta_{k}=(\boldsymbol{\mu}_{k},\kappa_{k}). The likelihood function can be expressed as:

L(𝒳;𝜽,𝝅)=i=1nk=1pπkf(𝐱i;𝝁k,κk),L({\cal X};\boldsymbol{\theta},\boldsymbol{\pi})=\prod_{i=1}^{n}\sum_{k=1}^{p}\pi_{k}f(\mathbf{x}_{i};\boldsymbol{\mu}_{k},\kappa_{k}),

where 𝜽=(θ1,,θp)=((𝝁1,κ1),,(𝝁p,κp))\boldsymbol{\theta}=(\theta_{1},\ldots,\theta_{p})=((\boldsymbol{\mu}_{1},\kappa_{1}),\ldots,(\boldsymbol{\mu}_{p},\kappa_{p})) and 𝝅=(π1,,πp)\boldsymbol{\pi}=(\pi_{1},\ldots,\pi_{p}). We can show that by letting the mean direction 𝝁k\boldsymbol{\mu}_{k} of one of the mixture components equals to an arbitrary observation and letting the corresponding concentration parameter κk\kappa_{k} goes to infinity, the resulting likelihood function diverges.

Theorem 3.1.

For any observations 𝒳=(𝐱1,,𝐱n){\cal X}=(\mathbf{x}_{1},\ldots,\mathbf{x}_{n}), there exists a sequence (𝛉(q),𝛑(q))(\boldsymbol{\theta}^{(q)},\boldsymbol{\pi}^{(q)}), q=1,2,q=1,2,\ldots such that L(𝒳;𝛉(q),𝛑(q))L({\cal X};\boldsymbol{\theta}^{(q)},\boldsymbol{\pi}^{(q)})\uparrow\infty as qq\rightarrow\infty.

The proof of Theorem 3.1 is provided in the Appendix. The unboundedness of the likelihood function on the parameter space implies that the maximum likelihood estimator is not well defined.

4 Penalized Maximum Likelihood Estimation

4.1 Preliminary

Let γ0Γ\gamma_{0}\in\Gamma be the true mixing measure for the mixture of vMF distributions with corresponding density function f0f_{0} on 𝕊d1\mathbb{S}^{d-1}. We let MM be the maximum of the true density f0f_{0}:

M:=max𝐱𝕊d1f0(𝐱),\displaystyle M:=\max_{\mathbf{x}\in\mathbb{S}^{d-1}}f_{0}(\mathbf{x}), (4)

and define the metric d(𝐱,𝐲)=arccos(𝐱T𝐲)d(\mathbf{x},\mathbf{y})=\arccos(\mathbf{x}^{T}\mathbf{y}) on 𝕊d1\mathbb{S}^{d-1} as the angle between two unit vectors 𝐱,𝐲𝕊d1\mathbf{x},\mathbf{y}\in\mathbb{S}^{d-1}. For any fixed 𝐱𝕊d1\mathbf{x}\in\mathbb{S}^{d-1} and positive number ϵ\epsilon, the ϵ\epsilon-ball in 𝕊d1\mathbb{S}^{d-1} centered at 𝐱\mathbf{x} is defined as Bϵ(𝐱)={𝐲𝕊d1:d(𝐱,𝐲)<ϵ}B_{\epsilon}(\mathbf{x})=\{\mathbf{y}\in\mathbb{S}^{d-1}:d(\mathbf{x},\mathbf{y})<\epsilon\}. For any measurable set B𝕊d1B\subset\mathbb{S}^{d-1}, the spherical measure of BB is given by ω(B):=B𝑑ω\omega(B):=\int_{B}d\omega, where dωd\omega is the standard surface measure on 𝕊d1\mathbb{S}^{d-1}.

For any 𝐱𝕊d1\mathbf{x}\in\mathbb{S}^{d-1} and small positive number ϵ\epsilon, the measure of the ball B2ϵ(𝐱)B_{2\epsilon}(\mathbf{x}) in 𝕊d1\mathbb{S}^{d-1} is given by (Li, 2011)

ω(B2ϵ(𝐱))\displaystyle\omega(B_{2\epsilon}(\mathbf{x})) =\displaystyle= 2π(d1)/2Γ(d12)02ϵsind2(θ)𝑑θ\displaystyle\frac{2\pi^{(d-1)/2}}{\Gamma(\frac{d-1}{2})}\int_{0}^{2\epsilon}\sin^{d-2}(\theta)d\theta (5)
\displaystyle\leq 2d12π(d1)/2Γ(d12)ϵd1\displaystyle 2^{d-1}\frac{2\pi^{(d-1)/2}}{\Gamma(\frac{d-1}{2})}\epsilon^{d-1}
=\displaystyle= A2ϵd1,\displaystyle A_{2}\epsilon^{d-1}, (6)

where

A2=2d12π(d1)/2Γ(d12).\displaystyle A_{2}=2^{d-1}\frac{2\pi^{(d-1)/2}}{\Gamma(\frac{d-1}{2})}. (7)

We define the function δ()\delta(\cdot) by

δ(ϵ):=MA2ϵd1,\displaystyle\delta(\epsilon):=MA_{2}\epsilon^{d-1}, (8)

where the constants MM and A2A_{2} are defined in Equation (4) and (7), respectively. The function δ()\delta(\cdot) plays a crucial role in Lemmas 1 and 2. Lemmas 1 and 2 are analogous to Lemmas 1 and 2 in Chen et al. (2008). They provide (almost sure) upper bounds on the number of observations in a small ϵ\epsilon-ball in 𝕊d1\mathbb{S}^{d-1}. The upper bound in Lemma 1 is for each fixed ϵ\epsilon in an interval whereas the upper bound in Lemma 2 holds uniformly for all ϵ\epsilon in the same interval. The proof of Lemma 1 is given in the Appendix. The proof of Lemma 2 is similar to the proof of Lemma 2 in Chen et al. (2008) and is omitted. Lemmas 1 and 2 are crucial to ensure consistency of the penalized maximum likelihood estimator.

We note that Lemmas 1 and 2 may be generalized by relaxing the assumption that the true density is a mixture of vMF densities. This is possible because the vMF assumption does not play a crucial role. Such a generalization has been obtained for normal mixtures (Chen, 2017, Lemma 3.2). However, this is not required for the proof of our main result.

Lemma 1.

For any sufficiently small positive number ξ0\xi_{0}, as nn\rightarrow\infty, and for each fixed ϵ\epsilon such that

lognMnA2ϵd1<ξ0,\frac{\log n}{MnA_{2}}\leq\epsilon^{d-1}<\xi_{0},

the following inequalities hold except for a zero probability event:

sup𝝁Sd1{1ni=1nI(XiBϵ(𝝁))}2δ(ϵ).\displaystyle\sup_{\boldsymbol{\mu}\in S^{d-1}}\bigg{\{}\frac{1}{n}\sum_{i=1}^{n}I\big{(}X_{i}\in B_{\epsilon}(\boldsymbol{\mu})\big{)}\bigg{\}}\leq 2\delta(\epsilon). (9)

Uniformly for all ϵ\epsilon such that

0<ϵd1<lognMnA2,0<\epsilon^{d-1}<\frac{\log n}{MnA_{2}},

the following inequalities hold except for a zero probability event:

sup𝝁Sd1{1ni=1nI(XiBϵ(𝝁))}2(logn)2n.\displaystyle\sup_{\boldsymbol{\mu}\in S^{d-1}}\bigg{\{}\frac{1}{n}\sum_{i=1}^{n}I\big{(}X_{i}\in B_{\epsilon}(\boldsymbol{\mu})\big{)}\bigg{\}}\leq 2\frac{(\log n)^{2}}{n}. (10)
Lemma 2.

For any sufficiently small positive number ξ0\xi_{0}, as nn\rightarrow\infty, uniformly for all ϵ\epsilon such that

lognMnA2ϵd1<ξ0,\frac{\log n}{MnA_{2}}\leq\epsilon^{d-1}<\xi_{0},

the following inequality holds except for a zero probability event:

sup𝝁Sd1{1ni=1nI(XiBϵ(𝝁))}4δ(ϵ).\displaystyle\sup_{\boldsymbol{\mu}\in S^{d-1}}\bigg{\{}\frac{1}{n}\sum_{i=1}^{n}I\big{(}X_{i}\in B_{\epsilon}(\boldsymbol{\mu})\big{)}\bigg{\}}\leq 4\delta(\epsilon). (11)

Uniformly for all ϵ\epsilon such that

0<ϵd1<lognMnA2,0<\epsilon^{d-1}<\frac{\log n}{MnA_{2}},

the following inequalities hold except for a zero probability event:

sup𝝁Sd1{1ni=1nI(XiBϵ(𝝁))}2(logn)2n.\displaystyle\sup_{\boldsymbol{\mu}\in S^{d-1}}\bigg{\{}\frac{1}{n}\sum_{i=1}^{n}I\big{(}X_{i}\in B_{\epsilon}(\boldsymbol{\mu})\big{)}\bigg{\}}\leq 2\frac{(\log n)^{2}}{n}. (12)

4.2 Penalized Maximum Likelihood Estimator

For any mixing measure of a pp-component mixture γ=l=1pπlγθl\gamma=\sum_{l=1}^{p}\pi_{l}\gamma_{\theta_{l}} in Γ\Gamma, and nn i.i.d. observations 𝒳{\cal X}, the penalized log-likelihood function is defined as

pln(γ)=ln(γ)+pn(𝜿)\displaystyle pl_{n}(\gamma)=l_{n}(\gamma)+p_{n}(\boldsymbol{\kappa}) (13)

where ln(γ)l_{n}(\gamma) is the log-likelihood function:

ln(γ)=i=1nlog{k=1pπkf(𝐱i;𝝁k,κk)},l_{n}(\gamma)=\sum_{i=1}^{n}\log\bigg{\{}\sum_{k=1}^{p}\pi_{k}f(\mathbf{x}_{i};\boldsymbol{\mu}_{k},\kappa_{k})\bigg{\}},

and pn()p_{n}(\cdot) is a penalty function that depends on 𝜿=(κ1,,κp)\boldsymbol{\kappa}=(\kappa_{1},\ldots,\kappa_{p}). Note that we slightly abuse notations and let pn()p_{n}(\cdot) denotes the penalty function and pp denotes the number of mixture components. We impose the following conditions on the penalty function pn()p_{n}(\cdot).

  1. C1

    pn(𝜿)=l=1pp~n(κl)p_{n}(\boldsymbol{\kappa})=\sum_{l=1}^{p}\tilde{p}_{n}(\kappa_{l}),

  2. C2

    For l=1,,pl=1,\ldots,p, supκl>0max{0,p~n(κl)}=o(n)\sup_{\kappa_{l}>0}\max\{0,\tilde{p}_{n}(\kappa_{l})\}=o(n) and p~n(κl)=o(n)\tilde{p}_{n}(\kappa_{l})=o(n) for each fixed κl0\kappa_{l}\geq 0,

  3. C3

    For l=1,,pl=1,\ldots,p, and for

    0<1log(κl)2d2lognMnA2,0<\frac{1}{\log(\kappa_{l})^{2d-2}}\leq\frac{\log n}{MnA_{2}},

    p~n(κl)3(logn)2logκl\tilde{p}_{n}(\kappa_{l})\leq-3(\log n)^{2}\log\kappa_{l} for large enough nn.

Conditions C1 - C3 on the penalty function are analogous to the three conditions proposed in (Chen et al., 2008). Condition C1 assumes that the penalty function is of additive form. Condition C2 ensures that the penalty is not overly strong while condition C3 allows the penalty to be severe when the concentration parameter is very large. Recall the true mixing measure γ0Γ\gamma_{0}\in\Gamma, and let γ^\hat{\gamma} denote the maximizer of the penalized log-likelihood function defined in Equation (13). We have the following main result of this paper demonstrating that the maximizer of the penalized log-likelihood function is strongly consistency.

Theorem 4.1.

Let γ^n\hat{\gamma}_{n} be the maximizer of the penalized log-likelihood pln(γ)pl_{n}(\gamma), then γ^nγ0\hat{\gamma}_{n}\rightarrow\gamma_{0} almost surely in the quotient topological space Γ~\tilde{\Gamma}.

4.3 EM Algorithm

We develop an Expectation-Maximization algorithm to maximize the penalized log-likelihood function defined in Equation (13). By condition C1, the penalty function is assumed to have the form pn(𝜿)=l=1pp~n(κl)p_{n}(\boldsymbol{\kappa})=\sum_{l=1}^{p}\tilde{p}_{n}(\kappa_{l}). We consider p~n(κl)\tilde{p}_{n}(\kappa_{l}) to have the form p~n(κl)=ψnκl\tilde{p}_{n}(\kappa_{l})=-\psi_{n}\kappa_{l} for all ll where the constant ψnn1\psi_{n}\propto n^{-1} that depends on the sample size nn. In particular, we may set ψn=ζ/n\psi_{n}=\zeta/n for some constant ζ>0\zeta>0 or ψn=Sx/n\psi_{n}=S_{x}/n where SxS_{x} is the sample circular variance.

The resulting penalty function clearly satisfies condition C2. We note that condition C3 is also satisfied since for

0<1log(κl)2d2lognMnA2,0<\frac{1}{\log(\kappa_{l})^{2d-2}}\leq\frac{\log n}{MnA_{2}},

we have

κlexp((n/logn)1/(2d2)).\kappa_{l}\approx\exp\big{(}(n/\log n)^{1/(2d-2)}\big{)}.

The EM algorithm developed in (Banerjee et al., 2005) can be easily modified to incorporate an additional penalty function. The E-Step of the penalized EM involves computing the conditional probabilities:

p(Zi=h|𝐱i,𝜽)=πhf(𝐱i;θh)l=1pπlf(𝐱i;θl),h=1,,p,\displaystyle p(Z_{i}=h|\mathbf{x}_{i},\boldsymbol{\theta})=\frac{\pi_{h}f(\mathbf{x}_{i};\theta_{h})}{\sum_{l=1}^{p}\pi_{l}f(\mathbf{x}_{i};\theta_{l})},\quad h=1,\ldots,p, (14)

where ZiZ_{i} is the latent variable denoting the cluster membership of the iith observation. For the M-step, using the method of Lagrange multipliers, we optimize the full conditional penalized log-likelihood function below

l=1p[i=1n(log(πl)+log(cd(κl)))p(Zi=l|𝐱i,𝜽)+i=1nκlμlTxip(Zi=l|𝐱i,𝜽)ψnκl+λl(1𝝁lT𝝁l)]\sum_{l=1}^{p}\bigg{[}\sum_{i=1}^{n}(\log(\pi_{l})+\log(c_{d}(\kappa_{l})))p(Z_{i}=l|\mathbf{x}_{i},\boldsymbol{\theta})+\sum_{i=1}^{n}\kappa_{l}\mu_{l}^{T}x_{i}p(Z_{i}=l|\mathbf{x}_{i},\boldsymbol{\theta})-\psi_{n}\kappa_{l}+\lambda_{l}(1-\boldsymbol{\mu}_{l}^{T}\boldsymbol{\mu}_{l})\bigg{]}

with respect to 𝝁h,κh,πh\boldsymbol{\mu}_{h},\kappa_{h},\pi_{h} for h=1,,ph=1,\ldots,p, which gives:

π^h\displaystyle\hat{\pi}_{h} =\displaystyle= 1ni=1Np(Zi=h|𝐱i,𝜽)\displaystyle\frac{1}{n}\sum_{i=1}^{N}p(Z_{i}=h|\mathbf{x}_{i},\boldsymbol{\theta}) (15)
𝝁^h\displaystyle\hat{\boldsymbol{\mu}}_{h} =\displaystyle= rhrh\displaystyle\frac{r_{h}}{||r_{h}||} (16)
Id/2(κ^h)Id/21(κ^h)\displaystyle\frac{I_{d/2}(\hat{\kappa}_{h})}{I_{d/2-1}(\hat{\kappa}_{h})} =\displaystyle= ψn+rhi=1Np(Zi=h|𝐱i,𝜽)\displaystyle\frac{-\psi_{n}+||r_{h}||}{\sum_{i=1}^{N}p(Z_{i}=h|\mathbf{x}_{i},\boldsymbol{\theta})} (17)

where rh=i=1n𝐱ip(Zi=h|𝐱i,𝜽)r_{h}=\sum_{i=1}^{n}\mathbf{x}_{i}p(Z_{i}=h|\mathbf{x}_{i},\boldsymbol{\theta}). We note that the assumption on ψn\psi_{n} implies that ψn+rh0-\psi_{n}+||r_{h}||\geq 0 almost surely as nn\rightarrow\infty. However, for a finite sample size, there is a non-zero possibility that ψn+rh<0-\psi_{n}+||r_{h}||<0, and the updating equation for κh\kappa_{h} is not well defined since the left hand side of Equation (17) is non-negative. However, the left hand side of Equation (17) is a strictly monotonically increasing function from [0,)[0,\infty) to [0,1)[0,1) (Schou, 1978; Hornik and Grün, 2014), and in particular κ^h=0\hat{\kappa}_{h}=0 whenever

Id/2(κ^h)Id/21(κ^h)=0.\frac{I_{d/2}(\hat{\kappa}_{h})}{I_{d/2-1}(\hat{\kappa}_{h})}=0.

Thus, we can simply set κh=0\kappa_{h}=0 whenever ψn+rh<0-\psi_{n}+||r_{h}||<0. To solve Equation (17) for κ^h\hat{\kappa}_{h}, various approximations have been proposed (Banerjee et al., 2005; Tanabe et al., 2007; Song et al., 2012). Section 2.2 of Hornik and Grün (2014) contains a detailed review of available approximations. We consider the approximation used in Banerjee et al. (2005):

κ^hρh(dρh2)1ρh2,\hat{\kappa}_{h}\approx\frac{\rho_{h}(d-\rho_{h}^{2})}{1-\rho_{h}^{2}},

with

ρh=ψn+rhi=1Np(Zi=h|𝐱i,𝜽).\rho_{h}=\frac{-\psi_{n}+||r_{h}||}{\sum_{i=1}^{N}p(Z_{i}=h|\mathbf{x}_{i},\boldsymbol{\theta})}.

We initialize the EM algorithm by randomly assigning the observations into mixture components, and the algorithm is terminated if the change in the penalized log-likelihood falls below a small threshold which is set at 10510^{-5} in the experiements.

5 Simulation Studies

We perform simulation studies to investigate the performance of the proposed EM algorithm for maximizing the penalized likelihood function. We generate data from the mixture of vMF distributions with two and three mixture components and with dimensions d=2,3,4d=2,3,4. For each model, data are generated with increasing samples sizes to assess the convergence of the estimated parameters toward the true parameters. The concentration parameters 𝜿\boldsymbol{\kappa} and the mixing proportions 𝝅\boldsymbol{\pi} are pre-specficied whereas the mean directions 𝝁\boldsymbol{\mu} are drawn from the uniform distribution on the surface of the unit hypersphere.

For the two mixture components model, we specify the mixing proportions as 𝝅=(0.5,0.5)\boldsymbol{\pi}=(0.5,0.5) and the concentration parameters 𝜿=(10,1)\boldsymbol{\kappa}=(10,1). For the model with three mixture components, we set 𝝅=(0.4,0.3,0.3)\boldsymbol{\pi}=(0.4,0.3,0.3) and 𝜿=(10,5,1)\boldsymbol{\kappa}=(10,5,1). For illustrative purpose, we consider the penalty function p~n(κl)=(1/n)κl\tilde{p}_{n}(\kappa_{l})=-(1/n)\kappa_{l}. For each combination of dimension dd and sample size nn, we simulate 500 random samples from the model and the EM algorithm developed in Section 4.3 is used to obtain the parameter estimates. We measure the distance between the estimated parameters and the true parameters for each random sample. For the mean direction parameters 𝝁\boldsymbol{\mu}, the distance is measured using the metric d(𝐱,𝐲)=arccos(𝐱T𝐲)d(\mathbf{x},\mathbf{y})=\mbox{arccos}(\mathbf{x}^{T}\mathbf{y}).

Simulation results for the two and three mixture cases are presented in Table 1 and 2, respectively. The average distance and the standard deviation between the true and the estimated parameters from 500 replications are reported. We observe that the estimated parameters converge to the true parameter as nn increases. We notice that the mean direction parameter can be estimated with higher precision when the corresponding concentration parameter is large. This is expected since observations are more closely clustered with a large concentration parameter.

Table 3 and 4 show the number of degeneracies when running the EM algorithm for computing the ordinary MLE for mixture of vMF distributions. Observations are generated from mixture of vMF distributions with one mixture component for Table 3 and with two mixture components for Table 4. We vary the dimension of the data from d=3d=3 to d=4d=4 and the sample size from n=100n=100 to n=500n=500. Mixtures of vMF distributions with p=2,3,4,5p=2,3,4,5 components are fitted to the generated data. We compute the ordinary MLE using the EM algorithm and record the number of times that the EM fails to converge from 1000 simulation runs. The EM algorithm is considered fail to converge if one of the concentration parameters becomes exceedingly large (greater than 101010^{10}). From Table 3 and 4, the EM algorithm tends to fail to converge with smaller sample sizes. We also note that when the fitted model has a larger number of mixture components pp, the EM algorithm is more likely to fail to converge.

Table 1: Simulation results for the vMF mixtures with two mixture components.
dd nn π1(=0.5)\pi_{1}(=0.5) μ1\mu_{1} μ2\mu_{2} κ1(=10)\kappa_{1}(=10) κ2(=1)\kappa_{2}(=1)
2 100 0.047 0.035 0.152 2.488 0.207
(0.050) (0.023) (0.124) (2.339) (0.159)
2 500 0.026 0.016 0.071 1.594 0.081
(0.024) (0.010) (0.062) (1.181) (0.064)
2 1000 0.022 0.010 0.046 1.410 0.078
(0.019) (0.007) (0.034) (1.037) (0.075)
3 100 0.037 0.048 0.275 2.175 0.171
(0.034) (0.026) (0.154) (1.712) (0.143)
3 500 0.025 0.023 0.126 1.345 0.098
(0.024) (0.013) (0.067) (0.894) (0.087)
3 1000 0.022 0.018 0.085 1.299 0.068
(0.017) (0.009) (0.047) (0.680) (0.058)
4 100 0.039 0.075 0.324 1.623 0.194
(0.033) (0.032) (0.229) (1.406) (0.122)
4 500 0.019 0.024 0.161 0.868 0.103
(0.017) (0.013) (0.065) (0.518) (0.083)
4 1000 0.018 0.020 0.142 0.842 0.060
(0.011) (0.011) (0.052) (0.431) (0.051)
Table 2: Simulation results for the vMF mixtures with three mixture components
dd nn π1(=0.4))\pi_{1}(=0.4)) π2(=0.3)\pi_{2}(=0.3) μ1\mu_{1} μ2\mu_{2} μ3\mu_{3} κ1(=10)\kappa_{1}(=10) κ2(=5)\kappa_{2}(=5) κ3(=1)\kappa_{3}(=1)
2 100 0.071 0.039 0.046 0.085 0.327 2.828 2.016 0.293
(0.042) (0.026) (0.050) (0.091) (0.279) (2.571) (1.289) (0.302)
2 500 0.058 0.028 0.039 0.062 0.209 1.703 1.514 0.255
(0.501) (0.023) (0.044) (0.061) (0.154) (1.616) (1.176) (0.202)
2 1000 0.046 0.025 0.022 0.040 0.167 1.431 1.318 0.209
(0.032) (0.185) (0.036) (0.047) (0.125) (1.307) (0.892) (0.185)
3 100 0.037 0.041 0.053 0.113 0.452 1.717 1.720 0.249
(0.034) (0.044) (0.028) (0.096) (0.258) (1.224) (1.050) (0.274)
3 500 0.033 0.031 0.043 0.067 0.285 1.120 1.018 0.206
(0.024) (0.023) (0.037) (0.040) (0.138) (1.010) (0.914) (0.246)
3 1000 0.026 0.022 0.024 0.052 0.255 1.051 1.039 0.183
(0.026) (0.021) (0.018) (0.029) (0.126) (0.806) (0.747) (0.138)
4 100 0.051 0.021 0.073 0.121 0.417 1.432 1.356 0.334
(0.045) (0.017) (0.026) (0.058) (0.267) (1.207) (1.110) (0.260)
4 500 0.030 0.022 0.031 0.068 0.313 1.154 1.088 0.246
(0.026) (0.016) (0.016) (0.028) (0.018) (0.873) (0.760) (0.209)
4 1000 0.033 0.021 0.028 0.059 0.277 1.100 1.072 0.227
(0.027) (0.017) (0.015) (0.029) (0.163) (0.675) (0.736) (0.180)
Table 3: Number of degeneracies of the EM algorithm when computing the ordinary MLE for mixture of vMF distributions with one mixture component
dd nn p=2p=2 p=3p=3 p=4p=4 p=5p=5
3 100 13 47 86 218
3 200 2 8 34 53
3 500 0 2 2 5
4 100 6 33 64 169
4 200 1 3 12 39
4 500 0 0 1 4
Table 4: Number of degeneracies of the EM algorithm when computing the ordinary MLE for mixture of vMF distributions with two mixture components
dd nn p=2p=2 p=3p=3 p=4p=4 p=5p=5
3 100 0 51 147 233
3 200 0 27 45 87
3 500 0 4 11 23
4 100 0 49 113 193
4 200 0 14 47 81
4 500 0 2 8 11

6 Data Application

We illustrate the EM algorithm for maximum penalized log-likelihood using the household data set from R package HSAUR3. The data set contains the household expenditures of 20 single men and 20 single women on four commodity group. As in Hornik and Grün (2014), we will also focus on the three commodity groups (housing, food and service). The EM algorithms for ordinary MLE and for the penalized MLE with 2 and 3 mixture components are fitted to the data. The results are shown in Table 5 and 6, respectively, where the estimated parameters 𝝅^,𝝁^,𝜿^\hat{\boldsymbol{\pi}},\hat{\boldsymbol{\mu}},\hat{\boldsymbol{\kappa}} are shown for all cases. The estimated paramters for the MLE and for the penalized MLE are very similar for both p=2p=2 and p=3p=3. The log-likelihood evaluated at the MLE is slightly larger than the penalized log-likelihood evaluated at the penalized MLE. More interestingly, we observe that for each case the largest concentration parameter obtained under the penalized MLE is smaller than that obtained under the MLE. This behavior suggests that the incorporation of a penalty function pulls back the estimate of largest concentration parameter towards 0 and prevents the divergence of the likelihood function.

Table 5: Maximum likelihood estimates obtained from fitting mixtures of vMF distributions to the household expenses example.
π\pi housing food service κ\kappa log likelihood
p=2p=2 0.47 0.95 0.13 0.27 114.70 113.08
0.53 0.67 0.63 0.40 17.96
p=3p=3 0.52 0.95 0.15 0.27 83.26 126.06
0.13 0.67 0.31 0.68 181.21
0.35 0.59 0.76 0.28 62.91
Table 6: Penalized maximum likelihood estimates obtained from fitting mixtures of vMF distributions to the household expenses example.
π\pi housing food service κ\kappa penalized log likelihood
p=2p=2 0.47 0.95 0.13 0.27 112.20 112.94
0.53 0.67 0.63 0.40 18.48
p=3p=3 0.52 0.95 0.15 0.27 82.97 125.26
0.13 0.67 0.31 0.68 165.71
0.35 0.59 0.76 0.28 62.69

7 Discussion

In this paper we considered a penalized maximum likelihood approach to the estimation of the mixture of vMF distributions. By incorporating a suitable penalty function, we showed that the resulting penalized MLE is strongly consistent. An EM algorithm was derived to maximize the penalized likelihood function, and its performance and behavior were examined using simulation studies and a data application. The techniques used in this work to prove consistency could be applicable to study other mixture models for spherical observations.

8 Conflict of Interests

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Appendix A Proofs

A.1 Proof of Theorem 3.1

Proof.

We fix 𝜽:=((𝝁1,κ1),,(𝝁p,κp))Θp\boldsymbol{\theta}:=((\boldsymbol{\mu}_{1},\kappa_{1}),\ldots,(\boldsymbol{\mu}_{p},\kappa_{p}))\in\Theta^{p} such that 𝝁l=𝐱m\boldsymbol{\mu}_{l}=\mathbf{x}_{m} for some pair (l,m)(l,m). We construct a sequence (𝜽(q),𝝅(q)),q=1,2,(\boldsymbol{\theta}^{(q)},\boldsymbol{\pi}^{(q)}),q=1,2,\ldots and show that L(𝒳;𝜽(q),𝝅(q))L({\cal X};\boldsymbol{\theta}^{(q)},\boldsymbol{\pi}^{(q)})\uparrow\infty as qq\uparrow\infty.

For k=1,2,,pk=1,2,\ldots,p and q=1,2,q=1,2,\ldots, we let 𝝁k(q)=𝝁k\boldsymbol{\mu}^{(q)}_{k}=\boldsymbol{\mu}_{k}, and πk(q)=(11/q)πk+1/(qp)\pi^{(q)}_{k}=(1-1/q)\pi_{k}+1/(qp). It is easy to verify that k=1pπk(q)=1\sum_{k=1}^{p}\pi^{(q)}_{k}=1. For each qq, we let κl(q)=q\kappa_{l}^{(q)}=q, and for klk\neq l, we let κk(q)=κk\kappa_{k}^{(q)}=\kappa_{k}. Since πk(q)1/(qp)\pi_{k}^{(q)}\geq 1/(qp), the likelihood is lower bounded by:

L(𝒳;𝜽(q),𝝅(q))1(qp)ni=1nk=1pf(𝐱i;𝝁k,κk(q)).L({\cal X};\boldsymbol{\theta}^{(q)},\boldsymbol{\pi}^{(q)})\geq\frac{1}{(qp)^{n}}\prod_{i=1}^{n}\sum_{k=1}^{p}f(\mathbf{x}_{i};\boldsymbol{\mu}_{k},\kappa_{k}^{(q)}).

For the mmth observation, we have

k=1pf(𝐱m;𝝁k,κk(q))f(𝐱m;𝝁l,κl(q))=cd(κl(q))eκl(q).\sum_{k=1}^{p}f(\mathbf{x}_{m};\boldsymbol{\mu}_{k},\kappa_{k}^{(q)})\geq f(\mathbf{x}_{m};\boldsymbol{\mu}_{l},\kappa_{l}^{(q)})=c_{d}(\kappa_{l}^{(q)})e^{\kappa_{l}^{(q)}}.

For any imi\neq m, and hlh\neq l, we have

k=1pf(𝐱i;μk,κk(q))f(𝐱i;𝝁h,κh(q))=cd(κh(q))eκh(q)𝝁hT𝐱i.\sum_{k=1}^{p}f(\mathbf{x}_{i};\mu_{k},\kappa_{k}^{(q)})\geq f(\mathbf{x}_{i};\boldsymbol{\mu}_{h},\kappa_{h}^{(q)})=c_{d}(\kappa_{h}^{(q)})e^{\kappa_{h}^{(q)}\boldsymbol{\mu}_{h}^{T}\mathbf{x}_{i}}.

Therefore, the likelihood function can be lower bounded by

L(𝒳;𝜽(q),𝝅(q))\displaystyle L({\cal X};\boldsymbol{\theta}^{(q)},\boldsymbol{\pi}^{(q)}) \displaystyle\geq 1(qp)ncd(κl(q))eκl(q)incd(κh(q))eκh(q)𝝁hT𝐱i\displaystyle\frac{1}{(qp)^{n}}c_{d}(\kappa_{l}^{(q)})e^{\kappa_{l}^{(q)}}\prod_{i\neq n}c_{d}(\kappa_{h}^{(q)})e^{\kappa_{h}^{(q)}\boldsymbol{\mu}_{h}^{T}\mathbf{x}_{i}}
=\displaystyle= 1(qp)ncd(q)eqincd(κh)eκh𝝁hT𝐱i.\displaystyle\frac{1}{(qp)^{n}}c_{d}(q)e^{q}\prod_{i\neq n}c_{d}(\kappa_{h})e^{\kappa_{h}\boldsymbol{\mu}_{h}^{T}\mathbf{x}_{i}}.

Since cd(κ)=𝒪(κd/21/2)c_{d}(\kappa)={\cal O}(\kappa^{d/2-1/2}), we have L(𝒳;𝜽(q),𝝅(q))L({\cal X};\boldsymbol{\theta}^{(q)},\boldsymbol{\pi}^{(q)})\uparrow\infty as qq\uparrow\infty. ∎

A.2 Technical Lemmas

The following technical lemmas are useful for the proof of the main results. Lemma 3 gives the asymptotic expansion of the modified Bessel function of the first kind and is straight forward to derive.

Lemma 3.

Let Ir()I_{r}(\cdot) be the modified Bessel function of the first kind with degree rr. As zz\rightarrow\infty, we have

Ir(z)=ez2πz(1+𝒪(z1)).I_{r}(z)=\frac{e^{z}}{\sqrt{2\pi z}}(1+{\cal O}(z^{-1})).

Lemma 4 concerns the covering of the surface of the unit hypersphere with BϵB_{\epsilon}-balls and is needed for the proof of Lemma 1.

Lemma 4.

For any sufficiently small positive number ϵ\epsilon, there exists points 𝛈1,,𝛈m𝕊d1\boldsymbol{\eta}_{1},\ldots,\boldsymbol{\eta}_{m}\in\mathbb{S}^{d-1} with mA1/ϵd1m\leq A_{1}/\epsilon^{d-1} where A1>0A_{1}>0 is some constant which depends on the dimension dd such that for any 𝐱𝕊d1\mathbf{x}\in\mathbb{S}^{d-1}, there exists 𝛈j\boldsymbol{\eta}_{j} with Bϵ(𝐱)B2ϵ(𝛈j)B_{\epsilon}(\mathbf{x})\subset B_{2\epsilon}(\boldsymbol{\eta}_{j}).

Proof.

Fix ϵ>0\epsilon>0 and consider an open cover {Bϵ(𝜼i)}i\{B_{\epsilon}(\boldsymbol{\eta}_{i})\}_{i} of 𝕊d1\mathbb{S}^{d-1}. By compactness of 𝕊d1\mathbb{S}^{d-1}, there exists a finite subcover {Bϵ(𝜼1),,Bϵ(𝜼m)}\{B_{\epsilon}(\boldsymbol{\eta}_{1}),\ldots,B_{\epsilon}(\boldsymbol{\eta}_{m})\} of 𝕊d1\mathbb{S}^{d-1}. Let 𝐱𝕊d1\mathbf{x}\in\mathbb{S}^{d-1} be fixed, and let 𝐳Bϵ(𝐱)\mathbf{z}\in B_{\epsilon}(\mathbf{x}) be arbitrary. We must show that d(𝐳,𝜼i)<2ϵd(\mathbf{z},\boldsymbol{\eta}_{i})<2\epsilon for some ii.

Since {Bϵ(𝜼1),,Bϵ(𝜼m)}\{B_{\epsilon}(\boldsymbol{\eta}_{1}),\ldots,B_{\epsilon}(\boldsymbol{\eta}_{m})\} is an open cover of 𝕊d1\mathbb{S}^{d-1}, we must have 𝐱Bϵ(𝜼i)\mathbf{x}\in B_{\epsilon}(\boldsymbol{\eta}_{i}) for some ii. Therefore,

d(𝐳,𝜼i)d(𝐳,𝐱)+d(𝐱,𝜼i)<ϵ+ϵ=2ϵ.d(\mathbf{z},\boldsymbol{\eta}_{i})\leq d(\mathbf{z},\mathbf{x})+d(\mathbf{x},\boldsymbol{\eta}_{i})<\epsilon+\epsilon=2\epsilon.

Hence, we have Bϵ(𝐱)B2ϵ(𝜼i)B_{\epsilon}(\mathbf{x})\subset B_{2\epsilon}(\boldsymbol{\eta}_{i}). Since xx is arbitrary, the result follows.

The statement that mA1/ϵd1m\leq A_{1}/\epsilon^{d-1} for some constant A1>0A_{1}>0 is clearly true for d=2d=2, and the general case holds using proof by induction with a geometric argument.

A.3 Proof of Lemma 1

Proof of Lemma 1

Proof.

Let ϵ\epsilon be a small positive number. By Lemma 4, there exists 𝜼1,,𝜼m𝕊d1\boldsymbol{\eta}_{1},\ldots,\boldsymbol{\eta}_{m}\in\mathbb{S}^{d-1} with mA1/ϵd1m\leq A_{1}/\epsilon^{d-1} such that for any 𝐱𝕊d1\mathbf{x}\in\mathbb{S}^{d-1}, we have Bϵ(𝐱)B2ϵ(𝜼j)B_{\epsilon}(\mathbf{x})\subset B_{2\epsilon}(\boldsymbol{\eta}_{j}) for some jj. Consequently, we have that

sup𝝁Sd1{1ni=1nI(XiBϵ(𝝁))}\displaystyle\sup_{\boldsymbol{\mu}\in S^{d-1}}\bigg{\{}\frac{1}{n}\sum_{i=1}^{n}I\big{(}X_{i}\in B_{\epsilon}(\boldsymbol{\mu})\big{)}\bigg{\}} \displaystyle\leq maxj=1,,m{1ni=1nI(XiB2ϵ(𝜼j))}\displaystyle\max_{j=1,\ldots,m}\bigg{\{}\frac{1}{n}\sum_{i=1}^{n}I\big{(}X_{i}\in B_{2\epsilon}(\boldsymbol{\eta}_{j})\big{)}\bigg{\}}
\displaystyle\leq maxj=1,,m{1ni=1nI(XiB2ϵ(𝜼j))γ0(B2ϵ(𝜼j))}\displaystyle\max_{j=1,\ldots,m}\bigg{\{}\frac{1}{n}\sum_{i=1}^{n}I\big{(}X_{i}\in B_{2\epsilon}(\boldsymbol{\eta}_{j})\big{)}-\gamma_{0}(B_{2\epsilon}(\boldsymbol{\eta}_{j}))\bigg{\}}
+maxj=1,,m{γ0(B2ϵ(𝜼j))},\displaystyle+\max_{j=1,\ldots,m}\bigg{\{}\gamma_{0}(B_{2\boldsymbol{\epsilon}}(\boldsymbol{\eta}_{j}))\bigg{\}},

where γ0(B2ϵ(𝜼j))=γ0(XB2ϵ(𝜼j))\gamma_{0}(B_{2\epsilon}(\boldsymbol{\eta}_{j}))=\gamma_{0}(X\in B_{2\epsilon}(\boldsymbol{\eta}_{j})) is the probability that a random variable XX generated from the γ0\gamma_{0} takes value in the 2ϵ2\epsilon-ball B2ϵB_{2\epsilon}. For each j=1,,mj=1,\ldots,m, γ0(B2ϵ(𝜼j))\gamma_{0}(B_{2\epsilon}(\boldsymbol{\eta}_{j})) can be bounded above by

γ0(B2ϵ(𝜼j))=B2ϵ(𝜼j)f0(𝐱)𝑑ω(𝐱)Mω(B2ϵ(𝜼j))=MA2ϵd1=δ(ϵ),\gamma_{0}(B_{2\epsilon}(\boldsymbol{\eta}_{j}))=\int_{B_{2\epsilon}(\boldsymbol{\eta}_{j})}f_{0}(\mathbf{x})d\omega(\mathbf{x})\leq M\omega(B_{2\epsilon}(\boldsymbol{\eta}_{j}))=MA_{2}\epsilon^{d-1}=\delta(\epsilon),

where we recall that the constants MM and A2A_{2} are defined in Equation (4) and (7), respectively, and the function δ()\delta(\cdot) is defined in Equation (8). This implies that

maxj=1,,m{γ0(B2ϵ(𝜼j))}δ(ϵ).\displaystyle\max_{j=1,\ldots,m}\bigg{\{}\gamma_{0}(B_{2\epsilon}(\boldsymbol{\eta}_{j}))\bigg{\}}\leq\delta(\epsilon). (18)

Define the quantities

Δnj:=|1ni=1nI(XiB2ϵ(𝜼j))γ0(B2ϵ(𝜼j))|,j=1,,m.\Delta_{nj}:=\bigg{|}\frac{1}{n}\sum_{i=1}^{n}I\big{(}X_{i}\in B_{2\epsilon}(\boldsymbol{\eta}_{j})\big{)}-\gamma_{0}(B_{2\epsilon}(\boldsymbol{\eta}_{j}))\bigg{|},\quad j=1,\ldots,m.

For t>0t>0, by Bernstein’s inequality we have

(Δnjt)\displaystyle\mathbb{P}(\Delta_{nj}\geq t) \displaystyle\leq 2exp(12n2t2nγ0(B2ϵ(𝜼j)(1γ0(B2ϵ(𝜼j)))+13nt)\displaystyle 2\exp\bigg{(}-\frac{\frac{1}{2}n^{2}t^{2}}{n\gamma_{0}(B_{2\epsilon}(\boldsymbol{\eta}_{j})(1-\gamma_{0}(B_{2\epsilon}(\boldsymbol{\eta}_{j})))+\frac{1}{3}nt}\bigg{)} (19)
\displaystyle\leq 2exp(n2t22nMA2ϵd1+23nt)\displaystyle 2\exp\bigg{(}-\frac{n^{2}t^{2}}{2nMA_{2}\epsilon^{d-1}+\frac{2}{3}nt}\bigg{)}
=\displaystyle= 2exp(nt22δ(ϵ)+23t)\displaystyle 2\exp\bigg{(}-\frac{nt^{2}}{2\delta(\epsilon)+\frac{2}{3}t}\bigg{)}

We note that δ(ϵ)>logn/n\delta(\epsilon)>\log n/n whenever ϵd1>logn/(MnA2)\epsilon^{d-1}>\log n/(MnA_{2}). Letting t=δ(ϵ)t=\delta(\epsilon) in the inequality above, we obtain

(Δnjδ(ϵ))2n3.\mathbb{P}(\Delta_{nj}\geq\delta(\epsilon))\leq 2n^{-3}.

Since ϵd1>logn/(MnA2)\epsilon^{d-1}>\log n/(MnA_{2}) implies m<nm<n for sufficiently large nn, we apply the union upper bound to obtain

(maxjΔnjδ(ϵ))2n2.\displaystyle\mathbb{P}\Big{(}\max_{j}\Delta_{nj}\geq\delta(\epsilon)\Big{)}\leq 2n^{-2}. (20)

Combining the two inequalities (18) and (20), the first conclusion of the lemma follows by applying the Borel-Cantelli lemma.

For the second statement of the lemma, we observe that 0<ϵd1<logn/(MnA2)0<\epsilon^{d-1}<\log n/(MnA_{2}) implies that δ(ϵ)<logn/n\delta(\epsilon)<\log n/n. Let t:=(logn)2/nt:=(\log n)^{2}/n, for large enough nn, we have

2δ(ϵ)<t/3.2\delta(\epsilon)<t/3.

Substituting tt into inequality (19) gives

(Δnj(logn)2n)exp((logn)2)n3.\mathbb{P}\bigg{(}\Delta_{nj}\geq\frac{(\log n)^{2}}{n}\bigg{)}\leq\exp(-(\log n)^{2})\leq n^{-3}.

The second conlusion of the lemma follows from an application of the Borel-Cantelli lemma.

A.4 Proof of Strong Consistency of PMLE

Proof.

We prove the strong consistency of PMLE for the case of two mixture components. The proof of strong consistency for the general case of pp mixture components is analogous but significantly more tedious. We briefly outline the key steps for the proof of the pp mixture components case, which are also along the lines of Section 3.3 of (Chen et al., 2008).

Recall that a two component mixture mixing measure has the form γ=πγθ1+(1π)γθ2\gamma=\pi\gamma_{\theta_{1}}+(1-\pi)\gamma_{\theta_{2}}, where 0π10\leq\pi\leq 1, θ1=(𝝁1,κ1)\theta_{1}=(\boldsymbol{\mu}_{1},\kappa_{1}), θ2=(𝝁2,κ2)\theta_{2}=(\boldsymbol{\mu}_{2},\kappa_{2}) and the corresponding penalized log-likelihood function is given by

pln(γ)=ln(γ)+p~n(κ1)+p~n(κ2).pl_{n}(\gamma)=l_{n}(\gamma)+\tilde{p}_{n}(\kappa_{1})+\tilde{p}_{n}(\kappa_{2}).

Let K0=E0logf(X;γ0)K_{0}=E_{0}\log f(X;\gamma_{0}) where E0E_{0} denotes the expectation under the true probability measure γ0\gamma_{0}. We follow the strategy in (Chen et al., 2008) to divide the parameter space Γ=Π×Θ2\Gamma=\Pi\times\Theta^{2} into three regions

Γ1:={π,1π,(𝝁1,κ1),(𝝁2,κ2):κ1κ2ν0},\Gamma_{1}:=\{\pi,1-\pi,(\boldsymbol{\mu}_{1},\kappa_{1}),(\boldsymbol{\mu}_{2},\kappa_{2}):\kappa_{1}\geq\kappa_{2}\geq\nu_{0}\},
Γ2:={π,1π,(𝝁1,κ1),(𝝁2,κ2):κ1τ0,κ2ν0},\Gamma_{2}:=\{\pi,1-\pi,(\boldsymbol{\mu}_{1},\kappa_{1}),(\boldsymbol{\mu}_{2},\kappa_{2}):\kappa_{1}\geq\tau_{0},\kappa_{2}\leq\nu_{0}\},
Γ3:=Γ(Γ1Γ2).\Gamma_{3}^{:}=\Gamma-(\Gamma_{1}\cup\Gamma_{2}).

We require ν0\nu_{0} and τ0\tau_{0} to be sufficiently large where the exact magnitude are to be specified later. We will show that the penalized MLE γ^\hat{\gamma} almost surely is not in Γ1\Gamma_{1} or Γ2\Gamma_{2}. Therefore, γ^\hat{\gamma} must be in Γ3\Gamma_{3} and its consistency follows from Theorem 5 of Redner (1981).

We first consider the region Γ1\Gamma_{1}, and define the following index sets:

D1:={i:XiBϵ1(𝝁1)},D2:={i:XiBϵ2(𝝁2)},D_{1}:=\{i:X_{i}\in B_{\epsilon_{1}}(\boldsymbol{\mu}_{1})\},\quad D_{2}:=\{i:X_{i}\in B_{\epsilon_{2}}(\boldsymbol{\mu}_{2})\},

where

ϵi=1(logκi)2,i=1,2.\epsilon_{i}=\frac{1}{(\log\kappa_{i})^{2}},\quad i=1,2.

D1D_{1} and D2D_{2} consist of observations that are very close to 𝝁1\boldsymbol{\mu}_{1} and 𝝁2\boldsymbol{\mu}_{2}, respectively. We separately assess the likelihood contributions of the observations in D1D_{1} and D2D_{2} in Lemmas 5 and 6.

By Lemmas 5 and 6 the maximizer γ^n\hat{\gamma}_{n} of pln(γ)pl_{n}(\gamma) is almost surely in Γ3\Gamma_{3}. Lemmas 5 and 6 also imply that γ0Γ3\gamma_{0}\in\Gamma_{3}. We want to apply Theorem 5 of Redner (1981) to conclude the strong consistency of the estimator γ^n\hat{\gamma}_{n}. First, it is clear following the definition of the metric ρ\rho on Θ\Theta given in Equation 3 that Γ3\Gamma_{3} is a compact subset of Γ\Gamma containing γ0\gamma_{0}. For any θ=(𝝁,κ)Θ\theta=(\boldsymbol{\mu},\kappa)\in\Theta, we can choose rr sufficiently small such that for all θ=(𝝁,κ)\theta^{\prime}=(\boldsymbol{\mu}^{\prime},\kappa^{\prime}) with ρ(θ,θ)<r\rho(\theta,\theta^{\prime})<r, the density f(;𝝁,κ)f(\cdot;\boldsymbol{\mu}^{\prime},\kappa^{\prime}) is bounded. Thus, we have

𝕊d1log(max(1,supθ:ρ(θ,θ)<rf(𝐱,𝝁,κ)))𝑑γ0\displaystyle\int_{\mathbb{S}^{d-1}}\log\bigg{(}\max\big{(}1,\sup_{\theta^{\prime}:\rho(\theta,\theta^{\prime})<r}f(\mathbf{x},\boldsymbol{\mu}^{\prime},\kappa^{\prime})\big{)}\bigg{)}d\gamma_{0} \displaystyle\leq 𝕊d1C𝑑γ0\displaystyle\int_{\mathbb{S}^{d-1}}Cd\gamma_{0}
\displaystyle\leq C,\displaystyle C,

for some constant CC. Therefore, Assumption 2a of Redner (1981) is also satisfied. For any θ1,θ2Θ\theta_{1},\theta_{2}\in\Theta where θ1=(𝝁1,κ1)\theta_{1}=(\boldsymbol{\mu}_{1},\kappa_{1}) and θ2=(𝝁2,κ2)\theta_{2}=(\boldsymbol{\mu}_{2},\kappa_{2}), since f(,𝝁1,κ1)f(\cdot,\boldsymbol{\mu}_{1},\kappa_{1}) is bounded, we have

𝕊d1|logf(𝐱;𝝁1,κ1)|𝑑γθ2<.\int_{\mathbb{S}^{d-1}}|\log f(\mathbf{x};\boldsymbol{\mu}_{1},\kappa_{1})|d\gamma_{\theta_{2}}<\infty.

Therefore, Assumption 4 of Redner (1981) is also satisfied. We can conclude by applying Theorem 5 of Redner (1981) that γ^nγ0\hat{\gamma}_{n}\rightarrow\gamma_{0} almost surely in the quotient space Γ^\hat{\Gamma}.

We outline the key steps of the proof for the general case of pp mixture components, which follows the same strategy as the proof for the 2 components case. We divide the parameter space Γ=Π×Θp\Gamma=\Pi\times\Theta^{p} into p+1p+1 regions

Γ1:={(π1,,πp),(𝝁1,κ1),,(𝝁p,κp):κ1κ2κpν10},\displaystyle\Gamma_{1}:=\{(\pi_{1},\ldots,\pi_{p}),(\boldsymbol{\mu}_{1},\kappa_{1}),\ldots,(\boldsymbol{\mu}_{p},\kappa_{p}):\kappa_{1}\geq\kappa_{2}\geq\cdots\geq\kappa_{p}\geq\nu_{10}\},
Γk:={(π1,,πp),(𝝁1,κ1),,(𝝁p,κp):κ1κpk+1νk0,\displaystyle\Gamma_{k}:=\{(\pi_{1},\ldots,\pi_{p}),(\boldsymbol{\mu}_{1},\kappa_{1}),\ldots,(\boldsymbol{\mu}_{p},\kappa_{p}):\kappa_{1}\geq\cdots\geq\kappa_{p-k+1}\geq\nu_{k0},
ν(k1)0κpk+2κp},\displaystyle\nu_{(k-1)0}\geq\kappa_{p-k+2}\geq\cdots\geq\kappa_{p}\},

for k=2,,pk=2,\ldots,p, and Γp+1=Γk=1pΓk\Gamma_{p+1}=\Gamma-\cup_{k=1}^{p}\Gamma_{k}.
Given suitable constraints on the values of ν10,,νk0\nu_{10},\ldots,\nu_{k0}, an extension of Lemma 5 shows that

supγΓ1pln(γ)pln(γ0),a.s.,\sup_{\gamma\in\Gamma_{1}}pl_{n}(\gamma)-pl_{n}(\gamma_{0})\rightarrow-\infty,\quad\mbox{a.s.},

and an extension of Lemma 6 shows that

supγΓkpln(γ)pln(γ0),a.s.\sup_{\gamma\in\Gamma_{k}}pl_{n}(\gamma)-pl_{n}(\gamma_{0})\rightarrow-\infty,\quad\mbox{a.s.}

for k=2,,pk=2,\ldots,p. Therefore, the probability that γ^n\hat{\gamma}_{n}, the maximizer of pln(γ)pl_{n}(\gamma) belongs to Γ1,,Γp\Gamma_{1},\ldots,\Gamma_{p} goes to zero. With γ^n\hat{\gamma}_{n} almost surely in Γp\Gamma_{p} which is a compact subset of Γ\Gamma, we can apply Theorem 5 of Redner (1981) to conclude the strong consistency of γ^n\hat{\gamma}_{n}.

Lemma 5.

supγΓ1pln(γ)pln(γ0),a.s.\sup_{\gamma\in\Gamma_{1}}pl_{n}(\gamma)-pl_{n}(\gamma_{0})\rightarrow-\infty,\quad\mbox{a.s.}

Proof.

The log-likelihood contributions of observations in any index set DD is given by

ln(γ;D)=iDlog(πcd(κ1)exp(κ1𝐱iT𝝁1)+(1π)cd(κ2)exp(κ2𝐱iT𝝁2)).l_{n}(\gamma;D)=\sum_{i\in D}\log\bigg{(}\pi c_{d}(\kappa_{1})\exp(\kappa_{1}\mathbf{x}_{i}^{T}\boldsymbol{\mu}_{1})+(1-\pi)c_{d}(\kappa_{2})\exp(\kappa_{2}\mathbf{x}_{i}^{T}\boldsymbol{\mu}_{2})\bigg{)}.

For any observation ii in D1D_{1}, its likelihood contribution is bounded above by exp(κ1)/cd(κ1)\exp(\kappa_{1})/c_{d}(\kappa_{1}), and by the asymptotic expansion of the modified Bessel function of the first kind in Lemma 3, we have

cd(κ1)exp(κ1)A3κ1.\displaystyle c_{d}(\kappa_{1})\exp(\kappa_{1})\leq A_{3}\sqrt{\kappa_{1}}. (21)

for some constant A3>0A_{3}>0. Consequently, the log-likelihood of observations in D1D_{1} is bounded above by

ln(γ;D1)n(D1)log(A3κ1),l_{n}(\gamma;D_{1})\leq n(D_{1})\log(A_{3}\sqrt{\kappa_{1}}),

where n(D1)n(D_{1}) is the number of observations in D1D_{1}. By Lemma 2, for

lognMnA2ϵ1d1<ξ0,\frac{\log n}{MnA_{2}}\leq\epsilon_{1}^{d-1}<\xi_{0},

n(D1)n(D_{1}) is almost surely bounded above by

n(D1)=i=1nI(XiBϵ1(𝝁1))4nδ(ϵ1)=4nMA2ϵ1d1.n(D_{1})=\sum_{i=1}^{n}I(X_{i}\in B_{\epsilon_{1}}(\boldsymbol{\mu}_{1}))\leq 4n\delta(\epsilon_{1})=4nMA_{2}\epsilon_{1}^{d-1}.

Therefore, recalling that ϵ1=1/(logκ1)2\epsilon_{1}=1/(\log\kappa_{1})^{2}, ln(γ;D1)l_{n}(\gamma;D_{1}) can be bounded above by:

ln(γ;D1)\displaystyle l_{n}(\gamma;D_{1}) \displaystyle\leq 4nMA2ϵ1d1(logA3κ)\displaystyle 4nMA_{2}\epsilon_{1}^{d-1}(\log A_{3}\sqrt{\kappa}) (22)
\displaystyle\leq A4n1(logκ1)2d3\displaystyle A_{4}n\frac{1}{(\log\kappa_{1})^{2d-3}}
\displaystyle\leq A4n1(logν0)2d3\displaystyle A_{4}n\frac{1}{(\log\nu_{0})^{2d-3}}

where A4>0A_{4}>0 is some constant, and the last inequality follows from κ1>ν0\kappa_{1}>\nu_{0}. For

0<ϵ1d1<lognMnA2,0<\epsilon_{1}^{d-1}<\frac{\log n}{MnA_{2}},

we have n(D1)2(logn)2n(D_{1})\leq 2(\log n)^{2} almost surely by Lemma 2. Therefore, with condition C3 on the penalty function p~n(κ1)\tilde{p}_{n}(\kappa_{1}), almost surely

n(D1)(logA3κ1)+p~n(κ1)2(logn)2(logA3κ1)+p~n(κ1)<0.\displaystyle n(D_{1})(\log A_{3}\sqrt{\kappa_{1}})+\tilde{p}_{n}(\kappa_{1})\leq 2(\log n)^{2}(\log A_{3}\sqrt{\kappa_{1}})+\tilde{p}_{n}(\kappa_{1})<0. (23)

The two bounds (22) and (23) can be combined to form

ln(γ;D1)+p~n(κ1)A4n1(logν0)2d3.\displaystyle l_{n}(\gamma;D_{1})+\tilde{p}_{n}(\kappa_{1})\leq A_{4}n\frac{1}{(\log\nu_{0})^{2d-3}}. (24)

The same approach can be used to derive the same bound for observations in D1cD2D_{1}^{c}D_{2}:

ln(γ;D1cD2)+p~n(κ2)A4n1(logν0)2d3.\displaystyle l_{n}(\gamma;D_{1}^{c}D_{2})+\tilde{p}_{n}(\kappa_{2})\leq A_{4}n\frac{1}{(\log\nu_{0})^{2d-3}}. (25)

For any observation 𝐱\mathbf{x} that falls outside both D1D_{1} and D2D_{2}, we have that arccos(𝐱T𝝁1)>ϵ1\arccos(\mathbf{x}^{T}\boldsymbol{\mu}_{1})>\epsilon_{1} and arccos(𝐱T𝝁2)>ϵ2\arccos(\mathbf{x}^{T}\boldsymbol{\mu}_{2})>\epsilon_{2}. Using the Taylor expansion of cos(ϵ)\cos(\epsilon) for positive ϵ\epsilon around 0, we can show that for 𝐱D1cD2c\mathbf{x}\in D_{1}^{c}D_{2}^{c}, we have

𝐱T𝝁i113ϵi2=113(logκi)4,i=1,2.\mathbf{x}^{T}\boldsymbol{\mu}_{i}\leq 1-\frac{1}{3\epsilon_{i}^{2}}=1-\frac{1}{3(\log\kappa_{i})^{4}},\quad i=1,2.

Consequently, recalling the inequality (21), and for large enough ν0\nu_{0}, the log-likelihood contribution of such 𝐱\mathbf{x} is bounded above by

logA3+logκiκi3(logκi)4logκiκi4(logκi)4logν0ν04(logν0)4.\log A_{3}+\log\sqrt{\kappa_{i}}-\frac{\kappa_{i}}{3(\log\kappa_{i})^{4}}\leq\log\kappa_{i}-\frac{\kappa_{i}}{4(\log\kappa_{i})^{4}}\leq\log\nu_{0}-\frac{\nu_{0}}{4(\log\nu_{0})^{4}}.

For large enough nn, we must have n(D1cD2c)n/2n(D_{1}^{c}D_{2}^{c})\geq n/2 almost surely. Therefore, almost surely the log-likelihood of the observations in D1cD2cD_{1}^{c}D_{2}^{c} is bounded above by

ln(γ;D1cD2c)(n/2)(logν0ν0(logν0)4).\displaystyle l_{n}(\gamma;D_{1}^{c}D_{2}^{c})\leq(n/2)\bigg{(}\log\nu_{0}-\frac{\nu_{0}}{(\log\nu_{0})^{4}}\bigg{)}. (26)

For sufficiently large ν0\nu_{0}, the following inequalities hold

2A41(logν0)2d31,2A_{4}\frac{1}{(\log\nu_{0})^{2d-3}}\leq 1,
logν0ν0(logν0)42K04.\log\nu_{0}-\frac{\nu_{0}}{(\log\nu_{0})^{4}}\leq 2K_{0}-4.

Therefore, combining the bounds (24), (25), (26), the penalized log-likelihood can be bounded above by

pln(γ)\displaystyle pl_{n}(\gamma) \displaystyle\leq 2A4n1(logν0)2d3+(n/2)(logν0ν0(logν0)4)\displaystyle 2A_{4}n\frac{1}{(\log\nu_{0})^{2d-3}}+(n/2)\bigg{(}\log\nu_{0}-\frac{\nu_{0}}{(\log\nu_{0})^{4}}\bigg{)}
\displaystyle\leq n+(n/2)(2K04)\displaystyle n+(n/2)(2K_{0}-4)
=\displaystyle= n(K01).\displaystyle n(K_{0}-1).

By strong law of large numbers, we have n1pln(γ0)K0n^{-1}pl_{n}(\gamma_{0})\rightarrow K_{0} almost surely. Therefore,

supγΓ1pln(γ)pln(γ0)\sup_{\gamma\in\Gamma_{1}}pl_{n}(\gamma)-pl_{n}(\gamma_{0})\rightarrow-\infty

almost surely.

Lemma 6.

supγΓ2pln(γ)pln(γ0),a.s.\sup_{\gamma\in\Gamma_{2}}pl_{n}(\gamma)-pl_{n}(\gamma_{0})\rightarrow-\infty,\quad\mbox{a.s.}

Proof.

To establish a similar result for Γ2\Gamma_{2}, we define

g(𝐱;γ)=πexp(κ12(𝐱Tμ11))+(1π)cd(κ2)exp(κ2𝐱Tμ2).g(\mathbf{x};\gamma)=\pi\exp\bigg{(}\frac{\kappa_{1}}{2}(\mathbf{x}^{T}\mu_{1}-1)\bigg{)}+(1-\pi)c_{d}(\kappa_{2})\exp\big{(}\kappa_{2}\mathbf{x}^{T}\mu_{2}\big{)}.

We note that the first part of the RHS above is not a vMF density, and is well defined as κ1\kappa_{1}\rightarrow\infty. Straightforward calculation shows that for all γΓ2\gamma\in\Gamma_{2}, we have

𝕊d1g(𝐱;γ)𝑑x<1.\int_{\mathbb{S}^{d-1}}g(\mathbf{x};\gamma)dx<1.

Therefore, by Jensen’s inequality, for all γΓ2\gamma\in\Gamma_{2},

E0[logg(X;γ)f0(X)]<0,\displaystyle E_{0}\bigg{[}\log\frac{g(X;\gamma)}{f_{0}(X)}\bigg{]}<0,

where we recall that f0()f_{0}(\cdot) is the true density function. Since supγΓ2g(𝐱;γ)\sup_{\gamma\in\Gamma_{2}}g(\mathbf{x};\gamma) is bounded and g(𝐱;γ)g(\mathbf{x};\gamma) is continuous in γ\gamma almost surely w.r.t. f0(𝐱)f_{0}(\mathbf{x}), it follows that

supγΓ2{1ni=1nlog(g(Xi;γ)f0(Xi))}η(τ0)<0,a.s.\displaystyle\sup_{\gamma\in\Gamma_{2}}\bigg{\{}\frac{1}{n}\sum_{i=1}^{n}\log\Big{(}\frac{g(X_{i};\gamma)}{f_{0}(X_{i})}\Big{)}\bigg{\}}\rightarrow-\eta(\tau_{0})<0,\quad\mbox{a.s.} (27)

where η(τ0)>0\eta(\tau_{0})>0 is an increasing function of τ0\tau_{0}. Hence, we can find τ0>ν0\tau_{0}>\nu_{0} such that

A41(logτ0)2d3η(ν0)/4<η(τ0)/4.\displaystyle A_{4}\frac{1}{(\log\tau_{0})^{2d-3}}\leq\eta(\nu_{0})/4<\eta(\tau_{0})/4. (28)

For any observation in D1D_{1}, its log-likelihood contribution is no larger than

log(A3κ1)+logg(𝐱;γ).\log(A_{3}\sqrt{\kappa_{1}})+\log g(\mathbf{x};\gamma).

For sufficiently large τ0\tau_{0}, the log-likelihood contribution of any observation not in D1D_{1} is no more than logg(𝐱;γ)\log g(\mathbf{x};\gamma). This follows since for large enough κ1>τ0\kappa_{1}>\tau_{0},

cd(κ1)eκ1𝐱Tμ1\displaystyle c_{d}(\kappa_{1})e^{\kappa_{1}\mathbf{x}^{T}\mu_{1}} =\displaystyle= cd(κ1)eκ1(𝐱T𝝁11)eκ1\displaystyle c_{d}(\kappa_{1})e^{\kappa_{1}(\mathbf{x}^{T}\boldsymbol{\mu}_{1}-1)}e^{\kappa_{1}}
\displaystyle\leq A3κ1eκ1(𝐱T𝝁11)\displaystyle A_{3}\sqrt{\kappa_{1}}e^{\kappa_{1}(\mathbf{x}^{T}\boldsymbol{\mu}_{1}-1)}
\displaystyle\leq eκ12(𝐱T𝝁11).\displaystyle e^{\frac{\kappa_{1}}{2}(\mathbf{x}^{T}\boldsymbol{\mu}_{1}-1)}.

Therefore, the penalized log-likelihood is almost surely bounded above by

supΓ2pln(γ)pln(γ0)\displaystyle\sup_{\Gamma_{2}}pl_{n}(\gamma)-pl_{n}(\gamma_{0})
\displaystyle\leq supκ1τ0{iD1log(A3κ1)+p~n(κ1)}+supΓ2{i=1nlogg(Xi;γ)f0(Xi)}+pn(𝜿0)\displaystyle\sup_{\kappa_{1}\geq\tau_{0}}\bigg{\{}\sum_{i\in D_{1}}\log(A_{3}\sqrt{\kappa_{1}})+\tilde{p}_{n}(\kappa_{1})\bigg{\}}+\sup_{\Gamma_{2}}\bigg{\{}\sum_{i=1}^{n}\log\frac{g(X_{i};\gamma)}{f_{0}(X_{i})}\bigg{\}}+p_{n}(\boldsymbol{\kappa}_{0})
\displaystyle\leq A4n1(logτ0)2d334η(τ0)n+pn(𝜿0)\displaystyle A_{4}n\frac{1}{(\log\tau_{0})^{2d-3}}-\frac{3}{4}\eta(\tau_{0})n+p_{n}(\boldsymbol{\kappa}_{0})
\displaystyle\leq 12η(τ0)n+pn(𝜿0)\displaystyle-\frac{1}{2}\eta(\tau_{0})n+p_{n}(\boldsymbol{\kappa}_{0})\rightarrow-\infty

where 𝜿0\boldsymbol{\kappa}_{0} is the vector of the concentration parameters of the true measure γ0\gamma_{0}, the second inequality follows from (24) and (27), and the last inequality follows from (28).

References

  • Bagchi and Kadane (1991) Bagchi, P. and Kadane, J. B. (1991), “Laplace approximations to posterior moments and marginal distributions on circles, spheres, and cylinders,” Can. J. Stat., 19, 67–77.
  • Banerjee et al. (2005) Banerjee, A., Dhillon, I. S., Ghosh, J., and Sra, S. (2005), “Clustering on the Unit Hypersphere using von Mises-Fisher Distributions.” J. Mach. Learn. Res., 6, 1345–1382.
  • Bangert et al. (2010) Bangert, M., Hennig, P., and Oelfke, U. (2010), “Using an Infinite Von Mises-Fisher Mixture Model to Cluster Treatment Beam Directions in External Radiation Therapy,” in 2010 Ninth International Conference on Machine Learning and Applications, pp. 746–751.
  • Boomsma et al. (2006) Boomsma, W., Kent, J., Mardia, K., Taylor, C., and Hamelryck, T. (2006), “Graphical models and directional statistics capture protein structure,” in Interdisciplinary Statistics and Bioinformatics, Leeds University Press, pp. 91–94.
  • Chen (2017) Chen, J. (2017), “Consistency of the MLE under mixture models,” Statist. Sci., 32, 47–63.
  • Chen et al. (2007) Chen, J., Li, P., and Tan, X. (2007), “Inference for von Mises mixture in mean directions and concentration parameters.” Journal of Systems Science and Mathematical Sciences, 27, 27, 59–67.
  • Chen et al. (2016) Chen, J., Li, S., and Tan, X. (2016), “Consistency of the penalized MLE for two-parameter gamma mixture models,” Sci. China Math., 59, 2301–2318.
  • Chen et al. (2008) Chen, J., Tan, X., and Zhang, R. (2008), “Inference for normal mixtures in mean and variance,” Statist. Sinica, 18, 443–465.
  • Chen et al. (2013) Chen, L., Singh, V. P., Guo, S., Fang, B., and Liu, P. (2013), “A new method for identification of flood seasons using directional statistics,” Hydrological Sciences Journal, 58, 28–40.
  • Ciuperca et al. (2003) Ciuperca, G., Ridolfi, A., and Idier, J. (2003), “Penalized maximum likelihood estimator for normal mixtures,” Scand. J. Statist., 30, 45–59.
  • Dempster et al. (1977) Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977), “Maximum likelihood from incomplete data via the EM algorithm,” J. Roy. Statist. Soc. Ser. B, 39, 1–38, with discussion.
  • Fisher (1953) Fisher, R. (1953), “Dispersion on a sphere,” Proc. Roy. Soc. London Ser. A, 217, 295–305.
  • Goodd and Gaskins (1971) Goodd, I. J. and Gaskins, R. A. (1971), “Nonparametric roughness penalties for probability densities,” Biometrika, 58, 255–277.
  • Hathaway (1985) Hathaway, R. J. (1985), “A constrained formulation of maximum-likelihood estimation for normal mixture distributions,” Ann. Statist., 13, 795–800.
  • Hornik and Grün (2014) Hornik, K. and Grün, B. (2014), “movMF: An R Package for Fitting Mixtures of von Mises-Fisher Distributions,” Journal of Statistical Software, 58, 1–31.
  • Jupp (1995) Jupp, P. E. (1995), “Some applications of directional statistics to astronomy,” in New trends in probability and statistics, Vol. 3 (Tartu/Pühajärve, 1994), VSP, Utrecht, pp. 123–133.
  • Li (2011) Li, S. (2011), “Concise Formulas for the Area and Volume of a Hyperspherical Cap,” Asian Journal of Mathematics & Statistics, 4, 66–70.
  • Mardia (1972) Mardia, K. V. (1972), Statistics of directional data, Academic Press, London-New York, probability and Mathematical Statistics, No. 13.
  • Mardia and Jupp (2000) Mardia, K. V. and Jupp, P. E. (2000), Directional statistics, Wiley Series in Probability and Statistics, John Wiley & Sons, Ltd., Chichester.
  • Mcgraw et al. (2006) Mcgraw, T., Vemuri, B., Yezierski, B., and Mareci, T. (2006), “Von Mises-Fisher mixture model of the diffusion ODF,” IEEE International Symposium on Biomedical Imaging, 2006, 65–68.
  • Ng and Kwong (2021) Ng, T. L. J. and Kwong, K.-K. (2021), “Universal approximation on the hypersphere,” Communications in Statistics - Theory and Methods, 0, 1–11.
  • Peters et al. (1978) Peters, B. C., Jr., and Walker, H. F. (1978), “An iterative procedure for obtaining maximum-likelihood estimates of the parameters for a mixture of normal distributions,” SIAM J. Appl. Math, 362–378.
  • Redner (1981) Redner, R. (1981), “Note on the consistency of the maximum likelihood estimate for nonidentifiable distributions,” Ann. Statist., 9, 225–228.
  • Schou (1978) Schou, G. (1978), “Estimation of the concentration parameter in von Mises-Fisher distributions,” Biometrika, 65, 369–377.
  • Sinkkonen and Kaski (2002) Sinkkonen, J. and Kaski, S. (2002), “Clustering based on conditional distributions in an auxiliary space,” Neural Comput., 14, 217–239.
  • Song et al. (2012) Song, H., Liu, J., and Wang, G. (2012), “High-order parameter approximation for von Mises-Fisher distributions,” Appl. Math. Comput., 218, 11880–11890.
  • Taghia et al. (2014) Taghia, J., Ma, Z., and Leijon, A. (2014), “Bayesian Estimation of the von-Mises Fisher Mixture Model with Variational Inference,” IEEE Trans. Pattern Anal. Mach. Intell., 36, 1701–1715.
  • Tanabe et al. (2007) Tanabe, A., Fukumizu, K., Oba, S., Takenouchi, T., and Ishii, S. (2007), “Parameter estimation for von Mises-Fisher distributions,” Comput. Statist., 22, 145–157.
  • Zhe et al. (2019) Zhe, X., Chen, S., and Yan, H. (2019), “Directional statistics-based deep metric learning for image classification and retrieval,” Pattern Recognit., 93, 113 – 123.