Learning Mixtures of Experts with EM
Abstract
Mixtures of Experts (MoE) are Machine Learning models that involve partitioning the input space, with a separate “expert” model trained on each partition. Recently, MoE have become popular as components in today’s large language models as a means to reduce training and inference costs. There, the partitioning function and the experts are both learnt jointly via gradient descent on the log-likelihood. In this paper we focus on studying the efficiency of the Expectation Maximization (EM) algorithm for the training of MoE models. We first rigorously analyze EM for the cases of linear or logistic experts, where we show that EM is equivalent to Mirror Descent with unit step size and a Kullback-Leibler Divergence regularizer. This perspective allows us to derive new convergence results and identify conditions for local linear convergence based on the signal-to-noise ratio (SNR). Experiments on synthetic and (small-scale) real-world data show that EM outperforms the gradient descent algorithm both in terms of convergence rate and the achieved accuracy.
1 Introduction
Mixtures of Experts (MoE)[JJ94] are a crucial class of parametric latent variable models that have gained significant popularity in deep learning for their ability to reduce both training and inference costs [CDWGL22]. MoE are particularly effective when the feature space can be divided and processed by specialized models, known as experts. Instead of relying on a single, large model to handle all input-output mappings, MoE utilize an ensemble of specialized experts. Each expert is responsible for a specific subset of the input space, allowing the system to efficiently route inputs to the most appropriate expert. This partitioning enables each expert to focus on mapping its designated inputs to outputs using a separate, optimized model. By leveraging multiple specialized experts rather than a monolithic model, MoE can achieve greater scalability and flexibility. This modular approach not only enhances computational efficiency but also allows for improved performance, as each expert can be finely tuned to handle its particular segment of the input space effectively. Real-world applications of Mixture of Experts span across various domains, including language translation, speech recognition, recommendation systems, and more [FZS22, MZYCHC18, Hin+12].
Training an MoE model involves training both (a) the parameters in the individual experts, and (b) the gating function that routes inputs to the appropriate expert. Typically, these are both learnt jointly by first formulating the final loss function as applied to the ensembled output, and then minimizing this joint loss function (via SGD or its variants) over the parameters of the gate and the experts.
In this paper we investigate, primarily from a theoretical perspective, the training of MoE using a classic algorithm: Expectation Maximization (EM). As opposed to SGD-based methods which are agnostic to whether a parameter is in the gate or in an expert, EM first formulates two separate problems – one for the router, and another for the experts – in a specific way. It then solves each problem in isolation and in parallel, and then collates the outputs to arrive at the updated set of gate and expert parameters.
For our theoretical results, we consider two simple instances of MoE models: Mixture of Linear Experts (where each expert is a linear regressor) and Mixture of Logistic Experts (where each expert is a logisitic regressor). The router in each case is a linear softmax.
Main Contributions:
-
1)
A recent but cornerstone result in EM [KKS22] showed that for exponential families, EM corresponds to Mirror Descent with unit step size and KL divergence regularizer. MoE models (inlcuding the ones we focus on) are generally not exponential families; however, we show that nonetheless this correspondence still holds.
-
2)
Armed with this perspective, we study our linear and logistic MoE when there are two experts. We provide sufficient conditions for EM to converge to the true model parameters at sub-linear or linear rates; we also characterize under what parameter regimes these conditions will hold. To our knowledge, no similar such characterizations are known for SGD-based methods.
-
3)
Finally we show, both on synthetic and small scale proof of concept real world datasets, that EM outperforms gradient descent both in terms of convergence rate and the achieved performance.
1.1 Related Work
The EM algorithm [DLR77] is a powerful method for fitting latent variable models. Previous research on EM has demonstrated that, under mild smoothness assumptions, its parameter iterates converge to a stationary point of the log-likelihood objective [DLR77, Wu83, Tse04]. Subsequent research on EM introduced new analytical frameworks to provide specialized guarantees, particularly regarding the convergence of EM iterates to the true model parameters and the rate of this convergence. An adopted framework in the past decade, introduced by [BWY17], interprets EM as a variant of the gradient descent algorithm. For latent variable models with a strongly convex EM objective that satisfies a condition known as “first-order stability,” it was shown that EM iterates converge linearly to the true parameters. Subsequent works utilized this framework to show a local linear rate for Mixtures of Gaussians and Mixtures of Linear Regressions [BWY17, DTZ17, DHKWJ18, KQCCD18, KC20a, KHC21, KC20].
In a recent work, [KKS22] proved that EM – for exponential families – is equivalent to the mirror descent algorithm with unit step-size and Kullback–Leibler (KL) divergence regularizer:
(1) |
This led to the first non-asymptotic convergence rates for EM, independent of parameterization. While this characterization of EM included Mixtures of Gaussians, it failed to extend to Mixtures of Regression or Mixtures of Experts.
There has also been works attempting to understand how to fit MoE. [JX95] showed that for MoE with strongly convex negative log-likelihood objective, EM iterations converge linearly to the true parameters. However, the objective is generally nonconvex, raising doubts about whether the necessary assumptions for their result hold, even locally. Other works have remarked that the nature of the gating function creates a form of competition between the experts during training that can lead to local minima. Bayesian methods for MoE include variational learning and maximum aposteriori (MAP) estimation. But, [YWG12] notes that these solutions are not trivial due to the softmax gate not admitting a conjugate prior and are prone to getting stuck at local minima. [MOKV19] analyzed a variant of the EM algorithm for MoE where they first recover the expert parameters using a tensor decomposition method. They proved that, for proper initialization, they recover the true parameters at a nearly linear rate.
2 Mixture of Experts
Next, we formally describe the setting under consideration for the Mixture of Experts (MoE).
Data Generation Model: First, the input or feature vector variable is sampled based on a probability density function . Second, given the feature vector , a latent variable , responsible for routing to the appropriate expert is sampled with probability mass function . Finally, given the pair , the output or label is generated according to the probability distribution . Hence, the complete data distribution is and the joint input-output probability distribution can be written as
(2) |
This is indeed the most general form of data generation under mixture of -experts, but here we focus on the case where is sampled from a unit spherical Gaussian distribution, i.e. , and is parameterized by a set of vectors and can be written as
(3) |
In other words, the probability mass function of the latent variable is the softmax function between the inner product of with the parameter vectors concatenated as .
Regarding , the probability distribution of the output variable conditioned on the input variable and latent routing variable (i.e., expert ), we also assume that it is parameterized by a vector and we have . Thus, for the concatenated vector , then .
To benefit the understanding of the reader in later sections, we focus on two special cases of experts. In the first case, we have
(4) |
as the density function of the normal distribution. This is equivalent to assume that the output where is additive zero mean Gaussian noise with unit variance. For future reference, we will refer to this special case as a Mixture of Linear Experts.
In the second case, we assume a logistic model for each expert. Thus, the density function can be written as
(5) |
and indeed . For future reference, we will refer to this special case as a Mixture of Logistic Experts.
Maximum Likelihood Loss: Given the assumed data distribution of , our goal is to find the set of feasible parameters and that maximize the log likelihood function. For ease of notation we define as the concatenation of all the parameters. Specifically, from (2), the expected log likelihood is given by
Given the fact that only the last term of the sum depends on parameters the negative log-likelihood objective function, , that we aim to minimize can be written as
(6) |
Note that as we will discuss in detail, for both mixtures of linear or logistic experts, the above objective function is known to be non convex with respect to . In fact, this is generally true for Mixtures of Gaussian, Mixtures of Regressions, and Mixtures of Experts. In the next section we will discuss the use of the Expectation-Maximization (EM) algorithm for solving this optimization problem.
3 EM for Mixtures of Experts
Next, we present the Expectation-Maximization (EM) algorithm for MoE. EM takes a structured approach to minimizing the objective given in (6). Each iteration of the EM algorithm is decomposed into two steps as follows. The first step is called “expectation”: For current parameter estimate , we compute the expectation of the complete-data log-likelihood with respect to the hidden variables, using the current parameter estimates and denote it by , i.e.,
(7) |
Then, in the second step called “maximization”, we simply minimize the objective (or maximize ) with respect to and obtain our new parameter iterate as
(8) |
Since , it follows that the EM objective (7) is linearly separable in the parameters and . Thus, we can re-write as the sum of functions that depend only on and respectively. Subsequently, the EM update (8) is obtained as the concatenation where
are both strongly convex minimization problems for the special cases of mixture of linear experts and mixture of logistic experts.
EM has two well understood characteristics: (a) its update always minimize an objective that is an upper bound on the likelihood, and (b) fixed points of the EM update are also stationary points of the likelihood. We now show this below. The original objective function, , can be decomposed into the difference between the EM objective and another function that is bounded below by . Specifically, for any ,
Denoting , it follows that
(9) |
where is bounded below by . Thus, acts as an upper bound on the negative log-likelihood.
Next, applying Jensen’s inequality shows that is minimized at where . Consequently, it follows that the gradient of the negative log-likelihood objective matches that of the surrogate EM objective at , i.e., . This suggests that any stationary point of is also a stationary point of the EM algorithm and vice versa.
3.1 EM for Symmetric Mixture of 2-Experts
We define the special case of Symmetric Mixture of Experts (SMoE), the focus of our analysis in the following sections. SMoE is a simplified version of MoE where (i) the number of experts is restricted to , represented as , and (ii) the experts are symmetric around the linear separator, i.e., . This symmetric structure simplifies the probability density functions introduced earlier, making the subsequent analysis easier to follow. We explore these simplifications in detail below.
As we are restricted to two experts, the expression for is
For the ease of notation, we define and re-parameterize the probability mass function of given as
(10) |
Thus, under this simplification, the EM update of the gating parameter is now given as
While the above minimization problem is strongly convex, it does not have a closed form solution. In our synthetic and small scale proof of concept real world experiments, we use gradient descent to obtain .
For the special case of a symmetric mixture of linear experts (SMoLinE), the expression for given in (4) can be simplified as
(11) |
Under this simplification, the EM update of the expert parameter is now obtained more compactly as
Similarly, for symmetric mixtures of logistic experts (SMoLogE) with , the expression of given in (5) can also be simplified as
(12) |
Under this simplification, the EM update of the expert parameter is now given as
4 Main Result
In the previous section, we derived the EM update for both MoE and SMoE as the solution to minimizing the EM objective in (7). We further demonstrated that this solution can be decomposed into the concatenation of the respective solutions to two minimization sub-problems. In this section, we will show that this update is exactly equivalent to performing a single step of the Mirror Descent (MD) update on with a unit step size and the KL divergence as a regularizer.
To illustrate more clearly that minimizing in (7) corresponds to a MD step on the loss at the point , we first provide a brief overview of the core concept behind the MD update.
In most gradient-based methods, the next iteration is obtained by minimizing an upper bound of the objective function. For example, in Gradient Descent (GD), the next iterate is found by minimizing the first-order Taylor expansion of the objective at , with a squared norm regularizer. Specifically, for minimizing the loss , the GD update with step size at is equivalent to minimizing the following quadratic function:
This function indeed serves as an upper bound for if , where is the Lipschitz constant of .
Mirror Descent solves a similar sub-problem where instead of a squared norm regularizer, we employ the Bregman Divergence regularizer: The Bregman divergence induced by a differentiable, convex function , measures the difference between and its first-order approximation at ,i.e.,
(13) |
Thus, the iterations of MD are derived by minimizing the following expression:
(14) |
As mentioned, this scheme is reasonable if the function approximation using the Bregman divergence serves as an upper bound for the function . This can be ensured when the step size is sufficiently small, and the condition of relative smoothness is satisfied.
In the upcoming theorem, we formally establish that for the Symmetric Mixture of Linear Experts (SMoLinE) and the Symmetric Mixture of Logistic Experts (SMoLogE) models, minimizing the EM objective function defined in equation (8) is exactly equivalent to minimizing the subproblem associated with a single step of MD, as defined in equation (14), with a step size of . This result demonstrates that the EM update in these specific models is essentially performing a MD step on . The proof of this result is quite technical and is deferred to Appendix A.
Theorem 4.1.
For SMoLinE and SMoLogE, there is a mirror map such that the EM update in (8) simplifies and is equivalent to
(15) |
where the divergence function is equal to the divergence on the complete data:
(16) |
In particular, in the case of SMoLinE,
(17) |
while in the case of SMoLogE,
(18) |
Finally, in both cases, the map is strictly convex in and is -smooth relative to .
As the loss is is 1-relative smooth with respect to , this validates the choice of for the Mirror Descent Update, and subsequent convergence results from the MD literature.
4.1 Discussion of Main Result
The results show that the EM update (8) for MoE is equivalent to mirror descent with a unit step-size and KL divergence regularizer on the complete data distribution. We offer the following additional remarks.
First, as shown in the proof of Theorem 4.1 in Appendix A.1, it is evident from (30) and (34) that does not belong to an exponential family of distributions for either SMoLinE or SMoLogE. Therefore, this result does not simply follow as a corollary of [KKS22, Proposition 1], but stands as an independent finding, introducing another class of latent variable models where EM is equivalent to MD. Moreover, it suggests the possibility of a broader class of latent variable models—beyond those with complete data in an exponential family—where EM can be generalized to MD.
Second, if we have oracle access to the EM updates for and , EM requires no hyper-parameters, unlike GD, which is sensitive to the step size. This is especially advantageous for SMoLinE, where the -update has a closed-form solution, making EM’s benefits over GD more evident. Additionally, while GD regularizes progress based on the Euclidean distance between iterates, EM adjusts progress based on the divergence between probability distributions across iterations. This is often more suitable for latent variable models, where small Euclidean changes may cause large shifts in the mixture distribution, and vice versa.
Finally, whereas previous analysis of EM for various settings hinged on various types of analyses ranging from verifying obscure conditions to – less reproducible at scale – direct proofs, the connection to MD greatly unifies the process of analysis. In particular, as we will discuss in Section 5 MD boasts general yet powerful guarantees of convergence. Specifically, some guarantees stand out from previous analyses for MoE as they do not hinge on problem specific constants or choice of parameterization.
5 Convergence Analysis
This section presents robust convergence guarantees for EM on SMoLinE and SMoLogE, starting with a review of Mirror Descent guarantees for context. We then examine how these conditions relate to concrete statistical metrics and the true model parameters.
5.1 Convergence Guarantees
Building on prior work [LFN17], we contextualize convergence properties of MD for SMoLinE and SMoLogE. Before presenting the result, we briefly review key concepts. We say is initialized in a locally convex region of if there exists a convex set containing such that for all ,
(19) |
Furthermore, is called strongly convex relative to if
(20) |
The corollary that follows provides conditions for convergence of EM to (1) a stationary point in the KL divergence, (2) the true parameters at a sub-linear rate, and (3) the true parameters at a linear rate. The proof is adapted from [KKS22, Proposition 2, Corollary 1, and Corollary 3] and [LFN17, Theorem 3.1]; we include it in Appendix B.1 for completeness.
Corollary 5.1 (Convergence of EM).
For SMoLinE, SMoLogE with mirror map given as (17) (18) respectively, and denoting , the EM iterates satisfy:
-
1)
Stationarity. For no additional conditions,
(21) -
2)
Sub-linear Rate to . If is initialized in , a locally convex region of containing , then
(22) -
3)
Linear Rate to . If is initialized in , a locally strongly convex region of relative to that contains , then
(23)
As noted in the literature, EM’s convergence is sensitive to initialization. If is initialized within a locally convex region of , the EM iterates for the MoE problem will converge sub-linearly to the true parameter. However, if is in a region where is strongly convex relative to , the iterates will converge linearly. This assumption is weaker than those in prior work, which typically require to be initialized in a locally strongly convex region.
5.2 Satisfiability of Conditions
The above result raises an important question: when does a locally convex or relatively strongly convex region of containing exist? Interestingly, this is closely tied to the Signal-to-Noise Ratio (SNR). Before exploring this connection, we first introduce the concept of the Missing Information Matrix (MIM) introduced in [OW72].
The MIM relates the level of information the pair holds about the latent expert label given parameters , and it is formally defined as
(24) |
Here, is the Fisher information matrix of the complete data distribution and is the Fisher information matrix of the conditional distribution of the latent unobserved variable given the observed ones, denoted by
Due to the pairwise independence of and its rotational invariance, there exists an orthonormal matrix such that is positive semi-definite and diagonal and is symmetric positive semi-definite (see Appendix B.2). As such, the MIM in (24) is also symmetric and positive semi-definite: . Note that the MIM quantifies the difficulty of estimating parameters when only are observed. To understand its significance, consider the following: large eigenvalues of indicate that contain little information about the true value of the latent variable , making estimation more difficult. Conversely, small eigenvalues suggest that provide enough information to effectively constrain the possible values of . Thus, the MIM can be seen as analogous to the Signal-to-Noise Ratio (SNR).
In the theorem below, we show how the eigenvalues of are related to the satisfiability of the conditions for Corollary 5.1 regarding the relative strong convexity of with respect to the mirror map .
Theorem 5.2.
The above result states if for all and , then the EM updates will converge linearly to . This offers a unified framework for analyzing EM for MoE, linking the rate of convergence to a classical statistical metric. To determine whether EM achieves linear or sub-linear convergence, we need to understand the behavior of and , which indicates the existence and size of the local region where EM enjoys such convergence. The following theorem is technical, and its proof is provided in Appendix B.3.
Theorem 5.3.
For SMoLinE, the eigenvalues of belong to the set
(26) |
and , is given as the expectation over of a function that is decreasing as a function of :
(27) |
Similarly, For SMoLogE, the eigenvalues of belong to the set
(28) |
and is given as the expectation over of a function that is decreasing as a function of :
(29) |
The theorem above provides an upper bound for as a function of and . The advantage of this bound is that it can be reliably estimated from the observed data . To gain an intuition for how (27) and (29) scale with , consider a simplified scenario where is parallel to with probability 1. In this case, it can be shown that
and as a result . Substituting back into the definition of the MIM and using the tight bounds on the eigenvalues of , we can see that in a best case scenario, the maximum eigenvalue of the MIM is . When this upper bound is in the interval , Theorem 5.2 and part 3) of Corollary 5.1 guarantee our parameter iterates to converge at the rate of . Moreover, when is large, we can expect to belong in a region that is strongly convex for relative to . For further interpretations of the former result, we provide a more thorough discussion in Appendix B.3.
6 Experiments
In this section, we empirically validate our theoretical results by comparing the performance of EM with Gradient EM (Algorithm 1), and Gradient Descent (Algorithm 2). Recall that EM for MoE obtains its next parameter iterate as the concatenation to the solutions of two minimization problems; Instead, Gradient EM obtains its next parameter iterate as the concatenation of a single gradient update on the respective sub-minimization problems of EM. This differs from Gradient Descent (GD) that obtains its next parameter iterate as the gradient update on the negative log-likelihood objective . We evaluate these methods on both a synthetic dataset and the real-world Fashion MNIST dataset, consistently showing significant improvements for EM and Gradient EM over GD. Note that our aim is not to achieve state-of-the-art accuracy, but to demonstrate that EM is more suitable than GD for fitting specific models.
Synthetic Dataset. We created the synthetic dataset so as to simulate a population setting of SMoLinE. We sampled data points from an SMoLinE with known additive unit Gaussian noise (i.e. ) and true parameters that sastisfy . Subsequently, we run EM, Gradient EM, and GD for iterations and report the results on the training set averaged over instances. Each time, re-sampling the true parameters, initial parameters, and whole dataset. The initial parameters, are randomly initialize within a neighborhood of the true parameters, and are consistent across all benchmarks.




Figure 1 shows the objective function progress. EM requires fewer iterations to fit the mixture compared to both Gradient EM and GD, with Gradient EM also outperforming GD in fitting time. Figure 2 illustrates the progress toward recovering the true SMoLinE parameters. Once again, EM requires significantly fewer iterations to fit the mixture compared to both Gradient EM and GD, with Gradient EM also taking considerably less time than GD.
Overall, we observe that all three algorithms exhibit a linear convergence rate, both in optimizing the objective function and fitting the true parameters. This aligns with our theoretical results for MoE and is consistent with findings for Mixtures of Gaussians and Mixtures of Linear Regression in high SNR scenarios.
Validation Experiment on Fashion MNIST. For the small scale proof of concept experiment on Fashion MNIST [XRV17], we alter the dataset so as to simulate a mixture of Logistic Experts. To do this, we simply perform an inversion transformation on the images at random with probability . Effectively, the transformation inverts the images from a white article of clothing on a black background to a black article of clothing on a white background. As can be seen in Table 1, the single expert on the original Fashion MNIST dataset reaches an accuracy of on the test set. Meanwhile, the single expert cannot achieve better than an accuracy of on the altered dataset. This suggests a -component MoLogE is appropriate for fitting the altered dataset, so long as the ground truth partitioning is linear in image space.
The -component MoLogE to be trained consists of one Linear gating layer of dimension , and logistic experts of dimension each. We randomly initialize each linear layer to be unit norm and execute the algorithms on the same datasets and with the same initializations. For Gradient EM, the only additional code needed over GD is to define the EM Loss function appropriately, and then perform a Gradient Step on the Gating parameters and the Expert parameters separately as describe in Algorithm 1. For EM, for each iteration, we perform several gradient steps in an inner loop to approximately recover the solutions to the sub-problems described in (8). We report our findings in Table 2 and Figure 3.
In Table 2, we report the respective final test accuracy and cross-entropy loss values after iterations of EM, Gradient EM and GD for fitting a -component MoLogE on the altered Fashion MNIST dataset, averaged over instances. We see that EM boasts a much improved final test accuracy that nearly recovers the single expert accuracy on the original unaltered Fashion MNIST dataset of . Meanwhile, Gradient EM also registers an improvement over GD. In Figure 3, we report the progress made on the accuracy and objective function for the test set over the iterations, averaged over instances. As was observed in our synthetic experiment, EM takes considerably less iterations to fit the mixture than both Gradient EM and GD, where the former also takes considerably less time to fit the mixture than GD.
Accuracy | Random Invert | |
---|---|---|
Single Expert | 83.2% | No |
Single Expert | 10.2% | Yes |
Accuracy | Cross Entropy | |
---|---|---|
EM | 79.2% | 0.801 |
Gradient EM | 65.7% | 1.30 |
Gradient Descent | 61.9% | 1.33 |


7 Conclusion
In this paper, we addressed the problem of Mixtures of Experts (MoE), focusing on the use of the EM method to solve these problems. We first showed that the EM update for MoE could be interpreted as a Mirror Descent step on the log-likelihood with a unit step size and a KL divergence regularizer, extending the result of [KKS22] beyond the exponential family. Building on this, we characterized different convergence rates for EM in this setting under various assumptions about the log-likelihood function and specified when these assumptions held. We also compared EM with gradient descent empirically and showed that EM outperformed gradient descent in both convergence rate and final performance.
References
- [BWY17] Sivaraman Balakrishnan, Martin J. Wainwright and Bin Yu “Statistical guarantees for the EM algorithm: From population to sample-based analysis” In The Annals of Statistics 45.1 Institute of Mathematical Statistics, 2017, pp. 77–120 DOI: 10.1214/16-AOS1435
- [CDWGL22] Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu and Yuanzhi Li “Towards Understanding Mixture of Experts in Deep Learning”, 2022 arXiv: https://arxiv.org/abs/2208.02813
- [DTZ17] Constantinos Daskalakis, Christos Tzamos and Manolis Zampetakis “Ten Steps of EM Suffice for Mixtures of Two Gaussians” In Proceedings of the 2017 Conference on Learning Theory 65, Proceedings of Machine Learning Research PMLR, 2017, pp. 704–710 URL: https://proceedings.mlr.press/v65/daskalakis17b.html
- [DLR77] A.. Dempster, N.. Laird and D.. Rubin “Maximum Likelihood from Incomplete Data Via the EM Algorithm” In Journal of the Royal Statistical Society: Series B (Methodological) 39.1, 1977, pp. 1–22 DOI: https://doi.org/10.1111/j.2517-6161.1977.tb01600.x
- [DHKWJ18] Raaz Dwivedi, Nhat Ho, Koulik Khamaru, Martin J Wainwright and Michael I Jordan “Theoretical guarantees for EM under misspecified Gaussian mixture models” In Advances in Neural Information Processing Systems 31 Curran Associates, Inc., 2018 URL: https://proceedings.neurips.cc/paper_files/paper/2018/file/acc21473c4525b922286130ffbfe00b5-Paper.pdf
- [FZS22] William Fedus, Barret Zoph and Noam Shazeer “Switch transformers: scaling to trillion parameter models with simple and efficient sparsity” In J. Mach. Learn. Res. 23.1 JMLR.org, 2022
- [Hin+12] Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath and Brian Kingsbury “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups” In IEEE Signal Processing Magazine 29.6, 2012, pp. 82–97 DOI: 10.1109/MSP.2012.2205597
- [JJ94] Michael Jordan and Robert Jacobs “Hierarchical mixtures of experts and the” In Neural computation 6, 1994, pp. 181–
- [JX95] Michael I. Jordan and Lei Xu “Convergence results for the EM approach to mixtures of experts architectures” In Neural Networks 8.9, 1995, pp. 1409–1431 DOI: https://doi.org/10.1016/0893-6080(95)00014-3
- [KKS22] Frederik Kunstner, Raunak Kumar and Mark Schmidt “Homeomorphic-Invariance of EM: Non-Asymptotic Convergence in KL Divergence for Exponential Families via Mirror Descent”, 2022 arXiv: https://arxiv.org/abs/2011.01170
- [KC20] Jeongyeol Kwon and Constantine Caramanis “EM Converges for a Mixture of Many Linear Regressions” In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics 108, Proceedings of Machine Learning Research PMLR, 2020, pp. 1727–1736 URL: https://proceedings.mlr.press/v108/kwon20a.html
- [KC20a] Jeongyeol Kwon and Constantine Caramanis “The EM Algorithm gives Sample-Optimality for Learning Mixtures of Well-Separated Gaussians” In Proceedings of Thirty Third Conference on Learning Theory 125, Proceedings of Machine Learning Research PMLR, 2020, pp. 2425–2487 URL: https://proceedings.mlr.press/v125/kwon20a.html
- [KHC21] Jeongyeol Kwon, Nhat Ho and Constantine Caramanis “On the Minimax Optimality of the EM Algorithm for Learning Two-Component Mixed Linear Regression” In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics 130, Proceedings of Machine Learning Research PMLR, 2021, pp. 1405–1413 URL: https://proceedings.mlr.press/v130/kwon21b.html
- [KQCCD18] Jeongyeol Kwon, Wei Qian, Constantine Caramanis, Yudong Chen and Damek Davis “Global Convergence of EM Algorithm for Mixtures of Two Component Linear Regression” arXiv, 2018 DOI: 10.48550/ARXIV.1810.05752
- [LFN17] Haihao Lu, Robert M. Freund and Yurii Nesterov “Relatively-Smooth Convex Optimization by First-Order Methods, and Applications”, 2017 arXiv: https://arxiv.org/abs/1610.05708
- [MZYCHC18] Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong and Ed Chi “Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts”, 2018, pp. 1930–1939 DOI: 10.1145/3219819.3220007
- [MOKV19] Ashok Vardhan Makkuva, Sewoong Oh, Sreeram Kannan and Pramod Viswanath “Breaking the gridlock in Mixture-of-Experts: Consistent and Efficient Algorithms”, 2019 arXiv: https://arxiv.org/abs/1802.07417
- [OW72] Terence Orchard and Max A. Woodbury “A MISSING INFORMATION PRINCIPLE: THEORY AND APPLICATIONS” In Volume 1 Theory of Statistics University of California Press, 1972, pp. 697–716 DOI: doi:10.1525/9780520325883-036
- [Tse04] Paul Tseng “An Analysis of the EM Algorithm and Entropy-Like Proximal Point Methods” In Mathematics of Operations Research 29.1 INFORMS, 2004, pp. 27–44 URL: http://www.jstor.org/stable/30035630
- [Wu83] C.. Wu “On the Convergence Properties of the EM Algorithm” In The Annals of Statistics 11.1 Institute of Mathematical Statistics, 1983, pp. 95–103 DOI: 10.1214/aos/1176346060
- [XRV17] Han Xiao, Kashif Rasul and Roland Vollgraf “Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms”, 2017 arXiv: https://arxiv.org/abs/1708.07747
- [YWG12] Seniha Yuksel, Joseph Wilson and Paul Gader “Twenty Years of Mixture of Experts” In Neural Networks and Learning Systems, IEEE Transactions on 23, 2012, pp. 1177–1193 DOI: 10.1109/TNNLS.2012.2200299
Appendix A EM is Mirror Descent for SMoLogE and SMoLinE
In this section, we provide the full and detailed proof of the main result, Theorem 4.1. For ease of comprehension and in the hope that this will provide useful insights into other types of non-exponential family mixtures for which EM is also connected to MD, we prove our result following the same general ideas as that of [KKS22, Proposition 1]. We split the proof into two parts (SMoLinE and SMoLogE) which can be found in Appendix A.1 and A.2.
For ease of reading, we re-state the theorem below:
Theorem 4.1: For SMoLinE and SMoLogE, there is a mirror map such that the EM update in (8) simplifies and is equivalent to
where the divergence function is equal to the divergence on the complete data:
In particular, in the case of SMoLinE,
while in the case of SMoLogE,
Finally, in both cases, the map is strictly convex in and is -smooth relative to .
A.1 Proof of Theorem 4.1 for SMoLinE
Proof.
Recall that we consider a component SMoLinE (see Section 3.1) where is the latent unobserved variable, and
-
1)
,
-
2)
,
-
3)
.
We begin by deriving a near exponential form of the complete data probability density function :
Thus we have recovered the decomposition,
(30) |
where in , the feature variable cannot be linearly separated from the parameter :
At this point, we pause and discuss the implications of the obtained form. First we recall that for a random variable to belong to an exponential family, it must satisfy
for some that are called the normalization function, sufficient statistics, natural parameters, and log-partition function respectively. To clarify, we note that 1) must be a function of the parameters only and cannot depend on and 2) must be a function of the variable only and cannot depend on the parameters. In other words, it must be that and are linearly separated inside the . With the above in mind, we see that SMoLinE is not an exponential family. Further, we remark that the above formulation showing is linear with does not extend beyond the symmetric setting of the Mixture of Linear Experts; for , this relationships becomes non-linear. This turns out to be problematic for showing EM is equivalent to MD for . Lastly, note that taking the expectation of over yields,
where the above follows from and is not a function of the parameter . With the obtained form (30), we now continue with the proof.
Part a): Show EM is MD, i.e., .
Taking the appropriate expectation, the EM objective can be written as
where . As a consequence, it is also true that
(31) |
Continuing, we use the above to simplify the expression for that will subsequently give us the MD loss:
where we first adding and subtracting then follows from (31) and follows from (see Section 3 for the derivation).
Finally, the first part of our result follows trivially as
Part b): Show
This result follows simply from decomposing as follows:
where follows from the fact that minimizes . To see this, we use Jensen’s inequality:
Finally, taking the derivative with respect to and setting equal to completes the proof:
Part c): Show is -smooth relative to .
The function is said to be -smooth relative to if for all , it holds that
Recall the following from Section 3. The objective function is related to the EM objective by (9),
where and for all . Consequently, it then holds that for all ,
(32) | ||||
(33) |
Recall also from part a) that . Then, the claim follows naturally from the above as follows:
It follows that is -smooth relative to .
Part d): and the MD objective is convex with respect to .
Here, we will show that the mirror descent objective is strongly convex in . It is important for this to hold so that the iterations of MD are well-defined; the minimizer of a strongly convex objective exists and is unique.
Note that the mirror descent objective, , is strongly convex in if is strongly convex in . Therefore, since given in (17) is twice continuously differentiable, it is strongly convex with respect to if and only if , for some . We begin:
where the last line follows from the assumption that is sampled from a unit spherical Gaussian distribution: for .
From the above, we see that is strictly convex, and it is strongly convex if for some . This follows from Lemma B.2 where we show its eigenvalues are bounded below by . Thus, it holds that
Restricting the feasible set to be all with for some , it holds that is strongly convex with respect to on .
With part d) proven, this concludes the proof of Theorem 4.1 for SMoLinE. We now prove the same for SMoLogE, referring to this section where necessary.
∎
A.2 Proof of Theorem 4.1 for SMoLogE
Proof.
Recall that we consider a component SMoLogE (see Section 3.1) where is the latent unobserved variable, and
-
1)
,
-
2)
-
3)
.
We begin by deriving a near exponential form of the complete data probability density function :
Thus we have recovered the decomposition,
(34) |
where in , cannot be linearly separated from the parameter :
Similar to SMoLinE, we can see that does not belong to an exponential family of distribution. Also, note that taking the expectation of over yields,
where the above follows from and is not a function of the parameter . We now continue with the proof.
From here on, the proofs of part a), part b) and part c) follow identically from that of SMoLinE, so we will refer to Appendix A.1 for those proofs. We will now show part d).
Part d): and the MD objective is convex with respect to .
Here, we will show that the mirror descent objective is strongly convex in . It is important for this to hold so that the iterations of MD are well-defined; the minimizer of a strongly convex objective exists and is unique.
Note that the mirror descent objective, , is strongly convex in if is strongly convex in . Therefore, since given in (17) is twice continuously differentiable, it is strongly convex with respect to if and only if , for some . We begin:
From the above, we see that is strictly convex, and it is strongly convex if and for some . This follows from Lemma B.2 where we show their respective eigenvalues are bounded below by, and . Thus, it holds that
Restricting to be all with for some , it holds that is strongly convex with respect to on .
With part d) proven, this concludes the proof of Theorem 4.1 for SMoLinE. We now prove the same for SMoLogE, referring to this section where necessary.
∎
Appendix B Convergence Analysis
In this section, we provide the proofs of Corollary 5.1, Theorem 5.2 and Theorem 5.3. As well, in the last subsection, we discuss the consequences of these results.
B.1 Convergence Properties of Mirror Descent: Corollary 5.1
In this section, we provide a contextual proof for Corollary 5.1 which follows trivially from previous work on MD (see [KKS22, Proposition 2, Corollary 1, and Corollary 3] and [LFN17, Theorem 3.1] for a comprehensive overview of the many convergence properties of MD). The proof is divided into three parts that correspond to each of the three sub-results of the corollary.
For ease of reading, we re-state the Corollary below:
Corollary 5.1: For SMoLinE, SMoLogE with mirror map given as (17) (18) respectively, and denoting , the EM iterates satisfy:
-
1)
Stationarity. For no additional conditions,
(35) -
2)
Sub-linear Rate to . If is initialized in , a locally convex region of containing , then
(36) -
3)
Linear Rate to . If is initialized in , a locally strongly convex region of relative to that contains , then
(37)
Proof.
The proof is divided into three parts that correspond to each of the three sub-results of the corollary.
Part 1): Stationarity.
Given Theorem 4.1, this proof follows from identical arguments to that of [KKS22, Proposition 2]. We write it below for completeness.
Recall from Theorem 4.1 that is obtained as the minimizer of the convex objective, (14):
As such, differentiating and setting equal to , it holds that satisfies
(38) |
Further, by the above together with relative smoothness, it holds that
Thus it we have shown that
(39) |
The claim then follows from taking the mean over iterations:
Part 2): Sub-linear Rate to .
Given Theorem 4.1, this proof follows from identical arguments to that of [KKS22, Corollary 1] and [LFN17, Theorem 3.1]. We write it below for completeness.
Here, we assume that is convex on the set . In part 1), we used (38) to show,
where the right hand side is non-positive since the Bregman divergence is non-negative if the inducing function is convex – which it is. Now, starting from relative smoothness, we see that
where i) follows from convexity of on the set . Subsequently, we apply the -point lemma, , and obtain,
(40) |
Finally, the result follows from summing the left and right hand side over iterations:
Part 3): Linear Rate to .
Given Theorem 4.1, this proof follows from identical arguments to that of [KKS22, Corollary 3] and [LFN17, Theorem 3.1]. We write it below for completeness.
In addition to convexity, we now assume that is -strongly convex relative to on the set . Specifically, we have that for any ,
(41) |
Using the three point lemma again, we have
B.2 Proof of Theorem 5.2
In this section, we provide the complete and detailed proof of Theorem 5.2. Before we begin, we make the additional useful remark that the respective Fischer information matrices introduce in Section 5 satisfy the following equalities:
(42) | ||||
(43) |
Thus, it also holds that the MIM is a function of and , i.e.,
(44) |
For ease of reading, we re-state the theorem below:
Theorem 5.2: For SMoLinE and SMoLogE and their respective mirror mappings (17) and (18), the objective is -strongly convex relative to the mirror map on the convex set if and only if
Proof.
Recall from Section 5 that is strongly convex relative to on if for all , it holds that
For and twice continuously differentiable, it was shown by [LFN17] that this is equivalent to the following bound on the Hessian:
Now, using , we see that
Therefore, since is symmetric positive definite (proven in Appendix A), our condition simplifies to
Finally, for from a unit spherical Gaussian distribution, we know that is symmetric positive-definite (see Lemma B.1). As a result, the above inequality is equivalent to the following bound on the eigenvalues of the MIM:
∎
We now provide the simple Lemma that the MIM is a symmetric matrix for SMoLinE and SMoLogE.
Lemma B.1 ( is symmetric).
For SMoLinE and SMoLogE, the MIM is a symmetric matrix, i.e. .
Proof.
Recall the assumption that is sampled from a unit spherical Gaussian distribution: . As such, for any orthonormal matrix , we know that ; this is called rotational invariance of the Gaussian distribution. Thus, consider orthonormal , such that where is the vector with a at index . Now, we can observe that for any , , is diagonalizable by an orthonormal matrix :
Where in the above, 1) since for orthonormal matrices and 2) non diagonal elements evaluate to because for all , the mean random variables is independent from .
Finally, we put the above together to show is symmetric. We define the block diagonal orthonormal matrix as
From the above, it follows that both for SMoLinE and SMoLogE, is a diagonal matrix. We can now use this change of basis matrix to show the MIM is a symmetric matrix:
∎
B.3 Proof of Theorem 5.3
For ease of reading, we re-state the theorem below:
For SMoLinE, the eigenvalues of belong to the set
and , is given as the expectation over of a function that is decreasing as a function of :
Similarly, For SMoLogE, the eigenvalues of belong to the set
and is given as the expectation over of a function that is decreasing as a function of :
Proof.
We divide the proof into two parts. In the first part, we consider the SMoLinE setting and, in the second part, we consider the SMoLogE setting.
Part a): SMoLinE.
For , recall that is has the following form:
By Lemma B.2, we see that can be diagonalized into the following form:
where is an orthonormal rotation matrix and and . Therefore, has three eigenvalues given as . It follows that has the form
Therefore, has three eigenvalues given as .
Now, for , we first derive a more compact form for the conditional distribution of the latent variable, . From simple Bayes rule and algebraic manipulation, we see that
Now that we have this simplified form, we are able to derive (27) for :
This expression depends only on the random variable and the parameter iterate .
Part b): SMoLogE.
Recall that for SMoLogE, has the following form:
By Lemma B.2, we see that can be diagonalized into the following form:
where is an orthonormal rotation matrix and , , , and . Therefore, has four eigenvalues given as
. It follows that has the form
Therefore, has four eigenvalues given as .
Now, for , we first derive a more compact form for the conditional distribution of the latent variable, . From simple Bayes rule and algebraic manipulation, we see that
Now that we have this simplified form, we are able to derive (29) for :
This expression depends only on the random variable and the parameter iterate . ∎
Lemma B.2.
For and and , the symmetric positive definite matrix
(45) |
is diagonalizable by an orthonormal matrix and has two eigenvalues, , that satisfy
(46) | ||||
(47) |
Proof.
Recall that Gaussian random variables are rotationally invariant. Specifically, for orthonormal matrix and , it follows that . Moreover, . Using this notion, we will 1) diagonalize (45), then 2) evaluate the eigenvalues of (45) as the diagonal elements.
Consider the orthonormal rotation matrix that is such that where is the canonical vector of . Using this change of basis matrix, we can now obtain the diagonal matrix,
It has only two eigenvalues given in closed form as
The rest of the proof is spent evaluating tight lower and upper bounds on in terms of .
Part a): Bounds for .
For the probability density function is bounded above by : . Then we can upper bound as follows:
For the lower bounds, we will use the fact that for and . Then, we can lower bound as follows:
Therefore, it holds that .
Part b): Bounds for .
For the probability density function is bounded above by : . Then we can upper bound as follows:
For the lower bounds, we will use the fact that for and . Then, we can lower bound as follows:
Therefore, it holds that . ∎
B.4 Existence of Locally Convex Region
In this section, we further discuss the consequences of Corollary 5.1, Theorem 5.2, and Theorem 5.3.
To begin, we will discuss the sufficient condition for Gradient Descent to converge linearly to the true parameters and compare it to that of EM. For Gradient Descent, it is well understood that if is initialized in a convex set that contains and where is strongly convex, i.e.
(48) |
then the parameter iterates converge linearly to . However, as we have shown for SMoLinE and SMoLogE, the sufficient condition for EM to converge linearly to is slightly different. Instead, we require to satisfy that is strongly convex relative to , i.e.,
(49) |
Interestingly, if it holds that is -smooth, we see that (49) is weaker than (48), i.e.,
(50) |
But more interestingly, as long as is -smooth for some , it will hold that any set satisfying (48) for some will also satisfy (49) with . We note that the converse holds for strongly convex. In summary, it then holds that, for SMoLinE and SMoLogE, EM’s sufficient conditions for a linear rate is strictly weaker than that of Gradient Descent when the mirror map is convex, but not strongly convex.
Next, we will further discuss the implications of Theorem 5.3. In the theorem, we obtain clear lower and upper bounds for the eigenvalues of . However, it is not clear how to do the same for as the trick to rotate the axis with an orthonormal matrix to simplify the expression will not work here because the distribution of the vector is not invariant to rotation. Still, there are some things that we can say for special cases. For this discussion, we will constrain ourselves to SMoLogE. However, the same lines of reasoning also apply to SMoLinE. First we recall from Theorem 5.3 that
From here, one easy way to approach bounding the above is to 1) recall that any outer product of the form has a single eigenvalue given as , and 2) the inner product between two vectors is equal to the product of their norms and the cosine of the angle between them, i.e. . Thus, denoting to be the angle between and the vector , we obtain
Denoting , we can write the above expectation as
Subsequently, the idea is bound in a way that makes integration easy.
The case where is fairly easy. Under this scenario, and can be upper-bounded as
Now, recall that , , , and see that
Subsequently, together with the fact that , it holds that the eigenvalues of the MIM are upper-bounded by
This special case is closely related to the case where is parallel to ; a similar approach will work.
Appendix C Experiments
In this section, we provide explicit formulations of Gradient EM and Gradient Descent, as they relate to the experiments we perform in Section 6.
Gradient EM. Whereas EM performs the global minimization of the EM objective given in (7), Gradient EM (see Algorithm 1) obtains its next parameter iterate as the concatenation of two gradient updates on the sub objectives,
(51) | |||
(52) |
Gradient Descent. Gradient descent is given as the global minimizer of the first order approximation of at plus a quadratic regularizer, i.e.,
Differentiating, and solving for equality at yields the well known gradient update (53).
(53) |