This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning Mixtures of Experts with EM

Quentin Fruytier Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA     {[email protected], [email protected], [email protected]}    Aryan Mokhtari    Sujay Sanghavi
Abstract

Mixtures of Experts (MoE) are Machine Learning models that involve partitioning the input space, with a separate “expert” model trained on each partition. Recently, MoE have become popular as components in today’s large language models as a means to reduce training and inference costs. There, the partitioning function and the experts are both learnt jointly via gradient descent on the log-likelihood. In this paper we focus on studying the efficiency of the Expectation Maximization (EM) algorithm for the training of MoE models. We first rigorously analyze EM for the cases of linear or logistic experts, where we show that EM is equivalent to Mirror Descent with unit step size and a Kullback-Leibler Divergence regularizer. This perspective allows us to derive new convergence results and identify conditions for local linear convergence based on the signal-to-noise ratio (SNR). Experiments on synthetic and (small-scale) real-world data show that EM outperforms the gradient descent algorithm both in terms of convergence rate and the achieved accuracy.

1 Introduction

Mixtures of Experts (MoE)[JJ94] are a crucial class of parametric latent variable models that have gained significant popularity in deep learning for their ability to reduce both training and inference costs [CDWGL22]. MoE are particularly effective when the feature space can be divided and processed by specialized models, known as experts. Instead of relying on a single, large model to handle all input-output mappings, MoE utilize an ensemble of specialized experts. Each expert is responsible for a specific subset of the input space, allowing the system to efficiently route inputs to the most appropriate expert. This partitioning enables each expert to focus on mapping its designated inputs to outputs using a separate, optimized model. By leveraging multiple specialized experts rather than a monolithic model, MoE can achieve greater scalability and flexibility. This modular approach not only enhances computational efficiency but also allows for improved performance, as each expert can be finely tuned to handle its particular segment of the input space effectively. Real-world applications of Mixture of Experts span across various domains, including language translation, speech recognition, recommendation systems, and more [FZS22, MZYCHC18, Hin+12].

Training an MoE model involves training both (a) the parameters in the individual experts, and (b) the gating function that routes inputs to the appropriate expert. Typically, these are both learnt jointly by first formulating the final loss function as applied to the ensembled output, and then minimizing this joint loss function (via SGD or its variants) over the parameters of the gate and the experts.

In this paper we investigate, primarily from a theoretical perspective, the training of MoE using a classic algorithm: Expectation Maximization (EM). As opposed to SGD-based methods which are agnostic to whether a parameter is in the gate or in an expert, EM first formulates two separate problems – one for the router, and another for the experts – in a specific way. It then solves each problem in isolation and in parallel, and then collates the outputs to arrive at the updated set of gate and expert parameters.

For our theoretical results, we consider two simple instances of MoE models: Mixture of Linear Experts (where each expert is a linear regressor) and Mixture of Logistic Experts (where each expert is a logisitic regressor). The router in each case is a linear softmax.

Main Contributions:

  • 1)

    A recent but cornerstone result in EM [KKS22] showed that for exponential families, EM corresponds to Mirror Descent with unit step size and KL divergence regularizer. MoE models (inlcuding the ones we focus on) are generally not exponential families; however, we show that nonetheless this correspondence still holds.

  • 2)

    Armed with this perspective, we study our linear and logistic MoE when there are two experts. We provide sufficient conditions for EM to converge to the true model parameters at sub-linear or linear rates; we also characterize under what parameter regimes these conditions will hold. To our knowledge, no similar such characterizations are known for SGD-based methods.

  • 3)

    Finally we show, both on synthetic and small scale proof of concept real world datasets, that EM outperforms gradient descent both in terms of convergence rate and the achieved performance.

1.1 Related Work

The EM algorithm [DLR77] is a powerful method for fitting latent variable models. Previous research on EM has demonstrated that, under mild smoothness assumptions, its parameter iterates converge to a stationary point of the log-likelihood objective [DLR77, Wu83, Tse04]. Subsequent research on EM introduced new analytical frameworks to provide specialized guarantees, particularly regarding the convergence of EM iterates to the true model parameters and the rate of this convergence. An adopted framework in the past decade, introduced by [BWY17], interprets EM as a variant of the gradient descent algorithm. For latent variable models with a strongly convex EM objective that satisfies a condition known as “first-order stability,” it was shown that EM iterates converge linearly to the true parameters. Subsequent works utilized this framework to show a local linear rate for Mixtures of Gaussians and Mixtures of Linear Regressions [BWY17, DTZ17, DHKWJ18, KQCCD18, KC20a, KHC21, KC20].

In a recent work, [KKS22] proved that EM – for exponential families – is equivalent to the mirror descent algorithm with unit step-size and Kullback–Leibler (KL) divergence regularizer:

KL[p(𝒙;𝜽)||p(𝒙,ϕ)]=𝔼X[log(p(𝒙;𝜽)p(𝒙;ϕ))].\text{KL}\left[p(\bm{x};\bm{\theta})||p(\bm{x},\bm{\phi})\right]=\mathbb{E}_{X}\left[\log\left(\frac{p(\bm{x};\bm{\theta})}{p(\bm{x};\bm{\phi})}\right)\right]. (1)

This led to the first non-asymptotic convergence rates for EM, independent of parameterization. While this characterization of EM included Mixtures of Gaussians, it failed to extend to Mixtures of Regression or Mixtures of Experts.

There has also been works attempting to understand how to fit MoE. [JX95] showed that for MoE with strongly convex negative log-likelihood objective, EM iterations converge linearly to the true parameters. However, the objective is generally nonconvex, raising doubts about whether the necessary assumptions for their result hold, even locally. Other works have remarked that the nature of the gating function creates a form of competition between the experts during training that can lead to local minima. Bayesian methods for MoE include variational learning and maximum aposteriori (MAP) estimation. But, [YWG12] notes that these solutions are not trivial due to the softmax gate not admitting a conjugate prior and are prone to getting stuck at local minima. [MOKV19] analyzed a variant of the EM algorithm for MoE where they first recover the expert parameters using a tensor decomposition method. They proved that, for proper initialization, they recover the true parameters at a nearly linear rate.

2 Mixture of Experts

Next, we formally describe the setting under consideration for the Mixture of kk Experts (MoE).

Data Generation Model: First, the input or feature vector variable 𝒙d\bm{x}\in\mathbb{R}^{d} is sampled based on a probability density function p(𝒙)p(\bm{x}). Second, given the feature vector 𝒙\bm{x}, a latent variable z[k]z\in[k], responsible for routing 𝒙\bm{x} to the appropriate expert is sampled with probability mass function P(z|𝒙)P(z|\bm{x}). Finally, given the pair (𝒙,z)(\bm{x},z), the output or label yy is generated according to the probability distribution p(y|𝒙,z)p(y|\bm{x},z). Hence, the complete data distribution is p(𝒙,y,z)=p(y|𝒙,z)p(z|𝒙)p(𝒙)p(\bm{x},y,z)=p(y|\bm{x},z)p(z|\bm{x})p(\bm{x}) and the joint input-output probability distribution can be written as

p(𝒙,y)=p(𝒙)z[k]p(y|𝒙,z)P(z|𝒙).p(\bm{x},y)=p(\bm{x})\sum_{z\in[k]}p(y|\bm{x},z)P(z|\bm{x}). (2)

This is indeed the most general form of data generation under mixture of kk-experts, but here we focus on the case where 𝒙\bm{x} is sampled from a unit spherical Gaussian distribution, i.e. 𝒙𝒩(𝟎,𝑰d)\bm{x}\sim\mathcal{N}(\bm{0},\bm{I}_{d}), and P(z|𝒙)=P(z|𝒙;𝒘)P(z|\bm{x})=P(z|\bm{x};\bm{w}^{*}) is parameterized by a set of vectors (𝒘1,,𝒘k)(\bm{w}_{1}^{*},\dots,\bm{w}_{k}^{*}) and can be written as

P(z=i|𝒙;𝒘)=e𝒙𝒘ij[k]e𝒙𝒘j,i[k].P(z=i|\bm{x};\bm{w}^{*})=\frac{e^{\bm{x}^{\top}\bm{w}_{i}^{*}}}{\sum_{j\in[k]}e^{\bm{x}^{\top}\bm{w}_{j}^{*}}},\qquad i\in[k]. (3)

In other words, the probability mass function of the latent variable is the softmax function between the inner product of 𝒙\bm{x} with the parameter vectors concatenated as 𝒘=(𝒘1,,𝒘k)d×k\bm{w}^{*}=(\bm{w}^{*}_{1},...,\bm{w}^{*}_{k})\in\mathbb{R}^{d\times k}.

Regarding p(y|𝒙,z)p(y|\bm{x},z), the probability distribution of the output variable yy conditioned on the input variable xx and latent routing variable z=iz\!=\!i (i.e., expert ii), we also assume that it is parameterized by a vector 𝜷i\bm{\beta}_{i}^{*} and we have p(y|𝒙,z=i)=p(y|𝒙,z=i;𝜷i)p(y|\bm{x},z\!=\!i)=p(y|\bm{x},z\!=\!i;\bm{\beta}_{i}^{*}). Thus, for the concatenated vector 𝜷=(𝜷1,,𝜷k)s×k\bm{\beta}^{*}=(\bm{\beta}^{*}_{1},...,\bm{\beta}^{*}_{k})\in\mathbb{R}^{s\times k}, then p(y|𝒙,z)=p(y|𝒙,z;𝜷)p(y|\bm{x},z)=p(y|\bm{x},z;\bm{\beta}^{*}).

To benefit the understanding of the reader in later sections, we focus on two special cases of experts. In the first case, we have

p(y|𝒙,z=i;𝜷i)exp{(y𝒙𝜷i)22}p(y|\bm{x},z=i;\bm{\beta}_{i}^{*})\propto\exp\left\{\frac{(y-\bm{x}^{\top}\bm{\beta}_{i}^{*})^{2}}{2}\right\} (4)

as the density function of the normal distribution. This is equivalent to assume that the output y=𝒙βi+ϵy=\bm{x}^{\top}\beta_{i}^{*}+\epsilon where ϵ\epsilon is additive zero mean Gaussian noise with unit variance. For future reference, we will refer to this special case as a Mixture of Linear Experts.

In the second case, we assume a logistic model for each expert. Thus, the density function p(y|𝒙,z=i;𝜷i)p(y|\bm{x},z=i;\bm{\beta}_{i}^{*}) can be written as

P(y=1|𝒙,z=i;𝜷i)=exp(𝒙𝜷i)1+exp(𝒙𝜷i)P(y=1|\bm{x},z=i;\bm{\beta}_{i}^{*})=\frac{\exp(\bm{x}^{\top}\bm{\beta}_{i}^{*})}{1+\exp(\bm{x}^{\top}\bm{\beta}_{i}^{*})} (5)

and indeed P(y=1|𝒙,z=i;𝜷i)=1P(y=1|𝒙,z=i;𝜷i)P(y=-1|\bm{x},z\!=\!i;\bm{\beta}_{i}^{*})=1-P(y=1|\bm{x},z\!=\!i;\bm{\beta}_{i}^{*}). For future reference, we will refer to this special case as a Mixture of Logistic Experts.

Maximum Likelihood Loss: Given the assumed data distribution of (𝒙,y)(\bm{x},y), our goal is to find the set of feasible parameters 𝜷d×k\bm{\beta}\in\mathbb{R}^{d\times k} and 𝒘d×k\bm{w}\in\mathbb{R}^{d\times k} that maximize the log likelihood function. For ease of notation we define 𝜽:=(𝜷,𝒘)2d×k\bm{\theta}:=(\bm{\beta},\bm{w})\in\mathbb{R}^{2d\times k} as the concatenation of all the parameters. Specifically, from (2), the expected log likelihood is given by

𝔼𝑿[logp(𝒙)]+𝔼𝑿,Y[log(z[k]p(y|𝒙,z)P(z|𝒙))].\displaystyle\mathbb{E}_{\bm{X}}\left[\log p(\bm{x})\right]+\mathbb{E}_{\bm{X},Y}\left[\log\left(\sum_{z\in[k]}p(y|\bm{x},z)P(z|\bm{x})\right)\right].

Given the fact that only the last term of the sum depends on parameters 𝜽=(𝒘,𝜷)\bm{\theta}=(\bm{w},\bm{\beta})^{\top} the negative log-likelihood objective function, (𝜽){\cal{L}}(\bm{\theta}), that we aim to minimize can be written as

𝔼𝑿,Y[log(z[k]p(y|𝒙,z;𝜷)P(z|𝒙;𝒘))]-\mathbb{E}_{\bm{X},Y}\left[\log\left(\sum_{z\in[k]}p(y|\bm{x},z;\bm{\beta})P(z|\bm{x};\bm{w})\right)\right] (6)

Note that as we will discuss in detail, for both mixtures of linear or logistic experts, the above objective function is known to be non convex with respect to 𝜽\bm{\theta}. In fact, this is generally true for Mixtures of Gaussian, Mixtures of Regressions, and Mixtures of Experts. In the next section we will discuss the use of the Expectation-Maximization (EM) algorithm for solving this optimization problem.

3 EM for Mixtures of Experts

Next, we present the Expectation-Maximization (EM) algorithm for MoE. EM takes a structured approach to minimizing the objective (𝜽)\mathcal{L}(\bm{\theta}) given in (6). Each iteration of the EM algorithm is decomposed into two steps as follows. The first step is called “expectation”: For current parameter estimate 𝜽t\bm{\theta}^{t}, we compute the expectation of the complete-data log-likelihood with respect to the hidden variables, using the current parameter estimates 𝜽t\bm{\theta}^{t} and denote it by Q(𝜽|𝜽t)Q(\bm{\theta}|\bm{\theta}^{t}), i.e.,

Q(𝜽|𝜽t)=𝔼X,Y[𝔼Z|𝒙,y;𝜽t[logp(𝒙,y,z;𝜽)]].Q(\bm{\theta}|\bm{\theta}^{t})=-\mathbb{E}_{X,Y}\left[\mathbb{E}_{Z|\bm{x},y;\bm{\theta}^{t}}[\log p(\bm{x},y,z;\bm{\theta})]\right]. (7)

Then, in the second step called “maximization”, we simply minimize the objective Q(𝜽|𝜽t)Q(\bm{\theta}|\bm{\theta}^{t}) (or maximize Q(𝜽|𝜽t)-Q(\bm{\theta}|\bm{\theta}^{t})) with respect to 𝜽Ω\bm{\theta}\in\Omega and obtain our new parameter iterate as

𝜽t+1\displaystyle\bm{\theta}^{t+1} :=argmin𝜽ΩQ(𝜽|𝜽t).\displaystyle:=\operatorname*{arg\,min}_{\bm{\theta}\in\Omega}Q(\bm{\theta}|\bm{\theta}^{t}). (8)

Since logp(y,z|𝒙;𝜽)=logp(y|z,𝒙;β)+logp(z|𝒙;𝒘)\log p(y,z|\bm{x};\bm{\theta})=\log p(y|z,\bm{x};\beta)+\log p(z|\bm{x};\bm{w}), it follows that the EM objective (7) is linearly separable in the parameters 𝜷\bm{\beta} and 𝒘\bm{w}. Thus, we can re-write Q(𝜽|ϕ)Q(\bm{\theta}|\bm{\phi}) as the sum of 22 functions that depend only on 𝜷\bm{\beta} and 𝒘\bm{w} respectively. Subsequently, the EM update (8) is obtained as the concatenation 𝜽t+1=(𝒘t+1,𝜷t+1)\bm{\theta}^{t+1}=(\bm{w}^{t+1},\bm{\beta}^{t+1})^{\top} where

𝒘t+1=argmin𝒘d𝔼𝑿,Y[𝔼Z|𝒙,y;𝜽t[logp(z|𝒙;𝒘)]]\displaystyle\bm{w}^{t+1}=\operatorname*{arg\,min}_{\bm{w}\in\mathbb{R}^{d}}-\mathbb{E}_{\bm{X},Y}\left[\mathbb{E}_{Z|\bm{x},y;\bm{\theta}^{t}}\left[\log p(z|\bm{x};\bm{w})\right]\right]
𝜷t+1=argmin𝜷d𝔼𝑿,Y[𝔼Z|𝒙,y;𝜽t[logp(y|z,𝒙;𝜷)]]\displaystyle\bm{\beta}^{t+1}=\operatorname*{arg\,min}_{\bm{\beta}\in\mathbb{R}^{d}}-\mathbb{E}_{\bm{X},Y}\left[\mathbb{E}_{Z|\bm{x},y;\bm{\theta}^{t}}\left[\log p(y|z,\bm{x};\bm{\beta})\right]\right]

are both strongly convex minimization problems for the special cases of mixture of linear experts and mixture of logistic experts.

EM has two well understood characteristics: (a) its update always minimize an objective that is an upper bound on the likelihood, and (b) fixed points of the EM update are also stationary points of the likelihood. We now show this below. The original objective function, (𝜽)\mathcal{L}(\bm{\theta}), can be decomposed into the difference between the EM objective and another function that is bounded below by 0. Specifically, for any 𝜽,ϕΩ\bm{\theta},\bm{\phi}\in\Omega,

(𝜽)\displaystyle\mathcal{L}(\bm{\theta}) =𝔼𝑿,Y[log(p(y|𝒙;𝜽))]\displaystyle=-\mathbb{E}_{\bm{X},Y}\left[\log(p(y|\bm{x};\bm{\theta}))\right]
=𝔼𝑿,Y𝔼Z|𝒙,y,ϕ[log(p(y|𝒙;𝜽))]\displaystyle=-\mathbb{E}_{\bm{X},Y}\mathbb{E}_{Z|\bm{x},y,\bm{\phi}}\left[\log(p(y|\bm{x};\bm{\theta}))\right]
=𝔼𝑿,Y𝔼Z|𝒙,y,ϕ[log(p(y,z|𝒙;𝜽)p(z|y,𝒙;𝜽))].\displaystyle=-\mathbb{E}_{\bm{X},Y}\mathbb{E}_{Z|\bm{x},y,\bm{\phi}}\left[\log\left(\frac{p(y,z|\bm{x};\bm{\theta})}{p(z|y,\bm{x};\bm{\theta})}\right)\right].

Denoting H(𝜽|ϕ):=𝔼𝑿,Y𝔼Z|𝒙,y,ϕ[logp(z|𝒙,y;𝜽)]H(\bm{\theta}|\bm{\phi}):=-\mathbb{E}_{\bm{X},Y}\mathbb{E}_{Z|\bm{x},y,\bm{\phi}}\left[\log p(z|\bm{x},y;\bm{\theta})\right], it follows that

(𝜽)=Q(𝜽|ϕ)H(𝜽|ϕ).\mathcal{L}(\bm{\theta})=Q(\bm{\theta}|\bm{\phi})-H(\bm{\theta}|\bm{\phi}). (9)

where H(𝜽|ϕ)H(\bm{\theta}|\bm{\phi}) is bounded below by 0. Thus, Q(𝜽|ϕ)Q(\bm{\theta}|\bm{\phi}) acts as an upper bound on the negative log-likelihood.

Next, applying Jensen’s inequality shows that H(𝜽|ϕ)H(\bm{\theta}|\bm{\phi}) is minimized at 𝜽=ϕ\bm{\theta}=\bm{\phi} where H(𝜽|𝜽)=0H(\bm{\theta}|\bm{\theta})=0. Consequently, it follows that the gradient of the negative log-likelihood objective matches that of the surrogate EM objective at ϕ=𝜽\bm{\phi}=\bm{\theta}, i.e., (𝜽)=Q(𝜽|𝜽)\nabla\mathcal{L}(\bm{\theta})=\nabla Q(\bm{\theta}|\bm{\theta}). This suggests that any stationary point of (𝜽)\mathcal{L}(\bm{\theta}) is also a stationary point of the EM algorithm and vice versa.

3.1 EM for Symmetric Mixture of 2-Experts

We define the special case of Symmetric Mixture of Experts (SMoE), the focus of our analysis in the following sections. SMoE is a simplified version of MoE where (i) the number of experts is restricted to 22, represented as z{1,1}z\in\{-1,1\}, and (ii) the experts are symmetric around the linear separator, i.e., 𝜷~:=𝜷1=𝜷1\bm{\tilde{\beta}^{*}}:=\bm{\beta}_{1}^{*}=-\bm{\beta}_{-1}^{*}. This symmetric structure simplifies the probability density functions introduced earlier, making the subsequent analysis easier to follow. We explore these simplifications in detail below.

As we are restricted to two experts, the expression for P(z=1|𝒙;𝒘)=1P(z=1|𝒙;𝒘)P(z=1|\bm{x};\bm{w}^{*})=1-P(z=-1|\bm{x};\bm{w}^{*}) is

P(z=1|𝒙;𝒘)=e𝒙𝒘1e𝒙𝒘1+e𝒙𝒘2.P(z=1|\bm{x};\bm{w}^{*})=\frac{e^{\bm{x}^{\top}\bm{w}_{1}^{*}}}{e^{\bm{x}^{\top}{\bm{w}_{1}^{*}}}+e^{\bm{x}^{\top}{\bm{w}_{2}^{*}}}}.

For the ease of notation, we define 𝒘~:=𝒘1𝒘2{\bm{\tilde{w}^{*}}}:=\bm{w}_{1}^{*}-\bm{w}_{2}^{*} and re-parameterize the probability mass function of zz given 𝒙\bm{x} as

P(z|𝒙;𝒘~)\displaystyle P(z|\bm{x};{\bm{\tilde{w}^{*}}}) =exp{z+12𝒙𝒘~}1+e𝒙𝒘~.\displaystyle=\frac{\exp\{\frac{z+1}{2}\bm{x}^{\top}{\bm{\tilde{w}^{*}}}\}}{1+e^{\bm{x}^{\top}{\bm{\tilde{w}^{*}}}}}. (10)

Thus, under this simplification, the EM update of the gating parameter 𝒘\bm{w} is now given as

𝒘t+1=argmin𝒘d𝔼𝑿,Y𝔼Z|𝒙,y;𝜽t[log(1+e𝒙𝒘ez+12𝒙𝒘)].\bm{w}^{t+1}=\operatorname*{arg\,min}_{\bm{w}\in\mathbb{R}^{d}}\mathbb{E}_{\bm{X},Y}\mathbb{E}_{Z|\bm{x},y;\bm{\theta}^{t}}\left[\log\left(\frac{1+e^{\bm{x}^{\top}\bm{w}}}{e^{\frac{z+1}{2}\bm{x}^{\top}\bm{w}}}\right)\right].

While the above minimization problem is strongly convex, it does not have a closed form solution. In our synthetic and small scale proof of concept real world experiments, we use gradient descent to obtain 𝒘t+1\bm{w}^{t+1}.

For the special case of a symmetric mixture of linear experts (SMoLinE), the expression for p(y|𝒙,z;𝜷)p(y|\bm{x},z;\bm{\beta}^{*}) given in (4) can be simplified as

p(y|𝒙,z;𝜷)exp{(yz𝒙𝜷~)22}.p(y|\bm{x},z;\bm{\beta}^{*})\propto\exp\left\{\frac{(y-z\bm{x}^{\top}\bm{\tilde{\beta}^{*}})^{2}}{2}\right\}. (11)

Under this simplification, the EM update of the expert parameter 𝜷\bm{\beta} is now obtained more compactly as

𝜷t+1=𝔼X,Y[(2p(z=1|𝒙,y;𝜽t)1)𝒙y].\bm{\beta}^{t+1}=\mathbb{E}_{X,Y}\left[\left(2p(z=1|\bm{x},y;\bm{\theta}^{t})-1\right)\bm{x}y\right].

Similarly, for symmetric mixtures of logistic experts (SMoLogE) with y{1,1}y\in\{-1,1\}, the expression of p(y|𝒙,z;𝜷)p(y|\bm{x},z;\bm{\beta}^{*}) given in (5) can also be simplified as

P(y|𝒙,z;𝜷)=exp{𝐦𝐢𝐬𝐬𝐢𝐧𝐠(𝒚𝒛+𝟏𝟐)x𝜷}1+e𝒙𝜷.P(y|\bm{x},z;\bm{\beta}^{*})=\frac{\exp\{\bm{\left missing}(\frac{yz+1}{2}\right)x^{\top}\bm{\beta}^{*}\}}{1+e^{\bm{x}^{\top}\bm{\beta}^{*}}}. (12)

Under this simplification, the EM update of the expert parameter 𝜷\bm{\beta} is now given as

𝜷t+1=argmin𝜷d𝔼𝑿,Y[𝔼Z|𝒙,y,𝜽t[log(1+e𝒙𝜷eyz+12𝒙𝜷)]].\bm{\beta}^{t+1}=\operatorname*{arg\,min}_{\bm{\beta}\in\mathbb{R}^{d}}\mathbb{E}_{\bm{X},Y}\left[\mathbb{E}_{Z|\bm{x},y,\bm{\theta}^{t}}\left[\log\left(\frac{1+e^{\bm{x}^{\top}\bm{\beta}}}{e^{\frac{yz+1}{2}\bm{x}^{\top}\bm{\beta}}}\right)\right]\right].

4 Main Result

In the previous section, we derived the EM update for both MoE and SMoE as the solution to minimizing the EM objective in (7). We further demonstrated that this solution can be decomposed into the concatenation of the respective solutions to two minimization sub-problems. In this section, we will show that this update is exactly equivalent to performing a single step of the Mirror Descent (MD) update on (𝜽)\mathcal{L}(\bm{\theta}) with a unit step size and the KL divergence as a regularizer.

To illustrate more clearly that minimizing Q(𝜽|𝜽t)Q(\bm{\theta}|\bm{\theta}^{t}) in (7) corresponds to a MD step on the loss (𝜽)\mathcal{L}(\bm{\theta}) at the point 𝜽t\bm{\theta}^{t}, we first provide a brief overview of the core concept behind the MD update.

In most gradient-based methods, the next iteration is obtained by minimizing an upper bound of the objective function. For example, in Gradient Descent (GD), the next iterate is found by minimizing the first-order Taylor expansion of the objective at 𝜽t\bm{\theta}^{t}, with a squared norm regularizer. Specifically, for minimizing the loss (.)\mathcal{L}(.), the GD update with step size η\eta at 𝜽t\bm{\theta}^{t} is equivalent to minimizing the following quadratic function:

(𝜽t)+(𝜽t),𝜽𝜽t+12η𝜽𝜽t22.\mathcal{L}(\bm{\theta}^{t})+\langle\nabla\mathcal{L}(\bm{\theta}^{t}),\bm{\theta}-\bm{\theta}^{t}\rangle+\frac{1}{2\eta}\|\bm{\theta}-\bm{\theta}^{t}\|_{2}^{2}.

This function indeed serves as an upper bound for (.)\mathcal{L}(.) if η1/L\eta\leq 1/L, where LL is the Lipschitz constant of (𝜽)\nabla\mathcal{L}(\bm{\theta}).

Mirror Descent solves a similar sub-problem where instead of a squared norm regularizer, we employ the Bregman Divergence regularizer: The Bregman divergence induced by a differentiable, convex function h:dh:\mathbb{R}^{d}\to\mathbb{R}, measures the difference between h(𝜽)h(\bm{\theta}) and its first-order approximation at 𝜽t\bm{\theta}^{t},i.e.,

Dh(𝜽,𝜽t)=h(𝜽)h(𝜽t)h(𝜽t),𝜽𝜽t.D_{h}(\bm{\theta},\bm{\theta}^{t})=h(\bm{\theta})-h(\bm{\theta}^{t})-\langle\nabla h(\bm{\theta}^{t}),\bm{\theta}-\bm{\theta}^{t}\rangle. (13)

Thus, the iterations of MD are derived by minimizing the following expression:

(𝜽t)+(𝜽t),𝜽𝜽t+1ηDh(𝜽t,𝜽).\mathcal{L}(\bm{\theta}^{t})+\langle\nabla\mathcal{L}(\bm{\theta}^{t}),\bm{\theta}-\bm{\theta}^{t}\rangle+\frac{1}{\eta}D_{h}(\bm{\theta}^{t},\bm{\theta}). (14)

As mentioned, this scheme is reasonable if the function approximation using the Bregman divergence serves as an upper bound for the function (𝜽)\mathcal{L}(\bm{\theta}). This can be ensured when the step size η\eta is sufficiently small, and the condition of relative smoothness is satisfied.

In the upcoming theorem, we formally establish that for the Symmetric Mixture of Linear Experts (SMoLinE) and the Symmetric Mixture of Logistic Experts (SMoLogE) models, minimizing the EM objective function defined in equation (8) is exactly equivalent to minimizing the subproblem associated with a single step of MD, as defined in equation (14), with a step size of η=1\eta=1. This result demonstrates that the EM update in these specific models is essentially performing a MD step on \cal L. The proof of this result is quite technical and is deferred to Appendix A.

Theorem 4.1.

For SMoLinE and SMoLogE, there is a mirror map A(𝛉)A(\bm{\theta}) such that the EM update in (8) simplifies and is equivalent to

argmin𝜽Ω(𝜽t),𝜽𝜽t+DA(𝜽,𝜽t),\operatorname*{arg\,min}_{\bm{\theta}\in\Omega}\langle\nabla\mathcal{L}(\bm{\theta}^{t}),\bm{\theta}-\bm{\theta}^{t}\rangle+D_{A}(\bm{\theta},\bm{\theta}^{t}), (15)

where ϕ,𝛉Ω\forall\bm{\phi},\bm{\theta}\in\Omega the divergence function DA(𝛉,𝛉t)D_{A}(\bm{\theta},\bm{\theta}^{t}) is equal to the KLKL divergence on the complete data:

DA(ϕ,𝜽)=KL[p(𝒙,y,z;𝜽)p(𝒙,y,z;ϕ)].D_{A}(\bm{\phi},\bm{\theta})=\text{KL}[p(\bm{x},y,z;\bm{\theta})\|p(\bm{x},y,z;\bm{\phi})]. (16)

In particular, in the case of SMoLinE,

A(𝜽)=𝔼𝑿[(𝒙𝜷)22+log(1+e𝒙𝒘)],A(\bm{\theta})=\mathbb{E}_{\bm{X}}\left[\frac{(\bm{x}^{\top}\bm{\beta})^{2}}{2}+\log\left(1+e^{\bm{x}^{\top}\bm{w}}\right)\right], (17)

while in the case of SMoLogE,

A(𝜽)=𝔼𝑿[log((1+e𝒙𝜷)(1+e𝒙𝒘))].A(\bm{\theta})=\mathbb{E}_{\bm{X}}\left[\log\left(\left(1+e^{\bm{x}^{\top}\bm{\beta}}\right)\left(1+e^{\bm{x}^{\top}\bm{w}}\right)\right)\right]. (18)

Finally, in both cases, the map A(𝛉)A(\bm{\theta}) is strictly convex in 𝛉\bm{\theta} and (𝛉)\mathcal{L}(\bm{\theta}) is 11-smooth relative to A(𝛉)A(\bm{\theta}).

As the loss is is 1-relative smooth with respect to AA, this validates the choice of η=1\eta=1 for the Mirror Descent Update, and subsequent convergence results from the MD literature.

4.1 Discussion of Main Result

The results show that the EM update (8) for MoE is equivalent to mirror descent with a unit step-size and KL divergence regularizer on the complete data distribution. We offer the following additional remarks.

First, as shown in the proof of Theorem 4.1 in Appendix A.1, it is evident from (30) and (34) that p(𝒙,y,z;𝜽)p(\bm{x},y,z;\bm{\theta}) does not belong to an exponential family of distributions for either SMoLinE or SMoLogE. Therefore, this result does not simply follow as a corollary of [KKS22, Proposition 1], but stands as an independent finding, introducing another class of latent variable models where EM is equivalent to MD. Moreover, it suggests the possibility of a broader class of latent variable models—beyond those with complete data in an exponential family—where EM can be generalized to MD.

Second, if we have oracle access to the EM updates for 𝒘\bm{w} and 𝜷\bm{\beta}, EM requires no hyper-parameters, unlike GD, which is sensitive to the step size. This is especially advantageous for SMoLinE, where the 𝜷\bm{\beta}-update has a closed-form solution, making EM’s benefits over GD more evident. Additionally, while GD regularizes progress based on the Euclidean distance between iterates, EM adjusts progress based on the divergence between probability distributions across iterations. This is often more suitable for latent variable models, where small Euclidean changes may cause large shifts in the mixture distribution, and vice versa.

Finally, whereas previous analysis of EM for various settings hinged on various types of analyses ranging from verifying obscure conditions to – less reproducible at scale – direct proofs, the connection to MD greatly unifies the process of analysis. In particular, as we will discuss in Section 5 MD boasts general yet powerful guarantees of convergence. Specifically, some guarantees stand out from previous analyses for MoE as they do not hinge on problem specific constants or choice of parameterization.

5 Convergence Analysis

This section presents robust convergence guarantees for EM on SMoLinE and SMoLogE, starting with a review of Mirror Descent guarantees for context. We then examine how these conditions relate to concrete statistical metrics and the true model parameters.

5.1 Convergence Guarantees

Building on prior work [LFN17], we contextualize convergence properties of MD for SMoLinE and SMoLogE. Before presenting the result, we briefly review key concepts. We say 𝜽1\bm{\theta}^{1} is initialized in a locally convex region of (𝜽)\mathcal{L}(\bm{\theta}) if there exists a convex set ΘΩ\Theta\subseteq\Omega containing 𝜽1,𝜽\bm{\theta}^{1},\bm{\theta}^{*} such that for all ϕ,𝜽Θ\bm{\phi},\bm{\theta}\in\Theta,

(ϕ)(𝜽)+(𝜽),ϕ𝜽.\mathcal{L}(\bm{\phi})\geq\mathcal{L}(\bm{\theta})+\langle\nabla\mathcal{L}(\bm{\theta}),\bm{\phi}-\bm{\theta}\rangle. (19)

Furthermore, Θ\Theta is called strongly convex relative to hh if

(ϕ)(𝜽)+(𝜽),ϕ𝜽+αDh(ϕ,𝜽).\mathcal{L}(\bm{\phi})\geq\mathcal{L}(\bm{\theta})+\langle\nabla\mathcal{L}(\bm{\theta}),\bm{\phi}-\bm{\theta}\rangle+\alpha D_{h}(\bm{\phi},\bm{\theta}). (20)

The corollary that follows provides conditions for convergence of EM to (1) a stationary point in the KL divergence, (2) the true parameters at a sub-linear rate, and (3) the true parameters at a linear rate. The proof is adapted from [KKS22, Proposition 2, Corollary 1, and Corollary 3] and [LFN17, Theorem 3.1]; we include it in Appendix B.1 for completeness.

Corollary 5.1 (Convergence of EM).

For SMoLinE, SMoLogE with mirror map A(𝛉)A(\bm{\theta}) given as (17) (18) respectively, and denoting DA(𝛉t,𝛉t+1):=KL[p(𝐱,y,z;𝛉t+1)p(𝐱,y,z;𝛉t)]D_{A}(\bm{\theta}^{t},\bm{\theta}^{t+1})\!:=\!\text{KL}[p(\bm{x},y,z;\bm{\theta}^{t+1})\|p(\bm{x},y,z;\bm{\theta}^{t})], the EM iterates {𝛉t}t[T]\{\bm{\theta}^{t}\}_{t\in[T]} satisfy:

  • 1)

    Stationarity. For no additional conditions,

    mint[T]DA(𝜽t+1,𝜽t)(𝜽1)(𝜽)T;\min_{t\in[T]}D_{A}(\bm{\theta}^{t+1},\bm{\theta}^{t})\leq\frac{\mathcal{L}(\bm{\theta}^{1})-\mathcal{L}(\bm{\theta}^{*})}{T}; (21)
  • 2)

    Sub-linear Rate to 𝜽\bm{\theta}^{*}. If 𝜽1\bm{\theta}^{1} is initialized in Θ\Theta, a locally convex region of (𝜽)\mathcal{L}(\bm{\theta}) containing 𝜽\bm{\theta}^{*}, then

    (𝜽T)(𝜽)DA(𝜽,𝜽1)T\mathcal{L}(\bm{\theta}^{T})-\mathcal{L}(\bm{\theta}^{*})\leq\frac{D_{A}(\bm{\theta}^{*},\bm{\theta}^{1})}{T} (22)
  • 3)

    Linear Rate to 𝜽\bm{\theta}^{*}. If 𝜽1\bm{\theta}^{1} is initialized in ΘΩ\Theta\subseteq\Omega, a locally strongly convex region of (𝜽)\mathcal{L}(\bm{\theta}) relative to A(𝜽)A(\bm{\theta}) that contains 𝜽\bm{\theta}^{*}, then

    (𝜽T)(𝜽)(1α)T((𝜽1)(𝜽)).\mathcal{L}(\bm{\theta}^{T})-\mathcal{L}(\bm{\theta}^{*})\leq(1-\alpha)^{T}(\mathcal{L}(\bm{\theta}^{1})-\mathcal{L}(\bm{\theta}^{*})). (23)

As noted in the literature, EM’s convergence is sensitive to initialization. If 𝜽1\bm{\theta}^{1} is initialized within a locally convex region of (𝜽)\mathcal{L}(\bm{\theta}), the EM iterates for the MoE problem will converge sub-linearly to the true parameter. However, if 𝜽1\bm{\theta}^{1} is in a region where (𝜽)\mathcal{L}(\bm{\theta}) is strongly convex relative to AA, the iterates will converge linearly. This assumption is weaker than those in prior work, which typically require 𝜽1\bm{\theta}^{1} to be initialized in a locally strongly convex region.

5.2 Satisfiability of Conditions

The above result raises an important question: when does a locally convex or relatively strongly convex region of (𝜽)\mathcal{L}(\bm{\theta}) containing 𝜽\bm{\theta}^{*} exist? Interestingly, this is closely tied to the Signal-to-Noise Ratio (SNR). Before exploring this connection, we first introduce the concept of the Missing Information Matrix (MIM) introduced in [OW72].

The MIM relates the level of information the pair (𝒙,y)(\bm{x},y) holds about the latent expert label zz given parameters 𝜽\bm{\theta}, and it is formally defined as

𝑴(θ)=𝑰𝒙,z,y|𝜽1𝑰z|𝒙,y,𝜽.\bm{M}(\theta)=\bm{I}_{\bm{x},z,y|\bm{\theta}}^{-1}\bm{I}_{z|\bm{x},y,\bm{\theta}}. (24)

Here, 𝑰𝒙,z,y|𝜽\bm{I}_{\bm{x},z,y|\bm{\theta}} is the Fisher information matrix of the complete data distribution and 𝑰z|𝒙,y,𝜽\bm{I}_{z|\bm{x},y,\bm{\theta}} is the Fisher information matrix of the conditional distribution of the latent unobserved variable given the observed ones, denoted by

𝑰𝒙,z,y|𝜽:\displaystyle\bm{I}_{\bm{x},z,y|\bm{\theta}}: =𝔼𝑿,Y,Z|𝜽[2𝜽2logp(𝒙,z,y;𝜽)]\displaystyle=-\mathbb{E}_{\bm{X},Y,Z|\bm{\theta}}\left[\frac{\partial^{2}}{\partial\bm{\theta}^{2}}\log p(\bm{x},z,y;\bm{\theta})\right]
𝑰z|𝒙,y,𝜽:\displaystyle\bm{I}_{z|\bm{x},y,\bm{\theta}}: =𝔼𝑿,Y𝔼Z|𝒙,y,𝜽[2𝜽2logP(z|𝒙,y;𝜽)]\displaystyle=-\mathbb{E}_{\bm{X},Y}\mathbb{E}_{Z|\bm{x},y,\bm{\theta}}\left[\frac{\partial^{2}}{\partial\bm{\theta}^{2}}\log P(z|\bm{x},y;\bm{\theta})\right]

Due to the pairwise independence of 𝑿\bm{X} and its rotational invariance, there exists an orthonormal matrix RR such that Δ:=R𝑰𝒙,z,y|𝜽1R\Delta:=R\bm{I}_{\bm{x},z,y|\bm{\theta}}^{-1}R^{\top} is positive semi-definite and diagonal and J:=R𝑰z|𝒙,y,θRJ:=R\bm{I}_{z|\bm{x},y,\theta}R^{\top} is symmetric positive semi-definite (see Appendix B.2). As such, the MIM in (24) is also symmetric and positive semi-definite: M(𝜽)=RΔRRJR=RΔJRM(\bm{\theta})=R^{\top}\Delta RR^{\top}JR=R^{\top}\Delta JR. Note that the MIM quantifies the difficulty of estimating parameters when only 𝒙,y\bm{x},y are observed. To understand its significance, consider the following: large eigenvalues of 𝑴\bm{M} indicate that 𝒙,y\bm{x},y contain little information about the true value of the latent variable zz, making estimation more difficult. Conversely, small eigenvalues suggest that 𝒙,y\bm{x},y provide enough information to effectively constrain the possible values of zz. Thus, the MIM can be seen as analogous to the Signal-to-Noise Ratio (SNR).

In the theorem below, we show how the eigenvalues of 𝑴(𝜽)\bm{M}(\bm{\theta}) are related to the satisfiability of the conditions for Corollary 5.1 regarding the relative strong convexity of (𝜽)\cal L(\bm{\theta}) with respect to the mirror map A(𝜽)A(\bm{\theta}).

Theorem 5.2.

For SMoLinE and SMoLogE and their respective mirror mappings (17) and (18), the objective (𝛉)\mathcal{L}(\bm{\theta}) is α\alpha-strongly convex relative to the mirror map A(𝛉)A(\bm{\theta}) on the convex set Θ\Theta if and only if

λmax(𝑴(𝜽))(1α) for all 𝜽Θ.\lambda_{\max}(\bm{M}(\bm{\theta}))\leq(1-\alpha)\text{ for all }\bm{\theta}\in\Theta. (25)

The above result states if 𝑴(𝜽)𝑰2d\bm{M}(\bm{\theta})\prec\bm{I}_{2d} for all 𝜽Θ\bm{\theta}\in\Theta and 𝜽Θ\bm{\theta}^{*}\in\Theta, then the EM updates will converge linearly to 𝜽\bm{\theta}^{*}. This offers a unified framework for analyzing EM for MoE, linking the rate of convergence to a classical statistical metric. To determine whether EM achieves linear or sub-linear convergence, we need to understand the behavior of 𝑴(𝜽)\bm{M}(\bm{\theta}) and 𝑴(𝜽)\bm{M}(\bm{\theta}^{*}), which indicates the existence and size of the local region where EM enjoys such convergence. The following theorem is technical, and its proof is provided in Appendix B.3.

Theorem 5.3.

For SMoLinE, the eigenvalues of 𝐈𝐱,z,y|𝛉1\bm{I}_{\bm{x},z,y|\bm{\theta}}^{-1} belong to the set

λ(𝑰𝒙,z,y|𝜽1)={Θ(𝒘23),Θ(𝒘2),1}\lambda\left(\bm{I}_{\bm{x},z,y|\bm{\theta}}^{-1}\right)=\{\Theta(\|\bm{w}\|_{2}^{3}),\Theta(\|\bm{w}\|_{2}),1\} (26)

and 𝐈z|𝐱,y,𝛉\bm{I}_{z|\bm{x},y,\bm{\theta}}, is given as the expectation over (𝐗,Y)(\bm{X},Y) of a function that is decreasing as a function of 𝛉2\|\bm{\theta}\|_{2}:

𝑰z|𝒙,y,𝜽=𝔼𝑿,Y[exp[𝒙2y𝒙],𝜽[𝒙2y𝒙][𝒙2y𝒙](1+exp[𝒙2y𝒙],𝜽)2].\bm{I}_{z|\bm{x},y,\bm{\theta}}=\mathbb{E}_{\bm{X},Y}\left[\frac{\exp{\left\langle\begin{bmatrix}\bm{x}\\ 2y\bm{x}\end{bmatrix},\bm{\theta}\right\rangle}\begin{bmatrix}\bm{x}\\ 2y\bm{x}\end{bmatrix}\begin{bmatrix}\bm{x}\\ 2y\bm{x}\end{bmatrix}^{\top}}{\left(1+\exp{\left\langle\begin{bmatrix}\bm{x}\\ 2y\bm{x}\end{bmatrix},\bm{\theta}\right\rangle}\right)^{2}}\right]. (27)

Similarly, For SMoLogE, the eigenvalues of 𝐈𝐱,z,y|𝛉1\bm{I}_{\bm{x},z,y|\bm{\theta}}^{-1} belong to the set

λ(𝑰𝒙,z,y|𝜽1)={Θ(𝒘23),Θ(𝒘2),Θ(𝜷23),Θ(𝜷2)}\!\lambda\left(\bm{I}_{\bm{x},z,y|\bm{\theta}}^{-1}\right)=\{\Theta(\|\bm{w}\|_{2}^{3}),\Theta(\|\bm{w}\|_{2}),\Theta(\|\bm{\beta}\|_{2}^{3}),\Theta(\|\bm{\beta}\|_{2})\} (28)

and 𝐈z|𝐱,y,𝛉\bm{I}_{z|\bm{x},y,\bm{\theta}} is given as the expectation over (𝐗,Y)(\bm{X},Y) of a function that is decreasing as a function of 𝛉2\|\bm{\theta}\|_{2}:

𝑰z|𝒙,y,𝜽=𝔼𝑿,Y[exp[𝒙y𝒙],𝜽[𝒙y𝒙][𝒙y𝒙](1+exp[𝒙y𝒙],𝜽)2].\!\bm{I}_{z|\bm{x},y,\bm{\theta}}=\mathbb{E}_{\bm{X},Y}\!\left[\frac{\exp{\left\langle\begin{bmatrix}\bm{x}\\ y\bm{x}\end{bmatrix},\bm{\theta}\right\rangle}\begin{bmatrix}\bm{x}\\ y\bm{x}\end{bmatrix}\begin{bmatrix}\bm{x}\\ y\bm{x}\end{bmatrix}^{\top}}{\left(1+\exp{\left\langle\begin{bmatrix}\bm{x}\\ y\bm{x}\end{bmatrix},\bm{\theta}\right\rangle}\right)^{2}}\right]\!. (29)

The theorem above provides an upper bound for λmax(M(𝜽))\lambda_{\max}(M(\bm{\theta})) as a function of 𝜽\bm{\theta} and 𝜽\bm{\theta}^{*}. The advantage of this bound is that it can be reliably estimated from the observed data (𝒙,y)(\bm{x},y). To gain an intuition for how (27) and (29) scale with 𝜽\bm{\theta}, consider a simplified scenario where [𝒙y𝒙]\begin{bmatrix}\bm{x}\\ y\bm{x}\end{bmatrix} is parallel to 𝜽\bm{\theta} with probability 1. In this case, it can be shown that

𝑰z|𝒙,y,𝜽\displaystyle\bm{I}_{z|\bm{x},y,\bm{\theta}} 0𝒖22exp{𝒖2𝜽2}d𝒖2\displaystyle\leq\int_{0}^{\infty}\|\bm{u}\|_{2}^{2}\exp\{-\|\bm{u}\|_{2}\|\bm{\theta}\|_{2}\}d\|\bm{u}\|_{2}
=2𝜽23=2(𝒘2+𝜷2)3.\displaystyle=\frac{2}{\|\bm{\theta}\|_{2}^{3}}=\frac{2}{(\|\bm{w}\|_{2}+\|\bm{\beta}\|_{2})^{3}}.

and as a result 𝑰z|𝒙,y,𝜽𝒪(1𝜽23)\bm{I}_{z|\bm{x},y,\bm{\theta}}\preceq\mathcal{O}\left(\frac{1}{\|\bm{\theta}\|_{2}^{3}}\right). Substituting back into the definition of the MIM and using the tight bounds on the eigenvalues of 𝑰x,y,z|𝜽1\bm{I}_{x,y,z|\bm{\theta}}^{-1}, we can see that in a best case scenario, the maximum eigenvalue of the MIM is 𝒪(𝒘23/𝜽23)\mathcal{O}(\|\bm{w}\|_{2}^{3}/\|\bm{\theta}\|_{2}^{3}). When this upper bound is in the interval [0,1][0,1], Theorem 5.2 and part 3) of Corollary 5.1 guarantee our parameter iterates to converge at the rate of 𝒪((𝒘2𝒘2+𝜷2)3)\mathcal{O}\left(\left(\frac{\|\bm{w}\|_{2}}{\|\bm{w}\|_{2}+\|\bm{\beta}\|_{2}}\right)^{3}\right). Moreover, when 𝜷\bm{\beta}^{*} is large, we can expect 𝜽\bm{\theta}^{*} to belong in a region that is strongly convex for (𝜽)\mathcal{L}(\bm{\theta}) relative to A(𝜽)A(\bm{\theta}). For further interpretations of the former result, we provide a more thorough discussion in Appendix B.3.

6 Experiments

In this section, we empirically validate our theoretical results by comparing the performance of EM with Gradient EM (Algorithm 1), and Gradient Descent (Algorithm 2). Recall that EM for MoE obtains its next parameter iterate as the concatenation to the solutions of two minimization problems; Instead, Gradient EM obtains its next parameter iterate as the concatenation of a single gradient update on the respective sub-minimization problems of EM. This differs from Gradient Descent (GD) that obtains its next parameter iterate as the gradient update on the negative log-likelihood objective 𝜽\mathcal{\bm{\theta}}. We evaluate these methods on both a synthetic dataset and the real-world Fashion MNIST dataset, consistently showing significant improvements for EM and Gradient EM over GD. Note that our aim is not to achieve state-of-the-art accuracy, but to demonstrate that EM is more suitable than GD for fitting specific models.

Synthetic Dataset. We created the synthetic dataset so as to simulate a population setting of SMoLinE. We sampled 10310^{3} data points from an SMoLinE with known additive unit Gaussian noise (i.e. 𝒩(0,1)\mathcal{N}(0,1)) and true parameters 𝜷,𝒘10\bm{\beta}^{*},\bm{w}^{*}\in\mathbb{R}^{10} that sastisfy 𝜷2=𝒘2=4\|\bm{\beta}^{*}\|_{2}=\|\bm{w}^{*}\|_{2}=4. Subsequently, we run EM, Gradient EM, and GD for 5050 iterations and report the results on the training set averaged over 5050 instances. Each time, re-sampling the true parameters, initial parameters, and whole dataset. The initial parameters, are randomly initialize within a neighborhood of the true parameters, and are consistent across all benchmarks.

Refer to caption
(a)
Refer to caption
(b)
Figure 1: Convergence of objective errors (𝜽t)(𝜽)\mathcal{L}(\bm{\theta}^{t})-\mathcal{L}(\bm{\theta}^{*}) and (𝜽t)(𝜽T)\mathcal{L}(\bm{\theta}^{t})-\mathcal{L}(\bm{\theta}^{T}) in Fig 1(a) and Fig 1(b), respectively, averaged over 5050 instances when fitting a SMoLinE.
Refer to caption
(a)
Refer to caption
(b)
Figure 2: This figure shows the progress made towards the true parameters, 𝜷t𝜷2𝜷2\frac{\|\bm{\beta}^{t}-\bm{\beta}^{*}\|_{2}}{\|\bm{\beta}^{*}\|_{2}} and 𝒘t𝒘2𝒘2\frac{\|\bm{w}^{t}-\bm{w}^{*}\|_{2}}{\|\bm{w}^{*}\|_{2}} in figures 2(a) and 2(b) respectively, averaged over 5050 instances when fitting a SMoLinE

Figure 1 shows the objective function progress. EM requires fewer iterations to fit the mixture compared to both Gradient EM and GD, with Gradient EM also outperforming GD in fitting time. Figure 2 illustrates the progress toward recovering the true SMoLinE parameters. Once again, EM requires significantly fewer iterations to fit the mixture compared to both Gradient EM and GD, with Gradient EM also taking considerably less time than GD.

Overall, we observe that all three algorithms exhibit a linear convergence rate, both in optimizing the objective function and fitting the true parameters. This aligns with our theoretical results for MoE and is consistent with findings for Mixtures of Gaussians and Mixtures of Linear Regression in high SNR scenarios.

Validation Experiment on Fashion MNIST. For the small scale proof of concept experiment on Fashion MNIST [XRV17], we alter the dataset so as to simulate a mixture of 22 Logistic Experts. To do this, we simply perform an inversion transformation on the images at random with probability 12\frac{1}{2}. Effectively, the transformation inverts the images from a white article of clothing on a black background to a black article of clothing on a white background. As can be seen in Table 1, the single expert on the original Fashion MNIST dataset reaches an accuracy of 83.2%83.2\% on the test set. Meanwhile, the single expert cannot achieve better than an accuracy of 10.2%10.2\% on the altered dataset. This suggests a 22-component MoLogE is appropriate for fitting the altered dataset, so long as the ground truth partitioning is linear in image space.

The 22-component MoLogE to be trained consists of one Linear gating layer of dimension 2×28×282\times 28\times 28, and 22 logistic experts of dimension 10×28×2810\times 28\times 28 each. We randomly initialize each linear layer to be unit norm and execute the algorithms on the same datasets and with the same initializations. For Gradient EM, the only additional code needed over GD is to define the EM Loss function appropriately, and then perform a Gradient Step on the Gating parameters and the Expert parameters separately as describe in Algorithm 1. For EM, for each iteration, we perform several gradient steps in an inner loop to approximately recover the solutions to the sub-problems described in (8). We report our findings in Table 2 and Figure 3.

In Table 2, we report the respective final test accuracy and cross-entropy loss values after 100100 iterations of EM, Gradient EM and GD for fitting a 22-component MoLogE on the altered Fashion MNIST dataset, averaged over 2525 instances. We see that EM boasts a much improved final test accuracy that nearly recovers the single expert accuracy on the original unaltered Fashion MNIST dataset of 79.2%79.2\%. Meanwhile, Gradient EM also registers an improvement over GD. In Figure 3, we report the progress made on the accuracy and objective function for the test set over the 100100 iterations, averaged over 2525 instances. As was observed in our synthetic experiment, EM takes considerably less iterations to fit the mixture than both Gradient EM and GD, where the former also takes considerably less time to fit the mixture than GD.

Table 1: Performance for single Logistic Expert
Accuracy Random Invert
Single Expert 83.2% No
Single Expert 10.2% Yes
Table 2: Performance for 22-Component MoLogE
Accuracy​ Cross Entropy​
EM 79.2% 0.801
Gradient EM 65.7% 1.30
Gradient Descent​​ 61.9% 1.33
Refer to caption
(a)
Refer to caption
(b)
Figure 3: Test accuracy and objective function, 1ni=1n𝟙yi^=yi\frac{1}{n}\sum_{i=1}^{n}\mathbbm{1}_{\hat{y_{i}}=y_{i}} and (𝜽t)\mathcal{L}(\bm{\theta}^{t}) in 3(a) and 3(b), respectively, averaged over 2525 instances for a 22-component MoLogE.

7 Conclusion

In this paper, we addressed the problem of Mixtures of Experts (MoE), focusing on the use of the EM method to solve these problems. We first showed that the EM update for MoE could be interpreted as a Mirror Descent step on the log-likelihood with a unit step size and a KL divergence regularizer, extending the result of [KKS22] beyond the exponential family. Building on this, we characterized different convergence rates for EM in this setting under various assumptions about the log-likelihood function and specified when these assumptions held. We also compared EM with gradient descent empirically and showed that EM outperformed gradient descent in both convergence rate and final performance.

References

  • [BWY17] Sivaraman Balakrishnan, Martin J. Wainwright and Bin Yu “Statistical guarantees for the EM algorithm: From population to sample-based analysis” In The Annals of Statistics 45.1 Institute of Mathematical Statistics, 2017, pp. 77–120 DOI: 10.1214/16-AOS1435
  • [CDWGL22] Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu and Yuanzhi Li “Towards Understanding Mixture of Experts in Deep Learning”, 2022 arXiv: https://arxiv.org/abs/2208.02813
  • [DTZ17] Constantinos Daskalakis, Christos Tzamos and Manolis Zampetakis “Ten Steps of EM Suffice for Mixtures of Two Gaussians” In Proceedings of the 2017 Conference on Learning Theory 65, Proceedings of Machine Learning Research PMLR, 2017, pp. 704–710 URL: https://proceedings.mlr.press/v65/daskalakis17b.html
  • [DLR77] A.. Dempster, N.. Laird and D.. Rubin “Maximum Likelihood from Incomplete Data Via the EM Algorithm” In Journal of the Royal Statistical Society: Series B (Methodological) 39.1, 1977, pp. 1–22 DOI: https://doi.org/10.1111/j.2517-6161.1977.tb01600.x
  • [DHKWJ18] Raaz Dwivedi, Nhat Ho, Koulik Khamaru, Martin J Wainwright and Michael I Jordan “Theoretical guarantees for EM under misspecified Gaussian mixture models” In Advances in Neural Information Processing Systems 31 Curran Associates, Inc., 2018 URL: https://proceedings.neurips.cc/paper_files/paper/2018/file/acc21473c4525b922286130ffbfe00b5-Paper.pdf
  • [FZS22] William Fedus, Barret Zoph and Noam Shazeer “Switch transformers: scaling to trillion parameter models with simple and efficient sparsity” In J. Mach. Learn. Res. 23.1 JMLR.org, 2022
  • [Hin+12] Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath and Brian Kingsbury “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups” In IEEE Signal Processing Magazine 29.6, 2012, pp. 82–97 DOI: 10.1109/MSP.2012.2205597
  • [JJ94] Michael Jordan and Robert Jacobs “Hierarchical mixtures of experts and the” In Neural computation 6, 1994, pp. 181–
  • [JX95] Michael I. Jordan and Lei Xu “Convergence results for the EM approach to mixtures of experts architectures” In Neural Networks 8.9, 1995, pp. 1409–1431 DOI: https://doi.org/10.1016/0893-6080(95)00014-3
  • [KKS22] Frederik Kunstner, Raunak Kumar and Mark Schmidt “Homeomorphic-Invariance of EM: Non-Asymptotic Convergence in KL Divergence for Exponential Families via Mirror Descent”, 2022 arXiv: https://arxiv.org/abs/2011.01170
  • [KC20] Jeongyeol Kwon and Constantine Caramanis “EM Converges for a Mixture of Many Linear Regressions” In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics 108, Proceedings of Machine Learning Research PMLR, 2020, pp. 1727–1736 URL: https://proceedings.mlr.press/v108/kwon20a.html
  • [KC20a] Jeongyeol Kwon and Constantine Caramanis “The EM Algorithm gives Sample-Optimality for Learning Mixtures of Well-Separated Gaussians” In Proceedings of Thirty Third Conference on Learning Theory 125, Proceedings of Machine Learning Research PMLR, 2020, pp. 2425–2487 URL: https://proceedings.mlr.press/v125/kwon20a.html
  • [KHC21] Jeongyeol Kwon, Nhat Ho and Constantine Caramanis “On the Minimax Optimality of the EM Algorithm for Learning Two-Component Mixed Linear Regression” In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics 130, Proceedings of Machine Learning Research PMLR, 2021, pp. 1405–1413 URL: https://proceedings.mlr.press/v130/kwon21b.html
  • [KQCCD18] Jeongyeol Kwon, Wei Qian, Constantine Caramanis, Yudong Chen and Damek Davis “Global Convergence of EM Algorithm for Mixtures of Two Component Linear Regression” arXiv, 2018 DOI: 10.48550/ARXIV.1810.05752
  • [LFN17] Haihao Lu, Robert M. Freund and Yurii Nesterov “Relatively-Smooth Convex Optimization by First-Order Methods, and Applications”, 2017 arXiv: https://arxiv.org/abs/1610.05708
  • [MZYCHC18] Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong and Ed Chi “Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts”, 2018, pp. 1930–1939 DOI: 10.1145/3219819.3220007
  • [MOKV19] Ashok Vardhan Makkuva, Sewoong Oh, Sreeram Kannan and Pramod Viswanath “Breaking the gridlock in Mixture-of-Experts: Consistent and Efficient Algorithms”, 2019 arXiv: https://arxiv.org/abs/1802.07417
  • [OW72] Terence Orchard and Max A. Woodbury “A MISSING INFORMATION PRINCIPLE: THEORY AND APPLICATIONS” In Volume 1 Theory of Statistics University of California Press, 1972, pp. 697–716 DOI: doi:10.1525/9780520325883-036
  • [Tse04] Paul Tseng “An Analysis of the EM Algorithm and Entropy-Like Proximal Point Methods” In Mathematics of Operations Research 29.1 INFORMS, 2004, pp. 27–44 URL: http://www.jstor.org/stable/30035630
  • [Wu83] C.. Wu “On the Convergence Properties of the EM Algorithm” In The Annals of Statistics 11.1 Institute of Mathematical Statistics, 1983, pp. 95–103 DOI: 10.1214/aos/1176346060
  • [XRV17] Han Xiao, Kashif Rasul and Roland Vollgraf “Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms”, 2017 arXiv: https://arxiv.org/abs/1708.07747
  • [YWG12] Seniha Yuksel, Joseph Wilson and Paul Gader “Twenty Years of Mixture of Experts” In Neural Networks and Learning Systems, IEEE Transactions on 23, 2012, pp. 1177–1193 DOI: 10.1109/TNNLS.2012.2200299

Appendix A EM is Mirror Descent for SMoLogE and SMoLinE

In this section, we provide the full and detailed proof of the main result, Theorem 4.1. For ease of comprehension and in the hope that this will provide useful insights into other types of non-exponential family mixtures for which EM is also connected to MD, we prove our result following the same general ideas as that of [KKS22, Proposition 1]. We split the proof into two parts (SMoLinE and SMoLogE) which can be found in Appendix A.1 and A.2.

For ease of reading, we re-state the theorem below:
Theorem 4.1: For SMoLinE and SMoLogE, there is a mirror map A(𝜽)A(\bm{\theta}) such that the EM update in (8) simplifies and is equivalent to

argmin𝜽Ω(𝜽),𝜽𝜽t+DA(𝜽,𝜽t),\operatorname*{arg\,min}_{\bm{\theta}\in\Omega}\langle\mathcal{\nabla L(\bm{\theta})},\bm{\theta}-\bm{\theta}^{t}\rangle+D_{A}(\bm{\theta},\bm{\theta}^{t}),

where ϕ,𝜽Ω\forall\bm{\phi},\bm{\theta}\in\Omega the divergence function DA(𝜽,𝜽t)D_{A}(\bm{\theta},\bm{\theta}^{t}) is equal to the KLKL divergence on the complete data:

DA(ϕ,𝜽)=KL[p(𝒙,y,z;𝜽)p(𝒙,y,z;ϕ)].D_{A}(\bm{\phi},\bm{\theta})=\text{KL}[p(\bm{x},y,z;\bm{\theta})\|p(\bm{x},y,z;\bm{\phi})].

In particular, in the case of SMoLinE,

A(𝜽)=𝔼𝑿[(𝒙𝜷)22+log(1+e𝒙𝒘)],A(\bm{\theta})=\mathbb{E}_{\bm{X}}\left[\frac{(\bm{x}^{\top}\bm{\beta})^{2}}{2}+\log\left(1+e^{\bm{x}^{\top}\bm{w}}\right)\right],

while in the case of SMoLogE,

A(𝜽)=𝔼𝑿[log((1+e𝒙𝜷)(1+e𝒙𝒘))].A(\bm{\theta})=\mathbb{E}_{\bm{X}}\left[\log\left(\left(1+e^{\bm{x}^{\top}\bm{\beta}}\right)\left(1+e^{\bm{x}^{\top}\bm{w}}\right)\right)\right].

Finally, in both cases, the map A(𝜽)A(\bm{\theta}) is strictly convex in 𝜽\bm{\theta} and (𝜽)\mathcal{L}(\bm{\theta}) is 11-smooth relative to A(𝜽)A(\bm{\theta}).

A.1 Proof of Theorem 4.1 for SMoLinE

Proof.

Recall that we consider a 22 component SMoLinE (see Section 3.1) where z{1,1}z\in\{-1,1\} is the latent unobserved variable, and

  • 1)

    𝒙𝒩(𝟎,𝑰d)\bm{x}\sim\mathcal{N}(\bm{0},\bm{I}_{d}),

  • 2)

    P(z|𝒙;𝒘)=exp{z+12𝒙𝒘}1+e𝒙𝒘P(z|\bm{x};\bm{w})=\frac{\exp\{\frac{z+1}{2}\bm{x}^{\top}{\bm{w}}\}}{1+e^{\bm{x}^{\top}{\bm{w}}}},

  • 3)

    p(y|𝒙,z;𝜷)=exp{(yz𝒙𝜷)22}2πp(y|\bm{x},z;\bm{\beta})=\frac{\exp\left\{-\frac{(y-z\bm{x}^{\top}\bm{\beta})^{2}}{2}\right\}}{\sqrt{2\pi}}.

We begin by deriving a near exponential form of the complete data probability density function p(𝒙,z,y;𝜽)p(\bm{x},z,y;\bm{\theta}):

p(𝒙,y,z;𝜽)\displaystyle p(\bm{x},y,z;\bm{\theta})
=p(y|𝒙,z;𝜽)P(z|𝒙;𝜽)p(𝒙)\displaystyle=p(y|\bm{x},z;\bm{\theta})P(z|\bm{x};\bm{\theta})p(\bm{x})
=exp{logp(y|𝒙,z;𝜷)+logP(z|𝒙;𝒘)+logp(𝒙)}\displaystyle=\exp\{\log p(y|\bm{x},z;\bm{\beta})+\log P(z|\bm{x};\bm{w})+\log p(\bm{x})\}
=exp{(yz𝒙𝜷)2212log(2π)+(z+12)𝒙𝒘log(1+e𝒙𝒘)+logp(𝒙)}\displaystyle=\exp\left\{\frac{-(y-z\bm{x}^{\top}\bm{\beta})^{2}}{2}-\frac{1}{2}\log(2\pi)+\left(\frac{z+1}{2}\right)\bm{x}^{\top}\bm{w}-\log(1+e^{\bm{x}^{\top}\bm{w}})+\log p(\bm{x})\right\}
=exp{y22+yz𝒙𝜷z2(𝒙𝜷)22+(z+12)𝒙𝒘log(1+e𝒙𝒘)+logp(𝒙)12log(2π)}\displaystyle=\exp\left\{\frac{-y^{2}}{2}+yz\bm{x}^{\top}\bm{\beta}-\frac{z^{2}(\bm{x}^{\top}\bm{\beta})^{2}}{2}+\left(\frac{z+1}{2}\right)\bm{x}^{\top}\bm{w}-\log(1+e^{\bm{x}^{\top}\bm{w}})+\log p(\bm{x})-\frac{1}{2}\log(2\pi)\right\}
=exp{[yz𝒙z𝒙2],[𝜷𝒘]+𝒙𝒘2(𝒙𝜷)22log(1+e𝒙𝒘)+logp(𝒙)y2212log(2π)}.\displaystyle=\exp\left\{\left\langle\begin{bmatrix}yz\bm{x}\\ \frac{z\bm{x}}{2}\end{bmatrix},\begin{bmatrix}\bm{\beta}\\ \bm{w}\end{bmatrix}\right\rangle+\frac{\bm{x}^{\top}\bm{w}}{2}-\frac{(\bm{x}^{\top}\bm{\beta})^{2}}{2}-\log(1+e^{\bm{x}^{\top}\bm{w}})+\log p(\bm{x})-\frac{y^{2}}{2}-\frac{1}{2}\log(2\pi)\right\}.

Thus we have recovered the decomposition,

p(𝒙,y,z;𝜽)=exp{[z𝒙2yz𝒙]S(𝒙,y,z),[𝒘𝜷]+a(𝒙,y,𝜽)},p(\bm{x},y,z;\bm{\theta})=\exp\left\{\left\langle\underbrace{\begin{bmatrix}\frac{z\bm{x}}{2}\\ yz\bm{x}\end{bmatrix}}_{S(\bm{x},y,z)},\begin{bmatrix}\bm{w}\\ \bm{\beta}\end{bmatrix}\right\rangle+a(\bm{x},y,\bm{\theta})\right\}, (30)

where in a(𝒙,y,𝜽)a(\bm{x},y,\bm{\theta}), the feature variable 𝒙\bm{x} cannot be linearly separated from the parameter 𝜽\bm{\theta}:

a(𝒙,y,𝜽)=𝒙𝒘2(𝒙𝜷)22log(1+e𝒙𝒘)+logp(𝒙)y2212log(2π).a(\bm{x},y,\bm{\theta})=\frac{\bm{x}^{\top}\bm{w}}{2}-\frac{(\bm{x}^{\top}\bm{\beta})^{2}}{2}-\log\left(1+e^{\bm{x}^{\top}\bm{w}}\right)+\log p(\bm{x})-\frac{y^{2}}{2}-\frac{1}{2}\log(2\pi).

At this point, we pause and discuss the implications of the obtained form. First we recall that for a random variable 𝑼\bm{U} to belong to an exponential family, it must satisfy

p(𝒖;𝜽)=h(𝒖)exp{s(𝒖),𝜽A(𝜽)}p(\bm{u};\bm{\theta})=h(\bm{u})\exp\left\{{\langle s(\bm{u}),\bm{\theta}\rangle-A(\bm{\theta})}\right\}

for some h(),s(),𝜽,A()h(\cdot),s(\cdot),\bm{\theta},A(\cdot) that are called the normalization function, sufficient statistics, natural parameters, and log-partition function respectively. To clarify, we note that 1) A(𝜽)A(\bm{\theta}) must be a function of the parameters only and cannot depend on 𝒖\bm{u} and 2) h(𝒖)h(\bm{u}) must be a function of the variable 𝒖\bm{u} only and cannot depend on the parameters. In other words, it must be that h(𝒖)h(\bm{u}) and A(𝜽)A(\bm{\theta}) are linearly separated inside the exp\exp. With the above in mind, we see that SMoLinE is not an exponential family. Further, we remark that the above formulation showing S(𝒙,y,z)S(\bm{x},y,z) is linear with (𝒘,𝜷)(\bm{w},\bm{\beta})^{\top} does not extend beyond the symmetric setting of the Mixture of Linear Experts; for k3k\geq 3, this relationships becomes non-linear. This turns out to be problematic for showing EM is equivalent to MD for k3k\geq 3. Lastly, note that taking the expectation of a(𝒙,y,𝜽)a(\bm{x},y,\bm{\theta}) over (𝒙,y)p(𝒙,y;𝜽)(\bm{x},y)\sim p(\bm{x},y;\bm{\theta}^{*}) yields,

𝔼𝑿,Y[a(𝒙,y,𝜽)]\displaystyle\mathbb{E}_{\bm{X},Y}[a(\bm{x},y,\bm{\theta})] =𝔼𝑿[(𝒙𝜷)22+log(1+e𝒙𝒘)]C\displaystyle=-\mathbb{E}_{\bm{X}}\left[\frac{(\bm{x}^{\top}\bm{\beta})^{2}}{2}+\log\left(1+e^{\bm{x}^{\top}\bm{w}}\right)\right]-C
=A(𝜽)C,\displaystyle=-A(\bm{\theta})-C,

where the above follows from 𝔼𝑿[𝒙𝒘2]=0\mathbb{E}_{\bm{X}}[\frac{\bm{x}^{\top}\bm{w}}{2}]=0 and C:=𝔼𝑿,Y[logp(𝒙)y2212log(2π)]C:=-\mathbb{E}_{\bm{X},Y}[\log p(\bm{x})-\frac{y^{2}}{2}-\frac{1}{2}\log(2\pi)] is not a function of the parameter 𝜽\bm{\theta}. With the obtained form (30), we now continue with the proof.

Part a): Show EM is MD, i.e., argmin𝜽ΩQ(𝜽|𝜽t)=argmin𝜽Ω(𝜽),𝜽𝜽t+DA(𝜽,𝜽t)\operatorname*{arg\,min}_{\bm{\theta}\in\Omega}Q(\bm{\theta}|\bm{\theta}^{t})=\operatorname*{arg\,min}_{\bm{\theta}\in\Omega}\langle\mathcal{\nabla L(\bm{\theta})},\bm{\theta}-\bm{\theta}^{t}\rangle+D_{A}(\bm{\theta},\bm{\theta}^{t}).

Taking the appropriate expectation, the EM objective QQ can be written as

Q(𝜽|𝜽t)\displaystyle Q(\bm{\theta}|\bm{\theta}^{t}) =𝔼𝑿,Y[𝔼Z|𝒙,y;𝜽t[logp(𝒙,y,z;𝜽)]]\displaystyle=-\mathbb{E}_{\bm{X},Y}\left[\mathbb{E}_{Z|\bm{x},y;\bm{\theta}^{t}}\left[\log p(\bm{x},y,z;\bm{\theta})\right]\right]
=𝔼𝑿,Y[𝔼Z|𝒙,y;𝜽t[S(𝒙,y,z),𝜽+a(𝒙,y,𝜽)]]\displaystyle=-\mathbb{E}_{\bm{X},Y}\left[\mathbb{E}_{Z|\bm{x},y;\bm{\theta}^{t}}\left[\langle S(\bm{x},y,z),\bm{\theta}\rangle+a(\bm{x},y,\bm{\theta})\right]\right]
=𝔼𝑿,Y[a(𝒙,y,𝜽)]𝔼𝑿,Y[𝔼Z|𝒙,y;𝜽t[S(𝒙,y,z),𝜽]]\displaystyle=-\mathbb{E}_{\bm{X},Y}\left[a(\bm{x},y,\bm{\theta})\right]-\mathbb{E}_{\bm{X},Y}\left[\mathbb{E}_{Z|\bm{x},y;\bm{\theta}^{t}}\left[\langle S(\bm{x},y,z),\bm{\theta}\rangle\right]\right]
=A(𝜽)s(𝜽t),𝜽+C\displaystyle=A(\bm{\theta})-\left\langle s(\bm{\theta}^{t}),\bm{\theta}\right\rangle+C

where s(𝜽t):=𝔼𝑿,Y𝔼Z|𝒙,y;𝜽t[S(𝒙,y,z)]s(\bm{\theta}^{t}):=\mathbb{E}_{\bm{X},Y}\mathbb{E}_{Z|\bm{x},y;\bm{\theta}^{t}}[S(\bm{x},y,z)]. As a consequence, it is also true that

Q(𝜽t|𝜽t)=A(𝜽t)s(𝜽t).\nabla Q(\bm{\theta}^{t}|\bm{\theta}^{t})=\nabla A(\bm{\theta}^{t})-s(\bm{\theta}^{t}). (31)

Continuing, we use the above to simplify the expression for Q(𝜽|𝜽t)Q(𝜽t|𝜽t)Q(\bm{\theta}|\bm{\theta}^{t})-Q(\bm{\theta}^{t}|\bm{\theta}^{t}) that will subsequently give us the MD loss:

Q(𝜽|𝜽t)Q(𝜽t|𝜽t)\displaystyle Q(\bm{\theta}|\bm{\theta}^{t})-Q(\bm{\theta}^{t}|\bm{\theta}^{t}) =A(𝜽)s(𝜽t),𝜽A(𝜽t)+s(𝜽t),𝜽t\displaystyle=A(\bm{\theta})-\left\langle s(\bm{\theta}^{t}),\bm{\theta}\right\rangle-A(\bm{\theta}^{t})+\left\langle s(\bm{\theta}^{t}),\bm{\theta}^{t}\right\rangle
=s(𝜽t),𝜽𝜽t+A(𝜽t),𝜽𝜽tA(𝜽t),𝜽𝜽t+A(𝜽)A(𝜽t)\displaystyle=-\left\langle s(\bm{\theta}^{t}),\bm{\theta}-\bm{\theta}^{t}\right\rangle+\langle\nabla A(\bm{\theta}^{t}),\bm{\theta}-\bm{\theta}^{t}\rangle-\langle\nabla A(\bm{\theta}^{t}),\bm{\theta}-\bm{\theta}^{t}\rangle+A(\bm{\theta})-A(\bm{\theta}^{t})
=i)Q(𝜽t|𝜽t),𝜽𝜽t+DA(𝜽,𝜽t)\displaystyle\stackrel{{\scriptstyle i)}}{{=}}\left\langle\nabla Q(\bm{\theta}^{t}|\bm{\theta}^{t}),\bm{\theta}-\bm{\theta}^{t}\right\rangle+D_{A}(\bm{\theta},\bm{\theta}^{t})
=ii)(𝜽t),𝜽𝜽t+DA(𝜽,𝜽t)\displaystyle\stackrel{{\scriptstyle ii)}}{{=}}\left\langle\nabla\mathcal{L}(\bm{\theta}^{t}),\bm{\theta}-\bm{\theta}^{t}\right\rangle+D_{A}(\bm{\theta},\bm{\theta}^{t})

where we first adding and subtracting A(𝜽t),𝜽𝜽t\langle\nabla A(\bm{\theta}^{t}),\bm{\theta}-\bm{\theta}^{t}\rangle then i)i) follows from (31) and ii)ii) follows from (𝜽)=Q(𝜽|𝜽)\nabla\mathcal{L}(\bm{\theta})=\nabla Q(\bm{\theta}|\bm{\theta}) (see Section 3 for the derivation).

Finally, the first part of our result follows trivially as

argmin𝜽ΩQ(𝜽|𝜽t)=argmin𝜽ΩQ(𝜽|𝜽t)Q(𝜽t|𝜽t).\operatorname*{arg\,min}_{\bm{\theta}\in\Omega}Q(\bm{\theta}|\bm{\theta}^{t})=\operatorname*{arg\,min}_{\bm{\theta}\in\Omega}Q(\bm{\theta}|\bm{\theta}^{t})-Q(\bm{\theta}^{t}|\bm{\theta}^{t}).

Part b): Show DA(ϕ,𝜽)=KL[p(𝒙,y,z;𝜽)p(𝒙,y,z;ϕ)]D_{A}(\bm{\phi},\bm{\theta})=\text{KL}[p(\bm{x},y,z;\bm{\theta})\|p(\bm{x},y,z;\bm{\phi})]

This result follows simply from decomposing KL[p(𝒙,y,z;𝜽)p(𝒙,y,z;ϕ)]\text{KL}[p(\bm{x},y,z;\bm{\theta})\|p(\bm{x},y,z;\bm{\phi})] as follows:

KL[p(𝒙,y,z;𝜽)p(𝒙,y,z;ϕ)]=𝔼𝑿,Y,Z|𝜽[logp(𝒙,y,z;𝜽)p(𝒙,y,z;ϕ)]\displaystyle\text{KL}[p(\bm{x},y,z;\bm{\theta})\|p(\bm{x},y,z;\bm{\phi})]=\mathbb{E}_{\bm{X},Y,Z|\bm{\theta}}\left[\log\frac{p(\bm{x},y,z;\bm{\theta})}{p(\bm{x},y,z;\bm{\phi})}\right]
=(30)s(𝜽),𝜽ϕA(𝜽)+A(ϕ)±𝔼𝑿,Y|𝜽[logp(𝒙)y2212log(2π)]\displaystyle\stackrel{{\scriptstyle\eqref{eqn: Not Exp SMOLINE}}}{{=}}\langle s(\bm{\theta}),\bm{\theta}-\bm{\phi}\rangle-A(\bm{\theta})+A(\phi)\pm\mathbb{E}_{\bm{X},Y|\bm{\theta}}\left[\log p(\bm{x})-\frac{y^{2}}{2}-\frac{1}{2}\log(2\pi)\right]
=A(ϕ)A(𝜽)s(𝜽),ϕ𝜽\displaystyle=A(\bm{\phi})-A(\bm{\theta})-\langle s(\bm{\theta}),\bm{\phi}-\bm{\theta}\rangle
=i)A(ϕ)A(𝜽)A(𝜽),ϕ𝜽.\displaystyle\stackrel{{\scriptstyle i)}}{{=}}A(\bm{\phi})-A(\bm{\theta})-\langle\nabla A(\bm{\theta}),\bm{\phi}-\bm{\theta}\rangle.

where i)i) follows from the fact that ϕ=𝜽\bm{\phi}=\bm{\theta} minimizes 𝔼𝑿,Z,Y|𝜽[logp(𝒙,y,z;ϕ)]-\mathbb{E}_{\bm{X},Z,Y|\bm{\theta}}\left[\log p(\bm{x},y,z;\bm{\phi})\right]. To see this, we use Jensen’s inequality:

0\displaystyle 0 𝔼𝑿,Z,Y|𝜽[logp(𝒙,y,z;𝜽)]\displaystyle\leq-\mathbb{E}_{\bm{X},Z,Y|\bm{\theta}}\left[\log p(\bm{x},y,z;\bm{\theta})\right]
Jensen’slog𝔼𝑿,Z,Y|𝜽[p(𝒙,y,z;𝜽)]=log𝒙,y,zp(𝒙,y,z:𝜽)2d𝒙dzdy\displaystyle\stackrel{{\scriptstyle\text{Jensen's}}}{{\leq}}-\log\mathbb{E}_{\bm{X},Z,Y|\bm{\theta}}\left[p(\bm{x},y,z;\bm{\theta})\right]=-\log\int_{\bm{x},y,z}p(\bm{x},y,z:\bm{\theta})^{2}d\bm{x}dzdy
Jensen’slog(𝒙,y,zp(𝒙,y,z;𝜽)d𝒙dzdy)2=log(1)=0.\displaystyle\stackrel{{\scriptstyle\text{Jensen's}}}{{\leq}}-\log\left(\int_{\bm{x},y,z}p(\bm{x},y,z;\bm{\theta})d\bm{x}dzdy\right)^{2}=-\log(1)=0.

Finally, taking the derivative with respect to ϕ\phi and setting equal to 0 completes the proof:

𝟎\displaystyle\bm{0} =𝜽𝔼𝑿,Z,Y|𝜽[logp(𝒙,y,z;ϕ)]|ϕ=𝜽\displaystyle=\frac{\partial}{\partial\bm{\theta}}\mathbb{E}_{\bm{X},Z,Y|\bm{\theta}}\left[\log p(\bm{x},y,z;\bm{\phi})\right]|_{\bm{\phi}=\bm{\theta}}
=𝔼𝑿,Z,Y|𝜽[S(𝒙,y,z)+ϕa(𝒙,y,ϕ)|ϕ=𝜽]\displaystyle=\mathbb{E}_{\bm{X},Z,Y|\bm{\theta}}\left[S(\bm{x},y,z)+\frac{\partial}{\partial\bm{\phi}}a(\bm{x},y,\bm{\phi})|_{\bm{\phi}=\bm{\theta}}\right]
=s(𝜽)A(𝜽).\displaystyle=s(\bm{\theta})-\nabla A(\bm{\theta}).

Part c): Show (𝜽)\mathcal{L}(\bm{\theta}) is 11-smooth relative to A(𝜽)A(\bm{\theta}).
The function (𝜽)\mathcal{L}(\bm{\theta}) is said to be 11-smooth relative to A(𝜽)A(\bm{\theta}) if for all ϕ,𝜽\bm{\phi},\bm{\theta}, it holds that

(𝜽)(ϕ)+(ϕ,𝜽ϕ+DA(𝜽,ϕ).\mathcal{L}(\bm{\theta})\leq\mathcal{L}(\bm{\phi})+\left\langle\nabla\mathcal{L}(\bm{\phi},\bm{\theta}-\bm{\phi}\right\rangle+D_{A}(\bm{\theta},\bm{\phi}).

Recall the following from Section 3. The objective function (𝜽)\mathcal{L}(\bm{\theta}) is related to the EM objective Q(ϕ|𝜽)Q(\bm{\phi}|\bm{\theta}) by (9),

(𝜽)=Q(𝜽|ϕ)H(𝜽|ϕ),\mathcal{L}(\bm{\theta})=Q(\bm{\theta}|\bm{\phi})-H(\bm{\theta}|\bm{\phi}),

where H(ϕ|𝜽)0H(\bm{\phi}|\bm{\theta})\geq 0 and H(𝜽|𝜽)=0H(\bm{\theta}|\bm{\theta})=0 for all ϕ,𝜽Ω\bm{\phi},\bm{\theta}\in\Omega. Consequently, it then holds that for all ϕ,𝜽Ω\bm{\phi},\bm{\theta}\in\Omega,

(𝜽)\displaystyle\mathcal{L}(\bm{\theta}) =Q(𝜽|𝜽)\displaystyle=Q(\bm{\theta}|\bm{\theta}) (32)
(𝜽)\displaystyle\mathcal{L}(\bm{\theta}) Q(𝜽|ϕ).\displaystyle\leq Q(\bm{\theta}|\bm{\phi}). (33)

Recall also from part a) that Q(𝜽|ϕ)Q(ϕ|ϕ)=(ϕ),𝜽ϕ+DA(𝜽,ϕ)Q(\bm{\theta}|\bm{\phi})-Q(\bm{\phi}|\bm{\phi})=\left\langle\nabla\mathcal{L}(\bm{\phi}),\bm{\theta}-\bm{\phi}\right\rangle+D_{A}(\bm{\theta},\bm{\phi}). Then, the claim follows naturally from the above as follows:

(𝜽)\displaystyle\mathcal{L}(\bm{\theta}) (33)Q(𝜽|ϕ)\displaystyle\stackrel{{\scriptstyle\eqref{eqn: Q is upper bound}}}{{\leq}}Q(\bm{\theta}|\bm{\phi})
=a)Q(ϕ|ϕ)+(ϕ),𝜽ϕ+DA(𝜽,ϕ)\displaystyle\stackrel{{\scriptstyle a)}}{{=}}Q(\bm{\phi}|\bm{\phi})+\left\langle\nabla\mathcal{L}(\bm{\phi}),\bm{\theta}-\bm{\phi}\right\rangle+D_{A}(\bm{\theta},\bm{\phi})
=(32)(ϕ)+(ϕ),𝜽ϕ+DA(𝜽,ϕ)\displaystyle\stackrel{{\scriptstyle\eqref{eqn: Q equality to L}}}{{=}}\mathcal{L}(\bm{\phi})+\left\langle\nabla\mathcal{L}(\bm{\phi}),\bm{\theta}-\bm{\phi}\right\rangle+D_{A}(\bm{\theta},\bm{\phi})

It follows that (𝜽)\mathcal{L}(\bm{\theta}) is 11-smooth relative to A(𝜽)A(\bm{\theta}).

Part d): A(𝜽)A(\bm{\theta}) and the MD objective is convex with respect to 𝜽\bm{\theta}.
Here, we will show that the mirror descent objective is strongly convex in 𝜽\bm{\theta}. It is important for this to hold so that the iterations of MD are well-defined; the minimizer of a strongly convex objective exists and is unique.

Note that the mirror descent objective, (𝜽),𝜽𝜽t+DA(𝜽,𝜽t)\langle\mathcal{\nabla L(\bm{\theta})},\bm{\theta}-\bm{\theta}^{t}\rangle+D_{A}(\bm{\theta},\bm{\theta}^{t}), is strongly convex in 𝜽\bm{\theta} if A(𝜽)A(\bm{\theta}) is strongly convex in 𝜽\bm{\theta}. Therefore, since A(𝜽)A(\bm{\theta}) given in (17) is twice continuously differentiable, it is strongly convex with respect to 𝜽\bm{\theta} if and only if 2A(𝜽)r𝑰2d\nabla^{2}A(\bm{\theta})\succeq r\bm{I}_{2d}, for some r>0r>0. We begin:

2A(𝜽)\displaystyle\nabla^{2}A(\bm{\theta}) =2𝜽2𝔼𝑿[(𝒙𝜷)22+log(1+e𝒙𝒘)]\displaystyle=\frac{\partial^{2}}{\partial\bm{\theta}^{2}}\mathbb{E}_{\bm{X}}\left[\frac{(\bm{x}^{\top}\bm{\beta})^{2}}{2}+\log\left(1+e^{\bm{x}^{\top}\bm{w}}\right)\right]
=𝔼𝑿[2𝜽2((𝒙𝜷)22+log(1+e𝒙𝒘))]\displaystyle=\mathbb{E}_{\bm{X}}\left[\frac{\partial^{2}}{\partial\bm{\theta}^{2}}\left(\frac{(\bm{x}^{\top}\bm{\beta})^{2}}{2}+\log\left(1+e^{\bm{x}^{\top}\bm{w}}\right)\right)\right]
=(𝔼𝑿[𝒙𝒙e𝒙𝒘(1+e𝒙𝒘)2]𝟎𝟎𝔼𝑿[𝒙𝒙])\displaystyle=\begin{pmatrix}\mathbb{E}_{\bm{X}}\left[\bm{x}\bm{x}^{\top}\frac{e^{\bm{x}^{\top}\bm{w}}}{\left(1+e^{\bm{x}^{\top}\bm{w}}\right)^{2}}\right]&\bm{0}\\ \bm{0}&\mathbb{E}_{\bm{X}}\left[\bm{x}\bm{x}^{\top}\right]\end{pmatrix}
=(𝔼𝑿[𝒙𝒙e𝒙𝒘(1+e𝒙𝒘)2]𝟎𝟎𝑰d)\displaystyle=\begin{pmatrix}\mathbb{E}_{\bm{X}}\left[\bm{x}\bm{x}^{\top}\frac{e^{\bm{x}^{\top}\bm{w}}}{\left(1+e^{\bm{x}^{\top}\bm{w}}\right)^{2}}\right]&\bm{0}\\ \bm{0}&\bm{I}_{d}\end{pmatrix}

where the last line follows from the assumption that 𝒙\bm{x} is sampled from a unit spherical Gaussian distribution: 𝔼𝑿[𝒙𝒙]=𝑰d\mathbb{E}_{\bm{X}}\left[\bm{x}\bm{x}^{\top}\right]=\bm{I}_{d} for 𝒙𝒩(𝟎,𝑰d)\bm{x}\sim\mathcal{N}(\bm{0},\bm{I}_{d}).

From the above, we see that A(𝜽)A(\bm{\theta}) is strictly convex, and it is strongly convex if 𝔼𝑿[𝒙𝒙e𝒙𝒘(1+e𝒙𝒘)2]r𝑰d\mathbb{E}_{\bm{X}}\left[\bm{x}\bm{x}^{\top}\frac{e^{\bm{x}^{\top}\bm{w}}}{\left(1+e^{\bm{x}^{\top}\bm{w}}\right)^{2}}\right]\succeq r\bm{I}_{d} for some r>0r>0. This follows from Lemma B.2 where we show its eigenvalues are bounded below by min{Ω(1𝒘2),Ω(1𝒘23)}\min\left\{\Omega\left(\frac{1}{\|\bm{w}\|_{2}}\right),\Omega\left(\frac{1}{\|\bm{w}\|_{2}^{3}}\right)\right\}. Thus, it holds that

2A(𝜽)min{Ω(1𝒘2),Ω(1𝒘23),1}𝑰2d.\nabla^{2}A(\bm{\theta})\succeq\min\left\{\Omega\left(\frac{1}{\|\bm{w}\|_{2}}\right),\Omega\left(\frac{1}{\|\bm{w}\|_{2}^{3}}\right),1\right\}\bm{I}_{2d}.

Restricting the feasible set Ω\Omega to be all 𝜽2d\bm{\theta}\in\mathbb{R}^{2d} with 𝜽2N\|\bm{\theta}\|_{2}\leq N for some N[0,)N\in[0,\infty), it holds that A(𝜽)A(\bm{\theta}) is strongly convex with respect to 𝜽\bm{\theta} on Ω\Omega.

With part d) proven, this concludes the proof of Theorem 4.1 for SMoLinE. We now prove the same for SMoLogE, referring to this section where necessary.

A.2 Proof of Theorem 4.1 for SMoLogE

Proof.

Recall that we consider a 22 component SMoLogE (see Section 3.1) where z{1,1}z\in\{-1,1\} is the latent unobserved variable, and

  • 1)

    𝒙𝒩(𝟎,𝑰d)\bm{x}\sim\mathcal{N}(\bm{0},\bm{I}_{d}),

  • 2)

    P(z|𝒙;𝒘)=exp{z+12𝒙𝒘}1+e𝒙𝒘.P(z|\bm{x};\bm{w})=\frac{\exp\{\frac{z+1}{2}\bm{x}^{\top}{\bm{w}}\}}{1+e^{\bm{x}^{\top}{\bm{w}}}}.

  • 3)

    P(y|𝒙,z;𝜷)=exp{𝐦𝐢𝐬𝐬𝐢𝐧𝐠(𝒚𝒛+𝟏𝟐)x𝜷}1+e𝒙𝜷P(y|\bm{x},z;\bm{\beta})=\frac{\exp\{\bm{\left missing}(\frac{yz+1}{2}\right)x^{\top}\bm{\beta}\}}{1+e^{\bm{x}^{\top}\bm{\beta}}}.

We begin by deriving a near exponential form of the complete data probability density function p(𝒙,z,y;𝜽)p(\bm{x},z,y;\bm{\theta}):

p(𝒙,z,y;𝜽)\displaystyle p(\bm{x},z,y;\bm{\theta})
=P(y|𝒙,z;𝜽)P(z|𝒙;𝜽)p(𝒙)\displaystyle=P(y|\bm{x},z;\bm{\theta})P(z|\bm{x};\bm{\theta})p(\bm{x})
=exp{logP(y|𝒙,z;𝜷)+logP(z|𝒙;𝒘)+logp(𝒙)}\displaystyle=\exp\{\log P(y|\bm{x},z;\bm{\beta})+\log P(z|\bm{x};\bm{w})+\log p(\bm{x})\}
=exp{log((exp{𝒙𝜷}1+e𝒙𝜷)yz+12(11+e𝒙𝜷)1yz+12)+(z+12)𝒙𝒘log(1+e𝒙𝒘)+logp(𝒙)}\displaystyle=\exp\left\{\log\left(\left(\frac{\exp\{\bm{x}^{\top}\bm{\beta}\}}{1+e^{\bm{x}^{\top}\bm{\beta}}}\right)^{\frac{yz+1}{2}}\left(\frac{1}{1+e^{\bm{x}^{\top}\bm{\beta}}}\right)^{1-\frac{yz+1}{2}}\right)+\left(\frac{z+1}{2}\right)\bm{x}^{\top}\bm{w}-\log(1+e^{\bm{x}^{\top}\bm{w}})+\log p(\bm{x})\right\}
=exp{yz+12log(exp{𝒙𝜷}1+e𝒙𝜷)+(1yz+12)log(11+e𝒙𝜷)+(z+12)𝒙𝒘log(1+e𝒙𝒘)+logp(𝒙)}\displaystyle=\exp\left\{{\frac{yz+1}{2}}\log\left(\frac{\exp\{\bm{x}^{\top}\bm{\beta}\}}{1+e^{\bm{x}^{\top}\bm{\beta}}}\right)+\left(1-\frac{yz+1}{2}\right)\log\left(\frac{1}{1+e^{\bm{x}^{\top}\bm{\beta}}}\right)+\left(\frac{z+1}{2}\right)\bm{x}^{\top}\bm{w}-\log(1+e^{\bm{x}^{\top}\bm{w}})+\log p(\bm{x})\right\}
=exp{(yz+12)𝒙𝜷log(1+e𝒙𝜷)+(z+12)𝒙𝒘log(1+e𝒙𝒘)+logp(𝒙)}\displaystyle=\exp\left\{\left(\frac{yz+1}{2}\right)\bm{x}^{\top}\bm{\beta}-\log\left(1+e^{\bm{x}^{\top}\bm{\beta}}\right)+\left(\frac{z+1}{2}\right)\bm{x}^{\top}\bm{w}-\log(1+e^{\bm{x}^{\top}\bm{w}})+\log p(\bm{x})\right\}
=exp{[yz𝒙2z𝒙2],[βw]+𝒙(𝒘+𝜷)2log[(1+e𝒙𝜷)(1+e𝒙𝒘)]+logp(𝒙)}.\displaystyle=\exp\left\{\left\langle\begin{bmatrix}\frac{yz\bm{x}}{2}\\ \frac{z\bm{x}}{2}\end{bmatrix},\begin{bmatrix}\beta\\ w\end{bmatrix}\right\rangle+\frac{\bm{x}^{\top}(\bm{w}+\bm{\beta})}{2}-\log\left[\left(1+e^{\bm{x}^{\top}\bm{\beta}}\right)\left(1+e^{\bm{x}^{\top}\bm{w}}\right)\right]+\log p(\bm{x})\right\}.

Thus we have recovered the decomposition,

p(𝒙,y,z;𝜽)=exp{[yz𝒙2z𝒙2],[𝜷𝒘]+a(𝒙,y,𝜽)},,p(\bm{x},y,z;\bm{\theta})=\exp\left\{\left\langle\begin{bmatrix}\frac{yz\bm{x}}{2}\\ \frac{z\bm{x}}{2}\end{bmatrix},\begin{bmatrix}\bm{\beta}\\ \bm{w}\end{bmatrix}\right\rangle+a(\bm{x},y,\bm{\theta})\right\},, (34)

where in a(𝒙,y,𝜽)a(\bm{x},y,\bm{\theta}), 𝒙\bm{x} cannot be linearly separated from the parameter 𝜽\bm{\theta}:

a(𝒙,y,𝜽)=𝒙(𝒘+𝜷)2log[(1+e𝒙𝜷)(1+e𝒙𝒘)]+logp(𝒙).a(\bm{x},y,\bm{\theta})=\frac{\bm{x}^{\top}(\bm{w}+\bm{\beta})}{2}-\log\left[\left(1+e^{\bm{x}^{\top}\bm{\beta}}\right)\left(1+e^{\bm{x}^{\top}\bm{w}}\right)\right]+\log p(\bm{x}).

Similar to SMoLinE, we can see that p(𝒙,y,z;𝜽)p(\bm{x},y,z;\bm{\theta}) does not belong to an exponential family of distribution. Also, note that taking the expectation of a(𝒙,y,𝜽)a(\bm{x},y,\bm{\theta}) over (𝒙,y)p(𝒙,y;𝜽)(\bm{x},y)\sim p(\bm{x},y;\bm{\theta}^{*}) yields,

𝔼𝑿,Y[a(𝒙,y,𝜽)]\displaystyle\mathbb{E}_{\bm{X},Y}[a(\bm{x},y,\bm{\theta})] =𝔼𝑿[log[(1+e𝒙𝜷)(1+e𝒙𝒘)]]C\displaystyle=-\mathbb{E}_{\bm{X}}\left[\log\left[\left(1+e^{\bm{x}^{\top}\bm{\beta}}\right)\left(1+e^{\bm{x}^{\top}\bm{w}}\right)\right]\right]-C
=A(𝜽)C,\displaystyle=-A(\bm{\theta})-C,

where the above follows from 𝔼𝑿[𝒙(𝒘+𝜷)2]=0\mathbb{E}_{\bm{X}}[\frac{\bm{x}^{\top}(\bm{w}+\bm{\beta})}{2}]=0 and C:=𝔼𝑿,Y|𝜽[logp(𝒙)]C:=-\mathbb{E}_{\bm{X},Y|\bm{\theta}^{*}}[\log p(\bm{x})] is not a function of the parameter 𝜽\bm{\theta}. We now continue with the proof.

From here on, the proofs of part a), part b) and part c) follow identically from that of SMoLinE, so we will refer to Appendix A.1 for those proofs. We will now show part d).

Part d):A(θ)A(\bm{\theta}) and the MD objective is convex with respect to 𝜽\bm{\theta}.
Here, we will show that the mirror descent objective is strongly convex in 𝜽\bm{\theta}. It is important for this to hold so that the iterations of MD are well-defined; the minimizer of a strongly convex objective exists and is unique.

Note that the mirror descent objective, (𝜽),𝜽𝜽t+DA(𝜽,𝜽t)\langle\mathcal{\nabla L(\bm{\theta})},\bm{\theta}-\bm{\theta}^{t}\rangle+D_{A}(\bm{\theta},\bm{\theta}^{t}), is strongly convex in 𝜽\bm{\theta} if A(𝜽)A(\bm{\theta}) is strongly convex in 𝜽\bm{\theta}. Therefore, since A(𝜽)A(\bm{\theta}) given in (17) is twice continuously differentiable, it is strongly convex with respect to 𝜽\bm{\theta} if and only if 2A(𝜽)r𝑰2d\nabla^{2}A(\bm{\theta})\succeq r\bm{I}_{2d}, for some r>0r>0. We begin:

2A(𝜽)\displaystyle\nabla^{2}A(\bm{\theta}) =2𝜽2𝔼𝑿[log((1+e𝒙𝜷)(1+e𝒙𝒘))]\displaystyle=\frac{\partial^{2}}{\partial\bm{\theta}^{2}}\mathbb{E}_{\bm{X}}\left[\log\left(\left(1+e^{\bm{x}^{\top}\bm{\beta}}\right)\left(1+e^{\bm{x}^{\top}\bm{w}}\right)\right)\right]
=𝔼𝑿[2𝜽2(log(1+e𝒙𝜷)+log(1+e𝒙𝒘))]\displaystyle=\mathbb{E}_{\bm{X}}\left[\frac{\partial^{2}}{\partial\bm{\theta}^{2}}\left(\log\left(1+e^{\bm{x}^{\top}\bm{\beta}}\right)+\log\left(1+e^{\bm{x}^{\top}\bm{w}}\right)\right)\right]
=(𝔼𝑿[𝒙𝒙e𝒙𝒘(1+e𝒙𝒘)2]𝟎𝟎𝔼𝑿[𝒙𝒙e𝒙𝜷(1+e𝒙𝜷)2])\displaystyle=\begin{pmatrix}\mathbb{E}_{\bm{X}}\left[\bm{x}\bm{x}^{\top}\frac{e^{\bm{x}^{\top}\bm{w}}}{\left(1+e^{\bm{x}^{\top}\bm{w}}\right)^{2}}\right]&\bm{0}\\ \bm{0}&\mathbb{E}_{\bm{X}}\left[\bm{x}\bm{x}^{\top}\frac{e^{\bm{x}^{\top}\bm{\beta}}}{\left(1+e^{\bm{x}^{\top}\bm{\beta}}\right)^{2}}\right]\end{pmatrix}
=(𝔼𝑿[𝒙𝒙e𝒙𝒘(1+e𝒙𝒘)2]𝟎𝟎𝔼𝑿[𝒙𝒙e𝒙𝜷(1+e𝒙𝜷)2]).\displaystyle=\begin{pmatrix}\mathbb{E}_{\bm{X}}\left[\bm{x}\bm{x}^{\top}\frac{e^{\bm{x}^{\top}\bm{w}}}{\left(1+e^{\bm{x}^{\top}\bm{w}}\right)^{2}}\right]&\bm{0}\\ \bm{0}&\mathbb{E}_{\bm{X}}\left[\bm{x}\bm{x}^{\top}\frac{e^{\bm{x}^{\top}\bm{\beta}}}{\left(1+e^{\bm{x}^{\top}\bm{\beta}}\right)^{2}}\right]\end{pmatrix}.

From the above, we see that A(𝜽)A(\bm{\theta}) is strictly convex, and it is strongly convex if 𝔼𝑿[𝒙𝒙e𝒙𝒘(1+e𝒙𝒘)2]r𝑰d\mathbb{E}_{\bm{X}}\left[\bm{x}\bm{x}^{\top}\frac{e^{\bm{x}^{\top}\bm{w}}}{\left(1+e^{\bm{x}^{\top}\bm{w}}\right)^{2}}\right]\succeq r\bm{I}_{d} and 𝔼𝑿[𝒙𝒙e𝒙𝜷(1+e𝒙𝜷)2]r𝑰d\mathbb{E}_{\bm{X}}\left[\bm{x}\bm{x}^{\top}\frac{e^{\bm{x}^{\top}\bm{\beta}}}{\left(1+e^{\bm{x}^{\top}\bm{\beta}}\right)^{2}}\right]\succeq r\bm{I}_{d} for some r>0r>0. This follows from Lemma B.2 where we show their respective eigenvalues are bounded below by, min{Ω(1𝒘2),Ω(1𝒘23)}\min\left\{\Omega\left(\frac{1}{\|\bm{w}\|_{2}}\right),\Omega\left(\frac{1}{\|\bm{w}\|_{2}^{3}}\right)\right\} and min{Ω(1𝜷2),Ω(1𝜷23)}\min\left\{\Omega\left(\frac{1}{\|\bm{\beta}\|_{2}}\right),\Omega\left(\frac{1}{\|\bm{\beta}\|_{2}^{3}}\right)\right\}. Thus, it holds that

2A(𝜽)min{Ω(1𝒘2),Ω(1𝒘23),Ω(1𝜷2),Ω(1𝜷23)}𝑰2d.\nabla^{2}A(\bm{\theta})\succeq\min\left\{\Omega\left(\frac{1}{\|\bm{w}\|_{2}}\right),\Omega\left(\frac{1}{\|\bm{w}\|_{2}^{3}}\right),\Omega\left(\frac{1}{\|\bm{\beta}\|_{2}}\right),\Omega\left(\frac{1}{\|\bm{\beta}\|_{2}^{3}}\right)\right\}\bm{I}_{2d}.

Restricting Ω\Omega to be all 𝜽\bm{\theta} with 𝜽2N\|\bm{\theta}\|_{2}\leq N for some N[0,)N\in[0,\infty), it holds that A(𝜽)A(\bm{\theta}) is strongly convex with respect to 𝜽\bm{\theta} on Ω\Omega.

With part d) proven, this concludes the proof of Theorem 4.1 for SMoLinE. We now prove the same for SMoLogE, referring to this section where necessary.

Appendix B Convergence Analysis

In this section, we provide the proofs of Corollary 5.1, Theorem 5.2 and Theorem 5.3. As well, in the last subsection, we discuss the consequences of these results.

B.1 Convergence Properties of Mirror Descent: Corollary 5.1

In this section, we provide a contextual proof for Corollary 5.1 which follows trivially from previous work on MD (see [KKS22, Proposition 2, Corollary 1, and Corollary 3] and [LFN17, Theorem 3.1] for a comprehensive overview of the many convergence properties of MD). The proof is divided into three parts that correspond to each of the three sub-results of the corollary.

For ease of reading, we re-state the Corollary below:
Corollary 5.1: For SMoLinE, SMoLogE with mirror map A(𝜽)A(\bm{\theta}) given as (17) (18) respectively, and denoting DA(𝜽t,𝜽t+1):=KL[p(𝒙,y,z;𝜽t+1)p(𝒙,y,z;𝜽t)]D_{A}(\bm{\theta}^{t},\bm{\theta}^{t+1})\!:=\!\text{KL}[p(\bm{x},y,z;\bm{\theta}^{t+1})\|p(\bm{x},y,z;\bm{\theta}^{t})], the EM iterates {𝜽t}t[T]\{\bm{\theta}^{t}\}_{t\in[T]} satisfy:

  • 1)

    Stationarity. For no additional conditions,

    mint[T]DA(𝜽t+1,𝜽t)(𝜽1)(𝜽)T;\min_{t\in[T]}D_{A}(\bm{\theta}^{t+1},\bm{\theta}^{t})\leq\frac{\mathcal{L}(\bm{\theta}^{1})-\mathcal{L}(\bm{\theta}^{*})}{T}; (35)
  • 2)

    Sub-linear Rate to θ\bm{\theta}^{*}. If 𝜽1\bm{\theta}^{1} is initialized in Θ\Theta, a locally convex region of (𝜽)\mathcal{L}(\bm{\theta}) containing 𝜽\bm{\theta}^{*}, then

    (𝜽T)(𝜽)DA(𝜽,𝜽1)T\mathcal{L}(\bm{\theta}^{T})-\mathcal{L}(\bm{\theta}^{*})\leq\frac{D_{A}(\bm{\theta}^{*},\bm{\theta}^{1})}{T} (36)
  • 3)

    Linear Rate to θ\bm{\theta}^{*}. If 𝜽1\bm{\theta}^{1} is initialized in ΘΩ\Theta\subseteq\Omega, a locally strongly convex region of (𝜽)\mathcal{L}(\bm{\theta}) relative to A(𝜽)A(\bm{\theta}) that contains 𝜽\bm{\theta}^{*}, then

    (𝜽T)(𝜽)(1α)T((𝜽1)(𝜽)).\mathcal{L}(\bm{\theta}^{T})-\mathcal{L}(\bm{\theta}^{*})\leq(1-\alpha)^{T}(\mathcal{L}(\bm{\theta}^{1})-\mathcal{L}(\bm{\theta}^{*})). (37)
Proof.

The proof is divided into three parts that correspond to each of the three sub-results of the corollary.

Part 1): Stationarity.
Given Theorem 4.1, this proof follows from identical arguments to that of [KKS22, Proposition 2]. We write it below for completeness.

Recall from Theorem 4.1 that 𝜽t+1\bm{\theta}^{t+1} is obtained as the minimizer of the convex objective, (14):

(𝜽t),𝜽𝜽t+DA(𝜽,𝜽t).\langle\nabla\mathcal{L}(\bm{\theta}^{t}),\bm{\theta}-\bm{\theta}^{t}\rangle+D_{A}(\bm{\theta},\bm{\theta}^{t}).

As such, differentiating and setting equal to 0, it holds that 𝜽t+1\bm{\theta}^{t+1} satisfies

(𝜽t)=A(𝜽t)A(𝜽t+1)\nabla\mathcal{L}(\bm{\theta}^{t})=\nabla A(\bm{\theta}^{t})-\nabla A(\bm{\theta}^{t+1}) (38)

Further, by the above together with relative smoothness, it holds that

(𝜽t+1)\displaystyle\mathcal{L}(\bm{\theta}^{t+1}) (𝜽t)+(𝜽t),𝜽t+1𝜽t+DA(𝜽t+1,𝜽t)\displaystyle\leq\mathcal{L}(\bm{\theta}^{t})+\langle\nabla\mathcal{L}(\bm{\theta}^{t}),\bm{\theta}^{t+1}-\bm{\theta}^{t}\rangle+D_{A}(\bm{\theta}^{t+1},\bm{\theta}^{t})
=(𝜽t)+A(𝜽t)A(𝜽t+1),𝜽t+1𝜽t+DA(𝜽t+1,𝜽t)\displaystyle=\mathcal{L}(\bm{\theta}^{t})+\langle\nabla A(\bm{\theta}^{t})-\nabla A(\bm{\theta}^{t+1}),\bm{\theta}^{t+1}-\bm{\theta}^{t}\rangle+D_{A}(\bm{\theta}^{t+1},\bm{\theta}^{t})
=(𝜽t)A(𝜽t+1),𝜽t+1𝜽t+A(𝜽t+1)A(𝜽t)\displaystyle=\mathcal{L}(\bm{\theta}^{t})-\langle\nabla A(\bm{\theta}^{t+1}),\bm{\theta}^{t+1}-\bm{\theta}^{t}\rangle+A(\bm{\theta}^{t+1})-A(\bm{\theta}^{t})
=(𝜽t)DA(𝜽t,𝜽t+1).\displaystyle=\mathcal{L}(\bm{\theta}^{t})-D_{A}(\bm{\theta}^{t},\bm{\theta}^{t+1}).

Thus it we have shown that

DA(𝜽t,𝜽t+1)(𝜽t)(𝜽t+1).D_{A}(\bm{\theta}^{t},\bm{\theta}^{t+1})\leq\mathcal{L}(\bm{\theta}^{t})-\mathcal{L}(\bm{\theta}^{t+1}). (39)

The claim then follows from taking the mean over TT iterations:

mintTDA(𝜽t,𝜽t+1)1Tt=1TDA(𝜽t,𝜽t+1)1Tt=1T(𝜽t)(𝜽t+1)=(𝜽1)(𝜽T)T(𝜽1)(𝜽)T.\min_{t\leq T}D_{A}(\bm{\theta}^{t},\bm{\theta}^{t+1})\leq\frac{1}{T}\sum_{t=1}^{T}D_{A}(\bm{\theta}^{t},\bm{\theta}^{t+1})\leq\frac{1}{T}\sum_{t=1}^{T}\mathcal{L}(\bm{\theta}^{t})-\mathcal{L}(\bm{\theta}^{t+1})=\frac{\mathcal{L}(\bm{\theta}^{1})-\mathcal{L}(\bm{\theta}^{T})}{T}\leq\frac{\mathcal{L}(\bm{\theta}^{1})-\mathcal{L}(\bm{\theta}^{*})}{T}.

Part 2): Sub-linear Rate to 𝜽\bm{\theta}^{*}.
Given Theorem 4.1, this proof follows from identical arguments to that of [KKS22, Corollary 1] and [LFN17, Theorem 3.1]. We write it below for completeness.

Here, we assume that (𝜽)\mathcal{L}(\bm{\theta}) is convex on the set Θ\Theta. In part 1), we used (38) to show,

(𝜽t),𝜽t+1𝜽t+DA(𝜽t+1,𝜽t)=DA(𝜽t,𝜽t+1),\displaystyle\langle\nabla\mathcal{L}(\bm{\theta}^{t}),\bm{\theta}^{t+1}-\bm{\theta}^{t}\rangle+D_{A}(\bm{\theta}^{t+1},\bm{\theta}^{t})=-D_{A}(\bm{\theta}^{t},\bm{\theta}^{t+1}),

where the right hand side is non-positive since the Bregman divergence is non-negative if the inducing function AA is convex – which it is. Now, starting from relative smoothness, we see that

(𝜽t+1)\displaystyle\mathcal{L}(\bm{\theta}^{t+1}) (𝜽t)+(𝜽t),𝜽t+1𝜽t+DA(𝜽t+1,𝜽t)\displaystyle\leq\mathcal{L}(\bm{\theta}^{t})+\langle\nabla\mathcal{L}(\bm{\theta}^{t}),\bm{\theta}^{t+1}-\bm{\theta}^{t}\rangle+D_{A}(\bm{\theta}^{t+1},\bm{\theta}^{t})
=(𝜽t)+(𝜽t),𝜽t+1𝜽+𝜽𝜽t+DA(𝜽t+1,𝜽t)\displaystyle=\mathcal{L}(\bm{\theta}^{t})+\langle\nabla\mathcal{L}(\bm{\theta}^{t}),\bm{\theta}^{t+1}-\bm{\theta}^{*}+\bm{\theta}^{*}-\bm{\theta}^{t}\rangle+D_{A}(\bm{\theta}^{t+1},\bm{\theta}^{t})
=(𝜽t)+(𝜽t),𝜽t+1𝜽+(𝜽t),𝜽𝜽t+DA(𝜽t+1,𝜽t)\displaystyle=\mathcal{L}(\bm{\theta}^{t})+\langle\nabla\mathcal{L}(\bm{\theta}^{t}),\bm{\theta}^{t+1}-\bm{\theta}^{*}\rangle+\langle\nabla\mathcal{L}(\bm{\theta}^{t}),\bm{\theta}^{*}-\bm{\theta}^{t}\rangle+D_{A}(\bm{\theta}^{t+1},\bm{\theta}^{t})
i)(𝜽)+(𝜽t),𝜽t+1𝜽+DA(𝜽t+1,𝜽t)\displaystyle\stackrel{{\scriptstyle i)}}{{\leq}}\mathcal{L}(\bm{\theta}^{*})+\langle\nabla\mathcal{L}(\bm{\theta}^{t}),\bm{\theta}^{t+1}-\bm{\theta}^{*}\rangle+D_{A}(\bm{\theta}^{t+1},\bm{\theta}^{t})
=(38)(𝜽)+A(𝜽t)A(𝜽t+1),𝜽t+1𝜽+DA(𝜽t+1,𝜽t)\displaystyle\stackrel{{\scriptstyle\eqref{eqn:first order cond MD}}}{{=}}\mathcal{L}(\bm{\theta}^{*})+\langle\nabla A(\bm{\theta}^{t})-\nabla A(\bm{\theta}^{t+1}),\bm{\theta}^{t+1}-\bm{\theta}^{*}\rangle+D_{A}(\bm{\theta}^{t+1},\bm{\theta}^{t})

where i) follows from convexity of (𝜽)\mathcal{L}(\bm{\theta}) on the set Θ\Theta. Subsequently, we apply the 33-point lemma, DA(𝜽,𝜽t)=DA(𝜽,𝜽t+1)+𝜽𝜽t+1,A(𝜽t+1)A(𝜽t)+DA(𝜽t+1,𝜽t)D_{A}(\bm{\theta}^{*},\bm{\theta}^{t})=D_{A}(\bm{\theta}^{*},\bm{\theta}^{t+1})+\langle\bm{\theta}^{*}-\bm{\theta}^{t+1},\nabla A(\bm{\theta}^{t+1})-A(\bm{\theta}^{t})\rangle+D_{A}(\bm{\theta}^{t+1},\bm{\theta}^{t}), and obtain,

(𝜽t+1)(𝜽)+DA(𝜽,𝜽t)DA(𝜽,𝜽t+1).\mathcal{L}(\bm{\theta}^{t+1})\leq\mathcal{L}(\bm{\theta}^{*})+D_{A}(\bm{\theta}^{*},\bm{\theta}^{t})-D_{A}(\bm{\theta}^{*},\bm{\theta}^{t+1}). (40)

Finally, the result follows from summing the left and right hand side over TT iterations:

T((𝜽T)(𝜽))t=1T(𝜽t+1)(𝜽)t=1TDA(𝜽,𝜽t)DA(𝜽,𝜽t+1)DA(𝜽,𝜽1)T(\mathcal{L}(\bm{\theta}^{T})-\mathcal{L}(\bm{\theta}^{*}))\leq\sum_{t=1}^{T}\mathcal{L}(\bm{\theta}^{t+1})-\mathcal{L}(\bm{\theta}^{*})\leq\sum_{t=1}^{T}D_{A}(\bm{\theta}^{*},\bm{\theta}^{t})-D_{A}(\bm{\theta}^{*},\bm{\theta}^{t+1})\leq D_{A}(\bm{\theta}^{*},\bm{\theta}^{1})

Part 3): Linear Rate to 𝜽\bm{\theta}^{*}.
Given Theorem 4.1, this proof follows from identical arguments to that of [KKS22, Corollary 3] and [LFN17, Theorem 3.1]. We write it below for completeness.

In addition to convexity, we now assume that (𝜽)\mathcal{L}(\bm{\theta}) is α\alpha-strongly convex relative to A(𝜽)A(\bm{\theta}) on the set Θ\Theta. Specifically, we have that for any ϕ,𝜽Θ\bm{\phi},\bm{\theta}\in\Theta,

(𝜽)(ϕ)+(ϕ),𝜽ϕ+αDA(𝜽,ϕ).\mathcal{L}(\bm{\theta})\geq\mathcal{L}(\bm{\phi})+\langle\nabla\mathcal{L}(\bm{\phi}),\bm{\theta}-\bm{\phi}\rangle+\alpha D_{A}(\bm{\theta},\bm{\phi}). (41)

Using the three point lemma again, we have

DA(𝜽,𝜽t+1)\displaystyle D_{A}(\bm{\theta}^{*},\bm{\theta}^{t+1}) =DA(𝜽,𝜽t)+𝜽𝜽t,A(𝜽t)A(𝜽t+1)+DA(𝜽t,𝜽t+1)\displaystyle=D_{A}(\bm{\theta}^{*},\bm{\theta}^{t})+\langle\bm{\theta}^{*}-\bm{\theta}^{t},\nabla A(\bm{\theta}^{t})-\nabla A(\bm{\theta}^{t+1})\rangle+D_{A}(\bm{\theta}^{t},\bm{\theta}^{t+1})
=(38)DA(𝜽,𝜽t)+(𝜽t),𝜽𝜽t+DA(𝜽t,𝜽t+1)\displaystyle\stackrel{{\scriptstyle\eqref{eqn:first order cond MD}}}{{=}}D_{A}(\bm{\theta}^{*},\bm{\theta}^{t})+\langle\nabla\mathcal{L}(\bm{\theta}^{t}),\bm{\theta}^{*}-\bm{\theta}^{t}\rangle+D_{A}(\bm{\theta}^{t},\bm{\theta}^{t+1})
(41)DA(𝜽,𝜽t)+(𝜽)(𝜽t)αDA(𝜽,𝜽t)+DA(𝜽t,𝜽t+1)\displaystyle\stackrel{{\scriptstyle\eqref{eqn:relative strong convexity}}}{{\leq}}D_{A}(\bm{\theta}^{*},\bm{\theta}^{t})+\mathcal{L}(\bm{\theta}^{*})-\mathcal{L}(\bm{\theta}^{t})-\alpha D_{A}(\bm{\theta}^{*},\bm{\theta}^{t})+D_{A}(\bm{\theta}^{t},\bm{\theta}^{t+1})
=(1α)DA(𝜽,𝜽t)+(𝜽)(𝜽t)+DA(𝜽t,𝜽t+1)\displaystyle=(1-\alpha)D_{A}(\bm{\theta}^{*},\bm{\theta}^{t})+\mathcal{L}(\bm{\theta}^{*})-\mathcal{L}(\bm{\theta}^{t})+D_{A}(\bm{\theta}^{t},\bm{\theta}^{t+1})
(39)(1α)DA(𝜽,𝜽t)+(𝜽)(𝜽t)+(𝜽t)(𝜽t+1)\displaystyle\stackrel{{\scriptstyle\eqref{eqn:Progress Bound}}}{{\leq}}(1-\alpha)D_{A}(\bm{\theta}^{*},\bm{\theta}^{t})+\mathcal{L}(\bm{\theta}^{*})-\mathcal{L}(\bm{\theta}^{t})+\mathcal{L}(\bm{\theta}^{t})-\mathcal{L}(\bm{\theta}^{t+1})
(1α)DA(𝜽,𝜽t)\displaystyle\leq(1-\alpha)D_{A}(\bm{\theta}^{*},\bm{\theta}^{t})
(1α)TDA(𝜽,𝜽1).\displaystyle\leq(1-\alpha)^{T}D_{A}(\bm{\theta}^{*},\bm{\theta}^{1}).

Finally, from (40), we see that

(𝜽t+1)(𝜽)\displaystyle\mathcal{L}(\bm{\theta}^{t+1})-\mathcal{L}(\bm{\theta}^{*}) DA(𝜽,𝜽t)DA(𝜽,𝜽t+1)\displaystyle\leq D_{A}(\bm{\theta}^{*},\bm{\theta}^{t})-D_{A}(\bm{\theta}^{*},\bm{\theta}^{t+1})
(1α)TDA(𝜽,𝜽1)DA(𝜽,𝜽t+1)\displaystyle\leq(1-\alpha)^{T}D_{A}(\bm{\theta}^{*},\bm{\theta}^{1})-D_{A}(\bm{\theta}^{*},\bm{\theta}^{t+1})
(1α)TDA(𝜽,𝜽1).\displaystyle\leq(1-\alpha)^{T}D_{A}(\bm{\theta}^{*},\bm{\theta}^{1}).

B.2 Proof of Theorem 5.2

In this section, we provide the complete and detailed proof of Theorem 5.2. Before we begin, we make the additional useful remark that the respective Fischer information matrices introduce in Section 5 satisfy the following equalities:

I𝒙,y,z|𝜽\displaystyle I_{\bm{x},y,z|\bm{\theta}} =2Q(ϕ|𝜽)|ϕ=𝜽=2A(𝜽)\displaystyle=\nabla^{2}Q(\bm{\phi}|\bm{\theta})|_{\bm{\phi}=\bm{\theta}}=\nabla^{2}A(\bm{\theta}) (42)
Iz|𝒙,y,𝜽\displaystyle I_{z|\bm{x},y,\bm{\theta}} =2H(ϕ|𝜽)|ϕ=𝜽.\displaystyle=\nabla^{2}H(\bm{\phi}|\bm{\theta})|_{\bm{\phi}=\bm{\theta}}. (43)

Thus, it also holds that the MIM is a function of A(𝜽)A(\bm{\theta}) and H(ϕ|𝜽)H(\bm{\phi}|\bm{\theta}), i.e.,

𝑴(𝜽)=2A(𝜽)12H(ϕ|𝜽)|ϕ=𝜽.\bm{M}(\bm{\theta})={\nabla^{2}A(\bm{\theta})}^{-1}\nabla^{2}H(\bm{\phi}|\bm{\theta})|_{\bm{\phi}=\bm{\theta}}. (44)

For ease of reading, we re-state the theorem below:
Theorem 5.2: For SMoLinE and SMoLogE and their respective mirror mappings (17) and (18), the objective (𝜽)\mathcal{L}(\bm{\theta}) is α\alpha-strongly convex relative to the mirror map A(𝜽)A(\bm{\theta}) on the convex set Θ\Theta if and only if

λmax(𝑴(𝜽))(1α) for all 𝜽Θ.\lambda_{\max}(\bm{M}(\bm{\theta}))\leq(1-\alpha)\text{ for all }\bm{\theta}\in\Theta.
Proof.

Recall from Section 5 that (𝜽)\mathcal{L}(\bm{\theta}) is strongly convex relative to A(𝜽)A(\bm{\theta}) on Θ\Theta if for all 𝜽,ϕΘ\bm{\theta},\bm{\phi}\in\Theta, it holds that

(ϕ)(𝜽)+(𝜽),ϕ𝜽+αDA(ϕ,𝜽).\mathcal{L}(\bm{\phi})\geq\mathcal{L}(\bm{\theta})+\langle\nabla\mathcal{L}(\bm{\theta}),\bm{\phi}-\bm{\theta}\rangle+\alpha D_{A}(\bm{\phi},\bm{\theta}).

For (𝜽)\mathcal{L}(\bm{\theta}) and A(𝜽)A(\bm{\theta}) twice continuously differentiable, it was shown by [LFN17] that this is equivalent to the following bound on the Hessian:

2(𝜽)α2A(𝜽).\nabla^{2}\mathcal{L}(\bm{\theta})\succeq\alpha\nabla^{2}A(\bm{\theta}).

Now, using (ϕ)=Q(ϕ|𝜽)H(ϕ|𝜽)\mathcal{L}(\bm{\phi})=Q(\bm{\phi}|\bm{\theta})-H(\bm{\phi}|\bm{\theta}), we see that

2(𝜽)\displaystyle\nabla^{2}\mathcal{L}(\bm{\theta}) =2(Q(ϕ|𝜽)H(ϕ|𝜽))|ϕ=𝜽\displaystyle=\nabla^{2}(Q(\bm{\phi}|\bm{\theta})-H(\bm{\phi}|\bm{\theta}))|_{\bm{\phi}=\bm{\theta}}
=2Q(ϕ|𝜽)|ϕ=𝜽2H(ϕ|𝜽)|ϕ=𝜽\displaystyle=\nabla^{2}Q(\bm{\phi}|\bm{\theta})|_{\bm{\phi}=\bm{\theta}}-\nabla^{2}H(\bm{\phi}|\bm{\theta})|_{\bm{\phi}=\bm{\theta}}
=2A(𝜽)2H(ϕ|𝜽)|ϕ=𝜽\displaystyle=\nabla^{2}A(\bm{\theta})-\nabla^{2}H(\bm{\phi}|\bm{\theta})|_{\bm{\phi}=\bm{\theta}}

Therefore, since 2A(𝜽)\nabla^{2}A(\bm{\theta}) is symmetric positive definite (proven in Appendix A), our condition simplifies to

(1α)2A(𝜽)2H(ϕ|𝜽)|ϕ=𝜽\displaystyle(1-\alpha)\nabla^{2}A(\bm{\theta})\succeq\nabla^{2}H(\bm{\phi}|\bm{\theta})|_{\bm{\phi}=\bm{\theta}}
\displaystyle\iff (1α)𝑰2d2A(𝜽)12H(ϕ|𝜽)|ϕ=𝜽=𝑴(𝜽).\displaystyle(1-\alpha)\bm{I}_{2d}\succeq\nabla^{2}A(\bm{\theta})^{-1}\nabla^{2}H(\bm{\phi}|\bm{\theta})|_{\bm{\phi}=\bm{\theta}}=\bm{M}(\bm{\theta}).

Finally, for 𝒙\bm{x} from a unit spherical Gaussian distribution, we know that 𝑴(𝜽)\bm{M}(\bm{\theta}) is symmetric positive-definite (see Lemma B.1). As a result, the above inequality is equivalent to the following bound on the eigenvalues of the MIM:

1αλmax(𝑴(𝜽)).1-\alpha\geq\lambda_{\max}(\bm{M}(\bm{\theta})).

We now provide the simple Lemma that the MIM is a symmetric matrix for SMoLinE and SMoLogE.

Lemma B.1 (𝑴(𝜽)\bm{M}(\bm{\theta}) is symmetric).

For SMoLinE and SMoLogE, the MIM is a symmetric matrix, i.e. 𝐌(𝛉)=𝐌(𝛉)\bm{M}(\bm{\theta})=\bm{M}(\bm{\theta})^{\top}.

Proof.

Recall the assumption that 𝒙\bm{x} is sampled from a unit spherical Gaussian distribution: 𝒙𝒩(𝟎,𝑰d)\bm{x}\sim\mathcal{N}(\bm{0},\bm{I}_{d}). As such, for any orthonormal d×dd\times d matrix 𝑹\bm{R}, we know that 𝑹𝑿𝒩(𝟎,𝑰d)\bm{R}\bm{X}\sim\mathcal{N}(\bm{0},\bm{I}_{d}); this is called rotational invariance of the Gaussian distribution. Thus, consider orthonormal 𝑹𝒖d×d\bm{R}_{\bm{u}}\in\mathbb{R}^{d\times d}, such that R𝒖𝒖=𝒆1𝒖2R_{\bm{u}}\bm{u}=\bm{e}_{1}\|\bm{u}\|_{2} where 𝒆jd\bm{e}_{j}\in\mathbb{R}^{d} is the 𝟎\bm{0} vector with a 11 at index jj. Now, we can observe that for any 𝒘d\bm{w}\in\mathbb{R}^{d}, 𝔼𝑿[𝒙𝒙e𝒙𝒖(1+e𝒙𝒖)2]\mathbb{E}_{\bm{X}}\left[\bm{x}\bm{x}^{\top}\frac{e^{\bm{x}^{\top}\bm{u}}}{\left(1+e^{\bm{x}^{\top}\bm{u}}\right)^{2}}\right], is diagonalizable by an orthonormal matrix 𝑹\bm{R}:

𝑹𝒖(𝔼𝑿[𝒙𝒙e𝒙𝒖(1+e𝒙𝒖)2])𝑹𝒖\displaystyle\bm{R}_{\bm{u}}\left(\mathbb{E}_{\bm{X}}\left[\bm{x}\bm{x}^{\top}\frac{e^{\bm{x}^{\top}\bm{u}}}{\left(1+e^{\bm{x}^{\top}\bm{u}}\right)^{2}}\right]\right)\bm{R}_{\bm{u}}^{\top} =𝔼𝑿[𝑹𝒖𝒙𝒙𝑹𝒖e𝒙R𝒖R𝒖𝒖(1+e𝒙R𝒖R𝒖𝒖)2]\displaystyle=\mathbb{E}_{\bm{X}}\left[\bm{R}_{\bm{u}}\bm{x}\bm{x}^{\top}\bm{R}_{\bm{u}}^{\top}\frac{e^{\bm{x}^{\top}R_{\bm{u}}^{\top}R_{\bm{u}}\bm{u}}}{\left(1+e^{\bm{x}^{\top}R_{\bm{u}}^{\top}R_{\bm{u}}\bm{u}}\right)^{2}}\right]
=𝔼𝑹𝒖𝒙[~x~xe~x𝒆1𝒖2(1+e~x𝒆1𝒖2)2]\displaystyle=\mathbb{E}_{\bm{R}_{\bm{u}}\bm{x}}\left[\bm{\tilde{}}{x}\bm{\tilde{}}{x}^{\top}\frac{e^{\bm{\tilde{}}{x}^{\top}\bm{e}_{1}\|\bm{u}\|_{2}}}{\left(1+e^{\bm{\tilde{}}{x}^{\top}\bm{e}_{1}\|\bm{u}\|_{2}}\right)^{2}}\right]
=(𝔼𝑹𝒖𝒙[x~12e~x1𝒖2(1+e~x1𝒖2)2]𝟎𝟎𝟎𝟎𝟎𝟎𝔼𝑹𝒖𝒙[x~d2e~x1𝒖2(1+e~x1𝒖2)2])\displaystyle=\begin{pmatrix}\mathbb{E}_{\bm{R}_{\bm{u}}\bm{x}}\left[\tilde{x}_{1}^{2}\frac{e^{\bm{\tilde{}}{x}_{1}\|\bm{u}\|_{2}}}{\left(1+e^{\bm{\tilde{}}{x}_{1}\|\bm{u}\|_{2}}\right)^{2}}\right]&\bm{0}&\bm{0}\\ \bm{0}&\dots&\bm{0}\\ \bm{0}&\bm{0}&\mathbb{E}_{\bm{R}_{\bm{u}}\bm{x}}\left[\tilde{x}_{d}^{2}\frac{e^{\bm{\tilde{}}{x}_{1}\|\bm{u}\|_{2}}}{\left(1+e^{\bm{\tilde{}}{x}_{1}\|\bm{u}\|_{2}}\right)^{2}}\right]\end{pmatrix}

Where in the above, 1) 𝑹𝑹=Id\bm{R}^{\top}\bm{R}=I_{d} since 𝑹=𝑹1\bm{R}^{\top}=\bm{R}^{-1} for orthonormal matrices and 2) non diagonal elements evaluate to 0 because for all iji\neq j, the 0 mean random variables ~Xj\bm{\tilde{}}{X}_{j} is independent from ~Xi\bm{\tilde{}}{X}_{i}.

Finally, we put the above together to show 𝑴(𝜽)\bm{M}(\bm{\theta}) is symmetric. We define the block diagonal orthonormal matrix 𝑹𝑴\bm{R}_{\bm{M}} as

𝑹𝑴:=(𝑹𝒘𝟎𝟎𝑹𝜷).\bm{R}_{\bm{M}}:=\begin{pmatrix}\bm{R}_{\bm{w}}&\bm{0}\\ \bm{0}&\bm{R}_{\bm{\beta}}\end{pmatrix}.

From the above, it follows that both for SMoLinE and SMoLogE, 𝑹𝑴2A(𝜽)1𝑹𝑴\bm{R}_{\bm{M}}\nabla^{2}A(\bm{\theta})^{-1}\bm{R}_{\bm{M}}^{\top} is a diagonal matrix. We can now use this change of basis matrix to show the MIM is a symmetric matrix:

𝑴(𝜽)\displaystyle\bm{M}(\bm{\theta}) =2A(𝜽)12H(𝜽|𝜽)\displaystyle={\nabla^{2}A(\bm{\theta})}^{-1}\nabla^{2}H(\bm{\theta}|\bm{\theta})
=𝑹𝑴(𝑹𝑴2A(𝜽)1𝑹𝑴)(𝑹𝑴2H(𝜽|𝜽)R𝑴)𝑹𝑴\displaystyle=\bm{R}_{\bm{M}}^{\top}\left(\bm{R}_{\bm{M}}{\nabla^{2}A(\bm{\theta})}^{-1}\bm{R}_{\bm{M}}^{\top}\right)\left(\bm{R}_{\bm{M}}\bm{\nabla}^{2}H(\bm{\theta}|\bm{\theta})R_{\bm{M}}^{\top}\right)\bm{R}_{\bm{M}}
=𝑹𝑴(𝑹𝑴2H(𝜽|𝜽)R𝑴)(𝑹𝑴2A(𝜽)1𝑹𝑴)𝑹𝑴\displaystyle=\bm{R}_{\bm{M}}^{\top}\left(\bm{R}_{\bm{M}}\bm{\nabla}^{2}H(\bm{\theta}|\bm{\theta})R_{\bm{M}}^{\top}\right)\left(\bm{R}_{\bm{M}}{\nabla^{2}A(\bm{\theta})}^{-1}\bm{R}_{\bm{M}}^{\top}\right)\bm{R}_{\bm{M}}
=2H(𝜽|𝜽)(2A(𝜽)1)\displaystyle=\nabla^{2}H(\bm{\theta}|\bm{\theta})^{\top}(\nabla^{2}A(\bm{\theta})^{-1})^{\top}
=𝑴(𝜽).\displaystyle=\bm{M}(\bm{\theta})^{\top}.

B.3 Proof of Theorem 5.3

For ease of reading, we re-state the theorem below:
For SMoLinE, the eigenvalues of 𝑰𝒙,y,z|𝜽1\bm{I}_{\bm{x},y,z|\bm{\theta}}^{-1} belong to the set

λ(𝑰𝒙,y,z|𝜽1)={Θ(𝒘23),Θ(𝒘2),1}\lambda\left(\bm{I}_{\bm{x},y,z|\bm{\theta}}^{-1}\right)=\{\Theta(\|\bm{w}\|_{2}^{3}),\Theta(\|\bm{w}\|_{2}),1\}

and 𝑰z|𝒙,y,𝜽\bm{I}_{z|\bm{x},y,\bm{\theta}}, is given as the expectation over (𝑿,Y)(\bm{X},Y) of a function that is decreasing as a function of 𝜽2\|\bm{\theta}\|_{2}:

𝑰z|𝒙,y,𝜽=𝔼𝑿,Y[exp[𝒙2y𝒙],𝜽[𝒙2y𝒙][𝒙2y𝒙](1+exp[𝒙2y𝒙],𝜽)2].\bm{I}_{z|\bm{x},y,\bm{\theta}}=\mathbb{E}_{\bm{X},Y}\left[\frac{\exp{\left\langle\begin{bmatrix}\bm{x}\\ 2y\bm{x}\end{bmatrix},\bm{\theta}\right\rangle}\begin{bmatrix}\bm{x}\\ 2y\bm{x}\end{bmatrix}\begin{bmatrix}\bm{x}\\ 2y\bm{x}\end{bmatrix}^{\top}}{\left(1+\exp{\left\langle\begin{bmatrix}\bm{x}\\ 2y\bm{x}\end{bmatrix},\bm{\theta}\right\rangle}\right)^{2}}\right].

Similarly, For SMoLogE, the eigenvalues of 𝑰𝒙,z,y|𝜽1\bm{I}_{\bm{x},z,y|\bm{\theta}}^{-1} belong to the set

λ(𝑰𝒙,z,y|𝜽1)={Θ(𝒘23),Θ(𝒘2),Θ(𝜷23),Θ(𝜷2)}\!\lambda\left(\bm{I}_{\bm{x},z,y|\bm{\theta}}^{-1}\right)=\{\Theta(\|\bm{w}\|_{2}^{3}),\Theta(\|\bm{w}\|_{2}),\Theta(\|\bm{\beta}\|_{2}^{3}),\Theta(\|\bm{\beta}\|_{2})\}

and 𝑰z|𝒙,y,𝜽\bm{I}_{z|\bm{x},y,\bm{\theta}} is given as the expectation over (𝑿,Y)(\bm{X},Y) of a function that is decreasing as a function of 𝜽2\|\bm{\theta}\|_{2}:

𝑰z|𝒙,y,𝜽=𝔼𝑿,Y[exp[𝒙y𝒙],𝜽[𝒙y𝒙][𝒙y𝒙](1+exp[𝒙y𝒙],𝜽)2].\!\bm{I}_{z|\bm{x},y,\bm{\theta}}=\mathbb{E}_{\bm{X},Y}\!\left[\frac{\exp{\left\langle\begin{bmatrix}\bm{x}\\ y\bm{x}\end{bmatrix},\bm{\theta}\right\rangle}\begin{bmatrix}\bm{x}\\ y\bm{x}\end{bmatrix}\begin{bmatrix}\bm{x}\\ y\bm{x}\end{bmatrix}^{\top}}{\left(1+\exp{\left\langle\begin{bmatrix}\bm{x}\\ y\bm{x}\end{bmatrix},\bm{\theta}\right\rangle}\right)^{2}}\right]\!.
Proof.

We divide the proof into two parts. In the first part, we consider the SMoLinE setting and, in the second part, we consider the SMoLogE setting.

Part a): SMoLinE.
For 𝑰𝒙,z,y|𝜽\bm{I}_{\bm{x},z,y|\bm{\theta}}, recall that is has the following form:

𝑰𝒙,y,z|𝜽=(𝔼𝑿[𝒙𝒙e𝒙𝒘(1+e𝒙𝒘)2]𝟎𝟎𝑰d).\bm{I}_{\bm{x},y,z|\bm{\theta}}=\begin{pmatrix}\mathbb{E}_{\bm{X}}\left[\bm{x}\bm{x}^{\top}\frac{e^{\bm{x}^{\top}\bm{w}}}{\left(1+e^{\bm{x}^{\top}\bm{w}}\right)^{2}}\right]&\bm{0}\\ \bm{0}&\bm{I}_{d}\end{pmatrix}.

By Lemma B.2, we see that 𝑰𝒙,z,y|𝜽\bm{I}_{\bm{x},z,y|\bm{\theta}} can be diagonalized into the following form:

𝑰𝒙,y,z|𝜽=𝑹𝑴(λ1𝟎𝟎𝟎𝟎𝟎𝟎𝟎𝟎λ2𝟎𝟎𝟎𝟎𝑰d)𝑹𝑴\bm{I}_{\bm{x},y,z|\bm{\theta}}=\bm{R}_{\bm{M}}\begin{pmatrix}\lambda_{1}&\bm{0}&\bm{0}&\bm{0}\\ \bm{0}&\dots&\bm{0}&\bm{0}\\ \bm{0}&\bm{0}&\lambda_{2}&\bm{0}\\ \bm{0}&\bm{0}&\bm{0}&\bm{I}_{d}\end{pmatrix}\bm{R}_{\bm{M}}^{\top}

where R𝑴R_{\bm{M}} is an orthonormal rotation matrix and λ1=Θ(1𝒘23)\lambda_{1}=\Theta\left(\frac{1}{\|\bm{w}\|_{2}^{3}}\right) and λ2=Θ(1𝒘2)\lambda_{2}=\Theta\left(\frac{1}{\|\bm{w}\|_{2}}\right). Therefore, 𝑰𝒙,z,y|𝜽\bm{I}_{\bm{x},z,y|\bm{\theta}} has three eigenvalues given as {Θ(1𝒘23),Θ(1𝒘2),1}\left\{\Theta\left(\frac{1}{\|\bm{w}\|_{2}^{3}}\right),\Theta\left(\frac{1}{\|\bm{w}\|_{2}}\right),1\right\}. It follows that 𝑰𝒙,z,y|𝜽1\bm{I}_{\bm{x},z,y|\bm{\theta}}^{-1} has the form

𝑰𝒙,y,z|𝜽1=𝑹𝑴(1/λ1𝟎𝟎𝟎𝟎𝟎𝟎𝟎𝟎1/λ2𝟎𝟎𝟎𝟎𝑰d)𝑹𝑴.\bm{I}_{\bm{x},y,z|\bm{\theta}}^{-1}=\bm{R}_{\bm{M}}^{\top}\begin{pmatrix}1/\lambda_{1}&\bm{0}&\bm{0}&\bm{0}\\ \bm{0}&\dots&\bm{0}&\bm{0}\\ \bm{0}&\bm{0}&1/\lambda_{2}&\bm{0}\\ \bm{0}&\bm{0}&\bm{0}&\bm{I}_{d}\end{pmatrix}\bm{R}_{\bm{M}}.

Therefore, 𝑰𝒙,y,z|𝜽1\bm{I}_{\bm{x},y,z|\bm{\theta}}^{-1} has three eigenvalues given as {Θ(𝒘23),Θ(𝒘2),1}\left\{\Theta\left(\|\bm{w}\|_{2}^{3}\right),\Theta\left(\|\bm{w}\|_{2}\right),1\right\}.

Now, for Iz|𝒙,y,𝜽I_{z|\bm{x},y,\bm{\theta}}, we first derive a more compact form for the conditional distribution of the latent variable, p(z|𝒙,y;𝜽)p(z|\bm{x},y;\bm{\theta}). From simple Bayes rule and algebraic manipulation, we see that

p(z|𝒙,y;𝜽)\displaystyle p(z|\bm{x},y;\bm{\theta}) =p(y|𝒙,z;𝜽)p(z|𝒙;𝜽)p(𝒙)p(𝒙,y;𝜽)\displaystyle=\frac{p(y|\bm{x},z;\bm{\theta})p(z|\bm{x};\bm{\theta})p(\bm{x})}{p(\bm{x},y;\bm{\theta})}
=12πexp{(yz𝒙𝜷)22}exp{z+12𝒙𝒘}1+e𝒙𝒘12πexp{(y𝒙𝜷)22}exp{𝒙𝒘}1+e𝒙𝒘+12πexp{(y+𝒙𝜷)22}11+e𝒙𝒘\displaystyle=\frac{\frac{1}{\sqrt{2\pi}}\exp\left\{-\frac{(y-z\bm{x}^{\top}\bm{\beta})^{2}}{2}\right\}\frac{\exp\{\frac{z+1}{2}\bm{x}^{\top}{\bm{w}}\}}{1+e^{\bm{x}^{\top}{\bm{w}}}}}{\frac{1}{\sqrt{2\pi}}\exp\left\{-\frac{(y-\bm{x}^{\top}\bm{\beta})^{2}}{2}\right\}\frac{\exp\{\bm{x}^{\top}{\bm{w}}\}}{1+e^{\bm{x}^{\top}{\bm{w}}}}+\frac{1}{\sqrt{2\pi}}\exp\left\{-\frac{(y+\bm{x}^{\top}\bm{\beta})^{2}}{2}\right\}\frac{1}{1+e^{\bm{x}^{\top}{\bm{w}}}}}
=exp{(yz𝒙𝜷)22+z+12𝒙𝒘}exp{(y𝒙𝜷)22+𝒙𝒘}+exp{(y+𝒙𝜷)22}\displaystyle=\frac{\exp\left\{-\frac{(y-z\bm{x}^{\top}\bm{\beta})^{2}}{2}+\frac{z+1}{2}\bm{x}^{\top}{\bm{w}}\right\}}{\exp\left\{-\frac{(y-\bm{x}^{\top}\bm{\beta})^{2}}{2}+\bm{x}^{\top}{\bm{w}}\right\}+\exp\left\{-\frac{(y+\bm{x}^{\top}\bm{\beta})^{2}}{2}\right\}}
=exp{zy𝒙𝜷+z+12𝒙𝒘}exp{y𝒙𝜷+𝒙𝒘}+exp{y𝒙𝜷}\displaystyle=\frac{\exp\left\{zy\bm{x}^{\top}\bm{\beta}+\frac{z+1}{2}\bm{x}^{\top}{\bm{w}}\right\}}{\exp\left\{y\bm{x}^{\top}\bm{\beta}+\bm{x}^{\top}{\bm{w}}\right\}+\exp\left\{-y\bm{x}^{\top}\bm{\beta}\right\}}
=exp{z+12(2y𝒙𝜷+𝒙𝒘)}exp{2y𝒙𝜷+𝒙𝒘}+1\displaystyle=\frac{\exp\left\{\frac{z+1}{2}(2y\bm{x}^{\top}\bm{\beta}+\bm{x}^{\top}\bm{w})\right\}}{\exp\left\{2y\bm{x}^{\top}\bm{\beta}+\bm{x}^{\top}{\bm{w}}\right\}+1}
=exp{z+12[𝒙2y𝒙],[𝒘𝜷]}exp{[𝒙2y𝒙],[𝒘𝜷]}+1\displaystyle=\frac{\exp\left\{\frac{z+1}{2}\left\langle\begin{bmatrix}\bm{x}\\ 2y\bm{x}\end{bmatrix},\begin{bmatrix}\bm{w}\\ \bm{\beta}\end{bmatrix}\right\rangle\right\}}{\exp\left\{\left\langle\begin{bmatrix}\bm{x}\\ 2y\bm{x}\end{bmatrix},\begin{bmatrix}\bm{w}\\ \bm{\beta}\end{bmatrix}\right\rangle\right\}+1}

Now that we have this simplified form, we are able to derive (27) for Iz|𝒙,y,𝜽I_{z|\bm{x},y,\bm{\theta}}:

Iz|𝒙,y,𝜽\displaystyle I_{z|\bm{x},y,\bm{\theta}} =𝔼𝑿,Y𝔼Z|𝒙,y,𝜽[2𝜽2logp(z|𝒙,y;𝜽)]\displaystyle=-\mathbb{E}_{\bm{X},Y}\mathbb{E}_{Z|\bm{x},y,\bm{\theta}}\left[\frac{\partial^{2}}{\partial\bm{\theta}^{2}}\log p(z|\bm{x},y;\bm{\theta})\right]
=𝔼𝑿,Y𝔼Z|𝒙,y,𝜽[2𝜽2(z+12(𝒙2y𝒙),𝜽log(1+exp[𝒙2y𝒙],𝜽))]\displaystyle=-\mathbb{E}_{\bm{X},Y}\mathbb{E}_{Z|\bm{x},y,\bm{\theta}}\left[\frac{\partial^{2}}{\partial\bm{\theta}^{2}}\left(\frac{z+1}{2}\left\langle\begin{pmatrix}\bm{x}\\ 2y\bm{x}\end{pmatrix},\bm{\theta}\right\rangle-\log\left(1+\exp{\left\langle\begin{bmatrix}\bm{x}\\ 2y\bm{x}\end{bmatrix},\bm{\theta}\right\rangle}\right)\right)\right]
=𝔼𝑿,Y𝔼Z|𝒙,y,𝜽[2𝜽2log(1+exp[𝒙2y𝒙],𝜽)]\displaystyle=\mathbb{E}_{\bm{X},Y}\mathbb{E}_{Z|\bm{x},y,\bm{\theta}}\left[\frac{\partial^{2}}{\partial\bm{\theta}^{2}}\log\left(1+\exp{\left\langle\begin{bmatrix}\bm{x}\\ 2y\bm{x}\end{bmatrix},\bm{\theta}\right\rangle}\right)\right]
=𝔼𝑿,Y[2𝜽2log(1+exp[𝒙2y𝒙],𝜽)]\displaystyle=\mathbb{E}_{\bm{X},Y}\left[\frac{\partial^{2}}{\partial\bm{\theta}^{2}}\log\left(1+\exp{\left\langle\begin{bmatrix}\bm{x}\\ 2y\bm{x}\end{bmatrix},\bm{\theta}\right\rangle}\right)\right]
=𝔼𝑿,Y[exp[𝒙2y𝒙],𝜽[𝒙2y𝒙][𝒙2y𝒙](1+exp[𝒙2y𝒙],𝜽)2].\displaystyle=\mathbb{E}_{\bm{X},Y}\left[\frac{\exp{\left\langle\begin{bmatrix}\bm{x}\\ 2y\bm{x}\end{bmatrix},\bm{\theta}\right\rangle}\begin{bmatrix}\bm{x}\\ 2y\bm{x}\end{bmatrix}\begin{bmatrix}\bm{x}\\ 2y\bm{x}\end{bmatrix}^{\top}}{\left(1+\exp{\left\langle\begin{bmatrix}\bm{x}\\ 2y\bm{x}\end{bmatrix},\bm{\theta}\right\rangle}\right)^{2}}\right].

This expression depends only on the random variable (𝑿,Y)(\bm{X},Y) and the parameter iterate 𝜽\bm{\theta}.

Part b): SMoLogE.
Recall that for SMoLogE, 𝑰𝒙,z,y|𝜽\bm{I}_{\bm{x},z,y|\bm{\theta}} has the following form:

𝑰𝒙,z,y|𝜽=(𝔼𝑿[𝒙𝒙e𝒙𝒘(1+e𝒙𝒘)2]𝟎𝟎𝔼𝑿[𝒙𝒙e𝒙𝜷(1+e𝒙𝜷)2]).\bm{I}_{\bm{x},z,y|\bm{\theta}}=\begin{pmatrix}\mathbb{E}_{\bm{X}}\left[\bm{x}\bm{x}^{\top}\frac{e^{\bm{x}^{\top}\bm{w}}}{\left(1+e^{\bm{x}^{\top}\bm{w}}\right)^{2}}\right]&\bm{0}\\ \bm{0}&\mathbb{E}_{\bm{X}}\left[\bm{x}\bm{x}^{\top}\frac{e^{\bm{x}^{\top}\bm{\beta}}}{\left(1+e^{\bm{x}^{\top}\bm{\beta}}\right)^{2}}\right]\end{pmatrix}.

By Lemma B.2, we see that 𝑰𝒙,z,y|𝜽\bm{I}_{\bm{x},z,y|\bm{\theta}} can be diagonalized into the following form:

𝑰𝒙,z,y|𝜽=𝑹𝑴(λ1𝟎𝟎𝟎𝟎𝟎λ2𝟎𝟎𝟎𝟎𝟎𝟎𝟎𝟎𝟎𝟎λ3𝟎𝟎𝟎𝟎𝟎λ4)𝑹𝑴\bm{I}_{\bm{x},z,y|\bm{\theta}}=\bm{R}_{\bm{M}}\begin{pmatrix}\lambda_{1}&\bm{0}&\bm{0}&\bm{0}&\bm{0}\\ \bm{0}&\lambda_{2}&\bm{0}&\bm{0}&\bm{0}\\ \bm{0}&\bm{0}&\dots&\bm{0}&\bm{0}\\ \bm{0}&\bm{0}&\bm{0}&\lambda_{3}&\bm{0}\\ \bm{0}&\bm{0}&\bm{0}&\bm{0}&\lambda_{4}\end{pmatrix}\bm{R}_{\bm{M}}^{\top}

where R𝑴R_{\bm{M}} is an orthonormal rotation matrix and λ1=Θ(1𝒘23)\lambda_{1}=\Theta\left(\frac{1}{\|\bm{w}\|_{2}^{3}}\right), λ2=Θ(1𝒘2)\lambda_{2}=\Theta\left(\frac{1}{\|\bm{w}\|_{2}}\right), λ3=Θ(1𝜷23)\lambda_{3}=\Theta\left(\frac{1}{\|\bm{\beta}\|_{2}^{3}}\right), and λ4=Θ(1𝜷23)\lambda_{4}=\Theta\left(\frac{1}{\|\bm{\beta}\|_{2}^{3}}\right). Therefore, 𝑰𝒙,z,y|𝜽\bm{I}_{\bm{x},z,y|\bm{\theta}} has four eigenvalues given as
{Θ(1𝒘23),Θ(1𝒘2),Θ(1𝜷23),Θ(1𝜷2)}\left\{\Theta\left(\frac{1}{\|\bm{w}\|_{2}^{3}}\right),\Theta\left(\frac{1}{\|\bm{w}\|_{2}}\right),\Theta\left(\frac{1}{\|\bm{\beta}\|_{2}^{3}}\right),\Theta\left(\frac{1}{\|\bm{\beta}\|_{2}}\right)\right\}. It follows that 𝑰𝒙,z,y|𝜽1\bm{I}_{\bm{x},z,y|\bm{\theta}}^{-1} has the form

𝑰𝒙,z,y|𝜽1=𝑹𝑴(1/λ1𝟎𝟎𝟎𝟎𝟎1/λ2𝟎𝟎𝟎𝟎𝟎𝟎𝟎𝟎𝟎𝟎1/λ3𝟎𝟎𝟎𝟎𝟎1/λ4)𝑹𝑴.\bm{I}_{\bm{x},z,y|\bm{\theta}}^{-1}=\bm{R}_{\bm{M}}\begin{pmatrix}1/\lambda_{1}&\bm{0}&\bm{0}&\bm{0}&\bm{0}\\ \bm{0}&1/\lambda_{2}&\bm{0}&\bm{0}&\bm{0}\\ \bm{0}&\bm{0}&\dots&\bm{0}&\bm{0}\\ \bm{0}&\bm{0}&\bm{0}&1/\lambda_{3}&\bm{0}\\ \bm{0}&\bm{0}&\bm{0}&\bm{0}&1/\lambda_{4}\end{pmatrix}\bm{R}_{\bm{M}}^{\top}.

Therefore, 𝑰𝒙,z,y|𝜽1\bm{I}_{\bm{x},z,y|\bm{\theta}}^{-1} has four eigenvalues given as {Θ(𝒘23),Θ(𝒘2),Θ(𝜷23),Θ(𝜷2)}\left\{\Theta\left(\|\bm{w}\|_{2}^{3}\right),\Theta\left(\|\bm{w}\|_{2}\right),\Theta\left(\|\bm{\beta}\|_{2}^{3}\right),\Theta\left(\|\bm{\beta}\|_{2}\right)\right\}.

Now, for Iz|𝒙,y,𝜽I_{z|\bm{x},y,\bm{\theta}}, we first derive a more compact form for the conditional distribution of the latent variable, P(z|𝒙,y;𝜽)P(z|\bm{x},y;\bm{\theta}). From simple Bayes rule and algebraic manipulation, we see that

P(z|𝒙,y;𝜽)\displaystyle P(z|\bm{x},y;\bm{\theta}) =P(y|𝒙,z;𝜽)P(z|𝒙;𝜽)p(𝒙)p(𝒙,y;𝜽)\displaystyle=\frac{P(y|\bm{x},z;\bm{\theta})P(z|\bm{x};\bm{\theta})p(\bm{x})}{p(\bm{x},y;\bm{\theta})}
=exp{yz+12𝒙𝜷}1+e𝒙𝜷exp{z+12𝒙𝒘}1+e𝒙𝒘exp{y+12𝒙𝜷}1+e𝒙𝜷exp{𝒙𝒘}1+e𝒙𝒘+exp{y+12𝒙𝜷}1+e𝒙𝜷11+e𝒙𝒘\displaystyle=\frac{\frac{\exp\{\frac{yz+1}{2}\bm{x}^{\top}{\bm{\beta}}\}}{1+e^{\bm{x}^{\top}{\bm{\beta}}}}\frac{\exp\{\frac{z+1}{2}\bm{x}^{\top}{\bm{w}}\}}{1+e^{\bm{x}^{\top}{\bm{w}}}}}{\frac{\exp\{\frac{y+1}{2}\bm{x}^{\top}{\bm{\beta}}\}}{1+e^{\bm{x}^{\top}{\bm{\beta}}}}\frac{\exp\{\bm{x}^{\top}{\bm{w}}\}}{1+e^{\bm{x}^{\top}{\bm{w}}}}+\frac{\exp\{\frac{-y+1}{2}\bm{x}^{\top}{\bm{\beta}}\}}{1+e^{\bm{x}^{\top}{\bm{\beta}}}}\frac{1}{1+e^{\bm{x}^{\top}{\bm{w}}}}}
=exp{yz+12𝒙𝜷+z+12𝒙𝒘}exp{y+12𝒙𝜷+𝒙𝒘}+exp{y+12𝒙𝜷}\displaystyle=\frac{\exp\left\{\frac{yz+1}{2}\bm{x}^{\top}{\bm{\beta}}+\frac{z+1}{2}\bm{x}^{\top}{\bm{w}}\right\}}{\exp\left\{\frac{y+1}{2}\bm{x}^{\top}{\bm{\beta}}+\bm{x}^{\top}{\bm{w}}\right\}+\exp\left\{\frac{-y+1}{2}\bm{x}^{\top}{\bm{\beta}}\right\}}
=exp{z+12(y𝒙𝜷+𝒙𝒘)}exp{y𝒙β+𝒙𝒘}+1\displaystyle=\frac{\exp\left\{\frac{z+1}{2}\left(y\bm{x}^{\top}{\bm{\beta}}+\bm{x}^{\top}{\bm{w}}\right)\right\}}{\exp\left\{y\bm{x}^{\top}\beta+\bm{x}^{\top}{\bm{w}}\right\}+1}
=exp{z+12[𝒙y𝒙],[𝒘𝜷]}exp{[𝒙y𝒙],[𝒘𝜷]}+1\displaystyle=\frac{\exp\left\{\frac{z+1}{2}\left\langle\begin{bmatrix}\bm{x}\\ y\bm{x}\end{bmatrix},\begin{bmatrix}\bm{w}\\ \bm{\beta}\end{bmatrix}\right\rangle\right\}}{\exp\left\{\left\langle\begin{bmatrix}\bm{x}\\ y\bm{x}\end{bmatrix},\begin{bmatrix}\bm{w}\\ \bm{\beta}\end{bmatrix}\right\rangle\right\}+1}

Now that we have this simplified form, we are able to derive (29) for Iz|𝒙,y,𝜽I_{z|\bm{x},y,\bm{\theta}}:

Iz|𝒙,y,𝜽\displaystyle I_{z|\bm{x},y,\bm{\theta}} =𝔼𝑿,Y𝔼Z|𝒙,y,𝜽[2𝜽2logP(z|𝒙,y;𝜽)]\displaystyle=-\mathbb{E}_{\bm{X},Y}\mathbb{E}_{Z|\bm{x},y,\bm{\theta}}\left[\frac{\partial^{2}}{\partial\bm{\theta}^{2}}\log P(z|\bm{x},y;\bm{\theta})\right]
=𝔼𝑿,Y𝔼Z|𝒙,y,𝜽[2𝜽2(z+12(𝒙y𝒙),𝜽log(1+exp[𝒙y𝒙],𝜽))]\displaystyle=-\mathbb{E}_{\bm{X},Y}\mathbb{E}_{Z|\bm{x},y,\bm{\theta}}\left[\frac{\partial^{2}}{\partial\bm{\theta}^{2}}\left(\frac{z+1}{2}\left\langle\begin{pmatrix}\bm{x}\\ y\bm{x}\end{pmatrix},\bm{\theta}\right\rangle-\log\left(1+\exp{\left\langle\begin{bmatrix}\bm{x}\\ y\bm{x}\end{bmatrix},\bm{\theta}\right\rangle}\right)\right)\right]
=𝔼𝑿,Y𝔼Z|𝒙,y,𝜽[2𝜽2log(1+exp[𝒙y𝒙],𝜽)]\displaystyle=\mathbb{E}_{\bm{X},Y}\mathbb{E}_{Z|\bm{x},y,\bm{\theta}}\left[\frac{\partial^{2}}{\partial\bm{\theta}^{2}}\log\left(1+\exp{\left\langle\begin{bmatrix}\bm{x}\\ y\bm{x}\end{bmatrix},\bm{\theta}\right\rangle}\right)\right]
=𝔼𝑿,Y[2𝜽2log(1+exp[𝒙y𝒙],𝜽)]\displaystyle=\mathbb{E}_{\bm{X},Y}\left[\frac{\partial^{2}}{\partial\bm{\theta}^{2}}\log\left(1+\exp{\left\langle\begin{bmatrix}\bm{x}\\ y\bm{x}\end{bmatrix},\bm{\theta}\right\rangle}\right)\right]
=𝔼𝑿,Y[exp[𝒙y𝒙],𝜽[𝒙y𝒙][𝒙y𝒙](1+exp[𝒙y𝒙],𝜽)2]\displaystyle=\mathbb{E}_{\bm{X},Y}\left[\frac{\exp{\left\langle\begin{bmatrix}\bm{x}\\ y\bm{x}\end{bmatrix},\bm{\theta}\right\rangle}\begin{bmatrix}\bm{x}\\ y\bm{x}\end{bmatrix}\begin{bmatrix}\bm{x}\\ y\bm{x}\end{bmatrix}^{\top}}{\left(1+\exp{\left\langle\begin{bmatrix}\bm{x}\\ y\bm{x}\end{bmatrix},\bm{\theta}\right\rangle}\right)^{2}}\right]

This expression depends only on the random variable (𝑿,Y)(\bm{X},Y) and the parameter iterate 𝜽\bm{\theta}. ∎

Lemma B.2.

For 𝐱𝒩(𝟎,𝐈d)\bm{x}\sim\mathcal{N}(\bm{0},\bm{I}_{d}) and 𝐮d\bm{u}\in\mathbb{R}^{d} and 𝐮22\|\bm{u}\|_{2}\geq\sqrt{2}, the symmetric positive definite matrix

𝔼𝑿[𝒙𝒙e𝒙𝒖(1+e𝒙𝒖)2]\mathbb{E}_{\bm{X}}\left[\bm{x}\bm{x}^{\top}\frac{e^{\bm{x}^{\top}\bm{u}}}{\left(1+e^{\bm{x}^{\top}\bm{u}}\right)^{2}}\right] (45)

is diagonalizable by an orthonormal matrix 𝐑𝐮d×d\bm{R}_{\bm{u}}\in\mathbb{R}^{d\times d} and has two eigenvalues, λ1,λ20\lambda_{1},\lambda_{2}\geq 0, that satisfy

λ1\displaystyle\lambda_{1} =Θ(1𝒖23)\displaystyle=\Theta\left(\frac{1}{\|\bm{u}\|_{2}^{3}}\right) (46)
λ2\displaystyle\lambda_{2} =Θ(1𝒖2).\displaystyle=\Theta\left(\frac{1}{\|\bm{u}\|_{2}}\right). (47)
Proof.

Recall that Gaussian random variables are rotationally invariant. Specifically, for orthonormal matrix 𝑹d×d\bm{R}\in\mathbb{R}^{d\times d} and 𝑿𝒩(𝟎,𝑰d)\bm{X}\sim\mathcal{N}(\bm{0},\bm{I}_{d}), it follows that 𝑹𝑿𝒩(𝟎,𝑰d)\bm{R}\bm{X}\sim\mathcal{N}(\bm{0},\bm{I}_{d}). Moreover, 𝑹𝑹=𝑹𝑹=𝑰d\bm{R}^{\top}\bm{R}=\bm{R}\bm{R}^{\top}=\bm{I}_{d}. Using this notion, we will 1) diagonalize (45), then 2) evaluate the eigenvalues of (45) as the diagonal elements.

Consider the orthonormal rotation matrix 𝑹𝒖d×d\bm{R}_{\bm{u}}\in\mathbb{R}^{d\times d} that is such that 𝑹𝒖𝒖=𝒆1𝒖2\bm{R}_{\bm{u}}\bm{u}=\bm{e}_{1}\|\bm{u}\|_{2} where 𝒆j\bm{e}_{j} is the jthj^{th} canonical vector of d\mathbb{R}^{d}. Using this change of basis matrix, we can now obtain the diagonal matrix,

𝑹𝒖(𝔼𝑿[𝒙𝒙e𝒙𝒖(1+e𝒙𝒖)2])𝑹𝒖\displaystyle\bm{R}_{\bm{u}}\left(\mathbb{E}_{\bm{X}}\left[\bm{x}\bm{x}^{\top}\frac{e^{\bm{x}^{\top}\bm{u}}}{\left(1+e^{\bm{x}^{\top}\bm{u}}\right)^{2}}\right]\right)\bm{R}_{\bm{u}}^{\top} =𝔼𝑿[𝑹𝒖𝒙𝒙𝑹𝒖e𝒙R𝒖R𝒖𝒖(1+e𝒙R𝒖R𝒖𝒖)2]\displaystyle=\mathbb{E}_{\bm{X}}\left[\bm{R}_{\bm{u}}\bm{x}\bm{x}^{\top}\bm{R}_{\bm{u}}^{\top}\frac{e^{\bm{x}^{\top}R_{\bm{u}}^{\top}R_{\bm{u}}\bm{u}}}{\left(1+e^{\bm{x}^{\top}R_{\bm{u}}^{\top}R_{\bm{u}}\bm{u}}\right)^{2}}\right]
=𝔼𝑹𝒖𝒙[~x~xe~x𝒆1𝒖2(1+e~x𝒆1𝒖2)2]\displaystyle=\mathbb{E}_{\bm{R}_{\bm{u}}\bm{x}}\left[\bm{\tilde{}}{x}\bm{\tilde{}}{x}^{\top}\frac{e^{\bm{\tilde{}}{x}^{\top}\bm{e}_{1}\|\bm{u}\|_{2}}}{\left(1+e^{\bm{\tilde{}}{x}^{\top}\bm{e}_{1}\|\bm{u}\|_{2}}\right)^{2}}\right]
=(𝔼𝑹𝒖𝒙[x~12e~x1𝒖2(1+e~x1𝒖2)2]𝟎𝟎𝟎𝟎𝟎𝟎𝔼𝑹𝒖𝒙[e~x1𝒖2(1+e~x1𝒖2)2]).\displaystyle=\begin{pmatrix}\mathbb{E}_{\bm{R}_{\bm{u}}\bm{x}}\left[\frac{\tilde{x}_{1}^{2}e^{\bm{\tilde{}}{x}_{1}\|\bm{u}\|_{2}}}{\left(1+e^{\bm{\tilde{}}{x}_{1}\|\bm{u}\|_{2}}\right)^{2}}\right]&\bm{0}&\bm{0}\\ \bm{0}&\dots&\bm{0}\\ \bm{0}&\bm{0}&\mathbb{E}_{\bm{R}_{\bm{u}}\bm{x}}\left[\frac{e^{\bm{\tilde{}}{x}_{1}\|\bm{u}\|_{2}}}{\left(1+e^{\bm{\tilde{}}{x}_{1}\|\bm{u}\|_{2}}\right)^{2}}\right]\end{pmatrix}.

It has only two eigenvalues given in closed form as

λ1\displaystyle\lambda_{1} =𝔼X~1[x~12e~x1𝒖2(1+e~x1𝒖2)2]=x~12e~x1𝒖2(1+e~x1𝒖2)2p(𝒙~1)𝑑𝒙~1\displaystyle=\mathbb{E}_{\tilde{X}_{1}}\left[\frac{\tilde{x}_{1}^{2}e^{\bm{\tilde{}}{x}_{1}\|\bm{u}\|_{2}}}{\left(1+e^{\bm{\tilde{}}{x}_{1}\|\bm{u}\|_{2}}\right)^{2}}\right]=\int_{-\infty}^{\infty}\frac{\tilde{x}_{1}^{2}e^{\bm{\tilde{}}{x}_{1}\|\bm{u}\|_{2}}}{\left(1+e^{\bm{\tilde{}}{x}_{1}\|\bm{u}\|_{2}}\right)^{2}}p(\tilde{\bm{x}}_{1})d\tilde{\bm{x}}_{1}
λ2\displaystyle\lambda_{2} =𝔼X~1[e~x1𝒖2(1+e~x1𝒖2)2]=e~x1𝒖2(1+e~x1𝒖2)2p(𝒙~1)𝑑𝒙~1.\displaystyle=\mathbb{E}_{\tilde{X}_{1}}\left[\frac{e^{\bm{\tilde{}}{x}_{1}\|\bm{u}\|_{2}}}{\left(1+e^{\bm{\tilde{}}{x}_{1}\|\bm{u}\|_{2}}\right)^{2}}\right]=\int_{-\infty}^{\infty}\frac{e^{\bm{\tilde{}}{x}_{1}\|\bm{u}\|_{2}}}{\left(1+e^{\bm{\tilde{}}{x}_{1}\|\bm{u}\|_{2}}\right)^{2}}p(\tilde{\bm{x}}_{1})d\tilde{\bm{x}}_{1}.

The rest of the proof is spent evaluating tight lower and upper bounds on λ1,λ2\lambda_{1},\lambda_{2} in terms of 𝒖2\|\bm{u}\|_{2}.

Part a): Bounds for λ1\lambda_{1}.
For ~x1𝒩(0,1)\bm{\tilde{}}{x}_{1}\sim\mathcal{N}(0,1) the probability density function is bounded above by 11: p(~x1)1p(\bm{\tilde{}}{x}_{1})\leq 1. Then we can upper bound λ1\lambda_{1} as follows:

λ1\displaystyle\lambda_{1} =x~12e~x1𝒖2(1+e~x1𝒖2)2p(𝒙~1)𝑑𝒙~1\displaystyle=\int_{-\infty}^{\infty}\frac{\tilde{x}_{1}^{2}e^{\bm{\tilde{}}{x}_{1}\|\bm{u}\|_{2}}}{\left(1+e^{\bm{\tilde{}}{x}_{1}\|\bm{u}\|_{2}}\right)^{2}}p(\tilde{\bm{x}}_{1})d\tilde{\bm{x}}_{1}
x~12e~x1𝒖2(1+e~x1𝒖2)2𝑑𝒙~1\displaystyle\leq\int_{-\infty}^{\infty}\frac{\tilde{x}_{1}^{2}e^{\bm{\tilde{}}{x}_{1}\|\bm{u}\|_{2}}}{\left(1+e^{\bm{\tilde{}}{x}_{1}\|\bm{u}\|_{2}}\right)^{2}}d\tilde{\bm{x}}_{1}
=20x~12e~x1𝒖2𝑑𝒙~1\displaystyle=2\int_{0}^{\infty}\tilde{x}_{1}^{2}e^{-\bm{\tilde{}}{x}_{1}\|\bm{u}\|_{2}}d\tilde{\bm{x}}_{1}
=4𝒖23\displaystyle=\frac{4}{\|\bm{u}\|_{2}^{3}}
=𝒪(1𝒖23)\displaystyle=\mathcal{O}\left(\frac{1}{\|\bm{u}\|_{2}^{3}}\right)

For the lower bounds, we will use the fact that e(𝒖2𝒙~+𝒙~12/2)e𝒖22𝒙~2/2e^{-(\|\bm{u}\|_{2}\tilde{\bm{x}}+\tilde{\bm{x}}_{1}^{2}/2)}\geq e^{-\|\bm{u}\|_{2}^{2}\tilde{\bm{x}}^{2}/2} for x[4𝒖2,]x\in\left[\frac{4}{\|\bm{u}\|_{2}},\infty\right] and 𝒖22\|\bm{u}\|_{2}\geq\sqrt{2}. Then, we can lower bound λ1\lambda_{1} as follows:

λ1\displaystyle\lambda_{1} =x~12e~x1𝒖2(1+e~x1𝒖2)2p(𝒙~1)𝑑𝒙~1\displaystyle=\int_{-\infty}^{\infty}\frac{\tilde{x}_{1}^{2}e^{\bm{\tilde{}}{x}_{1}\|\bm{u}\|_{2}}}{\left(1+e^{\bm{\tilde{}}{x}_{1}\|\bm{u}\|_{2}}\right)^{2}}p(\tilde{\bm{x}}_{1})d\tilde{\bm{x}}_{1}
20x~12e~x1𝒖2(2e~x1𝒖2)2(e𝒙~1222π)𝑑𝒙~1\displaystyle\geq 2\int_{0}^{\infty}\frac{\tilde{x}_{1}^{2}e^{\bm{\tilde{}}{x}_{1}\|\bm{u}\|_{2}}}{\left(2e^{\bm{\tilde{}}{x}_{1}\|\bm{u}\|_{2}}\right)^{2}}\left(\frac{e^{-\frac{\tilde{\bm{x}}_{1}^{2}}{2}}}{\sqrt{2\pi}}\right)d\tilde{\bm{x}}_{1}
=122π0x~12e~x1𝒖2𝒙~122𝑑𝒙~1\displaystyle=\frac{1}{2\sqrt{2\pi}}\int_{0}^{\infty}\tilde{x}_{1}^{2}e^{-\bm{\tilde{}}{x}_{1}\|\bm{u}\|_{2}-\frac{\tilde{\bm{x}}_{1}^{2}}{2}}d\tilde{\bm{x}}_{1}
122π4𝒖2x~12e𝒖22𝒙~2/2𝑑𝒙~1\displaystyle\geq\frac{1}{2\sqrt{2\pi}}\int_{\frac{4}{\|\bm{u}\|_{2}}}^{\infty}\tilde{x}_{1}^{2}e^{-\|\bm{u}\|_{2}^{2}\tilde{\bm{x}}^{2}/2}d\tilde{\bm{x}}_{1}
142π𝒖23(πerf(𝒙~1𝒖2)2𝒖2𝒙~1e𝒖22𝒙~12)4𝒖2\displaystyle\geq\frac{1}{4\sqrt{2\pi}\|\bm{u}\|_{2}^{3}}\left(\sqrt{\pi}\text{erf}(\tilde{\bm{x}}_{1}\|\bm{u}\|_{2})-2\|\bm{u}\|_{2}\tilde{\bm{x}}_{1}e^{-\|\bm{u}\|_{2}^{2}\tilde{\bm{x}}_{1}^{2}}\right)_{\frac{4}{\|\bm{u}\|_{2}}}^{\infty}
Ω(1𝒖23).\displaystyle\geq\Omega\left(\frac{1}{\|\bm{u}\|_{2}^{3}}\right).

Therefore, it holds that λ1=Θ(1𝒖23)\lambda_{1}=\Theta\left(\frac{1}{\|\bm{u}\|_{2}^{3}}\right).

Part b): Bounds for λ2\lambda_{2}.
For ~x1𝒩(0,1)\bm{\tilde{}}{x}_{1}\sim\mathcal{N}(0,1) the probability density function is bounded above by 11: p(~x1)1p(\bm{\tilde{}}{x}_{1})\leq 1. Then we can upper bound λ2\lambda_{2} as follows:

λ2\displaystyle\lambda_{2} =e~x1𝒖2(1+e~x1𝒖2)2p(𝒙~1)𝑑𝒙~1\displaystyle=\int_{-\infty}^{\infty}\frac{e^{\bm{\tilde{}}{x}_{1}\|\bm{u}\|_{2}}}{\left(1+e^{\bm{\tilde{}}{x}_{1}\|\bm{u}\|_{2}}\right)^{2}}p(\tilde{\bm{x}}_{1})d\tilde{\bm{x}}_{1}
e~x1𝒖2(1+e~x1𝒖2)2𝑑𝒙~1\displaystyle\leq\int_{-\infty}^{\infty}\frac{e^{\bm{\tilde{}}{x}_{1}\|\bm{u}\|_{2}}}{\left(1+e^{\bm{\tilde{}}{x}_{1}\|\bm{u}\|_{2}}\right)^{2}}d\tilde{\bm{x}}_{1}
=20e~x1𝒖2𝑑𝒙~1\displaystyle=2\int_{0}^{\infty}e^{-\bm{\tilde{}}{x}_{1}\|\bm{u}\|_{2}}d\tilde{\bm{x}}_{1}
=1𝒖2\displaystyle=\frac{1}{\|\bm{u}\|_{2}}
=𝒪(1𝒖2)\displaystyle=\mathcal{O}\left(\frac{1}{\|\bm{u}\|_{2}}\right)

For the lower bounds, we will use the fact that e(𝒖2𝒙~1+𝒙~12/2)e𝒖22𝒙~12/2e^{-(\|\bm{u}\|_{2}\tilde{\bm{x}}_{1}+\tilde{\bm{x}}_{1}^{2}/2)}\geq e^{-\|\bm{u}\|_{2}^{2}\tilde{\bm{x}}_{1}^{2}/2} for x[4𝒖2,]x\in\left[\frac{4}{\|\bm{u}\|_{2}},\infty\right] and 𝒖22\|\bm{u}\|_{2}\geq\sqrt{2}. Then, we can lower bound λ2\lambda_{2} as follows:

λ2\displaystyle\lambda_{2} =e~x1𝒖2(1+e~x1𝒖2)2p(𝒙~1)𝑑𝒙~1\displaystyle=\int_{-\infty}^{\infty}\frac{e^{\bm{\tilde{}}{x}_{1}\|\bm{u}\|_{2}}}{\left(1+e^{\bm{\tilde{}}{x}_{1}\|\bm{u}\|_{2}}\right)^{2}}p(\tilde{\bm{x}}_{1})d\tilde{\bm{x}}_{1}
20e~x1𝒖2(2e~x1𝒖2)2(e𝒙~1222π)𝑑𝒙~1\displaystyle\geq 2\int_{0}^{\infty}\frac{e^{\bm{\tilde{}}{x}_{1}\|\bm{u}\|_{2}}}{\left(2e^{\bm{\tilde{}}{x}_{1}\|\bm{u}\|_{2}}\right)^{2}}\left(\frac{e^{-\frac{\tilde{\bm{x}}_{1}^{2}}{2}}}{\sqrt{2\pi}}\right)d\tilde{\bm{x}}_{1}
=122π0e~x1𝒖2𝒙~122𝑑𝒙~1\displaystyle=\frac{1}{2\sqrt{2\pi}}\int_{0}^{\infty}e^{-\bm{\tilde{}}{x}_{1}\|\bm{u}\|_{2}-\frac{\tilde{\bm{x}}_{1}^{2}}{2}}d\tilde{\bm{x}}_{1}
122π4𝒖2e𝒖22𝒙~2/2𝑑𝒙~1\displaystyle\geq\frac{1}{2\sqrt{2\pi}}\int_{\frac{4}{\|\bm{u}\|_{2}}}^{\infty}e^{-\|\bm{u}\|_{2}^{2}\tilde{\bm{x}}^{2}/2}d\tilde{\bm{x}}_{1}
142π𝒖2(πerf(𝒙~1𝒖2))4𝒖2\displaystyle\geq\frac{1}{4\sqrt{2\pi}\|\bm{u}\|_{2}}\left(\sqrt{\pi}\text{erf}(\tilde{\bm{x}}_{1}\|\bm{u}\|_{2})\right)_{\frac{4}{\|\bm{u}\|_{2}}}^{\infty}
Ω(1𝒖2).\displaystyle\geq\Omega\left(\frac{1}{\|\bm{u}\|_{2}}\right).

Therefore, it holds that λ2=Θ(1𝒖2)\lambda_{2}=\Theta\left(\frac{1}{\|\bm{u}\|_{2}}\right). ∎

B.4 Existence of Locally Convex Region

In this section, we further discuss the consequences of Corollary 5.1, Theorem 5.2, and Theorem 5.3.

To begin, we will discuss the sufficient condition for Gradient Descent to converge linearly to the true parameters 𝜽\bm{\theta}^{*} and compare it to that of EM. For Gradient Descent, it is well understood that if 𝜽1\bm{\theta}^{1} is initialized in a convex set Θ\Theta that contains 𝜽\bm{\theta}^{*} and where (𝜽)\mathcal{L}(\bm{\theta}) is strongly convex, i.e.

2(𝜽)α𝑰2dfor all 𝜽Θ,\nabla^{2}\mathcal{L}(\bm{\theta})\succeq\alpha\bm{I}_{2d}\qquad\text{for all }\bm{\theta}\in\Theta, (48)

then the parameter iterates converge linearly to 𝜽\bm{\theta}^{*}. However, as we have shown for SMoLinE and SMoLogE, the sufficient condition for EM to converge linearly to 𝜽\bm{\theta}^{*} is slightly different. Instead, we require Θ\Theta to satisfy that (𝜽)\mathcal{L}(\bm{\theta}) is strongly convex relative to 𝑨(𝜽)\bm{A}(\bm{\theta}), i.e.,

2(𝜽)α2A(𝜽)for all 𝜽Θ.\nabla^{2}\mathcal{L}(\bm{\theta})\succeq\alpha\nabla^{2}A(\bm{\theta})\qquad\text{for all }\bm{\theta}\in\Theta. (49)

Interestingly, if it holds that A(𝜽)A(\bm{\theta}) is 11-smooth, we see that (49) is weaker than (48), i.e.,

2A(𝜽)I2d and (48) holds(49).\nabla^{2}A(\bm{\theta})\preceq I_{2d}\text{ and \eqref{eqn:Grad suff cond} holds}\implies\eqref{eqn:EM suff cond}. (50)

But more interestingly, as long as A(𝜽)A(\bm{\theta}) is μ\mu-smooth for some μ>0\mu>0, it will hold that any set Θ\Theta satisfying (48) for some α>0\alpha>0 will also satisfy (49) with α~=αμ>0\tilde{\alpha}=\frac{\alpha}{\mu}>0. We note that the converse holds for A(𝜽)A(\bm{\theta}) strongly convex. In summary, it then holds that, for SMoLinE and SMoLogE, EM’s sufficient conditions for a linear rate is strictly weaker than that of Gradient Descent when the mirror map A(𝜽)A(\bm{\theta}) is convex, but not strongly convex.

Next, we will further discuss the implications of Theorem 5.3. In the theorem, we obtain clear lower and upper bounds for the eigenvalues of 𝑰𝒙,y,z|𝜽\bm{I}_{\bm{x},y,z|\bm{\theta}}. However, it is not clear how to do the same for 𝑰z|𝒙,y,𝜽\bm{I}_{z|\bm{x},y,\bm{\theta}} as the trick to rotate the axis with an orthonormal matrix RR to simplify the expression will not work here because the distribution of the vector (𝒙,y𝒙)(\bm{x},y\bm{x})^{\top} is not invariant to rotation. Still, there are some things that we can say for special cases. For this discussion, we will constrain ourselves to SMoLogE. However, the same lines of reasoning also apply to SMoLinE. First we recall from Theorem 5.3 that

𝑰z|𝒙,y,𝜽=𝔼𝑿,Y[exp[𝒙y𝒙],𝜽[𝒙y𝒙][𝒙y𝒙](1+exp[𝒙y𝒙],𝜽)2].\bm{I}_{z|\bm{x},y,\bm{\theta}}=\mathbb{E}_{\bm{X},Y}\left[\frac{\exp{\left\langle\begin{bmatrix}\bm{x}\\ y\bm{x}\end{bmatrix},\bm{\theta}\right\rangle}\begin{bmatrix}\bm{x}\\ y\bm{x}\end{bmatrix}\begin{bmatrix}\bm{x}\\ y\bm{x}\end{bmatrix}^{\top}}{\left(1+\exp{\left\langle\begin{bmatrix}\bm{x}\\ y\bm{x}\end{bmatrix},\bm{\theta}\right\rangle}\right)^{2}}\right].

From here, one easy way to approach bounding the above is to 1) recall that any outer product of the form 𝒖𝒖\bm{u}\bm{u}^{\top} has a single eigenvalue given as 𝒖22\|\bm{u}\|_{2}^{2}, and 2) the inner product between two vectors is equal to the product of their norms and the cosine of the angle between them, i.e. 𝒔𝒖=𝒔2𝒖2cos(ϕ𝒔,𝒖)\bm{s}^{\top}\bm{u}=\|\bm{s}\|_{2}\|\bm{u}\|_{2}\cos(\phi_{\bm{s},\bm{u}}). Thus, denoting ϕ\phi to be the angle between 𝜽\bm{\theta} and the vector (𝒙,y𝒙)(\bm{x},y\bm{x})^{\top}, we obtain

𝑰z|𝒙,y,𝜽\displaystyle\bm{I}_{z|\bm{x},y,\bm{\theta}} 𝔼𝑿,Y,Φ[e(1+y2)𝒙22𝜽2cos(ϕ)(1+y2)𝒙22(1+e(1+y2)𝒙22𝜽2cos(ϕ))2]𝑰2d\displaystyle\preceq\mathbb{E}_{\bm{X},Y,\Phi}\left[\frac{e^{\sqrt{(1+y^{2})\|\bm{x}\|_{2}^{2}}\|\bm{\theta}\|_{2}\cos(\phi)}(1+y^{2})\|\bm{x}\|_{2}^{2}}{\left(1+e^{\sqrt{(1+y^{2})\|\bm{x}\|_{2}^{2}}\|\bm{\theta}\|_{2}\cos(\phi)}\right)^{2}}\right]\bm{I}_{2d}
𝔼𝑿,Y,Φ[e|(1+y2)𝒙22𝜽2cos(ϕ)|(1+y2)𝒙22]𝑰2d.\displaystyle\preceq\mathbb{E}_{\bm{X},Y,\Phi}\left[e^{-\left|\sqrt{(1+y^{2})\|\bm{x}\|_{2}^{2}}\|\bm{\theta}\|_{2}\cos(\phi)\right|}(1+y^{2})\|\bm{x}\|_{2}^{2}\right]\bm{I}_{2d}.

Denoting s:=|(1+y2)𝒙22|s:=\left|\sqrt{(1+y^{2})\|\bm{x}\|_{2}^{2}}\right|, we can write the above expectation as

80π/20s2es𝜽2cos(ϕ)p(s,ϕ)𝑑s𝑑ϕ.\displaystyle 8\int_{0}^{\pi/2}\int_{0}^{\infty}s^{2}e^{-s\|\bm{\theta}\|_{2}\cos(\phi)}p(s,\phi)dsd\phi.

Subsequently, the idea is bound p(s,ϕ)=p(s)p(ϕ|s)p(s,\phi)=p(s)p(\phi|s) in a way that makes integration easy.

The case where x,w,βx,w,\beta\in\mathbb{R} is fairly easy. Under this scenario, 𝒙𝒩(0,1)\bm{x}\sim\mathcal{N}(0,1) and 𝑰𝒙,y,z|𝜽\bm{I}_{\bm{x},y,z|\bm{\theta}} can be upper-bounded as

𝑰z|𝒙,y,𝜽\displaystyle\bm{I}_{z|\bm{x},y,\bm{\theta}} =𝔼X,Y[(x2yx2yx2y2x2)ex(w+yβ)(1+ex(w+yβ))2]\displaystyle=\mathbb{E}_{X,Y}\left[\begin{pmatrix}x^{2}&yx^{2}\\ yx^{2}&y^{2}x^{2}\end{pmatrix}\frac{e^{x(w+y\beta)}}{\left(1+e^{x(w+y\beta)}\right)^{2}}\right]
𝔼X,Y[x2(1+y2)e|x(w+yβ)|]𝑰2.\displaystyle\leq\mathbb{E}_{X,Y}\left[x^{2}(1+y^{2})e^{-\left|x(w+y\beta)\right|}\right]\bm{I}_{2}.

Now, recall that y{1,1}y\in\{-1,1\}, P(y|x)1P(y|x)\leq 1, p(x)=ex2/22π2e|x|2πp(x)=\frac{e^{-x^{2}/2}}{\sqrt{2\pi}}\leq\frac{2e^{-|x|}}{\sqrt{2\pi}}, and see that

𝑰z|𝒙,y,𝜽\displaystyle\bm{I}_{z|\bm{x},y,\bm{\theta}} =𝔼X[2x2(e|x(wβ)|P(y=1|x)+e|x(w+β)|P(y=1|x))]\displaystyle=\mathbb{E}_{X}\left[2x^{2}\left(e^{-\left|x(w-\beta)\right|}P(y=1|x)+e^{-\left|x(w+\beta)\right|}P(y=-1|x)\right)\right]
𝔼X[2x2(e|x(wβ)|+e|x(w+β)|)]\displaystyle\leq\mathbb{E}_{X}\left[2x^{2}\left(e^{-\left|x(w-\beta)\right|}+e^{-\left|x(w+\beta)\right|}\right)\right]
=40x2(e|x(wβ)|+e|x(w+β)|)p(x)𝑑x\displaystyle=4\int_{0}^{\infty}x^{2}\left(e^{-\left|x(w-\beta)\right|}+e^{-\left|x(w+\beta)\right|}\right)p(x)dx
40x2(ex(|wβ|+1)+ex(|w+β|+1))𝑑x\displaystyle\leq 4\int_{0}^{\infty}x^{2}\left(e^{-x(|w-\beta|+1)}+e^{-x(|w+\beta|+1)}\right)dx
8(1(1+|wβ|)3+1(1+|w+β|)3)\displaystyle\leq 8\left(\frac{1}{(1+|w-\beta|)^{3}}+\frac{1}{(1+|w+\beta|)^{3}}\right)
𝒪(1(1+|wβ|)3+1(1+|w+β|)3).\displaystyle\leq\mathcal{O}\left(\frac{1}{(1+|w-\beta|)^{3}}+\frac{1}{(1+|w+\beta|)^{3}}\right).

Subsequently, together with the fact that 𝑰𝒙,y,z,𝜽1max{𝒪(𝒘23),(𝜷23)}\bm{I}_{\bm{x},y,z,\bm{\theta}}^{-1}\leq\max\left\{\mathcal{O}\left(\|\bm{w}\|^{3}_{2}\right),\left(\|\bm{\beta}\|^{3}_{2}\right)\right\}, it holds that the eigenvalues of the MIM are upper-bounded by

max{𝒪((w1+|wβ|)3+(w1+|w+β|)3),𝒪((β1+|wβ|)3+(β1+|w+β|)3)}.\displaystyle\max\left\{\mathcal{O}\left(\left(\frac{w}{1+|w-\beta|}\right)^{3}+\left(\frac{w}{1+|w+\beta|}\right)^{3}\right),\mathcal{O}\left(\left(\frac{\beta}{1+|w-\beta|}\right)^{3}+\left(\frac{\beta}{1+|w+\beta|}\right)^{3}\right)\right\}.

This special case is closely related to the case where 𝜷\bm{\beta} is parallel to 𝒘\bm{w}; a similar approach will work.

Appendix C Experiments

In this section, we provide explicit formulations of Gradient EM and Gradient Descent, as they relate to the experiments we perform in Section 6.

Gradient EM. Whereas EM performs the global minimization of the EM objective given in (7), Gradient EM (see Algorithm 1) obtains its next parameter iterate as the concatenation of two gradient updates on the sub objectives,

𝔼𝑿,Y[𝔼Z|𝒙,y;𝜽t[logP(z|𝒙;𝒘)]]\displaystyle-\mathbb{E}_{\bm{X},Y}\left[\mathbb{E}_{Z|\bm{x},y;\bm{\theta}^{t}}\left[\log P(z|\bm{x};\bm{w})\right]\right] (51)
𝔼𝑿,Y[𝔼Z|𝒙,y;𝜽t[logp(y|z,𝒙;𝜷)]]\displaystyle-\mathbb{E}_{\bm{X},Y}\left[\mathbb{E}_{Z|\bm{x},y;\bm{\theta}^{t}}\left[\log p(y|z,\bm{x};\bm{\beta})\right]\right] (52)

where the EM objective is given as the summation of (51) and (52).

Algorithm 1 Gradient EM for MOE
Input: Initial 𝜽1Ω\bm{\theta}^{1}\in\Omega, data: (𝑿,Y)p(𝒙,y;𝜽)(\bm{X},Y)\sim p(\bm{x},y;\bm{\theta}^{*})), step-size: γ1,γ2(0,)\gamma_{1},\gamma_{2}\in(0,\infty).
for t=1,,Tt=1,\dots,Tdo
     𝜷\bm{\beta}-Update: Obtain 𝜷t+1\bm{\beta}^{t+1} as
𝜷t+1=𝜷t+γ1𝔼𝑿,Y𝔼Z|𝒙,y;𝜽t[𝜷logp(y|z,𝒙;𝜷)].\bm{\beta}^{t+1}=\bm{\beta}^{t}+\gamma_{1}\mathbb{E}_{\bm{X},Y}\mathbb{E}_{Z|\bm{x},y;\bm{\theta}^{t}}\left[\frac{\partial}{\partial\bm{\beta}}\log p(y|z,\bm{x};\bm{\beta})\right].
     𝒘\bm{w}-Update: Obtain 𝒘t+1\bm{w}^{t+1} as
𝒘t+1=𝒘t+γ2𝔼𝑿,Y𝔼Z|𝒙,y;𝜽t[𝒘logP(z|𝒙;𝒘)].\bm{w}^{t+1}=\bm{w}^{t}+\gamma_{2}\mathbb{E}_{\bm{X},Y}\mathbb{E}_{Z|\bm{x},y;\bm{\theta}^{t}}\left[\frac{\partial}{\partial\bm{w}}\log P(z|\bm{x};\bm{w})\right].
end for
Output: 𝜽T=(𝒘T,𝜷T)\bm{\theta}^{T}=(\bm{w}^{T},\bm{\beta}^{T})

Gradient Descent. Gradient descent is given as the global minimizer of the first order approximation of (𝜽)\mathcal{L}(\bm{\theta}) at 𝜽\bm{\theta} plus a quadratic regularizer, i.e.,

(𝜽t)+(𝜽t),𝜽𝜽t+12η𝜽𝜽t22.\mathcal{L}(\bm{\theta}^{t})+\langle\nabla\mathcal{L}(\bm{\theta}^{t}),\bm{\theta}-\bm{\theta}^{t}\rangle+\frac{1}{2\eta}\|\bm{\theta}-\bm{\theta}^{t}\|_{2}^{2}.

Differentiating, and solving for equality at 0 yields the well known gradient update (53).

Algorithm 2 Gradient Descent for MOE
Input: Initial 𝜽1Ω\bm{\theta}^{1}\in\Omega, data: (𝑿,Y)p(𝒙,y;𝜽)(\bm{X},Y)\sim p(\bm{x},y;\bm{\theta}^{*})), step-size: γ+\gamma\in\mathbb{R^{+}}.
for t=1,,Tt=1,\dots,Tdo
     𝜽\bm{\theta}-Update: Obtain 𝜽t+1\bm{\theta}^{t+1} as
𝜽t+1=𝜽tγ(𝜽t).\bm{\theta}^{t+1}=\bm{\theta}^{t}-\gamma\nabla\mathcal{L}(\bm{\theta}^{t}). (53)
end for
Output: 𝜽T=(𝒘T,𝜷T)\bm{\theta}^{T}=(\bm{w}^{T},\bm{\beta}^{T})