This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\optauthor\Name

Jing Wang \Email[email protected]
\NameAnna Choromanska \Email[email protected]
\addrDepartment of Electrical and Computer Engineering, New York University

SGB: Stochastic Gradient Bound Method for Optimizing
Partition Functions

Abstract

This paper addresses the problem of optimizing partition functions in a stochastic learning setting. We propose a stochastic variant of the bound majorization algorithm from [Jebara and Choromanska(2012)] that relies on upper-bounding the partition function with a quadratic surrogate. The update of the proposed method, that we refer to as Stochastic Partition Function Bound (SPFB), resembles scaled stochastic gradient descent where the scaling factor relies on a second order term that is however different from the Hessian. Similarly to quasi-Newton schemes, this term is constructed using the stochastic approximation of the value of the function and its gradient. We prove sub-linear convergence rate of the proposed method and show the construction of its low-rank variant (LSPFB). Experiments on logistic regression demonstrate that the proposed schemes significantly outperform SGD. We also discuss how to use quadratic partition function bound for efficient training of deep learning models and in non-convex optimization.

1 Introduction

The problem of estimating the probability density function over a set of random variables underlies majority of learning frameworks and heavily depends on the partition function. Partition function is a normalizer of a density function and ensures that it integrates to 11. This function needs to be minimized when learning proper data distribution. Optimizing the partition function however is a hard and often intractable problem [Goodfellow et al.(2016)Goodfellow, Bengio, and Courville]. It has been addressed in a number of ways in the literature. Below we review strategies that directly confront the partition function (we skip pseudo-likelihood strategies [Huang and Ogata(2002)] and score matching [Hyvärinen(2005)] and ratio matching [Hyvarinen(2007)] techniques, which avoid direct partition function computations).

There exists a variety of Markov chain Monte Carlo methods for approximately maximizing the likelihood of models with partition functions such as i) contrastive divergence [Hinton(2002), Carreira-Perpiñán and Hinton(2005)] and persistent contrastive divergence [Tieleman(2008)], which perform Gibbs sampling and are used inside a gradient descent procedure to compute model parameter update, and ii) fast persistent contrastive divergence [Tieleman and Hinton(2009)], which relies on re-parametrizing the model and introducing the parameters that are trained with much larger learning rate such that the Markov chain is forced to mix rapidly. The above mentioned techniques are Gibbs sampling strategies that focus on estimating the gradient of the log-partition function. Another technique called noise-contrastive estimation [Gutmann and Hyvärinen(2012)] treats partition function like an additional model parameter whose estimate can be learned via nonlinear logistic regression discriminating between the observed data and some artificially generated noise. Methods that directly estimate the partiton function rely on importance sampling. More conretely they estimate the ratio of the partition functions of two models, where one of the partition function is known. The extensions of this technique, annealed importance sampling [Neal(2001), Jarzynski(1997)] and bridge sampling [Bennett(1976)], cope with the setting where two considered distributions are far from each other.

Finally, bound majorization constitutes yet another strategy for performing density estimation. Bound majorization methods iteratively construct and optimize variational bound on the original optimization problem. Among these techniques we have iterative scaling schemes [Darroch and Ratcliff(1972), Berger(1997)], EM algorithm [Balakrishnan et al.(2017)Balakrishnan, Wainwright, and Yu, Dempster et al.(1977)Dempster, Laird, and Rubin], non-negative matrix factorization method [Lee and Seung()], convex-concave procedure [Yuille and Rangarajan(2002)], minimization by incremental surrogate optimization [Mairal(2015)], and technique based on constructing quadratic partition function bound [Jebara and Choromanska(2012)] (early predecessors of these techniques include [Böhning and Lindsay(1988), Lafferty et al.(2001)Lafferty, McCallum, and Pereira]). The latter technique uses tighter bound compared to the aforementioned methods, and exhibits faster convergence compared to generic first- [Malouf(2002), Wallach(2002), Sha and Pereira(2003)] and second-order [Zhu et al.(1997)Zhu, Byrd, Lu, and Nocedal, Benson and More(2001), Andrew and Gao(2007)] techniques in the batch optimization setting for both convex and non-convex learning problems. In this paper we revisit the quadratic bound majorization technique and propose its stochastic variant that we analyze both theoretically and empirically. We prove its convergence rate and show that it is performing favorably compared to SGD [Robbins and Monro(1951), Bottou(2010)]. Finally, we propose future research directions that can utilize quadratic partition function bound in non-convex optimization, including deep learning setting. With a pressing need to develop landscape-driven deep learning optimization strategies [Feng and Tu(2020), Chaudhari et al.(2017)Chaudhari, A. Choromanska, Soatto, LeCun, Baldassi, Borgs, Chayes, Sagun, and Zecchina, Chaudhari et al.(2019)Chaudhari, A. Choromanska, Soatto, LeCun, Baldassi, Borgs, Chayes, Sagun, and Zecchina, Hochreiter and Schmidhuber(1997), Jastrzebski et al.(2018)Jastrzebski, Kenton, Arpit, Ballas, Fischer, Bengio, and Storkey, Wen et al.(2018)Wen, Wang, Yan, Xu, Chen, and Li, Keskar et al.(2017)Keskar, Mudigere, Nocedal, Smelyanskiy, and Tang, Sagun et al.(2018)Sagun, Evci, Güney, Dauphin, and Bottou, Simsekli et al.(2019)Simsekli, Sagun, and Gürbüzbalaban, Jiang et al.(2019)Jiang, Neyshabur, Mobahi, Krishnan, and Bengio, Chaudhari et al.(2017)Chaudhari, A. Choromanska, Soatto, LeCun, Baldassi, Borgs, Chayes, Sagun, and Zecchina, Baldassi et al.(2015)Baldassi, Ingrosso, Lucibello, Saglietti, and Zecchina, Baldassi et al.(2016a)Baldassi, Borgs, Chayes, Ingrosso, Lucibello, Saglietti, and Zecchina, Baldassi et al.(2016b)Baldassi, Gerace, Lucibello, Saglietti, and Zecchina], we foresee the resurgence of interest in bound majorization techniques and its applicability to non-convex learning problems.

2 Method

\RestyleAlgo

boxruled \IncMargin1em {algorithm2e}[H] \LinesNumbered

Figure 1: Partition function bound
\KwIn

𝜽~d\widetilde{\bm{\mathbf{\theta}}}\in\mathbb{R}^{d}, observation 𝐱\bm{\mathbf{x}}, 𝐟𝐱(y)y{1,,n}\bm{\mathbf{f}}_{\bm{\mathbf{x}}}(y)\forall y\in\{1,...,n\}. \KwOutBound parameters: 𝚺\bm{\mathbf{\Sigma}}, 𝝁\bm{\mathbf{\mu}}, zz

Init z0+z\to 0^{+}, 𝝁=𝟎\bm{\mathbf{\mu}}=\bm{\mathbf{0}}, 𝚺=z𝐈\bm{\mathbf{\Sigma}}=z\bm{\mathbf{I}}

\For

y=1,,ny=1,...,n αj=exp(𝜽~T𝐟𝐱(y));𝐥=𝐟𝐱(y)𝝁\alpha_{j}=\exp(\widetilde{{\bm{\mathbf{\theta}}}}^{T}\bm{\mathbf{f}}_{\bm{\mathbf{x}}}(y));\bm{\mathbf{l}}=\bm{\mathbf{f}}_{\bm{\mathbf{x}}}(y)-\bm{\mathbf{\mu}}

β=tanh(12log(α/z))2log(α/z);κ=αz+α\beta=\frac{\tanh(\frac{1}{2}\log(\alpha/z))}{2\log(\alpha/z)};\kappa=\frac{\alpha}{z+\alpha}

𝚺+=β𝐥𝐥T\bm{\mathbf{\Sigma}}+=\beta\bm{\mathbf{l}}\bm{\mathbf{l}}^{T}

𝝁+=κ𝐥\bm{\mathbf{\mu}}+=\kappa\bm{\mathbf{l}}

z+=αz+=\alpha

\RestyleAlgo

boxruled \IncMargin1em {algorithm2e}[H] \LinesNumbered\KwIninitial parameters 𝜽0\bm{\mathbf{\theta}}_{0}, training data set {(𝐱j,yj)}j=1T\{(\bm{\mathbf{x}}_{j},y_{j})\}_{j=1}^{T}, features 𝐟𝐱j\bm{\mathbf{f}}_{\bm{\mathbf{x}}_{j}}, learning rates ηt\eta_{t}, regularization coefficient λ\lambda \KwOutModel parameters 𝜽\bm{\mathbf{\theta}}

Figure 2: Maximum Likelihood via Stochastic Partition Function Bound (SPFB)

Set 𝜽=𝜽0\bm{\mathbf{\theta}}=\bm{\mathbf{\theta}}_{0}

\While

not converged randomly select a training point (𝐱t,yt)(\bm{\mathbf{x}}_{t},y_{t})

Get 𝚺t\bm{\mathbf{\Sigma}}_{t}, 𝝁t\bm{\mathbf{\mu}}_{t} from 𝐟𝐱t\bm{\mathbf{f}}_{\bm{\mathbf{x}}_{t}}, 𝜽\bm{\mathbf{\theta}} via Algorithm 2

𝜽𝜽ηt(𝚺t+λ𝐈)1(𝝁t𝐟𝐱t(yt)+λ𝜽)\bm{\mathbf{\theta}}\!\leftarrow\!\bm{\mathbf{\theta}}\!-\!\eta_{t}\left(\bm{\mathbf{\Sigma}}_{t}\!+\!\lambda\bm{\mathbf{I}}\right)^{-1}\left(\bm{\mathbf{\mu}}_{t}\!-\!\bm{\mathbf{f}}_{\bm{\mathbf{x}}_{t}}(y_{t})\!+\!\lambda\bm{\mathbf{\theta}}\right)

Consider the log-linear model given by a density function of the form

p(y|𝐱,𝜽)=exp(𝜽T𝐟𝐱(y))/Z𝐱(𝜽),\operatorname{p}(y|\bm{\mathbf{x}},\bm{\mathbf{\theta}})=\exp(\bm{\mathbf{\theta}}^{T}\bm{\mathbf{f_{x}}}(y))/Z_{\bm{\mathbf{x}}}(\bm{\mathbf{\theta}}), (2.1)

where (𝐱,y)(\bm{\mathbf{x}},y) is the observation-label pair (y{1,2,,n}y\in\{1,2,...,n\}), 𝐟𝐱:{1,2,,n}d\bm{\mathbf{f_{x}}}:\{1,2,...,n\}\rightarrow\mathbb{R}^{d} represents a feature map, 𝜽d\bm{\mathbf{\theta}}\in\mathbb{R}^{d} is a model parameter vector, and Z𝐱(𝜽)=y=1nexp(𝜽T𝐟𝐱(y))Z_{\bm{\mathbf{x}}}(\bm{\mathbf{\theta}})=\sum_{y=1}^{n}{\exp(\bm{\mathbf{\theta}}^{T}\bm{\mathbf{f_{x}}}(y))} is the partition function. Maximum likelihood framework estimates 𝜽\bm{\mathbf{\theta}} from a training data set {(𝐱i,yi)}i=1T\{(\bm{\mathbf{x}}_{i},y_{i})\}_{i=1}^{T} by maximizing the objective function of the form

J(𝜽)=i=1Tlogp(yi|𝐱i,𝜽)λ2𝜽22=i=1T[𝜽T𝐟𝐱i(y)logZ𝐱i(𝜽)]λ2𝜽22,J(\bm{\mathbf{\theta}})=\sum_{i=1}^{T}\log\operatorname{p}(y_{i}|\bm{\mathbf{x}}_{i},\bm{\mathbf{\theta}})-\frac{\lambda}{2}\left\lVert\bm{\mathbf{\theta}}\right\rVert_{2}^{2}=\sum_{i=1}^{T}\left[\bm{\mathbf{\theta}}^{T}\bm{\mathbf{f}}_{\bm{\mathbf{x}}_{i}}(y)-\log Z_{\bm{\mathbf{x}}_{i}}(\bm{\mathbf{\theta}})\right]-\frac{\lambda}{2}\left\lVert\bm{\mathbf{\theta}}\right\rVert_{2}^{2}, (2.2)

where the second term is a regularization (λ\lambda is a regularization coefficient). This framework and its various extensions underlie logistic regression, conditional random fields, maximum entropy estimation, latent likelihood, deep belief networks, and other density estimation approaches. Equation 2.2 requires minimizing the partition function Z𝐱(𝜽)Z_{\bm{\mathbf{x}}}(\bm{\mathbf{\theta}}). This can be done by optimizing the variational quadratic bound on the partition function instead. The bound is shown in Theorem 1.

Theorem 1

[Jebara and Choromanska(2012)] Let Z𝐱(𝛉)=y=1nexp(𝛉T𝐟𝐱(y))Z_{\bm{\mathbf{x}}}(\bm{\mathbf{\theta}})=\sum_{y=1}^{n}\exp\left(\bm{\mathbf{\theta}}^{T}\bm{\mathbf{f}}_{\bm{\mathbf{x}}}(y)\right). Algorithm 2 finds z,𝛍,𝚺z,\bm{\mathbf{\mu}},\bm{\mathbf{\Sigma}} such that

Z𝐱(𝜽)zexp(12(𝜽𝜽~)T𝚺(𝜽𝜽~)+(𝜽𝜽~)T𝝁)Z_{\bm{\mathbf{x}}}(\bm{\mathbf{\theta}})\leq z\exp\left(\frac{1}{2}(\bm{\mathbf{\theta}}-\bm{\mathbf{\widetilde{\theta}}})^{T}\bm{\mathbf{\Sigma}}(\bm{\mathbf{\theta}}-\bm{\mathbf{\widetilde{\theta}}})+(\bm{\mathbf{\theta}}-\bm{\mathbf{\widetilde{\theta}}})^{T}\bm{\mathbf{\mu}}\right) (2.3)

for any 𝜽,𝜽~,𝐟𝐱(y)d\bm{\mathbf{\theta}},\bm{\mathbf{\widetilde{\theta}}},\bm{\mathbf{f}}_{\bm{\mathbf{x}}}(y)\in\mathbb{R}^{d} for any y{1,,n}y\in\{1,...,n\}.

2.1 Stochastic Partition Function Bound (SPFB)

The partition function bound of Algorithm 2 can be used to optimize the objective in Equation 2.2. The maximum likelihood parameter update given by the bound takes the form:

𝜽t+1=𝜽tη(j=1T𝚺j+λ𝐈)1(j=1T[𝝁j𝐟𝐱j(yj)]+λ𝜽t),\bm{\mathbf{\theta}}^{t+1}=\bm{\mathbf{\theta}}^{t}-\eta\left(\sum_{j=1}^{T}\bm{\mathbf{\Sigma}}_{j}+\lambda\bm{\mathbf{I}}\right)^{-1}\left(\sum_{j=1}^{T}\left[\bm{\mathbf{\mu}}_{j}-\bm{\mathbf{f}}_{\bm{\mathbf{x}}_{j}}(y_{j})\right]+\lambda\bm{\mathbf{\theta}}^{t}\right), (2.4)

where 𝚺j\bm{\mathbf{\Sigma}}_{j}s and 𝝁j\bm{\mathbf{\mu}}_{j}s are computed from Algorithm 2. In contrast to the above full-batch update, Stochastic Partition Function Bound (SPFB) method that we propose in Algorithm 2 updates parameters after seeing each training data point, rather than the entire data set, according to the formula:

𝜽t+1=𝜽tηt(𝚺t+λ𝐈)1(𝝁t𝐟𝐱t(yt)+λ𝜽t),\bm{\mathbf{\theta}}^{t+1}=\bm{\mathbf{\theta}}^{t}-\eta_{t}\left(\bm{\mathbf{\Sigma}}_{t}+\lambda\bm{\mathbf{I}}\right)^{-1}\left(\bm{\mathbf{\mu}}_{t}-\bm{\mathbf{f}}_{\bm{\mathbf{x}}_{t}}(y_{t})+\lambda\bm{\mathbf{\theta}}^{t}\right), (2.5)

where ηt=η0/t\eta_{t}=\eta_{0}/t is the learning rate. Denote f(𝜽;𝐱t)=log(Zt(𝜽))𝜽T𝐟𝐱t(𝐲t)+λ2𝜽2f(\bm{\mathbf{\theta}};\bm{\mathbf{x}}_{t})=\log(Z_{t}(\bm{\mathbf{\theta}}))-\bm{\mathbf{\theta}}^{T}\bm{\mathbf{f}}_{\bm{\mathbf{x}}_{t}}(\bm{\mathbf{y}}_{t})+\frac{\lambda}{2}\|\bm{\mathbf{\theta}}\|^{2} to be an unbiased estimation of objective function L(𝜽)L(\bm{\mathbf{\theta}}), where L(𝜽)=J(𝜽)L(\bm{\mathbf{\theta}})=-J(\bm{\mathbf{\theta}}). The above formula (2.5) can be rewritten as

𝜽t+1=𝜽tηt(𝚺t+λ𝐈)1f(𝜽t;𝐱t).\bm{\mathbf{\theta}}^{t+1}=\bm{\mathbf{\theta}}^{t}-\eta_{t}\left(\bm{\mathbf{\Sigma}}_{t}+\lambda\bm{\mathbf{I}}\right)^{-1}\nabla f(\bm{\mathbf{\theta}}^{t};\bm{\mathbf{x}}_{t}). (2.6)

The next theorem shows the convergence rate of SPFB.

Theorem 2

{𝜽t}\{\bm{\mathbf{\theta}}^{t}\} is the sequence of parameters generated by Algorithm 2. There exists 0<μ1<μ20<\mu_{1}<\mu_{2}, 0<λ1<λ20<\lambda_{1}<\lambda_{2} such that for all iterations tt,

μ1𝐈(𝚺t+λ𝐈)1μ2𝐈 and λ1𝐈2L(𝜽)λ2𝐈,\mu_{1}\bm{\mathbf{I}}\prec(\bm{\mathbf{\Sigma}}_{t}+\lambda\bm{\mathbf{I}})^{-1}\prec\mu_{2}\bm{\mathbf{I}}\quad\text{ and }\quad\lambda_{1}\bm{\mathbf{I}}\prec\nabla^{2}L(\bm{\mathbf{\theta}})\prec\lambda_{2}\bm{\mathbf{I}}, (2.7)

and there exists a constant σ\sigma, such that for all 𝜽d\bm{\mathbf{\theta}}\in\mathbb{R}^{d}, 𝐄𝐱t[f(𝜽;𝐱t)]2σ2\bm{\mathbf{E}}_{\bm{\mathbf{x}}_{t}}[\left\lVert f(\bm{\mathbf{\theta}};\bm{\mathbf{x}}_{t})\right\rVert]^{2}\leq\sigma^{2}. Define the learning rate in iteration tt as ηt=η0/t\eta_{t}=\eta_{0}/t, where η0>1/(2μ1λ1)\eta_{0}>1/(2\mu_{1}\lambda_{1}). Then for all t>1t>1,

𝐄[L(𝜽t)L(𝜽)]Q(η0)/t,\bm{\mathbf{E}}[L(\bm{\mathbf{\theta}}^{t})-L(\bm{\mathbf{\theta}}^{*})]\leq Q(\eta_{0})/t, (2.8)

where Q(η0)=max{λ2μ22η02σ22(2μ1λ1η01),L(𝜽1)L(𝜽)}Q(\eta_{0})=\max\left\{\frac{\lambda_{2}\mu_{2}^{2}\eta_{0}^{2}\sigma^{2}}{2(2\mu_{1}\lambda_{1}\eta_{0}-1)},L(\bm{\mathbf{\theta}}^{1})-L(\bm{\mathbf{\theta^{*}}})\right\}.

Theorem 2 guarantees sub-linear convergence rate for SPFB when the step size is diminishing. However, the time complexity of SPFB is O(nd2+d3)=O~(d3)O(nd^{2}\!+\!d^{3})\!=\!\widetilde{O}(d^{3}), due to the computation and inversion of matrix 𝚺t\bm{\mathbf{\Sigma}}_{t}, which is less appealing than the O(nd)O(nd) complexity of SGD. This is next addressed.

2.2 Low-rank bound

In this section, we provide a low-rank construction of the bound that applies to both batch and stochastic setting. We decompose matrix 𝚺\bm{\mathbf{\Sigma}} into 𝚺=𝐕T𝐒𝐕+𝐃\bm{\mathbf{\Sigma}}\!=\!\bm{\mathbf{V}}^{T}\bm{\mathbf{S}}\bm{\mathbf{V}}\!+\!\bm{\mathbf{D}}, where 𝐕k×d\bm{\mathbf{V}}\!\in\!\mathbb{R}^{k\times d} (orthnormal matrix), 𝐒k×k\bm{\mathbf{S}}\in\mathbb{R}^{k\times k}, and 𝐃d×d\bm{\mathbf{D}}\in\mathbb{R}^{d\times d} (diagonal matrix) and apply Woodbury formula to compute the inverse: 𝚺1=𝐃1𝐃1𝐕T(𝐒1+𝐕𝐃1𝐕T)1𝐕𝐃1\bm{\mathbf{\Sigma}}^{-1}=\bm{\mathbf{D}}^{-1}-\bm{\mathbf{D}}^{-1}\bm{\mathbf{V}}^{T}(\bm{\mathbf{S}}^{-1}+\bm{\mathbf{V}}\bm{\mathbf{D}}^{-1}\bm{\mathbf{V}}^{T})^{-1}\bm{\mathbf{V}}\bm{\mathbf{D}}^{-1} (clearly, the inverse only requires O(k3)O(k^{3}) time and does not affect the total time complexity when rank kdk\!\ll\!d). Note that Algorithm 2 performs rank-one update to matrix 𝚺\bm{\mathbf{\Sigma}} of the form: 𝚺=𝚺+𝐫𝐫T\bm{\mathbf{\Sigma}}=\bm{\mathbf{\Sigma}}+\bm{\mathbf{r}}\bm{\mathbf{r}}^{T}, where 𝐫=β𝐥\bm{\mathbf{r}}\!=\!\sqrt{\beta}\bm{\mathbf{l}}). This update can be “projected” onto matrices 𝐕,𝐒\bm{\mathbf{V}},\bm{\mathbf{S}}, and 𝐃\bm{\mathbf{D}}. The concrete updates of matrices 𝐕,𝐒\bm{\mathbf{V}},\bm{\mathbf{S}}, and 𝐃\bm{\mathbf{D}} are shown in Algorithm 2.2. The next theorem, Theorem 3, guarantees that the low-rank bound is indeed an upper-bound on the partition function 111We simultaneously repair the low-rank bound construction of [Jebara and Choromanska(2012)], which breaks this property..

Theorem 3

Let Z𝐱(𝛉)=y=1nexp(𝛉T𝐟𝐱(y))Z_{\bm{\mathbf{x}}}(\bm{\mathbf{\theta}})=\sum_{y=1}^{n}\exp\left(\bm{\mathbf{\theta}}^{T}\bm{\mathbf{f}}_{\bm{\mathbf{x}}}(y)\right). In each iteration of the xx-loop in Algorithm 2.2 finds z,𝛍,𝐕,𝐒,𝐃z,\bm{\mathbf{\mu}},\bm{\mathbf{V}},\bm{\mathbf{S}},\bm{\mathbf{D}} such that

Z𝐱(𝜽)zexp(12(𝜽𝜽~)T(𝐕T𝐒𝐕+𝐃)(𝜽𝜽~)+(𝜽𝜽~)T𝝁)Z_{\bm{\mathbf{x}}}(\bm{\mathbf{\theta}})\leq z\exp\left(\frac{1}{2}(\bm{\mathbf{\theta}}-\bm{\mathbf{\widetilde{\theta}}})^{T}(\bm{\mathbf{V}}^{T}\bm{\mathbf{S}}\bm{\mathbf{V}}+\bm{\mathbf{D}})(\bm{\mathbf{\theta}}-\bm{\mathbf{\widetilde{\theta}}})+(\bm{\mathbf{\theta}}-\bm{\mathbf{\widetilde{\theta}}})^{T}\bm{\mathbf{\mu}}\right) (2.9)

for any 𝜽,𝜽~,𝐟𝐱(y)d\bm{\mathbf{\theta}},\bm{\mathbf{\widetilde{\theta}}},\bm{\mathbf{f}}_{\bm{\mathbf{x}}}(y)\in\mathbb{R}^{d} for any y{1,,n}y\in\{1,...,n\}.

Low-rank variant of Algorithm 2 is presented in Algorithm 2.2. Note that all proofs supporting this section are deferred to the Supplement.

\RestyleAlgo

boxruled \IncMargin2em {algorithm2e}[H] Low-rank Partition Function Bound \LinesNumbered\KwIn𝜽~d\widetilde{\bm{\mathbf{\theta}}}\in\mathbb{R}^{d}, observation 𝐱\bm{\mathbf{x}}, 𝐟𝐱(y)y{1,,n}\bm{\mathbf{f}}_{\bm{\mathbf{x}}}(y)\forall y\in\{1,...,n\}, rank kk\in\mathbb{N} \KwOutLow-rank bound parameters: 𝐕\bm{\mathbf{V}}, 𝐒\bm{\mathbf{S}}, 𝐃\bm{\mathbf{D}}, 𝝁\bm{\mathbf{\mu}}, zz z0+z\to 0^{+}, 𝐒=0\bm{\mathbf{S}}=0, 𝐕=orthonormalk×d\bm{\mathbf{V}}=orthonormal\in\mathbb{R}^{k\times d}, 𝐃=z𝐈\bm{\mathbf{D}}=z\bm{\mathbf{I}}, 𝝁=𝟎\bm{\mathbf{\mu}}=\bm{\mathbf{0}}

\For

(\tcp*[h]xx-loop)each sample 𝐱j\bm{\mathbf{x}}_{j} in batch Init zj0+z_{j}\leftarrow 0^{+}, 𝝊=0\bm{\mathbf{\upsilon}}=0

\For

each label y{1,2,,n}y\in\{1,2,...,n\} α=exp(𝜽~T𝐟𝐱j(y))\alpha=\exp(\widetilde{\bm{\mathbf{\theta}}}^{T}\bm{\mathbf{f}}_{\bm{\mathbf{x}}_{j}}(y)); 𝐫=tanh(12log(α/z))2log(α/z)(𝐟𝐱j(y)𝝊)\bm{\mathbf{r}}=\sqrt{\frac{\tanh(\frac{1}{2}\log(\alpha/z))}{2\log(\alpha/z)}}(\bm{\mathbf{f}}_{\bm{\mathbf{x}}_{j}}(y)-\bm{\mathbf{\upsilon}});

𝐩=𝐕𝐫\bm{\mathbf{p}}=\bm{\mathbf{V}}\bm{\mathbf{r}}; 𝐚=𝐕T𝐩\bm{\mathbf{a}}=\bm{\mathbf{V}}^{T}\bm{\mathbf{p}}; 𝐠=𝐫𝐚\bm{\mathbf{g}}=\bm{\mathbf{r}}-\bm{\mathbf{a}}, 𝐒+=𝐩𝐩T\bm{\mathbf{S}}+\!=\bm{\mathbf{p}}\bm{\mathbf{p}}^{T}

𝐐T𝐀𝐐=svd(𝐒)\bm{\mathbf{Q}}^{T}\bm{\mathbf{A}}\bm{\mathbf{Q}}=svd(\bm{\mathbf{S}}); 𝐒𝐀\bm{\mathbf{S}}\leftarrow\bm{\mathbf{A}}; 𝐕𝐐𝐕\bm{\mathbf{V}}\leftarrow\bm{\mathbf{QV}}; 𝐃+=𝐠𝐚𝐈d×d\bm{\mathbf{D}}+\!=\|\bm{\mathbf{g}}\|\|\bm{\mathbf{a}}\|\bm{\mathbf{I}}\in\mathbb{R}^{d\times d}

𝐬=[𝐒(1,1),,𝐒(k,k),𝐠2]T\bm{\mathbf{s}}=[\bm{\mathbf{S}}(1,1),...,\bm{\mathbf{S}}(k,k),\|\bm{\mathbf{g}}\|^{2}]^{T}, k~=argmini=1,,k+1𝐬(i)\widetilde{k}=\arg\min_{i=1,...,k+1}\bm{\mathbf{s}}(i)

\uIf

k~k\widetilde{k}\leq k 𝐃=𝐃+𝐒(k~,k~)1T|𝐕(k~,)|diag(|𝐕(k~,)|)\bm{\mathbf{D}}=\bm{\mathbf{D}}+\bm{\mathbf{S}}(\widetilde{k},\widetilde{k})1^{T}|\bm{\mathbf{V}}(\widetilde{k},\cdot)|diag(|\bm{\mathbf{V}}(\widetilde{k},\cdot)|)

𝐒(k~,k~)=𝐠2\bm{\mathbf{S}}(\widetilde{k},\widetilde{k})=\|\bm{\mathbf{g}}\|^{2}; 𝐠=𝐠𝐠\bm{\mathbf{g}}=\frac{\bm{\mathbf{g}}}{\|\bm{\mathbf{g}}\|}; 𝐕(k~,)=𝐠\bm{\mathbf{V}}(\widetilde{k},\cdot)=\bm{\mathbf{g}} \uElse 𝐃+=1T|𝐠|diag(|𝐠|)\bm{\mathbf{D}}+\!=1^{T}|\bm{\mathbf{g}}|diag(|\bm{\mathbf{g}}|) 𝝊+=αzj+α(𝐟𝐱j(y)𝝊)\bm{\mathbf{\upsilon}}+\!=\frac{\alpha}{z_{j}+\alpha}(\bm{\mathbf{f}}_{\bm{\mathbf{x}}_{j}}(y)-\bm{\mathbf{\upsilon}}); zj+=αz_{j}+\!=\alpha

𝝁+=𝝊\bm{\mathbf{\mu}}+\!=\bm{\mathbf{\upsilon}}, z+=zjz+\!=z_{j}


\RestyleAlgo

boxruled \IncMargin1em {algorithm2e}[H] MLE via Low-rank Stochastic Partition Function Bound (LSPFB) \LinesNumbered\KwIninitial parameters 𝜽0\bm{\mathbf{\theta}}_{0}, training data set {(𝐱j,yj)}j=1T\{(\bm{\mathbf{x}}_{j},y_{j})\}_{j=1}^{T}, features 𝐟𝐱j\bm{\mathbf{f}}_{\bm{\mathbf{x}}_{j}}, learning rates ηt\eta_{t}, regularization coefficient λ\lambda \KwOutModel parameters 𝜽\bm{\mathbf{\theta}}

Set 𝜽=𝜽0\bm{\mathbf{\theta}}=\bm{\mathbf{\theta}}_{0}

\While

not converged randomly select a training point (𝐱t,yt)(\bm{\mathbf{x}}_{t},y_{t})

Get 𝐕t\bm{\mathbf{V}}_{t}, 𝐒t\bm{\mathbf{S}}_{t}, 𝐃t\bm{\mathbf{D}}_{t}, 𝝁t\bm{\mathbf{\mu}}_{t} from 𝐱t\bm{\mathbf{x}}_{t}, 𝐟𝐱t\bm{\mathbf{f}}_{\bm{\mathbf{x}}_{t}}, 𝜽\bm{\mathbf{\theta}} via Algorithm 2.2 (input batch is a single data point 𝐱t\bm{\mathbf{x}}_{t})

𝐃t=𝐃t+λ𝐈\bm{\mathbf{D}}_{t}=\bm{\mathbf{D}}_{t}+\lambda\bm{\mathbf{I}}; 𝝁t=𝝁t𝐟𝐱t(yt)+λ𝜽\bm{\mathbf{\mu}}_{t}=\bm{\mathbf{\mu}}_{t}-\bm{\mathbf{f}}_{\bm{\mathbf{x}}_{t}}(y_{t})+\lambda\bm{\mathbf{\theta}}

𝜽𝜽ηt(𝐃t1𝐃t1𝐕tT(𝐒t1+𝐕t𝐃t1𝐕tT)1𝐕t𝐃t1)𝝁t\bm{\mathbf{\theta}}\!\leftarrow\!\bm{\mathbf{\theta}}\!-\!\eta_{t}\left(\bm{\mathbf{D}}_{t}^{-1}-\bm{\mathbf{D}}_{t}^{-1}\bm{\mathbf{V}}_{t}^{T}\left(\bm{\mathbf{S}}_{t}^{-1}+\bm{\mathbf{V}}_{t}\bm{\mathbf{D}}_{t}^{-1}\bm{\mathbf{V}}_{t}^{T}\right)^{-1}\bm{\mathbf{V}}_{t}\bm{\mathbf{D}}_{t}^{-1}\right)\bm{\mathbf{\mu}}_{t}

3 Experiments

Experiments were performed on adult111http://archive.ics.uci.edu/ml/datasets/Adult (T=48842T=48842, n=2n=2, d=14d=14), KMNIST222https://pytorch.org/docs/stable/torchvision/datasets.html (T=60000T=60000, n=10n=10, d=784d=784), and Fashion-MNIST2 (T=60000T=60000, n=10n=10, d=784d=784) data sets. All algorithms were run with mini-batch size equal to m=1000m=1000. For low-rank methods, we explored the following settings of the rank kk: k=1,5,10,100k=1,5,10,100. Hyperparameters for all methods were chosen to achieve the best test performance. We compare SPFB and LSPFB with SGD for 2\ell_{2}-regularized logistic regression on the adult, KMNIST, and Fashion-MNIST data sets. Both SPFB and LSPFB show clear advantage over SGD in terms of convergence speed.

\floatconts

fig \subfigure Refer to caption \subfigure Refer to caption \subfigure Refer to caption \subfigure Refer to caption
\subfigure Refer to caption \subfigure Refer to caption \subfigure Refer to caption \subfigure Refer to caption
\subfigure Refer to caption \subfigure Refer to caption \subfigure Refer to caption \subfigure Refer to caption

Figure 3: A Comparison of SPFB, LSPFB, and SGD on l2l2-regularized logistic regression problem.

4 Discussion

Here we briefly discuss the future extensions of this work. First, we will analyze whether the batch bound method of Algorithm 2 admits super-linear convergence rate and thus match the convergence rate of quasi-Newton techniques (the existing convergence analysis in the original paper shows linear rate only). On the empirical side, we will investigate applying the bound techniques discussed in this paper to optimize each layer of the network during backpropagation. Per-layer bounds, when combined together, may potentially lead to a universal quadratic bound on the original highly complex deep learning loss function. This approach would open up new possibilities for training deep learning models as it reduces the deep learning non-convex optimization problem to a convex one. We will also investigate applying the developed technique to backpropagation-free [Choromanska et al.(2019)Choromanska, Cowen, Kumaravel, Luss, Rigotti, Rish, Diachille, Gurev, Kingsbury, Tejwani, and Bouneffouf] setting and large-batch [You et al.(2017)You, Gitman, and Ginsburg] training of deep learning models. Finally, we will explore the applicability of the bound techniques in the context of biasing the gradient to explore wide valleys in the non-convex optimization landscape  [Chaudhari et al.(2017)Chaudhari, A. Choromanska, Soatto, LeCun, Baldassi, Borgs, Chayes, Sagun, and Zecchina]. In this case enforcing the width of the bound to be sufficiently large should provide a simple mechanism for finding solutions that lie in the flat regions of the landscape.

References

  • [Andrew and Gao(2007)] Galen Andrew and Jianfeng Gao. Scalable training of l 1-regularized log-linear models. In Proceedings of the 24th international conference on Machine learning, pages 33–40, 2007.
  • [Balakrishnan et al.(2017)Balakrishnan, Wainwright, and Yu] S. Balakrishnan, M. J. Wainwright, and B. Yu. Statistical guarantees for the em algorithm: From population to sample-based analysis. Ann. Statist., 45(1):77–120, 2017.
  • [Baldassi et al.(2015)Baldassi, Ingrosso, Lucibello, Saglietti, and Zecchina] C. Baldassi, A. Ingrosso, C. Lucibello, L. Saglietti, and R. Zecchina. Subdominant dense clusters allow for simple learning and high computational performance in neural networks with discrete synapses. Physical review letters, 115(12):128101, 2015.
  • [Baldassi et al.(2016a)Baldassi, Borgs, Chayes, Ingrosso, Lucibello, Saglietti, and Zecchina] C. Baldassi, C. Borgs, J. T. Chayes, A. Ingrosso, C. Lucibello, L. Saglietti, and R. Zecchina. Unreasonable effectiveness of learning neural networks: From accessible states and robust ensembles to basic algorithmic schemes. In PNAS, 2016a.
  • [Baldassi et al.(2016b)Baldassi, Gerace, Lucibello, Saglietti, and Zecchina] C. Baldassi, F. Gerace, C. Lucibello, L. Saglietti, and R. Zecchina. Learning may need only a few bits of synaptic precision. Physical Review E, 93(5):052313, 2016b.
  • [Bennett(1976)] C. H. Bennett. Efficient Estimation of Free Energy Differences from Monte Carlo Data. Journal of Computational Physics, 22(2):245–268, October 1976. 10.1016/0021-9991(76)90078-4.
  • [Benson and More(2001)] Steven J Benson and Jorge J More. A limited memory variable metric method in subspaces and bound constrained optimization problems. In in Subspaces and Bound Constrained Optimization Problems. Citeseer, 2001.
  • [Berger(1997)] Adam Berger. The improved iterative scaling algorithm: A gentle introduction, 1997.
  • [Böhning and Lindsay(1988)] Dankmar Böhning and Bruce G Lindsay. Monotonicity of quadratic-approximation algorithms. Annals of the Institute of Statistical Mathematics, 40(4):641–663, 1988.
  • [Bottou(2010)] Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.
  • [Broyden et al.(1973)Broyden, Dennis Jr, and Moré] Charles George Broyden, John E Dennis Jr, and Jorge J Moré. On the local and superlinear convergence of quasi-newton methods. IMA Journal of Applied Mathematics, 12(3):223–245, 1973.
  • [Byrd et al.(2016)Byrd, Hansen, Nocedal, and Singer] Richard H Byrd, Samantha L Hansen, Jorge Nocedal, and Yoram Singer. A stochastic quasi-newton method for large-scale optimization. SIAM Journal on Optimization, 26(2):1008–1031, 2016.
  • [Carreira-Perpiñán and Hinton(2005)] Miguel Á. Carreira-Perpiñán and Geoffrey E. Hinton. On contrastive divergence learning. In AISTATS, 2005.
  • [Chaudhari et al.(2017)Chaudhari, A. Choromanska, Soatto, LeCun, Baldassi, Borgs, Chayes, Sagun, and Zecchina] P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R. Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. In ICLR, 2017.
  • [Chaudhari et al.(2019)Chaudhari, A. Choromanska, Soatto, LeCun, Baldassi, Borgs, Chayes, Sagun, and Zecchina] P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R. Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys (journal version). Journal of Statistical Mechanics: Theory and Experiment, 2019(12):124018, 2019.
  • [Choromanska et al.(2019)Choromanska, Cowen, Kumaravel, Luss, Rigotti, Rish, Diachille, Gurev, Kingsbury, Tejwani, and Bouneffouf] Anna Choromanska, Benjamin Cowen, Sadhana Kumaravel, Ronny Luss, Mattia Rigotti, Irina Rish, Paolo Diachille, Viatcheslav Gurev, Brian Kingsbury, Ravi Tejwani, and Djallel Bouneffouf. Beyond backprop: Online alternating minimization with auxiliary variables. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 1193–1202. PMLR, 2019. URL http://proceedings.mlr.press/v97/choromanska19a.html.
  • [Darroch and Ratcliff(1972)] John N Darroch and Douglas Ratcliff. Generalized iterative scaling for log-linear models. The annals of mathematical statistics, pages 1470–1480, 1972.
  • [Dempster et al.(1977)Dempster, Laird, and Rubin] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES B, 39(1):1–38, 1977.
  • [Feng and Tu(2020)] Y. Feng and Y. Tu. How neural networks find generalizable solutions: Self-tuned annealing in deep learning. CoRR, abs/2001.01678, 2020.
  • [Goodfellow et al.(2016)Goodfellow, Bengio, and Courville] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
  • [Gutmann and Hyvärinen(2012)] Michael U. Gutmann and Aapo Hyvärinen. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. J. Mach. Learn. Res., 13(null):307–361, February 2012. ISSN 1532-4435.
  • [Hinton(2002)] Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771–1800, 2002. URL http://www.ncbi.nlm.nih.gov/pubmed/12180402.
  • [Hochreiter and Schmidhuber(1997)] S. Hochreiter and J. Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.
  • [Huang and Ogata(2002)] Fuchun Huang and Yosihiko Ogata. Generalized pseudo-likelihood estimates for markov random fields on lattice. Annals of the Institute of Statistical Mathematics, 54:1–18, 02 2002. 10.1023/A:1016170102988.
  • [Hyvärinen(2005)] Aapo Hyvärinen. Estimation of non-normalized statistical models by score matching. J. Mach. Learn. Res., 6:695–709, December 2005. ISSN 1532-4435.
  • [Hyvarinen(2007)] Aapo Hyvarinen. Some extensions of score matching. Computational Statistics & Data Analysis, 51(5):2499–2512, 2007. URL https://EconPapers.repec.org/RePEc:eee:csdana:v:51:y:2007:i:5:p:2499-2512.
  • [Jarzynski(1997)] C. Jarzynski. Nonequilibrium equality for free energy differences. Phys. Rev. Lett., 78:2690–2693, Apr 1997. 10.1103/PhysRevLett.78.2690. URL https://link.aps.org/doi/10.1103/PhysRevLett.78.2690.
  • [Jastrzebski et al.(2018)Jastrzebski, Kenton, Arpit, Ballas, Fischer, Bengio, and Storkey] S. Jastrzebski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio, and A. Storkey. Finding flatter minima with sgd. In ICLR Workshop Track, 2018.
  • [Jebara and Choromanska(2012)] Tony Jebara and Anna Choromanska. Majorization for crfs and latent likelihoods. In Advances in Neural Information Processing Systems, pages 557–565, 2012.
  • [Jiang et al.(2019)Jiang, Neyshabur, Mobahi, Krishnan, and Bengio] Y. Jiang, B. Neyshabur, H. Mobahi, D. Krishnan, and S. Bengio. Fantastic generalization measures and where to find them. CoRR, abs/1912.02178, 2019.
  • [Keskar et al.(2017)Keskar, Mudigere, Nocedal, Smelyanskiy, and Tang] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In ICLR, 2017. URL http://arxiv.org/abs/1609.04836.
  • [Krizhevsky et al.()Krizhevsky, Nair, and Hinton] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-100 (canadian institute for advanced research).
  • [Lafferty et al.(2001)Lafferty, McCallum, and Pereira] John Lafferty, Andrew McCallum, and Fernando CN Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. 2001.
  • [Lee and Seung()] Daniel D. Lee and H. Sebastian Seung. Learning the parts of objects by nonnegative matrix factorization. Nature, 401:788–791, 1999.
  • [Mairal(2015)] Julien Mairal. Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM J. on Optimization, 25(2):829–855, April 2015. ISSN 1052-6234. 10.1137/140957639. URL https://doi.org/10.1137/140957639.
  • [Malouf(2002)] Robert Malouf. A comparison of algorithms for maximum entropy parameter estimation. In COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002), 2002.
  • [Neal(2001)] R.M. Neal. Annealed importance sampling. Statistics and Computing, 11, 01 2001.
  • [Robbins and Monro(1951)] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
  • [Sagun et al.(2018)Sagun, Evci, Güney, Dauphin, and Bottou] L. Sagun, U. Evci, V. Ugur Güney, Y. Dauphin, and L. Bottou. Empirical analysis of the hessian of over-parametrized neural networks. In ICLR Workshop, 2018.
  • [Sha and Pereira(2003)] Fei Sha and Fernando Pereira. Shallow parsing with conditional random fields. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pages 213–220, 2003.
  • [Simsekli et al.(2019)Simsekli, Sagun, and Gürbüzbalaban] U. Simsekli, L. Sagun, and M. Gürbüzbalaban. A tail-index analysis of stochastic gradient noise in deep neural networks. CoRR, abs/1901.06053, 2019.
  • [Tieleman(2008)] Tijmen Tieleman. Training restricted boltzmann machines using approximations to the likelihood gradient. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, page 1064–1071, New York, NY, USA, 2008. Association for Computing Machinery. ISBN 9781605582054. 10.1145/1390156.1390290. URL https://doi.org/10.1145/1390156.1390290.
  • [Tieleman and Hinton(2009)] Tijmen Tieleman and Geoffrey Hinton. Using fast weights to improve persistent contrastive divergence. In Proceedings of the 26th Annual International Conference on Machine Learning, page 1033–1040, 2009.
  • [Wallach(2002)] Hanna Wallach. Efficient training of conditional random fields. PhD thesis, Citeseer, 2002.
  • [Wen et al.(2018)Wen, Wang, Yan, Xu, Chen, and Li] W. Wen, Y. Wang, F. Yan, C. Xu, Y. Chen, and H. Li. Smoothout: Smoothing out sharp minima for generalization in large-batch deep learning. CoRR, abs/1805.07898, 2018.
  • [You et al.(2017)You, Gitman, and Ginsburg] Y. You, I. Gitman, and B. Ginsburg. Scaling SGD batch size to 32k for imagenet training. CoRR, abs/1708.03888, 2017.
  • [Yuille and Rangarajan(2002)] Alan L Yuille and Anand Rangarajan. The concave-convex procedure (cccp). In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 1033–1040. MIT Press, 2002. URL http://papers.nips.cc/paper/2125-the-concave-convex-procedure-cccp.pdf.
  • [Zhu et al.(1997)Zhu, Byrd, Lu, and Nocedal] Ciyou Zhu, Richard H Byrd, Peihuang Lu, and Jorge Nocedal. Algorithm 778: L-bfgs-b: Fortran subroutines for large-scale bound-constrained optimization. ACM Transactions on Mathematical Software (TOMS), 23(4):550–560, 1997.

 

Supplementary material

 

Appendix A Proof for Theorem 2: sub-linear convergence rate

The proof in this section inspires by [Broyden et al.(1973)Broyden, Dennis Jr, and Moré, Byrd et al.(2016)Byrd, Hansen, Nocedal, and Singer]. We analysis the convergence rate of Stochastic version of Bound Majorization Method. It is easy to know that the objective function

L(𝜽)=1Tt=1T[log(Z𝐱t(𝜽))𝜽T𝐟𝐱t(𝐲t)]+λ2𝜽2L(\bm{\mathbf{\theta}})=\frac{1}{T}\sum_{t=1}^{T}\left[\log(Z_{\bm{\mathbf{x}}_{t}}(\bm{\mathbf{\theta}}))-\bm{\mathbf{\theta}}^{T}\bm{\mathbf{f}}_{\bm{\mathbf{x}}_{t}}(\bm{\mathbf{y}}_{t})\right]+\frac{\lambda}{2}\left\lVert\bm{\mathbf{\theta}}\right\rVert^{2}

is strongly convex and and twice continuously differentiable, where Z𝐱t(𝜽)=i=1nexp(𝜽T𝐟𝐱t(i))Z_{\bm{\mathbf{x}}_{t}}(\bm{\mathbf{\theta}})=\sum_{i=1}^{n}\exp(\bm{\mathbf{\theta}}^{T}\bm{\mathbf{f}}_{\bm{\mathbf{x}}_{t}}(i)). Therfore, we can make the following assumption

Assumption 1

Assume that

  • (1)

    The objective function LL is twice continuous differentiable.

  • (2)

    There exists 0<λ1<λ20<\lambda_{1}<\lambda_{2} such that for all 𝜽d\bm{\mathbf{\theta}}\in\mathbb{R}^{d},

    λ1𝐈2L(𝜽)λ2𝐈,\lambda_{1}\bm{\mathbf{I}}\prec\nabla^{2}L(\bm{\mathbf{\theta}})\prec\lambda_{2}\bm{\mathbf{I}}, (A.1)

For stochastic partition function bound (SPFB) method, define ηt\eta_{t} to be the learning rate, 𝚺t\bm{\mathbf{\Sigma}}_{t} to be the Hessian approximation and f(𝜽;𝐱t)=log(Zt(𝜽))𝜽T𝐟𝐱t(𝐲t)+λ2𝜽2f(\bm{\mathbf{\theta}};\bm{\mathbf{x}}_{t})=\log(Z_{t}(\bm{\mathbf{\theta}}))-\bm{\mathbf{\theta}}^{T}\bm{\mathbf{f}}_{\bm{\mathbf{x}}_{t}}(\bm{\mathbf{y}}_{t})+\frac{\lambda}{2}\|\bm{\mathbf{\theta}}\|^{2} to be the unbiased estimation of objective function L(𝜽)L(\bm{\mathbf{\theta}}) in each iteration, then the update for parameter 𝜽\bm{\mathbf{\theta}} is

𝜽t+1=𝜽tηt(𝚺t+λ𝐈)1f(𝜽t;𝐱t).\bm{\mathbf{\theta}}^{t+1}=\bm{\mathbf{\theta}}^{t}-\eta_{t}(\bm{\mathbf{\Sigma}}_{t}+\lambda\bm{\mathbf{I}})^{-1}\nabla f(\bm{\mathbf{\theta}}^{t};\bm{\mathbf{x}}_{t}). (A.2)

In order to analysis the convergence for SPFB, we add a condition that the variance of estimation f(𝜽)f(\bm{\mathbf{\theta}}) is bounded then construct the following assumption:

Assumption 2

Assume that

  • (1)

    The objective function LL is twice continuous differentiable.

  • (2)

    There exists 0<λ1<λ20<\lambda_{1}<\lambda_{2} such that for all 𝜽d\bm{\mathbf{\theta}}\in\mathbb{R}^{d},

    λ1𝐈2L(𝜽)λ2𝐈,\lambda_{1}\bm{\mathbf{I}}\prec\nabla^{2}L(\bm{\mathbf{\theta}})\prec\lambda_{2}\bm{\mathbf{I}}, (A.3)
  • (3)

    There exists a constant σ\sigma, such that for all 𝜽d\bm{\mathbf{\theta}}\in\mathbb{R}^{d},

    𝐄𝐱[f(𝜽;𝐱)]2σ2\bm{\mathbf{E}}_{\bm{\mathbf{x}}}[\left\lVert f(\bm{\mathbf{\theta}};\bm{\mathbf{x}})\right\rVert]^{2}\leq\sigma^{2} (A.4)
Lemma 1

For matrices {𝚺t}\{\bm{\mathbf{\Sigma}}_{t}\} compute in Algorithm 2, there exists 0<μ1<μ20<\mu_{1}<\mu_{2} such that for all 𝚺t\bm{\mathbf{\Sigma}}_{t}

μ1𝐈(𝚺t+λ𝐈)1μ2𝐈.\mu_{1}\bm{\mathbf{I}}\prec(\bm{\mathbf{\Sigma}}_{t}+\lambda\bm{\mathbf{I}})^{-1}\prec\mu_{2}\bm{\mathbf{I}}. (A.5)
Proof A.1.

For easy notation, we omit the subscripts and demote 𝚺t\bm{\mathbf{\Sigma}}_{t} as 𝚺\bm{\mathbf{\Sigma}}. From Algorithm2, the formulation for 𝚺\bm{\mathbf{\Sigma}} is

𝚺=z0𝐈+i=1nβi𝐥i𝐥iT,\bm{\mathbf{\Sigma}}=z_{0}\bm{\mathbf{I}}+\sum_{i=1}^{n}\beta_{i}\bm{\mathbf{l}}_{i}\bm{\mathbf{l}}_{i}^{T}, (A.6)

where z00+z_{0}\to 0^{+} is the initialization of zz in algorithm 2. Define zi=k=1iαkz_{i}=\sum_{k=1}^{i}\alpha_{k}, we can compute the upper bound of βi\beta_{i}

βi=tanh(12log(αi/zi))2log(αi/zi)=14tanh(12log(αi/zi))12log(αi/zi)14,\beta_{i}=\frac{\tanh{(\frac{1}{2}\log(\alpha_{i}/z_{i})})}{2\log(\alpha_{i}/z_{i})}=\frac{1}{4}\cdot\frac{\tanh{(\frac{1}{2}\log(\alpha_{i}/z_{i}))}}{\frac{1}{2}\log(\alpha_{i}/z_{i})}\leq\frac{1}{4}, (A.7)

we can also conclude that there exits kk such that 1n<αkzk<1\frac{1}{n}<\frac{\alpha_{k}}{z_{k}}<1, which imply

βktanh(12log(1n))2log(1n).\beta_{k}\geq\frac{\tanh(\frac{1}{2}\log(\frac{1}{n}))}{2\log(\frac{1}{n})}.

Since 𝐥i+1=(α1zi𝐱,,αizi𝐱,𝐱,0,,0)\bm{\mathbf{l}}_{i+1}=(-\frac{\alpha_{1}}{z_{i}}\bm{\mathbf{x}},\cdots,\frac{\alpha_{i}}{z_{i}}\bm{\mathbf{x}},\bm{\mathbf{x}},0,\cdots,0), where 𝐱\bm{\mathbf{x}} is the current observation, we have

(1+1n)𝐱2𝐥i+12=α12+α22+αi2(α1+α2+αi)2𝐱2+𝐱22𝐱2(1+\frac{1}{n})\|\bm{\mathbf{x}}\|^{2}\leq\|\bm{\mathbf{l}}_{i+1}\|^{2}=\frac{\alpha_{1}^{2}+\alpha_{2}^{2}+\cdots\alpha_{i}^{2}}{(\alpha_{1}+\alpha_{2}+\cdots\alpha_{i})^{2}}\|\bm{\mathbf{x}}\|^{2}+\|\bm{\mathbf{x}}\|^{2}\leq 2\|\bm{\mathbf{x}}\|^{2} (A.8)

Therefore,

𝚺2βk𝐥k2(1+1n)tanh(12log(1n))2log(1n)𝐱2.\displaystyle\|\bm{\mathbf{\Sigma}}\|^{2}\geq\beta_{k}\|\bm{\mathbf{l}}_{k}\|^{2}\geq\left(1+\frac{1}{n}\right)\frac{\tanh(\frac{1}{2}\log(\frac{1}{n}))}{2\log(\frac{1}{n})}\|\bm{\mathbf{x}}\|^{2}. (A.9)

Based on previous proof, the upperbound for matrix 𝚺\bm{\mathbf{\Sigma}} is

𝚺2\displaystyle\|\bm{\mathbf{\Sigma}}\|^{2} z02+i=1nβi𝐥i𝐥iT2z02+14i=1n𝐥i4\displaystyle\leq z_{0}^{2}+\sum_{i=1}^{n}\beta_{i}\|\bm{\mathbf{l}}_{i}\bm{\mathbf{l}}_{i}^{T}\|^{2}\leq z_{0}^{2}+\frac{1}{4}\sum_{i=1}^{n}\|\bm{\mathbf{l}}_{i}\|^{4}
z02+14𝐱4i=1n2z02+n2𝐱4.\displaystyle\leq z_{0}^{2}+\frac{1}{4}\|\bm{\mathbf{x}}\|^{4}\sum_{i=1}^{n}2\leq z_{0}^{2}+\sqrt{\frac{n}{2}}\|\bm{\mathbf{x}}\|^{4}. (A.10)

When z00+z_{0}\rightarrow 0^{+}, 𝚺n2𝐱2\|\bm{\mathbf{\Sigma}}\|\leq\sqrt{\frac{n}{2}}\|\bm{\mathbf{x}}\|^{2}.

Denote {𝐱t}\{\bm{\mathbf{x}}_{t}\} as the set of all observations. Define μ2=1/((1+1n)tanh(12log(1n))2log(1n)maxt𝐱t2+λ)\mu_{2}=1/\left(\left(1+\frac{1}{n}\right)\frac{\tanh(\frac{1}{2}\log(\frac{1}{n}))}{2\log(\frac{1}{n})}\max_{t}\|\bm{\mathbf{x}}_{t}\|^{2}+\lambda\right),
μ1=1/(n2maxt𝐱t2+λ)\mu_{1}=1/\left(\sqrt{\frac{n}{2}}\max_{t}\|\bm{\mathbf{x}}_{t}\|^{2}+\lambda\right),

1/μ2𝚺t+λ𝐈1/μ1μ2(𝚺t+λ𝐈)1μ1.1/\mu_{2}\leq\|\bm{\mathbf{\Sigma}}_{t}+\lambda\bm{\mathbf{I}}\|\leq 1/\mu_{1}\Longrightarrow\mu_{2}\geq\|(\bm{\mathbf{\Sigma}}_{t}+\lambda\bm{\mathbf{I}})^{-1}\|\geq\mu_{1}.
Theorem 2

{𝜽t}\{\bm{\mathbf{\theta}}^{t}\} is the sequence of parameters generated by Algorithm 2. There exists 0<μ1<μ20<\mu_{1}<\mu_{2}, 0<λ1<λ20<\lambda_{1}<\lambda_{2} such that for all iterations tt,

μ1𝐈(𝚺t+λ𝐈)1μ2𝐈 and λ1𝐈2L(𝜽)λ2𝐈,\mu_{1}\bm{\mathbf{I}}\prec(\bm{\mathbf{\Sigma}}_{t}+\lambda\bm{\mathbf{I}})^{-1}\prec\mu_{2}\bm{\mathbf{I}}\quad\text{ and }\quad\lambda_{1}\bm{\mathbf{I}}\prec\nabla^{2}L(\bm{\mathbf{\theta}})\prec\lambda_{2}\bm{\mathbf{I}}, (A.11)

and there exists a constant σ\sigma, such that for all 𝛉d\bm{\mathbf{\theta}}\in\mathbb{R}^{d},

𝐄𝐱t[f(𝜽;𝐱t)]2σ2.\bm{\mathbf{E}}_{\bm{\mathbf{x}}_{t}}[\left\lVert f(\bm{\mathbf{\theta}};\bm{\mathbf{x}}_{t})\right\rVert]^{2}\leq\sigma^{2}. (A.12)

Define the learning rate in iteration tt as

ηt=η0/twhere η0>1/(2μ1λ1).\eta_{t}=\eta_{0}/t\quad\quad\quad\text{where }\eta_{0}>1/(2\mu_{1}\lambda_{1}).

Then for all t>1t>1,

𝐄[L(𝜽t)L(𝜽)]Q(η0)/t,\bm{\mathbf{E}}[L(\bm{\mathbf{\theta}}^{t})-L(\bm{\mathbf{\theta}}^{*})]\leq Q(\eta_{0})/t, (A.13)

where Q(η0)=max{λ2μ22η02σ22(2μ1λ1η01),L(𝛉1)L(𝛉)}Q(\eta_{0})=\max\left\{\frac{\lambda_{2}\mu_{2}^{2}\eta_{0}^{2}\sigma^{2}}{2(2\mu_{1}\lambda_{1}\eta_{0}-1)},L(\bm{\mathbf{\theta}}^{1})-L(\bm{\mathbf{\theta^{*}}})\right\}.

Proof A.2.
L(𝜽t+1)\displaystyle L(\bm{\mathbf{\theta}}^{t+1}) =L(𝜽tηt(𝚺t+λ𝐈)1f(𝜽t;𝐱t))\displaystyle=L(\bm{\mathbf{\theta}}^{t}-\eta_{t}(\bm{\mathbf{\Sigma}}_{t}+\lambda\bm{\mathbf{I}})^{-1}\nabla f(\bm{\mathbf{\theta}}^{t};\bm{\mathbf{x}}_{t}))
L(𝜽t)+L(𝜽t)T(ηt(𝚺t+λ𝐈)1f(𝜽t;𝐱t))+λ22ηt(𝚺t+λ𝐈)1f(𝜽t;𝐱t)2\displaystyle\leq L(\bm{\mathbf{\theta}}^{t})+\nabla L(\bm{\mathbf{\theta}}^{t})^{T}\left(-\eta_{t}(\bm{\mathbf{\Sigma}}_{t}+\lambda\bm{\mathbf{I}})^{-1}\nabla f(\bm{\mathbf{\theta}}^{t};\bm{\mathbf{x}}_{t})\right)+\frac{\lambda_{2}}{2}\left\lVert\eta_{t}(\bm{\mathbf{\Sigma}}_{t}+\lambda\bm{\mathbf{I}})^{-1}\nabla f(\bm{\mathbf{\theta}}^{t};\bm{\mathbf{x}}_{t})\right\rVert^{2}
L(𝜽t)ηtL(𝜽t)T(𝚺t+λ𝐈)1f(𝜽t;𝐱t)+λ22ηt2μ22f(𝜽t;𝐱t)2.\displaystyle\leq L(\bm{\mathbf{\theta}}^{t})-\eta_{t}\nabla L(\bm{\mathbf{\theta}}^{t})^{T}(\bm{\mathbf{\Sigma}}_{t}+\lambda\bm{\mathbf{I}})^{-1}\nabla f(\bm{\mathbf{\theta}}^{t};\bm{\mathbf{x}}_{t})+\frac{\lambda_{2}}{2}\eta_{t}^{2}\mu_{2}^{2}\left\lVert\nabla f(\bm{\mathbf{\theta}}^{t};\bm{\mathbf{x}}_{t})\right\rVert^{2}. (A.14)

Taking the expectation, we have

𝐄[L(𝜽t+1)]𝐄[L(𝜽t)]ηtμ1L(𝜽t)2+λ22ηt2μ22σ2.\displaystyle\bm{\mathbf{E}}[L(\bm{\mathbf{\theta}}^{t+1})]\leq\bm{\mathbf{E}}[L(\bm{\mathbf{\theta}}^{t})]-\eta_{t}\mu_{1}\left\lVert\nabla L(\bm{\mathbf{\theta}}^{t})\right\rVert^{2}+\frac{\lambda_{2}}{2}\eta_{t}^{2}\mu_{2}^{2}\sigma^{2}. (A.15)

Now, we want to correlate L(𝛉t)2\left\lVert\nabla L(\bm{\mathbf{\theta}}^{t})\right\rVert^{2} with L(𝛉𝐭)L(𝛉)L(\bm{\mathbf{\theta^{t}}})-L(\bm{\mathbf{\theta}}^{*}). For all 𝐯d\bm{\mathbf{v}}\in\mathbb{R}^{d}, by condition (2) in Assumption 2,

L(𝐯)\displaystyle L(\bm{\mathbf{v}}) L(𝜽t)+L(𝜽t)T(𝐯𝜽t)+λ12𝐯𝜽t2\displaystyle\geq L(\bm{\mathbf{\theta}}^{t})+\nabla L(\bm{\mathbf{\theta}}^{t})^{T}(\bm{\mathbf{v}}-\bm{\mathbf{\theta}}^{t})+\frac{\lambda_{1}}{2}\|\bm{\mathbf{v}}-\bm{\mathbf{\theta}}^{t}\|^{2}
L(𝜽t)L(𝜽t)T(1λ1L(𝜽t))+λ121λ1L(𝜽t)2\displaystyle\geq L(\bm{\mathbf{\theta}}^{t})-\nabla L(\bm{\mathbf{\theta}}^{t})^{T}\left(\frac{1}{\lambda_{1}}\nabla L(\bm{\mathbf{\theta}}^{t})\right)+\frac{\lambda_{1}}{2}\|\frac{1}{\lambda_{1}}\nabla L(\bm{\mathbf{\theta}}^{t})\|^{2}
=L(𝜽t)12λ1L(𝜽t)2,\displaystyle=L(\bm{\mathbf{\theta}}^{t})-\frac{1}{2\lambda_{1}}\|\nabla L(\bm{\mathbf{\theta}}^{t})\|^{2}, (A.16)

the second inequality comes from computing the minimum of quadratic function q(𝐯)=L(𝛉k)+L(𝛉k)T(𝐯𝛉k)+λ12𝐯𝛉k2q(\bm{\mathbf{v}})=L(\bm{\mathbf{\theta}}^{k})+\nabla L(\bm{\mathbf{\theta}}^{k})^{T}(\bm{\mathbf{v}}-\bm{\mathbf{\theta}}^{k})+\frac{\lambda_{1}}{2}\|\bm{\mathbf{v}}-\bm{\mathbf{\theta}}^{k}\|^{2}. Setting 𝐯=𝛉\bm{\mathbf{v}}=\bm{\mathbf{\theta^{*}}} yields

2λ1[L(𝜽𝐭)L(𝜽)]L(𝜽t)2,2\lambda_{1}[L(\bm{\mathbf{\theta^{t}}})-L(\bm{\mathbf{\theta}}^{*})]\leq\|\nabla L(\bm{\mathbf{\theta}}^{t})\|^{2}, (A.17)

which together with formula (A.15) yields

𝐄[L(𝜽t+1)L(𝜽)]\displaystyle\bm{\mathbf{E}}[L(\bm{\mathbf{\theta}}^{t+1})-L(\bm{\mathbf{\theta}}^{*})] 𝐄[L(𝜽t)L(𝜽)]2ηtμ1λ1𝐄[L(𝜽t)L(𝜽)]+λ22ηt2μ22σ2\displaystyle\leq\bm{\mathbf{E}}[L(\bm{\mathbf{\theta}}^{t})-L(\bm{\mathbf{\theta}}^{*})]-2\eta_{t}\mu_{1}\lambda_{1}\bm{\mathbf{E}}[L(\bm{\mathbf{\theta}}^{t})-L(\bm{\mathbf{\theta}}^{*})]+\frac{\lambda_{2}}{2}\eta_{t}^{2}\mu_{2}^{2}\sigma^{2}
=(12ηtμ1λ1)𝐄[L(𝜽t)L(𝜽)]+λ22ηt2μ22σ2.\displaystyle=(1-2\eta_{t}\mu_{1}\lambda_{1})\bm{\mathbf{E}}[L(\bm{\mathbf{\theta}}^{t})-L(\bm{\mathbf{\theta}}^{*})]+\frac{\lambda_{2}}{2}\eta_{t}^{2}\mu_{2}^{2}\sigma^{2}. (A.18)

Define ϕt=𝐄[L(𝛉t)L(𝛉)]\phi_{t}=\bm{\mathbf{E}}[L(\bm{\mathbf{\theta}}^{t})-L(\bm{\mathbf{\theta}}^{*})], Q(η0)=max{λ2μ22η02σ22(2μ1λ1η01),L(𝛉1)L(𝛉)}Q(\eta_{0})=\max\left\{\frac{\lambda_{2}\mu_{2}^{2}\eta_{0}^{2}\sigma^{2}}{2(2\mu_{1}\lambda_{1}\eta_{0}-1)},L(\bm{\mathbf{\theta}}^{1})-L(\bm{\mathbf{\theta}}^{*})\right\}, when t=1t=1, ϕ1=L(𝛉1)L(𝛉)=Q(η0)/1\phi_{1}=L(\bm{\mathbf{\theta}}^{1})-L(\bm{\mathbf{\theta}}^{*})=Q(\eta_{0})/1 holds. We finish the proof by induction. Assume ϕtQ(η0)t\phi_{t}\leq\frac{Q(\eta_{0})}{t},

ϕt+1\displaystyle\phi_{t+1} (12ηtμ1λ1)ϕt+λ22ηt2μ22σ2\displaystyle\leq(1-2\eta_{t}\mu_{1}\lambda_{1})\phi_{t}+\frac{\lambda_{2}}{2}\eta_{t}^{2}\mu_{2}^{2}\sigma^{2}
=(12η0μ1λ1t)Q(η0)t+λ2η02μ22σ22t2\displaystyle=(1-\frac{2\eta_{0}\mu_{1}\lambda_{1}}{t})\frac{Q(\eta_{0})}{t}+\frac{\lambda_{2}\eta_{0}^{2}\mu_{2}^{2}\sigma^{2}}{2t^{2}}
=t2η0μ1λ1t2Q(η0)+λ2η02μ22σ22t2\displaystyle=\frac{t-2\eta_{0}\mu_{1}\lambda_{1}}{t^{2}}Q(\eta_{0})+\frac{\lambda_{2}\eta_{0}^{2}\mu_{2}^{2}\sigma^{2}}{2t^{2}}
=t1t2Q(η0)2η0μ1λ11t2Q(η0)+λ2η02μ22σ22t2\displaystyle=\frac{t-1}{t^{2}}Q(\eta_{0})-\frac{2\eta_{0}\mu_{1}\lambda_{1}-1}{t^{2}}Q(\eta_{0})+\frac{\lambda_{2}\eta_{0}^{2}\mu_{2}^{2}\sigma^{2}}{2t^{2}}
t1t2Q(η0)2η0μ1λ11t2λ2μ22η02σ22(2μ1λ1η01)+λ2η02μ22σ22t2\displaystyle\leq\frac{t-1}{t^{2}}Q(\eta_{0})-\frac{2\eta_{0}\mu_{1}\lambda_{1}-1}{t^{2}}\frac{\lambda_{2}\mu_{2}^{2}\eta_{0}^{2}\sigma^{2}}{2(2\mu_{1}\lambda_{1}\eta_{0}-1)}+\frac{\lambda_{2}\eta_{0}^{2}\mu_{2}^{2}\sigma^{2}}{2t^{2}}
Q(η0)t+1\displaystyle\leq\frac{Q(\eta_{0})}{t+1} (A.19)

Appendix B Proof for Thm 3: Low-rank Bound

The lower-bound in [Jebara and Choromanska(2012)] is not correct since 𝐚𝐠\bm{\mathbf{a}}\perp\bm{\mathbf{g}} is not sufficient for 𝐱T(𝐚𝐠T+𝐠𝐚T)𝐱=0,𝐱\bm{\mathbf{x}}^{T}(\bm{\mathbf{ag}}^{T}+\bm{\mathbf{ga}}^{T})\bm{\mathbf{x}}=0,\forall\bm{\mathbf{x}}. We loose and fix the bound in this section. We use some proof steps in [Jebara and Choromanska(2012)].

Lemma 2

𝐱d\forall\bm{\mathbf{x}}\in\mathbb{R}^{d}, 𝐥+d(l0)\forall\bm{\mathbf{l}}\in\mathbb{R}^{+d}(l\geq 0), we have

i=1d𝐱2(i)𝐥(i)(i=1d𝐱(i)𝐥(i)j=1d𝐥(j))\sum_{i=1}^{d}\bm{\mathbf{x}}^{2}(i)\bm{\mathbf{l}}(i)\geq\left(\sum_{i=1}^{d}\bm{\mathbf{x}}(i)\frac{\bm{\mathbf{l}}(i)}{\sqrt{\sum_{j=1}^{d}\bm{\mathbf{l}}(j)}}\right)
Proof B.1.

By Jensen’s inequality, if f:f:\mathbb{R}\rightarrow\mathbb{R} convex, {ai}i=1d\{a_{i}\}_{i=1}^{d} satisfies i=1dai=1\sum_{i=1}^{d}a_{i}=1, then

i=1daif(𝐱(i))f(i=1dai𝐱(i))\sum_{i=1}^{d}a_{i}f(\bm{\mathbf{x}}(i))\geq f\left(\sum_{i=1}^{d}a_{i}\bm{\mathbf{x}}(i)\right)

Set ai=𝐥(i)i=1d𝐥(i)a_{i}=\frac{\bm{\mathbf{l}}(i)}{\sum_{i=1}^{d}\bm{\mathbf{l}}(i)},f(x)=x2f(x)=x^{2},

i=1d𝐱(i)2𝐥(i)i=1d𝐥(i)\displaystyle\sum_{i=1}^{d}\bm{\mathbf{x}}(i)^{2}\frac{\bm{\mathbf{l}}(i)}{\sum_{i=1}^{d}\bm{\mathbf{l}}(i)} (i=1d𝐱(i)𝐥(i)i=1d𝐥(i))2\displaystyle\geq\left(\sum_{i=1}^{d}\bm{\mathbf{x}}(i)\frac{\bm{\mathbf{l}}(i)}{\sum_{i=1}^{d}\bm{\mathbf{l}}(i)}\right)^{2}
i=1d𝐱2(i)𝐥(i)\displaystyle\sum_{i=1}^{d}\bm{\mathbf{x}}^{2}(i)\bm{\mathbf{l}}(i) (i=1d𝐱(i)𝐥(i)j=1d𝐥(j))\displaystyle\geq\left(\sum_{i=1}^{d}\bm{\mathbf{x}}(i)\frac{\bm{\mathbf{l}}(i)}{\sqrt{\sum_{j=1}^{d}\bm{\mathbf{l}}(j)}}\right)
Lemma 3

If 𝐚,𝐠n\bm{\mathbf{a}},\bm{\mathbf{g}}\in\mathbb{R}^{n} non-zero, then rank(𝐚𝐠T)=1rank(\bm{\mathbf{a}}\bm{\mathbf{g}}^{T})=1.

Proof B.2.
rank(𝐚𝐠T)=min{rank(𝐚),rank(𝐠)}=1rank(\bm{\mathbf{a}}\bm{\mathbf{g}}^{T})=\min\{rank(\bm{\mathbf{a}}),rank(\bm{\mathbf{g}})\}=1
Lemma 4

Let 𝐚,𝐠n\bm{\mathbf{a}},\bm{\mathbf{g}}\in\mathbb{R}^{n} non-zero, matrix 𝐀=𝐚𝐠T+𝐠𝐚Tn×n\bm{\mathbf{A}}=\bm{\mathbf{a}}\bm{\mathbf{g}}^{T}+\bm{\mathbf{g}}\bm{\mathbf{a}}^{T}\in\mathbb{R}^{n\times n}. If 𝐚𝐠\bm{\mathbf{a}}\perp\bm{\mathbf{g}} and 𝐀𝟎\bm{\mathbf{A}}\neq\bm{\mathbf{0}}, then A has exactly 2 opposite eigenvalues ±𝐚2𝐠2\pm\|\bm{\mathbf{a}}\|_{2}\|\bm{\mathbf{g}}\|_{2}.

Proof B.3.

Because rank(𝐚𝐠T)=rank(𝐠𝐚T)=1rank(\bm{\mathbf{ag}}^{T})=rank(\bm{\mathbf{ga}}^{T})=1 (By Lemma 3),

rank(𝐀)=rank(𝐚𝐠T+𝐠𝐚T)rank(𝐚𝐠T)+rank(𝐠𝐚T)=2.rank(\bm{\mathbf{A}})=rank(\bm{\mathbf{ag}}^{T}+\bm{\mathbf{ga}}^{T})\leq rank(\bm{\mathbf{ag}}^{T})+rank(\bm{\mathbf{ga}}^{T})=2.
  • (a)

    rank(𝐀)=1rank(\bm{\mathbf{A}})=1, there is only one non-zero eigenvalue λ\lambda.

    λ=trace(𝐀)=i=1n𝐀ii=i=1n𝐚i𝐠i+𝐚i𝐠i=2𝐚T𝐠=0.\lambda=trace(\bm{\mathbf{A}})=\sum_{i=1}^{n}\bm{\mathbf{A}}_{ii}=\sum_{i=1}^{n}\bm{\mathbf{a}}_{i}\bm{\mathbf{g}}_{i}+\bm{\mathbf{a}}_{i}\bm{\mathbf{g}}_{i}=2\bm{\mathbf{a}}^{T}\bm{\mathbf{g}}=0.

    Therefore, all eigenvalues of matrix 𝐀\bm{\mathbf{A}} are 0 and 𝐀=𝟎\bm{\mathbf{A}}=\bm{\mathbf{0}}, which contradicts the condition 𝐀0\bm{\mathbf{A}}\neq 0.

  • (b)

    rank(𝐀)=2rank(\bm{\mathbf{A}})=2, assume λ1,λ2\lambda_{1},\lambda_{2} are 2 non-zero eigenvalues of 𝐀\bm{\mathbf{A}}.

    λ1+λ2=trace(𝐀)=0\lambda_{1}+\lambda_{2}=trace(\bm{\mathbf{A}})=0 (B.1)

    Without loss of generality, assume λ1>0\lambda_{1}>0, then the characteristic polynomial of matrix AA is

    λn2(λλ1)(λλ2)=λn2(λ2λ12),\lambda^{n-2}(\lambda-\lambda_{1})(\lambda-\lambda_{2})=\lambda^{n-2}(\lambda^{2}-\lambda_{1}^{2}),

    and matrix AA satisfies

    𝐀n2(𝐀2λ12𝐈)=𝟎\displaystyle\bm{\mathbf{A}}^{n-2}(\bm{\mathbf{A}}^{2}-\lambda_{1}^{2}\bm{\mathbf{I}})=\bm{\mathbf{0}} (B.2)

    Because

    𝐀3=𝐀2𝐀\displaystyle\bm{\mathbf{A}}^{3}=\bm{\mathbf{A}}^{2}\bm{\mathbf{A}} =(𝐚𝐠T𝐚𝐠T+𝐚𝐠T𝐠𝐚T+𝐠𝐚T𝐚𝐠T+𝐠𝐚T𝐠𝐚T)(𝐚𝐠T+𝐠𝐚T)\displaystyle=(\bm{\mathbf{ag}}^{T}\bm{\mathbf{ag}}^{T}+\bm{\mathbf{ag}}^{T}\bm{\mathbf{ga}}^{T}+\bm{\mathbf{ga}}^{T}\bm{\mathbf{ag}}^{T}+\bm{\mathbf{ga}}^{T}\bm{\mathbf{ga}}^{T})(\bm{\mathbf{ag}}^{T}+\bm{\mathbf{ga}}^{T})
    =(𝐚22𝐠𝐠T+𝐠22𝐚𝐚T)(𝐚𝐠T+𝐠𝐚T)\displaystyle=(\|\bm{\mathbf{a}}\|_{2}^{2}\bm{\mathbf{gg}}^{T}+\|\bm{\mathbf{g}}\|_{2}^{2}\bm{\mathbf{aa}}^{T})(\bm{\mathbf{ag}}^{T}+\bm{\mathbf{ga}}^{T})
    =𝐚22𝐠22(𝐚𝐠T+𝐠𝐚T)\displaystyle=\|\bm{\mathbf{a}}\|_{2}^{2}\|\bm{\mathbf{g}}\|_{2}^{2}(\bm{\mathbf{ag}}^{T}+\bm{\mathbf{ga}}^{T})
    =𝐚22𝐠22𝐀,\displaystyle=\|\bm{\mathbf{a}}\|_{2}^{2}\|\bm{\mathbf{g}}\|_{2}^{2}\bm{\mathbf{A}},

    We can conclude that

    𝐀(𝐀2𝐚22𝐠22𝐈)=0𝐀n2(𝐀2𝐚22𝐠22𝐈)=0.\bm{\mathbf{A}}(\bm{\mathbf{A}}^{2}-\|\bm{\mathbf{a}}\|_{2}^{2}\|\bm{\mathbf{g}}\|_{2}^{2}\bm{\mathbf{I}})=0\Longrightarrow\bm{\mathbf{A}}^{n-2}(\bm{\mathbf{A}}^{2}-\|\bm{\mathbf{a}}\|_{2}^{2}\|\bm{\mathbf{g}}\|_{2}^{2}\bm{\mathbf{I}})=0. (B.3)

    From formula (B.1)(\ref{opposite}), (B.2)(\ref{eigenpoly1}) and (B.3)(\ref{eigenpoly2}) we can conclude that the eigenvalues λ1=𝐚2𝐠2\lambda_{1}=\|\bm{\mathbf{a}}\|_{2}\|\bm{\mathbf{g}}\|_{2} and λ2=𝐚2𝐠2\lambda_{2}=-\|\bm{\mathbf{a}}\|_{2}\|\bm{\mathbf{g}}\|_{2}.

Theorem 3

Let Z𝐱(𝛉)=y=1nexp(𝛉T𝐟𝐱(y))Z_{\bm{\mathbf{x}}}(\bm{\mathbf{\theta}})=\sum_{y=1}^{n}\exp\left(\bm{\mathbf{\theta}}^{T}\bm{\mathbf{f}}_{\bm{\mathbf{x}}}(y)\right). In each iteration of the xx-loop in Algorithm 2.2 finds z,𝛍,𝐕,𝐒,𝐃z,\bm{\mathbf{\mu}},\bm{\mathbf{V}},\bm{\mathbf{S}},\bm{\mathbf{D}} such that

Z𝐱(𝜽)zexp(12(𝜽𝜽~)T(𝐕T𝐒𝐕+𝐃)(𝜽𝜽~)+(𝜽𝜽~)T𝝁)Z_{\bm{\mathbf{x}}}(\bm{\mathbf{\theta}})\leq z\exp\left(\frac{1}{2}(\bm{\mathbf{\theta}}-\bm{\mathbf{\widetilde{\theta}}})^{T}(\bm{\mathbf{V}}^{T}\bm{\mathbf{S}}\bm{\mathbf{V}}+\bm{\mathbf{D}})(\bm{\mathbf{\theta}}-\bm{\mathbf{\widetilde{\theta}}})+(\bm{\mathbf{\theta}}-\bm{\mathbf{\widetilde{\theta}}})^{T}\bm{\mathbf{\mu}}\right) (B.4)
Proof B.4.

Define Z𝐱i(𝛉)=y=1iexp(𝛉T𝐟𝐱(y))Z_{\bm{\mathbf{x}}}^{i}(\bm{\mathbf{\theta}})=\sum_{y=1}^{i}\exp(\bm{\mathbf{\theta}}^{T}\bm{\mathbf{f}}_{\bm{\mathbf{x}}}(y)), then the partition function is represented as Z𝐱(𝛉)=Z𝐱n(𝛉)Z_{\bm{\mathbf{x}}}(\bm{\mathbf{\theta}})=Z_{\bm{\mathbf{x}}}^{n}(\bm{\mathbf{\theta}}), where n is the number of labels. From proof of Thm 1 (details included in [Jebara and Choromanska(2012)]), we already find a sequence of matrix {𝚺i}i=1n\{\bm{\mathbf{\Sigma}}_{i}\}_{i=1}^{n}, vector {𝛍}i=1n\{\bm{\mathbf{\mu}}\}_{i=1}^{n} and constant {zi}i=1n\{z_{i}\}_{i=1}^{n} such that:

Z𝐱i(𝜽)ziexp(12(𝜽𝜽~)T𝚺i(𝜽𝜽~)+(𝜽𝜽~)T𝝁i),Z_{\bm{\mathbf{x}}}^{i}(\bm{\mathbf{\theta}})\leq z_{i}\exp\left(\frac{1}{2}(\bm{\mathbf{\theta}}-\bm{\mathbf{\widetilde{\theta}}})^{T}\bm{\mathbf{\Sigma}}_{i}(\bm{\mathbf{\theta}}-\bm{\mathbf{\widetilde{\theta}}})+(\bm{\mathbf{\theta}}-\bm{\mathbf{\widetilde{\theta}}})^{T}\bm{\mathbf{\mu}}_{i}\right), (B.5)

where the terminate terms 𝚺=𝚺n\bm{\mathbf{\Sigma}}=\bm{\mathbf{\Sigma}}_{n}, 𝝁=𝝁n\bm{\mathbf{\mu}}=\bm{\mathbf{\mu}}_{n}, z=znz=z_{n} are the output of algorithm 2. If we could find sequences of {𝐕i}i=1nk×d\{\bm{\mathbf{V}}_{i}\}_{i=1}^{n}\subset\mathbb{R}^{k\times d} (orthnormal), {𝐒i}i=1nk×k\{\bm{\mathbf{S}}_{i}\}_{i=1}^{n}\subset\mathbb{R}^{k\times k} and {𝐃i}i=1nd×d\{\bm{\mathbf{D}}_{i}\}_{i=1}^{n}\subset\mathbb{R}^{d\times d} (diagonal) upper-bounds matrices {𝚺i}i=1nd×d\{\bm{\mathbf{\Sigma}}_{i}\}_{i=1}^{n}\subset\mathbb{R}^{d\times d} as

𝐱T𝚺i𝐱𝐱T(𝐕iT𝐒i𝐕i+𝐃i)𝐱𝐱d,\bm{\mathbf{x}}^{T}\bm{\mathbf{\Sigma}}_{i}\bm{\mathbf{x}}\leq\bm{\mathbf{x}}^{T}(\bm{\mathbf{V}}_{i}^{T}\bm{\mathbf{S}}_{i}\bm{\mathbf{V}}_{i}+\bm{\mathbf{D}}_{i})\bm{\mathbf{x}}\quad\forall\bm{\mathbf{x}}\in\mathbb{R}^{d}, (B.6)

the upper-bound B.5 of partition function over only ii labels can be renewed as

Z𝐱i(𝜽)ziexp(12(𝜽𝜽~)T(𝐕iT𝐒i𝐕i+𝐃i)(𝜽𝜽~)+(𝜽𝜽~)T𝝁i).Z_{\bm{\mathbf{x}}}^{i}(\bm{\mathbf{\theta}})\leq z_{i}\exp\left(\frac{1}{2}(\bm{\mathbf{\theta}}-\bm{\mathbf{\widetilde{\theta}}})^{T}\left(\bm{\mathbf{V}}_{i}^{T}\bm{\mathbf{S}}_{i}\bm{\mathbf{V}}_{i}+\bm{\mathbf{D}}_{i}\right)(\bm{\mathbf{\theta}}-\bm{\mathbf{\widetilde{\theta}}})+(\bm{\mathbf{\theta}}-\bm{\mathbf{\widetilde{\theta}}})^{T}\bm{\mathbf{\mu}}_{i}\right). (B.7)

When i=ni=n, denote 𝐕,𝐒,𝐃,𝛍,z=𝐕n,𝐒n,𝐃n,𝛍n,zn\bm{\mathbf{V}},\bm{\mathbf{S}},\bm{\mathbf{D}},\bm{\mathbf{\mu}},z=\bm{\mathbf{V}}_{n},\bm{\mathbf{S}}_{n},\bm{\mathbf{D}}_{n},\bm{\mathbf{\mu}}_{n},z_{n}, the formula B.7 is equivalent to B.4 and we finish the proof. We successfully decompose 𝚺i\bm{\mathbf{\Sigma}}_{i} into lower-rank while keep the upper-bound at the same time. In the following part, we are going to show how to construct the matrix sequence {𝐕i}i=1n\{\bm{\mathbf{V}}_{i}\}_{i=1}^{n}, {𝐒i}i=1n\{\bm{\mathbf{S}}_{i}\}_{i=1}^{n} and {𝐃i}i=1n\{\bm{\mathbf{D}}_{i}\}_{i=1}^{n} satisfies the condition B.6. The proof is based on mathematical induction.

  • From proof of Thm 1 in [Jebara and Choromanska(2012)], z00+z_{0}\to 0^{+}, 𝚺0=z0𝐈\bm{\mathbf{\Sigma}}_{0}=z_{0}\bm{\mathbf{I}}. Define 𝐕0=𝟎\bm{\mathbf{V}}_{0}=\bm{\mathbf{0}}, 𝐒0=𝟎\bm{\mathbf{S}}_{0}=\bm{\mathbf{0}} and 𝐃0=z0𝐈\bm{\mathbf{D}}_{0}=z_{0}\bm{\mathbf{I}}, it is obvious that 𝐱T𝚺0𝐱=𝐱T(𝐕0T𝐒0𝐕0+𝐃0)𝐱\bm{\mathbf{x}}^{T}\bm{\mathbf{\Sigma}}_{0}\bm{\mathbf{x}}=\bm{\mathbf{x}}^{T}(\bm{\mathbf{V}}_{0}^{T}\bm{\mathbf{S}}_{0}\bm{\mathbf{V}}_{0}+\bm{\mathbf{D}}_{0})\bm{\mathbf{x}} holds for all 𝐱d\bm{\mathbf{x}}\in\mathbb{R}^{d}.

  • Assume 𝐕i1\bm{\mathbf{V}}_{i-1}, 𝐒i1\bm{\mathbf{S}}_{i-1} and 𝐃i1\bm{\mathbf{D}}_{i-1} satisfies condition (B.6), we are going to find 𝐕i\bm{\mathbf{V}}_{i}, 𝐒i\bm{\mathbf{S}}_{i} and 𝐃i\bm{\mathbf{D}}_{i} satisfies this condition as well. From line 5 in algorithm 2, 𝚺i=𝚺i1+𝐫𝐫T\bm{\mathbf{\Sigma}}_{i}=\bm{\mathbf{\Sigma}}_{i-1}+\bm{\mathbf{r}}\bm{\mathbf{r}}^{T}, where 𝐫=β𝐥\bm{\mathbf{r}}=\sqrt{\beta}\bm{\mathbf{l}}. Define the subspace constructed by row vectors of 𝐕i1\bm{\mathbf{V}}_{i-1} as 𝒜=span{𝐕i1(1,),,𝐕i1(k,)}.\mathcal{A}=span\{\bm{\mathbf{V}}_{i-1}(1,\cdot),...,\bm{\mathbf{V}}_{i-1}(k,\cdot)\}. Map the vector 𝐫\bm{\mathbf{r}} onto subspace 𝒜\mathcal{A} and denote the residual orthogonal to subspace 𝒜\mathcal{A} as 𝐠\bm{\mathbf{g}}

    𝐫\displaystyle\bm{\mathbf{r}} =j=1d𝐫𝐕i1(j,)𝐕i1T(j,)+𝐠\displaystyle=\sum_{j=1}^{d}\bm{\mathbf{r}}\bm{\mathbf{V}}_{i-1}(j,\cdot)\bm{\mathbf{V}}^{T}_{i-1}(j,\cdot)+\bm{\mathbf{g}}
    =𝐕i1T𝐕i1𝐫+𝐠,\displaystyle=\bm{\mathbf{V}}_{i-1}^{T}\bm{\mathbf{V}}_{i-1}\bm{\mathbf{r}}+\bm{\mathbf{g}}, (B.8)

    substitute (B.4) into 𝚺i=𝚺i1+𝐫𝐫T\bm{\mathbf{\Sigma}}_{i}=\bm{\mathbf{\Sigma}}_{i-1}+\bm{\mathbf{r}}\bm{\mathbf{r}}^{T}, we have

    𝚺i\displaystyle\bm{\mathbf{\Sigma}}_{i} =𝚺i1+(𝐕i1T𝐕i1𝐫+𝐠)(𝐕i1T𝐕i1𝐫+𝐠)T\displaystyle=\bm{\mathbf{\Sigma}}_{i-1}+(\bm{\mathbf{V}}_{i-1}^{T}\bm{\mathbf{V}}_{i-1}\bm{\mathbf{r}}+\bm{\mathbf{g}})(\bm{\mathbf{V}}_{i-1}^{T}\bm{\mathbf{V}}_{i-1}\bm{\mathbf{r}}+\bm{\mathbf{g}})^{T}
    =𝐕i1T(𝐒i1+𝐕i1𝐫𝐫iT𝐕i1T)𝐕i1+𝐃i1+𝐠𝐠T+𝐕i1T𝐕i1𝐫𝐠T+𝐠𝐫iT𝐕i1T𝐕i1.\displaystyle=\bm{\mathbf{V}}_{i-1}^{T}(\bm{\mathbf{S}}_{i-1}+\bm{\mathbf{V}}_{i-1}\bm{\mathbf{r}}\bm{\mathbf{r}}_{i}^{T}\bm{\mathbf{V}}_{i-1}^{T})\bm{\mathbf{V}}_{i-1}+\bm{\mathbf{D}}_{i-1}+\bm{\mathbf{gg}}^{T}+\bm{\mathbf{V}}_{i-1}^{T}\bm{\mathbf{V}}_{i-1}\bm{\mathbf{r}}\bm{\mathbf{g}}^{T}+\bm{\mathbf{g}}\bm{\mathbf{r}}_{i}^{T}\bm{\mathbf{V}}_{i-1}^{T}\bm{\mathbf{V}}_{i-1}. (B.9)

    Define 𝐚=𝐕i1T𝐕i1𝐫\bm{\mathbf{a}}=\bm{\mathbf{V}}_{i-1}^{T}\bm{\mathbf{V}}_{i-1}\bm{\mathbf{r}}, Equation (B.4) can be simplified as

    𝚺i=𝐕i1T(𝐒i1+𝐕i1𝐫𝐫T𝐕i1T)𝐕i1+𝐃i1+𝐠𝐠T+𝐚𝐠T+𝐠𝐚T,\displaystyle\bm{\mathbf{\Sigma}}_{i}=\bm{\mathbf{V}}_{i-1}^{T}(\bm{\mathbf{S}}_{i-1}+\bm{\mathbf{V}}_{i-1}\bm{\mathbf{r}}\bm{\mathbf{r}}^{T}\bm{\mathbf{V}}_{i-1}^{T})\bm{\mathbf{V}}_{i-1}+\bm{\mathbf{D}}_{i-1}+\bm{\mathbf{gg}}^{T}+\bm{\mathbf{ag}}^{T}+\bm{\mathbf{ga}}^{T}, (B.10)

    the corresponding quadratic form is

    𝐱T𝚺i𝐱=𝐱T[𝐕i1T(𝐒i1+𝐕i1𝐫𝐫T𝐕i1T)𝐕i1+𝐃i1+𝐠𝐠T]+𝐱T(𝐚𝐠T+𝐠𝐚T)𝐱.\displaystyle\bm{\mathbf{x}}^{T}\bm{\mathbf{\Sigma}}_{i}\bm{\mathbf{x}}=\bm{\mathbf{x}}^{T}\left[\bm{\mathbf{V}}_{i-1}^{T}(\bm{\mathbf{S}}_{i-1}+\bm{\mathbf{V}}_{i-1}\bm{\mathbf{r}}\bm{\mathbf{r}}^{T}\bm{\mathbf{V}}_{i-1}^{T})\bm{\mathbf{V}}_{i-1}+\bm{\mathbf{D}}_{i-1}+\bm{\mathbf{gg}}^{T}\right]+\bm{\mathbf{x}}^{T}(\bm{\mathbf{ag}}^{T}+\bm{\mathbf{ga}}^{T})\bm{\mathbf{x}}. (B.11)

    By Lemma 4, the maximum eigenvalue of matrix 𝐚𝐠T+𝐠𝐚T\bm{\mathbf{ag}}^{T}+\bm{\mathbf{ga}}^{T} is 𝐚2𝐠2\|\bm{\mathbf{a}}\|_{2}\|\bm{\mathbf{g}}\|_{2}. Define 𝐏=𝐚2𝐠2I\bm{\mathbf{P}}=\|\bm{\mathbf{a}}\|_{2}\|\bm{\mathbf{g}}\|_{2}I, we have

    𝐱T(𝐚𝐠T+𝐠𝐚T)𝐱max{λ|λ is eigenvalue of A}𝐱T𝐱=𝐚2𝐠2𝐱22=𝐱T𝐏𝐱.\bm{\mathbf{x}}^{T}(\bm{\mathbf{ag}}^{T}+\bm{\mathbf{ga}}^{T})\bm{\mathbf{x}}\leq\max\{\lambda|\lambda\text{ is eigenvalue of }A\}\bm{\mathbf{x}}^{T}\bm{\mathbf{x}}=\|\bm{\mathbf{a}}\|_{2}\|\bm{\mathbf{g}}\|_{2}\|\bm{\mathbf{x}}\|_{2}^{2}=\bm{\mathbf{x}}^{T}\bm{\mathbf{Px}}. (B.12)

    Define

    𝐃i1\displaystyle\bm{\mathbf{D}}^{\prime}_{i-1} =𝐃i1+𝐏d×d (diagonal),\displaystyle=\bm{\mathbf{D}}_{i-1}+\bm{\mathbf{P}}\in\mathbb{R}^{d\times d}\text{ (diagonal),}
    𝚺i~:\displaystyle\widetilde{\bm{\mathbf{\Sigma}}_{i}}: =𝐕i1T(𝐒i1+𝐕i1𝐫i𝐫iT𝐕i1T)𝐕i1+𝐃i1+𝐠𝐠T,\displaystyle=\bm{\mathbf{V}}_{i-1}^{T}(\bm{\mathbf{S}}_{i-1}+\bm{\mathbf{V}}_{i-1}\bm{\mathbf{r}}_{i}\bm{\mathbf{r}}_{i}^{T}\bm{\mathbf{V}}_{i-1}^{T})\bm{\mathbf{V}}_{i-1}+\bm{\mathbf{D}}^{\prime}_{i-1}+\bm{\mathbf{gg}}^{T}, (B.13)

    by (B.11) and (B.12), it is easy to know,

    𝐱T𝚺i𝐱𝐱T𝚺i~x\displaystyle\bm{\mathbf{x}}^{T}\bm{\mathbf{\Sigma}}_{i}\bm{\mathbf{x}}\leq\bm{\mathbf{x}}^{T}\widetilde{\bm{\mathbf{\Sigma}}_{i}}x (B.14)

    We have already prove xT𝚺i~xx^{T}\widetilde{\bm{\mathbf{\Sigma}}_{i}}x is larger than xT𝚺ixx^{T}\bm{\mathbf{\Sigma}}_{i}x, then all we need is the upper-bound of xT𝚺i~xx^{T}\widetilde{\bm{\mathbf{\Sigma}}_{i}}x. Perform SVD decomposition on 𝐒i1+𝐕i1𝐫𝐫T𝐕i1T\bm{\mathbf{S}}_{i-1}+\bm{\mathbf{V}}_{i-1}\bm{\mathbf{rr}}^{T}\bm{\mathbf{V}}_{i-1}^{T} and denote the result as

    𝐐i1T𝐒i1𝐐i1=svd(𝐒i1+𝐕i1𝐫𝐫T𝐕i1T),\bm{\mathbf{Q}}_{i-1}^{T}\bm{\mathbf{S}}^{\prime}_{i-1}\bm{\mathbf{Q}}_{i-1}=svd(\bm{\mathbf{S}}_{i-1}+\bm{\mathbf{V}}_{i-1}\bm{\mathbf{rr}}^{T}\bm{\mathbf{V}}_{i-1}^{T}),

    define 𝐕i1=𝐐i1𝐕i1\bm{\mathbf{V}}^{\prime}_{i-1}=\bm{\mathbf{Q}}_{i-1}\bm{\mathbf{V}}_{i-1}, then (B.4) can be simplified as

    𝚺i~\displaystyle\widetilde{\bm{\mathbf{\Sigma}}_{i}} =𝐕i1T𝐐i1T𝐒i1𝐐i1𝐕i1+𝐃i1+𝐠𝐠T\displaystyle=\bm{\mathbf{V}}_{i-1}^{T}\bm{\mathbf{Q}}_{i-1}^{T}\bm{\mathbf{S}}^{\prime}_{i-1}\bm{\mathbf{Q}}_{i-1}\bm{\mathbf{V}}_{i-1}+\bm{\mathbf{D}}^{\prime}_{i-1}+\bm{\mathbf{gg}}^{T}
    =𝐕i1T𝐒i1𝐕i1+𝐃i1+𝐠𝐠T\displaystyle=\bm{\mathbf{V}}_{i-1}^{{}^{\prime}T}\bm{\mathbf{S}}^{\prime}_{i-1}\bm{\mathbf{V}}^{\prime}_{i-1}+\bm{\mathbf{D}}^{\prime}_{i-1}+\bm{\mathbf{gg}}^{T}
    =[𝐕i1T𝐠𝐠2][𝐒i1𝟎T𝟎𝐠22][𝐕i1𝐠T𝐠2]=:𝐁+𝐃i1.\displaystyle=\underbrace{\left[\begin{matrix}\bm{\mathbf{V}}_{i-1}^{{}^{\prime}T}&\frac{\bm{\mathbf{g}}}{\|\bm{\mathbf{g}}\|_{2}}\end{matrix}\right]\left[\begin{matrix}\bm{\mathbf{S}}^{\prime}_{i-1}&\bm{\mathbf{0}}^{T}\\ \bm{\mathbf{0}}&\|\bm{\mathbf{g}}\|_{2}^{2}\end{matrix}\right]\left[\begin{matrix}\bm{\mathbf{V}}^{\prime}_{i-1}\\ \frac{\bm{\mathbf{g}}^{T}}{\|\bm{\mathbf{g}}\|_{2}}\end{matrix}\right]}_{=:\bm{\mathbf{B}}}+\bm{\mathbf{D}}^{\prime}_{i-1}. (B.15)

    It is easy to know that rank(𝐁)=k+1rank(\bm{\mathbf{B}})=k+1 and 𝐁\bm{\mathbf{B}} is a (k+1)(k+1)-svd decomposition. In order to keep the construction of 𝚺i\bm{\mathbf{\Sigma}}_{i}, we have no choice but to remove the smallest eigenvalue and corresponding eigenvector from matrix 𝐁\bm{\mathbf{B}}.

    • case 1)

      𝐠22argminj=1k𝐒i1(j,j)\|\bm{\mathbf{g}}\|_{2}^{2}\leq\arg\min_{j=1...k}\bm{\mathbf{S}}^{\prime}_{i-1}(j,j), remove eigenvalue 𝐠22\|\bm{\mathbf{g}}\|_{2}^{2} and corresponding eigenvector 𝐠𝐠2\frac{\bm{\mathbf{g}}}{\|\bm{\mathbf{g}}\|_{2}}.

      𝚺i=𝐕i1T𝐒i1𝐕i1𝐃i1=𝚺i~𝐠𝐠t=𝚺i~c𝐯𝐯T,\displaystyle\bm{\mathbf{\Sigma}}^{\prime}_{i}=\bm{\mathbf{V}}_{i-1}^{{}^{\prime}T}\bm{\mathbf{S}}^{\prime}_{i-1}\bm{\mathbf{V}}^{\prime}_{i-1}\bm{\mathbf{D}}^{\prime}_{i-1}=\widetilde{\bm{\mathbf{\Sigma}}_{i}}-\bm{\mathbf{gg}}^{t}=\widetilde{\bm{\mathbf{\Sigma}}_{i}}-c\bm{\mathbf{vv}}^{T}, (B.16)

      where c=𝐠22c=\|\bm{\mathbf{g}}\|_{2}^{2},𝐯=𝐠𝐠2\bm{\mathbf{v}}=\frac{\bm{\mathbf{g}}}{\|\bm{\mathbf{g}}\|_{2}}. In this case 𝐕i=𝐕i1\bm{\mathbf{V}}_{i}=\bm{\mathbf{V}}^{\prime}_{i-1}, 𝐒i=𝐒i1\bm{\mathbf{S}}_{i}=\bm{\mathbf{S}}^{\prime}_{i-1}.

    • case 2)

      𝐠22>argminj=1k𝐒i1(j,j)\|\bm{\mathbf{g}}\|_{2}^{2}>\arg\min_{j=1...k}\bm{\mathbf{S}}^{\prime}_{i-1}(j,j), remove mthm^{th} (absolute value smallest) eigenvalue in 𝐒i1\bm{\mathbf{S}}^{\prime}_{i-1} and corresponding eigenvalue 𝐕i1(m,)\bm{\mathbf{V}}^{\prime}_{i-1}(m,\cdot).

      𝚺i=𝐕i1T𝐒i1𝐕i1+𝐃i1+𝐠𝐠t𝐒i1(m,m)𝐕i1(m,)𝐕i1(m,)T=𝚺i~c𝐯𝐯T,\displaystyle\bm{\mathbf{\Sigma}}^{\prime}_{i}=\bm{\mathbf{V}}_{i-1}^{{}^{\prime}T}\bm{\mathbf{S}}^{\prime}_{i-1}\bm{\mathbf{V}}^{\prime}_{i-1}+\bm{\mathbf{D}}^{\prime}_{i-1}+\bm{\mathbf{gg}}^{t}-\bm{\mathbf{S}}^{\prime}_{i-1}(m,m)\bm{\mathbf{V}}_{i-1}(m,\cdot)\bm{\mathbf{V}}_{i-1}(m,\cdot)^{T}=\widetilde{\bm{\mathbf{\Sigma}}_{i}}-c\bm{\mathbf{vv}}^{T}, (B.17)

      where c=𝐒i1(m,m)c=\bm{\mathbf{S}}^{\prime}_{i-1}(m,m),𝐯=𝐕i1(m,)\bm{\mathbf{v}}=\bm{\mathbf{V}}_{i-1}(m,\cdot). In this case,

      𝐕i\displaystyle\bm{\mathbf{V}}_{i} =[𝐕i1(1,)𝐕i1(m1,)𝐠𝐠22𝐕i1(k,)],\displaystyle=\left[\begin{matrix}\bm{\mathbf{V}}^{\prime}_{i-1}(1,\cdot)\\ ...\\ \bm{\mathbf{V}}^{\prime}_{i-1}(m-1,\cdot)\\ \frac{\bm{\mathbf{g}}}{\|\bm{\mathbf{g}}\|_{2}^{2}}\\ ...\\ \bm{\mathbf{V}}^{\prime}_{i-1}(k,\cdot)\end{matrix}\right],
      𝐒i\displaystyle\bm{\mathbf{S}}_{i} =diag(𝐒i1(1,1),,𝐒i1(m1,m1),𝐠22,,𝐒i1(k,k)).\displaystyle=diag\left(\bm{\mathbf{S}}^{\prime}_{i-1}(1,1),\cdots,\bm{\mathbf{S}}^{\prime}_{i-1}(m-1,m-1),\|\bm{\mathbf{g}}\|_{2}^{2},\cdots,\bm{\mathbf{S}}^{\prime}_{i-1}(k,k)\right).

    In both cases

    𝚺i\displaystyle\bm{\mathbf{\Sigma}}^{\prime}_{i} =𝐕iT𝐒i𝐕i+𝐃i1\displaystyle=\bm{\mathbf{V}}_{i}^{T}\bm{\mathbf{S}}_{i}\bm{\mathbf{V}}_{i}+\bm{\mathbf{D}}^{\prime}_{i-1} (B.18)
    𝚺i~\displaystyle\widetilde{\bm{\mathbf{\Sigma}}_{i}} =𝚺i+c𝐯𝐯T,\displaystyle=\bm{\mathbf{\Sigma}}^{\prime}_{i}+c\bm{\mathbf{vv}}^{T}, (B.19)

    where 𝐕ik×d\bm{\mathbf{V}}_{i}\in\mathbb{R}^{k\times d} orthnormal, 𝐒ik×k\bm{\mathbf{S}}_{i}\in\mathbb{R}^{k\times k} and 𝐃i1d×d\bm{\mathbf{D}}^{\prime}_{i-1}\in\mathbb{R}^{d\times d} are diagonal, 𝐯d\bm{\mathbf{v}}\in\mathbb{R}^{d} and 𝐯T𝐯=1\bm{\mathbf{v}}^{T}\bm{\mathbf{v}}=1. Therefore, for all 𝐱d\bm{\mathbf{x}}\in\mathbb{R}^{d} we have

    𝐱T𝚺i~𝐱=𝐱T𝚺i𝐱+c𝐱T𝐯𝐯T𝐱=𝐱T𝚺i𝐱+c(𝐱T𝐯)2,\bm{\mathbf{x}}^{T}\widetilde{\bm{\mathbf{\Sigma}}_{i}}\bm{\mathbf{x}}=\bm{\mathbf{x}}^{T}\bm{\mathbf{\Sigma}}^{\prime}_{i}\bm{\mathbf{x}}+c\bm{\mathbf{x}}^{T}\bm{\mathbf{vv}}^{T}\bm{\mathbf{x}}=\bm{\mathbf{x}}^{T}\bm{\mathbf{\Sigma}}^{\prime}_{i}\bm{\mathbf{x}}+c(\bm{\mathbf{x}}^{T}\bm{\mathbf{v}})^{2}, (B.20)

    𝚺i\bm{\mathbf{\Sigma}}^{\prime}_{i} is not sufficient for the condition 𝐱T𝚺i𝐱𝐱T𝚺i~𝐱(𝐱)\bm{\mathbf{x}}^{T}\bm{\mathbf{\Sigma}}^{\prime}_{i}\bm{\mathbf{x}}\geq\bm{\mathbf{x}}^{T}\widetilde{\bm{\mathbf{\Sigma}}_{i}}\bm{\mathbf{x}}(\forall\bm{\mathbf{x}}), we would like to relax it by constructing 𝚺i′′=𝚺i+𝐅\bm{\mathbf{\Sigma}}^{\prime\prime}_{i}=\bm{\mathbf{\Sigma}}^{\prime}_{i}+\bm{\mathbf{F}} (𝐅d×d is diagonal)(\bm{\mathbf{F}}\in\mathbb{R}^{d\times d}\text{ is diagonal}) such that 𝐱T𝚺i′′𝐱𝐱T𝚺i~𝐱(𝐱)\bm{\mathbf{x}}^{T}\bm{\mathbf{\Sigma}}^{\prime\prime}_{i}\bm{\mathbf{x}}\geq\bm{\mathbf{x}}^{T}\widetilde{\bm{\mathbf{\Sigma}}_{i}}\bm{\mathbf{x}}(\forall\bm{\mathbf{x}}), which is equivalent to

    𝐱T𝚺i′′𝐱\displaystyle\bm{\mathbf{x}}^{T}\bm{\mathbf{\Sigma}}^{\prime\prime}_{i}\bm{\mathbf{x}} 𝐱T𝚺i~𝐱\displaystyle\geq\bm{\mathbf{x}}^{T}\widetilde{\bm{\mathbf{\Sigma}}_{i}}\bm{\mathbf{x}}
    𝐱T(𝚺i+𝐅)𝐱\displaystyle\bm{\mathbf{x}}^{T}(\bm{\mathbf{\Sigma}}^{\prime}_{i}+\bm{\mathbf{F}})\bm{\mathbf{x}} 𝐱T(𝚺i+c𝐯𝐯T)𝐱\displaystyle\geq\bm{\mathbf{x}}^{T}(\bm{\mathbf{\Sigma}}^{\prime}_{i}+c\bm{\mathbf{vv}}^{T})\bm{\mathbf{x}}
    i=1d𝐱2(i)𝐅(i)\displaystyle\sum_{i=1}^{d}\bm{\mathbf{x}}^{2}(i)\bm{\mathbf{F}}(i) c(i=1d𝐱(i)𝐯(i))2.\displaystyle\geq c\left(\sum_{i=1}^{d}\bm{\mathbf{x}}(i)\bm{\mathbf{v}}(i)\right)^{2}. (B.21)

    Set 𝐅(i)=c|𝐯(i)|j=1d|𝐯(j)|\bm{\mathbf{F}}(i)=c|\bm{\mathbf{v}}(i)|\sum_{j=1}^{d}|\bm{\mathbf{v}}(j)|, by Lemma 2 we have

    i=1d𝐱2(i)𝐅(i)\displaystyle\sum_{i=1}^{d}\bm{\mathbf{x}}^{2}(i)\bm{\mathbf{F}}(i) =i=1d|𝐱(i)|2𝐅(i)\displaystyle=\sum_{i=1}^{d}|\bm{\mathbf{x}}(i)|^{2}\bm{\mathbf{F}}(i)
    (i=1d|𝐱(i)|𝐅(i)j=1d𝐅(i))\displaystyle\geq\left(\sum_{i=1}^{d}|\bm{\mathbf{x}}(i)|\frac{\bm{\mathbf{F}}(i)}{\sqrt{\sum_{j=1}^{d}\bm{\mathbf{F}}(i)}}\right)
    =(i=1d|𝐱(i)|c|𝐯(i)|j=1d|𝐯(j)|cj=1dc|𝐯(i)|)2\displaystyle=\left(\sum_{i=1}^{d}|\bm{\mathbf{x}}(i)|\frac{c|\bm{\mathbf{v}}(i)|\sum_{j=1}^{d}|\bm{\mathbf{v}}(j)|}{\sqrt{c}\sum_{j=1}^{d}c|\bm{\mathbf{v}}(i)|}\right)^{2}
    c(i=1d𝐱(i)𝐯(i))2\displaystyle\geq c\left(\sum_{i=1}^{d}\bm{\mathbf{x}}(i)\bm{\mathbf{v}}(i)\right)^{2} (B.22)

    Combine (B.4) and (B.4), 𝚺i′′=𝚺i+𝐅\bm{\mathbf{\Sigma}}^{\prime\prime}_{i}=\bm{\mathbf{\Sigma}}^{\prime}_{i}+\bm{\mathbf{F}} is what we want. Let 𝐃i=𝐃i1+𝐅\bm{\mathbf{D}}_{i}\!=\!\bm{\mathbf{D}}_{i-1}^{\prime}\!+\!\bm{\mathbf{F}}, together with 𝐕i\bm{\mathbf{V}}_{i} and 𝐒i\bm{\mathbf{S}}_{i} defined in former case 1) and case 2), we have

    𝐱T𝚺i𝐱𝐱T𝚺~i𝐱𝐱T𝚺i′′𝐱=𝐱T(𝐕iT𝐒i𝐕i+𝐃i)𝐱\bm{\mathbf{x}}^{T}\bm{\mathbf{\Sigma}}_{i}\bm{\mathbf{x}}\leq\bm{\mathbf{x}}^{T}\widetilde{\bm{\mathbf{\Sigma}}}_{i}\bm{\mathbf{x}}\leq\bm{\mathbf{x}}^{T}\bm{\mathbf{\Sigma}}^{\prime\prime}_{i}\bm{\mathbf{x}}=\bm{\mathbf{x}}^{T}\left(\bm{\mathbf{V}}_{i}^{T}\bm{\mathbf{S}}_{i}\bm{\mathbf{V}}_{i}+\bm{\mathbf{D}}_{i}\right)\bm{\mathbf{x}}

We already construct sequences of matrix {𝐕i}i=1n\{\bm{\mathbf{V}}_{i}\}_{i=1}^{n}, {𝐒i}i=1n\{\bm{\mathbf{S}}_{i}\}_{i=1}^{n} and {𝐃i}i=1n\{\bm{\mathbf{D}}_{i}\}_{i=1}^{n} satisfies the condition (B.6). Define 𝐕,𝐒,𝐃=𝐕n,𝐒n,𝐃n\bm{\mathbf{V}},\bm{\mathbf{S}},\bm{\mathbf{D}}=\bm{\mathbf{V}}_{n},\bm{\mathbf{S}}_{n},\bm{\mathbf{D}}_{n}, keep 𝝁\bm{\mathbf{\mu}} and zz the same as what they are in Thm 1, the formula (B.7) implies

Z𝐱(𝜽)zexp(12(𝜽𝜽~)T(𝐕T𝐒𝐕+𝐃)(𝜽𝜽~)+(𝜽𝜽~)T𝝁)Z_{\bm{\mathbf{x}}}(\bm{\mathbf{\theta}})\leq z\exp\left(\frac{1}{2}(\bm{\mathbf{\theta}}-\bm{\mathbf{\widetilde{\theta}}})^{T}(\bm{\mathbf{V}}^{T}\bm{\mathbf{S}}\bm{\mathbf{V}}+\bm{\mathbf{D}})(\bm{\mathbf{\theta}}-\bm{\mathbf{\widetilde{\theta}}})+(\bm{\mathbf{\theta}}-\bm{\mathbf{\widetilde{\theta}}})^{T}\bm{\mathbf{\mu}}\right) (B.23)