This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\newsiamremark

remarkRemark \newsiamremarkhypothesisHypothesis \newsiamremarkexampleExample \newsiamthmassumptionAssumption \newsiamthmclaimClaim \headersConvergence of SALS for Random TensorsY. Cao, S. Das, L. Oeding, H.-W. van Wyk

Analysis of the Stochastic Alternating Least Squares Method for the Decomposition of Random Tensorsthanks: Submitted to the editors 04/01/2020.

Yanzhao Cao Department of Mathematics and Statistics, Auburn University, Auburn AL (, , , ). [email protected] [email protected] [email protected] [email protected]    Somak Das22footnotemark: 2    Luke Oeding22footnotemark: 2    Hans-Werner van Wyk22footnotemark: 2
Abstract

Stochastic Alternating Least Squares (SALS) is a method that approximates the canonical decomposition of averages of sampled random tensors. Its simplicity and efficient memory usage make SALS an ideal tool for decomposing tensors in an online setting. We show, under mild regularization and readily verifiable assumptions on the boundedness of the data, that the SALS algorithm is globally convergent. Numerical experiments validate our theoretical findings and demonstrate the algorithm’s performance and complexity.

keywords:
canonical decomposition, parallel factors, stochastic optimization, stochastic alternating least squares, random tensor, block nonlinear Gauss-Seidel
{AMS}

65K10, 90C15, 15A69, 68W20

1 Introduction

Multi-modal data is represented by a tensor, or a multidimensional matrix. Tensor data is present in ares such as Natural Language Processing (NLP) [8], Blind Source Separation (BSS) [9, 28], and Phylogenetic Tree Reconstruction (PTR) [12, 2, 17]. In each of these areas, canonical decomposition (CANDECOMP)/parallel factors (PARAFAC), also known as CP tensor decomposition (representing a given tensor as a sum of rank-one tensors), provides important insights since the components of the rank-one terms, the factors, represent meaning in the data. For NLP given a tensor constructed from large text databases, the rank-one terms could represent topics, and the factors could represent words. For BSS given a tensor constructed from signals arriving at a single receiver from unknown sources, the rank-one terms could represent sources, and the factors could be used to locate the sources. For PTR given a tensor constructed as a contingency table for instances of neucleotide combinations from aligned DNA from several species, the factors represent model parameters for the phylogenetic tree.

In applications observations may represent samples of an underlying random tensor. For example, the word co-occurence frequencies used in NLP may come from a sample of a collection of documents. Here fast, reliable algorithms are desired for the CP decomposition of the average tensor. Because CP decomposition is known to be NP hard in general [15], and because the average tensor is usually only available through sampling, we investigate approximate decompositions based on stochastic optimization. In [22], Maehara et al. propose a stochastic version of the widely-used alternating least squares (ALS) method, the stochastic alternating least squares (SALS) method, and show the algorithm’s efficiency by means of numerical experiments. In this paper, we provide a detailed analysis of the algorithm, showing under mild regularization and a minimal set of verifiable assumptions that it is globally convergent to a local minimizer for any initial guess. In Section 4, we also include a detailed discussion the algorithm’s complexity and efficiency.

The alternating least squares (ALS) method, first proposed by Carroll and Chang [10], remains the most widely used algorithm for computing the CP decomposition of a tensor [13]. This block-nonlinear Gauss-Seidel method [5, 26] exploits the component-wise quadratic structure of the cost functional to compute iterates efficiently and with a low memory footprint. It has been modified [11] to exploit sparsity structure in data, which typically occurs in tensors representing co-occurence frequencies such as in the tree reconstruction above. Regularized versions of the ALS method, as well as the associated proximal minimization iterations considered in [21], help mitigate the potential ill-posedness of the CP problem to within sample noise. Although one may compute the tensor’s expectation a priori and decompose it by means of the standard ALS method (see Remark 2.1), such an approach results in a loss of sparsity and cannot seemlessly acommodate the arrival of new samples. In [22], Maehara et al. proposed the Stochastic Alternating Least Squares method, a block-stochastic minimization method that preserves the salient features of the ALS method, while also efficiently incorporating tensor samples in the optimization procedure. Recent work (see [3, 20]) has considered the related problem of using randomized (or stochastic) methods to decompose existing large-scale tensors. In particular, in [3], a variant of the ALS method is developed that approximates the component least squares problem efficiently by randomized/sketching methods. In [20], a dense tensor is expressed as the expectation of a sequence of sparse ones, thereby allowing the use of stochastic gradient descent (SGD) methods.

The convergence analysis of the SALS algorithm applied to the CP decomposition of a random tensor is complicated by the fact that the underlying cost functional is not convex. Moreover, the algorithm itself is a stochastic, block-iterative, second order method whose successive iterates do not have the Markovian dependence structure present in classical stochastic gradient descent (SGD) methods. The convergence of the ALS method was studied in [6], where a quotient-linear convergence rate was established. Block-coordinate techniques for the unconstrained optimization of general, possibly nonconvex, deterministic functionals were treated in [14] (see also [25, 4, 32] and references therein). Xu and Yin [31] discuss the convergence of block stochastic gradient methods for averages of convex and nonconvex functionals. They rely on assumptions (such as the uniformly bounded variance of gradient iterates) that, while standard in the literature (see e.g. [7]), are difficult to verify in practice since they pertain to the algorithm itself. The main contributions of this paper are to prove the convergence of the SALS algorithm, a block-stochastic Newton method, for the CP decomposition of the average of a random tensor, subject to a single verifiable assumption relating to the boundedness of the observed data.

This paper is structured as follows. In Section 2, we introduce the CP decomposition problem for random tensors and describe the stochastic alternating least squares algorithm (Algorithm 2). Section 3 contains our main theoretical contributions. In Section 3.1, we exploit the special multinomial structure of the cost functional to quantify the regularity of its component-wise gradient and Hessian in terms of the size of the algorithm’s iterate vectors (see Lemma 3.5 and its corollaries). In Section 3.2 we show that the iterates themselves can be bounded in terms of the random tensor to be decomposed (Lemma 3.15), which naturally leads to our single, verifiable assumption on the latter’s statistical properties (3.16). In Section 3.3, we show that the iterates generated by the SALS algorithm converge to a local minimizer. We validate the SALS method numerically via a few computational examples in Section 4 and offer concluding remarks in Section 5.

2 Notation and Problem Description

We to follow notational conventions that are closely aligned with the literature on the CP decomposition (see [19]), as well as on stochastic optimization (see [7, 31]). Use use lower case letters to denote scalars, bold letters for vectors, uppercase letters for matrices, and uppercase calligraphic letters for tensors. We use superscripts to indicate iteration (or sub-iteration) indices and subscripts for components, i.e. xikx_{i}^{k} is the ii-th component at the kk-th iteration. Multi-indices are denoted by bold letters and sums, products, and integrals over these should be interpreted to be in iterated form. For the block coordinate minimization method in Algorithm 2, it is convenient to express the vector 𝒙=(𝒙1,,𝒙p){\bm{x}}=({\bm{x}}_{1},...,{\bm{x}}_{p}) in terms of its ii-th block component 𝒙i{\bm{x}}_{i} and the remaining components, denoted 𝒙i:=(𝒙1,,𝒙i1,𝒙i+1,,𝒙p){\bm{x}}_{i^{*}}:=({\bm{x}}_{1},\ldots,{\bm{x}}_{i-1},{\bm{x}}_{i+1},\ldots,{\bm{x}}_{p}). Using this notation, we write 𝒙=(𝒙i,𝒙i){\bm{x}}=({\bm{x}}_{i},{\bm{x}}_{i^{*}}) with the implicit understanding that the block components are in the correct order. The Frobenius norm of a vector, matrix, or tensor, i.e. the root mean square of its entries, is denoted by \|\cdot\|. For 1p1\leq p\leq\infty, p\|\cdot\|_{p} denotes the standard Euclidean pp-norm for vectors and the induced matrix pp-norm for matrices. We use ‘\circ’ to denote the outer product, ‘*’ for the component-wise (or Hadamard) product, ‘\odot’ for the column-wise Khatri-Rao product, and ‘\otimes’ for the Kronecker product.

Let (Ω,,dμ)(\Omega,\mathcal{F},d\mu) be a complete probability space and let 𝒳:Ωn1×n2×np\mathcal{X}:\Omega\rightarrow\mathbb{R}^{n_{1}\times n_{2}\ldots\times n_{p}} be a measurable map, also known as a pp-th order random tensor. In practice the underlying probability space is rarely known explicitly. Instead, its law can often be observed indirectly through sample realizations 𝒳1,𝒳2,,𝒳n\mathcal{X}^{1},\mathcal{X}^{2},\ldots,\mathcal{X}^{n} of 𝒳\mathcal{X} that are assumed to be independent and identically distributed (iid). We use 𝔼\mathbb{E} to denote expectation:

𝔼[f]=Ωf(𝒳)𝑑μ𝒳\mathbb{E}\left[f\right]=\int_{\Omega}f(\mathcal{X})d\mu_{\mathcal{X}}

for any integrable function f:n1××npmf:\mathbb{R}^{n_{1}\times\ldots\times n_{p}}\rightarrow\mathbb{R}^{m}.

The rank-rr decomposition problem for 𝒳\mathcal{X} (or its sample realizations) amounts to finding a set of rr rank-one deterministic tensors {𝒳^j}j=1r\{\hat{\mathcal{X}}_{j}\}_{j=1}^{r}, so that the quantity

(1) 𝒳^:=j=1r𝒳^j\hat{\mathcal{X}}:=\sum_{j=1}^{r}\hat{\mathcal{X}}_{j}

is a good overall representation of 𝒳\mathcal{X}. Each rank-one tensor 𝒳^j\hat{\mathcal{X}}_{j}, j=1,,rj=1,\ldots,r, is formed by the outer product 𝒳^j=𝒂1,j𝒂p,j\hat{\mathcal{X}}_{j}={\bm{a}}_{1,j}\circ\ldots\circ{\bm{a}}_{p,j} where 𝒂i,j=(ai,j,1,,ai,j,ni)ni{\bm{a}}_{i,j}=(a_{i,j,1},\ldots,a_{i,j,n_{i}})\in\mathbb{R}^{n_{i}} for each i=1,pi=1,\ldots p, i.e. 𝒳^jn1××np\hat{\mathcal{X}}_{j}\in\mathbb{R}^{n_{1}\times\ldots\times n_{p}} is defined component-wise by

𝒳^j,i1,,ip=l=1pal,j,il.\hat{\mathcal{X}}_{j,i_{1},\ldots,i_{p}}=\prod_{l=1}^{p}a_{l,j,i_{l}}.

We use the mean squared error 𝔼[𝒳𝒳^2]\mathbb{E}\left[\|\mathcal{X}-\hat{\mathcal{X}}\|^{2}\right] to measure the quality of the representation. Other risk measures may be more appropriate, depending on the application, possibly requiring a different analysis and approximation technique.

For analysis and computation, it is convenient to consider two other representations of the design variable. We define the ii-th factor matrix Ai=[𝒂i,1,,𝒂i,r]A_{i}=[{\bm{a}}_{i,1},\ldots,{\bm{a}}_{i,r}] in ni×r\mathbb{R}^{n_{i}\times r} to consist of the ii-th component vectors of each term in decomposition (1). Let 𝒙=vec([A1,,Ar])nr{\bm{x}}=\mathrm{vec}([A_{1},\ldots,A_{r}])\in\mathbb{R}^{nr}, with n=i=1pnin=\sum_{i=1}^{p}n_{i}, be the vectorization of the factor matrices, i.e. the concatenation of their column vectors. We also write 𝒙{\bm{x}} in block component form as 𝒙=(𝒙1,,𝒙p){\bm{x}}=({\bm{x}}_{1},\ldots,{\bm{x}}_{p}), where 𝒙i=vec(Ai)rni{\bm{x}}_{i}=\mathrm{vec}(A_{i})\in\mathbb{R}^{rn_{i}} for i=1,,pi=1,\ldots,p. To emphasize dependence of 𝒳^\hat{\mathcal{X}} on 𝒙{\bm{x}}, we write the following (which also defines \left\llbracket\cdot\right\rrbracket)

𝒳^=𝒙=A1,,Ap:=j=1r𝒂1,j𝒂p,j.\hat{\mathcal{X}}=\left\llbracket{\bm{x}}\right\rrbracket=\left\llbracket A_{1},\ldots,A_{p}\right\rrbracket:=\sum_{j=1}^{r}{\bm{a}}_{1,j}\circ\ldots\circ{\bm{a}}_{p,j}.

No efficient closed form solution of the factorization problem exists (even for deterministic tensors) [15]. So it is commonly reformulated as the optimization problem

(2) min𝒙nr𝔼[𝒳𝒙2].\min_{{\bm{x}}\in\mathbb{R}^{nr}}\mathbb{E}\left[\|\mathcal{X}-\left\llbracket{\bm{x}}\right\rrbracket\|^{2}\right].
Remark 2.1.

Letting 𝐢=(i1,,ip){\bm{i}}=(i_{1},\ldots,i_{p}) and 𝐧=(n1,,np){\bm{n}}=(n_{1},\ldots,n_{p}), it follows readily that

𝔼[𝒳𝒙2]\displaystyle\mathbb{E}\left[\left\|\mathcal{X}-\llbracket{\bm{x}}\rrbracket\right\|^{2}\right] =𝔼[𝒊=1𝒏(𝒳𝒊𝒙𝒊)2]\displaystyle=\mathbb{E}\left[\sum_{{\bm{i}}=1}^{\bm{n}}\left(\mathcal{X}_{\bm{i}}-\llbracket{\bm{x}}\rrbracket_{\bm{i}}\right)^{2}\right]
=𝒊=1𝒏𝔼[𝒳𝒊2]2𝔼[𝒳𝒊]𝒙𝒊+𝒙𝒊2\displaystyle=\sum_{{\bm{i}}=1}^{{\bm{n}}}\mathbb{E}\left[\mathcal{X}_{\bm{i}}^{2}\right]-2\mathbb{E}\left[\mathcal{X}_{\bm{i}}\right]\llbracket{\bm{x}}\rrbracket_{\bm{i}}+\llbracket{\bm{x}}\rrbracket_{\bm{i}}^{2}
=𝒊=1𝒏var(𝒳𝒊)+𝒊=1𝒏(𝔼(𝒳𝒊)𝒙𝒊)2.\displaystyle=\sum_{{\bm{i}}=1}^{\bm{n}}\mathrm{var}(\mathcal{X}_{\bm{i}})+\sum_{{\bm{i}}=1}^{\bm{n}}\left(\mathbb{E}(\mathcal{X}_{\bm{i}})-\llbracket{\bm{x}}\rrbracket_{\bm{i}}\right)^{2}.

Since the variance var(𝒳𝐢)\mathrm{var}(\mathcal{X}_{{\bm{i}}}) of 𝒳𝐢\mathcal{X}_{{\bm{i}}} is constant in the design variable 𝐱\bm{x}, the minimization Problem (2) is equivalent to the decomposition of the expected tensor, i.e.

minxnr12𝔼[𝒳]𝒙2.\min_{x\in\mathbb{R}^{nr}}\frac{1}{2}\|\mathbb{E}[\mathcal{X}]-\llbracket{\bm{x}}\rrbracket\|^{2}.

The analysis in this work is however based on the formulation given by (2), since it lends itself more readily to extensions to more general risk functions.

Tensor decompositions have scale ambiguity. Indeed, for any direction i=1,..,pi=1,..,p and any scalars βi,1,,βi,r1>0\beta_{i,1},\ldots,\beta_{i,r-1}>0, let βi,r=1/j=1r1βi,j\beta_{i,r}=1/\prod_{j=1}^{r-1}\beta_{i,j}. Then

j=1r(β1,j𝒂1,j)(βp,j𝒂p,j)=j=1r𝒂1,j𝒂p,j.\sum_{j=1}^{r}(\beta_{1,j}{\bm{a}}_{1,j})\circ\ldots\circ(\beta_{p,j}{\bm{a}}_{p,j})=\sum_{j=1}^{r}{\bm{a}}_{1,j}\circ\ldots\circ{\bm{a}}_{p,j}.

The optimizer of (2) therefore lies on a simply connected manifold (see [30]), which can lead to difficulties in the convergence of optimization algorithms. To ensure isolated minima that can be readily located, it is common to enforce bounds on the size of the factors, either directly through an appropriate normalization (see [30]), or indirectly through the addition of a regularization term [22]. We pursue the latter strategy, leading to the problem

(3) min𝒙nrF(𝒙)=min𝒙nr𝔼[f(𝒙)], withf(𝒙;𝒳)=12𝒳𝒙2+λ2𝒙2,𝒙nr,λ>0.\min_{{\bm{x}}\in\mathbb{R}^{nr}}F({\bm{x}})=\min_{{\bm{x}}\in\mathbb{R}^{nr}}\mathbb{E}\left[f({\bm{x}})\right],\text{ with}\\ f({\bm{x}};\mathcal{X})=\frac{1}{2}\left\|\mathcal{X}-\llbracket{\bm{x}}\rrbracket\right\|^{2}+\frac{\lambda}{2}\|{\bm{x}}\|^{2},\qquad{\bm{x}}\in\mathbb{R}^{nr},\lambda>0.

While the regularization term biases the minimizers of Problem (2) its inclusion is key to guaranteeing well-posedness in the presence of noise. It plays a pivotal role in (i) proving the existence of a minimizer (Lemma 2.2), (ii) ensuring the positive-definiteness of the Hessian (Corollary 3.11), and (iii) guaranteeing the boundedness of iterates in terms of random tensor 𝒳\mathcal{X} (Lemma 3.15). Heuristic methods are typically used to choose the value of λ\lambda that balances the bias of the optimizers against stability considerations, the most well-known of which is the Morozov discrepancy principle [24].

Lemma 2.2 (Existence of Minimizers).

Problem (3) has at least one minimizer.

Proof 2.3.

Let F=inf𝐱nrF(𝐱)F^{*}=\inf_{{\bm{x}}\in\mathbb{R}^{nr}}F({\bm{x}}). So there exists a sequence {𝐱k}k=1\{{\bm{x}^{k}}\}_{k=1}^{\infty} in nr\mathbb{R}^{nr} with

limkF(𝒙k)=F,\lim_{k\rightarrow\infty}F({\bm{x}}^{k})=F^{*},

from which it follows that F(𝐱k)F({\bm{x}^{k}}) is bounded. The inequality

(4) 𝒙22λF(𝒙)for all 𝒙rn,\|{\bm{x}}\|^{2}\leq\frac{2}{\lambda}F({\bm{x}})\ \ \text{for all }{\bm{x}}\in\mathbb{R}^{rn},

allows us to establish the boundedness of {𝐱k}k=1\{{\bm{x}}^{k}\}_{k=1}^{\infty}. By compactness, there exists a convergent subsequence 𝐱ki𝐱{\bm{x}}^{k_{i}}\rightarrow{\bm{x}}^{*} as ii\rightarrow\infty. The continuity of FF then implies

F(𝒙)=limiF(𝒙ki)=F.F({\bm{x}^{*}})=\lim_{i\rightarrow\infty}F({\bm{x}}^{k_{i}})=F^{*}.

Remark 2.4.

For the sample realizations f(𝐱;𝒳)f({\bm{x}};\mathcal{X}) we have a bound similar to (4):

(5) for all 𝒙nr,𝒙22λf(𝒙;𝒳),a.s. on Ω.\text{for all }{\bm{x}}\in\mathbb{R}^{nr},\quad\|{\bm{x}}\|^{2}\leq\frac{2}{\lambda}f({\bm{x}};\mathcal{X}),\ \ \ \text{a.s. on }\Omega.

2.1 The Stochastic Alternating Least Squares Method

Although the objective function FF is a high degree polynomial in general, it is quadratic in each component factor matrix AiA_{i}. This is apparent from the matricized form of (3). Recall [18] that the columnwise Khatri-Rao product :ni×r×nj×rninj×r\odot:\mathbb{R}^{n_{i}\times r}\times\mathbb{R}^{n_{j}\times r}\rightarrow\mathbb{R}^{n_{i}n_{j}\times r} of matrices A=[𝒂1,,𝒂r]A=[\bm{a}_{1},...,{\bm{a}}_{r}] and B=[𝒃1,,𝒃r]B=[{\bm{b}}_{1},...,{\bm{b}}_{r}] is defined as their columnwise Kronecker product, i.e. AB=[𝒂1𝒃1,,𝒂r𝒃r]A\odot B=[{\bm{a}}_{1}\otimes{\bm{b}}_{1},...,{\bm{a}}_{r}\otimes{\bm{b}}_{r}]. The matricization A1,,Ap(i)\llbracket A_{1},\ldots,A_{p}\rrbracket_{(i)} of the rank rr tensor A1,,Ap\llbracket A_{1},\ldots,A_{p}\rrbracket along the ii-th fiber takes the form (see [1])

(6) A1,,Ap(i)=Ai(ApAi+1Ai1A1)T=:AiΘiT,\llbracket A_{1},\ldots,A_{p}\rrbracket_{(i)}=A_{i}\left(A_{p}\odot\ldots\odot A_{i+1}\odot A_{i-1}\odot\ldots\odot A_{1}\right)^{T}=:A_{i}\Theta_{i}^{T},

where Θi\Theta_{i} is defined to be the iterated Khatri-Rao product on the right. Note that Θi\Theta_{i} does not depend on AiA_{i}, and hence the matricized tensor decomposition is linear in AiA_{i}. Since the Frobenius norm is invariant under matricization, the sample objective function f(𝒙,𝒳)f({\bm{x}},\mathcal{X}) can then be rewritten as a quadratic function in AiA_{i}, i.e.

f(A1,,Ap;𝒳):=12𝒳(i)AiΘiT2+λ2j=1pAj2.f(A_{1},\ldots,A_{p};\mathcal{X}):=\frac{1}{2}\left\|\mathcal{X}_{(i)}-A_{i}\Theta_{i}^{T}\right\|^{2}+\frac{\lambda}{2}\sum_{j=1}^{p}\|A_{j}\|^{2}.

Vectorizing this form yields a linear least squares objective function in 𝒙i{\bm{x}}_{i}, namely

f(𝒙1,,𝒙p;𝒳)=12vec(𝒳(i))(ΘiIni)𝒙i2+λ2j=1p𝒙j2,f({\bm{x}}_{1},\ldots,{\bm{x}}_{p};\mathcal{X})=\frac{1}{2}\left\|\mathrm{vec}(\mathcal{X}_{(i)})-(\Theta_{i}\otimes I_{n_{i}}){\bm{x}}_{i}\right\|^{2}+\frac{\lambda}{2}\sum_{j=1}^{p}\|{\bm{x}}_{j}\|^{2},

whose component-wise gradient and Hessian are given respectively by

(7) 𝒙if(𝒙;𝒳)\displaystyle\nabla_{{\bm{x}}_{i}}f({\bm{x}};\mathcal{X}) =(ΘiTIni)vec(𝒳(i))+((ΘiTΘi+λIr)Ini)𝒙i,and\displaystyle=-(\Theta_{i}^{T}\otimes I_{n_{i}})\mathrm{vec}(\mathcal{X}_{(i)})+\left((\Theta_{i}^{T}\Theta_{i}+\lambda I_{r})\otimes I_{n_{i}}\right){\bm{x}}_{i},\quad\text{and}
(8) 𝒙i2f(𝒙)\displaystyle\nabla_{{\bm{x}}_{i}}^{2}f({\bm{x}}) =(ΘiTΘi+λIr)Ini,\displaystyle=(\Theta_{i}^{T}\Theta_{i}+\lambda I_{r})\otimes I_{n_{i}},

where Irr×rI_{r}\in\mathbb{R}^{r\times r} and Inini×niI_{n_{i}}\in\mathbb{R}^{n_{i}\times n_{i}} are identity matrices. In the presence of regularization, the Hessian matrix (8) is strictly positive definite with lower bound that is independent of both 𝒳\mathcal{X} and of the remaining components 𝒙i{\bm{x}}_{i^{*}} (see Corollary 3.11). Consequently, each sampled component-wise problem

min𝒙irnif(𝒙;𝒳)\min_{{\bm{x}}_{i}\in\mathbb{R}^{rn_{i}}}f({\bm{x}};\mathcal{X})

has a unique solution given by the stationary point 𝒙i{\bm{x}}_{i} satisfying 𝒙if(𝒙,𝒳)=0\nabla_{{\bm{x}}_{i}}f({\bm{x}},\mathcal{X})=0. It is more efficient to use the matricized form of the stationarity condition, i.e.

(9) 0=𝒳(i)Θi+Ai(ΘiTΘi+λIr).0=-\mathcal{X}_{(i)}\Theta_{i}+A_{i}(\Theta_{i}^{T}\Theta_{i}+\lambda I_{r}).

For any Ani×r,Bnj×rA\in\mathbb{R}^{n_{i}\times r},B\in\mathbb{R}^{n_{j}\times r}, it can be shown [29] that

(AB)T(AB)=(ATA)(BTB),(A\odot B)^{T}(A\odot B)=(A^{T}A)*(B^{T}B),

where * denotes the (Hadamard) product. Repeatedly using this identity, we have

(10) ΘiTΘi:=(A1TA1)(Ai1TAi1)(Ai+1TAi+1)(ApTAp),\Theta_{i}^{T}\Theta_{i}:=(A_{1}^{T}A_{1})*\ldots*(A_{i-1}^{T}A_{i-1})*(A_{i+1}^{T}A_{i+1})*\ldots*(A_{p}^{T}A_{p}),

so the Hessian can be computed as the component-wise product of pp matrices in r×r\mathbb{R}^{r\times r}.

The (deterministic) ALS method (Algorithm 1) exploits the component-wise quadratic structure of the objective functional FF, which it inherits from ff. In the kk-th block iteration, the ALS algorithm cycles through the pp components 𝒙1,,𝒙p{\bm{x}}_{1},\ldots,{\bm{x}}_{p} of 𝒙{\bm{x}}, updating each component in turn in the direction of the component-wise minimizer. The function is also updated at each subiteration to reflect this change. Specifically, the iterate at the beginning of the kk-th block is denoted by

𝒙k=𝒙k,0=(𝒙1k,𝒙2k,,𝒙pk).\displaystyle{\bm{x}}^{k}={\bm{x}}^{k,0}=({\bm{x}}_{1}^{k},{\bm{x}}_{2}^{k},...,{\bm{x}}_{p}^{k}).

The ALS algorithm then generates a sequence of sub-iterates 𝒙k,1,,𝒙k,p{\bm{x}}^{k,1},...,{\bm{x}}^{k,p}, where

𝒙k,i=(𝒙1k+1,,𝒙i1k+1,𝒙ik+1,𝒙i+1k,,𝒙pk).{\bm{x}}^{k,i}=({\bm{x}}_{1}^{k+1},\ldots,{\bm{x}}_{i-1}^{k+1},{\bm{x}}_{i}^{k+1},{\bm{x}}_{i+1}^{k},\ldots,{\bm{x}}_{p}^{k}).

Note that, under this convention, 𝒙k,p=𝒙k+1=𝒙k+1,0{\bm{x}}^{k,p}={\bm{x}}^{k+1}={\bm{x}}^{k+1,0}.

Algorithm 1 The Alternating Least Squares Algorithm
1:  Initial guess 𝒙1{\bm{x}^{1}}
2:  for k=1,2,k=1,2,\ldots do
3:     for i=1,,pi=1,\ldots,p do
4:        𝒙ik+1=argmin𝒙iniF(𝒙1k+1,,𝒙i1k+1,𝒙i,𝒙i+1k,,𝒙pk)\displaystyle{\bm{x}}_{i}^{k+1}=\operatorname*{argmin}_{{\bm{x}}_{i}\in\mathbb{R}^{n_{i}}}F({\bm{x}}_{1}^{k+1},\ldots,{\bm{x}}_{i-1}^{k+1},{\bm{x}}_{i},{\bm{x}}_{i+1}^{k},\ldots,{\bm{x}}_{p}^{k})
5:        𝒙k,i=(𝒙1k+1,,𝒙ik+1,𝒙i+1k,,𝒙pk){\bm{x}}^{k,i}=({\bm{x}}_{1}^{k+1},\ldots,{\bm{x}}_{i}^{k+1},{\bm{x}}_{i+1}^{k},\ldots,{\bm{x}}_{p}^{k})
6:     end for
7:  end for

Although the descent achieved by the ALS method during kk-th block iteration is most likely not as large as a descent would be for a monolithic descent direction for FF over the entire space nr\mathbb{R}^{nr}, the ALS updates can be obtained at a significantly lower computational cost and with a lower memory footprint.

The block-component-wise quadratic structure of the integrand FF allows us to compute 𝒙ik+1{\bm{x}}_{i}^{k+1} explicitly. To this end we form the second order Taylor expansion FF about 𝒙i{\bm{x}}_{i} at the current iterate 𝒙k,i1{\bm{x}}^{k,i-1}, i.e.

(11) F(𝒙1k+1,𝒙i1k+1,𝒙i+𝒑,𝒙i+1k,,𝒙pk)\displaystyle F({\bm{x}}_{1}^{k+1},\ldots{\bm{x}}_{i-1}^{k+1},{\bm{x}}_{i}+{\bm{p}},{\bm{x}}_{i+1}^{k},\ldots,{\bm{x}}_{p}^{k}) =Fk,i+(𝒈k,i)T𝒑+12𝒑THk,i𝒑,\displaystyle=F^{k,i}+({\bm{g}}^{k,i})^{T}{\bm{p}}+\frac{1}{2}{\bm{p}}^{T}H^{k,i}{\bm{p}},
withFk,i=F(𝒙k,i1)=𝔼[f(𝒙k,i1)],\displaystyle\text{with}\quad F^{k,i}=F({\bm{x}}^{k,i-1})=\mathbb{E}\left[f({\bm{x}}^{k,i-1})\right],
𝒈k,i=𝒙iF(𝒙k,i1)=𝔼[𝒙if(𝒙k,i1)],and\displaystyle{\bm{g}}^{k,i}=\nabla_{{\bm{x}}_{i}}F({\bm{x}}^{k,i-1})=\mathbb{E}\left[\nabla_{{\bm{x}}_{i}}f({\bm{x}}^{k,i-1})\right],\quad\text{and}
Hk,i=𝒙i2F(𝒙k,i1)=𝒙i2f(𝒙k,i1).\displaystyle H^{k,i}=\nabla_{{\bm{x}}_{i}}^{2}F({\bm{x}}^{k,i-1})=\nabla_{{\bm{x}}_{i}}^{2}f({\bm{x}}^{k,i-1}).

Note that Hk,iH^{k,i} does not depend explicitly on 𝒳\mathcal{X} and is therefore a deterministic quantity for known 𝒙{\bm{x}}. The component-wise minimizer 𝒙ik+1{\bm{x}}_{i}^{k+1} of FF can be calculated explicitly as the Newton update

𝒙ik+1=𝒙ik(Hk,i)1𝒈k,i.{\bm{x}}_{i}^{k+1}={\bm{x}}_{i}^{k}-(H^{k,i})^{-1}{\bm{g}}^{k,i}.

Although both the design variable and the functions appearing in Problem (3) are deterministic, any gradient based optimization procedure, such as the ALS method described above, requires the approximation of an expectation at every iterate, the computational effort of which is considerable. This suggests the use of stochastic gradient sampling algorithms that efficiently incorporate the sampling procedure into the optimization iteration, while exploiting data sparsity.

The stochastic gradient descent method [27] addresses the cost of approximating the expectation by computing descent directions from small sampled batches of function values and gradients at each step of the iteration. To accommodate the noisy gradients, the step size is reduced at a predetermined rate. For the stochastic alternating least squares (SALS) method (Algorithm 2), we determine the sample at the beginning of the kk-th block iteration. For a batch 𝓧=(𝒳1,,𝒳m)\bm{\mathcal{X}}=(\mathcal{X}^{1},\ldots,\mathcal{X}^{m}) of mm iid random samples of 𝒳\mathcal{X}, we write the component-wise batch average f~\tilde{f} of ff as

f~(𝒙;𝓧)=1ml=1mf(𝒙;𝒳l).\tilde{f}({\bm{x}};\bm{\mathcal{X}})=\frac{1}{m}\sum_{l=1}^{m}f({\bm{x}};\mathcal{X}^{l}).

As before, we can compute the component-wise minimizer

(12) 𝒙^ik+1=argmin𝒙inif~(𝒙1k+1,𝒙i1k+1,𝒙i,𝒙i+1k,,𝒙pk;𝓧)\hat{\bm{x}}_{i}^{k+1}=\operatorname*{argmin}_{{\bm{x}}_{i}\in\mathbb{R}^{n_{i}}}\tilde{f}({\bm{x}}_{1}^{k+1},\ldots{\bm{x}}_{i-1}^{k+1},{\bm{x}}_{i},{\bm{x}}_{i+1}^{k},\ldots,{\bm{x}}_{p}^{k};\bm{\mathcal{X}})

at the ii-th subiteration by means of the Newton step. To this end we express f~\tilde{f} in terms of its second order Taylor expansion about the current iterate 𝒙k,i1{\bm{x}}^{k,i-1}, i.e.

f~(𝒙1k+1,,𝒙i1k+1,𝒙i+𝒑,𝒙i+1k,,𝒙pk;𝓧)=f~k,i+(𝒈~k,i)T𝒑+12𝒑THk,i𝒑,with\displaystyle\tilde{f}({\bm{x}}_{1}^{k+1},\ldots,{\bm{x}}_{i-1}^{k+1},{\bm{x}}_{i}+{\bm{p}},{\bm{x}}_{i+1}^{k},\ldots,{\bm{x}}_{p}^{k};\bm{\mathcal{X}})=\tilde{f}^{k,i}+(\tilde{\bm{g}}^{k,i})^{T}{\bm{p}}+\frac{1}{2}{\bm{p}}^{T}H^{k,i}{\bm{p}},\quad\text{with}
f~k,i(𝓧)=1ml=1mf(𝒙k,i1;𝒳l),and𝒈~k,i(𝓧)=1ml=1m𝒙if(𝒙k,i1;𝒳l).\displaystyle\tilde{f}^{k,i}(\bm{\mathcal{X}})=\frac{1}{m}\sum_{l=1}^{m}f({\bm{x}}^{k,i-1};\mathcal{X}^{l}),\quad\text{and}\quad\tilde{\bm{g}}^{k,i}(\bm{\mathcal{X}})=\frac{1}{m}\sum_{l=1}^{m}\nabla_{{\bm{x}}_{i}}f({\bm{x}}^{k,i-1};\mathcal{X}^{l}).

The sample minimizer is thus

𝒙^ik+1=𝒙ik(Hk,i)1𝒈~k,i.\hat{\bm{x}}_{i}^{k+1}={\bm{x}}_{i}^{k}-\left(H^{k,i}\right)^{-1}\tilde{\bm{g}}^{k,i}.

To mitigate the effects of noise on the estimate, especially during later iterations, we modify the update by introducing a variable stepsize parameter αk,i>0\alpha^{k,i}>0, so that

(13) 𝒙ik+1=𝒙ikαk,i(Hk,i)1𝒈~k,i.{\bm{x}}_{i}^{k+1}={\bm{x}}_{i}^{k}-\alpha^{k,i}\left(H^{k,i}\right)^{-1}\tilde{\bm{g}}^{k,i}.
Algorithm 2 Stochastic Alternating Least Squares Algorithm
1:  Initial guess 𝒙1{\bm{x}^{1}}
2:  for k=1,2,k=1,2,\ldots do
3:     Generate a random sample 𝓧k=[𝒳k,1,,𝒳k,mk]\bm{\mathcal{X}}^{k}=[\mathcal{X}^{k,1},\ldots,\mathcal{X}^{k,m_{k}}]
4:     for i=1,,pi=1,\ldots,p do
5:        Compute sample gradient 𝒈~k,i\tilde{\bm{g}}^{k,i} and Hessian Hk,iH^{k,i}
6:        Compute step size αk,i\alpha^{k,i}
7:        Update ii-th block 𝒙ik+1=𝒙ikαk,i(Hk,i)1𝒈~k,i{\bm{x}}_{i}^{k+1}={\bm{x}}_{i}^{k}-\alpha^{k,i}\left(H^{k,i}\right)^{-1}\tilde{\bm{g}}^{k,i} so that
𝒙k,i=(𝒙1k+1,,𝒙ik+1,𝒙i+1k,,𝒙pk){\bm{x}}^{k,i}=({\bm{x}}_{1}^{k+1},\ldots,{\bm{x}}_{i}^{k+1},{\bm{x}}_{i+1}^{k},\ldots,{\bm{x}}_{p}^{k})
8:     end for
9:     𝒙k+1=𝒙k,p{\bm{x}}^{k+1}={\bm{x}}^{k,p}
10:  end for

It is well known (see e.g. [7]) that a stepsize αk,i\alpha^{k,i} that decreases at the rate of O(1k)O\left(\frac{1}{k}\right) as kk\rightarrow\infty leads to an optimal convergence rate for stochastic gradient methods. Here, we specify that the stepsize takes the form αk,i=ck,ik\alpha^{k,i}=\frac{c^{k,i}}{k}, where ck,ic^{k,i} is a strictly positive, bounded sequence, i.e. there are constants 0<cmin<cmax20<c_{\min}<c_{\max}\leq 2 such that

0<cminck,icmax,for all k=1,2,,i=1,2,,p.0<c_{\min}\leq c^{k,i}\leq c_{\max},\qquad\text{for all }k=1,2,\ldots,\ i=1,2,\ldots,p.

One of the difficulties in analyzing the convergence behavior of stochastic sampling methods arises from the fact that the iterates 𝒙k,i{\bm{x}}^{k,i} constitute a stochastic process generated by a stochastic algorithm that depends on the realizations of the sample batches 𝓧1,𝓧2,𝓧k1\bm{\mathcal{X}}^{1},\bm{\mathcal{X}}^{2},\ldots\bm{\mathcal{X}}^{k-1}. Consequently, even deterministic functions, such as FF, become stochastic when evaluated at these points, e.g.​ F(𝒙k,i)F({\bm{x}}^{k,i}) is a random quantity. Moreover, successive iterates are statistically dependent, since later iterates are updates of earlier ones. Specifically, let k=σ(𝓧1,,𝓧k)\mathcal{F}^{k}=\sigma(\bm{\mathcal{X}}^{1},\ldots,\bm{\mathcal{X}}^{k}) be the σ\sigma-algebra generated by the first kk sample batches. At the end of the ii-th subiteration in the kk-th block, the first ii components of 𝒙k,i=(𝒙1k+1,,𝒙ik+1,𝒙i+1k,,𝒙pk){\bm{x}}^{k,i}=({\bm{x}}^{k+1}_{1},...,{\bm{x}}_{i}^{k+1},{\bm{x}}_{i+1}^{k},...,{\bm{x}}_{p}^{k}) have been updated using 𝓧k\bm{\mathcal{X}}^{k} and are thus k\mathcal{F}^{k}-measurable, whereas the remaining components 𝒙i+1k,,𝒙pk{\bm{x}}_{i+1}^{k},...,{\bm{x}}_{p}^{k} depend only on 𝓧1,,𝓧k1\bm{\mathcal{X}}^{1},...,\bm{\mathcal{X}}^{k-1} and are therefore only k1\mathcal{F}^{k-1}-measurable. To separate the effect of 𝓧k\bm{\mathcal{X}}^{k} from that of 𝓧1,,𝓧k1\bm{\mathcal{X}}^{1},\ldots,\bm{\mathcal{X}}^{k-1} on an k\mathcal{F}^{k}-measurable random variable hh, it is often useful to invoke the law of total expectation, i.e.

𝔼𝓧1,,𝓧k1,𝓧k[h]=𝔼𝓧1,,𝓧k1[𝔼𝓧k[h|k1]].\mathbb{E}_{\bm{\mathcal{X}}^{1},...,\bm{\mathcal{X}}^{k-1},\bm{\mathcal{X}}^{k}}[h]=\mathbb{E}_{\bm{\mathcal{X}}^{1},\ldots,\bm{\mathcal{X}}^{k-1}}\left[\mathbb{E}_{\bm{\mathcal{X}}^{k}}\left[h|\mathcal{F}^{k-1}\right]\right].

3 Convergence Analysis

Here we discuss the convergence of Algorithm 2. Since Problem (3) is nonconvex, there can in general be no unique global minimizer. Theorem 3.25 establishes the convergence of the SALS iterates 𝒙k,i{\bm{x}}^{k,i} to a local minimizer in expectation. Yet, the special structure of the CP problem allows us to forego most of the standard assumptions made in the SGD framework. In fact, we only assume boundedness of the sampled data 𝒳\mathcal{X}. Indeed, we show that the regularity of the component-wise problem, as the Lipschitz constant of the gradient, depends only on the norm of the current iterate 𝒙k,i{\bm{x}}^{k,i}, which is in turn bounded by 𝒳\|\mathcal{X}\| (Lemma 3.15).

3.1 Regularity Estimates

In the following we exploit the multinomial structure of the cost functional to establish component-wise regularity estimates, such as local Lipschitz continuity of the sampled gradient 𝒙if(𝒙,𝒳)\nabla_{{\bm{x}}_{i}}f({\bm{x}},\mathcal{X}) and of the Newton step (𝒙i2f(𝒙))1𝒙if(𝒙,𝒳)-\left(\nabla^{2}_{{\bm{x}}_{i}}f({\bm{x}})\right)^{-1}\nabla_{{\bm{x}}_{i}}f({\bm{x}},\mathcal{X}), as well as bounded invertibility of the Hessian 𝒙i2f(𝒙)\nabla^{2}_{{\bm{x}}_{i}}f({\bm{x}}).

In the proofs below, we will often need to bound a sum of the form i=1p𝒙iq\sum_{i=1}^{p}\|{\bm{x}}_{i}\|^{q} in terms of 𝒙\|{\bm{x}}\|, where 𝒙=(𝒙1,,𝒙p){\bm{x}}=({\bm{x}}_{1},...,{\bm{x}}_{p}). For this we use the fact that for finite dimensional vector spaces all norms are equivalent. In particular, for any norm subscripts ss and tt, there exist dimension-dependent constants 0<cs,t<ct,s<0<c_{s,t}<c_{t,s}<\infty so that

(14) cs,t𝒙s𝒙tct,s𝒙is.c_{s,t}\|{\bm{x}}\|_{s}\leq\|{\bm{x}}\|_{t}\leq c_{t,s}\|{\bm{x}}_{i}\|_{s}.

Using norm equivalence with s=2s=2 and t=qt=q, we obtain

(15) i=1p𝒙iqi=1p(c2,qi)q𝒙iqq(c¯2,q)q𝒙qq(c¯2,qcq,2)q𝒙q,\sum_{i=1}^{p}\|{\bm{x}}_{i}\|^{q}\leq\sum_{i=1}^{p}(c_{2,q}^{i})^{q}\|{\bm{x}}_{i}\|_{q}^{q}\leq(\bar{c}_{2,q})^{q}\|{\bm{x}}\|_{q}^{q}\leq\left(\frac{\bar{c}_{2,q}}{c_{q,2}}\right)^{q}\|{\bm{x}}\|^{q},

where c¯2,q=maxi=1,,pc2,qi\displaystyle\bar{c}_{2,q}=\max_{i=1,...,p}c_{2,q}^{i}, while c2,qic_{2,q}^{i} and cq,2c_{q,2} are the appropriate constants in the norm equivalence relation above for 𝒙i{\bm{x}}_{i} and 𝒙{\bm{x}} respectively.

The estimates established in this section all follow from the local Lipschitz continuity of the mapping 𝒙iΘi{\bm{x}}_{i}^{*}\mapsto\Theta_{i}, shown in Lemma 3.1 below.

Lemma 3.1.

Let Θi\Theta_{i} and Θ~i\tilde{\Theta}_{i} be matrices defined in terms of 𝐱i{\bm{x}}_{i^{*}} and 𝐱~i\tilde{\bm{x}}_{i^{*}} respectively via (6). Then

(16) ΘiΘ~iPΘ(𝒙i,𝒙~i)𝒙i𝒙~i,with\displaystyle\|\Theta_{i}-\tilde{\Theta}_{i}\|\leq P_{\Theta}(\|{\bm{x}}_{i^{*}}\|,\|\tilde{\bm{x}}_{i^{*}}\|)\|{\bm{x}}_{i^{*}}-\tilde{\bm{x}}_{i^{*}}\|,\quad\text{with}
PΘ(𝒙i,𝒙~i)=C(𝒙ip2+𝒙~ip2)\displaystyle P_{\Theta}(\|{\bm{x}}_{i^{*}}\|,\|\tilde{\bm{x}}_{i^{*}}\|)=C\left(\|{\bm{x}}_{i^{*}}\|^{p-2}+\|\tilde{\bm{x}}_{i^{*}}\|^{p-2}\right)

for some constant 0C<0\leq C<\infty.

Proof 3.2.

First note that for any Ani×r,Bnj×rA\in\mathbb{R}^{n_{i}\times r},B\in\mathbb{R}^{n_{j}\times r}, properties of Kronecker products, the Cauchy-Schwartz inequality, and Inequality (15) imply that

AB2\displaystyle\|A\odot B\|^{2} =[𝒂1𝒃1,,𝒂r𝒃r]2=j=1r𝒂j𝒃j2\displaystyle=\|[{\bm{a}}_{1}\otimes\bm{b}_{1},\ldots,{\bm{a}}_{r}\otimes\bm{b}_{r}]\|^{2}=\sum_{j=1}^{r}\|\bm{a}_{j}\otimes\bm{b}_{j}\|^{2}
(17) =j=1r𝒂j2𝒃j2(j=1r𝒂j4)12(j=1r𝒃j4)12(c¯2,4c4,2)2A2B2.\displaystyle=\sum_{j=1}^{r}\|\bm{a}_{j}\|^{2}\|\bm{b}_{j}\|^{2}\leq\left(\sum_{j=1}^{r}\|\bm{a}_{j}\|^{4}\right)^{\frac{1}{2}}\left(\sum_{j=1}^{r}\|\bm{b}_{j}\|^{4}\right)^{\frac{1}{2}}\leq\left(\frac{\bar{c}_{2,4}}{c_{4,2}}\right)^{2}\|A\|^{2}\|B\|^{2}.

Let 𝐤=(p,p1,,i+1,i1,,1)\bm{k}=(p,p-1,\ldots,i+1,i-1,\ldots,1) denote the indices in the Khatri-Rao product (6). Addition and subtraction of the appropriate cross-terms provides

ΘiΘ~i=j=1p1Akjj=1p1A~kj=j=1p1j=1j1A~kj(AkjA~kj)j=j+1p1Akj.\Theta_{i}-\tilde{\Theta}_{i}=\bigodot_{j=1}^{p-1}A_{k_{j}}-\bigodot_{j=1}^{p-1}\tilde{A}_{k_{j}}=\sum_{j=1}^{p-1}\bigodot_{j^{\prime}=1}^{j-1}\tilde{A}_{k_{j^{\prime}}}\odot(A_{k_{j}}-\tilde{A}_{k_{j}})\odot\bigodot_{j^{\prime}=j+1}^{p-1}A_{k_{j^{\prime}}}.

By (17), each term in the sum above can be bounded in norm by

j=1j1A~kj(AkjA~kj)j=j+1p1Akj\displaystyle\left\|\bigodot_{j^{\prime}=1}^{j-1}\tilde{A}_{k_{j^{\prime}}}\odot(A_{k_{j}}-\tilde{A}_{k_{j}})\odot\bigodot_{j^{\prime}=j+1}^{p-1}A_{k_{j^{\prime}}}\right\|
\displaystyle\leq (c¯2,4c4,2)p(j=1j1A~kj)(j=j+1p1Akj)AkjA~kj\displaystyle\left(\frac{\bar{c}_{2,4}}{c_{4,2}}\right)^{p}\left(\prod_{j^{\prime}=1}^{j-1}\|\tilde{A}_{k_{j^{\prime}}}\|\right)\left(\prod_{j^{\prime}=j+1}^{p-1}\|A_{k_{j^{\prime}}}\|\right)\|A_{k_{j}}-\tilde{A}_{k_{j}}\|
=\displaystyle= (c¯2,4c4,2)p(j=1j1𝒙~kj)(j=j+1p1𝒙kj)𝒙kj𝒙~kj.\displaystyle\left(\frac{\bar{c}_{2,4}}{c_{4,2}}\right)^{p}\left(\prod_{j^{\prime}=1}^{j-1}\|\tilde{\bm{x}}_{k_{j^{\prime}}}\|\right)\left(\prod_{j^{\prime}=j+1}^{p-1}\|{\bm{x}}_{k_{j^{\prime}}}\|\right)\|{\bm{x}}_{k_{j}}-\tilde{\bm{x}}_{k_{j}}\|.

By the arithmetic-geometric mean inequality, and the bound in (15), we have

(j=1j1𝒙~kj)(j=j+1p1𝒙kj)\displaystyle\left(\prod_{j^{\prime}=1}^{j-1}\|\tilde{\bm{x}}_{k_{j^{\prime}}}\|\right)\left(\prod_{j^{\prime}=j+1}^{p-1}\|{\bm{x}}_{k_{j^{\prime}}}\|\right) 1(p2)(j=1j1𝒙~kjp2+j=j+1p1𝒙kjp2)\displaystyle\leq\frac{1}{(p-2)}\left(\sum_{j^{\prime}=1}^{j-1}\|\tilde{\bm{x}}_{k_{j^{\prime}}}\|^{p-2}+\sum_{j^{\prime}=j+1}^{p-1}\|{\bm{x}}_{k_{j^{\prime}}}\|^{p-2}\right)
1p2(j=1p1𝒙~kjp2+j=1p1𝒙kjp2)\displaystyle\leq\frac{1}{p-2}\left(\sum_{j^{\prime}=1}^{p-1}\|\tilde{\bm{x}}_{k_{j^{\prime}}}\|^{p-2}+\sum_{j^{\prime}=1}^{p-1}\|{\bm{x}}_{k_{j^{\prime}}}\|^{p-2}\right)
1p2(c¯2,p2cp2,2)p2(𝒙~ip2+𝒙ip2).\displaystyle\leq\frac{1}{p-2}\left(\frac{\bar{c}_{2,p-2}}{c_{p-2,2}}\right)^{p-2}\left(\|\tilde{\bm{x}}_{i^{*}}\|^{p-2}+\|{\bm{x}}_{i^{*}}\|^{p-2}\right).

Since the upper bound

P1(𝒙i,𝒙~i)=1p2(c¯2,4c4,2)p(c¯2,p2cp2,2)p2(𝒙~ip2+𝒙ip2)P_{1}(\|{\bm{x}}_{i^{*}}\|,\|\tilde{\bm{x}}_{i^{*}}\|)=\frac{1}{p-2}\left(\frac{\bar{c}_{2,4}}{c_{4,2}}\right)^{p}\left(\frac{\bar{c}_{2,p-2}}{c_{p-2,2}}\right)^{p-2}\left(\|\tilde{\bm{x}}_{i^{*}}\|^{p-2}+\|{\bm{x}}_{i^{*}}\|^{p-2}\right)

is independent of jj, it then follows that

ΘiΘ~i\displaystyle\|\Theta_{i}-\tilde{\Theta}_{i}\| =j=1p1j=1p1Akjj=1p1A~kjP1(𝒙i,𝒙~i)j=1p1𝒙kj𝒙~kj\displaystyle=\sum_{j=1}^{p-1}\|\bigodot_{j=1}^{p-1}A_{k_{j}}-\bigodot_{j=1}^{p-1}\tilde{A}_{k_{j}}\|\leq P_{1}(\|{\bm{x}}_{i^{*}}\|,\|\tilde{\bm{x}}_{i^{*}}\|)\sum_{j=1}^{p-1}\|{\bm{x}}_{k_{j}}-\tilde{\bm{x}}_{k_{j}}\|
P1(𝒙i,𝒙~i)(c¯2,1c1,2)𝒙i𝒙~i,\displaystyle\leq P_{1}(\|{\bm{x}}_{i^{*}}\|,\|\tilde{\bm{x}}_{i^{*}}\|)\left(\frac{\bar{c}_{2,1}}{c_{1,2}}\right)\|{\bm{x}}_{i^{*}}-\tilde{\bm{x}}_{i^{*}}\|,

which establishes (16) with PΘ(𝐱i,𝐱~i)=(c¯2,1c1,2)P1(𝐱i,𝐱~i)P_{\Theta}(\|{\bm{x}}_{i^{*}}\|,\|\tilde{\bm{x}}_{i^{*}}\|)=\left(\frac{\bar{c}_{2,1}}{c_{1,2}}\right)P_{1}(\|{\bm{x}}_{i^{*}}\|,\|\tilde{\bm{x}}_{i^{*}}\|).

In the special case when Θ~i=0\tilde{\Theta}_{i}=0, Inequality (16) reduces to

(18) ΘiCΘ𝒙ip1,with CΘ independent of 𝒙.\|\Theta_{i}\|\leq C_{\Theta}\|{\bm{x}}_{i^{*}}\|^{p-1},\qquad\text{with $C_{\Theta}$ independent of ${\bm{x}}$}.
Corollary 3.3.

For Θi\Theta_{i} and Θ~i\tilde{\Theta}_{i} defined in (6),

(19) ΘiTΘiΘ~iTΘ~iPΘTΘ(𝒙i,𝒙~i)𝒙i𝒙~i,where\|\Theta_{i}^{T}\Theta_{i}-\tilde{\Theta}_{i}^{T}\tilde{\Theta}_{i}\|\leq P_{\Theta^{T}\Theta}(\|{\bm{x}}_{i^{*}}\|,\|\tilde{\bm{x}}_{i^{*}}\|)\|{\bm{x}}_{i^{*}}-\tilde{\bm{x}}_{i^{*}}\|,\qquad\text{where}
(20) PΘTΘ(𝒙i,𝒙~i)=CΘTΘmax{𝒙ip1,𝒙~ip1}(𝒙ip2+𝒙~ip2),P_{\Theta^{T}\Theta}(\|{\bm{x}}_{i^{*}}\|,\|\tilde{\bm{x}}_{i^{*}}\|)=C_{\Theta^{T}\Theta}\max\{\|{\bm{x}}_{i^{*}}\|^{p-1}\!,\|\tilde{\bm{x}}_{i^{*}}\|^{p-1}\}\left(\|{\bm{x}}_{i^{*}}\|^{p-2}\!+\|\tilde{\bm{x}}_{i^{*}}\|^{p-2}\right),

for some constant 0CΘTΘ<0\leq C_{\Theta^{T}\Theta}<\infty independent of 𝐱{\bm{x}} and 𝐱~\tilde{\bm{x}}.

Proof 3.4.

By the equivalence of the induced Euclidean and the Frobenius norms,

ΘiTΘiΘ~iTΘ~i\displaystyle\|\Theta_{i}^{T}\Theta_{i}-\tilde{\Theta}_{i}^{T}\tilde{\Theta}_{i}\| cF,2ΘiTΘiΘ~iTΘ~i2=cF,2ΘiT(ΘiΘ~i)+(ΘiΘ~i)TΘ~i2\displaystyle\leq c_{F,2}\|\Theta_{i}^{T}\Theta_{i}-\tilde{\Theta}_{i}^{T}\tilde{\Theta}_{i}\|_{2}=c_{F,2}\|\Theta_{i}^{T}(\Theta_{i}-\tilde{\Theta}_{i})+(\Theta_{i}-\tilde{\Theta}_{i})^{T}\tilde{\Theta}_{i}\|_{2}
cF,2(c2,F)2(ΘiTΘiΘ~i+ΘiΘ~iΘ~i)\displaystyle\leq\frac{c_{F,2}}{(c_{2,F})^{2}}\left(\|\Theta_{i}^{T}\|\|\Theta_{i}-\tilde{\Theta}_{i}\|+\|\Theta_{i}-\tilde{\Theta}_{i}\|\|\tilde{\Theta}_{i}\|\right)
2cF,2(c2,F)2max{Θi,Θ~i}ΘiΘ~i\displaystyle\leq\frac{2c_{F,2}}{(c_{2,F})^{2}}\max\{\|\Theta_{i}\|,\|\tilde{\Theta}_{i}\|\}\|\Theta_{i}-\tilde{\Theta}_{i}\|
2cF,2(c2,F)2max{𝒙ip1,𝒙~ip1}PΘ(𝒙i,𝒙~i)𝒙i𝒙~i.\displaystyle\leq\frac{2c_{F,2}}{(c_{2,F})^{2}}\max\{\|{\bm{x}}_{i^{*}}\|^{p-1},\|\tilde{\bm{x}}_{i^{*}}\|^{p-1}\}P_{\Theta}(\|{\bm{x}}_{i^{*}}\|,\|\tilde{\bm{x}}_{i^{*}}\|)\|{\bm{x}}_{i^{*}}-\tilde{\bm{x}}_{i^{*}}\|.

Letting Θ~i=0\tilde{\Theta}_{i}=0 in (19), yields the bound

(21) ΘiTΘiCΘTΘ𝒙i2(p1).\|\Theta_{i}^{T}\Theta_{i}\|\leq C_{\Theta^{T}\Theta}\|{\bm{x}}_{i^{*}}\|^{2(p-1)}.

As a first consequence of Lemma 3.1 and Corollary 3.3, we can obtain an explicit form for the component-wise Lipschitz constant of 𝒙if(𝒙)\nabla_{{\bm{x}}_{i}}f({\bm{x}}).

Lemma 3.5.

For any 𝐱,𝐱~rn{\bm{x}},\tilde{\bm{x}}\in\mathbb{R}^{rn}, there exists PL=PL(𝐱,𝐱~i,𝒳)P_{L}=P_{L}(\|{\bm{x}}\|,\|\tilde{\bm{x}}_{i^{*}}\|,\|\mathcal{X}\|), given by (23), so that

(22) 𝒙if(𝒙;𝒳)𝒙if(𝒙~;𝒳)PL(𝒙,𝒙~i,𝒳)𝒙𝒙~.\|\nabla_{{\bm{x}}_{i}}f({\bm{x}};\mathcal{X})-\nabla_{{\bm{x}}_{i}}f(\tilde{\bm{x}};\mathcal{X})\|\leq P_{L}(\|{\bm{x}}\|,\|\tilde{\bm{x}}_{i^{*}}\|,\|\mathcal{X}\|)\|{\bm{x}}-\tilde{\bm{x}}\|.

Proof 3.6.

Let Θi\Theta_{i} and Θ~i\tilde{\Theta}_{i} be constructed from 𝐱i{\bm{x}}_{i^{*}} and 𝐱~i\tilde{\bm{x}}_{i^{*}} respectively via Equation Eq. 6. Using Eq. 7, the difference in sampled component-wise gradients is given by

𝒙if(𝒙;𝒳)𝒙if(𝒙~;𝒳)\displaystyle\nabla_{{\bm{x}}_{i}}f({\bm{x}};\mathcal{X})-\nabla_{\bm{x}_{i}}f(\tilde{\bm{x}};\mathcal{X})
=\displaystyle= ((ΘiTΘ~iT)Ini)vec(𝒳(i))+(ΘiTΘi+λIr)Ini𝒙i(Θ~iTΘ~i+λIr)Ini𝒙~i\displaystyle\left((\Theta_{i}^{T}-\tilde{\Theta}_{i}^{T})\otimes I_{n_{i}}\right)\mathrm{vec}(\mathcal{X}_{(i)})+(\Theta_{i}^{T}\Theta_{i}+\lambda I_{r})\otimes I_{n_{i}}{\bm{x}}_{i}-(\tilde{\Theta}_{i}^{T}\tilde{\Theta}_{i}+\lambda I_{r})\otimes I_{n_{i}}\tilde{\bm{x}}_{i}
=\displaystyle= ((ΘiTΘ~iT)Ini)vec(𝒳(i))+(Θ~iTΘ~i+λIr)Ini(𝒙i𝒙~i)\displaystyle\left((\Theta_{i}^{T}-\tilde{\Theta}_{i}^{T})\otimes I_{n_{i}}\right)\mathrm{vec}(\mathcal{X}_{(i)})+(\tilde{\Theta}_{i}^{T}\tilde{\Theta}_{i}+\lambda I_{r})\otimes I_{n_{i}}({\bm{x}}_{i}-\tilde{\bm{x}}_{i})
+(ΘiTΘiΘ~iTΘ~i)Ini𝒙i.\displaystyle+(\Theta_{i}^{T}\Theta_{i}-\tilde{\Theta}_{i}^{T}\tilde{\Theta}_{i})\otimes I_{n_{i}}\bm{x}_{i}.

Since the singular values of a Kronecker product are formed from products of singular values of the constituent matrices, the matrix norm BI2=B2B\|B\otimes I\|_{2}=\|B\|_{2}\leq\|B\| for any matrix BB. In light of inequalities (16) and (19), we therefore have

((ΘiTΘ~iT)Ini)vec(𝒳(i))+(ΘiTΘiΘ~iTΘ~i)Ini𝒙i\displaystyle\left\|\left((\Theta_{i}^{T}-\tilde{\Theta}_{i}^{T})\otimes I_{n_{i}}\right)\mathrm{vec}(\mathcal{X}_{(i)})+(\Theta_{i}^{T}\Theta_{i}-\tilde{\Theta}_{i}^{T}\tilde{\Theta}_{i})\otimes I_{n_{i}}\bm{x}_{i}\right\|
\displaystyle\leq\; ΘiΘ~i2𝒳+ΘiTΘiΘ~iTΘ~i2𝒙i\displaystyle\|\Theta_{i}-\tilde{\Theta}_{i}\|_{2}\|\mathcal{X}\|+\|\Theta_{i}^{T}\Theta_{i}-\tilde{\Theta}_{i}^{T}\tilde{\Theta}_{i}\|_{2}\|\bm{x}_{i}\|
\displaystyle\leq\; (PΘ(𝒙i,𝒙~i)𝒳+PΘTΘ(𝒙i,𝒙~i)𝒙i)𝒙i𝒙~i.\displaystyle\big{(}P_{\Theta}(\|{\bm{x}}_{i^{*}}\|,\|\tilde{\bm{x}}_{i^{*}}\|)\|\mathcal{X}\|+P_{\Theta^{T}\Theta}(\|{\bm{x}}_{i^{*}}\|,\|\tilde{\bm{x}}_{i^{*}}\|)\|{\bm{x}}_{i}\|\big{)}\|{\bm{x}}_{i^{*}}-\tilde{\bm{x}}_{i^{*}}\|.

Similarly, Inequality (21) yields

(Θ~iTΘ~i+λIr)Ini(𝒙i𝒙~i)Θ~iTΘ~i+λIr2𝒙i𝒙~i\displaystyle\|(\tilde{\Theta}_{i}^{T}\tilde{\Theta}_{i}+\lambda I_{r})\otimes I_{n_{i}}({\bm{x}}_{i}-\tilde{\bm{x}}_{i})\|\leq\|\tilde{\Theta}_{i}^{T}\tilde{\Theta}_{i}+\lambda I_{r}\|_{2}\|{\bm{x}}_{i}-\tilde{\bm{x}}_{i}\|
\displaystyle\leq\; (Θ~iTΘ~i+λ)𝒙i𝒙~i(CΘTΘ𝒙~i2(p1)+λ)𝒙i𝒙~i.\displaystyle\big{(}\|\tilde{\Theta}_{i}^{T}\tilde{\Theta}_{i}\|+\lambda\big{)}\|{\bm{x}}_{i}-\tilde{\bm{x}}_{i}\|\leq\left(C_{\Theta^{T}\Theta}\|\tilde{\bm{x}}_{i^{*}}\|^{2(p-1)}+\lambda\right)\|{\bm{x}}_{i}-\tilde{\bm{x}}_{i}\|.

Hence, by norm equivalence (15) with s=1s=1 and t=Ft=F, the bound (22) follows with

(23) PL(𝒙,𝒙~i,𝒳)=(c¯F1c1F)max{PLi(𝒙,𝒙~i,𝒳),PLi(𝒙~i)},\displaystyle P_{L}(\|{\bm{x}}\|,\|\tilde{\bm{x}}_{i^{*}}\|,\|\mathcal{X}\|)=\left(\frac{\bar{c}_{F1}}{c_{1F}}\right)\max\left\{P_{L_{i^{*}}}(\|{\bm{x}}\|,\|\tilde{\bm{x}}_{i^{*}}\|,\|\mathcal{X}\|),P_{L_{i}}(\|\tilde{\bm{x}}_{i^{*}}\|)\right\},
withPLi(𝒙~i)=CΘTΘ𝒙~i2(p1)+λ\displaystyle\text{with}\quad P_{L_{i}}(\|\tilde{\bm{x}}_{i^{*}}\|)=C_{\Theta^{T}\Theta}\|\tilde{\bm{x}}_{i^{*}}\|^{2(p-1)}+\lambda
andPLi(𝒙,𝒙~i,𝒳)=PΘ(𝒙i,𝒙~i)𝒳+PΘTΘ(𝒙i,𝒙~i)𝒙i.\displaystyle\text{and}\quad P_{L_{i^{*}}}(\|{\bm{x}}\|,\|\tilde{\bm{x}}_{i^{*}}\|,\|\mathcal{X}\|)=P_{\Theta}(\|{\bm{x}}_{i^{*}}\|,\|\tilde{\bm{x}}_{i^{*}}\|)\|\mathcal{X}\|\!+\!P_{\Theta^{T}\Theta}(\|{\bm{x}}_{i^{*}}\|,\|\tilde{\bm{x}}_{i^{*}}\|)\|{\bm{x}}_{i}\|.

The following is a simplified upper bound in the special case when 𝒙{\bm{x}} and 𝒙~\tilde{\bm{x}} only differ in their ii-th block components

Corollary 3.7.

For any 𝐱=(𝐱1,,𝐱p)=(𝐱i,𝐱i)rn{\bm{x}}=({\bm{x}}_{1},\ldots,{\bm{x}}_{p})=({\bm{x}}_{i},{\bm{x}}_{i^{*}})\in\mathbb{R}^{rn}, we have

(24) 𝒙if(𝒙i,𝒙i;𝒳)𝒙if(𝒙~i,𝒙i;𝒳)PLi(𝒙~i)𝒙i𝒙~i,\|\nabla_{{\bm{x}}_{i}}f({\bm{x}}_{i},{\bm{x}}_{i^{*}};\mathcal{X})-\nabla_{{\bm{x}}_{i}}f(\tilde{\bm{x}}_{i},{\bm{x}}_{i^{*}};\mathcal{X})\|\leq P_{L_{i}}(\|\tilde{\bm{x}}_{i^{*}}\|)\|{\bm{x}}_{i}-\tilde{\bm{x}}_{i}\|,

Proof 3.8.

Since 𝐱i=𝐱~i{\bm{x}}_{i^{*}}=\tilde{\bm{x}}_{i^{*}}, estimate (22) collapses to

(ΘiTΘi+λIr)Ini(𝒙i𝒙~i)PLi(𝒙i)𝒙i𝒙~i.\|(\Theta_{i}^{T}\Theta_{i}+\lambda I_{r})\otimes I_{n_{i}}({\bm{x}}_{i}-\tilde{\bm{x}}_{i})\|\leq P_{L_{i}}(\|{\bm{x}}_{i^{*}}\|)\|{\bm{x}}_{i}-\tilde{\bm{x}}_{i}\|.

We may also bound the norm of the component-wise gradient in terms of the norm of 𝒙=(𝒙i,𝒙i){\bm{x}}=({\bm{x}}_{i},{\bm{x}}_{i^{*}}) and the data 𝒳\mathcal{X}, as shown in the following corollary.

Corollary 3.9.

For any 𝐱=(𝐱i,𝐱i)rn{\bm{x}}=({\bm{x}}_{i},{\bm{x}}_{i^{*}})\in\mathbb{R}^{rn},

(25) 𝒙if(𝒙;𝒳)PL(𝒙,0,𝒳)𝒙i,\|\nabla_{{\bm{x}}_{i}}f({\bm{x}};\mathcal{X})\|\leq P_{L}(\|{\bm{x}}\|,0,\|\mathcal{X}\|)\|{\bm{x}}_{i}\|,

where PLP_{L} is given by (23).

Proof 3.10.

To get inequality (25) set 𝐱~=𝟎\tilde{\bm{x}}=\bm{0} in (22) and note that 𝐱if(𝟎;𝒳)=𝟎{\nabla_{{\bm{x}}_{i}}f(\bm{0};\mathcal{X})=\bm{0}}.

Corollary 3.3 also implies the following bounds on the component-wise Hessian.

Corollary 3.11.

For any 𝐱=(𝐱i,𝐱i)rn{\bm{x}}=({\bm{x}}_{i},{\bm{x}}_{i^{*}})\in\mathbb{R}^{rn} and 𝐯irni{\bm{v}}_{i}\in\mathbb{R}^{rn_{i}},

(26) λ𝒗i2𝒗iT𝒙i2f(𝒙)𝒗iPLi(𝒙i)𝒗i2.\lambda\|\bm{v}_{i}\|^{2}\leq\bm{v}_{i}^{T}\nabla_{{\bm{x}}_{i}}^{2}f({\bm{x}})\bm{v}_{i}\leq P_{L_{i}}(\|{\bm{x}}_{i^{*}}\|)\|\bm{v}_{i}\|^{2}.

Proof 3.12.

This follows directly from the fact that

λ𝒙i2f(𝒙)2ΘiTΘi+λPLi(𝒙i).\lambda\leq\|\nabla_{{\bm{x}}_{i}}^{2}f({\bm{x}})\|_{2}\leq\|\Theta_{i}^{T}\Theta_{i}\|+\lambda\leq P_{L_{i}}(\|{\bm{x}}_{i^{*}}\|).

Finally, we establish the local Lipschitz continuity of the Newton step mapping 𝒙(𝒙i2f(𝒙))1𝒙if(𝒙;𝒳){\bm{x}}\mapsto(\nabla_{{\bm{x}}_{i}}^{2}f({\bm{x}}))^{-1}\nabla_{{\bm{x}}_{i}}f({\bm{x}};\mathcal{X}).

Lemma 3.13.

For any 𝐱,𝐲rn{\bm{x}},\bm{y}\in\mathbb{R}^{rn}, we have

(27) (𝒙i2f(𝒙))1𝒙if(𝒙;𝒳)(𝒙i2f(𝒙~))1𝒙if(𝒙~;𝒳)PN(𝒙,𝒙~i,𝒳)𝒙𝒙~,\|(\nabla_{{\bm{x}}_{i}}^{2}f({\bm{x}}))^{-1}\nabla_{{\bm{x}}_{i}}f({\bm{x}};\mathcal{X})-(\nabla_{{\bm{x}}_{i}}^{2}f(\tilde{\bm{x}}))^{-1}\nabla_{{\bm{x}}_{i}}f(\tilde{\bm{x}};\mathcal{X})\|\\ \leq P_{N}(\|{\bm{x}}\|,\|\tilde{\bm{x}}_{i^{*}}\|,\|\mathcal{X}\|)\|{\bm{x}}-\tilde{\bm{x}}\|,

where PNP_{N} is given by (28).

Proof 3.14.

Let 𝐱,𝐱~rn{\bm{x}},\tilde{\bm{x}}\in\mathbb{R}^{rn}. By adding and subtracting a cross-term, we obtain

(𝒙i2f(𝒙))1𝒙if(𝒙;𝒳)(𝒙i2f(𝒙~i))1𝒙if(𝒙~;𝒳)\displaystyle\phantom{+}\|(\nabla_{{\bm{x}}_{i}}^{2}f({\bm{x}}))^{-1}\nabla_{{\bm{x}}_{i}}f({\bm{x}};\mathcal{X})-(\nabla_{{\bm{x}}_{i}}^{2}f(\tilde{\bm{x}}_{i}))^{-1}\nabla_{{\bm{x}}_{i}}f(\tilde{\bm{x}};\mathcal{X})\|
\displaystyle\leq (𝒙i2f(𝒙))1(𝒙~i2f(𝒙~))12𝒙if(𝒙;𝒳)\displaystyle\phantom{+}\|(\nabla_{{\bm{x}}_{i}}^{2}f({\bm{x}}))^{-1}-(\nabla_{\tilde{\bm{x}}_{i}}^{2}f(\tilde{\bm{x}}))^{-1}\|_{2}\|\nabla_{{\bm{x}}_{i}}f({\bm{x}};\mathcal{X})\|
+(𝒙i2f(𝒙~))12𝒙if(𝒙;𝒳)𝒙if(𝒙~;𝒳).\displaystyle+\|(\nabla_{{\bm{x}}_{i}}^{2}f(\tilde{\bm{x}}))^{-1}\|_{2}\|\nabla_{{\bm{x}}_{i}}f({\bm{x}};\mathcal{X})-\nabla_{{\bm{x}}_{i}}f(\tilde{\bm{x}};\mathcal{X})\|.

By the second resolvent identity (see e.g. Theorem 4.8.2. in [16]), we have

(𝒙i2f(𝒙))1(𝒙~i2f(𝒙~))1\displaystyle(\nabla_{{\bm{x}}_{i}}^{2}f({\bm{x}}))^{-1}-(\nabla_{\tilde{\bm{x}}_{i}}^{2}f(\tilde{\bm{x}}))^{-1} =(ΘiTΘi+λIr)1(Θ~iTΘ~i+λIr)1\displaystyle=(\Theta_{i}^{T}\Theta_{i}+\lambda I_{r})^{-1}-(\tilde{\Theta}_{i}^{T}\tilde{\Theta}_{i}+\lambda I_{r})^{-1}
=(ΘiTΘi+λIr)1(Θ~iΘ~iΘiTΘi)(Θ~iTΘ~i+λIr)1,\displaystyle=(\Theta_{i}^{T}\Theta_{i}+\lambda I_{r})^{-1}(\tilde{\Theta}_{i}\tilde{\Theta}_{i}-\Theta_{i}^{T}\Theta_{i})(\tilde{\Theta}_{i}^{T}\tilde{\Theta}_{i}+\lambda I_{r})^{-1},

so that, by virtue of (19), (25), and (26),

(𝒙i2f(𝒙))1(𝒙~i2f(𝒙~))12𝒙if(𝒙;𝒳)\displaystyle\|(\nabla_{{\bm{x}}_{i}}^{2}f({\bm{x}}))^{-1}-(\nabla_{\tilde{\bm{x}}_{i}}^{2}f(\tilde{\bm{x}}))^{-1}\|_{2}\|\nabla_{{\bm{x}}_{i}}f({\bm{x}};\mathcal{X})\|
\displaystyle\leq 1λ2𝒙iPL(𝒙,0,𝒳)PΘTΘ(𝒙i,𝒙~i)𝒙i𝒙~i.\displaystyle\frac{1}{\lambda^{2}}\|{\bm{x}}_{i}\|P_{L}(\|{\bm{x}}\|,0,\|\mathcal{X}\|)P_{\Theta^{T}\Theta}(\|{\bm{x}}_{i^{*}}\|,\|\tilde{\bm{x}}_{i^{*}}\|)\|{\bm{x}}_{i^{*}}-\tilde{\bm{x}}_{i^{*}}\|.

Moreover, (22) and (26) imply

(𝒙i2f(𝒙~))12𝒙if(𝒙;𝒳)𝒙if(𝒙~;𝒳)1λPL(𝒙,𝒙~i,𝒳)𝒙i𝒙~i.\displaystyle\|(\nabla_{{\bm{x}}_{i}}^{2}f(\tilde{\bm{x}}))^{-1}\|_{2}\|\nabla_{{\bm{x}}_{i}}f({\bm{x}};\mathcal{X})-\nabla_{{\bm{x}}_{i}}f(\tilde{\bm{x}};\mathcal{X})\|\leq\frac{1}{\lambda}P_{L}(\|{\bm{x}}\|,\|\tilde{\bm{x}}_{i^{*}}\|,\|\mathcal{X}\|)\|{\bm{x}}_{i}-\tilde{\bm{x}}_{i}\|.

Combining these estimates gives

(𝒙i2f(𝒙))1𝒙if(𝒙;𝒳)(𝒙i2f(𝒙~i))1𝒙if(𝒙~;𝒳)PN𝒙𝒙~,with\|(\nabla_{{\bm{x}}_{i}}^{2}f({\bm{x}}))^{-1}\nabla_{{\bm{x}}_{i}}f({\bm{x}};\mathcal{X})-(\nabla_{{\bm{x}}_{i}}^{2}f(\tilde{\bm{x}}_{i}))^{-1}\nabla_{{\bm{x}}_{i}}f(\tilde{\bm{x}};\mathcal{X})\|\leq P_{N}\|{\bm{x}}-\tilde{\bm{x}}\|,\quad\text{with}
(28) PN(𝒙,𝒙~i,𝒳)=(c¯2,1c1,2)max{P2,P3},\begin{split}P_{N}(\|{\bm{x}}\|,\|\tilde{\bm{x}}_{i^{*}}\|,\|\mathcal{X}\|)=\left(\frac{\bar{c}_{2,1}}{c_{1,2}}\right)\max\left\{P_{2},P_{3}\right\},\end{split}

where c¯2,1\bar{c}_{2,1} and c1,2c_{1,2} are appropriate norm equivalence constants,

P2(𝒙,𝒙~i,𝒳)\displaystyle P_{2}(\|{\bm{x}}\|,\|\tilde{\bm{x}}_{i^{*}}\|,\|\mathcal{X}\|) =1λ2𝒙iPL(𝒙,0,𝒳)PΘTΘ(𝒙i,𝒙~i),and\displaystyle=\frac{1}{\lambda^{2}}\|{\bm{x}}_{i}\|P_{L}(\|{\bm{x}}\|,0,\|\mathcal{X}\|)P_{\Theta^{T}\Theta}(\|{\bm{x}}_{i^{*}}\|,\|\tilde{\bm{x}}_{i^{*}}\|),\qquad\text{and}
P3(𝒙,𝒙~i,𝒳)\displaystyle P_{3}(\|{\bm{x}}\|,\|\tilde{\bm{x}}_{i^{*}}\|,\|\mathcal{X}\|) =1λPL(𝒙,𝒙~i,𝒳).\displaystyle=\frac{1}{\lambda}P_{L}(\|{\bm{x}}\|,\|\tilde{\bm{x}}_{i^{*}}\|,\|\mathcal{X}\|).

3.2 Boundedness of the Data

In this section we bound the norm of the component iterates 𝒙ik{\bm{x}}_{i}^{k} in terms of the norm of the initial guess and maximum value of the sequence of sample averages of the norms of 𝒳\mathcal{X}.

Lemma 3.15.

The iterates 𝐱k,i{\bm{x}}^{k,i} generated by Algorithm 2 satisfy

𝔼[𝒙ik+1]max{𝒙i1,R(𝓧k),,R(𝓧1)}\mathbb{E}\left[\|{\bm{x}}_{i}^{k+1}\|\right]\leq\max\left\{\|{\bm{x}}_{i}^{1}\|,R(\bm{\mathcal{X}}^{k}),\ldots,R(\bm{\mathcal{X}}^{1})\right\}

for each k=1,2k=1,2\ldots, and i=1,,pi=1,\ldots,p, where R(𝓧k)R(\bm{\mathcal{X}}^{k}) is given by (29).

Proof 3.16.

Note that the component-wise minimizer 𝐱^ik+1\hat{\bm{x}}_{i}^{k+1} given in (12) satisfies

𝒙^ik+12+j=1i1𝒙jk+12+j=i+1p𝒙jk2\displaystyle\|\hat{\bm{x}}_{i}^{k+1}\|^{2}+\sum_{j=1}^{i-1}\|{\bm{x}}_{j}^{k+1}\|^{2}+\sum_{j=i+1}^{p}\|{\bm{x}}_{j}^{k}\|^{2}
\displaystyle\leq 2λf~(𝒙1k+1,,𝒙i1k+1,𝒙^ik+1,𝒙i+1k,,𝒙pk;𝓧k)\displaystyle\frac{2}{\lambda}\tilde{f}({\bm{x}}_{1}^{k+1},\ldots,{\bm{x}}_{i-1}^{k+1},\hat{\bm{x}}_{i}^{k+1},{\bm{x}}_{i+1}^{k},\ldots,{\bm{x}}_{p}^{k};\bm{\mathcal{X}}^{k}) (by (5))
\displaystyle\leq 2λf~(𝒙1k+1,,𝒙i1k+1,𝟎,𝒙i+1k,,𝒙pk;𝓧k)\displaystyle\frac{2}{\lambda}\tilde{f}({\bm{x}}_{1}^{k+1},\ldots,{\bm{x}}_{i-1}^{k+1},{\bm{0}},{\bm{x}}_{i+1}^{k},\ldots,{\bm{x}}_{p}^{k};\bm{\mathcal{X}}^{k}) (by optimality)
=\displaystyle= 1λ(1mkl=1mk𝒳k,l2)+j=1i1𝒙jk+12+j=i+1p𝒙jk2,\displaystyle\frac{1}{\lambda}\left(\frac{1}{m_{k}}\sum_{l=1}^{m_{k}}\|\mathcal{X}^{k,l}\|^{2}\right)+\sum_{j=1}^{i-1}\|{\bm{x}}_{j}^{k+1}\|^{2}+\sum_{j=i+1}^{p}\|{\bm{x}}_{j}^{k}\|^{2},

so that

(29) 𝒙^ik+11λ(1mkl=1mk𝒳k,l2)=:R(𝓧k).\|\hat{\bm{x}}_{i}^{k+1}\|\leq\sqrt{\frac{1}{\lambda}\left(\frac{1}{m_{k}}\sum_{l=1}^{m_{k}}\|\mathcal{X}^{k,l}\|^{2}\right)}=:R(\bm{\mathcal{X}}^{k}).

To bound the iterates 𝐱ik+1{\bm{x}}_{i}^{k+1} note that Equation (13) can be rewritten as the convex combination

𝒙ik+1=αk,i𝒙ik+(1αk,i)𝒙^ik+1,{\bm{x}}_{i}^{k+1}=\alpha^{k,i}{\bm{x}}_{i}^{k}+(1-\alpha^{k,i})\hat{\bm{x}}_{i}^{k+1},

which, by virtue of the stepsize bounds 0αk,i20\leq\alpha^{k,i}\leq 2, implies

𝒙ik+1\displaystyle\|{\bm{x}}_{i}^{k+1}\| |αk,i|𝒙k,i+|1ak,i|𝒙^k,imax{𝒙ik,R(𝓧k)}(|αk,i|+|1αk,i|)\displaystyle\leq|\alpha^{k,i}|\|{\bm{x}}^{k,i}\|+|1-a^{k,i}|\|\hat{\bm{x}}^{k,i}\|\leq\max\{\|{\bm{x}}_{i}^{k}\|,R(\bm{\mathcal{X}}^{k})\}(|\alpha^{k,i}|+|1-\alpha^{k,i}|)
max{𝒙ik,R(𝓧k)}max{𝒙ik1,R(𝓧k),R(𝓧k1)}\displaystyle\leq\max\{\|{\bm{x}}_{i}^{k}\|,R(\bm{\mathcal{X}}^{k})\}\leq\max\left\{\|{\bm{x}}_{i}^{k-1}\|,R(\bm{\mathcal{X}}^{k}),R(\bm{\mathcal{X}}^{k-1})\right\}\leq\ldots
max{𝒙i1,R(𝓧k),R(𝓧k1),,R(𝓧1)}.\displaystyle\leq\max\left\{\|{\bm{x}}_{i}^{1}\|,R(\bm{\mathcal{X}}^{k}),R(\bm{\mathcal{X}}^{k-1}),\ldots,R(\bm{\mathcal{X}}^{1})\right\}.

Since the regularity estimates derived above all involve powers of 𝒙\|{\bm{x}}\| and of 𝓧k\|\bm{\mathcal{X}}^{k}\|, Lemma 3.15 suggests that a bound on the data 𝒳\mathcal{X} is sufficient to guarantee the regularity of the cost functional, gradient, and Newton steps that are necessary to show convergence. In the following we assume such a bound and pursue its consequences.

{assumption}

[Bounded data] There is a constant 0<M<0<M<\infty so that

(30) 𝒳M,a.s. on Ω.\|\mathcal{X}\|\leq M,\ \ a.s.\text{ on }\Omega.
Remark 3.17.

This assumption might conceivably be weakened to one pertaining to the statistical distribution of the maxima R(𝓧1),R(𝓧2),,R(𝓧k)R(\bm{\mathcal{X}}^{1}),R(\bm{\mathcal{X}}^{2}),\ldots,R(\bm{\mathcal{X}}^{k}). Specifically, letting rk=maxl=1,,kR(𝓧l)r_{k}=\max_{l=1,\ldots,k}R(\bm{\mathcal{X}}^{l}), it can be shown under appropriate conditions on the density of R(𝒳k)R(\mathcal{X}^{k}), that rkr_{k} converges in distribution to a random variable with known extreme value density. The analysis below will hold if it can be guaranteed that the limiting distribution has bounded moments of sufficiently high order. This possibility will be pursued in future work.

An immediate consequence of 3.16 is the existence of a uniform bound on the radius R(𝓧)R(\bm{\mathcal{X}}) and hence on the iterates 𝒙k,i{\bm{x}}^{k,i}.

Corollary 3.18.

Given 3.16, there exist finite, non-negative constants MR,MxiM_{R},M_{x_{i}}, and MxM_{x} independent of 𝐱k{\bm{x}}^{k} and of 𝒳k\mathcal{X}^{k}, so that for all k=1,2,k=1,2,\ldots,

(31) R(𝓧k)\displaystyle R(\bm{\mathcal{X}}^{k}) MRa.s.​ on Ω\displaystyle\leq M_{R{\phantom{{\bm{x}}_{i}}}}\qquad\text{a.s.\! on }\Omega
(32) 𝒙ik\displaystyle\|{\bm{x}}_{i}^{k}\| M𝒙ia.s.​ on Ω\displaystyle\leq M_{{\bm{x}}_{i}\phantom{R}}\qquad\text{a.s.\! on }\Omega
(33) 𝒙k\displaystyle\|{\bm{x}}^{k}\| M𝒙a.s.​ on Ω\displaystyle\leq M_{{\bm{x}}_{\phantom{i}}\phantom{R}}\qquad\text{a.s.\! on }\Omega

Proof 3.19.

By (29) and 3.16,

R(𝓧k)=1λ(1mkl=1mk𝒳k,l2)Mλ=:MR.\displaystyle R(\bm{\mathcal{X}}^{k})=\sqrt{\frac{1}{\lambda}\left(\frac{1}{m_{k}}\sum_{l=1}^{m_{k}}\|\mathcal{X}^{k,l}\|^{2}\right)}\leq\frac{M}{\sqrt{\lambda}}=:M_{R}.

Lemma 3.15 then implies

𝒙ikmax{𝒙i1,MR}=:M𝒙i\|{\bm{x}}_{i}^{k}\|\leq\max\{\|{\bm{x}}_{i}^{1}\|,M_{R}\}=:M_{{\bm{x}}_{i}}

and hence

𝒙k=i=1p𝒙ik2pM𝒙i=:M𝒙.\|{\bm{x}}^{k}\|=\sqrt{\sum_{i=1}^{p}\|{\bm{x}}_{i}^{k}\|^{2}}\leq\sqrt{p}M_{{\bm{x}}_{i}}=:M_{{\bm{x}}}.

3.3 Convergence

We now consider the difference 𝜹k=𝒈~k𝒈k\bm{\delta}^{k}=\tilde{\bm{g}}^{k}-\bm{g}^{k} between the sampled and expected search directions. For the standard stochastic gradient descent algorithm, this stochastic quantity vanishes in expectation, given past information k1\mathcal{F}^{k-1}, i.e. 𝔼[𝜹k|k1]=0\mathbb{E}\left[{\bm{\delta}^{k}}|\mathcal{F}^{k-1}\right]=0, since 𝒈~k\tilde{\bm{g}}^{k} is an unbiased estimator of 𝒈k{\bm{g}}^{k}. For the stochastic alternating least squares method, this is no longer the case. Lemma 3.20 however uses the regularity of the gradient and the Hessian to establish an upper bound that decreases on the order O(1k)O(\frac{1}{k}) as kk\rightarrow\infty.

Lemma 3.20.

There is a constant MN0M_{N}\geq 0 such that for every k=1,2,k=1,2,... and i=1,,pi=1,...,p,

(34) 𝔼[(Hk,i)1𝜹k,i|k1]MNk,\left\|\mathbb{E}\left[(H^{k,i})^{-1}{\bm{\delta}^{k,i}}|\mathcal{F}^{k-1}\right]\right\|\leq\frac{M_{N}}{k},

where MN=2cmaxPN(M𝐱,0,MR)PN(M𝐱,M𝐱i,MR)M𝐱M_{N}=2c_{\max}P_{N}(M_{{\bm{x}}},0,M_{R})P_{N}(M_{{\bm{x}}},M_{{\bm{x}}_{i^{*}}},M_{R})M_{{\bm{x}}}.

Proof 3.21.

Recall that the current iterate 𝐱k,i{\bm{x}}^{k,i} is statistically dependent on sampled tensors 𝓧1,,𝓧k1\bm{\mathcal{X}}^{1},\ldots,\bm{\mathcal{X}}^{k-1}, while its first ii components also depend on 𝓧k\bm{\mathcal{X}}^{k}. For the purpose of computing 𝔼[(Hk,i)1𝛅k,i|k1]\mathbb{E}\left[(H^{k,i})^{-1}{\bm{\delta}^{k,i}}|\mathcal{F}^{k-1}\right], we suppose that 𝓧1,,𝓧k1\bm{\mathcal{X}}^{1},\ldots,\bm{\mathcal{X}}^{k-1} are known and write 𝐱k,i1=𝐱k,i1(𝓧k)=(𝐱1k+1(𝓧k),,𝐱i1k+1(𝓧k),𝐱ik,,𝐱pk){\bm{x}}^{k,i-1}={\bm{x}}^{k,i-1}(\bm{\mathcal{X}}^{k})=({\bm{x}}_{1}^{k+1}(\bm{\mathcal{X}}^{k}),\ldots,{\bm{x}}_{i-1}^{k+1}(\bm{\mathcal{X}}^{k}),{\bm{x}}_{i}^{k},\ldots,{\bm{x}}_{p}^{k}) to emphasize its dependence on 𝓧k\bm{\mathcal{X}}^{k}. Thus

𝔼[(Hk,i)1𝜹k,i|k1]=\displaystyle\mathbb{E}\left[(H^{k,i})^{-1}{\bm{\delta}^{k,i}}|\mathcal{F}^{k-1}\right]=
Ωmk(𝒙i2f(𝒙k,i1(𝓧k)))1(𝒙if~(𝒙k,i1(𝓧k);𝓧k)𝒙iF(𝒙k,i1(𝓧k)))𝑑μ𝓧k.\displaystyle\int_{\Omega^{m_{k}}}\left(\nabla_{{\bm{x}}_{i}}^{2}f({\bm{x}}^{k,i-1}(\bm{\mathcal{X}}^{k}))\right)^{-1}\left(\nabla_{{\bm{x}}_{i}}\tilde{f}({\bm{x}}^{k,i-1}(\bm{\mathcal{X}}^{k});\bm{\mathcal{X}}^{k})-\nabla_{{\bm{x}}_{i}}F({\bm{x}}^{k,i-1}(\bm{\mathcal{X}}^{k}))\right)d\mu_{\bm{\mathcal{X}}^{k}}.

By definition, and since 𝐱if~\nabla_{{\bm{x}}_{i}}\tilde{f} is an unbiased estimator of 𝐱iF\nabla_{{\bm{x}}_{i}}F, we have

𝒙iF(𝒙k,i1(𝓧k))\displaystyle\nabla_{{\bm{x}}_{i}}F({\bm{x}}^{k,i-1}(\bm{\mathcal{X}}^{k})) =Ωxif(𝒙k,i1(𝓧k);𝒳)𝑑μ𝒳\displaystyle=\int_{\Omega}\nabla_{x_{i}}f({\bm{x}}^{k,i-1}(\bm{\mathcal{X}}^{k});\mathcal{X})d\mu_{\mathcal{X}}
=Ωmkxif~(𝒙k,i1(𝓧k);𝓧)𝑑μ𝓧.\displaystyle=\int_{\Omega^{m_{k}}}\nabla_{x_{i}}\tilde{f}({\bm{x}}^{k,i-1}(\bm{\mathcal{X}}^{k});\bm{\mathcal{X}})d\mu_{\bm{\mathcal{X}}}.

Moreover, since 𝓧\bm{\mathcal{X}} and 𝓧k\bm{\mathcal{X}}^{k} are identically distributed,

Ωmk(𝒙i2f(𝒙k,i1(𝓧k)))1𝒙iF(𝒙k,i1(𝓧k))𝑑μ𝓧k\displaystyle\int_{\Omega^{m_{k}}}\left(\nabla_{{\bm{x}}_{i}}^{2}f({\bm{x}}^{k,i-1}(\bm{\mathcal{X}}^{k}))\right)^{-1}\nabla_{{\bm{x}}_{i}}F({\bm{x}}^{k,i-1}(\bm{\mathcal{X}}^{k}))d\mu_{\bm{\mathcal{X}}^{k}}
=\displaystyle= ΩmkΩmk(𝒙i2f(𝒙k,i1(𝓧k)))1𝒙if~(𝒙k,i1(𝓧k);𝓧)𝑑μ𝓧𝑑μ𝓧k\displaystyle\int_{\Omega^{m_{k}}}\int_{\Omega^{m_{k}}}\left(\nabla_{{\bm{x}}_{i}}^{2}f({\bm{x}}^{k,i-1}(\bm{\mathcal{X}}^{k}))\right)^{-1}\nabla_{{\bm{x}}_{i}}\tilde{f}({\bm{x}}^{k,i-1}(\bm{\mathcal{X}}^{k});\bm{\mathcal{X}})d\mu_{\bm{\mathcal{X}}}d\mu_{\bm{\mathcal{X}}^{k}}
=\displaystyle= ΩmkΩmk(𝒙i2f(𝒙k,i1(𝓧)))1𝒙if~(𝒙k,i1(𝓧);𝓧k)𝑑μ𝓧𝑑μ𝓧k.\displaystyle\int_{\Omega^{m_{k}}}\int_{\Omega^{m_{k}}}\left(\nabla_{{\bm{x}}_{i}}^{2}f({\bm{x}}^{k,i-1}(\bm{\mathcal{X}}))\right)^{-1}\nabla_{{\bm{x}}_{i}}\tilde{f}({\bm{x}}^{k,i-1}(\bm{\mathcal{X}});\bm{\mathcal{X}}^{k})d\mu_{\bm{\mathcal{X}}}d\mu_{\bm{\mathcal{X}}^{k}}.

Therefore

(35) 𝔼[(Hk,i)1𝜹k,i|k1]=ΩmkΩmk(𝒙i2f(𝒙k,i1(𝓧k)))1𝒙if~(𝒙k,i1(𝓧k);𝓧k)𝑑μ𝓧𝑑μ𝓧kΩmkΩmk(𝒙i2f(𝒙k,i1(𝓧)))1𝒙if~(𝒙k,i1(𝓧);𝓧k)𝑑μ𝓧𝑑μ𝓧k.\mathbb{E}\left[(H^{k,i})^{-1}{\bm{\delta}^{k,i}}|\mathcal{F}^{k-1}\right]\\ =\int_{\Omega^{m_{k}}}\int_{\Omega^{m_{k}}}\left(\nabla_{{\bm{x}}_{i}}^{2}f({\bm{x}}^{k,i-1}(\bm{\mathcal{X}}^{k}))\right)^{-1}\nabla_{{\bm{x}}_{i}}\tilde{f}({\bm{x}}^{k,i-1}(\bm{\mathcal{X}}^{k});\bm{\mathcal{X}}^{k})d\mu_{\bm{\mathcal{X}}}d\mu_{\bm{\mathcal{X}}^{k}}\\ -\int_{\Omega^{m_{k}}}\int_{\Omega^{m_{k}}}\left(\nabla_{{\bm{x}}_{i}}^{2}f({\bm{x}}^{k,i-1}(\bm{\mathcal{X}}))\right)^{-1}\nabla_{{\bm{x}}_{i}}\tilde{f}({\bm{x}}^{k,i-1}(\bm{\mathcal{X}});\bm{\mathcal{X}}^{k})d\mu_{\bm{\mathcal{X}}}d\mu_{\bm{\mathcal{X}}^{k}}.

In the special case i=1i=1, the iterate 𝐱k,0=𝐱k1{\bm{x}}^{k,0}={\bm{x}}^{k-1}, and hence Hk,1H^{k,1}, does not depend on 𝓧k\bm{\mathcal{X}}^{k}. Since 𝐱1f~(𝐱k1;𝓧k)\nabla_{{\bm{x}}_{1}}\tilde{f}({\bm{x}}^{k-1};\bm{\mathcal{X}}^{k}) is an unbiased estimator of 𝐱1F(𝐱k1)\nabla_{{\bm{x}}_{1}}F({\bm{x}}^{k-1}), we have

𝔼[(Hk,1)1𝜹k,1|k1]=0.\mathbb{E}\left[(H^{k,1})^{-1}{\bm{\delta}^{k,1}}|\mathcal{F}^{k-1}\right]=0.

We now consider the case i=2,,pi=2,...,p. Using the Lipschitz continuity of the mapping 𝐱(𝐱i2f(𝐱))1𝐱if(𝐱;𝓧k){\bm{x}}\mapsto\left(\nabla_{{\bm{x}}_{i}}^{2}f({\bm{x}})\right)^{-1}\nabla_{{\bm{x}}_{i}}f({\bm{x}};\bm{\mathcal{X}}^{k}) (Lemma 3.13), the bounds in Corollary 3.18, and Jensen’s inequality, we obtain

𝔼[(Hk,i)1𝜹k,i|k1]PN(M𝒙,M𝒙i,M)ΩmkΩmk𝒙k,i1(𝓧k)𝒙k,i1(𝓧)𝑑μ𝓧𝑑μ𝓧k,\left\|\mathbb{E}\left[(H^{k,i})^{-1}{\bm{\delta}^{k,i}}|\mathcal{F}^{k-1}\right]\right\|\\ \leq P_{N}(M_{\bm{x}},M_{{\bm{x}}_{i^{*}}},M)\int_{\Omega^{m_{k}}}\int_{\Omega^{m_{k}}}\|{\bm{x}}^{k,i-1}(\bm{\mathcal{X}}^{k})-{\bm{x}}^{k,i-1}(\bm{\mathcal{X}})\|d\mu_{\bm{\mathcal{X}}}d\mu_{\bm{\mathcal{X}}^{k}},

the integrand of which can be bounded by

𝒙k,i1(𝓧k)𝒙k,i1(𝓧)j=1i1𝒙jk+1(𝓧k)𝒙jk+1(𝓧)=j=1i1αk,j(Hk,j(𝓧k))1𝒈~k,j(𝓧k)(Hk,j(𝓧))1𝒈~k,j(𝓧)cmaxk((Hk,j(𝓧k))1𝒈~k,j(𝓧k)+(Hk,j(𝓧))1𝒈~k,j(𝓧)).\|{\bm{x}}^{k,i-1}(\bm{\mathcal{X}}^{k})-{\bm{x}}^{k,i-1}(\bm{\mathcal{X}})\|\leq\sum_{j=1}^{i-1}\|{\bm{x}}_{j}^{k+1}(\bm{\mathcal{X}}^{k})-{\bm{x}}_{j}^{k+1}(\bm{\mathcal{X}})\|\\ =\sum_{j=1}^{i-1}\alpha^{k,j}\|(H^{k,j}(\bm{\mathcal{X}}^{k}))^{-1}\tilde{\bm{g}}^{k,j}(\bm{\mathcal{X}}^{k})-(H^{k,j}(\bm{\mathcal{X}}))^{-1}\tilde{\bm{g}}^{k,j}(\bm{\mathcal{X}})\|\\ \leq\frac{c_{\max}}{k}\left(\|(H^{k,j}(\bm{\mathcal{X}}^{k}))^{-1}\tilde{\bm{g}}^{k,j}(\bm{\mathcal{X}}^{k})\|+\|(H^{k,j}(\bm{\mathcal{X}}))^{-1}\tilde{\bm{g}}^{k,j}(\bm{\mathcal{X}})\|\right).

The result now follows from taking expectations and using (27) with 𝐱~=𝟎\tilde{\bm{x}}=\bm{0}, in conjunction with (31), and (33).

Lemma 3.22.

If 𝐡{\bm{h}} is k1\mathcal{F}^{k-1}-measurable and 𝔼[𝐡]<\mathbb{E}[\|\bm{h}\|]<\infty then

(36) 𝔼[𝒉,(Hk,i)1𝜹k,i]MNk𝔼[𝒉].\mathbb{E}\left[\left\langle{\bm{h}},(H^{k,i})^{-1}{\bm{\delta}}^{k,i}\right\rangle\right]\leq\frac{M_{N}}{k}\mathbb{E}\left[\|\bm{h}\|\right].

Proof 3.23.

Using the law of total expectation, the k1\mathcal{F}^{k-1}-measurability of 𝐡\bm{h}, and Lemma 3.20, we have

𝔼[𝒉,(Hk,i)1𝜹k,i]\displaystyle\mathbb{E}\left[\left\langle{\bm{h}},(H^{k,i})^{-1}{\bm{\delta}}^{k,i}\right\rangle\right] =𝔼𝓧1,,𝓧k1[𝔼[𝒉,(Hk,i)1𝜹k,i|k1]]\displaystyle=\mathbb{E}_{\bm{\mathcal{X}}^{1},\ldots,\bm{\mathcal{X}}^{k-1}}\left[\mathbb{E}\left[\left\langle{\bm{h}},(H^{k,i})^{-1}{\bm{\delta}}^{k,i}\right\rangle\middle|\mathcal{F}^{k-1}\right]\right]
=𝔼𝓧1,,𝓧k1[𝔼[𝒉|k1],𝔼[(Hk,i)1𝜹k,i|k1]]\displaystyle=\mathbb{E}_{\bm{\mathcal{X}}^{1},\ldots,\bm{\mathcal{X}}^{k-1}}\left[\left\langle\mathbb{E}\left[{\bm{h}}\middle|\mathcal{F}^{k-1}\right],\mathbb{E}\left[(H^{k,i})^{-1}{\bm{\delta}}^{k,i}\middle|\mathcal{F}^{k-1}\right]\right\rangle\right]
𝔼𝓧1,,𝓧k1[𝔼[𝒉]𝔼[(Hk,i)1𝜹k,i|k1]]\displaystyle\leq\mathbb{E}_{\bm{\mathcal{X}}^{1},\ldots,\bm{\mathcal{X}}^{k-1}}\left[\left\|\mathbb{E}\left[{\bm{h}}\right]\right\|\left\|\mathbb{E}\left[(H^{k,i})^{-1}{\bm{\delta}}^{k,i}\middle|\mathcal{F}^{k-1}\right]\right\|\right]
𝔼𝓧1,,𝓧k1[MNk𝔼𝓧k[𝒉]]=MNk𝔼[𝒉].\displaystyle\leq\mathbb{E}_{\bm{\mathcal{X}}^{1},\ldots,\bm{\mathcal{X}}^{k-1}}\left[\frac{M_{N}}{k}\mathbb{E}_{\bm{\mathcal{X}}^{k}}[\|\bm{h}\|]\right]=\frac{M_{N}}{k}\mathbb{E}[\|\bm{h}\|].

The main convergence theorem is based on the following lemma (for a proof, see e.g. Lemma A.5, [23])

Lemma 3.24.

Let {ak}k=1\{a_{k}\}_{k=1}^{\infty} and {bk}k=1\{b_{k}\}_{k=1}^{\infty} be any two nonnegative, real sequences so that (i) k=1ak=\sum_{k=1}^{\infty}a_{k}=\infty, (ii) k=1akbk<\sum_{k=1}^{\infty}a_{k}b_{k}<\infty, and (iii) there is a constant K>0K>0 so that |bk+1bk|Kak|b_{k+1}-b_{k}|\leq Ka_{k} for k1k\geq 1. Then limkbk=0\lim_{k\rightarrow\infty}b_{k}=0.

Theorem 3.25.

Let 𝐱k,i{\bm{x}}^{k,i} be the sequence generated by the Stochastic Alternating Least Squares (SALS) method outlined in Algorithm 2. Then

limk𝔼[𝒙iF(𝒙k,i)2]=0,i=1,,p.\lim_{k\rightarrow\infty}\mathbb{E}\left[\|\nabla_{\bm{x}_{i}}F({\bm{x}}^{k,i})\|^{2}\right]=0,\qquad i=1,...,p.

Proof 3.26.

We base the proof on Lemma 3.24 with ak=1ka_{k}=\frac{1}{k} and bk=𝐠k,i2b_{k}=\|\bm{g}^{k,i}\|^{2}. Clearly, Condition (i) in Lemma 3.24 is satisfied. To show that Condition (ii) holds, i.e. that k=11k𝔼[𝐠k,i2]<\sum_{k=1}^{\infty}\frac{1}{k}\mathbb{E}\left[\|{\bm{g}}^{k,i}\|^{2}\right]<\infty for i=1,,pi=1,\ldots,p, we use the component-wise Taylor expansion (11) of the expected cost centered at the iterate 𝐱k,i1{\bm{x}}^{k,i-1} and the SALS update given in (13) to express the expected decrease as

F(𝒙k,i)F(𝒙k,i1)\displaystyle F({\bm{x}}^{k,i})-F({\bm{x}}^{k,i-1})
=\displaystyle= 𝒙iF(𝒙k,i)T(𝒙k,i+1𝒙k,i)+12(𝒙k,i+1𝒙k,i)T𝒙i2F(𝒙k,i)(𝒙k,i+1𝒙k,i)\displaystyle\nabla_{\bm{x}_{i}}F({\bm{x}}^{k,i})^{T}({\bm{x}}^{k,i+1}-{\bm{x}}^{k,i})+\frac{1}{2}({\bm{x}}^{k,i+1}-{\bm{x}}^{k,i})^{T}\nabla_{{\bm{x}}_{i}}^{2}F({\bm{x}}^{k,i})({\bm{x}}^{k,i+1}-{\bm{x}}^{k,i})
=\displaystyle= αk,i(𝒈k,i)T(Hk,i)1𝒈~k,i+12(αk,i)2((Hk,i)1𝒈~k,i)THk,i(Hk,i)1𝒈~k,i\displaystyle-\alpha^{k,i}({\bm{g}}^{k,i})^{T}(H^{k,i})^{-1}\tilde{\bm{g}}^{k,i}+\frac{1}{2}(\alpha^{k,i})^{2}((H^{k,i})^{-1}\tilde{\bm{g}}^{k,i})^{T}H^{k,i}(H^{k,i})^{-1}\tilde{\bm{g}}^{k,i}
=\displaystyle= αk,i𝒈k,iT(Hk,i)1𝒈k,iαk,i(𝒈k,i)T(Hk,i)1𝜹k,i+12(αk,i)2(𝒈~k,i)T(Hk,i)1𝒈~k,i.\displaystyle-\alpha^{k,i}{\bm{g}^{k,i}}^{T}(H^{k,i})^{-1}\bm{g}^{k,i}-\alpha^{k,i}(\bm{g}^{k,i})^{T}(H^{k,i})^{-1}\bm{\delta}^{k,i}+\frac{1}{2}(\alpha^{k,i})^{2}(\tilde{\bm{g}}^{k,i})^{T}(H^{k,i})^{-1}\tilde{\bm{g}}^{k,i}.

Since, by (32) and Corollary (3.11),

αk,i(𝒈k,i)T(Hk,i)1𝒈k,icmink1λ+M2f𝒈k,i2,\alpha^{k,i}(\bm{g}^{k,i})^{T}(H^{k,i})^{-1}\bm{g}^{k,i}\geq\frac{c_{\min}}{k}\frac{1}{\lambda+M_{\nabla^{2}f}}\|\bm{g}^{k,i}\|^{2},

where M2f=PLi(M𝐱i)M_{\nabla^{2}f}=P_{L_{i}}(M_{{\bm{x}}_{i}^{*}}), the above expression can be rearranged as

(37) 1k𝒈k,i2\displaystyle\frac{1}{k}\|{\bm{g}}^{k,i}\|^{2} λ+M2fcmin(E1k,i+E2k,i+E3k,i),with\displaystyle\leq\frac{\lambda+M_{\nabla^{2}f}}{c_{\min}}\left(E_{1}^{k,i}+E_{2}^{k,i}+E_{3}^{k,i}\right),\quad\text{with}
E1k,i=F(𝒙k,i1)F(𝒙k,i),\displaystyle E_{1}^{k,i}=F({\bm{x}}^{k,i-1})-F({\bm{x}}^{k,i}),
E2k,i=αk,i(𝒈k,i)T(H~k,i)1𝜹k,i,and\displaystyle E_{2}^{k,i}=-\alpha^{k,i}(\bm{g}^{k,i})^{T}(\tilde{H}^{k,i})^{-1}\bm{\delta}^{k,i},\ \text{and}
E3k,i=1k2cmax2(𝒈~k,i)T(Hk,i)1𝒈~k,i.\displaystyle E_{3}^{k,i}=\frac{1}{k^{2}}\frac{c_{\max}}{2}(\tilde{\bm{g}}^{k,i})^{T}(H^{k,i})^{-1}\tilde{\bm{g}}^{k,i}.

Evidently, Condition (ii) of Lemma 3.24 holds as long as k=1𝔼[Ejk,i]<\sum_{k=1}^{\infty}\mathbb{E}\left[E_{j}^{k,i}\right]<\infty for j=1,2,3j=1,2,3. For a fixed K>0K>0, the first term E1k,iE_{1}^{k,i} generates a telescoping sum so that

k=1Ki=1p𝔼[E1k,i]=k=1K𝔼[F(𝒙k)]𝔼[F(𝒙k+1)]\displaystyle\sum_{k=1}^{K}\sum_{i=1}^{p}\mathbb{E}\left[E_{1}^{k,i}\right]=\sum_{k=1}^{K}\mathbb{E}\left[F({\bm{x}}^{k})\right]-\mathbb{E}\left[F({\bm{x}}^{k+1})\right]
(38) =\displaystyle=\; 𝔼[F(𝒙1)]𝔼[F(𝒙K+1)]𝔼[F(𝒙1)]<K>0,\displaystyle\mathbb{E}\left[F({\bm{x}}^{1})\right]-\mathbb{E}\left[F({\bm{x}}^{K+1})\right]\leq\mathbb{E}\left[F({\bm{x}}^{1})\right]<\infty\qquad\forall K>0,

since F(𝐱K+1)0F({\bm{x}}^{K+1})\geq 0. Consider

E2k,i\displaystyle E_{2}^{k,i} =αk,i(𝒈k,i)T(Hk,i)1𝜹k,i\displaystyle=-\alpha^{k,i}(\bm{g}^{k,i})^{T}(H^{k,i})^{-1}\bm{\delta}^{k,i}
=αk,i(𝒈k,i𝒙iF(𝒙k))T(Hk,i)1𝜹k,i+αk,i𝒙iF(𝒙k)T(Hk,i)1𝜹k,i.\displaystyle=-\alpha^{k,i}({\bm{g}}^{k,i}-\nabla_{{\bm{x}}_{i}}F({\bm{x}}^{k}))^{T}(H^{k,i})^{-1}{\bm{\delta}}^{k,i}+\alpha^{k,i}\nabla_{{\bm{x}}_{i}}F({\bm{x}}^{k})^{T}(H^{k,i})^{-1}{\bm{\delta}}^{k,i}.

Since 𝐱iF(𝐱k)\nabla_{{\bm{x}}_{i}}F({\bm{x}}^{k}) is k1\mathcal{F}^{k-1}-measurable, Lemma 3.22 implies that the expectation of the second term can be bounded as follows

𝔼[αk,i𝒙iF(𝒙k)T(Hk,i)1𝜹k,i]MfMNk2,\mathbb{E}\left[\alpha^{k,i}\nabla_{{\bm{x}}_{i}}F({\bm{x}}^{k})^{T}(H^{k,i})^{-1}{\bm{\delta}}^{k,i}\right]\leq\frac{M_{\nabla f}M_{N}}{k^{2}},

where Mf=PL(M𝐱,0,M)M𝐱iM_{\nabla f}=P_{L}(M_{{\bm{x}}},0,M)M_{{\bm{x}}_{i}} by Corollaries 3.9 and 3.18. The first term satisfies

αk,i(𝒙iF(𝒙k,i)𝒙iF(𝒙k))T(Hk,i)1𝜹k,i\displaystyle-\alpha^{k,i}(\nabla_{{\bm{x}}_{i}}F({\bm{x}}^{k,i})-\nabla_{{\bm{x}}_{i}}F({\bm{x}}^{k}))^{T}(H^{k,i})^{-1}{\bm{\delta}}^{k,i}
\displaystyle\leq\; αk,i𝒙iF(𝒙k,i)𝒙iF(𝒙k)(Hk,i)1𝜹k,i\displaystyle\alpha^{k,i}\|\nabla_{{\bm{x}}_{i}}F({\bm{x}}^{k,i})-\nabla_{{\bm{x}}_{i}}F({\bm{x}}^{k})\|\|(H^{k,i})^{-1}{\bm{\delta}}^{k,i}\|
\displaystyle\leq\; αk,iMNML𝒙k,i𝒙k\displaystyle\alpha^{k,i}M_{N}M_{L}\|{\bm{x}}^{k,i}-{\bm{x}}^{k}\|
=\displaystyle=\; MNMLαk,ij=1i(αk,j)2(Hk,j)1𝒈~k,j2\displaystyle M_{N}M_{L}\alpha^{k,i}\sqrt{\sum_{j=1}^{i}(\alpha^{k,j})^{2}\|(H^{k,j})^{-1}\tilde{\bm{g}}^{k,j}\|^{2}}
\displaystyle\leq\; p(MN)2ML(cmaxk)2,\displaystyle p(M_{N})^{2}M_{L}\left(\frac{c_{\max}}{k}\right)^{2},

where ML=PL(M𝐱,M𝐱i,M)M_{L}=P_{L}(M_{\bm{x}},M_{{\bm{x}}_{i^{*}}},M) by Lemmas 3.5 and 3.18. Combining these two bounds yields

(39) k=1i=1p𝔼[E2k,i](p2(cmaxMN)2ML+pMfMN)k=11k2<.\sum_{k=1}^{\infty}\sum_{i=1}^{p}\mathbb{E}\left[E_{2}^{k,i}\right]\leq\left(p^{2}(c_{\max}M_{N})^{2}M_{L}+pM_{\nabla f}M_{N}\right)\sum_{k=1}^{\infty}\frac{1}{k^{2}}<\infty.

Finally, we use Corollaries 3.9, 3.11, and 3.18 to bound

𝔼[(𝒈~k,i)T(Hk,i)1𝒈~k,i]\displaystyle\mathbb{E}\left[(\tilde{\bm{g}}^{k,i})^{T}(H^{k,i})^{-1}\tilde{\bm{g}}^{k,i}\right]\leq 1λ𝔼[𝒈~k,i2](Mf)2λ,\displaystyle\frac{1}{\lambda}\mathbb{E}\left[\|\tilde{\bm{g}}^{k,i}\|^{2}\right]\leq\frac{(M_{\nabla f})^{2}}{\lambda},

so that

(40) k=1i=1p𝔼[E3k,i]pcmax(Mf)22λk=11k2<.\sum_{k=1}^{\infty}\sum_{i=1}^{p}\mathbb{E}\left[E_{3}^{k,i}\right]\leq\frac{pc_{\max}(M_{\nabla f})^{2}}{2\lambda}\sum_{k=1}^{\infty}\frac{1}{k^{2}}<\infty.

By virtue of Inequality (37), the upper bounds (38), (39), and (40) now imply that Condition (ii) of Lemma 3.24 holds. It now remains to show that Condition (iii) of Lemma 3.24 holds, i.e. that 𝔼[𝐠k+1,i2]𝔼[𝐠k,i2]=O(1/k)\mathbb{E}\left[\|{\bm{g}}^{k+1,i}\|^{2}\right]-\mathbb{E}\left[\|{\bm{g}}^{k,i}\|^{2}\right]=O(1/k) as kk\rightarrow\infty for all i=1,,pi=1,\ldots,p. By the reverse triangle inequality and by the Lipshitz continuity and boundedness of the gradient,

𝒈k+1,i2𝒈k,i2(𝒈k+1,i+𝒈k,i)(𝒈k+1,i𝒈k,i)\displaystyle\|{\bm{g}}^{k+1,i}\|^{2}-\|{\bm{g}}^{k,i}\|^{2}\leq(\|{\bm{g}}^{k+1,i}\|+\|{\bm{g}}^{k,i}\|)(\|{\bm{g}}^{k+1,i}-{\bm{g}}^{k,i}\|)
\displaystyle\leq\; 2MfML𝒙k+1,i𝒙k,i\displaystyle 2M_{\nabla f}M_{L}\|{\bm{x}}^{k+1,i}-{\bm{x}}^{k,i}\|
=\displaystyle=\; 2MfMLj=1i𝒙jk+2𝒙jk+12+j=i+1p𝒙jk+1𝒙jk2\displaystyle 2M_{\nabla f}M_{L}\sqrt{\sum_{j=1}^{i}\|{\bm{x}}_{j}^{k+2}-{\bm{x}}_{j}^{k+1}\|^{2}+\sum_{j=i+1}^{p}\|{\bm{x}}_{j}^{k+1}-{\bm{x}}_{j}^{k}\|^{2}}
=\displaystyle=\; 2MfMLj=1i(αk+1,j)2(Hk+1,j)1𝒈~k+1,j2+j=i+1p(αk,j)2(Hk,j)1𝒈~k,j2\displaystyle 2M_{\nabla f}M_{L}\sqrt{\sum_{j=1}^{i}\left(\alpha^{k+1,j}\right)^{2}\|(H^{k+1,j})^{-1}\tilde{\bm{g}}^{k+1,j}\|^{2}+\sum_{j=i+1}^{p}\left(\alpha^{k,j}\right)^{2}\|(H^{k,j})^{-1}\tilde{\bm{g}}^{k,j}\|^{2}}
\displaystyle\leq\; 2MfMLcmaxkj=1i(Hk+1,j)1𝒈~k+1,j2+j=i+1p(Hk,j)1𝒈~k,j2\displaystyle 2M_{\nabla f}M_{L}\frac{c_{\max}}{k}\sqrt{\sum_{j=1}^{i}\|(H^{k+1,j})^{-1}\tilde{\bm{g}}^{k+1,j}\|^{2}+\sum_{j=i+1}^{p}\|(H^{k,j})^{-1}\tilde{\bm{g}}^{k,j}\|^{2}}
\displaystyle\leq\; (2pcmaxMfMLMN)1k.\displaystyle(2pc_{\max}M_{\nabla f}M_{L}M_{N})\frac{1}{k}.

4 Numerical Experiments

In this section we detail the implementation of the SALS algorithm, discuss its computational complexity and perform a variety of numerical experiments to explore its properties and validate our theoretical results.

We invoke the matricized form (9) of the component-wise minimization problem for the sake of computational and storage efficiency. In particular, at the kk-th block iterate, we compute the sample/batch tensor average 𝒳~k=1mkl=1mk𝒳k,l\tilde{\mathcal{X}}^{k}=\frac{1}{m_{k}}\sum_{l=1}^{m_{k}}\mathcal{X}^{k,l}. Cycling through each component i=1,,pi=1,\ldots,p, we then compute the component-wise minimizer as the stationary matrix A^ik+1\hat{A}_{i}^{k+1} satisfying the matrix equation

(41) 𝒳~(i)kΘi=A^ik+1(ΘiTΘi+λIr).\tilde{\mathcal{X}}_{(i)}^{k}\Theta_{i}=\hat{A}_{i}^{k+1}(\Theta_{i}^{T}\Theta_{i}+\lambda I_{r}).

Finally, we update the ii-th component factor matrix AikA_{i}^{k} using the relaxation step

Aik+1=αk,iA^ik+1+(1αk,i)Aik.A_{i}^{k+1}=\alpha^{k,i}\hat{A}_{i}^{k+1}+(1-\alpha^{k,i})A_{i}^{k}.

In all our numerical experiments, we use step sizes of the form αk,i=1k\alpha^{k,i}=\frac{1}{k}, i.e. ck,i=1c^{k,i}=1 for i=1,,pi=1,\ldots,p and k=1,2,k=1,2,\ldots.

Example 4.1 (Convergence).

In our first example, we apply the SALS algorithm to a 3-dimensional random tensor of size (30,40,50)(30,40,50) constructed from a deterministic tensor 𝔼[𝒳]\mathbb{E}[\mathcal{X}] with known decomposition. We generate random samples by perturbing each entry of 𝔼[𝒳]\mathbb{E}[\mathcal{X}] with a uniformly distributed noise UUNIF(δ,δ)U\sim UNIF(-\delta,\delta). We measured the accuracy at each step via the residual 𝔼[𝒳𝐱2]/𝔼[𝒳2]\sqrt{\mathbb{E}[\|\mathcal{X}-\left\llbracket{\bm{x}}\right\rrbracket\|^{2}]}/\sqrt{\mathbb{E}[\|\mathcal{X}\|^{2}]} which can be computed exactly, since E[𝒳]E[\mathcal{X}] and 𝔼[𝒳2]\mathbb{E}[\mathcal{X}^{2}] are known. Figure 1 shows that when the noise parameter is zero and the stepsize rule is constant, we recover the quotient-linear performance of the regularized ALS method. In the presence of noise, however, the convergence rate decreases.

Refer to caption
(a) Semi-log plot of of error in decomposing 𝔼[𝒳]\mathbb{E}{[\mathcal{X}]} using ALS, noise parameter σ=0\sigma=0, n1=30,n2=40,n3=50n_{1}=30,n_{2}=40,n_{3}=50.
Refer to caption
(b) Log-log plot of the error, using SALS with constant and variable stepsize with noise parameter δ=1\delta=1
Figure 1: The influence of noise on the ALS iteration.

Computational Complexity

The computational effort of each block iteration of the SALS algorithm is determined by the cost of forming and solving Equation (41) for each component. Recall that rr is the specified rank, pp the tensor’s dimension, and n=i=1pnin=\sum_{i=1}^{p}n_{i} where nin_{i} is the size of the ii-th vector component of each outer product in the expansion. Further, let N=i=1pniN=\prod_{i=1}^{p}n_{i} be the total number of entries in 𝒳\mathcal{X} and nnz(𝒳)\mathrm{nnz}(\mathcal{X}) be the number of non-zero entries in 𝒳\mathcal{X}. It can then readily be seen that the cost of forming the coefficient matrix ΘiTΘi+λIr\Theta_{i}^{T}\Theta_{i}+\lambda I_{r} is O(pr2n)O(pr^{2}n), that of computing the matricized tensor times Khatri-Rao product (MTTKRP) 𝒳~(i)kΘi\tilde{\mathcal{X}}_{(i)}^{k}\Theta_{i} is O(prnnz(𝒳~(i)k))O(pr\cdot\mathrm{nnz}(\tilde{\mathcal{X}}_{(i)}^{k})), and that of solving the resulting dense symmetric system is O(pr3)O(pr^{3}). If rNr\ll N, the computational cost of the MTTKRP dominates, especially as the density of 𝒳~(i)k\tilde{\mathcal{X}}_{(i)}^{k} increases and hence nnz(𝒳~(i)k)N\mathrm{nnz}(\tilde{\mathcal{X}}_{(i)}^{k})\rightarrow N.

It is often claimed that single-sample stochastic gradient methods, i.e. those using mk=1m_{k}=1 for k=1,2,k=1,2,\ldots, make more efficient use of samples than their batch counterparts, i.e. mk=m>1m_{k}=m>1 for k=1,2,k=1,2,\ldots. Below, we investigate this claim for the SALS method. For simplicity assume that each sample tensor 𝒳l\mathcal{X}^{l} has a fixed fraction γ[0,1]\gamma\in[0,1] of zero entries, so that nnz(𝒳l)=N(1γ)\mathrm{nnz}(\mathcal{X}^{l})=N(1-\gamma). Barring cancellation and assuming a uniform, independent distribution of nonzero entries, we estimate nnz(𝒳~)=N(1γm)\mathrm{nnz}(\tilde{\mathcal{X}})=N(1-\gamma^{m}), where mm is the sample size. Adding the cost O((m1)N(1γ)O((m-1)N(1-\gamma) of forming the average 𝒳~\tilde{\mathcal{X}} then gives the computational cost of each block iteration as

(42) C(m)=N(1γm)+(m1)N(1γ).C(m)=N(1-\gamma^{m})+(m-1)N(1-\gamma).

To compare the asymptotic efficiency of the single-sample SALS method with that of a batch method, we fix a computational budget CmaxC_{\max} and investigate the estimated error achieved by each approach. Note that once mm is fixed, the number of iterations is determined by the budget as k=Cmax/C(m)k=C_{\max}/C(m). We first consider the case m=1m=1 and let 𝒙=argminF(𝒙){\bm{x}}_{*}=\operatorname*{argmin}F({\bm{x}}) be the overall minimizer. If we assume the convergence rate to be sublinear, i.e. the error satisfies

𝔼[F(𝒙k)F(𝒙)]=O(1kβ), for some β>0,\mathbb{E}\left[F({\bm{x}}_{k})-F({\bm{x}}_{*})\right]=O\left(\frac{1}{k^{\beta}}\right),\quad\text{ for some }\beta>0,

then the accuracy of the SALS method with a budget of CmaxC_{\max} is on the order of

(43) E(1)=O((CmaxC(1))β)=O((Cmaxnnz(𝒳~))β).E(1)=O\left(\left(\frac{C_{\max}}{C(1)}\right)^{-\beta}\right)=O\left(\left(\frac{C_{\max}}{\mathrm{nnz}(\tilde{\mathcal{X}})}\right)^{-\beta}\right).

Now suppose m>1m>1. To obtain the best possible accuracy of the batch ALS methods (with budget CmaxC_{\max}) requires the optimal combination of batch size mm and number of iterations kk. Let 𝒙m=argminf~m(𝒙){\bm{x}}_{*}^{m}=\operatorname*{argmin}\tilde{f}_{m}({\bm{x}}) be the batch minimizer, where f~m(𝒙)=1ml=1mf(𝒙,𝒳l)\tilde{f}_{m}({\bm{x}})=\frac{1}{m}\sum_{l=1}^{m}f({\bm{x}},\mathcal{X}^{l}), and let 𝒙k{\bm{x}}_{k} be the terminal iterate. The overall error is given by,

𝔼[F(𝒙k)F(𝒙)]𝔼[F(𝒙k)f~m(𝒙k)]+𝔼[f~m(𝒙k)f~m(𝒙m)]+𝔼[f~m(𝒙m)f~m(𝒙)]+𝔼[f~m(𝒙)F(𝒙)],\mathbb{E}\left[F({\bm{x}}_{k})-F({\bm{x}}_{*})\right]\leq\mathbb{E}\left[F({\bm{x}}_{k})-\tilde{f}_{m}({\bm{x}}_{k})\right]+\mathbb{E}\left[\tilde{f}_{m}({\bm{x}}_{k})-\tilde{f}_{m}({\bm{x}}_{*}^{m})\right]\\ +\mathbb{E}\left[\tilde{f}_{m}({\bm{x}}_{*}^{m})-\tilde{f}_{m}({\bm{x}}_{*})\right]+\mathbb{E}\left[\tilde{f}_{m}({\bm{x}}_{*})-F({\bm{x}}_{*})\right],

Since f~m(𝒙m)f~m(𝒙)\tilde{f}_{m}({\bm{x}}_{*}^{m})\leq\tilde{f}_{m}({\bm{x}}^{*}), we can bound the error above by the sum of a sampling error 𝔼[F(𝒙k)f~m(𝒙k)]+𝔼[f~m(𝒙)F(𝒙)]\mathbb{E}\left[F({\bm{x}}_{k})-\tilde{f}_{m}({\bm{x}}_{k})\right]+\mathbb{E}\left[\tilde{f}_{m}({\bm{x}}_{*})-F({\bm{x}}_{*})\right] and an optimization error 𝔼[f~m(𝒙k)f~m(𝒙m)]\mathbb{E}\left[\tilde{f}_{m}({\bm{x}}_{k})-\tilde{f}_{m}({\bm{x}}_{*}^{m})\right]. Assuming the asymptotic behavior of the sampling error to be O(1/m)O(1/\sqrt{m}) and using the q-linear convergence rate O(ρk)O(\rho^{k}) for some ρ[0,1)\rho\in[0,1) of ALS (see [6]) as an optimistic estimate, the total error takes the form O(1/m+ρk)O(1/\sqrt{m}+\rho^{k}). Under the budget constraint, the error can thus be written as

(44) E(m)=O(1m+ρCmaxC(m)).E(m)=O\left(\frac{1}{\sqrt{m}}+\rho^{\frac{C_{\max}}{C(m)}}\right).

For large, high-dimensional tensors the number of iterations Cmax/C(m)C_{\max}/C(m) allowable under the budget will decrease rapidly as mm grows, so that the second term dominates the optimal error in Equation (44), whereas smaller problems have an optimal error dominated by the Monte-Carlo sampling error 1/m1/\sqrt{m}.

Example 4.2 (Sparsity).

To compare the computational complexity of the single-sample SALS method, i.e. mk=1m_{k}=1, with that of the batch SALS method, i.e. mk=mm_{k}=m, we form a sequence of three-dimensional random sparse tensors with n1=30n_{1}=30, n2=40n_{2}=40, and n3=50n_{3}=50 and a fixed fraction γ(0,1)\gamma\in(0,1) of nonzero entries. For these tensors, the first and second moments are known and hence residuals can be computed accurately. In Fig. 2 we verify how an increase in the batch size mm deteriorates the sparsity of the tensor and lowers the sampling error at the predicted rate.

Refer to caption
(a) Deterioration of sparsity due to averaging.
Refer to caption
(b) Log-log plot of sampling error.
Figure 2: The effects of batch size mm on sparsity (left) and on the sampling error in the estimation of moments (right) for (30,40,50)(30,40,50) random tensor with sparsity parameter γ=0.1\gamma=0.1.

Figure 3 shows the efficiency of the SALS algorithm with batch sizes m=1,10,100m=1,10,100, and 10001000. To this end, we fixed a computational budget of 10k total samples and used this to determine the number of iteration steps based on the batch size mm. Clearly the SALS method that uses a single sample for each block iteration makes much more efficient use of samples than variants with larger batch sizes.

Refer to caption
(a) Log-log plot of the residual error as a function of total number of samples.
Refer to caption
(b) Log-log plot of the residual error as a function of computational cost.
Figure 3: The efficiency of the SALS method for batch sizes m=1,10,100,1000m=1,10,100,1000.

5 Conclusion

In this work, we showed that the SALS method is globally convergent and demonstrated the benefits of using stochastic-gradient-based approaches, especially in the decomposition of large, sparse tensors. In our analysis we focused on regularization in the Frobenius norm, and have not included a discussion on related proximal point algorithms or other regularization approaches (see e.g. [21, 30, 31]). Another interesting avenue of exploration relates to the choice of cost functional in tensor decomposition. Even though the quadratic structure afforded by the Frobenius norm is essential to the effectiveness of the ALS method, the standard statistical basis of comparison, namely expectation, could potentially be extended to other statistical metrics, such as those used in risk averse optimization methods. Finally, we foresee the application and extension of the SALS in the analysis and decomposition of real data sets.

References

  • [1] E. Acar, D. M. Dunlavy, and T. G. Kolda, A scalable optimization approach for fitting canonical tensor decompositions, Journal of Chemometrics, 25 (2011), pp. 67–86.
  • [2] E. S. Allman and J. A. Rhodes, The identifiability of tree topology for phylogenetic models, including covarion and mixture models, Journal of Computational Biology, 13 (2006), pp. 1101–1113.
  • [3] C. Battaglino, G. Ballard, and T. G. Kolda, A practical randomized CP tensor decomposition, SIAM Journal on Matrix Analysis and Applications, 39 (2018), pp. 876–901.
  • [4] A. Beck and L. Tetruashvili, On the convergence of block coordinate descent type methods, SIAM Journal on Optimization, 23 (2013), pp. 2037–2060.
  • [5] D. P. Bertsekas and J. N. Tsitsiklis, Parallel and Distributed Computation: Numerical Methods, Prentice-Hall, Inc., USA, 1989.
  • [6] J. C. Bezdek and R. J. Hathaway, Convergence of alternating optimization, Neural, Parallel Sci. Comput., 11 (2003), pp. 351–368.
  • [7] L. Bottou, F. E. Curtis, and J. Nocedal, Optimization methods for large-scale machine learning, SIAM Review, 60 (2018), pp. 223–311.
  • [8] G. Bouchard, J. Naradowsky, S. Riedel, T. Rocktäschel, and A. Vlachos, Matrix and tensor factorization methods for natural language processing, in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing: Tutorial Abstracts, Beijing, China, July 2015, Association for Computational Linguistics, pp. 16–18.
  • [9] M. Boussé, O. Debals, and L. De Lathauwer, A tensor-based method for large-scale blind source separation using segmentation, IEEE Transactions on Signal Processing, 65 (2017), pp. 346–358.
  • [10] J. D. Carroll and J.-J. Chang, Analysis of individual differences in multidimensional scaling via an n-way generalization of “eckart-young” decomposition, Psychometrika, 35 (1970), pp. 283–319.
  • [11] D. Cheng, R. Peng, I. Perros, and Y. Liu, Spals: Fast alternating least squares via implicit leverage scores sampling, in Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, Red Hook, NY, USA, 2016, Curran Associates Inc., p. 721–729.
  • [12] N. Eriksson, Algebraic statistics for computational biology, vol. 13, Cambridge University Press, 2005, ch. Tree construction using singular value decomposition, pp. 347–358.
  • [13] N. K. M. Faber, R. Bro, and P. K. Hopke, Recent developments in CANDECOMP/PARAFAC algorithms: a critical review, Chemometrics and Intelligent Laboratory Systems, 65 (2003), pp. 119–137.
  • [14] L. Grippo and M. Sciandrone, Globally convergent block-coordinate techniques for unconstrained optimization, Optimization Methods and Software, 10 (1999), pp. 587–637.
  • [15] C. J. Hillar and L.-H. Lim, Most tensor problems are NP hard, J. ACM, 60 (2013). 0911.1393.
  • [16] E. Hille and R. S. Phillips, Functional analysis and semi-groups, American Mathematical Society, Providence, R. I. Third printing of the revised edition of 1957, American Mathematical Society Colloquium Publications, Vol. XXXI.
  • [17] M. Ishteva, H. Park, and L. Song, Unfolding latent tree structures using 4th order tensors, in International Conference on Machine Learning, 2013, pp. 316–324.
  • [18] C. G. Khatri and C. R. Rao, Solutions to some functional equations and their applications to characterization of probability distributions, Sankhya: The Indian Journal of Statistics, Series A (1961-2002), 30 (1968), pp. 167–180.
  • [19] T. G. Kolda and B. W. Bader, Tensor decompositions and applications, SIAM Review, 51 (2009), pp. 455–500.
  • [20] T. G. Kolda and D. Hong, Stochastic gradients for large-scale tensor decomposition.
  • [21] N. Li, S. Kindermann, and C. Navasca, Some convergence results on the regularized alternating least-squares method for tensor decomposition, Linear Algebra and its Applications, 438 (2013), pp. 796–812.
  • [22] T. Maehara, K. Hayashi, and K.-i. Kawarabayashi, Expected tensor decomposition with stochastic gradient descent, in Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, AAAI Press, 2016, pp. 1919–1925.
  • [23] J. Mairal, Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization, in NIPS 2013 - Advances in Neural Information Processing Systems, C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, eds., Advances in Neural Information Processing Systems 26, South Lake Tahoe, United States, Dec. 2013, pp. 2283–2291.
  • [24] V. A. Morozov, Methods for Solving Incorrectly Posed Problems, Springer New York, 1984.
  • [25] Y. Nesterov, Efficiency of coordinate descent methods on huge-scale optimization problems, SIAM Journal on Optimization, 22 (2012), pp. 341–362.
  • [26] J. M. Ortega and W. C. Rheinboldt, Iterative Solution of Nonlinear Equations in Several Variables, Society for Industrial and Applied Mathematics, jan 2000.
  • [27] H. Robbins and S. Monro, A stochastic approximation method, The Annals of Mathematical Statistics, 22 (1951), pp. 400–407.
  • [28] N. Sidiropoulos, G. Giannakis, and R. Bro, Blind PARAFAC receivers for DS-CDMA systems, IEEE Trans. on Signal Processing, 48 (2000), pp. 810–823.
  • [29] A. Smilde, R. Bro, and P. Geladi, Multi-Way Analysis with Applications in the Chemical Sciences, John Wiley & Sons, Ltd, aug 2004.
  • [30] A. Uschmajew, Local convergence of the alternating least squares algorithm for canonical tensor approximation, SIAM Journal on Matrix Analysis and Applications, 33 (2012), pp. 639–652.
  • [31] Y. Xu and W. Yin, Block stochastic gradient iteration for convex and nonconvex optimization, SIAM Journal on Optimization, 25 (2015), pp. 1686–1716.
  • [32]  , A globally convergent algorithm for nonconvex optimization based on block coordinate update, Journal of Scientific Computing, 72 (2017), pp. 700–734.