\newsiamremark

remarkRemark \newsiamremarkhypothesisHypothesis \newsiamremarkexampleExample \newsiamthmassumptionAssumption \newsiamthmclaimClaim \headersConvergence of SALS for Random TensorsY. Cao, S. Das, L. Oeding, H.-W. van Wyk

Analysis of the Stochastic Alternating Least Squares Method for the Decomposition of Random Tensors^†^†thanks: Submitted to the editors 04/01/2020.

Yanzhao Cao Department of Mathematics and Statistics, Auburn University, Auburn AL (, , , ). [email protected] [email protected] [email protected] [email protected] Somak Das²²footnotemark: 2 Luke Oeding²²footnotemark: 2 Hans-Werner van Wyk²²footnotemark: 2

Abstract

Stochastic Alternating Least Squares (SALS) is a method that approximates the canonical decomposition of averages of sampled random tensors. Its simplicity and efficient memory usage make SALS an ideal tool for decomposing tensors in an online setting. We show, under mild regularization and readily verifiable assumptions on the boundedness of the data, that the SALS algorithm is globally convergent. Numerical experiments validate our theoretical findings and demonstrate the algorithm’s performance and complexity.

keywords:

canonical decomposition, parallel factors, stochastic optimization, stochastic alternating least squares, random tensor, block nonlinear Gauss-Seidel

{AMS}

65K10, 90C15, 15A69, 68W20

1 Introduction

Multi-modal data is represented by a tensor, or a multidimensional matrix. Tensor data is present in ares such as Natural Language Processing (NLP) [8], Blind Source Separation (BSS) [9, 28], and Phylogenetic Tree Reconstruction (PTR) [12, 2, 17]. In each of these areas, canonical decomposition (CANDECOMP)/parallel factors (PARAFAC), also known as CP tensor decomposition (representing a given tensor as a sum of rank-one tensors), provides important insights since the components of the rank-one terms, the factors, represent meaning in the data. For NLP given a tensor constructed from large text databases, the rank-one terms could represent topics, and the factors could represent words. For BSS given a tensor constructed from signals arriving at a single receiver from unknown sources, the rank-one terms could represent sources, and the factors could be used to locate the sources. For PTR given a tensor constructed as a contingency table for instances of neucleotide combinations from aligned DNA from several species, the factors represent model parameters for the phylogenetic tree.

In applications observations may represent samples of an underlying random tensor. For example, the word co-occurence frequencies used in NLP may come from a sample of a collection of documents. Here fast, reliable algorithms are desired for the CP decomposition of the average tensor. Because CP decomposition is known to be NP hard in general [15], and because the average tensor is usually only available through sampling, we investigate approximate decompositions based on stochastic optimization. In [22], Maehara et al. propose a stochastic version of the widely-used alternating least squares (ALS) method, the stochastic alternating least squares (SALS) method, and show the algorithm’s efficiency by means of numerical experiments. In this paper, we provide a detailed analysis of the algorithm, showing under mild regularization and a minimal set of verifiable assumptions that it is globally convergent to a local minimizer for any initial guess. In Section 4, we also include a detailed discussion the algorithm’s complexity and efficiency.

The alternating least squares (ALS) method, first proposed by Carroll and Chang [10], remains the most widely used algorithm for computing the CP decomposition of a tensor [13]. This block-nonlinear Gauss-Seidel method [5, 26] exploits the component-wise quadratic structure of the cost functional to compute iterates efficiently and with a low memory footprint. It has been modified [11] to exploit sparsity structure in data, which typically occurs in tensors representing co-occurence frequencies such as in the tree reconstruction above. Regularized versions of the ALS method, as well as the associated proximal minimization iterations considered in [21], help mitigate the potential ill-posedness of the CP problem to within sample noise. Although one may compute the tensor’s expectation a priori and decompose it by means of the standard ALS method (see Remark 2.1), such an approach results in a loss of sparsity and cannot seemlessly acommodate the arrival of new samples. In [22], Maehara et al. proposed the Stochastic Alternating Least Squares method, a block-stochastic minimization method that preserves the salient features of the ALS method, while also efficiently incorporating tensor samples in the optimization procedure. Recent work (see [3, 20]) has considered the related problem of using randomized (or stochastic) methods to decompose existing large-scale tensors. In particular, in [3], a variant of the ALS method is developed that approximates the component least squares problem efficiently by randomized/sketching methods. In [20], a dense tensor is expressed as the expectation of a sequence of sparse ones, thereby allowing the use of stochastic gradient descent (SGD) methods.

The convergence analysis of the SALS algorithm applied to the CP decomposition of a random tensor is complicated by the fact that the underlying cost functional is not convex. Moreover, the algorithm itself is a stochastic, block-iterative, second order method whose successive iterates do not have the Markovian dependence structure present in classical stochastic gradient descent (SGD) methods. The convergence of the ALS method was studied in [6], where a quotient-linear convergence rate was established. Block-coordinate techniques for the unconstrained optimization of general, possibly nonconvex, deterministic functionals were treated in [14] (see also [25, 4, 32] and references therein). Xu and Yin [31] discuss the convergence of block stochastic gradient methods for averages of convex and nonconvex functionals. They rely on assumptions (such as the uniformly bounded variance of gradient iterates) that, while standard in the literature (see e.g. [7]), are difficult to verify in practice since they pertain to the algorithm itself. The main contributions of this paper are to prove the convergence of the SALS algorithm, a block-stochastic Newton method, for the CP decomposition of the average of a random tensor, subject to a single verifiable assumption relating to the boundedness of the observed data.

This paper is structured as follows. In Section 2, we introduce the CP decomposition problem for random tensors and describe the stochastic alternating least squares algorithm (Algorithm 2). Section 3 contains our main theoretical contributions. In Section 3.1, we exploit the special multinomial structure of the cost functional to quantify the regularity of its component-wise gradient and Hessian in terms of the size of the algorithm’s iterate vectors (see Lemma 3.5 and its corollaries). In Section 3.2 we show that the iterates themselves can be bounded in terms of the random tensor to be decomposed (Lemma 3.15), which naturally leads to our single, verifiable assumption on the latter’s statistical properties (3.16). In Section 3.3, we show that the iterates generated by the SALS algorithm converge to a local minimizer. We validate the SALS method numerically via a few computational examples in Section 4 and offer concluding remarks in Section 5.

2 Notation and Problem Description

We to follow notational conventions that are closely aligned with the literature on the CP decomposition (see [19]), as well as on stochastic optimization (see [7, 31]). Use use lower case letters to denote scalars, bold letters for vectors, uppercase letters for matrices, and uppercase calligraphic letters for tensors. We use superscripts to indicate iteration (or sub-iteration) indices and subscripts for components, i.e. $x_{i}^{k}$ is the $i$ -th component at the $k$ -th iteration. Multi-indices are denoted by bold letters and sums, products, and integrals over these should be interpreted to be in iterated form. For the block coordinate minimization method in Algorithm 2, it is convenient to express the vector ${\bm{x}}=({\bm{x}}_{1},...,{\bm{x}}_{p})$ in terms of its $i$ -th block component ${\bm{x}}_{i}$ and the remaining components, denoted ${\bm{x}}_{i^{*}}:=({\bm{x}}_{1},\ldots,{\bm{x}}_{i-1},{\bm{x}}_{i+1},\ldots,{\bm{x}}_{p})$ . Using this notation, we write ${\bm{x}}=({\bm{x}}_{i},{\bm{x}}_{i^{*}})$ with the implicit understanding that the block components are in the correct order. The Frobenius norm of a vector, matrix, or tensor, i.e. the root mean square of its entries, is denoted by $\|\cdot\|$ . For $1\leq p\leq\infty$ , $\|\cdot\|_{p}$ denotes the standard Euclidean $p$ -norm for vectors and the induced matrix $p$ -norm for matrices. We use ‘ $\circ$ ’ to denote the outer product, ‘ $*$ ’ for the component-wise (or Hadamard) product, ‘ $\odot$ ’ for the column-wise Khatri-Rao product, and ‘ $\otimes$ ’ for the Kronecker product.

Let $(\Omega,\mathcal{F},d\mu)$ be a complete probability space and let $\mathcal{X}:\Omega\rightarrow\mathbb{R}^{n_{1}\times n_{2}\ldots\times n_{p}}$ be a measurable map, also known as a $p$ -th order random tensor. In practice the underlying probability space is rarely known explicitly. Instead, its law can often be observed indirectly through sample realizations $\mathcal{X}^{1},\mathcal{X}^{2},\ldots,\mathcal{X}^{n}$ of $\mathcal{X}$ that are assumed to be independent and identically distributed (iid). We use $\mathbb{E}$ to denote expectation:

\mathbb{E}\left[f\right]=\int_{\Omega}f(\mathcal{X})d\mu_{\mathcal{X}}

for any integrable function $f:\mathbb{R}^{n_{1}\times\ldots\times n_{p}}\rightarrow\mathbb{R}^{m}$ .

The rank- $r$ decomposition problem for $\mathcal{X}$ (or its sample realizations) amounts to finding a set of $r$ rank-one deterministic tensors $\{\hat{\mathcal{X}}_{j}\}_{j=1}^{r}$ , so that the quantity

(1)

\hat{\mathcal{X}}:=\sum_{j=1}^{r}\hat{\mathcal{X}}_{j}

is a good overall representation of $\mathcal{X}$ . Each rank-one tensor $\hat{\mathcal{X}}_{j}$ , $j=1,\ldots,r$ , is formed by the outer product $\hat{\mathcal{X}}_{j}={\bm{a}}_{1,j}\circ\ldots\circ{\bm{a}}_{p,j}$ where ${\bm{a}}_{i,j}=(a_{i,j,1},\ldots,a_{i,j,n_{i}})\in\mathbb{R}^{n_{i}}$ for each $i=1,\ldots p$ , i.e. $\hat{\mathcal{X}}_{j}\in\mathbb{R}^{n_{1}\times\ldots\times n_{p}}$ is defined component-wise by

\hat{\mathcal{X}}_{j,i_{1},\ldots,i_{p}}=\prod_{l=1}^{p}a_{l,j,i_{l}}.

We use the mean squared error $\mathbb{E}\left[\|\mathcal{X}-\hat{\mathcal{X}}\|^{2}\right]$ to measure the quality of the representation. Other risk measures may be more appropriate, depending on the application, possibly requiring a different analysis and approximation technique.

For analysis and computation, it is convenient to consider two other representations of the design variable. We define the $i$ -th factor matrix $A_{i}=[{\bm{a}}_{i,1},\ldots,{\bm{a}}_{i,r}]$ in $\mathbb{R}^{n_{i}\times r}$ to consist of the $i$ -th component vectors of each term in decomposition (1). Let ${\bm{x}}=\mathrm{vec}([A_{1},\ldots,A_{r}])\in\mathbb{R}^{nr}$ , with $n=\sum_{i=1}^{p}n_{i}$ , be the vectorization of the factor matrices, i.e. the concatenation of their column vectors. We also write ${\bm{x}}$ in block component form as ${\bm{x}}=({\bm{x}}_{1},\ldots,{\bm{x}}_{p})$ , where ${\bm{x}}_{i}=\mathrm{vec}(A_{i})\in\mathbb{R}^{rn_{i}}$ for $i=1,\ldots,p$ . To emphasize dependence of $\hat{\mathcal{X}}$ on ${\bm{x}}$ , we write the following (which also defines $\left\llbracket\cdot\right\rrbracket$ )

\hat{\mathcal{X}}=\left\llbracket{\bm{x}}\right\rrbracket=\left\llbracket A_{1},\ldots,A_{p}\right\rrbracket:=\sum_{j=1}^{r}{\bm{a}}_{1,j}\circ\ldots\circ{\bm{a}}_{p,j}.

No efficient closed form solution of the factorization problem exists (even for deterministic tensors) [15]. So it is commonly reformulated as the optimization problem

(2)

\min_{{\bm{x}}\in\mathbb{R}^{nr}}\mathbb{E}\left[\|\mathcal{X}-\left\llbracket{\bm{x}}\right\rrbracket\|^{2}\right].

Remark 2.1.

Letting ${\bm{i}}=(i_{1},\ldots,i_{p})$ and ${\bm{n}}=(n_{1},\ldots,n_{p})$ , it follows readily that

	$\displaystyle\mathbb{E}\left[\left\\|\mathcal{X}-\llbracket{\bm{x}}\rrbracket\right\\|^{2}\right]$	$\displaystyle=\mathbb{E}\left[\sum_{{\bm{i}}=1}^{\bm{n}}\left(\mathcal{X}_{\bm{i}}-\llbracket{\bm{x}}\rrbracket_{\bm{i}}\right)^{2}\right]$
		$\displaystyle=\sum_{{\bm{i}}=1}^{{\bm{n}}}\mathbb{E}\left[\mathcal{X}_{\bm{i}}^{2}\right]-2\mathbb{E}\left[\mathcal{X}_{\bm{i}}\right]\llbracket{\bm{x}}\rrbracket_{\bm{i}}+\llbracket{\bm{x}}\rrbracket_{\bm{i}}^{2}$
		$\displaystyle=\sum_{{\bm{i}}=1}^{\bm{n}}\mathrm{var}(\mathcal{X}_{\bm{i}})+\sum_{{\bm{i}}=1}^{\bm{n}}\left(\mathbb{E}(\mathcal{X}_{\bm{i}})-\llbracket{\bm{x}}\rrbracket_{\bm{i}}\right)^{2}.$

Since the variance $\mathrm{var}(\mathcal{X}_{{\bm{i}}})$ of $\mathcal{X}_{{\bm{i}}}$ is constant in the design variable $\bm{x}$ , the minimization Problem (2) is equivalent to the decomposition of the expected tensor, i.e.

\min_{x\in\mathbb{R}^{nr}}\frac{1}{2}\|\mathbb{E}[\mathcal{X}]-\llbracket{\bm{x}}\rrbracket\|^{2}.

The analysis in this work is however based on the formulation given by (2), since it lends itself more readily to extensions to more general risk functions.

Tensor decompositions have scale ambiguity. Indeed, for any direction $i=1,..,p$ and any scalars $\beta_{i,1},\ldots,\beta_{i,r-1}>0$ , let $\beta_{i,r}=1/\prod_{j=1}^{r-1}\beta_{i,j}$ . Then

\sum_{j=1}^{r}(\beta_{1,j}{\bm{a}}_{1,j})\circ\ldots\circ(\beta_{p,j}{\bm{a}}_{p,j})=\sum_{j=1}^{r}{\bm{a}}_{1,j}\circ\ldots\circ{\bm{a}}_{p,j}.

The optimizer of (2) therefore lies on a simply connected manifold (see [30]), which can lead to difficulties in the convergence of optimization algorithms. To ensure isolated minima that can be readily located, it is common to enforce bounds on the size of the factors, either directly through an appropriate normalization (see [30]), or indirectly through the addition of a regularization term [22]. We pursue the latter strategy, leading to the problem

(3)

\min_{{\bm{x}}\in\mathbb{R}^{nr}}F({\bm{x}})=\min_{{\bm{x}}\in\mathbb{R}^{nr}}\mathbb{E}\left[f({\bm{x}})\right],\text{ with}\\ f({\bm{x}};\mathcal{X})=\frac{1}{2}\left\|\mathcal{X}-\llbracket{\bm{x}}\rrbracket\right\|^{2}+\frac{\lambda}{2}\|{\bm{x}}\|^{2},\qquad{\bm{x}}\in\mathbb{R}^{nr},\lambda>0.

While the regularization term biases the minimizers of Problem (2) its inclusion is key to guaranteeing well-posedness in the presence of noise. It plays a pivotal role in (i) proving the existence of a minimizer (Lemma 2.2), (ii) ensuring the positive-definiteness of the Hessian (Corollary 3.11), and (iii) guaranteeing the boundedness of iterates in terms of random tensor $\mathcal{X}$ (Lemma 3.15). Heuristic methods are typically used to choose the value of $\lambda$ that balances the bias of the optimizers against stability considerations, the most well-known of which is the Morozov discrepancy principle [24].

Lemma 2.2 (Existence of Minimizers).

Problem (3) has at least one minimizer.

Proof 2.3.

Let $F^{*}=\inf_{{\bm{x}}\in\mathbb{R}^{nr}}F({\bm{x}})$ . So there exists a sequence $\{{\bm{x}^{k}}\}_{k=1}^{\infty}$ in $\mathbb{R}^{nr}$ with

\lim_{k\rightarrow\infty}F({\bm{x}}^{k})=F^{*},

from which it follows that $F({\bm{x}^{k}})$ is bounded. The inequality

(4)

\|{\bm{x}}\|^{2}\leq\frac{2}{\lambda}F({\bm{x}})\ \ \text{for all }{\bm{x}}\in\mathbb{R}^{rn},

allows us to establish the boundedness of $\{{\bm{x}}^{k}\}_{k=1}^{\infty}$ . By compactness, there exists a convergent subsequence ${\bm{x}}^{k_{i}}\rightarrow{\bm{x}}^{*}$ as $i\rightarrow\infty$ . The continuity of $F$ then implies

F({\bm{x}^{*}})=\lim_{i\rightarrow\infty}F({\bm{x}}^{k_{i}})=F^{*}.

Remark 2.4.

For the sample realizations $f({\bm{x}};\mathcal{X})$ we have a bound similar to (4):

(5)

\text{for all }{\bm{x}}\in\mathbb{R}^{nr},\quad\|{\bm{x}}\|^{2}\leq\frac{2}{\lambda}f({\bm{x}};\mathcal{X}),\ \ \ \text{a.s. on }\Omega.

2.1 The Stochastic Alternating Least Squares Method

Although the objective function $F$ is a high degree polynomial in general, it is quadratic in each component factor matrix $A_{i}$ . This is apparent from the matricized form of (3). Recall [18] that the columnwise Khatri-Rao product $\odot:\mathbb{R}^{n_{i}\times r}\times\mathbb{R}^{n_{j}\times r}\rightarrow\mathbb{R}^{n_{i}n_{j}\times r}$ of matrices $A=[\bm{a}_{1},...,{\bm{a}}_{r}]$ and $B=[{\bm{b}}_{1},...,{\bm{b}}_{r}]$ is defined as their columnwise Kronecker product, i.e. $A\odot B=[{\bm{a}}_{1}\otimes{\bm{b}}_{1},...,{\bm{a}}_{r}\otimes{\bm{b}}_{r}]$ . The matricization $\llbracket A_{1},\ldots,A_{p}\rrbracket_{(i)}$ of the rank $r$ tensor $\llbracket A_{1},\ldots,A_{p}\rrbracket$ along the $i$ -th fiber takes the form (see [1])

(6)

\llbracket A_{1},\ldots,A_{p}\rrbracket_{(i)}=A_{i}\left(A_{p}\odot\ldots\odot A_{i+1}\odot A_{i-1}\odot\ldots\odot A_{1}\right)^{T}=:A_{i}\Theta_{i}^{T},

where $\Theta_{i}$ is defined to be the iterated Khatri-Rao product on the right. Note that $\Theta_{i}$ does not depend on $A_{i}$ , and hence the matricized tensor decomposition is linear in $A_{i}$ . Since the Frobenius norm is invariant under matricization, the sample objective function $f({\bm{x}},\mathcal{X})$ can then be rewritten as a quadratic function in $A_{i}$ , i.e.

f(A_{1},\ldots,A_{p};\mathcal{X}):=\frac{1}{2}\left\|\mathcal{X}_{(i)}-A_{i}\Theta_{i}^{T}\right\|^{2}+\frac{\lambda}{2}\sum_{j=1}^{p}\|A_{j}\|^{2}.

Vectorizing this form yields a linear least squares objective function in ${\bm{x}}_{i}$ , namely

f({\bm{x}}_{1},\ldots,{\bm{x}}_{p};\mathcal{X})=\frac{1}{2}\left\|\mathrm{vec}(\mathcal{X}_{(i)})-(\Theta_{i}\otimes I_{n_{i}}){\bm{x}}_{i}\right\|^{2}+\frac{\lambda}{2}\sum_{j=1}^{p}\|{\bm{x}}_{j}\|^{2},

whose component-wise gradient and Hessian are given respectively by

(7)		$\displaystyle\nabla_{{\bm{x}}_{i}}f({\bm{x}};\mathcal{X})$	$\displaystyle=-(\Theta_{i}^{T}\otimes I_{n_{i}})\mathrm{vec}(\mathcal{X}_{(i)})+\left((\Theta_{i}^{T}\Theta_{i}+\lambda I_{r})\otimes I_{n_{i}}\right){\bm{x}}_{i},\quad\text{and}$
(8)		$\displaystyle\nabla_{{\bm{x}}_{i}}^{2}f({\bm{x}})$	$\displaystyle=(\Theta_{i}^{T}\Theta_{i}+\lambda I_{r})\otimes I_{n_{i}},$

where $I_{r}\in\mathbb{R}^{r\times r}$ and $I_{n_{i}}\in\mathbb{R}^{n_{i}\times n_{i}}$ are identity matrices. In the presence of regularization, the Hessian matrix (8) is strictly positive definite with lower bound that is independent of both $\mathcal{X}$ and of the remaining components ${\bm{x}}_{i^{*}}$ (see Corollary 3.11). Consequently, each sampled component-wise problem

\min_{{\bm{x}}_{i}\in\mathbb{R}^{rn_{i}}}f({\bm{x}};\mathcal{X})

has a unique solution given by the stationary point ${\bm{x}}_{i}$ satisfying $\nabla_{{\bm{x}}_{i}}f({\bm{x}},\mathcal{X})=0$ . It is more efficient to use the matricized form of the stationarity condition, i.e.

(9)

0=-\mathcal{X}_{(i)}\Theta_{i}+A_{i}(\Theta_{i}^{T}\Theta_{i}+\lambda I_{r}).

For any $A\in\mathbb{R}^{n_{i}\times r},B\in\mathbb{R}^{n_{j}\times r}$ , it can be shown [29] that

(A\odot B)^{T}(A\odot B)=(A^{T}A)*(B^{T}B),

where $*$ denotes the (Hadamard) product. Repeatedly using this identity, we have

(10)

\Theta_{i}^{T}\Theta_{i}:=(A_{1}^{T}A_{1})*\ldots*(A_{i-1}^{T}A_{i-1})*(A_{i+1}^{T}A_{i+1})*\ldots*(A_{p}^{T}A_{p}),

so the Hessian can be computed as the component-wise product of $p$ matrices in $\mathbb{R}^{r\times r}$ .

The (deterministic) ALS method (Algorithm 1) exploits the component-wise quadratic structure of the objective functional $F$ , which it inherits from $f$ . In the $k$ -th block iteration, the ALS algorithm cycles through the $p$ components ${\bm{x}}_{1},\ldots,{\bm{x}}_{p}$ of ${\bm{x}}$ , updating each component in turn in the direction of the component-wise minimizer. The function is also updated at each subiteration to reflect this change. Specifically, the iterate at the beginning of the $k$ -th block is denoted by

\displaystyle{\bm{x}}^{k}={\bm{x}}^{k,0}=({\bm{x}}_{1}^{k},{\bm{x}}_{2}^{k},...,{\bm{x}}_{p}^{k}).

The ALS algorithm then generates a sequence of sub-iterates ${\bm{x}}^{k,1},...,{\bm{x}}^{k,p}$ , where

{\bm{x}}^{k,i}=({\bm{x}}_{1}^{k+1},\ldots,{\bm{x}}_{i-1}^{k+1},{\bm{x}}_{i}^{k+1},{\bm{x}}_{i+1}^{k},\ldots,{\bm{x}}_{p}^{k}).

Note that, under this convention, ${\bm{x}}^{k,p}={\bm{x}}^{k+1}={\bm{x}}^{k+1,0}$ .

Algorithm 1 The Alternating Least Squares Algorithm

1: Initial guess

{\bm{x}^{1}}

2: for

k=1,2,\ldots

3: for

i=1,\ldots,p

\displaystyle{\bm{x}}_{i}^{k+1}=\operatorname*{argmin}_{{\bm{x}}_{i}\in\mathbb{R}^{n_{i}}}F({\bm{x}}_{1}^{k+1},\ldots,{\bm{x}}_{i-1}^{k+1},{\bm{x}}_{i},{\bm{x}}_{i+1}^{k},\ldots,{\bm{x}}_{p}^{k})

{\bm{x}}^{k,i}=({\bm{x}}_{1}^{k+1},\ldots,{\bm{x}}_{i}^{k+1},{\bm{x}}_{i+1}^{k},\ldots,{\bm{x}}_{p}^{k})

6: end for

7: end for

Although the descent achieved by the ALS method during $k$ -th block iteration is most likely not as large as a descent would be for a monolithic descent direction for $F$ over the entire space $\mathbb{R}^{nr}$ , the ALS updates can be obtained at a significantly lower computational cost and with a lower memory footprint.

The block-component-wise quadratic structure of the integrand $F$ allows us to compute ${\bm{x}}_{i}^{k+1}$ explicitly. To this end we form the second order Taylor expansion $F$ about ${\bm{x}}_{i}$ at the current iterate ${\bm{x}}^{k,i-1}$ , i.e.

(11)	$\displaystyle F({\bm{x}}_{1}^{k+1},\ldots{\bm{x}}_{i-1}^{k+1},{\bm{x}}_{i}+{\bm{p}},{\bm{x}}_{i+1}^{k},\ldots,{\bm{x}}_{p}^{k})$	$\displaystyle=F^{k,i}+({\bm{g}}^{k,i})^{T}{\bm{p}}+\frac{1}{2}{\bm{p}}^{T}H^{k,i}{\bm{p}},$
	$\displaystyle\text{with}\quad F^{k,i}=F({\bm{x}}^{k,i-1})=\mathbb{E}\left[f({\bm{x}}^{k,i-1})\right],$
	$\displaystyle{\bm{g}}^{k,i}=\nabla_{{\bm{x}}_{i}}F({\bm{x}}^{k,i-1})=\mathbb{E}\left[\nabla_{{\bm{x}}_{i}}f({\bm{x}}^{k,i-1})\right],\quad\text{and}$
	$\displaystyle H^{k,i}=\nabla_{{\bm{x}}_{i}}^{2}F({\bm{x}}^{k,i-1})=\nabla_{{\bm{x}}_{i}}^{2}f({\bm{x}}^{k,i-1}).$

Note that $H^{k,i}$ does not depend explicitly on $\mathcal{X}$ and is therefore a deterministic quantity for known ${\bm{x}}$ . The component-wise minimizer ${\bm{x}}_{i}^{k+1}$ of $F$ can be calculated explicitly as the Newton update

{\bm{x}}_{i}^{k+1}={\bm{x}}_{i}^{k}-(H^{k,i})^{-1}{\bm{g}}^{k,i}.

Although both the design variable and the functions appearing in Problem (3) are deterministic, any gradient based optimization procedure, such as the ALS method described above, requires the approximation of an expectation at every iterate, the computational effort of which is considerable. This suggests the use of stochastic gradient sampling algorithms that efficiently incorporate the sampling procedure into the optimization iteration, while exploiting data sparsity.

The stochastic gradient descent method [27] addresses the cost of approximating the expectation by computing descent directions from small sampled batches of function values and gradients at each step of the iteration. To accommodate the noisy gradients, the step size is reduced at a predetermined rate. For the stochastic alternating least squares (SALS) method (Algorithm 2), we determine the sample at the beginning of the $k$ -th block iteration. For a batch $\bm{\mathcal{X}}=(\mathcal{X}^{1},\ldots,\mathcal{X}^{m})$ of $m$ iid random samples of $\mathcal{X}$ , we write the component-wise batch average $\tilde{f}$ of $f$ as

\tilde{f}({\bm{x}};\bm{\mathcal{X}})=\frac{1}{m}\sum_{l=1}^{m}f({\bm{x}};\mathcal{X}^{l}).

As before, we can compute the component-wise minimizer

(12)

\hat{\bm{x}}_{i}^{k+1}=\operatorname*{argmin}_{{\bm{x}}_{i}\in\mathbb{R}^{n_{i}}}\tilde{f}({\bm{x}}_{1}^{k+1},\ldots{\bm{x}}_{i-1}^{k+1},{\bm{x}}_{i},{\bm{x}}_{i+1}^{k},\ldots,{\bm{x}}_{p}^{k};\bm{\mathcal{X}})

at the $i$ -th subiteration by means of the Newton step. To this end we express $\tilde{f}$ in terms of its second order Taylor expansion about the current iterate ${\bm{x}}^{k,i-1}$ , i.e.

	$\displaystyle\tilde{f}({\bm{x}}_{1}^{k+1},\ldots,{\bm{x}}_{i-1}^{k+1},{\bm{x}}_{i}+{\bm{p}},{\bm{x}}_{i+1}^{k},\ldots,{\bm{x}}_{p}^{k};\bm{\mathcal{X}})=\tilde{f}^{k,i}+(\tilde{\bm{g}}^{k,i})^{T}{\bm{p}}+\frac{1}{2}{\bm{p}}^{T}H^{k,i}{\bm{p}},\quad\text{with}$
	$\displaystyle\tilde{f}^{k,i}(\bm{\mathcal{X}})=\frac{1}{m}\sum_{l=1}^{m}f({\bm{x}}^{k,i-1};\mathcal{X}^{l}),\quad\text{and}\quad\tilde{\bm{g}}^{k,i}(\bm{\mathcal{X}})=\frac{1}{m}\sum_{l=1}^{m}\nabla_{{\bm{x}}_{i}}f({\bm{x}}^{k,i-1};\mathcal{X}^{l}).$

The sample minimizer is thus

\hat{\bm{x}}_{i}^{k+1}={\bm{x}}_{i}^{k}-\left(H^{k,i}\right)^{-1}\tilde{\bm{g}}^{k,i}.

To mitigate the effects of noise on the estimate, especially during later iterations, we modify the update by introducing a variable stepsize parameter $\alpha^{k,i}>0$ , so that

(13)

{\bm{x}}_{i}^{k+1}={\bm{x}}_{i}^{k}-\alpha^{k,i}\left(H^{k,i}\right)^{-1}\tilde{\bm{g}}^{k,i}.

Algorithm 2 Stochastic Alternating Least Squares Algorithm

1: Initial guess

{\bm{x}^{1}}

2: for

k=1,2,\ldots

3: Generate a random sample

\bm{\mathcal{X}}^{k}=[\mathcal{X}^{k,1},\ldots,\mathcal{X}^{k,m_{k}}]

4: for

i=1,\ldots,p

5: Compute sample gradient

\tilde{\bm{g}}^{k,i}

and Hessian

H^{k,i}

6: Compute step size

\alpha^{k,i}

7: Update

i

-th block

{\bm{x}}_{i}^{k+1}={\bm{x}}_{i}^{k}-\alpha^{k,i}\left(H^{k,i}\right)^{-1}\tilde{\bm{g}}^{k,i}

so that

{\bm{x}}^{k,i}=({\bm{x}}_{1}^{k+1},\ldots,{\bm{x}}_{i}^{k+1},{\bm{x}}_{i+1}^{k},\ldots,{\bm{x}}_{p}^{k})

8: end for

{\bm{x}}^{k+1}={\bm{x}}^{k,p}

10: end for

It is well known (see e.g. [7]) that a stepsize $\alpha^{k,i}$ that decreases at the rate of $O\left(\frac{1}{k}\right)$ as $k\rightarrow\infty$ leads to an optimal convergence rate for stochastic gradient methods. Here, we specify that the stepsize takes the form $\alpha^{k,i}=\frac{c^{k,i}}{k}$ , where $c^{k,i}$ is a strictly positive, bounded sequence, i.e. there are constants $0<c_{\min}<c_{\max}\leq 2$ such that

0<c_{\min}\leq c^{k,i}\leq c_{\max},\qquad\text{for all }k=1,2,\ldots,\ i=1,2,\ldots,p.

One of the difficulties in analyzing the convergence behavior of stochastic sampling methods arises from the fact that the iterates ${\bm{x}}^{k,i}$ constitute a stochastic process generated by a stochastic algorithm that depends on the realizations of the sample batches $\bm{\mathcal{X}}^{1},\bm{\mathcal{X}}^{2},\ldots\bm{\mathcal{X}}^{k-1}$ . Consequently, even deterministic functions, such as $F$ , become stochastic when evaluated at these points, e.g. $F({\bm{x}}^{k,i})$ is a random quantity. Moreover, successive iterates are statistically dependent, since later iterates are updates of earlier ones. Specifically, let $\mathcal{F}^{k}=\sigma(\bm{\mathcal{X}}^{1},\ldots,\bm{\mathcal{X}}^{k})$ be the $\sigma$ -algebra generated by the first $k$ sample batches. At the end of the $i$ -th subiteration in the $k$ -th block, the first $i$ components of ${\bm{x}}^{k,i}=({\bm{x}}^{k+1}_{1},...,{\bm{x}}_{i}^{k+1},{\bm{x}}_{i+1}^{k},...,{\bm{x}}_{p}^{k})$ have been updated using $\bm{\mathcal{X}}^{k}$ and are thus $\mathcal{F}^{k}$ -measurable, whereas the remaining components ${\bm{x}}_{i+1}^{k},...,{\bm{x}}_{p}^{k}$ depend only on $\bm{\mathcal{X}}^{1},...,\bm{\mathcal{X}}^{k-1}$ and are therefore only $\mathcal{F}^{k-1}$ -measurable. To separate the effect of $\bm{\mathcal{X}}^{k}$ from that of $\bm{\mathcal{X}}^{1},\ldots,\bm{\mathcal{X}}^{k-1}$ on an $\mathcal{F}^{k}$ -measurable random variable $h$ , it is often useful to invoke the law of total expectation, i.e.

\mathbb{E}_{\bm{\mathcal{X}}^{1},...,\bm{\mathcal{X}}^{k-1},\bm{\mathcal{X}}^{k}}[h]=\mathbb{E}_{\bm{\mathcal{X}}^{1},\ldots,\bm{\mathcal{X}}^{k-1}}\left[\mathbb{E}_{\bm{\mathcal{X}}^{k}}\left[h|\mathcal{F}^{k-1}\right]\right].

3 Convergence Analysis

Here we discuss the convergence of Algorithm 2. Since Problem (3) is nonconvex, there can in general be no unique global minimizer. Theorem 3.25 establishes the convergence of the SALS iterates ${\bm{x}}^{k,i}$ to a local minimizer in expectation. Yet, the special structure of the CP problem allows us to forego most of the standard assumptions made in the SGD framework. In fact, we only assume boundedness of the sampled data $\mathcal{X}$ . Indeed, we show that the regularity of the component-wise problem, as the Lipschitz constant of the gradient, depends only on the norm of the current iterate ${\bm{x}}^{k,i}$ , which is in turn bounded by $\|\mathcal{X}\|$ (Lemma 3.15).

3.1 Regularity Estimates

In the following we exploit the multinomial structure of the cost functional to establish component-wise regularity estimates, such as local Lipschitz continuity of the sampled gradient $\nabla_{{\bm{x}}_{i}}f({\bm{x}},\mathcal{X})$ and of the Newton step $-\left(\nabla^{2}_{{\bm{x}}_{i}}f({\bm{x}})\right)^{-1}\nabla_{{\bm{x}}_{i}}f({\bm{x}},\mathcal{X})$ , as well as bounded invertibility of the Hessian $\nabla^{2}_{{\bm{x}}_{i}}f({\bm{x}})$ .

In the proofs below, we will often need to bound a sum of the form $\sum_{i=1}^{p}\|{\bm{x}}_{i}\|^{q}$ in terms of $\|{\bm{x}}\|$ , where ${\bm{x}}=({\bm{x}}_{1},...,{\bm{x}}_{p})$ . For this we use the fact that for finite dimensional vector spaces all norms are equivalent. In particular, for any norm subscripts $s$ and $t$ , there exist dimension-dependent constants $0<c_{s,t}<c_{t,s}<\infty$ so that

(14)

c_{s,t}\|{\bm{x}}\|_{s}\leq\|{\bm{x}}\|_{t}\leq c_{t,s}\|{\bm{x}}_{i}\|_{s}.

Using norm equivalence with $s=2$ and $t=q$ , we obtain

(15)

\sum_{i=1}^{p}\|{\bm{x}}_{i}\|^{q}\leq\sum_{i=1}^{p}(c_{2,q}^{i})^{q}\|{\bm{x}}_{i}\|_{q}^{q}\leq(\bar{c}_{2,q})^{q}\|{\bm{x}}\|_{q}^{q}\leq\left(\frac{\bar{c}_{2,q}}{c_{q,2}}\right)^{q}\|{\bm{x}}\|^{q},

where $\displaystyle\bar{c}_{2,q}=\max_{i=1,...,p}c_{2,q}^{i}$ , while $c_{2,q}^{i}$ and $c_{q,2}$ are the appropriate constants in the norm equivalence relation above for ${\bm{x}}_{i}$ and ${\bm{x}}$ respectively.

The estimates established in this section all follow from the local Lipschitz continuity of the mapping ${\bm{x}}_{i}^{*}\mapsto\Theta_{i}$ , shown in Lemma 3.1 below.

Lemma 3.1.

Let $\Theta_{i}$ and $\tilde{\Theta}_{i}$ be matrices defined in terms of ${\bm{x}}_{i^{*}}$ and $\tilde{\bm{x}}_{i^{*}}$ respectively via (6). Then

(16)		$\displaystyle\\|\Theta_{i}-\tilde{\Theta}_{i}\\|\leq P_{\Theta}(\\|{\bm{x}}_{i^{}}\\|,\\|\tilde{\bm{x}}_{i^{}}\\|)\\|{\bm{x}}_{i^{}}-\tilde{\bm{x}}_{i^{}}\\|,\quad\text{with}$
	$\displaystyle P_{\Theta}(\\|{\bm{x}}_{i^{}}\\|,\\|\tilde{\bm{x}}_{i^{}}\\|)=C\left(\\|{\bm{x}}_{i^{}}\\|^{p-2}+\\|\tilde{\bm{x}}_{i^{}}\\|^{p-2}\right)$

for some constant $0\leq C<\infty$ .

Proof 3.2.

First note that for any $A\in\mathbb{R}^{n_{i}\times r},B\in\mathbb{R}^{n_{j}\times r}$ , properties of Kronecker products, the Cauchy-Schwartz inequality, and Inequality (15) imply that

	$\displaystyle\\|A\odot B\\|^{2}$	$\displaystyle=\\|[{\bm{a}}_{1}\otimes\bm{b}_{1},\ldots,{\bm{a}}_{r}\otimes\bm{b}_{r}]\\|^{2}=\sum_{j=1}^{r}\\|\bm{a}_{j}\otimes\bm{b}_{j}\\|^{2}$
(17)			$\displaystyle=\sum_{j=1}^{r}\\|\bm{a}_{j}\\|^{2}\\|\bm{b}_{j}\\|^{2}\leq\left(\sum_{j=1}^{r}\\|\bm{a}_{j}\\|^{4}\right)^{\frac{1}{2}}\left(\sum_{j=1}^{r}\\|\bm{b}_{j}\\|^{4}\right)^{\frac{1}{2}}\leq\left(\frac{\bar{c}_{2,4}}{c_{4,2}}\right)^{2}\\|A\\|^{2}\\|B\\|^{2}.$

Let $\bm{k}=(p,p-1,\ldots,i+1,i-1,\ldots,1)$ denote the indices in the Khatri-Rao product (6). Addition and subtraction of the appropriate cross-terms provides

\Theta_{i}-\tilde{\Theta}_{i}=\bigodot_{j=1}^{p-1}A_{k_{j}}-\bigodot_{j=1}^{p-1}\tilde{A}_{k_{j}}=\sum_{j=1}^{p-1}\bigodot_{j^{\prime}=1}^{j-1}\tilde{A}_{k_{j^{\prime}}}\odot(A_{k_{j}}-\tilde{A}_{k_{j}})\odot\bigodot_{j^{\prime}=j+1}^{p-1}A_{k_{j^{\prime}}}.

By (17), each term in the sum above can be bounded in norm by

		$\displaystyle\left\\|\bigodot_{j^{\prime}=1}^{j-1}\tilde{A}_{k_{j^{\prime}}}\odot(A_{k_{j}}-\tilde{A}_{k_{j}})\odot\bigodot_{j^{\prime}=j+1}^{p-1}A_{k_{j^{\prime}}}\right\\|$
	$\displaystyle\leq$	$\displaystyle\left(\frac{\bar{c}_{2,4}}{c_{4,2}}\right)^{p}\left(\prod_{j^{\prime}=1}^{j-1}\\|\tilde{A}_{k_{j^{\prime}}}\\|\right)\left(\prod_{j^{\prime}=j+1}^{p-1}\\|A_{k_{j^{\prime}}}\\|\right)\\|A_{k_{j}}-\tilde{A}_{k_{j}}\\|$
	$\displaystyle=$	$\displaystyle\left(\frac{\bar{c}_{2,4}}{c_{4,2}}\right)^{p}\left(\prod_{j^{\prime}=1}^{j-1}\\|\tilde{\bm{x}}_{k_{j^{\prime}}}\\|\right)\left(\prod_{j^{\prime}=j+1}^{p-1}\\|{\bm{x}}_{k_{j^{\prime}}}\\|\right)\\|{\bm{x}}_{k_{j}}-\tilde{\bm{x}}_{k_{j}}\\|.$

By the arithmetic-geometric mean inequality, and the bound in (15), we have

	$\displaystyle\left(\prod_{j^{\prime}=1}^{j-1}\\|\tilde{\bm{x}}_{k_{j^{\prime}}}\\|\right)\left(\prod_{j^{\prime}=j+1}^{p-1}\\|{\bm{x}}_{k_{j^{\prime}}}\\|\right)$	$\displaystyle\leq\frac{1}{(p-2)}\left(\sum_{j^{\prime}=1}^{j-1}\\|\tilde{\bm{x}}_{k_{j^{\prime}}}\\|^{p-2}+\sum_{j^{\prime}=j+1}^{p-1}\\|{\bm{x}}_{k_{j^{\prime}}}\\|^{p-2}\right)$
		$\displaystyle\leq\frac{1}{p-2}\left(\sum_{j^{\prime}=1}^{p-1}\\|\tilde{\bm{x}}_{k_{j^{\prime}}}\\|^{p-2}+\sum_{j^{\prime}=1}^{p-1}\\|{\bm{x}}_{k_{j^{\prime}}}\\|^{p-2}\right)$
		$\displaystyle\leq\frac{1}{p-2}\left(\frac{\bar{c}_{2,p-2}}{c_{p-2,2}}\right)^{p-2}\left(\\|\tilde{\bm{x}}_{i^{}}\\|^{p-2}+\\|{\bm{x}}_{i^{}}\\|^{p-2}\right).$

Since the upper bound

P_{1}(\|{\bm{x}}_{i^{*}}\|,\|\tilde{\bm{x}}_{i^{*}}\|)=\frac{1}{p-2}\left(\frac{\bar{c}_{2,4}}{c_{4,2}}\right)^{p}\left(\frac{\bar{c}_{2,p-2}}{c_{p-2,2}}\right)^{p-2}\left(\|\tilde{\bm{x}}_{i^{*}}\|^{p-2}+\|{\bm{x}}_{i^{*}}\|^{p-2}\right)

is independent of $j$ , it then follows that

	$\displaystyle\\|\Theta_{i}-\tilde{\Theta}_{i}\\|$	$\displaystyle=\sum_{j=1}^{p-1}\\|\bigodot_{j=1}^{p-1}A_{k_{j}}-\bigodot_{j=1}^{p-1}\tilde{A}_{k_{j}}\\|\leq P_{1}(\\|{\bm{x}}_{i^{}}\\|,\\|\tilde{\bm{x}}_{i^{}}\\|)\sum_{j=1}^{p-1}\\|{\bm{x}}_{k_{j}}-\tilde{\bm{x}}_{k_{j}}\\|$
		$\displaystyle\leq P_{1}(\\|{\bm{x}}_{i^{}}\\|,\\|\tilde{\bm{x}}_{i^{}}\\|)\left(\frac{\bar{c}_{2,1}}{c_{1,2}}\right)\\|{\bm{x}}_{i^{}}-\tilde{\bm{x}}_{i^{}}\\|,$

which establishes (16) with $P_{\Theta}(\|{\bm{x}}_{i^{*}}\|,\|\tilde{\bm{x}}_{i^{*}}\|)=\left(\frac{\bar{c}_{2,1}}{c_{1,2}}\right)P_{1}(\|{\bm{x}}_{i^{*}}\|,\|\tilde{\bm{x}}_{i^{*}}\|)$ .

In the special case when $\tilde{\Theta}_{i}=0$ , Inequality (16) reduces to

(18)

\|\Theta_{i}\|\leq C_{\Theta}\|{\bm{x}}_{i^{*}}\|^{p-1},\qquad\text{with $C_{\Theta}$ independent of ${\bm{x}}$}.

Corollary 3.3.

For $\Theta_{i}$ and $\tilde{\Theta}_{i}$ defined in (6),

(19)

\|\Theta_{i}^{T}\Theta_{i}-\tilde{\Theta}_{i}^{T}\tilde{\Theta}_{i}\|\leq P_{\Theta^{T}\Theta}(\|{\bm{x}}_{i^{*}}\|,\|\tilde{\bm{x}}_{i^{*}}\|)\|{\bm{x}}_{i^{*}}-\tilde{\bm{x}}_{i^{*}}\|,\qquad\text{where}

(20)

P_{\Theta^{T}\Theta}(\|{\bm{x}}_{i^{*}}\|,\|\tilde{\bm{x}}_{i^{*}}\|)=C_{\Theta^{T}\Theta}\max\{\|{\bm{x}}_{i^{*}}\|^{p-1}\!,\|\tilde{\bm{x}}_{i^{*}}\|^{p-1}\}\left(\|{\bm{x}}_{i^{*}}\|^{p-2}\!+\|\tilde{\bm{x}}_{i^{*}}\|^{p-2}\right),

for some constant $0\leq C_{\Theta^{T}\Theta}<\infty$ independent of ${\bm{x}}$ and $\tilde{\bm{x}}$ .

Proof 3.4.

By the equivalence of the induced Euclidean and the Frobenius norms,

	$\displaystyle\\|\Theta_{i}^{T}\Theta_{i}-\tilde{\Theta}_{i}^{T}\tilde{\Theta}_{i}\\|$	$\displaystyle\leq c_{F,2}\\|\Theta_{i}^{T}\Theta_{i}-\tilde{\Theta}_{i}^{T}\tilde{\Theta}_{i}\\|_{2}=c_{F,2}\\|\Theta_{i}^{T}(\Theta_{i}-\tilde{\Theta}_{i})+(\Theta_{i}-\tilde{\Theta}_{i})^{T}\tilde{\Theta}_{i}\\|_{2}$
		$\displaystyle\leq\frac{c_{F,2}}{(c_{2,F})^{2}}\left(\\|\Theta_{i}^{T}\\|\\|\Theta_{i}-\tilde{\Theta}_{i}\\|+\\|\Theta_{i}-\tilde{\Theta}_{i}\\|\\|\tilde{\Theta}_{i}\\|\right)$
		$\displaystyle\leq\frac{2c_{F,2}}{(c_{2,F})^{2}}\max\{\\|\Theta_{i}\\|,\\|\tilde{\Theta}_{i}\\|\}\\|\Theta_{i}-\tilde{\Theta}_{i}\\|$
		$\displaystyle\leq\frac{2c_{F,2}}{(c_{2,F})^{2}}\max\{\\|{\bm{x}}_{i^{}}\\|^{p-1},\\|\tilde{\bm{x}}_{i^{}}\\|^{p-1}\}P_{\Theta}(\\|{\bm{x}}_{i^{}}\\|,\\|\tilde{\bm{x}}_{i^{}}\\|)\\|{\bm{x}}_{i^{}}-\tilde{\bm{x}}_{i^{}}\\|.$

Letting $\tilde{\Theta}_{i}=0$ in (19), yields the bound

(21)

\|\Theta_{i}^{T}\Theta_{i}\|\leq C_{\Theta^{T}\Theta}\|{\bm{x}}_{i^{*}}\|^{2(p-1)}.

As a first consequence of Lemma 3.1 and Corollary 3.3, we can obtain an explicit form for the component-wise Lipschitz constant of $\nabla_{{\bm{x}}_{i}}f({\bm{x}})$ .

Lemma 3.5.

For any ${\bm{x}},\tilde{\bm{x}}\in\mathbb{R}^{rn}$ , there exists $P_{L}=P_{L}(\|{\bm{x}}\|,\|\tilde{\bm{x}}_{i^{*}}\|,\|\mathcal{X}\|)$ , given by (23), so that

(22)

\|\nabla_{{\bm{x}}_{i}}f({\bm{x}};\mathcal{X})-\nabla_{{\bm{x}}_{i}}f(\tilde{\bm{x}};\mathcal{X})\|\leq P_{L}(\|{\bm{x}}\|,\|\tilde{\bm{x}}_{i^{*}}\|,\|\mathcal{X}\|)\|{\bm{x}}-\tilde{\bm{x}}\|.

Proof 3.6.

Let $\Theta_{i}$ and $\tilde{\Theta}_{i}$ be constructed from ${\bm{x}}_{i^{*}}$ and $\tilde{\bm{x}}_{i^{*}}$ respectively via Equation Eq. 6. Using Eq. 7, the difference in sampled component-wise gradients is given by

		$\displaystyle\nabla_{{\bm{x}}_{i}}f({\bm{x}};\mathcal{X})-\nabla_{\bm{x}_{i}}f(\tilde{\bm{x}};\mathcal{X})$
	$\displaystyle=$	$\displaystyle\left((\Theta_{i}^{T}-\tilde{\Theta}_{i}^{T})\otimes I_{n_{i}}\right)\mathrm{vec}(\mathcal{X}_{(i)})+(\Theta_{i}^{T}\Theta_{i}+\lambda I_{r})\otimes I_{n_{i}}{\bm{x}}_{i}-(\tilde{\Theta}_{i}^{T}\tilde{\Theta}_{i}+\lambda I_{r})\otimes I_{n_{i}}\tilde{\bm{x}}_{i}$
	$\displaystyle=$	$\displaystyle\left((\Theta_{i}^{T}-\tilde{\Theta}_{i}^{T})\otimes I_{n_{i}}\right)\mathrm{vec}(\mathcal{X}_{(i)})+(\tilde{\Theta}_{i}^{T}\tilde{\Theta}_{i}+\lambda I_{r})\otimes I_{n_{i}}({\bm{x}}_{i}-\tilde{\bm{x}}_{i})$
		$\displaystyle+(\Theta_{i}^{T}\Theta_{i}-\tilde{\Theta}_{i}^{T}\tilde{\Theta}_{i})\otimes I_{n_{i}}\bm{x}_{i}.$

Since the singular values of a Kronecker product are formed from products of singular values of the constituent matrices, the matrix norm $\|B\otimes I\|_{2}=\|B\|_{2}\leq\|B\|$ for any matrix $B$ . In light of inequalities (16) and (19), we therefore have

		$\displaystyle\left\\|\left((\Theta_{i}^{T}-\tilde{\Theta}_{i}^{T})\otimes I_{n_{i}}\right)\mathrm{vec}(\mathcal{X}_{(i)})+(\Theta_{i}^{T}\Theta_{i}-\tilde{\Theta}_{i}^{T}\tilde{\Theta}_{i})\otimes I_{n_{i}}\bm{x}_{i}\right\\|$
	$\displaystyle\leq\;$	$\displaystyle\\|\Theta_{i}-\tilde{\Theta}_{i}\\|_{2}\\|\mathcal{X}\\|+\\|\Theta_{i}^{T}\Theta_{i}-\tilde{\Theta}_{i}^{T}\tilde{\Theta}_{i}\\|_{2}\\|\bm{x}_{i}\\|$
	$\displaystyle\leq\;$	$\displaystyle\big{(}P_{\Theta}(\\|{\bm{x}}_{i^{}}\\|,\\|\tilde{\bm{x}}_{i^{}}\\|)\\|\mathcal{X}\\|+P_{\Theta^{T}\Theta}(\\|{\bm{x}}_{i^{}}\\|,\\|\tilde{\bm{x}}_{i^{}}\\|)\\|{\bm{x}}_{i}\\|\big{)}\\|{\bm{x}}_{i^{}}-\tilde{\bm{x}}_{i^{}}\\|.$

Similarly, Inequality (21) yields

		$\displaystyle\\|(\tilde{\Theta}_{i}^{T}\tilde{\Theta}_{i}+\lambda I_{r})\otimes I_{n_{i}}({\bm{x}}_{i}-\tilde{\bm{x}}_{i})\\|\leq\\|\tilde{\Theta}_{i}^{T}\tilde{\Theta}_{i}+\lambda I_{r}\\|_{2}\\|{\bm{x}}_{i}-\tilde{\bm{x}}_{i}\\|$
	$\displaystyle\leq\;$	$\displaystyle\big{(}\\|\tilde{\Theta}_{i}^{T}\tilde{\Theta}_{i}\\|+\lambda\big{)}\\|{\bm{x}}_{i}-\tilde{\bm{x}}_{i}\\|\leq\left(C_{\Theta^{T}\Theta}\\|\tilde{\bm{x}}_{i^{*}}\\|^{2(p-1)}+\lambda\right)\\|{\bm{x}}_{i}-\tilde{\bm{x}}_{i}\\|.$

Hence, by norm equivalence (15) with $s=1$ and $t=F$ , the bound (22) follows with

(23)		$\displaystyle P_{L}(\\|{\bm{x}}\\|,\\|\tilde{\bm{x}}_{i^{}}\\|,\\|\mathcal{X}\\|)=\left(\frac{\bar{c}_{F1}}{c_{1F}}\right)\max\left\{P_{L_{i^{}}}(\\|{\bm{x}}\\|,\\|\tilde{\bm{x}}_{i^{}}\\|,\\|\mathcal{X}\\|),P_{L_{i}}(\\|\tilde{\bm{x}}_{i^{}}\\|)\right\},$
	$\displaystyle\text{with}\quad P_{L_{i}}(\\|\tilde{\bm{x}}_{i^{}}\\|)=C_{\Theta^{T}\Theta}\\|\tilde{\bm{x}}_{i^{}}\\|^{2(p-1)}+\lambda$
	$\displaystyle\text{and}\quad P_{L_{i^{}}}(\\|{\bm{x}}\\|,\\|\tilde{\bm{x}}_{i^{}}\\|,\\|\mathcal{X}\\|)=P_{\Theta}(\\|{\bm{x}}_{i^{}}\\|,\\|\tilde{\bm{x}}_{i^{}}\\|)\\|\mathcal{X}\\|\!+\!P_{\Theta^{T}\Theta}(\\|{\bm{x}}_{i^{}}\\|,\\|\tilde{\bm{x}}_{i^{}}\\|)\\|{\bm{x}}_{i}\\|.$

The following is a simplified upper bound in the special case when ${\bm{x}}$ and $\tilde{\bm{x}}$ only differ in their $i$ -th block components

Corollary 3.7.

For any ${\bm{x}}=({\bm{x}}_{1},\ldots,{\bm{x}}_{p})=({\bm{x}}_{i},{\bm{x}}_{i^{*}})\in\mathbb{R}^{rn}$ , we have

(24)

\|\nabla_{{\bm{x}}_{i}}f({\bm{x}}_{i},{\bm{x}}_{i^{*}};\mathcal{X})-\nabla_{{\bm{x}}_{i}}f(\tilde{\bm{x}}_{i},{\bm{x}}_{i^{*}};\mathcal{X})\|\leq P_{L_{i}}(\|\tilde{\bm{x}}_{i^{*}}\|)\|{\bm{x}}_{i}-\tilde{\bm{x}}_{i}\|,

Proof 3.8.

Since ${\bm{x}}_{i^{*}}=\tilde{\bm{x}}_{i^{*}}$ , estimate (22) collapses to

\|(\Theta_{i}^{T}\Theta_{i}+\lambda I_{r})\otimes I_{n_{i}}({\bm{x}}_{i}-\tilde{\bm{x}}_{i})\|\leq P_{L_{i}}(\|{\bm{x}}_{i^{*}}\|)\|{\bm{x}}_{i}-\tilde{\bm{x}}_{i}\|.

We may also bound the norm of the component-wise gradient in terms of the norm of ${\bm{x}}=({\bm{x}}_{i},{\bm{x}}_{i^{*}})$ and the data $\mathcal{X}$ , as shown in the following corollary.

Corollary 3.9.

For any ${\bm{x}}=({\bm{x}}_{i},{\bm{x}}_{i^{*}})\in\mathbb{R}^{rn}$ ,

(25)

\|\nabla_{{\bm{x}}_{i}}f({\bm{x}};\mathcal{X})\|\leq P_{L}(\|{\bm{x}}\|,0,\|\mathcal{X}\|)\|{\bm{x}}_{i}\|,

where $P_{L}$ is given by (23).

Proof 3.10.

To get inequality (25) set $\tilde{\bm{x}}=\bm{0}$ in (22) and note that ${\nabla_{{\bm{x}}_{i}}f(\bm{0};\mathcal{X})=\bm{0}}$ .

Corollary 3.3 also implies the following bounds on the component-wise Hessian.

Corollary 3.11.

For any ${\bm{x}}=({\bm{x}}_{i},{\bm{x}}_{i^{*}})\in\mathbb{R}^{rn}$ and ${\bm{v}}_{i}\in\mathbb{R}^{rn_{i}}$ ,

(26)

\lambda\|\bm{v}_{i}\|^{2}\leq\bm{v}_{i}^{T}\nabla_{{\bm{x}}_{i}}^{2}f({\bm{x}})\bm{v}_{i}\leq P_{L_{i}}(\|{\bm{x}}_{i^{*}}\|)\|\bm{v}_{i}\|^{2}.

Proof 3.12.

This follows directly from the fact that

\lambda\leq\|\nabla_{{\bm{x}}_{i}}^{2}f({\bm{x}})\|_{2}\leq\|\Theta_{i}^{T}\Theta_{i}\|+\lambda\leq P_{L_{i}}(\|{\bm{x}}_{i^{*}}\|).

Finally, we establish the local Lipschitz continuity of the Newton step mapping ${\bm{x}}\mapsto(\nabla_{{\bm{x}}_{i}}^{2}f({\bm{x}}))^{-1}\nabla_{{\bm{x}}_{i}}f({\bm{x}};\mathcal{X})$ .

Lemma 3.13.

For any ${\bm{x}},\bm{y}\in\mathbb{R}^{rn}$ , we have

(27)

\|(\nabla_{{\bm{x}}_{i}}^{2}f({\bm{x}}))^{-1}\nabla_{{\bm{x}}_{i}}f({\bm{x}};\mathcal{X})-(\nabla_{{\bm{x}}_{i}}^{2}f(\tilde{\bm{x}}))^{-1}\nabla_{{\bm{x}}_{i}}f(\tilde{\bm{x}};\mathcal{X})\|\\ \leq P_{N}(\|{\bm{x}}\|,\|\tilde{\bm{x}}_{i^{*}}\|,\|\mathcal{X}\|)\|{\bm{x}}-\tilde{\bm{x}}\|,

where $P_{N}$ is given by (28).

Proof 3.14.

Let ${\bm{x}},\tilde{\bm{x}}\in\mathbb{R}^{rn}$ . By adding and subtracting a cross-term, we obtain

		$\displaystyle\phantom{+}\\|(\nabla_{{\bm{x}}_{i}}^{2}f({\bm{x}}))^{-1}\nabla_{{\bm{x}}_{i}}f({\bm{x}};\mathcal{X})-(\nabla_{{\bm{x}}_{i}}^{2}f(\tilde{\bm{x}}_{i}))^{-1}\nabla_{{\bm{x}}_{i}}f(\tilde{\bm{x}};\mathcal{X})\\|$
	$\displaystyle\leq$	$\displaystyle\phantom{+}\\|(\nabla_{{\bm{x}}_{i}}^{2}f({\bm{x}}))^{-1}-(\nabla_{\tilde{\bm{x}}_{i}}^{2}f(\tilde{\bm{x}}))^{-1}\\|_{2}\\|\nabla_{{\bm{x}}_{i}}f({\bm{x}};\mathcal{X})\\|$
		$\displaystyle+\\|(\nabla_{{\bm{x}}_{i}}^{2}f(\tilde{\bm{x}}))^{-1}\\|_{2}\\|\nabla_{{\bm{x}}_{i}}f({\bm{x}};\mathcal{X})-\nabla_{{\bm{x}}_{i}}f(\tilde{\bm{x}};\mathcal{X})\\|.$

By the second resolvent identity (see e.g. Theorem 4.8.2. in [16]), we have

	$\displaystyle(\nabla_{{\bm{x}}_{i}}^{2}f({\bm{x}}))^{-1}-(\nabla_{\tilde{\bm{x}}_{i}}^{2}f(\tilde{\bm{x}}))^{-1}$	$\displaystyle=(\Theta_{i}^{T}\Theta_{i}+\lambda I_{r})^{-1}-(\tilde{\Theta}_{i}^{T}\tilde{\Theta}_{i}+\lambda I_{r})^{-1}$
		$\displaystyle=(\Theta_{i}^{T}\Theta_{i}+\lambda I_{r})^{-1}(\tilde{\Theta}_{i}\tilde{\Theta}_{i}-\Theta_{i}^{T}\Theta_{i})(\tilde{\Theta}_{i}^{T}\tilde{\Theta}_{i}+\lambda I_{r})^{-1},$

so that, by virtue of (19), (25), and (26),

		$\displaystyle\\|(\nabla_{{\bm{x}}_{i}}^{2}f({\bm{x}}))^{-1}-(\nabla_{\tilde{\bm{x}}_{i}}^{2}f(\tilde{\bm{x}}))^{-1}\\|_{2}\\|\nabla_{{\bm{x}}_{i}}f({\bm{x}};\mathcal{X})\\|$
	$\displaystyle\leq$	$\displaystyle\frac{1}{\lambda^{2}}\\|{\bm{x}}_{i}\\|P_{L}(\\|{\bm{x}}\\|,0,\\|\mathcal{X}\\|)P_{\Theta^{T}\Theta}(\\|{\bm{x}}_{i^{}}\\|,\\|\tilde{\bm{x}}_{i^{}}\\|)\\|{\bm{x}}_{i^{}}-\tilde{\bm{x}}_{i^{}}\\|.$

Moreover, (22) and (26) imply

\displaystyle\|(\nabla_{{\bm{x}}_{i}}^{2}f(\tilde{\bm{x}}))^{-1}\|_{2}\|\nabla_{{\bm{x}}_{i}}f({\bm{x}};\mathcal{X})-\nabla_{{\bm{x}}_{i}}f(\tilde{\bm{x}};\mathcal{X})\|\leq\frac{1}{\lambda}P_{L}(\|{\bm{x}}\|,\|\tilde{\bm{x}}_{i^{*}}\|,\|\mathcal{X}\|)\|{\bm{x}}_{i}-\tilde{\bm{x}}_{i}\|.

Combining these estimates gives

\|(\nabla_{{\bm{x}}_{i}}^{2}f({\bm{x}}))^{-1}\nabla_{{\bm{x}}_{i}}f({\bm{x}};\mathcal{X})-(\nabla_{{\bm{x}}_{i}}^{2}f(\tilde{\bm{x}}_{i}))^{-1}\nabla_{{\bm{x}}_{i}}f(\tilde{\bm{x}};\mathcal{X})\|\leq P_{N}\|{\bm{x}}-\tilde{\bm{x}}\|,\quad\text{with}

(28)

\begin{split}P_{N}(\|{\bm{x}}\|,\|\tilde{\bm{x}}_{i^{*}}\|,\|\mathcal{X}\|)=\left(\frac{\bar{c}_{2,1}}{c_{1,2}}\right)\max\left\{P_{2},P_{3}\right\},\end{split}

where $\bar{c}_{2,1}$ and $c_{1,2}$ are appropriate norm equivalence constants,

	$\displaystyle P_{2}(\\|{\bm{x}}\\|,\\|\tilde{\bm{x}}_{i^{*}}\\|,\\|\mathcal{X}\\|)$	$\displaystyle=\frac{1}{\lambda^{2}}\\|{\bm{x}}_{i}\\|P_{L}(\\|{\bm{x}}\\|,0,\\|\mathcal{X}\\|)P_{\Theta^{T}\Theta}(\\|{\bm{x}}_{i^{}}\\|,\\|\tilde{\bm{x}}_{i^{}}\\|),\qquad\text{and}$
	$\displaystyle P_{3}(\\|{\bm{x}}\\|,\\|\tilde{\bm{x}}_{i^{*}}\\|,\\|\mathcal{X}\\|)$	$\displaystyle=\frac{1}{\lambda}P_{L}(\\|{\bm{x}}\\|,\\|\tilde{\bm{x}}_{i^{*}}\\|,\\|\mathcal{X}\\|).$

3.2 Boundedness of the Data

In this section we bound the norm of the component iterates ${\bm{x}}_{i}^{k}$ in terms of the norm of the initial guess and maximum value of the sequence of sample averages of the norms of $\mathcal{X}$ .

Lemma 3.15.

The iterates ${\bm{x}}^{k,i}$ generated by Algorithm 2 satisfy

\mathbb{E}\left[\|{\bm{x}}_{i}^{k+1}\|\right]\leq\max\left\{\|{\bm{x}}_{i}^{1}\|,R(\bm{\mathcal{X}}^{k}),\ldots,R(\bm{\mathcal{X}}^{1})\right\}

for each $k=1,2\ldots$ , and $i=1,\ldots,p$ , where $R(\bm{\mathcal{X}}^{k})$ is given by (29).

Proof 3.16.

Note that the component-wise minimizer $\hat{\bm{x}}_{i}^{k+1}$ given in (12) satisfies

	$\displaystyle\\|\hat{\bm{x}}_{i}^{k+1}\\|^{2}+\sum_{j=1}^{i-1}\\|{\bm{x}}_{j}^{k+1}\\|^{2}+\sum_{j=i+1}^{p}\\|{\bm{x}}_{j}^{k}\\|^{2}$
$\displaystyle\leq$	$\displaystyle\frac{2}{\lambda}\tilde{f}({\bm{x}}_{1}^{k+1},\ldots,{\bm{x}}_{i-1}^{k+1},\hat{\bm{x}}_{i}^{k+1},{\bm{x}}_{i+1}^{k},\ldots,{\bm{x}}_{p}^{k};\bm{\mathcal{X}}^{k})$	(by (5))
$\displaystyle\leq$	$\displaystyle\frac{2}{\lambda}\tilde{f}({\bm{x}}_{1}^{k+1},\ldots,{\bm{x}}_{i-1}^{k+1},{\bm{0}},{\bm{x}}_{i+1}^{k},\ldots,{\bm{x}}_{p}^{k};\bm{\mathcal{X}}^{k})$	(by optimality)
$\displaystyle=$	$\displaystyle\frac{1}{\lambda}\left(\frac{1}{m_{k}}\sum_{l=1}^{m_{k}}\\|\mathcal{X}^{k,l}\\|^{2}\right)+\sum_{j=1}^{i-1}\\|{\bm{x}}_{j}^{k+1}\\|^{2}+\sum_{j=i+1}^{p}\\|{\bm{x}}_{j}^{k}\\|^{2},$

so that

(29)

\|\hat{\bm{x}}_{i}^{k+1}\|\leq\sqrt{\frac{1}{\lambda}\left(\frac{1}{m_{k}}\sum_{l=1}^{m_{k}}\|\mathcal{X}^{k,l}\|^{2}\right)}=:R(\bm{\mathcal{X}}^{k}).

To bound the iterates ${\bm{x}}_{i}^{k+1}$ note that Equation (13) can be rewritten as the convex combination

{\bm{x}}_{i}^{k+1}=\alpha^{k,i}{\bm{x}}_{i}^{k}+(1-\alpha^{k,i})\hat{\bm{x}}_{i}^{k+1},

which, by virtue of the stepsize bounds $0\leq\alpha^{k,i}\leq 2$ , implies

	$\displaystyle\\|{\bm{x}}_{i}^{k+1}\\|$	$\displaystyle\leq\|\alpha^{k,i}\|\\|{\bm{x}}^{k,i}\\|+\|1-a^{k,i}\|\\|\hat{\bm{x}}^{k,i}\\|\leq\max\{\\|{\bm{x}}_{i}^{k}\\|,R(\bm{\mathcal{X}}^{k})\}(\|\alpha^{k,i}\|+\|1-\alpha^{k,i}\|)$
		$\displaystyle\leq\max\{\\|{\bm{x}}_{i}^{k}\\|,R(\bm{\mathcal{X}}^{k})\}\leq\max\left\{\\|{\bm{x}}_{i}^{k-1}\\|,R(\bm{\mathcal{X}}^{k}),R(\bm{\mathcal{X}}^{k-1})\right\}\leq\ldots$
		$\displaystyle\leq\max\left\{\\|{\bm{x}}_{i}^{1}\\|,R(\bm{\mathcal{X}}^{k}),R(\bm{\mathcal{X}}^{k-1}),\ldots,R(\bm{\mathcal{X}}^{1})\right\}.$

Since the regularity estimates derived above all involve powers of $\|{\bm{x}}\|$ and of $\|\bm{\mathcal{X}}^{k}\|$ , Lemma 3.15 suggests that a bound on the data $\mathcal{X}$ is sufficient to guarantee the regularity of the cost functional, gradient, and Newton steps that are necessary to show convergence. In the following we assume such a bound and pursue its consequences.

{assumption}

[Bounded data] There is a constant $0<M<\infty$ so that

(30)

\|\mathcal{X}\|\leq M,\ \ a.s.\text{ on }\Omega.

Remark 3.17.

This assumption might conceivably be weakened to one pertaining to the statistical distribution of the maxima $R(\bm{\mathcal{X}}^{1}),R(\bm{\mathcal{X}}^{2}),\ldots,R(\bm{\mathcal{X}}^{k})$ . Specifically, letting $r_{k}=\max_{l=1,\ldots,k}R(\bm{\mathcal{X}}^{l})$ , it can be shown under appropriate conditions on the density of $R(\mathcal{X}^{k})$ , that $r_{k}$ converges in distribution to a random variable with known extreme value density. The analysis below will hold if it can be guaranteed that the limiting distribution has bounded moments of sufficiently high order. This possibility will be pursued in future work.

An immediate consequence of 3.16 is the existence of a uniform bound on the radius $R(\bm{\mathcal{X}})$ and hence on the iterates ${\bm{x}}^{k,i}$ .

Corollary 3.18.

Given 3.16, there exist finite, non-negative constants $M_{R},M_{x_{i}}$ , and $M_{x}$ independent of ${\bm{x}}^{k}$ and of $\mathcal{X}^{k}$ , so that for all $k=1,2,\ldots$ ,

(31)	$\displaystyle R(\bm{\mathcal{X}}^{k})$	$\displaystyle\leq M_{R{\phantom{{\bm{x}}_{i}}}}\qquad\text{a.s.\! on }\Omega$
(32)	$\displaystyle\\|{\bm{x}}_{i}^{k}\\|$	$\displaystyle\leq M_{{\bm{x}}_{i}\phantom{R}}\qquad\text{a.s.\! on }\Omega$
(33)	$\displaystyle\\|{\bm{x}}^{k}\\|$	$\displaystyle\leq M_{{\bm{x}}_{\phantom{i}}\phantom{R}}\qquad\text{a.s.\! on }\Omega$

Proof 3.19.

By (29) and 3.16,

\displaystyle R(\bm{\mathcal{X}}^{k})=\sqrt{\frac{1}{\lambda}\left(\frac{1}{m_{k}}\sum_{l=1}^{m_{k}}\|\mathcal{X}^{k,l}\|^{2}\right)}\leq\frac{M}{\sqrt{\lambda}}=:M_{R}.

Lemma 3.15 then implies

\|{\bm{x}}_{i}^{k}\|\leq\max\{\|{\bm{x}}_{i}^{1}\|,M_{R}\}=:M_{{\bm{x}}_{i}}

and hence

\|{\bm{x}}^{k}\|=\sqrt{\sum_{i=1}^{p}\|{\bm{x}}_{i}^{k}\|^{2}}\leq\sqrt{p}M_{{\bm{x}}_{i}}=:M_{{\bm{x}}}.

3.3 Convergence

We now consider the difference $\bm{\delta}^{k}=\tilde{\bm{g}}^{k}-\bm{g}^{k}$ between the sampled and expected search directions. For the standard stochastic gradient descent algorithm, this stochastic quantity vanishes in expectation, given past information $\mathcal{F}^{k-1}$ , i.e. $\mathbb{E}\left[{\bm{\delta}^{k}}|\mathcal{F}^{k-1}\right]=0$ , since $\tilde{\bm{g}}^{k}$ is an unbiased estimator of ${\bm{g}}^{k}$ . For the stochastic alternating least squares method, this is no longer the case. Lemma 3.20 however uses the regularity of the gradient and the Hessian to establish an upper bound that decreases on the order $O(\frac{1}{k})$ as $k\rightarrow\infty$ .

Lemma 3.20.

There is a constant $M_{N}\geq 0$ such that for every $k=1,2,...$ and $i=1,...,p$ ,

(34)

\left\|\mathbb{E}\left[(H^{k,i})^{-1}{\bm{\delta}^{k,i}}|\mathcal{F}^{k-1}\right]\right\|\leq\frac{M_{N}}{k},

where $M_{N}=2c_{\max}P_{N}(M_{{\bm{x}}},0,M_{R})P_{N}(M_{{\bm{x}}},M_{{\bm{x}}_{i^{*}}},M_{R})M_{{\bm{x}}}$ .

Proof 3.21.

Recall that the current iterate ${\bm{x}}^{k,i}$ is statistically dependent on sampled tensors $\bm{\mathcal{X}}^{1},\ldots,\bm{\mathcal{X}}^{k-1}$ , while its first $i$ components also depend on $\bm{\mathcal{X}}^{k}$ . For the purpose of computing $\mathbb{E}\left[(H^{k,i})^{-1}{\bm{\delta}^{k,i}}|\mathcal{F}^{k-1}\right]$ , we suppose that $\bm{\mathcal{X}}^{1},\ldots,\bm{\mathcal{X}}^{k-1}$ are known and write ${\bm{x}}^{k,i-1}={\bm{x}}^{k,i-1}(\bm{\mathcal{X}}^{k})=({\bm{x}}_{1}^{k+1}(\bm{\mathcal{X}}^{k}),\ldots,{\bm{x}}_{i-1}^{k+1}(\bm{\mathcal{X}}^{k}),{\bm{x}}_{i}^{k},\ldots,{\bm{x}}_{p}^{k})$ to emphasize its dependence on $\bm{\mathcal{X}}^{k}$ . Thus

	$\displaystyle\mathbb{E}\left[(H^{k,i})^{-1}{\bm{\delta}^{k,i}}\|\mathcal{F}^{k-1}\right]=$
	$\displaystyle\int_{\Omega^{m_{k}}}\left(\nabla_{{\bm{x}}_{i}}^{2}f({\bm{x}}^{k,i-1}(\bm{\mathcal{X}}^{k}))\right)^{-1}\left(\nabla_{{\bm{x}}_{i}}\tilde{f}({\bm{x}}^{k,i-1}(\bm{\mathcal{X}}^{k});\bm{\mathcal{X}}^{k})-\nabla_{{\bm{x}}_{i}}F({\bm{x}}^{k,i-1}(\bm{\mathcal{X}}^{k}))\right)d\mu_{\bm{\mathcal{X}}^{k}}.$

By definition, and since $\nabla_{{\bm{x}}_{i}}\tilde{f}$ is an unbiased estimator of $\nabla_{{\bm{x}}_{i}}F$ , we have

	$\displaystyle\nabla_{{\bm{x}}_{i}}F({\bm{x}}^{k,i-1}(\bm{\mathcal{X}}^{k}))$	$\displaystyle=\int_{\Omega}\nabla_{x_{i}}f({\bm{x}}^{k,i-1}(\bm{\mathcal{X}}^{k});\mathcal{X})d\mu_{\mathcal{X}}$
		$\displaystyle=\int_{\Omega^{m_{k}}}\nabla_{x_{i}}\tilde{f}({\bm{x}}^{k,i-1}(\bm{\mathcal{X}}^{k});\bm{\mathcal{X}})d\mu_{\bm{\mathcal{X}}}.$

Moreover, since $\bm{\mathcal{X}}$ and $\bm{\mathcal{X}}^{k}$ are identically distributed,

		$\displaystyle\int_{\Omega^{m_{k}}}\left(\nabla_{{\bm{x}}_{i}}^{2}f({\bm{x}}^{k,i-1}(\bm{\mathcal{X}}^{k}))\right)^{-1}\nabla_{{\bm{x}}_{i}}F({\bm{x}}^{k,i-1}(\bm{\mathcal{X}}^{k}))d\mu_{\bm{\mathcal{X}}^{k}}$
	$\displaystyle=$	$\displaystyle\int_{\Omega^{m_{k}}}\int_{\Omega^{m_{k}}}\left(\nabla_{{\bm{x}}_{i}}^{2}f({\bm{x}}^{k,i-1}(\bm{\mathcal{X}}^{k}))\right)^{-1}\nabla_{{\bm{x}}_{i}}\tilde{f}({\bm{x}}^{k,i-1}(\bm{\mathcal{X}}^{k});\bm{\mathcal{X}})d\mu_{\bm{\mathcal{X}}}d\mu_{\bm{\mathcal{X}}^{k}}$
	$\displaystyle=$	$\displaystyle\int_{\Omega^{m_{k}}}\int_{\Omega^{m_{k}}}\left(\nabla_{{\bm{x}}_{i}}^{2}f({\bm{x}}^{k,i-1}(\bm{\mathcal{X}}))\right)^{-1}\nabla_{{\bm{x}}_{i}}\tilde{f}({\bm{x}}^{k,i-1}(\bm{\mathcal{X}});\bm{\mathcal{X}}^{k})d\mu_{\bm{\mathcal{X}}}d\mu_{\bm{\mathcal{X}}^{k}}.$

Therefore

(35)

\mathbb{E}\left[(H^{k,i})^{-1}{\bm{\delta}^{k,i}}|\mathcal{F}^{k-1}\right]\\ =\int_{\Omega^{m_{k}}}\int_{\Omega^{m_{k}}}\left(\nabla_{{\bm{x}}_{i}}^{2}f({\bm{x}}^{k,i-1}(\bm{\mathcal{X}}^{k}))\right)^{-1}\nabla_{{\bm{x}}_{i}}\tilde{f}({\bm{x}}^{k,i-1}(\bm{\mathcal{X}}^{k});\bm{\mathcal{X}}^{k})d\mu_{\bm{\mathcal{X}}}d\mu_{\bm{\mathcal{X}}^{k}}\\ -\int_{\Omega^{m_{k}}}\int_{\Omega^{m_{k}}}\left(\nabla_{{\bm{x}}_{i}}^{2}f({\bm{x}}^{k,i-1}(\bm{\mathcal{X}}))\right)^{-1}\nabla_{{\bm{x}}_{i}}\tilde{f}({\bm{x}}^{k,i-1}(\bm{\mathcal{X}});\bm{\mathcal{X}}^{k})d\mu_{\bm{\mathcal{X}}}d\mu_{\bm{\mathcal{X}}^{k}}.

In the special case $i=1$ , the iterate ${\bm{x}}^{k,0}={\bm{x}}^{k-1}$ , and hence $H^{k,1}$ , does not depend on $\bm{\mathcal{X}}^{k}$ . Since $\nabla_{{\bm{x}}_{1}}\tilde{f}({\bm{x}}^{k-1};\bm{\mathcal{X}}^{k})$ is an unbiased estimator of $\nabla_{{\bm{x}}_{1}}F({\bm{x}}^{k-1})$ , we have

\mathbb{E}\left[(H^{k,1})^{-1}{\bm{\delta}^{k,1}}|\mathcal{F}^{k-1}\right]=0.

We now consider the case $i=2,...,p$ . Using the Lipschitz continuity of the mapping ${\bm{x}}\mapsto\left(\nabla_{{\bm{x}}_{i}}^{2}f({\bm{x}})\right)^{-1}\nabla_{{\bm{x}}_{i}}f({\bm{x}};\bm{\mathcal{X}}^{k})$ (Lemma 3.13), the bounds in Corollary 3.18, and Jensen’s inequality, we obtain

\left\|\mathbb{E}\left[(H^{k,i})^{-1}{\bm{\delta}^{k,i}}|\mathcal{F}^{k-1}\right]\right\|\\ \leq P_{N}(M_{\bm{x}},M_{{\bm{x}}_{i^{*}}},M)\int_{\Omega^{m_{k}}}\int_{\Omega^{m_{k}}}\|{\bm{x}}^{k,i-1}(\bm{\mathcal{X}}^{k})-{\bm{x}}^{k,i-1}(\bm{\mathcal{X}})\|d\mu_{\bm{\mathcal{X}}}d\mu_{\bm{\mathcal{X}}^{k}},

the integrand of which can be bounded by

\|{\bm{x}}^{k,i-1}(\bm{\mathcal{X}}^{k})-{\bm{x}}^{k,i-1}(\bm{\mathcal{X}})\|\leq\sum_{j=1}^{i-1}\|{\bm{x}}_{j}^{k+1}(\bm{\mathcal{X}}^{k})-{\bm{x}}_{j}^{k+1}(\bm{\mathcal{X}})\|\\ =\sum_{j=1}^{i-1}\alpha^{k,j}\|(H^{k,j}(\bm{\mathcal{X}}^{k}))^{-1}\tilde{\bm{g}}^{k,j}(\bm{\mathcal{X}}^{k})-(H^{k,j}(\bm{\mathcal{X}}))^{-1}\tilde{\bm{g}}^{k,j}(\bm{\mathcal{X}})\|\\ \leq\frac{c_{\max}}{k}\left(\|(H^{k,j}(\bm{\mathcal{X}}^{k}))^{-1}\tilde{\bm{g}}^{k,j}(\bm{\mathcal{X}}^{k})\|+\|(H^{k,j}(\bm{\mathcal{X}}))^{-1}\tilde{\bm{g}}^{k,j}(\bm{\mathcal{X}})\|\right).

The result now follows from taking expectations and using (27) with $\tilde{\bm{x}}=\bm{0}$ , in conjunction with (31), and (33).

Lemma 3.22.

If ${\bm{h}}$ is $\mathcal{F}^{k-1}$ -measurable and $\mathbb{E}[\|\bm{h}\|]<\infty$ then

(36)

\mathbb{E}\left[\left\langle{\bm{h}},(H^{k,i})^{-1}{\bm{\delta}}^{k,i}\right\rangle\right]\leq\frac{M_{N}}{k}\mathbb{E}\left[\|\bm{h}\|\right].

Proof 3.23.

Using the law of total expectation, the $\mathcal{F}^{k-1}$ -measurability of $\bm{h}$ , and Lemma 3.20, we have

	$\displaystyle\mathbb{E}\left[\left\langle{\bm{h}},(H^{k,i})^{-1}{\bm{\delta}}^{k,i}\right\rangle\right]$	$\displaystyle=\mathbb{E}_{\bm{\mathcal{X}}^{1},\ldots,\bm{\mathcal{X}}^{k-1}}\left[\mathbb{E}\left[\left\langle{\bm{h}},(H^{k,i})^{-1}{\bm{\delta}}^{k,i}\right\rangle\middle\|\mathcal{F}^{k-1}\right]\right]$
		$\displaystyle=\mathbb{E}_{\bm{\mathcal{X}}^{1},\ldots,\bm{\mathcal{X}}^{k-1}}\left[\left\langle\mathbb{E}\left[{\bm{h}}\middle\|\mathcal{F}^{k-1}\right],\mathbb{E}\left[(H^{k,i})^{-1}{\bm{\delta}}^{k,i}\middle\|\mathcal{F}^{k-1}\right]\right\rangle\right]$
		$\displaystyle\leq\mathbb{E}_{\bm{\mathcal{X}}^{1},\ldots,\bm{\mathcal{X}}^{k-1}}\left[\left\\|\mathbb{E}\left[{\bm{h}}\right]\right\\|\left\\|\mathbb{E}\left[(H^{k,i})^{-1}{\bm{\delta}}^{k,i}\middle\|\mathcal{F}^{k-1}\right]\right\\|\right]$
		$\displaystyle\leq\mathbb{E}_{\bm{\mathcal{X}}^{1},\ldots,\bm{\mathcal{X}}^{k-1}}\left[\frac{M_{N}}{k}\mathbb{E}_{\bm{\mathcal{X}}^{k}}[\\|\bm{h}\\|]\right]=\frac{M_{N}}{k}\mathbb{E}[\\|\bm{h}\\|].$

The main convergence theorem is based on the following lemma (for a proof, see e.g. Lemma A.5, [23])

Lemma 3.24.

Let $\{a_{k}\}_{k=1}^{\infty}$ and $\{b_{k}\}_{k=1}^{\infty}$ be any two nonnegative, real sequences so that (i) $\sum_{k=1}^{\infty}a_{k}=\infty$ , (ii) $\sum_{k=1}^{\infty}a_{k}b_{k}<\infty$ , and (iii) there is a constant $K>0$ so that $|b_{k+1}-b_{k}|\leq Ka_{k}$ for $k\geq 1$ . Then $\lim_{k\rightarrow\infty}b_{k}=0$ .

Theorem 3.25.

Let ${\bm{x}}^{k,i}$ be the sequence generated by the Stochastic Alternating Least Squares (SALS) method outlined in Algorithm 2. Then

\lim_{k\rightarrow\infty}\mathbb{E}\left[\|\nabla_{\bm{x}_{i}}F({\bm{x}}^{k,i})\|^{2}\right]=0,\qquad i=1,...,p.

Proof 3.26.

We base the proof on Lemma 3.24 with $a_{k}=\frac{1}{k}$ and $b_{k}=\|\bm{g}^{k,i}\|^{2}$ . Clearly, Condition (i) in Lemma 3.24 is satisfied. To show that Condition (ii) holds, i.e. that $\sum_{k=1}^{\infty}\frac{1}{k}\mathbb{E}\left[\|{\bm{g}}^{k,i}\|^{2}\right]<\infty$ for $i=1,\ldots,p$ , we use the component-wise Taylor expansion (11) of the expected cost centered at the iterate ${\bm{x}}^{k,i-1}$ and the SALS update given in (13) to express the expected decrease as

		$\displaystyle F({\bm{x}}^{k,i})-F({\bm{x}}^{k,i-1})$
	$\displaystyle=$	$\displaystyle\nabla_{\bm{x}_{i}}F({\bm{x}}^{k,i})^{T}({\bm{x}}^{k,i+1}-{\bm{x}}^{k,i})+\frac{1}{2}({\bm{x}}^{k,i+1}-{\bm{x}}^{k,i})^{T}\nabla_{{\bm{x}}_{i}}^{2}F({\bm{x}}^{k,i})({\bm{x}}^{k,i+1}-{\bm{x}}^{k,i})$
	$\displaystyle=$	$\displaystyle-\alpha^{k,i}({\bm{g}}^{k,i})^{T}(H^{k,i})^{-1}\tilde{\bm{g}}^{k,i}+\frac{1}{2}(\alpha^{k,i})^{2}((H^{k,i})^{-1}\tilde{\bm{g}}^{k,i})^{T}H^{k,i}(H^{k,i})^{-1}\tilde{\bm{g}}^{k,i}$
	$\displaystyle=$	$\displaystyle-\alpha^{k,i}{\bm{g}^{k,i}}^{T}(H^{k,i})^{-1}\bm{g}^{k,i}-\alpha^{k,i}(\bm{g}^{k,i})^{T}(H^{k,i})^{-1}\bm{\delta}^{k,i}+\frac{1}{2}(\alpha^{k,i})^{2}(\tilde{\bm{g}}^{k,i})^{T}(H^{k,i})^{-1}\tilde{\bm{g}}^{k,i}.$

Since, by (32) and Corollary (3.11),

\alpha^{k,i}(\bm{g}^{k,i})^{T}(H^{k,i})^{-1}\bm{g}^{k,i}\geq\frac{c_{\min}}{k}\frac{1}{\lambda+M_{\nabla^{2}f}}\|\bm{g}^{k,i}\|^{2},

where $M_{\nabla^{2}f}=P_{L_{i}}(M_{{\bm{x}}_{i}^{*}})$ , the above expression can be rearranged as

(37)	$\displaystyle\frac{1}{k}\\|{\bm{g}}^{k,i}\\|^{2}$	$\displaystyle\leq\frac{\lambda+M_{\nabla^{2}f}}{c_{\min}}\left(E_{1}^{k,i}+E_{2}^{k,i}+E_{3}^{k,i}\right),\quad\text{with}$
	$\displaystyle E_{1}^{k,i}=F({\bm{x}}^{k,i-1})-F({\bm{x}}^{k,i}),$
	$\displaystyle E_{2}^{k,i}=-\alpha^{k,i}(\bm{g}^{k,i})^{T}(\tilde{H}^{k,i})^{-1}\bm{\delta}^{k,i},\ \text{and}$
	$\displaystyle E_{3}^{k,i}=\frac{1}{k^{2}}\frac{c_{\max}}{2}(\tilde{\bm{g}}^{k,i})^{T}(H^{k,i})^{-1}\tilde{\bm{g}}^{k,i}.$

Evidently, Condition (ii) of Lemma 3.24 holds as long as $\sum_{k=1}^{\infty}\mathbb{E}\left[E_{j}^{k,i}\right]<\infty$ for $j=1,2,3$ . For a fixed $K>0$ , the first term $E_{1}^{k,i}$ generates a telescoping sum so that

		$\displaystyle\sum_{k=1}^{K}\sum_{i=1}^{p}\mathbb{E}\left[E_{1}^{k,i}\right]=\sum_{k=1}^{K}\mathbb{E}\left[F({\bm{x}}^{k})\right]-\mathbb{E}\left[F({\bm{x}}^{k+1})\right]$
(38)		$\displaystyle=\;$	$\displaystyle\mathbb{E}\left[F({\bm{x}}^{1})\right]-\mathbb{E}\left[F({\bm{x}}^{K+1})\right]\leq\mathbb{E}\left[F({\bm{x}}^{1})\right]<\infty\qquad\forall K>0,$

since $F({\bm{x}}^{K+1})\geq 0$ . Consider

	$\displaystyle E_{2}^{k,i}$	$\displaystyle=-\alpha^{k,i}(\bm{g}^{k,i})^{T}(H^{k,i})^{-1}\bm{\delta}^{k,i}$
		$\displaystyle=-\alpha^{k,i}({\bm{g}}^{k,i}-\nabla_{{\bm{x}}_{i}}F({\bm{x}}^{k}))^{T}(H^{k,i})^{-1}{\bm{\delta}}^{k,i}+\alpha^{k,i}\nabla_{{\bm{x}}_{i}}F({\bm{x}}^{k})^{T}(H^{k,i})^{-1}{\bm{\delta}}^{k,i}.$

Since $\nabla_{{\bm{x}}_{i}}F({\bm{x}}^{k})$ is $\mathcal{F}^{k-1}$ -measurable, Lemma 3.22 implies that the expectation of the second term can be bounded as follows

\mathbb{E}\left[\alpha^{k,i}\nabla_{{\bm{x}}_{i}}F({\bm{x}}^{k})^{T}(H^{k,i})^{-1}{\bm{\delta}}^{k,i}\right]\leq\frac{M_{\nabla f}M_{N}}{k^{2}},

where $M_{\nabla f}=P_{L}(M_{{\bm{x}}},0,M)M_{{\bm{x}}_{i}}$ by Corollaries 3.9 and 3.18. The first term satisfies

		$\displaystyle-\alpha^{k,i}(\nabla_{{\bm{x}}_{i}}F({\bm{x}}^{k,i})-\nabla_{{\bm{x}}_{i}}F({\bm{x}}^{k}))^{T}(H^{k,i})^{-1}{\bm{\delta}}^{k,i}$
	$\displaystyle\leq\;$	$\displaystyle\alpha^{k,i}\\|\nabla_{{\bm{x}}_{i}}F({\bm{x}}^{k,i})-\nabla_{{\bm{x}}_{i}}F({\bm{x}}^{k})\\|\\|(H^{k,i})^{-1}{\bm{\delta}}^{k,i}\\|$
	$\displaystyle\leq\;$	$\displaystyle\alpha^{k,i}M_{N}M_{L}\\|{\bm{x}}^{k,i}-{\bm{x}}^{k}\\|$
	$\displaystyle=\;$	$\displaystyle M_{N}M_{L}\alpha^{k,i}\sqrt{\sum_{j=1}^{i}(\alpha^{k,j})^{2}\\|(H^{k,j})^{-1}\tilde{\bm{g}}^{k,j}\\|^{2}}$
	$\displaystyle\leq\;$	$\displaystyle p(M_{N})^{2}M_{L}\left(\frac{c_{\max}}{k}\right)^{2},$

where $M_{L}=P_{L}(M_{\bm{x}},M_{{\bm{x}}_{i^{*}}},M)$ by Lemmas 3.5 and 3.18. Combining these two bounds yields

(39)

\sum_{k=1}^{\infty}\sum_{i=1}^{p}\mathbb{E}\left[E_{2}^{k,i}\right]\leq\left(p^{2}(c_{\max}M_{N})^{2}M_{L}+pM_{\nabla f}M_{N}\right)\sum_{k=1}^{\infty}\frac{1}{k^{2}}<\infty.

Finally, we use Corollaries 3.9, 3.11, and 3.18 to bound

\displaystyle\mathbb{E}\left[(\tilde{\bm{g}}^{k,i})^{T}(H^{k,i})^{-1}\tilde{\bm{g}}^{k,i}\right]\leq

\displaystyle\frac{1}{\lambda}\mathbb{E}\left[\|\tilde{\bm{g}}^{k,i}\|^{2}\right]\leq\frac{(M_{\nabla f})^{2}}{\lambda},

so that

(40)

\sum_{k=1}^{\infty}\sum_{i=1}^{p}\mathbb{E}\left[E_{3}^{k,i}\right]\leq\frac{pc_{\max}(M_{\nabla f})^{2}}{2\lambda}\sum_{k=1}^{\infty}\frac{1}{k^{2}}<\infty.

By virtue of Inequality (37), the upper bounds (38), (39), and (40) now imply that Condition (ii) of Lemma 3.24 holds. It now remains to show that Condition (iii) of Lemma 3.24 holds, i.e. that $\mathbb{E}\left[\|{\bm{g}}^{k+1,i}\|^{2}\right]-\mathbb{E}\left[\|{\bm{g}}^{k,i}\|^{2}\right]=O(1/k)$ as $k\rightarrow\infty$ for all $i=1,\ldots,p$ . By the reverse triangle inequality and by the Lipshitz continuity and boundedness of the gradient,

		$\displaystyle\\|{\bm{g}}^{k+1,i}\\|^{2}-\\|{\bm{g}}^{k,i}\\|^{2}\leq(\\|{\bm{g}}^{k+1,i}\\|+\\|{\bm{g}}^{k,i}\\|)(\\|{\bm{g}}^{k+1,i}-{\bm{g}}^{k,i}\\|)$
	$\displaystyle\leq\;$	$\displaystyle 2M_{\nabla f}M_{L}\\|{\bm{x}}^{k+1,i}-{\bm{x}}^{k,i}\\|$
	$\displaystyle=\;$	$\displaystyle 2M_{\nabla f}M_{L}\sqrt{\sum_{j=1}^{i}\\|{\bm{x}}_{j}^{k+2}-{\bm{x}}_{j}^{k+1}\\|^{2}+\sum_{j=i+1}^{p}\\|{\bm{x}}_{j}^{k+1}-{\bm{x}}_{j}^{k}\\|^{2}}$
	$\displaystyle=\;$	$\displaystyle 2M_{\nabla f}M_{L}\sqrt{\sum_{j=1}^{i}\left(\alpha^{k+1,j}\right)^{2}\\|(H^{k+1,j})^{-1}\tilde{\bm{g}}^{k+1,j}\\|^{2}+\sum_{j=i+1}^{p}\left(\alpha^{k,j}\right)^{2}\\|(H^{k,j})^{-1}\tilde{\bm{g}}^{k,j}\\|^{2}}$
	$\displaystyle\leq\;$	$\displaystyle 2M_{\nabla f}M_{L}\frac{c_{\max}}{k}\sqrt{\sum_{j=1}^{i}\\|(H^{k+1,j})^{-1}\tilde{\bm{g}}^{k+1,j}\\|^{2}+\sum_{j=i+1}^{p}\\|(H^{k,j})^{-1}\tilde{\bm{g}}^{k,j}\\|^{2}}$
	$\displaystyle\leq\;$	$\displaystyle(2pc_{\max}M_{\nabla f}M_{L}M_{N})\frac{1}{k}.$

4 Numerical Experiments

In this section we detail the implementation of the SALS algorithm, discuss its computational complexity and perform a variety of numerical experiments to explore its properties and validate our theoretical results.

We invoke the matricized form (9) of the component-wise minimization problem for the sake of computational and storage efficiency. In particular, at the $k$ -th block iterate, we compute the sample/batch tensor average $\tilde{\mathcal{X}}^{k}=\frac{1}{m_{k}}\sum_{l=1}^{m_{k}}\mathcal{X}^{k,l}$ . Cycling through each component $i=1,\ldots,p$ , we then compute the component-wise minimizer as the stationary matrix $\hat{A}_{i}^{k+1}$ satisfying the matrix equation

(41)

\tilde{\mathcal{X}}_{(i)}^{k}\Theta_{i}=\hat{A}_{i}^{k+1}(\Theta_{i}^{T}\Theta_{i}+\lambda I_{r}).

Finally, we update the $i$ -th component factor matrix $A_{i}^{k}$ using the relaxation step

A_{i}^{k+1}=\alpha^{k,i}\hat{A}_{i}^{k+1}+(1-\alpha^{k,i})A_{i}^{k}.

In all our numerical experiments, we use step sizes of the form $\alpha^{k,i}=\frac{1}{k}$ , i.e. $c^{k,i}=1$ for $i=1,\ldots,p$ and $k=1,2,\ldots$ .

Example 4.1 (Convergence).

In our first example, we apply the SALS algorithm to a 3-dimensional random tensor of size $(30,40,50)$ constructed from a deterministic tensor $\mathbb{E}[\mathcal{X}]$ with known decomposition. We generate random samples by perturbing each entry of $\mathbb{E}[\mathcal{X}]$ with a uniformly distributed noise $U\sim UNIF(-\delta,\delta)$ . We measured the accuracy at each step via the residual $\sqrt{\mathbb{E}[\|\mathcal{X}-\left\llbracket{\bm{x}}\right\rrbracket\|^{2}]}/\sqrt{\mathbb{E}[\|\mathcal{X}\|^{2}]}$ which can be computed exactly, since $E[\mathcal{X}]$ and $\mathbb{E}[\mathcal{X}^{2}]$ are known. Figure 1 shows that when the noise parameter is zero and the stepsize rule is constant, we recover the quotient-linear performance of the regularized ALS method. In the presence of noise, however, the convergence rate decreases.

Refer to caption — (a) Semi-log plot of of error in decomposing $\mathbb{E}{[\mathcal{X}]}$ using ALS, noise parameter $\sigma=0$ , $n_{1}=30,n_{2}=40,n_{3}=50$ .

Computational Complexity

The computational effort of each block iteration of the SALS algorithm is determined by the cost of forming and solving Equation (41) for each component. Recall that $r$ is the specified rank, $p$ the tensor’s dimension, and $n=\sum_{i=1}^{p}n_{i}$ where $n_{i}$ is the size of the $i$ -th vector component of each outer product in the expansion. Further, let $N=\prod_{i=1}^{p}n_{i}$ be the total number of entries in $\mathcal{X}$ and $\mathrm{nnz}(\mathcal{X})$ be the number of non-zero entries in $\mathcal{X}$ . It can then readily be seen that the cost of forming the coefficient matrix $\Theta_{i}^{T}\Theta_{i}+\lambda I_{r}$ is $O(pr^{2}n)$ , that of computing the matricized tensor times Khatri-Rao product (MTTKRP) $\tilde{\mathcal{X}}_{(i)}^{k}\Theta_{i}$ is $O(pr\cdot\mathrm{nnz}(\tilde{\mathcal{X}}_{(i)}^{k}))$ , and that of solving the resulting dense symmetric system is $O(pr^{3})$ . If $r\ll N$ , the computational cost of the MTTKRP dominates, especially as the density of $\tilde{\mathcal{X}}_{(i)}^{k}$ increases and hence $\mathrm{nnz}(\tilde{\mathcal{X}}_{(i)}^{k})\rightarrow N$ .

It is often claimed that single-sample stochastic gradient methods, i.e. those using $m_{k}=1$ for $k=1,2,\ldots$ , make more efficient use of samples than their batch counterparts, i.e. $m_{k}=m>1$ for $k=1,2,\ldots$ . Below, we investigate this claim for the SALS method. For simplicity assume that each sample tensor $\mathcal{X}^{l}$ has a fixed fraction $\gamma\in[0,1]$ of zero entries, so that $\mathrm{nnz}(\mathcal{X}^{l})=N(1-\gamma)$ . Barring cancellation and assuming a uniform, independent distribution of nonzero entries, we estimate $\mathrm{nnz}(\tilde{\mathcal{X}})=N(1-\gamma^{m})$ , where $m$ is the sample size. Adding the cost $O((m-1)N(1-\gamma)$ of forming the average $\tilde{\mathcal{X}}$ then gives the computational cost of each block iteration as

(42)

C(m)=N(1-\gamma^{m})+(m-1)N(1-\gamma).

To compare the asymptotic efficiency of the single-sample SALS method with that of a batch method, we fix a computational budget $C_{\max}$ and investigate the estimated error achieved by each approach. Note that once $m$ is fixed, the number of iterations is determined by the budget as $k=C_{\max}/C(m)$ . We first consider the case $m=1$ and let ${\bm{x}}_{*}=\operatorname*{argmin}F({\bm{x}})$ be the overall minimizer. If we assume the convergence rate to be sublinear, i.e. the error satisfies

\mathbb{E}\left[F({\bm{x}}_{k})-F({\bm{x}}_{*})\right]=O\left(\frac{1}{k^{\beta}}\right),\quad\text{ for some }\beta>0,

then the accuracy of the SALS method with a budget of $C_{\max}$ is on the order of

(43)

E(1)=O\left(\left(\frac{C_{\max}}{C(1)}\right)^{-\beta}\right)=O\left(\left(\frac{C_{\max}}{\mathrm{nnz}(\tilde{\mathcal{X}})}\right)^{-\beta}\right).

Now suppose $m>1$ . To obtain the best possible accuracy of the batch ALS methods (with budget $C_{\max}$ ) requires the optimal combination of batch size $m$ and number of iterations $k$ . Let ${\bm{x}}_{*}^{m}=\operatorname*{argmin}\tilde{f}_{m}({\bm{x}})$ be the batch minimizer, where $\tilde{f}_{m}({\bm{x}})=\frac{1}{m}\sum_{l=1}^{m}f({\bm{x}},\mathcal{X}^{l})$ , and let ${\bm{x}}_{k}$ be the terminal iterate. The overall error is given by,

\mathbb{E}\left[F({\bm{x}}_{k})-F({\bm{x}}_{*})\right]\leq\mathbb{E}\left[F({\bm{x}}_{k})-\tilde{f}_{m}({\bm{x}}_{k})\right]+\mathbb{E}\left[\tilde{f}_{m}({\bm{x}}_{k})-\tilde{f}_{m}({\bm{x}}_{*}^{m})\right]\\ +\mathbb{E}\left[\tilde{f}_{m}({\bm{x}}_{*}^{m})-\tilde{f}_{m}({\bm{x}}_{*})\right]+\mathbb{E}\left[\tilde{f}_{m}({\bm{x}}_{*})-F({\bm{x}}_{*})\right],

Since $\tilde{f}_{m}({\bm{x}}_{*}^{m})\leq\tilde{f}_{m}({\bm{x}}^{*})$ , we can bound the error above by the sum of a sampling error $\mathbb{E}\left[F({\bm{x}}_{k})-\tilde{f}_{m}({\bm{x}}_{k})\right]+\mathbb{E}\left[\tilde{f}_{m}({\bm{x}}_{*})-F({\bm{x}}_{*})\right]$ and an optimization error $\mathbb{E}\left[\tilde{f}_{m}({\bm{x}}_{k})-\tilde{f}_{m}({\bm{x}}_{*}^{m})\right]$ . Assuming the asymptotic behavior of the sampling error to be $O(1/\sqrt{m})$ and using the q-linear convergence rate $O(\rho^{k})$ for some $\rho\in[0,1)$ of ALS (see [6]) as an optimistic estimate, the total error takes the form $O(1/\sqrt{m}+\rho^{k})$ . Under the budget constraint, the error can thus be written as

(44)

E(m)=O\left(\frac{1}{\sqrt{m}}+\rho^{\frac{C_{\max}}{C(m)}}\right).

For large, high-dimensional tensors the number of iterations $C_{\max}/C(m)$ allowable under the budget will decrease rapidly as $m$ grows, so that the second term dominates the optimal error in Equation (44), whereas smaller problems have an optimal error dominated by the Monte-Carlo sampling error $1/\sqrt{m}$ .

Example 4.2 (Sparsity).

To compare the computational complexity of the single-sample SALS method, i.e. $m_{k}=1$ , with that of the batch SALS method, i.e. $m_{k}=m$ , we form a sequence of three-dimensional random sparse tensors with $n_{1}=30$ , $n_{2}=40$ , and $n_{3}=50$ and a fixed fraction $\gamma\in(0,1)$ of nonzero entries. For these tensors, the first and second moments are known and hence residuals can be computed accurately. In Fig. 2 we verify how an increase in the batch size $m$ deteriorates the sparsity of the tensor and lowers the sampling error at the predicted rate.

Figure 3 shows the efficiency of the SALS algorithm with batch sizes $m=1,10,100$ , and $1000$ . To this end, we fixed a computational budget of 10k total samples and used this to determine the number of iteration steps based on the batch size $m$ . Clearly the SALS method that uses a single sample for each block iteration makes much more efficient use of samples than variants with larger batch sizes.

5 Conclusion

In this work, we showed that the SALS method is globally convergent and demonstrated the benefits of using stochastic-gradient-based approaches, especially in the decomposition of large, sparse tensors. In our analysis we focused on regularization in the Frobenius norm, and have not included a discussion on related proximal point algorithms or other regularization approaches (see e.g. [21, 30, 31]). Another interesting avenue of exploration relates to the choice of cost functional in tensor decomposition. Even though the quadratic structure afforded by the Frobenius norm is essential to the effectiveness of the ALS method, the standard statistical basis of comparison, namely expectation, could potentially be extended to other statistical metrics, such as those used in risk averse optimization methods. Finally, we foresee the application and extension of the SALS in the analysis and decomposition of real data sets.

References

[1] E. Acar, D. M. Dunlavy, and T. G. Kolda, A scalable optimization approach for fitting canonical tensor decompositions, Journal of Chemometrics, 25 (2011), pp. 67–86.
[2] E. S. Allman and J. A. Rhodes, The identifiability of tree topology for phylogenetic models, including covarion and mixture models, Journal of Computational Biology, 13 (2006), pp. 1101–1113.
[3] C. Battaglino, G. Ballard, and T. G. Kolda, A practical randomized CP tensor decomposition, SIAM Journal on Matrix Analysis and Applications, 39 (2018), pp. 876–901.
[4] A. Beck and L. Tetruashvili, On the convergence of block coordinate descent type methods, SIAM Journal on Optimization, 23 (2013), pp. 2037–2060.
[5] D. P. Bertsekas and J. N. Tsitsiklis, Parallel and Distributed Computation: Numerical Methods, Prentice-Hall, Inc., USA, 1989.
[6] J. C. Bezdek and R. J. Hathaway, Convergence of alternating optimization, Neural, Parallel Sci. Comput., 11 (2003), pp. 351–368.
[7] L. Bottou, F. E. Curtis, and J. Nocedal, Optimization methods for large-scale machine learning, SIAM Review, 60 (2018), pp. 223–311.
[8] G. Bouchard, J. Naradowsky, S. Riedel, T. Rocktäschel, and A. Vlachos, Matrix and tensor factorization methods for natural language processing, in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing: Tutorial Abstracts, Beijing, China, July 2015, Association for Computational Linguistics, pp. 16–18.
[9] M. Boussé, O. Debals, and L. De Lathauwer, A tensor-based method for large-scale blind source separation using segmentation, IEEE Transactions on Signal Processing, 65 (2017), pp. 346–358.
[10] J. D. Carroll and J.-J. Chang, Analysis of individual differences in multidimensional scaling via an n-way generalization of “eckart-young” decomposition, Psychometrika, 35 (1970), pp. 283–319.
[11] D. Cheng, R. Peng, I. Perros, and Y. Liu, Spals: Fast alternating least squares via implicit leverage scores sampling, in Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, Red Hook, NY, USA, 2016, Curran Associates Inc., p. 721–729.
[12] N. Eriksson, Algebraic statistics for computational biology, vol. 13, Cambridge University Press, 2005, ch. Tree construction using singular value decomposition, pp. 347–358.
[13] N. K. M. Faber, R. Bro, and P. K. Hopke, Recent developments in CANDECOMP/PARAFAC algorithms: a critical review, Chemometrics and Intelligent Laboratory Systems, 65 (2003), pp. 119–137.
[14] L. Grippo and M. Sciandrone, Globally convergent block-coordinate techniques for unconstrained optimization, Optimization Methods and Software, 10 (1999), pp. 587–637.
[15] C. J. Hillar and L.-H. Lim, Most tensor problems are NP hard, J. ACM, 60 (2013). 0911.1393.
[16] E. Hille and R. S. Phillips, Functional analysis and semi-groups, American Mathematical Society, Providence, R. I. Third printing of the revised edition of 1957, American Mathematical Society Colloquium Publications, Vol. XXXI.
[17] M. Ishteva, H. Park, and L. Song, Unfolding latent tree structures using 4th order tensors, in International Conference on Machine Learning, 2013, pp. 316–324.
[18] C. G. Khatri and C. R. Rao, Solutions to some functional equations and their applications to characterization of probability distributions, Sankhya: The Indian Journal of Statistics, Series A (1961-2002), 30 (1968), pp. 167–180.
[19] T. G. Kolda and B. W. Bader, Tensor decompositions and applications, SIAM Review, 51 (2009), pp. 455–500.
[20] T. G. Kolda and D. Hong, Stochastic gradients for large-scale tensor decomposition.
[21] N. Li, S. Kindermann, and C. Navasca, Some convergence results on the regularized alternating least-squares method for tensor decomposition, Linear Algebra and its Applications, 438 (2013), pp. 796–812.
[22] T. Maehara, K. Hayashi, and K.-i. Kawarabayashi, Expected tensor decomposition with stochastic gradient descent, in Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, AAAI Press, 2016, pp. 1919–1925.
[23] J. Mairal, Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization, in NIPS 2013 - Advances in Neural Information Processing Systems, C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, eds., Advances in Neural Information Processing Systems 26, South Lake Tahoe, United States, Dec. 2013, pp. 2283–2291.
[24] V. A. Morozov, Methods for Solving Incorrectly Posed Problems, Springer New York, 1984.
[25] Y. Nesterov, Efficiency of coordinate descent methods on huge-scale optimization problems, SIAM Journal on Optimization, 22 (2012), pp. 341–362.
[26] J. M. Ortega and W. C. Rheinboldt, Iterative Solution of Nonlinear Equations in Several Variables, Society for Industrial and Applied Mathematics, jan 2000.
[27] H. Robbins and S. Monro, A stochastic approximation method, The Annals of Mathematical Statistics, 22 (1951), pp. 400–407.
[28] N. Sidiropoulos, G. Giannakis, and R. Bro, Blind PARAFAC receivers for DS-CDMA systems, IEEE Trans. on Signal Processing, 48 (2000), pp. 810–823.
[29] A. Smilde, R. Bro, and P. Geladi, Multi-Way Analysis with Applications in the Chemical Sciences, John Wiley & Sons, Ltd, aug 2004.
[30] A. Uschmajew, Local convergence of the alternating least squares algorithm for canonical tensor approximation, SIAM Journal on Matrix Analysis and Applications, 33 (2012), pp. 639–652.
[31] Y. Xu and W. Yin, Block stochastic gradient iteration for convex and nonconvex optimization, SIAM Journal on Optimization, 25 (2015), pp. 1686–1716.
[32] , A globally convergent algorithm for nonconvex optimization based on block coordinate update, Journal of Scientific Computing, 72 (2017), pp. 700–734.

	$\displaystyle\\|A\odot B\\|^{2}$	$\displaystyle=\\|[{\bm{a}}_{1}\otimes\bm{b}_{1},\ldots,{\bm{a}}_{r}\otimes\bm{b}_{r}]\\|^{2}=\sum_{j=1}^{r}\\|\bm{a}_{j}\otimes\bm{b}_{j}\\|^{2}$
(17)			$\displaystyle=\sum_{j=1}^{r}\\|\bm{a}_{j}\\|^{2}\\|\bm{b}_{j}\\|^{2}\leq\left(\sum_{j=1}^{r}\\|\bm{a}_{j}\\|^{4}\right)^{\frac{1}{2}}\left(\sum_{j=1}^{r}\\|\bm{b}_{j}\\|^{4}\right)^{\frac{1}{2}}\leq\left(\frac{\bar{c}_{2,4}}{c_{4,2}}\right)^{2}\\|A\\|^{2}\\|B\\|^{2}.$

		$\displaystyle\left\\|\bigodot_{j^{\prime}=1}^{j-1}\tilde{A}_{k_{j^{\prime}}}\odot(A_{k_{j}}-\tilde{A}_{k_{j}})\odot\bigodot_{j^{\prime}=j+1}^{p-1}A_{k_{j^{\prime}}}\right\\|$
	$\displaystyle\leq$	$\displaystyle\left(\frac{\bar{c}_{2,4}}{c_{4,2}}\right)^{p}\left(\prod_{j^{\prime}=1}^{j-1}\\|\tilde{A}_{k_{j^{\prime}}}\\|\right)\left(\prod_{j^{\prime}=j+1}^{p-1}\\|A_{k_{j^{\prime}}}\\|\right)\\|A_{k_{j}}-\tilde{A}_{k_{j}}\\|$
	$\displaystyle=$	$\displaystyle\left(\frac{\bar{c}_{2,4}}{c_{4,2}}\right)^{p}\left(\prod_{j^{\prime}=1}^{j-1}\\|\tilde{\bm{x}}_{k_{j^{\prime}}}\\|\right)\left(\prod_{j^{\prime}=j+1}^{p-1}\\|{\bm{x}}_{k_{j^{\prime}}}\\|\right)\\|{\bm{x}}_{k_{j}}-\tilde{\bm{x}}_{k_{j}}\\|.$

	$\displaystyle\left(\prod_{j^{\prime}=1}^{j-1}\\|\tilde{\bm{x}}_{k_{j^{\prime}}}\\|\right)\left(\prod_{j^{\prime}=j+1}^{p-1}\\|{\bm{x}}_{k_{j^{\prime}}}\\|\right)$	$\displaystyle\leq\frac{1}{(p-2)}\left(\sum_{j^{\prime}=1}^{j-1}\\|\tilde{\bm{x}}_{k_{j^{\prime}}}\\|^{p-2}+\sum_{j^{\prime}=j+1}^{p-1}\\|{\bm{x}}_{k_{j^{\prime}}}\\|^{p-2}\right)$
		$\displaystyle\leq\frac{1}{p-2}\left(\sum_{j^{\prime}=1}^{p-1}\\|\tilde{\bm{x}}_{k_{j^{\prime}}}\\|^{p-2}+\sum_{j^{\prime}=1}^{p-1}\\|{\bm{x}}_{k_{j^{\prime}}}\\|^{p-2}\right)$
		$\displaystyle\leq\frac{1}{p-2}\left(\frac{\bar{c}_{2,p-2}}{c_{p-2,2}}\right)^{p-2}\left(\\|\tilde{\bm{x}}_{i^{}}\\|^{p-2}+\\|{\bm{x}}_{i^{}}\\|^{p-2}\right).$

	$\displaystyle\\|\Theta_{i}^{T}\Theta_{i}-\tilde{\Theta}_{i}^{T}\tilde{\Theta}_{i}\\|$	$\displaystyle\leq c_{F,2}\\|\Theta_{i}^{T}\Theta_{i}-\tilde{\Theta}_{i}^{T}\tilde{\Theta}_{i}\\|_{2}=c_{F,2}\\|\Theta_{i}^{T}(\Theta_{i}-\tilde{\Theta}_{i})+(\Theta_{i}-\tilde{\Theta}_{i})^{T}\tilde{\Theta}_{i}\\|_{2}$
		$\displaystyle\leq\frac{c_{F,2}}{(c_{2,F})^{2}}\left(\\|\Theta_{i}^{T}\\|\\|\Theta_{i}-\tilde{\Theta}_{i}\\|+\\|\Theta_{i}-\tilde{\Theta}_{i}\\|\\|\tilde{\Theta}_{i}\\|\right)$
		$\displaystyle\leq\frac{2c_{F,2}}{(c_{2,F})^{2}}\max\{\\|\Theta_{i}\\|,\\|\tilde{\Theta}_{i}\\|\}\\|\Theta_{i}-\tilde{\Theta}_{i}\\|$
		$\displaystyle\leq\frac{2c_{F,2}}{(c_{2,F})^{2}}\max\{\\|{\bm{x}}_{i^{}}\\|^{p-1},\\|\tilde{\bm{x}}_{i^{}}\\|^{p-1}\}P_{\Theta}(\\|{\bm{x}}_{i^{}}\\|,\\|\tilde{\bm{x}}_{i^{}}\\|)\\|{\bm{x}}_{i^{}}-\tilde{\bm{x}}_{i^{}}\\|.$

		$\displaystyle\left\\|\left((\Theta_{i}^{T}-\tilde{\Theta}_{i}^{T})\otimes I_{n_{i}}\right)\mathrm{vec}(\mathcal{X}_{(i)})+(\Theta_{i}^{T}\Theta_{i}-\tilde{\Theta}_{i}^{T}\tilde{\Theta}_{i})\otimes I_{n_{i}}\bm{x}_{i}\right\\|$
	$\displaystyle\leq\;$	$\displaystyle\\|\Theta_{i}-\tilde{\Theta}_{i}\\|_{2}\\|\mathcal{X}\\|+\\|\Theta_{i}^{T}\Theta_{i}-\tilde{\Theta}_{i}^{T}\tilde{\Theta}_{i}\\|_{2}\\|\bm{x}_{i}\\|$
	$\displaystyle\leq\;$	$\displaystyle\big{(}P_{\Theta}(\\|{\bm{x}}_{i^{}}\\|,\\|\tilde{\bm{x}}_{i^{}}\\|)\\|\mathcal{X}\\|+P_{\Theta^{T}\Theta}(\\|{\bm{x}}_{i^{}}\\|,\\|\tilde{\bm{x}}_{i^{}}\\|)\\|{\bm{x}}_{i}\\|\big{)}\\|{\bm{x}}_{i^{}}-\tilde{\bm{x}}_{i^{}}\\|.$

Analysis of the Stochastic Alternating Least Squares Method for the Decomposition of Random Tensors††thanks: Submitted to the editors 04/01/2020.

Abstract

keywords:

1 Introduction

2 Notation and Problem Description

Remark 2.1.

Lemma 2.2 (Existence of Minimizers).

Proof 2.3.

Remark 2.4.

2.1 The Stochastic Alternating Least Squares Method

3 Convergence Analysis

3.1 Regularity Estimates

Lemma 3.1.

Proof 3.2.

Corollary 3.3.

Proof 3.4.

Lemma 3.5.

Proof 3.6.

Corollary 3.7.

Proof 3.8.

Corollary 3.9.

Proof 3.10.

Corollary 3.11.

Proof 3.12.

Lemma 3.13.

Proof 3.14.

3.2 Boundedness of the Data

Lemma 3.15.

Proof 3.16.

Remark 3.17.

Corollary 3.18.

Proof 3.19.

3.3 Convergence

Lemma 3.20.

Proof 3.21.

Lemma 3.22.

Proof 3.23.

Lemma 3.24.

Theorem 3.25.

Proof 3.26.

4 Numerical Experiments

Example 4.1 (Convergence).

Computational Complexity

Example 4.2 (Sparsity).

5 Conclusion

References

Analysis of the Stochastic Alternating Least Squares Method for the Decomposition of Random Tensors^†^†thanks: Submitted to the editors 04/01/2020.