This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Personalized Tucker Decomposition: Modeling Commonality and Peculiarity on Tensor Data

Jiuyun Hu1, Naichen Shi2, Raed Al Kontar2, Hao Yan1
1School of Computing and Augmented Intelligence
Arizona State University
2Department of Industrial & Operations Engineering
University of Michigan, Ann Arbor
Abstract

We propose personalized Tucker decomposition (perTucker) to address the limitations of traditional tensor decomposition methods in capturing heterogeneity across different datasets. perTucker decomposes tensor data into shared global components and personalized local components. We introduce a mode orthogonality assumption and develop a proximal gradient regularized block coordinate descent algorithm that is guaranteed to converge to a stationary point. By learning unique and common representations across datasets, we demonstrate perTucker’s effectiveness in anomaly detection, client classification, and clustering through a simulation study and two case studies on solar flare detection and tonnage signal classification.


Keywords: Tucker decomposition, Personalization, Heterogeneous data

1 Introduction

In recent years, tensor decomposition methods have grown rapidly, providing the ability to analyze and utilize high-dimensional data structures, which are essentially multi-dimensional arrays or matrices. These decompositions are powerful mathematical tools that facilitate the extraction of latent features and patterns from complex data, enabling efficient data representation, dimensionality reduction, compression, completion, noise removal, and prediction. Indeed, tensor decomposition has seen immense success across a wide variety of applications that include: natural image and video processing (Gatto et al.,, 2021; Momeni and Ebrahimkhanlou,, 2022), health care systems (Ren et al.,, 2022; Sandhu et al.,, 2018), point cloud data (Yan et al.,, 2019; Du et al.,, 2022) and manufacturing (Zhen et al.,, 2023; Yan et al.,, 2014), amongst many others.

Among the widely used techniques in this area, Tucker decomposition stands out as a prominent approach that has been successfully tested and deployed in various settings and applications (Kolda and Bader,, 2009; Zubair and Wang,, 2013; Li et al.,, 2020). The Tucker approach generalizes singular value decomposition (SVD) to higher-order tensors, providing a core tensor and a set of factor matrices that capture the interactions between dimensions (Tucker,, 1966). The factor matrices represent the underlying patterns and structures in the data, while the core tensor captures the interaction between these patterns. By analyzing factor matrices and the core tensor, one can identify and extract meaningful features that can be used for further analysis, such as anomaly detection (Yan et al.,, 2014) and process optimization (Yan et al.,, 2019).

Despite the efficacy of tensor decomposition methods, they assume that complex, heterogeneous data can be adequately represented by a single set of global factor matrices and a core tensor. This assumption may oversimplify the intrinsic disparities that exist when the datasets come from different sources, clients, or modalities, potentially compromising the accuracy of the resulting representations. In practice, nowadays, it is common to collect data across various edge devices, such as sensors and phones, which exhibit unique local data patterns due to various local conditions, system status, or data collection methodologies (Kontar et al.,, 2021).

Using a universal tensor decomposition strategy may not accurately capture these distinct data patterns, leading to suboptimal representations. An alternative strategy involves fitting a local tensor decomposition for the data source. However, this does not utilize the rich data available across sources and may excessively overfit the peculiarities of each dataset while neglecting the shared patterns or commonalities among the datasets. More importantly, both strategies overlook the opportunity to model heterogeneity across data sources and exploit this for improved downstream analysis, be it in prediction, clustering, classification, or anomaly detection.

For example, in the context of one of our case studies on tonnage signal monitoring processes, numerous sensors are employed to monitor the tonnage force at various locations within a production system. The data collected at each location features common patterns of normal signal variations and heterogeneous failure-related patterns. Here, the heterogeneous nature of the data and the presence of diverse failure patterns pose significant challenges for traditional tensor decomposition methods. Therefore, it is essential to develop a personalized tensor decomposition that can effectively capture and represent commonality and peculiarity across the data collected from each location. Consequently, these methods could reveal previously hidden patterns and relationships, allowing more effective data analysis, decision-making, and system optimization.

Inspired by a recent personalization technique for vector datasets coined as personalized principal component analysis (PCA) (Shi and Kontar,, 2022), we propose the personalized Tucker decomposition (perTucker) to decompose tensor data collected from different sources into shared global components and personalized components to capture the heterogeneity of the data. Global components model the common patterns shared across the datasets, while local components model the unique features of a specific dataset. At the heart of our approach is (i) a mode orthogonality constraint to distinguish global and local features and (ii) a proximal gradient-regularized block coordinate descent algorithm that operates within feasible regions of the constraint and can provably recover stationary solutions.

Using two case studies and simulated data, we highlight the ability of perTucker to benefit (i) anomaly detection as one can monitor changes in the local features to better (and faster) detect anomalies in data collected over time and (ii) classification & clustering as operating on local features may yield better statistical power than the raw data since differences are more explicit when global features are removed.

The remainder of the paper is organized as follows. Sec. 2 reviews relevant literature on tensor decomposition methods. Sec. 3 introduces perTucker, proposes an algorithm to estimate model parameters, proves convergence, and sheds light on potential applications of perTucker. Sec. 4 uses simulated data to highlight the advantageous properties of our model. Two case studies on solar flare detection and tonnage signal classification are then presented in Sec. 5. Finally, Sec. 6 concludes the paper with a discussion about open problems.

We note that hereon we will use data source, client and modality interchangeably to index the origin from which each dataset was created.

2 Literature Review

Various tensor decomposition methods have been proposed in the literature. Among them, Tucker (Tucker,, 1966) and CP decompositions (Hitchcock,, 1927) have received the most attention. They have been applied to both supervised and unsupervised learning tasks.

For unsupervised tasks, great emphasis was placed on anomaly detection and clustering. In anomaly detection, starting from the work of Nomikos and MacGregor, (1994), tensor-based detection has grown dramatically in the literature. Some examples include Li et al., (2011), where a robust tensor subspace learning algorithm is used for online anomaly detection, and Yan et al., (2014), which studied the relationship between Tucker decomposition, CP decomposition, multilinear principal component analysis, and tensor rank one decomposition and proposed monitoring statistics for each method. Interested readers are referred to Fanaee-T and Gama, (2016) for an overview of existing tensor-based methods for anomaly detection.

In clustering, various methods have been proposed to improve the accuracy and efficiency of clustering algorithms. These methods include tensor-based subspace clustering (Fu et al.,, 2016), multi-view clustering (Zhang et al.,, 2023; Li et al.,, 2023), and multi-mode clustering (He and Atia,, 2022).

Within these areas, Sun and Li, (2019) developed a dynamic tensor clustering method based on CP tensor decomposition, which does not limit the tensor order and can be learned efficiently. Wu et al., (2016) proposed tensor spectral co-clustering based on a stochastic Markov process. This method works for general tensors of any order and can simultaneously cluster all dimensions. Zhou et al., (2019) proposed a tensor low-rank reconstruction technique (TLRR). The reconstruction consists of a low-rank dictionary recovery component and sparse noise. The dictionary is then used to obtain a similarity matrix to cluster the data.

For supervised tasks, such as regression and classification, various tensor-based classification methods have been developed, including logistic tensor regression (Tan et al.,, 2013), support tensor machine (Hao et al.,, 2013), and tensor Fisher discriminant analysis (Yan et al.,, 2005). Furthermore, different forms of tensor regression have been proposed, depending on the dimensionality of the input and output variables. These include scalar-to-tensor regression (Zhou et al.,, 2013), tensor-to-scalar regression (Yan et al.,, 2019), and tensor-to-tensor regression (Gahrooei et al.,, 2021).

Here it is worth noting that a large body of literature has focused on separating noise from a low-rank background, starting from pioneering work on robust PCA (Candès et al.,, 2011). Subsequently, this separation has been extended to anomaly detection by decomposing the data into three parts: background, anomaly, and noise. For example, Yan et al., (2017) proposed a smooth sparse decomposition (SSD) by decomposing large-scale image data into a smooth background, a sparse anomaly, and noise and outperformed many other image-based detectors. Similar approaches can be found in crime monitoring (Zhao et al., 2022c, ), public health surveillance (Dulal et al.,, 2022; Zhao et al.,, 2020), and transfer learning applications (Li et al.,, 2022). Unfortunately, such methods suffer from an inability to learn the data representation as they assume that the basis functions or data representation are known, which limits their ability to handle complex datasets. To mitigate this, recent methods have been proposed to learn the representation of background components, including Bayesian methods (Guo et al.,, 2022) and deep neural networks (Zhao et al., 2022b, ). Still, such approaches cannot simultaneously learn the basis functions for the background and anomaly components of the dataset.

Given the above literature, to the best of our knowledge, to date, there are no tensor decomposition methods capable of learning shared and common representations across different datasets. The closest work along this line is personalized PCA perPCA. perPCA introduces a novel technique to perform PCA on data originating from disparate sources or modalities that exhibit heterogeneous trends but also possess common characteristics (Shi and Kontar,, 2022). perPCA uses orthogonal global and local components to capture both shared and unique characteristics of each source. The paper offers a sufficient identifiability condition, theoretical guarantees, and competitive empirical results. Unfortunately, perPCA requires vectorization of datasets and cannot directly handle tensor data. The extension of perPCA to tensor data faces fundamental challenges due to the large degree of freedom and nonclosed-form solutions with tensor decompositions, the difficulty in defining tensor-based orthogonal constraints, and computational challenges involving high-order tensors.

Our work aims to bring personalization to a tensor paradigm and address the challenges imposed to make that possible.

3 Model Development

In this section, we first set the notation in Sec. 3.1 followed by the motivation and formulation of perTucker in Sec. 3.2. In Sec. 3.3, we propose an efficient algorithm to learn perTucker. Convergence, practical implementation, and potential applications are, respectively, highlighted in Sec. 3.4, Sec. 3.6, and Sec. 3.5. We note that the proof of all propositions, lemmas, and theorems is deferred to the Appendix.

3.1 Preliminary

A tensor can be regarded as a data structure with more than 2 dimensions, also known as modes in tensor analysis (see Fig. 1). For example, in images, we use a vector of length 33 to represent the RGB channel of the pixel. Thus, a picture can be represented by a 3-dimensional tensor with dimensions height×\timeswidth×\timesRGB. If we have multiple pictures of the same dimension, a dataset can be represented by a 4-dimensional tensor.

Refer to caption
Figure 1: Example of Vector, Matrix and Tensor Data
Notation

Throughout this paper, real numbers are denoted by letters, e.g., NN, ii; vectors by bold lowercase letters, e.g., 𝒄\bm{c}; matrices by bold uppercase letters, e.g., 𝑼\bm{U}; sets by script letters, e.g., 𝒦\mathcal{K}; and tensors by bold script letters, e.g., 𝓧\bm{\mathcal{X}}.

Mode-kk product and tensor unfolding

We briefly review the notion of a Tucker tensor product. A KK-mode tensor is represented by 𝓧I1××IK\bm{\mathcal{X}}\in\mathbb{R}^{I_{1}\times\cdots\times I_{K}}, where IkI_{k} denotes the mode-kk dimension of 𝓧{\bm{\mathcal{X}}} for k=1,,Kk=1,\cdots,K. We use 𝓧[i1,i2,,iK]{\bm{\mathcal{X}}}[i_{1},i_{2},\cdots,i_{K}] to denote the (i1,i2,,iK)(i_{1},i_{2},\cdots,i_{K})-th entry of 𝓧{\bm{\mathcal{X}}}. The mode-kk product of a tensor 𝓧{\bm{\mathcal{X}}} with a matrix 𝑽Jk×Ik\bm{V}\in\mathbb{R}^{J_{k}\times I_{k}} produces a tensor defined by (𝓧×k𝑽)[i1,,ik1,jk,ik+1,,iK]=ik𝓧[i1,,ik,,iN]V[jk,ik](\bm{\mathcal{X}}\times_{k}\bm{V})[i_{1},\cdots,i_{k-1},j_{k},i_{k+1},\cdots,i_{K}]=\sum_{i_{k}}{\bm{\mathcal{X}}}[i_{1},\cdots,i_{k},\cdots,i_{N}]V[j_{k},i_{k}]. For a tensor 𝓧\bm{\mathcal{X}} and a specific mode kk, we use the subscript with parenthesis 𝓧(k)Ik×q=1,qkKIq{\bm{\mathcal{X}}}_{(k)}\in\mathbb{R}^{I_{k}\times\prod_{q=1,q\neq k}^{K}I_{q}} to denote the unfolding of 𝓧{\bm{\mathcal{X}}} with respect to dimension kk, 𝓧(k)[ik,j]=𝓧[i1,i2,,iK]{\bm{\mathcal{X}}}_{(k)}[i_{k},j]={\bm{\mathcal{X}}}[i_{1},i_{2},\cdots,i_{K}] where j=1+q=1,qkK(iq1)Jqj=1+\sum_{q=1,q\neq k}^{K}(i_{q}-1)J_{q} and Jq=m=1,mkq1ImJ_{q}=\prod_{m=1,m\neq k}^{q-1}I_{m}. The columns of the kk-mode unfolding 𝓧(k){\bm{\mathcal{X}}}_{(k)} are the n-mode vectors of 𝓧{\bm{\mathcal{X}}}.

Tucker decomposition

Tucker decomposition (Tucker,, 1966) decomposes a tensor into a core tensor multiplied by a factor matrix along each mode, 𝓧𝓒×1𝑼1×2𝑼2×K𝑼K{\bm{\mathcal{X}}}\approx{\bm{\mathcal{C}}}\times_{1}{\bm{U}}_{1}\times_{2}{\bm{U}}_{2}\cdots\times_{K}{\bm{U}}_{K}, where 𝑼k{\bm{U}}_{k} is an orthonormal Jk×IkJ_{k}\times I_{k} factor matrix typically with Jk<IkJ_{k}<I_{k}. 𝑼k{\bm{U}}_{k} can be regarded as a principal component in mode-kk.

Tucker decomposition has an equivalent formulation in terms of the unfolded tensor, that is, 𝓧(k)=𝑼k𝓒(k)(𝑼K𝑼k+1𝑼k1𝑼1){\bm{\mathcal{X}}}_{(k)}={\bm{U}}_{k}{\bm{\mathcal{C}}}_{(k)}({\bm{U}}_{K}\bigotimes\cdots\bigotimes{\bm{U}}_{k+1}\bigotimes{\bm{U}}_{k-1}\cdots\bigotimes{\bm{U}}_{1})^{\top}. Here, \bigotimes is the Kronecker product.

Tensor inner product

The inner product of two tensors of the same shape 𝓐,𝓑I1××IK{\bm{\mathcal{A}}},{\bm{\mathcal{B}}}\in\mathbb{R}^{I_{1}\times\cdots\times I_{K}}, is defined

𝓐,𝓑=i1,,iK𝑨[i1,,ik,,iK]𝑩[i1,,ik,,iK].\left\langle{\bm{\mathcal{A}}},{\bm{\mathcal{B}}}\right\rangle=\sum_{i_{1},\cdots,i_{K}}{\bm{A}}[i_{1},\cdots,i_{k},\cdots,i_{K}]{\bm{B}}[i_{1},\cdots,i_{k},\cdots,i_{K}].

Then the Frobenius norm of a tensor 𝓐{\bm{\mathcal{A}}} can be defined as 𝓐F2=𝓐,𝓐\left\lVert{\bm{\mathcal{A}}}\right\rVert_{F}^{2}=\left\langle{\bm{\mathcal{A}}},{\bm{\mathcal{A}}}\right\rangle, which is the sum of squares of all elements.

3.2 Motivation & formulation

Suppose we have tensor data from NN sources. Each source has tensor data of order KK. We use 𝓨n{\bm{\mathcal{Y}}}_{n} to denote the data from source nn, and assume that 𝓨n{\bm{\mathcal{Y}}}_{n} has dimensions I1×I2××IK×snI_{1}\times I_{2}\times\ldots\times I_{K}\times s_{n}. Here, all dimensions across sources have the same length except for the last one. In particular, in practical applications, the last dimension sns_{n} denotes the number of samples from the source nn, which often differs between sources.

Our approach relies on defining global and local components to model commonality and heterogeneity across different sources. To do so, we let the global components consist of shared global factor matrices 𝑼G,1,,𝑼G,K{\bm{U}}_{G,1},\ldots,{\bm{U}}_{G,K} and individual global core tensors 𝓒G,1,,𝓒G,N\bm{\mathcal{C}}_{G,1},\ldots,\bm{\mathcal{C}}_{G,N} for each source. The local components consist of individual core tensors 𝓒L,1,,𝓒L,N\bm{\mathcal{C}}_{L,1},\ldots,\bm{\mathcal{C}}_{L,N} and individual local factor matrices 𝑽n,1,,𝑽n,K\bm{V}_{n,1},\ldots,\bm{V}_{n,K}. As such, the reconstructions of the global and local components for source nn are 𝓒G,n×1𝑼G,1×K𝑼G,K\bm{\mathcal{C}}_{G,n}\times_{1}{\bm{U}}_{G,1}\ldots\times_{K}{\bm{U}}_{G,K} and 𝓒L,n×1𝑽n,1×K𝑽n,K\bm{\mathcal{C}}_{L,n}\times_{1}\bm{V}_{n,1}\ldots\times_{K}\bm{V}_{n,K}, respectively.

Based on the above definitions, we assume our data-generating process to be

𝓨n=𝓒G,n×1𝑼G,1×K𝑼G,Kglobal+𝓒L,n×1𝑽n,1×K𝑽n,Klocali+𝓔n,\bm{\mathcal{Y}}_{n}=\underbrace{\bm{\mathcal{C}}_{G,n}\times_{1}{\bm{U}}_{G,1}\ldots\times_{K}{\bm{U}}_{G,K}}_{\text{global}}+\underbrace{\bm{\mathcal{C}}_{L,n}\times_{1}\bm{V}_{n,1}\ldots\times_{K}\bm{V}_{n,K}}_{\text{local}\,i}+\bm{\mathcal{E}}_{n}, (1)

where 𝓔n\bm{\mathcal{E}}_{n} are tensors that represent additive noise.

Since global and local components should convey different information, they need to be distinguished so that each part can vary independently of each other. To do so, we require the orthogonality of the global and local tensors. Specifically, we assume that:

𝓨G,n,𝓨L,n=0,𝓒G,n,𝓒L,n,\left\langle\bm{\mathcal{Y}}_{G,n},\bm{\mathcal{Y}}_{L,n}\right\rangle=0,\quad\forall\bm{\mathcal{C}}_{G,n},\bm{\mathcal{C}}_{L,n},

where 𝓨G,n=𝓒G,n×1𝑼G,1×K𝑼G,K\bm{\mathcal{Y}}_{G,n}=\bm{\mathcal{C}}_{G,n}\times_{1}{\bm{U}}_{G,1}\ldots\times_{K}{\bm{U}}_{G,K} and 𝓨L,n=𝓒L,n×1𝑽n,1×K𝑽n,K\bm{\mathcal{Y}}_{L,n}=\bm{\mathcal{C}}_{L,n}\times_{1}\bm{V}_{n,1}\ldots\times_{K}\bm{V}_{n,K}. Interestingly, it turns out that this condition is equivalent to having the global and local factor matrices orthogonal in at least one dimension, as stated in Proposition 1.

Proposition 1.

For each n=1,,Nn=1,\ldots,N, the following two conditions are equivalent.

  • 𝓨G,n,𝓨L,n=0,𝓒G,n,𝓒L,n\left\langle\bm{\mathcal{Y}}_{G,n},\bm{\mathcal{Y}}_{L,n}\right\rangle=0,\quad\forall\bm{\mathcal{C}}_{G,n},\bm{\mathcal{C}}_{L,n}.

  • There exists a mode k{1,,K}k\in\{1,\ldots,K\}, where 𝑼G,k𝑽n,k=0{\bm{U}}_{G,k}^{\top}\bm{V}_{n,k}=0.

Given Proposition 1, we require local factor matrices to be orthogonal to global factor matrices for all sources in at least one mode. We define the set of such orthogonal modes by 𝒦\mathcal{K}, |𝒦|1|\mathcal{K}|\geq 1. Then our objective is to minimize the reconstruction loss of the data across all NN sources. This is written as

min{𝓒G,n},{𝑼G,k},{𝓒L,n},{𝑽n,k}\displaystyle\min_{\{\bm{\mathcal{C}}_{G,n}\},\{{\bm{U}}_{G,k}\},\{\bm{\mathcal{C}}_{L,n}\},\{\bm{V}_{n,k}\}} n=1N𝓨n𝓒G,n×1𝑼G,1×K𝑼G,K𝓒L,n×1𝑽n,1×K𝑽n,KF2\displaystyle\sum_{n=1}^{N}\|\bm{\mathcal{Y}}_{n}-\bm{\mathcal{C}}_{G,n}\times_{1}{\bm{U}}_{G,1}\ldots\times_{K}{\bm{U}}_{G,K}-\bm{\mathcal{C}}_{L,n}\times_{1}\bm{V}_{n,1}\ldots\times_{K}\bm{V}_{n,K}\|_{F}^{2} (2)
s.t.𝑼G,k𝑼G,k=I,𝑽n,k𝑽n,k=I,n=1,,N,k=1,,K\displaystyle s.t.~{}{\bm{U}}_{G,k}^{\top}{\bm{U}}_{G,k}=I,\bm{V}_{n,k}^{\top}\bm{V}_{n,k}=I,n=1,\ldots,N,k=1,\ldots,K
𝑼G,k𝑽n,k=0,n=1,,N,k𝒦.\displaystyle{\bm{U}}_{G,k}^{\top}\bm{V}_{n,k}=0,n=1,\ldots,N,k\in\mathcal{K}.

We assume that the dimension of the global core tensor for all sources is 𝓒G,ng1××gK{\bm{\mathcal{C}}}_{G,n}\in\mathbb{R}^{g_{1}\times\cdots\times g_{K}}, and the dimension of the local core tensor for source nn is 𝓒L,nln,1××ln,K{\bm{\mathcal{C}}}_{L,n}\in\mathbb{R}^{l_{n,1}\times\cdots\times l_{n,K}}. This also defines the dimension of the global and local factor matrices.

3.3 Personalized Tucker algorithm

A natural algorithm to solve the objective in (2) is block coordinate descent (BCD), where we iteratively optimize each variable. A general framework for BCD in our context is outlined in Algorithm 1.

Data: 𝓨n\bm{\mathcal{Y}}_{n}, n=1,,Nn=1,\ldots,N
Output: {𝓒G,n},{𝑼G,k},{𝓒L,n},{𝑽n,k}\{\bm{\mathcal{C}}_{G,n}\},\{{\bm{U}}_{G,k}\},\{\bm{\mathcal{C}}_{L,n}\},\{\bm{V}_{n,k}\}
1 Initialization: {𝓒G,n},{𝑼G,k},{𝓒L,n},{𝑽n,k}\{\bm{\mathcal{C}}_{G,n}\},\{{\bm{U}}_{G,k}\},\{\bm{\mathcal{C}}_{L,n}\},\{\bm{V}_{n,k}\}
2 for iterations do
3       for k=1,,Kk=1,\ldots,K do
4             Update global factor matrices {𝑼G,k}\{{\bm{U}}_{G,k}\}
5             for n=1,,Nn=1,\ldots,N do
6                   Update global core tensors {𝓒G,n}\{\bm{\mathcal{C}}_{G,n}\}
7                   Update local factor matrices {𝑽n,k}\{\bm{V}_{n,k}\}.
8                   Update core tensors {𝓒L,n}\{\bm{\mathcal{C}}_{L,n}\}
9             end for
10            
11       end for
12      
13 end for
Return: {𝓒G,n},{𝑼G,k},{𝓒L,n},{𝑽n,k}\{\bm{\mathcal{C}}_{G,n}\},\{{\bm{U}}_{G,k}\},\{\bm{\mathcal{C}}_{L,n}\},\{\bm{V}_{n,k}\}
Algorithm 1 Pseudo Code of the Algorithm

In the rest of Sec. 3.3, we explain the update steps in Algorithm 1 in detail. We start with the update of global and local core tensors in Sec. 3.3.1 since the closed-form solution is a direct projection similar to the traditional Tucker decomposition owing to the orthogonality between global and local components. Then the solution consistently holds in the update of global and local factor matrices we introduced in Sec. 3.3.2 and Sec. 3.3.2. This simplifies the update of the factor matrices without the core tensors. Despite the challenges that pertain to the two distinct decomposition components within (2) and their orthogonality, a key result is that all updates can be done in closed form owing to the nice properties of the orthogonality constraint imposed on the model.

3.3.1 Update global and local core tensors

In this section, Proposition 2 provides the closed-form solution to update global and local core tensors, given the global and local factor matrices. The closed-form solution is the direct projection of the data to the global or local factor matrices. When updating the global and local factor matrices, we assume that the global and local core tensors are always the optimal solution. This simplifies the formula to update the global and local factor matrices by removing the core tensors from the optimization problem.

Proposition 2.

(Closed-form solutions to the core tensor) If |𝒦|1\left\lvert{\mathcal{K}}\right\rvert\geq 1, when the global factor matrices {𝐔G,k}\{{\bm{U}}_{G,k}\} and the local factor matrices {𝐕n,k}\{\bm{V}_{n,k}\} are given, the global core tensors 𝓒G,n{\bm{\mathcal{C}}}^{\star}_{G,n} that minimize (2) satisfy

𝓒G,n=𝓨n×1𝑼G,1×K𝑼G,K,{\bm{\mathcal{C}}}^{\star}_{G,n}=\bm{\mathcal{Y}}_{n}\times_{1}{\bm{U}}_{G,1}^{\top}\ldots\times_{K}{\bm{U}}_{G,K}^{\top},

and the local core tensors 𝓒L,n{\bm{\mathcal{C}}}^{\star}_{L,n} that minimize (2) satisfy

𝓒L,n=𝓨n×1𝑽n,1×K𝑽n,K.{\bm{\mathcal{C}}}^{\star}_{L,n}=\bm{\mathcal{Y}}_{n}\times_{1}\bm{V}_{n,1}^{\top}\ldots\times_{K}\bm{V}_{n,K}^{\top}.

The closed-form solutions presented in Proposition 2 takes advantage of the orthogonality between the two components. As a result, the cross-term is canceled, making the computation of the core tensors for both components efficient and straightforward. In the following sections, 𝓒G,n{\bm{\mathcal{C}}}^{\star}_{G,n} and 𝓒L,n{\bm{\mathcal{C}}}^{\star}_{L,n} are used to denote optimized global and local core tensors.

3.3.2 Update global and local factor matrices

In this section, we will discuss the closed-form solutions to update the global and local factor matrices. For the simplicity of notation, we define the global residual tensor and local residual tensor from each source n=1,Nn=1\ldots,N as:

𝓡G,n=𝓨n𝓒L,n×1𝑽n,1×K𝑽n,K,{\bm{\mathcal{R}}}_{G,n}=\bm{\mathcal{Y}}_{n}-{\bm{\mathcal{C}}}^{\star}_{L,n}\times_{1}\bm{V}_{n,1}\ldots\times_{K}\bm{V}_{n,K},
𝓡L,n=𝓨n𝓒G,n×1𝑼G,1×K𝑼G,K.{\bm{\mathcal{R}}}_{L,n}=\bm{\mathcal{Y}}_{n}-{\bm{\mathcal{C}}}^{\star}_{G,n}\times_{1}{\bm{U}}_{G,1}\ldots\times_{K}{\bm{U}}_{G,K}.

The global residual is the reconstruction error from local components and the local residual is the reconstruction error from global components. Therefore, global reconstruction tends to model the global residual, and local reconstruction tends to model the local residual.

Proximal update

In practice, when updating the global and local factor matrices, we can incorporate a proximal term into the optimization problem to regulate the update of the factor matrices (Shen et al.,, 2022). More specifically, we can define ϱ\varrho as the subspace difference between the subspaces expanded by the current factor matrix 𝑼t{\bm{U}}_{t} and the target factor matrix 𝑼{\bm{U}} to be optimized,

ϱ(𝑼,𝑼t)=𝑼𝑼𝑼t𝑼tF2.\varrho({\bm{U}},{\bm{U}}_{t})=\|\bm{UU}^{\top}-{\bm{U}}_{t}{\bm{U}}_{t}^{\top}\|_{F}^{2}. (3)

The proximal penalty term is defined as the subspace difference times some parameter ρ\rho. The proximal gradient algorithm can stabilize the update of factor matrices by regularizing the subspace change. Since the reconstruction of global and local components are to minimize the reconstruction error, we can write the optimization problem to solve for the global factor matrix in mode kk at iteration tt as

𝑼G,k,t+1=argmin𝑼G,kn=1N𝓡G,n𝓒G,n×1𝑼G,1×K𝑼G,KF2+ρ𝑼G,k𝑼G,k𝑼G,k,t𝑼G,k,tF2,\small{\bm{U}}_{G,k,t+1}=\arg\min_{{\bm{U}}_{G,k}}\sum_{n=1}^{N}\left\lVert{\bm{\mathcal{R}}}_{G,n}-{\bm{\mathcal{C}}}^{\star}_{G,n}\times_{1}{\bm{U}}_{G,1}\ldots\times_{K}{\bm{U}}_{G,K}\right\rVert_{F}^{2}+\rho\|{\bm{U}}_{G,k}{\bm{U}}_{G,k}^{\top}-{\bm{U}}_{G,k,t}{{\bm{U}}_{G,k,t}^{\top}}\|_{F}^{2}, (4)

and the optimization problem to solve for the local factor matrix of source nn is that when k𝒦k\in\mathcal{K},

𝑽n,k,t+1=argmin𝑽n,k𝑼G,k𝓡L,n𝓒L,n×1𝑽n,1×K𝑽n,KF2+ρ𝑽n,k𝑽n,k𝑽n,k,t𝑽n,k,tF2,{\bm{V}}_{n,k,t+1}=\arg\min_{\bm{V}_{n,k}\perp{\bm{U}}_{G,k}}\left\lVert{\bm{\mathcal{R}}}_{L,n}-{\bm{\mathcal{C}}}^{\star}_{L,n}\times_{1}\bm{V}_{n,1}\ldots\times_{K}\bm{V}_{n,K}\right\rVert_{F}^{2}+\rho\left\lVert\bm{V}_{n,k}\bm{V}_{n,k}^{\top}-\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}\right\rVert_{F}^{2}, (5)

and when k𝒦k\not\in\mathcal{K},

𝑽n,k,t+1=argmin𝑽n,k𝓡L,n𝓒L,n×1𝑽n,1×K𝑽n,KF2+ρ𝑽n,k𝑽n,k𝑽n,k,t𝑽n,k,tF2,{\bm{V}}_{n,k,t+1}=\arg\min_{\bm{V}_{n,k}}\left\lVert{\bm{\mathcal{R}}}_{L,n}-{\bm{\mathcal{C}}}^{\star}_{L,n}\times_{1}\bm{V}_{n,1}\ldots\times_{K}\bm{V}_{n,K}\right\rVert_{F}^{2}+\rho\left\lVert\bm{V}_{n,k}\bm{V}_{n,k}^{\top}-\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}\right\rVert_{F}^{2}, (6)

where 𝑼G,k,t{\bm{U}}_{G,k,t} and 𝑽n,k,t{\bm{V}}_{n,k,t} represents the global and local factor matrices for source nn, mode kk and iteration tt; 𝑼G,k,t+1{\bm{U}}_{G,k,t+1} and 𝑽n,k,t+1{\bm{V}}_{n,k,t+1} are the corresponding updated global and local factor matrices. The objectives of (5) and (6) are the same. They consist of a Frobenius norm of the fitting error and a regularization on the change of subspace. The difference is that in (5), we explicitly require 𝑽n,k,t+1{\bm{V}}_{n,k,t+1} to be orthogonal to 𝑼G,k,t+1{\bm{U}}_{G,k,t+1}, while in (6) we do not add constraints on 𝑽n,k,t+1{\bm{V}}_{n,k,t+1}.

Though the optimization problems (4) to (6) seem complicated, it turns out we can obtain closed-form solutions. To achieve this, we first transform the minimization problem into a maximization problem and remove the core tensors in the optimization by Lemma 3.1 and Lemma 3.2.

Lemma 3.1.

For any orthonormal factor matrices 𝐔{\bm{U}} and 𝐔t{\bm{U}}_{t}, the subspace error between 𝐔{\bm{U}} and 𝐔t{\bm{U}}_{t} defined in (3) can be formulated as,

ϱ(𝑼,𝑼t)=2c2Tr[𝑼𝑼t𝑼t𝑼],\varrho({\bm{U}},{\bm{U}}_{t})=2c-2Tr\left[{\bm{U}}^{\top}{\bm{U}}_{t}{\bm{U}}_{t}^{\top}{\bm{U}}\right], (7)

where cc is the number of rows in 𝐔\bm{U}.

Lemma 3.1 shows that the subspace error is differentiable with respect to 𝑼{\bm{U}}. This property is useful when we design the update rules for the BCD algorithms and the evaluation metrics. Furthermore, lemma 3.1 put a negative sign in the matrix trace term that can transform the minimization problem into a maximization problem.

Before deriving the solutions to (4) to (6), we introduce the following lemma that significantly simplifies our objective.

Lemma 3.2.

For each n=1,,Nn=1,\ldots,N and k=1,,Kk=1,\ldots,K, we have,

n=1N𝓡G,n𝓒G,n×1𝑼G,1×K𝑼G,KF2=n=1N𝓡G,n×1𝑼G,1×K𝑼G,KF2+𝓡G,nF2,\small\sum_{n=1}^{N}\|{\bm{\mathcal{R}}}_{G,n}-{\bm{\mathcal{C}}}^{\star}_{G,n}\times_{1}{\bm{U}}_{G,1}\ldots\times_{K}{\bm{U}}_{G,K}\|_{F}^{2}=-\sum_{n=1}^{N}\|{\bm{\mathcal{R}}}_{G,n}\times_{1}{\bm{U}}_{G,1}^{\top}\ldots\times_{K}{\bm{U}}_{G,K}^{\top}\|_{F}^{2}+\left\lVert{\bm{\mathcal{R}}}_{G,n}\right\rVert_{F}^{2}, (8)
𝓡L,n𝓒L,n×1𝑽n,1×K𝑽n,KF2=𝓡L,n×1𝑽n,1×K𝑽n,KF2+𝓡L,nF2.\|{\bm{\mathcal{R}}}_{L,n}-{\bm{\mathcal{C}}}^{\star}_{L,n}\times_{1}\bm{V}_{n,1}\ldots\times_{K}\bm{V}_{n,K}\|_{F}^{2}=-\|{\bm{\mathcal{R}}}_{L,n}\times_{1}\bm{V}_{n,1}^{\top}\ldots\times_{K}\bm{V}_{n,K}^{\top}\|_{F}^{2}+\left\lVert{\bm{\mathcal{R}}}_{L,n}\right\rVert_{F}^{2}. (9)

Lemma 3.2 bears two fundamental meanings in the derivation of the closed-form solution to update the global and local factor matrices. First, it also puts a negative sign in the term with the factor matrices, which can transform the minimization problem (4) to (6) into a maximization problem. Second, by plugging in the closed-form solution of global and local core tensors in Proposition 2, it simplifies the optimization problem by reducing the number of decision variables.

Update global factors

With all the prerequisites, we are now ready to present the closed-form solution of the sub-problem (4) in updating the global factor matrix 𝑼G,k{\bm{U}}_{G,k} in a specific mode kk in Proposition 3

Proposition 3.

We use 𝐖G,n\bm{W}_{G,n} to denote 𝐖G,n=(𝓡G,n)(k)(qk𝐔G,q)\bm{W}_{G,n}=\left({\bm{\mathcal{R}}}_{G,n}\right)_{(k)}(\bigotimes_{q\neq k}{\bm{U}}_{G,q}^{\top})^{\top}, where qk\bigotimes_{q\neq k} is the Kronecker product in reverse order of the factor matrices except kkth factor matrix. If 𝐔G,k,t+1{\bm{U}}_{G,k,t+1} is the optimal solution to (4), the columns of 𝐔G,k,t+1{\bm{U}}_{G,k,t+1} are the unit eigenvectors of the matrix n=1N𝐖G,n𝐖G,n+2ρ𝐔G,k,t𝐔G,k,t\sum_{n=1}^{N}\bm{W}_{G,n}\bm{W}_{G,n}^{\top}+2\rho{\bm{U}}_{G,k,t}{{\bm{U}}_{G,k,t}^{\top}} corresponding to the largest gkg_{k} eigenvalues.

Proposition 3 shows that with proximal regularization, global components can be updated efficiently through singular value decomposition. In practice, we can use the equivalent form of 𝑾G,n=(𝓡G,n×1𝑼G,1×k1𝑼G,k1×k+1𝑼G,k+1×K𝑼G,K)(k)\bm{W}_{G,n}=({\bm{\mathcal{R}}}_{G,n}\times_{1}{\bm{U}}_{G,1}\cdots\times_{k-1}{\bm{U}}_{G,k-1}\times_{k+1}{\bm{U}}_{G,k+1}\cdots\times_{K}{\bm{U}}_{G,K})_{(k)} to improve the efficiency of the computation. Note that when updating the global components, we do not impose the orthogonality of the global and local components. The reason is that enforcing the orthogonality between the global components to each of the local components is too restrictive and may leave no feasible space for updating if the number of sources is large. Therefore, we will update the global components freely and enforce the local component to be orthogonal to the global components.

Update local factors

We provide the closed-form solution to the sub-problem (5) and (6) to update the local factor matrices with or without the orthogonal constraint in Proposition 4. The component optimized in Proposition 4 is the local factor matrix 𝑽n,k{\bm{V}}_{n,k} with a specific source nn and mode kk. We denote the current local factor matrices at iteration tt by 𝑽n,k,t\bm{V}_{n,k,t}.

Proposition 4.

Problem (5) and (6) have closed-form solutions. We denote 𝐖L,n\bm{W}_{L,n} as 𝐖L,n=(𝓡L,n)(k)(qk𝐕n,q)\bm{W}_{L,n}=\left({\bm{\mathcal{R}}}_{L,n}\right)_{(k)}(\bigotimes_{q\neq k}\bm{V}_{n,q}^{\top})^{\top}.Then,

  1. 1.

    if k𝒦k\not\in\mathcal{K}, the updated columns of the local factor matrix 𝑽n,k,t+1\bm{V}_{n,k,t+1} is the unit eigenvectors of 𝑾L,n𝑾L,n+2ρ𝑽n,k,t𝑽n,k,t\bm{W}_{L,n}\bm{W}_{L,n}^{\top}+2\rho\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}} corresponding to top ln,kl_{n,k} eigenvalues.

  2. 2.

    if k𝒦k\in\mathcal{K}, the update of the local factor matrix 𝑽n,k,t+1\bm{V}_{n,k,t+1} is as follows.
    Denote 𝑺=(I𝑼G,k𝑼G,k)[𝑾L,n𝑾L,n+2ρ𝑽n,k,t𝑽n,k,t](I𝑼G,k𝑼G,k)\bm{S}^{\prime}=(I-{\bm{U}}_{G,k}{\bm{U}}_{G,k}^{\top})[\bm{W}_{L,n}\bm{W}_{L,n}^{\top}+2\rho\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}](I-{\bm{U}}_{G,k}{\bm{U}}_{G,k}^{\top}). The columns of the local factor matrix 𝑽n,k,t+1\bm{V}_{n,k,t+1} are the eigenvectors of 𝑺\bm{S}^{\prime} corresponding to top ln,kl_{n,k} eigenvalues.

The proof of Proposition 3 and Proposition 4 is shown in Appendix E and Appendix F. Having completed all the steps required to update each component of the algorithm, we present the complete algorithm in Algorithm 2. In Algorithm 2, we use the subscript tt to denote the current iteration index for the global and local factor matrices. Despite the orthogonality constraint between the local and global components, each update step in Algorithm 2 can be efficiently implemented via a closed-form solution.

Data: 𝓨n\bm{\mathcal{Y}}_{n}, n=1,,Nn=1,\ldots,N
Input: Global dimensions gkg_{k}, Local dimensions ln,kl_{n,k}, Orthogonal dimension set 𝒦\mathcal{K}, ρ\rho
Output: {𝓒G,n},{𝑼G,k},{𝓒L,n},{𝑽n,k}\{\bm{\mathcal{C}}_{G,n}\},\{{\bm{U}}_{G,k}\},\{\bm{\mathcal{C}}_{L,n}\},\{\bm{V}_{n,k}\}
1 Initialization: Randomly initialize or PCA initialize {𝓒G,n},{𝑼G,k,0},{𝓒L,n},{𝑽n,k,0}\{\bm{\mathcal{C}}_{G,n}\},\{{\bm{U}}_{G,k,0}\},\{\bm{\mathcal{C}}_{L,n}\},\{\bm{V}_{n,k,0}\}.
2 for iterations t=0,,T1t=0,\cdots,T-1 do
3       for k=1,,Kk=1,\ldots,K do
4             Set 𝓡G,n=𝓨n𝓒L,n×1𝑽n,1,t+1×k1𝑽n,k1,t+1×k𝑽n,k,t×K𝑽n,K{\bm{\mathcal{R}}}_{G,n}=\bm{\mathcal{Y}}_{n}-\bm{\mathcal{C}}^{\star}_{L,n}\times_{1}\bm{V}_{n,1,t+1}\ldots\times_{k-1}\bm{V}_{n,k-1,t+1}\times_{k}\bm{V}_{n,k,t}\ldots\times_{K}\bm{V}_{n,K}, n=1,,Nn=1,\ldots,N
5             Compute 𝑾G,n=(𝓡G,n)(k)(𝑼G,K,t𝑼G,k+1,t𝑼G,k1,t+1𝑼G,1,t+1)\bm{W}_{G,n}=\left({\bm{\mathcal{R}}}_{G,n}\right)_{(k)}({\bm{U}}_{G,K,t}^{\top}\bigotimes\ldots\bigotimes{\bm{U}}_{G,k+1,t}^{\top}\bigotimes{\bm{U}}_{G,k-1,t+1}^{\top}\bigotimes\ldots\bigotimes{\bm{U}}_{G,1,t+1}^{\top})^{\top}, n=1,,Nn=1,\ldots,N
6             Update 𝑼G,k,t+1{\bm{U}}_{G,k,t+1} to be the eigenvectors of n=1N𝑾G,n𝑾G,n+2ρ𝑼G,k,t𝑼G,k,t\sum_{n=1}^{N}\bm{W}_{G,n}\bm{W}_{G,n}^{\top}+2\rho{\bm{U}}_{G,k,t}{{\bm{U}}_{G,k,t}^{\top}} corresponding to the largest gkg_{k} eigenvalues.
7            
8            for n=1,,Nn=1,\ldots,N do
9                   Update 𝓒G,n=𝓨n×1𝑼G,1,t+1×k𝑼G,k,t+1×k+1𝑼G,k+1,t×K𝑼G,K,t\bm{\mathcal{C}}^{\star}_{G,n}=\bm{\mathcal{Y}}_{n}\times_{1}{\bm{U}}_{G,1,t+1}^{\top}\ldots\times_{k}{\bm{U}}_{G,k,t+1}^{\top}\times_{k+1}{\bm{U}}_{G,k+1,t}^{\top}\ldots\times_{K}{\bm{U}}_{G,K,t}^{\top}
10                   Set 𝓡L,n=𝓨n𝓒G,n×1𝑼G,1,t+1×k𝑼G,k,t+1×k+1𝑼G,k+1,t×K𝑼G,K,t{\bm{\mathcal{R}}}_{L,n}=\bm{\mathcal{Y}}_{n}-\bm{\mathcal{C}}^{\star}_{G,n}\times_{1}{\bm{U}}_{G,1,t+1}\ldots\times_{k}{\bm{U}}_{G,k,t+1}\times_{k+1}{\bm{U}}_{G,k+1,t}\ldots\times_{K}{\bm{U}}_{G,K,t}
11                   Let 𝑾L,n=(𝓡L,n)(k)(𝑽n,K,t𝑽n,k+1,t𝑽n,k1,t+1𝑽n,1,t+1)\bm{W}_{L,n}=\left({\bm{\mathcal{R}}}_{L,n}\right)_{(k)}({\bm{V}}_{n,K,t}^{\top}\bigotimes\ldots\bigotimes{\bm{V}}_{n,k+1,t}^{\top}\bigotimes{\bm{V}}_{n,k-1,t+1}^{\top}\bigotimes\ldots\bigotimes{\bm{V}}_{n,1,t+1}^{\top})^{\top}
12                   if k𝒦k\in\mathcal{K} then
13                         Let 𝑺=(𝑰𝑼G,k,t+1𝑼G,k,t+1)[𝑾L,n𝑾L,n+2ρ𝑽n,k,t𝑽n,k,t](𝑰𝑼G,k,t+1𝑼G,k,t+1)\bm{S}^{\prime}=({\bm{I}}-{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top})[\bm{W}_{L,n}\bm{W}_{L,n}^{\top}+2\rho\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}]({\bm{I}}-{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top})
14                         Update 𝑽n,k,t+1\bm{V}_{n,k,t+1} to be the eigenvectors of 𝑺\bm{S}^{\prime} corresponding to the largest ln,kl_{n,k} eigenvalues.
15                   else if k𝒦k\not\in\mathcal{K} then
16                         Update 𝑽n,k,t+1\bm{V}_{n,k,t+1} to be the eigenvectors of 𝑾L,n𝑾L,n+2ρ𝑽n,k,t𝑽n,k,t\bm{W}_{L,n}\bm{W}_{L,n}^{\top}+2\rho\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}} corresponding to the largest ln,kl_{n,k} eigenvalues.
17                   end if
18                  Update 𝓒L,n=𝓨n×1𝑽n,1,t+1×k𝑽n,k,t+1×k+1𝑽n,k+1,t×K𝑽n,K,t\bm{\mathcal{C}}^{\star}_{L,n}=\bm{\mathcal{Y}}_{n}\times_{1}\bm{V}_{n,1,t+1}^{\top}\ldots\times_{k}\bm{V}_{n,k,t+1}^{\top}\times_{k+1}\bm{V}_{n,k+1,t}^{\top}\times_{K}\bm{V}_{n,K,t}^{\top}
19             end for
20            
21       end for
22      
23 end for
Return: {𝓒G,n},{𝑼G,k,T},{𝓒L,n},{𝑽n,k,T}\{\bm{\mathcal{C}}^{\star}_{G,n}\},\{{\bm{U}}_{G,k,T}\},\{\bm{\mathcal{C}}^{\star}_{L,n}\},\{\bm{V}_{n,k,T}\}
Algorithm 2 BCD algorithm to solver Personalized Tucker

3.4 Convergence analysis of Algorithm 2

In this section, we provide the the convergence analysis of Algorithm 2. The special update rule in Proposition 4 brings challenges to the convergence analysis. In Algorithm 2, the update of local factors 𝑽n,k{\bm{V}}_{n,k}’s is different from the standard Tucker decomposition update, as 𝑽n,k{\bm{V}}_{n,k} is required to be orthogonal to 𝑼G,k{\bm{U}}_{G,k}. As a result, updating the local factors does not necessarily decrease the objective value in (2). Thus, Algorithm 2 is not a strictly descent algorithm. Despite such subtleties, we can show that, when the proximal parameter ρ\rho is not too small, our algorithm can converge into stationary solutions.

We will present our theorem on global convergence in the following theorem. Recall that we use 𝑼G,k,t{\bm{U}}_{G,k,t} to denote the kk-th global factor 𝑼G,k{\bm{U}}_{G,k} after iteration tt, and 𝑽n,k,t{\bm{V}}_{n,k,t} to denote the kk-th local factor of source nn after iteration tt.

Theorem 5.

If |𝒦|2|{\mathcal{K}}|\geq 2 and there exists a constant B>0B>0 such that 𝓨nFB\left\lVert{\bm{\mathcal{Y}}}_{n}\right\rVert_{F}\leq B for each nn, when we choose ρ=O(B2)\rho=O(B^{2}), then Algorithm 2 will converge to stationary points where

mint=1,,Tk=1K𝑼G,k,t+1𝑼G,k,t+1𝑼G,k,t𝑼G,k,tF2=O(1T),\displaystyle\min_{t=1,\cdots,T}\sum_{k=1}^{K}\left\lVert{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}-{\bm{U}}_{G,k,t}{\bm{U}}_{G,k,t}^{\top}\right\rVert_{F}^{2}=O\left(\frac{1}{T}\right), (10)

and

mint=1,,Tn=1Nk=1K𝑽n,k,t+1𝑽n,k,t+1𝑽n,k,t𝑽n,k,tF2=O(1T).\displaystyle\min_{t=1,\cdots,T}\sum_{n=1}^{N}\sum_{k=1}^{K}\left\lVert{\bm{V}}_{n,k,t+1}{\bm{V}}_{n,k,t+1}^{\top}-{\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}\right\rVert_{F}^{2}=O\left(\frac{1}{\sqrt{T}}\right). (11)

Theorem 5 provides many key insights. First, it shows that the subspaces spanned by the column vectors of global and local factors all converge into fixed solutions. The result establishes the global convergence of factors, as it does not require careful initialization. Second, the convergence rates for global and local factors differ. Global factors converge at a rate of O(1T)O(\frac{1}{T}), which is standard in non-convex optimization. However, since some local factors must be perpendicular to the global factors, they converge at a slightly slower rate of O(1T)O(\frac{1}{\sqrt{T}}). Third, our result is based on having |𝒦|2|{\mathcal{K}}|\geq 2. This requirement allows orthogonality to be maintained by each mode being updated.

To validate the convergence rates, we provide a proof-of-concept simulation study. Fig. 2 displays an example of the convergence of global and local factor matrices in this simulation. Both subspace errors of the local and global components go to 0. This slower rate of local components is primarily a result of the orthogonality requirement, and this result verifies Theorem 5. The detail of this simulation study is relegated to Appendix H

Refer to caption
Figure 2: Subspace error for global component and different local sources

3.5 Model initialization

One simple approach is to initialize all components randomly. Alternatively, one may use a Tucker decomposition on all the data for initialization. To do so, let s=n=1Nsns=\sum_{n=1}^{N}s_{n} be the total number of samples from all sources and recall that the data 𝓨n{\bm{\mathcal{Y}}}_{n} from source nn has dimension I1××IK×snI_{1}\times\ldots\times I_{K}\times s_{n}. Now the following steps can be taken:

  1. 1.

    Construct a tensor 𝓨{\bm{\mathcal{Y}}} that encompasses all samples from every source, with dimensions I1××IK×sI_{1}\times\ldots\times I_{K}\times s.

  2. 2.

    Use Tucker decomposition 𝓨𝓒×1𝑼G,1×K𝑼G,K{\bm{\mathcal{Y}}}\approx{\bm{\mathcal{C}}}\times_{1}{\bm{U}}_{G,1}\ldots\times_{K}{\bm{U}}_{G,K} to initialize global factors.

  3. 3.

    Use Proposition 2 to initialize the global core tensor for each source.

  4. 4.

    For each source nn, perform the Tucker decomposition on the local residual tensor 𝓡L,n=𝓨n𝓒G,n×1𝑼G,1×K𝑼G,K{\bm{\mathcal{R}}}_{L,n}={\bm{\mathcal{Y}}}_{n}-{\bm{\mathcal{C}}}^{\star}_{G,n}\times_{1}{\bm{U}}_{G,1}\ldots\times_{K}{\bm{U}}_{G,K} to initialize the local core tensor and the local factor matrices.

This initialization does not require orthogonality between global and local components. Therefore, we may observe an increase in the reconstruction error in the first iteration. But we have found that in practice, this method yields faster convergence.

In addition, in our model, one needs to choose the modes in which orthogonality is imposed. At the same time, our theorem suggests that |𝒦|2|{\mathcal{K}}|\geq 2, and we found that the result is also true for |𝒦|=1|{\mathcal{K}}|=1 in various simulation studies. In practice, one can simply choose the mode with the largest dimension to impose the orthogonality constraint. Alternatively, cross-validation can be utilized to select the mode.

3.6 Practical usage of perTucker

In this section, we introduce some practical applications of perTucker. Specifically, we shed light on its potential utility for improved classification, anomaly detection, and clustering. The key idea for all applications is to operate only on local components. This may allow for improved clustering, classification, and detection as differences become more explicit when shared knowledge is removed.

3.6.1 Classification via perTucker

To use perTucker for classification, we assume that each source corresponds to a class. Then we perform perTucker and can get the estimated local factor matrices 𝑽^n,k,k=1,,K\hat{{\bm{V}}}_{n,k},\,k=1,\ldots,K for each class. When a new piece of data 𝓨new{\bm{\mathcal{Y}}}^{\text{new}} is sent, we can use the following decision rule to classify the new data.

n^=argmaxn𝓒L,nF2=argmaxn𝓨new×1𝑽^n,1×K𝑽^n,KF2.\hat{n}=\arg\max_{n}\|{\bm{\mathcal{C}}}^{\star}_{L,n}\|_{F}^{2}=\arg\max_{n}\|{\bm{\mathcal{Y}}}^{\text{new}}\times_{1}\hat{\bm{V}}_{n,1}^{\top}\ldots\times_{K}\hat{\bm{V}}_{n,K}^{\top}\|_{F}^{2}. (12)

The decision rule (12) demonstrates that we can efficiently classify the data by selecting the class that maximizes the Frobenius norm of the local core tensor. This is because the largest norm of the core components indicates that the local subspace is most suitable for representing the original data, since the local core tensor is a projection of the original tensor onto the corresponding local subspace. We have found that such a decision rule is equivalent to finding the smallest possible reconstruction error across all classes. The discussion is relegated to Appendix I.

We want to emphasize that such a classification approach differs from traditional tensor-based classifiers (Klus and Gelß,, 2019), which directly trained supervised learning models for tensor classification. Here, we focus on a generative approach, which first trains CC data generation models (i.e., local subspaces) and then utilizes the representation error to decide to which class the data belong. The algorithm will construct global and local subspaces, which is beneficial not only for classification purposes but also for feature interpretation and visualization.

3.6.2 Anomaly detection via perTucker

By monitoring only local components, perTucker can improve anomaly detection methods as the changes in the underlying data become more explicit when common factors are removed. Specifically, we propose using 𝓒LF2\|{\bm{\mathcal{C}}}_{L}\|_{F}^{2} as the key monitoring statistic for online anomaly detection.

Here we emphasize that perTucker does not implement a sparsity penalty as often used in tensor-based anomaly detection (Yan et al.,, 2018). This is a unique benefit of perTucker as we do not assume that the anomaly patterns are sparse, which is too restrictive in some applications. As a result, perTucker can accommodate a wide range of anomalous pattern distributions.

3.6.3 Clustering via perTucker

perTucker provides an alternative approach for client clustering based on local factors. Specifically, we focus on the setting of subspace clustering, which aims to cluster the clients if they are within the same local subspaces. The subspace distance between client n1n_{1} and n2n_{2}, ρn1,n2=𝑽^n1𝑽^n1𝑽^n2𝑽^n2F2\rho_{n_{1},n_{2}}=\|\hat{\bm{V}}_{n_{1}}\hat{\bm{V}}_{n_{1}}^{\top}-\hat{\bm{V}}_{n_{2}}\hat{\bm{V}}_{n_{2}}^{\top}\|_{F}^{2}, can be calculated, where 𝑽n\bm{V}_{n} is defined by the Kronecker product of the local factor matrices for client nn. Then we can use spectral clustering to make clusters of the clients and further use multidimensional scaling to make the clustering plot (Hastie et al.,, 2009).

4 Numerical Studies

Now that we have introduced perTucker and its potential application, we validate its claimed advantages through numerical simulations. Sec. 4.1 introduces the data generation procedure. Sec. 4.2, Sec. 4.3, and Sec. 4.4 evaluates the performance of perTucker in terms of data reconstruction, classification, and clustering.

4.1 Data generation

In this simulation work, each sample of the data is a grayscale image with dimensions 50 by 50. The construction of each sample is low-rank global component, heterogeneous local component, and i.i.d standard normal noise, as in Eq. (13)

𝓨n=𝓨G,n+𝓨L,n+𝓔n.\bm{\mathcal{Y}}_{n}=\bm{\mathcal{Y}}_{G,n}+\bm{\mathcal{Y}}_{L,n}+\bm{\mathcal{E}}_{n}. (13)

We generate N=3N=3 clients defined as 33 patterns for the heterogeneous local component: Swiss pattern, oval pattern, and rectangle pattern, as in Fig. 3. The value of all the patterns is 5 while the rest part is 0. In each pattern, we generate 1010 sample images. There is some variability within each pattern, as shown in the left part of Fig. 3. The Swiss can be thin or thick; the oval can be vertical, horizontal, or circular; the rectangle can be wide, tall, or square.

For the global component, we randomly create orthonormal matrices 𝑼G,1{\bm{U}}_{G,1} and 𝑼G,2{\bm{U}}_{G,2} with dimension 50×550\times 5 for the 3 clients. Then we randomly generate the global core tensor 𝓒G,n\bm{\mathcal{C}}_{G,n} with dimension 5×5×105\times 5\times 10 for each client. And each entry of the global core tensors follows i.i.d. N(0,100)N(0,100). The global components are constructed by 𝓨G,n=𝓒G,n×1𝑼G,1×2𝑼G,2\bm{\mathcal{Y}}_{G,n}=\bm{\mathcal{C}}_{G,n}\times_{1}{\bm{U}}_{G,1}\times_{2}{\bm{U}}_{G,2}, n=1,2,3n=1,2,3.

Therefore, the full data dimension is 3×50×50×103\times 50\times 50\times 10. Some examples of the data generation structure are shown in the right part of Fig. 3. The three rows are for three patterns. The two columns show the global and local components, respectively, and the third column shows the sum of the global and local components along with the error term. With noise and global background, the local pattern can barely be recognized, which makes accurate identification of the local patterns challenging.

Refer to caption
Refer to caption
Figure 3: Left: examples of variability in each pattern. Right: examples of different components in each pattern.

4.2 Performance

In the generated data, we apply perTucker to decouple the global and local components. For comparison, we also evaluate the performance of some benchmark algorithms.

  1. 1.

    globalTucker: We first concatenate the samples of all clients into tensor 𝓨{\bm{\mathcal{Y}}} and then apply the Tucker decomposition on 𝓨{\bm{\mathcal{Y}}}.

  2. 2.

    localTucker: We apply a standard Tucker decomposition on each client 𝓨n{\bm{\mathcal{Y}}}_{n} individually.

  3. 3.

    robustTucker: We first concatenate samples from all clients in 𝓨{\bm{\mathcal{Y}}}, then apply the method in Lu et al., (2019) to identify the low-rank and sparse components.

  4. 4.

    perPCA: We apply perPCA (Shi and Kontar,, 2022) on the vectorized dataset where we vectorize 50×5050\times 50 images into vectors of length 25002500 and use perPCA to find global and local components. Although perPCA is designed for vector datasets, this comparison can highlight the need for personalized Tensor decompositions when data is in tensor form.

Refer to caption
Figure 4: Reconstruction result example for three patterns

Fig. 4 depicts examples of global and local component reconstruction via perTucker. The first row represents the data, and the second row shows the reconstruction from perTucker. The columns represent global and local components. The three examples with different patterns indicate that perTucker can effectively reconstruct the shared and unique data patterns.

Furthermore, to numerically examine the performance and benchmark algorithms, we calculate a few performance metrics. 𝑼^G,k\hat{{\bm{U}}}_{G,k} and 𝑽^n,k\hat{{\bm{V}}}_{n,k} represent the estimated global and local factor matrices; 𝓨^G,n\bm{\mathcal{\hat{Y}}}_{G,n} and 𝓨^L,n\bm{\mathcal{\hat{Y}}}_{L,n} represent the estimated reconstruction of global and local components.

  1. 1.

    Global subspace error: We compute the regularized global subspace error between the ground truth global factors {𝑼G,k}\{{\bm{U}}_{G,k}\} and the estimated global factors {𝑼^G,k}\{\hat{{\bm{U}}}_{G,k}\} by ϱ(k=1K𝑼G,k,k=1K𝑼^G,k)/k=1K𝑼G,kF2\varrho(\bigotimes_{k=1}^{K}{\bm{U}}_{G,k},\bigotimes_{k=1}^{K}\bm{\hat{U}}_{G,k})/\|\bigotimes_{k=1}^{K}{\bm{U}}_{G,k}\|_{F}^{2}.

  2. 2.

    Local subspace error: We first generate 100 images for each pattern, and use Tucker decomposition to estimate the local factors {𝑽n,k}\{{\bm{V}}_{n,k}\} for each pattern. Then the regularized local subspace error is calculated by ϱ(k=1K𝑽n,k,k=1K𝑽^n,k)/k=1K𝑽n,kF2\varrho(\bigotimes_{k=1}^{K}{\bm{V}}_{n,k},\bigotimes_{k=1}^{K}\hat{{\bm{V}}}_{n,k})/\|\bigotimes_{k=1}^{K}{\bm{V}}_{n,k}\|_{F}^{2}, and take the average of 33 patterns.

  3. 3.

    Global component error: defined by n𝓨^G,n𝓨G,nF2/n𝓨G,nF2\small\sum_{n}\|\bm{\mathcal{\hat{Y}}}_{G,n}-\bm{\mathcal{Y}}_{G,n}\|_{F}^{2}/\sum_{n}\|\bm{\mathcal{Y}}_{G,n}\|_{F}^{2}.

  4. 4.

    Local component error: defined by n𝓨^L,n𝓨L,nF2/n𝓨L,nF2\small\sum_{n}\|\bm{\mathcal{\hat{Y}}}_{L,n}-\bm{\mathcal{Y}}_{L,n}\|_{F}^{2}/\sum_{n}\|\bm{\mathcal{Y}}_{L,n}\|_{F}^{2}.

  5. 5.

    Denoised error: defined by n(𝓨G,n+𝓨L,n)(𝓨^G,n+𝓨^L,n)F2)/n𝓨G,n+𝓨L,nF2\small\sum_{n}\|(\bm{\mathcal{Y}}_{G,n}+\bm{\mathcal{Y}}_{L,n})-(\bm{\mathcal{\hat{Y}}}_{G,n}+\bm{\mathcal{\hat{Y}}}_{L,n})\|_{F}^{2})/\sum_{n}\|\bm{\mathcal{Y}}_{G,n}+\bm{\mathcal{Y}}_{L,n}\|_{F}^{2}

We run each experiment 10 times from 1010 different random seeds and report their mean and standard deviation. The results of the measuring statistics for different methods are summarized in Table 1. From Table 1, we can conclude the following statements.

Table 1: Component-wise reconstruction error with standard deviation in parenthesis
perTucker perPCA globalTucker localTucker robustTucker
Global subspace error (10310^{-3}) 2.3(0.8)\textbf{2.3}(0.8) 536(6)536(6) 2.4(0.8)\textbf{2.4}(0.8) N/A 3.7(0.9)3.7(0.9)
Local subspace error (101)(10^{-1}) 6.4(0.4)\textbf{6.4}(0.4) 9.88(0.07)9.88(0.07) N/A 9.34(0.03)9.34(0.03) N/A
Global component error (103)(10^{-3}) 5.7(1.6)\textbf{5.7}(1.6) 372(16)372(16) 5.7(1.6)\textbf{5.7}(1.6) N/A 264(12)264(12)
Local component error (101)(10^{-1}) 2.8(0.6)\textbf{2.8}(0.6) 23.1(1.3)23.1(1.3) N/A 63(3)63(3) 10(0.09)10(0.09)
Denoised error (10210^{-2}) 4(0.7)4(0.7) 13.7(0.7)13.7(0.7) 14.1(0.7)14.1(0.7) 2(2)\textbf{2}(2) 13.7(0.7)13.7(0.7)

1) perTucker yields the best results in separating the global and local components due to the utilization of the low-rank tensor structure of both components, from the following observations: (a) Comparing the global component error of perTucker and perPCA, we can conclude that perTucker identifies the global component with better accuracy due to its use of low-rank tensor structures. (b) Comparing the local component error of perTucker and robustTucker, although robustTucker identifies the global subspace with decent accuracy, it yields a much larger error in terms of local component reconstruction due to the fact that the local component does not assume any low-rank structure.

2) The reconstruction error for local components is larger for all methods compared to the global components. Two factors contribute to this: (a) the reconstruction of the local component is generally harder compared to the global components since it is only shared within the same clients, resulting in lower accuracy and larger variance with the use of fewer datasets; (b) the local components are generated by the shape variations of Swiss, oval, and rectangle patterns, which are not exactly low-rank, thus, the true local rank subspace is also an approximation from the dataset.

4.3 Classification

In this section, we evaluate the classification performance of the perTucker algorithm. The training sample size ranges from 10 to 50 with a step of 10 for each pattern. Then we perform the perTucker decomposition and get the local factor matrices. Next, we create 50 new images for each pattern and use the method described in Sec. 3.6.1 to classify the 150 images. This procedure is repeated 100 times. Table 2 displays the mean and standard deviation of the classification accuracy for various parameters. From Table 2 we can see that perTucker exhibits excellent classification performance even when the sample size is small. Increasing the training sample size leads to improved prediction accuracy. In comparison, if we perform local Tucker on the three clients and classify the new figures, the accuracy will be around 33% regardless of the training sample size. This accuracy is close to “guessing” the class. This is because local Tucker will model the commonality and peculiarity simultaneously. Therefore, global features are also considered when making the decision, while the multiplication coefficients are randomly generated.

Table 2: Classification accuracy and the standard deviation for different parameters
Training sample size 10 20 30 40 50
Accuracy 86.7% 91.2% 92.7 % 93.9% 95.0%
SD of accuracy 6 % 4% 4% 4% 3%

To visualize the classification process, we use the boxplot to show the summary of the test statistics in (12) in Fig. 5. The sample size is set to 50. The test statistics are centralized by the median statistics within each pattern. The test statistics for each specific true pattern are significantly higher than those for the other two patterns in the corresponding cases, implying that the classification procedure is effective and robust.

Refer to caption
Figure 5: Box plot for test statistics defined in (12) for different patterns and different classes

4.4 Clustering

In this section, we study the clustering performance of perTucker under a different problem setting. When generating the data, the variability within each pattern is measured by a “ratio” variable. The performance of clustering is evaluated by the ability to cluster similar “ratio” within each pattern. We focus on the Swiss pattern and cut the range of “ratio” from 0.70.7 to 1.41.4 into the group of 77 ratio intervals, each with a width of 0.10.1. Then for each ratio interval, we generate 33 clients, each with a sample size of 100100 figures. We make clustering plots for the 2121 clients and see the aggregation of the clients with similar ratio, as shown in Fig. 6.

Refer to caption
Figure 6: Clustering plot for swiss patterns

5 Case Study

In this section, we use two case study experiments to demonstrate the application of the perTucker algorithm. Sec. 5.1 provides an anomaly detection example that illustrates the power of the perTucker algorithm in detecting solar flares. The subsequent section, Sec. 5.2, demonstrates the effectiveness of the perTucker technique in classifying tonnage fault signals.

5.1 Anomaly detection in solar flare data

The first example involves monitoring solar activities and detecting solar flares from a stream of solar images. A solar flare is a significant event in the sun that emits enormous amounts of energetic charged particles that could cause power grids to fail (Marusek,, 2007). Therefore, detecting solar flares promptly is critical to implementing preemptive and corrective measures. However, monitoring solar activities is challenging due to the high dimensionality of the solar thermal imaging data and the gradual changes in solar temperature over time. Existing detection methods, which rely on subtracting the functional mean (background) using the sample mean, are insufficient to detect small transient flares in the dynamic system (Zhao et al., 2022a, ). Other studies focus on the decomposition of solar images into their background and anomaly components (Yan et al.,, 2018).

This dataset, publicly available in (Zhao et al., 2022a, ), comprises a sequence of images of size 232 × 292 pixels captured by a satellite. We use a sample of 300 frames in this case study. To detect solar flares in real time, we begin by subtracting the mean from each sample and then preprocess the data using the method proposed by (Aharon et al.,, 2006). Following this, we use a sliding window of 8×88\times 8 to divide each frame into small patches, resulting in a total of 1044 patches. The four right-most columns of pixels are discarded. Each patch or tile is then vectorized, yielding a data dimension of 300×1044×64300\times 1044\times 64. After the preprocessing step, we apply the perTucker algorithm to the data to break them down into two components: global components representing the slowly changing background and local components indicating the detected solar flares.

Refer to caption
Figure 7: Detection of solar flare at t=191t=191 and t=217t=217

Fig. 7 shows two frames where there is an abrupt change in the almost-stationary background. On the original images, such changes are not visible in the complicated background. However, after using perTucker to extract global and local components, one can clearly see the location and movement of small and rapid changes in local components. The experiments highlight perTucker’s ability to change the signal magnification.

Moreover, since the local component represents the anomaly, we can use the Frobenius norm of the local core tensor 𝓒L,nF\left\lVert{\bm{\mathcal{C}}}_{L,n}\right\rVert_{F} as monitoring statistics to detect the anomaly. The results are shown in Fig. 8, where we plot the logarithm of the Frobenius norm of the local core tensor 𝓒L,nF\left\lVert{\bm{\mathcal{C}}}_{L,n}\right\rVert_{F} for 300300 frames.

Refer to caption
Figure 8: The logarithm of Frobenius norm of local core tensors for 300300 frames

For the most time, the curve in Fig. 8 remains at a low level, suggesting that no significant changes in solar activities are detected. However, there are 2 clear peaks above the red control line, suggesting that there are two solar flares. Therefore, Fig. 8 provides an intuitive approach for detecting anomalous behaviors.

5.2 Classification for tonnage data

perTucker also applies to monitoring tonnage profiles in a multi-operation forging process that utilizes multiple strain gauge sensors. Specifically, the forging process employs four columns, each equipped with a strain gauge sensor that measures the tonnage force exerted by the press uprights, as illustrated in Fig. 9. Consequently, each operational cycle generates a four-channel tonnage profile. The dataset for this case study consists of 305 in-control profiles, collected under normal production conditions, and 69 out-of-control profiles for each of the four different fault classes. The length of each channel profile is 1201, resulting in a data dimension of 𝓨4×1201×305{\bm{\mathcal{Y}}}^{4\times 1201\times 305} for the “normal” class and 𝓨4×1201×69{\bm{\mathcal{Y}}}^{4\times 1201\times 69} for each fault class. Fig. 10 presents examples of profiles for both normal and the four different fault conditions.

[Uncaptioned image]
Figure 9: Tonnage Signal Monitoring
[Uncaptioned image]
Figure 10: Tonnage Data
Refer to caption
Figure 11: Relative decision test statistics compared to normal when the true label is (a) Fault 1, (b) Fault 2, (c) Fault 3, (d) Fault 4, (e) Normal

In this case study, we define the five “clients” as normal and four fault conditions. We select the first 50 normal samples and the first 10 samples from each fault condition to form the training dataset. The test dataset comprises 255255 normal samples and 5959 samples for each fault condition. We further assume that only the global and local factor matrices along the signal length dimension are orthogonal to each other.

Applying the classification method in Sec. 3.6.1, we are able to achieve 100% accuracy for classifying all fault modes and normal cases. Fig. 11 shows the relative test statistics by computing the difference between the test statistics of the corresponding fault modes and the normal cases. Consequently, normal test statistics have been normalized as y=0y=0 for each sample, as shown in a red line in Fig. 11.

We can see that in every fault mode, the range of the test statistics for the corresponding fault mode is much higher than those of the other three faults and the normal baseline. When the true label is normal, the relative test statistics for all fault modes are less than the normal baseline. Thus, perTucker provides informative statistics for anomaly detection and fault diagnostics.

6 Conclusion

In this paper, we propose the perTucker method to decompose the tensor data into global components with shared factor matrices and local components that are orthogonal to global components. Global and local components model the commonality and peculiarity of the data. Subsequently, we present an efficient algorithm to solve the model, based on the block coordinate descent algorithm. The inclusion of proximal terms in the step of updating factor matrices allows the convergence to the stationary point. Then, several applications of perTucker, such as anomaly detection, classification, and clustering, are demonstrated by simulation and case studies.

Some future studies will include some special properties of the global and local components. For example, the local components can be sparse to model the anomaly components in different applications.

References

  • Aharon et al., (2006) Aharon, M., Elad, M., and Bruckstein, A. (2006). K-svd: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on signal processing, 54(11):4311–4322.
  • Candès et al., (2011) Candès, E. J., Li, X., Ma, Y., and Wright, J. (2011). Robust principal component analysis? Journal of the ACM (JACM), 58(3):1–37.
  • Du et al., (2022) Du, J., Yan, H., Chang, T.-S., and Shi, J. (2022). A tensor voting-based surface anomaly classification approach by using 3d point cloud data. Journal of Manufacturing Science and Engineering, 144(5).
  • Dulal et al., (2022) Dulal, D., Karim, R. G., and Navasca, C. (2022). Covid-19 analysis using tensor methods. arXiv preprint arXiv:2212.14558.
  • Fanaee-T and Gama, (2016) Fanaee-T, H. and Gama, J. (2016). Tensor-based anomaly detection: An interdisciplinary survey. Knowledge-Based Systems, 98:130–147.
  • Fu et al., (2016) Fu, Y., Gao, J., Tien, D., Lin, Z., and Hong, X. (2016). Tensor lrr and sparse coding-based subspace clustering. IEEE transactions on neural networks and learning systems, 27(10):2120–2133.
  • Gahrooei et al., (2021) Gahrooei, M. R., Yan, H., Paynabar, K., and Shi, J. (2021). Multiple tensor-on-tensor regression: An approach for modeling processes with heterogeneous sources of data. Technometrics, 63(2):147–159.
  • Gatto et al., (2021) Gatto, B. B., dos Santos, E. M., Koerich, A. L., Fukui, K., and Junior, W. S. (2021). Tensor analysis with n-mode generalized difference subspace. Expert Systems with Applications, 171:114559.
  • Guo et al., (2022) Guo, J., Yan, H., and Zhang, C. (2022). A bayesian partially observable online change detection approach with thompson sampling. Technometrics, pages 1–13.
  • Hao et al., (2013) Hao, Z., He, L., Chen, B., and Yang, X. (2013). A linear support higher-order tensor machine for classification. IEEE Transactions on Image Processing, 22(7):2911–2920.
  • Hastie et al., (2009) Hastie, T., Tibshirani, R., Friedman, J. H., and Friedman, J. H. (2009). The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer.
  • He and Atia, (2022) He, Y. and Atia, G. K. (2022). Multi-mode tensor space clustering based on low-tensor-rank representation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36(6), pages 6893–6901.
  • Hitchcock, (1927) Hitchcock, F. L. (1927). The expression of a tensor or a polyadic as a sum of products. Journal of Mathematics and Physics, 6(1-4):164–189.
  • Klus and Gelß, (2019) Klus, S. and Gelß, P. (2019). Tensor-based algorithms for image classification. Algorithms, 12(11):240.
  • Kolda and Bader, (2009) Kolda, T. G. and Bader, B. W. (2009). Tensor decompositions and applications. SIAM review, 51(3):455–500.
  • Kontar et al., (2021) Kontar, R., Shi, N., Yue, X., Chung, S., Byon, E., Chowdhury, M., Jin, J., Kontar, W., Masoud, N., Nouiehed, M., et al. (2021). The internet of federated things (ioft). IEEE Access, 9:156071–156113.
  • Li et al., (2023) Li, J., Gao, Q., Wang, Q., Xia, W., and Gao, X. (2023). Multi-view clustering via semi-non-negative tensor factorization. arXiv preprint arXiv:2303.16748.
  • Li et al., (2011) Li, J., Han, G., Wen, J., and Gao, X. (2011). Robust tensor subspace learning for anomaly detection. International Journal of Machine Learning and Cybernetics, 2:89–98.
  • Li et al., (2020) Li, Z., Sergin, N. D., Yan, H., Zhang, C., and Tsung, F. (2020). Tensor completion for weakly-dependent data on graph for metro passenger flow prediction. 34(04):4804–4810.
  • Li et al., (2022) Li, Z., Yan, H., Tsung, F., and Zhang, K. (2022). Profile decomposition based hybrid transfer learning for cold-start data anomaly detection. ACM Transactions on Knowledge Discovery from Data (TKDD), 16(6):1–28.
  • Lu et al., (2019) Lu, C., Feng, J., Chen, Y., Liu, W., Lin, Z., and Yan, S. (2019). Tensor robust principal component analysis with a new tensor nuclear norm. IEEE transactions on pattern analysis and machine intelligence, 42(4):925–938.
  • Marusek, (2007) Marusek, J. A. (2007). Solar storm threat analysis. J. Marusek.
  • Momeni and Ebrahimkhanlou, (2022) Momeni, H. and Ebrahimkhanlou, A. (2022). High-dimensional data analytics applications in shm and nde: Tensor analysis of thermal videos. In Sensors and Smart Structures Technologies for Civil, Mechanical, and Aerospace Systems 2022, volume 12046, pages 237–242. SPIE.
  • Nomikos and MacGregor, (1994) Nomikos, P. and MacGregor, J. F. (1994). Monitoring batch processes using multiway principal component analysis. AIChE Journal, 40(8):1361–1375.
  • Ren et al., (2022) Ren, B., Yang, L. T., Zhang, Q., Feng, J., and Nie, X. (2022). Blockchain-powered tensor meta-learning-driven intelligent healthcare system with iot assistance. IEEE Transactions on Network Science and Engineering.
  • Sandhu et al., (2018) Sandhu, R., Kaur, N., Sood, S. K., and Buyya, R. (2018). Tdrm: tensor-based data representation and mining for healthcare data in cloud computing environments. The Journal of Supercomputing, 74:592–614.
  • Shen et al., (2022) Shen, B., Xie, W., and Kong, Z. J. (2022). Smooth robust tensor completion for background/foreground separation with missing pixels: Novel algorithm with convergence guarantee. Journal of Machine Learning Research, 23(217):1–40.
  • Shi and Kontar, (2022) Shi, N. and Kontar, R. A. (2022). Personalized pca: Decoupling shared and unique features. arXiv preprint arXiv:2207.08041.
  • Sun and Luo, (2016) Sun, R. and Luo, Z.-Q. (2016). Guaranteed matrix completion via non-convex factorization. IEEE Transactions on Information Theory, 62(11):6535–6579.
  • Sun and Li, (2019) Sun, W. W. and Li, L. (2019). Dynamic tensor clustering. Journal of the American Statistical Association, 114(528):1894–1907.
  • Tan et al., (2013) Tan, X., Zhang, Y., Tang, S., Shao, J., Wu, F., and Zhuang, Y. (2013). Logistic tensor regression for classification. In Intelligent Science and Intelligent Data Engineering: Third Sino-foreign-interchange Workshop, IScIDE 2012, Nanjing, China, October 15-17, 2012. Revised Selected Papers 3, pages 573–581. Springer.
  • Tucker, (1966) Tucker, L. R. (1966). Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279–311.
  • Wu et al., (2016) Wu, T., Benson, A. R., and Gleich, D. F. (2016). General tensor spectral co-clustering for higher-order data. Advances in Neural Information Processing Systems, 29.
  • Yan et al., (2019) Yan, H., Paynabar, K., and Pacella, M. (2019). Structured point cloud data analysis via regularized tensor regression for process modeling and optimization. Technometrics, 61(3):385–395.
  • Yan et al., (2014) Yan, H., Paynabar, K., and Shi, J. (2014). Image-based process monitoring using low-rank tensor decomposition. IEEE Transactions on Automation Science and Engineering, 12(1):216–227.
  • Yan et al., (2017) Yan, H., Paynabar, K., and Shi, J. (2017). Anomaly detection in images with smooth background via smooth-sparse decomposition. Technometrics, 59(1):102–114.
  • Yan et al., (2018) Yan, H., Paynabar, K., and Shi, J. (2018). Real-time monitoring of high-dimensional functional data streams via spatio-temporal smooth sparse decomposition. Technometrics, 60(2):181–197.
  • Yan et al., (2005) Yan, S., Xu, D., Yang, Q., Zhang, L., Tang, X., and Zhang, H.-J. (2005). Discriminant analysis with tensor representation. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 526–532. IEEE.
  • Zhang et al., (2023) Zhang, X., Shen, Q., Chen, Y., Zhang, G., Hua, Z., and Su, J. (2023). Multi-view ensemble clustering via low-rank and sparse decomposition: from matrix to tensor. ACM Transactions on Knowledge Discovery from Data.
  • (40) Zhao, X., Hu, J., Mei, Y., and Yan1, H. (2022a). Adaptive partially observed sequential change detection and isolation. Technometrics, 64(4):502–512.
  • (41) Zhao, X., Yan, H., Hu, Z., and Du, D. (2022b). Deep spatio-temporal sparse decomposition for trend prediction and anomaly detection in cardiac electrical conduction. IISE Transactions on Healthcare Systems Engineering, 12(2):150–164.
  • (42) Zhao, Y., Yan, H., Holte, S., and Mei, Y. (2022c). Rapid detection of hot-spots via tensor decomposition with applications to crime rate data. Journal of Applied Statistics, 49(7):1636–1662.
  • Zhao et al., (2020) Zhao, Y., Yan, H., Holte, S. E., Kerani, R. P., and Mei, Y. (2020). Rapid detection of hot-spot by tensor decomposition on space and circular time with application to weekly gonorrhea data. In The 13th International Workshop on Intelligent Statistical Quality Control 2019.
  • Zhen et al., (2023) Zhen, Z., Paynabar, K., and Shi, J. (2023). Image-based feedback control using tensor analysis. Technometrics, pages 1–10.
  • Zhou et al., (2013) Zhou, H., Li, L., and Zhu, H. (2013). Tensor regression with applications in neuroimaging data analysis. Journal of the American Statistical Association, 108(502):540–552.
  • Zhou et al., (2019) Zhou, P., Lu, C., Feng, J., Lin, Z., and Yan, S. (2019). Tensor low-rank representation for data recovery and clustering. IEEE transactions on pattern analysis and machine intelligence, 43(5):1718–1732.
  • Zubair and Wang, (2013) Zubair, S. and Wang, W. (2013). Tensor dictionary learning with sparse tucker decomposition. In 2013 18th international conference on Digital Signal Processing (DSP), pages 1–6. IEEE.

Appendix A Proof of Proposition 1

In this section, we prove the proposition of the equivalent condition for two tensors to be orthogonal to each other. We first introduce Lemma A.1.

Lemma A.1.

For two matrices 𝐀{\bm{A}} and 𝐁{\bm{B}},

𝑨𝑩F2=𝑨F2𝑩F2\|\bm{A}\bigotimes\bm{B}\|_{F}^{2}=\|\bm{A}\|_{F}^{2}\|\bm{B}\|_{F}^{2}
Proof.
𝑨𝑩F2\displaystyle\|\bm{A}\bigotimes\bm{B}\|_{F}^{2} =Tr[(𝑨𝑩)(𝑨𝑩)]\displaystyle=Tr\left[(\bm{A}\bigotimes\bm{B})^{\top}(\bm{A}\bigotimes\bm{B})\right]
=Tr[𝑨𝑨𝑩𝑩]\displaystyle=Tr[\bm{A}^{\top}\bm{A}\bigotimes\bm{B}^{\top}\bm{B}]
=Tr(𝑨𝑨)Tr(𝑩𝑩)\displaystyle=Tr(\bm{A}^{\top}\bm{A})Tr(\bm{B}^{\top}\bm{B})
=𝑨F2𝑩F2\displaystyle=\|\bm{A}\|_{F}^{2}\|\bm{B}\|_{F}^{2}

Then we prove Proposition 1

Proof.
𝓨G,n,𝓨L,n\displaystyle\left\langle\bm{\mathcal{Y}}_{G,n},\bm{\mathcal{Y}}_{L,n}\right\rangle =𝓒G,n×1𝑼G,1×K𝑼G,K,𝓒L,n×1𝑽n,1×K𝑽n,K\displaystyle=\left\langle\bm{\mathcal{C}}_{G,n}\times_{1}{\bm{U}}_{G,1}\ldots\times_{K}{\bm{U}}_{G,K},\bm{\mathcal{C}}_{L,n}\times_{1}\bm{V}_{n,1}\ldots\times_{K}\bm{V}_{n,K}\right\rangle
=(𝑼G,K𝑼G,1)𝒄G,n,(𝑽n,K𝑽n,1)𝒄L,n\displaystyle=\left\langle\left({\bm{U}}_{G,K}\bigotimes\cdots\bigotimes{\bm{U}}_{G,1}\right)\bm{c}_{G,n},\left(\bm{V}_{n,K}\bigotimes\cdots\bigotimes\bm{V}_{n,1}\right)\bm{c}_{L,n}\right\rangle
=Tr(𝒄G,nT(𝑼G,KT𝑼G,1T)(𝑽n,K𝑽n,1)𝒄L,n)\displaystyle=Tr(\bm{c}_{G,n}^{T}\left({\bm{U}}_{G,K}^{T}\bigotimes\cdots\bigotimes{\bm{U}}_{G,1}^{T}\right)\left(\bm{V}_{n,K}\bigotimes\cdots\bigotimes\bm{V}_{n,1}\right)\bm{c}_{L,n})
=Tr(𝒄G,nT(𝑼G,KT𝑽n,K𝑼G,1T𝑽n,1)𝒄L,n),\displaystyle=Tr(\bm{c}_{G,n}^{T}\left({\bm{U}}_{G,K}^{T}\bm{V}_{n,K}\bigotimes\cdots\bigotimes{\bm{U}}_{G,1}^{T}\bm{V}_{n,1}\right)\bm{c}_{L,n}),

where 𝒄G,n\bm{c}_{G,n} and 𝒄L,n\bm{c}_{L,n} are vectorized global and local tensors.

On the one hand, if 𝑼G,k𝑽n,k=0{\bm{U}}_{G,k}^{\top}\bm{V}_{n,k}=0 for some kk, 𝑼G,K𝑽n,K𝑼G,1𝑽n,1=𝟎{\bm{U}}_{G,K}^{\top}\bm{V}_{n,K}\bigotimes\cdots\bigotimes{\bm{U}}_{G,1}^{\top}\bm{V}_{n,1}=\bm{0} and thus 𝓨G,n,𝓨L,n=0\left\langle\bm{\mathcal{Y}}_{G,n},\bm{\mathcal{Y}}_{L,n}\right\rangle=0 for any 𝓒G,n\bm{\mathcal{C}}_{G,n}, 𝓒L,n\bm{\mathcal{C}}_{L,n}. On the other hand, if we want 𝓨G,n,𝓨L,n=0\left\langle\bm{\mathcal{Y}}_{G,n},\bm{\mathcal{Y}}_{L,n}\right\rangle=0 for any 𝓒G,n\bm{\mathcal{C}}_{G,n}, 𝓒L,n\bm{\mathcal{C}}_{L,n}, we need

𝑼G,K𝑽n,K𝑼G,1𝑽n,1=𝟎\displaystyle{\bm{U}}_{G,K}^{\top}\bm{V}_{n,K}\bigotimes\cdots\bigotimes{\bm{U}}_{G,1}^{\top}\bm{V}_{n,1}=\bm{0}
\displaystyle\Leftrightarrow 𝑼G,K𝑽n,K𝑼G,1𝑽n,1F2=0\displaystyle\|{\bm{U}}_{G,K}^{\top}\bm{V}_{n,K}\bigotimes\cdots\bigotimes{\bm{U}}_{G,1}^{\top}\bm{V}_{n,1}\|_{F}^{2}=0
\displaystyle\Leftrightarrow 𝑼G,K𝑽n,KF2𝑼G,1𝑽n,1F2=0\displaystyle\|{\bm{U}}_{G,K}^{\top}\bm{V}_{n,K}\|_{F}^{2}\cdots\|{\bm{U}}_{G,1}^{\top}\bm{V}_{n,1}\|_{F}^{2}=0

Therefore, there exists some mode kk, 𝑼G,k𝑽n,kF2\|{\bm{U}}_{G,k}^{\top}\bm{V}_{n,k}\|_{F}^{2} is 0, which means that 𝑼G,k𝑽n,k=0{\bm{U}}_{G,k}^{\top}\bm{V}_{n,k}=0. ∎

Appendix B Proof of Proposition 2

Proof.

For each nn, the optimization problem to derive global and local core tensors is

min𝓒G,n,𝓒L,n𝓨n𝓒G,n×1𝑼G,1×K𝑼G,K𝓒L,n×1𝑽n,1×K𝑽n,KF2\min_{\bm{\mathcal{C}}_{G,n},\bm{\mathcal{C}}_{L,n}}\|\bm{\mathcal{Y}}_{n}-\bm{\mathcal{C}}_{G,n}\times_{1}{\bm{U}}_{G,1}\ldots\times_{K}{\bm{U}}_{G,K}-\bm{\mathcal{C}}_{L,n}\times_{1}\bm{V}_{n,1}\ldots\times_{K}\bm{V}_{n,K}\|_{F}^{2} (14)

By vectorizing the terms, we get

min𝒄G,n,𝒄L,n𝒚n(k=K1𝑼G,k)𝒄G,n(k=K1𝑽n,k)𝒄L,n2\displaystyle\min_{\bm{c}_{G,n},\bm{c}_{L,n}}\|\bm{y}_{n}-(\bigotimes_{k=K}^{1}{\bm{U}}_{G,k})\bm{c}_{G,n}-(\bigotimes_{k=K}^{1}\bm{V}_{n,k})\bm{c}_{L,n}\|^{2}
\displaystyle\iff min𝒄G,n,𝒄L,n𝒚n(k=K1𝑼G,kk=K1𝑽n,k)(𝒄G,n𝒄L,n)2\displaystyle\min_{\bm{c}_{G,n},\bm{c}_{L,n}}\left\lVert\bm{y}_{n}-\begin{pmatrix}\bigotimes_{k=K}^{1}{\bm{U}}_{G,k}&\bigotimes_{k=K}^{1}\bm{V}_{n,k}\end{pmatrix}\begin{pmatrix}\bm{c}_{G,n}\\ \bm{c}_{L,n}\end{pmatrix}\right\rVert^{2}

where the sign k=K1\bigotimes_{k=K}^{1} represents the Kronecker product of the matrices in the order K,K1,,1K,K-1,\ldots,1. 𝒄G,n\bm{c}_{G,n} and 𝒄L,n\bm{c}_{L,n} represents the vectorized global and core tensor from source nn. This is a least square loss in linear regression. The solution for (𝒄G,n𝒄L,n)\begin{pmatrix}\bm{c}_{G,n}\\ \bm{c}_{L,n}\end{pmatrix} is

(𝒄G,n𝒄L,n)=[(k=K1𝑼G,kk=K1𝑽n,k)(k=K1𝑼G,kk=K1𝑽n,k)]1(k=K1𝑼G,kk=K1𝑽n,k)𝒚n\scriptsize\begin{pmatrix}\bm{c}^{\star}_{G,n}\\ \bm{c}^{\star}_{L,n}\end{pmatrix}=\left[\begin{pmatrix}\bigotimes_{k=K}^{1}{\bm{U}}_{G,k}&\bigotimes_{k=K}^{1}\bm{V}_{n,k}\end{pmatrix}^{\top}\begin{pmatrix}\bigotimes_{k=K}^{1}{\bm{U}}_{G,k}&\bigotimes_{k=K}^{1}\bm{V}_{n,k}\end{pmatrix}\right]^{-1}\begin{pmatrix}\bigotimes_{k=K}^{1}{\bm{U}}_{G,k}&\bigotimes_{k=K}^{1}\bm{V}_{n,k}\end{pmatrix}^{\top}\bm{y}_{n}

Since the global and local factor matrices are orthogonal, (k=K1𝑼G,k)k=K1𝑼G,k=I(\bigotimes_{k=K}^{1}{\bm{U}}_{G,k})^{\top}\bigotimes_{k=K}^{1}{\bm{U}}_{G,k}=I, (k=K1𝑽n,k)k=K1𝑽n,k=I(\bigotimes_{k=K}^{1}\bm{V}_{n,k})^{\top}\bigotimes_{k=K}^{1}\bm{V}_{n,k}=I. The upper off-diagonal matrix in the product is

(k=K1𝑽n,k)k=K1𝑼G,k=k=K1𝑽n,k𝑼G,k=0(\bigotimes_{k=K}^{1}\bm{V}_{n,k})^{\top}\bigotimes_{k=K}^{1}{\bm{U}}_{G,k}=\bigotimes_{k=K}^{1}\bm{V}_{n,k}^{\top}{\bm{U}}_{G,k}=0

where the last equality holds because in at least one dimension, 𝑽n,k𝑼G,k=0\bm{V}_{n,k}^{\top}{\bm{U}}_{G,k}=0, similarly, the lower off-diagonal matrix is also 0, therefore, the product is the identity matrix.

(𝒄G,n𝒄L,n)=((k=K1𝑼G,k)𝒚n(k=K1𝑽n,k)𝒚n)\begin{pmatrix}\bm{c}_{G,n}^{\star}\\ \bm{c}_{L,n}^{\star}\end{pmatrix}=\begin{pmatrix}(\bigotimes_{k=K}^{1}{\bm{U}}_{G,k}^{\top})\bm{y}_{n}\\ (\bigotimes_{k=K}^{1}\bm{V}_{n,k}^{\top})\bm{y}_{n}\end{pmatrix}

Transform the solution above back to the tensor, and we get the solution in the proposition. ∎

Appendix C Proof of Lemma 3.1

Proof.
𝑼𝑼𝑼t𝑼tF2\displaystyle\|\bm{UU}^{\top}-{\bm{U}}_{t}{\bm{U}}_{t}^{\top}\|_{F}^{2} =Tr[(𝑼𝑼𝑼t𝑼t)(𝑼𝑼𝑼t𝑼t)]\displaystyle=Tr\left[(\bm{UU}^{\top}-{\bm{U}}_{t}{\bm{U}}_{t}^{\top})(\bm{UU}^{\top}-{\bm{U}}_{t}{\bm{U}}_{t}^{\top})^{\top}\right]
=Tr[𝑼𝑼𝑼𝑼2𝑼𝑼𝑼t𝑼t+𝑼t𝑼t𝑼t𝑼t]\displaystyle=Tr\left[\bm{UU}^{\top}\bm{UU}^{\top}-2\bm{UU}^{\top}{\bm{U}}_{t}{\bm{U}}_{t}^{\top}+{\bm{U}}_{t}{\bm{U}}_{t}^{\top}{\bm{U}}_{t}{\bm{U}}_{t}^{\top}\right]
=2c2Tr[𝑼𝑼t𝑼t𝑼]\displaystyle=2c-2Tr\left[{\bm{U}}^{\top}{\bm{U}}_{t}{\bm{U}}_{t}^{\top}{\bm{U}}\right]

where cc is the number of rows of the factor matrix 𝑼{\bm{U}}. ∎

Appendix D Proof of Lemma 3.2

Proof.

The proof of Equation (8) and Equation (9) are similar. Here we take one component in Equation (8) as an example. We use 𝓒G,n{\bm{\mathcal{C}}}^{\star}_{G,n} and 𝓒L,n{\bm{\mathcal{C}}}^{\star}_{L,n} to represent the optimized core tensors.

𝓡G,n𝓒G,n×1𝑼G,1×K𝑼G,KF2\displaystyle\|{\bm{\mathcal{R}}}_{G,n}-{\bm{\mathcal{C}}}^{\star}_{G,n}\times_{1}{\bm{U}}_{G,1}\ldots\times_{K}{\bm{U}}_{G,K}\|_{F}^{2}
=\displaystyle= 𝓡G,nF22𝓡G,n,𝓒G,n×1𝑼G,1×K𝑼G,K+𝓒G,n×1𝑼G,1×K𝑼G,KF2\displaystyle\|{\bm{\mathcal{R}}}_{G,n}\|_{F}^{2}-2\left\langle{\bm{\mathcal{R}}}_{G,n},{\bm{\mathcal{C}}}^{\star}_{G,n}\times_{1}{\bm{U}}_{G,1}\ldots\times_{K}{\bm{U}}_{G,K}\right\rangle+\|{\bm{\mathcal{C}}}^{\star}_{G,n}\times_{1}{\bm{U}}_{G,1}\ldots\times_{K}{\bm{U}}_{G,K}\|_{F}^{2}
=\displaystyle= 𝓡G,nF22𝓡G,n×1𝑼G,1×K𝑼G,K,𝓒G,n+𝓒G,nF2\displaystyle\|{\bm{\mathcal{R}}}_{G,n}\|_{F}^{2}-2\left\langle{\bm{\mathcal{R}}}_{G,n}\times_{1}{\bm{U}}_{G,1}^{\top}\ldots\times_{K}{\bm{U}}_{G,K}^{\top},{\bm{\mathcal{C}}}^{\star}_{G,n}\right\rangle+\|{\bm{\mathcal{C}}}^{\star}_{G,n}\|_{F}^{2}

The second term can be formulated as

2𝓡G,n×1𝑼G,1×K𝑼G,K,𝓒G,n\displaystyle-2\left\langle{\bm{\mathcal{R}}}_{G,n}\times_{1}{\bm{U}}_{G,1}^{\top}\ldots\times_{K}{\bm{U}}_{G,K}^{\top},{\bm{\mathcal{C}}}^{\star}_{G,n}\right\rangle
=\displaystyle= 2𝓡G,n×1𝑼G,1×K𝑼G,K,𝓡G,n×1𝑼G,1×K𝑼G,K\displaystyle-2\left\langle{\bm{\mathcal{R}}}_{G,n}\times_{1}{\bm{U}}_{G,1}^{\top}\ldots\times_{K}{\bm{U}}_{G,K}^{\top},{\bm{\mathcal{R}}}_{G,n}\times_{1}{\bm{U}}_{G,1}^{\top}\ldots\times_{K}{\bm{U}}_{G,K}^{\top}\right\rangle
=\displaystyle= 2𝓡G,n×1𝑼G,1×K𝑼G,KF2\displaystyle-2\|{\bm{\mathcal{R}}}_{G,n}\times_{1}{\bm{U}}_{G,1}^{\top}\ldots\times_{K}{\bm{U}}_{G,K}^{\top}\|_{F}^{2}

And the third term can be formulated as

𝓒G,nF2\displaystyle\|{\bm{\mathcal{C}}}^{\star}_{G,n}\|_{F}^{2} =(𝓡G,n+𝓒L,n×1𝑽n,1×K𝑽n,K)×1𝑼G,1×K𝑼G,KF2\displaystyle=\|({\bm{\mathcal{R}}}_{G,n}+{\bm{\mathcal{C}}}_{L,n}\times_{1}{\bm{V}}_{n,1}\ldots\times_{K}{\bm{V}}_{n,K})\times_{1}{\bm{U}}_{G,1}^{\top}\ldots\times_{K}{\bm{U}}_{G,K}^{\top}\|_{F}^{2}
=𝓡G,n×1𝑼G,1×K𝑼G,KF2\displaystyle=\|{\bm{\mathcal{R}}}_{G,n}\times_{1}{\bm{U}}_{G,1}^{\top}\ldots\times_{K}{\bm{U}}_{G,K}^{\top}\|_{F}^{2}

Therefore, for each n=1,,Nn=1,\ldots,N,

𝓡G,n𝓒G,n×1𝑼G,1×K𝑼G,KF2=𝓡G,nF2𝓡G,n×1𝑼G,1×K𝑼G,KF2\displaystyle\|{\bm{\mathcal{R}}}_{G,n}-{\bm{\mathcal{C}}}^{\star}_{G,n}\times_{1}{\bm{U}}_{G,1}\ldots\times_{K}{\bm{U}}_{G,K}\|_{F}^{2}=\|{\bm{\mathcal{R}}}_{G,n}\|_{F}^{2}-\|{\bm{\mathcal{R}}}_{G,n}\times_{1}{\bm{U}}_{G,1}^{\top}\ldots\times_{K}{\bm{U}}_{G,K}^{\top}\|_{F}^{2}

and Equation (8) holds. Similarly, for each n=1,,Nn=1,\ldots,N,

𝓡L,n𝓒L,n×1𝑽n,1×K𝑽n,KF2=𝓡L,nF2𝓡L,n×1𝑽n,1×K𝑽n,KF2\displaystyle\|{\bm{\mathcal{R}}}_{L,n}-{\bm{\mathcal{C}}}^{\star}_{L,n}\times_{1}\bm{V}_{n,1}\ldots\times_{K}\bm{V}_{n,K}\|_{F}^{2}=\|{\bm{\mathcal{R}}}_{L,n}\|_{F}^{2}-\|{\bm{\mathcal{R}}}_{L,n}\times_{1}\bm{V}_{n,1}^{\top}\ldots\times_{K}\bm{V}_{n,K}^{\top}\|_{F}^{2}

and Equation (9) holds. ∎

Appendix E Proof of Proposition 3

Proof.

First, by Lemma 3.1, we have

𝑼G,k𝑼G,k𝑼G,k,t𝑼G,k,tF2=2c2Tr[𝑼G,k𝑼G,k,t𝑼G,k,t𝑼G,k]\displaystyle\|{\bm{U}}_{G,k}{\bm{U}}_{G,k}^{\top}-{\bm{U}}_{G,k,t}{{\bm{U}}_{G,k,t}^{\top}}\|_{F}^{2}=2c-2Tr\left[{\bm{U}}_{G,k}^{\top}{\bm{U}}_{G,k,t}{{\bm{U}}_{G,k,t}^{\top}}{\bm{U}}_{G,k}\right]

where cc is the number of rows of the factor matrix, therefore, can be omitted in the optimization. Then by Lemma 3.2, we can rewrite the minimization problem into a maximization problem.

min𝑼G,kn=1N𝓡G,n𝓒G,n×1𝑼G,1×K𝑼G,KF2+ρ𝑼G,k𝑼G,k𝑼G,k,t𝑼G,k,tF2\displaystyle\min_{{\bm{U}}_{G,k}}\sum_{n=1}^{N}\|{\bm{\mathcal{R}}}_{G,n}-{\bm{\mathcal{C}}}^{\star}_{G,n}\times_{1}{\bm{U}}_{G,1}\ldots\times_{K}{\bm{U}}_{G,K}\|_{F}^{2}+\rho\|{\bm{U}}_{G,k}{\bm{U}}_{G,k}^{\top}-{\bm{U}}_{G,k,t}{{\bm{U}}_{G,k,t}^{\top}}\|_{F}^{2}
\displaystyle\iff max𝑼G,kn=1N𝓡G,n×1𝑼G,1×K𝑼G,KF2ρ𝑼G,k𝑼G,k𝑼G,k,t𝑼G,k,tF2\displaystyle\max_{{\bm{U}}_{G,k}}\sum_{n=1}^{N}\|{\bm{\mathcal{R}}}_{G,n}\times_{1}{\bm{U}}_{G,1}^{\top}\ldots\times_{K}{\bm{U}}_{G,K}^{\top}\|_{F}^{2}-\rho\|{\bm{U}}_{G,k}{\bm{U}}_{G,k}^{\top}-{\bm{U}}_{G,k,t}{{\bm{U}}_{G,k,t}^{\top}}\|_{F}^{2}
\displaystyle\iff max𝑼G,kn=1N𝑼G,k(𝓡G,n)(k)(qk𝑼G,q)F2+2ρTr[𝑼G,k𝑼G,k,t𝑼G,k,t𝑼G,k]\displaystyle\max_{{\bm{U}}_{G,k}}\sum_{n=1}^{N}\|{\bm{U}}_{G,k}({\bm{\mathcal{R}}}_{{G,n}})_{(k)}(\bigotimes_{q\neq k}{\bm{U}}_{G,q}^{\top})^{\top}\|_{F}^{2}+2\rho Tr\left[{\bm{U}}_{G,k}^{\top}{\bm{U}}_{G,k,t}{{\bm{U}}_{G,k,t}^{\top}}{\bm{U}}_{G,k}\right]
\displaystyle\iff max𝑼G,kn=1NTr[𝑼G,k𝑾G,n𝑾G,n𝑼G,k]+2ρTr[𝑼G,k𝑼G,k,t𝑼G,k,t𝑼G,k]\displaystyle\max_{{\bm{U}}_{G,k}}\sum_{n=1}^{N}Tr\left[{\bm{U}}_{G,k}^{\top}\bm{W}_{G,n}\bm{W}_{G,n}^{\top}{\bm{U}}_{G,k}\right]+2\rho Tr\left[{\bm{U}}_{G,k}^{\top}{\bm{U}}_{G,k,t}{{\bm{U}}_{G,k,t}^{\top}}{\bm{U}}_{G,k}\right]
\displaystyle\iff max𝑼G,kTr[𝑼G,k(n=1N𝑾G,n𝑾G,n+2ρ𝑼G,k,t𝑼G,k,t)𝑼G,k]\displaystyle\max_{{\bm{U}}_{G,k}}Tr\left[{\bm{U}}_{G,k}^{\top}\left(\sum_{n=1}^{N}\bm{W}_{G,n}\bm{W}_{G,n}^{\top}+2\rho{\bm{U}}_{G,k,t}{{\bm{U}}_{G,k,t}^{\top}}\right){\bm{U}}_{G,k}\right]

Therefore, the columns of 𝑼G,k{\bm{U}}_{G,k} are the eigenvectors of n=1N𝑾G,n𝑾G,n+2ρ𝑼G,k,t𝑼G,k,t\sum_{n=1}^{N}\bm{W}_{G,n}\bm{W}_{G,n}^{\top}+2\rho{\bm{U}}_{G,k,t}{{\bm{U}}_{G,k,t}^{\top}} corresponding to top gkg_{k} eigenvalues. ∎

Appendix F Proof of Proposition 4

We first introduce a Lemma

Lemma F.1.

Suppose 𝐕a×b\bm{V}\in\mathbb{R}^{a\times b}. 𝐔a×c{\bm{U}}\in\mathbb{R}^{a\times c} is an orthonormal matrix, 𝐒a×a\bm{S}\in\mathbb{R}^{a\times a} is a symmetric matrix. The solution to the optimization problem

max𝑽Tr(𝑽𝑺𝑽)s.t.𝑽𝑽=𝑰,𝑽𝑼=0\max_{\bm{V}}Tr(\bm{V}^{\top}\bm{SV})~{}s.t.~{}\bm{V}^{\top}\bm{V}=\bm{I},~{}\bm{V}^{\top}{\bm{U}}=0

is the eigenvectors of the matrix (𝐈𝐔𝐔)𝐒(𝐈𝐔𝐔)(\bm{I}-\bm{UU}^{\top})\bm{S}(\bm{I}-\bm{UU}^{\top}) corresponding to the top bb eigenvalues.

Proof.

Since all the feasible solutions satisfy 𝑽𝑼=0\bm{V}^{\top}{\bm{U}}=0, we have

max𝑽Tr(𝑽𝑺𝑽)\displaystyle\max_{\bm{V}}Tr(\bm{V}^{\top}\bm{SV})
\displaystyle\iff max𝑽Tr[𝑽𝑺𝑽𝑽𝑼𝑼𝑺𝑽𝑽𝑺𝑼𝑼𝑽+𝑽𝑼𝑼𝑺𝑼𝑼𝑽]\displaystyle\max_{\bm{V}}Tr[\bm{V}^{\top}\bm{SV}-\bm{V}^{\top}\bm{UU}^{\top}\bm{SV}-\bm{V}^{\top}\bm{SUU}^{\top}\bm{V}+\bm{V}^{\top}\bm{UU}^{\top}\bm{S}\bm{UU}^{\top}\bm{V}]
\displaystyle\iff max𝑽Tr[𝑽(𝑰𝑼𝑼)𝑺(𝑰𝑼𝑼)𝑽]\displaystyle\max_{\bm{V}}Tr[\bm{V}^{\top}(\bm{I}-\bm{UU}^{\top})\bm{S}(\bm{I}-\bm{UU}^{\top})\bm{V}]

We claim that the optimal solution of

max𝑽Tr[𝑽(𝑰𝑼𝑼)𝑺(𝑰𝑼𝑼)𝑽]s.t.𝑽𝑽=𝑰,𝑽𝑼=0\max_{\bm{V}}Tr[\bm{V}^{\top}(\bm{I}-\bm{UU}^{\top})\bm{S}(\bm{I}-\bm{UU}^{\top})\bm{V}]~{}s.t.~{}\bm{V}^{\top}\bm{V}=\bm{I},~{}\bm{V}^{\top}{\bm{U}}=0

is the same by relaxing the condition 𝑽𝑼\bm{V}^{\top}{\bm{U}}. When this condition is relaxed, the optimal solution consists of the eigenvectors of the matrix (𝑰𝑼𝑼)𝑺(𝑰𝑼𝑼)(\bm{I}-\bm{UU}^{\top})\bm{S}(\bm{I}-\bm{UU}^{\top}). For any eigenvector 𝒙\bm{x}, we have

λ𝒙𝑼=𝒙(𝑰𝑼𝑼)𝑺(𝑰𝑼𝑼)𝑼=0\lambda\bm{x}^{\top}{\bm{U}}=\bm{x}^{\top}(\bm{I}-\bm{UU}^{\top})\bm{S}(\bm{I}-\bm{UU}^{\top}){\bm{U}}=0

Therefore, the optimal solution of the relaxed problem also satisfies 𝑽𝑼=0\bm{V}^{\top}{\bm{U}}=0. This completes the proof. ∎

Then we provide the proof of Proposition 4.

Proof.

For each n=1,,Nn=1,\ldots,N and k=1,,Kk=1,\ldots,K, the optimization problem to update the local factor matrix 𝑽n,k\bm{V}_{n,k} is

𝑽n,k=argmin𝑽n,k𝓡L,n𝓒L,n×1𝑽n,1×K𝑽n,KF2+ρ𝑽n,k𝑽n,k𝑽n,k,t𝑽n,k,tF2\bm{V}_{n,k}=\arg\min_{\bm{V}_{n,k}}\|{\bm{\mathcal{R}}}_{L,n}-{\bm{\mathcal{C}}}^{\star}_{L,n}\times_{1}\bm{V}_{n,1}\ldots\times_{K}\bm{V}_{n,K}\|_{F}^{2}+\rho\|\bm{V}_{n,k}\bm{V}_{n,k}^{\top}-\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}\|_{F}^{2} (15)

Following a similar process in updating the global factor matrices, the optimization problem above can be expressed as

max𝑽n,k𝑽n,k(𝓡L,n)(k)(qk𝑽n,q)F2+2ρTr[𝑽n,k𝑽n,k,t𝑽n,k,t𝑽n,k]\displaystyle\max_{\bm{V}_{n,k}}\|\bm{V}_{n,k}({\bm{\mathcal{R}}}_{L,n})_{(k)}(\bigotimes_{q\neq k}\bm{V}_{n,q}^{\top})^{\top}\|_{F}^{2}+2\rho Tr\left[\bm{V}_{n,k}^{\top}\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}\bm{V}_{n,k}\right]
\displaystyle\iff max𝑽n,kTr[𝑽n,k(𝑾L,n𝑾L,n+2ρ𝑽n,k,t𝑽n,k,t)𝑽n,k]\displaystyle\max_{\bm{V}_{n,k}}Tr\left[\bm{V}_{n,k}^{\top}\left(\bm{W}_{L,n}\bm{W}_{L,n}^{\top}+2\rho\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}\right)\bm{V}_{n,k}\right]

Now denote 𝑺=𝑾L,n𝑾L,n+2ρ𝑽n,k,t𝑽n,k,t\bm{S}=\bm{W}_{L,n}\bm{W}_{L,n}^{\top}+2\rho\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}. If k𝒦k\not\in\mathcal{K}, there is no further constraint on 𝑽n,k\bm{V}_{n,k}. Therefore, the columns of 𝑽n,k\bm{V}_{n,k} are the eigenvectors of 𝑺\bm{S} corresponding to top ln,kl_{n,k} eigenvalues.
If k𝒦k\in\mathcal{K}, the problem becomes

max𝑽n,kTr(𝑽n,k𝑺𝑽n,k)s.t.𝑽n,k𝑽n,k=I,𝑽n,k𝑼G,k=0\max_{\bm{V}_{n,k}}Tr\left(\bm{V}_{n,k}^{\top}\bm{S}\bm{V}_{n,k}\right)~{}s.t.~{}\bm{V}_{n,k}^{\top}\bm{V}_{n,k}=I,\bm{V}_{n,k}^{\top}{\bm{U}}_{G,k}=0

Therefore, from Lemma F.1, if we let 𝑺=(𝑰𝑼G,k𝑼G,k)𝑺(𝑰𝑼G,k𝑼G,k)\bm{S}^{\prime}=(\bm{I}-{\bm{U}}_{G,k}{\bm{U}}_{G,k}^{\top})\bm{S}(\bm{I}-{\bm{U}}_{G,k}{\bm{U}}_{G,k}^{\top}), the columns of 𝑽n,k\bm{V}_{n,k} are the eigenvectors of 𝑺\bm{S}^{\prime} corresponding to top ln,kl_{n,k} eigenvalues. ∎

Appendix G Proof of Theorem 5

In this section, we will prove Theorem 5. First, we will show that the global factors 𝑼G,k{\bm{U}}_{G,k}’s converge into stationary points. Then we will show that local factors 𝑽n,k{\bm{V}}_{n,k} also converge at stationary points. Some auxiliary lemmas are relegated to Sec. J for clarity.

For notation simplicity, we define

fG,n(𝑼G,1,𝑼G,2,,𝑼G,K)=𝓨n𝓨n×1𝑼G,1𝑼G,1×2×K𝑼G,K𝑼G,KF2f_{G,n}\left({\bm{U}}_{G,1},{\bm{U}}_{G,2},\cdots,{\bm{U}}_{G,K}\right)=\left\lVert{\bm{\mathcal{Y}}}_{n}-{\bm{\mathcal{Y}}}_{n}\times_{1}{\bm{U}}_{G,1}{\bm{U}}_{G,1}^{\top}\times_{2}\cdots\times_{K}{\bm{U}}_{G,K}{\bm{U}}_{G,K}^{\top}\right\rVert_{F}^{2} (16)

and

fL,n(𝑽n,1,𝑽n,2,,𝑽n,K)=𝓨n𝓨n×1𝑽n,1𝑽n,1×2×K𝑽n,K𝑽n,KF2f_{L,n}\left({\bm{V}}_{n,1},{\bm{V}}_{n,2},\cdots,{\bm{V}}_{n,K}\right)=\left\lVert{\bm{\mathcal{Y}}}_{n}-{\bm{\mathcal{Y}}}_{n}\times_{1}{\bm{V}}_{n,1}{\bm{V}}_{n,1}^{\top}\times_{2}\cdots\times_{K}{\bm{V}}_{n,K}{\bm{V}}_{n,K}^{\top}\right\rVert_{F}^{2} (17)

Apparently, both fG,nf_{G,n} and fL,nf_{L,n} are lower bounded by 0.

The following lemma shows that the global factors 𝑼G,k{\bm{U}}_{G,k}’s will converge into stationary points.

Lemma G.1.

The column spaces spanned by 𝐔G,k{\bm{U}}_{G,k}’s converge into stationary points.

t=1Tk=1K𝑼G,k,t+1𝑼G,k,t+1𝑼G,k,t𝑼G,k,tF21ρn=1NfG,n(𝑼G,1,1,𝑼G,2,1,,𝑼G,K,1)\sum_{t=1}^{T}\sum_{k=1}^{K}\left\lVert{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}-{\bm{U}}_{G,k,t}{\bm{U}}_{G,k,t}^{\top}\right\rVert_{F}^{2}\leq\frac{1}{\rho}\sum_{n=1}^{N}f_{G,n}\left({\bm{U}}_{G,1,1},{\bm{U}}_{G,2,1},\cdots,{\bm{U}}_{G,K,1}\right) (18)

Notice that by dividing TKTK from both sides of (18), we can immediately prove the first inequality in theorem 5.

Proof.

From proposition 3, the update of global factor jj is given by

𝑼G,k,t+1=argmin𝑼G,kn=1N𝓡G,n,t𝓒G,n×1𝑼G,1,t+1×K𝑼G,K,tF2+ρ𝑼G,k𝑼G,k𝑼G,k,t𝑼G,k,tF2{\bm{U}}_{G,k,t+1}=\arg\min_{{\bm{U}}_{G,k}}\sum_{n=1}^{N}\|{\bm{\mathcal{R}}}_{G,n,t}-{\bm{\mathcal{C}}}^{\star}_{G,n}\times_{1}{\bm{U}}_{G,1,t+1}\ldots\times_{K}{\bm{U}}_{G,K,t}\|_{F}^{2}+\rho\|{\bm{U}}_{G,k}{\bm{U}}_{G,k}^{\top}-{\bm{U}}_{G,k,t}{{\bm{U}}_{G,k,t}^{\top}}\|_{F}^{2}

where 𝓒G,n=𝓨n×1𝑼G,1,t+1×2×K𝑼G,K,t{\bm{\mathcal{C}}}^{\star}_{G,n}={\bm{\mathcal{Y}}}_{n}\times_{1}{\bm{U}}_{G,1,t+1}\times_{2}\cdots\times_{K}{\bm{U}}_{G,K,t}.

By lemma 3.2, we know

n=1N𝓡G,n𝓒G,n×1𝑼G,1×K𝑼G,KF2=𝓡G,n×1𝑼G,1×K𝑼G,KF2+const\sum_{n=1}^{N}\|{\bm{\mathcal{R}}}_{G,n}-{\bm{\mathcal{C}}}^{\star}_{G,n}\times_{1}{\bm{U}}_{G,1}\ldots\times_{K}{\bm{U}}_{G,K}\|_{F}^{2}=-\|{\bm{\mathcal{R}}}_{G,n}\times_{1}{\bm{U}}_{G,1}^{\top}\ldots\times_{K}{\bm{U}}_{G,K}^{\top}\|_{F}^{2}+const

From the definition of 𝓡G,n,t{\bm{\mathcal{R}}}_{G,n,t}, we know that 𝓡G,n,t=𝓨n𝓒L,n,t×1𝑽n,1,t×2×K𝑽n,K,t+1{\bm{\mathcal{R}}}_{G,n,t}={\bm{\mathcal{Y}}}_{n}-{\bm{\mathcal{C}}}_{L,n,t}\times_{1}{\bm{V}}_{n,1,t}\times_{2}\cdots\times_{K}{\bm{V}}_{n,K,t+1}. Since |𝒦|2|{\mathcal{K}}|\geq 2 and 𝑼G,k,t𝑽n,k,t{\bm{U}}_{G,k,t}^{\top}{\bm{V}}_{n,k,t} for every k𝒦k\in{\mathcal{K}} and every tt, there exists at least one m𝒦m\in{\mathcal{K}} and mkm\neq k such that 𝑼G,m,t𝑽n,m,t=0{\bm{U}}_{G,m,t}^{\top}{\bm{V}}_{n,m,t}=0 and 𝑼G,m,t+1𝑽n,m,t+1=0{\bm{U}}_{G,m,t+1}^{\top}{\bm{V}}_{n,m,t+1}=0. Therefore, 𝓡G,n×1𝑼G,1,t+1×2×K𝑼n,K,t=𝓨n×1𝑼G,1,t+1×2×K𝑼n,K,t{\bm{\mathcal{R}}}_{G,n}\times_{1}{\bm{U}}_{G,1,t+1}^{\top}\times_{2}\cdots\times_{K}{\bm{U}}_{n,K,t}^{\top}={\bm{\mathcal{Y}}}_{n}\times_{1}{\bm{U}}_{G,1,t+1}^{\top}\times_{2}\cdots\times_{K}{\bm{U}}_{n,K,t}^{\top},

Then we can apply lemma 3.2 again, and rewrite the maximization problem as the minimization problem,

𝑼G,k,t+1\displaystyle{\bm{U}}_{G,k,t+1}
=argmin𝑼G,kn=1N𝓨n𝓨n×1𝑼G,1,t+1𝑼G,1,t+1×K𝑼G,K,t𝑼G,K,tF2+ρ𝑼G,k𝑼G,k𝑼G,k,t𝑼G,k,tF2\displaystyle=\arg\min_{{\bm{U}}_{G,k}}\sum_{n=1}^{N}\|\bm{\mathcal{Y}}_{n}-{\bm{\mathcal{Y}}}_{n}\times_{1}{\bm{U}}_{G,1,t+1}{\bm{U}}_{G,1,t+1}^{\top}\ldots\times_{K}{\bm{U}}_{G,K,t}{\bm{U}}_{G,K,t}^{\top}\|_{F}^{2}+\rho\|{\bm{U}}_{G,k}{\bm{U}}_{G,k}^{\top}-{\bm{U}}_{G,k,t}{{\bm{U}}_{G,k,t}^{\top}}\|_{F}^{2}
=argmin𝑼G,kn=1NfG,n(𝑼G,1,t+1,,𝑼G,k1,t+1,𝑼G,k,𝑼G,k+1,t,,𝑼G,K,t)+ρ𝑼G,k𝑼G,k𝑼G,k,t𝑼G,k,tF2\displaystyle=\arg\min_{{\bm{U}}_{G,k}}\sum_{n=1}^{N}f_{G,n}({\bm{U}}_{G,1,t+1},\cdots,{\bm{U}}_{G,k-1,t+1},{\bm{U}}_{G,k},{\bm{U}}_{G,k+1,t},\cdots,{\bm{U}}_{G,K,t})+\rho\|{\bm{U}}_{G,k}{\bm{U}}_{G,k}^{\top}-{\bm{U}}_{G,k,t}^{\top}{{\bm{U}}_{G,k,t}^{\top}}\|_{F}^{2}

From the definition of optimal solution, we know,

n=1NfG,n(𝑼G,1,t+1,,𝑼G,k1,t+1,𝑼G,k,t+1,𝑼G,k+1,t,,𝑼G,K,t)\displaystyle\sum_{n=1}^{N}f_{G,n}({\bm{U}}_{G,1,t+1},\cdots,{\bm{U}}_{G,k-1,t+1},{\bm{U}}_{G,k,t+1},{\bm{U}}_{G,k+1,t},\cdots,{\bm{U}}_{G,K,t}) (19)
+ρ𝑼G,k,t+1𝑼G,k,t+1𝑼G,k,t𝑼G,k,tF2\displaystyle+\rho\|{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}-{\bm{U}}_{G,k,t}{{\bm{U}}_{G,k,t}^{\top}}\|_{F}^{2}
n=1NfG,n(𝑼G,1,t+1,,𝑼G,k1,t+1,𝑼G,k,t,𝑼G,k+1,t,,𝑼G,K,t)\displaystyle\leq\sum_{n=1}^{N}f_{G,n}({\bm{U}}_{G,1,t+1},\cdots,{\bm{U}}_{G,k-1,t+1},{\bm{U}}_{G,k,t},{\bm{U}}_{G,k+1,t},\cdots,{\bm{U}}_{G,K,t})

Summing both sides for kk from 11 to KK, then for tt from 11 to TT, we have,

ρk=1Kt=1T𝑼G,k,t+1𝑼G,k,t+1𝑼G,k,t𝑼G,k,tF2\displaystyle\rho\sum_{k=1}^{K}\sum_{t=1}^{T}\|{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}-{\bm{U}}_{G,k,t}{{\bm{U}}_{G,k,t}^{\top}}\|_{F}^{2} (20)
n=1NfG,n(𝑼G,1,T+1,,𝑼G,K,T+1)n=1NfG,n(𝑼G,1,1,,𝑼G,K,1)\displaystyle\leq\sum_{n=1}^{N}f_{G,n}({\bm{U}}_{G,1,T+1},\cdots,{\bm{U}}_{G,K,T+1})-\sum_{n=1}^{N}f_{G,n}({\bm{U}}_{G,1,1},\cdots,{\bm{U}}_{G,K,1})

Since fG,n0f_{G,n}\geq 0, this completes our proof. ∎

Now we will analyze the convergence of 𝑽n,k{\bm{V}}_{n,k}’s. The following lemma gives an upper bound on the subspace distance of 𝑼G,k{\bm{U}}_{G,k}’s from one update.

Lemma G.2.

If there exists a constant B>0B>0 such that 𝓨nFB\left\lVert{\bm{\mathcal{Y}}}_{n}\right\rVert_{F}\leq B for all nn, then we have,

𝑼G,k,t+1𝑼G,k,t+1𝑼G,k,t𝑼G,k,tF2BNρ\left\lVert{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}-{\bm{U}}_{G,k,t}{{\bm{U}}_{G,k,t}^{\top}}\right\rVert_{F}\leq\frac{2B\sqrt{N}}{\sqrt{\rho}} (21)

for all t=1,2,,Tt=1,2,\cdots,T and all n=1,,Nn=1,\cdots,N.

Proof.

Since 𝓨nFB\left\lVert{\bm{\mathcal{Y}}}_{n}\right\rVert_{F}\leq B, by the triangle inequality, we know that fG,n4B2f_{G,n}\leq 4B^{2}. Applying (19) and considering the fact that fG,n0f_{G,n}\geq 0, we can prove the desired inequality. ∎

We will treat the two cases k𝒦k\in{\mathcal{K}} and k𝒦k\notin{\mathcal{K}} differently. We first consider the case where k𝒦k\in{\mathcal{K}}. In Algorithm 2, after updating 𝑼G,k,t+1{\bm{U}}_{G,k,t+1}, 𝑽n,k,t{\bm{V}}_{n,k,t} is generally not orthogonal to 𝑼G,k,t+1{\bm{U}}_{G,k,t+1}. However, we are able to show that for any source nn, there exists an orthonormal matrix 𝑽~n,k,t{\tilde{\bm{V}}}_{n,k,t} such that 𝑽~n,k,t{\tilde{\bm{V}}}_{n,k,t} is close to 𝑽n,k,t{\bm{V}}_{n,k,t} and 𝑽~n,k,t𝑼G,k,t+1=0{\tilde{\bm{V}}}_{n,k,t}^{\top}{\bm{U}}_{G,k,t+1}=0.

Lemma G.3.

Under the same conditions as lemma G.2, if additionally the regularization parameter ρ\rho in Algorithm 2 is not too small ρ8B2N\rho\geq 8B^{2}N, then there exists an orthonormal matrix 𝐕~n,k,t{\tilde{\bm{V}}}_{n,k,t}, such that (1) 𝐕~n,k,t𝐕~n,k,t𝐕n,k,t𝐕n,k,tF11𝐔G,k,t+1𝐔G,k,t+1𝐔G,k,t𝐔G,k,tF\left\lVert{\tilde{\bm{V}}}_{n,k,t}{\tilde{\bm{V}}}_{n,k,t}^{\top}-{\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}\right\rVert_{F}\leq 11\left\lVert{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}-{\bm{U}}_{G,k,t}{\bm{U}}_{G,k,t}^{\top}\right\rVert_{F} and (2) 𝐕~n,k,t𝐔G,k,t+1=0{\tilde{\bm{V}}}_{n,k,t}^{\top}{\bm{U}}_{G,k,t+1}=0

Proof.

We define 𝑽~n,k,t{\tilde{\bm{V}}}_{n,k,t} as,

𝑽~n,k,t=(𝑽n,k,t𝑼G,k,t+1𝑼G,k,t+1𝑽n,k,t)(𝑰𝑽n,k,t𝑼G,k,t+1𝑼G,k,t+1𝑽n,k,t)12{\tilde{\bm{V}}}_{n,k,t}=\left({\bm{V}}_{n,k,t}-{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}{\bm{V}}_{n,k,t}\right)\left({\bm{I}}-{\bm{V}}_{n,k,t}^{\top}{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}{\bm{V}}_{n,k,t}\right)^{-\frac{1}{2}}

It is easy to verify that 𝑽~n,k,t𝑼G,k,t+1=0{\tilde{\bm{V}}}_{n,k,t}^{\top}{\bm{U}}_{G,k,t+1}=0 and that 𝑽~n,k,t𝑽~n,k,t=𝑰{\tilde{\bm{V}}}_{n,k,t}^{\top}{\tilde{\bm{V}}}_{n,k,t}={\bm{I}}. Now we will prove an upper bound on the norm on the distance between the subspace spanned by 𝑽n,k,t{\bm{V}}_{n,k,t} and by 𝑽~n,k,t{\tilde{\bm{V}}}_{n,k,t}.

𝑽~n,k,t𝑽~n,k,t𝑽n,k,t𝑽n,k,t\displaystyle{\tilde{\bm{V}}}_{n,k,t}{\tilde{\bm{V}}}_{n,k,t}^{\top}-{\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}
=(𝑽n,k,t𝑼G,k,t+1𝑼G,k,t+1𝑽n,k,t)(𝑰𝑽n,k,t𝑼G,k,t+1𝑼G,k,t+1𝑽n,k,t)1(𝑽n,k,t𝑽n,k,t𝑼G,k,t+1𝑼G,k,t+1)\displaystyle=\left({\bm{V}}_{n,k,t}-{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}{\bm{V}}_{n,k,t}\right)\left({\bm{I}}-{\bm{V}}_{n,k,t}^{\top}{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}{\bm{V}}_{n,k,t}\right)^{-1}\left({\bm{V}}_{n,k,t}^{\top}-{\bm{V}}_{n,k,t}^{\top}{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}\right)
(𝑽n,k,t𝑼G,k,t+1𝑼G,k,t+1𝑽n,k,t)(𝑽n,k,t𝑽n,k,t𝑼G,k,t+1𝑼G,k,t+1)\displaystyle-\left({\bm{V}}_{n,k,t}-{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}{\bm{V}}_{n,k,t}\right)\left({\bm{V}}_{n,k,t}^{\top}-{\bm{V}}_{n,k,t}^{\top}{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}\right)
+(𝑽n,k,t𝑼G,k,t+1𝑼G,k,t+1𝑽n,k,t)(𝑽n,k,t𝑽n,k,t𝑼G,k,t+1𝑼G,k,t+1)𝑽n,k,t𝑽n,k,t\displaystyle+\left({\bm{V}}_{n,k,t}-{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}{\bm{V}}_{n,k,t}\right)\left({\bm{V}}_{n,k,t}^{\top}-{\bm{V}}_{n,k,t}^{\top}{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}\right)-{\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}

Therefore, by triangular inequality,

𝑽~n,k,t𝑽~n,k,t𝑽n,k,t𝑽n,k,tF\displaystyle\left\lVert{\tilde{\bm{V}}}_{n,k,t}{\tilde{\bm{V}}}_{n,k,t}^{\top}-{\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}\right\rVert_{F}
𝑽n,k,t𝑼G,k,t+1𝑼G,k,t+1𝑽n,k,t22(𝑰𝑽n,k,t𝑼G,k,t+1𝑼G,k,t+1𝑽n,k,t)1F𝑽n,k,t𝑼G,k,t+1𝑼G,k,t+1𝑽n,k,tF\displaystyle\leq\left\lVert{\bm{V}}_{n,k,t}-{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}{\bm{V}}_{n,k,t}\right\rVert_{2}^{2}\left\lVert\left({\bm{I}}-{\bm{V}}_{n,k,t}^{\top}{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}{\bm{V}}_{n,k,t}\right)^{-1}\right\rVert_{F}\left\lVert{\bm{V}}_{n,k,t}^{\top}{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}{\bm{V}}_{n,k,t}\right\rVert_{F}
+𝑼G,k,t+1𝑼G,k,t+1𝑽n,k,t𝑽n,k,tF+𝑽n,k,t𝑽n,k,t𝑼G,k,t+1𝑼G,k,t+1F\displaystyle+\left\lVert{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}{\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}\right\rVert_{F}+\left\lVert{\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}\right\rVert_{F}
+𝑼G,k,t+1𝑼G,k,t+1𝑽n,k,t𝑽n,k,t𝑼G,k,t+1𝑼G,k,t+1F\displaystyle+\left\lVert{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}{\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}\right\rVert_{F}
4(𝑰𝑽n,k,t𝑼G,k,t+1𝑼G,k,t+1𝑽n,k,t)1F𝑽n,k,t𝑼G,k,t+1F2+2𝑽n,k,t𝑼G,k,t+1F+𝑽n,k,t𝑼G,k,t+1F2\displaystyle\leq 4\left\lVert\left({\bm{I}}-{\bm{V}}_{n,k,t}^{\top}{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}{\bm{V}}_{n,k,t}\right)^{-1}\right\rVert_{F}\left\lVert{\bm{V}}_{n,k,t}^{\top}{\bm{U}}_{G,k,t+1}\right\rVert_{F}^{2}+2\left\lVert{\bm{V}}_{n,k,t}^{\top}{\bm{U}}_{G,k,t+1}\right\rVert_{F}+\left\lVert{\bm{V}}_{n,k,t}^{\top}{\bm{U}}_{G,k,t+1}\right\rVert_{F}^{2}

where the last inequality comes from the fact that

𝑽n,k,t𝑼G,k,t+1𝑼G,k,t+1𝑽n,k,t2\displaystyle\left\lVert{\bm{V}}_{n,k,t}-{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}{\bm{V}}_{n,k,t}\right\rVert_{2}
𝑽n,k,t+𝑼G,k,t+1𝑼G,k,t+1𝑽n,k,t2\displaystyle\leq\left\lVert{\bm{V}}_{n,k,t}\right\rVert+\left\lVert{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}{\bm{V}}_{n,k,t}\right\rVert_{2}
1+(𝑼G,k,t+1𝑼G,k,t+1𝑼G,k,t𝑼G,k,t)𝑽n,k,t2\displaystyle\leq 1+\left\lVert\left({\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}-{\bm{U}}_{G,k,t}{\bm{U}}_{G,k,t}^{\top}\right){\bm{V}}_{n,k,t}\right\rVert_{2}
1+2BNρ2\displaystyle\leq 1+\frac{2B\sqrt{N}}{\sqrt{\rho}}\leq 2

Since 𝑼G,k,t{\bm{U}}_{G,k,t} and 𝑽n,k,t{\bm{V}}_{n,k,t} are orthogonal 𝑼G,k,t𝑽n,k,t=0{\bm{U}}_{G,k,t}^{\top}{\bm{V}}_{n,k,t}=0, we have,

𝑼G,k,t+1𝑽n,k,tF=𝑼G,k,t+1𝑼G,k,t+1𝑽n,k,tF\displaystyle\left\lVert{\bm{U}}_{G,k,t+1}^{\top}{\bm{V}}_{n,k,t}\right\rVert_{F}=\left\lVert{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}{\bm{V}}_{n,k,t}\right\rVert_{F}
=𝑼G,k,t+1𝑼G,k,t+1𝑽n,k,t𝑼G,k,t𝑼G,k,t𝑽n,k,tF\displaystyle=\left\lVert{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}{\bm{V}}_{n,k,t}-{\bm{U}}_{G,k,t}{\bm{U}}_{G,k,t}^{\top}{\bm{V}}_{n,k,t}\right\rVert_{F}
𝑼G,k,t+1𝑼G,k,t+1𝑼G,k,t𝑼G,k,tF\displaystyle\leq\left\lVert{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}-{\bm{U}}_{G,k,t}{\bm{U}}_{G,k,t}^{\top}\right\rVert_{F}

Thus, when 𝑼G,k,t+1𝑼G,k,t+1𝑼G,k,t𝑼G,k,tF22\scriptstyle\left\lVert{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}-{\bm{U}}_{G,k,t}{\bm{U}}_{G,k,t}^{\top}\right\rVert_{F}\leq\frac{\sqrt{2}}{2}, we know 𝑽n,k,t𝑼G,k,t+1𝑼G,k,t+1𝑽n,k,tF12\scriptstyle\small\left\lVert{\bm{V}}_{n,k,t}^{\top}{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}{\bm{V}}_{n,k,t}\right\rVert_{F}\leq\frac{1}{2}. As a result, (𝑰𝑽n,k,t𝑼G,k,t+1𝑼G,k,t+1𝑽n,k,t)1F2\scriptstyle\left\lVert\left({\bm{I}}-{\bm{V}}_{n,k,t}^{\top}{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}{\bm{V}}_{n,k,t}\right)^{-1}\right\rVert_{F}\leq 2.

Combining this with the fact that 𝑼G,k,t+1𝑽n,k,tF1\left\lVert{\bm{U}}_{G,k,t+1}^{\top}{\bm{V}}_{n,k,t}\right\rVert_{F}\leq 1, we have,

𝑽~n,k,t𝑽~n,k,t𝑽n,k,t𝑽n,k,tF11𝑼G,k,t+1𝑼G,k,t+1𝑼G,k,t𝑼G,k,tF\left\lVert{\tilde{\bm{V}}}_{n,k,t}{\tilde{\bm{V}}}_{n,k,t}^{\top}-{\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}\right\rVert_{F}\leq 11\left\lVert{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}-{\bm{U}}_{G,k,t}{\bm{U}}_{G,k,t}^{\top}\right\rVert_{F}

This completes the proof. ∎

Finally, we will prove the second equation in Theorem 5.

Proof.

We start by analyzing the update rule of 𝑽n,k{\bm{V}}_{n,k}. In Proposition 4, we consider the update rule for two cases depending on whether k𝒦k\in{\mathcal{K}}. We analyze them separately.

Case I, k𝒦k\notin{\mathcal{K}} In this case, 𝑽n,k,t+1{\bm{V}}_{n,k,t+1} should be the optimal solution to,

𝑽n,k,t+1=argmin𝑽n,k𝓡L,n,t𝓒L,n×1𝑽n,1,t+1×K𝑽n,K,tF2+ρ𝑽n,k𝑽n,k𝑽n,k,t𝑽n,k,tF2\bm{V}_{n,k,t+1}=\arg\min_{\bm{V}_{n,k}}\|{\bm{\mathcal{R}}}_{L,n,t}-{\bm{\mathcal{C}}}^{\star}_{L,n}\times_{1}\bm{V}_{n,1,t+1}\ldots\times_{K}\bm{V}_{n,K,t}\|_{F}^{2}+\rho\|\bm{V}_{n,k}\bm{V}_{n,k}-\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}\|_{F}^{2}

where 𝓒L,n{\bm{\mathcal{C}}}^{\star}_{L,n} is optimized according to the factors 𝑽n,1,t+1,,𝑽n,K,t{\bm{V}}_{n,1,t+1},\cdots,{\bm{V}}_{n,K,t}. Thus we have,

argmin𝑽n,k𝓡L,n,t𝓒L,n×1𝑽n,1,t+1×K𝑽n,K,tF2+ρ𝑽n,k𝑽n,k𝑽n,k,t𝑽n,k,tF2\displaystyle\arg\min_{\bm{V}_{n,k}}\|{\bm{\mathcal{R}}}_{L,n,t}-{\bm{\mathcal{C}}}^{\star}_{L,n}\times_{1}\bm{V}_{n,1,t+1}\ldots\times_{K}\bm{V}_{n,K,t}\|_{F}^{2}+\rho\|\bm{V}_{n,k}\bm{V}_{n,k}^{\top}-\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}\|_{F}^{2}
=argmin𝑽n,k𝓡L,n,t×1𝑽n,1,t+1×K𝑽n,K,tF2+ρ𝑽n,k𝑽n,k𝑽n,k,t𝑽n,k,tF2\displaystyle=\arg\min_{\bm{V}_{n,k}}-\|{\bm{\mathcal{R}}}_{L,n,t}\times_{1}\bm{V}_{n,1,t+1}^{\top}\ldots\times_{K}\bm{V}_{n,K,t}^{\top}\|_{F}^{2}+\rho\|\bm{V}_{n,k}\bm{V}_{n,k}^{\top}-\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}\|_{F}^{2}

Since |𝒦|2\left\lvert{\mathcal{K}}\right\rvert\geq 2, there exists at least one m𝒦m\in{\mathcal{K}} and mkm\neq k such that 𝑼G,m,t𝑽n,m,t=0{\bm{U}}_{G,m,t}^{\top}{\bm{V}}_{n,m,t}=0 and 𝑼G,m,t+1𝑽n,m,t+1=0{\bm{U}}_{G,m,t+1}^{\top}{\bm{V}}_{n,m,t+1}=0. We thus have

𝓡L,n,t×1𝑽n,1,t+1×K𝑽n,K,tF2=𝓨n×1𝑽n,1,t+1×K𝑽n,K,tF2\|{\bm{\mathcal{R}}}_{L,n,t}\times_{1}\bm{V}_{n,1,t+1}^{\top}\ldots\times_{K}\bm{V}_{n,K,t}^{\top}\|_{F}^{2}=\|\bm{\mathcal{Y}}_{n}\times_{1}\bm{V}_{n,1,t+1}^{\top}\ldots\times_{K}\bm{V}_{n,K,t}^{\top}\|_{F}^{2}

Therefore, we can use Lemma 3.2 again to derive

𝑽n,k,t+1\displaystyle{\bm{V}}_{n,k,t+1} =argmin𝑽n,k𝓨n𝓒L,n,t×1𝑽n,1,t+1×K𝑽n,K,tF2+ρ𝑽n,k𝑽n,k𝑽n,k,t𝑽n,k,tF2\displaystyle=\arg\min_{{\bm{V}}_{n,k}}\left\lVert{\bm{\mathcal{Y}}}_{n}-{\bm{\mathcal{C}}}_{L,n,t}\times_{1}{\bm{V}}_{n,1,t+1}\cdots\times_{K}{\bm{V}}_{n,K,t}\right\rVert_{F}^{2}+\rho\left\lVert{\bm{V}}_{n,k}{\bm{V}}_{n,k}^{\top}-{\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}\right\rVert_{F}^{2}
=argmin𝑽n,kfL,n(𝑽n,1,t+1,,𝑽n,k1,t+1,𝑽n,k,𝑽n,k+1,t,,𝑽n,K,t)+ρ𝑽n,k𝑽n,k𝑽n,k,t𝑽n,k,tF2\displaystyle=\arg\min_{{\bm{V}}_{n,k}}f_{L,n}({\bm{V}}_{n,1,t+1},\cdots,{\bm{V}}_{n,k-1,t+1},{\bm{V}}_{n,k},{\bm{V}}_{n,k+1,t},\cdots,{\bm{V}}_{n,K,t})+\rho\left\lVert{\bm{V}}_{n,k}{\bm{V}}_{n,k}^{\top}-{\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}\right\rVert_{F}^{2}

Since 𝑽n,k,t{\bm{V}}_{n,k,t} is a feasible solution, we know

fL,n(𝑽n,1,t+1,,𝑽n,k,t+1,𝑽n,k+1,t,,𝑽n,K,t)+ρ𝑽n,k,t+1𝑽n,k,t+1𝑽n,k,t𝑽n,k,tF2\displaystyle f_{L,n}({\bm{V}}_{n,1,t+1},\cdots,{\bm{V}}_{n,k,t+1},{\bm{V}}_{n,k+1,t},\cdots,{\bm{V}}_{n,K,t})+\rho\|\bm{V}_{n,k,t+1}\bm{V}_{n,k,t+1}^{\top}-\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}\|_{F}^{2} (22)
fL,n(𝑽n,1,t+1,,𝑽n,k1,t+1,𝑽n,k,t,,𝑽n,K,t)\displaystyle\leq f_{L,n}({\bm{V}}_{n,1,t+1},\cdots,{\bm{V}}_{n,k-1,t+1},{\bm{V}}_{n,k,t},\cdots,{\bm{V}}_{n,K,t})

Case II, k𝒦k\in{\mathcal{K}} By Proposition 4, the update rule of 𝑽n,k{\bm{V}}_{n,k} is this case is given by,

𝑽n,k,t+1=argmin𝑽n,k𝑼G,k,t+1𝓡L,n,t𝓒L,n×1𝑽n,1,t+1×K𝑽n,K,tF2+ρ𝑽n,k𝑽n,k𝑽n,k,t𝑽n,k,tF2\bm{V}_{n,k,t+1}=\arg\min_{\bm{V}_{n,k}\perp{\bm{U}}_{G,k,t+1}}\|{\bm{\mathcal{R}}}_{L,n,t}-{\bm{\mathcal{C}}}^{\star}_{L,n}\times_{1}\bm{V}_{n,1,t+1}\ldots\times_{K}\bm{V}_{n,K,t}\|_{F}^{2}+\rho\|\bm{V}_{n,k}\bm{V}_{n,k}^{\top}-\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}\|_{F}^{2}

where 𝓒L,n{\bm{\mathcal{C}}}^{\star}_{L,n} is optimized according to the factors 𝑽n,1,t+1,,𝑽n,K,t{\bm{V}}_{n,1,t+1},\cdots,{\bm{V}}_{n,K,t}. Thus we have,

𝓡L,n,t𝓒L,n×1𝑽n,1,t+1×K𝑽n,K,tF2=𝓡L,n,tF2𝓡L,n,t×1𝑽n,1,t+1×K𝑽n,K,tF2\displaystyle\small\|{\bm{\mathcal{R}}}_{L,n,t}-{\bm{\mathcal{C}}}^{\star}_{L,n}\times_{1}\bm{V}_{n,1,t+1}\ldots\times_{K}\bm{V}_{n,K,t}\|_{F}^{2}=\|{\bm{\mathcal{R}}}_{L,n,t}\|_{F}^{2}-\|{\bm{\mathcal{R}}}_{L,n,t}\times_{1}\bm{V}_{n,1,t+1}^{\top}\ldots\times_{K}\bm{V}_{n,K,t}^{\top}\|_{F}^{2}

Therefore, we can rewrite the equivalent formulation to 𝑽n,k{\bm{V}}_{n,k} as,

𝑽n,k,t+1=argmin𝑽n,k𝑼G,k,t+1fL,n(𝑽n,1,t+1,,𝑽n,k,𝑽n,k+1,t,,𝑽n,K,t)+ρ𝑽n,k𝑽n,k𝑽n,k,t𝑽n,k,tF2\bm{V}_{n,k,t+1}=\underset{\bm{V}_{n,k}\perp{\bm{U}}_{G,k,t+1}}{\mathrm{argmin}}f_{L,n}({\bm{V}}_{n,1,t+1},\cdots,{\bm{V}}_{n,k},{\bm{V}}_{n,k+1,t},\cdots,{\bm{V}}_{n,K,t})+\rho\|\bm{V}_{n,k}\bm{V}_{n,k}^{\top}-\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}\|_{F}^{2}

The 𝑽~n,k,t{\tilde{\bm{V}}}_{n,k,t} defined in lemma G.3 is an feasible solution, therefore we have,

fL,n(𝑽n,1,t+1,,𝑽n,k1,t+1,𝑽n,k,t+1,𝑽n,k+1,t,,𝑽n,K,t)+ρ𝑽n,k,t+1𝑽n,k,t+1𝑽n,k,t𝑽n,k,tF2\displaystyle f_{L,n}({\bm{V}}_{n,1,t+1},\cdots,{\bm{V}}_{n,k-1,t+1},{\bm{V}}_{n,k,t+1},{\bm{V}}_{n,k+1,t},\cdots,{\bm{V}}_{n,K,t})+\rho\|\bm{V}_{n,k,t+1}\bm{V}_{n,k,t+1}^{\top}-\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}\|_{F}^{2}
fL,n(𝑽n,1,t+1,,𝑽n,k1,t+1,𝑽~n,k,t,𝑽n,k+1,t,,𝑽n,K,t)+ρ𝑽~n,k,t𝑽~n,k,t𝑽n,k,t𝑽n,k,tF2\displaystyle\leq f_{L,n}({\bm{V}}_{n,1,t+1},\cdots,{\bm{V}}_{n,k-1,t+1},{\tilde{\bm{V}}}_{n,k,t},{\bm{V}}_{n,k+1,t},\cdots,{\bm{V}}_{n,K,t})+\rho\|{\tilde{\bm{V}}}_{n,k,t}{\tilde{\bm{V}}}_{n,k,t}^{\top}-\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}\|_{F}^{2}

We can bound the difference between fL,n(𝑽n,1,t+1,,𝑽n,k1,t+1,𝑽~n,k,t,𝑽n,k+1,t,,𝑽n,K,t)f_{L,n}({\bm{V}}_{n,1,t+1},\cdots,{\bm{V}}_{n,k-1,t+1},{\tilde{\bm{V}}}_{n,k,t},{\bm{V}}_{n,k+1,t},\cdots,{\bm{V}}_{n,K,t}) and fL,n(𝑽n,1,t+1,,𝑽n,k1,t+1,𝑽n,k,t,𝑽n,k+1,t,,𝑽n,K,t)f_{L,n}({\bm{V}}_{n,1,t+1},\cdots,{\bm{V}}_{n,k-1,t+1},{\bm{V}}_{n,k,t},{\bm{V}}_{n,k+1,t},\cdots,{\bm{V}}_{n,K,t}) as,

fL,n(𝑽n,1,t+1,,𝑽n,k1,t+1,𝑽~n,k,t,𝑽n,k+1,t,,𝑽n,K,t)\displaystyle f_{L,n}({\bm{V}}_{n,1,t+1},\cdots,{\bm{V}}_{n,k-1,t+1},{\tilde{\bm{V}}}_{n,k,t},{\bm{V}}_{n,k+1,t},\cdots,{\bm{V}}_{n,K,t})
(𝓨n𝓨n×1×k𝑽n,k,t𝑽n,k,t×K𝑽n,K,tF+𝓨n×1×k(𝑽n,k,t𝑽n,k,t𝑽~n,k,t𝑽~n,k,t)×K𝑽n,K,tF)2\displaystyle\leq\Big{(}\left\lVert{\bm{\mathcal{Y}}}_{n}-{\bm{\mathcal{Y}}}_{n}\times_{1}\cdots\times_{k}{\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}\times_{K}{\bm{V}}_{n,K,t}\right\rVert_{F}+\left\lVert{\bm{\mathcal{Y}}}_{n}\times_{1}\cdots\times_{k}\left({\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}-{\tilde{\bm{V}}}_{n,k,t}{\tilde{\bm{V}}}_{n,k,t}^{\top}\right)\times_{K}{\bm{V}}_{n,K,t}\right\rVert_{F}\Big{)}^{2}
𝓨n𝓨n×1×k𝑽n,k,t𝑽n,k,t×K𝑽n,K,tF2\displaystyle\leq\left\lVert{\bm{\mathcal{Y}}}_{n}-{\bm{\mathcal{Y}}}_{n}\times_{1}\cdots\times_{k}{\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}\times_{K}{\bm{V}}_{n,K,t}\right\rVert_{F}^{2}
+2𝓨n𝓨n×1×k𝑽n,k,t𝑽n,k,t×K𝑽n,K,tF𝓨n×1×k(𝑽n,k,t𝑽n,k,t𝑽~n,k,t𝑽~n,k,t)×K𝑽n,K,tF\displaystyle+2\left\lVert{\bm{\mathcal{Y}}}_{n}-{\bm{\mathcal{Y}}}_{n}\times_{1}\cdots\times_{k}{\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}\times_{K}{\bm{V}}_{n,K,t}\right\rVert_{F}\left\lVert{\bm{\mathcal{Y}}}_{n}\times_{1}\cdots\times_{k}\left({\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}-{\tilde{\bm{V}}}_{n,k,t}{\tilde{\bm{V}}}_{n,k,t}^{\top}\right)\times_{K}{\bm{V}}_{n,K,t}\right\rVert_{F}
+𝓨n×1×k(𝑽n,k,t𝑽n,k,t𝑽~n,k,t𝑽~n,k,t)×K𝑽n,K,tF2\displaystyle+\left\lVert{\bm{\mathcal{Y}}}_{n}\times_{1}\cdots\times_{k}\left({\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}-{\tilde{\bm{V}}}_{n,k,t}{\tilde{\bm{V}}}_{n,k,t}^{\top}\right)\times_{K}{\bm{V}}_{n,K,t}\right\rVert_{F}^{2}

The first term above is just fL,n(𝑽n,1,t+1,,𝑽n,k1,t+1,𝑽n,k,t,𝑽n,k+1,t,,𝑽n,K,t)f_{L,n}({\bm{V}}_{n,1,t+1},\cdots,{\bm{V}}_{n,k-1,t+1},{\bm{V}}_{n,k,t},{\bm{V}}_{n,k+1,t},\cdots,{\bm{V}}_{n,K,t}). By Lemma J.2, the second term is upper bounded by 2𝓨nF𝓨nF𝑽n,k,t𝑽n,k,t𝑽~n,k,t𝑽~n,k,t2\scriptstyle 2\left\lVert{\bm{\mathcal{Y}}}_{n}\right\rVert_{F}\left\lVert{\bm{\mathcal{Y}}}_{n}\right\rVert_{F}\left\lVert{\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}-{\tilde{\bm{V}}}_{n,k,t}{\tilde{\bm{V}}}_{n,k,t}^{\top}\right\rVert_{2}, and the third term is bounded by 𝑽n,k,t𝑽n,k,t𝑽~n,k,t𝑽~n,k,t22𝓨nF2\scriptstyle\left\lVert{\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}-{\tilde{\bm{V}}}_{n,k,t}{\tilde{\bm{V}}}_{n,k,t}^{\top}\right\rVert_{2}^{2}\left\lVert{\bm{\mathcal{Y}}}_{n}\right\rVert_{F}^{2}.

Therefore, we have

fL,n(𝑽n,1,t+1,,𝑽n,k1,t+1,𝑽~n,k,t,𝑽n,k+1,t,,𝑽n,K,t)\displaystyle f_{L,n}({\bm{V}}_{n,1,t+1},\cdots,{\bm{V}}_{n,k-1,t+1},{\tilde{\bm{V}}}_{n,k,t},{\bm{V}}_{n,k+1,t},\cdots,{\bm{V}}_{n,K,t})
fL,n(𝑽n,1,t+1,,𝑽n,k1,t+1,𝑽n,k,t,,𝑽n,K,t)+𝑽~n,k,t𝑽~n,k,t𝑽n,k,t𝑽n,k,t24B2\displaystyle\leq f_{L,n}({\bm{V}}_{n,1,t+1},\cdots,{\bm{V}}_{n,k-1,t+1},{\bm{V}}_{n,k,t},\cdots,{\bm{V}}_{n,K,t})+\left\lVert{\tilde{\bm{V}}}_{n,k,t}{\tilde{\bm{V}}}_{n,k,t}^{\top}-{\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}\right\rVert_{2}4B^{2}

Therefore, we have

fL,n(𝑽n,1,t+1,,𝑽n,k,t+1,𝑽n,k+1,t,,𝑽n,K,t)+ρ𝑽n,k,t+1𝑽n,k,t+1𝑽n,k,t𝑽n,k,tF2\displaystyle f_{L,n}({\bm{V}}_{n,1,t+1},\cdots,{\bm{V}}_{n,k,t+1},{\bm{V}}_{n,k+1,t},\cdots,{\bm{V}}_{n,K,t})+\rho\|\bm{V}_{n,k,t+1}\bm{V}_{n,k,t+1}^{\top}-\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}\|_{F}^{2} (23)
fL,n(𝑽n,1,t+1,,𝑽n,k1,t+1,𝑽n,k,t,𝑽n,k+1,t,,𝑽n,K,t)\displaystyle-f_{L,n}({\bm{V}}_{n,1,t+1},\cdots,{\bm{V}}_{n,k-1,t+1},{\bm{V}}_{n,k,t},{\bm{V}}_{n,k+1,t},\cdots,{\bm{V}}_{n,K,t})
4B2𝑽~n,k,t𝑽~n,k,t𝑽n,k,t𝑽n,k,t2+ρ𝑽~n,k,t𝑽~n,k,t𝑽n,k,t𝑽n,k,tF2\displaystyle\leq 4B^{2}\left\lVert{\tilde{\bm{V}}}_{n,k,t}{\tilde{\bm{V}}}_{n,k,t}^{\top}-{\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}\right\rVert_{2}+\rho\left\lVert{\tilde{\bm{V}}}_{n,k,t}{\tilde{\bm{V}}}_{n,k,t}^{\top}-{\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}\right\rVert_{F}^{2}
C3𝑼G,k,t+1𝑼G,k,t+1𝑼G,k,t𝑼G,k,tF\displaystyle\leq C_{3}\left\lVert{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}-{\bm{U}}_{G,k,t}{\bm{U}}_{G,k,t}^{\top}\right\rVert_{F}

where C3C_{3} is a constant defined as,

C3=11B2+242BNρC_{3}=11B^{2}+242B\sqrt{N\rho}

In the last inequality of (23), we applied Lemma G.2 and Lemma G.3.

Combining equation (22) in Case I and equation (23) in Case II, we know that,

fL,n(𝑽n,1,t+1,,𝑽n,K,t+1)fL,n(𝑽n,1,t,,𝑽n,K,t)\displaystyle f_{L,n}({\bm{V}}_{n,1,t+1},\cdots,{\bm{V}}_{n,K,t+1})-f_{L,n}({\bm{V}}_{n,1,t},\cdots,{\bm{V}}_{n,K,t})
k=1Kρ𝑽n,k,t+1𝑽n,k,t+1𝑽n,k,t𝑽n,k,tF2+k𝒦C3𝑼G,k,t+1𝑼G,k,t+1𝑼n,k,t𝑼n,k,tF\displaystyle\leq\sum_{k=1}^{K}\rho\|\bm{V}_{n,k,t+1}\bm{V}_{n,k,t+1}^{\top}-\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}\|_{F}^{2}+\sum_{k\in{\mathcal{K}}}C_{3}\|{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}-\bm{U}_{n,k,t}{\bm{U}_{n,k,t}^{\top}}\|_{F}
k=1Kρ𝑽n,k,t+1𝑽n,k,t+1𝑽n,k,t𝑽n,k,tF2+k=1KC3𝑼G,k,t+1𝑼G,k,t+1𝑼n,k,t𝑼n,k,tF\displaystyle\leq\sum_{k=1}^{K}\rho\|\bm{V}_{n,k,t+1}\bm{V}_{n,k,t+1}^{\top}-\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}\|_{F}^{2}+\sum_{k=1}^{K}C_{3}\|{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}-\bm{U}_{n,k,t}{\bm{U}_{n,k,t}^{\top}}\|_{F}

Summing both sides for tt from 11 to TT and for kk from 11 to KK, we have

k=1Kt=1Tρ𝑽n,k,t+1𝑽n,k,t+1𝑽n,k,t𝑽n,k,tF2C3k=1Kt=1T𝑼G,k,t+1𝑼G,k,t+1𝑼n,k,t𝑼n,k,tF+fL,n,0\displaystyle\sum_{k=1}^{K}\sum_{t=1}^{T}\rho\|\bm{V}_{n,k,t+1}\bm{V}_{n,k,t+1}^{\top}-\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}\|_{F}^{2}\leq C_{3}\sum_{k=1}^{K}\sum_{t=1}^{T}\|{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}-\bm{U}_{n,k,t}{\bm{U}_{n,k,t}^{\top}}\|_{F}+f_{L,n,0}

Considering lemma G.1 and the fact that

k=1Kt=1T𝑼G,k,t+1𝑼G,k,t+1𝑽n,k,t𝑽n,k,tFTKk=1Kt=1T𝑼G,k,t+1𝑼G,k,t+1𝑽n,k,t𝑽n,k,tF2\sum_{k=1}^{K}\sum_{t=1}^{T}\|{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}-\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}\|_{F}\leq\sqrt{TK}\sqrt{\sum_{k=1}^{K}\sum_{t=1}^{T}\|{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}-\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}\|_{F}^{2}}

we can devide both sides by TT and conclude that,

mint{1,,T}k=1K𝑽n,k,t+1𝑽n,k,t+1𝑽n,k,t𝑽n,k,tF2=O(1T)\displaystyle\min_{t\in\{1,\cdots,T\}}\sum_{k=1}^{K}\|\bm{V}_{n,k,t+1}\bm{V}_{n,k,t+1}^{\top}-\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}\|_{F}^{2}=O\left(\frac{1}{\sqrt{T}}\right)

This completes our proof. ∎

Appendix H Simulation Study about the Convergence Rate

In this simulation, we generate the data according to (1) but without noise for better visualization. We assume that there are 5 sources of data, each has 10 samples, and each sample has dimension 50×50×5050\times 50\times 50. Therefore, the data is 𝓨1,,𝓨5{\bm{\mathcal{Y}}}_{1},\ldots,{\bm{\mathcal{Y}}}_{5} each with dimension 50×50×50×1050\times 50\times 50\times 10.

For data generation, we use the standard normal distribution to generate 𝑼G,1,𝑼G,2,𝑼G,3{\bm{U}}_{G,1},{\bm{U}}_{G,2},{\bm{U}}_{G,3} with dimension 50×550\times 5 and use SVD to construct an orthonormal basis to make the matrices orthonormal. Then we use the standard normal distribution to generate 𝓒G,1,,𝓒G,5{\bm{\mathcal{C}}}_{G,1},\ldots,{\bm{\mathcal{C}}}_{G,5} with dimension 5×5×5×105\times 5\times 5\times 10. For n=1,,5n=1,\ldots,5, the global component is constructed by 𝓨G,n=𝓒G,n×1𝑼G,1×2𝑼G,2×3𝑼G,3{\bm{\mathcal{Y}}}_{G,n}={\bm{\mathcal{C}}}_{G,n}\times_{1}{\bm{U}}_{G,1}\times_{2}{\bm{U}}_{G,2}\times_{3}{\bm{U}}_{G,3}. When generating the local factor matrices, we assume 𝒦={1,2}\mathcal{K}=\{1,2\}. Then for each n{1,,5}n\in\{1,\ldots,5\}, if k𝒦k\in\mathcal{K}, 𝑽n,k{\bm{V}}_{n,k} are first generated by a standard normal distribution with dimension 50×550\times 5. Afterwards, we use (𝑰𝑼G,k(𝑼G,k𝑼G,k)1𝑼G,k)𝑽n,k(\bm{I}-{\bm{U}}_{G,k}({\bm{U}}_{G,k}^{\top}{\bm{U}}_{G,k})^{-1}{\bm{U}}_{G,k}){\bm{V}}_{n,k} to project the local factor matrix onto the orthogonal space of the corresponding global factor matrix. Finally, we use SVD to construct the orthonormal basis. If k𝒦k\not\in\mathcal{K}, we use the standard normal distribution to generate 𝑽n,k{\bm{V}}_{n,k} and use SVD to construct an orthonormal basis. Similarly, we generate the local core tensors 𝓒L,1,,𝓒L,5{\bm{\mathcal{C}}}_{L,1},\ldots,{\bm{\mathcal{C}}}_{L,5} from a standard normal distribution with dimension 5×5×5×105\times 5\times 5\times 10. The local components for n=1,,5n=1,\ldots,5 are constructed by 𝓨L,n=𝓒L,n×1𝑽n,1×2𝑽n,2×3𝑽n,3{\bm{\mathcal{Y}}}_{L,n}={\bm{\mathcal{C}}}_{L,n}\times_{1}{\bm{V}}_{n,1}\times_{2}{\bm{V}}_{n,2}\times_{3}{\bm{V}}_{n,3}.

When performing perTucker, we use the dimension of true global and local factor matrices. Monitoring statistics along the iteration axis are mean subspace error for global factor matrices and local factor matrices for each client, that is, 13k=13𝑼G,k,t𝑼G,k,t𝑼G,k𝑼G,kF2\frac{1}{3}\sum_{k=1}^{3}\|{\bm{U}}_{G,k,t}{\bm{U}}_{G,k,t}^{\top}-{\bm{U}}_{G,k}{\bm{U}}_{G,k}^{\top}\|_{F}^{2} for global monitoring and 13k=13𝑽n,k,t𝑽n,k,t𝑽n,k𝑽n,kF2\frac{1}{3}\sum_{k=1}^{3}\|{\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}-{\bm{V}}_{n,k}{\bm{V}}_{n,k}^{\top}\|_{F}^{2} for source nn local monitoring.

Appendix I Discussion of Classification Rule in Sec. 3.6.1

From Proposition 2, when a new piece of data arrives, we can first derive the optimal global core tensor 𝓒G{\bm{\mathcal{C}}}^{\star}_{G}, and for each class nn, we can derive the optimal local tensor 𝓒L,n{\bm{\mathcal{C}}}^{\star}_{L,n}. Then the decision rule by minimizing the reconstruction error from all the classes is given by

n^=argminn𝓨new𝓒G×1𝑼^G,1×K𝑼^G,K𝓒L,n×1𝑽^n,1×K𝑽^n,KF2.\hat{n}=\arg\min_{n}\left\lVert\bm{\mathcal{Y}}^{\text{new}}-\bm{\mathcal{C}}^{\star}_{G}\times_{1}\hat{{\bm{U}}}_{G,1}\ldots\times_{K}\hat{{\bm{U}}}_{G,K}-\bm{\mathcal{C}}^{\star}_{L,n}\times_{1}\hat{\bm{V}}_{n,1}\ldots\times_{K}\hat{\bm{V}}_{n,K}\right\rVert_{F}^{2}. (24)

We then show that the decision rule (24) is equivalent to the decision rule (12). By reformulating Equation (24), we have

𝓨new𝓒G×1𝑼^G,1×K𝑼^G,K𝓒L,n×1𝑽^n,1×K𝑽^n,KF2.\displaystyle\|\bm{\mathcal{Y}}^{\text{new}}-{\bm{\mathcal{C}}}^{\star}_{G}\times_{1}\hat{{\bm{U}}}_{G,1}\ldots\times_{K}\hat{{\bm{U}}}_{G,K}-{\bm{\mathcal{C}}}^{\star}_{L,n}\times_{1}\hat{\bm{V}}_{n,1}\ldots\times_{K}\hat{\bm{V}}_{n,K}\|_{F}^{2}.
=\displaystyle= 𝓨newF2+𝓒G×1𝑼^G,1×K𝑼^G,KF2+𝓒L,n×1𝑽^n,1×K𝑽^n,KF2\displaystyle\|\bm{\mathcal{Y}}^{\text{new}}\|_{F}^{2}+\|{\bm{\mathcal{C}}}^{\star}_{G}\times_{1}\hat{{\bm{U}}}_{G,1}\ldots\times_{K}\hat{{\bm{U}}}_{G,K}\|_{F}^{2}+\|{\bm{\mathcal{C}}}^{\star}_{L,n}\times_{1}\hat{\bm{V}}_{n,1}\ldots\times_{K}\hat{\bm{V}}_{n,K}\|_{F}^{2}
2𝓨new,𝓒G×1𝑼^G,1×K𝑼^G,K2𝓨new,𝓒L,n×1𝑽^n,1×K𝑽^n,K\displaystyle-2\langle\bm{\mathcal{Y}}^{\text{new}},{\bm{\mathcal{C}}}^{\star}_{G}\times_{1}\hat{{\bm{U}}}_{G,1}\ldots\times_{K}\hat{{\bm{U}}}_{G,K}\rangle-2\langle\bm{\mathcal{Y}}^{\text{new}},{\bm{\mathcal{C}}}^{\star}_{L,n}\times_{1}\hat{\bm{V}}_{n,1}\ldots\times_{K}\hat{\bm{V}}_{n,K}\rangle
2𝓒G×1𝑼^G,1×K𝑼^G,K,𝓒L,n×1𝑽^n,1×K𝑽^n,K\displaystyle-2\langle{\bm{\mathcal{C}}}^{\star}_{G}\times_{1}\hat{{\bm{U}}}_{G,1}\ldots\times_{K}\hat{{\bm{U}}}_{G,K},{\bm{\mathcal{C}}}^{\star}_{L,n}\times_{1}\hat{\bm{V}}_{n,1}\ldots\times_{K}\hat{\bm{V}}_{n,K}\rangle

The first, second, and fourth term is irrelevant to nn and thus can be excluded. The last term is 0 because the global and local components are orthogonal. The third and fifth terms can be reformulated as

𝓒L,n×1𝑽^n,1×K𝑽^n,KF22𝓨,𝓒L,n×1𝑽^n,1×K𝑽^n,K\displaystyle\|{\bm{\mathcal{C}}}^{\star}_{L,n}\times_{1}\hat{\bm{V}}_{n,1}\ldots\times_{K}\hat{\bm{V}}_{n,K}\|_{F}^{2}-2\langle\bm{\mathcal{Y}},{\bm{\mathcal{C}}}^{\star}_{L,n}\times_{1}\hat{\bm{V}}_{n,1}\ldots\times_{K}\hat{\bm{V}}_{n,K}\rangle
=\displaystyle= 𝓒L,nF22𝓨new×1𝑽^n,1×K𝑽^n,K,𝓒L,n\displaystyle\|{\bm{\mathcal{C}}}^{\star}_{L,n}\|_{F}^{2}-2\langle\bm{\mathcal{Y}}^{\text{new}}\times_{1}\hat{\bm{V}}_{n,1}^{\top}\ldots\times_{K}\hat{\bm{V}}_{n,K}^{\top},{\bm{\mathcal{C}}}^{\star}_{L,n}\rangle
=\displaystyle= 𝓒L,nF2\displaystyle-\|{\bm{\mathcal{C}}}^{\star}_{L,n}\|_{F}^{2}

Therefore,

argminn𝓨new𝓒G×1𝑼^G,1×K𝑼^G,K𝓒L,n×1𝑽^n,1×K𝑽^n,KF2=argmaxn𝓒L,nF2\arg\min_{n}\|\bm{\mathcal{Y}}^{\text{new}}-{\bm{\mathcal{C}}}^{\star}_{G}\times_{1}\hat{{\bm{U}}}_{G,1}\ldots\times_{K}\hat{{\bm{U}}}_{G,K}-{\bm{\mathcal{C}}}^{\star}_{L,n}\times_{1}\hat{\bm{V}}_{n,1}\ldots\times_{K}\hat{\bm{V}}_{n,K}\|_{F}^{2}=\arg\max_{n}\|{\bm{\mathcal{C}}}^{\star}_{L,n}\|_{F}^{2}

Appendix J Auxiliary Lemma

In this section, we will present several auxiliary lemmas useful for theoretical analysis.

Lemma J.1.

For two matrices Am×nA\in\mathbb{R}^{m\times n}, and Bn×pB\in\mathbb{R}^{n\times p}, we have,

ABFAFB2\left\lVert AB\right\rVert_{F}\leq\left\lVert A\right\rVert_{F}\left\lVert B\right\rVert_{2}

This lemma is the same as Proposition B.4 in Sun and Luo, (2016). Thus we omit the proof here to avoid duplication.

Lemma J.2.

For a core tensor 𝓒{\bm{\mathcal{C}}} and KK factor matrices {𝐔k}k=1K\{{\bm{U}}_{k}\}_{k=1}^{K}, the following holds,

𝓒×1𝑼1×2×K𝑼KF𝓒Fk=1K𝑼k2\left\lVert{\bm{\mathcal{C}}}\times_{1}{\bm{U}}_{1}\times_{2}\cdots\times_{K}{\bm{U}}_{K}\right\rVert_{F}\leq\left\lVert{\bm{\mathcal{C}}}\right\rVert_{F}\prod_{k=1}^{K}\left\lVert{\bm{U}}_{k}\right\rVert_{2}
Proof.

The proof is straightforward. We use 𝓨{\bm{\mathcal{Y}}} to denote 𝓨=𝓒×1𝑼1×2×K𝑼K{\bm{\mathcal{Y}}}={\bm{\mathcal{C}}}\times_{1}{\bm{U}}_{1}\times_{2}\cdots\times_{K}{\bm{U}}_{K}. Then 𝓨(1)=𝑼1𝓒(1)(𝑼K𝑼K1𝑼2){\bm{\mathcal{Y}}}_{(1)}={\bm{U}}_{1}{\bm{\mathcal{C}}}_{(1)}\left({\bm{U}}_{K}\bigotimes{\bm{U}}_{K-1}\cdots\bigotimes{\bm{U}}_{2}\right). Since 𝓨F=𝓨(1)F\left\lVert{\bm{\mathcal{Y}}}\right\rVert_{F}=\left\lVert{\bm{\mathcal{Y}}}_{(1)}\right\rVert_{F}, we have,

𝓨F=𝓨(1)F=𝑼1𝓒(1)(𝑼K𝑼K1𝑼2)F\displaystyle\left\lVert{\bm{\mathcal{Y}}}\right\rVert_{F}=\left\lVert{\bm{\mathcal{Y}}}_{(1)}\right\rVert_{F}=\left\lVert{\bm{U}}_{1}{\bm{\mathcal{C}}}_{(1)}\left({\bm{U}}_{K}\bigotimes{\bm{U}}_{K-1}\cdots\bigotimes{\bm{U}}_{2}\right)^{\top}\right\rVert_{F}
𝑼12𝓒(1)F(𝑼K𝑼K1𝑼2)2\displaystyle\leq\left\lVert{\bm{U}}_{1}\right\rVert_{2}\left\lVert{\bm{\mathcal{C}}}_{(1)}\right\rVert_{F}\left\lVert\left({\bm{U}}_{K}\bigotimes{\bm{U}}_{K-1}\cdots\bigotimes{\bm{U}}_{2}\right)\right\rVert_{2}
𝓒(1)Fk=1K𝑼k2=𝓒Fk=1K𝑼k2\displaystyle\leq\left\lVert{\bm{\mathcal{C}}}_{(1)}\right\rVert_{F}\prod_{k=1}^{K}\left\lVert{\bm{U}}_{k}\right\rVert_{2}=\left\lVert{\bm{\mathcal{C}}}\right\rVert_{F}\prod_{k=1}^{K}\left\lVert{\bm{U}}_{k}\right\rVert_{2}

where we used Lemma J.1 in the first inequality, and the fact that 𝑼K𝑼K1𝑼22𝑼K2𝑼22\left\lVert{\bm{U}}_{K}\bigotimes{\bm{U}}_{K-1}\cdots\bigotimes{\bm{U}}_{2}\right\rVert_{2}\linebreak\leq\left\lVert{\bm{U}}_{K}\right\rVert_{2}\cdots\left\lVert{\bm{U}}_{2}\right\rVert_{2} in the second inequality. ∎