Personalized Tucker Decomposition: Modeling Commonality and Peculiarity on Tensor Data

Jiuyun Hu¹, Naichen Shi², Raed Al Kontar², Hao Yan¹
¹School of Computing and Augmented Intelligence
Arizona State University
²Department of Industrial & Operations Engineering
University of Michigan, Ann Arbor

Abstract

We propose personalized Tucker decomposition (perTucker) to address the limitations of traditional tensor decomposition methods in capturing heterogeneity across different datasets. perTucker decomposes tensor data into shared global components and personalized local components. We introduce a mode orthogonality assumption and develop a proximal gradient regularized block coordinate descent algorithm that is guaranteed to converge to a stationary point. By learning unique and common representations across datasets, we demonstrate perTucker’s effectiveness in anomaly detection, client classification, and clustering through a simulation study and two case studies on solar flare detection and tonnage signal classification.

Keywords: Tucker decomposition, Personalization, Heterogeneous data

1 Introduction

In recent years, tensor decomposition methods have grown rapidly, providing the ability to analyze and utilize high-dimensional data structures, which are essentially multi-dimensional arrays or matrices. These decompositions are powerful mathematical tools that facilitate the extraction of latent features and patterns from complex data, enabling efficient data representation, dimensionality reduction, compression, completion, noise removal, and prediction. Indeed, tensor decomposition has seen immense success across a wide variety of applications that include: natural image and video processing (Gatto et al.,, 2021; Momeni and Ebrahimkhanlou,, 2022), health care systems (Ren et al.,, 2022; Sandhu et al.,, 2018), point cloud data (Yan et al.,, 2019; Du et al.,, 2022) and manufacturing (Zhen et al.,, 2023; Yan et al.,, 2014), amongst many others.

Among the widely used techniques in this area, Tucker decomposition stands out as a prominent approach that has been successfully tested and deployed in various settings and applications (Kolda and Bader,, 2009; Zubair and Wang,, 2013; Li et al.,, 2020). The Tucker approach generalizes singular value decomposition (SVD) to higher-order tensors, providing a core tensor and a set of factor matrices that capture the interactions between dimensions (Tucker,, 1966). The factor matrices represent the underlying patterns and structures in the data, while the core tensor captures the interaction between these patterns. By analyzing factor matrices and the core tensor, one can identify and extract meaningful features that can be used for further analysis, such as anomaly detection (Yan et al.,, 2014) and process optimization (Yan et al.,, 2019).

Despite the efficacy of tensor decomposition methods, they assume that complex, heterogeneous data can be adequately represented by a single set of global factor matrices and a core tensor. This assumption may oversimplify the intrinsic disparities that exist when the datasets come from different sources, clients, or modalities, potentially compromising the accuracy of the resulting representations. In practice, nowadays, it is common to collect data across various edge devices, such as sensors and phones, which exhibit unique local data patterns due to various local conditions, system status, or data collection methodologies (Kontar et al.,, 2021).

Using a universal tensor decomposition strategy may not accurately capture these distinct data patterns, leading to suboptimal representations. An alternative strategy involves fitting a local tensor decomposition for the data source. However, this does not utilize the rich data available across sources and may excessively overfit the peculiarities of each dataset while neglecting the shared patterns or commonalities among the datasets. More importantly, both strategies overlook the opportunity to model heterogeneity across data sources and exploit this for improved downstream analysis, be it in prediction, clustering, classification, or anomaly detection.

For example, in the context of one of our case studies on tonnage signal monitoring processes, numerous sensors are employed to monitor the tonnage force at various locations within a production system. The data collected at each location features common patterns of normal signal variations and heterogeneous failure-related patterns. Here, the heterogeneous nature of the data and the presence of diverse failure patterns pose significant challenges for traditional tensor decomposition methods. Therefore, it is essential to develop a personalized tensor decomposition that can effectively capture and represent commonality and peculiarity across the data collected from each location. Consequently, these methods could reveal previously hidden patterns and relationships, allowing more effective data analysis, decision-making, and system optimization.

Inspired by a recent personalization technique for vector datasets coined as personalized principal component analysis (PCA) (Shi and Kontar,, 2022), we propose the personalized Tucker decomposition (perTucker) to decompose tensor data collected from different sources into shared global components and personalized components to capture the heterogeneity of the data. Global components model the common patterns shared across the datasets, while local components model the unique features of a specific dataset. At the heart of our approach is (i) a mode orthogonality constraint to distinguish global and local features and (ii) a proximal gradient-regularized block coordinate descent algorithm that operates within feasible regions of the constraint and can provably recover stationary solutions.

Using two case studies and simulated data, we highlight the ability of perTucker to benefit (i) anomaly detection as one can monitor changes in the local features to better (and faster) detect anomalies in data collected over time and (ii) classification & clustering as operating on local features may yield better statistical power than the raw data since differences are more explicit when global features are removed.

The remainder of the paper is organized as follows. Sec. 2 reviews relevant literature on tensor decomposition methods. Sec. 3 introduces perTucker, proposes an algorithm to estimate model parameters, proves convergence, and sheds light on potential applications of perTucker. Sec. 4 uses simulated data to highlight the advantageous properties of our model. Two case studies on solar flare detection and tonnage signal classification are then presented in Sec. 5. Finally, Sec. 6 concludes the paper with a discussion about open problems.

We note that hereon we will use data source, client and modality interchangeably to index the origin from which each dataset was created.

2 Literature Review

Various tensor decomposition methods have been proposed in the literature. Among them, Tucker (Tucker,, 1966) and CP decompositions (Hitchcock,, 1927) have received the most attention. They have been applied to both supervised and unsupervised learning tasks.

For unsupervised tasks, great emphasis was placed on anomaly detection and clustering. In anomaly detection, starting from the work of Nomikos and MacGregor, (1994), tensor-based detection has grown dramatically in the literature. Some examples include Li et al., (2011), where a robust tensor subspace learning algorithm is used for online anomaly detection, and Yan et al., (2014), which studied the relationship between Tucker decomposition, CP decomposition, multilinear principal component analysis, and tensor rank one decomposition and proposed monitoring statistics for each method. Interested readers are referred to Fanaee-T and Gama, (2016) for an overview of existing tensor-based methods for anomaly detection.

In clustering, various methods have been proposed to improve the accuracy and efficiency of clustering algorithms. These methods include tensor-based subspace clustering (Fu et al.,, 2016), multi-view clustering (Zhang et al.,, 2023; Li et al.,, 2023), and multi-mode clustering (He and Atia,, 2022).

Within these areas, Sun and Li, (2019) developed a dynamic tensor clustering method based on CP tensor decomposition, which does not limit the tensor order and can be learned efficiently. Wu et al., (2016) proposed tensor spectral co-clustering based on a stochastic Markov process. This method works for general tensors of any order and can simultaneously cluster all dimensions. Zhou et al., (2019) proposed a tensor low-rank reconstruction technique (TLRR). The reconstruction consists of a low-rank dictionary recovery component and sparse noise. The dictionary is then used to obtain a similarity matrix to cluster the data.

For supervised tasks, such as regression and classification, various tensor-based classification methods have been developed, including logistic tensor regression (Tan et al.,, 2013), support tensor machine (Hao et al.,, 2013), and tensor Fisher discriminant analysis (Yan et al.,, 2005). Furthermore, different forms of tensor regression have been proposed, depending on the dimensionality of the input and output variables. These include scalar-to-tensor regression (Zhou et al.,, 2013), tensor-to-scalar regression (Yan et al.,, 2019), and tensor-to-tensor regression (Gahrooei et al.,, 2021).

Here it is worth noting that a large body of literature has focused on separating noise from a low-rank background, starting from pioneering work on robust PCA (Candès et al.,, 2011). Subsequently, this separation has been extended to anomaly detection by decomposing the data into three parts: background, anomaly, and noise. For example, Yan et al., (2017) proposed a smooth sparse decomposition (SSD) by decomposing large-scale image data into a smooth background, a sparse anomaly, and noise and outperformed many other image-based detectors. Similar approaches can be found in crime monitoring (Zhao et al., 2022c, ), public health surveillance (Dulal et al.,, 2022; Zhao et al.,, 2020), and transfer learning applications (Li et al.,, 2022). Unfortunately, such methods suffer from an inability to learn the data representation as they assume that the basis functions or data representation are known, which limits their ability to handle complex datasets. To mitigate this, recent methods have been proposed to learn the representation of background components, including Bayesian methods (Guo et al.,, 2022) and deep neural networks (Zhao et al., 2022b, ). Still, such approaches cannot simultaneously learn the basis functions for the background and anomaly components of the dataset.

Given the above literature, to the best of our knowledge, to date, there are no tensor decomposition methods capable of learning shared and common representations across different datasets. The closest work along this line is personalized PCA perPCA. perPCA introduces a novel technique to perform PCA on data originating from disparate sources or modalities that exhibit heterogeneous trends but also possess common characteristics (Shi and Kontar,, 2022). perPCA uses orthogonal global and local components to capture both shared and unique characteristics of each source. The paper offers a sufficient identifiability condition, theoretical guarantees, and competitive empirical results. Unfortunately, perPCA requires vectorization of datasets and cannot directly handle tensor data. The extension of perPCA to tensor data faces fundamental challenges due to the large degree of freedom and nonclosed-form solutions with tensor decompositions, the difficulty in defining tensor-based orthogonal constraints, and computational challenges involving high-order tensors.

Our work aims to bring personalization to a tensor paradigm and address the challenges imposed to make that possible.

3 Model Development

In this section, we first set the notation in Sec. 3.1 followed by the motivation and formulation of perTucker in Sec. 3.2. In Sec. 3.3, we propose an efficient algorithm to learn perTucker. Convergence, practical implementation, and potential applications are, respectively, highlighted in Sec. 3.4, Sec. 3.6, and Sec. 3.5. We note that the proof of all propositions, lemmas, and theorems is deferred to the Appendix.

3.1 Preliminary

A tensor can be regarded as a data structure with more than 2 dimensions, also known as modes in tensor analysis (see Fig. 1). For example, in images, we use a vector of length $3$ to represent the RGB channel of the pixel. Thus, a picture can be represented by a 3-dimensional tensor with dimensions height $\times$ width $\times$ RGB. If we have multiple pictures of the same dimension, a dataset can be represented by a 4-dimensional tensor.

Refer to caption — Figure 1: Example of Vector, Matrix and Tensor Data

Notation

Throughout this paper, real numbers are denoted by letters, e.g., $N$ , $i$ ; vectors by bold lowercase letters, e.g., $\bm{c}$ ; matrices by bold uppercase letters, e.g., $\bm{U}$ ; sets by script letters, e.g., $\mathcal{K}$ ; and tensors by bold script letters, e.g., $\bm{\mathcal{X}}$ .

Mode- $k$ product and tensor unfolding

We briefly review the notion of a Tucker tensor product. A $K$ -mode tensor is represented by $\bm{\mathcal{X}}\in\mathbb{R}^{I_{1}\times\cdots\times I_{K}}$ , where $I_{k}$ denotes the mode- $k$ dimension of ${\bm{\mathcal{X}}}$ for $k=1,\cdots,K$ . We use ${\bm{\mathcal{X}}}[i_{1},i_{2},\cdots,i_{K}]$ to denote the $(i_{1},i_{2},\cdots,i_{K})$ -th entry of ${\bm{\mathcal{X}}}$ . The mode- $k$ product of a tensor ${\bm{\mathcal{X}}}$ with a matrix $\bm{V}\in\mathbb{R}^{J_{k}\times I_{k}}$ produces a tensor defined by $(\bm{\mathcal{X}}\times_{k}\bm{V})[i_{1},\cdots,i_{k-1},j_{k},i_{k+1},\cdots,i_{K}]=\sum_{i_{k}}{\bm{\mathcal{X}}}[i_{1},\cdots,i_{k},\cdots,i_{N}]V[j_{k},i_{k}]$ . For a tensor $\bm{\mathcal{X}}$ and a specific mode $k$ , we use the subscript with parenthesis ${\bm{\mathcal{X}}}_{(k)}\in\mathbb{R}^{I_{k}\times\prod_{q=1,q\neq k}^{K}I_{q}}$ to denote the unfolding of ${\bm{\mathcal{X}}}$ with respect to dimension $k$ , ${\bm{\mathcal{X}}}_{(k)}[i_{k},j]={\bm{\mathcal{X}}}[i_{1},i_{2},\cdots,i_{K}]$ where $j=1+\sum_{q=1,q\neq k}^{K}(i_{q}-1)J_{q}$ and $J_{q}=\prod_{m=1,m\neq k}^{q-1}I_{m}$ . The columns of the $k$ -mode unfolding ${\bm{\mathcal{X}}}_{(k)}$ are the n-mode vectors of ${\bm{\mathcal{X}}}$ .

Tucker decomposition

Tucker decomposition (Tucker,, 1966) decomposes a tensor into a core tensor multiplied by a factor matrix along each mode, ${\bm{\mathcal{X}}}\approx{\bm{\mathcal{C}}}\times_{1}{\bm{U}}_{1}\times_{2}{\bm{U}}_{2}\cdots\times_{K}{\bm{U}}_{K}$ , where ${\bm{U}}_{k}$ is an orthonormal $J_{k}\times I_{k}$ factor matrix typically with $J_{k}<I_{k}$ . ${\bm{U}}_{k}$ can be regarded as a principal component in mode- $k$ .

Tucker decomposition has an equivalent formulation in terms of the unfolded tensor, that is, ${\bm{\mathcal{X}}}_{(k)}={\bm{U}}_{k}{\bm{\mathcal{C}}}_{(k)}({\bm{U}}_{K}\bigotimes\cdots\bigotimes{\bm{U}}_{k+1}\bigotimes{\bm{U}}_{k-1}\cdots\bigotimes{\bm{U}}_{1})^{\top}$ . Here, $\bigotimes$ is the Kronecker product.

Tensor inner product

The inner product of two tensors of the same shape ${\bm{\mathcal{A}}},{\bm{\mathcal{B}}}\in\mathbb{R}^{I_{1}\times\cdots\times I_{K}}$ , is defined

\left\langle{\bm{\mathcal{A}}},{\bm{\mathcal{B}}}\right\rangle=\sum_{i_{1},\cdots,i_{K}}{\bm{A}}[i_{1},\cdots,i_{k},\cdots,i_{K}]{\bm{B}}[i_{1},\cdots,i_{k},\cdots,i_{K}].

Then the Frobenius norm of a tensor ${\bm{\mathcal{A}}}$ can be defined as $\left\lVert{\bm{\mathcal{A}}}\right\rVert_{F}^{2}=\left\langle{\bm{\mathcal{A}}},{\bm{\mathcal{A}}}\right\rangle$ , which is the sum of squares of all elements.

3.2 Motivation & formulation

Suppose we have tensor data from $N$ sources. Each source has tensor data of order $K$ . We use ${\bm{\mathcal{Y}}}_{n}$ to denote the data from source $n$ , and assume that ${\bm{\mathcal{Y}}}_{n}$ has dimensions $I_{1}\times I_{2}\times\ldots\times I_{K}\times s_{n}$ . Here, all dimensions across sources have the same length except for the last one. In particular, in practical applications, the last dimension $s_{n}$ denotes the number of samples from the source $n$ , which often differs between sources.

Our approach relies on defining global and local components to model commonality and heterogeneity across different sources. To do so, we let the global components consist of shared global factor matrices ${\bm{U}}_{G,1},\ldots,{\bm{U}}_{G,K}$ and individual global core tensors $\bm{\mathcal{C}}_{G,1},\ldots,\bm{\mathcal{C}}_{G,N}$ for each source. The local components consist of individual core tensors $\bm{\mathcal{C}}_{L,1},\ldots,\bm{\mathcal{C}}_{L,N}$ and individual local factor matrices $\bm{V}_{n,1},\ldots,\bm{V}_{n,K}$ . As such, the reconstructions of the global and local components for source $n$ are $\bm{\mathcal{C}}_{G,n}\times_{1}{\bm{U}}_{G,1}\ldots\times_{K}{\bm{U}}_{G,K}$ and $\bm{\mathcal{C}}_{L,n}\times_{1}\bm{V}_{n,1}\ldots\times_{K}\bm{V}_{n,K}$ , respectively.

Based on the above definitions, we assume our data-generating process to be

\bm{\mathcal{Y}}_{n}=\underbrace{\bm{\mathcal{C}}_{G,n}\times_{1}{\bm{U}}_{G,1}\ldots\times_{K}{\bm{U}}_{G,K}}_{\text{global}}+\underbrace{\bm{\mathcal{C}}_{L,n}\times_{1}\bm{V}_{n,1}\ldots\times_{K}\bm{V}_{n,K}}_{\text{local}\,i}+\bm{\mathcal{E}}_{n},

(1)

where $\bm{\mathcal{E}}_{n}$ are tensors that represent additive noise.

Since global and local components should convey different information, they need to be distinguished so that each part can vary independently of each other. To do so, we require the orthogonality of the global and local tensors. Specifically, we assume that:

\left\langle\bm{\mathcal{Y}}_{G,n},\bm{\mathcal{Y}}_{L,n}\right\rangle=0,\quad\forall\bm{\mathcal{C}}_{G,n},\bm{\mathcal{C}}_{L,n},

where $\bm{\mathcal{Y}}_{G,n}=\bm{\mathcal{C}}_{G,n}\times_{1}{\bm{U}}_{G,1}\ldots\times_{K}{\bm{U}}_{G,K}$ and $\bm{\mathcal{Y}}_{L,n}=\bm{\mathcal{C}}_{L,n}\times_{1}\bm{V}_{n,1}\ldots\times_{K}\bm{V}_{n,K}$ . Interestingly, it turns out that this condition is equivalent to having the global and local factor matrices orthogonal in at least one dimension, as stated in Proposition 1.

Proposition 1.

For each $n=1,\ldots,N$ , the following two conditions are equivalent.

•

$\left\langle\bm{\mathcal{Y}}_{G,n},\bm{\mathcal{Y}}_{L,n}\right\rangle=0,\quad\forall\bm{\mathcal{C}}_{G,n},\bm{\mathcal{C}}_{L,n}$ .
•

There exists a mode $k\in\{1,\ldots,K\}$ , where ${\bm{U}}_{G,k}^{\top}\bm{V}_{n,k}=0$ .

Given Proposition 1, we require local factor matrices to be orthogonal to global factor matrices for all sources in at least one mode. We define the set of such orthogonal modes by $\mathcal{K}$ , $|\mathcal{K}|\geq 1$ . Then our objective is to minimize the reconstruction loss of the data across all $N$ sources. This is written as

$\displaystyle\min_{\{\bm{\mathcal{C}}_{G,n}\},\{{\bm{U}}_{G,k}\},\{\bm{\mathcal{C}}_{L,n}\},\{\bm{V}_{n,k}\}}$	$\displaystyle\sum_{n=1}^{N}\\|\bm{\mathcal{Y}}_{n}-\bm{\mathcal{C}}_{G,n}\times_{1}{\bm{U}}_{G,1}\ldots\times_{K}{\bm{U}}_{G,K}-\bm{\mathcal{C}}_{L,n}\times_{1}\bm{V}_{n,1}\ldots\times_{K}\bm{V}_{n,K}\\|_{F}^{2}$	(2)
	$\displaystyle s.t.~{}{\bm{U}}_{G,k}^{\top}{\bm{U}}_{G,k}=I,\bm{V}_{n,k}^{\top}\bm{V}_{n,k}=I,n=1,\ldots,N,k=1,\ldots,K$
	$\displaystyle{\bm{U}}_{G,k}^{\top}\bm{V}_{n,k}=0,n=1,\ldots,N,k\in\mathcal{K}.$

We assume that the dimension of the global core tensor for all sources is ${\bm{\mathcal{C}}}_{G,n}\in\mathbb{R}^{g_{1}\times\cdots\times g_{K}}$ , and the dimension of the local core tensor for source $n$ is ${\bm{\mathcal{C}}}_{L,n}\in\mathbb{R}^{l_{n,1}\times\cdots\times l_{n,K}}$ . This also defines the dimension of the global and local factor matrices.

3.3 Personalized Tucker algorithm

A natural algorithm to solve the objective in (2) is block coordinate descent (BCD), where we iteratively optimize each variable. A general framework for BCD in our context is outlined in Algorithm 1.

Data:

\bm{\mathcal{Y}}_{n}

n=1,\ldots,N

Output:

\{\bm{\mathcal{C}}_{G,n}\},\{{\bm{U}}_{G,k}\},\{\bm{\mathcal{C}}_{L,n}\},\{\bm{V}_{n,k}\}

1 Initialization:

\{\bm{\mathcal{C}}_{G,n}\},\{{\bm{U}}_{G,k}\},\{\bm{\mathcal{C}}_{L,n}\},\{\bm{V}_{n,k}\}

2 for iterations do

3 for $k=1,\ldots,K$ do

4 Update global factor matrices

\{{\bm{U}}_{G,k}\}

5 for $n=1,\ldots,N$ do

6 Update global core tensors

\{\bm{\mathcal{C}}_{G,n}\}

7 Update local factor matrices

\{\bm{V}_{n,k}\}

8 Update core tensors

\{\bm{\mathcal{C}}_{L,n}\}

9 end for

11 end for

13 end for

Return:

\{\bm{\mathcal{C}}_{G,n}\},\{{\bm{U}}_{G,k}\},\{\bm{\mathcal{C}}_{L,n}\},\{\bm{V}_{n,k}\}

Algorithm 1 Pseudo Code of the Algorithm

In the rest of Sec. 3.3, we explain the update steps in Algorithm 1 in detail. We start with the update of global and local core tensors in Sec. 3.3.1 since the closed-form solution is a direct projection similar to the traditional Tucker decomposition owing to the orthogonality between global and local components. Then the solution consistently holds in the update of global and local factor matrices we introduced in Sec. 3.3.2 and Sec. 3.3.2. This simplifies the update of the factor matrices without the core tensors. Despite the challenges that pertain to the two distinct decomposition components within (2) and their orthogonality, a key result is that all updates can be done in closed form owing to the nice properties of the orthogonality constraint imposed on the model.

3.3.1 Update global and local core tensors

In this section, Proposition 2 provides the closed-form solution to update global and local core tensors, given the global and local factor matrices. The closed-form solution is the direct projection of the data to the global or local factor matrices. When updating the global and local factor matrices, we assume that the global and local core tensors are always the optimal solution. This simplifies the formula to update the global and local factor matrices by removing the core tensors from the optimization problem.

Proposition 2.

(Closed-form solutions to the core tensor) If $\left\lvert{\mathcal{K}}\right\rvert\geq 1$ , when the global factor matrices $\{{\bm{U}}_{G,k}\}$ and the local factor matrices $\{\bm{V}_{n,k}\}$ are given, the global core tensors ${\bm{\mathcal{C}}}^{\star}_{G,n}$ that minimize (2) satisfy

{\bm{\mathcal{C}}}^{\star}_{G,n}=\bm{\mathcal{Y}}_{n}\times_{1}{\bm{U}}_{G,1}^{\top}\ldots\times_{K}{\bm{U}}_{G,K}^{\top},

and the local core tensors ${\bm{\mathcal{C}}}^{\star}_{L,n}$ that minimize (2) satisfy

{\bm{\mathcal{C}}}^{\star}_{L,n}=\bm{\mathcal{Y}}_{n}\times_{1}\bm{V}_{n,1}^{\top}\ldots\times_{K}\bm{V}_{n,K}^{\top}.

The closed-form solutions presented in Proposition 2 takes advantage of the orthogonality between the two components. As a result, the cross-term is canceled, making the computation of the core tensors for both components efficient and straightforward. In the following sections, ${\bm{\mathcal{C}}}^{\star}_{G,n}$ and ${\bm{\mathcal{C}}}^{\star}_{L,n}$ are used to denote optimized global and local core tensors.

3.3.2 Update global and local factor matrices

In this section, we will discuss the closed-form solutions to update the global and local factor matrices. For the simplicity of notation, we define the global residual tensor and local residual tensor from each source $n=1\ldots,N$ as:

{\bm{\mathcal{R}}}_{G,n}=\bm{\mathcal{Y}}_{n}-{\bm{\mathcal{C}}}^{\star}_{L,n}\times_{1}\bm{V}_{n,1}\ldots\times_{K}\bm{V}_{n,K},

{\bm{\mathcal{R}}}_{L,n}=\bm{\mathcal{Y}}_{n}-{\bm{\mathcal{C}}}^{\star}_{G,n}\times_{1}{\bm{U}}_{G,1}\ldots\times_{K}{\bm{U}}_{G,K}.

The global residual is the reconstruction error from local components and the local residual is the reconstruction error from global components. Therefore, global reconstruction tends to model the global residual, and local reconstruction tends to model the local residual.

Proximal update

In practice, when updating the global and local factor matrices, we can incorporate a proximal term into the optimization problem to regulate the update of the factor matrices (Shen et al.,, 2022). More specifically, we can define $\varrho$ as the subspace difference between the subspaces expanded by the current factor matrix ${\bm{U}}_{t}$ and the target factor matrix ${\bm{U}}$ to be optimized,

\varrho({\bm{U}},{\bm{U}}_{t})=\|\bm{UU}^{\top}-{\bm{U}}_{t}{\bm{U}}_{t}^{\top}\|_{F}^{2}.

(3)

The proximal penalty term is defined as the subspace difference times some parameter $\rho$ . The proximal gradient algorithm can stabilize the update of factor matrices by regularizing the subspace change. Since the reconstruction of global and local components are to minimize the reconstruction error, we can write the optimization problem to solve for the global factor matrix in mode $k$ at iteration $t$ as

\small{\bm{U}}_{G,k,t+1}=\arg\min_{{\bm{U}}_{G,k}}\sum_{n=1}^{N}\left\lVert{\bm{\mathcal{R}}}_{G,n}-{\bm{\mathcal{C}}}^{\star}_{G,n}\times_{1}{\bm{U}}_{G,1}\ldots\times_{K}{\bm{U}}_{G,K}\right\rVert_{F}^{2}+\rho\|{\bm{U}}_{G,k}{\bm{U}}_{G,k}^{\top}-{\bm{U}}_{G,k,t}{{\bm{U}}_{G,k,t}^{\top}}\|_{F}^{2},

(4)

and the optimization problem to solve for the local factor matrix of source $n$ is that when $k\in\mathcal{K}$ ,

{\bm{V}}_{n,k,t+1}=\arg\min_{\bm{V}_{n,k}\perp{\bm{U}}_{G,k}}\left\lVert{\bm{\mathcal{R}}}_{L,n}-{\bm{\mathcal{C}}}^{\star}_{L,n}\times_{1}\bm{V}_{n,1}\ldots\times_{K}\bm{V}_{n,K}\right\rVert_{F}^{2}+\rho\left\lVert\bm{V}_{n,k}\bm{V}_{n,k}^{\top}-\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}\right\rVert_{F}^{2},

(5)

and when $k\not\in\mathcal{K}$ ,

{\bm{V}}_{n,k,t+1}=\arg\min_{\bm{V}_{n,k}}\left\lVert{\bm{\mathcal{R}}}_{L,n}-{\bm{\mathcal{C}}}^{\star}_{L,n}\times_{1}\bm{V}_{n,1}\ldots\times_{K}\bm{V}_{n,K}\right\rVert_{F}^{2}+\rho\left\lVert\bm{V}_{n,k}\bm{V}_{n,k}^{\top}-\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}\right\rVert_{F}^{2},

(6)

where ${\bm{U}}_{G,k,t}$ and ${\bm{V}}_{n,k,t}$ represents the global and local factor matrices for source $n$ , mode $k$ and iteration $t$ ; ${\bm{U}}_{G,k,t+1}$ and ${\bm{V}}_{n,k,t+1}$ are the corresponding updated global and local factor matrices. The objectives of (5) and (6) are the same. They consist of a Frobenius norm of the fitting error and a regularization on the change of subspace. The difference is that in (5), we explicitly require ${\bm{V}}_{n,k,t+1}$ to be orthogonal to ${\bm{U}}_{G,k,t+1}$ , while in (6) we do not add constraints on ${\bm{V}}_{n,k,t+1}$ .

Though the optimization problems (4) to (6) seem complicated, it turns out we can obtain closed-form solutions. To achieve this, we first transform the minimization problem into a maximization problem and remove the core tensors in the optimization by Lemma 3.1 and Lemma 3.2.

Lemma 3.1.

For any orthonormal factor matrices ${\bm{U}}$ and ${\bm{U}}_{t}$ , the subspace error between ${\bm{U}}$ and ${\bm{U}}_{t}$ defined in (3) can be formulated as,

\varrho({\bm{U}},{\bm{U}}_{t})=2c-2Tr\left[{\bm{U}}^{\top}{\bm{U}}_{t}{\bm{U}}_{t}^{\top}{\bm{U}}\right],

(7)

where $c$ is the number of rows in $\bm{U}$ .

Lemma 3.1 shows that the subspace error is differentiable with respect to ${\bm{U}}$ . This property is useful when we design the update rules for the BCD algorithms and the evaluation metrics. Furthermore, lemma 3.1 put a negative sign in the matrix trace term that can transform the minimization problem into a maximization problem.

Before deriving the solutions to (4) to (6), we introduce the following lemma that significantly simplifies our objective.

Lemma 3.2.

For each $n=1,\ldots,N$ and $k=1,\ldots,K$ , we have,

\small\sum_{n=1}^{N}\|{\bm{\mathcal{R}}}_{G,n}-{\bm{\mathcal{C}}}^{\star}_{G,n}\times_{1}{\bm{U}}_{G,1}\ldots\times_{K}{\bm{U}}_{G,K}\|_{F}^{2}=-\sum_{n=1}^{N}\|{\bm{\mathcal{R}}}_{G,n}\times_{1}{\bm{U}}_{G,1}^{\top}\ldots\times_{K}{\bm{U}}_{G,K}^{\top}\|_{F}^{2}+\left\lVert{\bm{\mathcal{R}}}_{G,n}\right\rVert_{F}^{2},

(8)

\|{\bm{\mathcal{R}}}_{L,n}-{\bm{\mathcal{C}}}^{\star}_{L,n}\times_{1}\bm{V}_{n,1}\ldots\times_{K}\bm{V}_{n,K}\|_{F}^{2}=-\|{\bm{\mathcal{R}}}_{L,n}\times_{1}\bm{V}_{n,1}^{\top}\ldots\times_{K}\bm{V}_{n,K}^{\top}\|_{F}^{2}+\left\lVert{\bm{\mathcal{R}}}_{L,n}\right\rVert_{F}^{2}.

(9)

Lemma 3.2 bears two fundamental meanings in the derivation of the closed-form solution to update the global and local factor matrices. First, it also puts a negative sign in the term with the factor matrices, which can transform the minimization problem (4) to (6) into a maximization problem. Second, by plugging in the closed-form solution of global and local core tensors in Proposition 2, it simplifies the optimization problem by reducing the number of decision variables.

Update global factors

With all the prerequisites, we are now ready to present the closed-form solution of the sub-problem (4) in updating the global factor matrix ${\bm{U}}_{G,k}$ in a specific mode $k$ in Proposition 3

Proposition 3.

We use $\bm{W}_{G,n}$ to denote $\bm{W}_{G,n}=\left({\bm{\mathcal{R}}}_{G,n}\right)_{(k)}(\bigotimes_{q\neq k}{\bm{U}}_{G,q}^{\top})^{\top}$ , where $\bigotimes_{q\neq k}$ is the Kronecker product in reverse order of the factor matrices except $k$ th factor matrix. If ${\bm{U}}_{G,k,t+1}$ is the optimal solution to (4), the columns of ${\bm{U}}_{G,k,t+1}$ are the unit eigenvectors of the matrix $\sum_{n=1}^{N}\bm{W}_{G,n}\bm{W}_{G,n}^{\top}+2\rho{\bm{U}}_{G,k,t}{{\bm{U}}_{G,k,t}^{\top}}$ corresponding to the largest $g_{k}$ eigenvalues.

Proposition 3 shows that with proximal regularization, global components can be updated efficiently through singular value decomposition. In practice, we can use the equivalent form of $\bm{W}_{G,n}=({\bm{\mathcal{R}}}_{G,n}\times_{1}{\bm{U}}_{G,1}\cdots\times_{k-1}{\bm{U}}_{G,k-1}\times_{k+1}{\bm{U}}_{G,k+1}\cdots\times_{K}{\bm{U}}_{G,K})_{(k)}$ to improve the efficiency of the computation. Note that when updating the global components, we do not impose the orthogonality of the global and local components. The reason is that enforcing the orthogonality between the global components to each of the local components is too restrictive and may leave no feasible space for updating if the number of sources is large. Therefore, we will update the global components freely and enforce the local component to be orthogonal to the global components.

Update local factors

We provide the closed-form solution to the sub-problem (5) and (6) to update the local factor matrices with or without the orthogonal constraint in Proposition 4. The component optimized in Proposition 4 is the local factor matrix ${\bm{V}}_{n,k}$ with a specific source $n$ and mode $k$ . We denote the current local factor matrices at iteration $t$ by $\bm{V}_{n,k,t}$ .

Proposition 4.

Problem (5) and (6) have closed-form solutions. We denote $\bm{W}_{L,n}$ as $\bm{W}_{L,n}=\left({\bm{\mathcal{R}}}_{L,n}\right)_{(k)}(\bigotimes_{q\neq k}\bm{V}_{n,q}^{\top})^{\top}$ .Then,

1.

if $k\not\in\mathcal{K}$ , the updated columns of the local factor matrix $\bm{V}_{n,k,t+1}$ is the unit eigenvectors of $\bm{W}_{L,n}\bm{W}_{L,n}^{\top}+2\rho\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}$ corresponding to top $l_{n,k}$ eigenvalues.
2.

if $k\in\mathcal{K}$ , the update of the local factor matrix $\bm{V}_{n,k,t+1}$ is as follows.
Denote $\bm{S}^{\prime}=(I-{\bm{U}}_{G,k}{\bm{U}}_{G,k}^{\top})[\bm{W}_{L,n}\bm{W}_{L,n}^{\top}+2\rho\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}](I-{\bm{U}}_{G,k}{\bm{U}}_{G,k}^{\top})$ . The columns of the local factor matrix $\bm{V}_{n,k,t+1}$ are the eigenvectors of $\bm{S}^{\prime}$ corresponding to top $l_{n,k}$ eigenvalues.

The proof of Proposition 3 and Proposition 4 is shown in Appendix E and Appendix F. Having completed all the steps required to update each component of the algorithm, we present the complete algorithm in Algorithm 2. In Algorithm 2, we use the subscript $t$ to denote the current iteration index for the global and local factor matrices. Despite the orthogonality constraint between the local and global components, each update step in Algorithm 2 can be efficiently implemented via a closed-form solution.

Data:

\bm{\mathcal{Y}}_{n}

n=1,\ldots,N

Input: Global dimensions

g_{k}

, Local dimensions

l_{n,k}

, Orthogonal dimension set

\mathcal{K}

\rho

Output:

\{\bm{\mathcal{C}}_{G,n}\},\{{\bm{U}}_{G,k}\},\{\bm{\mathcal{C}}_{L,n}\},\{\bm{V}_{n,k}\}

1 Initialization: Randomly initialize or PCA initialize

\{\bm{\mathcal{C}}_{G,n}\},\{{\bm{U}}_{G,k,0}\},\{\bm{\mathcal{C}}_{L,n}\},\{\bm{V}_{n,k,0}\}

2 for iterations $t=0,\cdots,T-1$ do

3 for $k=1,\ldots,K$ do

4 Set

{\bm{\mathcal{R}}}_{G,n}=\bm{\mathcal{Y}}_{n}-\bm{\mathcal{C}}^{\star}_{L,n}\times_{1}\bm{V}_{n,1,t+1}\ldots\times_{k-1}\bm{V}_{n,k-1,t+1}\times_{k}\bm{V}_{n,k,t}\ldots\times_{K}\bm{V}_{n,K}

n=1,\ldots,N

5 Compute

\bm{W}_{G,n}=\left({\bm{\mathcal{R}}}_{G,n}\right)_{(k)}({\bm{U}}_{G,K,t}^{\top}\bigotimes\ldots\bigotimes{\bm{U}}_{G,k+1,t}^{\top}\bigotimes{\bm{U}}_{G,k-1,t+1}^{\top}\bigotimes\ldots\bigotimes{\bm{U}}_{G,1,t+1}^{\top})^{\top}

n=1,\ldots,N

6 Update

{\bm{U}}_{G,k,t+1}

to be the eigenvectors of

\sum_{n=1}^{N}\bm{W}_{G,n}\bm{W}_{G,n}^{\top}+2\rho{\bm{U}}_{G,k,t}{{\bm{U}}_{G,k,t}^{\top}}

corresponding to the largest

g_{k}

eigenvalues.

8 for $n=1,\ldots,N$ do

9 Update

\bm{\mathcal{C}}^{\star}_{G,n}=\bm{\mathcal{Y}}_{n}\times_{1}{\bm{U}}_{G,1,t+1}^{\top}\ldots\times_{k}{\bm{U}}_{G,k,t+1}^{\top}\times_{k+1}{\bm{U}}_{G,k+1,t}^{\top}\ldots\times_{K}{\bm{U}}_{G,K,t}^{\top}

10 Set

{\bm{\mathcal{R}}}_{L,n}=\bm{\mathcal{Y}}_{n}-\bm{\mathcal{C}}^{\star}_{G,n}\times_{1}{\bm{U}}_{G,1,t+1}\ldots\times_{k}{\bm{U}}_{G,k,t+1}\times_{k+1}{\bm{U}}_{G,k+1,t}\ldots\times_{K}{\bm{U}}_{G,K,t}

11 Let

\bm{W}_{L,n}=\left({\bm{\mathcal{R}}}_{L,n}\right)_{(k)}({\bm{V}}_{n,K,t}^{\top}\bigotimes\ldots\bigotimes{\bm{V}}_{n,k+1,t}^{\top}\bigotimes{\bm{V}}_{n,k-1,t+1}^{\top}\bigotimes\ldots\bigotimes{\bm{V}}_{n,1,t+1}^{\top})^{\top}

12 if $k\in\mathcal{K}$ then

13 Let

\bm{S}^{\prime}=({\bm{I}}-{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top})[\bm{W}_{L,n}\bm{W}_{L,n}^{\top}+2\rho\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}]({\bm{I}}-{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top})

14 Update

\bm{V}_{n,k,t+1}

to be the eigenvectors of

\bm{S}^{\prime}

corresponding to the largest

l_{n,k}

eigenvalues.

15 else if $k\not\in\mathcal{K}$ then

16 Update

\bm{V}_{n,k,t+1}

to be the eigenvectors of

\bm{W}_{L,n}\bm{W}_{L,n}^{\top}+2\rho\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}

corresponding to the largest

l_{n,k}

eigenvalues.

17 end if

18 Update

\bm{\mathcal{C}}^{\star}_{L,n}=\bm{\mathcal{Y}}_{n}\times_{1}\bm{V}_{n,1,t+1}^{\top}\ldots\times_{k}\bm{V}_{n,k,t+1}^{\top}\times_{k+1}\bm{V}_{n,k+1,t}^{\top}\times_{K}\bm{V}_{n,K,t}^{\top}

19 end for

21 end for

23 end for

Return:

\{\bm{\mathcal{C}}^{\star}_{G,n}\},\{{\bm{U}}_{G,k,T}\},\{\bm{\mathcal{C}}^{\star}_{L,n}\},\{\bm{V}_{n,k,T}\}

Algorithm 2 BCD algorithm to solver Personalized Tucker

3.4 Convergence analysis of Algorithm 2

In this section, we provide the the convergence analysis of Algorithm 2. The special update rule in Proposition 4 brings challenges to the convergence analysis. In Algorithm 2, the update of local factors ${\bm{V}}_{n,k}$ ’s is different from the standard Tucker decomposition update, as ${\bm{V}}_{n,k}$ is required to be orthogonal to ${\bm{U}}_{G,k}$ . As a result, updating the local factors does not necessarily decrease the objective value in (2). Thus, Algorithm 2 is not a strictly descent algorithm. Despite such subtleties, we can show that, when the proximal parameter $\rho$ is not too small, our algorithm can converge into stationary solutions.

We will present our theorem on global convergence in the following theorem. Recall that we use ${\bm{U}}_{G,k,t}$ to denote the $k$ -th global factor ${\bm{U}}_{G,k}$ after iteration $t$ , and ${\bm{V}}_{n,k,t}$ to denote the $k$ -th local factor of source $n$ after iteration $t$ .

Theorem 5.

If $|{\mathcal{K}}|\geq 2$ and there exists a constant $B>0$ such that $\left\lVert{\bm{\mathcal{Y}}}_{n}\right\rVert_{F}\leq B$ for each $n$ , when we choose $\rho=O(B^{2})$ , then Algorithm 2 will converge to stationary points where

\displaystyle\min_{t=1,\cdots,T}\sum_{k=1}^{K}\left\lVert{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}-{\bm{U}}_{G,k,t}{\bm{U}}_{G,k,t}^{\top}\right\rVert_{F}^{2}=O\left(\frac{1}{T}\right),

(10)

and

\displaystyle\min_{t=1,\cdots,T}\sum_{n=1}^{N}\sum_{k=1}^{K}\left\lVert{\bm{V}}_{n,k,t+1}{\bm{V}}_{n,k,t+1}^{\top}-{\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}\right\rVert_{F}^{2}=O\left(\frac{1}{\sqrt{T}}\right).

(11)

Theorem 5 provides many key insights. First, it shows that the subspaces spanned by the column vectors of global and local factors all converge into fixed solutions. The result establishes the global convergence of factors, as it does not require careful initialization. Second, the convergence rates for global and local factors differ. Global factors converge at a rate of $O(\frac{1}{T})$ , which is standard in non-convex optimization. However, since some local factors must be perpendicular to the global factors, they converge at a slightly slower rate of $O(\frac{1}{\sqrt{T}})$ . Third, our result is based on having $|{\mathcal{K}}|\geq 2$ . This requirement allows orthogonality to be maintained by each mode being updated.

To validate the convergence rates, we provide a proof-of-concept simulation study. Fig. 2 displays an example of the convergence of global and local factor matrices in this simulation. Both subspace errors of the local and global components go to 0. This slower rate of local components is primarily a result of the orthogonality requirement, and this result verifies Theorem 5. The detail of this simulation study is relegated to Appendix H

3.5 Model initialization

One simple approach is to initialize all components randomly. Alternatively, one may use a Tucker decomposition on all the data for initialization. To do so, let $s=\sum_{n=1}^{N}s_{n}$ be the total number of samples from all sources and recall that the data ${\bm{\mathcal{Y}}}_{n}$ from source $n$ has dimension $I_{1}\times\ldots\times I_{K}\times s_{n}$ . Now the following steps can be taken:

1.

Construct a tensor ${\bm{\mathcal{Y}}}$ that encompasses all samples from every source, with dimensions $I_{1}\times\ldots\times I_{K}\times s$ .
2.

Use Tucker decomposition ${\bm{\mathcal{Y}}}\approx{\bm{\mathcal{C}}}\times_{1}{\bm{U}}_{G,1}\ldots\times_{K}{\bm{U}}_{G,K}$ to initialize global factors.
3.

Use Proposition 2 to initialize the global core tensor for each source.
4.

For each source $n$ , perform the Tucker decomposition on the local residual tensor ${\bm{\mathcal{R}}}_{L,n}={\bm{\mathcal{Y}}}_{n}-{\bm{\mathcal{C}}}^{\star}_{G,n}\times_{1}{\bm{U}}_{G,1}\ldots\times_{K}{\bm{U}}_{G,K}$ to initialize the local core tensor and the local factor matrices.

This initialization does not require orthogonality between global and local components. Therefore, we may observe an increase in the reconstruction error in the first iteration. But we have found that in practice, this method yields faster convergence.

In addition, in our model, one needs to choose the modes in which orthogonality is imposed. At the same time, our theorem suggests that $|{\mathcal{K}}|\geq 2$ , and we found that the result is also true for $|{\mathcal{K}}|=1$ in various simulation studies. In practice, one can simply choose the mode with the largest dimension to impose the orthogonality constraint. Alternatively, cross-validation can be utilized to select the mode.

3.6 Practical usage of perTucker

In this section, we introduce some practical applications of perTucker. Specifically, we shed light on its potential utility for improved classification, anomaly detection, and clustering. The key idea for all applications is to operate only on local components. This may allow for improved clustering, classification, and detection as differences become more explicit when shared knowledge is removed.

3.6.1 Classification via perTucker

To use perTucker for classification, we assume that each source corresponds to a class. Then we perform perTucker and can get the estimated local factor matrices $\hat{{\bm{V}}}_{n,k},\,k=1,\ldots,K$ for each class. When a new piece of data ${\bm{\mathcal{Y}}}^{\text{new}}$ is sent, we can use the following decision rule to classify the new data.

\hat{n}=\arg\max_{n}\|{\bm{\mathcal{C}}}^{\star}_{L,n}\|_{F}^{2}=\arg\max_{n}\|{\bm{\mathcal{Y}}}^{\text{new}}\times_{1}\hat{\bm{V}}_{n,1}^{\top}\ldots\times_{K}\hat{\bm{V}}_{n,K}^{\top}\|_{F}^{2}.

(12)

The decision rule (12) demonstrates that we can efficiently classify the data by selecting the class that maximizes the Frobenius norm of the local core tensor. This is because the largest norm of the core components indicates that the local subspace is most suitable for representing the original data, since the local core tensor is a projection of the original tensor onto the corresponding local subspace. We have found that such a decision rule is equivalent to finding the smallest possible reconstruction error across all classes. The discussion is relegated to Appendix I.

We want to emphasize that such a classification approach differs from traditional tensor-based classifiers (Klus and Gelß,, 2019), which directly trained supervised learning models for tensor classification. Here, we focus on a generative approach, which first trains $C$ data generation models (i.e., local subspaces) and then utilizes the representation error to decide to which class the data belong. The algorithm will construct global and local subspaces, which is beneficial not only for classification purposes but also for feature interpretation and visualization.

3.6.2 Anomaly detection via perTucker

By monitoring only local components, perTucker can improve anomaly detection methods as the changes in the underlying data become more explicit when common factors are removed. Specifically, we propose using $\|{\bm{\mathcal{C}}}_{L}\|_{F}^{2}$ as the key monitoring statistic for online anomaly detection.

Here we emphasize that perTucker does not implement a sparsity penalty as often used in tensor-based anomaly detection (Yan et al.,, 2018). This is a unique benefit of perTucker as we do not assume that the anomaly patterns are sparse, which is too restrictive in some applications. As a result, perTucker can accommodate a wide range of anomalous pattern distributions.

3.6.3 Clustering via perTucker

perTucker provides an alternative approach for client clustering based on local factors. Specifically, we focus on the setting of subspace clustering, which aims to cluster the clients if they are within the same local subspaces. The subspace distance between client $n_{1}$ and $n_{2}$ , $\rho_{n_{1},n_{2}}=\|\hat{\bm{V}}_{n_{1}}\hat{\bm{V}}_{n_{1}}^{\top}-\hat{\bm{V}}_{n_{2}}\hat{\bm{V}}_{n_{2}}^{\top}\|_{F}^{2}$ , can be calculated, where $\bm{V}_{n}$ is defined by the Kronecker product of the local factor matrices for client $n$ . Then we can use spectral clustering to make clusters of the clients and further use multidimensional scaling to make the clustering plot (Hastie et al.,, 2009).

4 Numerical Studies

Now that we have introduced perTucker and its potential application, we validate its claimed advantages through numerical simulations. Sec. 4.1 introduces the data generation procedure. Sec. 4.2, Sec. 4.3, and Sec. 4.4 evaluates the performance of perTucker in terms of data reconstruction, classification, and clustering.

4.1 Data generation

In this simulation work, each sample of the data is a grayscale image with dimensions 50 by 50. The construction of each sample is low-rank global component, heterogeneous local component, and i.i.d standard normal noise, as in Eq. (13)

\bm{\mathcal{Y}}_{n}=\bm{\mathcal{Y}}_{G,n}+\bm{\mathcal{Y}}_{L,n}+\bm{\mathcal{E}}_{n}.

(13)

We generate $N=3$ clients defined as $3$ patterns for the heterogeneous local component: Swiss pattern, oval pattern, and rectangle pattern, as in Fig. 3. The value of all the patterns is 5 while the rest part is 0. In each pattern, we generate $10$ sample images. There is some variability within each pattern, as shown in the left part of Fig. 3. The Swiss can be thin or thick; the oval can be vertical, horizontal, or circular; the rectangle can be wide, tall, or square.

For the global component, we randomly create orthonormal matrices ${\bm{U}}_{G,1}$ and ${\bm{U}}_{G,2}$ with dimension $50\times 5$ for the 3 clients. Then we randomly generate the global core tensor $\bm{\mathcal{C}}_{G,n}$ with dimension $5\times 5\times 10$ for each client. And each entry of the global core tensors follows i.i.d. $N(0,100)$ . The global components are constructed by $\bm{\mathcal{Y}}_{G,n}=\bm{\mathcal{C}}_{G,n}\times_{1}{\bm{U}}_{G,1}\times_{2}{\bm{U}}_{G,2}$ , $n=1,2,3$ .

Therefore, the full data dimension is $3\times 50\times 50\times 10$ . Some examples of the data generation structure are shown in the right part of Fig. 3. The three rows are for three patterns. The two columns show the global and local components, respectively, and the third column shows the sum of the global and local components along with the error term. With noise and global background, the local pattern can barely be recognized, which makes accurate identification of the local patterns challenging.

4.2 Performance

In the generated data, we apply perTucker to decouple the global and local components. For comparison, we also evaluate the performance of some benchmark algorithms.

1.

globalTucker: We first concatenate the samples of all clients into tensor ${\bm{\mathcal{Y}}}$ and then apply the Tucker decomposition on ${\bm{\mathcal{Y}}}$ .
2.

localTucker: We apply a standard Tucker decomposition on each client ${\bm{\mathcal{Y}}}_{n}$ individually.
3.

robustTucker: We first concatenate samples from all clients in ${\bm{\mathcal{Y}}}$ , then apply the method in Lu et al., (2019) to identify the low-rank and sparse components.
4.

perPCA: We apply perPCA (Shi and Kontar,, 2022) on the vectorized dataset where we vectorize $50\times 50$ images into vectors of length $2500$ and use perPCA to find global and local components. Although perPCA is designed for vector datasets, this comparison can highlight the need for personalized Tensor decompositions when data is in tensor form.

Fig. 4 depicts examples of global and local component reconstruction via perTucker. The first row represents the data, and the second row shows the reconstruction from perTucker. The columns represent global and local components. The three examples with different patterns indicate that perTucker can effectively reconstruct the shared and unique data patterns.

Furthermore, to numerically examine the performance and benchmark algorithms, we calculate a few performance metrics. $\hat{{\bm{U}}}_{G,k}$ and $\hat{{\bm{V}}}_{n,k}$ represent the estimated global and local factor matrices; $\bm{\mathcal{\hat{Y}}}_{G,n}$ and $\bm{\mathcal{\hat{Y}}}_{L,n}$ represent the estimated reconstruction of global and local components.

1.

Global subspace error: We compute the regularized global subspace error between the ground truth global factors $\{{\bm{U}}_{G,k}\}$ and the estimated global factors $\{\hat{{\bm{U}}}_{G,k}\}$ by $\varrho(\bigotimes_{k=1}^{K}{\bm{U}}_{G,k},\bigotimes_{k=1}^{K}\bm{\hat{U}}_{G,k})/\|\bigotimes_{k=1}^{K}{\bm{U}}_{G,k}\|_{F}^{2}$ .
2.

Local subspace error: We first generate 100 images for each pattern, and use Tucker decomposition to estimate the local factors $\{{\bm{V}}_{n,k}\}$ for each pattern. Then the regularized local subspace error is calculated by $\varrho(\bigotimes_{k=1}^{K}{\bm{V}}_{n,k},\bigotimes_{k=1}^{K}\hat{{\bm{V}}}_{n,k})/\|\bigotimes_{k=1}^{K}{\bm{V}}_{n,k}\|_{F}^{2}$ , and take the average of $3$ patterns.
3.

Global component error: defined by $\small\sum_{n}\|\bm{\mathcal{\hat{Y}}}_{G,n}-\bm{\mathcal{Y}}_{G,n}\|_{F}^{2}/\sum_{n}\|\bm{\mathcal{Y}}_{G,n}\|_{F}^{2}$ .
4.

Local component error: defined by $\small\sum_{n}\|\bm{\mathcal{\hat{Y}}}_{L,n}-\bm{\mathcal{Y}}_{L,n}\|_{F}^{2}/\sum_{n}\|\bm{\mathcal{Y}}_{L,n}\|_{F}^{2}$ .
5.

Denoised error: defined by $\small\sum_{n}\|(\bm{\mathcal{Y}}_{G,n}+\bm{\mathcal{Y}}_{L,n})-(\bm{\mathcal{\hat{Y}}}_{G,n}+\bm{\mathcal{\hat{Y}}}_{L,n})\|_{F}^{2})/\sum_{n}\|\bm{\mathcal{Y}}_{G,n}+\bm{\mathcal{Y}}_{L,n}\|_{F}^{2}$

We run each experiment 10 times from $10$ different random seeds and report their mean and standard deviation. The results of the measuring statistics for different methods are summarized in Table 1. From Table 1, we can conclude the following statements.

Table 1: Component-wise reconstruction error with standard deviation in parenthesis

	perTucker	perPCA	globalTucker	localTucker	robustTucker
Global subspace error ( $10^{-3}$ )	$\textbf{2.3}(0.8)$	$536(6)$	$\textbf{2.4}(0.8)$	N/A	$3.7(0.9)$
Local subspace error $(10^{-1})$	$\textbf{6.4}(0.4)$	$9.88(0.07)$	N/A	$9.34(0.03)$	N/A
Global component error $(10^{-3})$	$\textbf{5.7}(1.6)$	$372(16)$	$\textbf{5.7}(1.6)$	N/A	$264(12)$
Local component error $(10^{-1})$	$\textbf{2.8}(0.6)$	$23.1(1.3)$	N/A	$63(3)$	$10(0.09)$
Denoised error ( $10^{-2}$ )	$4(0.7)$	$13.7(0.7)$	$14.1(0.7)$	$\textbf{2}(2)$	$13.7(0.7)$

1) perTucker yields the best results in separating the global and local components due to the utilization of the low-rank tensor structure of both components, from the following observations: (a) Comparing the global component error of perTucker and perPCA, we can conclude that perTucker identifies the global component with better accuracy due to its use of low-rank tensor structures. (b) Comparing the local component error of perTucker and robustTucker, although robustTucker identifies the global subspace with decent accuracy, it yields a much larger error in terms of local component reconstruction due to the fact that the local component does not assume any low-rank structure.

2) The reconstruction error for local components is larger for all methods compared to the global components. Two factors contribute to this: (a) the reconstruction of the local component is generally harder compared to the global components since it is only shared within the same clients, resulting in lower accuracy and larger variance with the use of fewer datasets; (b) the local components are generated by the shape variations of Swiss, oval, and rectangle patterns, which are not exactly low-rank, thus, the true local rank subspace is also an approximation from the dataset.

4.3 Classification

In this section, we evaluate the classification performance of the perTucker algorithm. The training sample size ranges from 10 to 50 with a step of 10 for each pattern. Then we perform the perTucker decomposition and get the local factor matrices. Next, we create 50 new images for each pattern and use the method described in Sec. 3.6.1 to classify the 150 images. This procedure is repeated 100 times. Table 2 displays the mean and standard deviation of the classification accuracy for various parameters. From Table 2 we can see that perTucker exhibits excellent classification performance even when the sample size is small. Increasing the training sample size leads to improved prediction accuracy. In comparison, if we perform local Tucker on the three clients and classify the new figures, the accuracy will be around 33% regardless of the training sample size. This accuracy is close to “guessing” the class. This is because local Tucker will model the commonality and peculiarity simultaneously. Therefore, global features are also considered when making the decision, while the multiplication coefficients are randomly generated.

Table 2: Classification accuracy and the standard deviation for different parameters

Training sample size	10	20	30	40	50
Accuracy	86.7%	91.2%	92.7 %	93.9%	95.0%
SD of accuracy	6 %	4%	4%	4%	3%

To visualize the classification process, we use the boxplot to show the summary of the test statistics in (12) in Fig. 5. The sample size is set to 50. The test statistics are centralized by the median statistics within each pattern. The test statistics for each specific true pattern are significantly higher than those for the other two patterns in the corresponding cases, implying that the classification procedure is effective and robust.

4.4 Clustering

In this section, we study the clustering performance of perTucker under a different problem setting. When generating the data, the variability within each pattern is measured by a “ratio” variable. The performance of clustering is evaluated by the ability to cluster similar “ratio” within each pattern. We focus on the Swiss pattern and cut the range of “ratio” from $0.7$ to $1.4$ into the group of $7$ ratio intervals, each with a width of $0.1$ . Then for each ratio interval, we generate $3$ clients, each with a sample size of $100$ figures. We make clustering plots for the $21$ clients and see the aggregation of the clients with similar ratio, as shown in Fig. 6.

5 Case Study

In this section, we use two case study experiments to demonstrate the application of the perTucker algorithm. Sec. 5.1 provides an anomaly detection example that illustrates the power of the perTucker algorithm in detecting solar flares. The subsequent section, Sec. 5.2, demonstrates the effectiveness of the perTucker technique in classifying tonnage fault signals.

5.1 Anomaly detection in solar flare data

The first example involves monitoring solar activities and detecting solar flares from a stream of solar images. A solar flare is a significant event in the sun that emits enormous amounts of energetic charged particles that could cause power grids to fail (Marusek,, 2007). Therefore, detecting solar flares promptly is critical to implementing preemptive and corrective measures. However, monitoring solar activities is challenging due to the high dimensionality of the solar thermal imaging data and the gradual changes in solar temperature over time. Existing detection methods, which rely on subtracting the functional mean (background) using the sample mean, are insufficient to detect small transient flares in the dynamic system (Zhao et al., 2022a, ). Other studies focus on the decomposition of solar images into their background and anomaly components (Yan et al.,, 2018).

This dataset, publicly available in (Zhao et al., 2022a, ), comprises a sequence of images of size 232 × 292 pixels captured by a satellite. We use a sample of 300 frames in this case study. To detect solar flares in real time, we begin by subtracting the mean from each sample and then preprocess the data using the method proposed by (Aharon et al.,, 2006). Following this, we use a sliding window of $8\times 8$ to divide each frame into small patches, resulting in a total of 1044 patches. The four right-most columns of pixels are discarded. Each patch or tile is then vectorized, yielding a data dimension of $300\times 1044\times 64$ . After the preprocessing step, we apply the perTucker algorithm to the data to break them down into two components: global components representing the slowly changing background and local components indicating the detected solar flares.

Fig. 7 shows two frames where there is an abrupt change in the almost-stationary background. On the original images, such changes are not visible in the complicated background. However, after using perTucker to extract global and local components, one can clearly see the location and movement of small and rapid changes in local components. The experiments highlight perTucker’s ability to change the signal magnification.

Moreover, since the local component represents the anomaly, we can use the Frobenius norm of the local core tensor $\left\lVert{\bm{\mathcal{C}}}_{L,n}\right\rVert_{F}$ as monitoring statistics to detect the anomaly. The results are shown in Fig. 8, where we plot the logarithm of the Frobenius norm of the local core tensor $\left\lVert{\bm{\mathcal{C}}}_{L,n}\right\rVert_{F}$ for $300$ frames.

For the most time, the curve in Fig. 8 remains at a low level, suggesting that no significant changes in solar activities are detected. However, there are 2 clear peaks above the red control line, suggesting that there are two solar flares. Therefore, Fig. 8 provides an intuitive approach for detecting anomalous behaviors.

5.2 Classification for tonnage data

perTucker also applies to monitoring tonnage profiles in a multi-operation forging process that utilizes multiple strain gauge sensors. Specifically, the forging process employs four columns, each equipped with a strain gauge sensor that measures the tonnage force exerted by the press uprights, as illustrated in Fig. 9. Consequently, each operational cycle generates a four-channel tonnage profile. The dataset for this case study consists of 305 in-control profiles, collected under normal production conditions, and 69 out-of-control profiles for each of the four different fault classes. The length of each channel profile is 1201, resulting in a data dimension of ${\bm{\mathcal{Y}}}^{4\times 1201\times 305}$ for the “normal” class and ${\bm{\mathcal{Y}}}^{4\times 1201\times 69}$ for each fault class. Fig. 10 presents examples of profiles for both normal and the four different fault conditions.

Figure 9: Tonnage Signal Monitoring

Figure 10: Tonnage Data

In this case study, we define the five “clients” as normal and four fault conditions. We select the first 50 normal samples and the first 10 samples from each fault condition to form the training dataset. The test dataset comprises $255$ normal samples and $59$ samples for each fault condition. We further assume that only the global and local factor matrices along the signal length dimension are orthogonal to each other.

Applying the classification method in Sec. 3.6.1, we are able to achieve 100% accuracy for classifying all fault modes and normal cases. Fig. 11 shows the relative test statistics by computing the difference between the test statistics of the corresponding fault modes and the normal cases. Consequently, normal test statistics have been normalized as $y=0$ for each sample, as shown in a red line in Fig. 11.

We can see that in every fault mode, the range of the test statistics for the corresponding fault mode is much higher than those of the other three faults and the normal baseline. When the true label is normal, the relative test statistics for all fault modes are less than the normal baseline. Thus, perTucker provides informative statistics for anomaly detection and fault diagnostics.

6 Conclusion

In this paper, we propose the perTucker method to decompose the tensor data into global components with shared factor matrices and local components that are orthogonal to global components. Global and local components model the commonality and peculiarity of the data. Subsequently, we present an efficient algorithm to solve the model, based on the block coordinate descent algorithm. The inclusion of proximal terms in the step of updating factor matrices allows the convergence to the stationary point. Then, several applications of perTucker, such as anomaly detection, classification, and clustering, are demonstrated by simulation and case studies.

Some future studies will include some special properties of the global and local components. For example, the local components can be sparse to model the anomaly components in different applications.

References

Aharon et al., (2006) Aharon, M., Elad, M., and Bruckstein, A. (2006). K-svd: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on signal processing, 54(11):4311–4322.
Candès et al., (2011) Candès, E. J., Li, X., Ma, Y., and Wright, J. (2011). Robust principal component analysis? Journal of the ACM (JACM), 58(3):1–37.
Du et al., (2022) Du, J., Yan, H., Chang, T.-S., and Shi, J. (2022). A tensor voting-based surface anomaly classification approach by using 3d point cloud data. Journal of Manufacturing Science and Engineering, 144(5).
Dulal et al., (2022) Dulal, D., Karim, R. G., and Navasca, C. (2022). Covid-19 analysis using tensor methods. arXiv preprint arXiv:2212.14558.
Fanaee-T and Gama, (2016) Fanaee-T, H. and Gama, J. (2016). Tensor-based anomaly detection: An interdisciplinary survey. Knowledge-Based Systems, 98:130–147.
Fu et al., (2016) Fu, Y., Gao, J., Tien, D., Lin, Z., and Hong, X. (2016). Tensor lrr and sparse coding-based subspace clustering. IEEE transactions on neural networks and learning systems, 27(10):2120–2133.
Gahrooei et al., (2021) Gahrooei, M. R., Yan, H., Paynabar, K., and Shi, J. (2021). Multiple tensor-on-tensor regression: An approach for modeling processes with heterogeneous sources of data. Technometrics, 63(2):147–159.
Gatto et al., (2021) Gatto, B. B., dos Santos, E. M., Koerich, A. L., Fukui, K., and Junior, W. S. (2021). Tensor analysis with n-mode generalized difference subspace. Expert Systems with Applications, 171:114559.
Guo et al., (2022) Guo, J., Yan, H., and Zhang, C. (2022). A bayesian partially observable online change detection approach with thompson sampling. Technometrics, pages 1–13.
Hao et al., (2013) Hao, Z., He, L., Chen, B., and Yang, X. (2013). A linear support higher-order tensor machine for classification. IEEE Transactions on Image Processing, 22(7):2911–2920.
Hastie et al., (2009) Hastie, T., Tibshirani, R., Friedman, J. H., and Friedman, J. H. (2009). The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer.
He and Atia, (2022) He, Y. and Atia, G. K. (2022). Multi-mode tensor space clustering based on low-tensor-rank representation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36(6), pages 6893–6901.
Hitchcock, (1927) Hitchcock, F. L. (1927). The expression of a tensor or a polyadic as a sum of products. Journal of Mathematics and Physics, 6(1-4):164–189.
Klus and Gelß, (2019) Klus, S. and Gelß, P. (2019). Tensor-based algorithms for image classification. Algorithms, 12(11):240.
Kolda and Bader, (2009) Kolda, T. G. and Bader, B. W. (2009). Tensor decompositions and applications. SIAM review, 51(3):455–500.
Kontar et al., (2021) Kontar, R., Shi, N., Yue, X., Chung, S., Byon, E., Chowdhury, M., Jin, J., Kontar, W., Masoud, N., Nouiehed, M., et al. (2021). The internet of federated things (ioft). IEEE Access, 9:156071–156113.
Li et al., (2023) Li, J., Gao, Q., Wang, Q., Xia, W., and Gao, X. (2023). Multi-view clustering via semi-non-negative tensor factorization. arXiv preprint arXiv:2303.16748.
Li et al., (2011) Li, J., Han, G., Wen, J., and Gao, X. (2011). Robust tensor subspace learning for anomaly detection. International Journal of Machine Learning and Cybernetics, 2:89–98.
Li et al., (2020) Li, Z., Sergin, N. D., Yan, H., Zhang, C., and Tsung, F. (2020). Tensor completion for weakly-dependent data on graph for metro passenger flow prediction. 34(04):4804–4810.
Li et al., (2022) Li, Z., Yan, H., Tsung, F., and Zhang, K. (2022). Profile decomposition based hybrid transfer learning for cold-start data anomaly detection. ACM Transactions on Knowledge Discovery from Data (TKDD), 16(6):1–28.
Lu et al., (2019) Lu, C., Feng, J., Chen, Y., Liu, W., Lin, Z., and Yan, S. (2019). Tensor robust principal component analysis with a new tensor nuclear norm. IEEE transactions on pattern analysis and machine intelligence, 42(4):925–938.
Marusek, (2007) Marusek, J. A. (2007). Solar storm threat analysis. J. Marusek.
Momeni and Ebrahimkhanlou, (2022) Momeni, H. and Ebrahimkhanlou, A. (2022). High-dimensional data analytics applications in shm and nde: Tensor analysis of thermal videos. In Sensors and Smart Structures Technologies for Civil, Mechanical, and Aerospace Systems 2022, volume 12046, pages 237–242. SPIE.
Nomikos and MacGregor, (1994) Nomikos, P. and MacGregor, J. F. (1994). Monitoring batch processes using multiway principal component analysis. AIChE Journal, 40(8):1361–1375.
Ren et al., (2022) Ren, B., Yang, L. T., Zhang, Q., Feng, J., and Nie, X. (2022). Blockchain-powered tensor meta-learning-driven intelligent healthcare system with iot assistance. IEEE Transactions on Network Science and Engineering.
Sandhu et al., (2018) Sandhu, R., Kaur, N., Sood, S. K., and Buyya, R. (2018). Tdrm: tensor-based data representation and mining for healthcare data in cloud computing environments. The Journal of Supercomputing, 74:592–614.
Shen et al., (2022) Shen, B., Xie, W., and Kong, Z. J. (2022). Smooth robust tensor completion for background/foreground separation with missing pixels: Novel algorithm with convergence guarantee. Journal of Machine Learning Research, 23(217):1–40.
Shi and Kontar, (2022) Shi, N. and Kontar, R. A. (2022). Personalized pca: Decoupling shared and unique features. arXiv preprint arXiv:2207.08041.
Sun and Luo, (2016) Sun, R. and Luo, Z.-Q. (2016). Guaranteed matrix completion via non-convex factorization. IEEE Transactions on Information Theory, 62(11):6535–6579.
Sun and Li, (2019) Sun, W. W. and Li, L. (2019). Dynamic tensor clustering. Journal of the American Statistical Association, 114(528):1894–1907.
Tan et al., (2013) Tan, X., Zhang, Y., Tang, S., Shao, J., Wu, F., and Zhuang, Y. (2013). Logistic tensor regression for classification. In Intelligent Science and Intelligent Data Engineering: Third Sino-foreign-interchange Workshop, IScIDE 2012, Nanjing, China, October 15-17, 2012. Revised Selected Papers 3, pages 573–581. Springer.
Tucker, (1966) Tucker, L. R. (1966). Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279–311.
Wu et al., (2016) Wu, T., Benson, A. R., and Gleich, D. F. (2016). General tensor spectral co-clustering for higher-order data. Advances in Neural Information Processing Systems, 29.
Yan et al., (2019) Yan, H., Paynabar, K., and Pacella, M. (2019). Structured point cloud data analysis via regularized tensor regression for process modeling and optimization. Technometrics, 61(3):385–395.
Yan et al., (2014) Yan, H., Paynabar, K., and Shi, J. (2014). Image-based process monitoring using low-rank tensor decomposition. IEEE Transactions on Automation Science and Engineering, 12(1):216–227.
Yan et al., (2017) Yan, H., Paynabar, K., and Shi, J. (2017). Anomaly detection in images with smooth background via smooth-sparse decomposition. Technometrics, 59(1):102–114.
Yan et al., (2018) Yan, H., Paynabar, K., and Shi, J. (2018). Real-time monitoring of high-dimensional functional data streams via spatio-temporal smooth sparse decomposition. Technometrics, 60(2):181–197.
Yan et al., (2005) Yan, S., Xu, D., Yang, Q., Zhang, L., Tang, X., and Zhang, H.-J. (2005). Discriminant analysis with tensor representation. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 526–532. IEEE.
Zhang et al., (2023) Zhang, X., Shen, Q., Chen, Y., Zhang, G., Hua, Z., and Su, J. (2023). Multi-view ensemble clustering via low-rank and sparse decomposition: from matrix to tensor. ACM Transactions on Knowledge Discovery from Data.
(40) Zhao, X., Hu, J., Mei, Y., and Yan1, H. (2022a). Adaptive partially observed sequential change detection and isolation. Technometrics, 64(4):502–512.
(41) Zhao, X., Yan, H., Hu, Z., and Du, D. (2022b). Deep spatio-temporal sparse decomposition for trend prediction and anomaly detection in cardiac electrical conduction. IISE Transactions on Healthcare Systems Engineering, 12(2):150–164.
(42) Zhao, Y., Yan, H., Holte, S., and Mei, Y. (2022c). Rapid detection of hot-spots via tensor decomposition with applications to crime rate data. Journal of Applied Statistics, 49(7):1636–1662.
Zhao et al., (2020) Zhao, Y., Yan, H., Holte, S. E., Kerani, R. P., and Mei, Y. (2020). Rapid detection of hot-spot by tensor decomposition on space and circular time with application to weekly gonorrhea data. In The 13th International Workshop on Intelligent Statistical Quality Control 2019.
Zhen et al., (2023) Zhen, Z., Paynabar, K., and Shi, J. (2023). Image-based feedback control using tensor analysis. Technometrics, pages 1–10.
Zhou et al., (2013) Zhou, H., Li, L., and Zhu, H. (2013). Tensor regression with applications in neuroimaging data analysis. Journal of the American Statistical Association, 108(502):540–552.
Zhou et al., (2019) Zhou, P., Lu, C., Feng, J., Lin, Z., and Yan, S. (2019). Tensor low-rank representation for data recovery and clustering. IEEE transactions on pattern analysis and machine intelligence, 43(5):1718–1732.
Zubair and Wang, (2013) Zubair, S. and Wang, W. (2013). Tensor dictionary learning with sparse tucker decomposition. In 2013 18th international conference on Digital Signal Processing (DSP), pages 1–6. IEEE.

Appendix A Proof of Proposition 1

In this section, we prove the proposition of the equivalent condition for two tensors to be orthogonal to each other. We first introduce Lemma A.1.

Lemma A.1.

For two matrices ${\bm{A}}$ and ${\bm{B}}$ ,

\|\bm{A}\bigotimes\bm{B}\|_{F}^{2}=\|\bm{A}\|_{F}^{2}\|\bm{B}\|_{F}^{2}

Proof.

	$\displaystyle\\|\bm{A}\bigotimes\bm{B}\\|_{F}^{2}$	$\displaystyle=Tr\left[(\bm{A}\bigotimes\bm{B})^{\top}(\bm{A}\bigotimes\bm{B})\right]$
		$\displaystyle=Tr[\bm{A}^{\top}\bm{A}\bigotimes\bm{B}^{\top}\bm{B}]$
		$\displaystyle=Tr(\bm{A}^{\top}\bm{A})Tr(\bm{B}^{\top}\bm{B})$
		$\displaystyle=\\|\bm{A}\\|_{F}^{2}\\|\bm{B}\\|_{F}^{2}$

∎

Then we prove Proposition 1

Proof.

	$\displaystyle\left\langle\bm{\mathcal{Y}}_{G,n},\bm{\mathcal{Y}}_{L,n}\right\rangle$	$\displaystyle=\left\langle\bm{\mathcal{C}}_{G,n}\times_{1}{\bm{U}}_{G,1}\ldots\times_{K}{\bm{U}}_{G,K},\bm{\mathcal{C}}_{L,n}\times_{1}\bm{V}_{n,1}\ldots\times_{K}\bm{V}_{n,K}\right\rangle$
		$\displaystyle=\left\langle\left({\bm{U}}_{G,K}\bigotimes\cdots\bigotimes{\bm{U}}_{G,1}\right)\bm{c}_{G,n},\left(\bm{V}_{n,K}\bigotimes\cdots\bigotimes\bm{V}_{n,1}\right)\bm{c}_{L,n}\right\rangle$
		$\displaystyle=Tr(\bm{c}_{G,n}^{T}\left({\bm{U}}_{G,K}^{T}\bigotimes\cdots\bigotimes{\bm{U}}_{G,1}^{T}\right)\left(\bm{V}_{n,K}\bigotimes\cdots\bigotimes\bm{V}_{n,1}\right)\bm{c}_{L,n})$
		$\displaystyle=Tr(\bm{c}_{G,n}^{T}\left({\bm{U}}_{G,K}^{T}\bm{V}_{n,K}\bigotimes\cdots\bigotimes{\bm{U}}_{G,1}^{T}\bm{V}_{n,1}\right)\bm{c}_{L,n}),$

where $\bm{c}_{G,n}$ and $\bm{c}_{L,n}$ are vectorized global and local tensors.

On the one hand, if ${\bm{U}}_{G,k}^{\top}\bm{V}_{n,k}=0$ for some $k$ , ${\bm{U}}_{G,K}^{\top}\bm{V}_{n,K}\bigotimes\cdots\bigotimes{\bm{U}}_{G,1}^{\top}\bm{V}_{n,1}=\bm{0}$ and thus $\left\langle\bm{\mathcal{Y}}_{G,n},\bm{\mathcal{Y}}_{L,n}\right\rangle=0$ for any $\bm{\mathcal{C}}_{G,n}$ , $\bm{\mathcal{C}}_{L,n}$ . On the other hand, if we want $\left\langle\bm{\mathcal{Y}}_{G,n},\bm{\mathcal{Y}}_{L,n}\right\rangle=0$ for any $\bm{\mathcal{C}}_{G,n}$ , $\bm{\mathcal{C}}_{L,n}$ , we need

		$\displaystyle{\bm{U}}_{G,K}^{\top}\bm{V}_{n,K}\bigotimes\cdots\bigotimes{\bm{U}}_{G,1}^{\top}\bm{V}_{n,1}=\bm{0}$
	$\displaystyle\Leftrightarrow$	$\displaystyle\\|{\bm{U}}_{G,K}^{\top}\bm{V}_{n,K}\bigotimes\cdots\bigotimes{\bm{U}}_{G,1}^{\top}\bm{V}_{n,1}\\|_{F}^{2}=0$
	$\displaystyle\Leftrightarrow$	$\displaystyle\\|{\bm{U}}_{G,K}^{\top}\bm{V}_{n,K}\\|_{F}^{2}\cdots\\|{\bm{U}}_{G,1}^{\top}\bm{V}_{n,1}\\|_{F}^{2}=0$

Therefore, there exists some mode $k$ , $\|{\bm{U}}_{G,k}^{\top}\bm{V}_{n,k}\|_{F}^{2}$ is $0$ , which means that ${\bm{U}}_{G,k}^{\top}\bm{V}_{n,k}=0$ . ∎

Appendix B Proof of Proposition 2

Proof.

For each $n$ , the optimization problem to derive global and local core tensors is

\min_{\bm{\mathcal{C}}_{G,n},\bm{\mathcal{C}}_{L,n}}\|\bm{\mathcal{Y}}_{n}-\bm{\mathcal{C}}_{G,n}\times_{1}{\bm{U}}_{G,1}\ldots\times_{K}{\bm{U}}_{G,K}-\bm{\mathcal{C}}_{L,n}\times_{1}\bm{V}_{n,1}\ldots\times_{K}\bm{V}_{n,K}\|_{F}^{2}

(14)

By vectorizing the terms, we get

		$\displaystyle\min_{\bm{c}_{G,n},\bm{c}_{L,n}}\\|\bm{y}_{n}-(\bigotimes_{k=K}^{1}{\bm{U}}_{G,k})\bm{c}_{G,n}-(\bigotimes_{k=K}^{1}\bm{V}_{n,k})\bm{c}_{L,n}\\|^{2}$
	$\displaystyle\iff$	$\displaystyle\min_{\bm{c}_{G,n},\bm{c}_{L,n}}\left\lVert\bm{y}_{n}-\begin{pmatrix}\bigotimes_{k=K}^{1}{\bm{U}}_{G,k}&\bigotimes_{k=K}^{1}\bm{V}_{n,k}\end{pmatrix}\begin{pmatrix}\bm{c}_{G,n}\\ \bm{c}_{L,n}\end{pmatrix}\right\rVert^{2}$

where the sign $\bigotimes_{k=K}^{1}$ represents the Kronecker product of the matrices in the order $K,K-1,\ldots,1$ . $\bm{c}_{G,n}$ and $\bm{c}_{L,n}$ represents the vectorized global and core tensor from source $n$ . This is a least square loss in linear regression. The solution for $\begin{pmatrix}\bm{c}_{G,n}\\ \bm{c}_{L,n}\end{pmatrix}$ is

\scriptsize\begin{pmatrix}\bm{c}^{\star}_{G,n}\\ \bm{c}^{\star}_{L,n}\end{pmatrix}=\left[\begin{pmatrix}\bigotimes_{k=K}^{1}{\bm{U}}_{G,k}&\bigotimes_{k=K}^{1}\bm{V}_{n,k}\end{pmatrix}^{\top}\begin{pmatrix}\bigotimes_{k=K}^{1}{\bm{U}}_{G,k}&\bigotimes_{k=K}^{1}\bm{V}_{n,k}\end{pmatrix}\right]^{-1}\begin{pmatrix}\bigotimes_{k=K}^{1}{\bm{U}}_{G,k}&\bigotimes_{k=K}^{1}\bm{V}_{n,k}\end{pmatrix}^{\top}\bm{y}_{n}

Since the global and local factor matrices are orthogonal, $(\bigotimes_{k=K}^{1}{\bm{U}}_{G,k})^{\top}\bigotimes_{k=K}^{1}{\bm{U}}_{G,k}=I$ , $(\bigotimes_{k=K}^{1}\bm{V}_{n,k})^{\top}\bigotimes_{k=K}^{1}\bm{V}_{n,k}=I$ . The upper off-diagonal matrix in the product is

(\bigotimes_{k=K}^{1}\bm{V}_{n,k})^{\top}\bigotimes_{k=K}^{1}{\bm{U}}_{G,k}=\bigotimes_{k=K}^{1}\bm{V}_{n,k}^{\top}{\bm{U}}_{G,k}=0

where the last equality holds because in at least one dimension, $\bm{V}_{n,k}^{\top}{\bm{U}}_{G,k}=0$ , similarly, the lower off-diagonal matrix is also $0$ , therefore, the product is the identity matrix.

\begin{pmatrix}\bm{c}_{G,n}^{\star}\\ \bm{c}_{L,n}^{\star}\end{pmatrix}=\begin{pmatrix}(\bigotimes_{k=K}^{1}{\bm{U}}_{G,k}^{\top})\bm{y}_{n}\\ (\bigotimes_{k=K}^{1}\bm{V}_{n,k}^{\top})\bm{y}_{n}\end{pmatrix}

Transform the solution above back to the tensor, and we get the solution in the proposition. ∎

Appendix C Proof of Lemma 3.1

Proof.

	$\displaystyle\\|\bm{UU}^{\top}-{\bm{U}}_{t}{\bm{U}}_{t}^{\top}\\|_{F}^{2}$	$\displaystyle=Tr\left[(\bm{UU}^{\top}-{\bm{U}}_{t}{\bm{U}}_{t}^{\top})(\bm{UU}^{\top}-{\bm{U}}_{t}{\bm{U}}_{t}^{\top})^{\top}\right]$
		$\displaystyle=Tr\left[\bm{UU}^{\top}\bm{UU}^{\top}-2\bm{UU}^{\top}{\bm{U}}_{t}{\bm{U}}_{t}^{\top}+{\bm{U}}_{t}{\bm{U}}_{t}^{\top}{\bm{U}}_{t}{\bm{U}}_{t}^{\top}\right]$
		$\displaystyle=2c-2Tr\left[{\bm{U}}^{\top}{\bm{U}}_{t}{\bm{U}}_{t}^{\top}{\bm{U}}\right]$

where $c$ is the number of rows of the factor matrix ${\bm{U}}$ . ∎

Appendix D Proof of Lemma 3.2

Proof.

The proof of Equation (8) and Equation (9) are similar. Here we take one component in Equation (8) as an example. We use ${\bm{\mathcal{C}}}^{\star}_{G,n}$ and ${\bm{\mathcal{C}}}^{\star}_{L,n}$ to represent the optimized core tensors.

		$\displaystyle\\|{\bm{\mathcal{R}}}_{G,n}-{\bm{\mathcal{C}}}^{\star}_{G,n}\times_{1}{\bm{U}}_{G,1}\ldots\times_{K}{\bm{U}}_{G,K}\\|_{F}^{2}$
	$\displaystyle=$	$\displaystyle\\|{\bm{\mathcal{R}}}_{G,n}\\|_{F}^{2}-2\left\langle{\bm{\mathcal{R}}}_{G,n},{\bm{\mathcal{C}}}^{\star}_{G,n}\times_{1}{\bm{U}}_{G,1}\ldots\times_{K}{\bm{U}}_{G,K}\right\rangle+\\|{\bm{\mathcal{C}}}^{\star}_{G,n}\times_{1}{\bm{U}}_{G,1}\ldots\times_{K}{\bm{U}}_{G,K}\\|_{F}^{2}$
	$\displaystyle=$	$\displaystyle\\|{\bm{\mathcal{R}}}_{G,n}\\|_{F}^{2}-2\left\langle{\bm{\mathcal{R}}}_{G,n}\times_{1}{\bm{U}}_{G,1}^{\top}\ldots\times_{K}{\bm{U}}_{G,K}^{\top},{\bm{\mathcal{C}}}^{\star}_{G,n}\right\rangle+\\|{\bm{\mathcal{C}}}^{\star}_{G,n}\\|_{F}^{2}$

The second term can be formulated as

		$\displaystyle-2\left\langle{\bm{\mathcal{R}}}_{G,n}\times_{1}{\bm{U}}_{G,1}^{\top}\ldots\times_{K}{\bm{U}}_{G,K}^{\top},{\bm{\mathcal{C}}}^{\star}_{G,n}\right\rangle$
	$\displaystyle=$	$\displaystyle-2\left\langle{\bm{\mathcal{R}}}_{G,n}\times_{1}{\bm{U}}_{G,1}^{\top}\ldots\times_{K}{\bm{U}}_{G,K}^{\top},{\bm{\mathcal{R}}}_{G,n}\times_{1}{\bm{U}}_{G,1}^{\top}\ldots\times_{K}{\bm{U}}_{G,K}^{\top}\right\rangle$
	$\displaystyle=$	$\displaystyle-2\\|{\bm{\mathcal{R}}}_{G,n}\times_{1}{\bm{U}}_{G,1}^{\top}\ldots\times_{K}{\bm{U}}_{G,K}^{\top}\\|_{F}^{2}$

And the third term can be formulated as

	$\displaystyle\\|{\bm{\mathcal{C}}}^{\star}_{G,n}\\|_{F}^{2}$	$\displaystyle=\\|({\bm{\mathcal{R}}}_{G,n}+{\bm{\mathcal{C}}}_{L,n}\times_{1}{\bm{V}}_{n,1}\ldots\times_{K}{\bm{V}}_{n,K})\times_{1}{\bm{U}}_{G,1}^{\top}\ldots\times_{K}{\bm{U}}_{G,K}^{\top}\\|_{F}^{2}$
		$\displaystyle=\\|{\bm{\mathcal{R}}}_{G,n}\times_{1}{\bm{U}}_{G,1}^{\top}\ldots\times_{K}{\bm{U}}_{G,K}^{\top}\\|_{F}^{2}$

Therefore, for each $n=1,\ldots,N$ ,

\displaystyle\|{\bm{\mathcal{R}}}_{G,n}-{\bm{\mathcal{C}}}^{\star}_{G,n}\times_{1}{\bm{U}}_{G,1}\ldots\times_{K}{\bm{U}}_{G,K}\|_{F}^{2}=\|{\bm{\mathcal{R}}}_{G,n}\|_{F}^{2}-\|{\bm{\mathcal{R}}}_{G,n}\times_{1}{\bm{U}}_{G,1}^{\top}\ldots\times_{K}{\bm{U}}_{G,K}^{\top}\|_{F}^{2}

and Equation (8) holds. Similarly, for each $n=1,\ldots,N$ ,

\displaystyle\|{\bm{\mathcal{R}}}_{L,n}-{\bm{\mathcal{C}}}^{\star}_{L,n}\times_{1}\bm{V}_{n,1}\ldots\times_{K}\bm{V}_{n,K}\|_{F}^{2}=\|{\bm{\mathcal{R}}}_{L,n}\|_{F}^{2}-\|{\bm{\mathcal{R}}}_{L,n}\times_{1}\bm{V}_{n,1}^{\top}\ldots\times_{K}\bm{V}_{n,K}^{\top}\|_{F}^{2}

and Equation (9) holds. ∎

Appendix E Proof of Proposition 3

Proof.

First, by Lemma 3.1, we have

\displaystyle\|{\bm{U}}_{G,k}{\bm{U}}_{G,k}^{\top}-{\bm{U}}_{G,k,t}{{\bm{U}}_{G,k,t}^{\top}}\|_{F}^{2}=2c-2Tr\left[{\bm{U}}_{G,k}^{\top}{\bm{U}}_{G,k,t}{{\bm{U}}_{G,k,t}^{\top}}{\bm{U}}_{G,k}\right]

where $c$ is the number of rows of the factor matrix, therefore, can be omitted in the optimization. Then by Lemma 3.2, we can rewrite the minimization problem into a maximization problem.

		$\displaystyle\min_{{\bm{U}}_{G,k}}\sum_{n=1}^{N}\\|{\bm{\mathcal{R}}}_{G,n}-{\bm{\mathcal{C}}}^{\star}_{G,n}\times_{1}{\bm{U}}_{G,1}\ldots\times_{K}{\bm{U}}_{G,K}\\|_{F}^{2}+\rho\\|{\bm{U}}_{G,k}{\bm{U}}_{G,k}^{\top}-{\bm{U}}_{G,k,t}{{\bm{U}}_{G,k,t}^{\top}}\\|_{F}^{2}$
	$\displaystyle\iff$	$\displaystyle\max_{{\bm{U}}_{G,k}}\sum_{n=1}^{N}\\|{\bm{\mathcal{R}}}_{G,n}\times_{1}{\bm{U}}_{G,1}^{\top}\ldots\times_{K}{\bm{U}}_{G,K}^{\top}\\|_{F}^{2}-\rho\\|{\bm{U}}_{G,k}{\bm{U}}_{G,k}^{\top}-{\bm{U}}_{G,k,t}{{\bm{U}}_{G,k,t}^{\top}}\\|_{F}^{2}$
	$\displaystyle\iff$	$\displaystyle\max_{{\bm{U}}_{G,k}}\sum_{n=1}^{N}\\|{\bm{U}}_{G,k}({\bm{\mathcal{R}}}_{{G,n}})_{(k)}(\bigotimes_{q\neq k}{\bm{U}}_{G,q}^{\top})^{\top}\\|_{F}^{2}+2\rho Tr\left[{\bm{U}}_{G,k}^{\top}{\bm{U}}_{G,k,t}{{\bm{U}}_{G,k,t}^{\top}}{\bm{U}}_{G,k}\right]$
	$\displaystyle\iff$	$\displaystyle\max_{{\bm{U}}_{G,k}}\sum_{n=1}^{N}Tr\left[{\bm{U}}_{G,k}^{\top}\bm{W}_{G,n}\bm{W}_{G,n}^{\top}{\bm{U}}_{G,k}\right]+2\rho Tr\left[{\bm{U}}_{G,k}^{\top}{\bm{U}}_{G,k,t}{{\bm{U}}_{G,k,t}^{\top}}{\bm{U}}_{G,k}\right]$
	$\displaystyle\iff$	$\displaystyle\max_{{\bm{U}}_{G,k}}Tr\left[{\bm{U}}_{G,k}^{\top}\left(\sum_{n=1}^{N}\bm{W}_{G,n}\bm{W}_{G,n}^{\top}+2\rho{\bm{U}}_{G,k,t}{{\bm{U}}_{G,k,t}^{\top}}\right){\bm{U}}_{G,k}\right]$

Therefore, the columns of ${\bm{U}}_{G,k}$ are the eigenvectors of $\sum_{n=1}^{N}\bm{W}_{G,n}\bm{W}_{G,n}^{\top}+2\rho{\bm{U}}_{G,k,t}{{\bm{U}}_{G,k,t}^{\top}}$ corresponding to top $g_{k}$ eigenvalues. ∎

Appendix F Proof of Proposition 4

We first introduce a Lemma

Lemma F.1.

Suppose $\bm{V}\in\mathbb{R}^{a\times b}$ . ${\bm{U}}\in\mathbb{R}^{a\times c}$ is an orthonormal matrix, $\bm{S}\in\mathbb{R}^{a\times a}$ is a symmetric matrix. The solution to the optimization problem

\max_{\bm{V}}Tr(\bm{V}^{\top}\bm{SV})~{}s.t.~{}\bm{V}^{\top}\bm{V}=\bm{I},~{}\bm{V}^{\top}{\bm{U}}=0

is the eigenvectors of the matrix $(\bm{I}-\bm{UU}^{\top})\bm{S}(\bm{I}-\bm{UU}^{\top})$ corresponding to the top $b$ eigenvalues.

Proof.

Since all the feasible solutions satisfy $\bm{V}^{\top}{\bm{U}}=0$ , we have

		$\displaystyle\max_{\bm{V}}Tr(\bm{V}^{\top}\bm{SV})$
	$\displaystyle\iff$	$\displaystyle\max_{\bm{V}}Tr[\bm{V}^{\top}\bm{SV}-\bm{V}^{\top}\bm{UU}^{\top}\bm{SV}-\bm{V}^{\top}\bm{SUU}^{\top}\bm{V}+\bm{V}^{\top}\bm{UU}^{\top}\bm{S}\bm{UU}^{\top}\bm{V}]$
	$\displaystyle\iff$	$\displaystyle\max_{\bm{V}}Tr[\bm{V}^{\top}(\bm{I}-\bm{UU}^{\top})\bm{S}(\bm{I}-\bm{UU}^{\top})\bm{V}]$

We claim that the optimal solution of

\max_{\bm{V}}Tr[\bm{V}^{\top}(\bm{I}-\bm{UU}^{\top})\bm{S}(\bm{I}-\bm{UU}^{\top})\bm{V}]~{}s.t.~{}\bm{V}^{\top}\bm{V}=\bm{I},~{}\bm{V}^{\top}{\bm{U}}=0

is the same by relaxing the condition $\bm{V}^{\top}{\bm{U}}$ . When this condition is relaxed, the optimal solution consists of the eigenvectors of the matrix $(\bm{I}-\bm{UU}^{\top})\bm{S}(\bm{I}-\bm{UU}^{\top})$ . For any eigenvector $\bm{x}$ , we have

\lambda\bm{x}^{\top}{\bm{U}}=\bm{x}^{\top}(\bm{I}-\bm{UU}^{\top})\bm{S}(\bm{I}-\bm{UU}^{\top}){\bm{U}}=0

Therefore, the optimal solution of the relaxed problem also satisfies $\bm{V}^{\top}{\bm{U}}=0$ . This completes the proof. ∎

Then we provide the proof of Proposition 4.

Proof.

For each $n=1,\ldots,N$ and $k=1,\ldots,K$ , the optimization problem to update the local factor matrix $\bm{V}_{n,k}$ is

\bm{V}_{n,k}=\arg\min_{\bm{V}_{n,k}}\|{\bm{\mathcal{R}}}_{L,n}-{\bm{\mathcal{C}}}^{\star}_{L,n}\times_{1}\bm{V}_{n,1}\ldots\times_{K}\bm{V}_{n,K}\|_{F}^{2}+\rho\|\bm{V}_{n,k}\bm{V}_{n,k}^{\top}-\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}\|_{F}^{2}

(15)

Following a similar process in updating the global factor matrices, the optimization problem above can be expressed as

		$\displaystyle\max_{\bm{V}_{n,k}}\\|\bm{V}_{n,k}({\bm{\mathcal{R}}}_{L,n})_{(k)}(\bigotimes_{q\neq k}\bm{V}_{n,q}^{\top})^{\top}\\|_{F}^{2}+2\rho Tr\left[\bm{V}_{n,k}^{\top}\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}\bm{V}_{n,k}\right]$
	$\displaystyle\iff$	$\displaystyle\max_{\bm{V}_{n,k}}Tr\left[\bm{V}_{n,k}^{\top}\left(\bm{W}_{L,n}\bm{W}_{L,n}^{\top}+2\rho\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}\right)\bm{V}_{n,k}\right]$

Now denote $\bm{S}=\bm{W}_{L,n}\bm{W}_{L,n}^{\top}+2\rho\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}$ . If $k\not\in\mathcal{K}$ , there is no further constraint on $\bm{V}_{n,k}$ . Therefore, the columns of $\bm{V}_{n,k}$ are the eigenvectors of $\bm{S}$ corresponding to top $l_{n,k}$ eigenvalues.
If $k\in\mathcal{K}$ , the problem becomes

\max_{\bm{V}_{n,k}}Tr\left(\bm{V}_{n,k}^{\top}\bm{S}\bm{V}_{n,k}\right)~{}s.t.~{}\bm{V}_{n,k}^{\top}\bm{V}_{n,k}=I,\bm{V}_{n,k}^{\top}{\bm{U}}_{G,k}=0

Therefore, from Lemma F.1, if we let $\bm{S}^{\prime}=(\bm{I}-{\bm{U}}_{G,k}{\bm{U}}_{G,k}^{\top})\bm{S}(\bm{I}-{\bm{U}}_{G,k}{\bm{U}}_{G,k}^{\top})$ , the columns of $\bm{V}_{n,k}$ are the eigenvectors of $\bm{S}^{\prime}$ corresponding to top $l_{n,k}$ eigenvalues. ∎

Appendix G Proof of Theorem 5

In this section, we will prove Theorem 5. First, we will show that the global factors ${\bm{U}}_{G,k}$ ’s converge into stationary points. Then we will show that local factors ${\bm{V}}_{n,k}$ also converge at stationary points. Some auxiliary lemmas are relegated to Sec. J for clarity.

For notation simplicity, we define

f_{G,n}\left({\bm{U}}_{G,1},{\bm{U}}_{G,2},\cdots,{\bm{U}}_{G,K}\right)=\left\lVert{\bm{\mathcal{Y}}}_{n}-{\bm{\mathcal{Y}}}_{n}\times_{1}{\bm{U}}_{G,1}{\bm{U}}_{G,1}^{\top}\times_{2}\cdots\times_{K}{\bm{U}}_{G,K}{\bm{U}}_{G,K}^{\top}\right\rVert_{F}^{2}

(16)

and

f_{L,n}\left({\bm{V}}_{n,1},{\bm{V}}_{n,2},\cdots,{\bm{V}}_{n,K}\right)=\left\lVert{\bm{\mathcal{Y}}}_{n}-{\bm{\mathcal{Y}}}_{n}\times_{1}{\bm{V}}_{n,1}{\bm{V}}_{n,1}^{\top}\times_{2}\cdots\times_{K}{\bm{V}}_{n,K}{\bm{V}}_{n,K}^{\top}\right\rVert_{F}^{2}

(17)

Apparently, both $f_{G,n}$ and $f_{L,n}$ are lower bounded by $0$ .

The following lemma shows that the global factors ${\bm{U}}_{G,k}$ ’s will converge into stationary points.

Lemma G.1.

The column spaces spanned by ${\bm{U}}_{G,k}$ ’s converge into stationary points.

\sum_{t=1}^{T}\sum_{k=1}^{K}\left\lVert{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}-{\bm{U}}_{G,k,t}{\bm{U}}_{G,k,t}^{\top}\right\rVert_{F}^{2}\leq\frac{1}{\rho}\sum_{n=1}^{N}f_{G,n}\left({\bm{U}}_{G,1,1},{\bm{U}}_{G,2,1},\cdots,{\bm{U}}_{G,K,1}\right)

(18)

Notice that by dividing $TK$ from both sides of (18), we can immediately prove the first inequality in theorem 5.

Proof.

From proposition 3, the update of global factor $j$ is given by

{\bm{U}}_{G,k,t+1}=\arg\min_{{\bm{U}}_{G,k}}\sum_{n=1}^{N}\|{\bm{\mathcal{R}}}_{G,n,t}-{\bm{\mathcal{C}}}^{\star}_{G,n}\times_{1}{\bm{U}}_{G,1,t+1}\ldots\times_{K}{\bm{U}}_{G,K,t}\|_{F}^{2}+\rho\|{\bm{U}}_{G,k}{\bm{U}}_{G,k}^{\top}-{\bm{U}}_{G,k,t}{{\bm{U}}_{G,k,t}^{\top}}\|_{F}^{2}

where ${\bm{\mathcal{C}}}^{\star}_{G,n}={\bm{\mathcal{Y}}}_{n}\times_{1}{\bm{U}}_{G,1,t+1}\times_{2}\cdots\times_{K}{\bm{U}}_{G,K,t}$ .

By lemma 3.2, we know

\sum_{n=1}^{N}\|{\bm{\mathcal{R}}}_{G,n}-{\bm{\mathcal{C}}}^{\star}_{G,n}\times_{1}{\bm{U}}_{G,1}\ldots\times_{K}{\bm{U}}_{G,K}\|_{F}^{2}=-\|{\bm{\mathcal{R}}}_{G,n}\times_{1}{\bm{U}}_{G,1}^{\top}\ldots\times_{K}{\bm{U}}_{G,K}^{\top}\|_{F}^{2}+const

From the definition of ${\bm{\mathcal{R}}}_{G,n,t}$ , we know that ${\bm{\mathcal{R}}}_{G,n,t}={\bm{\mathcal{Y}}}_{n}-{\bm{\mathcal{C}}}_{L,n,t}\times_{1}{\bm{V}}_{n,1,t}\times_{2}\cdots\times_{K}{\bm{V}}_{n,K,t+1}$ . Since $|{\mathcal{K}}|\geq 2$ and ${\bm{U}}_{G,k,t}^{\top}{\bm{V}}_{n,k,t}$ for every $k\in{\mathcal{K}}$ and every $t$ , there exists at least one $m\in{\mathcal{K}}$ and $m\neq k$ such that ${\bm{U}}_{G,m,t}^{\top}{\bm{V}}_{n,m,t}=0$ and ${\bm{U}}_{G,m,t+1}^{\top}{\bm{V}}_{n,m,t+1}=0$ . Therefore, ${\bm{\mathcal{R}}}_{G,n}\times_{1}{\bm{U}}_{G,1,t+1}^{\top}\times_{2}\cdots\times_{K}{\bm{U}}_{n,K,t}^{\top}={\bm{\mathcal{Y}}}_{n}\times_{1}{\bm{U}}_{G,1,t+1}^{\top}\times_{2}\cdots\times_{K}{\bm{U}}_{n,K,t}^{\top}$ ,

Then we can apply lemma 3.2 again, and rewrite the maximization problem as the minimization problem,

		$\displaystyle{\bm{U}}_{G,k,t+1}$
		$\displaystyle=\arg\min_{{\bm{U}}_{G,k}}\sum_{n=1}^{N}\\|\bm{\mathcal{Y}}_{n}-{\bm{\mathcal{Y}}}_{n}\times_{1}{\bm{U}}_{G,1,t+1}{\bm{U}}_{G,1,t+1}^{\top}\ldots\times_{K}{\bm{U}}_{G,K,t}{\bm{U}}_{G,K,t}^{\top}\\|_{F}^{2}+\rho\\|{\bm{U}}_{G,k}{\bm{U}}_{G,k}^{\top}-{\bm{U}}_{G,k,t}{{\bm{U}}_{G,k,t}^{\top}}\\|_{F}^{2}$
		$\displaystyle=\arg\min_{{\bm{U}}_{G,k}}\sum_{n=1}^{N}f_{G,n}({\bm{U}}_{G,1,t+1},\cdots,{\bm{U}}_{G,k-1,t+1},{\bm{U}}_{G,k},{\bm{U}}_{G,k+1,t},\cdots,{\bm{U}}_{G,K,t})+\rho\\|{\bm{U}}_{G,k}{\bm{U}}_{G,k}^{\top}-{\bm{U}}_{G,k,t}^{\top}{{\bm{U}}_{G,k,t}^{\top}}\\|_{F}^{2}$

From the definition of optimal solution, we know,

		$\displaystyle\sum_{n=1}^{N}f_{G,n}({\bm{U}}_{G,1,t+1},\cdots,{\bm{U}}_{G,k-1,t+1},{\bm{U}}_{G,k,t+1},{\bm{U}}_{G,k+1,t},\cdots,{\bm{U}}_{G,K,t})$		(19)
		$\displaystyle+\rho\\|{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}-{\bm{U}}_{G,k,t}{{\bm{U}}_{G,k,t}^{\top}}\\|_{F}^{2}$
		$\displaystyle\leq\sum_{n=1}^{N}f_{G,n}({\bm{U}}_{G,1,t+1},\cdots,{\bm{U}}_{G,k-1,t+1},{\bm{U}}_{G,k,t},{\bm{U}}_{G,k+1,t},\cdots,{\bm{U}}_{G,K,t})$

Summing both sides for $k$ from $1$ to $K$ , then for $t$ from $1$ to $T$ , we have,

		$\displaystyle\rho\sum_{k=1}^{K}\sum_{t=1}^{T}\\|{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}-{\bm{U}}_{G,k,t}{{\bm{U}}_{G,k,t}^{\top}}\\|_{F}^{2}$		(20)
		$\displaystyle\leq\sum_{n=1}^{N}f_{G,n}({\bm{U}}_{G,1,T+1},\cdots,{\bm{U}}_{G,K,T+1})-\sum_{n=1}^{N}f_{G,n}({\bm{U}}_{G,1,1},\cdots,{\bm{U}}_{G,K,1})$		(20)

Since $f_{G,n}\geq 0$ , this completes our proof. ∎

Now we will analyze the convergence of ${\bm{V}}_{n,k}$ ’s. The following lemma gives an upper bound on the subspace distance of ${\bm{U}}_{G,k}$ ’s from one update.

Lemma G.2.

If there exists a constant $B>0$ such that $\left\lVert{\bm{\mathcal{Y}}}_{n}\right\rVert_{F}\leq B$ for all $n$ , then we have,

\left\lVert{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}-{\bm{U}}_{G,k,t}{{\bm{U}}_{G,k,t}^{\top}}\right\rVert_{F}\leq\frac{2B\sqrt{N}}{\sqrt{\rho}}

(21)

for all $t=1,2,\cdots,T$ and all $n=1,\cdots,N$ .

Proof.

Since $\left\lVert{\bm{\mathcal{Y}}}_{n}\right\rVert_{F}\leq B$ , by the triangle inequality, we know that $f_{G,n}\leq 4B^{2}$ . Applying (19) and considering the fact that $f_{G,n}\geq 0$ , we can prove the desired inequality. ∎

We will treat the two cases $k\in{\mathcal{K}}$ and $k\notin{\mathcal{K}}$ differently. We first consider the case where $k\in{\mathcal{K}}$ . In Algorithm 2, after updating ${\bm{U}}_{G,k,t+1}$ , ${\bm{V}}_{n,k,t}$ is generally not orthogonal to ${\bm{U}}_{G,k,t+1}$ . However, we are able to show that for any source $n$ , there exists an orthonormal matrix ${\tilde{\bm{V}}}_{n,k,t}$ such that ${\tilde{\bm{V}}}_{n,k,t}$ is close to ${\bm{V}}_{n,k,t}$ and ${\tilde{\bm{V}}}_{n,k,t}^{\top}{\bm{U}}_{G,k,t+1}=0$ .

Lemma G.3.

Under the same conditions as lemma G.2, if additionally the regularization parameter $\rho$ in Algorithm 2 is not too small $\rho\geq 8B^{2}N$ , then there exists an orthonormal matrix ${\tilde{\bm{V}}}_{n,k,t}$ , such that (1) $\left\lVert{\tilde{\bm{V}}}_{n,k,t}{\tilde{\bm{V}}}_{n,k,t}^{\top}-{\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}\right\rVert_{F}\leq 11\left\lVert{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}-{\bm{U}}_{G,k,t}{\bm{U}}_{G,k,t}^{\top}\right\rVert_{F}$ and (2) ${\tilde{\bm{V}}}_{n,k,t}^{\top}{\bm{U}}_{G,k,t+1}=0$

Proof.

We define ${\tilde{\bm{V}}}_{n,k,t}$ as,

{\tilde{\bm{V}}}_{n,k,t}=\left({\bm{V}}_{n,k,t}-{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}{\bm{V}}_{n,k,t}\right)\left({\bm{I}}-{\bm{V}}_{n,k,t}^{\top}{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}{\bm{V}}_{n,k,t}\right)^{-\frac{1}{2}}

It is easy to verify that ${\tilde{\bm{V}}}_{n,k,t}^{\top}{\bm{U}}_{G,k,t+1}=0$ and that ${\tilde{\bm{V}}}_{n,k,t}^{\top}{\tilde{\bm{V}}}_{n,k,t}={\bm{I}}$ . Now we will prove an upper bound on the norm on the distance between the subspace spanned by ${\bm{V}}_{n,k,t}$ and by ${\tilde{\bm{V}}}_{n,k,t}$ .

		$\displaystyle{\tilde{\bm{V}}}_{n,k,t}{\tilde{\bm{V}}}_{n,k,t}^{\top}-{\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}$
		$\displaystyle=\left({\bm{V}}_{n,k,t}-{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}{\bm{V}}_{n,k,t}\right)\left({\bm{I}}-{\bm{V}}_{n,k,t}^{\top}{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}{\bm{V}}_{n,k,t}\right)^{-1}\left({\bm{V}}_{n,k,t}^{\top}-{\bm{V}}_{n,k,t}^{\top}{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}\right)$
		$\displaystyle-\left({\bm{V}}_{n,k,t}-{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}{\bm{V}}_{n,k,t}\right)\left({\bm{V}}_{n,k,t}^{\top}-{\bm{V}}_{n,k,t}^{\top}{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}\right)$
		$\displaystyle+\left({\bm{V}}_{n,k,t}-{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}{\bm{V}}_{n,k,t}\right)\left({\bm{V}}_{n,k,t}^{\top}-{\bm{V}}_{n,k,t}^{\top}{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}\right)-{\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}$

Therefore, by triangular inequality,

		$\displaystyle\left\lVert{\tilde{\bm{V}}}_{n,k,t}{\tilde{\bm{V}}}_{n,k,t}^{\top}-{\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}\right\rVert_{F}$
		$\displaystyle\leq\left\lVert{\bm{V}}_{n,k,t}-{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}{\bm{V}}_{n,k,t}\right\rVert_{2}^{2}\left\lVert\left({\bm{I}}-{\bm{V}}_{n,k,t}^{\top}{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}{\bm{V}}_{n,k,t}\right)^{-1}\right\rVert_{F}\left\lVert{\bm{V}}_{n,k,t}^{\top}{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}{\bm{V}}_{n,k,t}\right\rVert_{F}$
		$\displaystyle+\left\lVert{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}{\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}\right\rVert_{F}+\left\lVert{\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}\right\rVert_{F}$
		$\displaystyle+\left\lVert{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}{\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}\right\rVert_{F}$
		$\displaystyle\leq 4\left\lVert\left({\bm{I}}-{\bm{V}}_{n,k,t}^{\top}{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}{\bm{V}}_{n,k,t}\right)^{-1}\right\rVert_{F}\left\lVert{\bm{V}}_{n,k,t}^{\top}{\bm{U}}_{G,k,t+1}\right\rVert_{F}^{2}+2\left\lVert{\bm{V}}_{n,k,t}^{\top}{\bm{U}}_{G,k,t+1}\right\rVert_{F}+\left\lVert{\bm{V}}_{n,k,t}^{\top}{\bm{U}}_{G,k,t+1}\right\rVert_{F}^{2}$

where the last inequality comes from the fact that

		$\displaystyle\left\lVert{\bm{V}}_{n,k,t}-{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}{\bm{V}}_{n,k,t}\right\rVert_{2}$
		$\displaystyle\leq\left\lVert{\bm{V}}_{n,k,t}\right\rVert+\left\lVert{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}{\bm{V}}_{n,k,t}\right\rVert_{2}$
		$\displaystyle\leq 1+\left\lVert\left({\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}-{\bm{U}}_{G,k,t}{\bm{U}}_{G,k,t}^{\top}\right){\bm{V}}_{n,k,t}\right\rVert_{2}$
		$\displaystyle\leq 1+\frac{2B\sqrt{N}}{\sqrt{\rho}}\leq 2$

Since ${\bm{U}}_{G,k,t}$ and ${\bm{V}}_{n,k,t}$ are orthogonal ${\bm{U}}_{G,k,t}^{\top}{\bm{V}}_{n,k,t}=0$ , we have,

		$\displaystyle\left\lVert{\bm{U}}_{G,k,t+1}^{\top}{\bm{V}}_{n,k,t}\right\rVert_{F}=\left\lVert{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}{\bm{V}}_{n,k,t}\right\rVert_{F}$
		$\displaystyle=\left\lVert{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}{\bm{V}}_{n,k,t}-{\bm{U}}_{G,k,t}{\bm{U}}_{G,k,t}^{\top}{\bm{V}}_{n,k,t}\right\rVert_{F}$
		$\displaystyle\leq\left\lVert{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}-{\bm{U}}_{G,k,t}{\bm{U}}_{G,k,t}^{\top}\right\rVert_{F}$

Thus, when $\scriptstyle\left\lVert{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}-{\bm{U}}_{G,k,t}{\bm{U}}_{G,k,t}^{\top}\right\rVert_{F}\leq\frac{\sqrt{2}}{2}$ , we know $\scriptstyle\small\left\lVert{\bm{V}}_{n,k,t}^{\top}{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}{\bm{V}}_{n,k,t}\right\rVert_{F}\leq\frac{1}{2}$ . As a result, $\scriptstyle\left\lVert\left({\bm{I}}-{\bm{V}}_{n,k,t}^{\top}{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}{\bm{V}}_{n,k,t}\right)^{-1}\right\rVert_{F}\leq 2$ .

Combining this with the fact that $\left\lVert{\bm{U}}_{G,k,t+1}^{\top}{\bm{V}}_{n,k,t}\right\rVert_{F}\leq 1$ , we have,

\left\lVert{\tilde{\bm{V}}}_{n,k,t}{\tilde{\bm{V}}}_{n,k,t}^{\top}-{\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}\right\rVert_{F}\leq 11\left\lVert{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}-{\bm{U}}_{G,k,t}{\bm{U}}_{G,k,t}^{\top}\right\rVert_{F}

This completes the proof. ∎

Finally, we will prove the second equation in Theorem 5.

Proof.

We start by analyzing the update rule of ${\bm{V}}_{n,k}$ . In Proposition 4, we consider the update rule for two cases depending on whether $k\in{\mathcal{K}}$ . We analyze them separately.

Case I, $k\notin{\mathcal{K}}$ In this case, ${\bm{V}}_{n,k,t+1}$ should be the optimal solution to,

\bm{V}_{n,k,t+1}=\arg\min_{\bm{V}_{n,k}}\|{\bm{\mathcal{R}}}_{L,n,t}-{\bm{\mathcal{C}}}^{\star}_{L,n}\times_{1}\bm{V}_{n,1,t+1}\ldots\times_{K}\bm{V}_{n,K,t}\|_{F}^{2}+\rho\|\bm{V}_{n,k}\bm{V}_{n,k}-\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}\|_{F}^{2}

where ${\bm{\mathcal{C}}}^{\star}_{L,n}$ is optimized according to the factors ${\bm{V}}_{n,1,t+1},\cdots,{\bm{V}}_{n,K,t}$ . Thus we have,

	$\displaystyle\arg\min_{\bm{V}_{n,k}}\\|{\bm{\mathcal{R}}}_{L,n,t}-{\bm{\mathcal{C}}}^{\star}_{L,n}\times_{1}\bm{V}_{n,1,t+1}\ldots\times_{K}\bm{V}_{n,K,t}\\|_{F}^{2}+\rho\\|\bm{V}_{n,k}\bm{V}_{n,k}^{\top}-\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}\\|_{F}^{2}$
	$\displaystyle=\arg\min_{\bm{V}_{n,k}}-\\|{\bm{\mathcal{R}}}_{L,n,t}\times_{1}\bm{V}_{n,1,t+1}^{\top}\ldots\times_{K}\bm{V}_{n,K,t}^{\top}\\|_{F}^{2}+\rho\\|\bm{V}_{n,k}\bm{V}_{n,k}^{\top}-\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}\\|_{F}^{2}$

Since $\left\lvert{\mathcal{K}}\right\rvert\geq 2$ , there exists at least one $m\in{\mathcal{K}}$ and $m\neq k$ such that ${\bm{U}}_{G,m,t}^{\top}{\bm{V}}_{n,m,t}=0$ and ${\bm{U}}_{G,m,t+1}^{\top}{\bm{V}}_{n,m,t+1}=0$ . We thus have

\|{\bm{\mathcal{R}}}_{L,n,t}\times_{1}\bm{V}_{n,1,t+1}^{\top}\ldots\times_{K}\bm{V}_{n,K,t}^{\top}\|_{F}^{2}=\|\bm{\mathcal{Y}}_{n}\times_{1}\bm{V}_{n,1,t+1}^{\top}\ldots\times_{K}\bm{V}_{n,K,t}^{\top}\|_{F}^{2}

Therefore, we can use Lemma 3.2 again to derive

	$\displaystyle{\bm{V}}_{n,k,t+1}$	$\displaystyle=\arg\min_{{\bm{V}}_{n,k}}\left\lVert{\bm{\mathcal{Y}}}_{n}-{\bm{\mathcal{C}}}_{L,n,t}\times_{1}{\bm{V}}_{n,1,t+1}\cdots\times_{K}{\bm{V}}_{n,K,t}\right\rVert_{F}^{2}+\rho\left\lVert{\bm{V}}_{n,k}{\bm{V}}_{n,k}^{\top}-{\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}\right\rVert_{F}^{2}$
		$\displaystyle=\arg\min_{{\bm{V}}_{n,k}}f_{L,n}({\bm{V}}_{n,1,t+1},\cdots,{\bm{V}}_{n,k-1,t+1},{\bm{V}}_{n,k},{\bm{V}}_{n,k+1,t},\cdots,{\bm{V}}_{n,K,t})+\rho\left\lVert{\bm{V}}_{n,k}{\bm{V}}_{n,k}^{\top}-{\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}\right\rVert_{F}^{2}$

Since ${\bm{V}}_{n,k,t}$ is a feasible solution, we know

		$\displaystyle f_{L,n}({\bm{V}}_{n,1,t+1},\cdots,{\bm{V}}_{n,k,t+1},{\bm{V}}_{n,k+1,t},\cdots,{\bm{V}}_{n,K,t})+\rho\\|\bm{V}_{n,k,t+1}\bm{V}_{n,k,t+1}^{\top}-\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}\\|_{F}^{2}$		(22)
		$\displaystyle\leq f_{L,n}({\bm{V}}_{n,1,t+1},\cdots,{\bm{V}}_{n,k-1,t+1},{\bm{V}}_{n,k,t},\cdots,{\bm{V}}_{n,K,t})$		(22)

Case II, $k\in{\mathcal{K}}$ By Proposition 4, the update rule of ${\bm{V}}_{n,k}$ is this case is given by,

\bm{V}_{n,k,t+1}=\arg\min_{\bm{V}_{n,k}\perp{\bm{U}}_{G,k,t+1}}\|{\bm{\mathcal{R}}}_{L,n,t}-{\bm{\mathcal{C}}}^{\star}_{L,n}\times_{1}\bm{V}_{n,1,t+1}\ldots\times_{K}\bm{V}_{n,K,t}\|_{F}^{2}+\rho\|\bm{V}_{n,k}\bm{V}_{n,k}^{\top}-\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}\|_{F}^{2}

where ${\bm{\mathcal{C}}}^{\star}_{L,n}$ is optimized according to the factors ${\bm{V}}_{n,1,t+1},\cdots,{\bm{V}}_{n,K,t}$ . Thus we have,

\displaystyle\small\|{\bm{\mathcal{R}}}_{L,n,t}-{\bm{\mathcal{C}}}^{\star}_{L,n}\times_{1}\bm{V}_{n,1,t+1}\ldots\times_{K}\bm{V}_{n,K,t}\|_{F}^{2}=\|{\bm{\mathcal{R}}}_{L,n,t}\|_{F}^{2}-\|{\bm{\mathcal{R}}}_{L,n,t}\times_{1}\bm{V}_{n,1,t+1}^{\top}\ldots\times_{K}\bm{V}_{n,K,t}^{\top}\|_{F}^{2}

Therefore, we can rewrite the equivalent formulation to ${\bm{V}}_{n,k}$ as,

\bm{V}_{n,k,t+1}=\underset{\bm{V}_{n,k}\perp{\bm{U}}_{G,k,t+1}}{\mathrm{argmin}}f_{L,n}({\bm{V}}_{n,1,t+1},\cdots,{\bm{V}}_{n,k},{\bm{V}}_{n,k+1,t},\cdots,{\bm{V}}_{n,K,t})+\rho\|\bm{V}_{n,k}\bm{V}_{n,k}^{\top}-\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}\|_{F}^{2}

The ${\tilde{\bm{V}}}_{n,k,t}$ defined in lemma G.3 is an feasible solution, therefore we have,

		$\displaystyle f_{L,n}({\bm{V}}_{n,1,t+1},\cdots,{\bm{V}}_{n,k-1,t+1},{\bm{V}}_{n,k,t+1},{\bm{V}}_{n,k+1,t},\cdots,{\bm{V}}_{n,K,t})+\rho\\|\bm{V}_{n,k,t+1}\bm{V}_{n,k,t+1}^{\top}-\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}\\|_{F}^{2}$
		$\displaystyle\leq f_{L,n}({\bm{V}}_{n,1,t+1},\cdots,{\bm{V}}_{n,k-1,t+1},{\tilde{\bm{V}}}_{n,k,t},{\bm{V}}_{n,k+1,t},\cdots,{\bm{V}}_{n,K,t})+\rho\\|{\tilde{\bm{V}}}_{n,k,t}{\tilde{\bm{V}}}_{n,k,t}^{\top}-\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}\\|_{F}^{2}$

We can bound the difference between $f_{L,n}({\bm{V}}_{n,1,t+1},\cdots,{\bm{V}}_{n,k-1,t+1},{\tilde{\bm{V}}}_{n,k,t},{\bm{V}}_{n,k+1,t},\cdots,{\bm{V}}_{n,K,t})$ and $f_{L,n}({\bm{V}}_{n,1,t+1},\cdots,{\bm{V}}_{n,k-1,t+1},{\bm{V}}_{n,k,t},{\bm{V}}_{n,k+1,t},\cdots,{\bm{V}}_{n,K,t})$ as,

		$\displaystyle f_{L,n}({\bm{V}}_{n,1,t+1},\cdots,{\bm{V}}_{n,k-1,t+1},{\tilde{\bm{V}}}_{n,k,t},{\bm{V}}_{n,k+1,t},\cdots,{\bm{V}}_{n,K,t})$
		$\displaystyle\leq\Big{(}\left\lVert{\bm{\mathcal{Y}}}_{n}-{\bm{\mathcal{Y}}}_{n}\times_{1}\cdots\times_{k}{\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}\times_{K}{\bm{V}}_{n,K,t}\right\rVert_{F}+\left\lVert{\bm{\mathcal{Y}}}_{n}\times_{1}\cdots\times_{k}\left({\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}-{\tilde{\bm{V}}}_{n,k,t}{\tilde{\bm{V}}}_{n,k,t}^{\top}\right)\times_{K}{\bm{V}}_{n,K,t}\right\rVert_{F}\Big{)}^{2}$
		$\displaystyle\leq\left\lVert{\bm{\mathcal{Y}}}_{n}-{\bm{\mathcal{Y}}}_{n}\times_{1}\cdots\times_{k}{\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}\times_{K}{\bm{V}}_{n,K,t}\right\rVert_{F}^{2}$
		$\displaystyle+2\left\lVert{\bm{\mathcal{Y}}}_{n}-{\bm{\mathcal{Y}}}_{n}\times_{1}\cdots\times_{k}{\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}\times_{K}{\bm{V}}_{n,K,t}\right\rVert_{F}\left\lVert{\bm{\mathcal{Y}}}_{n}\times_{1}\cdots\times_{k}\left({\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}-{\tilde{\bm{V}}}_{n,k,t}{\tilde{\bm{V}}}_{n,k,t}^{\top}\right)\times_{K}{\bm{V}}_{n,K,t}\right\rVert_{F}$
		$\displaystyle+\left\lVert{\bm{\mathcal{Y}}}_{n}\times_{1}\cdots\times_{k}\left({\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}-{\tilde{\bm{V}}}_{n,k,t}{\tilde{\bm{V}}}_{n,k,t}^{\top}\right)\times_{K}{\bm{V}}_{n,K,t}\right\rVert_{F}^{2}$

The first term above is just $f_{L,n}({\bm{V}}_{n,1,t+1},\cdots,{\bm{V}}_{n,k-1,t+1},{\bm{V}}_{n,k,t},{\bm{V}}_{n,k+1,t},\cdots,{\bm{V}}_{n,K,t})$ . By Lemma J.2, the second term is upper bounded by $\scriptstyle 2\left\lVert{\bm{\mathcal{Y}}}_{n}\right\rVert_{F}\left\lVert{\bm{\mathcal{Y}}}_{n}\right\rVert_{F}\left\lVert{\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}-{\tilde{\bm{V}}}_{n,k,t}{\tilde{\bm{V}}}_{n,k,t}^{\top}\right\rVert_{2}$ , and the third term is bounded by $\scriptstyle\left\lVert{\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}-{\tilde{\bm{V}}}_{n,k,t}{\tilde{\bm{V}}}_{n,k,t}^{\top}\right\rVert_{2}^{2}\left\lVert{\bm{\mathcal{Y}}}_{n}\right\rVert_{F}^{2}$ .

Therefore, we have

		$\displaystyle f_{L,n}({\bm{V}}_{n,1,t+1},\cdots,{\bm{V}}_{n,k-1,t+1},{\tilde{\bm{V}}}_{n,k,t},{\bm{V}}_{n,k+1,t},\cdots,{\bm{V}}_{n,K,t})$
		$\displaystyle\leq f_{L,n}({\bm{V}}_{n,1,t+1},\cdots,{\bm{V}}_{n,k-1,t+1},{\bm{V}}_{n,k,t},\cdots,{\bm{V}}_{n,K,t})+\left\lVert{\tilde{\bm{V}}}_{n,k,t}{\tilde{\bm{V}}}_{n,k,t}^{\top}-{\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}\right\rVert_{2}4B^{2}$

Therefore, we have

		$\displaystyle f_{L,n}({\bm{V}}_{n,1,t+1},\cdots,{\bm{V}}_{n,k,t+1},{\bm{V}}_{n,k+1,t},\cdots,{\bm{V}}_{n,K,t})+\rho\\|\bm{V}_{n,k,t+1}\bm{V}_{n,k,t+1}^{\top}-\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}\\|_{F}^{2}$		(23)
		$\displaystyle-f_{L,n}({\bm{V}}_{n,1,t+1},\cdots,{\bm{V}}_{n,k-1,t+1},{\bm{V}}_{n,k,t},{\bm{V}}_{n,k+1,t},\cdots,{\bm{V}}_{n,K,t})$
		$\displaystyle\leq 4B^{2}\left\lVert{\tilde{\bm{V}}}_{n,k,t}{\tilde{\bm{V}}}_{n,k,t}^{\top}-{\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}\right\rVert_{2}+\rho\left\lVert{\tilde{\bm{V}}}_{n,k,t}{\tilde{\bm{V}}}_{n,k,t}^{\top}-{\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}\right\rVert_{F}^{2}$
		$\displaystyle\leq C_{3}\left\lVert{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}-{\bm{U}}_{G,k,t}{\bm{U}}_{G,k,t}^{\top}\right\rVert_{F}$

where $C_{3}$ is a constant defined as,

C_{3}=11B^{2}+242B\sqrt{N\rho}

In the last inequality of (23), we applied Lemma G.2 and Lemma G.3.

Combining equation (22) in Case I and equation (23) in Case II, we know that,

	$\displaystyle f_{L,n}({\bm{V}}_{n,1,t+1},\cdots,{\bm{V}}_{n,K,t+1})-f_{L,n}({\bm{V}}_{n,1,t},\cdots,{\bm{V}}_{n,K,t})$
	$\displaystyle\leq\sum_{k=1}^{K}\rho\\|\bm{V}_{n,k,t+1}\bm{V}_{n,k,t+1}^{\top}-\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}\\|_{F}^{2}+\sum_{k\in{\mathcal{K}}}C_{3}\\|{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}-\bm{U}_{n,k,t}{\bm{U}_{n,k,t}^{\top}}\\|_{F}$
	$\displaystyle\leq\sum_{k=1}^{K}\rho\\|\bm{V}_{n,k,t+1}\bm{V}_{n,k,t+1}^{\top}-\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}\\|_{F}^{2}+\sum_{k=1}^{K}C_{3}\\|{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}-\bm{U}_{n,k,t}{\bm{U}_{n,k,t}^{\top}}\\|_{F}$

Summing both sides for $t$ from $1$ to $T$ and for $k$ from $1$ to $K$ , we have

\displaystyle\sum_{k=1}^{K}\sum_{t=1}^{T}\rho\|\bm{V}_{n,k,t+1}\bm{V}_{n,k,t+1}^{\top}-\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}\|_{F}^{2}\leq C_{3}\sum_{k=1}^{K}\sum_{t=1}^{T}\|{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}-\bm{U}_{n,k,t}{\bm{U}_{n,k,t}^{\top}}\|_{F}+f_{L,n,0}

Considering lemma G.1 and the fact that

\sum_{k=1}^{K}\sum_{t=1}^{T}\|{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}-\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}\|_{F}\leq\sqrt{TK}\sqrt{\sum_{k=1}^{K}\sum_{t=1}^{T}\|{\bm{U}}_{G,k,t+1}{\bm{U}}_{G,k,t+1}^{\top}-\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}\|_{F}^{2}}

we can devide both sides by $T$ and conclude that,

\displaystyle\min_{t\in\{1,\cdots,T\}}\sum_{k=1}^{K}\|\bm{V}_{n,k,t+1}\bm{V}_{n,k,t+1}^{\top}-\bm{V}_{n,k,t}{\bm{V}_{n,k,t}^{\top}}\|_{F}^{2}=O\left(\frac{1}{\sqrt{T}}\right)

This completes our proof. ∎

Appendix H Simulation Study about the Convergence Rate

In this simulation, we generate the data according to (1) but without noise for better visualization. We assume that there are 5 sources of data, each has 10 samples, and each sample has dimension $50\times 50\times 50$ . Therefore, the data is ${\bm{\mathcal{Y}}}_{1},\ldots,{\bm{\mathcal{Y}}}_{5}$ each with dimension $50\times 50\times 50\times 10$ .

For data generation, we use the standard normal distribution to generate ${\bm{U}}_{G,1},{\bm{U}}_{G,2},{\bm{U}}_{G,3}$ with dimension $50\times 5$ and use SVD to construct an orthonormal basis to make the matrices orthonormal. Then we use the standard normal distribution to generate ${\bm{\mathcal{C}}}_{G,1},\ldots,{\bm{\mathcal{C}}}_{G,5}$ with dimension $5\times 5\times 5\times 10$ . For $n=1,\ldots,5$ , the global component is constructed by ${\bm{\mathcal{Y}}}_{G,n}={\bm{\mathcal{C}}}_{G,n}\times_{1}{\bm{U}}_{G,1}\times_{2}{\bm{U}}_{G,2}\times_{3}{\bm{U}}_{G,3}$ . When generating the local factor matrices, we assume $\mathcal{K}=\{1,2\}$ . Then for each $n\in\{1,\ldots,5\}$ , if $k\in\mathcal{K}$ , ${\bm{V}}_{n,k}$ are first generated by a standard normal distribution with dimension $50\times 5$ . Afterwards, we use $(\bm{I}-{\bm{U}}_{G,k}({\bm{U}}_{G,k}^{\top}{\bm{U}}_{G,k})^{-1}{\bm{U}}_{G,k}){\bm{V}}_{n,k}$ to project the local factor matrix onto the orthogonal space of the corresponding global factor matrix. Finally, we use SVD to construct the orthonormal basis. If $k\not\in\mathcal{K}$ , we use the standard normal distribution to generate ${\bm{V}}_{n,k}$ and use SVD to construct an orthonormal basis. Similarly, we generate the local core tensors ${\bm{\mathcal{C}}}_{L,1},\ldots,{\bm{\mathcal{C}}}_{L,5}$ from a standard normal distribution with dimension $5\times 5\times 5\times 10$ . The local components for $n=1,\ldots,5$ are constructed by ${\bm{\mathcal{Y}}}_{L,n}={\bm{\mathcal{C}}}_{L,n}\times_{1}{\bm{V}}_{n,1}\times_{2}{\bm{V}}_{n,2}\times_{3}{\bm{V}}_{n,3}$ .

When performing perTucker, we use the dimension of true global and local factor matrices. Monitoring statistics along the iteration axis are mean subspace error for global factor matrices and local factor matrices for each client, that is, $\frac{1}{3}\sum_{k=1}^{3}\|{\bm{U}}_{G,k,t}{\bm{U}}_{G,k,t}^{\top}-{\bm{U}}_{G,k}{\bm{U}}_{G,k}^{\top}\|_{F}^{2}$ for global monitoring and $\frac{1}{3}\sum_{k=1}^{3}\|{\bm{V}}_{n,k,t}{\bm{V}}_{n,k,t}^{\top}-{\bm{V}}_{n,k}{\bm{V}}_{n,k}^{\top}\|_{F}^{2}$ for source $n$ local monitoring.

Appendix I Discussion of Classification Rule in Sec. 3.6.1

From Proposition 2, when a new piece of data arrives, we can first derive the optimal global core tensor ${\bm{\mathcal{C}}}^{\star}_{G}$ , and for each class $n$ , we can derive the optimal local tensor ${\bm{\mathcal{C}}}^{\star}_{L,n}$ . Then the decision rule by minimizing the reconstruction error from all the classes is given by

\hat{n}=\arg\min_{n}\left\lVert\bm{\mathcal{Y}}^{\text{new}}-\bm{\mathcal{C}}^{\star}_{G}\times_{1}\hat{{\bm{U}}}_{G,1}\ldots\times_{K}\hat{{\bm{U}}}_{G,K}-\bm{\mathcal{C}}^{\star}_{L,n}\times_{1}\hat{\bm{V}}_{n,1}\ldots\times_{K}\hat{\bm{V}}_{n,K}\right\rVert_{F}^{2}.

(24)

We then show that the decision rule (24) is equivalent to the decision rule (12). By reformulating Equation (24), we have

		$\displaystyle\\|\bm{\mathcal{Y}}^{\text{new}}-{\bm{\mathcal{C}}}^{\star}_{G}\times_{1}\hat{{\bm{U}}}_{G,1}\ldots\times_{K}\hat{{\bm{U}}}_{G,K}-{\bm{\mathcal{C}}}^{\star}_{L,n}\times_{1}\hat{\bm{V}}_{n,1}\ldots\times_{K}\hat{\bm{V}}_{n,K}\\|_{F}^{2}.$
	$\displaystyle=$	$\displaystyle\\|\bm{\mathcal{Y}}^{\text{new}}\\|_{F}^{2}+\\|{\bm{\mathcal{C}}}^{\star}_{G}\times_{1}\hat{{\bm{U}}}_{G,1}\ldots\times_{K}\hat{{\bm{U}}}_{G,K}\\|_{F}^{2}+\\|{\bm{\mathcal{C}}}^{\star}_{L,n}\times_{1}\hat{\bm{V}}_{n,1}\ldots\times_{K}\hat{\bm{V}}_{n,K}\\|_{F}^{2}$
		$\displaystyle-2\langle\bm{\mathcal{Y}}^{\text{new}},{\bm{\mathcal{C}}}^{\star}_{G}\times_{1}\hat{{\bm{U}}}_{G,1}\ldots\times_{K}\hat{{\bm{U}}}_{G,K}\rangle-2\langle\bm{\mathcal{Y}}^{\text{new}},{\bm{\mathcal{C}}}^{\star}_{L,n}\times_{1}\hat{\bm{V}}_{n,1}\ldots\times_{K}\hat{\bm{V}}_{n,K}\rangle$
		$\displaystyle-2\langle{\bm{\mathcal{C}}}^{\star}_{G}\times_{1}\hat{{\bm{U}}}_{G,1}\ldots\times_{K}\hat{{\bm{U}}}_{G,K},{\bm{\mathcal{C}}}^{\star}_{L,n}\times_{1}\hat{\bm{V}}_{n,1}\ldots\times_{K}\hat{\bm{V}}_{n,K}\rangle$

The first, second, and fourth term is irrelevant to $n$ and thus can be excluded. The last term is $0$ because the global and local components are orthogonal. The third and fifth terms can be reformulated as

		$\displaystyle\\|{\bm{\mathcal{C}}}^{\star}_{L,n}\times_{1}\hat{\bm{V}}_{n,1}\ldots\times_{K}\hat{\bm{V}}_{n,K}\\|_{F}^{2}-2\langle\bm{\mathcal{Y}},{\bm{\mathcal{C}}}^{\star}_{L,n}\times_{1}\hat{\bm{V}}_{n,1}\ldots\times_{K}\hat{\bm{V}}_{n,K}\rangle$
	$\displaystyle=$	$\displaystyle\\|{\bm{\mathcal{C}}}^{\star}_{L,n}\\|_{F}^{2}-2\langle\bm{\mathcal{Y}}^{\text{new}}\times_{1}\hat{\bm{V}}_{n,1}^{\top}\ldots\times_{K}\hat{\bm{V}}_{n,K}^{\top},{\bm{\mathcal{C}}}^{\star}_{L,n}\rangle$
	$\displaystyle=$	$\displaystyle-\\|{\bm{\mathcal{C}}}^{\star}_{L,n}\\|_{F}^{2}$

Therefore,

\arg\min_{n}\|\bm{\mathcal{Y}}^{\text{new}}-{\bm{\mathcal{C}}}^{\star}_{G}\times_{1}\hat{{\bm{U}}}_{G,1}\ldots\times_{K}\hat{{\bm{U}}}_{G,K}-{\bm{\mathcal{C}}}^{\star}_{L,n}\times_{1}\hat{\bm{V}}_{n,1}\ldots\times_{K}\hat{\bm{V}}_{n,K}\|_{F}^{2}=\arg\max_{n}\|{\bm{\mathcal{C}}}^{\star}_{L,n}\|_{F}^{2}

Appendix J Auxiliary Lemma

In this section, we will present several auxiliary lemmas useful for theoretical analysis.

Lemma J.1.

For two matrices $A\in\mathbb{R}^{m\times n}$ , and $B\in\mathbb{R}^{n\times p}$ , we have,

\left\lVert AB\right\rVert_{F}\leq\left\lVert A\right\rVert_{F}\left\lVert B\right\rVert_{2}

This lemma is the same as Proposition B.4 in Sun and Luo, (2016). Thus we omit the proof here to avoid duplication.

Lemma J.2.

For a core tensor ${\bm{\mathcal{C}}}$ and $K$ factor matrices $\{{\bm{U}}_{k}\}_{k=1}^{K}$ , the following holds,

\left\lVert{\bm{\mathcal{C}}}\times_{1}{\bm{U}}_{1}\times_{2}\cdots\times_{K}{\bm{U}}_{K}\right\rVert_{F}\leq\left\lVert{\bm{\mathcal{C}}}\right\rVert_{F}\prod_{k=1}^{K}\left\lVert{\bm{U}}_{k}\right\rVert_{2}

Proof.

The proof is straightforward. We use ${\bm{\mathcal{Y}}}$ to denote ${\bm{\mathcal{Y}}}={\bm{\mathcal{C}}}\times_{1}{\bm{U}}_{1}\times_{2}\cdots\times_{K}{\bm{U}}_{K}$ . Then ${\bm{\mathcal{Y}}}_{(1)}={\bm{U}}_{1}{\bm{\mathcal{C}}}_{(1)}\left({\bm{U}}_{K}\bigotimes{\bm{U}}_{K-1}\cdots\bigotimes{\bm{U}}_{2}\right)$ . Since $\left\lVert{\bm{\mathcal{Y}}}\right\rVert_{F}=\left\lVert{\bm{\mathcal{Y}}}_{(1)}\right\rVert_{F}$ , we have,

	$\displaystyle\left\lVert{\bm{\mathcal{Y}}}\right\rVert_{F}=\left\lVert{\bm{\mathcal{Y}}}_{(1)}\right\rVert_{F}=\left\lVert{\bm{U}}_{1}{\bm{\mathcal{C}}}_{(1)}\left({\bm{U}}_{K}\bigotimes{\bm{U}}_{K-1}\cdots\bigotimes{\bm{U}}_{2}\right)^{\top}\right\rVert_{F}$
	$\displaystyle\leq\left\lVert{\bm{U}}_{1}\right\rVert_{2}\left\lVert{\bm{\mathcal{C}}}_{(1)}\right\rVert_{F}\left\lVert\left({\bm{U}}_{K}\bigotimes{\bm{U}}_{K-1}\cdots\bigotimes{\bm{U}}_{2}\right)\right\rVert_{2}$
	$\displaystyle\leq\left\lVert{\bm{\mathcal{C}}}_{(1)}\right\rVert_{F}\prod_{k=1}^{K}\left\lVert{\bm{U}}_{k}\right\rVert_{2}=\left\lVert{\bm{\mathcal{C}}}\right\rVert_{F}\prod_{k=1}^{K}\left\lVert{\bm{U}}_{k}\right\rVert_{2}$

where we used Lemma J.1 in the first inequality, and the fact that $\left\lVert{\bm{U}}_{K}\bigotimes{\bm{U}}_{K-1}\cdots\bigotimes{\bm{U}}_{2}\right\rVert_{2}\linebreak\leq\left\lVert{\bm{U}}_{K}\right\rVert_{2}\cdots\left\lVert{\bm{U}}_{2}\right\rVert_{2}$ in the second inequality. ∎

		$\displaystyle\\|{\bm{\mathcal{R}}}_{G,n}-{\bm{\mathcal{C}}}^{\star}_{G,n}\times_{1}{\bm{U}}_{G,1}\ldots\times_{K}{\bm{U}}_{G,K}\\|_{F}^{2}$
	$\displaystyle=$	$\displaystyle\\|{\bm{\mathcal{R}}}_{G,n}\\|_{F}^{2}-2\left\langle{\bm{\mathcal{R}}}_{G,n},{\bm{\mathcal{C}}}^{\star}_{G,n}\times_{1}{\bm{U}}_{G,1}\ldots\times_{K}{\bm{U}}_{G,K}\right\rangle+\\|{\bm{\mathcal{C}}}^{\star}_{G,n}\times_{1}{\bm{U}}_{G,1}\ldots\times_{K}{\bm{U}}_{G,K}\\|_{F}^{2}$
	$\displaystyle=$	$\displaystyle\\|{\bm{\mathcal{R}}}_{G,n}\\|_{F}^{2}-2\left\langle{\bm{\mathcal{R}}}_{G,n}\times_{1}{\bm{U}}_{G,1}^{\top}\ldots\times_{K}{\bm{U}}_{G,K}^{\top},{\bm{\mathcal{C}}}^{\star}_{G,n}\right\rangle+\\|{\bm{\mathcal{C}}}^{\star}_{G,n}\\|_{F}^{2}$

		$\displaystyle\\|{\bm{\mathcal{C}}}^{\star}_{L,n}\times_{1}\hat{\bm{V}}_{n,1}\ldots\times_{K}\hat{\bm{V}}_{n,K}\\|_{F}^{2}-2\langle\bm{\mathcal{Y}},{\bm{\mathcal{C}}}^{\star}_{L,n}\times_{1}\hat{\bm{V}}_{n,1}\ldots\times_{K}\hat{\bm{V}}_{n,K}\rangle$
	$\displaystyle=$	$\displaystyle\\|{\bm{\mathcal{C}}}^{\star}_{L,n}\\|_{F}^{2}-2\langle\bm{\mathcal{Y}}^{\text{new}}\times_{1}\hat{\bm{V}}_{n,1}^{\top}\ldots\times_{K}\hat{\bm{V}}_{n,K}^{\top},{\bm{\mathcal{C}}}^{\star}_{L,n}\rangle$
	$\displaystyle=$	$\displaystyle-\\|{\bm{\mathcal{C}}}^{\star}_{L,n}\\|_{F}^{2}$

Personalized Tucker Decomposition: Modeling Commonality and Peculiarity on Tensor Data

Abstract

1 Introduction

2 Literature Review

3 Model Development

3.1 Preliminary

Notation

Mode-kk product and tensor unfolding

Tucker decomposition

Tensor inner product

3.2 Motivation & formulation

Proposition 1.

3.3 Personalized Tucker algorithm

3.3.1 Update global and local core tensors

Proposition 2.

3.3.2 Update global and local factor matrices

Proximal update

Lemma 3.1.

Lemma 3.2.

Update global factors

Proposition 3.

Update local factors

Proposition 4.

3.4 Convergence analysis of Algorithm 2

Theorem 5.

3.5 Model initialization

3.6 Practical usage of perTucker

3.6.1 Classification via perTucker

3.6.2 Anomaly detection via perTucker

3.6.3 Clustering via perTucker

4 Numerical Studies

4.1 Data generation

4.2 Performance

4.3 Classification

4.4 Clustering

5 Case Study

5.1 Anomaly detection in solar flare data

5.2 Classification for tonnage data

6 Conclusion

References

Appendix A Proof of Proposition 1

Lemma A.1.

Proof.

Proof.

Appendix B Proof of Proposition 2

Proof.

Appendix C Proof of Lemma 3.1

Proof.

Appendix D Proof of Lemma 3.2

Proof.

Appendix E Proof of Proposition 3

Proof.

Appendix F Proof of Proposition 4

Lemma F.1.

Proof.

Proof.

Appendix G Proof of Theorem 5

Lemma G.1.

Proof.

Lemma G.2.

Proof.

Lemma G.3.

Proof.

Proof.

Appendix H Simulation Study about the Convergence Rate

Appendix I Discussion of Classification Rule in Sec. 3.6.1

Appendix J Auxiliary Lemma

Lemma J.1.

Lemma J.2.

Proof.

Mode- $k$ product and tensor unfolding