SliceNStitch: Continuous CP Decomposition of Sparse Tensor Streams

Taehyung Kwon12, Inkyu Park12, Dongjin Lee4, and Kijung Shin24 2Graduate School of AI and 4School of Electrical Engineering, KAIST, Daejeon, South Korea
{taehyung.kwon, inkyupark, dongjin.lee, kijungs}@kaist.ac.kr

Abstract

Consider traffic data (i.e., triplets in the form of source-destination-timestamp) that grow over time. Tensors (i.e., multi-dimensional arrays) with a time mode are widely used for modeling and analyzing such multi-aspect data streams. In such tensors, however, new entries are added only once per period, which is often an hour, a day, or even a year. This discreteness of tensors has limited their usage for real-time applications, where new data should be analyzed instantly as it arrives.

How can we analyze time-evolving multi-aspect sparse data ‘continuously’ using tensors where time is ‘discrete’? We propose SliceNStitch for continuous CANDECOMP/PARAFAC (CP) decomposition, which has numerous time-critical applications, including anomaly detection, recommender systems, and stock market prediction. SliceNStitch changes the starting point of each period adaptively, based on the current time, and updates factor matrices (i.e., outputs of CP decomposition) instantly as new data arrives. We show, theoretically and experimentally, that SliceNStitch is (1) ‘Any time’: updating factor matrices immediately without having to wait until the current time period ends, (2) Fast: with constant-time updates up to $464\times$ faster than online methods, and (3) Accurate: with fitness comparable (specifically, $72-100\%$ ) to offline methods.

^*^*footnotetext: Equal Contribution

I Introduction

Tensors (i.e., multidimensional arrays) are simple but powerful tools widely used for representing time-evolving multi-aspect data from various fields, including bioinformatics [1], data mining [2, 3], text mining [4], and cybersecurity [5, 6]. For example, consider a traffic dataset given as triplets in the form of (source, destination, timestamp). The dataset is naturally represented as a $3$ -mode tensor whose three modes correspond to sources, destinations, and time ranges, respectively (see Fig. 1a and 1b). Each $(i,j,k)$ -th entry of the tensor represents the amount of traffic from the $i$ -th source to the $j$ -th destination during the $k$ -th time range.

Refer to caption — (a) Coarse-grained Tensor

	Methods Based on Conventional CPD	SliceNStitch
Update Interval	Long (\faThumbsODown)	Short (\faThumbsOUp)	Short (\faThumbsOUp)
Parameters	Few (\faThumbsOUp)	Many (\faThumbsODown)	Few (\faThumbsOUp)
Fitness	High (\faThumbsOUp)	Low (\faThumbsODown)	High (\faThumbsOUp)

Once we represent data as a tensor, many tools [7, 8, 9, 10] are available for tensor analysis, and CANDECOMP/PARAFAC decomposition (CPD) [7] is one of the most widely used tools. Given an $M$ -mode tensor, CPD gives a low-rank approximation, specifically a sum of few outer products of $M$ vectors, which form $M$ factor matrices. CPD has been used for various applications, and many of them, including anomaly detection [2], recommendation [11, 12], stock market prediction [13], and weather forecast [14], are time-critical.

While tensors and CPD are powerful tools, they are not suitable for real-time applications since time in them advances in a discrete manner, specifically once per period. For example, in the tensor in Fig. 1a, each slice represents the amounts of traffic for one hour, and thus the tensor grows with a new slice only once per hour. That is, it may take one hour for new traffic to be applied to the tensor. For instance, traffic occurring at 2:00:01 is applied to the tensor at 3:00:00. Due to this discreteness, the outputs of CPD (i.e., factor matrices) are updated also only once per period even if incremental algorithms [15, 16, 17] are used.

How can we perform CPD ‘continuously’ for real-time applications? A potential solution is to make the granularity of the time mode extremely fine. According to our preliminary studies, however, it causes the following problems:

•

Degradation of Fitness (Fig. 1c) An extremely fine-grained time mode results in an extremely sparse tensor, which is known to be of high rank [18], and thus it degrades the fitness of the low-rank approximation, such as conventional CPD. As shown in Fig. 1c, the finer the time mode is, the lower the fitness of CPD is.
•

Increase in the Number of Parameters (Fig. 1d): The parameters of CPD are the entries of factor matrices, as explained in Section II, and the size of each factor matrix is proportional to the length of the corresponding mode of the input tensor. An extremely fine-grained time mode leads to an extremely long time mode and thus extremely many parameters, which require huge computational and storage resources. As shown in Fig. 1d, the finer the time mode is, the more parameters CPD requires.

In this work, we propose SliceNStitch for continuous CPD without increasing the number of parameters. It consists of a data model and online algorithms for CPD. From the data model aspect, we propose the continuous tensor model for time-evolving tensors. In the model, the starting point of each period changes adaptively, based on the current time, so that newly arrived data are applied to the input tensor instantly. From the algorithmic aspect, we propose a family of online algorithms for CPD of sparse tensors in the continuous tensor model. They update factor matrices instantly in response to each change in an entry of the input tensor. To the best of our knowledge, they are the first algorithms for this purpose, and existing online CPD algorithms [15, 16, 17] update factor matrices only once per period. We summarize our contributions as follows:

•

New data model: We propose the continuous tensor model, which allows for processing time-evolving tensors continuously in real-time for time-critical applications.
•

Fast online algorithms: We propose the first online algorithms that update outputs of CPD instantly in response to changes in an entry. Their fitness is comparable (specifically, $72-100\%$ ) even to offline competitors, and an update by them is up to $464\times$ faster than that of online competitors.
•

Extensive experiments: We extensively evaluate our algorithms on 4 real-world sparse tensors, and based on the results, we provide practitioner’s guides to algorithm selection and hyperparameter tuning.

Reproducibility: The code and datasets used in the paper are available at https://github.com/DMLab-Tensor/SliceNStitch.

Remarks: CPD may not be the best decomposition model for tensors with time modes, and there exist a number of alternatives, such as Tucker, INDSCAL, and DEDICOM (see [19]). Nevertheless, as a first step, we focus on making CPD ‘continuous’ due to its prevalence and simplicity. We leave extending our approach to more models as future work.

In Section II, we introduce some notations and preliminaries. In Section III, we provide a formal problem definition. In Sections IV and V, we present the model and optimization algorithms of SliceNStitch, respectively. In Section VI, we review our experiments. After discussing some related works in Section VII, we conclude in Section VIII.

II Preliminaries

In this section, we introduce some notations and preliminaries. Some frequently-used symbols are listed in Table I.

TABLE I: Table of frequently-used symbols

Symbol	Definition
$\bm{A}$	a matrix
$\bm{A}(i,:),\bm{A}(:,i)$	$i$ -th row of $\bm{A}$ , $i$ -th column of $\bm{A}$
${\bm{A}}^{\prime}$ , ${\bm{A}}^{\dagger}$	transpose of $\bm{A}$ , pseudoinverse of $\bm{A}$
$\odot$ , $\ast$	Khatri-Rao product, Hadamard product
$\bm{\mathcal{X}}$	a tensor
$M$	order of $\bm{\mathcal{X}}$
$N_{m}$	number of indices in the $m$ -th mode of $\bm{\mathcal{X}}$
$x_{i_{1},i_{2},\cdots,i_{M}}$	$(i_{1},i_{2},\cdots,i_{M})$ -th entry of $\bm{\mathcal{X}}$
$\|\bm{\mathcal{X}}\|$	number of non-zeros of $\bm{\mathcal{X}}$
$\\|\bm{\mathcal{X}}\\|_{F}$	Frobenius norm of $\bm{\mathcal{X}}$
${\bm{X}}_{(m)}$	mode- $m$ matricization of $\bm{\mathcal{X}}$
$R$	rank of CPD
${\bm{A}^{(m)}}$	factor matrix of the $m$ -th mode
$\widetilde{\bm{\mathcal{X}}}$	an approximation of $\bm{\mathcal{X}}$ by CPD
$\bm{\mathcal{D}}\left(t,W\right)$	tensor window at time $t$
$\Delta\bm{\mathcal{X}}$	a change in $\bm{\mathcal{X}}$
$W$	number of indices in the time mode
$deg(m,i_{m})$	number of non-zeros with $m$ -th mode index $i_{m}$
$a^{(m)}_{ij}$	$(i,j)$ -th entry of $\bm{A}^{(m)}$

Basic Notations: Consider a matrix $\bm{A}$ . We denote its $i$ -th row by $\bm{\bm{A}}(i,:)$ and its $i$ -th column by $\bm{\bm{A}}(:,i)$ . We denote the transpose of $\bm{A}$ by ${\bm{A}}^{\prime}$ and the Moore-Penrose pseudoinverse of $\bm{A}$ by ${\bm{A}}^{\dagger}$ . We denote the Khatri-Rao and Hadamard products by $\odot$ and $\ast$ , respectively. See Section I of [20] for the definitions of the Moore-Penrose pseudoinverse and both products.

Consider an $M$ -mode sparse tensor $\bm{\mathcal{X}}\in\mathbb{R}^{N_{1}\times N_{2}\times\cdots\times N_{M}}$ , where $N_{m}$ denotes the length of the $m$ -th mode. We denote each $(i_{1},i_{2},\cdots,i_{M})$ -th entry of $\bm{\mathcal{X}}$ by $x_{i_{1},i_{2},\cdots,i_{M}}$ . We let $|\bm{\mathcal{X}}|$ be the number of non-zero entries in $\bm{\mathcal{X}}$ , and we let $\|\bm{\mathcal{X}}\|_{2}$ be the Frobenius norm of $\bm{\mathcal{X}}$ . We let ${\bm{X}}_{(m)}$ be the matrix obtained by matricizing $\bm{\mathcal{X}}$ along the $m$ -th mode. See Section I of [20] for the definitions of the Frobenius norm and matricization.

CANDECOMP/PARAFAC Decomposition (CPD): Given an $M$ -mode tensor $\bm{\mathcal{X}}\in\mathbb{R}^{N_{1}\times N_{2}\times\cdots\times N_{M}}$ and rank $R\in\mathbb{N}$ , CANDECOMP/PARAFAC Decomposition (CPD) [7] gives a rank- $R$ approximation of $\bm{\mathcal{X}}$ , expressed as the sum of $R$ rank- $1$ tensors (i.e., outer products of vectors) as follows:

$\displaystyle\bm{\mathcal{X}}$	$\displaystyle\approx\sum\nolimits_{r=1}^{R}\bm{a}^{\left(1\right)}_{r}\circ\bm{a}^{\left(2\right)}_{r}\circ\cdots\circ\bm{a}^{\left(M\right)}_{r},$
	$\displaystyle\equiv\sum\nolimits_{r=1}^{R}{\bm{A}^{(1)}}(:,r)\circ{\bm{A}^{(2)}}(:,r)\circ\cdots\circ{\bm{A}^{(M)}}(:,r),$
	$\displaystyle\equiv\llbracket{\bm{A}^{(1)}},{\bm{A}^{(2)}},\cdots,{\bm{A}^{(M)}}\rrbracket\equiv\widetilde{\bm{\mathcal{X}}},$	(1)

where $\bm{a}^{\left(m\right)}_{r}\in\mathbb{R}^{N_{m}}$ for all $r\in\{1,2,\cdots,R\}$ , and $\circ$ denotes the outer product (see Section I of [20] for the definition). Each ${\bm{A}^{(m)}}\equiv[\bm{a}^{\left(m\right)}_{1}~{}\bm{a}^{\left(m\right)}_{2}~{}\cdots\bm{a}^{\left(m\right)}_{R}]\in\mathbb{R}^{N_{m}\times R}$ is called the factor matrix of the $m$ -th mode.

CP decomposition aims to find factor matrices that minimize the difference between the input tensor $\bm{\mathcal{X}}$ and its approximation $\widetilde{\bm{\mathcal{X}}}$ . That is, it aims to solve Eq. (2).

\min\nolimits_{\bm{A}^{(1)},\cdots,\bm{A}^{(M)}}\Big{\|}\bm{\mathcal{X}}-\llbracket\bm{A}^{(1)},\bm{A}^{(2)},\cdots,\bm{A}^{(M)}\rrbracket\Big{\|}_{F},

(2)

where $\|\cdot\|_{F}$ is the Frobenius norm (see Section I of [20] for the definition).

Alternating Least Squares (ALS): Alternating least squares (ALS) [8, 21] is a standard algorithm for computing CPD of a static tensor. For each $n$ -th mode, ${\llbracket\bm{A}^{(1)},\bm{A}^{(2)},\cdots,\bm{A}^{(M)}\rrbracket}_{(n)}=\bm{A}^{(n)}{(\odot_{m\neq n}^{M}\bm{A}^{(m)})}^{\prime}$ , and thus the mode- $n$ matricization of Eq. (2) becomes

\min\nolimits_{\bm{A}^{(1)},\cdots,\bm{A}^{(M)}}\Big{\|}{\bm{X}}_{(n)}-\bm{A}^{(n)}{(\odot_{m\neq n}^{M}\bm{A}^{(m)})}^{\prime}\Big{\|}_{F}.

(3)

While the objective function in Eq. (3) is non-convex, solving Eq. (3) only for ${\bm{A}^{(n)}}$ while fixing all other factor matrices is a least-square problem. Finding ${\bm{A}^{(n)}}$ that makes the derivative of the objective function in Eq. (3) with respect to ${\bm{A}^{(n)}}$ zero leads to the following update rule for ${\bm{A}^{(n)}}$ :

{\bm{A}^{(n)}}\leftarrow{\bm{X}}_{(n)}(\odot_{m\neq n}^{M}\bm{A}^{(m)}){\{\ast_{m\neq n}^{M}{{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}}\}}^{\dagger}.

(4)

ALS performs CPD by alternatively updating each factor matrix ${\bm{A}^{(n)}}$ using Eq. (4) until convergence.

III Problem Definition

In this section, we first define multi-aspect data streams. Then, we describe how it is typically modeled as a tensor and discuss the limitation of the common tensor modeling method. Lastly, we introduce the problem considered in this work.

Multi-aspect Data Stream and Examples: We define a multi-aspect data stream, which we aim to analyze, as follows:

Definition 1 (Multi-aspect Data Stream).

A multi-aspect data stream is defined as a sequence of timestamped $M$ -tuples $\left\{\left(e_{n}{=}(i_{1},\cdot\cdot\cdot,i_{M-1},v_{n}),t_{n}\right)\right\}_{n\in\mathbb{N}}$ , where $i_{1},\cdots,i_{M-1}$ are categorical variables, $v_{n}\in\mathbb{R}$ is a numerical variable, and $t_{n}\in\mathbb{N}$ is the time (e.g., Unix timestamp) when the $n$ -th tuple $e_{n}$ occurs. We assume that the sequence is chronological, i.e.,

t_{n}\leq t_{m}~{}\text{if}~{}n<m.

For simplicity, we also assume that the categorical variables are also (assigned to) integers, i.e.,

i_{m}\in\{1,\cdots,N_{m}\},~{}\forall m\in\{1,\cdots,M-1\}.

Time-evolving data from various domains are naturally represented as a multi-aspect data stream as follows:

•

Traffic History: Each $3$ -tuple $e_{n}$ = (source, destination, $1$ ) indicates a trip that started at time $t_{n}$ .
•

Crime History: Each $3$ -tuple $e_{n}$ = (location, type, $1$ ) indicates an incidence of crime at time $t_{n}$
•

Purchase History: Each $4$ -tuple $e_{n}$ = (user, product, color, quantity) indicates a purchase at time $t_{n}$ .

Common Tensor Modeling Method and its Limitations: Multi-aspect data streams are commonly modeled as tensors to benefit from powerful tensor analysis tools (e.g., CP decomposition) [22, 23, 24, 25, 26, 27, 16, 17]. Specifically, a multi-aspect data stream is modeled as an $M$ -mode tensor $\bm{\mathcal{X}}\in\mathbb{R}^{N_{1}\times\cdots\times N_{M-1}\times W}$ , where $W$ is the number of indices in the time mode (i.e., the $M$ -th mode). For each $w\in\{1,\cdots,W\}$ , let $\bm{\mathcal{G}}_{w}\in\mathbb{R}^{N_{1}\times N_{2}\times\cdots\times N_{M-1}}$ be the $(M-1)$ -mode tensor obtained from $\bm{\mathcal{X}}$ by fixing the $M$ -th mode index to $w$ . That is, $\bm{\mathcal{X}}\equiv\bm{\mathcal{G}}_{1}~{}||~{}\cdots~{}||~{}\bm{\mathcal{G}}_{W-1}~{}||~{}\bm{\mathcal{G}}_{W},$ where $||$ denotes concatenation. Each tensor $\bm{\mathcal{G}}_{w}$ is the sum over $T$ time units (i.e., $T$ is the period) ending at $wT$ . That is, each $(j_{1},\cdots,j_{M})$ -th entry of $\bm{\mathcal{G}}_{w}$ is the sum of $v_{n}$ over all $M$ -tuples $e_{n}$ where $i_{m}=j_{m}$ for all $m\in\{1,\cdots,M-1\}$ and $t_{n}\in(wT-T,wT]$ . See Figs. 1a and 1b for examples where $T$ is an hour and a second, respectively. As new tuples in the multi-aspect data stream arrive, a new $(M-1)$ -mode tensor is added to $\bm{\mathcal{X}}$ once per period $T$ .

Additionally, in many previous studies on window-based tensor analysis [22, 23, 24, 26], the oldest $(M-1)$ -mode tensor is removed from $\bm{\mathcal{X}}$ once per period $T$ to fix the number of indices in the time mode to $W$ . That is, at time $t=W^{\prime}T$ where $W^{\prime}\in\{W+1,W+2,\cdots\}$ , $\bm{\mathcal{X}}\equiv\bm{\mathcal{G}}_{W^{\prime}-W+1}~{}||~{}\cdots~{}||~{}\bm{\mathcal{G}}_{W^{\prime}-1}~{}||~{}\bm{\mathcal{G}}_{W^{\prime}}$ .

A serious limitation of this widely-used tensor model is that the tensor $\bm{\mathcal{X}}$ changes only once per period $T$ while the input multi-aspect data stream grows continuously. Thus, it is impossible to analyze multi-aspect data streams continuously in real time in response to the arrival of each new tuple.

Problem Definition: How can we continuously analyze multi-aspect data streams using tensor-analysis tools, specifically, CPD? We aim to answer this question, as stated in Problem 1.

Problem 1 (Continuous CP Decomposition).

(1) Given: a multi-aspect data stream, (2) to update: its CP decomposition instantly in response to each new tuple in the stream, (3) without having to wait for the current period to end.

Note that, as discussed in Section I and shown in Fig. 1, an extremely short period $T$ cannot be a proper solution since it extremely decreases the fitness of CPD and at the same time increases the number of parameters.

IV Proposed Data Model and Implementation

In this section, we propose the continuous tensor model and its efficient implementation. This data model is one component of SliceNStitch. See Section V for the other component.

IV-A Proposed Data Model: Continuous Tensor Model

We first define several terms which our model is based on.

Definition 2 (Tensor Slice).

Given a multi-aspect data stream (Definition 1), for each timestamped $M$ -tuple $\left(e_{n}{=}(i_{1},\cdot\cdot\cdot,i_{M-1},v_{n}),t_{n}\right)$ , we define the tensor slice $\bm{\mathcal{Z}}_{n}\in\mathbb{R}^{N_{1}\times\cdots\times N_{M-1}}$ as an $\left(M-1\right)$ -mode tensor where the $(i_{1},\cdots,i_{M-1})$ -th entry is $v_{n}$ and the other entries are zero.

Definition 3 (Tensor Unit).

Given a multi-aspect data stream, a time $t$ , and the period $T$ , we define the tensor unit $\bm{\mathcal{Y}}_{t}$ as

\displaystyle\bm{\mathcal{Y}}_{t}\equiv\sum\nolimits_{t_{n}\in\left(t-T,t\right]}\bm{\mathcal{Z}}_{n}.

That is, $\bm{\mathcal{Y}}_{t}\in\mathbb{R}^{N_{1}\times\cdots\times N_{M-1}}$ is an aggregation of the tuples occurred within the half-open interval $\left(t-T,t\right]$ .

Definition 4 (Tensor Window).

Given a multi-aspect data stream, a time $t$ , the period $T$ , and the number of time-mode indices $W$ , we define the tensor window $\bm{\mathcal{D}}\left(t,W\right)$ as

\displaystyle\bm{\mathcal{D}}\left(t,W\right)\equiv\bm{\mathcal{Y}}_{t-\left(W-1\right)T}~{}||~{}\cdots~{}||~{}\bm{\mathcal{Y}}_{t-T}~{}||~{}\bm{\mathcal{Y}}_{t},

where $||$ denotes concatenation.

That is, $\bm{\mathcal{D}}\left(t,W\right)\in\mathbb{R}^{N_{1}\times\cdots\times N_{M-1}\times W}$ concatenates the $W$ latest tensor units.

The main idea of the continuous tensor model is to adaptively adjust the starting point (or equally the end point) of each tensor unit based on the current time, as described in Definition 5 and Fig. 2.

Definition 5 (Continuous Tensor Model).

In the continuous tensor model, given a multi-aspect data stream, the period $T$ , and the number of time-mode indices $W$ , the modeled tensor evolves from $\bm{\mathcal{D}}\left(t-dt,W\right)$ to $\bm{\mathcal{D}}\left(t,W\right)$ at each time $t$ , where $dt$ represents the infinitesimal change in time.

Note that in the continuous tensor model, the modeled tensor changes ‘continuously’, i.e., once per minimum time unit in the input multi-aspect data stream (e.g., a millisecond), while the modeled tensor changes once per period $T$ in the typical models, discussed in Section III.

IV-B Event-driven Implementation of Continuous Tensor Model

In the continuous tensor model, it is crucial to efficiently update the modeled tensor, or in other words, efficiently compute the change in $\bm{\mathcal{D}}\left(t,W\right)$ in the modeled tensor. This is because repeatedly rebuilding the modeled tensor $\bm{\mathcal{D}}\left(t,W\right)$ at every time $t$ from scratch is computationally prohibitive.

We propose an efficient event-driven implementation of the continuous tensor model. Let $\bm{\mathcal{X}}=\bm{\mathcal{D}}\left(t,W\right)$ for simplicity. Our implementation, described in Algorithm 1, is based on the observation that each tuple $\left(e_{n}{=}(i_{1},\cdot\cdot\cdot,i_{M-1},v_{n}),t_{n}\right)$ in the input multi-aspect data stream causes the following events:

S.1

At time $t=t_{n}$ , the value $v_{n}$ is added to $x_{i_{1},\cdots,i_{M-1},W}$ .
S.2

At time $t=t_{n}+wT$ for each $w\in\{1,\cdots,W-1\}$ , the value $v_{n}$ is subtracted from $x_{i_{1},\cdots,i_{M-1},W-w+1}$ and then added to $x_{i_{1},\cdots,i_{M-1},W-w}$ .
S.3

At time $t=t_{n}+WT$ , the value $v_{n}$ is subtracted from $x_{i_{1},\cdots,i_{M-1},1}$ .

As formalized in Theorem 1, our implementation maintains the modeled tensor $\bm{\mathcal{X}}=\bm{\mathcal{D}}\left(t,W\right)$ up-to-date at each time $t$ by performing $O(MW)$ operations per tuple in the input stream. Note that $M$ and $W$ are usually small numbers, and if we regard them as constants, the time complexity becomes $O(1)$ , i.e. processing each tuple takes constant time.

Theorem 1 (Time Complexity of the Continuous Tensor Model).

In Algorithm 1, the time complexity of processing each timestamped $M$ -tuple is $O(MW)$ .

Proof.

For each timestamped $M$ -tuple, $W+1$ events occur. Processing an event (lines 1-1 or 1-1) takes $O(M)$ time. ∎

Theorem 2 (Space Complexity of the Continuous Tensor Model).

In Algorithm 1, the space complexity is

O\left(M\cdot\max_{t\in\mathbb{R}}|\{n\in\mathbb{N}:t_{n}\in(t-WT,t]\}|\right).

Proof.

We call $S_{t}\equiv\{n\in\mathbb{N}:t_{n}\in(t-WT,t]\}$ the set of active tuples at time $t$ . The number of non-zeros in $\bm{\mathcal{X}}=\bm{\mathcal{D}}\left(t,W\right)$ is upper bounded by $|S_{t}|$ , and thus the space required for $\bm{\mathcal{X}}$ is $O(M\cdot|S_{t}|)$ . Since at most one event is scheduled for each active tuple, the space required for storing all scheduled events is also $O(M\cdot|S_{t}|)$ . ∎

Input: (1) a multi-aspect data stream, (2) period

T

(3) number of indices in the time mode

W

Output: up-to-date tensor window

\bm{\mathcal{X}}=\bm{\mathcal{D}}\left(t,W\right)

2initialize

\bm{\mathcal{X}}

to a zero tensor

\in\mathbb{R}^{N_{1}\times\cdots\times N_{M-1}\times W}

;

3 wait until an event

E

occurs ;

4 if $E=$ arrival of $\left(e_{n}{=}(i_{1},\cdot\cdot\cdot,i_{M-1},v_{n}),t_{n}\right)$ then

5 add

v_{n}

x_{i_{1},\cdots,i_{M-1},W}

;

6 schedule the

1

-st update for

\left(e_{n},t_{n}\right)

at time

t_{n}+T

;

8if $E=$ $w$ -th update for $\left(e_{n}{=}(i_{1},\cdot\cdot\cdot,i_{M-1},v_{n}),t_{n}\right)$ then

9 subtract

v_{n}

from

x_{i_{1},\cdots,i_{M-1},W-w+1}

;

10 if $w<W$ then

11 add

v_{n}

x_{i_{1},\cdots,i_{M-1},W-w}

;

12 schedule the

(w+1)

-th update for

\left(e_{n},t_{n}\right)

at time

t_{n}+(w+1)T

;

goto Line 1

Algorithm 1 Event-driven Implementation of the Continuous Tensor Model

V Proposed Optimization Algorithms

In this section, we present the other part of SliceNStitch. We propose a family of online optimization algorithms for CPD of sparse tensors in the continuous tensor model. As stated in Problem 2, they aim to update factor matrices fast and accurately in response to each change in the tensor window.

Problem 2 (Online CP Decomposition of Sparse Tensors in the Continuous Tensor Model).

(1) Given:

•

the current tensor window $\bm{\mathcal{X}}=\bm{\mathcal{D}}\left(t,W\right)$ ,
•

factor matrices ${\bm{A}^{(1)}},{\bm{A}^{(2)}},\cdots,{\bm{A}^{(M)}}$ for $\bm{\mathcal{X}}$ ,
•

an event for $\left(e_{n}{=}(i_{1},\cdot\cdot\cdot,i_{M-1},v_{n}),t_{n}\right)$ occurring at $t$ ,

(2) Update: the factor matrices in response to the event,
(3) To Solve: the minimization problem in Eq. (2).

Below, we use $\Delta\bm{\mathcal{X}}$ to denote the change in $\bm{\mathcal{X}}$ due to the given event. By S.1-S.3 of Section IV-B, Definition 6 follows.

Definition 6 (Input Change).

The input change $\Delta\bm{\mathcal{X}}\in\mathbb{R}^{N_{1}\times\cdots\times N_{M-1}\times W}$ is defined as the change in $\bm{\mathcal{X}}$ due to an event for $\left(e_{n}{=}(i_{1},\cdot\cdot\cdot,i_{M-1},v_{n}),t_{n}\right)$ occurring at $t$ , i.e.,

•

[If $t=t_{n}$ ] The $(i_{1},\cdots,i_{M-1},W)$ -th entry of $\Delta\bm{\mathcal{X}}$ is $v_{n}$ , and the other entries are zero.
•

[If $t=t_{n}+wT$ for $1\leq w<W$ ] The $(i_{1},\cdots,i_{M-1},W-w)$ -th and $(i_{1},\cdots,i_{M-1},W-w+1)$ -th entries of $\Delta\bm{\mathcal{X}}$ are $v_{n}$ and $-v_{n}$ , respectively, and the others are zero.
•

[If $t=t_{n}+WT$ ] The $(i_{1},\cdots,i_{M-1},1)$ -th entry of $\Delta\bm{\mathcal{X}}$ is $-v_{n}$ , and the others are zero.

We first introduce SliceNStitch-Matrix (SNS_mat), which naively applies ALS to Problem 2. Then, we present SliceNStitch-Vector (SNS_vec) and SliceNStitch-Random (SNS_rnd). Lastly, we propose our main methods, SNS⁺_vec and SNS⁺_rnd.

V-A SliceNStitch-Matrix (SNS_mat)

When we apply ALS to Problem 2, the factor matrices for the current window $\bm{\mathcal{X}}$ are strong initial points. Thus, a single iteration of ALS is enough to achieve high fitness. The detailed procedure of SNS_mat is given in Algorithm 2. In line 2, we normalize¹¹1 Let the $r$ -th entry of $\lambda\in\mathbb{R}^{R}$ as $\lambda_{r}$ and the $r$ -th column of $\bm{A}^{(m)}$ be $\bm{a}^{(m)}_{r}$ . We set $\lambda_{r}$ to $\|\bm{a}^{(m)}_{r}\|_{2}$ and set $\bm{{\bar{a}}}^{(m)}_{r}$ to $\bm{a}^{(m)}_{r}/\lambda_{r}$ for $r=1,\cdots,R$ . Then, $\bm{\mathcal{X}}$ is approximated as $\sum\nolimits_{r=1}^{R}\lambda_{r}\bm{{\bar{a}}}^{\left(1\right)}_{r}\circ\bm{{\bar{a}}}^{\left(2\right)}_{r}\circ\cdots\circ\bm{{\bar{a}}}^{\left(M\right)}_{r}$ . the columns of each updated factor matrices for balancing the scales of the factor matrices.

Input: (1) current tensor window

\bm{\mathcal{X}}

, (2) change

\Delta\bm{\mathcal{X}}

(3) column-normalized factor matrices

\{\bm{\bar{A}}^{(m)}\}_{m=1}^{M}

(4)

\{{{\bm{{\bar{A}}}^{(m)}}}^{\prime}{\bm{{\bar{A}}}^{(m)}}\}_{m=1}^{M}

Output: Updated

\{\bm{{\bar{A}}}^{(m)}\}_{m=1}^{M}

\{{{\bm{{\bar{A}}}^{(m)}}}^{\prime}{\bm{{\bar{A}}}^{(m)}}\}_{m=1}^{M}

, and

\lambda

2for $m=1,\cdots,M$ do

\bm{U}\leftarrow{\left(\bm{X}+\Delta\bm{X}\right)}_{(m)}\left(\odot_{n\neq m}^{M}\bm{{\bar{A}}}^{(n)}\right)

\bm{H}\leftarrow\ast_{n\neq m}^{M}{{\bm{{\bar{A}}}^{(n)}}}^{\prime}{\bm{{\bar{A}}}^{(n)}}

\bm{A}^{(m)}\leftarrow\bm{U}{\bm{H}}^{\dagger}

// by Eq. (4)

\lambda\leftarrow

\ell^{2}

norms of the columns of

\bm{{A}}^{(m)}

\lambda\in\mathbb{R}^{R}

\bm{{\bar{A}}}^{(m)}\leftarrow

column normalization of

\bm{{A}}^{(m)}

8 Update

{{\bm{{\bar{A}}}^{(m)}}}^{\prime}{\bm{{\bar{A}}}^{(m)}}

return

\{\bm{{\bar{A}}}^{(m)}\}_{m=1}^{M}

\{{{\bm{{\bar{A}}}^{(m)}}}^{\prime}{\bm{{\bar{A}}}^{(m)}}\}_{m=1}^{M}

, and

\lambda

Algorithm 2 SNS_mat: Naive Extension of ALS.

Pros and Cons: For each event, SNS_mat accesses every entry of the current tensor window and updates every row of the factor matrices. Thus, it suffers from high computational cost, as formalized in Theorem 3, while it maintains a high-quality solution (i.e., factor matrices).

Theorem 3 (Time complexity of SNS_mat).

Let $N_{M}=W$ . Then, the time complexity of SNS_mat is

O\bigg{(}M^{2}R|\bm{\mathcal{X}}+\Delta\bm{\mathcal{X}}|+M^{2}R^{2}+MR^{3}+\sum_{m=1}^{M}N_{m}R^{2}\bigg{)}.

(5)

Proof.

See Section II.A of the online appendix [20]. ∎

Input: (1) current tensor window

\bm{\mathcal{X}}=\bm{\mathcal{D}}(t,W)

(2) change

\Delta\bm{\mathcal{X}}

due to an event for

\left(e_{n}{=}(i_{1},\cdot\cdot\cdot,i_{M-1},v_{n}),t_{n}\right)

occurring at

t

(3) factor matrices

\{\bm{A}^{(m)}\}_{m=1}^{M}

(4)

\{{{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}}\}_{m=1}^{M}

, (5) period

T

Output: updated

\{\bm{A}^{(m)}\}_{m=1}^{M}

and

\{{{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}}\}_{m=1}^{M}

\{{\bm{A}^{(m)}_{prev}}^{\prime}\bm{A}^{(m)}\}_{m=1}^{M}\leftarrow\{{{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}}\}_{m=1}^{M}

// used only in SNS_rnd and SNS⁺_rnd

w\leftarrow(t-t_{n})/T

// time-mode index

5if $w>0$ then

updateRow( $M$ , $W-w+1$ , $\cdots$ ) // Alg. 4 or 5

9if $w<W$ then

updateRow( $M$ , $W-w$ , $\cdots$ ) // Alg. 4 or 5

13for $m\leftarrow 1,\cdots,M-1$ do

updateRow( $m$ , $i_{m}$ ) // Alg. 4 or 5

15return

\{\bm{{A}}^{(m)}\}_{m=1}^{M}

and

\{{{\bm{{A}}^{(m)}}}^{\prime}{\bm{{A}}^{(m)}}\}_{m=1}^{M}

Algorithm 3 Common Outline of SNS_vec, SNS⁺_vec, SNS_rnd, and SNS⁺_rnd.

V-B SliceNStitch-Vector (SNS_vec)

We propose SNS_vec, a fast algorithm for Problem 2. The outline of SNS_vec is given in Algorithm 3, and the update rules are given in Algorithm 4 with a running example in Fig. 3. The key idea of SNS_vec is to update only the rows of factor matrices that approximate changed entries of the tensor window. Starting from the maintained factor matrices, SNS_vec updates such rows of the time-mode factor matrix (lines 3-3) and then such rows of the non-time mode factor matrices (lines 3-3). Below, we describe the update rules used.

Time Mode: Real-world tensors are typically modeled so that the time mode of $\bm{\mathcal{X}}$ has fewer indices than the other modes. Thus, each tensor unit (see Definition 3) in $\bm{\mathcal{X}}$ is likely to contain many non-zeros, and thus even updating only few rows of the time-mode factor matrix (i.e., $\bm{A}^{(M)}$ ) is likely to incur considerable computational cost. To avoid the cost, SNS_vec employs an approximated update rule.

From Eq. (4), the following update rule for the time-mode factor matrix follows:

\bm{A}^{(M)}\leftarrow{\left(\bm{X}+\Delta\bm{X}\right)}_{(M)}\bm{K}^{(M)}{\bm{H}^{(M)}}^{\dagger},

(6)

where $\bm{K}^{(M)}=\odot_{m=1}^{M-1}\bm{A}^{(m)}$ and $\bm{H}^{(M)}=\ast_{m=1}^{M-1}{{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}}$ . If we assume that the approximated tensor $\widetilde{\bm{\mathcal{X}}}$ in Eq. (1) approximates $\bm{\mathcal{X}}$ well, then Eq. (6) is approximated to Eq. (7).

\bm{A}^{(M)}\leftarrow\bm{A}^{(M)}{\bm{K}^{(M)}}^{\prime}\bm{K}^{(M)}{\bm{H}^{(M)}}^{\dagger}+\Delta{\bm{X}}_{(M)}\bm{K}^{(M)}{\bm{H}^{(M)}}^{\dagger}.

(7)

By a property of the Khatri-Rao product [19], Eq. (8) holds.

\displaystyle\begin{split}{\bm{K}^{(M)}}^{\prime}\bm{K}^{(M)}&={\left(\odot_{m=1}^{M-1}\bm{A}^{(m)}\right)}^{\prime}\left(\odot_{m=1}^{M-1}\bm{A}^{(m)}\right)\\ &=\ast_{m=1}^{M-1}{{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}}=\bm{H}^{(M)}.\end{split}

(8)

By Eq. (8), Eq. (7) is reduced to Eq. (9).

\displaystyle\bm{A}^{(M)}\leftarrow\bm{A}^{(M)}+\Delta{\bm{X}}_{(M)}\bm{K}^{(M)}{\bm{H}^{(M)}}^{\dagger}.

(9)

Computing Eq. (9) is much cheaper than computing Eq. (6). Since $\Delta{\bm{X}}_{(M)}$ contains at most two non-zeros (see Problem 2), computing $\Delta{\bm{X}}_{(M)}\bm{K}^{(M)}$ in Eq. (9) takes $O(MR)$ time. Due to the same reason, at most two rows of $\bm{A}^{(M)}$ are updated.

Non-time Modes: When updating $\bm{A}^{(m)}$ , while fixing the other factor matrices, the objective Eq. (3) becomes Eq. (10).

\min_{\bm{A}^{(m)}}\Big{\|}{\left(\bm{X}+\Delta\bm{X}\right)}_{(m)}-\bm{A}^{(m)}{\left(\odot_{n\neq m}^{M}\bm{A}^{(n)}\right)}^{\prime}\Big{\|}_{F}.

(10)

Note that $\Delta\bm{\mathcal{X}}$ contains up to two non-zeros, which are the entries changed in $\bm{\mathcal{X}}$ , and their $m$ -th mode indices are $i_{m}$ (see Problem 2). By Eq. (3), only the $i_{m}$ -th row of $\bm{A}^{(m)}$ is used to approximate the changed entries, and thus SNS_vec updates only the row.

If we fix all the other variables except $\bm{A}^{(m)}(i_{m},:)$ , the problem in Eq. (10) becomes the problem in Eq. (11).

\small\min_{\bm{A}^{(m)}\left(i_{m},:\right)}\Big{\|}{\left(\bm{X}+\Delta\bm{X}\right)}_{(m)}\left(i_{m},:\right)\\ -\bm{A}^{(m)}\left(i_{m},:\right){\left(\odot_{n\neq m}^{M}\bm{A}^{(n)}\right)}^{\prime}\Big{\|}_{F}.

(11)

The problem in Eq. (11) is a least-square problem, and its analytical solution in Eq. (12) is available.

\bm{A}^{(m)}\left(i_{m},:\right)\leftarrow{\left(\bm{X}+\Delta\bm{X}\right)}_{(m)}\left(i_{m},:\right)\bm{K}^{(m)}{\bm{H}^{(m)}}^{\dagger},

(12)

where $\bm{K}^{(m)}=\odot_{n\neq m}^{M}\bm{A}^{(n)}$ and $\bm{H}^{(m)}=\ast_{n\neq m}^{M}{{\bm{A}^{(n)}}}^{\prime}{\bm{A}^{(n)}}$ .

After updating the $i_{m}$ -th row of each $m$ -th mode factor matrix $\bm{A}^{(m)}$ either by Eq. (9) or Eq. (12), SNS_vec incrementally maintains ${{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}}$ up to date by Eq. (13).

{{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}}\leftarrow{{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}}-{\bm{p}}^{\prime}\bm{p}+{\bm{A}^{(m)}\left(i_{m},:\right)}^{\prime}\bm{A}^{(m)}\left(i_{m},:\right),

(13)

where $\bm{p}$ is the $i_{m}$ -th row of $\bm{A}^{(m)}$ before the update.

// Parenthesized inputs/outputs are for SNS_rnd

Input: (1) mode

m

and index

i_{m}

to be updated

(2) current tensor window

\bm{\mathcal{X}}

, (3) change

\Delta\bm{\mathcal{X}}

(4) factor matrices

\{\bm{A}^{(m)}\}_{m=1}^{M}

for

\bm{\mathcal{X}}

(5)

\{{{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}}\}_{m=1}^{M}

(and

\{{\bm{A}^{(m)}_{prev}}^{\prime}\bm{A}^{(m)}\}_{m=1}^{M}

(6) (threshold

\theta

for sampling)

Output: updated

\bm{A}^{(m)}

{{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}}

(and

{\bm{A}^{(m)}_{prev}}^{\prime}\bm{A}^{(m)}

)

// updateRow implemented in SNS_vec

4 Procedure updateRowVec( $m,i_{m}$ , $\cdots$ ):

\bm{p}\leftarrow\bm{A}^{(m)}\left(i_{m},:\right)

7 if

m=M

then Update

\bm{A}^{(m)}(i_{m},:)

by Eq. (9)

8 else Update

\bm{A}^{(m)}(i_{m},:)

by Eq. (12)

9 Update

{{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}}

by Eq. (13)

10 return

\bm{A}^{(m)}

and

{{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}}

// updateRow implemented in SNS_rnd

12 Procedure updateRowRan( $m$ , $i_{m}$ , $\cdots$ ):

\bm{p}\leftarrow\bm{A}^{(m)}\left(i_{m},:\right)

15 if $deg(m,i_{m})\leq\theta$ then

16 Update

\bm{A}^{(m)}(i_{m},:)

by Eq. (12)

17 else

S\leftarrow\theta

indices of

\bm{\mathcal{X}}

chosen uniformly at random, while fixing the

m

-th mode index to

i_{m}

19 Compute

\bm{\mathcal{\bar{X}}}

from

S

20 Update

\bm{A}^{(m)}(i_{m},:)

by Eq. (16)

22 Update

{{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}}

by Eq. (13)

23 Update

{\bm{A}^{(m)}_{prev}}^{\prime}\bm{A}^{(m)}

by Eq. (17)

25 return

\bm{A}^{(m)}

{{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}}

, and

{\bm{A}^{(m)}_{prev}}^{\prime}\bm{A}^{(m)}

Algorithm 4 updateRow in SNS_vec and SNS_rnd

Pros and Cons: By updating only few rows of each factor matrix, SNS_vec significantly reduces the computational cost of SNS_mat, as formalized in Theorem 4, without much loss in the quality of the solution. However, SNS_vec slows down if many non-zeros are of the same index (see Eq. (14)), and it is often unstable due to numerical errors, as discussed later.

Theorem 4 (Time complexity of SNS_vec).

Let $deg(m,i_{m})\equiv|{\left(\bm{X}+\Delta\bm{X}\right)}_{(m)}(i_{m},:)|$ be the number of non-zeros of $\bm{X}+\Delta\bm{X}$ whose $m$ -th mode index is $i_{m}$ . Then, the time complexity of SNS_vec is

O\bigg{(}MR\sum\nolimits_{m=1}^{M-1}deg(m,i_{m})+(MR)^{2}+MR^{3}\bigg{)}.

(14)

Proof.

See Section II.B of the online appendix [20]. ∎

V-C SliceNStitch-Random (SNS_rnd)

We introduce SNS_rnd, which is even faster than SNS_vec. The outline of SNS_rnd (see Algorithm 3) is the same with that of SNS_vec. That is, SNS_rnd also updates only the rows of the factor matrices that approximate the changed entries in the current tensor window $\bm{\mathcal{X}}$ . However, when updating such a row, the number of entries accessed by SNS_rnd is upper bounded by a user-specific constant $\theta$ , while SNS_vec accesses $deg(m,i_{m})$ entries (see Theorem 4), which can be as many as all the entries in $\bm{\mathcal{X}}$ . Below, we present its update rule.

Assume SNS_rnd updates the $i_{m}$ -th row of $\bm{A}^{(m)}$ . That is, consider the problem in Eq. (11). As described in the procedure updateRowRan in Algorithm 4, SNS_rnd uses different approaches depending on a user-specific threshold $\theta$ and $deg(m,i_{m})\equiv|{\left(\bm{X}+\Delta\bm{X}\right)}_{(m)}(i_{m},:)|$ , i.e., the number of non-zeros of $\bm{\mathcal{X}}+\Delta\bm{\mathcal{X}}$ whose $M$ -th mode index is $i_{m}$ . If $deg(m,i_{m})$ is smaller than or equal to $\theta$ , then SNS_rnd uses Eq. (12) (lines 4-4), which is also used in SNS_vec.

However, if $deg(m,i_{m})$ is greater than $\theta$ , SNS_rnd speeds up the update through approximation. First, it samples $\theta$ indices from $\bm{\mathcal{X}}$ without replacement, while fixing the $M$ -th mode index to $i_{m}$ (line 4).²²2We ignore the indices of non-zeros in $\Delta\bm{\mathcal{X}}$ even if they are sampled. Let the set of sampled indices be $S$ , and let $\bm{\mathcal{\bar{X}}}\in\mathbb{R}^{N_{1}\times\cdots\times N_{M-1}\times W}$ be a tensor whose entries are all 0 except those with the sampled indices $S$ . For each sampled index $J=(j_{1},\cdots,j_{M})\in S$ , $\bar{x}_{J}=x_{J}-\tilde{x}_{J}$ . Note that for any index $J=(j_{1},\cdots,j_{M})$ of $\bm{\mathcal{X}}$ , $\tilde{x}_{J}+\bar{x}_{J}=x_{J}$ if $J\in S$ and $\tilde{x}_{J}+\bar{x}_{J}=\tilde{x}_{J}$ otherwise. Thus, the more samples SNS_rnd draws with larger $S$ , the closer $\widetilde{\bm{\mathcal{X}}}+\bm{\mathcal{\bar{X}}}$ is to $\bm{\mathcal{X}}$ . SNS_rnd uses $\widetilde{\bm{\mathcal{X}}}+\bm{\mathcal{\bar{X}}}$ to approximate $\bm{\mathcal{X}}$ in the update. Specifically, it replaces $\bm{\mathcal{X}}$ of the objective function in Eq. (11) with $\widetilde{\bm{\mathcal{X}}}+\bm{\mathcal{\bar{X}}}$ . Then, as in Eq. (12), the update rule in Eq. (15) follows.

\bm{A}^{(m)}(i_{m},:)\leftarrow{(\widetilde{\bm{\mathcal{X}}}+\bm{\mathcal{\bar{X}}}+\Delta\bm{\mathcal{X}})}_{(m)}(i_{m},:)\bm{K}^{(m)}{\bm{H}^{(m)}}^{\dagger},

(15)

where $\bm{K}^{(m)}=\odot_{n\neq m}^{M}\bm{A}^{(n)}$ and $\bm{H}^{(m)}=\ast_{n\neq m}^{M}{{\bm{A}^{(n)}}}^{\prime}{\bm{A}^{(n)}}$ . Let $\bm{A}^{(m)}_{prev}$ be the $m$ -th mode factor matrix before the update and $\bm{H}^{(m)}_{prev}$ be $\ast_{n\neq m}^{M}{\bm{A}^{(n)}_{prev}}^{\prime}\bm{A}^{(n)}$ . By Eq. (8), Eq. (15) is equivalent to Eq. (16).

\bm{A}^{(m)}(i_{m},:)\leftarrow\bm{A}^{(m)}(i_{m},:)\bm{H}^{(m)}_{prev}{\bm{H}^{(m)}}^{\dagger}+\\ {(\bm{\bar{X}}+\Delta\bm{X})}_{(m)}\bm{K}^{(m)}{\bm{H}^{(m)}}^{\dagger}.

(16)

Noteworthy, ${(\bm{\bar{X}}+\Delta\bm{X})}_{(m)}$ has at most $\theta+2=O(\theta)$ non-zeros. SNS_rnd uses Eq. (16) to update the $i_{m}$ -th row of $\bm{A}^{(m)}$ (line 4 of Algorithm 4). It incrementally maintains ${{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}}$ up to date by Eq. (13) (line 4), as SNS_vec does. It also maintains ${\bm{A}^{(m)}_{prev}}^{\prime}\bm{A}^{(m)}$ up to date by Eq. (17) (line 4).

{\bm{A}^{(m)}_{prev}}^{\prime}\bm{A}^{(m)}\leftarrow{\bm{A}^{(m)}_{prev}}^{\prime}\bm{A}^{(m)}-{\bm{p}}^{\prime}\bm{p}+{\bm{p}}^{\prime}\bm{A}^{(m)}(i_{m},:),

(17)

where $\bm{p}=\bm{A}^{(m)}_{prev}\left(i_{m},:\right)$ .

Pros and Cons: Through approximation, SNS_rnd upper bounds the number of non-zeros of ${(\bm{\bar{X}}+\Delta\bm{X})}_{(m)}$ in Eq. (16) by $O(\theta)$ . As a result, the time complexity of SNS_rnd, given in Theorem 5, is lower than that of SNS_vec. Specifically, $deg(m,i_{m})$ in Eq. (14), which can be as big as $|\bm{\mathcal{X}}+\Delta\bm{\mathcal{X}}|$ , is replaced with the user-specific constant $\theta$ in Eq. (18) Noteworthy, if we regard $M$ , $R$ , and $\theta$ in Eq. (18) as constants, the time complexity of SNS_rnd becomes constant. This change makes SNS_rnd significantly faster than SNS_vec, at the expense of a slight reduction in the quality of the solution.

Theorem 5 (Time complexity of SNS_rnd).

If $\theta>1$ , then the time complexity of SNS_rnd is

O\bigg{(}M^{2}R\,\theta+M^{2}R^{2}+MR^{3}\bigg{)}.

(18)

If $M$ , $R$ , and $\theta$ are regarded as constants, Eq. (18) is $O(1)$ .

Proof.

See Section II.C of the online appendix [20]. ∎

Unlike SNS_mat, SNS_vec and SNS_rnd do not normalize the columns of factor matrices during the update process. This is because normalization requires $O(R\sum_{m=1}^{M}N_{m})$ time, which is proportional to the number of all entries in all factor matrices, and thus significantly increases the time complexity of SNS_vec and SNS_rnd. However, without normalization, the entries of factor matrices may have extremely large or extremely small absolute values, making SNS_vec and SNS_rnd vulnerable to numerical errors. In our experiments (see Fig. 4 in Section VI-C), the accuracies of SNS_vec and SNS_rnd suddenly drop due to numerical errors in some datasets.

V-D SliceNStitch-Stable (SNS⁺_vec and SNS⁺_rnd)

In this section, we propose SNS⁺_vec and SNS⁺_rnd, which successfully address the aforementioned instability of SNS_vec and SNS_rnd. The main idea is to clip each entry, (i.e., ensure that each entry is within a predefined range), while at the same time ensuring that the objective function does not increase. To this end, SNS⁺_vec and SNS⁺_rnd employs coordinate descent, where each entry of factor matrices is updated one by one. The outline of SNS⁺_vec and SNS⁺_rnd (see Algorithm 3) is the same as that of SNS_vec and SNS_rnd. Below, we present their update rules, which are used in Algorithm 5.

Coordinate descent updates one variable (i.e., entry of a factor matrix) at a time while fixing all the other variables. Assume an entry $a^{(m)}_{i_{m}k}$ of $\bm{A}^{(m)}$ is updated. Solving the problem in Eq. (2) with respect to $a^{(m)}_{i_{m}k}$ while fixing the other variables is equivalent to solving the problem in Eq. (19).

\min_{a^{(m)}_{i_{m}k}}\sum_{J\in\Omega^{(m)}_{i_{m}}}(x_{J}+\Delta x_{J}-\sum_{r\neq k}^{R}\prod_{n=1}^{M}a_{j_{n}r}^{(n)}-a_{i_{m}k}^{(m)}\prod_{n\neq m}^{M}a_{j_{n}k}^{(n)})^{2},

(19)

where $J=(j_{1},\cdots,j_{M})$ , and $\Omega^{(m)}_{i_{m}}$ is the set of indices of $\bm{\mathcal{X}}$ of which the $m$ -th mode index is $i_{m}$ . To describe its solution, we first define the following terms:

$\displaystyle c_{k}^{(m)}$	$\displaystyle\equiv\prod\nolimits_{n\neq m}^{M}\big{(}\sum\nolimits_{j_{n}=1}^{N_{n}}(a_{j_{n}k}^{(n)})^{2}\big{)},$
$\displaystyle d_{i_{m}k}^{(m)}$	$\displaystyle\equiv\sum\nolimits_{r\neq k}^{R}\big{(}a_{i_{m}r}^{(m)}\prod\nolimits_{n\neq m}^{M}\big{(}\sum\nolimits_{j_{n}=1}^{N_{n}}a_{j_{n}r}^{(n)}a_{j_{n}k}^{(n)}\big{)}\big{)},$	(20)
$\displaystyle e_{i_{m}k}^{(m)}$	$\displaystyle\equiv\sum\nolimits_{r=1}^{R}\big{(}b^{(m)}_{i_{m}r}\prod\nolimits_{n\neq m}^{M}\big{(}\sum\nolimits_{j_{n}=1}^{N_{n}}b_{j_{n}r}^{(n)}a_{j_{n}k}^{(n)}\big{)}\big{)},$

where $\bm{B}^{(m)}\equiv\bm{A}^{(m)}_{prev}$ is $\bm{A}^{(m)}$ before any update.

Solving the Problem in Eq. (19): The problem in Eq. (19) is a least square problem, and thus there exists the closed-form solution, which is used to update $a_{i_{m}k}^{(m)}$ in Eq. (21).

a_{i_{m}k}^{(m)}\leftarrow\Big{(}\sum_{J\in\Omega^{(m)}_{i_{m}}}\big{(}(x_{J}+\Delta x_{J})\prod_{n\neq m}^{M}a_{j_{n}k}^{(n)}\big{)}-d_{i_{m}k}^{(m)}\Big{)}/c_{k}^{(m)}.

(21)

Eq. (21) is used, instead of Eq. (12), when updating non-time mode factor matrices (i.e., when $m\neq M$ ) in SNS⁺_vec (line 5 of Algorithm 5). It is also used, instead of Eq. (12), in SNS⁺_rnd when $deg(m,i_{m})\leq\theta$ (line 5). As in SNS_vec, when updating the time-mode factor matrix, SNS⁺_vec approximates $\bm{\mathcal{X}}$ by $\widetilde{\bm{\mathcal{X}}}$ , and thus it uses Eq. (22) (line 5).

a_{i_{m}k}^{(m)}\leftarrow\Big{(}e_{i_{m}k}^{(m)}+\sum_{J\in\Omega^{(m)}_{i_{m}}}\big{(}\Delta x_{J}\prod_{n\neq m}^{M}a_{j_{n}k}^{(n)}\big{)}-d_{i_{m}k}^{(m)}\Big{)}/c_{k}^{(m)}.

(22)

Similarly, as in SNS_rnd, when $deg(m,i_{m})>\theta$ , SNS⁺_rnd approximates $\bm{\mathcal{X}}$ by $\widetilde{\bm{\mathcal{X}}}+\bm{\mathcal{\bar{X}}}$ , and thus it uses Eq. (23) (line 5).

a_{i_{m}k}^{(m)}\leftarrow\Big{(}e_{i_{m}k}^{(m)}+\sum_{J\in\Omega^{(m)}_{i_{m}}}\big{(}(\bar{x}_{J}+\Delta x_{J})\prod_{n\neq m}^{M}a_{j_{n}k}^{(n)}\big{)}-d_{i_{m}k}^{(m)}\Big{)}/c_{k}^{(m)}.

(23)

Note that all Eq. (21), Eq. (22), and Eq. (23) are based on Eq. (20). For the rapid computation of Eq. (20), SNS⁺_vec and SNS⁺_rnd incrementally maintain $\sum_{j_{m}=1}^{N_{m}}(a_{j_{m}k}^{(m)})^{2}$ and $\sum_{j_{m}=1}^{N_{m}}a_{j_{m}r}^{(m)}a_{j_{m}k}^{(m)}$ , which are the $(k,k)$ -th and $(r,k)$ -th entries of ${{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}}$ , by Eq. (24) and Eq. (25) (lines 5 and 5). SNS⁺_rnd also incrementally maintains $\sum_{j_{m}=1}^{N_{m}}b_{j_{m}r}^{(m)}a_{j_{m}k}^{(m)}$ , which is the $(r,k)$ -th entry of ${\bm{A}^{(m)}_{prev}}^{\prime}\bm{A}^{(m)}$ , by Eq. (26) (line 5).

$\displaystyle q^{(m)}_{kk}$	$\displaystyle\leftarrow q^{(m)}_{kk}-(b_{i_{m}k}^{(m)})^{2}+(a_{i_{m}k}^{(m)})^{2},$	(24)
$\displaystyle q^{(m)}_{rk}$	$\displaystyle\leftarrow q^{(m)}_{rk}-a_{i_{m}r}^{(m)}b_{i_{m}k}^{(m)}+a_{i_{m}r}^{(m)}a_{i_{m}k}^{(m)},$	(25)
$\displaystyle u^{(m)}_{rk}$	$\displaystyle\leftarrow u^{(m)}_{rk}-b_{i_{m}r}^{(m)}b_{i_{m}k}^{(m)}+b_{i_{m}r}^{(m)}a_{i_{m}k}^{(m)},$	(26)

where $\bm{Q}^{(m)}\equiv{{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}}$ and $\bm{U}^{(m)}\equiv{\bm{A}^{(m)}_{prev}}^{\prime}\bm{A}^{(m)}$ . Proofs of Eqs. (21)-(26) can be found in an online appendix [20].

Clipping: In order to prevent the entries of factor matrices from having extremely large or small absolute values, SNS⁺_vec and SNS⁺_rnd ensures that its absolute value is at most $\eta$ , which is a user-specific threshold. Specifically, if the updated entry is greater than $\eta$ , it is set to $\eta$ , and if the updated entry is smaller than $-\eta$ , it is set to $-\eta$ (lines 5 and 5 in Algorithm 5). Eq. (21) followed by clipping never increases the objective function in Eq. (19). ³³3Let $x$ , $y$ , and $z$ be $a^{(m)}_{i_{m}k}$ before update, after being updated by Eq. (21), and after being clipped, respectively. The objective function in Eq. (19) is convex, minimized at $y$ , and symmetric around $y$ . $|y-z|\leq|y-x|$ holds.^,⁴⁴4For Eq. (22) and Eq. (23), this is true only when $\bm{\mathcal{X}}$ is well approximated.

// Parenthesized inputs/outputs are for SNS⁺_rnd

Input: (1) mode

m

and index

i_{m}

to be updated,

(2) current tensor window

\bm{\mathcal{X}}

, (3) change

\Delta\bm{\mathcal{X}}

(4) factor matrices

\{\bm{A}^{(m)}\}_{m=1}^{M}

for

\bm{\mathcal{X}}

(5)

\{{{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}}\}_{m=1}^{M}

(and

\{{\bm{A}^{(m)}_{prev}}^{\prime}\bm{A}^{(m)}\}_{m=1}^{M}

)

(6)

\eta

for clipping (and threshold

\theta

for sampling)

Output: updated

\bm{A}^{(m)}

{{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}}

(and

{\bm{A}^{(m)}_{prev}}^{\prime}\bm{A}^{(m)}

)

// updateRow implemented in SNS⁺_vec

3 Procedure updateRowVec+( $m$ , $i_{m}$ , $\cdots$ ):

5 for $k=1,\cdots,R$ do

7 if

m=M

then Update

a_{i_{m}k}^{(m)}

by Eq. (22)

8 else Update

a_{i_{m}k}^{(m)}

by Eq. (21)

9 if

|a_{i_{m}k}^{(m)}|>\eta

then

a_{i_{m}k}^{(m)}\leftarrow sign(a_{i_{m}k}^{(m)})\cdot\eta

10 Update

{\bm{A}^{(m)}}^{\prime}\bm{A}^{(m)}

by Eq. (24) and Eq. (25)

12 return

\bm{A}^{(m)}

and

{{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}}

// updateRow implemented in SNS⁺_rnd

14 Procedure updateRowRan+( $m$ , $i_{m}$ , $\cdots$ ):

16 if $deg(m,i_{m})>\theta$ then

S\leftarrow\theta

indices of

\bm{\mathcal{X}}

chosen uniformly at random, while fixing the

m

-th mode index to

i_{m}

18 Compute

\bm{\mathcal{\bar{X}}}

from

S

19 for $k=1,\cdots,R$ do

21 if

deg(m,i_{m})\leq\theta

then Update

a_{i_{m}k}^{(m)}

by Eq. (21)

22 else Update

a_{i_{m}k}^{(m)}

by Eq. (23)

23 if

|a_{i_{m}k}^{(m)}|>\eta

then

a_{i_{m}k}^{(m)}\leftarrow sign(a_{i_{m}k}^{(m)})\cdot\eta

24 Update

{\bm{A}^{(m)}}^{\prime}\bm{A}^{(m)}

by Eq. (24) and Eq. (25)

25 Update

{\bm{A}^{(m)}_{prev}}^{\prime}\bm{A}^{(m)}

by Eq. (26)

27 return

\bm{A}^{(m)}

{{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}}

, and

{\bm{A}^{(m)}_{prev}}^{\prime}\bm{A}^{(m)}

Algorithm 5 updateRow in SNS⁺_vec and SNS⁺_rnd

Pros and Cons: SNS⁺_vec and SNS⁺_rnd does not suffer from instability due to numerical errors, which SNS_vec and SNS_rnd suffer from. Moreover, as shown in Theorems 6 and 7, the time complexities of SNS⁺_vec and SNS⁺_rnd are lower than those of SNS_vec and SNS_rnd, respectively. Empirically, however, SNS⁺_vec and SNS⁺_rnd are slightly slower and less accurate than SNS_vec and SNS_rnd, respectively (see Section VI-C).

Theorem 6 (Time complexity of SNS⁺_vec).

The time complexity of SNS⁺_vec is

O\bigg{(}MR\sum\nolimits_{m=1}^{M-1}deg(m,i_{m})+M^{2}R^{2}\bigg{)}.

(27)

Proof.

See Section II.E of the online appendix [20]. ∎

TABLE II: Summary of real-world sparse tensor datasets. All links are at https://github.com/DMLab-Tensor/SliceNStitch#datasets.

Name	Description	Size	# Non-zeros	Density
Divvy Bikes	sources $\times$ destinations $\times$ timestamps [minutes]	$673\times 673\times 525594$	$3.82M$	$1.604\times 10^{-5}$
Chicago Crime	communities $\times$ crime types $\times$ timestamps [hours]	$77\times 32\times 148464$	$5.33M$	$1.457\times 10^{-2}$
New York Taxi	sources $\times$ destinations $\times$ timestamps [seconds]	$265\times 265\times 5184000$	$84.39M$	$2.318\times 10^{-4}$
Ride Austin	sources $\times$ destinations $\times$ colors $\times$ timestamps [minutes]	$219\times 219\times 24\times 285136$	$0.89M$	$2.739\times 10^{-6}$

Theorem 7 (Time complexity of SNS⁺_rnd).

If $\theta>1$ , then the time complexity of SNS⁺_rnd is

O\bigg{(}M^{2}R\,\theta+M^{2}R^{2}\bigg{)}.

(28)

If $M$ , $R$ , and $\theta$ are regarded as constants, Eq. (28) is $O(1)$ .

Proof.

See Section II.F of the online appendix [20]. ∎

VI Experiments

In this section, we design and review experiments to answer the following questions:

•

Q1. Advantages of Continuous CP Decomposition: What are the advantages of continuous CP decomposition over conventional CP decomposition?
•

Q2. Speed and Fitness: How rapidly and precisely does SliceNStitch fit the input tensor, compared to baselines?
•

Q3. Data Scalability: How does SliceNStitch scale with regard to the number of events?
•

Q4. Effect of Parameters: How do user-specific thresholds $\theta$ and $\eta$ affect the performance of SliceNStitch?
•

Q5. Practitioner’s Guide: Which versions of SliceNStitch do we have to use?
•

Q6. Application to Anomaly Detection: Can SliceNStitch spot abnormal events rapidly and accurately?

VI-A Experiment Specifications

Machine: We ran all experiments on a machine with a 3.7GHz Intel i5-9600K CPU and 64GB memory.

Datasets: We used four different real-world sparse tensor datasets summarized in Table II. They are sparse tensors with a time mode, and their densities vary from $10^{-2}$ to $10^{-6}$ .

Evaluation Metrics: We evaluated SliceNStitch and baselines using the following metrics:

•

Elapsed Time per Update: The average elapsed time for updating the factor matrices in response to each event.
•

Fitness (The higher the better): Fitness is a widely-used metric to evaluate the accuracy of tensor decomposition algorithms. It is defined as $1-({\|\widetilde{\bm{\mathcal{X}}}-\bm{\mathcal{X}}\|_{F}}/{\|\bm{\mathcal{X}}\|_{F}}),$ where $\bm{\mathcal{X}}$ is the input tensor, and $\widetilde{\bm{\mathcal{X}}}$ (Eq. (1)) is its approximation.

•

Relative Fitness [16] (The higher the better): Relative fitness, which is defined as the ratio between the fitness of the target algorithm and the fitness of ALS, i.e.,

Relative\,Fitness\equiv\frac{Fitness_{target}}{Fitness_{ALS}}.\vspace{-0.5mm}

Recall that ALS (see Section II) is the standard batch algorithm for tensor decomposition.

Baselines: Since there is no previous algorithm for continuous CPD, we compared SliceNStitch with ALS, OnlineSCP [16], CP-stream [15], and NeCPD ( $n$ ) with $n$ iterations [28], all of which are for conventional CPD (see Section VII). All baselines except ALS update factor matrices once per period $T$ (instead of whenever an event occurs).⁵⁵5We modified the baselines, which are for decomposing the entire tensor, to decompose the tensor window (see Definition 4), as SliceNStitch does. We implemented SliceNStitch and ALS in C++. We used the official implementation of OnlineSCP in MATLAB and that of CP-stream in C++.⁶⁶6https://shuozhou.github.io, https://github.com/ShadenSmith/splatt-stream We implemented NeCPD in MATLAB.

Experimental Setup: We set the hyperparameters as listed in Table III unless otherwise stated. We set the threshold $\theta$ for sampling to be smaller than half of the average degree of indices (i.e., the average number of non-zeros when fixing an index) in the initial tensor window. In each experiment, we initialized factor matrices using ALS on the initial tensor window, and we processed the events during $5WT$ time units. We measured relative fitness $5$ times.

VI-B Q1. Advantages of Continuous CP Decomposition

We compared the continuous CPD and conventional CPD, in terms of the update interval (i.e., the minimum interval between two consecutive updates), fitness, and the number of parameters, using the New York Taxi dataset. We used SNS_rnd and fixed the period $T$ to $1$ hour for continuous CPD; and we used CP-stream, OnlineSCP, and ALS while varying $T$ (i.e., the granularity of the time mode) from $1$ second to $1$ hour for conventional CPD. Fig. 1 shows the result,⁷⁷7Before measuring the fitness of all baselines, we merged the rows of fine-grained time-mode factor matrices sequentially by adding entries so that one row corresponds to an hour. Without this postprocessing step, the fitness of the baselines was even lower than those reported in Fig. 1c. and we found Observation 1.

Observation 1 (Advantages of Continuous CPD).

Continuous CPD achieved (a) near-instant updates, (b) high fitness, and (c) a small number of parameters at the same time, while conventional CPD cannot. When the update interval was the same, continuous CPD achieved $\mathbf{2.26\times}$ higher fitness with $\mathbf{55\times}$ fewer parameters than conventional CPD. When they showed similar fitness, the update interval of continuous CPD was $\mathbf{3600\times}$ shorter than that of conventional CPD.

TABLE III: Default hyperparameter settings.

Name	$R$	$W$	$T$ (Period)	$\theta$	$\eta$
Divvy Bikes	$20$	$10$	$1440min\left(1day\right)$	$20$	$1000$
Chicago Crime	$20$	$10$	$720hour\left(1month\right)$	$20$	$1000$
New York Taxi	$20$	$10$	$3600sec\left(1hour\right)$	$20$	$1000$
Ride Austin	$20$	$10$	$1440min\left(1day\right)$	$50$	$1000$

VI-C Q2. Speed and Fitness

We compared the speed and fitness of all versions of SliceNStitch and the baseline methods. Fig. 4 shows how the relative fitness (i.e., fitness relative to ALS) changed over time, and Fig. 5 shows the average relative fitness and the average elapsed time for processing an event. We found Observations 2, 3, and 4.

Observation 2 (Significant Speed-ups).

All versions of SliceNStitch updated factor matrices significantly faster than the fastest baseline. For example, SNS⁺_rnd and SNS_mat were up to $\mathbf{464\times}$ and $\mathbf{3.71\times}$ faster than CP-stream, respectively.

Observation 3 (Effect of Clipping).

SNS_vec and SNS_rnd failed in some datasets due to numerical errors, as discussed in the last paragraph of Section V-C. SNS⁺_vec and SNS⁺_rnd, where clipping is used, successfully addressed this problem.

Observation 4 (Comparable Fitness).

All stable versions of SliceNStitch (i.e., SNS⁺_vec, SNS⁺_rnd, and SNS_mat) achieved $72$ - $100\%$ fitness relative to the most accurate baseline.

VI-D Q3. Data Scalability

We measured how rapidly the total running time of different versions of SliceNStitch increase with respect to the number of events in Fig. 6. We found Observation 5.

Observation 5 (Linear Scalability).

The total runtime of all SliceNStitch versions was linear in the number of events.

VI-E Q4. Effect of Parameters

To investigate the effect of the threshold $\theta$ for sampling on the performance of SNS_rnd and SNS⁺_rnd, we measured their relative fitness and update time while varying $\theta$ from $25\%$ to $200\%$ of the default value in Table III.⁸⁸8We set $\eta$ to $500$ in the Chicago Crime dataset since setting it to $1000$ led to unstable results. The results are reported in Fig. 7, and we found Observation 6.

Observation 6 (Effect of $\theta$ ).

As $\theta$ increases (i.e., more indices are sampled), the fitness of SNS_rnd and SNS⁺_rnd increases with diminishing returns, while their runtime grows linearly.

In Fig. 8, we measured the effect of the threshold $\eta$ for clipping on the relative fitness of SNS⁺_vec and SNS⁺_rnd, while changing $\eta$ from $32$ to $16,000$ . Note that $\eta$ does not affect their speed. We found Observation 7.

Observation 7 (Effect of $\eta$ ).

The fitness of SNS⁺_vec and SNS⁺_rnd is insensitive to $\eta$ as long as $\eta$ is small enough.

VI-F Q5. Practitioner’s Guide

Based on the above theoretical and empirical results, we provide a practitioner’s guide for SliceNStitch’s users.

•

We do not recommend SNS_vec and SNS_rnd. They are prone to numerical errors and thus unstable.
•

We recommend using the version fitting the input tensor best within your runtime budget among SNS_mat, SNS⁺_vec, and SNS⁺_rnd. There exists a clear trade-off between their speed and fitness. In terms of speed, SNS⁺_rnd is the best followed by SNS⁺_vec, and SNS_mat is the slowest. In terms of fitness, SNS_mat is the best followed by SNS⁺_vec, and SNS⁺_rnd is the most inaccurate.
•

If SNS⁺_rnd is chosen, we recommend increasing $\theta$ as much as possible, within your runtime budget.

VI-G Q6. Application to Anomaly Detection

We applied SNS⁺_rnd, OnlineSCP, and CP-stream to an anomaly detection task. In the New York Taxi dataset, we injected abnormally large changes (specifically, $15$ , which is $5$ times the maximum change in 1 second in the data stream) in $20$ randomly chosen entries. Then, we measured the Z scores of the errors in all entries in the latest tensor unit, where new changes arrive, as each method proceeded. After that, we investigated the top- $20$ Z-scores in each method. As summarized in Fig. 9, the precision, which is the same as the recall in our setting, was highest in SNS⁺_rnd and OnlineSCP. More importantly, the average time gap between the occurrence and detection of the injected anomalies was about $0.0015$ seconds in SNS⁺_rnd, while that exceeded $1,400$ seconds in the others, which have to wait until the current period ends to update CPD.

VII Related Work

In this section, we review related work on online CP decomposition (CPD) and window-based tensor analysis. Then, we briefly discuss the relation between CPD and machine learning. See [19, 29] for more models, solvers, and applications.

VII-A Online Tensor Decomposition

Nion et al. [25] proposed Simultaneously Diagonalization Tracking (SDT) and Recursively Least Squares Tracking (RLST) for incremental CP decomposition (CPD) of three-mode dense tensors. Specifically, SDT incrementally tracks the SVD of the unfolded tensor, while RLST recursively updates the factor matrices to minimize the weighted squared error. A limitation of the algorithms is that they are only applicable to three-mode dense tensors. Gujral et al. [17] proposed a sampling-based method called SamBaTen for incremental CPD of three-mode dense and sparse tensors. Zhou et al. proposed onlineCP [27] and onlineSCP [16] for incremental CPD of higher-order dense tensors and sparse tensors, respectively. Smith et al. [15] proposed an online CPD algorithm that can be extended to include non-negativity and sparsity constraints. It is suitable for both sparse and dense tensors. SGD-based methods [28, 30] have also been developed for online CPD. Specifically, Ye et al. [30] proposed one for Poisson-distributed count data with missing values and Anaissi et al. [28] employed Nesterov’s Accelerated Gradient method into SGD. Sobral et al. [31] developed an online framework for subtracting background pixels from multispectral video data. However, all these algorithms process every new entry with the same time-mode index (e.g., a slice in Fig. 1a) at once. They are not applicable to continuous CPD (Problem 2), where changes in entries need to be processed instantly.

BICP [32] efficiently updates block-based CPD [33, 34] when the size of the input tensor is fixed but some existing entries change per update. It requires partitioned subtensors and their CPDs rather than the CPD of the entire input tensor.

Moreover, several algorithms have been developed for incrementally updating the outputs of matrix factorization [35, 36], Bayesian probabilistic CP factorization [37], and generalized CPD [38], when previously unknown entries are revealed afterward. They are also not applicable to continuous CPD (Problem 2), where even increments and decrements of revealed entries need to be handled (see Definition 6).

	Precision @ Top-20	Time Gap between Occurrence and Detection
SNS⁺_rnd	0.80	0.0015 seconds
OnlineSCP	0.80	1601.00 seconds
CP-stream	0.70	1424.57 seconds

Lastly, it is worth mentioning that there have been several studies on approximation properties of some offline CPD algorithms. Haupt et al. [39] proved a sufficient condition for a sparse random projection technique to solve the low-rank tensor regression problem efficiently with an approximation quality. Song et al. [40] showed that an importance-sampling based orthogonal tensor decomposition algorithm achieves a sublinear time complexity with provable guarantees. To the best of our knowledge, however, there has been limited work on theoretical properties of online CPD of tensor streams.

VII-B Window-based Tensor Analysis

Sun et al. [22, 23] first suggested the concept of window-based tensor analysis (WTA). Instead of analyzing the entire tensor at once, they proposed to analyze a temporally adjacent subtensor within a time window at a time, while sliding the window. Based on the sliding window model, they devised an incremental Tucker decomposition algorithm for tensors growing over time. Xu et al. [24] also suggested a Tucker decomposition algorithm for sliding window tensors and used it to detect anomalies in road networks. Zhang et al. [26] used the sliding window model with exponential weighting for robust Bayesian probabilistic CP factorization and completion. Note that all these studies assume a time window moves ‘discretely’, while in our continuous tensor model, a time window moves ‘continuously’, as explained in Section IV.

VII-C Relation to Machine Learning

CP decomposition (CPD) has been a core building block of numerous machine learning (ML) algorithms, which are designed for classification [41], weather forecast [14], recommendation [11], stock price prediction [13], to name a few. Moreover, CPD has proven useful for outlier removal [42, 43], imputation [12, 43], and dimensionality reduction [19], and thus it can be used as a preprocessing step of ML algorithms, many of which are known to be vulnerable to outliers, missings, and the curse of dimensionality. We refer the reader to [44] for more roles of tensor decomposition for ML. By making the core building block “real-time”, our work represents a step towards real-time ML. Moreover, SliceNStitch can directly be used as a preprocessing step of existing streaming ML algorithms.

VIII Conclusion

In this work, we propose SliceNStitch, aiming to make tensor analysis “real-time” and applicable to time-critical applications. We summarize our contributions as follows:

•

New data model: We propose the continuous tensor model and its efficient event-driven implementation (Section IV). With our CPD algorithms, it achieves near real-time updates, high fitness, and a small number of parameters (Fig. 1).
•

Fast online algorithms: We propose a family of online algorithms for CPD in the continuous tensor model (Section V). They update factor matrices in response to changes in an entry up to $464\times$ faster than online competitors, with fitness even comparable (spec., $72$ - $100\%$ ) to offline competitors (Fig. 5). We analyze their complexities (Theorems 3-7).
•

Extensive experiments: We evaluate the speed, fitness, and scalability of our algorithms on $4$ real-world sparse tensors. We analyze the effects of hyperparameters. The results indicate a clear trade-off between speed and fitness, based on which we provide practitioner’s guides (Section VI).

Reproducibility: The code and datasets used in the paper are available at https://github.com/DMLab-Tensor/SliceNStitch.

Acknowledgement

This work was supported by Samsung Electronics Co., Ltd., Disaster-Safety Platform Technology Development Program of the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT (No. 2019M3D7A1094364), and Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-00075, Artificial Intelligence Graduate School Program (KAIST)).

References

[1] L. Zhao and M. J. Zaki, “Tricluster: an effective algorithm for mining coherent clusters in 3d microarray data,” in SIGMOD, 2005.
[2] D. Koutra, E. E. Papalexakis, and C. Faloutsos, “Tensorsplat: Spotting latent anomalies in time,” in PCI, 2012.
[3] Y. Cai, H. Tong, W. Fan, P. Ji, and Q. He, “Facets: Fast comprehensive mining of coevolving high-order time series,” in KDD, 2015.
[4] B. W. Bader, M. W. Berry, and M. Browne, “Discussion tracking in enron email using parafac,” in Survey of Text Mining II. Springer, 2008, pp. 147–163.
[5] D. Bruns-Smith, M. M. Baskaran, J. Ezick, T. Henretty, and R. Lethin, “Cyber security through multidimensional data decompositions,” in CYBERSEC, 2016.
[6] H. Fanaee-T and J. Gama, “Tensor-based anomaly detection: An interdisciplinary survey,” Knowledge-Based Systems, vol. 98, pp. 130–147, 2016.
[7] F. L. Hitchcock, “The expression of a tensor or a polyadic as a sum of products,” Journal of Mathematics and Physics, vol. 6, no. 1-4, pp. 164–189, 1927.
[8] J. D. Carroll and J.-J. Chang, “Analysis of individual differences in multidimensional scaling via an n-way generalization of “eckart-young” decomposition,” Psychometrika, vol. 35, no. 3, pp. 283–319, 1970.
[9] R. A. Harshman, “Parafac2: Mathematical and technical notes,” UCLA working papers in phonetics, vol. 22, no. 3044, p. 122215, 1972.
[10] L. R. Tucker, “Some mathematical notes on three-mode factor analysis,” Psychometrika, vol. 31, no. 3, pp. 279–311, 1966.
[11] L. Yao, Q. Z. Sheng, Y. Qin, X. Wang, A. Shemshadi, and Q. He, “Context-aware point-of-interest recommendation using tensor factorization with social regularization,” in SIGIR, 2015.
[12] K. Shin, L. Sael, and U. Kang, “Fully scalable methods for distributed tensor factorization,” TKDE, vol. 29, no. 1, pp. 100–113, 2016.
[13] A. Spelta, “Financial market predictability with tensor decomposition and links forecast,” Applied network science, vol. 2, no. 1, p. 7, 2017.
[14] J. Xu, X. Liu, T. Wilson, P.-N. Tan, P. Hatami, and L. Luo, “Muscat: Multi-scale spatio-temporal learning with application to climate modeling.” in IJCAI, 2018.
[15] S. Smith, K. Huang, N. D. Sidiropoulos, and G. Karypis, “Streaming tensor factorization for infinite data sources,” in SDM, 2018.
[16] S. Zhou, S. Erfani, and J. Bailey, “Online cp decomposition for sparse tensors,” in ICDM, 2018.
[17] E. Gujral, R. Pasricha, and E. E. Papalexakis, “Sambaten: Sampling-based batch incremental tensor decomposition,” in SDM, 2018.
[18] R. Pasricha, E. Gujral, and E. E. Papalexakis, “Adaptive granularity in tensors: A quest for interpretable structure,” arXiv preprint arXiv:1912.09009, 2019.
[19] T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” SIAM review, vol. 51, no. 3, pp. 455–500, 2009.
[20] (2020) Supplementary document. [Online]. Available: https://github.com/DMLab-Tensor/SliceNStitch/blob/master/doc/supplementary.pdf
[21] R. A. Harshman et al., “Foundations of the parafac procedure: Models and conditions for an” explanatory” multimodal factor analysis,” 1970.
[22] J. Sun, S. Papadimitriou, and S. Y. Philip, “Window-based tensor analysis on high-dimensional and multi-aspect streams,” in ICDM, 2006.
[23] J. Sun, D. Tao, S. Papadimitriou, P. S. Yu, and C. Faloutsos, “Incremental tensor analysis: Theory and applications,” ACM Transactions on Knowledge Discovery from Data, vol. 2, no. 3, pp. 1–37, 2008.
[24] M. Xu, J. Wu, H. Wang, and M. Cao, “Anomaly detection in road networks using sliding-window tensor factorization,” T-ITS, vol. 20, no. 12, pp. 4704–4713, 2019.
[25] D. Nion and N. D. Sidiropoulos, “Adaptive algorithms to track the parafac decomposition of a third-order tensor,” TSP, vol. 57, no. 6, pp. 2299–2310, 2009.
[26] Z. Zhang and C. Hawkins, “Variational bayesian inference for robust streaming tensor factorization and completion,” in ICDM, 2018.
[27] S. Zhou, N. X. Vinh, J. Bailey, Y. Jia, and I. Davidson, “Accelerating online cp decompositions for higher order tensors,” in KDD, 2016.
[28] A. Anaissi, B. Suleiman, and S. M. Zandavi, “Necpd: An online tensor decomposition with optimal stochastic gradient descent,” arXiv preprint arXiv:2003.08844, 2020.
[29] E. E. Papalexakis, C. Faloutsos, and N. D. Sidiropoulos, “Tensors for data mining and data fusion: Models, applications, and scalable algorithms,” ACM Transactions on Intelligent Systems and Technology, vol. 8, no. 2, pp. 1–44, 2016.
[30] C. Ye and G. Mateos, “Online tensor decomposition and imputation for count data.” in DSW, 2019.
[31] A. Sobral, S. Javed, S. Ki Jung, T. Bouwmans, and E.-h. Zahzah, “Online stochastic tensor decomposition for background subtraction in multispectral video sequences,” in ICCVW, 2015.
[32] S. Huang, K. S. Candan, and M. L. Sapino, “Bicp: block-incremental cp decomposition with update sensitive refinement,” in CIKM, 2016.
[33] A. H. Phan and A. Cichocki, “Parafac algorithms for large-scale problems,” Neurocomputing, vol. 74, no. 11, pp. 1970–1984, 2011.
[34] X. Li, S. Huang, K. S. Candan, and M. L. Sapino, “2pcp: Two-phase cp decomposition for billion-scale dense tensors,” in ICDE, 2016.
[35] X. He, H. Zhang, M.-Y. Kan, and T.-S. Chua, “Fast matrix factorization for online recommendation with implicit feedback,” in SIGIR, 2016.
[36] R. Devooght, N. Kourtellis, and A. Mantrach, “Dynamic matrix factorization with priors on unknown values,” in KDD, 2015.
[37] Y. Du, Y. Zheng, K.-c. Lee, and S. Zhe, “Probabilistic streaming tensor decomposition,” in ICDM, 2018.
[38] S. Zhou, S. M. Erfani, and J. Bailey, “Sced: A general framework for sparse tensor decomposition with constraints and elementwise dynamic learning,” in ICDM, 2017.
[39] J. Haupt, X. Li, and D. P. Woodruff, “Near optimal sketching of low-rank tensor regression,” in NIPS, 2017.
[40] Z. Song, D. P. Woodruff, and H. Zhang, “Sublinear time orthogonal tensor decomposition,” in NIPS, 2016.
[41] S. Rendle, “Factorization machines,” in ICDM, 2010.
[42] M. Najafi, L. He, and S. Y. Philip, “Outlier-robust multi-aspect streaming tensor completion and factorization.” in IJCAI, 2019.
[43] D. Lee and K. Shin, “Robust factorization of real-world tensor streams with patterns, missing values, and outliers,” in ICDE, 2021.
[44] N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E. Papalexakis, and C. Faloutsos, “Tensor decomposition for signal processing and machine learning,” TSP, vol. 65, no. 13, pp. 3551–3582, 2017.

	Methods Based on Conventional CPD		SliceNStitch
	(Coarse-grained)	(Fine-grained)	(Proposed)
Update Interval	Long (\faThumbsODown)	Short (\faThumbsOUp)	Short (\faThumbsOUp)
Parameters	Few (\faThumbsOUp)	Many (\faThumbsODown)	Few (\faThumbsOUp)
Fitness	High (\faThumbsOUp)	Low (\faThumbsODown)	High (\faThumbsOUp)

SliceNStitch: Continuous CP Decomposition of Sparse Tensor Streams

Abstract

I Introduction

II Preliminaries

III Problem Definition

Definition 1 (Multi-aspect Data Stream).

Problem 1 (Continuous CP Decomposition).

IV Proposed Data Model and Implementation

IV-A Proposed Data Model: Continuous Tensor Model

Definition 2 (Tensor Slice).

Definition 3 (Tensor Unit).

Definition 4 (Tensor Window).

Definition 5 (Continuous Tensor Model).

IV-B Event-driven Implementation of Continuous Tensor Model

Theorem 1 (Time Complexity of the Continuous Tensor Model).

Proof.

Theorem 2 (Space Complexity of the Continuous Tensor Model).

Proof.

V Proposed Optimization Algorithms

Problem 2 (Online CP Decomposition of Sparse Tensors in the Continuous Tensor Model).

Definition 6 (Input Change).

V-A SliceNStitch-Matrix (SNSmat)

Theorem 3 (Time complexity of SNSmat).

Proof.

V-B SliceNStitch-Vector (SNSvec)

Theorem 4 (Time complexity of SNSvec).

Proof.

V-C SliceNStitch-Random (SNSrnd)

Theorem 5 (Time complexity of SNSrnd).

Proof.

V-D SliceNStitch-Stable (SNS+vec and SNS+rnd)

Theorem 6 (Time complexity of SNS+vec).

Proof.

Theorem 7 (Time complexity of SNS+rnd).

Proof.

VI Experiments

VI-A Experiment Specifications

VI-B Q1. Advantages of Continuous CP Decomposition

Observation 1 (Advantages of Continuous CPD).

VI-C Q2. Speed and Fitness

Observation 2 (Significant Speed-ups).

Observation 3 (Effect of Clipping).

Observation 4 (Comparable Fitness).

VI-D Q3. Data Scalability

Observation 5 (Linear Scalability).

VI-E Q4. Effect of Parameters

Observation 6 (Effect of θ\theta).

Observation 7 (Effect of η\eta).

VI-F Q5. Practitioner’s Guide

VI-G Q6. Application to Anomaly Detection

VII Related Work

VII-A Online Tensor Decomposition

VII-B Window-based Tensor Analysis

VII-C Relation to Machine Learning

VIII Conclusion

Acknowledgement

References

V-A SliceNStitch-Matrix (SNS_mat)

Theorem 3 (Time complexity of SNS_mat).

V-B SliceNStitch-Vector (SNS_vec)

Theorem 4 (Time complexity of SNS_vec).

V-C SliceNStitch-Random (SNS_rnd)

Theorem 5 (Time complexity of SNS_rnd).

V-D SliceNStitch-Stable (SNS⁺_vec and SNS⁺_rnd)

Theorem 6 (Time complexity of SNS⁺_vec).

Theorem 7 (Time complexity of SNS⁺_rnd).

Observation 6 (Effect of $\theta$ ).

Observation 7 (Effect of $\eta$ ).