This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

SliceNStitch: Continuous CP Decomposition of Sparse Tensor Streams

Taehyung Kwon12, Inkyu Park12, Dongjin Lee4, and Kijung Shin24 2Graduate School of AI and 4School of Electrical Engineering, KAIST, Daejeon, South Korea
{taehyung.kwon, inkyupark, dongjin.lee, kijungs}@kaist.ac.kr
Abstract

Consider traffic data (i.e., triplets in the form of source-destination-timestamp) that grow over time. Tensors (i.e., multi-dimensional arrays) with a time mode are widely used for modeling and analyzing such multi-aspect data streams. In such tensors, however, new entries are added only once per period, which is often an hour, a day, or even a year. This discreteness of tensors has limited their usage for real-time applications, where new data should be analyzed instantly as it arrives.

How can we analyze time-evolving multi-aspect sparse data ‘continuously’ using tensors where time is ‘discrete’? We propose SliceNStitch for continuous CANDECOMP/PARAFAC (CP) decomposition, which has numerous time-critical applications, including anomaly detection, recommender systems, and stock market prediction. SliceNStitch changes the starting point of each period adaptively, based on the current time, and updates factor matrices (i.e., outputs of CP decomposition) instantly as new data arrives. We show, theoretically and experimentally, that SliceNStitch is (1) ‘Any time’: updating factor matrices immediately without having to wait until the current time period ends, (2) Fast: with constant-time updates up to 464×464\times faster than online methods, and (3) Accurate: with fitness comparable (specifically, 72100%72-100\%) to offline methods.

**footnotetext: Equal Contribution

I Introduction

Tensors (i.e., multidimensional arrays) are simple but powerful tools widely used for representing time-evolving multi-aspect data from various fields, including bioinformatics [1], data mining [2, 3], text mining [4], and cybersecurity [5, 6]. For example, consider a traffic dataset given as triplets in the form of (source, destination, timestamp). The dataset is naturally represented as a 33-mode tensor whose three modes correspond to sources, destinations, and time ranges, respectively (see Fig. 1a and 1b). Each (i,j,k)(i,j,k)-th entry of the tensor represents the amount of traffic from the ii-th source to the jj-th destination during the kk-th time range.

Refer to caption
(a) Coarse-grained Tensor
Refer to caption
(b) Fine-grained Tensor
Refer to caption
(c)
Refer to caption
(c) Average Fitness
Refer to caption
(d) Number of Parameters
Refer to caption
(e) Runtime per Update
Methods Based on Conventional CPD SliceNStitch
(Coarse-grained) (Fine-grained) (Proposed)
Update Interval Long (\faThumbsODown) Short (\faThumbsOUp) Short (\faThumbsOUp)
Parameters Few (\faThumbsOUp) Many (\faThumbsODown) Few (\faThumbsOUp)
Fitness High (\faThumbsOUp) Low (\faThumbsODown) High (\faThumbsOUp)
(f) Summary: SliceNStitch is fast, space-efficient, and accurate.
Figure 1: Advantages of SliceNStitch. LABEL:sub@subfig:coarse LABEL:sub@subfig:fine Coarse-grained and fine-grained tensors where update intervals are 11 hour and 11 second, respectively. LABEL:sub@subfig:simple_cp:fitness LABEL:sub@subfig:simple_cp:param Given a tensor stream (see Section VI-B for detailed experimental settings), SliceNStitch updates factor matrices (i.e., outputs of CPD) instantly (i.e., with a short update interval) while achieving high fitness with a small number of parameters. LABEL:sub@subfig:simple_cp:time Even runtime per update is shorter in SliceNStitch than in the three considered methods based on conventional CPD. LABEL:sub@Tab:summary A summary of the comparisons in LABEL:sub@subfig:simple_cp:fitness and LABEL:sub@subfig:simple_cp:param.

Once we represent data as a tensor, many tools [7, 8, 9, 10] are available for tensor analysis, and CANDECOMP/PARAFAC decomposition (CPD) [7] is one of the most widely used tools. Given an MM-mode tensor, CPD gives a low-rank approximation, specifically a sum of few outer products of MM vectors, which form MM factor matrices. CPD has been used for various applications, and many of them, including anomaly detection [2], recommendation [11, 12], stock market prediction [13], and weather forecast [14], are time-critical.

While tensors and CPD are powerful tools, they are not suitable for real-time applications since time in them advances in a discrete manner, specifically once per period. For example, in the tensor in Fig. 1a, each slice represents the amounts of traffic for one hour, and thus the tensor grows with a new slice only once per hour. That is, it may take one hour for new traffic to be applied to the tensor. For instance, traffic occurring at 2:00:01 is applied to the tensor at 3:00:00. Due to this discreteness, the outputs of CPD (i.e., factor matrices) are updated also only once per period even if incremental algorithms [15, 16, 17] are used.

How can we perform CPD ‘continuously’ for real-time applications? A potential solution is to make the granularity of the time mode extremely fine. According to our preliminary studies, however, it causes the following problems:

  • Degradation of Fitness (Fig. 1c) An extremely fine-grained time mode results in an extremely sparse tensor, which is known to be of high rank [18], and thus it degrades the fitness of the low-rank approximation, such as conventional CPD. As shown in Fig. 1c, the finer the time mode is, the lower the fitness of CPD is.

  • Increase in the Number of Parameters (Fig. 1d): The parameters of CPD are the entries of factor matrices, as explained in Section II, and the size of each factor matrix is proportional to the length of the corresponding mode of the input tensor. An extremely fine-grained time mode leads to an extremely long time mode and thus extremely many parameters, which require huge computational and storage resources. As shown in Fig. 1d, the finer the time mode is, the more parameters CPD requires.

In this work, we propose SliceNStitch for continuous CPD without increasing the number of parameters. It consists of a data model and online algorithms for CPD. From the data model aspect, we propose the continuous tensor model for time-evolving tensors. In the model, the starting point of each period changes adaptively, based on the current time, so that newly arrived data are applied to the input tensor instantly. From the algorithmic aspect, we propose a family of online algorithms for CPD of sparse tensors in the continuous tensor model. They update factor matrices instantly in response to each change in an entry of the input tensor. To the best of our knowledge, they are the first algorithms for this purpose, and existing online CPD algorithms [15, 16, 17] update factor matrices only once per period. We summarize our contributions as follows:

  • New data model: We propose the continuous tensor model, which allows for processing time-evolving tensors continuously in real-time for time-critical applications.

  • Fast online algorithms: We propose the first online algorithms that update outputs of CPD instantly in response to changes in an entry. Their fitness is comparable (specifically, 72100%72-100\%) even to offline competitors, and an update by them is up to 464×464\times faster than that of online competitors.

  • Extensive experiments: We extensively evaluate our algorithms on 4 real-world sparse tensors, and based on the results, we provide practitioner’s guides to algorithm selection and hyperparameter tuning.

Reproducibility: The code and datasets used in the paper are available at https://github.com/DMLab-Tensor/SliceNStitch.

Remarks: CPD may not be the best decomposition model for tensors with time modes, and there exist a number of alternatives, such as Tucker, INDSCAL, and DEDICOM (see [19]). Nevertheless, as a first step, we focus on making CPD ‘continuous’ due to its prevalence and simplicity. We leave extending our approach to more models as future work.

In Section II, we introduce some notations and preliminaries. In Section III, we provide a formal problem definition. In Sections IV and V, we present the model and optimization algorithms of SliceNStitch, respectively. In Section VI, we review our experiments. After discussing some related works in Section VII, we conclude in Section VIII.

II Preliminaries

In this section, we introduce some notations and preliminaries. Some frequently-used symbols are listed in Table I.

TABLE I: Table of frequently-used symbols
Symbol Definition
𝑨\bm{A} a matrix
𝑨(i,:),𝑨(:,i)\bm{A}(i,:),\bm{A}(:,i) ii-th row of 𝑨\bm{A}, ii-th column of 𝑨\bm{A}
𝑨{\bm{A}}^{\prime}, 𝑨{\bm{A}}^{\dagger} transpose of 𝑨\bm{A}, pseudoinverse of 𝑨\bm{A}
\odot, \ast Khatri-Rao product, Hadamard product
𝓧\bm{\mathcal{X}} a tensor
MM order of 𝓧\bm{\mathcal{X}}
NmN_{m} number of indices in the mm-th mode of 𝓧\bm{\mathcal{X}}
xi1,i2,,iMx_{i_{1},i_{2},\cdots,i_{M}} (i1,i2,,iM)(i_{1},i_{2},\cdots,i_{M})-th entry of 𝓧\bm{\mathcal{X}}
|𝓧||\bm{\mathcal{X}}| number of non-zeros of 𝓧\bm{\mathcal{X}}
𝓧F\|\bm{\mathcal{X}}\|_{F} Frobenius norm of 𝓧\bm{\mathcal{X}}
𝑿(m){\bm{X}}_{(m)} mode-mm matricization of 𝓧\bm{\mathcal{X}}
RR rank of CPD
𝑨(m){\bm{A}^{(m)}} factor matrix of the mm-th mode
𝓧~\widetilde{\bm{\mathcal{X}}} an approximation of 𝓧\bm{\mathcal{X}} by CPD
𝓓(t,W)\bm{\mathcal{D}}\left(t,W\right) tensor window at time tt
Δ𝓧\Delta\bm{\mathcal{X}} a change in 𝓧\bm{\mathcal{X}}
WW number of indices in the time mode
deg(m,im)deg(m,i_{m}) number of non-zeros with mm-th mode index imi_{m}
aij(m)a^{(m)}_{ij} (i,j)(i,j)-th entry of 𝑨(m)\bm{A}^{(m)}

Basic Notations: Consider a matrix 𝑨\bm{A}. We denote its ii-th row by 𝑨(i,:)\bm{\bm{A}}(i,:) and its ii-th column by 𝑨(:,i)\bm{\bm{A}}(:,i). We denote the transpose of 𝑨\bm{A} by 𝑨{\bm{A}}^{\prime} and the Moore-Penrose pseudoinverse of 𝑨\bm{A} by 𝑨{\bm{A}}^{\dagger}. We denote the Khatri-Rao and Hadamard products by \odot and \ast, respectively. See Section I of [20] for the definitions of the Moore-Penrose pseudoinverse and both products.

Consider an MM-mode sparse tensor 𝓧N1×N2××NM\bm{\mathcal{X}}\in\mathbb{R}^{N_{1}\times N_{2}\times\cdots\times N_{M}}, where NmN_{m} denotes the length of the mm-th mode. We denote each (i1,i2,,iM)(i_{1},i_{2},\cdots,i_{M})-th entry of 𝓧\bm{\mathcal{X}} by xi1,i2,,iMx_{i_{1},i_{2},\cdots,i_{M}}. We let |𝓧||\bm{\mathcal{X}}| be the number of non-zero entries in 𝓧\bm{\mathcal{X}}, and we let 𝓧2\|\bm{\mathcal{X}}\|_{2} be the Frobenius norm of 𝓧\bm{\mathcal{X}}. We let 𝑿(m){\bm{X}}_{(m)} be the matrix obtained by matricizing 𝓧\bm{\mathcal{X}} along the mm-th mode. See Section I of [20] for the definitions of the Frobenius norm and matricization.

CANDECOMP/PARAFAC Decomposition (CPD): Given an MM-mode tensor 𝓧N1×N2××NM\bm{\mathcal{X}}\in\mathbb{R}^{N_{1}\times N_{2}\times\cdots\times N_{M}} and rank RR\in\mathbb{N}, CANDECOMP/PARAFAC Decomposition (CPD) [7] gives a rank-RR approximation of 𝓧\bm{\mathcal{X}}, expressed as the sum of RR rank-11 tensors (i.e., outer products of vectors) as follows:

𝓧\displaystyle\bm{\mathcal{X}} r=1R𝒂r(1)𝒂r(2)𝒂r(M),\displaystyle\approx\sum\nolimits_{r=1}^{R}\bm{a}^{\left(1\right)}_{r}\circ\bm{a}^{\left(2\right)}_{r}\circ\cdots\circ\bm{a}^{\left(M\right)}_{r},
r=1R𝑨(1)(:,r)𝑨(2)(:,r)𝑨(M)(:,r),\displaystyle\equiv\sum\nolimits_{r=1}^{R}{\bm{A}^{(1)}}(:,r)\circ{\bm{A}^{(2)}}(:,r)\circ\cdots\circ{\bm{A}^{(M)}}(:,r),
𝑨(1),𝑨(2),,𝑨(M)𝓧~,\displaystyle\equiv\llbracket{\bm{A}^{(1)}},{\bm{A}^{(2)}},\cdots,{\bm{A}^{(M)}}\rrbracket\equiv\widetilde{\bm{\mathcal{X}}}, (1)

where 𝒂r(m)Nm\bm{a}^{\left(m\right)}_{r}\in\mathbb{R}^{N_{m}} for all r{1,2,,R}r\in\{1,2,\cdots,R\}, and \circ denotes the outer product (see Section I of [20] for the definition). Each 𝑨(m)[𝒂1(m)𝒂2(m)𝒂R(m)]Nm×R{\bm{A}^{(m)}}\equiv[\bm{a}^{\left(m\right)}_{1}~{}\bm{a}^{\left(m\right)}_{2}~{}\cdots\bm{a}^{\left(m\right)}_{R}]\in\mathbb{R}^{N_{m}\times R} is called the factor matrix of the mm-th mode.

CP decomposition aims to find factor matrices that minimize the difference between the input tensor 𝓧\bm{\mathcal{X}} and its approximation 𝓧~\widetilde{\bm{\mathcal{X}}}. That is, it aims to solve Eq. (2).

min𝑨(1),,𝑨(M)𝓧𝑨(1),𝑨(2),,𝑨(M)F,\min\nolimits_{\bm{A}^{(1)},\cdots,\bm{A}^{(M)}}\Big{\|}\bm{\mathcal{X}}-\llbracket\bm{A}^{(1)},\bm{A}^{(2)},\cdots,\bm{A}^{(M)}\rrbracket\Big{\|}_{F}, (2)

where F\|\cdot\|_{F} is the Frobenius norm (see Section I of [20] for the definition).

Alternating Least Squares (ALS): Alternating least squares (ALS) [8, 21] is a standard algorithm for computing CPD of a static tensor. For each nn-th mode, 𝑨(1),𝑨(2),,𝑨(M)(n)=𝑨(n)(mnM𝑨(m)){\llbracket\bm{A}^{(1)},\bm{A}^{(2)},\cdots,\bm{A}^{(M)}\rrbracket}_{(n)}=\bm{A}^{(n)}{(\odot_{m\neq n}^{M}\bm{A}^{(m)})}^{\prime}, and thus the mode-nn matricization of Eq. (2) becomes

min𝑨(1),,𝑨(M)𝑿(n)𝑨(n)(mnM𝑨(m))F.\min\nolimits_{\bm{A}^{(1)},\cdots,\bm{A}^{(M)}}\Big{\|}{\bm{X}}_{(n)}-\bm{A}^{(n)}{(\odot_{m\neq n}^{M}\bm{A}^{(m)})}^{\prime}\Big{\|}_{F}. (3)

While the objective function in Eq. (3) is non-convex, solving Eq. (3) only for 𝑨(n){\bm{A}^{(n)}} while fixing all other factor matrices is a least-square problem. Finding 𝑨(n){\bm{A}^{(n)}} that makes the derivative of the objective function in Eq. (3) with respect to 𝑨(n){\bm{A}^{(n)}} zero leads to the following update rule for 𝑨(n){\bm{A}^{(n)}}:

𝑨(n)𝑿(n)(mnM𝑨(m)){mnM𝑨(m)𝑨(m)}.{\bm{A}^{(n)}}\leftarrow{\bm{X}}_{(n)}(\odot_{m\neq n}^{M}\bm{A}^{(m)}){\{\ast_{m\neq n}^{M}{{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}}\}}^{\dagger}. (4)

ALS performs CPD by alternatively updating each factor matrix 𝑨(n){\bm{A}^{(n)}} using Eq. (4) until convergence.

III Problem Definition

In this section, we first define multi-aspect data streams. Then, we describe how it is typically modeled as a tensor and discuss the limitation of the common tensor modeling method. Lastly, we introduce the problem considered in this work.

Multi-aspect Data Stream and Examples: We define a multi-aspect data stream, which we aim to analyze, as follows:

Definition 1 (Multi-aspect Data Stream).

A multi-aspect data stream is defined as a sequence of timestamped MM-tuples {(en=(i1,,iM1,vn),tn)}n\left\{\left(e_{n}{=}(i_{1},\cdot\cdot\cdot,i_{M-1},v_{n}),t_{n}\right)\right\}_{n\in\mathbb{N}}, where i1,,iM1i_{1},\cdots,i_{M-1} are categorical variables, vnv_{n}\in\mathbb{R} is a numerical variable, and tnt_{n}\in\mathbb{N} is the time (e.g., Unix timestamp) when the nn-th tuple ene_{n} occurs. We assume that the sequence is chronological, i.e.,

tntmifn<m.t_{n}\leq t_{m}~{}\text{if}~{}n<m.

For simplicity, we also assume that the categorical variables are also (assigned to) integers, i.e.,

im{1,,Nm},m{1,,M1}.i_{m}\in\{1,\cdots,N_{m}\},~{}\forall m\in\{1,\cdots,M-1\}.

Time-evolving data from various domains are naturally represented as a multi-aspect data stream as follows:

  • Traffic History: Each 33-tuple ene_{n} = (source, destination, 11) indicates a trip that started at time tnt_{n}.

  • Crime History: Each 33-tuple ene_{n} = (location, type, 11) indicates an incidence of crime at time tnt_{n}

  • Purchase History: Each 44-tuple ene_{n} = (user, product, color, quantity) indicates a purchase at time tnt_{n}.

Common Tensor Modeling Method and its Limitations: Multi-aspect data streams are commonly modeled as tensors to benefit from powerful tensor analysis tools (e.g., CP decomposition) [22, 23, 24, 25, 26, 27, 16, 17]. Specifically, a multi-aspect data stream is modeled as an MM-mode tensor 𝓧N1××NM1×W\bm{\mathcal{X}}\in\mathbb{R}^{N_{1}\times\cdots\times N_{M-1}\times W}, where WW is the number of indices in the time mode (i.e., the MM-th mode). For each w{1,,W}w\in\{1,\cdots,W\}, let 𝓖wN1×N2××NM1\bm{\mathcal{G}}_{w}\in\mathbb{R}^{N_{1}\times N_{2}\times\cdots\times N_{M-1}} be the (M1)(M-1)-mode tensor obtained from 𝓧\bm{\mathcal{X}} by fixing the MM-th mode index to ww. That is, 𝓧𝓖1||||𝓖W1||𝓖W,\bm{\mathcal{X}}\equiv\bm{\mathcal{G}}_{1}~{}||~{}\cdots~{}||~{}\bm{\mathcal{G}}_{W-1}~{}||~{}\bm{\mathcal{G}}_{W}, where |||| denotes concatenation. Each tensor 𝓖w\bm{\mathcal{G}}_{w} is the sum over TT time units (i.e., TT is the period) ending at wTwT. That is, each (j1,,jM)(j_{1},\cdots,j_{M})-th entry of 𝓖w\bm{\mathcal{G}}_{w} is the sum of vnv_{n} over all MM-tuples ene_{n} where im=jmi_{m}=j_{m} for all m{1,,M1}m\in\{1,\cdots,M-1\} and tn(wTT,wT]t_{n}\in(wT-T,wT]. See Figs. 1a and 1b for examples where TT is an hour and a second, respectively. As new tuples in the multi-aspect data stream arrive, a new (M1)(M-1)-mode tensor is added to 𝓧\bm{\mathcal{X}} once per period TT.

Additionally, in many previous studies on window-based tensor analysis [22, 23, 24, 26], the oldest (M1)(M-1)-mode tensor is removed from 𝓧\bm{\mathcal{X}} once per period TT to fix the number of indices in the time mode to WW. That is, at time t=WTt=W^{\prime}T where W{W+1,W+2,}W^{\prime}\in\{W+1,W+2,\cdots\}, 𝓧𝓖WW+1||||𝓖W1||𝓖W\bm{\mathcal{X}}\equiv\bm{\mathcal{G}}_{W^{\prime}-W+1}~{}||~{}\cdots~{}||~{}\bm{\mathcal{G}}_{W^{\prime}-1}~{}||~{}\bm{\mathcal{G}}_{W^{\prime}}.

A serious limitation of this widely-used tensor model is that the tensor 𝓧\bm{\mathcal{X}} changes only once per period TT while the input multi-aspect data stream grows continuously. Thus, it is impossible to analyze multi-aspect data streams continuously in real time in response to the arrival of each new tuple.

Problem Definition: How can we continuously analyze multi-aspect data streams using tensor-analysis tools, specifically, CPD? We aim to answer this question, as stated in Problem 1.

Problem 1 (Continuous CP Decomposition).

(1) Given: a multi-aspect data stream, (2) to update: its CP decomposition instantly in response to each new tuple in the stream, (3) without having to wait for the current period to end.

Note that, as discussed in Section I and shown in Fig. 1, an extremely short period TT cannot be a proper solution since it extremely decreases the fitness of CPD and at the same time increases the number of parameters.

IV Proposed Data Model and Implementation

In this section, we propose the continuous tensor model and its efficient implementation. This data model is one component of SliceNStitch. See Section V for the other component.

IV-A Proposed Data Model: Continuous Tensor Model

We first define several terms which our model is based on.

Definition 2 (Tensor Slice).

Given a multi-aspect data stream (Definition 1), for each timestamped MM-tuple (en=(i1,,iM1,vn),tn)\left(e_{n}{=}(i_{1},\cdot\cdot\cdot,i_{M-1},v_{n}),t_{n}\right), we define the tensor slice 𝓩nN1××NM1\bm{\mathcal{Z}}_{n}\in\mathbb{R}^{N_{1}\times\cdots\times N_{M-1}} as an (M1)\left(M-1\right)-mode tensor where the (i1,,iM1)(i_{1},\cdots,i_{M-1})-th entry is vnv_{n} and the other entries are zero.

Definition 3 (Tensor Unit).

Given a multi-aspect data stream, a time tt, and the period TT, we define the tensor unit 𝓨t\bm{\mathcal{Y}}_{t} as

𝓨ttn(tT,t]𝓩n.\displaystyle\bm{\mathcal{Y}}_{t}\equiv\sum\nolimits_{t_{n}\in\left(t-T,t\right]}\bm{\mathcal{Z}}_{n}.

That is, 𝓨tN1××NM1\bm{\mathcal{Y}}_{t}\in\mathbb{R}^{N_{1}\times\cdots\times N_{M-1}} is an aggregation of the tuples occurred within the half-open interval (tT,t]\left(t-T,t\right].

Definition 4 (Tensor Window).

Given a multi-aspect data stream, a time tt, the period TT, and the number of time-mode indices WW, we define the tensor window 𝓓(t,W)\bm{\mathcal{D}}\left(t,W\right) as

𝓓(t,W)𝓨t(W1)T||||𝓨tT||𝓨t,\displaystyle\bm{\mathcal{D}}\left(t,W\right)\equiv\bm{\mathcal{Y}}_{t-\left(W-1\right)T}~{}||~{}\cdots~{}||~{}\bm{\mathcal{Y}}_{t-T}~{}||~{}\bm{\mathcal{Y}}_{t},

where |||| denotes concatenation.

That is, 𝓓(t,W)N1××NM1×W\bm{\mathcal{D}}\left(t,W\right)\in\mathbb{R}^{N_{1}\times\cdots\times N_{M-1}\times W} concatenates the WW latest tensor units.

Refer to caption
Figure 2: Example of the continuous tensor model. The starting points of tensor units (whose length is an hour) change adaptively based on the current time (which is 3:00:01).

The main idea of the continuous tensor model is to adaptively adjust the starting point (or equally the end point) of each tensor unit based on the current time, as described in Definition 5 and Fig. 2.

Definition 5 (Continuous Tensor Model).

In the continuous tensor model, given a multi-aspect data stream, the period TT, and the number of time-mode indices WW, the modeled tensor evolves from 𝓓(tdt,W)\bm{\mathcal{D}}\left(t-dt,W\right) to 𝓓(t,W)\bm{\mathcal{D}}\left(t,W\right) at each time tt, where dtdt represents the infinitesimal change in time.

Note that in the continuous tensor model, the modeled tensor changes ‘continuously’, i.e., once per minimum time unit in the input multi-aspect data stream (e.g., a millisecond), while the modeled tensor changes once per period TT in the typical models, discussed in Section III.

IV-B Event-driven Implementation of Continuous Tensor Model

In the continuous tensor model, it is crucial to efficiently update the modeled tensor, or in other words, efficiently compute the change in 𝓓(t,W)\bm{\mathcal{D}}\left(t,W\right) in the modeled tensor. This is because repeatedly rebuilding the modeled tensor 𝓓(t,W)\bm{\mathcal{D}}\left(t,W\right) at every time tt from scratch is computationally prohibitive.

We propose an efficient event-driven implementation of the continuous tensor model. Let 𝓧=𝓓(t,W)\bm{\mathcal{X}}=\bm{\mathcal{D}}\left(t,W\right) for simplicity. Our implementation, described in Algorithm 1, is based on the observation that each tuple (en=(i1,,iM1,vn),tn)\left(e_{n}{=}(i_{1},\cdot\cdot\cdot,i_{M-1},v_{n}),t_{n}\right) in the input multi-aspect data stream causes the following events:

  1. S.1

    At time t=tnt=t_{n}, the value vnv_{n} is added to xi1,,iM1,Wx_{i_{1},\cdots,i_{M-1},W}.

  2. S.2

    At time t=tn+wTt=t_{n}+wT for each w{1,,W1}w\in\{1,\cdots,W-1\}, the value vnv_{n} is subtracted from xi1,,iM1,Ww+1x_{i_{1},\cdots,i_{M-1},W-w+1} and then added to xi1,,iM1,Wwx_{i_{1},\cdots,i_{M-1},W-w}.

  3. S.3

    At time t=tn+WTt=t_{n}+WT, the value vnv_{n} is subtracted from xi1,,iM1,1x_{i_{1},\cdots,i_{M-1},1}.

As formalized in Theorem 1, our implementation maintains the modeled tensor 𝓧=𝓓(t,W)\bm{\mathcal{X}}=\bm{\mathcal{D}}\left(t,W\right) up-to-date at each time tt by performing O(MW)O(MW) operations per tuple in the input stream. Note that MM and WW are usually small numbers, and if we regard them as constants, the time complexity becomes O(1)O(1), i.e. processing each tuple takes constant time.

Theorem 1 (Time Complexity of the Continuous Tensor Model).

In Algorithm 1, the time complexity of processing each timestamped MM-tuple is O(MW)O(MW).

Proof.

For each timestamped MM-tuple, W+1W+1 events occur. Processing an event (lines 1-1 or 1-1) takes O(M)O(M) time. ∎

Theorem 2 (Space Complexity of the Continuous Tensor Model).

In Algorithm 1, the space complexity is

O(Mmaxt|{n:tn(tWT,t]}|).O\left(M\cdot\max_{t\in\mathbb{R}}|\{n\in\mathbb{N}:t_{n}\in(t-WT,t]\}|\right).
Proof.

We call St{n:tn(tWT,t]}S_{t}\equiv\{n\in\mathbb{N}:t_{n}\in(t-WT,t]\} the set of active tuples at time tt. The number of non-zeros in 𝓧=𝓓(t,W)\bm{\mathcal{X}}=\bm{\mathcal{D}}\left(t,W\right) is upper bounded by |St||S_{t}|, and thus the space required for 𝓧\bm{\mathcal{X}} is O(M|St|)O(M\cdot|S_{t}|). Since at most one event is scheduled for each active tuple, the space required for storing all scheduled events is also O(M|St|)O(M\cdot|S_{t}|). ∎

Input: (1) a multi-aspect data stream, (2) period TT,
   (3) number of indices in the time mode WW
Output: up-to-date tensor window 𝓧=𝓓(t,W)\bm{\mathcal{X}}=\bm{\mathcal{D}}\left(t,W\right)
1
2initialize 𝓧\bm{\mathcal{X}} to a zero tensor N1××NM1×W\in\mathbb{R}^{N_{1}\times\cdots\times N_{M-1}\times W} ;
3 wait until an event EE occurs ;
4 if E=E= arrival of (en=(i1,,iM1,vn),tn)\left(e_{n}{=}(i_{1},\cdot\cdot\cdot,i_{M-1},v_{n}),t_{n}\right) then
5       add vnv_{n} to xi1,,iM1,Wx_{i_{1},\cdots,i_{M-1},W} ;
6       schedule the 11-st update for (en,tn)\left(e_{n},t_{n}\right) at time tn+Tt_{n}+T ;
7      
8if E=E= ww-th update for (en=(i1,,iM1,vn),tn)\left(e_{n}{=}(i_{1},\cdot\cdot\cdot,i_{M-1},v_{n}),t_{n}\right) then
9       subtract vnv_{n} from xi1,,iM1,Ww+1x_{i_{1},\cdots,i_{M-1},W-w+1} ;
10       if w<Ww<W then
11             add vnv_{n} to xi1,,iM1,Wwx_{i_{1},\cdots,i_{M-1},W-w} ;
12             schedule the (w+1)(w+1)-th update for (en,tn)\left(e_{n},t_{n}\right) at time tn+(w+1)Tt_{n}+(w+1)T ;
13            
14      
goto Line 1
Algorithm 1 Event-driven Implementation of the Continuous Tensor Model

V Proposed Optimization Algorithms

In this section, we present the other part of SliceNStitch. We propose a family of online optimization algorithms for CPD of sparse tensors in the continuous tensor model. As stated in Problem 2, they aim to update factor matrices fast and accurately in response to each change in the tensor window.

Problem 2 (Online CP Decomposition of Sparse Tensors in the Continuous Tensor Model).

(1) Given:

  • the current tensor window 𝓧=𝓓(t,W)\bm{\mathcal{X}}=\bm{\mathcal{D}}\left(t,W\right),

  • factor matrices 𝑨(1),𝑨(2),,𝑨(M){\bm{A}^{(1)}},{\bm{A}^{(2)}},\cdots,{\bm{A}^{(M)}} for 𝓧\bm{\mathcal{X}},

  • an event for (en=(i1,,iM1,vn),tn)\left(e_{n}{=}(i_{1},\cdot\cdot\cdot,i_{M-1},v_{n}),t_{n}\right) occurring at tt,

(2) Update: the factor matrices in response to the event,
(3) To Solve: the minimization problem in Eq. (2).

Below, we use Δ𝓧\Delta\bm{\mathcal{X}} to denote the change in 𝓧\bm{\mathcal{X}} due to the given event. By S.1-S.3 of Section IV-B, Definition 6 follows.

Definition 6 (Input Change).

The input change Δ𝓧N1××NM1×W\Delta\bm{\mathcal{X}}\in\mathbb{R}^{N_{1}\times\cdots\times N_{M-1}\times W} is defined as the change in 𝓧\bm{\mathcal{X}} due to an event for (en=(i1,,iM1,vn),tn)\left(e_{n}{=}(i_{1},\cdot\cdot\cdot,i_{M-1},v_{n}),t_{n}\right) occurring at tt, i.e.,

  • [If t=tnt=t_{n}] The (i1,,iM1,W)(i_{1},\cdots,i_{M-1},W)-th entry of Δ𝓧\Delta\bm{\mathcal{X}} is vnv_{n}, and the other entries are zero.

  • [If t=tn+wTt=t_{n}+wT for 1w<W1\leq w<W] The (i1,,iM1,Ww)(i_{1},\cdots,i_{M-1},W-w)-th and (i1,,iM1,Ww+1)(i_{1},\cdots,i_{M-1},W-w+1)-th entries of Δ𝓧\Delta\bm{\mathcal{X}} are vnv_{n} and vn-v_{n}, respectively, and the others are zero.

  • [If t=tn+WTt=t_{n}+WT] The (i1,,iM1,1)(i_{1},\cdots,i_{M-1},1)-th entry of Δ𝓧\Delta\bm{\mathcal{X}} is vn-v_{n}, and the others are zero.

We first introduce SliceNStitch-Matrix (SNSmat), which naively applies ALS to Problem 2. Then, we present SliceNStitch-Vector (SNSvec) and SliceNStitch-Random (SNSrnd). Lastly, we propose our main methods, SNS+vec and SNS+rnd.

V-A SliceNStitch-Matrix (SNSmat)

When we apply ALS to Problem 2, the factor matrices for the current window 𝓧\bm{\mathcal{X}} are strong initial points. Thus, a single iteration of ALS is enough to achieve high fitness. The detailed procedure of SNSmat is given in Algorithm 2. In line 2, we normalize111 Let the rr-th entry of λR\lambda\in\mathbb{R}^{R} as λr\lambda_{r} and the rr-th column of 𝑨(m)\bm{A}^{(m)} be 𝒂r(m)\bm{a}^{(m)}_{r}. We set λr\lambda_{r} to 𝒂r(m)2\|\bm{a}^{(m)}_{r}\|_{2} and set 𝒂¯r(m)\bm{{\bar{a}}}^{(m)}_{r} to 𝒂r(m)/λr\bm{a}^{(m)}_{r}/\lambda_{r} for r=1,,Rr=1,\cdots,R. Then, 𝓧\bm{\mathcal{X}} is approximated as r=1Rλr𝒂¯r(1)𝒂¯r(2)𝒂¯r(M)\sum\nolimits_{r=1}^{R}\lambda_{r}\bm{{\bar{a}}}^{\left(1\right)}_{r}\circ\bm{{\bar{a}}}^{\left(2\right)}_{r}\circ\cdots\circ\bm{{\bar{a}}}^{\left(M\right)}_{r}.  the columns of each updated factor matrices for balancing the scales of the factor matrices.

Input: (1) current tensor window 𝓧\bm{\mathcal{X}}, (2) change Δ𝓧\Delta\bm{\mathcal{X}},
   (3) column-normalized factor matrices {𝑨¯(m)}m=1M\{\bm{\bar{A}}^{(m)}\}_{m=1}^{M},
   (4) {𝑨¯(m)𝑨¯(m)}m=1M\{{{\bm{{\bar{A}}}^{(m)}}}^{\prime}{\bm{{\bar{A}}}^{(m)}}\}_{m=1}^{M}
Output: Updated {𝑨¯(m)}m=1M\{\bm{{\bar{A}}}^{(m)}\}_{m=1}^{M}, {𝑨¯(m)𝑨¯(m)}m=1M\{{{\bm{{\bar{A}}}^{(m)}}}^{\prime}{\bm{{\bar{A}}}^{(m)}}\}_{m=1}^{M}, and λ\lambda
1
2for m=1,,Mm=1,\cdots,M do
3       𝑼(𝑿+Δ𝑿)(m)(nmM𝑨¯(n))\bm{U}\leftarrow{\left(\bm{X}+\Delta\bm{X}\right)}_{(m)}\left(\odot_{n\neq m}^{M}\bm{{\bar{A}}}^{(n)}\right)
4      𝑯nmM𝑨¯(n)𝑨¯(n)\bm{H}\leftarrow\ast_{n\neq m}^{M}{{\bm{{\bar{A}}}^{(n)}}}^{\prime}{\bm{{\bar{A}}}^{(n)}}
      𝑨(m)𝑼𝑯\bm{A}^{(m)}\leftarrow\bm{U}{\bm{H}}^{\dagger} // by Eq. (4)
5      
      λ\lambda\leftarrow 2\ell^{2} norms of the columns of 𝑨(m)\bm{{A}}^{(m)} // λR\lambda\in\mathbb{R}^{R}
6      
7      𝑨¯(m)\bm{{\bar{A}}}^{(m)}\leftarrow column normalization of 𝑨(m)\bm{{A}}^{(m)}
8      Update 𝑨¯(m)𝑨¯(m){{\bm{{\bar{A}}}^{(m)}}}^{\prime}{\bm{{\bar{A}}}^{(m)}}
9
return {𝑨¯(m)}m=1M\{\bm{{\bar{A}}}^{(m)}\}_{m=1}^{M}, {𝑨¯(m)𝑨¯(m)}m=1M\{{{\bm{{\bar{A}}}^{(m)}}}^{\prime}{\bm{{\bar{A}}}^{(m)}}\}_{m=1}^{M}, and λ\lambda
Algorithm 2 SNSmat: Naive Extension of ALS.

Pros and Cons: For each event, SNSmat accesses every entry of the current tensor window and updates every row of the factor matrices. Thus, it suffers from high computational cost, as formalized in Theorem 3, while it maintains a high-quality solution (i.e., factor matrices).

Theorem 3 (Time complexity of SNSmat).

Let NM=WN_{M}=W. Then, the time complexity of SNSmat is

O(M2R|𝓧+Δ𝓧|+M2R2+MR3+m=1MNmR2).O\bigg{(}M^{2}R|\bm{\mathcal{X}}+\Delta\bm{\mathcal{X}}|+M^{2}R^{2}+MR^{3}+\sum_{m=1}^{M}N_{m}R^{2}\bigg{)}. (5)
Proof.

See Section II.A of the online appendix [20]. ∎

Input: (1) current tensor window 𝓧=𝓓(t,W)\bm{\mathcal{X}}=\bm{\mathcal{D}}(t,W),
   (2) change Δ𝓧\Delta\bm{\mathcal{X}} due to an event for
     (en=(i1,,iM1,vn),tn)\left(e_{n}{=}(i_{1},\cdot\cdot\cdot,i_{M-1},v_{n}),t_{n}\right) occurring at tt,
   (3) factor matrices {𝑨(m)}m=1M\{\bm{A}^{(m)}\}_{m=1}^{M}
   (4) {𝑨(m)𝑨(m)}m=1M\{{{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}}\}_{m=1}^{M}, (5) period TT
Output: updated {𝑨(m)}m=1M\{\bm{A}^{(m)}\}_{m=1}^{M} and {𝑨(m)𝑨(m)}m=1M\{{{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}}\}_{m=1}^{M}
1
2
{𝑨prev(m)𝑨(m)}m=1M{𝑨(m)𝑨(m)}m=1M\{{\bm{A}^{(m)}_{prev}}^{\prime}\bm{A}^{(m)}\}_{m=1}^{M}\leftarrow\{{{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}}\}_{m=1}^{M} // used only in SNSrnd and SNS+rnd
3
w(ttn)/Tw\leftarrow(t-t_{n})/T  // time-mode index
4
5if w>0w>0  then
6      
      updateRow(MM, Ww+1W-w+1, \cdots) // Alg. 4 or 5
7      
8
9if w<Ww<W then
10      
      updateRow(MM, WwW-w, \cdots) // Alg. 4 or 5
11      
12
13for m1,,M1m\leftarrow 1,\cdots,M-1  do
       updateRow(mm, imi_{m}) // Alg. 4 or 5
14      
15return {𝑨(m)}m=1M\{\bm{{A}}^{(m)}\}_{m=1}^{M} and {𝑨(m)𝑨(m)}m=1M\{{{\bm{{A}}^{(m)}}}^{\prime}{\bm{{A}}^{(m)}}\}_{m=1}^{M}
Algorithm 3 Common Outline of SNSvec, SNS+vec, SNSrnd, and SNS+rnd.
Refer to caption
(a)

Figure 3: Example: updating a row of A(1)\bm{A}^{(1)}. SNSvec and SNSrnd update 𝑨(1)(i1,:)\bm{A}^{(1)}(i_{1},:) at once. SNS+vec and SNS+rnd update it entry by entry. SNSrnd and SNS+rnd sample θ\theta entries from those in 𝓧(i1,:,:)\bm{\mathcal{X}}(i_{1},:,:) if more than θ\theta entries there are non-zero (i.e., if deg(1,i1)>θdeg(1,i_{1})>\theta).

V-B SliceNStitch-Vector (SNSvec)

We propose SNSvec, a fast algorithm for Problem 2. The outline of SNSvec is given in Algorithm 3, and the update rules are given in Algorithm 4 with a running example in Fig. 3. The key idea of SNSvec is to update only the rows of factor matrices that approximate changed entries of the tensor window. Starting from the maintained factor matrices, SNSvec updates such rows of the time-mode factor matrix (lines 3-3) and then such rows of the non-time mode factor matrices (lines 3-3). Below, we describe the update rules used.

Time Mode: Real-world tensors are typically modeled so that the time mode of 𝓧\bm{\mathcal{X}} has fewer indices than the other modes. Thus, each tensor unit (see Definition 3) in 𝓧\bm{\mathcal{X}} is likely to contain many non-zeros, and thus even updating only few rows of the time-mode factor matrix (i.e., 𝑨(M)\bm{A}^{(M)}) is likely to incur considerable computational cost. To avoid the cost, SNSvec employs an approximated update rule.

From Eq. (4), the following update rule for the time-mode factor matrix follows:

𝑨(M)(𝑿+Δ𝑿)(M)𝑲(M)𝑯(M),\bm{A}^{(M)}\leftarrow{\left(\bm{X}+\Delta\bm{X}\right)}_{(M)}\bm{K}^{(M)}{\bm{H}^{(M)}}^{\dagger}, (6)

where 𝑲(M)=m=1M1𝑨(m)\bm{K}^{(M)}=\odot_{m=1}^{M-1}\bm{A}^{(m)} and 𝑯(M)=m=1M1𝑨(m)𝑨(m)\bm{H}^{(M)}=\ast_{m=1}^{M-1}{{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}}. If we assume that the approximated tensor 𝓧~\widetilde{\bm{\mathcal{X}}} in Eq. (1) approximates 𝓧\bm{\mathcal{X}} well, then Eq. (6) is approximated to Eq. (7).

𝑨(M)𝑨(M)𝑲(M)𝑲(M)𝑯(M)+Δ𝑿(M)𝑲(M)𝑯(M).\bm{A}^{(M)}\leftarrow\bm{A}^{(M)}{\bm{K}^{(M)}}^{\prime}\bm{K}^{(M)}{\bm{H}^{(M)}}^{\dagger}+\Delta{\bm{X}}_{(M)}\bm{K}^{(M)}{\bm{H}^{(M)}}^{\dagger}. (7)

By a property of the Khatri-Rao product [19], Eq. (8) holds.

𝑲(M)𝑲(M)=(m=1M1𝑨(m))(m=1M1𝑨(m))=m=1M1𝑨(m)𝑨(m)=𝑯(M).\displaystyle\begin{split}{\bm{K}^{(M)}}^{\prime}\bm{K}^{(M)}&={\left(\odot_{m=1}^{M-1}\bm{A}^{(m)}\right)}^{\prime}\left(\odot_{m=1}^{M-1}\bm{A}^{(m)}\right)\\ &=\ast_{m=1}^{M-1}{{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}}=\bm{H}^{(M)}.\end{split} (8)

By Eq. (8), Eq. (7) is reduced to Eq. (9).

𝑨(M)𝑨(M)+Δ𝑿(M)𝑲(M)𝑯(M).\displaystyle\bm{A}^{(M)}\leftarrow\bm{A}^{(M)}+\Delta{\bm{X}}_{(M)}\bm{K}^{(M)}{\bm{H}^{(M)}}^{\dagger}. (9)

Computing Eq. (9) is much cheaper than computing Eq. (6). Since Δ𝑿(M)\Delta{\bm{X}}_{(M)} contains at most two non-zeros (see Problem 2), computing Δ𝑿(M)𝑲(M)\Delta{\bm{X}}_{(M)}\bm{K}^{(M)} in Eq. (9) takes O(MR)O(MR) time. Due to the same reason, at most two rows of 𝑨(M)\bm{A}^{(M)} are updated.

Non-time Modes: When updating 𝑨(m)\bm{A}^{(m)}, while fixing the other factor matrices, the objective Eq. (3) becomes Eq. (10).

min𝑨(m)(𝑿+Δ𝑿)(m)𝑨(m)(nmM𝑨(n))F.\min_{\bm{A}^{(m)}}\Big{\|}{\left(\bm{X}+\Delta\bm{X}\right)}_{(m)}-\bm{A}^{(m)}{\left(\odot_{n\neq m}^{M}\bm{A}^{(n)}\right)}^{\prime}\Big{\|}_{F}. (10)

Note that Δ𝓧\Delta\bm{\mathcal{X}} contains up to two non-zeros, which are the entries changed in 𝓧\bm{\mathcal{X}}, and their mm-th mode indices are imi_{m} (see Problem 2). By Eq. (3), only the imi_{m}-th row of 𝑨(m)\bm{A}^{(m)} is used to approximate the changed entries, and thus SNSvec updates only the row.

If we fix all the other variables except 𝑨(m)(im,:)\bm{A}^{(m)}(i_{m},:), the problem in Eq. (10) becomes the problem in Eq. (11).

min𝑨(m)(im,:)(𝑿+Δ𝑿)(m)(im,:)𝑨(m)(im,:)(nmM𝑨(n))F.\small\min_{\bm{A}^{(m)}\left(i_{m},:\right)}\Big{\|}{\left(\bm{X}+\Delta\bm{X}\right)}_{(m)}\left(i_{m},:\right)\\ -\bm{A}^{(m)}\left(i_{m},:\right){\left(\odot_{n\neq m}^{M}\bm{A}^{(n)}\right)}^{\prime}\Big{\|}_{F}. (11)

The problem in Eq. (11) is a least-square problem, and its analytical solution in Eq. (12) is available.

𝑨(m)(im,:)(𝑿+Δ𝑿)(m)(im,:)𝑲(m)𝑯(m),\bm{A}^{(m)}\left(i_{m},:\right)\leftarrow{\left(\bm{X}+\Delta\bm{X}\right)}_{(m)}\left(i_{m},:\right)\bm{K}^{(m)}{\bm{H}^{(m)}}^{\dagger}, (12)

where 𝑲(m)=nmM𝑨(n)\bm{K}^{(m)}=\odot_{n\neq m}^{M}\bm{A}^{(n)} and 𝑯(m)=nmM𝑨(n)𝑨(n)\bm{H}^{(m)}=\ast_{n\neq m}^{M}{{\bm{A}^{(n)}}}^{\prime}{\bm{A}^{(n)}}.

After updating the imi_{m}-th row of each mm-th mode factor matrix 𝑨(m)\bm{A}^{(m)} either by Eq. (9) or Eq. (12), SNSvec incrementally maintains 𝑨(m)𝑨(m){{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}} up to date by Eq. (13).

𝑨(m)𝑨(m)𝑨(m)𝑨(m)𝒑𝒑+𝑨(m)(im,:)𝑨(m)(im,:),{{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}}\leftarrow{{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}}-{\bm{p}}^{\prime}\bm{p}+{\bm{A}^{(m)}\left(i_{m},:\right)}^{\prime}\bm{A}^{(m)}\left(i_{m},:\right), (13)

where 𝒑\bm{p} is the imi_{m}-th row of 𝑨(m)\bm{A}^{(m)} before the update.

// Parenthesized inputs/outputs are for SNSrnd
Input: (1) mode mm and index imi_{m} to be updated
   (2) current tensor window 𝓧\bm{\mathcal{X}}, (3) change Δ𝓧\Delta\bm{\mathcal{X}},
   (4) factor matrices {𝑨(m)}m=1M\{\bm{A}^{(m)}\}_{m=1}^{M} for 𝓧\bm{\mathcal{X}},
   (5) {𝑨(m)𝑨(m)}m=1M\{{{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}}\}_{m=1}^{M} (and {𝑨prev(m)𝑨(m)}m=1M\{{\bm{A}^{(m)}_{prev}}^{\prime}\bm{A}^{(m)}\}_{m=1}^{M}),
   (6) (threshold θ\theta for sampling)
Output: updated 𝑨(m)\bm{A}^{(m)}, 𝑨(m)𝑨(m){{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}} (and 𝑨prev(m)𝑨(m){\bm{A}^{(m)}_{prev}}^{\prime}\bm{A}^{(m)})
1
2
3
// updateRow implemented in SNSvec
4 Procedure updateRowVec(m,imm,i_{m}, \cdots):
5      
6      𝒑𝑨(m)(im,:)\bm{p}\leftarrow\bm{A}^{(m)}\left(i_{m},:\right)
7      if m=Mm=M then Update 𝑨(m)(im,:)\bm{A}^{(m)}(i_{m},:) by Eq. (9)
8      else Update 𝑨(m)(im,:)\bm{A}^{(m)}(i_{m},:) by Eq. (12)
9      Update 𝑨(m)𝑨(m){{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}} by Eq. (13)
10      return 𝑨(m)\bm{A}^{(m)} and 𝑨(m)𝑨(m){{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}}
11
// updateRow implemented in SNSrnd
12 Procedure updateRowRan(mm, imi_{m}, \cdots):
13      
14      𝒑𝑨(m)(im,:)\bm{p}\leftarrow\bm{A}^{(m)}\left(i_{m},:\right)
15      if deg(m,im)θdeg(m,i_{m})\leq\theta  then
16             Update 𝑨(m)(im,:)\bm{A}^{(m)}(i_{m},:) by Eq. (12)
17      else
18             SθS\leftarrow\theta indices of 𝓧\bm{\mathcal{X}} chosen uniformly at random, while fixing the mm-th mode index to imi_{m}
19            Compute 𝓧¯\bm{\mathcal{\bar{X}}} from SS
20            Update 𝑨(m)(im,:)\bm{A}^{(m)}(i_{m},:) by Eq. (16)
21      
22      Update 𝑨(m)𝑨(m){{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}} by Eq. (13)
23      Update 𝑨prev(m)𝑨(m){\bm{A}^{(m)}_{prev}}^{\prime}\bm{A}^{(m)} by Eq. (17)
24      
25      return 𝑨(m)\bm{A}^{(m)}, 𝑨(m)𝑨(m){{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}}, and 𝑨prev(m)𝑨(m){\bm{A}^{(m)}_{prev}}^{\prime}\bm{A}^{(m)}
Algorithm 4 updateRow in SNSvec and SNSrnd

Pros and Cons: By updating only few rows of each factor matrix, SNSvec significantly reduces the computational cost of SNSmat, as formalized in Theorem 4, without much loss in the quality of the solution. However, SNSvec slows down if many non-zeros are of the same index (see Eq. (14)), and it is often unstable due to numerical errors, as discussed later.

Theorem 4 (Time complexity of SNSvec).

Let deg(m,im)|(𝑿+Δ𝑿)(m)(im,:)|deg(m,i_{m})\equiv|{\left(\bm{X}+\Delta\bm{X}\right)}_{(m)}(i_{m},:)| be the number of non-zeros of 𝑿+Δ𝑿\bm{X}+\Delta\bm{X} whose mm-th mode index is imi_{m}. Then, the time complexity of SNSvec is

O(MRm=1M1deg(m,im)+(MR)2+MR3).O\bigg{(}MR\sum\nolimits_{m=1}^{M-1}deg(m,i_{m})+(MR)^{2}+MR^{3}\bigg{)}. (14)
Proof.

See Section II.B of the online appendix [20]. ∎

V-C SliceNStitch-Random (SNSrnd)

We introduce SNSrnd, which is even faster than SNSvec. The outline of SNSrnd (see Algorithm 3) is the same with that of SNSvec. That is, SNSrnd also updates only the rows of the factor matrices that approximate the changed entries in the current tensor window 𝓧\bm{\mathcal{X}}. However, when updating such a row, the number of entries accessed by SNSrnd is upper bounded by a user-specific constant θ\theta, while SNSvec accesses deg(m,im)deg(m,i_{m}) entries (see Theorem 4), which can be as many as all the entries in 𝓧\bm{\mathcal{X}}. Below, we present its update rule.

Assume SNSrnd updates the imi_{m}-th row of 𝑨(m)\bm{A}^{(m)}. That is, consider the problem in Eq. (11). As described in the procedure updateRowRan in Algorithm 4, SNSrnd uses different approaches depending on a user-specific threshold θ\theta and deg(m,im)|(𝑿+Δ𝑿)(m)(im,:)|deg(m,i_{m})\equiv|{\left(\bm{X}+\Delta\bm{X}\right)}_{(m)}(i_{m},:)|, i.e., the number of non-zeros of 𝓧+Δ𝓧\bm{\mathcal{X}}+\Delta\bm{\mathcal{X}} whose MM-th mode index is imi_{m}. If deg(m,im)deg(m,i_{m}) is smaller than or equal to θ\theta, then SNSrnd uses Eq. (12) (lines 4-4), which is also used in SNSvec.

However, if deg(m,im)deg(m,i_{m}) is greater than θ\theta, SNSrnd speeds up the update through approximation. First, it samples θ\theta indices from 𝓧\bm{\mathcal{X}} without replacement, while fixing the MM-th mode index to imi_{m} (line 4).222We ignore the indices of non-zeros in Δ𝓧\Delta\bm{\mathcal{X}} even if they are sampled. Let the set of sampled indices be SS, and let 𝓧¯N1××NM1×W\bm{\mathcal{\bar{X}}}\in\mathbb{R}^{N_{1}\times\cdots\times N_{M-1}\times W} be a tensor whose entries are all 0 except those with the sampled indices SS. For each sampled index J=(j1,,jM)SJ=(j_{1},\cdots,j_{M})\in S, x¯J=xJx~J\bar{x}_{J}=x_{J}-\tilde{x}_{J}. Note that for any index J=(j1,,jM)J=(j_{1},\cdots,j_{M}) of 𝓧\bm{\mathcal{X}}, x~J+x¯J=xJ\tilde{x}_{J}+\bar{x}_{J}=x_{J} if JSJ\in S and x~J+x¯J=x~J\tilde{x}_{J}+\bar{x}_{J}=\tilde{x}_{J} otherwise. Thus, the more samples SNSrnd draws with larger SS, the closer 𝓧~+𝓧¯\widetilde{\bm{\mathcal{X}}}+\bm{\mathcal{\bar{X}}} is to 𝓧\bm{\mathcal{X}}. SNSrnd uses 𝓧~+𝓧¯\widetilde{\bm{\mathcal{X}}}+\bm{\mathcal{\bar{X}}} to approximate 𝓧\bm{\mathcal{X}} in the update. Specifically, it replaces 𝓧\bm{\mathcal{X}} of the objective function in Eq. (11) with 𝓧~+𝓧¯\widetilde{\bm{\mathcal{X}}}+\bm{\mathcal{\bar{X}}}. Then, as in Eq. (12), the update rule in Eq. (15) follows.

𝑨(m)(im,:)(𝓧~+𝓧¯+Δ𝓧)(m)(im,:)𝑲(m)𝑯(m),\bm{A}^{(m)}(i_{m},:)\leftarrow{(\widetilde{\bm{\mathcal{X}}}+\bm{\mathcal{\bar{X}}}+\Delta\bm{\mathcal{X}})}_{(m)}(i_{m},:)\bm{K}^{(m)}{\bm{H}^{(m)}}^{\dagger}, (15)

where 𝑲(m)=nmM𝑨(n)\bm{K}^{(m)}=\odot_{n\neq m}^{M}\bm{A}^{(n)} and 𝑯(m)=nmM𝑨(n)𝑨(n)\bm{H}^{(m)}=\ast_{n\neq m}^{M}{{\bm{A}^{(n)}}}^{\prime}{\bm{A}^{(n)}}. Let 𝑨prev(m)\bm{A}^{(m)}_{prev} be the mm-th mode factor matrix before the update and 𝑯prev(m)\bm{H}^{(m)}_{prev} be nmM𝑨prev(n)𝑨(n)\ast_{n\neq m}^{M}{\bm{A}^{(n)}_{prev}}^{\prime}\bm{A}^{(n)}. By Eq. (8), Eq. (15) is equivalent to Eq. (16).

𝑨(m)(im,:)𝑨(m)(im,:)𝑯prev(m)𝑯(m)+(𝑿¯+Δ𝑿)(m)𝑲(m)𝑯(m).\bm{A}^{(m)}(i_{m},:)\leftarrow\bm{A}^{(m)}(i_{m},:)\bm{H}^{(m)}_{prev}{\bm{H}^{(m)}}^{\dagger}+\\ {(\bm{\bar{X}}+\Delta\bm{X})}_{(m)}\bm{K}^{(m)}{\bm{H}^{(m)}}^{\dagger}. (16)

Noteworthy, (𝑿¯+Δ𝑿)(m){(\bm{\bar{X}}+\Delta\bm{X})}_{(m)} has at most θ+2=O(θ)\theta+2=O(\theta) non-zeros. SNSrnd uses Eq. (16) to update the imi_{m}-th row of 𝑨(m)\bm{A}^{(m)} (line 4 of Algorithm 4). It incrementally maintains 𝑨(m)𝑨(m){{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}} up to date by Eq. (13) (line 4), as SNSvec does. It also maintains 𝑨prev(m)𝑨(m){\bm{A}^{(m)}_{prev}}^{\prime}\bm{A}^{(m)} up to date by Eq. (17) (line 4).

𝑨prev(m)𝑨(m)𝑨prev(m)𝑨(m)𝒑𝒑+𝒑𝑨(m)(im,:),{\bm{A}^{(m)}_{prev}}^{\prime}\bm{A}^{(m)}\leftarrow{\bm{A}^{(m)}_{prev}}^{\prime}\bm{A}^{(m)}-{\bm{p}}^{\prime}\bm{p}+{\bm{p}}^{\prime}\bm{A}^{(m)}(i_{m},:), (17)

where 𝒑=𝑨prev(m)(im,:)\bm{p}=\bm{A}^{(m)}_{prev}\left(i_{m},:\right).

Pros and Cons: Through approximation, SNSrnd upper bounds the number of non-zeros of (𝑿¯+Δ𝑿)(m){(\bm{\bar{X}}+\Delta\bm{X})}_{(m)} in Eq. (16) by O(θ)O(\theta). As a result, the time complexity of SNSrnd, given in Theorem 5, is lower than that of SNSvec. Specifically, deg(m,im)deg(m,i_{m}) in Eq. (14), which can be as big as |𝓧+Δ𝓧||\bm{\mathcal{X}}+\Delta\bm{\mathcal{X}}|, is replaced with the user-specific constant θ\theta in Eq. (18) Noteworthy, if we regard MM, RR, and θ\theta in Eq. (18) as constants, the time complexity of SNSrnd becomes constant. This change makes SNSrnd significantly faster than SNSvec, at the expense of a slight reduction in the quality of the solution.

Theorem 5 (Time complexity of SNSrnd).

If θ>1\theta>1, then the time complexity of SNSrnd is

O(M2Rθ+M2R2+MR3).O\bigg{(}M^{2}R\,\theta+M^{2}R^{2}+MR^{3}\bigg{)}. (18)

If MM, RR, and θ\theta are regarded as constants, Eq. (18) is O(1)O(1).

Proof.

See Section II.C of the online appendix [20]. ∎

Unlike SNSmat, SNSvec and SNSrnd do not normalize the columns of factor matrices during the update process. This is because normalization requires O(Rm=1MNm)O(R\sum_{m=1}^{M}N_{m}) time, which is proportional to the number of all entries in all factor matrices, and thus significantly increases the time complexity of SNSvec and SNSrnd. However, without normalization, the entries of factor matrices may have extremely large or extremely small absolute values, making SNSvec and SNSrnd vulnerable to numerical errors. In our experiments (see Fig. 4 in Section VI-C), the accuracies of SNSvec and SNSrnd suddenly drop due to numerical errors in some datasets.

V-D SliceNStitch-Stable (SNS+vec and SNS+rnd)

In this section, we propose SNS+vec and SNS+rnd, which successfully address the aforementioned instability of SNSvec and SNSrnd. The main idea is to clip each entry, (i.e., ensure that each entry is within a predefined range), while at the same time ensuring that the objective function does not increase. To this end, SNS+vec and SNS+rnd employs coordinate descent, where each entry of factor matrices is updated one by one. The outline of SNS+vec and SNS+rnd (see Algorithm 3) is the same as that of SNSvec and SNSrnd. Below, we present their update rules, which are used in Algorithm 5.

Coordinate descent updates one variable (i.e., entry of a factor matrix) at a time while fixing all the other variables. Assume an entry aimk(m)a^{(m)}_{i_{m}k} of 𝑨(m)\bm{A}^{(m)} is updated. Solving the problem in Eq. (2) with respect to aimk(m)a^{(m)}_{i_{m}k} while fixing the other variables is equivalent to solving the problem in Eq. (19).

minaimk(m)JΩim(m)(xJ+ΔxJrkRn=1Majnr(n)aimk(m)nmMajnk(n))2,\min_{a^{(m)}_{i_{m}k}}\sum_{J\in\Omega^{(m)}_{i_{m}}}(x_{J}+\Delta x_{J}-\sum_{r\neq k}^{R}\prod_{n=1}^{M}a_{j_{n}r}^{(n)}-a_{i_{m}k}^{(m)}\prod_{n\neq m}^{M}a_{j_{n}k}^{(n)})^{2}, (19)

where J=(j1,,jM)J=(j_{1},\cdots,j_{M}), and Ωim(m)\Omega^{(m)}_{i_{m}} is the set of indices of 𝓧\bm{\mathcal{X}} of which the mm-th mode index is imi_{m}. To describe its solution, we first define the following terms:

ck(m)\displaystyle c_{k}^{(m)} nmM(jn=1Nn(ajnk(n))2),\displaystyle\equiv\prod\nolimits_{n\neq m}^{M}\big{(}\sum\nolimits_{j_{n}=1}^{N_{n}}(a_{j_{n}k}^{(n)})^{2}\big{)},
dimk(m)\displaystyle d_{i_{m}k}^{(m)} rkR(aimr(m)nmM(jn=1Nnajnr(n)ajnk(n))),\displaystyle\equiv\sum\nolimits_{r\neq k}^{R}\big{(}a_{i_{m}r}^{(m)}\prod\nolimits_{n\neq m}^{M}\big{(}\sum\nolimits_{j_{n}=1}^{N_{n}}a_{j_{n}r}^{(n)}a_{j_{n}k}^{(n)}\big{)}\big{)}, (20)
eimk(m)\displaystyle e_{i_{m}k}^{(m)} r=1R(bimr(m)nmM(jn=1Nnbjnr(n)ajnk(n))),\displaystyle\equiv\sum\nolimits_{r=1}^{R}\big{(}b^{(m)}_{i_{m}r}\prod\nolimits_{n\neq m}^{M}\big{(}\sum\nolimits_{j_{n}=1}^{N_{n}}b_{j_{n}r}^{(n)}a_{j_{n}k}^{(n)}\big{)}\big{)},

where 𝑩(m)𝑨prev(m)\bm{B}^{(m)}\equiv\bm{A}^{(m)}_{prev} is 𝑨(m)\bm{A}^{(m)} before any update.

Solving the Problem in Eq. (19): The problem in Eq. (19) is a least square problem, and thus there exists the closed-form solution, which is used to update aimk(m)a_{i_{m}k}^{(m)} in Eq. (21).

aimk(m)(JΩim(m)((xJ+ΔxJ)nmMajnk(n))dimk(m))/ck(m).a_{i_{m}k}^{(m)}\leftarrow\Big{(}\sum_{J\in\Omega^{(m)}_{i_{m}}}\big{(}(x_{J}+\Delta x_{J})\prod_{n\neq m}^{M}a_{j_{n}k}^{(n)}\big{)}-d_{i_{m}k}^{(m)}\Big{)}/c_{k}^{(m)}. (21)

Eq. (21) is used, instead of Eq. (12), when updating non-time mode factor matrices (i.e., when mMm\neq M) in SNS+vec (line 5 of Algorithm 5). It is also used, instead of Eq. (12), in SNS+rnd when deg(m,im)θdeg(m,i_{m})\leq\theta (line 5). As in SNSvec, when updating the time-mode factor matrix, SNS+vec approximates 𝓧\bm{\mathcal{X}} by 𝓧~\widetilde{\bm{\mathcal{X}}}, and thus it uses Eq. (22) (line 5).

aimk(m)(eimk(m)+JΩim(m)(ΔxJnmMajnk(n))dimk(m))/ck(m).a_{i_{m}k}^{(m)}\leftarrow\Big{(}e_{i_{m}k}^{(m)}+\sum_{J\in\Omega^{(m)}_{i_{m}}}\big{(}\Delta x_{J}\prod_{n\neq m}^{M}a_{j_{n}k}^{(n)}\big{)}-d_{i_{m}k}^{(m)}\Big{)}/c_{k}^{(m)}. (22)

Similarly, as in SNSrnd, when deg(m,im)>θdeg(m,i_{m})>\theta, SNS+rnd approximates 𝓧\bm{\mathcal{X}} by 𝓧~+𝓧¯\widetilde{\bm{\mathcal{X}}}+\bm{\mathcal{\bar{X}}}, and thus it uses Eq. (23) (line 5).

aimk(m)(eimk(m)+JΩim(m)((x¯J+ΔxJ)nmMajnk(n))dimk(m))/ck(m).a_{i_{m}k}^{(m)}\leftarrow\Big{(}e_{i_{m}k}^{(m)}+\sum_{J\in\Omega^{(m)}_{i_{m}}}\big{(}(\bar{x}_{J}+\Delta x_{J})\prod_{n\neq m}^{M}a_{j_{n}k}^{(n)}\big{)}-d_{i_{m}k}^{(m)}\Big{)}/c_{k}^{(m)}. (23)

Note that all Eq. (21), Eq. (22), and Eq. (23) are based on Eq. (20). For the rapid computation of Eq. (20), SNS+vec and SNS+rnd incrementally maintain jm=1Nm(ajmk(m))2\sum_{j_{m}=1}^{N_{m}}(a_{j_{m}k}^{(m)})^{2} and jm=1Nmajmr(m)ajmk(m)\sum_{j_{m}=1}^{N_{m}}a_{j_{m}r}^{(m)}a_{j_{m}k}^{(m)}, which are the (k,k)(k,k)-th and (r,k)(r,k)-th entries of 𝑨(m)𝑨(m){{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}}, by Eq. (24) and Eq. (25) (lines 5 and 5). SNS+rnd also incrementally maintains jm=1Nmbjmr(m)ajmk(m)\sum_{j_{m}=1}^{N_{m}}b_{j_{m}r}^{(m)}a_{j_{m}k}^{(m)}, which is the (r,k)(r,k)-th entry of 𝑨prev(m)𝑨(m){\bm{A}^{(m)}_{prev}}^{\prime}\bm{A}^{(m)}, by Eq. (26) (line 5).

qkk(m)\displaystyle q^{(m)}_{kk} qkk(m)(bimk(m))2+(aimk(m))2,\displaystyle\leftarrow q^{(m)}_{kk}-(b_{i_{m}k}^{(m)})^{2}+(a_{i_{m}k}^{(m)})^{2}, (24)
qrk(m)\displaystyle q^{(m)}_{rk} qrk(m)aimr(m)bimk(m)+aimr(m)aimk(m),\displaystyle\leftarrow q^{(m)}_{rk}-a_{i_{m}r}^{(m)}b_{i_{m}k}^{(m)}+a_{i_{m}r}^{(m)}a_{i_{m}k}^{(m)}, (25)
urk(m)\displaystyle u^{(m)}_{rk} urk(m)bimr(m)bimk(m)+bimr(m)aimk(m),\displaystyle\leftarrow u^{(m)}_{rk}-b_{i_{m}r}^{(m)}b_{i_{m}k}^{(m)}+b_{i_{m}r}^{(m)}a_{i_{m}k}^{(m)}, (26)

where 𝑸(m)𝑨(m)𝑨(m)\bm{Q}^{(m)}\equiv{{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}} and 𝑼(m)𝑨prev(m)𝑨(m)\bm{U}^{(m)}\equiv{\bm{A}^{(m)}_{prev}}^{\prime}\bm{A}^{(m)}. Proofs of Eqs. (21)-(26) can be found in an online appendix [20].

Clipping: In order to prevent the entries of factor matrices from having extremely large or small absolute values, SNS+vec and SNS+rnd ensures that its absolute value is at most η\eta, which is a user-specific threshold. Specifically, if the updated entry is greater than η\eta, it is set to η\eta, and if the updated entry is smaller than η-\eta, it is set to η-\eta (lines 5 and 5 in Algorithm 5). Eq. (21) followed by clipping never increases the objective function in Eq. (19). 333Let xx, yy, and zz be aimk(m)a^{(m)}_{i_{m}k} before update, after being updated by Eq. (21), and after being clipped, respectively. The objective function in Eq. (19) is convex, minimized at yy, and symmetric around yy. |yz||yx||y-z|\leq|y-x| holds.,444For Eq. (22) and Eq. (23), this is true only when 𝓧\bm{\mathcal{X}} is well approximated.

// Parenthesized inputs/outputs are for SNS+rnd
Input: (1) mode mm and index imi_{m} to be updated,
   (2) current tensor window 𝓧\bm{\mathcal{X}}, (3) change Δ𝓧\Delta\bm{\mathcal{X}},
   (4) factor matrices {𝑨(m)}m=1M\{\bm{A}^{(m)}\}_{m=1}^{M} for 𝓧\bm{\mathcal{X}}
   (5) {𝑨(m)𝑨(m)}m=1M\{{{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}}\}_{m=1}^{M} (and {𝑨prev(m)𝑨(m)}m=1M\{{\bm{A}^{(m)}_{prev}}^{\prime}\bm{A}^{(m)}\}_{m=1}^{M})
   (6) η\eta for clipping (and threshold θ\theta for sampling)
Output: updated 𝑨(m)\bm{A}^{(m)}, 𝑨(m)𝑨(m){{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}} (and 𝑨prev(m)𝑨(m){\bm{A}^{(m)}_{prev}}^{\prime}\bm{A}^{(m)})
1
2
// updateRow implemented in SNS+vec
3 Procedure updateRowVec+(mm, imi_{m}, \cdots):
4      
5      for k=1,,Rk=1,\cdots,R do
6            
7            if m=Mm=M then Update aimk(m)a_{i_{m}k}^{(m)} by Eq. (22)
8            else Update aimk(m)a_{i_{m}k}^{(m)} by Eq. (21)
9            if |aimk(m)|>η|a_{i_{m}k}^{(m)}|>\eta then aimk(m)sign(aimk(m))ηa_{i_{m}k}^{(m)}\leftarrow sign(a_{i_{m}k}^{(m)})\cdot\eta
10            Update 𝑨(m)𝑨(m){\bm{A}^{(m)}}^{\prime}\bm{A}^{(m)} by Eq. (24) and Eq. (25)
11      
12      return 𝑨(m)\bm{A}^{(m)} and 𝑨(m)𝑨(m){{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}}
13
// updateRow implemented in SNS+rnd
14 Procedure updateRowRan+(mm, imi_{m}, \cdots):
15      
16      if deg(m,im)>θdeg(m,i_{m})>\theta then
17             SθS\leftarrow\theta indices of 𝓧\bm{\mathcal{X}} chosen uniformly at random, while fixing the mm-th mode index to imi_{m}
18            Compute 𝓧¯\bm{\mathcal{\bar{X}}} from SS
19      for k=1,,Rk=1,\cdots,R do
20            
21            if deg(m,im)θdeg(m,i_{m})\leq\theta then Update aimk(m)a_{i_{m}k}^{(m)} by Eq. (21)
22            else Update aimk(m)a_{i_{m}k}^{(m)} by Eq. (23)
23            if |aimk(m)|>η|a_{i_{m}k}^{(m)}|>\eta then aimk(m)sign(aimk(m))ηa_{i_{m}k}^{(m)}\leftarrow sign(a_{i_{m}k}^{(m)})\cdot\eta
24            Update 𝑨(m)𝑨(m){\bm{A}^{(m)}}^{\prime}\bm{A}^{(m)} by Eq. (24) and Eq. (25)
25            Update 𝑨prev(m)𝑨(m){\bm{A}^{(m)}_{prev}}^{\prime}\bm{A}^{(m)} by Eq. (26)
26      
27      return 𝑨(m)\bm{A}^{(m)}, 𝑨(m)𝑨(m){{\bm{A}^{(m)}}}^{\prime}{\bm{A}^{(m)}}, and 𝑨prev(m)𝑨(m){\bm{A}^{(m)}_{prev}}^{\prime}\bm{A}^{(m)}
28
Algorithm 5 updateRow in SNS+vec and SNS+rnd

Pros and Cons: SNS+vec and SNS+rnd does not suffer from instability due to numerical errors, which SNSvec and SNSrnd suffer from. Moreover, as shown in Theorems 6 and 7, the time complexities of SNS+vec and SNS+rnd are lower than those of SNSvec and SNSrnd, respectively. Empirically, however, SNS+vec and SNS+rnd are slightly slower and less accurate than SNSvec and SNSrnd, respectively (see Section VI-C).

Theorem 6 (Time complexity of SNS+vec).

The time complexity of SNS+vec is

O(MRm=1M1deg(m,im)+M2R2).O\bigg{(}MR\sum\nolimits_{m=1}^{M-1}deg(m,i_{m})+M^{2}R^{2}\bigg{)}. (27)
Proof.

See Section II.E of the online appendix [20]. ∎

TABLE II: Summary of real-world sparse tensor datasets. All links are at https://github.com/DMLab-Tensor/SliceNStitch#datasets.
Name Description Size # Non-zeros Density
Divvy Bikes sources ×\times destinations ×\times timestamps [minutes] 673×673×525594673\times 673\times 525594 3.82M3.82M 1.604×1051.604\times 10^{-5}
Chicago Crime communities ×\times crime types ×\times timestamps [hours] 77×32×14846477\times 32\times 148464 5.33M5.33M 1.457×1021.457\times 10^{-2}
New York Taxi sources ×\times destinations ×\times timestamps [seconds] 265×265×5184000265\times 265\times 5184000 84.39M84.39M 2.318×1042.318\times 10^{-4}
Ride Austin sources ×\times destinations ×\times colors ×\times timestamps [minutes] 219×219×24×285136219\times 219\times 24\times 285136 0.89M0.89M 2.739×1062.739\times 10^{-6}
Theorem 7 (Time complexity of SNS+rnd).

If θ>1\theta>1, then the time complexity of SNS+rnd is

O(M2Rθ+M2R2).O\bigg{(}M^{2}R\,\theta+M^{2}R^{2}\bigg{)}. (28)

If MM, RR, and θ\theta are regarded as constants, Eq. (28) is O(1)O(1).

Proof.

See Section II.F of the online appendix [20]. ∎

VI Experiments

In this section, we design and review experiments to answer the following questions:

  • Q1. Advantages of Continuous CP Decomposition: What are the advantages of continuous CP decomposition over conventional CP decomposition?

  • Q2. Speed and Fitness: How rapidly and precisely does SliceNStitch fit the input tensor, compared to baselines?

  • Q3. Data Scalability: How does SliceNStitch scale with regard to the number of events?

  • Q4. Effect of Parameters: How do user-specific thresholds θ\theta and η\eta affect the performance of SliceNStitch?

  • Q5. Practitioner’s Guide: Which versions of SliceNStitch do we have to use?

  • Q6. Application to Anomaly Detection: Can SliceNStitch spot abnormal events rapidly and accurately?

VI-A Experiment Specifications

Machine: We ran all experiments on a machine with a 3.7GHz Intel i5-9600K CPU and 64GB memory.

Datasets: We used four different real-world sparse tensor datasets summarized in Table II. They are sparse tensors with a time mode, and their densities vary from 10210^{-2} to 10610^{-6}.

Evaluation Metrics: We evaluated SliceNStitch and baselines using the following metrics:

  • Elapsed Time per Update: The average elapsed time for updating the factor matrices in response to each event.

  • Fitness (The higher the better): Fitness is a widely-used metric to evaluate the accuracy of tensor decomposition algorithms. It is defined as 1(𝓧~𝓧F/𝓧F),1-({\|\widetilde{\bm{\mathcal{X}}}-\bm{\mathcal{X}}\|_{F}}/{\|\bm{\mathcal{X}}\|_{F}}), where 𝓧\bm{\mathcal{X}} is the input tensor, and 𝓧~\widetilde{\bm{\mathcal{X}}} (Eq. (1)) is its approximation.

  • Relative Fitness [16] (The higher the better): Relative fitness, which is defined as the ratio between the fitness of the target algorithm and the fitness of ALS, i.e.,

    RelativeFitnessFitnesstargetFitnessALS.Relative\,Fitness\equiv\frac{Fitness_{target}}{Fitness_{ALS}}.\vspace{-0.5mm}

    Recall that ALS (see Section II) is the standard batch algorithm for tensor decomposition.

Baselines: Since there is no previous algorithm for continuous CPD, we compared SliceNStitch with ALS, OnlineSCP [16], CP-stream [15], and NeCPD (nn) with nn iterations [28], all of which are for conventional CPD (see Section VII). All baselines except ALS update factor matrices once per period TT (instead of whenever an event occurs).555We modified the baselines, which are for decomposing the entire tensor, to decompose the tensor window (see Definition 4), as SliceNStitch does. We implemented SliceNStitch and ALS in C++. We used the official implementation of OnlineSCP in MATLAB and that of CP-stream in C++.666https://shuozhou.github.io, https://github.com/ShadenSmith/splatt-stream We implemented NeCPD in MATLAB.

Experimental Setup: We set the hyperparameters as listed in Table III unless otherwise stated. We set the threshold θ\theta for sampling to be smaller than half of the average degree of indices (i.e., the average number of non-zeros when fixing an index) in the initial tensor window. In each experiment, we initialized factor matrices using ALS on the initial tensor window, and we processed the events during 5WT5WT time units. We measured relative fitness 55 times.

VI-B Q1. Advantages of Continuous CP Decomposition

We compared the continuous CPD and conventional CPD, in terms of the update interval (i.e., the minimum interval between two consecutive updates), fitness, and the number of parameters, using the New York Taxi dataset. We used SNSrnd and fixed the period TT to 11 hour for continuous CPD; and we used CP-stream, OnlineSCP, and ALS while varying TT (i.e., the granularity of the time mode) from 11 second to 11 hour for conventional CPD. Fig. 1 shows the result,777Before measuring the fitness of all baselines, we merged the rows of fine-grained time-mode factor matrices sequentially by adding entries so that one row corresponds to an hour. Without this postprocessing step, the fitness of the baselines was even lower than those reported in Fig. 1c. and we found Observation 1.

Observation 1 (Advantages of Continuous CPD).

Continuous CPD achieved (a) near-instant updates, (b) high fitness, and (c) a small number of parameters at the same time, while conventional CPD cannot. When the update interval was the same, continuous CPD achieved 2.26×\mathbf{2.26\times} higher fitness with 𝟓𝟓×\mathbf{55\times} fewer parameters than conventional CPD. When they showed similar fitness, the update interval of continuous CPD was 𝟑𝟔𝟎𝟎×\mathbf{3600\times} shorter than that of conventional CPD.

TABLE III: Default hyperparameter settings.
Name RR WW TT (Period) θ\theta η\eta
Divvy Bikes 2020 1010 1440min(1day)1440min\left(1day\right) 2020 10001000
Chicago Crime 2020 1010 720hour(1month)720hour\left(1month\right) 2020 10001000
New York Taxi 2020 1010 3600sec(1hour)3600sec\left(1hour\right) 2020 10001000
Ride Austin 2020 1010 1440min(1day)1440min\left(1day\right) 5050 10001000

VI-C Q2. Speed and Fitness

Refer to caption
Refer to caption
(a) Divvy Bikes
Refer to caption
(b) Chicago Crime
Refer to caption
(c) New York Taxi
Refer to caption
(d) Ride Austin
Figure 4: Relative fitness of all versions of SliceNStitch and baselines over time. All versions of SliceNStitch (represented as lines) update outputs whenever an event occurs, while baselines (represented as dots) update outputs only once per period TT.
Refer to caption
Refer to caption
(a) Runtime per Update
Refer to caption
(b) Average Relative Fitness
Figure 5: All versions of SliceNStitch update factor matrices much faster with comparable fitness than the best baseline.

We compared the speed and fitness of all versions of SliceNStitch and the baseline methods. Fig. 4 shows how the relative fitness (i.e., fitness relative to ALS) changed over time, and Fig. 5 shows the average relative fitness and the average elapsed time for processing an event. We found Observations 2, 3, and 4.

Observation 2 (Significant Speed-ups).

All versions of SliceNStitch updated factor matrices significantly faster than the fastest baseline. For example, SNS+rnd and SNSmat were up to 𝟒𝟔𝟒×\mathbf{464\times} and 3.71×\mathbf{3.71\times} faster than CP-stream, respectively.

Observation 3 (Effect of Clipping).

SNSvec and SNSrnd failed in some datasets due to numerical errors, as discussed in the last paragraph of Section V-C. SNS+vec and SNS+rnd, where clipping is used, successfully addressed this problem.

Observation 4 (Comparable Fitness).

All stable versions of SliceNStitch (i.e., SNS+vec, SNS+rnd, and SNSmat) achieved 7272-100%100\% fitness relative to the most accurate baseline.

VI-D Q3. Data Scalability

We measured how rapidly the total running time of different versions of SliceNStitch increase with respect to the number of events in Fig. 6. We found Observation 5.

Observation 5 (Linear Scalability).

The total runtime of all SliceNStitch versions was linear in the number of events.

Refer to caption
Refer to caption

(a) Divvy Bikes (b) Chicago Crime (c) New York Taxi (d) Ride Austin

Figure 6: The total runtime of SliceNStitch is linear in the number of events. While we omit SNSmat due to long execution time, it shows a similar trend.

VI-E Q4. Effect of Parameters

To investigate the effect of the threshold θ\theta for sampling on the performance of SNSrnd and SNS+rnd, we measured their relative fitness and update time while varying θ\theta from 25%25\% to 200%200\% of the default value in Table III.888We set η\eta to 500500 in the Chicago Crime dataset since setting it to 10001000 led to unstable results. The results are reported in Fig. 7, and we found Observation 6.

Refer to caption

Effect on Relative Fitness:   
Refer to caption
     (a) Divvy Bikes (b) Chicago Crime (c) New York Taxi (d) Ride Austin
Effect on Speed (Elapsed Time per Update):   
Refer to caption
     (e) Divvy Bikes (f) Chicago Crime (g) New York Taxi (h) Ride Austin

Figure 7: Effect of θ\theta on the performance of SNSrnd and SNS+rnd. As θ\theta increases, the fitness increases with diminishing returns, while the runtime grows linearly. SNSrnd fails in the Chicago Crime dataset due to instability.
Observation 6 (Effect of θ\theta).

As θ\theta increases (i.e., more indices are sampled), the fitness of SNSrnd and SNS+rnd increases with diminishing returns, while their runtime grows linearly.

In Fig. 8, we measured the effect of the threshold η\eta for clipping on the relative fitness of SNS+vec and SNS+rnd, while changing η\eta from 3232 to 16,00016,000. Note that η\eta does not affect their speed. We found Observation 7.

Observation 7 (Effect of η\eta).

The fitness of SNS+vec and SNS+rnd is insensitive to η\eta as long as η\eta is small enough.

Refer to caption
Refer to caption

(a) Divvy Bikes (b) Chicago Crime (c) New York Taxi (d) Ride Austin

Figure 8: Effects of η\eta on the fitness of SNS+vec and SNS+rnd. The fitness is insensitive to η\eta, as long as it is small enough.

VI-F Q5. Practitioner’s Guide

Based on the above theoretical and empirical results, we provide a practitioner’s guide for SliceNStitch’s users.

  • We do not recommend SNSvec and SNSrnd. They are prone to numerical errors and thus unstable.

  • We recommend using the version fitting the input tensor best within your runtime budget among SNSmat, SNS+vec, and SNS+rnd. There exists a clear trade-off between their speed and fitness. In terms of speed, SNS+rnd is the best followed by SNS+vec, and SNSmat is the slowest. In terms of fitness, SNSmat is the best followed by SNS+vec, and SNS+rnd is the most inaccurate.

  • If SNS+rnd is chosen, we recommend increasing θ\theta as much as possible, within your runtime budget.

VI-G Q6. Application to Anomaly Detection

We applied SNS+rnd, OnlineSCP, and CP-stream to an anomaly detection task. In the New York Taxi dataset, we injected abnormally large changes (specifically, 1515, which is 55 times the maximum change in 1 second in the data stream) in 2020 randomly chosen entries. Then, we measured the Z scores of the errors in all entries in the latest tensor unit, where new changes arrive, as each method proceeded. After that, we investigated the top-2020 Z-scores in each method. As summarized in Fig. 9, the precision, which is the same as the recall in our setting, was highest in SNS+rnd and OnlineSCP. More importantly, the average time gap between the occurrence and detection of the injected anomalies was about 0.00150.0015 seconds in SNS+rnd, while that exceeded 1,4001,400 seconds in the others, which have to wait until the current period ends to update CPD.

VII Related Work

In this section, we review related work on online CP decomposition (CPD) and window-based tensor analysis. Then, we briefly discuss the relation between CPD and machine learning. See [19, 29] for more models, solvers, and applications.

VII-A Online Tensor Decomposition

Nion et al. [25] proposed Simultaneously Diagonalization Tracking (SDT) and Recursively Least Squares Tracking (RLST) for incremental CP decomposition (CPD) of three-mode dense tensors. Specifically, SDT incrementally tracks the SVD of the unfolded tensor, while RLST recursively updates the factor matrices to minimize the weighted squared error. A limitation of the algorithms is that they are only applicable to three-mode dense tensors. Gujral et al. [17] proposed a sampling-based method called SamBaTen for incremental CPD of three-mode dense and sparse tensors. Zhou et al. proposed onlineCP [27] and onlineSCP [16] for incremental CPD of higher-order dense tensors and sparse tensors, respectively. Smith et al. [15] proposed an online CPD algorithm that can be extended to include non-negativity and sparsity constraints. It is suitable for both sparse and dense tensors. SGD-based methods [28, 30] have also been developed for online CPD. Specifically, Ye et al. [30] proposed one for Poisson-distributed count data with missing values and Anaissi et al. [28] employed Nesterov’s Accelerated Gradient method into SGD. Sobral et al. [31] developed an online framework for subtracting background pixels from multispectral video data. However, all these algorithms process every new entry with the same time-mode index (e.g., a slice in Fig. 1a) at once. They are not applicable to continuous CPD (Problem 2), where changes in entries need to be processed instantly.

BICP [32] efficiently updates block-based CPD [33, 34] when the size of the input tensor is fixed but some existing entries change per update. It requires partitioned subtensors and their CPDs rather than the CPD of the entire input tensor.

Moreover, several algorithms have been developed for incrementally updating the outputs of matrix factorization [35, 36], Bayesian probabilistic CP factorization [37], and generalized CPD [38], when previously unknown entries are revealed afterward. They are also not applicable to continuous CPD (Problem 2), where even increments and decrements of revealed entries need to be handled (see Definition 6).

Refer to caption
Refer to caption
(a) Z-scores in SNS+rnd (left), OnlineSCP (middle), and CP-stream (right)
Precision @ Top-20 Time Gap between Occurrence and Detection
SNS+rnd 0.80 0.0015 seconds
OnlineSCP 0.80 1601.00 seconds
CP-stream 0.70 1424.57 seconds
(b) Numerical Comparison with Baselines
Figure 9: SliceNStitch (spec., SNS+rnd) detects injected anomalies in the New York Taxi dataset much faster with comparable accuracy than baselines.

Lastly, it is worth mentioning that there have been several studies on approximation properties of some offline CPD algorithms. Haupt et al. [39] proved a sufficient condition for a sparse random projection technique to solve the low-rank tensor regression problem efficiently with an approximation quality. Song et al. [40] showed that an importance-sampling based orthogonal tensor decomposition algorithm achieves a sublinear time complexity with provable guarantees. To the best of our knowledge, however, there has been limited work on theoretical properties of online CPD of tensor streams.

VII-B Window-based Tensor Analysis

Sun et al. [22, 23] first suggested the concept of window-based tensor analysis (WTA). Instead of analyzing the entire tensor at once, they proposed to analyze a temporally adjacent subtensor within a time window at a time, while sliding the window. Based on the sliding window model, they devised an incremental Tucker decomposition algorithm for tensors growing over time. Xu et al. [24] also suggested a Tucker decomposition algorithm for sliding window tensors and used it to detect anomalies in road networks. Zhang et al. [26] used the sliding window model with exponential weighting for robust Bayesian probabilistic CP factorization and completion. Note that all these studies assume a time window moves ‘discretely’, while in our continuous tensor model, a time window moves ‘continuously’, as explained in Section IV.

VII-C Relation to Machine Learning

CP decomposition (CPD) has been a core building block of numerous machine learning (ML) algorithms, which are designed for classification [41], weather forecast [14], recommendation [11], stock price prediction [13], to name a few. Moreover, CPD has proven useful for outlier removal [42, 43], imputation [12, 43], and dimensionality reduction [19], and thus it can be used as a preprocessing step of ML algorithms, many of which are known to be vulnerable to outliers, missings, and the curse of dimensionality. We refer the reader to [44] for more roles of tensor decomposition for ML. By making the core building block “real-time”, our work represents a step towards real-time ML. Moreover, SliceNStitch can directly be used as a preprocessing step of existing streaming ML algorithms.

VIII Conclusion

In this work, we propose SliceNStitch, aiming to make tensor analysis “real-time” and applicable to time-critical applications. We summarize our contributions as follows:

  • New data model: We propose the continuous tensor model and its efficient event-driven implementation (Section IV). With our CPD algorithms, it achieves near real-time updates, high fitness, and a small number of parameters (Fig. 1).

  • Fast online algorithms: We propose a family of online algorithms for CPD in the continuous tensor model (Section V). They update factor matrices in response to changes in an entry up to 464×464\times faster than online competitors, with fitness even comparable (spec., 7272-100%100\%) to offline competitors (Fig. 5). We analyze their complexities (Theorems 3-7).

  • Extensive experiments: We evaluate the speed, fitness, and scalability of our algorithms on 44 real-world sparse tensors. We analyze the effects of hyperparameters. The results indicate a clear trade-off between speed and fitness, based on which we provide practitioner’s guides (Section VI).

Reproducibility: The code and datasets used in the paper are available at https://github.com/DMLab-Tensor/SliceNStitch.

Acknowledgement

This work was supported by Samsung Electronics Co., Ltd., Disaster-Safety Platform Technology Development Program of the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT (No. 2019M3D7A1094364), and Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-00075, Artificial Intelligence Graduate School Program (KAIST)).

References

  • [1] L. Zhao and M. J. Zaki, “Tricluster: an effective algorithm for mining coherent clusters in 3d microarray data,” in SIGMOD, 2005.
  • [2] D. Koutra, E. E. Papalexakis, and C. Faloutsos, “Tensorsplat: Spotting latent anomalies in time,” in PCI, 2012.
  • [3] Y. Cai, H. Tong, W. Fan, P. Ji, and Q. He, “Facets: Fast comprehensive mining of coevolving high-order time series,” in KDD, 2015.
  • [4] B. W. Bader, M. W. Berry, and M. Browne, “Discussion tracking in enron email using parafac,” in Survey of Text Mining II.   Springer, 2008, pp. 147–163.
  • [5] D. Bruns-Smith, M. M. Baskaran, J. Ezick, T. Henretty, and R. Lethin, “Cyber security through multidimensional data decompositions,” in CYBERSEC, 2016.
  • [6] H. Fanaee-T and J. Gama, “Tensor-based anomaly detection: An interdisciplinary survey,” Knowledge-Based Systems, vol. 98, pp. 130–147, 2016.
  • [7] F. L. Hitchcock, “The expression of a tensor or a polyadic as a sum of products,” Journal of Mathematics and Physics, vol. 6, no. 1-4, pp. 164–189, 1927.
  • [8] J. D. Carroll and J.-J. Chang, “Analysis of individual differences in multidimensional scaling via an n-way generalization of “eckart-young” decomposition,” Psychometrika, vol. 35, no. 3, pp. 283–319, 1970.
  • [9] R. A. Harshman, “Parafac2: Mathematical and technical notes,” UCLA working papers in phonetics, vol. 22, no. 3044, p. 122215, 1972.
  • [10] L. R. Tucker, “Some mathematical notes on three-mode factor analysis,” Psychometrika, vol. 31, no. 3, pp. 279–311, 1966.
  • [11] L. Yao, Q. Z. Sheng, Y. Qin, X. Wang, A. Shemshadi, and Q. He, “Context-aware point-of-interest recommendation using tensor factorization with social regularization,” in SIGIR, 2015.
  • [12] K. Shin, L. Sael, and U. Kang, “Fully scalable methods for distributed tensor factorization,” TKDE, vol. 29, no. 1, pp. 100–113, 2016.
  • [13] A. Spelta, “Financial market predictability with tensor decomposition and links forecast,” Applied network science, vol. 2, no. 1, p. 7, 2017.
  • [14] J. Xu, X. Liu, T. Wilson, P.-N. Tan, P. Hatami, and L. Luo, “Muscat: Multi-scale spatio-temporal learning with application to climate modeling.” in IJCAI, 2018.
  • [15] S. Smith, K. Huang, N. D. Sidiropoulos, and G. Karypis, “Streaming tensor factorization for infinite data sources,” in SDM, 2018.
  • [16] S. Zhou, S. Erfani, and J. Bailey, “Online cp decomposition for sparse tensors,” in ICDM, 2018.
  • [17] E. Gujral, R. Pasricha, and E. E. Papalexakis, “Sambaten: Sampling-based batch incremental tensor decomposition,” in SDM, 2018.
  • [18] R. Pasricha, E. Gujral, and E. E. Papalexakis, “Adaptive granularity in tensors: A quest for interpretable structure,” arXiv preprint arXiv:1912.09009, 2019.
  • [19] T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” SIAM review, vol. 51, no. 3, pp. 455–500, 2009.
  • [20] (2020) Supplementary document. [Online]. Available: https://github.com/DMLab-Tensor/SliceNStitch/blob/master/doc/supplementary.pdf
  • [21] R. A. Harshman et al., “Foundations of the parafac procedure: Models and conditions for an” explanatory” multimodal factor analysis,” 1970.
  • [22] J. Sun, S. Papadimitriou, and S. Y. Philip, “Window-based tensor analysis on high-dimensional and multi-aspect streams,” in ICDM, 2006.
  • [23] J. Sun, D. Tao, S. Papadimitriou, P. S. Yu, and C. Faloutsos, “Incremental tensor analysis: Theory and applications,” ACM Transactions on Knowledge Discovery from Data, vol. 2, no. 3, pp. 1–37, 2008.
  • [24] M. Xu, J. Wu, H. Wang, and M. Cao, “Anomaly detection in road networks using sliding-window tensor factorization,” T-ITS, vol. 20, no. 12, pp. 4704–4713, 2019.
  • [25] D. Nion and N. D. Sidiropoulos, “Adaptive algorithms to track the parafac decomposition of a third-order tensor,” TSP, vol. 57, no. 6, pp. 2299–2310, 2009.
  • [26] Z. Zhang and C. Hawkins, “Variational bayesian inference for robust streaming tensor factorization and completion,” in ICDM, 2018.
  • [27] S. Zhou, N. X. Vinh, J. Bailey, Y. Jia, and I. Davidson, “Accelerating online cp decompositions for higher order tensors,” in KDD, 2016.
  • [28] A. Anaissi, B. Suleiman, and S. M. Zandavi, “Necpd: An online tensor decomposition with optimal stochastic gradient descent,” arXiv preprint arXiv:2003.08844, 2020.
  • [29] E. E. Papalexakis, C. Faloutsos, and N. D. Sidiropoulos, “Tensors for data mining and data fusion: Models, applications, and scalable algorithms,” ACM Transactions on Intelligent Systems and Technology, vol. 8, no. 2, pp. 1–44, 2016.
  • [30] C. Ye and G. Mateos, “Online tensor decomposition and imputation for count data.” in DSW, 2019.
  • [31] A. Sobral, S. Javed, S. Ki Jung, T. Bouwmans, and E.-h. Zahzah, “Online stochastic tensor decomposition for background subtraction in multispectral video sequences,” in ICCVW, 2015.
  • [32] S. Huang, K. S. Candan, and M. L. Sapino, “Bicp: block-incremental cp decomposition with update sensitive refinement,” in CIKM, 2016.
  • [33] A. H. Phan and A. Cichocki, “Parafac algorithms for large-scale problems,” Neurocomputing, vol. 74, no. 11, pp. 1970–1984, 2011.
  • [34] X. Li, S. Huang, K. S. Candan, and M. L. Sapino, “2pcp: Two-phase cp decomposition for billion-scale dense tensors,” in ICDE, 2016.
  • [35] X. He, H. Zhang, M.-Y. Kan, and T.-S. Chua, “Fast matrix factorization for online recommendation with implicit feedback,” in SIGIR, 2016.
  • [36] R. Devooght, N. Kourtellis, and A. Mantrach, “Dynamic matrix factorization with priors on unknown values,” in KDD, 2015.
  • [37] Y. Du, Y. Zheng, K.-c. Lee, and S. Zhe, “Probabilistic streaming tensor decomposition,” in ICDM, 2018.
  • [38] S. Zhou, S. M. Erfani, and J. Bailey, “Sced: A general framework for sparse tensor decomposition with constraints and elementwise dynamic learning,” in ICDM, 2017.
  • [39] J. Haupt, X. Li, and D. P. Woodruff, “Near optimal sketching of low-rank tensor regression,” in NIPS, 2017.
  • [40] Z. Song, D. P. Woodruff, and H. Zhang, “Sublinear time orthogonal tensor decomposition,” in NIPS, 2016.
  • [41] S. Rendle, “Factorization machines,” in ICDM, 2010.
  • [42] M. Najafi, L. He, and S. Y. Philip, “Outlier-robust multi-aspect streaming tensor completion and factorization.” in IJCAI, 2019.
  • [43] D. Lee and K. Shin, “Robust factorization of real-world tensor streams with patterns, missing values, and outliers,” in ICDE, 2021.
  • [44] N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E. Papalexakis, and C. Faloutsos, “Tensor decomposition for signal processing and machine learning,” TSP, vol. 65, no. 13, pp. 3551–3582, 2017.