This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Beyond the Signs: Nonparametric Tensor Completion
via Sign Series

Chanwoo Lee
University of Wisconsin – Madison
[email protected]
   Miaoyan Wang
University of Wisconsin – Madison
[email protected]
Abstract

We consider the problem of tensor estimation from noisy observations with possibly missing entries. A nonparametric approach to tensor completion is developed based on a new model which we coin as sign representable tensors. The model represents the signal tensor of interest using a series of structured sign tensors. Unlike earlier methods, the sign series representation effectively addresses both low- and high-rank signals, while encompassing many existing tensor models—including CP models, Tucker models, single index models, several hypergraphon models—as special cases. We show that the sign tensor series is theoretically characterized, and computationally estimable, via classification tasks with carefully-specified weights. Excess risk bounds, estimation error rates, and sample complexities are established. We demonstrate the outperformance of our approach over previous methods on two datasets, one on human brain connectivity networks and the other on topic data mining.

1 Introduction

Higher-order tensors have recently received much attention in enormous fields including social networks (Anandkumar et al.,, 2014), neuroscience (Wang et al.,, 2017), and genomics (Hore et al.,, 2016). Tensor methods provide effective representation of the hidden structure in multiway data. In this paper we consider the signal plus noise model,

𝒴=Θ+,\mathcal{Y}=\Theta+\mathcal{E}, (1)

where 𝒴d1××dK\mathcal{Y}\in\mathbb{R}^{d_{1}\times\cdots\times d_{K}} is an order-KK data tensor, Θ\Theta is an unknown signal tensor of interest, and \mathcal{E} is a noise tensor. Our goal is to accurately estimate Θ\Theta from the incomplete, noisy observation of 𝒴\mathcal{Y}. In particular, we focus on the following two problems:

  • Q1 [Nonparametric tensor estimation]. How to flexibly estimate Θ\Theta under a wide range of structures, including both low-rankness and high-rankness?

  • Q2 [Complexity of tensor completion]. How many observed tensor entries do we need to consistently estimate the signal Θ\Theta?

1.1 Inadequacies of low-rank models

The signal plus noise model (3) is popular in tensor literature. Existing methods estimate the signal tensor based on low-rankness of Θ\Theta (Jain and Oh,, 2014; Montanari and Sun,, 2018). Common low-rank models include Canonical Polyadic (CP) tensors (Hitchcock,, 1927), Tucker tensors (De Lathauwer et al.,, 2000), and block tensors (Wang and Zeng,, 2019). While these methods have shown great success in signal recovery, tensors in applications often violate the low-rankness. Here we provide two examples to illustrate the limitation of classical models.

The first example reveals the sensitivity of tensor rank to order-preserving transformations. Let 𝒵30×30×30\mathcal{Z}\in\mathbb{R}^{30\times 30\times 30} be an order-3 tensor with CP rank(𝒵)=3\text{rank}(\mathcal{Z})=3 (formal definition is deferred to end of this section). Suppose a monotonic transformation f(z)=(1+exp(cz))1f(z)=(1+\exp(-cz))^{-1} is applied to 𝒵\mathcal{Z} entrywise, and we let the signal Θ\Theta in model (1) be the tensor after transformation. Figure 1a plots the numerical rank (see Section 7.1) of Θ\Theta versus cc. As we see, the rank increases rapidly with cc, rending traditional low-rank tensor methods ineffective in the presence of mild order-preserving nonlinearities. In digital processing (Ghadermarzy et al.,, 2018) and genomics analysis (Hore et al.,, 2016), the tensor of interest often undergoes unknown transformation prior to measurements. The sensitivity to transformation makes the low-rank model less desirable in practice.

Refer to caption
Figure 1: (a) Numerical rank of Θ\Theta versus cc in the first example. (b) Top d=30d=30 tensor singular values in the second example.

The second example demonstrates the inadequacy of classical low-rankness in representing special structures. Here we consider the signal tensor of the form Θ=log(1+𝒵)\Theta=\log(1+\mathcal{Z}), where 𝒵d×d×d\mathcal{Z}\in\mathbb{R}^{d\times d\times d} is an order-3 tensor with entries 𝒵(i,j,k)=1dmax(i,j,k)\mathcal{Z}(i,j,k)={1\over d}\max(i,j,k) for i,j,k{1,,d}i,j,k\in\{1,\ldots,d\}. The matrix analogy of Θ\Theta was studied by Chan and Airoldi, (2014) in graphon analysis. In this case neither Θ\Theta nor 𝒵\mathcal{Z} is low-rank; in fact, the rank is no smaller than the dimension dd as illustrated in Figure 1b. Again, classical low-rank models fail to address this type of tensor structure.

In the above and many other examples, the signal tensors Θ\Theta of interest have high rank. Classical low-rank models will miss these important structures. New methods that allow flexible tensor modeling have yet to be developed.

1.2 Our contributions

We develop a new model called sign representable tensors to address the aforementioned challenges. Figure 2 illustrates our main idea. Our approach is built on the sign series representation of the signal tensor, and we propose to estimate the sign tensors through a series of weighted classifications. In contrast to existing methods, our method is guaranteed to recover a wide range of low- and high-rank signals. We highlight two main contributions that set our work apart from earlier literature.

Refer to caption

Figure 2: Illustration of our method. For visualization purpose, we plot an order-2 tensor (a.k.a. matrix); similar procedure applies to higher-order tensors. (a): a noisy and incomplete tensor input. (b) and (c): main steps of estimating sign tensor series sgn(Θπ)\textup{sgn}(\Theta-\pi) for π{1,,1H,0,1H,,1}\pi\in\{-1,\ldots,-{1\over H},0,{1\over H},\ldots,1\}. (d) estimated signal Θ^\hat{\Theta}. The depicted signal is a full-rank matrix based on Example 5 in Section 3.

Statistically, the problem of high-rank tensor estimation is challenging. Existing estimation theory (Anandkumar et al.,, 2014; Montanari and Sun,, 2018; Cai et al.,, 2019) exclusively focuses on the regime of fixed rr growing dd. However, such premise fails in high-rank tensors, where the rank may grow with, or even exceed, the dimension. A proper notion of nonparametric complexity is crucial. We show that, somewhat surprisingly, the sign tensor series not only preserves all information in the original signals, but also brings the benefits of flexibility and accuracy over classical low-rank models. The results fill the gap between parametric (low-rank) and nonparametric (high-rank) tensors, thereby greatly enriching the tensor model literature.

From computational perspective, optimizations regarding tensors are in general NP-hard. Fortunately, tensors sought in applications are specially-structured, for which a number of efficient algorithms are available (Ghadermarzy et al.,, 2018; Wang and Li,, 2020; Han et al.,, 2020). Our high-rank tensor estimate is provably reductable to a series of classifications, and its divide-and-conquer nature facilitates efficient computation. The ability to import and adapt existing tensor algorithms is one advantage of our method.

We also highlight the challenges associated with tensors compared to matrices. High-rank matrix estimation is recently studied under nonlinear models (Ganti et al.,, 2015) and subspace clustering (Ongie et al.,, 2017; Fan and Udell,, 2019). However, the problem for high-rank tensors is more challenging, because the tensor rank often exceeds the dimension when order K3K\geq 3 (Anandkumar et al.,, 2017). This is in sharp contrast to matrices. We show that, applying matrix methods to higher-order tensors results in suboptimal estimates. A full exploitation of the higher-order structure is needed; this is another challenge we address in this paper.

1.3 Notation

We use sgn():{1,1}\textup{sgn}(\cdot)\colon\mathbb{R}\to\{-1,1\} to denote the sign function, where sgn(y)=1\textup{sgn}(y)=1 if y0y\geq 0 and 1-1 otherwise. We allow univariate functions, such as sgn()\textup{sgn}(\cdot) and general f:f\colon\mathbb{R}\to\mathbb{R}, to be applied to tensors in an element-wise manner. We denote anbna_{n}\lesssim b_{n} if limnan/bnc\lim_{n\to\infty}a_{n}/b_{n}\leq c for some constant c0c\geq 0. We use the shorthand [n][n] to denote the nn-set {1,,n}\{1,\ldots,n\} for n+n\in\mathbb{N}_{+}. Let Θd1××dK\Theta\in\mathbb{R}^{d_{1}\times\cdots\times d_{K}} denote an order-KK (d1,,dK)(d_{1},\ldots,d_{K})-dimensional tensor, and Θ(ω)\Theta(\omega)\in\mathbb{R} denote the tensor entry indexed by ω[d1]××[dK]\omega\in[d_{1}]\times\cdots\times[d_{K}]. An event EE is said to occur “with very high probability” if (E)\mathbb{P}(E) tends to 1 faster than any polynomial of tensor dimension d:=minkdkd:=\min_{k}d_{k}\to\infty. The CP decomposition (Hitchcock,, 1927) is defined by

Θ=s=1rλs𝒂s(1)𝒂s(K),\Theta=\sum_{s=1}^{r}\lambda_{s}\bm{a}^{(1)}_{s}\otimes\cdots\otimes\bm{a}^{(K)}_{s}, (2)

where λ1λr>0\lambda_{1}\geq\cdots\geq\lambda_{r}>0 are tensor singular values, 𝒂s(k)dk\bm{a}^{(k)}_{s}\in\mathbb{R}^{d_{k}} are norm-1 tensor singular vectors, and \otimes denotes the outer product of vectors. The minimal r+r\in\mathbb{N}_{+} for which (2) holds is called the tensor rank, denoted rank(Θ)\textup{rank}(\Theta).

2 Model and proposal overview

Let 𝒴\mathcal{Y} be an order-KK (d1,,dK)(d_{1},\ldots,d_{K})-dimensional data tensor generated from the following model

𝒴=Θ+,\mathcal{Y}=\Theta+\mathcal{E}, (3)

where Θd1××dK\Theta\in\mathbb{R}^{d_{1}\times\cdots\times d_{K}} is an unknown signal tensor of interest, and \mathcal{E} is a noise tensor consisting of mean-zero, independent but not necessarily identically distributed entries. We allow heterogenous noise, in that the marginal distribution of noise entry (ω)\mathcal{E}(\omega) may depend on ω\omega. Assume that 𝒴(ω)\mathcal{Y}(\omega) takes value in a bounded interval [A,A][-A,A]; without loss of generality, we set A=1A=1 throughout the paper.

Our observation is an incomplete data tensor from (3), denoted 𝒴Ω\mathcal{Y}_{\Omega}, where Ω[d1]××[dK]\Omega\subset[d_{1}]\times\cdots\times[d_{K}] is the index set of observed entries. We consider a general model on Ω\Omega that allows both uniform and non-uniform samplings. Specifically, let Π={pω}\Pi=\{p_{\omega}\} be an arbitrarily predefined probability distribution over the full index set with ω[d1]××[dK]pω=1\sum_{\omega\in[d_{1}]\times\cdots\times[d_{K}]}p_{\omega}=1. Assume that the entries ω\omega in Ω\Omega are i.i.d. draws with replacement from the full index set using distribution Π\Pi. The sampling rule is denoted as ωΠ\omega\sim\Pi.

Before describing our main results, we provide the intuition behind our method. In the two examples in Section 1, the high-rankness in the signal Θ\Theta makes the estimation challenging. Now let us examine the sign of the π\pi-shifted signal sgn(Θπ)\textup{sgn}(\Theta-\pi) for any given π[1,1]\pi\in[-1,1]. It turns out that, these sign tensors share the same sign patterns as low-rank tensors. Indeed, the signal tensor in the first example has the same sign pattern as a rank-44 tensor, since sgn(Θπ)=sgn(𝒵f1(π))\textup{sgn}(\Theta-\pi)=\textup{sgn}(\mathcal{Z}-f^{-1}(\pi)). The signal tensor in the second example has the same sign pattern as a rank-2 tensor, since sgn(Θπ)=sgn(max(i,j,k)d(eπ1))\textup{sgn}(\Theta-\pi)=\textup{sgn}(\max(i,j,k)-d(e^{\pi}-1)) (see Example 5 in Section 3).

The above observation suggests a general framework to estimate both low- and high-rank signal tensors. Figure 2 illustrates the main crux of our method. We dichotomize the data tensor into a series of sign tensors sgn(𝒴Ωπ)\textup{sgn}(\mathcal{Y}_{\Omega}-\pi) for π={1,,1H,0,1H,,1}\pi\in\mathcal{H}={\{\small-1,\ldots,-{1\over H},0,{1\over H},\ldots,1\}}. Then, we estimate the sign signals sgn(Θπ)\textup{sgn}(\Theta-\pi) by performing classification

𝒵^π=argminlow rank tensor 𝒵Weighted-Loss(sgn(𝒵),sgn(𝒴Ωπ)),\hat{\mathcal{Z}}_{\pi}=\operatorname*{arg\,min}_{\text{low rank tensor $\mathcal{Z}$}}\text{Weighted-Loss}(\textup{sgn}(\mathcal{Z}),\textup{sgn}(\mathcal{Y}_{\Omega}-\pi)),

where Weighted-Loss(,)(\cdot,\cdot) denotes a carefully-designed classification objective function which will be described in later sections. Our final proposed tensor estimate takes the form

Θ^=12H+1πsgn(𝒵^π).\hat{\Theta}={1\over 2H+1}\sum_{\pi\in\mathcal{H}}\textup{sgn}(\hat{\mathcal{Z}}_{\pi}).

Our approach is built on the nonparametric sign representation of signal tensors. The estimate Θ^\hat{\Theta} is essentially learned from dichotomized tensor series {sgn(𝒴Ωπ):π}\{\textup{sgn}(\mathcal{Y}_{\Omega}-\pi)\colon\pi\in\mathcal{H}\} with proper weights. We show that a careful aggregation of dichotomized data not only preserves all information in the original signals, but also brings benefits of accuracy and flexibility over classical low-rank models. Unlike traditional methods, the sign representation is guaranteed to recover both low- and high-rank signals that were previously impossible. The method enjoys statistical effectiveness and computational efficiency.

3 Statistical properties of sign representable tensors

This section develops sign representable tensor models for Θ\Theta in (3). We characterize the algebraic and statistical properties of sign tensor series, which serves the theoretical foundation for our method.

3.1 Sign-rank and sign tensor series

Let Θ\Theta be the tensor of interest, and sgn(Θ)\textup{sgn}(\Theta) the corresponding sign pattern. The sign patterns induce an equivalence relationship between tensors. Two tensors are called sign equivalent, denoted \simeq, if they have the same sign pattern.

Definition 1 (Sign-rank).

The sign-rank of a tensor Θd1××dK\Theta\in\mathbb{R}^{d_{1}\times\cdots\times d_{K}} is defined by the minimal rank among all tensors that share the same sign pattern as Θ\Theta; i.e.,

srank(Θ)=min{rank(Θ):ΘΘ,Θd1××dK}.\textup{srank}(\Theta)=\min\{\textup{rank}(\Theta^{\prime})\colon\Theta^{\prime}\simeq\Theta,\ \Theta^{\prime}\in\mathbb{R}^{d_{1}\times\cdots\times d_{K}}\}.

The sign-rank is also called support rank (Cohn and Umans,, 2013), minimal rank (Alon et al.,, 2016), and nondeterministic rank (De Wolf,, 2003). Earlier work defines sign-rank for binary-valued tensors; we extend the notion to continuous-valued tensors. Note that the sign-rank concerns only the sign pattern but discards the magnitude information of Θ\Theta. In particular, srank(Θ)=srank(sgnΘ)\textup{srank}(\Theta)=\textup{srank}(\textup{sgn}\Theta).

Like most tensor problems (Hillar and Lim,, 2013), determining the sign-rank for a general tensor is NP hard (Alon et al.,, 2016). Fortunately, tensors arisen in applications often possess special structures that facilitate analysis. By definition, the sign-rank is upper bounded by the tensor rank. More generally, we have the following upper bounds.

Proposition 1 (Upper bounds of the sign-rank).

For any strictly monotonic function g:g\colon\mathbb{R}\to\mathbb{R} with g(0)=0g(0)=0,

srank(Θ)rank(g(Θ)).\textup{srank}(\Theta)\leq\textup{rank}(g(\Theta)).

Conversely, the sign-rank can be much smaller than the tensor rank, as we have shown in the examples of Section 1.

Proposition 2 (Broadness).

For every order K2K\geq 2 and dimension dd, there exist tensors Θd××d\Theta\in\mathbb{R}^{d\times\cdots\times d} such that rank(Θ)d\textup{rank}(\Theta)\geq d but srank(Θπ)2\textup{srank}(\Theta-\pi)\leq 2 for all π\pi\in\mathbb{R}.

We provide several examples in Section 7.2, in which the tensor rank grows with dimension dd but the sign-rank remains a constant. The results highlight the advantages of using sign-rank in the high-dimensional tensor analysis. Propositions 1 and 2 together demonstrate the strict broadness of low sign-rank family over the usual low-rank family.

We now introduce a tensor family, which we coin as “sign representable tensors”, for the signal model in (3).

Definition 2 (Sign representable tensors).

Fix a level π[1,1]\pi\in[-1,1]. A tensor Θ\Theta is called (r,π)(r,\pi)-sign representable, if the tensor (Θπ)(\Theta-\pi) has sign-rank bounded by rr. A tensor Θ\Theta is called rr-sign (globally) representable, if Θ\Theta is (r,π)(r,\pi)-sign representable for all π[1,1]\pi\in[-1,1]. The collection {sgn(Θπ):π[1,1]}\{\textup{sgn}(\Theta-\pi)\colon\pi\in[-1,1]\} is called the sign tensor series. We use 𝒫sgn(r)={Θ:srank(Θπ)r for all π[1,1]}\mathscr{P}_{\textup{sgn}}(r)=\{\Theta\colon\textup{srank}(\Theta-\pi)\leq r\text{ for all }\pi\in[-1,1]\} to denote the rr-sign representable tensor family.

We show that the rr-sign representable tensor family is a general model that incorporates most existing tensor models, including low-rank tensors, single index models, GLM models, and several hypergraphon models.

Example 1 (CP/Tucker low-rank models).

The CP and Tucker low-rank tensors are the two most popular tensor models (Kolda and Bader,, 2009). Let Θ\Theta be a low-rank tensor with CP rank rr. We see that Θ\Theta belongs to the sign representable family; i.e., Θ𝒫sgn(r+1)\Theta\in\mathscr{P}_{\textup{sgn}}(r+1) (the constant 11 is due to rank(Θπ)r+1\textup{rank}(\Theta-\pi)\leq r+1). Similar results hold for Tucker low-rank tensors Θ𝒫sgn(r+1)\Theta\in\mathscr{P}_{\textup{sgn}}(r+1), where r=krkr=\prod_{k}r_{k} with rkr_{k} being the kk-th mode Tucker rank of Θ\Theta.

Example 2 (Tensor block models (TBMs)).

Tensor block model (Wang and Zeng,, 2019; Chi et al.,, 2020) assumes a checkerbord structure among tensor entries under marginal index permutation. The signal tensor Θ\Theta takes at most rr distinct values, where rr is the total number of multiway blocks. Our model incorporates TBM because Θ𝒫sgn(r)\Theta\in\mathscr{P}_{\textup{sgn}}(r).

Example 3 (Generalized linear models (GLMs)).

Let 𝒴\mathcal{Y} be a binary tensor from a logistic model (Wang and Li,, 2020) with mean Θ=logit(𝒵)\Theta=\text{logit}(\mathcal{Z}), where 𝒵\mathcal{Z} is a latent low-rank tensor. Notice that Θ\Theta itself may be high-rank (see Section 1). By definition, Θ\Theta is a low-rank sign representable tensor. Same conclusion holds for general exponential-family models with a (known) link function (Hong et al.,, 2020).

Example 4 (Single index models (SIMs)).

Single index model is a flexible semiparametric model proposed in economics (Robinson,, 1988) and high-dimensional statistics (Balabdaoui et al.,, 2019; Ganti et al.,, 2017). We here extend the model to higher-order tensors Θ\Theta. The SIM assumes the existence of a (unknown) monotonic function g:g\colon\mathbb{R}\to\mathbb{R} such that g(Θ)g(\Theta) has rank rr. We see that Θ\Theta belongs to the sign representable family; i.e., Θ𝒫sgn(r+1)\Theta\in\mathscr{P}_{\textup{sgn}}(r+1).

Example 5 (Min/Max hypergraphon).

Graphon is a popular nonparametric model for networks (Chan and Airoldi,, 2014; Xu,, 2018). Here we revisit the model introduced in Section 1 for generality. Let Θ\Theta be an order-KK tensor generated from the hypergraphon Θ(i1,,iK)=log(1+maxkxik(k))\Theta(i_{1},\ldots,i_{K})=\log(1+\max_{k}x^{(k)}_{i_{k}}), where xik(k)x^{(k)}_{i_{k}} are given number in [0,1][0,1] for all ik[dk],k[K]i_{k}\in[d_{k}],k\in[K]. We conclude that Θ𝒫sgn(2)\Theta\in\mathscr{P}_{\textup{sgn}}(2), because the sign tensor sgn(Θπ)\textup{sgn}(\Theta-\pi) with an arbitrary π(0,log2)\pi\in(0,\ \log 2) is a block tensor with at most two blocks (see Figure 2c).

The results extend to general min/max hypergraphons. Let g()g(\cdot) be a continuous univariate function with at most r1r\geq 1 distinct real roots in the equation g(z)=πg(z)=\pi; this property holds, e.g., when g(z)g(z) is a polynomial of degree rr. Then, the tensor Θ\Theta generated from Θ(i1,,iK)=g(maxkxik(k))\Theta(i_{1},\ldots,i_{K})=g(\max_{k}x^{(k)}_{i_{k}}) belongs to 𝒫sgn(2r)\mathscr{P}_{\textup{sgn}}(2r) (see Section 7.2). Same conclusion holds if the maximum in g()g(\cdot) is replaced by the minimum.

3.2 Statistical characterization of sign tensors via weighted classification

Accurate estimation of a sign representable tensor depends on the behavior of sign tensor series, sgn(Θπ)\textup{sgn}(\Theta-\pi). In this section, we show that sign tensors are completely characterized by weighted classification. The results bridge the algebraic and statistical properties of sign representable tensors.

For a given π[1,1]\pi\in[-1,1], define a π\pi-shifted data tensor 𝒴¯Ω\bar{\mathcal{Y}}_{\Omega} with entries 𝒴¯(ω)=(𝒴(ω)π)\bar{\mathcal{Y}}(\omega)=(\mathcal{Y}(\omega)-\pi) for ωΩ\omega\in\Omega. We propose a weighted classification objective function

L(𝒵,𝒴¯Ω)=1|Ω|ωΩ|𝒴¯(ω)|weight×|sgn𝒵(ω)sgn𝒴¯(ω)|classification loss,L(\mathcal{Z},\bar{\mathcal{Y}}_{\Omega})={1\over|\Omega|}\sum_{\omega\in\Omega}\ \mathop{\mathchoice{\underbrace{\displaystyle|\bar{\mathcal{Y}}(\omega)|}}{\underbrace{\textstyle|\bar{\mathcal{Y}}(\omega)|}}{\underbrace{\scriptstyle|\bar{\mathcal{Y}}(\omega)|}}{\underbrace{\scriptscriptstyle|\bar{\mathcal{Y}}(\omega)|}}}\limits_{\text{weight}}\ \times\ \mathop{\mathchoice{\underbrace{\displaystyle|\textup{sgn}\mathcal{Z}(\omega)-\textup{sgn}\bar{\mathcal{Y}}(\omega)|}}{\underbrace{\textstyle|\textup{sgn}\mathcal{Z}(\omega)-\textup{sgn}\bar{\mathcal{Y}}(\omega)|}}{\underbrace{\scriptstyle|\textup{sgn}\mathcal{Z}(\omega)-\textup{sgn}\bar{\mathcal{Y}}(\omega)|}}{\underbrace{\scriptscriptstyle|\textup{sgn}\mathcal{Z}(\omega)-\textup{sgn}\bar{\mathcal{Y}}(\omega)|}}}\limits_{\text{classification loss}}, (4)

where 𝒵d1××dK\mathcal{Z}\in\mathbb{R}^{d_{1}\times\cdots\times d_{K}} is the decision variable to be optimized, |𝒴¯(ω)||\bar{\mathcal{Y}}(\omega)| is the entry-specific weight equal to the distance from the tensor entry to the target level π\pi. The entry-specific weights incorporate the magnitude information into classification, where entries far away from the target level are penalized more heavily in the objective. In the special case of binary tensor 𝒴{1,1}d1××dK\mathcal{Y}\in\{-1,1\}^{d_{1}\times\cdots\times d_{K}} and target level π=0\pi=0, the loss (4) reduces to usual classification loss.

Our proposed weighted classification function (4) is important for characterizing sgn(Θπ)\textup{sgn}(\Theta-\pi). Define the weighted classification risk

Risk(𝒵)=𝔼𝒴ΩL(𝒵,𝒴¯Ω),\textup{Risk}(\mathcal{Z})=\mathbb{E}_{\mathcal{Y}_{\Omega}}L(\mathcal{Z},\bar{\mathcal{Y}}_{\Omega}), (5)

where the expectation is taken with respect to 𝒴Ω\mathcal{Y}_{\Omega} under model (3) and the sampling distribution ωΠ\omega\sim\Pi. Note that the form of Risk()\textup{Risk}(\cdot) implicitly depends on π\pi; we suppress π\pi when no confusion arises.

Proposition 3 (Global optimum of weighted risk).

Suppose the data 𝒴Ω\mathcal{Y}_{\Omega} is generated from model (3) with Θ𝒫sgn(r)\Theta\in\mathscr{P}_{\textup{sgn}}(r). Then, for all Θ¯\bar{\Theta} that are sign equivalent to sgn(Θπ)\textup{sgn}(\Theta-\pi),

Risk(Θ¯)\displaystyle\textup{Risk}(\bar{\Theta}) =inf{Risk(𝒵):𝒵d1××dK},\displaystyle=\inf\{\textup{Risk}(\mathcal{Z})\colon\mathcal{Z}\in\mathbb{R}^{d_{1}\times\cdots\times d_{K}}\},
=inf{Risk(𝒵):rank(𝒵)r}.\displaystyle=\inf\{\textup{Risk}(\mathcal{Z})\colon\textup{rank}(\mathcal{Z})\leq r\}. (6)

The results show that the sign tensor sgn(Θπ)\textup{sgn}(\Theta-\pi) optimizes the weighted classification risk. This fact suggests a practical procedure to estimate sgn(Θπ)\textup{sgn}(\Theta-\pi) via empirical risk optimization of L(𝒵,𝒴¯Ω)L(\mathcal{Z},\bar{\mathcal{Y}}_{\Omega}). In order to establish the recovery guarantee, we shall address the uniqueness (up to sign equivalence) for the optimizer of Risk()\textup{Risk}(\cdot). The local behavior of Θ\Theta around π\pi turns out to play a key role in the accuracy.

Some additional notation is needed. We use 𝒩={π:\mathcal{N}=\{\pi\colon ωΠ(Θ(ω)=π)0}\mathbb{P}_{\omega\sim\Pi}(\Theta(\omega)=\pi)\neq 0\} to denote the set of mass points of Θ\Theta under Π\Pi. Assume there exists a constant C>0C>0, independent of tensor dimension, such that |𝒩|C|\mathcal{N}|\leq C. Note that both Π\Pi and Θ\Theta implicitly depend on the tensor dimension. Our assumptions are imposed to Π=Π(d)\Pi=\Pi(d) and Θ=Θ(d)\Theta=\Theta(d) in the high-dimensional regime uniformly where d:=minkdkd:=\min_{k}d_{k}\to\infty.

Assumption 1 (α\alpha-smoothness).

Fix π𝒩\pi\notin\mathcal{N}. Assume there exist constants α=α(π)>0,c=c(π)>0\alpha=\alpha(\pi)>0,c=c(\pi)>0, independent of tensor dimension, such that,

sup0t<ρ(π,𝒩)ωΠ[|Θ(ω)π|t]tαc,\sup_{0\leq t<\rho(\pi,\mathcal{N})}{\mathbb{P}_{\omega\sim\Pi}[|\Theta(\omega)-\pi|\leq t]\over t^{\alpha}}\leq c, (7)

where ρ(π,𝒩):=minπ𝒩|ππ|\rho(\pi,\mathcal{N}):=\min_{\pi^{\prime}\in\mathcal{N}}|\pi-\pi^{\prime}| denotes the distance from π\pi to the nearest point in 𝒩\mathcal{N}. The largest possible α=α(π)\alpha=\alpha(\pi) in (7) is called the smoothness index at level π\pi. We make the convention that α=\alpha=\infty if the set {ω:|Θ(ω)π|t}\{\omega\colon|\Theta(\omega)-\pi|\leq t\} has zero measure, implying almost no entries of which Θ(ω)\Theta(\omega) is around the level π\pi. We call a tensor Θ\Theta is α\alpha-globally smooth, if (7) holds with a global constant c>0c>0 for all π[1,1]\pi\in[-1,1] except for a finite number of levels.

The smoothness index α\alpha quantifies the intrinsic hardness of recovering sgn(Θπ)\textup{sgn}(\Theta-\pi) from Risk()\textup{Risk}(\cdot). The value of α\alpha depends on both the sampling distribution ωΠ\omega\sim\Pi and the behavior of Θ(ω)\Theta(\omega). The recovery is easier at levels where points are less concentrated around π\pi with a large value of α>1\alpha>1, or equivalently, when the cumulative distribution function (CDF) G(π):=ωΠ[Θ(ω)π]G(\pi):=\mathbb{P}_{\omega\sim\Pi}[\Theta(\omega)\leq\pi] remains flat around π\pi. A small value of α<1\alpha<1 indicates the nonexistent (infinite) density at level π\pi, or equivalently, when the G(π)G(\pi) jumps at π\pi. A typical case is α=1\alpha=1 when the G(π)G(\pi) has finite non-zero derivative in the vicinity of π\pi. Table 1 illustrates the G(π)G(\pi) for various models of Θ\Theta (see Section 5 Simulation for details).

We now reach the main theorem in this section. For two tensors Θ1,Θ2\Theta_{1},\Theta_{2}, define the mean absolute error (MAE)

MAE(Θ1,Θ2)=def𝔼ωΠ|Θ1(ω)Θ2(ω)|.\text{MAE}(\Theta_{1},\Theta_{2})\stackrel{{\scriptstyle\text{def}}}{{=}}\mathbb{E}_{\omega\sim\Pi}|\Theta_{1}(\omega)-\Theta_{2}(\omega)|.
Theorem 1 (Identifiability).

Under Assumption 1, for all tensors Θ¯sgn(Θπ)\bar{\Theta}\simeq\textup{sgn}(\Theta-\pi) and tensors 𝒵d1××dK\mathcal{Z}\in\mathbb{R}^{d_{1}\times\cdots\times d_{K}},

MAE(sgn𝒵,sgnΘ¯)C(π)[Risk(𝒵)Risk(Θ¯)]α/(α+1),\textup{MAE}(\textup{sgn}\mathcal{Z},\textup{sgn}\bar{\Theta})\leq C(\pi)\left[\textup{Risk}(\mathcal{Z})-\textup{Risk}(\bar{\Theta})\right]^{\alpha/(\alpha+1)},

where C(π)>0C(\pi)>0 is independent of 𝒵\mathcal{Z}.

The result establishes the recovery stability of sign tensors sgn(Θπ)\textup{sgn}(\Theta-\pi) using optimization with population risk (5). The bound immediately shows the uniqueness of the optimizer for Risk()\text{Risk}(\cdot) up to a zero-measure set under Π\Pi. We find that a higher value of α\alpha implies more stable recovery, as intuition would suggest. Similar results hold for optimization with sample risk (4) (see Section 4).

We conclude this section by applying Assumption 1 to the examples described in Section 3.1. For simplicity, suppose Π\Pi is the uniform sampling for now. The tensor block model is \infty-globally smooth. This is because the set 𝒩\mathcal{N}, which consists of distinct block means in Θ\Theta, has finitely many elements. Furthermore, we have α=\alpha=\infty for all π𝒩\pi\notin\mathcal{N}, since the numerator in (7) is zero for all such π\pi. The min/max hypergaphon model with a rr-degree polynomial function is 11-globally smooth because α=1\alpha=1 for all π\pi in the function range except at most (r1)(r-1) many stationary points.

4 Nonparametric tensor completion via sign series

In previous sections we have established the sign series representation and its relationship to classification. In this section, we present our algorithm proposed in Section 2 (Figure 2) in details. We provide the estimation error bound and address the empirical implementation of the algorithm.

4.1 Estimation error and sample complexity

Given a noisy incomplete tensor observation 𝒴Ω\mathcal{Y}_{\Omega} from model (3), we cast the problem of estimating Θ\Theta into a series of weighted classifications. Specifically we propose the tensor estimate using the sign representation,

Θ^=12H+1πsgn𝒵^π,\hat{\Theta}={1\over 2H+1}\sum_{\pi\in\mathcal{H}}\textup{sgn}{\hat{\mathcal{Z}}_{\pi}}, (8)

where 𝒵^πd1××dK\hat{\mathcal{Z}}_{\pi}\in\mathbb{R}^{d_{1}\times\dots\times d_{K}} is the π\pi-weighted classifier estimated at levels π={1,,1H,0,1H,,1}\pi\in\mathcal{H}=\{-1,\ldots,-{1\over H},0,{1\over H},\ldots,1\},

𝒵^π=argmin𝒵:rank𝒵rL(𝒵,𝒴Ωπ).\hat{\mathcal{Z}}_{\pi}=\operatorname*{arg\,min}_{\mathcal{Z}\colon\text{rank}\mathcal{Z}\leq r}L(\mathcal{Z},\mathcal{Y}_{\Omega}-\pi). (9)

Here L(,)L(\cdot,\cdot) denotes the weighted classification objective defined in (4), where we have plugged 𝒴¯Ω=(𝒴Ωπ)\bar{\mathcal{Y}}_{\Omega}=(\mathcal{Y}_{\Omega}-\pi) in the expression, and the rank constraint follows from Proposition 3. For the theory, we assume the true rr is known; in practice, rr could be chosen in a data adaptive fashion via cross-validation or elbow method (Hastie et al.,, 2009). Step (9) corresponds Figure 2c while (8) to Figure 2d .

The next theorem establishes the statistical convergence for the sign tensor estimate (9), which is an important ingredient for the final signal tensor estimate Θ^\hat{\Theta} in (8).

Theorem 2 (Sign tensor estimation).

Suppose Θ𝒫sgn(r)\Theta\in\mathscr{P}_{\textup{sgn}}(r) and Θ(ω)\Theta(\omega) is α\alpha-globally smooth under ωΠ\omega\sim\Pi. Let 𝒵^π\hat{\mathcal{Z}}_{\pi} be the estimate in (9), dmax=maxk[K]dkd_{\max}=\max_{k\in[K]}d_{k}, and dmaxr|Ω|d_{\max}r\lesssim|\Omega|. Then, for all π[1,1]\pi\in[-1,1] except for a finite number of levels, with very high probability over 𝒴Ω\mathcal{Y}_{\Omega},

MAE(sgn𝒵^π,sgn(Θπ))(dmaxr|Ω|)αα+2+1ρ2(π,𝒩)dmaxr|Ω|.\displaystyle\textup{MAE}(\textup{sgn}\hat{\mathcal{Z}}_{\pi},\textup{sgn}(\Theta-\pi))\lesssim\left({d_{\max}r\over|\Omega|}\right)^{\alpha\over\alpha+2}+{1\over\rho^{2}(\pi,\mathcal{N})}{d_{\max}r\over|\Omega|}. (10)

Theorem 2 provides the error bound for the sign tensor estimation. Compared to the population results in Theorem 1, we here explicitly reveal the dependence of accuracy on the sample complexity and on the level π\pi. The result demonstrates the polynomial decay of sign errors with |Ω||\Omega|. In particular, our sign estimate achieves consistent recovery using as few as O~(dmaxr)\tilde{O}(d_{\max}r) noisy entries.

Combining the sign representability of the signal tensor and the sign estimation accuracy, we obtain the main results on our nonparametric tensor estimation method.

Theorem 3 (Tensor estimation error).

Consider the same conditions of Theorem 2. Let Θ^\hat{\Theta} be the estimate in (8). With very high probability over 𝒴Ω\mathcal{Y}_{\Omega},

MAE(Θ^,Θ)(dmaxr|Ω|)αα+2+1H+Hdmaxr|Ω|.\textup{MAE}(\hat{\Theta},\Theta)\lesssim\left({d_{\max}r\over|\Omega|}\right)^{\alpha\over\alpha+2}+{1\over H}+{Hd_{\max}r\over|\Omega|}. (11)

In particular, setting H(|Ω|dmaxr)1/2\scriptstyle H\asymp\left(|\Omega|\over d_{\max}r\right)^{1/2} yields the error bound

MAE(Θ^,Θ)(dmaxr|Ω|)αα+212.\textup{MAE}(\hat{\Theta},\Theta)\lesssim\left(d_{\max}r\over|\Omega|\right)^{{\alpha\over\alpha+2}\vee{1\over 2}}. (12)

Theorem 3 demonstrates the convergence rate of our tensor estimation. The bound (11) reveals three sources of errors: the estimation error for sign tensors, the bias from sign series representations, and the variance thereof. The resolution parameter HH controls the bias-variance tradeoff. We remark that the signal estimation error (12) is generally no better than the corresponding sign error (10). This is to be expected, since magnitude estimation is a harder problem than sign estimation.

In the special case of full observation with equal dimension dk=d,k[K]d_{k}=d,k\in[K], our signal estimate achieves convergence

MAE(Θ^,Θ)(rdK1)αα+212.\textup{MAE}(\hat{\Theta},\Theta)\lesssim\left(r\over d^{K-1}\right)^{{\alpha\over\alpha+2}\vee{1\over 2}}. (13)

Compared to earlier methods, our estimation accuracy applies to both low- and high-rank signal tensors. The rate depends on the sign complexity Θ𝒫sgn(r)\Theta\in\mathscr{P}_{\textup{sgn}}(r), and this rr is often much smaller than the usual tensor rank (see Section 3.1). Our result also reveals that the convergence becomes favorable as the order of data tensor increases.

We apply our method to the main examples in Section 3.1, and compare the results with existing literature. The numerical comparison is provided in Section 5.

Example 2 (TBM).

Consider a tensor block model with in total rr multiway blocks. Our result implies a rate 𝒪(d(K1)/2)\mathcal{O}(d^{-(K-1)/2}) by taking α=\alpha=\infty. This rate agrees with the previous root-mean-square error (RMSE) for block tensor estimation (Wang and Zeng,, 2019).

Example 3 (GLM).

Consider a GLM tensor Θ=g(𝒵)\Theta=g(\mathcal{Z}), where gg is a known link function and 𝒵\mathcal{Z} is a latent low-rank tensor. Suppose the marginal density of Θ(ω)\Theta(\omega) is bounded as dd\to\infty. Applying our results with α=1\alpha=1 yields 𝒪(d(K1)/3)\mathcal{O}(d^{-(K-1)/3}). This rate is slightly slower than the parametric RMSE rate (Zhang and Xia,, 2018; Wang and Li,, 2020). One possible reason is that our estimate remains valid for unknown gg and general high-rank tensors with α=1\alpha=1. The nonparametric rate is the price one has to pay for not knowing the form Θ=g(𝒵)\Theta=g(\mathcal{Z}) as a priori.

The following sample complexity for nonparamtric tensor completion is a direct consequence of Theorem 3.

Corollary 1 (Sample complexity for nonparametric completion).

Under the same conditions of Theorem 3 with α0\alpha\neq 0, with high probability over 𝒴Ω\mathcal{Y}_{\Omega},

MAE(Θ^,Θ)0,as|Ω|dmaxr.\textup{MAE}(\hat{\Theta},\Theta)\to 0,\quad\text{as}\quad{|\Omega|\over{d_{\max}}r}\to\infty.

Our result improves earlier work (Yuan and Zhang,, 2016; Ghadermarzy et al.,, 2019; Lee and Wang,, 2020) by allowing both low- and high-rank signals. Interestingly, the sample requirements depend only on the sign complexity rr but not the nonparametric complexity α\alpha. Note that 𝒪~(dmaxr)\tilde{\mathcal{O}}(d_{\max}r) roughly matches the degree of freedom of sign tensors, suggesting the optimality of our sample requirements.

4.2 Numerical implementation

This section addresses the practical implementation of our estimation (8) illustrated in Figure 2. Our sign representation of the signal estimate Θ^\hat{\Theta} is an average of 2H+12H+1 sign tensors, which can be solved in a divide-and-conquer fashion. Briefly, we estimate the sign tensors 𝒵π\mathcal{Z}_{\pi} (detailed in the next paragraph) for the series π\pi\in\mathcal{H} through parallel implementation, and then we aggregate the results to yield the output. The estimate enjoys low computational cost similar to a single sign tensor estimation.

For the sign tensor estimation (9), the problem reduces to binary tensor decomposition with a weighted classification loss. A number of algorithms have been developed for this problem (Ghadermarzy et al.,, 2018; Wang and Li,, 2020; Hong et al.,, 2020). We adopt similar ideas by tailoring the algorithms to our contexts. Following the common practice in classification, we replace the binary loss (z,y)=|sgnzsgny|\ell(z,y)=|\textup{sgn}z-\textup{sgn}y| with a surrogate loss F(m)F(m) using a continuous function of margin m:=zsgn(y)m:=z\textup{sgn}(y). Examples of large-margin loss are hinge loss F(m)=(1m)+F(m)=(1-m)_{+}, logistic loss F(m)=log(1+em)F(m)=\log(1+e^{-m}), and nonconvex ψ\psi-loss F(m)=2min(1,(1m)+)F(m)=2\min(1,(1-m)_{+}) with m+=max(m,0)m_{+}=\max(m,0). We implement the hinge loss and logistic loss in our algorithm, although our framework is applicable to general large-margin losses (Bartlett et al.,, 2006).

Algorithm 1 Nonparametric tensor completion
1:Noisy and incomplete data tensor 𝒴Ω\mathcal{Y}_{\Omega}, rank rr, resolution parameter HH.
2:for π={1,,1H,0,1H,,1}\pi\in\mathcal{H}=\{-1,\ldots,-{1\over H},0,{1\over H},\ldots,1\} do
3:     Random initialization of tensor factors 𝑨k=[𝒂1(k),,𝒂r(k)]dk×r\bm{A}_{k}=[\bm{a}^{(k)}_{1},\ldots,\bm{a}^{(k)}_{r}]\in\mathbb{R}^{d_{k}\times r} for all k[K]k\in[K].
4:     while not convergence do
5:         for k=1,,Kk=1,\ldots,K do
6:              Update 𝑨k\bm{A}_{k} while holding others fixed: 𝑨kargmin𝑨kdk×rωΩ|𝒴(ω)π|F(𝒵(ω)sgn(𝒴(ω)π))\bm{A}_{k}\leftarrow\operatorname*{arg\,min}_{\bm{A}_{k}\in\mathbb{R}^{d_{k}\times r}}\sum_{\omega\in\Omega}|\mathcal{Y}(\omega)-\pi|F(\mathcal{Z}(\omega)\textup{sgn}(\mathcal{Y}(\omega)-\pi)), where F()F(\cdot) is the large-margin loss, and 𝒵=s[r]𝒂s(1)𝒂s(K)\mathcal{Z}=\sum_{s\in[r]}\bm{a}^{(1)}_{s}\otimes\cdots\otimes\bm{a}^{(K)}_{s} is a rank-rr tensor.
7:         end for
8:     end while
9:     Return 𝒵πs[r]𝒂s(1)𝒂s(K)\mathcal{Z}_{\pi}\leftarrow\sum_{s\in[r]}\bm{a}^{(1)}_{s}\otimes\cdots\otimes\bm{a}^{(K)}_{s}.
10:end for
11:Estimated signal tensor Θ^=12H+1πsgn(𝒵π)\hat{\Theta}={1\over 2H+1}\sum_{\pi\in\mathcal{H}}\textup{sgn}(\mathcal{Z}_{\pi}).

The rank constraints in the optimization (9) have been extensively studied in literature. Recent developments involve convex norm relaxation (Ghadermarzy et al.,, 2018) and nonconvex optimization (Wang and Li,, 2020; Han et al.,, 2020). Unlike matrices, computing the tensor convex norm is NP hard, so we choose (non-convex) alternating optimization due to its numerical efficiency. Briefly, we use the rank decomposition (2) of 𝒵=𝒵(𝑨1,,𝑨K)\mathcal{Z}=\mathcal{Z}(\bm{A}_{1},\ldots,\bm{A}_{K}) to optimize the unknown factor matrices 𝑨k=[𝒂1(k),,𝒂r(k)]dk×r\bm{A}_{k}=[\bm{a}^{(k)}_{1},\ldots,\bm{a}^{(k)}_{r}]\in\mathbb{R}^{d_{k}\times r}, where we choose to collect tensor singular values into 𝑨K\bm{A}_{K}. We numerically solve (8) by optimizing one factor 𝑨k\bm{A}_{k} at a time while holding others fixed. Each suboptimization reduces to a convex optimization with a low-dimensional decision variable. Following common practice in tensor optimization (Anandkumar et al.,, 2014; Hong et al.,, 2020), we run the optimization from multiple initializations to locate a final estimate with the lowest objective value. The full procedure is described in Algorithm 1.

5 Simulations

In this section, we compare our nonparametric tensor method (NonParaT) with two alternative approaches: low-rank tensor CP decomposition (CPT), and the matrix version of our method applied to tensor unfolding (NonParaM). We assess the performance under both complete and incomplete observations. The signal tensors are generated based on four models listed in Table 1. The simulation covers a wide range of complexity, including block tensors, transformed low rank tensors, min/max hypergraphon with logarithm and exponential functions. We consider order-3 tensors of equal dimension d1=d2=d3=dd_{1}=d_{2}=d_{3}=d, and set d{15,20,,55,60}d\in\{15,20,\ldots,55,60\}, r=2r=2, H=10+(d15)/5H=10+{(d-15)/5} in Algorithm 1. For NonParaM, we apply Algorithm 1 to each of the three unfolded matrices and report the average error. All summary statistics are averaged across 3030 replicates.

[Uncaptioned image]
Table 1: Simulation models used for comparison. We use 𝑴k{0,1}d×3\bm{M}_{k}\in\{0,1\}^{d\times 3} to denote membership matrices, 𝒞3×3×3\mathcal{C}\in\mathbb{R}^{3\times 3\times 3} the block means, 𝒂=1d(1,2,,d)Td\bm{a}={1\over d}(1,2,\ldots,d)^{T}\in\mathbb{R}^{d}, 𝒵max\mathcal{Z}_{\max} and 𝒵min\mathcal{Z}_{\min} are order-3 tensors with entries 1dmax(i,j,k){1\over d}\max(i,j,k) and 1dmin(i,j,k){1\over d}\min(i,j,k), respectively.

Figure 3 compares the estimation error under full observation. The MAE decreases with tensor dimension for all three methods. We find that our method NonParaT achieves the best performance in all scenarios, whereas the second best method is CPT for models 1-2, and NonParaM for models 3-4. One possible reason is that models 1-2 have controlled multilinear tensor rank, which makes tensor methods NonParaT and CPT more accurate than matrix methods. For models 3-4, the rank exceeds the tensor dimension, and therefore, the two nonparametric methods NonParaT and NonparaM exhibit the greater advantage for signal recovery.

Figure 4 shows the completion error against observation fraction. We fix d=40d=40 and gradually increase the observation fraction |Ω|d3{|\Omega|\over d^{3}} from 0.3 to 1. We find that NonParaT achieves the lowest error among all methods. Our simulation covers a reasonable range of complexities; for example, model 1 has 333^{3} jumps in the CDF of signal Θ\Theta, and models 2 and 4 have unbounded noise. Nevertheless, our method shows good performance in spite of model misspecification. This robustness is appealing in practice because the structure of underlying signal tensor is often unknown.

Refer to caption
Figure 3: Estimation error versus tensor dimension. Panels (a)-(d) correspond to simulation models 1-4 in Table 1.
Refer to caption
Figure 4: Completion error versus observation fraction. Panels (a)-(d) correspond to simulation models 1-4 in Table 1.

6 Data applications

We apply our method to two tensor datasets, the MRN-114 human brain connectivity data (Wang et al.,, 2017), and NIPS word occurrence data (Globerson et al.,, 2007).

6.1 Brain connectivity analysis

The brain dataset records the structural connectivity among 68 brain regions for 114 individuals along with their Intelligence Quotient (IQ) scores. We organize the connectivity data into an order-3 tensor, where entries encode the presence or absence of fiber connections between brain regions across individuals.

Refer to caption
Figure 5: Estimation error versus rank under different missing rate. Panels (a)-(d) correspond to missing rate 20%, 33%, 50%, and 67%, respectively. Error bar represents the standard error over 5-fold cross-validations.

Figure 5 shows the MAE based on 5-fold cross-validations with r=3,6,,15r=3,6,\ldots,15 and H=20H=20. We find that our method outperforms CPT in all combinations of ranks and missing rates. The achieved error reduction appears to be more profound as the missing rate increases. This trend highlights the applicability of our method in tensor completion tasks. In addition, our method exhibits a smaller standard error in cross-validation experiments as shown in Figure 5 and Table 2, demonstrating the stability over CPT. One possible reason is that that our estimate is guaranteed to be in [0,1][0,1] (for binary tensor problem where 𝒴{0,1}d1×dK\mathcal{Y}\in\{0,1\}^{d_{1}\times\cdots d_{K}}) whereas CPT estimation may fall outside the valid range [0,1][0,1].

                                                         MRN-114 brain connectivity dataset
  Method r=3r=3 r=6r=6 r=9r=9 r=12r=12 r=15r=15
NonparaT (Ours) 0.18(0.001){\bf 0.18}(0.001) 0.14(0.001){\bf 0.14}(0.001) 0.12(0.001){\bf 0.12}(0.001) 0.12(0.001){\bf 0.12}(0.001) 0.11(0.001){\bf 0.11}(0.001)
Low-rank CPT 0.26(0.006)0.26(0.006) 0.23(0.0060.23(0.006) 0.22(0.004)0.22(0.004) 0.21(0.006)0.21(0.006) 0.20(0.008)0.20(0.008)
                                                         NIPS word occurrence dataset
  Method r=3r=3 r=6r=6 r=9r=9 r=12r=12 r=15r=15
NonparaT (Ours) 0.18(0.002){\bf 0.18}(0.002) 0.16(0.002){\bf 0.16}(0.002) 0.15(0.001){\bf 0.15}(0.001) 0.14(0.001){\bf 0.14}(0.001) 0.13(0.001){\bf 0.13}(0.001)
Low-rank CPT 0.22(0.004)0.22(0.004) 0.20(0.007)0.20(0.007) 0.19(0.007)0.19(0.007) 0.17(0.007)0.17(0.007) 0.17(0.007)0.17(0.007)
Naive imputation (Baseline) 0.32(.001)0.32(.001)
Table 2: MAE comparison in the brain data and NIPS data analysis. Reported MAEs are averaged over five runs of cross-validation, with 20% entries for testing and 80% for training, with standard errors in parentheses. Bold numbers indicate the minimal MAE among three methods. For low-rank CPT, we use R function rTensor with default hyperparameters, and for our method, we set H=20H=20.

We next investigate the pattern in the estimated signal tensor. Figure 6a shows the identified top edges associated with IQ scores. Specifically, we first obtain a denoised tensor Θ^68×68×114\hat{\Theta}\in\mathbb{R}^{68\times 68\times 114} using our method with r=10r=10 and H=20H=20. Then, we perform a regression analysis of Θ^(i,j,:)144\hat{\Theta}(i,j,\colon)\in\mathbb{R}^{144} against the normalized IQ score across the 144 individuals. The regression model is repeated for each edge (i,j)[68]×[68](i,j)\in[68]\times[68]. We find that top edges represent the interhemispheric connections in the frontal lobes. The result is consistent with recent research on brain connectivity with intelligence (Li et al.,, 2009; Wang et al.,, 2017).

6.2 NIPS data analysis

The NIPS dataset consists of word occurrence counts in papers published from 1987 to 2003. We focus on the top 100 authors, 200 most frequent words, and normalize each word count by log transformation with pseudo-count 1. The resulting dataset is an order-3 tensor with entry representing the log counts of words by authors across years.

Table 2 compares the prediction accuracy of different methods. We find that our method substantially outperforms the low-rank CP method for every configuration under consideration. Further increment of rank appears to have little effect on the performance. The comparison highlights the advantage of our method in achieving accuracy while maintaining low complexity. In addition, we also perform naive imputation where the missing values are predicted using the sample average. Both our method and CPT outperform the naive imputation, implying the necessity of incorporating tensor structure in the analysis.

Refer to caption
Refer to caption
Figure 6: Estimated signal tensors in the data analysis. (a) top edges associated with IQ scores in the brain connectivity data. The color indicates the estimated IQ effect size. (b) top authors and words for years 1996-1999 in the NIPS data. Authors and words are ranked by marginal averages based on Θ^\hat{\Theta}, where the marginal average is denoted in the parentheses.

We next examine the estimated signal tensor Θ^\hat{\Theta} from our method. Figure 6b illustrates the results from NIPS data, where we plot the entries in Θ^\hat{\Theta} corresponding to top authors and most-frequent words (after excluding generic words such as figure, results, etc). The identified pattern is consistent with the active topics in the NIPS publication. Among the top words are neural (marginal mean = 1.95), learning (1.48), and network (1.21), whereas top authors are T. Sejnowski (1.18), B. Scholkopf (1.17), M. Jordan (1.11), and G. Hinton (1.06). We also find strong heterogeneity among word occurrences across authors and years. For example, training and algorithm are popular words for B. Scholkopf and A. Smola in 1998-1999, whereas model occurs more often in M. Jordan and in 1996. The detected pattern and achieved accuracy demonstrate the applicability of our method.

7 Additional results and proofs

In this section, we provides additional results not covered in previous sections. Section 7.1 gives detailed explanation to the examples mentioned in Section 1. Section 7.2 supplements Section 3.1 by providing more theoretical results on sign rank and its relationship to tensor rank. Section 7.3 collects the proofs for theorems in the main texts.

7.1 Sensitivity of tensor rank to monotonic transformations

In Section 1, we have provided a motivating example to show the sensitivity of tensor rank to monotonic transformations. Here, we describe the details of the example set-up.

The step 1 is to generate a rank-3 tensor 𝒵\mathcal{Z} based on the CP representation

𝒵=𝒂3+𝒃3+𝒄3,\mathcal{Z}=\bm{a}^{\otimes 3}+\bm{b}^{\otimes 3}+\bm{c}^{\otimes 3},

where 𝒂,𝒃,𝒄30\bm{a},\bm{b},\bm{c}\in\mathbb{R}^{30} are vectors consisting of N(0,1)N(0,1) entries, and the shorthand 𝒂3=𝒂𝒂𝒂\bm{a}^{\otimes 3}=\bm{a}\otimes\bm{a}\otimes\bm{a} denotes the Kronecker power. We then apply f(z)=(1+exp(cz))1f(z)=(1+\exp(-cz))^{-1} to 𝒵\mathcal{Z} entrywise, and obtain a transformed tensor Θ=f(𝒵)\Theta=f(\mathcal{Z}).

The step 2 is to determine the rank of Θ\Theta. Unlike matrices, the exact rank determination for tensors is NP hard. Therefore, we choose to compute the numerical rank of Θ\Theta as an approximation. The numerical rank is determined as the minimal rank for which the relative approximation error is below 0.10.1, i.e.,

r^(Θ)=min{s+:minΘ^:rank(Θ^)sΘΘ^FΘF0.1}.\hat{r}(\Theta)=\min\left\{s\in\mathbb{N}_{+}\colon\min_{\hat{\Theta}\colon\textup{rank}(\hat{\Theta})\leq s}{\lVert\Theta-\hat{\Theta}\rVert_{F}\over\lVert\Theta\rVert_{F}}\leq 0.1\right\}. (14)

We compute r^(Θ)\hat{r}(\Theta) by searching over s{1,,302}s\in\{1,\ldots,30^{2}\}, where for each ss, we (approximately) solve the least-square minimization using CP function in R package rTensor. We repeat steps 1-2 ten times, and plot the averaged numerical rank of Θ\Theta versus transformation level cc in Figure 1a.

7.2 Tensor rank and sign-rank

In section 3.1, we have provided several tensor examples with high tensor rank but low sign-rank. This section provides more examples and their proofs. Unless otherwise specified, let Θ\Theta be an order-KK (d,,d)(d,\ldots,d)-dimensional tensor.

Example 6 (Max hypergraphon).

Suppose the tensor Θ\Theta takes the form

Θ(i1,,iK)=log(1+1dmax(i1,,iK)),for all (i1,,iK)[d]K.\Theta(i_{1},\ldots,i_{K})=\log\left(1+{1\over d}\max(i_{1},\ldots,i_{K})\right),\ \text{for all }(i_{1},\ldots,i_{K})\in[d]^{K}.

Then

rank(Θ)d,andsrank(Θπ)2for all π.\textup{rank}(\Theta)\geq d,\quad\text{and}\quad\textup{srank}(\Theta-\pi)\leq 2\ \text{for all }\pi\in\mathbb{R}.
Proof.

We first prove the results for K=2K=2. The full-rankness of Θ\Theta is verified from elementary row operations as follows

((Θ2Θ1)/(log(1+2d)log(1+1d))(Θ3Θ2)/(log(1+3d)log(1+2d))(ΘdΘd1)/(log(1+dd)log(1+d1d))Θd/log(1+dd))=(100111111011111),\displaystyle\begin{pmatrix}(\Theta_{2}-\Theta_{1})/(\log(1+\frac{2}{d})-\log(1+\frac{1}{d}))\\ (\Theta_{3}-\Theta_{2})/(\log(1+\frac{3}{d})-\log(1+\frac{2}{d}))\\ \vdots\\ (\Theta_{d}-\Theta_{d-1})/(\log(1+\frac{d}{d})-\log(1+\frac{d-1}{d}))\\ \Theta_{d}/\log(1+\frac{d}{d})\end{pmatrix}=\begin{pmatrix}1&0&&&0\\ 1&1&\ddots&&\\ \vdots&\vdots&\ddots&\ddots&\\ 1&1&1&1&0\\ 1&1&1&1&1\end{pmatrix}, (15)

where Θi\Theta_{i} denotes the ii-th row of Θ\Theta. Now it suffices to show srank(Θπ)2\textup{srank}(\Theta-\pi)\leq 2 for π\pi in the feasible range (log(1+1d),log2)(\log(1+{1\over d}),\ \log 2). In this case, there exists an index i{2,,d}i^{*}\in\{2,\ldots,d\}, such that log(1+i1d)<πlog(1+id)\log(1+{i^{*}-1\over d})<\pi\leq\log(1+{i^{*}\over d}). By definition, the sign matrix sgn(Θπ)\textup{sgn}(\Theta-\pi) takes the form

sgn(Θ(i,j)π)={1,both i and j are smaller than i;1,otherwise.\textup{sgn}(\Theta(i,j)-\pi)=\begin{cases}-1,&\text{both $i$ and $j$ are smaller than $i^{*}$};\\ 1,&\text{otherwise}.\end{cases} (16)

Therefore, the matrix sgn(Θπ)\textup{sgn}(\Theta-\pi) is a rank-2 block matrix, which implies srank(Θπ)=2\textup{srank}(\Theta-\pi)=2.

We now extend the results to K3K\geq 3. By definition of the tensor rank, the rank of a tensor is lower bounded by the rank of its matrix slice. So we have rank(Θ)rank(Θ(:,:,1,,1))=d\textup{rank}(\Theta)\geq\textup{rank}(\Theta(\colon,\colon,1,\ldots,1))=d. For the sign rank with feasible π\pi, notice that the sign tensor sgn(Θπ)\textup{sgn}(\Theta-\pi) takes the similar form as in (16),

sgn(Θ(i1,,iK)π)={1,ik<i for all k[K];1,otherwise,\textup{sgn}(\Theta(i_{1},\ldots,i_{K})-\pi)=\begin{cases}-1,&\text{$i_{k}<i^{*}$ for all $k\in[K]$};\\ 1,&\text{otherwise},\end{cases} (17)

where ii^{*} denotes the index that satisfies log(1+i1d)<πlog(1+id)\log(1+\frac{i^{*}-1}{d})<\pi\leq\log(1+\frac{i^{*}}{d}). The equation (17) implies that sgn(Θπ)=2𝒂K+1\textup{sgn}(\Theta-\pi)=-2\bm{a}^{\otimes K}+1, where 𝒂=(1,,1,0,,0)T\bm{a}=(1,\ldots,1,0,\ldots,0)^{T} takes 1 on the ii-th entry if i<ii<i^{*} and 0 otherwise. Henceforth srank(Θπ)=2\textup{srank}(\Theta-\pi)=2. ∎

In fact, Example 6 is a special case of the following proposition.

Proposition 4 (Min/Max hypergraphon).

Let 𝒵maxd1××dK\mathcal{Z}_{\max}\in\mathbb{R}^{d_{1}\times\cdots\times d_{K}} denote a tensor with entries

𝒵max(i1,,iK)=max(xi1(1),,xiK(K)),\mathcal{Z}_{\max}(i_{1},\ldots,i_{K})=\max(x^{(1)}_{i_{1}},\ldots,x^{(K)}_{i_{K}}), (18)

where xik(k)[0,1]x^{(k)}_{i_{k}}\in[0,1] are given numbers for all ik[dk]i_{k}\in[d_{k}]. Let g:g\colon\mathbb{R}\to\mathbb{R} be a continuous function and Θ:=g(𝒵max)\Theta:=g(\mathcal{Z}_{\max}) be the transformed tensor. For a given π[1,1]\pi\in[-1,1], suppose the function g(z)=πg(z)=\pi has at most r1r\geq 1 distinct real roots. Then, the sign rank of (Θπ)(\Theta-\pi) satisfies

srank(Θπ)2r.\textup{srank}(\Theta-\pi)\leq 2r.

The same conclusion holds if we use min\min in place of max\max in (18).

Proof.

We reorder the tensor indices along each mode such that x1(k)xdk(k)x^{(k)}_{1}\leq\cdots\leq x^{(k)}_{d_{k}} for all k[K]k\in[K]. Based on the construction of 𝒵max\mathcal{Z}_{\max}, the reordering does not change the rank of 𝒵max\mathcal{Z}_{\max} or (Θπ)(\Theta-\pi). Let z1<<zrz_{1}<\cdots<z_{r} be the rr distinct real roots for the equation g(z)=πg(z)=\pi. We separate the proof for two cases, r=1r=1 and r2r\geq 2.

  • When r=1r=1. The continuity of g()g(\cdot) implies that the function (g(z)π)(g(z)-\pi) has at most one sign change point. Using similar proof as in Example 6, we have

    sgn(Θπ)=12𝒂(1)𝒂(K) or sgn(Θπ)=2𝒂(1)𝒂(K)1,\displaystyle\textup{sgn}(\Theta-\pi)=1-2\bm{a}^{(1)}\otimes\cdots\otimes\bm{a}^{(K)}\quad\text{ or }\quad\textup{sgn}(\Theta-\pi)=2\bm{a}^{(1)}\otimes\cdots\otimes\bm{a}^{(K)}-1, (19)

    where 𝒂(k)\bm{a}^{(k)} are binary vectors defined by

    𝒂(k)=(1,,1,positions for which xikk<z10,,0)T,for k[K].\bm{a}^{(k)}=(\mathop{\mathchoice{\underbrace{\displaystyle 1,\ldots,1,}}{\underbrace{\textstyle 1,\ldots,1,}}{\underbrace{\scriptstyle 1,\ldots,1,}}{\underbrace{\scriptscriptstyle 1,\ldots,1,}}}\limits_{\text{positions for which $x_{i_{k}}^{k}<z_{1}$}}0,\ldots,0)^{T},\quad\text{for }k\in[K].

    Therefore, srank(Θπ)rank(sgn(Θπ))=2\textup{srank}(\Theta-\pi)\leq\textup{rank}(\textup{sgn}(\Theta-\pi))=2.

  • When r2r\geq 2. By continuity, the function (g(z)π)(g(z)-\pi) is non-zero and remains an unchanged sign in each of the intervals (zs,zs+1)(z_{s},z_{s+1}) for 1sr11\leq s\leq r-1. Define the index set ={s+:the interval (zs,zs+1) in which g(z)<π}\mathcal{I}=\{s\in\mathbb{N}_{+}\colon\text{the interval $(z_{s},z_{s+1})$ in which $g(z)<\pi$}\}. We now prove that the sign tensor sgn(Θπ)\textup{sgn}(\Theta-\pi) has rank bounded by 2r12r-1. To see this, consider the tensor indices for which sgn(Θπ)=1\textup{sgn}(\Theta-\pi)=-1,

    {ω:Θ(ω)π<0}\displaystyle\{\omega\colon\Theta(\omega)-\pi<0\} ={ω:g(𝒵max(ω))<π}\displaystyle=\{\omega\colon g(\mathcal{Z}_{\max}(\omega))<\pi\}
    =s{ω:𝒵max(ω)(zs,zs+1)}\displaystyle=\cup_{s\in\mathcal{I}}\{\omega\colon\mathcal{Z}_{\max}(\omega)\in(z_{s},z_{s+1})\}
    =s({ω:xik(k)<zs+1 for all k[K]}{ω:xik(k)zs for all k[K]}c).\displaystyle=\cup_{s\in\mathcal{I}}\Big{(}\{\omega\colon\text{$x^{(k)}_{i_{k}}<z_{s+1}$ for all $k\in[K]$}\}\cap\{\omega\colon\text{$x^{(k)}_{i_{k}}\leq z_{s}$ for all $k\in[K]$}\}^{c}\Big{)}. (20)

    The equation (7.2) is equivalent to

    𝟙(Θ(i1,,iK)<π)\displaystyle\mathds{1}(\Theta(i_{1},\ldots,i_{K})<\pi) =s(k𝟙(xik(k)<zs+1)k𝟙(xik(k)zs)),\displaystyle=\sum_{s\in\mathcal{I}}\left(\prod_{k}\mathds{1}(x^{(k)}_{i_{k}}<z_{s+1})-\prod_{k}\mathds{1}(x^{(k)}_{i_{k}}\leq z_{s})\right), (21)

    for all (i1,,iK)[d1]××[dK](i_{1},\ldots,i_{K})\in[d_{1}]\times\cdots\times[d_{K}], where 𝟙(){0,1}\mathds{1}(\cdot)\in\{0,1\} denotes the indicator function. The equation (21) implies the low-rank representation of sgn(Θπ)\textup{sgn}(\Theta-\pi),

    sgn(Θπ)=12s(𝒂s+1(1)𝒂s+1(K)𝒂¯s(1)𝒂¯s(K)),\textup{sgn}(\Theta-\pi)=1-2\sum_{s\in\mathcal{I}}\left(\bm{a}^{(1)}_{s+1}\otimes\cdots\otimes\bm{a}^{(K)}_{s+1}-\bar{\bm{a}}^{(1)}_{s}\otimes\cdots\otimes\bar{\bm{a}}^{(K)}_{s}\right), (22)

    where we have denoted the two binary vectors

    𝒂s+1(k)=(1,,1,positions for which xik(k)<zs+10,0)T,and𝒂¯s(k)=(1,,1,positions for which xik(k)zs0,0)T.\bm{a}^{(k)}_{s+1}=(\mathop{\mathchoice{\underbrace{\displaystyle 1,\ldots,1,}}{\underbrace{\textstyle 1,\ldots,1,}}{\underbrace{\scriptstyle 1,\ldots,1,}}{\underbrace{\scriptscriptstyle 1,\ldots,1,}}}\limits_{\text{positions for which $x_{i_{k}}^{(k)}<z_{s+1}$}}0,\ldots 0)^{T},\quad\text{and}\quad\bar{\bm{a}}^{(k)}_{s}=(\mathop{\mathchoice{\underbrace{\displaystyle 1,\ldots,1,}}{\underbrace{\textstyle 1,\ldots,1,}}{\underbrace{\scriptstyle 1,\ldots,1,}}{\underbrace{\scriptscriptstyle 1,\ldots,1,}}}\limits_{\text{positions for which $x_{i_{k}}^{(k)}\leq z_{s}$}}0,\ldots 0)^{T}.

    Therefore, by (22) and the assumption ||r1|\mathcal{I}|\leq r-1, we conclude that

    srank(Θπ)1+2(r1)=2r1.\textup{srank}(\Theta-\pi)\leq 1+2(r-1)=2r-1.

Combining two cases yields that srank(Θπ)2r\textup{srank}(\Theta-\pi)\leq 2r for any r1r\geq 1. ∎

We next provide several additional examples such that rank(Θ)d\textup{rank}(\Theta)\geq d whereas srank(Θ)c\textup{srank}(\Theta)\leq c for a constant cc independent of dd. We state the examples in the matrix case, i.e, K=2K=2. Similar conclusion extends to K3K\geq 3, by the following proposition.

Proposition 5.

Let 𝐌d1×d2\bm{M}\in\mathbb{R}^{d_{1}\times d_{2}} be a matrix. For any given K3K\geq 3, define an order-KK tensor Θd1××dK\Theta\in\mathbb{R}^{d_{1}\times\cdots\times d_{K}} by

Θ=𝑴𝟏d3𝟏dK,\Theta=\bm{M}\otimes\mathbf{1}_{d_{3}}\otimes\cdots\otimes\mathbf{1}_{d_{K}},

where 𝟏dkdk\mathbf{1}_{d_{k}}\in\mathbb{R}^{d_{k}} denotes an all-one vector, for 3kK3\leq k\leq K. Then we have

rank(Θ)=rank(𝑴),andsrank(Θπ)=srank(𝑴π)for all π.\textup{rank}(\Theta)=\textup{rank}(\bm{M}),\quad\text{and}\quad\textup{srank}(\Theta-\pi)=\textup{srank}(\bm{M}-\pi)\ \text{for all $\pi\in\mathbb{R}$}.
Proof.

The conclusion directly follows from the definition of tensor rank. ∎

Example 7 (Stacked banded matrices).

Let 𝒂=(1,2,,d)T\bm{a}=(1,2,\ldots,d)^{T} be a dd-dimensional vector, and define a dd-by-dd banded matrix 𝑴=|𝒂𝟏𝟏𝒂|\bm{M}=|\bm{a}\otimes\mathbf{1}-\mathbf{1}\otimes\bm{a}|. Then

rank(𝑴)=d,andsrank(𝑴π)3,for all π.\textup{rank}(\bm{M})=d,\quad\text{and}\quad\textup{srank}(\bm{M}-\pi)\leq 3,\quad\text{for all }\pi\in\mathbb{R}.
Proof.

Note that 𝑴\bm{M} is a banded matrix with entries

𝑴(i,j)=|ij|,for all (i,j)[d]2.\bm{M}(i,j)={|i-j|},\quad\text{for all }(i,j)\in[d]^{2}.

Elementary row operation directly shows that 𝑴\bm{M} is full rank as follows,

((𝑴1+𝑴d)/(d1)𝑴1𝑴2𝑴2𝑴3𝑴d1𝑴d)=(11111111111111111111).\displaystyle\begin{pmatrix}(\bm{M}_{1}+\bm{M}_{d})/(d-1)\\ \bm{M}_{1}-\bm{M}_{2}\\ \bm{M}_{2}-\bm{M}_{3}\\ \vdots\\ \bm{M}_{d-1}-\bm{M}_{d}\end{pmatrix}=\begin{pmatrix}1&1&1&\ldots&1&1\\ -1&1&1&\ldots&1&1\\ -1&-1&1&\ldots&1&1\\ \vdots\\ -1&-1&-1&\ldots&-1&1\end{pmatrix}. (23)

We now show srank(𝑴π)3\textup{srank}(\bm{M}-\pi)\leq 3 by construction. Define two vectors 𝒃=(21,22,,2d)Td\bm{b}=(2^{-1},2^{-2},\ldots,2^{-d})^{T}\in\mathbb{R}^{d} and rev(𝒃)=(2d,,21)Td\text{rev}(\bm{b})=(2^{-d},\ldots,2^{-1})^{T}\in\mathbb{R}^{d}. We construct the following matrix

𝑨=𝒃rev(𝒃)+rev(𝒃)𝒃.\bm{A}=\bm{b}\otimes\text{rev}(\bm{b})+\text{rev}(\bm{b})\otimes\bm{b}. (24)

The matrix 𝑨d×d\bm{A}\in\mathbb{R}^{d\times d} is banded with entries

𝑨(i,j)=𝑨(j,i)=𝑨(di,dj)=𝑨(dj,di)=2d1(2ji+2ij),for all (i,j)[d]2.\bm{A}(i,j)=\bm{A}(j,i)=\bm{A}(d-i,d-j)=\bm{A}(d-j,d-i)=2^{-d-1}\left(2^{j-i}+2^{i-j}\right),\ \text{for all }(i,j)\in[d]^{2}.

Furthermore, the entry value 𝑨(i,j)\bm{A}(i,j) decreases with respect to |ij||i-j|; i.e.,

𝑨(i,j)𝑨(i,j),for all |ij||ij|.\bm{A}(i,j)\geq\bm{A}(i^{\prime},j^{\prime}),\quad\text{for all }|i-j|\geq|i^{\prime}-j^{\prime}|. (25)

Notice that for a given π\pi\in\mathbb{R}, there exists π\pi^{\prime}\in\mathbb{R} such that sgn(𝑨π)=sgn(𝑴π)\textup{sgn}(\bm{A}-\pi^{\prime})=\textup{sgn}(\bm{M}-\pi). This is because both 𝑨\bm{A} and 𝑴\bm{M} are banded matrices satisfying monotonicity (25). By definition (24), 𝑨\bm{A} is a rank-2 matrix. Henceforce, srank(𝑴π)=srank(𝑨π)3.\textup{srank}(\bm{M}-\pi)=\textup{srank}(\bm{A}-\pi^{\prime})\leq 3.

Remark 1.

The tensor analogy of banded matrices Θ=|𝒂𝟏𝟏𝟏𝒂𝟏|\Theta=|\bm{a}\otimes\mathbf{1}\otimes\mathbf{1}-\mathbf{1}\otimes\bm{a}\otimes\mathbf{1}| is used as simulation model 3 in Table 1.

Example 8 (Stacked identity matrices).

Let 𝑰\bm{I} be a dd-by-dd identity matrix. Then

rank(𝑰)=d,andsrank(𝑰π)3for all π.\textup{rank}(\bm{I})=d,\quad\text{and}\quad\textup{srank}(\bm{I}-\pi)\leq 3\ \text{for all }\pi\in\mathbb{R}.
Proof.

Depending on the value of π\pi, the sign matrix sgn(𝑰π)\textup{sgn}(\bm{I}-\pi) falls into one of the three cases: 1) sgn(𝑰π)\textup{sgn}(\bm{I}-\pi) is a matrix of all 11; 2) sgn(𝑰π)\textup{sgn}(\bm{I}-\pi) is a matrix of all 1-1; 3) sgn(𝑰π)=2𝑰𝟏d𝟏d\textup{sgn}(\bm{I}-\pi)=2\bm{I}-\mathbf{1}_{d}\otimes\mathbf{1}_{d}. The former two cases are trivial, so it suffices to show srank(𝑰π)3\textup{srank}(\bm{I}-\pi)\leq 3 in the third case.

Based on Example 7, the rank-2 matrix 𝑨\bm{A} in (24) satisfies

𝑨(i,j){=2d,i=j,2d+2d2,ij.\bm{A}(i,j)\begin{cases}=2^{-d},&i=j,\\ \geq 2^{-d}+2^{-d-2},&i\neq j.\end{cases}

Therefore, sgn(2d+2d3𝑨)=2𝑰𝟏d𝟏d\textup{sgn}\left(2^{-d}+2^{-d-3}-\bm{A}\right)=2\bm{I}-\mathbf{1}_{d}\otimes\mathbf{1}_{d}. We conclude that srank(𝑰π)rank(2d+2d3𝑨)=3\textup{srank}(\bm{I}-\pi)\leq\textup{rank}(2^{-d}+2^{-d-3}-\bm{A})=3. ∎

7.3 Proofs

7.3.1 Proofs of Propositions 1-3

Proof of Proposition 1.

The strictly monotonicity of gg implies that the inverse function g1:g^{-1}\colon\mathbb{R}\to\mathbb{R} is well-defined. When gg is strictly increasing, the mapping xg(x)x\mapsto g(x) is sign preserving. Specifically, if x0x\geq 0, then g(x)g(0)=0g(x)\geq g(0)=0. Conversely, if g(x)0=g(0)g(x)\geq 0=g(0), then applying g1g^{-1} to both sides gives x0x\geq 0. When gg is strictly decreasing, the mapping xg(x)x\mapsto g(x) is sign reversing. Specifically, if x0x\geq 0, then g(x)g(0)=0g(x)\leq g(0)=0. Conversely, if g(x)0=g(0)g(x)\geq 0=g(0), then applying g1g^{-1} to both sides gives x0x\leq 0. Therefore, Θg(Θ)\Theta\simeq g(\Theta), or Θg(Θ)\Theta\simeq-g(\Theta). Since constant multiplication does not change the tensor rank, we have srank(Θ)=srank(g(Θ))rank(g(Θ))\textup{srank}(\Theta)=\textup{srank}(g(\Theta))\leq\textup{rank}(g(\Theta)). ∎

Proof of Proposition 2.

See Section 7.2 for constructive examples. ∎

Proof of Proposition 3.

Fix π[1,1]\pi\in[-1,1]. Based on the definition of classification loss L(,)L(\cdot,\cdot), the function Risk()\textup{Risk}(\cdot) relies only on the sign pattern of the tensor. Therefore, without loss of generality, we assume both Θ¯,𝒵{1,1}d1××dK\bar{\Theta},\mathcal{Z}\in\{-1,1\}^{d_{1}\times\cdots\times d_{K}} are binary tensors. We evaluate the excess risk

Risk(𝒵)Risk(Θ¯)=𝔼ωΠ𝔼𝒴(ω){|𝒴(ω)π|[|𝒵(ω)sgn(𝒴¯(ω))||Θ¯(ω)sgn(𝒴¯(ω))|]}=defI(ω).\textup{Risk}(\mathcal{Z})-\textup{Risk}(\bar{\Theta})=\mathbb{E}_{\omega\sim\Pi}\mathop{\mathchoice{\underbrace{\displaystyle\mathbb{E}_{\mathcal{Y}(\omega)}\left\{|\mathcal{Y}(\omega)-\pi|\left[\left|\mathcal{Z}(\omega)-\textup{sgn}(\bar{\mathcal{Y}}(\omega))\right|-\left|\bar{\Theta}(\omega)-\textup{sgn}(\bar{\mathcal{Y}}(\omega))\right|\right]\right\}}}{\underbrace{\textstyle\mathbb{E}_{\mathcal{Y}(\omega)}\left\{|\mathcal{Y}(\omega)-\pi|\left[\left|\mathcal{Z}(\omega)-\textup{sgn}(\bar{\mathcal{Y}}(\omega))\right|-\left|\bar{\Theta}(\omega)-\textup{sgn}(\bar{\mathcal{Y}}(\omega))\right|\right]\right\}}}{\underbrace{\scriptstyle\mathbb{E}_{\mathcal{Y}(\omega)}\left\{|\mathcal{Y}(\omega)-\pi|\left[\left|\mathcal{Z}(\omega)-\textup{sgn}(\bar{\mathcal{Y}}(\omega))\right|-\left|\bar{\Theta}(\omega)-\textup{sgn}(\bar{\mathcal{Y}}(\omega))\right|\right]\right\}}}{\underbrace{\scriptscriptstyle\mathbb{E}_{\mathcal{Y}(\omega)}\left\{|\mathcal{Y}(\omega)-\pi|\left[\left|\mathcal{Z}(\omega)-\textup{sgn}(\bar{\mathcal{Y}}(\omega))\right|-\left|\bar{\Theta}(\omega)-\textup{sgn}(\bar{\mathcal{Y}}(\omega))\right|\right]\right\}}}}\limits_{\stackrel{{\scriptstyle\text{def}}}{{=}}I(\omega)}. (26)

Denote y=𝒴(ω)y=\mathcal{Y}(\omega), z=𝒵(ω)z=\mathcal{Z}(\omega), θ¯=Θ¯(ω)\bar{\theta}=\bar{\Theta}(\omega), and θ=Θ(ω)\theta=\Theta(\omega). The expression of I(ω)I(\omega) is simplified as

I(ω)\displaystyle I(\omega) =𝔼y[(yπ)(θ¯z)𝟙(yπ)+(πy)(zθ¯)𝟙(y<π)]\displaystyle=\mathbb{E}_{y}\left[(y-\pi)(\bar{\theta}-z)\mathds{1}(y\geq\pi)+(\pi-y)(z-\bar{\theta})\mathds{1}(y<\pi)\right]
=𝔼y[(θ¯z)(yπ)]\displaystyle=\mathbb{E}_{y}\left[(\bar{\theta}-z)(y-\pi)\right]
=[sgn(θπ)z](θπ)\displaystyle=\left[\textup{sgn}(\theta-\pi)-z\right]\left(\theta-\pi\right)
=|sgn(θπ)z||θπ|0,\displaystyle=|\textup{sgn}(\theta-\pi)-z||\theta-\pi|\geq 0, (27)

where the third line uses the fact 𝔼y=θ\mathbb{E}y=\theta and θ¯=sgn(θπ)\bar{\theta}=\textup{sgn}(\theta-\pi), and the last line uses the assumption z{1,1}z\in\{-1,1\}. The equality (7.3.1) is attained when z=sgn(θπ)z=\textup{sgn}(\theta-\pi) or θ=π\theta=\pi. Combining (7.3.1) with (26), we conclude that, for all 𝒵{1,1}d1××dK\mathcal{Z}\in\{-1,1\}^{d_{1}\times\cdots\times d_{K}},

Risk(𝒵)Risk(Θ¯)=𝔼ωΠ|sgn(Θ(ω)π)𝒵(ω)||Θ(ω)π|0,\textup{Risk}(\mathcal{Z})-\textup{Risk}(\bar{\Theta})=\mathbb{E}_{\omega\sim\Pi}|\textup{sgn}(\Theta(\omega)-\pi)-\mathcal{Z}(\omega)||\Theta(\omega)-\pi|\geq 0, (28)

In particular, setting 𝒵=Θ¯=sgn(Θπ)\mathcal{Z}=\bar{\Theta}=\textup{sgn}(\Theta-\pi) in (28) yields the minimum. Therefore,

Risk(Θ¯)=min{Risk(𝒵):𝒵d1××dK}min{Risk(𝒵):rank(𝒵)r}.\textup{Risk}(\bar{\Theta})=\min\{\textup{Risk}(\mathcal{Z})\colon\mathcal{Z}\in\mathbb{R}^{d_{1}\times\cdots\times d_{K}}\}\leq\min\{\textup{Risk}(\mathcal{Z})\colon\textup{rank}(\mathcal{Z})\leq r\}.

Since srank(Θπ)r\textup{srank}(\Theta-\pi)\leq r by assumption, the last inequality becomes equality. The proof is complete. ∎

7.3.2 Proof of Theorem 1

Proof of Theorem 1.

Fix π[1,1]\pi\in[-1,1]. Based on (28) in Proposition 3 we have

Risk(𝒵)Risk(Θ¯)=𝔼[|sgn𝒵sgnΘ¯||Θ¯|].\textup{Risk}(\mathcal{Z})-\textup{Risk}(\bar{\Theta})=\mathbb{E}\left[|\textup{sgn}\mathcal{Z}-\textup{sgn}\bar{\Theta}||\bar{\Theta}|\right]. (29)

The Assumption 1 states that

(|Θ¯|t)ctα,for all 0t<ρ(π,𝒩).\mathbb{P}\left(|\bar{\Theta}|\leq t\right)\leq ct^{\alpha},\quad\text{for all }0\leq t<\rho(\pi,\mathcal{N}). (30)

Without future specification, all relevant probability statements, such as 𝔼\mathbb{E} and \mathbb{P}, are with respect to ωΠ\omega\sim\Pi.

We divide the proof into two cases: α>0\alpha>0 and α=\alpha=\infty.

  • Case 1: α>0\alpha>0.

    By (29), for all 0t<ρ(π,𝒩)0\leq t<\rho(\pi,\mathcal{N}),

    Risk(𝒵)Risk(Θ¯)\displaystyle\textup{Risk}(\mathcal{Z})-\textup{Risk}(\bar{\Theta}) t𝔼(|sgn𝒵sgnΘ^|𝟙{|Θ^|>t})\displaystyle\geq t\mathbb{E}\left(|\textup{sgn}\mathcal{Z}-\textup{sgn}\hat{\Theta}|\mathds{1}\{|\hat{\Theta}|>t\}\right)
    2t(sgn𝒵sgnΘ¯ and |Θ¯|>t)\displaystyle\geq 2t\mathbb{P}\left(\textup{sgn}\mathcal{Z}\neq\textup{sgn}\bar{\Theta}\text{ and }|\bar{\Theta}|>t\right)
    2t{(sgn𝒵sgnΘ¯)(|Θ¯|t)}\displaystyle\geq 2t\Big{\{}\mathbb{P}\left(\textup{sgn}\mathcal{Z}\neq\textup{sgn}\bar{\Theta}\right)-\mathbb{P}\left(|\bar{\Theta}|\leq t\right)\Big{\}}
    t{MAE(sgn𝒵,sgnΘ¯)2ctα},\displaystyle\geq t\Big{\{}\textup{MAE}(\textup{sgn}\mathcal{Z},\textup{sgn}\bar{\Theta})-2ct^{\alpha}\Big{\}}, (31)

    where the last line follows from the definition of MAE and (30). We maximize the lower bound (7.3.2) with respect to tt, and obtain the optimal toptt_{\text{opt}},

    topt={ρ(π,𝒩),if MAE(sgn𝒵,sgnΘ¯)>2c(1+α)ρα(π,𝒩),[12c(1+α)MAE(sgn𝒵,sgnΘ¯)]1/α,if MAE(sgn𝒵,sgnΘ¯)2c(1+α)ρα(π,𝒩).t_{\text{opt}}=\begin{cases}\rho(\pi,\mathcal{N}),&\text{if }\textup{MAE}(\textup{sgn}\mathcal{Z},\textup{sgn}\bar{\Theta})>2c(1+\alpha)\rho^{\alpha}(\pi,\mathcal{N}),\\ \left[{1\over 2c(1+\alpha)}\textup{MAE}(\textup{sgn}\mathcal{Z},\textup{sgn}\bar{\Theta})\right]^{1/\alpha},&\text{if }\textup{MAE}(\textup{sgn}\mathcal{Z},\textup{sgn}\bar{\Theta})\leq 2c(1+\alpha)\rho^{\alpha}(\pi,\mathcal{N}).\end{cases}

    The corresponding lower bound of the inequality (7.3.2) becomes

    Risk(𝒵)Risk(Θ¯){c1ρ(π,𝒩)MAE(sgn𝒵,sgnΘ¯),if MAE(sgn𝒵,sgnΘ¯)>2c(1+α)ρα(π,𝒩),c2[MAE(sgn𝒵,sgnΘ¯)]1+αα,if MAE(sgn𝒵,sgnΘ¯)2c(1+α)ρα(π,𝒩),\textup{Risk}(\mathcal{Z})-\textup{Risk}(\bar{\Theta})\geq\begin{cases}c_{1}\rho(\pi,\mathcal{N})\textup{MAE}(\textup{sgn}\mathcal{Z},\textup{sgn}\bar{\Theta}),&\text{if }\textup{MAE}(\textup{sgn}\mathcal{Z},\textup{sgn}\bar{\Theta})>2c(1+\alpha)\rho^{\alpha}(\pi,\mathcal{N}),\\ c_{2}\left[\textup{MAE}(\textup{sgn}\mathcal{Z},\textup{sgn}\bar{\Theta})\right]^{1+\alpha\over\alpha},&\text{if }\textup{MAE}(\textup{sgn}\mathcal{Z},\textup{sgn}\bar{\Theta})\leq 2c(1+\alpha)\rho^{\alpha}(\pi,\mathcal{N}),\end{cases}

    where c1,c2>0c_{1},c_{2}>0 are two constants independent of 𝒵\mathcal{Z}. Combining both cases gives

    MAE(sgn𝒵,sgnΘ¯)\displaystyle\textup{MAE}(\textup{sgn}\mathcal{Z},\textup{sgn}\bar{\Theta}) [Risk(𝒵)Risk(Θ¯)]α1+α+1ρ(π,𝒩)[Risk(𝒵)Risk(Θ¯)]\displaystyle\lesssim[\textup{Risk}(\mathcal{Z})-\textup{Risk}(\bar{\Theta})]^{\alpha\over 1+\alpha}+{1\over\rho(\pi,\mathcal{N})}\left[\textup{Risk}(\mathcal{Z})-\textup{Risk}(\bar{\Theta})\right] (32)
    C(π)[Risk(𝒵)Risk(Θ¯)]α1+α,\displaystyle\leq C(\pi)[\textup{Risk}(\mathcal{Z})-\textup{Risk}(\bar{\Theta})]^{\alpha\over 1+\alpha}, (33)

    where C(π)>0C(\pi)>0 is a multiplicative factor independent of 𝒵\mathcal{Z}.

  • Case 2: α=\alpha=\infty. The inequality (7.3.2) now becomes

    Risk(𝒵)Risk(Θ¯)tMAE(sgnΘ¯,sgn𝒵),for all 0t<ρ(π,𝒩).\textup{Risk}(\mathcal{Z})-\textup{Risk}(\bar{\Theta})\geq t\textup{MAE}(\textup{sgn}\bar{\Theta},\textup{sgn}\mathcal{Z}),\quad\text{for all }0\leq t<\rho(\pi,\mathcal{N}). (34)

    The conclusion follows by taking t=ρ(π,𝒩)2t={\rho(\pi,\mathcal{N})\over 2} in the inequality (34).

Remark 2.

The proof of Theorem 1 shows that, under Assumption 1,

MAE(sgn𝒵,sgnΘ¯)[Risk(𝒵)Risk(Θ¯)]α1+α+1ρ(π,𝒩)[Risk(𝒵)Risk(Θ¯)],\textup{MAE}(\textup{sgn}\mathcal{Z},\textup{sgn}\bar{\Theta})\lesssim[\textup{Risk}(\mathcal{Z})-\textup{Risk}(\bar{\Theta})]^{\alpha\over 1+\alpha}+{1\over\rho(\pi,\mathcal{N})}\left[\textup{Risk}(\mathcal{Z})-\textup{Risk}(\bar{\Theta})\right], (35)

for all 𝒵d1××dR\mathcal{Z}\in\mathbb{R}^{d_{1}\times\cdots\times d_{R}}. For fixed π\pi, the second term is absorbed into the first term.

7.3.3 Proof of Theorem 2

The following lemma provides the variance-to-mean relationship implied by the α\alpha-smoothness of Θ\Theta. The relationship plays a key role in determining the convergence rate based on empirical process theory (Shen and Wong,, 1994).

Lemma 1 (Variance-to-mean relationship).

Consider the same setup as in Theorem 2. Fix π[1,1]\pi\in[-1,1]. Let L(𝒵,Y¯Ω)L(\mathcal{Z},\bar{Y}_{\Omega}) be the π\pi-weighted classification loss

L(𝒵,𝒴¯Ω)\displaystyle L(\mathcal{Z},\bar{\mathcal{Y}}_{\Omega}) =1|Ω|ωΩ|𝒴¯(ω)|weight×|sgn𝒵(ω)sgn𝒴¯(ω)|classification loss\displaystyle={1\over|\Omega|}\sum_{\omega\in\Omega}\ \mathop{\mathchoice{\underbrace{\displaystyle|\bar{\mathcal{Y}}(\omega)|}}{\underbrace{\textstyle|\bar{\mathcal{Y}}(\omega)|}}{\underbrace{\scriptstyle|\bar{\mathcal{Y}}(\omega)|}}{\underbrace{\scriptscriptstyle|\bar{\mathcal{Y}}(\omega)|}}}\limits_{\text{weight}}\ \times\ \mathop{\mathchoice{\underbrace{\displaystyle|\textup{sgn}\mathcal{Z}(\omega)-\textup{sgn}\bar{\mathcal{Y}}(\omega)|}}{\underbrace{\textstyle|\textup{sgn}\mathcal{Z}(\omega)-\textup{sgn}\bar{\mathcal{Y}}(\omega)|}}{\underbrace{\scriptstyle|\textup{sgn}\mathcal{Z}(\omega)-\textup{sgn}\bar{\mathcal{Y}}(\omega)|}}{\underbrace{\scriptscriptstyle|\textup{sgn}\mathcal{Z}(\omega)-\textup{sgn}\bar{\mathcal{Y}}(\omega)|}}}\limits_{\text{classification loss}}
=1|Ω|ωΩω(𝒵,𝒴¯),\displaystyle={1\over|\Omega|}\sum_{\omega\in\Omega}\ell_{\omega}(\mathcal{Z},\bar{\mathcal{Y}}), (36)

where we have denoted the function ω(𝒵,𝒴¯)=def|𝒴¯(ω)||sgn𝒵(ω)sgn𝒴¯(ω)|\ell_{\omega}(\mathcal{Z},\bar{\mathcal{Y}})\stackrel{{\scriptstyle\text{def}}}{{=}}|\bar{\mathcal{Y}}(\omega)||\textup{sgn}\mathcal{Z}(\omega)-\textup{sgn}\bar{\mathcal{Y}}(\omega)|. Under Assumption 1 of the (α,π)(\alpha,\pi)-smoothness of Θ\Theta, we have

Var[ω(𝒵,𝒴¯)ω(Θ¯,𝒴¯Ω)][Risk(𝒵)Risk(Θ¯)]α1+α+1ρ(π,𝒩)[Risk(𝒵)Risk(Θ¯)],\textup{Var}[\ell_{\omega}(\mathcal{Z},\bar{\mathcal{Y}})-\ell_{\omega}(\bar{\Theta},\bar{\mathcal{Y}}_{\Omega})]\lesssim[\textup{Risk}(\mathcal{Z})-\textup{Risk}(\bar{\Theta})]^{\alpha\over 1+\alpha}+{1\over\rho(\pi,\mathcal{N})}[\textup{Risk}(\mathcal{Z})-\textup{Risk}(\bar{\Theta})], (37)

for all tensors 𝒵d1××dK\mathcal{Z}\in\mathbb{R}^{d_{1}\times\cdots\times d_{K}}. Here the expectation and variance are taken with respect to both 𝒴\mathcal{Y} and ωΠ\omega\sim\Pi.

Proof of Lemma 1.

We expand the variance by

Var[ω(𝒵,𝒴¯Ω)ω(Θ¯,𝒴¯Ω)]\displaystyle\text{Var}[\ell_{\omega}(\mathcal{Z},\bar{\mathcal{Y}}_{\Omega})-\ell_{\omega}(\bar{\Theta},\bar{\mathcal{Y}}_{\Omega})] 𝔼|ω(𝒵,𝒴¯Ω)ω(Θ¯,𝒴¯Ω)|2\displaystyle\lesssim\mathbb{E}|\ell_{\omega}(\mathcal{Z},\bar{\mathcal{Y}}_{\Omega})-\ell_{\omega}(\bar{\Theta},\bar{\mathcal{Y}}_{\Omega})|^{2}
𝔼|ω(𝒵,𝒴¯Ω)ω(Θ¯,𝒴¯Ω)|\displaystyle\lesssim\mathbb{E}|\ell_{\omega}(\mathcal{Z},\bar{\mathcal{Y}}_{\Omega})-\ell_{\omega}(\bar{\Theta},\bar{\mathcal{Y}}_{\Omega})|
𝔼|sgn𝒵sgnΘ¯|=MAE(sgn𝒵,sgnΘ¯),\displaystyle\leq\mathbb{E}|\textup{sgn}\mathcal{Z}-\textup{sgn}\bar{\Theta}|=\textup{MAE}(\textup{sgn}\mathcal{Z},\textup{sgn}\bar{\Theta}), (38)

where the second line comes from the boundedness of classification loss L(,)L(\cdot,\cdot), and the third line comes from the inequality ||ab||cb|||ab|||a-b|-|c-b||\leq|a-b| for a,b,c{1,1}a,b,c\in\{-1,1\}, together with the boundedness of classification weight |𝒴¯(ω)||\bar{\mathcal{Y}}(\omega)|. Here we have absorbed the constant multipliers in \lesssim. The conclusion (37) then directly follows by applying Remark 2 to (7.3.3). ∎

Proof of Theorem 2.

Fix π[1,1]\pi\in[-1,1]. For notational simplicity, we suppress the subscript π\pi and write 𝒵^\hat{\mathcal{Z}} in place of 𝒵^π\hat{\mathcal{Z}}_{\pi}. Denote n=|Ω|n=|\Omega| and ρ=ρ(π,𝒩)\rho=\rho(\pi,\mathcal{N}).

Because the classification loss L(,)L(\cdot,\cdot) is scale-free, i.e., L(𝒵,)=L(c𝒵,)L(\mathcal{Z},\cdot)=L(c\mathcal{Z},\cdot) for every c>0c>0, we consider the estimation subject to 𝒵F1\lVert\mathcal{Z}\rVert_{F}\leq 1 without loss of generality. Specifically, let

𝒵^=argmin𝒵:rank(𝒵)r,𝒵F1L(𝒵,𝒴¯Ω).\hat{\mathcal{Z}}=\operatorname*{arg\,min}_{\mathcal{Z}\colon\textup{rank}(\mathcal{Z})\leq r,\lVert\mathcal{Z}\rVert_{F}\leq 1}L(\mathcal{Z},\bar{\mathcal{Y}}_{\Omega}).

We next apply the empirical process theory to bound 𝒵^\hat{\mathcal{Z}}. To facilitate the analysis, we view the data 𝒴¯Ω={𝒴¯(ω):ωΩ}\bar{\mathcal{Y}}_{\Omega}=\{\bar{\mathcal{Y}}(\omega)\colon\omega\in\Omega\} as a collection of nn independent random variables where the randomness is from both 𝒴¯\bar{\mathcal{Y}} and ωΠ\omega\sim\Pi. Write the index set Ω={1,,n}\Omega=\{1,\ldots,n\}, so the loss function (1) becomes

L(𝒵,𝒴¯Ω)=1ni=1ni(𝒵,𝒴¯).L(\mathcal{Z},\bar{\mathcal{Y}}_{\Omega})={1\over n}\sum_{i=1}^{n}\ell_{i}(\mathcal{Z},\bar{\mathcal{Y}}).

We use f𝒵:[d1]××[dn]f_{\mathcal{Z}}\colon[d_{1}]\times\cdots\times[d_{n}]\to\mathbb{R} to denote the function induced by tensor 𝒵\mathcal{Z} such that f𝒵(ω)=𝒵(ω)f_{\mathcal{Z}}(\omega)=\mathcal{Z}(\omega) for ω[d1]××[dK]\omega\in[d_{1}]\times\cdots\times[d_{K}]. Under this set-up, the quantity of interest

L(𝒵,𝒴¯Ω)L(Θ¯,𝒴¯Ω)=1ni=1n[i(𝒵,𝒴¯)i(Θ¯,𝒴¯)]=defΔi(f𝒵,𝒴¯),\displaystyle L(\mathcal{Z},\bar{\mathcal{Y}}_{\Omega})-L(\bar{\Theta},\bar{\mathcal{Y}}_{\Omega})={1\over n}\sum_{i=1}^{n}\mathop{\mathchoice{\underbrace{\displaystyle\left[\ell_{i}(\mathcal{Z},\bar{\mathcal{Y}})-\ell_{i}(\bar{\Theta},\bar{\mathcal{Y}})\right]}}{\underbrace{\textstyle\left[\ell_{i}(\mathcal{Z},\bar{\mathcal{Y}})-\ell_{i}(\bar{\Theta},\bar{\mathcal{Y}})\right]}}{\underbrace{\scriptstyle\left[\ell_{i}(\mathcal{Z},\bar{\mathcal{Y}})-\ell_{i}(\bar{\Theta},\bar{\mathcal{Y}})\right]}}{\underbrace{\scriptscriptstyle\left[\ell_{i}(\mathcal{Z},\bar{\mathcal{Y}})-\ell_{i}(\bar{\Theta},\bar{\mathcal{Y}})\right]}}}\limits_{\stackrel{{\scriptstyle\text{def}}}{{=}}\Delta_{i}(f_{\mathcal{Z}},\bar{\mathcal{Y}})}, (39)

is an empirical process induced by function f𝒵𝒯f_{\mathcal{Z}}\in\mathcal{F}_{\mathcal{T}} where 𝒯={𝒵:rank(𝒵)r,𝒵F1}\mathcal{T}=\{\mathcal{Z}\colon\textup{rank}(\mathcal{Z})\leq r,\ \lVert\mathcal{Z}\rVert_{F}\leq 1\}. Note that there is an one-to-one correspondence between sets 𝒯\mathcal{F}_{\mathcal{T}} and 𝒯\mathcal{T}.

Our remaining proof adopts the techniques of Wang et al., (2008, Theorem 3) to bound (39) over the function family f𝒵𝒯f_{\mathcal{Z}}\in\mathcal{F}_{\mathcal{T}}. We summarize only the key difference here but refer to (Wang et al.,, 2008) for complete proof. Based on Lemma 1, the (α,π)(\alpha,\pi)-smoothness of Θ\Theta implies

VarΔi(f𝒵,𝒴¯)[𝔼Δi(f𝒵,𝒴¯)]α1+α+1ρ𝔼Δi(f𝒵,𝒴¯),for all f𝒵𝒯.\textup{Var}\Delta_{i}(f_{\mathcal{Z}},\bar{\mathcal{Y}})\lesssim\left[\mathbb{E}\Delta_{i}(f_{\mathcal{Z}},\bar{\mathcal{Y}})\right]^{\alpha\over 1+\alpha}+{1\over\rho}\mathbb{E}\Delta_{i}(f_{\mathcal{Z}},\bar{\mathcal{Y}}),\quad\text{for all $f_{\mathcal{Z}}\in\mathcal{F}_{\mathcal{T}}$}. (40)

Applying local iterative techniques in Wang et al., (2008, Theorem 3) to the empirical process (39) with the variance-to-mean relationship (40) gives that

(Risk(𝒵^)Risk(Θ¯)Ln)exp(nLn),\mathbb{P}\left(\textup{Risk}(\hat{\mathcal{Z}})-\textup{Risk}(\bar{\Theta})\geq L_{n}\right)\lesssim\exp(-nL_{n}), (41)

where the convergence rate Ln>0L_{n}>0 is determined by the solution to the following inequality,

1LnLnLnα/(α+1)+Lnρ[](ε,𝒯,2)𝑑εCn,{1\over L_{n}}\int_{L_{n}}^{\sqrt{L_{n}^{\alpha/(\alpha+1)}+{L_{n}\over\rho}}}\sqrt{\mathcal{H}_{[\ ]}(\varepsilon,\mathcal{F}_{\mathcal{T}},\lVert\cdot\rVert_{2})}d\varepsilon\leq C\sqrt{n}, (42)

for some constant C>0C>0. In particular, the smallest LnL_{n} satisfying (42) yields the best upper bound of the error rate. Here [](ε,𝒯,2)\mathcal{H}_{[\ ]}(\varepsilon,\mathcal{F}_{\mathcal{T}},\lVert\cdot\rVert_{2}) denotes the L2L_{2}-metric, ε\varepsilon-bracketing number (c.f. Definition 3) of family 𝒯\mathcal{F}_{\mathcal{T}}.

It remains to solve for the smallest possible LnL_{n} in (42). Based on Lemma 2, the inequality (42) is satisfied with

Lntn(α+1)/(α+2)+1ρtn,where tn=dmaxrKlogKn.L_{n}\asymp t_{n}^{(\alpha+1)/(\alpha+2)}+{1\over\rho}t_{n},\quad\text{where }t_{n}={d_{\max}rK\log K\over n}. (43)

Therefore, by (41), with very high probability.

Risk(𝒵^)Risk(Θ¯)tn(α+1)/(α+2)+1ρtn.\textup{Risk}(\hat{\mathcal{Z}})-\textup{Risk}(\bar{\Theta})\leq t_{n}^{(\alpha+1)/(\alpha+2)}+{1\over\rho}t_{n}.

Inserting the above bound into (35) gives

MAE(sgn𝒵^,sgnΘ¯)\displaystyle\textup{MAE}(\textup{sgn}\hat{\mathcal{Z}},\textup{sgn}\bar{\Theta}) [Risk(𝒵^)Risk(Θ¯)]α/(α+1)+1ρ[Risk(𝒵^)Risk(Θ¯)]\displaystyle\lesssim[\textup{Risk}(\hat{\mathcal{Z}})-\textup{Risk}(\bar{\Theta})]^{\alpha/(\alpha+1)}+{1\over\rho}[\textup{Risk}(\hat{\mathcal{Z}})-\textup{Risk}(\bar{\Theta})]
tnα/(α+2)+1ρα/α+1tnα/(α+1)+1ρtn(α+1)/(α+2)+1ρ2tn\displaystyle\lesssim t_{n}^{\alpha/(\alpha+2)}+{1\over\rho^{\alpha/\alpha+1}}t_{n}^{\alpha/(\alpha+1)}+{1\over\rho}t_{n}^{(\alpha+1)/(\alpha+2)}+{1\over\rho^{2}}t_{n}
4tnα/(α+2)+4ρ2tn,\displaystyle\leq 4t_{n}^{\alpha/(\alpha+2)}+{4\over\rho^{2}}t_{n}, (44)

where the last line follows from the fact that a(b2+b(α+2)/(α+1)+b+1)4a(b2+1)a(b^{2}+b^{(\alpha+2)/(\alpha+1)}+b+1)\leq 4a(b^{2}+1) with a=tnρ2a={t_{n}\over\rho^{2}} and b=ρtn1/(α+2)b=\rho t_{n}^{-1/(\alpha+2)}. We plug tnt_{n} into (7.3.3) and absorb the term KlogKK\log K into the constant. The conclusion is then proved. ∎

Definition 3 (Bracketing number).

Consider a family of functions \mathcal{F}, and let ε>0\varepsilon>0. Let 𝒳\mathcal{X} denote the domain space equipped with measure Π\Pi. We call {(fml,fmu)}m=1M\{(f^{l}_{m},f^{u}_{m})\}_{m=1}^{M} an L2L_{2}-metric, ε\varepsilon-bracketing function set of \mathcal{F}, if for every ff\in\mathcal{F}, there exists an m[M]m\in[M] such that

fml(x)f(x)fmu(x),for all x𝒳,f^{l}_{m}(x)\leq f(x)\leq f^{u}_{m}(x),\quad\text{for all }x\in\mathcal{X},

and

fmlfmu2=def𝔼xΠ|fml(x)fmu(x)|2ε,for all m=1,,M.\lVert f^{l}_{m}-f^{u}_{m}\rVert_{2}\stackrel{{\scriptstyle\text{def}}}{{=}}\sqrt{\mathbb{E}_{x\sim\Pi}|f^{l}_{m}(x)-f^{u}_{m}(x)|^{2}}\leq\varepsilon,\ \text{for all }m=1,\ldots,M.

The bracketing number with L2L_{2}-metric, denoted [](ε,,2)\mathcal{H}_{[\ ]}(\varepsilon,\mathcal{F},\lVert\cdot\rVert_{2}), is the logarithm of the smallest cardinality of the ε\varepsilon-bracketing function set of \mathcal{F}.

Lemma 2 (Bracketing complexity of low-rank tensors).

Define the family of rank-rr bounded tensors 𝒯={𝒵d1××dK:rank(𝒵)r,𝒵F1}\mathcal{T}=\{\mathcal{Z}\in\mathbb{R}^{d_{1}\times\cdots\times d_{K}}\colon\textup{rank}(\mathcal{Z})\leq r,\ \lVert\mathcal{Z}\rVert_{F}\leq 1\} and the induced function family 𝒯={f𝒵:𝒵𝒯}\mathcal{F}_{\mathcal{T}}=\{f_{\mathcal{Z}}\colon\mathcal{Z}\in\mathcal{T}\}. Set

Ln(dmaxrKlogKn)(α+1)/(α+2)+1ρ(π,𝒩)(dmaxrKlogKn).L_{n}\asymp\left({d_{\max}rK\log K\over n}\right)^{(\alpha+1)/(\alpha+2)}+{1\over\rho(\pi,\mathcal{N})}\left({d_{\max}rK\log K\over n}\right). (45)

Then, the following inequality is satisfied.

1LnLnLnα/(α+1)+Lnρ(π,𝒩)[](ε,𝒯,2)𝑑εCn1/2,{1\over L_{n}}\int^{\sqrt{L_{n}^{\alpha/(\alpha+1)}+{L_{n}\over\rho(\pi,\mathcal{N})}}}_{L_{n}}\sqrt{\mathcal{H}_{[\ ]}(\varepsilon,\mathcal{F}_{\mathcal{T}},\lVert\cdot\rVert_{2})}d\varepsilon\leq Cn^{1/2}, (46)

where C>0C>0 is a constant independent of r,Kr,K and dmaxd_{\text{max}}.

Proof of Lemma 2.

To simplify the notation, we denote ρ=ρ(π,𝒩)\rho=\rho(\pi,\mathcal{N}). Notice that

f𝒵1f𝒵12f𝒵1f𝒵1𝒵1𝒵1F for all 𝒵1,𝒵2𝒯.\displaystyle\lVert f_{\mathcal{Z}_{1}}-f_{\mathcal{Z}_{1}}\rVert_{2}\leq\|f_{\mathcal{Z}_{1}}-f_{\mathcal{Z}_{1}}\|_{\infty}\leq\lVert\mathcal{Z}_{1}-\mathcal{Z}_{1}\rVert_{F}\quad\text{ for all }\mathcal{Z}_{1},\mathcal{Z}_{2}\in\mathcal{T}. (47)

It follows from Kosorok, (2007, Theorem 9.22) that the L2L_{2}-metric, (2ϵ)(2\epsilon)-bracketing number of 𝒯\mathcal{F}_{\mathcal{T}} is bounded by

[](2ε,𝒯,2)(ε,𝒯,F)CdmaxrKlogKε.\mathcal{H}_{[\ ]}(2\varepsilon,\mathcal{F}_{\mathcal{T}},\lVert\cdot\rVert_{2})\leq\mathcal{H}(\varepsilon,\mathcal{T},\lVert\cdot\rVert_{F})\leq Cd_{\max}rK\log{K\over\varepsilon}.

The last inequality is from the covering number bounds for rank-rr bounded tensors; see Mu et al., (2014, Lemma 3).

Inserting the bracketing number into (46) gives

g(L)=1LLLα/(α+1)+ρ1LdmaxrKlog(Kε)𝑑ε.g(L)={1\over L}\int^{\sqrt{L^{\alpha/(\alpha+1)}+{\rho^{-1}L}}}_{L}\sqrt{d_{\max}rK\log\left({K\over\varepsilon}\right)}d\varepsilon. (48)

By the monotonicity of the integrand in (48), we bound g(L)g(L) by

g(L)\displaystyle g(L) dmaxrKLLLα/(α+1)+ρ1Llog(KL)𝑑ε\displaystyle\leq{\sqrt{d_{\max}rK}\over L}\int_{L}^{\sqrt{L^{\alpha/(\alpha+1)}+\rho^{-1}L}}\sqrt{\log\left(K\over L\right)}d\varepsilon
dmaxrK(logKlogL)(Lα/(2α+2)+ρ1LL1)\displaystyle\leq\sqrt{d_{\max}rK(\log K-\log L)}\left({L^{\alpha/(2\alpha+2)}+\sqrt{\rho^{-1}L}\over L}-1\right)
dmaxrKlogK(1L(α+2)/(2α+2)+1ρL),\displaystyle\leq\sqrt{d_{\max}rK\log K}\left({1\over L^{(\alpha+2)/(2\alpha+2)}}+{1\over\sqrt{\rho L}}\right), (49)

where the second line follows from a+ba+b\sqrt{a+b}\leq\sqrt{a}+\sqrt{b} for a,b>0a,b>0. It remains to verify that g(Ln)Cn1/2g(L_{n})\leq Cn^{1/2} for LnL_{n} specified in (46). Plugging LnL_{n} into the last line of (7.3.3) gives

g(Ln)\displaystyle g(L_{n}) dmaxrKlogK(1Ln(α+2)/(2α+2)+1ρLn)\displaystyle\leq\sqrt{d_{\max}rK\log K}\left({1\over L_{n}^{(\alpha+2)/(2\alpha+2)}}+{1\over\sqrt{\rho L_{n}}}\right) (50)
dmaxrKlogK([(dmaxrKlogKn)α+1α+2]α+22α+2+[ρ(dmaxrKlogKρn)]12)\displaystyle\leq\sqrt{d_{\max}rK\log K}\left(\left[\left(d_{\max}rK\log K\over n\right)^{\alpha+1\over\alpha+2}\right]^{-{\alpha+2\over 2\alpha+2}}+\left[\rho\left(d_{\max}rK\log K\over\rho n\right)\right]^{-{1\over 2}}\right) (51)
Cn1/2,\displaystyle\leq Cn^{1/2}, (52)

where C>0C>0 is a constant independent of r,Kr,K and dmaxd_{\text{max}}. The proof is therefore complete. ∎

7.3.4 Proof of Theorem 3

Proof of Theorem 3.

By definition of Θ^\hat{\Theta}, we have

MAE(Θ^,Θ)\displaystyle\text{MAE}(\hat{\Theta},\Theta) =𝔼|12H+1πΠsgnZ^πΘ|\displaystyle=\mathbb{E}\left|\frac{1}{2H+1}\sum_{\pi\in\Pi}\textup{sgn}\hat{Z}_{\pi}-\Theta\right|
𝔼|12H+1πΠ(sgnZ^πsgn(Θπ))|+𝔼|12H+1πΠsgn(Θπ)Θ|\displaystyle\leq\mathbb{E}\left|\frac{1}{2H+1}\sum_{\pi\in\Pi}\left(\textup{sgn}\hat{Z}_{\pi}-\textup{sgn}(\Theta-\pi)\right)\right|+\mathbb{E}\left|\frac{1}{2H+1}\sum_{\pi\in\Pi}\textup{sgn}(\Theta-\pi)-\Theta\right|
12H+1πΠMAE(sgnZ^π,sgn(Θπ))+1H,\displaystyle\leq\frac{1}{2H+1}\sum_{\pi\in\Pi}\text{MAE}(\textup{sgn}\hat{Z}_{\pi},\textup{sgn}(\Theta-\pi))+\frac{1}{H}, (53)

where the last line comes from the triangle inequality and the inequality

|12H+1πΠsgn(Θ(ω)π)Θ(ω)|1H,for all ω[d1]××[dK].\left|\frac{1}{2H+1}\sum_{\pi\in\Pi}\textup{sgn}(\Theta(\omega)-\pi)-\Theta(\omega)\right|\leq\frac{1}{H},\quad\text{for all }\omega\in[d_{1}]\times\cdots\times[d_{K}]. (54)

Write n=|Ω|n=|\Omega|. Now it suffices to bound the first term in (7.3.4). We prove that

12H+1πΠMAE(sgnZ^π,sgn(Θπ))tnα/(α+2)+1H+Htn,with tn=dmaxrKlogKn.{1\over 2H+1}\sum_{\pi\in\Pi}\textup{MAE}(\textup{sgn}\hat{Z}_{\pi},\textup{sgn}(\Theta-\pi))\lesssim t_{n}^{\alpha/(\alpha+2)}+{1\over H}+Ht_{n},\quad\text{with }t_{n}={d_{\max}rK\log K\over n}. (55)

Theorem 2 implies that the sign estimation accuracy depends on the closeness of π\pi\in\mathcal{H} to the mass points in \mathcal{H}. Therefore, we partition the level set π\pi\in\mathcal{H} based on their closeness to \mathcal{H}. Specifically, let 𝒩H=defπ𝒩(π1H,π+1H)\mathcal{N}_{H}\stackrel{{\scriptstyle\text{def}}}{{=}}\bigcup_{\pi^{\prime}\in\mathcal{N}}\left(\pi^{\prime}-\frac{1}{H},\pi^{\prime}+\frac{1}{H}\right) denote the set of levels at least 1H1\over H-close to the mass points. We expand (55) by

12H+1πΠMAE(sgnZ^π,sgn(Θπ))\displaystyle{1\over 2H+1}\sum_{\pi\in\Pi}\textup{MAE}(\textup{sgn}\hat{Z}_{\pi},\textup{sgn}(\Theta-\pi))
=\displaystyle= 12H+1πΠ𝒩HMAE(sgnZ^π,sgn(Θπ))+12H+1πΠ𝒩HcMAE(sgnZ^π,sgn(Θπ)).\displaystyle{1\over 2H+1}\sum_{\pi\in\Pi\cap\mathcal{N}_{H}}\textup{MAE}(\textup{sgn}\hat{Z}_{\pi},\textup{sgn}(\Theta-\pi))+{1\over 2H+1}\sum_{\pi\in\Pi\cap\mathcal{N}_{H}^{c}}\textup{MAE}(\textup{sgn}\hat{Z}_{\pi},\textup{sgn}(\Theta-\pi)). (56)

By assumption, the first term involves only finite number of summands and thus can be bounded by 4C/(2H+1)4C/(2H+1) where C>0C>0 is a constant such that |𝒩|C|\mathcal{N}|\leq C. We bound the second term using the explicit forms of ρ(π,𝒩)\rho(\pi,\mathcal{N}) in the sequence πΠ𝒩Hc\pi\in\Pi\cap\mathcal{N}_{H}^{c}. Based on Theorem 2,

12H+1πΠ𝒩HcMAE(sgn𝒵^π,sgn(Θπ))\displaystyle{1\over 2H+1}\sum_{\pi\in\Pi\cap\mathcal{N}_{H}^{c}}\textup{MAE}(\textup{sgn}\hat{\mathcal{Z}}_{\pi},\textup{sgn}(\Theta-\pi)) 12H+1πΠ𝒩Hctnα/(α+2)+tn2H+1πΠ𝒩Hc1ρ2(π,𝒩)\displaystyle\lesssim{1\over 2H+1}\sum_{\pi\in\Pi\cap\mathcal{N}_{H}^{c}}t_{n}^{\alpha/(\alpha+2)}+{t_{n}\over 2H+1}\sum_{\pi\in\Pi\cap\mathcal{N}_{H}^{c}}{1\over\rho^{2}(\pi,\mathcal{N})} (57)
tnα/(α+2)+tn2H+1πΠ𝒩Hcπ𝒩1|ππ|2\displaystyle\leq t_{n}^{\alpha/(\alpha+2)}+{t_{n}\over 2H+1}\sum_{\pi\in\Pi\cap\mathcal{N}_{H}^{c}}\sum_{\pi^{\prime}\in\mathcal{N}}{1\over|\pi-\pi^{\prime}|^{2}} (58)
tnα/(α+2)+tn2H+1π𝒩πΠ𝒩Hc1|ππ|2\displaystyle\leq t_{n}^{\alpha/(\alpha+2)}+{t_{n}\over 2H+1}\sum_{\pi^{\prime}\in\mathcal{N}}\sum_{\pi\in\Pi\cap\mathcal{N}_{H}^{c}}{1\over|\pi-\pi^{\prime}|^{2}} (59)
tnα/(α+2)+2CHtn,\displaystyle\leq t_{n}^{\alpha/(\alpha+2)}+2CHt_{n}, (60)

where the last inequality follows from the Lemma 3. Combining the bounds for the two terms in (7.3.4) completes the proof for conclusion (55). Finally, plugging (55) into (7.3.4) yields

MAE(Θ^,Θ)(dmaxrKlogK|Ω|)α/(α+2)+1H+HdmaxrKlogK|Ω|.\displaystyle\text{MAE}(\hat{\Theta},\Theta)\lesssim\left(d_{\max}rK\log K\over|\Omega|\right)^{\alpha/(\alpha+2)}+\frac{1}{H}+H{d_{\max}rK\log K\over|\Omega|}. (61)

The conclusion follows by absorbing KlogKK\log K into the constant term in the statement. ∎

Lemma 3.

Fix π𝒩\pi^{\prime}\in\mathcal{N} and a sequence Π={1,,1/H,0,1/H,,1}\Pi=\{-1,\ldots,-1/H,0,1/H,\ldots,1\} with H2H\geq 2. Then,

πΠ𝒩Hc1|ππ|24H2.\sum_{\pi\in\Pi\cap\mathcal{N}_{H}^{c}}{1\over|\pi-\pi^{\prime}|^{2}}\leq 4H^{2}.
Proof of Lemma 3.

Notice that all points πΠ𝒩Hc\pi\in\Pi\cap\mathcal{N}_{H}^{c} satisfy |ππ|>1H|\pi-\pi^{\prime}|>{1\over H} for all π𝒩\pi^{\prime}\in\mathcal{N}. We use this fact to compute the sum

πΠ𝒩Hc1|ππ|2\displaystyle\sum_{\pi\in\Pi\cap\mathcal{N}_{H}^{c}}{1\over|\pi-\pi^{\prime}|^{2}} =hHΠ𝒩Hc1|hHπ|2\displaystyle=\sum_{\frac{h}{H}\in\Pi\cap\mathcal{N}_{H}^{c}}{1\over|\frac{h}{H}-\pi^{\prime}|^{2}} (62)
2H2h=1H1h2\displaystyle\leq 2H^{2}\sum_{h=1}^{H}{1\over h^{2}} (63)
2H2{1+121x2𝑑x+231x2𝑑x++H1H1x2𝑑x}\displaystyle\leq 2H^{2}\left\{1+\int_{1}^{2}{1\over x^{2}}dx+\int_{2}^{3}{1\over x^{2}}dx+\cdots+\int_{H-1}^{H}{1\over x^{2}}dx\right\} (64)
=2H2(1+1H1x2𝑑x)4H2,\displaystyle=2H^{2}\left(1+\int^{H}_{1}{1\over x^{2}}dx\right)\leq 4H^{2}, (65)

where the third line uses the monotonicity of 1x2{1\over x^{2}} for x1x\geq 1. ∎

8 Conclusion

We have developed a tensor completion method that addresses both low- and high-rankness based on sign series representation. Our work provide a nonparametric framework for tensor estimation, and we obtain results previously impossible. We hope the work opens up new inquiry that allows more researchers to contribute to this field.

Acknowledgements

This research is supported in part by NSF grant DMS-1915978 and Wisconsin Alumni Research Foundation.

References

  • Alon et al., (2016) Alon, N., Moran, S., and Yehudayoff, A. (2016). Sign rank versus VC dimension. In Conference on Learning Theory, pages 47–80.
  • Anandkumar et al., (2014) Anandkumar, A., Ge, R., Hsu, D., Kakade, S. M., and Telgarsky, M. (2014). Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15(1):2773–2832.
  • Anandkumar et al., (2017) Anandkumar, A., Ge, R., and Janzamin, M. (2017). Analyzing tensor power method dynamics in overcomplete regime. Journal of Machine Learning Research, 18(1):752–791.
  • Balabdaoui et al., (2019) Balabdaoui, F., Durot, C., and Jankowski, H. (2019). Least squares estimation in the monotone single index model. Bernoulli, 25(4B):3276–3310.
  • Bartlett et al., (2006) Bartlett, P. L., Jordan, M. I., and McAuliffe, J. D. (2006). Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156.
  • Cai et al., (2019) Cai, C., Li, G., Poor, H. V., and Chen, Y. (2019). Nonconvex low-rank tensor completion from noisy data. In Advances in Neural Information Processing Systems, pages 1863–1874.
  • Chan and Airoldi, (2014) Chan, S. and Airoldi, E. (2014). A consistent histogram estimator for exchangeable graph models. In International Conference on Machine Learning, pages 208–216.
  • Chi et al., (2020) Chi, E. C., Gaines, B. J., Sun, W. W., Zhou, H., and Yang, J. (2020). Provable convex co-clustering of tensors. Journal of Machine Learning Research, 21(214):1–58.
  • Cohn and Umans, (2013) Cohn, H. and Umans, C. (2013). Fast matrix multiplication using coherent configurations. In Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms, pages 1074–1087.
  • De Lathauwer et al., (2000) De Lathauwer, L., De Moor, B., and Vandewalle, J. (2000). A multilinear singular value decomposition. SIAM Journal on Matrix Analysis and Applications, 21(4):1253–1278.
  • De Wolf, (2003) De Wolf, R. (2003). Nondeterministic quantum query and communication complexities. SIAM Journal on Computing, 32(3):681–699.
  • Fan and Udell, (2019) Fan, J. and Udell, M. (2019). Online high rank matrix completion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8690–8698.
  • Ganti et al., (2017) Ganti, R., Rao, N., Balzano, L., Willett, R., and Nowak, R. (2017). On learning high dimensional structured single index models. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pages 1898–1904.
  • Ganti et al., (2015) Ganti, R. S., Balzano, L., and Willett, R. (2015). Matrix completion under monotonic single index models. In Advances in Neural Information Processing Systems, pages 1873–1881.
  • Ghadermarzy et al., (2018) Ghadermarzy, N., Plan, Y., and Yilmaz, O. (2018). Learning tensors from partial binary measurements. IEEE Transactions on Signal Processing, 67(1):29–40.
  • Ghadermarzy et al., (2019) Ghadermarzy, N., Plan, Y., and Yilmaz, Ö. (2019). Near-optimal sample complexity for convex tensor completion. Information and Inference: A Journal of the IMA, 8(3):577–619.
  • Globerson et al., (2007) Globerson, A., Chechik, G., Pereira, F., and Tishby, N. (2007). Euclidean embedding of co-occurrence data. Journal of Machine Learning Research, 8:2265–2295.
  • Han et al., (2020) Han, R., Willett, R., and Zhang, A. (2020). An optimal statistical and computational framework for generalized tensor estimation. arXiv preprint arXiv:2002.11255.
  • Hastie et al., (2009) Hastie, T., Tibshirani, R., and Friedman, J. (2009). The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media.
  • Hillar and Lim, (2013) Hillar, C. J. and Lim, L.-H. (2013). Most tensor problems are NP-hard. Journal of the ACM (JACM), 60(6):45.
  • Hitchcock, (1927) Hitchcock, F. L. (1927). The expression of a tensor or a polyadic as a sum of products. Journal of Mathematics and Physics, 6(1-4):164–189.
  • Hong et al., (2020) Hong, D., Kolda, T. G., and Duersch, J. A. (2020). Generalized canonical polyadic tensor decomposition. SIAM Review, 62(1):133–163.
  • Hore et al., (2016) Hore, V., Viñuela, A., Buil, A., Knight, J., McCarthy, M. I., Small, K., and Marchini, J. (2016). Tensor decomposition for multiple-tissue gene expression experiments. Nature genetics, 48(9):1094.
  • Jain and Oh, (2014) Jain, P. and Oh, S. (2014). Provable tensor factorization with missing data. In Advances in Neural Information Processing Systems, volume 27, pages 1431–1439.
  • Kolda and Bader, (2009) Kolda, T. G. and Bader, B. W. (2009). Tensor decompositions and applications. SIAM Review, 51(3):455–500.
  • Kosorok, (2007) Kosorok, M. R. (2007). Introduction to empirical processes and semiparametric inference. Springer Science & Business Media.
  • Lee and Wang, (2020) Lee, C. and Wang, M. (2020). Tensor denoising and completion based on ordinal observations. In International Conference on Machine Learning, pages 5778–5788.
  • Li et al., (2009) Li, Y., Liu, Y., Li, J., Qin, W., Li, K., Yu, C., and Jiang, T. (2009). Brain anatomical network and intelligence. PLoS Comput Biol, 5(5):e1000395.
  • Montanari and Sun, (2018) Montanari, A. and Sun, N. (2018). Spectral algorithms for tensor completion. Communications on Pure and Applied Mathematics, 71(11):2381–2425.
  • Mu et al., (2014) Mu, C., Huang, B., Wright, J., and Goldfarb, D. (2014). Square deal: Lower bounds and improved relaxations for tensor recovery. In International Conference on Machine Learning, pages 73–81.
  • Ongie et al., (2017) Ongie, G., Willett, R., Nowak, R. D., and Balzano, L. (2017). Algebraic variety models for high-rank matrix completion. In International Conference on Machine Learning, pages 2691–2700.
  • Robinson, (1988) Robinson, P. M. (1988). Root-n-consistent semiparametric regression. Econometrica: Journal of the Econometric Society, 56(4):931–954.
  • Shen and Wong, (1994) Shen, X. and Wong, W. H. (1994). Convergence rate of sieve estimates. The Annals of Statistics, 22:580–615.
  • Wang et al., (2008) Wang, J., Shen, X., and Liu, Y. (2008). Probability estimation for large-margin classifiers. Biometrika, 95(1):149–167.
  • Wang et al., (2017) Wang, L., Durante, D., Jung, R. E., and Dunson, D. B. (2017). Bayesian network–response regression. Bioinformatics, 33(12):1859–1866.
  • Wang and Li, (2020) Wang, M. and Li, L. (2020). Learning from binary multiway data: Probabilistic tensor decomposition and its statistical optimality. Journal of Machine Learning Research, 21(154):1–38.
  • Wang and Zeng, (2019) Wang, M. and Zeng, Y. (2019). Multiway clustering via tensor block models. In Advances in Neural Information Processing Systems, pages 713–723.
  • Xu, (2018) Xu, J. (2018). Rates of convergence of spectral methods for graphon estimation. In International Conference on Machine Learning, pages 5433–5442.
  • Yuan and Zhang, (2016) Yuan, M. and Zhang, C.-H. (2016). On tensor completion via nuclear norm minimization. Foundations of Computational Mathematics, 16(4):1031–1068.
  • Zhang and Xia, (2018) Zhang, A. and Xia, D. (2018). Tensor SVD: Statistical and computational limits. IEEE Transactions on Information Theory, 64(11):7311 – 7338.