Beyond the Signs: Nonparametric Tensor Completion
via Sign Series

Chanwoo Lee
University of Wisconsin – Madison
[email protected]
Miaoyan Wang
University of Wisconsin – Madison
[email protected]

Abstract

We consider the problem of tensor estimation from noisy observations with possibly missing entries. A nonparametric approach to tensor completion is developed based on a new model which we coin as sign representable tensors. The model represents the signal tensor of interest using a series of structured sign tensors. Unlike earlier methods, the sign series representation effectively addresses both low- and high-rank signals, while encompassing many existing tensor models—including CP models, Tucker models, single index models, several hypergraphon models—as special cases. We show that the sign tensor series is theoretically characterized, and computationally estimable, via classification tasks with carefully-specified weights. Excess risk bounds, estimation error rates, and sample complexities are established. We demonstrate the outperformance of our approach over previous methods on two datasets, one on human brain connectivity networks and the other on topic data mining.

1 Introduction

Higher-order tensors have recently received much attention in enormous fields including social networks (Anandkumar et al.,, 2014), neuroscience (Wang et al.,, 2017), and genomics (Hore et al.,, 2016). Tensor methods provide effective representation of the hidden structure in multiway data. In this paper we consider the signal plus noise model,

\mathcal{Y}=\Theta+\mathcal{E},

(1)

where $\mathcal{Y}\in\mathbb{R}^{d_{1}\times\cdots\times d_{K}}$ is an order- $K$ data tensor, $\Theta$ is an unknown signal tensor of interest, and $\mathcal{E}$ is a noise tensor. Our goal is to accurately estimate $\Theta$ from the incomplete, noisy observation of $\mathcal{Y}$ . In particular, we focus on the following two problems:

•

Q1 [Nonparametric tensor estimation]. How to flexibly estimate $\Theta$ under a wide range of structures, including both low-rankness and high-rankness?
•

Q2 [Complexity of tensor completion]. How many observed tensor entries do we need to consistently estimate the signal $\Theta$ ?

1.1 Inadequacies of low-rank models

The signal plus noise model (3) is popular in tensor literature. Existing methods estimate the signal tensor based on low-rankness of $\Theta$ (Jain and Oh,, 2014; Montanari and Sun,, 2018). Common low-rank models include Canonical Polyadic (CP) tensors (Hitchcock,, 1927), Tucker tensors (De Lathauwer et al.,, 2000), and block tensors (Wang and Zeng,, 2019). While these methods have shown great success in signal recovery, tensors in applications often violate the low-rankness. Here we provide two examples to illustrate the limitation of classical models.

The first example reveals the sensitivity of tensor rank to order-preserving transformations. Let $\mathcal{Z}\in\mathbb{R}^{30\times 30\times 30}$ be an order-3 tensor with CP $\text{rank}(\mathcal{Z})=3$ (formal definition is deferred to end of this section). Suppose a monotonic transformation $f(z)=(1+\exp(-cz))^{-1}$ is applied to $\mathcal{Z}$ entrywise, and we let the signal $\Theta$ in model (1) be the tensor after transformation. Figure 1a plots the numerical rank (see Section 7.1) of $\Theta$ versus $c$ . As we see, the rank increases rapidly with $c$ , rending traditional low-rank tensor methods ineffective in the presence of mild order-preserving nonlinearities. In digital processing (Ghadermarzy et al.,, 2018) and genomics analysis (Hore et al.,, 2016), the tensor of interest often undergoes unknown transformation prior to measurements. The sensitivity to transformation makes the low-rank model less desirable in practice.

Refer to caption — Figure 1: (a) Numerical rank of $\Theta$ versus $c$ in the first example. (b) Top $d=30$ tensor singular values in the second example.

The second example demonstrates the inadequacy of classical low-rankness in representing special structures. Here we consider the signal tensor of the form $\Theta=\log(1+\mathcal{Z})$ , where $\mathcal{Z}\in\mathbb{R}^{d\times d\times d}$ is an order-3 tensor with entries $\mathcal{Z}(i,j,k)={1\over d}\max(i,j,k)$ for $i,j,k\in\{1,\ldots,d\}$ . The matrix analogy of $\Theta$ was studied by Chan and Airoldi, (2014) in graphon analysis. In this case neither $\Theta$ nor $\mathcal{Z}$ is low-rank; in fact, the rank is no smaller than the dimension $d$ as illustrated in Figure 1b. Again, classical low-rank models fail to address this type of tensor structure.

In the above and many other examples, the signal tensors $\Theta$ of interest have high rank. Classical low-rank models will miss these important structures. New methods that allow flexible tensor modeling have yet to be developed.

1.2 Our contributions

We develop a new model called sign representable tensors to address the aforementioned challenges. Figure 2 illustrates our main idea. Our approach is built on the sign series representation of the signal tensor, and we propose to estimate the sign tensors through a series of weighted classifications. In contrast to existing methods, our method is guaranteed to recover a wide range of low- and high-rank signals. We highlight two main contributions that set our work apart from earlier literature.

Statistically, the problem of high-rank tensor estimation is challenging. Existing estimation theory (Anandkumar et al.,, 2014; Montanari and Sun,, 2018; Cai et al.,, 2019) exclusively focuses on the regime of fixed $r$ growing $d$ . However, such premise fails in high-rank tensors, where the rank may grow with, or even exceed, the dimension. A proper notion of nonparametric complexity is crucial. We show that, somewhat surprisingly, the sign tensor series not only preserves all information in the original signals, but also brings the benefits of flexibility and accuracy over classical low-rank models. The results fill the gap between parametric (low-rank) and nonparametric (high-rank) tensors, thereby greatly enriching the tensor model literature.

From computational perspective, optimizations regarding tensors are in general NP-hard. Fortunately, tensors sought in applications are specially-structured, for which a number of efficient algorithms are available (Ghadermarzy et al.,, 2018; Wang and Li,, 2020; Han et al.,, 2020). Our high-rank tensor estimate is provably reductable to a series of classifications, and its divide-and-conquer nature facilitates efficient computation. The ability to import and adapt existing tensor algorithms is one advantage of our method.

We also highlight the challenges associated with tensors compared to matrices. High-rank matrix estimation is recently studied under nonlinear models (Ganti et al.,, 2015) and subspace clustering (Ongie et al.,, 2017; Fan and Udell,, 2019). However, the problem for high-rank tensors is more challenging, because the tensor rank often exceeds the dimension when order $K\geq 3$ (Anandkumar et al.,, 2017). This is in sharp contrast to matrices. We show that, applying matrix methods to higher-order tensors results in suboptimal estimates. A full exploitation of the higher-order structure is needed; this is another challenge we address in this paper.

1.3 Notation

We use $\textup{sgn}(\cdot)\colon\mathbb{R}\to\{-1,1\}$ to denote the sign function, where $\textup{sgn}(y)=1$ if $y\geq 0$ and $-1$ otherwise. We allow univariate functions, such as $\textup{sgn}(\cdot)$ and general $f\colon\mathbb{R}\to\mathbb{R}$ , to be applied to tensors in an element-wise manner. We denote $a_{n}\lesssim b_{n}$ if $\lim_{n\to\infty}a_{n}/b_{n}\leq c$ for some constant $c\geq 0$ . We use the shorthand $[n]$ to denote the $n$ -set $\{1,\ldots,n\}$ for $n\in\mathbb{N}_{+}$ . Let $\Theta\in\mathbb{R}^{d_{1}\times\cdots\times d_{K}}$ denote an order- $K$ $(d_{1},\ldots,d_{K})$ -dimensional tensor, and $\Theta(\omega)\in\mathbb{R}$ denote the tensor entry indexed by $\omega\in[d_{1}]\times\cdots\times[d_{K}]$ . An event $E$ is said to occur “with very high probability” if $\mathbb{P}(E)$ tends to 1 faster than any polynomial of tensor dimension $d:=\min_{k}d_{k}\to\infty$ . The CP decomposition (Hitchcock,, 1927) is defined by

\Theta=\sum_{s=1}^{r}\lambda_{s}\bm{a}^{(1)}_{s}\otimes\cdots\otimes\bm{a}^{(K)}_{s},

(2)

where $\lambda_{1}\geq\cdots\geq\lambda_{r}>0$ are tensor singular values, $\bm{a}^{(k)}_{s}\in\mathbb{R}^{d_{k}}$ are norm-1 tensor singular vectors, and $\otimes$ denotes the outer product of vectors. The minimal $r\in\mathbb{N}_{+}$ for which (2) holds is called the tensor rank, denoted $\textup{rank}(\Theta)$ .

2 Model and proposal overview

Let $\mathcal{Y}$ be an order- $K$ $(d_{1},\ldots,d_{K})$ -dimensional data tensor generated from the following model

\mathcal{Y}=\Theta+\mathcal{E},

(3)

where $\Theta\in\mathbb{R}^{d_{1}\times\cdots\times d_{K}}$ is an unknown signal tensor of interest, and $\mathcal{E}$ is a noise tensor consisting of mean-zero, independent but not necessarily identically distributed entries. We allow heterogenous noise, in that the marginal distribution of noise entry $\mathcal{E}(\omega)$ may depend on $\omega$ . Assume that $\mathcal{Y}(\omega)$ takes value in a bounded interval $[-A,A]$ ; without loss of generality, we set $A=1$ throughout the paper.

Our observation is an incomplete data tensor from (3), denoted $\mathcal{Y}_{\Omega}$ , where $\Omega\subset[d_{1}]\times\cdots\times[d_{K}]$ is the index set of observed entries. We consider a general model on $\Omega$ that allows both uniform and non-uniform samplings. Specifically, let $\Pi=\{p_{\omega}\}$ be an arbitrarily predefined probability distribution over the full index set with $\sum_{\omega\in[d_{1}]\times\cdots\times[d_{K}]}p_{\omega}=1$ . Assume that the entries $\omega$ in $\Omega$ are i.i.d. draws with replacement from the full index set using distribution $\Pi$ . The sampling rule is denoted as $\omega\sim\Pi$ .

Before describing our main results, we provide the intuition behind our method. In the two examples in Section 1, the high-rankness in the signal $\Theta$ makes the estimation challenging. Now let us examine the sign of the $\pi$ -shifted signal $\textup{sgn}(\Theta-\pi)$ for any given $\pi\in[-1,1]$ . It turns out that, these sign tensors share the same sign patterns as low-rank tensors. Indeed, the signal tensor in the first example has the same sign pattern as a rank- $4$ tensor, since $\textup{sgn}(\Theta-\pi)=\textup{sgn}(\mathcal{Z}-f^{-1}(\pi))$ . The signal tensor in the second example has the same sign pattern as a rank-2 tensor, since $\textup{sgn}(\Theta-\pi)=\textup{sgn}(\max(i,j,k)-d(e^{\pi}-1))$ (see Example 5 in Section 3).

The above observation suggests a general framework to estimate both low- and high-rank signal tensors. Figure 2 illustrates the main crux of our method. We dichotomize the data tensor into a series of sign tensors $\textup{sgn}(\mathcal{Y}_{\Omega}-\pi)$ for $\pi\in\mathcal{H}={\{\small-1,\ldots,-{1\over H},0,{1\over H},\ldots,1\}}$ . Then, we estimate the sign signals $\textup{sgn}(\Theta-\pi)$ by performing classification

\hat{\mathcal{Z}}_{\pi}=\operatorname*{arg\,min}_{\text{low rank tensor $\mathcal{Z}$}}\text{Weighted-Loss}(\textup{sgn}(\mathcal{Z}),\textup{sgn}(\mathcal{Y}_{\Omega}-\pi)),

where Weighted-Loss $(\cdot,\cdot)$ denotes a carefully-designed classification objective function which will be described in later sections. Our final proposed tensor estimate takes the form

\hat{\Theta}={1\over 2H+1}\sum_{\pi\in\mathcal{H}}\textup{sgn}(\hat{\mathcal{Z}}_{\pi}).

Our approach is built on the nonparametric sign representation of signal tensors. The estimate $\hat{\Theta}$ is essentially learned from dichotomized tensor series $\{\textup{sgn}(\mathcal{Y}_{\Omega}-\pi)\colon\pi\in\mathcal{H}\}$ with proper weights. We show that a careful aggregation of dichotomized data not only preserves all information in the original signals, but also brings benefits of accuracy and flexibility over classical low-rank models. Unlike traditional methods, the sign representation is guaranteed to recover both low- and high-rank signals that were previously impossible. The method enjoys statistical effectiveness and computational efficiency.

3 Statistical properties of sign representable tensors

This section develops sign representable tensor models for $\Theta$ in (3). We characterize the algebraic and statistical properties of sign tensor series, which serves the theoretical foundation for our method.

3.1 Sign-rank and sign tensor series

Let $\Theta$ be the tensor of interest, and $\textup{sgn}(\Theta)$ the corresponding sign pattern. The sign patterns induce an equivalence relationship between tensors. Two tensors are called sign equivalent, denoted $\simeq$ , if they have the same sign pattern.

Definition 1 (Sign-rank).

The sign-rank of a tensor $\Theta\in\mathbb{R}^{d_{1}\times\cdots\times d_{K}}$ is defined by the minimal rank among all tensors that share the same sign pattern as $\Theta$ ; i.e.,

\textup{srank}(\Theta)=\min\{\textup{rank}(\Theta^{\prime})\colon\Theta^{\prime}\simeq\Theta,\ \Theta^{\prime}\in\mathbb{R}^{d_{1}\times\cdots\times d_{K}}\}.

The sign-rank is also called support rank (Cohn and Umans,, 2013), minimal rank (Alon et al.,, 2016), and nondeterministic rank (De Wolf,, 2003). Earlier work defines sign-rank for binary-valued tensors; we extend the notion to continuous-valued tensors. Note that the sign-rank concerns only the sign pattern but discards the magnitude information of $\Theta$ . In particular, $\textup{srank}(\Theta)=\textup{srank}(\textup{sgn}\Theta)$ .

Like most tensor problems (Hillar and Lim,, 2013), determining the sign-rank for a general tensor is NP hard (Alon et al.,, 2016). Fortunately, tensors arisen in applications often possess special structures that facilitate analysis. By definition, the sign-rank is upper bounded by the tensor rank. More generally, we have the following upper bounds.

Proposition 1 (Upper bounds of the sign-rank).

For any strictly monotonic function $g\colon\mathbb{R}\to\mathbb{R}$ with $g(0)=0$ ,

\textup{srank}(\Theta)\leq\textup{rank}(g(\Theta)).

Conversely, the sign-rank can be much smaller than the tensor rank, as we have shown in the examples of Section 1.

Proposition 2 (Broadness).

For every order $K\geq 2$ and dimension $d$ , there exist tensors $\Theta\in\mathbb{R}^{d\times\cdots\times d}$ such that $\textup{rank}(\Theta)\geq d$ but $\textup{srank}(\Theta-\pi)\leq 2$ for all $\pi\in\mathbb{R}$ .

We provide several examples in Section 7.2, in which the tensor rank grows with dimension $d$ but the sign-rank remains a constant. The results highlight the advantages of using sign-rank in the high-dimensional tensor analysis. Propositions 1 and 2 together demonstrate the strict broadness of low sign-rank family over the usual low-rank family.

We now introduce a tensor family, which we coin as “sign representable tensors”, for the signal model in (3).

Definition 2 (Sign representable tensors).

Fix a level $\pi\in[-1,1]$ . A tensor $\Theta$ is called $(r,\pi)$ -sign representable, if the tensor $(\Theta-\pi)$ has sign-rank bounded by $r$ . A tensor $\Theta$ is called $r$ -sign (globally) representable, if $\Theta$ is $(r,\pi)$ -sign representable for all $\pi\in[-1,1]$ . The collection $\{\textup{sgn}(\Theta-\pi)\colon\pi\in[-1,1]\}$ is called the sign tensor series. We use $\mathscr{P}_{\textup{sgn}}(r)=\{\Theta\colon\textup{srank}(\Theta-\pi)\leq r\text{ for all }\pi\in[-1,1]\}$ to denote the $r$ -sign representable tensor family.

We show that the $r$ -sign representable tensor family is a general model that incorporates most existing tensor models, including low-rank tensors, single index models, GLM models, and several hypergraphon models.

Example 1 (CP/Tucker low-rank models).

The CP and Tucker low-rank tensors are the two most popular tensor models (Kolda and Bader,, 2009). Let $\Theta$ be a low-rank tensor with CP rank $r$ . We see that $\Theta$ belongs to the sign representable family; i.e., $\Theta\in\mathscr{P}_{\textup{sgn}}(r+1)$ (the constant $1$ is due to $\textup{rank}(\Theta-\pi)\leq r+1$ ). Similar results hold for Tucker low-rank tensors $\Theta\in\mathscr{P}_{\textup{sgn}}(r+1)$ , where $r=\prod_{k}r_{k}$ with $r_{k}$ being the $k$ -th mode Tucker rank of $\Theta$ .

Example 2 (Tensor block models (TBMs)).

Tensor block model (Wang and Zeng,, 2019; Chi et al.,, 2020) assumes a checkerbord structure among tensor entries under marginal index permutation. The signal tensor $\Theta$ takes at most $r$ distinct values, where $r$ is the total number of multiway blocks. Our model incorporates TBM because $\Theta\in\mathscr{P}_{\textup{sgn}}(r)$ .

Example 3 (Generalized linear models (GLMs)).

Let $\mathcal{Y}$ be a binary tensor from a logistic model (Wang and Li,, 2020) with mean $\Theta=\text{logit}(\mathcal{Z})$ , where $\mathcal{Z}$ is a latent low-rank tensor. Notice that $\Theta$ itself may be high-rank (see Section 1). By definition, $\Theta$ is a low-rank sign representable tensor. Same conclusion holds for general exponential-family models with a (known) link function (Hong et al.,, 2020).

Example 4 (Single index models (SIMs)).

Single index model is a flexible semiparametric model proposed in economics (Robinson,, 1988) and high-dimensional statistics (Balabdaoui et al.,, 2019; Ganti et al.,, 2017). We here extend the model to higher-order tensors $\Theta$ . The SIM assumes the existence of a (unknown) monotonic function $g\colon\mathbb{R}\to\mathbb{R}$ such that $g(\Theta)$ has rank $r$ . We see that $\Theta$ belongs to the sign representable family; i.e., $\Theta\in\mathscr{P}_{\textup{sgn}}(r+1)$ .

Example 5 (Min/Max hypergraphon).

Graphon is a popular nonparametric model for networks (Chan and Airoldi,, 2014; Xu,, 2018). Here we revisit the model introduced in Section 1 for generality. Let $\Theta$ be an order- $K$ tensor generated from the hypergraphon $\Theta(i_{1},\ldots,i_{K})=\log(1+\max_{k}x^{(k)}_{i_{k}})$ , where $x^{(k)}_{i_{k}}$ are given number in $[0,1]$ for all $i_{k}\in[d_{k}],k\in[K]$ . We conclude that $\Theta\in\mathscr{P}_{\textup{sgn}}(2)$ , because the sign tensor $\textup{sgn}(\Theta-\pi)$ with an arbitrary $\pi\in(0,\ \log 2)$ is a block tensor with at most two blocks (see Figure 2c).

The results extend to general min/max hypergraphons. Let $g(\cdot)$ be a continuous univariate function with at most $r\geq 1$ distinct real roots in the equation $g(z)=\pi$ ; this property holds, e.g., when $g(z)$ is a polynomial of degree $r$ . Then, the tensor $\Theta$ generated from $\Theta(i_{1},\ldots,i_{K})=g(\max_{k}x^{(k)}_{i_{k}})$ belongs to $\mathscr{P}_{\textup{sgn}}(2r)$ (see Section 7.2). Same conclusion holds if the maximum in $g(\cdot)$ is replaced by the minimum.

3.2 Statistical characterization of sign tensors via weighted classification

Accurate estimation of a sign representable tensor depends on the behavior of sign tensor series, $\textup{sgn}(\Theta-\pi)$ . In this section, we show that sign tensors are completely characterized by weighted classification. The results bridge the algebraic and statistical properties of sign representable tensors.

For a given $\pi\in[-1,1]$ , define a $\pi$ -shifted data tensor $\bar{\mathcal{Y}}_{\Omega}$ with entries $\bar{\mathcal{Y}}(\omega)=(\mathcal{Y}(\omega)-\pi)$ for $\omega\in\Omega$ . We propose a weighted classification objective function

L(\mathcal{Z},\bar{\mathcal{Y}}_{\Omega})={1\over|\Omega|}\sum_{\omega\in\Omega}\ \mathop{\mathchoice{\underbrace{\displaystyle|\bar{\mathcal{Y}}(\omega)|}}{\underbrace{\textstyle|\bar{\mathcal{Y}}(\omega)|}}{\underbrace{\scriptstyle|\bar{\mathcal{Y}}(\omega)|}}{\underbrace{\scriptscriptstyle|\bar{\mathcal{Y}}(\omega)|}}}\limits_{\text{weight}}\ \times\ \mathop{\mathchoice{\underbrace{\displaystyle|\textup{sgn}\mathcal{Z}(\omega)-\textup{sgn}\bar{\mathcal{Y}}(\omega)|}}{\underbrace{\textstyle|\textup{sgn}\mathcal{Z}(\omega)-\textup{sgn}\bar{\mathcal{Y}}(\omega)|}}{\underbrace{\scriptstyle|\textup{sgn}\mathcal{Z}(\omega)-\textup{sgn}\bar{\mathcal{Y}}(\omega)|}}{\underbrace{\scriptscriptstyle|\textup{sgn}\mathcal{Z}(\omega)-\textup{sgn}\bar{\mathcal{Y}}(\omega)|}}}\limits_{\text{classification loss}},

(4)

where $\mathcal{Z}\in\mathbb{R}^{d_{1}\times\cdots\times d_{K}}$ is the decision variable to be optimized, $|\bar{\mathcal{Y}}(\omega)|$ is the entry-specific weight equal to the distance from the tensor entry to the target level $\pi$ . The entry-specific weights incorporate the magnitude information into classification, where entries far away from the target level are penalized more heavily in the objective. In the special case of binary tensor $\mathcal{Y}\in\{-1,1\}^{d_{1}\times\cdots\times d_{K}}$ and target level $\pi=0$ , the loss (4) reduces to usual classification loss.

Our proposed weighted classification function (4) is important for characterizing $\textup{sgn}(\Theta-\pi)$ . Define the weighted classification risk

\textup{Risk}(\mathcal{Z})=\mathbb{E}_{\mathcal{Y}_{\Omega}}L(\mathcal{Z},\bar{\mathcal{Y}}_{\Omega}),

(5)

where the expectation is taken with respect to $\mathcal{Y}_{\Omega}$ under model (3) and the sampling distribution $\omega\sim\Pi$ . Note that the form of $\textup{Risk}(\cdot)$ implicitly depends on $\pi$ ; we suppress $\pi$ when no confusion arises.

Proposition 3 (Global optimum of weighted risk).

Suppose the data $\mathcal{Y}_{\Omega}$ is generated from model (3) with $\Theta\in\mathscr{P}_{\textup{sgn}}(r)$ . Then, for all $\bar{\Theta}$ that are sign equivalent to $\textup{sgn}(\Theta-\pi)$ ,

	$\displaystyle\textup{Risk}(\bar{\Theta})$	$\displaystyle=\inf\{\textup{Risk}(\mathcal{Z})\colon\mathcal{Z}\in\mathbb{R}^{d_{1}\times\cdots\times d_{K}}\},$
		$\displaystyle=\inf\{\textup{Risk}(\mathcal{Z})\colon\textup{rank}(\mathcal{Z})\leq r\}.$		(6)

The results show that the sign tensor $\textup{sgn}(\Theta-\pi)$ optimizes the weighted classification risk. This fact suggests a practical procedure to estimate $\textup{sgn}(\Theta-\pi)$ via empirical risk optimization of $L(\mathcal{Z},\bar{\mathcal{Y}}_{\Omega})$ . In order to establish the recovery guarantee, we shall address the uniqueness (up to sign equivalence) for the optimizer of $\textup{Risk}(\cdot)$ . The local behavior of $\Theta$ around $\pi$ turns out to play a key role in the accuracy.

Some additional notation is needed. We use $\mathcal{N}=\{\pi\colon$ $\mathbb{P}_{\omega\sim\Pi}(\Theta(\omega)=\pi)\neq 0\}$ to denote the set of mass points of $\Theta$ under $\Pi$ . Assume there exists a constant $C>0$ , independent of tensor dimension, such that $|\mathcal{N}|\leq C$ . Note that both $\Pi$ and $\Theta$ implicitly depend on the tensor dimension. Our assumptions are imposed to $\Pi=\Pi(d)$ and $\Theta=\Theta(d)$ in the high-dimensional regime uniformly where $d:=\min_{k}d_{k}\to\infty$ .

Assumption 1 ( $\alpha$ -smoothness).

Fix $\pi\notin\mathcal{N}$ . Assume there exist constants $\alpha=\alpha(\pi)>0,c=c(\pi)>0$ , independent of tensor dimension, such that,

\sup_{0\leq t<\rho(\pi,\mathcal{N})}{\mathbb{P}_{\omega\sim\Pi}[|\Theta(\omega)-\pi|\leq t]\over t^{\alpha}}\leq c,

(7)

where $\rho(\pi,\mathcal{N}):=\min_{\pi^{\prime}\in\mathcal{N}}|\pi-\pi^{\prime}|$ denotes the distance from $\pi$ to the nearest point in $\mathcal{N}$ . The largest possible $\alpha=\alpha(\pi)$ in (7) is called the smoothness index at level $\pi$ . We make the convention that $\alpha=\infty$ if the set $\{\omega\colon|\Theta(\omega)-\pi|\leq t\}$ has zero measure, implying almost no entries of which $\Theta(\omega)$ is around the level $\pi$ . We call a tensor $\Theta$ is $\alpha$ -globally smooth, if (7) holds with a global constant $c>0$ for all $\pi\in[-1,1]$ except for a finite number of levels.

The smoothness index $\alpha$ quantifies the intrinsic hardness of recovering $\textup{sgn}(\Theta-\pi)$ from $\textup{Risk}(\cdot)$ . The value of $\alpha$ depends on both the sampling distribution $\omega\sim\Pi$ and the behavior of $\Theta(\omega)$ . The recovery is easier at levels where points are less concentrated around $\pi$ with a large value of $\alpha>1$ , or equivalently, when the cumulative distribution function (CDF) $G(\pi):=\mathbb{P}_{\omega\sim\Pi}[\Theta(\omega)\leq\pi]$ remains flat around $\pi$ . A small value of $\alpha<1$ indicates the nonexistent (infinite) density at level $\pi$ , or equivalently, when the $G(\pi)$ jumps at $\pi$ . A typical case is $\alpha=1$ when the $G(\pi)$ has finite non-zero derivative in the vicinity of $\pi$ . Table 1 illustrates the $G(\pi)$ for various models of $\Theta$ (see Section 5 Simulation for details).

We now reach the main theorem in this section. For two tensors $\Theta_{1},\Theta_{2}$ , define the mean absolute error (MAE)

\text{MAE}(\Theta_{1},\Theta_{2})\stackrel{{\scriptstyle\text{def}}}{{=}}\mathbb{E}_{\omega\sim\Pi}|\Theta_{1}(\omega)-\Theta_{2}(\omega)|.

Theorem 1 (Identifiability).

Under Assumption 1, for all tensors $\bar{\Theta}\simeq\textup{sgn}(\Theta-\pi)$ and tensors $\mathcal{Z}\in\mathbb{R}^{d_{1}\times\cdots\times d_{K}}$ ,

\textup{MAE}(\textup{sgn}\mathcal{Z},\textup{sgn}\bar{\Theta})\leq C(\pi)\left[\textup{Risk}(\mathcal{Z})-\textup{Risk}(\bar{\Theta})\right]^{\alpha/(\alpha+1)},

where $C(\pi)>0$ is independent of $\mathcal{Z}$ .

The result establishes the recovery stability of sign tensors $\textup{sgn}(\Theta-\pi)$ using optimization with population risk (5). The bound immediately shows the uniqueness of the optimizer for $\text{Risk}(\cdot)$ up to a zero-measure set under $\Pi$ . We find that a higher value of $\alpha$ implies more stable recovery, as intuition would suggest. Similar results hold for optimization with sample risk (4) (see Section 4).

We conclude this section by applying Assumption 1 to the examples described in Section 3.1. For simplicity, suppose $\Pi$ is the uniform sampling for now. The tensor block model is $\infty$ -globally smooth. This is because the set $\mathcal{N}$ , which consists of distinct block means in $\Theta$ , has finitely many elements. Furthermore, we have $\alpha=\infty$ for all $\pi\notin\mathcal{N}$ , since the numerator in (7) is zero for all such $\pi$ . The min/max hypergaphon model with a $r$ -degree polynomial function is $1$ -globally smooth because $\alpha=1$ for all $\pi$ in the function range except at most $(r-1)$ many stationary points.

4 Nonparametric tensor completion via sign series

In previous sections we have established the sign series representation and its relationship to classification. In this section, we present our algorithm proposed in Section 2 (Figure 2) in details. We provide the estimation error bound and address the empirical implementation of the algorithm.

4.1 Estimation error and sample complexity

Given a noisy incomplete tensor observation $\mathcal{Y}_{\Omega}$ from model (3), we cast the problem of estimating $\Theta$ into a series of weighted classifications. Specifically we propose the tensor estimate using the sign representation,

\hat{\Theta}={1\over 2H+1}\sum_{\pi\in\mathcal{H}}\textup{sgn}{\hat{\mathcal{Z}}_{\pi}},

(8)

where $\hat{\mathcal{Z}}_{\pi}\in\mathbb{R}^{d_{1}\times\dots\times d_{K}}$ is the $\pi$ -weighted classifier estimated at levels $\pi\in\mathcal{H}=\{-1,\ldots,-{1\over H},0,{1\over H},\ldots,1\}$ ,

\hat{\mathcal{Z}}_{\pi}=\operatorname*{arg\,min}_{\mathcal{Z}\colon\text{rank}\mathcal{Z}\leq r}L(\mathcal{Z},\mathcal{Y}_{\Omega}-\pi).

(9)

Here $L(\cdot,\cdot)$ denotes the weighted classification objective defined in (4), where we have plugged $\bar{\mathcal{Y}}_{\Omega}=(\mathcal{Y}_{\Omega}-\pi)$ in the expression, and the rank constraint follows from Proposition 3. For the theory, we assume the true $r$ is known; in practice, $r$ could be chosen in a data adaptive fashion via cross-validation or elbow method (Hastie et al.,, 2009). Step (9) corresponds Figure 2c while (8) to Figure 2d .

The next theorem establishes the statistical convergence for the sign tensor estimate (9), which is an important ingredient for the final signal tensor estimate $\hat{\Theta}$ in (8).

Theorem 2 (Sign tensor estimation).

Suppose $\Theta\in\mathscr{P}_{\textup{sgn}}(r)$ and $\Theta(\omega)$ is $\alpha$ -globally smooth under $\omega\sim\Pi$ . Let $\hat{\mathcal{Z}}_{\pi}$ be the estimate in (9), $d_{\max}=\max_{k\in[K]}d_{k}$ , and $d_{\max}r\lesssim|\Omega|$ . Then, for all $\pi\in[-1,1]$ except for a finite number of levels, with very high probability over $\mathcal{Y}_{\Omega}$ ,

\displaystyle\textup{MAE}(\textup{sgn}\hat{\mathcal{Z}}_{\pi},\textup{sgn}(\Theta-\pi))\lesssim\left({d_{\max}r\over|\Omega|}\right)^{\alpha\over\alpha+2}+{1\over\rho^{2}(\pi,\mathcal{N})}{d_{\max}r\over|\Omega|}.

(10)

Theorem 2 provides the error bound for the sign tensor estimation. Compared to the population results in Theorem 1, we here explicitly reveal the dependence of accuracy on the sample complexity and on the level $\pi$ . The result demonstrates the polynomial decay of sign errors with $|\Omega|$ . In particular, our sign estimate achieves consistent recovery using as few as $\tilde{O}(d_{\max}r)$ noisy entries.

Combining the sign representability of the signal tensor and the sign estimation accuracy, we obtain the main results on our nonparametric tensor estimation method.

Theorem 3 (Tensor estimation error).

Consider the same conditions of Theorem 2. Let $\hat{\Theta}$ be the estimate in (8). With very high probability over $\mathcal{Y}_{\Omega}$ ,

\textup{MAE}(\hat{\Theta},\Theta)\lesssim\left({d_{\max}r\over|\Omega|}\right)^{\alpha\over\alpha+2}+{1\over H}+{Hd_{\max}r\over|\Omega|}.

(11)

In particular, setting $\scriptstyle H\asymp\left(|\Omega|\over d_{\max}r\right)^{1/2}$ yields the error bound

\textup{MAE}(\hat{\Theta},\Theta)\lesssim\left(d_{\max}r\over|\Omega|\right)^{{\alpha\over\alpha+2}\vee{1\over 2}}.

(12)

Theorem 3 demonstrates the convergence rate of our tensor estimation. The bound (11) reveals three sources of errors: the estimation error for sign tensors, the bias from sign series representations, and the variance thereof. The resolution parameter $H$ controls the bias-variance tradeoff. We remark that the signal estimation error (12) is generally no better than the corresponding sign error (10). This is to be expected, since magnitude estimation is a harder problem than sign estimation.

In the special case of full observation with equal dimension $d_{k}=d,k\in[K]$ , our signal estimate achieves convergence

\textup{MAE}(\hat{\Theta},\Theta)\lesssim\left(r\over d^{K-1}\right)^{{\alpha\over\alpha+2}\vee{1\over 2}}.

(13)

Compared to earlier methods, our estimation accuracy applies to both low- and high-rank signal tensors. The rate depends on the sign complexity $\Theta\in\mathscr{P}_{\textup{sgn}}(r)$ , and this $r$ is often much smaller than the usual tensor rank (see Section 3.1). Our result also reveals that the convergence becomes favorable as the order of data tensor increases.

We apply our method to the main examples in Section 3.1, and compare the results with existing literature. The numerical comparison is provided in Section 5.

Example 2 (TBM).

Consider a tensor block model with in total $r$ multiway blocks. Our result implies a rate $\mathcal{O}(d^{-(K-1)/2})$ by taking $\alpha=\infty$ . This rate agrees with the previous root-mean-square error (RMSE) for block tensor estimation (Wang and Zeng,, 2019).

Example 3 (GLM).

Consider a GLM tensor $\Theta=g(\mathcal{Z})$ , where $g$ is a known link function and $\mathcal{Z}$ is a latent low-rank tensor. Suppose the marginal density of $\Theta(\omega)$ is bounded as $d\to\infty$ . Applying our results with $\alpha=1$ yields $\mathcal{O}(d^{-(K-1)/3})$ . This rate is slightly slower than the parametric RMSE rate (Zhang and Xia,, 2018; Wang and Li,, 2020). One possible reason is that our estimate remains valid for unknown $g$ and general high-rank tensors with $\alpha=1$ . The nonparametric rate is the price one has to pay for not knowing the form $\Theta=g(\mathcal{Z})$ as a priori.

The following sample complexity for nonparamtric tensor completion is a direct consequence of Theorem 3.

Corollary 1 (Sample complexity for nonparametric completion).

Under the same conditions of Theorem 3 with $\alpha\neq 0$ , with high probability over $\mathcal{Y}_{\Omega}$ ,

\textup{MAE}(\hat{\Theta},\Theta)\to 0,\quad\text{as}\quad{|\Omega|\over{d_{\max}}r}\to\infty.

Our result improves earlier work (Yuan and Zhang,, 2016; Ghadermarzy et al.,, 2019; Lee and Wang,, 2020) by allowing both low- and high-rank signals. Interestingly, the sample requirements depend only on the sign complexity $r$ but not the nonparametric complexity $\alpha$ . Note that $\tilde{\mathcal{O}}(d_{\max}r)$ roughly matches the degree of freedom of sign tensors, suggesting the optimality of our sample requirements.

4.2 Numerical implementation

This section addresses the practical implementation of our estimation (8) illustrated in Figure 2. Our sign representation of the signal estimate $\hat{\Theta}$ is an average of $2H+1$ sign tensors, which can be solved in a divide-and-conquer fashion. Briefly, we estimate the sign tensors $\mathcal{Z}_{\pi}$ (detailed in the next paragraph) for the series $\pi\in\mathcal{H}$ through parallel implementation, and then we aggregate the results to yield the output. The estimate enjoys low computational cost similar to a single sign tensor estimation.

For the sign tensor estimation (9), the problem reduces to binary tensor decomposition with a weighted classification loss. A number of algorithms have been developed for this problem (Ghadermarzy et al.,, 2018; Wang and Li,, 2020; Hong et al.,, 2020). We adopt similar ideas by tailoring the algorithms to our contexts. Following the common practice in classification, we replace the binary loss $\ell(z,y)=|\textup{sgn}z-\textup{sgn}y|$ with a surrogate loss $F(m)$ using a continuous function of margin $m:=z\textup{sgn}(y)$ . Examples of large-margin loss are hinge loss $F(m)=(1-m)_{+}$ , logistic loss $F(m)=\log(1+e^{-m})$ , and nonconvex $\psi$ -loss $F(m)=2\min(1,(1-m)_{+})$ with $m_{+}=\max(m,0)$ . We implement the hinge loss and logistic loss in our algorithm, although our framework is applicable to general large-margin losses (Bartlett et al.,, 2006).

Algorithm 1 Nonparametric tensor completion

1:Noisy and incomplete data tensor

\mathcal{Y}_{\Omega}

, rank

r

, resolution parameter

H

2:for

\pi\in\mathcal{H}=\{-1,\ldots,-{1\over H},0,{1\over H},\ldots,1\}

3: Random initialization of tensor factors

\bm{A}_{k}=[\bm{a}^{(k)}_{1},\ldots,\bm{a}^{(k)}_{r}]\in\mathbb{R}^{d_{k}\times r}

for all

k\in[K]

4: while not convergence do

5: for

k=1,\ldots,K

6: Update

\bm{A}_{k}

while holding others fixed:

\bm{A}_{k}\leftarrow\operatorname*{arg\,min}_{\bm{A}_{k}\in\mathbb{R}^{d_{k}\times r}}\sum_{\omega\in\Omega}|\mathcal{Y}(\omega)-\pi|F(\mathcal{Z}(\omega)\textup{sgn}(\mathcal{Y}(\omega)-\pi))

, where

F(\cdot)

is the large-margin loss, and

\mathcal{Z}=\sum_{s\in[r]}\bm{a}^{(1)}_{s}\otimes\cdots\otimes\bm{a}^{(K)}_{s}

is a rank-

r

tensor.

7: end for

8: end while

9: Return

\mathcal{Z}_{\pi}\leftarrow\sum_{s\in[r]}\bm{a}^{(1)}_{s}\otimes\cdots\otimes\bm{a}^{(K)}_{s}

10:end for

11:Estimated signal tensor

\hat{\Theta}={1\over 2H+1}\sum_{\pi\in\mathcal{H}}\textup{sgn}(\mathcal{Z}_{\pi})

The rank constraints in the optimization (9) have been extensively studied in literature. Recent developments involve convex norm relaxation (Ghadermarzy et al.,, 2018) and nonconvex optimization (Wang and Li,, 2020; Han et al.,, 2020). Unlike matrices, computing the tensor convex norm is NP hard, so we choose (non-convex) alternating optimization due to its numerical efficiency. Briefly, we use the rank decomposition (2) of $\mathcal{Z}=\mathcal{Z}(\bm{A}_{1},\ldots,\bm{A}_{K})$ to optimize the unknown factor matrices $\bm{A}_{k}=[\bm{a}^{(k)}_{1},\ldots,\bm{a}^{(k)}_{r}]\in\mathbb{R}^{d_{k}\times r}$ , where we choose to collect tensor singular values into $\bm{A}_{K}$ . We numerically solve (8) by optimizing one factor $\bm{A}_{k}$ at a time while holding others fixed. Each suboptimization reduces to a convex optimization with a low-dimensional decision variable. Following common practice in tensor optimization (Anandkumar et al.,, 2014; Hong et al.,, 2020), we run the optimization from multiple initializations to locate a final estimate with the lowest objective value. The full procedure is described in Algorithm 1.

5 Simulations

In this section, we compare our nonparametric tensor method (NonParaT) with two alternative approaches: low-rank tensor CP decomposition (CPT), and the matrix version of our method applied to tensor unfolding (NonParaM). We assess the performance under both complete and incomplete observations. The signal tensors are generated based on four models listed in Table 1. The simulation covers a wide range of complexity, including block tensors, transformed low rank tensors, min/max hypergraphon with logarithm and exponential functions. We consider order-3 tensors of equal dimension $d_{1}=d_{2}=d_{3}=d$ , and set $d\in\{15,20,\ldots,55,60\}$ , $r=2$ , $H=10+{(d-15)/5}$ in Algorithm 1. For NonParaM, we apply Algorithm 1 to each of the three unfolded matrices and report the average error. All summary statistics are averaged across $30$ replicates.

[Uncaptioned image] — Table 1: Simulation models used for comparison. We use $\bm{M}_{k}\in\{0,1\}^{d\times 3}$ to denote membership matrices, $\mathcal{C}\in\mathbb{R}^{3\times 3\times 3}$ the block means, $\bm{a}={1\over d}(1,2,\ldots,d)^{T}\in\mathbb{R}^{d}$ , $\mathcal{Z}_{\max}$ and $\mathcal{Z}_{\min}$ are order-3 tensors with entries ${1\over d}\max(i,j,k)$ and ${1\over d}\min(i,j,k)$ , respectively.

Figure 3 compares the estimation error under full observation. The MAE decreases with tensor dimension for all three methods. We find that our method NonParaT achieves the best performance in all scenarios, whereas the second best method is CPT for models 1-2, and NonParaM for models 3-4. One possible reason is that models 1-2 have controlled multilinear tensor rank, which makes tensor methods NonParaT and CPT more accurate than matrix methods. For models 3-4, the rank exceeds the tensor dimension, and therefore, the two nonparametric methods NonParaT and NonparaM exhibit the greater advantage for signal recovery.

Figure 4 shows the completion error against observation fraction. We fix $d=40$ and gradually increase the observation fraction ${|\Omega|\over d^{3}}$ from 0.3 to 1. We find that NonParaT achieves the lowest error among all methods. Our simulation covers a reasonable range of complexities; for example, model 1 has $3^{3}$ jumps in the CDF of signal $\Theta$ , and models 2 and 4 have unbounded noise. Nevertheless, our method shows good performance in spite of model misspecification. This robustness is appealing in practice because the structure of underlying signal tensor is often unknown.

6 Data applications

We apply our method to two tensor datasets, the MRN-114 human brain connectivity data (Wang et al.,, 2017), and NIPS word occurrence data (Globerson et al.,, 2007).

6.1 Brain connectivity analysis

The brain dataset records the structural connectivity among 68 brain regions for 114 individuals along with their Intelligence Quotient (IQ) scores. We organize the connectivity data into an order-3 tensor, where entries encode the presence or absence of fiber connections between brain regions across individuals.

Figure 5 shows the MAE based on 5-fold cross-validations with $r=3,6,\ldots,15$ and $H=20$ . We find that our method outperforms CPT in all combinations of ranks and missing rates. The achieved error reduction appears to be more profound as the missing rate increases. This trend highlights the applicability of our method in tensor completion tasks. In addition, our method exhibits a smaller standard error in cross-validation experiments as shown in Figure 5 and Table 2, demonstrating the stability over CPT. One possible reason is that that our estimate is guaranteed to be in $[0,1]$ (for binary tensor problem where $\mathcal{Y}\in\{0,1\}^{d_{1}\times\cdots d_{K}}$ ) whereas CPT estimation may fall outside the valid range $[0,1]$ .

MRN-114 brain connectivity dataset
Method	$r=3$	$r=6$	$r=9$	$r=12$	$r=15$
NonparaT (Ours)	${\bf 0.18}(0.001)$	${\bf 0.14}(0.001)$	${\bf 0.12}(0.001)$	${\bf 0.12}(0.001)$	${\bf 0.11}(0.001)$
Low-rank CPT	$0.26(0.006)$	$0.23(0.006$ )	$0.22(0.004)$	$0.21(0.006)$	$0.20(0.008)$
NIPS word occurrence dataset
Method	$r=3$	$r=6$	$r=9$	$r=12$	$r=15$
NonparaT (Ours)	${\bf 0.18}(0.002)$	${\bf 0.16}(0.002)$	${\bf 0.15}(0.001)$	${\bf 0.14}(0.001)$	${\bf 0.13}(0.001)$
Low-rank CPT	$0.22(0.004)$	$0.20(0.007)$	$0.19(0.007)$	$0.17(0.007)$	$0.17(0.007)$
Naive imputation (Baseline)	$0.32(.001)$

Table 2: MAE comparison in the brain data and NIPS data analysis. Reported MAEs are averaged over five runs of cross-validation, with 20% entries for testing and 80% for training, with standard errors in parentheses. Bold numbers indicate the minimal MAE among three methods. For low-rank CPT, we use R function rTensor with default hyperparameters, and for our method, we set

H=20

We next investigate the pattern in the estimated signal tensor. Figure 6a shows the identified top edges associated with IQ scores. Specifically, we first obtain a denoised tensor $\hat{\Theta}\in\mathbb{R}^{68\times 68\times 114}$ using our method with $r=10$ and $H=20$ . Then, we perform a regression analysis of $\hat{\Theta}(i,j,\colon)\in\mathbb{R}^{144}$ against the normalized IQ score across the 144 individuals. The regression model is repeated for each edge $(i,j)\in[68]\times[68]$ . We find that top edges represent the interhemispheric connections in the frontal lobes. The result is consistent with recent research on brain connectivity with intelligence (Li et al.,, 2009; Wang et al.,, 2017).

6.2 NIPS data analysis

The NIPS dataset consists of word occurrence counts in papers published from 1987 to 2003. We focus on the top 100 authors, 200 most frequent words, and normalize each word count by log transformation with pseudo-count 1. The resulting dataset is an order-3 tensor with entry representing the log counts of words by authors across years.

Table 2 compares the prediction accuracy of different methods. We find that our method substantially outperforms the low-rank CP method for every configuration under consideration. Further increment of rank appears to have little effect on the performance. The comparison highlights the advantage of our method in achieving accuracy while maintaining low complexity. In addition, we also perform naive imputation where the missing values are predicted using the sample average. Both our method and CPT outperform the naive imputation, implying the necessity of incorporating tensor structure in the analysis.

We next examine the estimated signal tensor $\hat{\Theta}$ from our method. Figure 6b illustrates the results from NIPS data, where we plot the entries in $\hat{\Theta}$ corresponding to top authors and most-frequent words (after excluding generic words such as figure, results, etc). The identified pattern is consistent with the active topics in the NIPS publication. Among the top words are neural (marginal mean = 1.95), learning (1.48), and network (1.21), whereas top authors are T. Sejnowski (1.18), B. Scholkopf (1.17), M. Jordan (1.11), and G. Hinton (1.06). We also find strong heterogeneity among word occurrences across authors and years. For example, training and algorithm are popular words for B. Scholkopf and A. Smola in 1998-1999, whereas model occurs more often in M. Jordan and in 1996. The detected pattern and achieved accuracy demonstrate the applicability of our method.

7 Additional results and proofs

In this section, we provides additional results not covered in previous sections. Section 7.1 gives detailed explanation to the examples mentioned in Section 1. Section 7.2 supplements Section 3.1 by providing more theoretical results on sign rank and its relationship to tensor rank. Section 7.3 collects the proofs for theorems in the main texts.

7.1 Sensitivity of tensor rank to monotonic transformations

In Section 1, we have provided a motivating example to show the sensitivity of tensor rank to monotonic transformations. Here, we describe the details of the example set-up.

The step 1 is to generate a rank-3 tensor $\mathcal{Z}$ based on the CP representation

\mathcal{Z}=\bm{a}^{\otimes 3}+\bm{b}^{\otimes 3}+\bm{c}^{\otimes 3},

where $\bm{a},\bm{b},\bm{c}\in\mathbb{R}^{30}$ are vectors consisting of $N(0,1)$ entries, and the shorthand $\bm{a}^{\otimes 3}=\bm{a}\otimes\bm{a}\otimes\bm{a}$ denotes the Kronecker power. We then apply $f(z)=(1+\exp(-cz))^{-1}$ to $\mathcal{Z}$ entrywise, and obtain a transformed tensor $\Theta=f(\mathcal{Z})$ .

The step 2 is to determine the rank of $\Theta$ . Unlike matrices, the exact rank determination for tensors is NP hard. Therefore, we choose to compute the numerical rank of $\Theta$ as an approximation. The numerical rank is determined as the minimal rank for which the relative approximation error is below $0.1$ , i.e.,

\hat{r}(\Theta)=\min\left\{s\in\mathbb{N}_{+}\colon\min_{\hat{\Theta}\colon\textup{rank}(\hat{\Theta})\leq s}{\lVert\Theta-\hat{\Theta}\rVert_{F}\over\lVert\Theta\rVert_{F}}\leq 0.1\right\}.

(14)

We compute $\hat{r}(\Theta)$ by searching over $s\in\{1,\ldots,30^{2}\}$ , where for each $s$ , we (approximately) solve the least-square minimization using CP function in R package rTensor. We repeat steps 1-2 ten times, and plot the averaged numerical rank of $\Theta$ versus transformation level $c$ in Figure 1a.

7.2 Tensor rank and sign-rank

In section 3.1, we have provided several tensor examples with high tensor rank but low sign-rank. This section provides more examples and their proofs. Unless otherwise specified, let $\Theta$ be an order- $K$ $(d,\ldots,d)$ -dimensional tensor.

Example 6 (Max hypergraphon).

Suppose the tensor $\Theta$ takes the form

\Theta(i_{1},\ldots,i_{K})=\log\left(1+{1\over d}\max(i_{1},\ldots,i_{K})\right),\ \text{for all }(i_{1},\ldots,i_{K})\in[d]^{K}.

Then

\textup{rank}(\Theta)\geq d,\quad\text{and}\quad\textup{srank}(\Theta-\pi)\leq 2\ \text{for all }\pi\in\mathbb{R}.

Proof.

We first prove the results for $K=2$ . The full-rankness of $\Theta$ is verified from elementary row operations as follows

\displaystyle\begin{pmatrix}(\Theta_{2}-\Theta_{1})/(\log(1+\frac{2}{d})-\log(1+\frac{1}{d}))\\ (\Theta_{3}-\Theta_{2})/(\log(1+\frac{3}{d})-\log(1+\frac{2}{d}))\\ \vdots\\ (\Theta_{d}-\Theta_{d-1})/(\log(1+\frac{d}{d})-\log(1+\frac{d-1}{d}))\\ \Theta_{d}/\log(1+\frac{d}{d})\end{pmatrix}=\begin{pmatrix}1&0&&&0\\ 1&1&\ddots&&\\ \vdots&\vdots&\ddots&\ddots&\\ 1&1&1&1&0\\ 1&1&1&1&1\end{pmatrix},

(15)

where $\Theta_{i}$ denotes the $i$ -th row of $\Theta$ . Now it suffices to show $\textup{srank}(\Theta-\pi)\leq 2$ for $\pi$ in the feasible range $(\log(1+{1\over d}),\ \log 2)$ . In this case, there exists an index $i^{*}\in\{2,\ldots,d\}$ , such that $\log(1+{i^{*}-1\over d})<\pi\leq\log(1+{i^{*}\over d})$ . By definition, the sign matrix $\textup{sgn}(\Theta-\pi)$ takes the form

\textup{sgn}(\Theta(i,j)-\pi)=\begin{cases}-1,&\text{both $i$ and $j$ are smaller than $i^{*}$};\\ 1,&\text{otherwise}.\end{cases}

(16)

Therefore, the matrix $\textup{sgn}(\Theta-\pi)$ is a rank-2 block matrix, which implies $\textup{srank}(\Theta-\pi)=2$ .

We now extend the results to $K\geq 3$ . By definition of the tensor rank, the rank of a tensor is lower bounded by the rank of its matrix slice. So we have $\textup{rank}(\Theta)\geq\textup{rank}(\Theta(\colon,\colon,1,\ldots,1))=d$ . For the sign rank with feasible $\pi$ , notice that the sign tensor $\textup{sgn}(\Theta-\pi)$ takes the similar form as in (16),

\textup{sgn}(\Theta(i_{1},\ldots,i_{K})-\pi)=\begin{cases}-1,&\text{$i_{k}<i^{*}$ for all $k\in[K]$};\\ 1,&\text{otherwise},\end{cases}

(17)

where $i^{*}$ denotes the index that satisfies $\log(1+\frac{i^{*}-1}{d})<\pi\leq\log(1+\frac{i^{*}}{d})$ . The equation (17) implies that $\textup{sgn}(\Theta-\pi)=-2\bm{a}^{\otimes K}+1$ , where $\bm{a}=(1,\ldots,1,0,\ldots,0)^{T}$ takes 1 on the $i$ -th entry if $i<i^{*}$ and 0 otherwise. Henceforth $\textup{srank}(\Theta-\pi)=2$ . ∎

In fact, Example 6 is a special case of the following proposition.

Proposition 4 (Min/Max hypergraphon).

Let $\mathcal{Z}_{\max}\in\mathbb{R}^{d_{1}\times\cdots\times d_{K}}$ denote a tensor with entries

\mathcal{Z}_{\max}(i_{1},\ldots,i_{K})=\max(x^{(1)}_{i_{1}},\ldots,x^{(K)}_{i_{K}}),

(18)

where $x^{(k)}_{i_{k}}\in[0,1]$ are given numbers for all $i_{k}\in[d_{k}]$ . Let $g\colon\mathbb{R}\to\mathbb{R}$ be a continuous function and $\Theta:=g(\mathcal{Z}_{\max})$ be the transformed tensor. For a given $\pi\in[-1,1]$ , suppose the function $g(z)=\pi$ has at most $r\geq 1$ distinct real roots. Then, the sign rank of $(\Theta-\pi)$ satisfies

\textup{srank}(\Theta-\pi)\leq 2r.

The same conclusion holds if we use $\min$ in place of $\max$ in (18).

Proof.

We reorder the tensor indices along each mode such that $x^{(k)}_{1}\leq\cdots\leq x^{(k)}_{d_{k}}$ for all $k\in[K]$ . Based on the construction of $\mathcal{Z}_{\max}$ , the reordering does not change the rank of $\mathcal{Z}_{\max}$ or $(\Theta-\pi)$ . Let $z_{1}<\cdots<z_{r}$ be the $r$ distinct real roots for the equation $g(z)=\pi$ . We separate the proof for two cases, $r=1$ and $r\geq 2$ .

•

When $r=1$ . The continuity of $g(\cdot)$ implies that the function $(g(z)-\pi)$ has at most one sign change point. Using similar proof as in Example 6, we have

\displaystyle\textup{sgn}(\Theta-\pi)=1-2\bm{a}^{(1)}\otimes\cdots\otimes\bm{a}^{(K)}\quad\text{ or }\quad\textup{sgn}(\Theta-\pi)=2\bm{a}^{(1)}\otimes\cdots\otimes\bm{a}^{(K)}-1,

(19)

where $\bm{a}^{(k)}$ are binary vectors defined by

\bm{a}^{(k)}=(\mathop{\mathchoice{\underbrace{\displaystyle 1,\ldots,1,}}{\underbrace{\textstyle 1,\ldots,1,}}{\underbrace{\scriptstyle 1,\ldots,1,}}{\underbrace{\scriptscriptstyle 1,\ldots,1,}}}\limits_{\text{positions for which $x_{i_{k}}^{k}<z_{1}$}}0,\ldots,0)^{T},\quad\text{for }k\in[K].

Therefore, $\textup{srank}(\Theta-\pi)\leq\textup{rank}(\textup{sgn}(\Theta-\pi))=2$ .

•

When $r\geq 2$ . By continuity, the function $(g(z)-\pi)$ is non-zero and remains an unchanged sign in each of the intervals $(z_{s},z_{s+1})$ for $1\leq s\leq r-1$ . Define the index set $\mathcal{I}=\{s\in\mathbb{N}_{+}\colon\text{the interval $(z_{s},z_{s+1})$ in which $g(z)<\pi$}\}$ . We now prove that the sign tensor $\textup{sgn}(\Theta-\pi)$ has rank bounded by $2r-1$ . To see this, consider the tensor indices for which $\textup{sgn}(\Theta-\pi)=-1$ ,

$\displaystyle\{\omega\colon\Theta(\omega)-\pi<0\}$	$\displaystyle=\{\omega\colon g(\mathcal{Z}_{\max}(\omega))<\pi\}$
	$\displaystyle=\cup_{s\in\mathcal{I}}\{\omega\colon\mathcal{Z}_{\max}(\omega)\in(z_{s},z_{s+1})\}$
	$\displaystyle=\cup_{s\in\mathcal{I}}\Big{(}\{\omega\colon\text{$x^{(k)}_{i_{k}}<z_{s+1}$ for all $k\in[K]$}\}\cap\{\omega\colon\text{$x^{(k)}_{i_{k}}\leq z_{s}$ for all $k\in[K]$}\}^{c}\Big{)}.$	(20)

The equation (• ‣ 7.2) is equivalent to

\displaystyle\mathds{1}(\Theta(i_{1},\ldots,i_{K})<\pi)

\displaystyle=\sum_{s\in\mathcal{I}}\left(\prod_{k}\mathds{1}(x^{(k)}_{i_{k}}<z_{s+1})-\prod_{k}\mathds{1}(x^{(k)}_{i_{k}}\leq z_{s})\right),

(21)

for all $(i_{1},\ldots,i_{K})\in[d_{1}]\times\cdots\times[d_{K}]$ , where $\mathds{1}(\cdot)\in\{0,1\}$ denotes the indicator function. The equation (21) implies the low-rank representation of $\textup{sgn}(\Theta-\pi)$ ,

\textup{sgn}(\Theta-\pi)=1-2\sum_{s\in\mathcal{I}}\left(\bm{a}^{(1)}_{s+1}\otimes\cdots\otimes\bm{a}^{(K)}_{s+1}-\bar{\bm{a}}^{(1)}_{s}\otimes\cdots\otimes\bar{\bm{a}}^{(K)}_{s}\right),

(22)

where we have denoted the two binary vectors

\bm{a}^{(k)}_{s+1}=(\mathop{\mathchoice{\underbrace{\displaystyle 1,\ldots,1,}}{\underbrace{\textstyle 1,\ldots,1,}}{\underbrace{\scriptstyle 1,\ldots,1,}}{\underbrace{\scriptscriptstyle 1,\ldots,1,}}}\limits_{\text{positions for which $x_{i_{k}}^{(k)}<z_{s+1}$}}0,\ldots 0)^{T},\quad\text{and}\quad\bar{\bm{a}}^{(k)}_{s}=(\mathop{\mathchoice{\underbrace{\displaystyle 1,\ldots,1,}}{\underbrace{\textstyle 1,\ldots,1,}}{\underbrace{\scriptstyle 1,\ldots,1,}}{\underbrace{\scriptscriptstyle 1,\ldots,1,}}}\limits_{\text{positions for which $x_{i_{k}}^{(k)}\leq z_{s}$}}0,\ldots 0)^{T}.

Therefore, by (22) and the assumption $|\mathcal{I}|\leq r-1$ , we conclude that

\textup{srank}(\Theta-\pi)\leq 1+2(r-1)=2r-1.

Combining two cases yields that $\textup{srank}(\Theta-\pi)\leq 2r$ for any $r\geq 1$ . ∎

We next provide several additional examples such that $\textup{rank}(\Theta)\geq d$ whereas $\textup{srank}(\Theta)\leq c$ for a constant $c$ independent of $d$ . We state the examples in the matrix case, i.e, $K=2$ . Similar conclusion extends to $K\geq 3$ , by the following proposition.

Proposition 5.

Let $\bm{M}\in\mathbb{R}^{d_{1}\times d_{2}}$ be a matrix. For any given $K\geq 3$ , define an order- $K$ tensor $\Theta\in\mathbb{R}^{d_{1}\times\cdots\times d_{K}}$ by

\Theta=\bm{M}\otimes\mathbf{1}_{d_{3}}\otimes\cdots\otimes\mathbf{1}_{d_{K}},

where $\mathbf{1}_{d_{k}}\in\mathbb{R}^{d_{k}}$ denotes an all-one vector, for $3\leq k\leq K$ . Then we have

\textup{rank}(\Theta)=\textup{rank}(\bm{M}),\quad\text{and}\quad\textup{srank}(\Theta-\pi)=\textup{srank}(\bm{M}-\pi)\ \text{for all $\pi\in\mathbb{R}$}.

Proof.

The conclusion directly follows from the definition of tensor rank. ∎

Example 7 (Stacked banded matrices).

Let $\bm{a}=(1,2,\ldots,d)^{T}$ be a $d$ -dimensional vector, and define a $d$ -by- $d$ banded matrix $\bm{M}=|\bm{a}\otimes\mathbf{1}-\mathbf{1}\otimes\bm{a}|$ . Then

\textup{rank}(\bm{M})=d,\quad\text{and}\quad\textup{srank}(\bm{M}-\pi)\leq 3,\quad\text{for all }\pi\in\mathbb{R}.

Proof.

Note that $\bm{M}$ is a banded matrix with entries

\bm{M}(i,j)={|i-j|},\quad\text{for all }(i,j)\in[d]^{2}.

Elementary row operation directly shows that $\bm{M}$ is full rank as follows,

\displaystyle\begin{pmatrix}(\bm{M}_{1}+\bm{M}_{d})/(d-1)\\ \bm{M}_{1}-\bm{M}_{2}\\ \bm{M}_{2}-\bm{M}_{3}\\ \vdots\\ \bm{M}_{d-1}-\bm{M}_{d}\end{pmatrix}=\begin{pmatrix}1&1&1&\ldots&1&1\\ -1&1&1&\ldots&1&1\\ -1&-1&1&\ldots&1&1\\ \vdots\\ -1&-1&-1&\ldots&-1&1\end{pmatrix}.

(23)

We now show $\textup{srank}(\bm{M}-\pi)\leq 3$ by construction. Define two vectors $\bm{b}=(2^{-1},2^{-2},\ldots,2^{-d})^{T}\in\mathbb{R}^{d}$ and $\text{rev}(\bm{b})=(2^{-d},\ldots,2^{-1})^{T}\in\mathbb{R}^{d}$ . We construct the following matrix

\bm{A}=\bm{b}\otimes\text{rev}(\bm{b})+\text{rev}(\bm{b})\otimes\bm{b}.

(24)

The matrix $\bm{A}\in\mathbb{R}^{d\times d}$ is banded with entries

\bm{A}(i,j)=\bm{A}(j,i)=\bm{A}(d-i,d-j)=\bm{A}(d-j,d-i)=2^{-d-1}\left(2^{j-i}+2^{i-j}\right),\ \text{for all }(i,j)\in[d]^{2}.

Furthermore, the entry value $\bm{A}(i,j)$ decreases with respect to $|i-j|$ ; i.e.,

\bm{A}(i,j)\geq\bm{A}(i^{\prime},j^{\prime}),\quad\text{for all }|i-j|\geq|i^{\prime}-j^{\prime}|.

(25)

Notice that for a given $\pi\in\mathbb{R}$ , there exists $\pi^{\prime}\in\mathbb{R}$ such that $\textup{sgn}(\bm{A}-\pi^{\prime})=\textup{sgn}(\bm{M}-\pi)$ . This is because both $\bm{A}$ and $\bm{M}$ are banded matrices satisfying monotonicity (25). By definition (24), $\bm{A}$ is a rank-2 matrix. Henceforce, $\textup{srank}(\bm{M}-\pi)=\textup{srank}(\bm{A}-\pi^{\prime})\leq 3.$ ∎

Remark 1.

The tensor analogy of banded matrices $\Theta=|\bm{a}\otimes\mathbf{1}\otimes\mathbf{1}-\mathbf{1}\otimes\bm{a}\otimes\mathbf{1}|$ is used as simulation model 3 in Table 1.

Example 8 (Stacked identity matrices).

Let $\bm{I}$ be a $d$ -by- $d$ identity matrix. Then

\textup{rank}(\bm{I})=d,\quad\text{and}\quad\textup{srank}(\bm{I}-\pi)\leq 3\ \text{for all }\pi\in\mathbb{R}.

Proof.

Depending on the value of $\pi$ , the sign matrix $\textup{sgn}(\bm{I}-\pi)$ falls into one of the three cases: 1) $\textup{sgn}(\bm{I}-\pi)$ is a matrix of all $1$ ; 2) $\textup{sgn}(\bm{I}-\pi)$ is a matrix of all $-1$ ; 3) $\textup{sgn}(\bm{I}-\pi)=2\bm{I}-\mathbf{1}_{d}\otimes\mathbf{1}_{d}$ . The former two cases are trivial, so it suffices to show $\textup{srank}(\bm{I}-\pi)\leq 3$ in the third case.

Based on Example 7, the rank-2 matrix $\bm{A}$ in (24) satisfies

\bm{A}(i,j)\begin{cases}=2^{-d},&i=j,\\ \geq 2^{-d}+2^{-d-2},&i\neq j.\end{cases}

Therefore, $\textup{sgn}\left(2^{-d}+2^{-d-3}-\bm{A}\right)=2\bm{I}-\mathbf{1}_{d}\otimes\mathbf{1}_{d}$ . We conclude that $\textup{srank}(\bm{I}-\pi)\leq\textup{rank}(2^{-d}+2^{-d-3}-\bm{A})=3$ . ∎

7.3 Proofs

7.3.1 Proofs of Propositions 1-3

Proof of Proposition 1.

The strictly monotonicity of $g$ implies that the inverse function $g^{-1}\colon\mathbb{R}\to\mathbb{R}$ is well-defined. When $g$ is strictly increasing, the mapping $x\mapsto g(x)$ is sign preserving. Specifically, if $x\geq 0$ , then $g(x)\geq g(0)=0$ . Conversely, if $g(x)\geq 0=g(0)$ , then applying $g^{-1}$ to both sides gives $x\geq 0$ . When $g$ is strictly decreasing, the mapping $x\mapsto g(x)$ is sign reversing. Specifically, if $x\geq 0$ , then $g(x)\leq g(0)=0$ . Conversely, if $g(x)\geq 0=g(0)$ , then applying $g^{-1}$ to both sides gives $x\leq 0$ . Therefore, $\Theta\simeq g(\Theta)$ , or $\Theta\simeq-g(\Theta)$ . Since constant multiplication does not change the tensor rank, we have $\textup{srank}(\Theta)=\textup{srank}(g(\Theta))\leq\textup{rank}(g(\Theta))$ . ∎

Proof of Proposition 2.

See Section 7.2 for constructive examples. ∎

Proof of Proposition 3.

Fix $\pi\in[-1,1]$ . Based on the definition of classification loss $L(\cdot,\cdot)$ , the function $\textup{Risk}(\cdot)$ relies only on the sign pattern of the tensor. Therefore, without loss of generality, we assume both $\bar{\Theta},\mathcal{Z}\in\{-1,1\}^{d_{1}\times\cdots\times d_{K}}$ are binary tensors. We evaluate the excess risk

\textup{Risk}(\mathcal{Z})-\textup{Risk}(\bar{\Theta})=\mathbb{E}_{\omega\sim\Pi}\mathop{\mathchoice{\underbrace{\displaystyle\mathbb{E}_{\mathcal{Y}(\omega)}\left\{|\mathcal{Y}(\omega)-\pi|\left[\left|\mathcal{Z}(\omega)-\textup{sgn}(\bar{\mathcal{Y}}(\omega))\right|-\left|\bar{\Theta}(\omega)-\textup{sgn}(\bar{\mathcal{Y}}(\omega))\right|\right]\right\}}}{\underbrace{\textstyle\mathbb{E}_{\mathcal{Y}(\omega)}\left\{|\mathcal{Y}(\omega)-\pi|\left[\left|\mathcal{Z}(\omega)-\textup{sgn}(\bar{\mathcal{Y}}(\omega))\right|-\left|\bar{\Theta}(\omega)-\textup{sgn}(\bar{\mathcal{Y}}(\omega))\right|\right]\right\}}}{\underbrace{\scriptstyle\mathbb{E}_{\mathcal{Y}(\omega)}\left\{|\mathcal{Y}(\omega)-\pi|\left[\left|\mathcal{Z}(\omega)-\textup{sgn}(\bar{\mathcal{Y}}(\omega))\right|-\left|\bar{\Theta}(\omega)-\textup{sgn}(\bar{\mathcal{Y}}(\omega))\right|\right]\right\}}}{\underbrace{\scriptscriptstyle\mathbb{E}_{\mathcal{Y}(\omega)}\left\{|\mathcal{Y}(\omega)-\pi|\left[\left|\mathcal{Z}(\omega)-\textup{sgn}(\bar{\mathcal{Y}}(\omega))\right|-\left|\bar{\Theta}(\omega)-\textup{sgn}(\bar{\mathcal{Y}}(\omega))\right|\right]\right\}}}}\limits_{\stackrel{{\scriptstyle\text{def}}}{{=}}I(\omega)}.

(26)

Denote $y=\mathcal{Y}(\omega)$ , $z=\mathcal{Z}(\omega)$ , $\bar{\theta}=\bar{\Theta}(\omega)$ , and $\theta=\Theta(\omega)$ . The expression of $I(\omega)$ is simplified as

$\displaystyle I(\omega)$	$\displaystyle=\mathbb{E}_{y}\left[(y-\pi)(\bar{\theta}-z)\mathds{1}(y\geq\pi)+(\pi-y)(z-\bar{\theta})\mathds{1}(y<\pi)\right]$
	$\displaystyle=\mathbb{E}_{y}\left[(\bar{\theta}-z)(y-\pi)\right]$
	$\displaystyle=\left[\textup{sgn}(\theta-\pi)-z\right]\left(\theta-\pi\right)$
	$\displaystyle=\|\textup{sgn}(\theta-\pi)-z\|\|\theta-\pi\|\geq 0,$	(27)

where the third line uses the fact $\mathbb{E}y=\theta$ and $\bar{\theta}=\textup{sgn}(\theta-\pi)$ , and the last line uses the assumption $z\in\{-1,1\}$ . The equality (7.3.1) is attained when $z=\textup{sgn}(\theta-\pi)$ or $\theta=\pi$ . Combining (7.3.1) with (26), we conclude that, for all $\mathcal{Z}\in\{-1,1\}^{d_{1}\times\cdots\times d_{K}}$ ,

\textup{Risk}(\mathcal{Z})-\textup{Risk}(\bar{\Theta})=\mathbb{E}_{\omega\sim\Pi}|\textup{sgn}(\Theta(\omega)-\pi)-\mathcal{Z}(\omega)||\Theta(\omega)-\pi|\geq 0,

(28)

In particular, setting $\mathcal{Z}=\bar{\Theta}=\textup{sgn}(\Theta-\pi)$ in (28) yields the minimum. Therefore,

\textup{Risk}(\bar{\Theta})=\min\{\textup{Risk}(\mathcal{Z})\colon\mathcal{Z}\in\mathbb{R}^{d_{1}\times\cdots\times d_{K}}\}\leq\min\{\textup{Risk}(\mathcal{Z})\colon\textup{rank}(\mathcal{Z})\leq r\}.

Since $\textup{srank}(\Theta-\pi)\leq r$ by assumption, the last inequality becomes equality. The proof is complete. ∎

7.3.2 Proof of Theorem 1

Proof of Theorem 1.

Fix $\pi\in[-1,1]$ . Based on (28) in Proposition 3 we have

\textup{Risk}(\mathcal{Z})-\textup{Risk}(\bar{\Theta})=\mathbb{E}\left[|\textup{sgn}\mathcal{Z}-\textup{sgn}\bar{\Theta}||\bar{\Theta}|\right].

(29)

The Assumption 1 states that

\mathbb{P}\left(|\bar{\Theta}|\leq t\right)\leq ct^{\alpha},\quad\text{for all }0\leq t<\rho(\pi,\mathcal{N}).

(30)

Without future specification, all relevant probability statements, such as $\mathbb{E}$ and $\mathbb{P}$ , are with respect to $\omega\sim\Pi$ .

We divide the proof into two cases: $\alpha>0$ and $\alpha=\infty$ .

•

Case 1: $\alpha>0$ .

By (29), for all $0\leq t<\rho(\pi,\mathcal{N})$ ,

$\displaystyle\textup{Risk}(\mathcal{Z})-\textup{Risk}(\bar{\Theta})$	$\displaystyle\geq t\mathbb{E}\left(\|\textup{sgn}\mathcal{Z}-\textup{sgn}\hat{\Theta}\|\mathds{1}\{\|\hat{\Theta}\|>t\}\right)$
	$\displaystyle\geq 2t\mathbb{P}\left(\textup{sgn}\mathcal{Z}\neq\textup{sgn}\bar{\Theta}\text{ and }\|\bar{\Theta}\|>t\right)$
	$\displaystyle\geq 2t\Big{\{}\mathbb{P}\left(\textup{sgn}\mathcal{Z}\neq\textup{sgn}\bar{\Theta}\right)-\mathbb{P}\left(\|\bar{\Theta}\|\leq t\right)\Big{\}}$
	$\displaystyle\geq t\Big{\{}\textup{MAE}(\textup{sgn}\mathcal{Z},\textup{sgn}\bar{\Theta})-2ct^{\alpha}\Big{\}},$	(31)

where the last line follows from the definition of MAE and (30). We maximize the lower bound (• ‣ 7.3.2) with respect to $t$ , and obtain the optimal $t_{\text{opt}}$ ,

t_{\text{opt}}=\begin{cases}\rho(\pi,\mathcal{N}),&\text{if }\textup{MAE}(\textup{sgn}\mathcal{Z},\textup{sgn}\bar{\Theta})>2c(1+\alpha)\rho^{\alpha}(\pi,\mathcal{N}),\\ \left[{1\over 2c(1+\alpha)}\textup{MAE}(\textup{sgn}\mathcal{Z},\textup{sgn}\bar{\Theta})\right]^{1/\alpha},&\text{if }\textup{MAE}(\textup{sgn}\mathcal{Z},\textup{sgn}\bar{\Theta})\leq 2c(1+\alpha)\rho^{\alpha}(\pi,\mathcal{N}).\end{cases}

The corresponding lower bound of the inequality (• ‣ 7.3.2) becomes

\textup{Risk}(\mathcal{Z})-\textup{Risk}(\bar{\Theta})\geq\begin{cases}c_{1}\rho(\pi,\mathcal{N})\textup{MAE}(\textup{sgn}\mathcal{Z},\textup{sgn}\bar{\Theta}),&\text{if }\textup{MAE}(\textup{sgn}\mathcal{Z},\textup{sgn}\bar{\Theta})>2c(1+\alpha)\rho^{\alpha}(\pi,\mathcal{N}),\\ c_{2}\left[\textup{MAE}(\textup{sgn}\mathcal{Z},\textup{sgn}\bar{\Theta})\right]^{1+\alpha\over\alpha},&\text{if }\textup{MAE}(\textup{sgn}\mathcal{Z},\textup{sgn}\bar{\Theta})\leq 2c(1+\alpha)\rho^{\alpha}(\pi,\mathcal{N}),\end{cases}

where $c_{1},c_{2}>0$ are two constants independent of $\mathcal{Z}$ . Combining both cases gives

	$\displaystyle\textup{MAE}(\textup{sgn}\mathcal{Z},\textup{sgn}\bar{\Theta})$	$\displaystyle\lesssim[\textup{Risk}(\mathcal{Z})-\textup{Risk}(\bar{\Theta})]^{\alpha\over 1+\alpha}+{1\over\rho(\pi,\mathcal{N})}\left[\textup{Risk}(\mathcal{Z})-\textup{Risk}(\bar{\Theta})\right]$		(32)
		$\displaystyle\leq C(\pi)[\textup{Risk}(\mathcal{Z})-\textup{Risk}(\bar{\Theta})]^{\alpha\over 1+\alpha},$		(33)

where $C(\pi)>0$ is a multiplicative factor independent of $\mathcal{Z}$ .

•

Case 2: $\alpha=\infty$ . The inequality (• ‣ 7.3.2) now becomes

\textup{Risk}(\mathcal{Z})-\textup{Risk}(\bar{\Theta})\geq t\textup{MAE}(\textup{sgn}\bar{\Theta},\textup{sgn}\mathcal{Z}),\quad\text{for all }0\leq t<\rho(\pi,\mathcal{N}).

(34)

The conclusion follows by taking $t={\rho(\pi,\mathcal{N})\over 2}$ in the inequality (34).

∎

Remark 2.

The proof of Theorem 1 shows that, under Assumption 1,

\textup{MAE}(\textup{sgn}\mathcal{Z},\textup{sgn}\bar{\Theta})\lesssim[\textup{Risk}(\mathcal{Z})-\textup{Risk}(\bar{\Theta})]^{\alpha\over 1+\alpha}+{1\over\rho(\pi,\mathcal{N})}\left[\textup{Risk}(\mathcal{Z})-\textup{Risk}(\bar{\Theta})\right],

(35)

for all $\mathcal{Z}\in\mathbb{R}^{d_{1}\times\cdots\times d_{R}}$ . For fixed $\pi$ , the second term is absorbed into the first term.

7.3.3 Proof of Theorem 2

The following lemma provides the variance-to-mean relationship implied by the $\alpha$ -smoothness of $\Theta$ . The relationship plays a key role in determining the convergence rate based on empirical process theory (Shen and Wong,, 1994).

Lemma 1 (Variance-to-mean relationship).

Consider the same setup as in Theorem 2. Fix $\pi\in[-1,1]$ . Let $L(\mathcal{Z},\bar{Y}_{\Omega})$ be the $\pi$ -weighted classification loss

	$\displaystyle L(\mathcal{Z},\bar{\mathcal{Y}}_{\Omega})$	$\displaystyle={1\over\|\Omega\|}\sum_{\omega\in\Omega}\ \mathop{\mathchoice{\underbrace{\displaystyle\|\bar{\mathcal{Y}}(\omega)\|}}{\underbrace{\textstyle\|\bar{\mathcal{Y}}(\omega)\|}}{\underbrace{\scriptstyle\|\bar{\mathcal{Y}}(\omega)\|}}{\underbrace{\scriptscriptstyle\|\bar{\mathcal{Y}}(\omega)\|}}}\limits_{\text{weight}}\ \times\ \mathop{\mathchoice{\underbrace{\displaystyle\|\textup{sgn}\mathcal{Z}(\omega)-\textup{sgn}\bar{\mathcal{Y}}(\omega)\|}}{\underbrace{\textstyle\|\textup{sgn}\mathcal{Z}(\omega)-\textup{sgn}\bar{\mathcal{Y}}(\omega)\|}}{\underbrace{\scriptstyle\|\textup{sgn}\mathcal{Z}(\omega)-\textup{sgn}\bar{\mathcal{Y}}(\omega)\|}}{\underbrace{\scriptscriptstyle\|\textup{sgn}\mathcal{Z}(\omega)-\textup{sgn}\bar{\mathcal{Y}}(\omega)\|}}}\limits_{\text{classification loss}}$
		$\displaystyle={1\over\|\Omega\|}\sum_{\omega\in\Omega}\ell_{\omega}(\mathcal{Z},\bar{\mathcal{Y}}),$		(36)

where we have denoted the function $\ell_{\omega}(\mathcal{Z},\bar{\mathcal{Y}})\stackrel{{\scriptstyle\text{def}}}{{=}}|\bar{\mathcal{Y}}(\omega)||\textup{sgn}\mathcal{Z}(\omega)-\textup{sgn}\bar{\mathcal{Y}}(\omega)|$ . Under Assumption 1 of the $(\alpha,\pi)$ -smoothness of $\Theta$ , we have

\textup{Var}[\ell_{\omega}(\mathcal{Z},\bar{\mathcal{Y}})-\ell_{\omega}(\bar{\Theta},\bar{\mathcal{Y}}_{\Omega})]\lesssim[\textup{Risk}(\mathcal{Z})-\textup{Risk}(\bar{\Theta})]^{\alpha\over 1+\alpha}+{1\over\rho(\pi,\mathcal{N})}[\textup{Risk}(\mathcal{Z})-\textup{Risk}(\bar{\Theta})],

(37)

for all tensors $\mathcal{Z}\in\mathbb{R}^{d_{1}\times\cdots\times d_{K}}$ . Here the expectation and variance are taken with respect to both $\mathcal{Y}$ and $\omega\sim\Pi$ .

Proof of Lemma 1.

We expand the variance by

$\displaystyle\text{Var}[\ell_{\omega}(\mathcal{Z},\bar{\mathcal{Y}}_{\Omega})-\ell_{\omega}(\bar{\Theta},\bar{\mathcal{Y}}_{\Omega})]$	$\displaystyle\lesssim\mathbb{E}\|\ell_{\omega}(\mathcal{Z},\bar{\mathcal{Y}}_{\Omega})-\ell_{\omega}(\bar{\Theta},\bar{\mathcal{Y}}_{\Omega})\|^{2}$
	$\displaystyle\lesssim\mathbb{E}\|\ell_{\omega}(\mathcal{Z},\bar{\mathcal{Y}}_{\Omega})-\ell_{\omega}(\bar{\Theta},\bar{\mathcal{Y}}_{\Omega})\|$
	$\displaystyle\leq\mathbb{E}\|\textup{sgn}\mathcal{Z}-\textup{sgn}\bar{\Theta}\|=\textup{MAE}(\textup{sgn}\mathcal{Z},\textup{sgn}\bar{\Theta}),$	(38)

where the second line comes from the boundedness of classification loss $L(\cdot,\cdot)$ , and the third line comes from the inequality $||a-b|-|c-b||\leq|a-b|$ for $a,b,c\in\{-1,1\}$ , together with the boundedness of classification weight $|\bar{\mathcal{Y}}(\omega)|$ . Here we have absorbed the constant multipliers in $\lesssim$ . The conclusion (37) then directly follows by applying Remark 2 to (7.3.3). ∎

Proof of Theorem 2.

Fix $\pi\in[-1,1]$ . For notational simplicity, we suppress the subscript $\pi$ and write $\hat{\mathcal{Z}}$ in place of $\hat{\mathcal{Z}}_{\pi}$ . Denote $n=|\Omega|$ and $\rho=\rho(\pi,\mathcal{N})$ .

Because the classification loss $L(\cdot,\cdot)$ is scale-free, i.e., $L(\mathcal{Z},\cdot)=L(c\mathcal{Z},\cdot)$ for every $c>0$ , we consider the estimation subject to $\lVert\mathcal{Z}\rVert_{F}\leq 1$ without loss of generality. Specifically, let

\hat{\mathcal{Z}}=\operatorname*{arg\,min}_{\mathcal{Z}\colon\textup{rank}(\mathcal{Z})\leq r,\lVert\mathcal{Z}\rVert_{F}\leq 1}L(\mathcal{Z},\bar{\mathcal{Y}}_{\Omega}).

We next apply the empirical process theory to bound $\hat{\mathcal{Z}}$ . To facilitate the analysis, we view the data $\bar{\mathcal{Y}}_{\Omega}=\{\bar{\mathcal{Y}}(\omega)\colon\omega\in\Omega\}$ as a collection of $n$ independent random variables where the randomness is from both $\bar{\mathcal{Y}}$ and $\omega\sim\Pi$ . Write the index set $\Omega=\{1,\ldots,n\}$ , so the loss function (1) becomes

L(\mathcal{Z},\bar{\mathcal{Y}}_{\Omega})={1\over n}\sum_{i=1}^{n}\ell_{i}(\mathcal{Z},\bar{\mathcal{Y}}).

We use $f_{\mathcal{Z}}\colon[d_{1}]\times\cdots\times[d_{n}]\to\mathbb{R}$ to denote the function induced by tensor $\mathcal{Z}$ such that $f_{\mathcal{Z}}(\omega)=\mathcal{Z}(\omega)$ for $\omega\in[d_{1}]\times\cdots\times[d_{K}]$ . Under this set-up, the quantity of interest

\displaystyle L(\mathcal{Z},\bar{\mathcal{Y}}_{\Omega})-L(\bar{\Theta},\bar{\mathcal{Y}}_{\Omega})={1\over n}\sum_{i=1}^{n}\mathop{\mathchoice{\underbrace{\displaystyle\left[\ell_{i}(\mathcal{Z},\bar{\mathcal{Y}})-\ell_{i}(\bar{\Theta},\bar{\mathcal{Y}})\right]}}{\underbrace{\textstyle\left[\ell_{i}(\mathcal{Z},\bar{\mathcal{Y}})-\ell_{i}(\bar{\Theta},\bar{\mathcal{Y}})\right]}}{\underbrace{\scriptstyle\left[\ell_{i}(\mathcal{Z},\bar{\mathcal{Y}})-\ell_{i}(\bar{\Theta},\bar{\mathcal{Y}})\right]}}{\underbrace{\scriptscriptstyle\left[\ell_{i}(\mathcal{Z},\bar{\mathcal{Y}})-\ell_{i}(\bar{\Theta},\bar{\mathcal{Y}})\right]}}}\limits_{\stackrel{{\scriptstyle\text{def}}}{{=}}\Delta_{i}(f_{\mathcal{Z}},\bar{\mathcal{Y}})},

(39)

is an empirical process induced by function $f_{\mathcal{Z}}\in\mathcal{F}_{\mathcal{T}}$ where $\mathcal{T}=\{\mathcal{Z}\colon\textup{rank}(\mathcal{Z})\leq r,\ \lVert\mathcal{Z}\rVert_{F}\leq 1\}$ . Note that there is an one-to-one correspondence between sets $\mathcal{F}_{\mathcal{T}}$ and $\mathcal{T}$ .

Our remaining proof adopts the techniques of Wang et al., (2008, Theorem 3) to bound (39) over the function family $f_{\mathcal{Z}}\in\mathcal{F}_{\mathcal{T}}$ . We summarize only the key difference here but refer to (Wang et al.,, 2008) for complete proof. Based on Lemma 1, the $(\alpha,\pi)$ -smoothness of $\Theta$ implies

\textup{Var}\Delta_{i}(f_{\mathcal{Z}},\bar{\mathcal{Y}})\lesssim\left[\mathbb{E}\Delta_{i}(f_{\mathcal{Z}},\bar{\mathcal{Y}})\right]^{\alpha\over 1+\alpha}+{1\over\rho}\mathbb{E}\Delta_{i}(f_{\mathcal{Z}},\bar{\mathcal{Y}}),\quad\text{for all $f_{\mathcal{Z}}\in\mathcal{F}_{\mathcal{T}}$}.

(40)

Applying local iterative techniques in Wang et al., (2008, Theorem 3) to the empirical process (39) with the variance-to-mean relationship (40) gives that

\mathbb{P}\left(\textup{Risk}(\hat{\mathcal{Z}})-\textup{Risk}(\bar{\Theta})\geq L_{n}\right)\lesssim\exp(-nL_{n}),

(41)

where the convergence rate $L_{n}>0$ is determined by the solution to the following inequality,

{1\over L_{n}}\int_{L_{n}}^{\sqrt{L_{n}^{\alpha/(\alpha+1)}+{L_{n}\over\rho}}}\sqrt{\mathcal{H}_{[\ ]}(\varepsilon,\mathcal{F}_{\mathcal{T}},\lVert\cdot\rVert_{2})}d\varepsilon\leq C\sqrt{n},

(42)

for some constant $C>0$ . In particular, the smallest $L_{n}$ satisfying (42) yields the best upper bound of the error rate. Here $\mathcal{H}_{[\ ]}(\varepsilon,\mathcal{F}_{\mathcal{T}},\lVert\cdot\rVert_{2})$ denotes the $L_{2}$ -metric, $\varepsilon$ -bracketing number (c.f. Definition 3) of family $\mathcal{F}_{\mathcal{T}}$ .

It remains to solve for the smallest possible $L_{n}$ in (42). Based on Lemma 2, the inequality (42) is satisfied with

L_{n}\asymp t_{n}^{(\alpha+1)/(\alpha+2)}+{1\over\rho}t_{n},\quad\text{where }t_{n}={d_{\max}rK\log K\over n}.

(43)

Therefore, by (41), with very high probability.

\textup{Risk}(\hat{\mathcal{Z}})-\textup{Risk}(\bar{\Theta})\leq t_{n}^{(\alpha+1)/(\alpha+2)}+{1\over\rho}t_{n}.

Inserting the above bound into (35) gives

$\displaystyle\textup{MAE}(\textup{sgn}\hat{\mathcal{Z}},\textup{sgn}\bar{\Theta})$	$\displaystyle\lesssim[\textup{Risk}(\hat{\mathcal{Z}})-\textup{Risk}(\bar{\Theta})]^{\alpha/(\alpha+1)}+{1\over\rho}[\textup{Risk}(\hat{\mathcal{Z}})-\textup{Risk}(\bar{\Theta})]$
	$\displaystyle\lesssim t_{n}^{\alpha/(\alpha+2)}+{1\over\rho^{\alpha/\alpha+1}}t_{n}^{\alpha/(\alpha+1)}+{1\over\rho}t_{n}^{(\alpha+1)/(\alpha+2)}+{1\over\rho^{2}}t_{n}$
	$\displaystyle\leq 4t_{n}^{\alpha/(\alpha+2)}+{4\over\rho^{2}}t_{n},$	(44)

where the last line follows from the fact that $a(b^{2}+b^{(\alpha+2)/(\alpha+1)}+b+1)\leq 4a(b^{2}+1)$ with $a={t_{n}\over\rho^{2}}$ and $b=\rho t_{n}^{-1/(\alpha+2)}$ . We plug $t_{n}$ into (7.3.3) and absorb the term $K\log K$ into the constant. The conclusion is then proved. ∎

Definition 3 (Bracketing number).

Consider a family of functions $\mathcal{F}$ , and let $\varepsilon>0$ . Let $\mathcal{X}$ denote the domain space equipped with measure $\Pi$ . We call $\{(f^{l}_{m},f^{u}_{m})\}_{m=1}^{M}$ an $L_{2}$ -metric, $\varepsilon$ -bracketing function set of $\mathcal{F}$ , if for every $f\in\mathcal{F}$ , there exists an $m\in[M]$ such that

f^{l}_{m}(x)\leq f(x)\leq f^{u}_{m}(x),\quad\text{for all }x\in\mathcal{X},

and

\lVert f^{l}_{m}-f^{u}_{m}\rVert_{2}\stackrel{{\scriptstyle\text{def}}}{{=}}\sqrt{\mathbb{E}_{x\sim\Pi}|f^{l}_{m}(x)-f^{u}_{m}(x)|^{2}}\leq\varepsilon,\ \text{for all }m=1,\ldots,M.

The bracketing number with $L_{2}$ -metric, denoted $\mathcal{H}_{[\ ]}(\varepsilon,\mathcal{F},\lVert\cdot\rVert_{2})$ , is the logarithm of the smallest cardinality of the $\varepsilon$ -bracketing function set of $\mathcal{F}$ .

Lemma 2 (Bracketing complexity of low-rank tensors).

Define the family of rank- $r$ bounded tensors $\mathcal{T}=\{\mathcal{Z}\in\mathbb{R}^{d_{1}\times\cdots\times d_{K}}\colon\textup{rank}(\mathcal{Z})\leq r,\ \lVert\mathcal{Z}\rVert_{F}\leq 1\}$ and the induced function family $\mathcal{F}_{\mathcal{T}}=\{f_{\mathcal{Z}}\colon\mathcal{Z}\in\mathcal{T}\}$ . Set

L_{n}\asymp\left({d_{\max}rK\log K\over n}\right)^{(\alpha+1)/(\alpha+2)}+{1\over\rho(\pi,\mathcal{N})}\left({d_{\max}rK\log K\over n}\right).

(45)

Then, the following inequality is satisfied.

{1\over L_{n}}\int^{\sqrt{L_{n}^{\alpha/(\alpha+1)}+{L_{n}\over\rho(\pi,\mathcal{N})}}}_{L_{n}}\sqrt{\mathcal{H}_{[\ ]}(\varepsilon,\mathcal{F}_{\mathcal{T}},\lVert\cdot\rVert_{2})}d\varepsilon\leq Cn^{1/2},

(46)

where $C>0$ is a constant independent of $r,K$ and $d_{\text{max}}$ .

Proof of Lemma 2.

To simplify the notation, we denote $\rho=\rho(\pi,\mathcal{N})$ . Notice that

\displaystyle\lVert f_{\mathcal{Z}_{1}}-f_{\mathcal{Z}_{1}}\rVert_{2}\leq\|f_{\mathcal{Z}_{1}}-f_{\mathcal{Z}_{1}}\|_{\infty}\leq\lVert\mathcal{Z}_{1}-\mathcal{Z}_{1}\rVert_{F}\quad\text{ for all }\mathcal{Z}_{1},\mathcal{Z}_{2}\in\mathcal{T}.

(47)

It follows from Kosorok, (2007, Theorem 9.22) that the $L_{2}$ -metric, $(2\epsilon)$ -bracketing number of $\mathcal{F}_{\mathcal{T}}$ is bounded by

\mathcal{H}_{[\ ]}(2\varepsilon,\mathcal{F}_{\mathcal{T}},\lVert\cdot\rVert_{2})\leq\mathcal{H}(\varepsilon,\mathcal{T},\lVert\cdot\rVert_{F})\leq Cd_{\max}rK\log{K\over\varepsilon}.

The last inequality is from the covering number bounds for rank- $r$ bounded tensors; see Mu et al., (2014, Lemma 3).

Inserting the bracketing number into (46) gives

g(L)={1\over L}\int^{\sqrt{L^{\alpha/(\alpha+1)}+{\rho^{-1}L}}}_{L}\sqrt{d_{\max}rK\log\left({K\over\varepsilon}\right)}d\varepsilon.

(48)

By the monotonicity of the integrand in (48), we bound $g(L)$ by

$\displaystyle g(L)$	$\displaystyle\leq{\sqrt{d_{\max}rK}\over L}\int_{L}^{\sqrt{L^{\alpha/(\alpha+1)}+\rho^{-1}L}}\sqrt{\log\left(K\over L\right)}d\varepsilon$
	$\displaystyle\leq\sqrt{d_{\max}rK(\log K-\log L)}\left({L^{\alpha/(2\alpha+2)}+\sqrt{\rho^{-1}L}\over L}-1\right)$
	$\displaystyle\leq\sqrt{d_{\max}rK\log K}\left({1\over L^{(\alpha+2)/(2\alpha+2)}}+{1\over\sqrt{\rho L}}\right),$	(49)

where the second line follows from $\sqrt{a+b}\leq\sqrt{a}+\sqrt{b}$ for $a,b>0$ . It remains to verify that $g(L_{n})\leq Cn^{1/2}$ for $L_{n}$ specified in (46). Plugging $L_{n}$ into the last line of (7.3.3) gives

$\displaystyle g(L_{n})$	$\displaystyle\leq\sqrt{d_{\max}rK\log K}\left({1\over L_{n}^{(\alpha+2)/(2\alpha+2)}}+{1\over\sqrt{\rho L_{n}}}\right)$	(50)
	$\displaystyle\leq\sqrt{d_{\max}rK\log K}\left(\left[\left(d_{\max}rK\log K\over n\right)^{\alpha+1\over\alpha+2}\right]^{-{\alpha+2\over 2\alpha+2}}+\left[\rho\left(d_{\max}rK\log K\over\rho n\right)\right]^{-{1\over 2}}\right)$	(51)
	$\displaystyle\leq Cn^{1/2},$	(52)

where $C>0$ is a constant independent of $r,K$ and $d_{\text{max}}$ . The proof is therefore complete. ∎

7.3.4 Proof of Theorem 3

Proof of Theorem 3.

By definition of $\hat{\Theta}$ , we have

$\displaystyle\text{MAE}(\hat{\Theta},\Theta)$	$\displaystyle=\mathbb{E}\left\|\frac{1}{2H+1}\sum_{\pi\in\Pi}\textup{sgn}\hat{Z}_{\pi}-\Theta\right\|$
	$\displaystyle\leq\mathbb{E}\left\|\frac{1}{2H+1}\sum_{\pi\in\Pi}\left(\textup{sgn}\hat{Z}_{\pi}-\textup{sgn}(\Theta-\pi)\right)\right\|+\mathbb{E}\left\|\frac{1}{2H+1}\sum_{\pi\in\Pi}\textup{sgn}(\Theta-\pi)-\Theta\right\|$
	$\displaystyle\leq\frac{1}{2H+1}\sum_{\pi\in\Pi}\text{MAE}(\textup{sgn}\hat{Z}_{\pi},\textup{sgn}(\Theta-\pi))+\frac{1}{H},$	(53)

where the last line comes from the triangle inequality and the inequality

\left|\frac{1}{2H+1}\sum_{\pi\in\Pi}\textup{sgn}(\Theta(\omega)-\pi)-\Theta(\omega)\right|\leq\frac{1}{H},\quad\text{for all }\omega\in[d_{1}]\times\cdots\times[d_{K}].

(54)

Write $n=|\Omega|$ . Now it suffices to bound the first term in (7.3.4). We prove that

{1\over 2H+1}\sum_{\pi\in\Pi}\textup{MAE}(\textup{sgn}\hat{Z}_{\pi},\textup{sgn}(\Theta-\pi))\lesssim t_{n}^{\alpha/(\alpha+2)}+{1\over H}+Ht_{n},\quad\text{with }t_{n}={d_{\max}rK\log K\over n}.

(55)

Theorem 2 implies that the sign estimation accuracy depends on the closeness of $\pi\in\mathcal{H}$ to the mass points in $\mathcal{H}$ . Therefore, we partition the level set $\pi\in\mathcal{H}$ based on their closeness to $\mathcal{H}$ . Specifically, let $\mathcal{N}_{H}\stackrel{{\scriptstyle\text{def}}}{{=}}\bigcup_{\pi^{\prime}\in\mathcal{N}}\left(\pi^{\prime}-\frac{1}{H},\pi^{\prime}+\frac{1}{H}\right)$ denote the set of levels at least $1\over H$ -close to the mass points. We expand (55) by

		$\displaystyle{1\over 2H+1}\sum_{\pi\in\Pi}\textup{MAE}(\textup{sgn}\hat{Z}_{\pi},\textup{sgn}(\Theta-\pi))$
	$\displaystyle=$	$\displaystyle{1\over 2H+1}\sum_{\pi\in\Pi\cap\mathcal{N}_{H}}\textup{MAE}(\textup{sgn}\hat{Z}_{\pi},\textup{sgn}(\Theta-\pi))+{1\over 2H+1}\sum_{\pi\in\Pi\cap\mathcal{N}_{H}^{c}}\textup{MAE}(\textup{sgn}\hat{Z}_{\pi},\textup{sgn}(\Theta-\pi)).$		(56)

By assumption, the first term involves only finite number of summands and thus can be bounded by $4C/(2H+1)$ where $C>0$ is a constant such that $|\mathcal{N}|\leq C$ . We bound the second term using the explicit forms of $\rho(\pi,\mathcal{N})$ in the sequence $\pi\in\Pi\cap\mathcal{N}_{H}^{c}$ . Based on Theorem 2,

$\displaystyle{1\over 2H+1}\sum_{\pi\in\Pi\cap\mathcal{N}_{H}^{c}}\textup{MAE}(\textup{sgn}\hat{\mathcal{Z}}_{\pi},\textup{sgn}(\Theta-\pi))$	$\displaystyle\lesssim{1\over 2H+1}\sum_{\pi\in\Pi\cap\mathcal{N}_{H}^{c}}t_{n}^{\alpha/(\alpha+2)}+{t_{n}\over 2H+1}\sum_{\pi\in\Pi\cap\mathcal{N}_{H}^{c}}{1\over\rho^{2}(\pi,\mathcal{N})}$	(57)
	$\displaystyle\leq t_{n}^{\alpha/(\alpha+2)}+{t_{n}\over 2H+1}\sum_{\pi\in\Pi\cap\mathcal{N}_{H}^{c}}\sum_{\pi^{\prime}\in\mathcal{N}}{1\over\|\pi-\pi^{\prime}\|^{2}}$	(58)
	$\displaystyle\leq t_{n}^{\alpha/(\alpha+2)}+{t_{n}\over 2H+1}\sum_{\pi^{\prime}\in\mathcal{N}}\sum_{\pi\in\Pi\cap\mathcal{N}_{H}^{c}}{1\over\|\pi-\pi^{\prime}\|^{2}}$	(59)
	$\displaystyle\leq t_{n}^{\alpha/(\alpha+2)}+2CHt_{n},$	(60)

where the last inequality follows from the Lemma 3. Combining the bounds for the two terms in (7.3.4) completes the proof for conclusion (55). Finally, plugging (55) into (7.3.4) yields

\displaystyle\text{MAE}(\hat{\Theta},\Theta)\lesssim\left(d_{\max}rK\log K\over|\Omega|\right)^{\alpha/(\alpha+2)}+\frac{1}{H}+H{d_{\max}rK\log K\over|\Omega|}.

(61)

The conclusion follows by absorbing $K\log K$ into the constant term in the statement. ∎

Lemma 3.

Fix $\pi^{\prime}\in\mathcal{N}$ and a sequence $\Pi=\{-1,\ldots,-1/H,0,1/H,\ldots,1\}$ with $H\geq 2$ . Then,

\sum_{\pi\in\Pi\cap\mathcal{N}_{H}^{c}}{1\over|\pi-\pi^{\prime}|^{2}}\leq 4H^{2}.

Proof of Lemma 3.

Notice that all points $\pi\in\Pi\cap\mathcal{N}_{H}^{c}$ satisfy $|\pi-\pi^{\prime}|>{1\over H}$ for all $\pi^{\prime}\in\mathcal{N}$ . We use this fact to compute the sum

$\displaystyle\sum_{\pi\in\Pi\cap\mathcal{N}_{H}^{c}}{1\over\|\pi-\pi^{\prime}\|^{2}}$	$\displaystyle=\sum_{\frac{h}{H}\in\Pi\cap\mathcal{N}_{H}^{c}}{1\over\|\frac{h}{H}-\pi^{\prime}\|^{2}}$	(62)
	$\displaystyle\leq 2H^{2}\sum_{h=1}^{H}{1\over h^{2}}$	(63)
	$\displaystyle\leq 2H^{2}\left\{1+\int_{1}^{2}{1\over x^{2}}dx+\int_{2}^{3}{1\over x^{2}}dx+\cdots+\int_{H-1}^{H}{1\over x^{2}}dx\right\}$	(64)
	$\displaystyle=2H^{2}\left(1+\int^{H}_{1}{1\over x^{2}}dx\right)\leq 4H^{2},$	(65)

where the third line uses the monotonicity of ${1\over x^{2}}$ for $x\geq 1$ . ∎

8 Conclusion

We have developed a tensor completion method that addresses both low- and high-rankness based on sign series representation. Our work provide a nonparametric framework for tensor estimation, and we obtain results previously impossible. We hope the work opens up new inquiry that allows more researchers to contribute to this field.

Acknowledgements

This research is supported in part by NSF grant DMS-1915978 and Wisconsin Alumni Research Foundation.

References

Alon et al., (2016) Alon, N., Moran, S., and Yehudayoff, A. (2016). Sign rank versus VC dimension. In Conference on Learning Theory, pages 47–80.
Anandkumar et al., (2014) Anandkumar, A., Ge, R., Hsu, D., Kakade, S. M., and Telgarsky, M. (2014). Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15(1):2773–2832.
Anandkumar et al., (2017) Anandkumar, A., Ge, R., and Janzamin, M. (2017). Analyzing tensor power method dynamics in overcomplete regime. Journal of Machine Learning Research, 18(1):752–791.
Balabdaoui et al., (2019) Balabdaoui, F., Durot, C., and Jankowski, H. (2019). Least squares estimation in the monotone single index model. Bernoulli, 25(4B):3276–3310.
Bartlett et al., (2006) Bartlett, P. L., Jordan, M. I., and McAuliffe, J. D. (2006). Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156.
Cai et al., (2019) Cai, C., Li, G., Poor, H. V., and Chen, Y. (2019). Nonconvex low-rank tensor completion from noisy data. In Advances in Neural Information Processing Systems, pages 1863–1874.
Chan and Airoldi, (2014) Chan, S. and Airoldi, E. (2014). A consistent histogram estimator for exchangeable graph models. In International Conference on Machine Learning, pages 208–216.
Chi et al., (2020) Chi, E. C., Gaines, B. J., Sun, W. W., Zhou, H., and Yang, J. (2020). Provable convex co-clustering of tensors. Journal of Machine Learning Research, 21(214):1–58.
Cohn and Umans, (2013) Cohn, H. and Umans, C. (2013). Fast matrix multiplication using coherent configurations. In Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms, pages 1074–1087.
De Lathauwer et al., (2000) De Lathauwer, L., De Moor, B., and Vandewalle, J. (2000). A multilinear singular value decomposition. SIAM Journal on Matrix Analysis and Applications, 21(4):1253–1278.
De Wolf, (2003) De Wolf, R. (2003). Nondeterministic quantum query and communication complexities. SIAM Journal on Computing, 32(3):681–699.
Fan and Udell, (2019) Fan, J. and Udell, M. (2019). Online high rank matrix completion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8690–8698.
Ganti et al., (2017) Ganti, R., Rao, N., Balzano, L., Willett, R., and Nowak, R. (2017). On learning high dimensional structured single index models. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pages 1898–1904.
Ganti et al., (2015) Ganti, R. S., Balzano, L., and Willett, R. (2015). Matrix completion under monotonic single index models. In Advances in Neural Information Processing Systems, pages 1873–1881.
Ghadermarzy et al., (2018) Ghadermarzy, N., Plan, Y., and Yilmaz, O. (2018). Learning tensors from partial binary measurements. IEEE Transactions on Signal Processing, 67(1):29–40.
Ghadermarzy et al., (2019) Ghadermarzy, N., Plan, Y., and Yilmaz, Ö. (2019). Near-optimal sample complexity for convex tensor completion. Information and Inference: A Journal of the IMA, 8(3):577–619.
Globerson et al., (2007) Globerson, A., Chechik, G., Pereira, F., and Tishby, N. (2007). Euclidean embedding of co-occurrence data. Journal of Machine Learning Research, 8:2265–2295.
Han et al., (2020) Han, R., Willett, R., and Zhang, A. (2020). An optimal statistical and computational framework for generalized tensor estimation. arXiv preprint arXiv:2002.11255.
Hastie et al., (2009) Hastie, T., Tibshirani, R., and Friedman, J. (2009). The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media.
Hillar and Lim, (2013) Hillar, C. J. and Lim, L.-H. (2013). Most tensor problems are NP-hard. Journal of the ACM (JACM), 60(6):45.
Hitchcock, (1927) Hitchcock, F. L. (1927). The expression of a tensor or a polyadic as a sum of products. Journal of Mathematics and Physics, 6(1-4):164–189.
Hong et al., (2020) Hong, D., Kolda, T. G., and Duersch, J. A. (2020). Generalized canonical polyadic tensor decomposition. SIAM Review, 62(1):133–163.
Hore et al., (2016) Hore, V., Viñuela, A., Buil, A., Knight, J., McCarthy, M. I., Small, K., and Marchini, J. (2016). Tensor decomposition for multiple-tissue gene expression experiments. Nature genetics, 48(9):1094.
Jain and Oh, (2014) Jain, P. and Oh, S. (2014). Provable tensor factorization with missing data. In Advances in Neural Information Processing Systems, volume 27, pages 1431–1439.
Kolda and Bader, (2009) Kolda, T. G. and Bader, B. W. (2009). Tensor decompositions and applications. SIAM Review, 51(3):455–500.
Kosorok, (2007) Kosorok, M. R. (2007). Introduction to empirical processes and semiparametric inference. Springer Science & Business Media.
Lee and Wang, (2020) Lee, C. and Wang, M. (2020). Tensor denoising and completion based on ordinal observations. In International Conference on Machine Learning, pages 5778–5788.
Li et al., (2009) Li, Y., Liu, Y., Li, J., Qin, W., Li, K., Yu, C., and Jiang, T. (2009). Brain anatomical network and intelligence. PLoS Comput Biol, 5(5):e1000395.
Montanari and Sun, (2018) Montanari, A. and Sun, N. (2018). Spectral algorithms for tensor completion. Communications on Pure and Applied Mathematics, 71(11):2381–2425.
Mu et al., (2014) Mu, C., Huang, B., Wright, J., and Goldfarb, D. (2014). Square deal: Lower bounds and improved relaxations for tensor recovery. In International Conference on Machine Learning, pages 73–81.
Ongie et al., (2017) Ongie, G., Willett, R., Nowak, R. D., and Balzano, L. (2017). Algebraic variety models for high-rank matrix completion. In International Conference on Machine Learning, pages 2691–2700.
Robinson, (1988) Robinson, P. M. (1988). Root-n-consistent semiparametric regression. Econometrica: Journal of the Econometric Society, 56(4):931–954.
Shen and Wong, (1994) Shen, X. and Wong, W. H. (1994). Convergence rate of sieve estimates. The Annals of Statistics, 22:580–615.
Wang et al., (2008) Wang, J., Shen, X., and Liu, Y. (2008). Probability estimation for large-margin classifiers. Biometrika, 95(1):149–167.
Wang et al., (2017) Wang, L., Durante, D., Jung, R. E., and Dunson, D. B. (2017). Bayesian network–response regression. Bioinformatics, 33(12):1859–1866.
Wang and Li, (2020) Wang, M. and Li, L. (2020). Learning from binary multiway data: Probabilistic tensor decomposition and its statistical optimality. Journal of Machine Learning Research, 21(154):1–38.
Wang and Zeng, (2019) Wang, M. and Zeng, Y. (2019). Multiway clustering via tensor block models. In Advances in Neural Information Processing Systems, pages 713–723.
Xu, (2018) Xu, J. (2018). Rates of convergence of spectral methods for graphon estimation. In International Conference on Machine Learning, pages 5433–5442.
Yuan and Zhang, (2016) Yuan, M. and Zhang, C.-H. (2016). On tensor completion via nuclear norm minimization. Foundations of Computational Mathematics, 16(4):1031–1068.
Zhang and Xia, (2018) Zhang, A. and Xia, D. (2018). Tensor SVD: Statistical and computational limits. IEEE Transactions on Information Theory, 64(11):7311 – 7338.

$\displaystyle\text{Var}[\ell_{\omega}(\mathcal{Z},\bar{\mathcal{Y}}_{\Omega})-\ell_{\omega}(\bar{\Theta},\bar{\mathcal{Y}}_{\Omega})]$	$\displaystyle\lesssim\mathbb{E}\|\ell_{\omega}(\mathcal{Z},\bar{\mathcal{Y}}_{\Omega})-\ell_{\omega}(\bar{\Theta},\bar{\mathcal{Y}}_{\Omega})\|^{2}$
	$\displaystyle\lesssim\mathbb{E}\|\ell_{\omega}(\mathcal{Z},\bar{\mathcal{Y}}_{\Omega})-\ell_{\omega}(\bar{\Theta},\bar{\mathcal{Y}}_{\Omega})\|$
	$\displaystyle\leq\mathbb{E}\|\textup{sgn}\mathcal{Z}-\textup{sgn}\bar{\Theta}\|=\textup{MAE}(\textup{sgn}\mathcal{Z},\textup{sgn}\bar{\Theta}),$	(38)

Beyond the Signs: Nonparametric Tensor Completion via Sign Series

Abstract

1 Introduction

1.1 Inadequacies of low-rank models

1.2 Our contributions

1.3 Notation

2 Model and proposal overview

3 Statistical properties of sign representable tensors

3.1 Sign-rank and sign tensor series

Definition 1 (Sign-rank).

Proposition 1 (Upper bounds of the sign-rank).

Proposition 2 (Broadness).

Definition 2 (Sign representable tensors).

Example 1 (CP/Tucker low-rank models).

Example 2 (Tensor block models (TBMs)).

Example 3 (Generalized linear models (GLMs)).

Example 4 (Single index models (SIMs)).

Example 5 (Min/Max hypergraphon).

3.2 Statistical characterization of sign tensors via weighted classification

Proposition 3 (Global optimum of weighted risk).

Assumption 1 (α\alpha-smoothness).

Theorem 1 (Identifiability).

4 Nonparametric tensor completion via sign series

4.1 Estimation error and sample complexity

Theorem 2 (Sign tensor estimation).

Theorem 3 (Tensor estimation error).

Example 2 (TBM).

Example 3 (GLM).

Corollary 1 (Sample complexity for nonparametric completion).

4.2 Numerical implementation

5 Simulations

6 Data applications

6.1 Brain connectivity analysis

6.2 NIPS data analysis

7 Additional results and proofs

7.1 Sensitivity of tensor rank to monotonic transformations

7.2 Tensor rank and sign-rank

Example 6 (Max hypergraphon).

Proof.

Proposition 4 (Min/Max hypergraphon).

Proof.

Proposition 5.

Proof.

Example 7 (Stacked banded matrices).

Proof.

Remark 1.

Example 8 (Stacked identity matrices).

Proof.

7.3 Proofs

7.3.1 Proofs of Propositions 1-3

Proof of Proposition 1.

Proof of Proposition 2.

Proof of Proposition 3.

7.3.2 Proof of Theorem 1

Proof of Theorem 1.

Remark 2.

7.3.3 Proof of Theorem 2

Lemma 1 (Variance-to-mean relationship).

Proof of Lemma 1.

Proof of Theorem 2.

Definition 3 (Bracketing number).

Lemma 2 (Bracketing complexity of low-rank tensors).

Proof of Lemma 2.

7.3.4 Proof of Theorem 3

Proof of Theorem 3.

Lemma 3.

Proof of Lemma 3.

8 Conclusion

Acknowledgements

References

Beyond the Signs: Nonparametric Tensor Completion
via Sign Series

Assumption 1 ( $\alpha$ -smoothness).