Beyond the Signs: Nonparametric Tensor Completion
via Sign Series
Abstract
We consider the problem of tensor estimation from noisy observations with possibly missing entries. A nonparametric approach to tensor completion is developed based on a new model which we coin as sign representable tensors. The model represents the signal tensor of interest using a series of structured sign tensors. Unlike earlier methods, the sign series representation effectively addresses both low- and high-rank signals, while encompassing many existing tensor models—including CP models, Tucker models, single index models, several hypergraphon models—as special cases. We show that the sign tensor series is theoretically characterized, and computationally estimable, via classification tasks with carefully-specified weights. Excess risk bounds, estimation error rates, and sample complexities are established. We demonstrate the outperformance of our approach over previous methods on two datasets, one on human brain connectivity networks and the other on topic data mining.
1 Introduction
Higher-order tensors have recently received much attention in enormous fields including social networks (Anandkumar et al.,, 2014), neuroscience (Wang et al.,, 2017), and genomics (Hore et al.,, 2016). Tensor methods provide effective representation of the hidden structure in multiway data. In this paper we consider the signal plus noise model,
(1) |
where is an order- data tensor, is an unknown signal tensor of interest, and is a noise tensor. Our goal is to accurately estimate from the incomplete, noisy observation of . In particular, we focus on the following two problems:
-
•
Q1 [Nonparametric tensor estimation]. How to flexibly estimate under a wide range of structures, including both low-rankness and high-rankness?
-
•
Q2 [Complexity of tensor completion]. How many observed tensor entries do we need to consistently estimate the signal ?
1.1 Inadequacies of low-rank models
The signal plus noise model (3) is popular in tensor literature. Existing methods estimate the signal tensor based on low-rankness of (Jain and Oh,, 2014; Montanari and Sun,, 2018). Common low-rank models include Canonical Polyadic (CP) tensors (Hitchcock,, 1927), Tucker tensors (De Lathauwer et al.,, 2000), and block tensors (Wang and Zeng,, 2019). While these methods have shown great success in signal recovery, tensors in applications often violate the low-rankness. Here we provide two examples to illustrate the limitation of classical models.
The first example reveals the sensitivity of tensor rank to order-preserving transformations. Let be an order-3 tensor with CP (formal definition is deferred to end of this section). Suppose a monotonic transformation is applied to entrywise, and we let the signal in model (1) be the tensor after transformation. Figure 1a plots the numerical rank (see Section 7.1) of versus . As we see, the rank increases rapidly with , rending traditional low-rank tensor methods ineffective in the presence of mild order-preserving nonlinearities. In digital processing (Ghadermarzy et al.,, 2018) and genomics analysis (Hore et al.,, 2016), the tensor of interest often undergoes unknown transformation prior to measurements. The sensitivity to transformation makes the low-rank model less desirable in practice.

The second example demonstrates the inadequacy of classical low-rankness in representing special structures. Here we consider the signal tensor of the form , where is an order-3 tensor with entries for . The matrix analogy of was studied by Chan and Airoldi, (2014) in graphon analysis. In this case neither nor is low-rank; in fact, the rank is no smaller than the dimension as illustrated in Figure 1b. Again, classical low-rank models fail to address this type of tensor structure.
In the above and many other examples, the signal tensors of interest have high rank. Classical low-rank models will miss these important structures. New methods that allow flexible tensor modeling have yet to be developed.
1.2 Our contributions
We develop a new model called sign representable tensors to address the aforementioned challenges. Figure 2 illustrates our main idea. Our approach is built on the sign series representation of the signal tensor, and we propose to estimate the sign tensors through a series of weighted classifications. In contrast to existing methods, our method is guaranteed to recover a wide range of low- and high-rank signals. We highlight two main contributions that set our work apart from earlier literature.
Statistically, the problem of high-rank tensor estimation is challenging. Existing estimation theory (Anandkumar et al.,, 2014; Montanari and Sun,, 2018; Cai et al.,, 2019) exclusively focuses on the regime of fixed growing . However, such premise fails in high-rank tensors, where the rank may grow with, or even exceed, the dimension. A proper notion of nonparametric complexity is crucial. We show that, somewhat surprisingly, the sign tensor series not only preserves all information in the original signals, but also brings the benefits of flexibility and accuracy over classical low-rank models. The results fill the gap between parametric (low-rank) and nonparametric (high-rank) tensors, thereby greatly enriching the tensor model literature.
From computational perspective, optimizations regarding tensors are in general NP-hard. Fortunately, tensors sought in applications are specially-structured, for which a number of efficient algorithms are available (Ghadermarzy et al.,, 2018; Wang and Li,, 2020; Han et al.,, 2020). Our high-rank tensor estimate is provably reductable to a series of classifications, and its divide-and-conquer nature facilitates efficient computation. The ability to import and adapt existing tensor algorithms is one advantage of our method.
We also highlight the challenges associated with tensors compared to matrices. High-rank matrix estimation is recently studied under nonlinear models (Ganti et al.,, 2015) and subspace clustering (Ongie et al.,, 2017; Fan and Udell,, 2019). However, the problem for high-rank tensors is more challenging, because the tensor rank often exceeds the dimension when order (Anandkumar et al.,, 2017). This is in sharp contrast to matrices. We show that, applying matrix methods to higher-order tensors results in suboptimal estimates. A full exploitation of the higher-order structure is needed; this is another challenge we address in this paper.
1.3 Notation
We use to denote the sign function, where if and otherwise. We allow univariate functions, such as and general , to be applied to tensors in an element-wise manner. We denote if for some constant . We use the shorthand to denote the -set for . Let denote an order- -dimensional tensor, and denote the tensor entry indexed by . An event is said to occur “with very high probability” if tends to 1 faster than any polynomial of tensor dimension . The CP decomposition (Hitchcock,, 1927) is defined by
(2) |
where are tensor singular values, are norm-1 tensor singular vectors, and denotes the outer product of vectors. The minimal for which (2) holds is called the tensor rank, denoted .
2 Model and proposal overview
Let be an order- -dimensional data tensor generated from the following model
(3) |
where is an unknown signal tensor of interest, and is a noise tensor consisting of mean-zero, independent but not necessarily identically distributed entries. We allow heterogenous noise, in that the marginal distribution of noise entry may depend on . Assume that takes value in a bounded interval ; without loss of generality, we set throughout the paper.
Our observation is an incomplete data tensor from (3), denoted , where is the index set of observed entries. We consider a general model on that allows both uniform and non-uniform samplings. Specifically, let be an arbitrarily predefined probability distribution over the full index set with . Assume that the entries in are i.i.d. draws with replacement from the full index set using distribution . The sampling rule is denoted as .
Before describing our main results, we provide the intuition behind our method. In the two examples in Section 1, the high-rankness in the signal makes the estimation challenging. Now let us examine the sign of the -shifted signal for any given . It turns out that, these sign tensors share the same sign patterns as low-rank tensors. Indeed, the signal tensor in the first example has the same sign pattern as a rank- tensor, since . The signal tensor in the second example has the same sign pattern as a rank-2 tensor, since (see Example 5 in Section 3).
The above observation suggests a general framework to estimate both low- and high-rank signal tensors. Figure 2 illustrates the main crux of our method. We dichotomize the data tensor into a series of sign tensors for . Then, we estimate the sign signals by performing classification
where Weighted-Loss denotes a carefully-designed classification objective function which will be described in later sections. Our final proposed tensor estimate takes the form
Our approach is built on the nonparametric sign representation of signal tensors. The estimate is essentially learned from dichotomized tensor series with proper weights. We show that a careful aggregation of dichotomized data not only preserves all information in the original signals, but also brings benefits of accuracy and flexibility over classical low-rank models. Unlike traditional methods, the sign representation is guaranteed to recover both low- and high-rank signals that were previously impossible. The method enjoys statistical effectiveness and computational efficiency.
3 Statistical properties of sign representable tensors
This section develops sign representable tensor models for in (3). We characterize the algebraic and statistical properties of sign tensor series, which serves the theoretical foundation for our method.
3.1 Sign-rank and sign tensor series
Let be the tensor of interest, and the corresponding sign pattern. The sign patterns induce an equivalence relationship between tensors. Two tensors are called sign equivalent, denoted , if they have the same sign pattern.
Definition 1 (Sign-rank).
The sign-rank of a tensor is defined by the minimal rank among all tensors that share the same sign pattern as ; i.e.,
The sign-rank is also called support rank (Cohn and Umans,, 2013), minimal rank (Alon et al.,, 2016), and nondeterministic rank (De Wolf,, 2003). Earlier work defines sign-rank for binary-valued tensors; we extend the notion to continuous-valued tensors. Note that the sign-rank concerns only the sign pattern but discards the magnitude information of . In particular, .
Like most tensor problems (Hillar and Lim,, 2013), determining the sign-rank for a general tensor is NP hard (Alon et al.,, 2016). Fortunately, tensors arisen in applications often possess special structures that facilitate analysis. By definition, the sign-rank is upper bounded by the tensor rank. More generally, we have the following upper bounds.
Proposition 1 (Upper bounds of the sign-rank).
For any strictly monotonic function with ,
Conversely, the sign-rank can be much smaller than the tensor rank, as we have shown in the examples of Section 1.
Proposition 2 (Broadness).
For every order and dimension , there exist tensors such that but for all .
We provide several examples in Section 7.2, in which the tensor rank grows with dimension but the sign-rank remains a constant. The results highlight the advantages of using sign-rank in the high-dimensional tensor analysis. Propositions 1 and 2 together demonstrate the strict broadness of low sign-rank family over the usual low-rank family.
We now introduce a tensor family, which we coin as “sign representable tensors”, for the signal model in (3).
Definition 2 (Sign representable tensors).
Fix a level . A tensor is called -sign representable, if the tensor has sign-rank bounded by . A tensor is called -sign (globally) representable, if is -sign representable for all . The collection is called the sign tensor series. We use to denote the -sign representable tensor family.
We show that the -sign representable tensor family is a general model that incorporates most existing tensor models, including low-rank tensors, single index models, GLM models, and several hypergraphon models.
Example 1 (CP/Tucker low-rank models).
The CP and Tucker low-rank tensors are the two most popular tensor models (Kolda and Bader,, 2009). Let be a low-rank tensor with CP rank . We see that belongs to the sign representable family; i.e., (the constant is due to ). Similar results hold for Tucker low-rank tensors , where with being the -th mode Tucker rank of .
Example 2 (Tensor block models (TBMs)).
Example 3 (Generalized linear models (GLMs)).
Let be a binary tensor from a logistic model (Wang and Li,, 2020) with mean , where is a latent low-rank tensor. Notice that itself may be high-rank (see Section 1). By definition, is a low-rank sign representable tensor. Same conclusion holds for general exponential-family models with a (known) link function (Hong et al.,, 2020).
Example 4 (Single index models (SIMs)).
Single index model is a flexible semiparametric model proposed in economics (Robinson,, 1988) and high-dimensional statistics (Balabdaoui et al.,, 2019; Ganti et al.,, 2017). We here extend the model to higher-order tensors . The SIM assumes the existence of a (unknown) monotonic function such that has rank . We see that belongs to the sign representable family; i.e., .
Example 5 (Min/Max hypergraphon).
Graphon is a popular nonparametric model for networks (Chan and Airoldi,, 2014; Xu,, 2018). Here we revisit the model introduced in Section 1 for generality. Let be an order- tensor generated from the hypergraphon , where are given number in for all . We conclude that , because the sign tensor with an arbitrary is a block tensor with at most two blocks (see Figure 2c).
The results extend to general min/max hypergraphons. Let be a continuous univariate function with at most distinct real roots in the equation ; this property holds, e.g., when is a polynomial of degree . Then, the tensor generated from belongs to (see Section 7.2). Same conclusion holds if the maximum in is replaced by the minimum.
3.2 Statistical characterization of sign tensors via weighted classification
Accurate estimation of a sign representable tensor depends on the behavior of sign tensor series, . In this section, we show that sign tensors are completely characterized by weighted classification. The results bridge the algebraic and statistical properties of sign representable tensors.
For a given , define a -shifted data tensor with entries for . We propose a weighted classification objective function
(4) |
where is the decision variable to be optimized, is the entry-specific weight equal to the distance from the tensor entry to the target level . The entry-specific weights incorporate the magnitude information into classification, where entries far away from the target level are penalized more heavily in the objective. In the special case of binary tensor and target level , the loss (4) reduces to usual classification loss.
Our proposed weighted classification function (4) is important for characterizing . Define the weighted classification risk
(5) |
where the expectation is taken with respect to under model (3) and the sampling distribution . Note that the form of implicitly depends on ; we suppress when no confusion arises.
Proposition 3 (Global optimum of weighted risk).
Suppose the data is generated from model (3) with . Then, for all that are sign equivalent to ,
(6) |
The results show that the sign tensor optimizes the weighted classification risk. This fact suggests a practical procedure to estimate via empirical risk optimization of . In order to establish the recovery guarantee, we shall address the uniqueness (up to sign equivalence) for the optimizer of . The local behavior of around turns out to play a key role in the accuracy.
Some additional notation is needed. We use to denote the set of mass points of under . Assume there exists a constant , independent of tensor dimension, such that . Note that both and implicitly depend on the tensor dimension. Our assumptions are imposed to and in the high-dimensional regime uniformly where .
Assumption 1 (-smoothness).
Fix . Assume there exist constants , independent of tensor dimension, such that,
(7) |
where denotes the distance from to the nearest point in . The largest possible in (7) is called the smoothness index at level . We make the convention that if the set has zero measure, implying almost no entries of which is around the level . We call a tensor is -globally smooth, if (7) holds with a global constant for all except for a finite number of levels.
The smoothness index quantifies the intrinsic hardness of recovering from . The value of depends on both the sampling distribution and the behavior of . The recovery is easier at levels where points are less concentrated around with a large value of , or equivalently, when the cumulative distribution function (CDF) remains flat around . A small value of indicates the nonexistent (infinite) density at level , or equivalently, when the jumps at . A typical case is when the has finite non-zero derivative in the vicinity of . Table 1 illustrates the for various models of (see Section 5 Simulation for details).
We now reach the main theorem in this section. For two tensors , define the mean absolute error (MAE)
Theorem 1 (Identifiability).
The result establishes the recovery stability of sign tensors using optimization with population risk (5). The bound immediately shows the uniqueness of the optimizer for up to a zero-measure set under . We find that a higher value of implies more stable recovery, as intuition would suggest. Similar results hold for optimization with sample risk (4) (see Section 4).
We conclude this section by applying Assumption 1 to the examples described in Section 3.1. For simplicity, suppose is the uniform sampling for now. The tensor block model is -globally smooth. This is because the set , which consists of distinct block means in , has finitely many elements. Furthermore, we have for all , since the numerator in (7) is zero for all such . The min/max hypergaphon model with a -degree polynomial function is -globally smooth because for all in the function range except at most many stationary points.
4 Nonparametric tensor completion via sign series
In previous sections we have established the sign series representation and its relationship to classification. In this section, we present our algorithm proposed in Section 2 (Figure 2) in details. We provide the estimation error bound and address the empirical implementation of the algorithm.
4.1 Estimation error and sample complexity
Given a noisy incomplete tensor observation from model (3), we cast the problem of estimating into a series of weighted classifications. Specifically we propose the tensor estimate using the sign representation,
(8) |
where is the -weighted classifier estimated at levels ,
(9) |
Here denotes the weighted classification objective defined in (4), where we have plugged in the expression, and the rank constraint follows from Proposition 3. For the theory, we assume the true is known; in practice, could be chosen in a data adaptive fashion via cross-validation or elbow method (Hastie et al.,, 2009). Step (9) corresponds Figure 2c while (8) to Figure 2d .
The next theorem establishes the statistical convergence for the sign tensor estimate (9), which is an important ingredient for the final signal tensor estimate in (8).
Theorem 2 (Sign tensor estimation).
Suppose and is -globally smooth under . Let be the estimate in (9), , and . Then, for all except for a finite number of levels, with very high probability over ,
(10) |
Theorem 2 provides the error bound for the sign tensor estimation. Compared to the population results in Theorem 1, we here explicitly reveal the dependence of accuracy on the sample complexity and on the level . The result demonstrates the polynomial decay of sign errors with . In particular, our sign estimate achieves consistent recovery using as few as noisy entries.
Combining the sign representability of the signal tensor and the sign estimation accuracy, we obtain the main results on our nonparametric tensor estimation method.
Theorem 3 (Tensor estimation error).
Theorem 3 demonstrates the convergence rate of our tensor estimation. The bound (11) reveals three sources of errors: the estimation error for sign tensors, the bias from sign series representations, and the variance thereof. The resolution parameter controls the bias-variance tradeoff. We remark that the signal estimation error (12) is generally no better than the corresponding sign error (10). This is to be expected, since magnitude estimation is a harder problem than sign estimation.
In the special case of full observation with equal dimension , our signal estimate achieves convergence
(13) |
Compared to earlier methods, our estimation accuracy applies to both low- and high-rank signal tensors. The rate depends on the sign complexity , and this is often much smaller than the usual tensor rank (see Section 3.1). Our result also reveals that the convergence becomes favorable as the order of data tensor increases.
We apply our method to the main examples in Section 3.1, and compare the results with existing literature. The numerical comparison is provided in Section 5.
Example 2 (TBM).
Consider a tensor block model with in total multiway blocks. Our result implies a rate by taking . This rate agrees with the previous root-mean-square error (RMSE) for block tensor estimation (Wang and Zeng,, 2019).
Example 3 (GLM).
Consider a GLM tensor , where is a known link function and is a latent low-rank tensor. Suppose the marginal density of is bounded as . Applying our results with yields . This rate is slightly slower than the parametric RMSE rate (Zhang and Xia,, 2018; Wang and Li,, 2020). One possible reason is that our estimate remains valid for unknown and general high-rank tensors with . The nonparametric rate is the price one has to pay for not knowing the form as a priori.
The following sample complexity for nonparamtric tensor completion is a direct consequence of Theorem 3.
Corollary 1 (Sample complexity for nonparametric completion).
Under the same conditions of Theorem 3 with , with high probability over ,
Our result improves earlier work (Yuan and Zhang,, 2016; Ghadermarzy et al.,, 2019; Lee and Wang,, 2020) by allowing both low- and high-rank signals. Interestingly, the sample requirements depend only on the sign complexity but not the nonparametric complexity . Note that roughly matches the degree of freedom of sign tensors, suggesting the optimality of our sample requirements.
4.2 Numerical implementation
This section addresses the practical implementation of our estimation (8) illustrated in Figure 2. Our sign representation of the signal estimate is an average of sign tensors, which can be solved in a divide-and-conquer fashion. Briefly, we estimate the sign tensors (detailed in the next paragraph) for the series through parallel implementation, and then we aggregate the results to yield the output. The estimate enjoys low computational cost similar to a single sign tensor estimation.
For the sign tensor estimation (9), the problem reduces to binary tensor decomposition with a weighted classification loss. A number of algorithms have been developed for this problem (Ghadermarzy et al.,, 2018; Wang and Li,, 2020; Hong et al.,, 2020). We adopt similar ideas by tailoring the algorithms to our contexts. Following the common practice in classification, we replace the binary loss with a surrogate loss using a continuous function of margin . Examples of large-margin loss are hinge loss , logistic loss , and nonconvex -loss with . We implement the hinge loss and logistic loss in our algorithm, although our framework is applicable to general large-margin losses (Bartlett et al.,, 2006).
The rank constraints in the optimization (9) have been extensively studied in literature. Recent developments involve convex norm relaxation (Ghadermarzy et al.,, 2018) and nonconvex optimization (Wang and Li,, 2020; Han et al.,, 2020). Unlike matrices, computing the tensor convex norm is NP hard, so we choose (non-convex) alternating optimization due to its numerical efficiency. Briefly, we use the rank decomposition (2) of to optimize the unknown factor matrices , where we choose to collect tensor singular values into . We numerically solve (8) by optimizing one factor at a time while holding others fixed. Each suboptimization reduces to a convex optimization with a low-dimensional decision variable. Following common practice in tensor optimization (Anandkumar et al.,, 2014; Hong et al.,, 2020), we run the optimization from multiple initializations to locate a final estimate with the lowest objective value. The full procedure is described in Algorithm 1.
5 Simulations
In this section, we compare our nonparametric tensor method (NonParaT) with two alternative approaches: low-rank tensor CP decomposition (CPT), and the matrix version of our method applied to tensor unfolding (NonParaM). We assess the performance under both complete and incomplete observations. The signal tensors are generated based on four models listed in Table 1. The simulation covers a wide range of complexity, including block tensors, transformed low rank tensors, min/max hypergraphon with logarithm and exponential functions. We consider order-3 tensors of equal dimension , and set , , in Algorithm 1. For NonParaM, we apply Algorithm 1 to each of the three unfolded matrices and report the average error. All summary statistics are averaged across replicates.
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/fc380ceb-83e0-4944-8f1a-8720cf611276/x3.png)
Figure 3 compares the estimation error under full observation. The MAE decreases with tensor dimension for all three methods. We find that our method NonParaT achieves the best performance in all scenarios, whereas the second best method is CPT for models 1-2, and NonParaM for models 3-4. One possible reason is that models 1-2 have controlled multilinear tensor rank, which makes tensor methods NonParaT and CPT more accurate than matrix methods. For models 3-4, the rank exceeds the tensor dimension, and therefore, the two nonparametric methods NonParaT and NonparaM exhibit the greater advantage for signal recovery.
Figure 4 shows the completion error against observation fraction. We fix and gradually increase the observation fraction from 0.3 to 1. We find that NonParaT achieves the lowest error among all methods. Our simulation covers a reasonable range of complexities; for example, model 1 has jumps in the CDF of signal , and models 2 and 4 have unbounded noise. Nevertheless, our method shows good performance in spite of model misspecification. This robustness is appealing in practice because the structure of underlying signal tensor is often unknown.


6 Data applications
We apply our method to two tensor datasets, the MRN-114 human brain connectivity data (Wang et al.,, 2017), and NIPS word occurrence data (Globerson et al.,, 2007).
6.1 Brain connectivity analysis
The brain dataset records the structural connectivity among 68 brain regions for 114 individuals along with their Intelligence Quotient (IQ) scores. We organize the connectivity data into an order-3 tensor, where entries encode the presence or absence of fiber connections between brain regions across individuals.

Figure 5 shows the MAE based on 5-fold cross-validations with and . We find that our method outperforms CPT in all combinations of ranks and missing rates. The achieved error reduction appears to be more profound as the missing rate increases. This trend highlights the applicability of our method in tensor completion tasks. In addition, our method exhibits a smaller standard error in cross-validation experiments as shown in Figure 5 and Table 2, demonstrating the stability over CPT. One possible reason is that that our estimate is guaranteed to be in (for binary tensor problem where ) whereas CPT estimation may fall outside the valid range .
MRN-114 brain connectivity dataset | |||||
Method | |||||
NonparaT (Ours) | |||||
Low-rank CPT | ) | ||||
NIPS word occurrence dataset | |||||
Method | |||||
NonparaT (Ours) | |||||
Low-rank CPT | |||||
Naive imputation (Baseline) |
We next investigate the pattern in the estimated signal tensor. Figure 6a shows the identified top edges associated with IQ scores. Specifically, we first obtain a denoised tensor using our method with and . Then, we perform a regression analysis of against the normalized IQ score across the 144 individuals. The regression model is repeated for each edge . We find that top edges represent the interhemispheric connections in the frontal lobes. The result is consistent with recent research on brain connectivity with intelligence (Li et al.,, 2009; Wang et al.,, 2017).
6.2 NIPS data analysis
The NIPS dataset consists of word occurrence counts in papers published from 1987 to 2003. We focus on the top 100 authors, 200 most frequent words, and normalize each word count by log transformation with pseudo-count 1. The resulting dataset is an order-3 tensor with entry representing the log counts of words by authors across years.
Table 2 compares the prediction accuracy of different methods. We find that our method substantially outperforms the low-rank CP method for every configuration under consideration. Further increment of rank appears to have little effect on the performance. The comparison highlights the advantage of our method in achieving accuracy while maintaining low complexity. In addition, we also perform naive imputation where the missing values are predicted using the sample average. Both our method and CPT outperform the naive imputation, implying the necessity of incorporating tensor structure in the analysis.


We next examine the estimated signal tensor from our method. Figure 6b illustrates the results from NIPS data, where we plot the entries in corresponding to top authors and most-frequent words (after excluding generic words such as figure, results, etc). The identified pattern is consistent with the active topics in the NIPS publication. Among the top words are neural (marginal mean = 1.95), learning (1.48), and network (1.21), whereas top authors are T. Sejnowski (1.18), B. Scholkopf (1.17), M. Jordan (1.11), and G. Hinton (1.06). We also find strong heterogeneity among word occurrences across authors and years. For example, training and algorithm are popular words for B. Scholkopf and A. Smola in 1998-1999, whereas model occurs more often in M. Jordan and in 1996. The detected pattern and achieved accuracy demonstrate the applicability of our method.
7 Additional results and proofs
In this section, we provides additional results not covered in previous sections. Section 7.1 gives detailed explanation to the examples mentioned in Section 1. Section 7.2 supplements Section 3.1 by providing more theoretical results on sign rank and its relationship to tensor rank. Section 7.3 collects the proofs for theorems in the main texts.
7.1 Sensitivity of tensor rank to monotonic transformations
In Section 1, we have provided a motivating example to show the sensitivity of tensor rank to monotonic transformations. Here, we describe the details of the example set-up.
The step 1 is to generate a rank-3 tensor based on the CP representation
where are vectors consisting of entries, and the shorthand denotes the Kronecker power. We then apply to entrywise, and obtain a transformed tensor .
The step 2 is to determine the rank of . Unlike matrices, the exact rank determination for tensors is NP hard. Therefore, we choose to compute the numerical rank of as an approximation. The numerical rank is determined as the minimal rank for which the relative approximation error is below , i.e.,
(14) |
We compute by searching over , where for each , we (approximately) solve the least-square minimization using CP function in R package rTensor. We repeat steps 1-2 ten times, and plot the averaged numerical rank of versus transformation level in Figure 1a.
7.2 Tensor rank and sign-rank
In section 3.1, we have provided several tensor examples with high tensor rank but low sign-rank. This section provides more examples and their proofs.
Unless otherwise specified, let be an order- -dimensional tensor.
Example 6 (Max hypergraphon).
Suppose the tensor takes the form
Then
Proof.
We first prove the results for . The full-rankness of is verified from elementary row operations as follows
(15) |
where denotes the -th row of . Now it suffices to show for in the feasible range . In this case, there exists an index , such that . By definition, the sign matrix takes the form
(16) |
Therefore, the matrix is a rank-2 block matrix, which implies .
We now extend the results to . By definition of the tensor rank, the rank of a tensor is lower bounded by the rank of its matrix slice. So we have . For the sign rank with feasible , notice that the sign tensor takes the similar form as in (16),
(17) |
where denotes the index that satisfies . The equation (17) implies that , where takes 1 on the -th entry if and 0 otherwise. Henceforth . ∎
In fact, Example 6 is a special case of the following proposition.
Proposition 4 (Min/Max hypergraphon).
Let denote a tensor with entries
(18) |
where are given numbers for all . Let be a continuous function and be the transformed tensor. For a given , suppose the function has at most distinct real roots. Then, the sign rank of satisfies
The same conclusion holds if we use in place of in (18).
Proof.
We reorder the tensor indices along each mode such that for all . Based on the construction of , the reordering does not change the rank of or . Let be the distinct real roots for the equation . We separate the proof for two cases, and .
-
•
When . The continuity of implies that the function has at most one sign change point. Using similar proof as in Example 6, we have
(19) where are binary vectors defined by
Therefore, .
-
•
When . By continuity, the function is non-zero and remains an unchanged sign in each of the intervals for . Define the index set . We now prove that the sign tensor has rank bounded by . To see this, consider the tensor indices for which ,
(20) The equation (• ‣ 7.2) is equivalent to
(21) for all , where denotes the indicator function. The equation (21) implies the low-rank representation of ,
(22) where we have denoted the two binary vectors
Therefore, by (22) and the assumption , we conclude that
Combining two cases yields that for any . ∎
We next provide several additional examples such that whereas for a constant independent of . We state the examples in the matrix case, i.e, . Similar conclusion extends to , by the following proposition.
Proposition 5.
Let be a matrix. For any given , define an order- tensor by
where denotes an all-one vector, for . Then we have
Proof.
The conclusion directly follows from the definition of tensor rank. ∎
Example 7 (Stacked banded matrices).
Let be a -dimensional vector, and define a -by- banded matrix . Then
Proof.
Note that is a banded matrix with entries
Elementary row operation directly shows that is full rank as follows,
(23) |
We now show by construction. Define two vectors and . We construct the following matrix
(24) |
The matrix is banded with entries
Furthermore, the entry value decreases with respect to ; i.e.,
(25) |
Notice that for a given , there exists such that . This is because both and are banded matrices satisfying monotonicity (25). By definition (24), is a rank-2 matrix. Henceforce, ∎
Remark 1.
The tensor analogy of banded matrices is used as simulation model 3 in Table 1.
Example 8 (Stacked identity matrices).
Let be a -by- identity matrix. Then
Proof.
Depending on the value of , the sign matrix falls into one of the three cases: 1) is a matrix of all ; 2) is a matrix of all ; 3) . The former two cases are trivial, so it suffices to show in the third case.
7.3 Proofs
7.3.1 Proofs of Propositions 1-3
Proof of Proposition 1.
The strictly monotonicity of implies that the inverse function is well-defined. When is strictly increasing, the mapping is sign preserving. Specifically, if , then . Conversely, if , then applying to both sides gives . When is strictly decreasing, the mapping is sign reversing. Specifically, if , then . Conversely, if , then applying to both sides gives . Therefore, , or . Since constant multiplication does not change the tensor rank, we have . ∎
Proof of Proposition 3.
Fix . Based on the definition of classification loss , the function relies only on the sign pattern of the tensor. Therefore, without loss of generality, we assume both are binary tensors. We evaluate the excess risk
(26) |
Denote , , , and . The expression of is simplified as
(27) |
where the third line uses the fact and , and the last line uses the assumption . The equality (7.3.1) is attained when or . Combining (7.3.1) with (26), we conclude that, for all ,
(28) |
In particular, setting in (28) yields the minimum. Therefore,
Since by assumption, the last inequality becomes equality. The proof is complete. ∎
7.3.2 Proof of Theorem 1
Proof of Theorem 1.
Fix . Based on (28) in Proposition 3 we have
(29) |
The Assumption 1 states that
(30) |
Without future specification, all relevant probability statements, such as and , are with respect to .
We divide the proof into two cases: and .
-
•
Case 1: .
By (29), for all ,
(31) where the last line follows from the definition of MAE and (30). We maximize the lower bound (• ‣ 7.3.2) with respect to , and obtain the optimal ,
The corresponding lower bound of the inequality (• ‣ 7.3.2) becomes
where are two constants independent of . Combining both cases gives
(32) (33) where is a multiplicative factor independent of .
- •
∎
7.3.3 Proof of Theorem 2
The following lemma provides the variance-to-mean relationship implied by the -smoothness of . The relationship plays a key role in determining the convergence rate based on empirical process theory (Shen and Wong,, 1994).
Lemma 1 (Variance-to-mean relationship).
Proof of Lemma 1.
We expand the variance by
(38) |
where the second line comes from the boundedness of classification loss , and the third line comes from the inequality for , together with the boundedness of classification weight . Here we have absorbed the constant multipliers in . The conclusion (37) then directly follows by applying Remark 2 to (7.3.3). ∎
Proof of Theorem 2.
Fix . For notational simplicity, we suppress the subscript and write in place of . Denote and .
Because the classification loss is scale-free, i.e., for every , we consider the estimation subject to without loss of generality. Specifically, let
We next apply the empirical process theory to bound . To facilitate the analysis, we view the data as a collection of independent random variables where the randomness is from both and . Write the index set , so the loss function (1) becomes
We use to denote the function induced by tensor such that for . Under this set-up, the quantity of interest
(39) |
is an empirical process induced by function where . Note that there is an one-to-one correspondence between sets and .
Our remaining proof adopts the techniques of Wang et al., (2008, Theorem 3) to bound (39) over the function family . We summarize only the key difference here but refer to (Wang et al.,, 2008) for complete proof. Based on Lemma 1, the -smoothness of implies
(40) |
Applying local iterative techniques in Wang et al., (2008, Theorem 3) to the empirical process (39) with the variance-to-mean relationship (40) gives that
(41) |
where the convergence rate is determined by the solution to the following inequality,
(42) |
for some constant . In particular, the smallest satisfying (42) yields the best upper bound of the error rate. Here denotes the -metric, -bracketing number (c.f. Definition 3) of family .
It remains to solve for the smallest possible in (42). Based on Lemma 2, the inequality (42) is satisfied with
(43) |
Therefore, by (41), with very high probability.
Inserting the above bound into (35) gives
(44) |
where the last line follows from the fact that with and . We plug into (7.3.3) and absorb the term into the constant. The conclusion is then proved. ∎
Definition 3 (Bracketing number).
Consider a family of functions , and let . Let denote the domain space equipped with measure . We call an -metric, -bracketing function set of , if for every , there exists an such that
and
The bracketing number with -metric, denoted , is the logarithm of the smallest cardinality of the -bracketing function set of .
Lemma 2 (Bracketing complexity of low-rank tensors).
Define the family of rank- bounded tensors and the induced function family . Set
(45) |
Then, the following inequality is satisfied.
(46) |
where is a constant independent of and .
Proof of Lemma 2.
To simplify the notation, we denote . Notice that
(47) |
It follows from Kosorok, (2007, Theorem 9.22) that the -metric, -bracketing number of is bounded by
The last inequality is from the covering number bounds for rank- bounded tensors; see Mu et al., (2014, Lemma 3).
Inserting the bracketing number into (46) gives
(48) |
By the monotonicity of the integrand in (48), we bound by
(49) |
where the second line follows from for . It remains to verify that for specified in (46). Plugging into the last line of (7.3.3) gives
(50) | ||||
(51) | ||||
(52) |
where is a constant independent of and . The proof is therefore complete. ∎
7.3.4 Proof of Theorem 3
Proof of Theorem 3.
By definition of , we have
(53) |
where the last line comes from the triangle inequality and the inequality
(54) |
Write . Now it suffices to bound the first term in (7.3.4). We prove that
(55) |
Theorem 2 implies that the sign estimation accuracy depends on the closeness of to the mass points in . Therefore, we partition the level set based on their closeness to . Specifically, let denote the set of levels at least -close to the mass points. We expand (55) by
(56) |
By assumption, the first term involves only finite number of summands and thus can be bounded by where is a constant such that . We bound the second term using the explicit forms of in the sequence . Based on Theorem 2,
(57) | ||||
(58) | ||||
(59) | ||||
(60) |
where the last inequality follows from the Lemma 3. Combining the bounds for the two terms in (7.3.4) completes the proof for conclusion (55). Finally, plugging (55) into (7.3.4) yields
(61) |
The conclusion follows by absorbing into the constant term in the statement. ∎
Lemma 3.
Fix and a sequence with . Then,
Proof of Lemma 3.
Notice that all points satisfy for all . We use this fact to compute the sum
(62) | ||||
(63) | ||||
(64) | ||||
(65) |
where the third line uses the monotonicity of for . ∎
8 Conclusion
We have developed a tensor completion method that addresses both low- and high-rankness based on sign series representation. Our work provide a nonparametric framework for tensor estimation, and we obtain results previously impossible. We hope the work opens up new inquiry that allows more researchers to contribute to this field.
Acknowledgements
This research is supported in part by NSF grant DMS-1915978 and Wisconsin Alumni Research Foundation.
References
- Alon et al., (2016) Alon, N., Moran, S., and Yehudayoff, A. (2016). Sign rank versus VC dimension. In Conference on Learning Theory, pages 47–80.
- Anandkumar et al., (2014) Anandkumar, A., Ge, R., Hsu, D., Kakade, S. M., and Telgarsky, M. (2014). Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15(1):2773–2832.
- Anandkumar et al., (2017) Anandkumar, A., Ge, R., and Janzamin, M. (2017). Analyzing tensor power method dynamics in overcomplete regime. Journal of Machine Learning Research, 18(1):752–791.
- Balabdaoui et al., (2019) Balabdaoui, F., Durot, C., and Jankowski, H. (2019). Least squares estimation in the monotone single index model. Bernoulli, 25(4B):3276–3310.
- Bartlett et al., (2006) Bartlett, P. L., Jordan, M. I., and McAuliffe, J. D. (2006). Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156.
- Cai et al., (2019) Cai, C., Li, G., Poor, H. V., and Chen, Y. (2019). Nonconvex low-rank tensor completion from noisy data. In Advances in Neural Information Processing Systems, pages 1863–1874.
- Chan and Airoldi, (2014) Chan, S. and Airoldi, E. (2014). A consistent histogram estimator for exchangeable graph models. In International Conference on Machine Learning, pages 208–216.
- Chi et al., (2020) Chi, E. C., Gaines, B. J., Sun, W. W., Zhou, H., and Yang, J. (2020). Provable convex co-clustering of tensors. Journal of Machine Learning Research, 21(214):1–58.
- Cohn and Umans, (2013) Cohn, H. and Umans, C. (2013). Fast matrix multiplication using coherent configurations. In Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms, pages 1074–1087.
- De Lathauwer et al., (2000) De Lathauwer, L., De Moor, B., and Vandewalle, J. (2000). A multilinear singular value decomposition. SIAM Journal on Matrix Analysis and Applications, 21(4):1253–1278.
- De Wolf, (2003) De Wolf, R. (2003). Nondeterministic quantum query and communication complexities. SIAM Journal on Computing, 32(3):681–699.
- Fan and Udell, (2019) Fan, J. and Udell, M. (2019). Online high rank matrix completion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8690–8698.
- Ganti et al., (2017) Ganti, R., Rao, N., Balzano, L., Willett, R., and Nowak, R. (2017). On learning high dimensional structured single index models. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pages 1898–1904.
- Ganti et al., (2015) Ganti, R. S., Balzano, L., and Willett, R. (2015). Matrix completion under monotonic single index models. In Advances in Neural Information Processing Systems, pages 1873–1881.
- Ghadermarzy et al., (2018) Ghadermarzy, N., Plan, Y., and Yilmaz, O. (2018). Learning tensors from partial binary measurements. IEEE Transactions on Signal Processing, 67(1):29–40.
- Ghadermarzy et al., (2019) Ghadermarzy, N., Plan, Y., and Yilmaz, Ö. (2019). Near-optimal sample complexity for convex tensor completion. Information and Inference: A Journal of the IMA, 8(3):577–619.
- Globerson et al., (2007) Globerson, A., Chechik, G., Pereira, F., and Tishby, N. (2007). Euclidean embedding of co-occurrence data. Journal of Machine Learning Research, 8:2265–2295.
- Han et al., (2020) Han, R., Willett, R., and Zhang, A. (2020). An optimal statistical and computational framework for generalized tensor estimation. arXiv preprint arXiv:2002.11255.
- Hastie et al., (2009) Hastie, T., Tibshirani, R., and Friedman, J. (2009). The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media.
- Hillar and Lim, (2013) Hillar, C. J. and Lim, L.-H. (2013). Most tensor problems are NP-hard. Journal of the ACM (JACM), 60(6):45.
- Hitchcock, (1927) Hitchcock, F. L. (1927). The expression of a tensor or a polyadic as a sum of products. Journal of Mathematics and Physics, 6(1-4):164–189.
- Hong et al., (2020) Hong, D., Kolda, T. G., and Duersch, J. A. (2020). Generalized canonical polyadic tensor decomposition. SIAM Review, 62(1):133–163.
- Hore et al., (2016) Hore, V., Viñuela, A., Buil, A., Knight, J., McCarthy, M. I., Small, K., and Marchini, J. (2016). Tensor decomposition for multiple-tissue gene expression experiments. Nature genetics, 48(9):1094.
- Jain and Oh, (2014) Jain, P. and Oh, S. (2014). Provable tensor factorization with missing data. In Advances in Neural Information Processing Systems, volume 27, pages 1431–1439.
- Kolda and Bader, (2009) Kolda, T. G. and Bader, B. W. (2009). Tensor decompositions and applications. SIAM Review, 51(3):455–500.
- Kosorok, (2007) Kosorok, M. R. (2007). Introduction to empirical processes and semiparametric inference. Springer Science & Business Media.
- Lee and Wang, (2020) Lee, C. and Wang, M. (2020). Tensor denoising and completion based on ordinal observations. In International Conference on Machine Learning, pages 5778–5788.
- Li et al., (2009) Li, Y., Liu, Y., Li, J., Qin, W., Li, K., Yu, C., and Jiang, T. (2009). Brain anatomical network and intelligence. PLoS Comput Biol, 5(5):e1000395.
- Montanari and Sun, (2018) Montanari, A. and Sun, N. (2018). Spectral algorithms for tensor completion. Communications on Pure and Applied Mathematics, 71(11):2381–2425.
- Mu et al., (2014) Mu, C., Huang, B., Wright, J., and Goldfarb, D. (2014). Square deal: Lower bounds and improved relaxations for tensor recovery. In International Conference on Machine Learning, pages 73–81.
- Ongie et al., (2017) Ongie, G., Willett, R., Nowak, R. D., and Balzano, L. (2017). Algebraic variety models for high-rank matrix completion. In International Conference on Machine Learning, pages 2691–2700.
- Robinson, (1988) Robinson, P. M. (1988). Root-n-consistent semiparametric regression. Econometrica: Journal of the Econometric Society, 56(4):931–954.
- Shen and Wong, (1994) Shen, X. and Wong, W. H. (1994). Convergence rate of sieve estimates. The Annals of Statistics, 22:580–615.
- Wang et al., (2008) Wang, J., Shen, X., and Liu, Y. (2008). Probability estimation for large-margin classifiers. Biometrika, 95(1):149–167.
- Wang et al., (2017) Wang, L., Durante, D., Jung, R. E., and Dunson, D. B. (2017). Bayesian network–response regression. Bioinformatics, 33(12):1859–1866.
- Wang and Li, (2020) Wang, M. and Li, L. (2020). Learning from binary multiway data: Probabilistic tensor decomposition and its statistical optimality. Journal of Machine Learning Research, 21(154):1–38.
- Wang and Zeng, (2019) Wang, M. and Zeng, Y. (2019). Multiway clustering via tensor block models. In Advances in Neural Information Processing Systems, pages 713–723.
- Xu, (2018) Xu, J. (2018). Rates of convergence of spectral methods for graphon estimation. In International Conference on Machine Learning, pages 5433–5442.
- Yuan and Zhang, (2016) Yuan, M. and Zhang, C.-H. (2016). On tensor completion via nuclear norm minimization. Foundations of Computational Mathematics, 16(4):1031–1068.
- Zhang and Xia, (2018) Zhang, A. and Xia, D. (2018). Tensor SVD: Statistical and computational limits. IEEE Transactions on Information Theory, 64(11):7311 – 7338.