A Minimax Lower Bound for Low-Rank Matrix-Variate Logistic Regression
Abstract
This paper considers the problem of matrix-variate logistic regression. It derives the fundamental error threshold on estimating low-rank coefficient matrices in the logistic regression problem by obtaining a lower bound on the minimax risk. The bound depends explicitly on the dimension and distribution of the covariates, the rank and energy of the coefficient matrix, and the number of samples. The resulting bound is proportional to the intrinsic degrees of freedom in the problem, which suggests the sample complexity of the low-rank matrix logistic regression problem can be lower than that for vectorized logistic regression. The proof techniques utilized in this work also set the stage for development of minimax lower bounds for tensor-variate logistic regression problems.
Index Terms:
logistic regression, low-rank matrix, minimax risk, singular value decomposition.I Introduction
Logistic Regression (LR) is a statistical model used commonly in machine learning for classification problems [1]. In the simplest terms, LR seeks to model the conditional probability that a categorical response variable takes on one of two possible values, , which represents one of two possible events taking place (such as success/failure and detection/no detection). In regression analysis, the aim is to accurately estimate the model class parameters through a set of training data points , where is the covariate sample vector. The conventional LR model is as follows:
(1) |
where is the binary response, is the known covariate vector, is the deterministic but unknown coefficient vector, and is the intercept where .
In many practical applications, covariates naturally take the form of two-dimensional arrays. Common examples include images, biological data such as electroencephalography (EEG), fiber bundle imaging [2] and spatio-temporal data [3]. Classical machine learning techniques vectorize such matrix covariates and estimate a coefficient vector. This leads to computational inefficiency and the destruction of the underlying spatial structure of the coefficient matrix, which often carries valuable information [4, 5]. To address these issues, one can model the matrix LR problem as
(2) |
where is the known covariate matrix, is the coefficient matrix, and is the intercept. The model in (2) preserves the matrix structure of, and the valuable information in, the covariate data. In matrix-variate logistic regression analysis, the goal is to find an estimate of the coefficient .
In this paper, we focus on the high-dimensional setting where and derive a lower bound on the minimax risk of the matrix LR problem under the assumption that has rank . We note that the low-rank assumption has been used ubiquitously in regression analysis in former works [6, 7]. This low-rank structure may arise from the presence of latent variables, such that the model’s intrinsic degrees of freedom are smaller than its extrinsic dimensionality. This allows us to represent the data in a lower-dimensional space and potentially reduce the sample complexity of estimating the parameters. Minimax lower bounds are useful in understanding the fundamental error thresholds of the problem under study, and in assessing the performance of the existing algorithms. They also provide beneficial insights to the parameters on which an achievable minimax risk might depend.
Whilst a number of works in the literature derive minimax lower bounds for higher-order linear regression problems [8, 7], as well as vector LR problems [6, 9, 10], to the best of our knowledge, little work has been done on this topic in matrix-variate LR problems. Regarding the existing literature, there are several works that study the matrix LR problem. Many works extend the matrix LR problem of (2) by proposing regularized matrix LR models in order to obtain rank-optimized or sparse estimates of the coefficient matrix in the model in (2) [5, 11]. Recently, works have introduced regularized matrix LR models for inference on image data [12]. Much of the literature develops efficient algorithms and provides empirical results on their performance [5, 11, 12] while some provide theoretical guarantees for the proposed algorithms [11]. By contrast, Baldin and Berthet [13] model the problem of graph regression as a matrix LR problem, where . The matrix is assumed to be block-sparse, low-rank, and square. Under these assumptions (which are restrictive and not the most general), the authors derive minimax lower bounds on the estimation error .
The three prior works [4, 11, 12] implement algorithms for matrix logistic regression but do not prove sample complexity bounds (upper or lower). In this paper, we derive a minimax lower bound on the error of a low-rank LR model which gives a bound on the number of samples necessary for estimating . Contrary to prior works, we impose minimal assumptions on , making our results generalizable to a larger class of matrices.
Our Minimax lower bound relies on the distribution of the covariates and the energy of the regression matrix. Following previous works that derive minimax lower bounds in the dictionary learning setting [14, 15], we use the standard strategy of lower bounding the minimax risk in parametric estimation problems by the maximum error probability in a multiple hypothesis test problem and using Fano’s inequality [16]. We derive a lower bound that is proportional to the degrees of freedom in the rank- matrix LR problem, and we reduce the sample complexity from in the vector setting to . This result is intuitively pleasing as it coincides with the number of free parameters in the model. We also show that our methods are easily extendable to the tensor case, i.e., is a coefficient tensor with some known properties.
II Model and Problem Formulation
We use the following notation convention throughout the paper: , , and denote scalars, vectors, matrices and tensors, respectively. We denote as the identity matrix. Also, , and denote the , and Frobenius norms, respectively. Given a matrix , is the column of . If is a third-order tensor, then by fixing indices in the first and second modes, the matrix is the frontal slice of . If , then is the greatest integer less than . We denote as the Kronecker product of matrices and . We define the function as the column-wise vectorization of matrix . Finally, is shorthand for .
Consider the matrix LR model in (2), in which the Bernoulli response is the response variable of independent and identically distributed (i.i.d) observations. The covariate matrix has independent, normally distributed, zero-mean and variance entries. The response is generated according to (2) from and a fixed coefficient matrix with upper-bounded by a constant.
We consider the case where is a rank- matrix. Specifically, the rank- singular value decomposition of is
(3) |
where and are (tall) singular vector matrices with orthonormal columns, and is the matrix of singular values with . Under this low-rank structure of , we have
(4) |
We use Kronecker product properties [17] to express as .
The goal in LR is to find an estimate of using the training data . We assume that the true coefficient matrix belongs to the set
(5) |
the open ball of radius with distance metric , which resides in a given the parameter space,
(6) |
Thus and has energy bounded by .
Our objective is to find a lower bound on the minimax risk for estimating coefficient matrix . We define the non-decreasing function with . The minimax risk is thus defined as the worst-case mean squared error (MSE) for the best estimator, i.e.,
(7) |
where is the covariate tensor of samples. We point out that minimax risk in (7) is an inherent property of the LR problem and holds for all possible estimators.
III Minimax Lower Bound For Low-Rank Matrix Logistic Regression
The minimax lower bounds in the low-rank matrix LR setting that are based on Fano’s inequality will depend explicitly on the dimensions of the covariate matrices and their distribution, the rank, the upper bound on the energy of the coefficient matrix , the number of samples, , and the construction of the multiple hypothesis test set.
The novelty of our work is that we explicitly leverage the rank- structure of the coefficient matrix , in (3), which leads to the construction of each hypothesis that is structurally different in our setting compared to vector LR [6, 10, 9]. Furthermore, existing low-rank matrix packings, such as those in Negahban and Wainwright [18] are not useful in the LR setting of this paper, since the logistic function in our model makes part of our analysis non-trivial and fundamentally different to such works. In this work, we create a set from three constructed sets (for , and ), and derive conditions under which all sets can exist. These conditions ensure the existence of a hypothesis set, , with elements that have a rank- matrix structure and additional essential properties noted below. We also show that our chosen constructed hypothesis set and analysis are more suitable for our matrix-variate model because they are easily generalizable to the tensor setting for LR.
III-A Main Result
We derive lower bounds on using an argument based on Fano’s Inequality [19]. To do so, we first relate the minimax risk in (7) to a multi-way hypothesis testing problem between an exponentially large family, with respect to the dimensions of the matrices, of distinct coefficient matrices with energy upper bounded by , i.e., matrices residing inside the openball of fixed radius:
(8) |
where the correct hypothesis is generated uniformly at random from the set .
More specifically, suppose there exists an estimator with worst-case MSE that matches . This estimator can be used to solve a multiple hypothesis testing problem using a minimum distance decoder. The minimax risk can then be lower bounded by the probability of error in such a multiple hypothesis test. Our main challenge is to further lower bound this probability of error in order to derive lower bounds on the minimax risk.
We now state our main result on the minimax risk of the low-rank matrix coefficient estimation problem in LR.
Theorem 1.
Consider the rank- matrix LR problem in (4) with i.i.d observations, where the true coefficient matrix . Then, for covariate , the minimax risk is lower bounded by
(9) |
where , and .
III-B Roadmap for Theorem 1
For notational convenience, we define the response vector of samples as , and the covariate tensor of samples as , where the frontal slice, , is the covariate matrix of the sample.
For our analysis, we construct a hypothesis set of distinct matrices, as defined in (8). However, each hypothesis is a low-rank matrix following the decomposition in (3) and is composed of three separately constructed sets (for , , and ). Now, consider , the mutual information between the observations and random index . The construction of the set is used to provide upper and lower bounds on , from which we derive lower bounds on the minimax risk .
For finding a lower bound on , we first consider the exponentially large packing set , with respect to the dimensions of , where any two distinct hypotheses and are separated by a minimum distance, i.e.,
(10) |
for some positive . To achieve our desired bounds on the minimax risk, we require the existence of an estimator producing estimate and achieving the minimax bound . We consider the minimum distance decoder
(11) |
which seeks to detect the correct coefficient matrix . We require the estimate to satisfy , for the minimum distance decoder to detect and for the probability of detection error to be . A detection error might occur when . The following bound relates the loss and the probability of error :
(12) |
Next, (12) is used to obtain a lower bound on using Fano’s inequality stated below [16]:
(13) |
Secondly, for finding an upper bound on , we recognize that the LR model produces response variables that are Bernoulli random variables conditioned on . On the basis thereof, we evaluate the mutual information, conditioned on , . Next, define as the conditional probability distribution of given with coefficient matrix , and let be the relative entropy, between any two and , produced from two distinct , . Due to the convexity of , we upper bound as follows [20, 14]:
(14) |
IV Proof of Main Result
The formal proof for Theorem 1 relies on three lemmas. Lemma 1 introduces the exponentially large, with respect to the dimensions, set of vectors from which we will later construct . The set is constructed as a subset of an -dimensional hypercube with a required minimum distance between any two distinct elements in the set. Lemma 1 bounds the probability that this minimum distance is violated.
Lemma 1.
Let and . Consider the set of vectors , where each entry in vector is an independent and identically distributed random variable taking values uniformly. The probability that there exists a distinct pair such that is upper bounded as follows:
(16) |
The above is a direct result of a standard application of McDiarmid’s inequality [21]. Notice, however, that our technique differs from [22] because we are imposing minimum distance conditions in the Hamming metric, rather than conditions on the inner product, between vectors.
Similar to the above, the following corollary introduces a set of matrices from which we will construct .
Corollary 1.
Let and . Consider the set of matrices , where each entry in matrix is an independent and identically distributed random variable taking values uniformly. The probability that there exists a distinct pair such that is upper bounded as follows:
(17) |
From the results in Lemma 1 and Corollary 1, Lemma 2 derives conditions on such that with a certain set of properties exists, and constructs two sets of matrices: 1) and orthonormal matrices from the set of “generating” matrices defined in Corollary 1, and 2) diagonal matrices from the set of “generating” vectors in Lemma 1. Each element is constructed from the three sets defined above, according to the decomposition in (3). Next, we prove lower and upper bounds on the distance between any two distinct elements in . The lower bound determines the minimum packing distance between any two , whilst the upper bound is used to derive the results in Lemma 3.
Lemma 2.
There exists a collection of matrices for some of cardinality
(18) |
such that for any
(19) |
we have
(20) |
Lemma 3 derives an upper bound on .
Lemma 3.
The final lemma bounds the error probability under the conditions, mentioned in Section III-B, on and for the recovery of hypothesis by the minimum distance decoder. The proof follows exactly that for Lemma 8 in [14].
Lemma 4.
Consider the minimum distance decoder in (11), for the matrices constructed in Lemma 2. Consider the LR regression model in (2) with minimax risk , which is assumed to be upper bounded by . The minimum distance decoder recovers the correct hypothesis if . The detection error of the minimum distance decoder is upper bounded by .
Proof:
Fix , let satisfy (19), and define . Suppose there exists an estimator which guarantees a risk . By Lemma 2, there exists a packing set containing distinct rank- matrices, where satisfies (18) and for any pair , the distance satisfies (10). By Lemma 3, the conditional mutual information is upper bounded by (21), and is upper bounded by Lemma 4. Replacing (21) and (13) in (15), we have:
(22) |
Lastly, if we set , then we achieve the result in (9). ∎
V Discussion and Conclusion
In this paper we provided a minimax lower bound on the low-rank matrix-variate LR problem in the high-dimensional setting. We constructed a packing set of low-rank structured matrices with finite energy. Using the construction, we derived bounds on the conditional mutual information defined in our problem, in order to obtain a lower bound on the minimax risk. Compared to the vector case, such as in [6], the result in Theorem 1 shows a decrease in the lower bound from to ). The result also shows that the lower bound on the minimax risk is proportional to the intrinsic degrees of freedom in the coefficient matrix LR, (i.e., ), and decreases with increasing sample size . This suggests that we can develop algorithms that take advantage of the low-rank structure of the coefficient matrices. Moreover, the result in (9) can be generalized from low-rank matrices to low-rank tensors. Imposing a rank- decomposition on a coefficient tensor, , holds the same conveniences as those discussed above of low-rank matrices. This low-rank structure on is a special case of the well-known Canonical Polyadic Decomposition or Parallel Factors (CANDECOMP/PARAFAC, or CP)[17], formally defined as,
(23) |
where is a column vector along the mode of , is the outer product, and , for any rank , is a rank-1 tensor weighted by . The rank- CP-decomposition expresses tensors as a sum of rank-1 tensors. Equivalent to (23) is the following expression of a CP-structured tensor:
(24) |
where the tensor is simply a higher-order analogue of the diagonal matrix in (3) (where only the elements along the super-diagonal of are non-zero), and the mode- factor matrices are rank- matrices with orthonormal columns. Thus, the setup and construction proposed in this paper can be extended to the tensor case, where is simply a special case of , with factor matrices.
VI Appendix
Proof:
Consider the set of vectors , in (1). For indices , , define the vector as the point-wise product of and , and as the entry of , for . Define also the function
(25) |
We use to say that for all entries , except one.
We require a minimum packing distance
(26) |
The probability that the requirement in (26) is not satisfied is defined as
(27) |
The function satisfies
(28) |
According to McDiarmid’s inequality in [21], for , and since , we have,
(29) |
The probability in (29) is for any two distinct pairs . We take a union bound over all distinct pairs and upper bound the probability as:
(30) |
∎
Proof:
Fix the following arbitrary real orthogonal bases: , the set of distinct bases, , and the set of distinct bases, .
Next, consider the following hypercubes or subsets thereof: 1) The set of vectors from Lemma 1:
(31) |
2) Two sets of and matrices, from Corollary 1:
(32) |
and
(33) |
respectively.
From Lemma 1 we have the following bounds on the probability that the minimum distance condition is violated for the set (31):
(34) |
Likewise, from Corollary 1, we have the following bounds on the probability that the minimum distance conditions are violated for the sets (32) and (33), respectively:
(35) |
and
(36) |
We require the set of coefficient matrices from the sets in (31), (32) and (33) to exist simultaneously. Hence, using a union bound on (34), (35) and (35), we can choose parameters to guarantee the existence of a construction. This is satisfied if the following conditions on the cardinalities , and hold:
(37) |
(38) |
and
(39) |
Note that (37), (38) and (39) are sufficient conditions for the simultaneous existence of sets in (31), (32) and (33), such that the minimum distance condition between any two elements in each set is satisfied.
We proceed with the following steps in order to construct the final set of coefficient matrices. Without loss of generality, we assume that the energy of any is upper bounded by . We will construct diagonal matrices , and matrices with orthonormal columns, namely and , all of which will be used to construct each . In other words, due to our LR model, any matrix will have a rank- singular value decomposition. Thus, a bound on the matrix norm of gives a bound on the norm of the matrix of singular values.
Firstly, we construct vectors for , using and , as follows:
(40) |
From (40), since we have:
Similarly, we construct matrices , for and , for , respectively. Define as the column of , and as the column of . Let the columns be constructed as follows:
(41) |
and
(42) |
Secondly, we construct an -sparse vector , element-wise, from . Define as the th element of , where and use the following construction:
(43) |
and we note that
(44) |
We also construct matrices , for and , for . We show the construction of only: the construction of follows the same procedure. Define as the column of , for . We set
(45) |
and define
(46) |
and
(47) |
for .
The steps in (45), (46) and (47) constitute the well-known Gram-Schmidt Process. Thus, set of vectors , for , are orthonormal, i.e, and , for any two distinct . Consequently, .
Finally, we define the vector and matrices and as:
(48) |
and
(49) |
respectively, for some positive number .
We also define the diagonal matrix , where .
Now, by designating
(50) |
as the set of possible tuples , we have
(51) |
where follows from (37), (38) and (39). We define the set of coefficient matrices, as,
(52) | |||
(53) |
and we restrict such that
(54) |
in order to guarantee . We make the final note that, due to the Kronecker product, we can express as:
(55) |
We have the following remaining tasks at hand: 1) We must show that they energy of any is less than . 2) We must derive upper and lower bounds on the distance between any two distinct .
We begin by showing :
where follows from the fact that the matrix norm of the Kronecker product is the product of the matrix norms, and holds due to (54).
We proceed with deriving lower and upper bounds on , for any two distinct . For finding lower bounds on , it can be shown that the closest pair occurs for , and . Thus we have the following:
where follows from the fact that the Kronecker product of orthogonal bases is an orthogonal basis.
Finally, for finding upper bounds on , we have:
(56) | |||
(57) |
where follows from the triangle inequality. ∎
Proof:
Consider the set defined in Lemma 2, where the bounds in (VI) and (57) hold. Consider the matrix LR model in (2). For i.i.d samples, consider covariate matrices , where . According to (2), observations follow a Bernoulli distribution when conditioned on , . Consider and defined in Section II, and define as the mutual information between observations and index conditioned on side-information . From [20, 23], we have
(58) |
where is the Kullback-Leibler (KL) divergence of probability distribution of given for some . Denote , and , We evaluate the KL divergence as follows:
(59) |
Now, considering the distribution on covariates , we take the expectation of (59) with respect to the side-information . We have , and . We are left with:
(60) |
Define the random variable . is a normally distributed random variable with mean and variance . Now, define the half-Normal random variable . is a half-Normal distribution with mean:
(61) |
Plugging in (61) and (60) into (58) gives us,
where follows from (57). ∎
References
- [1] R. E. Wright, “Logistic regression.” in Reading and Understanding Multivariate Statistics. American Psychological Association, 1995, pp. 217–244.
- [2] J. P. Dumas, M. A. Lodhi, B. A. Taki, W. U. Bajwa, and M. C. Pierce, “Computational endoscopy—a framework for improving spatial resolution in fiber bundle imaging,” Optics Letters, vol. 44, no. 16, pp. 3968–3971, 2019.
- [3] M. A. Lodhi and W. U. Bajwa, “Learning product graphs underlying smooth graph signals,” arXiv preprint arXiv:2002.11277, 2020.
- [4] H. Hung and C.-C. Wang, “Matrix variate logistic regression model with application to EEG data,” Biostatistics, vol. 14, no. 1, pp. 189–202, 2013.
- [5] J. Zhang and J. Jiang, “Rank-optimized logistic matrix regression toward improved matrix data classification,” Neural Computation, vol. 30, no. 2, pp. 505–525, 2018.
- [6] L. P. Barnes and A. Ozgur, “Minimax bounds for distributed logistic regression,” arXiv preprint arXiv:1910.01625, 2019.
- [7] A. R. Zhang, Y. Luo, G. Raskutti, and M. Yuan, “ISLET: Fast and optimal low-rank tensor regression via importance sketching,” SIAM J. on Mathematics of Data Science, vol. 2, no. 2, pp. 444–479, 2020.
- [8] G. Raskutti, M. Yuan, and H. Chen, “Convex regularization for high-dimensional multiresponse tensor regression,” The Annals of Statstics, vol. 47, no. 3, pp. 1554–1584, 2019.
- [9] D. J. Foster, S. Kale, H. Luo, M. Mohri, and K. Sridharan, “Logistic regression: The importance of being improper,” in Proc. Conf. On Learning Theory (PMLR), 2018, pp. 167–208.
- [10] F. Abramovich and V. Grinshtein, “High-dimensional classification by sparse logistic regression,” IEEE Trans. on Information Theory, vol. 65, no. 5, pp. 3068–3079, 2018.
- [11] J. V. Shi, Y. Xu, and R. G. Baraniuk, “Sparse bilinear logistic regression,” arXiv preprint arXiv:1404.4104, 2014.
- [12] B. An and B. Zhang, “Logistic regression with image covariates via the combination of and Sobolev regularizations,” PLOS One, vol. 15, no. 6, p. e0234975, 2020.
- [13] Q. Berthet and N. Baldin, “Statistical and computational rates in graph logistic regression,” in Proc. Int. Conf. on Artificial Intelligence and Statistics (PMLR), 2020, pp. 2719–2730.
- [14] A. Jung, Y. C. Eldar, and N. Görtz, “On the minimax risk of dictionary learning,” IEEE Trans. on Inf. Theory, vol. 62, no. 3, pp. 1501–1515, 2016.
- [15] Z. Shakeri, W. U. Bajwa, and A. D. Sarwate, “Minimax lower bounds for kronecker-structured dictionary learning,” in Proc. 2016 IEEE Int. Symp. on Inf. Theory (ISIT). IEEE, 2016, pp. 1148–1152.
- [16] B. Yu, “Assouad, Fano, and Le Cam,” in Festschrift for Lucien Le Cam. Springer, 1997, pp. 423–435.
- [17] T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” SIAM Review, vol. 51, no. 3, pp. 455–500, 2009.
- [18] S. Negahban and M. J. Wainwright, “Restricted strong convexity and weighted matrix completion: Optimal bounds with noise,” The J. of Machine Learning Res. (JMLR), vol. 13, pp. 1665–1697, 2012.
- [19] R. Z. Khas’minskii, “A lower bound on the risks of non-parametric estimates of densities in the uniform metric,” Theory of Probability & Its Applications, vol. 23, no. 4, pp. 794–798, 1979.
- [20] T. M. Cover and J. A. Thomas, Elements of Information Theory. John Wiley & Sons, 2012.
- [21] D. P. Dubhashi and A. Panconesi, Concentration of Measure for the Analysis of Randomized Algorithms. Cambridge University Press, 2009.
- [22] Z. Shakeri, W. U. Bajwa, and A. D. Sarwate, “Minimax lower bounds on dictionary learning for tensor data,” IEEE Trans. on Inf. Theory, vol. 64, no. 4, pp. 2706–2726, 2018.
- [23] M. J. Wainwright, “Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting,” IEEE Trans. on Inf. Theory, vol. 55, no. 12, pp. 5728–5741, 2009.