A Minimax Lower Bound for Low-Rank Matrix-Variate Logistic Regression

Batoul Taki, Mohsen Ghassemi, Anand D. Sarwate, and Waheed U. Bajwa
E-mail:{batoul.taki, mohsen.ghassemi, anand.sarwate, waheed.bajwa}@rutgers.edu This work was supported by the US NSF Grants CCF-1910110 and CCF-1453073. Department of Electrical and Computer Engineering, Rutgers University–New Brunswick, Piscataway, NJ

Abstract

This paper considers the problem of matrix-variate logistic regression. It derives the fundamental error threshold on estimating low-rank coefficient matrices in the logistic regression problem by obtaining a lower bound on the minimax risk. The bound depends explicitly on the dimension and distribution of the covariates, the rank and energy of the coefficient matrix, and the number of samples. The resulting bound is proportional to the intrinsic degrees of freedom in the problem, which suggests the sample complexity of the low-rank matrix logistic regression problem can be lower than that for vectorized logistic regression. The proof techniques utilized in this work also set the stage for development of minimax lower bounds for tensor-variate logistic regression problems.

Index Terms:

logistic regression, low-rank matrix, minimax risk, singular value decomposition.

I Introduction

Logistic Regression (LR) is a statistical model used commonly in machine learning for classification problems [1]. In the simplest terms, LR seeks to model the conditional probability that a categorical response variable $y_{i}$ takes on one of two possible values, $y_{i}\in\{0,1\}$ , which represents one of two possible events taking place (such as success/failure and detection/no detection). In regression analysis, the aim is to accurately estimate the model class parameters through a set of training data points $\{\mathbf{x}_{i},y_{i}\}_{i=1}^{n}$ , where $\mathbf{x}_{i}$ is the $i^{th}$ covariate sample vector. The conventional LR model is as follows:

\displaystyle\mathbb{P}_{y|\mathbf{x}}(y_{i}=1|\mathbf{x}_{i})=\frac{1}{1+\exp^{-(\mathbf{b}^{T}\mathbf{x}_{i}+z)}},

(1)

where $y_{i}\in\{0,1\}$ is the binary response, $\mathbf{x}_{i}\in\mathbb{R}^{m}$ is the known covariate vector, $\mathbf{b}\in\mathbb{R}^{m}$ is the deterministic but unknown coefficient vector, and $z$ is the intercept where $\mathbb{E}[z]=0$ .

In many practical applications, covariates naturally take the form of two-dimensional arrays. Common examples include images, biological data such as electroencephalography (EEG), fiber bundle imaging [2] and spatio-temporal data [3]. Classical machine learning techniques vectorize such matrix covariates and estimate a coefficient vector. This leads to computational inefficiency and the destruction of the underlying spatial structure of the coefficient matrix, which often carries valuable information [4, 5]. To address these issues, one can model the matrix LR problem as

\displaystyle\mathbb{P}_{y|\mathbf{X}}(y_{i}=1|\mathbf{X}_{i})=\frac{1}{1+\exp^{-(\langle\mathbf{B},\mathbf{X}_{i}\rangle+z)}},

(2)

where $\mathbf{X}_{i}\in\mathbb{R}^{m_{1}\times m_{2}}$ is the known covariate matrix, $\mathbf{B}\in\mathbb{R}^{m_{1}\times m_{2}}$ is the coefficient matrix, and $z$ is the intercept. The model in (2) preserves the matrix structure of, and the valuable information in, the covariate data. In matrix-variate logistic regression analysis, the goal is to find an estimate of the coefficient $\mathbf{B}$ .

In this paper, we focus on the high-dimensional setting where $n\ll m_{1}m_{2}$ and derive a lower bound on the minimax risk of the matrix LR problem under the assumption that $\mathbf{B}$ has rank $r\ll\min\{m_{1},m_{2}\}$ . We note that the low-rank assumption has been used ubiquitously in regression analysis in former works [6, 7]. This low-rank structure may arise from the presence of latent variables, such that the model’s intrinsic degrees of freedom are smaller than its extrinsic dimensionality. This allows us to represent the data in a lower-dimensional space and potentially reduce the sample complexity of estimating the parameters. Minimax lower bounds are useful in understanding the fundamental error thresholds of the problem under study, and in assessing the performance of the existing algorithms. They also provide beneficial insights to the parameters on which an achievable minimax risk might depend.

Whilst a number of works in the literature derive minimax lower bounds for higher-order linear regression problems [8, 7], as well as vector LR problems [6, 9, 10], to the best of our knowledge, little work has been done on this topic in matrix-variate LR problems. Regarding the existing literature, there are several works that study the matrix LR problem. Many works extend the matrix LR problem of (2) by proposing regularized matrix LR models in order to obtain rank-optimized or sparse estimates of the coefficient matrix in the model in (2) [5, 11]. Recently, works have introduced regularized matrix LR models for inference on image data [12]. Much of the literature develops efficient algorithms and provides empirical results on their performance [5, 11, 12] while some provide theoretical guarantees for the proposed algorithms [11]. By contrast, Baldin and Berthet [13] model the problem of graph regression as a matrix LR problem, where $\mathbf{B}\in\mathbb{R}^{m\times m}$ . The matrix $\mathbf{B}$ is assumed to be block-sparse, low-rank, and square. Under these assumptions (which are restrictive and not the most general), the authors derive minimax lower bounds on the estimation error $\left\|\widehat{\mathbf{B}}-\mathbf{B}\right\|_{F}^{2}$ .

The three prior works [4, 11, 12] implement algorithms for matrix logistic regression but do not prove sample complexity bounds (upper or lower). In this paper, we derive a minimax lower bound on the error of a low-rank LR model which gives a bound on the number of samples necessary for estimating $\mathbf{B}$ . Contrary to prior works, we impose minimal assumptions on $\mathbf{B}$ , making our results generalizable to a larger class of matrices.

Our Minimax lower bound relies on the distribution of the covariates and the energy of the regression matrix. Following previous works that derive minimax lower bounds in the dictionary learning setting [14, 15], we use the standard strategy of lower bounding the minimax risk in parametric estimation problems by the maximum error probability in a multiple hypothesis test problem and using Fano’s inequality [16]. We derive a lower bound that is proportional to the degrees of freedom in the rank- $r$ matrix LR problem, and we reduce the sample complexity from $\mathcal{O}(m_{1}m_{2})$ in the vector setting to $\mathcal{O}(r(m_{1}+m_{2}))$ . This result is intuitively pleasing as it coincides with the number of free parameters in the model. We also show that our methods are easily extendable to the tensor case, i.e., $\underline{\mathbf{B}}$ is a coefficient tensor with some known properties.

II Model and Problem Formulation

We use the following notation convention throughout the paper: $x$ , $\mathbf{x}$ , $\mathbf{X}$ and $\underline{\mathbf{X}}$ denote scalars, vectors, matrices and tensors, respectively. We denote $\mathbf{I}_{m}$ as the $m\times m$ identity matrix. Also, $\left\|\mathbf{x}\right\|_{0}$ , $\left\|\mathbf{x}\right\|_{2}$ and $\left\|\mathbf{X}\right\|_{F}$ denote the $\ell_{0}$ , $\ell_{2}$ and Frobenius norms, respectively. Given a matrix $\mathbf{X}$ , $\mathbf{x}_{j}$ is the $j^{th}$ column of $\mathbf{X}$ . If $\underline{\mathbf{X}}$ is a third-order tensor, then by fixing indices in the first and second modes, the matrix $\underline{\mathbf{X}}_{:,:,j}$ is the $j^{th}$ frontal slice of $\underline{\mathbf{X}}$ . If $a\in\mathbb{R}$ , then $\lfloor a\rfloor$ is the greatest integer less than $a$ . We denote $\mathbf{A}\otimes\mathbf{C}$ as the Kronecker product of matrices $\mathbf{A}$ and $\mathbf{C}$ . We define the function $\mathrm{vec}(\mathbf{X})$ as the column-wise vectorization of matrix $\mathbf{X}$ . Finally, $[R]$ is shorthand for $\{1,\dots,R\}$ .

Consider the matrix LR model in (2), in which the Bernoulli response $y_{i}\in\{0,1\}$ is the $i^{th}$ response variable of $n$ independent and identically distributed (i.i.d) observations. The covariate matrix $\mathbf{X}_{i}\in\mathbb{R}^{m_{1}\times m_{2}}$ has independent, normally distributed, zero-mean and $\sigma^{2}$ variance entries. The response $y_{i}$ is generated according to (2) from $\mathbf{X}_{i}$ and a fixed coefficient matrix $\mathbf{B}\in\mathbb{R}^{m_{1}\times m_{2}}$ with $\left\|\mathbf{B}\right\|_{F}^{2}$ upper-bounded by a constant.

We consider the case where $\mathbf{B}$ is a rank- $r$ matrix. Specifically, the rank- $r$ singular value decomposition of $\mathbf{B}$ is

\displaystyle\mathbf{B}=\mathbf{B}_{1}\mathbf{G}\mathbf{B}_{2}^{T},

(3)

where $\mathbf{B}_{1}\in\mathbb{R}^{m_{1}\times r}$ and $\mathbf{B}_{2}\in\mathbb{R}^{m_{2}\times r}$ are (tall) singular vector matrices with orthonormal columns, and $\mathbf{G}=\mathrm{diag}(\lambda_{1},\cdots,\lambda_{r})\in\mathbb{R}^{r\times r}$ is the matrix of singular values with $\lambda_{i}>0,\forall i\in[r]$ . Under this low-rank structure of $\mathbf{B}$ , we have

\displaystyle\mathbb{P}_{y|\mathbf{X}}(y_{i}=1|\mathbf{X}_{i})=\frac{1}{1+\exp^{-(\langle\mathbf{B}_{1}\mathbf{G}\mathbf{B}_{2}^{T},\mathbf{X}_{i}\rangle+z)}}.

(4)

We use Kronecker product properties [17] to express $\mathbf{b}=\mathrm{vec}(\mathbf{B})$ as $\mathbf{b}=\left(\mathbf{B}_{2}\otimes\mathbf{B}_{1}\right)\mathrm{vec}(\mathbf{G})$ .

The goal in LR is to find an estimate $\widehat{\mathbf{B}}$ of $\mathbf{B}$ using the training data $\bigl{\{}\mathbf{X}_{i},y_{i}\bigr{\}}_{i=1}^{n}$ . We assume that the true coefficient matrix $\mathbf{B}$ belongs to the set

\displaystyle\mathcal{B}_{d}(\mathbf{0})\triangleq\{\mathbf{B}^{\prime}\in\mathcal{P}_{r}:\rho(\mathbf{B}^{\prime},\mathbf{0})<d\},

(5)

the open ball of radius $d$ with distance metric $\rho=\left\|\cdot\right\|_{F}$ , which resides in a given the parameter space,

$\displaystyle\mathcal{P}_{r}\triangleq$	$\displaystyle\biggl{\{}\mathbf{B}^{\prime}\in\mathbb{R}^{m_{1}\times m_{2}}:\mathrm{rank}(\mathbf{B}^{\prime})=r,~{}\mathbf{B}^{\prime}=\mathbf{B}^{\prime}_{1}\mathbf{G}^{\prime}\mathbf{B}^{\prime}_{2},$
	$\displaystyle\left\\|\mathbf{b}^{\prime}_{1,j}\right\\|_{2}=\left\\|\mathbf{b}^{\prime}_{2,j}\right\\|_{2}=1,\mathbf{b}^{\prime}_{k,j}\perp\mathbf{b}^{\prime}_{k,j^{\prime}},\forall j,j^{\prime}\in[r],$
	$\displaystyle j\neq j^{\prime},k\in\{1,2\}\biggr{\}}~{}\cup~{}\{\mathbf{0}\}.$	(6)

Thus $\mathcal{B}_{d}(\mathbf{0})\subset\mathcal{P}_{r}$ and $\mathbf{B}$ has energy bounded by $\left\|\mathbf{B}\right\|_{F}^{2}<d^{2}$ .

Our objective is to find a lower bound on the minimax risk for estimating coefficient matrix $\mathbf{B}$ . We define the non-decreasing function $\phi=\left\|\cdot\right\|_{F}^{2}\triangleq\mathbb{R}_{+}\rightarrow\mathbb{R}_{+}$ with $\phi(0)=0$ . The minimax risk is thus defined as the worst-case mean squared error (MSE) for the best estimator, i.e.,

\displaystyle\varepsilon^{*}=\underset{\widehat{\mathbf{B}}}{\inf}\underset{\mathbf{B}\in\mathcal{B}_{d}(\mathbf{0})}{\sup}\mathbb{E}_{\mathbf{y},\underline{\mathbf{X}}^{c}}\biggl{\{}\left\|\widehat{\mathbf{B}}-\mathbf{B}\right\|_{F}^{2}\biggr{\}},

(7)

where $\underline{\mathbf{X}}^{c}$ is the covariate tensor of $n$ samples. We point out that minimax risk in (7) is an inherent property of the LR problem and holds for all possible estimators.

III Minimax Lower Bound For Low-Rank Matrix Logistic Regression

The minimax lower bounds in the low-rank matrix LR setting that are based on Fano’s inequality will depend explicitly on the dimensions of the covariate matrices $\mathbf{X}_{i}$ and their distribution, the rank, the upper bound on the energy of the coefficient matrix $\mathbf{B}$ , the number of samples, $n$ , and the construction of the multiple hypothesis test set.

The novelty of our work is that we explicitly leverage the rank- $r$ structure of the coefficient matrix $\mathbf{B}$ , in (3), which leads to the construction of each hypothesis that is structurally different in our setting compared to vector LR [6, 10, 9]. Furthermore, existing low-rank matrix packings, such as those in Negahban and Wainwright [18] are not useful in the LR setting of this paper, since the logistic function in our model makes part of our analysis non-trivial and fundamentally different to such works. In this work, we create a set $\mathcal{B}_{L}$ from three constructed sets (for $\mathbf{B}_{1}$ , $\mathbf{B}_{2}$ and $\mathbf{G}$ ), and derive conditions under which all sets can exist. These conditions ensure the existence of a hypothesis set, $\mathcal{B}_{L}$ , with elements that have a rank- $r$ matrix structure and additional essential properties noted below. We also show that our chosen constructed hypothesis set and analysis are more suitable for our matrix-variate model because they are easily generalizable to the tensor setting for LR.

III-A Main Result

We derive lower bounds on $\varepsilon^{*}$ using an argument based on Fano’s Inequality [19]. To do so, we first relate the minimax risk in (7) to a multi-way hypothesis testing problem between an exponentially large family, with respect to the dimensions of the matrices, of $L$ distinct coefficient matrices with energy upper bounded by $d^{2}$ , i.e., matrices residing inside the openball $\mathcal{B}_{d}(\mathbf{0})$ of fixed radius:

\displaystyle\mathcal{B}_{L}=\{\mathbf{B}_{l}\colon l\in[L]\}\subset\mathcal{B}_{d}(\mathbf{0}),

(8)

where the correct hypothesis $\mathbf{B}_{l}$ is generated uniformly at random from the set $\mathcal{B}_{L}$ .

More specifically, suppose there exists an estimator with worst-case MSE that matches $\varepsilon^{*}$ . This estimator can be used to solve a multiple hypothesis testing problem using a minimum distance decoder. The minimax risk can then be lower bounded by the probability of error in such a multiple hypothesis test. Our main challenge is to further lower bound this probability of error in order to derive lower bounds on the minimax risk.

We now state our main result on the minimax risk of the low-rank matrix coefficient estimation problem in LR.

Theorem 1.

Consider the rank- $r$ matrix LR problem in (4) with $n$ i.i.d observations, $\bigl{\{}\mathbf{X}_{i},y_{i}\bigr{\}}_{i=1}^{n}$ where the true coefficient matrix $\left\|\mathbf{B}\right\|_{F}^{2}<d^{2}$ . Then, for covariate $\mathrm{vec}(\mathbf{X}_{i})\sim\mathcal{N}(0,\sigma^{2}\mathbf{I}_{m})$ , the minimax risk is lower bounded by

\displaystyle\varepsilon^{*}\geq\frac{\biggl{(}\biggl{[}c_{2}\bigl{(}c_{1}r(m_{1}+m_{2}-2)+c_{1}(r-1)\bigr{)}-c_{3}\biggr{]}-1\biggr{)}}{8n\sigma\sqrt{\frac{2}{\pi}}},

(9)

where $c_{1}=\left(1-\frac{1}{10}\right)^{2}$ , $c_{2}=\frac{\log_{2}(e)(\sqrt{2}-1)}{4\sqrt{2}}$ and $c_{3}=(\frac{3(\sqrt{2}-1)}{\sqrt{8}})\log_{2}\left(\frac{3}{2}\right)$ .

We discuss the implications of (9) of Theorem 1 in Section V

III-B Roadmap for Theorem 1

For notational convenience, we define the response vector of $n$ samples as $\mathbf{y}=[y_{1},\cdots,y_{n}]$ , and the covariate tensor of $n$ samples as $\underline{\mathbf{X}}^{c}\in R^{m_{1}\times m_{2}\times n}$ , where the $i^{th}$ frontal slice, $\underline{\mathbf{X}}_{:,:,i}^{c}$ , is the covariate matrix $\mathbf{X}_{i}$ of the $i^{th}$ sample.

For our analysis, we construct a hypothesis set of $L$ distinct matrices, as defined in (8). However, each hypothesis $\mathbf{B}_{l}$ is a low-rank matrix following the decomposition in (3) and is composed of three separately constructed sets (for $\mathbf{B}_{1}$ , $\mathbf{B}_{2}$ , and $\mathbf{G}$ ). Now, consider $\mathbb{I}(\mathbf{y};l)$ , the mutual information between the observations $\mathbf{y}$ and random index $l$ . The construction of the set $\mathcal{B}_{L}$ is used to provide upper and lower bounds on $\mathbb{I}(\mathbf{y};l)$ , from which we derive lower bounds on the minimax risk $\varepsilon^{*}$ .

For finding a lower bound on $\mathbb{I}(\mathbf{y};l)$ , we first consider the exponentially large packing set $\mathcal{B}_{L}$ , with respect to the dimensions of $\mathbf{B}$ , where any two distinct hypotheses $\mathbf{B}_{l}$ and $\mathbf{B}_{l^{\prime}}$ are separated by a minimum distance, i.e.,

\displaystyle\left\|\mathbf{B}_{l}-\mathbf{B}_{l^{\prime}}\right\|_{F}^{2}\geq 8\delta

(10)

for some positive $\delta$ . To achieve our desired bounds on the minimax risk, we require the existence of an estimator producing estimate $\widehat{\mathbf{B}}$ and achieving the minimax bound $\varepsilon^{*}<\sqrt{\delta}$ . We consider the minimum distance decoder

\displaystyle\widehat{l}(\mathbf{y})\triangleq\underset{\mathbf{B}_{l^{\prime}}\in\mathcal{B}_{d}(\mathbf{0})}{\operatorname*{arg\,min}}\left\|\widehat{\mathbf{B}}-\mathbf{B}_{l^{\prime}}\right\|_{F}^{2},

(11)

which seeks to detect the correct coefficient matrix $\mathbf{B}_{l}$ . We require the estimate $\widehat{\mathbf{B}}$ to satisfy $\left\|\widehat{\mathbf{B}}-\mathbf{B}_{l}\right\|_{F}^{2}<\sqrt{2\delta}$ , for the minimum distance decoder to detect $\mathbf{B}_{l}$ and for the probability of detection error to be $\mathbb{P}(\widehat{l}(\mathbf{y})\neq l)=0$ . A detection error might occur when $\left\|\widehat{\mathbf{B}}-\mathbf{B}_{l}\right\|_{F}^{2}\geq\sqrt{2\delta}$ . The following bound relates the loss $\left\|\widehat{\mathbf{B}}-\mathbf{B}_{l}\right\|_{F}^{2}$ and the probability of error $\mathbb{P}(\widehat{l}(\mathbf{y})\neq l)$ :

\displaystyle\mathbb{P}(\widehat{l}(\mathbf{y})\neq l)\leq\mathbb{P}\Big{(}\left\|\widehat{\mathbf{B}}-\mathbf{B}_{l}\right\|_{F}^{2}\geq\sqrt{2\delta}\Big{)}.

(12)

Next, (12) is used to obtain a lower bound on $\mathbb{I}(\mathbf{y};l)$ using Fano’s inequality stated below [16]:

\displaystyle\mathbb{I}(\mathbf{y};l)\geq\left(1-\mathbb{P}(\widehat{l}(\mathbf{y})\neq l)\right)\log_{2}(L)-1\triangleq u_{1}.

(13)

Secondly, for finding an upper bound on $\mathbb{I}(\mathbf{y};l)$ , we recognize that the LR model produces response variables $y_{i}$ that are Bernoulli random variables conditioned on $\mathbf{X}_{i}$ . On the basis thereof, we evaluate the mutual information, conditioned on $\underline{\mathbf{X}}^{c}$ , $\mathbb{I}(\mathbf{y};l|\underline{\mathbf{X}}^{c})$ . Next, define $f_{l}(\mathbf{y}|\underline{\mathbf{X}}^{c})$ as the conditional probability distribution of $\mathbf{y}$ given $\underline{\mathbf{X}}^{c}$ with coefficient matrix $\mathbf{B}_{l}$ , and let $D_{KL}$ be the relative entropy, between any two $f_{l}(\mathbf{y}|\underline{\mathbf{X}}^{c})$ and $f_{l^{\prime}}(\mathbf{y}|\underline{\mathbf{X}}^{c})$ , produced from two distinct $\mathbf{B}_{l}$ , $\mathbf{B}_{l^{\prime}}$ . Due to the convexity of $-\log$ , we upper bound $D_{KL}$ as follows [20, 14]:

\displaystyle\mathbb{I}(\mathbf{y};l|\underline{\mathbf{X}}^{c})\leq\frac{1}{L^{2}}\sum_{l,l^{\prime}}\mathbb{E}_{\underline{\mathbf{X}}^{c}}D_{KL}(f_{l}(\mathbf{y}|\mathbf{X})||f_{l^{\prime}}(\mathbf{y}|\mathbf{X}))\triangleq u_{2}.

(14)

Lastly, we remark that from [14] we have $\mathbb{I}(\mathbf{y};l)\leq\mathbb{I}(\mathbf{y};l|\underline{\mathbf{X}}^{c})$ , and $\mathbb{I}(\mathbf{y};l|\underline{\mathbf{X}}^{c})$ is trivially lower bounded by (13). Thus, (13) and (14) give rise to upper and lower bounds on the conditional mutual information:

\displaystyle u_{1}\leq\mathbb{I}(\mathbf{y};l|\mathbf{X})\leq u_{2},

(15)

where $u_{1}$ and $u_{2}$ require evaluation, and from which we obtain a lower bound on the minimax risk.

IV Proof of Main Result

The formal proof for Theorem 1 relies on three lemmas. Lemma 1 introduces the exponentially large, with respect to the dimensions, set of vectors from which we will later construct $\mathbf{G}$ . The set is constructed as a subset of an $(r-1)$ -dimensional hypercube with a required minimum distance between any two distinct elements in the set. Lemma 1 bounds the probability that this minimum distance is violated.

Lemma 1.

Let $r>0$ and $F\geq 2$ . Consider the set of $F$ vectors $\bigl{\{}\mathbf{s}_{f}\in\mathbb{R}^{r-1}:f\in[F]\bigr{\}}$ , where each entry in vector $\mathbf{s}_{f}$ is an independent and identically distributed random variable taking values $\biggl{\{}-\frac{1}{\sqrt{r-1}},+\frac{1}{\sqrt{r-1}}\biggr{\}}$ uniformly. The probability that there exists a distinct pair $(f,f^{\prime})$ such that $\left\|\mathbf{s}_{f}-\mathbf{s}_{f^{\prime}}\right\|_{0}<\frac{r-1}{20}$ is upper bounded as follows:

	$\displaystyle\mathbb{P}(\exists(f,f^{\prime})\in[F]\times[F],f\neq f^{\prime}:\left\\|\mathbf{s}_{f}-\mathbf{s}_{f^{\prime}}\right\\|_{0}<\frac{r-1}{20})$
	$\displaystyle\leq\exp\left[2\log(F)-\log(2)-\frac{1}{2}\left(1-\frac{1}{10}\right)^{2}(r-1)\right].$		(16)

The above is a direct result of a standard application of McDiarmid’s inequality [21]. Notice, however, that our technique differs from [22] because we are imposing minimum distance conditions in the Hamming metric, rather than conditions on the inner product, between vectors.

Similar to the above, the following corollary introduces a set of matrices from which we will construct $\mathbf{B}_{1},\mathbf{B}_{2}$ .

Corollary 1.

Let $m,r>0$ and $P\geq 2$ . Consider the set of $P$ matrices $\bigl{\{}\mathbf{S}_{p}\in\mathbb{R}^{(m-1)\times r}:p\in[P]\bigr{\}}$ , where each entry in matrix $\mathbf{S}_{p}$ is an independent and identically distributed random variable taking values $\biggl{\{}-\frac{1}{\sqrt{(m-1)r}},+\frac{1}{\sqrt{(m-1)r}}\biggr{\}}$ uniformly. The probability that there exists a distinct pair $(p,p^{\prime})$ such that $\left\|\mathbf{S}_{p}-\mathbf{S}_{p^{\prime}}\right\|_{0}\leq\frac{(m-1)r}{20}$ is upper bounded as follows:

	$\displaystyle\mathbb{P}(\exists(p,p^{\prime})\in[P]\times[P],p\neq p^{\prime}:\left\\|\mathbf{S}_{p}-\mathbf{S}_{p^{\prime}}\right\\|_{0}<\frac{(m-1)r}{20})$
	$\displaystyle\leq\exp\left[2\log(P)-\log(2)-\frac{1}{2}\left(1-\frac{1}{10}\right)^{2}(m-1)r\right].$		(17)

From the results in Lemma 1 and Corollary 1, Lemma 2 derives conditions on $L$ such that $\mathcal{B}_{L}$ with a certain set of properties exists, and constructs two sets of matrices: 1) $m_{1}\times r$ and $m_{2}\times r$ orthonormal matrices from the set of “generating” matrices defined in Corollary 1, and 2) $r\times r$ diagonal matrices from the set of “generating” vectors in Lemma 1. Each element $\mathbf{B}_{l}\in\mathcal{B}_{L}$ is constructed from the three sets defined above, according to the decomposition in (3). Next, we prove lower and upper bounds on the distance between any two distinct elements in $\mathcal{B}_{L}$ . The lower bound determines the minimum packing distance between any two $\mathbf{B}_{l},\mathbf{B}_{l^{\prime}}$ , whilst the upper bound is used to derive the results in Lemma 3.

Lemma 2.

There exists a collection of $L$ matrices $B_{L}\triangleq\{\mathbf{B}_{l}:l\in[L]\}\subset\mathcal{B}_{d}(\mathbf{0})$ for some $d>0$ of cardinality

\displaystyle L=2^{\lfloor\frac{\log_{2}(e)}{4}\left(\left(1-\frac{1}{10}\right)^{2}\left(r(m_{1}+m_{2}-1\right)+\left(1-\frac{1}{10}\right)^{2}(r-1)\right)-\frac{3}{2}\log_{2}\left(\frac{3}{2}\right)\rfloor}

(18)

such that for any

\displaystyle\sqrt{\frac{8(r-1)}{r}}<\varepsilon\leq d\sqrt{\frac{r-1}{r}},

(19)

we have

\displaystyle\frac{r\varepsilon^{2}}{r-1}<\left\|\mathbf{B}_{l}-\mathbf{B}_{l^{\prime}}\right\|_{F}^{2}\leq 4\frac{r\varepsilon^{2}}{r-1}.

(20)

Lemma 3 derives an upper bound on $\mathbb{I}(\mathbf{y};l|\underline{\mathbf{X}}^{c})$ .

Lemma 3.

Consider the matrix LR problem given by the model in (2) such that $\mathbf{B}\in\mathcal{B}_{d}(\mathbf{0})$ , for some $d>0$ , and consider the set $\mathcal{B}_{L}$ defined in Lemma 2. Consider $n$ i.i.d observations $y_{i}$ following a Bernoulli distribution when conditioned on $\mathbf{X}_{i}\sim\mathcal{N}(\mathbf{0},\sigma^{2}\mathbf{I})$ , then we have,

\displaystyle\mathbb{I}(\mathbf{y};l|\mathbf{X})\leq n\frac{2}{r}\sqrt{\frac{2}{\pi}}\sigma\varepsilon.

(21)

The final lemma bounds the error probability $\mathbb{P}(\widehat{l}(\mathbf{y})\neq l)$ under the conditions, mentioned in Section III-B, on $\varepsilon^{*}$ and $\left\|\widehat{\mathbf{B}}-\mathbf{B}_{l}\right\|_{F}^{2}$ for the recovery of hypothesis $\mathbf{B}_{l}$ by the minimum distance decoder. The proof follows exactly that for Lemma 8 in [14].

Lemma 4.

Consider the minimum distance decoder in (11), for the $L$ matrices constructed in Lemma 2. Consider the LR regression model in (2) with minimax risk $\varepsilon^{*}$ , which is assumed to be upper bounded by $\varepsilon^{*}\leq\sqrt{\delta}$ . The minimum distance decoder recovers the correct hypothesis if $\left\|\widehat{\mathbf{B}}-\mathbf{B}_{l}\right\|_{F}^{2}<\sqrt{2\delta}$ . The detection error of the minimum distance decoder is upper bounded by $\mathbb{P}(\widehat{l}(\mathbf{y})\neq l)\leq\frac{1}{\sqrt{2}}$ .

Proof:

Fix $r$ , let $\varepsilon>0$ satisfy (19), and define $\varepsilon^{2}\triangleq\frac{8\delta(r-1)}{r}$ . Suppose there exists an estimator $\widehat{\mathbf{B}}$ which guarantees a risk $\varepsilon^{*}\leq\sqrt{\delta}=\sqrt{\frac{r}{8(r-1)}}\varepsilon$ . By Lemma 2, there exists a packing set $B_{L}$ containing $L$ distinct rank- $r$ matrices, where $L$ satisfies (18) and for any pair $\mathbf{B}_{l},\mathbf{B}_{l^{\prime}}\in B_{L}$ , the distance $\left\|\mathbf{B}_{l}-\mathbf{B}_{l^{\prime}}\right\|_{F}^{2}$ satisfies (10). By Lemma 3, the conditional mutual information $\mathbb{I}(\mathbf{y};l|\underline{\mathbf{X}}^{c})$ is upper bounded by (21), and $\mathbb{P}(\widehat{l}(\mathbf{y})-\mathbf{B}_{l})$ is upper bounded by Lemma 4. Replacing (21) and (13) in (15), we have:

\displaystyle\frac{\sqrt{2}-1}{\sqrt{2}}\log_{2}L-1\leq\mathbb{I}(\mathbf{y};l|\mathbf{X})\leq n\sigma\frac{2}{r}\sqrt{\frac{2}{\pi}}\varepsilon.

(22)

Lastly, if we set $\varepsilon^{*}=\sqrt{\frac{r}{8(r-1)}}\varepsilon$ , then we achieve the result in (9). ∎

V Discussion and Conclusion

In this paper we provided a minimax lower bound on the low-rank matrix-variate LR problem in the high-dimensional setting. We constructed a packing set of low-rank structured matrices with finite energy. Using the construction, we derived bounds on the conditional mutual information defined in our problem, in order to obtain a lower bound on the minimax risk. Compared to the vector case, such as in [6], the result in Theorem 1 shows a decrease in the lower bound from $\mathcal{O}(m_{1}m_{2})$ to $\mathcal{O}(r(m_{1}+m_{2})$ ). The result also shows that the lower bound on the minimax risk is proportional to the intrinsic degrees of freedom in the coefficient matrix LR, (i.e., $r(m_{1}+m_{2}+1)$ ), and decreases with increasing sample size $n$ . This suggests that we can develop algorithms that take advantage of the low-rank structure of the coefficient matrices. Moreover, the result in (9) can be generalized from low-rank matrices to low-rank tensors. Imposing a rank- $r$ decomposition on a coefficient tensor, $\underline{\mathbf{B}}$ , holds the same conveniences as those discussed above of low-rank matrices. This low-rank structure on $\underline{\mathbf{B}}$ is a special case of the well-known Canonical Polyadic Decomposition or Parallel Factors (CANDECOMP/PARAFAC, or CP)[17], formally defined as,

\displaystyle\underline{\mathbf{B}}:=\sum\limits_{h=1}^{r}\lambda_{h}\mathbf{b}_{1}^{(h)}\circ\cdots\circ\mathbf{b}_{K}^{(h)},

(23)

where $\mathbf{b}_{k}^{(h)}\in\mathbb{R}^{m_{k}}$ is a column vector along the $k^{th}$ mode of $\underline{\mathbf{B}}$ , $\circ$ is the outer product, and $\lambda_{h}\mathbf{b}_{1}^{(h)}\circ\cdots\circ\mathbf{b}_{K}^{(h)}$ , for any rank $h$ , is a rank-1 tensor weighted by $\lambda_{h}$ . The rank- $r$ CP-decomposition expresses tensors as a sum of $r$ rank-1 tensors. Equivalent to (23) is the following expression of a CP-structured tensor:

\displaystyle\underline{\mathbf{B}}=\underline{\mathbf{G}}\times_{1}\mathbf{B}_{1}\times_{2}\cdots\times_{K}\mathbf{B}_{K},

(24)

where the tensor $\underline{\mathbf{G}}\in\mathbb{R}^{\overbrace{r\times\cdots\times r}^{K\text{-times}}}$ is simply a higher-order analogue of the diagonal matrix $\mathbf{G}$ in (3) (where only the elements along the super-diagonal of $\underline{\mathbf{G}}$ are non-zero), and the mode- $k$ factor matrices $\mathbf{B}_{k}\in\mathbb{R}^{m_{k}\times r},~{}\forall k\in[K]$ are rank- $r$ matrices with orthonormal columns. Thus, the setup and construction proposed in this paper can be extended to the tensor case, where $\mathbf{B}$ is simply a special case of $\underline{\mathbf{B}}$ , with $K=2$ factor matrices.

VI Appendix

Proof:

Consider the set of $F$ vectors $\{\mathbf{s}_{f}\in\mathbb{R}^{r-1}:f\in[F]\}$ , in (1). For indices $f,f^{\prime}\in[F]$ , $f\neq f^{\prime}$ , define the vector $\widetilde{\mathbf{s}}\triangleq\mathbf{s}_{f}\odot\mathbf{s}_{f^{\prime}}$ as the point-wise product of $\mathbf{s}_{l}$ and $\mathbf{s}_{l^{\prime}}$ , and $\widetilde{s}_{i}$ as the $i^{th}$ entry of $\widetilde{\mathbf{s}}$ , for $i\in[r-1]$ . Define also the function

\displaystyle h(\widetilde{\mathbf{s}})\triangleq(r-1)\sum\limits_{i=1}^{r-1}\widetilde{\mathbf{s}}_{i}.

(25)

We use $\widetilde{\mathbf{s}}\approx\widetilde{\mathbf{s}}^{\prime}$ to say that $\widetilde{s}_{i}=\widetilde{s}^{\prime}_{i}$ for all entries $i\in[r-1]$ , except one.

We require a minimum packing distance

\displaystyle\left\|\mathbf{s}_{f}-\mathbf{s}_{f^{\prime}}\right\|_{0}=\frac{1}{2}\left[(r-1)-(r-1)\sum\limits_{i=1}^{r-1}\widetilde{s}_{i}\right]>\frac{r-1}{20}.

(26)

The probability that the requirement in (26) is not satisfied is defined as

\displaystyle\mathbb{P}\left((r-1)\sum_{i=1}^{r}\widetilde{\mathbf{s}}_{i}\geq(r-1)(1-\frac{1}{10})\right).

(27)

The function $h(\cdot)$ satisfies

\displaystyle\underset{\widetilde{\mathbf{s}}\approx\widetilde{\mathbf{s}}^{\prime}}{\sup}\lvert h(\widetilde{\mathbf{s}})-h(\widetilde{\mathbf{s}}^{\prime})\rvert=\lvert(r-1)\frac{1}{r-1}+(r-1)\frac{1}{r-1}\rvert=2.

(28)

According to McDiarmid’s inequality in [21], for $\frac{(r-1)(9)}{10}>0$ , and since $\mathbb{E}_{\widetilde{\mathbf{s}}}[h(\widetilde{\mathbf{s}})]=0$ , we have,

	$\displaystyle\mathbb{P}\biggl{(}(r-1)\sum_{i=1}^{r-1}\tilde{\mathbf{s}}_{i}$	$\displaystyle\geq(r-1)\frac{9}{10}\biggr{)}$
		$\displaystyle\leq\exp\left[-\frac{1}{2}\left(1-\frac{1}{10}\right)^{2}(r-1)\right].$		(29)

The probability in (29) is for any two distinct pairs $(f,f^{\prime})\in[F]\times[F]$ . We take a union bound over all $\binom{L}{2}$ distinct pairs and upper bound the probability as:

	$\displaystyle\mathbb{P}(\exists(f,f^{\prime})\in[F]\times[F],f\neq f^{\prime}:(r-1)\sum_{i=1}^{r-1}\widetilde{s}_{i}\geq\frac{(r-1)9}{10})$
	$\displaystyle\leq\frac{F^{2}}{2}\exp\left[-\frac{1}{2}\left(1-\frac{1}{10}\right)^{2}(r-1)\right]$
	$\displaystyle=\exp\left[2\log(F)-\log(2)-\frac{1}{2}\left(1-\frac{1}{10}\right)^{2}(r-1)\right].$		(30)

∎

Proof:

Fix the following arbitrary real orthogonal bases: $\mathbf{Q}\text{ of }\mathbb{R}^{r}$ , the set of distinct $r$ bases, $\bigl{\{}\mathbf{U}_{1,j}\bigr{\}}_{j=1}^{r}\text{ of }\mathbb{R}^{m_{1}}$ , and the set of distinct $r$ bases, $\bigl{\{}\mathbf{U}_{2,j}\bigr{\}}_{j=1}^{r}\text{ of }\mathbb{R}^{m_{2}}$ .

Next, consider the following hypercubes or subsets thereof: 1) The set of $F$ vectors $\{\mathbf{s}_{f}\}$ from Lemma 1:

\displaystyle\mathbf{s}_{f}\in\biggl{\{}\frac{-1}{\sqrt{r-1}},\frac{+1}{\sqrt{r-1}}\biggr{\}}\subset\mathbb{R}^{r-1},

(31)

2) Two sets of $P_{1}$ and $P_{2}$ matrices, from Corollary 1:

\displaystyle\mathbf{S}_{p_{1}}\in\biggl{\{}\frac{-1}{\sqrt{(m_{1}-1)r}},\frac{+1}{\sqrt{(m_{1}-1)r}}\biggr{\}}\subset\mathbb{R}^{(m_{1}-1)\times r},

(32)

and

\displaystyle\mathbf{S}_{p_{2}}\in\biggl{\{}\frac{-1}{\sqrt{(m_{2}-1)r}},\frac{+1}{\sqrt{(m_{2}-1)r}}\biggr{\}}\subset\mathbb{R}^{(m_{2}-1)\times r},

(33)

respectively.

From Lemma 1 we have the following bounds on the probability that the minimum distance condition is violated for the set (31):

	$\displaystyle\mathbb{P}\biggl{(}\exists(f,f^{\prime})\in[F]\times[F],f\neq f^{\prime}:\left\\|\mathbf{s}_{f}-\mathbf{s}_{f^{\prime}}\right\\|_{0}<\frac{r-1}{20}\biggr{)}$
	$\displaystyle\leq\exp\left[\log(\frac{{F}^{2}}{2})-\frac{(r-1)}{2}\left(1-\frac{2}{20}\right)^{2}\right].$		(34)

Likewise, from Corollary 1, we have the following bounds on the probability that the minimum distance conditions are violated for the sets (32) and (33), respectively:

	$\displaystyle\mathbb{P}\biggl{(}\exists(p_{1},p_{1}^{\prime})\in[P_{1}]\times[P_{1}],p_{1}\neq p_{1}^{\prime}:\left\\|\mathbf{S}_{p_{1}}-\mathbf{S}_{p_{1}^{\prime}}\right\\|_{0}$
	$\displaystyle\qquad<\frac{(m_{1}-1)r}{20}\biggr{)}$
	$\displaystyle\qquad\leq\exp\left[\log\bigl{(}\frac{{P_{1}}^{2}}{2}\bigr{)}-\frac{(m_{1}-1)r}{2}\left(1-\frac{2}{20}\right)^{2}\right],$		(35)

and

	$\displaystyle\mathbb{P}\biggl{(}\exists(p_{2},p_{2}^{\prime})\in[P_{2}]\times[P_{2}],p_{2}\neq p_{2}^{\prime}:\left\\|\mathbf{S}_{p_{2}}-\mathbf{S}_{p_{2}^{\prime}}\right\\|_{0}$
	$\displaystyle\qquad<\frac{(m_{2}-1)r}{20}\biggr{)}$
	$\displaystyle\qquad\leq\exp\left[\log\bigl{(}\frac{{P_{2}}^{2}}{2}\bigr{)}-\frac{(m_{2}-1)r}{2}\left(1-\frac{2}{20}\right)^{2}\right].$		(36)

We require the set of coefficient matrices $\mathcal{B}_{L}$ from the sets in (31), (32) and (33) to exist simultaneously. Hence, using a union bound on (34), (35) and (35), we can choose parameters to guarantee the existence of a construction. This is satisfied if the following conditions on the cardinalities $F$ , $P_{1}$ and $P_{2}$ hold:

\displaystyle 0<F<{\frac{\log_{2}(e)}{4}\left(1-\frac{1}{10}\right)^{2}(r-1)-\frac{1}{2}\log_{2}(\frac{3}{2})},

(37)

\displaystyle 0<P_{1}<2^{\frac{\log_{2}(e)}{4}\left(1-\frac{1}{10}\right)^{2}(m_{1}-1)r-\frac{1}{2}\log_{2}(\frac{3}{2})},

(38)

and

\displaystyle 0<P_{2}<2^{\frac{\log_{2}(e)}{4}\left(1-\frac{1}{10}\right)^{2}(m_{2}-1)r-\frac{1}{2}\log_{2}(\frac{3}{2})}.

(39)

Note that (37), (38) and (39) are sufficient conditions for the simultaneous existence of sets in (31), (32) and (33), such that the minimum distance condition between any two elements in each set is satisfied.

We proceed with the following steps in order to construct the final set $\mathcal{B}_{L}$ of coefficient matrices. Without loss of generality, we assume that the energy of any $\mathbf{B}_{l}$ is upper bounded by $d^{2}$ . We will construct diagonal matrices $\mathbf{G}_{f}$ , and matrices with orthonormal columns, namely $\mathbf{B}_{1,p_{1}}$ and $\mathbf{B}_{1,p_{2}}$ , all of which will be used to construct each $\mathbf{B}_{l}\in\mathcal{B}_{L}$ . In other words, due to our LR model, any matrix $\mathbf{B}_{l}$ will have a rank- $r$ singular value decomposition. Thus, a bound on the matrix norm of $\mathbf{B}_{l}$ gives a bound on the norm of the matrix of singular values.

Firstly, we construct vectors $\mathbf{g}_{f}^{(1)}\in\mathbb{R}^{r}$ for $f\in[F]$ , using $\mathbf{Q}$ and $\mathbf{s}_{f}$ , as follows:

\displaystyle\mathbf{g}_{f}^{(1)}=\mathbf{Q}\begin{bmatrix}\sqrt{\frac{1}{r-1}}\\ \mathbf{s}_{f}\end{bmatrix},\forall f\in[F].

(40)

From (40), since $\left\|\mathbf{s}_{f}\right\|^{2}=1$ we have:

\displaystyle\left\|\mathbf{g}_{f}^{(1)}\right\|_{2}^{2}

\displaystyle=\left\|\mathbf{Q}\begin{bmatrix}\sqrt{\frac{1}{r-1}}\\ \mathbf{s}_{f}\end{bmatrix}\right\|_{2}^{2}=\frac{r}{r-1}.

Similarly, we construct matrices $\mathbf{B}_{1,p_{1}}^{(1)}\in\mathbb{R}^{m_{1}\times r}$ , for $p_{1}\in[P_{1}]$ and $\mathbf{B}_{2,p_{2}}^{(1)}\in\mathbb{R}^{m_{2}\times r}$ , for $p_{2}\in[P_{2}]$ , respectively. Define $\mathbf{b}_{1,p_{1},j}^{(1)}$ as the $j^{th}$ column of $\mathbf{B}_{1,p_{1}}^{(1)}$ , and $\mathbf{b}_{1,p_{1},j}^{(2)}$ as the $j^{th}$ column of $\mathbf{B}_{2,p_{2}}^{(1)}$ . Let the columns be constructed as follows:

\displaystyle\mathbf{b}_{1,p_{1},j}^{(1)}=\mathbf{U}_{1,j}\begin{bmatrix}1\\ \mathbf{s}_{p_{1},j}\end{bmatrix},\forall p_{1}\in[P_{1}],

(41)

and

\displaystyle\mathbf{b}_{2,p_{2},j}^{(1)}=\mathbf{U}_{2,j}\begin{bmatrix}1\\ \mathbf{s}_{p_{2},j}\end{bmatrix},\forall p_{2}\in[P_{2}].

(42)

From (41) and (42), we have:

\displaystyle\left\|\mathbf{b}_{1,p_{1},j}^{(1)}\right\|_{2}^{2}=\left\|\mathbf{b}_{2,p_{2},j}^{(1)}\right\|_{2}^{2}

\displaystyle=\left\|\begin{bmatrix}1\\ \mathbf{s}_{p_{1},j}\end{bmatrix}\right\|_{2}^{2}=\frac{r+1}{r}.

Secondly, we construct an $r$ -sparse vector $\mathbf{g}_{f}^{(2)}\in\mathbb{R}^{r^{2}}$ , element-wise, from $\mathbf{g}_{f}^{(1)}$ . Define $g_{f,i}^{(2)}$ as the $i$ th element of $\mathbf{g}_{f}^{(2)}$ , where $i\in[r^{2}]$ and use the following construction:

\displaystyle g_{f,i}^{(2)}=\begin{cases}|g_{f,i}^{(1)}|&i=i^{\prime}+r(i^{\prime}-1),i^{\prime}=\{1,\dots,r\}\\ 0&\textrm{otherwise}\end{cases},

(43)

and we note that

\displaystyle\left\|\mathbf{g}_{f}^{(2)}\right\|_{2}^{2}=\left\|\mathbf{g}_{f}^{(1)}\right\|_{2}^{2}=\frac{r}{r-1}.

(44)

We also construct matrices $\mathbf{B}_{1,p_{1}}^{(2)}\in\mathbb{R}^{m_{1}\times r}$ , for $p_{1}\in[P_{1}]$ and $\mathbf{B}_{2,p_{2}}^{(2)}\in\mathbb{R}^{m_{2}\times r}$ , for $p_{2}\in[P_{2}]$ . We show the construction of $\mathbf{B}_{1,p_{1}}^{(2)}$ only: the construction of $\mathbf{B}_{2,p_{2}}^{(2)}$ follows the same procedure. Define $\mathbf{b}_{1,p_{1},j}^{(2)}\in\mathbb{R}^{m_{1}}$ as the $j^{th}$ column of $\mathbf{B}_{1,p_{1}}^{(2)}$ , for $j\in[r]$ . We set

\displaystyle\mathbf{b}_{1,p_{1},1}^{(2)}=\frac{\mathbf{b}_{1,p_{1},1}^{(1)}}{\left\|\mathbf{b}_{1,p_{1},1}^{(1)}\right\|_{2}},

(45)

and define

\displaystyle\mathbf{a}_{j+1}\triangleq

\displaystyle\mathbf{b}_{1,p_{1},j+1}^{(1)}-\sum\limits_{j^{\prime}=1}^{j}\langle\mathbf{b}_{1,p_{1},j+1}^{(1)},\mathbf{b}_{1,p_{1},j^{\prime}}^{(2)}\rangle\mathbf{b}_{1,p_{1},j^{\prime}}^{(2)},

(46)

and

\displaystyle\mathbf{b}_{1,p_{1},j+1}^{(2)}\triangleq\frac{\mathbf{a}_{j+1}}{\left\|\mathbf{a}_{j+1}\right\|_{2}},

(47)

for $j\in[r-1]$ .

The steps in (45), (46) and (47) constitute the well-known Gram-Schmidt Process. Thus, set of vectors $\mathbf{b}_{1,p_{1},j}^{(2)}$ , for $j\in[r-1]$ , $p_{1}\in[P_{1}]$ are orthonormal, i.e, $\left\|\mathbf{b}_{1,p_{1},j}^{(2)}\right\|_{2}^{2}=1$ and $\mathbf{b}_{1,p_{1},j}^{(2)}\perp\mathbf{b}_{1,p_{1},j^{\prime}}^{(2)}$ , for any two distinct $j,j^{\prime}\in[r]$ . Consequently, $\left(\mathbf{B}_{2,p_{2}}^{(2)}\right)^{T}\left(\mathbf{B}_{2,p_{2}}^{(2)}\right)=\mathbf{I}_{r}$ .

Finally, we define the vector $\mathbf{g}_{f}$ and matrices $\mathbf{B}_{1,p_{1}}$ and $\mathbf{B}_{1,p_{2}}$ as:

\displaystyle\mathbf{g}_{f}=\frac{\varepsilon}{r}\mathbf{g}_{f}^{(2)},\forall f\in[F],

(48)

and

\displaystyle\mathbf{B}_{1,p_{1}}=\mathbf{B}_{1,p_{1}}^{(2)},~{}\mathbf{B}_{1,p_{2}}=\mathbf{B}_{2,p_{2}}^{(2)},

(49)

respectively, for some positive number $\varepsilon$ .

We also define the diagonal matrix $\mathbf{G}_{f}\in\mathbb{R}^{r\times r}$ , where $\mathrm{vec}(\mathbf{G}_{f})=\mathbf{g}_{f}$ .

Now, by designating

\displaystyle\mathcal{L}\triangleq\left\{(f,p_{1},p_{2}):f\in[F],p_{1}\in[P_{1}],p_{2}\in[P_{2}]\right\},

(50)

as the set of possible tuples $(f,p_{1},p_{2})$ , we have

\displaystyle L=|\mathcal{L}|

\displaystyle\overset{(a)}{<}2^{\left[\frac{\log_{2}(e)}{4}\left(c_{1}\left(r(m_{1}+m_{2}-2)+(r-1)\right)\right)-(\frac{3}{2})\log_{2}\left(\frac{3}{2}\right)\right]},

(51)

where $(a)$ follows from (37), (38) and (39). We define the set of coefficient matrices, $\mathcal{B}_{L}$ as,

	$\displaystyle\mathcal{B}_{L}\triangleq\biggl{\{}\mathbf{B}_{l}=\mathbf{B}_{1,p_{1}}^{T}\mathbf{G}_{f}\mathbf{B}_{1,p_{2}},l\in[L],f\in[F],$		(52)
	$\displaystyle p_{1}\in[P_{1}],p_{2}\in[P_{2}]\biggr{\}},$		(53)

and we restrict $\varepsilon$ such that

\displaystyle\sqrt{\frac{8(r-1)}{r}}<\varepsilon<d\sqrt{\frac{r-1}{r}},

(54)

in order to guarantee $\left\|\mathbf{G}\right\|_{2}^{2}<\frac{d^{2}}{r^{2}}$ . We make the final note that, due to the Kronecker product, we can express $\mathrm{vec}(\mathbf{B}_{l})$ as:

\displaystyle\mathrm{vec}(\mathbf{B}_{l})=(\mathbf{B}_{1,p_{1}}\otimes\mathbf{B}_{1,p_{2}})\mathbf{g}_{f}.

(55)

We have the following remaining tasks at hand: 1) We must show that they energy of any $\mathbf{B}_{l}$ is less than $d^{2}$ . 2) We must derive upper and lower bounds on the distance between any two distinct $\mathbf{B}_{l},\mathbf{B}_{l^{\prime}}\in B_{L}$ $\left(\left\|\mathbf{B}_{l}-\mathbf{B}_{l^{\prime}}\right\|_{F}^{2}\right)$ .

We begin by showing $\left\|\mathbf{B}_{l}\right\|_{F}^{2}<d^{2}$ :

	$\displaystyle\left\\|\mathbf{B}_{l}\right\\|_{F}^{2}$	$\displaystyle=\left\\|\left(\mathbf{B}_{1,p_{2}}\otimes\mathbf{B}_{1,p_{1}}\right)\mathbf{g}_{f}\right\\|_{2}^{2}$
		$\displaystyle\leq\left\\|\mathbf{B}_{1,p_{2}}\otimes\mathbf{B}_{1,p_{1}}\right\\|_{F}^{2}\left\\|\mathbf{g}_{f}\right\\|_{2}^{2}$
		$\displaystyle\overset{(b)}{=}\left\\|\mathbf{B}_{1,p_{1}}\right\\|_{F}^{2}\left\\|\mathbf{B}_{1,p_{2}}\right\\|_{F}^{2}\left\\|\mathbf{g}_{f}\right\\|_{2}^{2}=\frac{r\varepsilon^{2}}{r-1}\overset{(c)}{<}d^{2},$

where $(b)$ follows from the fact that the matrix norm of the Kronecker product is the product of the matrix norms, and $(c)$ holds due to (54).

We proceed with deriving lower and upper bounds on $\left\|\mathbf{B}_{l}-\mathbf{B}_{l^{\prime}}\right\|_{F}^{2}$ , for any two distinct $\mathbf{B}_{l},\mathbf{B}_{l^{\prime}}\in B_{L}$ . For finding lower bounds on $\left\|\mathbf{B}_{l}-\mathbf{B}_{l^{\prime}}\right\|_{F}^{2}$ , it can be shown that the closest pair $\mathbf{B}_{l},\mathbf{B}_{l^{\prime}}$ occurs for $p_{1}=p_{1}^{\prime}$ , $p_{2}=p_{2}^{\prime}$ and $f\neq f^{\prime}$ . Thus we have the following:

	$\displaystyle\left\\|\mathbf{B}_{l}-\mathbf{B}_{l^{\prime}}\right\\|_{F}^{2}$	$\displaystyle=\left\\|\left(\mathbf{B}_{1,p_{2}}\otimes\mathbf{B}_{1,p_{1}}\right)\mathbf{g}_{f}-\left(\mathbf{B}_{1,p^{\prime}_{2}}\otimes\mathbf{B}_{1,p^{\prime}_{1}}\right)\mathbf{g}_{f^{\prime}}\right\\|_{2}^{2}$
		$\displaystyle\geq\left\\|\left(\mathbf{B}_{1,p_{2}}\otimes\mathbf{B}_{1,p_{1}}\right)\left(\mathbf{g}_{f}-\mathbf{g}_{f^{\prime}}\right)\right\\|_{2}^{2}$
		$\displaystyle\overset{(d)}{=}\frac{\varepsilon^{2}}{r^{2}}\left\\|\mathbf{g}_{f}^{(2)}-\mathbf{g}_{f^{\prime}}^{(2)}\right\\|_{2}^{2}$
		$\displaystyle>\frac{r\varepsilon^{2}}{r-1}\frac{4}{20}$

where $(d)$ follows from the fact that the Kronecker product of orthogonal bases is an orthogonal basis.

Finally, for finding upper bounds on $\left\|\mathbf{B}_{l}-\mathbf{B}_{l^{\prime}}\right\|_{F}^{2}$ , we have:

	$\displaystyle\left\\|\mathbf{B}_{l}-\mathbf{B}_{l^{\prime}}\right\\|_{F}^{2}$
	$\displaystyle\overset{(e)}{\leq}\left[\left\\|\left(\mathbf{B}_{1,p_{2}}\otimes\mathbf{B}_{1,p_{1}}\right)\mathbf{g}_{f}\right\\|_{F}+\left\\|\left(\mathbf{B}_{1,p^{\prime}_{2}}\otimes\mathbf{B}_{1,p^{\prime}_{1}}\right)\mathbf{g}_{f^{\prime}}\right\\|_{F}\right]^{2}$
	$\displaystyle\leq\left[\left\\|\mathbf{B}_{1,p_{2}}\otimes\mathbf{B}_{1,p_{1}}\right\\|_{F}\left\\|\mathbf{g}_{f}\right\\|_{2}+\left\\|\mathbf{B}_{1,p^{\prime}_{2}}\otimes\mathbf{B}_{1,p^{\prime}_{1}}\right\\|_{F}\left\\|\mathbf{g}_{f^{\prime}}\right\\|_{2}\right]^{2}$
	$\displaystyle=\left[2\left\\|\mathbf{B}_{1,p_{1}}\right\\|_{F}\left\\|\mathbf{B}_{1,p_{2}}\right\\|_{F}\left\\|\mathbf{g}_{f}\right\\|_{2}\right]^{2}$		(56)
	$\displaystyle\leq 4\frac{r\varepsilon^{2}}{r-1}$		(57)

where $(e)$ follows from the triangle inequality. ∎

Proof:

Consider the set $\mathcal{B}_{L}$ defined in Lemma 2, where the bounds in (VI) and (57) hold. Consider the matrix LR model in (2). For $n$ i.i.d samples, consider covariate matrices $\mathbf{X}_{i}\in\mathbb{R}^{m_{1}\times m_{2}},\forall i\in[n]$ , where $\mathrm{vec}(\mathbf{X}_{i})\sim\mathcal{N}(\mathbf{0},\sigma^{2}\mathbf{I}_{m_{1}m_{2}})$ . According to (2), observations $y_{i}$ follow a Bernoulli distribution when conditioned on $\mathbf{X}_{i}$ , $\forall i\in[n]$ . Consider $\mathbf{y}$ and $\underline{\mathbf{X}}^{c}$ defined in Section II, and define $\mathbb{I}(\mathbf{y};l|\underline{\mathbf{X}}^{c})$ as the mutual information between observations $\mathbf{y}$ and index $l$ conditioned on side-information $\underline{\mathbf{X}}^{c}$ . From [20, 23], we have

\displaystyle\mathbb{I}(\mathbf{y};l|\underline{\mathbf{X}}^{c})\leq\frac{1}{L^{2}}\sum_{l,l^{\prime}}\mathbb{E}_{\underline{\mathbf{X}}^{c}}D_{KL}(f_{l}(\mathbf{y}|\underline{\mathbf{X}}^{c})||f_{l^{\prime}}(\mathbf{y}|\underline{\mathbf{X}}^{c}),

(58)

where $D_{KL}(f_{l}(\mathbf{y}|\underline{\mathbf{X}}^{c})||f_{l^{\prime}}(\mathbf{y}|\underline{\mathbf{X}}^{c})$ is the Kullback-Leibler (KL) divergence of probability distribution $f_{l}(\mathbf{y}|\underline{\mathbf{X}}^{c})$ of $\mathbf{y}$ given $\underline{\mathbf{X}}^{c}$ for some $\mathbf{B}_{l}\in\mathcal{B}_{L}$ . Denote $\sigma_{l}\triangleq\frac{1}{1+\exp{-\langle\mathbf{B}_{l},\mathbf{X}_{i}\rangle}}$ , and $\sigma_{l^{\prime}}\triangleq\frac{1}{1+\exp^{-\langle\mathbf{B}_{l^{\prime}},\mathbf{X}_{i}\rangle}}$ , We evaluate the KL divergence as follows:

$\displaystyle D$	${}_{KL}(f_{l}(\mathbf{y}\|\underline{\mathbf{X}}^{c})\|\|f_{l^{\prime}}(\mathbf{y}\|\underline{\mathbf{X}}^{c}))$
	$\displaystyle=\sum\limits_{i\in[n]}\sigma_{l}\log\left(\frac{\sigma_{l}}{\sigma_{l^{\prime}}}\right)+(1-\sigma_{l})\log\left(\frac{1-\sigma_{l}}{1-\sigma_{l^{\prime}}}\right).$
	$\displaystyle=\sum\limits_{i\in[n]}\sigma_{l}\langle\mathbf{B}_{l},\mathbf{X}_{i}\rangle-\sigma_{l}\langle\mathbf{B}_{l^{\prime}},\mathbf{X}_{i}\rangle-\langle\mathbf{B}_{l},\mathbf{X}_{i}\rangle+\langle\mathbf{B}_{l^{\prime}},\mathbf{X}_{i}\rangle$
	$\displaystyle+\log(1+\exp^{-\langle\mathbf{B}_{l},\mathbf{X}_{i}\rangle})-\log(1+\exp^{-\langle\mathbf{B}_{l^{\prime}},\mathbf{X}_{i}\rangle}).$	(59)

Now, considering the distribution on covariates $\mathbf{X}_{i},\forall i\in[n]$ , we take the expectation of (59) with respect to the side-information $\underline{\mathbf{X}}^{c}$ . We have $\mathbb{E}_{\underline{\mathbf{X}}^{c}}(\log(1+\exp^{-\langle\mathbf{B}_{l},\mathbf{X}_{i}\rangle}))=\mathbb{E}_{\underline{\mathbf{X}}^{c}}(\log(1+\exp^{-\langle\mathbf{B}_{l^{\prime}},\mathbf{X}_{i}\rangle}))$ , and $\mathbb{E}_{\underline{\mathbf{X}}^{c}}(\langle\mathbf{B}_{l},\mathbf{X}_{i}\rangle)=\mathbb{E}_{\underline{\mathbf{X}}^{c}}(\langle\mathbf{B}_{l^{\prime}},\mathbf{X}_{i}\rangle)=0$ . We are left with:

$\displaystyle\mathbb{E}_{\underline{\mathbf{X}}^{c}}D_{KL}$	$\displaystyle(f_{l}(\mathbf{y}\|\underline{\mathbf{X}}^{c})\|\|f_{l^{\prime}}(\mathbf{y}\|\underline{\mathbf{X}}^{c})$
	$\displaystyle=\sum\limits_{i\in[n]}\mathbb{E}_{\mathbf{X}}\bigl{[}\sigma_{l}\cdot\langle\mathbf{X}_{i},\mathbf{B}_{l}\rangle-\sigma_{l}\langle\mathbf{X}_{i},\mathbf{B}_{l^{\prime}}\rangle\bigr{]}$
	$\displaystyle=\sum\limits_{i\in[n]}\mathbb{E}_{\mathbf{X}}\bigl{[}\sigma_{l}\cdot\langle\mathbf{X}_{i},\mathbf{B}_{l}-\mathbf{B}_{l^{\prime}}\rangle\bigr{]}$
	$\displaystyle=\sum\limits_{i\in[n]}\mathbb{E}_{\mathbf{X}}\biggl{[}\frac{\langle\mathbf{X}_{i},\mathbf{B}_{l}-\mathbf{B}_{l^{\prime}}\rangle}{1+\exp^{-\langle\mathbf{X}_{i},\mathbf{B}_{l}\rangle}}\biggr{]}$
	$\displaystyle\leq n\mathbb{E}_{\mathbf{X}}\biggl{[}\left\|\langle\mathrm{vec}(\mathbf{X}_{i}),\mathrm{vec}(\mathbf{B}_{l}-\mathbf{B}_{l^{\prime}})\rangle\right\|\biggr{]}.$	(60)

Define the random variable $\widetilde{X}\triangleq\langle\mathbf{X}_{i},\mathbf{B}_{l}-\mathbf{B}_{l^{\prime}}\rangle$ . $\widetilde{X}$ is a normally distributed random variable with mean $\mu_{\{\widetilde{X}\}}=0$ and variance $\sigma_{\widetilde{X}}^{2}=\sigma^{2}\mathrm{vec}(\mathbf{B}_{l}-\mathbf{B}_{l^{\prime}})^{T}\mathbf{I}_{m_{1}m_{2}}\mathrm{vec}(\mathbf{B}_{l}-\mathbf{B}_{l^{\prime}})$ . Now, define the half-Normal random variable $\bar{X}\triangleq|\widetilde{X}|$ . $\bar{X}$ is a half-Normal distribution with mean:

	$\displaystyle\mathbb{E}_{\bar{X}}[\bar{X}]$	$\displaystyle=\sigma\sqrt{\mathrm{vec}(\mathbf{B}_{l}-\mathbf{B}_{l^{\prime}})^{T}\mathrm{vec}(\mathbf{B}_{l}-\mathbf{B}_{l^{\prime}})}\sqrt{\frac{2}{\pi}}$
		$\displaystyle=\sigma\left\\|\mathbf{B}_{l}-\mathbf{B}_{l^{\prime}}\right\\|_{F}\sqrt{\frac{2}{\pi}}.$		(61)

Plugging in (61) and (60) into (58) gives us,

\displaystyle\mathbb{I}(\mathbf{y};l|\underline{\mathbf{X}}^{c})

\displaystyle\leq n\sigma\left\|\mathbf{B}_{l}-\mathbf{B}_{l^{\prime}}\right\|_{F}\sqrt{\frac{2}{\pi}}\overset{(f)}{\leq}n\sigma\sqrt{\frac{4r}{r-1}}\sqrt{\frac{2}{\pi}}\varepsilon,

where $(f)$ follows from (57). ∎

References

[1] R. E. Wright, “Logistic regression.” in Reading and Understanding Multivariate Statistics. American Psychological Association, 1995, pp. 217–244.
[2] J. P. Dumas, M. A. Lodhi, B. A. Taki, W. U. Bajwa, and M. C. Pierce, “Computational endoscopy—a framework for improving spatial resolution in fiber bundle imaging,” Optics Letters, vol. 44, no. 16, pp. 3968–3971, 2019.
[3] M. A. Lodhi and W. U. Bajwa, “Learning product graphs underlying smooth graph signals,” arXiv preprint arXiv:2002.11277, 2020.
[4] H. Hung and C.-C. Wang, “Matrix variate logistic regression model with application to EEG data,” Biostatistics, vol. 14, no. 1, pp. 189–202, 2013.
[5] J. Zhang and J. Jiang, “Rank-optimized logistic matrix regression toward improved matrix data classification,” Neural Computation, vol. 30, no. 2, pp. 505–525, 2018.
[6] L. P. Barnes and A. Ozgur, “Minimax bounds for distributed logistic regression,” arXiv preprint arXiv:1910.01625, 2019.
[7] A. R. Zhang, Y. Luo, G. Raskutti, and M. Yuan, “ISLET: Fast and optimal low-rank tensor regression via importance sketching,” SIAM J. on Mathematics of Data Science, vol. 2, no. 2, pp. 444–479, 2020.
[8] G. Raskutti, M. Yuan, and H. Chen, “Convex regularization for high-dimensional multiresponse tensor regression,” The Annals of Statstics, vol. 47, no. 3, pp. 1554–1584, 2019.
[9] D. J. Foster, S. Kale, H. Luo, M. Mohri, and K. Sridharan, “Logistic regression: The importance of being improper,” in Proc. Conf. On Learning Theory (PMLR), 2018, pp. 167–208.
[10] F. Abramovich and V. Grinshtein, “High-dimensional classification by sparse logistic regression,” IEEE Trans. on Information Theory, vol. 65, no. 5, pp. 3068–3079, 2018.
[11] J. V. Shi, Y. Xu, and R. G. Baraniuk, “Sparse bilinear logistic regression,” arXiv preprint arXiv:1404.4104, 2014.
[12] B. An and B. Zhang, “Logistic regression with image covariates via the combination of $\ell_{1}$ and Sobolev regularizations,” PLOS One, vol. 15, no. 6, p. e0234975, 2020.
[13] Q. Berthet and N. Baldin, “Statistical and computational rates in graph logistic regression,” in Proc. Int. Conf. on Artificial Intelligence and Statistics (PMLR), 2020, pp. 2719–2730.
[14] A. Jung, Y. C. Eldar, and N. Görtz, “On the minimax risk of dictionary learning,” IEEE Trans. on Inf. Theory, vol. 62, no. 3, pp. 1501–1515, 2016.
[15] Z. Shakeri, W. U. Bajwa, and A. D. Sarwate, “Minimax lower bounds for kronecker-structured dictionary learning,” in Proc. 2016 IEEE Int. Symp. on Inf. Theory (ISIT). IEEE, 2016, pp. 1148–1152.
[16] B. Yu, “Assouad, Fano, and Le Cam,” in Festschrift for Lucien Le Cam. Springer, 1997, pp. 423–435.
[17] T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” SIAM Review, vol. 51, no. 3, pp. 455–500, 2009.
[18] S. Negahban and M. J. Wainwright, “Restricted strong convexity and weighted matrix completion: Optimal bounds with noise,” The J. of Machine Learning Res. (JMLR), vol. 13, pp. 1665–1697, 2012.
[19] R. Z. Khas’minskii, “A lower bound on the risks of non-parametric estimates of densities in the uniform metric,” Theory of Probability & Its Applications, vol. 23, no. 4, pp. 794–798, 1979.
[20] T. M. Cover and J. A. Thomas, Elements of Information Theory. John Wiley & Sons, 2012.
[21] D. P. Dubhashi and A. Panconesi, Concentration of Measure for the Analysis of Randomized Algorithms. Cambridge University Press, 2009.
[22] Z. Shakeri, W. U. Bajwa, and A. D. Sarwate, “Minimax lower bounds on dictionary learning for tensor data,” IEEE Trans. on Inf. Theory, vol. 64, no. 4, pp. 2706–2726, 2018.
[23] M. J. Wainwright, “Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting,” IEEE Trans. on Inf. Theory, vol. 55, no. 12, pp. 5728–5741, 2009.

	$\displaystyle\left\\|\mathbf{B}_{l}\right\\|_{F}^{2}$	$\displaystyle=\left\\|\left(\mathbf{B}_{1,p_{2}}\otimes\mathbf{B}_{1,p_{1}}\right)\mathbf{g}_{f}\right\\|_{2}^{2}$
		$\displaystyle\leq\left\\|\mathbf{B}_{1,p_{2}}\otimes\mathbf{B}_{1,p_{1}}\right\\|_{F}^{2}\left\\|\mathbf{g}_{f}\right\\|_{2}^{2}$
		$\displaystyle\overset{(b)}{=}\left\\|\mathbf{B}_{1,p_{1}}\right\\|_{F}^{2}\left\\|\mathbf{B}_{1,p_{2}}\right\\|_{F}^{2}\left\\|\mathbf{g}_{f}\right\\|_{2}^{2}=\frac{r\varepsilon^{2}}{r-1}\overset{(c)}{<}d^{2},$

	$\displaystyle\left\\|\mathbf{B}_{l}-\mathbf{B}_{l^{\prime}}\right\\|_{F}^{2}$	$\displaystyle=\left\\|\left(\mathbf{B}_{1,p_{2}}\otimes\mathbf{B}_{1,p_{1}}\right)\mathbf{g}_{f}-\left(\mathbf{B}_{1,p^{\prime}_{2}}\otimes\mathbf{B}_{1,p^{\prime}_{1}}\right)\mathbf{g}_{f^{\prime}}\right\\|_{2}^{2}$
		$\displaystyle\geq\left\\|\left(\mathbf{B}_{1,p_{2}}\otimes\mathbf{B}_{1,p_{1}}\right)\left(\mathbf{g}_{f}-\mathbf{g}_{f^{\prime}}\right)\right\\|_{2}^{2}$
		$\displaystyle\overset{(d)}{=}\frac{\varepsilon^{2}}{r^{2}}\left\\|\mathbf{g}_{f}^{(2)}-\mathbf{g}_{f^{\prime}}^{(2)}\right\\|_{2}^{2}$
		$\displaystyle>\frac{r\varepsilon^{2}}{r-1}\frac{4}{20}$

	$\displaystyle\left\\|\mathbf{B}_{l}-\mathbf{B}_{l^{\prime}}\right\\|_{F}^{2}$
	$\displaystyle\overset{(e)}{\leq}\left[\left\\|\left(\mathbf{B}_{1,p_{2}}\otimes\mathbf{B}_{1,p_{1}}\right)\mathbf{g}_{f}\right\\|_{F}+\left\\|\left(\mathbf{B}_{1,p^{\prime}_{2}}\otimes\mathbf{B}_{1,p^{\prime}_{1}}\right)\mathbf{g}_{f^{\prime}}\right\\|_{F}\right]^{2}$
	$\displaystyle\leq\left[\left\\|\mathbf{B}_{1,p_{2}}\otimes\mathbf{B}_{1,p_{1}}\right\\|_{F}\left\\|\mathbf{g}_{f}\right\\|_{2}+\left\\|\mathbf{B}_{1,p^{\prime}_{2}}\otimes\mathbf{B}_{1,p^{\prime}_{1}}\right\\|_{F}\left\\|\mathbf{g}_{f^{\prime}}\right\\|_{2}\right]^{2}$
	$\displaystyle=\left[2\left\\|\mathbf{B}_{1,p_{1}}\right\\|_{F}\left\\|\mathbf{B}_{1,p_{2}}\right\\|_{F}\left\\|\mathbf{g}_{f}\right\\|_{2}\right]^{2}$		(56)
	$\displaystyle\leq 4\frac{r\varepsilon^{2}}{r-1}$		(57)