This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Minimax Lower Bound for Low-Rank Matrix-Variate Logistic Regression

Batoul Taki, Mohsen Ghassemi, Anand D. Sarwate, and Waheed U. Bajwa
E-mail:{batoul.taki, mohsen.ghassemi, anand.sarwate, waheed.bajwa}@rutgers.edu
This work was supported by the US NSF Grants CCF-1910110 and CCF-1453073. Department of Electrical and Computer Engineering, Rutgers University–New Brunswick, Piscataway, NJ
Abstract

This paper considers the problem of matrix-variate logistic regression. It derives the fundamental error threshold on estimating low-rank coefficient matrices in the logistic regression problem by obtaining a lower bound on the minimax risk. The bound depends explicitly on the dimension and distribution of the covariates, the rank and energy of the coefficient matrix, and the number of samples. The resulting bound is proportional to the intrinsic degrees of freedom in the problem, which suggests the sample complexity of the low-rank matrix logistic regression problem can be lower than that for vectorized logistic regression. The proof techniques utilized in this work also set the stage for development of minimax lower bounds for tensor-variate logistic regression problems.

Index Terms:
logistic regression, low-rank matrix, minimax risk, singular value decomposition.

I Introduction

Logistic Regression (LR) is a statistical model used commonly in machine learning for classification problems [1]. In the simplest terms, LR seeks to model the conditional probability that a categorical response variable yiy_{i} takes on one of two possible values, yi{0,1}y_{i}\in\{0,1\}, which represents one of two possible events taking place (such as success/failure and detection/no detection). In regression analysis, the aim is to accurately estimate the model class parameters through a set of training data points {𝐱i,yi}i=1n\{\mathbf{x}_{i},y_{i}\}_{i=1}^{n} , where 𝐱i\mathbf{x}_{i} is the ithi^{th} covariate sample vector. The conventional LR model is as follows:

y|𝐱(yi=1|𝐱i)=11+exp(𝐛T𝐱i+z),\displaystyle\mathbb{P}_{y|\mathbf{x}}(y_{i}=1|\mathbf{x}_{i})=\frac{1}{1+\exp^{-(\mathbf{b}^{T}\mathbf{x}_{i}+z)}}, (1)

where yi{0,1}y_{i}\in\{0,1\} is the binary response, 𝐱im\mathbf{x}_{i}\in\mathbb{R}^{m} is the known covariate vector, 𝐛m\mathbf{b}\in\mathbb{R}^{m} is the deterministic but unknown coefficient vector, and zz is the intercept where 𝔼[z]=0\mathbb{E}[z]=0.

In many practical applications, covariates naturally take the form of two-dimensional arrays. Common examples include images, biological data such as electroencephalography (EEG), fiber bundle imaging [2] and spatio-temporal data [3]. Classical machine learning techniques vectorize such matrix covariates and estimate a coefficient vector. This leads to computational inefficiency and the destruction of the underlying spatial structure of the coefficient matrix, which often carries valuable information [4, 5]. To address these issues, one can model the matrix LR problem as

y|𝐗(yi=1|𝐗i)=11+exp(𝐁,𝐗i+z),\displaystyle\mathbb{P}_{y|\mathbf{X}}(y_{i}=1|\mathbf{X}_{i})=\frac{1}{1+\exp^{-(\langle\mathbf{B},\mathbf{X}_{i}\rangle+z)}}, (2)

where 𝐗im1×m2\mathbf{X}_{i}\in\mathbb{R}^{m_{1}\times m_{2}} is the known covariate matrix, 𝐁m1×m2\mathbf{B}\in\mathbb{R}^{m_{1}\times m_{2}} is the coefficient matrix, and zz is the intercept. The model in (2) preserves the matrix structure of, and the valuable information in, the covariate data. In matrix-variate logistic regression analysis, the goal is to find an estimate of the coefficient 𝐁\mathbf{B}.

In this paper, we focus on the high-dimensional setting where nm1m2n\ll m_{1}m_{2} and derive a lower bound on the minimax risk of the matrix LR problem under the assumption that 𝐁\mathbf{B} has rank rmin{m1,m2}r\ll\min\{m_{1},m_{2}\}. We note that the low-rank assumption has been used ubiquitously in regression analysis in former works [6, 7]. This low-rank structure may arise from the presence of latent variables, such that the model’s intrinsic degrees of freedom are smaller than its extrinsic dimensionality. This allows us to represent the data in a lower-dimensional space and potentially reduce the sample complexity of estimating the parameters. Minimax lower bounds are useful in understanding the fundamental error thresholds of the problem under study, and in assessing the performance of the existing algorithms. They also provide beneficial insights to the parameters on which an achievable minimax risk might depend.

Whilst a number of works in the literature derive minimax lower bounds for higher-order linear regression problems [8, 7], as well as vector LR problems [6, 9, 10], to the best of our knowledge, little work has been done on this topic in matrix-variate LR problems. Regarding the existing literature, there are several works that study the matrix LR problem. Many works extend the matrix LR problem of (2) by proposing regularized matrix LR models in order to obtain rank-optimized or sparse estimates of the coefficient matrix in the model in (2) [5, 11]. Recently, works have introduced regularized matrix LR models for inference on image data [12]. Much of the literature develops efficient algorithms and provides empirical results on their performance [5, 11, 12] while some provide theoretical guarantees for the proposed algorithms [11]. By contrast, Baldin and Berthet [13] model the problem of graph regression as a matrix LR problem, where 𝐁m×m\mathbf{B}\in\mathbb{R}^{m\times m}. The matrix 𝐁\mathbf{B} is assumed to be block-sparse, low-rank, and square. Under these assumptions (which are restrictive and not the most general), the authors derive minimax lower bounds on the estimation error 𝐁^𝐁F2\left\|\widehat{\mathbf{B}}-\mathbf{B}\right\|_{F}^{2}.

The three prior works [4, 11, 12] implement algorithms for matrix logistic regression but do not prove sample complexity bounds (upper or lower). In this paper, we derive a minimax lower bound on the error of a low-rank LR model which gives a bound on the number of samples necessary for estimating 𝐁\mathbf{B}. Contrary to prior works, we impose minimal assumptions on 𝐁\mathbf{B}, making our results generalizable to a larger class of matrices.

Our Minimax lower bound relies on the distribution of the covariates and the energy of the regression matrix. Following previous works that derive minimax lower bounds in the dictionary learning setting [14, 15], we use the standard strategy of lower bounding the minimax risk in parametric estimation problems by the maximum error probability in a multiple hypothesis test problem and using Fano’s inequality [16]. We derive a lower bound that is proportional to the degrees of freedom in the rank-rr matrix LR problem, and we reduce the sample complexity from 𝒪(m1m2)\mathcal{O}(m_{1}m_{2}) in the vector setting to 𝒪(r(m1+m2))\mathcal{O}(r(m_{1}+m_{2})). This result is intuitively pleasing as it coincides with the number of free parameters in the model. We also show that our methods are easily extendable to the tensor case, i.e., 𝐁¯\underline{\mathbf{B}} is a coefficient tensor with some known properties.

II Model and Problem Formulation

We use the following notation convention throughout the paper: xx, 𝐱\mathbf{x}, 𝐗\mathbf{X} and 𝐗¯\underline{\mathbf{X}} denote scalars, vectors, matrices and tensors, respectively. We denote 𝐈m\mathbf{I}_{m} as the m×mm\times m identity matrix. Also, 𝐱0\left\|\mathbf{x}\right\|_{0}, 𝐱2\left\|\mathbf{x}\right\|_{2} and 𝐗F\left\|\mathbf{X}\right\|_{F} denote the 0\ell_{0}, 2\ell_{2} and Frobenius norms, respectively. Given a matrix 𝐗\mathbf{X}, 𝐱j\mathbf{x}_{j} is the jthj^{th} column of 𝐗\mathbf{X}. If 𝐗¯\underline{\mathbf{X}} is a third-order tensor, then by fixing indices in the first and second modes, the matrix 𝐗¯:,:,j\underline{\mathbf{X}}_{:,:,j} is the jthj^{th} frontal slice of 𝐗¯\underline{\mathbf{X}}. If aa\in\mathbb{R}, then a\lfloor a\rfloor is the greatest integer less than aa. We denote 𝐀𝐂\mathbf{A}\otimes\mathbf{C} as the Kronecker product of matrices 𝐀\mathbf{A} and 𝐂\mathbf{C}. We define the function vec(𝐗)\mathrm{vec}(\mathbf{X}) as the column-wise vectorization of matrix 𝐗\mathbf{X}. Finally, [R][R] is shorthand for {1,,R}\{1,\dots,R\}.

Consider the matrix LR model in (2), in which the Bernoulli response yi{0,1}y_{i}\in\{0,1\} is the ithi^{th} response variable of nn independent and identically distributed (i.i.d) observations. The covariate matrix 𝐗im1×m2\mathbf{X}_{i}\in\mathbb{R}^{m_{1}\times m_{2}} has independent, normally distributed, zero-mean and σ2\sigma^{2} variance entries. The response yiy_{i} is generated according to (2) from 𝐗i\mathbf{X}_{i} and a fixed coefficient matrix 𝐁m1×m2\mathbf{B}\in\mathbb{R}^{m_{1}\times m_{2}} with 𝐁F2\left\|\mathbf{B}\right\|_{F}^{2} upper-bounded by a constant.

We consider the case where 𝐁\mathbf{B} is a rank-rr matrix. Specifically, the rank-rr singular value decomposition of 𝐁\mathbf{B} is

𝐁=𝐁1𝐆𝐁2T,\displaystyle\mathbf{B}=\mathbf{B}_{1}\mathbf{G}\mathbf{B}_{2}^{T}, (3)

where 𝐁1m1×r\mathbf{B}_{1}\in\mathbb{R}^{m_{1}\times r} and 𝐁2m2×r\mathbf{B}_{2}\in\mathbb{R}^{m_{2}\times r} are (tall) singular vector matrices with orthonormal columns, and 𝐆=diag(λ1,,λr)r×r\mathbf{G}=\mathrm{diag}(\lambda_{1},\cdots,\lambda_{r})\in\mathbb{R}^{r\times r} is the matrix of singular values with λi>0,i[r]\lambda_{i}>0,\forall i\in[r]. Under this low-rank structure of 𝐁\mathbf{B}, we have

y|𝐗(yi=1|𝐗i)=11+exp(𝐁1𝐆𝐁2T,𝐗i+z).\displaystyle\mathbb{P}_{y|\mathbf{X}}(y_{i}=1|\mathbf{X}_{i})=\frac{1}{1+\exp^{-(\langle\mathbf{B}_{1}\mathbf{G}\mathbf{B}_{2}^{T},\mathbf{X}_{i}\rangle+z)}}. (4)

We use Kronecker product properties [17] to express 𝐛=vec(𝐁)\mathbf{b}=\mathrm{vec}(\mathbf{B}) as 𝐛=(𝐁2𝐁1)vec(𝐆)\mathbf{b}=\left(\mathbf{B}_{2}\otimes\mathbf{B}_{1}\right)\mathrm{vec}(\mathbf{G}).

The goal in LR is to find an estimate 𝐁^\widehat{\mathbf{B}} of 𝐁\mathbf{B} using the training data {𝐗i,yi}i=1n\bigl{\{}\mathbf{X}_{i},y_{i}\bigr{\}}_{i=1}^{n}. We assume that the true coefficient matrix 𝐁\mathbf{B} belongs to the set

d(𝟎){𝐁𝒫r:ρ(𝐁,𝟎)<d},\displaystyle\mathcal{B}_{d}(\mathbf{0})\triangleq\{\mathbf{B}^{\prime}\in\mathcal{P}_{r}:\rho(\mathbf{B}^{\prime},\mathbf{0})<d\}, (5)

the open ball of radius dd with distance metric ρ=F\rho=\left\|\cdot\right\|_{F}, which resides in a given the parameter space,

𝒫r\displaystyle\mathcal{P}_{r}\triangleq {𝐁m1×m2:rank(𝐁)=r,𝐁=𝐁1𝐆𝐁2,\displaystyle\biggl{\{}\mathbf{B}^{\prime}\in\mathbb{R}^{m_{1}\times m_{2}}:\mathrm{rank}(\mathbf{B}^{\prime})=r,~{}\mathbf{B}^{\prime}=\mathbf{B}^{\prime}_{1}\mathbf{G}^{\prime}\mathbf{B}^{\prime}_{2},
𝐛1,j2=𝐛2,j2=1,𝐛k,j𝐛k,j,j,j[r],\displaystyle\left\|\mathbf{b}^{\prime}_{1,j}\right\|_{2}=\left\|\mathbf{b}^{\prime}_{2,j}\right\|_{2}=1,\mathbf{b}^{\prime}_{k,j}\perp\mathbf{b}^{\prime}_{k,j^{\prime}},\forall j,j^{\prime}\in[r],
jj,k{1,2}}{𝟎}.\displaystyle j\neq j^{\prime},k\in\{1,2\}\biggr{\}}~{}\cup~{}\{\mathbf{0}\}. (6)

Thus d(𝟎)𝒫r\mathcal{B}_{d}(\mathbf{0})\subset\mathcal{P}_{r} and 𝐁\mathbf{B} has energy bounded by 𝐁F2<d2\left\|\mathbf{B}\right\|_{F}^{2}<d^{2}.

Our objective is to find a lower bound on the minimax risk for estimating coefficient matrix 𝐁\mathbf{B}. We define the non-decreasing function ϕ=F2++\phi=\left\|\cdot\right\|_{F}^{2}\triangleq\mathbb{R}_{+}\rightarrow\mathbb{R}_{+} with ϕ(0)=0\phi(0)=0. The minimax risk is thus defined as the worst-case mean squared error (MSE) for the best estimator, i.e.,

ε=inf𝐁^sup𝐁d(𝟎)𝔼𝐲,𝐗¯c{𝐁^𝐁F2},\displaystyle\varepsilon^{*}=\underset{\widehat{\mathbf{B}}}{\inf}\underset{\mathbf{B}\in\mathcal{B}_{d}(\mathbf{0})}{\sup}\mathbb{E}_{\mathbf{y},\underline{\mathbf{X}}^{c}}\biggl{\{}\left\|\widehat{\mathbf{B}}-\mathbf{B}\right\|_{F}^{2}\biggr{\}}, (7)

where 𝐗¯c\underline{\mathbf{X}}^{c} is the covariate tensor of nn samples. We point out that minimax risk in (7) is an inherent property of the LR problem and holds for all possible estimators.

III Minimax Lower Bound For Low-Rank Matrix Logistic Regression

The minimax lower bounds in the low-rank matrix LR setting that are based on Fano’s inequality will depend explicitly on the dimensions of the covariate matrices 𝐗i\mathbf{X}_{i} and their distribution, the rank, the upper bound on the energy of the coefficient matrix 𝐁\mathbf{B}, the number of samples, nn, and the construction of the multiple hypothesis test set.

The novelty of our work is that we explicitly leverage the rank-rr structure of the coefficient matrix 𝐁\mathbf{B}, in (3), which leads to the construction of each hypothesis that is structurally different in our setting compared to vector LR [6, 10, 9]. Furthermore, existing low-rank matrix packings, such as those in Negahban and Wainwright [18] are not useful in the LR setting of this paper, since the logistic function in our model makes part of our analysis non-trivial and fundamentally different to such works. In this work, we create a set L\mathcal{B}_{L} from three constructed sets (for 𝐁1\mathbf{B}_{1}, 𝐁2\mathbf{B}_{2} and 𝐆\mathbf{G}), and derive conditions under which all sets can exist. These conditions ensure the existence of a hypothesis set, L\mathcal{B}_{L}, with elements that have a rank-rr matrix structure and additional essential properties noted below. We also show that our chosen constructed hypothesis set and analysis are more suitable for our matrix-variate model because they are easily generalizable to the tensor setting for LR.

III-A Main Result

We derive lower bounds on ε\varepsilon^{*} using an argument based on Fano’s Inequality [19]. To do so, we first relate the minimax risk in (7) to a multi-way hypothesis testing problem between an exponentially large family, with respect to the dimensions of the matrices, of LL distinct coefficient matrices with energy upper bounded by d2d^{2}, i.e., matrices residing inside the openball d(𝟎)\mathcal{B}_{d}(\mathbf{0}) of fixed radius:

L={𝐁l:l[L]}d(𝟎),\displaystyle\mathcal{B}_{L}=\{\mathbf{B}_{l}\colon l\in[L]\}\subset\mathcal{B}_{d}(\mathbf{0}), (8)

where the correct hypothesis 𝐁l\mathbf{B}_{l} is generated uniformly at random from the set L\mathcal{B}_{L}.

More specifically, suppose there exists an estimator with worst-case MSE that matches ε\varepsilon^{*}. This estimator can be used to solve a multiple hypothesis testing problem using a minimum distance decoder. The minimax risk can then be lower bounded by the probability of error in such a multiple hypothesis test. Our main challenge is to further lower bound this probability of error in order to derive lower bounds on the minimax risk.

We now state our main result on the minimax risk of the low-rank matrix coefficient estimation problem in LR.

Theorem 1.

Consider the rank-rr matrix LR problem in (4) with nn i.i.d observations, {𝐗i,yi}i=1n\bigl{\{}\mathbf{X}_{i},y_{i}\bigr{\}}_{i=1}^{n} where the true coefficient matrix 𝐁F2<d2\left\|\mathbf{B}\right\|_{F}^{2}<d^{2}. Then, for covariate vec(𝐗i)𝒩(0,σ2𝐈m)\mathrm{vec}(\mathbf{X}_{i})\sim\mathcal{N}(0,\sigma^{2}\mathbf{I}_{m}), the minimax risk is lower bounded by

ε([c2(c1r(m1+m22)+c1(r1))c3]1)8nσ2π,\displaystyle\varepsilon^{*}\geq\frac{\biggl{(}\biggl{[}c_{2}\bigl{(}c_{1}r(m_{1}+m_{2}-2)+c_{1}(r-1)\bigr{)}-c_{3}\biggr{]}-1\biggr{)}}{8n\sigma\sqrt{\frac{2}{\pi}}}, (9)

where c1=(1110)2c_{1}=\left(1-\frac{1}{10}\right)^{2}, c2=log2(e)(21)42c_{2}=\frac{\log_{2}(e)(\sqrt{2}-1)}{4\sqrt{2}} and c3=(3(21)8)log2(32)c_{3}=(\frac{3(\sqrt{2}-1)}{\sqrt{8}})\log_{2}\left(\frac{3}{2}\right).

We discuss the implications of (9) of Theorem 1 in Section V

III-B Roadmap for Theorem 1

For notational convenience, we define the response vector of nn samples as 𝐲=[y1,,yn]\mathbf{y}=[y_{1},\cdots,y_{n}], and the covariate tensor of nn samples as 𝐗¯cRm1×m2×n\underline{\mathbf{X}}^{c}\in R^{m_{1}\times m_{2}\times n}, where the ithi^{th} frontal slice, 𝐗¯:,:,ic\underline{\mathbf{X}}_{:,:,i}^{c}, is the covariate matrix 𝐗i\mathbf{X}_{i} of the ithi^{th} sample.

For our analysis, we construct a hypothesis set of LL distinct matrices, as defined in (8). However, each hypothesis 𝐁l\mathbf{B}_{l} is a low-rank matrix following the decomposition in (3) and is composed of three separately constructed sets (for 𝐁1\mathbf{B}_{1}, 𝐁2\mathbf{B}_{2}, and 𝐆\mathbf{G}). Now, consider 𝕀(𝐲;l)\mathbb{I}(\mathbf{y};l), the mutual information between the observations 𝐲\mathbf{y} and random index ll. The construction of the set L\mathcal{B}_{L} is used to provide upper and lower bounds on 𝕀(𝐲;l)\mathbb{I}(\mathbf{y};l), from which we derive lower bounds on the minimax risk ε\varepsilon^{*}.

For finding a lower bound on 𝕀(𝐲;l)\mathbb{I}(\mathbf{y};l), we first consider the exponentially large packing set L\mathcal{B}_{L}, with respect to the dimensions of 𝐁\mathbf{B}, where any two distinct hypotheses 𝐁l\mathbf{B}_{l} and 𝐁l\mathbf{B}_{l^{\prime}} are separated by a minimum distance, i.e.,

𝐁l𝐁lF28δ\displaystyle\left\|\mathbf{B}_{l}-\mathbf{B}_{l^{\prime}}\right\|_{F}^{2}\geq 8\delta (10)

for some positive δ\delta. To achieve our desired bounds on the minimax risk, we require the existence of an estimator producing estimate 𝐁^\widehat{\mathbf{B}} and achieving the minimax bound ε<δ\varepsilon^{*}<\sqrt{\delta}. We consider the minimum distance decoder

l^(𝐲)argmin𝐁ld(𝟎)𝐁^𝐁lF2,\displaystyle\widehat{l}(\mathbf{y})\triangleq\underset{\mathbf{B}_{l^{\prime}}\in\mathcal{B}_{d}(\mathbf{0})}{\operatorname*{arg\,min}}\left\|\widehat{\mathbf{B}}-\mathbf{B}_{l^{\prime}}\right\|_{F}^{2}, (11)

which seeks to detect the correct coefficient matrix 𝐁l\mathbf{B}_{l}. We require the estimate 𝐁^\widehat{\mathbf{B}} to satisfy 𝐁^𝐁lF2<2δ\left\|\widehat{\mathbf{B}}-\mathbf{B}_{l}\right\|_{F}^{2}<\sqrt{2\delta}, for the minimum distance decoder to detect 𝐁l\mathbf{B}_{l} and for the probability of detection error to be (l^(𝐲)l)=0\mathbb{P}(\widehat{l}(\mathbf{y})\neq l)=0. A detection error might occur when 𝐁^𝐁lF22δ\left\|\widehat{\mathbf{B}}-\mathbf{B}_{l}\right\|_{F}^{2}\geq\sqrt{2\delta}. The following bound relates the loss 𝐁^𝐁lF2\left\|\widehat{\mathbf{B}}-\mathbf{B}_{l}\right\|_{F}^{2} and the probability of error (l^(𝐲)l)\mathbb{P}(\widehat{l}(\mathbf{y})\neq l):

(l^(𝐲)l)(𝐁^𝐁lF22δ).\displaystyle\mathbb{P}(\widehat{l}(\mathbf{y})\neq l)\leq\mathbb{P}\Big{(}\left\|\widehat{\mathbf{B}}-\mathbf{B}_{l}\right\|_{F}^{2}\geq\sqrt{2\delta}\Big{)}. (12)

Next, (12) is used to obtain a lower bound on 𝕀(𝐲;l)\mathbb{I}(\mathbf{y};l) using Fano’s inequality stated below [16]:

𝕀(𝐲;l)(1(l^(𝐲)l))log2(L)1u1.\displaystyle\mathbb{I}(\mathbf{y};l)\geq\left(1-\mathbb{P}(\widehat{l}(\mathbf{y})\neq l)\right)\log_{2}(L)-1\triangleq u_{1}. (13)

Secondly, for finding an upper bound on 𝕀(𝐲;l)\mathbb{I}(\mathbf{y};l), we recognize that the LR model produces response variables yiy_{i} that are Bernoulli random variables conditioned on 𝐗i\mathbf{X}_{i}. On the basis thereof, we evaluate the mutual information, conditioned on 𝐗¯c\underline{\mathbf{X}}^{c}, 𝕀(𝐲;l|𝐗¯c)\mathbb{I}(\mathbf{y};l|\underline{\mathbf{X}}^{c}). Next, define fl(𝐲|𝐗¯c)f_{l}(\mathbf{y}|\underline{\mathbf{X}}^{c}) as the conditional probability distribution of 𝐲\mathbf{y} given 𝐗¯c\underline{\mathbf{X}}^{c} with coefficient matrix 𝐁l\mathbf{B}_{l}, and let DKLD_{KL} be the relative entropy, between any two fl(𝐲|𝐗¯c)f_{l}(\mathbf{y}|\underline{\mathbf{X}}^{c}) and fl(𝐲|𝐗¯c)f_{l^{\prime}}(\mathbf{y}|\underline{\mathbf{X}}^{c}), produced from two distinct 𝐁l\mathbf{B}_{l}, 𝐁l\mathbf{B}_{l^{\prime}}. Due to the convexity of log-\log, we upper bound DKLD_{KL} as follows [20, 14]:

𝕀(𝐲;l|𝐗¯c)1L2l,l𝔼𝐗¯cDKL(fl(𝐲|𝐗)||fl(𝐲|𝐗))u2.\displaystyle\mathbb{I}(\mathbf{y};l|\underline{\mathbf{X}}^{c})\leq\frac{1}{L^{2}}\sum_{l,l^{\prime}}\mathbb{E}_{\underline{\mathbf{X}}^{c}}D_{KL}(f_{l}(\mathbf{y}|\mathbf{X})||f_{l^{\prime}}(\mathbf{y}|\mathbf{X}))\triangleq u_{2}. (14)

Lastly, we remark that from [14] we have 𝕀(𝐲;l)𝕀(𝐲;l|𝐗¯c)\mathbb{I}(\mathbf{y};l)\leq\mathbb{I}(\mathbf{y};l|\underline{\mathbf{X}}^{c}), and 𝕀(𝐲;l|𝐗¯c)\mathbb{I}(\mathbf{y};l|\underline{\mathbf{X}}^{c}) is trivially lower bounded by (13). Thus, (13) and (14) give rise to upper and lower bounds on the conditional mutual information:

u1𝕀(𝐲;l|𝐗)u2,\displaystyle u_{1}\leq\mathbb{I}(\mathbf{y};l|\mathbf{X})\leq u_{2}, (15)

where u1u_{1} and u2u_{2} require evaluation, and from which we obtain a lower bound on the minimax risk.

IV Proof of Main Result

The formal proof for Theorem 1 relies on three lemmas. Lemma 1 introduces the exponentially large, with respect to the dimensions, set of vectors from which we will later construct 𝐆\mathbf{G}. The set is constructed as a subset of an (r1)(r-1)-dimensional hypercube with a required minimum distance between any two distinct elements in the set. Lemma 1 bounds the probability that this minimum distance is violated.

Lemma 1.

Let r>0r>0 and F2F\geq 2. Consider the set of FF vectors {𝐬fr1:f[F]}\bigl{\{}\mathbf{s}_{f}\in\mathbb{R}^{r-1}:f\in[F]\bigr{\}}, where each entry in vector 𝐬f\mathbf{s}_{f} is an independent and identically distributed random variable taking values {1r1,+1r1}\biggl{\{}-\frac{1}{\sqrt{r-1}},+\frac{1}{\sqrt{r-1}}\biggr{\}} uniformly. The probability that there exists a distinct pair (f,f)(f,f^{\prime}) such that 𝐬f𝐬f0<r120\left\|\mathbf{s}_{f}-\mathbf{s}_{f^{\prime}}\right\|_{0}<\frac{r-1}{20} is upper bounded as follows:

((f,f)[F]×[F],ff:𝐬f𝐬f0<r120)\displaystyle\mathbb{P}(\exists(f,f^{\prime})\in[F]\times[F],f\neq f^{\prime}:\left\|\mathbf{s}_{f}-\mathbf{s}_{f^{\prime}}\right\|_{0}<\frac{r-1}{20})
exp[2log(F)log(2)12(1110)2(r1)].\displaystyle\leq\exp\left[2\log(F)-\log(2)-\frac{1}{2}\left(1-\frac{1}{10}\right)^{2}(r-1)\right]. (16)

The above is a direct result of a standard application of McDiarmid’s inequality [21]. Notice, however, that our technique differs from [22] because we are imposing minimum distance conditions in the Hamming metric, rather than conditions on the inner product, between vectors.

Similar to the above, the following corollary introduces a set of matrices from which we will construct 𝐁1,𝐁2\mathbf{B}_{1},\mathbf{B}_{2}.

Corollary 1.

Let m,r>0m,r>0 and P2P\geq 2. Consider the set of PP matrices {𝐒p(m1)×r:p[P]}\bigl{\{}\mathbf{S}_{p}\in\mathbb{R}^{(m-1)\times r}:p\in[P]\bigr{\}}, where each entry in matrix 𝐒p\mathbf{S}_{p} is an independent and identically distributed random variable taking values {1(m1)r,+1(m1)r}\biggl{\{}-\frac{1}{\sqrt{(m-1)r}},+\frac{1}{\sqrt{(m-1)r}}\biggr{\}} uniformly. The probability that there exists a distinct pair (p,p)(p,p^{\prime}) such that 𝐒p𝐒p0(m1)r20\left\|\mathbf{S}_{p}-\mathbf{S}_{p^{\prime}}\right\|_{0}\leq\frac{(m-1)r}{20} is upper bounded as follows:

((p,p)[P]×[P],pp:𝐒p𝐒p0<(m1)r20)\displaystyle\mathbb{P}(\exists(p,p^{\prime})\in[P]\times[P],p\neq p^{\prime}:\left\|\mathbf{S}_{p}-\mathbf{S}_{p^{\prime}}\right\|_{0}<\frac{(m-1)r}{20})
exp[2log(P)log(2)12(1110)2(m1)r].\displaystyle\leq\exp\left[2\log(P)-\log(2)-\frac{1}{2}\left(1-\frac{1}{10}\right)^{2}(m-1)r\right]. (17)

From the results in Lemma 1 and Corollary 1, Lemma 2 derives conditions on LL such that L\mathcal{B}_{L} with a certain set of properties exists, and constructs two sets of matrices: 1) m1×rm_{1}\times r and m2×rm_{2}\times r orthonormal matrices from the set of “generating” matrices defined in Corollary 1, and 2) r×rr\times r diagonal matrices from the set of “generating” vectors in Lemma 1. Each element 𝐁lL\mathbf{B}_{l}\in\mathcal{B}_{L} is constructed from the three sets defined above, according to the decomposition in (3). Next, we prove lower and upper bounds on the distance between any two distinct elements in L\mathcal{B}_{L}. The lower bound determines the minimum packing distance between any two 𝐁l,𝐁l\mathbf{B}_{l},\mathbf{B}_{l^{\prime}}, whilst the upper bound is used to derive the results in Lemma 3.

Lemma 2.

There exists a collection of LL matrices BL{𝐁l:l[L]}d(𝟎)B_{L}\triangleq\{\mathbf{B}_{l}:l\in[L]\}\subset\mathcal{B}_{d}(\mathbf{0}) for some d>0d>0 of cardinality

L=2log2(e)4((1110)2(r(m1+m21)+(1110)2(r1))32log2(32)\displaystyle L=2^{\lfloor\frac{\log_{2}(e)}{4}\left(\left(1-\frac{1}{10}\right)^{2}\left(r(m_{1}+m_{2}-1\right)+\left(1-\frac{1}{10}\right)^{2}(r-1)\right)-\frac{3}{2}\log_{2}\left(\frac{3}{2}\right)\rfloor} (18)

such that for any

8(r1)r<εdr1r,\displaystyle\sqrt{\frac{8(r-1)}{r}}<\varepsilon\leq d\sqrt{\frac{r-1}{r}}, (19)

we have

rε2r1<𝐁l𝐁lF24rε2r1.\displaystyle\frac{r\varepsilon^{2}}{r-1}<\left\|\mathbf{B}_{l}-\mathbf{B}_{l^{\prime}}\right\|_{F}^{2}\leq 4\frac{r\varepsilon^{2}}{r-1}. (20)

Lemma 3 derives an upper bound on 𝕀(𝐲;l|𝐗¯c)\mathbb{I}(\mathbf{y};l|\underline{\mathbf{X}}^{c}).

Lemma 3.

Consider the matrix LR problem given by the model in (2) such that 𝐁d(𝟎)\mathbf{B}\in\mathcal{B}_{d}(\mathbf{0}), for some d>0d>0, and consider the set L\mathcal{B}_{L} defined in Lemma 2. Consider nn i.i.d observations yiy_{i} following a Bernoulli distribution when conditioned on 𝐗i𝒩(𝟎,σ2𝐈)\mathbf{X}_{i}\sim\mathcal{N}(\mathbf{0},\sigma^{2}\mathbf{I}), then we have,

𝕀(𝐲;l|𝐗)n2r2πσε.\displaystyle\mathbb{I}(\mathbf{y};l|\mathbf{X})\leq n\frac{2}{r}\sqrt{\frac{2}{\pi}}\sigma\varepsilon. (21)

The final lemma bounds the error probability (l^(𝐲)l)\mathbb{P}(\widehat{l}(\mathbf{y})\neq l) under the conditions, mentioned in Section III-B, on ε\varepsilon^{*} and 𝐁^𝐁lF2\left\|\widehat{\mathbf{B}}-\mathbf{B}_{l}\right\|_{F}^{2} for the recovery of hypothesis 𝐁l\mathbf{B}_{l} by the minimum distance decoder. The proof follows exactly that for Lemma 8 in [14].

Lemma 4.

Consider the minimum distance decoder in (11), for the LL matrices constructed in Lemma 2. Consider the LR regression model in (2) with minimax risk ε\varepsilon^{*}, which is assumed to be upper bounded by εδ\varepsilon^{*}\leq\sqrt{\delta}. The minimum distance decoder recovers the correct hypothesis if 𝐁^𝐁lF2<2δ\left\|\widehat{\mathbf{B}}-\mathbf{B}_{l}\right\|_{F}^{2}<\sqrt{2\delta}. The detection error of the minimum distance decoder is upper bounded by (l^(𝐲)l)12\mathbb{P}(\widehat{l}(\mathbf{y})\neq l)\leq\frac{1}{\sqrt{2}}.

Proof:

Fix rr, let ε>0\varepsilon>0 satisfy (19), and define ε28δ(r1)r\varepsilon^{2}\triangleq\frac{8\delta(r-1)}{r}. Suppose there exists an estimator 𝐁^\widehat{\mathbf{B}} which guarantees a risk εδ=r8(r1)ε\varepsilon^{*}\leq\sqrt{\delta}=\sqrt{\frac{r}{8(r-1)}}\varepsilon. By Lemma 2, there exists a packing set BLB_{L} containing LL distinct rank-rr matrices, where LL satisfies (18) and for any pair 𝐁l,𝐁lBL\mathbf{B}_{l},\mathbf{B}_{l^{\prime}}\in B_{L}, the distance 𝐁l𝐁lF2\left\|\mathbf{B}_{l}-\mathbf{B}_{l^{\prime}}\right\|_{F}^{2} satisfies (10). By Lemma 3, the conditional mutual information 𝕀(𝐲;l|𝐗¯c)\mathbb{I}(\mathbf{y};l|\underline{\mathbf{X}}^{c}) is upper bounded by (21), and (l^(𝐲)𝐁l)\mathbb{P}(\widehat{l}(\mathbf{y})-\mathbf{B}_{l}) is upper bounded by Lemma 4. Replacing (21) and (13) in (15), we have:

212log2L1𝕀(𝐲;l|𝐗)nσ2r2πε.\displaystyle\frac{\sqrt{2}-1}{\sqrt{2}}\log_{2}L-1\leq\mathbb{I}(\mathbf{y};l|\mathbf{X})\leq n\sigma\frac{2}{r}\sqrt{\frac{2}{\pi}}\varepsilon. (22)

Lastly, if we set ε=r8(r1)ε\varepsilon^{*}=\sqrt{\frac{r}{8(r-1)}}\varepsilon, then we achieve the result in (9). ∎

V Discussion and Conclusion

In this paper we provided a minimax lower bound on the low-rank matrix-variate LR problem in the high-dimensional setting. We constructed a packing set of low-rank structured matrices with finite energy. Using the construction, we derived bounds on the conditional mutual information defined in our problem, in order to obtain a lower bound on the minimax risk. Compared to the vector case, such as in [6], the result in Theorem 1 shows a decrease in the lower bound from 𝒪(m1m2)\mathcal{O}(m_{1}m_{2}) to 𝒪(r(m1+m2)\mathcal{O}(r(m_{1}+m_{2})). The result also shows that the lower bound on the minimax risk is proportional to the intrinsic degrees of freedom in the coefficient matrix LR, (i.e., r(m1+m2+1)r(m_{1}+m_{2}+1)), and decreases with increasing sample size nn. This suggests that we can develop algorithms that take advantage of the low-rank structure of the coefficient matrices. Moreover, the result in (9) can be generalized from low-rank matrices to low-rank tensors. Imposing a rank-rr decomposition on a coefficient tensor, 𝐁¯\underline{\mathbf{B}}, holds the same conveniences as those discussed above of low-rank matrices. This low-rank structure on 𝐁¯\underline{\mathbf{B}} is a special case of the well-known Canonical Polyadic Decomposition or Parallel Factors (CANDECOMP/PARAFAC, or CP)[17], formally defined as,

𝐁¯:=h=1rλh𝐛1(h)𝐛K(h),\displaystyle\underline{\mathbf{B}}:=\sum\limits_{h=1}^{r}\lambda_{h}\mathbf{b}_{1}^{(h)}\circ\cdots\circ\mathbf{b}_{K}^{(h)}, (23)

where 𝐛k(h)mk\mathbf{b}_{k}^{(h)}\in\mathbb{R}^{m_{k}} is a column vector along the kthk^{th} mode of 𝐁¯\underline{\mathbf{B}}, \circ is the outer product, and λh𝐛1(h)𝐛K(h)\lambda_{h}\mathbf{b}_{1}^{(h)}\circ\cdots\circ\mathbf{b}_{K}^{(h)}, for any rank hh, is a rank-1 tensor weighted by λh\lambda_{h}. The rank-rr CP-decomposition expresses tensors as a sum of rr rank-1 tensors. Equivalent to (23) is the following expression of a CP-structured tensor:

𝐁¯=𝐆¯×1𝐁1×2×K𝐁K,\displaystyle\underline{\mathbf{B}}=\underline{\mathbf{G}}\times_{1}\mathbf{B}_{1}\times_{2}\cdots\times_{K}\mathbf{B}_{K}, (24)

where the tensor 𝐆¯r××rK-times\underline{\mathbf{G}}\in\mathbb{R}^{\overbrace{r\times\cdots\times r}^{K\text{-times}}} is simply a higher-order analogue of the diagonal matrix 𝐆\mathbf{G} in (3) (where only the elements along the super-diagonal of 𝐆¯\underline{\mathbf{G}} are non-zero), and the mode-kk factor matrices 𝐁kmk×r,k[K]\mathbf{B}_{k}\in\mathbb{R}^{m_{k}\times r},~{}\forall k\in[K] are rank-rr matrices with orthonormal columns. Thus, the setup and construction proposed in this paper can be extended to the tensor case, where 𝐁\mathbf{B} is simply a special case of 𝐁¯\underline{\mathbf{B}}, with K=2K=2 factor matrices.

VI Appendix

Proof:

Consider the set of FF vectors {𝐬fr1:f[F]}\{\mathbf{s}_{f}\in\mathbb{R}^{r-1}:f\in[F]\}, in (1). For indices f,f[F]f,f^{\prime}\in[F], fff\neq f^{\prime}, define the vector 𝐬~𝐬f𝐬f\widetilde{\mathbf{s}}\triangleq\mathbf{s}_{f}\odot\mathbf{s}_{f^{\prime}} as the point-wise product of 𝐬l\mathbf{s}_{l} and 𝐬l\mathbf{s}_{l^{\prime}}, and s~i\widetilde{s}_{i} as the ithi^{th} entry of 𝐬~\widetilde{\mathbf{s}}, for i[r1]i\in[r-1]. Define also the function

h(𝐬~)(r1)i=1r1𝐬~i.\displaystyle h(\widetilde{\mathbf{s}})\triangleq(r-1)\sum\limits_{i=1}^{r-1}\widetilde{\mathbf{s}}_{i}. (25)

We use 𝐬~𝐬~\widetilde{\mathbf{s}}\approx\widetilde{\mathbf{s}}^{\prime} to say that s~i=s~i\widetilde{s}_{i}=\widetilde{s}^{\prime}_{i} for all entries i[r1]i\in[r-1], except one.

We require a minimum packing distance

𝐬f𝐬f0=12[(r1)(r1)i=1r1s~i]>r120.\displaystyle\left\|\mathbf{s}_{f}-\mathbf{s}_{f^{\prime}}\right\|_{0}=\frac{1}{2}\left[(r-1)-(r-1)\sum\limits_{i=1}^{r-1}\widetilde{s}_{i}\right]>\frac{r-1}{20}. (26)

The probability that the requirement in (26) is not satisfied is defined as

((r1)i=1r𝐬~i(r1)(1110)).\displaystyle\mathbb{P}\left((r-1)\sum_{i=1}^{r}\widetilde{\mathbf{s}}_{i}\geq(r-1)(1-\frac{1}{10})\right). (27)

The function h()h(\cdot) satisfies

sup𝐬~𝐬~|h(𝐬~)h(𝐬~)|=|(r1)1r1+(r1)1r1|=2.\displaystyle\underset{\widetilde{\mathbf{s}}\approx\widetilde{\mathbf{s}}^{\prime}}{\sup}\lvert h(\widetilde{\mathbf{s}})-h(\widetilde{\mathbf{s}}^{\prime})\rvert=\lvert(r-1)\frac{1}{r-1}+(r-1)\frac{1}{r-1}\rvert=2. (28)

According to McDiarmid’s inequality in [21], for (r1)(9)10>0\frac{(r-1)(9)}{10}>0, and since 𝔼𝐬~[h(𝐬~)]=0\mathbb{E}_{\widetilde{\mathbf{s}}}[h(\widetilde{\mathbf{s}})]=0, we have,

((r1)i=1r1𝐬~i\displaystyle\mathbb{P}\biggl{(}(r-1)\sum_{i=1}^{r-1}\tilde{\mathbf{s}}_{i} (r1)910)\displaystyle\geq(r-1)\frac{9}{10}\biggr{)}
exp[12(1110)2(r1)].\displaystyle\leq\exp\left[-\frac{1}{2}\left(1-\frac{1}{10}\right)^{2}(r-1)\right]. (29)

The probability in (29) is for any two distinct pairs (f,f)[F]×[F](f,f^{\prime})\in[F]\times[F]. We take a union bound over all (L2)\binom{L}{2} distinct pairs and upper bound the probability as:

((f,f)[F]×[F],ff:(r1)i=1r1s~i(r1)910)\displaystyle\mathbb{P}(\exists(f,f^{\prime})\in[F]\times[F],f\neq f^{\prime}:(r-1)\sum_{i=1}^{r-1}\widetilde{s}_{i}\geq\frac{(r-1)9}{10})
F22exp[12(1110)2(r1)]\displaystyle\leq\frac{F^{2}}{2}\exp\left[-\frac{1}{2}\left(1-\frac{1}{10}\right)^{2}(r-1)\right]
=exp[2log(F)log(2)12(1110)2(r1)].\displaystyle=\exp\left[2\log(F)-\log(2)-\frac{1}{2}\left(1-\frac{1}{10}\right)^{2}(r-1)\right]. (30)

Proof:

Fix the following arbitrary real orthogonal bases: 𝐐 of r\mathbf{Q}\text{ of }\mathbb{R}^{r}, the set of distinct rr bases, {𝐔1,j}j=1r of m1\bigl{\{}\mathbf{U}_{1,j}\bigr{\}}_{j=1}^{r}\text{ of }\mathbb{R}^{m_{1}}, and the set of distinct rr bases, {𝐔2,j}j=1r of m2\bigl{\{}\mathbf{U}_{2,j}\bigr{\}}_{j=1}^{r}\text{ of }\mathbb{R}^{m_{2}}.

Next, consider the following hypercubes or subsets thereof: 1) The set of FF vectors {𝐬f}\{\mathbf{s}_{f}\} from Lemma 1:

𝐬f{1r1,+1r1}r1,\displaystyle\mathbf{s}_{f}\in\biggl{\{}\frac{-1}{\sqrt{r-1}},\frac{+1}{\sqrt{r-1}}\biggr{\}}\subset\mathbb{R}^{r-1}, (31)

2) Two sets of P1P_{1} and P2P_{2} matrices, from Corollary 1:

𝐒p1{1(m11)r,+1(m11)r}(m11)×r,\displaystyle\mathbf{S}_{p_{1}}\in\biggl{\{}\frac{-1}{\sqrt{(m_{1}-1)r}},\frac{+1}{\sqrt{(m_{1}-1)r}}\biggr{\}}\subset\mathbb{R}^{(m_{1}-1)\times r}, (32)

and

𝐒p2{1(m21)r,+1(m21)r}(m21)×r,\displaystyle\mathbf{S}_{p_{2}}\in\biggl{\{}\frac{-1}{\sqrt{(m_{2}-1)r}},\frac{+1}{\sqrt{(m_{2}-1)r}}\biggr{\}}\subset\mathbb{R}^{(m_{2}-1)\times r}, (33)

respectively.

From Lemma 1 we have the following bounds on the probability that the minimum distance condition is violated for the set (31):

((f,f)[F]×[F],ff:𝐬f𝐬f0<r120)\displaystyle\mathbb{P}\biggl{(}\exists(f,f^{\prime})\in[F]\times[F],f\neq f^{\prime}:\left\|\mathbf{s}_{f}-\mathbf{s}_{f^{\prime}}\right\|_{0}<\frac{r-1}{20}\biggr{)}
exp[log(F22)(r1)2(1220)2].\displaystyle\leq\exp\left[\log(\frac{{F}^{2}}{2})-\frac{(r-1)}{2}\left(1-\frac{2}{20}\right)^{2}\right]. (34)

Likewise, from Corollary 1, we have the following bounds on the probability that the minimum distance conditions are violated for the sets (32) and (33), respectively:

((p1,p1)[P1]×[P1],p1p1:𝐒p1𝐒p10\displaystyle\mathbb{P}\biggl{(}\exists(p_{1},p_{1}^{\prime})\in[P_{1}]\times[P_{1}],p_{1}\neq p_{1}^{\prime}:\left\|\mathbf{S}_{p_{1}}-\mathbf{S}_{p_{1}^{\prime}}\right\|_{0}
<(m11)r20)\displaystyle\qquad<\frac{(m_{1}-1)r}{20}\biggr{)}
exp[log(P122)(m11)r2(1220)2],\displaystyle\qquad\leq\exp\left[\log\bigl{(}\frac{{P_{1}}^{2}}{2}\bigr{)}-\frac{(m_{1}-1)r}{2}\left(1-\frac{2}{20}\right)^{2}\right], (35)

and

((p2,p2)[P2]×[P2],p2p2:𝐒p2𝐒p20\displaystyle\mathbb{P}\biggl{(}\exists(p_{2},p_{2}^{\prime})\in[P_{2}]\times[P_{2}],p_{2}\neq p_{2}^{\prime}:\left\|\mathbf{S}_{p_{2}}-\mathbf{S}_{p_{2}^{\prime}}\right\|_{0}
<(m21)r20)\displaystyle\qquad<\frac{(m_{2}-1)r}{20}\biggr{)}
exp[log(P222)(m21)r2(1220)2].\displaystyle\qquad\leq\exp\left[\log\bigl{(}\frac{{P_{2}}^{2}}{2}\bigr{)}-\frac{(m_{2}-1)r}{2}\left(1-\frac{2}{20}\right)^{2}\right]. (36)

We require the set of coefficient matrices L\mathcal{B}_{L} from the sets in (31), (32) and (33) to exist simultaneously. Hence, using a union bound on (34), (35) and (35), we can choose parameters to guarantee the existence of a construction. This is satisfied if the following conditions on the cardinalities FF, P1P_{1} and P2P_{2} hold:

0<F<log2(e)4(1110)2(r1)12log2(32),\displaystyle 0<F<{\frac{\log_{2}(e)}{4}\left(1-\frac{1}{10}\right)^{2}(r-1)-\frac{1}{2}\log_{2}(\frac{3}{2})}, (37)
0<P1<2log2(e)4(1110)2(m11)r12log2(32),\displaystyle 0<P_{1}<2^{\frac{\log_{2}(e)}{4}\left(1-\frac{1}{10}\right)^{2}(m_{1}-1)r-\frac{1}{2}\log_{2}(\frac{3}{2})}, (38)

and

0<P2<2log2(e)4(1110)2(m21)r12log2(32).\displaystyle 0<P_{2}<2^{\frac{\log_{2}(e)}{4}\left(1-\frac{1}{10}\right)^{2}(m_{2}-1)r-\frac{1}{2}\log_{2}(\frac{3}{2})}. (39)

Note that (37), (38) and (39) are sufficient conditions for the simultaneous existence of sets in (31), (32) and (33), such that the minimum distance condition between any two elements in each set is satisfied.

We proceed with the following steps in order to construct the final set L\mathcal{B}_{L} of coefficient matrices. Without loss of generality, we assume that the energy of any 𝐁l\mathbf{B}_{l} is upper bounded by d2d^{2}. We will construct diagonal matrices 𝐆f\mathbf{G}_{f}, and matrices with orthonormal columns, namely 𝐁1,p1\mathbf{B}_{1,p_{1}} and 𝐁1,p2\mathbf{B}_{1,p_{2}}, all of which will be used to construct each 𝐁lL\mathbf{B}_{l}\in\mathcal{B}_{L}. In other words, due to our LR model, any matrix 𝐁l\mathbf{B}_{l} will have a rank-rr singular value decomposition. Thus, a bound on the matrix norm of 𝐁l\mathbf{B}_{l} gives a bound on the norm of the matrix of singular values.

Firstly, we construct vectors 𝐠f(1)r\mathbf{g}_{f}^{(1)}\in\mathbb{R}^{r} for f[F]f\in[F], using 𝐐\mathbf{Q} and 𝐬f\mathbf{s}_{f}, as follows:

𝐠f(1)=𝐐[1r1𝐬f],f[F].\displaystyle\mathbf{g}_{f}^{(1)}=\mathbf{Q}\begin{bmatrix}\sqrt{\frac{1}{r-1}}\\ \mathbf{s}_{f}\end{bmatrix},\forall f\in[F]. (40)

From (40), since 𝐬f2=1\left\|\mathbf{s}_{f}\right\|^{2}=1 we have:

𝐠f(1)22\displaystyle\left\|\mathbf{g}_{f}^{(1)}\right\|_{2}^{2} =𝐐[1r1𝐬f]22=rr1.\displaystyle=\left\|\mathbf{Q}\begin{bmatrix}\sqrt{\frac{1}{r-1}}\\ \mathbf{s}_{f}\end{bmatrix}\right\|_{2}^{2}=\frac{r}{r-1}.

Similarly, we construct matrices 𝐁1,p1(1)m1×r\mathbf{B}_{1,p_{1}}^{(1)}\in\mathbb{R}^{m_{1}\times r}, for p1[P1]p_{1}\in[P_{1}] and 𝐁2,p2(1)m2×r\mathbf{B}_{2,p_{2}}^{(1)}\in\mathbb{R}^{m_{2}\times r}, for p2[P2]p_{2}\in[P_{2}], respectively. Define 𝐛1,p1,j(1)\mathbf{b}_{1,p_{1},j}^{(1)} as the jthj^{th} column of 𝐁1,p1(1)\mathbf{B}_{1,p_{1}}^{(1)}, and 𝐛1,p1,j(2)\mathbf{b}_{1,p_{1},j}^{(2)} as the jthj^{th} column of 𝐁2,p2(1)\mathbf{B}_{2,p_{2}}^{(1)}. Let the columns be constructed as follows:

𝐛1,p1,j(1)=𝐔1,j[1𝐬p1,j],p1[P1],\displaystyle\mathbf{b}_{1,p_{1},j}^{(1)}=\mathbf{U}_{1,j}\begin{bmatrix}1\\ \mathbf{s}_{p_{1},j}\end{bmatrix},\forall p_{1}\in[P_{1}], (41)

and

𝐛2,p2,j(1)=𝐔2,j[1𝐬p2,j],p2[P2].\displaystyle\mathbf{b}_{2,p_{2},j}^{(1)}=\mathbf{U}_{2,j}\begin{bmatrix}1\\ \mathbf{s}_{p_{2},j}\end{bmatrix},\forall p_{2}\in[P_{2}]. (42)

From (41) and (42), we have:

𝐛1,p1,j(1)22=𝐛2,p2,j(1)22\displaystyle\left\|\mathbf{b}_{1,p_{1},j}^{(1)}\right\|_{2}^{2}=\left\|\mathbf{b}_{2,p_{2},j}^{(1)}\right\|_{2}^{2} =[1𝐬p1,j]22=r+1r.\displaystyle=\left\|\begin{bmatrix}1\\ \mathbf{s}_{p_{1},j}\end{bmatrix}\right\|_{2}^{2}=\frac{r+1}{r}.

Secondly, we construct an rr-sparse vector 𝐠f(2)r2\mathbf{g}_{f}^{(2)}\in\mathbb{R}^{r^{2}}, element-wise, from 𝐠f(1)\mathbf{g}_{f}^{(1)}. Define gf,i(2)g_{f,i}^{(2)} as the iith element of 𝐠f(2)\mathbf{g}_{f}^{(2)}, where i[r2]i\in[r^{2}] and use the following construction:

gf,i(2)={|gf,i(1)|i=i+r(i1),i={1,,r}0otherwise,\displaystyle g_{f,i}^{(2)}=\begin{cases}|g_{f,i}^{(1)}|&i=i^{\prime}+r(i^{\prime}-1),i^{\prime}=\{1,\dots,r\}\\ 0&\textrm{otherwise}\end{cases}, (43)

and we note that

𝐠f(2)22=𝐠f(1)22=rr1.\displaystyle\left\|\mathbf{g}_{f}^{(2)}\right\|_{2}^{2}=\left\|\mathbf{g}_{f}^{(1)}\right\|_{2}^{2}=\frac{r}{r-1}. (44)

We also construct matrices 𝐁1,p1(2)m1×r\mathbf{B}_{1,p_{1}}^{(2)}\in\mathbb{R}^{m_{1}\times r}, for p1[P1]p_{1}\in[P_{1}] and 𝐁2,p2(2)m2×r\mathbf{B}_{2,p_{2}}^{(2)}\in\mathbb{R}^{m_{2}\times r}, for p2[P2]p_{2}\in[P_{2}]. We show the construction of 𝐁1,p1(2)\mathbf{B}_{1,p_{1}}^{(2)} only: the construction of 𝐁2,p2(2)\mathbf{B}_{2,p_{2}}^{(2)} follows the same procedure. Define 𝐛1,p1,j(2)m1\mathbf{b}_{1,p_{1},j}^{(2)}\in\mathbb{R}^{m_{1}} as the jthj^{th} column of 𝐁1,p1(2)\mathbf{B}_{1,p_{1}}^{(2)}, for j[r]j\in[r]. We set

𝐛1,p1,1(2)=𝐛1,p1,1(1)𝐛1,p1,1(1)2,\displaystyle\mathbf{b}_{1,p_{1},1}^{(2)}=\frac{\mathbf{b}_{1,p_{1},1}^{(1)}}{\left\|\mathbf{b}_{1,p_{1},1}^{(1)}\right\|_{2}}, (45)

and define

𝐚j+1\displaystyle\mathbf{a}_{j+1}\triangleq 𝐛1,p1,j+1(1)j=1j𝐛1,p1,j+1(1),𝐛1,p1,j(2)𝐛1,p1,j(2),\displaystyle\mathbf{b}_{1,p_{1},j+1}^{(1)}-\sum\limits_{j^{\prime}=1}^{j}\langle\mathbf{b}_{1,p_{1},j+1}^{(1)},\mathbf{b}_{1,p_{1},j^{\prime}}^{(2)}\rangle\mathbf{b}_{1,p_{1},j^{\prime}}^{(2)}, (46)

and

𝐛1,p1,j+1(2)𝐚j+1𝐚j+12,\displaystyle\mathbf{b}_{1,p_{1},j+1}^{(2)}\triangleq\frac{\mathbf{a}_{j+1}}{\left\|\mathbf{a}_{j+1}\right\|_{2}}, (47)

for j[r1]j\in[r-1].

The steps in (45), (46) and (47) constitute the well-known Gram-Schmidt Process. Thus, set of vectors 𝐛1,p1,j(2)\mathbf{b}_{1,p_{1},j}^{(2)}, for j[r1]j\in[r-1], p1[P1]p_{1}\in[P_{1}] are orthonormal, i.e, 𝐛1,p1,j(2)22=1\left\|\mathbf{b}_{1,p_{1},j}^{(2)}\right\|_{2}^{2}=1 and 𝐛1,p1,j(2)𝐛1,p1,j(2)\mathbf{b}_{1,p_{1},j}^{(2)}\perp\mathbf{b}_{1,p_{1},j^{\prime}}^{(2)}, for any two distinct j,j[r]j,j^{\prime}\in[r]. Consequently, (𝐁2,p2(2))T(𝐁2,p2(2))=𝐈r\left(\mathbf{B}_{2,p_{2}}^{(2)}\right)^{T}\left(\mathbf{B}_{2,p_{2}}^{(2)}\right)=\mathbf{I}_{r}.

Finally, we define the vector 𝐠f\mathbf{g}_{f} and matrices 𝐁1,p1\mathbf{B}_{1,p_{1}} and 𝐁1,p2\mathbf{B}_{1,p_{2}} as:

𝐠f=εr𝐠f(2),f[F],\displaystyle\mathbf{g}_{f}=\frac{\varepsilon}{r}\mathbf{g}_{f}^{(2)},\forall f\in[F], (48)

and

𝐁1,p1=𝐁1,p1(2),𝐁1,p2=𝐁2,p2(2),\displaystyle\mathbf{B}_{1,p_{1}}=\mathbf{B}_{1,p_{1}}^{(2)},~{}\mathbf{B}_{1,p_{2}}=\mathbf{B}_{2,p_{2}}^{(2)}, (49)

respectively, for some positive number ε\varepsilon.

We also define the diagonal matrix 𝐆fr×r\mathbf{G}_{f}\in\mathbb{R}^{r\times r}, where vec(𝐆f)=𝐠f\mathrm{vec}(\mathbf{G}_{f})=\mathbf{g}_{f}.

Now, by designating

{(f,p1,p2):f[F],p1[P1],p2[P2]},\displaystyle\mathcal{L}\triangleq\left\{(f,p_{1},p_{2}):f\in[F],p_{1}\in[P_{1}],p_{2}\in[P_{2}]\right\}, (50)

as the set of possible tuples (f,p1,p2)(f,p_{1},p_{2}), we have

L=||\displaystyle L=|\mathcal{L}| <(a)2[log2(e)4(c1(r(m1+m22)+(r1)))(32)log2(32)],\displaystyle\overset{(a)}{<}2^{\left[\frac{\log_{2}(e)}{4}\left(c_{1}\left(r(m_{1}+m_{2}-2)+(r-1)\right)\right)-(\frac{3}{2})\log_{2}\left(\frac{3}{2}\right)\right]}, (51)

where (a)(a) follows from (37), (38) and (39). We define the set of coefficient matrices, L\mathcal{B}_{L} as,

L{𝐁l=𝐁1,p1T𝐆f𝐁1,p2,l[L],f[F],\displaystyle\mathcal{B}_{L}\triangleq\biggl{\{}\mathbf{B}_{l}=\mathbf{B}_{1,p_{1}}^{T}\mathbf{G}_{f}\mathbf{B}_{1,p_{2}},l\in[L],f\in[F], (52)
p1[P1],p2[P2]},\displaystyle p_{1}\in[P_{1}],p_{2}\in[P_{2}]\biggr{\}}, (53)

and we restrict ε\varepsilon such that

8(r1)r<ε<dr1r,\displaystyle\sqrt{\frac{8(r-1)}{r}}<\varepsilon<d\sqrt{\frac{r-1}{r}}, (54)

in order to guarantee 𝐆22<d2r2\left\|\mathbf{G}\right\|_{2}^{2}<\frac{d^{2}}{r^{2}}. We make the final note that, due to the Kronecker product, we can express vec(𝐁l)\mathrm{vec}(\mathbf{B}_{l}) as:

vec(𝐁l)=(𝐁1,p1𝐁1,p2)𝐠f.\displaystyle\mathrm{vec}(\mathbf{B}_{l})=(\mathbf{B}_{1,p_{1}}\otimes\mathbf{B}_{1,p_{2}})\mathbf{g}_{f}. (55)

We have the following remaining tasks at hand: 1) We must show that they energy of any 𝐁l\mathbf{B}_{l} is less than d2d^{2}. 2) We must derive upper and lower bounds on the distance between any two distinct 𝐁l,𝐁lBL\mathbf{B}_{l},\mathbf{B}_{l^{\prime}}\in B_{L} (𝐁l𝐁lF2)\left(\left\|\mathbf{B}_{l}-\mathbf{B}_{l^{\prime}}\right\|_{F}^{2}\right).

We begin by showing 𝐁lF2<d2\left\|\mathbf{B}_{l}\right\|_{F}^{2}<d^{2}:

𝐁lF2\displaystyle\left\|\mathbf{B}_{l}\right\|_{F}^{2} =(𝐁1,p2𝐁1,p1)𝐠f22\displaystyle=\left\|\left(\mathbf{B}_{1,p_{2}}\otimes\mathbf{B}_{1,p_{1}}\right)\mathbf{g}_{f}\right\|_{2}^{2}
𝐁1,p2𝐁1,p1F2𝐠f22\displaystyle\leq\left\|\mathbf{B}_{1,p_{2}}\otimes\mathbf{B}_{1,p_{1}}\right\|_{F}^{2}\left\|\mathbf{g}_{f}\right\|_{2}^{2}
=(b)𝐁1,p1F2𝐁1,p2F2𝐠f22=rε2r1<(c)d2,\displaystyle\overset{(b)}{=}\left\|\mathbf{B}_{1,p_{1}}\right\|_{F}^{2}\left\|\mathbf{B}_{1,p_{2}}\right\|_{F}^{2}\left\|\mathbf{g}_{f}\right\|_{2}^{2}=\frac{r\varepsilon^{2}}{r-1}\overset{(c)}{<}d^{2},

where (b)(b) follows from the fact that the matrix norm of the Kronecker product is the product of the matrix norms, and (c)(c) holds due to (54).

We proceed with deriving lower and upper bounds on 𝐁l𝐁lF2\left\|\mathbf{B}_{l}-\mathbf{B}_{l^{\prime}}\right\|_{F}^{2}, for any two distinct 𝐁l,𝐁lBL\mathbf{B}_{l},\mathbf{B}_{l^{\prime}}\in B_{L}. For finding lower bounds on 𝐁l𝐁lF2\left\|\mathbf{B}_{l}-\mathbf{B}_{l^{\prime}}\right\|_{F}^{2}, it can be shown that the closest pair 𝐁l,𝐁l\mathbf{B}_{l},\mathbf{B}_{l^{\prime}} occurs for p1=p1p_{1}=p_{1}^{\prime}, p2=p2p_{2}=p_{2}^{\prime} and fff\neq f^{\prime}. Thus we have the following:

𝐁l𝐁lF2\displaystyle\left\|\mathbf{B}_{l}-\mathbf{B}_{l^{\prime}}\right\|_{F}^{2} =(𝐁1,p2𝐁1,p1)𝐠f(𝐁1,p2𝐁1,p1)𝐠f22\displaystyle=\left\|\left(\mathbf{B}_{1,p_{2}}\otimes\mathbf{B}_{1,p_{1}}\right)\mathbf{g}_{f}-\left(\mathbf{B}_{1,p^{\prime}_{2}}\otimes\mathbf{B}_{1,p^{\prime}_{1}}\right)\mathbf{g}_{f^{\prime}}\right\|_{2}^{2}
(𝐁1,p2𝐁1,p1)(𝐠f𝐠f)22\displaystyle\geq\left\|\left(\mathbf{B}_{1,p_{2}}\otimes\mathbf{B}_{1,p_{1}}\right)\left(\mathbf{g}_{f}-\mathbf{g}_{f^{\prime}}\right)\right\|_{2}^{2}
=(d)ε2r2𝐠f(2)𝐠f(2)22\displaystyle\overset{(d)}{=}\frac{\varepsilon^{2}}{r^{2}}\left\|\mathbf{g}_{f}^{(2)}-\mathbf{g}_{f^{\prime}}^{(2)}\right\|_{2}^{2}
>rε2r1420\displaystyle>\frac{r\varepsilon^{2}}{r-1}\frac{4}{20}

where (d)(d) follows from the fact that the Kronecker product of orthogonal bases is an orthogonal basis.

Finally, for finding upper bounds on 𝐁l𝐁lF2\left\|\mathbf{B}_{l}-\mathbf{B}_{l^{\prime}}\right\|_{F}^{2}, we have:

𝐁l𝐁lF2\displaystyle\left\|\mathbf{B}_{l}-\mathbf{B}_{l^{\prime}}\right\|_{F}^{2}
(e)[(𝐁1,p2𝐁1,p1)𝐠fF+(𝐁1,p2𝐁1,p1)𝐠fF]2\displaystyle\overset{(e)}{\leq}\left[\left\|\left(\mathbf{B}_{1,p_{2}}\otimes\mathbf{B}_{1,p_{1}}\right)\mathbf{g}_{f}\right\|_{F}+\left\|\left(\mathbf{B}_{1,p^{\prime}_{2}}\otimes\mathbf{B}_{1,p^{\prime}_{1}}\right)\mathbf{g}_{f^{\prime}}\right\|_{F}\right]^{2}
[𝐁1,p2𝐁1,p1F𝐠f2+𝐁1,p2𝐁1,p1F𝐠f2]2\displaystyle\leq\left[\left\|\mathbf{B}_{1,p_{2}}\otimes\mathbf{B}_{1,p_{1}}\right\|_{F}\left\|\mathbf{g}_{f}\right\|_{2}+\left\|\mathbf{B}_{1,p^{\prime}_{2}}\otimes\mathbf{B}_{1,p^{\prime}_{1}}\right\|_{F}\left\|\mathbf{g}_{f^{\prime}}\right\|_{2}\right]^{2}
=[2𝐁1,p1F𝐁1,p2F𝐠f2]2\displaystyle=\left[2\left\|\mathbf{B}_{1,p_{1}}\right\|_{F}\left\|\mathbf{B}_{1,p_{2}}\right\|_{F}\left\|\mathbf{g}_{f}\right\|_{2}\right]^{2} (56)
4rε2r1\displaystyle\leq 4\frac{r\varepsilon^{2}}{r-1} (57)

where (e)(e) follows from the triangle inequality. ∎

Proof:

Consider the set L\mathcal{B}_{L} defined in Lemma 2, where the bounds in (VI) and (57) hold. Consider the matrix LR model in (2). For nn i.i.d samples, consider covariate matrices 𝐗im1×m2,i[n]\mathbf{X}_{i}\in\mathbb{R}^{m_{1}\times m_{2}},\forall i\in[n], where vec(𝐗i)𝒩(𝟎,σ2𝐈m1m2)\mathrm{vec}(\mathbf{X}_{i})\sim\mathcal{N}(\mathbf{0},\sigma^{2}\mathbf{I}_{m_{1}m_{2}}). According to (2), observations yiy_{i} follow a Bernoulli distribution when conditioned on 𝐗i\mathbf{X}_{i}, i[n]\forall i\in[n]. Consider 𝐲\mathbf{y} and 𝐗¯c\underline{\mathbf{X}}^{c} defined in Section II, and define 𝕀(𝐲;l|𝐗¯c)\mathbb{I}(\mathbf{y};l|\underline{\mathbf{X}}^{c}) as the mutual information between observations 𝐲\mathbf{y} and index ll conditioned on side-information 𝐗¯c\underline{\mathbf{X}}^{c}. From [20, 23], we have

𝕀(𝐲;l|𝐗¯c)1L2l,l𝔼𝐗¯cDKL(fl(𝐲|𝐗¯c)||fl(𝐲|𝐗¯c),\displaystyle\mathbb{I}(\mathbf{y};l|\underline{\mathbf{X}}^{c})\leq\frac{1}{L^{2}}\sum_{l,l^{\prime}}\mathbb{E}_{\underline{\mathbf{X}}^{c}}D_{KL}(f_{l}(\mathbf{y}|\underline{\mathbf{X}}^{c})||f_{l^{\prime}}(\mathbf{y}|\underline{\mathbf{X}}^{c}), (58)

where DKL(fl(𝐲|𝐗¯c)||fl(𝐲|𝐗¯c)D_{KL}(f_{l}(\mathbf{y}|\underline{\mathbf{X}}^{c})||f_{l^{\prime}}(\mathbf{y}|\underline{\mathbf{X}}^{c}) is the Kullback-Leibler (KL) divergence of probability distribution fl(𝐲|𝐗¯c)f_{l}(\mathbf{y}|\underline{\mathbf{X}}^{c}) of 𝐲\mathbf{y} given 𝐗¯c\underline{\mathbf{X}}^{c} for some 𝐁lL\mathbf{B}_{l}\in\mathcal{B}_{L}. Denote σl11+exp𝐁l,𝐗i\sigma_{l}\triangleq\frac{1}{1+\exp{-\langle\mathbf{B}_{l},\mathbf{X}_{i}\rangle}}, and σl11+exp𝐁l,𝐗i\sigma_{l^{\prime}}\triangleq\frac{1}{1+\exp^{-\langle\mathbf{B}_{l^{\prime}},\mathbf{X}_{i}\rangle}}, We evaluate the KL divergence as follows:

D\displaystyle D (fl(𝐲|𝐗¯c)||fl(𝐲|𝐗¯c))KL{}_{KL}(f_{l}(\mathbf{y}|\underline{\mathbf{X}}^{c})||f_{l^{\prime}}(\mathbf{y}|\underline{\mathbf{X}}^{c}))
=i[n]σllog(σlσl)+(1σl)log(1σl1σl).\displaystyle=\sum\limits_{i\in[n]}\sigma_{l}\log\left(\frac{\sigma_{l}}{\sigma_{l^{\prime}}}\right)+(1-\sigma_{l})\log\left(\frac{1-\sigma_{l}}{1-\sigma_{l^{\prime}}}\right).
=i[n]σl𝐁l,𝐗iσl𝐁l,𝐗i𝐁l,𝐗i+𝐁l,𝐗i\displaystyle=\sum\limits_{i\in[n]}\sigma_{l}\langle\mathbf{B}_{l},\mathbf{X}_{i}\rangle-\sigma_{l}\langle\mathbf{B}_{l^{\prime}},\mathbf{X}_{i}\rangle-\langle\mathbf{B}_{l},\mathbf{X}_{i}\rangle+\langle\mathbf{B}_{l^{\prime}},\mathbf{X}_{i}\rangle
+log(1+exp𝐁l,𝐗i)log(1+exp𝐁l,𝐗i).\displaystyle+\log(1+\exp^{-\langle\mathbf{B}_{l},\mathbf{X}_{i}\rangle})-\log(1+\exp^{-\langle\mathbf{B}_{l^{\prime}},\mathbf{X}_{i}\rangle}). (59)

Now, considering the distribution on covariates 𝐗i,i[n]\mathbf{X}_{i},\forall i\in[n], we take the expectation of (59) with respect to the side-information 𝐗¯c\underline{\mathbf{X}}^{c}. We have 𝔼𝐗¯c(log(1+exp𝐁l,𝐗i))=𝔼𝐗¯c(log(1+exp𝐁l,𝐗i))\mathbb{E}_{\underline{\mathbf{X}}^{c}}(\log(1+\exp^{-\langle\mathbf{B}_{l},\mathbf{X}_{i}\rangle}))=\mathbb{E}_{\underline{\mathbf{X}}^{c}}(\log(1+\exp^{-\langle\mathbf{B}_{l^{\prime}},\mathbf{X}_{i}\rangle})), and 𝔼𝐗¯c(𝐁l,𝐗i)=𝔼𝐗¯c(𝐁l,𝐗i)=0\mathbb{E}_{\underline{\mathbf{X}}^{c}}(\langle\mathbf{B}_{l},\mathbf{X}_{i}\rangle)=\mathbb{E}_{\underline{\mathbf{X}}^{c}}(\langle\mathbf{B}_{l^{\prime}},\mathbf{X}_{i}\rangle)=0. We are left with:

𝔼𝐗¯cDKL\displaystyle\mathbb{E}_{\underline{\mathbf{X}}^{c}}D_{KL} (fl(𝐲|𝐗¯c)||fl(𝐲|𝐗¯c)\displaystyle(f_{l}(\mathbf{y}|\underline{\mathbf{X}}^{c})||f_{l^{\prime}}(\mathbf{y}|\underline{\mathbf{X}}^{c})
=i[n]𝔼𝐗[σl𝐗i,𝐁lσl𝐗i,𝐁l]\displaystyle=\sum\limits_{i\in[n]}\mathbb{E}_{\mathbf{X}}\bigl{[}\sigma_{l}\cdot\langle\mathbf{X}_{i},\mathbf{B}_{l}\rangle-\sigma_{l}\langle\mathbf{X}_{i},\mathbf{B}_{l^{\prime}}\rangle\bigr{]}
=i[n]𝔼𝐗[σl𝐗i,𝐁l𝐁l]\displaystyle=\sum\limits_{i\in[n]}\mathbb{E}_{\mathbf{X}}\bigl{[}\sigma_{l}\cdot\langle\mathbf{X}_{i},\mathbf{B}_{l}-\mathbf{B}_{l^{\prime}}\rangle\bigr{]}
=i[n]𝔼𝐗[𝐗i,𝐁l𝐁l1+exp𝐗i,𝐁l]\displaystyle=\sum\limits_{i\in[n]}\mathbb{E}_{\mathbf{X}}\biggl{[}\frac{\langle\mathbf{X}_{i},\mathbf{B}_{l}-\mathbf{B}_{l^{\prime}}\rangle}{1+\exp^{-\langle\mathbf{X}_{i},\mathbf{B}_{l}\rangle}}\biggr{]}
n𝔼𝐗[|vec(𝐗i),vec(𝐁l𝐁l)|].\displaystyle\leq n\mathbb{E}_{\mathbf{X}}\biggl{[}\left|\langle\mathrm{vec}(\mathbf{X}_{i}),\mathrm{vec}(\mathbf{B}_{l}-\mathbf{B}_{l^{\prime}})\rangle\right|\biggr{]}. (60)

Define the random variable X~𝐗i,𝐁l𝐁l\widetilde{X}\triangleq\langle\mathbf{X}_{i},\mathbf{B}_{l}-\mathbf{B}_{l^{\prime}}\rangle. X~\widetilde{X} is a normally distributed random variable with mean μ{X~}=0\mu_{\{\widetilde{X}\}}=0 and variance σX~2=σ2vec(𝐁l𝐁l)T𝐈m1m2vec(𝐁l𝐁l)\sigma_{\widetilde{X}}^{2}=\sigma^{2}\mathrm{vec}(\mathbf{B}_{l}-\mathbf{B}_{l^{\prime}})^{T}\mathbf{I}_{m_{1}m_{2}}\mathrm{vec}(\mathbf{B}_{l}-\mathbf{B}_{l^{\prime}}). Now, define the half-Normal random variable X¯|X~|\bar{X}\triangleq|\widetilde{X}|. X¯\bar{X} is a half-Normal distribution with mean:

𝔼X¯[X¯]\displaystyle\mathbb{E}_{\bar{X}}[\bar{X}] =σvec(𝐁l𝐁l)Tvec(𝐁l𝐁l)2π\displaystyle=\sigma\sqrt{\mathrm{vec}(\mathbf{B}_{l}-\mathbf{B}_{l^{\prime}})^{T}\mathrm{vec}(\mathbf{B}_{l}-\mathbf{B}_{l^{\prime}})}\sqrt{\frac{2}{\pi}}
=σ𝐁l𝐁lF2π.\displaystyle=\sigma\left\|\mathbf{B}_{l}-\mathbf{B}_{l^{\prime}}\right\|_{F}\sqrt{\frac{2}{\pi}}. (61)

Plugging in (61) and (60) into (58) gives us,

𝕀(𝐲;l|𝐗¯c)\displaystyle\mathbb{I}(\mathbf{y};l|\underline{\mathbf{X}}^{c}) nσ𝐁l𝐁lF2π(f)nσ4rr12πε,\displaystyle\leq n\sigma\left\|\mathbf{B}_{l}-\mathbf{B}_{l^{\prime}}\right\|_{F}\sqrt{\frac{2}{\pi}}\overset{(f)}{\leq}n\sigma\sqrt{\frac{4r}{r-1}}\sqrt{\frac{2}{\pi}}\varepsilon,

where (f)(f) follows from (57). ∎

References

  • [1] R. E. Wright, “Logistic regression.” in Reading and Understanding Multivariate Statistics.   American Psychological Association, 1995, pp. 217–244.
  • [2] J. P. Dumas, M. A. Lodhi, B. A. Taki, W. U. Bajwa, and M. C. Pierce, “Computational endoscopy—a framework for improving spatial resolution in fiber bundle imaging,” Optics Letters, vol. 44, no. 16, pp. 3968–3971, 2019.
  • [3] M. A. Lodhi and W. U. Bajwa, “Learning product graphs underlying smooth graph signals,” arXiv preprint arXiv:2002.11277, 2020.
  • [4] H. Hung and C.-C. Wang, “Matrix variate logistic regression model with application to EEG data,” Biostatistics, vol. 14, no. 1, pp. 189–202, 2013.
  • [5] J. Zhang and J. Jiang, “Rank-optimized logistic matrix regression toward improved matrix data classification,” Neural Computation, vol. 30, no. 2, pp. 505–525, 2018.
  • [6] L. P. Barnes and A. Ozgur, “Minimax bounds for distributed logistic regression,” arXiv preprint arXiv:1910.01625, 2019.
  • [7] A. R. Zhang, Y. Luo, G. Raskutti, and M. Yuan, “ISLET: Fast and optimal low-rank tensor regression via importance sketching,” SIAM J. on Mathematics of Data Science, vol. 2, no. 2, pp. 444–479, 2020.
  • [8] G. Raskutti, M. Yuan, and H. Chen, “Convex regularization for high-dimensional multiresponse tensor regression,” The Annals of Statstics, vol. 47, no. 3, pp. 1554–1584, 2019.
  • [9] D. J. Foster, S. Kale, H. Luo, M. Mohri, and K. Sridharan, “Logistic regression: The importance of being improper,” in Proc. Conf. On Learning Theory (PMLR), 2018, pp. 167–208.
  • [10] F. Abramovich and V. Grinshtein, “High-dimensional classification by sparse logistic regression,” IEEE Trans. on Information Theory, vol. 65, no. 5, pp. 3068–3079, 2018.
  • [11] J. V. Shi, Y. Xu, and R. G. Baraniuk, “Sparse bilinear logistic regression,” arXiv preprint arXiv:1404.4104, 2014.
  • [12] B. An and B. Zhang, “Logistic regression with image covariates via the combination of 1\ell_{1} and Sobolev regularizations,” PLOS One, vol. 15, no. 6, p. e0234975, 2020.
  • [13] Q. Berthet and N. Baldin, “Statistical and computational rates in graph logistic regression,” in Proc. Int. Conf. on Artificial Intelligence and Statistics (PMLR), 2020, pp. 2719–2730.
  • [14] A. Jung, Y. C. Eldar, and N. Görtz, “On the minimax risk of dictionary learning,” IEEE Trans. on Inf. Theory, vol. 62, no. 3, pp. 1501–1515, 2016.
  • [15] Z. Shakeri, W. U. Bajwa, and A. D. Sarwate, “Minimax lower bounds for kronecker-structured dictionary learning,” in Proc. 2016 IEEE Int. Symp. on Inf. Theory (ISIT).   IEEE, 2016, pp. 1148–1152.
  • [16] B. Yu, “Assouad, Fano, and Le Cam,” in Festschrift for Lucien Le Cam.   Springer, 1997, pp. 423–435.
  • [17] T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” SIAM Review, vol. 51, no. 3, pp. 455–500, 2009.
  • [18] S. Negahban and M. J. Wainwright, “Restricted strong convexity and weighted matrix completion: Optimal bounds with noise,” The J. of Machine Learning Res. (JMLR), vol. 13, pp. 1665–1697, 2012.
  • [19] R. Z. Khas’minskii, “A lower bound on the risks of non-parametric estimates of densities in the uniform metric,” Theory of Probability & Its Applications, vol. 23, no. 4, pp. 794–798, 1979.
  • [20] T. M. Cover and J. A. Thomas, Elements of Information Theory.   John Wiley & Sons, 2012.
  • [21] D. P. Dubhashi and A. Panconesi, Concentration of Measure for the Analysis of Randomized Algorithms.   Cambridge University Press, 2009.
  • [22] Z. Shakeri, W. U. Bajwa, and A. D. Sarwate, “Minimax lower bounds on dictionary learning for tensor data,” IEEE Trans. on Inf. Theory, vol. 64, no. 4, pp. 2706–2726, 2018.
  • [23] M. J. Wainwright, “Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting,” IEEE Trans. on Inf. Theory, vol. 55, no. 12, pp. 5728–5741, 2009.