Sparse Hanson-Wright Inequality for a Bilinear Form of Sub-Gaussian Variables
Abstract
In this paper, we derive a new version of Hanson-Wright inequality for a sparse bilinear form of sub-Gaussian variables. Our results are generalization of previous deviation inequalities that consider either sparse quadratic forms or dense bilinear forms. We apply the new concentration inequality to testing the cross-covariance matrix when data are subject to missing. Using our results, we can find a threshold value of correlations that controls the family-wise error rate. Furthermore, we discuss the multiplicative measurement error case for the bilinear form with a boundedness condition.
Keywords: Concentration inequality; covariance matrix; missing data; measurement error; general missing dependency.
1 Introduction
Let , , be pairs of (possibly dependent) sub-Gaussian random variables with mean zero. Suppose is a random variable that corresponds to and is independent with . We allow (general) dependency between and . We write the random variables by vectors , , . For a given non-random matrix , Rudelson and Vershynin, (2013) dealt with a concentration inequality of a quadratic form , which is known as Hanson-Wright inequality. Recently, this result has been extended in two directions. On the one hand, Zhou, (2019) included Bernoulli random variables in the quadratic form, i.e., where is the Hadamard (element-wise) product of two vectors (or matrices). On the other hand, Park et al., (2021) proved that the sub-Gaussian vectors could be distinct and derived the deviation inequality of a bilinear form .
In this paper, considering the two extensions all together, we aim to analyze the concentration of the bilinear form where are vectors of Bernoulli random variables that may be dependent on each other. Furthermore, we generalize the Bernoulli variables to bounded random variables, which is the first attempt in the literature. Table 1 compares different types of bilinear forms. One may consider converting the bilinear form into a quadratic form by concatenating , i.e.
where . However, we cannot directly apply the previous results (e.g. Theorem 1 in Zhou, (2019)) because entries in are not independent.
We apply the new version of Hanson-Wright inequality to estimating the high-dimensional cross-covariance matrix when data are subject to either missingness or measurement errors. We treat the problem as multiple testing of individual elements of the matrix and propose an estimator thresholding their sample estimates. To determine the cutoff value, we make use of the extended Hanson-Wright inequality and thus can control the family-wise error rate at a desirable level.
Identical () | Distinct () | |
---|---|---|
Dense |
,
(Rudelson and Vershynin, 2013) |
,
(Park et al., 2021) |
Sparse |
,
(Zhou, 2019) |
The paper is organized as follows. In Section 2, we present the main result of , Theorem 1, and apply it to the problem of testing the cross-covariance matrix in Section 3. In Section 4, we conclude this paper with a brief discussion.
2 Main result
We first define the sub-Gaussian (or -) norm of a random variable in by
and is called sub-Gaussian if its -norm is bounded. For a matrix , its operator norm is defined by the square-root of the largest eigenvalue of . . For a vector , is a 2-norm and, is a diagonal matrix having its diagonal entries by .
We now describe the main theorem of this paper.
Theorem 1.
Let , , be independent pairs of (possibly correlated) random variables satisfying , and , and . Also, suppose , , be independent pairs of (possibly correlated) Bernoulli random variables such that and for , . Assume and are independent for distinct pairs . Then, we have that for every
for some numerical constant . For a matrix , we define .
Note that Theorem 1 is a combination of the results for diagonal and off-diagonal cases given below.
Lemma 1.
Assume are defined as in Theorem 1. Let be a diagonal matrix. We denote . Then, for , we have
for some numerical constant .
Lemma 2.
Assume are defined as in Theorem 1. Let be a matrix with its diagonals zero. We denote for and . Then, for , we have
for some numerical constant .
A complete proof of the two lemmas, in spirit of Zhou, (2019), is provided in Appendix. It is noted that our theorem above can yield Theorem 1.1 in Zhou, (2019) and Lemma 5 in Park et al., (2021) under the same setting of each.
To handle more general case where we do not have information about mean, we provide the concentration inequality for non-centered sub-Gaussian variables.
Theorem 2.
Let , , be independent pairs of (possibly correlated) random variables with , , and , and . Also, suppose , , be independent pairs of (possibly correlated) Bernoulli random variables such that and for , . Assume and are independent for distinct pairs . Then, we have that for every
for some numerical constants . and are defined by
For a matrix , we define . Also, we define and similarly.
3 Application to testing the cross-covariance matrix
A cross-covariance matrix between two random vectors and , with its -th entry being , refers to the off-diagonal block matrix of the covariance matrix of . It is often considered a less important part in the covariance matrix of , , and tends to be overpenalized by shrinkage estimators favoring an invertible estimate. However, it is a crucial statistical summary in some applications. For example, the study of multi-omics data, which aims to explain molecular variations at different molecular levels, receives much attention due to the public availability of biological big data and the covariation between two different data sources is just as important as that within each data source. The main question here is to find pairs (or positions in the matrix) of ’s and ’s that present a large degree of association, which can be treated by hypothesis testing:
(1) |
for .
Testing the cross-covariance matrix has not been much explored in literature. Cai et al., (2019) directly address the problem of estimating the cross-covariance matrix. However, they vaguely assume and consider simultaneous testing of hypotheses
for . They build Hotelling’s statistics for individual hypotheses and decide which rows of are not 0. Hence, the sparsity pattern in Cai et al., (2019) is not the same as considered in this paper. Moreover, their method cannot address missing data.
Considering (1) is equivalent to where , one can also analyze the sample correlation coefficient . Bailey et al., (2019) analyzed a universal thresholding estimator of based on its asymptotic normality. Though they are interested in estimation of a large correlation matrix, not a cross-correlation matrix directly, their method can be applied to estimation of the cross-covariance matrix. For example, the proposed estimator would be a matrix with its component , and they aim to find a cutoff value to control the family-wise error rate (FWER). However, again, if data are subject to missing, their method is no longer valid.
Here, we address the multiple testing problem for the cross-covariance matrix when some of the data are missing or measured with errors. We apply the modified Hanson-Wright inequalities (Theorem 1, 2, 3, 4) in order to choose a threshold value that controls FWER. More specifically, we derive concentration results for an appropriate cross-covariance estimator in the following form:
under some regularity conditions. We begin with the simplest case where data are completely observed and walk through more complicated cases later.
3.1 Complete data case
We begin with the complete data case.
Assumption 1.
Assume each components of random vectors and are sub-Gaussian, i.e., it holds for some
(2) |
Let us denote the mean vector and cross-covariance matrix of and as follows:
(3) |
Suppose we observe independent samples of in Assumption 1. Then, the cross-covariance can be estimated by
where , are the sample means corresponding to , . We can analyze the concentration of each component of using Theorem 1 as its special case where all ’s are 1. We first notice that
where and . Hence, we can rewrite the sample cross-covariance estimator in a matrix-form by
where , and . Note that and . Then, by Theorem 1, the element-wise deviation inequality for the sample cross-covariance is
for some numerical constant . Putting and using the union bounds, we can get
if for some numerical constants .
3.2 Missing data case
For the case where data are subject to missing, we introduce assumptions for missing indicators.
Assumption 2.
Each component of the indicator vector corresponding to is 1 if is observed and 0 otherwise. is similarly defined. Their moments are given by
Note that the above assumption does not mention independence between components of the indicator vector, which means it allows and , , to be dependent with each other. This implies multiple components in different positions can be missing together under some dependency structure. Assumption 2 of Park et al., (2021) is equivalent to this, and they called it the general missing dependency assumption. For the missing mechanism, missing completely at random (MCAR) is assumed, that is, is independent of . More generally, MCAR can be stated as follows.
Assumption 3 (Assumption 3 of Park et al., (2021)).
An event that an observation is missing is independent of both observed and unobserved random variables.
Suppose independent samples are generated from the population model under Assumption 1, 2, and 3. Each sample consists of the observational data and their missing indicators . However, due to missingness, we can only observe , , for , , and . We can easily check that
From the above observation, we define an estimator of the cross-covariance as follows, which is unbiased for .
(4) |
The last representation is a bilinear form of , and defined as below.
Thus, we apply Theorem 2 to and get, for
where are some numerical constants and are defined below.
Putting and using the union bound argument to the indices , we can get for some numerical constants
if . A simple calculation leads to
where , , , , , and . The superscript “J” and “M” stand for joint and marginal, respectively. Then, we conclude that for some numerical constants
if holds.
3.3 Measurement error case
The missing data problem is a special case of the multiplicative measurement error case if the multiplicative factor only takes either 1 or 0. Under some boundedness condition on the factors, we can extend the current framework to the multiplicative measurement error case, which is given below.
Theorem 3.
Let , , be independent pairs of (possibly correlated) random variables satisfying , and , and . Also, suppose , , be independent pairs of (possibly correlated) non-negative random variables such that almost surely for some , and . Assume and are independent for distinct pairs . Then, we have that for every
for some numerical constant . is a diagonal matrix with diagonal elements being from . is similarly defined.
The proof is straightforward, and thus we outline the idea behind it and omit the detail. In the diagonal part, we need to modify (5) as
For the off-diagonal case, a careful investigation into its proof shows that the missing indicators are conditioned in the analysis and the fact they are Bernoulli random variables is not used until Step 4 in Appendix A.3. The result of Lemma 5 (see Step 4 in Appendix A.3) can be extended to the bounded random errors as we can derive
Furthermore, we state the result for the case with non-zero means.
Theorem 4.
Let , , be independent pairs of (possibly correlated) random variables with , , and , and . Also, suppose be pairs of (possibly correlated) non-negative random variables such that almost surely for some , and . Assume and are independent for distinct pairs . Then, we have that for every
for some numerical constants . and are defined by
The rest of arguments are similar to Section 3.2. Assume we observe and , . While is an independent copy of in Assumption 1, is no longer a vector of binary variables but an independent copy of in Assumption 4.
Assumption 4.
Each component of is a measurement error corresponding to of , which is a non-negative random variable satisfying almost surely for each . is similarly defined. Their moments are given by
Accordingly, the unbiased estimator for is
In this case, we can derive
where are some numerical constants and are defined below.
Moreover, we can observe
where , , , , , , , and
The superscript “J” and “M” stand for joint and marginal, respectively.
Repeating the calculation as in the previous section, we conclude that for some numerical constants
if holds.
4 Discussion
We discuss the generalized Hanson-Wright inequality where the sparse structure and bilinear form are considered for the first time. This extension facilitates an analysis of concentration of the sample (cross-)covariance estimator even when mean parameters are unknown and some of the data are missing. We apply this result to multiple testing of the cross-covariance matrix.
We further consider a measurement error case extended from the missing data case, which is limited to a bounded random variable. It would be interesting to consider more general multiplicative errors as future work; for example, the truncated normal distribution defined on can be a good example.
References
- Bailey et al., (2019) Bailey, N., Pesaran, M., and Smith, L. V. (2019). A multiple testing approach to the regularisation of large sample correlation matrices. Journal of Econometrics, 208(2):507–534.
- Cai et al., (2019) Cai, T., Cai, T., Liao, K., and Liu, W. (2019). Large-scale simultaneous testing of cross-covariance matrices with applications to phewas. Statistica Sinica, 29:983–1005.
- Park et al., (2021) Park, S., Wang, X., and Lim, J. (2021). Estimating high-dimensional covariance and precision matrices under general missing dependence. Electronic Journal of Statistics, 15(2):4868 – 4915.
- Rudelson and Vershynin, (2013) Rudelson, M. and Vershynin, R. (2013). Hanson-wright inequality and sub-gaussian concentration. Electronic Communications in Probability, 18:9 pp.
- Vershynin, (2018) Vershynin, R. (2018). High-dimensional probability : an introduction with applications in data science. Cambridge University Press, Cambridge, United Kingdom New York, NY.
- Zhou, (2019) Zhou, S. (2019). Sparse hanson–wright inequalities for subgaussian quadratic forms. Bernoulli, 25(3):1603–1639.
Appendix
Appendix A Proof of Theorem 1
A.1 Proof of Lemma 1 (diagonal part)
Lemma 3.
Assume that for some and for given . Then, for any , we have
Proof.
First, we define the sub-exponential (or -) norm of a random variable by
Since the product of sub-Gaussian random variables is sub-exponential, we define . Setting , we get
In the last two inequalities, we use the following observations. First, note that for the subexponential variable satisfying , it holds (see Prop 2.7.1, Vershynin, (2018))
Second, we use Stirling’s formula for that
If , then and thus we get
Using the above, we have
∎
Then, for and , we have
For the optimal choice of , that is,
we can obtain the concentration bound
Considering instead of in the theorem, we have the same probability bound for , , which concludes the proof.
A.2 Proof of Lemma 2 (off-diagonal part)
Define by
whose expectation is zero due to independence across distinct and . Then, we claim the lemma below:
Lemma 4.
Assume and let be given. For for some numeric constant , we have
The proof is pending until Appendix A.3. Without loss of generality, we can assume ; otherwise set . Using Lemma 4, we get for and
For the optimal choice of , that is,
we can obtain the concentration bound
Considering instead of in the theorem, we have the same probability bound for , , which concludes the proof.
A.3 Proof of Lemma 4
This proof follows the logic of the proof of (9) in Zhou, (2019). We first decouple the off-diagonal sum .
Step 1. Decoupling
We introduce Bernoulli variables with for any . They are independent with each other and also independent of and . Given , we define by a subvector of at which and by a subvector of at which , respectively. Let be the expectation with respect to a random variable where can be any of in our context, or to denote and together. Define a random variable
Using and Jensen’s inequality with for any , we get
We condition all the variables except on and consider its moment generating function denoted by
Note that can be seen a linear combination of sub-Gaussian variables , for such that , that is,
conditional on all other variables. Thus, the conditional distribution of is sub-Gaussian satisfying
where . Therefore, we have that for some
(6) |
Taking the expectations on both sides, we get
Step 2. Reduction to normal random variables
Assume that , , and are fixed. Let be given where is i.i.d. from . Replacing by in , we define a random variable by
Due to the property of Gaussian variables, the conditional distribution of follows . Hence, its conditional moment generating function is given by
(7) |
Now, consider as a linear combination of by rewriting it by
where ’s are regarded random variables and others fixed. Then, we can get for some
(8) |
We now get an upper bound of from step 1 by using (7) and (8). First, choosing at (7) gives the same term in (6)
Then, we apply (8) to the right-hand side of the above:
(9) |
where has its -th element if ; , otherwise. Note that is symmetric.
Step 3. Integrating out the normal random variables
For fixed , the quadratic form is identical in distribution with due to the rotational invariance of where is the eigenvalue of . Note that the eigenvalues satisfy, for any realization of ,
(10) |
Now, using and for , we get from (9) and (10)
for . It is worth noting that the Bernoulli variables are decoupled by .
Step 4. Integrating out the Bernoulli random variables
We now integrate out from by using the following lemma. For , we have
Lemma 5.
For , the conditional expectation satisfies
Proof.
Due to the independence of , we get
Then, we use the local approximation of the exponential function:
To use it, should satisfies, for given and ,
(11) |
which leads to
where we use in the last inequality. A sufficient condition for for any realization of and in (11) is since
Hence, for ,
We apply the similar argument used for to in the last inequality. Note that is also sufficient since
Finally, we observe that
which concludes the proof. ∎
Step 5. Combining things together
Appendix B Proof of Theorem 2
Let us define the centered sub-Gaussian variable . Then, it is easy to check that the bilinear form is decomposed into 8 terms.
We will use Theorem 1 for the bilinear forms , and Hoeffding’s inequality, stated below, for the linear combinations . For , we treat as the centered sub-Gaussian variables with -norm at most in which the sub-Gaussian are completely observed. Applying the union bound and combining the results of , we can derive the bound for .
Theorem 5 (Hoeffding’s inequality).
Let , , be mean-zero independent sub-Gaussian variables satisfying , and , , be independent Bernoulli variables. Also, suppose and are independent for all . Let be a given coefficient vector. Then, we have
for some numerical constant . Moreover, we have
for some numerical constant .