This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Sparse Hanson-Wright Inequality for a Bilinear Form of Sub-Gaussian Variables

Seongoh Park School of Mathematics, Statistics and Data Science, Sungshin Women’s University, Seoul, Korea Xinlei Wang Department of Statistical Science, Southern Methodist University, Dallas, TX, USA Johan Lim111To whom all correspondence should be addressed. Email: johanlim@snu.ac.kr Department of Statistics, Seoul National University, Seoul, Korea
Abstract

In this paper, we derive a new version of Hanson-Wright inequality for a sparse bilinear form of sub-Gaussian variables. Our results are generalization of previous deviation inequalities that consider either sparse quadratic forms or dense bilinear forms. We apply the new concentration inequality to testing the cross-covariance matrix when data are subject to missing. Using our results, we can find a threshold value of correlations that controls the family-wise error rate. Furthermore, we discuss the multiplicative measurement error case for the bilinear form with a boundedness condition.


Keywords: Concentration inequality; covariance matrix; missing data; measurement error; general missing dependency.

1 Introduction

Let (Z1j,Z2j)(Z_{1j},Z_{2j}), j=1,,nj=1,\ldots,n, be pairs of (possibly dependent) sub-Gaussian random variables with mean zero. Suppose γij\gamma_{ij} is a random variable that corresponds to ZijZ_{ij} and is independent with ZijZ_{ij}. We allow (general) dependency between γ1j\gamma_{1j} and γ2j\gamma_{2j}. We write the random variables by vectors 𝒁i=(Zi1,,Zin)\boldsymbol{Z}_{i}=(Z_{i1},\ldots,Z_{in})^{\top}, 𝜸i=(γi1,,γin)\boldsymbol{\gamma}_{i}=(\gamma_{i1},\ldots,\gamma_{in})^{\top}, i=1,2i=1,2. For a given n×nn\times n non-random matrix 𝑨\boldsymbol{A}, Rudelson and Vershynin, (2013) dealt with a concentration inequality of a quadratic form 𝒁1𝑨𝒁1\boldsymbol{Z}_{1}^{\top}\boldsymbol{A}\boldsymbol{Z}_{1}, which is known as Hanson-Wright inequality. Recently, this result has been extended in two directions. On the one hand, Zhou, (2019) included Bernoulli random variables 𝜸1\boldsymbol{\gamma}_{1} in the quadratic form, i.e., (𝒁1𝜸1)𝑨(𝒁1𝜸1)(\boldsymbol{Z}_{1}*\boldsymbol{\gamma}_{1})^{\top}\boldsymbol{A}(\boldsymbol{Z}_{1}*\boldsymbol{\gamma}_{1}) where * is the Hadamard (element-wise) product of two vectors (or matrices). On the other hand, Park et al., (2021) proved that the sub-Gaussian vectors could be distinct 𝒁1𝒁2\boldsymbol{Z}_{1}\neq\boldsymbol{Z}_{2} and derived the deviation inequality of a bilinear form 𝒁1𝑨𝒁2\boldsymbol{Z}_{1}^{\top}\boldsymbol{A}\boldsymbol{Z}_{2}.

In this paper, considering the two extensions all together, we aim to analyze the concentration of the bilinear form (𝒁1𝜸1)𝑨(𝒁2𝜸2)(\boldsymbol{Z}_{1}*\boldsymbol{\gamma}_{1})^{\top}\boldsymbol{A}(\boldsymbol{Z}_{2}*\boldsymbol{\gamma}_{2}) where 𝜸1,𝜸2\boldsymbol{\gamma}_{1},\boldsymbol{\gamma}_{2} are vectors of Bernoulli random variables that may be dependent on each other. Furthermore, we generalize the Bernoulli variables to bounded random variables, which is the first attempt in the literature. Table 1 compares different types of bilinear forms. One may consider converting the bilinear form into a quadratic form by concatenating 𝒁1,𝒁2\boldsymbol{Z}_{1},\boldsymbol{Z}_{2}, i.e.

(𝒁1𝜸1)𝑨(𝒁2𝜸2)=(𝒁¯𝜸¯)[𝟎𝑨/2𝑨/2𝟎](𝒁¯𝜸¯)(\boldsymbol{Z}_{1}*\boldsymbol{\gamma}_{1})^{\top}\boldsymbol{A}(\boldsymbol{Z}_{2}*\boldsymbol{\gamma}_{2})=(\bar{\boldsymbol{Z}}*\bar{\boldsymbol{\gamma}})^{\top}\begin{bmatrix}\boldsymbol{0}&\boldsymbol{A}/2\\ \boldsymbol{A}/2&\boldsymbol{0}\end{bmatrix}(\bar{\boldsymbol{Z}}*\bar{\boldsymbol{\gamma}})

where 𝒁¯=(𝒁1,𝒁2),𝜸¯=(𝜸1,𝜸2)\bar{\boldsymbol{Z}}=(\boldsymbol{Z}_{1}^{\top},\boldsymbol{Z}_{2}^{\top})^{\top},\bar{\boldsymbol{\gamma}}=(\boldsymbol{\gamma}_{1}^{\top},\boldsymbol{\gamma}_{2}^{\top})^{\top}. However, we cannot directly apply the previous results (e.g. Theorem 1 in Zhou, (2019)) because entries in 𝒁¯\bar{\boldsymbol{Z}} are not independent.

We apply the new version of Hanson-Wright inequality to estimating the high-dimensional cross-covariance matrix when data are subject to either missingness or measurement errors. We treat the problem as multiple testing of individual elements of the matrix and propose an estimator thresholding their sample estimates. To determine the cutoff value, we make use of the extended Hanson-Wright inequality and thus can control the family-wise error rate at a desirable level.

Identical (𝒁1=𝒁2\boldsymbol{Z}_{1}=\boldsymbol{Z}_{2}) Distinct (𝒁1𝒁2\boldsymbol{Z}_{1}\neq\boldsymbol{Z}_{2})
Dense 𝒁1𝑨𝒁1\boldsymbol{Z}_{1}^{\top}\boldsymbol{A}\boldsymbol{Z}_{1},
(Rudelson and Vershynin, 2013)
𝒁1𝑨𝒁2\boldsymbol{Z}_{1}^{\top}\boldsymbol{A}\boldsymbol{Z}_{2},
(Park et al., 2021)
Sparse (𝒁1𝜸1)𝑨(𝒁1𝜸1)(\boldsymbol{Z}_{1}*\boldsymbol{\gamma}_{1})^{\top}\boldsymbol{A}(\boldsymbol{Z}_{1}*\boldsymbol{\gamma}_{1}),
(Zhou, 2019)
(𝒁1𝜸1)𝑨(𝒁2𝜸2)(\boldsymbol{Z}_{1}*\boldsymbol{\gamma}_{1})^{\top}\boldsymbol{A}(\boldsymbol{Z}_{2}*\boldsymbol{\gamma}_{2})
Table 1: Comparison of bilinear forms in the literature and their references.

The paper is organized as follows. In Section 2, we present the main result of (𝒁1𝜸1)𝑨(𝒁2𝜸2)(\boldsymbol{Z}_{1}*\boldsymbol{\gamma}_{1})^{\top}\boldsymbol{A}(\boldsymbol{Z}_{2}*\boldsymbol{\gamma}_{2}), Theorem 1, and apply it to the problem of testing the cross-covariance matrix in Section 3. In Section 4, we conclude this paper with a brief discussion.

2 Main result

We first define the sub-Gaussian (or ψ2\psi_{2}-) norm of a random variable XX in \mathbb{R} by

Xψ2=supp1(𝔼|X|p)1/pp,||X||_{\psi_{2}}=\sup_{p\geq 1}\dfrac{(\mathbb{E}|X|^{p})^{1/p}}{\sqrt{p}},

and XX is called sub-Gaussian if its ψ2\psi_{2}-norm is bounded. For a matrix 𝑨\boldsymbol{A}, its operator norm 𝑨2||\boldsymbol{A}||_{2} is defined by the square-root of the largest eigenvalue of 𝑨𝑨\boldsymbol{A}^{\top}\boldsymbol{A}. 𝑨F=i,jaij2||\boldsymbol{A}||_{F}=\sqrt{\sum_{i,j}a_{ij}^{2}}. For a vector 𝒙\boldsymbol{x}, 𝒙2||\boldsymbol{x}||_{2} is a 2-norm and, 𝑫(𝒙)\boldsymbol{D}(\boldsymbol{x}) is a diagonal matrix having its diagonal entries by 𝒙\boldsymbol{x}.

We now describe the main theorem of this paper.

Theorem 1.

Let (Z1j,Z2j)(Z_{1j},Z_{2j}), j=1,,nj=1,\ldots,n, be independent pairs of (possibly correlated) random variables satisfying 𝔼Z1j=𝔼Z2j=0\mathbb{E}Z_{1j}=\mathbb{E}Z_{2j}=0, and Zijψ2Ki||Z_{ij}||_{\psi_{2}}\leq K_{i}, i=1,2i=1,2 and j\forall j. Also, suppose (γ1j,γ2j)(\gamma_{1j},\gamma_{2j}), j=1,,nj=1,\ldots,n, be independent pairs of (possibly correlated) Bernoulli random variables such that 𝔼γij=πij\mathbb{E}\gamma_{ij}=\pi_{ij} and 𝔼γ1jγ2j=π12,j\mathbb{E}\gamma_{1j}\gamma_{2j}=\pi_{12,j} for j=1,,nj=1,\ldots,n, i=1,2i=1,2. Assume ZijZ_{ij} and γij\gamma_{i^{\prime}j^{\prime}} are independent for distinct pairs (i,j)(i,j)(i,j)\neq(i^{\prime},j^{\prime}). Then, we have that for every t>0t>0

P[|(𝒁1𝜸1)𝑨(𝒁2𝜸2)𝔼(𝒁1𝜸1)𝑨(𝒁2𝜸2)|>t]2exp{cmin(t2K12K22𝑨F,π2,tK1K2𝑨2)},\begin{array}[]{l}{\rm P}\bigg{[}\big{|}(\boldsymbol{Z}_{1}*\boldsymbol{\gamma}_{1})^{\top}\boldsymbol{A}(\boldsymbol{Z}_{2}*\boldsymbol{\gamma}_{2})-\mathbb{E}(\boldsymbol{Z}_{1}*\boldsymbol{\gamma}_{1})^{\top}\boldsymbol{A}(\boldsymbol{Z}_{2}*\boldsymbol{\gamma}_{2})\big{|}>t\bigg{]}\\ \qquad\qquad\qquad\leq 2\exp\left\{-c\min\left(\dfrac{t^{2}}{K_{1}^{2}K_{2}^{2}||\boldsymbol{A}||_{F,\pi}^{2}},\dfrac{t}{K_{1}K_{2}||\boldsymbol{A}||_{2}}\right)\right\},\end{array}

for some numerical constant c>0c>0. For a matrix 𝐀=(aij,1i,jn)\boldsymbol{A}=(a_{ij},1\leq i,j\leq n), we define 𝐀F,π=j=1nπ12,jajj2+ijaij2π1iπ2j||\boldsymbol{A}||_{F,\pi}=\sqrt{\sum_{j=1}^{n}\pi_{12,j}a_{jj}^{2}+\sum_{i\neq j}a_{ij}^{2}\pi_{1i}\pi_{2j}}.

Note that Theorem 1 is a combination of the results for diagonal and off-diagonal cases given below.

Lemma 1.

Assume 𝐙1,𝐙2,𝛄1,𝛄2\boldsymbol{Z}_{1},\boldsymbol{Z}_{2},\boldsymbol{\gamma}_{1},\boldsymbol{\gamma}_{2} are defined as in Theorem 1. Let 𝐀diag=diag(a11,,ann)\boldsymbol{A}_{\rm diag}={\rm{diag}}(a_{11},\ldots,a_{nn}) be a diagonal matrix. We denote 𝔼γ1jγ2j=π12,j\mathbb{E}\gamma_{1j}\gamma_{2j}=\pi_{12,j}. Then, for t>0t>0, we have

P[|(𝒁1𝜸1)𝑨diag(𝒁2𝜸2)𝔼(𝒁1𝜸1)𝑨diag(𝒁2𝜸2)|>t]2exp{cmin(t2K12K22j=1nπ12,jajj2,tK1K2max1jn|ajj|)},\begin{array}[]{l}{\rm P}\bigg{[}\big{|}(\boldsymbol{Z}_{1}*\boldsymbol{\gamma}_{1})^{\top}\boldsymbol{A}_{\rm diag}(\boldsymbol{Z}_{2}*\boldsymbol{\gamma}_{2})-\mathbb{E}(\boldsymbol{Z}_{1}*\boldsymbol{\gamma}_{1})^{\top}\boldsymbol{A}_{\rm diag}(\boldsymbol{Z}_{2}*\boldsymbol{\gamma}_{2})\big{|}>t\bigg{]}\\ \qquad\qquad\qquad\leq 2\exp\left\{-c\min\left(\dfrac{t^{2}}{K_{1}^{2}K_{2}^{2}\sum_{j=1}^{n}\pi_{12,j}a_{jj}^{2}},\dfrac{t}{K_{1}K_{2}\max\limits_{1\leq j\leq n}|a_{jj}|}\right)\right\},\end{array}

for some numerical constant c>0c>0.

Lemma 2.

Assume 𝐙1,𝐙2,𝛄1,𝛄2\boldsymbol{Z}_{1},\boldsymbol{Z}_{2},\boldsymbol{\gamma}_{1},\boldsymbol{\gamma}_{2} are defined as in Theorem 1. Let 𝐀off\boldsymbol{A}_{\rm off} be a n×nn\times n matrix with its diagonals zero. We denote 𝔼γij=πij\mathbb{E}\gamma_{ij}=\pi_{ij} for i=1,2i=1,2 and j=1,,nj=1,\ldots,n. Then, for t>0t>0, we have

P[|(𝒁1𝜸1)𝑨off(𝒁2𝜸2)|>t]2exp{cmin(t2K12K22ijaij2π1iπ2j,tK1K2𝑨off2)},\begin{array}[]{l}{\rm P}\bigg{[}\big{|}(\boldsymbol{Z}_{1}*\boldsymbol{\gamma}_{1})^{\top}\boldsymbol{A}_{\rm off}(\boldsymbol{Z}_{2}*\boldsymbol{\gamma}_{2})\big{|}>t\bigg{]}\\ \qquad\qquad\qquad\leq 2\exp\left\{-c\min\left(\dfrac{t^{2}}{K_{1}^{2}K_{2}^{2}\sum_{i\neq j}a_{ij}^{2}\pi_{1i}\pi_{2j}},\dfrac{t}{K_{1}K_{2}||\boldsymbol{A}_{\rm off}||_{2}}\right)\right\},\end{array}

for some numerical constant c>0c>0.

A complete proof of the two lemmas, in spirit of Zhou, (2019), is provided in Appendix. It is noted that our theorem above can yield Theorem 1.1 in Zhou, (2019) and Lemma 5 in Park et al., (2021) under the same setting of each.

To handle more general case where we do not have information about mean, we provide the concentration inequality for non-centered sub-Gaussian variables.

Theorem 2.

Let (Z~1j,Z~2j)(\tilde{Z}_{1j},\tilde{Z}_{2j}), j=1,,nj=1,\ldots,n, be independent pairs of (possibly correlated) random variables with 𝔼Z~1i=μ1i\mathbb{E}\tilde{Z}_{1i}=\mu_{1i}, 𝔼Z~2j=μ2j\mathbb{E}\tilde{Z}_{2j}=\mu_{2j}, and Z~ijμijψ2Ki||\tilde{Z}_{ij}-\mu_{ij}||_{\psi_{2}}\leq K_{i}, i=1,2i=1,2 and j\forall j. Also, suppose (γ1j,γ2j)(\gamma_{1j},\gamma_{2j}), j=1,,nj=1,\ldots,n, be independent pairs of (possibly correlated) Bernoulli random variables such that 𝔼γij=πij\mathbb{E}\gamma_{ij}=\pi_{ij} and 𝔼γ1jγ2j=π12,j\mathbb{E}\gamma_{1j}\gamma_{2j}=\pi_{12,j} for j=1,,nj=1,\ldots,n, i=1,2i=1,2. Assume Z~ij\tilde{Z}_{ij} and γij\gamma_{i^{\prime}j^{\prime}} are independent for distinct pairs (i,j)(i,j)(i,j)\neq(i^{\prime},j^{\prime}). Then, we have that for every t>0t>0

P[|(𝒁~1𝜸1)𝑨(𝒁~2𝜸2)𝔼(𝒁~1𝜸1)𝑨(𝒁~2𝜸2)|>t]dexp{cmin(t2E1,tE2)},\begin{array}[]{l}{\rm P}\bigg{[}\big{|}(\tilde{\boldsymbol{Z}}_{1}*\boldsymbol{\gamma}_{1})^{\top}\boldsymbol{A}(\tilde{\boldsymbol{Z}}_{2}*\boldsymbol{\gamma}_{2})-\mathbb{E}(\tilde{\boldsymbol{Z}}_{1}*\boldsymbol{\gamma}_{1})^{\top}\boldsymbol{A}(\tilde{\boldsymbol{Z}}_{2}*\boldsymbol{\gamma}_{2})\big{|}>t\bigg{]}\leq d\exp\left\{-c\min\left(\dfrac{t^{2}}{E_{1}},\dfrac{t}{E_{2}}\right)\right\},\end{array}

for some numerical constants c,d>0c,d>0. E1E_{1} and E2E_{2} are defined by

E1=max{K12K22𝑨F,π2,V1,π2V2,π2𝑫(𝝁1)𝑨𝑫(𝝁2)F2,V1,π2K22𝑫(𝝁1𝝁1)1/2𝑨𝑫(𝝅2)1/2F2,V2,π2K12𝑫(𝝅1)1/2𝑨𝑫(𝝁2𝝁2)1/2F2,K22𝑨(𝝁1𝝅1)22,K12𝑨(𝝁2𝝅2)22,V1,π2{𝑨(𝝁2𝝅2)}𝝁122,V2,π2||{𝑨(𝝁1𝝅1)}𝝁2||22},E2=max{K1K2𝑨2,V1,πV2,π𝑫(𝝁1)𝑨𝑫(𝝁2)2,V1,πK2𝑫(𝝁1)𝑨2,V2,πK1||𝑨𝑫(𝝁2)||2}.\begin{array}[]{lll}E_{1}=\max\Bigg{\{}&K_{1}^{2}K_{2}^{2}||\boldsymbol{A}||_{F,\pi}^{2},&V_{1,\pi}^{2}V_{2,\pi}^{2}||\boldsymbol{D}(\boldsymbol{\mu}_{1})\boldsymbol{A}\boldsymbol{D}(\boldsymbol{\mu}_{2})||_{F}^{2},\\ &V_{1,\pi}^{2}K_{2}^{2}||\boldsymbol{D}(\boldsymbol{\mu}_{1}*\boldsymbol{\mu}_{1})^{1/2}\boldsymbol{A}\boldsymbol{D}(\boldsymbol{\pi}_{2})^{1/2}||_{F}^{2},&V_{2,\pi}^{2}K_{1}^{2}||\boldsymbol{D}(\boldsymbol{\pi}_{1})^{1/2}\boldsymbol{A}\boldsymbol{D}(\boldsymbol{\mu}_{2}*\boldsymbol{\mu}_{2})^{1/2}||_{F}^{2},\\ &K_{2}^{2}||\boldsymbol{A}^{\top}(\boldsymbol{\mu}_{1}*\boldsymbol{\pi}_{1})||_{2}^{2},&K_{1}^{2}||\boldsymbol{A}(\boldsymbol{\mu}_{2}*\boldsymbol{\pi}_{2})||_{2}^{2},\\ &V_{1,\pi}^{2}||\{\boldsymbol{A}(\boldsymbol{\mu}_{2}*\boldsymbol{\pi}_{2})\}*\boldsymbol{\mu}_{1}||_{2}^{2},&V_{2,\pi}^{2}||\{\boldsymbol{A}(\boldsymbol{\mu}_{1}*\boldsymbol{\pi}_{1})\}*\boldsymbol{\mu}_{2}||_{2}^{2}\Bigg{\}},\\[10.00002pt] E_{2}=\max\Bigg{\{}&K_{1}K_{2}||\boldsymbol{A}||_{2},&V_{1,\pi}V_{2,\pi}||\boldsymbol{D}(\boldsymbol{\mu}_{1})\boldsymbol{A}\boldsymbol{D}(\boldsymbol{\mu}_{2})||_{2},\\ &V_{1,\pi}K_{2}||\boldsymbol{D}(\boldsymbol{\mu}_{1})\boldsymbol{A}||_{2},&V_{2,\pi}K_{1}||\boldsymbol{A}\boldsymbol{D}(\boldsymbol{\mu}_{2})||_{2}\Bigg{\}}.\end{array}

For a matrix 𝐀=(aij,1i,jn)\boldsymbol{A}=(a_{ij},1\leq i,j\leq n), we define 𝐀F,π=j=1nπ12,jajj2+ijaij2π1iπ2j||\boldsymbol{A}||_{F,\pi}=\sqrt{\sum_{j=1}^{n}\pi_{12,j}a_{jj}^{2}+\sum_{i\neq j}a_{ij}^{2}\pi_{1i}\pi_{2j}}. Also, we define V1,π=max1inπ1i(1π1i)V_{1,\pi}=\max\limits_{1\leq i\leq n}\pi_{1i}(1-\pi_{1i}) and V2,πV_{2,\pi} similarly.

The bilinear form of Z~ij\tilde{Z}_{ij} in Theorem 2 can be decomposed into the bilinear forms and the linear combinations of centered random variables. Then, we can apply either Theorem 1 or Hoeffding’s inequality (Theorem 5 in Appendix) to them. The detail of the proof can be found in Appendix B.

3 Application to testing the cross-covariance matrix

A cross-covariance matrix 𝚺XY\boldsymbol{\Sigma}_{XY} between two random vectors 𝑿=(X1,,Xp)\boldsymbol{X}=(X_{1},\ldots,X_{p})^{\top} and 𝒀=(Y1,,Yq)\boldsymbol{Y}=(Y_{1},\ldots,Y_{q})^{\top}, with its (k,)(k,\ell)-th entry being σkXY=Cov(Xk,Y)\sigma_{k\ell}^{XY}=\text{Cov}(X_{k},Y_{\ell}), refers to the off-diagonal block matrix of the covariance matrix of 𝒁=(𝑿,𝒀)\boldsymbol{Z}=\big{(}{\boldsymbol{X}}^{\top},{\boldsymbol{Y}}^{\top}\big{)}^{\top}. It is often considered a less important part in the covariance matrix of 𝒁\boldsymbol{Z}, 𝚺ZZ\boldsymbol{\Sigma}_{ZZ}, and tends to be overpenalized by shrinkage estimators favoring an invertible estimate. However, it is a crucial statistical summary in some applications. For example, the study of multi-omics data, which aims to explain molecular variations at different molecular levels, receives much attention due to the public availability of biological big data and the covariation between two different data sources is just as important as that within each data source. The main question here is to find pairs (or positions in the matrix) of XkX_{k}’s and YY_{\ell}’s that present a large degree of association, which can be treated by hypothesis testing:

0,k:σkXY=0vs.1,k:not 0,k,\mathcal{H}_{0,k\ell}:\sigma_{k\ell}^{XY}=0\quad\text{vs.}\quad\mathcal{H}_{1,k\ell}:\text{not }\mathcal{H}_{0,k\ell}, (1)

for 1kp,1q1\leq k\leq p,1\leq\ell\leq q.

Testing the cross-covariance matrix has not been much explored in literature. Cai et al., (2019) directly address the problem of estimating the cross-covariance matrix. However, they vaguely assume qpq\ll p and consider simultaneous testing of hypotheses

0k:σk1XY==σkqXY=0,vs.1k:not 0k\mathcal{H}_{0k}:\sigma_{k1}^{XY}=\ldots=\sigma_{kq}^{XY}=0,\quad\text{vs.}\quad\mathcal{H}_{1k}:\text{not }\mathcal{H}_{0k}

for k=1,,pk=1,\ldots,p. They build Hotelling’s T2T^{2} statistics for individual hypotheses and decide which rows of 𝚺XY\boldsymbol{\Sigma}_{XY} are not 0. Hence, the sparsity pattern in Cai et al., (2019) is not the same as considered in this paper. Moreover, their method cannot address missing data.

Considering (1) is equivalent to 0,k:ρkXY=0\mathcal{H}_{0,k\ell}:\rho_{k\ell}^{XY}=0 where ρkXY=Cor(Xk,Y)\rho_{k\ell}^{XY}=\text{Cor}(X_{k},Y_{\ell}), one can also analyze the sample correlation coefficient γk\gamma_{k\ell}. Bailey et al., (2019) analyzed a universal thresholding estimator of γk\gamma_{k\ell} based on its asymptotic normality. Though they are interested in estimation of a large correlation matrix, not a cross-correlation matrix directly, their method can be applied to estimation of the cross-covariance matrix. For example, the proposed estimator would be a p×qp\times q matrix with its component γk1(|γk|>c(n,p,q))\gamma_{k\ell}1_{(|\gamma_{k\ell}|>c(n,p,q))}, and they aim to find a cutoff value c(n,p,q)c(n,p,q) to control the family-wise error rate (FWER). However, again, if data are subject to missing, their method is no longer valid.

Here, we address the multiple testing problem for the cross-covariance matrix when some of the data are missing or measured with errors. We apply the modified Hanson-Wright inequalities (Theorem 1, 2, 3, 4) in order to choose a threshold value that controls FWER. More specifically, we derive concentration results for an appropriate cross-covariance estimator σ^kXY\hat{\sigma}_{k\ell}^{XY} in the following form:

P[maxk,|σ^kXYσkXY|>c(n,p,q)]α,0<α<1{\rm P}\left[\max\limits_{k,\ell}\big{|}\hat{\sigma}_{k\ell}^{XY}-\sigma_{k\ell}^{XY}\big{|}>c(n,p,q)\,\right]\leq\alpha,\quad 0<\alpha<1

under some regularity conditions. We begin with the simplest case where data are completely observed and walk through more complicated cases later.

3.1 Complete data case

We begin with the complete data case.

Assumption 1.

Assume each components of random vectors 𝐗p\boldsymbol{X}\in\mathbb{R}^{p} and 𝐘q\boldsymbol{Y}\in\mathbb{R}^{q} are sub-Gaussian, i.e., it holds for some KX,KY>0K_{X},K_{Y}>0

max1kpXk𝔼Xkψ2KX,max1qY𝔼Yψ2KY.\max_{1\leq k\leq p}||X_{k}-\mathbb{E}X_{k}||_{\psi_{2}}\leq K_{X},\quad\max_{1\leq\ell\leq q}||Y_{\ell}-\mathbb{E}Y_{\ell}||_{\psi_{2}}\leq K_{Y}. (2)

Let us denote the mean vector and cross-covariance matrix of 𝐗\boldsymbol{X} and 𝐘\boldsymbol{Y} as follows:

𝔼𝑿=𝝁X=(μkX,1kp),𝔼𝒀=𝝁Y=(μY,1q),Cov(𝑿,𝒀)=𝚺XY=(σkXY,1kp,1q).\begin{array}[]{l}\mathbb{E}\boldsymbol{X}=\boldsymbol{\mu}^{X}=(\mu_{k}^{X},1\leq k\leq p)^{\top},\\ \mathbb{E}\boldsymbol{Y}=\boldsymbol{\mu}^{Y}=(\mu_{\ell}^{Y},1\leq\ell\leq q)^{\top},\\ \text{Cov}(\boldsymbol{X},\boldsymbol{Y})=\boldsymbol{\Sigma}_{XY}=(\sigma_{k\ell}^{XY},1\leq k\leq p,1\leq\ell\leq q).\end{array} (3)

Suppose we observe nn independent samples {(𝑿i,𝒀i)}i=1n\{(\boldsymbol{X}_{i},\boldsymbol{Y}_{i})\}_{i=1}^{n} of (𝑿,𝒀)(\boldsymbol{X},\boldsymbol{Y}) in Assumption 1. Then, the cross-covariance σkXY\sigma_{k\ell}^{XY} can be estimated by

sk=1n1i=1n(XikX¯k)(YiY¯).\begin{array}[]{rcl}s_{k\ell}&=&\dfrac{1}{n-1}\sum\limits_{i=1}^{n}(X_{ik}-\bar{X}_{k})(Y_{i\ell}-\bar{Y}_{\ell}).\end{array}

where X¯k\bar{X}_{k}, Y¯\bar{Y}_{\ell} are the sample means corresponding to μkX\mu_{k}^{X}, μY\mu_{\ell}^{Y}. We can analyze the concentration of each component of 𝑺XY=(sk,1kp,1q)\boldsymbol{S}_{XY}=(s_{k\ell},1\leq k\leq p,1\leq\ell\leq q) using Theorem 1 as its special case where all π\pi’s are 1. We first notice that

sk=1ni=1n(XikμkX)(YiμY)1n(n1)ij(XikμkX)(YjμY),s_{k\ell}=\dfrac{1}{n}\sum\limits_{i=1}^{n}(X_{ik}-\mu_{k}^{X})(Y_{i\ell}-\mu_{\ell}^{Y})-\dfrac{1}{n(n-1)}\sum\limits_{i\neq j}(X_{ik}-\mu_{k}^{X})(Y_{j\ell}-\mu_{\ell}^{Y}),

where μkX=𝔼Xik\mu_{k}^{X}=\mathbb{E}X_{ik} and μY=𝔼Yj\mu_{\ell}^{Y}=\mathbb{E}Y_{j\ell}. Hence, we can rewrite the sample cross-covariance estimator in a matrix-form by

sk=𝒁1(k)𝑨𝒁2()s_{k\ell}=\boldsymbol{Z}_{1(k)}^{\top}\boldsymbol{A}\boldsymbol{Z}_{2(\ell)}

where 𝒁1(k)=(X1kμkX,,XnkμkX),𝒁2()=(Y1μY,,YnμY)\boldsymbol{Z}_{1(k)}=(X_{1k}-\mu_{k}^{X},\ldots,X_{nk}-\mu_{k}^{X})^{\top},\boldsymbol{Z}_{2(\ell)}=(Y_{1\ell}-\mu_{\ell}^{Y},\ldots,Y_{n\ell}-\mu_{\ell}^{Y})^{\top}, and 𝑨={n(n1)}1(n𝐈𝟏𝟏)\boldsymbol{A}=\{n(n-1)\}^{-1}(n\boldsymbol{\rm I}-\boldsymbol{1}\boldsymbol{1}^{\top}). Note that 𝑨F=1/n1||\boldsymbol{A}||_{F}=1/\sqrt{n-1} and 𝑨2=1/(n1)||\boldsymbol{A}||_{2}=1/(n-1). Then, by Theorem 1, the element-wise deviation inequality for the sample cross-covariance is

P[|skσkXY|>t]2exp{c1(n1)t2KX2KY2},tKXKY<1,{\rm P}\bigg{[}\big{|}s_{k\ell}-\sigma_{k\ell}^{XY}\big{|}>t\bigg{]}\leq 2\exp\left\{-\dfrac{c_{1}(n-1)t^{2}}{K_{X}^{2}K_{Y}^{2}}\right\},\quad\dfrac{t}{K_{X}K_{Y}}<1,

for some numerical constant c1>0c_{1}>0. Putting t=KXKYlog(2pq/α)/{c1(n1)}t=K_{X}K_{Y}\sqrt{\log(2pq/\alpha)/\{c_{1}(n-1)\}} and using the union bounds, we can get

P[maxk,|skσkXY|>C1KXKYlog(pq/α)n1]α,{\rm P}\left[\max\limits_{k,\ell}\big{|}s_{k\ell}-\sigma_{k\ell}^{XY}\big{|}>C_{1}K_{X}K_{Y}\sqrt{\dfrac{\log(pq/\alpha)}{n-1}}\,\right]\leq\alpha,

if n/log(pq/α)>d1n/\log(pq/\alpha)>d_{1} for some numerical constants C1,d1>0C_{1},d_{1}>0.

3.2 Missing data case

For the case where data are subject to missing, we introduce assumptions for missing indicators.

Assumption 2.

Each component δkX\delta_{k}^{X} of the indicator vector 𝛅X=(δkX,1kp)\boldsymbol{\delta}^{X}=(\delta_{k}^{X},1\leq k\leq p)^{\top} corresponding to 𝐗\boldsymbol{X} is 1 if XkX_{k} is observed and 0 otherwise. 𝛅Y\boldsymbol{\delta}^{Y} is similarly defined. Their moments are given by

𝔼𝜹X=𝝅X=(πkX,1kp),𝔼𝜹Y=𝝅Y=(πY,1q),𝔼𝜹X(𝜹Y)=𝝅XY=(πk,1kp,1q).\begin{array}[]{l}\mathbb{E}\boldsymbol{\delta}^{X}=\boldsymbol{\pi}^{X}=(\pi_{k}^{X},1\leq k\leq p)^{\top},\\ \mathbb{E}\boldsymbol{\delta}^{Y}=\boldsymbol{\pi}^{Y}=(\pi_{\ell}^{Y},1\leq\ell\leq q)^{\top},\\ \mathbb{E}\boldsymbol{\delta}^{X}(\boldsymbol{\delta}^{Y})^{\top}=\boldsymbol{\pi}^{XY}=(\pi_{k\ell},1\leq k\leq p,1\leq\ell\leq q).\end{array}

Note that the above assumption does not mention independence between components of the indicator vector, which means it allows δkX\delta_{k}^{X} and δX\delta_{\ell}^{X}, kk\neq\ell, to be dependent with each other. This implies multiple components in different positions can be missing together under some dependency structure. Assumption 2 of Park et al., (2021) is equivalent to this, and they called it the general missing dependency assumption. For the missing mechanism, missing completely at random (MCAR) is assumed, that is, (𝑿,𝒀)(\boldsymbol{X},\boldsymbol{Y}) is independent of (𝜹X,𝜹Y)(\boldsymbol{\delta}^{X},\boldsymbol{\delta}^{Y}). More generally, MCAR can be stated as follows.

Assumption 3 (Assumption 3 of Park et al., (2021)).

An event that an observation is missing is independent of both observed and unobserved random variables.

Suppose nn independent samples are generated from the population model under Assumption 1, 2, and 3. Each sample consists of the observational data (𝑿i,𝒀i)(\boldsymbol{X}_{i},\boldsymbol{Y}_{i}) and their missing indicators (𝜹iX,𝜹iY)(\boldsymbol{\delta}_{i}^{X},\boldsymbol{\delta}_{i}^{Y}). However, due to missingness, we can only observe X~ik=δikXXik\tilde{X}_{ik}=\delta_{ik}^{X}X_{ik}, Y~j=δjYYj\tilde{Y}_{j\ell}=\delta_{j\ell}^{Y}Y_{j\ell}, for i,j=1,,ni,j=1,\ldots,n, k=1,,pk=1,\ldots,p, and =1,,q\ell=1,\ldots,q. We can easily check that

𝔼[i=1nX~ikY~i]=nπk(σkXY+μkXμY),𝔼[ijnX~ikY~j]=n(n1)πkXπYμkXμY.\mathbb{E}\big{[}\sum\limits_{i=1}^{n}\tilde{X}_{ik}\tilde{Y}_{i\ell}\big{]}=n\pi_{k\ell}(\sigma_{k\ell}^{XY}+\mu_{k}^{X}\mu_{\ell}^{Y}),\quad\mathbb{E}\big{[}\sum\limits_{i\neq j}^{n}\tilde{X}_{ik}\tilde{Y}_{j\ell}\big{]}=n(n-1)\pi_{k}^{X}\pi_{\ell}^{Y}\mu_{k}^{X}\mu_{\ell}^{Y}.

From the above observation, we define an estimator of the cross-covariance as follows, which is unbiased for σkXY\sigma_{k\ell}^{XY}.

s~k=i=1nX~ikY~inπkijnX~ikY~jn(n1)πkXπY=(𝒁~1(k)𝜹1(k))𝑨k(𝒁~2()𝜹2()).\tilde{s}_{k\ell}=\dfrac{\sum_{i=1}^{n}\tilde{X}_{ik}\tilde{Y}_{i\ell}}{n\pi_{k\ell}}-\dfrac{\sum_{i\neq j}^{n}\tilde{X}_{ik}\tilde{Y}_{j\ell}}{n(n-1)\pi_{k}^{X}\pi_{\ell}^{Y}}=(\tilde{\boldsymbol{Z}}_{1(k)}*\boldsymbol{\delta}_{1(k)})^{\top}\boldsymbol{A}_{k\ell}(\tilde{\boldsymbol{Z}}_{2(\ell)}*\boldsymbol{\delta}_{2(\ell)}). (4)

The last representation is a bilinear form of 𝒁~1(k),𝒁~2(k),𝜹1(k),𝜹2(k)\tilde{\boldsymbol{Z}}_{1(k)},\tilde{\boldsymbol{Z}}_{2(k)},\boldsymbol{\delta}_{1(k)},\boldsymbol{\delta}_{2(k)}, and 𝑨k\boldsymbol{A}_{k\ell} defined as below.

𝒁~1(k)=(X1k,,Xnk),𝒁~2()=(Y1,,Yn),𝜹1(k)=(δ1kX,,δnkX),𝜹2()=(δ1Y,,δnY),𝑨k=(1nπk+1n(n1)πkXπY)𝐈1n(n1)πkXπY𝟏𝟏.\begin{array}[]{l}\tilde{\boldsymbol{Z}}_{1(k)}=({X}_{1k},\ldots,{X}_{nk})^{\top},\tilde{\boldsymbol{Z}}_{2(\ell)}=({Y}_{1\ell},\ldots,{Y}_{n\ell})^{\top},\\ \boldsymbol{\delta}_{1(k)}=(\delta_{1k}^{X},\ldots,\delta_{nk}^{X})^{\top},\boldsymbol{\delta}_{2(\ell)}=(\delta_{1\ell}^{Y},\ldots,\delta_{n\ell}^{Y})^{\top},\\ \boldsymbol{A}_{k\ell}=\left(\dfrac{1}{n\pi_{k\ell}}+\dfrac{1}{n(n-1)\pi_{k}^{X}\pi_{\ell}^{Y}}\right)\boldsymbol{\rm I}-\dfrac{1}{n(n-1)\pi_{k}^{X}\pi_{\ell}^{Y}}\boldsymbol{1}\boldsymbol{1}^{\top}.\end{array}

Thus, we apply Theorem 2 to s~k\tilde{s}_{k\ell} and get, for t<E1,k/E2,kt<E_{1,k\ell}/E_{2,k\ell}

P[|s~kσkXY|>t]dexp{c2t2E1,k},{\rm P}\bigg{[}\big{|}\tilde{s}_{k\ell}-\sigma_{k\ell}^{XY}\big{|}>t\bigg{]}\leq d\exp\left\{-\dfrac{c_{2}t^{2}}{E_{1,k\ell}}\right\},

where c2,d>0c_{2},d>0 are some numerical constants and E1,k,E2,kE_{1,k\ell},E_{2,k\ell} are defined below.

E1,k=max[max{KX2KY2,KX2|μY|2,|μkX|2KY2,|μkX|2|μY|2}(1nπk2+1n(n1)(πkX)2(πY)2),max{(KX)2|μY|2,|μkX|2(KY)2,|μkX|4,|μY|4}1n(1πk1πkXπY)2],E2,k=max{KXKY,KX|μY|,|μkX|KY,|μkX||μY|}(1nπk+1n(n1)πkXπY)\begin{array}[]{l}E_{1,k\ell}=\max\bigg{[}\max\Big{\{}K_{X}^{2}K_{Y}^{2},K_{X}^{2}|\mu_{\ell}^{Y}|^{2},|\mu_{k}^{X}|^{2}K_{Y}^{2},|\mu_{k}^{X}|^{2}|\mu_{\ell}^{Y}|^{2}\Big{\}}\left(\dfrac{1}{n\pi_{k\ell}^{2}}+\dfrac{1}{n(n-1)(\pi_{k}^{X})^{2}(\pi_{\ell}^{Y})^{2}}\right),\\ \qquad\qquad\max\Big{\{}(K_{X})^{2}|\mu_{\ell}^{Y}|^{2},|\mu_{k}^{X}|^{2}(K_{Y})^{2},|\mu_{k}^{X}|^{4},|\mu_{\ell}^{Y}|^{4}\Big{\}}\dfrac{1}{n}\left(\dfrac{1}{\pi_{k\ell}}-\dfrac{1}{\pi_{k}^{X}\pi_{\ell}^{Y}}\right)^{2}\bigg{]},\\ E_{2,k\ell}=\max\Big{\{}K_{X}K_{Y},K_{X}|\mu_{\ell}^{Y}|,|\mu_{k}^{X}|K_{Y},|\mu_{k}^{X}||\mu_{\ell}^{Y}|\Big{\}}\left(\dfrac{1}{n\pi_{k\ell}}+\dfrac{1}{n(n-1)\pi_{k}^{X}\pi_{\ell}^{Y}}\right)\end{array}

Putting t=E1,klog(dpq/α)/c2t=\sqrt{E_{1,k\ell}\log(dpq/\alpha)/c_{2}} and using the union bound argument to the indices (k,)(k,\ell), we can get for some numerical constants C2,d2>0C_{2},d_{2}>0

P[maxk,|s~kσkXY|>C2log(pq/α)maxk,E1,k]α,{\rm P}\left[\max\limits_{k,\ell}\big{|}\tilde{s}_{k\ell}-\sigma_{k\ell}^{XY}\big{|}>C_{2}\sqrt{\log(pq/\alpha)\max_{k,\ell}E_{1,k\ell}}\,\right]\leq\alpha,

if log(pq/α)<d2mink,E1,k/E2,k\sqrt{\log(pq/\alpha)}<d_{2}\min_{k,\ell}\sqrt{E_{1,k\ell}}/E_{2,k\ell}. A simple calculation leads to

maxk,E1,k{f2(KX,KY,μX,μY)}2(n1)(πminJ(πminM)2)2,mink,E1,k/E2,kg2(KX,KY,μX,μY)n1(πminJ(πminM)2),\begin{array}[]{l}\max\limits_{k,\ell}E_{1,k\ell}\leq\dfrac{\{f_{2}(K_{X},K_{Y},\mu_{X},\mu_{Y})\}^{2}}{(n-1)(\pi_{\min}^{J}\wedge(\pi_{\min}^{M})^{2})^{2}},\\ \min\limits_{k,\ell}\sqrt{E_{1,k\ell}}/E_{2,k\ell}\geq g_{2}(K_{X},K_{Y},\mu_{X},\mu_{Y})\sqrt{n-1}(\pi_{\min}^{J}\wedge(\pi_{\min}^{M})^{2}),\end{array}

where πminJ=mink,πk\pi_{\min}^{J}=\min_{k,\ell}\pi_{k\ell}, πminM=min(minkπkX,minπY)\pi_{\min}^{M}=\min(\min_{k}\pi_{k}^{X},\min_{\ell}\pi_{\ell}^{Y}), μX=maxk|μkX|\mu_{X}=\max_{k}|\mu_{k}^{X}|, μY=max|μY|\mu_{Y}=\max_{\ell}|\mu_{\ell}^{Y}|, f2(KX,KY,μX,μY)=max{KXKY,μXKY,KXμY,μXμY,μX2,μY2}f_{2}(K_{X},K_{Y},\mu_{X},\mu_{Y})=\max\{K_{X}K_{Y},\mu_{X}K_{Y},K_{X}\mu_{Y},\mu_{X}\mu_{Y},\mu_{X}^{2},\mu_{Y}^{2}\}, and g2(KX,KY,μX,μY)=min{1,μX/KY,μY/KX,μXμY/(KXKY)}g_{2}(K_{X},K_{Y},\mu_{X},\mu_{Y})=\min\{1,\mu_{X}/K_{Y},\mu_{Y}/K_{X},\mu_{X}\mu_{Y}/(K_{X}K_{Y})\}. The superscript “J” and “M” stand for joint and marginal, respectively. Then, we conclude that for some numerical constants C~2,d2>0\tilde{C}_{2},d_{2}>0

P[maxk,|s~kσkXY|>C~2f2(KX,KY,μX,μY)log(pq/α)(n1)(πminJ(πminM)2)]α,{\rm P}\left[\max\limits_{k,\ell}\big{|}\tilde{s}_{k\ell}-\sigma_{k\ell}^{XY}\big{|}>\tilde{C}_{2}f_{2}(K_{X},K_{Y},\mu_{X},\mu_{Y})\sqrt{\dfrac{\log(pq/\alpha)}{(n-1)(\pi_{\min}^{J}\wedge(\pi_{\min}^{M})^{2})}}\,\right]\leq\alpha,

if n1log(pq/α)>d2g2(KX,KY,μX,μY)πminJ(πminM)2\dfrac{n-1}{\log(pq/\alpha)}>\dfrac{d_{2}}{g_{2}(K_{X},K_{Y},\mu_{X},\mu_{Y})\pi_{\min}^{J}\wedge(\pi_{\min}^{M})^{2}} holds.

3.3 Measurement error case

The missing data problem is a special case of the multiplicative measurement error case if the multiplicative factor only takes either 1 or 0. Under some boundedness condition on the factors, we can extend the current framework to the multiplicative measurement error case, which is given below.

Theorem 3.

Let (Z1j,Z2j)(Z_{1j},Z_{2j}), j=1,,nj=1,\ldots,n, be independent pairs of (possibly correlated) random variables satisfying 𝔼Z1j=𝔼Z2j=0\mathbb{E}Z_{1j}=\mathbb{E}Z_{2j}=0, and Zijψ2Ki||Z_{ij}||_{\psi_{2}}\leq K_{i}, i=1,2i=1,2 and j\forall j. Also, suppose (γ1j,γ2j)(\gamma_{1j},\gamma_{2j}), j=1,,nj=1,\ldots,n, be independent pairs of (possibly correlated) non-negative random variables such that γijBij\gamma_{ij}\leq B_{ij} almost surely for some Bij>0B_{ij}>0, i=1,2i=1,2 and j\forall j. Assume ZijZ_{ij} and γij\gamma_{i^{\prime}j^{\prime}} are independent for distinct pairs (i,j)(i,j)(i,j)\neq(i^{\prime},j^{\prime}). Then, we have that for every t>0t>0

P[|(𝒁1𝜸1)𝑨(𝒁2𝜸2)𝔼(𝒁1𝜸1)𝑨(𝒁2𝜸2)|>t]2exp{cmin(t2K12K22𝑫(𝑩1)𝑨𝑫(𝑩2)F2,tK1K2𝑨2)},\begin{array}[]{l}{\rm P}\bigg{[}\big{|}(\boldsymbol{Z}_{1}*\boldsymbol{\gamma}_{1})^{\top}\boldsymbol{A}(\boldsymbol{Z}_{2}*\boldsymbol{\gamma}_{2})-\mathbb{E}(\boldsymbol{Z}_{1}*\boldsymbol{\gamma}_{1})^{\top}\boldsymbol{A}(\boldsymbol{Z}_{2}*\boldsymbol{\gamma}_{2})\big{|}>t\bigg{]}\\ \qquad\qquad\qquad\leq 2\exp\left\{-c\min\left(\dfrac{t^{2}}{K_{1}^{2}K_{2}^{2}||\boldsymbol{D}(\boldsymbol{B}_{1})\boldsymbol{A}\boldsymbol{D}(\boldsymbol{B}_{2})||_{F}^{2}},\dfrac{t}{K_{1}K_{2}||\boldsymbol{A}||_{2}}\right)\right\},\end{array}

for some numerical constant c>0c>0. 𝐃(𝐁1)\boldsymbol{D}(\boldsymbol{B}_{1}) is a diagonal matrix with diagonal elements being from 𝐁1=(B1i,1in)\boldsymbol{B}_{1}=(B_{1i},1\leq i\leq n)^{\top}. 𝐃(𝐁2)\boldsymbol{D}(\boldsymbol{B}_{2}) is similarly defined.

The proof is straightforward, and thus we outline the idea behind it and omit the detail. In the diagonal part, we need to modify (5) as

𝔼(λaiiγ1iγ2iZ1iZ2i)s𝔼(λ|aii|B1iB2i|Z1iZ2i|)s,λ>0.\mathbb{E}(\lambda a_{ii}\gamma_{1i}\gamma_{2i}Z_{1i}Z_{2i})^{s}\leq\mathbb{E}\big{(}\lambda|a_{ii}|B_{1i}B_{2i}|Z_{1i}Z_{2i}|\big{)}^{s},\quad\lambda>0.

For the off-diagonal case, a careful investigation into its proof shows that the missing indicators are conditioned in the analysis and the fact they are Bernoulli random variables is not used until Step 4 in Appendix A.3. The result of Lemma 5 (see Step 4 in Appendix A.3) can be extended to the bounded random errors as we can derive

𝔼[exp(τi:ηi=1γ1ij:ηj=0aij2γ2j)|η]exp(τijaij2B1iB2i),τ>0.\mathbb{E}\left[\exp\left(\tau\sum\limits_{i:\eta_{i}=1}\gamma_{1i}\sum\limits_{j:\eta_{j}=0}a_{ij}^{2}\gamma_{2j}\right)|\eta\right]\leq\exp\left(\tau\sum\limits_{i\neq j}a_{ij}^{2}B_{1i}B_{2i}\right),\quad\tau>0.

Furthermore, we state the result for the case with non-zero means.

Theorem 4.

Let (Z~1j,Z~2j)(\tilde{Z}_{1j},\tilde{Z}_{2j}), j=1,,nj=1,\ldots,n, be independent pairs of (possibly correlated) random variables with 𝔼Z~1i=μ1i\mathbb{E}\tilde{Z}_{1i}=\mu_{1i}, 𝔼Z~2j=μ2j\mathbb{E}\tilde{Z}_{2j}=\mu_{2j}, and Z~ijμijψ2Ki||\tilde{Z}_{ij}-\mu_{ij}||_{\psi_{2}}\leq K_{i}, i=1,2i=1,2 and j\forall j. Also, suppose (γ1j,γ2j)(\gamma_{1j},\gamma_{2j}) be pairs of (possibly correlated) non-negative random variables such that γijBij\gamma_{ij}\leq B_{ij} almost surely for some Bij>0B_{ij}>0, i=1,2i=1,2 and j\forall j. Assume Z~ij\tilde{Z}_{ij} and γij\gamma_{i^{\prime}j^{\prime}} are independent for distinct pairs (i,j)(i,j)(i,j)\neq(i^{\prime},j^{\prime}). Then, we have that for every t>0t>0

P[|(𝒁~1𝜸1)𝑨(𝒁~2𝜸2)𝔼(𝒁~1𝜸1)𝑨(𝒁~2𝜸2)|>t]dexp{cmin(t2E1,tE2)},\begin{array}[]{l}{\rm P}\bigg{[}\big{|}(\tilde{\boldsymbol{Z}}_{1}*\boldsymbol{\gamma}_{1})^{\top}\boldsymbol{A}(\tilde{\boldsymbol{Z}}_{2}*\boldsymbol{\gamma}_{2})-\mathbb{E}(\tilde{\boldsymbol{Z}}_{1}*\boldsymbol{\gamma}_{1})^{\top}\boldsymbol{A}(\tilde{\boldsymbol{Z}}_{2}*\boldsymbol{\gamma}_{2})\big{|}>t\bigg{]}\leq d\exp\left\{-c\min\left(\dfrac{t^{2}}{E_{1}},\dfrac{t}{E_{2}}\right)\right\},\end{array}

for some numerical constants c,d>0c,d>0. E1E_{1} and E2E_{2} are defined by

E1=max{K12K22𝑫(𝑩1)𝑨𝑫(𝑩2)F2,max1inB1i2K22𝑫(𝝁1𝝁1)1/2𝑨𝑫(𝑩2)1/2F2,max1inB2i2K12𝑫(𝑩1)1/2𝑨𝑫(𝝁2𝝁2)1/2F2,K22{𝑨(𝝁1𝒖1)}𝑩222,K12{𝑨(𝝁2𝒖2)}𝑩122,max1inB1i2max1inB2i2||𝑫(𝝁1)𝑨𝑫(𝝁2)||F2},E2=max{K1K2𝑨2,max1inB1iK2𝑫(𝝁1)𝑨2,max1inB2iK1||𝑨𝑫(𝝁2)||2}.\begin{array}[]{lll}E_{1}=\max\Bigg{\{}&K_{1}^{2}K_{2}^{2}||\boldsymbol{D}(\boldsymbol{B}_{1})\boldsymbol{A}\boldsymbol{D}(\boldsymbol{B}_{2})||_{F}^{2},&\max\limits_{1\leq i\leq n}B_{1i}^{2}K_{2}^{2}||\boldsymbol{D}(\boldsymbol{\mu}_{1}*\boldsymbol{\mu}_{1})^{1/2}\boldsymbol{A}\boldsymbol{D}(\boldsymbol{B}_{2})^{1/2}||_{F}^{2},\\ &\max\limits_{1\leq i\leq n}B_{2i}^{2}K_{1}^{2}||\boldsymbol{D}(\boldsymbol{B}_{1})^{1/2}\boldsymbol{A}\boldsymbol{D}(\boldsymbol{\mu}_{2}*\boldsymbol{\mu}_{2})^{1/2}||_{F}^{2},&K_{2}^{2}||\{\boldsymbol{A}^{\top}(\boldsymbol{\mu}_{1}*\boldsymbol{u}_{1})\}*\boldsymbol{B}_{2}||_{2}^{2},\\ &K_{1}^{2}||\{\boldsymbol{A}(\boldsymbol{\mu}_{2}*\boldsymbol{u}_{2})\}*\boldsymbol{B}_{1}||_{2}^{2},&\max\limits_{1\leq i\leq n}B_{1i}^{2}\max\limits_{1\leq i\leq n}B_{2i}^{2}||\boldsymbol{D}(\boldsymbol{\mu}_{1})\boldsymbol{A}\boldsymbol{D}(\boldsymbol{\mu}_{2})||_{F}^{2}\Bigg{\}},\\[10.00002pt] E_{2}=\max\Bigg{\{}&K_{1}K_{2}||\boldsymbol{A}||_{2},&\max\limits_{1\leq i\leq n}B_{1i}K_{2}||\boldsymbol{D}(\boldsymbol{\mu}_{1})\boldsymbol{A}||_{2},\\ &&\quad\max\limits_{1\leq i\leq n}B_{2i}K_{1}||\boldsymbol{A}\boldsymbol{D}(\boldsymbol{\mu}_{2})||_{2}\Bigg{\}}.\end{array}

The rest of arguments are similar to Section 3.2. Assume we observe 𝑿~i=𝜹iX𝑿i\tilde{\boldsymbol{X}}_{i}=\boldsymbol{\delta}_{i}^{X}*\boldsymbol{X}_{i} and 𝒀~i=𝜹iY𝒀i\tilde{\boldsymbol{Y}}_{i}=\boldsymbol{\delta}_{i}^{Y}*\boldsymbol{Y}_{i}, i=1,,ni=1,\ldots,n. While (𝑿i,𝒀i)(\boldsymbol{X}_{i},\boldsymbol{Y}_{i}) is an independent copy of (𝑿,𝒀)(\boldsymbol{X},\boldsymbol{Y}) in Assumption 1, (𝜹iX,𝜹iY)(\boldsymbol{\delta}_{i}^{X},\boldsymbol{\delta}_{i}^{Y}) is no longer a vector of binary variables but an independent copy of (𝜹X,𝜹Y)(\boldsymbol{\delta}^{X},\boldsymbol{\delta}^{Y}) in Assumption 4.

Assumption 4.

Each component δkX\delta_{k}^{X} of 𝛅X=(δkX,1kp)\boldsymbol{\delta}^{X}=(\delta_{k}^{X},1\leq k\leq p)^{\top} is a measurement error corresponding to XkX_{k} of 𝐗\boldsymbol{X}, which is a non-negative random variable satisfying δkXBkX\delta_{k}^{X}\leq B_{k}^{X} almost surely for each kk. 𝛅Y\boldsymbol{\delta}^{Y} is similarly defined. Their moments are given by

𝔼𝜹X=𝒖X=(ukX,1kp),𝔼𝜹Y=𝒖Y=(uY,1q),𝔼𝜹X(𝜹Y)=𝑼XY=(uk,1kp,1q).\begin{array}[]{l}\mathbb{E}\boldsymbol{\delta}^{X}=\boldsymbol{u}^{X}=(u_{k}^{X},1\leq k\leq p)^{\top},\\ \mathbb{E}\boldsymbol{\delta}^{Y}=\boldsymbol{u}^{Y}=(u_{\ell}^{Y},1\leq\ell\leq q)^{\top},\\ \mathbb{E}\boldsymbol{\delta}^{X}(\boldsymbol{\delta}^{Y})^{\top}=\boldsymbol{U}^{XY}=(u_{k\ell},1\leq k\leq p,1\leq\ell\leq q).\end{array}

Accordingly, the unbiased estimator for σkXY\sigma_{k\ell}^{XY} is

sˇk=i=1nX~ikY~inukijnX~ikY~jn(n1)ukXuY.\check{s}_{k\ell}=\dfrac{\sum_{i=1}^{n}\tilde{X}_{ik}\tilde{Y}_{i\ell}}{nu_{k\ell}}-\dfrac{\sum_{i\neq j}^{n}\tilde{X}_{ik}\tilde{Y}_{j\ell}}{n(n-1)u_{k}^{X}u_{\ell}^{Y}}.

In this case, we can derive

P[|sˇkσkXY|>t]dexp{c3t2E1,k},t<E1,k/E2,k{\rm P}\bigg{[}\big{|}\check{s}_{k\ell}-\sigma_{k\ell}^{XY}\big{|}>t\bigg{]}\leq d\exp\left\{-\dfrac{c_{3}t^{2}}{E_{1,k\ell}}\right\},\quad t<E_{1,k\ell}/E_{2,k\ell}

where c3,d>0c_{3},d>0 are some numerical constants and E1,k,E2,kE_{1,k\ell},E_{2,k\ell} are defined below.

E1,k=max[max{KX2KY2(BkX)2(BY)2,|μkX|2KY2(BkX)2BY,KX2|μY|2BkX(BY)2,|μkX|2|μY|2(BkX)2(BY)2}(1nuk2+1n(n1)(ukX)2(uY)2),max{KX2|μY|2(BkX)2(uY)2,|μkX|2KY2(ukX)2(BY)2}1n(1uk1ukXuY)2],E2,k=max{KXKY,KX|μY|BkX,|μkX|KYBX}(1nuk+1n(n1)ukXuY).\begin{array}[]{l}E_{1,k\ell}=\max\bigg{[}\max\Big{\{}K_{X}^{2}K_{Y}^{2}(B_{k}^{X})^{2}(B_{\ell}^{Y})^{2},|\mu_{k}^{X}|^{2}K_{Y}^{2}(B_{k}^{X})^{2}B_{\ell}^{Y},\\ \qquad\qquad\qquad\qquad K_{X}^{2}|\mu_{\ell}^{Y}|^{2}B_{k}^{X}(B_{\ell}^{Y})^{2},|\mu_{k}^{X}|^{2}|\mu_{\ell}^{Y}|^{2}(B_{k}^{X})^{2}(B_{\ell}^{Y})^{2}\Big{\}}\left(\dfrac{1}{nu_{k\ell}^{2}}+\dfrac{1}{n(n-1)(u_{k}^{X})^{2}(u_{\ell}^{Y})^{2}}\right),\\ \qquad\qquad\qquad\qquad\max\Big{\{}K_{X}^{2}|\mu_{\ell}^{Y}|^{2}(B_{k}^{X})^{2}(u_{\ell}^{Y})^{2},|\mu_{k}^{X}|^{2}K_{Y}^{2}(u_{k}^{X})^{2}(B_{\ell}^{Y})^{2}\Big{\}}\dfrac{1}{n}\left(\dfrac{1}{u_{k\ell}}-\dfrac{1}{u_{k}^{X}u_{\ell}^{Y}}\right)^{2}\bigg{]},\\ E_{2,k\ell}=\max\Big{\{}K_{X}K_{Y},K_{X}|\mu_{\ell}^{Y}|B_{k}^{X},|\mu_{k}^{X}|K_{Y}B_{\ell}^{X}\Big{\}}\left(\dfrac{1}{nu_{k\ell}}+\dfrac{1}{n(n-1)u_{k}^{X}u_{\ell}^{Y}}\right).\end{array}

Moreover, we can observe

maxk,E1,k{f3(KX,KY,μX,μY,BX,BY)}2(n1)(uminJ(uminM)2)2,mink,E1,k/E2,kg3(KX,KY,μX,μY,BX,BY)n1uminJ(uminM)2,\begin{array}[]{l}\max\limits_{k,\ell}E_{1,k\ell}\leq\dfrac{\{f_{3}(K_{X},K_{Y},\mu_{X},\mu_{Y},B_{X},B_{Y})\}^{2}}{(n-1)(u_{\min}^{J}\wedge(u_{\min}^{M})^{2})^{2}},\\ \min\limits_{k,\ell}\sqrt{E_{1,k\ell}}/E_{2,k\ell}\geq g_{3}(K_{X},K_{Y},\mu_{X},\mu_{Y},B_{X},B_{Y})\sqrt{n-1}\cdot u_{\min}^{J}\wedge(u_{\min}^{M})^{2},\end{array}

where uminJ=mink,uku_{\min}^{J}=\min_{k,\ell}u_{k\ell}, uminM=min(minkukX,minuY)u_{\min}^{M}=\min(\min_{k}u_{k}^{X},\min_{\ell}u_{\ell}^{Y}), μX=maxk|μkX|\mu_{X}=\max_{k}|\mu_{k}^{X}|, μY=max|μY|\mu_{Y}=\max_{\ell}|\mu_{\ell}^{Y}|, BX=maxkBkXB_{X}=\max_{k}B_{k}^{X}, BY=maxBYB_{Y}=\max_{\ell}B_{\ell}^{Y}, g3(KX,KY,μX,μY,BX,BY)=min{1,KX/(BXμX),KY/(BYμY)}g_{3}(K_{X},K_{Y},\mu_{X},\mu_{Y},B_{X},B_{Y})=\min\{1,K_{X}/(B_{X}\mu_{X}),K_{Y}/(B_{Y}\mu_{Y})\}, and

f3(KX,KY,μX,μY,BX,BY)=max{KXKYBXBY,μXKYBXBY,KXμYBXBY,μXμYBXBY,KXμYBXuY,μXKYuXBY,}.\begin{array}[]{l}f_{3}(K_{X},K_{Y},\mu_{X},\mu_{Y},B_{X},B_{Y})=\max\{K_{X}K_{Y}B_{X}B_{Y},\mu_{X}K_{Y}B_{X}B_{Y},K_{X}\mu_{Y}B_{X}B_{Y},\\ \qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\mu_{X}\mu_{Y}B_{X}B_{Y},K_{X}\mu_{Y}B_{X}u_{Y},\mu_{X}K_{Y}u_{X}B_{Y},\}.\end{array}

The superscript “J” and “M” stand for joint and marginal, respectively.

Repeating the calculation as in the previous section, we conclude that for some numerical constants C3,d3>0C_{3},d_{3}>0

P[maxk,|sˇkσkXY|>C3f3(KX,KY,μX,μY,BX,BY)log(pq/α)(n1)(uminJ)2(uminM)4]α,{\rm P}\left[\max\limits_{k,\ell}\big{|}\check{s}_{k\ell}-\sigma_{k\ell}^{XY}\big{|}>C_{3}f_{3}(K_{X},K_{Y},\mu_{X},\mu_{Y},B_{X},B_{Y})\sqrt{\dfrac{\log(pq/\alpha)}{(n-1)\cdot(u_{\min}^{J})^{2}\wedge(u_{\min}^{M})^{4}}}\,\right]\leq\alpha,

if n1log(pq/α)d3g3(KX,KY,μX,μY,BX,BY)uminJ(uminM)2\dfrac{n-1}{\log(pq/\alpha)}\geq\dfrac{d_{3}}{g_{3}(K_{X},K_{Y},\mu_{X},\mu_{Y},B_{X},B_{Y})\cdot u_{\min}^{J}\wedge(u_{\min}^{M})^{2}} holds.

4 Discussion

We discuss the generalized Hanson-Wright inequality where the sparse structure and bilinear form are considered for the first time. This extension facilitates an analysis of concentration of the sample (cross-)covariance estimator even when mean parameters are unknown and some of the data are missing. We apply this result to multiple testing of the cross-covariance matrix.

We further consider a measurement error case extended from the missing data case, which is limited to a bounded random variable. It would be interesting to consider more general multiplicative errors as future work; for example, the truncated normal distribution defined on (0,)(0,\infty) can be a good example.

References

  • Bailey et al., (2019) Bailey, N., Pesaran, M., and Smith, L. V. (2019). A multiple testing approach to the regularisation of large sample correlation matrices. Journal of Econometrics, 208(2):507–534.
  • Cai et al., (2019) Cai, T., Cai, T., Liao, K., and Liu, W. (2019). Large-scale simultaneous testing of cross-covariance matrices with applications to phewas. Statistica Sinica, 29:983–1005.
  • Park et al., (2021) Park, S., Wang, X., and Lim, J. (2021). Estimating high-dimensional covariance and precision matrices under general missing dependence. Electronic Journal of Statistics, 15(2):4868 – 4915.
  • Rudelson and Vershynin, (2013) Rudelson, M. and Vershynin, R. (2013). Hanson-wright inequality and sub-gaussian concentration. Electronic Communications in Probability, 18:9 pp.
  • Vershynin, (2018) Vershynin, R. (2018). High-dimensional probability : an introduction with applications in data science. Cambridge University Press, Cambridge, United Kingdom New York, NY.
  • Zhou, (2019) Zhou, S. (2019). Sparse hanson–wright inequalities for subgaussian quadratic forms. Bernoulli, 25(3):1603–1639.

Appendix

Appendix A Proof of Theorem 1

A.1 Proof of Lemma 1 (diagonal part)

Define S0S_{0} by

S0=i=1naiiγ1iγ2iX1iZ2i𝔼[i=1naiiγ1iγ2iZ1iZ2i].S_{0}=\sum_{i=1}^{n}a_{ii}\gamma_{1i}\gamma_{2i}X_{1i}Z_{2i}-\mathbb{E}\left[\sum_{i=1}^{n}a_{ii}\gamma_{1i}\gamma_{2i}Z_{1i}Z_{2i}\right].

For |λ|<1/(4eK1K2maxi|aii|)|\lambda|<1/(4eK_{1}K_{2}\max_{i}|a_{ii}|),

𝔼exp(λS0)=𝔼exp(λi=1naiiγ1iγ2iZ1iZ2i)exp(λ𝔼[i=1naiiγ1iγ2iZ1iZ2i])=i=1n𝔼γ𝔼Xexp(λaiiγ1iγ2iZ1iZ2i)exp(λaii𝔼γ1iγ2i𝔼Z1iZ2i)=i=1n(1π12,i)+π12,i𝔼Xexp(λaiiZ1iZ2i)exp(λaiiπ12,i𝔼Z1iZ2i)i=1n(1π12,i)+π12,i(1+λaii𝔼Z1iZ2i+16λ2aii2K12K22)exp(λaiiπ12,i𝔼Z1iZ2i)i=1nexp{π12,i(λaii𝔼Z1iZ2i+16λ2aii2K12K22)}exp(λaiiπ12,i𝔼Z1iZ2i)exp{16λ2K12K22i=1nπ12,iaii2}\begin{array}[]{l}\mathbb{E}\exp(\lambda S_{0})\\[8.00003pt] =\dfrac{\mathbb{E}\exp(\lambda\sum_{i=1}^{n}a_{ii}\gamma_{1i}\gamma_{2i}Z_{1i}Z_{2i})}{\exp\left(\lambda\mathbb{E}\left[\sum_{i=1}^{n}a_{ii}\gamma_{1i}\gamma_{2i}Z_{1i}Z_{2i}\right]\right)}\\[8.00003pt] =\prod\limits_{i=1}^{n}\dfrac{\mathbb{E}_{\gamma}\mathbb{E}_{X}\exp(\lambda a_{ii}\gamma_{1i}\gamma_{2i}Z_{1i}Z_{2i})}{\exp\left(\lambda a_{ii}\mathbb{E}\gamma_{1i}\gamma_{2i}\mathbb{E}Z_{1i}Z_{2i}\right)}\\[8.00003pt] =\prod\limits_{i=1}^{n}\dfrac{(1-\pi_{12,i})+\pi_{12,i}\mathbb{E}_{X}\exp(\lambda a_{ii}Z_{1i}Z_{2i})}{\exp\left(\lambda a_{ii}\pi_{12,i}\mathbb{E}Z_{1i}Z_{2i}\right)}\\[8.00003pt] \leq\prod\limits_{i=1}^{n}\dfrac{(1-\pi_{12,i})+\pi_{12,i}(1+\lambda a_{ii}\mathbb{E}Z_{1i}Z_{2i}+16\lambda^{2}a_{ii}^{2}K_{1}^{2}K_{2}^{2})}{\exp\left(\lambda a_{ii}\pi_{12,i}\mathbb{E}Z_{1i}Z_{2i}\right)}\\[8.00003pt] \leq\prod\limits_{i=1}^{n}\dfrac{\exp\left\{\pi_{12,i}(\lambda a_{ii}\mathbb{E}Z_{1i}Z_{2i}+16\lambda^{2}a_{ii}^{2}K_{1}^{2}K_{2}^{2})\right\}}{\exp\left(\lambda a_{ii}\pi_{12,i}\mathbb{E}Z_{1i}Z_{2i}\right)}\\[8.00003pt] \leq\exp\left\{16\lambda^{2}K_{1}^{2}K_{2}^{2}\sum\limits_{i=1}^{n}\pi_{12,i}a_{ii}^{2}\right\}\end{array} (5)

where the first inequality uses Lemma 3 and the second holds since 1+xex1+x\leq e^{x}.

Lemma 3.

Assume that Z1iψ2K1,Z2iψ2K2||Z_{1i}||_{\psi_{2}}\leq K_{1},||Z_{2i}||_{\psi_{2}}\leq K_{2} for some K1,K2>0K_{1},K_{2}>0 and |λ|<1/(4eK1K2maxi|aii|)|\lambda|<1/(4eK_{1}K_{2}\max_{i}|a_{ii}|) for given {aii}i=1n\{a_{ii}\}_{i=1}^{n}\subset\mathbb{R}. Then, for any ii, we have

𝔼exp(λaiiZ1iZ2i)1λaii𝔼Z1iZ2i+16λ2aii2K12K22.\mathbb{E}\exp\left(\lambda a_{ii}Z_{1i}Z_{2i}\right)-1\leq\lambda a_{ii}\mathbb{E}Z_{1i}Z_{2i}+16\lambda^{2}a_{ii}^{2}K_{1}^{2}K_{2}^{2}.
Proof.

First, we define the sub-exponential (or ψ1\psi_{1}-) norm of a random variable by

Xψ1=supp1(𝔼|X|p)1/pp.||X||_{\psi_{1}}=\sup_{p\geq 1}\dfrac{(\mathbb{E}|X|^{p})^{1/p}}{p}.

Since the product of sub-Gaussian random variables is sub-exponential, we define Yi=Z1iZ2i/Z1iZ2iψ1Y_{i}=Z_{1i}Z_{2i}/||Z_{1i}Z_{2i}||_{\psi_{1}}. Setting ti=λaiiZ1iZ2iψ1t_{i}=\lambda a_{ii}||Z_{1i}Z_{2i}||_{\psi_{1}}, we get

𝔼exp(tiYi)=1+ti𝔼Yi+s2tks𝔼Yiss!1+ti𝔼Yi+s2|tk|s𝔼|Yi|ss!1+ti𝔼Yi+s22|tk|ssss!1+ti𝔼Yi+s2|tk|sesπ\begin{array}[]{l}\mathbb{E}\exp\left(t_{i}Y_{i}\right)\\ =1+t_{i}\mathbb{E}Y_{i}+\sum_{s\geq 2}\dfrac{t_{k}^{s}\mathbb{E}Y_{i}^{s}}{s!}\\ \leq 1+t_{i}\mathbb{E}Y_{i}+\sum_{s\geq 2}\dfrac{|t_{k}|^{s}\mathbb{E}|Y_{i}|^{s}}{s!}\\ \leq 1+t_{i}\mathbb{E}Y_{i}+\sum_{s\geq 2}\dfrac{2|t_{k}|^{s}s^{s}}{s!}\\ \leq 1+t_{i}\mathbb{E}Y_{i}+\sum_{s\geq 2}\dfrac{|t_{k}|^{s}e^{s}}{\sqrt{\pi}}\end{array}

In the last two inequalities, we use the following observations. First, note that for the subexponential variable satisfying Yiψ1=1||Y_{i}||_{\psi_{1}}=1, it holds (see Prop 2.7.1, Vershynin, (2018))

𝔼|Yi|s2ss,s1.\mathbb{E}|Y_{i}|^{s}\leq 2s^{s},\quad s\geq 1.

Second, we use Stirling’s formula for s2s\geq 2 that

1s!es2ssπ.\dfrac{1}{s!}\leq\dfrac{e^{s}}{2s^{s}\sqrt{\pi}}.

If |λ|<1/(4eK1K2maxi|aii|)|\lambda|<1/(4eK_{1}K_{2}\max_{i}|a_{ii}|), then e|ti|1/2e|t_{i}|\leq 1/2 and thus we get

s3es|ti|s(e|ti|)3s0(1/2)s(e|ti|)2.\sum_{s\geq 3}e^{s}|t_{i}|^{s}\leq(e|t_{i}|)^{3}\sum_{s\geq 0}(1/2)^{s}\leq(e|t_{i}|)^{2}.

Using the above, we have

𝔼exp(tiYi)1+ti𝔼Yi+2e2|ti|2/π=1+λti𝔼Z1iZ2i+2e2λ2aii2Z1iZ2iψ12/π1+λti𝔼Z1iZ2i+16λ2aii2K12K22.\begin{array}[]{rcl}\mathbb{E}\exp\left(t_{i}Y_{i}\right)&\leq&1+t_{i}\mathbb{E}Y_{i}+2e^{2}|t_{i}|^{2}/\sqrt{\pi}\\ &=&1+\lambda t_{i}\mathbb{E}Z_{1i}Z_{2i}+2e^{2}\lambda^{2}a_{ii}^{2}||Z_{1i}Z_{2i}||_{\psi_{1}}^{2}/\sqrt{\pi}\\ &\leq&1+\lambda t_{i}\mathbb{E}Z_{1i}Z_{2i}+16\lambda^{2}a_{ii}^{2}K_{1}^{2}K_{2}^{2}.\end{array}

Then, for x>0x>0 and 0<λ<1/(4eK1K2maxi|aii|)0<\lambda<1/(4eK_{1}K_{2}\max_{i}|a_{ii}|), we have

P(S0>x)=P(exp(λS0)>exp(λx))𝔼exp(λS0)exp(λx)exp{λx+16λ2K12K22i=1nπ12,iaii2}.\begin{array}[]{l}{\rm P}(S_{0}>x)\\ ={\rm P}\left(\exp(\lambda S_{0})>\exp(\lambda x)\right)\\ \leq\dfrac{\mathbb{E}\exp(\lambda S_{0})}{\exp(\lambda x)}\\ \leq\exp\left\{-\lambda x+16\lambda^{2}K_{1}^{2}K_{2}^{2}\sum\limits_{i=1}^{n}\pi_{12,i}a_{ii}^{2}\right\}.\end{array}

For the optimal choice of λ\lambda, that is,

λ=min(x32eK12K22iπ12,iaii2,14eK1K2maxi|aii|)\lambda=\min\left(\dfrac{x}{32eK_{1}^{2}K_{2}^{2}\sum_{i}\pi_{12,i}a_{ii}^{2}},\dfrac{1}{4eK_{1}K_{2}\max_{i}|a_{ii}|}\right)

we can obtain the concentration bound

P(S0>x)exp{min(x232K12K22iπ12,iaii2,x8eK1K2maxi|aii|)}.{\rm P}(S_{0}>x)\leq\exp\left\{-\min\left(\dfrac{x^{2}}{32K_{1}^{2}K_{2}^{2}\sum_{i}\pi_{12,i}a_{ii}^{2}},\dfrac{x}{8eK_{1}K_{2}\max_{i}|a_{ii}|}\right)\right\}.

Considering Adiag-A_{\rm diag} instead of AdiagA_{\rm diag} in the theorem, we have the same probability bound for P(S0<x){\rm P}(S_{0}<-x), x>0x>0, which concludes the proof.

A.2 Proof of Lemma 2 (off-diagonal part)

Define SoffS_{\rm off} by

Soff=1ijnaijγ1iγ2jZ1iZ2j,S_{\rm off}=\sum_{1\leq i\neq j\leq n}a_{ij}\gamma_{1i}\gamma_{2j}Z_{1i}Z_{2j},

whose expectation is zero due to independence across distinct ii and jj. Then, we claim the lemma below:

Lemma 4.

Assume Z1iψ2=Z2iψ2=1||Z_{1i}||_{\psi_{2}}=||Z_{2i}||_{\psi_{2}}=1 and let {aij}1ijn\{a_{ij}\}_{1\leq i\neq j\leq n}\subset\mathbb{R} be given. For |λ|<1/(2C4A2)|\lambda|<1/(2\sqrt{C_{4}}||A||_{2}) for some numeric constant C4>0C_{4}>0, we have

𝔼exp(λSoff)exp(1.44C4λ2ijaijπ1iπ2j).\mathbb{E}\exp\left(\lambda S_{\rm off}\right)\leq\exp\left(1.44C_{4}\lambda^{2}\sum_{i\neq j}a_{ij}\pi_{1i}\pi_{2j}\right).

The proof is pending until Appendix A.3. Without loss of generality, we can assume Z1iψ2=Z2iψ2=1||Z_{1i}||_{\psi_{2}}=||Z_{2i}||_{\psi_{2}}=1; otherwise set Z1iZ1i/Z1iψ2,Z2iZ2i/Z2iψ2Z_{1i}\leftarrow Z_{1i}/||Z_{1i}||_{\psi_{2}},Z_{2i}\leftarrow Z_{2i}/||Z_{2i}||_{\psi_{2}}. Using Lemma 4, we get for x>0x>0 and 0<λ<1/(2C4A2)0<\lambda<1/(2\sqrt{C_{4}}||A||_{2})

P(S0>x)=P(exp(λS0)>exp(λx))𝔼exp(λS0)exp(λx)exp{λx+1.44C4λ2ijaijπ1iπ2j}.\begin{array}[]{l}{\rm P}(S_{0}>x)\\ ={\rm P}\left(\exp(\lambda S_{0})>\exp(\lambda x)\right)\\ \leq\dfrac{\mathbb{E}\exp(\lambda S_{0})}{\exp(\lambda x)}\\ \leq\exp\left\{-\lambda x+1.44C_{4}\lambda^{2}\sum_{i\neq j}a_{ij}\pi_{1i}\pi_{2j}\right\}.\end{array}

For the optimal choice of λ\lambda, that is,

λ=min(x2.88C4ijaijπ1iπ2j,12C4A2)\lambda=\min\left(\dfrac{x}{2.88C_{4}\sum_{i\neq j}a_{ij}\pi_{1i}\pi_{2j}},\dfrac{1}{2\sqrt{C_{4}}||A||_{2}}\right)

we can obtain the concentration bound

P(S0>x)exp{cmin(x2ijaijπ1iπ2j,xA2)}.{\rm P}(S_{0}>x)\leq\exp\left\{-c\min\left(\dfrac{x^{2}}{\sum_{i\neq j}a_{ij}\pi_{1i}\pi_{2j}},\dfrac{x}{||A||_{2}}\right)\right\}.

Considering Aoff-A_{\rm off} instead of AoffA_{\rm off} in the theorem, we have the same probability bound for P(S0<x){\rm P}(S_{0}<-x), x>0x>0, which concludes the proof.

A.3 Proof of Lemma 4

This proof follows the logic of the proof of (9) in Zhou, (2019). We first decouple the off-diagonal sum SoffS_{\rm off}.

Step 1. Decoupling

We introduce Bernoulli variables η=(η1,,ηn)\eta=(\eta_{1},\ldots,\eta_{n})^{\top} with 𝔼ηk=1/2\mathbb{E}\eta_{k}=1/2 for any kk. They are independent with each other and also independent of Z1,Z2Z_{1},Z_{2} and γ1,γ2\gamma_{1},\gamma_{2}. Given η\eta, we define Z1ηZ_{1}^{\eta} by a subvector of Z1Z_{1} at which ηi=1\eta_{i}=1 and Z21ηZ_{2}^{1-\eta} by a subvector of Z2Z_{2} at which ηj=0\eta_{j}=0, respectively. Let 𝔼Q\mathbb{E}_{Q} be the expectation with respect to a random variable QQ where QQ can be any of Z1,Z2,γ1,γ2,ηZ_{1},Z_{2},\gamma_{1},\gamma_{2},\eta in our context, or X,γX,\gamma to denote Z1,Z2Z_{1},Z_{2} and γ1,γ2\gamma_{1},\gamma_{2} together. Define a random variable

Sη=1i,jnaijηi(1ηj)γ1iγ2jZ1iZ2j.S_{\eta}=\sum_{1\leq i,j\leq n}a_{ij}\eta_{i}(1-\eta_{j})\gamma_{1i}\gamma_{2j}Z_{1i}Z_{2j}.

Using Soff=4𝔼γSηS_{\rm off}=4\mathbb{E}_{\gamma}S_{\eta} and Jensen’s inequality with xeaxx\mapsto e^{ax} for any aa\in\mathbb{R}, we get

𝔼exp(λSoff)=𝔼γ𝔼Xexp(𝔼η4λSη)𝔼γ𝔼X𝔼ηexp(4λSη).\mathbb{E}\exp(\lambda S_{\rm off})=\mathbb{E}_{\gamma}\mathbb{E}_{X}\exp(\mathbb{E}_{\eta}4\lambda S_{\eta})\leq\mathbb{E}_{\gamma}\mathbb{E}_{X}\mathbb{E}_{\eta}\exp(4\lambda S_{\eta}).

We condition all the variables except Z21η=(Z2j,ηj=0)Z_{2}^{1-\eta}=(Z_{2j},\eta_{j}=0) on exp(4λSη)\exp(4\lambda S_{\eta}) and consider its moment generating function denoted by f(γ1,γ2,η,Z1η)f(\gamma_{1},\gamma_{2},\eta,Z_{1}^{\eta})

f(γ1,γ2,η,Z1η)=𝔼(exp(4λSη)|γ1,γ2,η,Z1η).f(\gamma_{1},\gamma_{2},\eta,Z_{1}^{\eta})=\mathbb{E}\left(\exp(4\lambda S_{\eta})|\gamma_{1},\gamma_{2},\eta,Z_{1}^{\eta}\right).

Note that SηS_{\eta} can be seen a linear combination of sub-Gaussian variables Z2jZ_{2j}, for jj such that ηj=0\eta_{j}=0, that is,

Sη=j:ηj=0Z2j(i:ηi=1aijγ1iγ2jZ1i),S_{\eta}=\sum_{j:\eta_{j}=0}Z_{2j}\left(\sum_{i:\eta_{i}=1}a_{ij}\gamma_{1i}\gamma_{2j}Z_{1i}\right),

conditional on all other variables. Thus, the conditional distribution of SηS_{\eta} is sub-Gaussian satisfying

Sηψ2C0ση,γ,||S_{\eta}||_{\psi_{2}}\leq C_{0}\sigma_{\eta,\gamma},

where ση,γ2=j:ηj=0γ2j(i:ηi=1aijγ1iZ1i)2\sigma_{\eta,\gamma}^{2}=\sum\limits_{j:\eta_{j}=0}\gamma_{2j}\left(\sum\limits_{i:\eta_{i}=1}a_{ij}\gamma_{1i}Z_{1i}\right)^{2}. Therefore, we have that for some C>0C^{\prime}>0

f(γ1,γ2,η,Z1η)=𝔼(exp(4λSη)|γ1,γ2,η,Z1η)exp(Cλ2ση,γ2).f(\gamma_{1},\gamma_{2},\eta,Z_{1}^{\eta})=\mathbb{E}\left(\exp(4\lambda S_{\eta})|\gamma_{1},\gamma_{2},\eta,Z_{1}^{\eta}\right)\leq\exp(C^{\prime}\lambda^{2}\sigma_{\eta,\gamma}^{2}). (6)

Taking the expectations on both sides, we get

𝔼γ𝔼Z1ηf(γ1,γ2,η,Z1η)𝔼γ𝔼Z1ηexp(Cλ2ση,γ2)=:f~η.\mathbb{E}_{\gamma}\mathbb{E}_{Z_{1}^{\eta}}f(\gamma_{1},\gamma_{2},\eta,Z_{1}^{\eta})\leq\mathbb{E}_{\gamma}\mathbb{E}_{Z_{1}^{\eta}}\exp(C^{\prime}\lambda^{2}\sigma_{\eta,\gamma}^{2})=:\tilde{f}_{\eta}.

Step 2. Reduction to normal random variables

Assume that η\eta, γ1,γ2\gamma_{1},\gamma_{2}, and Z1ηZ_{1}^{\eta} are fixed. Let g=(g1,,gn)g=(g_{1},\ldots,g_{n})^{\top} be given where gig_{i} is i.i.d. from N(0,1)N(0,1). Replacing Z2jZ_{2j} by gjg_{j} in SηS_{\eta}, we define a random variable ZZ by

Z=j:ηj=0gj(i:ηi=1aijγ1iγ2jZ1i).Z=\sum_{j:\eta_{j}=0}g_{j}\left(\sum_{i:\eta_{i}=1}a_{ij}\gamma_{1i}\gamma_{2j}Z_{1i}\right).

Due to the property of Gaussian variables, the conditional distribution of ZZ follows N(0,ση,γ2)N(0,\sigma_{\eta,\gamma}^{2}). Hence, its conditional moment generating function is given by

𝔼gexp(tZ)=𝔼[exp(tZ)|η,γ1,γ2,Z1η]=exp(t2ση,γ2/2).\mathbb{E}_{g}\exp(tZ)=\mathbb{E}\left[\exp(tZ)|\eta,\gamma_{1},\gamma_{2},Z_{1}^{\eta}\right]=\exp(t^{2}\sigma_{\eta,\gamma}^{2}/2). (7)

Now, consider ZZ as a linear combination of {Z1i:ηi=1}\{Z_{1i}:\eta_{i}=1\} by rewriting it by

Z=i:ηi=1Z1i(j:ηj=0gjaijγ1iγ2j),Z=\sum_{i:\eta_{i}=1}Z_{1i}\left(\sum_{j:\eta_{j}=0}g_{j}a_{ij}\gamma_{1i}\gamma_{2j}\right),

where Z1iZ_{1i}’s are regarded random variables and others fixed. Then, we can get for some C3>0C_{3}>0

𝔼Z1ηexp(2CλZ)=𝔼[exp(2CλZ)|g,η,γ1,γ2]exp(C3λ2i:ηi=1γ1i(j:ηj=0gjaijγ2j)2).\mathbb{E}_{Z_{1}^{\eta}}\exp(\sqrt{2C^{\prime}}\lambda Z)=\mathbb{E}\left[\exp(\sqrt{2C^{\prime}}\lambda Z)|g,\eta,\gamma_{1},\gamma_{2}\right]\leq\exp\left(C_{3}\lambda^{2}\sum_{i:\eta_{i}=1}\gamma_{1i}\left(\sum_{j:\eta_{j}=0}g_{j}a_{ij}\gamma_{2j}\right)^{2}\right). (8)

We now get an upper bound of f~η\tilde{f}_{\eta} from step 1 by using (7) and (8). First, choosing t=2Cλt=\sqrt{2C^{\prime}}\lambda at (7) gives the same term in (6)

f~η=𝔼γ𝔼Z1η𝔼gexp(2CλZ)=𝔼γ𝔼g𝔼Z1ηexp(2CλZ).\tilde{f}_{\eta}=\mathbb{E}_{\gamma}\mathbb{E}_{Z_{1}^{\eta}}\mathbb{E}_{g}\exp(\sqrt{2C^{\prime}}\lambda Z)=\mathbb{E}_{\gamma}\mathbb{E}_{g}\mathbb{E}_{Z_{1}^{\eta}}\exp(\sqrt{2C^{\prime}}\lambda Z).

Then, we apply (8) to the right-hand side of the above:

f~η𝔼γ𝔼gexp(C3λ2i:ηi=1γ1i(j:ηj=0gjaijγ2j)2)=𝔼[exp(C3λ2(gBη,γg))|η],\tilde{f}_{\eta}\leq\mathbb{E}_{\gamma}\mathbb{E}_{g}\exp\left(C_{3}\lambda^{2}\sum_{i:\eta_{i}=1}\gamma_{1i}\left(\sum_{j:\eta_{j}=0}g_{j}a_{ij}\gamma_{2j}\right)^{2}\right)=\mathbb{E}\left[\exp\left(C_{3}\lambda^{2}(g^{\top}B_{\eta,\gamma}g)\right)|\eta\right], (9)

where Bη,γB_{\eta,\gamma} has its (k,)(k,\ell)-th element i:ηi=1γ1iaikaiγ2kγ2\sum\limits_{i:\eta_{i}=1}\gamma_{1i}a_{ik}a_{i\ell}\gamma_{2k}\gamma_{2\ell} if k,{j:ηj=0}k,\ell\in\{j:\eta_{j}=0\}; 0, otherwise. Note that Bη,γB_{\eta,\gamma} is symmetric.

Step 3. Integrating out the normal random variables

For fixed γ1,γ2,η\gamma_{1},\gamma_{2},\eta, the quadratic form gBη,γgg^{\top}B_{\eta,\gamma}g is identical in distribution with j=1nsj2gj2\sum_{j=1}^{n}s_{j}^{2}g_{j}^{2} due to the rotational invariance of gg where sj2s_{j}^{2} is the eigenvalue of Bη,γB_{\eta,\gamma}. Note that the eigenvalues satisfy, for any realization of γ1,γ2,η\gamma_{1},\gamma_{2},\eta,

max1jnsj2=Bη,γ2A221jnsj2=tr(Bη,γ)=i:ηi=1γ1ij:ηj=0aij2γ2j.\begin{array}[]{rcl}\max\limits_{1\leq j\leq n}s_{j}^{2}&=&||B_{\eta,\gamma}||_{2}\leq||A||_{2}^{2}\\ \sum\limits_{1\leq j\leq n}s_{j}^{2}&=&{\rm{tr}}(B_{\eta,\gamma})=\sum\limits_{i:\eta_{i}=1}\gamma_{1i}\sum\limits_{j:\eta_{j}=0}a_{ij}^{2}\gamma_{2j}.\end{array} (10)

Now, using gj2χ12g_{j}^{2}\sim\chi^{2}_{1} and 𝔼exp(tgj2)(12t)1/2exp(2t)\mathbb{E}\exp(tg_{j}^{2})\leq(1-2t)^{-1/2}\leq\exp(2t) for t<1/4t<1/4, we get from (9) and (10)

f~η𝔼[exp(C3λ2(gBη,γg))|η]=𝔼γ𝔼[exp(C3λ2(gBη,γg))|η,γ1,γ2]=𝔼γ𝔼[exp(j=1nC3λ2sj2gj2)|η,γ1,γ2]𝔼[j=1nexp(2C3λ2sj2)|η]=𝔼[exp(2C3λ2i:ηi=1γ1ij:ηj=0aij2γ2j)|η]\begin{array}[]{l}\tilde{f}_{\eta}\leq\mathbb{E}\left[\exp\left(C_{3}\lambda^{2}(g^{\top}B_{\eta,\gamma}g)\right)|\eta\right]\\[5.0pt] =\mathbb{E}_{\gamma}\mathbb{E}\left[\exp\left(C_{3}\lambda^{2}(g^{\top}B_{\eta,\gamma}g)\right)|\eta,\gamma_{1},\gamma_{2}\right]\\[5.0pt] =\mathbb{E}_{\gamma}\mathbb{E}\left[\exp\left(\sum\limits_{j=1}^{n}C_{3}\lambda^{2}s_{j}^{2}g_{j}^{2}\right)|\eta,\gamma_{1},\gamma_{2}\right]\\[5.0pt] \leq\mathbb{E}\left[\prod\limits_{j=1}^{n}\exp(2C_{3}\lambda^{2}s_{j}^{2})|\eta\right]\\[5.0pt] =\mathbb{E}\left[\exp\left(2C_{3}\lambda^{2}\sum\limits_{i:\eta_{i}=1}\gamma_{1i}\sum\limits_{j:\eta_{j}=0}a_{ij}^{2}\gamma_{2j}\right)|\eta\right]\end{array}

for λ2<1/(4C3A22)(<1/(4C3maxjsj2))\lambda^{2}<1/(4C_{3}||A||_{2}^{2})(<1/(4C_{3}\max_{j}s_{j}^{2})). It is worth noting that the Bernoulli variables γ1,γ2\gamma_{1},\gamma_{2} are decoupled by η\eta.

Step 4. Integrating out the Bernoulli random variables

We now integrate out γ1,γ2\gamma_{1},\gamma_{2} from f~η\tilde{f}_{\eta} by using the following lemma. For 0<λ21/(8C3A22)0<\lambda^{2}\leq 1/(8C_{3}||A||_{2}^{2}), we have

f~η𝔼[exp(2C3λ2i:ηi=1γ1ij:ηj=0aij2γ2j)|η]exp(2.88C3λ2ijaij2π1iπ2i)\tilde{f}_{\eta}\leq\mathbb{E}\left[\exp\left(2C_{3}\lambda^{2}\sum\limits_{i:\eta_{i}=1}\gamma_{1i}\sum\limits_{j:\eta_{j}=0}a_{ij}^{2}\gamma_{2j}\right)|\eta\right]\leq\exp\left(2.88C_{3}\lambda^{2}\sum\limits_{i\neq j}a_{ij}^{2}\pi_{1i}\pi_{2i}\right)
Lemma 5.

For 0<τ1/(4A22)0<\tau\leq 1/(4||A||_{2}^{2}), the conditional expectation satisfies

𝔼[exp(τi:ηi=1γ1ij:ηj=0aij2γ2j)|η]exp(1.44τijaij2π1iπ2i).\mathbb{E}\left[\exp\left(\tau\sum\limits_{i:\eta_{i}=1}\gamma_{1i}\sum\limits_{j:\eta_{j}=0}a_{ij}^{2}\gamma_{2j}\right)|\eta\right]\leq\exp\left(1.44\tau\sum\limits_{i\neq j}a_{ij}^{2}\pi_{1i}\pi_{2i}\right).
Proof.

Due to the independence of γ1i,i=1,2,,n\gamma_{1i},i=1,2,\ldots,n, we get

𝔼[exp(τi:ηi=1γ1ij:ηj=0aij2γ2j)|γ21η,η]=i:ηi=1𝔼[exp(τγ1ij:ηj=0aij2γ2j)|γ21η,η]=i:ηi=1[(1π1i)+π1iexp(τj:ηj=0aij2γ2j)].\begin{array}[]{l}\mathbb{E}\left[\exp\left(\tau\sum\limits_{i:\eta_{i}=1}\gamma_{1i}\sum\limits_{j:\eta_{j}=0}a_{ij}^{2}\gamma_{2j}\right)|\gamma_{2}^{1-\eta},\eta\right]\\ =\prod\limits_{i:\eta_{i}=1}\mathbb{E}\left[\exp\left(\tau\gamma_{1i}\sum\limits_{j:\eta_{j}=0}a_{ij}^{2}\gamma_{2j}\right)|\gamma_{2}^{1-\eta},\eta\right]\\ =\prod\limits_{i:\eta_{i}=1}\left[(1-\pi_{1i})+\pi_{1i}\exp\left(\tau\sum\limits_{j:\eta_{j}=0}a_{ij}^{2}\gamma_{2j}\right)\right].\end{array}

Then, we use the local approximation of the exponential function:

ex11.2x,0x0.35.e^{x}-1\leq 1.2x,\quad 0\leq x\leq 0.35.

To use it, τ\tau should satisfies, for given η\eta and γ2\gamma_{2},

τj:ηj=0aij2γ2j<0.35,\tau\sum\limits_{j:\eta_{j}=0}a_{ij}^{2}\gamma_{2j}<0.35, (11)

which leads to

(1π1i)+π1iexp(τj:ηj=0aij2γ2j)1+1.2π1iτj:ηj=0aij2γ2jexp(1.2π1iτj:ηj=0aij2γ2j),(1-\pi_{1i})+\pi_{1i}\exp\left(\tau\sum\limits_{j:\eta_{j}=0}a_{ij}^{2}\gamma_{2j}\right)\leq 1+1.2\pi_{1i}\tau\sum\limits_{j:\eta_{j}=0}a_{ij}^{2}\gamma_{2j}\leq\exp\left(1.2\pi_{1i}\tau\sum\limits_{j:\eta_{j}=0}a_{ij}^{2}\gamma_{2j}\right),

where we use 1+xex1+x\leq e^{x} in the last inequality. A sufficient condition for τ\tau for any realization of η\eta and γ2\gamma_{2} in (11) is τ1/(4A22)\tau\leq 1/(4||A||_{2}^{2}) since

τj:ηj=0aij2γ2jτmaxijaij2τA220.25.\tau\sum\limits_{j:\eta_{j}=0}a_{ij}^{2}\gamma_{2j}\leq\tau\max_{i}\sum_{j}a_{ij}^{2}\leq\tau||A||_{2}^{2}\leq 0.25.

Hence, for τ1/(4A22)\tau\leq 1/(4||A||_{2}^{2}),

𝔼[exp(τi:ηi=1γ1ij:ηj=0aij2γ2j)|η]=𝔼γ21η𝔼[exp(τi:ηi=1γ1ij:ηj=0aij2γ2j)|γ21η,η]𝔼[exp(i:ηi=11.2π1iτj:ηj=0aij2γ2j)|η]=j:ηj=0𝔼[exp(1.2τγ2ji:ηi=1π1iaij2)|η]j:ηj=0exp(1.22τπ2ji:ηi=1π1iaij2).\begin{array}[]{l}\mathbb{E}\left[\exp\left(\tau\sum\limits_{i:\eta_{i}=1}\gamma_{1i}\sum\limits_{j:\eta_{j}=0}a_{ij}^{2}\gamma_{2j}\right)|\eta\right]\\ =\mathbb{E}_{\gamma_{2}^{1-\eta}}\mathbb{E}\left[\exp\left(\tau\sum\limits_{i:\eta_{i}=1}\gamma_{1i}\sum\limits_{j:\eta_{j}=0}a_{ij}^{2}\gamma_{2j}\right)|\gamma_{2}^{1-\eta},\eta\right]\\ \leq\mathbb{E}\left[\exp\left(\sum\limits_{i:\eta_{i}=1}1.2\pi_{1i}\tau\sum\limits_{j:\eta_{j}=0}a_{ij}^{2}\gamma_{2j}\right)|\eta\right]\\ =\prod\limits_{j:\eta_{j}=0}\mathbb{E}\left[\exp\left(1.2\tau\gamma_{2j}\sum\limits_{i:\eta_{i}=1}\pi_{1i}a_{ij}^{2}\right)|\eta\right]\\ \leq\prod\limits_{j:\eta_{j}=0}\exp\left(1.2^{2}\tau\pi_{2j}\sum\limits_{i:\eta_{i}=1}\pi_{1i}a_{ij}^{2}\right).\end{array}

We apply the similar argument used for γ2j,j{j:ηj=0}\gamma_{2j},j\in\{j:\eta_{j}=0\} to γ1i,i{i:ηj=1}\gamma_{1i},i\in\{i:\eta_{j}=1\} in the last inequality. Note that τ1/(4A22)\tau\leq 1/(4||A||_{2}^{2}) is also sufficient since

1.2τi:ηj=1aij2π1i1.2τmaxjiaij21.2τA220.3.1.2\tau\sum\limits_{i:\eta_{j}=1}a_{ij}^{2}\pi_{1i}\leq 1.2\tau\max_{j}\sum_{i}a_{ij}^{2}\leq 1.2\tau||A||_{2}^{2}\leq 0.3.

Finally, we observe that

exp(1.22τj:ηj=0i:ηi=1aij2π1iπ2j)exp(1.22τijaij2π1iπ2j),\exp\left(1.2^{2}\tau\sum\limits_{j:\eta_{j}=0}\sum\limits_{i:\eta_{i}=1}a_{ij}^{2}\pi_{1i}\pi_{2j}\right)\leq\exp\left(1.2^{2}\tau\sum\limits_{i\neq j}a_{ij}^{2}\pi_{1i}\pi_{2j}\right),

which concludes the proof. ∎

Step 5. Combining things together

Assume |λ|<1/(8C3A2)|\lambda|<1/(\sqrt{8C_{3}}||A||_{2}). Combining all the steps above, we get

𝔼exp(λSoff)𝔼η𝔼γ𝔼Xexp(4λSη)𝔼η𝔼γ𝔼Z1η𝔼(exp(4λSη)|γ1,γ2,η,Z1η)𝔼ηf~ηexp(2.88C3λ2ijaij2π1iπ2i),\begin{array}[]{l}\mathbb{E}\exp(\lambda S_{\rm off})\\ \leq\mathbb{E}_{\eta}\mathbb{E}_{\gamma}\mathbb{E}_{X}\exp(4\lambda S_{\eta})\\ \leq\mathbb{E}_{\eta}\mathbb{E}_{\gamma}\mathbb{E}_{Z_{1}^{\eta}}\mathbb{E}\left(\exp(4\lambda S_{\eta})|\gamma_{1},\gamma_{2},\eta,Z_{1}^{\eta}\right)\\ \leq\mathbb{E}_{\eta}\tilde{f}_{\eta}\\ \leq\exp\left(2.88C_{3}\lambda^{2}\sum\limits_{i\neq j}a_{ij}^{2}\pi_{1i}\pi_{2i}\right),\end{array}

which proves Lemma 4.

Appendix B Proof of Theorem 2

Let us define the centered sub-Gaussian variable Zij=Z~ijμijZ_{ij}=\tilde{Z}_{ij}-\mu_{ij}. Then, it is easy to check that the bilinear form is decomposed into 8 terms.

S(𝒁~1𝜸1)𝑨(𝒁~2𝜸2)𝔼(𝒁~1𝜸1)𝑨(𝒁~2𝜸2)=i,jaijγ1iγ2jZ1iZ2j𝔼[i,jaijγ1iγ2jZ1iZ2j]+i,jaijμ1iγ1iγ2jZ2j+i,jaijμ2jγ1iγ2jZ1i+i,jaijμ1iμ2jγ1iγ2j𝔼[i,jaijμ1iμ2jγ1iγ2j]=i,jaijγ1iγ2jZ1iZ2j𝔼[i,jaijγ1iγ2jZ1iZ2j]+i,jaijμ1i(γ1iπ1i)γ2jZ2j+i,jaijμ1iπ1iγ2jZ2j+i,jaijμ2jγ1iZ1i(γ2jπ2j)+i,jaijμ2jπ2jγ1iZ1i+i,jaijμ1iμ2j(γ1iπ1i)(γ2jπ2j)+i,jaijμ1iμ2jπ2j(γ1iπ1i)+i,jaijμ1iπ1iμ2j(γ2jπ2j)S1,1+S2,1+S2,2+S3,1+S3,2+S4,1+S4,2+S4,3\begin{array}[]{rcl}S&\equiv&(\tilde{\boldsymbol{Z}}_{1}*\boldsymbol{\gamma}_{1})^{\top}\boldsymbol{A}(\tilde{\boldsymbol{Z}}_{2}*\boldsymbol{\gamma}_{2})-\mathbb{E}(\tilde{\boldsymbol{Z}}_{1}*\boldsymbol{\gamma}_{1})^{\top}\boldsymbol{A}(\tilde{\boldsymbol{Z}}_{2}*\boldsymbol{\gamma}_{2})\\ &=&\sum\limits_{i,j}a_{ij}\gamma_{1i}\gamma_{2j}Z_{1i}Z_{2j}-\mathbb{E}\left[\sum\limits_{i,j}a_{ij}\gamma_{1i}\gamma_{2j}Z_{1i}Z_{2j}\right]\\ &&+\sum\limits_{i,j}a_{ij}\mu_{1i}\gamma_{1i}\gamma_{2j}Z_{2j}\\ &&+\sum\limits_{i,j}a_{ij}\mu_{2j}\gamma_{1i}\gamma_{2j}Z_{1i}\\ &&+\sum\limits_{i,j}a_{ij}\mu_{1i}\mu_{2j}\gamma_{1i}\gamma_{2j}-\mathbb{E}\left[\sum\limits_{i,j}a_{ij}\mu_{1i}\mu_{2j}\gamma_{1i}\gamma_{2j}\right]\\ &=&\sum\limits_{i,j}a_{ij}\gamma_{1i}\gamma_{2j}Z_{1i}Z_{2j}-\mathbb{E}\left[\sum\limits_{i,j}a_{ij}\gamma_{1i}\gamma_{2j}Z_{1i}Z_{2j}\right]\\ &&+\sum\limits_{i,j}a_{ij}\mu_{1i}(\gamma_{1i}-\pi_{1i})\gamma_{2j}Z_{2j}+\sum\limits_{i,j}a_{ij}\mu_{1i}\pi_{1i}\gamma_{2j}Z_{2j}\\ &&+\sum\limits_{i,j}a_{ij}\mu_{2j}\gamma_{1i}Z_{1i}(\gamma_{2j}-\pi_{2j})+\sum\limits_{i,j}a_{ij}\mu_{2j}\pi_{2j}\gamma_{1i}Z_{1i}\\ &&+\sum\limits_{i,j}a_{ij}\mu_{1i}\mu_{2j}(\gamma_{1i}-\pi_{1i})(\gamma_{2j}-\pi_{2j})+\sum\limits_{i,j}a_{ij}\mu_{1i}\mu_{2j}\pi_{2j}(\gamma_{1i}-\pi_{1i})+\sum\limits_{i,j}a_{ij}\mu_{1i}\pi_{1i}\mu_{2j}(\gamma_{2j}-\pi_{2j})\\ &\equiv&S_{1,1}\\ &&+S_{2,1}+S_{2,2}\\ &&+S_{3,1}+S_{3,2}\\ &&+S_{4,1}+S_{4,2}+S_{4,3}\\ \end{array}

We will use Theorem 1 for the bilinear forms S1,1,S2,1,S3,1,S4,1S_{1,1},S_{2,1},S_{3,1},S_{4,1}, and Hoeffding’s inequality, stated below, for the linear combinations S2,2,S3,2,S4,2,S4,3S_{2,2},S_{3,2},S_{4,2},S_{4,3}. For S2,1=i,jaijμ1i(γ1iπ1i)γ2jZ2jS_{2,1}=\sum\limits_{i,j}a_{ij}\mu_{1i}(\gamma_{1i}-\pi_{1i})\gamma_{2j}Z_{2j}, we treat γ1iπ1i\gamma_{1i}-\pi_{1i} as the centered sub-Gaussian variables with ψ2\psi_{2}-norm at most 11 in which the nn sub-Gaussian are completely observed. Applying the union bound and combining the results of Si,jS_{i,j}, we can derive the bound for P[|S|>t]{\rm P}\Big{[}\big{|}S\big{|}>t\Big{]}.

Theorem 5 (Hoeffding’s inequality).

Let ZiZ_{i}, i=1,,ni=1,\ldots,n, be mean-zero independent sub-Gaussian variables satisfying max1inZiψ2K\max\limits_{1\leq i\leq n}||Z_{i}||_{\psi_{2}}\leq K, and γi\gamma_{i}, i=1,,ni=1,\ldots,n, be independent Bernoulli variables. Also, suppose ZiZ_{i} and γi\gamma_{i} are independent for all ii. Let 𝛂=(αi,1in)n\boldsymbol{\alpha}=(\alpha_{i},1\leq i\leq n)^{\top}\in\mathbb{R}^{n} be a given coefficient vector. Then, we have

𝔼exp(λi=1nαiγiZi)exp(Cλ2K2i=1nαi2),λ>0,\mathbb{E}\exp\left(\lambda\sum_{i=1}^{n}\alpha_{i}\gamma_{i}Z_{i}\right)\leq\exp\left(C\lambda^{2}K^{2}\sum_{i=1}^{n}\alpha_{i}^{2}\right),\quad\lambda>0,

for some numerical constant C>0C>0. Moreover, we have

P[|i=1nαiγiZi|>t]2exp{ct2K2𝜶22},t>0,{\rm P}\bigg{[}\big{|}\sum\limits_{i=1}^{n}\alpha_{i}\gamma_{i}Z_{i}\big{|}>t\bigg{]}\leq 2\exp\left\{-\dfrac{ct^{2}}{K^{2}||\boldsymbol{\alpha}||_{2}^{2}}\right\},\quad t>0,

for some numerical constant c>0c>0.