This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Covariate-Adjusted Inference for Differential Analysis of High-Dimensional Networks

Aaron Hudson and Ali Shojaie
Department of Biostatistics, University of Washington, Seattle, Washington
Abstract

Differences between biological networks corresponding to disease conditions can help delineate the underlying disease mechanisms. Existing methods for differential network analysis do not account for dependence of networks on covariates. As a result, these approaches may detect spurious differential connections induced by the effect of the covariates on both the disease condition and the network. To address this issue, we propose a general covariate-adjusted test for differential network analysis. Our method assesses differential network connectivity by testing the null hypothesis that the network is the same for individuals who have identical covariates and only differ in disease status. We show empirically in a simulation study that the covariate-adjusted test exhibits improved type-I error control compared with naïve hypothesis testing procedures that do not account for covariates. We additionally show that there are settings in which our proposed methodology provides improved power to detect differential connections. We illustrate our method by applying it to detect differences in breast cancer gene co-expression networks by subtype.

1 Introduction

Complex diseases are often associated with aberrations in biological networks, such as gene regulatory networks and brain functional or structural connectivity networks [1]. Performing differential network analysis, or identifying connections in biological networks that change with disease condition, can provide insights into the disease mechanisms and lead to the identification of network-based biomarkers [19, 10].

Probabilistic graphical models are commonly used to summarize the conditional independence structure of a set of nodes in a biological network. A common approach to differential network analysis is to first estimate the graph corresponding to each disease condition and then assess between-condition differences in the graph. For instance, when using Gaussian graphical models, one can learn the network by estimating the inverse covariance matrix using the graphical LASSO [9]; one can then identify changes in the inverse covariance matrix associated with disease condition [49, 38, 15]. Alternatively, the condition-specific networks can be estimated using neighborhood selection [27]; in this approach, partial correlations among nodes are estimated by fitting a series of linear regressions in which one node is treated as the outcome, and the remaining nodes are treated as regressors. Changes in the network can then be delineated from differences in the regression coefficients by disease condition [2, 39]. More generally, the condition-specific networks are often modeled using exponential family pairwise interaction models [23, 40, 44, 43].

The approach to differential network analysis described above may lead to the detection of between-group differences in biological networks that are not necessarily meaningful, in particular, when the condition-specific networks depend on covariates (e.g., age and sex). This is because between-group network differences can be induced by confounding variables, i.e., variables that are associated with both the within-group networks, and the disease condition. In such cases, the network differences by disease condition may only reflect the association between the confounding variable and the disease. It is therefore important to account for the relationship between covariates and biological networks when performing differential network analysis.

In this paper, we propose a two-sample test for differential network analysis that accounts for within-group dependence of the networks on covariates. More specifically, we propose to perform covariate-adjusted inference using a class of pairwise interaction models for the within-group networks. Our approach treats each condition-specific network as a function of the covariates. It then performs a hypothesis test for equivalence of these functions. To accommodate the high-dimensional setting, in which the number of nodes in the network is large relative to the number of samples collected, we propose to estimate the networks using a regularized estimator and to perform hypothesis testing using a bias-corrected version of the regularized estimate [11].

Our proposal is related to existing literature on modeling networks as functions of a small number of variables. For example, there are various proposals for estimating high-dimensional inverse covariance matrices, conditional upon continuous low-dimensional features [50, 36]. Also related are methods for regularized estimation of high-dimensional varying coefficient models, wherein the regression coefficients are functions of a small number of covariates [35]. Our method is similar but places a particular emphasis on hypothesis testing in order to assess the statistical significance of observed changes in the network. Our approach lays the foundation for a general class of graphical models and is the first, to the best of our knowledge, to perform covariate-adjusted hypothesis tests for differential network analysis.

The rest of the paper is organized as follows. In Section 2, we begin with a broad overview of our proposed framework for covariate-adjusted differential network analysis in pairwise interaction exponential family models and introduce some working examples. In the following sections, we specialize our framework by considering two different approaches for estimation and inference: In Section 3, we describe a method that uses neighborhood selection [27, 7, 40], and in Section 4, we discuss an alternative estimation approach that utilizes the score matching framework of Hyvärinen [17, 18]. We assess the performance of our proposed methodology on synthetic data in Section 5 and apply it to a breast cancer data set from The Cancer Genome Atlas (TCGA) [37] in Section 6. We conclude with a brief discussion in Section 7.

2 Overview of the Proposed Framework

2.1 Differential Network Analysis without Covariate Adjustment

To formalize our problem, we begin by introducing some notation. We compare networks between two groups, labeled by g{I,II}g\in\{\mathrm{I},\mathrm{II}\}. We obtain measurements of pp variables Xg=(X1g,,Xpg)X^{g}=\left(X^{g}_{1},\ldots,X_{p}^{g}\right)^{\top}, corresponding to nodes in a graphical model [26], on nIn^{\mathrm{I}} subjects in group I\mathrm{I} and nIIn^{\mathrm{II}} subjects in group II\mathrm{II}. We define 𝒳p\mathcal{X}\subseteq\mathds{R}^{p} as the sample space of XgX^{g}. Let Xi,jgX^{g}_{i,j} denote the data for node jj for subject ii in group gg, and let 𝐗jg=(X1,jg,,Xng,jg)\mathbf{X}_{j}^{g}=(X^{g}_{1,j},\ldots,X^{g}_{n^{g},j})^{\top} be an ngn^{g}-dimensional vector of measurements on node jj for group gg.

Our objective is to determine whether the association between variables XjX_{j} and XkX_{k}, conditional upon all other variables, differs by group. Our approach is to specify a model for XgX^{g} such that the conditional dependence between any two nodes XjgX^{g}_{j} and XkgX^{g}_{k} can be represented by a single scalar parameter βj,kg,\beta^{g,*}_{j,k}. If the association between nodes jj and kk is the same in both groups I\mathrm{I} and II\mathrm{II}, βj,kI,=βj,kII,\beta^{\mathrm{I},*}_{j,k}=\beta^{\mathrm{II},*}_{j,k}. Conversely, if βj,kI,βj,kII,\beta^{\mathrm{I},*}_{j,k}\neq\beta^{\mathrm{II},*}_{j,k}, we say nodes jj and kk are differentially connected. We assess for differential connectivity by performing a test of the null hypothesis

Hj,k0:βj,kI,=βj,kII,.\displaystyle H^{0}_{j,k}:\beta^{\mathrm{I},*}_{j,k}=\beta^{\mathrm{II},*}_{j,k}. (1)

We consider a general class of exponential family pairwise interaction models. For x=(x1,,xp)x=(x_{1},\ldots,x_{p})^{\top}, we assume the density function for XgX^{g} takes the form

fg,(x)=exp(j=1pμj(xj)+j=1pk=1jβj,kg,ψj,k(xj,xk)U(𝜷g,)),\displaystyle f^{g,*}(x)=\exp\left(\sum_{j=1}^{p}\mu_{j}(x_{j})+\sum_{j=1}^{p}\sum_{k=1}^{j}\beta^{g,*}_{j,k}\psi_{j,k}(x_{j},x_{k})-U\left(\boldsymbol{\beta}^{g,*}\right)\right), (2)

where ψj,k\psi_{j,k} and μj\mu_{j} are fixed and known functions, 𝜷g,\boldsymbol{\beta}^{g,*} is a p×pp\times p matrix with elements βj,kg,\beta^{g,*}_{j,k}, and U(𝜷g,)U(\boldsymbol{\beta}^{g,*}) is the log-partition function. The dependence between XjgX^{g}_{j} and XkgX^{g}_{k} is measured by βj,kg,\beta^{g,*}_{j,k}, and nodes jj and kk are conditionally independent in group gg if and only if βj,kg,=0\beta^{g,*}_{j,k}=0.

This class of exponential family distributions is rich and includes several models that have been studied previously in the graphical modeling literature. One such example is the Gaussian graphical model, perhaps the most widely-used graphical model for continuous data. For xpx\in\mathds{R}^{p} the density function for mean-centered Gaussian random vectors can be expressed as

fg,(x)exp(j=1pk=1jβj,kg,xjxk),\displaystyle f^{g,*}(x)\propto\exp\left(-\sum_{j=1}^{p}\sum_{k=1}^{j}\beta^{g,*}_{j,k}x_{j}x_{k}\right), (3)

and is thus a special case of (2) with ψj,k=xjxk\psi_{j,k}=-x_{j}x_{k} and μj=0\mu_{j}=0. The non-negative Gaussian density, which takes the form of (3) with the constraint that xx takes values in +p\mathds{R}^{p}_{+}, also belongs to the exponential family class. Another canonical example is the Ising model, commonly used for studying conditional dependencies among binary random variables. For x{0,1}px\in\{0,1\}^{p}, the density function for the Ising model can be expressed as

f(x)exp(j=1pk=1jβj,kxjxk).\displaystyle f(x)\propto\exp\left(\sum_{j=1}^{p}\sum_{k=1}^{j}\beta_{j,k}x_{j}x_{k}\right).

Additional examples include the Poisson model, the exponential graphical model, and conditionally-specified mixed graphical models [40, 7].

When asymptotically normal estimates of βj,kI,\beta^{\mathrm{I},*}_{j,k} and βj,kII,\beta^{\mathrm{II},*}_{j,k} are available, one can perform a calibrated test of Hj,k0H^{0}_{j,k} based on the difference between the estimates. In many cases, asymptotically normal estimates can be obtained using well-established methodology. For instance, when the log-partition function U(𝜷g,)U(\boldsymbol{\beta}^{g,*}) is available in closed form and is tractable, one can obtain estimates via (penalized) maximum likelihood. This is a standard approach in the Gaussian setting, in which case the log-partition function is easy to compute. However, this is not the case for other exponential family models. Likelihood-based estimation strategies are thus generally difficult to implement. In this paper, we consider two alternative strategies that have been proposed to overcome these computational challenges and are more broadly applicable.

The first approach we discuss is neighborhood selection [7, 27, 40]. Consider a sub-class of exponential family graphical models for which the conditional density function for any node XjgX^{g}_{j} given the remaining nodes belongs to a univariate exponential family model. Because the log-partition function in univariate exponential family models is available in closed form, it is computationally feasible to estimate each conditional density function. By estimating the conditional density functions, one can identify the neighbors of nodes jj, that is, the nodes upon which the conditional distribution depends. This approach was first proposed as an alternative to maximum likelihood estimation for estimating Gaussian graphical models [27]. To describe our approach, we focus on the Gaussian case, though this approach is more widely applicable and can be used for modeling dependencies among, e.g., Poisson, binomial, and exponential random variables as well [7, 40].

In Gaussian graphical models, the dependency of node jj on all other nodes can be determined based on the linear model

𝔼[Xjg|X1g,,Xpg]=βj,0g,+kjβj,kg,Xkg.\displaystyle\mathds{E}\left[X^{g}_{j}|X^{g}_{1},\ldots,X^{g}_{p}\right]=\beta^{g,*}_{j,0}+\sum_{k\neq j}\beta^{g,*}_{j,k}X^{g}_{k}. (4)

The regression coefficients βj,kg,\beta^{g,*}_{j,k} measure the strength of linear association between nodes jj and kk conditional upon all other nodes and are zero if and only if nodes jj and kk are conditionally independent; βj,0g,\beta^{g,*}_{j,0} is an intercept term and is zero if all nodes are mean-centered. (We acknowledge a slight abuse of notation here, as the regression coefficients in (4) are not equivalent to parameters in (2). However, either estimand fully characterizes conditional independence.) In the low-dimensional setting (i.e., pngp\ll n^{g}), statistically efficient and asymptotically normal estimates of the regression coefficients can be readily obtained via ordinary least squares. In high-dimensions (i.e., pngp\geq n^{g}), the ordinary least squares estimates are inconsistent, so to obtain consistent estimates we typically rely upon regularized estimators such as the LASSO and the elastic net [33, 51]. Regularized estimators are generally biased and have intractable sampling distributions, and as such, are unsuitable for performing formal statistical inference. However, several methods have recently emerged for obtaining asymptotically normal estimates by correcting the bias of regularized estimators [20, 12, 47].

The second computationally efficient approach we consider is to estimate the density function using the score matching framework of Hyvärinen [17, 18]. Hyvärinen derives a loss function for estimation of density functions for continuous random variables that is based on the gradient of the log-density with respect to the observations. As such, the score matching loss does not depend on the log-partition function in exponential family models. Moreover, when the joint distribution for XgX^{g} belongs to an exponential family model, the loss is quadratic in the unknown parameters, allowing for efficient computation. In low dimensions, the minimizer of the score matching loss is consistent and asymptotically normal. In high dimensions, one can obtain asymptotically normal estimates by minimizing a regularized version of the score matching loss to obtain an initial estimate [23, 44] and subsequently correcting for the bias induced by regularization [43].

2.2 Covariate-Adjusted Differential Network Analysis

We now consider the setting in which the within-group networks depend on covariates. We denote by WgW^{g} a qq-dimensional random vector of covariate measurements for group gg, and we define 𝒲\mathcal{W} as the sample space of WgW^{g}. Let Wi,rgW^{g}_{i,r} refer to the value of covariate rr for subject ii in group gg, and let Wig=(Wi,1g,,Wi,qg)W^{g}_{i}=(W^{g}_{i,1},\ldots,W^{g}_{i,q})^{\top} be a qq-dimensional vector containing all covariates for subject ii in group gg. We assume the number of covariates is small relative to the sample size (i.e., qngq\ll n^{g}).

To study the dependence of the within-group networks on the covariates, we specify a model for the nodes XgX^{g} given the covariates WgW^{g} that allows for the inter-node dependencies to vary as a function of WgW^{g}. The model defines a function ηj,kg:𝒲\eta^{g}_{j,k}:\mathcal{W}\to\mathds{R} that takes as input a vector of covariates and returns a measure of association between nodes jj and kk for a subject in group gg with identical covariates. One can interpret ηj,kg,\eta^{g,*}_{j,k} as a conditional version of βj,kg,\beta^{g,*}_{j,k}, given the covariates.

We assume that ηj,kg,\eta^{g,*}_{j,k} can be written as a low-dimensional linear basis expansion in WgW^{g} of dimension dd — that is,

ηj,kg,(Wg)=ϕ(Wg),αj,kg,,\displaystyle\eta^{g,*}_{j,k}\left(W^{g}\right)=\left\langle\phi\left(W^{g}\right),\alpha_{j,k}^{g,*}\right\rangle, (5)

where ϕ:qd\phi:\mathds{R}^{q}\to\mathds{R}^{d} is a map from a set of covariates to its expansion, αj,kg,\alpha_{j,k}^{g,*} is a dd-dimensional vector, and ,\langle\cdot,\cdot\rangle denotes the vector inner product. Let ϕc(w)\phi_{c}(w) refer to the cc-th element of ϕ(w)\phi(w). One can take the simple approach of specifying ϕ\phi as a linear basis, ϕ(w)=(1,w1,,wq)\phi(w)=\left(1,w_{1},\ldots,w_{q}\right) for wqw\in\mathds{R}^{q}, though more flexible choices such as polynomial or B-spline bases can also be considered. It may be preferable to specify ϕ\phi so that ηj,kg,\eta^{g,*}_{j,k} is an additive function of the covariates. This allows one to easily assess the effect of any specific covariate on the network by estimating the sub-vector of αj,kg,\alpha^{g,*}_{j,k} that is relevant to the covariate of interest.

When the association between nodes jj and kk does not depend on group membership, ηj,kI,(w)=ηj,kII,(w)\eta^{\mathrm{I,*}}_{j,k}(w)=\eta^{\mathrm{II,*}}_{j,k}(w) for all ww, and αj,kI,=αj,kII,\alpha^{\mathrm{I,*}}_{j,k}=\alpha^{\mathrm{II,*}}_{j,k}. In other words, if one subject from group I\mathrm{I} and another subject from group II\mathrm{II} have identically-valued covariates, the corresponding measure of association between nodes jj and kk is also the same. In the covariate-adjusted setting, we say that nodes jj and kk are differentially connected if there exists ww such that ηj,kI,(w)ηj,kII,(w)\eta^{\mathrm{I,*}}_{j,k}(w)\neq\eta^{\mathrm{II,*}}_{j,k}(w), or equivalently, if αj,kI,αj,kII,\alpha^{\mathrm{I,*}}_{j,k}\neq\alpha^{\mathrm{II,*}}_{j,k}. We can thus assess differential connectivity between nodes jj and kk by testing the null hypothesis

Gj,k0:αj,kI,=αj,kII,.\displaystyle G^{0}_{j,k}:\alpha^{\mathrm{I,*}}_{j,k}=\alpha^{\mathrm{II,*}}_{j,k}. (6)

Similar to the unadjusted setting, when asymptotically normal estimates of αj,kI,\alpha^{\mathrm{I},*}_{j,k} and αj,kII,\alpha^{\mathrm{II},*}_{j,k} are available, a calibrated test can be constructed based on the difference between the estimates.

We now specify a form for the conditional distribution of XgX^{g} given WgW^{g} as a generalization of the exponential family pairwise interaction model (2). We assume the conditional density for XgX^{g} given WgW^{g} can be expressed as

fg,(x|w)exp(j=1pμj(xj)+j=1pk=1jηj,kg,(w)ψj,k(xj,xk)+j=1pc=1dθj,cg,ζj,c(xj,ϕc(w))),\displaystyle f^{g,*}(x|w)\propto\exp\left(\sum_{j=1}^{p}\mu_{j}(x_{j})+\sum_{j=1}^{p}\sum_{k=1}^{j}\eta^{g,*}_{j,k}(w)\psi_{j,k}(x_{j},x_{k})+\sum_{j=1}^{p}\sum_{c=1}^{d}\theta_{j,c}^{g,*}\zeta_{j,c}\left(x_{j},\phi_{c}(w)\right)\right), (7)

where w=(w1,,wq)w=(w_{1},\ldots,w_{q})^{\top}, and the proportionality is up to a normalizing constant that does not depend on xx. Above, ζj,c\zeta_{j,c} is a fixed and known function, and the main effects of the covariates on XgX^{g} are represented by the scalar parameters θj,cg,\theta^{g,*}_{j,c}. The conditional dependence between nodes jj and kk, given all other nodes and given that Wg=wW^{g}=w is quantified by ηj,kg,(w)\eta^{g,*}_{j,k}(w), and ηj,kg,(w)=0\eta^{g,*}_{j,k}(w)=0 if and only if nodes jj and kk are conditionally independent at ww. One can thus view ηj,kg,\eta^{g,*}_{j,k} as a conditional version of βj,kg,\beta^{g,*}_{j,k} in (7).

Either of the estimation strategies introduced in Section 2.1 can be used to perform covariate-adjusted inference. When the conditional distribution of each node given the remaining nodes and the covariates belongs to a univariate exponential family model, the covariate-dependent network can be estimated using neighborhood selection because the node conditional distributions can be estimated efficiently with likelihood-based methods. Alternatively, we can estimate the conditional density function (7) using score matching.

As a working example, we again consider estimation of covariate-dependent Gaussian networks using neighborhood selection. Suppose the conditional distribution of XgX^{g} given WgW^{g} takes the form

fg,(x|w)exp(j=1pk=1jηj,kg,(w)xjxkj=1pc=1dθj,cg,xjϕc(w)).\displaystyle f^{g,*}(x|w)\propto\exp\left(-\sum_{j=1}^{p}\sum_{k=1}^{j}\eta^{g,*}_{j,k}(w)x_{j}x_{k}-\sum_{j=1}^{p}\sum_{c=1}^{d}\theta_{j,c}^{g,*}x_{j}\phi_{c}(w)\right). (8)

Then the dependencies of node jj on all other nodes can be determined based on the following varying coefficient model [14]:

𝔼[Xjg|X1g,,Xpg,Wg]=ηj,0g,(Wg)+kjηj,kg,(Wg)Xkg.\displaystyle\mathds{E}\left[X^{g}_{j}|X_{1}^{g},\ldots,X_{p}^{g},W^{g}\right]=\eta_{j,0}^{g,*}(W^{g})+\sum_{k\neq j}\eta^{g,*}_{j,k}\left(W^{g}\right)X_{k}^{g}. (9)

The varying coefficient model is a generalization of the linear model that treats the regression coefficients as functions of the covariates. In (9), ηj,kg,(w)\eta^{g,*}_{j,k}(w) returns a regression coefficient that quantifies the linear relationship between nodes jj and kk for subjects in group gg with covariates equal to ww. Then XjgX^{g}_{j} and XkgX^{g}_{k} are conditionally independent given all other nodes and given Wg=wW^{g}=w if and only if ηj,kg,(w)=0\eta^{g,*}_{j,k}(w)=0. The varying coefficients ηj,kg,\eta^{g,*}_{j,k} can thus be viewed as a conditional version of the regression coefficients in (4). (We have again abused the notation, as the varying coefficient functions in (9) are not equal to the parameters in (8), though both functions are zero for the same values of ww). The intercept term ηj,0g,\eta^{g,*}_{j,0} accounts for the main effect of WgW^{g} on XjgX^{g}_{j}. We can remove this main effect term by first centering the nodes XjgX^{g}_{j} about their conditional mean given WgW^{g} (which can be estimated by performing a linear regression of XjgX^{g}_{j} on ϕ(Wg)\phi(W^{g})).

In Sections 3 and 4, we discuss construction of asymptotically normal estimators of αj,kg,\alpha^{g,*}_{j,k} in the low- and high-dimensional settings using neighborhood selection and score matching. Before proceeding, we first examine the connection between the null hypotheses Hj,k0H^{0}_{j,k} and Gj,k0G^{0}_{j,k}.

2.3 The Relationship between Hypotheses Hj,k0H^{0}_{j,k} and Gj,k0G^{0}_{j,k}

Hypotheses Hj,k0H^{0}_{j,k} in (1) and Gj,k0G^{0}_{j,k} in (6) are related but not equivalent. It is possible that Hj,k0H^{0}_{j,k} holds while Gj,k0G^{0}_{j,k} fails and vice versa. We provide an example below. Suppose we are using neighborhood selection to perform differential network analysis in the Gaussian setting, so we are making a comparison of linear regression coefficients between the two groups. Suppose further that the within-group networks depend on single scalar covariate WgW^{g}, and the nodes are centered about their conditional mean given WgW^{g}. One can show that the regression coefficients βj,kg,\beta^{g,*}_{j,k} are equal to the average of their conditional versions ηj,kg,(Wg)\eta^{g,*}_{j,k}(W^{g}). That is, βj,kg,=𝔼[ηj,kg,(Wg)]\beta^{g,*}_{j,k}=\mathds{E}[\eta^{g,*}_{j,k}(W^{g})]. Now, suppose Gj,k0G^{0}_{j,k} holds. If WIW^{\mathrm{I}} and WIIW^{\mathrm{II}} do not share the same distribution (e.g., the covariate tends to take higher values in group I\mathrm{I} than in group II\mathrm{II}), the average conditional inter-node association may differ, and Hj,k0H^{0}_{j,k} may not hold. Although the conditional association between nodes, given the covariate, does not differ by group, the average conditional association does differ, as illustrated in Figure 1a. In such a scenario, the difference in the average conditional association is induced by the dependence of the covariate on group membership and the dependence of the inter-node association on the covariate. Thus, inequality of βj,kI,\beta^{\mathrm{I},*}_{j,k} and βj,kII,\beta^{\mathrm{II},*}_{j,k} does not necessarily capture a meaningful association between the network and group membership. Similarly when Hj,k0H^{0}_{j,k} holds, it is possible that ηj,kI,ηj,kII,\eta^{\mathrm{I},*}_{j,k}\neq\eta^{\mathrm{II},*}_{j,k}. For instance, suppose that the distribution of the covariate is the same in both groups, and 𝔼[ηg(Wg)]=0\mathds{E}[\eta^{g}(W^{g})]=0 in both groups. If the between-node association depends more strongly upon the covariates in one group than the other, Gj,k0G^{0}_{j,k} will be false. This example is depicted in Figure 1b. In this scenario, adjusting for covariates should provide improved power to detect differential connections. We note that for other distributions, it does not necessarily hold that βj,kg,=𝔼[ηj,kg,(Wg)]\beta^{g,*}_{j,k}=\mathds{E}[\eta^{g,*}_{j,k}(W^{g})], but regardless, there is generally no equivalence between hypotheses Hj,k0H^{0}_{j,k} and Gj,k0G^{0}_{j,k}.

Refer to caption
(a)
Refer to caption
(b)
Figure 1: Displayed are the association between nodes jj and kk, ηj,kg()\eta^{g}_{j,k}(\cdot), as a function of covariate WgW^{g} and the distribution of WgW^{g} in groups I\mathrm{I} and II\mathrm{II}. The average inter-node association is represented by the dashed colored lines. In (a), the average inter-node association depends on group membership, though the inter-node association given the covariate does not. In (b), the average inter-node association does not depend on group membership, though the conditional association between nodes given the covariate does depend on group membership.

3 Covariate-Adjusted Differential Network Analysis Using Neighborhood Selection

In this section, we describe in detail an approach for covariate-adjusted differential network analysis using neighborhood selection. To simplify our presentation, we focus on Gaussian graphical models, though this strategy is generally applicable to graphical models for which the node conditional distributions belong to univariate exponential family models.

3.1 Covariate-Adjustment via Neighborhood Selection in Low Dimensions

We first discuss testing the unadjusted null hypothesis Hj,k0H^{0}_{j,k} in (1), where the βj,kg,\beta^{g,*}_{j,k} are the regression coefficients in (4). Suppose, for now, that we are in the low-dimensional setting, so the number of nodes pp is smaller than the sample sizes ngn^{g}, g{I,II}g\in\{\mathrm{I},\mathrm{II}\}.

It is well-known that the regression coefficients can be characterized as the minimizers of the expected least squares loss — that is,

𝜷jg,=(βj,1g,,,βj,pg,)=arg minβ1,,βp𝔼[(XjgkjXkgβk)2].\displaystyle\boldsymbol{\beta}_{j}^{g,*}=(\beta^{g,*}_{j,1},\ldots,\beta^{g,*}_{j,p})^{\top}=\underset{\beta_{1},\ldots,\beta_{p}\in\mathds{R}}{\text{arg min}}\,\mathds{E}\left[\left(X^{g}_{j}-\sum_{k\neq j}X^{g}_{k}\beta_{k}\right)^{2}\right].

One can obtain an estimate 𝜷^jg=(β^j,1g,,β^j,pg)\hat{\boldsymbol{\beta}}_{j}^{g}=(\hat{\beta}^{g}_{j,1},\ldots,\hat{\beta}^{g}_{j,p}) of 𝜷jg,=(βj,1g,.,βj,pg,)\boldsymbol{\beta}^{g,*}_{j}=(\beta^{g,*}_{j,1}\ldots.,\beta^{g,*}_{j,p}) by minimizing the empirical average of the least squares, taking

𝜷^jg=arg minβ1,,βp1ng𝐗jgkj𝐗kgβk22,\displaystyle\hat{\boldsymbol{\beta}}^{g}_{j}=\underset{\beta_{1},\ldots,\beta_{p}\in\mathds{R}}{\text{arg min}}\,\frac{1}{n^{g}}\left\|\mathbf{X}^{g}_{j}-\sum_{k\neq j}\mathbf{X}^{g}_{k}\beta_{k}\right\|_{2}^{2},

where 2\|\cdot\|_{2} denotes the 2\ell_{2} norm. The ordinary least squares estimate 𝜷^jg\hat{\boldsymbol{\beta}}^{g}_{j} is available in closed form and is easy to compute. The estimates β^j,kg\hat{\beta}^{g}_{j,k} are unbiased, and, under mild assumptions, are approximately normally distributed for sufficiently large ngn^{g} — that is,

β^j,kgN(βj,kg,,τj,kg),\displaystyle\hat{\beta}^{g}_{j,k}\sim N\left(\beta^{g,*}_{j,k},\tau^{g}_{j,k}\right),

with τj,kg>0\tau^{g}_{j,k}>0 (though τj,kg\tau^{g}_{j,k} can be calculated in closed form, we omit the expression for brevity).

We construct a test of Hj,k0H_{j,k}^{0} based on the difference between the estimates of the group-specific regression coefficients, β^j,kIβ^j,kII\hat{\beta}^{\mathrm{I}}_{j,k}-\hat{\beta}^{\mathrm{II}}_{j,k}. When Hj,k0H^{0}_{j,k} holds, β^j,kIβ^j,kII\hat{\beta}^{\mathrm{I}}_{j,k}-\hat{\beta}^{\mathrm{II}}_{j,k} is normally distributed with mean zero and variance τj,kI+τj,kII\tau^{\mathrm{I}}_{j,k}+\tau^{\mathrm{II}}_{j,k}. Given a consistent estimate τ^j,kg\hat{\tau}^{g}_{j,k} of the variance, we can use the test statistic

Tj,k=(β^j,kIβ^j,kII)2τ^j,kI+τ^j,kII,\displaystyle T_{j,k}=\frac{\left(\hat{\beta}_{j,k}^{\mathrm{I}}-\hat{\beta}^{\mathrm{II}}_{j,k}\right)^{2}}{\hat{\tau}^{\mathrm{I}}_{j,k}+\hat{\tau}^{\mathrm{II}}_{j,k}},

which follows a chi-square distribution with one degree of freedom under the null for nIn^{\mathrm{I}} and nIIn^{\mathrm{II}} sufficiently large. A p-value for Hj,k0H^{0}_{j,k} can be calculated as

ρj,k=P(χ12>Tj,k).\displaystyle\rho_{j,k}=P\left(\chi^{2}_{1}>T_{j,k}\right).

In the low-dimensional setting, performing a covariate-adjusted test is similar to performing the unadjusted test. We can obtain an estimate 𝜶^jg=((α^j,1g),,(α^j,pg))\hat{\boldsymbol{\alpha}}^{g}_{j}=\left((\hat{\alpha}^{g}_{j,1})^{\top},\ldots,(\hat{\alpha}^{g}_{j,p})^{\top}\right)^{\top} of 𝜶jg,=((αj,1g,),,(αj,pg,))\boldsymbol{\alpha}^{g,*}_{j}=\left((\alpha^{g,*}_{j,1})^{\top},\ldots,(\alpha^{g,*}_{j,p})^{\top}\right)^{\top} by minimizing the empirical average of the least squares loss

𝜶^jg=arg minαj,1,,αj,pd1ngi=1ng(Xi,jgkjϕ(Wig),αj,kXi,kg)2.\displaystyle\hat{\boldsymbol{\alpha}}^{g}_{j}=\underset{\alpha_{j,1},\ldots,\alpha_{j,p}\in\mathds{R}^{d}}{\text{arg min}}\,\frac{1}{n^{g}}\sum_{i=1}^{n^{g}}\left(X^{g}_{i,j}-\sum_{k\neq j}\left\langle\phi\left(W_{i}^{g}\right),\alpha_{j,k}\right\rangle X_{i,k}^{g}\right)^{2}. (10)

To simplify the presentation, we introduce additional notation that allows us to rewrite (10) in a condensed form. Let 𝒱kg\mathcal{V}^{g}_{k} be the ng×dn^{g}\times d matrix

𝒱kg=(X1,kg×ϕ(W1g)Xng,kg×ϕ(Wngg)).\displaystyle\mathcal{V}^{g}_{k}=\begin{pmatrix}X_{1,k}^{g}\times\phi\left(W_{1}^{g}\right)\\ \vdots\\ X_{n^{g},k}^{g}\times\phi\left(W^{g}_{n^{g}}\right)\end{pmatrix}. (11)

We can now equivalently express (10) as

𝜶^jg=arg minαj,1,,αj,pd1ng𝐗jgkj𝒱kgαj,k22.\displaystyle\hat{\boldsymbol{\alpha}}^{g}_{j}=\underset{\alpha_{j,1},\ldots,\alpha_{j,p}\in\mathds{R}^{d}}{\text{arg min}}\,\frac{1}{n^{g}}\left\|\mathbf{X}^{g}_{j}-\sum_{k\neq j}\mathcal{V}^{g}_{k}\alpha_{j,k}\right\|_{2}^{2}. (12)

Again, α^j,kg\hat{\alpha}^{g}_{j,k} is an unbiased and approximately normal for sufficiently large ngn^{g}, satisfying

α^j,kgN(αj,kg,,Ωj,kg),\hat{\alpha}^{g}_{j,k}\sim N\left(\alpha^{g,*}_{j,k},\Omega^{g}_{j,k}\right),

where Ωj,kg\Omega^{g}_{j,k} is a positive definite matrix of dimension d×dd\times d (though a closed form expression is available, we omit it here for brevity).

We construct a test of Gj,k0G^{0}_{j,k} based on α^j,kIα^j,kII\hat{\alpha}^{\mathrm{I}}_{j,k}-\hat{\alpha}^{\mathrm{II}}_{j,k}. Under the null hypothesis, α^j,kIα^j,kII\hat{\alpha}^{\mathrm{I}}_{j,k}-\hat{\alpha}^{\mathrm{II}}_{j,k} follows a normal distribution with mean zero and variance Ωj,kI+Ωj,kII\Omega^{\mathrm{I}}_{j,k}+\Omega^{\mathrm{II}}_{j,k}. Given a consistent estimate Ω^j,kg\hat{\Omega}^{g}_{j,k} of Ωj,kg\Omega^{g}_{j,k}, we can test Gj,k0G^{0}_{j,k} using the test statistic

Sj,k=(α^j,kIα^j,kII)(Ω^j,kI+Ω^j,kII)1(α^j,kIα^j,kII).\displaystyle S_{j,k}=\left(\hat{\alpha}_{j,k}^{\mathrm{I}}-\hat{\alpha}_{j,k}^{\mathrm{II}}\right)^{\top}\left(\hat{\Omega}^{\mathrm{I}}_{j,k}+\hat{\Omega}^{\mathrm{II}}_{j,k}\right)^{-1}\left(\hat{\alpha}_{j,k}^{\mathrm{I}}-\hat{\alpha}_{j,k}^{\mathrm{II}}\right).

Under the null, the test statistic follows a chi-squared distribution with dd degrees of freedom, and a p-value can therefore be calculated as

P(χd2>Sj,k).\displaystyle P\left(\chi^{2}_{d}>S_{j,k}\right).

3.2 Covariate-Adjustment via Neighborhood Selection in High Dimensions

The methods described in Section 3.1 are only appropriate when the number of nodes pp is small relative to the sample size. Model (9) has (p1)d(p-1)d parameters, so the least squares estimator of Section 3.1 provides stable estimates as long as nIn^{\mathrm{I}} and nIIn^{\mathrm{II}} are larger than (p1)d(p-1)d. However, in the high-dimensional setting, where the the number of parameters exceeds the sample size, the ordinary least squares estimates behave poorly.

To fit the varying coefficient model (9) in the high-dimensional setting, we use a regularized estimator that relies upon an assumption of sparsity in the networks. The sparsity assumption requires that within each group only a small number of nodes are partially correlated, meaning that in (9), only a few of the vectors αj,kg,\alpha^{g,*}_{j,k} are nonzero. To leverage the sparsity assumption, we propose to use the group LASSO estimator [46]:

𝜶~jg=arg minαj,1,,αj,pd1ng𝐗jgkj𝒱kgαj,k22+λkjαj,k2,\displaystyle\tilde{\boldsymbol{\alpha}}_{j}^{g}=\underset{\alpha_{j,1},\ldots,\alpha_{j,p}\in\mathds{R}^{d}}{\text{arg min}}\,\frac{1}{n^{g}}\left\|\mathbf{X}^{g}_{j}-\sum_{k\neq j}\mathcal{V}^{g}_{k}\alpha_{j,k}\right\|_{2}^{2}+\lambda\sum_{k\neq j}\left\|\alpha_{j,k}\right\|_{2}, (13)

where λ>0\lambda>0 is a tuning parameter. The group LASSO provides a sparse estimate and sets some α~j,k\tilde{\alpha}_{j,k} to be exactly zero, resulting in networks with few edges. The level of sparsity of 𝜶~jg\tilde{\boldsymbol{\alpha}}^{g}_{j} is determined by λ\lambda, with higher λ\lambda values forcing more α~j,k\tilde{\alpha}_{j,k} to zero. We discuss selection of the tuning parameter in Section 5.1.

Though the group LASSO provides a consistent estimate of 𝜶jg,\boldsymbol{\alpha}^{g,*}_{j}, the estimate is not approximately normally distributed. The group LASSO estimate of αj,kg,{\alpha}^{g,*}_{j,k} retains a bias that diminishes at the same rate as the standard error. As a result, the group LASSO estimator has a non-standard sampling distribution that cannot be derived analytically and is therefore unsuitable for hypothesis testing.

We can obtain approximately normal estimates of αj,kg,\alpha^{g,*}_{j,k} by correcting the bias of α~j,kg\tilde{\alpha}^{g}_{j,k}, as was first proposed to obtain normal estimates for the classical 1\ell_{1}-penalized version of the LASSO [12, 47]. These “de-biased” or “de-sparsified” estimators can been shown to be approximately normal with moderately large samples even in the high-dimensional setting; they are therefore suitable for hypothesis testing. Our approach is to use a de-biased version of the group LASSO. Bias correction in group LASSO problems is well studied [11, 16, 28], so we are able to perform covariate-adjusted inference by applying previously-developed methods.

The bias of the group LASSO estimate can be written as

δj,kg=𝔼[α~j,kg]αj,kg,,\displaystyle\delta^{g}_{j,k}=\mathds{E}\left[\tilde{\alpha}^{g}_{j,k}\right]-\alpha^{g,*}_{j,k}, (14)

where δj,kg\delta^{g}_{j,k} is a nonzero dd-dimensional vector (recall dd is the dimension of αj,kg,\alpha^{g,*}_{j,k}). Our approach is to obtain an estimate of the bias δ~j,k\tilde{\delta}_{j,k} and to use a de-biased estimator, defined as

αˇj,kg=α~j,kgδ~j,kg.\displaystyle\check{\alpha}^{g}_{j,k}=\tilde{\alpha}^{g}_{j,k}-\tilde{\delta}^{g}_{j,k}. (15)

For a suitable choice of δ~j,k\tilde{\delta}_{j,k}, the bias-corrected estimator is approximately normal for a sufficiently large sample size ngn^{g} under mild conditions, i.e.,

αˇj,kgN(αj,kg,,Ωj,kg),\displaystyle\check{\alpha}^{g}_{j,k}\sim N\left(\alpha^{g,*}_{j,k},\Omega^{g}_{j,k}\right), (16)

where the variance Ωj,kg\Omega^{g}_{j,k} is a positive definite matrix, for which we obtain an estimate Ωˇj,kg\check{\Omega}^{g}_{j,k}. We provide a derivation for the bias-correction and the form of our variance estimate in Appendix A.

Similar to Section 2.1, we test the null hypothesis Gj,k0G^{0}_{j,k} in (6) using the test statistic

Sj,k=(αˇj,kIαˇj,kII)(Ωˇj,kI+Ωˇj,kII)1(αˇj,kIαˇj,kII).\displaystyle S_{j,k}=\left(\check{\alpha}_{j,k}^{\mathrm{I}}-\check{\alpha}_{j,k}^{\mathrm{II}}\right)^{\top}\left(\check{\Omega}_{j,k}^{\mathrm{I}}+\check{\Omega}_{j,k}^{\mathrm{II}}\right)^{-1}\left(\check{\alpha}_{j,k}^{\mathrm{I}}-\check{\alpha}_{j,k}^{\mathrm{II}}\right). (17)

The test statistic asymptotically follows a chi-squared distribution with dd degrees of freedom under the null hypothesis.

4 Covariate-Adjusted Differential Network Analysis Using Score Matching

In this section, we discuss covariate-adjustment using the score matching framework introduced in Section 2. We first describe the score matching estimator in greater detail and then specialize the framework to estimation of pairwise exponential family graphical models in the low- and high dimensional settings. As shown later in this section, for exponential family distributions with continuous support, the score matching loss function is a quadratic function of parameters, providing a computationally-efficient framework for estimating graphical models.

4.1 The Score Matching Framework

We begin by providing a brief summary of the score matching framework [17, 18]. Let Z𝒵pZ\in\mathcal{Z}\subseteq\mathds{R}^{p} be a random vector generated from a distribution with density function hh^{*}. For any candidate density hh, we denote the gradient and Laplician of the log-density by

logh(z)={zjlogh(x)}p;Δlogh(z)=j=1p2zj2logh(zj).\displaystyle\nabla\log{h}(z)=\left\{\frac{\partial}{\partial z_{j}}\log h(x)\right\}\in\mathds{R}^{p};\quad\quad\Delta\log h(z)=\sum_{j=1}^{p}\frac{\partial^{2}}{\partial z_{j}^{2}}\log h(z_{j}).

The score matching loss LL is defined as a measure of divergence between a candidate density function hh and the true density hh^{*}:

L(h)=logh(z)logh(z)22h(z)𝑑z=𝔼[logh(Z)logh(Z)22].\displaystyle L(h)=\int\left\|\nabla\log h(z)-\nabla\log h^{*}(z)\right\|_{2}^{2}h^{*}(z)dz=\mathds{E}\left[\left\|\nabla\log h(Z)-\nabla\log h^{*}(Z)\right\|_{2}^{2}\right]. (18)

It is apparent that the score matching loss is minimized when h=hh=h^{*}. A natural approach to constructing an estimator for hh^{*} would then be to minimize the empirical score matching loss given observations Z1,,ZnZ_{1},\ldots,Z_{n}, defined as

Ln(h)=1ni=1nlogh(Zi)logh(Zi)22.\displaystyle L_{n}(h)=\frac{1}{n}\sum_{i=1}^{n}\left\|\nabla\log h\left(Z_{i}\right)-\nabla\log h^{*}\left(Z_{i}\right)\right\|_{2}^{2}.

Because the score matching loss function takes as input the gradient of the log density function, the loss does not depend on the normalizing constant. This makes score matching appealing when the normalizing constant is intractable.

The empirical loss seemingly depends on prior knowledge of hh^{*}. However, if h(z)h(z) and h(z)2\|h(z)\|_{2} both tend to zero as zz approaches the boundary of 𝒵\mathcal{Z}, a partial integration argument can be used to show that the score matching loss can be expressed as

L(h)={Δlogh(z)+12logh(z)22}h(z)𝑑z+const.,\displaystyle L(h)=\int\left\{\Delta\log h(z)+\frac{1}{2}\left\|\nabla\log h(z)\right\|_{2}^{2}\right\}h^{*}(z)dz+\text{const.}, (19)

where ‘const.’ is a term that does not depend on hh. We can therefore estimate hh^{*} by minimizing an empirical version of the score matching loss that does not depend on hh^{*}. We can express the empirical loss as

Ln(h)=1ni=1nΔlogh(Zi)+12logh(Zi)22.\displaystyle L_{n}(h)=\frac{1}{n}\sum_{i=1}^{n}\Delta\log h(Z_{i})+\frac{1}{2}\left\|\nabla\log h(Z_{i})\right\|_{2}^{2}.

The score matching loss is particularly appealing for exponential family distributions with continuous support, as it leads to a quadratic optimization function [23]. However, when ZZ is non-negative, the arguments used to express (18) as (19) fail because h(z)h(z) and h(z)2\|\nabla h(z)\|_{2} do not approach zero at the boundary. We can overcome this problem by instead considering the generalized score matching framework [44, 18] as an extension that is suitable for non-negative data. Let v1,,vp:++v_{1},\ldots,v_{p}:\mathds{R}^{+}\to\mathds{R}^{+} be positive and differentiable functions, let v(z)=(v1(z1),,vp(zp))v(z)=\left(v_{1}(z_{1}),\ldots,v_{p}(z_{p})\right)^{\top}, let v˙j\dot{v}_{j} denote the derivative of vjv_{j}, and let \circ denote the element-wise product operator. The generalized score matching loss is defined as

L(h)={logh(z)logh(z)}v1/2(z)22h(z)𝑑z,\displaystyle L(h)=\int\left\|\left\{\nabla\log h(z)-\nabla\log h^{*}(z)\right\}\circ v^{1/2}(z)\right\|_{2}^{2}h^{*}(z)dz, (20)

and is also minimized when h=hh=h^{*}. As for the original score matching loss (18), the generalized score matching loss seemingly depends on prior knowledge of hh^{*}. However, under mild technical conditions on hh and vv (see Appendix B.1), the loss in (20) can be rewritten as

L(h)=[j=1p\displaystyle L(h)=\int\bigg{[}\sum_{j=1}^{p} v˙j(zj){logh(zj)zj}+vj(zj){2logh(z)zj2}+12vj(zj){logh(z)zj}2]h(z)dz.\displaystyle\dot{v}_{j}(z_{j})\left\{\frac{\partial\log h(z_{j})}{\partial z_{j}}\right\}+v_{j}(z_{j})\left\{\frac{\partial^{2}\log h(z)}{\partial z^{2}_{j}}\right\}+\frac{1}{2}v_{j}(z_{j})\left\{\frac{\partial\log h(z)}{\partial z_{j}}\right\}^{2}\bigg{]}h^{*}(z)dz. (21)

The generalized score matching loss thus no longer depends on hh^{*}, and an estimator can be constructed by minimizing the empirical version of (21) with respect to hh. To this end, the original generalized score matching estimator considered vj(zj)=zj2v_{j}(z_{j})=z_{j}^{2} [18]. In this case, it becomes necessary to estimate high moments of hh^{*}, leading to poor performance of the estimator. It has been shown that by instead taking vv as a slowly increasing function, such as vj(zj)=log(1+vj)v_{j}(z_{j})=\log(1+v_{j}), one obtains improved theoretical results and better empirical performance [44].

4.2 Covariate-Adjustment in High-Dimensional Exponential Family Models via Score Matching

In this sub-section, we discuss construction of asymptotically normal estimators for the parameters of the exponential family pairwise interaction model (7) using the generalized score matching framework. To simplify our presentation, we consider the setting in which we are only interested in studying the connectedness between one node XjgX^{g}_{j} and all other neighboring nodes in the network. To this end, it suffices to estimate the conditional density of XjgX^{g}_{j} given all other nodes and the covariates WgW^{g}. A similar approach to the one we describe below can also be used to estimate the entire joint density (7). For simplicity, we assume that in (7), there exist functions ψ\psi and ζ\zeta such that ψ=ψj,k\psi=\psi_{j,k} for all (j,k)(j,k) and ζj,c=ζ\zeta_{j,c}=\zeta for all (j,c)(j,c), and that μj=0\mu_{j}=0. For x=(x1,,xp)x=(x_{1},\ldots,x_{p})^{\top} and w=(w1,,wq)w=(w_{1},\ldots,w_{q})^{\top} the conditional density can thus be expressed as

fjg,(xj|x1,,xp,w)exp(j=1pαj,kg,,ϕ(w)ψ(xj,xk)+c=1dθj,cg,ζ(xj,ϕc(w))),\displaystyle f_{j}^{g,*}(x_{j}|x_{1},\ldots,x_{p},w)\propto\exp\left(\sum_{j=1}^{p}\left\langle\alpha^{g,*}_{j,k},\phi\left(w\right)\right\rangle\psi(x_{j},x_{k})+\sum_{c=1}^{d}\theta_{j,c}^{g,*}\zeta\left(x_{j},\phi_{c}(w)\right)\right), (22)

where the density is up to a normalizing constant that does not depend on xjx_{j}.

We first explicitly define the score matching loss for the conditional density function (22). Let 𝜶jg,=((αj,1g,),,(αj,pg,))\boldsymbol{\alpha}^{g,*}_{j}=\left((\alpha^{g,*}_{j,1})^{\top},\ldots,(\alpha^{g,*}_{j,p})^{\top}\right)^{\top}, and similarly let 𝜽jg,=(θj,1g,,,θj,pg,)\boldsymbol{\theta}^{g,*}_{j}=(\theta^{g,*}_{j,1},\ldots,\theta^{g,*}_{j,p})^{\top}. Let ψ˙\dot{\psi} and ψ¨\ddot{\psi} denote the first and second derivatives of ψ\psi with respect to xjx_{j}, and similarly, let ζ˙\dot{\zeta} and ζ¨\ddot{\zeta} denote the first and second derivatives of ζ\zeta with respect to xjx_{j}. We define a non-negative function vj:++v_{j}:\mathds{R}_{+}\to\mathds{R}_{+}, and let v˙j\dot{v}_{j} denote the first derivative of vjv_{j}. Then for candidate parameters 𝜶j=(αj,1,,αj,p)\boldsymbol{\alpha}_{j}=\left(\alpha^{\top}_{j,1},\ldots,\alpha^{\top}_{j,p}\right)^{\top} and 𝜽j=(θj,1,,θj,d)\boldsymbol{\theta}_{j}=(\theta_{j,1},\ldots,\theta_{j,d})^{\top}, the empirical generalized score matching loss for the conditional density of XjgX_{j}^{g} given all other nodes and the covariates can be expressed as

Ln,jg(𝜶,𝜽)=12ng\displaystyle L^{g}_{n,j}(\boldsymbol{\alpha},\boldsymbol{\theta})=\frac{1}{2n^{g}} i=1ngvj(Xi,jg){k=1pαj,k,ϕ(Wig)ψ˙(Xi,jg,Xi,kg)+c=1dθj,cζ˙(Xi,jg,ϕc(Wig))}2+\displaystyle\sum_{i=1}^{n^{g}}v_{j}\left(X_{i,j}^{g}\right)\bigg{\{}\sum_{k=1}^{p}\left\langle\alpha_{j,k},\phi\left(W^{g}_{i}\right)\right\rangle\dot{\psi}\left(X^{g}_{i,j},X^{g}_{i,k}\right)+\sum_{c=1}^{d}\theta_{j,c}\dot{\zeta}\left(X^{g}_{i,j},\phi_{c}\left(W^{g}_{i}\right)\right)\bigg{\}}^{2}+
1ng\displaystyle\frac{1}{n^{g}} i=1ngvj(Xi,jg){k=1pαj,k,ϕ(Wig)ψ¨(Xi,jg,Xi,kg)+c=1dθj,cζ¨(Xi,jg,ϕc(Wig))}+\displaystyle\sum_{i=1}^{n^{g}}v_{j}\left(X^{g}_{i,j}\right)\bigg{\{}\sum_{k=1}^{p}\left\langle\alpha_{j,k},\phi\left(W^{g}_{i}\right)\right\rangle\ddot{\psi}\left(X^{g}_{i,j},X^{g}_{i,k}\right)+\sum_{c=1}^{d}\theta_{j,c}\ddot{\zeta}\left(X^{g}_{i,j},\phi_{c}\left(W^{g}_{i}\right)\right)\bigg{\}}+
1ng\displaystyle\frac{1}{n^{g}} i=1ngv˙j(Xi,jg){k=1pαj,k,ϕ(Wig)ψ˙(Xi,jg,Xi,kg)+c=1dθj,cζ˙(Xi,jg,ϕc(Wig))}.\displaystyle\sum_{i=1}^{n^{g}}\dot{v}_{j}\left(X^{g}_{i,j}\right)\bigg{\{}\sum_{k=1}^{p}\left\langle\alpha_{j,k},\phi\left(W^{g}_{i}\right)\right\rangle\dot{\psi}\left(X^{g}_{i,j},X^{g}_{i,k}\right)+\sum_{c=1}^{d}\theta_{j,c}\dot{\zeta}\left(X^{g}_{i,j},\phi_{c}\left(W^{g}_{i}\right)\right)\bigg{\}}. (23)

The true parameters 𝜶jg,\boldsymbol{\alpha}^{g,*}_{j} and 𝜽jg,\boldsymbol{\theta}^{g,*}_{j} can characterized as the minimizers of the population score matching loss 𝔼[Ln,jg(𝜶j,𝜽j)]\mathds{E}\left[L^{g}_{n,j}\left(\boldsymbol{\alpha}_{j},\boldsymbol{\theta}_{j}\right)\right], as discussed in Section 4.1.

The loss function in (23) is quadratic in parameters 𝜶jg,\boldsymbol{\alpha}^{g,*}_{j} and 𝜽jg,\boldsymbol{\theta}_{j}^{g,*} and can thus be solved efficiently. When the sample size ngn^{g} is much larger than the number of unknown parameters (p+1)d(p+1)d, one can estimate 𝜶jg,\boldsymbol{\alpha}^{g,*}_{j} and 𝜽jg,\boldsymbol{\theta}_{j}^{g,*} by simply minimizing Ln,jgL^{g}_{n,j} with respect to the unknown parameters. The empirical loss function is quadratic in (𝜶j,𝜽j)(\boldsymbol{\alpha}_{j},\boldsymbol{\theta}_{j}), so the minimizer of the loss is available in closed form and can be computed efficiently. Moreover, we can readily establish asymptotic normality of the parameter estimates using results from classical M-estimation theory [34]. To avoid including cumbersome notation, we reserve the details for Appendix B.2.

When the sample size is smaller than the number of parameters, the minimizer of Ln,jgL^{g}_{n,j} is no longer consistent. Similar to Section 3.2, we use regularization to obtain a consistent estimator in the high-dimensional setting. We define the 2\ell_{2}-regularized generalized score matching estimator as

(𝜶~jg,𝜽~jg)=arg min𝜶j,𝜽jLn,jg(𝜶j,𝜽j)+λj=1pαj,k2,\displaystyle\left(\tilde{\boldsymbol{\alpha}}^{g}_{j},\tilde{\boldsymbol{\theta}}^{g}_{j}\right)=\underset{\boldsymbol{\alpha}_{j},\boldsymbol{\theta}_{j}}{\text{arg min}}\,\,L^{g}_{n,j}(\boldsymbol{\alpha}_{j},\boldsymbol{\theta}_{j})+\lambda\sum_{j=1}^{p}\left\|\alpha_{j,k}\right\|_{2}, (24)

where λ>0\lambda>0 is a tuning parameter. Similar to the group LASSO estimator (13), the regularization term in (24) induces sparsity in the estimate 𝜶~jg\tilde{\boldsymbol{\alpha}}_{j}^{g} and sets some α~j,kg\tilde{\alpha}^{g}_{j,k} to be exactly zero. The tuning parameter controls the level of sparsity, where more vectors α~j,kg\tilde{\alpha}^{g}_{j,k} are zero for higher λ\lambda. In Appendix B.3, we establish consistency of the regularized score matching estimator assuming sparsity of 𝜶~jg\tilde{\boldsymbol{\alpha}}^{g}_{j} and some additional regularity conditions.

As is the case for the group LASSO estimator, the regularized score matching estimator has an intractable limiting distribution because its bias and standard error diminish at the same rate. We can obtain an asymptotically normal estimate by subtracting from the initial estimate an estimate of the bias. In Appendix B.4, we construct such a bias-corrected estimate αˇj,kg\check{\alpha}^{g}_{j,k} that, for sufficiently large ngn^{g}, satisfies

αˇj,kgN(αj,kg,,Ωj,kg),\displaystyle\check{\alpha}^{g}_{j,k}\sim N\left(\alpha^{g,*}_{j,k},\Omega^{g}_{j,k}\right),

for a positive definite matrix Ωj,kg\Omega^{g}_{j,k}. Given bias-corrected estimates and a consistent estimate Ωˇj,kg\check{\Omega}^{g}_{j,k} of Ωj,kg\Omega^{g}_{j,k}, we can test the null hypothesis (6) using the test statistic

Sj,k=(αˇj,kIαˇj,kII)(Ωˇj,kI+Ωˇj,kII)1(αˇj,kIαˇj,kII).\displaystyle S_{j,k}=\left(\check{\alpha}^{\mathrm{I}}_{j,k}-\check{\alpha}^{\mathrm{II}}_{j,k}\right)^{\top}\left(\check{\Omega}^{\mathrm{I}}_{j,k}+\check{\Omega}^{\mathrm{II}}_{j,k}\right)^{-1}\left(\check{\alpha}^{\mathrm{I}}_{j,k}-\check{\alpha}^{\mathrm{II}}_{j,k}\right).

Under the null hypothesis, the test statistic follows a chi-squared distribution with dd degrees of freedom.

5 Numerical Studies

In this section, we examine the performance of our proposed test in a simulation study. We consider the neighborhood selection approach described in Section 3. Our simulation study has three objectives: (1) to assess the stability of our estimators for the covariate-dependent networks, (2) to examine the effect of sample size on statistical power and type-I error control, and (3) to illustrate that failing to adjust for covariates can in some settings result in poor type-I error control or reduced statistical power.

5.1 Implementation

We first discuss implementation of the neighborhood selection approach. The group LASSO estimate (13) does not exist in closed form, in contrast to the ordinary least squares estimate (12). To solve (13), we use the efficient algorithm implemented in the publicly available R package gglasso [42].

The group LASSO estimator requires selection of a tuning parameter λ\lambda, which controls the sparsity of the estimate. We select the tuning parameter by performing KK-fold cross-validation, using K=10K=10 folds. Since the selection of λ\lambda is sensitive to the scale of the columns of 𝒱kg\mathcal{V}_{k}^{g} in (11), we scale the columns by their standard deviations prior to cross-validating. After fitting the group LASSO with the selected tuning parameter, we convert the estimates back to their original scale by dividing the estimates by the standard deviations of the columns of 𝒱kg\mathcal{V}_{k}^{g}.

5.2 Simulation Setting

In what follows, we describe our simulation setting. In short, we generate data from the varying coefficient model (9), where we treat nodes 11 through (p1)(p-1) as predictors, and treat node pp as the response. We first randomly generate data for nodes 11 through (p1)(p-1) in groups I\mathrm{I} and II\mathrm{II} from the same multivariate normal distribution. We then construct ηj,kg,\eta^{g,*}_{j,k} and generate data for two covariates Wig=(Wi,1g,Wi,2g)W_{i}^{g}=(W_{i,1}^{g},W_{i,2}^{g})^{\top} so that one covariate acts as a confounding variable, and the other covariate should improve statistical power to detect differential associations after adjustment.

To simulate data for nodes 11 through (p1)(p-1), we first generate a random graph with (p1)(p-1) nodes and an edge density of .05 from a power law distribution with power parameter 5 [30]. Denoting the edge set of the graph by EE, we generate the (p1)×(p1)(p-1)\times(p-1) matrix Θ\Theta as

Θj,k={0(j,k)E.5(j,k)E with 50% probability.5(j,k)E with 50% probability,\Theta_{j,k}=\begin{cases}0&(j,k)\notin E\\ .5&(j,k)\in E\text{ with 50\% probability}\\ -.5&(j,k)\in E\text{ with 50\% probability}\end{cases},

with Θj,k=Θk,j\Theta_{j,k}=\Theta_{k,j}. Defining by aa^{*} the smallest eigenvalue of Θ\Theta, we set Σ=(Θ(a.1)I)1\Sigma=(\Theta-(a^{*}-.1)I)^{-1}, where II is the identity matrix. We then draw (Xi,1g,,Xi,p1g)(X_{i,1}^{g},\ldots,X_{i,p-1}^{g})^{\top} from a multivariate normal distribution with mean zero and covariance Σ\Sigma for i=1,,ngi=1,\ldots,n^{g} for each group gg.

We generate Wi,1IW^{\mathrm{I}}_{i,1} from a Beta(3/2,1)\text{Beta}(3/2,1) distribution and Wi,1IIW^{\mathrm{II}}_{i,1} from a Beta(1,3/2)\text{Beta}(1,3/2) distribution. We center and scale both Wi,1IW^{\mathrm{I}}_{i,1} and Wi,1IIW^{\mathrm{II}}_{i,1} to the (1,1)(-1,1) interval. We generate Wi,2IW^{\mathrm{I}}_{i,2} and Wi,2IIW_{i,2}^{\mathrm{II}} each from a Uniform(1,1)(-1,1) distribution.

We consider two different choices for the varying coefficient functions ηj,kg,\eta^{g,*}_{j,k}:

  • Linear Polynomial:

    ηp,1I,(w1,w2)=.5+.5w1;\displaystyle\eta^{\mathrm{I,*}}_{p,1}(w_{1},w_{2})=.5+.5w_{1}; ηp,1II,(w1,w2)=.5+.5w1\displaystyle\eta^{\mathrm{II,*}}_{p,1}(w_{1},w_{2})=.5+.5w_{1}
    ηp,2I,(w1,w2)=.5+.25w2;\displaystyle\eta^{\mathrm{I,*}}_{p,2}(w_{1},w_{2})=.5+.25w_{2}; ηp,2II,(w1,w2)=.5+.75w2\displaystyle\eta^{\mathrm{II,*}}_{p,2}(w_{1},w_{2})=.5+.75w_{2}
    ηp,3I,(w1,w2)=0;\displaystyle\eta^{\mathrm{I,*}}_{p,3}(w_{1},w_{2})=0; ηp,3II,(w1,w2)=.5,\displaystyle\eta^{\mathrm{II,*}}_{p,3}(w_{1},w_{2})=.5,

    and ηp,kg,=0\eta^{g,*}_{p,k}=0 for k4k\geq 4.

  • Cubic Polynomial:

    ηp,1I,(w1,w2)=.5+.5(w1+w12+w13);\displaystyle\eta^{\mathrm{I,*}}_{p,1}(w_{1},w_{2})=.5+.5\left(w_{1}+w_{1}^{2}+w_{1}^{3}\right); ηp,1II,(w1,w2)=.5+.5(w1+w12+w13)\displaystyle\eta^{\mathrm{II,*}}_{p,1}(w_{1},w_{2})=.5+.5\left(w_{1}+w_{1}^{2}+w_{1}^{3}\right)
    ηp,2I,(w1,w2)=.5+.25(w2+w23);\displaystyle\eta^{\mathrm{I,*}}_{p,2}(w_{1},w_{2})=.5+.25\left(w_{2}+w_{2}^{3}\right); ηp,2II,(w1,w2)=.5+.75(w2+w23)\displaystyle\eta^{\mathrm{II,*}}_{p,2}(w_{1},w_{2})=.5+.75\left(w_{2}+w_{2}^{3}\right)
    ηp,3I,(w1,w2)=0;\displaystyle\eta^{\mathrm{I,*}}_{p,3}(w_{1},w_{2})=0; ηp,3II,(w1,w2)=.5,\displaystyle\eta^{\mathrm{II,*}}_{p,3}(w_{1},w_{2})=.5,

    and ηp,kg,=0\eta^{g,*}_{p,k}=0 for k4k\geq 4.

The first covariate Wi,1gW_{i,1}^{g} confounds the association between nodes pp and 1. The distribution of Wi,1gW_{i,1}^{g} depends on group membership, and Wi,1gW_{i,1}^{g} affects the association between genes pp and 11. However, ηp,1I,(w)=ηp,1II,(w)\eta^{\mathrm{I},*}_{p,1}(w)=\eta^{\mathrm{II},*}_{p,1}(w) for all ww. Thus, Gp,10G^{0}_{p,1} in (6) holds while Hp,10H^{0}_{p,1} in (1) fails, as depicted in Figure 1a. Failing to adjust for W1gW_{1}^{g} should therefore result in an inflated type-I error rate for the hypothesis Gp,10G^{0}_{p,1}. Adjusting for the second covariate Wi,2gW_{i,2}^{g} should improve the power to detect the differential connection between nodes pp and 22. We have constructed ηp,2g,\eta^{g,*}_{p,2} so that 𝔼[ηI,(WI)]=𝔼[ηII,(WII)]\mathds{E}\left[\eta^{\mathrm{I},*}\left(W^{\mathrm{I}}\right)\right]=\mathds{E}\left[\eta^{\mathrm{II},*}\left(W^{\mathrm{II}}\right)\right], though the association between nodes pp and 2 depends more strongly on WgW^{g} in group II\mathrm{II} than in group I\mathrm{I}. Thus, Hp,20H^{0}_{p,2} holds while Gp,20G^{0}_{p,2} fails, as depicted in Figure 1b. The association between nodes pp and 33 does not depend on either covariate, though the association differs by group. Thus, one should be able to identify a differential connection using either the adjusted or unadjusted test. Node pp is conditionally independent of all other nodes in both groups.

For i=1,,ngi=1,\ldots,n^{g}, we generate Xi,pgX^{g}_{i,p} as

Xi,pg=kpηj,kg(Wig)Xi,kg+ϵig,X^{g}_{i,p}=\sum_{k\neq p}\eta^{g}_{j,k}\left(W_{i}^{g}\right)X^{g}_{i,k}+\epsilon^{g}_{i},

where ϵig\epsilon^{g}_{i} follows a normal distribution with zero mean and unit variance. We use balanced sample sizes nI=nII=nn^{\mathrm{I}}=n^{\mathrm{II}}=n and consider n{80,160,240}n\in\{80,160,240\}. We set the number of nodes p=40p=40. The graph for nodes 11 through (p1)(p-1) contains 15 edges. Leaving Σ\Sigma fixed, we generate 400 random data sets following the above approach.

We consider two choices of the basis expansion ϕ\phi:

  1. 1.

    Linear basis: ϕ(w1,w2)=(1w1w2)\phi(w_{1},w_{2})=\begin{pmatrix}1&w_{1}&w_{2}\end{pmatrix}^{\top};

  2. 2.

    Cubic polynomial basis: ϕ(w1,w2)=(1w1w12w13w2w22w23)\phi(w_{1},w_{2})=\begin{pmatrix}1&w_{1}&w_{1}^{2}&w_{1}^{3}&w_{2}&w_{2}^{2}&w_{2}^{3}\end{pmatrix}^{\top}.

Using a linear basis, d=3d=3, and model (9) has 117117 parameters. With the cubic polynomial basis, d=7d=7, and there are 273273 parameters.

We compare our proposed methodology with the approach for differential network analysis without covariate adjustment described in Section 3.1. In the unadjusted analysis, ordinary least squares estimation is justified because although (p1)d(p-1)d is large with respect to nn, (p1)(p-1) is smaller than nn.

5.3 Simulation Results

Figure 2 shows the Monte Carlo estimates of the expected 2\ell_{2} error for the de-biased group LASSO estimates αˇp,kg\check{\alpha}^{g}_{p,k}, 𝔼[d1(αˇp,kgαp,kg,)2]\mathds{E}\left[\left\|d^{-1}\left(\check{\alpha}^{g}_{p,k}-\alpha^{g,*}_{p,k}\right)\right\|_{2}\right], for k=1,,(p1)k=1,\ldots,(p-1). We only report the 2\ell_{2} error when the basis ϕ\phi is correctly specified for the varying coefficient function ηp,kg,\eta^{g,*}_{p,k} — that is, when ϕ\phi is linear basis, and ηp,kg,\eta^{g,*}_{p,k} is a linear function or when ϕ\phi is a cubic basis, and ηp,kg,\eta^{g,*}_{p,k} is a cubic function. In both the linear and cubic polynomial settings, the average 2\ell_{2} estimation error for αp,kg,\alpha^{g,*}_{p,k} decreases with the sample size for all kk, as expected. We also find that in small samples, the estimation error is substantially lower in the linear setting than in the cubic setting. This suggests that estimates are less stable in more complex models.

Refer to caption
Figure 2: Monte Carlo estimates of expected 2\ell_{2} error, 𝔼[d1(αˇp,kgαp,kg,)2],\mathds{E}\left[\left\|d^{-1}\left(\check{\alpha}^{g}_{p,k}-\alpha^{g,*}_{p,k}\right)\right\|_{2}\right], for k=1,,39k=1,\ldots,39. The linear polynomial plots display the 2\ell_{2} error when ηj,kg,\eta^{g,*}_{j,k} is a linear function, and ϕ\phi is a linear basis. The cubic polynomial plots display the 2\ell_{2} error when ηj,kg,\eta^{g,*}_{j,k} is a cubic polynomial, and ϕ\phi is a cubic basis.

In Table 1, we report Monte Carlo estimates of the probability of rejecting Gp,k0G^{0}_{p,k}, the null hypothesis that nodes pp and kk are not differentially connected given WgW^{g}, for k=1k=1, k=2k=2, k=3k=3, and k4k\geq 4, using both the adjusted and unadjusted tests at the significance level κ=.05\kappa=.05. As the purpose of the simulation study is to examine the behavior of the edge-wise test, we do not perform a multiple testing correction.

For k=1k=1 (i.e., when Hp,k0H^{0}_{p,k} fails, but Gp,k0G^{0}_{p,k} holds), the unadjusted test is anti-conservative, and the probability of falsely rejecting Gp,k0G^{0}_{p,k} increases with the sample size. When an adjusted test is performed using a linear basis, and when ηp,1g,\eta^{g,*}_{p,1} is linear, the type-I error rate is slightly inflated but appears to approach the nominal level of .05.05 as the sample size increases. However, when ηp,1g,\eta^{g,*}_{p,1} is a cubic function, and the linear basis is mis-specified, the type-I error rate is inflated, though it is still slightly lower than that of unadjusted test. For both specifications of ηp,1g,\eta^{g,*}_{p,1}, the covariate-adjusted test controls the type-I error rate near the nominal level when a cubic polynomial basis is used. For k=2k=2, (i.e., when Hp,k0H^{0}_{p,k} holds, but Gp,k0G^{0}_{p,k} fails), the unadjusted test exhibits low power to detect differential associations. The adjusted test provides greatly improved power when either a linear or cubic basis is used. For k=3k=3, (i.e., when both Hp,k0H^{0}_{p,k} and Gp,k0G^{0}_{p,k} fail), the unadjusted test and both adjusted tests are well-powered against the null. For k4k\geq 4 (i.e., when genes pp and kk are conditionally independent in both groups), the unadjusted test and the adjusted test with a linear basis both control the type-I error near the nominal level. However, the covariate-adjusted test is conservative when a cubic basis is used.

Unadjusted Linear Adjustment Cubic Adjustment
n=80n=80 n=160n=160 n=240n=240 n=80n=80 n=160n=160 n=240n=240 n=80n=80 n=160n=160 n=240n=240
Linear ηp,kg,\eta^{g,*}_{p,k} k=1k=1 0.15 0.278 0.385 0.13 0.09 0.072 0.04 0.062 0.05
k=2k=2 0.042 0.078 0.045 0.27 0.532 0.73 0.08 0.27 0.52
k=3k=3 0.48 0.912 0.988 0.605 0.922 0.965 0.218 0.738 0.902
k4k\geq 4 0.052 0.054 0.053 0.045 0.048 0.048 0.009 0.017 0.025
Cubic ηp,kg,\eta^{g,*}_{p,k} k=1k=1 0.21 0.505 0.668 0.358 0.315 0.342 0.07 0.055 0.07
k=2k=2 0.052 0.068 0.082 0.6 0.882 0.975 0.195 0.73 0.93
k=3k=3 0.408 0.84 0.978 0.55 0.898 0.982 0.202 0.772 0.945
k4k\geq 4 0.056 0.05 0.054 0.053 0.054 0.052 0.009 0.02 0.027
Table 1: Monte Carlo estimates of probability of rejecting Gp,k0G^{0}_{p,k}, the null hypothesis that nodes pp and kk are not differentially connected, given WgW^{g}. All tests are performed at the significance level κ=.05\kappa=.05, and no multiple testing correction is performed.

The simulation results corroborate our expectations and suggest that there are potential benefits to covariate adjustment. We find that when the sample size is large, the covariate-adjusted test behaves reasonably well with either choice of basis function. However, in small samples, the covariate-adjusted test is somewhat imprecise, and the type-I error rate can be slightly above or below the nominal level. Practitioners should therefore exercise caution when using our proposed methodology in very small samples.

6 Data Example

Breast cancer classification based on expression of estrogen receptor hormone (ER) is prognostic of clinical outcomes. Breast cancers can be classified as estrogen receptor positive (ER+) and estrogen receptor negative (ER-), with approximately 70% of breast cancers being ER+ [25]. In ER+ breast cancer, the cancer cells require estrogen to grow; this has been shown to be associated with positive clinical outcomes, compared with ER- breast cancer [6]. Identifying differences between the biological pathways of ER+ and ER- breast cancers can be helpful for understanding the underlying disease mechanisms.

It is has been shown that age is associated with ER status and that age can be associated with gene expression [22, 41]. This warrants consideration of age as an adjustment variable in a comparison of gene co-expression networks between ER groups.

We perform an age-adjusted differential analysis of the ER+ and ER- breast cancer networks, using publicly available data from The Cancer Genome Atlas (TCGA) [37]. We obtain clinical measurements and gene expression data from a total of 806 ER+ patients and 237 ER- patients. We consider the set of p=145p=145 genes in the Kyoto Encyclopedia of Genes and Genomes (KEGG) breast cancer pathway [21], and adjust for age as our only covariate. The average age in the ER+ plus group is 59.3 years (SD = 13.3), and the average age in the ER- group is 55.9 years (SD = 12.4). We use a linear basis for covariate adjustment. In the ER+ group, the sample size is considerably larger than the number of the parameters, so we can fit the varying coefficient model (9) using ordinary least squares. We use the de-biased group LASSO to estimate the network for the ER- group because the sample size is smaller than the number of model parameters. We compare the results from the covariate-adjusted analysis with the unadjusted approach described in Section 3.1.

To assess for differential connectivity between any two nodes jj and kk, we can either treat node jj or node kk as the response in the varying coefficient model (9). We can then test either of the hypotheses Gj,k0:αj,kI,=αj,kII,G^{0}_{j,k}:\alpha^{\mathrm{I},*}_{j,k}=\alpha^{\mathrm{II},*}_{j,k} or Gk,j0:αk,jI,=αk,jII,G^{0}_{k,j}:\alpha^{\mathrm{I},*}_{k,j}=\alpha^{\mathrm{II},*}_{k,j}. Our approach is to set our p-value for the test for differential connectivity between nodes jj and kk as the minimum of the p-values for the tests of Gj,k0G^{0}_{j,k} and Gk,j0G^{0}_{k,j}, though we acknowledge that this strategy is anti-conservative.

Our objective is to identify all pairs of differentially connected genes, so we need to adjust for the fact that we perform a separate hypothesis test for each gene pair. We account for multiplicity by controlling the false discovery rate at the level κ=.05\kappa=.05 using the Benjamini-Yekutieli method [3].

Refer to caption
Figure 3: Differential breast cancer network by estrogen receptor status from covariate-adjusted analysis. Nodes with at least five differentially connected neighbors are colored blue. The false discovery rate is controlled at .05.

The differential networks obtained from the unadjusted and adjusted analyses are substantially different. We report 106 differentially connected edges from the adjusted analysis (shown in Figure 3), compared to only two such edges from the unadjusted analysis. This suggests it is possible that relationship between the gene co-expression network and age differs by ER group.

7 Discussion

In this paper, we have addressed challenges that arise when performing differential network analysis [32] in the setting where the network depends on covariates. Using both synthetic and real data, we showed that accounting for covariates can result in better control of type-I error and improved power.

We propose a parsimonious approach for covariate adjustment in differential network analysis. A number of improvements and extensions can be made to our current work. First, while this paper focuses on differential network analysis in exponential family models, our framework can be applied to other models where conditional dependence between any pair of nodes can be represented by a single scalar parameter. This includes semi-parametric models such as the nonparanormal model [24], as well as distributions defined over complex domains, which can be modeled using the generalized score matching framework [45]. Additionally, we only discuss testing edge-wise differences between the networks, though testing differences between sub-networks may also be of interest. When the sub-networks are low-dimensional, one can construct a chi-squared test using similar test statistics as presented in Section 3 and Section 4 because joint asymptotic normality of a low-dimensional set of the estimators αˇj,kg\check{\alpha}^{g}_{j,k} can be readily established. Such an approach is not applicable to high-dimensional sub-networks, but it may be possible to construct a calibrated test using recent results on simultaneous inference in high-dimensional models [48, 43]. We can also improve the statistical efficiency of the network estimates by considering joint estimation procedures that borrow information across groups [13, 8, 31]. Finally, we assume that the relationship between the network and the covariates can be represented by a low-dimensional basis expansion. Investigating nonparametric approaches that relax this assumption can be a fruitful area of research.

Funding

The authors gratefully acknowledge the support of the NSF Graduate Research Fellowship Program under grant DGE-1762114 as well as NSF grant DMS-1561814 and NIH grant R01-GM114029. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding agencies.

Availability of Data

This findings of this paper are supported by data from The Cancer Genome Atlas, which are accessible using the publicly available R package RTCGA.

Code Availability

An implementation of the proposed methodology is available at https://github.com/awhudson/CovDNA.

References

  • [1] Barabási, A.L., Gulbahce, N., Loscalzo, J.: Network medicine: a network-based approach to human disease. Nature Reviews Genetics 12(1), 56–68 (2011)
  • [2] Belilovsky, E., Varoquaux, G., Blaschko, M.B.: Testing for differences in Gaussian graphical models: Applications to brain connectivity. In: Advances in Neural Information Processing Systems, vol. 29. Curran Associates, Inc. (2016)
  • [3] Benjamini, Y., Yekutieli, D.: The control of the false discovery rate in multiple testing under dependency. Annals of Statistics pp. 1165–1188 (2001)
  • [4] Breheny, P., Huang, J.: Penalized methods for bi-level variable selection. Statistics and its Interface 2(3), 369 (2009)
  • [5] Bühlmann, P., van de Geer, S.: Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media (2011)
  • [6] Carey, L.A., Perou, C.M., Livasy, C.A., Dressler, L.G., Cowan, D., Conway, K., Karaca, G., Troester, M.A., Tse, C.K., Edmiston, S., et al.: Race, breast cancer subtypes, and survival in the carolina breast cancer study. Journal of the American Medical Association 295(21), 2492–2502 (2006)
  • [7] Chen, S., Witten, D.M., Shojaie, A.: Selection and estimation for mixed graphical models. Biometrika 102(1), 47–64 (2015)
  • [8] Danaher, P., Wang, P., Witten, D.M.: The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society: Series B 76(2), 373–397 (2014)
  • [9] Friedman, J., Hastie, T., Tibshirani, R.: Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9(3), 432–441 (2008)
  • [10] de la Fuente, A.: From ‘differential expression’to ‘differential networking’–identification of dysfunctional regulatory networks in diseases. Trends in Genetics 26(7), 326–333 (2010)
  • [11] van de Geer, S.: Estimation and testing under sparsity. Lecture Notes in Mathematics 2159 (2016)
  • [12] van de Geer, S., Bühlmann, P., Ritov, Y., Dezeure, R.: On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics 42(3), 1166–1202 (2014)
  • [13] Guo, J., Levina, E., Michailidis, G., Zhu, J.: Joint estimation of multiple graphical models. Biometrika 98(1), 1–15 (2011)
  • [14] Hastie, T., Tibshirani, R.: Varying-coefficient models. Journal of the Royal Statistical Society: Series B 55(4), 757–779 (1993)
  • [15] He, H., Cao, S., Zhang, J.g., Shen, H., Wang, Y.P., Deng, H.: A statistical test for differential network analysis based on inference of Gaussian graphical model. Scientific Reports 9(1), 1–8 (2019)
  • [16] Honda, T.: The de-biased group lasso estimation for varying coefficient models. Annals of the Institute of Statistical Mathematics pp. 1–27 (2019)
  • [17] Hyvärinen, A.: Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research 6(24), 695–709 (2005)
  • [18] Hyvärinen, A.: Some extensions of score matching. Computational Statistics & Data Analysis 51(5), 2499–2512 (2007)
  • [19] Ideker, T., Krogan, N.J.: Differential network biology. Molecular Systems Biology 8(1) (2012)
  • [20] Javanmard, A., Montanari, A.: Confidence intervals and hypothesis testing for high-dimensional regression. Journal of Machine Learning Research 15(1), 2869–2909 (2014)
  • [21] Kanehisa, M., Goto, S.: Kegg: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research 28(1), 27–30 (2000)
  • [22] Khan, S.A., Rogers, M.A., Khurana, K.K., Meguid, M.M., Numann, P.J.: Estrogen receptor expression in benign breast epithelium and breast cancer risk. Journal of the National Cancer Institute 90(1), 37–42 (1998)
  • [23] Lin, L., Drton, M., Shojaie, A.: Estimation of high-dimensional graphical models using regularized score matching. Electronic Journal of Statistics 10(1), 806–854 (2016)
  • [24] Liu, H., Lafferty, J., Wasserman, L.: The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. Journal of Machine Learning Research 10(80), 2295–2328 (2009)
  • [25] Lumachi, F., Brunello, A., Maruzzo, M., Basso, U., Mm Basso, S.: Treatment of estrogen receptor-positive breast cancer. Current Medicinal Chemistry 20(5), 596–604 (2013)
  • [26] Maathuis, M., Drton, M., Lauritzen, S., Wainwright, M.: Handbook of graphical models. CRC Press (2018)
  • [27] Meinshausen, N., Bühlmann, P.: High-dimensional graphs and variable selection with the lasso. The Annals of Statistics 34(3), 1436–1462 (2006)
  • [28] Mitra, R., Zhang, C.H.: The benefit of group sparsity in group inference with de-biased scaled group lasso. Electronic Journal of Statistics 10(2), 1829–1873 (2016)
  • [29] Negahban, S.N., Ravikumar, P., Wainwright, M.J., Yu, B.: A unified framework for high-dimensional analysis of mm-estimators with decomposable regularizers. Statistical Science 27(4), 538–557 (2012)
  • [30] Newman, M.E.: The structure and function of complex networks. SIAM Review 45(2), 167–256 (2003)
  • [31] Saegusa, T., Shojaie, A.: Joint estimation of precision matrices in heterogeneous populations. Electronic Journal of Statistics 10(2), 1341–1392 (2016)
  • [32] Shojaie, A.: Differential network analysis: A statistical perspective. Wiley Interdisciplinary Reviews: Computational Statistics p. e1508 (2020)
  • [33] Tibshirani, R.: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B 58(1), 267–288 (1996)
  • [34] van der Vaart, A.W.: Asymptotic statistics, vol. 3. Cambridge University Press (2000)
  • [35] Wang, H., Xia, Y.: Shrinkage estimation of the varying coefficient model. Journal of the American Statistical Association 104(486), 747–757 (2009)
  • [36] Wang, J., Kolar, M.: Inference for sparse conditional precision matrices. arXiv preprint arXiv:1412.7638 (2014)
  • [37] Weinstein, J.N., Collisson, E.A., Mills, G.B., Shaw, K.R.M., Ozenberger, B.A., Ellrott, K., Shmulevich, I., Sander, C., Stuart, J.M.: The cancer genome atlas pan-cancer analysis project. Nature Genetics 45(10), 1113–1120 (2013)
  • [38] Xia, Y., Cai, T., Cai, T.T.: Testing differential networks with applications to the detection of gene-gene interactions. Biometrika 102(2), 247–266 (2015)
  • [39] Xia, Y., Cai, T., Cai, T.T.: Two-sample tests for high-dimensional linear regression with an application to detecting interactions. Statistica Sinica 28, 63–92 (2018)
  • [40] Yang, E., Ravikumar, P., Allen, G.I., Liu, Z.: Graphical models via univariate exponential family distributions. Journal of Machine Learning Research 16(1), 3813–3847 (2015)
  • [41] Yang, J., Huang, T., Petralia, F., Long, Q., Zhang, B., Argmann, C., Zhao, Y., Mobbs, C.V., Schadt, E.E., Zhu, J., et al.: Synchronized age-related gene expression changes across multiple tissues in human and the link to complex diseases. Scientific Reports 5(1), 1–16 (2015)
  • [42] Yang, Y., Zou, H.: A fast unified algorithm for solving group-lasso penalize learning problems. Statistics and Computing 25(6), 1129–1141 (2015)
  • [43] Yu, M., Gupta, V., Kolar, M.: Simultaneous inference for pairwise graphical models with generalized score matching. Journal of Machine Learning Research 21(91), 1–51 (2020)
  • [44] Yu, S., Drton, M., Shojaie, A.: Generalized score matching for non-negative data. Journal of Machine Learning Research 20(76), 1–70 (2019)
  • [45] Yu, S., Drton, M., Shojaie, A.: Generalized score matching for general domains. Information and Inference: A Journal of the IMA (2021)
  • [46] Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B 68(1), 49–67 (2006)
  • [47] Zhang, C.H., Zhang, S.S.: Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B 76(1), 217–242 (2014)
  • [48] Zhang, X., Cheng, G.: Simultaneous inference for high-dimensional linear models. Journal of the American Statistical Association 112(518), 757–768 (2017)
  • [49] Zhao, S.D., Cai, T.T., Li, H.: Direct estimation of differential networks. Biometrika 101(2), 253–268 (2014)
  • [50] Zhou, S., Lafferty, J., Wasserman, L.: Time varying undirected graphs. Machine Learning 80(2), 295–319 (2010)
  • [51] Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B 67(2), 301–320 (2005)

Appendix

A  De-biased Group LASSO Estimator

In this subsection, we derive a de-biased group LASSO estimator. Our construction is essentially the same as the one presented in van de Geer [11].

With 𝒱j\mathcal{V}_{j} as defined in (11), let 𝒱jg=(𝒱1g,,𝒱j1g,𝒱j+1g,,𝒱pg)\mathcal{V}_{-j}^{g}=\left(\mathcal{V}^{g}_{1},\cdots,\mathcal{V}^{g}_{j-1},\mathcal{V}^{g}_{j+1},\cdots,\mathcal{V}^{g}_{p}\right) be an n×(p1)dn\times(p-1)d dimensional matrix. For αj,1,,αj,pd\alpha_{j,1},\ldots,\alpha_{j,p}\in\mathds{R}^{d}, let 𝜶j=(αj,1,,αj,p)\boldsymbol{\alpha}_{j}=\left(\alpha_{j,1}^{\top},\cdots,\alpha_{j,p}^{\top}\right)^{\top}, let 𝒫j(𝜶j)=kjαj,k2\mathcal{P}_{j}\left(\boldsymbol{\alpha}_{j}\right)=\sum_{k\neq j}\left\|\alpha_{j,k}\right\|_{2}, and let 𝒫j\nabla\mathcal{P}_{j} denote the sub-gradient of 𝒫j\mathcal{P}_{j}. We can express the sub-gradient as 𝒫j(𝜶j)=((αj,12),,(αj,p2))\nabla\mathcal{P}_{j}(\boldsymbol{\alpha}_{j})=\left((\nabla\|\alpha_{j,1}\|_{2})^{\top},\cdots,(\nabla\|\alpha_{j,p}\|_{2})^{\top}\right)^{\top} where αj,k2=αj,k/αj,k2\nabla\|{\alpha}_{j,k}\|_{2}=\alpha_{j,k}/\|\alpha_{j,k}\|_{2} if αj,k20\|\alpha_{j,k}\|_{2}\neq 0, and αj,k2\nabla\|{\alpha}_{j,k}\|_{2} is otherwise a vector with 2\ell_{2} norm less than one. The KKT conditions for the group LASSO imply that the estimate 𝜶~jg\tilde{\boldsymbol{\alpha}}^{g}_{j} satisfies

(ng)1(𝒱jg)(𝐗jg𝒱jg𝜶~jg)=λ𝒫j(𝜶~jg).\displaystyle\left(n^{g}\right)^{-1}\left(\mathcal{V}_{-j}^{g}\right)^{\top}\left(\mathbf{X}_{j}^{g}-\mathcal{V}_{-j}^{g}\tilde{\boldsymbol{\alpha}}^{g}_{j}\right)=-\lambda\nabla\mathcal{P}_{j}\left(\tilde{\boldsymbol{\alpha}}^{g}_{j}\right).

With some algebra, we can rewrite this as

(ng)1(𝒱jg)𝒱jg(𝜶~jg𝜶jg,)=λ𝒫j(𝜶~jg)+(𝒱jg)(𝐗jg𝒱jg𝜶jg,).\displaystyle\left(n^{g}\right)^{-1}\left(\mathcal{V}_{-j}^{g}\right)^{\top}\mathcal{V}_{-j}^{g}\left(\tilde{\boldsymbol{\alpha}}_{j}^{g}-\boldsymbol{\alpha}^{g,*}_{j}\right)=-\lambda\nabla\mathcal{P}_{j}\left(\tilde{\boldsymbol{\alpha}}^{g}_{j}\right)+\left(\mathcal{V}^{g}_{-j}\right)^{\top}\left(\mathbf{X}_{j}^{g}-\mathcal{V}_{-j}^{g}\boldsymbol{\alpha}^{g,*}_{j}\right).

Let Σj\Sigma_{j} be defined as the matrix

Σj=𝔼[(ng)1(𝒱jg)𝒱jg],\displaystyle\Sigma_{j}=\mathds{E}\left[\left(n^{g}\right)^{-1}\left(\mathcal{V}_{-j}^{g}\right)^{\top}\mathcal{V}_{-j}^{g}\right],

and let M~j\tilde{M}_{j} be an estimate of Σj1\Sigma_{j}^{-1}. We can write (𝜶~jg𝜶~jg,)\left(\tilde{\boldsymbol{\alpha}}^{g}_{j}-\tilde{\boldsymbol{\alpha}}^{g,*}_{j}\right) as

(𝜶~jg𝜶jg,)=\displaystyle\left(\tilde{\boldsymbol{\alpha}}^{g}_{j}-\boldsymbol{\alpha}_{j}^{g,*}\right)= λM~j𝒫j(𝜶~jg)(i)+(ng)1M~j(𝒱jg)(𝐗jg𝒱jg𝜶jg,)(ii)+\displaystyle\underset{\mathrm{(i)}}{\underbrace{-\lambda\tilde{M}_{j}\nabla\mathcal{P}_{j}\left(\tilde{\boldsymbol{\alpha}}^{g}_{j}\right)}}+\underset{\mathrm{(ii)}}{\underbrace{\left(n^{g}\right)^{-1}\tilde{M}_{j}\left(\mathcal{V}^{g}_{-j}\right)^{\top}\left(\mathbf{X}_{j}^{g}-\mathcal{V}_{-j}^{g}\boldsymbol{\alpha}^{g,*}_{j}\right)}}+
{I(ng)1M~j(𝒱jg)𝒱jg}(𝜶~jg𝜶jg,)(iii).\displaystyle\underset{\mathrm{(iii)}}{\underbrace{\left\{I-\left(n^{g}\right)^{-1}\tilde{M}_{j}\left(\mathcal{V}_{-j}^{g}\right)^{\top}\mathcal{V}_{-j}^{g}\right\}\left(\tilde{\boldsymbol{\alpha}}_{j}^{g}-\boldsymbol{\alpha}^{g,*}_{j}\right)}}. (25)

The first term (i)(\mathrm{i}) in (25) is an approximation for the bias of the group LASSO estimate. This term is a function only of the observed data and not of any unknown quantities. This term can therefore be directly added to the initial estimate 𝜶~jg\tilde{\boldsymbol{\alpha}}_{j}^{g}. If M~j\tilde{M}_{j} is a consistent estimate of Σj1\Sigma_{j}^{-1}, the second term (ii)(\mathrm{ii}) is asymptotically equivalent to

Σj1(𝒱jg)(𝐗jg𝒱jg𝜶jg,).\displaystyle\Sigma^{-1}_{j}\left(\mathcal{V}^{g}_{-j}\right)^{\top}\left(\mathbf{X}_{j}^{g}-\mathcal{V}_{-j}^{g}\boldsymbol{\alpha}^{g,*}_{j}\right).

Thus, (ii)(\mathrm{ii}) is asymptotically equivalent to a sample average of mean zero i.i.d. random variables. The central limit theorem can then be applied to establish convergence in distribution to the multivariate normal distribution at an n1/2n^{1/2} rate for any low-dimensional sub-vector. The third term will also be asymptotically negligible if M~j\tilde{M}_{j} is an approximate inverse of (ng)1(𝒱jg)𝒱jg(n^{g})^{-1}\left(\mathcal{V}_{-j}^{g}\right)^{\top}\mathcal{V}^{g}_{-j}. This would suggest that an estimator of the form

𝜶ˇjg=𝜶~jg+λM~j𝒫j(𝜶~jg)\displaystyle\check{\boldsymbol{\alpha}}_{j}^{g}=\tilde{\boldsymbol{\alpha}}_{j}^{g}+\lambda\tilde{M}_{j}\nabla\mathcal{P}_{j}\left(\tilde{\boldsymbol{\alpha}}^{g}_{j}\right)

will be asymptotically normal for an appropriate choice of M~j\tilde{M}_{j}.

Before describing our construction of M~j\tilde{M}_{j}, we find it helpful to consider an alternative expression for Σj1\Sigma^{-1}_{j}. We define the d×dd\times d matrices Γj,k,l\Gamma^{*}_{j,k,l} as

Γj,k,1,,Γj,k,p=arg minΓ1,,Γpd×d𝔼[trace{(ng)1(𝒱kglk,j𝒱lgΓl)(𝒱kglk,j𝒱lgΓl)}].\displaystyle\Gamma^{*}_{j,k,1},\ldots,\Gamma^{*}_{j,k,p}=\underset{\Gamma_{1},\ldots,\Gamma_{p}\in\mathds{R}^{d\times d}}{\text{arg min}}\mathds{E}\left[\text{trace}\left\{\left(n^{g}\right)^{-1}\left(\mathcal{V}^{g}_{k}-\sum_{l\neq k,j}\mathcal{V}^{g}_{l}\Gamma_{l}\right)^{\top}\left(\mathcal{V}^{g}_{k}-\sum_{l\neq k,j}\mathcal{V}^{g}_{l}\Gamma_{l}\right)\right\}\right]. (26)

We also define the d×dd\times d matrix C~j,k\tilde{C}_{j,k} as

Cj,k=𝔼[(ng)1(𝒱kglk,j𝒱lgΓj,k,l)𝒱kg].\displaystyle C^{*}_{j,k}=\mathds{E}\left[\left(n^{g}\right)^{-1}\left(\mathcal{V}_{k}^{g}-\sum_{l\neq k,j}\mathcal{V}^{g}_{l}\Gamma^{*}_{j,k,l}\right)^{\top}\mathcal{V}_{k}^{g}\right].

It can be shown that Σj1\Sigma^{-1}_{j} can be expressed as

Σj1=((Cj,1)1𝟎𝟎(Cj,p)1)(IΓj,1,2Γj,1,pΓj,2,1IΓj,2,pΓj,p,1Γj,p,2I).\displaystyle\Sigma^{-1}_{j}=\begin{pmatrix}\left(C_{j,1}^{*}\right)^{-1}&\cdots&\mathbf{0}\\ \vdots&\ddots&\vdots\\ \mathbf{0}&\cdots&\left(C_{j,p}^{*}\right)^{-1}\end{pmatrix}\begin{pmatrix}I&-\Gamma^{*}_{j,1,2}&\cdots&-\Gamma^{*}_{j,1,p}\\ -\Gamma^{*}_{j,2,1}&I&\cdots&-\Gamma^{*}_{j,2,p}\\ \vdots&\vdots&\ddots&\vdots\\ -\Gamma^{*}_{j,p,1}&-\Gamma^{*}_{j,p,2}&\cdots&I\end{pmatrix}.

We can thus estimate Σj1\Sigma_{j}^{-1} by performing a series of regressions to estimate each matrix Γj,k,l\Gamma^{*}_{j,k,l}.

Following the approach of van de Geer et al. [12], we use a group LASSO variant of the nodewise LASSO to construct M~j\tilde{M}_{j}. To proceed, we require some additional notation. For any d×dd\times d matrix Γ=(γ1,,γd)\Gamma=(\gamma_{1},\cdots,\gamma_{d}) for dd-dimensional vectors γc\gamma_{c}, let Γ2,=c=1dγc2\|\Gamma\|_{2,*}=\sum_{c=1}^{d}\|\gamma_{c}\|_{2}. Let Γ2,=(γ1/γ12,,γd/γd2)\nabla\|\Gamma\|_{2,*}=\left(\gamma_{1}/\|\gamma_{1}\|_{2},\cdots,\gamma_{d}/\|\gamma_{d}\|_{2}\right) be the subgradient of Γ2,\|\Gamma\|_{2,*}. We use the group LASSO to obtain estimates Γ~j,k,l\tilde{\Gamma}_{j,k,l} of Γj,k,l\Gamma^{*}_{j,k,l}:

Γ~j,k,1,,Γ~j,k,p=\displaystyle\tilde{\Gamma}_{j,k,1},\ldots,\tilde{\Gamma}_{j,k,p}=
arg minΓ1,,Γpd×dtrace{(ng)1(𝒱kglk,j𝒱lgΓl)(𝒱kglk,j𝒱lgΓl)}+ωlk,jΓl2,.\displaystyle\underset{\Gamma_{1},\ldots,\Gamma_{p}\in\mathds{R}^{d\times d}}{\text{arg min}}\text{trace}\left\{\left(n^{g}\right)^{-1}\left(\mathcal{V}^{g}_{k}-\sum_{l\neq k,j}\mathcal{V}^{g}_{l}\Gamma_{l}\right)^{\top}\left(\mathcal{V}^{g}_{k}-\sum_{l\neq k,j}\mathcal{V}^{g}_{l}\Gamma_{l}\right)\right\}+\omega\sum_{l\neq k,j}\|\Gamma_{l}\|_{2,*}. (27)

We then estimate Cj,kC^{*}_{j,k} as

C~j,k=(ng)1(𝒱kglk,j𝒱lgΓ~j,k,l)(𝒱kg).\displaystyle\tilde{C}_{j,k}=\left(n^{g}\right)^{-1}\left(\mathcal{V}^{g}_{k}-\sum_{l\neq k,j}\mathcal{V}^{g}_{l}\tilde{\Gamma}_{j,k,l}\right)^{\top}\left(\mathcal{V}_{k}^{g}\right).

Our estimate M~j\tilde{M}_{j} takes the form

M~j=(C~j,11𝟎𝟎C~j,p1)(IΓ~j,1,2Γ~j,1,pΓ~j,2,1IΓ~j,2,pΓ~j,p,1Γ~j,p,2I).\displaystyle\tilde{M}_{j}=\begin{pmatrix}\tilde{C}^{-1}_{j,1}&\cdots&\mathbf{0}\\ \vdots&\ddots&\vdots\\ \mathbf{0}&\cdots&\tilde{C}^{-1}_{j,p}\end{pmatrix}\begin{pmatrix}I&-\tilde{\Gamma}_{j,1,2}&\cdots&-\tilde{\Gamma}_{j,1,p}\\ -\tilde{\Gamma}_{j,2,1}&I&\cdots&-\tilde{\Gamma}_{j,2,p}\\ \vdots&\vdots&\ddots&\vdots\\ -\tilde{\Gamma}_{j,p,1}&-\tilde{\Gamma}_{j,p,2}&\cdots&I\end{pmatrix}.

With this construction of M~j\tilde{M}_{j}, we can establish a bound on the remainder term (iii)(\mathrm{iii}) in (25). To show this, we make use of the following lemma, which states a special case of the dual norm inequality for the group LASSO norm 𝒫j\mathcal{P}_{j} (see, e.g., Chapter 6 of van de Geer [11]).

Lemma 1.

Let a1,,apa_{1},\ldots,a_{p} and b1,,bpb_{1},\ldots,b_{p} be dd-dimensional vectors, and let 𝐚=(a1,,ap)\mathbf{a}=\left(a_{1}^{\top},\cdots,a_{p}^{\top}\right)^{\top} and 𝐛=(b1,,bp)\mathbf{b}=\left(b_{1}^{\top},\cdots,b_{p}^{\top}\right)^{\top} be pdpd-dimensional vectors. Then

𝐚,𝐛(j=1paj2)maxjbj2.\displaystyle\langle\mathbf{a},\mathbf{b}\rangle\leq\left(\sum_{j=1}^{p}\|a_{j}\|_{2}\right)\max_{j}\left\|b_{j}\right\|_{2}.

The KKT conditions for (27) imply that for all lj,kl\neq j,k

(ng)1(𝒱lg)(𝒱kgrk,j𝒱rgΓ~j,k,r)=ωΓ~j,k,l2,.\displaystyle\left(n^{g}\right)^{-1}\left(\mathcal{V}^{g}_{l}\right)^{\top}\left(\mathcal{V}^{g}_{k}-\sum_{r\neq k,j}\mathcal{V}^{g}_{r}\tilde{\Gamma}_{j,k,r}\right)=-\omega\nabla\left\|\tilde{\Gamma}_{j,k,l}\right\|_{2,*}. (28)

Lemma 1 and (28) imply that

(C~j,1𝟎𝟎C~j,p){I(ng)1M~j(𝒱jg)𝒱jg}(𝜶~jg𝜶jg,)ω𝒫j(𝜶~jg𝜶jg,),\displaystyle\left\|\begin{pmatrix}\tilde{C}_{j,1}&\cdots&\mathbf{0}\\ \vdots&\ddots&\vdots\\ \mathbf{0}&\cdots&\tilde{C}_{j,p}\end{pmatrix}\left\{I-\left(n^{g}\right)^{-1}\tilde{M}_{j}\left(\mathcal{V}_{-j}^{g}\right)^{\top}\mathcal{V}_{-j}^{g}\right\}\left(\tilde{\boldsymbol{\alpha}}_{j}^{g}-\boldsymbol{\alpha}^{g,*}_{j}\right)\right\|_{\infty}\leq\omega\mathcal{P}_{j}\left(\tilde{\boldsymbol{\alpha}}_{j}^{g}-\boldsymbol{\alpha}^{g,*}_{j}\right),

where \|\cdot\|_{\infty} is the \ell_{\infty} norm. With ω{log(p)/n}1/2\omega\asymp\left\{\log(p)/n\right\}^{1/2}, M~j\tilde{M}_{j} can be shown to be consistent under sparsity of Γj,k,l\Gamma^{*}_{j,k,l} (i.e., only a few matrices Γj,k,l\Gamma^{*}_{j,k,l} have some nonzero columns) and some additional regularity conditions. Additionally, it can be shown under sparsity of 𝜶g,\boldsymbol{\alpha}^{g,*} (i.e., very few vectors αj,kg,\alpha^{g,*}_{j,k} are nonzero) and some additional regularity conditions that 𝒫j(𝜶~jg𝜶jg,)=OP({log(p)/n}1/2)\mathcal{P}_{j}\left(\tilde{\boldsymbol{\alpha}}_{j}^{g}-\boldsymbol{\alpha}_{j}^{g,*}\right)=O_{P}\left(\left\{\log(p)/n\right\}^{1/2}\right). Thus, a scaled version of the remainder term (iii) is oP(n1/2)o_{P}(n^{-1/2}) if n1/2log(p)0n^{-1/2}\log(p)\to 0. We refer readers to Chapter 8 of Bühlmann and van de Geer [5] for a more comprehensive discussion of assumptions required for consistency of the group LASSO.

We now express the de-biased group LASSO estimator for αj,kg,\alpha^{g,*}_{j,k} as

αˇj,kg=α~j,kg+(ng)1C~j,k1(𝒱kglj,kΓ~j,k,l𝒱lg)(𝐗jg𝒱jg𝜶~jg).\displaystyle\check{\alpha}^{g}_{j,k}=\tilde{\alpha}^{g}_{j,k}+\left(n^{g}\right)^{-1}\tilde{C}^{-1}_{j,k}\left(\mathcal{V}^{g}_{k}-\sum_{l\neq j,k}\tilde{\Gamma}_{j,k,l}\mathcal{V}_{l}^{g}\right)^{\top}\left(\mathbf{X}^{g}_{j}-\mathcal{V}^{g}_{-j}\tilde{\boldsymbol{\alpha}}^{g}_{j}\right). (29)

We have established that αˇj,kg\check{\alpha}^{g}_{j,k} can be written as

C~j,k(αˇj,kgαj,kg,)=(ng)1(𝒱kglj,kΓj,k,l𝒱lg)(𝐗jg𝒱jg𝜶jg,)+oP(n1/2).\displaystyle\tilde{C}_{j,k}\left(\check{\alpha}^{g}_{j,k}-\alpha^{g,*}_{j,k}\right)=\left(n^{g}\right)^{-1}\left(\mathcal{V}^{g}_{k}-\sum_{l\neq j,k}\Gamma^{*}_{j,k,l}\mathcal{V}_{l}^{g}\right)^{\top}\left(\mathbf{X}^{g}_{j}-\mathcal{V}^{g}_{-j}\boldsymbol{\alpha}^{g,*}_{j}\right)+o_{P}(n^{-1/2}).

As stated above, the central limit theorem implies asymptotic normality of αˇj,kg\check{\alpha}^{g}_{j,k}.

We now construct an estimate for the variance of αˇj,kg\check{\alpha}^{g}_{j,k}. Suppose the residual 𝐗jg𝒱jg𝜶jg,\mathbf{X}^{g}_{j}-\mathcal{V}^{g}_{-j}\boldsymbol{\alpha}^{g,*}_{j} is independent of 𝒱g\mathcal{V}^{g}, and let τjg\tau_{j}^{g} denote the residual variance

τjg=𝔼[(Xjgkjϕig,αj,kg,Xkg)2].\displaystyle\tau_{j}^{g}=\mathds{E}\left[\left(X_{j}^{g}-\sum_{k\neq j}\left\langle\phi_{i}^{g},\alpha_{j,k}^{g,*}\right\rangle X_{k}^{g}\right)^{2}\right].

We can approximate the variance of αˇj,kg\check{\alpha}^{g}_{j,k} as

Ωˇj,kg=(ng)2τjgC~j,k1(𝒱kglj,kΓ~j,k,l𝒱lg)(𝒱kglj,kΓ~j,k,l𝒱lg)(C~j,k1).\displaystyle\check{\Omega}^{g}_{j,k}=\left(n^{g}\right)^{-2}\tau_{j}^{g}\tilde{C}^{-1}_{j,k}\left(\mathcal{V}^{g}_{k}-\sum_{l\neq j,k}\tilde{\Gamma}_{j,k,l}\mathcal{V}_{l}^{g}\right)^{\top}\left(\mathcal{V}^{g}_{k}-\sum_{l\neq j,k}\tilde{\Gamma}_{j,k,l}\mathcal{V}_{l}^{g}\right)\left(\tilde{C}^{-1}_{j,k}\right)^{\top}. (30)

As τjg\tau_{j}^{g} is typically unknown, we instead us the estimate

τ~jg=𝐗jg𝒱jg𝜶~jg22ndf^,\displaystyle\tilde{\tau}_{j}^{g}=\frac{\left\|\mathbf{X}^{g}_{j}-\mathcal{V}^{g}_{-j}\tilde{\boldsymbol{\alpha}}^{g}_{j}\right\|_{2}^{2}}{n-\widehat{df}},

where df^\widehat{df} is an estimate of the degrees of freedom for the group LASSO estimate 𝜶~jg\tilde{\boldsymbol{\alpha}}_{j}^{g}. In our implementation, we use the estimate proposed by Breheny and Huang [4].

B  Generalized Score Matching Estimator

In this section, we establish consistency of the regularized score matching estimator and derive a bias-corrected estimator.

B.1  Form of Generalized Score Matching Loss

Below, we restate Theorem 3 of [44], which provides conditions under which the score matching loss (20) can be expressed as (21).

Theorem 1.

Assume the following conditions hold:

limzjh(z)(zj){zjh(zj)}=0z1,,zj1,zj+1,,zp+,h\displaystyle\lim_{z_{j}\to\infty}h^{*}(z)(z_{j})\left\{\frac{\partial}{\partial z_{j}}h(z_{j})\right\}=0\quad\forall\,z_{1},\ldots,z_{j-1},z_{j+1},\ldots,z_{p}\in\mathds{R}_{+},\quad\forall h\in\mathcal{H}
limzj0h(z)(zj){zjh(zj)}=0z1,,zj1,zj+1,,zp+,h\displaystyle\lim_{z_{j}\to 0}h^{*}(z)(z_{j})\left\{\frac{\partial}{\partial z_{j}}h(z_{j})\right\}=0\quad\forall\,z_{1},\ldots,z_{j-1},z_{j+1},\ldots,z_{p}\in\mathds{R}_{+},\quad\forall h\in\mathcal{H}
suphlogh(z)v1/2(z)22h(z)𝑑z<\displaystyle\sup_{h\in\mathcal{H}}\int\left\|\nabla\log h(z)\circ v^{1/2}(z)\right\|_{2}^{2}h^{*}(z)dz<\infty
suph{logh(z)v1/2(z)}1h(z)𝑑z<,\displaystyle\sup_{h\in\mathcal{H}}\int\left\|\left\{\nabla\log h(z)\circ v^{1/2}(z)\right\}^{\prime}\right\|_{1}h^{*}(z)dz<\infty,

where the prime symbol denotes the element-wise derivative. Then (20) and (21) are equivalent up to an additive constant that does not depend on hh.

B.2  Generalized Score Matching Estimator in Low Dimensions

In this section, we provide an explicit form for the generalized score matching estimator in the low-dimensional setting and state its limiting distribution. We first introduce some additional notation below that allows for the generalized score matching loss to be written in a condensed form. Recall the form of the conditional density for the pairwise interaction model in (22). We define

𝒱j,k,1g=(vj1/2(X1,jg)ψ˙(X1,jg,X1,kg)×ϕ(W1g)vj1/2(Xng,jg)ψ˙(Xng,jg,Xng,kg)×ϕ(Wngg)),\displaystyle\mathcal{V}^{g}_{j,k,1}=\begin{pmatrix}v_{j}^{1/2}\left(X^{g}_{1,j}\right)\dot{\psi}\left(X^{g}_{1,j},X^{g}_{1,k}\right)\times\phi\left(W^{g}_{1}\right)\\ \vdots\\ v_{j}^{1/2}\left(X^{g}_{n^{g},j}\right)\dot{\psi}\left(X^{g}_{n^{g},j},X^{g}_{n^{g},k}\right)\times\phi\left(W^{g}_{n^{g}}\right)\end{pmatrix},
𝒱2,jg=(vj1/2(X1,jg)×{ζ˙(X1,jg,ϕ1(W1g)),,ζ˙(X1,jg,ϕd(W1g))}vj1/2(Xng,jg)×{ζ˙(Xng,jg,ϕ1(Wngg)),,ζ˙(Xng,jg,ϕd(Wngg))}),\displaystyle\mathcal{V}^{g}_{2,j}=\begin{pmatrix}v_{j}^{1/2}\left(X^{g}_{1,j}\right)\times\left\{\dot{\zeta}\left(X^{g}_{1,j},\phi_{1}(W^{g}_{1})\right),\cdots,\dot{\zeta}\left(X^{g}_{1,j},\phi_{d}(W^{g}_{1})\right)\right\}\\ \vdots\\ v_{j}^{1/2}\left(X^{g}_{n^{g},j}\right)\times\left\{\dot{\zeta}\left(X^{g}_{n^{g},j},\phi_{1}(W^{g}_{n^{g}})\right),\cdots,\dot{\zeta}\left(X^{g}_{n^{g},j},\phi_{d}(W^{g}_{n^{g}})\right)\right\}\end{pmatrix},
𝒰j,k,1g=({v˙j(X1,jg)ψ˙(X1,jg,X1,kg)+vj(X1,jg)ψ¨(X1,jg,X1,kg)}×ϕ(W1g){v˙j(X1,jg)ψ˙(Xng,jg,Xng,kg)+vj(Xng,jg)ψ¨(X1,jg,Xng,kg)}×ϕ(Wngg)),\displaystyle\mathcal{U}^{g}_{j,k,1}=\begin{pmatrix}\left\{\dot{v}_{j}\left(X^{g}_{1,j}\right)\dot{\psi}\left(X^{g}_{1,j},X^{g}_{1,k}\right)+v_{j}\left(X^{g}_{1,j}\right)\ddot{\psi}\left(X^{g}_{1,j},X^{g}_{1,k}\right)\right\}\times\phi\left(W^{g}_{1}\right)\\ \vdots\\ \left\{\dot{v}_{j}\left(X^{g}_{1,j}\right)\dot{\psi}\left(X^{g}_{n^{g},j},X^{g}_{n^{g},k}\right)+v_{j}\left(X^{g}_{n^{g},j}\right)\ddot{\psi}\left(X^{g}_{1,j},X^{g}_{n^{g},k}\right)\right\}\times\phi\left(W^{g}_{n^{g}}\right)\end{pmatrix},
𝒰j,2g=(vj(X1,jg)ζ¨(X1,jg,ϕ1(W1g))vj(X1,jg)ζ¨(X1,jg,ϕd(W1g))vj(Xng,jg)ζ¨(Xng,jg,ϕ1(Wngg))vj(Xng,jg)ζ¨(Xng,jg,ϕd(Wngg)))+\displaystyle\mathcal{U}^{g}_{j,2}=\begin{pmatrix}v_{j}\left(X_{1,j}^{g}\right)\ddot{\zeta}\left(X^{g}_{1,j},\phi_{1}(W^{g}_{1})\right)&\cdots&v_{j}\left(X_{1,j}^{g}\right)\ddot{\zeta}\left(X^{g}_{1,j},\phi_{d}(W^{g}_{1})\right)\\ \vdots&\ddots&\vdots\\ v_{j}\left(X_{n^{g},j}^{g}\right)\ddot{\zeta}\left(X^{g}_{n^{g},j},\phi_{1}(W^{g}_{n^{g}})\right)&\cdots&v_{j}\left(X_{n^{g},j}^{g}\right)\ddot{\zeta}\left(X^{g}_{n^{g},j},\phi_{d}(W^{g}_{n^{g}})\right)\end{pmatrix}+
(v˙j(X1,jg)ζ˙(X1,jg,ϕ1(W1g))v˙j(X1,jg)ζ˙(X1,jg,ϕd(W1g))v˙j(Xng,jg)ζ˙(Xng,jg,ϕ1(Wngg))v˙j(Xng,jg)ζ˙(Xng,jg,ϕd(Wngg))),\displaystyle\quad\quad\quad\begin{pmatrix}\dot{v}_{j}\left(X_{1,j}^{g}\right)\dot{\zeta}\left(X^{g}_{1,j},\phi_{1}(W^{g}_{1})\right)&\cdots&\dot{v}_{j}\left(X_{1,j}^{g}\right)\dot{\zeta}\left(X^{g}_{1,j},\phi_{d}(W^{g}_{1})\right)\\ \vdots&\ddots&\vdots\\ \dot{v}_{j}\left(X_{n^{g},j}^{g}\right)\dot{\zeta}\left(X^{g}_{n^{g},j},\phi_{1}(W^{g}_{n^{g}})\right)&\cdots&\dot{v}_{j}\left(X_{n^{g},j}^{g}\right)\dot{\zeta}\left(X^{g}_{n^{g},j},\phi_{d}(W^{g}_{n^{g}})\right)\end{pmatrix},
𝒱j,1g=(𝒱j,1,1g𝒱j,p,1g);𝒰j,1g=(𝒰1,j,1g𝒰j,p,1g).\displaystyle\mathcal{V}^{g}_{j,1}=\begin{pmatrix}\mathcal{V}^{g}_{j,1,1}\\ \vdots\\ \mathcal{V}^{g}_{j,p,1}\end{pmatrix};\quad\mathcal{U}^{g}_{j,1}=\begin{pmatrix}\mathcal{U}^{g}_{1,j,1}\\ \vdots\\ \mathcal{U}^{g}_{j,p,1}\end{pmatrix}.

Let 𝜶j=(αj,1,,αj,p)\boldsymbol{\alpha}_{j}=\left(\alpha_{j,1}^{\top},\cdots,\alpha_{j,p}^{\top}\right)^{\top} for αj,kd\alpha_{j,k}\in\mathds{R}^{d} and 𝜽j=(θj,1,,θj,d)\boldsymbol{\theta}_{j}=(\theta_{j,1},\cdots,\theta_{j,d})^{\top} for θj,c\theta_{j,c}\in\mathds{R}. We can express the empirical score matching loss (23) as

Ln,jg(𝜶j,𝜽j)\displaystyle L^{g}_{n,j}(\boldsymbol{\alpha}_{j},\boldsymbol{\theta}_{j}) =(2ng)1(𝒱j,1g𝜶j+𝒱2,jg𝜽j)(𝒱j,1g𝜶j+𝒱2,jg𝜽j)+(ng)1𝟏(𝒰1,jg𝜶j+𝒰2,jg𝜽j).\displaystyle=\left(2n^{g}\right)^{-1}\left(\mathcal{V}_{j,1}^{g}\boldsymbol{\alpha}_{j}+\mathcal{V}^{g}_{2,j}\boldsymbol{\theta}_{j}\right)^{\top}\left(\mathcal{V}_{j,1}^{g}\boldsymbol{\alpha}_{j}+\mathcal{V}^{g}_{2,j}\boldsymbol{\theta}_{j}\right)+\left(n^{g}\right)^{-1}\mathbf{1}^{\top}\left(\mathcal{U}^{g}_{1,j}\boldsymbol{\alpha}_{j}+\mathcal{U}^{g}_{2,j}\boldsymbol{\theta}_{j}\right).

We write the gradient of the risk function as

Ln,jg(𝜶j,𝜽j)\displaystyle\nabla L^{g}_{n,j}(\boldsymbol{\alpha}_{j},\boldsymbol{\theta}_{j}) =(ng)1((𝒱j,1g)𝒱j,1g(𝒱j,1g)𝒱j,2g(𝒱j,2g)𝒱j,1g(𝒱j,2g)𝒱j,2g)(𝜶j𝜽j)+(ng)1((𝒰j,1g)𝟏(𝒰j,2g)𝟏).\displaystyle=\left(n^{g}\right)^{-1}\begin{pmatrix}\left(\mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g}&\left(\mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g}\\ \left(\mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g}&\left(\mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g}\end{pmatrix}\begin{pmatrix}\boldsymbol{\alpha}_{j}\\ \boldsymbol{\theta}_{j}\end{pmatrix}+\left(n^{g}\right)^{-1}\begin{pmatrix}\left(\mathcal{U}_{j,1}^{g}\right)^{\top}\mathbf{1}\\ \left(\mathcal{U}_{j,2}^{g}\right)^{\top}\mathbf{1}\end{pmatrix}.

Thus, the minimizer (𝜶^jg,𝜽^jg)(\hat{\boldsymbol{\alpha}}^{g}_{j},\hat{\boldsymbol{\theta}}^{g}_{j}) of the empirical loss takes the form

(𝜶^jg𝜽^jg)=((𝒱j,1g)𝒱j,1g(𝒱j,1g)𝒱j,2g(𝒱j,2g)𝒱j,1g(𝒱j,2g)𝒱j,2g)1((𝒰j,1g)𝟏(𝒰j,2g)𝟏).\displaystyle\begin{pmatrix}\hat{\boldsymbol{\alpha}}^{g}_{j}\\ \hat{\boldsymbol{\theta}}^{g}_{j}\end{pmatrix}=-\begin{pmatrix}\left(\mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g}&\left(\mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g}\\ \left(\mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g}&\left(\mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g}\end{pmatrix}^{-1}\begin{pmatrix}\left(\mathcal{U}_{j,1}^{g}\right)^{\top}\mathbf{1}\\ \left(\mathcal{U}_{j,2}^{g}\right)^{\top}\mathbf{1}\end{pmatrix}.

By applying Theorem 5.23 of van der Vaart [34],

(ng)1/2(𝜶^jg𝜶jg,𝜽^jg𝜽jg,)dN(0,(ABA)),\displaystyle\left(n^{g}\right)^{1/2}\begin{pmatrix}\hat{\boldsymbol{\alpha}}^{g}_{j}-\boldsymbol{\alpha}_{j}^{g,*}\\ \hat{\boldsymbol{\theta}}^{g}_{j}-\boldsymbol{\theta}_{j}^{g,*}\end{pmatrix}\to_{d}N\left(0,\begin{pmatrix}ABA\end{pmatrix}\right),

where the matrices AA and BB are defined as

A=𝔼[(ng)1((𝒱j,1g)𝒱j,1g(𝒱j,1g)𝒱j,2g(𝒱j,2g)𝒱j,1g(𝒱j,2g)𝒱j,2g)]1,\displaystyle A=\mathds{E}\left[\left(n^{g}\right)^{-1}\begin{pmatrix}\left(\mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g}&\left(\mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g}\\ \left(\mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g}&\left(\mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g}\end{pmatrix}\right]^{-1},
B=Cov((ng)1((𝒱j,1g)𝒱j,1g(𝒱j,1g)𝒱j,2g(𝒱j,2g)𝒱j,1g(𝒱j,2g)𝒱j,2g)(𝜶jg,𝜽jg,)+(ng)1((𝒰j,1g)𝟏(𝒰j,2g)𝟏)).\displaystyle B=\text{Cov}\left(\left(n^{g}\right)^{-1}\begin{pmatrix}\left(\mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g}&\left(\mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g}\\ \left(\mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g}&\left(\mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g}\end{pmatrix}\begin{pmatrix}\boldsymbol{\alpha}^{g,*}_{j}\\ \boldsymbol{\theta}^{g,*}_{j}\end{pmatrix}+\left(n^{g}\right)^{-1}\begin{pmatrix}\left(\mathcal{U}_{j,1}^{g}\right)^{\top}\mathbf{1}\\ \left(\mathcal{U}_{j,2}^{g}\right)^{\top}\mathbf{1}\end{pmatrix}\right).

We estimate the variance of (𝜶^jg,𝜽^jg)(\hat{\boldsymbol{\alpha}}^{g}_{j},\hat{\boldsymbol{\theta}}^{g}_{j}) as Ω^jg=(ng)1A^B^A^\hat{\Omega}^{g}_{j}=\left(n^{g}\right)^{-1}\hat{A}\hat{B}\hat{A}, where

A^=ng((𝒱j,1g)𝒱j,1g(𝒱j,1g)𝒱j,2g(𝒱j,2g)𝒱j,1g(𝒱j,2g)𝒱j,2g)1,\displaystyle\hat{A}=n^{g}\begin{pmatrix}\left(\mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g}&\left(\mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g}\\ \left(\mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g}&\left(\mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g}\end{pmatrix}^{-1},
B^=(ng)1ξ^ξ^,ξ^=(diag(𝒱j,1g𝜶^jg+𝒱j,2g𝜽^jg)𝒱j,1gdiag(𝒱j,1g𝜶^jg+𝒱j,2g𝜽^jg)𝒱j,2g)+(𝒰j,1g𝒰j,2g).\displaystyle\hat{B}=\left(n^{g}\right)^{-1}\hat{\xi}^{\top}\hat{\xi},\quad\hat{\xi}=\begin{pmatrix}\text{diag}\left(\mathcal{V}_{j,1}^{g}\hat{\boldsymbol{\alpha}}^{g}_{j}+\mathcal{V}_{j,2}^{g}\hat{\boldsymbol{\theta}}^{g}_{j}\right)\mathcal{V}_{j,1}^{g}\\ \text{diag}\left(\mathcal{V}_{j,1}^{g}\hat{\boldsymbol{\alpha}}^{g}_{j}+\mathcal{V}_{j,2}^{g}\hat{\boldsymbol{\theta}}^{g}_{j}\right)\mathcal{V}_{j,2}^{g}\end{pmatrix}+\begin{pmatrix}\mathcal{U}_{j,1}^{g}\\ \mathcal{U}_{j,2}^{g}\end{pmatrix}.

B.3  Consistency of Regularized Generalized Score Matching Estimator

In this subsection, we argue that the regularized generalized score matching estimators 𝜶~jg\tilde{\boldsymbol{\alpha}}^{g}_{j} and 𝜽~jg\tilde{\boldsymbol{\theta}}^{g}_{j} from (24) are consistent. Let 𝒫j(𝜶j)=j=1pαj,k2\mathcal{P}_{j}(\boldsymbol{\alpha}_{j})=\sum_{j=1}^{p}\|\alpha_{j,k}\|_{2}. We establish convergence rates of 𝒫j(𝜶~jg𝜶jg,)\mathcal{P}_{j}\left(\tilde{\boldsymbol{\alpha}}_{j}^{g}-\boldsymbol{\alpha}_{j}^{g,*}\right) and 𝜽~jg𝜽jg,2\left\|\tilde{\boldsymbol{\theta}}^{g}_{j}-\boldsymbol{\theta}_{j}^{g,*}\right\|_{2}. Our approach is based on proof techniques described in Bühlmann and van de Geer [5].

Our result requires a notion of compatibility between the penalty function 𝒫j\mathcal{P}_{j} and the loss Ln,jgL^{g}_{n,j}. Such notions are commonly assumed in the high-dimensional literature. Below, we define the compatibility condition.

Definition 1 (Compatibility Condition).

Let SS be a set containing indices of the nonzero elements of 𝛂jg,\boldsymbol{\alpha}_{j}^{g,*}, and let S¯\bar{S} denote the complement of SS. Let 𝟙S\mathds{1}_{S} be a (p1)d(p-1)d-dimensional vector where the rr-th element is one if rSr\in S, and zero otherwise. The group LASSO compatibility condition holds for the index set S{1,,p}S\subset\{1,\ldots,p\} and for constant C>0C>0 if for all Ω(𝛂j𝟙S)3Ω(𝛂j𝟙S¯)+𝛉j2\Omega\left(\boldsymbol{\alpha}_{j}\circ\mathds{1}_{S}\right)\leq 3\Omega\left(\boldsymbol{\alpha}_{j}\circ\mathds{1}_{\bar{S}}\right)+\|\boldsymbol{\theta}_{j}\|_{2},

Ω(𝜶j𝟙S)+𝜽j/2|S|1/2C{(𝜶j𝜽j)((𝒱j,1g)𝒱j,1g(𝒱j,1g)𝒱j,2g(𝒱j,2g)𝒱j,1g(𝒱j,2g)𝒱j,2g)(𝜶j𝜽j)}1/2,\displaystyle\Omega\left(\boldsymbol{\alpha}_{j}\circ\mathds{1}_{S}\right)+\|\boldsymbol{\theta}_{j}\|/2\leq\frac{|S|^{1/2}}{C}\left\{\begin{pmatrix}\boldsymbol{\alpha}_{j}^{\top}&\boldsymbol{\theta}_{j}^{\top}\end{pmatrix}\begin{pmatrix}\left(\mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g}&\left(\mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g}\\ \left(\mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g}&\left(\mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g}\end{pmatrix}\begin{pmatrix}\boldsymbol{\alpha}_{j}\\ \boldsymbol{\theta}_{j}\end{pmatrix}\right\}^{1/2},

where \circ is the element-wise product operator.

Theorem 2.

Let \mathcal{E} be the set

=\displaystyle\mathcal{E}= {maxkj{(𝒱j,k,1g)(𝒱j,1g𝜶jg,+𝒱j,2g𝜽jg,)+(𝒰j,1g)𝟏2}ngλ0}\displaystyle\left\{\max_{k\neq j}\left\{\left\|\left(\mathcal{V}_{j,k,1}^{g}\right)^{\top}\left(\mathcal{V}_{j,1}^{g}\boldsymbol{\alpha}_{j}^{g,*}+\mathcal{V}_{j,2}^{g}\boldsymbol{\theta}_{j}^{g,*}\right)+\left(\mathcal{U}^{g}_{j,1}\right)^{\top}\mathbf{1}\right\|_{2}\right\}\leq n^{g}\lambda_{0}\right\}\cap
{(𝒱j,k,2g)(𝒱j,1g𝜶jg,+𝒱j,2g𝜽jg,)+(𝒰j,2g)𝟏2ngλ0}\displaystyle\left\{\left\|\left(\mathcal{V}_{j,k,2}^{g}\right)^{\top}\left(\mathcal{V}_{j,1}^{g}\boldsymbol{\alpha}_{j}^{g,*}+\mathcal{V}_{j,2}^{g}\boldsymbol{\theta}_{j}^{g,*}\right)+\left(\mathcal{U}^{g}_{j,2}\right)^{\top}\mathbf{1}\right\|_{2}\leq n^{g}\lambda_{0}\right\}

for some λ0λ/2\lambda_{0}\leq\lambda/2. Suppose the compatibility condition also holds. Then on the set \mathcal{E},

𝒫(𝜶~jg𝜶jg,)+𝜽~jg𝜽jg,2λ4|S|C2.\displaystyle\mathcal{P}\left(\tilde{\boldsymbol{\alpha}}^{g}_{j}-\boldsymbol{\alpha}^{g,*}_{j}\right)+\|\tilde{\boldsymbol{\theta}}^{g}_{j}-\boldsymbol{\theta}_{j}^{g,*}\|_{2}\leq\frac{\lambda 4|S|}{C^{2}}.
Proof of Theorem 2.

The regularized score matching estimator 𝜶~jg\tilde{\boldsymbol{\alpha}}_{j}^{g} necessarily satisfies the following basic inequality:

Ln,jg(𝜶~jg,𝜽~jg)+λ𝒫j(𝜶~jg)Ln,jg(𝜶jg,,𝜽jg,)+λ𝒫j(𝜶jg,).\displaystyle L^{g}_{n,j}\left(\tilde{\boldsymbol{\alpha}}^{g}_{j},\tilde{\boldsymbol{\theta}}^{g}_{j}\right)+\lambda\mathcal{P}_{j}\left(\tilde{\boldsymbol{\alpha}}^{g}_{j}\right)\leq L^{g}_{n,j}\left(\boldsymbol{\alpha}^{g,*}_{j},\boldsymbol{\theta}^{g,*}_{j}\right)+\lambda\mathcal{P}_{j}\left(\boldsymbol{\alpha}^{g,*}_{j}\right).

With some algebra, this inequality can be rewritten as

(2ng)1((𝜶~jg𝜶jg,)(𝜽~jg𝜽jg,))((𝒱j,1g)𝒱j,1g(𝒱j,1g)𝒱j,2g(𝒱j,2g)𝒱j,1g(𝒱j,2g)𝒱j,2g)(𝜶~jg𝜶jg,𝜽~jg𝜽jg,)+λ𝒫j(𝜶~jg)\displaystyle\left(2n^{g}\right)^{-1}\begin{pmatrix}\left(\tilde{\boldsymbol{\alpha}}^{g}_{j}-\boldsymbol{\alpha}^{g,*}_{j}\right)^{\top}&\left(\tilde{\boldsymbol{\theta}}^{g}_{j}-\boldsymbol{\theta}^{g,*}_{j}\right)^{\top}\end{pmatrix}\begin{pmatrix}\left(\mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g}&\left(\mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g}\\ \left(\mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g}&\left(\mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g}\end{pmatrix}\begin{pmatrix}\tilde{\boldsymbol{\alpha}}^{g}_{j}-\boldsymbol{\alpha}^{g,*}_{j}\\ \tilde{\boldsymbol{\theta}}^{g}_{j}-\boldsymbol{\theta}^{g,*}_{j}\end{pmatrix}+\lambda\mathcal{P}_{j}\left(\tilde{\boldsymbol{\alpha}}^{g}_{j}\right)\leq
(ng)1((𝜶~jg𝜶jg,)(𝜽~jg𝜽jg,))((𝒱j,1g)(𝒱j,1g𝜶jg,+𝒱j,2g𝜽jg,)+(𝒰j,1g)𝟏(𝒱j,2g)(𝒱j,1g𝜶jg,+𝒱j,2g𝜽jg,)+(𝒰j,2g)𝟏)+λ𝒫j(𝜶jg,).\displaystyle-\left(n^{g}\right)^{-1}\begin{pmatrix}\left(\tilde{\boldsymbol{\alpha}}^{g}_{j}-\boldsymbol{\alpha}^{g,*}_{j}\right)^{\top}&\left(\tilde{\boldsymbol{\theta}}^{g}_{j}-\boldsymbol{\theta}^{g,*}_{j}\right)^{\top}\end{pmatrix}\begin{pmatrix}\left(\mathcal{V}_{j,1}^{g}\right)^{\top}\left(\mathcal{V}_{j,1}^{g}\boldsymbol{\alpha}_{j}^{g,*}+\mathcal{V}_{j,2}^{g}\boldsymbol{\theta}_{j}^{g,*}\right)+\left(\mathcal{U}^{g}_{j,1}\right)^{\top}\mathbf{1}\\ \left(\mathcal{V}_{j,2}^{g}\right)^{\top}\left(\mathcal{V}_{j,1}^{g}\boldsymbol{\alpha}_{j}^{g,*}+\mathcal{V}_{j,2}^{g}\boldsymbol{\theta}_{j}^{g,*}\right)+\left(\mathcal{U}^{g}_{j,2}\right)^{\top}\mathbf{1}\end{pmatrix}+\lambda\mathcal{P}_{j}\left(\boldsymbol{\alpha}^{g,*}_{j}\right).

By Lemma 1, on the set \mathcal{E} and using λλ0/2\lambda\geq\lambda_{0}/2 we get

(ng)1((𝜶~jg𝜶jg,)(𝜽~jg𝜽jg,))((𝒱j,1g)𝒱j,1g(𝒱j,1g)𝒱j,2g(𝒱j,2g)𝒱j,1g(𝒱j,2g)𝒱j,2g)(𝜶~jg𝜶jg,𝜽~jg𝜽jg,)+2λ𝒫j(𝜶~jg)\displaystyle\left(n^{g}\right)^{-1}\begin{pmatrix}\left(\tilde{\boldsymbol{\alpha}}^{g}_{j}-\boldsymbol{\alpha}^{g,*}_{j}\right)^{\top}&\left(\tilde{\boldsymbol{\theta}}^{g}_{j}-\boldsymbol{\theta}^{g,*}_{j}\right)^{\top}\end{pmatrix}\begin{pmatrix}\left(\mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g}&\left(\mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g}\\ \left(\mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g}&\left(\mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g}\end{pmatrix}\begin{pmatrix}\tilde{\boldsymbol{\alpha}}^{g}_{j}-\boldsymbol{\alpha}^{g,*}_{j}\\ \tilde{\boldsymbol{\theta}}^{g}_{j}-\boldsymbol{\theta}^{g,*}_{j}\end{pmatrix}+2\lambda\mathcal{P}_{j}\left(\tilde{\boldsymbol{\alpha}}^{g}_{j}\right)\leq
λ𝜽~j𝜽j2+2λ𝒫j(𝜶jg,)+λ𝒫j(𝜶~jg𝜶jg,).\displaystyle\lambda\left\|\tilde{\boldsymbol{\theta}}_{j}-\boldsymbol{\theta}^{*}_{j}\right\|_{2}+2\lambda\mathcal{P}_{j}\left(\boldsymbol{\alpha}^{g,*}_{j}\right)+\lambda\mathcal{P}_{j}\left(\tilde{\boldsymbol{\alpha}}^{g}_{j}-\boldsymbol{\alpha}^{g,*}_{j}\right).

On the left hand side, we apply the triangle inequality to get

𝒫j(𝜶~jg)=𝒫j(𝜶~jg𝟙S)+𝒫j(𝜶~jg𝟙S¯)𝒫j(𝜶jg,𝟙S)𝒫j((𝜶~jg𝜶jg,)𝟙S)+𝒫j(𝜶~jg𝟙S¯).\displaystyle\mathcal{P}_{j}\left(\tilde{\boldsymbol{\alpha}}^{g}_{j}\right)=\mathcal{P}_{j}\left(\tilde{\boldsymbol{\alpha}}^{g}_{j}\circ\mathds{1}_{S}\right)+\mathcal{P}_{j}\left(\tilde{\boldsymbol{\alpha}}^{g}_{j}\circ\mathds{1}_{\bar{S}}\right)\geq\mathcal{P}_{j}\left(\boldsymbol{\alpha}_{j}^{g,*}\circ\mathds{1}_{S}\right)-\mathcal{P}_{j}\left(\left(\tilde{\boldsymbol{\alpha}}^{g}_{j}-\boldsymbol{\alpha}_{j}^{g,*}\right)\circ\mathds{1}_{S}\right)+\mathcal{P}_{j}\left(\tilde{\boldsymbol{\alpha}}^{g}_{j}\circ\mathds{1}_{\bar{S}}\right).

On the right hand side, we observe that

𝒫j(𝜶~jg𝜶jg,)=𝒫j((𝜶~jg𝜶jg,)𝟙S)+𝒫j(𝜶~jg𝟙S¯).\displaystyle\mathcal{P}_{j}\left(\tilde{\boldsymbol{\alpha}}^{g}_{j}-\boldsymbol{\alpha}^{g,*}_{j}\right)=\mathcal{P}_{j}\left(\left(\tilde{\boldsymbol{\alpha}}^{g}_{j}-\boldsymbol{\alpha}_{j}^{g,*}\right)\circ\mathds{1}_{S}\right)+\mathcal{P}_{j}\left(\tilde{\boldsymbol{\alpha}}^{g}_{j}\circ\mathds{1}_{\bar{S}}\right).

We then have

(ng)1((𝜶~jg𝜶jg,)(𝜽~jg𝜽jg,))((𝒱j,1g)𝒱j,1g(𝒱j,1g)𝒱j,2g(𝒱j,2g)𝒱j,1g(𝒱j,2g)𝒱j,2g)(𝜶~jg𝜶jg,𝜽~jg𝜽jg,)+λ𝒫j(𝜶~jg𝟙S¯)\displaystyle\left(n^{g}\right)^{-1}\begin{pmatrix}\left(\tilde{\boldsymbol{\alpha}}^{g}_{j}-\boldsymbol{\alpha}^{g,*}_{j}\right)^{\top}&\left(\tilde{\boldsymbol{\theta}}^{g}_{j}-\boldsymbol{\theta}^{g,*}_{j}\right)^{\top}\end{pmatrix}\begin{pmatrix}\left(\mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g}&\left(\mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g}\\ \left(\mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g}&\left(\mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g}\end{pmatrix}\begin{pmatrix}\tilde{\boldsymbol{\alpha}}^{g}_{j}-\boldsymbol{\alpha}^{g,*}_{j}\\ \tilde{\boldsymbol{\theta}}^{g}_{j}-\boldsymbol{\theta}^{g,*}_{j}\end{pmatrix}+\lambda\mathcal{P}_{j}\left(\tilde{\boldsymbol{\alpha}}^{g}_{j}\circ\mathds{1}_{\bar{S}}\right)\leq
λ𝜽~jg𝜽jg,2+3λ𝒫j((𝜶~jg𝜶jg,)𝟙S).\displaystyle\lambda\left\|\tilde{\boldsymbol{\theta}}^{g}_{j}-\boldsymbol{\theta}^{g,*}_{j}\right\|_{2}+3\lambda\mathcal{P}_{j}\left(\left(\tilde{\boldsymbol{\alpha}}^{g}_{j}-\boldsymbol{\alpha}_{j}^{g,*}\right)\circ\mathds{1}_{S}\right).

Now,

(ng)1((𝜶~jg𝜶jg,)(𝜽~jg𝜽jg,))((𝒱j,1g)𝒱j,1g(𝒱j,1g)𝒱j,2g(𝒱j,2g)𝒱j,1g(𝒱j,2g)𝒱j,2g)(𝜶~jg𝜶jg,𝜽~jg𝜽jg,)+\displaystyle\left(n^{g}\right)^{-1}\begin{pmatrix}\left(\tilde{\boldsymbol{\alpha}}^{g}_{j}-\boldsymbol{\alpha}^{g,*}_{j}\right)^{\top}&\left(\tilde{\boldsymbol{\theta}}^{g}_{j}-\boldsymbol{\theta}^{g,*}_{j}\right)^{\top}\end{pmatrix}\begin{pmatrix}\left(\mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g}&\left(\mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g}\\ \left(\mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g}&\left(\mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g}\end{pmatrix}\begin{pmatrix}\tilde{\boldsymbol{\alpha}}^{g}_{j}-\boldsymbol{\alpha}^{g,*}_{j}\\ \tilde{\boldsymbol{\theta}}^{g}_{j}-\boldsymbol{\theta}^{g,*}_{j}\end{pmatrix}+
λ𝒫j(𝜶~jg𝜶jg,)+λ𝜽~jg𝜽jg,2=\displaystyle\lambda\mathcal{P}_{j}\left(\tilde{\boldsymbol{\alpha}}^{g}_{j}-\boldsymbol{\alpha}_{j}^{g,*}\right)+\lambda\left\|\tilde{\boldsymbol{\theta}}^{g}_{j}-\boldsymbol{\theta}^{g,*}_{j}\right\|_{2}=
(ng)1((𝜶~jg𝜶jg,)(𝜽~jg𝜽jg,))((𝒱j,1g)𝒱j,1g(𝒱j,1g)𝒱j,2g(𝒱j,2g)𝒱j,1g(𝒱j,2g)𝒱j,2g)(𝜶~jg𝜶jg,𝜽~jg𝜽jg,)+\displaystyle\left(n^{g}\right)^{-1}\begin{pmatrix}\left(\tilde{\boldsymbol{\alpha}}^{g}_{j}-\boldsymbol{\alpha}^{g,*}_{j}\right)^{\top}&\left(\tilde{\boldsymbol{\theta}}^{g}_{j}-\boldsymbol{\theta}^{g,*}_{j}\right)^{\top}\end{pmatrix}\begin{pmatrix}\left(\mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g}&\left(\mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g}\\ \left(\mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g}&\left(\mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g}\end{pmatrix}\begin{pmatrix}\tilde{\boldsymbol{\alpha}}^{g}_{j}-\boldsymbol{\alpha}^{g,*}_{j}\\ \tilde{\boldsymbol{\theta}}^{g}_{j}-\boldsymbol{\theta}^{g,*}_{j}\end{pmatrix}+
λ𝒫j(𝜶~jg𝟙S¯)+λ𝒫j((𝜶~jg𝜶jg,)𝟙S)+λ𝜽~jg𝜽jg,2\displaystyle\lambda\mathcal{P}_{j}\left(\tilde{\boldsymbol{\alpha}}^{g}_{j}\circ\mathds{1}_{\bar{S}}\right)+\lambda\mathcal{P}_{j}\left(\left(\tilde{\boldsymbol{\alpha}}^{g}_{j}-\boldsymbol{\alpha}_{j}^{g,*}\right)\circ\mathds{1}_{S}\right)+\lambda\left\|\tilde{\boldsymbol{\theta}}^{g}_{j}-\boldsymbol{\theta}^{g,*}_{j}\right\|_{2}\leq
λ2|S|1/2C{(ng)1((𝜶~jg𝜶jg,)(𝜽~jg𝜽jg,))((𝒱j,1g)𝒱j,1g(𝒱j,1g)𝒱j,2g(𝒱j,2g)𝒱j,1g(𝒱j,2g)𝒱j,2g)(𝜶~jg𝜶jg,𝜽~jg𝜽jg,)}1/2\displaystyle\frac{\lambda 2|S|^{1/2}}{C}\left\{\left(n^{g}\right)^{-1}\begin{pmatrix}\left(\tilde{\boldsymbol{\alpha}}^{g}_{j}-\boldsymbol{\alpha}^{g,*}_{j}\right)^{\top}&\left(\tilde{\boldsymbol{\theta}}^{g}_{j}-\boldsymbol{\theta}^{g,*}_{j}\right)^{\top}\end{pmatrix}\begin{pmatrix}\left(\mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g}&\left(\mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g}\\ \left(\mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g}&\left(\mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g}\end{pmatrix}\begin{pmatrix}\tilde{\boldsymbol{\alpha}}^{g}_{j}-\boldsymbol{\alpha}_{j}^{g,*}\\ \tilde{\boldsymbol{\theta}}^{g}_{j}-\boldsymbol{\theta}^{g,*}_{j}\end{pmatrix}\right\}^{1/2}\leq
λ24|S|C2+(ng)1((𝜶~jg𝜶jg,)(𝜽~jg𝜽jg,))((𝒱j,1g)𝒱j,1g(𝒱j,1g)𝒱j,2g(𝒱j,2g)𝒱j,1g(𝒱j,2g)𝒱j,2g)(𝜶~jg𝜶jg,𝜽~jg𝜽jg,),\displaystyle\frac{\lambda^{2}4|S|}{C^{2}}+\left(n^{g}\right)^{-1}\begin{pmatrix}\left(\tilde{\boldsymbol{\alpha}}^{g}_{j}-\boldsymbol{\alpha}^{g,*}_{j}\right)^{\top}&\left(\tilde{\boldsymbol{\theta}}^{g}_{j}-\boldsymbol{\theta}^{g,*}_{j}\right)^{\top}\end{pmatrix}\begin{pmatrix}\left(\mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g}&\left(\mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g}\\ \left(\mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g}&\left(\mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g}\end{pmatrix}\begin{pmatrix}\tilde{\boldsymbol{\alpha}}^{g}_{j}-\boldsymbol{\alpha}_{j}^{g,*}\\ \tilde{\boldsymbol{\theta}}^{g}_{j}-\boldsymbol{\theta}^{g,*}_{j}\end{pmatrix},

where we use the compatiblility condition for the first inequality, and for the second inequality use the fact that

abb2+a2\displaystyle ab\leq b^{2}+a^{2}

for any a,ba,b\in\mathds{R}. The conclusion follows immediately. ∎

If the event \mathcal{E} occurs with probability tending to one, Theorem 2 implies

𝒫(𝜶~jg𝜶jg,)+𝜽~jg𝜽jg,2=OP(λ).\displaystyle\mathcal{P}\left(\tilde{\boldsymbol{\alpha}}^{g}_{j}-\boldsymbol{\alpha}^{g,*}_{j}\right)+\|\tilde{\boldsymbol{\theta}}^{g}_{j}-\boldsymbol{\theta}_{j}^{g,*}\|_{2}=O_{P}\left(\lambda\right).

We select λ\lambda so that the event \mathcal{E} occurs with high probability. For instance, suppose the elements of the matrix

ξ=(diag(𝒱j,1g𝜶jg,+𝒱j,2g𝜽jg,)𝒱j,1gdiag(𝒱j,1g𝜶jg,+𝒱j,2g𝜽jg,)𝒱j,2g)\displaystyle\xi=\begin{pmatrix}\text{diag}\left(\mathcal{V}_{j,1}^{g}\boldsymbol{\alpha}^{g,*}_{j}+\mathcal{V}_{j,2}^{g}\boldsymbol{\theta}^{g,*}_{j}\right)\mathcal{V}_{j,1}^{g}\\ \text{diag}\left(\mathcal{V}_{j,1}^{g}\boldsymbol{\alpha}^{g,*}_{j}+\mathcal{V}_{j,2}^{g}\boldsymbol{\theta}^{g,*}_{j}\right)\mathcal{V}_{j,2}^{g}\end{pmatrix}

are sub-Gaussian, and consider the event

¯=\displaystyle\bar{\mathcal{E}}= {((𝒱j,1g)(𝒱j,1g𝜶jg,+𝒱j,2g𝜽jg,)+(𝒰j,1g)𝟏(𝒱j,2g)(𝒱j,1g𝜶jg,+𝒱j,2g𝜽jg,)+(𝒰j,2g)𝟏)ngλ0d},\displaystyle\left\{\left\|\begin{pmatrix}\left(\mathcal{V}_{j,1}^{g}\right)^{\top}\left(\mathcal{V}_{j,1}^{g}\boldsymbol{\alpha}_{j}^{g,*}+\mathcal{V}_{j,2}^{g}\boldsymbol{\theta}_{j}^{g,*}\right)+\left(\mathcal{U}^{g}_{j,1}\right)^{\top}\mathbf{1}\\ \left(\mathcal{V}_{j,2}^{g}\right)^{\top}\left(\mathcal{V}_{j,1}^{g}\boldsymbol{\alpha}_{j}^{g,*}+\mathcal{V}_{j,2}^{g}\boldsymbol{\theta}_{j}^{g,*}\right)+\left(\mathcal{U}^{g}_{j,2}\right)^{\top}\mathbf{1}\end{pmatrix}\right\|_{\infty}\leq\frac{n^{g}\lambda_{0}}{d}\right\},

where \|\cdot\|_{\infty} is the \ell_{\infty} norm. Observing that ¯\mathcal{E}\subset\bar{\mathcal{E}}, it is only necessary to show that ¯\bar{\mathcal{E}} holds with high probability. It is shown in Corollary 2 of Negahban et al. [29] that there exist constants u1,u2>0u_{1},u_{2}>0 such that with λ0{log(p)/n}1/2\lambda_{0}\asymp\{\log(p)/n\}^{1/2}, ¯\bar{\mathcal{E}} holds with probability at least 1u1pu21-u_{1}p^{-u_{2}}. Thus, \mathcal{E} occurs with probability tending to one as pp\to\infty. For distributions with heavier tails, a larger choice of λ\lambda may be required [44].

B.4  De-biased Score Matching Estimator

The KKT conditions for the regularized score matching loss imply that the estimator 𝜶~jg\tilde{\boldsymbol{\alpha}}^{g}_{j} satisfies

Ln,j(𝜶~jg,𝜽~jg)\displaystyle\nabla L_{n,j}(\tilde{\boldsymbol{\alpha}}^{g}_{j},\tilde{\boldsymbol{\theta}}^{g}_{j}) =(ng)1((𝒱j,1g)𝒱j,1g(𝒱j,1g)𝒱j,2g(𝒱j,2g)𝒱j,1g(𝒱j,2g)𝒱j,2g)(𝜶~jg𝜽~jg)+(ng)1((𝒰j,1g)𝟏(𝒰j,2g)𝟏)=(λP(𝜶~jg)𝟎).\displaystyle=\left(n^{g}\right)^{-1}\begin{pmatrix}\left(\mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g}&\left(\mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g}\\ \left(\mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g}&\left(\mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g}\end{pmatrix}\begin{pmatrix}\tilde{\boldsymbol{\alpha}}_{j}^{g}\\ \tilde{\boldsymbol{\theta}}_{j}^{g}\end{pmatrix}+\left(n^{g}\right)^{-1}\begin{pmatrix}\left(\mathcal{U}_{j,1}^{g}\right)^{\top}\mathbf{1}\\ \left(\mathcal{U}_{j,2}^{g}\right)^{\top}\mathbf{1}\end{pmatrix}=\begin{pmatrix}\lambda\nabla P\left(\tilde{\boldsymbol{\alpha}}^{g}_{j}\right)\\ \mathbf{0}\end{pmatrix}.

With some algebra, we can rewrite the KKT conditions as

(ng)1((𝒱j,1g)𝒱j,1g(𝒱j,1g)𝒱j,2g(𝒱j,2g)𝒱j,1g(𝒱j,2g)𝒱j,2g)(𝜶~jg𝜶jg,𝜽~jg𝜽jg,)=\displaystyle\left(n^{g}\right)^{-1}\begin{pmatrix}\left(\mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g}&\left(\mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g}\\ \left(\mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g}&\left(\mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g}\end{pmatrix}\begin{pmatrix}\tilde{\boldsymbol{\alpha}}^{g}_{j}-\boldsymbol{\alpha}_{j}^{g,*}\\ \tilde{\boldsymbol{\theta}}^{g}_{j}-\boldsymbol{\theta}_{j}^{g,*}\end{pmatrix}=
λ(P(𝜶~jg)𝟎)(ng)1((𝒱j,1g)(𝒱j,1g𝜶jg,+𝒱j,2g𝜽jg,)+(𝒰j,1g)𝟏(𝒱j,2g)(𝒱j,1g𝜶jg,+𝒱j,2g𝜽jg,)+(𝒰j,2g)𝟏).\displaystyle\lambda\begin{pmatrix}\nabla P\left(\tilde{\boldsymbol{\alpha}}^{g}_{j}\right)\\ \mathbf{0}\end{pmatrix}-\left(n^{g}\right)^{-1}\begin{pmatrix}\left(\mathcal{V}_{j,1}^{g}\right)^{\top}\left(\mathcal{V}_{j,1}^{g}\boldsymbol{\alpha}_{j}^{g,*}+\mathcal{V}_{j,2}^{g}\boldsymbol{\theta}_{j}^{g,*}\right)+\left(\mathcal{U}^{g}_{j,1}\right)^{\top}\mathbf{1}\\ \left(\mathcal{V}_{j,2}^{g}\right)^{\top}\left(\mathcal{V}_{j,1}^{g}\boldsymbol{\alpha}_{j}^{g,*}+\mathcal{V}_{j,2}^{g}\boldsymbol{\theta}_{j}^{g,*}\right)+\left(\mathcal{U}^{g}_{j,2}\right)^{\top}\mathbf{1}\end{pmatrix}.

Now, let Σj,n\Sigma_{j,n} be the matrix

Σj,n=(ng)1((𝒱j,1g)𝒱j,1g(𝒱j,1g)𝒱j,2g(𝒱j,2g)𝒱j,1g(𝒱j,2g)𝒱j,2g),\displaystyle\Sigma_{j,n}=\left(n^{g}\right)^{-1}\begin{pmatrix}\left(\mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g}&\left(\mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g}\\ \left(\mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g}&\left(\mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g}\end{pmatrix},

let Σj=𝔼[Σj,n]\Sigma_{j}=\mathds{E}[\Sigma_{j,n}], and let M~j\tilde{M}_{j} be an estimate of Σj1\Sigma_{j}^{-1}. We can now rewrite the KKT conditions as

(𝜶~jg𝜶jg,𝜽~jg𝜽jg,)\displaystyle\begin{pmatrix}\tilde{\boldsymbol{\alpha}}^{g}_{j}-\boldsymbol{\alpha}_{j}^{g,*}\\ \tilde{\boldsymbol{\theta}}^{g}_{j}-\boldsymbol{\theta}_{j}^{g,*}\end{pmatrix} =λM~j(P(𝜶~jg)𝟎)(i)(ng)1M~j((𝒱j,1g)(𝒱j,1g𝜶jg,+𝒱j,2g𝜽jg,)+(𝒰j,1g)𝟏(𝒱j,2g)(𝒱j,1g𝜶jg,+𝒱j,2g𝜽jg,)+(𝒰j,2g)𝟏)(ii)+\displaystyle=\underset{(\mathrm{i})}{\underbrace{\lambda\tilde{M}_{j}\begin{pmatrix}\nabla P\left(\tilde{\boldsymbol{\alpha}}^{g}_{j}\right)\\ \mathbf{0}\end{pmatrix}}}-\underset{(\mathrm{ii})}{\underbrace{\left(n^{g}\right)^{-1}\tilde{M}_{j}\begin{pmatrix}\left(\mathcal{V}_{j,1}^{g}\right)^{\top}\left(\mathcal{V}_{j,1}^{g}\boldsymbol{\alpha}_{j}^{g,*}+\mathcal{V}_{j,2}^{g}\boldsymbol{\theta}_{j}^{g,*}\right)+\left(\mathcal{U}^{g}_{j,1}\right)^{\top}\mathbf{1}\\ \left(\mathcal{V}_{j,2}^{g}\right)^{\top}\left(\mathcal{V}_{j,1}^{g}\boldsymbol{\alpha}_{j}^{g,*}+\mathcal{V}_{j,2}^{g}\boldsymbol{\theta}_{j}^{g,*}\right)+\left(\mathcal{U}^{g}_{j,2}\right)^{\top}\mathbf{1}\end{pmatrix}}}+
(ng)1{IΣj,nM~j}(𝜶~jg𝜶jg,𝜽~jg𝜽jg,)(iii).\displaystyle\quad\quad\quad\underset{(\mathrm{iii})}{\underbrace{\left(n^{g}\right)^{-1}\left\{I-\Sigma_{j,n}\tilde{M}_{j}\right\}\begin{pmatrix}\tilde{\boldsymbol{\alpha}}^{g}_{j}-\boldsymbol{\alpha}_{j}^{g,*}\\ \tilde{\boldsymbol{\theta}}^{g}_{j}-\boldsymbol{\theta}_{j}^{g,*}\end{pmatrix}}}. (31)

As is the case for the de-biased group LASSO in Appendix A, the first term (i)(\mathrm{i}) in (31) depends only on the observed data and can be directly subtracted from the initial estimate. The second term (ii)(\mathrm{ii}) is asymptotically equivalent to

(ng)1Σj1((𝒱j,1g)(𝒱j,1g𝜶jg,+𝒱j,2g𝜽jg,)+(𝒰j,1g)𝟏(𝒱j,2g)(𝒱j,1g𝜶jg,+𝒱j,2g𝜽jg,)+(𝒰j,2g)𝟏),\displaystyle\left(n^{g}\right)^{-1}\Sigma_{j}^{-1}\begin{pmatrix}\left(\mathcal{V}_{j,1}^{g}\right)^{\top}\left(\mathcal{V}_{j,1}^{g}\boldsymbol{\alpha}_{j}^{g,*}+\mathcal{V}_{j,2}^{g}\boldsymbol{\theta}_{j}^{g,*}\right)+\left(\mathcal{U}^{g}_{j,1}\right)^{\top}\mathbf{1}\\ \left(\mathcal{V}_{j,2}^{g}\right)^{\top}\left(\mathcal{V}_{j,1}^{g}\boldsymbol{\alpha}_{j}^{g,*}+\mathcal{V}_{j,2}^{g}\boldsymbol{\theta}_{j}^{g,*}\right)+\left(\mathcal{U}^{g}_{j,2}\right)^{\top}\mathbf{1}\end{pmatrix}, (32)

if M~j\tilde{M}_{j} is a consistent estimate of Σj1\Sigma_{j}^{-1}. Using the fact that 𝔼[Ln,jg(𝜶jg,,𝜽jg,)]=𝟎\mathds{E}\left[\nabla L^{g}_{n,j}\left(\boldsymbol{\alpha}^{g,*}_{j},\boldsymbol{\theta}^{g,*}_{j}\right)\right]=\mathbf{0}, it can be seen that (32) is an average of i.i.d. random quantities with mean zero. The central theorem then implies that any low-dimensional sub-vector is asymptotically normal. The last term (iii)(\mathrm{iii}) is asymptotically negligible if M~j\tilde{M}_{j} is an approximate inverse of Σj,n\Sigma_{j,n} and if (𝜶~jg,𝜽~jg)(\tilde{\boldsymbol{\alpha}}_{j}^{g},\tilde{\boldsymbol{\theta}}_{j}^{g}) is consistent for (𝜶jg,,𝜽jg,)(\boldsymbol{\alpha}_{j}^{g,*},\boldsymbol{\theta}_{j}^{g,*}). Thus, for an appropriate choice of M~j\tilde{M}_{j}, we expect asymptotic normality of an estimator of the form

(𝜶ˇjg𝜽ˇjg)=(𝜶~jg𝜽~jg)λM~j(P(𝜶~jg)𝟎).\displaystyle\begin{pmatrix}\check{\boldsymbol{\alpha}}^{g}_{j}\\ \check{\boldsymbol{\theta}}^{g}_{j}\end{pmatrix}=\begin{pmatrix}\tilde{\boldsymbol{\alpha}}^{g}_{j}\\ \tilde{\boldsymbol{\theta}}^{g}_{j}\end{pmatrix}-\lambda\tilde{M}_{j}\begin{pmatrix}\nabla P\left(\tilde{\boldsymbol{\alpha}}^{g}_{j}\right)\\ \mathbf{0}\end{pmatrix}.

Before constructing M~j\tilde{M}_{j}, we first provide an alternative expression for Σj1\Sigma_{j}^{-1}. We define the d×dd\times d matrices Γj,k,l\Gamma^{*}_{j,k,l} and Δj,k\Delta^{*}_{j,k} as

Γj,k,1,,Γj,k,p,Δj,k=\displaystyle\Gamma^{*}_{j,k,1},\ldots,\Gamma^{*}_{j,k,p},\Delta^{*}_{j,k}=
arg minΓ1,,Γp,Δd×d𝔼[trace{(ng)1(𝒱j,k,1glk,j𝒱j,l,1gΓl𝒱j,2gΔ)(𝒱j,k,1glk,j𝒱j,l,1gΓl𝒱j,2gΔ)}].\displaystyle\underset{\Gamma_{1},\ldots,\Gamma_{p},\Delta\in\mathds{R}^{d\times d}}{\text{arg min}}\mathds{E}\left[\text{trace}\left\{\left(n^{g}\right)^{-1}\left(\mathcal{V}_{j,k,1}^{g}-\sum_{l\neq k,j}\mathcal{V}_{j,l,1}^{g}\Gamma_{l}-\mathcal{V}^{g}_{j,2}\Delta\right)^{\top}\left(\mathcal{V}_{j,k,1}^{g}-\sum_{l\neq k,j}\mathcal{V}_{j,l,1}^{g}\Gamma_{l}-\mathcal{V}^{g}_{j,2}\Delta\right)\right\}\right].

We also define the d×dd\times d matrices Λj,k\Lambda^{*}_{j,k} as

Λj,1,,Λj,p=arg minΛ1,,Λp,d×d𝔼[trace{(ng)1(𝒱j,2gkj𝒱j,k,1gΛk)(𝒱j,2gkj𝒱j,k,1gΛk)}].\displaystyle\Lambda^{*}_{j,1},\ldots,\Lambda^{*}_{j,p}=\underset{\Lambda_{1},\ldots,\Lambda_{p},\in\mathds{R}^{d\times d}}{\text{arg min}}\mathds{E}\left[\text{trace}\left\{\left(n^{g}\right)^{-1}\left(\mathcal{V}_{j,2}^{g}-\sum_{k\neq j}\mathcal{V}_{j,k,1}^{g}\Lambda_{k}\right)^{\top}\left(\mathcal{V}_{j,2}^{g}-\sum_{k\neq j}\mathcal{V}_{j,k,1}^{g}\Lambda_{k}\right)\right\}\right].

Additionally, we define the d×dd\times d matrices Cj,kC^{*}_{j,k} and DjD^{*}_{j}

Cj,k=𝔼[(ng)1(𝒱j,k,1g)(𝒱j,k,1glk,j𝒱j,l,1gΓj,k,l𝒱j,2gΔj,k)]\displaystyle C^{*}_{j,k}=\mathds{E}\left[\left(n^{g}\right)^{-1}\left(\mathcal{V}^{g}_{j,k,1}\right)^{\top}\left(\mathcal{V}_{j,k,1}^{g}-\sum_{l\neq k,j}\mathcal{V}_{j,l,1}^{g}\Gamma^{*}_{j,k,l}-\mathcal{V}^{g}_{j,2}\Delta^{*}_{j,k}\right)\right]
Dj=𝔼[(ng)1(𝒱j,2g)(𝒱j,2gkj𝒱j,k,1gΛj,k)].\displaystyle D^{*}_{j}=\mathds{E}\left[\left(n^{g}\right)^{-1}\left(\mathcal{V}^{g}_{j,2}\right)^{\top}\left(\mathcal{V}_{j,2}^{g}-\sum_{k\neq j}\mathcal{V}_{j,k,1}^{g}\Lambda^{*}_{j,k}\right)\right].

It can be shown that Σj1\Sigma_{j}^{-1} can be expressed as

Σj1=((Cj,1)1𝟎𝟎𝟎(Cj,p)1𝟎𝟎𝟎(Dj)1)(IΓj,1,2Γj,1,pΔj,1Γj,2,1IΓj,2,pΔj,2Γj,p,1Γj,p,2IΔj,pΛj,1Λj,2Λj,pI).\displaystyle\Sigma^{-1}_{j}=\begin{pmatrix}\left(C^{*}_{j,1}\right)^{-1}&\cdots&\mathbf{0}&\mathbf{0}\\ \vdots&\ddots&\vdots&\vdots\\ \mathbf{0}&\cdots&\left(C^{*}_{j,p}\right)^{-1}&\mathbf{0}\\ \mathbf{0}&\cdots&\mathbf{0}&\left(D^{*}_{j}\right)^{-1}\end{pmatrix}\begin{pmatrix}I&-\Gamma^{*}_{j,1,2}&\cdots&-\Gamma^{*}_{j,1,p}&-\Delta^{*}_{j,1}\\ -\Gamma^{*}_{j,2,1}&I&\cdots&-\Gamma^{*}_{j,2,p}&-\Delta^{*}_{j,2}\\ \vdots&\vdots&\ddots&\vdots&\vdots\\ -\Gamma^{*}_{j,p,1}&-\Gamma^{*}_{j,p,2}&\cdots&I&-\Delta^{*}_{j,p}\\ -\Lambda^{*}_{j,1}&-\Lambda^{*}_{j,2}&\cdots&-\Lambda^{*}_{j,p}&I\end{pmatrix}.

We can thus estimate Σj1\Sigma_{j}^{-1} by estimating each of the matrices Γj,k,l\Gamma^{*}_{j,k,l}, Λj,k\Lambda^{*}_{j,k}, and Δj,k\Delta^{*}_{j,k}.

Similar to our discussion of the de-biased group LASSO in Appendix A, we use a group-penalized variant of the nodewise LASSO to construct M~j\tilde{M}_{j}. We estimate Γj,k,l\Gamma^{*}_{j,k,l} and Δj,k\Delta^{*}_{j,k} as

Γ~j,k,1,,Γ~j,k,p,Δ~j,k=\displaystyle\tilde{\Gamma}_{j,k,1},\ldots,\tilde{\Gamma}_{j,k,p},\tilde{\Delta}_{j,k}=
arg minΓ1,,Γp,Δd×dtrace{(ng)1(𝒱j,k,1glk,j𝒱j,l,1gΓl𝒱j,2gΔ)(𝒱j,k,1glk,j𝒱j,l,1gΓl𝒱j,2gΔ)}+\displaystyle\underset{\Gamma_{1},\ldots,\Gamma_{p},\Delta\in\mathds{R}^{d\times d}}{\text{arg min}}\text{trace}\left\{\left(n^{g}\right)^{-1}\left(\mathcal{V}_{j,k,1}^{g}-\sum_{l\neq k,j}\mathcal{V}_{j,l,1}^{g}\Gamma_{l}-\mathcal{V}^{g}_{j,2}\Delta\right)^{\top}\left(\mathcal{V}_{j,k,1}^{g}-\sum_{l\neq k,j}\mathcal{V}_{j,l,1}^{g}\Gamma_{l}-\mathcal{V}^{g}_{j,2}\Delta\right)\right\}+
ω1lk,jΓl2,+ω1Δ2,,\displaystyle\quad\quad\quad\quad\quad\omega_{1}\sum_{l\neq k,j}\|\Gamma_{l}\|_{2,*}+\omega_{1}\|\Delta\|_{2,*},

where ω1,ω2>0\omega_{1},\omega_{2}>0 are tuning parameters, and 2,\|\cdot\|_{2,*} is as defined in Appendix A. We estimate Λj,k\Lambda^{*}_{j,k} as

Λ~j,1,,Λ~j,p=arg minΛ1,,Λp,d×dtrace{(ng)1(𝒱j,2gkj𝒱j,k,1gΛk)(𝒱j,2gkj𝒱j,k,1gΛk)}+ω2ljΛk2,.\displaystyle\tilde{\Lambda}_{j,1},\ldots,\tilde{\Lambda}_{j,p}=\underset{\Lambda_{1},\ldots,\Lambda_{p},\in\mathds{R}^{d\times d}}{\text{arg min}}\text{trace}\left\{\left(n^{g}\right)^{-1}\left(\mathcal{V}_{j,2}^{g}-\sum_{k\neq j}\mathcal{V}_{j,k,1}^{g}\Lambda_{k}\right)^{\top}\left(\mathcal{V}_{j,2}^{g}-\sum_{k\neq j}\mathcal{V}_{j,k,1}^{g}\Lambda_{k}\right)\right\}+\omega_{2}\sum_{l\neq j}\|\Lambda_{k}\|_{2,*}.

Additionally, we define the d×dd\times d matrices C~j,k\tilde{C}_{j,k} and D~j\tilde{D}_{j}

C~j,k=(ng)1(𝒱j,k,1g)(𝒱j,k,1glk,j𝒱j,l,1gΓ~j,k,l𝒱j,2gΔ~j,k)\displaystyle\tilde{C}_{j,k}=\left(n^{g}\right)^{-1}\left(\mathcal{V}^{g}_{j,k,1}\right)^{\top}\left(\mathcal{V}_{j,k,1}^{g}-\sum_{l\neq k,j}\mathcal{V}_{j,l,1}^{g}\tilde{\Gamma}_{j,k,l}-\mathcal{V}^{g}_{j,2}\tilde{\Delta}_{j,k}\right)
D~j=(ng)1(𝒱j,2g)(𝒱j,2gkj𝒱j,k,1gΛ~j,k).\displaystyle\tilde{D}_{j}=\left(n^{g}\right)^{-1}\left(\mathcal{V}^{g}_{j,2}\right)^{\top}\left(\mathcal{V}_{j,2}^{g}-\sum_{k\neq j}\mathcal{V}_{j,k,1}^{g}\tilde{\Lambda}_{j,k}\right).

We then take M~j\tilde{M}_{j} as

M~j=(C~j,11𝟎𝟎𝟎C~j,p1𝟎𝟎𝟎D~j1)(IΓ~j,1,2Γ~j,1,pΔ~j,1Γ~j,2,1IΓ~j,2,pΔ~j,2Γ~j,p,1Γ~j,p,2IΔ~j,pΛ~j,1Λ~j,2Λ~j,pI).\displaystyle\tilde{M}_{j}=\begin{pmatrix}\tilde{C}^{-1}_{j,1}&\cdots&\mathbf{0}&\mathbf{0}\\ \vdots&\ddots&\vdots&\vdots\\ \mathbf{0}&\cdots&\tilde{C}^{-1}_{j,p}&\mathbf{0}\\ \mathbf{0}&\cdots&\mathbf{0}&\tilde{D}^{-1}_{j}\end{pmatrix}\begin{pmatrix}I&-\tilde{\Gamma}_{j,1,2}&\cdots&-\tilde{\Gamma}_{j,1,p}&-\tilde{\Delta}_{j,1}\\ -\tilde{\Gamma}_{j,2,1}&I&\cdots&-\tilde{\Gamma}_{j,2,p}&-\tilde{\Delta}_{j,2}\\ \vdots&\vdots&\ddots&\vdots&\vdots\\ -\tilde{\Gamma}_{j,p,1}&-\tilde{\Gamma}_{j,p,2}&\cdots&I&-\tilde{\Delta}_{j,p}\\ -\tilde{\Lambda}_{j,1}&-\tilde{\Lambda}_{j,2}&\cdots&-\tilde{\Lambda}_{j,p}&I\end{pmatrix}.

When Γj,k,l\Gamma^{*}_{j,k,l}, Δj,k\Delta^{*}_{j,k}, and Λj,k\Lambda^{*}_{j,k} satisfy appropriate sparsity conditions and some additional regularity assumptions, M~j\tilde{M}_{j} is a consistent estimate of Σj1\Sigma_{j}^{-1} for ω1{log(p)/n}1/2\omega_{1}\asymp\{\log(p)/n\}^{1/2} and ω2{log(p)/n}1/2\omega_{2}\asymp\{\log(p)/n\}^{1/2} (see, e.g., Chapter 8 of Bühlmann and van de Geer [5] for a more comprehensive discussion). Using the same argument presented in Appendix A, we are able to obtain the following bound on a scaled version of the remainder term (iii)(\mathrm{iii}):

(C~j,1𝟎𝟎𝟎C~j,p𝟎𝟎𝟎D~j){I(ng)1((𝒱j,1g)𝒱j,1g(𝒱j,1g)𝒱j,2g(𝒱j,2g)𝒱j,1g(𝒱j,2g)𝒱j,2g)M~j}(𝜶~jg𝜶jg,𝜽~jg𝜽jg,)\displaystyle\left\|\begin{pmatrix}\tilde{C}_{j,1}&\cdots&\mathbf{0}&\mathbf{0}\\ \vdots&\ddots&\vdots&\vdots\\ \mathbf{0}&\cdots&\tilde{C}_{j,p}&\mathbf{0}\\ \mathbf{0}&\cdots&\mathbf{0}&\tilde{D}_{j}\end{pmatrix}\left\{I-\left(n^{g}\right)^{-1}\begin{pmatrix}\left(\mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g}&\left(\mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g}\\ \left(\mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g}&\left(\mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g}\end{pmatrix}\tilde{M}_{j}\right\}\begin{pmatrix}\tilde{\boldsymbol{\alpha}}^{g}_{j}-\boldsymbol{\alpha}_{j}^{g,*}\\ \tilde{\boldsymbol{\theta}}^{g}_{j}-\boldsymbol{\theta}_{j}^{g,*}\end{pmatrix}\right\|_{\infty}\leq
max{ω1,ω2}{𝒫(𝜶~jg𝜶jg,)+𝜽~jg𝜽jg,2}.\displaystyle\max\{\omega_{1},\omega_{2}\}\left\{\mathcal{P}\left(\tilde{\boldsymbol{\alpha}}^{g}_{j}-\boldsymbol{\alpha}^{g,*}_{j}\right)+\|\tilde{\boldsymbol{\theta}}^{g}_{j}-\boldsymbol{\theta}_{j}^{g,*}\|_{2}\right\}.

The remainder is oP(n1/2)o_{P}(n^{-1/2}) and hence asymptotically negligible if n1/2max{ω1,ω2}λ0n^{1/2}\max\{\omega_{1},\omega_{2}\}\lambda\to 0, where λ\lambda is the tuning parameter for the regularized score matching estimator (see Theorem 2).

The de-biased estimate αˇj,kg\check{\alpha}^{g}_{j,k} of αj,kg,\alpha^{g,*}_{j,k} can be expressed as

αˇj,kg=α~j,kg(ng)1C~j,k1(𝒱j,k,1glj,k𝒱j,l,1gΓ~j,k,l)(𝒱j,1g𝜶~jg+𝒱j,2g𝜽~jg+(𝒰j,1g)𝟏).\displaystyle\check{\alpha}^{g}_{j,k}=\tilde{\alpha}^{g}_{j,k}-\left(n^{g}\right)^{-1}\tilde{C}^{-1}_{j,k}\left(\mathcal{V}^{g}_{j,k,1}-\sum_{l\neq j,k}\mathcal{V}_{j,l,1}^{g}\tilde{\Gamma}_{j,k,l}\right)^{\top}\left(\mathcal{V}^{g}_{j,1}\tilde{\boldsymbol{\alpha}}^{g}_{j}+\mathcal{V}^{g}_{j,2}\tilde{\boldsymbol{\theta}}_{j}^{g}+\left(\mathcal{U}_{j,1}^{g}\right)^{\top}\mathbf{1}\right). (33)

The difference between the de-biased estimator αˇj,kg\check{\alpha}^{g}_{j,k} and the true parameter αj,kg,\alpha^{g,*}_{j,k} can be expressed as

C~j,k(αˇj,kgαj,kg,)=\displaystyle\tilde{C}_{j,k}\left(\check{\alpha}^{g}_{j,k}-\alpha^{g,*}_{j,k}\right)=- (ng)1(𝒱j,k,1glj,k𝒱j,l,1gΓj,k,l)(𝒱j,1g𝜶jg,+𝒱j,2g𝜽jg,+(𝒰j,1g)𝟏)+\displaystyle\left(n^{g}\right)^{-1}\left(\mathcal{V}^{g}_{j,k,1}-\sum_{l\neq j,k}\mathcal{V}_{j,l,1}^{g}\Gamma^{*}_{j,k,l}\right)^{\top}\left(\mathcal{V}^{g}_{j,1}\boldsymbol{\alpha}^{g,*}_{j}+\mathcal{V}^{g}_{j,2}\boldsymbol{\theta}_{j}^{g,*}+\left(\mathcal{U}_{j,1}^{g}\right)^{\top}\mathbf{1}\right)+
(ng)1(𝒱j,2gΔj,k)(𝒱j,1g𝜶jg,+𝒱j,2g𝜽jg,+(𝒰j,2g)𝟏)}+oP(n1/2).\displaystyle\left(n^{g}\right)^{-1}\left(\mathcal{V}^{g}_{j,2}\Delta^{*}_{j,k}\right)^{\top}\left(\mathcal{V}^{g}_{j,1}\boldsymbol{\alpha}^{g,*}_{j}+\mathcal{V}^{g}_{j,2}\boldsymbol{\theta}_{j}^{g,*}+\left(\mathcal{U}_{j,2}^{g}\right)^{\top}\mathbf{1}\right)\bigg{\}}+o_{P}\left(n^{-1/2}\right).

As discussed above, the central limit theorem implies asymptotic normality of αˇj,kg\check{\alpha}^{g}_{j,k}. We can estimate the asymptotic variance of αˇj,kg\check{\alpha}^{g}_{j,k} as

(ng)2C~j,k1M~j,kξ~ξ~M~j,k(C~j,k1),\displaystyle\left(n^{g}\right)^{-2}\tilde{C}_{j,k}^{-1}\tilde{M}_{j,k}\tilde{\xi}^{\top}\tilde{\xi}\tilde{M}^{\top}_{j,k}\left(\tilde{C}_{j,k}^{-1}\right)^{\top},

where we define

ξ~\displaystyle\tilde{\xi} =(diag(𝒱j,1g𝜶~jg+𝒱j,2g𝜽~jg)𝒱j,1g+𝒰j,1gdiag(𝒱j,1g𝜶~jg+𝒱j,2g𝜽~jg)𝒱j,2g+𝒰j,2g)\displaystyle=\begin{pmatrix}\text{diag}\left(\mathcal{V}_{j,1}^{g}\tilde{\boldsymbol{\alpha}}_{j}^{g}+\mathcal{V}_{j,2}^{g}\tilde{\boldsymbol{\theta}}_{j}^{g}\right)\mathcal{V}_{j,1}^{g}+\mathcal{U}^{g}_{j,1}\\ \text{diag}\left(\mathcal{V}_{j,1}^{g}\tilde{\boldsymbol{\alpha}}_{j}^{g}+\mathcal{V}_{j,2}^{g}\tilde{\boldsymbol{\theta}}_{j}^{g}\right)\mathcal{V}_{j,2}^{g}+\mathcal{U}^{g}_{j,2}\end{pmatrix}
M~j,k\displaystyle\tilde{M}_{j,k} =(Γ~j,k,1Γ~j,k,k1IΓ~j,k,k+1Γ~j,k,pΔ~j,p).\displaystyle=\begin{pmatrix}-\tilde{\Gamma}_{j,k,1}&\cdots&-\tilde{\Gamma}_{j,k,k-1}&I&-\tilde{\Gamma}_{j,k,k+1}&\cdots&-\tilde{\Gamma}_{j,k,p}&-\tilde{\Delta}_{j,p}\end{pmatrix}.