This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Ising Model Selection Using 1\ell_{1}-Regularized Linear Regression: A Statistical Mechanics Analysis

Xiangming Meng
Institute for Physics of Intelligence
The University of Tokyo
7-3-1, Hongo, Tokyo 113-0033, Japan
[email protected]
&Tomoyuki Obuchi
Department of Systems Science
Kyoto University
Kyoto 606-8501, Japan
[email protected]
Yoshiyuki Kabashima
Institute for Physics of Intelligence
The University of Tokyo
7-3-1, Hongo, Tokyo 113-0033, Japan
[email protected]
Corresponding author.
Abstract

We theoretically analyze the typical learning performance of 1\ell_{1}-regularized linear regression (1\ell_{1}-LinR) for Ising model selection using the replica method from statistical mechanics. For typical random regular graphs in the paramagnetic phase, an accurate estimate of the typical sample complexity of 1\ell_{1}-LinR is obtained. Remarkably, despite the model misspecification, 1\ell_{1}-LinR is model selection consistent with the same order of sample complexity as 1\ell_{1}-regularized logistic regression (1\ell_{1}-LogR), i.e., M=𝒪(logN)M=\mathcal{O}\left(\log N\right), where NN is the number of variables of the Ising model. Moreover, we provide an efficient method to accurately predict the non-asymptotic behavior of 1\ell_{1}-LinR for moderate M,NM,N, such as precision and recall. Simulations show a fairly good agreement between theoretical predictions and experimental results, even for graphs with many loops, which supports our findings. Although this paper mainly focuses on 1\ell_{1}-LinR, our method is readily applicable for precisely characterizing the typical learning performances of a wide class of 1\ell_{1}-regularized MM-estimators including 1\ell_{1}-LogR and interaction screening.

1 Introduction

The advent of massive data across various scientific disciplines has led to the widespread use of undirected graphical models, also known as Markov random fields (MRFs), as a tool for discovering and visualizing dependencies among covariates in multivariate data [1]. The Ising model, originally proposed in statistical physics, is one special class of binary MRFs with pairwise potentials and has been widely used in different domains such as image analysis, social networking, gene network analysis [2, 3, 4, 5, 6, 7]. Among various applications, one fundamental problem of interest is called Ising model selection, which refers to recovering the underlying graph structure of the original Ising model from independent, identically distributed (i.i.d.) samples. A variety of methods have been proposed [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18], demonstrating the possibility of successful Ising model selection even when the number of samples is smaller than that of the variables. Notably, it has been demonstrated that for the 1\ell_{1}-regularized logistic regression (1\ell_{1}-LogR) [10, 16] and interaction screening (IS) [14, 15] estimators, M=𝒪(logN)M=\mathcal{O}\left(\log N\right) samples suffice for an Ising model with NN spins under certain assumptions, which is consistent with respect to (w.r.t.) previously established information-theoretic lower-bound [11]. Both 1\ell_{1}-LogR and IS are 1\ell_{1}-regularized MM-estimators [19] with logistic and IS objective (ISO) loss functions, respectively.

In this paper, we focus on one simpler linear estimator called 1\ell_{1}-regularized linear regression (1\ell_{1}-LinR) and theoretically investigate its typical learning performance using the powerful replica method [20, 21, 22, 23] from statistical mechanics. The 1\ell_{1}-LinR estimator, widely known as least absolute shrinkage and selection operator (LASSO) [24] in statistics and machine learning, is considered here mainly for two reasons. On the one hand, it is one representative example of model misspecification since the quadratic loss of 1\ell_{1}-LinR does not match the true log-conditional-likelihood as 1\ell_{1}-LogR, nor does it have the interaction screening property as IS. On the other hand, as one of the most popular linear estimator, 1\ell_{1}-LinR is more computationally efficient than 1\ell_{1}-LogR and IS, and thus it is of practical importance to investigate its learning performance for Ising model selection. Since it is difficult to obtain results for general graphs, as a first step we consider the random regular (RR) graphs 𝒢N,d,K0\mathcal{G}_{N,d,K_{0}} in the paramagnetic phase [23], where 𝒢N,d,K0\mathcal{G}_{N,d,K_{0}} denotes the ensemble of RR graphs with constant node degree dd and uniform coupling strength K0K_{0} on the edges.

1.1 Contributions

The main contributions are summarized as follows. First, we obtain an accurate estimate of the typical sample complexity of 1\ell_{1}-LinR for Ising model selection for typical RR graphs in the paramagnetic phase, which, remarkably, has the same order as 1\ell_{1}-LogR. Specifically, for a typical RR graph G𝒢N,d,K0G\in\mathcal{G}_{N,d,K_{0}}, using 1\ell_{1}-LinR with a regularization parameter 0<λ<tanh(K0)0<\lambda<\tanh\left(K_{0}\right), one can consistently reconstruct the structure with M>c(λ,K0)logNλ2M>\frac{c\left(\lambda,K_{0}\right)\log N}{\lambda^{2}} samples, where c(λ,K0)=2(1tanh2(K0)+dλ2)1+(d1)tanh2(K0){c\left(\lambda,K_{0}\right)=\frac{2\left(1-\tanh^{2}\left(K_{0}\right)+d\lambda^{2}\right)}{1+\left(d-1\right)\tanh^{2}\left(K_{0}\right)}}. The accuracy of our typical sample complexity prediction is verified by its excellent agreement with experimental results. To the best of our knowledge, this is the first result that provides an accurate typical sample complexity for Ising model selection. Interestingly, as λtanh(K0)\lambda\to\tanh\left(K_{0}\right), a lower bound M>2logNtanh2(K0)M>\frac{2\log N}{\tanh^{2}\left(K_{0}\right)} of the typical sample complexity is obtained, which has the same scaling as the information-theoretic lower bound M>clogNK02M>\frac{c^{\prime}\log N}{K_{0}^{2}} [11] for some constant cc^{\prime} at high temperatures (i.e., small K0K_{0}) since tanh(K0)=𝒪(K0)\tanh\left(K_{0}\right)=\mathcal{O}\left(K_{0}\right) as K00K_{0}\rightarrow 0.

Second, we provide a computationally efficient method to precisely predict the typical learning performance of 1\ell_{1}-LinR in the non-asymptotic case with moderate M,NM,N, such as precision, recall, and residual sum of square (RSS). Such precise non-asymptotic predictions of 1\ell_{1}-LinR for Ising model selection have been unavailable even for 1\ell_{1}-LogR [10, 16] and IS [14, 15], nor are they the same as previous asymptotic results of 1\ell_{1}-LinR assuming fixed αM/N\alpha\equiv M/N [25, 26, 27, 28]. Moreover, although our theoretical analysis is based on a tree-like structure assumption, experimental results on two dimensional (2D) grid graphs also show a fairly good agreement, indicating that our theoretical result can be a good approximation even for graphs with many loops.

Third, while this paper mainly focuses on 1\ell_{1}-LinR, our method is readily applicable to a wide class of 1\ell_{1}-regularized MM-estimators [19], including 1\ell_{1}-LogR [10] and IS [14, 15]. Thus, an additional technical contribution is providing a generic approach for precisely characterizing the typical learning performances of various 1\ell_{1}-regularized MM-estimators for Ising model selection. Although the replica method from statistical mechanics is non-rigorous, our results are conjectured to be correct, which is supported by their excellent agreement with the experimental results. Additionally, several technical advances we propose in this paper, e.g., the entropy term computation by averaging over the Haar measure and the modification of EOS to address the finite-size effect, might be of general interest to those who use the replica method as a tool for performance analysis.

1.2 Related works

There has been some earlier works on the analysis of Ising model selection (also known as the inverse Ising problem) using the replica method [4, 5, 6, 7, 29] from statistical mechanics. For example, in [6], the performance of the pseudo-likelihood (PL) method [30] is studied. However, instead of graph structure learning, [6] focuses on the problem of parameter learning. Then, [7] extends the analysis to the Ising model with sparse couplings using logistic regression without regularization. The recent work [29] analyzes the performance of 2\ell_{2}-regularized linear regression but the techniques invented there are not applicable to 1\ell_{1}-LinR since the 1\ell_{1}-norm breaks the rotational invariance property.

Regarding the study of 1\ell_{1}-LinR (LASSO) under model misspecification, the past few years have seen a line of research in the field of signal processing with a specific focus on the single-index model [31, 32, 27, 33, 34]. These studies are closely related to ours but there are several important differences. First, in our study, the covariates are generated from an Ising model rather than a Gaussian distribution. Second, we focus on model selection consistency of 1\ell_{1}-LinR while most previous studies consider estimation consistency except [33]. However, [33] only considers the classical asymptotic regime while our analysis includes the high-dimensional setting where MNM\ll N.

As far as we have searched, there is no earlier study of 1\ell_{1}-LinR estimator for Ising model selection, though some are found for Gaussian graphical models [35, 36]. One closely related work [15] states that at high temperatures when the coupling magnitude is approaching zero, both logistic and ISO losses can be approximated by a quadratic loss. However, their claim is only restricted to the very small magnitude near zero while our analysis extends the validity range to the whole paramagnetic phase. Moreover, they evaluate the minimum number of samples necessary for consistently reconstructing “arbitrary” Ising models, which, however, seems much larger than that actually needed. By contrast, we provide the first accurate assessment of typical sample complexity for consistently reconstructing typical samples of Ising models defined over the RR graphs. Furthermore, [15] does not provide precise predictions of the non-asymptotic learning performance as we do.

2 Background and Problem Setup

2.1 Ising Model

Ising model is one special class of MRFs with pairwise potentials and each variable takes binary values [22, 23], which is one classical model from statistical physics. The joint probability distribution of an Ising model with NN variables (spins) 𝒔=(si)i=0N1{1,+1}N\boldsymbol{s}=\left(s_{i}\right)_{i=0}^{N-1}\in\left\{-1,+1\right\}^{N} has the form

PIsing(𝒔|𝑱)=1ZIsing(𝑱)exp{i<jJijsisj},P_{\textrm{Ising}}\left(\boldsymbol{s}|\boldsymbol{J}^{*}\right)=\frac{1}{Z_{\textrm{Ising}}\left(\boldsymbol{J}^{*}\right)}\exp\left\{\sum_{i<j}J^{*}_{ij}s_{i}s_{j}\right\}, (1)

where ZIsing(𝑱)=𝒔exp{i<jJijsisj}Z_{\textrm{Ising}}\left(\boldsymbol{J}^{*}\right)=\sum_{\boldsymbol{s}}\exp\left\{\sum_{i<j}J^{*}_{ij}s_{i}s_{j}\right\} is the partition function and 𝑱=(Jij)i,j\boldsymbol{J}^{*}=\left(J^{*}_{ij}\right)_{i,j} are the original couplings, respectively. In general, there are also external fields but here they are assumed to be zero for simplicity. The structure of Ising model can be described by an undirected graph G=(𝚅,𝙴)G=\left(\mathtt{V},\mathtt{E}\right), where 𝚅={0,1,,N1}\mathtt{V}=\left\{0,1,...,N-1\right\} is a collection of vertices at which the spins are assigned, and 𝙴={(i,j)|Jij0}\mathtt{E}=\left\{\left(i,j\right)|J^{*}_{ij}\neq 0\right\} is a collection of undirected edges, i.e., Jij=0J^{*}_{ij}=0 for all pairs of (i,j)𝙴\left(i,j\right)\notin\mathtt{E}. For each vertex i𝚅i\in\mathtt{V}, its neighborhood is defined as the subset 𝒩(i){j𝚅|(i,j)𝙴}\mathcal{N}\left(i\right)\equiv\left\{j\in\mathtt{V}|\left(i,j\right)\in\mathtt{E}\right\}.

2.2 Neighborhood-based 1\ell_{1}-regularized linear regression (1\ell_{1}-LinR)

The problem of Ising model selection refers to recovering the graph GG (edge set 𝙴\mathtt{E}), given MM i.i.d. samples 𝒟M={𝒔(1),,𝒔(M)}\mathcal{D}^{M}=\left\{\boldsymbol{s}^{\left(1\right)},...,\boldsymbol{s}^{\left(M\right)}\right\} from the Ising model. While the maximum likelihood method has nice properties of consistency and asymptotic efficiency, it suffers from high computational complexity. To tackle this difficulty, several local learning algorithms have been proposed, notably the 1\ell_{1}-LogR estimator [10] and IS estimator [14]. Both 1\ell_{1}-LogR and IS optimize a regularized local cost function ()\ell\left(\cdot\right) for each spin i.e., i𝚅\forall i\in\mathtt{V},

𝑱^i=argmin𝑱i[1Mμ=1M(si(μ)hi(μ))+λ𝑱i1],\displaystyle\boldsymbol{\hat{J}}_{\setminus i}=\underset{\boldsymbol{J}_{\setminus i}}{\arg\min}\left[\frac{1}{M}\sum_{\mu=1}^{M}\ell\left(s_{i}^{\left(\mu\right)}h_{\setminus i}^{\left(\mu\right)}\right)+\lambda\left\|\boldsymbol{J}_{\setminus i}\right\|_{1}\right], (2)

where hi(μ)jiJijsj(μ)h_{\setminus i}^{\left(\mu\right)}\equiv\sum_{j\neq i}J_{ij}s_{j}^{\left(\mu\right)}, 𝑱i(Jij)j(i)\boldsymbol{J}_{\setminus i}\equiv\left(J_{ij}\right)_{j(\neq i)}, and 1\left\|\cdot\right\|_{1} denotes the 1\ell_{1} norm. Specifically, (x)=log(1+e2x)\ell\left(x\right)=\log\left(1+e^{-2x}\right) for 1\ell_{1}-LogR and (x)=ex\ell\left(x\right)=e^{-x} for IS, which correspond to the minus log conditional distribution [10] and the ISO [14], respectively. Consequently, the problem of recovering the edge set 𝙴\mathtt{E} is equivalently reduced to local neighborhood selection, i.e., recovering the neighborhood set 𝒩(i)\mathcal{N}\left(i\right) for each vertex i𝚅i\in\mathtt{V}. In particular, given the estimates 𝑱^i\boldsymbol{\hat{J}}_{\setminus i} in (2), the neighborhood set of vertex ii can be estimated via the nonzero coefficients, i.e.,

𝒩^(i)={j|J^ij0,j𝚅i},i𝚅.\hat{\mathcal{N}}\left(i\right)=\left\{j|\hat{J}_{ij}\neq 0,j\in\mathtt{V}\setminus i\right\},\;\forall i\in\mathtt{V}. (3)

In this paper, we focus on one simple linear estimator, termed as the 1\ell_{1}-LinR estimator, i.e., i𝚅\forall i\in\mathtt{V},

𝑱^i=argmin𝑱i[12Mμ=1M(si(μ)hi(μ))2+λ𝑱i1],\displaystyle\boldsymbol{\hat{J}}_{\setminus i}=\underset{\boldsymbol{J}_{\setminus i}}{\arg\min}\left[\frac{1}{2M}\sum_{\mu=1}^{M}\left(s_{i}^{\left(\mu\right)}-h_{\setminus i}^{\left(\mu\right)}\right)^{2}+\lambda\left\|\boldsymbol{J}_{\setminus i}\right\|_{1}\right], (4)

which, recalling that si(μ){1,+1}s_{i}^{\left(\mu\right)}\in\{-1,+1\}, corresponds to a quadratic loss (x)=12(x1)2\ell\left(x\right)=\frac{1}{2}\left(x-1\right)^{2} in (2). The neighorbood set for each vertex i𝚅i\in\mathtt{V} is estimated in the same way as (3). Interestingly, the quadratic loss used in (4) implies that the postulated conditional distribution is Gaussian and thus inconsistent with the true one, which is one typical case of model misspecification. Furthermore, compared with nonlinear estimators 1\ell_{1}-LogR and IS, the 1\ell_{1}-LinR estimator is more efficient to implement.

3 Statistical Mechanics Analysis

In this section, a statistical mechanics analysis of the 1\ell_{1}-LinR estimator is presented for typical RR graphs in the paramagnetic phase. Our analysis is applicable to any M-estimator of the form (2) and please refer to Appendix A for a unified analysis, including detailed results for the 1\ell_{1}-LogR estimator.

To characterize the structure learning performance, the precision and recall are considered:

Precision=TPTP+FP,Recall=TPTP+FN,Precision=\frac{TP}{TP+FP},\;\;\;Recall=\frac{TP}{TP+FN}, (5)

where TPTP, FPFP, FNFN denote the number of true positives, false positives, and false negatives in the estimated couplings, respectively. The concept of model selection consistent for an estimator is defined in Definition 1, which is also known as the sparsistency property [10].

Definition 1.

An estimator is called model selection consistent if both the associated precision and recall satisfy Precision1Precision\to 1 and Recall1Recall\to 1 as MM\rightarrow\infty.

Additionally, if one is further interested in the specific values of the estimated couplings, our analysis can also yield the residual sum of squares (RSS) for the estimated couplings.

Our theoretical analysis of the learning performance builds on the statistical mechanics framework. Contrary to the probably almost correct (PAC) learning theory [37] in mathematical statistics, statistical mechanics tries to describe the typical (as defined in Definition 2) behavior exactly rather than to bound the worst case which is likely to be over-pessimistic [38].

Definition 2.

“typical” means not just most probable but in addition the probability for situations different from the typical one can be made arbitrarily small as NN\to\infty [38].

Similarly, when referring to typical RR graphs, we mean tree-like RR graphs, i.e., when seen from a random node, they look like part of an infinite tree, which are typical realizations from the uniform probability distribution on the ensemble of RR graphs.

3.1 Problem Formulation

For simplicity and without loss of generality, we focus on spin s0s_{0}. With a slight abuse of notation, we will drop certain subscripts in following descriptions, e.g., 𝑱i\boldsymbol{J}_{\setminus i} will be denoted as 𝑱\boldsymbol{J} hereafter which represents a vector rather than a matrix. The basic idea of the statistical mechanical approach is to introduce the following Hamiltonian and Boltzmann distribution induced by the loss function ()\ell\left(\cdot\right)

(𝑱|𝒟M)\displaystyle\mathcal{H}\left(\boldsymbol{J}|\mathcal{D}^{M}\right) =μ=1M(s0(μ)h(μ))+λM𝑱1,\displaystyle=\sum_{\mu=1}^{M}\ell\left(s_{0}^{\left(\mu\right)}h^{\left(\mu\right)}\right)+\lambda M\left\|\boldsymbol{J}\right\|_{1}, (6)
P(𝑱|𝒟M)\displaystyle P\left(\boldsymbol{J}|\mathcal{D}^{M}\right) =1Zeβ(𝑱|𝒟M),\displaystyle=\frac{1}{Z}e^{-\beta\mathcal{H}\left(\boldsymbol{J}|\mathcal{D}^{M}\right)}, (7)

where Z=𝑑𝑱eβ(𝑱|𝒟M)Z=\int d\boldsymbol{J}e^{-\beta\mathcal{H}\left(\boldsymbol{J}|\mathcal{D}^{M}\right)} is the partition function, and β(>0)\beta\left(>0\right) is the inverse temperature. In the zero-temperature limit β+\beta\rightarrow+\infty, the Boltzmann distribution (7) converges to a point-wise measure on the estimator (2). The macroscopic properties of (7) can be analyzed by assessing the free energy density f(𝒟M)=1NβlogZf({\cal D}^{M})=-\frac{1}{N\beta}\log Z, from which, once obtained, we can evaluate averages of various quantities simply by taking its derivatives w.r.t. external fields [21]. In current case, f(𝒟M)f({\cal D}^{M}) depends on the predetermined randomness of 𝒟M{\cal D}^{M}, which plays the role of quenched disorder. As N,MN,M\to\infty, f(𝒟M)f({\cal D}^{M}) is expected to show the self averaging property [21]: for typical datasets 𝒟M{\cal D}^{M}, f(𝒟M)f({\cal D}^{M}) converges to its average over the random data 𝒟M{\cal D}^{M}:

f=1Nβ[logZ]𝒟M,f=-\frac{1}{N\beta}\left[\log Z\right]_{\mathcal{D}^{M}}, (8)

where []𝒟M\left[\cdot\right]_{\mathcal{D}^{M}} denotes expectation over 𝒟M\mathcal{D}^{M}, i.e. []𝒟M=𝒔(1),,𝒔(M)()μ=1MPIsing(𝒔(μ)|𝑱)\left[\cdot\right]_{\mathcal{D}^{M}}=\sum_{\boldsymbol{s}^{\left(1\right)},...,\boldsymbol{s}^{\left(M\right)}}\left(\cdot\right)\prod_{\mu=1}^{M}P_{\textrm{Ising}}\left(\boldsymbol{s}^{\left(\mu\right)}|\boldsymbol{J}^{*}\right). Consequently, one can analyze the typical performance of any 1\ell_{1}-regularized M-estimator of the form (2) via the assessment of (8), with 1\ell_{1}-LinR in (4) being a special case with (x)=12(x1)2\ell\left(x\right)=\frac{1}{2}\left(x-1\right)^{2}.

3.2 Replica computation of the free energy density

Unfortunately, computing (8) rigorously is difficult. For practically overcoming this difficulty, we resort to the powerful replica method [20, 21, 22, 23] from statistical mechanics, which is symbolized using the following identity

f=1Nβ[logZ]𝒟M=limn01Nβlog[Zn]𝒟Mn.f=-\frac{1}{N\beta}\left[\log Z\right]_{\mathcal{D}^{M}}=-\underset{n\rightarrow 0}{\lim}\frac{1}{N\beta}\frac{\partial\log\left[Z^{n}\right]_{\mathcal{D}^{M}}}{\partial n}. (9)

The basic idea is as follows. One replaces the average of logZ\log Z by that of the nn-th power ZnZ^{n} which is analytically tractable for nn\in\mathbb{N} in the large NN limit, and constructs an analytically continuable expression from \mathbb{N} to \mathbb{R}, then takes the limit n0n\to 0 by using the expression. Although the replica method is not rigorous, it has been empirically verified from extensive studies in disorder systems [22, 23] and also found useful in the study of high-dimensional models in machine learning [39, 40]. For more details of the replica method, please refer to [20, 21, 22, 23].

Specifically, with the Hamiltonian (𝑱|𝒟M)\mathcal{H}\left(\boldsymbol{J}|\mathcal{D}^{M}\right), assuming nn\in\mathbb{N} is a positive integer, the replicated partition function [Zn]𝒟M\left[Z^{n}\right]_{\mathcal{D}^{M}} in (9) can be written as

[Zn]𝒟M\displaystyle\left[Z^{n}\right]_{\mathcal{D}^{M}} =a=1nd𝑱aeβλMa=1n𝑱a1{𝒔PIsing(𝒔|𝑱)exp[βa=1n(s0ha)]}M,\displaystyle=\int\prod_{a=1}^{n}d\boldsymbol{J}^{a}e^{-\beta\lambda M\sum_{a=1}^{n}\left\|\boldsymbol{J}^{a}\right\|_{1}}\left\{\sum_{\boldsymbol{s}}P_{\textrm{Ising}}\left(\boldsymbol{s}|\boldsymbol{J}^{*}\right)\exp\left[-\beta\sum_{a=1}^{n}\ell\left(s_{0}h^{a}\right)\right]\right\}^{M}, (10)

where ha=jJjasjh^{a}=\sum_{j}J_{j}^{a}s_{j} will be termed as local field hereafter, and aa (and bb in the following) is index variable of the replicas. The analysis below essentially depends on the distribution of hah^{a} but it is nontrivial. To resolve it, we take a similar approach as [7, 29] and introduce the following ansatz.

Ansatz 1 (A1): Denote Ψ={j|j𝒩(0)}\Psi=\left\{j|j\in\mathcal{N}\left(0\right)\right\} and Ψ¯={j|j=1,,N1,j𝒩(0)}\bar{\Psi}=\left\{j|j=1,...,N-1,j\notin\mathcal{N}\left(0\right)\right\} as the active and inactive sets of spin s0s_{0}, respectively, then for a typical RR graph G𝒢N,d,K0G\in\mathcal{G}_{N,d,K_{0}} in the paramagnetic phase, i.e., (d1)tanh2(K0)<1\left(d-1\right)\tanh^{2}\left(K_{0}\right)<1, the 1\ell_{1}-LinR estimator in (4) is a random vector determined by random realizations of 𝒟M\mathcal{D}^{M} and obeys the following form

Jj^={J¯j+1Nwj,jΨ1Nwj,jΨ¯\hat{J_{j}}=\begin{cases}\text{$\bar{J}_{j}$}+\frac{1}{\sqrt{N}}w_{j},&j\in\Psi\\ \frac{1}{\sqrt{N}}w_{j},&j\in\bar{\Psi}\end{cases} (11)

where J¯j\bar{J}_{j} is the mean value of the estimator and wjw_{j} is a random variable which is asymptotically zero mean with variance scaled as 𝒪(1)\mathcal{O}\left(1\right).

The consistency of Ansatz 1 is checked in Appendix B. Under Ansatz 1, the local fields hah^{a} can be decomposed as ha=jΨJ¯jsj+hwah^{a}=\sum_{j\in\Psi}\text{$\bar{J}_{j}$}s_{j}+h_{w}^{a} where hwaj1Nwjasjh_{w}^{a}\equiv\sum_{j}\frac{1}{\sqrt{N}}w_{j}^{a}s_{j} is the “noise” part. According to the central limit theorem (CLT), hwah_{w}^{a} can be approximated as multivariate Gaussian variables, which, under the replica symmetric (RS) ansatz [21], can be fully described by two order parameters

Q1Ni,jwiaCij\0wja,q1Ni,jwiaCij\0wjb,(ab),\displaystyle Q\equiv\frac{1}{N}\sum_{i,j}w_{i}^{a}C_{ij}^{\backslash 0}w_{j}^{a},\;\;\;q\equiv\frac{1}{N}\sum_{i,j}w_{i}^{a}C_{ij}^{\backslash 0}w_{j}^{b},(a\neq b), (12)

where C\0{Cij\0}C^{\backslash 0}\equiv\{C_{ij}^{\backslash 0}\} is the covariance matrix of the original Ising model without the spin s0s_{0}. Since the difference between C\0C^{\backslash 0} and that with s0s_{0} is not essential in the limit NN\to\infty, hereafter the superscript \0 will be discarded. As shown in Appendix A, for quadratic loss (x)=12(1x)2\ell\left(x\right)=\frac{1}{2}(1-x)^{2} of 1\ell_{1}-LinR, the average free energy density (9) in the limit β\beta\rightarrow\infty can be computed as

f(β)\displaystyle f\left(\beta\rightarrow\infty\right) =𝙴𝚡𝚝𝚛{ξ+S},\displaystyle=-\mathtt{Extr}\left\{-\mathcal{\xi}+S\right\}, (13)

where 𝙴𝚡𝚝𝚛{}\underset{}{\mathtt{Extr}}\left\{\cdot\right\} denotes the extremum operation w.r.t. relevant variables and ξ,S\mathcal{\xi},S denote the energy and entropy terms:

S\displaystyle S =limβlimn01NβnlogI,\displaystyle={\color[rgb]{0,0,0}\underset{\beta\to\infty}{\lim}}\underset{n\rightarrow 0}{\lim}\frac{1}{N\beta}\frac{\partial}{\partial n}\log I, (14)
I\displaystyle I =a=1ndwaa=1neλβwa1δ(i,jwiaCijwjaNQ)×a<bδ(i,jwiaCijwjbNq),\displaystyle=\int\prod_{a=1}^{n}dw^{a}\prod_{a=1}^{n}e^{-\lambda\beta\left\|w^{a}\right\|_{1}}\delta\left(\sum_{i,j}w_{i}^{a}C_{ij}w_{j}^{a}-NQ\right)\times\prod_{a<b}\delta\left(\sum_{i,j}w_{i}^{a}C_{ij}w_{j}^{b}-Nq\right), (15)
ξ\displaystyle\mathcal{\mathcal{\xi}} =α𝔼s,z(s0jΨJ¯jsjQz)22(1+χ)+αλjΨ|J¯j|,\displaystyle=\frac{\alpha\mathbb{E}_{s,z}\left(s_{0}-\sum_{j\in\Psi}\text{$\bar{J}_{j}$}s_{j}-\sqrt{Q}z\right)^{2}}{2\left(1+\chi\right)}+\alpha\lambda\sum_{j\in\Psi}\left|\bar{J}_{j}\right|, (16)

where αM/N,χlimββ(Qq)\alpha\equiv M/N,\chi\equiv\lim_{\beta\rightarrow\infty}\beta\left(Q-q\right), 𝔼s,z()\mathbb{E}_{s,z}(\cdot) denotes the expectation operation w.r.t. z𝒩(0,1)z\sim{\cal N}(0,1) and (s0,𝒔Ψ)PIsing(s0,𝒔Ψ|𝑱)es0jΨJjsj(s_{0},\boldsymbol{s}_{\Psi})\sim P_{\rm Ising}(s_{0},\boldsymbol{s}_{\Psi}|\boldsymbol{J}^{*})\propto e^{s_{0}\sum_{j\in\Psi}J^{*}_{j}s_{j}} [7]. For different losses ()\ell\left(\cdot\right), the free energy results (13) only differ in the energy term ξ\mathcal{\xi}, which in general is non-analytical (e.g., logistic loss for 1\ell_{1}-LogR) but can be solved numerically. Please refer to Appendix A.3 for more details.

In contrast to the case of 2\ell_{2}-norm in [29], the 1\ell_{1}-norm in (15) breaks the rotational invariance property, i.e., wa1Owa1\left\|w^{a}\right\|_{1}\neq\left\|Ow^{a}\right\|_{1} for general orthogonal matrix OO, making it difficult to compute the entropy term SS. To circumvent this difficulty, we employ an observation that, when considering the RR graph ensemble 𝒢N,d,K0\mathcal{G}_{N,d,K_{0}} as the coupling network of the Ising model, the orthogonal matrix OO diagonalizing the covariance matrix CC appears to be distributed from the Haar orthogonal measure [41, 42]. Thus, it is assumed that II in (15) can be replaced by its average [I]O\left[I\right]_{O} over the Haar-distributed OO:

Ansatz 2 (A2): Denote C𝔼𝐬[𝐬𝐬T]C\equiv\mathbb{E}_{\boldsymbol{s}}[\boldsymbol{s}\boldsymbol{s}^{T}], where 𝔼𝐬[]=𝐬PIsing(𝐬|𝐉)(),\mathbb{E}_{\boldsymbol{s}}[\cdot]=\sum_{\boldsymbol{s}}P_{\rm Ising}(\boldsymbol{s}|\boldsymbol{J}^{*})(\cdot), as the covariance matrix of spin configurations 𝐬\boldsymbol{s}. Suppose that the eigendecomposition of CC is C=OΛOTC=O\Lambda O^{T}, where OO is the orthogonal matrix, then OO can be seen as a random sample generated from the Haar orthogonal measure and thus for typical graph realizations from 𝒢N,d,K0\mathcal{G}_{N,d,K_{0}}, II in (15) is equal to the average [I]O[I]_{O}.

The consistency of Ansatz (A2) is numerically checked in Appendix C. Under Ansatz (A2), the entropy term SS in (14) can be alternatively computed as limn01Nβnlog[I]O\underset{n\rightarrow 0}{\lim}\frac{1}{N\beta}\frac{\partial}{\partial n}\log\left[I\right]_{O}, as shown in Appendix A. Finally, under the RS ansatz, the average free energy density (9) in the limit β\beta\rightarrow\infty reads

f(β)=ExtrΘ{α2(1+χ)𝔼s,z((s0jΨJ¯jsjQz)2)λαjΨ|J¯j|+(ER+Fη)G(Eη)+12EQ12Fχ+12KR12Hη𝔼zmin𝑤{K2w2Hzw+λMN|w|}},\displaystyle f\left(\beta\rightarrow\infty\right)=-\underset{\mathtt{\varTheta}}{\textrm{Extr}}\left\{\begin{array}[]{c}-\frac{\alpha}{2\left(1+\chi\right)}\mathbb{E}_{s,z}\left(\left(s_{0}-\sum_{j\in\Psi}\text{$\bar{J}_{j}$}s_{j}-\sqrt{Q}z\right)^{2}\right)-\lambda\alpha\sum_{j\in\Psi}\left|\bar{J}_{j}\right|\\ +\left(-ER+F\eta\right)G^{{}^{\prime}}\left(-E\eta\right)+\frac{1}{2}EQ-\frac{1}{2}F\chi+\frac{1}{2}KR-\frac{1}{2}H\eta\\ -\mathbb{E}_{z}\underset{w}{\min}\left\{\frac{K}{2}w^{2}-\sqrt{H}zw+\frac{\lambda M}{\sqrt{N}}\left|w\right|\right\}\end{array}\right\}, (20)

where z𝒩(0,1)z\sim\mathcal{N}\left(0,1\right), and G(x)G\left(x\right) is a function defined as

G(x)\displaystyle G\left(x\right) =12logx12+𝙴𝚡𝚝𝚛Λ{12log(Λγ)ρ(γ)𝑑γ+Λ2x},\displaystyle=-\frac{1}{2}\log x-\frac{1}{2}+\underset{\Lambda}{\mathtt{Extr}}\left\{-\frac{1}{2}\int\log\left(\Lambda-\gamma\right)\rho\left(\gamma\right)d\gamma+\frac{\Lambda}{2}x\right\}, (21)

and ρ(γ)\rho\left(\gamma\right) is the eigenvalue distribution (EVD) of the covariance matrix C{C}, and Θ\mathtt{\varTheta} is a collection of macroscopic parameters Θ={χ,Q,E,R,F,η,K,H,{J¯j} jΨ}\mathtt{\varTheta}=\left\{\chi,Q,E,R,F,\eta,K,H,\text{$\left\{\bar{J}_{j}\right\}$ }_{j\in\Psi}\right\}. For details of these macroscopic parameters and ρ(γ)\rho\left(\gamma\right), please refer to Appendix A and F, respectively. Note that in (20), apart from the ratio αM/N\alpha\equiv M/N, NN and MM also appear as λM/N\lambda M/\sqrt{N} in the free energy result, which is different from previous results [7, 29, 39]. The reason is that, thanks to the 1\ell_{1}-regularization term λM𝑱1\lambda M\left\|\boldsymbol{J}\right\|_{1} , the mean estimates {J¯j} jΨ\text{$\left\{\bar{J}_{j}\right\}$ }_{j\in\Psi} in the active set Ψ\Psi and the noise ww in the inactive set Ψ¯\bar{\Psi} essentially give different scaling contributions to the free energy density.

Although there are no analytic solutions, these macroscopic parameters in (20) can be obtained by numerically solving the corresponding equations of state (EOS) employing the physics terminology. Specifically, for the 1\ell_{1}-LinR estimator, the EOS can be obtained from the extremization condition in (20) as follows (for EOS of a general M-esimator and 1\ell_{1}-LogR, please refer to Appendix A.3):

{E=α(1+χ),F=α(1+χ)2[𝔼s(s0jΨsjJ¯j)2+Q],R=1K2[(H+λ2M2N)erfc(λM2HN)2λMHN12πeλ2M22HN],Eη=ρ(γ)Λ~γ𝑑γ,Q=FE2+RΛ~(ER+Fη)ηρ(γ)(Λ~γ)2𝑑γ,K=EΛ~+1η,χ=1E+ηΛ~,H=Rη2+FΛ~+(ER+Fη)Eρ(γ)(Λ~γ)2𝑑γ,η=1Kerfc(λM2HN),J¯j=𝚜𝚘𝚏𝚝(tanh(K0),λ(1+χ))1+(d1)tanh2(K0),jΨ,\begin{cases}E=\frac{\alpha}{\left(1+\chi\right)},\\ F=\frac{\alpha}{\left(1+\chi\right)^{2}}\left[\mathbb{E}_{s}\left(s_{0}-\sum_{j\in\Psi}s_{j}\bar{J}_{j}\right)^{2}+Q\right],\\ R=\frac{1}{K^{2}}[\left(H+\frac{\lambda^{2}M^{2}}{N}\right)\textrm{erfc}\left(\frac{\lambda M}{\sqrt{2HN}}\right)-2\lambda M\sqrt{\frac{H}{N}}\frac{1}{\sqrt{2\pi}}e^{-\frac{\lambda^{2}M^{2}}{2HN}}],\\ E\eta=-\int\frac{\rho\left(\gamma\right)}{\tilde{\varLambda}-\gamma}d\gamma,\\ Q=\frac{F}{E^{2}}+R\tilde{\varLambda}-\frac{\left(-ER+F\eta\right)\eta}{\int\frac{\rho\left(\gamma\right)}{\left(\tilde{\varLambda}-\gamma\right)^{2}}d\gamma},\\ K=E\tilde{\varLambda}+\frac{1}{\eta},\\ \chi=\frac{1}{E}+\eta\tilde{\varLambda},\\ H=\frac{R}{\eta^{2}}+F\tilde{\varLambda}+\frac{\left(-ER+F\eta\right)E}{\int\frac{\rho\left(\gamma\right)}{\left(\tilde{\varLambda}-\gamma\right)^{2}}d\gamma},\\ \eta=\frac{1}{K}\textrm{erfc}\left(\frac{\lambda M}{\sqrt{2HN}}\right),\\ \bar{J}_{j}=\frac{\mathtt{soft}\left(\tanh\left(K_{0}\right),\lambda\left(1+\chi\right)\right)}{1+\left(d-1\right)\tanh^{2}\left(K_{0}\right)},j\in\Psi,\end{cases} (22)

where Λ~\tilde{\varLambda} satisfying Eη=ρ(γ)Λ~γ𝑑γE\eta=-\int\frac{\rho\left(\gamma\right)}{\tilde{\varLambda}-\gamma}d\gamma is determined by the extremization condition in (21) and 𝚜𝚘𝚏𝚝(z,τ)=𝚜𝚒𝚐𝚗(z)(|z|τ)+\mathtt{soft}\left(z,\tau\right)=\mathtt{sign}\left(z\right)\left(\left|z\right|-\tau\right)_{+} is the soft-thresholding function. Once the EOS is solved, the free energy density defined in (8) is readily obtained.

3.3 High-dimensional asymptotic result

Refer to caption
Figure 1: Equivalent low-dimensional estimators for high-dimensional 1\ell_{1}-LinR obtained from the statistical mechanics analysis. (a) and (b) are diagrams of the pair of scalar estimators in Eqs. (23) and (24). (c) is a schematic description of the modified estimator in Eq. (28) which takes into account the finite-size effect.

One important result of our replica analysis is that, as derived (see Appendix A.3) from the free energy result (20), the original high dimensional 1\ell_{1}-LinR estimator (4) is decoupled into a pair of scalar estimators, one for the active set and one for the inactive set, i.e.,

J^j=\displaystyle\hat{J}_{j}= 𝚜𝚘𝚏𝚝(tanh(K0),λ(1+χ))1+(d1)tanh2(K0)J¯j,\displaystyle\frac{\mathtt{soft}\left(\tanh\left(K_{0}\right),\lambda\left(1+\chi\right)\right)}{1+\left(d-1\right)\tanh^{2}\left(K_{0}\right)}\equiv\bar{J}_{j}, jΨj\in\Psi (23)
J^j=\displaystyle\hat{J}_{j}= HKN𝚜𝚘𝚏𝚝(zj,λMHN),\displaystyle\frac{\sqrt{H}}{K\sqrt{N}}\mathtt{soft}\left(z_{j},\frac{\lambda M}{\sqrt{HN}}\right), jΨ¯j\in\bar{\Psi} (24)

where zj𝒩(0,1),jΨ¯z_{j}\sim\mathcal{N}\left(0,1\right),j\in\bar{\Psi} are i.i.d. standard Gaussian random variables. The decoupling property asserts that, once the EOS (22) is solved, the asymptotic behavior of 1\ell_{1}-LinR can be statistically described by a pair of simple scalar soft-thresholding estimators (see Figs. 1(a) and 1(b)).

In the high-dimensional setting where NN is allowed to grow as a function of MM, one important question is that what is the minimum number of samples MM required to achieve model selection consistency as NN\rightarrow\infty. Though we obtain a pair of scalar estimators (23) and (24), there are no analytical solutions to EOS (22), making it difficult to derive an explicit condition. To overcome this difficulty, as shown in Appendix D, we perform a perturbation analysis of EOS (20) and obtain an asymptotic relation HFH\simeq F, Then, we obtain that for a RR graph G𝒢N,d,K0G\in\mathcal{G}_{N,d,K_{0}}, given MM i.i.d. samples 𝒟M\mathcal{D}^{M}, the 1\ell_{1}-LinR estimator (4) can consistently recover the graph structure GG as NN\rightarrow\infty if

M\displaystyle M >c(λ,K0)logNλ2,0<λ<tanh(K0),\displaystyle>\frac{c\left(\lambda,K_{0}\right)\log N}{\lambda^{2}},0<\lambda<\tanh\left(K_{0}\right), (25)

where c(λ,K0)c(\lambda,K_{0}) is a constant value dependent on the regularization parameter λ\lambda and coupling strength K0K_{0} and a sharp prediction (as verified in Sec. 5) is obtained as

c(λ,K0)=2(1tanh2(K0)+dλ2)1+(d1)tanh2(K0).c\left(\lambda,K_{0}\right)=\frac{2\left(1-\tanh^{2}\left(K_{0}\right)+d\lambda^{2}\right)}{1+\left(d-1\right)\tanh^{2}\left(K_{0}\right)}. (26)

For details of the analysis, including the counterpart of 1\ell_{1}-LogR, see Appendix D. Consequently, we obtain the typical sample complexity of 1\ell_{1}-LinR for Ising model selection for typical RR graphs in the paramagnetic phase. The result in (25) is derived for 1\ell_{1}-LinR with a fixed regularization parameter λ\lambda. Since the value of λ\lambda is upper bounded by tanh(K0)\tanh\left(K_{0}\right) (otherwise false negatives occur as discussed in Appendix D), a lower bound of the typical sample complexity for 1\ell_{1}-LinR is obtained as

M>2logNtanh2(K0).M>\frac{2\log N}{\tanh^{2}\left(K_{0}\right)}. (27)

Interestingly, the scaling in (27) is the same as the information-theoretic worst-case result M>clogNK02M>\frac{c\log N}{K_{0}^{2}} obtained in [11] at high temperatures (i.e., small K0K_{0}) since tanh(K0)=𝒪(K0)\tanh\left(K_{0}\right)=\mathcal{O}\left(K_{0}\right) as K00K_{0}\rightarrow 0.

3.4 Non-asymptotic result for moderate M,NM,N

In practice, it is desirable to predict the non-asymptotic performance of the 1\ell_{1}-LinR estimator for moderate M,NM,N. However, the scalar estimator (23) for the active set (see Fig. 1(a)) fails to capture the fluctuations around the mean estimates. This is because in obtaining the energy term ξ\mathcal{\mathcal{\xi}} (16) of the free energy density (20), the fluctuations around the mean estimates {J¯j} jΨ\text{$\left\{\bar{J}_{j}\right\}$ }_{j\in\Psi} are averaged out by the expectation 𝔼s,z()\mathbb{E}_{s,z}\left(\cdot\right). To address this problem, we replace 𝔼s,z()\mathbb{E}_{s,z}\left(\cdot\right) in (20) with a sample average by accounting for the finite-size effect, thus obtaining a modified estimator for the active set as follows

{J^j} jΨ=argminJj,jΨ[μ=1M(s0μjΨsjμJjQzμ)22(1+χ)M+λjΨ|Jj|],\displaystyle\text{$\{\hat{J}_{j}\}$ }_{j\in\Psi}=\underset{J_{j,j\in\Psi}}{\arg\min}\left[\frac{\sum_{\mu=1}^{M}\left(s_{0}^{\mu}-\sum_{j\in\Psi}s_{j}^{\mu}J_{j}-\sqrt{Q}z^{\mu}\right)^{2}}{2\left(1+\chi\right)M}+\lambda\sum_{j\in\Psi}\left|{J}_{j}\right|\right], (28)

where s0μ,sj,jΨμP(s0,𝒔Ψ|𝑱),zμ𝒩(0,1),μ=1Ms_{0}^{\mu},s_{j,j\in\Psi}^{\mu}\sim P\left(s_{0},\boldsymbol{s}_{\Psi}|\boldsymbol{J}^{*}\right),\;z^{\mu}\sim\mathcal{N}\left(0,1\right),\mu=1...M. The modified dd-dimensional estimator (28) (see Fig. 1(c) for a schematic) is equivalent to the scalar one (23) (Fig. 1(a)) as MM\rightarrow\infty but it enables us to capture the fluctuations of {J^j}jΨ\{\hat{J}_{j}\}_{j\in\Psi} for moderate MM. Note that due to the replacement of expectation with sample average in the free energy density (20), the EOS (22) also needs to be modified and it can be solved iteratively as sketched in Algorithm 1. The details are shown in Appendix E.1.

Input: M,N,λ,K0,ρ(γ)M,N,\lambda,K_{0},\rho\left(\gamma\right) and TMCT_{\rm MC}
Output: χ,Q,E,R,F,η,K,H,{J^jt} jΨ\chi,Q,E,R,F,\eta,K,H,\text{$\{\hat{J}^{t}_{j}\}$ }_{j\in\Psi}
Initialization: χ,Q,E,R,F,η,K,H\chi,Q,E,R,F,\eta,K,H
1 MC sampling: For t=1TMCt=1...T_{MC}, draw random samples s0μ,t,{sjμ,t}jΨP(s0,𝒔Ψ|𝑱)s_{0}^{\mu,t},\left\{s^{\mu,t}_{j}\right\}_{j\in\Psi}\sim P\left(s_{0},\boldsymbol{s}_{\Psi}|\boldsymbol{J}^{*}\right) and zμ,t𝒩(0,1)z^{\mu,t}\sim\mathcal{N}\left(0,1\right), μ=1M\mu=1...M
2repeat
3       for t=1t=1 to TMCT_{\rm MC} do
4             Solve {J^jt} jΨ=argminJj,jΨ[μ=1M(s0μ,tjΨsjμ,tJjQzμ,t)22(1+χ)M+λjΨ|Jj|]\text{$\{\hat{J}^{t}_{j}\}$ }_{j\in\Psi}=\underset{J_{j,j\in\Psi}}{\arg\min}\left[\frac{\sum_{\mu=1}^{M}\left(s_{0}^{\mu,t}-\sum_{j\in\Psi}s_{j}^{\mu,t}J_{j}-\sqrt{Q}z^{\mu,t}\right)^{2}}{2\left(1+\chi\right)M}+\lambda\sum_{j\in\Psi}\left|{J}_{j}\right|\right]
5            Compute t=1Mμ=1M(s0μ,tjΨsjμ,tJ^jt)2\triangle^{t}=\frac{1}{M}\sum_{\mu=1}^{M}\left(s_{0}^{\mu,t}-\sum_{j\in\Psi}s_{j}^{\mu,t}\hat{J}^{t}_{j}\right)^{2}
6      Solve the EOS (22) with 𝔼s(sijΨsjJ¯j)2=1TMCt=1TMCt\mathbb{E}_{s}\left(s_{i}-\sum_{j\in\Psi}s_{j}\bar{J}_{j}\right)^{2}=\frac{1}{T_{\rm MC}}\sum_{t=1}^{T_{\rm MC}}\triangle^{t}
7      Update values of χ,Q,E,R,F,η,K,H\chi,Q,E,R,F,\eta,K,H
8until convergence
Algorithm 1 Method to solve EOS (22) together with (28)

Consequently, for moderate M,NM,N, the non-asymptotic statistical properties of the 1\ell_{1}-LinR estimator can be characterized by the reduced dd-dimensional 1\ell_{1}-LinR estimator (28) (Fig. 1(c)) and scalar estimator (24) (Fig. 1(b)) using MC simulations. Denote {J^jt},t=1,,TMC\{\hat{J}_{j}^{t}\},t=1,...,T_{\rm MC} as the estimates in tt-th MC simulation, where {J^jt}jΨ\{\hat{J}^{t}_{j}\}_{j\in\Psi} and {J^jt}jΨ¯\{\hat{J}^{t}_{j}\}_{j\in\bar{\Psi}} are solutions of (28) and (24), and TMCT_{\rm MC} is the total number of MC simulations. Then, the PrecisionPrecision and RecallRecall are computed as

Precision=1TMCt=1TMCJ^j,jΨt0J^j,jΨt0+J^j,jΨ¯t0,Recall=1TMCt=1TMCJ^j,jΨt0d,\displaystyle Precision=\frac{1}{T_{\rm MC}}\sum_{t=1}^{T_{\rm MC}}\frac{\left\|\hat{J}_{j,j\in\Psi}^{t}\right\|_{0}}{\left\|\hat{J}_{j,j\in\Psi}^{t}\right\|_{0}+\left\|\hat{J}_{j,j\in\bar{\Psi}}^{t}\right\|_{0}},\;\;Recall=\frac{1}{T_{\rm MC}}\sum_{t=1}^{T_{\rm MC}}\frac{\left\|\hat{J}_{j,j\in\Psi}^{t}\right\|_{0}}{d}, (29)

where 0\left\|\cdot\right\|_{0} is the 0\ell_{0}-norm indicating the number of nonzero elements. In addition, the RSS can be computed as RSS=j|J^jJj|2=1TMCt=1TMCjΨ|J^jtK0|2+RRSS=\sum_{j}\left|\hat{J}_{j}-{J}^{*}_{j}\right|^{2}=\frac{1}{T_{\rm MC}}\sum_{t=1}^{T_{\rm MC}}\sum_{j\in\Psi}\left|\hat{J}_{j}^{t}-K_{0}\right|^{2}+R.

4 Discussions

It might seem surprising that, despite apparent model misspecification due to the use of quadratic loss, the 1\ell_{1}-LinR estimator can still correctly infer the structure with the same order of sample complexity as 1\ell_{1}-LogR. Also, our theoretical analysis implies that the idea of using linear regression for binary data is not as outrageous as one might imagine. Here we provide an intuitive explanation of its success with a discussion of its limitations.

On the average, from (4), the condition for the 1\ell_{1}-LinR estimator is given as

s0skj0sjskJj=λ|Jk|,k=1,,N,\langle s_{0}s_{k}\rangle-\sum_{j\neq 0}\langle s_{j}s_{k}\rangle J_{j}=\lambda\partial|J_{k}|,\;k=1,\ldots,N, (30)

where \langle\cdot\rangle and |Jk|\partial|J_{k}| represent average w.r.t. the Boltzmann distribution (7) and the sub-gradient of |Jk||J_{k}|, respectively. In the paramagnetic phase, sisj\langle s_{i}s_{j}\rangle decays in its magnitude exponentially w.r.t. the distance of sites ii and jj. This guarantees that once the connections JkJ_{k} of sites in the first nearest neighbor set Ψ\Psi are given so that

s0sk=jΨsjskJj+λsign(Jk),kΨ\langle s_{0}s_{k}\rangle=\sum_{j\in\Psi}\langle s_{j}s_{k}\rangle J_{j}+\lambda{\rm{sign}}(J_{k}),\;\forall{k}\in\Psi (31)

holds, the other conditions are automatically satisfied by setting all the other connections that are not from Ψ\Psi to zero. For appropriate choice of λ\lambda, (31) has solutions of sign(Jk)Jk>0,kΨ{\rm{sign}}(J_{k}^{*})J_{k}>0,\forall{k}\in\Psi. Namely kΨ\forall{k}\in\Psi, the estimate of Jk{J}_{k} has the same sign as the true value JkJ_{k}^{*}. This implies that on average the 1\ell_{1}-LinR estimator can successfully recover the network structure up to the connection signs if λ\lambda is chosen appropriately.

The key of the above argument is that sisj\langle s_{i}s_{j}\rangle decays exponentially fast w.r.t. the distance of two sites, which does not hold after the phase transition. Thus, it is conjectured that the 1\ell_{1}-LinR estimator will start to fail in the network recovery just at the phase transition point. However, it is worth noting that this is in fact not limited to 1\ell_{1}-LinR: 1\ell_{1}-LogR also exhibits similar behavior unless post-thresholding is used, as reported in [43].

5 Experimental results

Refer to caption
Figure 2: Theoretical and experimental results of RSS, PrecisionPrecision and RecallRecall for RR graph and 2D grid using 1\ell_{1}-LinR and 1\ell_{1}-LogR with different values of αM/N\alpha\equiv M/N. The standard error bars are obtained from 1000 random runs. A good agreement between theory and experiment is achieved, even for small-size 2D grid graph with many loops. For more results, please refer to Appendix G.
Refer to caption
Figure 3: Left: critical scaling value c0(λ,K0)c(λ,K0)λ2c_{0}\left(\lambda,K_{0}\right)\equiv\frac{c\left(\lambda,K_{0}\right)}{\lambda^{2}} of 1\ell_{1}-LinR and 1\ell_{1}-LogR for the RR graph G𝒢N,d,K0G\in\mathcal{G}_{N,d,K_{0}} with d=3,K0=0.4d=3,K_{0}=0.4. Middle and Right: Precision and Recall for RR graph using 1\ell_{1}-LinR with λ=0.3\lambda=0.3. Experimental results are shown for N=200,400,800,1600,3200N=200,400,800,1600,3200. When c>c0(λ,K0)c>c_{0}\left(\lambda,K_{0}\right) (c0(λ=0.3,K0)19.41c_{0}\left(\lambda=0.3,K_{0}\right)\approx 19.41 in this case), the Precision increases consistently with NN and approaches 1 as NN\rightarrow\infty while it decreases consistently with NN when c<c0(λ,K0)c<c_{0}\left(\lambda,K_{0}\right). The Recall increases consistently and approach to 1 as NN\rightarrow\infty. For more results, please refer to Appendix G.

In this section, we conduct numerical experiments to verify the accuracy of the theoretical analysis. The experimental procedures are as follows. First, a random graph G𝒢N,d,K0G\in\mathcal{G}_{N,d,K_{0}} is generated and the Ising model is defined on it. Then, the spin snapshots are obtained using the Metropolis–Hastings algorithm [44, 45, 46] in the same way as [7], yielding the dataset 𝒟M\mathcal{D}^{M}. We randomly choose a center spin s0s_{0} and infer its neighborhood using the 1\ell_{1}-LinR (4) and 1\ell_{1}-LogR [10] estimators. To obtain standard error bars, we repeat the sequence of operations 1000 times. The RR graph G𝒢N,d,K0G\in\mathcal{G}_{N,d,K_{0}} with node degree d=3d=3 and coupling strength K0=0.4K_{0}=0.4 is considered, which satisfies the paramagnetic condition (d1)tanh2(K0)<1\left(d-1\right)\tanh^{2}\left(K_{0}\right)<1. The active couplings {Jij}(i,j)𝙴\left\{J_{ij}\right\}_{\left(i,j\right)\in\mathtt{E}} have the same probability of taking both signs of +1+1 or 1-1 111Though this setting is different from the analysis where the nonzero couplings take a uniform sign, the result can be directly compared thanks to gauge symmetry [21]..

We first verify the precise non-asymptotic predictions of our method described in Sec.3.4. Fig. 2 (upper figure) shows the replica and experimental results of RSSRSS, PrecisionPrecision, RecallRecall for N=200N=200 with different values of αM/N\alpha\equiv M/N. It can be seen that for both 1\ell_{1}-LinR and 1\ell_{1}-LogR, there is a fairly good agreement between the theoretical predictions and experimental results, even for small N=200N=200 and small α\alpha (equivalently small MM), verifying the correctness of the replica analysis. Interestingly, a quantitatively similar behavior between 1\ell_{1}-LinR and 1\ell_{1}-LogR is observed in terms of precision and recall. Regarding RSS, the two estimators actually behave differently, which can be clearly seen in Fig. 7 in Appendix G: the RSS is much smaller for 1\ell_{1}-LogR, which is natural since the estimates of 1\ell_{1}-LogR are closer to the true ones due to the model match. As our theoretical analysis is based on the typical tree-like structure assumption, it is interesting to see if it is applicable to graphs with loops. To this end, we consider the 2D 4-nearest neighbor grid with periodic boundary condition, which is one common loopy graph. Fig. 2 (lower figure) shows the results for a 15×1515\times 15 2D grid with uniform constant coupling K0=0.2K_{0}=0.2. The agreement between the theoretical and numerical results is fairly good, indicating that our theoretical result can be a good approximation even for loopy graphs. More results for different values of NN and λ\lambda are shown in Fig. 7 and Fig. 8 in Appendix G.

Subsequently, the asymptotic result and sharpness of the critical scaling value c0(λ,K0)c(λ,K0)λ2c_{0}\left(\lambda,K_{0}\right)\equiv\frac{c\left(\lambda,K_{0}\right)}{\lambda^{2}} in (25) are evaluated. First, Fig. 3 (left) shows comparison of c0(λ,K0)c_{0}\left(\lambda,K_{0}\right) between 1\ell_{1}-LinR and 1\ell_{1}-LogR for the RR graph G𝒢N,d,K0G\in\mathcal{G}_{N,d,K_{0}} when d=3,K0=0.4d=3,K_{0}=0.4, indicating similar behavior of 1\ell_{1}-LogR and 1\ell_{1}-LinR. Then, we conduct experiments for M=clogNM=c\log N with different values of cc around c0(λ,K0)c_{0}\left(\lambda,K_{0}\right), and investigate the trend of PrecisionPrecision and RecallRecall as NN increases. When λ=0.3\lambda=0.3, Fig. 3 (middle and right) show the results of Precision and Recall, respectively. As expected, the Precision increases consistently with NN when c>c0(λ,K0)c>c_{0}\left(\lambda,K_{0}\right) and decreases consistently with NN when c<c0(λ,K0)c<c_{0}\left(\lambda,K_{0}\right) while the Recall increases consistently and approaches to 1 as NN\rightarrow\infty, which verifies the sharpness of the critical scaling value prediction. The results for 1\ell_{1}-LogR, including the case of λ=0.1\lambda=0.1 for both 1\ell_{1}-LinR and 1\ell_{1}-LogR, are shown in Fig. 9 and Fig. 10 in Appendix G.

6 Conclusion

In this paper, we provide a unified statistical mechanics framework for the analysis of typical learning performances of 1\ell_{1}-regularized MM-estimators, 1\ell_{1}-LinR in particular, for Ising model selection on typical paramagnetic RR graphs. Using the powerful replica method, the high-dimensional 1\ell_{1}-regularized M-estimator is decoupled into a pair of scalar estimators, by which we obtain an accurate estimate of the typical sample complexity. It is revealed that, perhaps surprisingly, the misspecified 1\ell_{1}-LinR estimator is model selection consistent using M=𝒪(logN)M=\mathcal{O}\left(\log N\right) samples, which is of the same order as 1\ell_{1}-LogR. Moreover, with a slight modification of the scalar estimator for the active set to account for the finite-size effect, we further obtain sharp predictions of the non-asymptotic behavior of 1\ell_{1}-LinR (also 1\ell_{1}-LogR) for moderate M,NM,N. There is an excellent agreement between theoretical predictions and experimental results, even for graphs with many loops, which supports our findings. Several key assumptions are made in our theoretical analysis, such as the paramagnetic assumption which implies that the coupling strength should not be too large. It is worth noting that the restrictive paramagnetic assumption is not only limited to 1\ell_{1}-LinR, but also to other low-complexity estimators like 1\ell_{1}-LogR unless post-thresholding is used [43]. These assumptions restrict the applicability of the presented result, and thus overcoming such limitations will be an important direction for future work.

Acknowledgements

This work was supported by JSPS KAKENHI Nos. 17H00764, 18K11463, and 19H01812, and JST CREST Grant Number JPMJCR1912, Japan.

References

  • [1] Martin J Wainwright and Michael Irwin Jordan. Graphical models, exponential families, and variational inference. Now Publishers Inc, 2008.
  • [2] H Chau Nguyen, Riccardo Zecchina, and Johannes Berg. Inverse statistical problems: from the inverse ising problem to data science. Advances in Physics, 66(3):197–261, 2017.
  • [3] Erik Aurell and Magnus Ekeberg. Inverse ising inference using all the data. Physical review letters, 108(9):090201, 2012.
  • [4] Ludovica Bachschmid-Romano and Manfred Opper. Learning of couplings for random asymmetric kinetic ising models revisited: random correlation matrices and learning curves. Journal of Statistical Mechanics: Theory and Experiment, 2015(9):P09016, 2015.
  • [5] Johannes Berg. Statistical mechanics of the inverse ising problem and the optimal objective function. Journal of Statistical Mechanics: Theory and Experiment, 2017(8):083402, 2017.
  • [6] Ludovica Bachschmid-Romano and Manfred Opper. A statistical physics approach to learning curves for the inverse ising problem. Journal of Statistical Mechanics: Theory and Experiment, 2017(6):063406, 2017.
  • [7] Alia Abbara, Yoshiyuki Kabashima, Tomoyuki Obuchi, and Yingying Xu. Learning performance in inverse ising problems with sparse teacher couplings. Journal of Statistical Mechanics: Theory and Experiment, 2020(7):073402, 2020.
  • [8] Martin J Wainwright, John D Lafferty, and Pradeep K Ravikumar. High-dimensional graphical model selection using _1\ell\_1-regularized logistic regression. In Advances in neural information processing systems, pages 1465–1472, 2007.
  • [9] Holger Höfling and Robert Tibshirani. Estimation of sparse binary pairwise markov networks using pseudo-likelihoods. Journal of Machine Learning Research, 10(4):883–906, 2009.
  • [10] Pradeep Ravikumar, Martin J Wainwright, John D Lafferty, et al. High-dimensional ising model selection using 1\ell_{1}-regularized logistic regression. The Annals of Statistics, 38(3):1287–1319, 2010.
  • [11] Narayana P Santhanam and Martin J Wainwright. Information-theoretic limits of selecting binary graphical models in high dimensions. IEEE Transactions on Information Theory, 58(7):4117–4134, 2012.
  • [12] Aurélien Decelle and Federico Ricci-Tersenghi. Pseudolikelihood decimation algorithm improving the inference of the interaction network in a general class of ising models. Physical review letters, 112(7):070603, 2014.
  • [13] Guy Bresler. Efficiently learning ising models on arbitrary graphs. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 771–782, 2015.
  • [14] Marc Vuffray, Sidhant Misra, Andrey Lokhov, and Michael Chertkov. Interaction screening: Efficient and sample-optimal learning of ising models. In Advances in Neural Information Processing Systems, pages 2595–2603, 2016.
  • [15] Andrey Y Lokhov, Marc Vuffray, Sidhant Misra, and Michael Chertkov. Optimal structure and parameter learning of ising models. Science advances, 4(3):e1700791, 2018.
  • [16] Shanshan Wu, Sujay Sanghavi, and Alexandros G Dimakis. Sparse logistic regression learns all discrete pairwise graphical models. Advances in Neural Information Processing Systems, 32:8071–8081, 2019.
  • [17] Guy Bresler, Elchanan Mossel, and Allan Sly. Reconstruction of markov random fields from samples: Some observations and algorithms. In Approximation, Randomization and Combinatorial Optimization. Algorithms and Techniques, pages 343–356. Springer, 2008.
  • [18] Adarsh Prasad, Vishwak Srinivasan, Sivaraman Balakrishnan, and Pradeep Ravikumar. On learning ising models under huber’s contamination model. Advances in Neural Information Processing Systems, 33, 2020.
  • [19] Sahand N Negahban, Pradeep Ravikumar, Martin J Wainwright, and Bin Yu. A unified framework for high-dimensional analysis of mm-estimators with decomposable regularizers. Statistical science, 27(4):538–557, 2012.
  • [20] Marc Mézard, Giorgio Parisi, and Miguel Angel Virasoro. Spin glass theory and beyond: An Introduction to the Replica Method and Its Applications, volume 9. World Scientific Publishing Company, 1987.
  • [21] Hidetoshi Nishimori. Statistical physics of spin glasses and information processing: an introduction. Number 111. Clarendon Press, 2001.
  • [22] Manfred Opper and David Saad. Advanced mean field methods: Theory and practice. MIT press, 2001.
  • [23] Marc Mezard and Andrea Montanari. Information, physics, and computation. Oxford University Press, 2009.
  • [24] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
  • [25] Mohsen Bayati and Andrea Montanari. The lasso risk for gaussian matrices. IEEE Transactions on Information Theory, 58(4):1997–2017, 2011.
  • [26] Sundeep Rangan, Alyson K Fletcher, and Vivek K Goyal. Asymptotic analysis of map estimation via the replica method and applications to compressed sensing. IEEE Transactions on Information Theory, 58(3):1902–1923, 2012.
  • [27] Christos Thrampoulidis, Ehsan Abbasi, and Babak Hassibi. Lasso with non-linear measurements is equivalent to one with linear measurements. Advances in Neural Information Processing Systems, 28:3420–3428, 2015.
  • [28] Cédric Gerbelot, Alia Abbara, and Florent Krzakala. Asymptotic errors for convex penalized linear regression beyond gaussian matrices. arXiv preprint arXiv:2002.04372, 2020.
  • [29] Xiangming Meng, Tomoyuki Obuchi, and Yoshiyuki Kabashima. Structure learning in inverse ising problems using 2\ell_{2}-regularized linear estimator. Journal of Statistical Mechanics: Theory and Experiment, 2021(5):053403, 2021.
  • [30] Julian Besag. Statistical analysis of non-lattice data. Journal of the Royal Statistical Society: Series D (The Statistician), 24(3):179–195, 1975.
  • [31] David R Brillinger. A generalized linear model with gaussian regressor variables. In A Festschrift for Erich L. Lehmann, page 97–114. 1982.
  • [32] Yaniv Plan and Roman Vershynin. The generalized lasso with non-linear observations. IEEE Transactions on information theory, 62(3):1528–1537, 2016.
  • [33] Yue Zhang, Weihong Guo, and Soumya Ray. On the consistency of feature selection with lasso for non-linear targets. In International Conference on Machine Learning, pages 183–191. PMLR, 2016.
  • [34] Martin Genzel. High-dimensional estimation of structured signals from non-linear observations with general convex loss functions. IEEE Transactions on Information Theory, 63(3):1601–1619, 2016.
  • [35] Nicolai Meinshausen, Peter Bühlmann, et al. High-dimensional graphs and variable selection with the lasso. The annals of statistics, 34(3):1436–1462, 2006.
  • [36] Peng Zhao and Bin Yu. On model selection consistency of lasso. The Journal of Machine Learning Research, 7:2541–2563, 2006.
  • [37] Leslie G Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984.
  • [38] Andreas Engel and Christian Van den Broeck. Statistical mechanics of learning. Cambridge University Press, 2001.
  • [39] Federica Gerace, Bruno Loureiro, Florent Krzakala, Marc Mézard, and Lenka Zdeborová. Generalisation error in learning with random features and the hidden manifold model. In International Conference on Machine Learning, pages 3452–3462. PMLR, 2020.
  • [40] Madhu Advani and Surya Ganguli. Statistical mechanics of optimal convex inference in high dimensions. Physical Review X, 6(3):031034, 2016.
  • [41] Persi Diaconis and Mehrdad Shahshahani. On the eigenvalues of random matrices. Journal of Applied Probability, 31(A):49–62, 1994.
  • [42] Kurt Johansson. On random matrices from the compact classical groups. Annals of mathematics, pages 519–545, 1997.
  • [43] José Bento and Andrea Montanari. Which graphical models are difficult to learn? arXiv preprint arXiv:0910.5761, 2009.
  • [44] Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller. Equation of state calculations by fast computing machines. The journal of chemical physics, 21(6):1087–1092, 1953.
  • [45] W Keith Hastings. Monte carlo sampling methods using markov chains and their applications. 1970.
  • [46] Stuart Geman and Donald Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on pattern analysis and machine intelligence, (6):721–741, 1984.
  • [47] Federico Ricci-Tersenghi. The bethe approximation for solving the inverse ising problem: a comparison with other inference methods. Journal of Statistical Mechanics: Theory and Experiment, 2012(08):P08015, 2012.
  • [48] H Chau Nguyen and Johannes Berg. Bethe–peierls approximation and the inverse ising problem. Journal of Statistical Mechanics: Theory and Experiment, 2012(03):P03004, 2012.
  • [49] Brendan D McKay. The expected eigenvalue distribution of a large regular graph. Linear Algebra and its Applications, 40:203–216, 1981.

Appendix A Free energy density ff computation

The detailed derivation of the average free energy density f=1Nβ[logZ]𝒟Mf=-\frac{1}{N\beta}\left[\log Z\right]_{\mathcal{D}^{M}} in (9) using the replica method is illustrated. Our method provides a unified framework for the statistical mechanics analysis of any 1\ell_{1}-regularized MM-estimator of the form (2). As a result, for generality, in the following derivations, we first focus on a generic 1\ell_{1}-regularized MM-estimator (2) with a generic loss function ()\ell\left(\cdot\right). After obtaining the generic results, specific results for both the 1\ell_{1}-LinR estimator (4) with square loss (x)=12(x1)2\ell\left(x\right)=\frac{1}{2}\left(x-1\right)^{2} and the 1\ell_{1}-LogR estimator with logistic loss (x)=log(1+e2x)\ell\left(x\right)=\log\left(1+e^{-2x}\right) are provided. For the IS estimator, the results can be easily obtained by substituting (x)=ex\ell\left(x\right)=e^{-x}, though the specific results are not shown.

A.1 Energy term ξ\xi of ff

The key of replica method is to compute the replicated partition function [Zn]𝒟M\left[Z^{n}\right]_{\mathcal{D}^{M}}. According to the definition in (10) and Ansatz (A1) in Sec. 3.2, the average replicated partition function [Zn]𝒟M\left[Z^{n}\right]_{\mathcal{D}^{M}} can be re-written as

[Zn]𝒟M\displaystyle\left[Z^{n}\right]_{\mathcal{D}^{M}} =a=1nd𝑱aeβλMa=1nj|Jja|{sPIsing(s|𝑱)exp[βa=1n(s0ha)]}M,\displaystyle=\int\prod_{a=1}^{n}d\boldsymbol{J}^{a}e^{-\beta\lambda M\sum_{a=1}^{n}\sum_{j}\left|J_{j}^{a}\right|}\left\{\sum_{s}P_{\textrm{Ising}}\left(s|\boldsymbol{J}^{*}\right)\exp\left[-\beta\sum_{a=1}^{n}\ell\left(s_{0}h^{a}\right)\right]\right\}^{M},
a=1ndwaeβλM(a=1njΨ|J¯j|+a=1n1Nwa1)×\displaystyle\approx\int\prod_{a=1}^{n}dw^{a}e^{-\beta\lambda M\left(\sum_{a=1}^{n}\sum_{j\in\Psi}\left|\bar{J}_{j}\right|+\sum_{a=1}^{n}\frac{1}{\sqrt{N}}\left\|w^{a}\right\|_{1}\right)}\times
{sPIsing(s|𝑱)a𝑑hwaδ(hwa1NjΨ¯wjasj)eβa=1n(s0(jΨJ¯jsj+hwa))}αN\displaystyle\left\{\sum_{s}P_{\textrm{Ising}}\left(s|\boldsymbol{J}^{*}\right)\prod_{a}\int dh_{w}^{a}\delta\left(h_{w}^{a}-\frac{1}{\sqrt{N}}\sum_{j\in\bar{\Psi}}w_{j}^{a}s_{j}\right)e^{-\beta\sum_{a=1}^{n}\ell\left(s_{0}\left(\sum_{j\in\Psi}\text{$\bar{J}_{j}$}s_{j}+h_{w}^{a}\right)\right)}\right\}^{\alpha N}
=a=1ndwaeβλM(njΨ|J¯j|+a=1nwa1N)×\displaystyle=\int\prod_{a=1}^{n}dw^{a}e^{-\beta\lambda M\left(n\sum_{j\in\Psi}\left|\bar{J}_{j}\right|+\sum_{a=1}^{n}\frac{\left\|w^{a}\right\|_{1}}{\sqrt{N}}\right)}\times
{s0,sΨa=1ndhwaP(s0,sΨ,{hwa}a|𝑱,{wa}a)eβa=1n(s0(jΨJ¯jsj+hwa))}αN\displaystyle\left\{\sum_{s_{0},s_{\Psi}}\int\prod_{a=1}^{n}dh_{w}^{a}P\left(s_{0},s_{\Psi},\left\{h_{w}^{a}\right\}_{a}|\boldsymbol{J}^{*},\left\{w^{a}\right\}_{a}\right)e^{-\beta\sum_{a=1}^{n}\ell\left(s_{0}\left(\sum_{j\in\Psi}\text{$\bar{J}_{j}$}s_{j}+h_{w}^{a}\right)\right)}\right\}^{\alpha N}
a=1ndwaeβλM(njΨ|J¯j|+a=1nwa1N)×\displaystyle\approx\int\prod_{a=1}^{n}dw^{a}e^{-\beta\lambda M\left(n\sum_{j\in\Psi}\left|\bar{J}_{j}\right|+\sum_{a=1}^{n}\frac{\left\|w^{a}\right\|_{1}}{\sqrt{N}}\right)}\times
{s0,sΨP(s0,𝒔Ψ|𝑱)a=1ndhwaPnoise({hwa}a|{wa}a)eβa=1n(s0(jΨJ¯jsj+hwa))}αN,\displaystyle\left\{\sum_{s_{0},s_{\Psi}}P\left(s_{0},\boldsymbol{s}_{\Psi}|\boldsymbol{J}^{*}\right)\int\prod_{a=1}^{n}dh_{w}^{a}P_{\textrm{noise}}\left(\left\{h_{w}^{a}\right\}_{a}|\left\{w^{a}\right\}_{a}\right)e^{-\beta\sum_{a=1}^{n}\ell\left(s_{0}\left(\sum_{j\in\Psi}\text{$\bar{J}_{j}$}s_{j}+h_{w}^{a}\right)\right)}\right\}^{\alpha N}, (32)

where {1Nwja,jΨ}\left\{\frac{1}{\sqrt{N}}w_{j}^{a},j\in\Psi\right\} in the finite active set Ψ\Psi are neglected in the second line when NN is large, P(s0,𝒔Ψ|𝑱)=𝒔Ψ¯PIsing(s|𝑱)P\left(s_{0},\boldsymbol{s}_{\Psi}|\boldsymbol{J}^{*}\right)=\sum_{\boldsymbol{s}_{\bar{\Psi}}}P_{\textrm{Ising}}\left(s|\boldsymbol{J}^{*}\right) is the marginal distribution of s0,𝒔Ψs_{0},\boldsymbol{s}_{\Psi} that can be computed as [7], Pnoise({hwa}a|{wa}a)P_{\textrm{noise}}\left(\left\{h_{w}^{a}\right\}_{a}|\left\{w^{a}\right\}_{a}\right) is the distribution of the “noise” part hwa1NjΨ¯wjasjh_{w}^{a}\equiv\frac{1}{\sqrt{N}}\sum_{j\in\bar{\Psi}}w_{j}^{a}s_{j} of the local field. In the last line, the asymptotic independence between hwah_{w}^{a} and (s0,𝒔Ψs_{0},\boldsymbol{s}_{\Psi}) are applied as discussed in [7]. Regarding the marginal distribution P(s0,𝒔Ψ|𝑱)P\left(s_{0},\boldsymbol{s}_{\Psi}|\boldsymbol{J}^{*}\right), in general we have to take into account the cavity fields in the marginal distribution. In the case considered in this paper, however, the paramagnetic assumption simplifies the marginal distribution and finally it is proportional to es0jΨJjsje^{s_{0}\sum_{j\in\Psi}J_{j}^{*}s_{j}} [7]. When Ψ\Psi has a small cardinality dd, we can compute the expectation w.r.t. (s0,sΨ)(s_{0},s_{\Psi}) exactly by exhaustive enumeration. For large dd, MC methods like the Metropolis–Hastings algorithm [44, 45, 46] might be used.

To proceed with the calculation, according to the CLT, the noise part {hwa}a=1n\left\{h_{w}^{a}\right\}_{a=1}^{n} can be regarded as Gaussian variables so that Pnoise({hwa}a|{wa}a)P_{\textrm{noise}}\left(\left\{h_{w}^{a}\right\}_{a}|\left\{w^{a}\right\}_{a}\right) can be approximated as a multivariate Gaussian distribution. Under the RS ansatz, two auxiliary order parameters are introduced, i.e.,

Q\displaystyle Q\equiv 1Ni,jΨ¯wiaCij\0wja,\displaystyle\frac{1}{N}\sum_{i,j\in\bar{\Psi}}w_{i}^{a}C_{ij}^{\backslash 0}w_{j}^{a}, (33)
q\displaystyle q\equiv 1Ni,jΨ¯wiaCij\0wjb,(ab),\displaystyle\frac{1}{N}\sum_{i,j\in\bar{\Psi}}w_{i}^{a}C_{ij}^{\backslash 0}w_{j}^{b},\;\left(a\neq b\right), (34)

where C\0={Cij\0}C^{\backslash 0}=\left\{C_{ij}^{\backslash 0}\right\} is the covariance matrix of the original Ising model without s0s_{0}. To write the integration in terms of the order parameters Q,qQ,q, we introduce the following trivial identities

1\displaystyle 1 =N𝑑Qδ(i,j0wiaCij\0wjaNQ),a=1,,n\displaystyle=N\int dQ\delta\left(\sum_{i,j\neq 0}w_{i}^{a}C_{ij}^{\backslash 0}w_{j}^{a}-NQ\right),a=1,...,n (35)
1\displaystyle 1 =N𝑑qδ(i,j0wiaCij\0wjbNq),a<b,\displaystyle=N\int dq\delta\left(\sum_{i,j\neq 0}w_{i}^{a}C_{ij}^{\backslash 0}w_{j}^{b}-Nq\right),a<b, (36)

so that [Zn]𝒟M\left[Z^{n}\right]_{\mathcal{D}^{M}} in (32) can be rewritten as

[Zn]𝒟M\displaystyle\left[Z^{n}\right]_{\mathcal{D}^{M}} =eβλMnjΨ|J¯j|𝑑Q𝑑qa=1ndwaeλβMNa=1nwa1\displaystyle=e^{-\beta\lambda Mn\sum_{j\in\Psi}\left|\bar{J}_{j}\right|}\int dQdq\int\prod_{a=1}^{n}dw^{a}e^{-\lambda\beta\frac{M}{\sqrt{N}}\sum_{a=1}^{n}\left\|w^{a}\right\|_{1}}
×a=1nδ(i,jwiaCij\0wjaNQ)a<bδ(i,jwiaCij\0wjbNq)×\displaystyle\times\prod_{a=1}^{n}\delta\left(\sum_{i,j}w_{i}^{a}C_{ij}^{\backslash 0}w_{j}^{a}-NQ\right)\prod_{a<b}\delta\left(\sum_{i,j}w_{i}^{a}C_{ij}^{\backslash 0}w_{j}^{b}-Nq\right)\times
{s0,sΨP(s0,𝒔Ψ|𝑱)a=1ndhwaPnoise({hwa}a|{wa}a)eβa=1n(s0(jΨJ¯jsj+hwa))}αN\displaystyle\left\{\sum_{s_{0},s_{\Psi}}P\left(s_{0},\boldsymbol{s}_{\Psi}|\boldsymbol{J}^{*}\right)\int\prod_{a=1}^{n}dh_{w}^{a}P_{\textrm{noise}}\left(\left\{h_{w}^{a}\right\}_{a}|\left\{w^{a}\right\}_{a}\right)e^{-\beta\sum_{a=1}^{n}\ell\left(s_{0}\left(\sum_{j\in\Psi}\text{$\bar{J}_{j}$}s_{j}+h_{w}^{a}\right)\right)}\right\}^{\alpha N} (37)
=\displaystyle= 𝑑Q𝑑qIeMlogL,\displaystyle\int dQdqIe^{M\log L}, (38)

where

I\displaystyle I a=1ndwaeλβMNa=1nwa1a=1nδ(i,jwiaCij\0wjaNQ)a<bδ(i,jwiaCij\0wjbNq),\displaystyle\equiv\int\prod_{a=1}^{n}dw^{a}e^{-\lambda\beta\frac{M}{\sqrt{N}}\sum_{a=1}^{n}\left\|w^{a}\right\|_{1}}\prod_{a=1}^{n}\delta\left(\sum_{i,j}w_{i}^{a}C_{ij}^{\backslash 0}w_{j}^{a}-NQ\right)\prod_{a<b}\delta\left(\sum_{i,j}w_{i}^{a}C_{ij}^{\backslash 0}w_{j}^{b}-Nq\right), (39)
L\displaystyle L eβλnjΨ|J¯j|s0,𝒔ΨP(s0,𝒔Ψ|𝑱)a=1ndhwaPnoise({hwa}a|{wa}a)eβa=1n(s0(jΨJ¯jsj+hwa)).\displaystyle\equiv e^{-\beta\lambda n\sum_{j\in\Psi}\left|\bar{J}_{j}\right|}\sum_{s_{0},\boldsymbol{s}_{\Psi}}P\left(s_{0},\boldsymbol{s}_{\Psi}|\boldsymbol{J}^{*}\right)\int\prod_{a=1}^{n}dh_{w}^{a}P_{\textrm{noise}}\left(\left\{h_{w}^{a}\right\}_{a}|\left\{w^{a}\right\}_{a}\right)e^{-\beta\sum_{a=1}^{n}\ell\left(s_{0}\left(\sum_{j\in\Psi}\text{$\bar{J}_{j}$}s_{j}+h_{w}^{a}\right)\right)}. (40)

According to CLT and (33) and (34), the noise parts hwa,a=1,,nh_{w}^{a},a=1,\ldots,n follow a multivariate Gaussian distribution with zero mean (paramagnetic assumption) and covariances

hwahwb\0=Qδab+(1δab)q.\left\langle h_{w}^{a}h_{w}^{b}\right\rangle^{\backslash 0}=Q\delta_{ab}+\left(1-\delta_{ab}\right)q. (41)

Consequently, by introducing two auxiliary i.i.d. standard Gaussian random variables va𝒩(0,1),z𝒩(0,1)v_{a}\sim\mathcal{N}\left(0,1\right),z\sim\mathcal{N}\left(0,1\right), the noise parts hwa,a=1,,nh_{w}^{a},a=1,\ldots,n can be written in a compact form

hwa=Qqva+qz,a=1,,nh_{w}^{a}=\sqrt{Q-q}v_{a}+\sqrt{q}z,a=1,\ldots,n (42)

so that LL in (40) could be written as

L\displaystyle L =eβλnjΨ|J¯j|s0,𝒔ΨP(s0,𝒔Ψ|𝑱)a=1ndhwaPnoise({hwa}a|{wa}a)eβa=1n(s0(jΨJ¯jsj+hwa))\displaystyle=e^{-\beta\lambda n\sum_{j\in\Psi}\left|\bar{J}_{j}\right|}\sum_{s_{0},\boldsymbol{s}_{\Psi}}P\left(s_{0},\boldsymbol{s}_{\Psi}|\boldsymbol{J}^{*}\right)\int\prod_{a=1}^{n}dh_{w}^{a}P_{\textrm{noise}}\left(\left\{h_{w}^{a}\right\}_{a}|\left\{w^{a}\right\}_{a}\right)e^{-\beta\sum_{a=1}^{n}\ell\left(s_{0}\left(\sum_{j\in\Psi}\text{$\bar{J}_{j}$}s_{j}+h_{w}^{a}\right)\right)}
=eβλnjΨ|J¯j|s0,𝒔ΨP(s0,𝒔Ψ|𝑱)𝒟za𝒟vaeβa=1n(s0(jΨJ¯jsj+Qqva+qz))\displaystyle=e^{-\beta\lambda n\sum_{j\in\Psi}\left|\bar{J}_{j}\right|}\sum_{s_{0},\boldsymbol{s}_{\Psi}}P\left(s_{0},\boldsymbol{s}_{\Psi}|\boldsymbol{J}^{*}\right)\int\mathcal{D}z\prod_{a}\mathcal{D}v_{a}e^{-\beta\sum_{a=1}^{n}\ell\left(s_{0}\left(\sum_{j\in\Psi}\text{$\bar{J}_{j}$}s_{j}+\sqrt{Q-q}v_{a}+\sqrt{q}z\right)\right)}
=eβλnjΨ|J¯j|s0,𝒔ΨP(s0,𝒔Ψ|𝑱)𝒟z[𝒟veβ(s0(jΨJ¯jsj+Qqv+qz))𝐴]n\displaystyle=e^{-\beta\lambda n\sum_{j\in\Psi}\left|\bar{J}_{j}\right|}\sum_{s_{0},\boldsymbol{s}_{\Psi}}P\left(s_{0},\boldsymbol{s}_{\Psi}|\boldsymbol{J}^{*}\right)\int\mathcal{D}z\left[\underset{A}{\underbrace{\int\mathcal{D}ve^{-\beta\ell\left(s_{0}\left(\sum_{j\in\Psi}\text{$\bar{J}_{j}$}s_{j}+\sqrt{Q-q}v+\sqrt{q}z\right)\right)}}}\right]^{n}
=eβλnjΨ|J¯j|s0,𝒔ΨP(s0,𝒔Ψ|𝑱)𝔼z(An),\displaystyle=e^{-\beta\lambda n\sum_{j\in\Psi}\left|\bar{J}_{j}\right|}\sum_{s_{0},\boldsymbol{s}_{\Psi}}P\left(s_{0},\boldsymbol{s}_{\Psi}|\boldsymbol{J}^{*}\right)\mathbb{E}_{z}\left(A^{n}\right), (43)

where 𝒟z=dz2πez22\mathcal{D}z=\frac{dz}{\sqrt{2\pi}}e^{-\frac{z^{2}}{2}}. As a result, using the replica formula, we have

limn01nlogL\displaystyle\underset{n\rightarrow 0}{\lim}\frac{1}{n}\log L
=\displaystyle= βλjΨ|J¯j|+limn0logs0,𝒔ΨP(s0,𝒔Ψ|𝑱)Ez(An)n\displaystyle-\beta\lambda\sum_{j\in\Psi}\left|\bar{J}_{j}\right|+\underset{n\rightarrow 0}{\lim}\frac{\log\sum_{s_{0},\boldsymbol{s}_{\Psi}}P\left(s_{0},\boldsymbol{s}_{\Psi}|\boldsymbol{J}^{*}\right)E_{z}\left(A^{n}\right)}{n}
=\displaystyle= βλjΨ|J¯j|+𝔼z[s0,𝒔ΨP(s0,𝒔Ψ|𝑱)logA]\displaystyle-\beta\lambda\sum_{j\in\Psi}\left|\bar{J}_{j}\right|+\mathbb{E}_{z}\left[\sum_{s_{0},\boldsymbol{s}_{\Psi}}P\left(s_{0},\boldsymbol{s}_{\Psi}|\boldsymbol{J}^{*}\right)\log A\right]
=\displaystyle= βλjΨ|J¯j|+s0,𝒔ΨP(s0,𝒔Ψ|𝑱)𝒟zlog𝒟veβ(s0(jΨJ¯jsj+Qqv+qz))\displaystyle-\beta\lambda\sum_{j\in\Psi}\left|\bar{J}_{j}\right|+\sum_{s_{0},\boldsymbol{s}_{\Psi}}P\left(s_{0},\boldsymbol{s}_{\Psi}|\boldsymbol{J}^{*}\right)\int\mathcal{D}z\log\int\mathcal{D}ve^{-\beta\ell\left(s_{0}\left(\sum_{j\in\Psi}\text{$\bar{J}_{j}$}s_{j}+\sqrt{Q-q}v+\sqrt{q}z\right)\right)}
=\displaystyle= βλjΨ|J¯j|+s0,𝒔ΨP(s0,𝒔Ψ|𝑱)𝒟zlogdy2π(Qq)e[ys0(jΨJ¯jsj+qz)]22(Qq)eβ(y),\displaystyle-\beta\lambda\sum_{j\in\Psi}\left|\bar{J}_{j}\right|+\sum_{s_{0},\boldsymbol{s}_{\Psi}}P\left(s_{0},\boldsymbol{s}_{\Psi}|\boldsymbol{J}^{*}\right)\int\mathcal{D}z\log\int\frac{dy}{\sqrt{2\pi\left(Q-q\right)}}e^{-\frac{\left[y-s_{0}\left(\sum_{j\in\Psi}\text{$\bar{J}_{j}$}s_{j}+\sqrt{q}z\right)\right]^{2}}{2\left(Q-q\right)}}e^{-\beta\ell\left(y\right)}, (44)

where in the last line, a change of variable y=s0(jΨJ¯jsj+Qqv+qz)y=s_{0}\left(\sum_{j\in\Psi}\text{$\bar{J}_{j}$}s_{j}+\sqrt{Q-q}v+\sqrt{q}z\right) is used.

As a result, from (9), the average free energy density in the limit β\beta\rightarrow\infty reads

f(β)\displaystyle f\left(\beta\rightarrow\infty\right) =limβ1Nβ{limn0nlogI+Mlimn0nlogL}\displaystyle=\lim_{\beta\rightarrow\infty}-\frac{1}{N\beta}\left\{\underset{n\rightarrow 0}{\lim}\frac{\partial}{\partial n}\log I+M\underset{n\rightarrow 0}{\lim}\frac{\partial}{\partial n}\log L\right\}
=𝙴𝚡𝚝𝚛{ξ+S},\displaystyle=-\mathtt{Extr}\left\{-\mathcal{\xi}+S\right\}, (45)

where Extr{}{\textrm{Extr}}\left\{\cdot\right\} denotes extremization w.r.t. some relevant variables, and ξ,S\mathcal{\xi},S are the corresponding energy and entropy terms of ff, respectively:

S\displaystyle S =limn01NβnlogI,\displaystyle=\underset{n\rightarrow 0}{\lim}\frac{1}{N\beta}\frac{\partial}{\partial n}\log I, (46)
I\displaystyle I =a=1ndwaeλβa=1nwa1a=1nδ(i,jwiaCijwjaNQ)a<bδ(i,jwiaCijwjbNq),\displaystyle=\int\prod_{a=1}^{n}dw^{a}e^{-\lambda\beta\sum_{a=1}^{n}\left\|w^{a}\right\|_{1}}\prod_{a=1}^{n}\delta\left(\sum_{i,j}w_{i}^{a}C_{ij}w_{j}^{a}-NQ\right)\prod_{a<b}\delta\left(\sum_{i,j}w_{i}^{a}C_{ij}w_{j}^{b}-Nq\right), (47)
ξ\displaystyle\mathcal{\mathcal{\xi}} =αλjΨ|J¯j|+α𝔼s,z(min𝑦[(ys0(Qz+jΨJ¯jsj))22χ+(y)]),\displaystyle=\alpha\lambda\sum_{j\in\Psi}\left|\bar{J}_{j}\right|+\alpha\mathbb{E}_{s,z}\left(\underset{y}{\min}\left[\frac{\left(y-s_{0}\left(\sqrt{Q}z+\sum_{j\in\Psi}\text{$\bar{J}_{j}$}s_{j}\right)\right)^{2}}{2\chi}+\ell\left(y\right)\right]\right), (48)

and the relation limββ(Qq)χ\lim_{\beta\rightarrow\infty}\beta\left(Q-q\right)\equiv\chi is used [6, 7]. The extremization in the free energy result (45) comes from saddle point method in the large NN limit.

A.2 Entropy term SS of ff

To obtain the final result of free energy density, there is still one remaining entropy term SS to compute, which requires the result of II (47). However, unlike the 2\ell_{2}-norm, the 1\ell_{1}-norm in (47) breaks the rotational invariance property, which makes the computation of II difficult and the methods in [7, 29] are no loner applicable. To address this problem, applying the Haar Orthogonal Ansatz (A2) in Sec. 3.2, we employ a method to replace II with an average [I]O\left[I\right]_{O} over the orthogonal matrix OO generated from the Haar orthogonal measure.

Specifically, also under the RS ansatz, two auxiliary order parameters are introduced, i.e.,

R\displaystyle R\equiv 1Ni,jwiawja,\displaystyle\frac{1}{N}\sum_{i,j}w_{i}^{a}w_{j}^{a}, (49)
r\displaystyle r\equiv 1Ni,jwiawjb,(ab).\displaystyle\frac{1}{N}\sum_{i,j}w_{i}^{a}w_{j}^{b},\;\left(a\neq b\right). (50)

Then, by inserting the delta functions aδ((wa)TwaNR)a<bδ((wa)TwbNr)\prod_{a}\delta\left(\left(w^{a}\right)^{T}w^{a}-NR\right)\prod_{a<b}\delta\left(\left(w^{a}\right)^{T}w^{b}-Nr\right), we obtain

I\displaystyle I =a=1ndwaeλβMNa=1nwa1a=1nδ((wa)TCwaNQ)a<bδ((wa)TCwbNq)\displaystyle=\int\prod_{a=1}^{n}dw^{a}e^{-\frac{\lambda\beta M}{\sqrt{N}}\sum_{a=1}^{n}\left\|w^{a}\right\|_{1}}\prod_{a=1}^{n}\delta\left(\left(w^{a}\right)^{T}Cw^{a}-NQ\right)\prod_{a<b}\delta\left(\left(w^{a}\right)^{T}Cw^{b}-Nq\right)
×dRdraδ((wa)TwaNR)a<bδ((wa)TwbNr).\displaystyle\times\int dRdr\prod_{a}\delta\left(\left(w^{a}\right)^{T}w^{a}-NR\right)\prod_{a<b}\delta\left(\left(w^{a}\right)^{T}w^{b}-Nr\right). (51)

Moreover, replacing the original delta functions in (51) as the following identities

{δ((wa)TCwaNQ)=𝑑Q^eQ^2((wa)TCwaNQ),δ((wa)TCwbNq)=𝑑q^eq^((wa)TCwbNq),\begin{cases}\delta\left(\left(w^{a}\right)^{T}Cw^{a}-NQ\right)=\int d\hat{Q}e^{-\frac{\hat{Q}}{2}\left(\left(w^{a}\right)^{T}Cw^{a}-NQ\right)},\\ \delta\left(\left(w^{a}\right)^{T}Cw^{b}-Nq\right)=\int d\hat{q}e^{\hat{q}\left(\left(w^{a}\right)^{T}Cw^{b}-Nq\right)},\end{cases}

and taking average over the orthogonal matrix OO, after some algebra, the II is replaced with the following average [I]O\left[I\right]_{O}

[I]O\displaystyle\left[I\right]_{O} =𝑑R𝑑r𝑑Q^𝑑q^a=1ndwaeλβMNa=1nwa1aδ((wa)TwaNR)a<bδ((wa)TwbNr)\displaystyle=\int dRdrd\hat{Q}d\hat{q}\prod_{a=1}^{n}dw^{a}e^{-\frac{\lambda\beta M}{\sqrt{N}}\sum_{a=1}^{n}\left\|w^{a}\right\|_{1}}\prod_{a}\delta\left(\left(w^{a}\right)^{T}w^{a}-NR\right)\prod_{a<b}\delta\left(\left(w^{a}\right)^{T}w^{b}-Nr\right)
×exp{Nn2Q^QNn2(n1)q^q}×[e12𝚃𝚛(CLn)]O,\displaystyle\times\exp\left\{\frac{Nn}{2}\hat{Q}Q-\frac{Nn}{2}\left(n-1\right)\hat{q}q\right\}\times\left[e^{\frac{1}{2}\mathtt{Tr}\left(CL_{n}\right)}\right]_{O}, (52)
Ln\displaystyle L_{n} =(Q^+q^)a=1nwa(wa)T+q^(a=1nwa)(b=1nwb)T.\displaystyle=-\left(\hat{Q}+\hat{q}\right)\sum_{a=1}^{n}w^{a}\left(w^{a}\right)^{T}+\hat{q}\left(\sum_{a=1}^{n}w^{a}\right)\left(\sum_{b=1}^{n}w^{b}\right)^{T}. (53)

To proceed with the computation, the eigendecompostion of the matrix LnL_{n} is performed. After some algebra, for the configuration of waw^{a} that satisfies both (wa)Twa=NR\left(w^{a}\right)^{T}w^{a}=NR and (wa)Twb=Nr\left(w^{a}\right)^{T}w^{b}=Nr, the eigenvalues and associated eigenvectors of matrix LnL_{n} can be calculated as follows

{λ1=N(Q^+q^nq^)(Rr+nr),u1=a=1nwa,λ2=N(Q^+q^)(Rr),ua=wa1nb=1nwb,a=2,,n,\displaystyle\begin{cases}\lambda_{1}=-N\left(\hat{Q}+\hat{q}-n\hat{q}\right)\left(R-r+nr\right),\\ u_{1}=\sum_{a=1}^{n}w^{a},\\ \lambda_{2}=-N\left(\hat{Q}+\hat{q}\right)\left(R-r\right),\\ u_{a}=w^{a}-\frac{1}{n}\sum_{b=1}^{n}w^{b},a=2,...,n,\end{cases} (54)

where λ1\lambda_{1} is the eigenvalue corresponding to the eigenvector u1u_{1} while λ2\lambda_{2} is the degenerate eigenvalue corresponding to eigenvectors ua,a=2,,nu_{a},a=2,...,n. To compute [e12𝚃𝚛(CLn)]O\left[e^{\frac{1}{2}\mathtt{Tr}\left(CL_{n}\right)}\right]_{O}, we define a function G(x)G\left(x\right) as

G(x)\displaystyle G\left(x\right) 1Nlog[exp(x2𝚃𝚛C(𝟏𝟏T))]O\displaystyle\equiv\frac{1}{N}\log\left[\exp\left(\frac{x}{2}\mathtt{Tr}C\left(\mathbf{1}\mathbf{1}^{T}\right)\right)\right]_{O}
=𝙴𝚡𝚝𝚛Λ{12log(Λγ)ρ(γ)𝑑γ+Λ2x}12logx12,\displaystyle=\underset{\Lambda}{\mathtt{Extr}}\left\{-\frac{1}{2}\int\log\left(\Lambda-\gamma\right)\rho\left(\gamma\right)d\gamma+\frac{\Lambda}{2}x\right\}-\frac{1}{2}\log x-\frac{1}{2}, (55)

and ρ(γ)\rho\left(\gamma\right) is the eigenvalue distribution (EVD) of CC. Then, combined with (54), after some algebra, we obtain that

1Nlog[e12𝚃𝚛(CLn)]O=G((Q^+q^nq^)(Rr+nr))+(n1)G((Q^+q^)(Rr)).\frac{1}{N}\log\left[e^{\frac{1}{2}\mathtt{Tr}\left(CL_{n}\right)}\right]_{O}=G\left(-\left(\hat{Q}+\hat{q}-n\hat{q}\right)\left(R-r+nr\right)\right)+\left(n-1\right)G\left(-\left(\hat{Q}+\hat{q}\right)\left(R-r\right)\right). (56)

Furthermore, replacing the original delta functions in (51) as

{δ((wa)TwaNR)=𝑑R^eR^2((wa)TwaNR),δ((wa)TwbNr)=𝑑r^er^((wa)TwbNr),\begin{cases}\delta\left(\left(w^{a}\right)^{T}w^{a}-NR\right)=\int d\hat{R}e^{-\frac{\hat{R}}{2}\left(\left(w^{a}\right)^{T}w^{a}-NR\right)},\\ \delta\left(\left(w^{a}\right)^{T}w^{b}-Nr\right)=\int d\hat{r}e^{\hat{r}\left(\left(w^{a}\right)^{T}w^{b}-Nr\right)},\end{cases}

we obtain

[I]0\displaystyle\left[I\right]_{0} =𝑑R𝑑r𝑑Q^𝑑q^𝑑R^𝑑r^a=1ndwaexp{a=1nλβMNwa1R^+r^2a=1n(wa)Twa+r^2a,b(wa)Twb}\displaystyle=\int dRdrd\hat{Q}d\hat{q}d\hat{R}d\hat{r}\prod_{a=1}^{n}dw^{a}\exp\left\{-\sum_{a=1}^{n}\frac{\lambda\beta M}{\sqrt{N}}\left\|w^{a}\right\|_{1}-\frac{\hat{R}+\hat{r}}{2}\sum_{a=1}^{n}\left(w^{a}\right)^{T}w^{a}+\frac{\hat{r}}{2}\sum_{a,b}\left(w^{a}\right)^{T}w^{b}\right\}
×exp{Nn2R^RNn2(n1)r^r+Nn2Q^QNn2(n1)q^q}×[e12𝚃𝚛(CLn)]O.\displaystyle\times\exp\left\{\frac{Nn}{2}\hat{R}R-\frac{Nn}{2}\left(n-1\right)\hat{r}r+\frac{Nn}{2}\hat{Q}Q-\frac{Nn}{2}\left(n-1\right)\hat{q}q\right\}\times\left[e^{\frac{1}{2}\mathtt{Tr}\left(CL_{n}\right)}\right]_{O}. (57)

In addition, using a Gaussian integral, the following result can be linearized as

a=1ndwaexp{a=1nλβMNwa1R^+r^2a=1n(wa)Twa+r^2a,b(wa)Twb}\displaystyle\int\prod_{a=1}^{n}dw^{a}\exp\left\{-\sum_{a=1}^{n}\frac{\lambda\beta M}{\sqrt{N}}\left\|w^{a}\right\|_{1}-\frac{\hat{R}+\hat{r}}{2}\sum_{a=1}^{n}\left(w^{a}\right)^{T}w^{a}+\frac{\hat{r}}{2}\sum_{a,b}\left(w^{a}\right)^{T}w^{b}\right\}
=\displaystyle= a=1ndwaexp{a=1ni=1NλβMN|wia|R^+r^2a=1ni=1N(wia)2+r^2i=1N(a=1nwia)2}\displaystyle\int\prod_{a=1}^{n}dw^{a}\exp\left\{-\sum_{a=1}^{n}\sum_{i=1}^{N}\frac{\lambda\beta M}{\sqrt{N}}\left|w_{i}^{a}\right|-\frac{\hat{R}+\hat{r}}{2}\sum_{a=1}^{n}\sum_{i=1}^{N}\left(w_{i}^{a}\right)^{2}+\frac{\hat{r}}{2}\sum_{i=1}^{N}\left(\sum_{a=1}^{n}w_{i}^{a}\right)^{2}\right\}
=\displaystyle= i𝒟zia=1ndwaexp{a=1nλβMN|wia|R^+r^2a=1n(wia)2+r^ziawia}\displaystyle\prod_{i}\int\mathcal{D}z_{i}\int\prod_{a=1}^{n}dw^{a}\exp\left\{-\sum_{a=1}^{n}\frac{\lambda\beta M}{\sqrt{N}}\left|w_{i}^{a}\right|-\frac{\hat{R}+\hat{r}}{2}\sum_{a=1}^{n}\left(w_{i}^{a}\right)^{2}+\sqrt{\hat{r}}z_{i}\sum_{a}w_{i}^{a}\right\}
=\displaystyle= i𝒟zi{𝑑wexp[R^+r^2wi2+(r^zλβMN sign(wi))wi]}n,\displaystyle\prod_{i}\int\mathcal{D}z_{i}\left\{\int dw\exp\left[-\frac{\hat{R}+\hat{r}}{2}w_{i}^{2}+\left(\sqrt{\hat{r}}z-\frac{\lambda\beta M}{\sqrt{N}}\textrm{ sign}\left(w_{i}\right)\right)w_{i}\right]\right\}^{n},

where 𝒟zi=dzi2πezi22\mathcal{D}{z_{i}}=\frac{dz_{i}}{\sqrt{2\pi}}e^{-\frac{{z_{i}}^{2}}{2}}. Consequently, the entropy term SS of the free energy density ff is computed as

limn01Nnlog[I]O\displaystyle\underset{n\rightarrow 0}{\lim}\frac{1}{N}\frac{\partial}{\partial n}\log\left[I\right]_{O} =(q^(Rr)(Q^+q^)r)G((Q^+q^)(Rr))\displaystyle=\left(\hat{q}\left(R-r\right)-\left(\hat{Q}+\hat{q}\right)r\right)G^{{}^{\prime}}\left(-\left(\hat{Q}+\hat{q}\right)\left(R-r\right)\right)
+G((Q^+q^)(Rr))+R^R2+r^r2+Q^Q2+q^q2\displaystyle+G\left(-\left(\hat{Q}+\hat{q}\right)\left(R-r\right)\right)+\frac{\hat{R}R}{2}+\frac{\hat{r}r}{2}+\frac{\hat{Q}Q}{2}+\frac{\hat{q}q}{2}
+Dzlog𝑑wexp[R^+r^2w2+(r^zλβMN sign(w))w].\displaystyle+\int Dz\log\int dw\exp\left[-\frac{\hat{R}+\hat{r}}{2}w^{2}+\left(\sqrt{\hat{r}}z-\frac{\lambda\beta M}{\sqrt{N}}\textrm{ sign}\left(w\right)\right)w\right].

For β\beta\to\infty, according to the characteristic of the Boltzmann distribution, the following scaling relations are assumed to hold, i.e.,

{Q^+q^βEq^β2FR^+r^βKr^β2Hβ(Qq)χβ(Rr)η\begin{cases}\hat{Q}+\hat{q}&\equiv\beta E\\ \hat{q}&\equiv\beta^{2}F\\ \hat{R}+\hat{r}&\equiv\beta K\\ \hat{r}&\equiv\beta^{2}H\\ \beta\left(Q-q\right)&\equiv\chi\\ \beta\left(R-r\right)&\equiv\eta\end{cases} (58)

Finally, the entropy term is computed as

S\displaystyle S =(ER+Fη)G(Eη)+12EQ12Fχ+12KR12Hη\displaystyle=\left(-ER+F\eta\right)G^{{}^{\prime}}\left(-E\eta\right)+\frac{1}{2}EQ-\frac{1}{2}F\chi+\frac{1}{2}KR-\frac{1}{2}H\eta- (59)
min𝑤{K2w2(HzλMN sign(w))w}Dz.\displaystyle\int\underset{w}{\min}\left\{\frac{K}{2}w^{2}-\left(\sqrt{H}z-\frac{\lambda M}{\sqrt{N}}\textrm{ sign}\left(w\right)\right)w\right\}Dz. (60)

A.3 Free energy density result

Combining the results (48) and (60) together, the free energy density for general loss function ()\ell\left(\cdot\right) in the limit β\beta\rightarrow\infty is obtained as

f(β)\displaystyle f\left(\beta\rightarrow\infty\right) =ExtrΘ{α𝔼s,z(min𝑦[(ys0(Qz+jΨJ¯jsj))22χ+(y)])αλjΨ|J¯j|+(ER+Fη)G(Eη)+12EQ12Fχ+12KR12Hη𝔼z(min𝑤{K2w2(HzλMN sign(w))w})},\displaystyle=-\underset{\mathtt{\varTheta}}{\textrm{Extr}}\left\{\begin{array}[]{c}-\alpha\mathbb{E}_{s,z}\left(\underset{y}{\min}\left[\frac{\left(y-s_{0}\left(\sqrt{Q}z+\sum_{j\in\Psi}\text{$\bar{J}_{j}$}s_{j}\right)\right)^{2}}{2\chi}+\ell\left(y\right)\right]\right)-\alpha\lambda\sum_{j\in\Psi}\left|\bar{J}_{j}\right|\\ +\left(-ER+F\eta\right)G^{{}^{\prime}}\left(-E\eta\right)+\frac{1}{2}EQ-\frac{1}{2}F\chi\\ +\frac{1}{2}KR-\frac{1}{2}H\eta-\mathbb{E}_{z}\left(\underset{w}{\min}\left\{\frac{K}{2}w^{2}-\left(\sqrt{H}z-\frac{\lambda M}{\sqrt{N}}\textrm{ sign}\left(w\right)\right)w\right\}\right)\end{array}\right\}, (64)

where the values of the parameters Θ={χ,Q,E,R,F,η,K,H,{J¯j} jΨ}\varTheta=\left\{\chi,Q,E,R,F,\eta,K,H,\text{$\left\{\bar{J}_{j}\right\}$ }_{j\in\Psi}\right\} can be calculated by the extremization condition, i.e., solving the equations of state (EOS). For general loss function (y)\ell\left(y\right), the EOS for (64) is as follows

{y^(s,z,χ,Q,J)=argmax𝑦{(ys0(Qz+jΨJ¯jsj))22χ(y)}E=αQ𝔼s,z(s0zd(y)dyy=y^(s,z,χ,Q,J))F=α𝔼s,z((d(y)dyy=y^(s,z,χ,Q,J))2)R=1K2[(H+λ2M2N)erfc(λM2HN)2λMHN12πeλ2M22HN]Eη=ρ(γ)Λ~γ𝑑γQ=FE2+RΛ~(ER+Fη)η1ρ(λ)(Λ~λ)2𝑑λK=EΛ~+1ηχ=1E+ηΛ~H=Rη2+FΛ~+(ER+Fη)E1ρ(λ)(Λ~λ)2𝑑λη=1Kerfc(λM2HN)J¯j,jΨ=argminJj,jΨ{𝔼s,z([(y^(s,z,χ,Q,J)s0(Qz+jΨJjsj))22χ+(y^(s,z,χ,Q,J))])+λjΨ|Jj|}\begin{cases}\hat{y}\left(s,z,\chi,Q,J\right)=\underset{y}{\arg\max}\left\{-\frac{\left(y-s_{0}\left(\sqrt{Q}z+\sum_{j\in\Psi}\text{$\bar{J}_{j}$}s_{j}\right)\right)^{2}}{2\chi}-\ell\left(y\right)\right\}\\ E=\frac{\alpha}{\sqrt{Q}}\mathbb{E}_{s,z}\left(s_{0}z\frac{d\ell\left(y\right)}{dy}\mid_{y=\hat{y}\left(s,z,\chi,Q,J\right)}\right)\\ F=\alpha\mathbb{E}_{s,z}\left(\left(\frac{d\ell\left(y\right)}{dy}\mid_{y=\hat{y}\left(s,z,\chi,Q,J\right)}\right)^{2}\right)\\ R=\frac{1}{K^{2}}\left[\left(H+\frac{\lambda^{2}M^{2}}{N}\right)\textrm{erfc}\left(\frac{\lambda M}{\sqrt{2HN}}\right)-2\lambda M\sqrt{\frac{H}{N}}\frac{1}{\sqrt{2\pi}}e^{-\frac{\lambda^{2}M^{2}}{2HN}}\right]\\ E\eta=-\int\frac{\rho\left(\gamma\right)}{\tilde{\varLambda}-\gamma}d\gamma\\ Q=\frac{F}{E^{2}}+R\tilde{\varLambda}-\left(-ER+F\eta\right)\eta\frac{1}{\int\frac{\rho\left(\lambda\right)}{\left(\tilde{\varLambda}-\lambda\right)^{2}}d\lambda}\\ K=E\tilde{\varLambda}+\frac{1}{\eta}\\ \chi=\frac{1}{E}+\eta\tilde{\varLambda}\\ H=\frac{R}{\eta^{2}}+F\tilde{\varLambda}+\left(-ER+F\eta\right)E\frac{1}{\int\frac{\rho\left(\lambda\right)}{\left(\tilde{\varLambda}-\lambda\right)^{2}}d\lambda}\\ \eta=\frac{1}{K}\textrm{erfc}\left(\frac{\lambda M}{\sqrt{2HN}}\right)\\ \bar{J}_{j,j\in\Psi}=\underset{{J}_{j,j\in\Psi}}{\arg\min}\;\left\{\mathbb{E}_{s,z}\left(\left[\frac{\left(\hat{y}\left(s,z,\chi,Q,J\right)-s_{0}\left(\sqrt{Q}z+\sum_{j\in\Psi}\text{${J}_{j}$}s_{j}\right)\right)^{2}}{2\chi}+\ell\left(\hat{y}\left(s,z,\chi,Q,J\right)\right)\right]\right)+\lambda\sum_{j\in\Psi}\left|{J}_{j}\right|\right\}\end{cases} (65)

where Λ~\tilde{\varLambda} satisfying Eη=ρ(γ)Λ~γ𝑑γE\eta=-\int\frac{\rho\left(\gamma\right)}{\tilde{\varLambda}-\gamma}d\gamma is determined by the extremization condition in (55) combined with the free energy result (64). In general, there are no analytic solutions for the EOS (65) but it can be solved numerically.

A.3.1 quadratic loss (y)=(y1)2/2\ell\left(y\right)=\left(y-1\right)^{2}/2

In the case of square lass (y)=(y1)2/2\ell\left(y\right)=\left(y-1\right)^{2}/2 for the 1\ell_{1}-LinR estimator, there is an analytic solution to yy in min𝑦[(ys0(Qz+jΨJ¯jsj))22χ+(y)]\underset{y}{\min}\left[\frac{\left(y-s_{0}\left(\sqrt{Q}z+\sum_{j\in\Psi}\text{$\bar{J}_{j}$}s_{j}\right)\right)^{2}}{2\chi}+\ell\left(y\right)\right] and thus the results can be further simplified. Specifically, the free energy can be written as follows

f(β)\displaystyle f\left(\beta\rightarrow\infty\right) =ExtrΘ{α2(1+χ)𝔼s,z[(s0jΨsjJ¯jQz)2]αλjΨ|J¯j|+(ER+Fη)G(Eη)+12EQ12Fχ+12KR12Hη𝔼z[min𝑤{K2w2(HzλMN sign(w))w}]},\displaystyle=-\underset{\mathtt{\varTheta}}{\textrm{Extr}}\left\{\begin{array}[]{c}-\frac{\alpha}{2\left(1+\chi\right)}\mathbb{E}_{s,z}\left[\left(s_{0}-\sum_{j\in\Psi}s_{j}\bar{J}_{j}-\sqrt{Q}z\right)^{2}\right]-\alpha\lambda\sum_{j\in\Psi}\left|\bar{J}_{j}\right|\\ +\left(-ER+F\eta\right)G^{{}^{\prime}}\left(-E\eta\right)+\frac{1}{2}EQ-\frac{1}{2}F\chi\\ +\frac{1}{2}KR-\frac{1}{2}H\eta-\mathbb{E}_{z}\left[\underset{w}{\min}\left\{\frac{K}{2}w^{2}-\left(\sqrt{H}z-\frac{\lambda M}{\sqrt{N}}\textrm{ sign}\left(w\right)\right)w\right\}\right]\end{array}\right\}, (69)

and the corresponding EOS can be written as

{E=α(1+χ),(a)F=α(1+χ)2[𝔼s(sijΨsjJ¯j)2+Q],(b)R=1K2[(H+λ2M2N)erfc(λM2HN)2λMHN12πeλ2M22HN],(c)Eη=ρ(γ)Λ~γ𝑑γ,(d)Q=FE2+RΛ~(ER+Fη)ηρ(γ)(Λ~γ)2𝑑γ,(e)K=EΛ~+1η,(f)χ=1E+ηΛ~,(g)H=Rη2+FΛ~+(ER+Fη)Eρ(γ)(Λ~γ)2𝑑γ,(h)η=1Kerfc(λM2HN),(i)J¯j=𝚜𝚘𝚏𝚝(tanh(K0),λ(1+χ))1+(d1)tanh2(K0),jΨ,(j)\begin{cases}E=\frac{\alpha}{\left(1+\chi\right)},&\left(a\right)\\ F=\frac{\alpha}{\left(1+\chi\right)^{2}}\left[\mathbb{E}_{s}\left(s_{i}-\sum_{j\in\Psi}s_{j}\bar{J}_{j}\right)^{2}+Q\right],&\left(b\right)\\ R=\frac{1}{K^{2}}\left[\left(H+\frac{\lambda^{2}M^{2}}{N}\right)\textrm{erfc}\left(\frac{\lambda M}{\sqrt{2HN}}\right)-2\lambda M\sqrt{\frac{H}{N}}\frac{1}{\sqrt{2\pi}}e^{-\frac{\lambda^{2}M^{2}}{2HN}}\right],&\left(c\right)\\ E\eta=-\int\frac{\rho\left(\gamma\right)}{\tilde{\varLambda}-\gamma}d\gamma,&\left(d\right)\\ Q=\frac{F}{E^{2}}+R\tilde{\varLambda}-\left(-ER+F\eta\right)\frac{\eta}{\int\frac{\rho\left(\gamma\right)}{\left(\tilde{\varLambda}-\gamma\right)^{2}}d\gamma},&\left(e\right)\\ K=E\tilde{\varLambda}+\frac{1}{\eta},&\left(f\right)\\ \chi=\frac{1}{E}+\eta\tilde{\varLambda},&\left(g\right)\\ H=\frac{R}{\eta^{2}}+F\tilde{\varLambda}+\left(-ER+F\eta\right)\frac{E}{\int\frac{\rho\left(\gamma\right)}{\left(\tilde{\varLambda}-\gamma\right)^{2}}d\gamma},&\left(h\right)\\ \eta=\frac{1}{K}\textrm{erfc}\left(\frac{\lambda M}{\sqrt{2HN}}\right),&\left(i\right)\\ \bar{J}_{j}=\frac{\mathtt{soft}\left(\tanh\left(K_{0}\right),\lambda\left(1+\chi\right)\right)}{1+\left(d-1\right)\tanh^{2}\left(K_{0}\right)},j\in\Psi,&\left(j\right)\end{cases} (70)

Note that the mean estimates {J¯j,jΨ}\left\{\bar{J}_{j},j\in\Psi\right\} in (70) is obtained by solving the following reduced optimization problem

argmin{J¯j}{12(1+χ)𝔼s,z[(s0jΨsjJ¯jQz)2]λjΨ|J¯j|},\underset{\left\{\bar{J}_{j}\right\}}{\arg\min}\left\{\frac{1}{2\left(1+\chi\right)}\mathbb{E}_{s,z}\left[\left(s_{0}-\sum_{j\in\Psi}s_{j}\bar{J}_{j}-\sqrt{Q}z\right)^{2}\right]-\lambda\sum_{j\in\Psi}\left|\bar{J}_{j}\right|\right\}, (71)

where the corresponding fixed-point equation associated with any J¯k,kΨ\bar{J}_{k},k\in\Psi can be written as follows

11+χ𝔼s[sk(s0jΨsjJ¯j)]λ𝚜𝚒𝚐𝚗(J¯k)=0,kΨ,\frac{1}{1+\chi}\mathbb{E}_{s}\left[s_{k}\left(s_{0}-\sum_{j\in\Psi}s_{j}\bar{J}_{j}\right)\right]-\lambda\mathtt{sign}\left(\bar{J}_{k}\right)=0,\forall k\in\Psi, (72)

where the 𝚜𝚒𝚐𝚗()\mathtt{sign}(\cdot) denotes an element-wise application of the standard sign function. For a RR graph G𝒢N,d,K0G\in\mathcal{G}_{N,d,K_{0}} with degree dd and coupling strength K0K_{0}, without loss of generality, assuming that all the active couplings are positive, we have 𝔼s(s0sk)=tanh(K0),kΨ\mathbb{E}_{s}\left(s_{0}s_{k}\right)=\tanh\left({K_{0}}\right),\forall k\in\Psi, and 𝔼s(sksj)=tanh2(K0),k,jΨ,kj\mathbb{E}_{s}\left(s_{k}s_{j}\right)=\tanh^{2}\left({K_{0}}\right),\;\forall k,j\in\Psi,k\neq j. Given these results and thanks to the the symmetry, we obtain

J¯j=𝚜𝚘𝚏𝚝(tanh(K0),λ(1+χ))1+(d1)tanh2(K0),jΨ,\bar{J}_{j}=\frac{\mathtt{soft}\left(\tanh\left(K_{0}\right),\lambda\left(1+\chi\right)\right)}{1+\left(d-1\right)\tanh^{2}\left(K_{0}\right)},j\in\Psi, (73)

where 𝚜𝚘𝚏𝚝(z,τ)=𝚜𝚒𝚐𝚗(z)(|z|τ)+\mathtt{soft}\left(z,\tau\right)=\mathtt{sign}\left(z\right)\left(\left|z\right|-\tau\right)_{+} is the soft-thresholding function, i.e.,

𝚜𝚘𝚏𝚝(z,τ)𝚜𝚒𝚐𝚗(z)(|z|τ)+{zτ,z>τ0,|z|τz+τ,z<τ\displaystyle\mathtt{soft}\left(z,\tau\right)\equiv\mathtt{sign}\left(z\right)\left(\left|z\right|-\tau\right)_{+}\equiv\begin{cases}z-\tau,&z>\tau\\ 0,&\left|z\right|\leq\tau\\ z+\tau,&z<-\tau\end{cases} (74)

On the other hand, in the inactive set Ψ¯\bar{\Psi}, each component of the scaled noise estimates can be statistically described as the solution to the scalar estimator min𝑤{K2w2(HzλMN sign(w))w}\underset{w}{\min}\left\{\frac{K}{2}w^{2}-\left(\sqrt{H}z-\frac{\lambda M}{\sqrt{N}}\textrm{ sign}\left(w\right)\right)w\right\} in (64). Consequently, recalling the definition of ww in (11), the estimates {J^j,jΨ¯}\left\{\hat{J}_{j},j\in\bar{\Psi}\right\} in the inactive set Ψ¯\bar{\Psi} are

J^j\displaystyle\hat{J}_{j} =HKN𝚜𝚘𝚏𝚝(zj,λMHN)\displaystyle=\frac{\sqrt{H}}{K\sqrt{N}}\mathtt{soft}\left(z_{j},\frac{\lambda M}{\sqrt{HN}}\right)
=argminJj[12(Jj1KHNzj)2+λMKN|Jj|],jΨ¯,\displaystyle=\underset{J_{j}}{\arg\min}\left[\frac{1}{2}\left(J_{j}-\frac{1}{K}\sqrt{\frac{H}{N}}z_{j}\right)^{2}+\frac{\lambda M}{KN}\left|J_{j}\right|\right],j\in\bar{\Psi}, (75)

which zj𝒩(0,1),jΨ¯z_{j}\sim\mathcal{N}\left(0,1\right),j\in\bar{\Psi} are i.i.d. random Gaussian noise.

Consequently, it can be seen that from (73) and (75), statistically, the 1\ell_{1}-LinR estimator is decoupled into two scalar thresholding estimators for the active set Ψ\Psi and inactive set Ψ¯\bar{\Psi}, respectively.

A.3.2 Logistic loss (y)=log(1+e2y)\ell\left(y\right)=\log\left(1+e^{-2y}\right)

In the case of logistic lass (y)=log(1+e2y)\ell\left(y\right)=\log\left(1+e^{-2y}\right) for the 1\ell_{1}-LogR estimator, however, there is no analytic solution to yy in min𝑦[(ys0(Qz+jΨJ¯jsj))22χ+(y)]\underset{y}{\min}\left[\frac{\left(y-s_{0}\left(\sqrt{Q}z+\sum_{j\in\Psi}\text{$\bar{J}_{j}$}s_{j}\right)\right)^{2}}{2\chi}+\ell\left(y\right)\right] and we have to solve it together iteratively with other parameters Θ\varTheta. After some algebra, we obtain the EOS for the 1\ell_{1}-LogR estimator:

{y^(s,z,χ,Q,J)s0(Qz+jΨJ¯jsj)χ=1tanh(y^(s,z,χ,Q,J)),E=α𝔼s,z(s0zQtanh(y^(S,z,χ,Q,J))),F=α𝔼s,z((1tanh(y^(S,z,χ,Q,J)))2),R=1K2[(H+λ2M2N)erfc(λM2HN)2λMHN12πeλ2M22HN],Eη=ρ(γ)Λ~γ𝑑γ,Q=FE2+RΛ~(ER+Fη)η1ρ(λ)(Λ~λ)2𝑑λ,K=EΛ~+1η,χ=1E+ηΛ~,H=Rη2+FΛ~+(ER+Fη)E1ρ(λ)(Λ~λ)2𝑑λ,η=1Kerfc(λM2HN),J¯j=𝚜𝚘𝚏𝚝(𝔼s,z(y^(s,z,χ,Q,J)s0jΨsj),λdχ)d(1+(d1)tanh2(K0)),jΨ.\begin{cases}\frac{\hat{y}\left(s,z,\chi,Q,J\right)-s_{0}\left(\sqrt{Q}z+\sum_{j\in\Psi}\text{$\bar{J}_{j}$}s_{j}\right)}{\chi}=1-\tanh\left(\hat{y}\left(s,z,\chi,Q,J\right)\right),\\ E=\alpha\mathbb{E}_{s,z}\left(\frac{s_{0}z}{\sqrt{Q}}\tanh\left(\hat{y}\left(S,z,\chi,Q,J\right)\right)\right),\\ F=\alpha\mathbb{E}_{s,z}\left(\left(1-\tanh\left(\hat{y}\left(S,z,\chi,Q,J\right)\right)\right)^{2}\right),\\ R=\frac{1}{K^{2}}\left[\left(H+\frac{\lambda^{2}M^{2}}{N}\right)\textrm{erfc}\left(\frac{\lambda M}{\sqrt{2HN}}\right)-2\lambda M\sqrt{\frac{H}{N}}\frac{1}{\sqrt{2\pi}}e^{-\frac{\lambda^{2}M^{2}}{2HN}}\right],\\ E\eta=-\int\frac{\rho\left(\gamma\right)}{\tilde{\varLambda}-\gamma}d\gamma,\\ Q=\frac{F}{E^{2}}+R\tilde{\varLambda}-\left(-ER+F\eta\right)\eta\frac{1}{\int\frac{\rho\left(\lambda\right)}{\left(\tilde{\varLambda}-\lambda\right)^{2}}d\lambda},\\ K=E\tilde{\varLambda}+\frac{1}{\eta},\\ \chi=\frac{1}{E}+\eta\tilde{\varLambda},\\ H=\frac{R}{\eta^{2}}+F\tilde{\varLambda}+\left(-ER+F\eta\right)E\frac{1}{\int\frac{\rho\left(\lambda\right)}{\left(\tilde{\varLambda}-\lambda\right)^{2}}d\lambda},\\ \eta=\frac{1}{K}\textrm{erfc}\left(\frac{\lambda M}{\sqrt{2HN}}\right),\\ \bar{J}_{j}=\frac{\mathtt{soft}\left(\mathbb{E}_{s,z}\left(\hat{y}\left(s,z,\chi,Q,J\right)s_{0}\sum_{j\in\Psi}s_{j}\right),\lambda d\chi\right)}{d\left(1+\left(d-1\right)\tanh^{2}\left(K_{0}\right)\right)},j\in\Psi.\end{cases} (76)

In the active set Ψ\Psi, the mean estimates {J¯j,jΨ}\left\{\bar{J}_{j},j\in\Psi\right\} can be obtained by solving a reduced 1\ell_{1}-regularized optimization problem

min{J¯j}jΨ{𝔼s,z(min𝑦[(ys0(Qz+jΨJ¯jsj))22χ+log(1+e2y)])+λjΨ|J¯j|}.\underset{\left\{\bar{J}_{j}\right\}_{j\in\Psi}}{\min}\left\{\mathbb{E}_{s,z}\left(\underset{y}{\min}\left[\frac{\left(y-s_{0}\left(\sqrt{Q}z+\sum_{j\in\Psi}\text{$\bar{J}_{j}$}s_{j}\right)\right)^{2}}{2\chi}+\log\left(1+e^{-2y}\right)\right]\right)+\lambda\sum_{j\in\Psi}\left|\bar{J}_{j}\right|\right\}. (77)

In contrast to the 1\ell_{1}-LinR estimator, the mean estimates {J¯j,jΨ}\left\{\bar{J}_{j},j\in\Psi\right\} in (77) for the 1\ell_{1}-LogR estimator do not have analytic solutions and also have to be solved numerically. For a RR graph G𝒢N,d,K0G\in\mathcal{G}_{N,d,K_{0}} with degree dd and coupling strength K0K_{0}, after some algebra, the corresponding fixed-point equations for {J¯j=J,jΨ}\left\{\bar{J}_{j}=J,j\in\Psi\right\} are obtained as follows

J\displaystyle J =𝚜𝚘𝚏𝚝(𝔼s,z(y^(s,z,χ,Q,J)s0jΨsj),λdχ)d(1+(d1)tanh2(K0)),\displaystyle=\frac{\mathtt{soft}\left(\mathbb{E}_{s,z}\left(\hat{y}\left(s,z,\chi,Q,J\right)s_{0}\sum_{j\in\Psi}s_{j}\right),\lambda d\chi\right)}{d\left(1+\left(d-1\right)\tanh^{2}\left(K_{0}\right)\right)}, (78)

which can be solved iteratively.

The estimates in the inactive set Ψ¯\bar{\Psi} are the same as (75) that of 1\ell_{1}-LinR, which can be described by a scalar theresholding estimator once the EOS is solved.

Appendix B Check the consistency of ansatz (A1)

To check the consistency of Ansatz (A1), first we categorize the estimators based on the distance or generation from the focused spin s0s_{0}. Considering the original Ising model whose coupling network is a tree-like graph, we can naturally define generations of the spins according to the distance from the focused spin s0s_{0}. We categorize the spins directly connected to s0s_{0} as the first generation and denote the corresponding index set as Ω1={i|Ji0,i{1,,N1}}\Omega_{1}=\{i|J_{i}^{*}\neq 0,i\in\left\{1,\ldots,N-1\right\}\}. Each spin in Ω1\Omega_{1} is connected to some other spins except for s0s_{0}, and those spins constitute the second generation and we denote its index set as Ω2\Omega_{2}. This recursive construction of generations can be unambiguously continued on the tree-like graph, and we denote the index set of the gg-th generation from spin s0s_{0} as Ωg\Omega_{g}. The overall construction of generations is graphically represented in Fig. 4. Generally, assume that the set of nonzero values of the 1\ell_{1}-LinR estimator is denoted as Ψ={Ω1,,Ωg}\Psi=\left\{\Omega_{1},\ldots,\Omega_{g}\right\}. Then, Ansatz (A1) means that the correct active set of the mean estimates is Ψ={Ω1}\Psi=\left\{\Omega_{1}\right\}.

Refer to caption
Figure 4: Schematic of generations of spins. In general, the gg-th generation of spin s0s_{0} is denoted as Ωg\Omega_{g}, whose distance from spin s0s_{0} is gg.

To verify this, we examine the values of mean estimates based on (69). Due to the symmetry, it is expected that for each a=1,,ga=1,...,g, the values of the mean estimates J¯jΩa=Ja\bar{J}_{j\in\Omega_{a}}=J_{a} are identical to each other within the same set Ωa,a=1g\Omega_{a},a=1...g. In addition, if the solutions satisfy Ansatz (A1) in (11), i.e., J1=J,Ja=0,a2J_{1}=J,J_{a}=0,a\geq 2, from (69) we obtain

{11+χ[tanh(K0)(1+(d1)tanh2(K0))J]λ=0,jΩ1;|11+χ[tanha(K0)tanha1(K0)(1+(d1)tanh2(K0))J]|λ,jΩa,a2,\begin{cases}\frac{1}{1+\chi}\left[\tanh\left(K_{0}\right)-\left(1+\left(d-1\right)\tanh^{2}\left(K_{0}\right)\right)J\right]-\lambda=0,&j\in\Omega_{1};\\ \left|\frac{1}{1+\chi}\left[\tanh^{a}\left(K_{0}\right)-\tanh^{a-1}\left(K_{0}\right)\left(1+\left(d-1\right)\tanh^{2}\left(K_{0}\right)\right)J\right]\right|\leq\lambda,&j\in\Omega_{a},a\geq 2,\end{cases} (79)

where the result 𝔼s(sisj)=tanhd0(K0)\mathbb{E}_{s}\left(s_{i}s_{j}\right)=\tanh^{d_{0}}\left({K_{0}}\right) is used for any two spins si,sjs_{i},s_{j} whose distance is d0d_{0} in the RR graph G𝒢N,d,K0G\in\mathcal{G}_{N,d,K_{0}}. Note that the solution of the first equation in (79) automatically satisfies the second equation (sub-gradient condition) since |tanh(K0)|1\left|\tanh\left(K_{0}\right)\right|\leq 1, which indicates that J1=J,Ja=0,a2J_{1}=J,J_{a}=0,a\geq 2 is one valid solution. Moreover, the convexity of the quadratic loss function indicates that this is the unique and correct solution, which checks the Ansatz (A1).

Appendix C Check the consistency of ansatz (A2)

We here check the consistency of a part of the Ansatz (A2) in Sec.3.2, the orthogonal matrix OO diagonalizing the covariance matrix CC is distributed from the Haar orthogonal measure. To achieve this, we compare certain properties of the orthogonal matrix generated from the diagonalization of the covariance matrix CC with the orthogonal matrix which is actually generated from the Haar orthogonal measure. Specifically, we compute the cumulants of the trace of the power kk of the orthogonal matrix. All cumulants with degree r3r\geq 3 are shown to disappear in the large NN limit [41, 42]. The nontrivial cumulants are only second order cumulant with the same power kk. We have computed these cumulants about the orthogonal matrix from the covariance matrix CC and found that they exhibit the same behavior as the ones generated from the true Haar measure, as shown in Fig. 5.

Refer to caption
Figure 5: The RR graph G𝒢N,d,K0G\in\mathcal{G}_{N,d,K_{0}} with N=1000,d=3,K0=0.4N=1000,d=3,K_{0}=0.4 is generated and we compute the associated covariance matrix CC and then diagonalize it as C=OΛOTC=O\Lambda O^{T}, obtaining the orthogonal matrix OO. Then the 𝚃𝚛(Ok),𝚃𝚛(Ok)\mathtt{Tr}\left(O^{k}\right),\mathtt{Tr}\left(O^{-k}\right) for several kk (k=18k=1\sim 8) are computed, where 𝚃𝚛()\mathtt{Tr}\left(\cdot\right) is the trace operation. This procedure is repeated 200 times with different random numbers, from which we obtain the ensemble of 𝚃𝚛(Ok)\mathtt{Tr}\left(O^{k}\right) and 𝚃𝚛(Ok)\mathtt{Tr}\left(O^{-k}\right). Consequently, the cumulants of 1st, 2nd, and 3rd orders are computed. All of them exhibit the expected theoretical behavior.

Appendix D Details of the High-dimensional asymptotic result

Here the asymptotic performance of PrecisionPrecision and RecallRecall are considered for both the 1\ell_{1}-LinR estimator and the 1\ell_{1}-LogR estimator. Recall that perfect Ising model selection is achieved if and only if Precision=1Precision=1 and Recall=1Recall=1

D.1 Recall rate

According to the definition in (5), the recall rate is only related to the statistical properties of estimates in the active set Ψ\Psi and thus the mean estimates {J¯j}jΨ\left\{\bar{J}_{j}\right\}_{j\in\Psi} in the limit MM\to\infty are considered.

D.1.1 quadratic loss

In this case, in the limit MM\rightarrow\infty, the mean estimates {J¯j=J}jΨ\left\{\bar{J}_{j}=J\right\}_{j\in\Psi} in the active set Ψ\Psi are shown in (73) and rewritten as follows for ease of reference

J=𝚜𝚘𝚏𝚝(tanh(K0),λ(1+χ))1+(d1)tanh2(K0).J=\frac{\mathtt{soft}\left(\tanh\left(K_{0}\right),\lambda\left(1+\chi\right)\right)}{1+\left(d-1\right)\tanh^{2}\left(K_{0}\right)}. (80)

As a result, as long as λ(1+χ)<tanh(K0)\lambda\left(1+\chi\right)<\tanh\left(K_{0}\right), J>0J>0 and thus we can successfully recover the active set so that Recall=1Recall=1. In addition, when M=𝒪(logN)M=\mathcal{O}\left(\log N\right), χ0\chi\rightarrow 0 as NN\rightarrow\infty, as demonstrated later by the relation in (90). As a result, the regularization parameter needs to satisfy 0<λ<tanh(K0)0<\lambda<\tanh\left(K_{0}\right).

D.1.2 Logistic loss

In this case, in the limit MM\rightarrow\infty, the mean estimates {J¯j=J}jΨ\left\{\bar{J}_{j}=J\right\}_{j\in\Psi} in the active set Ψ\Psi are shown in (78) and rewritten as follows for ease of reference

J\displaystyle J =𝚜𝚘𝚏𝚝(𝔼s,z(y^(s,z,χ,Q,J)s0jΨsj),λdχ)d(1+(d1)tanh2(K0)).\displaystyle=\frac{\mathtt{soft}\left(\mathbb{E}_{s,z}\left(\hat{y}\left(s,z,\chi,Q,J\right)s_{0}\sum_{j\in\Psi}s_{j}\right),\lambda d\chi\right)}{d\left(1+\left(d-1\right)\tanh^{2}\left(K_{0}\right)\right)}. (81)

There is no analytic solution for y^(s,z,χ,Q,J)\hat{y}\left(s,z,\chi,Q,J\right) and the following fixed-point equation has to be solved numerically

y^(s,z,χ,Q,J)s0(Qz+JjΨsj)χ=1tanh(y^(s,z,χ,Q,J)).\displaystyle\frac{\hat{y}\left(s,z,\chi,Q,J\right)-s_{0}\left(\sqrt{Q}z+J\sum_{j\in\Psi}s_{j}\right)}{\chi}=1-\tanh\left(\hat{y}\left(s,z,\chi,Q,J\right)\right). (82)

Then one can determine the valid choice of λ\lambda to enable J>0J>0. Numerical results show that the choice of λ\lambda is similar to that of the quadratic loss.

D.2 Precision rate

According to the definition in (5), to compute the PrecisionPrecision, the number of true positives TPTP and false positives FPFP are needed, respectively. On the one hand, as discussed in Appendix D.1, in the limit MM\to\infty, the recall rate approach to one and thus we have TP=dTP=d for a RR graph G𝒢N,d,K0G\in\mathcal{G}_{N,d,K_{0}}. On the other hand, the number of false positives FPFP can be computed as FP=FPRNFP=FPR\cdot N, where FPRFPR is the false positive rate (FPR).

As shown in Appendix A.3, the estimator in the inactive set Ψ¯\bar{\Psi} can be statistically described by a scalar estimator (75) and thus the FPRFPR can be computed as

FPR=erfc(λM2HN),FPR=\textrm{erfc}\left(\frac{\lambda M}{\sqrt{2HN}}\right), (83)

which depends on λ,M,N,H\lambda,M,N,H. However, for both the quadratic loss and logistic loss, there is no analytic result for HH in (65). Nevertheless, we can obtain some asymptotic result using perturbative analysis.

Specifically, we focus on the asymptotic behavior of the macroscopic parameters, e.g., χ,Q,K,E,H,F\chi,Q,K,E,H,F, in the regime FPR0FPR\rightarrow 0, which is necessary for successful Ising model selection. From η=1Kerfc(λM2HN)\eta=\frac{1}{K}\textrm{erfc}\left(\frac{\lambda M}{\sqrt{2HN}}\right) in EOS (65) and the FPRFPR in (83), there is FPR=KηFPR=K\eta. Moreover, by combining Eη=ρ(γ)Λ~γ𝑑γE\eta=-\int\frac{\rho\left(\gamma\right)}{\tilde{\varLambda}-\gamma}d\gamma and K=EΛ~+1ηK=E\tilde{\varLambda}+\frac{1}{\eta}, the following relation can be obtained

erfc(λM2HN)=1ρ(γ)1γΛ~𝑑γ.\textrm{erfc}\left(\frac{\lambda M}{\sqrt{2HN}}\right)=1-\int\frac{\rho\left(\gamma\right)}{1-\frac{\gamma}{\tilde{\varLambda}}}d\gamma. (84)

Thus as FPR=erfc(λM2HN)0FPR=\textrm{erfc}\left(\frac{\lambda M}{\sqrt{2HN}}\right)\rightarrow 0, there is ρ(γ)1γΛ~𝑑γ1\int\frac{\rho\left(\gamma\right)}{1-\frac{\gamma}{\tilde{\varLambda}}}d\gamma\rightarrow 1, implying that the magnitude of Λ~\tilde{\varLambda}\to\infty. Consequently, using the truncated series expansion, we obtain

Eη\displaystyle E\eta =ρ(γ)Λ~γ𝑑γ\displaystyle=-\int\frac{\rho\left(\gamma\right)}{\tilde{\varLambda}-\gamma}d\gamma
=1Λ~k=0γkΛ~k\displaystyle=-\frac{1}{\tilde{\varLambda}}\sum_{k=0}^{\infty}\frac{\left\langle\gamma^{k}\right\rangle}{\tilde{\varLambda}^{k}}
1Λ~γΛ~2,\displaystyle\simeq-\frac{1}{\tilde{\varLambda}}-\frac{\left\langle\gamma\right\rangle}{\tilde{\varLambda}^{2}}, (85)

where γk=ρ(γ)γk𝑑γ\left\langle\gamma^{k}\right\rangle=\int\rho\left(\gamma\right)\gamma^{k}d\gamma. Then, solving the quadratic equation (85), we obtain the solution (the other solution is not considered since it is a smaller value) of Λ~\tilde{\varLambda} as

Λ~\displaystyle\tilde{\varLambda} =114Eηγ2Eηγ1Eη.\displaystyle=\frac{-1-\sqrt{1-4E\eta\left\langle\gamma\right\rangle}}{2E\eta}\simeq\left\langle\gamma\right\rangle-\frac{1}{E\eta}. (86)

To compute ρ(γ)(Λ~γ)2𝑑γ\int\frac{\rho\left(\gamma\right)}{\left(\tilde{\varLambda}-\gamma\right)^{2}}d\gamma, we use the following relation

f(Λ~)\displaystyle f\left(\tilde{\varLambda}\right) =ρ(γ)Λ~γ𝑑γ1Λ~γΛ~2,\displaystyle=-\int\frac{\rho\left(\gamma\right)}{\tilde{\varLambda}-\gamma}d\gamma\simeq-\frac{1}{\tilde{\varLambda}}-\frac{\left\langle\gamma\right\rangle}{\tilde{\varLambda}^{2}}, (87)
df(Λ~)dΛ~\displaystyle\frac{df\left(\tilde{\varLambda}\right)}{d\tilde{\varLambda}} =ρ(γ)(Λ~γ)2𝑑γ1Λ~2+2γΛ~3.\displaystyle=\int\frac{\rho\left(\gamma\right)}{\left(\tilde{\varLambda}-\gamma\right)^{2}}d\gamma\simeq\frac{1}{\tilde{\varLambda}^{2}}+2\frac{\left\langle\gamma\right\rangle}{\tilde{\varLambda}^{3}}. (88)

Substituting the results (86) - (88) into (65), after some algebra, we obtain

K\displaystyle K Eγ,\displaystyle\simeq E\left\langle\gamma\right\rangle, (89)
χ\displaystyle\chi ηγ,\displaystyle\simeq\eta\left\langle\gamma\right\rangle, (90)
Q\displaystyle Q γ3E2η2Rγ3EFη3+3γ2Fη2Rγ3Eηγ1,\displaystyle\simeq\frac{\left\langle\gamma\right\rangle^{3}E^{2}\eta^{2}R-\left\langle\gamma\right\rangle^{3}EF\eta^{3}+3\left\langle\gamma\right\rangle^{2}F\eta^{2}-R\left\langle\gamma\right\rangle}{3E\eta\left\langle\gamma\right\rangle-1}, (91)
H\displaystyle H γ3E2η2Fγ3RηE3+3γ2RE2Fγ3Eηγ1.\displaystyle\simeq\frac{\left\langle\gamma\right\rangle^{3}E^{2}\eta^{2}F-\left\langle\gamma\right\rangle^{3}R\eta E^{3}+3\left\langle\gamma\right\rangle^{2}RE^{2}-F\left\langle\gamma\right\rangle}{3E\eta\left\langle\gamma\right\rangle-1}. (92)

In addition, as FPR=erfc(λM2HN)0FPR=\textrm{erfc}\left(\frac{\lambda M}{\sqrt{2HN}}\right)\rightarrow 0, from (65) we obtain

R\displaystyle R =1K2[(H+λ2M2N)erfc(λM2HN)2λMHN12πeλ2M22HN]\displaystyle=\frac{1}{K^{2}}\left[\left(H+\frac{\lambda^{2}M^{2}}{N}\right)\textrm{erfc}\left(\frac{\lambda M}{\sqrt{2HN}}\right)-2\lambda M\sqrt{\frac{H}{N}}\frac{1}{\sqrt{2\pi}}e^{-\frac{\lambda^{2}M^{2}}{2HN}}\right]
HK2erfc(λM2HN)HKηHEγη,\displaystyle\simeq\frac{H}{K^{2}}\textrm{erfc}\left(\frac{\lambda M}{\sqrt{2HN}}\right)\simeq\frac{H}{K}\eta\simeq\frac{H}{E\left\langle\gamma\right\rangle}\eta, (93)

where the first result in \simeq uses the asymptotic relation erfc(x)1xπex2\textrm{erfc}\left(x\right)\simeq\frac{1}{x\sqrt{\pi}}e^{-x^{2}} as xx\to\infty and the last result in \simeq results from the asymptotic relation in (89). Then, substituting (93) into (92) leads to the following relation

(3Eηγ1)Hγ3E2η2Fγ2η2E2H+3EηγHFγ.\displaystyle\left(3E\eta\left\langle\gamma\right\rangle-1\right)H\simeq\left\langle\gamma\right\rangle^{3}E^{2}\eta^{2}F-\left\langle\gamma\right\rangle^{2}\eta^{2}E^{2}H+3E\eta\left\langle\gamma\right\rangle H-F\left\langle\gamma\right\rangle. (94)

Interestingly, the common terms 3EηγH3E\eta\left\langle\gamma\right\rangle H in both sides of (94) cancel with each other. Therefore, the key result for HH is obtained as follows

HFγ.\displaystyle H\simeq F\left\langle\gamma\right\rangle. (95)

In addition, from (95) and (91), QQ can be simplified as

Q\displaystyle Q Rγ.\displaystyle\simeq R\left\langle\gamma\right\rangle. (96)

As shown in (65), F=α𝔼s,z(d(y)dyy=y^(s,z,χ,Q,J))2F=\alpha\mathbb{E}_{s,z}\left(\frac{d\ell\left(y\right)}{dy}\mid_{y=\hat{y}\left(s,z,\chi,Q,J\right)}\right)^{2}, thus the result HFγH\simeq F\left\langle\gamma\right\rangle in (95) implies that there is a linear relation between HH and αM/N\alpha\equiv M/N. The relation between E,F,HE,F,H and α\alpha are also verified numerically in Fig. 6 when M=50(logN)M=50(\log N) for N=1021012N=10^{2}\sim 10^{12} using the 1\ell_{1}-LinR estimator.

Refer to caption
Figure 6: E,F,HE,F,H versus α\alpha when α=50(logN)/N\alpha=50(\log N)/N for N=1021012N=10^{2}\sim 10^{12} for RR graph G𝒢N,d,K0G\in\mathcal{G}_{N,d,K_{0}} with d=3,K0=0.4d=3,K_{0}=0.4. Note that in this case, there is γ=1\left\langle\gamma\right\rangle=1.

In the paramagnetic phase, it can be obtained that the mean value of the eigenvalue γ\left\langle\gamma\right\rangle. Specifically, we have γ=1Ni=1Nγi=1NTrC=(1/N)×N=1\left\langle\gamma\right\rangle=\frac{1}{N}\sum_{i=1}^{N}\gamma_{i}=\frac{1}{N}\textrm{Tr}C=(1/N)\times N=1. Denote by HFγαH\simeq F\left\langle\gamma\right\rangle\equiv\alpha\triangle, where =𝔼s,z(d(y)dyy=y^(s,z,χ,Q,J))2=𝒪(1)\triangle=\mathbb{E}_{s,z}\left(\frac{d\ell\left(y\right)}{dy}\mid_{y=\hat{y}\left(s,z,\chi,Q,J\right)}\right)^{2}=\mathcal{O}\left(1\right), then the FPRFPR in (83) can be rewritten as follows

FPR\displaystyle FPR =erfc(λM2αN)\displaystyle=\textrm{erfc}\left(\frac{\lambda M}{\sqrt{2\alpha\triangle N}}\right)
=erfc(λM2)\displaystyle=\textrm{erfc}\left(\lambda\sqrt{\frac{M}{2\triangle}}\right)
1πeλ2M212log(λ2M2),\displaystyle\leq\frac{1}{\sqrt{\pi}}e^{-\frac{\lambda^{2}M}{2\triangle}-\frac{1}{2}\log\left(\frac{\lambda^{2}M}{2\triangle}\right)}, (97)

where the last inequality uses the upper bound of erfc function, i.e., erfc(x)1xπex2\textrm{erfc}\left(x\right)\leq\frac{1}{x\sqrt{\pi}}e^{-x^{2}}. Consequently, the number of false positives FPFP satisfies

FP\displaystyle FP Nπeλ2M212log(λ2M2)\displaystyle\leq\frac{N}{\sqrt{\pi}}e^{-\frac{\lambda^{2}M}{2\triangle}-\frac{1}{2}\log\left(\frac{\lambda^{2}M}{2\triangle}\right)}
=1πeλ2M212log(λ2M2)+logN\displaystyle=\frac{1}{\sqrt{\pi}}e^{-\frac{\lambda^{2}M}{2\triangle}-\frac{1}{2}\log\left(\frac{\lambda^{2}M}{2\triangle}\right)+\log N}
<1πeλ2M2+logN,\displaystyle<\frac{1}{\sqrt{\pi}}e^{-\frac{\lambda^{2}M}{2\triangle}+\log N}, (98)

where the last inequality holds when λ2M2>1\frac{\lambda^{2}M}{2\triangle}>1, which is necessary when FP0FP\rightarrow 0 as NN\rightarrow\infty. Consequently, to ensure FP0FP\rightarrow 0 as NN\rightarrow\infty, from (98), the term λ2M2\frac{\lambda^{2}M}{2\triangle} should grow at least faster than logN\log N, i.e.,

M\displaystyle M >2logNλ2.\displaystyle>\frac{2\triangle\log N}{\lambda^{2}}. (99)

Meanwhile, the number of false positives FPFP will decay as 𝒪(eclogN)\mathcal{O}\left(e^{-c\log{N}}\right) for some constant c(>0)c\left(>0\right).

D.2.1 Quadratic loss

In this case, when 0<λ<tanh(K0)0<\lambda<\tanh{\left(K_{0}\right)}, from (70), we can obtain an analytic result for \triangle as follows

\displaystyle\bigtriangleup 𝔼s0(sjΨsjJ¯j)2\displaystyle\simeq\mathbb{E}_{s_{0}}\left(s-\sum_{j\in\Psi}s_{j}\bar{J}_{j}\right)^{2} (100)
=1tanh2K0+dλ21+(d1)tanh2K0.\displaystyle=\frac{1-\tanh^{2}K_{0}+d\lambda^{2}}{1+\left(d-1\right)\tanh^{2}K_{0}}. (101)

On the other hand, from the discussion in Appendix D.1, the recall rate Recall1Recall\rightarrow 1 as MM\to\infty when 0<λ<tanhK00<\lambda<\tanh K_{0}. Overall, for a RR graph G𝒢N,d,K0G\in\mathcal{G}_{N,d,K_{0}} with degree dd and coupling strength K0K_{0}, given MM i.i.d. samples 𝒟M={𝒔(1),,𝒔(M)}\mathcal{D}^{M}=\left\{\boldsymbol{s}^{\left(1\right)},...,\boldsymbol{s}^{\left(M\right)}\right\}, using 1\ell_{1}-LinR estimator (4) with regularization parameter λ\lambda, perfect recovery of the graph structure GG can be achieved as NN\to\infty if the number of samples MM satisfies

M\displaystyle M >c(λ,K0)logNλ2,λ(0,tanh(K0))\displaystyle>\frac{c\left(\lambda,K_{0}\right)\log N}{\lambda^{2}},\lambda\in\left(0,\tanh\left(K_{0}\right)\right) (102)

where c(λ,K0)c\left(\lambda,K_{0}\right) is a value dependent on the regularization parameter λ\lambda and coupling strength K0K_{0}, which can be approximated in the limit NN\rightarrow\infty as:

c(λ,K0)=2(1tanh2(K0)+dλ2)1+(d1)tanh2(K0).c\left(\lambda,K_{0}\right)=\frac{2\left(1-\tanh^{2}\left(K_{0}\right)+d\lambda^{2}\right)}{1+\left(d-1\right)\tanh^{2}\left(K_{0}\right)}. (103)

D.2.2 Logistic loss

In this case, from (76), the value of \triangle can be computed as

\displaystyle\bigtriangleup 𝔼s,z((1tanh(y^(S,z,χ,Q,J)))2).\displaystyle\simeq\mathbb{E}_{s,z}\left(\left(1-\tanh\left(\hat{y}\left(S,z,\chi,Q,J\right)\right)\right)^{2}\right). (104)

However, different from the case of 1\ell_{1}-LinR estimator, there is no analytic solution but it can be calculated numerically. It can be seen that the 1\ell_{1}-LinR estimator only differs in the value of scaling factor \bigtriangleup with the 1\ell_{1}-LogR estimator for Ising model selection.

Appendix E Details of the non-asymptotic result for moderate M,NM,N

As demonstrated in Appendix A.3, from the replica analysis, both 1\ell_{1}-LinR and 1\ell_{1}-LogR estimators are decoupled and their asymptotic behavior can be described by two scalar estimators for the active set and inactive set, respectively. It is desirable to obtain the non-asymptotic result for moderate M,NM,N. However, it is found that the behavior of the two scalar estimators by simply inserting the finite values of M,NM,N into the EOS does not always lead to good consistency with the experimental results, especially for the RecallRecall when MM is small. This can be explained by the derivation of the free energy density. In calculating the energy term ξ\mathcal{\mathcal{\xi}}, the limit MM\rightarrow\infty is taken implicitly when assuming the limit NN\rightarrow\infty with αM/N\alpha\equiv M/N. As a result, the scalar estimator associated with the active set can only describe the asymptotic performance in the limit MM\rightarrow\infty. Thus, one cannot describe the fluctuating behavior of the estimator in the active set such as the recall rate for finite MM. To characterize the non-asymptotic behavior of the estimates in the active set Ψ\Psi, we replace the expectation 𝔼s()\mathbb{E}_{s}(\cdot) in (64) by the sample average over MM samples, and the corresponding estimates are obtained as

{J^j}jΨ=argminJj,jΨ{1Mμ=1Mminyμ[(yμs0μ(Qzμ+jΨJjsjμ))22χ+(yμ)]+λjΨ|Jj|},\left\{\hat{J}_{j}\right\}_{j\in\Psi}=\underset{J_{j,j\in\Psi}}{\arg\min}\left\{\frac{1}{M}\sum_{\mu=1}^{M}\underset{y^{\mu}}{\min}\left[\frac{\left(y^{\mu}-s_{0}^{\mu}\left(\sqrt{Q}z^{\mu}+\sum_{j\in\Psi}\text{$J_{j}$}s_{j}^{\mu}\right)\right)^{2}}{2\chi}+\ell\left(y^{\mu}\right)\right]+\lambda\sum_{j\in\Psi}\left|J_{j}\right|\right\}, (105)

where zμ𝒩(0,1)z^{\mu}\sim\mathcal{N}\left(0,1\right) and s0μ,sj,jΨμP(s0,𝒔Ψ|𝑱)s_{0}^{\mu},s_{j,j\in\Psi}^{\mu}\sim P\left(s_{0},\boldsymbol{s}_{\Psi}|\boldsymbol{J}^{*}\right) are random samples μ=1,,M\mu=1,...,M. Note that the mean estimates {J¯j}jΨ\left\{\bar{J}_{j}\right\}_{j\in\Psi} are replaced by {J^j}jΨ\left\{\hat{J}_{j}\right\}_{j\in\Psi} in (105) as we now focus on its fluctuating behavior due to the finite size effect. In the limit MM\rightarrow\infty, the sample average will converge to the expectation and thus (105) is equivalent to (77) when MM\rightarrow\infty.

E.1 quadratic loss (y)=(y1)2/2\ell\left(y\right)=\left(y-1\right)^{2}/2

In the case of quadratic loss (y)=(y1)2/2\ell\left(y\right)=\left(y-1\right)^{2}/2, there is an analytic solution to yy in min𝑦[(ys0(Qz+jΨJ¯jsj))22χ+(y)]\underset{y}{\min}\left[\frac{\left(y-s_{0}\left(\sqrt{Q}z+\sum_{j\in\Psi}\text{$\bar{J}_{j}$}s_{j}\right)\right)^{2}}{2\chi}+\ell\left(y\right)\right]. Consequently, similar to (71), the result of (105) for the 1\ell_{1}-LinR estimator becomes

{J^j}jΨ=argminJj,jΨ[12(1+χ)Mμ=1M(siμjΨsjμJjQzμ)2+λjΨ|Jj|].\left\{\hat{J}_{j}\right\}_{j\in\Psi}=\underset{J_{j,j\in\Psi}}{\arg\min}\left[\frac{1}{2\left(1+\chi\right)M}\sum_{\mu=1}^{M}\left(s_{i}^{\mu}-\sum_{j\in\Psi}s_{j}^{\mu}J_{j}-\sqrt{Q}z^{\mu}\right)^{2}+\lambda\sum_{j\in\Psi}\left|J_{j}\right|\right]. (106)

As the mean estimates {J¯j}jΨ\left\{\bar{J}_{j}\right\}_{j\in\Psi} are modified as in (106), the corresponding solution to the EOS in (70) also needs to be modified, and this can be solved iteratively as sketched in Algorithm 1. For a practical implementation of Algorithm 1, the details are described in the following.

First, in the EOS (22), we need to obtain Λ~\tilde{\varLambda} satisfying the following relation

Eη=ρ(γ)Λ~γ𝑑γ,E\eta=-\int\frac{\rho\left(\gamma\right)}{\tilde{\varLambda}-\gamma}d\gamma, (107)

which is difficult to solve directly. To obtain Λ~\tilde{\varLambda}, we introduce an auxiliary variable Γ1Λ~\varGamma\equiv-\frac{1}{\tilde{\varLambda}}, by which (107) can be rewritten as

Γ=Eηρ(γ)1+Γγ𝑑γ,\varGamma=\frac{E\eta}{\int\frac{\rho\left(\gamma\right)}{1+\varGamma\gamma}d\gamma}, (108)

which can be solved iteratively. Accordingly, the χ,Q,K,H\chi,Q,K,H in EOS (22) can be equivalently written in terms of Γ\varGamma.

Second, when solving the EOS (22) iteratively using numerical methods, it is helpful to improve the convergence of the solution by introducing a small amount of damping factor 𝚍𝚊𝚖𝚙[0,1)\mathtt{damp}\in[0,1) for χ,Q,E,R,F,η,K,H,Γ\chi,Q,E,R,F,\eta,K,H,\varGamma in each iteration.

The detailed implementation of Algorithm 1 is shown in Algorithm 2.

Input: M,N,λ,K0,ρ(γ)M,N,\lambda,K_{0},\rho\left(\gamma\right), 𝚍𝚊𝚖𝚙,TMC\mathtt{damp},T_{\rm MC}
Output: χ,Q,E,R,F,η,K,H,Γ,{J^j,jΨt}t=1TMC\chi,Q,E,R,F,\eta,K,H,\varGamma,\{\hat{J}^{t}_{j,j\in\Psi}\}_{t=1}^{T_{MC}}
Initialization: χ,Q,E,R,F,η,K,H,Γ\chi,Q,E,R,F,\eta,K,H,\varGamma
1 MC sampling: For t=1TMCt=1...T_{MC}, draw random samples s0μ,t,{sjμ,t}jΨP(s0,𝒔Ψ|𝑱)s_{0}^{\mu,t},\left\{s^{\mu,t}_{j}\right\}_{j\in\Psi}\sim P\left(s_{0},\boldsymbol{s}_{\Psi}|\boldsymbol{J}^{*}\right) and zμ,t𝒩(0,1)z^{\mu,t}\sim\mathcal{N}\left(0,1\right), μ=1M\mu=1...M
2repeat
3       for t=1t=1 to TMCT_{\rm MC} do
4             Solve J^j,jΨt=argminJj,jΨ[μ=1M(s0μ,tjΨsjμ,tJjQzμ,t)22(1+χ)M+λjΨ|Jj|]\hat{J}^{t}_{j,j\in\Psi}=\underset{J_{j,j\in\Psi}}{\arg\min}\left[\frac{\sum_{\mu=1}^{M}\left(s_{0}^{\mu,t}-\sum_{j\in\Psi}s_{j}^{\mu,t}J_{j}-\sqrt{Q}z^{\mu,t}\right)^{2}}{2\left(1+\chi\right)M}+\lambda\sum_{j\in\Psi}\left|{J}_{j}\right|\right]
5            Compute (t)=1Mμ=1M(s0μ,tjΨsjμ,tJ^jt)2\triangle\left(t\right)=\frac{1}{M}\sum_{\mu=1}^{M}\left(s_{0}^{\mu,t}-\sum_{j\in\Psi}s_{j}^{\mu,t}\hat{J}_{j}^{t}\right)^{2}
6      Set ¯=1TMCt1=1TMC(t)\bar{\triangle}=\frac{1}{T_{\rm MC}}\sum_{t_{1}=1}^{T_{\rm MC}}\triangle\left(t\right)
7      E=(1𝚍𝚊𝚖𝚙)α(1+χ)+𝚍𝚊𝚖𝚙EE=(1-\mathtt{damp})\frac{\alpha}{\left(1+\chi\right)}+\mathtt{damp}\cdot{E}
8      F=(1𝚍𝚊𝚖𝚙)α(1+χ)2(¯+Q)+𝚍𝚊𝚖𝚙FF=(1-\mathtt{damp})\frac{\alpha}{\left(1+\chi\right)^{2}}\left(\bar{\triangle}+Q\right)+\mathtt{damp}\cdot{F}
9      R=(1𝚍𝚊𝚖𝚙)1K2[(H+λ2M2N)erfc(λM2HN)2λMHN12πeλ2M22HN]+𝚍𝚊𝚖𝚙RR=(1-\mathtt{damp})\frac{1}{K^{2}}\left[\left(H+\frac{\lambda^{2}M^{2}}{N}\right)\textrm{erfc}\left(\frac{\lambda M}{\sqrt{2HN}}\right)-2\lambda M\sqrt{\frac{H}{N}}\frac{1}{\sqrt{2\pi}}e^{-\frac{\lambda^{2}M^{2}}{2HN}}\right]+\mathtt{damp}\cdot{R}
10      repeat
11             Γ=(1𝚍𝚊𝚖𝚙)Eηρ(γ)1+Γγ𝑑γ+𝚍𝚊𝚖𝚙Γ\varGamma=(1-\mathtt{damp})\frac{E\eta}{\int\frac{\rho\left(\gamma\right)}{1+\varGamma\gamma}d\gamma}+\mathtt{damp}\cdot{\varGamma}
12      until convergence
13      K=(1𝚍𝚊𝚖𝚙)(EΓ+1η)+𝚍𝚊𝚖𝚙KK=(1-\mathtt{damp})\left(-\frac{E}{\varGamma}+\frac{1}{\eta}\right)+\mathtt{damp}\cdot{K}
14      χ=(1𝚍𝚊𝚖𝚙)(ηΓ+1E)+𝚍𝚊𝚖𝚙χ\chi=(1-\mathtt{damp})\left(-\frac{\eta}{\varGamma}+\frac{1}{E}\right)+\mathtt{damp}\cdot{\chi}
15      Q=(1𝚍𝚊𝚖𝚙)(FE2RΓ(ER+Fη)ηΓ2ρ(γ)(1+Γγ)2𝑑γ)+𝚍𝚊𝚖𝚙QQ=(1-\mathtt{damp})\left(\frac{F}{E^{2}}-\frac{R}{\varGamma}-\frac{\left(-ER+F\eta\right)\eta}{\varGamma^{2}\int\frac{\rho\left(\gamma\right)}{\left(1+\varGamma\gamma\right)^{2}}d\gamma}\right)+\mathtt{damp}\cdot{Q}
16      H=(1𝚍𝚊𝚖𝚙)(RE2FΓ(ER+Fη)EΓ2ρ(γ)(1+Γγ)2𝑑γ)+𝚍𝚊𝚖𝚙HH=(1-\mathtt{damp})\left(\frac{R}{E^{2}}-\frac{F}{\varGamma}-\frac{\left(-ER+F\eta\right)E}{\varGamma^{2}\int\frac{\rho\left(\gamma\right)}{\left(1+\varGamma\gamma\right)^{2}}d\gamma}\right)+\mathtt{damp}\cdot{H}
17      η=(1𝚍𝚊𝚖𝚙)1Kerfc(λM2HN)+𝚍𝚊𝚖𝚙η\eta=(1-\mathtt{damp})\frac{1}{K}\textrm{erfc}\left(\frac{\lambda M}{\sqrt{2HN}}\right)+\mathtt{damp}\cdot{\eta}
18until convergence
Algorithm 2 Detailed implementation of Algorithm 1 for the 1\ell_{1}-LinR estimator with moderate M,NM,N.

E.2 Logistic loss (y)=log(1+e2y)\ell\left(y\right)=\log\left(1+e^{-2y}\right)

In the case of square lass (y)=log(1+e2y)\ell\left(y\right)=\log\left(1+e^{-2y}\right), since there is no analytic solution to yy in min𝑦[(ys0(Qz+jΨJ¯jsj))22χ+(y)]\underset{y}{\min}\left[\frac{\left(y-s_{0}\left(\sqrt{Q}z+\sum_{j\in\Psi}\text{$\bar{J}_{j}$}s_{j}\right)\right)^{2}}{2\chi}+\ell\left(y\right)\right], the result of (105) for the 1\ell_{1}-LogR estimator becomes

J^j,jΨ=argminJj,jΨ[1Mμ=1Mminyμ[(yμs0μ(Qzμ+jΨJjsjμ))22χ+log(1+e2y)]+λjΨ|Jjμ|],\hat{J}_{j,j\in\Psi}=\underset{J_{j,j\in\Psi}}{\arg\min}\left[\frac{1}{M}\sum_{\mu=1}^{M}\underset{y^{\mu}}{\min}\left[\frac{\left(y^{\mu}-s_{0}^{\mu}\left(\sqrt{Q}z^{\mu}+\sum_{j\in\Psi}\text{$J_{j}$}s_{j}^{\mu}\right)\right)^{2}}{2\chi}+\log\left(1+e^{-2y}\right)\right]+\lambda\sum_{j\in\Psi}\left|J^{\mu}_{j}\right|\right], (109)

Similarly as the case for quadratic loss, as the mean estimates {J¯j}jΨ\left\{\bar{J}_{j}\right\}_{j\in\Psi} are modified as in (109), the corresponding solutions to the EOS in (76) also need to be modified, which can be solved iteratively as shown in Algorithm 3.

Input: M,N,λ,K0,ρ(γ)M,N,\lambda,K_{0},\rho\left(\gamma\right), 𝚍𝚊𝚖𝚙,TMC\mathtt{damp},T_{\rm MC}
Output: χ,Q,E,R,F,η,K,H,Γ,{J^j,jΨt}t=1TMC\chi,Q,E,R,F,\eta,K,H,\varGamma,\{\hat{J}^{t}_{j,j\in\Psi}\}_{t=1}^{T_{MC}}
Initialization: χ,Q,E,R,F,η,K,H,Γ\chi,Q,E,R,F,\eta,K,H,\varGamma
1
2MC sampling: For t=1TMCt=1...T_{MC}, draw random samples s0μ,t,{sjμ,t}jΨP(s0,𝒔Ψ|𝑱)s_{0}^{\mu,t},\left\{s^{\mu,t}_{j}\right\}_{j\in\Psi}\sim P\left(s_{0},\boldsymbol{s}_{\Psi}|\boldsymbol{J}^{*}\right) and zμ,t𝒩(0,1)z^{\mu,t}\sim\mathcal{N}\left(0,1\right), μ=1M\mu=1...M
3repeat
4       for t=1t=1 to TMCT_{\rm MC} do
5             Initialization J^j,jΨt\hat{J}^{t}_{j,j\in\Psi}
6            repeat
7                   y^μ,t=argminyμ,t[(yμs0μ,t(Qzμ,t+jΨJ^jsjμ,t))22χ+log(1+e2yμ)],μ=1M\hat{y}^{\mu,t}=\underset{y^{\mu,t}}{\arg\min}\left[\frac{\left(y^{\mu}-s_{0}^{\mu,t}\left(\sqrt{Q}z^{\mu,t}+\sum_{j\in\Psi}\text{$\hat{J}_{j}$}s_{j}^{\mu,t}\right)\right)^{2}}{2\chi}+\log\left(1+e^{-2y^{\mu}}\right)\right],\mu=1...M
8                  J^j,jΨt=argminJj,jΨ{1Mμ=1M[(y^μ,ts0μ,t(Qzμ,t+jΨJjtsjμ,t))22χ+log(1+e2y^μ,t)]+λjΨ|Jj|}\hat{J}^{t}_{j,j\in\Psi}=\underset{J_{j,j\in\Psi}}{\arg\min}\bigg{\{}\frac{1}{M}\sum_{\mu=1}^{M}\left[\frac{\left(\hat{y}^{\mu,t}-s_{0}^{\mu,t}\left(\sqrt{Q}z^{\mu,t}+\sum_{j\in\Psi}\text{$J^{t}_{j}$}s_{j}^{\mu,t}\right)\right)^{2}}{2\chi}+\log\left(1+e^{-2\hat{y}^{\mu,t}}\right)\right]+\hskip 85.35826pt\lambda\sum_{j\in\Psi}\left|J_{j}\right|\bigg{\}}
9            until convergence
10            Compute 1(t)=1Mμ=1M(s0μ,tzμ,tQ(1tanh(y^μ,t)))\triangle_{1}\left(t\right)=\frac{1}{M}\sum_{\mu=1}^{M}\left(\frac{s^{\mu,t}_{0}z^{\mu,t}}{-\sqrt{Q}}(1-\tanh\left(\hat{y}^{\mu,t}\right))\right)
11            Compute 2(t)=1Mμ=1M(1tanh(y^μ,t))2\triangle_{2}\left(t\right)=\frac{1}{M}\sum_{\mu=1}^{M}\left(1-\tanh\left(\hat{y}^{\mu,t}\right)\right)^{2}
12      
13      Set ¯1=1TMCt=1TMC1(t)\bar{\triangle}_{1}=\frac{1}{T_{\rm MC}}\sum_{t=1}^{T_{\rm MC}}\triangle_{1}\left(t\right) and ¯2=1TMCt=1TMC2(t)\bar{\triangle}_{2}=\frac{1}{T_{\rm MC}}\sum_{t=1}^{T_{\rm MC}}\triangle_{2}\left(t\right)
14      E=(1𝚍𝚊𝚖𝚙)α¯1+𝚍𝚊𝚖𝚙EE=(1-\mathtt{damp})\cdot{\alpha\bar{\triangle}_{1}}+\mathtt{damp}\cdot{E}
15      F=(1𝚍𝚊𝚖𝚙)α¯2+𝚍𝚊𝚖𝚙FF=(1-\mathtt{damp})\cdot{\alpha\bar{\triangle}_{2}}+\mathtt{damp}\cdot{F}
16      R=(1𝚍𝚊𝚖𝚙)1K2[(H+λ2M2N)erfc(λM2HN)2λMHN12πeλ2M22HN]+𝚍𝚊𝚖𝚙RR=(1-\mathtt{damp})\frac{1}{K^{2}}\left[\left(H+\frac{\lambda^{2}M^{2}}{N}\right)\textrm{erfc}\left(\frac{\lambda M}{\sqrt{2HN}}\right)-2\lambda M\sqrt{\frac{H}{N}}\frac{1}{\sqrt{2\pi}}e^{-\frac{\lambda^{2}M^{2}}{2HN}}\right]+\mathtt{damp}\cdot{R}
17      repeat
18             Γ=(1𝚍𝚊𝚖𝚙)Eηρ(γ)1+Γγ𝑑γ+𝚍𝚊𝚖𝚙Γ\varGamma=(1-\mathtt{damp})\frac{E\eta}{\int\frac{\rho\left(\gamma\right)}{1+\varGamma\gamma}d\gamma}+\mathtt{damp}\cdot{\varGamma}
19      until convergence
20      K=(1𝚍𝚊𝚖𝚙)(EΓ+1η)+𝚍𝚊𝚖𝚙KK=(1-\mathtt{damp})\left(-\frac{E}{\varGamma}+\frac{1}{\eta}\right)+\mathtt{damp}\cdot{K}
21      χ=(1𝚍𝚊𝚖𝚙)(ηΓ+1E)+𝚍𝚊𝚖𝚙χ\chi=(1-\mathtt{damp})\left(-\frac{\eta}{\varGamma}+\frac{1}{E}\right)+\mathtt{damp}\cdot{\chi}
22      Q=(1𝚍𝚊𝚖𝚙)(FE2RΓ(ER+Fη)ηΓ2ρ(γ)(1+Γγ)2𝑑γ)+𝚍𝚊𝚖𝚙QQ=(1-\mathtt{damp})\left(\frac{F}{E^{2}}-\frac{R}{\varGamma}-\frac{\left(-ER+F\eta\right)\eta}{\varGamma^{2}\int\frac{\rho\left(\gamma\right)}{\left(1+\varGamma\gamma\right)^{2}}d\gamma}\right)+\mathtt{damp}\cdot{Q}
23      H=(1𝚍𝚊𝚖𝚙)(RE2FΓ(ER+Fη)EΓ2ρ(γ)(1+Γγ)2𝑑γ)+𝚍𝚊𝚖𝚙HH=(1-\mathtt{damp})\left(\frac{R}{E^{2}}-\frac{F}{\varGamma}-\frac{\left(-ER+F\eta\right)E}{\varGamma^{2}\int\frac{\rho\left(\gamma\right)}{\left(1+\varGamma\gamma\right)^{2}}d\gamma}\right)+\mathtt{damp}\cdot{H}
24      η=(1𝚍𝚊𝚖𝚙)1Kerfc(λM2HN)+𝚍𝚊𝚖𝚙η\eta=(1-\mathtt{damp})\frac{1}{K}\textrm{erfc}\left(\frac{\lambda M}{\sqrt{2HN}}\right)+\mathtt{damp}\cdot{\eta}
25until convergence
Algorithm 3 Detailed implementation of solving the EOS (76) together with (109) for 1\ell_{1}-LogR with moderate M,NM,N.

Appendix F Eigenvalue Distribution ρ(γ)\rho\left(\gamma\right)

From the replica analysis presented, the learning performance will depend on the eigenvalue distribution (EVD) ρ(γ)\rho\left(\gamma\right) of the covariance matrix CC of the original Ising model.

There are two issues to be noted. One is about the formula connecting the performance of the estimator and the spectral density, and the other is the numeric values of quantities which are computed from the formula. For the first point, no assumption about the spectral density is needed to obtain the formula itself and this formula is valid when the graph structure is tree-like and the Ising model defined on the graph is in the paramagnetic phase. For the second point, we need the specific form of the spectral density to obtain numeric solutions in general. As a demonstration, we assume the random regular graph with constant coupling strength for which the spectral density can be obtained analytically as has already been known before in [7].

In general, it is difficult to obtain this EVD; however, for sparse tree-like graphs such as RR graph G𝒢N,d,K0G\in\mathcal{G}_{N,d,K_{0}} with constant node degree dd and sufficiently small coupling strength K0K_{0} that yields the paramagnetic state (𝔼𝒔(𝒔)=𝟎\mathbb{E}_{\boldsymbol{s}}(\boldsymbol{s})=\boldsymbol{0}), it can be computed analytically. For this, we express the covariances as

Cij=𝔼𝒔(sisj)𝔼𝒔(si)𝔼𝒔(sj)=2logZ(𝜽)θiθj,C_{ij}=\mathbb{E}_{\boldsymbol{s}}(s_{i}s_{j})-\mathbb{E}_{\boldsymbol{s}}(s_{i})\mathbb{E}_{\boldsymbol{s}}(s_{j})=\frac{\partial^{2}\log Z(\boldsymbol{\theta})}{\partial\theta_{i}\partial\theta_{j}}, (110)

where Z(𝜽)=𝑑𝒔PIsing(𝒔|J)exp(i=0N1θisi)Z(\boldsymbol{\theta})=\int d\boldsymbol{s}P_{\rm Ising}(\boldsymbol{s}|J^{*})\exp(\sum_{i=0}^{N-1}\theta_{i}s_{i}) and the assessment is carried out at 𝜽=𝟎\boldsymbol{\theta}=\boldsymbol{0}.

In addition, for technical convenience we introduce the Gibbs free energy as

A(𝒎)=max𝜽{𝜽T𝒎logZ(𝜽)}.A\left(\boldsymbol{m}\right)=\underset{\boldsymbol{\theta}}{\max}\left\{\boldsymbol{\theta}^{T}\boldsymbol{m}-\log Z\left(\boldsymbol{\theta}\right)\right\}. (111)

The definition of (111) indicates that following two relations hold:

miθj=2logZ(𝜽)θiθj=Cij,\displaystyle\frac{\partial m_{i}}{\partial\theta_{j}}=\frac{\partial^{2}\log Z(\boldsymbol{\theta})}{\partial\theta_{i}\partial\theta_{j}}=C_{ij},
θimj=[C1]ij=2A(𝒎)mimj,\displaystyle\frac{\partial\theta_{i}}{\partial m_{j}}=[C^{-1}]_{ij}=\frac{\partial^{2}A(\boldsymbol{m})}{\partial m_{i}\partial m_{j}}, (112)

where the evaluations are performed at 𝜽=𝟎\boldsymbol{\theta}=\boldsymbol{0} and 𝒎=argmin𝒎A(𝒎)\boldsymbol{m}=\arg\min_{\boldsymbol{m}}A(\boldsymbol{m}) (=𝟎=\boldsymbol{0} under the paramagnetic assumption).

Consequently, we can focus on the computation of A(𝒎)A\left(\boldsymbol{m}\right) to obtain the EVD of C1C^{-1}. The inverse covariance matrix of a RR graph G𝒢N,d,K0G\in\mathcal{G}_{N,d,K_{0}} can be computed from the Hessian of the Gibbs free energy [7, 47, 48] as

[C1]ij\displaystyle\left[C^{-1}\right]_{ij} =A(𝒎)mimj\displaystyle=\frac{\partial A\left(\boldsymbol{m}\right)}{\partial m_{i}\partial m_{j}}
=(d1tanh2K0d+1)δijtanh(Jij)1tanh2(Jij)(1δij),\displaystyle=\left(\frac{d}{1-\tanh^{2}K_{0}}-d+1\right)\delta_{ij}-\frac{\tanh\left(J_{ij}\right)}{1-\tanh^{2}\left(J_{ij}\right)}\left(1-\delta_{ij}\right), (113)

and in matrix form, we have

C1=(d1tanh2K0d+1)𝐈tanh(𝑱)1tanh2(𝑱),C^{-1}=\left(\frac{d}{1-\tanh^{2}K_{0}}-d+1\right)\mathbf{I}-\frac{\tanh\left(\boldsymbol{J}\right)}{1-\tanh^{2}\left(\boldsymbol{J}\right)}, (114)

where 𝐈\mathbf{I} is an identity matrix of proper size, and the operations tanh(),tanh2()\tanh\left(\cdot\right),\tanh^{2}\left(\cdot\right) on matrix 𝑱\boldsymbol{J} are defined in the component-wise manner. For RR graph G𝒢N,d,K0G\in\mathcal{G}_{N,d,K_{0}}, 𝑱\boldsymbol{J} is a sparse matrix, therefore the matrix tanh(𝑱)1tanh2(𝑱)\frac{\tanh\left(\boldsymbol{J}\right)}{1-\tanh^{2}\left(\boldsymbol{J}\right)} also corresponds to a sparse coupling matrix (whose nonzero coupling positions are the same as 𝑱\boldsymbol{J}) with constant coupling strength K1=tanh(K0)1tanh2(K0)K_{1}=\frac{\tanh\left(K_{0}\right)}{1-\tanh^{2}\left(K_{0}\right)} and fixed connectivity dd, the corresponding eigenvalue (denoted as ζ\zeta) distribution can be calculated as [49]

ρζ(ζ)=d4K12(d1)ζ22π(K12d2ζ2),|ζ|2K1d1.\rho_{\zeta}\left(\zeta\right)=\frac{d\sqrt{4K_{1}^{2}\left(d-1\right)-\zeta^{2}}}{2\pi\left(K_{1}^{2}d^{2}-\zeta^{2}\right)},\;\left|\zeta\right|\leq 2K_{1}\sqrt{d-1}. (115)

From (114), the eigenvalue η\eta of C1C^{-1} is

ηi=d1tanh2K0d+1ζi,\eta_{i}=\frac{d}{1-\tanh^{2}K_{0}}-d+1-\zeta_{i}, (116)

which, when combined with (115), readily yields the EVD of η\eta as NN\rightarrow\infty as follows:

ρη(η)\displaystyle\rho_{\eta}\left(\eta\right) =ρζ(d1tanh2K0d+1η)\displaystyle=\rho_{\zeta}\left(\frac{d}{1-\tanh^{2}K_{0}}-d+1-\eta\right)
=d4(tanh(K0)1tanh2(K0))2(d1)(d1tanh2K0d+1η)22π((tanh(K0)1tanh2(K0))2d2(d1tanh2K0d+1η)2),\displaystyle=\frac{d\sqrt{4\left(\frac{\tanh\left(K_{0}\right)}{1-\tanh^{2}\left(K_{0}\right)}\right)^{2}\left(d-1\right)-\left(\frac{d}{1-\tanh^{2}K_{0}}-d+1-\eta\right)^{2}}}{2\pi\left(\left(\frac{\tanh\left(K_{0}\right)}{1-\tanh^{2}\left(K_{0}\right)}\right)^{2}d^{2}-\left(\frac{d}{1-\tanh^{2}K_{0}}-d+1-\eta\right)^{2}\right)}, (117)

where η[d1tanh2K0d+12tanh(K0)d11tanh2(K0),d1tanh2K0d+1+2tanh(K0)d11tanh2(K0)]\eta\in\left[\frac{d}{1-\tanh^{2}K_{0}}-d+1-\frac{2\tanh\left(K_{0}\right)\sqrt{d-1}}{1-\tanh^{2}\left(K_{0}\right)},\frac{d}{1-\tanh^{2}K_{0}}-d+1+\frac{2\tanh\left(K_{0}\right)\sqrt{d-1}}{1-\tanh^{2}\left(K_{0}\right)}\right].

Consequently, since γ=1/η\gamma=1/\eta, we obtain the EVD of ρ(γ)\rho\left(\gamma\right) as follows

ρ(γ)\displaystyle\rho\left(\gamma\right) =1γ2ρη(η=1γ)\displaystyle=\frac{1}{\gamma^{2}}\rho_{\eta}\left(\eta=\frac{1}{\gamma}\right)
=d4(tanh(K0)1tanh2(K0))2(d1)(d1tanh2K0d+11γ)22πγ2((tanh(K0)1tanh2(K0))2d2(d1tanh2K0d+11γ)2)\displaystyle=\frac{d\sqrt{4\left(\frac{\tanh\left(K_{0}\right)}{1-\tanh^{2}\left(K_{0}\right)}\right)^{2}\left(d-1\right)-\left(\frac{d}{1-\tanh^{2}K_{0}}-d+1-\frac{1}{\gamma}\right)^{2}}}{2\pi\gamma^{2}\left(\left(\frac{\tanh\left(K_{0}\right)}{1-\tanh^{2}\left(K_{0}\right)}\right)^{2}d^{2}-\left(\frac{d}{1-\tanh^{2}K_{0}}-d+1-\frac{1}{\gamma}\right)^{2}\right)} (118)

where γ[1/(d1tanh2K0d+1+2tanh(K0)d11tanh2(K0)),1/(d1tanh2K0d+12tanh(K0)d11tanh2(K0))]\gamma\in\left[1/\left(\frac{d}{1-\tanh^{2}K_{0}}-d+1+\frac{2\tanh\left(K_{0}\right)\sqrt{d-1}}{1-\tanh^{2}\left(K_{0}\right)}\right),1/\left(\frac{d}{1-\tanh^{2}K_{0}}-d+1-\frac{2\tanh\left(K_{0}\right)\sqrt{d-1}}{1-\tanh^{2}\left(K_{0}\right)}\right)\right].

Appendix G Additional Experimental Results

Fig. 7 and Fig. 8 show the full results of non-asymptotic learning performance prediction when λ=0.1\lambda=0.1 and λ=0.3\lambda=0.3, respectively. Good agreements between replica results and experimental results are achieved in all cases. As can be seen, there is negligible difference in PrecisionPrecision and RecallRecall between 1\ell_{1}-LinR and 1\ell_{1}-LogR. Meanwhile, compared to Fig. 7 when λ=0.1\lambda=0.1, the difference in RSS between 1\ell_{1}-LinR and 1\ell_{1}-LogR is reduced when λ=0.3\lambda=0.3. In addition, by comparing Fig. 7 and Fig. 8, it can be seen that under the same setting, when λ\lambda increases, the PrecisionPrecision becomes larger while the RecallRecall becomes smaller, implying a tradeoff in choosing λ\lambda in practice for Ising model selection with finite M,NM,N.

Refer to caption
Figure 7: Theoretical and experimental results of RSSRSS, PrecisionPrecision and RecallRecall for both 1\ell_{1}-LinR and 1\ell_{1}-LogR when λ=0.1\lambda=0.1, N=200,400,800N=200,400,800 with different values of αM/N\alpha\equiv M/N. The standard error bars are obtained from 1000 random runs. An excellent agreement between theory and experiment is achieved, even for small N=200N=200 and small α\alpha ( small MM).
Refer to caption
Figure 8: Theoretical and experimental results of RSS, PrecisionPrecision and RecallRecall for both 1\ell_{1}-LinR and 1\ell_{1}-LogR when λ=0.3\lambda=0.3, N=200,400,800N=200,400,800 with different values of αM/N\alpha\equiv M/N. The standard error bars are obtained from 1000 random runs. An excellent agreement between theory and experiment is achieved, even for small N=200N=200 and small α\alpha ( small MM).

Fig. 9 and Fig. 10 show the full results of critical scaling prediction when λ=0.1\lambda=0.1 and λ=0.3\lambda=0.3, respectively. For comparison, both the results of 1\ell_{1}-LinR and 1\ell_{1}-LogR are shown. It can be seen that apart from the good agreements between replica results and experimental results, the prediction of the scaling value c0(λ,K0)c(λ,K0)λ2c_{0}\left(\lambda,K_{0}\right)\equiv\frac{c\left(\lambda,K_{0}\right)}{\lambda^{2}} is very accurate.

Refer to caption
Figure 9: PrecisionPrecision and RecallRecall versus NN when M=clogNM=c\log N and K0=0.4K_{0}=0.4 for 1\ell_{1}-LinR and 1\ell_{1}-LogR when λ=0.1\lambda=0.1, where c0(λ,K0)c(λ,K0)λ2137c_{0}\left(\lambda,K_{0}\right)\equiv\frac{c\left(\lambda,K_{0}\right)}{\lambda^{2}}\approx 137. When c>c0(λ,K0)c>c_{0}\left(\lambda,K_{0}\right), the Precision increases consistently with NN and approaches 1 as NN\rightarrow\infty while it decreases consistently with NN when c<c0(λ,K0)c<c_{0}\left(\lambda,K_{0}\right).
Refer to caption
Figure 10: PrecisionPrecision and RecallRecall versus NN when M=clogNM=c\log N and K0=0.4K_{0}=0.4 for 1\ell_{1}-LinR and 1\ell_{1}-LogR when λ=0.3\lambda=0.3, where c0(λ,K0)c(λ,K0)λ219.4c_{0}\left(\lambda,K_{0}\right)\equiv\frac{c\left(\lambda,K_{0}\right)}{\lambda^{2}}\approx 19.4. When c>c0(λ,K0)c>c_{0}\left(\lambda,K_{0}\right), the Precision increases consistently with NN and approaches 1 as NN\rightarrow\infty while it decreases consistently with NN when c<c0(λ,K0)c<c_{0}\left(\lambda,K_{0}\right). The Recall increases consistently and approach to 1 as NN\rightarrow\infty.