This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Free Energy Evaluation Using Marginalized Annealed Importance Sampling

Muneki Yasuda [email protected] Graduate School of Science and Engineering, Yamagata University, Japan.    Chako Takahashi Graduate School of Science and Engineering, Yamagata University, Japan.
Abstract

The evaluation of the free energy of a stochastic model is considered a significant issue in various fields of physics and machine learning. However, the exact free energy evaluation is computationally infeasible because the free energy expression includes an intractable partition function. Annealed importance sampling (AIS) is a type of importance sampling based on the Markov chain Monte Carlo method that is similar to a simulated annealing and can effectively approximate the free energy. This study proposes an AIS-based approach, which is referred to as marginalized AIS (mAIS). The statistical efficiency of mAIS is investigated in detail based on theoretical and numerical perspectives. Based on the investigation, it is proved that mAIS is more effective than AIS under a certain condition.

free energy evaluation, annealed importance sampling, Rao–Blackwellization, Ising model, restiricted Boltzmann machine
pacs:
Valid PACS appear here
preprint: APS/123-QED

I Introduction

The evaluation of the free energy (i.e., the negative log of the partition function) of a stochastic model is considered to be an important issue in various fields of physics. Moreover, the free energy plays an important role in machine learning models, e.g., Boltzmann machine Ackley et al. (1985). Boltzmann machine and its variants, e.g., restricted Boltzmann machine (RBM) Smolensky (1986); Hinton (2002) and deep Boltzmann machine (DBM) Salakhutdinov and Hinton (2009), have been actively investigated in the fields of machine learning and physics Roudi et al. (2009); Decelle and Furtlehner (2021); Chen et al. (2018); Nomura and Imada (2021); Torlai et al. (2018); Carleo and Troyer (2017); Gao and Duan (2017). However, an exact free energy evaluation is computationally infeasible, because the expression for the free energy includes an intractable partition function.

Annealed importance sampling (AIS) Neal (2001) is a type of importance sampling based on the Markov chain Monte Carlo (MCMC) method that is similar to simulated annealing, and can effectively approximate the free energy Neal (2001); Salakhutdinov and Murray (2008). In AIS, from a tractable initial distribution to the target distribution, a sequential sampling (or ancestral sampling) is executed, in which the transitions between the distributions are performed using, for example, Gibbs sampling Geman and Geman (1984). The AIS-based free energy evaluation is essentially the same as the method proposed by Jarzynski (the so-called Jarzynski equality) Jarzynski (1997). Several researchers have addressed the development of AIS Burda et al. (2015); Liu et al. (2015); Carlson et al. (2016); Mazzanti and Romero (2020); Krause et al. (2020); Yasuda and Sekimoto (2021). Recently, the relationship between AIS and the normalizing flow proposed in the deep learning field is investigated Caselle et al. (2022).

In AIS, the free energy is estimated as follows: first, we obtain the estimator of the partition function based on the sample average of the importance weights, which are obtained during the sequential sampling process, and subsequently, the free energy estimator is obtained as the negative log of the obtained partition function estimator. It is known that the partition function estimator is unbiased, whereas the free energy estimator is biased Burda et al. (2015). To evaluate the free energy, we propose an AIS-based approach that we refer to as marginalized AIS (mAIS). The concept of mAIS is simple; mAIS corresponds to AIS in a marginalized model; i.e., mAIS can be regarded as a special case of AIS. Therefore, the basic principle of mAIS is similar to AIS. Suppose that our target model represents a distribution in nn dimensional space. With regard to AIS, we perform a sampling procedure in the nn dimensional space to evaluate the free energy. However, in mAIS, the dimensional space, where the sampling procedure is performed, is smaller because the dimension of the model is reduced through marginalization. Intuitively, mAIS seems to be more effective than AIS considering the aforementioned statement. This intuition is valid under a certain condition. Under the condition, the following two statements can be proved (see section III.2): (1) the partition function estimator obtained from mAIS is more accurate than that obtained from AIS, and (2) the bias of the free energy estimator obtained from mAIS is lower than that of the estimator obtained from AIS. However, as discussed in section III.2, the condition assumed in the aforementioned statements limits the range in which the effectiveness of mAIS is assured. As discussed in section III.3, this condition can be satisfied with regard to the use of AIS and mAIS on Markov random fields (MRFs) defined on bipartite graphs. Moreover, a layer-wised marginal operation can be performed in bipartite-type MRFs. Bipartite-type MRFs include important applications, e.g., RBM and DBM.

The remainder of this paper is organized as follows: Section II explains the AIS-based free energy evaluation. Section III introduces mAIS Section III.1 details mAIS. Section III.2 describes the theoretical analysis of mAIS, and section III.3 discusses the application of mAIS to bipartite-type MRFs. Section IV numerically demonstrates the validity of mAIS. Section V summarizes the study, along with discussions.

II Free Energy Evaluation using Annealed Importance Sampling

Consider a distribution with nn random variables, 𝒙:={xi𝒳iiU:={1,2,,n}}\bm{x}:=\{x_{i}\in\mathcal{X}_{i}\mid i\in U:=\{1,2,\ldots,n\}\}, as

Pmodel(𝒙):=1Zexp(E(𝒙)),\displaystyle P_{\mathrm{model}}(\bm{x}):=\frac{1}{Z}\exp\big{(}-E(\bm{x})\big{)}, (1)

where 𝒳i\mathcal{X}_{i} is the continuous or discrete sample space of xix_{i} and E(𝒙)E(\bm{x}) is the energy function (or the Hamiltonian); ZZ is the partition function and is defined as

Z:=𝒙exp(E(𝒙)),\displaystyle Z:=\sum_{\bm{x}}\exp\big{(}-E(\bm{x})\big{)}, (2)

where 𝒙:=x1𝒳1x2𝒳2xn𝒳n\sum_{\bm{x}}:=\sum_{x_{1}\in\mathcal{X}_{1}}\sum_{x_{2}\in\mathcal{X}_{2}}\cdots\sum_{x_{n}\in\mathcal{X}_{n}} denotes the multiple summation over all realizations of 𝒙\bm{x}. When 𝒳i\mathcal{X}_{i} exhibits a continuous space, xi\sum_{x_{i}} is replaced by the integration over 𝒳i\mathcal{X}_{i}.

This study evaluates free energy F:=lnZF:=-\ln Z. The evaluation of the free energy is infeasible because it requires the evaluation of the intractable partition function. AIS is a type of importance sampling based on the MCMC method that is similar to simulated annealing, and can evaluate the free energy Neal (2001); Salakhutdinov and Murray (2008). This free energy evaluation method is essentially the same as the method proposed by Jarzynski Jarzynski (1997). The AIS-based free energy evaluation is briefly explained in the following.

First, we design a sequence of distributions as

{Pk(𝒙)k=0,1,,K},\displaystyle\{P_{k}(\bm{x})\mid k=0,1,\ldots,K\}, (3)

where PK(𝒙)=Pmodel(𝒙)P_{K}(\bm{x})=P_{\mathrm{model}}(\bm{x}), and each Pk(𝒙)P_{k}(\bm{x}) is expressed as

Pk(𝒙):=1Zkexp(Ek(𝒙)),\displaystyle P_{k}(\bm{x}):=\frac{1}{Z_{k}}\exp\big{(}-E_{k}(\bm{x})\big{)}, (4)

where ZkZ_{k} is the partition function of the kkth distribution. Because PK(𝒙)=Pmodel(𝒙)P_{K}(\bm{x})=P_{\mathrm{model}}(\bm{x}), EK(𝒙)=E(𝒙)E_{K}(\bm{x})=E(\bm{x}) and ZK=ZZ_{K}=Z are assumed. In this sequence, the initial distribution P0(𝒙)P_{0}(\bm{x}) is set to a tractable distribution (e.g., a uniform distribution). For example, the sequence expressed as

Pk(𝒙)P0(𝒙)1βkPmodel(𝒙)βk,\displaystyle P_{k}(\bm{x})\propto P_{0}(\bm{x})^{1-\beta_{k}}P_{\mathrm{model}}(\bm{x})^{\beta_{k}}, (5)

for 0=β0<β1<<βK=10=\beta_{0}<\beta_{1}<\cdots<\beta_{K}=1, is frequently used, where βk\beta_{k} corresponds to the annealing temperature. The kkth distribution expressed by equation (5) corresponds to equation (4) with Ek(𝒙)=(1βk)lnP0(𝒙)+βkE(𝒙)E_{k}(\bm{x})=(1-\beta_{k})\ln P_{0}(\bm{x})+\beta_{k}E(\bm{x}). However, we pursue the following arguments without specifying the form of Ek(𝒙)E_{k}(\bm{x}).

On the kkth distribution, a transition probability Tk(𝒙𝒙)T_{k}(\bm{x}^{\prime}\mid\bm{x}), which satisfies the condition

Pk(𝒙)=𝒙Tk(𝒙𝒙)Pk(𝒙),\displaystyle P_{k}(\bm{x}^{\prime})=\sum_{\bm{x}}T_{k}(\bm{x}^{\prime}\mid\bm{x})P_{k}(\bm{x}), (6)

is defined. Using the transition probability, the annealing process from the initial state 𝒙(1)\bm{x}^{(1)} to the final state 𝒙(K)\bm{x}^{(K)} is defined as

T(𝑿):=(k=1K1Tk(𝒙(k+1)𝒙(k)))P0(𝒙(1)),\displaystyle T(\bm{X}):=\Big{(}\prod_{k=1}^{K-1}T_{k}(\bm{x}^{(k+1)}\mid\bm{x}^{(k)})\Big{)}P_{0}(\bm{x}^{(1)}), (7)

where 𝑿:={𝒙(k)k=1,2,,K}\bm{X}:=\{\bm{x}^{(k)}\mid k=1,2,\ldots,K\}. T(𝑿)T(\bm{X}) denotes the joint distribution over 𝑿\bm{X}. For T(𝑿)T(\bm{X}), the corresponding “reverse” process,

T~(𝑿):=(k=1K1T~k(𝒙(k)𝒙(k+1)))PK(𝒙(K)),\displaystyle\tilde{T}(\bm{X}):=\Big{(}\prod_{k=1}^{K-1}\tilde{T}_{k}(\bm{x}^{(k)}\mid\bm{x}^{(k+1)})\Big{)}P_{K}(\bm{x}^{(K)}), (8)

is defined, where

T~k(𝒙𝒙):=Tk(𝒙𝒙)Pk(𝒙)Pk(𝒙)\displaystyle\tilde{T}_{k}(\bm{x}\mid\bm{x}^{\prime}):=\frac{T_{k}(\bm{x}^{\prime}\mid\bm{x})P_{k}(\bm{x})}{P_{k}(\bm{x}^{\prime})} (9)

is the reverse transition. T(𝑿)T(\bm{X}) expresses the transition process from 𝒙(1)\bm{x}^{(1)} to 𝒙(K)\bm{x}^{(K)}; while, its reverse process T~(𝑿)\tilde{T}(\bm{X}) expresses the reverse transition from 𝒙(K)\bm{x}^{(K)} to 𝒙(1)\bm{x}^{(1)}. Because the normalization condition, 𝒙T~k(𝒙𝒙)=1\sum_{\bm{x}}\tilde{T}_{k}(\bm{x}\mid\bm{x}^{\prime})=1, is ensured from the condition in equation (6), T~(𝑿)\tilde{T}(\bm{X}) can be considered as the joint distribution over 𝑿\bm{X}; it satisfies

𝒙(1)𝒙(2)𝒙(K1)T~(𝑿)=PK(𝒙(K)).\displaystyle\sum_{\bm{x}^{(1)}}\sum_{\bm{x}^{(2)}}\cdots\sum_{\bm{x}^{(K-1)}}\tilde{T}(\bm{X})=P_{K}(\bm{x}^{(K)}).

AIS is regarded as the importance sampling, in which T~(𝑿)\tilde{T}(\bm{X}) and T(𝑿)T(\bm{X}) are considered the target and corresponding proposal distributions, respectively. Using equations (7), (8), and (9), the ratio between the target and proposal distributions, i.e., the (normalized) importance weight, is obtained as

T~(𝑿)T(𝑿)=Z0ZKW(𝑿),\displaystyle\frac{\tilde{T}(\bm{X})}{T(\bm{X})}=\frac{Z_{0}}{Z_{K}}W(\bm{X}), (10)

where

W(𝑿)\displaystyle W(\bm{X}) :=k=1Kwk(𝒙(k)),\displaystyle:=\prod_{k=1}^{K}w_{k}(\bm{x}^{(k)}), (11)
wk(𝒙)\displaystyle w_{k}(\bm{x}) :=exp(Ek(𝒙)+Ek1(𝒙)).\displaystyle:=\exp\big{(}-E_{k}(\bm{x})+E_{k-1}(\bm{x})\big{)}. (12)

Here, Z0Z_{0} is the partition function of P0(𝒙)P_{0}(\bm{x}) (i.e., a tractable distribution) and ZK=ZZ_{K}=Z is the true partition function of the objective distribution. From the normalization condition of T~(𝑿)\tilde{T}(\bm{X}), the relation,

1=𝑿T~(𝑿)T(𝑿)T(𝑿),\displaystyle 1=\sum_{\bm{X}}\frac{\tilde{T}(\bm{X})}{T(\bm{X})}T(\bm{X}),

is obtained. Substituting equation (10) into this relation yields

ZK=Z0𝑿W(𝑿)T(𝑿),\displaystyle Z_{K}=Z_{0}\sum_{\bm{X}}W(\bm{X})T(\bm{X}), (13)

where 𝑿:=𝒙(1)𝒙(2)𝒙(K)\sum_{\bm{X}}:=\sum_{\bm{x}^{(1)}}\sum_{\bm{x}^{(2)}}\cdots\sum_{\bm{x}^{(K)}} denotes the multiple summation over all realizations of 𝑿\bm{X}.

Sampling from the proposal distribution can be performed using ancestral sampling, from 𝒙(1)\bm{x}^{(1)} to 𝒙(K)\bm{x}^{(K)}, on T(𝑿)T(\bm{X}), i.e., the sequence of the sample points, 𝐗:={𝐱(1),𝐱(2),,𝐱(K)}\mathbf{X}:=\{\mathbf{x}^{(1)},\mathbf{x}^{(2)},\ldots,\mathbf{x}^{(K)}\}, is generated by the following sampling process:

𝐱(1)P0(𝒙),𝐱(k)Tk1(𝒙𝐱(k1))(k=2,3,,K).\displaystyle\begin{split}\mathbf{x}^{(1)}&\leftarrow P_{0}(\bm{x}),\\ \mathbf{x}^{(k)}&\leftarrow T_{k-1}(\bm{x}\mid\mathbf{x}^{(k-1)})\>\>(k=2,3,\ldots,K).\end{split} (14)

Using NN independent sample sequences ST:={𝐗μμ=1,2,,N}S_{T}:=\{\mathbf{X}_{\mu}\mid\mu=1,2,\ldots,N\} obtained when this ancestral sampling process conducted NN times, equation (13) is approximated as

ZZAIS(ST):=Z0Nμ=1NW(𝐗μ).\displaystyle Z\approx Z_{\mathrm{AIS}}(S_{T}):=\frac{Z_{0}}{N}\sum_{\mu=1}^{N}W(\mathbf{X}_{\mu}). (15)

Using equation (15), the free energy is approximated as

FFAIS(ST)\displaystyle F\approx F_{\mathrm{AIS}}(S_{T}) :=lnZAIS(ST)\displaystyle:=-\ln Z_{\mathrm{AIS}}(S_{T})
=lnZ0ln(1Nμ=1NW(𝐗μ)).\displaystyle\>=-\ln Z_{0}-\ln\Big{(}\frac{1}{N}\sum_{\mu=1}^{N}W(\mathbf{X}_{\mu})\Big{)}. (16)

ZAIS(ST)Z_{\mathrm{AIS}}(S_{T}) is an unbiased estimator for the true partition function because its expectation converges to ZZ: 𝔼T[ZAIS(ST)]=Z\mathbb{E}_{T}[Z_{\mathrm{AIS}}(S_{T})]=Z, where 𝔼T[]\mathbb{E}_{T}[\cdots] denotes the expectation of the distribution over ST={𝐗μ}S_{T}=\{\mathbf{X}_{\mu}\}, i.e.,

𝔼T[]:=(μ=1N𝐗μT(𝐗μ))().\displaystyle\mathbb{E}_{T}[\cdots]:=\Big{(}\prod_{\mu=1}^{N}\sum_{\mathbf{X}_{\mu}}T(\mathbf{X}_{\mu})\Big{)}(\cdots). (17)

However, FAIS(ST)F_{\mathrm{AIS}}(S_{T}) is not an unbiased estimator for the true free energy Burda et al. (2015). Using Jensen’s inequality,

𝔼T[FAIS(ST)]\displaystyle\mathbb{E}_{T}[F_{\mathrm{AIS}}(S_{T})] =𝔼T[lnZAIS(ST)]\displaystyle=-\mathbb{E}_{T}[\ln Z_{\mathrm{AIS}}(S_{T})]
ln𝔼T[ZAIS(ST)]=F\displaystyle\geq-\ln\mathbb{E}_{T}[Z_{\mathrm{AIS}}(S_{T})]=F (18)

is obtained, which implies that the expectation of FAIS(ST)F_{\mathrm{AIS}}(S_{T}) provides an upper bound of the true free energy.

III Proposed Method

Suppose that 𝒙\bm{x} is divided into two mutually disjoint sets: 𝒗:={viiV}\bm{v}:=\{v_{i}\mid i\in V\} and 𝒉:={hjjH}\bm{h}:=\{h_{j}\mid j\in H\}, i.e., U=VHU=V\cup H and VH=V\cap H=\emptyset; therefore, Pk(𝒙)P_{k}(\bm{x}) can be considered the joint distribution over 𝒗\bm{v} and 𝒉\bm{h}: Pk(𝒙)=Pk(𝒗,𝒉)P_{k}(\bm{x})=P_{k}(\bm{v},\bm{h}). The method discussed in Section II is regarded as AIS based on this joint distribution. In contrast, mAIS proposed in this section is regarded as AIS based on a marginal distribution of Pk(𝒗,𝒉)P_{k}(\bm{v},\bm{h}).

III.1 Marginalized annealed importance sampling

For the sequence of the distributions in equation (3), we introduce the sequence of their marginal distributions as

{Pk(𝒗)k=0,1,,K},\displaystyle\{P_{k}(\bm{v})\mid k=0,1,\ldots,K\}, (19)

where Pk(𝒗)P_{k}(\bm{v}) is the marginal distribution of Pk(𝒙)P_{k}(\bm{x}), i.e.,

Pk(𝒗)=𝒉Pk(𝒙)=1Zkexp(EV(𝒗,k)),\displaystyle P_{k}(\bm{v})=\sum_{\bm{h}}P_{k}(\bm{x})=\frac{1}{Z_{k}}\exp\big{(}-E_{V}(\bm{v},k)\big{)}, (20)

where

EV(𝒗,k):=ln𝒉exp(Ek(𝒙))\displaystyle E_{V}(\bm{v},k):=-\ln\sum_{\bm{h}}\exp\big{(}-E_{k}(\bm{x})\big{)} (21)

is the energy function of the marginal distribution. From the definition, the partition functions of Pk(𝒙)P_{k}(\bm{x}) and Pk(𝒗)P_{k}(\bm{v}) are the same.

Based on the sequence of the marginal distributions in equation (19), in a similar manner to equation (7), the annealing process from the initial state 𝒗(1)\bm{v}^{(1)} to the final state 𝒗(K)\bm{v}^{(K)} can be defined as

τ(𝑽):=(k=1K1τk(𝒗(k+1)𝒗(k)))P0(𝒗(1)),\displaystyle\tau(\bm{V}):=\Big{(}\prod_{k=1}^{K-1}\tau_{k}(\bm{v}^{(k+1)}\mid\bm{v}^{(k)})\Big{)}P_{0}(\bm{v}^{(1)}), (22)

where 𝑽:={𝒗(k)k=1,2,,K}\bm{V}:=\{\bm{v}^{(k)}\mid k=1,2,\ldots,K\} and τk(𝒗𝒗)\tau_{k}(\bm{v}^{\prime}\mid\bm{v}) is the transition probability on the marginal distribution in equation (20). The transition probability satisfies the following condition:

Pk(𝒗)=𝒗τk(𝒗𝒗)Pk(𝒗).\displaystyle P_{k}(\bm{v}^{\prime})=\sum_{\bm{v}}\tau_{k}(\bm{v}^{\prime}\mid\bm{v})P_{k}(\bm{v}). (23)

Using almost the same derivation to obtain equation (13), we derive

ZK=Z0𝑽Λ(𝑽)τ(𝑽),\displaystyle Z_{K}=Z_{0}\sum_{\bm{V}}\Lambda(\bm{V})\tau(\bm{V}), (24)

where Λ(𝑽)\Lambda(\bm{V}) is the importance weight of mAIS defined as

Λ(𝑽)\displaystyle\Lambda(\bm{V}) :=k=1Kλk(𝒗(k)),\displaystyle:=\prod_{k=1}^{K}\lambda_{k}(\bm{v}^{(k)}), (25)

where

λk(𝒗)\displaystyle\lambda_{k}(\bm{v}) :=exp(EV(𝒗,k)+EV(𝒗,k1)).\displaystyle:=\exp\big{(}-E_{V}(\bm{v},k)+E_{V}(\bm{v},k-1)\big{)}. (26)

In mAIS, the free energy can be evaluated using a technique similar to the derivation of equation (16). The sequence of the sample points, 𝐕:={𝐯(1),𝐯(2),,𝐯(K)}\mathbf{V}:=\{\mathbf{v}^{(1)},\mathbf{v}^{(2)},\ldots,\mathbf{v}^{(K)}\}, is generated based on the ancestral sampling on τ(𝑽)\tau(\bm{V}):

𝐯(1)P0(𝒗),𝐯(k)τk1(𝒗𝐯(k1))(k=2,3,,K).\displaystyle\begin{split}\mathbf{v}^{(1)}&\leftarrow P_{0}(\bm{v}),\\ \mathbf{v}^{(k)}&\leftarrow\tau_{k-1}(\bm{v}\mid\mathbf{v}^{(k-1)})\>\>(k=2,3,\ldots,K).\end{split} (27)

By conducting this ancestral sampling process NN times, NN independent sample sequences Sτ:={𝐕μμ=1,2,,N}S_{\tau}:=\{\mathbf{V}_{\mu}\mid\mu=1,2,\ldots,N\} are obtained, and subsequently, using the NN sample sequences, equation (24) is approximated as

ZZmAIS(Sτ):=Z0Nμ=1NΛ(𝐕μ).\displaystyle Z\approx Z_{\mathrm{mAIS}}(S_{\tau}):=\frac{Z_{0}}{N}\sum_{\mu=1}^{N}\Lambda(\mathbf{V}_{\mu}). (28)

Therefore, the free energy is evaluated as

FFmAIS(Sτ)\displaystyle F\approx F_{\mathrm{mAIS}}(S_{\tau}) :=lnZmAIS(Sτ)\displaystyle:=-\ln Z_{\mathrm{mAIS}}(S_{\tau})
=lnZ0ln(1Nμ=1NΛ(𝐕μ)).\displaystyle\>=-\ln Z_{0}-\ln\Big{(}\frac{1}{N}\sum_{\mu=1}^{N}\Lambda(\mathbf{V}_{\mu})\Big{)}. (29)

Similar to ZAIS(ST)Z_{\mathrm{AIS}}(S_{T}), ZmAIS(Sτ)Z_{\mathrm{mAIS}}(S_{\tau}) acts as an unbiased estimator for the true partition function because 𝔼τ[ZmAIS(Sτ)]=Z\mathbb{E}_{\tau}[Z_{\mathrm{mAIS}}(S_{\tau})]=Z, where 𝔼τ[]\mathbb{E}_{\tau}[\cdots] denotes the expectation of the distribution over Sτ={𝐕μ}S_{\tau}=\{\mathbf{V}_{\mu}\}, i.e.,

𝔼τ[]:=(μ=1N𝐕μτ(𝐕μ))().\displaystyle\mathbb{E}_{\tau}[\cdots]:=\Big{(}\prod_{\mu=1}^{N}\sum_{\mathbf{V}_{\mu}}\tau(\mathbf{V}_{\mu})\Big{)}(\cdots). (30)

Similar to equation (18), based on Jensen’s inequality,

𝔼τ[FmAIS(Sτ)]F\displaystyle\mathbb{E}_{\tau}[F_{\mathrm{mAIS}}(S_{\tau})]\geq F (31)

can be obtained; therefore, FmAIS(Sτ)F_{\mathrm{mAIS}}(S_{\tau}) is not an unbiased estimator for the true free energy as is the case with FAIS(ST)F_{\mathrm{AIS}}(S_{T}).

In the aforementioned discussions, we have considered mAIS based on the sequence of Pk(𝒗)P_{k}(\bm{v}) (we refer to this mAIS as “mAIS-V”). An opposite mAIS (referred to as “mAIS-H”), which is based on the sequence of

Pk(𝒉)\displaystyle P_{k}(\bm{h}) =𝒗Pk(𝒙)=1Zkexp(EH(𝒉,k)),\displaystyle=\sum_{\bm{v}}P_{k}(\bm{x})=\frac{1}{Z_{k}}\exp\big{(}-E_{H}(\bm{h},k)\big{)}, (32)

where

EH(𝒉,k)\displaystyle E_{H}(\bm{h},k) :=ln𝒗exp(Ek(𝒙)),\displaystyle:=-\ln\sum_{\bm{v}}\exp\big{(}-E_{k}(\bm{x})\big{)}, (33)

can be constructed in the same manner. However, we do not need to consider mAIS-H separately because the difference between the methods only lies in the way the variables are labelled, i.e., mAIS-V is identified with mAIS-H by exchanging the labels of variable sets. Therefore, we consider mAIS-V as mAIS in this paper.

Theoretically, mAIS can be applied to any model. However, whether it is practical or not strongly depend on the structures of {Pk(𝒙)}\{P_{k}(\bm{x})\} because mAIS requires the marginal operation, as expressed in equation (20). mAIS can be efficiently applied to a bipartite-type MRF (i.e., an MRF defined on a bipartite undirected graph), which will be discussed in Section III.3.

Suppose that we have a model that mAIS can be efficiently applied to. Therefore, the free energy of the model can be evaluated based on the two methods, i.e., FAIS(ST)F_{\mathrm{AIS}}(S_{T}) and FmAIS(Sτ)F_{\mathrm{mAIS}}(S_{\tau}). Intuitively, mAIS seems to be better because the entire sample space of the variables is reduced through the marginalization, which is similar to the concept of Rao–Blackwellization Liu (2001).

Here, we briefly explain Rao–Blackwellization. Suppose that we wish to evaluate the expectation of f(𝒙t)f(\bm{x}_{t}) over a distribution Q(𝒙)Q(\bm{x}): mt:=𝒙f(𝒙t)Q(𝒙)m_{t}:=\sum_{\bm{x}}f(\bm{x}_{t})Q(\bm{x}), where 𝒙t𝒙\bm{x}_{t}\subseteq\bm{x}. Based on a simple sampling approximation (or Monte Carlo integration), it is approximated by

mt1Mν=1Mf(𝐱t(ν)),\displaystyle m_{t}\approx\frac{1}{M}\sum_{\nu=1}^{M}f(\mathbf{x}_{t}^{(\nu)}), (34)

where {𝐱(ν)ν=1,2,,M}\{\mathbf{x}^{(\nu)}\mid\nu=1,2,\ldots,M\} is the sample set generated from Q(𝒙)Q(\bm{x}), and 𝐱t(ν)\mathbf{x}_{t}^{(\nu)} is the corresponding subset of 𝐱(ν)\mathbf{x}^{(\nu)}. Using the decomposition Q(𝒙)=Q(𝒙a𝒙b)Q(𝒙b)Q(\bm{x})=Q(\bm{x}_{a}\mid\bm{x}_{b})Q(\bm{x}_{b}), mtm_{t} can be transformed as

mt=𝒙bmt(𝒙b)Q(𝒙b),\displaystyle m_{t}=\sum_{\bm{x}_{b}}m_{t}^{\dagger}(\bm{x}_{b})Q(\bm{x}_{b}), (35)

where mt(𝒙b)m_{t}^{\dagger}(\bm{x}_{b}) is the conditional expectation expressed as

mt(𝒙b):=𝒙af(𝒙t)Q(𝒙a𝒙b).\displaystyle m_{t}^{\dagger}(\bm{x}_{b}):=\sum_{\bm{x}_{a}}f(\bm{x}_{t})Q(\bm{x}_{a}\mid\bm{x}_{b}).

Therefore, based on equation (35), mtm_{t} is approximated by

mt1Mν=1Mmt(𝐱b(ν)),\displaystyle m_{t}\approx\frac{1}{M}\sum_{\nu=1}^{M}m_{t}^{\dagger}(\mathbf{x}_{b}^{(\nu)}), (36)

where 𝐱b(ν)\mathbf{x}_{b}^{(\nu)} is the corresponding subset of 𝐱(ν)\mathbf{x}^{(\nu)}. The Rao–Blackwell theorem guarantees that the estimator in equation (36) is more effective than the estimator in equation (34) from the perspective of the variance. The transformation of equation (35) is called Rao–Blackwellization. The applications of Rao–Blackwellization are widely investigated; for example, its application to MRFs, spatial Monte Carlo integration, has been developed Yasuda (2015); Yasuda and Uchizawa (2021); Yasuda and Sekimoto (2021).

The relation between equations (34) and (36) looks similar to that between AIS and mAIS. Thus, we can expect mAIS to be more effective. In fact this expectation can be justified under a certain condition, which is discussed in the next section.

III.2 Statistical efficiency of mAIS

First, we compare the statistical efficiencies of the two estimators for the partition function, i.e., ZAIS(ST)Z_{\mathrm{AIS}}(S_{T}) and ZmAIS(Sτ)Z_{\mathrm{mAIS}}(S_{\tau}). The variance of ZAIS(ST)Z_{\mathrm{AIS}}(S_{T}) is

𝕍T[ZAIS(ST)]=1N(Z02𝑿W(𝑿)2T(𝑿)Z2),\displaystyle\mathbb{V}_{T}[Z_{\mathrm{AIS}}(S_{T})]=\frac{1}{N}\Big{(}Z_{0}^{2}\sum_{\bm{X}}W(\bm{X})^{2}T(\bm{X})-Z^{2}\Big{)}, (37)

where 𝕍T[A]:=𝔼T[A2]𝔼T[A]2\mathbb{V}_{T}[A]:=\mathbb{E}_{T}[A^{2}]-\mathbb{E}_{T}[A]^{2}. The variance of ZmAIS(Sτ)Z_{\mathrm{mAIS}}(S_{\tau}) is

𝕍τ[ZmAIS(Sτ)]=1N(Z02𝑽Λ(𝑽)2τ(𝑽)Z2),\displaystyle\mathbb{V}_{\tau}[Z_{\mathrm{mAIS}}(S_{\tau})]=\frac{1}{N}\Big{(}Z_{0}^{2}\sum_{\bm{V}}\Lambda(\bm{V})^{2}\tau(\bm{V})-Z^{2}\Big{)}, (38)

where 𝕍τ[A]:=𝔼τ[A2]𝔼τ[A]2\mathbb{V}_{\tau}[A]:=\mathbb{E}_{\tau}[A^{2}]-\mathbb{E}_{\tau}[A]^{2}. As discussed in Sections II and III.1, both estimators are unbiased; therefore, the estimator with smaller variance is more effective.

Refer to caption
Figure 1: State transition, from the previous to next states, according to equation (39).

Assume that the transition probability of AIS is expressed as

Tk(𝒙𝒙)=Pk(𝒉𝒗)τk(𝒗𝒗),\displaystyle T_{k}(\bm{x}^{\prime}\mid\bm{x})=P_{k}(\bm{h}^{\prime}\mid\bm{v}^{\prime})\tau_{k}(\bm{v}^{\prime}\mid\bm{v}), (39)

where τk(𝒗𝒗)\tau_{k}(\bm{v}^{\prime}\mid\bm{v}) is the transition probability of mAIS satisfying equation (23), and

Pk(𝒉𝒗)=Pk(𝒙)𝒉Pk(𝒙)=exp(Ek(𝒙))exp(EV(𝒗,k))\displaystyle P_{k}(\bm{h}\mid\bm{v})=\frac{P_{k}(\bm{x})}{\sum_{\bm{h}}P_{k}(\bm{x})}=\frac{\exp\big{(}-E_{k}(\bm{x})\big{)}}{\exp\big{(}-E_{V}(\bm{v},k)\big{)}} (40)

is the conditional distribution on Pk(𝒙)P_{k}(\bm{x}). According to equation (39), the state transition from 𝒙\bm{x} to 𝒙\bm{x}^{\prime} is performed as follows: first, the state of 𝒗\bm{v} is updated to 𝒗\bm{v}^{\prime} according to τk(𝒗𝒗)\tau_{k}(\bm{v}^{\prime}\mid\bm{v}); subsequently, the state of 𝒉\bm{h} is updated to 𝒉\bm{h}^{\prime} according to Pk(𝒉𝒗)P_{k}(\bm{h}^{\prime}\mid\bm{v}^{\prime}) (see figure 1). The use of the transition probability of equation (39) is accepted in AIS because it satisfies the condition of equation (6) as follows:

𝒙Pk(𝒉𝒗)τk(𝒗𝒗)Pk(𝒙)\displaystyle\sum_{\bm{x}}P_{k}(\bm{h}^{\prime}\mid\bm{v}^{\prime})\tau_{k}(\bm{v}^{\prime}\mid\bm{v})P_{k}(\bm{x})
=Pk(𝒉𝒗)𝒗τk(𝒗𝒗)Pk(𝒗)Pk(𝒗)=Pk(𝒙).\displaystyle=P_{k}(\bm{h}^{\prime}\mid\bm{v}^{\prime})\underbrace{\sum_{\bm{v}}\tau_{k}(\bm{v}^{\prime}\mid\bm{v})P_{k}(\bm{v})}_{P_{k}(\bm{v}^{\prime})}=P_{k}(\bm{x}^{\prime}).

Based on this assumption, the following theorem can be obtained.

Theorem 1

ZAIS(ST)Z_{\mathrm{AIS}}(S_{T}) and ZmAIS(Sτ)Z_{\mathrm{mAIS}}(S_{\tau}) are unbiased estimators for ZZ defined in equations (13) and (24), respectively. For the two estimators, the inequality

𝕍T[ZAIS(ST)]𝕍τ[ZmAIS(Sτ)]\displaystyle\mathbb{V}_{T}[Z_{\mathrm{AIS}}(S_{T})]\geq\mathbb{V}_{\tau}[Z_{\mathrm{mAIS}}(S_{\tau})]

always holds, if the transition probabilities of AIS and mAIS satisfy the condition expressed in equation (39).

Based on this theorem, it is ensured that ZmAIS(Sτ)Z_{\mathrm{mAIS}}(S_{\tau}) is statistically more efficient. The proof of this theorem is described in Appendix A.1. As mentioned in the final part of Appendix A.1, the claim of this theorem is essentially the same as that of the Rao–Blackwell theorem.

Next, the statistical efficiencies of the two estimators for the free energy, i.e., FAIS(ST)F_{\mathrm{AIS}}(S_{T}) and FmAIS(Sτ)F_{\mathrm{mAIS}}(S_{\tau}), are compared. As expressed in equations (18) and (31), 𝔼T[FAIS(ST)]\mathbb{E}_{T}[F_{\mathrm{AIS}}(S_{T})] and 𝔼τ[FmAIS(Sτ)]\mathbb{E}_{\tau}[F_{\mathrm{mAIS}}(S_{\tau})] are upper bounds of the true free energy. Based on the assumption of equation (39), the following theorem can be obtained.

Theorem 2

FAIS(ST)F_{\mathrm{AIS}}(S_{T}) and FmAIS(Sτ)F_{\mathrm{mAIS}}(S_{\tau}) are biased estimators for FF defined in equations (16) and (29), respectively. For the two estimators, the inequality

𝔼T[FAIS(ST)]𝔼τ[FmAIS(Sτ)]F\displaystyle\mathbb{E}_{T}[F_{\mathrm{AIS}}(S_{T})]\geq\mathbb{E}_{\tau}[F_{\mathrm{mAIS}}(S_{\tau})]\geq F

always holds, if the transition probabilities of AIS and mAIS satisfy the condition expressed in equation (39).

Based on this theorem, although FAIS(ST)F_{\mathrm{AIS}}(S_{T}) and FmAIS(Sτ)F_{\mathrm{mAIS}}(S_{\tau}) are biased, the bias of FmAIS(Sτ)F_{\mathrm{mAIS}}(S_{\tau}) is ensured to be smaller. The proof of this theorem is presented in Appendix A.2. The aforementioned two theorems do not depend on the details of the design of Ek(𝒙)E_{k}(\bm{x}) and the values of NN and KK; only the condition in equation (39) is required.

We have proved that mAIS is statistically more efficient than AIS with regard to the partition function and free energy when the condition of equation (39) is satisfied. The remaining issue is determining whether this condition is satisfied in practical situations. This condition significantly limits the modeling of Tk(𝒙𝒙)T_{k}(\bm{x}^{\prime}\mid\bm{x}); e.g., the standard (synchronous or asynchronous) Gibbs sampling on Pk(𝒙)P_{k}(\bm{x}) does not satisfy the condition because it uses the states of all variables to obtain the subsequent states. As explained in Section III.3, the state transition according to equation (39) can be natural and practical when Pk(𝒙)P_{k}(\bm{x}) is a bipartite-type MRF.

In Theorems 1 and 2, the condition of equation (39) is not a necessary condition but a sufficient condition. This implies the possibility of the relaxation of the condition. However, it is an open problem.

III.3 Application to bipartite-type MRFs

The proposed mAIS is applicable when the marginal operation in equation (20) is feasible. Owing to this restriction, the range of application of mAIS is limited. As discussed below, mAIS is practical on bipartite-type MRFs, which include important applications such as RBM and DBM (a DBM can be observed as a bipartite-type MRF Gao and Duan (2017)). Moreover, a square-grid-type MRF is regarded as a bipartite-type MRF (see Fig. 2).

Refer to caption
Figure 2: Square-grid graph can be regarded as a bipartite graph.

Suppose that Pk(𝒙)P_{k}(\bm{x}) is a bipartite-type MRF and that 𝒗\bm{v} and 𝒉\bm{h} are the variables of each layers. In this case, the energy function Ek(𝒙)E_{k}(\bm{x}) can be expressed as

Ek(𝒙)\displaystyle E_{k}(\bm{x}) =iVϕi(k)(vi)jHϕj(k)(hj)\displaystyle=-\sum_{i\in V}\phi_{i}^{(k)}(v_{i})-\sum_{j\in H}\phi_{j}^{(k)}(h_{j})
iVjHψi,j(k)(vi,hj),\displaystyle\quad\>-\sum_{i\in V}\sum_{j\in H}\psi_{i,j}^{(k)}(v_{i},h_{j}), (41)

where ϕi(k)\phi_{i}^{(k)} and ψi,j(k)\psi_{i,j}^{(k)} are one- and two-variable functions determined based on the design of {Pk(𝒙)}\{P_{k}(\bm{x})\}; the marginal operation in equation (20) is feasible and leads to

EV(𝒗,k)\displaystyle E_{V}(\bm{v},k) =iVϕi(k)(vi)jHlnhjexp(ϕj(k)(hj)\displaystyle=-\sum_{i\in V}\phi_{i}^{(k)}(v_{i})-\sum_{j\in H}\ln\sum_{h_{j}}\exp\Big{(}\phi_{j}^{(k)}(h_{j})
+iVψi,j(k)(vi,hj)).\displaystyle\quad\>+\sum_{i\in V}\psi_{i,j}^{(k)}(v_{i},h_{j})\Big{)}. (42)

On the bipartite-type MRF, the layer-wise blocked Gibbs sampling based on the conditional distributions Pk(𝒉𝒗)P_{k}(\bm{h}\mid\bm{v}) and Pk(𝒗𝒉)P_{k}(\bm{v}\mid\bm{h}) can be easily implemented, where Pk(𝒉𝒗)P_{k}(\bm{h}\mid\bm{v}) is the conditional distribution presented in equation (40) and

Pk(𝒗𝒉)=Pk(𝒙)𝒗Pk(𝒙)=exp(Ek(𝒙))exp(EH(𝒉,k)),\displaystyle P_{k}(\bm{v}\mid\bm{h})=\frac{P_{k}(\bm{x})}{\sum_{\bm{v}}P_{k}(\bm{x})}=\frac{\exp\big{(}-E_{k}(\bm{x})\big{)}}{\exp\big{(}-E_{H}(\bm{h},k)\big{)}}, (43)

because viv_{i}s are conditionally independent from each other in Pk(𝒗𝒉)P_{k}(\bm{v}\mid\bm{h}) and hjh_{j}s are also conditionally independent from each other in Pk(𝒉𝒗)P_{k}(\bm{h}\mid\bm{v}), i.e.,

Pk(𝒗𝒉)\displaystyle P_{k}(\bm{v}\mid\bm{h}) iVexp(ϕi(k)(vi)+jHψi,j(k)(vi,hj)),\displaystyle\propto\prod_{i\in V}\exp\Big{(}\phi_{i}^{(k)}(v_{i})+\sum_{j\in H}\psi_{i,j}^{(k)}(v_{i},h_{j})\Big{)},
Pk(𝒉𝒗)\displaystyle P_{k}(\bm{h}\mid\bm{v}) jHexp(ϕj(k)(hj)+iVψi,j(k)(vi,hj)).\displaystyle\propto\prod_{j\in H}\exp\Big{(}\phi_{j}^{(k)}(h_{j})+\sum_{i\in V}\psi_{i,j}^{(k)}(v_{i},h_{j})\Big{)}.

Based on the layer-wise blocked Gibbs sampling, the transition probability of mAIS can be modeled as

τk(𝒗𝒗)=𝒉Pk(𝒗𝒉)Pk(𝒉𝒗),\displaystyle\tau_{k}(\bm{v}^{\prime}\mid\bm{v})=\sum_{\bm{h}}P_{k}(\bm{v}^{\prime}\mid\bm{h})P_{k}(\bm{h}\mid\bm{v}), (44)

which satisfies the condition in equation (23) as follows:

𝒗𝒉Pk(𝒗𝒉)Pk(𝒉𝒗)Pk(𝒗)=Pk(𝒗).\displaystyle\sum_{\bm{v}}\sum_{\bm{h}}P_{k}(\bm{v}^{\prime}\mid\bm{h})P_{k}(\bm{h}\mid\bm{v})P_{k}(\bm{v})=P_{k}(\bm{v}^{\prime}).

This transition probability can be considered as a collapsed Gibbs sampling Liu (1994). From equations (39) and (44), when the transition probability of AIS is modeled by

Tk(𝒙𝒙)=Pk(𝒉𝒗)𝒉Pk(𝒗𝒉)Pk(𝒉𝒗),\displaystyle T_{k}(\bm{x}^{\prime}\mid\bm{x})=P_{k}(\bm{h}^{\prime}\mid\bm{v}^{\prime})\sum_{\bm{h}}P_{k}(\bm{v}^{\prime}\mid\bm{h})P_{k}(\bm{h}\mid\bm{v}), (45)

the theoretical results mentioned in Section III.2 (Theorems 1 and 2) are applicable. Approximating the expectations in equations (44) and (45) using the sampling approximation with one sample point leads to the widely used sampling procedure, i.e., the blocked Gibbs sampling, on bipartite-type MRFs (see figure 3).

Refer to caption
Figure 3: State transitions on a bipartite-type MRF based on the one-sample approximation of equations (44) and (45): the transitions of (a) mAIS and (b) AIS. The transitions between the layers (illustrated by the arrows) are the blocked Gibbs sampling based on Pk(𝒉𝒗)P_{k}(\bm{h}\mid\bm{v}) and Pk(𝒗𝒉)P_{k}(\bm{v}\mid\bm{h}).

IV Numerical Experiments

The performance of mAIS (i.e., mAIS-V) is examined using numerical experiments on an RBM whose energy function is defined as

E(𝒙)=1T(iVbivi+jHcjhj+iVjHwi,jvihj),\displaystyle E(\bm{x})=-\frac{1}{T}\Big{(}\sum_{i\in V}b_{i}v_{i}+\sum_{j\in H}c_{j}h_{j}+\sum_{i\in V}\sum_{j\in H}w_{i,j}v_{i}h_{j}\Big{)}, (46)

where vi,hj{1,+1}v_{i},h_{j}\in\{-1,+1\} and T>0T>0 is the temperature of the RBM. The temperature controls the complexity of the distribution: a lower TT expresses a higher-clustered multimodal distribution. On the RBM, the distribution sequence, {Pk(𝒙)}\{P_{k}(\bm{x})\}, is designed according to equation (5), where βk+1=βk+1/K\beta_{k+1}=\beta_{k}+1/K and the initial distribution P0(𝒙)P_{0}(\bm{x}) is fixed to a uniform distribution; therefore, in this case, Ek(𝒙)E_{k}(\bm{x}) and EV(𝒗,k)E_{V}(\bm{v},k) in equations (41) and (42) are

Ek(𝒙)=βkT(iVbivi+jHcjhj+iVjHwi,jvihj)\displaystyle E_{k}(\bm{x})=-\frac{\beta_{k}}{T}\Big{(}\sum_{i\in V}b_{i}v_{i}+\sum_{j\in H}c_{j}h_{j}+\sum_{i\in V}\sum_{j\in H}w_{i,j}v_{i}h_{j}\Big{)}

and

EV(𝒗,k)\displaystyle E_{V}(\bm{v},k) =βkTiVbivi\displaystyle=-\frac{\beta_{k}}{T}\sum_{i\in V}b_{i}v_{i}
jHln2coshβkT(cj+iVwi,jvi),\displaystyle\quad\>-\sum_{j\in H}\ln 2\cosh\frac{\beta_{k}}{T}\Big{(}c_{j}+\sum_{i\in V}w_{i,j}v_{i}\Big{)},

respectively.

IV.1 Quantitative investigation of theoretical results

The theoretical results obtained in section III.2 revealed the qualitatively effectiveness of mAIS. We investigate the quantitatively effectiveness of mAIS using numerical experiments. In the experiments, bib_{i}s and cjc_{j}s were independently drawn from a uniform distribution in [0.001,+0.001][-0.001,+0.001], and wi,jw_{i,j}s are independently drawn from a normal distribution whose mean is zero and variance is 1/n1/n; the sizes of the sample set and 𝒗\bm{v} were fixed to N=1000N=1000 and |V|=20|V|=20. The blocked Gibbs sampling shown in figure 3 was used for the state transitions of AIS and mAIS.

Refer to caption
Figure 4: APEs versus 1/T1/T for K=10K=10, 30 and 60. These plots represent the average values over 1000 experiments.

Figure 4 shows the absolute percentage error (APE) between the true free energy, f:=F/nf:=F/n, and its approximation, fapp:=Fapp/nf_{\mathrm{app}}:=F_{\mathrm{app}}/n, obtained using AIS or mAIS, against the inverse temperature 1/T1/T, where the APE is defined as

APE:=100×|ffapp||f|[%].\displaystyle\mathrm{APE}:=100\times\frac{|f-f_{\mathrm{app}}|}{|f|}\>[\%]. (47)

In this experiment, the size of 𝒉\bm{h} is fixed to |H|=40|H|=40. We observe that the APEs obtained using mAIS are always lower than those obtained using AIS, as supported by Theorem 1. mAIS is particularly effective in the high-temperature region.

Table 1: Comparison of the true free energy ff with the trial averages of its approximations, 𝔼trial[fapp]\mathbb{E}_{\mathrm{trial}}[f_{\mathrm{app}}], obtained using AIS and mAIS, for several TT and KK values. The values are the average values obtained over 1000 experiments.
AIS mAIS
KK KK
1/T1/T true 10 30 60 10 30 60
0.2 0.69759-0.69759 0.69759-0.69759 0.69759-0.69759 0.69759-0.69759 0.69759-0.69759 0.69759-0.69759 0.69759-0.69759
0.4 0.71084-0.71084 0.71083-0.71083 0.71084-0.71084 0.71084-0.71084 0.71084-0.71084 0.71084-0.71084 0.71084-0.71084
0.8 0.76376-0.76376 0.76373-0.76373 0.76375-0.76375 0.76375-0.76375 0.76375-0.76375 0.76376-0.76376 0.76376-0.76376
1 0.80306-0.80306 0.80300-0.80300 0.80305-0.80305 0.80305-0.80305 0.80305-0.80305 0.80306-0.80306 0.80306-0.80306
2 1.10992-1.10992 1.10782-1.10782 1.10977-1.10977 1.10987-1.10987 1.10963-1.10963 1.10987-1.10987 1.10990-1.10990
4 1.95593-1.95593 1.93328-1.93328 1.95345-1.95345 1.95545-1.95545 1.95143-1.95143 1.95535-1.95535 1.95575-1.95575
8 3.80281-3.80281 3.70846-3.70846 3.78813-3.78813 3.79920-3.79920 3.78087-3.78087 3.79925-3.79925 3.80186-3.80186

Next, we compare the true free energy ff with the trial averages of its approximations, 𝔼trial[fapp]\mathbb{E}_{\mathrm{trial}}[f_{\mathrm{app}}], where 𝔼trial[]\mathbb{E}_{\mathrm{trial}}[\cdots] was estimated based on the average of 30 trials. The trial average can be regarded as the approximation of 𝔼T[]\mathbb{E}_{T}[\cdots] in AIS or 𝔼τ[]\mathbb{E}_{\tau}[\cdots] in mAIS. The results are listed in table 1. In this experiment, the size of 𝒉\bm{h} was fixed to |H|=40|H|=40. The estimates of AIS and mAIS are higher than the true free energy and the estimates of mAIS are lower than those of AIS, as supported by Theorem 2.

IV.2 Dependency on the size ratio of two layers

We investigate the relative accuracy of mAIS compared with AIS for fraction α:=|H|/|V|\alpha:=|H|/|V|. The experiments in section IV.1 correspond to the cases of α=2\alpha=2. Intuitively, mAIS is more efficient for a larger α\alpha because, as α\alpha increases, the fraction of the size of remaining variables through marginalization to nn reduces (in other words, the dimension of space evaluated by the sampling approximation relatively shrinks). However, the theoretical results obtained in section III.2 do not directly support this assumption. Thus, we check the relative accuracy for several α\alpha values using numerical experiments on the RBM. In the experiments, the relative accuracy is measured by the ratio of the APEs obtained using AIS and mAIS, i.e.,

r:=APE of AISAPE of mAIS=|ffAIS||ffmAIS|.\displaystyle r:=\frac{\text{APE of AIS}}{\text{APE of mAIS}}=\frac{|f-f_{\mathrm{AIS}}|}{|f-f_{\mathrm{mAIS}}|}.

The relative accuracy increases as rr increases. In the experiment, the sizes of the sample set and 𝒗\bm{v} were fixed to N=1000N=1000 and |V|=20|V|=20; the parameter setting of the RBM was the same as that in section IV.1, and the blocked Gibbs sampling was used for the state transitions of AIS and mAIS. The distributions of lnr\ln r for 1/T=0.51/T=0.5, 11, and 22 are shown in figures 57, respectively. Each distribution in these figures was created based on the kernel density estimation Bishop (2006) using the results obtained from 4000 experiments, in which the Gaussian kernel with a bandwidth (or smoothing parameter) of 0.25 was used. In all cases, the distributions transit to the positive direction with an increase in α\alpha, which implies that the relative accuracy monotonically improves as α\alpha increases. This result implies that we should use mAIS-V when α>1\alpha>1 and mAIS-H when α<1\alpha<1.

Refer to caption
Figure 5: Distributions of lnr\ln r for several α\alpha values when 1/T=0.51/T=0.5: (a) K=10K=10, (b) K=30K=30, and (c) K=60K=60.
Refer to caption
Figure 6: Distributions of lnr\ln r for several α\alpha values when 1/T=11/T=1: (a) K=10K=10, (b) K=30K=30, and (c) K=60K=60.
Refer to caption
Figure 7: Distributions of lnr\ln r for several α\alpha values when 1/T=21/T=2: (a) K=10K=10, (b) K=30K=30, and (c) K=60K=60.

IV.3 Convergence property for KK

Refer to caption
Figure 8: Illustration of Roussel’s sampling method Roussel et al. (2021). The transition from 𝒉old\bm{h}_{\mathrm{old}} to 𝒉new\bm{h}_{\mathrm{new}} is performed using the MH method on 𝒉\bm{h} space. The transitions between the two layers are the same as the blocked Gibbs sampling. This sampling procedure can be used as τk(𝒗𝒗)\tau_{k}(\bm{v}^{\prime}\mid\bm{v}).

The experiments in sections IV.1 and IV.2 were conducted for relatively small KK values to emphasize the performance difference between AIS and mAIS under less-than-ideal conditions. However, in practice, KK is set to a sufficient large value to obtain precise estimations. To obtain precise estimations, the value of KK should be set to a value larger than the mixing time (or relaxation time). The mixing time tends to increase as the complexity of distribution increases. In this section, the convergence property of APE in equation (47) on a wider range of KK is demonstrated using numerical experiments.

The detailed discussions for the mixing time in RBM have been provided from both inference and learning perspectives Roussel et al. (2021); Decelle et al. (2021). Roussel et al. proved that the standard blocked Gibbs sampling shown in figure 3 is not efficient in RBMs having high clustered distributions from the perspective of the mixing time, and they proposed a more efficient sampling method by combining the blocked Gibbs sampling and Metropolis–Hastings (MH) method illustrated in figure 8 Roussel et al. (2021). Roussel’s sampling method can be employed as the transition τk(𝒗𝒗)\tau_{k}(\bm{v}^{\prime}\mid\bm{v}) in our framework. For the experiments, we used two different RBMs: the RBM with Gaussian interactions which is the same model used in sections IV.1 and IV.2, and the RBM with Hopfield-type interactions in which wi,j=ξi(j)/|V|w_{i,j}=\xi_{i}^{(j)}/\sqrt{|V|}, where ξi(j){1,+1}\xi_{i}^{(j)}\in\{-1,+1\} was determined uniformly at random. The Hopfield-type interactions can exhibit a clustered distribution Roussel et al. (2021). In both RBMs, the bias parameters, bib_{i}s and cjc_{j}s, were independently drawn from a uniform distribution in [0.001,+0.001][-0.001,+0.001], and |V|=20|V|=20 and |H|=40|H|=40 were fixed. The size of the sample set was fixed to N=10000N=10000.

Refer to caption
Figure 9: APEs versus KK on the RBM with the Gaussian interactions: (a) 1/T=21/T=2 and (b) 1/T=201/T=20. These plots represent the average values over 1000 experiments.
Refer to caption
Figure 10: APEs versus KK on the RBM with the Hopfield-type interactions: (a) 1/T=21/T=2 and (b) 1/T=201/T=20. These plots represent the average values over 1000 experiments.

Figures 9 and 10 show the behavior of the APE for the increase in KK. In the figures, “AIS” and “mAIS” indicate the results obtained using the standard blocked Gibbs sampling, while “AIS+MH” and “mAIS+MH” indicate the results obtained using Roussel’s sampling method. mAIS exhibits faster convergence than AIS in all experiments, and Roussel’s sampling method improves the convergence properties of both AIS and mAIS, although the improvement is very small in figure 10(b). mAIS+MH achieved the best performance the best in all experiments. The convergence speed of mAIS seems to be higher than that of AIS+MH; this implies that the improvement by mAIS is more effective than that by Roussel’s sampling method. Finally, we comment on the computational efficiency of the four methods: AIS, AIS+MH, mAIS, and mAIS+MH. The computational cost of the four methods have the same order; they are O(NK|V||H|)O(NK|V||H|). However, mAIS is remarkably faster than AIS+MH with regard to the CPU time (around 10 times faster in our implementation). The MH procedure is the bottleneck in the Roussel’s sampling method. Thus, mAIS+MH is as slow as AIS+MH.

V Summary and Discussion

In this study, we have proposed mAIS, which is identified as AIS on a marginalized model. The proposed method is regarded as a special case of AIS, which can be used when a partial marginalization can be exactly evaluated, and it is not an improvement method of AIS. This study therefore contributes to several related studies for AIS such as those listed in section I. Two important theorems for mAIS (i.e., Theorems 1 and 2) have been obtained: when the transition probabilities of AIS and mAIS satisfy the condition of equation (39), (a) the partition-function estimator of mAIS is more accurate than that of AIS and (b) the bias of the free energy estimator of mAIS is lower than that of AIS. These results theoretically support the qualitative effectiveness of mAIS. Furthermore, the numerical results demonstrated in section IV empirically support the quantitative effectiveness of mAIS.

mAIS can be applied to a model in which a partial marginalization can be exactly evaluated. However, the effectiveness of the resultant mAIS in comparison with AIS cannot be theoretically guaranteed when the transition probabilities of AIS and mAIS do not satisfy the condition of equation (39). This condition significantly limits the applicability of the theory obtained in this study. As mentioned in section III.2, the condition of equation (39) is sufficient for Theorems 1 and 2, and has been not identified as a necessary condition. If it is not necessary, then it may be relaxed. A relaxation of the condition needs to be investigated in our future studies.

Appendix A Proofs

This appendix demonstrates the proofs of the two theorems presented in Section III.2.

A.1 Proof of Theorem 1

As discussed in Sections II and III.1, ZAIS(ST)Z_{\mathrm{AIS}}(S_{T}) and ZmAIS(Sτ)Z_{\mathrm{mAIS}}(S_{\tau}) are the unbiased estimators of ZZ because

𝔼T[ZAIS(ST)]=𝔼τ[ZmAIS(Sτ)]=Z.\displaystyle\mathbb{E}_{T}[Z_{\mathrm{AIS}}(S_{T})]=\mathbb{E}_{\tau}[Z_{\mathrm{mAIS}}(S_{\tau})]=Z.

The relationship between the annealing processes in equations (7) and (22) is considered. Based on the assumption of equation (39), the marginal distribution of T(𝑿)T(\bm{X}) can be expressed as

𝑯T(𝑿)\displaystyle\sum_{\bm{H}}T(\bm{X}) =𝑯(k=1K1Pk(𝒉(k+1)𝒗(k+1))τk(𝒗(k+1)𝒗(k)))P0(𝒗(1),𝒉(1))\displaystyle=\sum_{\bm{H}}\Big{(}\prod_{k=1}^{K-1}P_{k}(\bm{h}^{(k+1)}\mid\bm{v}^{(k+1)})\tau_{k}(\bm{v}^{(k+1)}\mid\bm{v}^{(k)})\Big{)}P_{0}(\bm{v}^{(1)},\bm{h}^{(1)})
=(k=1K1𝒉(k+1)Pk(𝒉(k+1)𝒗(k+1))τk(𝒗(k+1)𝒗(k)))𝒉(1)P0(𝒗(1),𝒉(1))=(k=1K1τk(𝒗(k+1)𝒗(k)))P0(𝒗(1)).\displaystyle=\Big{(}\prod_{k=1}^{K-1}\sum_{\bm{h}^{(k+1)}}P_{k}(\bm{h}^{(k+1)}\mid\bm{v}^{(k+1)})\tau_{k}(\bm{v}^{(k+1)}\mid\bm{v}^{(k)})\Big{)}\sum_{\bm{h}^{(1)}}P_{0}(\bm{v}^{(1)},\bm{h}^{(1)})=\Big{(}\prod_{k=1}^{K-1}\tau_{k}(\bm{v}^{(k+1)}\mid\bm{v}^{(k)})\Big{)}P_{0}(\bm{v}^{(1)}).

Therefore,

τ(𝑽)=𝑯T(𝑿)\displaystyle\tau(\bm{V})=\sum_{\bm{H}}T(\bm{X}) (48)

is obtained. Moreover, the assumption of equation (39) leads to

T(𝑿)τ(𝑽)\displaystyle\frac{T(\bm{X})}{\tau(\bm{V})} =(k=1K1Pk(𝒉(k+1)𝒗(k+1))τk(𝒗(k+1)𝒗(k))τk(𝒗(k+1)𝒗(k)))P0(𝒗(1),𝒉(1))P0(𝒗(1))=k=1KPk1(𝒉(k)𝒗(k)).\displaystyle=\Big{(}\prod_{k=1}^{K-1}\frac{P_{k}(\bm{h}^{(k+1)}\mid\bm{v}^{(k+1)})\tau_{k}(\bm{v}^{(k+1)}\mid\bm{v}^{(k)})}{\tau_{k}(\bm{v}^{(k+1)}\mid\bm{v}^{(k)})}\Big{)}\frac{P_{0}(\bm{v}^{(1)},\bm{h}^{(1)})}{P_{0}(\bm{v}^{(1)})}=\prod_{k=1}^{K}P_{k-1}(\bm{h}^{(k)}\mid\bm{v}^{(k)}). (49)

From equations (48) and (49), the annealing process of AIS can be factorizable as

T(𝑿)=T(𝑯𝑽)τ(𝑽),\displaystyle T(\bm{X})=T(\bm{H}\mid\bm{V})\tau(\bm{V}), (50)

where T(𝑯𝑽):=k=1KPk1(𝒉(k)𝒗(k))T(\bm{H}\mid\bm{V}):=\prod_{k=1}^{K}P_{k-1}(\bm{h}^{(k)}\mid\bm{v}^{(k)}) is the conditional distribution over 𝑯\bm{H}. Using equation (50), the difference in the variances can be expressed as

𝕍T[ZAIS(ST)]𝕍τ[ZmAIS(Sτ)]\displaystyle\mathbb{V}_{T}[Z_{\mathrm{AIS}}(S_{T})]-\mathbb{V}_{\tau}[Z_{\mathrm{mAIS}}(S_{\tau})]
=Z02N(𝑿W(𝑿)2T(𝑿)𝑽Λ(𝑽)2τ(𝑽))\displaystyle=\frac{Z_{0}^{2}}{N}\Big{(}\sum_{\bm{X}}W(\bm{X})^{2}T(\bm{X})-\sum_{\bm{V}}\Lambda(\bm{V})^{2}\tau(\bm{V})\Big{)}
=Z02N𝑽(𝑯W(𝑿)2T(𝑯𝑽)Λ(𝑽)2)τ(𝑽).\displaystyle=\frac{Z_{0}^{2}}{N}\sum_{\bm{V}}\Big{(}\sum_{\bm{H}}W(\bm{X})^{2}T(\bm{H}\mid\bm{V})-\Lambda(\bm{V})^{2}\Big{)}\tau(\bm{V}). (51)

From equations (12) and (26),

λk(𝒗(k))=𝒉(k)exp(Ek(𝒙(k)))𝒉(k)exp(Ek1(𝒙(k)))\displaystyle\lambda_{k}(\bm{v}^{(k)})=\frac{\sum_{\bm{h}^{(k)}}\exp(-E_{k}(\bm{x}^{(k)}))}{\sum_{\bm{h}^{(k)}}\exp(-E_{k-1}(\bm{x}^{(k)}))}
=𝒉(k)exp(Ek(𝒙(k)))exp(Ek1(𝒙(k)))exp(Ek1(𝒙(k)))𝒉(k)exp(Ek1(𝒙(k)))\displaystyle=\sum_{\bm{h}^{(k)}}\frac{\exp(-E_{k}(\bm{x}^{(k)}))}{\exp(-E_{k-1}(\bm{x}^{(k)}))}\frac{\exp(-E_{k-1}(\bm{x}^{(k)}))}{\sum_{\bm{h}^{(k)}}\exp(-E_{k-1}(\bm{x}^{(k)}))}
=𝒉(k)wk(𝒙(k))Pk1(𝒉(k)𝒗(k))\displaystyle=\sum_{\bm{h}^{(k)}}w_{k}(\bm{x}^{(k)})P_{k-1}(\bm{h}^{(k)}\mid\bm{v}^{(k)})

is obtained, which leads to

Λ(𝑽)=𝑯W(𝑿)T(𝑯𝑽).\displaystyle\Lambda(\bm{V})=\sum_{\bm{H}}W(\bm{X})T(\bm{H}\mid\bm{V}). (52)

Substituting equation (52) into (51) yields

𝕍T[ZAIS(ST)]𝕍τ[ZmAIS(Sτ)]\displaystyle\mathbb{V}_{T}[Z_{\mathrm{AIS}}(S_{T})]-\mathbb{V}_{\tau}[Z_{\mathrm{mAIS}}(S_{\tau})]
=Z02N𝑿(W(𝑿)Λ(𝑽))2T(𝑿)0.\displaystyle=\frac{Z_{0}^{2}}{N}\sum_{\bm{X}}\big{(}W(\bm{X})-\Lambda(\bm{V})\big{)}^{2}T(\bm{X})\geq 0.

Therefore, 𝕍T[ZAIS(ST)]𝕍τ[ZmAIS(Sτ)]\mathbb{V}_{T}[Z_{\mathrm{AIS}}(S_{T})]\geq\mathbb{V}_{\tau}[Z_{\mathrm{mAIS}}(S_{\tau})] is ensured. \square

The above result can be understand based on Rao–Blackwellization. ZAIS(ST)Z_{\mathrm{AIS}}(S_{T}) is the simple sampling approximation of equation (13). From equation (50), equation (13) can be rewritten as

ZK=Z0𝑽𝑯W(𝑿)T(𝑯𝑽)τ(𝑽).\displaystyle Z_{K}=Z_{0}\sum_{\bm{V}}\sum_{\bm{H}}W(\bm{X})T(\bm{H}\mid\bm{V})\tau(\bm{V}).

From this equation and equation (52), equation (24) can be seen as the expectation of the conditional expectation of W(𝑿)W(\bm{X}). mAIS, therefore, is identified with Rao–Blackwellization of AIS; and therefore, 𝕍T[ZAIS(ST)]𝕍τ[ZmAIS(Sτ)]\mathbb{V}_{T}[Z_{\mathrm{AIS}}(S_{T})]\geq\mathbb{V}_{\tau}[Z_{\mathrm{mAIS}}(S_{\tau})] is ensured by the Rao–Blackwell theorem.

A.2 Proof of Theorem 2

As shown in equations (18) and (31), 𝔼T[FAIS(ST)]\mathbb{E}_{T}[F_{\mathrm{AIS}}(S_{T})] and 𝔼τ[FmAIS(Sτ)]\mathbb{E}_{\tau}[F_{\mathrm{mAIS}}(S_{\tau})] are upper bounds of the true free energy FF, i.e., 𝔼T[FAIS(ST)]F\mathbb{E}_{T}[F_{\mathrm{AIS}}(S_{T})]\geq F and 𝔼τ[FmAIS(Sτ)]F\mathbb{E}_{\tau}[F_{\mathrm{mAIS}}(S_{\tau})]\geq F. Using equation (50), 𝔼T[FAIS(ST)]\mathbb{E}_{T}[F_{\mathrm{AIS}}(S_{T})] can be rewritten as

𝔼T[FAIS(ST)]\displaystyle\mathbb{E}_{T}[F_{\mathrm{AIS}}(S_{T})]
=lnZ0(μ=1N𝐗μT(𝐗μ))ln(1Nμ=1NW(𝐗μ))\displaystyle=-\ln Z_{0}-\Big{(}\prod_{\mu=1}^{N}\sum_{\mathbf{X}_{\mu}}T(\mathbf{X}_{\mu})\Big{)}\ln\Big{(}\frac{1}{N}\sum_{\mu=1}^{N}W(\mathbf{X}_{\mu})\Big{)}
=lnZ0\displaystyle=-\ln Z_{0}
(μ=1N𝐕μ𝐇μT(𝐇μ𝐕μ)τ(𝐕μ))ln(1Nμ=1NW(𝐗μ)).\displaystyle-\Big{(}\prod_{\mu=1}^{N}\sum_{\mathbf{V}_{\mu}}\sum_{\mathbf{H}_{\mu}}T(\mathbf{H}_{\mu}\mid\mathbf{V}_{\mu})\tau(\mathbf{V}_{\mu})\Big{)}\ln\Big{(}\frac{1}{N}\sum_{\mu=1}^{N}W(\mathbf{X}_{\mu})\Big{)}. (53)

Using Jensen’s inequality, the following inequality is obtained:

(μ=1N𝐕μ𝐇μT(𝐇μ𝐕μ)τ(𝐕μ))ln(1Nμ=1NW(𝐗μ))\displaystyle\Big{(}\prod_{\mu=1}^{N}\sum_{\mathbf{V}_{\mu}}\sum_{\mathbf{H}_{\mu}}T(\mathbf{H}_{\mu}\mid\mathbf{V}_{\mu})\tau(\mathbf{V}_{\mu})\Big{)}\ln\Big{(}\frac{1}{N}\sum_{\mu=1}^{N}W(\mathbf{X}_{\mu})\Big{)}
(μ=1N𝐕μτ(𝐕μ))ln(1Nμ=1N𝐇μT(𝐇μ𝐕μ)W(𝐗μ))\displaystyle\leq\Big{(}\prod_{\mu=1}^{N}\sum_{\mathbf{V}_{\mu}}\tau(\mathbf{V}_{\mu})\Big{)}\ln\Big{(}\frac{1}{N}\sum_{\mu=1}^{N}\sum_{\mathbf{H}_{\mu}}T(\mathbf{H}_{\mu}\mid\mathbf{V}_{\mu})W(\mathbf{X}_{\mu})\Big{)}
=(μ=1N𝐕μτ(𝐕μ))ln(1Nμ=1NΛ(𝐕μ)).\displaystyle=\Big{(}\prod_{\mu=1}^{N}\sum_{\mathbf{V}_{\mu}}\tau(\mathbf{V}_{\mu})\Big{)}\ln\Big{(}\frac{1}{N}\sum_{\mu=1}^{N}\Lambda(\mathbf{V}_{\mu})\Big{)}. (54)

Equation (52) is used in the derivation of the last line of equation (54). From equations (53) and (54),

𝔼T[FAIS(ST)]\displaystyle\mathbb{E}_{T}[F_{\mathrm{AIS}}(S_{T})]
lnZ0(μ=1N𝐕μτ(𝐕μ))ln(1Nμ=1NΛ(𝐕μ))\displaystyle\geq-\ln Z_{0}-\Big{(}\prod_{\mu=1}^{N}\sum_{\mathbf{V}_{\mu}}\tau(\mathbf{V}_{\mu})\Big{)}\ln\Big{(}\frac{1}{N}\sum_{\mu=1}^{N}\Lambda(\mathbf{V}_{\mu})\Big{)}
=𝔼τ[FmAIS(Sτ)]F\displaystyle=\mathbb{E}_{\tau}[F_{\mathrm{mAIS}}(S_{\tau})]\geq F

is ensured; here, the final inequality is obtained from equation (31). \square

Acknowledgment

This work was partially supported by JSPS KAKENHI (grant numbers 18K11459, 18H03303, 20K23342, 21K11778, and 21K17804).

References

  • Ackley et al. (1985) D. H. Ackley, G. E. Hinton,  and T. J. Sejnowski, Cognitive Science 9, 147 (1985).
  • Smolensky (1986) P. Smolensky, Parallel distributed processing: Explorations in the microstructure of cognition 1, 194 (1986).
  • Hinton (2002) G. E. Hinton, Neural Computation 14, 1771 (2002).
  • Salakhutdinov and Hinton (2009) R. Salakhutdinov and G. E. Hinton, In Proc. of the 12th International Conference on Artificial Intelligence and Statistics , 448 (2009).
  • Roudi et al. (2009) Y. Roudi, E. Aurell,  and J. Hertz, Frontiers in Computational Neuroscience 3, 1 (2009).
  • Decelle and Furtlehner (2021) A. Decelle and C. Furtlehner, Chinese Physics B 30, 040202 (2021).
  • Chen et al. (2018) J. Chen, S. Cheng, H. Xie, L. Wang,  and T. Xiang, Physical Review B 97, 085104 (2018).
  • Nomura and Imada (2021) Y. Nomura and M. Imada, Physical Review X 11, 031034 (2021).
  • Torlai et al. (2018) G. Torlai, G. Mazzola, J. Carrasquilla, M. Troyer, R. Melko,  and G. Carleo, Nature Physics 14, 447 (2018).
  • Carleo and Troyer (2017) G. Carleo and M. Troyer, Science 355, 602 (2017).
  • Gao and Duan (2017) X. Gao and L. Duan, Nature Communications 8, 1 (2017).
  • Neal (2001) R. M. Neal, Statistics and Computing 11, 125 (2001).
  • Salakhutdinov and Murray (2008) R. Salakhutdinov and I. Murray, In Proc. of the 25th International Conference on Machine Learning 25, 872 (2008).
  • Geman and Geman (1984) S. Geman and D. Geman, IEEE Transactions on Pattern Analysis and Machine Intelligence 6, 721 (1984).
  • Jarzynski (1997) C. Jarzynski, Phys. Rev. E 56, 5018 (1997).
  • Burda et al. (2015) Y. Burda, R. B. Grosse,  and R. Salakhutdinov, In Proc. of the 18th International Conference on Artificial Intelligence and Statistics , 102 (2015).
  • Liu et al. (2015) Q. Liu, A. Ihler, J. Peng,  and J. Fisher, In Proc. of the 31st Conference on Uncertainty in Artificial Intelligence , 514 (2015).
  • Carlson et al. (2016) D. E. Carlson, P. Stinson, A. Pakman,  and L. Paninski, In proc. of the 33rd International Conference on Machine Learning 6, 4248 (2016).
  • Mazzanti and Romero (2020) F. Mazzanti and E. Romero, arXiv:2007.11926  (2020).
  • Krause et al. (2020) O. Krause, A. Fischer,  and C. Igel, Artificial Intelligence 278, 103195 (2020).
  • Yasuda and Sekimoto (2021) M. Yasuda and K. Sekimoto, Physical Review E 103, 052118 (2021).
  • Caselle et al. (2022) M. Caselle, E. Cellini, A. Nada,  and M. Panero, Journal of High Energy Physics 2022, 015 (2022).
  • Liu (2001) J. S. Liu, Monte Carlo strategies in scientific computing (Springer, 2001).
  • Yasuda (2015) M. Yasuda, Journal of the Physical Society of Japan 84, 034001 (2015).
  • Yasuda and Uchizawa (2021) M. Yasuda and K. Uchizawa, Neural Computation 33, 1037 (2021).
  • Liu (1994) J. S. Liu, Journal of the American Statistical Association 89, 958 (1994).
  • Bishop (2006) C. M. Bishop, Pattern Recognition and Machine Learning (Springer, 2006).
  • Roussel et al. (2021) C. Roussel, S. Cocco,  and R. Monasson, Phys. Rev. E 104, 034109 (2021).
  • Decelle et al. (2021) A. Decelle, C. Furtlehner,  and B. Seoane, In Proc. of the 35th Conference on Neural Information Processing Systems  (2021).