Free Energy Evaluation Using Marginalized Annealed Importance Sampling

Muneki Yasuda [email protected] Graduate School of Science and Engineering, Yamagata University, Japan. Chako Takahashi Graduate School of Science and Engineering, Yamagata University, Japan.

Abstract

The evaluation of the free energy of a stochastic model is considered a significant issue in various fields of physics and machine learning. However, the exact free energy evaluation is computationally infeasible because the free energy expression includes an intractable partition function. Annealed importance sampling (AIS) is a type of importance sampling based on the Markov chain Monte Carlo method that is similar to a simulated annealing and can effectively approximate the free energy. This study proposes an AIS-based approach, which is referred to as marginalized AIS (mAIS). The statistical efficiency of mAIS is investigated in detail based on theoretical and numerical perspectives. Based on the investigation, it is proved that mAIS is more effective than AIS under a certain condition.

free energy evaluation, annealed importance sampling, Rao–Blackwellization, Ising model, restiricted Boltzmann machine

pacs:

Valid PACS appear here

^†^†preprint: APS/123-QED

I Introduction

The evaluation of the free energy (i.e., the negative log of the partition function) of a stochastic model is considered to be an important issue in various fields of physics. Moreover, the free energy plays an important role in machine learning models, e.g., Boltzmann machine Ackley et al. (1985). Boltzmann machine and its variants, e.g., restricted Boltzmann machine (RBM) Smolensky (1986); Hinton (2002) and deep Boltzmann machine (DBM) Salakhutdinov and Hinton (2009), have been actively investigated in the fields of machine learning and physics Roudi et al. (2009); Decelle and Furtlehner (2021); Chen et al. (2018); Nomura and Imada (2021); Torlai et al. (2018); Carleo and Troyer (2017); Gao and Duan (2017). However, an exact free energy evaluation is computationally infeasible, because the expression for the free energy includes an intractable partition function.

Annealed importance sampling (AIS) Neal (2001) is a type of importance sampling based on the Markov chain Monte Carlo (MCMC) method that is similar to simulated annealing, and can effectively approximate the free energy Neal (2001); Salakhutdinov and Murray (2008). In AIS, from a tractable initial distribution to the target distribution, a sequential sampling (or ancestral sampling) is executed, in which the transitions between the distributions are performed using, for example, Gibbs sampling Geman and Geman (1984). The AIS-based free energy evaluation is essentially the same as the method proposed by Jarzynski (the so-called Jarzynski equality) Jarzynski (1997). Several researchers have addressed the development of AIS Burda et al. (2015); Liu et al. (2015); Carlson et al. (2016); Mazzanti and Romero (2020); Krause et al. (2020); Yasuda and Sekimoto (2021). Recently, the relationship between AIS and the normalizing flow proposed in the deep learning field is investigated Caselle et al. (2022).

In AIS, the free energy is estimated as follows: first, we obtain the estimator of the partition function based on the sample average of the importance weights, which are obtained during the sequential sampling process, and subsequently, the free energy estimator is obtained as the negative log of the obtained partition function estimator. It is known that the partition function estimator is unbiased, whereas the free energy estimator is biased Burda et al. (2015). To evaluate the free energy, we propose an AIS-based approach that we refer to as marginalized AIS (mAIS). The concept of mAIS is simple; mAIS corresponds to AIS in a marginalized model; i.e., mAIS can be regarded as a special case of AIS. Therefore, the basic principle of mAIS is similar to AIS. Suppose that our target model represents a distribution in $n$ dimensional space. With regard to AIS, we perform a sampling procedure in the $n$ dimensional space to evaluate the free energy. However, in mAIS, the dimensional space, where the sampling procedure is performed, is smaller because the dimension of the model is reduced through marginalization. Intuitively, mAIS seems to be more effective than AIS considering the aforementioned statement. This intuition is valid under a certain condition. Under the condition, the following two statements can be proved (see section III.2): (1) the partition function estimator obtained from mAIS is more accurate than that obtained from AIS, and (2) the bias of the free energy estimator obtained from mAIS is lower than that of the estimator obtained from AIS. However, as discussed in section III.2, the condition assumed in the aforementioned statements limits the range in which the effectiveness of mAIS is assured. As discussed in section III.3, this condition can be satisfied with regard to the use of AIS and mAIS on Markov random fields (MRFs) defined on bipartite graphs. Moreover, a layer-wised marginal operation can be performed in bipartite-type MRFs. Bipartite-type MRFs include important applications, e.g., RBM and DBM.

The remainder of this paper is organized as follows: Section II explains the AIS-based free energy evaluation. Section III introduces mAIS Section III.1 details mAIS. Section III.2 describes the theoretical analysis of mAIS, and section III.3 discusses the application of mAIS to bipartite-type MRFs. Section IV numerically demonstrates the validity of mAIS. Section V summarizes the study, along with discussions.

II Free Energy Evaluation using Annealed Importance Sampling

Consider a distribution with $n$ random variables, $\bm{x}:=\{x_{i}\in\mathcal{X}_{i}\mid i\in U:=\{1,2,\ldots,n\}\}$ , as

\displaystyle P_{\mathrm{model}}(\bm{x}):=\frac{1}{Z}\exp\big{(}-E(\bm{x})\big{)},

(1)

where $\mathcal{X}_{i}$ is the continuous or discrete sample space of $x_{i}$ and $E(\bm{x})$ is the energy function (or the Hamiltonian); $Z$ is the partition function and is defined as

\displaystyle Z:=\sum_{\bm{x}}\exp\big{(}-E(\bm{x})\big{)},

(2)

where $\sum_{\bm{x}}:=\sum_{x_{1}\in\mathcal{X}_{1}}\sum_{x_{2}\in\mathcal{X}_{2}}\cdots\sum_{x_{n}\in\mathcal{X}_{n}}$ denotes the multiple summation over all realizations of $\bm{x}$ . When $\mathcal{X}_{i}$ exhibits a continuous space, $\sum_{x_{i}}$ is replaced by the integration over $\mathcal{X}_{i}$ .

This study evaluates free energy $F:=-\ln Z$ . The evaluation of the free energy is infeasible because it requires the evaluation of the intractable partition function. AIS is a type of importance sampling based on the MCMC method that is similar to simulated annealing, and can evaluate the free energy Neal (2001); Salakhutdinov and Murray (2008). This free energy evaluation method is essentially the same as the method proposed by Jarzynski Jarzynski (1997). The AIS-based free energy evaluation is briefly explained in the following.

First, we design a sequence of distributions as

\displaystyle\{P_{k}(\bm{x})\mid k=0,1,\ldots,K\},

(3)

where $P_{K}(\bm{x})=P_{\mathrm{model}}(\bm{x})$ , and each $P_{k}(\bm{x})$ is expressed as

\displaystyle P_{k}(\bm{x}):=\frac{1}{Z_{k}}\exp\big{(}-E_{k}(\bm{x})\big{)},

(4)

where $Z_{k}$ is the partition function of the $k$ th distribution. Because $P_{K}(\bm{x})=P_{\mathrm{model}}(\bm{x})$ , $E_{K}(\bm{x})=E(\bm{x})$ and $Z_{K}=Z$ are assumed. In this sequence, the initial distribution $P_{0}(\bm{x})$ is set to a tractable distribution (e.g., a uniform distribution). For example, the sequence expressed as

\displaystyle P_{k}(\bm{x})\propto P_{0}(\bm{x})^{1-\beta_{k}}P_{\mathrm{model}}(\bm{x})^{\beta_{k}},

(5)

for $0=\beta_{0}<\beta_{1}<\cdots<\beta_{K}=1$ , is frequently used, where $\beta_{k}$ corresponds to the annealing temperature. The $k$ th distribution expressed by equation (5) corresponds to equation (4) with $E_{k}(\bm{x})=(1-\beta_{k})\ln P_{0}(\bm{x})+\beta_{k}E(\bm{x})$ . However, we pursue the following arguments without specifying the form of $E_{k}(\bm{x})$ .

On the $k$ th distribution, a transition probability $T_{k}(\bm{x}^{\prime}\mid\bm{x})$ , which satisfies the condition

\displaystyle P_{k}(\bm{x}^{\prime})=\sum_{\bm{x}}T_{k}(\bm{x}^{\prime}\mid\bm{x})P_{k}(\bm{x}),

(6)

is defined. Using the transition probability, the annealing process from the initial state $\bm{x}^{(1)}$ to the final state $\bm{x}^{(K)}$ is defined as

\displaystyle T(\bm{X}):=\Big{(}\prod_{k=1}^{K-1}T_{k}(\bm{x}^{(k+1)}\mid\bm{x}^{(k)})\Big{)}P_{0}(\bm{x}^{(1)}),

(7)

where $\bm{X}:=\{\bm{x}^{(k)}\mid k=1,2,\ldots,K\}$ . $T(\bm{X})$ denotes the joint distribution over $\bm{X}$ . For $T(\bm{X})$ , the corresponding “reverse” process,

\displaystyle\tilde{T}(\bm{X}):=\Big{(}\prod_{k=1}^{K-1}\tilde{T}_{k}(\bm{x}^{(k)}\mid\bm{x}^{(k+1)})\Big{)}P_{K}(\bm{x}^{(K)}),

(8)

is defined, where

\displaystyle\tilde{T}_{k}(\bm{x}\mid\bm{x}^{\prime}):=\frac{T_{k}(\bm{x}^{\prime}\mid\bm{x})P_{k}(\bm{x})}{P_{k}(\bm{x}^{\prime})}

(9)

is the reverse transition. $T(\bm{X})$ expresses the transition process from $\bm{x}^{(1)}$ to $\bm{x}^{(K)}$ ; while, its reverse process $\tilde{T}(\bm{X})$ expresses the reverse transition from $\bm{x}^{(K)}$ to $\bm{x}^{(1)}$ . Because the normalization condition, $\sum_{\bm{x}}\tilde{T}_{k}(\bm{x}\mid\bm{x}^{\prime})=1$ , is ensured from the condition in equation (6), $\tilde{T}(\bm{X})$ can be considered as the joint distribution over $\bm{X}$ ; it satisfies

\displaystyle\sum_{\bm{x}^{(1)}}\sum_{\bm{x}^{(2)}}\cdots\sum_{\bm{x}^{(K-1)}}\tilde{T}(\bm{X})=P_{K}(\bm{x}^{(K)}).

AIS is regarded as the importance sampling, in which $\tilde{T}(\bm{X})$ and $T(\bm{X})$ are considered the target and corresponding proposal distributions, respectively. Using equations (7), (8), and (9), the ratio between the target and proposal distributions, i.e., the (normalized) importance weight, is obtained as

\displaystyle\frac{\tilde{T}(\bm{X})}{T(\bm{X})}=\frac{Z_{0}}{Z_{K}}W(\bm{X}),

(10)

where

	$\displaystyle W(\bm{X})$	$\displaystyle:=\prod_{k=1}^{K}w_{k}(\bm{x}^{(k)}),$		(11)
	$\displaystyle w_{k}(\bm{x})$	$\displaystyle:=\exp\big{(}-E_{k}(\bm{x})+E_{k-1}(\bm{x})\big{)}.$		(12)

Here, $Z_{0}$ is the partition function of $P_{0}(\bm{x})$ (i.e., a tractable distribution) and $Z_{K}=Z$ is the true partition function of the objective distribution. From the normalization condition of $\tilde{T}(\bm{X})$ , the relation,

\displaystyle 1=\sum_{\bm{X}}\frac{\tilde{T}(\bm{X})}{T(\bm{X})}T(\bm{X}),

is obtained. Substituting equation (10) into this relation yields

\displaystyle Z_{K}=Z_{0}\sum_{\bm{X}}W(\bm{X})T(\bm{X}),

(13)

where $\sum_{\bm{X}}:=\sum_{\bm{x}^{(1)}}\sum_{\bm{x}^{(2)}}\cdots\sum_{\bm{x}^{(K)}}$ denotes the multiple summation over all realizations of $\bm{X}$ .

Sampling from the proposal distribution can be performed using ancestral sampling, from $\bm{x}^{(1)}$ to $\bm{x}^{(K)}$ , on $T(\bm{X})$ , i.e., the sequence of the sample points, $\mathbf{X}:=\{\mathbf{x}^{(1)},\mathbf{x}^{(2)},\ldots,\mathbf{x}^{(K)}\}$ , is generated by the following sampling process:

\displaystyle\begin{split}\mathbf{x}^{(1)}&\leftarrow P_{0}(\bm{x}),\\ \mathbf{x}^{(k)}&\leftarrow T_{k-1}(\bm{x}\mid\mathbf{x}^{(k-1)})\>\>(k=2,3,\ldots,K).\end{split}

(14)

Using $N$ independent sample sequences $S_{T}:=\{\mathbf{X}_{\mu}\mid\mu=1,2,\ldots,N\}$ obtained when this ancestral sampling process conducted $N$ times, equation (13) is approximated as

\displaystyle Z\approx Z_{\mathrm{AIS}}(S_{T}):=\frac{Z_{0}}{N}\sum_{\mu=1}^{N}W(\mathbf{X}_{\mu}).

(15)

Using equation (15), the free energy is approximated as

	$\displaystyle F\approx F_{\mathrm{AIS}}(S_{T})$	$\displaystyle:=-\ln Z_{\mathrm{AIS}}(S_{T})$
		$\displaystyle\>=-\ln Z_{0}-\ln\Big{(}\frac{1}{N}\sum_{\mu=1}^{N}W(\mathbf{X}_{\mu})\Big{)}.$		(16)

$Z_{\mathrm{AIS}}(S_{T})$ is an unbiased estimator for the true partition function because its expectation converges to $Z$ : $\mathbb{E}_{T}[Z_{\mathrm{AIS}}(S_{T})]=Z$ , where $\mathbb{E}_{T}[\cdots]$ denotes the expectation of the distribution over $S_{T}=\{\mathbf{X}_{\mu}\}$ , i.e.,

\displaystyle\mathbb{E}_{T}[\cdots]:=\Big{(}\prod_{\mu=1}^{N}\sum_{\mathbf{X}_{\mu}}T(\mathbf{X}_{\mu})\Big{)}(\cdots).

(17)

However, $F_{\mathrm{AIS}}(S_{T})$ is not an unbiased estimator for the true free energy Burda et al. (2015). Using Jensen’s inequality,

	$\displaystyle\mathbb{E}_{T}[F_{\mathrm{AIS}}(S_{T})]$	$\displaystyle=-\mathbb{E}_{T}[\ln Z_{\mathrm{AIS}}(S_{T})]$
		$\displaystyle\geq-\ln\mathbb{E}_{T}[Z_{\mathrm{AIS}}(S_{T})]=F$		(18)

is obtained, which implies that the expectation of $F_{\mathrm{AIS}}(S_{T})$ provides an upper bound of the true free energy.

III Proposed Method

Suppose that $\bm{x}$ is divided into two mutually disjoint sets: $\bm{v}:=\{v_{i}\mid i\in V\}$ and $\bm{h}:=\{h_{j}\mid j\in H\}$ , i.e., $U=V\cup H$ and $V\cap H=\emptyset$ ; therefore, $P_{k}(\bm{x})$ can be considered the joint distribution over $\bm{v}$ and $\bm{h}$ : $P_{k}(\bm{x})=P_{k}(\bm{v},\bm{h})$ . The method discussed in Section II is regarded as AIS based on this joint distribution. In contrast, mAIS proposed in this section is regarded as AIS based on a marginal distribution of $P_{k}(\bm{v},\bm{h})$ .

III.1 Marginalized annealed importance sampling

For the sequence of the distributions in equation (3), we introduce the sequence of their marginal distributions as

\displaystyle\{P_{k}(\bm{v})\mid k=0,1,\ldots,K\},

(19)

where $P_{k}(\bm{v})$ is the marginal distribution of $P_{k}(\bm{x})$ , i.e.,

\displaystyle P_{k}(\bm{v})=\sum_{\bm{h}}P_{k}(\bm{x})=\frac{1}{Z_{k}}\exp\big{(}-E_{V}(\bm{v},k)\big{)},

(20)

where

\displaystyle E_{V}(\bm{v},k):=-\ln\sum_{\bm{h}}\exp\big{(}-E_{k}(\bm{x})\big{)}

(21)

is the energy function of the marginal distribution. From the definition, the partition functions of $P_{k}(\bm{x})$ and $P_{k}(\bm{v})$ are the same.

Based on the sequence of the marginal distributions in equation (19), in a similar manner to equation (7), the annealing process from the initial state $\bm{v}^{(1)}$ to the final state $\bm{v}^{(K)}$ can be defined as

\displaystyle\tau(\bm{V}):=\Big{(}\prod_{k=1}^{K-1}\tau_{k}(\bm{v}^{(k+1)}\mid\bm{v}^{(k)})\Big{)}P_{0}(\bm{v}^{(1)}),

(22)

where $\bm{V}:=\{\bm{v}^{(k)}\mid k=1,2,\ldots,K\}$ and $\tau_{k}(\bm{v}^{\prime}\mid\bm{v})$ is the transition probability on the marginal distribution in equation (20). The transition probability satisfies the following condition:

\displaystyle P_{k}(\bm{v}^{\prime})=\sum_{\bm{v}}\tau_{k}(\bm{v}^{\prime}\mid\bm{v})P_{k}(\bm{v}).

(23)

Using almost the same derivation to obtain equation (13), we derive

\displaystyle Z_{K}=Z_{0}\sum_{\bm{V}}\Lambda(\bm{V})\tau(\bm{V}),

(24)

where $\Lambda(\bm{V})$ is the importance weight of mAIS defined as

\displaystyle\Lambda(\bm{V})

\displaystyle:=\prod_{k=1}^{K}\lambda_{k}(\bm{v}^{(k)}),

(25)

where

\displaystyle\lambda_{k}(\bm{v})

\displaystyle:=\exp\big{(}-E_{V}(\bm{v},k)+E_{V}(\bm{v},k-1)\big{)}.

(26)

In mAIS, the free energy can be evaluated using a technique similar to the derivation of equation (16). The sequence of the sample points, $\mathbf{V}:=\{\mathbf{v}^{(1)},\mathbf{v}^{(2)},\ldots,\mathbf{v}^{(K)}\}$ , is generated based on the ancestral sampling on $\tau(\bm{V})$ :

\displaystyle\begin{split}\mathbf{v}^{(1)}&\leftarrow P_{0}(\bm{v}),\\ \mathbf{v}^{(k)}&\leftarrow\tau_{k-1}(\bm{v}\mid\mathbf{v}^{(k-1)})\>\>(k=2,3,\ldots,K).\end{split}

(27)

By conducting this ancestral sampling process $N$ times, $N$ independent sample sequences $S_{\tau}:=\{\mathbf{V}_{\mu}\mid\mu=1,2,\ldots,N\}$ are obtained, and subsequently, using the $N$ sample sequences, equation (24) is approximated as

\displaystyle Z\approx Z_{\mathrm{mAIS}}(S_{\tau}):=\frac{Z_{0}}{N}\sum_{\mu=1}^{N}\Lambda(\mathbf{V}_{\mu}).

(28)

Therefore, the free energy is evaluated as

	$\displaystyle F\approx F_{\mathrm{mAIS}}(S_{\tau})$	$\displaystyle:=-\ln Z_{\mathrm{mAIS}}(S_{\tau})$
		$\displaystyle\>=-\ln Z_{0}-\ln\Big{(}\frac{1}{N}\sum_{\mu=1}^{N}\Lambda(\mathbf{V}_{\mu})\Big{)}.$		(29)

Similar to $Z_{\mathrm{AIS}}(S_{T})$ , $Z_{\mathrm{mAIS}}(S_{\tau})$ acts as an unbiased estimator for the true partition function because $\mathbb{E}_{\tau}[Z_{\mathrm{mAIS}}(S_{\tau})]=Z$ , where $\mathbb{E}_{\tau}[\cdots]$ denotes the expectation of the distribution over $S_{\tau}=\{\mathbf{V}_{\mu}\}$ , i.e.,

\displaystyle\mathbb{E}_{\tau}[\cdots]:=\Big{(}\prod_{\mu=1}^{N}\sum_{\mathbf{V}_{\mu}}\tau(\mathbf{V}_{\mu})\Big{)}(\cdots).

(30)

Similar to equation (18), based on Jensen’s inequality,

\displaystyle\mathbb{E}_{\tau}[F_{\mathrm{mAIS}}(S_{\tau})]\geq F

(31)

can be obtained; therefore, $F_{\mathrm{mAIS}}(S_{\tau})$ is not an unbiased estimator for the true free energy as is the case with $F_{\mathrm{AIS}}(S_{T})$ .

In the aforementioned discussions, we have considered mAIS based on the sequence of $P_{k}(\bm{v})$ (we refer to this mAIS as “mAIS-V”). An opposite mAIS (referred to as “mAIS-H”), which is based on the sequence of

\displaystyle P_{k}(\bm{h})

\displaystyle=\sum_{\bm{v}}P_{k}(\bm{x})=\frac{1}{Z_{k}}\exp\big{(}-E_{H}(\bm{h},k)\big{)},

(32)

where

\displaystyle E_{H}(\bm{h},k)

\displaystyle:=-\ln\sum_{\bm{v}}\exp\big{(}-E_{k}(\bm{x})\big{)},

(33)

can be constructed in the same manner. However, we do not need to consider mAIS-H separately because the difference between the methods only lies in the way the variables are labelled, i.e., mAIS-V is identified with mAIS-H by exchanging the labels of variable sets. Therefore, we consider mAIS-V as mAIS in this paper.

Theoretically, mAIS can be applied to any model. However, whether it is practical or not strongly depend on the structures of $\{P_{k}(\bm{x})\}$ because mAIS requires the marginal operation, as expressed in equation (20). mAIS can be efficiently applied to a bipartite-type MRF (i.e., an MRF defined on a bipartite undirected graph), which will be discussed in Section III.3.

Suppose that we have a model that mAIS can be efficiently applied to. Therefore, the free energy of the model can be evaluated based on the two methods, i.e., $F_{\mathrm{AIS}}(S_{T})$ and $F_{\mathrm{mAIS}}(S_{\tau})$ . Intuitively, mAIS seems to be better because the entire sample space of the variables is reduced through the marginalization, which is similar to the concept of Rao–Blackwellization Liu (2001).

Here, we briefly explain Rao–Blackwellization. Suppose that we wish to evaluate the expectation of $f(\bm{x}_{t})$ over a distribution $Q(\bm{x})$ : $m_{t}:=\sum_{\bm{x}}f(\bm{x}_{t})Q(\bm{x})$ , where $\bm{x}_{t}\subseteq\bm{x}$ . Based on a simple sampling approximation (or Monte Carlo integration), it is approximated by

\displaystyle m_{t}\approx\frac{1}{M}\sum_{\nu=1}^{M}f(\mathbf{x}_{t}^{(\nu)}),

(34)

where $\{\mathbf{x}^{(\nu)}\mid\nu=1,2,\ldots,M\}$ is the sample set generated from $Q(\bm{x})$ , and $\mathbf{x}_{t}^{(\nu)}$ is the corresponding subset of $\mathbf{x}^{(\nu)}$ . Using the decomposition $Q(\bm{x})=Q(\bm{x}_{a}\mid\bm{x}_{b})Q(\bm{x}_{b})$ , $m_{t}$ can be transformed as

\displaystyle m_{t}=\sum_{\bm{x}_{b}}m_{t}^{\dagger}(\bm{x}_{b})Q(\bm{x}_{b}),

(35)

where $m_{t}^{\dagger}(\bm{x}_{b})$ is the conditional expectation expressed as

\displaystyle m_{t}^{\dagger}(\bm{x}_{b}):=\sum_{\bm{x}_{a}}f(\bm{x}_{t})Q(\bm{x}_{a}\mid\bm{x}_{b}).

Therefore, based on equation (35), $m_{t}$ is approximated by

\displaystyle m_{t}\approx\frac{1}{M}\sum_{\nu=1}^{M}m_{t}^{\dagger}(\mathbf{x}_{b}^{(\nu)}),

(36)

where $\mathbf{x}_{b}^{(\nu)}$ is the corresponding subset of $\mathbf{x}^{(\nu)}$ . The Rao–Blackwell theorem guarantees that the estimator in equation (36) is more effective than the estimator in equation (34) from the perspective of the variance. The transformation of equation (35) is called Rao–Blackwellization. The applications of Rao–Blackwellization are widely investigated; for example, its application to MRFs, spatial Monte Carlo integration, has been developed Yasuda (2015); Yasuda and Uchizawa (2021); Yasuda and Sekimoto (2021).

The relation between equations (34) and (36) looks similar to that between AIS and mAIS. Thus, we can expect mAIS to be more effective. In fact this expectation can be justified under a certain condition, which is discussed in the next section.

III.2 Statistical efficiency of mAIS

First, we compare the statistical efficiencies of the two estimators for the partition function, i.e., $Z_{\mathrm{AIS}}(S_{T})$ and $Z_{\mathrm{mAIS}}(S_{\tau})$ . The variance of $Z_{\mathrm{AIS}}(S_{T})$ is

\displaystyle\mathbb{V}_{T}[Z_{\mathrm{AIS}}(S_{T})]=\frac{1}{N}\Big{(}Z_{0}^{2}\sum_{\bm{X}}W(\bm{X})^{2}T(\bm{X})-Z^{2}\Big{)},

(37)

where $\mathbb{V}_{T}[A]:=\mathbb{E}_{T}[A^{2}]-\mathbb{E}_{T}[A]^{2}$ . The variance of $Z_{\mathrm{mAIS}}(S_{\tau})$ is

\displaystyle\mathbb{V}_{\tau}[Z_{\mathrm{mAIS}}(S_{\tau})]=\frac{1}{N}\Big{(}Z_{0}^{2}\sum_{\bm{V}}\Lambda(\bm{V})^{2}\tau(\bm{V})-Z^{2}\Big{)},

(38)

where $\mathbb{V}_{\tau}[A]:=\mathbb{E}_{\tau}[A^{2}]-\mathbb{E}_{\tau}[A]^{2}$ . As discussed in Sections II and III.1, both estimators are unbiased; therefore, the estimator with smaller variance is more effective.

Figure 1: State transition, from the previous to next states, according to equation (39).

Assume that the transition probability of AIS is expressed as

\displaystyle T_{k}(\bm{x}^{\prime}\mid\bm{x})=P_{k}(\bm{h}^{\prime}\mid\bm{v}^{\prime})\tau_{k}(\bm{v}^{\prime}\mid\bm{v}),

(39)

where $\tau_{k}(\bm{v}^{\prime}\mid\bm{v})$ is the transition probability of mAIS satisfying equation (23), and

\displaystyle P_{k}(\bm{h}\mid\bm{v})=\frac{P_{k}(\bm{x})}{\sum_{\bm{h}}P_{k}(\bm{x})}=\frac{\exp\big{(}-E_{k}(\bm{x})\big{)}}{\exp\big{(}-E_{V}(\bm{v},k)\big{)}}

(40)

is the conditional distribution on $P_{k}(\bm{x})$ . According to equation (39), the state transition from $\bm{x}$ to $\bm{x}^{\prime}$ is performed as follows: first, the state of $\bm{v}$ is updated to $\bm{v}^{\prime}$ according to $\tau_{k}(\bm{v}^{\prime}\mid\bm{v})$ ; subsequently, the state of $\bm{h}$ is updated to $\bm{h}^{\prime}$ according to $P_{k}(\bm{h}^{\prime}\mid\bm{v}^{\prime})$ (see figure 1). The use of the transition probability of equation (39) is accepted in AIS because it satisfies the condition of equation (6) as follows:

	$\displaystyle\sum_{\bm{x}}P_{k}(\bm{h}^{\prime}\mid\bm{v}^{\prime})\tau_{k}(\bm{v}^{\prime}\mid\bm{v})P_{k}(\bm{x})$
	$\displaystyle=P_{k}(\bm{h}^{\prime}\mid\bm{v}^{\prime})\underbrace{\sum_{\bm{v}}\tau_{k}(\bm{v}^{\prime}\mid\bm{v})P_{k}(\bm{v})}_{P_{k}(\bm{v}^{\prime})}=P_{k}(\bm{x}^{\prime}).$

Based on this assumption, the following theorem can be obtained.

Theorem 1

$Z_{\mathrm{AIS}}(S_{T})$ and $Z_{\mathrm{mAIS}}(S_{\tau})$ are unbiased estimators for $Z$ defined in equations (13) and (24), respectively. For the two estimators, the inequality

\displaystyle\mathbb{V}_{T}[Z_{\mathrm{AIS}}(S_{T})]\geq\mathbb{V}_{\tau}[Z_{\mathrm{mAIS}}(S_{\tau})]

always holds, if the transition probabilities of AIS and mAIS satisfy the condition expressed in equation (39).

Based on this theorem, it is ensured that $Z_{\mathrm{mAIS}}(S_{\tau})$ is statistically more efficient. The proof of this theorem is described in Appendix A.1. As mentioned in the final part of Appendix A.1, the claim of this theorem is essentially the same as that of the Rao–Blackwell theorem.

Next, the statistical efficiencies of the two estimators for the free energy, i.e., $F_{\mathrm{AIS}}(S_{T})$ and $F_{\mathrm{mAIS}}(S_{\tau})$ , are compared. As expressed in equations (18) and (31), $\mathbb{E}_{T}[F_{\mathrm{AIS}}(S_{T})]$ and $\mathbb{E}_{\tau}[F_{\mathrm{mAIS}}(S_{\tau})]$ are upper bounds of the true free energy. Based on the assumption of equation (39), the following theorem can be obtained.

Theorem 2

$F_{\mathrm{AIS}}(S_{T})$ and $F_{\mathrm{mAIS}}(S_{\tau})$ are biased estimators for $F$ defined in equations (16) and (29), respectively. For the two estimators, the inequality

\displaystyle\mathbb{E}_{T}[F_{\mathrm{AIS}}(S_{T})]\geq\mathbb{E}_{\tau}[F_{\mathrm{mAIS}}(S_{\tau})]\geq F

always holds, if the transition probabilities of AIS and mAIS satisfy the condition expressed in equation (39).

Based on this theorem, although $F_{\mathrm{AIS}}(S_{T})$ and $F_{\mathrm{mAIS}}(S_{\tau})$ are biased, the bias of $F_{\mathrm{mAIS}}(S_{\tau})$ is ensured to be smaller. The proof of this theorem is presented in Appendix A.2. The aforementioned two theorems do not depend on the details of the design of $E_{k}(\bm{x})$ and the values of $N$ and $K$ ; only the condition in equation (39) is required.

We have proved that mAIS is statistically more efficient than AIS with regard to the partition function and free energy when the condition of equation (39) is satisfied. The remaining issue is determining whether this condition is satisfied in practical situations. This condition significantly limits the modeling of $T_{k}(\bm{x}^{\prime}\mid\bm{x})$ ; e.g., the standard (synchronous or asynchronous) Gibbs sampling on $P_{k}(\bm{x})$ does not satisfy the condition because it uses the states of all variables to obtain the subsequent states. As explained in Section III.3, the state transition according to equation (39) can be natural and practical when $P_{k}(\bm{x})$ is a bipartite-type MRF.

In Theorems 1 and 2, the condition of equation (39) is not a necessary condition but a sufficient condition. This implies the possibility of the relaxation of the condition. However, it is an open problem.

III.3 Application to bipartite-type MRFs

The proposed mAIS is applicable when the marginal operation in equation (20) is feasible. Owing to this restriction, the range of application of mAIS is limited. As discussed below, mAIS is practical on bipartite-type MRFs, which include important applications such as RBM and DBM (a DBM can be observed as a bipartite-type MRF Gao and Duan (2017)). Moreover, a square-grid-type MRF is regarded as a bipartite-type MRF (see Fig. 2).

Refer to caption — Figure 2: Square-grid graph can be regarded as a bipartite graph.

Suppose that $P_{k}(\bm{x})$ is a bipartite-type MRF and that $\bm{v}$ and $\bm{h}$ are the variables of each layers. In this case, the energy function $E_{k}(\bm{x})$ can be expressed as

	$\displaystyle E_{k}(\bm{x})$	$\displaystyle=-\sum_{i\in V}\phi_{i}^{(k)}(v_{i})-\sum_{j\in H}\phi_{j}^{(k)}(h_{j})$
		$\displaystyle\quad\>-\sum_{i\in V}\sum_{j\in H}\psi_{i,j}^{(k)}(v_{i},h_{j}),$		(41)

where $\phi_{i}^{(k)}$ and $\psi_{i,j}^{(k)}$ are one- and two-variable functions determined based on the design of $\{P_{k}(\bm{x})\}$ ; the marginal operation in equation (20) is feasible and leads to

	$\displaystyle E_{V}(\bm{v},k)$	$\displaystyle=-\sum_{i\in V}\phi_{i}^{(k)}(v_{i})-\sum_{j\in H}\ln\sum_{h_{j}}\exp\Big{(}\phi_{j}^{(k)}(h_{j})$
		$\displaystyle\quad\>+\sum_{i\in V}\psi_{i,j}^{(k)}(v_{i},h_{j})\Big{)}.$		(42)

On the bipartite-type MRF, the layer-wise blocked Gibbs sampling based on the conditional distributions $P_{k}(\bm{h}\mid\bm{v})$ and $P_{k}(\bm{v}\mid\bm{h})$ can be easily implemented, where $P_{k}(\bm{h}\mid\bm{v})$ is the conditional distribution presented in equation (40) and

\displaystyle P_{k}(\bm{v}\mid\bm{h})=\frac{P_{k}(\bm{x})}{\sum_{\bm{v}}P_{k}(\bm{x})}=\frac{\exp\big{(}-E_{k}(\bm{x})\big{)}}{\exp\big{(}-E_{H}(\bm{h},k)\big{)}},

(43)

because $v_{i}$ s are conditionally independent from each other in $P_{k}(\bm{v}\mid\bm{h})$ and $h_{j}$ s are also conditionally independent from each other in $P_{k}(\bm{h}\mid\bm{v})$ , i.e.,

	$\displaystyle P_{k}(\bm{v}\mid\bm{h})$	$\displaystyle\propto\prod_{i\in V}\exp\Big{(}\phi_{i}^{(k)}(v_{i})+\sum_{j\in H}\psi_{i,j}^{(k)}(v_{i},h_{j})\Big{)},$
	$\displaystyle P_{k}(\bm{h}\mid\bm{v})$	$\displaystyle\propto\prod_{j\in H}\exp\Big{(}\phi_{j}^{(k)}(h_{j})+\sum_{i\in V}\psi_{i,j}^{(k)}(v_{i},h_{j})\Big{)}.$

Based on the layer-wise blocked Gibbs sampling, the transition probability of mAIS can be modeled as

\displaystyle\tau_{k}(\bm{v}^{\prime}\mid\bm{v})=\sum_{\bm{h}}P_{k}(\bm{v}^{\prime}\mid\bm{h})P_{k}(\bm{h}\mid\bm{v}),

(44)

which satisfies the condition in equation (23) as follows:

\displaystyle\sum_{\bm{v}}\sum_{\bm{h}}P_{k}(\bm{v}^{\prime}\mid\bm{h})P_{k}(\bm{h}\mid\bm{v})P_{k}(\bm{v})=P_{k}(\bm{v}^{\prime}).

This transition probability can be considered as a collapsed Gibbs sampling Liu (1994). From equations (39) and (44), when the transition probability of AIS is modeled by

\displaystyle T_{k}(\bm{x}^{\prime}\mid\bm{x})=P_{k}(\bm{h}^{\prime}\mid\bm{v}^{\prime})\sum_{\bm{h}}P_{k}(\bm{v}^{\prime}\mid\bm{h})P_{k}(\bm{h}\mid\bm{v}),

(45)

the theoretical results mentioned in Section III.2 (Theorems 1 and 2) are applicable. Approximating the expectations in equations (44) and (45) using the sampling approximation with one sample point leads to the widely used sampling procedure, i.e., the blocked Gibbs sampling, on bipartite-type MRFs (see figure 3).

IV Numerical Experiments

The performance of mAIS (i.e., mAIS-V) is examined using numerical experiments on an RBM whose energy function is defined as

\displaystyle E(\bm{x})=-\frac{1}{T}\Big{(}\sum_{i\in V}b_{i}v_{i}+\sum_{j\in H}c_{j}h_{j}+\sum_{i\in V}\sum_{j\in H}w_{i,j}v_{i}h_{j}\Big{)},

(46)

where $v_{i},h_{j}\in\{-1,+1\}$ and $T>0$ is the temperature of the RBM. The temperature controls the complexity of the distribution: a lower $T$ expresses a higher-clustered multimodal distribution. On the RBM, the distribution sequence, $\{P_{k}(\bm{x})\}$ , is designed according to equation (5), where $\beta_{k+1}=\beta_{k}+1/K$ and the initial distribution $P_{0}(\bm{x})$ is fixed to a uniform distribution; therefore, in this case, $E_{k}(\bm{x})$ and $E_{V}(\bm{v},k)$ in equations (41) and (42) are

\displaystyle E_{k}(\bm{x})=-\frac{\beta_{k}}{T}\Big{(}\sum_{i\in V}b_{i}v_{i}+\sum_{j\in H}c_{j}h_{j}+\sum_{i\in V}\sum_{j\in H}w_{i,j}v_{i}h_{j}\Big{)}

and

	$\displaystyle E_{V}(\bm{v},k)$	$\displaystyle=-\frac{\beta_{k}}{T}\sum_{i\in V}b_{i}v_{i}$
		$\displaystyle\quad\>-\sum_{j\in H}\ln 2\cosh\frac{\beta_{k}}{T}\Big{(}c_{j}+\sum_{i\in V}w_{i,j}v_{i}\Big{)},$

respectively.

IV.1 Quantitative investigation of theoretical results

The theoretical results obtained in section III.2 revealed the qualitatively effectiveness of mAIS. We investigate the quantitatively effectiveness of mAIS using numerical experiments. In the experiments, $b_{i}$ s and $c_{j}$ s were independently drawn from a uniform distribution in $[-0.001,+0.001]$ , and $w_{i,j}$ s are independently drawn from a normal distribution whose mean is zero and variance is $1/n$ ; the sizes of the sample set and $\bm{v}$ were fixed to $N=1000$ and $|V|=20$ . The blocked Gibbs sampling shown in figure 3 was used for the state transitions of AIS and mAIS.

Figure 4 shows the absolute percentage error (APE) between the true free energy, $f:=F/n$ , and its approximation, $f_{\mathrm{app}}:=F_{\mathrm{app}}/n$ , obtained using AIS or mAIS, against the inverse temperature $1/T$ , where the APE is defined as

\displaystyle\mathrm{APE}:=100\times\frac{|f-f_{\mathrm{app}}|}{|f|}\>[\%].

(47)

In this experiment, the size of $\bm{h}$ is fixed to $|H|=40$ . We observe that the APEs obtained using mAIS are always lower than those obtained using AIS, as supported by Theorem 1. mAIS is particularly effective in the high-temperature region.

Table 1: Comparison of the true free energy

f

with the trial averages of its approximations,

\mathbb{E}_{\mathrm{trial}}[f_{\mathrm{app}}]

, obtained using AIS and mAIS, for several

T

and

K

values. The values are the average values obtained over 1000 experiments.

		AIS			mAIS
		$K$			$K$
$1/T$	true	10	30	60	10	30	60
0.2	$-0.69759$	$-0.69759$	$-0.69759$	$-0.69759$	$-0.69759$	$-0.69759$	$-0.69759$
0.4	$-0.71084$	$-0.71083$	$-0.71084$	$-0.71084$	$-0.71084$	$-0.71084$	$-0.71084$
0.8	$-0.76376$	$-0.76373$	$-0.76375$	$-0.76375$	$-0.76375$	$-0.76376$	$-0.76376$
1	$-0.80306$	$-0.80300$	$-0.80305$	$-0.80305$	$-0.80305$	$-0.80306$	$-0.80306$
2	$-1.10992$	$-1.10782$	$-1.10977$	$-1.10987$	$-1.10963$	$-1.10987$	$-1.10990$
4	$-1.95593$	$-1.93328$	$-1.95345$	$-1.95545$	$-1.95143$	$-1.95535$	$-1.95575$
8	$-3.80281$	$-3.70846$	$-3.78813$	$-3.79920$	$-3.78087$	$-3.79925$	$-3.80186$

Next, we compare the true free energy $f$ with the trial averages of its approximations, $\mathbb{E}_{\mathrm{trial}}[f_{\mathrm{app}}]$ , where $\mathbb{E}_{\mathrm{trial}}[\cdots]$ was estimated based on the average of 30 trials. The trial average can be regarded as the approximation of $\mathbb{E}_{T}[\cdots]$ in AIS or $\mathbb{E}_{\tau}[\cdots]$ in mAIS. The results are listed in table 1. In this experiment, the size of $\bm{h}$ was fixed to $|H|=40$ . The estimates of AIS and mAIS are higher than the true free energy and the estimates of mAIS are lower than those of AIS, as supported by Theorem 2.

IV.2 Dependency on the size ratio of two layers

We investigate the relative accuracy of mAIS compared with AIS for fraction $\alpha:=|H|/|V|$ . The experiments in section IV.1 correspond to the cases of $\alpha=2$ . Intuitively, mAIS is more efficient for a larger $\alpha$ because, as $\alpha$ increases, the fraction of the size of remaining variables through marginalization to $n$ reduces (in other words, the dimension of space evaluated by the sampling approximation relatively shrinks). However, the theoretical results obtained in section III.2 do not directly support this assumption. Thus, we check the relative accuracy for several $\alpha$ values using numerical experiments on the RBM. In the experiments, the relative accuracy is measured by the ratio of the APEs obtained using AIS and mAIS, i.e.,

\displaystyle r:=\frac{\text{APE of AIS}}{\text{APE of mAIS}}=\frac{|f-f_{\mathrm{AIS}}|}{|f-f_{\mathrm{mAIS}}|}.

The relative accuracy increases as $r$ increases. In the experiment, the sizes of the sample set and $\bm{v}$ were fixed to $N=1000$ and $|V|=20$ ; the parameter setting of the RBM was the same as that in section IV.1, and the blocked Gibbs sampling was used for the state transitions of AIS and mAIS. The distributions of $\ln r$ for $1/T=0.5$ , $1$ , and $2$ are shown in figures 5–7, respectively. Each distribution in these figures was created based on the kernel density estimation Bishop (2006) using the results obtained from 4000 experiments, in which the Gaussian kernel with a bandwidth (or smoothing parameter) of 0.25 was used. In all cases, the distributions transit to the positive direction with an increase in $\alpha$ , which implies that the relative accuracy monotonically improves as $\alpha$ increases. This result implies that we should use mAIS-V when $\alpha>1$ and mAIS-H when $\alpha<1$ .

IV.3 Convergence property for $K$

The experiments in sections IV.1 and IV.2 were conducted for relatively small $K$ values to emphasize the performance difference between AIS and mAIS under less-than-ideal conditions. However, in practice, $K$ is set to a sufficient large value to obtain precise estimations. To obtain precise estimations, the value of $K$ should be set to a value larger than the mixing time (or relaxation time). The mixing time tends to increase as the complexity of distribution increases. In this section, the convergence property of APE in equation (47) on a wider range of $K$ is demonstrated using numerical experiments.

The detailed discussions for the mixing time in RBM have been provided from both inference and learning perspectives Roussel et al. (2021); Decelle et al. (2021). Roussel et al. proved that the standard blocked Gibbs sampling shown in figure 3 is not efficient in RBMs having high clustered distributions from the perspective of the mixing time, and they proposed a more efficient sampling method by combining the blocked Gibbs sampling and Metropolis–Hastings (MH) method illustrated in figure 8 Roussel et al. (2021). Roussel’s sampling method can be employed as the transition $\tau_{k}(\bm{v}^{\prime}\mid\bm{v})$ in our framework. For the experiments, we used two different RBMs: the RBM with Gaussian interactions which is the same model used in sections IV.1 and IV.2, and the RBM with Hopfield-type interactions in which $w_{i,j}=\xi_{i}^{(j)}/\sqrt{|V|}$ , where $\xi_{i}^{(j)}\in\{-1,+1\}$ was determined uniformly at random. The Hopfield-type interactions can exhibit a clustered distribution Roussel et al. (2021). In both RBMs, the bias parameters, $b_{i}$ s and $c_{j}$ s, were independently drawn from a uniform distribution in $[-0.001,+0.001]$ , and $|V|=20$ and $|H|=40$ were fixed. The size of the sample set was fixed to $N=10000$ .

Figures 9 and 10 show the behavior of the APE for the increase in $K$ . In the figures, “AIS” and “mAIS” indicate the results obtained using the standard blocked Gibbs sampling, while “AIS+MH” and “mAIS+MH” indicate the results obtained using Roussel’s sampling method. mAIS exhibits faster convergence than AIS in all experiments, and Roussel’s sampling method improves the convergence properties of both AIS and mAIS, although the improvement is very small in figure 10(b). mAIS+MH achieved the best performance the best in all experiments. The convergence speed of mAIS seems to be higher than that of AIS+MH; this implies that the improvement by mAIS is more effective than that by Roussel’s sampling method. Finally, we comment on the computational efficiency of the four methods: AIS, AIS+MH, mAIS, and mAIS+MH. The computational cost of the four methods have the same order; they are $O(NK|V||H|)$ . However, mAIS is remarkably faster than AIS+MH with regard to the CPU time (around 10 times faster in our implementation). The MH procedure is the bottleneck in the Roussel’s sampling method. Thus, mAIS+MH is as slow as AIS+MH.

V Summary and Discussion

In this study, we have proposed mAIS, which is identified as AIS on a marginalized model. The proposed method is regarded as a special case of AIS, which can be used when a partial marginalization can be exactly evaluated, and it is not an improvement method of AIS. This study therefore contributes to several related studies for AIS such as those listed in section I. Two important theorems for mAIS (i.e., Theorems 1 and 2) have been obtained: when the transition probabilities of AIS and mAIS satisfy the condition of equation (39), (a) the partition-function estimator of mAIS is more accurate than that of AIS and (b) the bias of the free energy estimator of mAIS is lower than that of AIS. These results theoretically support the qualitative effectiveness of mAIS. Furthermore, the numerical results demonstrated in section IV empirically support the quantitative effectiveness of mAIS.

mAIS can be applied to a model in which a partial marginalization can be exactly evaluated. However, the effectiveness of the resultant mAIS in comparison with AIS cannot be theoretically guaranteed when the transition probabilities of AIS and mAIS do not satisfy the condition of equation (39). This condition significantly limits the applicability of the theory obtained in this study. As mentioned in section III.2, the condition of equation (39) is sufficient for Theorems 1 and 2, and has been not identified as a necessary condition. If it is not necessary, then it may be relaxed. A relaxation of the condition needs to be investigated in our future studies.

Appendix A Proofs

This appendix demonstrates the proofs of the two theorems presented in Section III.2.

A.1 Proof of Theorem 1

As discussed in Sections II and III.1, $Z_{\mathrm{AIS}}(S_{T})$ and $Z_{\mathrm{mAIS}}(S_{\tau})$ are the unbiased estimators of $Z$ because

\displaystyle\mathbb{E}_{T}[Z_{\mathrm{AIS}}(S_{T})]=\mathbb{E}_{\tau}[Z_{\mathrm{mAIS}}(S_{\tau})]=Z.

The relationship between the annealing processes in equations (7) and (22) is considered. Based on the assumption of equation (39), the marginal distribution of $T(\bm{X})$ can be expressed as

	$\displaystyle\sum_{\bm{H}}T(\bm{X})$	$\displaystyle=\sum_{\bm{H}}\Big{(}\prod_{k=1}^{K-1}P_{k}(\bm{h}^{(k+1)}\mid\bm{v}^{(k+1)})\tau_{k}(\bm{v}^{(k+1)}\mid\bm{v}^{(k)})\Big{)}P_{0}(\bm{v}^{(1)},\bm{h}^{(1)})$
		$\displaystyle=\Big{(}\prod_{k=1}^{K-1}\sum_{\bm{h}^{(k+1)}}P_{k}(\bm{h}^{(k+1)}\mid\bm{v}^{(k+1)})\tau_{k}(\bm{v}^{(k+1)}\mid\bm{v}^{(k)})\Big{)}\sum_{\bm{h}^{(1)}}P_{0}(\bm{v}^{(1)},\bm{h}^{(1)})=\Big{(}\prod_{k=1}^{K-1}\tau_{k}(\bm{v}^{(k+1)}\mid\bm{v}^{(k)})\Big{)}P_{0}(\bm{v}^{(1)}).$

Therefore,

\displaystyle\tau(\bm{V})=\sum_{\bm{H}}T(\bm{X})

(48)

is obtained. Moreover, the assumption of equation (39) leads to

\displaystyle\frac{T(\bm{X})}{\tau(\bm{V})}

\displaystyle=\Big{(}\prod_{k=1}^{K-1}\frac{P_{k}(\bm{h}^{(k+1)}\mid\bm{v}^{(k+1)})\tau_{k}(\bm{v}^{(k+1)}\mid\bm{v}^{(k)})}{\tau_{k}(\bm{v}^{(k+1)}\mid\bm{v}^{(k)})}\Big{)}\frac{P_{0}(\bm{v}^{(1)},\bm{h}^{(1)})}{P_{0}(\bm{v}^{(1)})}=\prod_{k=1}^{K}P_{k-1}(\bm{h}^{(k)}\mid\bm{v}^{(k)}).

(49)

From equations (48) and (49), the annealing process of AIS can be factorizable as

\displaystyle T(\bm{X})=T(\bm{H}\mid\bm{V})\tau(\bm{V}),

(50)

where $T(\bm{H}\mid\bm{V}):=\prod_{k=1}^{K}P_{k-1}(\bm{h}^{(k)}\mid\bm{v}^{(k)})$ is the conditional distribution over $\bm{H}$ . Using equation (50), the difference in the variances can be expressed as

	$\displaystyle\mathbb{V}_{T}[Z_{\mathrm{AIS}}(S_{T})]-\mathbb{V}_{\tau}[Z_{\mathrm{mAIS}}(S_{\tau})]$
	$\displaystyle=\frac{Z_{0}^{2}}{N}\Big{(}\sum_{\bm{X}}W(\bm{X})^{2}T(\bm{X})-\sum_{\bm{V}}\Lambda(\bm{V})^{2}\tau(\bm{V})\Big{)}$
	$\displaystyle=\frac{Z_{0}^{2}}{N}\sum_{\bm{V}}\Big{(}\sum_{\bm{H}}W(\bm{X})^{2}T(\bm{H}\mid\bm{V})-\Lambda(\bm{V})^{2}\Big{)}\tau(\bm{V}).$		(51)

From equations (12) and (26),

	$\displaystyle\lambda_{k}(\bm{v}^{(k)})=\frac{\sum_{\bm{h}^{(k)}}\exp(-E_{k}(\bm{x}^{(k)}))}{\sum_{\bm{h}^{(k)}}\exp(-E_{k-1}(\bm{x}^{(k)}))}$
	$\displaystyle=\sum_{\bm{h}^{(k)}}\frac{\exp(-E_{k}(\bm{x}^{(k)}))}{\exp(-E_{k-1}(\bm{x}^{(k)}))}\frac{\exp(-E_{k-1}(\bm{x}^{(k)}))}{\sum_{\bm{h}^{(k)}}\exp(-E_{k-1}(\bm{x}^{(k)}))}$
	$\displaystyle=\sum_{\bm{h}^{(k)}}w_{k}(\bm{x}^{(k)})P_{k-1}(\bm{h}^{(k)}\mid\bm{v}^{(k)})$

is obtained, which leads to

\displaystyle\Lambda(\bm{V})=\sum_{\bm{H}}W(\bm{X})T(\bm{H}\mid\bm{V}).

(52)

Substituting equation (52) into (51) yields

	$\displaystyle\mathbb{V}_{T}[Z_{\mathrm{AIS}}(S_{T})]-\mathbb{V}_{\tau}[Z_{\mathrm{mAIS}}(S_{\tau})]$
	$\displaystyle=\frac{Z_{0}^{2}}{N}\sum_{\bm{X}}\big{(}W(\bm{X})-\Lambda(\bm{V})\big{)}^{2}T(\bm{X})\geq 0.$

Therefore, $\mathbb{V}_{T}[Z_{\mathrm{AIS}}(S_{T})]\geq\mathbb{V}_{\tau}[Z_{\mathrm{mAIS}}(S_{\tau})]$ is ensured. $\square$

The above result can be understand based on Rao–Blackwellization. $Z_{\mathrm{AIS}}(S_{T})$ is the simple sampling approximation of equation (13). From equation (50), equation (13) can be rewritten as

\displaystyle Z_{K}=Z_{0}\sum_{\bm{V}}\sum_{\bm{H}}W(\bm{X})T(\bm{H}\mid\bm{V})\tau(\bm{V}).

From this equation and equation (52), equation (24) can be seen as the expectation of the conditional expectation of $W(\bm{X})$ . mAIS, therefore, is identified with Rao–Blackwellization of AIS; and therefore, $\mathbb{V}_{T}[Z_{\mathrm{AIS}}(S_{T})]\geq\mathbb{V}_{\tau}[Z_{\mathrm{mAIS}}(S_{\tau})]$ is ensured by the Rao–Blackwell theorem.

A.2 Proof of Theorem 2

As shown in equations (18) and (31), $\mathbb{E}_{T}[F_{\mathrm{AIS}}(S_{T})]$ and $\mathbb{E}_{\tau}[F_{\mathrm{mAIS}}(S_{\tau})]$ are upper bounds of the true free energy $F$ , i.e., $\mathbb{E}_{T}[F_{\mathrm{AIS}}(S_{T})]\geq F$ and $\mathbb{E}_{\tau}[F_{\mathrm{mAIS}}(S_{\tau})]\geq F$ . Using equation (50), $\mathbb{E}_{T}[F_{\mathrm{AIS}}(S_{T})]$ can be rewritten as

	$\displaystyle\mathbb{E}_{T}[F_{\mathrm{AIS}}(S_{T})]$
	$\displaystyle=-\ln Z_{0}-\Big{(}\prod_{\mu=1}^{N}\sum_{\mathbf{X}_{\mu}}T(\mathbf{X}_{\mu})\Big{)}\ln\Big{(}\frac{1}{N}\sum_{\mu=1}^{N}W(\mathbf{X}_{\mu})\Big{)}$
	$\displaystyle=-\ln Z_{0}$
	$\displaystyle-\Big{(}\prod_{\mu=1}^{N}\sum_{\mathbf{V}_{\mu}}\sum_{\mathbf{H}_{\mu}}T(\mathbf{H}_{\mu}\mid\mathbf{V}_{\mu})\tau(\mathbf{V}_{\mu})\Big{)}\ln\Big{(}\frac{1}{N}\sum_{\mu=1}^{N}W(\mathbf{X}_{\mu})\Big{)}.$		(53)

Using Jensen’s inequality, the following inequality is obtained:

	$\displaystyle\Big{(}\prod_{\mu=1}^{N}\sum_{\mathbf{V}_{\mu}}\sum_{\mathbf{H}_{\mu}}T(\mathbf{H}_{\mu}\mid\mathbf{V}_{\mu})\tau(\mathbf{V}_{\mu})\Big{)}\ln\Big{(}\frac{1}{N}\sum_{\mu=1}^{N}W(\mathbf{X}_{\mu})\Big{)}$
	$\displaystyle\leq\Big{(}\prod_{\mu=1}^{N}\sum_{\mathbf{V}_{\mu}}\tau(\mathbf{V}_{\mu})\Big{)}\ln\Big{(}\frac{1}{N}\sum_{\mu=1}^{N}\sum_{\mathbf{H}_{\mu}}T(\mathbf{H}_{\mu}\mid\mathbf{V}_{\mu})W(\mathbf{X}_{\mu})\Big{)}$
	$\displaystyle=\Big{(}\prod_{\mu=1}^{N}\sum_{\mathbf{V}_{\mu}}\tau(\mathbf{V}_{\mu})\Big{)}\ln\Big{(}\frac{1}{N}\sum_{\mu=1}^{N}\Lambda(\mathbf{V}_{\mu})\Big{)}.$		(54)

Equation (52) is used in the derivation of the last line of equation (54). From equations (53) and (54),

	$\displaystyle\mathbb{E}_{T}[F_{\mathrm{AIS}}(S_{T})]$
	$\displaystyle\geq-\ln Z_{0}-\Big{(}\prod_{\mu=1}^{N}\sum_{\mathbf{V}_{\mu}}\tau(\mathbf{V}_{\mu})\Big{)}\ln\Big{(}\frac{1}{N}\sum_{\mu=1}^{N}\Lambda(\mathbf{V}_{\mu})\Big{)}$
	$\displaystyle=\mathbb{E}_{\tau}[F_{\mathrm{mAIS}}(S_{\tau})]\geq F$

is ensured; here, the final inequality is obtained from equation (31). $\square$

Acknowledgment

This work was partially supported by JSPS KAKENHI (grant numbers 18K11459, 18H03303, 20K23342, 21K11778, and 21K17804).

References

Ackley et al. (1985) D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, Cognitive Science 9, 147 (1985).
Smolensky (1986) P. Smolensky, Parallel distributed processing: Explorations in the microstructure of cognition 1, 194 (1986).
Hinton (2002) G. E. Hinton, Neural Computation 14, 1771 (2002).
Salakhutdinov and Hinton (2009) R. Salakhutdinov and G. E. Hinton, In Proc. of the 12th International Conference on Artificial Intelligence and Statistics , 448 (2009).
Roudi et al. (2009) Y. Roudi, E. Aurell, and J. Hertz, Frontiers in Computational Neuroscience 3, 1 (2009).
Decelle and Furtlehner (2021) A. Decelle and C. Furtlehner, Chinese Physics B 30, 040202 (2021).
Chen et al. (2018) J. Chen, S. Cheng, H. Xie, L. Wang, and T. Xiang, Physical Review B 97, 085104 (2018).
Nomura and Imada (2021) Y. Nomura and M. Imada, Physical Review X 11, 031034 (2021).
Torlai et al. (2018) G. Torlai, G. Mazzola, J. Carrasquilla, M. Troyer, R. Melko, and G. Carleo, Nature Physics 14, 447 (2018).
Carleo and Troyer (2017) G. Carleo and M. Troyer, Science 355, 602 (2017).
Gao and Duan (2017) X. Gao and L. Duan, Nature Communications 8, 1 (2017).
Neal (2001) R. M. Neal, Statistics and Computing 11, 125 (2001).
Salakhutdinov and Murray (2008) R. Salakhutdinov and I. Murray, In Proc. of the 25th International Conference on Machine Learning 25, 872 (2008).
Geman and Geman (1984) S. Geman and D. Geman, IEEE Transactions on Pattern Analysis and Machine Intelligence 6, 721 (1984).
Jarzynski (1997) C. Jarzynski, Phys. Rev. E 56, 5018 (1997).
Burda et al. (2015) Y. Burda, R. B. Grosse, and R. Salakhutdinov, In Proc. of the 18th International Conference on Artificial Intelligence and Statistics , 102 (2015).
Liu et al. (2015) Q. Liu, A. Ihler, J. Peng, and J. Fisher, In Proc. of the 31st Conference on Uncertainty in Artificial Intelligence , 514 (2015).
Carlson et al. (2016) D. E. Carlson, P. Stinson, A. Pakman, and L. Paninski, In proc. of the 33rd International Conference on Machine Learning 6, 4248 (2016).
Mazzanti and Romero (2020) F. Mazzanti and E. Romero, arXiv:2007.11926 (2020).
Krause et al. (2020) O. Krause, A. Fischer, and C. Igel, Artificial Intelligence 278, 103195 (2020).
Yasuda and Sekimoto (2021) M. Yasuda and K. Sekimoto, Physical Review E 103, 052118 (2021).
Caselle et al. (2022) M. Caselle, E. Cellini, A. Nada, and M. Panero, Journal of High Energy Physics 2022, 015 (2022).
Liu (2001) J. S. Liu, Monte Carlo strategies in scientific computing (Springer, 2001).
Yasuda (2015) M. Yasuda, Journal of the Physical Society of Japan 84, 034001 (2015).
Yasuda and Uchizawa (2021) M. Yasuda and K. Uchizawa, Neural Computation 33, 1037 (2021).
Liu (1994) J. S. Liu, Journal of the American Statistical Association 89, 958 (1994).
Bishop (2006) C. M. Bishop, Pattern Recognition and Machine Learning (Springer, 2006).
Roussel et al. (2021) C. Roussel, S. Cocco, and R. Monasson, Phys. Rev. E 104, 034109 (2021).
Decelle et al. (2021) A. Decelle, C. Furtlehner, and B. Seoane, In Proc. of the 35th Conference on Neural Information Processing Systems (2021).

Free Energy Evaluation Using Marginalized Annealed Importance Sampling

Abstract

pacs:

I Introduction

II Free Energy Evaluation using Annealed Importance Sampling

III Proposed Method

III.1 Marginalized annealed importance sampling

III.2 Statistical efficiency of mAIS

Theorem 1

Theorem 2

III.3 Application to bipartite-type MRFs

IV Numerical Experiments

IV.1 Quantitative investigation of theoretical results

IV.2 Dependency on the size ratio of two layers

IV.3 Convergence property for KK

V Summary and Discussion

Appendix A Proofs

A.1 Proof of Theorem 1

A.2 Proof of Theorem 2

Acknowledgment

References

IV.3 Convergence property for $K$