This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Markov Chain Monte Carlo in Tensor Network Representation

Synge Todo [email protected] Department of Physics,​ The University of Tokyo,​ Tokyo 113-0033,​ Japan Institute for Physics of Intelligence,​ The University of Tokyo,​ Tokyo 113-0033,​ Japan Institute for Solid State Physics,​ The University of Tokyo,​ Kashiwa,​ 277-8581,​ Japan
Abstract

Markov chain Monte Carlo (MCMC) is a powerful tool for sampling from complex probability distributions. Despite its versatility, MCMC often suffers from strong autocorrelation and the negative sign problem, leading to slowing down the convergence of statistical error. We propose a novel MCMC formulation based on tensor network representations to reduce the population variance and mitigate these issues systematically. By introducing stochastic projectors into the tensor network framework and employing Markov chain sampling, our method eliminates the systematic error associated with low-rank approximation in tensor contraction while maintaining the high accuracy of the tensor network method. We demonstrate the effectiveness of the proposed method on the two-dimensional Ising model, achieving an exponential reduction in statistical error with increasing bond dimension cutoff. Furthermore, we address the sign problem in systems with negative weights, showing significant improvements in average signs as bond dimension cutoff increases. The proposed framework provides a robust solution for accurate statistical estimation in complex systems, paving the way for broader applications in computational physics and beyond.

Markov chain Monte Carlo (MCMC) is a flexible algorithm for generating random samples according to any arbitrary multidimensional probability distribution [1, 2]. It has been widely applied to simulations of many-body systems in statistical physics, particle physics, and biophysics, as well as for statistical estimation, Bayesian inference, optimization problems, etc. MCMC generates random samples using a stochastic process called a Markov chain. The Markov chain must satisfy the global balance according to the target distribution and the ergodicity in the phase space. Although an MCMC method that satisfies these two conditions guarantees unbiased results in infinite time in principle, the MCMC dynamics in the phase space must be fast for the technique to work in practice [3, 4, 5]. The variance of an averaged quantity, σ2\sigma^{2}, along a Markov chain is generally given by

σ22σ02τintM,\sigma^{2}\approx\frac{2\sigma_{0}^{2}\tau_{\text{int}}}{M}, (1)

where σ02\sigma_{0}^{2} is the population variance of the raw time-series data, τint\tau_{\text{int}} is the autocorrelation time, and MM is the number of Monte Carlo (MC) steps.

The MCMC method has an intrinsic difficulty that is even more severe than autocorrelation, called the negative sign problem. For example, the Boltzmann weights sometimes become negative, or even complex, in the path-integral representation for frustrated quantum magnets, and they can not be considered (unnormalized) probabilities for the Monte Carlo sampling anymore; one has to simulate an alternative system, whose weight function is the absolute value of the original one, and take into account the sign effect via the reweighting technique [6, 7, 8]. In many interesting systems, such as frustrated quantum magnets, itinerant electron systems, and real-time unitary evolution, however, the average sign, SS, the average of the reweighting factor, becomes exponentially small for larger system sizes, lower temperatures, or longer real-time integration. As a result, the r.h.s. in Eq. 1 gets an extra factor S2S^{-2}, which ultimately leads to an explosive growth of the statistical error.

The increase of variance can be interpreted differently; due to the correlations, the number of samples that can be considered statistically independent decreases from MM to MS2/2τintMS^{2}/2\tau_{\text{int}}. So far, many efforts have been made to improve the statistics in MCMC sampling: non-local block updates [9, 10, 11, 12, 13, 14], extended ensemble methods [15, 16, 17, 18], irreversible transition kernels [19, 20, 21, 22], mitigation of negative sign problem [23, 24, 25, 26], etc. However, almost all previous attempts have focused on recovering the effective sample size.

In this paper, we propose a novel approach for improving the efficiency of MCMC from a different perspective; we consider systematically reducing the population variance of physical quantities. One might think that the population variance is not affected by the details of the sampling scheme since it is determined by the target distribution, e.g., the canonical ensemble, which reflects the system’s physical characteristics. However, this intuition is not that correct. The population variance can change depending on the representation of the partition function—the definition of configurations and associated weights on which MCMC sampling is performed. One example is the so-called improved estimator in the cluster and worm algorithms, in which the partition function and the estimators of physical quantities are defined in terms of graphs, and, in some cases, the population variance is reduced drastically [27, 28, 29, 30, 31, 32, 33]. Unfortunately, there has been no systematic research in this direction so far.

We propose an MCMC formulation in a tensor network representation in the following, which systematically improves the population variance. Although our method relies on the low-rank approximation in the conversion to a tensor network representation, the systematic error due to that approximation is eliminated by MCMC sampling. By adopting the tensor network representation, we demonstrate that statistical error decreases exponentially by increasing bond dimension cutoff in the low-rank approximation and that even the negative sign problem is prevented with polynomial computational time.

Recently, tensor network methods have attracted attention as a promising numerical technique. Tensor networks, a versatile data structure for compactly representing large matrices or tensors, have applications in various fields [34, 35, 36]. Beyond their first applications in statistical mechanics for many-body systems [37, 38, 39], they are employed for diverse tasks such as compressing machine learning models [40, 41, 42], simulating complex quantum circuits [43, 44, 45, 46], and providing high-precision approximate solutions for partial differential equations [47, 48, 49, 50].

There are two main approaches in the application of tensor networks. One is the variational (or Hamiltonian) approach, where tensor networks are used as approximate functions with high representability. They are used as variational wave functions for quantum many-body systems or as architectures for generative models in machine learning. The other is called the renormalization (or Lagrangian) approach. For example, the partition function of various classical lattice models and quantum circuits can be expressed precisely as tensor networks. Then, the partition function or quantum amplitude can be obtained as the contraction of the whole network.

However, efficient tensor network contraction is a non-trivial task in either approach [51, 52]. Generally, exact contraction requires exponential computational cost for high-dimensional tensor networks. Therefore, a low-rank approximation based on the singular value decomposition (SVD) is usually employed during the contraction. Since the introduction of the renormalization group methods such as density-matrix renormalization group [37], corner transfer matrix renormalization group [38], and tensor renormalization group method [39], various improvements have been proposed, such as generalization to higher-dimensional systems [53, 54, 55] and fermionic systems [56, 57, 58, 59], improvements in computational cost [54, 55], incorporation of environmental effects [60, 61, 62], and removal of local correlations [63, 64, 65]. The bond dimension cutoff, dd, which denotes the number of singular values retained in the low-rank approximation, is crucial in controlling the systematic error in the approximate contraction of tensor networks. The sum of squares of discarded singular values gives the squared Frobenius norm of the difference between the original and the approximated tensors. The systematic error decreases rapidly as the bond dimension cutoff increases, but the computational cost increases as some power of dd. In addition, since multiple approximations are combined during the renormalization, a decrease in systematic error is often not monotonic, making the extrapolation to the dd\rightarrow\infty limit difficult.

Refer to caption
Figure 1: Tensor network of the Levin-Nave TRG in the projector formulation. Black circles represent the tensors after the initial (exact) SVD. Red, blue, and orange pairs of three-leg tensors represent the projectors for the successive low-rank decompositions. See Appendix C for more details.

Let’s consider the tensor renormalization group (TRG) method proposed by Levin and Nave in 2007 [39] as an example. In TRG, at each renormalization step, a four-leg tensor is decomposed into two three-leg tensors, and then four three-leg tensors are contracted to form a renormalized four-leg tensor. By repeating this procedure, all tensors are contracted to produce the partition function (see Fig. 6 in Appendix C).

The successive low-rank decompositions in the renormalization process can be replaced by the insertion of projectors, as shown in Fig. 1, where each step is described by

ABAPB,\displaystyle A^{*}B\approx A^{*}PB, (2)

where P=WRWL=i=1dηiξir×rP=W_{R}W_{L}^{*}=\sum_{i=1}^{d}\eta_{i}\xi_{i}^{*}\in\mathbb{C}^{r\times r} is a projector to dd dimensions, AA and BB are the tensors to be approximated, and rr denotes the bond dimension between AA^{*} and BB. By choosing largest dd singular values in SVD of AB=UΣV=i=1rciμiνiA^{*}B=U\Sigma V^{*}=\sum_{i=1}^{r}c_{i}\mu_{i}\nu_{i}^{*}, the projector corresponding to the best rank-dd approximation is defined by ξi=ci1/2Aμi\xi_{i}=c_{i}^{-1/2}A\mu_{i} and ηi=ci1/2Bνi\eta_{i}=c_{i}^{-1/2}B\nu_{i} 111Here, we consider the full-rank case, that is, we assume that the rank of ABA^{*}B is the same as the bond dimension rr between AA^{*} and BB. See Appendix B for rank-deficient cases. (see Appendix B for more details). The total number of projectors, denoted by NpN_{\text{p}} hereafter, is the number of repetitions of low-rank approximations and is in the same order as the number of initial tensors. Note that a pair of WRW_{R} and WLW_{L} that make up a single projector may be far apart in the final tensor network, as the projectors are inserted recursively (Fig. 1). Other SVD-based algorithms, such as TEBD [67, 68], ATRG [54], BTRG [62], and CATN [55], can be rewritten similarly in the projector formulation.

The core technique in our proposal is to replace the projectors with stochastic ones and sample them by utilizing MCMC. In Ref. 69, Ferris proposed a seminal sampling scheme that selects dd rank-1 projectors among rr according to a certain weight instead of choosing the optimal rank-dd projector in the process of the tensor renormalization group, where the number of possible combinations is given by nc=(rd)n_{\text{c}}=\binom{r}{d}. The sampling weights are set so that the rank-1 projectors corresponding to the larger singular values are selected with higher probabilities. A scale factor is introduced for each rank-1 projector so that the random average of stochastic projectors becomes the identity operator 222Although the inserted tensor is not exactly a projector due to the scale factors, it is referred to as a “projector” in the present paper for simplicity.:

WR(θ)WL(θ)θ=1ncθ=1ncWR(θ)WL(θ)p(θ)=Ir,\displaystyle\langle W_{R}(\theta)W_{L}^{*}(\theta)\rangle_{\theta}=\frac{1}{n_{\text{c}}}\sum_{\theta=1}^{n_{\text{c}}}W_{R}(\theta)W_{L}(\theta)p(\theta)=I_{r}, (3)

where θ\theta denotes an index to specify the set of chosen rank-1 projectors, and p(θ)p(\theta) is the probability for choosing θ\theta. (See Appendix D for more details.) Thus, the resulting tensor network with stochastic projectors defines an unbiased estimator for the partition function [69].

However, there is a fatal flaw with the approach in Ref. 69. The number of projectors, NpN_{\text{p}}, is in the same order as the number of tensors in the original network. The variance of the contraction of a tensor network containing numerous random tensors explodes exponentially to NpN_{\text{p}}, as it is essentially equivalent to a product of an extensive number of random numbers. Such naive importance sampling only works for relatively small systems as the estimator’s variance diverges exponentially with the system size. (See Appendix A for more details). We must introduce a method that can control the variance, such as the sequential Monte Carlo [71, 72], with resampling particles, or MCMC [1, 2]. In many cases, we are interested in the expected value of physical quantities, not the partition function itself, so the latter, MCMC, is the method of choice.

In the following, we denote the state of each stochastic projector as θi\theta_{i}. We aim to implement MCMC based on a simple Metropolis-Hasting (MH) scheme [73, 74] with {θi}\{\theta_{i}\} as the state variables. To do this, we need to introduce the following additional techniques.

Independent proposal distribution: In the original proposal by Ferris [69], a set of rank-1 projectors is obtained based on the results of the previous short-scale renormalization step, and then importance sampling is performed. In this method, when the short-scale projectors (e.g., red projectors in Fig. 1) are updated, the definition of longer-scale projectors (blue and orange ones) changes, so it is difficult to implement MCMC. In our method, on the other hand, we first execute the conventional deterministic tensor renormalization procedure (in the projector formulation) as an initialization process. At the same time, we determine all the rank-1 projectors and their weights using the results of the SVD in that process. During the MCMC process, we do not change the projector sets and the weights. This initialization scheme makes the proposal distribution of the MCMC projectors independent for each projector, and the partition function can be written as

Z={θi}g(θ1,θ2,θNp)p(θ1)p(θ2)p(θNp).\displaystyle Z=\sum_{\{\theta_{i}\}}g(\theta_{1},\theta_{2},\ldots\theta_{N_{\text{p}}})p(\theta_{1})p(\theta_{2})\cdots p(\theta_{N_{\text{p}}}). (4)

In addition, since SVD is only performed for initialization, and in the MCMC process, only the selection of rank-1 projectors and the contraction of the tensor network with embedded projectors is executed, it has the advantage of being very high-performance and executable on modern computer architectures.

Computational graph: The weight factor, g({θi})g(\{\theta_{i}\}) in Eq. 4, is the tensor network contraction with embedded projectors. When the projector configuration changes, the contraction of the whole network must be evaluated again. If we re-evaluate the entire tensor network every time, the computational complexity is O(d5N)O(d^{5}N); that is, the cost of one MCMC sweep becomes O(d5N2)O(d^{5}N^{2}), which is very inefficient. To reduce the order of this computational complexity, we introduce the concept of a computational graph. The series of contractions can be represented as a tree structure. The initial tensors and projectors correspond to the leaves of the tree structure. When a projector, i.e., a pair of WL(θ)W_{L}(\theta) and WR(θ)W_{R}(\theta), is updated, only their ancestors need to be re-evaluated. This way, in the case of TRG, the computational complexity of a single MCMC sweep is reduced to O(d5NlogN)O(d^{5}N\log N).

(a)
Refer to caption
(b)
Refer to caption

Figure 2: Temperature dependence of the specific heat (a) and the magnetization squared (b) of the Ising model on the N=16×16N=16\times 16 square lattice. Dark blue curves represent the exact results of the transfer matrix calculation. Red diamonds represent the results of MCMC in the tensor network representation with d=6d=6. We also show results of the Levin-Nave TRG with impurity tensors with d=2d=2 (purple), 4 (green), 6 (cyan), 8 (orange), 10 (dark green),,,\ldots, 16 (black). Inset in (b) represents the dd-dependence of the asymptotic variance of the energy (purple), the specific heat (green), and the magnetization squared (cyan), together with those by the standard MH method (horizontal lines).

Physical quantities: In the standard tensor network algorithms, finite difference or automatic differentiation of the free energy is used to evaluate the expectation value of the physical quantities, such as the internal energy and magnetization. In our MCMC algorithm, we adopt the impurity tensor method [75, 76, 77] instead. In the impurity tensor method, we prepare “impurities,” which are obtained by differentiating the initial tensors concerning the external variables, such as the temperature and the magnetic field, and then recursively evaluate the contraction of the tensor network that contains only one or two impurities in total, depending on the order of differentiation. This method is easily implemented in projector-base tensor network renormalization and is compatible with our MCMC formalism. One of the drawbacks of the impurity tensor method in ordinary tensor renormalization groups is that it does not consider projector derivatives, which introduces extra systematic errors. However, in our MCMC method, the projectors become the identity operators after taking the random average and do not depend on external variables; we have no systematic error in the impurity tensor method.

To demonstrate the effectiveness of the proposed method, we present simulation results of the two-dimensional square lattice Ising model, which Hamiltonian is given by

H=Ji,jσiσjhiσi,\displaystyle H=-J\sum_{\langle i,j\rangle}\sigma_{i}\sigma_{j}-h\sum_{i}\sigma_{i}, (5)

where σi=±1\sigma_{i}=\pm 1 is the Ising spin variable at the ii-th site, JJ is the coupling constant, and hh is the magnetic field.

Refer to caption
Figure 3: Temperature dependence of the average sign. Blue filled symbols represent the standard MCMC results for N=8×8N=8\times 8 (squares), 16×1616\times 16 (circles), and 32×3232\times 32 (diamonds). Open diamonds represent those of MCMC in the tensor network representation for N=32×32N=32\times 32. The bond dimension cutoff is d=2d=2 (purple), 3 (green), 4 (orange), and 6 (red).

First, we present the results at zero magnetic field, h=0h=0. In Fig. 2, we show the specific heat and the magnetization squared as functions of the temperature. The system size is N=16×16N=16\times 16, and the bond dimension cutoff is set to d=6d=6. The total MC steps are 2142^{14}, of which the first 2112^{11} steps are discarded as burn-in time. For comparison, we also show the exact results obtained by the transfer matrix method and the results by the Levin-Nave TRG with impurity tensors with d=2,4,6,8,,16d=2,4,6,8,\ldots,16. Note that d=256d=256 is required for TRG to obtain the exact result for N=16×16N=16\times 16. The TRG results suffer from a systematic error due to the finite bond dimension cutoff and the impurity tensor method. Furthermore, the convergence to the exact result is non-monotonic. On the other hand, the MCMC results using d=6d=6 are fully consistent with the exact results, supporting that the systematic error is eliminated.

In the inset of Fig. 2, we show the dd-dependence of the asymptotic variance, 2σ02τint2\sigma_{0}^{2}\tau_{\text{int}} in Eq. 1, of the internal energy, specific heat, and magnetization squared at the critical temperature, Tc=2J/log(1+2)T_{c}=2J/\log(1+\sqrt{2}) [78]. The asymptotic variance is evaluated using the binning analysis. The variance is smaller than the conventional MH method already at d=4d=4 and decreases exponentially with the bond dimension cutoff dd. Considering our proposed method’s computational complexity is O(d5NlogN)O(d^{5}N\log N), the present results indicate that exponential acceleration has been achieved.

Next, we present the result at a pure imaginary magnetic field, h=iπ/2βh=i\pi/2\beta, or a negative fugacity, z=e2βh=1z=e^{-2\beta h}=-1. This magnetic field is the intersection of the unit circle |z|=1|z|=1 of the Yang-Lee zeros [79] and the negative real axis of zz. At z=1z=-1, the Boltzmann weight, in the original spin-base representation, is given by W({σi})=(1)m/2eβσiσjW(\{\sigma_{i}\})=(-1)^{m/2}e^{\beta\sum\sigma_{i}\sigma_{j}} with mm being the total magnetization and the standard MCMC method suffers from the severe sign problem due to the factor (1)m/2(-1)^{m/2}. In Fig. 3, we show the temperature dependence of the average sign. Our proposed method, where the average sign is given by sgng({θi})\langle\operatorname{sgn}g(\{\theta_{i}\})\rangle, also has negative signs at small dd. However, in contrast to the standard MCMC method, the average sign is improved systematically and approaches unity rapidly as we increase dd. Thus, the negative sign problem has also been prevented by paying a polynomial cost.

To summarize, we have introduced a novel MCMC framework in tensor network representations. By combining projector formulation of the tensor network renormalization and MCMC sampling, we eliminate systematic error arising from finite bond dimension cutoff while maintaining the computational efficiency of tensor network techniques. We demonstrated the effectiveness of our approach through simulations of the two-dimensional Ising model, showing exponential reductions in statistical error and a systematic resolution of the negative sign problem with polynomial computational cost. This framework is versatile and applicable to various sophisticated tensor network algorithms, such as HOTRG [53], TEBD [67, 68], ATRG [54], BTRG [62], and CATN [55].

The proposed method facilitates accurate evaluation of partition functions, physical quantities, and tensor network optimization, making it a powerful tool for tackling complex physics problems. Future work could explore extensions to higher-dimensional systems, applications in real-time unitary dynamics, and integration with advanced MCMC techniques, such as the extended ensemble methods [15, 16, 17, 18] and irreversible transition kernels [19, 20, 21, 22]. By offering both accuracy and computational efficiency, the proposed method represents a significant step forward in developing robust and scalable sampling techniques for computational science.

Acknowledgements.
The author thanks Hidemaro Suwa and Tsuyoshi Okubo for fruitful discussions and comments. This work was supported by JSPS KAKENHI Grant Numbers 20H01824 and 24K00543, the Center of Innovation for Sustainable Quantum AI (SQAI), JST Grant Number JPMJPF2221, and JST CREST Grant Number JPMJCR24I1.

Appendix A Product of Random Numbers

In this Appendix, we focus on the problem of sampling the product of NN independent random numbers. Suppose that we have NN random variables, {Xi}i=1N\{X_{i}\}_{i=1}^{N}, that are independent and each of which is distributed according to probability distributions with mean μi\mu_{i} and variance σi2\sigma_{i}^{2}. We are interested in estimating the expectation of the product of these random variables, E[X1X2XN]\text{E}[X_{1}X_{2}\cdots X_{N}]. As {Xi}\{X_{i}\} are independent of each other, the expectation of the product is given by the product of their expectations:

E[X1X2XN]=E[X1]E[X2]E[XN]=iμiΓ.\displaystyle\text{E}[X_{1}X_{2}\cdots X_{N}]=\text{E}[X_{1}]\text{E}[X_{2}]\cdots\text{E}[X_{N}]=\prod_{i}\mu_{i}\equiv\Gamma. (6)

The sample mean of the product of NN random variables is an unbiased estimator of Γ\Gamma. However, the relative statistical error diverges exponentially with NN:

Var[X1X2XN]=E[X12X22XN2]E[X1X2XN]2=E[X12]E[X22]E[XN2]Γ2=i(μi2+σi2)Γ2Γ2i(1+σi2/μi2)for N1.\displaystyle\begin{split}\text{Var}[X_{1}X_{2}\cdots X_{N}]&=\text{E}[X_{1}^{2}X_{2}^{2}\cdots X_{N}^{2}]-\text{E}[X_{1}X_{2}\cdots X_{N}]^{2}\\ &=\text{E}[X_{1}^{2}]\text{E}[X_{2}^{2}]\cdots\text{E}[X_{N}^{2}]-\Gamma^{2}\\ &=\prod_{i}(\mu_{i}^{2}+\sigma_{i}^{2})-\Gamma^{2}\\ &\approx\Gamma^{2}\prod_{i}(1+\sigma_{i}^{2}/\mu_{i}^{2})\qquad\text{for $N\gg 1$}.\end{split} (7)

The divergence of the relative error can also be explained from a different viewpoint. The multiplication of NN random variables is interpreted as an NN-step random walk on the logarithmic axis as

log(X1X2XN)=i=1NlogXi.\displaystyle\log(X_{1}X_{2}\cdots X_{N})=\sum_{i=1}^{N}\log X_{i}. (8)

The linear increase of the variance of the random walk causes exponential divergence of the relative error in the estimate of Γ\Gamma.

Refer to caption
Figure 4: Histogram of log(x1x2xN)\log(x_{1}x_{2}\cdots x_{N}) for N=4N=4 (purple), 8 (green), 16 (cyan), 32 (orange), and 64 (red), where xix_{i} are independently sampled from uniform distributed in (0,2](0,2]. The distributions exhibit biased random walk behavior with a peak at (log21)N(\log 2-1)N of width N\sqrt{N}.
Refer to caption
Figure 5: NN-dependence of the sample mean for the product of NN random variables by the naive sequential sampling (purple) and those with multinominal resampling (green). The number of samples is M=224M=2^{24} in both cases. Inset shows the NN-dependence of the variance of the sample mean in the logarithmic scale. After introducing the resampling, the variance is suppressed by orders of magnitude.

Let’s consider a simple example: {Xi}i=1N\{X_{i}\}_{i=1}^{N} are independent and identity distributed (i.i.d.), uniformly in (0,2](0,2]:

P(X=x)={1/2for 0<x20otherwise.\displaystyle P(X=x)=\begin{cases}1/2&\text{for $0<x\leq 2$}\\ 0&\text{otherwise}.\end{cases} (9)

The mean and the variance are μ=1\mu=1 and σ2=1/3\sigma^{2}=1/3, respectively. The probability density function of Y=logXY=\log X is given by

P(Y=y)={eyfor ylog20otherwise,\displaystyle P(Y=y)=\begin{cases}e^{y}&\text{for $y\leq\log 2$}\\ 0&\text{otherwise},\end{cases} (10)

and the mean and the variance are E[Y]=log21\text{E}[Y]=\log 2-1 and Var[Y]=1\text{Var}[Y]=1, respectively.

In Fig. 4, we show the histogram of samples, {log(x1x2xN)}\{\log(x_{1}x_{2}\cdots x_{N})\} for N=4,8,,64N=4,8,\ldots,64. The number of samples is M=224M=2^{24}. As expected from Eq. 8, a biased random walk behavior is observed; the peak position shifts as (log21)N(\log 2-1)N, while the distribution width increases as N\sqrt{N}. In other words, in the original linear scale, the typical value of x1x2xNx_{1}x_{2}\cdots x_{N} is exponentially small as exp[(log21)N]\exp[(\log 2-1)N] and rare events with extremely large contributions dominate the mean Γ=1\Gamma=1. In Fig. 5, we plot the variance as a function of NN, which diverges exponentially as (4/3)N(4/3)^{N} as predicted by Eq. 7. When NN is large, the mean and the variance are greatly underestimated as the sample size is insufficient to pick up rare events.

Thus, even though as an estimator of the product of means, the sample mean of the product of the random variables is “perfect,” i.e., each sample is statistically independent and unbiased, it suffers from the exponential growth of statistical error when NN is large. This issue is the one that has surfaced in Ref. 69.

The present naive perfect sampling is equivalent to the sequential Monte Carlo (SMC) method without resampling. In the SMC method [71, 72], many particles, or walkers, are simulated simultaneously. To each particle, a weight is assigned. These particles are updated according to the model. However, a problem called “weight degeneracy” occurs, where a few particles gradually end up with extremely high weights while many others contribute almost nothing. To prevent this phenomenon, we perform resampling during particle updates. Resampling prunes particles with small weights and duplicates those with high weights to re-even out the overall distribution. This technique prevents information from being concentrated in a few particles and increases the effective sample size.

Here, we adopt one of the simplest methods, multinominal resampling, for the present problem, where the weight of each particle is the product of the random numbers [80]; first, we prepare MM particles and initialize their weights to unity. Then, each particle’s weight is multiplied by a random number xx at each step. After the update process, we perform resampling so that the weights of all the particles become the same. In the present problem, since there is no internal state in particles other than the weight, we replace the weight of all particles with their average. This set of update and resampling is repeated NN times to estimate the product’s mean. In Fig. 5, we also plot the results after the introduction of resampling. The suppression of the variance is remarkable. The variance is now given by replacing σ2\sigma^{2} in equation Eq. 7 by σ2/M\sigma^{2}/M, which is almost linear in NN:

Var[X1X2XN]\displaystyle\text{Var}[X_{1}X_{2}\cdots X_{N}] =(1+σ2M)N1σ2NM\displaystyle=\Big{(}1+\frac{\sigma^{2}}{M}\Big{)}^{N}-1\approx\frac{\sigma^{2}N}{M} (11)

as long as MNM\gg N.

Another established approach for controlling the variance of weights is the MCMC method [1, 2]. If we are interested in the weighted average of quantities, not the average weight, the MCMC method is more suitable. In the main text, we adopt the MCMC method to estimate the average physical quantities in the canonical ensemble.

Appendix B Projector formulation of low-rank approximation

Let’s consider SVD, or rank-1 decomposition, of the product of two matrices, Ar×pA\in\mathbb{C}^{r\times p} and Br×qB\in\mathbb{C}^{r\times q}:

AB=UΣV=i=1sciμiνi,\displaystyle A^{*}B=U\Sigma V^{*}=\sum_{i=1}^{s}c_{i}\mu_{i}\nu_{i}^{*}, (12)

where \cdot^{*} denotes matrix conjugate transpose, ss min(p,q,r)\leq\min(p,q,r) is the rank of ABA^{*}B, Up×sU\in\mathbb{C}^{p\times s} and Vq×sV\in\mathbb{C}^{q\times s} are isometries (UU=VV=IsU^{*}U=V^{*}V=I_{s}), Σs×s\Sigma\in\mathbb{R}^{s\times s} is a diagonal matrix whose diagonal elements are singular values {ci}i=1s\{c_{i}\}_{i=1}^{s}, and μip\mu_{i}\in\mathbb{C}^{p} and νiq\nu_{i}\in\mathbb{C}^{q} denote the ii-th column vector of UU and VV, respectively. We assume that the singular values in Σ\Sigma are sorted in descending order. For later convenience, we define a diagonal matrix Λ=Σ1/2\Lambda=\Sigma^{1/2}.

Rank-dd approximation gives a p×qp\times q matrix of rank dd (s\leq s) that approximates the input ABA^{*}B. By using the SVD introduced above, the optimal rank-dd approximation that minimizes the Frobenius norm of the difference from the input is given by

ABUΛ¯(VΛ¯).\displaystyle A^{*}B\approx U\bar{\Lambda}(V\bar{\Lambda})^{*}. (13)

Here, Λ¯s×d\bar{\Lambda}\in\mathbb{R}^{s\times d} is a matrix consisting of the first dd columns of Λ\Lambda. This procedure also defines the low-rank decomposition: ABA¯B¯A^{*}B\approx\bar{A}^{*}\bar{B} with A¯=(UΛ¯)d×p\bar{A}=(U\bar{\Lambda})^{*}\in\mathbb{C}^{d\times p} and B¯=(VΛ¯)d×q\bar{B}=(V\bar{\Lambda})^{*}\in\mathbb{C}^{d\times q}.

The low-rank approximation can be represented in a different form. We define P=WRWLr×rP=W_{R}W_{L}^{*}\in\mathbb{C}^{r\times r} with

WL\displaystyle W_{L} =AU(Λ¯+),\displaystyle=AU(\bar{\Lambda}^{+})^{*}, (14)
WR\displaystyle W_{R} =BV(Λ¯+),\displaystyle=BV(\bar{\Lambda}^{+})^{*}, (15)

where Λ¯+d×s\bar{\Lambda}^{+}\in\mathbb{R}^{d\times s} is the Moore-Penrose inverse of Λ¯\bar{\Lambda}. By noticing that Λ¯+Σ=Λ¯\bar{\Lambda}^{+}\Sigma=\bar{\Lambda}^{*}, one can confirm that PP satisfies

P2\displaystyle P^{2} =(WRWL)2=WRWL=P,\displaystyle=(W_{R}W_{L}^{*})^{2}=W_{R}W_{L}^{*}=P, (16)
APB\displaystyle A^{*}PB =AWRWLB=A¯B¯.\displaystyle=AW_{R}W_{L}^{*}B^{*}=\bar{A}^{*}\bar{B}. (17)

Thus, PP is a projector and its insertion between AA^{*} and BB yields the rank-dd approximation, Eq. 13.

The projector PP can be decomposed into the sum of rank-1 projectors as

P=i=1dηiξi,\displaystyle P=\sum_{i=1}^{d}\eta_{i}\xi_{i}^{*}, (18)

where ξi\xi_{i} and ηi\eta_{i} are the ii-th column vector of WLW_{L} and WRW_{R}, respectively, i.e.,

ξi\displaystyle\xi_{i} =ci1/2Aμi,\displaystyle=c_{i}^{-1/2}A\mu_{i}, (19)
ηi\displaystyle\eta_{i} =ci1/2Bνi,\displaystyle=c_{i}^{-1/2}B\nu_{i}, (20)

where {ξi}i=1d\{\xi_{i}\}_{i=1}^{d} and {ηi}i=1d\{\eta_{i}\}_{i=1}^{d} form a dual orthonormal basis set:

ξiηj=δiji,j[1,d],\displaystyle\xi_{i}^{*}\eta_{j}=\delta_{ij}\qquad\forall i,j\in[1,d], (21)

or equivalently,

WLWR=Id.\displaystyle W_{L}^{*}W_{R}=I_{d}. (22)

Note that by setting d=sd=s, one can recover the input matrix:

APB=ABwhen d=s.\displaystyle A^{*}PB=A^{*}B\qquad\text{when $d=s$}. (23)

However, PP is not necessarily an identity matrix even in that case; if s<rs<r, the rank of PP is smaller than rr, and it can not be an identity.

In such a case, we augment basis vectors to form a complete dual orthonormal basis set, {ξi}i=1r\{\xi_{i}\}_{i=1}^{r} and {ηi}i=1r\{\eta_{i}\}_{i=1}^{r}; we add 2(rs)2(r-s) vectors to the basis set by using the Gramm-Schmidt orthonormalization. After the augmentation, WLW_{L} and WRW_{R} become square, and thus

P=WRWL=Ir\displaystyle P=W_{R}W_{L}^{*}=I_{r} (24)

follows WLWR=IrW_{L}^{*}W_{R}=I_{r}.

In our MCMC method, we record these complete dual basis set {(ξi,ηi)}\{(\xi_{i},\eta_{i})\} together with the corresponding singular values {ci}\{c_{i}\} during the initialization stage, where the standard tensor renormalization group is performed with optimal low-rank approximations. The completeness property, Eq. 24, is essential when introducing MC sampling of projectors, as the projectors will be inserted between matrices (or tensors), which are generated stochastically and different from those used in the initialization stage.

Appendix C Tensor Renormalization Group in Projector Formulation

Refer to caption
Figure 6: Original Leven-Nave TRG procedure [39]. Tensors in the original tensor network (a) are iteratively replaced by decomposition (red arrows) and contraction (blue arrows). We perform the low-rank approximation at the decomposition steps, keeping the bond dimension less than the cutoff dd (red, blue, and orange bonds). After each set of decomposition and contraction, the number of tensors becomes half.
Refer to caption
Figure 7: TRG in projector formulation. The first step (b) is the same as the original TRG. Then, the next steps, (c) and (d) in Fig. 6, are replaced by the insertion of projectors (c), where pairs of three-leg tensors (red) are inserted between the tensors. We repeat the insertion of projectors ((d) and (e)). Note that the tree-leg tensors in (b) after the initial (exact) decomposition do not change during the renormalization steps.

The partition function of the Ising model, Eq. 5, can be represented by a tensor network. First, we regard the Boltzmann weight of each bond term as a 2×22\times 2 matrix:

W\displaystyle W =[eβJeβJeβJeβJ],\displaystyle=\begin{bmatrix}e^{\beta J}&e^{-\beta J}\\ e^{-\beta J}&e^{\beta J}\end{bmatrix}, (25)

which is positive definite and can be decomposed as

W\displaystyle W =KK,\displaystyle=K^{*}K, (26)
where
K\displaystyle K =[cosh1/2βJcosh1/2βJsinh1/2βJsinh1/2βJ].\displaystyle=\begin{bmatrix}\cosh^{1/2}\beta J&\cosh^{1/2}\beta J\\ \sinh^{1/2}\beta J&-\sinh^{1/2}\beta J\end{bmatrix}. (27)

By using KK, we define a four-leg site tensor as

Tijkl\displaystyle T_{ijkl} =mKimKjmKkmKlmeβhσm.\displaystyle=\sum_{m}K_{im}K_{jm}K_{km}K_{lm}e^{\beta h\sigma_{m}}. (28)

The partition function is then represented by a tensor network of TT on the square lattice, as shown in Fig. 6(a).

At each iteration step of the Levin-Nave TRG [39], we decompose a four-leg tensor into two three-leg tensors [Fig. 6(b), (d), (f) (h)] and then contract four three-leg tensors to form a renormalized four-leg tensor [Fig. 6(c), (e), (g), (i)]. By repeating this procedure, all tensors are finally contracted to produce a scalar, the partition function. The decomposition process is performed by SVD with the optimal low-rank truncation of the singular values (Appendix B). In particular, the first decomposition step, Fig. 6(b), is performed explicitly as

Tijkl\displaystyle T_{ijkl} =mRijmRklm,\displaystyle=\sum_{m}R_{ijm}R_{klm}, (29)
where
Rijm\displaystyle R_{ijm} =KimKjmeβhσm/2.\displaystyle=K_{im}K_{jm}e^{\beta h\sigma_{m}/2}. (30)

The impurity tensors for calculating the internal energy and the specific heat (the magnetization and the susceptibility) is obtained by taking derivatives of RR with respect to the inverse temperature β\beta (the external field hh).

The above tensor renormalization procedure can be rewritten in the projector formulation (Appendix B), where projector insertion replaces tensor contraction-and-decomposition. The first step of the projector-base TRG is the same as the original TRG (Fig. 7(b)). Then, the next steps, Fig. 6(c) and (d), are replaced by Fig. 7(c), where projectors, i.e., pairs of three-leg tensors (red), are inserted between the tensors. This procedure is repeated as shown in Fig. 7(d) and (e). Note that in the original Levin-Nave TRG, tensors (denoted by black circles in Fig. 6) are transformed successively as the renormalization proceeds. On the other hand, in the projector-base TRG, the tree-leg tensors in Fig. 6(b) remain unchanged until the final tensor network [Fig. 6(e)].

In the original TRG and its projector representation, we can use the translational symmetry of the system to reduce the NN-dependence of the computational complexity from O(N)O(N) to O(logN)O(\log N). It should be noted, however, that when sampling projectors, the projectors at all locations must be treated as separate, or a systematic bias will be introduced [81].

As for the dd-dependence of the computational complexity, the original TRG has O(d6)O(d^{6}) complexity for the SVD of a four-leg tensor into two three-leg tensors. However, by using the randomized SVD or the partial SVD, the complexity can be reduced to O(d5)O(d^{5}) [82, 83]. In the initialization process of the proposed MCMC method, however, we need the full SVD to obtain all rank-1 projectors, whose complexity is still O(d6)O(d^{6}). In the later MCMC sampling stage, the complexity of the tensor network contraction is O(d5)O(d^{5}).

Appendix D Sampling Projectors

At each step of MCMC, a projector, a set of rank-1 projectors, is proposed as a candidate according to predetermined trial probabilities; we choose a projector in rr dimensions projecting onto a dd-dimensional subspace. A particular set of rank-1 projectors, {k1,k2,,kd}\{k_{1},k_{2},\ldots,k_{d}\} (1k1<k2<<kdr1\leq k_{1}<k_{2}<\cdots<k_{d}\leq r), is chosen with a probability proportional to the product of weights of each rank-1 projector, i=1dwki\prod_{i=1}^{d}w_{k_{i}}. In this work, the weight of each rank-1 projector is defined by using the corresponding singular value cic_{i} as

wi=[max(ci,cmax×1012)]ω,\displaystyle w_{i}=[\max(c_{i},c_{\text{max}}\times 10^{-12})]^{\omega}, (31)

where cmaxc_{\text{max}} is the largest singular value and the exponent ω\omega is a hyperparameter that controls the distribution of the weights. For the numerical demonstration in the main text, we use ω=1\omega=1, but the optimal value ω\omega should be studied in more detail in the future. There are (rd)\binom{r}{d} possible combinations to choose dd out of rr. In typical applications, rd2r\approx d^{2}. For example, for d=32d=32, the number of combinations is more than 106010^{60}. Thus, the naive tower sampling does not work at all.

For the present purpose, we can employ the dynamic programming [84]. We should notice that if mm elements have already been chosen from {1,,k1}\{1,\ldots,k-1\}, the probability of choosing the kk-th element is given by

qk(m)=wkRk,m+1wkRk,m+1+Rk,m,q^{(m)}_{k}=\frac{w_{k}R_{k,m+1}}{w_{k}R_{k,m+1}+R_{k,m}}, (32)

where Rk,mR_{k,m} (k=0,,rk=0,\ldots,r; m=0,,dm=0,\ldots,d) is a sum of weight products of (rkdm)\binom{r-k}{d-m} combinations of choosing (dm)(d-m) elements from {k+1,,r}\{k+1,\ldots,r\}. We can use this conditional probability to decide whether to adopt each element, k=1,,rk=1,\ldots,r, one by one. Also, the probability that a chosen set includes the kk-th element (k=1,,r)(k=1,\ldots,r) can be represented as

qk=skzkq_{k}=\frac{s_{k}}{z_{k}} (33)

with

sk\displaystyle s_{k} =m=0d1Lk1,mwkRk,m+1,\displaystyle=\sum_{m=0}^{d-1}L_{k-1,m}w_{k}R_{k,m+1}, (34)
zk\displaystyle z_{k} =m=0d1Lk1,mwkRk,m+1+m=0dLk1,mRk,m,\displaystyle=\sum_{m=0}^{d-1}L_{k-1,m}w_{k}R_{k,m+1}+\sum_{m=0}^{d}L_{k-1,m}R_{k,m}, (35)

where Lk,mL_{k,m} (k=0,,rk=0,\ldots,r; m=0,,dm=0,\ldots,d) is a sum of weights of (km)\binom{k}{m} combinations of choosing mm elements from 1,,k1,\ldots,k. {Rk,m}\{R_{k,m}\} and {Lk,m}\{L_{k,m}\} are calculated by using the recursion relations:

Rk,m\displaystyle R_{k,m} ={δm,dfor k=rwk+1Rk+1,m+1+Rk+1,motherwise,\displaystyle=\begin{cases}\delta_{m,d}&\text{for $k=r$}\\ w_{k+1}R_{k+1,m+1}+R_{k+1,m}&\text{otherwise,}\end{cases} (36)
Lk,m\displaystyle L_{k,m} ={δm,0for k=0wkLk1,m1+Lk1,motherwise,\displaystyle=\begin{cases}\delta_{m,0}&\text{for $k=0$}\\ w_{k}L_{k-1,m-1}+L_{k-1,m}&\text{otherwise,}\end{cases} (37)

respectively, in advance at the cost of O(rd)O(rd). In applying Eqs. 36 and 37, we should regard w0w_{0} and wd+1w_{d+1} as zero. It is immediately clear from the above definitions that Rk,d=Lk,0=1R_{k,d}=L_{k,0}=1 for k=1,,rk=1,\ldots,r. By using Eqs. 36 and 37, zkz_{k} can be rewritten as

zk=m=0d1Lk1,m(wkRk,m+1+Rk,m)+Lk1,dRk,d=m=0dLk1,mRk1,m.\displaystyle\begin{split}z_{k}&=\sum_{m=0}^{d-1}L_{k-1,m}(w_{k}R_{k,m+1}+R_{k,m})+L_{k-1,d}R_{k,d}\\ &=\sum_{m=0}^{d}L_{k-1,m}R_{k-1,m}.\end{split} (38)

Note that zkz_{k}, which is nothing but the partition function, does not depend on kk and is equal to R0,0R_{0,0} or Lr,dL_{r,d}. In the Appendix of Ref. 69, an equivalent formulation has been presented in the tensor network language.

When applying the above expressions, additional care must be taken, as {Rk,m}\{R_{k,m}\} and {Lk,m}\{L_{k,m}\} are the sums of the products of many element weights, and overflow or underflow easily occurs if rr becomes large. One possible workaround is to use multi-precision floating-point arithmetic. Alternatively, one can store the logarithm of the value and proceed with the log-sum-exponential trick. In the following, we present a more natural implementation. As we see below, our method is portable as it does not require multi-precision floating-point arithmetic and is faster as it does not require logarithmic and exponential operations.

First, we introduce new quantities {Sk,m}\{S_{k,m}\} and {Tk,m}\{T_{k,m}\}, which are defined as the ratio of {Rk,m}\{R_{k,m}\} and {Lk,m}\{L_{k,m}\}, respectively:

Sk,m\displaystyle S_{k,m} ={Rr,m=δm,dfor k=r1for m=dRk,m/Rk,m+1for m<d and Rk,m+1>00otherwise,\displaystyle=\begin{cases}R_{r,m}=\delta_{m,d}&\text{for $k=r$}\\ 1&\text{for $m=d$}\\ R_{k,m}/R_{k,m+1}&\text{for $m<d$ and $R_{k,m+1}>0$}\\ 0&\text{otherwise,}\end{cases} (39)
Tk,m\displaystyle T_{k,m} ={L0,m=δm,0for k=01for m=0Lk,m/Lk,m1if m>0 and Lk,m1>00otherwise.\displaystyle=\begin{cases}L_{0,m}=\delta_{m,0}&\text{for $k=0$}\\ 1&\text{for $m=0$}\\ L_{k,m}/L_{k,m-1}&\text{if $m>0$ and $L_{k,m-1}>0$}\\ 0&\text{otherwise.}\end{cases} (40)

Note that {Sk,m}\{S_{k,m}\} and {Tk,m}\{T_{k,m}\} are quantities of the order of wkw_{k} unlike {Rk,m}\{R_{k,m}\} and {Lk,m}\{L_{k,m}\}. The following recursion relations manifest this fact:

Sk,m\displaystyle S_{k,m} ={wk+1+Sk+1,d1for m=d1Sk+1,m+1wk+1+Sk+1,mwk+1+Sk+1,m+1otherwise,\displaystyle=\begin{cases}w_{k+1}+S_{k+1,d-1}&\text{for $m=d-1$}\\ S_{k+1,m+1}\displaystyle\frac{w_{k+1}+S_{k+1,m}}{w_{k+1}+S_{k+1,m+1}}&\text{otherwise,}\end{cases} (41)
Tk,m\displaystyle T_{k,m} ={wk1+Tk1,1for m=1Tk1,m1wk1+Tk1,mwk1+Tk1,m1otherwise.\displaystyle=\begin{cases}w_{k-1}+T_{k-1,1}&\text{for $m=1$}\\ T_{k-1,m-1}\displaystyle\frac{w_{k-1}+T_{k-1,m}}{w_{k-1}+T_{k-1,m-1}}&\text{otherwise.}\end{cases} (42)

By using {Sk,m}\{S_{k,m}\}, the conditional probability qk(m)q_{k}^{(m)} [Eq. 32] is represented as

qk(m)\displaystyle q_{k}^{(m)} ={wkwk+Sk,mif wk>00otherwise.\displaystyle=\begin{cases}\displaystyle\frac{w_{k}}{w_{k}+S_{k,m}}&\text{if $w_{k}>0$}\\ 0&\text{otherwise.}\end{cases} (43)

Similarly, the marginal probability qkq_{k} [Eq. 33] is represented using {Sk,m}\{S_{k,m}\} and {Tk,m}\{T_{k,m}\} as

qk=m=0d1[j=1mTk1,j]wk[j=m+1dSk,j][j=0dS0,j]1/2[j=0dTr,j]1/2=wkm=0d1[j=1mTk1,jS0,j1/2Tr,j1/2][j=m+1dSk,jS0,j1/2Tr,j1/2],\displaystyle\begin{split}q_{k}&=\frac{\displaystyle\sum_{m=0}^{d-1}\Big{[}\prod_{j=1}^{m}T_{k-1,j}\Big{]}w_{k}\Big{[}\prod_{j=m+1}^{d}S_{k,j}\Big{]}}{\displaystyle\Big{[}\prod_{j=0}^{d}S_{0,j}\Big{]}^{1/2}\Big{[}\prod_{j=0}^{d}T_{r,j}\Big{]}^{1/2}}\\ &=w_{k}\sum_{m=0}^{d-1}\Big{[}\prod_{j=1}^{m}\frac{T_{k-1,j}}{S_{0,j}^{1/2}T_{r,j}^{1/2}}\Big{]}\Big{[}\prod_{j=m+1}^{d}\frac{S_{k,j}}{S_{0,j}^{1/2}T_{r,j}^{1/2}}\Big{]},\end{split} (44)

which is evaluated at the cost of O(d2)O(d^{2}) for each kk and is free from overflow or underflow.

References