This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Tradeoff of generalization error in unsupervised learning

Gilhan Kim1, Hojun Lee1, Junghyo Jo2, and Yongjoo Baek1 1 Department of Physics and Astronomy & Center for Theoretical Physics, Seoul National University, Seoul 08826, Korea 2 Department of Physics Education & Center for Theoretical Physics, Seoul National University, Seoul 08826, Korea [email protected]
Abstract

Finding the optimal model complexity that minimizes the generalization error (GE) is a key issue of machine learning. For the conventional supervised learning, this task typically involves the bias-variance tradeoff: lowering the bias by making the model more complex entails an increase in the variance. Meanwhile, little has been studied about whether the same tradeoff exists for unsupervised learning. In this study, we propose that unsupervised learning generally exhibits a two-component tradeoff of the GE, namely the model error and the data error—using a more complex model reduces the model error at the cost of the data error, with the data error playing a more significant role for a smaller training dataset. This is corroborated by training the restricted Boltzmann machine to generate the configurations of the two-dimensional Ising model at a given temperature and the totally asymmetric simple exclusion process with given entry and exit rates. Our results also indicate that the optimal model tends to be more complex when the data to be learned are more complex.

Keywords: Machine Learning, Classical phase transitions, Stochastic processes

1 Introduction

Inductive reasoning, which derives general principles from specific observations, is an essential generalization process that builds up human knowledge. With the advent of big data, there is a rapidly growing need for automating the process of inductive reasoning. Lately, much development has been made in this direction thanks to various machine learning techniques based on the artificial neural networks (ANNs) [1, 2, 3, 4]. However, despite their tremendous success, we still have a very limited understanding of when and how an ANN achieves a good generalization capacity.

There are two major types of generalization tasks performed by the ANNs. The more well-studied type is supervised learning, which refers to the task of guessing the correct form of a function over its entire domain by generalizing some given examples of its functional relations. In this case, the failure to properly generalize the given examples, the generalization error (GE), can be defined in terms of the mean squared error (MSE) of the predicted function from the true function. In practice, the true function is unknown, so the MSE estimated using independently drawn examples of functional relations (called the test error) is used as a proxy for the GE.

Thanks to the mathematical structure of the MSE, this GE is readily decomposed into two parts [2]. The first part, called the bias, quantifies how the predicted function on average deviates from the true function. The second part, called the variance, quantifies how much the predicted function fluctuates from its average behavior. In many examples of supervised learning, these two components of the GE exhibit a tradeoff behavior: as the model complexity (e.g., the size of the ANN) increases, the bias decreases at the cost of the increasing variance. This leads to the GE showing a U-shape dependence on the model complexity, which is called the bias-variance tradeoff. It should be noted that the decomposition is not limited to the MSE but also generalized to different types of the GE (see, for example, [5]). In addition, according to recent studies, the GE of supervised learning again exhibits a monotonic decrease if the model complexity is increased even further. This is called the double descent phenomenon, whose origin has been extensively discussed [6, 7, 8, 9].

Meanwhile, there is the less studied but no less important type of tasks, namely unsupervised learning, which refers to the task of finding the probability distribution that best captures the statistical properties of a dataset sampled from an unknown distribution. The GE for unsupervised learning can be defined as the Kullback-Leibler (KL) divergence of the predicted distribution from the true distribution to be found. Again, there has been a proposal about how this GE can be decomposed into the bias and the variance [10]. However, little has been studied about whether the GE of unsupervised learning also exhibits a tradeoff behavior.

In this study, we address the problem by training the Restricted Boltzmann Machine (RBM) to learn the data generated from the two-dimensional (2-d) Ising model and the totally asymmetric simple exclusion process (TASEP), which are well-known models of equilibrium and nonequilibrium phase transitions, respectively. Since the distributions of the configurations of these models are exactly known, the GE can be calculated exactly. Examining how these quantities depend on the number of hidden nodes of the RBM, we observe that the GE exhibits a tradeoff behavior. We propose a two-component decomposition scheme of the GE so that the tradeoff is determined by the monotonic behaviors of the two error components, one related to the model limitations and the other to the fluctuations in the data. We also examine how the optimal model complexity, resulting from by a tradeoff among these three GE components, depends on the complexity of the given training dataset.

The rest of the paper is organized as follows. In Sec. 2, we introduce the restricted Boltzmann machine and our decomposition scheme for its generalization error. In Sec. 3, we describe how the RBM is trained using the data generated from the 2-d Ising model and the TASEP. In Sec. 4, we present our results about the tradeoff behaviors observed in unsupervised learning of the RBMs. Finally, we summarize our findings and discuss future works in Sec. 5.

Refer to caption
Figure 1: (a) A schematic diagram showing the structure of the RBM. The generalization error (GE) of the RBM depends nonmonotonically on the number of hidden nodes NHN_{H}, exhibiting a tradeoff between the data error (DE) and the model error (ME) as illustrated by the generation tasks of (b) the 2-d Ising configurations at temperature T=3.6T=3.6 and (c) the TASEP configurations at α=β=0.7\alpha=\beta=0.7. The lines are to guide the eye.

2 Theory

2.1 Restricted Boltzmann Machine

The RBM is an energy-based generative model proposed by Smolensky [11] and popularized by Hinton [12, 13]. It has the form of a bipartite network of NVN_{V} nodes in the visible layer and NHN_{H} nodes in the hidden layer, see Fig. 1(a). For the visible-layer configuration 𝐯{vi}i=1NV\mathbf{v}\equiv\{v_{i}\}_{i=1}^{N_{V}} and the hidden-layer configuration 𝐡{hi}i=1NH\mathbf{h}\equiv\{h_{i}\}_{i=1}^{N_{H}}, the corresponding energy is given by

E(𝐯,𝐡)=𝐚T𝐯𝐛T𝐡𝐯T𝕎𝐡,\displaystyle E(\mathbf{v},\mathbf{h})=-\mathbf{a}^{\mathrm{T}}\,\mathbf{v}-\mathbf{b}^{\mathrm{T}}\,\mathbf{h}-\mathbf{v}^{\mathrm{T}}\,\mathbb{W}\,\mathbf{h}, (1)

where 𝐚{ai}i=1NV\mathbf{a}\equiv\{a_{i}\}_{i=1}^{N_{V}} and 𝐛{bi}i=1NH\mathbf{b}\equiv\{b_{i}\}_{i=1}^{N_{H}} indicate the bias terms, and 𝕎\mathbb{W} is the NV×NHN_{V}\times N_{H} weight matrix coupling the two layers. The state of each node is a Boolean variable, i.e., vi{0, 1}v_{i}\in\{0,\,1\} and hi{0, 1}h_{i}\in\{0,\,1\}. The probability of each configuration is determined by this energy function according to the Boltzmann distribution

QVH(𝐯,𝐡)=1Zexp[E(𝐯,𝐡)],\displaystyle Q_{VH}(\mathbf{v},\mathbf{h})=\frac{1}{Z}\exp\left[-E(\mathbf{v},\mathbf{h})\right], (2)

where ZZ is the normalizing factor (or the partition function)

Z𝐯,𝐡exp[E(𝐯,𝐡)].\displaystyle Z\equiv\sum_{\mathbf{v},\mathbf{h}}\exp\left[-E(\mathbf{v},\mathbf{h})\right]. (3)

The goal of the RBM is to find 𝐚\mathbf{a}, 𝐛\mathbf{b}, and 𝕎\mathbb{W} such that the marginal distribution QV(𝐯)𝐡QVH(𝐯,𝐡)Q_{V}(\mathbf{v})\equiv\sum_{\mathbf{h}}Q_{VH}(\mathbf{v},\mathbf{h}) is as similar as possible to some given empirical distribution PV(𝐯)P_{V}(\mathbf{v}). To put it precisely, the RBM seeks to achieve

QVQV[]argminDKL(PVQV),\displaystyle Q^{*}_{V}\equiv\;\stackrel{{\scriptstyle[}}{{Q}}_{V}]{}{\mathrm{arg\,min}}D_{\mathrm{KL}}(P_{V}\|Q_{V}), (4)

where

DKL(PVQV)𝐯PV(𝐯)logPV(𝐯)QV(𝐯)\displaystyle D_{\mathrm{KL}}(P_{V}\|Q_{V})\equiv\sum_{\mathbf{v}}P_{V}(\mathbf{v})\log\frac{P_{V}(\mathbf{v})}{Q_{V}(\mathbf{v})} (5)

is the Kullback-Leibler (KL) divergence. Towards this purpose, the above KL divergence is taken to be the loss function, and the RBM is updated according to the gradient descent

ai(t+1)\displaystyle a_{i}(t+1) =ai(t)αaiDKL(PVQV),\displaystyle=a_{i}(t)-\alpha\frac{\partial}{\partial a_{i}}D_{\mathrm{KL}}(P_{V}\|Q_{V}), (6)
bi(t+1)\displaystyle b_{i}(t+1) =bi(t)αbiDKL(PVQV),\displaystyle=b_{i}(t)-\alpha\frac{\partial}{\partial b_{i}}D_{\mathrm{KL}}(P_{V}\|Q_{V}), (7)
Wij(t+1)\displaystyle W_{ij}(t+1) =Wij(t)αWijDKL(PVQV).\displaystyle=W_{ij}(t)-\alpha\frac{\partial}{\partial W_{ij}}D_{\mathrm{KL}}(P_{V}\|Q_{V}). (8)

Denoting by

QH|V(𝐡|𝐯)QVH(𝐯,𝐡)QV(𝐯)=j=1NHexp[bjhj+i=1NVviWijhj]1+exp[bj+i=1NVviWij]\displaystyle Q_{H|V}(\mathbf{h}|\mathbf{v})\equiv\frac{Q_{VH}(\mathbf{v},\mathbf{h})}{Q_{V}(\mathbf{v})}=\prod_{j=1}^{N_{H}}\frac{\exp\left[b_{j}h_{j}+\sum_{i=1}^{N_{V}}v_{i}W_{ij}h_{j}\right]}{1+\exp\left[b_{j}+\sum_{i=1}^{N_{V}}v_{i}W_{ij}\right]} (9)

the conditional probability of the hidden-layer configuration, we can show that the gradients of the KL divergence satisfy

aiDKL(PVQV)\displaystyle\frac{\partial}{\partial a_{i}}D_{\mathrm{KL}}(P_{V}\|Q_{V}) =viQVviPV,\displaystyle=\langle v_{i}\rangle_{Q_{V}}-\langle v_{i}\rangle_{P_{V}}, (10)
biDKL(PVQV)\displaystyle\frac{\partial}{\partial b_{i}}D_{\mathrm{KL}}(P_{V}\|Q_{V}) =hiQVHhiPVQH|V,\displaystyle=\langle h_{i}\rangle_{Q_{VH}}-\langle h_{i}\rangle_{P_{V}Q_{H|V}}, (11)
WijDKL(PVQV)\displaystyle\frac{\partial}{\partial W_{ij}}D_{\mathrm{KL}}(P_{V}\|Q_{V}) =vihjQVHvihjPVQH|V,\displaystyle=\langle v_{i}h_{j}\rangle_{Q_{VH}}-\langle v_{i}h_{j}\rangle_{P_{V}Q_{H|V}}, (12)

where F\langle\cdot\rangle_{F} denotes an average with respect to the probability distribution FF. Thus, the training saturates when the first and the second moments of the state variables are equal for both empirical (PVQH|VP_{V}Q_{H|V}) and model (QVHQ_{VH}) distributions.

Since these gradients involve the average QVH\langle\cdot\rangle_{Q_{VH}} whose computation is difficult for large networks, various approximation methods are used in practice, such as contrastive divergence (CD) [12], persistent contrastive divergence (PCD) [14], fast PCD [15], Bennett’s acceptance ratio method [16], and the Thouless-Anderson-Palmer equations [17]. However, to avoid further complications arising from the approximations, in this study we stick to the exact gradients written above.

2.2 Error components in unsupervised learning

The goal of unsupervised learning is to construct a generative model whose statistical properties are as similar as possible to the true distribution underlying an ensemble of objects. We may formally describe the problem as follows. Suppose that there exists the true probability distribution PV0P^{0}_{V} generating an ensemble of objects. Then we can think of the best model QV0Q^{0}_{V} that the RBM can express, namely

QV0QV[]argminDKL(PV0QV).Q^{0}_{V}\equiv\;\stackrel{{\scriptstyle[}}{{Q}}_{V}]{}{\mathrm{arg\,min}}D_{\mathrm{KL}}(P^{0}_{V}\|Q_{V}). (13)

Ideally, the RBM should generate QV0Q^{0}_{V} at the end of the training. However, this may not be true due to three reasons. First, the KL divergence is generally not a convex function of 𝐚\mathbf{a}, 𝐛\mathbf{b}, and 𝕎\mathbb{W}, so the gradient descent shown in Eqs. (6), (7), and (8) may end up in a local minimum far away from QV0Q^{0}_{V}. Second, even if the RBM does evolve towards QV0Q^{0}_{V}, the convergence may take an extremely long time. In this case, the training will have to end before QV0Q^{0}_{V} is reached. Third, in practical situations, we do not have direct access to the true distribution (which we denote by PV0P^{0}_{V}) generating the observed samples. Due to the sampling error, the distribution PVP_{V} used in the training is generally different from PV0P^{0}_{V}. Then, even if the RBM does manage to find a distribution most similar to PVP_{V}, it may still be quite different from QV0Q^{0}_{V}.

For these reasons, the distribution generated by the RBM at the end of the training is not QV0Q^{0}_{V}, but each training will result in its own model distribution QVmQ^{\mathrm{m}}_{V}. Then the GE, which quantifies the expected difference of the model distribution from the true distribution, can be defined as

GEDKL(PV0QVm)m,\mathrm{GE}\equiv\left\langle D_{\mathrm{KL}}(P^{0}_{V}\|Q^{\mathrm{m}}_{V})\right\rangle_{\mathrm{m}}, (14)

where m\langle\cdot\rangle_{\mathrm{m}} represents the average with respect to different models obtained by independent trainings. By definition, the GE cannot be smaller than the model error (ME)

MEDKL(PV0QV0),\mathrm{ME}\equiv D_{\mathrm{KL}}(P^{0}_{V}\|Q^{0}_{V}), (15)

which indicates the fundamental lower bound on how similar the model distribution generated by the RBM can be to the true distribution PV0P^{0}_{V}. Finally, we identify the excess part of the error,

DEGEME,\mathrm{DE}\equiv\mathrm{GE}-\mathrm{ME}, (16)

as the data error (DE), which mainly stems from deviations of the training data from the true distribution, as will be shown below. Thus, Eqs. (14), (15), and (16) define a two-component decomposition of the GE for unsupervised learning.

3 Methods

To examine the behaviors of the error components defined above, we train the RBMs to two basic models of statistical physics: the 2-d Ising model and the TASEP with open boundaries. We chose these models for two advantages. First, these models have exactly known steady-state distributions. The Ising model follows the standard Boltzmann statistics, and the nonequilibrium steady-state statistics of the TASEP can be exactly obtained using the matrix-product ansatz [18]. Thus, for these models, we have full information about PV0P^{0}_{V}. Second, these models provide examples of equilibrium and nonequilibrium phase transitions. Depending on the nature of the phases and the transitions between them, this allows us to control the complexity of the data and examine how the tradeoff behavior is affected by it. With these considerations in mind, we describe the two models in the following.

Refer to caption
Figure 2: (a) The magnetization and the susceptibility of the 2-d Ising model on the 3×33\times 3 square lattice with periodic boundaries. The vertical lines indicate the values of temperature used for generating the training datasets. (b) A schematic illustration of the TASEP with open boundaries. (c) The phase diagram of the TASEP with open boundaries. The red points indicate the control parameters used for generating the training datasets.

3.1 2-d Ising model

The Ising model, originally proposed as a simplified model of ferromagnetism [19], is a paradigmatic model of equilibrium phase transitions with exact solutions for its free energy [20] and spontaneous magnetization [21] on the 2-d square lattice. In this study, we use the Ising model on a 3×33\times 3 square lattice with periodic boundary conditions (v4,j=v1,jv_{4,j}=v_{1,j} and vi,4=vi,1v_{i,4}=v_{i,1}) following the Boltzmann statistics

PV0(𝐯)exp[1T(𝐯)],P^{0}_{V}(\mathbf{v})\propto\exp\!\left[-\frac{1}{T}\mathcal{H}(\mathbf{v})\right], (17)

where the Hamiltonian \mathcal{H} is given by

(𝐯)=i=13j=13[(2vi,j1)(2vi+1,j1)+(2vi,j1)(2vi,j+11)],\mathcal{H}(\mathbf{v})=-\sum_{i=1}^{3}\sum_{j=1}^{3}\left[(2v_{i,j}-1)(2v_{i+1,j}-1)+(2v_{i,j}-1)(2v_{i,j+1}-1)\right], (18)

with vi,j{0, 1}v_{i,j}\in\{0,\,1\} for each ii and jj. We train the RBMs to the equilibrium distributions at three different temperatures T=1.9T=1.9, T=3.6T=3.6, and T=16T=16. These values are chosen so that T=1.9T=1.9 corresponds to the ferromagnetic (ordered) phase, T=3.6T=3.6 stands for the critical regime, and T=16T=16 generates the paramagnetic (disordered) phase. Even though the size of the system used in our study is small, these three values of temperature generate markedly different ensembles of spin configurations, as shown in the order parameter (magnetization) and its fluctuations (susceptibility) plotted in Fig. 2(a).

Using PV0P^{0}_{V} thus defined, we train the RBM in two different ways. The first way is the usual unsupervised learning. We draw a certain number of equilibrium spin configurations from PV0P^{0}_{V} by the Metropolis-Hastings algorithm. To remove correlations between different sampled configurations, we saved the snapshot of the system every 9090 or more (varied depending on the correlation time) Monte Carlo steps. This sampled set of Ising configurations form PVP_{V}, which we use in the gradient descent dynamics described by Eqs. (6)–(12). We repeat the process 1010 times to obtain 1010 independent models {QVm}\{Q^{\mathrm{m}}_{V}\}, with which we calculate the GE according to Eq. (14).

Meanwhile, to calculate the ME and the DE, we train the RBM in a different way. Instead of the sampled PVP_{V}, we use the true distribution PV0P^{0}_{V} to directly calculate the true gradients in Eqs. (6)–(12). This is done by evaluating the averages over all the 29=5122^{9}=512 spin configurations of the 2-d Ising model on the 3×33\times 3 square lattice. Repeating this training 1010 times, we choose the resulting model that minimizes the KL divergence to be an estimate for QV0Q^{0}_{V}. In fact, we find that the KL divergence obtained by this training method exhibits only small variabilities: the error of the estimated ME is at most about the order of the symbol size in the plots.

3.2 TASEP with open boundaries

The TASEP is a simple model of nonequilibrium 1-d transport of hard-core particles. Originally proposed as a model of biopolymerization [22], the model has found numerous applications in various traffic [23] and biological [24] systems.

For the case of open boundaries, the process is defined as follows (see Fig. 2(b) for a schematic illustration). Consider a 1-d chain of LL discrete sites. Each site can hold at most a single particle. From the 11-st site to the (L1L-1)-th site, every particle moves to the right with rate 11 if the next site is empty, and the movement is forbidden if the next site is occupied. Meanwhile, if the 11-st site is empty (LL-th site is occupied), a particle moves into the site from the left particle reservoir (moves from the site to the right particle reservoir) with entry rate α\alpha (exit rate β\beta).

Depending on the values of α\alpha and β\beta, the TASEP exhibits various phases distinguished from each other by the current JJ and the bulk density ρb\rho_{b}, as shown in Fig. 2(c). The phase boundaries can be correctly predicted by the simple kinematic wave theory or the mean-field arguments, although the exact solution for the steady-state statistics requires the matrix product ansatz [18, 25].

Similarly to the case of the 2-d Ising model, we train the RBM in two different ways. First, we generate a certain number of steady-state particle configurations by directly running the kinetic Monte Carlo simulation of the TASEP. To remove correlations between different sampled configurations, we take the snapshot every 9090 Monte Carlo steps. This sampled set is used as PVP_{V} in the gradient descent dynamics described by Eqs. (6)–(12). Then the same process is repeated 1010 times to obtain 1010 independent models {QVm}\{Q^{\mathrm{m}}_{V}\}, with which we calculate the GE according to Eq. (14).

To calculate the ME, we need the true distribution PV0P^{0}_{V} of the TASEP. Using the recursive relations between matrix-product representations of the steady-state probabilities shown in [25], we can iteratively determine the probabilities of 29=5122^{9}=512 TASEP configurations. Then, based on these probabilities, we directly calculate the true gradients in Eqs. (6)–(12). Then the ME is found by choosing the model that minimizes the KL divergence to be an estimate for QV0Q^{0}_{V}.

The codes for the training methods described above are available in [github].

Refer to caption
Figure 3: Training dynamics of the RBM. (a) The evolution of the generalization error (GE) when the RBM is trained to 512512 sampled configurations of the 2-d Ising model at temperature T=1.9T=1.9. (b) The evolution of the GE when the RBM is trained to the exact gradients of the KL divergence at the same temperature. The insets show the average trajectories of the RBM in terms of the error and the weight width during each dynamics, with the direction of time shown by black arrows. The error bars are comparable to the symbol size.

4 Results

4.1 Training dynamics

In Fig. 3(a), we show the evolution of the mean error DKL(PV0QVm)m\left\langle D_{\mathrm{KL}}(P^{0}_{V}\|Q^{m}_{V})\right\rangle_{m} as the RBM is trained to the 512512 sampled configurations (using mini-batches of size 256256) of the 2-d Ising model at T=1.9T=1.9 (ferromagnetic phase). When NH2N_{H}\leq 2, the error monotonically decreases as the training proceeds, saturating by epoch 10310^{3}. However, for NH3N_{H}\geq 3, the error reaches the minimum around epoch 500500 and increases again, reaching higher value as NHN_{H} is increased. These tendencies reflect that more complex RBMs (with larger NHN_{H}) end up learning even the noisy features of the sampled configurations that deviate from the true PV0P^{0}_{V}. This is equivalent to what we call the overfitting phenomenon in supervised learning.

Meanwhile, in Fig. 3(b), we show the evolution of the same mean error as the RBM is trained to the true distribution of the 2-d Ising model at T=1.9T=1.9. When NH4N_{H}\leq 4, the error saturates to some value that decreases as NHN_{H} is increased. When NHN_{H} is increased further, the error keeps decreasing as the training proceeds, never saturating within the observation time span. These show that the overfitting effect is absent when the true distribution is directly used for the training.

The presence of overfitting can also be checked by observing the weights of the RBM. As shown in the inset of Fig. 3(a), when the RBM is trained to the sampled dataset, the width (standard deviation) of the weight distribution tends to increase towards the end of the training. This is because here the RBM is effectively decreasing its temperature, trying to distinguish configurations which happen to appear in the sampled dataset from those which happen not to appear, even if they are very similar to each other. In contrast, when the RBM is trained to the true distribution, the weight width saturates to some value and stops increasing, as shown in Fig. 3(b). This shows that there is no overfitting involved in the dynamics.

4.2 Effects of data volume

Refer to caption
Figure 4: Dependence of the data error (DE) on the training dataset volume. We show results for the 2-d Ising model at (a) T=1.9T=1.9 (ferromagnetic), (b) T=3.6T=3.6 (critical), (c) T=16T=16 (paramagnetic) and for the TASEP with open boundaries at (d) α=0.3\alpha=0.3, β=0.7\beta=0.7 (low density), (e) α=0.5\alpha=0.5, β=0.7\beta=0.7 (phase boundary), (f) α=0.7\alpha=0.7, β=0.7\beta=0.7 (maximal current). The error bars are comparable to the symbol size.

As stated in Eq. (15), the ME depends only on the true distribution PV0P^{0}_{V} and the optimal model distribution QV0Q^{0}_{V}; thus the ME does not depend on the volume of the training dataset. The only part of the GE that can be affected by the data volume is the DE, whose behaviors are shown in Fig. 4. As the data volume VV increases, the DE tends to decrease like V1V^{-1}. This scaling behavior can be understood as follows. By careful sampling, all VV samples of the training dataset can be made independently and identically distributed. We note that PV(𝐯)P_{V}(\mathbf{v}) can be interpreted as the sample mean of a random variable whose value is 11 when the sample is in the state 𝐯\mathbf{v} and zero otherwise. Then, when VV is very large, the central limit theorem implies that the sample mean PV(𝐯)P_{V}(\mathbf{v}) deviates from the true probability PV0(𝐯)P_{V}^{0}(\mathbf{v}) by an amount proportional to V1/2V^{-1/2}. Then |QVm(𝐯)QV(𝐯)||Q_{V}^{m}(\mathbf{v})-Q_{V}(\mathbf{v})|, with QV0Q^{0}_{V} related to PV0P^{0}_{V} and QVmQ^{\mathrm{m}}_{V} to PVP_{V}, would also be of order V1/2V^{-1/2} for every 𝐯\mathbf{v}. Now, the DE defined in Eq. (16) can be rewritten as

DE=DKL(PV0QVm)mDKL(PV0QV0),\mathrm{DE}=\left\langle D_{\mathrm{KL}}(P^{0}_{V}\|Q^{\mathrm{m}}_{V})\right\rangle_{\mathrm{m}}-D_{\mathrm{KL}}(P^{0}_{V}\|Q^{0}_{V}), (19)

which reaches the minimum zero when QV0Q^{0}_{V} and QVmQ^{\mathrm{m}}_{V} are exactly equal. However, as discussed above, we expect these distributions to differ by a small amount proportional to V1/2V^{-1/2} for every configuration. Since the DE is close to its minimum, this difference of order V1/2V^{-1/2} leads to a correction to the DE of order (V1/2)2V1(V^{-1/2})^{2}\sim V^{-1}. Hence, the DE scales like V1V^{-1}.

We note that some deviations from DEV1\mathrm{DE}\sim V^{-1} are observed for the RBMs with NH=1N_{H}=1 and 22 trained to the Ising model at T=1.9T=1.9 and T=3.6T=3.6. When NHN_{H} is too small compared to the complexity of the training dataset, the dynamics of the RBM may develop glassy features (as is the case for supervised learning [26]), such as the rugged loss function landscape and the presence of multiple local minima. In such cases, the DE may not be entirely induced by the difference between PV0P^{0}_{V} and PVmP^{m}_{V}. However, in the regime of large NHN_{H} where the DE becomes a significant part of the GE, we expect the DE to be dominated by the effects of PV0PVmP^{0}_{V}\neq P^{m}_{V}.

4.3 Effects of NHN_{H}

Refer to caption
Figure 5: Dependence of the generalization error (GE), the data error (DE), and the model error (ME) on the number of hidden nodes in the RBM. We show results for the 2-d Ising model at (a) T=1.9T=1.9 (ferromagnetic), (b) T=3.6T=3.6 (critical), (c) T=16T=16 (paramagnetic) and for the TASEP with open boundaries at (d) α=0.3\alpha=0.3, β=0.7\beta=0.7 (low density), (e) α=0.5\alpha=0.5, β=0.7\beta=0.7 (phase boundary), (f) α=0.7\alpha=0.7, β=0.7\beta=0.7 (maximal current). The error bars are comparable to the symbol size.

Now we examine the effects of NHN_{H} on the the GE and its two components defined by Eqs. (14), (15), (16), which are shown in Fig. 5. The results for the 2-d Ising model are shown in Fig. 5(a), (b), and (c). In all cases, the ME (DE) monotonically decreases (increases) with NHN_{H}. For T=1.9T=1.9 and T=3.6T=3.6, this leads to the nonmonotonic behavior of the GE, which is minimized at NH=3N_{H}=3 for T=1.9T=1.9 and NH=4N_{H}=4 for T=3.6T=3.6. Meanwhile, for T=16T=16, the ME is already very small for NH=1N_{H}=1, which reflects that a single hidden node is enough to describe the disordered state, where all spins are unbiased i.i.d. random variables.

The results for the TASEP are shown in the lower panel of Fig. 5. For α=0.3\alpha=0.3 and β=0.7\beta=0.7, the ME is almost always zero, see Fig. 5(d). This is because, in the low-density phase, the occupancies of the most of the sites are i.i.d. random variables. The monotonic decrease of the ME becomes more visible as the system crosses the phase boundary (see Fig. 5(e)), although the tradeoff behavior of the GE becomes clear only when the system is well within the maximal-current phase, see Fig. 5(f). For the TASEP, it is known that the mean-field approximations are valid sufficiently far away from the boundaries. While those boundary effects decay exponentially with distance in the low-density and the high-density phases, the decay is algebraic in the maximal-current phase. Thus, the amount of correlations present in the data tend to be larger for the maximal-current phase.

To sum up, the GE exhibits a U-shaped tradeoff behavior as the ME (DE) decreases (increases) monotononically with NHN_{H}. The GE minimum tends to occur at a higher NHN_{H} when the dataset to be modeled is more complex. The situation is analogous to the bias-variance tradeoff observed in supervised learning, as detailed in Table 1.

4.4 Comparison with the bias-variance decomposition

Refer to caption
Figure 6: Decomposition of the generalization error (GE) according to the bias-variance scheme à la Heskes. We show results for the 2-d Ising model at (a) T=1.9T=1.9 (ferromagnetic), (b) T=3.6T=3.6 (critical), (c) T=16T=16 (paramagnetic) and for the TASEP with open boundaries at (d) α=0.3\alpha=0.3, β=0.7\beta=0.7 (low density), (e) α=0.5\alpha=0.5, β=0.7\beta=0.7 (phase boundary), (f) α=0.7\alpha=0.7, β=0.7\beta=0.7 (maximal current). The error bars are comparable to the symbol size.

While the behaviors of the ME and the DE shown in Fig. 5 are similar to those shown by the bias and the variance in supervised learning, there are no clear mathematical connections between these quantities. In fact, in 1998, Heskes proposed a bias-variance decomposition scheme for the KL divergence [10], which would read in our problem as

DKL(PV0QVm)m=DKL(PV0Q¯V)Bias+DKL(Q¯VQVm)mVariance,\left\langle D_{\mathrm{KL}}(P^{0}_{V}\|Q^{m}_{V})\right\rangle_{m}=\underbrace{D_{\mathrm{KL}}(P^{0}_{V}\|\bar{Q}_{V})}_{\equiv\mathrm{Bias}}+\underbrace{\left\langle D_{\mathrm{KL}}(\bar{Q}_{V}\|Q^{m}_{V})\right\rangle_{m}}_{\equiv\mathrm{Variance}}, (20)

where

Q¯V(𝐯)1𝒵explogQVm(𝐯)m\bar{Q}_{V}(\mathbf{v})\equiv\frac{1}{\mathcal{Z}}\exp\,\langle\log Q^{m}_{V}(\mathbf{v})\rangle_{m} (21)

is the mean distribution for the suitable normalization constant 𝒵\mathcal{Z}. This generalizes the original bias-variance decomposition originally proposed for the MSE in the following sense:

  1. 1.

    The variance does not depend on PV0P^{0}_{V} directly. Also it is nonnegative and equal to zero if and only if QVmQ^{m}_{V} are always identical.

  2. 2.

    The bias depends on only PV0P^{0}_{V} and the “average model” Q¯V\bar{Q}_{V}, which is defined as the minimizer of the variance.

Note that this scheme is related to our ME-DE decomposition scheme by

ME\displaystyle\mathrm{ME} =Bias𝐯PV0(𝐯)logQV0(𝐯)Q¯V(𝐯),\displaystyle=\mathrm{Bias}-\sum_{\mathbf{v}}P^{0}_{V}(\mathbf{v})\log\frac{Q^{0}_{V}(\mathbf{v})}{\bar{Q}_{V}(\mathbf{v})}, (22)
DE\displaystyle\mathrm{DE} =Variance+𝐯PV0(𝐯)logQV0(𝐯)Q¯V(𝐯).\displaystyle=\mathrm{Variance}+\sum_{\mathbf{v}}P^{0}_{V}(\mathbf{v})\log\frac{Q^{0}_{V}(\mathbf{v})}{\bar{Q}_{V}(\mathbf{v})}. (23)

Since 𝐯PV0(𝐯)log[QV0(𝐯)/Q¯V(𝐯)]\sum_{\mathbf{v}}P^{0}_{V}(\mathbf{v})\log\left[Q^{0}_{V}(\mathbf{v})/\bar{Q}_{V}(\mathbf{v})\right] is not sign definite (note that Q¯V\bar{Q}_{V} might be a distribution that cannot be generated by an RBM, so it can be even “better” than the optimal RBM-generated distribution QV0Q^{0}_{V}), there are no definite inequalities between these error components.

We reexamine the effects of NHN_{H} shown in Fig. 5 using the Heskes decomposition scheme. As shown in Fig. 6, while the variance monotonically increases with NHN_{H}, the bias exhibits a nonmonotonic behavior as NHN_{H} is increased. It seems that, in this case, the bias also contains significant contributions from the sampling fluctuations which were captured by the DE of our decomposition scheme. This might be due to the training outcome QVmQ^{m}_{V} having a skewed distribution around the optimal distribution QV0Q^{0}_{V}. Thus, to describe the tradeoff behavior of the GE, the decomposition into the ME and the DE seems more appropriate than the Heskes bias-variance decomposition.

5 Summary and outlook

Supervised learning (FNN) Unsupervised learning (RBM)
Generalization error Mean-squared error KL divergence
Limitation of the model Bias Model error
Performance variability Variance Data error
What it learns Functional relationship Distribution
What it overfits Noise Sampling error
Table 1: A comparison between the GE tradeoff behaviors in supervised (feed-forward neural network; FNN) and unsupervised learning

In this study, we proposed that the generalization error (GE) in unsupervised learning can be decomposed into two components, namely the model error (ME) and the data error (DE). To examine how these quantities behave as the data and the model complexities are varied, we trained the RBMs with various numbers of hidden nodes to generate the steady-state configurations of the 2-d Ising model at a given temperature and the TASEP with given entry and exit rates, whose statistics are exactly known. For both models, we observed that the DE tends to decrease as the inverse volume of the training dataset, verifying that our decomposition properly distinguishes between the two sources of the GE, namely the inadequacies of the model (ME) and those of the training data (DE). Moreover, we found that the ME (DE) decreases (increases) monotonically as the number of hidden nodes is increased, leading to the tradeoff behavior of the GE. The GE minimum occurs at a higher number of hidden nodes when there exist stronger correlations between different parts of the system. This is analogous to the bias-variance tradeoff in supervised learning—too simple machines fail to capture the essential features of the data, while too complex machines fail to filter out the noise.

Thus, our study clarifies the nature of the GE tradeoff in unsupervised learning. Various theoretical studies are possible from here. For example, while our study has reported only numerical results for the RBMs, it would be interesting to think of an analytically tractable example of unsupervised learning for which the location of the GE minimum can be explicitly connected to the statistical properties of the training data. Also notable is the abrupt change in the training dynamics at the onset of the overfitting regime shown in Fig. 3. It would be worthwhile to check whether the RBM crosses any phase boundary as its dynamical behavior changes. The issues of (i) how regularization suppresses overfitting and affects the GE minimum, (ii) how the hidden layer differently encodes information in the underparametrized and the overparmetrized regimes [27, 28], and (iii) whether the double descent of the GE observed in supervised learning [6, 7, 8, 9] is also possible in unsupervised learning would be also interesting to investigate.

Refer to caption
Figure 7: Dependence of the log-likelihood on the number of hidden nodes in the RBM. We show results for the 2-d Ising model at (a) T=1.9T=1.9 (ferromagnetic), (b) T=3.6T=3.6 (critical), (c) T=16T=16 (paramagnetic) and for the TASEP with open boundaries at (d) α=0.3\alpha=0.3, β=0.7\beta=0.7 (low density), (e) α=0.5\alpha=0.5, β=0.7\beta=0.7 (phase boundary), (f) α=0.7\alpha=0.7, β=0.7\beta=0.7 (maximal current). The error bars are explicitly shown for the Ising model, while they are much smaller than the symbol size for the TASEP.

On the practical side, we note that the GE defined as the KL divergence in Eq. (14) is not easy to calculate as PV0P^{0}_{V} is unknown in practice. Instead, one can focus on the tradeoff behavior of the log-likelihood

1Mi=1MlogQVm(𝐯i)m,\mathcal{L}\equiv\left\langle\frac{1}{M}\sum_{i=1}^{M}\log Q^{\mathrm{m}}_{V}(\mathbf{v}_{i})\right\rangle_{\mathrm{m}}, (24)

whose value is obtained using a sampled test dataset {𝐯1,,𝐯M}\{\mathbf{v}_{1},\ldots,\mathbf{v}_{M}\}. Unlike the GE, \mathcal{L} is not bounded by zero, but the the nonmonotonic behavior of the GE as a function of NHN_{H} will manifest itself as the inverted nonmonotonic behavior of \mathcal{L}, as illustrated for the 2-d Ising model and the TASEP in Fig. 7. It would be interesting to check whether the same behavior is observed for more diverse range of examples.

Acknowledgments

This research has been supported by the POSCO Science Fellowship of POSCO TJ Park Foundation (GK, HL, and YB). It has also been supported in part by the Creative-Pioneering Researchers Program through Seoul National University, the National Research Foundation of Korea (NRF) grant (Grant No. 2022R1A2C1006871) (JJ).

References

References

  • [1] Goodfellow I, Bengio Y and Courville A 2016 Deep learning (Cambridge: MIT Press)
  • [2] Mehta P, Bukov M, Wang C H, Day A G, Richardson C, Fisher C K and Schwab D J 2019 Phys. Rep. 810 1–124
  • [3] Carleo G, Cirac I, Cranmer K, Daudet L, Schuld M, Tishby N, Vogt-Maranto L and Zdeborová L 2019 Rev. Mod. Phys. 91(4) 045002
  • [4] Bahri Y, Kadmon J, Pennington J, Schoenholz S S, Sohl-Dickstein J and Ganguli S 2020 Annu. Rev. Condens. Mat. Phys. 11 501–528
  • [5] Kohavi R and Wolpert D H 1996 Bias plus variance decomposition for zero-one loss functions ICML vol 96 (Citeseer) pp 275–83
  • [6] Belkin M, Hsu D, Ma S and Mandal S 2019 Proc. Natl. Acad. Sci. 116 15849–15854
  • [7] Spigler S, Geiger M, d’Ascoli S, Sagun L, Biroli G and Wyart M 2019 J. Phys. A: Math. Theor. 52 474001
  • [8] Nakkiran P, Kaplun G, Bansal Y, Yang T, Barak B and Sutskever I 2021 J. Stat. Mech.: Theor. Exp. 2021 124003
  • [9] Rocks J W and Mehta P 2022 Phys. Rev. Res. 4(1) 013201
  • [10] Heskes T 1998 Neural Comput. 10 1425–1433
  • [11] Smolensky P 1986 Information processing in dynamical systems: Foundations of harmony theory (Cambridge: MIT Press) pp 194–281
  • [12] Hinton G E 2002 Neural Comput. 14 1771–1800
  • [13] Hinton G E 2012 A Practical Guide to Training Restricted Boltzmann Machines (Berlin: Springer) pp 599–619
  • [14] Tieleman T 2008 Training restricted boltzmann machines using approximations to the likelihood gradient Proceedings of the 25th International Conference on Machine Learning ICML ’08 (New York, NY, USA: Association for Computing Machinery) p 1064–1071
  • [15] Tieleman T and Hinton G 2009 Using fast weights to improve persistent contrastive divergence Proceedings of the 26th Annual International Conference on Machine Learning ICML ’09 (New York, NY, USA: Association for Computing Machinery) p 1033–1040
  • [16] Krause O, Fischer A and Igel C 2020 Artif. Intell. 278 103195
  • [17] Gabrie M, Tramel E W and Krzakala F 2015 Training restricted boltzmann machine via the thouless-anderson-palmer free energy Advances in Neural Information Processing Systems vol 28 ed Cortes C, Lawrence N, Lee D, Sugiyama M and Garnett R (Curran Associates, Inc.)
  • [18] Derrida B, Evans M R, Hakim V and Pasquier V 1993 J. Phys. A: Math. Gen. 26 1493
  • [19] Ising E 1925 Zeit. Phys. 31(1) 253–258
  • [20] Onsager L 1944 Phys. Rev. 65(3-4) 117–149
  • [21] Yang C N 1952 Phys. Rev. 85(5) 808–816
  • [22] MacDonald C T, Gibbs J H and Pipkin A C 1968 Biopolymers 6 1–25
  • [23] Helbing D 2001 Rev. Mod. Phys. 73 1067 – 1141
  • [24] Chowdhury D, Schadschneider A and Nishinari K 2005 Phys. Life Rev. 2 318–352
  • [25] Blythe R A and Evans M R 2007 J. Phys. A: Math. Theor. 40 R333
  • [26] Baity-Jesi M, Sagun L, Geiger M, Spigler S, Arous G B, Cammarota C, LeCun Y, Wyart M and Biroli G 2019 J. Stat. Mech.: Theor. Exp. 2019 124013
  • [27] Song J, Marsili M and Jo J 2018 J. Stat. Mech.: Theor. Exp. 2018 123406
  • [28] Harsh M, Tubiana J, Cocco S and Monasson R 2020 J. Phys. A: Math. Theor. 53 174002