Tradeoff of generalization error in unsupervised learning

Gilhan Kim¹, Hojun Lee¹, Junghyo Jo², and Yongjoo Baek¹ ¹ Department of Physics and Astronomy & Center for Theoretical Physics, Seoul National University, Seoul 08826, Korea ² Department of Physics Education & Center for Theoretical Physics, Seoul National University, Seoul 08826, Korea [email protected]

Abstract

Finding the optimal model complexity that minimizes the generalization error (GE) is a key issue of machine learning. For the conventional supervised learning, this task typically involves the bias-variance tradeoff: lowering the bias by making the model more complex entails an increase in the variance. Meanwhile, little has been studied about whether the same tradeoff exists for unsupervised learning. In this study, we propose that unsupervised learning generally exhibits a two-component tradeoff of the GE, namely the model error and the data error—using a more complex model reduces the model error at the cost of the data error, with the data error playing a more significant role for a smaller training dataset. This is corroborated by training the restricted Boltzmann machine to generate the configurations of the two-dimensional Ising model at a given temperature and the totally asymmetric simple exclusion process with given entry and exit rates. Our results also indicate that the optimal model tends to be more complex when the data to be learned are more complex.

Keywords: Machine Learning, Classical phase transitions, Stochastic processes

1 Introduction

Inductive reasoning, which derives general principles from specific observations, is an essential generalization process that builds up human knowledge. With the advent of big data, there is a rapidly growing need for automating the process of inductive reasoning. Lately, much development has been made in this direction thanks to various machine learning techniques based on the artificial neural networks (ANNs) [1, 2, 3, 4]. However, despite their tremendous success, we still have a very limited understanding of when and how an ANN achieves a good generalization capacity.

There are two major types of generalization tasks performed by the ANNs. The more well-studied type is supervised learning, which refers to the task of guessing the correct form of a function over its entire domain by generalizing some given examples of its functional relations. In this case, the failure to properly generalize the given examples, the generalization error (GE), can be defined in terms of the mean squared error (MSE) of the predicted function from the true function. In practice, the true function is unknown, so the MSE estimated using independently drawn examples of functional relations (called the test error) is used as a proxy for the GE.

Thanks to the mathematical structure of the MSE, this GE is readily decomposed into two parts [2]. The first part, called the bias, quantifies how the predicted function on average deviates from the true function. The second part, called the variance, quantifies how much the predicted function fluctuates from its average behavior. In many examples of supervised learning, these two components of the GE exhibit a tradeoff behavior: as the model complexity (e.g., the size of the ANN) increases, the bias decreases at the cost of the increasing variance. This leads to the GE showing a U-shape dependence on the model complexity, which is called the bias-variance tradeoff. It should be noted that the decomposition is not limited to the MSE but also generalized to different types of the GE (see, for example, [5]). In addition, according to recent studies, the GE of supervised learning again exhibits a monotonic decrease if the model complexity is increased even further. This is called the double descent phenomenon, whose origin has been extensively discussed [6, 7, 8, 9].

Meanwhile, there is the less studied but no less important type of tasks, namely unsupervised learning, which refers to the task of finding the probability distribution that best captures the statistical properties of a dataset sampled from an unknown distribution. The GE for unsupervised learning can be defined as the Kullback-Leibler (KL) divergence of the predicted distribution from the true distribution to be found. Again, there has been a proposal about how this GE can be decomposed into the bias and the variance [10]. However, little has been studied about whether the GE of unsupervised learning also exhibits a tradeoff behavior.

In this study, we address the problem by training the Restricted Boltzmann Machine (RBM) to learn the data generated from the two-dimensional (2-d) Ising model and the totally asymmetric simple exclusion process (TASEP), which are well-known models of equilibrium and nonequilibrium phase transitions, respectively. Since the distributions of the configurations of these models are exactly known, the GE can be calculated exactly. Examining how these quantities depend on the number of hidden nodes of the RBM, we observe that the GE exhibits a tradeoff behavior. We propose a two-component decomposition scheme of the GE so that the tradeoff is determined by the monotonic behaviors of the two error components, one related to the model limitations and the other to the fluctuations in the data. We also examine how the optimal model complexity, resulting from by a tradeoff among these three GE components, depends on the complexity of the given training dataset.

The rest of the paper is organized as follows. In Sec. 2, we introduce the restricted Boltzmann machine and our decomposition scheme for its generalization error. In Sec. 3, we describe how the RBM is trained using the data generated from the 2-d Ising model and the TASEP. In Sec. 4, we present our results about the tradeoff behaviors observed in unsupervised learning of the RBMs. Finally, we summarize our findings and discuss future works in Sec. 5.

Refer to caption — Figure 1: (a) A schematic diagram showing the structure of the RBM. The generalization error (GE) of the RBM depends nonmonotonically on the number of hidden nodes $N_{H}$ , exhibiting a tradeoff between the data error (DE) and the model error (ME) as illustrated by the generation tasks of (b) the 2-d Ising configurations at temperature $T=3.6$ and (c) the TASEP configurations at $\alpha=\beta=0.7$ . The lines are to guide the eye.

2 Theory

2.1 Restricted Boltzmann Machine

The RBM is an energy-based generative model proposed by Smolensky [11] and popularized by Hinton [12, 13]. It has the form of a bipartite network of $N_{V}$ nodes in the visible layer and $N_{H}$ nodes in the hidden layer, see Fig. 1(a). For the visible-layer configuration $\mathbf{v}\equiv\{v_{i}\}_{i=1}^{N_{V}}$ and the hidden-layer configuration $\mathbf{h}\equiv\{h_{i}\}_{i=1}^{N_{H}}$ , the corresponding energy is given by

\displaystyle E(\mathbf{v},\mathbf{h})=-\mathbf{a}^{\mathrm{T}}\,\mathbf{v}-\mathbf{b}^{\mathrm{T}}\,\mathbf{h}-\mathbf{v}^{\mathrm{T}}\,\mathbb{W}\,\mathbf{h},

(1)

where $\mathbf{a}\equiv\{a_{i}\}_{i=1}^{N_{V}}$ and $\mathbf{b}\equiv\{b_{i}\}_{i=1}^{N_{H}}$ indicate the bias terms, and $\mathbb{W}$ is the $N_{V}\times N_{H}$ weight matrix coupling the two layers. The state of each node is a Boolean variable, i.e., $v_{i}\in\{0,\,1\}$ and $h_{i}\in\{0,\,1\}$ . The probability of each configuration is determined by this energy function according to the Boltzmann distribution

\displaystyle Q_{VH}(\mathbf{v},\mathbf{h})=\frac{1}{Z}\exp\left[-E(\mathbf{v},\mathbf{h})\right],

(2)

where $Z$ is the normalizing factor (or the partition function)

\displaystyle Z\equiv\sum_{\mathbf{v},\mathbf{h}}\exp\left[-E(\mathbf{v},\mathbf{h})\right].

(3)

The goal of the RBM is to find $\mathbf{a}$ , $\mathbf{b}$ , and $\mathbb{W}$ such that the marginal distribution $Q_{V}(\mathbf{v})\equiv\sum_{\mathbf{h}}Q_{VH}(\mathbf{v},\mathbf{h})$ is as similar as possible to some given empirical distribution $P_{V}(\mathbf{v})$ . To put it precisely, the RBM seeks to achieve

\displaystyle Q^{*}_{V}\equiv\;\stackrel{{\scriptstyle[}}{{Q}}_{V}]{}{\mathrm{arg\,min}}D_{\mathrm{KL}}(P_{V}\|Q_{V}),

(4)

where

\displaystyle D_{\mathrm{KL}}(P_{V}\|Q_{V})\equiv\sum_{\mathbf{v}}P_{V}(\mathbf{v})\log\frac{P_{V}(\mathbf{v})}{Q_{V}(\mathbf{v})}

(5)

is the Kullback-Leibler (KL) divergence. Towards this purpose, the above KL divergence is taken to be the loss function, and the RBM is updated according to the gradient descent

$\displaystyle a_{i}(t+1)$	$\displaystyle=a_{i}(t)-\alpha\frac{\partial}{\partial a_{i}}D_{\mathrm{KL}}(P_{V}\\|Q_{V}),$	(6)
$\displaystyle b_{i}(t+1)$	$\displaystyle=b_{i}(t)-\alpha\frac{\partial}{\partial b_{i}}D_{\mathrm{KL}}(P_{V}\\|Q_{V}),$	(7)
$\displaystyle W_{ij}(t+1)$	$\displaystyle=W_{ij}(t)-\alpha\frac{\partial}{\partial W_{ij}}D_{\mathrm{KL}}(P_{V}\\|Q_{V}).$	(8)

Denoting by

\displaystyle Q_{H|V}(\mathbf{h}|\mathbf{v})\equiv\frac{Q_{VH}(\mathbf{v},\mathbf{h})}{Q_{V}(\mathbf{v})}=\prod_{j=1}^{N_{H}}\frac{\exp\left[b_{j}h_{j}+\sum_{i=1}^{N_{V}}v_{i}W_{ij}h_{j}\right]}{1+\exp\left[b_{j}+\sum_{i=1}^{N_{V}}v_{i}W_{ij}\right]}

(9)

the conditional probability of the hidden-layer configuration, we can show that the gradients of the KL divergence satisfy

$\displaystyle\frac{\partial}{\partial a_{i}}D_{\mathrm{KL}}(P_{V}\\|Q_{V})$	$\displaystyle=\langle v_{i}\rangle_{Q_{V}}-\langle v_{i}\rangle_{P_{V}},$	(10)
$\displaystyle\frac{\partial}{\partial b_{i}}D_{\mathrm{KL}}(P_{V}\\|Q_{V})$	$\displaystyle=\langle h_{i}\rangle_{Q_{VH}}-\langle h_{i}\rangle_{P_{V}Q_{H\|V}},$	(11)
$\displaystyle\frac{\partial}{\partial W_{ij}}D_{\mathrm{KL}}(P_{V}\\|Q_{V})$	$\displaystyle=\langle v_{i}h_{j}\rangle_{Q_{VH}}-\langle v_{i}h_{j}\rangle_{P_{V}Q_{H\|V}},$	(12)

where $\langle\cdot\rangle_{F}$ denotes an average with respect to the probability distribution $F$ . Thus, the training saturates when the first and the second moments of the state variables are equal for both empirical ( $P_{V}Q_{H|V}$ ) and model ( $Q_{VH}$ ) distributions.

Since these gradients involve the average $\langle\cdot\rangle_{Q_{VH}}$ whose computation is difficult for large networks, various approximation methods are used in practice, such as contrastive divergence (CD) [12], persistent contrastive divergence (PCD) [14], fast PCD [15], Bennett’s acceptance ratio method [16], and the Thouless-Anderson-Palmer equations [17]. However, to avoid further complications arising from the approximations, in this study we stick to the exact gradients written above.

2.2 Error components in unsupervised learning

The goal of unsupervised learning is to construct a generative model whose statistical properties are as similar as possible to the true distribution underlying an ensemble of objects. We may formally describe the problem as follows. Suppose that there exists the true probability distribution $P^{0}_{V}$ generating an ensemble of objects. Then we can think of the best model $Q^{0}_{V}$ that the RBM can express, namely

Q^{0}_{V}\equiv\;\stackrel{{\scriptstyle[}}{{Q}}_{V}]{}{\mathrm{arg\,min}}D_{\mathrm{KL}}(P^{0}_{V}\|Q_{V}).

(13)

Ideally, the RBM should generate $Q^{0}_{V}$ at the end of the training. However, this may not be true due to three reasons. First, the KL divergence is generally not a convex function of $\mathbf{a}$ , $\mathbf{b}$ , and $\mathbb{W}$ , so the gradient descent shown in Eqs. (6), (7), and (8) may end up in a local minimum far away from $Q^{0}_{V}$ . Second, even if the RBM does evolve towards $Q^{0}_{V}$ , the convergence may take an extremely long time. In this case, the training will have to end before $Q^{0}_{V}$ is reached. Third, in practical situations, we do not have direct access to the true distribution (which we denote by $P^{0}_{V}$ ) generating the observed samples. Due to the sampling error, the distribution $P_{V}$ used in the training is generally different from $P^{0}_{V}$ . Then, even if the RBM does manage to find a distribution most similar to $P_{V}$ , it may still be quite different from $Q^{0}_{V}$ .

For these reasons, the distribution generated by the RBM at the end of the training is not $Q^{0}_{V}$ , but each training will result in its own model distribution $Q^{\mathrm{m}}_{V}$ . Then the GE, which quantifies the expected difference of the model distribution from the true distribution, can be defined as

\mathrm{GE}\equiv\left\langle D_{\mathrm{KL}}(P^{0}_{V}\|Q^{\mathrm{m}}_{V})\right\rangle_{\mathrm{m}},

(14)

where $\langle\cdot\rangle_{\mathrm{m}}$ represents the average with respect to different models obtained by independent trainings. By definition, the GE cannot be smaller than the model error (ME)

\mathrm{ME}\equiv D_{\mathrm{KL}}(P^{0}_{V}\|Q^{0}_{V}),

(15)

which indicates the fundamental lower bound on how similar the model distribution generated by the RBM can be to the true distribution $P^{0}_{V}$ . Finally, we identify the excess part of the error,

\mathrm{DE}\equiv\mathrm{GE}-\mathrm{ME},

(16)

as the data error (DE), which mainly stems from deviations of the training data from the true distribution, as will be shown below. Thus, Eqs. (14), (15), and (16) define a two-component decomposition of the GE for unsupervised learning.

3 Methods

To examine the behaviors of the error components defined above, we train the RBMs to two basic models of statistical physics: the 2-d Ising model and the TASEP with open boundaries. We chose these models for two advantages. First, these models have exactly known steady-state distributions. The Ising model follows the standard Boltzmann statistics, and the nonequilibrium steady-state statistics of the TASEP can be exactly obtained using the matrix-product ansatz [18]. Thus, for these models, we have full information about $P^{0}_{V}$ . Second, these models provide examples of equilibrium and nonequilibrium phase transitions. Depending on the nature of the phases and the transitions between them, this allows us to control the complexity of the data and examine how the tradeoff behavior is affected by it. With these considerations in mind, we describe the two models in the following.

3.1 2-d Ising model

The Ising model, originally proposed as a simplified model of ferromagnetism [19], is a paradigmatic model of equilibrium phase transitions with exact solutions for its free energy [20] and spontaneous magnetization [21] on the 2-d square lattice. In this study, we use the Ising model on a $3\times 3$ square lattice with periodic boundary conditions ( $v_{4,j}=v_{1,j}$ and $v_{i,4}=v_{i,1}$ ) following the Boltzmann statistics

P^{0}_{V}(\mathbf{v})\propto\exp\!\left[-\frac{1}{T}\mathcal{H}(\mathbf{v})\right],

(17)

where the Hamiltonian $\mathcal{H}$ is given by

\mathcal{H}(\mathbf{v})=-\sum_{i=1}^{3}\sum_{j=1}^{3}\left[(2v_{i,j}-1)(2v_{i+1,j}-1)+(2v_{i,j}-1)(2v_{i,j+1}-1)\right],

(18)

with $v_{i,j}\in\{0,\,1\}$ for each $i$ and $j$ . We train the RBMs to the equilibrium distributions at three different temperatures $T=1.9$ , $T=3.6$ , and $T=16$ . These values are chosen so that $T=1.9$ corresponds to the ferromagnetic (ordered) phase, $T=3.6$ stands for the critical regime, and $T=16$ generates the paramagnetic (disordered) phase. Even though the size of the system used in our study is small, these three values of temperature generate markedly different ensembles of spin configurations, as shown in the order parameter (magnetization) and its fluctuations (susceptibility) plotted in Fig. 2(a).

Using $P^{0}_{V}$ thus defined, we train the RBM in two different ways. The first way is the usual unsupervised learning. We draw a certain number of equilibrium spin configurations from $P^{0}_{V}$ by the Metropolis-Hastings algorithm. To remove correlations between different sampled configurations, we saved the snapshot of the system every $90$ or more (varied depending on the correlation time) Monte Carlo steps. This sampled set of Ising configurations form $P_{V}$ , which we use in the gradient descent dynamics described by Eqs. (6)–(12). We repeat the process $10$ times to obtain $10$ independent models $\{Q^{\mathrm{m}}_{V}\}$ , with which we calculate the GE according to Eq. (14).

Meanwhile, to calculate the ME and the DE, we train the RBM in a different way. Instead of the sampled $P_{V}$ , we use the true distribution $P^{0}_{V}$ to directly calculate the true gradients in Eqs. (6)–(12). This is done by evaluating the averages over all the $2^{9}=512$ spin configurations of the 2-d Ising model on the $3\times 3$ square lattice. Repeating this training $10$ times, we choose the resulting model that minimizes the KL divergence to be an estimate for $Q^{0}_{V}$ . In fact, we find that the KL divergence obtained by this training method exhibits only small variabilities: the error of the estimated ME is at most about the order of the symbol size in the plots.

3.2 TASEP with open boundaries

The TASEP is a simple model of nonequilibrium 1-d transport of hard-core particles. Originally proposed as a model of biopolymerization [22], the model has found numerous applications in various traffic [23] and biological [24] systems.

For the case of open boundaries, the process is defined as follows (see Fig. 2(b) for a schematic illustration). Consider a 1-d chain of $L$ discrete sites. Each site can hold at most a single particle. From the $1$ -st site to the ( $L-1$ )-th site, every particle moves to the right with rate $1$ if the next site is empty, and the movement is forbidden if the next site is occupied. Meanwhile, if the $1$ -st site is empty ( $L$ -th site is occupied), a particle moves into the site from the left particle reservoir (moves from the site to the right particle reservoir) with entry rate $\alpha$ (exit rate $\beta$ ).

Depending on the values of $\alpha$ and $\beta$ , the TASEP exhibits various phases distinguished from each other by the current $J$ and the bulk density $\rho_{b}$ , as shown in Fig. 2(c). The phase boundaries can be correctly predicted by the simple kinematic wave theory or the mean-field arguments, although the exact solution for the steady-state statistics requires the matrix product ansatz [18, 25].

Similarly to the case of the 2-d Ising model, we train the RBM in two different ways. First, we generate a certain number of steady-state particle configurations by directly running the kinetic Monte Carlo simulation of the TASEP. To remove correlations between different sampled configurations, we take the snapshot every $90$ Monte Carlo steps. This sampled set is used as $P_{V}$ in the gradient descent dynamics described by Eqs. (6)–(12). Then the same process is repeated $10$ times to obtain $10$ independent models $\{Q^{\mathrm{m}}_{V}\}$ , with which we calculate the GE according to Eq. (14).

To calculate the ME, we need the true distribution $P^{0}_{V}$ of the TASEP. Using the recursive relations between matrix-product representations of the steady-state probabilities shown in [25], we can iteratively determine the probabilities of $2^{9}=512$ TASEP configurations. Then, based on these probabilities, we directly calculate the true gradients in Eqs. (6)–(12). Then the ME is found by choosing the model that minimizes the KL divergence to be an estimate for $Q^{0}_{V}$ .

The codes for the training methods described above are available in [github].

4 Results

4.1 Training dynamics

In Fig. 3(a), we show the evolution of the mean error $\left\langle D_{\mathrm{KL}}(P^{0}_{V}\|Q^{m}_{V})\right\rangle_{m}$ as the RBM is trained to the $512$ sampled configurations (using mini-batches of size $256$ ) of the 2-d Ising model at $T=1.9$ (ferromagnetic phase). When $N_{H}\leq 2$ , the error monotonically decreases as the training proceeds, saturating by epoch $10^{3}$ . However, for $N_{H}\geq 3$ , the error reaches the minimum around epoch $500$ and increases again, reaching higher value as $N_{H}$ is increased. These tendencies reflect that more complex RBMs (with larger $N_{H}$ ) end up learning even the noisy features of the sampled configurations that deviate from the true $P^{0}_{V}$ . This is equivalent to what we call the overfitting phenomenon in supervised learning.

Meanwhile, in Fig. 3(b), we show the evolution of the same mean error as the RBM is trained to the true distribution of the 2-d Ising model at $T=1.9$ . When $N_{H}\leq 4$ , the error saturates to some value that decreases as $N_{H}$ is increased. When $N_{H}$ is increased further, the error keeps decreasing as the training proceeds, never saturating within the observation time span. These show that the overfitting effect is absent when the true distribution is directly used for the training.

The presence of overfitting can also be checked by observing the weights of the RBM. As shown in the inset of Fig. 3(a), when the RBM is trained to the sampled dataset, the width (standard deviation) of the weight distribution tends to increase towards the end of the training. This is because here the RBM is effectively decreasing its temperature, trying to distinguish configurations which happen to appear in the sampled dataset from those which happen not to appear, even if they are very similar to each other. In contrast, when the RBM is trained to the true distribution, the weight width saturates to some value and stops increasing, as shown in Fig. 3(b). This shows that there is no overfitting involved in the dynamics.

4.2 Effects of data volume

As stated in Eq. (15), the ME depends only on the true distribution $P^{0}_{V}$ and the optimal model distribution $Q^{0}_{V}$ ; thus the ME does not depend on the volume of the training dataset. The only part of the GE that can be affected by the data volume is the DE, whose behaviors are shown in Fig. 4. As the data volume $V$ increases, the DE tends to decrease like $V^{-1}$ . This scaling behavior can be understood as follows. By careful sampling, all $V$ samples of the training dataset can be made independently and identically distributed. We note that $P_{V}(\mathbf{v})$ can be interpreted as the sample mean of a random variable whose value is $1$ when the sample is in the state $\mathbf{v}$ and zero otherwise. Then, when $V$ is very large, the central limit theorem implies that the sample mean $P_{V}(\mathbf{v})$ deviates from the true probability $P_{V}^{0}(\mathbf{v})$ by an amount proportional to $V^{-1/2}$ . Then $|Q_{V}^{m}(\mathbf{v})-Q_{V}(\mathbf{v})|$ , with $Q^{0}_{V}$ related to $P^{0}_{V}$ and $Q^{\mathrm{m}}_{V}$ to $P_{V}$ , would also be of order $V^{-1/2}$ for every $\mathbf{v}$ . Now, the DE defined in Eq. (16) can be rewritten as

\mathrm{DE}=\left\langle D_{\mathrm{KL}}(P^{0}_{V}\|Q^{\mathrm{m}}_{V})\right\rangle_{\mathrm{m}}-D_{\mathrm{KL}}(P^{0}_{V}\|Q^{0}_{V}),

(19)

which reaches the minimum zero when $Q^{0}_{V}$ and $Q^{\mathrm{m}}_{V}$ are exactly equal. However, as discussed above, we expect these distributions to differ by a small amount proportional to $V^{-1/2}$ for every configuration. Since the DE is close to its minimum, this difference of order $V^{-1/2}$ leads to a correction to the DE of order $(V^{-1/2})^{2}\sim V^{-1}$ . Hence, the DE scales like $V^{-1}$ .

We note that some deviations from $\mathrm{DE}\sim V^{-1}$ are observed for the RBMs with $N_{H}=1$ and $2$ trained to the Ising model at $T=1.9$ and $T=3.6$ . When $N_{H}$ is too small compared to the complexity of the training dataset, the dynamics of the RBM may develop glassy features (as is the case for supervised learning [26]), such as the rugged loss function landscape and the presence of multiple local minima. In such cases, the DE may not be entirely induced by the difference between $P^{0}_{V}$ and $P^{m}_{V}$ . However, in the regime of large $N_{H}$ where the DE becomes a significant part of the GE, we expect the DE to be dominated by the effects of $P^{0}_{V}\neq P^{m}_{V}$ .

4.3 Effects of $N_{H}$

Now we examine the effects of $N_{H}$ on the the GE and its two components defined by Eqs. (14), (15), (16), which are shown in Fig. 5. The results for the 2-d Ising model are shown in Fig. 5(a), (b), and (c). In all cases, the ME (DE) monotonically decreases (increases) with $N_{H}$ . For $T=1.9$ and $T=3.6$ , this leads to the nonmonotonic behavior of the GE, which is minimized at $N_{H}=3$ for $T=1.9$ and $N_{H}=4$ for $T=3.6$ . Meanwhile, for $T=16$ , the ME is already very small for $N_{H}=1$ , which reflects that a single hidden node is enough to describe the disordered state, where all spins are unbiased i.i.d. random variables.

The results for the TASEP are shown in the lower panel of Fig. 5. For $\alpha=0.3$ and $\beta=0.7$ , the ME is almost always zero, see Fig. 5(d). This is because, in the low-density phase, the occupancies of the most of the sites are i.i.d. random variables. The monotonic decrease of the ME becomes more visible as the system crosses the phase boundary (see Fig. 5(e)), although the tradeoff behavior of the GE becomes clear only when the system is well within the maximal-current phase, see Fig. 5(f). For the TASEP, it is known that the mean-field approximations are valid sufficiently far away from the boundaries. While those boundary effects decay exponentially with distance in the low-density and the high-density phases, the decay is algebraic in the maximal-current phase. Thus, the amount of correlations present in the data tend to be larger for the maximal-current phase.

To sum up, the GE exhibits a U-shaped tradeoff behavior as the ME (DE) decreases (increases) monotononically with $N_{H}$ . The GE minimum tends to occur at a higher $N_{H}$ when the dataset to be modeled is more complex. The situation is analogous to the bias-variance tradeoff observed in supervised learning, as detailed in Table 1.

4.4 Comparison with the bias-variance decomposition

While the behaviors of the ME and the DE shown in Fig. 5 are similar to those shown by the bias and the variance in supervised learning, there are no clear mathematical connections between these quantities. In fact, in 1998, Heskes proposed a bias-variance decomposition scheme for the KL divergence [10], which would read in our problem as

\left\langle D_{\mathrm{KL}}(P^{0}_{V}\|Q^{m}_{V})\right\rangle_{m}=\underbrace{D_{\mathrm{KL}}(P^{0}_{V}\|\bar{Q}_{V})}_{\equiv\mathrm{Bias}}+\underbrace{\left\langle D_{\mathrm{KL}}(\bar{Q}_{V}\|Q^{m}_{V})\right\rangle_{m}}_{\equiv\mathrm{Variance}},

(20)

where

\bar{Q}_{V}(\mathbf{v})\equiv\frac{1}{\mathcal{Z}}\exp\,\langle\log Q^{m}_{V}(\mathbf{v})\rangle_{m}

(21)

is the mean distribution for the suitable normalization constant $\mathcal{Z}$ . This generalizes the original bias-variance decomposition originally proposed for the MSE in the following sense:

1.

The variance does not depend on $P^{0}_{V}$ directly. Also it is nonnegative and equal to zero if and only if $Q^{m}_{V}$ are always identical.
2.

The bias depends on only $P^{0}_{V}$ and the “average model” $\bar{Q}_{V}$ , which is defined as the minimizer of the variance.

Note that this scheme is related to our ME-DE decomposition scheme by

	$\displaystyle\mathrm{ME}$	$\displaystyle=\mathrm{Bias}-\sum_{\mathbf{v}}P^{0}_{V}(\mathbf{v})\log\frac{Q^{0}_{V}(\mathbf{v})}{\bar{Q}_{V}(\mathbf{v})},$			(22)
	$\displaystyle\mathrm{DE}$	$\displaystyle=\mathrm{Variance}+\sum_{\mathbf{v}}P^{0}_{V}(\mathbf{v})\log\frac{Q^{0}_{V}(\mathbf{v})}{\bar{Q}_{V}(\mathbf{v})}.$			(23)

Since $\sum_{\mathbf{v}}P^{0}_{V}(\mathbf{v})\log\left[Q^{0}_{V}(\mathbf{v})/\bar{Q}_{V}(\mathbf{v})\right]$ is not sign definite (note that $\bar{Q}_{V}$ might be a distribution that cannot be generated by an RBM, so it can be even “better” than the optimal RBM-generated distribution $Q^{0}_{V}$ ), there are no definite inequalities between these error components.

We reexamine the effects of $N_{H}$ shown in Fig. 5 using the Heskes decomposition scheme. As shown in Fig. 6, while the variance monotonically increases with $N_{H}$ , the bias exhibits a nonmonotonic behavior as $N_{H}$ is increased. It seems that, in this case, the bias also contains significant contributions from the sampling fluctuations which were captured by the DE of our decomposition scheme. This might be due to the training outcome $Q^{m}_{V}$ having a skewed distribution around the optimal distribution $Q^{0}_{V}$ . Thus, to describe the tradeoff behavior of the GE, the decomposition into the ME and the DE seems more appropriate than the Heskes bias-variance decomposition.

5 Summary and outlook

	Supervised learning (FNN)	Unsupervised learning (RBM)
Generalization error	Mean-squared error	KL divergence
Limitation of the model	Bias	Model error
Performance variability	Variance	Data error
What it learns	Functional relationship	Distribution
What it overfits	Noise	Sampling error

Table 1: A comparison between the GE tradeoff behaviors in supervised (feed-forward neural network; FNN) and unsupervised learning

In this study, we proposed that the generalization error (GE) in unsupervised learning can be decomposed into two components, namely the model error (ME) and the data error (DE). To examine how these quantities behave as the data and the model complexities are varied, we trained the RBMs with various numbers of hidden nodes to generate the steady-state configurations of the 2-d Ising model at a given temperature and the TASEP with given entry and exit rates, whose statistics are exactly known. For both models, we observed that the DE tends to decrease as the inverse volume of the training dataset, verifying that our decomposition properly distinguishes between the two sources of the GE, namely the inadequacies of the model (ME) and those of the training data (DE). Moreover, we found that the ME (DE) decreases (increases) monotonically as the number of hidden nodes is increased, leading to the tradeoff behavior of the GE. The GE minimum occurs at a higher number of hidden nodes when there exist stronger correlations between different parts of the system. This is analogous to the bias-variance tradeoff in supervised learning—too simple machines fail to capture the essential features of the data, while too complex machines fail to filter out the noise.

Thus, our study clarifies the nature of the GE tradeoff in unsupervised learning. Various theoretical studies are possible from here. For example, while our study has reported only numerical results for the RBMs, it would be interesting to think of an analytically tractable example of unsupervised learning for which the location of the GE minimum can be explicitly connected to the statistical properties of the training data. Also notable is the abrupt change in the training dynamics at the onset of the overfitting regime shown in Fig. 3. It would be worthwhile to check whether the RBM crosses any phase boundary as its dynamical behavior changes. The issues of (i) how regularization suppresses overfitting and affects the GE minimum, (ii) how the hidden layer differently encodes information in the underparametrized and the overparmetrized regimes [27, 28], and (iii) whether the double descent of the GE observed in supervised learning [6, 7, 8, 9] is also possible in unsupervised learning would be also interesting to investigate.

On the practical side, we note that the GE defined as the KL divergence in Eq. (14) is not easy to calculate as $P^{0}_{V}$ is unknown in practice. Instead, one can focus on the tradeoff behavior of the log-likelihood

\mathcal{L}\equiv\left\langle\frac{1}{M}\sum_{i=1}^{M}\log Q^{\mathrm{m}}_{V}(\mathbf{v}_{i})\right\rangle_{\mathrm{m}},

(24)

whose value is obtained using a sampled test dataset $\{\mathbf{v}_{1},\ldots,\mathbf{v}_{M}\}$ . Unlike the GE, $\mathcal{L}$ is not bounded by zero, but the the nonmonotonic behavior of the GE as a function of $N_{H}$ will manifest itself as the inverted nonmonotonic behavior of $\mathcal{L}$ , as illustrated for the 2-d Ising model and the TASEP in Fig. 7. It would be interesting to check whether the same behavior is observed for more diverse range of examples.

Acknowledgments

This research has been supported by the POSCO Science Fellowship of POSCO TJ Park Foundation (GK, HL, and YB). It has also been supported in part by the Creative-Pioneering Researchers Program through Seoul National University, the National Research Foundation of Korea (NRF) grant (Grant No. 2022R1A2C1006871) (JJ).

References

[1] Goodfellow I, Bengio Y and Courville A 2016 Deep learning (Cambridge: MIT Press)
[2] Mehta P, Bukov M, Wang C H, Day A G, Richardson C, Fisher C K and Schwab D J 2019 Phys. Rep. 810 1–124
[3] Carleo G, Cirac I, Cranmer K, Daudet L, Schuld M, Tishby N, Vogt-Maranto L and Zdeborová L 2019 Rev. Mod. Phys. 91(4) 045002
[4] Bahri Y, Kadmon J, Pennington J, Schoenholz S S, Sohl-Dickstein J and Ganguli S 2020 Annu. Rev. Condens. Mat. Phys. 11 501–528
[5] Kohavi R and Wolpert D H 1996 Bias plus variance decomposition for zero-one loss functions ICML vol 96 (Citeseer) pp 275–83
[6] Belkin M, Hsu D, Ma S and Mandal S 2019 Proc. Natl. Acad. Sci. 116 15849–15854
[7] Spigler S, Geiger M, d’Ascoli S, Sagun L, Biroli G and Wyart M 2019 J. Phys. A: Math. Theor. 52 474001
[8] Nakkiran P, Kaplun G, Bansal Y, Yang T, Barak B and Sutskever I 2021 J. Stat. Mech.: Theor. Exp. 2021 124003
[9] Rocks J W and Mehta P 2022 Phys. Rev. Res. 4(1) 013201
[10] Heskes T 1998 Neural Comput. 10 1425–1433
[11] Smolensky P 1986 Information processing in dynamical systems: Foundations of harmony theory (Cambridge: MIT Press) pp 194–281
[12] Hinton G E 2002 Neural Comput. 14 1771–1800
[13] Hinton G E 2012 A Practical Guide to Training Restricted Boltzmann Machines (Berlin: Springer) pp 599–619
[14] Tieleman T 2008 Training restricted boltzmann machines using approximations to the likelihood gradient Proceedings of the 25th International Conference on Machine Learning ICML ’08 (New York, NY, USA: Association for Computing Machinery) p 1064–1071
[15] Tieleman T and Hinton G 2009 Using fast weights to improve persistent contrastive divergence Proceedings of the 26th Annual International Conference on Machine Learning ICML ’09 (New York, NY, USA: Association for Computing Machinery) p 1033–1040
[16] Krause O, Fischer A and Igel C 2020 Artif. Intell. 278 103195
[17] Gabrie M, Tramel E W and Krzakala F 2015 Training restricted boltzmann machine via the thouless-anderson-palmer free energy Advances in Neural Information Processing Systems vol 28 ed Cortes C, Lawrence N, Lee D, Sugiyama M and Garnett R (Curran Associates, Inc.)
[18] Derrida B, Evans M R, Hakim V and Pasquier V 1993 J. Phys. A: Math. Gen. 26 1493
[19] Ising E 1925 Zeit. Phys. 31(1) 253–258
[20] Onsager L 1944 Phys. Rev. 65(3-4) 117–149
[21] Yang C N 1952 Phys. Rev. 85(5) 808–816
[22] MacDonald C T, Gibbs J H and Pipkin A C 1968 Biopolymers 6 1–25
[23] Helbing D 2001 Rev. Mod. Phys. 73 1067 – 1141
[24] Chowdhury D, Schadschneider A and Nishinari K 2005 Phys. Life Rev. 2 318–352
[25] Blythe R A and Evans M R 2007 J. Phys. A: Math. Theor. 40 R333
[26] Baity-Jesi M, Sagun L, Geiger M, Spigler S, Arous G B, Cammarota C, LeCun Y, Wyart M and Biroli G 2019 J. Stat. Mech.: Theor. Exp. 2019 124013
[27] Song J, Marsili M and Jo J 2018 J. Stat. Mech.: Theor. Exp. 2018 123406
[28] Harsh M, Tubiana J, Cocco S and Monasson R 2020 J. Phys. A: Math. Theor. 53 174002

$\displaystyle\frac{\partial}{\partial a_{i}}D_{\mathrm{KL}}(P_{V}\\|Q_{V})$	$\displaystyle=\langle v_{i}\rangle_{Q_{V}}-\langle v_{i}\rangle_{P_{V}},$	(10)
$\displaystyle\frac{\partial}{\partial b_{i}}D_{\mathrm{KL}}(P_{V}\\|Q_{V})$	$\displaystyle=\langle h_{i}\rangle_{Q_{VH}}-\langle h_{i}\rangle_{P_{V}Q_{H\|V}},$	(11)
$\displaystyle\frac{\partial}{\partial W_{ij}}D_{\mathrm{KL}}(P_{V}\\|Q_{V})$	$\displaystyle=\langle v_{i}h_{j}\rangle_{Q_{VH}}-\langle v_{i}h_{j}\rangle_{P_{V}Q_{H\|V}},$	(12)

Tradeoff of generalization error in unsupervised learning

Abstract

1 Introduction

2 Theory

2.1 Restricted Boltzmann Machine

2.2 Error components in unsupervised learning

3 Methods

3.1 2-d Ising model

3.2 TASEP with open boundaries

4 Results

4.1 Training dynamics

4.2 Effects of data volume

4.3 Effects of NHN_{H}

4.4 Comparison with the bias-variance decomposition

5 Summary and outlook

Acknowledgments

References

References

4.3 Effects of $N_{H}$