Justices for Information Bottleneck Theory

Faxian Cao
School of Computer Science
University of Hull
Hull, HU6 7RX
[email protected]
&Yongqiang Cheng
School of Computer Science
University of Hull
Hull, HU6 7RX
[email protected]
&Adil Mehmood Khan
School of Computer Science
University of Hull
Hull, HU6 7RX
[email protected]
&Zhijing Yang
School of Information Engineering
Guangdong University of Technology
Guangzhou, 510006
[email protected]
Corresponding Author

Abstract

This study comes as a timely response to mounting criticism of the information bottleneck (IB) theory, injecting fresh perspectives to rectify misconceptions and reaffirm its validity. Firstly, we introduce an auxiliary function to reinterpret the maximal coding rate reduction method as a special yet local optimal case of IB theory. Through this auxiliary function, we clarify the paradox of decreasing mutual information during the application of ReLU activation in deep learning (DL) networks. Secondly, we challenge the doubts about IB theory’s applicability by demonstrating its capacity to explain the absence of a compression phase with linear activation functions in hidden layers, when viewed through the lens of the auxiliary function. Lastly, by taking a novel theoretical stance, we provide a new way to interpret the inner organizations of DL networks by using IB theory, aligning them with recent experimental evidence. Thus, this paper serves as an act of justice for IB theory, potentially reinvigorating its standing and application in DL and other fields such as communications and biomedical research.

1 Introduction

Both information bottleneck (IB) theory [1] and maximal coding rate reduction (MCR²) [2] originate from the rate distortion theory [3] in the field of information theory [4]. IB theory aims to find a short code for input signals that preserves the maximum information about output signals while compressing the mutual information between input signals and the corresponding short code [1]. On the other hand, MCR² strives to maximize the difference between the coding rate/length of the entire dataset and the average of all subsets in each category [2; 5], i.e., the objective of MCR² is to maximize the mutual information between the input signals and its corresponding short code, as well as the mutual information between the short code of the input signal and the output signal. Both IB theory and MCR² have gained remarkable attention and are widely applied in various fields, including communications [6], biomedical research [7], and speech decomposition [8] with IB theory, and classification [9] and segmentation [10] with MCR².

Recently, researchers have employed both IB theory and MCR² to interpret how deep learning (DL) networks work, i.e., the inner organizations of the DL networks [11]. By applying MCR² to DL networks, ReduNet [5; 10] was presented as a deep-layered architecture construction method. ReduNet claims that both linear and nonlinear operators, as well as network parameters, are explicitly constructed layer-by-layer based on the principle of MCR². For example, the mechanism of ReLU activation function in DL could be explained by MCR², i.e., the role of ReLU activation function is to maximize the mutual information between samples in input layer and output data in the hidden layers of DL networks. In other words, high-dimensional data usually has a low-dimensional structure, by increasing the mutual information between samples of input layer and data of the hidden layers in DL networks, samples belonging to different classes could be more distinguished so that the task of classification or segmentation can be facilitated. However, some existing experiments [12] already show that this is not always the case since the mutual information between samples in the input layer and data in the hidden layers of DL networks might be decreased even if the ReLU activation function is used in the hidden layer of DL network. Besides, compared with IB theory, the MCR² is not applicable to explain the mechanism of the nonlinear activation function of the DL network, such as the tanh activation function.

Furthermore, some typical works [13; 14] use IB theory to analyze the inner organizations of DL networks, suggesting that there are two main phases of hidden layers in DL. The first phase is called "fitting", where the training errors of DL dramatically drop by fitting the training label, i.e., the mutual information between samples in input layer and data in hidden layers of DL network increases. The second phase is referred to as "compression" which starts once the training errors become small, i.e., the mutual information between samples in input layer and data in hidden layers of DL network decreases. These researchers also claim [13; 14] that the compression phase in DL plays a vital role in its excellent generalization performance. However, other studies have shown that when the nonlinear active function in the hidden layers of DL networks is replaced with a linear function (i.e., the ReLU function), sometimes there is no compression phase when training the DL networks [12]. This discrepancy has led some researchers to challenge the validity and capability of IB theory for interpreting the inner organizations DL networks [12].

Moreover, when applying IB theory to interpret the inner organizations of DL networks, all academic works have aimed at minimizing mutual information between the input signal and the projection of the input signal while maximizing mutual information between the projection of the input signal and output signal, including the inventor of IB theory [1; 15]. However, some experiments show that this is always the case. Therefore, it is necessary and important to provide a new way to interpret the mechanism of DL networks by using IB theory.

In this paper, to address the three issues above, we introduce an auxiliary function of conditional entropy to IB theory, i.e.,

•

By introducing an auxiliary function to IB theory and through the derivations of IB theory under Gaussian distribution and linear projection/activation function, we discover that MCR² is simply a special case of IB theory. This finding implies that MCR² is only a local optimal solution of IB theory for interpreting inner organization. Consequently, with the transformation of IB theory, the MCR² and IB theory can be unified together. More importantly, when ReLU activation function is used in DL network, the phenomenon that the mutual information between samples in input layer and data in the hidden layers decreased could be explained.
•

With the help of the auxiliary function, IB method can be transformed into another form, so that a new theoretical perspective could be presented for explaining why the compression phase does not occur in hidden layers of DL networks once the nonlinear activation function is replaced by the linear activation function, i.e., the mutual information between samples in input layer and data in hidden layers of DL networks continues to increase. Therefore, our findings validate the principle of IB theory for interpreting DL networks’ inner organization.
•

With the transformed IB theory by introducing the auxiliary function to it, we are here to provide a new way to interpret the inner organization of DL networks: 1) the mutual information between data in the hidden layers and the output data in the final layer of the DL network continues to increase. 2) When it comes to the mutual information between samples in the input layer and data in hidden layers of DL networks, sometimes only one phase occurs, while at other times both the fitting and compression phases occur. 3) In some cases, the final aim of DL networks is trying to maximize the difference between the coding rate of the whole datasets and the average of all subsets within each class. Alternatively, DL networks may aim to minimize the sum of the coding rate of the whole datasets and the average of all subsets within each class.

2 MCR² and IB theory

Learning distributions from a finite set of i.i.d. samples of $K$ classes/categories is one of the most fundamental problems in machine learning [10]. The input data $X$ consists of $M$ samples, each with dimensions $D$ , represented as $X=\begin{bmatrix}x_{1},&x_{2},&\cdots&x_{M}\end{bmatrix}\in\mathbb{R}^{D\times M}$ . To enable clustering or classification tasks, it is essential to find a good representation using a mapping function, $f(x_{i},\Theta)$ : $\mathbb{R}^{D}\rightarrow\mathbb{R}^{d}$ , where $i$ ranges from 1 to $M$ . This mapping function captures the intrinsic structures of sample $x_{i}$ and projects it to a feature space of dimensionality $d$ with parameter $\Theta$ [10].

In the context of supervised learning in DL, we can view the output data in the hidden layer as selecting certain discriminative features represented by $Z=f(X,\Theta)\in\mathbb{R}^{d\times M}$ that facilitate the subsequent classification task. Then the class label $Y$ can be predicted by optimizing a classifier denoted by $g(Z)$ . Therefore, this process can be represented mathematically as follows:

X\stackrel{{\scriptstyle f(X,\Theta)}}{{\longrightarrow}}Z\stackrel{{\scriptstyle g(Z)}}{{\longrightarrow}}Y.

(1)

MCR² for interpreting DL [2; 5]: The MCR² intends to encode discriminative features denoted by $Z=\begin{bmatrix}z_{1}&z_{2}&\cdots z_{M}\end{bmatrix}=\begin{bmatrix}f(x_{1},\Theta)&f(x_{2},\Theta)&\cdots&f(x_{M},\Theta)\end{bmatrix}$ up to a precision of $\epsilon$ , i.e., $\hat{z_{i}}=z_{i}+c_{i}$ , where $c_{i}$ is drawn from a Gaussian distribution with zero mean and a variance of $\frac{\epsilon^{2}}{d}E$ (note that $E$ is identity matrix defined as having 1’s down its diagonal and 0’s everywhere else). In addition, by assuming that $z_{i}$ is Gaussian distribution with zero mean and unit variance, i.e., $\sum_{i=1}^{d}z_{i}=0$ (if $z_{i}$ has non-zero-mean, we can subtract $\sum_{i=1}^{d}z_{i}$ from $z_{i}$ ), the number of bits required to encode the discriminative features $Z$ is is given by $\frac{(M+d)}{2}\log\det(\frac{d}{M\epsilon^{2}}ZZ^{T}+E)$ [5]. Consequently, the average coding rate per sample, subject to the precision or distortion level $\epsilon$ , is expressed as [2]

R(Z,\epsilon)=\frac{1}{2}\log\det(\frac{d}{M\epsilon^{2}}ZZ^{T}+E).

(2)

In addition, the multi-class features $Z$ may belong to multiple low-dimensional subspaces, which can affect the accuracy of rate distortion evaluation. To address this issue, MCR² partitions the features into several subsets, denoted as $Z=Z_{1}\cup Z_{2}\cup\cdots\cup Z_{K}$ , with each being in a separate low-dim subspace. In order to encode the membership of $M$ samples in $K$ classes more effectively, a set of diagonal matrices $\Pi=\{\Pi_{j}\}_{j=1}^{K}$ is introduced. These matrices have diagonal entries representing the membership of each sample in each class:

\sum_{j=1}^{K}\Pi_{j}=E.

(3)

and each $\Pi_{j}$ is defined as

\Pi_{j}=\begin{bmatrix}\hat{\pi}_{1,j}&0&\cdots&0\\ 0&\hat{\pi}_{2,j}&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&\cdots&\cdots&\hat{\pi}_{M,j}\end{bmatrix}\in\mathbb{R}^{M\times M}.

(4)

where $\hat{\pi}_{i,j}\in\{0,1\}$ . If $x_{i}$ or $z_{i}$ belongs to $j^{th}$ class, $\hat{\pi}_{i,j}=1$ , otherwise, $\hat{\pi}_{i,j}=0$ .

With respect to this partition, the average number of bits per sample subjective to the precision/distortion $\epsilon$ can be written as [2]

R^{c}(Z,\epsilon|\Pi)=\frac{1}{2}\sum_{j=1}^{K}\frac{tr(\Pi_{j})}{M}\log\det(\frac{d}{tr(\Pi_{j})\epsilon^{2}}Z\Pi_{j}Z^{T}+E),

(5)

where $tr(\cdot)$ is the trace of a matrix. Finally, the aim of MCR² is trying to maximize the difference of code rate/length between the whole dataset in Eq. (2) and the average of all the subsets in Eq. (5) [2; 5]:

	$\displaystyle\Delta R(Z,\epsilon,\Pi)$	$\displaystyle=R(Z,\epsilon)-R^{c}(Z,\epsilon\|\Pi)$
		$\displaystyle=\frac{1}{2}\log\det(\frac{d}{M\epsilon^{2}}ZZ^{T}+E)-\frac{1}{2}\sum_{j=1}^{K}\frac{tr(\Pi_{j})}{M}\log\det(\frac{d}{tr(\Pi_{j})\epsilon^{2}}Z\Pi_{j}Z^{T}+E).$		(6)

In simpler terms, MCR² has two goals:

•

Maximize the mutual information between the input data $X$ and discriminative feature $Z$ , which is done by enlarging the space of $Z$ , and this is measured through the $R(Z,\epsilon)$ .
•

Maximize the mutual information between the discriminative feature $Z$ and output/label $Y$ , which is done by compressing the space for the discriminative feature $Z_{j}$ of each category, and this is measured through the $R^{c}(Z,\epsilon|\Pi)$ .

IB theory for interpreting DL [1; 13; 14]: With the training data $X$ and the corresponding label $Y$ , IB theory contains two steps: encoding and decoding. To encode the discriminative features $Z$ , the aim of IB theory is trying to minimize the mutual information $I(X,Z)$ between input data $X$ and the discriminative features $Z$ . while during the decoding stage, IB theory aims to maximize the mutual information $I(Y,Z)$ between discriminative features $Z$ and output/label $Y$ , thus, IB theory can be formulated by finding an optimal representation $Z$ as the minimization of following Lagrangian:

\Delta I(Z,\beta,X,Y)=I(X,Z)-\beta I(Y,Z),

(7)

where $\beta\in(0,+\infty)$ is the tradeoff parameter that balances those two types of mutual information above [1].

By analyzing the objective functions of MCR² in Eq. (2) and IB theory in Eq. (7), we can observe that both MCR² and IB theory aim to maximize the mutual information between the discriminative feature $Z$ and output/label $Y$ . However, when it comes to the mutual information between the input data $X$ and the discriminative feature $Z$ , MCR² maximizes the corresponding information while IB theory aims to minimize it. In addition, as a promising method, MCR² claims to be the first principle of DL and the mechanism of ReLU activation function in DL network could be explained MCR². However, sometimes the mutual information between samples in the input layer and data in the hidden layers decreases when applying the ReLU activation function to DL network. Therefore, investigating the connection between IB theory and MCR² is not only necessary but also crucial. Furthermore, since IB theory’s goal is to reduce the mutual information between the input data $X$ and the discriminative feature $Z$ , it proposes that the mutual information between them initially increases and then goes through a compression phase where this value will begin to decrease. It is important to note that this compression phase is not present in DL networks when non-linear activation functions are replaced with linear functions [12], such as the ReLU function. Hence, presenting a new theoretical perspective to explain this phenomenon in DL is critical. Besides, since the tradeoff parameter $\beta>0$ , according to IB theory, the goal is to minimize the mutual information $I(X,Z)$ when explaining the inner organizations of DL. However, recent experiments [12] have shown that in certain situations, the mutual information $I(X,Z)$ of DL networks continues to increase, while in other situations the fitting and compression phases alternate in DL networks. Therefore, providing a new way/perspective is essential for a better understanding of the behavior of deep learning networks.

3 Proposed method based on IB theory

In this section, an auxiliary function is introduced to IB theory, resulting in solving the three concerns outlined in Sections 1 and 2. With the objective function of IB theory in Eq. (7), by introducing an auxiliary function $\beta(H(Z|X)-H(Z|X))$ to IB theory, we can formulate Eq. (7) as

\Delta I(Z,\beta,X,Y)=I(X,Z)-\beta I(Y,Z)+\beta(H(Z|X)-H(Z|X)),

(8)

where $H(\cdot)$ is entropy [15]. Since $H(Z|X)-H(Z|X)=0$ , we can see that the objective function in Eq. (8) is the same as the objective function of IB theory in Eq. (7). In addition, according to the definitions of mutual information and entropy [15], it implies that $I(X,Z)=H(Z)-H(Z|X)$ and $I(Y,Z)=H(Z)-H(Z|Y)$ , then we can rewrite Eq. (8) as

$\displaystyle\Delta I(Z,\beta,X,Y)$	$\displaystyle=H(Z)-H(Z\|X)-\beta(H(Z)-H(Z\|Y))+\beta(H(Z\|X)-H(Z\|X))$
	$\displaystyle=(1-\beta)H(Z)-(1-\beta)H(Z\|X)+\beta(H(Z\|Y)-H(Z\|X))$
	$\displaystyle=(1-\beta)(H(Z)-H(Z\|X))+\beta(H(Z\|Y)-H(Z\|X)).$	(9)

From Eq. (3), it can be observed that the first term at the right-hand side is the mutual information between the input data $X$ and the discriminative feature $Z$ . That is, $H(Z)-H(Z|X)=I(X,Z)$ . Since $\beta$ lies within the range $(0,+\infty)$ , IB theory aims to minimize the mutual information between input data $X$ and discriminative features $Z$ for $0<\beta<1$ while maximizing this mutual information for $\beta>1$ . Moreover, the second term in Eq. (3) corresponds to the difference in the uncertainty degree of discriminative features $Z$ by giving both labels $Y$ and samples $X$ , i.e., $H(Z|Y)-H(Z|X)$ . When $X$ is given, then the uncertainty degree of discriminative features $Z$ is fixed, this implies that the aim of IB theory is trying to decrease the uncertainty degree of discriminative features $Z$ by giving labels $Y$ . In other words, the aim of IB theory is trying to maximize the mutual information between data in the hidden layers and the output data in the final layer of DL networks. Now, we shall to show that MCR² is just a special case and local optimal solution of IB theory under Gaussian distribution and linear mapping.

Assuming that data $x_{i}$ follows a Gaussian distribution with zero mean, i.e., $\sum_{i=1}^{D}x_{i}=0$ (If it has a non-zero mean, we can simply subtract $\sum_{i=1}^{D}x_{i}$ from $x_{i}$ ). We also assume that the mapping function $f(x_{i},\Theta)$ is a linear mapping function, $f(x_{i},\Theta)=\Theta x_{i}=z_{i}$ . Additionally, we denote $\hat{z}_{i}$ as the approximation of $z_{i}$ and allow for the square errors between $\hat{z}_{i}$ and $z_{i}$ to be $\epsilon^{2}$ , meaning that given a coding precision $\epsilon$ , we may model the approximation error or coding precision as an independent additive Gaussian noise:

\hat{z}_{i}=z_{i}+c_{i},

(10)

where $c_{i}$ is zero mean with a variance of $\frac{\epsilon^{2}}{d}E$ for $i=\begin{matrix}1,&\cdots,&M\end{matrix}$ (notice that all of these assumptions above are the same as MCR²). With these assumptions above, we can rewrite Eq. (3) as

\Delta I(Z,\epsilon,\beta,X,Y)=(1-\beta)(H(\hat{Z})-H(\hat{Z}|X))+\beta(H(\hat{Z}|Y)-H(\hat{Z}|X)).

(11)

Since input data $x_{i}$ is from Gaussian distribution and the linear transformation of Gaussian distribution is still from a Gaussian distribution, these imply that both $\hat{z}_{i}$ and $z_{i}$ are still Gaussian distributions. Then according to the definition of differential entropy for Gaussian distribution [16], the terms $H(\hat{Z})$ , $H(\hat{Z}|X)$ and $H(\hat{Z}|Y)$ in Eq. (11) can be expressed as

\begin{cases}H(\hat{Z})=\frac{1}{2}\log((2\pi e)^{d}\det(\Sigma_{\hat{Z}}))\\ H(\hat{Z}|X)=\frac{1}{2}\log((2\pi e)^{d}\det(\Sigma_{\hat{Z}|X}))\\ H(\hat{Z}|Y)=\frac{1}{2}\log((2\pi e)^{d}\det(\Sigma_{\hat{Z}|Y}))\\ \end{cases},

(12)

where $e$ is Euler’s number, $\Sigma_{\hat{Z}}$ is the covariance matrix of $\hat{Z}$ , and $\Sigma_{\hat{Z}|X}$ and $\Sigma_{\hat{Z}|Y}$ are the conditional covariance matrices. Then with Eq. (12), we can rewrite Eq. (11) as

$\displaystyle\Delta I(Z,\epsilon,\beta,X,Y)$	$\displaystyle=(1-\beta)(\frac{1}{2}\log((2\pi e)^{d}\det(\Sigma_{\hat{Z}}))-\frac{1}{2}\log((2\pi e)^{d}\det(\Sigma_{\hat{Z}\|X})))$
	$\displaystyle+\beta(\frac{1}{2}\log((2\pi e)^{d}\det(\Sigma_{\hat{Z}\|Y})-\frac{1}{2}\log((2\pi e)^{d}\det(\Sigma_{\hat{Z}\|X}))$
	$\displaystyle=\frac{1-\beta}{2}\log(\det(\Sigma_{\hat{Z}\|X}^{-1}\Sigma_{\hat{Z}})+\frac{\beta}{2}(\log(\det(\Sigma_{\hat{Z}\|Y})-\log(\det(\Sigma_{\hat{Z}\|X})),$	(13)

where the properties of $\log$ and $\det$ functions [17] are used for derivations of Eq. (3), i.e., $\log ab=\log a+\log b$ , $\log\frac{a}{b}=\log a-\log b$ and $\det(A)-\det(B)=\det(B^{-1}A)$ . In addition, since both $X$ and $\hat{Z}$ are Gaussian distribution and $\hat{z}_{i}=\Theta x_{i}+c_{i}$ , according to the Schur complement formula [18], the relevant covariance matrices $\Sigma_{\hat{Z}}$ and $\Sigma_{\hat{Z}|X}$ in Eq. (3) can be written as

\begin{cases}\Sigma_{\hat{Z}}=\Theta\Sigma_{X}\Theta^{T}+\frac{\epsilon^{2}}{d}E\\ \Sigma_{X,\hat{Z}}=\Theta\Sigma_{X}\\ \Sigma_{\hat{Z}|X}=\Sigma_{\hat{Z}}-\Sigma_{X,\hat{Z}}\Sigma_{X}^{-1}\Sigma_{\hat{Z},X}=\frac{\epsilon^{2}}{d}E\end{cases},

(14)

then with Eq. (14), we can rewrite Eq. (3) as

		$\displaystyle\Delta I(Z,\epsilon,\beta,X,Y)=\frac{1-\beta}{2}\log(\det(\Sigma_{\hat{Z}\|X}^{-1}\Sigma_{Z})+\frac{\beta}{2}(\log(\det(\Sigma_{\hat{Z}\|Y})-\log(\det(\Sigma_{\hat{Z}\|X}))$
		$\displaystyle=\frac{1-\beta}{2}\log(\det(\frac{d}{\epsilon^{2}}\Theta\Sigma_{X}\Theta^{T}+E))+\frac{\beta}{2}(\log(\det(\Sigma_{\hat{Z}\|Y})-\log(\det(\frac{\epsilon^{2}}{d}E)).$		(15)

Now, the final step is to analyze the term $\log\det(\Sigma_{\hat{Z}|Y})$ . Denote $p(Y^{j})$ as the probability that data vector $x_{i}$ or $\hat{z}_{i}$ belongs to $j^{th}$ class, i.e., the proportions of data $x_{i}$ or $\hat{z}_{i}$ among all classes, then we can have

\begin{cases}H(\hat{Z}|Y)=\sum_{j=1}^{K}p(Y^{j})H(\hat{Z}|Y^{j})=\sum_{j=1}^{K}p(Y^{j})\sum_{l=1}^{K}p(\hat{Z}_{l}|Y^{j})\log\frac{1}{p(\hat{Z}_{l}|Y^{j})}\\ \sum_{j=1}^{K}p(Y^{j})=1\\ \sum_{j=1}^{K}p(Y^{j})H(\hat{Z}|X)=H(\hat{Z}|X)\end{cases},

(16)

where $\hat{Z}_{l}$ is the subset of $\hat{Z}$ that belongs to the $l^{th}$ class, i.e., $\hat{Z}=\hat{Z}_{1}\cup\hat{Z}_{2}\cup\cdots\cup\hat{Z}_{K}$ , thus it leads $p(\hat{Z}_{l}|Y^{j})\log\frac{1}{p(\hat{Z}_{l}|Y^{j})}=0$ for $l\neq j$ , then we have

	$\displaystyle H(\hat{Z}\|Y)$	$\displaystyle=\sum_{j=1}^{K}p(Y^{j})\sum_{l=1}^{K}p(\hat{Z}_{l}\|Y^{j})\log\frac{1}{p(\hat{Z}_{l}\|Y^{j})}=\sum_{j=1}^{K}p(Y^{j})p(\hat{Z}_{j}\|Y^{j})\log\frac{1}{p(\hat{Z}_{j}\|Y^{j})}$
		$\displaystyle=\sum_{j=1}^{K}p(Y^{j})H(\hat{Z}_{j}\|Y^{j}).$		(17)

With Eqs. (16), (3) and $\Sigma_{\hat{Z}_{j}|Y^{j}}=\Sigma_{\hat{Z}_{j}}=\Theta\Sigma_{X_{j}}\Theta^{T}+\frac{\epsilon^{2}}{d}E$ where $X_{j}$ is the subset of training data $X$ that belongs to the $j^{th}$ class, then the objective function of Eq. (3) can be rewritten as

		$\displaystyle\Delta I(Z,\epsilon,\beta,X,Y)$
		$\displaystyle=\frac{1-\beta}{2}\log(\det(\frac{d}{\epsilon^{2}}\Theta\Sigma_{X}\Theta^{T}+E))+\frac{\beta}{2}\sum_{j=1}^{K}p(Y^{j})(\log(\det(\Sigma_{\hat{Z}_{j}\|Y^{j}})-\log(\det(\frac{\epsilon^{2}}{d}E))$
		$\displaystyle=\frac{1-\beta}{2}\log(\det(\frac{d}{\epsilon^{2}}\Theta\Sigma_{X}\Theta^{T}+E))+\frac{\beta}{2}\sum_{j=1}^{K}p(Y^{j})\log(\det(\frac{d}{\epsilon^{2}}\Theta\Sigma_{X_{j}}\Theta^{T}+E)).$		(18)

Since $\Theta\Sigma_{X}\Theta^{T}$ and $\Theta\Sigma_{X_{j}}\Theta^{T}$ are the covariance matrix of $Z$ and $Z_{j}$ , respectively, we can rewrite the objective function of Eq. (3) as

\Delta I(Z,\epsilon,\beta,Y)=\frac{1-\beta}{2}\log(\det(\frac{d}{M\epsilon^{2}}ZZ^{T}+E))+\frac{\beta}{2}\sum_{j=1}^{K}p(Y^{j})\log(\det(\frac{d}{Mp(Y^{j})\epsilon^{2}}Z_{j}{Z_{j}}^{T}+E)).

(19)

Take the same operation as MCR², a set of diagonal matrices, $\Pi=\{\Pi_{j}\}_{j=1}^{K}$ in Eq. (4), are introduced. In addition, since $p(Y^{j})=\frac{tr(\Pi_{j})}{M}$ and $Z_{j}{Z_{j}}^{T}=Z\Pi_{j}Z^{T}$ , Eq. (19) can be rewritten as

	$\displaystyle\Delta I(Z,\epsilon,\beta,\Pi)$	$\displaystyle=\frac{1-\beta}{2}\log(\det(\frac{d}{M\epsilon^{2}}ZZ^{T}+E))$
		$\displaystyle+\frac{\beta}{2}\sum_{j=1}^{K}\frac{tr(\Pi_{j})}{M}\log(\det(\frac{d}{tr(\Pi_{j})\epsilon^{2}}Z\Pi_{j}Z^{T}+E)).$		(20)

With Eq. (3), we can see IB theory is trying to maximize $-\Delta I(Z,\epsilon,\beta,\Pi)$ :

\frac{\beta-1}{2}\log(\det(\frac{d}{M\epsilon^{2}}ZZ^{T}+E)-\frac{\beta}{2}\sum_{j=1}^{K}\frac{tr(\Pi_{j})}{M}\log(\det(\frac{d}{tr(\Pi_{j})\epsilon^{2}}Z\Pi_{j}Z^{T}+E)).

(21)

Compared with the objective function of MCR² in Eq. (2), we can see that difference between MCR² in Eq. (2) and IB theory in Eq. (21) is just the coefficient $\beta$ , then it is easy to see that when $\beta$ is large enough, then $\beta\approx\beta-1$ , thus IB theory in Eq. (21) degenerates to MCR², i.e.,

-\Delta I(Z,\epsilon,\beta,\Pi)=\Delta R(Z,\epsilon,\Pi)

(22)

This completes the proof that MCR² is a special case of IB theory. In the next section, we will discuss the three concerns stated in Sections 1 and 2 by using the proposed transformed IB theory in Eq. (3).

4 Discussion

Refer to caption — Figure 1: Information plane dynamics with different tasks and different architectures of DL networks (except for the final layer of DL network for all four subfigures). A curve in the corresponding information plane is produced for each of the hidden layers with the first hidden layer at far right and the final hidden layer at the far left. (a) Binary classification [14] task with 10-7-5-3 hidden layers architecture. (b) Binary classification [14] task with 12-10-7-5-4-3-2 hidden layers architecture. (c) Binary classification [14] with 10-7-5-4-3 hidden layers architecture. (d) MNIST dataset [12] with 32-28-24-20-16-12 hidden layers architecture.

In this section, some experimental results are shown for discussing the three concerns mentioned in this work by using the proposed transformed IB theory based on the auxiliary function. In addition, the codes and data sets that are used for all experiments in this section are from the existing works ¹¹1The code and data sets are available at: https://github.com/artemyk/ibsgd/tree/iclr2018 [12].

Figs. 1(a), (b) and (c) show the information plane dynamics by using three neural networks with 4 fully connected hidden layers of width 10-7-5-3, 7 fully connected hidden layers of width 12-10-7-5-4-3-2, and 5 fully connected hidden layers of width 10-7-5-4-3, respectively. In addition, we follow the same settings as in [12], all of these three networks are trained with stochastic gradient descent to produce a binary classification from 12-dimensional input which means that 12 uniformly distributed points on a 2D sphere are represented [14]. And 256 randomly selected samples per batch are used. Besides, we also show the information plane dynamics by using the MNIST dataset [12] (see Fig. 1(d)), which applies to a neural network with 6 fully connected hidden layers of width 32-28-24-20-16-12. By following the same setting as in [12], this network is trained with stochastic gradient descent and 128 randomly selected samples per batch are used.

As can be seen from Fig. 1(a), with the ReLU activation function, the two phases for mutual information $I(X,Z)$ , fitting and compression, alternate, i.e., there is just one phase in some hidden layers while two phases occur at other layers. However, on one side, MCR² is always trying to increase the mutual information $I(X,Z)$ by enlarging the space of discriminative features $Z$ , so this is not applicable to explain the inner organization of DL network. On the other side, IB thoery [1] is always aiming to compress the mutual information $I(X,Z)$ by squeezing the space of discriminative features $Z$ , so that this is also not applicable to explain the inner organizations of DL networks. Fortunately, by introducing the auxiliary function to IB theory, we unified both IB theory [1] and MCR² [2] by proving that MCR² is a special case and sub-optimal solution of IB theory, which means that IB theory will degenerate to MCR² when the coefficient $\beta$ approximates to positive infinite. With this finding, IB theory could explain the phenomenon that twp phases happen in DL network in some layers while there is only one phase in some other layers which can be explicitly explained by the term $(1-\beta)(H(Z)-H(Z/X))$ in Eq. (3). In more detail, according to the finding in state-of-the-art [14], the DL network is always trying to reach the theoretical IB limit which means that the optimal value of $\beta$ could be determined by DL network during the training stage. Since $I(X,Z)=H(Z)-H(Z/X)$ and $\beta\in(0,+\infty)$ , the mutual information $I(X,Z)$ decreases when $\beta<1$ while the mutual information $I(X,Z)$ increases when $\beta>1$ , so that this leads the IB theory in Eq. (3) has the ability to explain why those two situations happen in DL network.

In addition, regarding the argument of two phases in DL network, i.e., fitting and compression, from Figs. 1(a), (b), (c) and (d), it can be observed that the mutual information $I(X,Z)$ trends may vary depending on whether the ReLU or tanh activation function is applied to the hidden layers. In more detail, with the tanh activation function, both the fitting and compression phases occur, as previously demonstrated in the work of the inventor of IB theory [14]. On the other hand, with the ReLU activation function, there may be only a fitting phase (as shown in Figs. 1(c) and (d)), or both fitting and compression phases (as shown in Figs. 1(a) and (b)). However, as IB theory is trying to compress the mutual information $I(X,Z)$ , it could not explain all of those situations, especially when applying the ReLU activation function to a DL network where the compression phase is absent (as reflected in Figs. 1(c) and (d)). Fortunately, the proposed method in Eq. (3) utilizes an auxiliary function added to IB theory to explain all these different scenarios. This strengthens the justification for IB theory. With all of these findings, we can see that the aims of DL networks are: 1) maximize the mutual information between features in hidden layers and output data in the final layer; 2) maximize or minimize the mutual information between samples in the input layer and output data in the hidden layers; 3) maximize the difference or minimize the sum between the coding rate of the whole datasets and the average of all subsets within each class.

5 Conclusion and future works

In this paper, we provided the justices for IB theory and solved three issues. By introducing an auxiliary function to IB theory, 1) we unified IB theory and MCR², which means that the MCR² is just a special case and local optimal solution of IB theory under Gaussian distribution and linear activation function. In addition, the problem that mutual information between samples in the input layer and features in the hidden layers decreases which could not be explained by MCR² can be solved by IB theory. 2) we ended both the argument of two phases in DL network and the doubts about the validity and capability of information bottleneck theory for interpreting the inner organization. 3) we provided a new perspective to explain the inner organization of DL networks when applying IB theory to DL networks.

For our future work, we will try to derivate the analytical form of IB theory under the situations of non-Gaussian distribution and non-linear activation. With this operation, all types of DL networks might be unified by IB theory, just like the mechanism of ReLU activation function can be explained by MCR² method, so that the all of mechanisms in DL networks can be constructed by one law.

References

[1] Tishby, N., Pereira, F. C., and Bialek, W. (2000). The information bottleneck method. arXiv preprint physics/0004057.
[2] Ma, Y., Derksen, H., Hong, W., and Wright, J. (2007). Segmentation of multivariate mixed data via lossy data coding and compression. IEEE transactions on pattern analysis and machine intelligence, 29(9), 1546-1562.
[3] Cover, T. M., and Thomas, J. A. (1964). Elements of information theory. Wiley-Interscience, 1991. Inf Control, 7, 485-507.
[4] Shannon, C. E. (1948). A mathematical theory of communication. The Bell system technical journal, 27(3), 379-423.
[5] Chan, K. H. R., Yu, Y., You, C., Qi, H., Wright, J., and Ma, Y. (2022). ReduNet: A white-box deep network from the principle of maximizing rate reduction. The Journal of Machine Learning Research, 23(1), 4907-5009.
[6] Pezone, F., Barbarossa, S., and Di Lorenzo, P. (2022, May). Goal-oriented communication for edge learning based on the information bottleneck. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 8832-8836). IEEE.
[7] Mao, H., Chen, X., Fu, Q., Du, L., Han, S., and Zhang, D. (2021, October). Neuron campaign for initialization guided by information bottleneck theory. In Proceedings of the 30th ACM International Conference on Information and Knowledge Management (pp. 3328-3332).
[8] Qian, K., Zhang, Y., Chang, S., Hasegawa-Johnson, M., and Cox, D. (2020, November). Unsupervised speech decomposition via triple information bottleneck. In International Conference on Machine Learning (pp. 7836-7846). PMLR.
[9] Baek, C., Wu, Z., Chan, K. H. R., Ding, T., Ma, Y., and Haeffele, B. D. (2022). Efficient maximal coding rate reduction by variational forms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 500-508).
[10] Yu, Y., Chan, K. H. R., You, C., Song, C., and Ma, Y. (2020). Learning diverse and discriminative representations via the principle of maximal coding rate reduction. Advances in Neural Information Processing Systems, 33, 9422-9434.
[11] Tian, L., Tu, Z., Zhang, D., Liu, J., Li, B., and Yuan, J. (2020). Unsupervised learning of optical flow with CNN-based non-local filtering. IEEE Transactions on Image Processing, 29, 8429-8442.
[12] Saxe, A. M., Bansal, Y., Dapello, J., Advani, M., Kolchinsky, A., Tracey, B. D., and Cox, D. D. (2019). On the information bottleneck theory of deep learning. Journal of Statistical Mechanics: Theory and Experiment, 2019(12), 124020.
[13] Tishby, N., and Zaslavsky, N. (2015, April). Deep learning and the information bottleneck principle. In 2015 ieee information theory workshop (itw) (pp. 1-5). IEEE.
[14] Shwartz-Ziv, R., and Tishby, N. (2017). Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810.
[15] Painsky, A. and Tishby, N. (2017). Gaussian Lower Bound for the Information Bottleneck Limit. J. Mach. Learn. Res., 18, 213-1.
[16] Cai, T. T., Liang, T., and Zhou, H. H. (2015). Law of log determinant of sample covariance matrix and optimal estimation of differential entropy for high-dimensional Gaussian distributions. Journal of Multivariate Analysis, 137, 161-172.
[17] Horn, R. A., and Johnson, C. R. (2012). Matrix analysis. Cambridge university press.
[18] Magnus, J. R., and Neudecker, H. (2019). Matrix differential calculus with applications in statistics and econometrics. John Wiley and Sons.

$\displaystyle\Delta I(Z,\beta,X,Y)$	$\displaystyle=H(Z)-H(Z\|X)-\beta(H(Z)-H(Z\|Y))+\beta(H(Z\|X)-H(Z\|X))$
	$\displaystyle=(1-\beta)H(Z)-(1-\beta)H(Z\|X)+\beta(H(Z\|Y)-H(Z\|X))$
	$\displaystyle=(1-\beta)(H(Z)-H(Z\|X))+\beta(H(Z\|Y)-H(Z\|X)).$	(9)

$\displaystyle\Delta I(Z,\epsilon,\beta,X,Y)$	$\displaystyle=(1-\beta)(\frac{1}{2}\log((2\pi e)^{d}\det(\Sigma_{\hat{Z}}))-\frac{1}{2}\log((2\pi e)^{d}\det(\Sigma_{\hat{Z}\|X})))$
	$\displaystyle+\beta(\frac{1}{2}\log((2\pi e)^{d}\det(\Sigma_{\hat{Z}\|Y})-\frac{1}{2}\log((2\pi e)^{d}\det(\Sigma_{\hat{Z}\|X}))$
	$\displaystyle=\frac{1-\beta}{2}\log(\det(\Sigma_{\hat{Z}\|X}^{-1}\Sigma_{\hat{Z}})+\frac{\beta}{2}(\log(\det(\Sigma_{\hat{Z}\|Y})-\log(\det(\Sigma_{\hat{Z}\|X})),$	(13)

	$\displaystyle H(\hat{Z}\|Y)$	$\displaystyle=\sum_{j=1}^{K}p(Y^{j})\sum_{l=1}^{K}p(\hat{Z}_{l}\|Y^{j})\log\frac{1}{p(\hat{Z}_{l}\|Y^{j})}=\sum_{j=1}^{K}p(Y^{j})p(\hat{Z}_{j}\|Y^{j})\log\frac{1}{p(\hat{Z}_{j}\|Y^{j})}$
		$\displaystyle=\sum_{j=1}^{K}p(Y^{j})H(\hat{Z}_{j}\|Y^{j}).$		(17)