This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Justices for Information Bottleneck Theory

Faxian Cao
School of Computer Science
University of Hull
Hull, HU6 7RX
[email protected]
&Yongqiang Cheng
School of Computer Science
University of Hull
Hull, HU6 7RX
[email protected]
&Adil Mehmood Khan
School of Computer Science
University of Hull
Hull, HU6 7RX
[email protected]
&Zhijing Yang
School of Information Engineering
Guangdong University of Technology
Guangzhou, 510006
[email protected]
Corresponding Author
Abstract

This study comes as a timely response to mounting criticism of the information bottleneck (IB) theory, injecting fresh perspectives to rectify misconceptions and reaffirm its validity. Firstly, we introduce an auxiliary function to reinterpret the maximal coding rate reduction method as a special yet local optimal case of IB theory. Through this auxiliary function, we clarify the paradox of decreasing mutual information during the application of ReLU activation in deep learning (DL) networks. Secondly, we challenge the doubts about IB theory’s applicability by demonstrating its capacity to explain the absence of a compression phase with linear activation functions in hidden layers, when viewed through the lens of the auxiliary function. Lastly, by taking a novel theoretical stance, we provide a new way to interpret the inner organizations of DL networks by using IB theory, aligning them with recent experimental evidence. Thus, this paper serves as an act of justice for IB theory, potentially reinvigorating its standing and application in DL and other fields such as communications and biomedical research.

1 Introduction

Both information bottleneck (IB) theory [1] and maximal coding rate reduction (MCR2[2] originate from the rate distortion theory [3] in the field of information theory [4]. IB theory aims to find a short code for input signals that preserves the maximum information about output signals while compressing the mutual information between input signals and the corresponding short code [1]. On the other hand, MCR2 strives to maximize the difference between the coding rate/length of the entire dataset and the average of all subsets in each category [2; 5], i.e., the objective of MCR2 is to maximize the mutual information between the input signals and its corresponding short code, as well as the mutual information between the short code of the input signal and the output signal. Both IB theory and MCR2 have gained remarkable attention and are widely applied in various fields, including communications [6], biomedical research [7], and speech decomposition [8] with IB theory, and classification [9] and segmentation [10] with MCR2.

Recently, researchers have employed both IB theory and MCR2 to interpret how deep learning (DL) networks work, i.e., the inner organizations of the DL networks [11]. By applying MCR2 to DL networks, ReduNet [5; 10] was presented as a deep-layered architecture construction method. ReduNet claims that both linear and nonlinear operators, as well as network parameters, are explicitly constructed layer-by-layer based on the principle of MCR2. For example, the mechanism of ReLU activation function in DL could be explained by MCR2, i.e., the role of ReLU activation function is to maximize the mutual information between samples in input layer and output data in the hidden layers of DL networks. In other words, high-dimensional data usually has a low-dimensional structure, by increasing the mutual information between samples of input layer and data of the hidden layers in DL networks, samples belonging to different classes could be more distinguished so that the task of classification or segmentation can be facilitated. However, some existing experiments [12] already show that this is not always the case since the mutual information between samples in the input layer and data in the hidden layers of DL networks might be decreased even if the ReLU activation function is used in the hidden layer of DL network. Besides, compared with IB theory, the MCR2 is not applicable to explain the mechanism of the nonlinear activation function of the DL network, such as the tanh activation function.

Furthermore, some typical works [13; 14] use IB theory to analyze the inner organizations of DL networks, suggesting that there are two main phases of hidden layers in DL. The first phase is called "fitting", where the training errors of DL dramatically drop by fitting the training label, i.e., the mutual information between samples in input layer and data in hidden layers of DL network increases. The second phase is referred to as "compression" which starts once the training errors become small, i.e., the mutual information between samples in input layer and data in hidden layers of DL network decreases. These researchers also claim [13; 14] that the compression phase in DL plays a vital role in its excellent generalization performance. However, other studies have shown that when the nonlinear active function in the hidden layers of DL networks is replaced with a linear function (i.e., the ReLU function), sometimes there is no compression phase when training the DL networks [12]. This discrepancy has led some researchers to challenge the validity and capability of IB theory for interpreting the inner organizations DL networks [12].

Moreover, when applying IB theory to interpret the inner organizations of DL networks, all academic works have aimed at minimizing mutual information between the input signal and the projection of the input signal while maximizing mutual information between the projection of the input signal and output signal, including the inventor of IB theory [1; 15]. However, some experiments show that this is always the case. Therefore, it is necessary and important to provide a new way to interpret the mechanism of DL networks by using IB theory.

In this paper, to address the three issues above, we introduce an auxiliary function of conditional entropy to IB theory, i.e.,

  • By introducing an auxiliary function to IB theory and through the derivations of IB theory under Gaussian distribution and linear projection/activation function, we discover that MCR2 is simply a special case of IB theory. This finding implies that MCR2 is only a local optimal solution of IB theory for interpreting inner organization. Consequently, with the transformation of IB theory, the MCR2 and IB theory can be unified together. More importantly, when ReLU activation function is used in DL network, the phenomenon that the mutual information between samples in input layer and data in the hidden layers decreased could be explained.

  • With the help of the auxiliary function, IB method can be transformed into another form, so that a new theoretical perspective could be presented for explaining why the compression phase does not occur in hidden layers of DL networks once the nonlinear activation function is replaced by the linear activation function, i.e., the mutual information between samples in input layer and data in hidden layers of DL networks continues to increase. Therefore, our findings validate the principle of IB theory for interpreting DL networks’ inner organization.

  • With the transformed IB theory by introducing the auxiliary function to it, we are here to provide a new way to interpret the inner organization of DL networks: 1) the mutual information between data in the hidden layers and the output data in the final layer of the DL network continues to increase. 2) When it comes to the mutual information between samples in the input layer and data in hidden layers of DL networks, sometimes only one phase occurs, while at other times both the fitting and compression phases occur. 3) In some cases, the final aim of DL networks is trying to maximize the difference between the coding rate of the whole datasets and the average of all subsets within each class. Alternatively, DL networks may aim to minimize the sum of the coding rate of the whole datasets and the average of all subsets within each class.

2 MCR2 and IB theory

Learning distributions from a finite set of i.i.d. samples of KK classes/categories is one of the most fundamental problems in machine learning [10]. The input data XX consists of MM samples, each with dimensions DD, represented as X=[x1,x2,xM]D×MX=\begin{bmatrix}x_{1},&x_{2},&\cdots&x_{M}\end{bmatrix}\in\mathbb{R}^{D\times M}. To enable clustering or classification tasks, it is essential to find a good representation using a mapping function, f(xi,Θ)f(x_{i},\Theta): Dd\mathbb{R}^{D}\rightarrow\mathbb{R}^{d}, where ii ranges from 1 to MM. This mapping function captures the intrinsic structures of sample xix_{i} and projects it to a feature space of dimensionality dd with parameter Θ\Theta [10].

In the context of supervised learning in DL, we can view the output data in the hidden layer as selecting certain discriminative features represented by Z=f(X,Θ)d×MZ=f(X,\Theta)\in\mathbb{R}^{d\times M} that facilitate the subsequent classification task. Then the class label YY can be predicted by optimizing a classifier denoted by g(Z)g(Z). Therefore, this process can be represented mathematically as follows:

Xf(X,Θ)Zg(Z)Y.X\stackrel{{\scriptstyle f(X,\Theta)}}{{\longrightarrow}}Z\stackrel{{\scriptstyle g(Z)}}{{\longrightarrow}}Y. (1)

MCR2 for interpreting DL [2; 5]: The MCR2 intends to encode discriminative features denoted by Z=[z1z2zM]=[f(x1,Θ)f(x2,Θ)f(xM,Θ)]Z=\begin{bmatrix}z_{1}&z_{2}&\cdots z_{M}\end{bmatrix}=\begin{bmatrix}f(x_{1},\Theta)&f(x_{2},\Theta)&\cdots&f(x_{M},\Theta)\end{bmatrix} up to a precision of ϵ\epsilon, i.e., zi^=zi+ci\hat{z_{i}}=z_{i}+c_{i}, where cic_{i} is drawn from a Gaussian distribution with zero mean and a variance of ϵ2dE\frac{\epsilon^{2}}{d}E (note that EE is identity matrix defined as having 1’s down its diagonal and 0’s everywhere else). In addition, by assuming that ziz_{i} is Gaussian distribution with zero mean and unit variance, i.e., i=1dzi=0\sum_{i=1}^{d}z_{i}=0 (if ziz_{i} has non-zero-mean, we can subtract i=1dzi\sum_{i=1}^{d}z_{i} from ziz_{i}), the number of bits required to encode the discriminative features ZZ is is given by (M+d)2logdet(dMϵ2ZZT+E)\frac{(M+d)}{2}\log\det(\frac{d}{M\epsilon^{2}}ZZ^{T}+E) [5]. Consequently, the average coding rate per sample, subject to the precision or distortion level ϵ\epsilon, is expressed as [2]

R(Z,ϵ)=12logdet(dMϵ2ZZT+E).R(Z,\epsilon)=\frac{1}{2}\log\det(\frac{d}{M\epsilon^{2}}ZZ^{T}+E). (2)

In addition, the multi-class features ZZ may belong to multiple low-dimensional subspaces, which can affect the accuracy of rate distortion evaluation. To address this issue, MCR2 partitions the features into several subsets, denoted as Z=Z1Z2ZKZ=Z_{1}\cup Z_{2}\cup\cdots\cup Z_{K}, with each being in a separate low-dim subspace. In order to encode the membership of MM samples in KK classes more effectively, a set of diagonal matrices Π={Πj}j=1K\Pi=\{\Pi_{j}\}_{j=1}^{K} is introduced. These matrices have diagonal entries representing the membership of each sample in each class:

j=1KΠj=E.\sum_{j=1}^{K}\Pi_{j}=E. (3)

and each Πj\Pi_{j} is defined as

Πj=[π^1,j000π^2,j00π^M,j]M×M.\Pi_{j}=\begin{bmatrix}\hat{\pi}_{1,j}&0&\cdots&0\\ 0&\hat{\pi}_{2,j}&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&\cdots&\cdots&\hat{\pi}_{M,j}\end{bmatrix}\in\mathbb{R}^{M\times M}. (4)

where π^i,j{0,1}\hat{\pi}_{i,j}\in\{0,1\}. If xix_{i} or ziz_{i} belongs to jthj^{th} class, π^i,j=1\hat{\pi}_{i,j}=1, otherwise, π^i,j=0\hat{\pi}_{i,j}=0.

With respect to this partition, the average number of bits per sample subjective to the precision/distortion ϵ\epsilon can be written as [2]

Rc(Z,ϵ|Π)=12j=1Ktr(Πj)Mlogdet(dtr(Πj)ϵ2ZΠjZT+E),R^{c}(Z,\epsilon|\Pi)=\frac{1}{2}\sum_{j=1}^{K}\frac{tr(\Pi_{j})}{M}\log\det(\frac{d}{tr(\Pi_{j})\epsilon^{2}}Z\Pi_{j}Z^{T}+E), (5)

where tr()tr(\cdot) is the trace of a matrix. Finally, the aim of MCR2 is trying to maximize the difference of code rate/length between the whole dataset in Eq. (2) and the average of all the subsets in Eq. (5[2; 5]:

ΔR(Z,ϵ,Π)\displaystyle\Delta R(Z,\epsilon,\Pi) =R(Z,ϵ)Rc(Z,ϵ|Π)\displaystyle=R(Z,\epsilon)-R^{c}(Z,\epsilon|\Pi)
=12logdet(dMϵ2ZZT+E)12j=1Ktr(Πj)Mlogdet(dtr(Πj)ϵ2ZΠjZT+E).\displaystyle=\frac{1}{2}\log\det(\frac{d}{M\epsilon^{2}}ZZ^{T}+E)-\frac{1}{2}\sum_{j=1}^{K}\frac{tr(\Pi_{j})}{M}\log\det(\frac{d}{tr(\Pi_{j})\epsilon^{2}}Z\Pi_{j}Z^{T}+E). (6)

In simpler terms, MCR2 has two goals:

  • Maximize the mutual information between the input data XX and discriminative feature ZZ, which is done by enlarging the space of ZZ, and this is measured through the R(Z,ϵ)R(Z,\epsilon).

  • Maximize the mutual information between the discriminative feature ZZ and output/label YY, which is done by compressing the space for the discriminative feature ZjZ_{j} of each category, and this is measured through the Rc(Z,ϵ|Π)R^{c}(Z,\epsilon|\Pi).

IB theory for interpreting DL [1; 13; 14]: With the training data XX and the corresponding label YY, IB theory contains two steps: encoding and decoding. To encode the discriminative features ZZ, the aim of IB theory is trying to minimize the mutual information I(X,Z)I(X,Z) between input data XX and the discriminative features ZZ. while during the decoding stage, IB theory aims to maximize the mutual information I(Y,Z)I(Y,Z) between discriminative features ZZ and output/label YY, thus, IB theory can be formulated by finding an optimal representation ZZ as the minimization of following Lagrangian:

ΔI(Z,β,X,Y)=I(X,Z)βI(Y,Z),\Delta I(Z,\beta,X,Y)=I(X,Z)-\beta I(Y,Z), (7)

where β(0,+)\beta\in(0,+\infty) is the tradeoff parameter that balances those two types of mutual information above [1].

By analyzing the objective functions of MCR2 in Eq. (2) and IB theory in Eq. (7), we can observe that both MCR2 and IB theory aim to maximize the mutual information between the discriminative feature ZZ and output/label YY. However, when it comes to the mutual information between the input data XX and the discriminative feature ZZ, MCR2 maximizes the corresponding information while IB theory aims to minimize it. In addition, as a promising method, MCR2 claims to be the first principle of DL and the mechanism of ReLU activation function in DL network could be explained MCR2. However, sometimes the mutual information between samples in the input layer and data in the hidden layers decreases when applying the ReLU activation function to DL network. Therefore, investigating the connection between IB theory and MCR2 is not only necessary but also crucial. Furthermore, since IB theory’s goal is to reduce the mutual information between the input data XX and the discriminative feature ZZ, it proposes that the mutual information between them initially increases and then goes through a compression phase where this value will begin to decrease. It is important to note that this compression phase is not present in DL networks when non-linear activation functions are replaced with linear functions [12], such as the ReLU function. Hence, presenting a new theoretical perspective to explain this phenomenon in DL is critical. Besides, since the tradeoff parameter β>0\beta>0, according to IB theory, the goal is to minimize the mutual information I(X,Z)I(X,Z) when explaining the inner organizations of DL. However, recent experiments [12] have shown that in certain situations, the mutual information I(X,Z)I(X,Z) of DL networks continues to increase, while in other situations the fitting and compression phases alternate in DL networks. Therefore, providing a new way/perspective is essential for a better understanding of the behavior of deep learning networks.

3 Proposed method based on IB theory

In this section, an auxiliary function is introduced to IB theory, resulting in solving the three concerns outlined in Sections 1 and 2. With the objective function of IB theory in Eq. (7), by introducing an auxiliary function β(H(Z|X)H(Z|X))\beta(H(Z|X)-H(Z|X)) to IB theory, we can formulate Eq. (7) as

ΔI(Z,β,X,Y)=I(X,Z)βI(Y,Z)+β(H(Z|X)H(Z|X)),\Delta I(Z,\beta,X,Y)=I(X,Z)-\beta I(Y,Z)+\beta(H(Z|X)-H(Z|X)), (8)

where H()H(\cdot) is entropy [15]. Since H(Z|X)H(Z|X)=0H(Z|X)-H(Z|X)=0, we can see that the objective function in Eq. (8) is the same as the objective function of IB theory in Eq. (7). In addition, according to the definitions of mutual information and entropy [15], it implies that I(X,Z)=H(Z)H(Z|X)I(X,Z)=H(Z)-H(Z|X) and I(Y,Z)=H(Z)H(Z|Y)I(Y,Z)=H(Z)-H(Z|Y), then we can rewrite Eq. (8) as

ΔI(Z,β,X,Y)\displaystyle\Delta I(Z,\beta,X,Y) =H(Z)H(Z|X)β(H(Z)H(Z|Y))+β(H(Z|X)H(Z|X))\displaystyle=H(Z)-H(Z|X)-\beta(H(Z)-H(Z|Y))+\beta(H(Z|X)-H(Z|X))
=(1β)H(Z)(1β)H(Z|X)+β(H(Z|Y)H(Z|X))\displaystyle=(1-\beta)H(Z)-(1-\beta)H(Z|X)+\beta(H(Z|Y)-H(Z|X))
=(1β)(H(Z)H(Z|X))+β(H(Z|Y)H(Z|X)).\displaystyle=(1-\beta)(H(Z)-H(Z|X))+\beta(H(Z|Y)-H(Z|X)). (9)

From Eq. (3), it can be observed that the first term at the right-hand side is the mutual information between the input data XX and the discriminative feature ZZ. That is, H(Z)H(Z|X)=I(X,Z)H(Z)-H(Z|X)=I(X,Z). Since β\beta lies within the range (0,+)(0,+\infty), IB theory aims to minimize the mutual information between input data XX and discriminative features ZZ for 0<β<10<\beta<1 while maximizing this mutual information for β>1\beta>1. Moreover, the second term in Eq. (3) corresponds to the difference in the uncertainty degree of discriminative features ZZ by giving both labels YY and samples XX, i.e., H(Z|Y)H(Z|X)H(Z|Y)-H(Z|X). When XX is given, then the uncertainty degree of discriminative features ZZ is fixed, this implies that the aim of IB theory is trying to decrease the uncertainty degree of discriminative features ZZ by giving labels YY. In other words, the aim of IB theory is trying to maximize the mutual information between data in the hidden layers and the output data in the final layer of DL networks. Now, we shall to show that MCR2 is just a special case and local optimal solution of IB theory under Gaussian distribution and linear mapping.

Assuming that data xix_{i} follows a Gaussian distribution with zero mean, i.e., i=1Dxi=0\sum_{i=1}^{D}x_{i}=0 (If it has a non-zero mean, we can simply subtract i=1Dxi\sum_{i=1}^{D}x_{i} from xix_{i}). We also assume that the mapping function f(xi,Θ)f(x_{i},\Theta) is a linear mapping function, f(xi,Θ)=Θxi=zif(x_{i},\Theta)=\Theta x_{i}=z_{i}. Additionally, we denote z^i\hat{z}_{i} as the approximation of ziz_{i} and allow for the square errors between z^i\hat{z}_{i} and ziz_{i} to be ϵ2\epsilon^{2}, meaning that given a coding precision ϵ\epsilon, we may model the approximation error or coding precision as an independent additive Gaussian noise:

z^i=zi+ci,\hat{z}_{i}=z_{i}+c_{i}, (10)

where cic_{i} is zero mean with a variance of ϵ2dE\frac{\epsilon^{2}}{d}E for i=1,,Mi=\begin{matrix}1,&\cdots,&M\end{matrix} (notice that all of these assumptions above are the same as MCR2). With these assumptions above, we can rewrite Eq. (3) as

ΔI(Z,ϵ,β,X,Y)=(1β)(H(Z^)H(Z^|X))+β(H(Z^|Y)H(Z^|X)).\Delta I(Z,\epsilon,\beta,X,Y)=(1-\beta)(H(\hat{Z})-H(\hat{Z}|X))+\beta(H(\hat{Z}|Y)-H(\hat{Z}|X)). (11)

Since input data xix_{i} is from Gaussian distribution and the linear transformation of Gaussian distribution is still from a Gaussian distribution, these imply that both z^i\hat{z}_{i} and ziz_{i} are still Gaussian distributions. Then according to the definition of differential entropy for Gaussian distribution [16], the terms H(Z^)H(\hat{Z}), H(Z^|X)H(\hat{Z}|X) and H(Z^|Y)H(\hat{Z}|Y) in Eq. (11) can be expressed as

{H(Z^)=12log((2πe)ddet(ΣZ^))H(Z^|X)=12log((2πe)ddet(ΣZ^|X))H(Z^|Y)=12log((2πe)ddet(ΣZ^|Y)),\begin{cases}H(\hat{Z})=\frac{1}{2}\log((2\pi e)^{d}\det(\Sigma_{\hat{Z}}))\\ H(\hat{Z}|X)=\frac{1}{2}\log((2\pi e)^{d}\det(\Sigma_{\hat{Z}|X}))\\ H(\hat{Z}|Y)=\frac{1}{2}\log((2\pi e)^{d}\det(\Sigma_{\hat{Z}|Y}))\\ \end{cases}, (12)

where ee is Euler’s number, ΣZ^\Sigma_{\hat{Z}} is the covariance matrix of Z^\hat{Z}, and ΣZ^|X\Sigma_{\hat{Z}|X} and ΣZ^|Y\Sigma_{\hat{Z}|Y} are the conditional covariance matrices. Then with Eq. (12), we can rewrite Eq. (11) as

ΔI(Z,ϵ,β,X,Y)\displaystyle\Delta I(Z,\epsilon,\beta,X,Y) =(1β)(12log((2πe)ddet(ΣZ^))12log((2πe)ddet(ΣZ^|X)))\displaystyle=(1-\beta)(\frac{1}{2}\log((2\pi e)^{d}\det(\Sigma_{\hat{Z}}))-\frac{1}{2}\log((2\pi e)^{d}\det(\Sigma_{\hat{Z}|X})))
+β(12log((2πe)ddet(ΣZ^|Y)12log((2πe)ddet(ΣZ^|X))\displaystyle+\beta(\frac{1}{2}\log((2\pi e)^{d}\det(\Sigma_{\hat{Z}|Y})-\frac{1}{2}\log((2\pi e)^{d}\det(\Sigma_{\hat{Z}|X}))
=1β2log(det(ΣZ^|X1ΣZ^)+β2(log(det(ΣZ^|Y)log(det(ΣZ^|X)),\displaystyle=\frac{1-\beta}{2}\log(\det(\Sigma_{\hat{Z}|X}^{-1}\Sigma_{\hat{Z}})+\frac{\beta}{2}(\log(\det(\Sigma_{\hat{Z}|Y})-\log(\det(\Sigma_{\hat{Z}|X})), (13)

where the properties of log\log and det\det functions [17] are used for derivations of Eq. (3), i.e., logab=loga+logb\log ab=\log a+\log b, logab=logalogb\log\frac{a}{b}=\log a-\log b and det(A)det(B)=det(B1A)\det(A)-\det(B)=\det(B^{-1}A). In addition, since both XX and Z^\hat{Z} are Gaussian distribution and z^i=Θxi+ci\hat{z}_{i}=\Theta x_{i}+c_{i}, according to the Schur complement formula [18], the relevant covariance matrices ΣZ^\Sigma_{\hat{Z}} and ΣZ^|X\Sigma_{\hat{Z}|X} in Eq. (3) can be written as

{ΣZ^=ΘΣXΘT+ϵ2dEΣX,Z^=ΘΣXΣZ^|X=ΣZ^ΣX,Z^ΣX1ΣZ^,X=ϵ2dE,\begin{cases}\Sigma_{\hat{Z}}=\Theta\Sigma_{X}\Theta^{T}+\frac{\epsilon^{2}}{d}E\\ \Sigma_{X,\hat{Z}}=\Theta\Sigma_{X}\\ \Sigma_{\hat{Z}|X}=\Sigma_{\hat{Z}}-\Sigma_{X,\hat{Z}}\Sigma_{X}^{-1}\Sigma_{\hat{Z},X}=\frac{\epsilon^{2}}{d}E\end{cases}, (14)

then with Eq. (14), we can rewrite Eq. (3) as

ΔI(Z,ϵ,β,X,Y)=1β2log(det(ΣZ^|X1ΣZ)+β2(log(det(ΣZ^|Y)log(det(ΣZ^|X))\displaystyle\Delta I(Z,\epsilon,\beta,X,Y)=\frac{1-\beta}{2}\log(\det(\Sigma_{\hat{Z}|X}^{-1}\Sigma_{Z})+\frac{\beta}{2}(\log(\det(\Sigma_{\hat{Z}|Y})-\log(\det(\Sigma_{\hat{Z}|X}))
=1β2log(det(dϵ2ΘΣXΘT+E))+β2(log(det(ΣZ^|Y)log(det(ϵ2dE)).\displaystyle=\frac{1-\beta}{2}\log(\det(\frac{d}{\epsilon^{2}}\Theta\Sigma_{X}\Theta^{T}+E))+\frac{\beta}{2}(\log(\det(\Sigma_{\hat{Z}|Y})-\log(\det(\frac{\epsilon^{2}}{d}E)). (15)

Now, the final step is to analyze the term logdet(ΣZ^|Y)\log\det(\Sigma_{\hat{Z}|Y}). Denote p(Yj)p(Y^{j}) as the probability that data vector xix_{i} or z^i\hat{z}_{i} belongs to jthj^{th} class, i.e., the proportions of data xix_{i} or z^i\hat{z}_{i} among all classes, then we can have

{H(Z^|Y)=j=1Kp(Yj)H(Z^|Yj)=j=1Kp(Yj)l=1Kp(Z^l|Yj)log1p(Z^l|Yj)j=1Kp(Yj)=1j=1Kp(Yj)H(Z^|X)=H(Z^|X),\begin{cases}H(\hat{Z}|Y)=\sum_{j=1}^{K}p(Y^{j})H(\hat{Z}|Y^{j})=\sum_{j=1}^{K}p(Y^{j})\sum_{l=1}^{K}p(\hat{Z}_{l}|Y^{j})\log\frac{1}{p(\hat{Z}_{l}|Y^{j})}\\ \sum_{j=1}^{K}p(Y^{j})=1\\ \sum_{j=1}^{K}p(Y^{j})H(\hat{Z}|X)=H(\hat{Z}|X)\end{cases}, (16)

where Z^l\hat{Z}_{l} is the subset of Z^\hat{Z} that belongs to the lthl^{th} class, i.e., Z^=Z^1Z^2Z^K\hat{Z}=\hat{Z}_{1}\cup\hat{Z}_{2}\cup\cdots\cup\hat{Z}_{K}, thus it leads p(Z^l|Yj)log1p(Z^l|Yj)=0p(\hat{Z}_{l}|Y^{j})\log\frac{1}{p(\hat{Z}_{l}|Y^{j})}=0 for ljl\neq j, then we have

H(Z^|Y)\displaystyle H(\hat{Z}|Y) =j=1Kp(Yj)l=1Kp(Z^l|Yj)log1p(Z^l|Yj)=j=1Kp(Yj)p(Z^j|Yj)log1p(Z^j|Yj)\displaystyle=\sum_{j=1}^{K}p(Y^{j})\sum_{l=1}^{K}p(\hat{Z}_{l}|Y^{j})\log\frac{1}{p(\hat{Z}_{l}|Y^{j})}=\sum_{j=1}^{K}p(Y^{j})p(\hat{Z}_{j}|Y^{j})\log\frac{1}{p(\hat{Z}_{j}|Y^{j})}
=j=1Kp(Yj)H(Z^j|Yj).\displaystyle=\sum_{j=1}^{K}p(Y^{j})H(\hat{Z}_{j}|Y^{j}). (17)

With Eqs. (16), (3) and ΣZ^j|Yj=ΣZ^j=ΘΣXjΘT+ϵ2dE\Sigma_{\hat{Z}_{j}|Y^{j}}=\Sigma_{\hat{Z}_{j}}=\Theta\Sigma_{X_{j}}\Theta^{T}+\frac{\epsilon^{2}}{d}E where XjX_{j} is the subset of training data XX that belongs to the jthj^{th} class, then the objective function of Eq. (3) can be rewritten as

ΔI(Z,ϵ,β,X,Y)\displaystyle\Delta I(Z,\epsilon,\beta,X,Y)
=1β2log(det(dϵ2ΘΣXΘT+E))+β2j=1Kp(Yj)(log(det(ΣZ^j|Yj)log(det(ϵ2dE))\displaystyle=\frac{1-\beta}{2}\log(\det(\frac{d}{\epsilon^{2}}\Theta\Sigma_{X}\Theta^{T}+E))+\frac{\beta}{2}\sum_{j=1}^{K}p(Y^{j})(\log(\det(\Sigma_{\hat{Z}_{j}|Y^{j}})-\log(\det(\frac{\epsilon^{2}}{d}E))
=1β2log(det(dϵ2ΘΣXΘT+E))+β2j=1Kp(Yj)log(det(dϵ2ΘΣXjΘT+E)).\displaystyle=\frac{1-\beta}{2}\log(\det(\frac{d}{\epsilon^{2}}\Theta\Sigma_{X}\Theta^{T}+E))+\frac{\beta}{2}\sum_{j=1}^{K}p(Y^{j})\log(\det(\frac{d}{\epsilon^{2}}\Theta\Sigma_{X_{j}}\Theta^{T}+E)). (18)

Since ΘΣXΘT\Theta\Sigma_{X}\Theta^{T} and ΘΣXjΘT\Theta\Sigma_{X_{j}}\Theta^{T} are the covariance matrix of ZZ and ZjZ_{j}, respectively, we can rewrite the objective function of Eq. (3) as

ΔI(Z,ϵ,β,Y)=1β2log(det(dMϵ2ZZT+E))+β2j=1Kp(Yj)log(det(dMp(Yj)ϵ2ZjZjT+E)).\Delta I(Z,\epsilon,\beta,Y)=\frac{1-\beta}{2}\log(\det(\frac{d}{M\epsilon^{2}}ZZ^{T}+E))+\frac{\beta}{2}\sum_{j=1}^{K}p(Y^{j})\log(\det(\frac{d}{Mp(Y^{j})\epsilon^{2}}Z_{j}{Z_{j}}^{T}+E)). (19)

Take the same operation as MCR2, a set of diagonal matrices, Π={Πj}j=1K\Pi=\{\Pi_{j}\}_{j=1}^{K} in Eq. (4), are introduced. In addition, since p(Yj)=tr(Πj)Mp(Y^{j})=\frac{tr(\Pi_{j})}{M} and ZjZjT=ZΠjZTZ_{j}{Z_{j}}^{T}=Z\Pi_{j}Z^{T}, Eq. (19) can be rewritten as

ΔI(Z,ϵ,β,Π)\displaystyle\Delta I(Z,\epsilon,\beta,\Pi) =1β2log(det(dMϵ2ZZT+E))\displaystyle=\frac{1-\beta}{2}\log(\det(\frac{d}{M\epsilon^{2}}ZZ^{T}+E))
+β2j=1Ktr(Πj)Mlog(det(dtr(Πj)ϵ2ZΠjZT+E)).\displaystyle+\frac{\beta}{2}\sum_{j=1}^{K}\frac{tr(\Pi_{j})}{M}\log(\det(\frac{d}{tr(\Pi_{j})\epsilon^{2}}Z\Pi_{j}Z^{T}+E)). (20)

With Eq. (3), we can see IB theory is trying to maximize ΔI(Z,ϵ,β,Π)-\Delta I(Z,\epsilon,\beta,\Pi):

β12log(det(dMϵ2ZZT+E)β2j=1Ktr(Πj)Mlog(det(dtr(Πj)ϵ2ZΠjZT+E)).\frac{\beta-1}{2}\log(\det(\frac{d}{M\epsilon^{2}}ZZ^{T}+E)-\frac{\beta}{2}\sum_{j=1}^{K}\frac{tr(\Pi_{j})}{M}\log(\det(\frac{d}{tr(\Pi_{j})\epsilon^{2}}Z\Pi_{j}Z^{T}+E)). (21)

Compared with the objective function of MCR2 in Eq. (2), we can see that difference between MCR2 in Eq. (2) and IB theory in Eq. (21) is just the coefficient β\beta, then it is easy to see that when β\beta is large enough, then ββ1\beta\approx\beta-1, thus IB theory in Eq. (21) degenerates to MCR2, i.e.,

ΔI(Z,ϵ,β,Π)=ΔR(Z,ϵ,Π)-\Delta I(Z,\epsilon,\beta,\Pi)=\Delta R(Z,\epsilon,\Pi) (22)

This completes the proof that MCR2 is a special case of IB theory. In the next section, we will discuss the three concerns stated in Sections 1 and 2 by using the proposed transformed IB theory in Eq. (3).

4 Discussion

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: Information plane dynamics with different tasks and different architectures of DL networks (except for the final layer of DL network for all four subfigures). A curve in the corresponding information plane is produced for each of the hidden layers with the first hidden layer at far right and the final hidden layer at the far left. (a) Binary classification [14] task with 10-7-5-3 hidden layers architecture. (b) Binary classification [14] task with 12-10-7-5-4-3-2 hidden layers architecture. (c) Binary classification [14] with 10-7-5-4-3 hidden layers architecture. (d) MNIST dataset [12] with 32-28-24-20-16-12 hidden layers architecture.

In this section, some experimental results are shown for discussing the three concerns mentioned in this work by using the proposed transformed IB theory based on the auxiliary function. In addition, the codes and data sets that are used for all experiments in this section are from the existing works 111The code and data sets are available at: https://github.com/artemyk/ibsgd/tree/iclr2018 [12].

Figs. 1(a), (b) and (c) show the information plane dynamics by using three neural networks with 4 fully connected hidden layers of width 10-7-5-3, 7 fully connected hidden layers of width 12-10-7-5-4-3-2, and 5 fully connected hidden layers of width 10-7-5-4-3, respectively. In addition, we follow the same settings as in [12], all of these three networks are trained with stochastic gradient descent to produce a binary classification from 12-dimensional input which means that 12 uniformly distributed points on a 2D sphere are represented [14]. And 256 randomly selected samples per batch are used. Besides, we also show the information plane dynamics by using the MNIST dataset [12] (see Fig. 1(d)), which applies to a neural network with 6 fully connected hidden layers of width 32-28-24-20-16-12. By following the same setting as in [12], this network is trained with stochastic gradient descent and 128 randomly selected samples per batch are used.

As can be seen from Fig. 1(a), with the ReLU activation function, the two phases for mutual information I(X,Z)I(X,Z), fitting and compression, alternate, i.e., there is just one phase in some hidden layers while two phases occur at other layers. However, on one side, MCR2 is always trying to increase the mutual information I(X,Z)I(X,Z) by enlarging the space of discriminative features ZZ, so this is not applicable to explain the inner organization of DL network. On the other side, IB thoery [1] is always aiming to compress the mutual information I(X,Z)I(X,Z) by squeezing the space of discriminative features ZZ, so that this is also not applicable to explain the inner organizations of DL networks. Fortunately, by introducing the auxiliary function to IB theory, we unified both IB theory [1] and MCR2 [2] by proving that MCR2 is a special case and sub-optimal solution of IB theory, which means that IB theory will degenerate to MCR2 when the coefficient β\beta approximates to positive infinite. With this finding, IB theory could explain the phenomenon that twp phases happen in DL network in some layers while there is only one phase in some other layers which can be explicitly explained by the term (1β)(H(Z)H(Z/X))(1-\beta)(H(Z)-H(Z/X)) in Eq. (3). In more detail, according to the finding in state-of-the-art [14], the DL network is always trying to reach the theoretical IB limit which means that the optimal value of β\beta could be determined by DL network during the training stage. Since I(X,Z)=H(Z)H(Z/X)I(X,Z)=H(Z)-H(Z/X) and β(0,+)\beta\in(0,+\infty), the mutual information I(X,Z)I(X,Z) decreases when β<1\beta<1 while the mutual information I(X,Z)I(X,Z) increases when β>1\beta>1, so that this leads the IB theory in Eq. (3) has the ability to explain why those two situations happen in DL network.

In addition, regarding the argument of two phases in DL network, i.e., fitting and compression, from Figs. 1(a), (b), (c) and (d), it can be observed that the mutual information I(X,Z)I(X,Z) trends may vary depending on whether the ReLU or tanh activation function is applied to the hidden layers. In more detail, with the tanh activation function, both the fitting and compression phases occur, as previously demonstrated in the work of the inventor of IB theory [14]. On the other hand, with the ReLU activation function, there may be only a fitting phase (as shown in Figs. 1(c) and (d)), or both fitting and compression phases (as shown in Figs. 1(a) and (b)). However, as IB theory is trying to compress the mutual information I(X,Z)I(X,Z), it could not explain all of those situations, especially when applying the ReLU activation function to a DL network where the compression phase is absent (as reflected in Figs. 1(c) and (d)). Fortunately, the proposed method in Eq. (3) utilizes an auxiliary function added to IB theory to explain all these different scenarios. This strengthens the justification for IB theory. With all of these findings, we can see that the aims of DL networks are: 1) maximize the mutual information between features in hidden layers and output data in the final layer; 2) maximize or minimize the mutual information between samples in the input layer and output data in the hidden layers; 3) maximize the difference or minimize the sum between the coding rate of the whole datasets and the average of all subsets within each class.

5 Conclusion and future works

In this paper, we provided the justices for IB theory and solved three issues. By introducing an auxiliary function to IB theory, 1) we unified IB theory and MCR2, which means that the MCR2 is just a special case and local optimal solution of IB theory under Gaussian distribution and linear activation function. In addition, the problem that mutual information between samples in the input layer and features in the hidden layers decreases which could not be explained by MCR2 can be solved by IB theory. 2) we ended both the argument of two phases in DL network and the doubts about the validity and capability of information bottleneck theory for interpreting the inner organization. 3) we provided a new perspective to explain the inner organization of DL networks when applying IB theory to DL networks.

For our future work, we will try to derivate the analytical form of IB theory under the situations of non-Gaussian distribution and non-linear activation. With this operation, all types of DL networks might be unified by IB theory, just like the mechanism of ReLU activation function can be explained by MCR2 method, so that the all of mechanisms in DL networks can be constructed by one law.

References

  • [1] Tishby, N., Pereira, F. C., and Bialek, W. (2000). The information bottleneck method. arXiv preprint physics/0004057.
  • [2] Ma, Y., Derksen, H., Hong, W., and Wright, J. (2007). Segmentation of multivariate mixed data via lossy data coding and compression. IEEE transactions on pattern analysis and machine intelligence, 29(9), 1546-1562.
  • [3] Cover, T. M., and Thomas, J. A. (1964). Elements of information theory. Wiley-Interscience, 1991. Inf Control, 7, 485-507.
  • [4] Shannon, C. E. (1948). A mathematical theory of communication. The Bell system technical journal, 27(3), 379-423.
  • [5] Chan, K. H. R., Yu, Y., You, C., Qi, H., Wright, J., and Ma, Y. (2022). ReduNet: A white-box deep network from the principle of maximizing rate reduction. The Journal of Machine Learning Research, 23(1), 4907-5009.
  • [6] Pezone, F., Barbarossa, S., and Di Lorenzo, P. (2022, May). Goal-oriented communication for edge learning based on the information bottleneck. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 8832-8836). IEEE.
  • [7] Mao, H., Chen, X., Fu, Q., Du, L., Han, S., and Zhang, D. (2021, October). Neuron campaign for initialization guided by information bottleneck theory. In Proceedings of the 30th ACM International Conference on Information and Knowledge Management (pp. 3328-3332).
  • [8] Qian, K., Zhang, Y., Chang, S., Hasegawa-Johnson, M., and Cox, D. (2020, November). Unsupervised speech decomposition via triple information bottleneck. In International Conference on Machine Learning (pp. 7836-7846). PMLR.
  • [9] Baek, C., Wu, Z., Chan, K. H. R., Ding, T., Ma, Y., and Haeffele, B. D. (2022). Efficient maximal coding rate reduction by variational forms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 500-508).
  • [10] Yu, Y., Chan, K. H. R., You, C., Song, C., and Ma, Y. (2020). Learning diverse and discriminative representations via the principle of maximal coding rate reduction. Advances in Neural Information Processing Systems, 33, 9422-9434.
  • [11] Tian, L., Tu, Z., Zhang, D., Liu, J., Li, B., and Yuan, J. (2020). Unsupervised learning of optical flow with CNN-based non-local filtering. IEEE Transactions on Image Processing, 29, 8429-8442.
  • [12] Saxe, A. M., Bansal, Y., Dapello, J., Advani, M., Kolchinsky, A., Tracey, B. D., and Cox, D. D. (2019). On the information bottleneck theory of deep learning. Journal of Statistical Mechanics: Theory and Experiment, 2019(12), 124020.
  • [13] Tishby, N., and Zaslavsky, N. (2015, April). Deep learning and the information bottleneck principle. In 2015 ieee information theory workshop (itw) (pp. 1-5). IEEE.
  • [14] Shwartz-Ziv, R., and Tishby, N. (2017). Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810.
  • [15] Painsky, A. and Tishby, N. (2017). Gaussian Lower Bound for the Information Bottleneck Limit. J. Mach. Learn. Res., 18, 213-1.
  • [16] Cai, T. T., Liang, T., and Zhou, H. H. (2015). Law of log determinant of sample covariance matrix and optimal estimation of differential entropy for high-dimensional Gaussian distributions. Journal of Multivariate Analysis, 137, 161-172.
  • [17] Horn, R. A., and Johnson, C. R. (2012). Matrix analysis. Cambridge university press.
  • [18] Magnus, J. R., and Neudecker, H. (2019). Matrix differential calculus with applications in statistics and econometrics. John Wiley and Sons.