Blocked and Hierarchical Disentangled Representation From Information Theory Perspective

Ziwen Liu
University of Chinese Academy of Sciences
[email protected] Mingqiang Li
Information Science Academy of China Electronics Technology Group Corporation
[email protected] Congying Han
University of Chinese Academy of Sciences
[email protected]

Abstract

We propose a novel and theoretical model, blocked and hierarchical variational autoencoder (BHiVAE), to get better-disentangled representation. It is well known that information theory has an excellent explanatory meaning for the network, so we start to solve the disentanglement problem from the perspective of information theory. BHiVAE mainly comes from the information bottleneck theory and information maximization principle. Our main idea is that (1) Neurons block not only one neuron node is used to represent attribute, which can contain enough information; (2) Create a hierarchical structure with different attributes on different layers, so that we can segment the information within each layer to ensure that the final representation is disentangled. Furthermore, we present supervised and unsupervised BHiVAE, respectively, where the difference is mainly reflected in the separation of information between different blocks. In supervised BHiVAE, we utilize the label information as the standard to separate blocks. In unsupervised BHiVAE, without extra information, we use the Total Correlation (TC) measure to achieve independence, and we design a new prior distribution of the latent space to guide the representation learning. It also exhibits excellent disentanglement results in experiments and superior classification accuracy in representation learning.

1 Introduction

Disentanglement Representation

Learning an interpretable and disentangled representation of data to reflect the semantic meaning is what machine learning always pursues [5, 6, 8, 27]. Disentangled representation is defined in [5] as: a representation where a change in one dimension corresponds to a change in one factor of variation, while being relatively invariant to changes in other factors. As far as our understanding is concerned, the fact that different dimensions do not affect each other means probabilistically independent.

As popular generative models, Variational Autoencoder (VAE) [15] and Generative Adversarial Networks(GAN) [11] have been applied in disentanglement. For example, InfoGAN [8], based on the GAN model, maximizes the mutual information between the small subset of the latent variables and the observations which makes the latent variables contain more information about the real data, hence increases the interpretability of the latent representation. Based on InfoGAN, FineGAN [18, 30] creates a hierarchical architecture that assigns the background, object shape, and object appearance to different hierarchy to generate images of fine-grained object categories. And VAE model, derived from autoencoder [1] is also widely applied to representation learning, VAEs have been demonstrated their unique power to constrain representations disentanglement. For example, $\beta$ -VAE [12], $\beta$ -TCVAE [7], FactorVAE [14] and so on [10] are able to get more disentangled representation.

Information Theory

Information Theory has been proposed by Shannon in 1948 [28], which came from communication research. Mutual information is the fundamental metric for measuring the relationship about information between random variables. In representation learning, it has been applied widely [3, 8, 13, 25], with graph network [26, 34], and gets some explanatory meaning on machine learning [29]. We can conclude the application as two ideas: The first one is Information Maximization Principle(InfoMax) [4, 19], which enforces representation to preserve more information about the input data through the transformers (CNN, GNN); some works [8, 13, 35] regularize their original model with InfoMax term to get more informative and interpretable model. The other one is the Information Bottleneck(IB) theory [29, 32, 33]. It analyzes the process of information transmission and the loss through the networks. IB theory considers the network process as a Markov chain and uses the Data Processing Inequality (DPI) [9] to explain the variation of information in deep networks. In 2015, Variational Information Bottleneck (VIB) method [2] offers a variational form of supervised IB theory. Also, IB theory has been revealed a unique ability [36] to explain how and why VAEs models design this architecture. With this knowledge of disentanglement and information, we initiate our model, blocked and hierarchical variational autoencoder (BHiVAE), completely from information theory perspective to get better interpretability and controllability. In BHiVAE, because of the neural network’s different ability to extract features with different net depth, we locate data factors into different layers. Furthermore, the weak expressiveness of single-neuron pushes us to use neuron blocks to represent features. We also discuss the supervised and unsupervised version model. In the supervised model, we utilize the label to separate the representation from feature information. In the unsupervised model, we give out a unique prior distribution to better meet our model and use additional discriminators to split information. Of course we give enough experiments in MNIST [17], CelebA [20] and dSprite [23] datasets to show the great performance in disentanglement. In summary, our work mainly makes the following contributions:

•

We approach the disentanglement problem for the first time entirely from an information theory perspective. Most previous works on disentanglement have been based on existing models and modified to fit the framework for solving entanglement problems.
•

We present Blocked and Hierarchical Variational Autoencoder (BHiVAE) in both supervised and unsupervised cases. In the supervised case, we utilize the known feature information to guide the representation learning in each hierarchy; in the unsupervised case, we propose a novel distribution-based method to meet our neural block set.
•

We perform experiments thoroughly on several public datasets, MNIST, dSprites and CelebA, comparing with VAE, $\beta$ -VAE, FactorVAE, $\beta$ -TCVAE, and Guided-VAE in several classic metrics. From the results, our method BHiVAE shows an excellent performance considering all the indicators together.

2 Related Work

In order to get disentangled representation, some previous work has made a significant contribution to it. Based on VAE, $\beta$ -VAE [12] adds a coefficient weight to the KL-divergence term of the VAE loss and get a more disentangled representation. Mostly there is a significant advantage in that it trains more stably than InfoGAN. However, $\beta$ -VAE sacrifices the reconstruction result at the same time. $\beta$ -TCVAE [7] and FactorVAE [14] explored this issue in more detail and found TC term is the immediate causes to promote disentanglement.

Guided VAE [10] also gives out a model using different strategies in supervised and unsupervised situations to get disentanglement representation. It uses additional discriminator to guide the representation learning and learn the knowledge about latent geometric transformation and principal components. This idea of using different methods with different supervised information inspires us. FineGAN [30] based on InfoGAN, generates the background, object shape, and object appearance images respectively in different hierarchies, then combines these three images into true image. In FineGAN, what helps the disentanglement is the mutual information between the latent codes and each factor. And MixNMatch [18], developed from FineGAN, becomes a conditional generative model that learns disentangled representation and encodes different features from real image and then uses additional discriminators to match the representation to the prior distribution given by FineGAN model.

Previous works have made simple corrections to $\beta$ -VAE or GAN model, adding some useful terms for solving disentanglement. In our work, we fully consider the disentanglement problem from information theory and then establish the BHiVAE model. Information theory and optimal coding theory [9, 36] have shown that longer code can express more information. So in our model, instead of using only one dimension node to represent a ground-truth factor as in previous work, we choose multiple neural nodes to do so.

In the meantime, different ground-truth factors of data contain different levels of information, and the depth of the neural network affects the depth of information extracted, so a hierarchical architecture is used in our model for extracting different factor features at different layers. Therefore, in order to satisfy the requirement of disentanglement representation, i.e., the irrelevance between representation neural blocks, We only need to minimize the mutual information between blocks of the same layer due to characteristics of hierarchical architecture.

3 Proposed Method

We propose our model motivated by IB theory and VAEs, like $\beta$ -VAE, Factor-VAE, $\beta$ -TCVAE, Guided-VAE, and FineGAN. Therefore, in this section, we first introduce the IB theory and VAEs models, and then we present our detailed model architecture and discuss supervised and unsupervised BHiVAE methods.

3.1 Information Theory and VAEs

IB theory aims to learn a representation $Z$ that maximizes the compression of informaiton in real data $X$ while maximizing the expression of target $Y$ . So we can describe it as:

\displaystyle\min I(X;Z)-\beta I(Z;Y)

(1)

the target $Y$ is the attribute information under supervision, and is equal to $X$ under unsupervision [36].

In the case of supervised IB theory [2], we can get the upper bound:

	$\displaystyle I_{\phi}(X;Z)-\beta I_{\theta}(Z;Y)\leq$	$\displaystyle\mathbb{E}_{p_{D}(x)}[D_{KL}(q_{\phi}(z\|x)\\|p(z))]$
		$\displaystyle-\beta\mathbb{E}_{p(x,y)}[q_{\phi}(z\|x)\log p_{\theta}(y\|z)]$		(2)

The first term represents the KL divergence between the posterior $q_{\phi}(z|x)$ and the prior distribution $p(z)$ ; and absolutely, the second term equals cross-entropy loss of label prediction.

And in the case of unsupevised IB theory, the we can rewrite the objective Eq. (1) as:

\displaystyle\min I_{\phi}(X;Z)-\beta I_{\theta}(Z;X)

(3)

Unsupervised IB theory seems like generalization of VAEs model, with an encoder to learn representation and a decoder to reconstruct. $\beta$ -VAE [12] is actually the upper bound of it:

	$\displaystyle\mathcal{L}_{\beta-VAE}=$	$\displaystyle\mathbb{E}_{p(x)}[D_{KL}(q_{\phi}(z\|x)\\|p(z))$
		$\displaystyle-\beta\mathbb{E}_{q_{\phi}(z\|x)}[\log(p_{\theta}(x\|z))]]$		(4)

FactorVAE [14] and $\beta$ -TCVAE [7] just add more weight on the TC term $\mathbb{E}_{q(z)}[\log\frac{q(z)}{\tilde{q}(z)}]$ , which express the dependence across dimensions of variable in information theory, where $\tilde{q}(z)=\prod_{i=1}^{n}q(z_{i})$ .

We build our BHiVAE model upon above works and models. We focus on information transmission and loss through the whole network, and then achieve it through different methods.

3.2 BHiVAE

Now let us present our detailed model architecture. As shown in Fig 1, feed data $X$ into the encoder (parameterized as $\phi$ ), and in the first layer, we get the latent representation $z^{1}$ , be divided into two parts $s^{1}$ and $h^{1}$ . The part $s^{1}$ is the final representation part, which corresponds to feature $y^{1}$ , and $h^{1}$ is the input of next layer’s encoder to get latent representation $z^{2}$ . Then through three similar network processes, we can get three representation parts $s^{1},s^{2},s^{3}$ , which are disentangled, and get the part $c^{3}$ in the last layer, that contains information other than the above attributes of the data. All of them make up the whole representation $z=(s^{1};s^{2};s^{3};c^{3})$ . The representation of each part is then mapped to the same space by a different decoder (all parameterized as $\theta$ ) and finally concatenated together to reconstruct the raw data, which is shown in Fig 1(b). For the problem we discussed, we need to get the final disentangled representation $z$ , i.e., we need the independence between each representation part $s^{1},s^{2},s^{3}$ , and $c^{3}$ .

Then we can separate the whole problem into two sub-problem in $i$ -th layer, so the input is $h^{i-1}$ (where $h^{0}=x$ ):

(1)

Information flow $h^{i-1}\rightarrow s^{i}\rightarrow y^{i}$ : Encode the upper layer’s output $h^{i-1}$ to representation $z^{i}$ , with one part $s^{i}$ containing sufficient information about one feature factor $y^{i}$ ;
(2)

Information separation of $s^{i}$ and $h^{i}$ : Eliminate the information about $s^{i}$ in $h^{i}$ while requiring $s^{i}$ only to contain label $y^{i}$ information.

The first subproblem can be regarded as IB problem, the goal is to learn a representation of $s^{i}$ , i.e. maximally expressive about feature $y^{i}$ while minimally informative about input data $h^{i-1}$ . So it can described as:

\displaystyle\min I(h^{i-1};s^{i})-\beta I(s^{i};y^{i})

(5)

To satisfy the second subproblem is a complex issue, and it requires different methods to achieve it with different known conditions. So we will introduce these in follow conditions in detail. In summary, our representation is designed to enhance the internal correlation of each block while reducing the relationships between them to achieve the desired disentanglement goal.

3.2.1 Supervised BHiVAE

In supervised case, we denote the input of $i$ -th layer as $h^{i-1}$ ( $h^{0}=x$ ). Given the $i$ -th layer label $y^{i}$ , we require the representation part $s^{i}$ to predict the feature correctly while being as compressed as possible. So the objective in $i$ -th ( $i=1,2,3$ ) layer can be described as with information measure:

\displaystyle\mathcal{L}_{sup}^{class}(i)=I(h^{i-1};s^{i})-\beta I(s^{i};y^{i})

(6)

We can get a upper bound of it:

$\displaystyle\mathcal{L}_{sup}^{class}(i)$	$\displaystyle=I(h^{i-1};s^{i})-\beta I(s^{i};y^{i})$
	$\displaystyle\leq\mathbb{E}_{p(h^{i-1})}[D_{KL}(q_{\phi}(s^{i}\|h^{i-1})\\|p(s))]$
	$\displaystyle-\beta\mathbb{E}_{p(z^{i-1},y^{i})}[\mathbb{E}_{q_{\phi}(s^{i}\|h^{i-1})}[\log p_{\theta}(y^{i}\|s^{i})]]$
	$\displaystyle\triangleq\mathcal{L}_{sup}^{class_{up}}(i)$	(7)

So we need one more classifier $\mathcal{C}_{i}$ in Fig 2(b) to predict $y^{i}$ with $s^{i}$ .

For the second requirement, since $s^{i}$ is completely informative about $y^{i}$ which constrained in first subproblem, the elimination of information about $y^{i}$ is required for $h^{i}$ :

	$\displaystyle\mathcal{L}_{info}^{sup}(i)$	$\displaystyle=I(h^{i},y^{i})$
		$\displaystyle=H(y^{i})-H(y^{i}\|h^{i})$		(8)

$H(y^{i})$ is a constant, so minimizing $\mathcal{L}_{info}^{sup}(i)$ is equal to minimize:

\displaystyle\mathcal{L}_{info}^{sup_{e}}(i)=-H(y^{i}|h^{i})

(9)

This is like a principle of maximum entropy, just requiring $h^{i}$ can’t predict the factor feature $y^{i}$ at all, i.e. the probability predicted by $h^{i}$ of each category is $\frac{1}{n_{i}}$ ( $n_{i}$ denotes the number of $i$ -th feature categories). And $h^{i}$ shares the classifier $\mathcal{C}_{i}$ with $s^{i}$ as Fig 2(b) shows.

So in our supervised model, we can get the total objective as:

\displaystyle\min\{\mathcal{L}^{sup}

\displaystyle=\sum_{i=1}^{n}\mathcal{L}_{class}^{sup}(i)+\gamma\mathcal{L}_{info}^{sup_{e}}(i)\}

(10)

where $\beta$ and $\gamma$ in the objective are hyper-parameter. The objective (10) satisfies two requirement we need, and deal with the second subproblem with a novel approach.

3.2.2 Unsupervised BHiVAE

In the unsupervised case, we know nothing about the data source, so we can only use reconstruction to constrain the representation. However, only reconstruction is not enough for disentanglement problem [21], so we try to use an unique representation prior distribution to guide the representation learning. We know that all disentanglement models of the VAE series match the posterior distribution $q_{\phi}(z|x)$ to standard normal distribution prior $\mathcal{N}(0,I)$ , and they can get disentanglement representation in each dimension because of the independence across $\mathcal{N}(0,I)$ . For meeting our neural block representation set, we set the prior distribution $p(z)$ as $\mathcal{N}(0,\Sigma)$ , where $\Sigma$ is a block diagonal symmetric matrix. Of course, the dimension of each block corresponds to the segmentation of each hidden layer. In the unsupervised model, the target is reconstruction, so we can decompose Eq. (5) as:

$\displaystyle\min$	$\displaystyle I(h^{i-1};s^{i})-\beta I(s^{i};x)$
	$\displaystyle\leq\mathbb{E}_{p(h^{i-1})}[D_{KL}(q(z^{i}\|h^{i-1})\\|p(z))]$	(11)
	$\displaystyle-D_{KL}(q_{\phi}(z^{i})\\|p(z))$	(12)
	$\displaystyle-\beta[\mathbb{E}_{p(h^{i-1},y^{i})}[\mathbb{E}_{q_{\phi}(s^{i}\|h^{i-1})}[\log p_{\theta}(x\|s^{i})]]$	(13)
	$\displaystyle-D_{KL}(q_{\phi}(z^{i-1})\\|p_{D}(x))]$	(14)

The first two terms are meant to constrain the capacity of representation $z^{i}$ , and the last two reinforce the reconstruction. VAEs model use (11) and (13) to achieve, and adversarial autoencoder [22] use the KL divergence (12) between the posterior distribution $q_{\phi}(z^{i})$ and prior $p(z)$ to constrain the capacity of representation and get better representation.

In our model, we also minimize the KL divergence between the posterior distribution $q_{\phi}(z^{i})$ and prior $\mathcal{N}(0,\Sigma)$ , i.e., $D_{KL}(q_{\phi}(z^{i})\|\mathcal{N}(0,\Sigma))\rightarrow 0$ . And we choose the determinstic encoder, so we get the objective:

	$\displaystyle\mathcal{L}_{recon}^{uns}=$	$\displaystyle D_{KL}(q_{\phi}(z^{i})\\|\mathcal{N}(0,\Sigma))$
		$\displaystyle-\beta\mathbb{E}_{p(h^{i-1})}[\mathbb{E}_{q_{\phi}(s^{i}\|h^{i-1})}[\log p_{\theta}(x\|s^{i})]]$		(15)

We use a discriminator at the top of Fig 2(a) to estimate and optimize $D_{KL}(p_{\phi}(h^{i})\|\mathcal{N}(0,\Sigma))$ .

Unlike the supervised case, we adopt a different method to satisfy the information separation requirement. When $s^{i}$ and $h^{i}$ are independent in probability, the mutual information between them comes to zero, i.e., no shared information between $s^{i}$ and $h^{i}$ . Here we apply an alternative definition of mutual information, Total Correlation (TC) penalty [14, 37], which is a popular measure of dependence for multiple random variables.

$KL(q(z)\|q(\tilde{z}))$ where $q(\tilde{z})=\prod^{d}_{j=1}q(z_{j})$ is typical TC form, and in our case, we use the form $KL(p(z^{i})\|p(h^{i})p(s^{i}))=I(h^{i};s^{i})$ . So we can get the information separation objective as:

	$\displaystyle\mathcal{L}_{info}^{uns}(i)$	$\displaystyle=I(h^{i};s^{i})$		(16)
		$\displaystyle=KL(p(z^{i})\\|p(h^{i})p(s^{i}))$		(17)

In practice, $KL$ term is intractable to compute. The multiplication of marginal distributions $p(h^{i})p(s^{i})$ is not analytically computable, so we take a sampling approach to simulate it. After getting the a batch of representations $\{z^{i}_{j}=(s^{i}_{j};h^{i}_{j})\}_{j=1}^{N}$ in $i$ -th layer, we randomly permute across the batch for $\{s^{i}_{j}\}_{j=1}^{N}$ and $\{h^{i}_{j}\}_{j=1}^{N}$ to generate sample batch under distribution $p(z^{i})p(s^{i})$ . But direct estimating density ratio $\frac{p(z^{i})}{p(h^{i})p(s^{i})}$ is often impossible. Thus, with random samples, we conduct a density ratio method [24, 31]: use an additional classifier $D(x)$ that distinguishes between samples from the two distributions, at the bottom of Fig 2(a):

$\displaystyle\mathcal{L}_{info}^{uns}(i)$	$\displaystyle=KL(p(z^{i})\\|p(h^{i})p(s^{i}))$
	$\displaystyle=TC(z^{i})$
	$\displaystyle=\mathbb{E}_{q(z)}[\log\frac{p(z^{i})}{p(h^{i})p(s^{i})}]$
	$\displaystyle\approx\mathbb{E}_{q(z)}[\log\frac{D(z^{i})}{1-D(z^{i})}]$	(18)

In summary, the total objective under unsupervision is:

\displaystyle\max\{\mathcal{L}^{unsup}=\sum_{i=1}^{n}(\mathcal{L}_{recon}^{sup}+\gamma\mathcal{L}_{info}^{sup}(i))\}

(19)

4 Experiments

In this section, we present our results in quantitative and qualitative experiments. We also perform experiments comparing with $\beta$ -VAE, FactorVAE, and $\beta$ -TCVAE in several classic metrics. Here are datasets used in our experiments:

MNIST [17]: handwriting digital $(28\times 28\times 1)$ images with 60000 train samples and 10000 test samples;

dSprites [23]: 737280 2D shapes $(64\times 64\times 1)$ images procedurally generated from 6 ground truth independent latent factors: shapes (heart,oval and square), x-postion (32 values), y-position (32 values), scale (6 values) and rotation (40 values);

CelebA (cropped version) [20]: 202599 celebrity face $(64\times 64\times 3)$ images with 5 landmark locations, 40 binary attributes annotations.

In the following, we perform several qualitative and quantitative experiments on these datasets and show some results comparison in both unsupervised and supervised cases. We demonstrated the ability of our model to disentangle in the unsupervised case. Besides, we also show the representation learned in the supervised case.

4.1 Training Details

When training BHiVAE model, we need the encoder and decoder (Fig 1) both in supervised and unsupervised cases. On the CelabA dataset, we build our network with both a convolutional layer and a fully connected layer. On the MNIST and dSprites datasets, the datasets are both $64\times 64$ binary images, so we design our network to consist entirely of fully connected layers.

In evaluating the experimental results, we use the Z-differ [12], SAP [16], and MIG [7] metrics to measure the quality of the disentangled representation, and observe the images generated by the traversal representation. Moreover, we use some pre-trained classifiers on attribute features to analyze the model according to the classification accuracy.

4.2 Unsupervised BHiVAE

In the unsupervised case, as introduced in the previous section, the most significant novel idea is we use a different prior $\mathcal{N}(0,\Sigma)$ to guide the representation learning. Additionally, we need another one to estimate the KL divergence (18). Therefore, two extra discriminators are needed for BHiVAE in Fig 2(a). Actually, because we aim to get $D_{KL}(q_{\phi}(z^{i})\|p(z))=0$ , the latent representation $\{z_{j}^{i}\}_{j=1}^{N}$ can be considered as generated from true distribution, while prior and permuted ’presentations’ $\{z^{i-perm}_{j}\}_{j=1}^{N}$ can both be considered as false. Therefore, we can simplify the network to contain only one discriminator to score these three distributions.

We want to reinforce the relationship within $s^{i}$ to retain the information and then decrease the dependency between $s^{i}$ and $h^{i}$ to separate information, so in our unsupervised experiments, we use this prior $\mathcal{N}(0,\Sigma)$ , where

\displaystyle\scriptsize\Sigma=\begin{bmatrix}1&0.5&0&\cdots&0\\ 0.5&1&0&\cdots&0\\ 0&0&1&\cdots&0\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ 0&0&0&\cdots&1\end{bmatrix}

First, we use some experiments to demonstrate the feasibility and validity of this prior setting. We train the model on the dSprites dataset first, with setting the dimension of representation $z$ to 14 ( $d(z)=14$ ), where $d(s^{i})=2,i=1,2,3$ and $d(c^{3})=8$ . Then we get a representation in each layer of the 1000 test images, while the three subfigures in Fig 3 shows a scatter plot of each layer representation respectively, and the curves in these figures both are the contour of the block target distribution $p(s)\sim\mathcal{N}(0,\begin{bmatrix}1&0.5\\ 0.5&1\end{bmatrix})$ . And it is shown in Fig 3 that in the first and second layer, the distribution of $s^{1}$ and $s^{2}$ do not sufficiently match the prior $p(s)$ , but as the layer forward, the KL divergence between $q_{\phi}(s^{i})$ and $p(s)$ keep decresing, and the scatter plot of $s^{i}$ fits the prior distribution more closely. In the model, we train the encoder globally, so the front layer’s representation learning can be influenced by the change of deeper representation and then yields larger KL divergence than the next layer.

Even more surprisingly, in Fig 3(c), we find that in the third layer, visualizing the ’Shape’ attribute of dSprites dataset, there is an apparent clustering effect (the different colors denote different categories). This result proves our hypothesis about the deep network’s ability: the deeper network is, the more detailed information it extracts. And it almost matches the prior perfectly. Fig 3(c) also gives us a better traversal way. In previous works, because only one dimension represents the attribute, they can simply change the representation from $a$ to $b$ ( $a$ and $b$ both are constant). However, this does not fit our model, so the direction of the category transformation in Fig 3(c) inspires us to traverse the data along the diagonal line ( $y=x$ ). Our block prior $p(s)$ also supports that (because the prior distribution’s major axis is the diagonal line too).

We perform several experiments under above architecture setting and traversal way to show the disentanglement quality on MNIST datasets. The disentanglement quantitative results of comparing with $\beta$ -VAE [12], FactorVAE [14] and Guided-VAE [10] are presented in Fig 4. Here, considering the dimension of the representation and the number of parameters, other works’ bottleneck size is set to 12, i.e., $d(z)=12$ . This setting helps reduce the impact of differences in complexity between model frameworks. However, for a better comparison, we only select seven dimensions that change more regularly. In our model, we change the three-block representation $\{s^{i}\}_{i=1}^{3}$ and then the rest representation $c^{3}$ changes according to two dimensions as a whole, i.e., $c^{3}=(c^{3}_{1:2},c^{3}_{3:4},c^{3}_{5:6},c^{3}_{7:8})$ . And Fig 4 shows that $\beta$ -VAE hardly ever gets a worthwhile disentangled representation, but FactorVAE appears to attribute change as representation varies. Moreover, Fig 4(c) and Fig 4(d) both show great disentangled images, with $h_{1}$ changing in Guided-VAE and $s^{1}$ changing in BHiVAE, the handwriting is getting thicker, and $h_{3},s^{2}$ control the angle of inclination. These all demonstrate the model capabilities of our model.

	Z-diff $\uparrow$	SAP $\uparrow$	MIG $\uparrow$
VAE[15]	67.1	0.14	0.23
$\beta$ -VAE[12]( $\beta$ =6)	97.3	0.17	0.41
FactorVAE[14]( $\gamma$ =7)	98.4	0.19	0.44
$\beta$ -TCVAE[7]( $\alpha$ =1, $\beta$ =8, $\gamma$ =2)	96.5	0.41	0.49
Guided-VAE[10]	99.2	0.4320	0.57
BHiVAE(Ours)( $\beta$ =10, $\gamma$ =3)	99.0	0.4312	0.61

Table 1: Disentanglement Scores: Z-diff score, SAP score, MIG score on the dSprites dataset in the unsupervised case. The bold note the best results and blue is the second best result.

We then progress to the traversal experiments on the dSprites dataset. This dataset has clear attributes distinctions, and these allow us to better observe the disentangled representation. In these experiments, BHiVAE learns a 10-dimensional representation $z=(s^{1},s^{2},s^{3},c^{3}_{1:2},c^{3}_{3,4})$ and 8-dimensional $z=(z_{1},z_{2},\dots,z_{8})$ in other works. We present the experiments results in Fig 5 of reconstruction and traversal results. The first and second rows in four figure represent original and reconstruction images respectively. In Fig 5(d), it shows that our first three variables $s^{1},s^{2},s^{3}$ have learned the attribute characteristics (Scale, Orientation, and Position) of the data.

Moreover, we perform two quantitive experiments comparing with previous works and present our results in Table 1 and Table 2. The experiments are all based on the same experiment setting in Fig 4.

First, we compare BHiVAE with previous models with Z-differ Score [12], SAP Score [16] and MIG Score [7] and present the results in Table 1. It is clear that our model BHiVAE is at the top and that the MIG metric is better than other popular models. The high value of the Z-diff score indicates that learned disentangled representation has less variance on the attributes of generated data as corresponding dimension changing, while SAP measures the degree of coupling between data factors and representations. Additionally MIG metric uses mutual information to measure the correlation between the data factor and learned disentangled representation, and our work is just modeled from the perspective of mutual information, which makes us performs best on the MIG score.

Not only that, but we also perform transferability experiments by conducting classification tasks on the generated representation. Here we set the representation dimensions to be the same in all models. First, we have learned a pre-trained model to obtain the representation $z$ and a pre-trained classifier to predict MNIST image label from representation. We compare the classification accuracy in Table 2 with different dimension settings.

	$d_{z}=10\uparrow$	$d_{z}=16\uparrow$	$d_{z}=32\uparrow$
VAE[15]	97.21% $\pm$ 0.42	96.62% $\pm$ 0.51	96.41% $\pm$ 0.22
$\beta$ -VAE[12]( $\beta$ =6)	94.32% $\pm$ 0.48	95.22% $\pm$ 0.36	94.78% $\pm$ 0.53
FactorVAE[14]( $\gamma$ =7)	93.7% $\pm$ 0.07	94.62% $\pm$ 0.12	93.69% $\pm$ 0.26
$\beta$ -TCVAE[7]( $\alpha$ =1, $\beta$ =8, $\gamma$ =2)	98.4% $\pm$ 0.04	98.6% $\pm$ 0.05	98.9% $\pm$ 0.11
Guided-VAE[10]	98.2% $\pm$ 0.08	98.2% $\pm$ 0.07	98.40% $\pm$ 0.08
BHiVAE(Ours)( $\beta$ =10, $\gamma$ =3)	98.2% $\pm$ 0.09	98.7% $\pm$ 0.10	99.0% $\pm$ 0.05

Table 2: Accuracy of representation under unsupervised case: The bold note the best results and blue is the second best result.

Our model appears not higher accuracy than FactorVAE and Guided-VAE in the case of $d_{z}=10$ . That block representation setting causes a small number of factors it learns. However, as $d(z)$ is increased, our representation can learn more attribute factors of data, and then the classification accuracy can also be improved.

4.3 Supervised BHiVAE

In the supervised case, we still did the qualitative and quantitative experiments to evaluate our model. The same as the unsupervised case, overall autoencoder is required, and then we need a classifier to satisfy the segmentation of information at each level, as shown in Fig 2(b). And we set the dimension of representation $z$ as 12 ( $d(z)=12,d(c^{3})=6$ , and $d(s^{i})=2,i=1,2,3$ ).

We first perform several experiments comparing with Guided-VAE [10] in two attributes(Gender and Black Hair) and present the results in Fig 6. When changing each attribute $s^{i}\in\{s^{1},s^{2},s^{3}\}$ , we keep other attributes representations and content representation $c^{3}$ unchanged. We use the third layer representation $s^{3}$ to control gender attribute, while the first layers correspond to the black hair and bale, respectively. In the supervised case, compared to Guided-VAE, we use multiple dimensions to control an attribute while Guided-VAE uses only one dimension, which may lead to insufficient information to control the traversal results. And Fig 6 shows that our model has a broader range of control over attributes, especially reflected in the range of hair from pure white to pure black.

Besides, our quantitative experiment is to first pre-train the BHiVAE model and three attribute classifiers of the representation and then get the representationS of the training set, traversing the three representation blocks a,b,c from $(-3,3)$ to $(3,3)$ along with the diagonal( $y=x$ ). Fig 7 shows that all three attributes have a transformation threshold in the corresponding representation blocks.

4.4 Block Nodes VS. Single Node

In the previous experiments, we are all making judgments about how well the representation is disentangled and did not prove that the block setting is beneficial, so we set up the following comparison experiments for this problem.

For the comparison experiment here, we set the dimension of the model representation $z$ to 10, 16, and 32. Then in the comparison experiment, we just changed the dimension of representation $s^{1}$ (black hair) in the first layer to 1, and therefore the dimension of $c^{3}$ is changed to 5, 11, and 27 accordingly. First we pre-train these two models under the same conditions and learn a binary classifier that predicts the black hair attributes with representation $z$ . It is shown in Fig 8 that Block is better than Single in every dimension setting, and the accuracy of them has increased with increasing representation dimension. It could be that there is still some information about black hair in other representation parts of the model, and then the increasing dimension will allow more information about black hair to be preserved, getting better prediction accuracy.

5 Conclusion and Future Work

We propose a new model, blocked and hierarchical variational autoencoder, for thinking about and solving disentanglement problems entirely through the perspective of information theory. We innovatively propose a blocked disentangled representation and hierarchical architecture. Then, following the idea of information segmentation, we use different methods to guide information transfer in unsupervised and supervised cases. Outstanding performance in both image traversal and representation learning allows BHiVAE to have a wider field of application.

References

[1] David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for boltzmann machines. Cognitive science, 9(1):147–169, 1985.
[2] Alex Alemi, Ian Fischer, Josh Dillon, and Kevin Murphy. Deep variational information bottleneck. In ICLR, 2017.
[3] Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and Devon Hjelm. Mutual information neural estimation. volume 80 of Proceedings of Machine Learning Research, pages 531–540, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
[4] Anthony J Bell and Terrence J Sejnowski. An information-maximization approach to blind separation and blind deconvolution. Neural computation, 7(6):1129–1159, 1995.
[5] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
[6] Yoshua Bengio and Yann Lecun. Scaling Learning Algorithms towards AI. MIT Press, 2007.
[7] Ricky TQ Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud. Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, pages 2610–2620, 2018.
[8] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pages 2172–2180, 2016.
[9] Thomas M. Cover and Joy Thomas. Elements of Information Theory. Wiley, 1991.
[10] Zheng Ding, Yifan Xu, Weijian Xu, Gaurav Parmar, Yang Yang, Max Welling, and Zhuowen Tu. Guided variational autoencoder for disentanglement learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7920–7929, 2020.
[11] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
[12] I. Higgins, Loïc Matthey, A. Pal, C. Burgess, Xavier Glorot, M. Botvinick, S. Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017.
[13] Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Philip Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. In ICLR 2019. ICLR, April 2019.
[14] Hyunjik Kim and Andriy Mnih. Disentangling by factorising. volume 80 of Proceedings of Machine Learning Research, pages 2649–2658, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
[15] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
[16] Abhishek Kumar, Prasanna Sattigeri, and Avinash Balakrishnan. Variational inference of disentangled latent concepts from unlabeled observations. In ICLR, 2018.
[17] Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
[18] Yuheng Li, Krishna Kumar Singh, Utkarsh Ojha, and Yong Jae Lee. Mixnmatch: Multifactor disentanglement and encoding for conditional image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8039–8048, 2020.
[19] Ralph Linsker. Self-organization in a perceptual network. Computer, 21(3):105–117, 1988.
[20] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pages 3730–3738, 2015.
[21] Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Raetsch, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. In international conference on machine learning, pages 4114–4124, 2019.
[22] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, and Ian Goodfellow. Adversarial autoencoders. In International Conference on Learning Representations, 2016.
[23] Loic Matthey, Irina Higgins, Demis Hassabis, and Alexander Lerchner. dsprites: Disentanglement testing sprites dataset. https://github.com/deepmind/dsprites-dataset/, 2017.
[24] XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847–5861, 2010.
[25] A. Oord, Y. Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. ArXiv, abs/1807.03748, 2018.
[26] Yuxiang Ren, Bo Liu, Chao Huang, Peng Dai, Liefeng Bo, and Jiawei Zhang. Heterogeneous deep graph infomax. AAAI, 2020.
[27] Jürgen Schmidhuber. Learning factorial codes by predictability minimization. Neural Comput., 4(6):863–879, 1992.
[28] Claude E Shannon. A mathematical theory of communication. The Bell system technical journal, 27(3):379–423, 1948.
[29] Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017.
[30] Krishna Kumar Singh, Utkarsh Ojha, and Yong Jae Lee. Finegan: Unsupervised hierarchical disentanglement for fine-grained object generation and discovery. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6490–6499, 2019.
[31] Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density-ratio matching under the bregman divergence: a unified framework of density-ratio estimation. Annals of the Institute of Statistical Mathematics, 64(5):1009–1044, 2012.
[32] Naftali Tishby, Fernando C. Pereira, and William Bialek. The information bottleneck method. In Proc. of the 37-th Annual Allerton Conference on Communication, Control and Computing, pages 368–377, 1999.
[33] Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW), pages 1–5. IEEE, 2015.
[34] Petar Velickovic, William Fedus, William L Hamilton, Pietro Lio, Yoshua Bengio, and R Devon Hjelm. Deep graph infomax. In ICLR (Poster), 2019.
[35] Petar Velickovic, William Fedus, William L. Hamilton, Pietro Liò, Yoshua Bengio, and R. Devon Hjelm. Deep graph infomax. In ICLR. OpenReview.net, 2019.
[36] Slava Voloshynovskiy, Mouad Kondah, Shideh Rezaeifar, Olga Taran, Taras Holotyak, and Danilo Jimenez Rezende. Information bottleneck through variational glasses. arXiv preprint arXiv:1912.00830, 2019.
[37] Satosi Watanabe. Information theoretical analysis of multivariate correlation. IBM Journal of research and development, 4(1):66–82, 1960.