This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Blocked and Hierarchical Disentangled Representation From Information Theory Perspective

Ziwen Liu
University of Chinese Academy of Sciences
[email protected]
   Mingqiang Li
Information Science Academy of China Electronics Technology Group Corporation
[email protected]
   Congying Han
University of Chinese Academy of Sciences
[email protected]
Abstract

We propose a novel and theoretical model, blocked and hierarchical variational autoencoder (BHiVAE), to get better-disentangled representation. It is well known that information theory has an excellent explanatory meaning for the network, so we start to solve the disentanglement problem from the perspective of information theory. BHiVAE mainly comes from the information bottleneck theory and information maximization principle. Our main idea is that (1) Neurons block not only one neuron node is used to represent attribute, which can contain enough information; (2) Create a hierarchical structure with different attributes on different layers, so that we can segment the information within each layer to ensure that the final representation is disentangled. Furthermore, we present supervised and unsupervised BHiVAE, respectively, where the difference is mainly reflected in the separation of information between different blocks. In supervised BHiVAE, we utilize the label information as the standard to separate blocks. In unsupervised BHiVAE, without extra information, we use the Total Correlation (TC) measure to achieve independence, and we design a new prior distribution of the latent space to guide the representation learning. It also exhibits excellent disentanglement results in experiments and superior classification accuracy in representation learning.

1 Introduction

Disentanglement Representation

Learning an interpretable and disentangled representation of data to reflect the semantic meaning is what machine learning always pursues [5, 6, 8, 27]. Disentangled representation is defined in [5] as: a representation where a change in one dimension corresponds to a change in one factor of variation, while being relatively invariant to changes in other factors. As far as our understanding is concerned, the fact that different dimensions do not affect each other means probabilistically independent.

As popular generative models, Variational Autoencoder (VAE) [15] and Generative Adversarial Networks(GAN) [11] have been applied in disentanglement. For example, InfoGAN [8], based on the GAN model, maximizes the mutual information between the small subset of the latent variables and the observations which makes the latent variables contain more information about the real data, hence increases the interpretability of the latent representation. Based on InfoGAN, FineGAN [18, 30] creates a hierarchical architecture that assigns the background, object shape, and object appearance to different hierarchy to generate images of fine-grained object categories. And VAE model, derived from autoencoder [1] is also widely applied to representation learning, VAEs have been demonstrated their unique power to constrain representations disentanglement. For example, β\beta-VAE [12], β\beta-TCVAE [7], FactorVAE [14] and so on [10] are able to get more disentangled representation.

Information Theory

Information Theory has been proposed by Shannon in 1948 [28], which came from communication research. Mutual information is the fundamental metric for measuring the relationship about information between random variables. In representation learning, it has been applied widely [3, 8, 13, 25], with graph network [26, 34], and gets some explanatory meaning on machine learning [29]. We can conclude the application as two ideas: The first one is Information Maximization Principle(InfoMax) [4, 19], which enforces representation to preserve more information about the input data through the transformers (CNN, GNN); some works [8, 13, 35] regularize their original model with InfoMax term to get more informative and interpretable model. The other one is the Information Bottleneck(IB) theory [29, 32, 33]. It analyzes the process of information transmission and the loss through the networks. IB theory considers the network process as a Markov chain and uses the Data Processing Inequality (DPI) [9] to explain the variation of information in deep networks. In 2015, Variational Information Bottleneck (VIB) method [2] offers a variational form of supervised IB theory. Also, IB theory has been revealed a unique ability [36] to explain how and why VAEs models design this architecture. With this knowledge of disentanglement and information, we initiate our model, blocked and hierarchical variational autoencoder (BHiVAE), completely from information theory perspective to get better interpretability and controllability. In BHiVAE, because of the neural network’s different ability to extract features with different net depth, we locate data factors into different layers. Furthermore, the weak expressiveness of single-neuron pushes us to use neuron blocks to represent features. We also discuss the supervised and unsupervised version model. In the supervised model, we utilize the label to separate the representation from feature information. In the unsupervised model, we give out a unique prior distribution to better meet our model and use additional discriminators to split information. Of course we give enough experiments in MNIST [17], CelebA [20] and dSprite [23] datasets to show the great performance in disentanglement. In summary, our work mainly makes the following contributions:

  • We approach the disentanglement problem for the first time entirely from an information theory perspective. Most previous works on disentanglement have been based on existing models and modified to fit the framework for solving entanglement problems.

  • We present Blocked and Hierarchical Variational Autoencoder (BHiVAE) in both supervised and unsupervised cases. In the supervised case, we utilize the known feature information to guide the representation learning in each hierarchy; in the unsupervised case, we propose a novel distribution-based method to meet our neural block set.

  • We perform experiments thoroughly on several public datasets, MNIST, dSprites and CelebA, comparing with VAE, β\beta-VAE, FactorVAE, β\beta-TCVAE, and Guided-VAE in several classic metrics. From the results, our method BHiVAE shows an excellent performance considering all the indicators together.

2 Related Work

In order to get disentangled representation, some previous work has made a significant contribution to it. Based on VAE, β\beta-VAE [12] adds a coefficient weight to the KL-divergence term of the VAE loss and get a more disentangled representation. Mostly there is a significant advantage in that it trains more stably than InfoGAN. However, β\beta-VAE sacrifices the reconstruction result at the same time. β\beta-TCVAE [7] and FactorVAE [14] explored this issue in more detail and found TC term is the immediate causes to promote disentanglement.

Guided VAE [10] also gives out a model using different strategies in supervised and unsupervised situations to get disentanglement representation. It uses additional discriminator to guide the representation learning and learn the knowledge about latent geometric transformation and principal components. This idea of using different methods with different supervised information inspires us. FineGAN [30] based on InfoGAN, generates the background, object shape, and object appearance images respectively in different hierarchies, then combines these three images into true image. In FineGAN, what helps the disentanglement is the mutual information between the latent codes and each factor. And MixNMatch [18], developed from FineGAN, becomes a conditional generative model that learns disentangled representation and encodes different features from real image and then uses additional discriminators to match the representation to the prior distribution given by FineGAN model.

Previous works have made simple corrections to β\beta-VAE or GAN model, adding some useful terms for solving disentanglement. In our work, we fully consider the disentanglement problem from information theory and then establish the BHiVAE model. Information theory and optimal coding theory [9, 36] have shown that longer code can express more information. So in our model, instead of using only one dimension node to represent a ground-truth factor as in previous work, we choose multiple neural nodes to do so.

In the meantime, different ground-truth factors of data contain different levels of information, and the depth of the neural network affects the depth of information extracted, so a hierarchical architecture is used in our model for extracting different factor features at different layers. Therefore, in order to satisfy the requirement of disentanglement representation, i.e., the irrelevance between representation neural blocks, We only need to minimize the mutual information between blocks of the same layer due to characteristics of hierarchical architecture.

Refer to caption
(a) Encoder part
Refer to caption
(b) Decoder part
Figure 1: Architecture of Hierarchical VAE model: Encoder part in the left-side and decoder in the right-side.

3 Proposed Method

We propose our model motivated by IB theory and VAEs, like β\beta-VAE, Factor-VAE, β\beta-TCVAE, Guided-VAE, and FineGAN. Therefore, in this section, we first introduce the IB theory and VAEs models, and then we present our detailed model architecture and discuss supervised and unsupervised BHiVAE methods.

3.1 Information Theory and VAEs

IB theory aims to learn a representation ZZ that maximizes the compression of informaiton in real data XX while maximizing the expression of target YY. So we can describe it as:

minI(X;Z)βI(Z;Y)\displaystyle\min I(X;Z)-\beta I(Z;Y) (1)

the target YY is the attribute information under supervision, and is equal to XX under unsupervision [36].

In the case of supervised IB theory [2], we can get the upper bound:

Iϕ(X;Z)βIθ(Z;Y)\displaystyle I_{\phi}(X;Z)-\beta I_{\theta}(Z;Y)\leq 𝔼pD(x)[DKL(qϕ(z|x)p(z))]\displaystyle\mathbb{E}_{p_{D}(x)}[D_{KL}(q_{\phi}(z|x)\|p(z))]
β𝔼p(x,y)[qϕ(z|x)logpθ(y|z)]\displaystyle-\beta\mathbb{E}_{p(x,y)}[q_{\phi}(z|x)\log p_{\theta}(y|z)] (2)

The first term represents the KL divergence between the posterior qϕ(z|x)q_{\phi}(z|x) and the prior distribution p(z)p(z); and absolutely, the second term equals cross-entropy loss of label prediction.

And in the case of unsupevised IB theory, the we can rewrite the objective Eq. (1) as:

minIϕ(X;Z)βIθ(Z;X)\displaystyle\min I_{\phi}(X;Z)-\beta I_{\theta}(Z;X) (3)

Unsupervised IB theory seems like generalization of VAEs model, with an encoder to learn representation and a decoder to reconstruct. β\beta-VAE [12] is actually the upper bound of it:

βVAE=\displaystyle\mathcal{L}_{\beta-VAE}= 𝔼p(x)[DKL(qϕ(z|x)p(z))\displaystyle\mathbb{E}_{p(x)}[D_{KL}(q_{\phi}(z|x)\|p(z))
β𝔼qϕ(z|x)[log(pθ(x|z))]]\displaystyle-\beta\mathbb{E}_{q_{\phi}(z|x)}[\log(p_{\theta}(x|z))]] (4)

FactorVAE [14] and β\beta-TCVAE [7] just add more weight on the TC term 𝔼q(z)[logq(z)q~(z)]\mathbb{E}_{q(z)}[\log\frac{q(z)}{\tilde{q}(z)}], which express the dependence across dimensions of variable in information theory, where q~(z)=i=1nq(zi)\tilde{q}(z)=\prod_{i=1}^{n}q(z_{i}).

We build our BHiVAE model upon above works and models. We focus on information transmission and loss through the whole network, and then achieve it through different methods.

3.2 BHiVAE

Now let us present our detailed model architecture. As shown in Fig 1, feed data XX into the encoder (parameterized as ϕ\phi), and in the first layer, we get the latent representation z1z^{1}, be divided into two parts s1s^{1} and h1h^{1}. The part s1s^{1} is the final representation part, which corresponds to feature y1y^{1}, and h1h^{1} is the input of next layer’s encoder to get latent representation z2z^{2}. Then through three similar network processes, we can get three representation parts s1,s2,s3s^{1},s^{2},s^{3}, which are disentangled, and get the part c3c^{3} in the last layer, that contains information other than the above attributes of the data. All of them make up the whole representation z=(s1;s2;s3;c3)z=(s^{1};s^{2};s^{3};c^{3}). The representation of each part is then mapped to the same space by a different decoder (all parameterized as θ\theta) and finally concatenated together to reconstruct the raw data, which is shown in Fig 1(b). For the problem we discussed, we need to get the final disentangled representation zz, i.e., we need the independence between each representation part s1,s2,s3s^{1},s^{2},s^{3}, and c3c^{3}.

Refer to caption
(a) Unsupervised
Refer to caption
(b) Supervised
Figure 2: Different methods for constraining information segmentation between sis^{i} and ziz^{i}.

Then we can separate the whole problem into two sub-problem in ii-th layer, so the input is hi1h^{i-1}(where h0=xh^{0}=x):

  • (1)

    Information flow hi1siyih^{i-1}\rightarrow s^{i}\rightarrow y^{i}: Encode the upper layer’s output hi1h^{i-1} to representation ziz^{i}, with one part sis^{i} containing sufficient information about one feature factor yiy^{i};

  • (2)

    Information separation of sis^{i} and hih^{i}: Eliminate the information about sis^{i} in hih^{i} while requiring sis^{i} only to contain label yiy^{i} information.

The first subproblem can be regarded as IB problem, the goal is to learn a representation of sis^{i}, i.e. maximally expressive about feature yiy^{i} while minimally informative about input data hi1h^{i-1}. So it can described as:

minI(hi1;si)βI(si;yi)\displaystyle\min I(h^{i-1};s^{i})-\beta I(s^{i};y^{i}) (5)

To satisfy the second subproblem is a complex issue, and it requires different methods to achieve it with different known conditions. So we will introduce these in follow conditions in detail. In summary, our representation is designed to enhance the internal correlation of each block while reducing the relationships between them to achieve the desired disentanglement goal.

3.2.1 Supervised BHiVAE

In supervised case, we denote the input of ii-th layer as hi1h^{i-1} (h0=xh^{0}=x). Given the ii-th layer label yiy^{i}, we require the representation part sis^{i} to predict the feature correctly while being as compressed as possible. So the objective in ii-th (i=1,2,3i=1,2,3) layer can be described as with information measure:

supclass(i)=I(hi1;si)βI(si;yi)\displaystyle\mathcal{L}_{sup}^{class}(i)=I(h^{i-1};s^{i})-\beta I(s^{i};y^{i}) (6)

We can get a upper bound of it:

supclass(i)\displaystyle\mathcal{L}_{sup}^{class}(i) =I(hi1;si)βI(si;yi)\displaystyle=I(h^{i-1};s^{i})-\beta I(s^{i};y^{i})
𝔼p(hi1)[DKL(qϕ(si|hi1)p(s))]\displaystyle\leq\mathbb{E}_{p(h^{i-1})}[D_{KL}(q_{\phi}(s^{i}|h^{i-1})\|p(s))]
β𝔼p(zi1,yi)[𝔼qϕ(si|hi1)[logpθ(yi|si)]]\displaystyle-\beta\mathbb{E}_{p(z^{i-1},y^{i})}[\mathbb{E}_{q_{\phi}(s^{i}|h^{i-1})}[\log p_{\theta}(y^{i}|s^{i})]]
supclassup(i)\displaystyle\triangleq\mathcal{L}_{sup}^{class_{up}}(i) (7)

So we need one more classifier 𝒞i\mathcal{C}_{i} in Fig 2(b) to predict yiy^{i} with sis^{i}.

For the second requirement, since sis^{i} is completely informative about yiy^{i} which constrained in first subproblem, the elimination of information about yiy^{i} is required for hih^{i}:

infosup(i)\displaystyle\mathcal{L}_{info}^{sup}(i) =I(hi,yi)\displaystyle=I(h^{i},y^{i})
=H(yi)H(yi|hi)\displaystyle=H(y^{i})-H(y^{i}|h^{i}) (8)

H(yi)H(y^{i}) is a constant, so minimizing infosup(i)\mathcal{L}_{info}^{sup}(i) is equal to minimize:

infosupe(i)=H(yi|hi)\displaystyle\mathcal{L}_{info}^{sup_{e}}(i)=-H(y^{i}|h^{i}) (9)

This is like a principle of maximum entropy, just requiring hih^{i} can’t predict the factor feature yiy^{i} at all, i.e. the probability predicted by hih^{i} of each category is 1ni\frac{1}{n_{i}} (nin_{i} denotes the number of ii-th feature categories). And hih^{i} shares the classifier 𝒞i\mathcal{C}_{i} with sis^{i} as Fig 2(b) shows.

So in our supervised model, we can get the total objective as:

min{sup\displaystyle\min\{\mathcal{L}^{sup} =i=1nclasssup(i)+γinfosupe(i)}\displaystyle=\sum_{i=1}^{n}\mathcal{L}_{class}^{sup}(i)+\gamma\mathcal{L}_{info}^{sup_{e}}(i)\} (10)

where β\beta and γ\gamma in the objective are hyper-parameter. The objective (10) satisfies two requirement we need, and deal with the second subproblem with a novel approach.

3.2.2 Unsupervised BHiVAE

In the unsupervised case, we know nothing about the data source, so we can only use reconstruction to constrain the representation. However, only reconstruction is not enough for disentanglement problem [21], so we try to use an unique representation prior distribution to guide the representation learning. We know that all disentanglement models of the VAE series match the posterior distribution qϕ(z|x)q_{\phi}(z|x) to standard normal distribution prior 𝒩(0,I)\mathcal{N}(0,I), and they can get disentanglement representation in each dimension because of the independence across 𝒩(0,I)\mathcal{N}(0,I). For meeting our neural block representation set, we set the prior distribution p(z)p(z) as 𝒩(0,Σ)\mathcal{N}(0,\Sigma), where Σ\Sigma is a block diagonal symmetric matrix. Of course, the dimension of each block corresponds to the segmentation of each hidden layer. In the unsupervised model, the target is reconstruction, so we can decompose Eq. (5) as:

min\displaystyle\min I(hi1;si)βI(si;x)\displaystyle I(h^{i-1};s^{i})-\beta I(s^{i};x)
𝔼p(hi1)[DKL(q(zi|hi1)p(z))]\displaystyle\leq\mathbb{E}_{p(h^{i-1})}[D_{KL}(q(z^{i}|h^{i-1})\|p(z))] (11)
DKL(qϕ(zi)p(z))\displaystyle-D_{KL}(q_{\phi}(z^{i})\|p(z)) (12)
β[𝔼p(hi1,yi)[𝔼qϕ(si|hi1)[logpθ(x|si)]]\displaystyle-\beta[\mathbb{E}_{p(h^{i-1},y^{i})}[\mathbb{E}_{q_{\phi}(s^{i}|h^{i-1})}[\log p_{\theta}(x|s^{i})]] (13)
DKL(qϕ(zi1)pD(x))]\displaystyle-D_{KL}(q_{\phi}(z^{i-1})\|p_{D}(x))] (14)

The first two terms are meant to constrain the capacity of representation ziz^{i}, and the last two reinforce the reconstruction. VAEs model use (11) and (13) to achieve, and adversarial autoencoder [22] use the KL divergence (12) between the posterior distribution qϕ(zi)q_{\phi}(z^{i}) and prior p(z)p(z) to constrain the capacity of representation and get better representation.

In our model, we also minimize the KL divergence between the posterior distribution qϕ(zi)q_{\phi}(z^{i}) and prior 𝒩(0,Σ)\mathcal{N}(0,\Sigma), i.e., DKL(qϕ(zi)𝒩(0,Σ))0D_{KL}(q_{\phi}(z^{i})\|\mathcal{N}(0,\Sigma))\rightarrow 0. And we choose the determinstic encoder, so we get the objective:

reconuns=\displaystyle\mathcal{L}_{recon}^{uns}= DKL(qϕ(zi)𝒩(0,Σ))\displaystyle D_{KL}(q_{\phi}(z^{i})\|\mathcal{N}(0,\Sigma))
β𝔼p(hi1)[𝔼qϕ(si|hi1)[logpθ(x|si)]]\displaystyle-\beta\mathbb{E}_{p(h^{i-1})}[\mathbb{E}_{q_{\phi}(s^{i}|h^{i-1})}[\log p_{\theta}(x|s^{i})]] (15)

We use a discriminator at the top of Fig 2(a) to estimate and optimize DKL(pϕ(hi)𝒩(0,Σ))D_{KL}(p_{\phi}(h^{i})\|\mathcal{N}(0,\Sigma)).

Unlike the supervised case, we adopt a different method to satisfy the information separation requirement. When sis^{i} and hih^{i} are independent in probability, the mutual information between them comes to zero, i.e., no shared information between sis^{i} and hih^{i}. Here we apply an alternative definition of mutual information, Total Correlation (TC) penalty [14, 37], which is a popular measure of dependence for multiple random variables.

KL(q(z)q(z~))KL(q(z)\|q(\tilde{z})) where q(z~)=j=1dq(zj)q(\tilde{z})=\prod^{d}_{j=1}q(z_{j}) is typical TC form, and in our case, we use the form KL(p(zi)p(hi)p(si))=I(hi;si)KL(p(z^{i})\|p(h^{i})p(s^{i}))=I(h^{i};s^{i}). So we can get the information separation objective as:

infouns(i)\displaystyle\mathcal{L}_{info}^{uns}(i) =I(hi;si)\displaystyle=I(h^{i};s^{i}) (16)
=KL(p(zi)p(hi)p(si))\displaystyle=KL(p(z^{i})\|p(h^{i})p(s^{i})) (17)

In practice, KLKL term is intractable to compute. The multiplication of marginal distributions p(hi)p(si)p(h^{i})p(s^{i}) is not analytically computable, so we take a sampling approach to simulate it. After getting the a batch of representations {zji=(sji;hji)}j=1N\{z^{i}_{j}=(s^{i}_{j};h^{i}_{j})\}_{j=1}^{N} in ii-th layer, we randomly permute across the batch for {sji}j=1N\{s^{i}_{j}\}_{j=1}^{N} and {hji}j=1N\{h^{i}_{j}\}_{j=1}^{N} to generate sample batch under distribution p(zi)p(si)p(z^{i})p(s^{i}). But direct estimating density ratio p(zi)p(hi)p(si)\frac{p(z^{i})}{p(h^{i})p(s^{i})} is often impossible. Thus, with random samples, we conduct a density ratio method [24, 31]: use an additional classifier D(x)D(x) that distinguishes between samples from the two distributions, at the bottom of Fig 2(a):

infouns(i)\displaystyle\mathcal{L}_{info}^{uns}(i) =KL(p(zi)p(hi)p(si))\displaystyle=KL(p(z^{i})\|p(h^{i})p(s^{i}))
=TC(zi)\displaystyle=TC(z^{i})
=𝔼q(z)[logp(zi)p(hi)p(si)]\displaystyle=\mathbb{E}_{q(z)}[\log\frac{p(z^{i})}{p(h^{i})p(s^{i})}]
𝔼q(z)[logD(zi)1D(zi)]\displaystyle\approx\mathbb{E}_{q(z)}[\log\frac{D(z^{i})}{1-D(z^{i})}] (18)

In summary, the total objective under unsupervision is:

max{unsup=i=1n(reconsup+γinfosup(i))}\displaystyle\max\{\mathcal{L}^{unsup}=\sum_{i=1}^{n}(\mathcal{L}_{recon}^{sup}+\gamma\mathcal{L}_{info}^{sup}(i))\} (19)

4 Experiments

In this section, we present our results in quantitative and qualitative experiments. We also perform experiments comparing with β\beta-VAE, FactorVAE, and β\beta-TCVAE in several classic metrics. Here are datasets used in our experiments:

MNIST [17]: handwriting digital (28×28×1)(28\times 28\times 1) images with 60000 train samples and 10000 test samples;

dSprites [23]: 737280 2D shapes (64×64×1)(64\times 64\times 1) images procedurally generated from 6 ground truth independent latent factors: shapes (heart,oval and square), x-postion (32 values), y-position (32 values), scale (6 values) and rotation (40 values);

CelebA (cropped version) [20]: 202599 celebrity face (64×64×3)(64\times 64\times 3) images with 5 landmark locations, 40 binary attributes annotations.

In the following, we perform several qualitative and quantitative experiments on these datasets and show some results comparison in both unsupervised and supervised cases. We demonstrated the ability of our model to disentangle in the unsupervised case. Besides, we also show the representation learned in the supervised case.

Refer to caption
(a) Layer1 with KL=0.61
Refer to caption
(b) Layer2 with KL=0.49
Refer to caption
(c) Layer3 with KL=0.11
Figure 3: Scatter distribution VS. Prior distribution: Scatter plot of three layers representation {si}i=13\{s^{i}\}_{i=1}^{3}; and (C) visualizes the known category information with different colors.
Refer to caption
(a) β\beta-VAE
Refer to caption
(b) FactorVAE
Refer to caption
(c) Guided-VAE
Refer to caption
(d) BHiVAE
Figure 4: Traversal images on MNIST: In (a), (b) and (c), the images in ii-th row are generated by changing ziz^{i} from -3 to 3; and we change {s1,s2,s3,c3}\{s^{1},s^{2},s^{3},c^{3}\} from (-3,-3) to (3,3), then generate the images in each row.

4.1 Training Details

When training BHiVAE model, we need the encoder and decoder (Fig 1) both in supervised and unsupervised cases. On the CelabA dataset, we build our network with both a convolutional layer and a fully connected layer. On the MNIST and dSprites datasets, the datasets are both 64×6464\times 64 binary images, so we design our network to consist entirely of fully connected layers.

In evaluating the experimental results, we use the Z-differ [12], SAP [16], and MIG [7] metrics to measure the quality of the disentangled representation, and observe the images generated by the traversal representation. Moreover, we use some pre-trained classifiers on attribute features to analyze the model according to the classification accuracy.

4.2 Unsupervised BHiVAE

In the unsupervised case, as introduced in the previous section, the most significant novel idea is we use a different prior 𝒩(0,Σ)\mathcal{N}(0,\Sigma) to guide the representation learning. Additionally, we need another one to estimate the KL divergence (18). Therefore, two extra discriminators are needed for BHiVAE in Fig 2(a). Actually, because we aim to get DKL(qϕ(zi)p(z))=0D_{KL}(q_{\phi}(z^{i})\|p(z))=0, the latent representation {zji}j=1N\{z_{j}^{i}\}_{j=1}^{N} can be considered as generated from true distribution, while prior and permuted ’presentations’ {zjiperm}j=1N\{z^{i-perm}_{j}\}_{j=1}^{N} can both be considered as false. Therefore, we can simplify the network to contain only one discriminator to score these three distributions.

We want to reinforce the relationship within sis^{i} to retain the information and then decrease the dependency between sis^{i} and hih^{i} to separate information, so in our unsupervised experiments, we use this prior 𝒩(0,Σ)\mathcal{N}(0,\Sigma), where

Σ=[10.5000.510000100001]\displaystyle\scriptsize\Sigma=\begin{bmatrix}1&0.5&0&\cdots&0\\ 0.5&1&0&\cdots&0\\ 0&0&1&\cdots&0\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ 0&0&0&\cdots&1\end{bmatrix}

First, we use some experiments to demonstrate the feasibility and validity of this prior setting. We train the model on the dSprites dataset first, with setting the dimension of representation zz to 14 (d(z)=14d(z)=14), where d(si)=2,i=1,2,3d(s^{i})=2,i=1,2,3 and d(c3)=8d(c^{3})=8. Then we get a representation in each layer of the 1000 test images, while the three subfigures in Fig 3 shows a scatter plot of each layer representation respectively, and the curves in these figures both are the contour of the block target distribution p(s)𝒩(0,[10.50.51])p(s)\sim\mathcal{N}(0,\begin{bmatrix}1&0.5\\ 0.5&1\end{bmatrix}). And it is shown in Fig 3 that in the first and second layer, the distribution of s1s^{1} and s2s^{2} do not sufficiently match the prior p(s)p(s), but as the layer forward, the KL divergence between qϕ(si)q_{\phi}(s^{i}) and p(s)p(s) keep decresing, and the scatter plot of sis^{i} fits the prior distribution more closely. In the model, we train the encoder globally, so the front layer’s representation learning can be influenced by the change of deeper representation and then yields larger KL divergence than the next layer.

Even more surprisingly, in Fig 3(c), we find that in the third layer, visualizing the ’Shape’ attribute of dSprites dataset, there is an apparent clustering effect (the different colors denote different categories). This result proves our hypothesis about the deep network’s ability: the deeper network is, the more detailed information it extracts. And it almost matches the prior perfectly. Fig 3(c) also gives us a better traversal way. In previous works, because only one dimension represents the attribute, they can simply change the representation from aa to bb (aa and bb both are constant). However, this does not fit our model, so the direction of the category transformation in Fig 3(c) inspires us to traverse the data along the diagonal line (y=xy=x). Our block prior p(s)p(s) also supports that (because the prior distribution’s major axis is the diagonal line too).

We perform several experiments under above architecture setting and traversal way to show the disentanglement quality on MNIST datasets. The disentanglement quantitative results of comparing with β\beta-VAE [12], FactorVAE [14] and Guided-VAE [10] are presented in Fig 4. Here, considering the dimension of the representation and the number of parameters, other works’ bottleneck size is set to 12, i.e., d(z)=12d(z)=12. This setting helps reduce the impact of differences in complexity between model frameworks. However, for a better comparison, we only select seven dimensions that change more regularly. In our model, we change the three-block representation {si}i=13\{s^{i}\}_{i=1}^{3} and then the rest representation c3c^{3} changes according to two dimensions as a whole, i.e., c3=(c1:23,c3:43,c5:63,c7:83)c^{3}=(c^{3}_{1:2},c^{3}_{3:4},c^{3}_{5:6},c^{3}_{7:8}). And Fig 4 shows that β\beta-VAE hardly ever gets a worthwhile disentangled representation, but FactorVAE appears to attribute change as representation varies. Moreover, Fig 4(c) and Fig 4(d) both show great disentangled images, with h1h_{1} changing in Guided-VAE and s1s^{1} changing in BHiVAE, the handwriting is getting thicker, and h3,s2h_{3},s^{2} control the angle of inclination. These all demonstrate the model capabilities of our model.

Z-diff \uparrow SAP \uparrow MIG \uparrow
VAE[15] 67.1 0.14 0.23
β\beta-VAE[12](β\beta=6) 97.3 0.17 0.41
FactorVAE[14](γ\gamma=7) 98.4 0.19 0.44
β\beta-TCVAE[7](α\alpha=1,β\beta=8,γ\gamma=2) 96.5 0.41 0.49
Guided-VAE[10] 99.2 0.4320 0.57
BHiVAE(Ours)(β\beta=10, γ\gamma=3) 99.0 0.4312 0.61
Table 1: Disentanglement Scores: Z-diff score, SAP score, MIG score on the dSprites dataset in the unsupervised case. The bold note the best results and blue is the second best result.

We then progress to the traversal experiments on the dSprites dataset. This dataset has clear attributes distinctions, and these allow us to better observe the disentangled representation. In these experiments, BHiVAE learns a 10-dimensional representation z=(s1,s2,s3,c1:23,c3,43)z=(s^{1},s^{2},s^{3},c^{3}_{1:2},c^{3}_{3,4}) and 8-dimensional z=(z1,z2,,z8)z=(z_{1},z_{2},\dots,z_{8}) in other works. We present the experiments results in Fig 5 of reconstruction and traversal results. The first and second rows in four figure represent original and reconstruction images respectively. In Fig 5(d), it shows that our first three variables s1,s2,s3s^{1},s^{2},s^{3} have learned the attribute characteristics (Scale, Orientation, and Position) of the data.

Moreover, we perform two quantitive experiments comparing with previous works and present our results in Table 1 and Table 2. The experiments are all based on the same experiment setting in Fig 4.

Refer to caption
(a) β\beta-VAE
Refer to caption
(b) FactorVAE
Refer to caption
(c) GuidedVAE
Refer to caption
(d) BHiVAE(Ours)
Figure 5: Traversal images on dsprites: Images in first and second row of each figure are original and reconstruction images respectively. And others rows correspond the traversal images.

First, we compare BHiVAE with previous models with Z-differ Score [12], SAP Score [16] and MIG Score [7] and present the results in Table 1. It is clear that our model BHiVAE is at the top and that the MIG metric is better than other popular models. The high value of the Z-diff score indicates that learned disentangled representation has less variance on the attributes of generated data as corresponding dimension changing, while SAP measures the degree of coupling between data factors and representations. Additionally MIG metric uses mutual information to measure the correlation between the data factor and learned disentangled representation, and our work is just modeled from the perspective of mutual information, which makes us performs best on the MIG score.

Not only that, but we also perform transferability experiments by conducting classification tasks on the generated representation. Here we set the representation dimensions to be the same in all models. First, we have learned a pre-trained model to obtain the representation zz and a pre-trained classifier to predict MNIST image label from representation. We compare the classification accuracy in Table 2 with different dimension settings.

dz=10d_{z}=10\uparrow dz=16d_{z}=16\uparrow dz=32d_{z}=32\uparrow
VAE[15] 97.21%±\pm0.42 96.62% ±\pm 0.51 96.41%±\pm0.22
β\beta-VAE[12](β\beta=6) 94.32% ±\pm0.48 95.22%±\pm0.36 94.78%±\pm0.53
FactorVAE[14](γ\gamma=7) 93.7%±\pm0.07 94.62%±\pm0.12 93.69%±\pm 0.26
β\beta-TCVAE[7](α\alpha=1,β\beta=8,γ\gamma=2) 98.4%±\pm0.04 98.6%±\pm0.05 98.9%±\pm0.11
Guided-VAE[10] 98.2%±\pm0.08 98.2%±\pm0.07 98.40% ±\pm0.08
BHiVAE(Ours)(β\beta=10, γ\gamma=3) 98.2%±\pm0.09 98.7%±\pm0.10 99.0%±\pm0.05
Table 2: Accuracy of representation under unsupervised case: The bold note the best results and blue is the second best result.

Our model appears not higher accuracy than FactorVAE and Guided-VAE in the case of dz=10d_{z}=10. That block representation setting causes a small number of factors it learns. However, as d(z)d(z) is increased, our representation can learn more attribute factors of data, and then the classification accuracy can also be improved.

4.3 Supervised BHiVAE

In the supervised case, we still did the qualitative and quantitative experiments to evaluate our model. The same as the unsupervised case, overall autoencoder is required, and then we need a classifier to satisfy the segmentation of information at each level, as shown in Fig 2(b). And we set the dimension of representation zz as 12 (d(z)=12,d(c3)=6d(z)=12,d(c^{3})=6, and d(si)=2,i=1,2,3d(s^{i})=2,i=1,2,3).

Refer to caption
Figure 6: Traversal Results Comparison on CelebA: The first column is the traversal change of Gender, the second column is the change of Black Hair, the first row is from Guided-VAE [10], the second row is ours, following the procedure of Guided-VAE.

We first perform several experiments comparing with Guided-VAE [10] in two attributes(Gender and Black Hair) and present the results in Fig 6. When changing each attribute si{s1,s2,s3}s^{i}\in\{s^{1},s^{2},s^{3}\}, we keep other attributes representations and content representation c3c^{3} unchanged. We use the third layer representation s3s^{3} to control gender attribute, while the first layers correspond to the black hair and bale, respectively. In the supervised case, compared to Guided-VAE, we use multiple dimensions to control an attribute while Guided-VAE uses only one dimension, which may lead to insufficient information to control the traversal results. And Fig 6 shows that our model has a broader range of control over attributes, especially reflected in the range of hair from pure white to pure black.

Besides, our quantitative experiment is to first pre-train the BHiVAE model and three attribute classifiers of the representation and then get the representationS of the training set, traversing the three representation blocks a,b,c from (3,3)(-3,3) to (3,3)(3,3) along with the diagonal(y=xy=x). Fig 7 shows that all three attributes have a transformation threshold in the corresponding representation blocks.

Refer to caption
Figure 7: The classifier result used to determine if the property is available. We traverse the Black Hair (s1s^{1}), Bale (s2s^{2}) and Gender (s3s^{3}) attributes.
Refer to caption
Figure 8: Comparison of the accuracy of Block and Single setting model for Black Hair attribute.

4.4 Block Nodes VS. Single Node

In the previous experiments, we are all making judgments about how well the representation is disentangled and did not prove that the block setting is beneficial, so we set up the following comparison experiments for this problem.

For the comparison experiment here, we set the dimension of the model representation zz to 10, 16, and 32. Then in the comparison experiment, we just changed the dimension of representation s1s^{1} (black hair) in the first layer to 1, and therefore the dimension of c3c^{3} is changed to 5, 11, and 27 accordingly. First we pre-train these two models under the same conditions and learn a binary classifier that predicts the black hair attributes with representation zz. It is shown in Fig 8 that Block is better than Single in every dimension setting, and the accuracy of them has increased with increasing representation dimension. It could be that there is still some information about black hair in other representation parts of the model, and then the increasing dimension will allow more information about black hair to be preserved, getting better prediction accuracy.

5 Conclusion and Future Work

We propose a new model, blocked and hierarchical variational autoencoder, for thinking about and solving disentanglement problems entirely through the perspective of information theory. We innovatively propose a blocked disentangled representation and hierarchical architecture. Then, following the idea of information segmentation, we use different methods to guide information transfer in unsupervised and supervised cases. Outstanding performance in both image traversal and representation learning allows BHiVAE to have a wider field of application.

References

  • [1] David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for boltzmann machines. Cognitive science, 9(1):147–169, 1985.
  • [2] Alex Alemi, Ian Fischer, Josh Dillon, and Kevin Murphy. Deep variational information bottleneck. In ICLR, 2017.
  • [3] Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and Devon Hjelm. Mutual information neural estimation. volume 80 of Proceedings of Machine Learning Research, pages 531–540, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
  • [4] Anthony J Bell and Terrence J Sejnowski. An information-maximization approach to blind separation and blind deconvolution. Neural computation, 7(6):1129–1159, 1995.
  • [5] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
  • [6] Yoshua Bengio and Yann Lecun. Scaling Learning Algorithms towards AI. MIT Press, 2007.
  • [7] Ricky TQ Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud. Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, pages 2610–2620, 2018.
  • [8] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pages 2172–2180, 2016.
  • [9] Thomas M. Cover and Joy Thomas. Elements of Information Theory. Wiley, 1991.
  • [10] Zheng Ding, Yifan Xu, Weijian Xu, Gaurav Parmar, Yang Yang, Max Welling, and Zhuowen Tu. Guided variational autoencoder for disentanglement learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7920–7929, 2020.
  • [11] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [12] I. Higgins, Loïc Matthey, A. Pal, C. Burgess, Xavier Glorot, M. Botvinick, S. Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017.
  • [13] Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Philip Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. In ICLR 2019. ICLR, April 2019.
  • [14] Hyunjik Kim and Andriy Mnih. Disentangling by factorising. volume 80 of Proceedings of Machine Learning Research, pages 2649–2658, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
  • [15] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • [16] Abhishek Kumar, Prasanna Sattigeri, and Avinash Balakrishnan. Variational inference of disentangled latent concepts from unlabeled observations. In ICLR, 2018.
  • [17] Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
  • [18] Yuheng Li, Krishna Kumar Singh, Utkarsh Ojha, and Yong Jae Lee. Mixnmatch: Multifactor disentanglement and encoding for conditional image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8039–8048, 2020.
  • [19] Ralph Linsker. Self-organization in a perceptual network. Computer, 21(3):105–117, 1988.
  • [20] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pages 3730–3738, 2015.
  • [21] Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Raetsch, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. In international conference on machine learning, pages 4114–4124, 2019.
  • [22] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, and Ian Goodfellow. Adversarial autoencoders. In International Conference on Learning Representations, 2016.
  • [23] Loic Matthey, Irina Higgins, Demis Hassabis, and Alexander Lerchner. dsprites: Disentanglement testing sprites dataset. https://github.com/deepmind/dsprites-dataset/, 2017.
  • [24] XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847–5861, 2010.
  • [25] A. Oord, Y. Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. ArXiv, abs/1807.03748, 2018.
  • [26] Yuxiang Ren, Bo Liu, Chao Huang, Peng Dai, Liefeng Bo, and Jiawei Zhang. Heterogeneous deep graph infomax. AAAI, 2020.
  • [27] Jürgen Schmidhuber. Learning factorial codes by predictability minimization. Neural Comput., 4(6):863–879, 1992.
  • [28] Claude E Shannon. A mathematical theory of communication. The Bell system technical journal, 27(3):379–423, 1948.
  • [29] Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017.
  • [30] Krishna Kumar Singh, Utkarsh Ojha, and Yong Jae Lee. Finegan: Unsupervised hierarchical disentanglement for fine-grained object generation and discovery. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6490–6499, 2019.
  • [31] Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density-ratio matching under the bregman divergence: a unified framework of density-ratio estimation. Annals of the Institute of Statistical Mathematics, 64(5):1009–1044, 2012.
  • [32] Naftali Tishby, Fernando C. Pereira, and William Bialek. The information bottleneck method. In Proc. of the 37-th Annual Allerton Conference on Communication, Control and Computing, pages 368–377, 1999.
  • [33] Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW), pages 1–5. IEEE, 2015.
  • [34] Petar Velickovic, William Fedus, William L Hamilton, Pietro Lio, Yoshua Bengio, and R Devon Hjelm. Deep graph infomax. In ICLR (Poster), 2019.
  • [35] Petar Velickovic, William Fedus, William L. Hamilton, Pietro Liò, Yoshua Bengio, and R. Devon Hjelm. Deep graph infomax. In ICLR. OpenReview.net, 2019.
  • [36] Slava Voloshynovskiy, Mouad Kondah, Shideh Rezaeifar, Olga Taran, Taras Holotyak, and Danilo Jimenez Rezende. Information bottleneck through variational glasses. arXiv preprint arXiv:1912.00830, 2019.
  • [37] Satosi Watanabe. Information theoretical analysis of multivariate correlation. IBM Journal of research and development, 4(1):66–82, 1960.