Disentangling semantics in language through VAEs and a certain architectural choice

Ghazi Felhi
LIPN
Université Sorbonne Paris Nord
Villetaneuse, France
[email protected]
&Joseph Le Roux
LIPN
Université Sorbonne Paris Nord
Villetaneuse, France
[email protected]
&Djamé Seddah
INRIA Paris
Paris, France
[email protected]

Abstract

We present an unsupervised method to obtain disentangled representations of sentences that single out semantic content. Using modified Transformers as building blocks, we train a Variational Autoencoder to translate the sentence to a fixed number of hierarchically structured latent variables. We study the influence of each latent variable in generation on the dependency structure of sentences, and on the predicate structure it yields when passed through an Open Information Extraction model. Our model could separate verbs, subjects, direct objects, and prepositional objects into latent variables we identified. We show that varying the corresponding latent variables results in varying these elements in sentences, and that swapping them between couples of sentences leads to the expected partial semantic swap.

1 Introduction

Deep learning has brought about an insanely powerful framework to extract information from the real world through universal approximators. Consequently, a wide range of techniques have been introduced to project complex high dimensional observations such as text or images into low dimensional spaces. These low dimensional projections most often yield desirable properties such as linear separability with regard to certain high level attributes, semantically meaningful algebraic operations (Mikolov et al., 2013, Bojanowski et al., 2017) …etc. Among these properties, disentanglement has received a lot of attention in recent studies.
Transparency is of great importance in the deployment of machine intelligence. In that sense, obtaining neural representation with clearly identified chunks of information is sought as a gateway to fine-grained explanation and/or controllable generation in deep learning. Interestingly, Variational Auto-Encoders (VAEs) seem to naturally disentangle neural information and have successfully been applied to this problem in numerous works (Li et al., 2020b, John et al., 2020, Chen et al., 2019). This effect has been studied in depth and explained in Rolinek et al. (2019). It appears that the use of diagonal Gaussians in the VAEs’ approximate posteriors, which was originally aimed at minimizing computational costs, enforces latent variables to have independent dimensions, which leads to the observed disentanglement.
In this work, we aim to use this property that VAEs have, and attention based language decoders to encode sentences into $N$ latent variables with similar computational roles. These latent variables will be decoded into a sentence using co-attention, as if they represented tokens from the source language in machine translation.
We first relate our work to the current Natural Language Processing (NLP) landscape in section 2. In sections 3 we describe our generative model and explain the motivation behind our design choices. Then in section 4, we construct an objective that proved effective in dealing with posterior collapse for our model. Finally (section 5), we conduct a series of experiments exhibiting quantitative and qualitative evidence that establishes our model’s disentanglement capabilities.
Our contribution sums up as follows: We describe and architecture, and an objective that are capable of singling out semantics in a sentence without using labeled data or linguistic cues. To the best of our knowledge, we are the first to explore this research direction, and among the first few to tackle unsupervised disentanglement in NLP. As this is a first step, we will use the plain text from the SNLI dataset as was done in Schmidt et al. (2020) to work on homogeneous, low complexity sentences. Consequently, as opposed to mainstream language modeling studies, we will disregard long range dependencies.

2 Related Works

Parallels between linguistics and neural architectures

Such parallels are valuable as they enable better inductive bias in machine learning systems. RNNG Dyer et al. (2016) and ON-LSTM Shen et al. (2019) are examples among others (Zhang et al., 2020, Du et al., 2020, etc) of successful attempts at inducing linguistic structure in a neural language model. A plethora of post hoc works such as (Hu et al., 2020, Kodner and Gupta, 2020, Marvin and Linzen, 2020, Kulmizev et al., 2020) have also dived into their linguistic capabilities, the types of linguistic annotations that emerge best in them, and syntactic error analyses. The transformer based model BERT Devlin et al. (2019), has, in turn, been subject to studies showing it operates on sentences as would a classical NLP pipeline Tenney et al. (2020), and that its attention heads perform impressively well at dependency parsing Clark et al. (2019), inter alia.

Disentanglement in NLP

As discussed by Burgess et al. (2018), disentanglement is not only important for improving interpretability by representing high-level abstract concepts, but can also improve transfer. In contrast to the image processing field, attempts at disentanglement in NLP were mainly supervised. The main line of work revolves around multitask training schemes aimed at separating concepts in neural representations (e.g. style vs content John et al. (2020), syntax vs semantics (Bao et al., 2020, Chen et al., 2019)). A close attempt was that of Cheng et al. (2020) which successfully disentangles a content from style using only style supervision.

Open Information Extraction

Open Information Extraction (OpenIE) is the task of extracting, from a sentence, a list of predicates coupled with their arguments. The resulting tuples are handy, as they bypass complex parse trees towards a relationship-centered structure. The task can be accomplished using supervised learning on labeled samples Stanovsky et al. (2018), as well as earlier carefully crafted syntactic and lexical constraints Roy et al. (2020).

3 Model

3.1 Graphical Model

As a sentence’s structure can be modeled as a tree (a dependency tree), we will make use of a hierarchy of latent variables in our model. The inference and generation graphical models are depicted in figures 2 and 2 respectively.

[Uncaptioned image] — Figure 1: Inference Graphical Model

$z_{1}$ , $z_{2}$ , and $z_{3}$ are each a set of $n_{1}$ , $n_{2}$ , and $n_{3}$ multivariate diagonal Gaussian independent latent variables of size $z_{size}$ . A fixed standard normal distribution $p$ is set as a prior for $z_{1}$ in the generative model. Consequently, the generative model decomposes into $p_{\theta}(x,z_{1},z_{2},z_{3})=p(z1)p_{\theta}(z_{2}|z_{1})p_{\theta}(z_{3}|z_{1},z_{2})p_{\theta}(x|z_{1},z_{2},z_{3})$ , and the inference model decomposes into $p_{\phi}(x,z_{1},z_{2},z_{3})=p_{data}(x)p_{\phi}(z_{3}|x)p_{\phi}(z_{2}|z_{3},x)p_{\phi}(z_{1}|z_{2},z_{3},x)$ , so that $p_{data}$ is the true data distribution.
The neural components modelling the different conditional distributions hereabove will be described in the upcoming sections.

3.2 Encoder

Constructing $p_{\phi}(z_{3}|x)$ :

The model differs from classical VAE encoders in that it will encode a sentence into $n_{3}$ latent variables, where $n_{3}$ is a fixed integer (regardless of the sentence length).
Our choice was to use a transformer Vaswani et al. (2017). More specifically, we will use the transformer encoder-decoder architecture that is mostly used for machine translation. Contrary to an encoder only transformer , this architecture allows for obtaining a number of output elements that is different from that of the input sequence (as is needed for translation). It has been established that transformers can store sentence-level statistics in artificially introduced tokens (e.g.SEP in Devlin et al. (2019)). In a similar manner, we will feed a set of fixed $n_{3}$ learnable vectors to the decoder in place of it s targets. The transformer encoder-decoder’s architecture and the decoding process are explicited in figure 3, where the ”Previous Latent variable value” placeholder (light blue) is empty.We will apply $n_{3}$ distinct linear transformations (resp. $n_{3}$ MLPs with softplus activations) to obtain the means (resp. the standard deviations) of the posterior distribution of $z_{3}$ .

Constructing $p_{\phi}(z_{2}|z_{3},x)$ and $p_{\phi}(z_{1}|z_{2},z_{3},x)$ :

Similarly to the way we obtain $z_{1}$ , we use a Transformer encoder-decoder architecture. The latent variables that we condition on are introduced here by concatenating them to the input sentence after positional-encoding and Transformer-encoding, as depicted in figure 3. These latent variables are viewed as additional elements of the sentence with no specific positioning.

Refer to caption — Figure 3: The Encoder

3.3 Decoder

Constructing $p_{\theta}(z_{2}|z_{1})$ and $p_{\theta}(z_{3}|z_{1},z_{2})$ :

As explained in 3.1, we use a learnable structured prior $p_{\theta}(z_{1},z_{2},z_{3})=p(z1)p_{\theta}(z_{2}|z_{1})p_{\theta}(z_{3}|z_{1},z_{2})$ . To obtain the parameters of $z_{2}$ and $z_{3}$ in the generative model, we use the same architecture we used in the encoder without inputting text (i.e we use the model in figure 3 while dropping the green part).

Latent Variable Identifier

One must notice that, given our training procedure (c.f. ELBo in section 4), all latent variables in $z_{1}$ are enforced to follow the same prior. In the generation step, we will be sampling a set of similarly distributed random variables with no means for the decoding network to distinguish between them. As our objective is to encode $n_{1}$ different types of information in $z_{1}$ , and to have the decoder identify and leverage this information, we will concatenate the vectors corresponding to the value of each latent variable in $z_{1}$ to a latent-variable-specific trainable vector before having it decoded. The same will be done for $z_{2}$ , and $z_{3}$ even though their trainable priors enable better distinguishability .

Sequence decoder $p_{\theta}(x|z_{1},z_{2},z_{3})$

Sequence to sequence models (Seq2seq Sutskever et al. (2014)) that do not use attention were always found to be lacking in comparison to those that do. As a side effect to our architectural choices, we will be able to use attention based decoders and thus benefit from their higher expressiveness. In the same spirit as that of the previous section, we will use sequence transduction components that were originally designed to be used in machine translation to simultaneously translate and align.
We chose to use here the same transformer encoder-decoder architecture used in the encoding stage, but with different inputs. It will closely follow machine translation in this step by receiving the latent variable values as source inputs, and the previously generated tokens as target inputs. Contrary to what is done in the sequence encoder, the transformer applied to targets will use an attention mask that enforces the current generated word to depend only on previous words. A Latent variable identifier coupled with a transformer decoder are depicted in figure 4.

4 Optimization

Preliminary experiments have revealed that this model was subject to severe posterior collapse. Using $\operatorname*{KL}$ -annealing Bowman et al. (2016), or its combination with $\operatorname*{KL}$ -thresholding Li et al. (2020a) was not effective in yielding adequate results. As $\operatorname*{KL}$ -thresholding forces all the latent dimensions to stay at least $\gamma$ bits away from the prior ( $\gamma$ being the threshold), it may create artificial redundancy in the latent variables, which counteracts disentanglement.
In the following, we will describe a procedure that turned out to bring satisfactory generation results while keeping this generation dependent on the latent variables.
The original objective of VAEs is the Evidence Lower Bound (ELBo):

\log p_{\theta}(x)\geq\\ \mathbb{E}_{(z_{1},z_{2},z_{3})\sim q_{\phi}(z_{1},z_{2},z_{3}|x)}\left[\log p_{\theta}(x|z_{1},z_{2},z_{3})\right]-\\ \operatorname*{KL}[q_{\phi}(z_{1},z_{2},z_{3}|x)||p_{\theta}(z_{1},z_{2},z_{3})]

(1)

Where $\operatorname*{KL}[.]$ is the Kullback-Leibler divergence. The first term of the right hand side is the reconstruction term. The second term represents the information we get about our latent variables from the observation $x$ . ”Posterior collapse” happens when this term collapses to zero (i.e. when $x$ brings no more information on $z1$ , $z2$ , and $z3$ than what was described by the prior). The upcoming alternative objective aims at keeping this term to a multiple of the reconstruction’s value, while spreading the information gain from the observation across the 3 levels of latent variables:

\max(\mathbb{E}_{(z_{1},z_{2},z_{3})\sim q_{\phi}(z_{1},z_{2},z_{3}|x)}\left[\log p_{\theta}(x|z_{1},z_{2},z_{3})\right],\\ -\alpha\beta{\operatorname*{KL}}_{max})\\ s.t.\hskip 5.01874pt{\operatorname*{KL}}_{max}=\max(\\ \mathbb{E}_{(z_{2},z_{3})\sim q_{\phi}(z_{2},z_{3}|x)}\left[\operatorname*{KL}[q_{\phi}(z_{1}|z_{2},z_{3},x)||p(z_{1})]\right],\\ \mathbb{E}_{(z_{1},z_{3})\sim q_{\phi}(z_{1},z_{3}|x)}\left[\operatorname*{KL}[q_{\phi}(z_{2}|z_{3},x)||p_{\theta}(z_{2}|z_{1})]\right],\\ \mathbb{E}_{(z_{1},z_{2})\sim q_{\phi}(z_{1},z_{2}|x)}\left[\operatorname*{KL}[q_{\phi}(z_{3}|x)||p_{\theta}(z_{3}|z_{1},z_{2})]\right])

(2)

The global max ensures that we are minimizing the selected Kullback-Leibler divergence up to $\frac{1}{\alpha\beta}$ times the reconstruction loss so far. The values of $\alpha$ and $\beta$ will be discussed in section 5.1. In contrast to $\operatorname*{KL}$ -thresholding, this objective thresholds each latent variable layer as a whole, and uses a mobile threshold that is linear in the reconstruction loss of the example at hand. A lower value of $\alpha\beta$ allows for a better perplexity at the cost of a lower $\operatorname*{KL}$ -divergence (more posterior collapse), while a higher value guarantees more informative posteriors at the cost of a higher perplexity (empirically leading to semantically inconsistent sentences). As for $\operatorname*{KL}_{max}$ , it ensures that we are optimizing the hierarchy level that strays most from the prior for each example. In fact, when using structured generative models, the first layer tends to absorb all the mutual information with observations while the subsequent layers are hardly informative about the observation. This behavior was demonstrated and studied in depth by Zhao et al. (2017), and confirmed by our preliminary experiments.

5 Experiments

5.1 Setup

As previously mentioned, our training set consists of low complexity text extracted for the SNLI dataset by Schmidt et al. (2020). The sentences are on average $8.92\pm 2.66$ tokens long. We use 90K samples as a training set, and 10K samples as a test set.
We found it best for disentanglement to train the model with more latent variables than it needs, instead of fixing the number of latent variables to the expected number of disentangled concepts. This observation is not surprising, as it is well known that overparametrized neural networks have higher chances of containing well initialized subnetworks Frankle and Carbin (2018). $n1$ , $n2$ , and $n3$ are therefore fixed to 16 each. Training details are in Appendix A.The code for training our model, and performing the evaluations below is publicly available¹¹1https://github.com/ghazi-f/Disentanglement_Transformer.

5.2 Evaluation Protocol

We analyze our models qualitatively as well as quantitatively. Our quantitative analysis partly relies on the OpenIE system of Stanovsky et al. (2018)²²2Online Live demo from AllenNLP https://demo.allennlp.org/open-information-extraction. We obtain the necessary statistics with the following process:
We samples 100 sentences from the model. Then, for each sentence, we resample 10 times the 48 latent variables one at a time and generate the resulting new sentence. This results in 48K (original sentence, modified sentence) couples. After parsing all the sentences using Honnibal and Montani (2017), and obtaining their first³³3The predicates that follow the first OpenIE predicate correspond to subordinates clauses predicate structure using Stanovsky et al. (2018), we calculate the following between the original and the modified sentences:

1.

ROOT-DEP-APPEAR: the set of non-common dependency labels in the children of ROOT.
2.

DEP-APPEAR: the set of non-common dependency labels over the whole sentence.
3.

OIE-APPEAR: the set of non-common OpenIE labels over the whole sentence.
4.

DEP-ALTER: if both sentences have the same length, we extract the list of dependency labels for which the text spans have changed.
5.

OIE-ALTER: if both sentences have the same first predicate structure, we extract the list of predicate arguments for which the text spans have changed.

This information is used to calculate statistics about the influence of each latent variable in the model on the generated sentence. *-APPEAR variables (resp. *-ALTER variables) are used to analyze the influence of latent variables on the structure (resp. content) of sentences.

5.3 Quantitative results

From a dependency structure perspective

ROOT-DEP-APPEAR, DEP-APPEAR, and DEP-ALTER are lists of dependency labels. We found it interesting to look at the latent variables that causes each of the dependency labels to appear in each of the 3 lists.
Influencing ROOT-DEP-APPEAR means having the corresponding dependency label appear/disappear from the ROOT children. Influencing DEP-APPEAR means having the corresponding dependency label appear/disappear from the whole dependency tree. Influencing DEP-ALTER means having the text behind a certain dependency label change in a static length sentence.
We report the latent variables with the highest influence on each dependency label for these three statistics in figure 5. A first look at the results shows that a major part of the variability in the generated sentences is expressed by latent variable (LV) 10. Figure 5 shows that is it responsible for the content of the ROOT node, which explains how its influence propagates to the major part of the sentence. LV 30 seems to influence nominal subjects (passive or active), auxiliaries (possibly for conjugation) in terms of content, and numeral modifiers, and expletives when it comes to structure. These cues clearly point to LV 30 being responsible for subject related information. Other highly influential LVs are 35, and 43. 35 is responsible for the appearance of conjunctions, as well as the content of direct objects. LV 43 controls the content in prepositional objects, and is structurally related to the appearance of compounds, adverbial clause modifiers, and markers. Consequently, we expect these last two LVs to control most of the information past verbs and subjects.

From an OpenIE perspective

We plot the influence of each latent variable on the appearance of OpenIE arguments (OIE-APPEAR) and on their content (OIE-ALTER) as heat maps. OIE-APPEAR showed no evidence of the presence of variables that control structure while disregarding content. Therefore the heat map for OIE-APPEAR is reported in Appendix B, while the heat map for OIE-ALTER is in figure 6.
As was expected from the dependency parsing analysis, LV 10 is the most influential on the verb. It can also be seen that LV 30 has the highest influence on ARG0 (i.e. the subject). Along with the information from the dependency analysis, figure 6 further stresses the roles of LV 35 and LV 43. LV 35 seems to specialize in the direct object (ARG1), while LV 43 partly describes the direct object as well as secondary arguments (ARG2 often corresponds to prepositional objects), and contextual arguments (ARGM-DIR, ARGM-LOC, and ARGM-MNR correspond to direction, location, and manner). One should notice that LV 10 is in the root level of latent variables ( $10<16$ ), LV 30 in the middle level ( $16<30<32$ ) and LVs 35 and 43 in the leaf level ( $32<35<43<48$ ). The disentangled information is consequently also arranged as dictated by a linguistic dependency structure. This further confirms the ability of machine learning models to align with our conception of linguistic structure.

5.4 Qualitative results

Here we will exhibit some samples where the latent variables were varied in different manners. We will take a special interest in LVs 10, 30, 35, and 43 as these have shown potential for interpretability. Table 1 shows an example where we altered a latent variable for some sentences.
A second experiment we did, was to swap the value of certain latent variables between two sentences. The results are in table 2.

LV 10

As was pointed out by the quantitative analyses, this LV is the most influential (overall) on the sentence. But it seems to specialize, to a certain extent, in specifying the verb. In table 1, We can see that varying LV 10 keeps the same subject for the sentence, but varies the verb and the object (which is highly dependent on the verb). As LV 10 is a low level variable, changing it results in an incompatibility with the higher level variables, and a radical change in the sentence is observed. The fact that verbs can’t be changed independently from their subject is also observed in table 2. In fact, swapping LV 10 clearly results in unexpected changes.

Original sentence	ALV	Sample 1	Sample 2	Sample 3
a girl is holding a ball	10	a girl is riding a bike down the sidewalk	a girl is making a toy	a girl is sitting at a table
a child is running in the park	10	a child is in a store	a child is standing on a bench	a child is painting a marathon
a man is looking at something	10	a man is playing with his dog	a man is cooking in a park	a man is in a dress
a girl is holding a ball	30	a man is holding a ball	a group of people are sitting down	a kid is holding a ball
a child is running in the park	30	two men are running in the street	a boy is running in the park	a man is looking at a large boat .
a man and a woman are sitting in a race	30	a man is laying on a bench	a child is laying on a bench	a kid is laying on a bench
a girl is holding a ball	35	a girl is holding a bicycle	a girl is holding a baseball	a girl is holding a baby
two girls are wearing a pink and pink shirt	35	two girls are wearing a hat and talking about to get to get	two girls are wearing a red and pink shirt	two girls are wearing a green and pink shirt
a group of people are standing in a park	35	a group of people are dancing in a park	a group of people are in a parade	a group of people are walking in a park
two girls are wearing a pink and pink shirt	43	two girls are wearing a pink uniform	two girls are wearing a pink dress	two girls are wearing a pink hat
a man is sitting in a chair	43	a man is sitting in a $<?>$	a man is sitting in a park	man is sitting in a house
a group of people are sitting around a table	43	a group of people are sitting at a table	a group of people are sitting in a restaurant	a group of people are sitting on a beach

Table 1: Varying the value of a specific latent variable for a sentence. ALV is the Altered LV.

Sentence 1	Sentence 2	SLV	Swapped Sentence 1	Swapped Sentence 2
two people are outside at a store	a man is wearing a helmet	30	a man is outside with a red shirt	two people are wearing a helmet
a child is jumping in the snow	a boy is running through the snow	30	a boy is jumping in the snow	a child is running through the snow
a boy is using a phone	a man is walking on a sidewalk	30	a man is using a phone	a boy is walking on a sidewalk
a young boy is playing with a ball	a little girl jumps on a bike	10	a man is singing in a park	a little girl is playing in a park
a person is looking at a boat	a young boy is playing with a ball	10	a person is playing with a ball	a man is riding a horse
a little girl jumps on a bike	a man is taking a nap	10	a little girl is running in the water	a man is standing on a bench
a person is riding a bicycle on a beach	a man is riding a bike	35	a person is riding a bike on a beach	a man is riding a bicycle
a couple of people are playing with a little girl	a snowboarder is playing with a white mountain	43	a couple of people are playing in a city	a snowboarder is playing with a child
a boy is holding a ball	a man is holding a sign	35	a boy is holding a sign	a man is holding a ball

Table 2: Swapping the value of a specific latent variable between two sentences. SLV is the swapped LV.

LV 30

Despite some negative examples(table 1, 5th row, 5th column), Tables 1 and 2 clearly demonstrates that LV 30 contains the information on the subject. We can see, nevertheless, that changing it results in co-adaptation of the rest of the sentence, such as the conjugation of a verb (table 1, 5th row, 3rd column). A surprising observation can be made in table 2, 6th row: a change of subject from plural to singular resulted in the same co-adaptation of the verb on 3 examples. It is unclear whether ”sitting” has been reinterpreted as ”laying”, or the latent code stores the action for a group in a different area than the action for a single individual.

LVs 43 an 35

These two LVs, most often encode low level information (i.e. leaf information in the dependency parsing sense). To generate table 2, we had to try both LVs and see which contained the information for the verb at hand. Another constraint for results in table 2 to be coherent, was for the sentences to feature the same verb. As these LVs are leaf LVs, it is only natural that they can only be swapped between sentences where they remain a high probability sample when conditioned on the root LVs. As can be seen in table 1 (rows 8 and 10) when varying these LVs for the same sentence, they change different aspects of the object. We could also confirm that LV 35 most often controls direct objects (table 1 lines 7, 8, and table 2 lines 7, and 9), while LV 43 holds the information on prepositional objects (table 1 lines 11 and 12, and table 2 line 8). LV 35 also seems to control some intransitive verbs (table 1 line 9). Given that most sentences are in the past, these verbs may be perceived by the model as objects to the auxiliary.

Encoder-Decoder Discrepancy

An inherent short-coming of VAEs is the fact their objective (ELBo) is only a lower bound to the exact marginal log-likelihood of the data. In fact, the positive term quantifying the gap between ELBo and $log(x)$ is $KL[q(z|x)||p(x|z)]$ . This difference results in a discrepancy between encoder and decoder. We study this discrepancy under the light of the attention values between input tokens, and latent variables. Example heat maps of these attention values for our latent variables of interest (10, 30, 35, and 43) are provided in Appendix C. The most striking observation that can be made is that our latent variables attend to positions where the information they need is expected to appear, with little reliance on the tokens present in the attended positions. In fact, LV 10, LV 30, LV 43, and LV 35 almost exclusively attend to positions $\left[3-5\right]$ , $\left[1-3\right]$ , $\left[5-6\right]$ , and $\left[6-7\right]$ . Fortunately, the latent variables in the generator do not seem to only influence the same predefined positions as is illustrated by the examples in tables 1 and 2 (e.g. LV 43 influences tokens at position 9 in the last example of table 1). A secondary observation that can be made is that LV 10 and LV 30 attend a lot to latent variables from previous layers. This establishes that the encoder is actively using its latent variable structure (i.e it successfully learned a structured posterior).

6 Discussion & Conclusion

Our model’s capabilities differ from unsupervised OpenIE in that it factors information that aligns with predicate arguments instead of extracting text spans that correspond to these arguments. This is demonstrated by the fact that some words have information from more than one of our disentangled latent variables (direct objects are defined both with information from LV 10 and LV 35). It is also noteworthy that our model is limited with regard to two aspects. The first is that we could not discover structure related disentangled information in its latent variables. In that regard, future iterations may be able to obtain improvements through a fine grained use of self attention (separate latent variables for keys, queries, and values), or through non-sequential generation. The second limit is the posterior collapse, which was handled to a certain extent by our modified ELBo. By calibrating $\alpha\beta$ we could compromise between low perplexity and high $\operatorname*{KL}$ -divergence, but a great proportion of the generative model’s descriptive capacity still resides in $p_{\theta}(x|z_{1},z_{2},z_{3})$ instead of $p_{\theta}(z_{1},z_{2},z_{3})$ . In fact, contrary to our expectations, sequential sampling from $p_{\theta}(x|z_{1},z_{2},z_{3})$ with fixed $(z_{1},z_{2},z_{3})$ (as opposed to greedy sampling) did not yield paraphrases with fixed semantics. Our model is therefore expected to greatly improve with future strives in dealing with posterior collapse. To the best of our knowledge, our model is the first to induce a form of disentanglement that separates semantics in a sentence. Through our analysis, we could highlight 4 latent variables with distinguishable semantic content. Moreover, this model is also the first to accomplish disentanglement on language without any form of supervision (e.g labels, linguistic cues, etc). Hence, it’s inductive bias could serve as a basis to derive semi-supervised models, weakly supervised models or other forms of data efficient machine learning models. Data efficiency is central to contemporary NLP as annotated data is getting more expensive with the explosive rise of User Generated Content and the concomitant annotation difficulties Seddah et al. (2020). A highly potent research direction and a natural extension of our work would be to explore the results of applying our method to word level representations (disentangling morphological phenomena) and to document level representations (disentangling rhetorical structure).

Acknowledgments

We want to thank Antoine Simoulin for his proofreading and valuable comments. This work is supported by the PARSITI project grant (ANR-16-CE33-0021) given by the French National Research Agency (ANR), the Laboratoire d’excellence “Empirical Foundations of Linguistics” (ANR-10-LABX-0083), as well as the ONTORULE project. It was also granted access to the HPC resources of IDRIS under the allocation 20XX-AD011012112 made by GENCI.

References

Bao et al. (2020) Yu Bao, Hao Zhou, Shujian Huang, Lei Li, Lili Mou, Olga Vechtomova, Xinyu Dai, and Jiajun Chen. 2020. Generating sentences from disentangled syntactic and semantic spaces. ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, pages 6008–6019.
Bojanowski et al. (2017) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics, 5:135–146.
Bowman et al. (2016) Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Jozefowicz, and Samy Bengio. 2016. Generating sentences from a continuous space. CoNLL 2016 - 20th SIGNLL Conference on Computational Natural Language Learning, Proceedings, pages 10–21.
Burgess et al. (2018) Christopher P. Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. 2018. Understanding disentangling in-VAE. arXiv, (Nips).
Chen et al. (2019) Mingda Chen, Qingming Tang, Sam Wiseman, and Kevin Gimpel. 2019. A multi-task approach for disentangling syntax and semantics in sentence representations. NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, 1:2453–2464.
Cheng et al. (2020) Pengyu Cheng, Martin Renqiang Min, Dinghan Shen, Christopher Malon, Yizhe Zhang, Yitong Li, and Lawrence Carin. 2020. Improving Disentangled Text Representation Learning with Information-Theoretic Guidance. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7530–7541.
Clark et al. (2019) Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. What Does BERT Look At? An Analysis of BERT’s Attention.
Devlin et al. (2019) Jacob Devlin, Ming Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, 1:4171–4186.
Du et al. (2020) Wenyu Du, Zhouhan Lin, Yikang Shen, Timothy J. O’Donnell, Yoshua Bengio, and Yue Zhang. 2020. Exploiting Syntactic Structure for Better Language Modeling: A Syntactic Distance Approach. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, page 6611–6628.
Dyer et al. (2016) Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A. Smith. 2016. Recurrent neural network grammars. 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 2016 - Proceedings of the Conference, pages 199–209.
Frankle and Carbin (2018) Jonathan Frankle and Michael Carbin. 2018. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations (ICLR).
Honnibal and Montani (2017) Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear.
Hu et al. (2020) Jennifer Hu, Jon Gauthier, Peng Qian, Ethan Wilcox, and Roger P. Levy. 2020. A Systematic assessment of syntactic generalization in neural language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1725–1744.
John et al. (2020) Vineet John, Lili Mou, Hareesh Bahuleyan, and Olga Vechtomova. 2020. Disentangled representation learning for non-parallel text style transfer. In ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, pages 424–434.
Kodner and Gupta (2020) Jordan Kodner and Nitish Gupta. 2020. Overestimation of syntactic representation in neural language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, page 1757–1762.
Kulmizev et al. (2020) Artur Kulmizev, Vinit Ravishankar, Mostafa Abdou, and Joakim Nivre. 2020. Do neural language models show preferences for syntactic formalisms? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, page 4077–4091.
Li et al. (2020a) Bohan Li, Junxian He, Graham Neubig, Taylor Berg-Kirkpatrick, and Yiming Yang. 2020a. A surprisingly effective fix for deep latent variable modeling of text. In EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, pages 3603–3614.
Li et al. (2020b) Zhiyuan Li, Jaideep Vitthal Murkute, Prashnna Kumar Gyawali, and Linwei Wang. 2020b. Progressive Learning and Disentanglement of Hierarchical Representations. Proceedings of the International Conference on Learning Representations.
Marvin and Linzen (2020) Rebecca Marvin and Tal Linzen. 2020. Targeted syntactic evaluation of language models. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018, pages 1192–1202.
Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. In Proceedings of the International Conference on Learning Representations, pages 1–12.
Rolinek et al. (2019) Michal Rolinek, Dominik Zietlow, and Georg Martius. 2019. Variational autoencoders pursue pca directions (by accident). Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2019-June:12398–12407.
Roy et al. (2020) Arpita Roy, Youngja Park, Taesung Lee, and Shimei Pan. 2020. Supervising unsupervised open information extraction models. EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, pages 728–737.
Schmidt et al. (2020) Florian Schmidt, Stephan Mandt, and Thomas Hofmann. 2020. Autoregressive text generation beyond feedback loops. EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, (2003):3400–3406.
Seddah et al. (2020) Djamé Seddah, Farah Essaidi, Amal Fethi, Matthieu Futeral, Benjamin Muller, Pedro Javier Ortiz Suárez, Benoît Sagot, and Abhishek Srivastava. 2020. Building a User-Generated Content North-African Arabizi Treebank: Tackling Hell. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1139–1150. Association for Computational Linguistics.
Shen et al. (2019) Yikang Shen, Shawn Tan, Alessandro Sordoni, and Aaron Courville. 2019. Ordered neurons: Integrating tree structures into recurrent neural networks. In 7th International Conference on Learning Representations, ICLR 2019, pages 1–14.
Stanovsky et al. (2018) Gabriel Stanovsky, Julian Michael, Luke Zettlemoyer, and Ido Dagan. 2018. Supervised open information extraction. NAACL HLT 2018 - 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, 1(Section 4):885–895.
Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 383–391.
Tenney et al. (2020) Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2020. BERT rediscovers the classical NLP pipeline. ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, pages 4593–4601.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. NeurIPS, (Nips).
Zhang et al. (2020) Xinyuan Zhang, Yi Yang, Siyang Yuan, Dinghan Shen, and Lawrence Carin. 2020. Syntax-infused variational autoencoder for text generation. ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, pages 2069–2078.
Zhao et al. (2017) Shengjia Zhao, Jiaming Song, and Stefano Ermon. 2017. Learning hierarchical features from deep generative models. 34th International Conference on Machine Learning, ICML 2017, 8:6195–6204.

Appendix A Training details

We use a 48-dimensional Transformers with 4 attention heads, 2 layers for each of the encoder modules, and 3 layers for each of the decoder modules. The model is warmed up by training it in pure reconstruction ( $\alpha=0$ ) for 3000 steps, then annealing the $\operatorname*{KL}$ -divergence (linearly raising $\alpha$ to 1) during 3000 steps. $\beta$ is initialized at 6, then decreased by 1 each time the perplexity (evaluated each 3 epochs) stops decreasing. The training is halted when $\beta$ reaches 3 and the perplexity stops decreasing. This setup has been reached through a manual search of the hyper parameters that best exhibited the behavior we sought. During evaluation, we sample sentences (conditioned on the latent variables we sampled beforehand) in a greedy fashion.

Appendix B OIE-APPEAR

The heat map for OIE-APPEAR is plotted in figure 7.

Appendix C Encoder Attention

We provide the attention heat maps illustrating our qualitative attention analysis in figures 8 and 9.

Figure 8: First 4 encoder attention examples. The y-axis labels are to be read ”

<

latent variable index

>\_<

encoder layer

>

”. The lighter the color of the box, the higher the attention value. The last column

<

latent

>

is the summation of the attention values between the indicated latent variable, and the latent variables from the previous latent variable layer (c.f. figure 2 for the encoder latent variable structure).

Figure 9: Second 4 encoder attention examples, generated in the same way as those of figure 8