Deep Clustering of Text Representations for Supervision-free Probing of Syntax

Vikram Gupta,¹ Haoyue Shi,² Kevin Gimpel,² Mrinmaya Sachan³

Abstract

We explore deep clustering of text representations for unsupervised model interpretation and induction of syntax. As these representations are high-dimensional, out-of-the-box methods like KMeans do not work well. Thus, our approach jointly transforms the representations into a lower-dimensional cluster-friendly space and clusters them. We consider two notions of syntax: part of speech induction (POSI) and constituency labelling (CoLab) in this work. Interestingly, we find that Multilingual BERT (mBERT) contains surprising amount of syntactic knowledge of English; possibly even as much as English BERT (E-BERT). Our model can be used as a supervision-free probe which is arguably a less-biased way of probing. We find that unsupervised probes show benefits from higher layers as compared to supervised probes. We further note that our unsupervised probe utilizes E-BERT and mBERT representations differently, especially for POSI. We validate the efficacy of our probe by demonstrating its capabilities as a unsupervised syntax induction technique. Our probe works well for both syntactic formalisms by simply adapting the input representations. We report competitive performance of our probe on 45-tag English POSI, state-of-the-art performance on 12-tag POSI across 10 languages, and competitive results on CoLab. We also perform zero-shot syntax induction on resource impoverished languages and report strong results.

1 Introduction

Contextualized text representations (Peters et al. 2018a; Devlin et al. 2019) have been used in many supervised NLP problems such as part-of-speech (POS) tagging (Tsai et al. 2019), syntactic parsing (Kitaev and Klein 2018; Zhou and Zhao 2019; Mrini et al. 2019), and coreference resolution (Lee, He, and Zettlemoyer 2018; Joshi et al. 2019; Wu et al. 2020), often leading to significant improvements. Recent works have shown that these representations encode linguistic information including POS (Belinkov et al. 2017), morphology (Peters et al. 2018a), and syntactic structure (Linzen, Dupoux, and Goldberg 2016; Peters et al. 2018b; Tenney, Das, and Pavlick 2019; Hewitt and Manning 2019).

While there has been a lot of focus on using contextualized representations in supervised settings for either solving NLP problems and interpreting these representations, the efficacy of these representations for unsupervised learning is not well explored¹¹1 Some recent work such as DIORA (Drozdov et al. 2019b, a) has explored specialized methods for unsupervised discovery and representation of constituents using ELMo (Peters et al. 2018a). (Jin et al. 2019) used ELMo with a normalizing flow model while (Cao, Kitaev, and Klein 2020) used RoBERTa (Liu et al. 2019b) for unsupervised constituency parsing.. Most of the recent work in “probing” contextual representations have focused on building supervised classifiers and using accuracy to interpret these representations. This has led to a debate as it is not clear if the supervised probe is probing the model or trying to solve the task (Hewitt and Manning 2019; Pimentel et al. 2020).

Thus, in this work, we explore a new clustering-based approach to probe contextualized text representations. Our probe allows for studying text representations with relatively less task-specific transformations due to the absence of supervision. Thus, our approach is arguably a less biased way to discover linguistic structure than supervised probes (Hewitt and Manning 2019; Pimentel et al. 2020; Zhou and Srikumar 2021).We focus on two syntactic formalisms: part-of-speech induction (POSI) and constituency labelling (CoLab), and explore the efficacy of contextualized representations towards encoding syntax in an unsupervised manner. We investigate the research question: Do contextualized representations encode enough information for unsupervised syntax induction? How do these perform on POSI, which has been traditionally solved using smaller context windows and morphology and span-based CoLab?

For both formalisms, we find that naively clustering text representations does not perform well. We speculate that this is because contextualized text representations are high-dimensional and not very friendly to existing clustering approaches. Thus, we develop a deep clustering approach (Xie, Girshick, and Farhadi 2016; Ghasedi Dizaji et al. 2017; Jiang et al. 2016; Chang et al. 2017; Yang, Parikh, and Batra 2016; Yang et al. 2017) which transforms these representations into a lower dimensional, clustering friendly latent space. This transformation is learnt jointly with the clustering using a combination of reconstruction and clustering objectives. The procedure iteratively refines the transformation and the clustering using an auxiliary target distribution derived from the current soft clustering. As this process is repeated, it gradually improves the transformed representations as well as the clustering. We show a t-SNE visualization of mBERT embeddings and embeddings learned by our deep clustering probe (SyntDEC) in Figure 1.

We further explore architectural variations such as pretrained subword embeddings from fastText (Joulin et al. 2017), a continuous bag of words (CBoW) loss (Mikolov et al. 2013), and span representations (Toshniwal et al. 2020) to incorporate task-dependent information into the latent space and observe sugnificant improvements. It is important to note that we do not claim that clustering contextualized representations is the optimal approach for POSI as representations with short context (Lin et al. 2015), (He, Neubig, and Berg-Kirkpatrick 2018) and word-based POSI (Yatbaz, Sert, and Yuret 2012) have shown best results. Our approach explores the potential of contextualized representations for unsupervised induction of syntax and acts as an unsupervised probe for interpreting these representations. Nevertheless, we report competitive many-to-one (M1) accuracies for POSI on the 45-tag Penn Treebank WSJ dataset as compared to specialized state-of-the-art approaches in the literature (He, Neubig, and Berg-Kirkpatrick 2018) and improve upon the state of the art on the 12 tag universal treebank dataset across multiple languages (Stratos, Collins, and Hsu 2016; Stratos 2019). We further show that our approach can be used in a zero-shot crosslingual setting where a model trained on one language can used for evaluation in another language. We observe impressive crosslingual POSI performance, showcasing the representational power of mBERT, especially when the languages are related. Our method also achieves competitive results on CoLab on the WSJ test set, outperforming the initial DIORA approach (Drozdov et al. 2019b) and performing comparably to recent DIORA variants (Drozdov et al. 2019a) which incorporate more complex methods such as latent chart parsing and discrete representation learning. In contrast to specialized state-of-the-art methods for syntax induction, our framework is more general as it demonstrates good performance for both CoLab and POSI by simply adapting the input representations.

We further investigate the effectiveness of multilingual BERT (mBERT) (Devlin et al. 2019) for POSI across multiple languages and CoLab in English and see improvement in performance by using mBERT for both tasks even in English. This is in contrast with the supervised experiments where both mBERT and E-BERT perform competitively. In contrast to various supervised probes in the literature (Liu et al. 2019a; Tenney, Das, and Pavlick 2019), our unsupervised probe finds that syntactic information is captured in higher layers on average than what was previously reported (Tenney, Das, and Pavlick 2019). Upon further layer-wise analysis of the two probes, we find that while supervised probes show that all layers of E-BERT contain syntactic information fairly uniformly, middle layers lead to a better performance on the investigated syntactic tasks with our unsupervised probe.

2 Problem Definition

We consider two syntax induction problems in this work:

1.

Part-of-speech induction (POSI): determining part of speech of words in a sentence.
2.

Constituency label induction (CoLab): determining the constituency label for a given constituent (span of contiguous tokens).²²2Note that it is not necessary for constituents to be contiguous, but we only consider contiguous constituents for simplicity.

Figure 2 shows an illustration for the two tasks. In order to do well, both tasks require reasoning about the context. This motivates us to use contextualized representations, which have shown an ability to model such information effectively. Letting $[m]$ denote $\{1,2,\ldots,m\}$ , we model unsupervised syntax induction as the task of learning a mapping function $C:X\longrightarrow[m]$ . For POSI, $X$ is the set of word tokens in the corpus and $m$ is the number of part-of-speech tags.³³3 $X$ is distinct from the corpus vocabulary; in POSI, we tag each word token in each sentence with a POS tag. For CoLab, $X$ is the set of constituents across all sentences in the corpus and $m$ is the number of constituent labels. For each element $x\in X$ , let $c(x)$ denote the context of $x$ in the sentence containing $x$ . The number $m$ of true clusters is assumed to be known. For CoLab, we also assume gold constituent spans from manually annotated constituency parse trees, focusing only on determining constituent labels, following Drozdov et al. (2019a).

3 Proposed Method

We address unsupervised syntax induction via clustering, where $C$ defines a clustering of $X$ into $m$ clusters. We define a deep embedded clustering framework and modify it to support common NLP objectives such as continuous bag of words (Mikolov et al. 2013). Our framework jointly transforms the text representations into a lower-dimensions and learns the clustering parameters in an end-to-end setup.

Deep Clustering

Unlike traditional clustering approaches that work with fixed, and often hand-designed features, deep clustering (Xie, Girshick, and Farhadi 2016; Ghasedi Dizaji et al. 2017; Jiang et al. 2016; Chang et al. 2017; Yang, Parikh, and Batra 2016; Yang et al. 2017) transforms the data $X$ into a latent feature space $Z$ with a mapping function $f_{\theta}:X\longrightarrow Z$ , where $\theta$ are learnable parameters. The dimensionality of $Z$ is typically much smaller than $X$ . The datapoints are clustered by simultaneously learning a clustering $\tilde{C}:Z\rightarrow[m]$ .While $C$ might have been hard to learn directly (due to the high dimensionality of $X$ ), learning $\tilde{C}$ may be easier.

Deep Embedded Clustering: We draw on a particular deep clustering approach: Deep Embedded Clustering (DEC; Xie, Girshick, and Farhadi 2016). Our approach consists of two stages: (a) a pretraining stage, and (b) a joint representation learning and clustering stage. In the pretraining stage, a mapping function $f_{\theta}$ is pretrained using a stacked autoencoder (SAE). The SAE learns to reconstruct $X$ through the bottleneck $Z$ , i.e., $X\xrightarrow[]{\mathit{encoder}}Z\xrightarrow[]{\mathit{decoder}}X^{\prime}$ . We use mean squared error (MSE) as the reconstruction loss:

\displaystyle\mathcal{L}_{\mathit{rec}}=||X-X^{\prime}||^{2}=\sum\limits_{x\in X}||x-x^{\prime}||^{2}

The encoder parameters are used to initialize the mapping function $f_{\theta}$ .

In the joint representation learning and clustering stage, we finetune the encoder $f_{\theta}$ trained in the pretraining stage to minimize a clustering loss $\mathcal{L}_{\textit{KL}}$ . The goal of this step is to learn a latent space that is amenable to clustering. We learn a set of $m$ cluster centers $\{\mu_{i}\in Z\}_{i=1}^{m}$ of the latent space $Z$ and alternate between computing an auxiliary target distribution and minimizing the Kullback-Leibler (KL) divergence. First, a soft cluster assignment is computed for each embedded point. Then, the mapping function $f_{\theta}$ is refined along with the cluster centers by learning from the assignments using an auxiliary target distribution. This process is repeated. The soft assignment is computed via the Student’s $t$ -distribution. The probability of assigning data point $i$ to cluster $j$ is denoted $q_{ij}$ and defined:

\displaystyle q_{ij}=\frac{(1+||z_{i}-\mu_{j}||^{2}/\nu)^{-\frac{\nu+1}{2}}}{\sum\limits_{j^{\prime}}(1+||z_{i}-\mu_{j^{\prime}}||^{2}/\nu)^{-\frac{\nu+1}{2}}}

where $\nu$ is set to 1 in all experiments. Then, a cluster assignment hardening loss (Xie, Girshick, and Farhadi 2016) is used to make these soft assignment probabilities more peaked. This is done by letting cluster assignment probability distribution $q$ approach a more peaked auxiliary (target) distribution $p$ :

{p}_{ij}=\frac{q_{ij}^{2}/n_{j}}{\sum_{j^{\prime}}q_{ij^{\prime}}^{2}/n_{j^{\prime}}}\quad\quad\quad{n}_{j}=\sum_{i}{q}_{ij}

By squaring the original distribution and then normalizing it, the auxiliary distribution $p$ forces assignments to have more peaked probabilities. This aims to improve cluster purity, put emphasis on data points assigned with high confidence, and to prevent large clusters from distorting the latent space. The divergence between the two probability distributions is formulated as the Kullback-Leibler divergence:

\mathcal{L}_{\textit{KL}}=\sum_{i}\sum_{j}p_{ij}\log\frac{p_{ij}}{q_{ij}}

The representation learning and clustering model is learned end-to-end.

SyntDEC: DEC for Syntax Induction

We further modify DEC for syntax induction:

a) CBoW autoencoders: While DEC uses a conventional autoencoder, i.e., the input and output are the same, we modify it to support the continuous bag of words (CBoW) objective (Mikolov et al. 2013). This helps focus the low-dimensional representations to focus on context words, which are expected to be helpful for POSI. In particular, given a set of tokens $c(x)$ that defines the context for an element $x\in X$ , CBoW combines the distributed representations of tokens in $c(x)$ to predict the element $x$ in the middle. See Appendix A for an illustration.

b) Finetuning with reconstruction loss: We found that in the clustering stage, finetuning with respect to the KL divergence loss alone easily leads to trivial solutions where all points map to the same cluster. To address this, we add the reconstruction loss as a regularization term. This is in agreement with subsequent works in deep clustering (Yang et al. 2017). Instead of solely minimizing $\mathcal{L}_{\textit{KL}}$ , we minimize

\mathcal{L}_{\textit{total}}=\mathcal{L}_{\textit{KL}}+\lambda\mathcal{L}_{\textit{rec}}

(1)

in the clustering stage, where $\lambda$ is a hyperparameter denoting the weight of the reconstruction loss.

c) Contextualized representations: We represent linguistic elements $x$ by embeddings extracted from pretrained networks like BERT (Devlin et al. 2019), SpanBERT (Joshi et al. 2020), and multilingual BERT (Devlin et al. 2019). All of these networks are multi-layer architectures. Thus, we average the embeddings across the various layers. We experimented with different layer combinations but found the average was the best solution for these tasks. We averaged the embeddings of the subword units to compute word embeddings.⁴⁴4In our preliminary experiments, we also tried other pooling mechanisms such as min/max pooling over subwords, but average performed the best among all of them. For CoLab, we represent spans by concatenating the representations of the end points (Toshniwal et al. 2020).

d) Task-specific representations: Previous work in unsupervised syntax induction has shown the value of task-specific features. In particular, a number of morphological features based on prefixes and suffixes and spelling cues like capitalization have been used in unsupervised POSI works (Tseng, Jurafsky, and Manning 2005; Stratos 2019; Yatbaz, Sert, and Yuret 2012). In our POSI experiments, we incorporate these morphological features by using word representations from fastText (Joulin et al. 2017). We use fastText embeddings of the trigram from each word with contextualized representations as input.

4 Experimental Details

Datasets: We evaluate our approach for POSI on two datasets: 45-tag Penn Treebank Wall Street Journal (WSJ) dataset (Marcus, Santorini, and Marcinkiewicz 1993) and multilingual 12-tag datasets drawn from the universal dependencies project (Nivre et al. 2016). The WSJ dataset has approximately one million words tagged with 45 part of speech tags. For multilingual experiments, we use the 12-tag universal treebank v2.0 dataset which consists of corpora from 10 languages.⁵⁵5We use v2.0 in order to compare to Stratos (2019). The words in this dataset have been tagged with 12 universal POS tags (McDonald et al. 2013). For CoLab, we follow the existing benchmark (Drozdov et al. 2019a) and evaluate on the WSJ test set. For POSI, as per the standard practice (Stratos 2019), we use the complete dataset (train + val + test) for training as well as evaluation. However, for CoLab, we use the train set to train our model and the test set for reporting results, following Drozdov et al. (2019a).

Evaluation Metrics: For POSI, we use the standard measures of many-to-one (M1; Johnson 2007) accuracy and V-Measure (Rosenberg and Hirschberg 2007). For CoLab, we use F1 score following Drozdov et al. (2019a), ignoring spans which have only a single word and spans with the “TOP” label. In addition to F1, we also report M1 accuracy for CoLab to show the clustering performance more naturally and intuitively.

Training Details: Similar to Xie, Girshick, and Farhadi (2016), we use greedy layerwise pretraining (Bengio et al. 2007) for initialization. New hidden layers are successively added to the autoencoder, and the layers are trained to denoise output of the previous layer. After layerwise pretraining, we train the autoencoder end-to-end and leverage the trained SyntDEC encoder (Section 3). K-Means is used to initialize cluster means and assignments. SyntDEC is trained end-to-end with the reconstruction and clustering losses. More details are in the appendix.

5 Part of Speech Induction (POSI)

Method	M1	VM
SyntDEC_Morph	79.5 ( $\pm$ 0.9)	73.9 ( $\pm$ 0.7)
SyntDEC	77.6 ( $\pm$ 1.5)	72.5 ( $\pm$ 0.9)
SAE	75.3 ( $\pm$ 1.4)	69.9 ( $\pm$ 0.9)
KMeans	72.4 ( $\pm$ 2.9)	-
Brown et al. (1992)	65.6 ( $\pm$ NA)	-
Stratos, Collins, and Hsu (2016)	67.7 ( $\pm$ NA)	-
Berg-Kirkpatrick et al. (2010)	74.9 ( $\pm$ 1.5)	-
Blunsom and Cohn (2011)	77.5 ( $\pm$ NA)	69.8
Stratos (2019)	78.1 ( $\pm$ 0.8)	-
Tran et al. (2016)	79.1 ( $\pm$ NA)	71.7 ( $\pm$ NA)
Yuret, Yatbaz, and Sert (2014)	79.5 ( $\pm$ 0.3)	69.1( $\pm$ 2.7)
Yatbaz, Sert, and Yuret (2012) (word-based)	80.2 ( $\pm$ 0.7)	72.1 ( $\pm$ 0.4)
He, Neubig, and Berg-Kirkpatrick (2018)	80.8 ( $\pm$ 1.3)	74.1 ( $\pm$ 0.7)

Table 1: Many-to-one (M1) accuracy and V-Measure (VM) of POSI on the 45-tag Penn Treebank WSJ dataset for 10 random runs. mBERT is used in all of our experiments (upper part of the table).

	de	en	es	fr	id	it	ja	ko	pt-br	sv	Mean
SAE	74.8	70.7	71.1	66.7	75.4	66.2	82.1	65.4	75.1	61.6	70.9
SAE	( $\pm$ 1.5)	( $\pm$ 2.2)	( $\pm$ 2.4)	( $\pm$ 1.9)	( $\pm$ 1.6)	( $\pm$ 3.3)	( $\pm$ 0.9)	( $\pm$ 1.7)	( $\pm$ 4.1)	( $\pm$ 2.6)
SyntDEC	81.5	76.5	78.9	70.7	76.8	71.7	84.7	69.7	77.7	68.8	75.7
SyntDEC	( $\pm$ 1.8)	( $\pm$ 1.1)	( $\pm$ 1.9)	( $\pm$ 3.9)	( $\pm$ 1.1)	( $\pm$ 3.3)	( $\pm$ 1.2)	( $\pm$ 1.5)	( $\pm$ 2.1)	( $\pm$ 3.9)
Stratos (2019)	75.4	73.1	73.1	70.4	73.6	67.4	77.9	65.6	70.7	67.1	71.4
Stratos (2019)	( $\pm$ 1.5)	( $\pm$ 1.7)	( $\pm$ 1.0)	( $\pm$ 2.9)	( $\pm$ 1.5)	( $\pm$ 3.3)	( $\pm$ 0.4)	( $\pm$ 1.2)	( $\pm$ 2.3)	( $\pm$ 1.5)
Stratos, Collins, and Hsu (2016)	63.4	71.4	74.3	71.9	67.3	60.2	69.4	61.8	65.8	61.0	66.7
Berg-Kirkpatrick et al. (2010)	67.5	62.4	67.1	62.1	61.3	52.9	78.2	60.5	63.2	56.7	63.2
Berg-Kirkpatrick et al. (2010)	( $\pm$ 1.8)	( $\pm$ 3.5)	( $\pm$ 3.1)	( $\pm$ 4.5)	( $\pm$ 3.9)	( $\pm$ 2.9)	( $\pm$ 2.9)	( $\pm$ 3.6)	( $\pm$ 2.2)	( $\pm$ 2.5)
Brown et al. (1992)	60.0	62.9	67.4	66.4	59.3	66.1	60.3	47.5	67.4	61.9	61.9

Table 2: M1 accuracy and standard deviations on the 12-tag universal treebank dataset averaged over 5 random runs. mBERT is used for all of our experiments (upper part of the table). The number of epochs are proportional to the number of samples and the M1 accuracy corresponding to the last epoch is reported.

45-Tag Penn Treebank WSJ: In Table 1, we evaluate the performance of contextualized representations and our probe on the 45-tag Penn Treebank WSJ dataset. KMeans clustering over the mBERT embeddings improves upon Brown clustering (Brown et al. 1992) (as reported by Stratos, 2019) and Hidden Markov Models (Stratos, Collins, and Hsu 2016) based approach, showing that mBERT embeddings encode syntactic information. The stacked autoencoder, SAE (trained during pretraining stage), improves upon the result of KMeans by nearly 3 points, which demonstrates the effectiveness of transforming the mBERT embeddings to lower dimensionality using an autoencoder before clustering. Our method (SyntDEC) further enhances the result and shows that transforming the pretrained mBERT embeddings using clustering objective helps to extract syntactic information more effectively. When augmenting the mBERT embeddings with morphological features (SyntDEC_Morph), we improve over Stratos (2019) and (Tran et al. 2016). We also obtain similar M1 accuracy with higher VM as compared to (Yuret, Yatbaz, and Sert 2014).

Morphology: We also note that the M1 accuracy of Tran et al. (2016) and Stratos (2019) drop significantly by nearly 14 points in absence of morphological features, while SyntDEC degrades by 2 points. This trend suggests that mBERT representations encode the morphology to some extent.

Yatbaz, Sert, and Yuret (2012) are not directly comparable to our work as they performed word-based POSI which attaches same tag to all the instances of the word, while all the other works in Table 1 perform token-based POSI. They use task-specific hand-engineered rules like presence of hyphen, apostrophe etc. which might not translate to multiple languages and tasks. (He, Neubig, and Berg-Kirkpatrick 2018) train a POSI specialized model with Markov syntax model and short-context word embeddings and report current SOTA on POSI. In contrast to their method, SyntDEC is fairly task agnostic.

12-Tag Universal Treebanks: In Table 2, we report M1 accuracies on the 12-tag datasets averaged over 5 random runs. Across all languages, we report SOTA results and find an improvement on average over the previous best method (Stratos 2019) from 71.4% to 75.7%. We also note improvements of SyntDEC over SAE (70.9% to 75.7%) across languages, which reiterates the importance of finetuning representations for clustering. Our methods yield larger gains on this coarse-grained 12 tag POSI task as compared to the fine-grained 45 tag POSI task, and we hope to explore the reasons for this in future work.

Ablation Studies: Next, we study the impact of our choices on the 45-tag WSJ dataset. Table 3 demonstrates that multilingual BERT (mBERT) is better than English BERT (E-BERT) across settings. For both mBERT and E-BERT, compressing the representations with SAE and finetuning using SyntDEC performs better than KMeans. Also, focusing the representations on the local context (CBoW) improves performance with E-BERT, though not with mBERT. In the appendix, we show the impact of using different types of fastText character embeddings and note the best results when we use embeddings of the last trigram of each word.

	Method	M1
E-BERT	KMeans	69.1 ( $\pm$ 0.9)
	SAE	71.6 ( $\pm$ 2.3)
	CBoW	73.8 ( $\pm$ 0.7)
	SyntDEC (SAE)	72.7 ( $\pm$ 1.2)
	SyntDEC (CBoW)	74.4 ( $\pm$ 0.6)
mBERT	KMeans	72.4 ( $\pm$ 2.9)
	SAE	75.3 ( $\pm$ 1.4)
	CBoW	75.1 ( $\pm$ 0.3)
	SyntDEC (SAE)	77.8 ( $\pm$ 1.4)
	SyntDEC (CBoW)	75.9 ( $\pm$ 0.3)

Table 3: Comparison of E-BERT and mBERT on the 45-tag POSI task. We report oracle results in this table.

Error Analysis: We compared SyntDEC and KMeans (when both use mBERT) and found that SyntDEC does better on noun phrases and nominal tags. It helps alleviate confusion among fine-grained noun tags (e.g., NN vs. NNS), while also showing better handling of numerals (CD) and personal pronouns (PRP). However, SyntDEC still shows considerable confusion among fine-grained verb categories. For 12-tag experiments, we similarly found that SyntDEC outperforms KMeans for the majority of the tags, especially nouns and verbs, resulting in a gain of more than 20% in 1-to-1 accuracy. We further compare t-SNE visualizations of SyntDEC and mBERT embeddings and observe that SyntDEC embeddings show relatively compact clusters. Detailed results and visualizations are shown in Figure 4 and the appendix.

		Nearby						Distant
	en	de	sv	es	fr	pt	it	ko	id	ja	Mean
distance to en	0	0.36	0.4	0.46	0.46	0.48	0.50	0.69	0.71	0.71	-
Monolingual	76.5	81.5	68.8	78.9	70.7	77.7	71.7	69.7	76.8	84.7	75.7
Monolingual	( $\pm$ 1.1)	( $\pm$ 1.8)	( $\pm$ 3.9)	( $\pm$ 1.9)	( $\pm$ 3.9)	( $\pm$ 2.1)	( $\pm$ 3.3)	( $\pm$ 1.5)	( $\pm$ 1.1)	( $\pm$ 1.2)	-
Crosslingual	76.5	71.9	66.7	75.7	73.5	77.6	73.5	67.5	75.4	80.3	73.9
Crosslingual	( $\pm$ 1.1)	( $\pm$ 1.5)	( $\pm$ 1.9)	( $\pm$ 1.4)	( $\pm$ 1.1)	( $\pm$ 1.1)	( $\pm$ 1.2)	( $\pm$ 0.9)	( $\pm$ 1.7)	( $\pm$ 1.3)	-

Table 4: POSI M1 for SyntDEC with mBERT on 12-tag universal treebank in monolingual and crosslingual settings. Monolingual: clusters are learned and evaluated on the same language. Crosslingual: clusters are learned on English and evaluated on all languages.

6 SyntDEC as an Unsupervised Probe

Next, we leverage SyntDEC as an unsupervised probe to analyse where syntactic information is captured in the pretrained representations. Existing approaches to probing usually rely on supervised training of probes. However, as argued recently by (Zhou and Srikumar 2021), this can be unreliable. Our supervision-free probe arguably gets rid of any bias in interpretations due to the involvement of training data in probing.We compare our unsupervised probe to a reimplementation of the supervised shallow MLP based probe in Tenney, Das, and Pavlick (2019). Similar to their paper, we report Expected Layer under supervised and unsupervised settings for the two tasks in Figure 5. Expected Layer represents the average layer number in terms of incremental performance gains: ${E}_{\Delta}{[l]}=\frac{\sum_{l=1}^{L}l*\Delta^{(l)}}{\sum_{l=1}^{L}\Delta^{(l)}}$ , where $\Delta^{(l)}$ is the change in the performance metric when adding layer $l$ to the previous layers. Layers are incrementally added from lower to higher layers. We use F1 and M1 score as the performance metric for supervised and unsupervised experiments respectively. We observe that:

1.

Expected Layer as per the unsupervised probe (blue) is higher than the supervised probe (green) for both tasks and models showing that unsupervised syntax induction benefits more from higher layers.
2.

There are larger differences between E-BERT and mBERT Expected Layer under unsupervised settings suggesting that our unsupervised probe utilizes mBERT and E-BERT layers differently than the supervised one.

In Figure 6, we further probe the performance of each layer individually by computing the F1 score for the supervised probe and the M1 score for the unsupervised probe. We observe noticeable improvement at Layer 1 for supervised POSI and Layer 1/4/6 for CoLab which also correlates with their respective Expected Layer values. For unsupervised settings, the improvements are more evenly shared across initial layers. Although F1 and M1 are not directly comparable, supervised performance is competitive even at higher layers while unsupervised performance drops. We present detailed results in the appendix.

7 Crosslingual POSI

Pires, Schlinger, and Garrette (2019); Wu and Dredze (2019) show that mBERT is effective at zero-shot crosslingual transfer. Inspired by this, we evaluate the crosslingual performance on 12-tag universal treebank (Table 4). The first row shows M1 accuracies when training and evaluating SyntDEC on the same language (monolingual). The second row shows M1 accuracies of the English-trained SyntDEC on other languages (crosslingual). In general, we find that clusters learned on a high-resource languages like English can be used for other languages. Similar to He et al. (2019), we use the distances of the languages with English to group languages as nearby or distant. The distance is calculated by accounting for syntactic, genetic, and geographic distances according to the URIEL linguistic database (Littell et al. 2017). Our results highlight the effectiveness of mBERT in crosslingual POSI. Even for Asian languages (ko, id, and ja), which have a higher distance from English, the performance is comparable across settings. For nearby languages, crosslingual SyntDEC performs well and even outperforms the monolingual setting for some languages.

Method		$F1_{\mu}$	$F1_{\mathit{max}}\!\!$	M1	VM
	DIORA	62.5 ( $\pm$ 0.5)	63.4	-	-
	DIORA ${}_{\!\mathit{CB}}$ (*)	64.5 ( $\pm$ 0.6)	65.5	-	-
	DIORA ${}_{\!\mathit{CB}}^{}$ ()	66.4 ( $\pm$ 0.7)	67.8	-	-
DIORA Baselines	E-BERT (**)	41.8	42.2	-	-
DIORA Baselines	ELMo (**)	58.5	59.4	-	-
	ELMo ${}_{\!\mathit{CI}}$ (**)	53.4	56.3	-	-
SyntDEC	E-BERT	60.8 ( $\pm$ 0.7)	62.7	75.4 ( $\pm$ 1.1)	41.2 ( $\pm$ 1.4)
	SpanBERT	61.3 ( $\pm$ 0.8)	63.3	75.9 ( $\pm$ 1.0)	40.8 ( $\pm$ 1.1)
	mBERT	64.0 ( $\pm$ 0.4)	64.6	79.6 ( $\pm$ 0.6)	44.5 ( $\pm$ 0.7)

Table 5: CoLab results on the WSJ test set using the gold parses over five random runs. Our models were trained for 15 epochs and results from the final epoch for each run are recorded. DIORA results are reported from Drozdov et al. (2019a).

\texttt{DIORA}_{\mathit{CB}}

and

\texttt{DIORA}_{\mathit{CB}}^{*}

are fairly specialized models involving codebook learning (*). We also report E-BERT and ELMo baselines from Drozdov et al. (2019a) (**). We significantly outperform these previously reported E-BERT/ELMo baselines. Our results are not directly comparable to DIORA as it uses the WSJ dev set for tuning and early stopping whereas we do not.

8 Constituency Labelling (CoLab)

In Table 5, we report the F1 and M1 score of constituency labelling (CoLab) over the WSJ test set. We represent constituents by concatenating embeddings of the first and last words in the span (where word embeddings are computed by averaging corresponding subword embeddings). We observe improvement over DIORA (Drozdov et al. 2019b), a recent unsupervised constituency parsing model, and achieve competitive results to recent variants that improve DIORA with discrete representation learning (Drozdov et al. 2019a). Our model and the DIORA variants use gold constituents for these experiments. We compute F1 metrics for comparing with previous work but also report M1 accuracies. As with POSI, our results suggest that mBERT outperforms both SpanBERT and E-BERT for the CoLab task as well. We also note that SpanBERT performs better than E-BERT, presumably because SpanBERT seeks to learn span representations explicitly. In the Appendix(Table 7), we explore other ways of representing constituents and note that mean/max pooling followed by clustering does not perform well. Compressing and finetuning the mean-pooled representation using SyntDEC (SyntDEC_Mean) is also suboptimal. We hypothesize that mean/max pooling results in a loss of information about word order in the constituent whereas the concatenation of first and last words retains this information. Even a stacked autoencoder (SAE) over the concatenation of first and last token achieves competitive results, but finetuning with SyntDEC improves the $F1_{\mu}$ by nearly 4.5%. This demonstrates that for CoLab also, the transformation to lower dimensions and finetuning to clustering friendly spaces is important for achieving competitive performance.

9 Related Work

Deep Clustering: Unlike previous work where feature extraction and clustering were applied sequentially, deep clustering aims to jointly optimize for both by combining a clustering loss with the feature extraction. A number of deep clustering methods have been proposed which primarily differ in their clustering approach: Yang et al. (2017) use KMeans, Xie, Girshick, and Farhadi (2016) use cluster assignment hardening, Ghasedi Dizaji et al. (2017) add a balanced assignments loss on top of cluster assignment hardening, Huang et al. (2014) introduce a locality-preserving loss and a group sparsity loss on the clustering, Yang, Parikh, and Batra (2016) use agglomerative clustering, and Ji et al. (2017) use subspace clustering. All of these approaches can be used to cluster contextualized representations, and future work may improve upon our results by exploring these approaches. The interplay between deep clustering for syntax and recent advancements in NLP, such as contextualized representations, has not previously been studied. In this paper, we fill this gap.

Unsupervised Syntax Induction: There has been a lot of work on unsupervised induction of syntax, namely, unsupervised constituency parsing (Klein and Manning 2002; Seginer 2007; Kim, Dyer, and Rush 2019) and dependency parsing (Klein and Manning 2004; Smith and Eisner 2006; Gillenwater et al. 2010; Spitkovsky, Alshawi, and Jurafsky 2013; Jiang, Han, and Tu 2016). While most prior work focuses on inducing unlabeled syntactic structures, we focus on inducing constituent labels while assuming the gold syntactic structure is available. This goal has also been pursued in prior work (Drozdov et al. 2019a; Jin and Schuler 2020). Compared to them, we present simpler models to induce syntactic labels directly from pretrained models via dimensionality reduction and clustering. Similar to us, (Li and Eisner 2019) also note gains for supervised NLP tasks upon reducing the representation dimension.

Probing Pretrained Representations: Recent analysis work (Liu et al. 2019a; Tenney et al. 2019; Aljalbout et al. 2018; Jawahar, Sagot, and Seddah 2019, inter alia) has shown that pretrained language models encode syntactic information efficiently. Most of them train a supervised model using pretrained representations and labeled examples, and show that pretrained language models effectively encode part-of-speech and constituency information. In contrast to these works, we propose an unsupervised approach to probing which does not rely on any training data. (Zhou and Srikumar 2021) also pursue the same goals by studying the geometry of these representations.

10 Conclusion

In this work, we explored the problem of clustering text representations for model interpretation and induction of syntax. We observed that off-the-shelf methods like KMeans are sub-optimal as these representations are high dimensional and, thus, not directly suitable for clustering. Thus, we proposed a deep clustering approach which jointly transforms these representations into a lower-dimensional cluster friendly space and clusters them. Upon integration of a small number of task-specific features, and use of multilingual representations, we find that our approach achieves competitive performance for unsupervised POSI and CoLab comparable to more complex methods in the literature. Finally, we also show that we can use the technique as a supervision-free approach to probe syntax in these representations and contrast our unsupervised probe with supervised ones.

References

Aljalbout et al. (2018) Aljalbout, E.; Golkov, V.; Siddiqui, Y.; Strobel, M.; and Cremers, D. 2018. Clustering with deep learning: Taxonomy and new methods. arXiv preprint arXiv:1801.07648.
Belinkov et al. (2017) Belinkov, Y.; Durrani, N.; Dalvi, F.; Sajjad, H.; and Glass, J. 2017. What do neural machine translation models learn about morphology? In Proc. of ACL.
Bengio et al. (2007) Bengio, Y.; Lamblin, P.; Popovici, D.; and Larochelle, H. 2007. Greedy layer-wise training of deep networks. In Proc. of NeurIPS.
Berg-Kirkpatrick et al. (2010) Berg-Kirkpatrick, T.; Bouchard-Côté, A.; DeNero, J.; and Klein, D. 2010. Painless unsupervised learning with features. In Proc. of NAACL-HLT.
Blunsom and Cohn (2011) Blunsom, P.; and Cohn, T. 2011. A hierarchical Pitman-Yor process HMM for unsupervised part of speech induction. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 865–874.
Brown et al. (1992) Brown, P. F.; Della Pietra, V. J.; Desouza, P. V.; Lai, J. C.; and Mercer, R. L. 1992. Class-based n-gram models of natural language. Computational linguistics, 18(4): 467–480.
Cao, Kitaev, and Klein (2020) Cao, S.; Kitaev, N.; and Klein, D. 2020. Unsupervised Parsing via Constituency Tests. arXiv preprint arXiv:2010.03146.
Chang et al. (2017) Chang, J.; Wang, L.; Meng, G.; Xiang, S.; and Pan, C. 2017. Deep adaptive image clustering. In Proc. of ICCV.
Devlin et al. (2019) Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. of NAACL-HLT.
Drozdov et al. (2019a) Drozdov, A.; Verga, P.; Chen, Y.-P.; Iyyer, M.; and McCallum, A. 2019a. Unsupervised Labeled Parsing with Deep Inside-Outside Recursive Autoencoders. In Proc. of EMNLP-IJCNLP.
Drozdov et al. (2019b) Drozdov, A.; Verga, P.; Yadav, M.; Iyyer, M.; and McCallum, A. 2019b. Unsupervised Latent Tree Induction with Deep Inside-Outside Recursive Auto-Encoders. In Proc. of NAACL-HLT.
Ghasedi Dizaji et al. (2017) Ghasedi Dizaji, K.; Herandi, A.; Deng, C.; Cai, W.; and Huang, H. 2017. Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization. In Proc. of ICCV.
Gillenwater et al. (2010) Gillenwater, J.; Ganchev, K.; Graça, J.; Pereira, F.; and Taskar, B. 2010. Sparsity in Dependency Grammar Induction. In Proc. of ACL.
He, Neubig, and Berg-Kirkpatrick (2018) He, J.; Neubig, G.; and Berg-Kirkpatrick, T. 2018. Unsupervised learning of syntactic structure with invertible neural projections. In Proc. of EMNLP.
He et al. (2019) He, J.; Zhang, Z.; Berg-Kiripatrick, T.; and Neubig, G. 2019. Cross-lingual syntactic transfer through unsupervised adaptation of invertible projections. In Proc. of ACL.
Hewitt and Manning (2019) Hewitt, J.; and Manning, C. D. 2019. A structural probe for finding syntax in word representations. In Proc. of NAACL-HLT.
Huang et al. (2014) Huang, P.; Huang, Y.; Wang, W.; and Wang, L. 2014. Deep embedding network for clustering. In Proc. of International Conference on Pattern Recognition.
Jawahar, Sagot, and Seddah (2019) Jawahar, G.; Sagot, B.; and Seddah, D. 2019. What Does BERT Learn about the Structure of Language? In Proc. of ACL.
Ji et al. (2017) Ji, P.; Zhang, T.; Li, H.; Salzmann, M.; and Reid, I. 2017. Deep subspace clustering networks. In Proc. of NeurIPS.
Jiang, Han, and Tu (2016) Jiang, Y.; Han, W.; and Tu, K. 2016. Unsupervised Neural Dependency Parsing. In Proc. of EMNLP.
Jiang et al. (2016) Jiang, Z.; Zheng, Y.; Tan, H.; Tang, B.; and Zhou, H. 2016. Variational deep embedding: An unsupervised and generative approach to clustering. In Proc. of IJCAI.
Jin et al. (2019) Jin, L.; Doshi-Velez, F.; Miller, T.; Schwartz, L.; and Schuler, W. 2019. Unsupervised learning of PCFGs with normalizing flow. In Proc. of ACL.
Jin and Schuler (2020) Jin, L.; and Schuler, W. 2020. The Importance of Category Labels in Grammar Induction with Child-directed Utterances. In Proc. of International Conference on Parsing Technologies.
Johnson (2007) Johnson, M. 2007. Why doesn’t EM find good HMM POS-taggers? In Proc. of EMNLP-CoNLL.
Joshi et al. (2020) Joshi, M.; Chen, D.; Liu, Y.; Weld, D. S.; Zettlemoyer, L.; and Levy, O. 2020. SpanBERT: Improving Pre-training by Representing and Predicting Spans. TACL, 8: 64–77.
Joshi et al. (2019) Joshi, M.; Levy, O.; Zettlemoyer, L.; and Weld, D. 2019. BERT for Coreference Resolution: Baselines and Analysis. In Proc. of EMNLP-IJCNLP.
Joulin et al. (2017) Joulin, A.; Grave, E.; Bojanowski, P.; and Mikolov, T. 2017. Bag of Tricks for Efficient Text Classification. In Proc. of EACL.
Kim, Dyer, and Rush (2019) Kim, Y.; Dyer, C.; and Rush, A. M. 2019. Compound Probabilistic Context-Free Grammars for Grammar Induction. In Proc. of ACL.
Kitaev and Klein (2018) Kitaev, N.; and Klein, D. 2018. Constituency Parsing with a Self-Attentive Encoder. In Proc. of ACL.
Klein and Manning (2002) Klein, D.; and Manning, C. D. 2002. A Generative Constituent-Context Model for Improved Grammar Induction. In Proc. of ACL.
Klein and Manning (2004) Klein, D.; and Manning, C. D. 2004. Corpus-Based Induction of Syntactic Structure: Models of Dependency and Constituency. In Proc. of ACL.
Lee, He, and Zettlemoyer (2018) Lee, K.; He, L.; and Zettlemoyer, L. 2018. Higher-Order Coreference Resolution with Coarse-to-Fine Inference. In Proc. of NAACL-HLT.
Li and Eisner (2019) Li, X. L.; and Eisner, J. 2019. Specializing word embeddings (for parsing) by information bottleneck. arXiv preprint arXiv:1910.00163.
Lin et al. (2015) Lin, C.-C.; Ammar, W.; Dyer, C.; and Levin, L. 2015. Unsupervised POS Induction with Word Embeddings. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
Linzen, Dupoux, and Goldberg (2016) Linzen, T.; Dupoux, E.; and Goldberg, Y. 2016. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. TACL, 4: 521–535.
Littell et al. (2017) Littell, P.; Mortensen, D. R.; Lin, K.; Kairis, K.; Turner, C.; and Levin, L. 2017. Uriel and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors. In Proc. of EACL.
Liu et al. (2019a) Liu, N. F.; Gardner, M.; Belinkov, Y.; Peters, M. E.; and Smith, N. A. 2019a. Linguistic Knowledge and Transferability of Contextual Representations. In Proc. of NAACL-HLT.
Liu et al. (2019b) Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019b. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
Marcus, Santorini, and Marcinkiewicz (1993) Marcus, M. P.; Santorini, B.; and Marcinkiewicz, M. A. 1993. Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics, 19(2): 313–330.
McDonald et al. (2013) McDonald, R.; Nivre, J.; Quirmbach-Brundage, Y.; Goldberg, Y.; Das, D.; Ganchev, K.; Hall, K.; Petrov, S.; Zhang, H.; Täckström, O.; Bedini, C.; Bertomeu Castelló, N.; and Lee, J. 2013. Universal Dependency Annotation for Multilingual Parsing. In Proc. of ACL.
Mikolov et al. (2013) Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
Mrini et al. (2019) Mrini, K.; Dernoncourt, F.; Bui, T.; Chang, W.; and Nakashole, N. 2019. Rethinking self-attention: An interpretable self-attentive encoder-decoder parser. arXiv preprint arXiv:1911.03875.
Nivre et al. (2016) Nivre, J.; de Marneffe, M.-C.; Ginter, F.; Goldberg, Y.; Hajič, J.; Manning, C. D.; McDonald, R.; Petrov, S.; Pyysalo, S.; Silveira, N.; Tsarfaty, R.; and Zeman, D. 2016. Universal Dependencies v1: A Multilingual Treebank Collection. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), 1659–1666. Portorož, Slovenia: European Language Resources Association (ELRA).
Peters et al. (2018a) Peters, M.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; and Zettlemoyer, L. 2018a. Deep Contextualized Word Representations. In Proc. of NAACL-HLT.
Peters et al. (2018b) Peters, M.; Neumann, M.; Zettlemoyer, L.; and Yih, W.-t. 2018b. Dissecting Contextual Word Embeddings: Architecture and Representation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.
Pimentel et al. (2020) Pimentel, T.; Valvoda, J.; Hall Maudslay, R.; Zmigrod, R.; Williams, A.; and Cotterell, R. 2020. Information-Theoretic Probing for Linguistic Structure. In Proc. of ACL.
Pires, Schlinger, and Garrette (2019) Pires, T.; Schlinger, E.; and Garrette, D. 2019. How Multilingual is Multilingual BERT? In Proc. of ACL.
Rosenberg and Hirschberg (2007) Rosenberg, A.; and Hirschberg, J. 2007. V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure. In Proc. of EMNLP-CoNLL, 410–420. Prague, Czech Republic.
Seginer (2007) Seginer, Y. 2007. Fast Unsupervised Incremental Parsing. In Proc. of ACL.
Smith and Eisner (2006) Smith, N. A.; and Eisner, J. 2006. Annealing Structural Bias in Multilingual Weighted Grammar Induction. In Proc. of COLING-ACL.
Spitkovsky, Alshawi, and Jurafsky (2013) Spitkovsky, V. I.; Alshawi, H.; and Jurafsky, D. 2013. Breaking Out of Local Optima with Count Transforms and Model Recombination: A Study in Grammar Induction. In Proc. of EMNLP.
Stratos (2019) Stratos, K. 2019. Mutual Information Maximization for Simple and Accurate Part-Of-Speech Induction. In Proc. of NAACL-HLT.
Stratos, Collins, and Hsu (2016) Stratos, K.; Collins, M.; and Hsu, D. 2016. Unsupervised Part-Of-Speech Tagging with Anchor Hidden Markov Models. TACL, 4: 245–257.
Tenney, Das, and Pavlick (2019) Tenney, I.; Das, D.; and Pavlick, E. 2019. BERT Rediscovers the Classical NLP Pipeline. In Proc. of ACL.
Tenney et al. (2019) Tenney, I.; Xia, P.; Chen, B.; Wang, A.; Poliak, A.; McCoy, R. T.; Kim, N.; Van Durme, B.; Bowman, S. R.; Das, D.; et al. 2019. What do you learn from context? probing for sentence structure in contextualized word representations. In Proc. of ICLR.
Toshniwal et al. (2020) Toshniwal, S.; Shi, H.; Shi, B.; Gao, L.; Livescu, K.; and Gimpel, K. 2020. A Cross-Task Analysis of Text Span Representations. In Proc. of RepL4NLP.
Tran et al. (2016) Tran, K. M.; Bisk, Y.; Vaswani, A.; Marcu, D.; and Knight, K. 2016. Unsupervised Neural Hidden Markov Models. In Proc. of the Workshop on Structured Prediction for NLP.
Tsai et al. (2019) Tsai, H.; Riesa, J.; Johnson, M.; Arivazhagan, N.; Li, X.; and Archer, A. 2019. Small and Practical BERT Models for Sequence Labeling. In Proc. of EMNLP-IJCNLP.
Tseng, Jurafsky, and Manning (2005) Tseng, H.; Jurafsky, D.; and Manning, C. 2005. Morphological features help POS tagging of unknown words across language varieties. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing.
Wu and Dredze (2019) Wu, S.; and Dredze, M. 2019. Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT. In Proc. of EMNLP-IJCNLP.
Wu et al. (2020) Wu, W.; Wang, F.; Yuan, A.; Wu, F.; and Li, J. 2020. CorefQA: Coreference Resolution as Query-based Span Prediction. In Proc. of ACL.
Xie, Girshick, and Farhadi (2016) Xie, J.; Girshick, R.; and Farhadi, A. 2016. Unsupervised deep embedding for clustering analysis. In Proc. of ICML.
Yang et al. (2017) Yang, B.; Fu, X.; Sidiropoulos, N. D.; and Hong, M. 2017. Towards k-means-friendly spaces: Simultaneous deep learning and clustering. In Proc. of ICML.
Yang, Parikh, and Batra (2016) Yang, J.; Parikh, D.; and Batra, D. 2016. Joint unsupervised learning of deep representations and image clusters. In Proc. of CVPR.
Yatbaz, Sert, and Yuret (2012) Yatbaz, M. A.; Sert, E.; and Yuret, D. 2012. Learning syntactic categories using paradigmatic representations of word context. In Proc. of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning.
Yuret, Yatbaz, and Sert (2014) Yuret, D.; Yatbaz, M. A.; and Sert, E. 2014. Unsupervised instance-based part of speech induction using probable substitutes. In Proc. of COLING 2014.
Zhou and Zhao (2019) Zhou, J.; and Zhao, H. 2019. Head-Driven Phrase Structure Grammar Parsing on Penn Treebank. In Proc. of ACL.
Zhou and Srikumar (2021) Zhou, Y.; and Srikumar, V. 2021. DirectProbe: Studying Representations without Classifiers. In Proc. of NAACL-HLT.

Appendix A Task and Architecture

CBoW Illustration

Figure 7 shows an illustration of SyntDEC–CBoW

Appendix B POSI Analysis

45-Tag POSI Analysis

We show a t-SNE visualization of mBERT embeddings and the embeddings learned by our deep clustering model in Figure 1. We note that the clusters formed by SyntDEC are more coherent and dense.

In Figure 8, we show the confusion matrices of SyntDEC_Morph and mBERT for the 20 most frequent tags in the 45 tag POSI task by assigning labels to predicted clusters using the optimal 1-to-1 mapping. We observe that SyntDEC_Morph outperforms mBERT for most tags.

12-Tag POSI Analysis

In Fig 9, we show t-SNE visualization of SyntDEC and mBERT embeddings of tokens from the 12-tag Universal Treebank English dataset. SyntDEC embeddings produce more distinct clusters.

45-Tag POSI Ablation Studies

Morph. Feats.	M1	VM
Unigram	77.9 ( $\pm$ 2.3)	73.2 ( $\pm$ 1.4)
Bigram	77.4 ( $\pm$ 2.4)	72.7 ( $\pm$ 1.9)
Trigram	79.5 ( $\pm$ 0.9)	73.9 ( $\pm$ 0.7)

Table 6: Comparison of different orders of character

n

-gram embeddings for the 45-tag POSI task.

In Table 6, we study the impact of different character embeddings and achieve best results on using embeddings of the trailing trigram of each token.

CoLab Ablation Studies

In Table 7, we present the ablation results for CoLab. We find that KMeans on Max or mean pooled span representation of mBERT do not work well. Even deep clustering (SyntDEC_Mean) over the mean of the span representation does not help. SAE and SyntDEC trained over the concatenation of the representation of the end points substantially improve the results.

Method	$F1_{\mu}$	$F1_{\mathit{max}}$	M1
KMeans (Mean)	39.9 ( $\pm$ 0.4)	40.2	48.6 ( $\pm$ 0.3)
KMeans (Max)	40.1 ( $\pm$ 0.6)	40.9	49.6 ( $\pm$ 0.7)
SyntDEC_Mean	40.8 ( $\pm$ 1.1)	42.4	49.8 ( $\pm$ 1.2)
SAE	61.2 ( $\pm$ 1.2)	62.8	76.6 ( $\pm$ 1.1)
SyntDEC	64.0 ( $\pm$ 0.4)	64.6	79.6 ( $\pm$ 0.6)

Table 7: Comparison of different methods to represent spans for CoLab. mBERT is used in these experiments.

Appendix C Unsupervised Probing

In Table 8 and Table 9, we report the results of adding layers incrementally from lower to higher for POSI on mBERT and E-BERT. We present similar results for CoLab in Table 10 and Table 11. In Table 15 and Table 14, we report the results of individual layers for POSI on mBERT and E-BERT. We present similar results for CoLab in Table 12 and Table 13.

Appendix D Hyperparameters

Words are represented by 768 dimension vectors obtained after taking the mean of BERT layers.We tried max and mean pooling also but did not notice much improvement. Morphological embeddings extracted from fastText have 300 dimensions. The number of clusters are set equal to the number of ground truth tags for all the experiments. Following previous work (Stratos 2019), we use the 45-tag POSI experiments on English to select the hyperparameters for our framework and use these hyperparameters across all the other languages and tasks.

We use a SyntDEC architecture with one encoder layer and use 75 as the size of the latent dimension. Layer-wise and end-to-end training is done for 50 epochs with a batch size of 64, learning rate of 0.1 and momentum of 0.9 using the SGD optimizer. In the clustering stage, we train SyntDEC for 4000 iterations with 256 as batch size and 0.001 as learning rate with 0.9 momentum using SGD. We set the reconstruction error weight $\lambda=5$ for all our experiments. We set the context width as one for CBoW. For out-of-vocabulary words, we use an average over all subword embeddings. For all the experiments, we report results for the last training iteration as we do not have access to the ground truth labels for model selection. For the supervised experiments, we follow the training and architecture details of (Tenney, Das, and Pavlick 2019). All our experiments are performed on a 12GB GeForce RTX 2080 Ti GPU and each run takes approximately 3 hours.

Layers	M1	VM
Layer 0	61.6 ( $\pm$ 0.5)	59.8 ( $\pm$ 0.6)
Layer 0_1	61.9 ( $\pm$ 0.8)	59.9 ( $\pm$ 0.8)
Layer 0_2	66.5 ( $\pm$ 1.0)	64.5 ( $\pm$ 1.0)
Layer 0_3	67.4 ( $\pm$ 2.4)	65.6 ( $\pm$ 1.6)
Layer 0_4	68.5 ( $\pm$ 2.2)	65.9 ( $\pm$ 1.6)
Layer 0_5	69.4 ( $\pm$ 2.4)	66.2 ( $\pm$ 1.3)
Layer 0_6	70.7 ( $\pm$ 1.2)	67.1 ( $\pm$ 1.6)
Layer 0_7	72.8 ( $\pm$ 1.2)	68.3 ( $\pm$ 0.7)
Layer 0_8	72.6 ( $\pm$ 0.6)	68.6 ( $\pm$ 0.3)
Layer 0_9	72.7 ( $\pm$ 0.7)	68.9 ( $\pm$ 0.5)
Layer 0_10	72.1 ( $\pm$ 1.4)	67.9 ( $\pm$ 0.9)
Layer 0_11	72.0 ( $\pm$ 1.2)	67.9 ( $\pm$ 0.9)
Layer 0_12	72.7 ( $\pm$ 1.2)	68.9 ( $\pm$ 0.8)

Table 8: Comparison of different E-BERT layers for the 45-tag POSI. We report oracle M1 accuracy and V-Measure (VM) averaged over 5 random runs.

Layers	M1	VM
Layer 0	69.6 ( $\pm$ 2.7)	66.4 ( $\pm$ 2.0)
Layer 0_1	69.8 ( $\pm$ 1.9)	66.9 ( $\pm$ 0.7)
Layer 0_2	72.1 ( $\pm$ 1.7)	68.2 ( $\pm$ 0.9)
Layer 0_3	71.5 ( $\pm$ 1.6)	68.5 ( $\pm$ 0.9)
Layer 0_4	72.1 ( $\pm$ 1.7)	68.5 ( $\pm$ 0.8)
Layer 0_5	73.1 ( $\pm$ 1.5)	69.1 ( $\pm$ 0.8)
Layer 0_6	75.0 ( $\pm$ 1.7)	70.1 ( $\pm$ 1.6)
Layer 0_7	76.2 ( $\pm$ 2.6)	71.5 ( $\pm$ 1.8)
Layer 0_8	77.9 ( $\pm$ 1.3)	72.2 ( $\pm$ 1.1)
Layer 0_9	77.8 ( $\pm$ 1.9)	72.6 ( $\pm$ 1.0)
Layer 0_10	76.9 ( $\pm$ 2.8)	72.1 ( $\pm$ 1.8)
Layer 0_11	77.5 ( $\pm$ 0.9)	72.1 ( $\pm$ 0.6)
Layer 0_12	77.8 ( $\pm$ 1.4)	72.6 ( $\pm$ 1.0)

Table 9: Comparison of different mBERT layers for the 45-tag POSI task. We report oracle M1 accuracy and V-Measure (VM) averaged over 5 random runs.

Layers	M1	VM
Layer 0	54.3 ( $\pm$ 1.5)	31.6 ( $\pm$ 1.7)
Layer 0_1	54.2 ( $\pm$ 1.9)	31.5 ( $\pm$ 2.0)
Layer 0_2	52.6 ( $\pm$ 1.3)	32.6 ( $\pm$ 2.2)
Layer 0_3	58.8 ( $\pm$ 0.7)	37.2 ( $\pm$ 1.1)
Layer 0_4	58.9 ( $\pm$ 1.8)	38.0 ( $\pm$ 2.4)
Layer 0_5	61.3 ( $\pm$ 0.1)	42.1 ( $\pm$ 0.1)
Layer 0_6	60.5 ( $\pm$ 1.9)	40.7 ( $\pm$ 3.0)
Layer 0_7	62.1 ( $\pm$ 0.7)	42.7 ( $\pm$ 1.0)
Layer 0_8	60.1 ( $\pm$ 2.9)	40.7 ( $\pm$ 3.5)
Layer 0_9	61.9 ( $\pm$ 0.2)	42.7 ( $\pm$ 1.0)
Layer 0_10	60.8 ( $\pm$ 1.7)	42.0 ( $\pm$ 2.3)
Layer 0_11	60.6 ( $\pm$ 3.6)	41.9 ( $\pm$ 4.4)
Layer 0_12	61.4 ( $\pm$ 0.5)	42.4 ( $\pm$ 0.8)

Table 10: Comparison of different E-BERT layers for CoLab task. We report oracle M1 accuracy and V-Measure (VM) averaged over 5 random runs.

Layers	M1	VM
Layer 0	54.0 ( $\pm$ 2.1)	32.9 ( $\pm$ 1.7)
Layer 0_1	53.9 ( $\pm$ 3.3)	33.6 ( $\pm$ 2.5)
Layer 0_2	58.4 ( $\pm$ 1.7)	36.3 ( $\pm$ 1.4)
Layer 0_3	56.8 ( $\pm$ 3.2)	35.9 ( $\pm$ 2.3)
Layer 0_4	60.0 ( $\pm$ 1.8)	39.1 ( $\pm$ 1.7)
Layer 0_5	61.2 ( $\pm$ 1.3)	40.4 ( $\pm$ 1.5)
Layer 0_6	62.9 ( $\pm$ 0.6)	43.1 ( $\pm$ 0.5)
Layer 0_7	62.7 ( $\pm$ 0.7)	42.5 ( $\pm$ 1.1)
Layer 0_8	62.9 ( $\pm$ 0.4)	43.1 ( $\pm$ 0.6)
Layer 0_9	62.8 ( $\pm$ 0.6)	42.9 ( $\pm$ 0.9)
Layer 0_10	63.2 ( $\pm$ 0.5)	43.5 ( $\pm$ 0.6)
Layer 0_11	63.3 ( $\pm$ 0.6)	43.5 ( $\pm$ 0.8)
Layer 0_12	64.2 ( $\pm$ 0.4)	45.0 ( $\pm$ 0.7)

Table 11: Comparison of different mBERT layers for CoLab task. We report oracle M1 accuracy and V-Measure (VM) averaged over 5 random runs.

Layers	M1	VM
Layer 0	54.6 ( $\pm$ 0.9)	31.8 ( $\pm$ 1.0)
Layer 1	53.7 ( $\pm$ 0.9)	33.6 ( $\pm$ 2.0)
Layer 2	59.9 ( $\pm$ 1.0)	39.7 ( $\pm$ 1.0)
Layer 3	60.2 ( $\pm$ 1.8)	40.8 ( $\pm$ 2.2)
Layer 4	62.4 ( $\pm$ 1.3)	44.0 ( $\pm$ 1.8)
Layer 5	58.7 ( $\pm$ 4.1)	39.3 ( $\pm$ 5.5)
Layer 6	59.2 ( $\pm$ 1.2)	39.1 ( $\pm$ 1.9)
Layer 7	58.1 ( $\pm$ 2.1)	36.9 ( $\pm$ 2.6)
Layer 8	57.3 ( $\pm$ 0.4)	37.5 ( $\pm$ 0.8)
Layer 9	56.4 ( $\pm$ 1.2)	34.9 ( $\pm$ 1.4)
Layer 10	42.9 ( $\pm$ 1.5)	16.43 ( $\pm$ 2.3)
Layer 11	40.9 ( $\pm$ 1.1)	14.9 ( $\pm$ 1.2)
Layer 12	47.7 ( $\pm$ 1.7)	22.3 ( $\pm$ 2.0)

Table 12: Comparison of different E-BERT layers for CoLab task. We report oracle M1 accuracy and V-Measure (VM) averaged over 5 random runs.

Layers	M1	VM
Layer 0	53.6 ( $\pm$ 2.6)	32.1 ( $\pm$ 1.9)
Layer 1	55.5 ( $\pm$ 2.2)	35.3 ( $\pm$ 1.4)
Layer 2	60.2 ( $\pm$ 1.5)	39.1 ( $\pm$ 0.9)
Layer 3	61.6 ( $\pm$ 1.7)	41.2 ( $\pm$ 0.9)
Layer 4	63.4 ( $\pm$ 0.6)	43.0 ( $\pm$ 0.7)
Layer 5	63.9 ( $\pm$ 0.5)	44.1 ( $\pm$ 0.6)
Layer 6	63.7 ( $\pm$ 0.9)	44.7 ( $\pm$ 0.9)
Layer 7	63.5 ( $\pm$ 0.3)	44.9 ( $\pm$ 0.5)
Layer 8	63.1 ( $\pm$ 0.8)	44.9 ( $\pm$ 0.8)
Layer 9	63.9 ( $\pm$ 0.5)	46.2 ( $\pm$ 0.8)
Layer 10	64.3 ( $\pm$ 0.6)	46.3 ( $\pm$ 0.8)
Layer 11	63.7 ( $\pm$ 0.4)	45.1 ( $\pm$ 0.2)
Layer 12	62.9 ( $\pm$ 0.6)	43.3 ( $\pm$ 0.6)

Table 13: Comparison of different mBERT layers for CoLab task. We report oracle M1 accuracy and V-Measure (VM) averaged over 5 random runs.

Layers	M1	VM
Layer 0	66.9 ( $\pm$ 1.4)	64.5 ( $\pm$ 0.6)
Layer 1	70.8 ( $\pm$ 0.8)	67.4 ( $\pm$ 0.4)
Layer 2	72.6 ( $\pm$ 0.8)	68.1 ( $\pm$ 0.4)
Layer 3	74.9 ( $\pm$ 1.4)	70.0 ( $\pm$ 1.0)
Layer 4	76.2 ( $\pm$ 1.8)	71.3 ( $\pm$ 1.3)
Layer 5	79.2 ( $\pm$ 0.4)	72.9 ( $\pm$ 0.6)
Layer 6	77.5 ( $\pm$ 2.3)	72.0 ( $\pm$ 1.4)
Layer 7	78.1 ( $\pm$ 1.7)	71.7 ( $\pm$ 1.1)
Layer 8	75.6 ( $\pm$ 1.9)	70.0 ( $\pm$ 1.5)
Layer 9	73.9 ( $\pm$ 1.3)	68.5 ( $\pm$ 0.3)
Layer 10	73.2 ( $\pm$ 0.9)	69.1 ( $\pm$ 0.5)
Layer 11	74.5 ( $\pm$ 2.3)	69.6 ( $\pm$ 1.1)
Layer 12	71.9 ( $\pm$ 1.6)	66.3 ( $\pm$ 1.1)

Table 14: Comparison of different mBERT layers for the 45-tag POSI task. We report oracle M1 accuracy and V-Measure (VM) averaged over 5 random runs.

Layers	M1	VM
Layer 0	60.5 ( $\pm$ 1.0)	60.0 ( $\pm$ 0.7)
Layer 1	64.6 ( $\pm$ 1.5)	64.3 ( $\pm$ 0.8)
Layer 2	67.9 ( $\pm$ 1.2)	66.2 ( $\pm$ 0.4)
Layer 3	69.5 ( $\pm$ 1.1)	66.6 ( $\pm$ 0.7)
Layer 4	71.6 ( $\pm$ 1.4)	67.2 ( $\pm$ 0.9)
Layer 5	72.6 ( $\pm$ 0.4)	68.1 ( $\pm$ 0.6)
Layer 6	73.9 ( $\pm$ 1.4)	67.9 ( $\pm$ 0.9)
Layer 7	71.7 ( $\pm$ 0.5)	67.2 ( $\pm$ 0.5)
Layer 8	72.6 ( $\pm$ 0.6)	67.2 ( $\pm$ 0.5)
Layer 9	73.0 ( $\pm$ 0.9)	67.7 ( $\pm$ 0.7)
Layer 10	65.7 ( $\pm$ 1.1)	59.9 ( $\pm$ 0.6)
Layer 11	61.5 ( $\pm$ 1.7)	54.4 ( $\pm$ 1.0)
Layer 12	67.2 ( $\pm$ 2.6)	60.4 ( $\pm$ 2.2)

Table 15: Comparison of different E-BERT layers for the 45-tag POSI task. We report oracle M1 accuracy averaged and V-Measure (VM) over 5 random runs.