This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Are Classes Clusters?

Kees Varekamp
Stanford Center for Professional Development
[email protected]
Abstract

Sentence embedding models aim to provide general purpose embeddings for sentences. Most of the models studied in this paper claim to perform well on STS tasks – but they do not report on their suitability for clustering.

This paper111Project was completed as part of the XCS224U professional course looks at four recent sentence embedding models (Universal Sentence Encoder (Cer et al., 2018), Sentence-BERT (Reimers and Gurevych, 2019), LASER (Artetxe and Schwenk, 2019), and DeCLUTR (Giorgi et al., 2020)). It gives a brief overview of the ideas behind their implementations.

It then investigates how well topic classes in two text classification datasets (Amazon Reviews (Ni et al., 2019) and News Category Dataset (Misra, 2018)) map to clusters in their corresponding sentence embedding space. While the performance of the resulting classification model is far from perfect, it is better than random.

This is interesting because the classification model has been constructed in an unsupervised way. The topic classes in these real life topic classification datasets can be partly reconstructed by clustering the corresponding sentence embeddings.

Model Cluster F1 LogReg F1
Sentence-BERT 0.22 0.53
Universal Sentence Encoder 0.29 0.56
DeCLUTR 0.22 0.56
LASER 0.10 0.42
TfidfVectorizer 0.07 0.55
Table 1: Average F1 scores of unsupervised cluster classifier and logistic regression classifier. Averages taken over two datasets: Amazon Reviews dataset and News dataset. Bold is best of column.

1 Introduction

Since the success of universal word embeddings such as word2vec (Mikolov et al., 2013) and GloVE (Pennington et al., 2014) the interest in the sentence equivalent of such embeddings has been increasing.

Sentence embeddings promise to be useful for many tasks in natural language processing. They can provide standardized inputs for custom models such as topic classifiers or sentiment analysis models. But they are even more useful for tasks that involve computing the semantic similarity between many sentences.

It is perfectly possible to train a well-performing model on sentence similarity without any pre-computed sentence embeddings. But as was pointed out by (Reimers and Gurevych, 2019), you will need to run one inference for every similarity between two sentences. For a similarity matrix between nn sentences you will need to run n(n1)/2n(n-1)/2 inferences.

And when the model is large (a Transformer based model for example), these costs add up. The big advantage of universal sentence embedding models is that the vector representations of the sentences can be pre-computed individually. If the proximity of two vectors in the embedding space closely matches the semantic similarity of the two corresponding sentences (and most sentence embedding models indeed aim for this), the similarity calculation becomes as simple as the opposite of an L2 difference or a cosine similarity. Only nn potentially expensive embeddings need to be pre-computed instead of n(n1)/2n(n-1)/2, after which the above similarity matrix can be produced much more efficiently.

This opens the door to many useful applications that involve semantic search in one way or another, two of the most prominent ones being k-nearest-neighbour search and cluster analysis.

This paper focuses on cluster analysis. It will try to answer the following question: Do topic classes in classification datasets correspond to clusters in the corresponding embedding space?

It attempts to answer this question in a very straightforward manner: by creating a classification model based on an unsupervised clustering algorithm. We use the usual metrics to evaluate the performance of this unsupervised model. And finally we compare the result to a logistic regression model which acts as the standard supervised baseline.

It turns out (Table 1) that it is in fact somewhat possible to use clusters in the embedding space of a topic classification corpus to classify the corresponding topics. The approach is nowhere near state of the art, but the fact that it works is interesting and may be useful for unsupervised text classification applications.

2 Related work: SentEval

The canonical benchmark for sentence embedding models is SentEval (Conneau and Kiela, 2018). SentEval consists of 2 parts: The first part is a series of benchmarks that are called Downstream tasks - these are tasks that use the embeddings to perform a series of common NLP benchmark tasks. The second part is called Probing Tasks. These are a series of tasks that attempt to reconstruct certain linguistic properties of the original sentence.

The tasks report a variety of scores, but all of them include at least either accuracy or Spearman’s rank correlation coefficient.

The models in this study have been evaluated on SentEval for comparison with the Cluster Classifier experiment. Results are in Table 2.

3 Data

3.1 Amazon Reviews Dataset

The first dataset that we investigate is the Amazon Reviews Dataset (Ni et al., 2019). The data was downloaded from https://nijianmo.github.io/amazon/index.html and was sampled down to contain a 1000 examples for each category. There are 29 categories of products in the dataset (for example ”Books”, ”Appliances”, ”Toys and Games”, etc).

There are titles in the reviews but they are not always equally suitable for topic classification: They tend to be short and generic (”Great Product”). So rather than using the titles we pick the first sentence in the review that is longer than 10 words and shorter than 20 words.

This approach aims to select sentences that are long enough to be representative of the topic label, and filter out short sentences that may be too generic (such as ‘I really like this product’). But it also avoids selecting sentences that are too long for the sentence embedding models to be able to accurately represent them (the representations will get too diluted).

The approach gives pretty good results for finding meaningful examples. Some examples:
(Luxury and Beauty): ”It doesnt stay on or protect your hands through washing.”
(Fashion):”Bought the just as gorgeous rich, red Fireman’s scarf, too.”
(Magazine Subscriptions): ”I enjoy this magazine very much And especially this one on COLOR.”
(Automotive): ”Great for routine maintenance tasks such as oil changes, etc.”

3.2 News Dataset

The second dataset collects news from the Huffington Post (Misra, 2018) and was downloaded from https://www.kaggle.com/rmisra/news-category-dataset. It has also been sampled down to contain a 1000 examples of each news category. There are 40 news categories, for example ”SPORTS”, ”CRIME”, etc.

Each example contains metadata, a short description, a headline and a category. We choose the headline for our analysis.

Some examples:
(CRIME): ”There Were 2 Mass Shootings In Texas Last Week, But Only 1 On TV”
(POLITICS): ”How The Chinese Exclusion Act Can Help Us Understand Immigration Politics Today”
(WOMEN): ”The 20 Funniest Tweets From Women This Week”
(TECH): ”Self-Driving Uber In Fatal Accident Had 6 Seconds To React Before Crash”

4 Models

4.1 Embedding Models

This section describes the universal sentence embedding models that we will investigate.

4.1.1 Predecessors

A lot of work has been done in this space, and there is not enough room here to go over all of it. So this summary only includes four of the most recent models in some detail (USE, SBERT, LASER, and DeCLUTR). It is omitting important ground work, in particular the following 3 studies and their resulting models:

  • SkipThought (Kiros et al., 2015): Basically “Thou shalt know a word by the company it keeps”, but extended to sentences. An unsupervised method based on the idea that sentences that are close to each other are more similar than sentences that are far away from each other, much like word2vec and glove.

  • InferSent (Conneau et al., 2017): max pooling on the hidden states of a biLSTM trained on the Stanford Natural Language Inference dataset

  • QuickThoughts (Logeswaran and Lee, 2018): A non-generative version of SkipThought – rather than trying to reconstruct neighbouring sentences, it tries to train a classifier for predicting neighbouring/not neighbouring.

4.1.2 Universal Sentence Encoder

Universal Sentence Encoder (USE) (Cer et al., 2018) is a model developed by Google. The goal of USE is to provide easy-to-use sentence embeddings with good transfer performance.

There are actually 2 implementations of the encoder:

  • A Transformer-based model using (self) attention that mean-pools the word embeddings to produce sentence embeddings

  • A Deep Averaging Network (DAN) that simply averages word embeddings and bi-gram embeddings and feeds the resulting representation into a straightforward deep neural network. The DAN trades accuracy for performance, with respect to the transformer, which scales with the square of the sentence length.

The models are trained on a set of self-supervised tasks and on SNLI

The unsupervised tasks are trained on Wikipedia, web news, web question-answer pages and discussion forums

One interesting note in the paper mentions that for STS tasks the authors use the angular distance (the arccos of the cosine similarity) as a distance measure, because “arccos converts cosine similarity into an angular distance that obeys the triangle inequality. We find that angular distance performs better on STS than cosine similarity.”

4.1.3 Sentence-BERT

Sentence-BERT (SBERT) (Reimers and Gurevych, 2019) is a BERT based model that uses a Siamese network approach to fine-tune BERT on a set of tasks in order to produce useful sentence embeddings.

For SBERT the goal is to produce embeddings that are primarily useful for computationally and/or combinatorially expensive tasks like clustering and nearest neighbour search – producing vectors that can be used in optimized computation libraries such as numpy or sklearn.

SBERT promises semantically meaningful sentence embeddings (similar sentences are close in vector space).

It tries to accomplish this by pooling the token embeddings while training on STS tasks but using fixed weights between 2 BERTs, so that the resulting embeddings for the two input sentences are created using identical weights. It claims to be better on STS or SentEval than InferSent or USE.

SBERT trains using 3 objectives:

  • Classification, by pooling the embeddings of uu and vv as [u;v;uv][u;v;u-v] as input to a softmax, using cross entropy loss

  • Regression (presumably on STS), by using cosim(u,v) using MSE as the loss

  • Triplet loss: Given 3 sentences; a (anchor), p (positive example) and n (negative example), train the model such that |ap|<|an||a-p|<|a-n|, or in other words: minimize |ap||an||a-p|-|a-n|.

SBERT is trained on SNLI and MultiNLI. It does well on STS tasks (as expected)

4.1.4 LASER

LASER (Artetxe and Schwenk, 2019) is a bit different from the other models in this summary in that it is primarily interested in universal language agnostic sentence embeddings. It is trained on a large dataset consisting of 93 languages. It uses a relative simple biLSTM to encode the input text – BPE tokenized text of any language is fed into the same encoder. This creates a fixed size language agnostic embedding that is then used in the decoder to translate the output into a specified language. The paper claims that fixed length representations are more versatile and compatible than variable length representations: “For instance, there is not always a one-to-one correspondence among words in different languages (e.g. a single word of a morphologically complex language might correspond to several words of a morphologically simple language), so having a separate vector for each word might not transfer as well across languages.”

  • The embeddings are results of max pooling the hidden states of layers of the biLSTM of dimensionality 512, concatenating forward and backward hidden state into sentence representations of dimensionality 1024

  • The authors claim that the model does well on

    • XNLI (entailment)

    • MLDoc (document classification)

    • BUCC (finding the same sentence in another language)

    • Tatoeba: a new cross language similarity search benchmark

  • In the future, the authors would like to improve the result by using a self attention encoder, pre trained word embeddings, and back translation.

4.1.5 DeCLUTR

DeCLUTR (Giorgi et al., 2020) aims to produce useful universal sentence embeddings by unsupervised learning only. It points out that the best results so far have been obtained by methods that used at least some supervised learning, but that it is important to close the gap between supervised and unsupervised methods for languages and domains for which no supervised data exists.

  • Wants to be good at wide variety of tasks

  • Most universal sentence embedders train supervised on SNLI or multiNLI (entailment, contradiction, neutral). Examples are InferSent, USE, and SBERT.

  • The authors describe Skip-thought as an unsupervised generative model that uses sentence embeddings to predict words in neighbouring sentences. They mention that the generative nature of the model makes it expensive and surface focused. QuickThoughts tries to improve on this by classifying context sentences from non-context sentences rather than generating them.

  • The authors describe DeCLUTR’s approach as similar to SBERT, but self-supervised, and the objective as similar to QuickThoughts, but using segments rather than whole sentences.

  • Like SBERT’s triplet approach, it uses contrastive loss: It tries to minimize |ap||an||a-p|-|a-n|.

  • Like QuickThoughts, it classifies sentences (or rather sentence segments) as near by or far away.

4.1.6 TfidfVectorizer

This is simply the TfidfVectorizer model from scikit-learn (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). It is a WordVectorizer (bag of words) with a TF-IDF weighting applied to it. It has been included in the experiments as a baseline for the encoder models.

4.2 Classifiers

This section describes the models we will be using for our experiment

4.2.1 Cluster Classifier

We test the main hypothesis by using a simple model:

  • For every dataset, we cluster the data into as many clusters as there are classes. We use k-means clustering.

  • For every cluster, we allocate the most frequently occurring class within the cluster as its designated class

  • This is the model. Prediction is simply a mapping from the predicted cluster of an example to the designated class of that cluster

  • F1 (macro averaged) will give a measure of how well the model performs

4.2.2 Logistic Regression

For establishing a baseline on the classifier models, we use logistic regression on the embedding data to train a linear model.

5 Experiments

5.1 Cluster Classifier

The Cluster Classifier was able to retrieve some of the topic classes. Table 1 shows that the F1 scores of the Cluster Classifier are not great, but they are also not random.

Most of the classes ended up being mapped to by one (or more!) of the clusters. The cluster classifier also mixed up some classes and missed out on some other ones altogether.

The full ranking for Cluster Classifier is:

  1. 1.

    USE

  2. 2.

    DeCLUTR

  3. 3.

    SBERT

  4. 4.

    LASER

  5. 5.

    TfidfVectorizer

5.2 Logistic Regression Classifier

The Logistic Regression Classifier did significantly better than the Cluster Classifier. This is not surprising as it is a supervised model that has been given the classes in advance.

It is interesting to note that TfidfVectorizer performed reasonably well (better than SBERT) using this classifier. The full ranking for Logistic Regression is:

  1. 1.

    USE

  2. 2.

    DeCLUTR

  3. 3.

    TfidfVectorizer

  4. 4.

    SBERT

  5. 5.

    LASER

5.2.1 SentEval

If we count the number of tasks on SentEval for which each model has the highest score (see Table 2, we end up with the following list:

  1. 1.

    DeCLUTR (10 wins)

  2. 2.

    SBERT (8 wins)

  3. 3.

    LASER (6 wins)

  4. 4.

    USE (2 wins)

6 Analysis

The embedding models that were trained with STS tasks in mind (SBERT, USE, and DeCLUTR) scored better than the ones that didn’t (LASER and TfidfVectorizer). This was to be expected.

But there are also unexpected results. Universal Sentence Encoder performs the worst on SentEval, yet it manages to obtain by far the best scores in both Cluster Classifier and Logistic Regression. It is unclear where this discrepancy comes from. Perhaps it is related to the fact that USE uses angular distance rather than cosine similarity. But it is difficult to dig deeper as the paper is light on details about the training data and procedures. At the very least we can conclude from this that great performance on STS tasks does not automatically lead to great performance on clustering tasks.

Over all it is encouraging that Cluster Classifier even partially works. The datasets used in this experiment are real life datasets. The topic classes used in them are not created artificially with academic purposes in mind, but have evolved out of practical considerations in real world scenario’s. Yet the various sentence embedding models have managed to find a partial overlap.

7 Conclusion

Sentence embeddings can be clustered. In real life datasets, the resulting clusters form at least some overlap with topic classes.

This is interesting because clustering is an unsupervised analysis technique. It means that sentence embedding clusters can be used for setting up unsupervised text classification, which is a major task in real world applications such as for example customer feedback analysis.

References

Appendix A Supplemental Material

Result type SBERT DeCLUTR USE LASER
STS12 spearman 0.695 0.635 0.656 0.623
STS13 spearman 0.737 0.726 0.680 0.516
STS14 spearman 0.763 0.717 0.715 0.670
STS15 spearman 0.821 0.799 0.808 0.754
STS16 spearman 0.807 0.796 0.787 0.723
MR accuracy 79.830 84.990 75.150 74.080
CR accuracy 86.780 90.010 81.780 80.530
MPQA accuracy 86.630 88.330 87.150 88.210
SUBJ accuracy 92.680 95.240 91.800 91.480
SST2 accuracy 84.130 89.840 80.510 79.850
SST5 accuracy 45.930 48.550 42.810 44.250
TREC accuracy 89.200 91.800 92.200 89.200
MRPC accuracy 73.860 74.030 69.620 75.190
SICKEntailment accuracy 81.290 82.100 82.360 80.980
SICKRelatedness spearman 0.802 0.786 0.789 0.790
STSBenchmark spearman 0.819 0.794 0.791 0.778
Length accuracy 64.430 82.360 64.910 79.660
WordContent accuracy 73.970 62.150 70.130 40.380
Depth accuracy 31.440 34.430 27.430 39.390
TopConstituents accuracy 71.800 71.050 62.840 78.490
BigramShift accuracy 76.380 87.630 59.930 67.590
Tense accuracy 87.000 88.400 80.080 87.290
SubjNumber accuracy 83.320 85.800 74.550 90.220
ObjNumber accuracy 81.810 83.130 72.550 88.740
OddManOut accuracy 56.500 64.210 54.030 50.790
CoordinationInversion accuracy 57.080 66.290 54.300 67.840
Table 2: SentEval scores for SBERT, DeCLUTR, USE, and LASER. SBERT scores very well on STS tasks. LASER scores well on probing tasks (linguistic properties of the input). Bold is best of row. DeCLUTR scores best over all.
Clusters F1 Clusters Acc LogReg F1 LogReg Acc
SBERT 0.225 0.263 0.586 0.586
USE 0.321 0.345 0.618 0.618
DeCLUTR 0.239 0.276 0.613 0.612
LASER 0.087 0.113 0.438 0.446
TfidfVectorizer 0.088 0.107 0.605 0.604
Table 3: F1 macro and accuracy scores for SBERT, DeCLUTR, USE, and LASER on the Amazon Reviews dataset. Bold is best of column.
Clusters F1 Clusters Acc LogReg F1 LogReg Acc
SBERT 0.212 0.247 0.472 0.475
USE 0.256 0.286 0.511 0.515
DeCLUTR 0.206 0.243 0.506 0.508
LASER 0.112 0.138 0.398 0.402
TfidfVectorizer 0.058 0.075 0.486 0.487
Table 4: F1 macro and accuracy scores for SBERT, DeCLUTR, USE, and LASER on the News dataset. Bold is best (vertically)
  • Full results of the SentEval benchmark are available in Table 2.

  • Full results of the Cluster Classifier and Logistic Regression on the Amazon Reviews dataset are available in Table 3.

  • Full results of the Cluster Classifier and Logistic Regression on the News dataset are available in Table 4.