A Simple and Plug-and-play Method for Unsupervised Sentence Representation Enhancement

Lingfeng Shen
Johns Hopkins Iniversity &Haiyun Jiang
Tencent AI Lab &Lemao Liu
Tencent AI Lab &Shuming Shi
Tencent AI Lab

Abstract

Generating proper embedding of sentences through an unsupervised way is beneficial to semantic matching and retrieval problems in real-world scenarios. This paper presents Representation ALchemy (RepAL), an extremely simple post-processing method that enhances sentence representations. The basic idea in RepAL is to de-emphasize redundant information of sentence embedding generated by pre-trained models. Through comprehensive experiments, we show that RepAL is free of training and is a plug-and-play method that can be combined with most existing unsupervised sentence learning models. We also conducted in-depth analysis to understand RepAL.

1 Introduction

Learning high-quality sentence embeddings is a fundamental task in Natural Language Processing (NLP) field Socher et al. (2011); Le and Mikolov (2014); Kiros et al. (2015); Reimers and Gurevych (2019); Gao et al. (2021). In real-world scenarios, especially when a large amount of supervised data is unavailable, an approach that provides high-quality sentence embeddings in an unsupervised paradigm is of great value.

Generally, the unsupervised sentence encoder (USE) can be categorized into two paradigms. The first is pre-trained language model (PTM) Devlin et al. (2019); Liu et al. (2019) based paradigm, which are naturally good unsupervised sentence representation learning models. For example, BERT Devlin et al. (2019) and BERT-like Liu et al. (2019); He et al. (2020); Raffel et al. (2020) models, commit to design stronger pre-trained language models by self-training with mask or next sentence prediction. While designing stronger PTMs for better sentence representation is extremely expensive, time-consuming, and labor-intensive. Based on PTMs, secondary trained, e.g., contrastive-based methods Reimers and Gurevych (2019); Logeswaran and Lee (2018); Gao et al. (2021), proved to be effective to further improve the representation quality of sentences. For example, SimCSE Gao et al. (2021) minimizes the distance between positive pairs of sentences and pulls away from the negative pairs of sentences in the embedding space, which achieves promising performance.

This paper focuses on enhancing sentence embeddings generated by above two paradigms in an supervised way. Our basic idea is to refine sentence representations by removing the redundant information from the levels of sentences and corpus. Considering corpus-level, there may be shared information among the corpus, which may lead to homogenous properties for all sentence embeddings, which diminishes the distinctiveness between sentences. As for sentence-level, there are several trivial words within the sentence. Such words are proven to bring a negative impact on downstream NLP tasks, like Natural Language Inference (NLI) Mahabadi et al. (2020); Zhou and Bansal (2020) and text classification Choi et al. (2020); Qian et al. (2021).

Therefore, we propose a simple, straightforward, and effective method called representation alchemy (RepAL), which improves sentence representations without training and extra resources. RepAL accepts raw sentence representations as inputs, which are generated from existing unsupervised sentence models. Then RepAL outputs refined representations by extracting two redundant representations from different perspectives. Intuitively, it is like an alchemy that improves sentence representation by refinement. It’s worth mentioning that our proposed RepAL can be applied to almost USEs and is a plug-and-play method in sentence embedding enhancement without extra training cost. To verify, we perform extensive experiments on both English and Chinese benchmarks, and results demonstrate the effectiveness of the proposed RepAL.

2 Related Work

Methods for unsupervised sentence learning have been extensively explored. Early works are mainly based on distributional hypothesis Socher et al. (2011); Le and Mikolov (2014). Hill Hill et al. (2016) proposed to learn sentence representations with its internal structure. Then Pagliardini Pagliardini et al. (2018) proposed Sent2Vec, a simple unsupervised model allowing to compose sentence embeddings using word vectors.

Then strong pre-trained language model Devlin et al. (2019) emerged from the blue. Such pre-trained models own potential to improve the quality of sentence representation. However, models like BERT own strong anisotropy in their embedding space, which means the sentence embeddings produced by BERT have extremely high cosine similarity, leading to unsatisfactory performances on sentence embedding.

Recently, contrastive learning began to play an important role in unsupervised sentence representation learning Zhang et al. (2020); Yan et al. (2021); Meng et al. (2021); Gao et al. (2021); Wang et al. (2021). Such methods are based on the assumption that high-quality embedding methods should bring similar sentences closer while pushing away dissimilar ones.

Specifically, the most relevant work is BERT-whitening Su et al. (2021a, a), a post-processing method, and a detailed comparison between it and our work is illustrated in Appendix A.

3 Methodology

3.1 Problem Formulation

In unsupervised sentence representation learning, we take a collection of unlabeled sentences $\{x_{i}\}_{i=1}^{n}$ , also we choose a suitable unsupervised sentence learning model (e.g., BERT) as the encoder $f(\cdot;\theta)$ , where $\theta$ represents the trainable parameters in $f$ . Specifically, we have a carefully designed training objective $\mathcal{L}$ for unsupervised training, and $\theta$ is then fixed as $\theta_{0}$ where $\theta_{0}=\operatorname{argmin}\mathcal{L}$ . Finally, we obtain the sentence representation $v_{i}$ for $x_{i}$ by feeding it into the encoder, i.e., $v_{i}=f(x_{i};\theta_{0})$ .

RepAL plays its role in refining $v_{i}$ to $v_{i}^{\prime}$ with $v_{i}^{\prime}=g(v_{i})$ , instead of directly selecting $v_{i}$ for sentence representation. RepAL aims to extract and refine two types of redundancy, namely sentence-level redundancy and corpus-level redundancy, respectively. Sentence-level redundancy denotes the useless word information hidden in the target sentence, which may bias the representations that reflect the core semantics of the sentence. Corpus-level redundancy denotes the shared redundant information in all sentence representations within the dataset, making all the representations tend to be homogenous and thus reducing the distinction.

RepAL generates $x_{i}^{*}$ by an operation called partial mask on $x_{i}$ , then feed $x_{i}^{*}$ into the encoder $f(.;\theta_{0})$ to obtain sentence-level redundancy embeddings $v_{i}^{*}$ . Besides, RepAL produces a global vector $\hat{v}$ as corpus-level redundancy embedding. Finally, RepAL generates the refined embedding $v_{i}^{\prime}$ for downstream tasks through the embedding refinement operation by combining $v_{i},v_{i}^{*}$ and $\hat{v}$ .

3.2 Redundant Embedding Generation

In RepAL, we firstly detect redundant information and generate their embeddings from the target sentence, which is a groundbreaking step in our method and determines the performance. Moreover, we also conduct deep analyses towards two kinds of redundancy, and defer them to Appendix C.

3.2.1 Sentence-level Redundancy

We apply a partial mask to extract the sentence-level redundancy. Specifically, given a sentence $x_{i}=\{w_{1},w_{2},\dots,w_{N}\}$ , partial mask generates a partially masked sentence $x_{i}^{*}$ , a mask version of $x_{i}$ , where informative words in $x_{i}$ are replaced with [MASK] to distill the trivial words from the sentence. We judge the words as informative according to their TF-IDF Luhn (1958); Jones (1972) values¹¹1In RepAL, we use the default function (based on TF-IDF) in Jieba toolkit to extract the keywords within a sentence. calculated on a general corpus. In the following, we generate partially masked sentence $x_{i}^{*}$ , where only the keywords in the sentence $x_{i}$ are masked, and $f(x_{i}^{*})$ as the corresponding redundant embedding. Since the model is forced to see only the non-masked context words, $f(x_{i}^{*})$ actually encode the information from the trivial words (non-keywords), and the sentence-level redundant embedding is defined as follows:

x_{i}^{*}=\text{PartialMask}(x_{i},\text{keyword});\quad v_{i}^{*}=f(x_{i}^{*})

(1)

where $v_{i}^{*}$ is the sentence-level redundant embedding of $x_{i}$ . Diminishing $v_{i}^{*}$ from $x_{i}$ representation aims to de-emphasizes the influence of trivial words when producing sentence embedding.

3.2.2 Corpus-level Redundancy

Given a sentence set $\mathcal{X}=\{x_{i}\}_{i=1}^{n}$ , we feed all the sentences to the encoder $f$ , and take the average embedding $\hat{v}$ as its corpus-level redundant embedding, which can be formally defined as follows:

\hat{v}=\frac{\Sigma_{i=1}^{n}f(x_{i})}{n}

(2)

where $\hat{v}$ is the corpus-level redundant embedding. Diminishing $\hat{v}$ from each sentence’s representation makes them more distinguishable to each other.

3.3 Embedding Refinement

After generating redundant embedding, the embedding refinement operation can be formalized via the conceptually simple subtraction operation, which is defined as follows:

v_{i}^{\prime}=f(x_{i})-\lambda_{1}\cdot v_{i}^{*}-\lambda_{2}\cdot\hat{v}

(3)

where $f(x)$ corresponds to the original embedding of $x_{i}$ , and $v_{i}^{*}$ and $\hat{v}$ represent the redundant embedding at two levels, respectively. Since the two redundant embeddings typically do not contribute completely equal to the embedding $v$ , we introduce two independent hyper-parameters $\lambda_{1}$ and $\lambda_{2}$ to balance the two terms.

Baseline	ATEC	BQ	LCQMC	PAWSX	STS-B	Avg
BERT	16.51 $\rightarrow$ 19.58	29.35 $\rightarrow$ 32.89	41.71 $\rightarrow$ 44.53	9.84 $\rightarrow$ 11.28	34.65 $\rightarrow$ 47.00	26.41 $\rightarrow$ 31.06(+4.65)
RoBERTa	24.61 $\rightarrow$ 27.00	40.54 $\rightarrow$ 39.51	70.55 $\rightarrow$ 70.98	16.23 $\rightarrow$ 16.98	63.55 $\rightarrow$ 64.01	43.10 $\rightarrow$ 43.70(+0.60)
RoFormer	24.29 $\rightarrow$ 25.07	41.91 $\rightarrow$ 42.56	64.87 $\rightarrow$ 65.33	20.15 $\rightarrow$ 20.13	56.65 $\rightarrow$ 57.23	41.57 $\rightarrow$ 42.06(+0.49)
NEZHA	17.39 $\rightarrow$ 18.98	29.63 $\rightarrow$ 30.53	40.60 $\rightarrow$ 41.85	14.90 $\rightarrow$ 15.43	35.84 $\rightarrow$ 36.68	27.67 $\rightarrow$ 28.69(+1.02)
W-BERT	20.61 $\rightarrow$ 23.29	25.76 $\rightarrow$ 29.83	48.91 $\rightarrow$ 50.01	16.82 $\rightarrow$ 16.96	61.19 $\rightarrow$ 61.46	34.66 $\rightarrow$ 36.31(+1.65)
W-RoBERTa	29.59 $\rightarrow$ 30.44	28.95 $\rightarrow$ 43.12	70.82 $\rightarrow$ 71.39	17.99 $\rightarrow$ 18.48	69.19 $\rightarrow$ 70.92	43.31 $\rightarrow$ 46.87(+2.56)
W-RoFormer	26.04 $\rightarrow$ 27.68	28.13 $\rightarrow$ 42.63	60.92 $\rightarrow$ 61.55	23.08 $\rightarrow$ 23.05	66.96 $\rightarrow$ 67.13	41.03 $\rightarrow$ 44.38(+3.35)
W-NEZHA	18.83 $\rightarrow$ 21.33	21.94 $\rightarrow$ 23.02	50.52 $\rightarrow$ 52.01	18.15 $\rightarrow$ 19.00	60.84 $\rightarrow$ 60.82	34.06 $\rightarrow$ 35.24(+1.18)
C-BERT	26.35 $\rightarrow$ 28.69	46.68 $\rightarrow$ 48.02	69.22 $\rightarrow$ 69.98	10.89 $\rightarrow$ 12.03	68.89 $\rightarrow$ 69.66	44.41 $\rightarrow$ 45.68(+1.27)
C-RoBERTa	27.39 $\rightarrow$ 28.43	47.20 $\rightarrow$ 47.14	67.34 $\rightarrow$ 67.98	09.36 $\rightarrow$ 10.55	72.02 $\rightarrow$ 71.80	44.66 $\rightarrow$ 45.18(+0.52)
C-RoFormer	26.24 $\rightarrow$ 27.68	47.13 $\rightarrow$ 47.63	66.92 $\rightarrow$ 67.85	11.08 $\rightarrow$ 11.65	69.84 $\rightarrow$ 69.73	44.24 $\rightarrow$ 44.91(+0.67)
C-NEZHA	26.02 $\rightarrow$ 26.73	47.44 $\rightarrow$ 48.02	70.02 $\rightarrow$ 70.63	11.46 $\rightarrow$ 11.80	68.97 $\rightarrow$ 69.53	44.78 $\rightarrow$ 45.34(+0.56)
Sim-BERT	33.14 $\rightarrow$ 33.48	50.67 $\rightarrow$ 51.14	69.99 $\rightarrow$ 72.44	12.95 $\rightarrow$ 13.58	69.04 $\rightarrow$ 69.55	47.16 $\rightarrow$ 48.04(+0.88)
Sim-RoBERTa	32.23 $\rightarrow$ 33.10	50.61 $\rightarrow$ 51.53	74.22 $\rightarrow$ 74.77	12.25 $\rightarrow$ 13.28	71.13 $\rightarrow$ 72.20	48.09 $\rightarrow$ 48.98(+0.89)
Sim-RoFormer	32.33 $\rightarrow$ 32.59	49.13 $\rightarrow$ 49.46	71.61 $\rightarrow$ 72.13	15.25 $\rightarrow$ 15.69	69.45 $\rightarrow$ 70.01	47.55 $\rightarrow$ 48.02(+0.47)
Sim-NEZHA	32.14 $\rightarrow$ 32.52	46.08 $\rightarrow$ 47.42	60.38 $\rightarrow$ 60.51	16.60 $\rightarrow$ 16.58	68.50 $\rightarrow$ 69.19	44.74 $\rightarrow$ 45.26(+0.52)

Table 1: The experimental results of RepAL on Chinese semantic similarity benchmarks. The numbers before

\rightarrow

indicate the performance without RepAL and the numbers after

\rightarrow

mean the performance with RepAL. Blue numbers indicate RepAL improves the baseline.

4 Experiments

This section shows that our method can be adaptive to various USE and improves their performance.

4.1 Baselines and Benchmarks

To verify the effectiveness of our method, we evaluate RepAL on both Chinese and English benchmarks. To investigate whether our method can be applied to various unsupervised sentence encoder (USE), we choose two kinds of encoders: vanilla USE and secondary trained USE. For vanilla USE, we select BERT Devlin et al. (2019), RoBerTa Liu et al. (2019), RoFormer Su et al. (2021b) and NEZHA Wei et al. (2019) for Chinese; for English, we select BERT_base, BERT_large Devlin et al. (2019) and RoBERTa_base Reimers and Gurevych (2019). Specifically, we name the secondary trained USE equipped with whitening Huang et al. (2021), ConSERT Yan et al. (2021), and SimCSE Gao et al. (2021) as W-USE (e.g., W-BERT), C-USE (e.g., C-BERT), and Sim-USE (e.g., Sim-BERT), respectively. Results of Sim-USE and C-USE are from our implementation. Details of training SimCSE on Chinese benchmarks are deferred to Appendix B.

•

Chinese: We select five Chinese benchmarks for evaluation: AETC, LCQMC, BQ, PAWSX, and STS-B. The details about them are deferred to the Appendix B.
•

English: We select STS task benchmarks as our English datasets, including STS 2012-2016 tasks Agirre et al. (2012, 2013, 2014, 2015, 2016), the STS benchmark Cer et al. (2017) and the SICK-Relatedness dataset Marelli et al. (2014).

4.2 Experimental Settings

The vanilla USE in our experiments is the same as their original settings. Besides, we keep the settings baselines the same as their original ones. As for hyper-parameters, we follow previous unsupervised work Gao et al. (2021); Yan et al. (2021), and tune $\lambda_{1}$ and $\lambda_{2}$ based on the STS-B dev set. The results are evaluated through Spearman correlation.

Baseline	STS-12	STS-13	STS-14	STS-15	STS-16	Avg
BERT	57.86 $\rightarrow$ 59.55	61.97 $\rightarrow$ 66.20	62.49 $\rightarrow$ 65.19	70.96 $\rightarrow$ 73.50	69.76 $\rightarrow$ 72.10	63.69 $\rightarrow$ 66.70(+3.01)
BERT_l	57.74 $\rightarrow$ 59.90	61.16 $\rightarrow$ 66.20	61.18 $\rightarrow$ 65.62	68.06 $\rightarrow$ 73.01	70.30 $\rightarrow$ 74.72	62.62 $\rightarrow$ 67.47(+4.85)
RoBERTa	58.52 $\rightarrow$ 60.88	56.21 $\rightarrow$ 62.20	60.12 $\rightarrow$ 64.10	69.12 $\rightarrow$ 71.41	63.69 $\rightarrow$ 69.94	60.59 $\rightarrow$ 65.41 (+4.82)
W-BERT	63.62 $\rightarrow$ 64.50	73.02 $\rightarrow$ 73.69	69.23 $\rightarrow$ 69.69	74.52 $\rightarrow$ 74.69	72.15 $\rightarrow$ 76.11	69.21 $\rightarrow$ 70.39 (+1.18)
W-BERT_l	63.62 $\rightarrow$ 63.90	73.02 $\rightarrow$ 73.41	69.23 $\rightarrow$ 70.01	74.52 $\rightarrow$ 75.18	72.15 $\rightarrow$ 75.89	69.21 $\rightarrow$ 70.39 (+1.18)
W-RoBERTa	68.18 $\rightarrow$ 68.85	62.21 $\rightarrow$ 63.03	67.13 $\rightarrow$ 67.69	67.63 $\rightarrow$ 68.23	74.78 $\rightarrow$ 75.44	67.17 $\rightarrow$ 68.43 (+1.26)
C-BERT	64.09 $\rightarrow$ 65.01	78.21 $\rightarrow$ 78.54	68.68 $\rightarrow$ 69.04	79.56 $\rightarrow$ 79.90	75.41 $\rightarrow$ 75.74	72.27 $\rightarrow$ 72.69 (+0.42)
C-BERT_l	70.23 $\rightarrow$ 70.70	82.13 $\rightarrow$ 82.54	73.60 $\rightarrow$ 74.12	81.72 $\rightarrow$ 82.01	77.01 $\rightarrow$ 77.58	76.03 $\rightarrow$ 76.48 (+0.45)
Sim-BERT	68.93 $\rightarrow$ 69.33	78.68 $\rightarrow$ 78.93	73.57 $\rightarrow$ 73.95	79.68 $\rightarrow$ 80.01	79.11 $\rightarrow$ 79.29	75.11 $\rightarrow$ 75.44 (+0.33)
Sim-BERT_l	69.25 $\rightarrow$ 69.60	78.96 $\rightarrow$ 79.30	73.64 $\rightarrow$ 73.92	80.06 $\rightarrow$ 80.31	79.08 $\rightarrow$ 79.42	75.31 $\rightarrow$ 75.61 (+0.30)

Table 2: The experimental results of RepAL on English semantic similarity benchmarks. ‘Avg’ indicates the average performance of all English benchmarks including STS-B and SICK-R in Table 3, and BERT_l means BERT_large during the experiments.

Baseline	STS-B	SICK-R
BERT	59.04 $\rightarrow$ 66.35	63.75 $\rightarrow$ 64.55
BERT_l	59.59 $\rightarrow$ 68.21	60.34 $\rightarrow$ 64.61
RoBERTa	55.16 $\rightarrow$ 65.75	61.33 $\rightarrow$ 63.61
W-BERT	71.34 $\rightarrow$ 71.45	60.60 $\rightarrow$ 62.61
W-BERT_l	71.34 $\rightarrow$ 69.56	60.60 $\rightarrow$ 65.00
W-RoBERTa	71.43 $\rightarrow$ 72.03	58.80 $\rightarrow$ 63.95
C-BERT	73.12 $\rightarrow$ 73.45	66.79 $\rightarrow$ 67.15
C-BERT_l	77.48 $\rightarrow$ 77.91	70.02 $\rightarrow$ 70.51
Sim-BERT	75.71 $\rightarrow$ 76.00	70.12 $\rightarrow$ 70.51
Sim-BERT_l	75.84 $\rightarrow$ 76.11	70.34 $\rightarrow$ 70.61

Table 3: The results of RepAL on STS-B and SICK-R

4.3 Results

As shown in Table 1, RepAL improves the baselines’ performance in most cases. For example, RepAL produces 4.65%, 1.65%, 1.27%, and 0.88% improvement to BERT, W-BERT, C-BERT, and Sim-BERT, respectively. Generally, as USE becomes stronger, the improvements brought by RepAL decrease. Still, for strong baselines like C-BERT and Sim-BERT, RepAL still makes progress over them. Specifically, RepAL achieves $1.27\%$ and $0.88\%$ performance increase for C-BERT and Sim-BERT, indicating the effectiveness of RepAL on extremely strong baselines. The results on English benchmarks are listed in Table 2, 3, where RepAL also obtains improvements over various USEs. Overall, both results on Chinese and English benchmarks demonstrate the effectiveness of RepAL and illustrate that RepAL is a plug-and-play method for sentence representation enhancement.

4.4 Ablation study

This section investigates the individual effective of embedding refinement of two different levels. As shown in Table 4, each of operation is beneficial and combining them lead to stronger performance.

	BERT	W-BERT	C-BERT	S-BERT
RepAL	31.06	36.31	45.68	48.04
w/o Sen	29.28	34.03	43.49	46.09
w/o Cor	30.42	35.78	45.03	47.63

Table 4: Ablation studies of RepAL on Chinese benchmarks. ‘S-BERT’ refers to ‘Sim-BERT’.

5 Conclusion

In this paper, we propose RepAL, a universal method for unsupervised sentence representation enhancement. Based on the idea that de-emphasize redundant information, RepAL extracts then refines redundant information for the sentence embedding at sentence-level and corpus-level. Through a simple embedding refinement operation, RepAL successfully achieves improvements on both Chinese and English benchmarks and is proved to be a simple and plug-and-play method in modern techniques for unsupervised sentence representation.

6 Limitation

RepAL is a universal method for sentence representation, and sentence representation is eventually used for downstream tasks. However, RepAL does not consider task-specific information since different downstream tasks may have different preferences. Therefore, exploring task-specific modification on RepAL is a future direction.

References

Agirre et al. (2015) Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Inigo Lopez-Gazpio, Montse Maritxalar, Rada Mihalcea, et al. 2015. Semeval-2015 task 2: Semantic textual similarity, english, spanish and pilot on interpretability. In Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015), pages 252–263.
Agirre et al. (2014) Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Rada Mihalcea, German Rigau, and Janyce Wiebe. 2014. Semeval-2014 task 10: Multilingual semantic textual similarity. In Proceedings of the 8th international workshop on semantic evaluation (SemEval 2014), pages 81–91.
Agirre et al. (2016) Eneko Agirre, Carmen Banea, Daniel Cer, Mona Diab, Aitor Gonzalez Agirre, Rada Mihalcea, German Rigau Claramunt, and Janyce Wiebe. 2016. Semeval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation. In SemEval-2016. 10th International Workshop on Semantic Evaluation; 2016 Jun 16-17; San Diego, CA. Stroudsburg (PA): ACL; 2016. p. 497-511. ACL (Association for Computational Linguistics).
Agirre et al. (2012) Eneko Agirre, Daniel Cer, Mona Diab, and Aitor Gonzalez-Agirre. 2012. Semeval-2012 task 6: A pilot on semantic textual similarity. In * SEM 2012: The First Joint Conference on Lexical and Computational Semantics–Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), pages 385–393.
Agirre et al. (2013) Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, and Weiwei Guo. 2013. * sem 2013 shared task: Semantic textual similarity. In Second joint conference on lexical and computational semantics (* SEM), volume 1: proceedings of the Main conference and the shared task: semantic textual similarity, pages 32–43.
Cer et al. (2017) Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. 2017. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055.
Choi et al. (2020) Seungtaek Choi, Haeju Park, Jinyoung Yeo, and Seung-won Hwang. 2020. Less is more: Attention supervision with counterfactuals for text classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6695–6704.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
Gao et al. (2021) Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821.
He et al. (2020) Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2020. Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations.
Hill et al. (2016) Felix Hill, Kyunghyun Cho, and Anna Korhonen. 2016. Learning distributed representations of sentences from unlabelled data. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1367–1377.
Huang et al. (2021) Junjie Huang, Duyu Tang, Wanjun Zhong, Shuai Lu, Linjun Shou, Ming Gong, Daxin Jiang, and Nan Duan. 2021. Whiteningbert: An easy unsupervised sentence embedding approach. arXiv preprint arXiv:2104.01767.
Jones (1972) Karen Sparck Jones. 1972. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation.
Kiros et al. (2015) Ryan Kiros, Yukun Zhu, Russ R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in neural information processing systems, pages 3294–3302.
Le and Mikolov (2014) Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In International conference on machine learning, pages 1188–1196. PMLR.
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach.
Logeswaran and Lee (2018) Lajanugen Logeswaran and Honglak Lee. 2018. An efficient framework for learning sentence representations. In International Conference on Learning Representations.
Luhn (1958) Hans Peter Luhn. 1958. The automatic creation of literature abstracts. IBM Journal of research and development, 2(2):159–165.
Mahabadi et al. (2020) Rabeeh Karimi Mahabadi, Yonatan Belinkov, and James Henderson. 2020. End-to-end bias mitigation by modelling biases in corpora. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8706–8716.
Marelli et al. (2014) Marco Marelli, Stefano Menini, Marco Baroni, Luisa Bentivogli, Raffaella Bernardi, Roberto Zamparelli, et al. 2014. A sick cure for the evaluation of compositional distributional semantic models. In Lrec, pages 216–223. Reykjavik.
Meng et al. (2021) Yu Meng, Chenyan Xiong, Payal Bajaj, Saurabh Tiwary, Paul Bennett, Jiawei Han, and Xia Song. 2021. Coco-lm: Correcting and contrasting text sequences for language model pretraining.
Pagliardini et al. (2018) Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi. 2018. Unsupervised learning of sentence embeddings using compositional n-gram features. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 528–540.
Qian et al. (2021) Chen Qian, Fuli Feng, Lijie Wen, Chunping Ma, and Pengjun Xie. 2021. Counterfactual inference for text classification debiasing. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5434–5445.
Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992.
Socher et al. (2011) Richard Socher, Eric Huang, Jeffrey Pennin, Christopher D Manning, and Andrew Ng. 2011. Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. Advances in neural information processing systems, 24.
Su et al. (2021a) Jianlin Su, Jiarun Cao, Weijie Liu, and Yangyiwen Ou. 2021a. Whitening sentence representations for better semantics and faster retrieval. arXiv preprint arXiv:2103.15316.
Su et al. (2021b) Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. 2021b. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864.
Wang et al. (2021) Dong Wang, Ning Ding, Piji Li, and Haitao Zheng. 2021. Cline: Contrastive learning with semantic negative examples for natural language understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2332–2342.
Wei et al. (2019) Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen, and Qun Liu. 2019. Nezha: Neural contextualized representation for chinese language understanding. arXiv preprint arXiv:1909.00204.
Yan et al. (2021) Yuanmeng Yan, Rumei Li, Sirui Wang, Fuzheng Zhang, Wei Wu, and Weiran Xu. 2021. Consert: A contrastive framework for self-supervised sentence representation transfer. arXiv preprint arXiv:2105.11741.
Yang et al. (2019) Yinfei Yang, Yuan Zhang, Chris Tar, and Jason Baldridge. 2019. Paws-x: A cross-lingual adversarial dataset for paraphrase identification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3687–3692.
Zhang et al. (2020) Yan Zhang, Ruidan He, Zuozhu Liu, Kwan Hui Lim, and Lidong Bing. 2020. An unsupervised sentence embedding method by mutual information maximization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1601–1610.
Zhou and Bansal (2020) Xiang Zhou and Mohit Bansal. 2020. Towards robustifying nli models against lexical dataset biases. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8759–8771.

Appendix A Difference between BERT whitening

Some post-processing methods have been proposed to improve the quality of contextual sentence embeddings to solve such a problem. The post-processing paradigm aims to enhance sentence embeddings through simple and efficient operations without extra training or data. The most promising method is whitening Huang et al. (2021), dedicated to transforming sentence embedding into Gaussian-like embedding, which proved to be effective in sentence embedding improvement. Specifically, the most relevant work to ours is whitening Huang et al. (2021) since the corpus-level refinement is similar to the average embedding subtraction in whitening. However, there are three principal differences between such two works. Firstly, the motivation is different. Whitening aims at transforming the sentence embedding to Gaussian-like embedding for distance measurement on an orthogonal basis. Our method starts a perspective of redundancy refinement, which aims to diminish the impact of trivial words within a sentence during similarity calculation. Second, the methodology is different. Our method additionally employs a partial mask to filter the redundancy and introduce weight factors to control the impact during embedding refinement. Lastly, the in-depth analysis shows that our method aims to diminish the upper bound of the largest eigenvalue of the embedding matrix and the impact of trivial words, which is irrelevant to whitening’s effects.

Appendix B Details of Chinese benchmarks and training Chinese SimCSE

(1) ATEC: A semantic similarity dataset related to customer service; (2) LCQMC: A dataset consisting problem matching across multiple domains; (3) BQ: a dataset consisting problem matching related to bank and finance; (4) PAWSX Yang et al. (2019) : The dataset contains multilingual paraphrase and non-paraphrase pairs, we select the Chinese part; (5) STS-B: A Chinese benchmark labeled by semantic correlation between two sentences.

As for training SimCSE on benchmarks, we delete the labels in ATEC, LCQMC, BQ, PAWSX, and STS-B, and merge them into an unsupervised corpus. Then SimCSE is launched on the merged corpus, the best checkpoint is selected according to the STS-B dev set, which keeps the same setting in previous works Gao et al. (2021); Yan et al. (2021). Specifically, we find that the best dropout rate $r$ on Chinese corpus is about 0.3.

Appendix C Detailed Analysis and Discussion

The proposed RepAL enhances sentence embedding by filtering redundant information from two levels: sentence-level and corpus-level. Despite the presentations of the overall experiment results and analysis, the intrinsic properties of RepAL remain unclear. In this section, we illustrate the reasons why RepAL is effective in enhancing sentence embedding.

In Sec C.1, we provide the evidence about the impact of trivial words in sentence embedding and show the capacity of our sentence-level embedding refinement. In Sec C.2, we show why the corpus-level embedding refinement enhances sentence embedding and illustrate the relation between the largest eigenvalue and performance.

C.1 Sentence-level Refinement

We conduct analyses for sentence-level refinement (SR) as follows: we investigate the impact of trivial words w/o RepAL, which explains the necessity of removing such redundancy information and validates the effectiveness of SR.

We first define the importance $H$ of word $w\in x_{i}$ in semantic similarity calculation, which can be defined as follows:

H(x_{i},x^{-}_{i};w)=Sim(x_{i},x^{-}_{i})-Sim(x_{i}/w,x^{-}_{i})

(4)

where $x_{i}$ and $x^{-}_{i}$ are a pair of sentences and $x_{i}/w_{i}$ means deleting the word $w_{i}$ from $x_{i}$ . Note that we do not consider the words in $x^{-}_{i}$ since it is equivalent to evaluation on more sentences. Then we define the set of trivial words within $x_{i}$ as $S(x_{i})$ , which are unmasked by jieba. Thus we can define the redundancy overlap ratio $r(p_{i})$ of a sentence pair $p_{i}=(x_{i},x^{-}_{i})$ as follows:

r(p_{i})=\frac{|S(x_{i})\cap T(x_{i})|}{|T(x_{i})|}

(5)

where $T(x_{i})$ represents the top-5 words with highest importance $H$ in $x_{i}$ . $r(p_{i})$ is a metric to reflect the impact of trivial words in semantic similarity between the sentence pair $p_{i}$ , since higher $r(p_{i})$ indicates more trivial words are important towards semantic similarity calculation. We randomly sample 300 sentence pairs from STS-B Cer et al. (2017) and select BERT as the USE, and we calculate the average redundancy overlap ratio $\hat{r}=\frac{\sum_{i=1}^{N}r(p_{i})}{N}$ w/o SR. The results show that $\hat{r}$ reaches $10.2\%$ without SR, after applying SR, $\hat{r}$ drops to $7.1\%$ ²²2 $\hat{r}$ changes since the inputs during similarity calculation have changed when SR activates. After SR. Eq 4 becomes $H(x_{i},x^{-}_{i};w)=Sim(G(x_{i}),G(x^{-}_{i}))-Sim(G(x_{i}/w),G(x^{-}_{i}))$ where $G(\cdot)$ represents SR operation. The results demonstrate that SR diminishes the impact of trivial words when measuring semantic similarity.

Moreover, we select some representative words and evaluate their importance w/o RepAL. As shown in Table 5, the results show that our SR indeed diminishes the impact of such trivial words when calculating semantic similarity.

Word	No Refinement	With Refinement	$\Delta$
the	1.02	0.56	-0.46
a	0.98	0.43	-0.55
to	0.59	0.32	-0.27
in	0.68	0.21	-0.47
some	0.60	0.31	-0.29
with	0.72	0.24	-0.48
and	0.99	0.61	-0.38

Table 5: The importance of trivial words w/o sentence-level refinement.

\Delta

means the importance change.

C.2 Corpus-level Refinement

To investigate whether corpus-level refinement diminishes the upper bound of the eigenvalue of embedding $E^{*}$ , we make numerical experiments to dive into the relationship between the performance (Spearman correlation), $\lambda$ and the upper bound of the largest eigenvalue of $E^{*}$ .

Specifically, we launch the experiments on six English benchmarks with BERT_base. As shown in Figure LABEL:figure:lamda, when performance rises at peak, the upper bound of the largest eigenvalue of the embedding matrix $E^{*}$ is around the minimum, showing a coincidence between the two. The numerical results show that the corpus-level refinement enhances sentence embedding since it diminishes the largest eigenvalue of $E^{*}$ . Previous methods Huang et al. (2021) is equivalent to subtracting the average vector with $\lambda=1.0$ , which fails to suppress the largest eigenvalue of embedding matrix extremely. However, our method chooses to subtract a larger $\lambda$ with adaptive weight, further suppressing the upper bound of the largest eigenvalue of the embedding matrix. The results show that the average embedding subtraction needs an adaptive weight. Moreover, this also illustrates why our method can still significantly improve the performance on W-BERT with substantial progress.