This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Simple and Plug-and-play Method for Unsupervised Sentence Representation Enhancement

Lingfeng Shen
Johns Hopkins Iniversity &Haiyun Jiang
Tencent AI Lab &Lemao Liu
Tencent AI Lab &Shuming Shi
Tencent AI Lab
Abstract

Generating proper embedding of sentences through an unsupervised way is beneficial to semantic matching and retrieval problems in real-world scenarios. This paper presents Representation ALchemy (RepAL), an extremely simple post-processing method that enhances sentence representations. The basic idea in RepAL is to de-emphasize redundant information of sentence embedding generated by pre-trained models. Through comprehensive experiments, we show that RepAL is free of training and is a plug-and-play method that can be combined with most existing unsupervised sentence learning models. We also conducted in-depth analysis to understand RepAL.

1 Introduction

Learning high-quality sentence embeddings is a fundamental task in Natural Language Processing (NLP) field Socher et al. (2011); Le and Mikolov (2014); Kiros et al. (2015); Reimers and Gurevych (2019); Gao et al. (2021). In real-world scenarios, especially when a large amount of supervised data is unavailable, an approach that provides high-quality sentence embeddings in an unsupervised paradigm is of great value.

Generally, the unsupervised sentence encoder (USE) can be categorized into two paradigms. The first is pre-trained language model (PTM) Devlin et al. (2019); Liu et al. (2019) based paradigm, which are naturally good unsupervised sentence representation learning models. For example, BERT Devlin et al. (2019) and BERT-like Liu et al. (2019); He et al. (2020); Raffel et al. (2020) models, commit to design stronger pre-trained language models by self-training with mask or next sentence prediction. While designing stronger PTMs for better sentence representation is extremely expensive, time-consuming, and labor-intensive. Based on PTMs, secondary trained, e.g., contrastive-based methods Reimers and Gurevych (2019); Logeswaran and Lee (2018); Gao et al. (2021), proved to be effective to further improve the representation quality of sentences. For example, SimCSE Gao et al. (2021) minimizes the distance between positive pairs of sentences and pulls away from the negative pairs of sentences in the embedding space, which achieves promising performance.

This paper focuses on enhancing sentence embeddings generated by above two paradigms in an supervised way. Our basic idea is to refine sentence representations by removing the redundant information from the levels of sentences and corpus. Considering corpus-level, there may be shared information among the corpus, which may lead to homogenous properties for all sentence embeddings, which diminishes the distinctiveness between sentences. As for sentence-level, there are several trivial words within the sentence. Such words are proven to bring a negative impact on downstream NLP tasks, like Natural Language Inference (NLI) Mahabadi et al. (2020); Zhou and Bansal (2020) and text classification Choi et al. (2020); Qian et al. (2021).

Therefore, we propose a simple, straightforward, and effective method called representation alchemy (RepAL), which improves sentence representations without training and extra resources. RepAL accepts raw sentence representations as inputs, which are generated from existing unsupervised sentence models. Then RepAL outputs refined representations by extracting two redundant representations from different perspectives. Intuitively, it is like an alchemy that improves sentence representation by refinement. It’s worth mentioning that our proposed RepAL can be applied to almost USEs and is a plug-and-play method in sentence embedding enhancement without extra training cost. To verify, we perform extensive experiments on both English and Chinese benchmarks, and results demonstrate the effectiveness of the proposed RepAL.

2 Related Work

Methods for unsupervised sentence learning have been extensively explored. Early works are mainly based on distributional hypothesis Socher et al. (2011); Le and Mikolov (2014). Hill Hill et al. (2016) proposed to learn sentence representations with its internal structure. Then Pagliardini Pagliardini et al. (2018) proposed Sent2Vec, a simple unsupervised model allowing to compose sentence embeddings using word vectors.

Then strong pre-trained language model Devlin et al. (2019) emerged from the blue. Such pre-trained models own potential to improve the quality of sentence representation. However, models like BERT own strong anisotropy in their embedding space, which means the sentence embeddings produced by BERT have extremely high cosine similarity, leading to unsatisfactory performances on sentence embedding.

Recently, contrastive learning began to play an important role in unsupervised sentence representation learning Zhang et al. (2020); Yan et al. (2021); Meng et al. (2021); Gao et al. (2021); Wang et al. (2021). Such methods are based on the assumption that high-quality embedding methods should bring similar sentences closer while pushing away dissimilar ones.

Specifically, the most relevant work is BERT-whitening Su et al. (2021a, a), a post-processing method, and a detailed comparison between it and our work is illustrated in Appendix A.

3 Methodology

3.1 Problem Formulation

In unsupervised sentence representation learning, we take a collection of unlabeled sentences {xi}i=1n\{x_{i}\}_{i=1}^{n}, also we choose a suitable unsupervised sentence learning model (e.g., BERT) as the encoder f(;θ)f(\cdot;\theta), where θ\theta represents the trainable parameters in ff. Specifically, we have a carefully designed training objective \mathcal{L} for unsupervised training, and θ\theta is then fixed as θ0\theta_{0} where θ0=argmin\theta_{0}=\operatorname{argmin}\mathcal{L}. Finally, we obtain the sentence representation viv_{i} for xix_{i} by feeding it into the encoder, i.e., vi=f(xi;θ0)v_{i}=f(x_{i};\theta_{0}).

RepAL plays its role in refining viv_{i} to viv_{i}^{\prime} with vi=g(vi)v_{i}^{\prime}=g(v_{i}), instead of directly selecting viv_{i} for sentence representation. RepAL aims to extract and refine two types of redundancy, namely sentence-level redundancy and corpus-level redundancy, respectively. Sentence-level redundancy denotes the useless word information hidden in the target sentence, which may bias the representations that reflect the core semantics of the sentence. Corpus-level redundancy denotes the shared redundant information in all sentence representations within the dataset, making all the representations tend to be homogenous and thus reducing the distinction.

RepAL generates xix_{i}^{*} by an operation called partial mask on xix_{i}, then feed xix_{i}^{*} into the encoder f(.;θ0)f(.;\theta_{0}) to obtain sentence-level redundancy embeddings viv_{i}^{*}. Besides, RepAL produces a global vector v^\hat{v} as corpus-level redundancy embedding. Finally, RepAL generates the refined embedding viv_{i}^{\prime} for downstream tasks through the embedding refinement operation by combining vi,viv_{i},v_{i}^{*} and v^\hat{v}.

3.2 Redundant Embedding Generation

In RepAL, we firstly detect redundant information and generate their embeddings from the target sentence, which is a groundbreaking step in our method and determines the performance. Moreover, we also conduct deep analyses towards two kinds of redundancy, and defer them to Appendix C.

3.2.1 Sentence-level Redundancy

We apply a partial mask to extract the sentence-level redundancy. Specifically, given a sentence xi={w1,w2,,wN}x_{i}=\{w_{1},w_{2},\dots,w_{N}\}, partial mask generates a partially masked sentence xix_{i}^{*}, a mask version of xix_{i}, where informative words in xix_{i} are replaced with [MASK] to distill the trivial words from the sentence. We judge the words as informative according to their TF-IDF Luhn (1958); Jones (1972) values111In RepAL, we use the default function (based on TF-IDF) in Jieba toolkit to extract the keywords within a sentence. calculated on a general corpus. In the following, we generate partially masked sentence xix_{i}^{*}, where only the keywords in the sentence xix_{i} are masked, and f(xi)f(x_{i}^{*}) as the corresponding redundant embedding. Since the model is forced to see only the non-masked context words, f(xi)f(x_{i}^{*}) actually encode the information from the trivial words (non-keywords), and the sentence-level redundant embedding is defined as follows:

xi=PartialMask(xi,keyword);vi=f(xi)x_{i}^{*}=\text{PartialMask}(x_{i},\text{keyword});\quad v_{i}^{*}=f(x_{i}^{*}) (1)

where viv_{i}^{*} is the sentence-level redundant embedding of xix_{i}. Diminishing viv_{i}^{*} from xix_{i} representation aims to de-emphasizes the influence of trivial words when producing sentence embedding.

3.2.2 Corpus-level Redundancy

Given a sentence set 𝒳={xi}i=1n\mathcal{X}=\{x_{i}\}_{i=1}^{n}, we feed all the sentences to the encoder ff, and take the average embedding v^\hat{v} as its corpus-level redundant embedding, which can be formally defined as follows:

v^=Σi=1nf(xi)n\hat{v}=\frac{\Sigma_{i=1}^{n}f(x_{i})}{n} (2)

where v^\hat{v} is the corpus-level redundant embedding. Diminishing v^\hat{v} from each sentence’s representation makes them more distinguishable to each other.

3.3 Embedding Refinement

After generating redundant embedding, the embedding refinement operation can be formalized via the conceptually simple subtraction operation, which is defined as follows:

vi=f(xi)λ1viλ2v^v_{i}^{\prime}=f(x_{i})-\lambda_{1}\cdot v_{i}^{*}-\lambda_{2}\cdot\hat{v} (3)

where f(x)f(x) corresponds to the original embedding of xix_{i}, and viv_{i}^{*} and v^\hat{v} represent the redundant embedding at two levels, respectively. Since the two redundant embeddings typically do not contribute completely equal to the embedding vv, we introduce two independent hyper-parameters λ1\lambda_{1} and λ2\lambda_{2} to balance the two terms.

Baseline ATEC BQ LCQMC PAWSX STS-B Avg
BERT 16.51\rightarrow19.58 29.35\rightarrow32.89 41.71\rightarrow44.53 9.84\rightarrow11.28 34.65\rightarrow47.00 26.41\rightarrow31.06(+4.65)
RoBERTa 24.61\rightarrow27.00 40.54\rightarrow39.51 70.55\rightarrow70.98 16.23\rightarrow16.98 63.55\rightarrow64.01 43.10\rightarrow43.70(+0.60)
RoFormer 24.29\rightarrow25.07 41.91\rightarrow42.56 64.87\rightarrow65.33 20.15\rightarrow20.13 56.65\rightarrow57.23 41.57\rightarrow42.06(+0.49)
NEZHA 17.39\rightarrow18.98 29.63\rightarrow30.53 40.60\rightarrow41.85 14.90\rightarrow15.43 35.84\rightarrow36.68 27.67\rightarrow28.69(+1.02)
W-BERT 20.61\rightarrow23.29 25.76\rightarrow29.83 48.91\rightarrow50.01 16.82\rightarrow16.96 61.19\rightarrow61.46 34.66\rightarrow36.31(+1.65)
W-RoBERTa 29.59\rightarrow30.44 28.95\rightarrow43.12 70.82\rightarrow71.39 17.99\rightarrow18.48 69.19\rightarrow70.92 43.31\rightarrow46.87(+2.56)
W-RoFormer 26.04\rightarrow27.68 28.13\rightarrow42.63 60.92\rightarrow61.55 23.08\rightarrow23.05 66.96\rightarrow67.13 41.03\rightarrow44.38(+3.35)
W-NEZHA 18.83\rightarrow21.33 21.94\rightarrow23.02 50.52\rightarrow52.01 18.15\rightarrow19.00 60.84\rightarrow60.82 34.06\rightarrow35.24(+1.18)
C-BERT 26.35\rightarrow28.69 46.68\rightarrow48.02 69.22\rightarrow69.98 10.89\rightarrow12.03 68.89\rightarrow69.66 44.41\rightarrow45.68(+1.27)
C-RoBERTa 27.39\rightarrow28.43 47.20\rightarrow47.14 67.34\rightarrow67.98 09.36\rightarrow10.55 72.02\rightarrow71.80 44.66\rightarrow45.18(+0.52)
C-RoFormer 26.24\rightarrow27.68 47.13\rightarrow47.63 66.92\rightarrow67.85 11.08\rightarrow11.65 69.84\rightarrow69.73 44.24\rightarrow44.91(+0.67)
C-NEZHA 26.02\rightarrow26.73 47.44\rightarrow48.02 70.02\rightarrow70.63 11.46\rightarrow11.80 68.97\rightarrow69.53 44.78\rightarrow45.34(+0.56)
Sim-BERT 33.14\rightarrow33.48 50.67\rightarrow51.14 69.99\rightarrow72.44 12.95\rightarrow13.58 69.04\rightarrow69.55 47.16\rightarrow48.04(+0.88)
Sim-RoBERTa 32.23\rightarrow33.10 50.61\rightarrow51.53 74.22\rightarrow74.77 12.25\rightarrow13.28 71.13\rightarrow72.20 48.09\rightarrow48.98(+0.89)
Sim-RoFormer 32.33\rightarrow32.59 49.13\rightarrow49.46 71.61\rightarrow72.13 15.25\rightarrow15.69 69.45\rightarrow70.01 47.55\rightarrow48.02(+0.47)
Sim-NEZHA 32.14\rightarrow32.52 46.08\rightarrow47.42 60.38\rightarrow60.51 16.60\rightarrow16.58 68.50\rightarrow69.19 44.74\rightarrow45.26(+0.52)
Table 1: The experimental results of RepAL on Chinese semantic similarity benchmarks. The numbers before \rightarrow indicate the performance without RepAL and the numbers after \rightarrow mean the performance with RepAL. Blue numbers indicate RepAL improves the baseline.

4 Experiments

This section shows that our method can be adaptive to various USE and improves their performance.

4.1 Baselines and Benchmarks

To verify the effectiveness of our method, we evaluate RepAL on both Chinese and English benchmarks. To investigate whether our method can be applied to various unsupervised sentence encoder (USE), we choose two kinds of encoders: vanilla USE and secondary trained USE. For vanilla USE, we select BERT Devlin et al. (2019), RoBerTa Liu et al. (2019), RoFormer Su et al. (2021b) and NEZHA Wei et al. (2019) for Chinese; for English, we select BERTbase, BERTlarge Devlin et al. (2019) and RoBERTabase Reimers and Gurevych (2019). Specifically, we name the secondary trained USE equipped with whitening Huang et al. (2021), ConSERT Yan et al. (2021), and SimCSE Gao et al. (2021) as W-USE (e.g., W-BERT), C-USE (e.g., C-BERT), and Sim-USE (e.g., Sim-BERT), respectively. Results of Sim-USE and C-USE are from our implementation. Details of training SimCSE on Chinese benchmarks are deferred to Appendix B.

  • Chinese: We select five Chinese benchmarks for evaluation: AETC, LCQMC, BQ, PAWSX, and STS-B. The details about them are deferred to the Appendix B.

  • English: We select STS task benchmarks as our English datasets, including STS 2012-2016 tasks Agirre et al. (2012, 2013, 2014, 2015, 2016), the STS benchmark Cer et al. (2017) and the SICK-Relatedness dataset Marelli et al. (2014).

4.2 Experimental Settings

The vanilla USE in our experiments is the same as their original settings. Besides, we keep the settings baselines the same as their original ones. As for hyper-parameters, we follow previous unsupervised work Gao et al. (2021); Yan et al. (2021), and tune λ1\lambda_{1} and λ2\lambda_{2} based on the STS-B dev set. The results are evaluated through Spearman correlation.

Baseline STS-12 STS-13 STS-14 STS-15 STS-16 Avg
BERT 57.86\rightarrow59.55 61.97\rightarrow66.20 62.49\rightarrow65.19 70.96\rightarrow73.50 69.76\rightarrow72.10 63.69\rightarrow66.70(+3.01)
BERTl 57.74\rightarrow59.90 61.16\rightarrow66.20 61.18\rightarrow65.62 68.06\rightarrow73.01 70.30\rightarrow74.72 62.62\rightarrow67.47(+4.85)
RoBERTa 58.52\rightarrow60.88 56.21\rightarrow62.20 60.12\rightarrow64.10 69.12\rightarrow71.41 63.69\rightarrow69.94 60.59\rightarrow65.41 (+4.82)
W-BERT 63.62\rightarrow64.50 73.02\rightarrow73.69 69.23\rightarrow69.69 74.52\rightarrow74.69 72.15\rightarrow76.11 69.21\rightarrow70.39 (+1.18)
W-BERTl 63.62\rightarrow63.90 73.02\rightarrow73.41 69.23\rightarrow70.01 74.52\rightarrow75.18 72.15\rightarrow75.89 69.21\rightarrow70.39 (+1.18)
W-RoBERTa 68.18\rightarrow68.85 62.21\rightarrow63.03 67.13\rightarrow67.69 67.63\rightarrow68.23 74.78\rightarrow75.44 67.17\rightarrow68.43 (+1.26)
C-BERT 64.09\rightarrow65.01 78.21\rightarrow78.54 68.68\rightarrow69.04 79.56\rightarrow79.90 75.41\rightarrow75.74 72.27\rightarrow72.69 (+0.42)
C-BERTl 70.23\rightarrow70.70 82.13\rightarrow82.54 73.60\rightarrow74.12 81.72\rightarrow82.01 77.01\rightarrow77.58 76.03\rightarrow76.48 (+0.45)
Sim-BERT 68.93\rightarrow69.33 78.68\rightarrow78.93 73.57\rightarrow73.95 79.68\rightarrow80.01 79.11\rightarrow79.29 75.11\rightarrow75.44 (+0.33)
Sim-BERTl 69.25\rightarrow69.60 78.96\rightarrow79.30 73.64\rightarrow73.92 80.06\rightarrow80.31 79.08\rightarrow79.42 75.31\rightarrow75.61 (+0.30)
Table 2: The experimental results of RepAL on English semantic similarity benchmarks. ‘Avg’ indicates the average performance of all English benchmarks including STS-B and SICK-R in Table 3, and BERTl means BERTlarge during the experiments.
Baseline STS-B SICK-R
BERT 59.04\rightarrow 66.35 63.75\rightarrow 64.55
BERTl 59.59\rightarrow 68.21 60.34\rightarrow 64.61
RoBERTa 55.16\rightarrow 65.75 61.33\rightarrow 63.61
W-BERT 71.34\rightarrow 71.45 60.60\rightarrow 62.61
W-BERTl 71.34\rightarrow 69.56 60.60\rightarrow 65.00
W-RoBERTa 71.43\rightarrow 72.03 58.80\rightarrow 63.95
C-BERT 73.12\rightarrow 73.45 66.79\rightarrow 67.15
C-BERTl 77.48\rightarrow 77.91 70.02\rightarrow 70.51
Sim-BERT 75.71\rightarrow 76.00 70.12\rightarrow 70.51
Sim-BERTl 75.84\rightarrow 76.11 70.34\rightarrow 70.61
Table 3: The results of RepAL on STS-B and SICK-R

4.3 Results

As shown in Table 1, RepAL improves the baselines’ performance in most cases. For example, RepAL produces 4.65%, 1.65%, 1.27%, and 0.88% improvement to BERT, W-BERT, C-BERT, and Sim-BERT, respectively. Generally, as USE becomes stronger, the improvements brought by RepAL decrease. Still, for strong baselines like C-BERT and Sim-BERT, RepAL still makes progress over them. Specifically, RepAL achieves 1.27%1.27\% and 0.88%0.88\% performance increase for C-BERT and Sim-BERT, indicating the effectiveness of RepAL on extremely strong baselines. The results on English benchmarks are listed in Table 23, where RepAL also obtains improvements over various USEs. Overall, both results on Chinese and English benchmarks demonstrate the effectiveness of RepAL and illustrate that RepAL is a plug-and-play method for sentence representation enhancement.

4.4 Ablation study

This section investigates the individual effective of embedding refinement of two different levels. As shown in Table 4, each of operation is beneficial and combining them lead to stronger performance.

BERT W-BERT C-BERT S-BERT
RepAL 31.06 36.31 45.68 48.04
w/o Sen 29.28 34.03 43.49 46.09
w/o Cor 30.42 35.78 45.03 47.63
Table 4: Ablation studies of RepAL on Chinese benchmarks. ‘S-BERT’ refers to ‘Sim-BERT’.

5 Conclusion

In this paper, we propose RepAL, a universal method for unsupervised sentence representation enhancement. Based on the idea that de-emphasize redundant information, RepAL extracts then refines redundant information for the sentence embedding at sentence-level and corpus-level. Through a simple embedding refinement operation, RepAL successfully achieves improvements on both Chinese and English benchmarks and is proved to be a simple and plug-and-play method in modern techniques for unsupervised sentence representation.

6 Limitation

RepAL is a universal method for sentence representation, and sentence representation is eventually used for downstream tasks. However, RepAL does not consider task-specific information since different downstream tasks may have different preferences. Therefore, exploring task-specific modification on RepAL is a future direction.

References

  • Agirre et al. (2015) Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Inigo Lopez-Gazpio, Montse Maritxalar, Rada Mihalcea, et al. 2015. Semeval-2015 task 2: Semantic textual similarity, english, spanish and pilot on interpretability. In Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015), pages 252–263.
  • Agirre et al. (2014) Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Rada Mihalcea, German Rigau, and Janyce Wiebe. 2014. Semeval-2014 task 10: Multilingual semantic textual similarity. In Proceedings of the 8th international workshop on semantic evaluation (SemEval 2014), pages 81–91.
  • Agirre et al. (2016) Eneko Agirre, Carmen Banea, Daniel Cer, Mona Diab, Aitor Gonzalez Agirre, Rada Mihalcea, German Rigau Claramunt, and Janyce Wiebe. 2016. Semeval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation. In SemEval-2016. 10th International Workshop on Semantic Evaluation; 2016 Jun 16-17; San Diego, CA. Stroudsburg (PA): ACL; 2016. p. 497-511. ACL (Association for Computational Linguistics).
  • Agirre et al. (2012) Eneko Agirre, Daniel Cer, Mona Diab, and Aitor Gonzalez-Agirre. 2012. Semeval-2012 task 6: A pilot on semantic textual similarity. In * SEM 2012: The First Joint Conference on Lexical and Computational Semantics–Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), pages 385–393.
  • Agirre et al. (2013) Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, and Weiwei Guo. 2013. * sem 2013 shared task: Semantic textual similarity. In Second joint conference on lexical and computational semantics (* SEM), volume 1: proceedings of the Main conference and the shared task: semantic textual similarity, pages 32–43.
  • Cer et al. (2017) Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. 2017. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055.
  • Choi et al. (2020) Seungtaek Choi, Haeju Park, Jinyoung Yeo, and Seung-won Hwang. 2020. Less is more: Attention supervision with counterfactuals for text classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6695–6704.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
  • Gao et al. (2021) Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821.
  • He et al. (2020) Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2020. Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations.
  • Hill et al. (2016) Felix Hill, Kyunghyun Cho, and Anna Korhonen. 2016. Learning distributed representations of sentences from unlabelled data. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1367–1377.
  • Huang et al. (2021) Junjie Huang, Duyu Tang, Wanjun Zhong, Shuai Lu, Linjun Shou, Ming Gong, Daxin Jiang, and Nan Duan. 2021. Whiteningbert: An easy unsupervised sentence embedding approach. arXiv preprint arXiv:2104.01767.
  • Jones (1972) Karen Sparck Jones. 1972. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation.
  • Kiros et al. (2015) Ryan Kiros, Yukun Zhu, Russ R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in neural information processing systems, pages 3294–3302.
  • Le and Mikolov (2014) Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In International conference on machine learning, pages 1188–1196. PMLR.
  • Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach.
  • Logeswaran and Lee (2018) Lajanugen Logeswaran and Honglak Lee. 2018. An efficient framework for learning sentence representations. In International Conference on Learning Representations.
  • Luhn (1958) Hans Peter Luhn. 1958. The automatic creation of literature abstracts. IBM Journal of research and development, 2(2):159–165.
  • Mahabadi et al. (2020) Rabeeh Karimi Mahabadi, Yonatan Belinkov, and James Henderson. 2020. End-to-end bias mitigation by modelling biases in corpora. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8706–8716.
  • Marelli et al. (2014) Marco Marelli, Stefano Menini, Marco Baroni, Luisa Bentivogli, Raffaella Bernardi, Roberto Zamparelli, et al. 2014. A sick cure for the evaluation of compositional distributional semantic models. In Lrec, pages 216–223. Reykjavik.
  • Meng et al. (2021) Yu Meng, Chenyan Xiong, Payal Bajaj, Saurabh Tiwary, Paul Bennett, Jiawei Han, and Xia Song. 2021. Coco-lm: Correcting and contrasting text sequences for language model pretraining.
  • Pagliardini et al. (2018) Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi. 2018. Unsupervised learning of sentence embeddings using compositional n-gram features. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 528–540.
  • Qian et al. (2021) Chen Qian, Fuli Feng, Lijie Wen, Chunping Ma, and Pengjun Xie. 2021. Counterfactual inference for text classification debiasing. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5434–5445.
  • Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
  • Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992.
  • Socher et al. (2011) Richard Socher, Eric Huang, Jeffrey Pennin, Christopher D Manning, and Andrew Ng. 2011. Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. Advances in neural information processing systems, 24.
  • Su et al. (2021a) Jianlin Su, Jiarun Cao, Weijie Liu, and Yangyiwen Ou. 2021a. Whitening sentence representations for better semantics and faster retrieval. arXiv preprint arXiv:2103.15316.
  • Su et al. (2021b) Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. 2021b. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864.
  • Wang et al. (2021) Dong Wang, Ning Ding, Piji Li, and Haitao Zheng. 2021. Cline: Contrastive learning with semantic negative examples for natural language understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2332–2342.
  • Wei et al. (2019) Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen, and Qun Liu. 2019. Nezha: Neural contextualized representation for chinese language understanding. arXiv preprint arXiv:1909.00204.
  • Yan et al. (2021) Yuanmeng Yan, Rumei Li, Sirui Wang, Fuzheng Zhang, Wei Wu, and Weiran Xu. 2021. Consert: A contrastive framework for self-supervised sentence representation transfer. arXiv preprint arXiv:2105.11741.
  • Yang et al. (2019) Yinfei Yang, Yuan Zhang, Chris Tar, and Jason Baldridge. 2019. Paws-x: A cross-lingual adversarial dataset for paraphrase identification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3687–3692.
  • Zhang et al. (2020) Yan Zhang, Ruidan He, Zuozhu Liu, Kwan Hui Lim, and Lidong Bing. 2020. An unsupervised sentence embedding method by mutual information maximization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1601–1610.
  • Zhou and Bansal (2020) Xiang Zhou and Mohit Bansal. 2020. Towards robustifying nli models against lexical dataset biases. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8759–8771.

Appendix A Difference between BERT whitening

Some post-processing methods have been proposed to improve the quality of contextual sentence embeddings to solve such a problem. The post-processing paradigm aims to enhance sentence embeddings through simple and efficient operations without extra training or data. The most promising method is whitening Huang et al. (2021), dedicated to transforming sentence embedding into Gaussian-like embedding, which proved to be effective in sentence embedding improvement. Specifically, the most relevant work to ours is whitening Huang et al. (2021) since the corpus-level refinement is similar to the average embedding subtraction in whitening. However, there are three principal differences between such two works. Firstly, the motivation is different. Whitening aims at transforming the sentence embedding to Gaussian-like embedding for distance measurement on an orthogonal basis. Our method starts a perspective of redundancy refinement, which aims to diminish the impact of trivial words within a sentence during similarity calculation. Second, the methodology is different. Our method additionally employs a partial mask to filter the redundancy and introduce weight factors to control the impact during embedding refinement. Lastly, the in-depth analysis shows that our method aims to diminish the upper bound of the largest eigenvalue of the embedding matrix and the impact of trivial words, which is irrelevant to whitening’s effects.

Appendix B Details of Chinese benchmarks and training Chinese SimCSE

(1) ATEC: A semantic similarity dataset related to customer service; (2) LCQMC: A dataset consisting problem matching across multiple domains; (3) BQ: a dataset consisting problem matching related to bank and finance; (4) PAWSX Yang et al. (2019) : The dataset contains multilingual paraphrase and non-paraphrase pairs, we select the Chinese part; (5) STS-B: A Chinese benchmark labeled by semantic correlation between two sentences.

As for training SimCSE on benchmarks, we delete the labels in ATEC, LCQMC, BQ, PAWSX, and STS-B, and merge them into an unsupervised corpus. Then SimCSE is launched on the merged corpus, the best checkpoint is selected according to the STS-B dev set, which keeps the same setting in previous works Gao et al. (2021); Yan et al. (2021). Specifically, we find that the best dropout rate rr on Chinese corpus is about 0.3.

Appendix C Detailed Analysis and Discussion

The proposed RepAL enhances sentence embedding by filtering redundant information from two levels: sentence-level and corpus-level. Despite the presentations of the overall experiment results and analysis, the intrinsic properties of RepAL remain unclear. In this section, we illustrate the reasons why RepAL is effective in enhancing sentence embedding.

In Sec C.1, we provide the evidence about the impact of trivial words in sentence embedding and show the capacity of our sentence-level embedding refinement. In Sec C.2, we show why the corpus-level embedding refinement enhances sentence embedding and illustrate the relation between the largest eigenvalue and performance.

C.1 Sentence-level Refinement

We conduct analyses for sentence-level refinement (SR) as follows: we investigate the impact of trivial words w/o RepAL, which explains the necessity of removing such redundancy information and validates the effectiveness of SR.

We first define the importance HH of word wxiw\in x_{i} in semantic similarity calculation, which can be defined as follows:

H(xi,xi;w)=Sim(xi,xi)Sim(xi/w,xi)H(x_{i},x^{-}_{i};w)=Sim(x_{i},x^{-}_{i})-Sim(x_{i}/w,x^{-}_{i}) (4)

where xix_{i} and xix^{-}_{i} are a pair of sentences and xi/wix_{i}/w_{i} means deleting the word wiw_{i} from xix_{i}. Note that we do not consider the words in xix^{-}_{i} since it is equivalent to evaluation on more sentences. Then we define the set of trivial words within xix_{i} as S(xi)S(x_{i}), which are unmasked by jieba. Thus we can define the redundancy overlap ratio r(pi)r(p_{i}) of a sentence pair pi=(xi,xi)p_{i}=(x_{i},x^{-}_{i}) as follows:

r(pi)=|S(xi)T(xi)||T(xi)|r(p_{i})=\frac{|S(x_{i})\cap T(x_{i})|}{|T(x_{i})|} (5)

where T(xi)T(x_{i}) represents the top-5 words with highest importance HH in xix_{i}. r(pi)r(p_{i}) is a metric to reflect the impact of trivial words in semantic similarity between the sentence pair pip_{i}, since higher r(pi)r(p_{i}) indicates more trivial words are important towards semantic similarity calculation. We randomly sample 300 sentence pairs from STS-B Cer et al. (2017) and select BERT as the USE, and we calculate the average redundancy overlap ratio r^=i=1Nr(pi)N\hat{r}=\frac{\sum_{i=1}^{N}r(p_{i})}{N} w/o SR. The results show that r^\hat{r} reaches 10.2%10.2\% without SR, after applying SR, r^\hat{r} drops to 7.1%7.1\%222r^\hat{r} changes since the inputs during similarity calculation have changed when SR activates. After SR. Eq 4 becomes H(xi,xi;w)=Sim(G(xi),G(xi))Sim(G(xi/w),G(xi))H(x_{i},x^{-}_{i};w)=Sim(G(x_{i}),G(x^{-}_{i}))-Sim(G(x_{i}/w),G(x^{-}_{i})) where G()G(\cdot) represents SR operation. The results demonstrate that SR diminishes the impact of trivial words when measuring semantic similarity.

Moreover, we select some representative words and evaluate their importance w/o RepAL. As shown in Table 5, the results show that our SR indeed diminishes the impact of such trivial words when calculating semantic similarity.

Word No Refinement With Refinement Δ\Delta
the 1.02 0.56 -0.46
a 0.98 0.43 -0.55
to 0.59 0.32 -0.27
in 0.68 0.21 -0.47
some 0.60 0.31 -0.29
with 0.72 0.24 -0.48
and 0.99 0.61 -0.38
Table 5: The importance of trivial words w/o sentence-level refinement. Δ\Delta means the importance change.

C.2 Corpus-level Refinement

To investigate whether corpus-level refinement diminishes the upper bound of the eigenvalue of embedding EE^{*}, we make numerical experiments to dive into the relationship between the performance (Spearman correlation), λ\lambda and the upper bound of the largest eigenvalue of EE^{*}.

Specifically, we launch the experiments on six English benchmarks with BERTbase. As shown in Figure LABEL:figure:lamda, when performance rises at peak, the upper bound of the largest eigenvalue of the embedding matrix EE^{*} is around the minimum, showing a coincidence between the two. The numerical results show that the corpus-level refinement enhances sentence embedding since it diminishes the largest eigenvalue of EE^{*}. Previous methods Huang et al. (2021) is equivalent to subtracting the average vector with λ=1.0\lambda=1.0, which fails to suppress the largest eigenvalue of embedding matrix extremely. However, our method chooses to subtract a larger λ\lambda with adaptive weight, further suppressing the upper bound of the largest eigenvalue of the embedding matrix. The results show that the average embedding subtraction needs an adaptive weight. Moreover, this also illustrates why our method can still significantly improve the performance on W-BERT with substantial progress.